1 00:00:00,000 --> 00:00:04,960 [MUSIC PLAYING] 2 00:00:04,960 --> 00:00:49,182 3 00:00:49,182 --> 00:00:50,140 DAVID MALAN: All right. 4 00:00:50,140 --> 00:00:53,260 This is CS50, and this is week 3. 5 00:00:53,260 --> 00:00:56,560 And you'll recall that last week, we equipped you with a lot more tools 6 00:00:56,560 --> 00:00:59,830 by which to solve problems-- not only problems that we had proposed, 7 00:00:59,830 --> 00:01:02,110 but problems in your own code, that is to say bugs. 8 00:01:02,110 --> 00:01:05,980 And recall that those tools involve command line tools like help50 9 00:01:05,980 --> 00:01:09,580 for help with cryptic error messages that the compiler might spit out; 10 00:01:09,580 --> 00:01:11,500 style50 which, gives you a bit of feedback 11 00:01:11,500 --> 00:01:14,230 on the stylization of your code, the aesthetics thereof; 12 00:01:14,230 --> 00:01:18,550 check50, which checks the correctness of your code against the specifications 13 00:01:18,550 --> 00:01:21,400 in a given problem set or lab; printf, which 14 00:01:21,400 --> 00:01:24,640 is a function that it exists in some form in almost any programming 15 00:01:24,640 --> 00:01:27,185 language that you might ultimately learn, and this 16 00:01:27,185 --> 00:01:30,310 is simply a way of printing out anything you might want from the computer's 17 00:01:30,310 --> 00:01:31,345 memory onto the screen. 18 00:01:31,345 --> 00:01:33,220 Then perhaps the most powerful of these tools 19 00:01:33,220 --> 00:01:36,190 was debug50, which was this interactive debugger. 20 00:01:36,190 --> 00:01:40,060 And even though this command debug50 is a little specific to CS50, 21 00:01:40,060 --> 00:01:42,970 what it triggers to happen, that little side window 22 00:01:42,970 --> 00:01:46,150 where you can see the stack of functions that you 23 00:01:46,150 --> 00:01:48,550 might have called during some break point, 24 00:01:48,550 --> 00:01:51,942 and you can see the local variables that you might have defined at some point 25 00:01:51,942 --> 00:01:53,650 during the execution of your code, that's 26 00:01:53,650 --> 00:01:58,900 a very common conventional feature of any debugger with most any language. 27 00:01:58,900 --> 00:02:02,500 And then lastly, recall there was this ddb, duck debugger, which 28 00:02:02,500 --> 00:02:06,490 of course, takes this physical form, if you happen to have a rubber duck lying 29 00:02:06,490 --> 00:02:07,870 around with whom you can talk. 30 00:02:07,870 --> 00:02:10,270 But I'm so pleased to say that if you lack 31 00:02:10,270 --> 00:02:16,420 that currently while at home, CS50's own Kareem and Brenda and Sophie 32 00:02:16,420 --> 00:02:19,150 have wonderfully added, if you haven't noticed already, 33 00:02:19,150 --> 00:02:22,090 that same virtual duck to CS50 IDE. 34 00:02:22,090 --> 00:02:24,400 So if you click in the top corner, you can actually 35 00:02:24,400 --> 00:02:27,328 begin to have a chat of sorts with the rubber duck. 36 00:02:27,328 --> 00:02:30,370 And while this is a certainly more playful incarnation of that same idea, 37 00:02:30,370 --> 00:02:32,560 we really can't emphasize enough the value 38 00:02:32,560 --> 00:02:36,700 of talking through problems when you're experiencing them in code with someone 39 00:02:36,700 --> 00:02:38,530 else or with something else. 40 00:02:38,530 --> 00:02:42,250 This particular duck, not all that large of a vocabulary, 41 00:02:42,250 --> 00:02:45,100 but it's not so much what the other person says but what you say 42 00:02:45,100 --> 00:02:47,952 and what you hear yourself saying that is undoubtedly 43 00:02:47,952 --> 00:02:49,535 the most valuable part of the process. 44 00:02:49,535 --> 00:02:53,560 So our thanks to Kareem and Brenda and Sophie on that. 45 00:02:53,560 --> 00:02:58,000 Recall last week, 2, that we took a look underneath the hood, 46 00:02:58,000 --> 00:03:02,020 literally in some sense, at the computer's memory in your laptop 47 00:03:02,020 --> 00:03:03,040 or desktop or phone. 48 00:03:03,040 --> 00:03:06,160 And then we decided to think about this more 49 00:03:06,160 --> 00:03:09,100 artistically as just a grid of bytes. 50 00:03:09,100 --> 00:03:11,470 So within that chip, there's a whole bunch of bits. 51 00:03:11,470 --> 00:03:14,620 And if you look at eight of them at a time, there's a whole bunch of bytes. 52 00:03:14,620 --> 00:03:16,150 And it stands to reason that we could think 53 00:03:16,150 --> 00:03:18,790 of this as the first byte, the second byte, the third byte, 54 00:03:18,790 --> 00:03:21,360 and so forth, and sort of chop this up pictorially 55 00:03:21,360 --> 00:03:25,790 into just a whole sequence of bytes in the computer's memory. 56 00:03:25,790 --> 00:03:30,160 And recall that if we zoom in on that and focus on just one continuous block 57 00:03:30,160 --> 00:03:33,500 of memory, otherwise known as an "array," 58 00:03:33,500 --> 00:03:37,640 we can do things within this array like storing a bunch of different values. 59 00:03:37,640 --> 00:03:41,110 So recall last week, we started by defining a little-- 60 00:03:41,110 --> 00:03:44,710 goofily, multiple variables that were almost identically names, 61 00:03:44,710 --> 00:03:47,470 like scores1, scores2, scores3. 62 00:03:47,470 --> 00:03:51,190 And then we began to clean up the design of our code by introducing an array, 63 00:03:51,190 --> 00:03:55,540 so we can have just one variable called scores, that is of size 3 64 00:03:55,540 --> 00:03:57,860 and has room for multiple values. 65 00:03:57,860 --> 00:04:02,320 So today, we'll continue to leverage this feature of many programming 66 00:04:02,320 --> 00:04:04,570 languages-- being able to store things continuously, 67 00:04:04,570 --> 00:04:07,000 back to back to back to back, in a computer's memory, 68 00:04:07,000 --> 00:04:11,050 because this very simple layout, this very simple feature of the language, 69 00:04:11,050 --> 00:04:13,270 is going to open up all sorts of powerful features. 70 00:04:13,270 --> 00:04:17,410 And in fact, we can even revisit some of the problems 71 00:04:17,410 --> 00:04:19,990 we tried to solve way back in week 0. 72 00:04:19,990 --> 00:04:21,700 But there is a catch with arrays. 73 00:04:21,700 --> 00:04:25,405 And we didn't really emphasize this much last week. 74 00:04:25,405 --> 00:04:27,280 And that's because, even though you and I can 75 00:04:27,280 --> 00:04:30,363 glance at this picture on the screen and see immediately that, oh, there's 76 00:04:30,363 --> 00:04:33,640 seven boxes on the screen, there are seven locations in which you 77 00:04:33,640 --> 00:04:37,870 can store values, you and I can sort of have this bird's eye view of everything 78 00:04:37,870 --> 00:04:41,380 and just see what's inside that entire array all at once. 79 00:04:41,380 --> 00:04:45,370 But computers, recall, are much more methodical, more algorithmic, 80 00:04:45,370 --> 00:04:46,250 if you will. 81 00:04:46,250 --> 00:04:48,790 And so a computer, as powerful as they are, 82 00:04:48,790 --> 00:04:51,945 can technically only look at one location in an array at a time. 83 00:04:51,945 --> 00:04:54,070 So whereas you and I can glance at this and sort of 84 00:04:54,070 --> 00:04:57,370 take it all in at once a computer just can't glance at its memory 85 00:04:57,370 --> 00:05:00,040 and take in all at once all of the values therein, 86 00:05:00,040 --> 00:05:04,270 it has to do so more methodically, for instance, from left to right, maybe 87 00:05:04,270 --> 00:05:06,950 right to left, maybe middle onward. 88 00:05:06,950 --> 00:05:08,990 But it has to be an algorithm. 89 00:05:08,990 --> 00:05:12,850 And so today we'll formalize that notion and really kind of hide the fact that 90 00:05:12,850 --> 00:05:16,840 this array cannot be seen all at once, you can only look at one location 91 00:05:16,840 --> 00:05:19,090 in an array at a given time. 92 00:05:19,090 --> 00:05:21,160 And this is going to have very real implications. 93 00:05:21,160 --> 00:05:23,820 For instance, if we consider that very first problem 94 00:05:23,820 --> 00:05:27,430 in the very first week where we tried to find my phone number in a phone book, 95 00:05:27,430 --> 00:05:30,640 the very naive approach was to start at the beginning and search from left 96 00:05:30,640 --> 00:05:31,180 to right. 97 00:05:31,180 --> 00:05:33,055 And we tried a couple of variants thereafter. 98 00:05:33,055 --> 00:05:35,590 But the problem, quite simply, is that of searching. 99 00:05:35,590 --> 00:05:37,690 And this is a term of art in computer science, 100 00:05:37,690 --> 00:05:40,540 super common, certainly for you and I as users on Google 101 00:05:40,540 --> 00:05:42,830 and the like to search for things all day long. 102 00:05:42,830 --> 00:05:46,690 And so certainly searching well, designing a search algorithm well, 103 00:05:46,690 --> 00:05:49,710 is certainly a compelling feature of so many of today's tools 104 00:05:49,710 --> 00:05:50,723 that you and I use. 105 00:05:50,723 --> 00:05:52,890 So if we think of this really as a problem to solve, 106 00:05:52,890 --> 00:05:56,430 we've got some input, which, for instance, might be an array of numbers, 107 00:05:56,430 --> 00:05:59,640 or maybe an array of web pages in the case of Google. 108 00:05:59,640 --> 00:06:01,270 And the goal is to get some output. 109 00:06:01,270 --> 00:06:05,340 So if the input to the problem is an array of values, the output, 110 00:06:05,340 --> 00:06:09,630 hopefully, is going to be something as simple, really, as a bool-- 111 00:06:09,630 --> 00:06:10,650 yes or no. 112 00:06:10,650 --> 00:06:15,180 Is the value you're looking for discoverable? 113 00:06:15,180 --> 00:06:20,130 Can you search for and find that value, yes or no, true or false? 114 00:06:20,130 --> 00:06:23,190 Now, within this black box, recall, is going to be some algorithm. 115 00:06:23,190 --> 00:06:25,530 And that's where today we'll spend most of our time. 116 00:06:25,530 --> 00:06:28,680 Indeed, we won't really introduce that many more features of C. 117 00:06:28,680 --> 00:06:30,480 We won't introduce that much more code. 118 00:06:30,480 --> 00:06:33,930 We'll focus again on ideas, just taking for granted now that you 119 00:06:33,930 --> 00:06:35,670 have some more tools in your toolkit. 120 00:06:35,670 --> 00:06:38,100 Beyond loops and conditions and Boolean expressions, 121 00:06:38,100 --> 00:06:40,830 we now have this other tool known as arrays. 122 00:06:40,830 --> 00:06:45,060 But let's first introduce some other terms of art, some jargon if you will, 123 00:06:45,060 --> 00:06:46,990 related to what we'll call running time. 124 00:06:46,990 --> 00:06:48,780 So we've alluded to this a few times. 125 00:06:48,780 --> 00:06:51,510 When we're thinking about just how good or bad an algorithm is, 126 00:06:51,510 --> 00:06:53,490 we describe how long it takes to run. 127 00:06:53,490 --> 00:06:54,540 That is its running time. 128 00:06:54,540 --> 00:06:57,330 The running time of an algorithm is how long it takes-- 129 00:06:57,330 --> 00:07:00,820 how many steps it takes, how many seconds it takes, how many iterations 130 00:07:00,820 --> 00:07:01,320 it takes. 131 00:07:01,320 --> 00:07:04,140 It doesn't really matter what your unit of measure is. 132 00:07:04,140 --> 00:07:07,350 Maybe it's time, maybe it's iterations or something else. 133 00:07:07,350 --> 00:07:10,422 But running time just refers to how long does an algorithm take. 134 00:07:10,422 --> 00:07:13,380 And there are ways that we can think about this a little more formally. 135 00:07:13,380 --> 00:07:16,020 And we kind of did this already in the first week, 136 00:07:16,020 --> 00:07:19,320 but we didn't give it this name, this italicized O, 137 00:07:19,320 --> 00:07:23,220 this capital O on the screen, is otherwise known as Big O notation. 138 00:07:23,220 --> 00:07:25,470 And computer scientists and some mathematicians 139 00:07:25,470 --> 00:07:28,410 will very frequently use, literally, this symbol 140 00:07:28,410 --> 00:07:31,410 to describe the running times of algorithms, 141 00:07:31,410 --> 00:07:33,310 or mathematically like a function. 142 00:07:33,310 --> 00:07:34,720 So recall this picture, in fact. 143 00:07:34,720 --> 00:07:38,190 When we were searching that phone book, we did it sort of good, better, best. 144 00:07:38,190 --> 00:07:41,800 We did it linearly-- that is, searching one page at a time, 145 00:07:41,800 --> 00:07:44,940 we did it twice as fast by doing two pages at a time-- 146 00:07:44,940 --> 00:07:48,690 and then we did it logarithmically by dividing and conquering, 147 00:07:48,690 --> 00:07:50,060 in half and half and half. 148 00:07:50,060 --> 00:07:52,560 And at the time, I proposed that if we think of a phone book 149 00:07:52,560 --> 00:07:57,630 as having n pages, where n is just a number in computer science vernacular, 150 00:07:57,630 --> 00:08:00,300 we might describe the running time, or the number of steps 151 00:08:00,300 --> 00:08:04,350 involved for that first algorithm, as being maybe in the worst case n steps. 152 00:08:04,350 --> 00:08:06,780 If the person you're looking for in a phone book maybe 153 00:08:06,780 --> 00:08:10,297 alphabetically has the last name starting with Z in English, 154 00:08:10,297 --> 00:08:12,880 well, the Z might be all the way at the end of the phone book. 155 00:08:12,880 --> 00:08:15,450 So at the worst case, you might be taking n steps 156 00:08:15,450 --> 00:08:18,090 to find someone like myself in that phone book. 157 00:08:18,090 --> 00:08:20,400 The second algorithm, though, was twice as fast, 158 00:08:20,400 --> 00:08:22,510 because we went two pages at a time. 159 00:08:22,510 --> 00:08:24,985 So we might describe its running time as n divided by 2. 160 00:08:24,985 --> 00:08:28,110 And then the third algorithm, where we divided the problem in half and half 161 00:08:28,110 --> 00:08:31,680 and half, literally throwing half of the problem away again and again, 162 00:08:31,680 --> 00:08:35,760 was logarithmic-- technically log base 2 of n, which, again, is just 163 00:08:35,760 --> 00:08:39,390 a mathematical formula that refers to halving something again 164 00:08:39,390 --> 00:08:40,289 and again and again. 165 00:08:40,289 --> 00:08:43,690 And you start with, of course, n pages in that scenario. 166 00:08:43,690 --> 00:08:46,560 Well, it turns out that a computer scientist would actually 167 00:08:46,560 --> 00:08:49,027 wave their hands at some of these mathematical details. 168 00:08:49,027 --> 00:08:51,360 Indeed, we're not going to get into the habit of writing 169 00:08:51,360 --> 00:08:53,517 very precise mathematical formulas. 170 00:08:53,517 --> 00:08:55,350 What we're instead going to do is try to get 171 00:08:55,350 --> 00:08:59,850 a sense of the order on which the running time of an algorithm 172 00:08:59,850 --> 00:09:03,870 is, just roughly how fast or how slow it is, but still 173 00:09:03,870 --> 00:09:06,210 using some symbology like n as a placeholder. 174 00:09:06,210 --> 00:09:09,480 And so a computer scientist would describe the running time of all three 175 00:09:09,480 --> 00:09:14,820 of those algorithms from week 0 as being big O of n, or big O of n/2, 176 00:09:14,820 --> 00:09:17,040 or big O of log base 2 of n. 177 00:09:17,040 --> 00:09:18,900 So "big O" just means "on the order of." 178 00:09:18,900 --> 00:09:20,520 It's sort of a wave of the hand. 179 00:09:20,520 --> 00:09:24,900 Maybe it's n minus 1, maybe it's n plus 1, maybe it's even 2n. 180 00:09:24,900 --> 00:09:28,240 But it's on the order of n or these other values. 181 00:09:28,240 --> 00:09:30,468 But in fact, too, notice that this chart, 182 00:09:30,468 --> 00:09:33,510 there's something kind of curious. , Like these first two algorithms from 183 00:09:33,510 --> 00:09:36,870 week 0 kind of pictorially look pretty much the same. 184 00:09:36,870 --> 00:09:39,137 Like undoubtedly, the yellow line is a little lower 185 00:09:39,137 --> 00:09:41,970 and therefore a little better and a little faster than the red line. 186 00:09:41,970 --> 00:09:43,200 But they have the same shape. 187 00:09:43,200 --> 00:09:47,370 And in fact, I bet if we zoomed way out, these two straight lines 188 00:09:47,370 --> 00:09:49,380 would pretty much look identical. 189 00:09:49,380 --> 00:09:53,412 If you change your axis to be big enough and tall enough, 190 00:09:53,412 --> 00:09:54,870 these would start to blur together. 191 00:09:54,870 --> 00:09:57,880 But clearly, the green line is fundamentally different. 192 00:09:57,880 --> 00:10:01,020 And so this speaks to a computer scientist's tendency 193 00:10:01,020 --> 00:10:03,870 to not really quibble over these details. 194 00:10:03,870 --> 00:10:06,977 Like, yes, the second algorithm in week 0 was better. 195 00:10:06,977 --> 00:10:08,310 Yes, this yellow line is better. 196 00:10:08,310 --> 00:10:12,540 But, eh, let's just call both of those algorithms running times 197 00:10:12,540 --> 00:10:13,800 on the order of n. 198 00:10:13,800 --> 00:10:18,390 That is to say, a computer scientist tends to throw away constant factors, 199 00:10:18,390 --> 00:10:20,840 like the 1/2 or the divided by 2. 200 00:10:20,840 --> 00:10:23,250 And they tend to focus only on the dominant factor, 201 00:10:23,250 --> 00:10:27,810 like which value in that mathematical expression is going to grow the most, 202 00:10:27,810 --> 00:10:28,890 grow the fastest. 203 00:10:28,890 --> 00:10:32,280 And n divided by 2n it's going to dominate over time. 204 00:10:32,280 --> 00:10:34,830 The bigger the phone book gets, the more pages you have. 205 00:10:34,830 --> 00:10:38,238 It's really n that's going to matter less so than that divided by 2. 206 00:10:38,238 --> 00:10:39,280 And same thing over here. 207 00:10:39,280 --> 00:10:41,700 If you're familiar with and remember your logarithms, 208 00:10:41,700 --> 00:10:45,120 we don't really have to even care about the base of that logarithm. 209 00:10:45,120 --> 00:10:48,940 Yes, it's base 2, but eh, we can just multiply that logarithm 210 00:10:48,940 --> 00:10:51,490 by some other number to convert it to any base we want-- 211 00:10:51,490 --> 00:10:54,040 base 10, base 3, base 7, anything. 212 00:10:54,040 --> 00:10:55,960 So let's just say it's on the order of log n. 213 00:10:55,960 --> 00:10:57,760 So this is good, because it means we're not really 214 00:10:57,760 --> 00:10:59,815 going to waste time getting really into the weeds 215 00:10:59,815 --> 00:11:02,440 mathematically when we talk about the efficiency of algorithms. 216 00:11:02,440 --> 00:11:07,580 It suffices to describe things really in terms of the variable, n in this case, 217 00:11:07,580 --> 00:11:09,370 if you will, that dominates over time. 218 00:11:09,370 --> 00:11:10,510 And indeed, let's zoom out. 219 00:11:10,510 --> 00:11:12,910 If I zoom out on this picture, boom, you begin 220 00:11:12,910 --> 00:11:16,405 to see that, yeah, these are really starting to look almost identical. 221 00:11:16,405 --> 00:11:20,140 And if we kept zooming out, you would see that they're essentially 222 00:11:20,140 --> 00:11:21,040 one in the same. 223 00:11:21,040 --> 00:11:25,420 But the green one stands out, so that's indeed on the order of log of n 224 00:11:25,420 --> 00:11:26,768 as opposed to n itself. 225 00:11:26,768 --> 00:11:28,060 So here's a little cheat sheet. 226 00:11:28,060 --> 00:11:30,280 It turns out that within computer science, 227 00:11:30,280 --> 00:11:33,250 and within the analysis of algorithms, we're 228 00:11:33,250 --> 00:11:36,410 going to tend to see some common formulas like this. 229 00:11:36,410 --> 00:11:38,290 So we've just seen on the order of n. 230 00:11:38,290 --> 00:11:39,880 We've seen on the order of log n. 231 00:11:39,880 --> 00:11:42,850 It turns out that the very common two is going to be n times 232 00:11:42,850 --> 00:11:46,677 log n, maybe even n squared, and then even big O of 1. 233 00:11:46,677 --> 00:11:48,760 And the last of those just means that an algorithm 234 00:11:48,760 --> 00:11:50,800 takes, wonderfully, one step-- 235 00:11:50,800 --> 00:11:55,420 or maybe two steps, maybe even 10 steps, but a constant number of steps. 236 00:11:55,420 --> 00:11:58,780 So that's sort of the best case scenario, at least among these options. 237 00:11:58,780 --> 00:12:01,820 Whereas, n squared is going to start to take a long time. 238 00:12:01,820 --> 00:12:03,820 It's going to start to feel slow, because if you 239 00:12:03,820 --> 00:12:06,070 take any value of n and square it, that's 240 00:12:06,070 --> 00:12:08,510 going to imply more and more steps. 241 00:12:08,510 --> 00:12:10,998 So just a bit of jargon, then, to start off today, 242 00:12:10,998 --> 00:12:12,790 whereby we now have this sort of vocabulary 243 00:12:12,790 --> 00:12:16,750 with which to describe the running times of an algorithm in terms of this Big O 244 00:12:16,750 --> 00:12:17,560 notation. 245 00:12:17,560 --> 00:12:19,240 But there's one other notation. 246 00:12:19,240 --> 00:12:22,960 And just as big O refers to an upper bound 247 00:12:22,960 --> 00:12:25,930 on running times, like how many steps maximally, 248 00:12:25,930 --> 00:12:27,970 how much time maximally might an algorithm take, 249 00:12:27,970 --> 00:12:31,730 this omega notation refers to the opposite. 250 00:12:31,730 --> 00:12:34,293 What's a lower bound on the running time of an algorithm? 251 00:12:34,293 --> 00:12:36,460 And we don't need another picture or other formulas. 252 00:12:36,460 --> 00:12:37,740 We can reuse the same one. 253 00:12:37,740 --> 00:12:40,150 So this cheat sheet here just proposes that, 254 00:12:40,150 --> 00:12:43,850 when describing the efficiency or inefficiency of an algorithm 255 00:12:43,850 --> 00:12:45,850 and you want to come up with a lower bound-- 256 00:12:45,850 --> 00:12:48,630 like minimally, how many steps does my algorithm take-- 257 00:12:48,630 --> 00:12:50,380 we can use the same mathematical formulas, 258 00:12:50,380 --> 00:12:54,340 but we can note that with omega instead of big O. 259 00:12:54,340 --> 00:12:56,620 So again, looks fancy, but it really just refers 260 00:12:56,620 --> 00:13:01,090 to a wave of the hand trying to sort of ballpark exactly what the running time 261 00:13:01,090 --> 00:13:02,600 is of an algorithm. 262 00:13:02,600 --> 00:13:05,080 And thankfully, we've seen a few algorithms already, 263 00:13:05,080 --> 00:13:08,650 including in that week 0, and now we're going to give it a more formal name. 264 00:13:08,650 --> 00:13:11,750 Linear search is what we did with that phone book 265 00:13:11,750 --> 00:13:15,010 first off by searching it page by page by page, 266 00:13:15,010 --> 00:13:18,710 looking for my phone number in that particular example. 267 00:13:18,710 --> 00:13:23,412 And so the difference today is that, unlike us humans, who 268 00:13:23,412 --> 00:13:26,620 can look down at a phone book page and see a whole bunch of names and numbers 269 00:13:26,620 --> 00:13:30,490 at once, unlike a human who can look at an array on the board a moment ago 270 00:13:30,490 --> 00:13:32,530 and sort of see everything at once, we need 271 00:13:32,530 --> 00:13:34,630 to be more methodical, more deliberate today so 272 00:13:34,630 --> 00:13:39,160 that we can translate week 0's ideas now, not into even pseudocode, 273 00:13:39,160 --> 00:13:40,510 but actual C code. 274 00:13:40,510 --> 00:13:43,810 And so wonderfully, here at the American Repertory Theater 275 00:13:43,810 --> 00:13:46,280 as we are on Harvard's campus this semester, 276 00:13:46,280 --> 00:13:48,610 we've been collaborating with the whole team 277 00:13:48,610 --> 00:13:51,670 here who are much more artistically inclined than certainly I 278 00:13:51,670 --> 00:13:52,900 could be on my own here. 279 00:13:52,900 --> 00:13:55,150 And we have these seven wonderful doors that 280 00:13:55,150 --> 00:13:58,690 were previously used in various theatrical shows that took place here 281 00:13:58,690 --> 00:13:59,680 in this theater. 282 00:13:59,680 --> 00:14:03,130 And we've even collaborated with the theater's prop shop, 283 00:14:03,130 --> 00:14:07,030 who in back have wonderfully manufactured some delightful numbers 284 00:14:07,030 --> 00:14:08,300 and brought them to life. 285 00:14:08,300 --> 00:14:12,437 Which is to say that, behind each of these seven doors is a number. 286 00:14:12,437 --> 00:14:15,520 And this is going to be an opportunity now to really hammer home the point 287 00:14:15,520 --> 00:14:18,430 that when we want to search for some number in an array, 288 00:14:18,430 --> 00:14:22,900 it's pretty equivalent to having to search for a number, in this case, 289 00:14:22,900 --> 00:14:25,210 behind an otherwise closed door. 290 00:14:25,210 --> 00:14:27,190 You and I can't just look at all of these doors 291 00:14:27,190 --> 00:14:29,020 now and figure out where a number is. 292 00:14:29,020 --> 00:14:30,430 We have to be more methodical. 293 00:14:30,430 --> 00:14:32,440 We have to start searching these doors, maybe 294 00:14:32,440 --> 00:14:35,688 from left to right, maybe from right to left, maybe from the middle on out. 295 00:14:35,688 --> 00:14:38,980 But we need to come up with an algorithm and ultimately translate that to code. 296 00:14:38,980 --> 00:14:43,630 So for instance, suppose I were to search for the number 0. 297 00:14:43,630 --> 00:14:48,700 How could we go about searching, methodically, these seven wooden doors 298 00:14:48,700 --> 00:14:50,200 for the number 0? 299 00:14:50,200 --> 00:14:54,310 Let me take a suggestion from the audience. 300 00:14:54,310 --> 00:14:56,620 What approach might you take? 301 00:14:56,620 --> 00:15:02,540 What first step would you propose I take here on my own with these doors? 302 00:15:02,540 --> 00:15:04,540 Any recommendations? 303 00:15:04,540 --> 00:15:07,030 How do I begin to find myself the number 0? 304 00:15:07,030 --> 00:15:09,280 Florence, what do you propose? 305 00:15:09,280 --> 00:15:13,523 AUDIENCE: I would propose starting form the left, since 0 is a smaller number. 306 00:15:13,523 --> 00:15:14,440 DAVID MALAN: OK, good. 307 00:15:14,440 --> 00:15:15,880 And hang in there for with me for just a moment. 308 00:15:15,880 --> 00:15:18,630 Let me go ahead and started on the left edge as Florence proposes. 309 00:15:18,630 --> 00:15:21,040 Go ahead and open the door, and hopefully, voila-- 310 00:15:21,040 --> 00:15:21,850 no. 311 00:15:21,850 --> 00:15:23,020 It's a number 4. 312 00:15:23,020 --> 00:15:23,920 So it's not a 0. 313 00:15:23,920 --> 00:15:27,150 So Florence, what would you propose I do next? 314 00:15:27,150 --> 00:15:31,790 AUDIENCE: I'd probably start in the middle somewhere, 315 00:15:31,790 --> 00:15:35,533 if, like, in case, I don't know, it's going down by 1. 316 00:15:35,533 --> 00:15:36,200 DAVID MALAN: OK. 317 00:15:36,200 --> 00:15:37,367 So maybe it's going down. 318 00:15:37,367 --> 00:15:38,700 So let me go ahead and try that. 319 00:15:38,700 --> 00:15:41,690 So you propose middle, I could go over here, and voila-- 320 00:15:41,690 --> 00:15:42,620 nope. 321 00:15:42,620 --> 00:15:44,360 That's the number 2. 322 00:15:44,360 --> 00:15:47,820 And I wonder, where else should I look. 323 00:15:47,820 --> 00:15:49,070 Let me-- I'm a little curious. 324 00:15:49,070 --> 00:15:50,570 I'm a little nervous that I ignored these doors. 325 00:15:50,570 --> 00:15:53,320 So Florence, if you don't mind, let's go ahead and look here and-- 326 00:15:53,320 --> 00:15:56,240 no, that's the number 6, it seems. 327 00:15:56,240 --> 00:15:59,690 Let's go ahead and check in here, the number 8. 328 00:15:59,690 --> 00:16:02,670 So they're kind of going up and down. 329 00:16:02,670 --> 00:16:05,150 So Florence, how might I finish searching for this number? 330 00:16:05,150 --> 00:16:08,420 What remains to be done, would you say? 331 00:16:08,420 --> 00:16:10,490 AUDIENCE: Probably start from the right now. 332 00:16:10,490 --> 00:16:11,157 DAVID MALAN: OK. 333 00:16:11,157 --> 00:16:14,420 So I could start from the right now, and maybe just go over here. 334 00:16:14,420 --> 00:16:16,550 And voila-- and there it is. 335 00:16:16,550 --> 00:16:17,690 So we found the number 0. 336 00:16:17,690 --> 00:16:19,820 So let me ask Florence, what was your algorithm? 337 00:16:19,820 --> 00:16:23,060 How did you go about so successfully finding the number 0 for us? 338 00:16:23,060 --> 00:16:26,130 339 00:16:26,130 --> 00:16:30,990 AUDIENCE: I guess I initially tried starting, like, by going down by 1. 340 00:16:30,990 --> 00:16:36,840 So like, if the number was not at the left, 341 00:16:36,840 --> 00:16:40,490 then going to the center, which is, like, halfway in between 342 00:16:40,490 --> 00:16:41,780 and then going to [INAUDIBLE]. 343 00:16:41,780 --> 00:16:42,410 I don't know. 344 00:16:42,410 --> 00:16:45,780 DAVID MALAN: And playfully, how did that work out for you, going to the middle? 345 00:16:45,780 --> 00:16:48,420 Better or worse, no different? 346 00:16:48,420 --> 00:16:52,290 AUDIENCE: I mean, I guess maybe it helped a little bit 347 00:16:52,290 --> 00:16:54,083 to then go all the way to the right. 348 00:16:54,083 --> 00:16:54,750 DAVID MALAN: OK. 349 00:16:54,750 --> 00:16:56,370 Yeah, we might have gleaned some information. 350 00:16:56,370 --> 00:16:59,203 But let's go ahead and take a look at all of the doors for a moment. 351 00:16:59,203 --> 00:17:00,720 There's that 4 and the 6 again. 352 00:17:00,720 --> 00:17:02,400 Here is that 8 again. 353 00:17:02,400 --> 00:17:04,920 Over in the middle we have the 2 again. 354 00:17:04,920 --> 00:17:08,490 Over here we have a 7 for the first time. 355 00:17:08,490 --> 00:17:10,109 Over here we have a 5. 356 00:17:10,109 --> 00:17:11,996 And then of course, we have a 0. 357 00:17:11,996 --> 00:17:14,579 And if you took all of that in, honestly, Florence, you and I, 358 00:17:14,579 --> 00:17:16,079 we couldn't really have done any better. 359 00:17:16,079 --> 00:17:18,130 Because these door-- these numbers, it turns out, 360 00:17:18,130 --> 00:17:20,310 are just randomly arranged behind these doors. 361 00:17:20,310 --> 00:17:23,460 So it wasn't bad at all that you kind of hopped around. 362 00:17:23,460 --> 00:17:26,060 Although, the downside is if you hop around, 363 00:17:26,060 --> 00:17:29,642 you and I as humans can pretty easily remember where we've been before. 364 00:17:29,642 --> 00:17:32,100 But if you think about how we would translate that to code, 365 00:17:32,100 --> 00:17:34,500 I feel like we're starting to accumulate a bunch of variables 366 00:17:34,500 --> 00:17:36,417 maybe, because you have to keep track of that. 367 00:17:36,417 --> 00:17:40,140 So frankly, maybe the simplest solution-- whoops-- 368 00:17:40,140 --> 00:17:44,460 maybe the simplest solution would have been where we started in week 0, 369 00:17:44,460 --> 00:17:48,540 where we just take a very simple if naive approach of starting 370 00:17:48,540 --> 00:17:52,920 with our array, this time of size 7, behind which are some numbers. 371 00:17:52,920 --> 00:17:55,530 And if you don't know anything about those numbers, 372 00:17:55,530 --> 00:18:01,260 honestly the best you can do is just that same linear search from week 0, 373 00:18:01,260 --> 00:18:05,190 and just check, one at a time, the values behind each of these doors 374 00:18:05,190 --> 00:18:08,940 and just hope that eventually you will find it. 375 00:18:08,940 --> 00:18:12,360 So this is already sort of taking a lot of time, right? 376 00:18:12,360 --> 00:18:16,090 If I do this linear search approach like I did in week 0, 377 00:18:16,090 --> 00:18:18,840 I'm potentially going to have to search behind all of those doors. 378 00:18:18,840 --> 00:18:21,090 I'm going to have to search behind all of those doors. 379 00:18:21,090 --> 00:18:24,485 So let's consider a little more formally exactly how I could at least implement 380 00:18:24,485 --> 00:18:25,110 that algorithm. 381 00:18:25,110 --> 00:18:26,730 Because I could take the approach that Florence 382 00:18:26,730 --> 00:18:29,105 proposed, and just kind of jumping around and maybe using 383 00:18:29,105 --> 00:18:30,030 a bit of intuition. 384 00:18:30,030 --> 00:18:31,530 But again, that's not really an algorithm. 385 00:18:31,530 --> 00:18:33,572 We really need to do something more step by step. 386 00:18:33,572 --> 00:18:36,447 And in the meantime, let's go ahead, Joe, and let's close the curtain 387 00:18:36,447 --> 00:18:39,300 and see if we can't clean those up with another problem in a moment, 388 00:18:39,300 --> 00:18:43,180 while we consider now linear search and the analysis thereof. 389 00:18:43,180 --> 00:18:46,650 So with linear search, I would propose that we 390 00:18:46,650 --> 00:18:50,160 could implement it in pseudocode first, if you will, like this. 391 00:18:50,160 --> 00:18:53,280 For i from 0 to n minus 1-- 392 00:18:53,280 --> 00:18:57,090 we'll see where we're going with this-- if the number is behind the i-th door, 393 00:18:57,090 --> 00:19:02,110 return true, otherwise at the very end return false. 394 00:19:02,110 --> 00:19:05,280 So it's a relatively simple translation into pseudocode, 395 00:19:05,280 --> 00:19:08,130 much like we did with the phone book some time ago. 396 00:19:08,130 --> 00:19:10,260 And why, though, these values? 397 00:19:10,260 --> 00:19:13,200 Because I'm now starting to express myself a little more like C, 398 00:19:13,200 --> 00:19:14,770 even though it's still pseudocode. 399 00:19:14,770 --> 00:19:16,800 So for i from 0 to n minus 1. 400 00:19:16,800 --> 00:19:19,320 So computer scientists tend to start counting from 0. 401 00:19:19,320 --> 00:19:25,230 If there's n doors, or 7 doors in this case, you want to go from 0 on up to 6, 402 00:19:25,230 --> 00:19:27,930 or from 0 on up to n minus 1. 403 00:19:27,930 --> 00:19:31,980 So this is just a very common way of setting yourself up with a for loop, 404 00:19:31,980 --> 00:19:34,320 maybe in C, maybe in pseudocode in this case, that 405 00:19:34,320 --> 00:19:38,140 just gets you from left to right, algorithmically step by step. 406 00:19:38,140 --> 00:19:41,530 If a condition, number is behind the i-th door-- 407 00:19:41,530 --> 00:19:43,710 i-th just being a colloquial way of saying, 408 00:19:43,710 --> 00:19:45,960 what is behind the door at location i-- 409 00:19:45,960 --> 00:19:47,160 go ahead and return true. 410 00:19:47,160 --> 00:19:50,640 I have found myself the number I want, for instance, the number 0. 411 00:19:50,640 --> 00:19:54,640 And then notice that this return false is not part of an else, 412 00:19:54,640 --> 00:19:58,710 because I don't want to abort this algorithm prematurely and abort simply 413 00:19:58,710 --> 00:20:01,950 because a number is not behind the current door. 414 00:20:01,950 --> 00:20:05,820 I essentially want to wait all the way to the end of the algorithm, 415 00:20:05,820 --> 00:20:10,080 after I've checked all n doors, and if I have still not found 416 00:20:10,080 --> 00:20:14,150 the number I care about, then and only then am I going to return false. 417 00:20:14,150 --> 00:20:15,900 So a very common programming mistake might 418 00:20:15,900 --> 00:20:19,920 be to nest this internally and think about things in terms of ifs and elses. 419 00:20:19,920 --> 00:20:21,600 But you don't need to have an else. 420 00:20:21,600 --> 00:20:24,900 This is kind of a catchall here at the very end. 421 00:20:24,900 --> 00:20:29,460 But now let's consider, if this is the pseudocode for linear search, just what 422 00:20:29,460 --> 00:20:32,730 is the efficiency of linear search? 423 00:20:32,730 --> 00:20:35,650 What is the efficiency of linear search, which is to say, 424 00:20:35,650 --> 00:20:37,680 how well-designed is this algorithm? 425 00:20:37,680 --> 00:20:40,260 We put or gave ourselves a framework a moment ago, 426 00:20:40,260 --> 00:20:43,500 Big O notation, which is an upper bound, which we can think of for now 427 00:20:43,500 --> 00:20:45,060 as meaning like a worst case. 428 00:20:45,060 --> 00:20:49,380 In the worst case, how many steps might it take me to find the number 0-- 429 00:20:49,380 --> 00:20:51,030 or any number for that matter-- 430 00:20:51,030 --> 00:20:52,800 among n doors? 431 00:20:52,800 --> 00:20:57,060 Is it big O of n squared, big O of n times log n, big O of n, 432 00:20:57,060 --> 00:21:00,360 big O of log n, or big O of one, which, again, just means 433 00:21:00,360 --> 00:21:03,290 a constant fixed number of steps? 434 00:21:03,290 --> 00:21:06,670 Brian, could we go ahead and pull up this question? 435 00:21:06,670 --> 00:21:09,180 Let me go ahead and pull it up on my screen as well. 436 00:21:09,180 --> 00:21:14,250 If you go to our usual URL to propose what you think an upper bound 437 00:21:14,250 --> 00:21:18,210 is on the running time of linear search. 438 00:21:18,210 --> 00:21:18,710 OK. 439 00:21:18,710 --> 00:21:20,940 Indeed, if we consider now the running time of linear search, 440 00:21:20,940 --> 00:21:22,107 it's going to be big O of n. 441 00:21:22,107 --> 00:21:22,678 Why is that? 442 00:21:22,678 --> 00:21:24,720 So in the worst case, the number I'm looking for, 443 00:21:24,720 --> 00:21:27,480 0, might very well be at the end of that list, which 444 00:21:27,480 --> 00:21:31,087 is going to be on the order of n steps, or in this case precisely n steps. 445 00:21:31,087 --> 00:21:32,670 So that's one way to think about this. 446 00:21:32,670 --> 00:21:34,590 Well, now let me ask a follow-up question. 447 00:21:34,590 --> 00:21:39,090 Proposing instead that we consider omega notation, which is a lower bound 448 00:21:39,090 --> 00:21:40,800 on the running time of an algorithm-- 449 00:21:40,800 --> 00:21:43,320 Brian, could we go ahead and ask this question next? 450 00:21:43,320 --> 00:21:47,470 At that same URL, we'll see a question asking now 451 00:21:47,470 --> 00:21:55,210 for the possible answers for the running time-- 452 00:21:55,210 --> 00:21:58,280 for a lower bound on the running time of linear search. 453 00:21:58,280 --> 00:22:00,800 So let's go ahead and take a look at this one here. 454 00:22:00,800 --> 00:22:03,220 And in just a moment, we'll see as the responses come in, 455 00:22:03,220 --> 00:22:05,830 about 75-plus percent of you are proposing 456 00:22:05,830 --> 00:22:07,630 that it's actually omega of 1. 457 00:22:07,630 --> 00:22:09,220 So omega is a lower bound. 458 00:22:09,220 --> 00:22:10,923 1 refers to constant time. 459 00:22:10,923 --> 00:22:11,590 And why is that? 460 00:22:11,590 --> 00:22:13,507 Let me just take a quick answer on this point. 461 00:22:13,507 --> 00:22:17,890 Among the 75% of you who said one step, or a constant number of steps, 462 00:22:17,890 --> 00:22:19,060 why is that? 463 00:22:19,060 --> 00:22:23,260 How do you think about this lower bound on running time? 464 00:22:23,260 --> 00:22:25,210 How about from Keith? 465 00:22:25,210 --> 00:22:27,600 Why omega of 1? 466 00:22:27,600 --> 00:22:31,120 AUDIENCE: Yeah, you can just open it and be lucky and find it in the first door. 467 00:22:31,120 --> 00:22:31,410 DAVID MALAN: Yeah. 468 00:22:31,410 --> 00:22:32,830 So it really speaks to just that. 469 00:22:32,830 --> 00:22:34,900 You might just get lucky, and the number you're looking for 470 00:22:34,900 --> 00:22:36,290 might be at the very first door. 471 00:22:36,290 --> 00:22:39,760 So the lower bound, in the best case, if you will, of this algorithm, 472 00:22:39,760 --> 00:22:44,530 linear search might very well be omega of 1 for exactly that reason-- 473 00:22:44,530 --> 00:22:46,540 that you have-- get lucky and the element 474 00:22:46,540 --> 00:22:48,020 might be there at the beginning. 475 00:22:48,020 --> 00:22:48,670 So that's pretty good. 476 00:22:48,670 --> 00:22:50,378 You really can't do any better than that. 477 00:22:50,378 --> 00:22:54,880 So we this range now of a lower bound from omega of 1 on up to big O of n 478 00:22:54,880 --> 00:22:57,940 being an upper bound on the running time of linear search. 479 00:22:57,940 --> 00:23:00,760 But of course, we have this other algorithm in our toolkit. 480 00:23:00,760 --> 00:23:04,240 And recall from week 0 that we looked at binary search-- although, 481 00:23:04,240 --> 00:23:05,530 not necessarily by name. 482 00:23:05,530 --> 00:23:09,250 It was that divide-and-conquer third algorithm, where we took the phone book 483 00:23:09,250 --> 00:23:11,950 and split it in half and half and half again. 484 00:23:11,950 --> 00:23:17,140 Now, while I fumbled there, Joe kindly has given us a new set of doors. 485 00:23:17,140 --> 00:23:19,990 If Joe, you could go ahead and reveal our seven doors again, 486 00:23:19,990 --> 00:23:22,510 behind which we still have some numbers. 487 00:23:22,510 --> 00:23:26,440 But I think this time, I'm going to be a little better off. 488 00:23:26,440 --> 00:23:30,010 Cue Joe and the doors behind. 489 00:23:30,010 --> 00:23:30,620 There we go. 490 00:23:30,620 --> 00:23:32,140 So we have our same seven doors. 491 00:23:32,140 --> 00:23:34,930 But behind those doors now is a different arrangement of numbers. 492 00:23:34,930 --> 00:23:39,532 And suppose this time, I want to find myself the number 6. 493 00:23:39,532 --> 00:23:41,740 So the number 6-- we'll change the problem slightly-- 494 00:23:41,740 --> 00:23:44,630 but I'm going to give you one other ingredient this time, 495 00:23:44,630 --> 00:23:46,990 which is going to be key to this working. 496 00:23:46,990 --> 00:23:51,580 Why were Florence and I able to do no better than linear search before? 497 00:23:51,580 --> 00:23:53,530 Why were Florence and I able to do no better 498 00:23:53,530 --> 00:23:57,610 than randomly searching even last time? 499 00:23:57,610 --> 00:24:00,760 What was it about the array of numbers, or the array 500 00:24:00,760 --> 00:24:06,390 of doors, that did not allow me previously to use binary search? 501 00:24:06,390 --> 00:24:08,755 Iris, what do you think? 502 00:24:08,755 --> 00:24:11,380 AUDIENCE: Because we didn't know the numbers are sorted or not. 503 00:24:11,380 --> 00:24:11,860 DAVID MALAN: Yeah. 504 00:24:11,860 --> 00:24:14,060 We didn't know if the numbers were sorted or not. 505 00:24:14,060 --> 00:24:16,510 And indeed, barring that detail, Florence and I 506 00:24:16,510 --> 00:24:19,550 really couldn't have done any better than, say, linear search. 507 00:24:19,550 --> 00:24:22,900 So this time, though, Joe has kindly sorted some numbers 508 00:24:22,900 --> 00:24:24,440 behind these doors for us. 509 00:24:24,440 --> 00:24:26,590 And so if I want to search for the number 6, 510 00:24:26,590 --> 00:24:29,290 now I can begin to use a bit of that information. 511 00:24:29,290 --> 00:24:31,480 So you know what, I'm going to start just like we did with the phone book 512 00:24:31,480 --> 00:24:32,813 and start roughly in the middle. 513 00:24:32,813 --> 00:24:34,310 And voila, number 5. 514 00:24:34,310 --> 00:24:34,810 All right. 515 00:24:34,810 --> 00:24:36,850 So we're pretty close, we're pretty close. 516 00:24:36,850 --> 00:24:38,740 But the thing about binary search, recall, 517 00:24:38,740 --> 00:24:41,080 is that this is now useful information. 518 00:24:41,080 --> 00:24:44,920 If the numbers are sorted behind these doors all, of the doors to the left 519 00:24:44,920 --> 00:24:48,790 should presumably be lower than 5, and all of the doors to the right 520 00:24:48,790 --> 00:24:51,340 should presumably be larger than 5. 521 00:24:51,340 --> 00:24:54,430 Now, I might kind of cut a corner here and be like, well, if this is 5, 522 00:24:54,430 --> 00:24:57,220 6 is probably right next door, literally. 523 00:24:57,220 --> 00:24:59,530 But again, algorithmically, how might we do this? 524 00:24:59,530 --> 00:25:02,420 We don't want to necessarily consider these special cases. 525 00:25:02,420 --> 00:25:06,580 So more generally, it looks like I now have an array of size 3. 526 00:25:06,580 --> 00:25:10,400 So let me go ahead and apply that same algorithm, voila, to the middle. 527 00:25:10,400 --> 00:25:11,560 Now I have the number 7. 528 00:25:11,560 --> 00:25:14,888 And now it's becoming pretty clear that if the number 6 is present, 529 00:25:14,888 --> 00:25:16,180 it's probably behind this door. 530 00:25:16,180 --> 00:25:21,400 And indeed, if I now look at my remaining array of size 1, and voila, 531 00:25:21,400 --> 00:25:23,470 in the middle there is that number 6. 532 00:25:23,470 --> 00:25:27,700 So this time, I only had to open up three doors instead of all seven, 533 00:25:27,700 --> 00:25:31,390 potentially, or maybe all six doors to find my way to that number, 534 00:25:31,390 --> 00:25:34,060 because I was given this additional ingredient of all 535 00:25:34,060 --> 00:25:35,860 of those numbers being sorted. 536 00:25:35,860 --> 00:25:38,230 So it would seem, then, that you can apply 537 00:25:38,230 --> 00:25:40,960 the better, more efficient, better designed algorithm, now known 538 00:25:40,960 --> 00:25:46,060 as binary search, if only someone like Joe would sort the numbers for you 539 00:25:46,060 --> 00:25:46,820 in advance. 540 00:25:46,820 --> 00:25:50,110 So let's consider now a little more algorithmically 541 00:25:50,110 --> 00:25:51,440 how we might implement this. 542 00:25:51,440 --> 00:25:54,130 So with binary search, let me propose this pseudocode. 543 00:25:54,130 --> 00:25:57,980 If the number is behind the middle door, return true-- we found it. 544 00:25:57,980 --> 00:26:00,562 So if we got lucky, then we might very well 545 00:26:00,562 --> 00:26:02,520 have found the number 6 behind the middle door, 546 00:26:02,520 --> 00:26:03,687 and we would have been done. 547 00:26:03,687 --> 00:26:04,660 But that didn't happen. 548 00:26:04,660 --> 00:26:06,785 And in the general case that probably won't happen. 549 00:26:06,785 --> 00:26:10,452 So if the number is less than that behind the middle door, then 550 00:26:10,452 --> 00:26:12,910 just like with the phone book, I'm going to go to the left, 551 00:26:12,910 --> 00:26:17,020 and I'm going to search the left half of the remaining doors in the array. 552 00:26:17,020 --> 00:26:20,045 Else if the number is greater than that behind the middle door, 553 00:26:20,045 --> 00:26:22,420 then like the phone book I'm going to go ahead and search 554 00:26:22,420 --> 00:26:24,370 the right half of the phone book. 555 00:26:24,370 --> 00:26:28,000 But there might still be one final case potentially, 556 00:26:28,000 --> 00:26:32,170 whereby if there's no doors left at all, or no doors in the first place, 557 00:26:32,170 --> 00:26:36,430 I should at least have this one special case where I do say return false. 558 00:26:36,430 --> 00:26:40,272 For instance, if 6, for whatever reason, weren't be among those doors 559 00:26:40,272 --> 00:26:41,980 and I were searching for it, I still need 560 00:26:41,980 --> 00:26:44,860 to be able to handle that situation where I can say definitively 561 00:26:44,860 --> 00:26:48,280 return false if I'm left with no further doors to search. 562 00:26:48,280 --> 00:26:51,750 So here, then, might be the pseudocode for this algorithm a bit more formally. 563 00:26:51,750 --> 00:26:53,820 Now let's consider the analysis thereof. 564 00:26:53,820 --> 00:26:58,620 Before, where we left off, linear search was big O of n. 565 00:26:58,620 --> 00:27:00,630 Linear search was big O of n. 566 00:27:00,630 --> 00:27:05,310 This time let's consider where binary search actually falls into place 567 00:27:05,310 --> 00:27:06,908 by asking a different question. 568 00:27:06,908 --> 00:27:09,450 I'm going to go ahead and go back and ask this question now-- 569 00:27:09,450 --> 00:27:13,920 what's an upper bound on the running time of binary search? 570 00:27:13,920 --> 00:27:16,920 An upper bound on the running time of binary search-- 571 00:27:16,920 --> 00:27:21,210 and go ahead and buzz in, if you'd like, similarly to before. 572 00:27:21,210 --> 00:27:25,510 What's an upper bound on the running time of binary search? 573 00:27:25,510 --> 00:27:30,960 And you can see here answers are getting pretty dominant around log n. 574 00:27:30,960 --> 00:27:33,780 And indeed, that jives with exactly what we did in week 0. 575 00:27:33,780 --> 00:27:36,240 The correct answer is indeed log of n, because that's 576 00:27:36,240 --> 00:27:38,240 going to be the maximum number of times that you 577 00:27:38,240 --> 00:27:40,438 can take a list or an array of a given size 578 00:27:40,438 --> 00:27:42,480 and split it in half and half and half, until you 579 00:27:42,480 --> 00:27:45,360 find the number you're looking for, or ultimately you 580 00:27:45,360 --> 00:27:47,320 don't find that number at all. 581 00:27:47,320 --> 00:27:54,460 Meanwhile, if we consider now not just the upper bound on this algorithm-- 582 00:27:54,460 --> 00:27:57,930 so in the worst case, binary search takes big O of log n-- 583 00:27:57,930 --> 00:28:00,210 now let's consider a related question which 584 00:28:00,210 --> 00:28:03,420 is, what's a lower bound on the running time of this same algorithm? 585 00:28:03,420 --> 00:28:05,530 What's a lower bound on the running time? 586 00:28:05,530 --> 00:28:07,470 I'll go ahead and pluck this one off myself 587 00:28:07,470 --> 00:28:10,680 and go back to some of the suggestions thus far. 588 00:28:10,680 --> 00:28:13,573 In the best case, maybe, too, you do get lucky, 589 00:28:13,573 --> 00:28:15,990 and the number you're looking for, 6 or some other number, 590 00:28:15,990 --> 00:28:18,250 is smack dab in the middle of the array. 591 00:28:18,250 --> 00:28:20,940 And so maybe indeed you can get away with just one step. 592 00:28:20,940 --> 00:28:25,110 And indeed, a lower bound on binary search now might very well just 593 00:28:25,110 --> 00:28:29,250 be an omega of 1, because in that best case you just get lucky, 594 00:28:29,250 --> 00:28:33,250 and it's right where you happen to start, in this case in the middle. 595 00:28:33,250 --> 00:28:34,690 So we seem to have a range there. 596 00:28:34,690 --> 00:28:37,950 But strictly speaking, it would seem that binary search is better. 597 00:28:37,950 --> 00:28:40,710 Binary search is better than linear search, 598 00:28:40,710 --> 00:28:44,880 because as n gets big, big, big, you can really feel that difference. 599 00:28:44,880 --> 00:28:47,970 In fact, recall from week 0 we played a little bit with these light bulbs. 600 00:28:47,970 --> 00:28:50,698 And right now, all 64 of these light bulbs are on. 601 00:28:50,698 --> 00:28:53,490 And let's consider for a moment, just to put this into perspective, 602 00:28:53,490 --> 00:28:56,400 how long it would take to use linear search to find 603 00:28:56,400 --> 00:28:58,410 one light bulb among these 64. 604 00:28:58,410 --> 00:29:01,920 And recall that in the worst case, maybe the light bulb, or the number 605 00:29:01,920 --> 00:29:04,480 that we're looking for, is way down there at the end, 606 00:29:04,480 --> 00:29:06,100 but we don't know in advance. 607 00:29:06,100 --> 00:29:09,900 And so Sumner, if you wouldn't mind executing linear search 608 00:29:09,900 --> 00:29:13,410 on these light bulbs, let's just get a feel for the efficiency 609 00:29:13,410 --> 00:29:15,870 or inefficiency of this algorithm. 610 00:29:15,870 --> 00:29:18,360 Linear search in light bulb form. 611 00:29:18,360 --> 00:29:21,060 So you'll notice that one light bulb at a time 612 00:29:21,060 --> 00:29:24,570 is going out, implying that I've searched that door, searched that door, 613 00:29:24,570 --> 00:29:25,890 searched that door. 614 00:29:25,890 --> 00:29:27,990 But we've only gone through 10 or so bulbs, 615 00:29:27,990 --> 00:29:30,400 and we've got another 50-plus to go. 616 00:29:30,400 --> 00:29:34,920 And you can see that if we look inside of these doors one per second, 617 00:29:34,920 --> 00:29:39,010 or turn off these light bulbs one per second, it's going to take a long time. 618 00:29:39,010 --> 00:29:42,310 In fact, it doesn't seem worthwhile to even wait until the very end. 619 00:29:42,310 --> 00:29:45,143 So Sumner, if you wouldn't mind, let's bring all the lights back up, 620 00:29:45,143 --> 00:29:48,300 and let's try once more another algorithm, this one binary search, just 621 00:29:48,300 --> 00:29:52,500 to get, again, a feel of what the running time is of an algorithm, 622 00:29:52,500 --> 00:29:54,750 like binary search that runs in logarithmic time. 623 00:29:54,750 --> 00:29:58,260 So in just a moment, we'll go ahead and execute binary search 624 00:29:58,260 --> 00:30:01,510 on these light bulbs, the idea being that there's one bulb we care about. 625 00:30:01,510 --> 00:30:04,620 Let's see how fast we can get down to just one bulb out of 64. 626 00:30:04,620 --> 00:30:07,440 So Sumner, on your marks, get set, go. 627 00:30:07,440 --> 00:30:10,850 628 00:30:10,850 --> 00:30:14,102 And we're done just a few steps later. 629 00:30:14,102 --> 00:30:15,560 And then have this sole light bulb. 630 00:30:15,560 --> 00:30:16,880 That was so much faster. 631 00:30:16,880 --> 00:30:20,330 And in fact, we did this deliberately one iteration at a time. 632 00:30:20,330 --> 00:30:24,920 The algorithm that we just executed with Sumner's and Matt's help, 633 00:30:24,920 --> 00:30:28,400 algorithmically was operating at what's called 1 hertz, 1 hertz. 634 00:30:28,400 --> 00:30:31,400 And if you're unfamiliar with hertz, it's just one something per second. 635 00:30:31,400 --> 00:30:34,855 It's very often used in physics or just in discussions of electricity 636 00:30:34,855 --> 00:30:35,480 more generally. 637 00:30:35,480 --> 00:30:38,090 And indeed, in this case if you're doing one thing per second, 638 00:30:38,090 --> 00:30:41,930 that first algorithm, linear search, might have taken us like 64 seconds 639 00:30:41,930 --> 00:30:44,480 to get all the way to that final light bulb. 640 00:30:44,480 --> 00:30:47,390 But that second algorithm was logarithmic. 641 00:30:47,390 --> 00:30:54,350 And so by going from 64 to 32 to 16 to 8 to 4 to 2 to 1, 642 00:30:54,350 --> 00:30:58,850 we get to the final result much faster, even going at the same pace. 643 00:30:58,850 --> 00:31:01,500 So in fact, if you think of your computer's CPU, 644 00:31:01,500 --> 00:31:03,560 CPUs are also measured in hertz-- 645 00:31:03,560 --> 00:31:06,750 H-E-R-T-Z. Probably measured in gigahertz, 646 00:31:06,750 --> 00:31:08,960 which is billions of hertz per second. 647 00:31:08,960 --> 00:31:11,960 So your CPU, the brain of your computer, If it's 1 gigahertz, 648 00:31:11,960 --> 00:31:15,840 that means it can literally do 1 billion things at a time. 649 00:31:15,840 --> 00:31:18,770 And here we have this sort of simpler setup of just light bulbs doing 650 00:31:18,770 --> 00:31:20,510 one thing per second. 651 00:31:20,510 --> 00:31:24,290 Your computer can do 1 billion of these kinds of operations at once. 652 00:31:24,290 --> 00:31:27,260 So just imagine, therefore, how much these savings tend 653 00:31:27,260 --> 00:31:30,743 to add up over time if you can take big bites out of these problems at once, 654 00:31:30,743 --> 00:31:32,660 as opposed to doing things like we did in week 655 00:31:32,660 --> 00:31:36,620 0, just one single step at a time. 656 00:31:36,620 --> 00:31:37,380 All right. 657 00:31:37,380 --> 00:31:39,922 Well, let's now go ahead and start to translate this to code. 658 00:31:39,922 --> 00:31:42,677 We have enough tools in our toolkit in C that I think, 659 00:31:42,677 --> 00:31:44,510 based on our discussion of arrays last week, 660 00:31:44,510 --> 00:31:48,330 we can now actually start to build something in code on our own. 661 00:31:48,330 --> 00:31:51,600 So I'm going to go ahead and create a file here in just a moment, 662 00:31:51,600 --> 00:31:55,430 in CS50 IDE, called, for instance, numbers.c. 663 00:31:55,430 --> 00:32:00,320 Let me go ahead and translate this to a file in C code called numbers.c. 664 00:32:00,320 --> 00:32:03,650 The goal at hand is just to implement linear search in code, 665 00:32:03,650 --> 00:32:06,380 just so that we're no longer waving our hands at the pseudocode 666 00:32:06,380 --> 00:32:08,520 but doing things a little more concretely. 667 00:32:08,520 --> 00:32:10,880 So I'm going to go ahead and include cs50.h. 668 00:32:10,880 --> 00:32:13,160 I'm going to go ahead and include stdio.h. 669 00:32:13,160 --> 00:32:15,530 And I'm going to start with no command line arguments, 670 00:32:15,530 --> 00:32:18,647 like we left off last week, but just with main void. 671 00:32:18,647 --> 00:32:21,230 And I'm going to go ahead and give myself an array of numbers, 672 00:32:21,230 --> 00:32:22,820 seven numbers, just like the doors. 673 00:32:22,820 --> 00:32:25,670 And I'm going to go ahead and say int numbers. 674 00:32:25,670 --> 00:32:28,250 And then this is a little trick that we didn't see last week, 675 00:32:28,250 --> 00:32:30,740 but it's handy for creating an array when 676 00:32:30,740 --> 00:32:33,110 you know in advance what numbers you want, which I do, 677 00:32:33,110 --> 00:32:36,440 because I'm going to mimic the doors that Joe kindly set up for us here, I'm 678 00:32:36,440 --> 00:32:44,120 going to go ahead and say give me an array that is equal to 4, 6, 8, 2, 7, 679 00:32:44,120 --> 00:32:45,480 5, 0. 680 00:32:45,480 --> 00:32:47,480 And this is the feature we didn't see last week. 681 00:32:47,480 --> 00:32:50,690 If you know in advance the numbers that you want to assign to an array, 682 00:32:50,690 --> 00:32:54,560 you actually don't have to bother specifying the size of the array 683 00:32:54,560 --> 00:32:55,790 explicitly. 684 00:32:55,790 --> 00:32:58,640 The compiler can figure that out intelligently for you. 685 00:32:58,640 --> 00:33:01,960 But you can use these curly braces with commas 686 00:33:01,960 --> 00:33:05,420 inside to enumerate from left to right the values 687 00:33:05,420 --> 00:33:07,400 that you want to put into that array. 688 00:33:07,400 --> 00:33:09,727 So after this line 6 has executed in my computer, 689 00:33:09,727 --> 00:33:11,810 I'm going to be left with an array called numbers, 690 00:33:11,810 --> 00:33:14,480 inside of which are seven integers listed 691 00:33:14,480 --> 00:33:17,895 from left to right in the computer's memory, so to speak, in this way. 692 00:33:17,895 --> 00:33:19,770 Now, what do I want to do with these numbers? 693 00:33:19,770 --> 00:33:21,560 Well, let's implement linear search. 694 00:33:21,560 --> 00:33:24,170 Linear search, as we latched on to earlier, 695 00:33:24,170 --> 00:33:27,110 is a searching from left to right or equivalently right to left-- 696 00:33:27,110 --> 00:33:29,460 but convention tends to go left to right. 697 00:33:29,460 --> 00:33:31,700 So I'm going to do a standard for loop. 698 00:33:31,700 --> 00:33:34,730 For int i gets 0, i is less than-- 699 00:33:34,730 --> 00:33:36,980 I'm going to keep it simple for now and hardcode this, 700 00:33:36,980 --> 00:33:40,400 but we could clean this up if we want, and I'm going to do i++ on each 701 00:33:40,400 --> 00:33:41,130 iteration. 702 00:33:41,130 --> 00:33:45,330 So I'm pretty sure that my line 8 will induce a for loop that 703 00:33:45,330 --> 00:33:47,090 iterates eight total times. 704 00:33:47,090 --> 00:33:49,460 And what question do I want to ask on each iteration? 705 00:33:49,460 --> 00:33:56,270 Well, if the numbers array at location i equals equals-- 706 00:33:56,270 --> 00:34:00,170 for instance, the number I was searching for initially, 707 00:34:00,170 --> 00:34:02,060 let's go ahead and search for 0-- 708 00:34:02,060 --> 00:34:03,620 then what do I want to do? 709 00:34:03,620 --> 00:34:06,860 Let me go ahead and print out something arbitrary but useful, 710 00:34:06,860 --> 00:34:09,787 like "Found," quote, unquote, so the human knows. 711 00:34:09,787 --> 00:34:12,620 And then let me go ahead, and just for good measure, let me go ahead 712 00:34:12,620 --> 00:34:14,000 and return 0. 713 00:34:14,000 --> 00:34:16,110 And we'll come back to that in just a moment. 714 00:34:16,110 --> 00:34:20,389 But at the end of this program, I'm also going to do this-- printf "Not found" 715 00:34:20,389 --> 00:34:21,610 with a backslash n. 716 00:34:21,610 --> 00:34:23,719 And then I'm going to go ahead and return 1. 717 00:34:23,719 --> 00:34:26,000 But before we tease apart those returns, just 718 00:34:26,000 --> 00:34:27,949 consider the code in the aggregate. 719 00:34:27,949 --> 00:34:30,020 Here's my entire main function. 720 00:34:30,020 --> 00:34:32,750 And on line 6, to recap, I initialized the array, 721 00:34:32,750 --> 00:34:37,909 just as we did at the very beginning, with a seemingly random list of numbers 722 00:34:37,909 --> 00:34:39,020 behind the doors. 723 00:34:39,020 --> 00:34:41,570 Then on line 8, I'm going to iterate with this for loop 724 00:34:41,570 --> 00:34:44,600 seven total times, incrementing i in each turn. 725 00:34:44,600 --> 00:34:48,469 And then line 10, just like I was opening the doors one at a time, 726 00:34:48,469 --> 00:34:51,469 I'm going to check if the i-th number in this array 727 00:34:51,469 --> 00:34:54,949 equals equals the number I care about, 0, with that first demo. 728 00:34:54,949 --> 00:34:56,570 I'm going to print "Found." 729 00:34:56,570 --> 00:35:00,110 Otherwise-- not else, per se-- but otherwise, 730 00:35:00,110 --> 00:35:04,550 if I go through this entire loop, checking if, if, if, if, if, and I 731 00:35:04,550 --> 00:35:08,870 never actually find 0, I'm going to have this catchall at the end that just 732 00:35:08,870 --> 00:35:13,790 says no matter what, if you reach line 16, print "Not found," 733 00:35:13,790 --> 00:35:15,505 and then return 1. 734 00:35:15,505 --> 00:35:16,880 Now, this is a bit of a subtlety. 735 00:35:16,880 --> 00:35:20,800 But could someone remind us what's going on with the return 0 736 00:35:20,800 --> 00:35:26,566 on line 13 and the return 1 on line 17? 737 00:35:26,566 --> 00:35:31,930 Why 0 in 1, and why am I returning at all? 738 00:35:31,930 --> 00:35:35,630 What problem is this solving for me? 739 00:35:35,630 --> 00:35:37,380 Even though most of our programs thus far, 740 00:35:37,380 --> 00:35:39,150 we haven't bothered too much with this. 741 00:35:39,150 --> 00:35:40,710 Demi, is it? 742 00:35:40,710 --> 00:35:42,340 What do you think? 743 00:35:42,340 --> 00:35:45,600 AUDIENCE: It's Demi, but basically, return 0 744 00:35:45,600 --> 00:35:49,490 is like it was executed correctly, or it found it, 745 00:35:49,490 --> 00:35:52,170 and it kind of exits that loop saying that it was found. 746 00:35:52,170 --> 00:35:58,375 And then return 1 is like the return false, and it exits as well. 747 00:35:58,375 --> 00:35:59,250 DAVID MALAN: Exactly. 748 00:35:59,250 --> 00:36:00,917 And "exit" really is the operative word. 749 00:36:00,917 --> 00:36:02,580 In main, when you are done-- 750 00:36:02,580 --> 00:36:05,340 ready to quit the program, as we've done with the word 751 00:36:05,340 --> 00:36:09,000 "quit" in some of our pseudocode in the past, you can literally return a value. 752 00:36:09,000 --> 00:36:11,970 And recall at the end of last week, we introduced the fact 753 00:36:11,970 --> 00:36:14,250 that main always returns an int. 754 00:36:14,250 --> 00:36:17,130 You and I have ignored that for at least a week or two, 755 00:36:17,130 --> 00:36:20,130 but sometimes it's useful to return an explicit value, 756 00:36:20,130 --> 00:36:22,410 whether it's for autograding purposes, whether it's 757 00:36:22,410 --> 00:36:24,735 for automated testing of your code in the real world, 758 00:36:24,735 --> 00:36:27,360 or just so it's a signal to the user that something indeed went 759 00:36:27,360 --> 00:36:28,030 Wrong. 760 00:36:28,030 --> 00:36:30,180 So you can return a value from main. 761 00:36:30,180 --> 00:36:33,507 And as Demi proposed, 0 means "all is well." 762 00:36:33,507 --> 00:36:35,340 And it's a little counter-intuitive, because 763 00:36:35,340 --> 00:36:37,330 thus far true tends to be a good thing. 764 00:36:37,330 --> 00:36:39,420 But in this case, 0 is a good thing. 765 00:36:39,420 --> 00:36:40,140 All is well. 766 00:36:40,140 --> 00:36:41,220 It's success. 767 00:36:41,220 --> 00:36:44,700 And if you return any other value, for instance 1, 768 00:36:44,700 --> 00:36:46,660 that indicates that something went wrong. 769 00:36:46,660 --> 00:36:51,390 So the reason I'm printing out, after the word "Found" I'm returning 0, 770 00:36:51,390 --> 00:36:54,498 is so that effectively the program exits at that point. 771 00:36:54,498 --> 00:36:56,790 I don't want to keep going again and again if I already 772 00:36:56,790 --> 00:36:58,320 found the number I care about. 773 00:36:58,320 --> 00:37:02,160 And down here, this one admittedly isn't strictly necessary, 774 00:37:02,160 --> 00:37:06,120 because if I hit line 16 and maybe deleted line 17, 775 00:37:06,120 --> 00:37:07,740 the program's going to end anyway. 776 00:37:07,740 --> 00:37:10,800 But there wouldn't be that so-called exit status 777 00:37:10,800 --> 00:37:12,960 that we discussed last week briefly, whereby 778 00:37:12,960 --> 00:37:15,293 you can kind of signal to the computer whether something 779 00:37:15,293 --> 00:37:17,190 was successful or unsuccessful. 780 00:37:17,190 --> 00:37:21,690 And the reason that 0 is a good thing and 1 or any other number is not, 781 00:37:21,690 --> 00:37:24,780 consider how many things can go wrong in programs that you write 782 00:37:24,780 --> 00:37:26,607 or that companies in the real world write 783 00:37:26,607 --> 00:37:28,440 when you get those error messages, sometimes 784 00:37:28,440 --> 00:37:30,060 with those cryptic error codes. 785 00:37:30,060 --> 00:37:32,430 There are hundreds, thousands of problems 786 00:37:32,430 --> 00:37:35,790 that might happen in a computer program that could be that many error 787 00:37:35,790 --> 00:37:37,680 codes that you see on the screen, reasons 788 00:37:37,680 --> 00:37:40,630 explaining why the program crashed or froze or the like. 789 00:37:40,630 --> 00:37:46,020 But 0 is sort of special in that it's just one value that the world has 790 00:37:46,020 --> 00:37:47,760 decided means "success." 791 00:37:47,760 --> 00:37:50,460 So there's only one way to get your program right, in a sense, 792 00:37:50,460 --> 00:37:54,000 but there's so many millions of ways in which things can go wrong. 793 00:37:54,000 --> 00:37:57,810 And that's why humans have adopted that particular convention. 794 00:37:57,810 --> 00:37:58,310 All right. 795 00:37:58,310 --> 00:38:02,378 But let's consider now not just numbers, but let's make things more interesting. 796 00:38:02,378 --> 00:38:04,170 Besides the doors, suppose that we actually 797 00:38:04,170 --> 00:38:06,300 had people's names behind them. 798 00:38:06,300 --> 00:38:08,790 Well, let's go ahead and write a program this time 799 00:38:08,790 --> 00:38:13,130 that not only searches for numbers, but instead searches for names. 800 00:38:13,130 --> 00:38:16,850 So I'm going to go ahead and create a different file here called names.c. 801 00:38:16,850 --> 00:38:18,600 And I'm going to start a little similarly. 802 00:38:18,600 --> 00:38:23,460 I'm going to include cs50.h at the top, I'm going to include stdio at the top. 803 00:38:23,460 --> 00:38:26,332 But I'm also this time going to include string.h, 804 00:38:26,332 --> 00:38:28,040 which we introduced briefly last week, so 805 00:38:28,040 --> 00:38:31,110 that we have access to strlen for getting the length of a string, 806 00:38:31,110 --> 00:38:33,100 and, it turns out, some other functions. 807 00:38:33,100 --> 00:38:35,790 Let me go ahead and declare int main void as usual. 808 00:38:35,790 --> 00:38:38,190 And then inside here, I need some arbitrary names. 809 00:38:38,190 --> 00:38:40,260 So let's come up with seven names here. 810 00:38:40,260 --> 00:38:43,825 And here, too, I can declare an array just as I did before. 811 00:38:43,825 --> 00:38:45,450 But it doesn't have to store only ints. 812 00:38:45,450 --> 00:38:47,400 It can store strings instead. 813 00:38:47,400 --> 00:38:49,560 So I've changed the data type from int to string, 814 00:38:49,560 --> 00:38:52,480 and I've changed the variable name from numbers to names. 815 00:38:52,480 --> 00:38:55,350 And I can still use this new curly brace notation, 816 00:38:55,350 --> 00:39:01,470 and I can give myself a name like Bill, and maybe Charlie, and maybe Fred, 817 00:39:01,470 --> 00:39:07,480 and maybe George, and maybe Ginny, and maybe Percy, and lastly, 818 00:39:07,480 --> 00:39:09,510 maybe a name like Ron. 819 00:39:09,510 --> 00:39:12,700 And it just barely fits on my screen. 820 00:39:12,700 --> 00:39:16,110 So with that said, I now have this array of names. 821 00:39:16,110 --> 00:39:19,620 And beyond there being a perhaps obvious pattern to them, 822 00:39:19,620 --> 00:39:23,310 there's a second less obvious, or maybe obvious, pattern to them. 823 00:39:23,310 --> 00:39:28,590 How would you describe the list of names I arbitrarily just came up with? 824 00:39:28,590 --> 00:39:31,350 What's a useful characteristic of them? 825 00:39:31,350 --> 00:39:34,410 What do you notice about these names? 826 00:39:34,410 --> 00:39:37,530 And there's at least two right answers to this question, I think. 827 00:39:37,530 --> 00:39:40,370 What do you notice about these names? 828 00:39:40,370 --> 00:39:41,533 Jack? 829 00:39:41,533 --> 00:39:43,200 AUDIENCE: They're in alphabetical order. 830 00:39:43,200 --> 00:39:43,500 DAVID MALAN: Yes. 831 00:39:43,500 --> 00:39:46,332 So beyond being the names of the Weasley children in Harry Potter, 832 00:39:46,332 --> 00:39:47,790 they're also in alphabetical order. 833 00:39:47,790 --> 00:39:49,080 And that's the more salient detail. 834 00:39:49,080 --> 00:39:50,970 For our purposes, I've had the forethought 835 00:39:50,970 --> 00:39:53,340 this time to sort these names in advance. 836 00:39:53,340 --> 00:39:56,580 And if I've sorted these names, that means implicitly 837 00:39:56,580 --> 00:39:59,670 I can use a better algorithm than linear search. 838 00:39:59,670 --> 00:40:02,160 I can use, for instance, our old binary search. 839 00:40:02,160 --> 00:40:05,615 But let's go ahead first and just search them naively for now. 840 00:40:05,615 --> 00:40:07,740 Let's still apply linear search, because, you know, 841 00:40:07,740 --> 00:40:12,925 what we haven't yet done is necessarily compare strings against one another. 842 00:40:12,925 --> 00:40:15,300 We've done a lot of comparisons of numbers like integers. 843 00:40:15,300 --> 00:40:16,178 But what about names? 844 00:40:16,178 --> 00:40:17,470 So let me go ahead and do this. 845 00:40:17,470 --> 00:40:22,200 So for int i gets 0, just like before, i less than 7, i++-- 846 00:40:22,200 --> 00:40:25,200 and I'm doing this only because I know in advance there are seven names. 847 00:40:25,200 --> 00:40:27,630 I think we could probably improve the design of this code, 848 00:40:27,630 --> 00:40:31,200 too, by having a variable or a constant storing that value. 849 00:40:31,200 --> 00:40:34,470 But I'm going to keep it simple and focus only on the new details for now. 850 00:40:34,470 --> 00:40:38,280 And it turns out, for reasons we'll explore in more detail next week, 851 00:40:38,280 --> 00:40:42,560 it is not sufficient to do what we did before and do something like this 852 00:40:42,560 --> 00:40:44,330 if I'm searching for "Ron." 853 00:40:44,330 --> 00:40:49,340 It turns out that in C, you can't use equals equals to compare two strings. 854 00:40:49,340 --> 00:40:52,080 You can for an int, you can for a char. 855 00:40:52,080 --> 00:40:53,870 And we've done both of those in the past. 856 00:40:53,870 --> 00:40:57,920 But there's a subtlety that we'll dive into in more detail next week that 857 00:40:57,920 --> 00:41:00,003 means you can't actually do this. 858 00:41:00,003 --> 00:41:02,420 And this is curious, because if you have prior programming 859 00:41:02,420 --> 00:41:06,270 experience in languages like Python or the like, you can do this. 860 00:41:06,270 --> 00:41:09,770 So in C you can't, but we'll see next time why. 861 00:41:09,770 --> 00:41:13,040 But for now, it turns out that C can solve this problem, 862 00:41:13,040 --> 00:41:15,600 and historically the way you do this is with a function. 863 00:41:15,600 --> 00:41:18,560 So inside of the string.h header file, there 864 00:41:18,560 --> 00:41:22,790 is not only a declaration for strlen, the length of a string like last week. 865 00:41:22,790 --> 00:41:25,400 There's another function called strcmp. 866 00:41:25,400 --> 00:41:29,000 And "stir compare," for short, S-T-R-C-M-P, 867 00:41:29,000 --> 00:41:33,590 allows me to pass in two strings, one string that I want to compare against 868 00:41:33,590 --> 00:41:34,880 another string. 869 00:41:34,880 --> 00:41:37,250 So it's not quite the same syntax. 870 00:41:37,250 --> 00:41:38,830 Indeed, it's a little harder to read. 871 00:41:38,830 --> 00:41:40,580 It's not quite as simple as equals equals. 872 00:41:40,580 --> 00:41:43,760 But strcmp, if we read the documentation for it, 873 00:41:43,760 --> 00:41:46,430 will tell us that this compares two strings. 874 00:41:46,430 --> 00:41:49,160 And it returns one of three possible values. 875 00:41:49,160 --> 00:41:54,410 If those two strings are equal, that is, identically the same letter for letter, 876 00:41:54,410 --> 00:41:58,550 then this function is going to return 0, it turns out. 877 00:41:58,550 --> 00:42:03,470 If the first string is supposed to come before the second string 878 00:42:03,470 --> 00:42:06,760 alphabetically, in some sense, then this function 879 00:42:06,760 --> 00:42:09,140 is going to return a negative value. 880 00:42:09,140 --> 00:42:13,550 If the first string is supposed to come after the second string alphabetically, 881 00:42:13,550 --> 00:42:16,320 if you will, then it's going to return a positive value. 882 00:42:16,320 --> 00:42:20,690 So there's three possible outcomes-- either equal to 0, or less than 0, 883 00:42:20,690 --> 00:42:22,460 or greater than 0. 884 00:42:22,460 --> 00:42:26,090 But you'll notice, and in fact, if you look at the documentation some time, 885 00:42:26,090 --> 00:42:30,890 it doesn't specify what value less than 0 or what value greater than 0. 886 00:42:30,890 --> 00:42:34,790 You have to just check for any negative value or any positive value. 887 00:42:34,790 --> 00:42:37,070 And I also told a bit of a white lie a moment ago. 888 00:42:37,070 --> 00:42:39,620 This does not check things alphabetically, 889 00:42:39,620 --> 00:42:42,530 even though it coincidentally does sometimes. 890 00:42:42,530 --> 00:42:45,590 Actually compares strings in what's called ASCII order, 891 00:42:45,590 --> 00:42:49,670 or ASCIIbetically which is kind of a goofy way of describing 892 00:42:49,670 --> 00:42:54,830 this function looks at every character in the two strings, from left to right, 893 00:42:54,830 --> 00:42:59,570 it checks the ASCII values of them, and then it compares those ASCII values 894 00:42:59,570 --> 00:43:01,440 character by character. 895 00:43:01,440 --> 00:43:03,680 And if the ASCII value is less than the other, 896 00:43:03,680 --> 00:43:06,780 then it returns a negative value or vice versa. 897 00:43:06,780 --> 00:43:10,790 So if you have, for instance, the letter A, capital A in the string, 898 00:43:10,790 --> 00:43:12,963 that gets converted first to 65. 899 00:43:12,963 --> 00:43:15,380 And then if you have an A in the other string capitalized, 900 00:43:15,380 --> 00:43:18,455 it, too, gets compared to 65, and those would be equal. 901 00:43:18,455 --> 00:43:21,080 But of course, all of these names have more than one character, 902 00:43:21,080 --> 00:43:25,620 so this ASCII order, or ASCIIbetical, precedes left to right 903 00:43:25,620 --> 00:43:29,750 so that strcmp checks every character in the names for you. 904 00:43:29,750 --> 00:43:33,290 And it stops when it hits that terminating null character. 905 00:43:33,290 --> 00:43:35,480 Recall that strings, underneath the hood, 906 00:43:35,480 --> 00:43:39,780 always end in C with this backslash 0, or eight 0 bits. 907 00:43:39,780 --> 00:43:43,470 So that's how strcmp knows when to stop comparing values. 908 00:43:43,470 --> 00:43:46,257 But if I go ahead and find someone like Ron, let me go ahead 909 00:43:46,257 --> 00:43:47,840 and print out quote, unquote, "Found." 910 00:43:47,840 --> 00:43:50,840 And like before, I'll go ahead and return, like Demi proposed, 911 00:43:50,840 --> 00:43:54,710 0, just to imply that all is successful. 912 00:43:54,710 --> 00:43:57,360 Otherwise, if we get all the way to the bottom of my code, 913 00:43:57,360 --> 00:44:00,620 I'm going to print out "Not found" to tell the story that we did not 914 00:44:00,620 --> 00:44:04,310 find Ron in this array, even though he does happen to be there, 915 00:44:04,310 --> 00:44:06,570 and I'm going to go ahead and return 1. 916 00:44:06,570 --> 00:44:08,480 So even though I've hardcoded everything-- 917 00:44:08,480 --> 00:44:11,960 to hardcode something in a program means to type it out explicitly-- 918 00:44:11,960 --> 00:44:15,200 you could imagine using a command line argument like last week 919 00:44:15,200 --> 00:44:16,310 to get user's input. 920 00:44:16,310 --> 00:44:17,720 Who would you like to search for? 921 00:44:17,720 --> 00:44:20,900 You could imagine using get_string to get user's input and ask them, 922 00:44:20,900 --> 00:44:22,460 who would you like to search for? 923 00:44:22,460 --> 00:44:26,150 But for now, just for demonstration sake, I've used only Ron's name. 924 00:44:26,150 --> 00:44:28,310 And if I haven't made any typos-- 925 00:44:28,310 --> 00:44:35,030 let me go ahead and type in make names, Enter, so far so good, ./names. 926 00:44:35,030 --> 00:44:37,400 And hopefully, we'll see, indeed, "Found," 927 00:44:37,400 --> 00:44:42,140 because "Ron" is very much in this array of seven siblings. 928 00:44:42,140 --> 00:44:44,120 But the building blocks that are new here 929 00:44:44,120 --> 00:44:47,540 are, again, the fact that when we declare an array of some fixed size 930 00:44:47,540 --> 00:44:49,580 we don't strictly need to put a number here, 931 00:44:49,580 --> 00:44:51,890 and we have this curly brace notation when we 932 00:44:51,890 --> 00:44:54,110 know the array's contents in advance. 933 00:44:54,110 --> 00:44:56,150 But perhaps lastly and most powerfully, we 934 00:44:56,150 --> 00:45:01,730 do have this function in C called strcmp that will allow us to actually 935 00:45:01,730 --> 00:45:05,370 store and compare strings in this way. 936 00:45:05,370 --> 00:45:07,550 So let me pause here and just ask if there's 937 00:45:07,550 --> 00:45:12,320 any questions about how we translated these ideas to code for numbers, 938 00:45:12,320 --> 00:45:15,650 and how we translated these ideas to code for now 939 00:45:15,650 --> 00:45:19,280 names, each time using linear search, not, binary. 940 00:45:19,280 --> 00:45:21,670 Caleb, question? 941 00:45:21,670 --> 00:45:22,400 AUDIENCE: Yeah. 942 00:45:22,400 --> 00:45:29,585 So would that program still work if "Ron," for example, was like all caps, 943 00:45:29,585 --> 00:45:30,960 like if you're trying to search-- 944 00:45:30,960 --> 00:45:35,643 like, if the cases are different in terms of uppercase and lowercase? 945 00:45:35,643 --> 00:45:37,060 DAVID MALAN: Really good question. 946 00:45:37,060 --> 00:45:40,510 And let me propose an instinct that's useful to acquire in general-- 947 00:45:40,510 --> 00:45:41,457 when in doubt, try it. 948 00:45:41,457 --> 00:45:42,790 So I'm going to do exactly that. 949 00:45:42,790 --> 00:45:44,998 I do happen to know the answer, but suppose I didn't. 950 00:45:44,998 --> 00:45:47,560 Let me go ahead and change "Ron" to all caps, just because. 951 00:45:47,560 --> 00:45:49,420 Maybe the human, the Caps Lock key was on, 952 00:45:49,420 --> 00:45:51,220 and they typed it in a little sloppily. 953 00:45:51,220 --> 00:45:53,300 Let me go ahead and make no other changes. 954 00:45:53,300 --> 00:45:57,820 Notice that I'm leaving the original array alone with only a capital R. 955 00:45:57,820 --> 00:46:01,900 Let me remake this program, make name, ./names. 956 00:46:01,900 --> 00:46:06,950 And voila, he's still, in fact, found. 957 00:46:06,950 --> 00:46:10,860 958 00:46:10,860 --> 00:46:11,570 Stand by. 959 00:46:11,570 --> 00:46:14,690 960 00:46:14,690 --> 00:46:15,960 Oh, OK. 961 00:46:15,960 --> 00:46:19,910 962 00:46:19,910 --> 00:46:22,280 Caleb, you have just helped me unearth, a bug that 963 00:46:22,280 --> 00:46:23,810 was latent in the previous example. 964 00:46:23,810 --> 00:46:26,600 None of you should have accepted the fact 965 00:46:26,600 --> 00:46:29,317 that the previous program worked with "RON," because I didn't 966 00:46:29,317 --> 00:46:30,900 practice literally what I'm preaching. 967 00:46:30,900 --> 00:46:34,070 So Caleb, hold that thought for just a moment so I can rewind a little bit 968 00:46:34,070 --> 00:46:35,540 and fix my apparent bug. 969 00:46:35,540 --> 00:46:37,340 So "RON" was indeed found. 970 00:46:37,340 --> 00:46:39,650 But he wasn't found because "RON" was found. 971 00:46:39,650 --> 00:46:41,240 I did something stupid here. 972 00:46:41,240 --> 00:46:45,440 And it's perhaps all the more pedagogically appropriate 973 00:46:45,440 --> 00:46:47,420 now to highlight that. 974 00:46:47,420 --> 00:46:53,870 So how did this program say "Ron" was found, even though this time it also 975 00:46:53,870 --> 00:46:57,050 says "RON" was found in all caps? 976 00:46:57,050 --> 00:47:00,480 And you know what, let me get a little curious here. 977 00:47:00,480 --> 00:47:03,920 Let me go ahead and search for, not even "Ron." 978 00:47:03,920 --> 00:47:07,310 How about we search for Ron's mom, "Molly"? 979 00:47:07,310 --> 00:47:08,600 Make names. 980 00:47:08,600 --> 00:47:09,680 All right. 981 00:47:09,680 --> 00:47:14,330 And now, just to reveal that I really did do something stupid, ./names. 982 00:47:14,330 --> 00:47:16,910 OK, now something's clearly wrong, right? 983 00:47:16,910 --> 00:47:22,473 I can even search for the father "Arthur", make names, ./name. 984 00:47:22,473 --> 00:47:25,640 It seems that I wrote you a program that just literally always says "Found." 985 00:47:25,640 --> 00:47:27,830 So we shouldn't have accepted this as correct. 986 00:47:27,830 --> 00:47:32,870 Can anyone spot the bug based on my definition thus far? 987 00:47:32,870 --> 00:47:34,880 Can anyone spot the bug? 988 00:47:34,880 --> 00:47:38,900 In the meantime, this isn't really a bad time to open up the duck 989 00:47:38,900 --> 00:47:41,130 and say, "Hello, duck. 990 00:47:41,130 --> 00:47:50,240 I am having a problem whereby my program is always printing Found 991 00:47:50,240 --> 00:47:54,320 even when someone is not in the array. 992 00:47:54,320 --> 00:47:56,630 And I could proceed to explain my logic to the duck, 993 00:47:56,630 --> 00:48:00,640 but hopefully Sophia can point me at the solution even faster than the duck. 994 00:48:00,640 --> 00:48:03,350 AUDIENCE: You need to compare the value that we 995 00:48:03,350 --> 00:48:05,910 received from strcmp with something. 996 00:48:05,910 --> 00:48:08,030 So we need to compare it with like 0 and make sure 997 00:48:08,030 --> 00:48:11,265 that we receive the value that they're equal. 998 00:48:11,265 --> 00:48:12,140 DAVID MALAN: Perfect. 999 00:48:12,140 --> 00:48:16,400 So I said the right thing, but I literally did not do the right thing. 1000 00:48:16,400 --> 00:48:19,070 If I want to check for equality, I literally 1001 00:48:19,070 --> 00:48:22,580 need to check the return value when comparing names bracket i 1002 00:48:22,580 --> 00:48:24,380 against "Ron" to equal 0. 1003 00:48:24,380 --> 00:48:28,130 Because only in the case when the return value of strcmp is 0 1004 00:48:28,130 --> 00:48:31,280 do I actually have a match. 1005 00:48:31,280 --> 00:48:36,590 By contrast, if the function returns a negative value or the function 1006 00:48:36,590 --> 00:48:39,470 returns a positive value, that means it's not a match. 1007 00:48:39,470 --> 00:48:42,230 That means that one name is supposed to come before the other 1008 00:48:42,230 --> 00:48:44,240 or after the other. 1009 00:48:44,240 --> 00:48:47,870 But the catch with my shorthand syntax here, which is not always 1010 00:48:47,870 --> 00:48:52,880 an incorrect syntax to use, whenever you have a Boolean expression inside 1011 00:48:52,880 --> 00:48:56,340 of which is a function call like this-- 1012 00:48:56,340 --> 00:48:58,550 notice that the entirety of my Boolean expression 1013 00:48:58,550 --> 00:49:01,830 is just a call, so to speak, to strcmp. 1014 00:49:01,830 --> 00:49:06,350 I'm passing in two inputs, names bracket i and quote, unquote "Ron." 1015 00:49:06,350 --> 00:49:10,070 And therefore, I'm expecting strcmp to return output, a so-called return 1016 00:49:10,070 --> 00:49:10,730 value. 1017 00:49:10,730 --> 00:49:15,200 That return value is going to be negative or positive or 0. 1018 00:49:15,200 --> 00:49:18,530 And in fact, to be clear, if the first name being searched 1019 00:49:18,530 --> 00:49:22,760 for is "Bill" and names bracket i or names bracket 0 1020 00:49:22,760 --> 00:49:26,540 is "Bill," "Bill" comma "Ron" is effectively what 1021 00:49:26,540 --> 00:49:28,730 my input is on the first iteration. 1022 00:49:28,730 --> 00:49:31,820 "Bill," alphabetically and ASCIIbetically, 1023 00:49:31,820 --> 00:49:34,520 comes before "Ron," which means it should 1024 00:49:34,520 --> 00:49:36,980 be returning a negative value to me. 1025 00:49:36,980 --> 00:49:41,450 And the problem with Boolean expressions is, as implemented in this context, 1026 00:49:41,450 --> 00:49:44,870 is that only 0 is false. 1027 00:49:44,870 --> 00:49:50,240 Any other return value is by definition true or a Yes answer, 1028 00:49:50,240 --> 00:49:53,810 whether it's negative 1 or positive 1, negative 1 million 1029 00:49:53,810 --> 00:49:55,340 or positive 1 million-- 1030 00:49:55,340 --> 00:49:58,850 any non-zero value in a computer language like C 1031 00:49:58,850 --> 00:50:02,660 is considered true, also known as truthy. 1032 00:50:02,660 --> 00:50:06,260 Any value that is 0 is considered false, but only 1033 00:50:06,260 --> 00:50:07,880 that value is considered false. 1034 00:50:07,880 --> 00:50:12,560 So really, I was getting lucky at first, because my program was finding "Bill," 1035 00:50:12,560 --> 00:50:14,450 but I was confusing "Bill" for "Ron." 1036 00:50:14,450 --> 00:50:17,960 Then when I did it again for Caleb and I capitalized "Ron," 1037 00:50:17,960 --> 00:50:20,690 I was getting unlucky, because suddenly I 1038 00:50:20,690 --> 00:50:23,030 knew "RON" capitalized wasn't in the array, 1039 00:50:23,030 --> 00:50:24,590 and yet I'm still saying he's found. 1040 00:50:24,590 --> 00:50:28,580 But that's because I didn't practice what I preach per Sophia's find. 1041 00:50:28,580 --> 00:50:31,490 And so if I actually compare this against 0-- and now, Caleb, 1042 00:50:31,490 --> 00:50:33,260 we come full circle to your question-- 1043 00:50:33,260 --> 00:50:37,970 I rebuild this program with make names, I now do ./names and search for all 1044 00:50:37,970 --> 00:50:43,280 caps "RON," I should now see, thankfully, "Not found." 1045 00:50:43,280 --> 00:50:46,550 So I wish I could say that was deliberate, but thus 1046 00:50:46,550 --> 00:50:48,410 is the common case of bugs. 1047 00:50:48,410 --> 00:50:50,820 So here I am 20 years later making bugs in my code. 1048 00:50:50,820 --> 00:50:53,360 So if you run up to a similar problem this week, 1049 00:50:53,360 --> 00:50:58,910 rest assured that it never ends. 1050 00:50:58,910 --> 00:51:01,430 But hopefully you won't have several people watching you 1051 00:51:01,430 --> 00:51:03,120 while you do your problem set this week. 1052 00:51:03,120 --> 00:51:03,620 All right. 1053 00:51:03,620 --> 00:51:05,660 Any questions, then, beyond Caleb's? 1054 00:51:05,660 --> 00:51:07,640 So great question, Caleb, and the answer is no. 1055 00:51:07,640 --> 00:51:09,540 It is case sensitive. 1056 00:51:09,540 --> 00:51:11,780 So it does not find "Rob"-- 1057 00:51:11,780 --> 00:51:13,270 "RON." 1058 00:51:13,270 --> 00:51:16,530 Any questions here? 1059 00:51:16,530 --> 00:51:20,640 Any questions on linear search using strings? 1060 00:51:20,640 --> 00:51:21,140 No? 1061 00:51:21,140 --> 00:51:24,230 All right, well, let's go ahead and do one final example, I think, 1062 00:51:24,230 --> 00:51:25,280 with searching. 1063 00:51:25,280 --> 00:51:27,260 But let's introduce just one other feature. 1064 00:51:27,260 --> 00:51:29,540 And this one's actually pretty cool and powerful. 1065 00:51:29,540 --> 00:51:33,350 Up until now, we've been using data types that just come with C 1066 00:51:33,350 --> 00:51:37,770 or come from CS50, like int, and char, and float, and the like. 1067 00:51:37,770 --> 00:51:41,330 And you'll see now that there's actually sometimes 1068 00:51:41,330 --> 00:51:44,570 reasons where you or I might want to create our own custom data types, 1069 00:51:44,570 --> 00:51:48,500 our own types that didn't exist when C itself was invented. 1070 00:51:48,500 --> 00:51:51,890 So for instance, suppose that I wanted to represent not just 1071 00:51:51,890 --> 00:51:54,440 a whole bunch of numbers and not just a whole bunch of names, 1072 00:51:54,440 --> 00:51:57,065 but suppose I want to implement like a full-fledged phone book. 1073 00:51:57,065 --> 00:52:00,110 A phone book, of course, contains both names and numbers. 1074 00:52:00,110 --> 00:52:02,870 And suppose I want to combine these two ideas together. 1075 00:52:02,870 --> 00:52:06,950 Wouldn't it be nice if I could have a data structure that 1076 00:52:06,950 --> 00:52:10,550 is a data type that has some structure to it that can actually 1077 00:52:10,550 --> 00:52:11,850 store both at once? 1078 00:52:11,850 --> 00:52:16,670 And in fact, wouldn't it be nice if C had a data type called person, 1079 00:52:16,670 --> 00:52:19,550 so that if I want to represent a person, like in a phone book, who 1080 00:52:19,550 --> 00:52:22,490 had both a name and a number, I can actually 1081 00:52:22,490 --> 00:52:27,320 implement that and code by calling that variable of type person? 1082 00:52:27,320 --> 00:52:30,170 Now, of course, the designers of C did not have the foresight 1083 00:52:30,170 --> 00:52:32,308 to create a data type called person. 1084 00:52:32,308 --> 00:52:34,100 And, indeed, that would be a slippery slope 1085 00:52:34,100 --> 00:52:37,820 if they had a data type for every real-world entity you can think of. 1086 00:52:37,820 --> 00:52:40,830 But they did give us the capabilities to do this. 1087 00:52:40,830 --> 00:52:44,810 So if a person, in our limited world here of phone books, 1088 00:52:44,810 --> 00:52:49,610 has both a name and a number, we might think of it as follows-- 1089 00:52:49,610 --> 00:52:51,770 a name and a number, both of type string. 1090 00:52:51,770 --> 00:52:53,060 But a quick check here. 1091 00:52:53,060 --> 00:52:56,330 Why have I now decided, somewhat presumptuously, 1092 00:52:56,330 --> 00:53:00,110 to call phone numbers strings as well? 1093 00:53:00,110 --> 00:53:02,360 We've been talking about ints behind these doors. 1094 00:53:02,360 --> 00:53:04,580 We've been searching for ints in code. 1095 00:53:04,580 --> 00:53:08,150 But why did I just presume to propose that we instead 1096 00:53:08,150 --> 00:53:12,560 implement a phone book using strings for names and numbers? 1097 00:53:12,560 --> 00:53:14,650 Any thoughts here, Kurt? 1098 00:53:14,650 --> 00:53:16,070 AUDIENCE: Yeah. 1099 00:53:16,070 --> 00:53:17,570 Because we're not doing math on it. 1100 00:53:17,570 --> 00:53:21,065 It's like-- a phone number could be, like, letters for all we care. 1101 00:53:21,065 --> 00:53:22,940 And in fact, I mean, sometimes you see, like, 1102 00:53:22,940 --> 00:53:26,160 1-800 Contacts or something like that, and maybe we want to allow that. 1103 00:53:26,160 --> 00:53:27,410 DAVID MALAN: Yeah, absolutely. 1104 00:53:27,410 --> 00:53:30,780 A phone number, despite its name, isn't necessarily just a number. 1105 00:53:30,780 --> 00:53:33,710 It might be 1-800 Contacts, which is an English word. 1106 00:53:33,710 --> 00:53:36,020 It might have hyphens in it or dashes. 1107 00:53:36,020 --> 00:53:37,640 It might have parentheses in it. 1108 00:53:37,640 --> 00:53:39,800 It might have a plus sign for country codes. 1109 00:53:39,800 --> 00:53:41,720 There's a lot of characters that we absolutely 1110 00:53:41,720 --> 00:53:46,257 can represent in C using strings that we couldn't represent in C using int. 1111 00:53:46,257 --> 00:53:48,090 And so indeed, even though in the real world 1112 00:53:48,090 --> 00:53:51,710 there are these "numbers" that you and I talk about once in a while like phone 1113 00:53:51,710 --> 00:53:55,580 numbers, maybe in the US Social Security numbers, credit card numbers, 1114 00:53:55,580 --> 00:54:00,170 those aren't necessarily values that you want to treat as actual integers. 1115 00:54:00,170 --> 00:54:02,570 And in fact, those of you who did the credit problem 1116 00:54:02,570 --> 00:54:04,760 and tried to validate credit card numbers 1117 00:54:04,760 --> 00:54:08,720 may very well have run into challenges by using a long to represent a credit 1118 00:54:08,720 --> 00:54:09,650 card number. 1119 00:54:09,650 --> 00:54:11,900 It probably in retrospect might very well 1120 00:54:11,900 --> 00:54:15,560 have been easier for you to treat credit card numbers as strings. 1121 00:54:15,560 --> 00:54:17,810 The catch, of course, by design is that you didn't yet 1122 00:54:17,810 --> 00:54:21,410 have strings in your vocabulary, at least in C yet. 1123 00:54:21,410 --> 00:54:23,390 So suppose I want to create my own custom data 1124 00:54:23,390 --> 00:54:27,800 type that encapsulates, if you will, two different types of values. 1125 00:54:27,800 --> 00:54:31,430 A person shall be henceforth a name and a number. 1126 00:54:31,430 --> 00:54:34,850 It turns out that C gives us this syntax here. 1127 00:54:34,850 --> 00:54:39,770 This is the only juicy piece of new syntax besides those curly braces 1128 00:54:39,770 --> 00:54:42,880 a moment ago that we'll see today in C, typedef. 1129 00:54:42,880 --> 00:54:44,810 And as the name rather succinctly suggests, 1130 00:54:44,810 --> 00:54:47,240 this allows you to define a type. 1131 00:54:47,240 --> 00:54:50,120 And the type will be a structure of some sort. 1132 00:54:50,120 --> 00:54:52,610 So a data structure in a programming language 1133 00:54:52,610 --> 00:54:56,600 is typically a data type that has some structure to it. 1134 00:54:56,600 --> 00:54:57,980 What do we mean by "structure"? 1135 00:54:57,980 --> 00:55:01,610 It typically has one or more values inside of it. 1136 00:55:01,610 --> 00:55:04,940 So using typedef, and in turn using the struct keyword, 1137 00:55:04,940 --> 00:55:06,770 we can create our own custom types that's 1138 00:55:06,770 --> 00:55:10,320 a structure, a composition of multiple other data types. 1139 00:55:10,320 --> 00:55:14,360 So if we want to keep persons together as their own custom data type, 1140 00:55:14,360 --> 00:55:15,890 the syntax is a little cryptic here. 1141 00:55:15,890 --> 00:55:19,310 You literally do typedef struct open curly brace, 1142 00:55:19,310 --> 00:55:22,137 then one per line you specify the data types that you want 1143 00:55:22,137 --> 00:55:24,470 and the names that you want to give to those data types, 1144 00:55:24,470 --> 00:55:26,270 for instance name and number. 1145 00:55:26,270 --> 00:55:28,850 And then outside of the closing curly brace, 1146 00:55:28,850 --> 00:55:33,290 you literally put the word "person," if that's indeed the data type 1147 00:55:33,290 --> 00:55:35,280 that you want to invent. 1148 00:55:35,280 --> 00:55:37,790 So how can we use this more powerfully? 1149 00:55:37,790 --> 00:55:42,020 Well, let's go ahead and do things the wrong way without this feature first, 1150 00:55:42,020 --> 00:55:44,060 so as to motivate its existence. 1151 00:55:44,060 --> 00:55:47,016 Let me go ahead and save this file as phonebook.c. 1152 00:55:47,016 --> 00:55:50,420 And let me start, as always, with includes cs50.h. 1153 00:55:50,420 --> 00:55:53,030 And then let me go ahead and include stdio.h. 1154 00:55:53,030 --> 00:55:56,180 And then lastly, let me also include string.h, 1155 00:55:56,180 --> 00:55:59,100 because I know I'm going to be manipulating some strings in a moment. 1156 00:55:59,100 --> 00:56:03,200 Let me go ahead now, and within my main function, let me go ahead 1157 00:56:03,200 --> 00:56:06,050 and give myself initially, for the first version of this program, 1158 00:56:06,050 --> 00:56:07,370 a whole bunch of names. 1159 00:56:07,370 --> 00:56:10,940 Specifically, how about "Brian" comma "David"? 1160 00:56:10,940 --> 00:56:15,200 We'll keep it short, just so as to focus on the ideas and not the actual data 1161 00:56:15,200 --> 00:56:15,862 they're in. 1162 00:56:15,862 --> 00:56:17,570 Then Brian and I each have phone numbers. 1163 00:56:17,570 --> 00:56:19,890 So let's go ahead and store them in an array-- 1164 00:56:19,890 --> 00:56:28,890 numbers equals, again the curly braces as before, and +1-617-495-1000-- 1165 00:56:28,890 --> 00:56:32,670 and indeed, there's already motivation, per Kurt's comment, to use strings, 1166 00:56:32,670 --> 00:56:35,250 because we've got a plus and a couple of dashes in there-- 1167 00:56:35,250 --> 00:56:36,460 and then my number here. 1168 00:56:36,460 --> 00:56:43,020 So we'll do +1-949-468-2750 close curly brace, semicolon. 1169 00:56:43,020 --> 00:56:45,900 So I've gone ahead and declared two arrays, one called names, 1170 00:56:45,900 --> 00:56:47,040 one called numbers. 1171 00:56:47,040 --> 00:56:52,410 And I'm just going to have sort of a handshake agreement 1172 00:56:52,410 --> 00:56:56,380 that the first name in names corresponds to the first number in numbers, 1173 00:56:56,380 --> 00:56:59,702 the second name in names corresponds to the second number in numbers. 1174 00:56:59,702 --> 00:57:02,910 And you can imagine that working well so long as you don't make any mistakes, 1175 00:57:02,910 --> 00:57:04,950 and you have just the right number in each. 1176 00:57:04,950 --> 00:57:08,520 Now let me go ahead and do int i equals 0, i less than 2-- 1177 00:57:08,520 --> 00:57:11,580 I'm going to keep that hardcoded for now just to do the demonstration. 1178 00:57:11,580 --> 00:57:14,910 And then inside of this loop, let me go ahead and search for my phone number, 1179 00:57:14,910 --> 00:57:17,350 for instance, even though I happen to be at the end. 1180 00:57:17,350 --> 00:57:22,590 So if strcmp of names bracket i equals-- 1181 00:57:22,590 --> 00:57:27,360 rather, comma "David" equals equals 0-- 1182 00:57:27,360 --> 00:57:29,200 so I'm not going to make that mistake again. 1183 00:57:29,200 --> 00:57:37,290 Let me go ahead inside of this loop, inside of this condition here. 1184 00:57:37,290 --> 00:57:41,650 And I'm going to go ahead and do the following-- print out that I found, 1185 00:57:41,650 --> 00:57:43,910 for instance my number. 1186 00:57:43,910 --> 00:57:45,160 And I'm going to plug that in. 1187 00:57:45,160 --> 00:57:47,730 So numbers bracket i. 1188 00:57:47,730 --> 00:57:50,430 And then as before, I'm going to go ahead and return 0. 1189 00:57:50,430 --> 00:57:54,090 And if none of this works out, and I happen not to be in this array, 1190 00:57:54,090 --> 00:57:59,200 I'll go ahead and print out as before "Not found" with a semicolon. 1191 00:57:59,200 --> 00:58:00,870 And then I'll return 1 arbitrarily. 1192 00:58:00,870 --> 00:58:03,750 I can return negative 1, I could return a million, negative million. 1193 00:58:03,750 --> 00:58:07,530 But human convention would typically have you go from 0 to 1 to 2 to 3 1194 00:58:07,530 --> 00:58:11,050 on up, if you have that many possible error conditions. 1195 00:58:11,050 --> 00:58:11,760 All right. 1196 00:58:11,760 --> 00:58:14,730 So I essentially have implemented in C a phone book of sorts. 1197 00:58:14,730 --> 00:58:16,710 We did this verbally in week 0. 1198 00:58:16,710 --> 00:58:17,987 Now I'm doing it in code. 1199 00:58:17,987 --> 00:58:19,070 It's a limited phone book. 1200 00:58:19,070 --> 00:58:21,030 It's only got two names and two numbers. 1201 00:58:21,030 --> 00:58:22,988 But I could certainly implement this phone book 1202 00:58:22,988 --> 00:58:26,230 by just using two arrays, two parallel arrays, if you will, 1203 00:58:26,230 --> 00:58:29,490 by just using the honor system that the first element and names lines 1204 00:58:29,490 --> 00:58:32,072 up with the first elements and numbers and so forth. 1205 00:58:32,072 --> 00:58:33,780 Now hopefully, if I don't make any typos, 1206 00:58:33,780 --> 00:58:36,030 let me go ahead and make phonebook. 1207 00:58:36,030 --> 00:58:36,530 All right. 1208 00:58:36,530 --> 00:58:41,440 It compiled OK. ./phonebook, and it found what seems to be my number there. 1209 00:58:41,440 --> 00:58:45,040 So it seems to work correctly, though I've tried to pull that one over you 1210 00:58:45,040 --> 00:58:45,540 before. 1211 00:58:45,540 --> 00:58:48,100 But I'm pretty sure this one actually works correctly. 1212 00:58:48,100 --> 00:58:51,990 And so we found my name and in turn number. 1213 00:58:51,990 --> 00:58:57,030 But why is the design of this code not necessarily the best? 1214 00:58:57,030 --> 00:59:00,090 This is starting to get more subtle, admittedly. 1215 00:59:00,090 --> 00:59:02,980 And we've seen that we can do this differently. 1216 00:59:02,980 --> 00:59:05,372 But what rubs you the wrong way about here? 1217 00:59:05,372 --> 00:59:07,830 This is another example of what we might call "code smell." 1218 00:59:07,830 --> 00:59:10,710 Like, something's a little funky here, like, ah, this 1219 00:59:10,710 --> 00:59:13,500 might not be the best solution long term Nick, what do you think? 1220 00:59:13,500 --> 00:59:14,125 AUDIENCE: Yeah. 1221 00:59:14,125 --> 00:59:15,378 So what I'm guessing is that-- 1222 00:59:15,378 --> 00:59:18,420 like, you know how you made the data frame before the new data structure, 1223 00:59:18,420 --> 00:59:20,920 where the two things were linked together? 1224 00:59:20,920 --> 00:59:23,010 In this case, we're just banking on the fact 1225 00:59:23,010 --> 00:59:27,270 that we don't screw something up and unintentionally unlink them 1226 00:59:27,270 --> 00:59:29,390 from the same index. 1227 00:59:29,390 --> 00:59:32,812 So they're not intrinsically linked, which might not be-- 1228 00:59:32,812 --> 00:59:34,770 DAVID MALAN: That's exactly the right instinct. 1229 00:59:34,770 --> 00:59:38,790 In general, as great as a programmer as you're maybe aspiring to be, 1230 00:59:38,790 --> 00:59:39,690 you're not all that. 1231 00:59:39,690 --> 00:59:41,357 And like, you're going to make mistakes. 1232 00:59:41,357 --> 00:59:44,700 And the more you can write code that's self-defensive, 1233 00:59:44,700 --> 00:59:47,880 that protects you from yourself, the better off you're going to be, 1234 00:59:47,880 --> 00:59:49,890 the more correct your code is going to be, 1235 00:59:49,890 --> 00:59:53,460 and the more easily you're going to be able to collaborate successfully, 1236 00:59:53,460 --> 00:59:56,710 if you so choose in the real world, on real-world programming projects, 1237 00:59:56,710 --> 00:59:59,520 whether for a research project, a full-time job, a personal project 1238 00:59:59,520 --> 01:00:00,210 or the like. 1239 01:00:00,210 --> 01:00:04,290 Generally speaking, you should not trust yourself or other people 1240 01:00:04,290 --> 01:00:06,060 that-- with whom you're writing code, you 1241 01:00:06,060 --> 01:00:11,320 should have as many defense mechanisms in place exactly along these lines. 1242 01:00:11,320 --> 01:00:15,030 So yes, there's nothing wrong with what I have done in the sense 1243 01:00:15,030 --> 01:00:16,620 that this is correct. 1244 01:00:16,620 --> 01:00:21,480 But as noted, if you screw up, and maybe you get an off by one error-- 1245 01:00:21,480 --> 01:00:24,210 maybe you transpose two names or two numbers. 1246 01:00:24,210 --> 01:00:26,910 I mean, imagine if you've got dozens of names and numbers, 1247 01:00:26,910 --> 01:00:29,220 hundreds of names and numbers, thousands of them 1248 01:00:29,220 --> 01:00:32,820 the odds that you or someone messes the order up at some point 1249 01:00:32,820 --> 01:00:35,370 is just probably going to be too, too high. 1250 01:00:35,370 --> 01:00:39,250 So it would be nice, then, if we could sort of keep related data together. 1251 01:00:39,250 --> 01:00:42,805 This is kind of a hack, to just on the honor system say, my arrays line up, 1252 01:00:42,805 --> 01:00:45,180 I'm just going to make sure to keep them the same length. 1253 01:00:45,180 --> 01:00:46,110 We can do better. 1254 01:00:46,110 --> 01:00:51,030 Let's keep related data together and design this a little more cleanly. 1255 01:00:51,030 --> 01:00:53,570 And I can do this by defining my own type 1256 01:00:53,570 --> 01:00:55,650 that I'll call for instance, a person. 1257 01:00:55,650 --> 01:00:58,790 So at the top of this file, before main, I'm 1258 01:00:58,790 --> 01:01:01,830 going to go ahead and typedef a structure, inside 1259 01:01:01,830 --> 01:01:04,380 of which are the two types of data that I care about, 1260 01:01:04,380 --> 01:01:08,520 string name and string number, just as before. 1261 01:01:08,520 --> 01:01:15,030 Notice, though, here that what I have done here is not give myself an array. 1262 01:01:15,030 --> 01:01:17,738 I've given myself one name and one number. 1263 01:01:17,738 --> 01:01:20,280 Outside of this curly brace, I'm going to give this data type 1264 01:01:20,280 --> 01:01:21,690 a name, which I could call "person." 1265 01:01:21,690 --> 01:01:23,482 I could call it anything I want, but person 1266 01:01:23,482 --> 01:01:25,410 seems pretty reasonable in this case. 1267 01:01:25,410 --> 01:01:29,890 And now down here, I'm going to go ahead and change this code a little bit. 1268 01:01:29,890 --> 01:01:32,370 I'm going to go ahead and give myself an array still, 1269 01:01:32,370 --> 01:01:36,240 but this time I'm going to give myself an array of persons. 1270 01:01:36,240 --> 01:01:39,360 And I'm going to call that array, somewhat playfully, "people," 1271 01:01:39,360 --> 01:01:44,940 because I want to have two persons, two people, in this program, me and Brian. 1272 01:01:44,940 --> 01:01:47,215 Now I want to go ahead and populate this array. 1273 01:01:47,215 --> 01:01:48,840 That is, I want to fill it with values. 1274 01:01:48,840 --> 01:01:50,940 And this syntax is a little new, but it's just 1275 01:01:50,940 --> 01:01:55,230 to enable us to actually store values inside of a structure. 1276 01:01:55,230 --> 01:01:58,050 If I want to index into this array, there's 1277 01:01:58,050 --> 01:01:59,460 nothing different from last week. 1278 01:01:59,460 --> 01:02:01,380 I do people bracket 0. 1279 01:02:01,380 --> 01:02:06,270 That's going to give me the first person variable inside, so probably where 1280 01:02:06,270 --> 01:02:07,890 "Brian" is supposed to go. 1281 01:02:07,890 --> 01:02:10,350 The one last piece of syntax I need is how do I 1282 01:02:10,350 --> 01:02:14,730 go inside of that structure, that person data structure, 1283 01:02:14,730 --> 01:02:16,450 and access the person's name? 1284 01:02:16,450 --> 01:02:17,710 I literally just do a dot. 1285 01:02:17,710 --> 01:02:22,620 So people bracket 0 gives me the first person in the people array. 1286 01:02:22,620 --> 01:02:26,580 And then the dot means, go inside of it and grab the person variable. 1287 01:02:26,580 --> 01:02:29,880 I'm going to go ahead and set that name equal to quote, unquote "Brian." 1288 01:02:29,880 --> 01:02:31,990 The syntax now for his name is almost identical-- 1289 01:02:31,990 --> 01:02:39,630 people bracket 0 dot number equals quote, unquote "+1-617-495-1000" 1290 01:02:39,630 --> 01:02:40,740 semicolon. 1291 01:02:40,740 --> 01:02:43,092 Meanwhile, if I want to access a location for myself, 1292 01:02:43,092 --> 01:02:45,300 I'm going to go ahead and put it at location 1, which 1293 01:02:45,300 --> 01:02:46,620 is the second location. 1294 01:02:46,620 --> 01:02:48,630 Name will be, quote, unquote "David." 1295 01:02:48,630 --> 01:02:52,020 And then over here, I'm going to do people bracket 1 dot number equals 1296 01:02:52,020 --> 01:02:59,250 quote, unquote "+1-949-468-2750" close quote, semicolon. 1297 01:02:59,250 --> 01:03:01,320 So it's a bit verbose, admittedly. 1298 01:03:01,320 --> 01:03:05,430 But you could imagine, if we just let our thoughts run ahead of ourselves 1299 01:03:05,430 --> 01:03:08,370 here, if you used get_string, could sort of automatically do this. 1300 01:03:08,370 --> 01:03:11,453 If you used command line arguments, maybe you could populate some of this. 1301 01:03:11,453 --> 01:03:14,190 We don't just have to hardcode, that is, write my name and number 1302 01:03:14,190 --> 01:03:15,668 and Brian's into this program. 1303 01:03:15,668 --> 01:03:17,460 You can imagine doing this more dynamically 1304 01:03:17,460 --> 01:03:20,700 using some of our techniques, using get_string and so forth, from week 1. 1305 01:03:20,700 --> 01:03:22,920 But for now, it's just for demonstration's sake. 1306 01:03:22,920 --> 01:03:29,040 So now if I want to search this new array, this new single array of people, 1307 01:03:29,040 --> 01:03:31,410 I think my for loop can stay the same. 1308 01:03:31,410 --> 01:03:33,990 And I think I can still use strcmp. 1309 01:03:33,990 --> 01:03:38,970 But now I need to go inside of not names but people, 1310 01:03:38,970 --> 01:03:41,740 and look for the dot name field. 1311 01:03:41,740 --> 01:03:45,008 So data structures have fields or variables inside of them. 1312 01:03:45,008 --> 01:03:46,800 So I'm going to use the dot notation there, 1313 01:03:46,800 --> 01:03:50,670 too, go into the i-th person in the people array, 1314 01:03:50,670 --> 01:03:53,610 and compare that name against, for instance, quote, unquote "David." 1315 01:03:53,610 --> 01:03:57,300 And then if I have found "David," in this case myself, 1316 01:03:57,300 --> 01:04:01,920 go ahead and access the people array again, but print out using printf 1317 01:04:01,920 --> 01:04:02,830 the number. 1318 01:04:02,830 --> 01:04:05,820 So again, the dot operator is the only new piece of syntax 1319 01:04:05,820 --> 01:04:10,120 that's letting us go inside of this new feature known as a data structure. 1320 01:04:10,120 --> 01:04:12,930 If I go ahead and make phonebook again after making those changes, 1321 01:04:12,930 --> 01:04:13,610 all is well. 1322 01:04:13,610 --> 01:04:14,850 It compiled OK. 1323 01:04:14,850 --> 01:04:20,410 And if I run ./phonebook, I now have hopefully found my number again. 1324 01:04:20,410 --> 01:04:24,420 So here is a seemingly useless exercise, in that all 1325 01:04:24,420 --> 01:04:28,050 I really did was re-implement the same program using more lines of code 1326 01:04:28,050 --> 01:04:29,630 and making it more complicated. 1327 01:04:29,630 --> 01:04:31,460 But it's now better designed. 1328 01:04:31,460 --> 01:04:34,130 Or it's a step toward being better designed, because now I've 1329 01:04:34,130 --> 01:04:38,330 encapsulated all inside of one variable, for instance, people bracket 0, 1330 01:04:38,330 --> 01:04:42,590 people bracket 1, all of the information we care about with respect to Brian, 1331 01:04:42,590 --> 01:04:46,800 or me, or anyone else we might put into this program. 1332 01:04:46,800 --> 01:04:49,880 And indeed, this is how programs, this is how Googles, of the world, 1333 01:04:49,880 --> 01:04:52,520 Facebooks of the world store lots of information together. 1334 01:04:52,520 --> 01:04:55,220 Consider any of your social media accounts like Instagram, 1335 01:04:55,220 --> 01:04:57,230 or Facebook, or Snapchat and the like. 1336 01:04:57,230 --> 01:04:59,810 You have multiple pieces of data associated 1337 01:04:59,810 --> 01:05:02,810 with you on all of those platforms-- not just your username 1338 01:05:02,810 --> 01:05:06,050 but also your password, also your history of posts, 1339 01:05:06,050 --> 01:05:08,265 also your friends and followers and the like. 1340 01:05:08,265 --> 01:05:11,390 So there's a lot of information that these companies, for better for worse, 1341 01:05:11,390 --> 01:05:12,950 are collecting on all of us. 1342 01:05:12,950 --> 01:05:17,660 And can you imagine if they just had one big array with all of our usernames, 1343 01:05:17,660 --> 01:05:21,892 one big array with all of our passwords, one big array with all of our friends? 1344 01:05:21,892 --> 01:05:23,600 Like, you can imagine certainly at scale, 1345 01:05:23,600 --> 01:05:26,418 that's got to be a bad design, to just trust 1346 01:05:26,418 --> 01:05:29,210 that you're going to get the ordering of all of these things right. 1347 01:05:29,210 --> 01:05:30,110 They don't do that. 1348 01:05:30,110 --> 01:05:34,040 They instead write code in some language that somehow encapsulates 1349 01:05:34,040 --> 01:05:36,590 all the information related to me and Brian 1350 01:05:36,590 --> 01:05:39,620 and you inside of some kind of data structure. 1351 01:05:39,620 --> 01:05:43,370 And that's what they put in their database or some other server 1352 01:05:43,370 --> 01:05:44,360 on their back end. 1353 01:05:44,360 --> 01:05:47,720 So this encapsulation is a feature we now have in terms of C. 1354 01:05:47,720 --> 01:05:51,740 And it allows us to create our own data structures that we can then use 1355 01:05:51,740 --> 01:05:56,090 in order to keep related data together. 1356 01:05:56,090 --> 01:05:59,990 All right, any questions, then, on data structures, or more 1357 01:05:59,990 --> 01:06:04,130 specifically typedef and struct, the C keywords 1358 01:06:04,130 --> 01:06:08,360 with which you can create your own custom types 1359 01:06:08,360 --> 01:06:10,760 that themselves are data structures? 1360 01:06:10,760 --> 01:06:11,720 Besley? 1361 01:06:11,720 --> 01:06:12,590 AUDIENCE: Hi. 1362 01:06:12,590 --> 01:06:17,450 So is it typical to define the new data structure outside of main, like 1363 01:06:17,450 --> 01:06:17,950 in a header? 1364 01:06:17,950 --> 01:06:19,367 DAVID MALAN: Really good question. 1365 01:06:19,367 --> 01:06:22,070 Is it typical to define a new data structure outside of main? 1366 01:06:22,070 --> 01:06:23,270 Quite often yes. 1367 01:06:23,270 --> 01:06:26,270 In this case, it's immaterial, because I only have 1368 01:06:26,270 --> 01:06:28,190 one function in this program, main. 1369 01:06:28,190 --> 01:06:30,530 But as we'll see this week and next week and onward, 1370 01:06:30,530 --> 01:06:34,370 our programs are going to start to get a little more complicated by nature 1371 01:06:34,370 --> 01:06:35,930 of just having more features. 1372 01:06:35,930 --> 01:06:38,987 And once you have more features, you probably have more functions. 1373 01:06:38,987 --> 01:06:41,570 And when you have more functions, you want your data structure 1374 01:06:41,570 --> 01:06:44,930 to be available to all of those functions. 1375 01:06:44,930 --> 01:06:49,100 And so we'll begin to see definition of some of these structures being, 1376 01:06:49,100 --> 01:06:51,620 indeed, outside of our own functions. 1377 01:06:51,620 --> 01:06:53,000 Peter, over to you. 1378 01:06:53,000 --> 01:06:54,020 AUDIENCE: Oh, yeah. 1379 01:06:54,020 --> 01:06:58,160 Will we define new classes in header files later, 1380 01:06:58,160 --> 01:07:00,350 or will we keep defining them outside of main? 1381 01:07:00,350 --> 01:07:01,767 DAVID MALAN: Really good question. 1382 01:07:01,767 --> 01:07:05,430 Might we define our own types and our own data structures in header files? 1383 01:07:05,430 --> 01:07:05,930 Yes. 1384 01:07:05,930 --> 01:07:07,220 Eventually we'll do that, too. 1385 01:07:07,220 --> 01:07:11,210 Thus far, you and I have only been using header files that other people wrote. 1386 01:07:11,210 --> 01:07:15,770 We've been using stdio.h, string.h, that the authors of C created. 1387 01:07:15,770 --> 01:07:19,070 You've been using cs50.h which we the staff wrote. 1388 01:07:19,070 --> 01:07:23,210 It turns out, you can also create your own header files, your own .h files, 1389 01:07:23,210 --> 01:07:27,870 inside of which are pieces of code that you want to share across multiple files 1390 01:07:27,870 --> 01:07:28,370 of your own. 1391 01:07:28,370 --> 01:07:29,453 We're not quite there yet. 1392 01:07:29,453 --> 01:07:32,990 But yes, Peter, that would be a solution to this problem 1393 01:07:32,990 --> 01:07:36,130 by putting it in one central place. 1394 01:07:36,130 --> 01:07:37,580 Thiago, over to you. 1395 01:07:37,580 --> 01:07:41,120 1396 01:07:41,120 --> 01:07:43,370 AUDIENCE: I was-- 1397 01:07:43,370 --> 01:07:54,265 I was thinking, this course really takes enough information to solve the sets, 1398 01:07:54,265 --> 01:08:00,590 because I feel there is missing information. 1399 01:08:00,590 --> 01:08:07,490 I am a freshman, and I was taking-- 1400 01:08:07,490 --> 01:08:14,780 I was so concentrating, and I can't go on, go ahead on the sets. 1401 01:08:14,780 --> 01:08:17,302 Is there anything that I'm missing? 1402 01:08:17,302 --> 01:08:19,010 DAVID MALAN: It's a really good question. 1403 01:08:19,010 --> 01:08:19,700 And quite fair. 1404 01:08:19,700 --> 01:08:21,920 We do move quite quickly, admittedly. 1405 01:08:21,920 --> 01:08:26,720 So indeed, recall from week 0 the fire hose metaphor 1406 01:08:26,720 --> 01:08:29,270 that I borrowed from MIT's water fountain. 1407 01:08:29,270 --> 01:08:31,069 Indeed, that's very much the case. 1408 01:08:31,069 --> 01:08:33,569 There's a lot of new syntax, a lot of new ideas all at once. 1409 01:08:33,569 --> 01:08:37,310 But when it comes to individual problems in the problem sets, 1410 01:08:37,310 --> 01:08:39,920 do realize that you should take those step by step. 1411 01:08:39,920 --> 01:08:45,460 And invariably, they tend to work from less complicated to more complicated. 1412 01:08:45,460 --> 01:08:47,960 And throughout each of the lectures and each of the examples 1413 01:08:47,960 --> 01:08:50,390 that we do, either live or via the examples that 1414 01:08:50,390 --> 01:08:53,060 are premade on the course's website for your review, 1415 01:08:53,060 --> 01:08:57,529 there's always little clues or hints or examples that you can then do. 1416 01:08:57,529 --> 01:09:01,910 And certainly, by way of other resources like labs and the like, 1417 01:09:01,910 --> 01:09:04,200 will you see additional building blocks as well. 1418 01:09:04,200 --> 01:09:06,080 So feel free to reach out more individually afterword. 1419 01:09:06,080 --> 01:09:07,997 Happy to point you at some of those resources. 1420 01:09:07,997 --> 01:09:10,910 In fact, most recently, too, will you notice on the course's website 1421 01:09:10,910 --> 01:09:13,399 what we call "shorts," which are shorter videos made 1422 01:09:13,399 --> 01:09:17,120 by another colleague of mine, CS50's own Doug Lloyd, which are literally 1423 01:09:17,120 --> 01:09:19,680 short videos on very specific topics. 1424 01:09:19,680 --> 01:09:22,490 So after today, you'll see short videos by Doug 1425 01:09:22,490 --> 01:09:25,580 with a different perspective on linear search, on binary search, 1426 01:09:25,580 --> 01:09:28,260 and on a number of other algorithms as well. 1427 01:09:28,260 --> 01:09:28,859 Good question. 1428 01:09:28,859 --> 01:09:30,567 Sophia, back to you. 1429 01:09:30,567 --> 01:09:32,609 AUDIENCE: I was wondering, with the return values 1430 01:09:32,609 --> 01:09:37,380 that we have for different error cases, would that be-- 1431 01:09:37,380 --> 01:09:39,630 like, what's an example of what we would use that for? 1432 01:09:39,630 --> 01:09:42,540 Is that for later if there are like several different cases 1433 01:09:42,540 --> 01:09:44,356 and we want to somehow keep track of them? 1434 01:09:44,356 --> 01:09:45,689 DAVID MALAN: Exactly the latter. 1435 01:09:45,689 --> 01:09:47,790 So right now, honestly, it's kind of stupid 1436 01:09:47,790 --> 01:09:52,027 that we're even bothering to spend time returning 0 or returning 1. 1437 01:09:52,027 --> 01:09:54,360 Like, we don't really need to do that, because we're not 1438 01:09:54,360 --> 01:09:55,410 using the information. 1439 01:09:55,410 --> 01:09:58,200 But what we're trying to do is lay the foundation 1440 01:09:58,200 --> 01:09:59,782 for more complicated programs. 1441 01:09:59,782 --> 01:10:01,740 And indeed, this week and next week and beyond, 1442 01:10:01,740 --> 01:10:05,520 as your own programs get a little longer, and as we, the course, 1443 01:10:05,520 --> 01:10:08,520 start providing you with starter code or distribution 1444 01:10:08,520 --> 01:10:12,210 code, that is, lines of code that the staff and I write that you then 1445 01:10:12,210 --> 01:10:15,630 have to build upon, it's going to be a very useful mechanism 1446 01:10:15,630 --> 01:10:19,360 to be able to signal that this went wrong or this other thing went wrong. 1447 01:10:19,360 --> 01:10:22,110 So all we're doing is preparing for that inevitability, 1448 01:10:22,110 --> 01:10:25,870 even if right now it doesn't really seem to be scratching an itch. 1449 01:10:25,870 --> 01:10:26,630 Anthony? 1450 01:10:26,630 --> 01:10:28,672 AUDIENCE: I was just going to ask really quickly, 1451 01:10:28,672 --> 01:10:32,200 obviously in this code we have "Brian" and your name, "David." 1452 01:10:32,200 --> 01:10:33,310 And that's two people. 1453 01:10:33,310 --> 01:10:35,500 So let's say we had 10 or 20 or even 30 people. 1454 01:10:35,500 --> 01:10:37,500 I know it was a question in the chat, but I just 1455 01:10:37,500 --> 01:10:39,460 wanted to clarify for myself, too. 1456 01:10:39,460 --> 01:10:42,000 DAVID MALAN: And the "what if" being what would change? 1457 01:10:42,000 --> 01:10:44,210 Or, what was the end of that question? 1458 01:10:44,210 --> 01:10:44,850 AUDIENCE: Yeah. 1459 01:10:44,850 --> 01:10:46,100 What would change in the code? 1460 01:10:46,100 --> 01:10:48,017 Or what we do exactly to address that problem? 1461 01:10:48,017 --> 01:10:48,850 DAVID MALAN: Ah, OK. 1462 01:10:48,850 --> 01:10:49,450 Good question. 1463 01:10:49,450 --> 01:10:53,305 So if we were to have more names, like a third name or a tenth name or the like, 1464 01:10:53,305 --> 01:10:56,430 the only things that we would have to change in this version of the program 1465 01:10:56,430 --> 01:10:58,947 is first, on line 14, the size of the array. 1466 01:10:58,947 --> 01:11:00,780 So if we're going to have 10 people, we need 1467 01:11:00,780 --> 01:11:03,180 to decide in advance that we're going to have 10 people. 1468 01:11:03,180 --> 01:11:08,048 Better still, I could, for instance, allocate myself a constant up here. 1469 01:11:08,048 --> 01:11:09,840 So let me actually go up here, just like we 1470 01:11:09,840 --> 01:11:17,670 did in a previous class, where we did something like this-- const int NUMBER. 1471 01:11:17,670 --> 01:11:19,380 And I'll just initialize this to 10. 1472 01:11:19,380 --> 01:11:21,450 And recall that const means constant. 1473 01:11:21,450 --> 01:11:23,140 That means this variable can't change. 1474 01:11:23,140 --> 01:11:24,930 Int, of course means it's an integer. 1475 01:11:24,930 --> 01:11:27,450 The fact that I've capitalized it is just a human convention 1476 01:11:27,450 --> 01:11:29,640 to make a little visually clear that this is 1477 01:11:29,640 --> 01:11:32,200 a constant, just so you don't forget. 1478 01:11:32,200 --> 01:11:33,660 But it has no functional role. 1479 01:11:33,660 --> 01:11:36,360 And then this, of course, is just the value to assign to NUMBER. 1480 01:11:36,360 --> 01:11:40,187 Then I could go down here on line 16 and plug in that variable 1481 01:11:40,187 --> 01:11:42,270 so that I don't have to hardcode what people would 1482 01:11:42,270 --> 01:11:44,730 call a "magic number," which is just a number that 1483 01:11:44,730 --> 01:11:46,410 appears seemingly out of nowhere. 1484 01:11:46,410 --> 01:11:50,520 Now I've put all of my special numbers at the top of my file, 1485 01:11:50,520 --> 01:11:54,190 or toward the top of my file, and now I'm using this variable here. 1486 01:11:54,190 --> 01:11:57,390 And then what I could do-- and I alluded to this only verbally before-- 1487 01:11:57,390 --> 01:12:00,120 I could absolutely start hardcoding in, for instance, 1488 01:12:00,120 --> 01:12:03,420 Montague's name and number, and Rithvik's and Benedict's, and Cody's 1489 01:12:03,420 --> 01:12:04,110 and others. 1490 01:12:04,110 --> 01:12:07,048 But honestly, this seems kind of stupid if you're just hardcoding 1491 01:12:07,048 --> 01:12:08,340 all of these names and numbers. 1492 01:12:08,340 --> 01:12:10,423 And in a few weeks, we'll see how you can actually 1493 01:12:10,423 --> 01:12:13,200 store all of the same information in like a spreadsheet, 1494 01:12:13,200 --> 01:12:14,910 or what's called a CSV file-- 1495 01:12:14,910 --> 01:12:19,560 Comma Separated Values-- or even in a proper database, which the Facebooks 1496 01:12:19,560 --> 01:12:21,065 and Googles of the world would use. 1497 01:12:21,065 --> 01:12:23,190 But what I could do for now is something like this. 1498 01:12:23,190 --> 01:12:28,940 For int i gets 0, i less than the number of people, i++. 1499 01:12:28,940 --> 01:12:30,720 And maybe I could do something like this-- 1500 01:12:30,720 --> 01:12:38,250 people bracket i dot name equals get_string, "What's the name" 1501 01:12:38,250 --> 01:12:39,150 question mark. 1502 01:12:39,150 --> 01:12:43,080 And then here I could do people bracket i dot number equals 1503 01:12:43,080 --> 01:12:46,380 get_string, "What's their number?" 1504 01:12:46,380 --> 01:12:48,220 And I could ask that question, too. 1505 01:12:48,220 --> 01:12:51,320 So now the program's getting to be a little better designed. 1506 01:12:51,320 --> 01:12:53,700 I'm not arbitrarily hardcoding just me and Brian. 1507 01:12:53,700 --> 01:12:54,780 Now it's dynamic. 1508 01:12:54,780 --> 01:12:58,330 And technically, the phone book only supports 10 people at the moment, 1509 01:12:58,330 --> 01:12:59,850 but I could make that dynamic, too. 1510 01:12:59,850 --> 01:13:01,630 I could also call get_int. 1511 01:13:01,630 --> 01:13:04,860 Or, like you did this past week, use a command line argument 1512 01:13:04,860 --> 01:13:07,770 and parameterize the code so that it can actually 1513 01:13:07,770 --> 01:13:09,990 be for 2 people, 10 people-- whatever you want, 1514 01:13:09,990 --> 01:13:15,270 the program can dynamically adapt to it for you. 1515 01:13:15,270 --> 01:13:16,890 Other questions? 1516 01:13:16,890 --> 01:13:22,420 On structs, on types, or the like? 1517 01:13:22,420 --> 01:13:23,170 No? 1518 01:13:23,170 --> 01:13:23,670 All right. 1519 01:13:23,670 --> 01:13:25,110 So how did we get here? 1520 01:13:25,110 --> 01:13:27,810 Recall that we started with this problem of searching, 1521 01:13:27,810 --> 01:13:30,420 whereby we just want to find someone in the doors. 1522 01:13:30,420 --> 01:13:32,625 We just want to find someone in the array. 1523 01:13:32,625 --> 01:13:34,500 We've sort of escalated things pretty quickly 1524 01:13:34,500 --> 01:13:36,720 to finding not just numbers or names but now 1525 01:13:36,720 --> 01:13:40,380 names with numbers in the form of these data structures. 1526 01:13:40,380 --> 01:13:43,770 But to do this efficiently really requires a smarter algorithm 1527 01:13:43,770 --> 01:13:44,850 like binary search. 1528 01:13:44,850 --> 01:13:50,100 Up until now, we've only used in C code linear search, even though, recall, 1529 01:13:50,100 --> 01:13:54,240 that we did have at our disposal the pseudocode for binary search. 1530 01:13:54,240 --> 01:13:58,020 But with binary search, we're going to need the data to be sorted. 1531 01:13:58,020 --> 01:14:01,080 And so if you want to get the speed benefits of searching 1532 01:14:01,080 --> 01:14:04,290 more quickly by having sorted numbers, somehow someone 1533 01:14:04,290 --> 01:14:06,120 is going to have to do that for us. 1534 01:14:06,120 --> 01:14:09,748 Joe, for instance, sorted behind the curtain all of these numbers for us. 1535 01:14:09,748 --> 01:14:11,790 But what algorithm did he use is going to open up 1536 01:14:11,790 --> 01:14:15,005 a whole can of worms as to how we can sort numbers efficiently. 1537 01:14:15,005 --> 01:14:17,130 And indeed, if you're the Googles and the Facebooks 1538 01:14:17,130 --> 01:14:19,240 and the Instagrams of the world, with millions, 1539 01:14:19,240 --> 01:14:22,740 billions of pieces of data and users, you surely 1540 01:14:22,740 --> 01:14:24,850 want to keep that data sorted, presumably, 1541 01:14:24,850 --> 01:14:26,940 so that you can use algorithms like binary search 1542 01:14:26,940 --> 01:14:30,520 to find information quickly when you're searching for friends or for content. 1543 01:14:30,520 --> 01:14:32,730 But let's go ahead and here take a five-minute break. 1544 01:14:32,730 --> 01:14:35,160 And when we come back, we'll consider a few algorithms 1545 01:14:35,160 --> 01:14:39,600 for sorting that's going to enable us to do everything we've just now discussed. 1546 01:14:39,600 --> 01:14:40,910 See you in five. 1547 01:14:40,910 --> 01:14:41,790 All right. 1548 01:14:41,790 --> 01:14:43,050 We are back. 1549 01:14:43,050 --> 01:14:47,760 So to recap, we have a couple different algorithms for searching, linear search 1550 01:14:47,760 --> 01:14:48,870 and binary search. 1551 01:14:48,870 --> 01:14:52,560 Binary search is clearly the winner from all measures we've seen thus far. 1552 01:14:52,560 --> 01:14:56,910 The catch is that the data needs to be sorted in advanced in order 1553 01:14:56,910 --> 01:14:58,107 to apply that algorithm. 1554 01:14:58,107 --> 01:14:59,940 So let's just give ourselves a working model 1555 01:14:59,940 --> 01:15:01,350 for what it means to sort something. 1556 01:15:01,350 --> 01:15:04,560 Well, as always, if you think of this as just another problem to be solved, 1557 01:15:04,560 --> 01:15:07,970 it's got input and output, and the goal is to take that input 1558 01:15:07,970 --> 01:15:08,970 and produce that output. 1559 01:15:08,970 --> 01:15:09,990 Well, what's the input? 1560 01:15:09,990 --> 01:15:12,360 It's going to be a whole bunch of unsorted values. 1561 01:15:12,360 --> 01:15:14,462 And the goal, of course, is to get sorted values. 1562 01:15:14,462 --> 01:15:16,420 So the interesting part of the process is going 1563 01:15:16,420 --> 01:15:18,270 to be whatever there is in the middle. 1564 01:15:18,270 --> 01:15:22,710 But just to be even more concrete, if we think now in terms of this unsorted 1565 01:15:22,710 --> 01:15:25,650 input as being an array of input-- because after all, 1566 01:15:25,650 --> 01:15:28,470 that's perhaps the most useful mechanism we've seen thus far, 1567 01:15:28,470 --> 01:15:32,880 to pass around a bunch of values at once using just one variable name-- 1568 01:15:32,880 --> 01:15:37,710 we might have an array like this, 6 3 8 5 2 7 4 1, which seems to be, indeed, 1569 01:15:37,710 --> 01:15:40,050 randomly ordered, that is unsorted. 1570 01:15:40,050 --> 01:15:43,594 And we want to turn that into an equivalent array that's just 1 2 1571 01:15:43,594 --> 01:15:45,120 3 4 5 6 7 8. 1572 01:15:45,120 --> 01:15:47,490 So eight numbers this time instead of seven. 1573 01:15:47,490 --> 01:15:51,190 But the goal this time is not to search them, per se, but to sort them. 1574 01:15:51,190 --> 01:15:53,730 But before I get ahead of myself, could someone 1575 01:15:53,730 --> 01:15:56,640 push back on this whole intellectual exercise 1576 01:15:56,640 --> 01:15:59,400 we're about to do with sorting in the first place? 1577 01:15:59,400 --> 01:16:02,280 Could someone make an argument as to why we might not 1578 01:16:02,280 --> 01:16:06,990 want to bother using a sorted array, why we might not 1579 01:16:06,990 --> 01:16:12,000 want to bother sorting the elements, and heck, let's just use linear search 1580 01:16:12,000 --> 01:16:13,570 to find some element-- 1581 01:16:13,570 --> 01:16:17,460 whether it's a number behind a door, a name in an array. 1582 01:16:17,460 --> 01:16:23,850 Like, when might we want to just use linear search and not bother sorting? 1583 01:16:23,850 --> 01:16:25,940 Sophia, what do you think? 1584 01:16:25,940 --> 01:16:28,370 AUDIENCE: We could encounter errors in sorting, 1585 01:16:28,370 --> 01:16:33,010 and that might cause errors, like, unpredictability in terms of, 1586 01:16:33,010 --> 01:16:34,620 like, if we can find something. 1587 01:16:34,620 --> 01:16:37,143 Versus linear search, we know we can find it. 1588 01:16:37,143 --> 01:16:38,310 DAVID MALAN: OK, quite fair. 1589 01:16:38,310 --> 01:16:41,940 I will concede that implementing binary search, not in pseudocode, which we've 1590 01:16:41,940 --> 01:16:44,850 already done, but in code is actually more difficult, 1591 01:16:44,850 --> 01:16:47,040 because you have to deal with rounding, especially 1592 01:16:47,040 --> 01:16:49,790 if you've got a weird number of doors, like an odd number of doors 1593 01:16:49,790 --> 01:16:52,410 versus an even number of doors or an array of those lengths. 1594 01:16:52,410 --> 01:16:54,618 Honestly, you've got to deal with these corner cases, 1595 01:16:54,618 --> 01:16:57,180 like rounding down or rounding up, because anything time 1596 01:16:57,180 --> 01:16:59,857 you divide something by 2, you might get a fractional value 1597 01:16:59,857 --> 01:17:01,190 or you might get a whole number. 1598 01:17:01,190 --> 01:17:02,350 So we've got to make some decisions. 1599 01:17:02,350 --> 01:17:03,630 So it's totally solvable. 1600 01:17:03,630 --> 01:17:06,870 And humans for decades have been writing code that implements binary search. 1601 01:17:06,870 --> 01:17:08,160 It's totally possible. 1602 01:17:08,160 --> 01:17:09,450 There's libraries you can use. 1603 01:17:09,450 --> 01:17:12,492 But it's definitely more challenging, and you open yourselves up to risk. 1604 01:17:12,492 --> 01:17:14,385 But let me stipulate that that's OK. 1605 01:17:14,385 --> 01:17:17,763 I am good enough at this point in my progression where I'm pretty 1606 01:17:17,763 --> 01:17:19,180 sure I can implement it correctly. 1607 01:17:19,180 --> 01:17:21,480 So correctness is not my concern. 1608 01:17:21,480 --> 01:17:26,880 What else might demotivate me from sorting an array of elements? 1609 01:17:26,880 --> 01:17:30,420 And what might motivate me to, ah, just use linear search. 1610 01:17:30,420 --> 01:17:33,150 It's so simple. 1611 01:17:33,150 --> 01:17:34,930 Can anyone propose why? 1612 01:17:34,930 --> 01:17:36,480 Olivia, what do you think? 1613 01:17:36,480 --> 01:17:39,420 AUDIENCE: If the name of the game is efficiency, 1614 01:17:39,420 --> 01:17:42,180 and you have a small enough data set, then 1615 01:17:42,180 --> 01:17:46,040 you might as well just search it versus sort 1616 01:17:46,040 --> 01:17:47,920 it, which would be an extra expense. 1617 01:17:47,920 --> 01:17:49,420 DAVID MALAN: Yeah, really well said. 1618 01:17:49,420 --> 01:17:51,450 If you've got a relatively small data set, 1619 01:17:51,450 --> 01:17:55,050 and your computer operates at a billion operations per second, 1620 01:17:55,050 --> 01:17:59,130 for instance, my God, who cares if your code sucks and it's a little bit slow? 1621 01:17:59,130 --> 01:18:00,580 Just do it the inefficient way. 1622 01:18:00,580 --> 01:18:01,080 Why? 1623 01:18:01,080 --> 01:18:04,205 Because it's going to take you maybe a few minutes to implement the simpler 1624 01:18:04,205 --> 01:18:07,200 algorithm like linear search, even though it's going to take longer 1625 01:18:07,200 --> 01:18:10,350 to run, whereas it might take you tens of minutes, maybe an hour 1626 01:18:10,350 --> 01:18:12,810 or so, to not only write but debug something 1627 01:18:12,810 --> 01:18:16,200 like a fancier algorithm, like binary search, at which point 1628 01:18:16,200 --> 01:18:19,710 you might have spent more time writing the code, the faster code, than you 1629 01:18:19,710 --> 01:18:22,342 would have just running the slower code. 1630 01:18:22,342 --> 01:18:23,800 And I can speak to this personally. 1631 01:18:23,800 --> 01:18:26,008 Back in grad school, some of the research I was doing 1632 01:18:26,008 --> 01:18:28,170 involved analysis of very large data sets. 1633 01:18:28,170 --> 01:18:30,960 And I had to write code in order to analyze this data. 1634 01:18:30,960 --> 01:18:33,330 And I could have spent hours, days, even, 1635 01:18:33,330 --> 01:18:37,320 writing the best designed algorithm I could to analyze 1636 01:18:37,320 --> 01:18:39,480 the data as efficiently as possible. 1637 01:18:39,480 --> 01:18:42,240 Or, frankly, I could write the crappy version of the code, 1638 01:18:42,240 --> 01:18:44,550 go to sleep for eight hours, and my code will just 1639 01:18:44,550 --> 01:18:46,770 produce the output I want by morning. 1640 01:18:46,770 --> 01:18:49,513 And that is a very real-world, reasonable trade-off to make. 1641 01:18:49,513 --> 01:18:51,930 And indeed, this is going to be thematic in the weeks that 1642 01:18:51,930 --> 01:18:54,615 proceed in the course, where there's going to be this trade-off. 1643 01:18:54,615 --> 01:18:56,490 And quite often, the trade-off is going to be 1644 01:18:56,490 --> 01:19:01,110 time, or complexity, or the amount of space or memory that you're using. 1645 01:19:01,110 --> 01:19:04,740 And part of the art of being a good computer scientist, 1646 01:19:04,740 --> 01:19:08,140 and in turn programmer, is trying to decide where the line is. 1647 01:19:08,140 --> 01:19:11,430 Do you exert more effort upfront to make a better, faster, more efficient 1648 01:19:11,430 --> 01:19:13,800 algorithm, or do you maybe cut some corners 1649 01:19:13,800 --> 01:19:17,580 there so that you can focus your most precious resource, human time, 1650 01:19:17,580 --> 01:19:20,430 on other, more fundamentally challenging problems? 1651 01:19:20,430 --> 01:19:22,620 So we for the course's problem sets and labs 1652 01:19:22,620 --> 01:19:24,880 will always prescribe what's most important. 1653 01:19:24,880 --> 01:19:27,113 But in a few weeks' time, with one of our problem 1654 01:19:27,113 --> 01:19:29,280 sets will you implement your very own spell checker. 1655 01:19:29,280 --> 01:19:30,988 And among the goals of that spell checker 1656 01:19:30,988 --> 01:19:33,690 are going to be to minimize the amount of time 1657 01:19:33,690 --> 01:19:37,980 your code is taking to run, and also to minimize the amount of space or memory 1658 01:19:37,980 --> 01:19:41,070 that your program is taking while running. 1659 01:19:41,070 --> 01:19:44,010 And so we'll begin to appreciate those trade-offs ever more. 1660 01:19:44,010 --> 01:19:47,370 But indeed, it's the case-- and I really like Olivia's formulation of it-- 1661 01:19:47,370 --> 01:19:50,340 if your data set is pretty small, it's probably not worth 1662 01:19:50,340 --> 01:19:54,330 writing the fastest, best designed algorithm as possible. 1663 01:19:54,330 --> 01:19:56,400 Just write it the simple way, the correct way, 1664 01:19:56,400 --> 01:19:58,440 and get the answer quickly, and move on. 1665 01:19:58,440 --> 01:20:01,530 But that's not going to be the case for a lot of problems, dare I say, 1666 01:20:01,530 --> 01:20:03,420 most problems in life. 1667 01:20:03,420 --> 01:20:06,150 If you're building Facebook or Instagram or Whatsapp, 1668 01:20:06,150 --> 01:20:10,980 or any of today's most popular services that are getting thousands, millions 1669 01:20:10,980 --> 01:20:13,500 of new pieces of data at a time, you can't just 1670 01:20:13,500 --> 01:20:17,310 linearly search all of your friends or connections on LinkedIn efficiently. 1671 01:20:17,310 --> 01:20:20,430 You can't just linearly search the billions of web pages 1672 01:20:20,430 --> 01:20:23,820 that Google and Microsoft index in their search engines. 1673 01:20:23,820 --> 01:20:25,450 You've got to be smarter about it. 1674 01:20:25,450 --> 01:20:28,200 And undoubtedly, the more successful your programs are 1675 01:20:28,200 --> 01:20:31,350 and your code are, or websites, your apps, whatever the case may be, 1676 01:20:31,350 --> 01:20:33,810 the more important design does come into play. 1677 01:20:33,810 --> 01:20:38,850 So indeed, let's stipulate now that the goal is not to search these doors once; 1678 01:20:38,850 --> 01:20:41,040 the goal is not to search these light bulbs once; 1679 01:20:41,040 --> 01:20:44,430 the goal is not to search the phone book once, but rather again 1680 01:20:44,430 --> 01:20:45,970 and again and again. 1681 01:20:45,970 --> 01:20:48,600 And if that's going to be the case, then we probably 1682 01:20:48,600 --> 01:20:52,890 should spend a little more time and a little more complexity upfront 1683 01:20:52,890 --> 01:20:56,250 getting our code, not only right but also efficient, 1684 01:20:56,250 --> 01:20:59,670 so that we can benefit from that efficiency again and again 1685 01:20:59,670 --> 01:21:01,480 and again, over time. 1686 01:21:01,480 --> 01:21:04,050 So how might we go about sorting some numbers. 1687 01:21:04,050 --> 01:21:06,390 So in fact, let me see, to do this, if we can maybe 1688 01:21:06,390 --> 01:21:10,470 get a hand from Brian in back. 1689 01:21:10,470 --> 01:21:12,180 Brian, do you mind helping with sorting? 1690 01:21:12,180 --> 01:21:13,210 BRIAN: Yeah, absolutely. 1691 01:21:13,210 --> 01:21:16,950 So I've got eight numbers here right now that all seem to be in unsorted order. 1692 01:21:16,950 --> 01:21:17,700 DAVID MALAN: Yeah. 1693 01:21:17,700 --> 01:21:21,400 And Brian, could you go ahead, and could you sort these eight numbers for us? 1694 01:21:21,400 --> 01:21:22,900 BRIAN: Yeah, I'll put them in order. 1695 01:21:22,900 --> 01:21:27,640 So we'll take these and-- 1696 01:21:27,640 --> 01:21:35,450 um-- and all right. 1697 01:21:35,450 --> 01:21:37,033 I think these are now in sorted order. 1698 01:21:37,033 --> 01:21:38,117 DAVID MALAN: Yeah, indeed. 1699 01:21:38,117 --> 01:21:38,630 I agree. 1700 01:21:38,630 --> 01:21:42,420 And now let's take some critique from the audience, some observations. 1701 01:21:42,420 --> 01:21:49,140 Would someone mind explaining how Brian just sorted those eight numbers? 1702 01:21:49,140 --> 01:21:54,650 What did Brian just do, step by step, in order to get to that end result? 1703 01:21:54,650 --> 01:21:57,780 The input was unsorted, the output now is sorted. 1704 01:21:57,780 --> 01:21:58,550 So what did he do? 1705 01:21:58,550 --> 01:22:01,190 Peter, what did you see happen? 1706 01:22:01,190 --> 01:22:03,200 AUDIENCE: He went through them step by step. 1707 01:22:03,200 --> 01:22:07,200 And if they were in increasing order, he flipped them, 1708 01:22:07,200 --> 01:22:10,730 and kept doing it until they were all in the correct [INAUDIBLE].. 1709 01:22:10,730 --> 01:22:11,480 DAVID MALAN: Yeah. 1710 01:22:11,480 --> 01:22:13,820 He kept step by step kind of looking for small values 1711 01:22:13,820 --> 01:22:16,130 and moving them to the left, and looking for big values 1712 01:22:16,130 --> 01:22:18,380 and moving them to the right, so effectively selecting 1713 01:22:18,380 --> 01:22:21,330 numbers one at a time and putting it into its right place. 1714 01:22:21,330 --> 01:22:24,163 So let's see this is, maybe in more slow motion, if you will, Brian. 1715 01:22:24,163 --> 01:22:25,913 And if you could be a little more pedantic 1716 01:22:25,913 --> 01:22:27,900 and explain exactly what you're doing. 1717 01:22:27,900 --> 01:22:32,390 I see you've already reset the numbers to their original, unsorted order. 1718 01:22:32,390 --> 01:22:35,240 Why don't we go ahead and start a little more methodically? 1719 01:22:35,240 --> 01:22:37,910 And could you go ahead and for us, more slowly this time, 1720 01:22:37,910 --> 01:22:40,100 select the smallest value. 1721 01:22:40,100 --> 01:22:41,860 Because I do think, per Peter, it's going 1722 01:22:41,860 --> 01:22:44,100 to need to end up at the far left. 1723 01:22:44,100 --> 01:22:44,850 BRIAN: Yeah, sure. 1724 01:22:44,850 --> 01:22:47,370 So I'm looking at the numbers, and the 1 is the smallest. 1725 01:22:47,370 --> 01:22:49,340 So I now have the smallest value. 1726 01:22:49,340 --> 01:22:49,580 DAVID MALAN: All right. 1727 01:22:49,580 --> 01:22:50,872 So you did that really quickly. 1728 01:22:50,872 --> 01:22:52,880 But I feel like you took the liberty of being 1729 01:22:52,880 --> 01:22:56,120 a human who can kind of have this bird's eye view of everything all at once. 1730 01:22:56,120 --> 01:22:58,162 But be a little more computer-like, if you could. 1731 01:22:58,162 --> 01:23:00,650 And if these eight numbers are technically in an array, 1732 01:23:00,650 --> 01:23:02,690 kind of like my seven doors out here, such 1733 01:23:02,690 --> 01:23:06,710 that you can only look at one number at a time, can you be even more methodical 1734 01:23:06,710 --> 01:23:10,010 and deliberate this time in telling us how you found the smallest 1735 01:23:10,010 --> 01:23:11,760 number to put into place? 1736 01:23:11,760 --> 01:23:12,260 BRIAN: Sure. 1737 01:23:12,260 --> 01:23:15,110 I guess, since the computer can only look at one number at a time, 1738 01:23:15,110 --> 01:23:17,220 I would start at the left side of this array 1739 01:23:17,220 --> 01:23:20,490 and work my way through the right, looking at each number one at a time. 1740 01:23:20,490 --> 01:23:22,910 So I might start with the 6 and say, OK, this right now 1741 01:23:22,910 --> 01:23:25,180 is the smallest number I've looked at so far. 1742 01:23:25,180 --> 01:23:28,430 But then I look at the next number, and it's a 3, and that's smaller than a 6. 1743 01:23:28,430 --> 01:23:31,700 So now the 3, that's the smallest number I found so far. 1744 01:23:31,700 --> 01:23:33,440 So I'll remember that and keep looking. 1745 01:23:33,440 --> 01:23:36,107 The 8 is bigger than the 3, so I don't need to worry about that. 1746 01:23:36,107 --> 01:23:37,520 The 5 is bigger than the 3. 1747 01:23:37,520 --> 01:23:39,680 The 2 is smaller than the 3, so that now is 1748 01:23:39,680 --> 01:23:41,905 the smallest number I've found so far. 1749 01:23:41,905 --> 01:23:42,780 But I'm not done yet. 1750 01:23:42,780 --> 01:23:43,655 So I'll keep looking. 1751 01:23:43,655 --> 01:23:46,490 The 7 is bigger than the 2, the 4 is bigger than the 2. 1752 01:23:46,490 --> 01:23:47,990 But the 1 is smaller than the 2. 1753 01:23:47,990 --> 01:23:50,750 So now I've made my way all the way to the end of the array. 1754 01:23:50,750 --> 01:23:53,023 And 1, I can say, is the smallest number that I found. 1755 01:23:53,023 --> 01:23:53,690 DAVID MALAN: OK. 1756 01:23:53,690 --> 01:23:56,232 So what I'm hearing is you're doing all of these comparisons, 1757 01:23:56,232 --> 01:23:58,867 also similar to what Peter implied, and you keep checking, 1758 01:23:58,867 --> 01:24:00,950 is this smaller, is this smaller, is this smaller, 1759 01:24:00,950 --> 01:24:04,100 and you're keeping track of the currently smallest number you've seen? 1760 01:24:04,100 --> 01:24:05,180 BRIAN: Yeah, that sounds about right. 1761 01:24:05,180 --> 01:24:05,420 DAVID MALAN: All right. 1762 01:24:05,420 --> 01:24:06,260 So you found it. 1763 01:24:06,260 --> 01:24:07,930 And I think it belongs at the beginning. 1764 01:24:07,930 --> 01:24:09,760 So how do we put this into place now? 1765 01:24:09,760 --> 01:24:11,250 BRIAN: Yeah, so I want to put it at the beginning. 1766 01:24:11,250 --> 01:24:12,583 There's not really space for it. 1767 01:24:12,583 --> 01:24:15,383 So I could make space for it, just by shifting these numbers over. 1768 01:24:15,383 --> 01:24:16,050 DAVID MALAN: OK. 1769 01:24:16,050 --> 01:24:16,500 Wait, wait. 1770 01:24:16,500 --> 01:24:19,375 But I feel like you're just-- now you're doubling the amount of work. 1771 01:24:19,375 --> 01:24:20,708 I feel like-- don't do all that. 1772 01:24:20,708 --> 01:24:23,167 That feels like you're going to do more steps than we need. 1773 01:24:23,167 --> 01:24:24,530 What else could we do here? 1774 01:24:24,530 --> 01:24:25,030 BRIAN: OK. 1775 01:24:25,030 --> 01:24:27,343 So the other option is, it needs to go in this spot, 1776 01:24:27,343 --> 01:24:28,760 like this first spot in the array. 1777 01:24:28,760 --> 01:24:30,267 So I could just put it there. 1778 01:24:30,267 --> 01:24:33,350 But if I do that, I'm going to have to take the 6 which is there right now 1779 01:24:33,350 --> 01:24:34,790 and pull the 6 out. 1780 01:24:34,790 --> 01:24:35,060 DAVID MALAN: All right, but I think that's-- 1781 01:24:35,060 --> 01:24:37,170 BRIAN: So the 1 is in the right place, but the 6 isn't. 1782 01:24:37,170 --> 01:24:37,670 DAVID MALAN: Yeah, I agree. 1783 01:24:37,670 --> 01:24:38,878 But I think that's OK, right? 1784 01:24:38,878 --> 01:24:42,660 Because these numbers started randomly, and so the 6 is in the wrong place 1785 01:24:42,660 --> 01:24:43,160 anyway. 1786 01:24:43,160 --> 01:24:46,310 I don't think we're making the problem any worse by just moving it elsewhere. 1787 01:24:46,310 --> 01:24:49,280 And indeed, it's a lot faster, I would think, to just swap two numbers, 1788 01:24:49,280 --> 01:24:51,530 move one to the other and vice versa, then 1789 01:24:51,530 --> 01:24:54,020 shift all of those numbers in between. 1790 01:24:54,020 --> 01:24:54,520 BRIAN: Yeah. 1791 01:24:54,520 --> 01:24:56,728 So I took the 1 out of the position at the very end 1792 01:24:56,728 --> 01:24:58,770 of the array, all the way on the right-hand side. 1793 01:24:58,770 --> 01:25:01,320 So I guess I could take the 6 and just put it there, 1794 01:25:01,320 --> 01:25:04,100 because that's where there's an open space to put the number. 1795 01:25:04,100 --> 01:25:04,370 DAVID MALAN: Yeah. 1796 01:25:04,370 --> 01:25:07,110 And it's not exactly in the right space, but again, it's no worse off. 1797 01:25:07,110 --> 01:25:07,610 So I like that. 1798 01:25:07,610 --> 01:25:07,820 All right. 1799 01:25:07,820 --> 01:25:10,487 But now, the fact that the 1 is in the right place-- and indeed, 1800 01:25:10,487 --> 01:25:12,380 you've illuminated it to indicate as much-- 1801 01:25:12,380 --> 01:25:14,630 I feel like we can pretty much ignore the 1 henceforth 1802 01:25:14,630 --> 01:25:16,680 and now just select the next smallest element. 1803 01:25:16,680 --> 01:25:18,210 So can you walk us through that? 1804 01:25:18,210 --> 01:25:19,790 BRIAN: Yeah, so I guess I'd repeat the same process. 1805 01:25:19,790 --> 01:25:20,870 I'd start with the 3. 1806 01:25:20,870 --> 01:25:22,928 That's the smallest number I've found so far. 1807 01:25:22,928 --> 01:25:23,720 And I keep looking. 1808 01:25:23,720 --> 01:25:26,750 The 8 is bigger than the 3, the 5 is bigger than the 3. 1809 01:25:26,750 --> 01:25:27,990 The 2 is smaller than the 3. 1810 01:25:27,990 --> 01:25:29,060 So I'll remember that 2. 1811 01:25:29,060 --> 01:25:31,023 That's the smallest thing I've seen so far. 1812 01:25:31,023 --> 01:25:34,190 And then I just need to check to see if there's anything smaller than the 2. 1813 01:25:34,190 --> 01:25:36,558 And I look at the 7, the 4, and the 6. 1814 01:25:36,558 --> 01:25:38,100 None of those are smaller than the 2. 1815 01:25:38,100 --> 01:25:41,623 So the 2, I can say is the next smallest number for the array. 1816 01:25:41,623 --> 01:25:42,290 DAVID MALAN: OK. 1817 01:25:42,290 --> 01:25:43,808 And where would you put that then? 1818 01:25:43,808 --> 01:25:45,600 BRIAN: That needs to go in the second spot. 1819 01:25:45,600 --> 01:25:47,060 So I need to pull the 3 out. 1820 01:25:47,060 --> 01:25:50,450 And I guess I can take the 3 and just put it into this open spot, where 1821 01:25:50,450 --> 01:25:51,770 there's available space. 1822 01:25:51,770 --> 01:25:52,520 DAVID MALAN: Yeah. 1823 01:25:52,520 --> 01:25:54,650 And I feel like it's starting to become clear 1824 01:25:54,650 --> 01:25:56,870 that we're inside some kind of loop, because you pretty much told 1825 01:25:56,870 --> 01:25:58,670 the same story again but with a different number. 1826 01:25:58,670 --> 01:26:00,710 Do you mind just continuing the algorithm to the end 1827 01:26:00,710 --> 01:26:03,168 and select the next smallest, next smallest, next smallest, 1828 01:26:03,168 --> 01:26:04,001 and get that sorted? 1829 01:26:04,001 --> 01:26:04,501 BRIAN: Sure. 1830 01:26:04,501 --> 01:26:05,225 So we got the 8. 1831 01:26:05,225 --> 01:26:07,768 5 is smaller than that, 3 is smaller than that. 1832 01:26:07,768 --> 01:26:09,560 And then the rest of the number is 7, 4, 6. 1833 01:26:09,560 --> 01:26:10,760 Those are all bigger. 1834 01:26:10,760 --> 01:26:13,850 So the 3, that's going to go into sorted position here. 1835 01:26:13,850 --> 01:26:15,900 And I'll take the 8 and swap it. 1836 01:26:15,900 --> 01:26:17,380 Now I'm going to look at the 5. 1837 01:26:17,380 --> 01:26:18,830 8 and 7 are both bigger. 1838 01:26:18,830 --> 01:26:21,680 The 4 is smaller than the 5, but the 6 is bigger. 1839 01:26:21,680 --> 01:26:24,540 So the 4, that's the smallest number I've seen so far. 1840 01:26:24,540 --> 01:26:28,160 So the 4, that's going to go into place, and I'll swap it with the 5. 1841 01:26:28,160 --> 01:26:29,485 And now I've got the 8. 1842 01:26:29,485 --> 01:26:31,610 The 7 is smaller than the 8, so I'll remember that. 1843 01:26:31,610 --> 01:26:33,040 5 is smaller than that. 1844 01:26:33,040 --> 01:26:34,550 The 6 is bigger. 1845 01:26:34,550 --> 01:26:37,190 So the 5, that's going to be the next number. 1846 01:26:37,190 --> 01:26:39,877 And now I'm left with 7. 1847 01:26:39,877 --> 01:26:41,960 8 is bigger, so 7 is still the smallest I've seen. 1848 01:26:41,960 --> 01:26:45,350 But 6 is smaller, so 6 goes next. 1849 01:26:45,350 --> 01:26:47,570 And now I'm down to the last two. 1850 01:26:47,570 --> 01:26:50,900 And between the last two, the 8 and the 7, the 7 is smaller. 1851 01:26:50,900 --> 01:26:53,450 So the 7 is going to go in this spot. 1852 01:26:53,450 --> 01:26:55,730 And at this point, I've only got one number left. 1853 01:26:55,730 --> 01:26:58,460 So that number must be in sorted position. 1854 01:26:58,460 --> 01:27:01,310 And now I would say that this is a sorted array of numbers. 1855 01:27:01,310 --> 01:27:02,060 DAVID MALAN: Nice. 1856 01:27:02,060 --> 01:27:04,250 So it definitely seems to be correct. 1857 01:27:04,250 --> 01:27:05,600 It felt a little slow. 1858 01:27:05,600 --> 01:27:07,808 But of course, the computer could do this much faster 1859 01:27:07,808 --> 01:27:09,132 than we, using an actual array. 1860 01:27:09,132 --> 01:27:11,090 And if you don't mind my making an observation, 1861 01:27:11,090 --> 01:27:15,870 it looks like if we have eight numbers to begin with, or n more generally, 1862 01:27:15,870 --> 01:27:20,120 it looks like you essentially did n minus 1 comparisons, 1863 01:27:20,120 --> 01:27:23,897 because you kept comparing numbers again-- actually, did n comparisons. 1864 01:27:23,897 --> 01:27:25,730 You looked at the first number, and then you 1865 01:27:25,730 --> 01:27:29,000 compared it again and again and again at all of the other possible values 1866 01:27:29,000 --> 01:27:31,440 in order to find the smallest element. 1867 01:27:31,440 --> 01:27:31,940 BRIAN: Yeah. 1868 01:27:31,940 --> 01:27:35,190 Because for each of the numbers in the array, I had to do a comparison to see, 1869 01:27:35,190 --> 01:27:38,250 is it smaller than the smallest thing that I've seen so far? 1870 01:27:38,250 --> 01:27:40,540 And if it is smaller, than I needed to remember that. 1871 01:27:40,540 --> 01:27:41,290 DAVID MALAN: Yeah. 1872 01:27:41,290 --> 01:27:44,510 So in each pass you considered every number, so a total of n numbers first. 1873 01:27:44,510 --> 01:27:46,760 And so you found the number 1 you put it in its place, 1874 01:27:46,760 --> 01:27:50,300 and that left you to be clear with n minus 1 numbers thereafter. 1875 01:27:50,300 --> 01:27:53,630 And then after that, n minus 2 numbers, n minus 3 numbers, dot, 1876 01:27:53,630 --> 01:27:56,130 dot, dot, all the way down to one final number. 1877 01:27:56,130 --> 01:27:57,440 So I think this is correct. 1878 01:27:57,440 --> 01:28:00,050 And I think that's a pretty deliberate way 1879 01:28:00,050 --> 01:28:03,428 of sorting these elements, a little more deliberately than your first approach, 1880 01:28:03,428 --> 01:28:05,720 Brian, which I might describe as a little more organic. 1881 01:28:05,720 --> 01:28:06,560 You kind of did it like-- 1882 01:28:06,560 --> 01:28:09,352 more like a human, just kind of eyeballing things and moving things 1883 01:28:09,352 --> 01:28:09,930 around. 1884 01:28:09,930 --> 01:28:11,722 But if we were to translate this into code, 1885 01:28:11,722 --> 01:28:13,790 recall that we have to be ever so precise. 1886 01:28:13,790 --> 01:28:18,470 And so let me consider altogether how exactly we might translate what Brian 1887 01:28:18,470 --> 01:28:20,820 did ultimately to, again, pseudocode. 1888 01:28:20,820 --> 01:28:23,280 So what he did is actually an algorithm that has a name. 1889 01:28:23,280 --> 01:28:24,990 It's called selection sort. 1890 01:28:24,990 --> 01:28:25,530 Why? 1891 01:28:25,530 --> 01:28:27,380 Well, it's sorting the elements ultimately. 1892 01:28:27,380 --> 01:28:30,530 And it's doing so by having Brian, or really the computer, 1893 01:28:30,530 --> 01:28:33,830 select the smallest elements again and again and again. 1894 01:28:33,830 --> 01:28:35,660 And once you found each such small element, 1895 01:28:35,660 --> 01:28:37,820 you get the added benefit of just ignoring it. 1896 01:28:37,820 --> 01:28:39,680 Indeed, every time Brian lit up a number, 1897 01:28:39,680 --> 01:28:43,610 he didn't need to keep comparing it, so the amount of work he was doing 1898 01:28:43,610 --> 01:28:47,300 was decreasing each iteration-- n numbers, then n minus 1, 1899 01:28:47,300 --> 01:28:49,950 then n minus 2, n minus 3, and so forth. 1900 01:28:49,950 --> 01:28:53,450 And so we can think about the running time of this algorithm 1901 01:28:53,450 --> 01:28:56,820 as being manifest in its actual pseudocode. 1902 01:28:56,820 --> 01:28:58,483 So how might we define the pseudocode? 1903 01:28:58,483 --> 01:29:00,650 Well, let me propose that we think of it like this-- 1904 01:29:00,650 --> 01:29:03,300 for i from 0 to n minus 1. 1905 01:29:03,300 --> 01:29:05,390 Now, undoubtedly this is probably the most cryptic 1906 01:29:05,390 --> 01:29:07,972 looking line of the three lines of pseudocode on the screen. 1907 01:29:07,972 --> 01:29:09,680 But again, this is the kind of thing that 1908 01:29:09,680 --> 01:29:13,470 should become rote memory over time, or just instincts with code. 1909 01:29:13,470 --> 01:29:15,470 We've seen in C how you can write a for loop. 1910 01:29:15,470 --> 01:29:18,680 For loops typically, by convention, start counting at 0. 1911 01:29:18,680 --> 01:29:22,400 But if you have n elements, you don't want to count up through n. 1912 01:29:22,400 --> 01:29:27,560 You want to count up to n or equivalently up through n minus 1, 1913 01:29:27,560 --> 01:29:29,430 so from 0 to n minus 1. 1914 01:29:29,430 --> 01:29:29,930 All right. 1915 01:29:29,930 --> 01:29:31,580 Now what do I want to do on the next-- 1916 01:29:31,580 --> 01:29:32,900 on the first iteration? 1917 01:29:32,900 --> 01:29:37,580 Find the smallest item between the i-th item and the last item. 1918 01:29:37,580 --> 01:29:40,580 So this is not quite obvious, I think, at first glance. 1919 01:29:40,580 --> 01:29:43,700 But I do think it's a fair characterization of what Brian did. 1920 01:29:43,700 --> 01:29:47,540 Because if i is initialized to 0, that was like Brian pointing 1921 01:29:47,540 --> 01:29:51,830 his left hand at the first number on the very left of the shelf. 1922 01:29:51,830 --> 01:29:56,250 And what he then did was he found the smallest element between the i-th item, 1923 01:29:56,250 --> 01:29:58,655 the first item 0, and the last item. 1924 01:29:58,655 --> 01:30:00,530 So that's kind of a very fancy way of saying, 1925 01:30:00,530 --> 01:30:04,790 Brian, find the smallest elements among all n elements. 1926 01:30:04,790 --> 01:30:09,680 Then what he did was swapped the smallest item with the i-th item. 1927 01:30:09,680 --> 01:30:12,140 So he just did that switcheroo, so as to not have 1928 01:30:12,140 --> 01:30:15,080 to waste time shifting everything over. 1929 01:30:15,080 --> 01:30:17,330 He instead just made room for it by swapping it 1930 01:30:17,330 --> 01:30:20,217 with the value that was in its wrong place. 1931 01:30:20,217 --> 01:30:23,300 But now in the next iteration of this loop, consider how a for loop works. 1932 01:30:23,300 --> 01:30:25,582 You do an i++ implicitly in pseudocode. 1933 01:30:25,582 --> 01:30:26,790 That's what's happening here. 1934 01:30:26,790 --> 01:30:28,370 So now i equals 1. 1935 01:30:28,370 --> 01:30:33,800 Find the smallest item between the i-th item, item 1 0 indexed, 1936 01:30:33,800 --> 01:30:34,880 and the last item. 1937 01:30:34,880 --> 01:30:39,140 So this is a fancy way of saying, Brian, check all of the n elements 1938 01:30:39,140 --> 01:30:42,200 again, except for the first, because now you're 1939 01:30:42,200 --> 01:30:45,680 starting at location 1 instead of location 0. 1940 01:30:45,680 --> 01:30:47,300 And now the algorithm proceeds. 1941 01:30:47,300 --> 01:30:49,490 So you could write this code in different ways 1942 01:30:49,490 --> 01:30:51,410 in English like pseudocode, but this seems 1943 01:30:51,410 --> 01:30:54,650 to be a reasonable formulation of exactly that algorithm. 1944 01:30:54,650 --> 01:30:57,470 But let's see it a little more visually now, 1945 01:30:57,470 --> 01:31:00,808 without all of the switching around of the humans moving around the numbers. 1946 01:31:00,808 --> 01:31:02,600 Let me go ahead and use this visualization. 1947 01:31:02,600 --> 01:31:04,040 And we'll put a link on the course's website 1948 01:31:04,040 --> 01:31:05,707 if you'd like to play with this as well. 1949 01:31:05,707 --> 01:31:10,520 This is just someone's visualization of an array of numbers. 1950 01:31:10,520 --> 01:31:14,540 But this time, rather than represent the numbers as symbols, decimal digits, 1951 01:31:14,540 --> 01:31:17,510 now this person is using vertical bars, like a bar chart. 1952 01:31:17,510 --> 01:31:20,720 And what this means is that a small bar is like a small number, 1953 01:31:20,720 --> 01:31:22,430 and a big bar is a big number. 1954 01:31:22,430 --> 01:31:26,810 So the goal here is to these bars, which equivalently might as well be numbers, 1955 01:31:26,810 --> 01:31:29,777 from short bars over to tall bars, left to right. 1956 01:31:29,777 --> 01:31:30,860 And I'm going to go ahead. 1957 01:31:30,860 --> 01:31:34,100 And along the top of the here, I can choose my sorting algorithm. 1958 01:31:34,100 --> 01:31:36,830 And the one we just described, recall, was selection sort. 1959 01:31:36,830 --> 01:31:39,170 So let me go ahead and do this. 1960 01:31:39,170 --> 01:31:41,900 And notice, it takes a moment, I think, to wrap your mind 1961 01:31:41,900 --> 01:31:43,740 around what's happening here. 1962 01:31:43,740 --> 01:31:48,200 But notice that this pink line is going from left to right, because that's 1963 01:31:48,200 --> 01:31:49,730 essentially what Brian was doing. 1964 01:31:49,730 --> 01:31:52,730 He was walking back and forth, back and forth, back and forth 1965 01:31:52,730 --> 01:31:56,180 through that shelf of numbers, looking for the next smallest number, 1966 01:31:56,180 --> 01:32:00,260 and he kept putting the smallest number over on the left where it belongs. 1967 01:32:00,260 --> 01:32:02,660 And indeed, that's why in this visualization 1968 01:32:02,660 --> 01:32:07,730 you see the small numbers beginning to be put into place on the left 1969 01:32:07,730 --> 01:32:09,500 as we keep swooping through. 1970 01:32:09,500 --> 01:32:13,760 But notice, the colored bar keeps starting later and later, 1971 01:32:13,760 --> 01:32:18,222 more rightward and more rightward, just like Brian was not retracing his steps. 1972 01:32:18,222 --> 01:32:20,430 As soon as he lit up the numbers, he left them alone. 1973 01:32:20,430 --> 01:32:23,257 And voila, all of these numbers are now sorted. 1974 01:32:23,257 --> 01:32:26,090 So that's just a graphical way of thinking about the same algorithm. 1975 01:32:26,090 --> 01:32:28,970 But how efficient or inefficient was that? 1976 01:32:28,970 --> 01:32:31,100 Well, let's see if we can apply some numbers here. 1977 01:32:31,100 --> 01:32:33,517 But there's also ways to do this a little more intuitively 1978 01:32:33,517 --> 01:32:35,160 over time, which we'll do, too. 1979 01:32:35,160 --> 01:32:38,450 So if the first time through the shelf of numbers, he 1980 01:32:38,450 --> 01:32:41,420 had eight numbers at his disposal-- he had to look at all eight numbers 1981 01:32:41,420 --> 01:32:43,560 in order to decide which of these is the smallest. 1982 01:32:43,560 --> 01:32:45,380 So that's n steps initially. 1983 01:32:45,380 --> 01:32:48,090 The next time he did a pass through the shelf, 1984 01:32:48,090 --> 01:32:51,140 he ignored the brightly lit number 1, because it was already 1985 01:32:51,140 --> 01:32:53,660 in place by definition of what he had already done. 1986 01:32:53,660 --> 01:32:56,480 So now he had n minus 1 steps to go. 1987 01:32:56,480 --> 01:33:01,860 Then he did another n minus 2 steps, then n minus 3, n minus 4, n minus 5, 1988 01:33:01,860 --> 01:33:05,270 dot, dot, dot, all the way down to the final step, where he just 1989 01:33:05,270 --> 01:33:08,210 had to find and leave alone the number 8, 1990 01:33:08,210 --> 01:33:11,280 because that was the biggest number, so one single step. 1991 01:33:11,280 --> 01:33:13,988 So this is some kind of series here, mathematically. 1992 01:33:13,988 --> 01:33:17,030 You might recall something like this in, like, the back of your math book 1993 01:33:17,030 --> 01:33:19,697 or in high school, or back of your physics textbook or the like. 1994 01:33:19,697 --> 01:33:23,330 It turns out that this actually sums up to this formula here-- 1995 01:33:23,330 --> 01:33:25,692 n times n plus 1 divided by 2. 1996 01:33:25,692 --> 01:33:28,400 And if that's not familiar, you don't remember that, no big deal. 1997 01:33:28,400 --> 01:33:31,280 Just let me stipulate that the mathematical formula with which we 1998 01:33:31,280 --> 01:33:35,000 began, where we had the series of n, plus n minus 1, plus n minus 2, 1999 01:33:35,000 --> 01:33:38,150 plus n minus 3, dot, dot, dot, simply sums up ultimately 2000 01:33:38,150 --> 01:33:42,650 to the more succinct n times n plus 1 divided by 2. 2001 01:33:42,650 --> 01:33:47,000 This, of course, if we multiply it out, gives us n squared plus n divided by 2. 2002 01:33:47,000 --> 01:33:50,770 And this now, I will propose, gives us just this-- 2003 01:33:50,770 --> 01:33:54,440 n squared divided by 2 plus n/2. 2004 01:33:54,440 --> 01:33:56,900 So if we really wanted to be nit-picky, this 2005 01:33:56,900 --> 01:34:01,400 is the total number of steps, or operations, or seconds, 2006 01:34:01,400 --> 01:34:04,040 however we want to measure Brian's running time. 2007 01:34:04,040 --> 01:34:07,970 This seems to be the precise mathematical formula therefore. 2008 01:34:07,970 --> 01:34:11,700 But at the beginning of this week, we considered again, 2009 01:34:11,700 --> 01:34:13,160 the sort of Big O notation. 2010 01:34:13,160 --> 01:34:16,760 With a wave of the hand, we care more about the order of magnitude 2011 01:34:16,760 --> 01:34:18,080 on which an algorithm operates. 2012 01:34:18,080 --> 01:34:22,850 I really don't care about these divided by 2 and n/2. 2013 01:34:22,850 --> 01:34:26,120 Because which of these factors is going to matter as n gets big? 2014 01:34:26,120 --> 01:34:29,210 The bigger the phone book gets, the more doors we have, 2015 01:34:29,210 --> 01:34:33,370 the more light bulbs we have, the more numbers we have on the shelf, 2016 01:34:33,370 --> 01:34:35,810 n is going to keep getting bigger and bigger and bigger. 2017 01:34:35,810 --> 01:34:38,430 And given that, which is the dominant factor? 2018 01:34:38,430 --> 01:34:42,080 Rongxin, if we could call on someone here, which of these factors, 2019 01:34:42,080 --> 01:34:48,320 n squared divided by 2, or n divided by 2, really matters in the long run 2020 01:34:48,320 --> 01:34:54,090 as our problems get bigger and bigger, as n gets bigger and bigger? 2021 01:34:54,090 --> 01:34:57,630 Which of those factors mathematically dominates? 2022 01:34:57,630 --> 01:34:58,130 Anika? 2023 01:34:58,130 --> 01:35:00,218 AUDIENCE: Oh, it's Anika, but-- 2024 01:35:00,218 --> 01:35:01,010 DAVID MALAN: Anika. 2025 01:35:01,010 --> 01:35:02,635 AUDIENCE: It would be the-- no problem. 2026 01:35:02,635 --> 01:35:04,272 It would be the n squared. 2027 01:35:04,272 --> 01:35:05,480 DAVID MALAN: Yeah, n squared. 2028 01:35:05,480 --> 01:35:05,980 Right. 2029 01:35:05,980 --> 01:35:08,390 If you take any number for n and you square it, 2030 01:35:08,390 --> 01:35:11,150 that's going to be bigger, certainly in the long run, 2031 01:35:11,150 --> 01:35:12,720 than just doing n divided by 2. 2032 01:35:12,720 --> 01:35:15,470 And so with our Big O notation, we could describe the running time 2033 01:35:15,470 --> 01:35:20,930 of Brian's selection sort implementation as, ah, it's on the order of n squared. 2034 01:35:20,930 --> 01:35:23,150 Yes, I'm ignoring some numbers, and yes, if we really 2035 01:35:23,150 --> 01:35:25,760 wanted to be nit-picky and count up every single step 2036 01:35:25,760 --> 01:35:29,450 that Brian took, yes, it's n squared divided by 2 plus n/2. 2037 01:35:29,450 --> 01:35:33,080 But again, if you think about the problem over time and n 2038 01:35:33,080 --> 01:35:36,170 getting really large, sort of Facebook-sized, Twitter-sized, 2039 01:35:36,170 --> 01:35:39,890 Google-sized, what's really going to dominate mathematically 2040 01:35:39,890 --> 01:35:41,690 is this bigger factor here. 2041 01:35:41,690 --> 01:35:44,420 That's what's going to make the total number of steps way 2042 01:35:44,420 --> 01:35:47,370 bigger than just those smaller order terms. 2043 01:35:47,370 --> 01:35:49,250 So in Big O notation, selection sort would 2044 01:35:49,250 --> 01:35:51,540 seem to be on the order of n squared. 2045 01:35:51,540 --> 01:35:54,140 So if we consider our chart from before where 2046 01:35:54,140 --> 01:35:57,920 we had the upper bounds on our searching algorithms, 2047 01:35:57,920 --> 01:36:01,410 both linear and binary, this one, unfortunately, 2048 01:36:01,410 --> 01:36:05,765 is at really the tip top of this particular list of running times. 2049 01:36:05,765 --> 01:36:07,140 And there's infinitely many more. 2050 01:36:07,140 --> 01:36:09,470 These are just a subset of the more common formulas 2051 01:36:09,470 --> 01:36:11,807 that a computer scientist might use and think about. 2052 01:36:11,807 --> 01:36:13,640 Selection sort is kind of a top of the list. 2053 01:36:13,640 --> 01:36:15,800 And being number one on this list is bad. 2054 01:36:15,800 --> 01:36:18,980 n squared is certainly much slower than, say, 2055 01:36:18,980 --> 01:36:22,610 big O of 1, which, of course, was constant time or one step. 2056 01:36:22,610 --> 01:36:24,580 So I wonder if we could be-- 2057 01:36:24,580 --> 01:36:25,880 if we could do a little better. 2058 01:36:25,880 --> 01:36:27,770 I wonder if we could do a little better. 2059 01:36:27,770 --> 01:36:29,970 And Peter actually did say something else earlier, 2060 01:36:29,970 --> 01:36:33,890 which was about like sharing two numbers and fixing problems. 2061 01:36:33,890 --> 01:36:35,720 And if I can kind of run with that, let me 2062 01:36:35,720 --> 01:36:39,620 propose that we, Brian, return to you for a look at an algorithm that 2063 01:36:39,620 --> 01:36:43,390 might be called instead bubble sort, bubble sort 2064 01:36:43,390 --> 01:36:45,740 being a different algorithm, this one that 2065 01:36:45,740 --> 01:36:47,287 tries to fix problems more locally. 2066 01:36:47,287 --> 01:36:49,370 So in fact, Brian, if you look at the numbers that 2067 01:36:49,370 --> 01:36:51,440 are in front of you, which you've kindly reset 2068 01:36:51,440 --> 01:36:55,100 to their original, unsorted location, I feel like this really, 2069 01:36:55,100 --> 01:36:58,578 if we focus on just pairs of numbers, it's just a lot of small numbers. 2070 01:36:58,578 --> 01:37:00,620 Like last time, we tried to solve the big problem 2071 01:37:00,620 --> 01:37:02,030 and sorting the whole thing. 2072 01:37:02,030 --> 01:37:06,050 What if we just look at pairs of numbers that are adjacent to one another? 2073 01:37:06,050 --> 01:37:09,860 Can we maybe make some little tweaks and change our algorithm fundamentally? 2074 01:37:09,860 --> 01:37:14,450 So for instance, Brian, 6 and 3, what observation can you make there for us? 2075 01:37:14,450 --> 01:37:15,200 BRIAN: Yeah, sure. 2076 01:37:15,200 --> 01:37:18,233 So 6 and 3 that's, the first pair of numbers in the array. 2077 01:37:18,233 --> 01:37:20,900 And if I want the array to be sorted, I want the smaller numbers 2078 01:37:20,900 --> 01:37:23,670 to be on the left and the bigger numbers to be on the right. 2079 01:37:23,670 --> 01:37:27,227 So just looking at this pair, I can tell you that the 6 and 3 or out of order. 2080 01:37:27,227 --> 01:37:29,810 The 3 should be on the left, and the 6 should be on the right. 2081 01:37:29,810 --> 01:37:30,140 DAVID MALAN: All right. 2082 01:37:30,140 --> 01:37:31,932 So let's go ahead and do that, and go ahead 2083 01:37:31,932 --> 01:37:34,070 and fix that by swapping those two. 2084 01:37:34,070 --> 01:37:35,632 And just fix a small little problem. 2085 01:37:35,632 --> 01:37:37,340 And now let's repeat this process, right? 2086 01:37:37,340 --> 01:37:39,690 Loops seem to be omnipresent in a lot of our algorithms. 2087 01:37:39,690 --> 01:37:41,213 So 6 and 8 is the next such pair. 2088 01:37:41,213 --> 01:37:43,130 What you want-- what do you think about those? 2089 01:37:43,130 --> 01:37:46,362 BRIAN: That particular pair seems OK, because the 6 is smaller and already 2090 01:37:46,362 --> 01:37:47,070 on the left side. 2091 01:37:47,070 --> 01:37:48,560 So I think I can leave this pair alone. 2092 01:37:48,560 --> 01:37:49,518 DAVID MALAN: All right. 2093 01:37:49,518 --> 01:37:50,380 How about 8 and 5? 2094 01:37:50,380 --> 01:37:51,920 BRIAN: The 8 is bigger than the 5. 2095 01:37:51,920 --> 01:37:53,300 So I'm going to swap these two. 2096 01:37:53,300 --> 01:37:55,190 The 5 should be on the left of the 8. 2097 01:37:55,190 --> 01:37:56,148 DAVID MALAN: All right. 2098 01:37:56,148 --> 01:37:56,870 And 8 and 2? 2099 01:37:56,870 --> 01:37:58,537 BRIAN: Same thing here, the 8 is bigger. 2100 01:37:58,537 --> 01:38:00,337 So the 8 is going to be swapped with the 2. 2101 01:38:00,337 --> 01:38:01,670 DAVID MALAN: All right, 8 and 7. 2102 01:38:01,670 --> 01:38:06,133 BRIAN: The 8 is bigger than the 7, so the 8 I should switch with the 7. 2103 01:38:06,133 --> 01:38:07,425 DAVID MALAN: All right 8 and 4? 2104 01:38:07,425 --> 01:38:09,968 BRIAN: 8 and 4, same thing, it's bigger than the 4. 2105 01:38:09,968 --> 01:38:11,010 DAVID MALAN: And 8 and 1. 2106 01:38:11,010 --> 01:38:12,385 BRIAN: I can do it one last time. 2107 01:38:12,385 --> 01:38:14,912 The 8 is bigger than the 1, and I think that's all. 2108 01:38:14,912 --> 01:38:16,870 DAVID MALAN: And with a nice dramatic flourish, 2109 01:38:16,870 --> 01:38:18,900 if you step off to the side, voila-- 2110 01:38:18,900 --> 01:38:20,340 not sorted. 2111 01:38:20,340 --> 01:38:23,070 In fact, it doesn't really look all that much better. 2112 01:38:23,070 --> 01:38:26,070 But I do think Brian's done something smart here. 2113 01:38:26,070 --> 01:38:29,550 Brian, can you speak to at least some of the marginal improvements 2114 01:38:29,550 --> 01:38:30,300 that you've made? 2115 01:38:30,300 --> 01:38:30,590 BRIAN: Yeah. 2116 01:38:30,590 --> 01:38:32,298 So there are some improvements, at least. 2117 01:38:32,298 --> 01:38:36,890 The 1 originally was all the way at the very end, and it moved back one spot. 2118 01:38:36,890 --> 01:38:39,390 And the other improvement, I think, is that the 8 originally 2119 01:38:39,390 --> 01:38:42,150 was way over here on the left side of the array somewhere. 2120 01:38:42,150 --> 01:38:44,130 But because the 8 is the biggest number, I 2121 01:38:44,130 --> 01:38:46,380 kept switching it over and over again until it made it 2122 01:38:46,380 --> 01:38:47,640 all the way to the end. 2123 01:38:47,640 --> 01:38:51,270 And so now actually, I think this 8 is in the correct place. 2124 01:38:51,270 --> 01:38:54,660 It's the biggest number, and it ended up moving its way all the way 2125 01:38:54,660 --> 01:38:56,190 to the right side of the array. 2126 01:38:56,190 --> 01:38:56,550 DAVID MALAN: Yeah. 2127 01:38:56,550 --> 01:38:59,130 And this is where this algorithm that we'll see the rest of in just a moment 2128 01:38:59,130 --> 01:39:00,720 gets its name, bubble sort-- 2129 01:39:00,720 --> 01:39:04,470 alludes to the fact that the biggest numbers start bubbling their way up 2130 01:39:04,470 --> 01:39:08,220 to the top of, or the end of, the list, at the right-hand side 2131 01:39:08,220 --> 01:39:09,570 of the shelf as Brian notes. 2132 01:39:09,570 --> 01:39:14,350 But notice, as Brian does, too, the number 1 only moved over one position. 2133 01:39:14,350 --> 01:39:16,080 So there's clearly more work to be done. 2134 01:39:16,080 --> 01:39:18,690 And that's obvious from the other numbers being misordered as well. 2135 01:39:18,690 --> 01:39:19,950 But we have improved things. 2136 01:39:19,950 --> 01:39:23,760 The 8 is in place, and the 1 is closer to being in place. 2137 01:39:23,760 --> 01:39:25,377 So how might we proceed next? 2138 01:39:25,377 --> 01:39:28,210 Well, Brian, let's continue to solve some small bite-sized problems. 2139 01:39:28,210 --> 01:39:29,668 Let's start at the beginning again. 2140 01:39:29,668 --> 01:39:30,400 3 and 6? 2141 01:39:30,400 --> 01:39:30,900 BRIAN: Sure. 2142 01:39:30,900 --> 01:39:33,540 The 3 and the 6, those seem to be in order, so I'll leave those alone. 2143 01:39:33,540 --> 01:39:34,590 DAVID MALAN: 6 and 5. 2144 01:39:34,590 --> 01:39:37,420 BRIAN: 6 and 5 or out of the order, so I'll go ahead and take the 6 2145 01:39:37,420 --> 01:39:38,420 and put it to the right. 2146 01:39:38,420 --> 01:39:39,300 DAVID MALAN: 6 and 2. 2147 01:39:39,300 --> 01:39:42,045 BRIAN: Those are out of order as well, so I'll swap the 2 and the 6. 2148 01:39:42,045 --> 01:39:42,930 DAVID MALAN: 6 and 7. 2149 01:39:42,930 --> 01:39:44,032 BRIAN: 6 and 7 are OK. 2150 01:39:44,032 --> 01:39:44,740 They're in order. 2151 01:39:44,740 --> 01:39:45,690 DAVID MALAN: 7 and 4. 2152 01:39:45,690 --> 01:39:48,535 BRIAN: Those are out of order, so I'll switch the 4 and the 7. 2153 01:39:48,535 --> 01:39:49,410 DAVID MALAN: 7 and 1. 2154 01:39:49,410 --> 01:39:52,300 BRIAN: And those two are out of order as well, so I'll swap those. 2155 01:39:52,300 --> 01:39:55,800 And now I think the 7 has made its way to the sorted position as well. 2156 01:39:55,800 --> 01:39:56,783 DAVID MALAN: Indeed. 2157 01:39:56,783 --> 01:39:58,200 So now we're making some progress. 2158 01:39:58,200 --> 01:40:02,100 7 has bubbled its way up to the top of the list, stopping just before the 8, 2159 01:40:02,100 --> 01:40:05,550 whereas the 1 has continued its advance to its correct location. 2160 01:40:05,550 --> 01:40:08,490 So I bet, Brian, if we keep doing this again and again 2161 01:40:08,490 --> 01:40:11,610 and again, so long as the list remains in part unsorted, 2162 01:40:11,610 --> 01:40:13,530 I think we'll probably get to the finish line. 2163 01:40:13,530 --> 01:40:15,690 Do you want to take it from here and sort the rest? 2164 01:40:15,690 --> 01:40:16,270 BRIAN: Yeah, sure. 2165 01:40:16,270 --> 01:40:17,830 So I just repeat the process again. 2166 01:40:17,830 --> 01:40:19,410 The 3 and the 5 are OK. 2167 01:40:19,410 --> 01:40:22,070 The 2 and the 5 are out of order, so I'll swap them. 2168 01:40:22,070 --> 01:40:24,330 The 5 and the 6, those are fine as a pair. 2169 01:40:24,330 --> 01:40:28,200 The 6 and the 4, out of order relative to each other, so I'll switch those. 2170 01:40:28,200 --> 01:40:31,510 And the 6 and the 1, those are out of order as well, so I'll swap those. 2171 01:40:31,510 --> 01:40:34,680 And now the 6, that I can say is in its correct position. 2172 01:40:34,680 --> 01:40:36,010 And I'll repeat it again. 2173 01:40:36,010 --> 01:40:38,820 The 3 and the 2 are out of order, so those get switched. 2174 01:40:38,820 --> 01:40:40,380 The 3 and the 5 are OK. 2175 01:40:40,380 --> 01:40:43,080 The 5 and the 4 are out of order, so those get switched. 2176 01:40:43,080 --> 01:40:47,520 And then the 5 and the 1 need to be switched as well. 2177 01:40:47,520 --> 01:40:49,325 So there's the 5 in sorted position. 2178 01:40:49,325 --> 01:40:50,700 And now I'm left with these four. 2179 01:40:50,700 --> 01:40:53,070 The 2 and the 3 are OK, the 3 and the 4 OK. 2180 01:40:53,070 --> 01:40:55,110 But the 4 and the 1 are out of order. 2181 01:40:55,110 --> 01:40:58,790 So those get switched, and now the four, that's in its place. 2182 01:40:58,790 --> 01:41:02,720 The 2 and the 3 are OK, but the 3 and the 1 are not, so I'll swap those. 2183 01:41:02,720 --> 01:41:05,700 And now the 3 goes into its sorted place. 2184 01:41:05,700 --> 01:41:08,900 And then finally, the last pair to consider is just the 2 and the 1. 2185 01:41:08,900 --> 01:41:12,690 Those are out of order, so I'll swap those, and now the 2 is in place. 2186 01:41:12,690 --> 01:41:16,280 And 1 is the only remaining number, so I can say that that one's in place, too. 2187 01:41:16,280 --> 01:41:18,440 And now I think we have a sorted array. 2188 01:41:18,440 --> 01:41:19,190 DAVID MALAN: Nice. 2189 01:41:19,190 --> 01:41:21,860 So it felt like this was a fundamentally different approach, 2190 01:41:21,860 --> 01:41:23,762 but we still got to the same end point. 2191 01:41:23,762 --> 01:41:26,720 So that really now invites the question as to whether bubbles or it was 2192 01:41:26,720 --> 01:41:29,370 better or worse or maybe no different. 2193 01:41:29,370 --> 01:41:33,270 But notice, too, that we've solved the same problem fundamentally differently. 2194 01:41:33,270 --> 01:41:36,427 The first time, we took the more human natural intuition of just, 2195 01:41:36,427 --> 01:41:37,510 find the smallest element. 2196 01:41:37,510 --> 01:41:39,480 All right, do it again, do it again, do it again. 2197 01:41:39,480 --> 01:41:42,230 This time, we sort of viewed the problem through a different lens. 2198 01:41:42,230 --> 01:41:44,360 And we thought about, it would seem, what does it 2199 01:41:44,360 --> 01:41:46,130 mean for the list to be unsorted? 2200 01:41:46,130 --> 01:41:48,260 As Peter noted, it's when things are out of order. 2201 01:41:48,260 --> 01:41:51,380 Like that very basic primitive where something is out of order 2202 01:41:51,380 --> 01:41:54,380 suggests an opportunity to solve the problem that way. 2203 01:41:54,380 --> 01:41:57,170 Just fix all of the tiny bite-sized problems. 2204 01:41:57,170 --> 01:42:00,470 And it would seem that using a loop, if we repeat that intuition, 2205 01:42:00,470 --> 01:42:03,506 is going to pay off eventually by fixing, fixing, fixing, 2206 01:42:03,506 --> 01:42:06,620 fixing all of the little problems until the big one itself 2207 01:42:06,620 --> 01:42:08,000 would seem to go away. 2208 01:42:08,000 --> 01:42:12,150 Well, let me return to the visualization from before, re-randomize the bars-- 2209 01:42:12,150 --> 01:42:15,480 short bar is small number, big bar is big number. 2210 01:42:15,480 --> 01:42:17,900 And let me go ahead and run the bubble sort algorithm, 2211 01:42:17,900 --> 01:42:20,390 this time with this visualization. 2212 01:42:20,390 --> 01:42:24,680 And you'll notice now sweeping from left to right are two colored bars that 2213 01:42:24,680 --> 01:42:30,320 represent the comparison of two adjacent numbers again and again and again. 2214 01:42:30,320 --> 01:42:33,770 And you'll see this time that the bars are being a little smart, 2215 01:42:33,770 --> 01:42:36,470 and they're not going all the way to the end every time, 2216 01:42:36,470 --> 01:42:39,440 just like Brian illuminated the numbers and stopped 2217 01:42:39,440 --> 01:42:43,080 looking at the 8 and the 7 and the 6 once they were in place. 2218 01:42:43,080 --> 01:42:46,910 But he and this visualization do indeed keep returning to the beginning, 2219 01:42:46,910 --> 01:42:50,330 doing another pass, another pass, and another pass. 2220 01:42:50,330 --> 01:42:53,270 So if we think ahead to the analysis of this algorithm, 2221 01:42:53,270 --> 01:42:57,560 it sort of invites us to consider, well, how many total comparisons are there 2222 01:42:57,560 --> 01:42:58,610 this time? 2223 01:42:58,610 --> 01:43:01,280 It would seem that the very first time through the bars, 2224 01:43:01,280 --> 01:43:04,250 or equivalently the very first time through the shelf, Brian 2225 01:43:04,250 --> 01:43:07,830 and this visualization did like n minus 1 comparisons. 2226 01:43:07,830 --> 01:43:10,090 So n minus 1 comparisons from left to right, out 2227 01:43:10,090 --> 01:43:13,880 of n elements you can compare n minus 1 adjacencies. 2228 01:43:13,880 --> 01:43:17,780 After that it was n minus 2, n minus 3, n minus 4, 2229 01:43:17,780 --> 01:43:22,050 n minus 5, until just two or one remain, and at that point you're done. 2230 01:43:22,050 --> 01:43:25,070 So even though this algorithm fundamentally took a different approach 2231 01:43:25,070 --> 01:43:29,160 and achieved the same goal, it sorted the elements successfully. 2232 01:43:29,160 --> 01:43:31,400 Let's consider how it was implemented in code 2233 01:43:31,400 --> 01:43:35,270 and whether it's actually a little faster or a little slower. 2234 01:43:35,270 --> 01:43:37,910 And let's set one final bar, in fact, too. 2235 01:43:37,910 --> 01:43:42,218 Earlier, we considered only the upper bound on selection sort, 2236 01:43:42,218 --> 01:43:44,510 just so that we have something to compare this against. 2237 01:43:44,510 --> 01:43:48,590 Let's also consider for a moment what the running time is 2238 01:43:48,590 --> 01:43:52,940 of selection sort in terms of a lower bound-- best case scenario. 2239 01:43:52,940 --> 01:43:56,690 With selection sort, if you have n elements, 2240 01:43:56,690 --> 01:43:59,810 and you keep looking for the next smallest element, again and again 2241 01:43:59,810 --> 01:44:04,390 and again, it turns out that selection sort is not really our friend. 2242 01:44:04,390 --> 01:44:07,790 Here's, for instance, the chart of where we left off in terms of omega notation 2243 01:44:07,790 --> 01:44:08,450 before. 2244 01:44:08,450 --> 01:44:10,850 Linear search and binary search could very well 2245 01:44:10,850 --> 01:44:13,910 get lucky and take just one step if you happen to open a door 2246 01:44:13,910 --> 01:44:17,340 and, voila, the number you're looking for is already there. 2247 01:44:17,340 --> 01:44:20,360 But with selection sort, as we've implemented it, 2248 01:44:20,360 --> 01:44:23,150 both with Brian and with the visualization, 2249 01:44:23,150 --> 01:44:26,990 unfortunately it's none so good with the lower bound. 2250 01:44:26,990 --> 01:44:27,920 Why? 2251 01:44:27,920 --> 01:44:32,720 Well, Brian pretty naively, every time he searched for a number, 2252 01:44:32,720 --> 01:44:37,230 started at the left and went all the way to the right, started at the left, 2253 01:44:37,230 --> 01:44:38,480 went all the way to the right. 2254 01:44:38,480 --> 01:44:41,808 To be fair, he did ignore the numbers that were already in place. 2255 01:44:41,808 --> 01:44:44,600 So he didn't keep looking at the 1, he didn't keep looking at the 2 2256 01:44:44,600 --> 01:44:46,260 once they were in place. 2257 01:44:46,260 --> 01:44:50,390 But he did keep repeating himself again and again, touching 2258 01:44:50,390 --> 01:44:52,400 those numbers multiple times each. 2259 01:44:52,400 --> 01:44:55,400 So again, even though you and I, the humans, could look at those numbers 2260 01:44:55,400 --> 01:44:57,692 and be like, obviously there's the 1, obviously there's 2261 01:44:57,692 --> 01:44:59,540 the 2, the obviously there's the 3, Brian 2262 01:44:59,540 --> 01:45:01,430 had to do it much more methodically. 2263 01:45:01,430 --> 01:45:07,220 And in fact, even if that list of numbers were perfectly sorted, 2264 01:45:07,220 --> 01:45:09,480 he would have wasted just as much time. 2265 01:45:09,480 --> 01:45:11,840 In fact, Brian, if you don't mind, could you quickly 2266 01:45:11,840 --> 01:45:14,510 sort all eight numbers again? 2267 01:45:14,510 --> 01:45:17,210 And Brian, if we start with a sorted list, 2268 01:45:17,210 --> 01:45:21,190 this is kind of a nice perversion to consider, if you will, algorithmically. 2269 01:45:21,190 --> 01:45:22,940 When analyzing an algorithm, sometimes you 2270 01:45:22,940 --> 01:45:25,358 want to consider best cases and worst cases. 2271 01:45:25,358 --> 01:45:28,400 And there would seem to be nothing better than, heck, the list is already 2272 01:45:28,400 --> 01:45:31,530 sorted, you got lucky, there's really no work to be done. 2273 01:45:31,530 --> 01:45:34,040 The worst case is the list is maybe completely backwards, 2274 01:45:34,040 --> 01:45:36,020 and that's a huge amount of work to be done. 2275 01:45:36,020 --> 01:45:40,262 Unfortunately, selection sort doesn't really optimize for that lucky case 2276 01:45:40,262 --> 01:45:41,470 where they're already sorted. 2277 01:45:41,470 --> 01:45:44,780 So Brian, I see you've resorted the numbers for us from left to right. 2278 01:45:44,780 --> 01:45:48,380 If we were to re-execute selection sort as before, 2279 01:45:48,380 --> 01:45:51,005 how would you go about finding the smallest number? 2280 01:45:51,005 --> 01:45:53,630 BRIAN: So we decided earlier that, to find the smallest number, 2281 01:45:53,630 --> 01:45:55,422 I need to look at all the numbers from left 2282 01:45:55,422 --> 01:45:58,580 to right in the array and each time check to see if I found something. 2283 01:45:58,580 --> 01:45:59,390 smaller. 2284 01:45:59,390 --> 01:46:00,590 So I would start with the 1. 2285 01:46:00,590 --> 01:46:02,530 That's the smallest thing I've seen so far. 2286 01:46:02,530 --> 01:46:04,070 But I would have to keep looking, because maybe there's 2287 01:46:04,070 --> 01:46:05,780 a 0 or a negative number later on. 2288 01:46:05,780 --> 01:46:08,020 I need to check to see if there's anything smaller. 2289 01:46:08,020 --> 01:46:11,240 So I would check, the 2 is bigger, the 3, 4, 5, 6, 7, 8. 2290 01:46:11,240 --> 01:46:12,117 They're all bigger. 2291 01:46:12,117 --> 01:46:13,700 So it turns out I was right all along. 2292 01:46:13,700 --> 01:46:16,730 The 1 was the smallest number, and it's already in place. 2293 01:46:16,730 --> 01:46:18,590 So now that number is in place. 2294 01:46:18,590 --> 01:46:20,090 DAVID MALAN: And then to find the next smallest number, 2295 01:46:20,090 --> 01:46:21,260 what would you have done? 2296 01:46:21,260 --> 01:46:22,635 BRIAN: I would do the same thing. 2297 01:46:22,635 --> 01:46:24,320 2 is the smallest number I found so far. 2298 01:46:24,320 --> 01:46:25,580 And then I would look through all the rest 2299 01:46:25,580 --> 01:46:27,497 to see if there's anything smaller than the 2. 2300 01:46:27,497 --> 01:46:30,170 And I would look at 3, 4, 5, 6, 7, 8. 2301 01:46:30,170 --> 01:46:31,880 Nothing's smaller than the 2. 2302 01:46:31,880 --> 01:46:34,910 So I go back to the two and say, OK, that number must now 2303 01:46:34,910 --> 01:46:36,327 be in its sorted position. 2304 01:46:36,327 --> 01:46:37,160 DAVID MALAN: Indeed. 2305 01:46:37,160 --> 01:46:40,035 And that story would be the same for the 3, for the 4, and for the 5. 2306 01:46:40,035 --> 01:46:43,610 Like, nowhere in selection sort pseudocode or actual code 2307 01:46:43,610 --> 01:46:46,850 is there any sort of intelligence of, eh, if the numbers are already sorted, 2308 01:46:46,850 --> 01:46:47,630 quit. 2309 01:46:47,630 --> 01:46:50,540 Like, there was no opportunity to short circuit and abort 2310 01:46:50,540 --> 01:46:51,645 that algorithm earlier. 2311 01:46:51,645 --> 01:46:53,520 Brian would literally be doing the same work, 2312 01:46:53,520 --> 01:46:55,340 whether they're all sorted from the get-go 2313 01:46:55,340 --> 01:46:57,620 or completely unsorted, and even backwards. 2314 01:46:57,620 --> 01:47:00,660 And so selection sort doesn't really perform very highly. 2315 01:47:00,660 --> 01:47:03,048 So now we're hoping bubble sort, indeed, does. 2316 01:47:03,048 --> 01:47:05,590 So toward that end, let's take a look at some proposed pseudo 2317 01:47:05,590 --> 01:47:09,040 code for bubble sort, assuming that the input is anything. 2318 01:47:09,040 --> 01:47:10,790 Whether sorted or unsorted, the pseudocode 2319 01:47:10,790 --> 01:47:13,070 is always going to look like this. 2320 01:47:13,070 --> 01:47:14,510 Repeat until sorted. 2321 01:47:14,510 --> 01:47:17,100 For i from 0 to n minus 2-- 2322 01:47:17,100 --> 01:47:18,190 now, what does this mean? 2323 01:47:18,190 --> 01:47:22,010 0 to n minus 1 goes from the first element to the last. 2324 01:47:22,010 --> 01:47:27,050 So 0 to n minus 2 goes from the first element to the second to last. 2325 01:47:27,050 --> 01:47:27,950 Why am I doing that? 2326 01:47:27,950 --> 01:47:29,210 We'll see in just a moment. 2327 01:47:29,210 --> 01:47:34,010 The condition inside of this loop is, if the i-th and the i plus 1th elements 2328 01:47:34,010 --> 01:47:36,720 are out of order, swap them. 2329 01:47:36,720 --> 01:47:38,590 So this is me being a little clever. 2330 01:47:38,590 --> 01:47:41,090 If you think about all of these numbers as being in an array 2331 01:47:41,090 --> 01:47:46,100 or behind doors, if you iterate from 0 to n minus 2, 2332 01:47:46,100 --> 01:47:49,010 that's like going from the first door to the second to last door. 2333 01:47:49,010 --> 01:47:52,830 But that's good, because my condition is checking door i and i plus 1. 2334 01:47:52,830 --> 01:47:57,530 So if I start at the beginning here, and I only iterate up to this door, 2335 01:47:57,530 --> 01:47:58,520 that's a good thing. 2336 01:47:58,520 --> 01:48:02,070 Because when I compared door i and i plus 1, at the very end 2337 01:48:02,070 --> 01:48:05,100 I'm going to compare door i and i plus 1. 2338 01:48:05,100 --> 01:48:09,020 What I don't want to do is compare this door i against door i 2339 01:48:09,020 --> 01:48:10,763 plus 1, which doesn't even exist. 2340 01:48:10,763 --> 01:48:13,430 And indeed, that's going to be an error that probably all of you 2341 01:48:13,430 --> 01:48:14,510 make at some point-- 2342 01:48:14,510 --> 01:48:19,040 going beyond the boundary of an array, touching memory 2343 01:48:19,040 --> 01:48:22,233 that is going one or more spaces too far in the array, 2344 01:48:22,233 --> 01:48:24,150 even though you didn't allocate memory for it. 2345 01:48:24,150 --> 01:48:26,940 So this hedges against that possibility. 2346 01:48:26,940 --> 01:48:29,630 So this would seem to be a pretty smart algorithm. 2347 01:48:29,630 --> 01:48:35,030 But as written, it's not actually as performant as might be ideal. 2348 01:48:35,030 --> 01:48:39,830 With bubble sort, suppose the list were entirely sorted. 2349 01:48:39,830 --> 01:48:43,910 Brian, not to make you sort and resort numbers too many times. 2350 01:48:43,910 --> 01:48:47,300 Do you mind giving us a sorted list one more time real quick? 2351 01:48:47,300 --> 01:48:50,660 In a moment, I want to see, if we consider that same sorted list as 2352 01:48:50,660 --> 01:48:54,920 before, this time with bubble sort, can we do fundamentally better? 2353 01:48:54,920 --> 01:48:58,160 I have this code saying, repeat until sorted. 2354 01:48:58,160 --> 01:48:59,330 So how might this change? 2355 01:48:59,330 --> 01:49:01,310 So Brian, you've got the sorted numbers again. 2356 01:49:01,310 --> 01:49:02,660 This should be a good case. 2357 01:49:02,660 --> 01:49:05,990 But selection sort did not benefit from this input, 2358 01:49:05,990 --> 01:49:07,700 even though we could have gotten lucky. 2359 01:49:07,700 --> 01:49:10,085 Bubble sort, what would your thought process be here? 2360 01:49:10,085 --> 01:49:11,960 BRIAN: So the thought process for bubble sort 2361 01:49:11,960 --> 01:49:14,030 was to go through each of the pairs one at a time 2362 01:49:14,030 --> 01:49:17,000 and see if I need to make a swap for that particular pair. 2363 01:49:17,000 --> 01:49:18,620 So I'd look at the 1 and the 2. 2364 01:49:18,620 --> 01:49:20,600 Those two are OK, I don't need to swap them. 2365 01:49:20,600 --> 01:49:21,763 The 2 and the 3 are OK. 2366 01:49:21,763 --> 01:49:23,180 I don't need to make a swap there. 2367 01:49:23,180 --> 01:49:24,270 The 3 and the 4 are OK. 2368 01:49:24,270 --> 01:49:25,820 The 4 and the 5 are OK. 2369 01:49:25,820 --> 01:49:29,750 Same with the 5 and the 6, and the 6 and the 7, and the 7 and the 8. 2370 01:49:29,750 --> 01:49:32,480 So I made my way through all the entire array, 2371 01:49:32,480 --> 01:49:36,020 and I never needed to make any swap, because every pair that I looked at, 2372 01:49:36,020 --> 01:49:38,840 they were already in the correct order relative to each other. 2373 01:49:38,840 --> 01:49:39,673 DAVID MALAN: Indeed. 2374 01:49:39,673 --> 01:49:42,440 And so it would be foolish and so obvious this time 2375 01:49:42,440 --> 01:49:45,140 if Brian literally retraced those steps and did it 2376 01:49:45,140 --> 01:49:49,130 again with n minus 1 elements, and then did it again with n minus 2 elements. 2377 01:49:49,130 --> 01:49:52,490 I mean, if he didn't do any work, any swaps the first pass, 2378 01:49:52,490 --> 01:49:54,830 he's literally wasting his own time by even 2379 01:49:54,830 --> 01:49:57,050 doing another pass or another pass. 2380 01:49:57,050 --> 01:50:00,650 And so that's kind of implicit in the pseudocode, this repeat until sorted. 2381 01:50:00,650 --> 01:50:03,110 Even though it doesn't translate perfectly into a for loop 2382 01:50:03,110 --> 01:50:07,250 or a while loop in C, it kind of says intuitively what he should do-- 2383 01:50:07,250 --> 01:50:08,390 repeat until sorted. 2384 01:50:08,390 --> 01:50:11,150 Brian has already identified the fact, by nature of him 2385 01:50:11,150 --> 01:50:13,670 not having made any swaps, that this list is sorted. 2386 01:50:13,670 --> 01:50:16,400 Therefore, he can just stop, and this loop does not 2387 01:50:16,400 --> 01:50:18,020 have to continue again and again. 2388 01:50:18,020 --> 01:50:21,620 We can map this to C-like code a little more explicitly. 2389 01:50:21,620 --> 01:50:24,770 We can by default say, do the following n minus 1 times. 2390 01:50:24,770 --> 01:50:29,570 Because among n elements, you can look at n minus 1 total pairs from left 2391 01:50:29,570 --> 01:50:31,620 to right without going too far. 2392 01:50:31,620 --> 01:50:35,100 But notice, I can add an additional line of code here 2393 01:50:35,100 --> 01:50:40,070 which might say, if no swaps, quit from the algorithm altogether. 2394 01:50:40,070 --> 01:50:42,890 So, so long as Brian is keeping track of how many swaps 2395 01:50:42,890 --> 01:50:47,160 he made or didn't make through one pass, as with a variable called counter 2396 01:50:47,160 --> 01:50:50,940 or whatever, he can simply abort this algorithm early and certainly 2397 01:50:50,940 --> 01:50:52,687 then save us some time. 2398 01:50:52,687 --> 01:50:55,020 So with that said, let's consider for just a moment what 2399 01:50:55,020 --> 01:50:59,280 the running time of bubble sort might be in terms of an upper bound, 2400 01:50:59,280 --> 01:51:01,120 in the worst case, if you will. 2401 01:51:01,120 --> 01:51:05,070 Well, in the case of bubble sort, notice with the pseudocode 2402 01:51:05,070 --> 01:51:08,160 where we're doing something n minus 1 times, 2403 01:51:08,160 --> 01:51:11,820 and inside of that we're doing something n minus 1 times. 2404 01:51:11,820 --> 01:51:14,130 So again, repeat n minus 1 times literally 2405 01:51:14,130 --> 01:51:17,040 says, do the following n minus 1 times. 2406 01:51:17,040 --> 01:51:19,830 The for loop here, which is just a different way in pseudocode 2407 01:51:19,830 --> 01:51:24,540 of expressing a similar idea but giving us a variable this time, for i from 0 2408 01:51:24,540 --> 01:51:25,950 to n minus 1-- 2409 01:51:25,950 --> 01:51:32,890 n minus 2, is a total number of n minus 1 comparisons. 2410 01:51:32,890 --> 01:51:37,490 So this is an n minus 1 thing inside the repeat, 2411 01:51:37,490 --> 01:51:39,680 and an n minus 1 outside the repeat. 2412 01:51:39,680 --> 01:51:44,720 So I think what that gives me is n minus 1 things times n minus 1 times. 2413 01:51:44,720 --> 01:51:47,180 So now if I just kind of FOIL this, sort of in high school 2414 01:51:47,180 --> 01:51:51,080 or middle school math, n squared minus 1n minus 1n plus 1. 2415 01:51:51,080 --> 01:51:54,115 We can combine like terms, n squared minus 2n plus 1. 2416 01:51:54,115 --> 01:51:57,240 But per our discussion earlier, ugh, this is really getting into the weeds. 2417 01:51:57,240 --> 01:51:59,870 Who cares about the 2n or the 1? 2418 01:51:59,870 --> 01:52:04,320 The dominant factor as n gets large is definitely going to be the n squared. 2419 01:52:04,320 --> 01:52:06,440 So it would seem that bubble sort, if you actually 2420 01:52:06,440 --> 01:52:10,130 do out the math and the formulas, is going to have an upper bound of n 2421 01:52:10,130 --> 01:52:13,200 squared, or rather, on the order of n squared steps. 2422 01:52:13,200 --> 01:52:17,420 So in that sense, it is equivalent to selection sort. 2423 01:52:17,420 --> 01:52:19,490 It is no better fundamentally. 2424 01:52:19,490 --> 01:52:22,430 It's what we would say ask asymptotically equivalent. 2425 01:52:22,430 --> 01:52:25,220 That is, as n gets really large, this formula 2426 01:52:25,220 --> 01:52:28,670 is, for all intents and purposes, equivalent to the selection sort 2427 01:52:28,670 --> 01:52:30,750 formula, even though they differed slightly 2428 01:52:30,750 --> 01:52:32,930 in terms of their lower order terms. 2429 01:52:32,930 --> 01:52:37,080 For all intents and purposes, ah, they're on the order of n squared both. 2430 01:52:37,080 --> 01:52:40,460 But if we consider a lower bound, perhaps, 2431 01:52:40,460 --> 01:52:43,520 even though bubble sort has the same upper bound running time, 2432 01:52:43,520 --> 01:52:47,420 if we consider a lower bound, as with this smarter code, where Brian might 2433 01:52:47,420 --> 01:52:50,840 actually have the wherewithal to notice, wait a minute, I didn't do any swaps, 2434 01:52:50,840 --> 01:52:54,630 I'm just going to exit out of this looping pretty much early-- 2435 01:52:54,630 --> 01:52:56,540 not even prematurely but early, because it 2436 01:52:56,540 --> 01:52:59,120 would be fruitless to keep doing more and more work-- 2437 01:52:59,120 --> 01:53:01,760 we can then whittle down this running time. 2438 01:53:01,760 --> 01:53:07,610 I think-- not quite as good as omega of 1, which was constant time-- 2439 01:53:07,610 --> 01:53:12,200 like, you cannot conclude definitively that an array is sorted unless you 2440 01:53:12,200 --> 01:53:14,360 minimally look at all of the elements once. 2441 01:53:14,360 --> 01:53:17,540 So constant time is completely naive and unrealistic. 2442 01:53:17,540 --> 01:53:21,230 You can't look at one element, or two or three, and say, yes, this is sorted. 2443 01:53:21,230 --> 01:53:24,650 You've got to obviously look at all of the elements at least once. 2444 01:53:24,650 --> 01:53:28,400 So this would seem to suggest that the omega notation for it, that 2445 01:53:28,400 --> 01:53:31,520 is, the lower bound on bubble sort's running time, 2446 01:53:31,520 --> 01:53:37,850 if we're clever and don't retrace our steps unnecessarily, is in omega of n. 2447 01:53:37,850 --> 01:53:39,950 Or technically, it's n minus 1 steps, right? 2448 01:53:39,950 --> 01:53:41,950 Because if you've got n elements and you compare 2449 01:53:41,950 --> 01:53:44,030 these two, these two, these two, these two, 2450 01:53:44,030 --> 01:53:45,650 that's n minus 1 total comparisons. 2451 01:53:45,650 --> 01:53:47,450 But who cares about the minus 1? 2452 01:53:47,450 --> 01:53:53,380 It's on the order of n, or omega of n notation here. 2453 01:53:53,380 --> 01:53:57,240 So to recap, selection sort selects the next smallest element again and again 2454 01:53:57,240 --> 01:53:57,870 and again. 2455 01:53:57,870 --> 01:54:01,020 Unfortunately, based on how it's implemented in pseudocode and actual 2456 01:54:01,020 --> 01:54:03,330 code, it's in Big O of n squared. 2457 01:54:03,330 --> 01:54:05,370 But it's also an omega of n squared, which 2458 01:54:05,370 --> 01:54:10,170 means it's always going to take the same amount of time asymptotically, that is, 2459 01:54:10,170 --> 01:54:11,490 as n gets large. 2460 01:54:11,490 --> 01:54:16,083 Unfortunately, too, bubble sort is no better, it would seem, 2461 01:54:16,083 --> 01:54:17,250 in terms of the upper bound. 2462 01:54:17,250 --> 01:54:19,500 It's going to take as many as n squared steps, too. 2463 01:54:19,500 --> 01:54:23,610 But it's at least marginally better when it comes to using something 2464 01:54:23,610 --> 01:54:26,280 like an input that's already sorted. 2465 01:54:26,280 --> 01:54:30,880 It can short circuit and not waste time. 2466 01:54:30,880 --> 01:54:32,850 But honestly, n squared is bad. 2467 01:54:32,850 --> 01:54:34,950 Like, n squared is really going to add up quickly. 2468 01:54:34,950 --> 01:54:39,180 If you've got n squared and n is a million or n is a billion, I mean, 2469 01:54:39,180 --> 01:54:40,860 my God, that's a lot of 0's. 2470 01:54:40,860 --> 01:54:44,820 That's a lot of steps in the total running time of your algorithm. 2471 01:54:44,820 --> 01:54:46,180 Can we do better? 2472 01:54:46,180 --> 01:54:47,500 Can we do better? 2473 01:54:47,500 --> 01:54:48,720 And it turns out we can. 2474 01:54:48,720 --> 01:54:52,560 And we'll consider one final algorithm today that does fundamentally better. 2475 01:54:52,560 --> 01:54:57,000 Just like in week 0, we sort of latched onto binary search and again today-- 2476 01:54:57,000 --> 01:55:01,390 it's just fundamentally better than linear search by an order of magnitude, 2477 01:55:01,390 --> 01:55:01,890 so to speak. 2478 01:55:01,890 --> 01:55:05,250 Its picture representation was fundamentally different. 2479 01:55:05,250 --> 01:55:09,048 I think we can do fundamentally better than bubble sort and selection sort. 2480 01:55:09,048 --> 01:55:10,840 And so while both bubble sort and selection 2481 01:55:10,840 --> 01:55:13,465 sort might be the sort of thing that I was using in grad school 2482 01:55:13,465 --> 01:55:15,840 just to rip up the code quickly and then go to sleep, 2483 01:55:15,840 --> 01:55:18,272 it's not going to work well for very large data sets. 2484 01:55:18,272 --> 01:55:19,980 And frankly, it wouldn't have worked well 2485 01:55:19,980 --> 01:55:22,140 if I didn't want to just sleep through the problem. 2486 01:55:22,140 --> 01:55:26,190 Rather, we want to do things as efficiently as we can from the get go. 2487 01:55:26,190 --> 01:55:29,560 And let me propose that we leverage a technique-- 2488 01:55:29,560 --> 01:55:32,310 and this is a technique that you can use in almost any programming 2489 01:55:32,310 --> 01:55:33,960 language, C among them-- 2490 01:55:33,960 --> 01:55:35,430 known as recursion. 2491 01:55:35,430 --> 01:55:41,940 And recursion, quite simply, is the ability for a function to call itself. 2492 01:55:41,940 --> 01:55:44,760 Up until now, we have not seen any examples of this. 2493 01:55:44,760 --> 01:55:47,070 We've seen functions calling other functions. 2494 01:55:47,070 --> 01:55:49,170 Main keeps calling printf. 2495 01:55:49,170 --> 01:55:51,000 Main has started to call strlen. 2496 01:55:51,000 --> 01:55:54,930 Main called strcmp, compare, earlier today. 2497 01:55:54,930 --> 01:55:56,700 But we've never seen main call main. 2498 01:55:56,700 --> 01:55:59,770 And people don't do that, so that's not going to solve the problem. 2499 01:55:59,770 --> 01:56:02,640 But we can implement our own functions and have 2500 01:56:02,640 --> 01:56:05,100 our own functions call themselves. 2501 01:56:05,100 --> 01:56:07,290 Now, this would seem to be a bad idea in principle. 2502 01:56:07,290 --> 01:56:10,020 If a function calls itself, my God, where does it end? 2503 01:56:10,020 --> 01:56:11,820 It would seem to just do something forever, 2504 01:56:11,820 --> 01:56:13,487 and then something bad probably happens. 2505 01:56:13,487 --> 01:56:14,132 And it could. 2506 01:56:14,132 --> 01:56:15,840 And that's the danger of using recursion. 2507 01:56:15,840 --> 01:56:17,495 You can screw it up easily. 2508 01:56:17,495 --> 01:56:19,620 But it's also a very powerful technique, because it 2509 01:56:19,620 --> 01:56:21,870 allows us to think about potential solutions 2510 01:56:21,870 --> 01:56:25,770 to problems in a very interesting, and daresay elegant, way. 2511 01:56:25,770 --> 01:56:29,160 So we're not only going to be able to achieve correctness but also better 2512 01:56:29,160 --> 01:56:32,610 design, because of better efficiency, it would seem, here. 2513 01:56:32,610 --> 01:56:33,960 So let me propose this. 2514 01:56:33,960 --> 01:56:37,830 Recall this code from week 0, which was the pseudocode for finding someone 2515 01:56:37,830 --> 01:56:38,790 in a phone book. 2516 01:56:38,790 --> 01:56:42,180 And recall that, among the features of this pseudocode, 2517 01:56:42,180 --> 01:56:44,520 were these lines here, "Go back to line 3." 2518 01:56:44,520 --> 01:56:48,950 And we describe those in week 0 as being representative of loops, 2519 01:56:48,950 --> 01:56:52,950 a programming construct that has something happen again and again. 2520 01:56:52,950 --> 01:56:56,640 But you know what, there's a missed opportunity here in this pseudocode 2521 01:56:56,640 --> 01:56:59,590 to use a technique known as recursion. 2522 01:56:59,590 --> 01:57:02,280 This implementation is what we would call iterative. 2523 01:57:02,280 --> 01:57:04,290 It is purely loop based. 2524 01:57:04,290 --> 01:57:07,230 It tells me literally, go back to this line, go back to this line, 2525 01:57:07,230 --> 01:57:08,400 go back to this line. 2526 01:57:08,400 --> 01:57:10,230 There's no calling yourself. 2527 01:57:10,230 --> 01:57:13,620 But what if I changed week 0's pseudocode to be a little more 2528 01:57:13,620 --> 01:57:14,550 like this? 2529 01:57:14,550 --> 01:57:19,110 Let me go ahead and get rid of, not just that one line but two lines 2530 01:57:19,110 --> 01:57:21,000 in both of those conditions. 2531 01:57:21,000 --> 01:57:23,310 And let me quite simply say, instead of open 2532 01:57:23,310 --> 01:57:26,280 to the middle of the left half of the book and then go back to line 3, 2533 01:57:26,280 --> 01:57:29,910 or open to the middle of the right half of the book and then go back to line 3, 2534 01:57:29,910 --> 01:57:34,950 why don't I just more elegantly say, search left half of book, 2535 01:57:34,950 --> 01:57:36,660 search right half of book? 2536 01:57:36,660 --> 01:57:39,780 Now, immediately I can shorten the code a little bit. 2537 01:57:39,780 --> 01:57:44,880 But I claim that by just saying search left half of book and search right 2538 01:57:44,880 --> 01:57:49,140 half of book, I claim that this is enough information 2539 01:57:49,140 --> 01:57:50,760 to implement the very same algorithm. 2540 01:57:50,760 --> 01:57:53,100 But it's not using a loop per se. 2541 01:57:53,100 --> 01:57:56,010 It's going to induce me the human or me the computer 2542 01:57:56,010 --> 01:57:57,840 to do something again and again. 2543 01:57:57,840 --> 01:58:00,270 But there's other ways to do things again and again-- 2544 01:58:00,270 --> 01:58:03,690 not by way of a for loop, or a while loop, or a do while loop, 2545 01:58:03,690 --> 01:58:06,360 or a repeat block, or a forever block-- 2546 01:58:06,360 --> 01:58:08,610 you can actually use recursion. 2547 01:58:08,610 --> 01:58:12,690 And recursion, again, is this technique where a function can call itself. 2548 01:58:12,690 --> 01:58:15,660 And if we consider, after all, the pseudocode we are looking at 2549 01:58:15,660 --> 01:58:18,210 is the pseudocode for searching. 2550 01:58:18,210 --> 01:58:24,210 And on line 7 and 9 now, I am literally saying, "Search left half of book," 2551 01:58:24,210 --> 01:58:28,800 and "Search right half of book," this is already, even in pseudocode form, 2552 01:58:28,800 --> 01:58:30,270 an example of recursion. 2553 01:58:30,270 --> 01:58:34,530 Here I have in 11 lines of code an algorithm or a function 2554 01:58:34,530 --> 01:58:36,440 that searches a phone book. 2555 01:58:36,440 --> 01:58:40,830 In lines 7 and 9, I have lines of code that literally say, search 2556 01:58:40,830 --> 01:58:44,460 a phone book, but more specifically, search half of the phone book. 2557 01:58:44,460 --> 01:58:47,520 And that's where recursion really works its magic. 2558 01:58:47,520 --> 01:58:50,760 It would be foolish and incorrect and completely counterproductive 2559 01:58:50,760 --> 01:58:53,148 to just have a function call itself with the same input, 2560 01:58:53,148 --> 01:58:55,440 with the same input, with the same input, because you'd 2561 01:58:55,440 --> 01:58:57,960 have to be kind of crazy to expect different output 2562 01:58:57,960 --> 01:59:00,120 if the input is constantly the same. 2563 01:59:00,120 --> 01:59:03,540 But that's not what we did in week 0, and that's not what we're doing now. 2564 01:59:03,540 --> 01:59:07,110 If you use the same function, or equivalently algorithm, 2565 01:59:07,110 --> 01:59:11,580 but change the input to be smaller and smaller and smaller, 2566 01:59:11,580 --> 01:59:14,770 it's probably OK that a function is calling itself, 2567 01:59:14,770 --> 01:59:18,120 so long as you have at least one line of code in there 2568 01:59:18,120 --> 01:59:20,820 that very intelligently says, if you're out of doors, 2569 01:59:20,820 --> 01:59:23,310 if you're out of phone book pages, quit. 2570 01:59:23,310 --> 01:59:25,500 You need to have a so-called base case. 2571 01:59:25,500 --> 01:59:28,650 You need some line of code that's going to notice, wait a minute, there's 2572 01:59:28,650 --> 01:59:31,950 no more problem to be solved, quit now. 2573 01:59:31,950 --> 01:59:35,190 And so how can we map this to actual code? 2574 01:59:35,190 --> 01:59:38,130 Well, let's consider something very familiar from week 1. 2575 01:59:38,130 --> 01:59:40,440 Recall when you reconstructed one of Mario's pyramids. 2576 01:59:40,440 --> 01:59:43,330 It looked a little something like this. 2577 01:59:43,330 --> 01:59:46,050 And let's consider that this is a pyramid of blocks, 2578 01:59:46,050 --> 01:59:47,880 of bricks, that's of height 4. 2579 01:59:47,880 --> 01:59:48,420 Why 4? 2580 01:59:48,420 --> 01:59:52,450 Well, there's 1, then 2, then 3, then 4 bricks from top to bottom. 2581 01:59:52,450 --> 01:59:54,210 So the total height here is 4. 2582 01:59:54,210 --> 01:59:58,740 But let me ask the question, a little naively, how do you go about creating, 2583 01:59:58,740 --> 02:00:02,730 or how do you go about printing a pyramid of height 4? 2584 02:00:02,730 --> 02:00:06,000 Well, it turns out that this simple Mario pyramid, that's 2585 02:00:06,000 --> 02:00:09,210 ever more clear if we get rid of the unnecessary background, 2586 02:00:09,210 --> 02:00:12,700 is a recursive structure of some sort. 2587 02:00:12,700 --> 02:00:14,250 It's a recursive physical structure. 2588 02:00:14,250 --> 02:00:15,020 Why? 2589 02:00:15,020 --> 02:00:18,780 Well, notice that this structure, this brick, this pyramid, 2590 02:00:18,780 --> 02:00:20,970 is kind of defined in terms of itself. 2591 02:00:20,970 --> 02:00:21,690 Why? 2592 02:00:21,690 --> 02:00:24,660 Well, how do you make a pyramid of height 4? 2593 02:00:24,660 --> 02:00:27,870 I would argue, a little obnoxiously, a little circularly, well, 2594 02:00:27,870 --> 02:00:30,240 you create a pyramid of height 3, and then 2595 02:00:30,240 --> 02:00:32,670 you add an additional row of bricks. 2596 02:00:32,670 --> 02:00:33,400 All right. 2597 02:00:33,400 --> 02:00:34,440 Well, let's continue that logic. 2598 02:00:34,440 --> 02:00:35,107 All right, fine. 2599 02:00:35,107 --> 02:00:38,160 How do you build a pyramid of height 3? 2600 02:00:38,160 --> 02:00:41,670 Well, you sort of smile and say, well, you build a pyramid of height 2, 2601 02:00:41,670 --> 02:00:43,110 and then you add one more layer. 2602 02:00:43,110 --> 02:00:43,590 All right, fine. 2603 02:00:43,590 --> 02:00:45,215 How do you build a pyramid of height 2? 2604 02:00:45,215 --> 02:00:49,165 Well, you build a pyramid of height 1, and then you add one more layer. 2605 02:00:49,165 --> 02:00:51,040 Well, how do you build a pyramid of height 1? 2606 02:00:51,040 --> 02:00:53,610 Well, you just put the stupid brick down. 2607 02:00:53,610 --> 02:00:56,280 You have a base case, where you sort of state the obvious 2608 02:00:56,280 --> 02:00:57,990 and just do something once. 2609 02:00:57,990 --> 02:00:59,550 You hardcode the logic. 2610 02:00:59,550 --> 02:01:02,610 But notice what's kind of mind bending, or kind 2611 02:01:02,610 --> 02:01:05,550 of obnoxious in a human interaction, like, 2612 02:01:05,550 --> 02:01:08,640 you're just defining the answer in terms of itself. 2613 02:01:08,640 --> 02:01:10,470 I keep saying the same thing. 2614 02:01:10,470 --> 02:01:14,790 But that's OK, because the pyramid keeps getting smaller and smaller and smaller 2615 02:01:14,790 --> 02:01:16,895 until I can handle that one special case. 2616 02:01:16,895 --> 02:01:19,770 And so we can do this just for fun with these little cardboard bricks 2617 02:01:19,770 --> 02:01:20,920 here, for instance. 2618 02:01:20,920 --> 02:01:24,010 If I want to build a pyramid of height 4, how do I do it? 2619 02:01:24,010 --> 02:01:26,220 Well, I can build a pyramid of height 3. 2620 02:01:26,220 --> 02:01:29,280 All right, let me go ahead and build a pyramid of height 3. 2621 02:01:29,280 --> 02:01:31,095 How do I build a pyramid of height 3? 2622 02:01:31,095 --> 02:01:33,970 All right, well, I build a pyramid of height 2, and then I add to it. 2623 02:01:33,970 --> 02:01:37,090 OK, how do I build a pyramid of height 2? 2624 02:01:37,090 --> 02:01:38,790 Well, you build a pyramid of height 1. 2625 02:01:38,790 --> 02:01:39,773 How do I do that? 2626 02:01:39,773 --> 02:01:41,190 Well, you just put the brick down. 2627 02:01:41,190 --> 02:01:43,140 And so here's where things kind of bottom out, 2628 02:01:43,140 --> 02:01:45,390 and it's no longer a cyclical argument. 2629 02:01:45,390 --> 02:01:47,640 You eventually just do some actual work. 2630 02:01:47,640 --> 02:01:51,780 But in my mind, I have to remember all of the instructions you just gave me, 2631 02:01:51,780 --> 02:01:53,040 or I gave myself. 2632 02:01:53,040 --> 02:01:56,850 I had to build a pyramid of height 4; nope, 3; nope, 2; nope, 1. 2633 02:01:56,850 --> 02:01:58,270 Now I'm actually doing that. 2634 02:01:58,270 --> 02:02:00,090 So here's a pyramid of height 1. 2635 02:02:00,090 --> 02:02:02,610 How do I now build a pyramid of height 2? 2636 02:02:02,610 --> 02:02:04,350 Well, rewind in the story. 2637 02:02:04,350 --> 02:02:07,950 To build a pyramid of height 2, you build a pyramid of height 1, 2638 02:02:07,950 --> 02:02:09,940 and then you add one more layer. 2639 02:02:09,940 --> 02:02:14,440 So I think to add one more layer, I essentially need to do this. 2640 02:02:14,440 --> 02:02:14,940 All right. 2641 02:02:14,940 --> 02:02:16,740 Now I have a pyramid of height 2. 2642 02:02:16,740 --> 02:02:17,490 But wait a minute. 2643 02:02:17,490 --> 02:02:20,010 The story began with, how do I build a pyramid of height 3? 2644 02:02:20,010 --> 02:02:22,740 Well, you take a pyramid of height 2, which I have here, 2645 02:02:22,740 --> 02:02:24,330 and you add an additional layer. 2646 02:02:24,330 --> 02:02:26,370 So I've got to build this additional layer. 2647 02:02:26,370 --> 02:02:30,900 I'm going to go ahead and give myself the layer, the layer, the layer. 2648 02:02:30,900 --> 02:02:34,950 And then I'm going to put the original pyramid of height to on top of it. 2649 02:02:34,950 --> 02:02:37,598 And voila, it's a pyramid of height 3 now. 2650 02:02:37,598 --> 02:02:38,640 Well, how did I get here? 2651 02:02:38,640 --> 02:02:40,230 Well, let me keep rewinding in the story. 2652 02:02:40,230 --> 02:02:42,120 The very first question I asked myself was, 2653 02:02:42,120 --> 02:02:43,828 how do you build a pyramid of height 4? 2654 02:02:43,828 --> 02:02:45,870 Well, the answer was build a pyramid of height 3. 2655 02:02:45,870 --> 02:02:47,110 Great, that's done. 2656 02:02:47,110 --> 02:02:49,222 Then add one additional layer. 2657 02:02:49,222 --> 02:02:51,930 And if I had more hands, I could do this a little more elegantly, 2658 02:02:51,930 --> 02:02:54,190 but let me go ahead and just lay this out. 2659 02:02:54,190 --> 02:02:57,280 Here's the new level of height 3. 2660 02:02:57,280 --> 02:02:58,920 And now I'm going to go-- 2661 02:02:58,920 --> 02:03:00,840 of width 4. 2662 02:03:00,840 --> 02:03:05,940 Now I'm going to go and put the pyramid of height 3 on top of it, until voila, 2663 02:03:05,940 --> 02:03:09,840 I have this form here of Mario's pyramid. 2664 02:03:09,840 --> 02:03:13,020 So it's a bit cyclical in that, every time I 2665 02:03:13,020 --> 02:03:15,627 asked myself to build a pyramid of a certain height, 2666 02:03:15,627 --> 02:03:18,210 I kind of punted and said, no, build a pyramid of this height. 2667 02:03:18,210 --> 02:03:19,290 No, build a pyramid of this height. 2668 02:03:19,290 --> 02:03:20,940 No, build a pyramid of this height. 2669 02:03:20,940 --> 02:03:25,890 But the magic of that algorithm was that there was constantly 2670 02:03:25,890 --> 02:03:29,520 this, do a little more work, build a layer, do a little more work, 2671 02:03:29,520 --> 02:03:30,690 build a layer. 2672 02:03:30,690 --> 02:03:35,160 And it's in that implicit building of layer after layer after layer 2673 02:03:35,160 --> 02:03:38,483 that the pyramid itself, the end goal, actually emerges. 2674 02:03:38,483 --> 02:03:41,400 So you could implement the same thing with a for loop or a while loop. 2675 02:03:41,400 --> 02:03:42,330 And frankly, you did. 2676 02:03:42,330 --> 02:03:45,210 It was a slightly different shape for problem set 1, 2677 02:03:45,210 --> 02:03:47,340 but you did the same thing using a loop. 2678 02:03:47,340 --> 02:03:51,060 And you kind of had to do it that way, at least as we prescribed it. 2679 02:03:51,060 --> 02:03:54,327 Because with printf, you have to print from the top of the screen 2680 02:03:54,327 --> 02:03:54,910 to the bottom. 2681 02:03:54,910 --> 02:03:57,910 Like, we haven't shown you a technique yet to print a layer 2682 02:03:57,910 --> 02:03:59,253 and then go back on top. 2683 02:03:59,253 --> 02:04:01,420 So I'm kind of taking some real-world liberties here 2684 02:04:01,420 --> 02:04:03,503 by lifting these things up and moving them around. 2685 02:04:03,503 --> 02:04:06,010 You'd have to be a little more clever in code. 2686 02:04:06,010 --> 02:04:07,300 But the idea is the same. 2687 02:04:07,300 --> 02:04:09,280 And so even physical objects like this can 2688 02:04:09,280 --> 02:04:12,770 have some recursive definition to them. 2689 02:04:12,770 --> 02:04:14,800 And so we present this sort of goofy example, 2690 02:04:14,800 --> 02:04:18,708 because this notion of recursion is a fundamental programming technique 2691 02:04:18,708 --> 02:04:20,500 that you can leverage now to solve problems 2692 02:04:20,500 --> 02:04:22,480 in a fundamentally different way. 2693 02:04:22,480 --> 02:04:26,500 And I think for this, we need one final visualization of merge sort, 2694 02:04:26,500 --> 02:04:28,450 with both Brian's help and the computer's. 2695 02:04:28,450 --> 02:04:32,170 And merge sort is going to be an algorithm whose pseudocode is, daresay, 2696 02:04:32,170 --> 02:04:35,440 the simplest we've seen thus far, but deceptively simple. 2697 02:04:35,440 --> 02:04:38,740 The pseudocode for merge sort, quite simply, is this-- 2698 02:04:38,740 --> 02:04:42,690 sort the left half of numbers, sort the right half of numbers, 2699 02:04:42,690 --> 02:04:44,950 merge the sorted halves. 2700 02:04:44,950 --> 02:04:48,640 And notice, even at first glance this feels kind of unfair. 2701 02:04:48,640 --> 02:04:51,310 Like, here's an algorithm for sorting, and yet I'm 2702 02:04:51,310 --> 02:04:54,820 literally using the word "sort" in my algorithm for sorting. 2703 02:04:54,820 --> 02:04:57,070 It's like in English if you're asked to define a word, 2704 02:04:57,070 --> 02:04:59,290 and you literally use the word in the definition. 2705 02:04:59,290 --> 02:05:03,950 Like, that rarely flies, because you're just making a circular argument. 2706 02:05:03,950 --> 02:05:08,740 But in code, it's OK, so long as there's one special step that's doing something 2707 02:05:08,740 --> 02:05:10,990 a little differently, and so long as the problem keeps 2708 02:05:10,990 --> 02:05:12,157 getting smaller and smaller. 2709 02:05:12,157 --> 02:05:13,180 And indeed it is. 2710 02:05:13,180 --> 02:05:16,593 This pseudocode is not saying, sort the numbers, sort the numbers, 2711 02:05:16,593 --> 02:05:17,260 sort of numbers. 2712 02:05:17,260 --> 02:05:22,060 No, it's dividing the problem in half and then solving the other half 2713 02:05:22,060 --> 02:05:22,700 as well. 2714 02:05:22,700 --> 02:05:25,090 So it's shrinking the problem on each iteration. 2715 02:05:25,090 --> 02:05:28,667 Now, I will disclaim we're going to need that so-called base case again. 2716 02:05:28,667 --> 02:05:31,000 I'm going to have to do something stupid, but necessary, 2717 02:05:31,000 --> 02:05:33,670 and say, if there's only one number, quit. 2718 02:05:33,670 --> 02:05:34,540 It's sorted. 2719 02:05:34,540 --> 02:05:36,370 That's the so-called base case. 2720 02:05:36,370 --> 02:05:39,970 The recursive case is where the function calls itself. 2721 02:05:39,970 --> 02:05:44,920 But this is, indeed, our third and final sorting algorithm called merge sort. 2722 02:05:44,920 --> 02:05:48,290 And we'll focus here really on the juiciest pieces, 2723 02:05:48,290 --> 02:05:49,870 one, this notion of merging. 2724 02:05:49,870 --> 02:05:52,270 So in fact, Brian, can we come over to you 2725 02:05:52,270 --> 02:05:56,300 just so we can define, before we look at the merge sort algorithm itself, 2726 02:05:56,300 --> 02:06:00,010 what do we even mean when we say merge sorted halves? 2727 02:06:00,010 --> 02:06:04,510 So for instance, Brian has on his shelf here two arrays of size 4. 2728 02:06:04,510 --> 02:06:08,680 In the first array on the left are four integers, 3, 5, 6, 8. 2729 02:06:08,680 --> 02:06:12,460 And in the right side, in another array of size 4, 2730 02:06:12,460 --> 02:06:15,160 are four numbers, too, 1, 2, 4, 7. 2731 02:06:15,160 --> 02:06:18,370 Both the left is sorted and the right is sorted. 2732 02:06:18,370 --> 02:06:21,880 But now, Brian, I would like you to merge these sorted halves. 2733 02:06:21,880 --> 02:06:23,480 Tell us what that means. 2734 02:06:23,480 --> 02:06:23,980 BRIAN: Sure. 2735 02:06:23,980 --> 02:06:26,680 So if I have a left half that sorted from smallest 2736 02:06:26,680 --> 02:06:30,490 to largest and a right half that's also sorted from smallest to largest, 2737 02:06:30,490 --> 02:06:34,120 I want to merge them into a new list that has all of the same numbers 2738 02:06:34,120 --> 02:06:36,090 also from smallest to largest. 2739 02:06:36,090 --> 02:06:38,890 And I guess where I could start here is that the smallest 2740 02:06:38,890 --> 02:06:43,570 number of the combined array needs to begin with either the smallest 2741 02:06:43,570 --> 02:06:46,540 number of the left half or the smallest number of the right half. 2742 02:06:46,540 --> 02:06:49,690 So on the left the smallest number is the 3, and on the right 2743 02:06:49,690 --> 02:06:51,580 the smallest number is the 1. 2744 02:06:51,580 --> 02:06:55,030 Of those two has got to be the smallest number for the entire array. 2745 02:06:55,030 --> 02:06:58,030 And between the 3 and the 1, the 1 is smaller. 2746 02:06:58,030 --> 02:07:03,100 So I would take that 1, and that's going to be the first number, the smallest 2747 02:07:03,100 --> 02:07:06,297 number, of the merged two halves. 2748 02:07:06,297 --> 02:07:08,380 And then I guess I would repeat the process again. 2749 02:07:08,380 --> 02:07:10,960 On the left side the smallest number is the 3. 2750 02:07:10,960 --> 02:07:13,060 On the right side the smallest number is the 2. 2751 02:07:13,060 --> 02:07:16,120 And between the 3 and the 2, 2 is smaller. 2752 02:07:16,120 --> 02:07:19,270 So I would take the 2 [INAUDIBLE] and that's going to be the next number. 2753 02:07:19,270 --> 02:07:22,090 So I'm slowly building up this sorted array that 2754 02:07:22,090 --> 02:07:23,770 is the result of combining these two. 2755 02:07:23,770 --> 02:07:27,040 Now I'm comparing the 3 on the left to the 4 on the right. 2756 02:07:27,040 --> 02:07:29,170 Between the 3 and the 4, the 3 is smaller. 2757 02:07:29,170 --> 02:07:33,020 So I'll take the 3, and we'll put that one into position. 2758 02:07:33,020 --> 02:07:36,010 Now I'm comparing the 5 on the left with the 4 on the right. 2759 02:07:36,010 --> 02:07:38,380 Between the 5 and the 4, the 4 is smaller. 2760 02:07:38,380 --> 02:07:41,130 So that one goes into position. 2761 02:07:41,130 --> 02:07:44,970 And then now I'm comparing the 5 on the left with the 7 on the right. 2762 02:07:44,970 --> 02:07:47,811 5 is smaller, so the 5 goes next. 2763 02:07:47,811 --> 02:07:50,880 Next I'm comparing the 6 on the left with the 7 on the right. 2764 02:07:50,880 --> 02:07:55,200 The 6 is still smaller, so that one is going to go next. 2765 02:07:55,200 --> 02:07:58,290 Now I'm comparing the 8 and the 7, the only two numbers left. 2766 02:07:58,290 --> 02:08:00,650 The 7 is the smaller between the two. 2767 02:08:00,650 --> 02:08:03,530 So I'll take the 7 and put that into place. 2768 02:08:03,530 --> 02:08:05,280 And now I'm only left with one number that 2769 02:08:05,280 --> 02:08:09,900 hasn't been put into the merging of the two halves, and that's the number 8. 2770 02:08:09,900 --> 02:08:12,510 So that number is going to take up the final position. 2771 02:08:12,510 --> 02:08:16,440 And now I've taken these to halves, each of which was originally sorted, 2772 02:08:16,440 --> 02:08:19,887 and made one complete array that has all of those numbers in sorted order. 2773 02:08:19,887 --> 02:08:20,720 DAVID MALAN: Indeed. 2774 02:08:20,720 --> 02:08:21,930 And consider what we've done. 2775 02:08:21,930 --> 02:08:24,120 We've essentially verbally and physically kind of 2776 02:08:24,120 --> 02:08:27,360 defined a helper function, our own custom function if you will, 2777 02:08:27,360 --> 02:08:32,430 whereby Brian has defined what does it mean to merge two arrays-- 2778 02:08:32,430 --> 02:08:35,080 specifically merge two sorted arrays. 2779 02:08:35,080 --> 02:08:35,640 Because why? 2780 02:08:35,640 --> 02:08:37,050 Well, that's a building block that I think 2781 02:08:37,050 --> 02:08:38,920 we're going to want in this merge sort algorithm. 2782 02:08:38,920 --> 02:08:40,795 So just like in actual C code, you might have 2783 02:08:40,795 --> 02:08:43,230 defined a function that does some small task, 2784 02:08:43,230 --> 02:08:46,440 so have we now verbally and physically defined the notion of merging. 2785 02:08:46,440 --> 02:08:49,620 The mind bending part here is that "Sort left 2786 02:08:49,620 --> 02:08:52,770 half of numbers" and "Sort right half of numbers" 2787 02:08:52,770 --> 02:08:54,690 is kind of already implemented. 2788 02:08:54,690 --> 02:08:58,170 There's nothing more for Brian or me to define. 2789 02:08:58,170 --> 02:09:01,890 All that remains is for us to execute this algorithm, focusing especially 2790 02:09:01,890 --> 02:09:04,300 on these three highlighted lines of code. 2791 02:09:04,300 --> 02:09:08,220 And let me disclaim that of the algorithms we've looked at thus far, 2792 02:09:08,220 --> 02:09:10,290 odds are this will be the one that doesn't really 2793 02:09:10,290 --> 02:09:11,887 sink in as quickly as the others. 2794 02:09:11,887 --> 02:09:14,220 Even if the others might have taken you a moment, a day, 2795 02:09:14,220 --> 02:09:16,887 a week to settle in-- or maybe you're still not quite there yet, 2796 02:09:16,887 --> 02:09:17,520 that's fine-- 2797 02:09:17,520 --> 02:09:20,610 merge sort is a bit of a mind bending one, 2798 02:09:20,610 --> 02:09:23,130 because it seems to work magically. 2799 02:09:23,130 --> 02:09:25,387 But it really just works more intelligently. 2800 02:09:25,387 --> 02:09:27,720 And you'll begin to get more comfortable with harnessing 2801 02:09:27,720 --> 02:09:31,190 these kinds of primitives so that we can ultimately, indeed, solve problems 2802 02:09:31,190 --> 02:09:31,960 more efficiently. 2803 02:09:31,960 --> 02:09:35,130 So Brian has kindly put the numbers again on the top shelf. 2804 02:09:35,130 --> 02:09:37,740 And he has put them into their original, unsorted order, 2805 02:09:37,740 --> 02:09:40,080 just like for selection sort and bubble sort. 2806 02:09:40,080 --> 02:09:44,050 And Brian, I'd like to propose now that we execute this merge sort algorithm. 2807 02:09:44,050 --> 02:09:47,500 And if you don't mind, I'll recite aloud first the few steps. 2808 02:09:47,500 --> 02:09:51,690 So here is one array of size 8 with unsorted numbers. 2809 02:09:51,690 --> 02:09:54,150 The goal is to these numbers using merge sort. 2810 02:09:54,150 --> 02:09:57,520 And recall that merge sort essentially is just three steps-- 2811 02:09:57,520 --> 02:10:00,330 sort left half, sort right half, merge sorted halves. 2812 02:10:00,330 --> 02:10:02,280 So Brian, looking at those numbers there, 2813 02:10:02,280 --> 02:10:04,860 could you go ahead and sort the left half of numbers? 2814 02:10:04,860 --> 02:10:05,100 BRIAN: All right. 2815 02:10:05,100 --> 02:10:06,225 So there are eight numbers. 2816 02:10:06,225 --> 02:10:09,810 The left half would be these four numbers, so I will sort those. 2817 02:10:09,810 --> 02:10:13,020 Except I'm not really sure how do I now sort these four numbers. 2818 02:10:13,020 --> 02:10:13,770 DAVID MALAN: Yeah. 2819 02:10:13,770 --> 02:10:16,353 So granted, we've seen selection sort, we've seen bubble sort. 2820 02:10:16,353 --> 02:10:19,530 But we don't want to regress to those older, slower algorithms. 2821 02:10:19,530 --> 02:10:22,110 Brian, I can kind of be a little clever here. 2822 02:10:22,110 --> 02:10:24,220 Well, I'm giving you a sorting algorithm. 2823 02:10:24,220 --> 02:10:28,140 So now you effectively have a smaller problem, an array of size 4, 2824 02:10:28,140 --> 02:10:30,870 and I'm pretty sure we can use the same algorithm, merge sort, 2825 02:10:30,870 --> 02:10:32,670 by sorting left half, sorting right half, 2826 02:10:32,670 --> 02:10:34,520 and then merging the sorted halves. 2827 02:10:34,520 --> 02:10:38,232 So could you go ahead and sort the left half of these four numbers? 2828 02:10:38,232 --> 02:10:38,940 BRIAN: All right. 2829 02:10:38,940 --> 02:10:40,020 So I have these four numbers. 2830 02:10:40,020 --> 02:10:41,228 I want to sort the left half. 2831 02:10:41,228 --> 02:10:42,893 That's these two numbers. 2832 02:10:42,893 --> 02:10:45,060 So now I need to figure out how to sort two numbers. 2833 02:10:45,060 --> 02:10:45,390 DAVID MALAN: All right. 2834 02:10:45,390 --> 02:10:48,490 Now, us with human intuition might obviously know what we have to do here. 2835 02:10:48,490 --> 02:10:51,360 But again, let's apply the algorithm-- sort left half, sort right half, 2836 02:10:51,360 --> 02:10:52,170 merge sorted half. 2837 02:10:52,170 --> 02:10:55,257 Brian, could you sort the right half of this array of size 2? 2838 02:10:55,257 --> 02:10:57,340 BRIAN: So I've got the array of two, so I'll first 2839 02:10:57,340 --> 02:11:00,150 sort the left half of the array of two, which is the 6. 2840 02:11:00,150 --> 02:11:03,120 DAVID MALAN: And this is where the base case in white on the slide 2841 02:11:03,120 --> 02:11:04,260 comes into play-- 2842 02:11:04,260 --> 02:11:06,180 if only one number, quit. 2843 02:11:06,180 --> 02:11:07,830 So Brian, I can let you off the hook. 2844 02:11:07,830 --> 02:11:12,220 That list of size one with the number 6 is sorted. 2845 02:11:12,220 --> 02:11:13,650 So that's step one of three done. 2846 02:11:13,650 --> 02:11:16,950 Brian, could you sort the right half of that array of size two? 2847 02:11:16,950 --> 02:11:18,620 BRIAN: The right half is the number 3. 2848 02:11:18,620 --> 02:11:20,770 It's also just one number, so that one is done. 2849 02:11:20,770 --> 02:11:21,520 DAVID MALAN: Good. 2850 02:11:21,520 --> 02:11:22,890 So think about where we are on the story. 2851 02:11:22,890 --> 02:11:25,710 We've sorted the left half, and we've started the right half, 2852 02:11:25,710 --> 02:11:29,030 even though it looks like neither Brian nor I have done any useful work yet. 2853 02:11:29,030 --> 02:11:30,270 But now the magic happens. 2854 02:11:30,270 --> 02:11:34,290 Brian, you now have two arrays of size 1. 2855 02:11:34,290 --> 02:11:36,222 Could you merge them together? 2856 02:11:36,222 --> 02:11:36,930 BRIAN: All right. 2857 02:11:36,930 --> 02:11:38,670 So I'm going to merge these two together. 2858 02:11:38,670 --> 02:11:40,530 Between the 6 and the 3, the 3 is smaller. 2859 02:11:40,530 --> 02:11:42,300 So that one I'll put there first. 2860 02:11:42,300 --> 02:11:44,700 And then I'll take the 6, and that one goes next. 2861 02:11:44,700 --> 02:11:47,892 And now I have a sorted array of size 2 that is now done. 2862 02:11:47,892 --> 02:11:48,850 DAVID MALAN: All right. 2863 02:11:48,850 --> 02:11:51,725 And this is where you now need to start remembering step by step sort 2864 02:11:51,725 --> 02:11:53,850 of in your brain as the things pile up. 2865 02:11:53,850 --> 02:11:55,180 How did we get to this point? 2866 02:11:55,180 --> 02:11:57,120 We started with a list of size 8. 2867 02:11:57,120 --> 02:12:00,360 We then looked at the left half, which was an array of size 4. 2868 02:12:00,360 --> 02:12:03,630 We then looked at the left half of that, which was an array of size 2, 2869 02:12:03,630 --> 02:12:06,990 then two arrays of size 1, then we merged those two sorted halves. 2870 02:12:06,990 --> 02:12:09,600 So I think now if I rewind in that story, Brian, 2871 02:12:09,600 --> 02:12:14,262 you need to sort the right half of the left half of the original numbers. 2872 02:12:14,262 --> 02:12:14,970 BRIAN: All right. 2873 02:12:14,970 --> 02:12:16,770 So the left half is these four. 2874 02:12:16,770 --> 02:12:20,430 The right half of the left half is going to be these two numbers. 2875 02:12:20,430 --> 02:12:23,790 And so now to those two, I guess I would repeat the process again-- 2876 02:12:23,790 --> 02:12:25,240 look at the numbers individually. 2877 02:12:25,240 --> 02:12:27,825 I would look at the left half of these two, which is the 8. 2878 02:12:27,825 --> 02:12:29,250 That one is done. 2879 02:12:29,250 --> 02:12:31,180 And the 5, that one is done as well. 2880 02:12:31,180 --> 02:12:32,138 DAVID MALAN: All right. 2881 02:12:32,138 --> 02:12:35,262 So step three of three, then, is merge those two sorted halves. 2882 02:12:35,262 --> 02:12:35,970 BRIAN: All right. 2883 02:12:35,970 --> 02:12:39,830 So between the 8 and the 5, the 5 is smaller, so that one will go in first. 2884 02:12:39,830 --> 02:12:41,460 And the 8 will go after that. 2885 02:12:41,460 --> 02:12:45,157 And now I have a second array of size 2 that is also now sorted. 2886 02:12:45,157 --> 02:12:45,990 DAVID MALAN: Indeed. 2887 02:12:45,990 --> 02:12:49,750 So here's where, again, you have to rewind in your mind's eye. 2888 02:12:49,750 --> 02:12:53,940 We've just now sorted the left half, and we've 2889 02:12:53,940 --> 02:12:58,240 sorted the left half and the right half of the left half. 2890 02:12:58,240 --> 02:13:02,700 So I think the third and final step at this part of the story is, Brian, 2891 02:13:02,700 --> 02:13:07,692 to merge those sorted halves, each of which now is of size 2. 2892 02:13:07,692 --> 02:13:08,400 BRIAN: All right. 2893 02:13:08,400 --> 02:13:12,008 I have two arrays of size 2, each of which is sorted, that I need to merge. 2894 02:13:12,008 --> 02:13:14,300 So I'm going to compare the smallest numbers from each. 2895 02:13:14,300 --> 02:13:16,110 I'm going to compare the 3 and the 5. 2896 02:13:16,110 --> 02:13:18,570 The 3 is smaller, so that one will go in first. 2897 02:13:18,570 --> 02:13:21,960 Now between these two arrays, I have a 6 and a 5 to compare. 2898 02:13:21,960 --> 02:13:24,210 The 5 is smaller, so that one will go next. 2899 02:13:24,210 --> 02:13:26,830 Between the 6 and the 8, the 6 is smaller. 2900 02:13:26,830 --> 02:13:28,530 And I'm left with just the 8. 2901 02:13:28,530 --> 02:13:32,250 So if we go back to the original story of eight numbers that I was sorting, 2902 02:13:32,250 --> 02:13:35,970 I think I have now sorted the left half of the left four numbers 2903 02:13:35,970 --> 02:13:37,230 from that original array. 2904 02:13:37,230 --> 02:13:37,560 DAVID MALAN: Indeed. 2905 02:13:37,560 --> 02:13:39,210 So if you're playing along at home, think 2906 02:13:39,210 --> 02:13:41,335 about-- you've got all these thoughts probably kind 2907 02:13:41,335 --> 02:13:42,430 of piling up in your mind. 2908 02:13:42,430 --> 02:13:43,860 That's indeed supposed to be the case. 2909 02:13:43,860 --> 02:13:46,152 And admittedly, it's hard to keep track of all of that. 2910 02:13:46,152 --> 02:13:48,520 So we'll let Brian now execute this altogether 2911 02:13:48,520 --> 02:13:52,213 together doing the same thing now, by sorting the right half all 2912 02:13:52,213 --> 02:13:53,130 the way to completion. 2913 02:13:53,130 --> 02:13:53,963 Brian, if you could. 2914 02:13:53,963 --> 02:13:54,672 BRIAN: All right. 2915 02:13:54,672 --> 02:13:56,340 So the right half, you got four numbers. 2916 02:13:56,340 --> 02:13:59,670 I'm going to start by sorting the left half of the right half, which 2917 02:13:59,670 --> 02:14:01,180 is these two numbers here. 2918 02:14:01,180 --> 02:14:02,930 To do that, I'll repeat the same process-- 2919 02:14:02,930 --> 02:14:06,240 sort the left half of these two numbers, which is just the 2. 2920 02:14:06,240 --> 02:14:08,080 That one's done, it's only one number. 2921 02:14:08,080 --> 02:14:09,372 Same thing with the right half. 2922 02:14:09,372 --> 02:14:11,100 The 7 is only one number, so it's done. 2923 02:14:11,100 --> 02:14:13,410 And now I'll merge the sorted halves together. 2924 02:14:13,410 --> 02:14:17,160 Between the 2 and the 7, the 2 is smaller and then the 7. 2925 02:14:17,160 --> 02:14:21,540 So here now is the left half of the right half, an array of size 2, 2926 02:14:21,540 --> 02:14:22,500 that is sorted. 2927 02:14:22,500 --> 02:14:25,260 And I'll do the same thing with the right half of the right half, 2928 02:14:25,260 --> 02:14:27,120 starting with the left half, which is 4. 2929 02:14:27,120 --> 02:14:28,090 That's done. 2930 02:14:28,090 --> 02:14:29,200 The 1 is done. 2931 02:14:29,200 --> 02:14:30,950 And now to merge these two together, I'll 2932 02:14:30,950 --> 02:14:33,190 compare them and say the 1 is smaller. 2933 02:14:33,190 --> 02:14:36,330 So I'll put the 1 down and then the 4. 2934 02:14:36,330 --> 02:14:39,870 So now I have two sorted arrays, each of size 2, 2935 02:14:39,870 --> 02:14:42,360 that I now need to backtrack and now merge together 2936 02:14:42,360 --> 02:14:44,460 to form an array of size 4. 2937 02:14:44,460 --> 02:14:45,930 So I'll compare the 2 and the 1. 2938 02:14:45,930 --> 02:14:48,000 Between those two, the 1 is smaller. 2939 02:14:48,000 --> 02:14:49,980 Then I'll compare the 2 with the 4. 2940 02:14:49,980 --> 02:14:51,570 The 2 is smaller. 2941 02:14:51,570 --> 02:14:53,190 Then I'll compare the 7 with the 4. 2942 02:14:53,190 --> 02:14:54,480 The 4 is smaller. 2943 02:14:54,480 --> 02:14:57,210 And then finally, I'll just take the 7, the last number, 2944 02:14:57,210 --> 02:14:59,010 and put that in the final spot. 2945 02:14:59,010 --> 02:15:01,740 And so now from the original array of eight numbers, 2946 02:15:01,740 --> 02:15:05,520 I've now sorted the left half, and I've sorted the right half. 2947 02:15:05,520 --> 02:15:09,030 DAVID MALAN: And now that brings us to our third and very final step. 2948 02:15:09,030 --> 02:15:11,950 Could you, Brian, merge the sorted halves? 2949 02:15:11,950 --> 02:15:12,450 BRIAN: Yeah. 2950 02:15:12,450 --> 02:15:14,520 And I think this is actually an example we've seen already. 2951 02:15:14,520 --> 02:15:16,728 And what I'm going to do in order to these two halves 2952 02:15:16,728 --> 02:15:19,080 is just take the smaller number from each half 2953 02:15:19,080 --> 02:15:20,640 and compare them again and again. 2954 02:15:20,640 --> 02:15:24,240 So between the 3 and the 1, the 1, that's the smallest number. 2955 02:15:24,240 --> 02:15:25,800 So that goes into place. 2956 02:15:25,800 --> 02:15:29,070 Then between the 3 and the 2, the 2 is smaller, 2957 02:15:29,070 --> 02:15:31,350 so we'll take that and put that into place. 2958 02:15:31,350 --> 02:15:33,930 Now I'm comparing the 3 with the 4. 2959 02:15:33,930 --> 02:15:36,420 The 3, that goes next. 2960 02:15:36,420 --> 02:15:38,640 Next I'm comparing the 5 with the 4. 2961 02:15:38,640 --> 02:15:42,450 4 is smaller, so the 4 goes into place next. 2962 02:15:42,450 --> 02:15:44,670 Now I'm comparing the 5 with the 7. 2963 02:15:44,670 --> 02:15:47,930 5 is smaller, so that one goes into place. 2964 02:15:47,930 --> 02:15:51,630 And next, comparing the 6 with the 7, so the 6 is smaller. 2965 02:15:51,630 --> 02:15:52,870 That goes next. 2966 02:15:52,870 --> 02:15:55,320 And now I'm left with two numbers, the 8 and the 7. 2967 02:15:55,320 --> 02:15:59,100 The 7 is the smaller of the 2, so that one goes next. 2968 02:15:59,100 --> 02:16:02,200 And at this point, I only have one number left, which is the 8. 2969 02:16:02,200 --> 02:16:05,100 And so that one's going to go into its sorted position 2970 02:16:05,100 --> 02:16:06,607 at the end of the array. 2971 02:16:06,607 --> 02:16:07,440 DAVID MALAN: Indeed. 2972 02:16:07,440 --> 02:16:10,110 So even though it felt like we weren't really doing anything 2973 02:16:10,110 --> 02:16:12,300 at several points in that story, it all sort of 2974 02:16:12,300 --> 02:16:16,110 came together when we started merging and merging and merging these lists. 2975 02:16:16,110 --> 02:16:19,650 And it's not an accident that Brian was using multiple shelves, 2976 02:16:19,650 --> 02:16:22,320 moving the numbers from top to bottom, to make clear 2977 02:16:22,320 --> 02:16:26,285 just how many times he was effectively dividing that list up. 2978 02:16:26,285 --> 02:16:28,410 We started with a list of eight, and we essentially 2979 02:16:28,410 --> 02:16:33,510 took it to two lists of size 4, four lists of size 2, eight lists of size 1. 2980 02:16:33,510 --> 02:16:35,740 And while it wasn't exactly in that order, 2981 02:16:35,740 --> 02:16:38,850 if you rewind and analyze all of the steps, that's indeed what he did. 2982 02:16:38,850 --> 02:16:42,809 He went from 8 to two 4's to four 2's to eight 1's. 2983 02:16:42,809 --> 02:16:47,040 And that's why he moved those numbers from the top shelf down three times-- 2984 02:16:47,040 --> 02:16:50,920 from 8's, to 4's, to 2's, to 1'r. 2985 02:16:50,920 --> 02:16:53,100 So how many times did he move the numbers? 2986 02:16:53,100 --> 02:16:55,590 He moved them three times total. 2987 02:16:55,590 --> 02:17:00,150 And on each of those shelves, how many numbers did he have to merge together? 2988 02:17:00,150 --> 02:17:04,020 On each of those shelves, he ultimately touched all eight numbers. 2989 02:17:04,020 --> 02:17:08,087 He first inserted the smallest number, then the second smallest, then 2990 02:17:08,087 --> 02:17:08,879 the third smallest. 2991 02:17:08,879 --> 02:17:13,709 But unlike selection sort, he had smartly already sorted those halves, 2992 02:17:13,709 --> 02:17:16,072 so he was just plucking them off one at a time. 2993 02:17:16,072 --> 02:17:18,030 He wasn't going back and forth, back and forth. 2994 02:17:18,030 --> 02:17:22,469 He was constantly taking from the beginning of each of those half lists. 2995 02:17:22,469 --> 02:17:26,309 So on every shelf, he was doing, let's say, n steps, 2996 02:17:26,309 --> 02:17:29,790 because he was merging in all n elements of that shelf. 2997 02:17:29,790 --> 02:17:34,379 But how many times did he merge n elements together? 2998 02:17:34,379 --> 02:17:36,633 Well, he did that three total times. 2999 02:17:36,633 --> 02:17:39,550 But if you think about binary search, and really the process of divide 3000 02:17:39,550 --> 02:17:43,070 and conquer more generally, anytime you divide something in half and half 3001 02:17:43,070 --> 02:17:47,150 and half, as he was doing from 8's to 4's to 2's to 1's. 3002 02:17:47,150 --> 02:17:48,170 That's a logarithm. 3003 02:17:48,170 --> 02:17:49,639 That's log base 2. 3004 02:17:49,639 --> 02:17:52,730 And indeed, that is wonderfully the height of this shelf. 3005 02:17:52,730 --> 02:17:56,480 If you have eight elements on the shelf, the number of additional shelves 3006 02:17:56,480 --> 02:18:03,180 Brian used, 3, is exactly what you get by doing the math log base 2 of 8. 3007 02:18:03,180 --> 02:18:08,059 Which is to say, Brian did n things log n times. 3008 02:18:08,059 --> 02:18:10,520 And again with a wave of the hand, computer scientists 3009 02:18:10,520 --> 02:18:13,129 don't bother mentioning the base with Big O notation. 3010 02:18:13,129 --> 02:18:14,840 It suffices just to say log n-- 3011 02:18:14,840 --> 02:18:18,410 Brian did n things log n times. 3012 02:18:18,410 --> 02:18:22,490 And so if we consider, then, the asymptotic complexity 3013 02:18:22,490 --> 02:18:25,430 of this algorithm, that is to say the running time of this algorithm, 3014 02:18:25,430 --> 02:18:30,230 in terms of big O notation, notice that it performs strictly better then 3015 02:18:30,230 --> 02:18:32,660 selection sort and bubble sort-- 3016 02:18:32,660 --> 02:18:34,780 n times log n. 3017 02:18:34,780 --> 02:18:37,760 And even, again, if you're a little rusty on logarithms, log n, 3018 02:18:37,760 --> 02:18:40,040 we have seen as of week 0 in binary search, 3019 02:18:40,040 --> 02:18:43,040 is definitely faster than n steps. 3020 02:18:43,040 --> 02:18:45,530 So n squared is n times n. 3021 02:18:45,530 --> 02:18:49,459 n log n is n times log n, which is indeed mathematically 3022 02:18:49,459 --> 02:18:51,980 better then n squared. 3023 02:18:51,980 --> 02:18:55,400 As with merge sort, though, if we consider the lower bound, 3024 02:18:55,400 --> 02:19:00,049 notice that bubble sort, yes, got us as low as omega of n. 3025 02:19:00,049 --> 02:19:03,750 Turns out merge sort is a little bit like selection sort 3026 02:19:03,750 --> 02:19:07,940 in that it doesn't optimize itself and get you out of the algorithm early. 3027 02:19:07,940 --> 02:19:13,375 It's always n log n, so it's lower bound omega of n log n. 3028 02:19:13,375 --> 02:19:14,750 And that might not be acceptable. 3029 02:19:14,750 --> 02:19:16,730 Sometimes you might have certain data inputs 3030 02:19:16,730 --> 02:19:19,700 where maybe it tends to be sorted and you don't want to waste time. 3031 02:19:19,700 --> 02:19:21,799 So maybe you'd be OK with bubble sort. 3032 02:19:21,799 --> 02:19:24,920 But honestly, as n gets large, the probability 3033 02:19:24,920 --> 02:19:29,540 that the input to your sorting algorithm is just by chance going to be sorted 3034 02:19:29,540 --> 02:19:33,200 is probably so, so low that you're just better 3035 02:19:33,200 --> 02:19:36,940 off in the general case using an algorithm like merge sort that's 3036 02:19:36,940 --> 02:19:38,660 n log n always. 3037 02:19:38,660 --> 02:19:41,299 We can see this visually using our bars, too. 3038 02:19:41,299 --> 02:19:43,850 And notice, just as Brian was dividing and conquering 3039 02:19:43,850 --> 02:19:47,150 the problem in half and half and half, and then reconstituting 3040 02:19:47,150 --> 02:19:51,620 the array by merging those halves, you can kind of see that visually here. 3041 02:19:51,620 --> 02:19:53,618 There's a lot more going on. 3042 02:19:53,618 --> 02:19:56,660 And it's going to seem in a moment that everything just kind of magically 3043 02:19:56,660 --> 02:19:57,160 worked. 3044 02:19:57,160 --> 02:20:00,680 But you can see in the faded purple bars that, indeed, this 3045 02:20:00,680 --> 02:20:04,400 is sorting things in halves and then merging those halves together. 3046 02:20:04,400 --> 02:20:06,590 And this visualization was a little different. 3047 02:20:06,590 --> 02:20:08,510 It did not have the luxury of three shelves. 3048 02:20:08,510 --> 02:20:10,760 It just moved top to bottom, top to bottom. 3049 02:20:10,760 --> 02:20:13,460 And honestly, Brian could have been a little more optimal there. 3050 02:20:13,460 --> 02:20:16,250 We wanted to make clear how many total shelves there were. 3051 02:20:16,250 --> 02:20:18,470 But honestly, there's no reason he couldn't have just 3052 02:20:18,470 --> 02:20:21,950 moved the numbers down then back up, then back down then back up. 3053 02:20:21,950 --> 02:20:24,440 And, indeed that's the price you pay with merge sort. 3054 02:20:24,440 --> 02:20:27,920 Even though n log n is better than n squared, and ergo 3055 02:20:27,920 --> 02:20:31,790 merge sort is arguably better than selection sort and bubble sort, 3056 02:20:31,790 --> 02:20:32,960 you pay a price. 3057 02:20:32,960 --> 02:20:35,750 And this speaks to the trade-off I mentioned earlier. 3058 02:20:35,750 --> 02:20:39,560 Almost always, when you do something better in code 3059 02:20:39,560 --> 02:20:42,897 or solve a problem more intelligently, you have paid a price. 3060 02:20:42,897 --> 02:20:45,230 Maybe you spent more time as the human writing the code, 3061 02:20:45,230 --> 02:20:47,420 because it was harder and took more sophistication. 3062 02:20:47,420 --> 02:20:48,620 That is a cost. 3063 02:20:48,620 --> 02:20:51,680 Maybe you had to use actually more space. 3064 02:20:51,680 --> 02:20:56,950 Brian had to have at least one extra shelf in order to implement merge sort. 3065 02:20:56,950 --> 02:20:59,360 If implementing merge sort in code and C, 3066 02:20:59,360 --> 02:21:05,180 you will need at least a second array to temporarily put the numbers into as you 3067 02:21:05,180 --> 02:21:06,800 merge things back and forth. 3068 02:21:06,800 --> 02:21:09,740 If you want to be extravagant, you can have three separate arrays 3069 02:21:09,740 --> 02:21:11,090 or four separate arrays. 3070 02:21:11,090 --> 02:21:14,240 But it's suffices, per the graphical representation of merge sort, 3071 02:21:14,240 --> 02:21:16,103 to just use a second array. 3072 02:21:16,103 --> 02:21:18,020 Now, that might not seem like such a big deal. 3073 02:21:18,020 --> 02:21:21,710 But implicitly, you need twice as much space. 3074 02:21:21,710 --> 02:21:23,270 And that might be a big deal. 3075 02:21:23,270 --> 02:21:27,590 If you've got a million things to sort, and you now need two arrays, 3076 02:21:27,590 --> 02:21:30,530 that's 2 million chunks of memory that you need. 3077 02:21:30,530 --> 02:21:32,052 And maybe that's not tenable. 3078 02:21:32,052 --> 02:21:34,010 So there, too, there's going to be a trade-off. 3079 02:21:34,010 --> 02:21:36,320 And maybe while slower, selection sort of bubble sort, maybe 3080 02:21:36,320 --> 02:21:38,820 it's better because it's a little more efficient with space. 3081 02:21:38,820 --> 02:21:41,120 It's going to depend on what you care about 3082 02:21:41,120 --> 02:21:42,870 and what you want to optimize for. 3083 02:21:42,870 --> 02:21:44,750 And honestly, money is sometimes a factor. 3084 02:21:44,750 --> 02:21:48,500 In the real world, maybe it's better to write slightly slower code 3085 02:21:48,500 --> 02:21:52,130 so that you don't have to buy twice as many servers or twice as much memory 3086 02:21:52,130 --> 02:21:53,210 for your computer. 3087 02:21:53,210 --> 02:21:55,730 It depends there on what resource is more important-- 3088 02:21:55,730 --> 02:22:00,770 your time, the computer's time, your wallet, or some other resource 3089 02:22:00,770 --> 02:22:01,340 altogether. 3090 02:22:01,340 --> 02:22:03,590 So we'll continue to see these kinds of trade-offs. 3091 02:22:03,590 --> 02:22:07,500 But perhaps the most mind blowing thing we can do as we wrap up here 3092 02:22:07,500 --> 02:22:12,920 is share a few visualizations of how these algorithms actually compare. 3093 02:22:12,920 --> 02:22:17,900 And one last piece of jargon is this one final Greek symbol, theta. 3094 02:22:17,900 --> 02:22:21,920 It turns out that, thanks to selection sort and merge sort, 3095 02:22:21,920 --> 02:22:26,570 we can actually apply one more term of art here, this theta notation. 3096 02:22:26,570 --> 02:22:30,140 Anytime an algorithm has both the same upper bound 3097 02:22:30,140 --> 02:22:33,050 as its lower bound running time, you can actually 3098 02:22:33,050 --> 02:22:37,980 describe it in just one sentence instead of two in terms of theta notation. 3099 02:22:37,980 --> 02:22:41,375 So because selection sort was in both big O of n squared and omega 3100 02:22:41,375 --> 02:22:44,870 of n squared, you can actually just say, ah, it's in theta of n squared. 3101 02:22:44,870 --> 02:22:49,010 It's always n squared either in the upper bound or in the lower bound. 3102 02:22:49,010 --> 02:22:50,210 Same thing for merge sort. 3103 02:22:50,210 --> 02:22:52,310 It's in theta of n log n. 3104 02:22:52,310 --> 02:22:57,860 We cannot use theta for bubble sort or for binary search or for linear search, 3105 02:22:57,860 --> 02:23:01,150 because they had different upper and lower bounds. 3106 02:23:01,150 --> 02:23:05,600 Well, let me go ahead now and prepare a final demonstration, 3107 02:23:05,600 --> 02:23:08,090 this time using some random inputs. 3108 02:23:08,090 --> 02:23:11,750 So you'll see here a video comparing selection sort, bubble sort, 3109 02:23:11,750 --> 02:23:13,760 and merge sort all together. 3110 02:23:13,760 --> 02:23:16,560 All three of them start with random data. 3111 02:23:16,560 --> 02:23:18,830 But let's just see what it means for an algorithm 3112 02:23:18,830 --> 02:23:26,295 to be an n squared in the worst case or in n log n in this case instead. 3113 02:23:26,295 --> 02:23:26,920 [MUSIC PLAYING] 3114 02:23:26,920 --> 02:23:29,295 Selection sort's on the top, bubble sort's on the bottom, 3115 02:23:29,295 --> 02:23:30,490 merge sort's in the middle. 3116 02:23:30,490 --> 02:23:35,352 And would you believe it, merge sort is already done. 3117 02:23:35,352 --> 02:23:37,280 [MUSIC INTENSIFIES] 3118 02:23:37,280 --> 02:23:40,902 And meanwhile, we have some very trendy music we can listen to, 3119 02:23:40,902 --> 02:23:42,610 which is really just there to distract us 3120 02:23:42,610 --> 02:23:47,060 from the fact at how slow n squared actually is in practice. 3121 02:23:47,060 --> 02:23:49,420 And notice, there's not that many bars here. 3122 02:23:49,420 --> 02:23:51,310 There's maybe like a hundred or so bars. 3123 02:23:51,310 --> 02:23:52,690 Like, n is 100. 3124 02:23:52,690 --> 02:23:53,882 That's not even a big value. 3125 02:23:53,882 --> 02:23:56,590 When we're talking about the Twitters, the Facebooks, the Googles 3126 02:23:56,590 --> 02:23:59,080 of the world, these are trivial sizes. 3127 02:23:59,080 --> 02:24:03,220 And yet, my God, we're still waiting for selection sort and bubble sort 3128 02:24:03,220 --> 02:24:04,300 to finish. 3129 02:24:04,300 --> 02:24:08,110 And so you can see here that it really matters when you exercise a little bit 3130 02:24:08,110 --> 02:24:10,980 more cleverness, and you leverage a more efficient algorithm-- 3131 02:24:10,980 --> 02:24:14,495 and finally, selection sort is done, bubble sort still taking 3132 02:24:14,495 --> 02:24:15,370 a little longer here. 3133 02:24:15,370 --> 02:24:16,810 And this is going to depend on the input. 3134 02:24:16,810 --> 02:24:18,610 Sometimes you can get lucky or unlucky. 3135 02:24:18,610 --> 02:24:22,390 But I think it's convincing that merge sort has won in this case. 3136 02:24:22,390 --> 02:24:26,940 [MUSIC PLAYING] 3137 02:24:26,940 --> 02:25:25,000