[MUSIC PLAYING] DAVID MALAN: All right. This is CS50, and this is week 3. And you'll recall that last week, we equipped you with a lot more tools by which to solve problems-- not only problems that we had proposed, but problems in your own code, that is to say bugs. And recall that those tools involve command line tools like help50 for help with cryptic error messages that the compiler might spit out; style50 which, gives you a bit of feedback on the stylization of your code, the aesthetics thereof; check50, which checks the correctness of your code against the specifications in a given problem set or lab; printf, which is a function that it exists in some form in almost any programming language that you might ultimately learn, and this is simply a way of printing out anything you might want from the computer's memory onto the screen.
Then perhaps the most powerful of these tools was debug50, which was this interactive debugger. And even though this command debug50 is a little specific to CS50, what it triggers to happen, that little side window where you can see the stack of functions that you might have called during some break point, and you can see the local variables that you might have defined at some point during the execution of your code, that's a very common conventional feature of any debugger with most any language.
And then lastly, recall there was this ddb, duck debugger, which of course, takes this physical form, if you happen to have a rubber duck lying around with whom you can talk. But I'm so pleased to say that if you lack that currently while at home, CS50's own Kareem and Brenda and Sophie have wonderfully added, if you haven't noticed already, that same virtual duck to CS50 IDE. So if you click in the top corner, you can actually begin to have a chat of sorts with the rubber duck. And while this is a certainly more playful incarnation of that same idea, we really can't emphasize enough the value of talking through problems when you're experiencing them in code with someone else or with something else.
This particular duck, not all that large of a vocabulary, but it's not so much what the other person says but what you say and what you hear yourself saying that is undoubtedly the most valuable part of the process. So our thanks to Kareem and Brenda and Sophie on that.
Recall last week, 2, that we took a look underneath the hood, literally in some sense, at the computer's memory in your laptop or desktop or phone. And then we decided to think about this more artistically as just a grid of bytes. So within that chip, there's a whole bunch of bits. And if you look at eight of them at a time, there's a whole bunch of bytes. And it stands to reason that we could think of this as the first byte, the second byte, the third byte, and so forth, and sort of chop this up pictorially into just a whole sequence of bytes in the computer's memory.
And recall that if we zoom in on that and focus on just one continuous block of memory, otherwise known as an "array," we can do things within this array like storing a bunch of different values. So recall last week, we started by defining a little-- goofily, multiple variables that were almost identically names, like scores1, scores2, scores3. And then we began to clean up the design of our code by introducing an array, so we can have just one variable called scores, that is of size 3 and has room for multiple values.
So today, we'll continue to leverage this feature of many programming languages-- being able to store things continuously, back to back to back to back, in a computer's memory, because this very simple layout, this very simple feature of the language, is going to open up all sorts of powerful features. And in fact, we can even revisit some of the problems we tried to solve way back in week 0.
But there is a catch with arrays. And we didn't really emphasize this much last week. And that's because, even though you and I can glance at this picture on the screen and see immediately that, oh, there's seven boxes on the screen, there are seven locations in which you can store values, you and I can sort of have this bird's eye view of everything and just see what's inside that entire array all at once. But computers, recall, are much more methodical, more algorithmic, if you will. And so a computer, as powerful as they are, can technically only look at one location in an array at a time.
So whereas you and I can glance at this and sort of take it all in at once a computer just can't glance at its memory and take in all at once all of the values therein, it has to do so more methodically, for instance, from left to right, maybe right to left, maybe middle onward. But it has to be an algorithm. And so today we'll formalize that notion and really kind of hide the fact that this array cannot be seen all at once, you can only look at one location in an array at a given time.
And this is going to have very real implications. For instance, if we consider that very first problem in the very first week where we tried to find my phone number in a phone book, the very naive approach was to start at the beginning and search from left to right. And we tried a couple of variants thereafter. But the problem, quite simply, is that of searching. And this is a term of art in computer science, super common, certainly for you and I as users on Google and the like to search for things all day long. And so certainly searching well, designing a search algorithm well, is certainly a compelling feature of so many of today's tools that you and I use.
So if we think of this really as a problem to solve, we've got some input, which, for instance, might be an array of numbers, or maybe an array of web pages in the case of Google. And the goal is to get some output. So if the input to the problem is an array of values, the output, hopefully, is going to be something as simple, really, as a bool-- yes or no. Is the value you're looking for discoverable? Can you search for and find that value, yes or no, true or false?
Now, within this black box, recall, is going to be some algorithm. And that's where today we'll spend most of our time. Indeed, we won't really introduce that many more features of C. We won't introduce that much more code. We'll focus again on ideas, just taking for granted now that you have some more tools in your toolkit. Beyond loops and conditions and Boolean expressions, we now have this other tool known as arrays.
But let's first introduce some other terms of art, some jargon if you will, related to what we'll call running time. So we've alluded to this a few times. When we're thinking about just how good or bad an algorithm is, we describe how long it takes to run. That is its running time. The running time of an algorithm is how long it takes-- how many steps it takes, how many seconds it takes, how many iterations it takes.
It doesn't really matter what your unit of measure is. Maybe it's time, maybe it's iterations or something else. But running time just refers to how long does an algorithm take. And there are ways that we can think about this a little more formally. And we kind of did this already in the first week, but we didn't give it this name, this italicized O, this capital O on the screen, is otherwise known as Big O notation.
And computer scientists and some mathematicians will very frequently use, literally, this symbol to describe the running times of algorithms, or mathematically like a function. So recall this picture, in fact. When we were searching that phone book, we did it sort of good, better, best. We did it linearly-- that is, searching one page at a time, we did it twice as fast by doing two pages at a time-- and then we did it logarithmically by dividing and conquering, in half and half and half.
And at the time, I proposed that if we think of a phone book as having n pages, where n is just a number in computer science vernacular, we might describe the running time, or the number of steps involved for that first algorithm, as being maybe in the worst case n steps. If the person you're looking for in a phone book maybe alphabetically has the last name starting with Z in English, well, the Z might be all the way at the end of the phone book. So at the worst case, you might be taking n steps to find someone like myself in that phone book.
The second algorithm, though, was twice as fast, because we went two pages at a time. So we might describe its running time as n divided by 2. And then the third algorithm, where we divided the problem in half and half and half, literally throwing half of the problem away again and again, was logarithmic-- technically log base 2 of n, which, again, is just a mathematical formula that refers to halving something again and again and again. And you start with, of course, n pages in that scenario.
Well, it turns out that a computer scientist would actually wave their hands at some of these mathematical details. Indeed, we're not going to get into the habit of writing very precise mathematical formulas. What we're instead going to do is try to get a sense of the order on which the running time of an algorithm is, just roughly how fast or how slow it is, but still using some symbology like n as a placeholder.
And so a computer scientist would describe the running time of all three of those algorithms from week 0 as being big O of n, or big O of n/2, or big O of log base 2 of n. So "big O" just means "on the order of." It's sort of a wave of the hand. Maybe it's n minus 1, maybe it's n plus 1, maybe it's even 2n. But it's on the order of n or these other values.
But in fact, too, notice that this chart, there's something kind of curious. , Like these first two algorithms from week 0 kind of pictorially look pretty much the same. Like undoubtedly, the yellow line is a little lower and therefore a little better and a little faster than the red line. But they have the same shape.
And in fact, I bet if we zoomed way out, these two straight lines would pretty much look identical. If you change your axis to be big enough and tall enough, these would start to blur together. But clearly, the green line is fundamentally different.
And so this speaks to a computer scientist's tendency to not really quibble over these details. Like, yes, the second algorithm in week 0 was better. Yes, this yellow line is better. But, eh, let's just call both of those algorithms running times on the order of n. That is to say, a computer scientist tends to throw away constant factors, like the 1/2 or the divided by 2. And they tend to focus only on the dominant factor, like which value in that mathematical expression is going to grow the most, grow the fastest. And n divided by 2n it's going to dominate over time. The bigger the phone book gets, the more pages you have. It's really n that's going to matter less so than that divided by 2.
And same thing over here. If you're familiar with and remember your logarithms, we don't really have to even care about the base of that logarithm. Yes, it's base 2, but eh, we can just multiply that logarithm by some other number to convert it to any base we want-- base 10, base 3, base 7, anything. So let's just say it's on the order of log n.
So this is good, because it means we're not really going to waste time getting really into the weeds mathematically when we talk about the efficiency of algorithms. It suffices to describe things really in terms of the variable, n in this case, if you will, that dominates over time. And indeed, let's zoom out. If I zoom out on this picture, boom, you begin to see that, yeah, these are really starting to look almost identical. And if we kept zooming out, you would see that they're essentially one in the same. But the green one stands out, so that's indeed on the order of log of n as opposed to n itself.
So here's a little cheat sheet. It turns out that within computer science, and within the analysis of algorithms, we're going to tend to see some common formulas like this. So we've just seen on the order of n. We've seen on the order of log n.
It turns out that the very common two is going to be n times log n, maybe even n squared, and then even big O of 1. And the last of those just means that an algorithm takes, wonderfully, one step-- or maybe two steps, maybe even 10 steps, but a constant number of steps. So that's sort of the best case scenario, at least among these options. Whereas, n squared is going to start to take a long time. It's going to start to feel slow, because if you take any value of n and square it, that's going to imply more and more steps.
So just a bit of jargon, then, to start off today, whereby we now have this sort of vocabulary with which to describe the running times of an algorithm in terms of this Big O notation. But there's one other notation. And just as big O refers to an upper bound on running times, like how many steps maximally, how much time maximally might an algorithm take, this omega notation refers to the opposite. What's a lower bound on the running time of an algorithm?
And we don't need another picture or other formulas. We can reuse the same one. So this cheat sheet here just proposes that, when describing the efficiency or inefficiency of an algorithm and you want to come up with a lower bound-- like minimally, how many steps does my algorithm take-- we can use the same mathematical formulas, but we can note that with omega instead of big O.
So again, looks fancy, but it really just refers to a wave of the hand trying to sort of ballpark exactly what the running time is of an algorithm. And thankfully, we've seen a few algorithms already, including in that week 0, and now we're going to give it a more formal name. Linear search is what we did with that phone book first off by searching it page by page by page, looking for my phone number in that particular example.
And so the difference today is that, unlike us humans, who can look down at a phone book page and see a whole bunch of names and numbers at once, unlike a human who can look at an array on the board a moment ago and sort of see everything at once, we need to be more methodical, more deliberate today so that we can translate week 0's ideas now, not into even pseudocode, but actual C code.
And so wonderfully, here at the American Repertory Theater as we are on Harvard's campus this semester, we've been collaborating with the whole team here who are much more artistically inclined than certainly I could be on my own here. And we have these seven wonderful doors that were previously used in various theatrical shows that took place here in this theater. And we've even collaborated with the theater's prop shop, who in back have wonderfully manufactured some delightful numbers and brought them to life.
Which is to say that, behind each of these seven doors is a number. And this is going to be an opportunity now to really hammer home the point that when we want to search for some number in an array, it's pretty equivalent to having to search for a number, in this case, behind an otherwise closed door. You and I can't just look at all of these doors now and figure out where a number is. We have to be more methodical. We have to start searching these doors, maybe from left to right, maybe from right to left, maybe from the middle on out. But we need to come up with an algorithm and ultimately translate that to code.
So for instance, suppose I were to search for the number 0. How could we go about searching, methodically, these seven wooden doors for the number 0? Let me take a suggestion from the audience. What approach might you take? What first step would you propose I take here on my own with these doors? Any recommendations? How do I begin to find myself the number 0?
Florence, what do you propose?
AUDIENCE: I would propose starting form the left, since 0 is a smaller number.
DAVID MALAN: OK, good. And hang in there for with me for just a moment. Let me go ahead and started on the left edge as Florence proposes. Go ahead and open the door, and hopefully, voila-- no. It's a number 4. So it's not a 0. So Florence, what would you propose I do next?
AUDIENCE: I'd probably start in the middle somewhere, if, like, in case, I don't know, it's going down by 1.
DAVID MALAN: OK. So maybe it's going down. So let me go ahead and try that. So you propose middle, I could go over here, and voila-- nope. That's the number 2. And I wonder, where else should I look. Let me-- I'm a little curious. I'm a little nervous that I ignored these doors. So Florence, if you don't mind, let's go ahead and look here and-- no, that's the number 6, it seems.
Let's go ahead and check in here, the number 8. So they're kind of going up and down. So Florence, how might I finish searching for this number? What remains to be done, would you say?
AUDIENCE: Probably start from the right now.
DAVID MALAN: OK. So I could start from the right now, and maybe just go over here. And voila-- and there it is. So we found the number 0. So let me ask Florence, what was your algorithm? How did you go about so successfully finding the number 0 for us?
AUDIENCE: I guess I initially tried starting, like, by going down by 1. So like, if the number was not at the left, then going to the center, which is, like, halfway in between and then going to [INAUDIBLE]. I don't know.
DAVID MALAN: And playfully, how did that work out for you, going to the middle? Better or worse, no different?
AUDIENCE: I mean, I guess maybe it helped a little bit to then go all the way to the right.
DAVID MALAN: OK. Yeah, we might have gleaned some information. But let's go ahead and take a look at all of the doors for a moment. There's that 4 and the 6 again. Here is that 8 again. Over in the middle we have the 2 again. Over here we have a 7 for the first time. Over here we have a 5. And then of course, we have a 0.
And if you took all of that in, honestly, Florence, you and I, we couldn't really have done any better. Because these door-- these numbers, it turns out, are just randomly arranged behind these doors. So it wasn't bad at all that you kind of hopped around. Although, the downside is if you hop around, you and I as humans can pretty easily remember where we've been before. But if you think about how we would translate that to code, I feel like we're starting to accumulate a bunch of variables maybe, because you have to keep track of that.
So frankly, maybe the simplest solution-- whoops-- maybe the simplest solution would have been where we started in week 0, where we just take a very simple if naive approach of starting with our array, this time of size 7, behind which are some numbers. And if you don't know anything about those numbers, honestly the best you can do is just that same linear search from week 0, and just check, one at a time, the values behind each of these doors and just hope that eventually you will find it.
So this is already sort of taking a lot of time, right? If I do this linear search approach like I did in week 0, I'm potentially going to have to search behind all of those doors. I'm going to have to search behind all of those doors.
So let's consider a little more formally exactly how I could at least implement that algorithm. Because I could take the approach that Florence proposed, and just kind of jumping around and maybe using a bit of intuition. But again, that's not really an algorithm. We really need to do something more step by step.
And in the meantime, let's go ahead, Joe, and let's close the curtain and see if we can't clean those up with another problem in a moment, while we consider now linear search and the analysis thereof. So with linear search, I would propose that we could implement it in pseudocode first, if you will, like this. For i from 0 to n minus 1-- we'll see where we're going with this-- if the number is behind the i-th door, return true, otherwise at the very end return false.
So it's a relatively simple translation into pseudocode, much like we did with the phone book some time ago. And why, though, these values? Because I'm now starting to express myself a little more like C, even though it's still pseudocode. So for i from 0 to n minus 1. So computer scientists tend to start counting from 0. If there's n doors, or 7 doors in this case, you want to go from 0 on up to 6, or from 0 on up to n minus 1.
So this is just a very common way of setting yourself up with a for loop, maybe in C, maybe in pseudocode in this case, that just gets you from left to right, algorithmically step by step. If a condition, number is behind the i-th door-- i-th just being a colloquial way of saying, what is behind the door at location i-- go ahead and return true. I have found myself the number I want, for instance, the number 0.
And then notice that this return false is not part of an else, because I don't want to abort this algorithm prematurely and abort simply because a number is not behind the current door. I essentially want to wait all the way to the end of the algorithm, after I've checked all n doors, and if I have still not found the number I care about, then and only then am I going to return false. So a very common programming mistake might be to nest this internally and think about things in terms of ifs and elses. But you don't need to have an else. This is kind of a catchall here at the very end.
But now let's consider, if this is the pseudocode for linear search, just what is the efficiency of linear search? What is the efficiency of linear search, which is to say, how well-designed is this algorithm? We put or gave ourselves a framework a moment ago, Big O notation, which is an upper bound, which we can think of for now as meaning like a worst case. In the worst case, how many steps might it take me to find the number 0-- or any number for that matter-- among n doors? Is it big O of n squared, big O of n times log n, big O of n, big O of log n, or big O of one, which, again, just means a constant fixed number of steps?
Brian, could we go ahead and pull up this question? Let me go ahead and pull it up on my screen as well. If you go to our usual URL to propose what you think an upper bound is on the running time of linear search. OK. Indeed, if we consider now the running time of linear search, it's going to be big O of n. Why is that?
So in the worst case, the number I'm looking for, 0, might very well be at the end of that list, which is going to be on the order of n steps, or in this case precisely n steps. So that's one way to think about this. Well, now let me ask a follow-up question. Proposing instead that we consider omega notation, which is a lower bound on the running time of an algorithm-- Brian, could we go ahead and ask this question next? At that same URL, we'll see a question asking now for the possible answers for the running time-- for a lower bound on the running time of linear search.
So let's go ahead and take a look at this one here. And in just a moment, we'll see as the responses come in, about 75-plus percent of you are proposing that it's actually omega of 1. So omega is a lower bound. 1 refers to constant time. And why is that?
Let me just take a quick answer on this point. Among the 75% of you who said one step, or a constant number of steps, why is that? How do you think about this lower bound on running time? How about from Keith? Why omega of 1?
AUDIENCE: Yeah, you can just open it and be lucky and find it in the first door.
DAVID MALAN: Yeah. So it really speaks to just that. You might just get lucky, and the number you're looking for might be at the very first door. So the lower bound, in the best case, if you will, of this algorithm, linear search might very well be omega of 1 for exactly that reason-- that you have-- get lucky and the element might be there at the beginning. So that's pretty good. You really can't do any better than that.
So we this range now of a lower bound from omega of 1 on up to big O of n being an upper bound on the running time of linear search. But of course, we have this other algorithm in our toolkit. And recall from week 0 that we looked at binary search-- although, not necessarily by name. It was that divide-and-conquer third algorithm, where we took the phone book and split it in half and half and half again.
Now, while I fumbled there, Joe kindly has given us a new set of doors. If Joe, you could go ahead and reveal our seven doors again, behind which we still have some numbers. But I think this time, I'm going to be a little better off. Cue Joe and the doors behind. There we go.
So we have our same seven doors. But behind those doors now is a different arrangement of numbers. And suppose this time, I want to find myself the number 6. So the number 6-- we'll change the problem slightly-- but I'm going to give you one other ingredient this time, which is going to be key to this working. Why were Florence and I able to do no better than linear search before? Why were Florence and I able to do no better than randomly searching even last time?
What was it about the array of numbers, or the array of doors, that did not allow me previously to use binary search? Iris, what do you think?
AUDIENCE: Because we didn't know the numbers are sorted or not.
DAVID MALAN: Yeah. We didn't know if the numbers were sorted or not. And indeed, barring that detail, Florence and I really couldn't have done any better than, say, linear search. So this time, though, Joe has kindly sorted some numbers behind these doors for us. And so if I want to search for the number 6, now I can begin to use a bit of that information.
So you know what, I'm going to start just like we did with the phone book and start roughly in the middle. And voila, number 5. All right. So we're pretty close, we're pretty close. But the thing about binary search, recall, is that this is now useful information. If the numbers are sorted behind these doors all, of the doors to the left should presumably be lower than 5, and all of the doors to the right should presumably be larger than 5.
Now, I might kind of cut a corner here and be like, well, if this is 5, 6 is probably right next door, literally. But again, algorithmically, how might we do this? We don't want to necessarily consider these special cases. So more generally, it looks like I now have an array of size 3. So let me go ahead and apply that same algorithm, voila, to the middle. Now I have the number 7.
And now it's becoming pretty clear that if the number 6 is present, it's probably behind this door. And indeed, if I now look at my remaining array of size 1, and voila, in the middle there is that number 6. So this time, I only had to open up three doors instead of all seven, potentially, or maybe all six doors to find my way to that number, because I was given this additional ingredient of all of those numbers being sorted.
So it would seem, then, that you can apply the better, more efficient, better designed algorithm, now known as binary search, if only someone like Joe would sort the numbers for you in advance. So let's consider now a little more algorithmically how we might implement this. So with binary search, let me propose this pseudocode.
If the number is behind the middle door, return true-- we found it. So if we got lucky, then we might very well have found the number 6 behind the middle door, and we would have been done. But that didn't happen. And in the general case that probably won't happen. So if the number is less than that behind the middle door, then just like with the phone book, I'm going to go to the left, and I'm going to search the left half of the remaining doors in the array.
Else if the number is greater than that behind the middle door, then like the phone book I'm going to go ahead and search the right half of the phone book. But there might still be one final case potentially, whereby if there's no doors left at all, or no doors in the first place, I should at least have this one special case where I do say return false. For instance, if 6, for whatever reason, weren't be among those doors and I were searching for it, I still need to be able to handle that situation where I can say definitively return false if I'm left with no further doors to search.
So here, then, might be the pseudocode for this algorithm a bit more formally. Now let's consider the analysis thereof. Before, where we left off, linear search was big O of n. Linear search was big O of n. This time let's consider where binary search actually falls into place by asking a different question. I'm going to go ahead and go back and ask this question now-- what's an upper bound on the running time of binary search?
An upper bound on the running time of binary search-- and go ahead and buzz in, if you'd like, similarly to before. What's an upper bound on the running time of binary search? And you can see here answers are getting pretty dominant around log n. And indeed, that jives with exactly what we did in week 0. The correct answer is indeed log of n, because that's going to be the maximum number of times that you can take a list or an array of a given size and split it in half and half and half, until you find the number you're looking for, or ultimately you don't find that number at all.
Meanwhile, if we consider now not just the upper bound on this algorithm-- so in the worst case, binary search takes big O of log n-- now let's consider a related question which is, what's a lower bound on the running time of this same algorithm? What's a lower bound on the running time?
I'll go ahead and pluck this one off myself and go back to some of the suggestions thus far. In the best case, maybe, too, you do get lucky, and the number you're looking for, 6 or some other number, is smack dab in the middle of the array. And so maybe indeed you can get away with just one step. And indeed, a lower bound on binary search now might very well just be an omega of 1, because in that best case you just get lucky, and it's right where you happen to start, in this case in the middle.
So we seem to have a range there. But strictly speaking, it would seem that binary search is better. Binary search is better than linear search, because as n gets big, big, big, you can really feel that difference. In fact, recall from week 0 we played a little bit with these light bulbs. And right now, all 64 of these light bulbs are on.
And let's consider for a moment, just to put this into perspective, how long it would take to use linear search to find one light bulb among these 64. And recall that in the worst case, maybe the light bulb, or the number that we're looking for, is way down there at the end, but we don't know in advance. And so Sumner, if you wouldn't mind executing linear search on these light bulbs, let's just get a feel for the efficiency or inefficiency of this algorithm. Linear search in light bulb form.
So you'll notice that one light bulb at a time is going out, implying that I've searched that door, searched that door, searched that door. But we've only gone through 10 or so bulbs, and we've got another 50-plus to go. And you can see that if we look inside of these doors one per second, or turn off these light bulbs one per second, it's going to take a long time. In fact, it doesn't seem worthwhile to even wait until the very end.
So Sumner, if you wouldn't mind, let's bring all the lights back up, and let's try once more another algorithm, this one binary search, just to get, again, a feel of what the running time is of an algorithm, like binary search that runs in logarithmic time. So in just a moment, we'll go ahead and execute binary search on these light bulbs, the idea being that there's one bulb we care about. Let's see how fast we can get down to just one bulb out of 64.
So Sumner, on your marks, get set, go. And we're done just a few steps later. And then have this sole light bulb. That was so much faster. And in fact, we did this deliberately one iteration at a time. The algorithm that we just executed with Sumner's and Matt's help, algorithmically was operating at what's called 1 hertz, 1 hertz. And if you're unfamiliar with hertz, it's just one something per second. It's very often used in physics or just in discussions of electricity more generally.
And indeed, in this case if you're doing one thing per second, that first algorithm, linear search, might have taken us like 64 seconds to get all the way to that final light bulb. But that second algorithm was logarithmic. And so by going from 64 to 32 to 16 to 8 to 4 to 2 to 1, we get to the final result much faster, even going at the same pace.
So in fact, if you think of your computer's CPU, CPUs are also measured in hertz-- H-E-R-T-Z. Probably measured in gigahertz, which is billions of hertz per second. So your CPU, the brain of your computer, If it's 1 gigahertz, that means it can literally do 1 billion things at a time. And here we have this sort of simpler setup of just light bulbs doing one thing per second. Your computer can do 1 billion of these kinds of operations at once.
So just imagine, therefore, how much these savings tend to add up over time if you can take big bites out of these problems at once, as opposed to doing things like we did in week 0, just one single step at a time. All right. Well, let's now go ahead and start to translate this to code. We have enough tools in our toolkit in C that I think, based on our discussion of arrays last week, we can now actually start to build something in code on our own.
So I'm going to go ahead and create a file here in just a moment, in CS50 IDE, called, for instance, numbers.c. Let me go ahead and translate this to a file in C code called numbers.c. The goal at hand is just to implement linear search in code, just so that we're no longer waving our hands at the pseudocode but doing things a little more concretely.
So I'm going to go ahead and include cs50.h. I'm going to go ahead and include stdio.h. And I'm going to start with no command line arguments, like we left off last week, but just with main void. And I'm going to go ahead and give myself an array of numbers, seven numbers, just like the doors. And I'm going to go ahead and say int numbers.
And then this is a little trick that we didn't see last week, but it's handy for creating an array when you know in advance what numbers you want, which I do, because I'm going to mimic the doors that Joe kindly set up for us here, I'm going to go ahead and say give me an array that is equal to 4, 6, 8, 2, 7, 5, 0. And this is the feature we didn't see last week. If you know in advance the numbers that you want to assign to an array, you actually don't have to bother specifying the size of the array explicitly. The compiler can figure that out intelligently for you. But you can use these curly braces with commas inside to enumerate from left to right the values that you want to put into that array.
So after this line 6 has executed in my computer, I'm going to be left with an array called numbers, inside of which are seven integers listed from left to right in the computer's memory, so to speak, in this way. Now, what do I want to do with these numbers? Well, let's implement linear search.
Linear search, as we latched on to earlier, is a searching from left to right or equivalently right to left-- but convention tends to go left to right. So I'm going to do a standard for loop. For int i gets 0, i is less than-- I'm going to keep it simple for now and hardcode this, but we could clean this up if we want, and I'm going to do i++ on each iteration.
So I'm pretty sure that my line 8 will induce a for loop that iterates eight total times. And what question do I want to ask on each iteration? Well, if the numbers array at location i equals equals-- for instance, the number I was searching for initially, let's go ahead and search for 0-- then what do I want to do? Let me go ahead and print out something arbitrary but useful, like "Found," quote, unquote, so the human knows. And then let me go ahead, and just for good measure, let me go ahead and return 0. And we'll come back to that in just a moment.
But at the end of this program, I'm also going to do this-- printf "Not found" with a backslash n. And then I'm going to go ahead and return 1. But before we tease apart those returns, just consider the code in the aggregate. Here's my entire main function. And on line 6, to recap, I initialized the array, just as we did at the very beginning, with a seemingly random list of numbers behind the doors. Then on line 8, I'm going to iterate with this for loop seven total times, incrementing i in each turn.
And then line 10, just like I was opening the doors one at a time, I'm going to check if the i-th number in this array equals equals the number I care about, 0, with that first demo. I'm going to print "Found." Otherwise-- not else, per se-- but otherwise, if I go through this entire loop, checking if, if, if, if, if, and I never actually find 0, I'm going to have this catchall at the end that just says no matter what, if you reach line 16, print "Not found," and then return 1.
Now, this is a bit of a subtlety. But could someone remind us what's going on with the return 0 on line 13 and the return 1 on line 17? Why 0 in 1, and why am I returning at all? What problem is this solving for me? Even though most of our programs thus far, we haven't bothered too much with this.
Demi, is it? What do you think?
AUDIENCE: It's Demi, but basically, return 0 is like it was executed correctly, or it found it, and it kind of exits that loop saying that it was found. And then return 1 is like the return false, and it exits as well.
DAVID MALAN: Exactly. And "exit" really is the operative word. In main, when you are done-- ready to quit the program, as we've done with the word "quit" in some of our pseudocode in the past, you can literally return a value. And recall at the end of last week, we introduced the fact that main always returns an int. You and I have ignored that for at least a week or two, but sometimes it's useful to return an explicit value, whether it's for autograding purposes, whether it's for automated testing of your code in the real world, or just so it's a signal to the user that something indeed went
Wrong. So you can return a value from main. And as Demi proposed, 0 means "all is well." And it's a little counter-intuitive, because thus far true tends to be a good thing. But in this case, 0 is a good thing. All is well. It's success. And if you return any other value, for instance 1, that indicates that something went wrong.
So the reason I'm printing out, after the word "Found" I'm returning 0, is so that effectively the program exits at that point. I don't want to keep going again and again if I already found the number I care about. And down here, this one admittedly isn't strictly necessary, because if I hit line 16 and maybe deleted line 17, the program's going to end anyway. But there wouldn't be that so-called exit status that we discussed last week briefly, whereby you can kind of signal to the computer whether something was successful or unsuccessful.
And the reason that 0 is a good thing and 1 or any other number is not, consider how many things can go wrong in programs that you write or that companies in the real world write when you get those error messages, sometimes with those cryptic error codes. There are hundreds, thousands of problems that might happen in a computer program that could be that many error codes that you see on the screen, reasons explaining why the program crashed or froze or the like.
But 0 is sort of special in that it's just one value that the world has decided means "success." So there's only one way to get your program right, in a sense, but there's so many millions of ways in which things can go wrong. And that's why humans have adopted that particular convention.
All right. But let's consider now not just numbers, but let's make things more interesting. Besides the doors, suppose that we actually had people's names behind them. Well, let's go ahead and write a program this time that not only searches for numbers, but instead searches for names. So I'm going to go ahead and create a different file here called names.c.
And I'm going to start a little similarly. I'm going to include cs50.h at the top, I'm going to include stdio at the top. But I'm also this time going to include string.h, which we introduced briefly last week, so that we have access to strlen for getting the length of a string, and, it turns out, some other functions. Let me go ahead and declare int main void as usual.
And then inside here, I need some arbitrary names. So let's come up with seven names here. And here, too, I can declare an array just as I did before. But it doesn't have to store only ints. It can store strings instead.
So I've changed the data type from int to string, and I've changed the variable name from numbers to names. And I can still use this new curly brace notation, and I can give myself a name like Bill, and maybe Charlie, and maybe Fred, and maybe George, and maybe Ginny, and maybe Percy, and lastly, maybe a name like Ron. And it just barely fits on my screen.
So with that said, I now have this array of names. And beyond there being a perhaps obvious pattern to them, there's a second less obvious, or maybe obvious, pattern to them. How would you describe the list of names I arbitrarily just came up with? What's a useful characteristic of them? What do you notice about these names?
And there's at least two right answers to this question, I think. What do you notice about these names? Jack?
AUDIENCE: They're in alphabetical order.
DAVID MALAN: Yes. So beyond being the names of the Weasley children in Harry Potter, they're also in alphabetical order. And that's the more salient detail. For our purposes, I've had the forethought this time to sort these names in advance. And if I've sorted these names, that means implicitly I can use a better algorithm than linear search. I can use, for instance, our old binary search.
But let's go ahead first and just search them naively for now. Let's still apply linear search, because, you know, what we haven't yet done is necessarily compare strings against one another. We've done a lot of comparisons of numbers like integers. But what about names?
So let me go ahead and do this. So for int i gets 0, just like before, i less than 7, i++-- and I'm doing this only because I know in advance there are seven names. I think we could probably improve the design of this code, too, by having a variable or a constant storing that value. But I'm going to keep it simple and focus only on the new details for now.
And it turns out, for reasons we'll explore in more detail next week, it is not sufficient to do what we did before and do something like this if I'm searching for "Ron." It turns out that in C, you can't use equals equals to compare two strings. You can for an int, you can for a char. And we've done both of those in the past. But there's a subtlety that we'll dive into in more detail next week that means you can't actually do this.
And this is curious, because if you have prior programming experience in languages like Python or the like, you can do this. So in C you can't, but we'll see next time why. But for now, it turns out that C can solve this problem, and historically the way you do this is with a function.
So inside of the string.h header file, there is not only a declaration for strlen, the length of a string like last week. There's another function called strcmp. And "stir compare," for short, S-T-R-C-M-P, allows me to pass in two strings, one string that I want to compare against another string.
So it's not quite the same syntax. Indeed, it's a little harder to read. It's not quite as simple as equals equals. But strcmp, if we read the documentation for it, will tell us that this compares two strings.
And it returns one of three possible values. If those two strings are equal, that is, identically the same letter for letter, then this function is going to return 0, it turns out. If the first string is supposed to come before the second string alphabetically, in some sense, then this function is going to return a negative value. If the first string is supposed to come after the second string alphabetically, if you will, then it's going to return a positive value. So there's three possible outcomes-- either equal to 0, or less than 0, or greater than 0.
But you'll notice, and in fact, if you look at the documentation some time, it doesn't specify what value less than 0 or what value greater than 0. You have to just check for any negative value or any positive value. And I also told a bit of a white lie a moment ago. This does not check things alphabetically, even though it coincidentally does sometimes.
Actually compares strings in what's called ASCII order, or ASCIIbetically which is kind of a goofy way of describing this function looks at every character in the two strings, from left to right, it checks the ASCII values of them, and then it compares those ASCII values character by character. And if the ASCII value is less than the other, then it returns a negative value or vice versa.
So if you have, for instance, the letter A, capital A in the string, that gets converted first to 65. And then if you have an A in the other string capitalized, it, too, gets compared to 65, and those would be equal. But of course, all of these names have more than one character, so this ASCII order, or ASCIIbetical, precedes left to right so that strcmp checks every character in the names for you.
And it stops when it hits that terminating null character. Recall that strings, underneath the hood, always end in C with this backslash 0, or eight 0 bits. So that's how strcmp knows when to stop comparing values. But if I go ahead and find someone like Ron, let me go ahead and print out quote, unquote, "Found." And like before, I'll go ahead and return, like Demi proposed, 0, just to imply that all is successful. Otherwise, if we get all the way to the bottom of my code, I'm going to print out "Not found" to tell the story that we did not find Ron in this array, even though he does happen to be there, and I'm going to go ahead and return 1.
So even though I've hardcoded everything-- to hardcode something in a program means to type it out explicitly-- you could imagine using a command line argument like last week to get user's input. Who would you like to search for? You could imagine using get_string to get user's input and ask them, who would you like to search for?
But for now, just for demonstration sake, I've used only Ron's name. And if I haven't made any typos-- let me go ahead and type in make names, Enter, so far so good, ./names. And hopefully, we'll see, indeed, "Found," because "Ron" is very much in this array of seven siblings. But the building blocks that are new here are, again, the fact that when we declare an array of some fixed size we don't strictly need to put a number here, and we have this curly brace notation when we know the array's contents in advance. But perhaps lastly and most powerfully, we do have this function in C called strcmp that will allow us to actually store and compare strings in this way.
So let me pause here and just ask if there's any questions about how we translated these ideas to code for numbers, and how we translated these ideas to code for now names, each time using linear search, not, binary. Caleb, question?
AUDIENCE: Yeah. So would that program still work if "Ron," for example, was like all caps, like if you're trying to search-- like, if the cases are different in terms of uppercase and lowercase?
DAVID MALAN: Really good question. And let me propose an instinct that's useful to acquire in general-- when in doubt, try it. So I'm going to do exactly that. I do happen to know the answer, but suppose I didn't. Let me go ahead and change "Ron" to all caps, just because. Maybe the human, the Caps Lock key was on, and they typed it in a little sloppily.
Let me go ahead and make no other changes. Notice that I'm leaving the original array alone with only a capital R. Let me remake this program, make name, ./names. And voila, he's still, in fact, found. Stand by. Oh, OK.
Caleb, you have just helped me unearth, a bug that was latent in the previous example. None of you should have accepted the fact that the previous program worked with "RON," because I didn't practice literally what I'm preaching. So Caleb, hold that thought for just a moment so I can rewind a little bit and fix my apparent bug.
So "RON" was indeed found. But he wasn't found because "RON" was found. I did something stupid here. And it's perhaps all the more pedagogically appropriate now to highlight that. So how did this program say "Ron" was found, even though this time it also says "RON" was found in all caps?
And you know what, let me get a little curious here. Let me go ahead and search for, not even "Ron." How about we search for Ron's mom, "Molly"? Make names. All right. And now, just to reveal that I really did do something stupid, ./names. OK, now something's clearly wrong, right? I can even search for the father "Arthur", make names, ./name. It seems that I wrote you a program that just literally always says "Found."
So we shouldn't have accepted this as correct. Can anyone spot the bug based on my definition thus far? Can anyone spot the bug? In the meantime, this isn't really a bad time to open up the duck and say, "Hello, duck. I am having a problem whereby my program is always printing Found even when someone is not in the array. And I could proceed to explain my logic to the duck, but hopefully Sophia can point me at the solution even faster than the duck.
AUDIENCE: You need to compare the value that we received from strcmp with something. So we need to compare it with like 0 and make sure that we receive the value that they're equal.
DAVID MALAN: Perfect. So I said the right thing, but I literally did not do the right thing. If I want to check for equality, I literally need to check the return value when comparing names bracket i against "Ron" to equal 0. Because only in the case when the return value of strcmp is 0 do I actually have a match.
By contrast, if the function returns a negative value or the function returns a positive value, that means it's not a match. That means that one name is supposed to come before the other or after the other. But the catch with my shorthand syntax here, which is not always an incorrect syntax to use, whenever you have a Boolean expression inside of which is a function call like this-- notice that the entirety of my Boolean expression is just a call, so to speak, to strcmp. I'm passing in two inputs, names bracket i and quote, unquote "Ron." And therefore, I'm expecting strcmp to return output, a so-called return value.
That return value is going to be negative or positive or 0. And in fact, to be clear, if the first name being searched for is "Bill" and names bracket i or names bracket 0 is "Bill," "Bill" comma "Ron" is effectively what my input is on the first iteration. "Bill," alphabetically and ASCIIbetically, comes before "Ron," which means it should be returning a negative value to me.
And the problem with Boolean expressions is, as implemented in this context, is that only 0 is false. Any other return value is by definition true or a Yes answer, whether it's negative 1 or positive 1, negative 1 million or positive 1 million-- any non-zero value in a computer language like C is considered true, also known as truthy. Any value that is 0 is considered false, but only that value is considered false.
So really, I was getting lucky at first, because my program was finding "Bill," but I was confusing "Bill" for "Ron." Then when I did it again for Caleb and I capitalized "Ron," I was getting unlucky, because suddenly I knew "RON" capitalized wasn't in the array, and yet I'm still saying he's found. But that's because I didn't practice what I preach per Sophia's find.
And so if I actually compare this against 0-- and now, Caleb, we come full circle to your question-- I rebuild this program with make names, I now do ./names and search for all caps "RON," I should now see, thankfully, "Not found." So I wish I could say that was deliberate, but thus is the common case of bugs. So here I am 20 years later making bugs in my code. So if you run up to a similar problem this week, rest assured that it never ends. But hopefully you won't have several people watching you while you do your problem set this week.
All right. Any questions, then, beyond Caleb's? So great question, Caleb, and the answer is no. It is case sensitive. So it does not find "Rob"-- "RON." Any questions here? Any questions on linear search using strings? No?
All right, well, let's go ahead and do one final example, I think, with searching. But let's introduce just one other feature. And this one's actually pretty cool and powerful. Up until now, we've been using data types that just come with C or come from CS50, like int, and char, and float, and the like. And you'll see now that there's actually sometimes reasons where you or I might want to create our own custom data types, our own types that didn't exist when C itself was invented.
So for instance, suppose that I wanted to represent not just a whole bunch of numbers and not just a whole bunch of names, but suppose I want to implement like a full-fledged phone book. A phone book, of course, contains both names and numbers. And suppose I want to combine these two ideas together. Wouldn't it be nice if I could have a data structure that is a data type that has some structure to it that can actually store both at once?
And in fact, wouldn't it be nice if C had a data type called person, so that if I want to represent a person, like in a phone book, who had both a name and a number, I can actually implement that and code by calling that variable of type person? Now, of course, the designers of C did not have the foresight to create a data type called person. And, indeed, that would be a slippery slope if they had a data type for every real-world entity you can think of. But they did give us the capabilities to do this.
So if a person, in our limited world here of phone books, has both a name and a number, we might think of it as follows-- a name and a number, both of type string. But a quick check here. Why have I now decided, somewhat presumptuously, to call phone numbers strings as well? We've been talking about ints behind these doors. We've been searching for ints in code. But why did I just presume to propose that we instead implement a phone book using strings for names and numbers?
Any thoughts here, Kurt?
AUDIENCE: Yeah. Because we're not doing math on it. It's like-- a phone number could be, like, letters for all we care. And in fact, I mean, sometimes you see, like, 1-800 Contacts or something like that, and maybe we want to allow that.
DAVID MALAN: Yeah, absolutely. A phone number, despite its name, isn't necessarily just a number. It might be 1-800 Contacts, which is an English word. It might have hyphens in it or dashes. It might have parentheses in it. It might have a plus sign for country codes. There's a lot of characters that we absolutely can represent in C using strings that we couldn't represent in C using int. And so indeed, even though in the real world there are these "numbers" that you and I talk about once in a while like phone numbers, maybe in the US Social Security numbers, credit card numbers, those aren't necessarily values that you want to treat as actual integers.
And in fact, those of you who did the credit problem and tried to validate credit card numbers may very well have run into challenges by using a long to represent a credit card number. It probably in retrospect might very well have been easier for you to treat credit card numbers as strings. The catch, of course, by design is that you didn't yet have strings in your vocabulary, at least in C yet.
So suppose I want to create my own custom data type that encapsulates, if you will, two different types of values. A person shall be henceforth a name and a number. It turns out that C gives us this syntax here. This is the only juicy piece of new syntax besides those curly braces a moment ago that we'll see today in C, typedef. And as the name rather succinctly suggests, this allows you to define a type. And the type will be a structure of some sort.
So a data structure in a programming language is typically a data type that has some structure to it. What do we mean by "structure"? It typically has one or more values inside of it. So using typedef, and in turn using the struct keyword, we can create our own custom types that's a structure, a composition of multiple other data types.
So if we want to keep persons together as their own custom data type, the syntax is a little cryptic here. You literally do typedef struct open curly brace, then one per line you specify the data types that you want and the names that you want to give to those data types, for instance name and number. And then outside of the closing curly brace, you literally put the word "person," if that's indeed the data type that you want to invent.
So how can we use this more powerfully? Well, let's go ahead and do things the wrong way without this feature first, so as to motivate its existence. Let me go ahead and save this file as phonebook.c. And let me start, as always, with includes cs50.h. And then let me go ahead and include stdio.h. And then lastly, let me also include string.h, because I know I'm going to be manipulating some strings in a moment.
Let me go ahead now, and within my main function, let me go ahead and give myself initially, for the first version of this program, a whole bunch of names. Specifically, how about "Brian" comma "David"? We'll keep it short, just so as to focus on the ideas and not the actual data they're in. Then Brian and I each have phone numbers. So let's go ahead and store them in an array-- numbers equals, again the curly braces as before, and +1-617-495-1000-- and indeed, there's already motivation, per Kurt's comment, to use strings, because we've got a plus and a couple of dashes in there-- and then my number here. So we'll do +1-949-468-2750 close curly brace, semicolon.
So I've gone ahead and declared two arrays, one called names, one called numbers. And I'm just going to have sort of a handshake agreement that the first name in names corresponds to the first number in numbers, the second name in names corresponds to the second number in numbers. And you can imagine that working well so long as you don't make any mistakes, and you have just the right number in each.
Now let me go ahead and do int i equals 0, i less than 2-- I'm going to keep that hardcoded for now just to do the demonstration. And then inside of this loop, let me go ahead and search for my phone number, for instance, even though I happen to be at the end. So if strcmp of names bracket i equals-- rather, comma "David" equals equals 0-- so I'm not going to make that mistake again.
Let me go ahead inside of this loop, inside of this condition here. And I'm going to go ahead and do the following-- print out that I found, for instance my number. And I'm going to plug that in. So numbers bracket i.
And then as before, I'm going to go ahead and return 0. And if none of this works out, and I happen not to be in this array, I'll go ahead and print out as before "Not found" with a semicolon. And then I'll return 1 arbitrarily. I can return negative 1, I could return a million, negative million. But human convention would typically have you go from 0 to 1 to 2 to 3 on up, if you have that many possible error conditions.
All right. So I essentially have implemented in C a phone book of sorts. We did this verbally in week 0. Now I'm doing it in code. It's a limited phone book. It's only got two names and two numbers. But I could certainly implement this phone book by just using two arrays, two parallel arrays, if you will, by just using the honor system that the first element and names lines up with the first elements and numbers and so forth.
Now hopefully, if I don't make any typos, let me go ahead and make phonebook. All right. It compiled OK. ./phonebook, and it found what seems to be my number there. So it seems to work correctly, though I've tried to pull that one over you before. But I'm pretty sure this one actually works correctly. And so we found my name and in turn number.
But why is the design of this code not necessarily the best? This is starting to get more subtle, admittedly. And we've seen that we can do this differently. But what rubs you the wrong way about here? This is another example of what we might call "code smell." Like, something's a little funky here, like, ah, this might not be the best solution long term Nick, what do you think?
AUDIENCE: Yeah. So what I'm guessing is that-- like, you know how you made the data frame before the new data structure, where the two things were linked together? In this case, we're just banking on the fact that we don't screw something up and unintentionally unlink them from the same index. So they're not intrinsically linked, which might not be--
DAVID MALAN: That's exactly the right instinct. In general, as great as a programmer as you're maybe aspiring to be, you're not all that. And like, you're going to make mistakes. And the more you can write code that's self-defensive, that protects you from yourself, the better off you're going to be, the more correct your code is going to be, and the more easily you're going to be able to collaborate successfully, if you so choose in the real world, on real-world programming projects, whether for a research project, a full-time job, a personal project or the like. Generally speaking, you should not trust yourself or other people that-- with whom you're writing code, you should have as many defense mechanisms in place exactly along these lines.
So yes, there's nothing wrong with what I have done in the sense that this is correct. But as noted, if you screw up, and maybe you get an off by one error-- maybe you transpose two names or two numbers. I mean, imagine if you've got dozens of names and numbers, hundreds of names and numbers, thousands of them the odds that you or someone messes the order up at some point is just probably going to be too, too high.
So it would be nice, then, if we could sort of keep related data together. This is kind of a hack, to just on the honor system say, my arrays line up, I'm just going to make sure to keep them the same length. We can do better. Let's keep related data together and design this a little more cleanly.
And I can do this by defining my own type that I'll call for instance, a person. So at the top of this file, before main, I'm going to go ahead and typedef a structure, inside of which are the two types of data that I care about, string name and string number, just as before. Notice, though, here that what I have done here is not give myself an array. I've given myself one name and one number.
Outside of this curly brace, I'm going to give this data type a name, which I could call "person." I could call it anything I want, but person seems pretty reasonable in this case. And now down here, I'm going to go ahead and change this code a little bit. I'm going to go ahead and give myself an array still, but this time I'm going to give myself an array of persons.
And I'm going to call that array, somewhat playfully, "people," because I want to have two persons, two people, in this program, me and Brian. Now I want to go ahead and populate this array. That is, I want to fill it with values. And this syntax is a little new, but it's just to enable us to actually store values inside of a structure.
If I want to index into this array, there's nothing different from last week. I do people bracket 0. That's going to give me the first person variable inside, so probably where "Brian" is supposed to go. The one last piece of syntax I need is how do I go inside of that structure, that person data structure, and access the person's name? I literally just do a dot.
So people bracket 0 gives me the first person in the people array. And then the dot means, go inside of it and grab the person variable. I'm going to go ahead and set that name equal to quote, unquote "Brian." The syntax now for his name is almost identical-- people bracket 0 dot number equals quote, unquote "+1-617-495-1000" semicolon.
Meanwhile, if I want to access a location for myself, I'm going to go ahead and put it at location 1, which is the second location. Name will be, quote, unquote "David." And then over here, I'm going to do people bracket 1 dot number equals quote, unquote "+1-949-468-2750" close quote, semicolon.
So it's a bit verbose, admittedly. But you could imagine, if we just let our thoughts run ahead of ourselves here, if you used get_string, could sort of automatically do this. If you used command line arguments, maybe you could populate some of this. We don't just have to hardcode, that is, write my name and number and Brian's into this program. You can imagine doing this more dynamically using some of our techniques, using get_string and so forth, from week 1.
But for now, it's just for demonstration's sake. So now if I want to search this new array, this new single array of people, I think my for loop can stay the same. And I think I can still use strcmp. But now I need to go inside of not names but people, and look for the dot name field. So data structures have fields or variables inside of them.
So I'm going to use the dot notation there, too, go into the i-th person in the people array, and compare that name against, for instance, quote, unquote "David." And then if I have found "David," in this case myself, go ahead and access the people array again, but print out using printf the number. So again, the dot operator is the only new piece of syntax that's letting us go inside of this new feature known as a data structure.
If I go ahead and make phonebook again after making those changes, all is well. It compiled OK. And if I run ./phonebook, I now have hopefully found my number again. So here is a seemingly useless exercise, in that all I really did was re-implement the same program using more lines of code and making it more complicated. But it's now better designed. Or it's a step toward being better designed, because now I've encapsulated all inside of one variable, for instance, people bracket 0, people bracket 1, all of the information we care about with respect to Brian, or me, or anyone else we might put into this program.
And indeed, this is how programs, this is how Googles, of the world, Facebooks of the world store lots of information together. Consider any of your social media accounts like Instagram, or Facebook, or Snapchat and the like. You have multiple pieces of data associated with you on all of those platforms-- not just your username but also your password, also your history of posts, also your friends and followers and the like. So there's a lot of information that these companies, for better for worse, are collecting on all of us.
And can you imagine if they just had one big array with all of our usernames, one big array with all of our passwords, one big array with all of our friends? Like, you can imagine certainly at scale, that's got to be a bad design, to just trust that you're going to get the ordering of all of these things right. They don't do that. They instead write code in some language that somehow encapsulates all the information related to me and Brian and you inside of some kind of data structure. And that's what they put in their database or some other server on their back end.
So this encapsulation is a feature we now have in terms of C. And it allows us to create our own data structures that we can then use in order to keep related data together. All right, any questions, then, on data structures, or more specifically typedef and struct, the C keywords with which you can create your own custom types that themselves are data structures? Besley?
AUDIENCE: Hi. So is it typical to define the new data structure outside of main, like in a header?
DAVID MALAN: Really good question. Is it typical to define a new data structure outside of main? Quite often yes. In this case, it's immaterial, because I only have one function in this program, main.
But as we'll see this week and next week and onward, our programs are going to start to get a little more complicated by nature of just having more features. And once you have more features, you probably have more functions. And when you have more functions, you want your data structure to be available to all of those functions. And so we'll begin to see definition of some of these structures being, indeed, outside of our own functions.
Peter, over to you.
AUDIENCE: Oh, yeah. Will we define new classes in header files later, or will we keep defining them outside of main?
DAVID MALAN: Really good question. Might we define our own types and our own data structures in header files? Yes. Eventually we'll do that, too. Thus far, you and I have only been using header files that other people wrote. We've been using stdio.h, string.h, that the authors of C created. You've been using cs50.h which we the staff wrote.
It turns out, you can also create your own header files, your own .h files, inside of which are pieces of code that you want to share across multiple files of your own. We're not quite there yet. But yes, Peter, that would be a solution to this problem by putting it in one central place.
Thiago, over to you.
AUDIENCE: I was-- I was thinking, this course really takes enough information to solve the sets, because I feel there is missing information. I am a freshman, and I was taking-- I was so concentrating, and I can't go on, go ahead on the sets. Is there anything that I'm missing?
DAVID MALAN: It's a really good question. And quite fair. We do move quite quickly, admittedly. So indeed, recall from week 0 the fire hose metaphor that I borrowed from MIT's water fountain. Indeed, that's very much the case. There's a lot of new syntax, a lot of new ideas all at once.
But when it comes to individual problems in the problem sets, do realize that you should take those step by step. And invariably, they tend to work from less complicated to more complicated. And throughout each of the lectures and each of the examples that we do, either live or via the examples that are premade on the course's website for your review, there's always little clues or hints or examples that you can then do. And certainly, by way of other resources like labs and the like, will you see additional building blocks as well.
So feel free to reach out more individually afterword. Happy to point you at some of those resources. In fact, most recently, too, will you notice on the course's website what we call "shorts," which are shorter videos made by another colleague of mine, CS50's own Doug Lloyd, which are literally short videos on very specific topics. So after today, you'll see short videos by Doug with a different perspective on linear search, on binary search, and on a number of other algorithms as well. Good question.
Sophia, back to you.
AUDIENCE: I was wondering, with the return values that we have for different error cases, would that be-- like, what's an example of what we would use that for? Is that for later if there are like several different cases and we want to somehow keep track of them?
DAVID MALAN: Exactly the latter. So right now, honestly, it's kind of stupid that we're even bothering to spend time returning 0 or returning 1. Like, we don't really need to do that, because we're not using the information. But what we're trying to do is lay the foundation for more complicated programs.
And indeed, this week and next week and beyond, as your own programs get a little longer, and as we, the course, start providing you with starter code or distribution code, that is, lines of code that the staff and I write that you then have to build upon, it's going to be a very useful mechanism to be able to signal that this went wrong or this other thing went wrong. So all we're doing is preparing for that inevitability, even if right now it doesn't really seem to be scratching an itch. Anthony?
AUDIENCE: I was just going to ask really quickly, obviously in this code we have "Brian" and your name, "David." And that's two people. So let's say we had 10 or 20 or even 30 people. I know it was a question in the chat, but I just wanted to clarify for myself, too.
DAVID MALAN: And the "what if" being what would change? Or, what was the end of that question?
AUDIENCE: Yeah. What would change in the code? Or what we do exactly to address that problem?
DAVID MALAN: Ah, OK. Good question. So if we were to have more names, like a third name or a tenth name or the like, the only things that we would have to change in this version of the program is first, on line 14, the size of the array. So if we're going to have 10 people, we need to decide in advance that we're going to have 10 people.
Better still, I could, for instance, allocate myself a constant up here. So let me actually go up here, just like we did in a previous class, where we did something like this-- const int NUMBER. And I'll just initialize this to 10. And recall that const means constant. That means this variable can't change. Int, of course means it's an integer. The fact that I've capitalized it is just a human convention to make a little visually clear that this is a constant, just so you don't forget. But it has no functional role. And then this, of course, is just the value to assign to NUMBER.
Then I could go down here on line 16 and plug in that variable so that I don't have to hardcode what people would call a "magic number," which is just a number that appears seemingly out of nowhere. Now I've put all of my special numbers at the top of my file, or toward the top of my file, and now I'm using this variable here. And then what I could do-- and I alluded to this only verbally before-- I could absolutely start hardcoding in, for instance, Montague's name and number, and Rithvik's and Benedict's, and Cody's and others. But honestly, this seems kind of stupid if you're just hardcoding all of these names and numbers.
And in a few weeks, we'll see how you can actually store all of the same information in like a spreadsheet, or what's called a CSV file-- Comma Separated Values-- or even in a proper database, which the Facebooks and Googles of the world would use. But what I could do for now is something like this. For int i gets 0, i less than the number of people, i++. And maybe I could do something like this-- people bracket i dot name equals get_string, "What's the name" question mark.
And then here I could do people bracket i dot number equals get_string, "What's their number?" And I could ask that question, too. So now the program's getting to be a little better designed. I'm not arbitrarily hardcoding just me and Brian. Now it's dynamic. And technically, the phone book only supports 10 people at the moment, but I could make that dynamic, too. I could also call get_int. Or, like you did this past week, use a command line argument and parameterize the code so that it can actually be for 2 people, 10 people-- whatever you want, the program can dynamically adapt to it for you.
Other questions? On structs, on types, or the like? No?
All right. So how did we get here? Recall that we started with this problem of searching, whereby we just want to find someone in the doors. We just want to find someone in the array. We've sort of escalated things pretty quickly to finding not just numbers or names but now names with numbers in the form of these data structures.
But to do this efficiently really requires a smarter algorithm like binary search. Up until now, we've only used in C code linear search, even though, recall, that we did have at our disposal the pseudocode for binary search. But with binary search, we're going to need the data to be sorted.
And so if you want to get the speed benefits of searching more quickly by having sorted numbers, somehow someone is going to have to do that for us. Joe, for instance, sorted behind the curtain all of these numbers for us. But what algorithm did he use is going to open up a whole can of worms as to how we can sort numbers efficiently. And indeed, if you're the Googles and the Facebooks and the Instagrams of the world, with millions, billions of pieces of data and users, you surely want to keep that data sorted, presumably, so that you can use algorithms like binary search to find information quickly when you're searching for friends or for content.
But let's go ahead and here take a five-minute break. And when we come back, we'll consider a few algorithms for sorting that's going to enable us to do everything we've just now discussed. See you in five.
All right. We are back. So to recap, we have a couple different algorithms for searching, linear search and binary search. Binary search is clearly the winner from all measures we've seen thus far. The catch is that the data needs to be sorted in advanced in order to apply that algorithm.
So let's just give ourselves a working model for what it means to sort something. Well, as always, if you think of this as just another problem to be solved, it's got input and output, and the goal is to take that input and produce that output. Well, what's the input? It's going to be a whole bunch of unsorted values. And the goal, of course, is to get sorted values. So the interesting part of the process is going to be whatever there is in the middle.
But just to be even more concrete, if we think now in terms of this unsorted input as being an array of input-- because after all, that's perhaps the most useful mechanism we've seen thus far, to pass around a bunch of values at once using just one variable name-- we might have an array like this, 6 3 8 5 2 7 4 1, which seems to be, indeed, randomly ordered, that is unsorted. And we want to turn that into an equivalent array that's just 1 2 3 4 5 6 7 8. So eight numbers this time instead of seven.
But the goal this time is not to search them, per se, but to sort them. But before I get ahead of myself, could someone push back on this whole intellectual exercise we're about to do with sorting in the first place? Could someone make an argument as to why we might not want to bother using a sorted array, why we might not want to bother sorting the elements, and heck, let's just use linear search to find some element-- whether it's a number behind a door, a name in an array. Like, when might we want to just use linear search and not bother sorting?
Sophia, what do you think?
AUDIENCE: We could encounter errors in sorting, and that might cause errors, like, unpredictability in terms of, like, if we can find something. Versus linear search, we know we can find it.
DAVID MALAN: OK, quite fair. I will concede that implementing binary search, not in pseudocode, which we've already done, but in code is actually more difficult, because you have to deal with rounding, especially if you've got a weird number of doors, like an odd number of doors versus an even number of doors or an array of those lengths. Honestly, you've got to deal with these corner cases, like rounding down or rounding up, because anything time you divide something by 2, you might get a fractional value or you might get a whole number.
So we've got to make some decisions. So it's totally solvable. And humans for decades have been writing code that implements binary search. It's totally possible. There's libraries you can use. But it's definitely more challenging, and you open yourselves up to risk.
But let me stipulate that that's OK. I am good enough at this point in my progression where I'm pretty sure I can implement it correctly. So correctness is not my concern. What else might demotivate me from sorting an array of elements? And what might motivate me to, ah, just use linear search. It's so simple.
Can anyone propose why? Olivia, what do you think?
AUDIENCE: If the name of the game is efficiency, and you have a small enough data set, then you might as well just search it versus sort it, which would be an extra expense.
DAVID MALAN: Yeah, really well said. If you've got a relatively small data set, and your computer operates at a billion operations per second, for instance, my God, who cares if your code sucks and it's a little bit slow? Just do it the inefficient way. Why? Because it's going to take you maybe a few minutes to implement the simpler algorithm like linear search, even though it's going to take longer to run, whereas it might take you tens of minutes, maybe an hour or so, to not only write but debug something like a fancier algorithm, like binary search, at which point you might have spent more time writing the code, the faster code, than you would have just running the slower code.
And I can speak to this personally. Back in grad school, some of the research I was doing involved analysis of very large data sets. And I had to write code in order to analyze this data. And I could have spent hours, days, even, writing the best designed algorithm I could to analyze the data as efficiently as possible. Or, frankly, I could write the crappy version of the code, go to sleep for eight hours, and my code will just produce the output I want by morning.
And that is a very real-world, reasonable trade-off to make. And indeed, this is going to be thematic in the weeks that proceed in the course, where there's going to be this trade-off. And quite often, the trade-off is going to be time, or complexity, or the amount of space or memory that you're using. And part of the art of being a good computer scientist, and in turn programmer, is trying to decide where the line is. Do you exert more effort upfront to make a better, faster, more efficient algorithm, or do you maybe cut some corners there so that you can focus your most precious resource, human time, on other, more fundamentally challenging problems?
So we for the course's problem sets and labs will always prescribe what's most important. But in a few weeks' time, with one of our problem sets will you implement your very own spell checker. And among the goals of that spell checker are going to be to minimize the amount of time your code is taking to run, and also to minimize the amount of space or memory that your program is taking while running. And so we'll begin to appreciate those trade-offs ever more.
But indeed, it's the case-- and I really like Olivia's formulation of it-- if your data set is pretty small, it's probably not worth writing the fastest, best designed algorithm as possible. Just write it the simple way, the correct way, and get the answer quickly, and move on. But that's not going to be the case for a lot of problems, dare I say, most problems in life.
If you're building Facebook or Instagram or Whatsapp, or any of today's most popular services that are getting thousands, millions of new pieces of data at a time, you can't just linearly search all of your friends or connections on LinkedIn efficiently. You can't just linearly search the billions of web pages that Google and Microsoft index in their search engines. You've got to be smarter about it.
And undoubtedly, the more successful your programs are and your code are, or websites, your apps, whatever the case may be, the more important design does come into play. So indeed, let's stipulate now that the goal is not to search these doors once; the goal is not to search these light bulbs once; the goal is not to search the phone book once, but rather again and again and again. And if that's going to be the case, then we probably should spend a little more time and a little more complexity upfront getting our code, not only right but also efficient, so that we can benefit from that efficiency again and again and again, over time.
So how might we go about sorting some numbers. So in fact, let me see, to do this, if we can maybe get a hand from Brian in back. Brian, do you mind helping with sorting?
BRIAN: Yeah, absolutely. So I've got eight numbers here right now that all seem to be in unsorted order.
DAVID MALAN: Yeah. And Brian, could you go ahead, and could you sort these eight numbers for us?
BRIAN: Yeah, I'll put them in order. So we'll take these and-- um-- and all right. I think these are now in sorted order.
DAVID MALAN: Yeah, indeed. I agree. And now let's take some critique from the audience, some observations. Would someone mind explaining how Brian just sorted those eight numbers? What did Brian just do, step by step, in order to get to that end result?
The input was unsorted, the output now is sorted. So what did he do? Peter, what did you see happen?
AUDIENCE: He went through them step by step. And if they were in increasing order, he flipped them, and kept doing it until they were all in the correct [INAUDIBLE].
DAVID MALAN: Yeah. He kept step by step kind of looking for small values and moving them to the left, and looking for big values and moving them to the right, so effectively selecting numbers one at a time and putting it into its right place. So let's see this is, maybe in more slow motion, if you will, Brian. And if you could be a little more pedantic and explain exactly what you're doing.
I see you've already reset the numbers to their original, unsorted order. Why don't we go ahead and start a little more methodically? And could you go ahead and for us, more slowly this time, select the smallest value. Because I do think, per Peter, it's going to need to end up at the far left.
BRIAN: Yeah, sure. So I'm looking at the numbers, and the 1 is the smallest. So I now have the smallest value.
DAVID MALAN: All right. So you did that really quickly. But I feel like you took the liberty of being a human who can kind of have this bird's eye view of everything all at once. But be a little more computer-like, if you could. And if these eight numbers are technically in an array, kind of like my seven doors out here, such that you can only look at one number at a time, can you be even more methodical and deliberate this time in telling us how you found the smallest number to put into place?
BRIAN: Sure. I guess, since the computer can only look at one number at a time, I would start at the left side of this array and work my way through the right, looking at each number one at a time. So I might start with the 6 and say, OK, this right now is the smallest number I've looked at so far. But then I look at the next number, and it's a 3, and that's smaller than a 6.
So now the 3, that's the smallest number I found so far. So I'll remember that and keep looking. The 8 is bigger than the 3, so I don't need to worry about that. The 5 is bigger than the 3. The 2 is smaller than the 3, so that now is the smallest number I've found so far. But I'm not done yet. So I'll keep looking.
The 7 is bigger than the 2, the 4 is bigger than the 2. But the 1 is smaller than the 2. So now I've made my way all the way to the end of the array. And 1, I can say, is the smallest number that I found.
DAVID MALAN: OK. So what I'm hearing is you're doing all of these comparisons, also similar to what Peter implied, and you keep checking, is this smaller, is this smaller, is this smaller, and you're keeping track of the currently smallest number you've seen?
BRIAN: Yeah, that sounds about right.
DAVID MALAN: All right. So you found it. And I think it belongs at the beginning. So how do we put this into place now?
BRIAN: Yeah, so I want to put it at the beginning. There's not really space for it. So I could make space for it, just by shifting these numbers over.
DAVID MALAN: OK. Wait, wait. But I feel like you're just-- now you're doubling the amount of work. I feel like-- don't do all that. That feels like you're going to do more steps than we need. What else could we do here?
BRIAN: OK. So the other option is, it needs to go in this spot, like this first spot in the array. So I could just put it there. But if I do that, I'm going to have to take the 6 which is there right now and pull the 6 out.
DAVID MALAN: All right, but I think that's--
BRIAN: So the 1 is in the right place, but the 6 isn't.
DAVID MALAN: Yeah, I agree. But I think that's OK, right? Because these numbers started randomly, and so the 6 is in the wrong place anyway. I don't think we're making the problem any worse by just moving it elsewhere. And indeed, it's a lot faster, I would think, to just swap two numbers, move one to the other and vice versa, then shift all of those numbers in between.
BRIAN: Yeah. So I took the 1 out of the position at the very end of the array, all the way on the right-hand side. So I guess I could take the 6 and just put it there, because that's where there's an open space to put the number.
DAVID MALAN: Yeah. And it's not exactly in the right space, but again, it's no worse off. So I like that. All right. But now, the fact that the 1 is in the right place-- and indeed, you've illuminated it to indicate as much-- I feel like we can pretty much ignore the 1 henceforth and now just select the next smallest element. So can you walk us through that?
BRIAN: Yeah, so I guess I'd repeat the same process. I'd start with the 3. That's the smallest number I've found so far. And I keep looking. The 8 is bigger than the 3, the 5 is bigger than the 3. The 2 is smaller than the 3. So I'll remember that 2. That's the smallest thing I've seen so far.
And then I just need to check to see if there's anything smaller than the 2. And I look at the 7, the 4, and the 6. None of those are smaller than the 2. So the 2, I can say is the next smallest number for the array.
DAVID MALAN: OK. And where would you put that then?
BRIAN: That needs to go in the second spot. So I need to pull the 3 out. And I guess I can take the 3 and just put it into this open spot, where there's available space.
DAVID MALAN: Yeah. And I feel like it's starting to become clear that we're inside some kind of loop, because you pretty much told the same story again but with a different number. Do you mind just continuing the algorithm to the end and select the next smallest, next smallest, next smallest, and get that sorted?
BRIAN: Sure. So we got the 8. 5 is smaller than that, 3 is smaller than that. And then the rest of the number is 7, 4, 6. Those are all bigger. So the 3, that's going to go into sorted position here. And I'll take the 8 and swap it.
Now I'm going to look at the 5. 8 and 7 are both bigger. The 4 is smaller than the 5, but the 6 is bigger. So the 4, that's the smallest number I've seen so far. So the 4, that's going to go into place, and I'll swap it with the 5.
And now I've got the 8. The 7 is smaller than the 8, so I'll remember that. 5 is smaller than that. The 6 is bigger. So the 5, that's going to be the next number.
And now I'm left with 7. 8 is bigger, so 7 is still the smallest I've seen. But 6 is smaller, so 6 goes next.
And now I'm down to the last two. And between the last two, the 8 and the 7, the 7 is smaller. So the 7 is going to go in this spot. And at this point, I've only got one number left. So that number must be in sorted position.
And now I would say that this is a sorted array of numbers.
DAVID MALAN: Nice. So it definitely seems to be correct. It felt a little slow. But of course, the computer could do this much faster than we, using an actual array. And if you don't mind my making an observation, it looks like if we have eight numbers to begin with, or n more generally, it looks like you essentially did n minus 1 comparisons, because you kept comparing numbers again-- actually, did n comparisons. You looked at the first number, and then you compared it again and again and again at all of the other possible values in order to find the smallest element.
BRIAN: Yeah. Because for each of the numbers in the array, I had to do a comparison to see, is it smaller than the smallest thing that I've seen so far? And if it is smaller, than I needed to remember that.
DAVID MALAN: Yeah. So in each pass you considered every number, so a total of n numbers first. And so you found the number 1 you put it in its place, and that left you to be clear with n minus 1 numbers thereafter. And then after that, n minus 2 numbers, n minus 3 numbers, dot, dot, dot, all the way down to one final number.
So I think this is correct. And I think that's a pretty deliberate way of sorting these elements, a little more deliberately than your first approach, Brian, which I might describe as a little more organic. You kind of did it like-- more like a human, just kind of eyeballing things and moving things around. But if we were to translate this into code, recall that we have to be ever so precise.
And so let me consider altogether how exactly we might translate what Brian did ultimately to, again, pseudocode. So what he did is actually an algorithm that has a name. It's called selection sort. Why? Well, it's sorting the elements ultimately. And it's doing so by having Brian, or really the computer, select the smallest elements again and again and again. And once you found each such small element, you get the added benefit of just ignoring it.
Indeed, every time Brian lit up a number, he didn't need to keep comparing it, so the amount of work he was doing was decreasing each iteration-- n numbers, then n minus 1, then n minus 2, n minus 3, and so forth. And so we can think about the running time of this algorithm as being manifest in its actual pseudocode.
So how might we define the pseudocode? Well, let me propose that we think of it like this-- for i from 0 to n minus 1. Now, undoubtedly this is probably the most cryptic looking line of the three lines of pseudocode on the screen. But again, this is the kind of thing that should become rote memory over time, or just instincts with code.
We've seen in C how you can write a for loop. For loops typically, by convention, start counting at 0. But if you have n elements, you don't want to count up through n. You want to count up to n or equivalently up through n minus 1, so from 0 to n minus 1.
All right. Now what do I want to do on the next-- on the first iteration? Find the smallest item between the i-th item and the last item. So this is not quite obvious, I think, at first glance. But I do think it's a fair characterization of what Brian did. Because if i is initialized to 0, that was like Brian pointing his left hand at the first number on the very left of the shelf.
And what he then did was he found the smallest element between the i-th item, the first item 0, and the last item. So that's kind of a very fancy way of saying, Brian, find the smallest elements among all n elements. Then what he did was swapped the smallest item with the i-th item. So he just did that switcheroo, so as to not have to waste time shifting everything over. He instead just made room for it by swapping it with the value that was in its wrong place.
But now in the next iteration of this loop, consider how a for loop works. You do an i++ implicitly in pseudocode. That's what's happening here. So now i equals 1. Find the smallest item between the i-th item, item 1 0 indexed, and the last item. So this is a fancy way of saying, Brian, check all of the n elements again, except for the first, because now you're starting at location 1 instead of location 0.
And now the algorithm proceeds. So you could write this code in different ways in English like pseudocode, but this seems to be a reasonable formulation of exactly that algorithm. But let's see it a little more visually now, without all of the switching around of the humans moving around the numbers. Let me go ahead and use this visualization. And we'll put a link on the course's website if you'd like to play with this as well.
This is just someone's visualization of an array of numbers. But this time, rather than represent the numbers as symbols, decimal digits, now this person is using vertical bars, like a bar chart. And what this means is that a small bar is like a small number, and a big bar is a big number. So the goal here is to these bars, which equivalently might as well be numbers, from short bars over to tall bars, left to right.
And I'm going to go ahead. And along the top of the here, I can choose my sorting algorithm. And the one we just described, recall, was selection sort. So let me go ahead and do this. And notice, it takes a moment, I think, to wrap your mind around what's happening here. But notice that this pink line is going from left to right, because that's essentially what Brian was doing. He was walking back and forth, back and forth, back and forth through that shelf of numbers, looking for the next smallest number, and he kept putting the smallest number over on the left where it belongs.
And indeed, that's why in this visualization you see the small numbers beginning to be put into place on the left as we keep swooping through. But notice, the colored bar keeps starting later and later, more rightward and more rightward, just like Brian was not retracing his steps. As soon as he lit up the numbers, he left them alone. And voila, all of these numbers are now sorted.
So that's just a graphical way of thinking about the same algorithm. But how efficient or inefficient was that? Well, let's see if we can apply some numbers here. But there's also ways to do this a little more intuitively over time, which we'll do, too.
So if the first time through the shelf of numbers, he had eight numbers at his disposal-- he had to look at all eight numbers in order to decide which of these is the smallest. So that's n steps initially. The next time he did a pass through the shelf, he ignored the brightly lit number 1, because it was already in place by definition of what he had already done. So now he had n minus 1 steps to go.
Then he did another n minus 2 steps, then n minus 3, n minus 4, n minus 5, dot, dot, dot, all the way down to the final step, where he just had to find and leave alone the number 8, because that was the biggest number, so one single step. So this is some kind of series here, mathematically. You might recall something like this in, like, the back of your math book or in high school, or back of your physics textbook or the like.
It turns out that this actually sums up to this formula here-- n times n plus 1 divided by 2. And if that's not familiar, you don't remember that, no big deal. Just let me stipulate that the mathematical formula with which we began, where we had the series of n, plus n minus 1, plus n minus 2, plus n minus 3, dot, dot, dot, simply sums up ultimately to the more succinct n times n plus 1 divided by 2. This, of course, if we multiply it out, gives us n squared plus n divided by 2.
And this now, I will propose, gives us just this-- n squared divided by 2 plus n/2. So if we really wanted to be nit-picky, this is the total number of steps, or operations, or seconds, however we want to measure Brian's running time. This seems to be the precise mathematical formula therefore. But at the beginning of this week, we considered again, the sort of Big O notation. With a wave of the hand, we care more about the order of magnitude on which an algorithm operates. I really don't care about these divided by 2 and n/2.
Because which of these factors is going to matter as n gets big? The bigger the phone book gets, the more doors we have, the more light bulbs we have, the more numbers we have on the shelf, n is going to keep getting bigger and bigger and bigger. And given that, which is the dominant factor? Rongxin, if we could call on someone here, which of these factors, n squared divided by 2, or n divided by 2, really matters in the long run as our problems get bigger and bigger, as n gets bigger and bigger?
Which of those factors mathematically dominates? Anika?
AUDIENCE: Oh, it's Anika, but--
DAVID MALAN: Anika.
AUDIENCE: It would be the-- no problem. It would be the n squared.
DAVID MALAN: Yeah, n squared. Right. If you take any number for n and you square it, that's going to be bigger, certainly in the long run, than just doing n divided by 2. And so with our Big O notation, we could describe the running time of Brian's selection sort implementation as, ah, it's on the order of n squared. Yes, I'm ignoring some numbers, and yes, if we really wanted to be nit-picky and count up every single step that Brian took, yes, it's n squared divided by 2 plus n/2.
But again, if you think about the problem over time and n getting really large, sort of Facebook-sized, Twitter-sized, Google-sized, what's really going to dominate mathematically is this bigger factor here. That's what's going to make the total number of steps way bigger than just those smaller order terms. So in Big O notation, selection sort would seem to be on the order of n squared.
So if we consider our chart from before where we had the upper bounds on our searching algorithms, both linear and binary, this one, unfortunately, is at really the tip top of this particular list of running times. And there's infinitely many more. These are just a subset of the more common formulas that a computer scientist might use and think about. Selection sort is kind of a top of the list.
And being number one on this list is bad. n squared is certainly much slower than, say, big O of 1, which, of course, was constant time or one step. So I wonder if we could be-- if we could do a little better. I wonder if we could do a little better.
And Peter actually did say something else earlier, which was about like sharing two numbers and fixing problems. And if I can kind of run with that, let me propose that we, Brian, return to you for a look at an algorithm that might be called instead bubble sort, bubble sort being a different algorithm, this one that tries to fix problems more locally.
So in fact, Brian, if you look at the numbers that are in front of you, which you've kindly reset to their original, unsorted location, I feel like this really, if we focus on just pairs of numbers, it's just a lot of small numbers. Like last time, we tried to solve the big problem and sorting the whole thing. What if we just look at pairs of numbers that are adjacent to one another? Can we maybe make some little tweaks and change our algorithm fundamentally?
So for instance, Brian, 6 and 3, what observation can you make there for us?
BRIAN: Yeah, sure. So 6 and 3 that's, the first pair of numbers in the array. And if I want the array to be sorted, I want the smaller numbers to be on the left and the bigger numbers to be on the right. So just looking at this pair, I can tell you that the 6 and 3 or out of order. The 3 should be on the left, and the 6 should be on the right.
DAVID MALAN: All right. So let's go ahead and do that, and go ahead and fix that by swapping those two. And just fix a small little problem. And now let's repeat this process, right? Loops seem to be omnipresent in a lot of our algorithms.
So 6 and 8 is the next such pair. What you want-- what do you think about those?
BRIAN: That particular pair seems OK, because the 6 is smaller and already on the left side. So I think I can leave this pair alone.
DAVID MALAN: All right. How about 8 and 5?
BRIAN: The 8 is bigger than the 5. So I'm going to swap these two. The 5 should be on the left of the 8.
DAVID MALAN: All right. And 8 and 2?
BRIAN: Same thing here, the 8 is bigger. So the 8 is going to be swapped with the 2.
DAVID MALAN: All right, 8 and 7.
BRIAN: The 8 is bigger than the 7, so the 8 I should switch with the 7.
DAVID MALAN: All right 8 and 4?
BRIAN: 8 and 4, same thing, it's bigger than the 4.
DAVID MALAN: And 8 and 1.
BRIAN: I can do it one last time. The 8 is bigger than the 1, and I think that's all.
DAVID MALAN: And with a nice dramatic flourish, if you step off to the side, voila-- not sorted. In fact, it doesn't really look all that much better. But I do think Brian's done something smart here. Brian, can you speak to at least some of the marginal improvements that you've made?
BRIAN: Yeah. So there are some improvements, at least. The 1 originally was all the way at the very end, and it moved back one spot. And the other improvement, I think, is that the 8 originally was way over here on the left side of the array somewhere. But because the 8 is the biggest number, I kept switching it over and over again until it made it all the way to the end.
And so now actually, I think this 8 is in the correct place. It's the biggest number, and it ended up moving its way all the way to the right side of the array.
DAVID MALAN: Yeah. And this is where this algorithm that we'll see the rest of in just a moment gets its name, bubble sort-- alludes to the fact that the biggest numbers start bubbling their way up to the top of, or the end of, the list, at the right-hand side of the shelf as Brian notes. But notice, as Brian does, too, the number 1 only moved over one position. So there's clearly more work to be done. And that's obvious from the other numbers being misordered as well.
But we have improved things. The 8 is in place, and the 1 is closer to being in place. So how might we proceed next? Well, Brian, let's continue to solve some small bite-sized problems. Let's start at the beginning again. 3 and 6?
BRIAN: Sure. The 3 and the 6, those seem to be in order, so I'll leave those alone.
DAVID MALAN: 6 and 5.
BRIAN: 6 and 5 or out of the order, so I'll go ahead and take the 6 and put it to the right.
DAVID MALAN: 6 and 2.
BRIAN: Those are out of order as well, so I'll swap the 2 and the 6.
DAVID MALAN: 6 and 7.
BRIAN: 6 and 7 are OK. They're in order.
DAVID MALAN: 7 and 4.
BRIAN: Those are out of order, so I'll switch the 4 and the 7.
DAVID MALAN: 7 and 1.
BRIAN: And those two are out of order as well, so I'll swap those. And now I think the 7 has made its way to the sorted position as well.
DAVID MALAN: Indeed. So now we're making some progress. 7 has bubbled its way up to the top of the list, stopping just before the 8, whereas the 1 has continued its advance to its correct location. So I bet, Brian, if we keep doing this again and again and again, so long as the list remains in part unsorted, I think we'll probably get to the finish line. Do you want to take it from here and sort the rest?
BRIAN: Yeah, sure. So I just repeat the process again. The 3 and the 5 are OK. The 2 and the 5 are out of order, so I'll swap them. The 5 and the 6, those are fine as a pair. The 6 and the 4, out of order relative to each other, so I'll switch those. And the 6 and the 1, those are out of order as well, so I'll swap those. And now the 6, that I can say is in its correct position.
And I'll repeat it again. The 3 and the 2 are out of order, so those get switched. The 3 and the 5 are OK. The 5 and the 4 are out of order, so those get switched. And then the 5 and the 1 need to be switched as well. So there's the 5 in sorted position.
And now I'm left with these four. The 2 and the 3 are OK, the 3 and the 4 OK. But the 4 and the 1 are out of order. So those get switched, and now the four, that's in its place. The 2 and the 3 are OK, but the 3 and the 1 are not, so I'll swap those. And now the 3 goes into its sorted place.
And then finally, the last pair to consider is just the 2 and the 1. Those are out of order, so I'll swap those, and now the 2 is in place. And 1 is the only remaining number, so I can say that that one's in place, too. And now I think we have a sorted array.
DAVID MALAN: Nice. So it felt like this was a fundamentally different approach, but we still got to the same end point. So that really now invites the question as to whether bubbles or it was better or worse or maybe no different. But notice, too, that we've solved the same problem fundamentally differently. The first time, we took the more human natural intuition of just, find the smallest element. All right, do it again, do it again, do it again.
This time, we sort of viewed the problem through a different lens. And we thought about, it would seem, what does it mean for the list to be unsorted? As Peter noted, it's when things are out of order. Like that very basic primitive where something is out of order suggests an opportunity to solve the problem that way. Just fix all of the tiny bite-sized problems.
And it would seem that using a loop, if we repeat that intuition, is going to pay off eventually by fixing, fixing, fixing, fixing all of the little problems until the big one itself would seem to go away. Well, let me return to the visualization from before, re-randomize the bars-- short bar is small number, big bar is big number. And let me go ahead and run the bubble sort algorithm, this time with this visualization.
And you'll notice now sweeping from left to right are two colored bars that represent the comparison of two adjacent numbers again and again and again. And you'll see this time that the bars are being a little smart, and they're not going all the way to the end every time, just like Brian illuminated the numbers and stopped looking at the 8 and the 7 and the 6 once they were in place. But he and this visualization do indeed keep returning to the beginning, doing another pass, another pass, and another pass.
So if we think ahead to the analysis of this algorithm, it sort of invites us to consider, well, how many total comparisons are there this time? It would seem that the very first time through the bars, or equivalently the very first time through the shelf, Brian and this visualization did like n minus 1 comparisons. So n minus 1 comparisons from left to right, out of n elements you can compare n minus 1 adjacencies.
After that it was n minus 2, n minus 3, n minus 4, n minus 5, until just two or one remain, and at that point you're done. So even though this algorithm fundamentally took a different approach and achieved the same goal, it sorted the elements successfully. Let's consider how it was implemented in code and whether it's actually a little faster or a little slower.
And let's set one final bar, in fact, too. Earlier, we considered only the upper bound on selection sort, just so that we have something to compare this against. Let's also consider for a moment what the running time is of selection sort in terms of a lower bound-- best case scenario. With selection sort, if you have n elements, and you keep looking for the next smallest element, again and again and again, it turns out that selection sort is not really our friend.
Here's, for instance, the chart of where we left off in terms of omega notation before. Linear search and binary search could very well get lucky and take just one step if you happen to open a door and, voila, the number you're looking for is already there. But with selection sort, as we've implemented it, both with Brian and with the visualization, unfortunately it's none so good with the lower bound. Why?
Well, Brian pretty naively, every time he searched for a number, started at the left and went all the way to the right, started at the left, went all the way to the right. To be fair, he did ignore the numbers that were already in place. So he didn't keep looking at the 1, he didn't keep looking at the 2 once they were in place. But he did keep repeating himself again and again, touching those numbers multiple times each.
So again, even though you and I, the humans, could look at those numbers and be like, obviously there's the 1, obviously there's the 2, the obviously there's the 3, Brian had to do it much more methodically. And in fact, even if that list of numbers were perfectly sorted, he would have wasted just as much time.
In fact, Brian, if you don't mind, could you quickly sort all eight numbers again? And Brian, if we start with a sorted list, this is kind of a nice perversion to consider, if you will, algorithmically. When analyzing an algorithm, sometimes you want to consider best cases and worst cases. And there would seem to be nothing better than, heck, the list is already sorted, you got lucky, there's really no work to be done. The worst case is the list is maybe completely backwards, and that's a huge amount of work to be done.
Unfortunately, selection sort doesn't really optimize for that lucky case where they're already sorted. So Brian, I see you've resorted the numbers for us from left to right. If we were to re-execute selection sort as before, how would you go about finding the smallest number?
BRIAN: So we decided earlier that, to find the smallest number, I need to look at all the numbers from left to right in the array and each time check to see if I found something. smaller. So I would start with the 1. That's the smallest thing I've seen so far. But I would have to keep looking, because maybe there's a 0 or a negative number later on. I need to check to see if there's anything smaller.
So I would check, the 2 is bigger, the 3, 4, 5, 6, 7, 8. They're all bigger. So it turns out I was right all along. The 1 was the smallest number, and it's already in place. So now that number is in place.
DAVID MALAN: And then to find the next smallest number, what would you have done?
BRIAN: I would do the same thing. 2 is the smallest number I found so far. And then I would look through all the rest to see if there's anything smaller than the 2. And I would look at 3, 4, 5, 6, 7, 8. Nothing's smaller than the 2. So I go back to the two and say, OK, that number must now be in its sorted position.
DAVID MALAN: Indeed. And that story would be the same for the 3, for the 4, and for the 5. Like, nowhere in selection sort pseudocode or actual code is there any sort of intelligence of, eh, if the numbers are already sorted, quit. Like, there was no opportunity to short circuit and abort that algorithm earlier. Brian would literally be doing the same work, whether they're all sorted from the get-go or completely unsorted, and even backwards.
And so selection sort doesn't really perform very highly. So now we're hoping bubble sort, indeed, does. So toward that end, let's take a look at some proposed pseudo code for bubble sort, assuming that the input is anything. Whether sorted or unsorted, the pseudocode is always going to look like this. Repeat until sorted. For i from 0 to n minus 2-- now, what does this mean? 0 to n minus 1 goes from the first element to the last. So 0 to n minus 2 goes from the first element to the second to last.
Why am I doing that? We'll see in just a moment. The condition inside of this loop is, if the i-th and the i plus 1th elements are out of order, swap them. So this is me being a little clever. If you think about all of these numbers as being in an array or behind doors, if you iterate from 0 to n minus 2, that's like going from the first door to the second to last door.
But that's good, because my condition is checking door i and i plus 1. So if I start at the beginning here, and I only iterate up to this door, that's a good thing. Because when I compared door i and i plus 1, at the very end I'm going to compare door i and i plus 1. What I don't want to do is compare this door i against door i plus 1, which doesn't even exist.
And indeed, that's going to be an error that probably all of you make at some point-- going beyond the boundary of an array, touching memory that is going one or more spaces too far in the array, even though you didn't allocate memory for it. So this hedges against that possibility. So this would seem to be a pretty smart algorithm.
But as written, it's not actually as performant as might be ideal. With bubble sort, suppose the list were entirely sorted. Brian, not to make you sort and resort numbers too many times. Do you mind giving us a sorted list one more time real quick? In a moment, I want to see, if we consider that same sorted list as before, this time with bubble sort, can we do fundamentally better? I have this code saying, repeat until sorted. So how might this change?
So Brian, you've got the sorted numbers again. This should be a good case. But selection sort did not benefit from this input, even though we could have gotten lucky. Bubble sort, what would your thought process be here?
BRIAN: So the thought process for bubble sort was to go through each of the pairs one at a time and see if I need to make a swap for that particular pair. So I'd look at the 1 and the 2. Those two are OK, I don't need to swap them.
The 2 and the 3 are OK. I don't need to make a swap there. The 3 and the 4 are OK. The 4 and the 5 are OK. Same with the 5 and the 6, and the 6 and the 7, and the 7 and the 8. So I made my way through all the entire array, and I never needed to make any swap, because every pair that I looked at, they were already in the correct order relative to each other.
DAVID MALAN: Indeed. And so it would be foolish and so obvious this time if Brian literally retraced those steps and did it again with n minus 1 elements, and then did it again with n minus 2 elements. I mean, if he didn't do any work, any swaps the first pass, he's literally wasting his own time by even doing another pass or another pass. And so that's kind of implicit in the pseudocode, this repeat until sorted. Even though it doesn't translate perfectly into a for loop or a while loop in C, it kind of says intuitively what he should do-- repeat until sorted.
Brian has already identified the fact, by nature of him not having made any swaps, that this list is sorted. Therefore, he can just stop, and this loop does not have to continue again and again. We can map this to C-like code a little more explicitly. We can by default say, do the following n minus 1 times. Because among n elements, you can look at n minus 1 total pairs from left to right without going too far.
But notice, I can add an additional line of code here which might say, if no swaps, quit from the algorithm altogether. So, so long as Brian is keeping track of how many swaps he made or didn't make through one pass, as with a variable called counter or whatever, he can simply abort this algorithm early and certainly then save us some time. So with that said, let's consider for just a moment what the running time of bubble sort might be in terms of an upper bound, in the worst case, if you will.
Well, in the case of bubble sort, notice with the pseudocode where we're doing something n minus 1 times, and inside of that we're doing something n minus 1 times. So again, repeat n minus 1 times literally says, do the following n minus 1 times. The for loop here, which is just a different way in pseudocode of expressing a similar idea but giving us a variable this time, for i from 0 to n minus 1-- n minus 2, is a total number of n minus 1 comparisons. So this is an n minus 1 thing inside the repeat, and an n minus 1 outside the repeat. So I think what that gives me is n minus 1 things times n minus 1 times.
So now if I just kind of FOIL this, sort of in high school or middle school math, n squared minus 1n minus 1n plus 1. We can combine like terms, n squared minus 2n plus 1. But per our discussion earlier, ugh, this is really getting into the weeds. Who cares about the 2n or the 1? The dominant factor as n gets large is definitely going to be the n squared.
So it would seem that bubble sort, if you actually do out the math and the formulas, is going to have an upper bound of n squared, or rather, on the order of n squared steps. So in that sense, it is equivalent to selection sort. It is no better fundamentally. It's what we would say ask asymptotically equivalent. That is, as n gets really large, this formula is, for all intents and purposes, equivalent to the selection sort formula, even though they differed slightly in terms of their lower order terms. For all intents and purposes, ah, they're on the order of n squared both.
But if we consider a lower bound, perhaps, even though bubble sort has the same upper bound running time, if we consider a lower bound, as with this smarter code, where Brian might actually have the wherewithal to notice, wait a minute, I didn't do any swaps, I'm just going to exit out of this looping pretty much early-- not even prematurely but early, because it would be fruitless to keep doing more and more work-- we can then whittle down this running time. I think-- not quite as good as omega of 1, which was constant time-- like, you cannot conclude definitively that an array is sorted unless you minimally look at all of the elements once. So constant time is completely naive and unrealistic.
You can't look at one element, or two or three, and say, yes, this is sorted. You've got to obviously look at all of the elements at least once. So this would seem to suggest that the omega notation for it, that is, the lower bound on bubble sort's running time, if we're clever and don't retrace our steps unnecessarily, is in omega of n.
Or technically, it's n minus 1 steps, right? Because if you've got n elements and you compare these two, these two, these two, these two, that's n minus 1 total comparisons. But who cares about the minus 1? It's on the order of n, or omega of n notation here.
So to recap, selection sort selects the next smallest element again and again and again. Unfortunately, based on how it's implemented in pseudocode and actual code, it's in Big O of n squared. But it's also an omega of n squared, which means it's always going to take the same amount of time asymptotically, that is, as n gets large. Unfortunately, too, bubble sort is no better, it would seem, in terms of the upper bound. It's going to take as many as n squared steps, too. But it's at least marginally better when it comes to using something like an input that's already sorted. It can short circuit and not waste time.
But honestly, n squared is bad. Like, n squared is really going to add up quickly. If you've got n squared and n is a million or n is a billion, I mean, my God, that's a lot of 0's. That's a lot of steps in the total running time of your algorithm. Can we do better? Can we do better?
And it turns out we can. And we'll consider one final algorithm today that does fundamentally better. Just like in week 0, we sort of latched onto binary search and again today-- it's just fundamentally better than linear search by an order of magnitude, so to speak. Its picture representation was fundamentally different. I think we can do fundamentally better than bubble sort and selection sort.
And so while both bubble sort and selection sort might be the sort of thing that I was using in grad school just to rip up the code quickly and then go to sleep, it's not going to work well for very large data sets. And frankly, it wouldn't have worked well if I didn't want to just sleep through the problem. Rather, we want to do things as efficiently as we can from the get go.
And let me propose that we leverage a technique-- and this is a technique that you can use in almost any programming language, C among them-- known as recursion. And recursion, quite simply, is the ability for a function to call itself. Up until now, we have not seen any examples of this. We've seen functions calling other functions. Main keeps calling printf. Main has started to call strlen. Main called strcmp, compare, earlier today.
But we've never seen main call main. And people don't do that, so that's not going to solve the problem. But we can implement our own functions and have our own functions call themselves. Now, this would seem to be a bad idea in principle. If a function calls itself, my God, where does it end? It would seem to just do something forever, and then something bad probably happens.
And it could. And that's the danger of using recursion. You can screw it up easily. But it's also a very powerful technique, because it allows us to think about potential solutions to problems in a very interesting, and daresay elegant, way. So we're not only going to be able to achieve correctness but also better design, because of better efficiency, it would seem, here.
So let me propose this. Recall this code from week 0, which was the pseudocode for finding someone in a phone book. And recall that, among the features of this pseudocode, were these lines here, "Go back to line 3." And we describe those in week 0 as being representative of loops, a programming construct that has something happen again and again.
But you know what, there's a missed opportunity here in this pseudocode to use a technique known as recursion. This implementation is what we would call iterative. It is purely loop based. It tells me literally, go back to this line, go back to this line, go back to this line. There's no calling yourself.
But what if I changed week 0's pseudocode to be a little more like this? Let me go ahead and get rid of, not just that one line but two lines in both of those conditions. And let me quite simply say, instead of open to the middle of the left half of the book and then go back to line 3, or open to the middle of the right half of the book and then go back to line 3, why don't I just more elegantly say, search left half of book, search right half of book?
Now, immediately I can shorten the code a little bit. But I claim that by just saying search left half of book and search right half of book, I claim that this is enough information to implement the very same algorithm. But it's not using a loop per se. It's going to induce me the human or me the computer to do something again and again. But there's other ways to do things again and again-- not by way of a for loop, or a while loop, or a do while loop, or a repeat block, or a forever block-- you can actually use recursion.
And recursion, again, is this technique where a function can call itself. And if we consider, after all, the pseudocode we are looking at is the pseudocode for searching. And on line 7 and 9 now, I am literally saying, "Search left half of book," and "Search right half of book," this is already, even in pseudocode form, an example of recursion. Here I have in 11 lines of code an algorithm or a function that searches a phone book.
In lines 7 and 9, I have lines of code that literally say, search a phone book, but more specifically, search half of the phone book. And that's where recursion really works its magic. It would be foolish and incorrect and completely counterproductive to just have a function call itself with the same input, with the same input, with the same input, because you'd have to be kind of crazy to expect different output if the input is constantly the same.
But that's not what we did in week 0, and that's not what we're doing now. If you use the same function, or equivalently algorithm, but change the input to be smaller and smaller and smaller, it's probably OK that a function is calling itself, so long as you have at least one line of code in there that very intelligently says, if you're out of doors, if you're out of phone book pages, quit.
You need to have a so-called base case. You need some line of code that's going to notice, wait a minute, there's no more problem to be solved, quit now. And so how can we map this to actual code? Well, let's consider something very familiar from week 1. Recall when you reconstructed one of Mario's pyramids. It looked a little something like this.
And let's consider that this is a pyramid of blocks, of bricks, that's of height 4. Why 4? Well, there's 1, then 2, then 3, then 4 bricks from top to bottom. So the total height here is 4. But let me ask the question, a little naively, how do you go about creating, or how do you go about printing a pyramid of height 4?
Well, it turns out that this simple Mario pyramid, that's ever more clear if we get rid of the unnecessary background, is a recursive structure of some sort. It's a recursive physical structure. Why? Well, notice that this structure, this brick, this pyramid, is kind of defined in terms of itself.
Why? Well, how do you make a pyramid of height 4? I would argue, a little obnoxiously, a little circularly, well, you create a pyramid of height 3, and then you add an additional row of bricks. All right. Well, let's continue that logic. All right, fine. How do you build a pyramid of height 3?
Well, you sort of smile and say, well, you build a pyramid of height 2, and then you add one more layer. All right, fine. How do you build a pyramid of height 2? Well, you build a pyramid of height 1, and then you add one more layer.
Well, how do you build a pyramid of height 1? Well, you just put the stupid brick down. You have a base case, where you sort of state the obvious and just do something once. You hardcode the logic. But notice what's kind of mind bending, or kind of obnoxious in a human interaction, like, you're just defining the answer in terms of itself. I keep saying the same thing. But that's OK, because the pyramid keeps getting smaller and smaller and smaller until I can handle that one special case.
And so we can do this just for fun with these little cardboard bricks here, for instance. If I want to build a pyramid of height 4, how do I do it? Well, I can build a pyramid of height 3. All right, let me go ahead and build a pyramid of height 3. How do I build a pyramid of height 3? All right, well, I build a pyramid of height 2, and then I add to it.
OK, how do I build a pyramid of height 2? Well, you build a pyramid of height 1. How do I do that? Well, you just put the brick down.
And so here's where things kind of bottom out, and it's no longer a cyclical argument. You eventually just do some actual work. But in my mind, I have to remember all of the instructions you just gave me, or I gave myself. I had to build a pyramid of height 4; nope, 3; nope, 2; nope, 1. Now I'm actually doing that. So here's a pyramid of height 1.
How do I now build a pyramid of height 2? Well, rewind in the story. To build a pyramid of height 2, you build a pyramid of height 1, and then you add one more layer. So I think to add one more layer, I essentially need to do this. All right. Now I have a pyramid of height 2.
But wait a minute. The story began with, how do I build a pyramid of height 3? Well, you take a pyramid of height 2, which I have here, and you add an additional layer. So I've got to build this additional layer. I'm going to go ahead and give myself the layer, the layer, the layer. And then I'm going to put the original pyramid of height to on top of it. And voila, it's a pyramid of height 3 now.
Well, how did I get here? Well, let me keep rewinding in the story. The very first question I asked myself was, how do you build a pyramid of height 4? Well, the answer was build a pyramid of height 3. Great, that's done. Then add one additional layer.
And if I had more hands, I could do this a little more elegantly, but let me go ahead and just lay this out. Here's the new level of height 3. And now I'm going to go-- of width 4. Now I'm going to go and put the pyramid of height 3 on top of it, until voila, I have this form here of Mario's pyramid.
So it's a bit cyclical in that, every time I asked myself to build a pyramid of a certain height, I kind of punted and said, no, build a pyramid of this height. No, build a pyramid of this height. No, build a pyramid of this height. But the magic of that algorithm was that there was constantly this, do a little more work, build a layer, do a little more work, build a layer. And it's in that implicit building of layer after layer after layer that the pyramid itself, the end goal, actually emerges.
So you could implement the same thing with a for loop or a while loop. And frankly, you did. It was a slightly different shape for problem set 1, but you did the same thing using a loop. And you kind of had to do it that way, at least as we prescribed it. Because with printf, you have to print from the top of the screen to the bottom. Like, we haven't shown you a technique yet to print a layer and then go back on top.
So I'm kind of taking some real-world liberties here by lifting these things up and moving them around. You'd have to be a little more clever in code. But the idea is the same. And so even physical objects like this can have some recursive definition to them.
And so we present this sort of goofy example, because this notion of recursion is a fundamental programming technique that you can leverage now to solve problems in a fundamentally different way. And I think for this, we need one final visualization of merge sort, with both Brian's help and the computer's. And merge sort is going to be an algorithm whose pseudocode is, daresay, the simplest we've seen thus far, but deceptively simple.
The pseudocode for merge sort, quite simply, is this-- sort the left half of numbers, sort the right half of numbers, merge the sorted halves. And notice, even at first glance this feels kind of unfair. Like, here's an algorithm for sorting, and yet I'm literally using the word "sort" in my algorithm for sorting. It's like in English if you're asked to define a word, and you literally use the word in the definition. Like, that rarely flies, because you're just making a circular argument.
But in code, it's OK, so long as there's one special step that's doing something a little differently, and so long as the problem keeps getting smaller and smaller. And indeed it is. This pseudocode is not saying, sort the numbers, sort the numbers, sort of numbers. No, it's dividing the problem in half and then solving the other half as well. So it's shrinking the problem on each iteration.
Now, I will disclaim we're going to need that so-called base case again. I'm going to have to do something stupid, but necessary, and say, if there's only one number, quit. It's sorted. That's the so-called base case. The recursive case is where the function calls itself.
But this is, indeed, our third and final sorting algorithm called merge sort. And we'll focus here really on the juiciest pieces, one, this notion of merging. So in fact, Brian, can we come over to you just so we can define, before we look at the merge sort algorithm itself, what do we even mean when we say merge sorted halves?
So for instance, Brian has on his shelf here two arrays of size 4. In the first array on the left are four integers, 3, 5, 6, 8. And in the right side, in another array of size 4, are four numbers, too, 1, 2, 4, 7. Both the left is sorted and the right is sorted. But now, Brian, I would like you to merge these sorted halves. Tell us what that means.
BRIAN: Sure. So if I have a left half that sorted from smallest to largest and a right half that's also sorted from smallest to largest, I want to merge them into a new list that has all of the same numbers also from smallest to largest. And I guess where I could start here is that the smallest number of the combined array needs to begin with either the smallest number of the left half or the smallest number of the right half. So on the left the smallest number is the 3, and on the right the smallest number is the 1. Of those two has got to be the smallest number for the entire array.
And between the 3 and the 1, the 1 is smaller. So I would take that 1, and that's going to be the first number, the smallest number, of the merged two halves. And then I guess I would repeat the process again. On the left side the smallest number is the 3. On the right side the smallest number is the 2. And between the 3 and the 2, 2 is smaller. So I would take the 2 [INAUDIBLE] and that's going to be the next number.
So I'm slowly building up this sorted array that is the result of combining these two. Now I'm comparing the 3 on the left to the 4 on the right. Between the 3 and the 4, the 3 is smaller. So I'll take the 3, and we'll put that one into position.
Now I'm comparing the 5 on the left with the 4 on the right. Between the 5 and the 4, the 4 is smaller. So that one goes into position. And then now I'm comparing the 5 on the left with the 7 on the right. 5 is smaller, so the 5 goes next. Next I'm comparing the 6 on the left with the 7 on the right. The 6 is still smaller, so that one is going to go next.
Now I'm comparing the 8 and the 7, the only two numbers left. The 7 is the smaller between the two. So I'll take the 7 and put that into place. And now I'm only left with one number that hasn't been put into the merging of the two halves, and that's the number 8. So that number is going to take up the final position.
And now I've taken these to halves, each of which was originally sorted, and made one complete array that has all of those numbers in sorted order.
DAVID MALAN: Indeed. And consider what we've done. We've essentially verbally and physically kind of defined a helper function, our own custom function if you will, whereby Brian has defined what does it mean to merge two arrays-- specifically merge two sorted arrays. Because why? Well, that's a building block that I think we're going to want in this merge sort algorithm. So just like in actual C code, you might have defined a function that does some small task, so have we now verbally and physically defined the notion of merging.
The mind bending part here is that "Sort left half of numbers" and "Sort right half of numbers" is kind of already implemented. There's nothing more for Brian or me to define. All that remains is for us to execute this algorithm, focusing especially on these three highlighted lines of code.
And let me disclaim that of the algorithms we've looked at thus far, odds are this will be the one that doesn't really sink in as quickly as the others. Even if the others might have taken you a moment, a day, a week to settle in-- or maybe you're still not quite there yet, that's fine-- merge sort is a bit of a mind bending one, because it seems to work magically. But it really just works more intelligently. And you'll begin to get more comfortable with harnessing these kinds of primitives so that we can ultimately, indeed, solve problems more efficiently.
So Brian has kindly put the numbers again on the top shelf. And he has put them into their original, unsorted order, just like for selection sort and bubble sort. And Brian, I'd like to propose now that we execute this merge sort algorithm. And if you don't mind, I'll recite aloud first the few steps.
So here is one array of size 8 with unsorted numbers. The goal is to these numbers using merge sort. And recall that merge sort essentially is just three steps-- sort left half, sort right half, merge sorted halves. So Brian, looking at those numbers there, could you go ahead and sort the left half of numbers?
BRIAN: All right. So there are eight numbers. The left half would be these four numbers, so I will sort those. Except I'm not really sure how do I now sort these four numbers.
DAVID MALAN: Yeah. So granted, we've seen selection sort, we've seen bubble sort. But we don't want to regress to those older, slower algorithms. Brian, I can kind of be a little clever here. Well, I'm giving you a sorting algorithm. So now you effectively have a smaller problem, an array of size 4, and I'm pretty sure we can use the same algorithm, merge sort, by sorting left half, sorting right half, and then merging the sorted halves.
So could you go ahead and sort the left half of these four numbers?
BRIAN: All right. So I have these four numbers. I want to sort the left half. That's these two numbers. So now I need to figure out how to sort two numbers.
DAVID MALAN: All right. Now, us with human intuition might obviously know what we have to do here. But again, let's apply the algorithm-- sort left half, sort right half, merge sorted half. Brian, could you sort the right half of this array of size 2?
BRIAN: So I've got the array of two, so I'll first sort the left half of the array of two, which is the 6.
DAVID MALAN: And this is where the base case in white on the slide comes into play-- if only one number, quit. So Brian, I can let you off the hook. That list of size one with the number 6 is sorted. So that's step one of three done. Brian, could you sort the right half of that array of size two?
BRIAN: The right half is the number 3. It's also just one number, so that one is done.
DAVID MALAN: Good. So think about where we are on the story. We've sorted the left half, and we've started the right half, even though it looks like neither Brian nor I have done any useful work yet. But now the magic happens.
Brian, you now have two arrays of size 1. Could you merge them together?
BRIAN: All right. So I'm going to merge these two together. Between the 6 and the 3, the 3 is smaller. So that one I'll put there first. And then I'll take the 6, and that one goes next. And now I have a sorted array of size 2 that is now done.
DAVID MALAN: All right. And this is where you now need to start remembering step by step sort of in your brain as the things pile up. How did we get to this point? We started with a list of size 8. We then looked at the left half, which was an array of size 4. We then looked at the left half of that, which was an array of size 2, then two arrays of size 1, then we merged those two sorted halves.
So I think now if I rewind in that story, Brian, you need to sort the right half of the left half of the original numbers.
BRIAN: All right. So the left half is these four. The right half of the left half is going to be these two numbers. And so now to those two, I guess I would repeat the process again-- look at the numbers individually. I would look at the left half of these two, which is the 8. That one is done. And the 5, that one is done as well.
DAVID MALAN: All right. So step three of three, then, is merge those two sorted halves.
BRIAN: All right. So between the 8 and the 5, the 5 is smaller, so that one will go in first. And the 8 will go after that. And now I have a second array of size 2 that is also now sorted.
DAVID MALAN: Indeed. So here's where, again, you have to rewind in your mind's eye. We've just now sorted the left half, and we've sorted the left half and the right half of the left half. So I think the third and final step at this part of the story is, Brian, to merge those sorted halves, each of which now is of size 2.
BRIAN: All right. I have two arrays of size 2, each of which is sorted, that I need to merge. So I'm going to compare the smallest numbers from each. I'm going to compare the 3 and the 5. The 3 is smaller, so that one will go in first. Now between these two arrays, I have a 6 and a 5 to compare. The 5 is smaller, so that one will go next.
Between the 6 and the 8, the 6 is smaller. And I'm left with just the 8. So if we go back to the original story of eight numbers that I was sorting, I think I have now sorted the left half of the left four numbers from that original array.
DAVID MALAN: Indeed. So if you're playing along at home, think about-- you've got all these thoughts probably kind of piling up in your mind. That's indeed supposed to be the case. And admittedly, it's hard to keep track of all of that. So we'll let Brian now execute this altogether together doing the same thing now, by sorting the right half all the way to completion. Brian, if you could.
BRIAN: All right. So the right half, you got four numbers. I'm going to start by sorting the left half of the right half, which is these two numbers here. To do that, I'll repeat the same process-- sort the left half of these two numbers, which is just the 2. That one's done, it's only one number. Same thing with the right half. The 7 is only one number, so it's done.
And now I'll merge the sorted halves together. Between the 2 and the 7, the 2 is smaller and then the 7. So here now is the left half of the right half, an array of size 2, that is sorted. And I'll do the same thing with the right half of the right half, starting with the left half, which is 4. That's done. The 1 is done.
And now to merge these two together, I'll compare them and say the 1 is smaller. So I'll put the 1 down and then the 4. So now I have two sorted arrays, each of size 2, that I now need to backtrack and now merge together to form an array of size 4.
So I'll compare the 2 and the 1. Between those two, the 1 is smaller. Then I'll compare the 2 with the 4. The 2 is smaller. Then I'll compare the 7 with the 4. The 4 is smaller.
And then finally, I'll just take the 7, the last number, and put that in the final spot. And so now from the original array of eight numbers, I've now sorted the left half, and I've sorted the right half.
DAVID MALAN: And now that brings us to our third and very final step. Could you, Brian, merge the sorted halves?
BRIAN: Yeah. And I think this is actually an example we've seen already. And what I'm going to do in order to these two halves is just take the smaller number from each half and compare them again and again. So between the 3 and the 1, the 1, that's the smallest number. So that goes into place. Then between the 3 and the 2, the 2 is smaller, so we'll take that and put that into place.
Now I'm comparing the 3 with the 4. The 3, that goes next. Next I'm comparing the 5 with the 4. 4 is smaller, so the 4 goes into place next. Now I'm comparing the 5 with the 7. 5 is smaller, so that one goes into place. And next, comparing the 6 with the 7, so the 6 is smaller. That goes next.
And now I'm left with two numbers, the 8 and the 7. The 7 is the smaller of the 2, so that one goes next. And at this point, I only have one number left, which is the 8. And so that one's going to go into its sorted position at the end of the array.
DAVID MALAN: Indeed. So even though it felt like we weren't really doing anything at several points in that story, it all sort of came together when we started merging and merging and merging these lists. And it's not an accident that Brian was using multiple shelves, moving the numbers from top to bottom, to make clear just how many times he was effectively dividing that list up.
We started with a list of eight, and we essentially took it to two lists of size 4, four lists of size 2, eight lists of size 1. And while it wasn't exactly in that order, if you rewind and analyze all of the steps, that's indeed what he did. He went from 8 to two 4's to four 2's to eight 1's. And that's why he moved those numbers from the top shelf down three times-- from 8's, to 4's, to 2's, to 1'r.
So how many times did he move the numbers? He moved them three times total. And on each of those shelves, how many numbers did he have to merge together? On each of those shelves, he ultimately touched all eight numbers. He first inserted the smallest number, then the second smallest, then the third smallest.
But unlike selection sort, he had smartly already sorted those halves, so he was just plucking them off one at a time. He wasn't going back and forth, back and forth. He was constantly taking from the beginning of each of those half lists.
So on every shelf, he was doing, let's say, n steps, because he was merging in all n elements of that shelf. But how many times did he merge n elements together? Well, he did that three total times.
But if you think about binary search, and really the process of divide and conquer more generally, anytime you divide something in half and half and half, as he was doing from 8's to 4's to 2's to 1's. That's a logarithm. That's log base 2. And indeed, that is wonderfully the height of this shelf. If you have eight elements on the shelf, the number of additional shelves Brian used, 3, is exactly what you get by doing the math log base 2 of 8. Which is to say, Brian did n things log n times.
And again with a wave of the hand, computer scientists don't bother mentioning the base with Big O notation. It suffices just to say log n-- Brian did n things log n times. And so if we consider, then, the asymptotic complexity of this algorithm, that is to say the running time of this algorithm, in terms of big O notation, notice that it performs strictly better then selection sort and bubble sort-- n times log n.
And even, again, if you're a little rusty on logarithms, log n, we have seen as of week 0 in binary search, is definitely faster than n steps. So n squared is n times n. n log n is n times log n, which is indeed mathematically better then n squared. As with merge sort, though, if we consider the lower bound, notice that bubble sort, yes, got us as low as omega of n. Turns out merge sort is a little bit like selection sort in that it doesn't optimize itself and get you out of the algorithm early. It's always n log n, so it's lower bound omega of n log n.
And that might not be acceptable. Sometimes you might have certain data inputs where maybe it tends to be sorted and you don't want to waste time. So maybe you'd be OK with bubble sort. But honestly, as n gets large, the probability that the input to your sorting algorithm is just by chance going to be sorted is probably so, so low that you're just better off in the general case using an algorithm like merge sort that's n log n always.
We can see this visually using our bars, too. And notice, just as Brian was dividing and conquering the problem in half and half and half, and then reconstituting the array by merging those halves, you can kind of see that visually here. There's a lot more going on. And it's going to seem in a moment that everything just kind of magically worked.
But you can see in the faded purple bars that, indeed, this is sorting things in halves and then merging those halves together. And this visualization was a little different. It did not have the luxury of three shelves. It just moved top to bottom, top to bottom. And honestly, Brian could have been a little more optimal there. We wanted to make clear how many total shelves there were. But honestly, there's no reason he couldn't have just moved the numbers down then back up, then back down then back up.
And, indeed that's the price you pay with merge sort. Even though n log n is better than n squared, and ergo merge sort is arguably better than selection sort and bubble sort, you pay a price. And this speaks to the trade-off I mentioned earlier. Almost always, when you do something better in code or solve a problem more intelligently, you have paid a price. Maybe you spent more time as the human writing the code, because it was harder and took more sophistication. That is a cost.
Maybe you had to use actually more space. Brian had to have at least one extra shelf in order to implement merge sort. If implementing merge sort in code and C, you will need at least a second array to temporarily put the numbers into as you merge things back and forth. If you want to be extravagant, you can have three separate arrays or four separate arrays. But it's suffices, per the graphical representation of merge sort, to just use a second array.
Now, that might not seem like such a big deal. But implicitly, you need twice as much space. And that might be a big deal. If you've got a million things to sort, and you now need two arrays, that's 2 million chunks of memory that you need. And maybe that's not tenable. So there, too, there's going to be a trade-off.
And maybe while slower, selection sort of bubble sort, maybe it's better because it's a little more efficient with space. It's going to depend on what you care about and what you want to optimize for. And honestly, money is sometimes a factor. In the real world, maybe it's better to write slightly slower code so that you don't have to buy twice as many servers or twice as much memory for your computer. It depends there on what resource is more important-- your time, the computer's time, your wallet, or some other resource altogether. So we'll continue to see these kinds of trade-offs.
But perhaps the most mind blowing thing we can do as we wrap up here is share a few visualizations of how these algorithms actually compare. And one last piece of jargon is this one final Greek symbol, theta. It turns out that, thanks to selection sort and merge sort, we can actually apply one more term of art here, this theta notation. Anytime an algorithm has both the same upper bound as its lower bound running time, you can actually describe it in just one sentence instead of two in terms of theta notation.
So because selection sort was in both big O of n squared and omega of n squared, you can actually just say, ah, it's in theta of n squared. It's always n squared either in the upper bound or in the lower bound. Same thing for merge sort. It's in theta of n log n. We cannot use theta for bubble sort or for binary search or for linear search, because they had different upper and lower bounds.
Well, let me go ahead now and prepare a final demonstration, this time using some random inputs. So you'll see here a video comparing selection sort, bubble sort, and merge sort all together. All three of them start with random data. But let's just see what it means for an algorithm to be an n squared in the worst case or in n log n in this case instead.
[MUSIC PLAYING]
Selection sort's on the top, bubble sort's on the bottom, merge sort's in the middle. And would you believe it, merge sort is already done.
[MUSIC INTENSIFIES]
And meanwhile, we have some very trendy music we can listen to, which is really just there to distract us from the fact at how slow n squared actually is in practice. And notice, there's not that many bars here. There's maybe like a hundred or so bars. Like, n is 100. That's not even a big value. When we're talking about the Twitters, the Facebooks, the Googles of the world, these are trivial sizes. And yet, my God, we're still waiting for selection sort and bubble sort to finish.
And so you can see here that it really matters when you exercise a little bit more cleverness, and you leverage a more efficient algorithm-- and finally, selection sort is done, bubble sort still taking a little longer here. And this is going to depend on the input. Sometimes you can get lucky or unlucky. But I think it's convincing that merge sort has won in this case.
[MUSIC PLAYING]