[Week 3] [David J. Malan - Harvard University] [This is CS50. - CS50.TV] Let me steer us in the direction of where we left off last time, which was starting to think a bit more than about syntax and trying to think a bit less about all of the minutiae that takes a bit of time to acclimate to thus far in terms of semicolons and parentheses and curly braces, to start taking things a little bit to a higher conceptual level so that the problems we now start solving over the next several weeks are going to involve much more into higher level conceptual problems and a bit less in the syntactical as you get your feet wet and your hands dirty with some of the syntax from these past few weeks. So recall that last week we introduced this notion of an array. And an array in English can be described as what? >>[inaudible student response] Sorry? A collection of? >>[inaudible student response] >>Okay, good. A collection of items. So we saw arrays in Scratch. If you happened to use for pset 0 one of Scratch's lists that you can drag things like oranges and bananas into, an inventory of sorts, that's kind of like what an array is. And then more technically, in the context of an actual computer, an array is simply a contiguous chunk of memory. In other words, you have a byte, then another byte, then another byte, then another byte, and if you were to draw those bytes in a picture, they would be back to back to back to back. That's what we mean by contiguous. So it's byte number 1, then 2, then 3. It doesn't mean up here, up here, up here, up here. An array is a contiguous chunk of 0 or more bytes. So what are they useful for? Recall we had this sort of contrived example of storing people's quiz grades in a program to compute your quiz average for some course, and recall that we could start writing that program by declaring a variable quiz1. Then we could have another variable called quiz2. But then if there were 3 quizzes in this class, quiz4. Or if there was a weekly quiz, it would be quiz5, quiz6, quiz7. So you would have all of these variables declared inside of main or somewhere else in your program, and the problem with that approach, easy though it is to just copy and paste, is it just very quickly becomes unwieldy. God forbid you actually have 30 quizzes or 50 quizzes. If it's like a high school style daily pop quiz, then you just have a ridiculously long list of variables being declared, and this just very quickly gets out of control. It's ugly, it's hard to maintain, it's so much easier to make a typo if you get 1 number mistyped somewhere in your program. So we introduced the notion of an array instead. And recall that we implemented this program by doing a little something like this. Let me go into today's Source 3 Monday directory and open up array, which we saw last time. And even though there were a couple of new C tricks here, among them the notion of a constant, recall that we declared multiple floats essentially by using this syntax: float, then the name of the variable, then we used square braces really for the first time, and what we did inside of those square braces was effectively put a number. But instead of putting a number, I put this capitalized word, QUIZZES. And what was the motivation for putting a capitalized word like QUIZZES and then using line 17's trick here to actually give that a number? What was the motivation there? Yeah. [inaudible student response] >>Exactly. If we want to change that value 2, we only have to change it in 1 place because consider--I don't even remember what this program did exactly, but if you just skim it you see QUIZZES, QUIZZES. You see QUIZZES, down here more QUIZZES. So if we didn't have this constant, this use of sharp define, we would have typed 2, then 2, then 2, then 2, which is fine. It would be just as correct. But suppose that next year we have 3 quizzes in CS50. So I have to go and update the code, I have to recompile it, but the problem is if I do something stupid, like I overlook 1 mention of 2 and forget to plug in 3, the whole program could very well break. So we're just asking for trouble. So the notion of a constant is all about factoring out some piece of data, whether it's a string or a char or a float or whatever, and declaring it 1 place so that you can more readily change it in the future. And it's also, frankly, a bit easier to read because if you just think of this now, it's QUIZZES, or we could even rename it something like NUMBER_OF_QUIZZES or something more explicit. The code just becomes a little more obvious as to what it's doing, and you wonder a little less what number 2 might happen to mean. So the constant had nothing to do fundamentally with arrays. The array was introduced by way of these square braces. So notice that in line 23 we ask the user, "What were your quiz scores?" Then we just have this loop which apparently asks the user for their grades. How? It iterates from 0 to 2. And I say 2 because QUIZZES in all caps is currently 2. So it iterates from 0 up to 2 and then it prints out Quiz # something of something, and then it uses GetFloat to get a value from the user. So notice this is the only other new piece of syntax from last Wednesday. If you want to store something in a particular location in that array, you again use the square brackets. So there's a bit of dichotomy here. The first time you use the square brackets you use it to specify how big you want the array to be. But this next context here where we again employ these square brackets means where in that array do you want to put some value? And the distinction here can be inferred from context. Notice here we have a data type, then we have the name of a variable, then we have our square braces with a number inside, semicolon. That's it. So that's a declaration. It's just as though we had done something like float grade1; float grade2; but again, this very quickly devolves into way too much copy, paste, so instead we just simplified it as such, which means henceforth we have a grade that can be stored at bracket 0, we have another grade that can be stored at bracket 1, but what if I goof and, for instance, my loop goes so far-- for instance, I make this less than or equal to, which recall was the source of a previous bug-- which effectively means that on some third accidental iteration of this loop I use bracket 2. Effectively, what might happen here? Sorry? [student] It's going to be replaced. >>Is it going to be replaced? What would be replaced? This literally is saying replace what is at location 2 with the return value of GetFloat. But the problem is how big is the array at this point in the story? [inaudible student response] >>The array is still only of size 2 because the array, like any variable, was declared first, before we used it, and we specified here because of this constant that I have 2 grades that I'm going to put. But remember, the computer scientists start counting from 0. So the first location in that array is bracket 0. The next location is 1. This thing is ever so slightly too far over to the side. So in other words, if I actually had this array-- and let me see how well this cooperates here for us-- if I have an array that I've simply drawn as follows and I've allocated space for 2 elements, I might draw this like this in memory where this big white canvas is. It's just the RAM I have in my computer, a gig of RAM, 2 gigs of RAM, whatever, but these 2 boxes now individually represent a float, 32 bits. So if I put 1 number here like 1.0, then I put another number here like 3.2 but then I do bracket 2, that's like putting something here. And as the picture suggests, there is nothing there. It's sort of like no man's land because I have not asked the operating system to give me this third quiz. If I did want that third quiz, I should have had the forethought to ask the operating system for it by declaring QUIZZES to be not 2 but to instead equal 3. So in other words, the picture that we effectively have at hand looks like this here. This again is no man's land. We better not try writing values here. But again, because computer scientists count from 0, when we talk about this location in the array, that's supposed to be location 0, this is supposed to be location 1, and this doesn't even exist because we only asked the operating system for 2 such places. So those of you with prior programming experience from other languages might know that this is not always the case with arrays or things called vectors. Rather, you can just keep adding and adding and adding things to arrays, which, frankly, we had that ability in Scratch and yet we seem to have given it up here because with C you are programming much more explicitly. It's only you and the computer right now, and the computer is only going to do what you tell it to do. So if you only tell it to give you 2 floats by way of line 22 here, that's all you're going to get back from the operating system: space for 2. So increasingly are your programs going to occasionally be buggy with regard to arrays. This is just sort of the nature of the beast whereby all of us are fallible, and at some point you will very likely index beyond the boundary of your array. And that's just a fancy way of saying you went into bracket something and something was just too big of a number. You went beyond the bounds of your array. But the upside now is this. The rest of this program really has nothing fundamentally to do with arrays. It's all just about some simple arithmetic for computing averages. So we have here in this for loop here first a variable called sum that we initialize to 0. Then we iterate from 0 up to 2 again and we add to that summation variable the ith grade, so bracket 0 then bracket 1. And then as you would do in grade school to compute the average, we simply take that sum, divide it by the total number of quizzes, and then for good measure we call a function here called round. Now, as an aside, what is the deal with this parenthetical int on line 34? It might have come up already in section, haven't really talked about it formally here, but what is this int in parens probably doing? >>[inaudible student response] Yeah, this refers to casting or typecasting, which means taking 1 data type and converting it to another. You can't do this with all data types because sometimes it would be a bit strange. But in this case, if the return value of round is a float because, after all, I'm taking a float and dividing it by a number like 2, I'm going to get back a float. But grade school people don't really like to know that their average was 93.4 because they'll realize they were ever so close to that 95 rounding point. So we want to instead use int to round everyone to the nearest int, which in this case is going to be 94 with no point after it. So that's just a little mathematical trick. And we'll come back to this notion of casting because it will have implications, if you haven't discovered already, for problem set 2. So an array then, you can think of--it's going to make me smile all day. It looks like this if you draw a picture of it, but the key is that the size is also selected by you when you request it from the operating system. Any questions then on arrays? Yeah. [inaudible student question] Ah, good question. The question is what happens to the null 0 in the array? It does not exist in this context. That only exists in the context of strings, which we're about to come to in just a moment. But for an array, as in this case, all you get is what you ask the operating system for. And as an aside, lest this be unclear, I keep saying you ask the operating system, ask the operating system. An operating system, as you probably know, is Mac OS, Windows, Linux. When you're calling functions like GetFloat or you are declaring variables like grades, at the end of the day you are effectively asking someone else to give you that memory because we as aspiring programmers have no idea how to actually get physical access to memory. But someone does: the operating system. So besides presenting us with pretty icons and menus and folders and the like that you see on your desktop, whether a Mac or PC, operating systems also do the low level mundane stuff, the highly technical stuff of managing the gigabyte or 2 gigabytes of memory that you have, managing the CPU that you have, and so forth. So when you're writing code, you're really hooking in to your operating system in that sense. I'm going to have to minimize that. All right. Other questions about arrays? No? Okay. So the transition naturally from arrays is actually to a topic that's a bit familiar. And we looked ever so briefly at this last time too. This was a string example from Wednesday. This string example was a pretty simple program, and I've actually simplified it by a couple of lines for today's purposes. All it does in line 19 is get a string from the user, stores it in a variable called s. Then in line 22 onward it's apparently printing that string 1 character per line. But how is it doing this? We're declaring a variable i, setting it equal to 0, and this is becoming old habit now. We hadn't seen this until Wednesday, but you can kind of infer from its name strlen just returns what when given s? The length of the string. So if I pass it a string, quote-unquote DAVID, it's hopefully going to return to me the number 5 because of D-A-V-I-D. So that's its purpose in life is to take a string, whether hard coded by you or in this case plugged in as a variable, as an argument, and it figures out what the length of that string is. So here now we're borrowing some notation from the previous quiz example. This has nothing to do with floats, has nothing to do with quizzes, but it turns out that the little white lie we've been telling you since week 1 is that a string doesn't really exist in C. A string at the end of the day is really just an array. It's an array of bytes, so byte, byte, byte, byte, which recall is just 8 bits, so chunk of memory, chunk of memory, chunk of memory, chunk of memory. And the means by which a string is implemented is by putting the first character here, then here, then here, then here, back to back to back in the computer's memory. So if you wanted to spell out a word like HELLO, you would put 1 character H, then E, then L then L, then O--5 characters in total--somewhere in your computer's RAM. But the key detail here is that they're going to be back to back to back to back, right next to one another. When when I say s[i], what in English is this giving me? What does s[i] represent in this case? Yeah. [student] The ith character in the string. >>Exactly. The ith character in the string. Now, i is going to start at 0 as per my for loop here, but that's good because everything starts counting from 0. So s[0] is going to represent the letter H in a word like HELLO, s[1] is going to represent a letter like E in a word like HELLO, and so forth. And what we seem to be doing on each iteration of this loop is temporarily storing the ith character in a variable called c, which is just a char, and then we're printing out c so that at the end of the day what this program does is the following. If I go into the source directory and I make string1 and I go ahead and run string1, and then I type a word like HELLO, Enter, all it does is print this 1 character at a time. So there's an opportunity for refinement here. I'm kind of doing more work, even though it's more clear maybe this way, than necessary. Which line of code here can I probably throw away altogether? Yeah. Line 24. In line 24 I'm declaring a variable c. I'm storing the ith character of s in it, but then I'm using c here. So I'm using c, so I feel like I can't just throw line 24 away. [inaudible student comment] >>Exactly. So when it comes to talking about the design of programs, notice this slight simplification of the code, which is just as readable, but realize that s is just a variable, its data type is an array, so s[i] is just going to instantly return to you the ith character in that string. And if you want to print it, that's fine. You just have to use %c because you're not printing a string, you're printing a character in a string, and this too has the effect of printing the ith character. And recall the only difference really from last week with using printf is that whereas in the weeks past we would do something super simple like the %s placeholder then the name of a string here, now we're diving in a little deeper underneath the hood and saying, don't print the string; print the single character therein. So we can do something a little different here because there's 1 other--not bug because this program is right, but I'm doing something stupid that I mentioned briefly on Wednesday. But thinking back, how could this program's design be improved even further? Yeah. [inaudible student response] >>Oh, good. So recall that we introduced a second variable called n last time, which seems to be contradicting ourselves because my goal a second ago was just to throw away a variable as unnecessary, but recall that on Wednesday we actually did this. I changed the for loop to actually have a comma here, then n = strlen, and then over here I did i < n, then I did my increments. What is the fundamental gain that I'm achieving by changing my initialization to this and my condition to this now? >>[inaudible student response] >>Exactly. I'm not recalling strlen again and again and again because recall how the for loop works. Even if they start to get more complicated-looking, recall that the thing before the first semicolon is the initialization, which happens once. The condition, though, is in the middle, and this gets checked every time that you go through the loop. So it's kind of stupid to be asking the computer the same question again and again-- What's the length of HELLO? What's the length of HELLO? What's the length of HELLO?-- because as we'll see today and on Wednesday, this is definitely going to take time, and it's not a very good use of time because to figure out the length of a string actually takes a bit of effort. It's not instantaneous, as it is in some languages. So by changing this to n, the price I'm paying is what? We're seeing a trade-off here. I can save time by not asking the same damn question again and again, but it's going to cost me something, which is what? [student] You lose a certain amount of memory. >>Exactly. It's going to cost me some memory. So in this case it costs me what? Another 32 bits because n is just an int, as implied by the word int here. But is that okay? Frankly, that's probably okay because if you think about it, the longer the string is, the more time I'm going to be wasting because strlen is going to get called again and again and again for every iteration of the loop. And these days, my Mac has 2 gigs of RAM, these days 4 gigs of RAM sometimes. I think I can afford 4 of those bytes to actually speed things up. But this is going to be a trade-off and a theme really in programming and in computer science of never really getting anything for free. If you want to improve something here, you have to pay for it in the other hand somehow. Space versus time in this case. So this was all leading up toward something cryptic like this, which, as you probably figured out by now, actually says? [inaudible student response] >>Yeah, so this is, Be sure to drink your Ovaltine, actually using an algorithm called ROT13, R-O-T 1-3, which just means rotate all of the letters 13 places, which means take A and then add 13 to it and go dot, dot, dot all the way to the 13th letter away, do the same thing for B and for C and for D and so forth. And so if we actually convert this here using a shift of 13 places, we'll get back to what little Ralphie had, which was, Be sure to drink your Ovaltine. But now for problem set 2, in the standard edition at least, you have to kind of do this enciphering yourself, and we have to somehow take in input like this and encrypt it or decrypt it. So which of these fundamentals sort of leads us to that opportunity? Let's take a look at this third example here. First of all, it's called ASCII. What does ASCII refer back to? American Standard Code for Information Interchange, which is a really long way of saying what? What is ASCII? [inaudible student response] >>What's that? >>[student] A character map. >>A character map. It just maps numbers to letters because the world has standardized what numbers will represent what letters so that all of us can use computers and our programs all are just compatible when it comes to printing out things on the screen. So recall that 65 happens to represent A, 97 happens to represent lowercase a. And so this simple program here ASCII is taking advantage of that fact-- that the world knows that capital A is 65--and it's just printing the mapping. So before we dive into this code, let me instead open up a terminal window. Let me go ahead and make ASCII, and then let's just run this thing just to spoil the output. And it just does this: a really big chart that just tells me all of the various codes for all of the various letters. So a super simple program, but I didn't have to hard code those 52 lines of output: 26 uppercase, 26 lowercase. Instead, I did this programmatically with a couple of loops. Notice what I did here. I iterated from i is 65 on up to 65 + 26 because I wanted to print out 26 letters in the English alphabet, i++ on each iteration, and now notice this again. It's the recurrence of our friend typecasting whereby you convert 1 data type to another because what do I want to do in this particular program? I want to count numerically because that's how I grew up counting--65, 66, 67, and so forth-- but I don't want to print just the numbers. I want to print the letter followed by the number. I want to print A: number, B: number, but I can do this with the same exact variable. So I print out %c as a placeholder for a character, %d as a placeholder for a digit or number. Then what do I plug in for those 2 placeholders? I first plug in the character equivalent of i, and then I print out i itself. So notice this too just works. Just as I can cast from a float to an int in order to go from a real number to an integer, here I can go from an int to a char, which is a little weird-- doesn't quite map onto the real world--but in computers a char is just a number underneath the hood, so we're being ever so explicit here to the computer, saying, printf, print out not i as 65, print it out as its numeric equivalent. And it turns out I technically don't even need this. What I was doing a moment ago is explicitly casting by specifying what data type I want to go from and to. But notice that I already have this placeholder %c and this other %c placeholder here. Even though this isn't int, the computer realizes that a char, it's just an int underneath the hood. So if I actually recompile this and rerun the ASCII program, notice it still just works because the computer realizes that there is this correspondence. Now, it's more important to do the explicit casting in the world of floats to ints because there you're actually making a calculated decision: throw away everything after the decimal point. Here there's really nothing to throw away because a character is just a number, and a string is just an array of characters. So when it comes time to implementing some encryption or decryption, how is it that we can actually translate something like this nonsense to, Be sure to drink your Ovaltine? What if we know right now--let's take as assumption--that the key, the number that we're rotating all of these letters by, is the number 13? So we went from the letter B all the way to O at the start of the sentence, Be sure to drink your Ovaltine, because if I do B and then I go C, D, E, F, G, H, I, J, K, L, M, N, O, that's why the encryption of the letter B becomes O because I just added 13 to it. So if I want to decrypt this, I essentially have to take O and then subtract 13 from it. Or, frankly, because there's 26 letters in the alphabet, this is wonderfully symmetric, we can also just add 13 and we'll get back to the letter B. But how do you go about implementing something like this in Caesar or really manipulating strings in general? If the letter B is what number? What's the letter B? So it's 66, right? So if the letter A is 65 and the letter B is 66, so 66, all I have to do is add 13 to it, and this gives me 79. And if we go to our little cheat sheet, 79 indeed maps onto O. But there is a bit of a corner case here. What is, say, the letter Z? If we do 66 + 25 to get all the way to the end of the alphabet, we're at 91. 91 + 13 gives me 104, and guess what? 104 does not equal an uppercase letter. Let's go back to a little cheat sheet here. If I rerun this program in the appliance, notice that 104, if I go back to the terminal window, 104 is apparently the lowercase h. So we need some key trick here in order to make sure that when we start at Z and we add 13 to it we don't want to just keep forging ahead to bigger and bigger numbers. What do we really want to do? You want to wrap around. So it turns out, as you've seen probably in section now or in the problem set spec itself realized that there is this other operator in C that also is a percent sign, but whereas we've used % here to specify a placeholder, know that, especially for problem set 2, there's also something like this: int x = y % z. Let me just present this as a very generic form of this. Percent means what in a programming language? >>[student] Modulo. Modulo, which is a fancy way of saying the remainder. Even though there's a slight distinction with the definition there, this means divide y by z but don't return the result of that division; instead, return the remainder. So if y is actually 3 and z is actually 2, 3 divided by 2 is 1 with a remainder of 1, so what does x actually equal in this scenario? 1. This is such a simple, low-level idea. It takes a little time to get your mind wrapped around it because it's probably been a while since you even had to care about remainders and actually use them for something purposeful, but in this case the simple fact that you can go from a big number like 3 to a relatively small number like 2 and then wrap around effectively by using the remainder to a smaller value like 1 is going to be an invaluable trick that we can use both for something like Caesar and this other thing Vigenere in problem set 2, but this is going to be a recurring trick throughout the semester. This simple, simple idea of just taking the remainder in general is going to allow us to wrap around. And as we start playing more with arrays, as we start playing more with memory itself, this is going to become more and more of a powerful trick. So any questions then on ASCII or the representation of strings as arrays? And we'll take it up 1 notch further. Yeah. [inaudible student question] >>Good question. What does it mean when a variable has an asterisk in front of it? Let me postpone answering that in any detail, but that refers to a topic known as a pointer. Pointers have to do with memory, and we're actually today taking the first step toward that discussion, but for now, let me pretend that the star doesn't exist and we'll continue calling strings strings instead of using char*, which you've probably seen before and I'll put on the screen in just a moment as a teaser. So we'll come back to that in way more detail than many of you will probably like. Eventually, not today. Yeah. [inaudible student question] In what context do you have to provide the sign for a character? >>[student] Yeah. So by default, when you don't put a +, just positive numbers are assumed. So if just write the number 1, it's a positive 1. If you actually want to specify the negation of a value, you literally have to do -1 on your keyboard. But this probably isn't your question. >>[inaudible student response] Good question. Okay. So this has to do, I gather, with some kind of bug you ran into because you were converting an integer to a character, but somehow negativity got involved, and so the character just came out munged somehow. So for now, let me oversimplify a bit until we come back to this kind of topic. For now, think of things this way--and this is an oversimplification. But in the world of an integer, you have how many bits at your disposal? You have 32 bits. And thus far, we've talked about the total number of integers you can therefore represent is roughly 4 billion in total because you have 32 bits, so that's 2 to the 32, so that's roughly 4 billion. But we saw a week or 2 ago that you don't really have a range of numbers from 0 on up to 4 billion. The range instead goes from roughly negative 2 billion to positive 2 billion. But this then begs the question, how do you represent the notion of negative 2 billion let alone negative 1? For now, we can oversimplify and just say that we're going to use the leftmost bit of those 32 bits, and if it's a 1 it's a negative number, and if it's a 0 it's a positive number. The problem with that simplified representation of negative numbers is that if you were deliberately being clever and trying to convert from a character to a number or vice versa, there's no such thing as a negative character. In the world of ASCII, which uses only 8 bits, all 8 of those bits matter, and the leftmost bit has nothing to do with negativity. And just to be clear, when I say leftmost bits, recall that when we did our bit-related examples in the first week recall that we drew things like 1001101, something like this. When I say the leftmost bit, I just literally mean the 1 that you write all the way over to the left. So in the world of characters there is no notion of negativity, so that leftmost bit actually has something to do with ASCII, nothing to do with negativity. So it sounds like--and out of context it's hard to answer exactly-- but somehow, your code was confusing that leftmost bit as representing a negative value when it really was part of the character in question. And again, I'm oversimplifying because computers actually do something a little fancier than just changing that leftmost bit to a 1 for a negative sign versus a 0. They instead, if you're curious to Google, use something typically called 2's complement, which is a bit more sophisticated of an approach but the idea is ultimately the same. So in short, it had to do with the fact that you were massaging a number to a character or vice versa but your code was not cognizant of the fact that 1 of those bits had significance in the numeric world. That's not the case in the character world. But it sounds like you fixed, in which case moot now. Other questions. Okay. So thus far, all of the programs we've written have taken input maybe from the user in the form of functions like GetInt, GetString, or if you've been reading ahead in various books or online references, you yourselves might have used functions like scanf which, frankly, we use in the CS50 library. But in a week or 2, we'll actually show you how the CS50 library is implemented so that we can take those training wheels off altogether. But it turns out there's another way to get input from a user. In fact, we ourselves have been using command line arguments for a couple of weeks now. Every time we have run Clang or we have run make, we haven't just typed clang, Enter, we haven't typed make, Enter. What have we typically written after the word clang at our terminal windows prompt? [student] The file name. >>The file name, right? Hello.c or mario.c or whatever the relevant file name is. And in that sense what you've really done is you've influenced the behavior of Clang because certainly the people who wrote Clang had no idea that little old you was going to write a program called mario.c years later. So you had to somehow influence the behavior of that program, and that program Clang had to be written in such a way that it can accept input from you by the addition of words on the prompt before the user hits Enter. So it turns out that for some time we have been declaring almost all of our programs to start like this--int main(void)--and then we've gone ahead and started writing our code. And we might have some sharp includes at the top of the file, but almost all of our programs thus far have begun with this even though you might have seen in section, in books, online references that this does not in fact have to be void. Another legitimate form for this to take is int argc and then string argv[]. So now what is this implying? It turns out that argc, which is a human convention--you could call this foo, but it would just be a lot less clear to readers-- argc just is an argument to the function called main that represents what? What does argc stand for for those familiar? [inaudible student response] >>Yeah, number of arguments or argument count. It's as simple as that. How many arguments were passed to this program? What does that mean? If on the command line I have run something like this--clang mario.c-- argc when I hit Enter is going to take on a value of, somewhat confusingly, 2. So it turns out that argc is argument count, but for historical reasons, the name of the program itself is included in that count. So argc is 2 when I wrote clang mario.c. What does argv contain? First of all, argv looks like a string but not quite because as of last Wednesday and all the more today, these square brackets denote what? That's an array. There's no number in the array, and that should make sense intuitively because the people who wrote Clang years ago certainly had no idea how many words people like us would type at the prompt before hitting Enter. So in this case here they have declared the function main as taking an array of arguments, 0 or more arguments. They don't know in advance how many there are, so there is deliberately no number inside of these square brackets. But the fact that the square brackets are there are telling the computer, expect an array. Argv is just shorthand notation for argument vector. A vector is a fancy way of saying array, and array is a fancy way of saying a list or collection. So this just means that if you write main like this instead of like how we've been doing it for the past couple of weeks, your program now has the power to accept command line arguments so that no longer do you have to write mario and then hit Enter, then type in a number for how many blocks high you want the pyramid to be, then hit Enter again. We don't even need to use GetString anymore or GetInt or GetFloat for that matter. We can just expect the user to type those words at the prompt itself just like the authors of Clang decided it would be a really annoying program if to compile your code you first typed clang, hit Enter, then we said to the user, please type the name of the file you want to compile, then we type in mario.c and hit Enter. But that's exactly what we've been doing to our users the past couple of weeks. We use GetString and we wait until the program is running to prompt them for input. That no longer needs to be the case. So in this example here, we now have string argv, and this too is an oversimplification, training wheels that will very soon come off. This is the more proper way of writing this alternative declaration of main because it turns out that what we keep calling string actually has a star, an asterisk, in its actual definition, but this just looks complicated, it's confusing at first, so we simplify by just creating a synonym of sorts in the CS50 library that maps char* to this more user-friendly word string. So let's actually try this then. Let me go ahead and open up gedit here. Let me go ahead and open argv of 1. This program apparently prints the arguments, but in English terms, by looking at this code, what does this do more specifically? If I type in the command a.out foo bar, what gets printed in my black and white window? A.out foo bar, Enter. Go ahead. Yeah. >>[inaudible student response] Good. So a.out, new line, foo, new line, bar, new line. Why is this? We can certainly confirm in just a moment. This is kind of a fluffy line of code. It just prints a new line just to make things prettier on the screen. This is a loop that's iterating from 0 on up to argc, and this is incrementing on each iteration ++. So this is now saying print a string, as implied by this %s. Argv[i] is pretty much the same idea from the previous example. We used to call the variable s; now it's called, arbitrarily, argv. This means print the ith argument that was typed at the command line, and then after this whole thing is done, just for good measure print another new line. So let's see this. Let me open up the terminal window. Let me compile argv of 1, and now let me run argv of 1, Enter. Hmm. Okay. Let's run foo bar. Interesting. Baz. And if you've ever wondered why I type this, this is just also a stupid computer science convention. The world often needs just verbal placeholders for words. So if you want to talk about some generic string, computer scientists just tend to say foo when they need a random word, then they say bar if they need a second random word, then they say baz if they need a third word, then they say qux if they need a fourth word, and then there's a huge religious debate online as to what comes after qux, so you can Google that to figure out what the other arbitrary word should be. But these have no meaning whatsoever, though foo bar, if you Google that, that does have meaning, which is part of the etymology here. So all this is doing then is printing 1 of these strings per line. So if I instead, though, wanted to get a little fancier, suppose that I didn't want to print each string per line; I wanted to print each character from each string per line. How could I instead do that? What do I need to change about this program if I want to print not each word but I want to print each word letter by letter by letter, then the next word letter by letter by letter? How do we combine these ideas thus far? Yeah. [student] %c. >>All right. So we somewhere need a %c. Good, because I don't want to print whole strings, I want to print characters. What else? [inaudible student response] >>Interesting. So we need sort of a second dimension here now because think of argv as an array, but it's an array of strings. But as of, like, 15 minutes ago, what's a string? It's an array of characters. So really, argv is an array of an array of characters, an array of arrays of characters. So it turns out that we can use just more square bracket notations. So let's do this. In the top of this loop on line 19, I'm going to iterate from i up to argc, but then I'm going to do this: for--I can't use i now. I need another variable because I want to iterate over the words but then also over the letters in the words so I sort of have a vertical axis and a horizontal axis, sort of conceptually. So int j gets 0, then I want to do j as long as j is less than--and I'll clean this up in a bit. How do I iterate over the letters in a string? We did this a moment ago. Strlen of argv[i]. Good. And again, I'm making a little inefficiency here by not creating n or whatever, but we'll come back to that. So now j++. Now I have to indent further here. What do I now want to print on each iteration? [inaudible student response] >>So [i] will give me the word. [i] [j], sort of like a matrix. Those of you with math-y backgrounds, we're sort of indexing even deeper into this matrix or this array of arrays, this 2-dimensional structure. So now let's see what happens here. Let me open up my bigger terminal window. Let me rerun make of argv of 1. And I've screwed up here, which is a good lesson because I too forgot to do this. Implicitly declaring C library function 'strlen' with type 'unsigned-- I don't even know what the rest of that means, but I have seen this before, implicitly declaring. Whenever we see this error, what does this usually signify? [inaudible student response] >>I forgot a library up top. But wait a minute. Usually I've screwed up because I forgot the CS50 library, but that's there. Usually I've screwed up because I've forgotten standard I/O. And frankly, I don't even need this. We're not using GetString today. So what am I missing? There's another library that now we need to use occasionally called string.h, and this is just yet another library that has more functions that aren't in standard I/O. So let's go back to my big terminal window. Okay. Now, damn it, I guess I was wrong. I was using the CS50 library. So we can fix this in either of 2 ways. We can take the training wheels off right now and just do this, or let's kind of keep that simplification just for now, paste this back in, solve that problem, and now go back to the terminal window. So to be clear, in the CS50 library is not just functions, it's also the keyword string, which is why that error just happened. So here we go. I fixed both of the library issues. Enter. Good. Argv of 1, foo bar, Enter. Excellent. So now we have each letter of each word printed 1 per line, which does not make for a very interesting program, but notice now we have the capability of not only iterating over words but also over individual letters in words, which sounds awfully familiar to even the simplest of applications like scrambling letters in a string like this. Let's go ahead and take our 5-minute break here. And when we come back, we'll start talking about the efficiency with which we can do these things better. All right. We are back. Thanks to one of our TFs who plays a lot of bananagrams, we actually have a whole bunch of chars with us here today physically incarnated with these little plastic pieces, and let me propose that this blank white slate here represents the RAM in my computer-- laptop, desktop, whatever--and there looks like a lot of it because if we start chopping up this RAM into small byte-size pieces, let's arbitrarily say that something that size and that blurry represents-- there we go, and let's zoom out a little bit here-- let's say something that size represents a single byte. So we can indeed fit a whole bunch of bytes or characters inside of this memory, as suggested by the relative size here. So suppose now that the goal is to allocate memory for a string. How does this actually work? In the programs we've been writing, we've usually been using GetString, but now, clearly, there's this other channel via which we can get user input in argv via command line arguments. But what's really going on underneath the hood? It turns out if we call--let's scroll back to GetString--the function GetString in the CS50 library, the user is prompted for a string, the user types in some word--let's call it HELLO. And we've been saying for the past couple of weeks that the return value of GetString is in fact a string, like the word HELLO. But what is GetString really doing? As the user types in H-E-L-L-O, Enter, GetString is figuring out, okay, how many characters is this? This is H-E-L-L-O. So it needs to allocate, it needs to ask the operating system--Linux in this case-- for at least 5 bytes to store HELLO. And what it then proceeds to do once it gets back those 5 bytes from the operating system is to lay out H-E-L-L-O back to back to back to back. And so what's really returned from GetString is a chunk of data that looks like this. But this is a bit inaccurate because it turns out that it's not as simple as just storing H-E-L-L-O in the computer's memory because suppose that my program that I'm writing in C then calls GetString again, and the next word the user types in is B-Y-E, BYE. Well, I need to fit that word BYE somewhere in memory. I can't clobber HELLO. For instance, I don't want the computer to just start overwriting like this the original word because I might still be using the word HELLO in a variable somewhere else in my program. So B-Y-E has to end up somewhere else in memory. But the convention typically is that the next string you allocate probably, but not always, is going to end up at the next available memory location. And if I haven't asked the operating system for any memory since the last time I called GetString, odds are the word BYE is going to end up right after the word HELLO in memory. But at this point you can perhaps see where a potential problem arises. Because the next chunks of memory, the next bytes that were just free-- clean white slate--in the computer's memory were right next to HELLO, it feels like the first string I asked for might suddenly now change because I've essentially changed it to HELLOBYE instead of somehow demarcing the start of BYE and the end of HELLO. So it turns out that what's really going on underneath the hood, which you might have glimpsed in online references or section or books or not at all just yet is that there is actually a deliberate demarcation between words in a computer's memory. And in fact, in this case here, rather than just put BYE right next to HELLO, instead, the computer puts a special character, the special null character, so to speak, which is represented with a marker with backslash 0. So long story short, recall that characters are represented in ASCII. ASCII is just a mapping between numbers and letters, and most of those letters start roughly 65 for capital A, but it turns out you can certainly represent the number 0 as an integer or in binary, and it turns out the world decided long, long ago, "You know what?" "Let's reserve number 0 as not representing any characters on the keyboard-- "no letters, no numbers, no punctuation. 0 is special." "It's going to be the special null character, and we're going to write it as \0." The difference being if we just wrote 0, 0 is a character. Recall that there are ASCII codes for 0, for 1, for 2, for 3 because the character 0 is different from the number 0. And you can see that if you look back from week 1 when we first talked about ASCII, 0 and 1 and 2 and 3 all the way up to 9 had their own ASCII codes. They are not, coincidentally, 0 through 9. They're very different. So 0 just means "I am special," and the \0 means, literally, "I'm not the 0 character." "I'm this special value, the null character." So I actually need another one of these since I can't make the same mistake twice. So after the word BYE we're also going to need another one of these null characters. Let me grab my pen here and let me quickly draw another \0 so that after I have asked the operating system for 2 strings via GetString followed by another call to GetString, this is what's actually in memory. So when I get back a string, I'm really getting back that, and when I get the next string, I'm really getting back that. So this begs the question, strlen, first of all, what should it return? When I call strlen on the string s and s was the word HELLO that the user typed in, what did we obviously say the length of HELLO was a few minutes ago? It was 5, right? H-E-L-L-O. And that's indeed how strlen works. It returns what a normal human being would expect the length of a string to be. But in reality, how big is the array of characters that's storing hello? It's actually 6. So strlen does not mention that fact to you. But underneath the hood the computer is indeed using 6 bytes to store a 5-letter word, and this is true no matter how long the word is. There's always going to be a special null terminating character at the end of the string to demarc its total length. So then if you are now the person implementing strlen 20, 30 years ago, how do you go about implementing strlen itself? We take for granted that it exists, just like we take for granted that printf exists, but if HELLO is the word in question and what I have in memory is something that looks like this, if you had to reimplement strlen because you were asked to or because, frankly, you didn't know strlen existed-- you had to roll this one on your own--how could you implement strlen when given something that looks like this? Now that we know a string is an array, we can iterate over each of the individual characters using something like-- Let's try to do this on the fly. Let me go into the appliance. Let me create a new file, strlen.c. Let me go ahead now and do include stdio.h so that we have access to printf. Let me do int main(void). Oh. I'll just do this on my own for now then. [chuckles] Thank you. This is what I'm doing. All right. So before I turned on the screen, I typed all of that. And now what I'm going to do is the following: printf("Give me a string: ") That's just fluffy instructions. Now let me do string s = GetString. I already need to make a change now. I'm using the CS50 library suddenly, so let me go ahead and type in cs50.h. And now let's do this: printf("The length is: %d, strlen[s]-- and I'm not done yet. What else do I have to add to this program? [student] string.h. >>string.h. So for now, we're using strlen, so let's make sure the compiler knows where that is, so a little sanity check. I'm getting a string in line 8, and in line 9 I'm printing out its length with %d. So let's go ahead and open this up. We have make strlen--compiles okay-- strlen--let me zoom in--Enter, H-E-L-L-O, Enter. The length is 5. Okay, so strlen seems to work, but the world knew that. So let's now implement strlen ourselves as follows. Let me take this library away. We no longer have access to string.h because I didn't even know it existed. But that's okay because I can implement strlen myself and have it take a string called input, and now I need to figure out the length of this string. So how can I do this? What if I do--let's see how to do this-- What do you want to do? [inaudible student response] >>Okay. So we can do this in a bunch of ways. Let me try to take this approach. Let me give myself an int variable i, so i starts at 0. And let me say this: while input[i] is not equal to what? \0. So it turns out, as with the case with all chars when writing them literally in a program, you have to use single quotes, not double quotes. So if I were writing the letter a, I would do that, the letter b, I would do that. This, by contrast, would be a string, not an individual character. So I want \0 literally. What do I want to do in this loop? Actually, I need another variable, so int length gets 0. Even if you weren't sure why we started the way we did, now that we're going down this road, what do I want to do on line 9? length++ and then down here on line 10, return length. So how is strlen implemented? It's actually implemented probably like this. Maybe the person used a for loop, maybe a do while loop--who knows? We'd really have to look underneath the hood at the actual source code in some file called string.c probably. But here let's think about what I'm doing. I'm declaring a variable called i, setting it equal to 0. I'm then declaring another variable called length, setting it equal to 0. Then I'm saying while the ith character in input is not equal to the special null character, \0, increment the length. But as soon as the ith character is this special character, what happens to the loop? It short circuits. It stops, which means we then instantly return length. So if I didn't mess up, let's go ahead and go back to my terminal window. Let me recompile. And I did screw up. Incompatible redeclaration of library function strlen. So I was trying to get too clever for my own good here. The compiler actually knows that there is a function called strlen even though we haven't included the library. That's fine. Whatever. We're just going to cooperate then. Let's rename this length. Let me change the use of it to length here, and this will make Clang happier. As an aside, because some of these functions are so darn common-- strlen, prinf--they actually have sort of special status. And so Clang just knows a little something special about them. That's not always the case with most functions, so that's why we got yelled at. Let me try again. Thankfully, it worked that time. So now let me run my own strlen program. Give me a string: H-E-L-L-O, Enter. And I have screwed up. Why? >>[inaudible student response] >>Exactly. So I have myself here a very nice-looking infinite loop because even though I'm incrementing length on each iteration, what am I clearly not doing? I'm not incrementing i. Okay. Easy fix. Yes? Okay. No. Now we would run afoul of some other common mistake where I need brackets. And frankly, this code is starting to look ugly, so we'll take a stab at cleaning this up in a moment. But now I'm incrementing both length and i. Frankly, I already see an opportunity for improvement here, but we'll come back to that. So now let's just make sure we're at least making progress. This has happened to a few of you, and I neglected to mention this in advance. When you do have the misfortune of a scenario like this, how do you fix this short of restarting the appliance or your computer or closing the window? It's actually easy. Control C will send this little carrot symbol C, and that just terminates most programs. If you have a really bad infinite loop that's printing stuff infinitely many times, sometimes you might have to hit Control C a thousand times to make it actually hear it. So just realize now because I'm not printing anything, that was quite easy. And technically, once suffices, but I get impatient and I usually hit it that many times. So strlen. Give me a string: HELLO. Is it going to work this time? Okay. Another common mistake. Have to recompile. That was deliberate, that one. All right. So strlen, H-E-L-L-O, Enter. Excellent. So we now have a strlen to 5. So we have literally reimplemented that wheel. So now let's clean this up because this does not make me impressed with the design of my code. What can we clearly eliminate in this program to clean this up? [inaudible student response] >>Yeah. Literally, we're treating i and length identically. So why don't we just get smart and say while length? Rather, let's just call it length to begin with, initialize it to 0 because by default the string has no length until we figure out what it is. Now we do this, and now this is a pretty elegant program. One variable. I cleaned it up, tightened it up. So now let's go back to my terminal window. Let's go ahead and run this. Make strlen. Looks good. Run strlen again, Enter. Give me a string: HELLO, Enter. And it seems to be working as 5. Now, to be clear, if I had not written, for instance, HELLO in 1 string and then BYE in another, we can certainly have multiple words. If the expression I actually wanted to type was not HELLO but, for instance, HELLO WORLD, notice that what we would not have is this situation here, right? That would suggest that that's 2 strings. You certainly can have space bar characters, so if we actually typed in a longer phrase like HELLO WORLD, what we would really have in memory looks a little something like that there. All right. Any questions then about the representation here of strings? No? All right. So I said earlier that calling strlen again and again deliberately like that probably isn't the best idea because you're going to be doing a whole lot of work again and again and again. Indeed, what kind of work is necessary for figuring out the length of a string, apparently? You have to start at the beginning and then look, look, look, look, look until you finally see that special character, at which point, ah, now I know the length. So earlier when we had strlen being called again and again and again, the reason I proposed that was kind of stupid is because again, that string looks like that. It's not going to change every time you iterate through some loop, so you're doing unnecessary work. At the same time you should know, as an aside, that compilers like Clang these days have been developed over many years, and compiler writers, programmers, are pretty smart. And so it turns out that Clang and other compilers can actually figure out that, okay, yes, you wrote strlen in your condition, which technically means that we would call it again and again and again. But smart compilers can actually optimize those kinds of poor user decisions out of your code to redress things. So do just realize that sometimes the compiler is smarter than us and will kind of hide our own mistakes. But certainly when it comes to problem sets and the like, do be thinking about those fundamentally erroneous design decisions potentially for the simple reason that we'd be doing way more work than we actually have to do. But how much more work? In the case of HELLO WORLD, let's start to generalize the size of this problem. What's the length of the problem or the size of the problem when the word the user typed in is HELLO? It's apparently 5, maybe 6. Plus or minus 1. Whatever. It's so close we'll just call it 5. So what's the size of the problem here when trying to figure out the length of HELLO? It's 1, 2, 3, 4, 5, and maybe 6 for the last character, but let's generalize that as n. So n, just the variable n, is what computer scientists would typically use to describe the size of a problem, and the problem at hand is how long is HELLO? How much time does strlen take? It takes on the order of n steps, where each step means look at a character, look at a character, look at a character. And we had this discussion a while back, the number of operations something takes. The very first day of class we had everyone awkwardly stand up, and then everyone started pairing off with each other in order to actually count ideally how many people were in the room. And we also did another thing whereby if I instead did it the old school way of just starting 1, 2, 3, 4, 5, 6 and so forth, that too, the size of that problem was of size n. There were n people in the room. But I could speed that up, right? Grade school style I could start counting in 2s. 2, 4, 6, 8, 10, 12. And that feels so much faster, and indeed it is. It's literally twice as fast, but again, if another 400 people walked into this room all at once, those algorithms would take another 400 or maybe 200 steps. But by contrast, if we really get smart and we instead have all of you count yourselves, recall how that algorithm worked. You all stood up. Let me fast-forward to this. You all stood up, you paired off, then half of you sat down, half of you sat down, half of you sat down, and on each iteration of this loop from week 0, we halved the problem at hand and went to n/2, then n/4, then n/8. And the implication of that is that if another 400 people walk into the room, no big deal, it will take us 1 more round, not 400 more rounds, not 200 more rounds. And so the story we told a while back had to do a little something with this. This red line here is linear, it's straight, and it's labeled as n because as the size of a problem grows, if your algorithm or program with which you're solving it takes n steps, we can plot it as a straight line where it takes more time the bigger the size of the problem. And the twosies approach, counting 2, 4, 6, 8, still a straight line, just a little better. It takes a little less time, so the yellow line is below the red line point for point. But even better was this holy grail of what we called logarithmic time where even if again we double the number of people in the room, we double the size of that phone book from the first day of class, no big deal, it takes 1 more page tear, takes 1 more sitting down in order to solve a problem that's twice as big. And so the conversation we now get to start having is how do we actually solve problems efficiently if we consider the simplest of problems like this? Suppose we have 8 doors behind which are some numbers, and each of these numbers is not sorted in any way, they're just random integers behind these doors, and we ask the question how do you go about finding the number--who knows-- 7 behind these doors? What would you, a human, do in order to find me the number 7 if again each of these are doors and to see a value you have to open a door? What would your algorithm be perhaps? [inaudible student response] >>So start with the left and open a door, open a door, open a door. And in the worst case, how long is it going to take us to find the number 7? And again, they're not sorted, so it's not as easy as, well, I'm going to open the 7th door. It could take us, maximally, 8 steps. In the worst case, 7 is randomly at the very end of the line of doors, so we might have to try all n doors. So again here, we seem to have a linear algorithm. In fact, we did this just a couple of years ago. One of your predecessors was challenged with precisely this where we didn't have a digital version, we instead had a blackboard with some pieces of paper on it. And what I thought I would do is take a quick look back at how this went, one of the best and perhaps most awkward opportunities on stage to have a demonstration right here on Sanders. We had 2 rows of numbers. We're only going to look at what happens here with Sean for the very top of these rows. Unless no one ever again volunteers in CS50, we had Sean's blessing to keep this on camera, so he knows that hundreds of people have been watching this now for years. But Sean did an amazing job--or did he?--at actually finding us a particular number. So let's see how he solved this algorithm so that we'll resume this conversation before long of how we find things efficiently. [Malan on video] I have hidden behind these doors the number 7, but tucked away in some of these doors as well are other non-negative numbers, and your goal is to think of this top row of numbers as just an array or just a sequence of pieces of paper with numbers behind them, and your goal is, only using the top array here, find me the number 7. And we are then going to critique how you go about doing it. >>All right. [Malan] Find us the number 7, please. [laughter] [Malan] No. [laughter] 5, 19, 13, [laughter]. It's not a trick question. 1. [laughter] At this point your score is not very good, so you might as well keep going. [laughter] 3. Go on. Frankly, I can't help but wonder what you're even thinking about. [laughter] Only the top row, so you've got 3 left. So find me 7. [students murmuring] [Malan] 17. [students murmuring] [Malan] 7! [applause] So on Wednesday we will dive into this and more sophisticated algorithms for finding things. For now we'll leave you with Sean and see you on Wednesday. [CS50.TV]