[Week 7] [David J. Malan - Harvard University] [This is CS50. - CS50.TV] All right. Welcome back. This is CS50, and this is the start of week 7. A couple of little announcements: Pset5 is now in progress, or soon shall be, and let me say, quite honestly, this does tend to be among the more challenging of the course's problem sets, so let me mention this now so that this week more than ever you don't wait until, say, Wednesday night or Thursday night to dive in. This is definitely an interesting pset. We think it's fun. If you actually get it fully correct and can then challenge the so-called Big Board, you'll have an opportunity to match wits with some of the course's staff and some of your classmates. What The Big Board is is once you have your spell-checker working, you'll be able to go to cs50.net after running a command, purely opt in, and then the amount of time and the amount of RAM and more that you have used in your implementation will be exhibited here on the course's home page. You'll notice that a whole bunch of these folks here are listed as staff since over the weekend the staff thought it would be fun to try to outdo each other. So realize that the goal here is not to outdo the staff. Even I am only here at number 13. Purely opt in, but it's an opportunity to see just how little RAM and how few CPU seconds you can use vis-a-vis some of your classmates. And I'll admit that Kevin Michael Schmid, currently in number 1 position as one of the TFs, this is an implementation that we call not possible given that he's using almost 0 RAM and almost 0 seconds for loading. So we'll take care of Kevin offline. [laughter] There are certain skills that Kevin is putting to the test here. One of the things we thought we'd do too is now CS50x is a week in progress, and you guys are as much a part of this experiment as those students are. We've asked them as part of their pset0, which was similarly to submit a Scratch project of interest to them--a game, an interactive piece of art, an animation, or the like-- a 1- to 2-minute video, if they would like, saying hello to the world and who they actually are. I thought I'd share with you just a couple of the videos that have been submitted thus far because for us on the staff at least it really has been exciting and inspiring to see these folks from all over the world--countries all over the world-- tuning in, of all things, to a computer science course on the Internet, whether it's because they want to continue their own studies, they want to take their careers in a new direction, they want to fill in gaps in their own knowledge, so some of the same reasons that you guys perhaps have been here. So I give you 1 such student here. You could raise the volume just a little bit. Here is one of our student's 1-minute submissions. Hello, world. I am a student of industrial engineering here in Malaga, Spain. I am excited about this online course because I love computer science, I really do, and I truly appreciate that I get to explore it. And the fact that I can learn the same all of you guys do but instead of being in Harvard I am in Malaga, how awesome is that? Well, I am Fernando, and this is CS50. See you guys. [laughter] Another clip we particularly like, you'll find that this gentleman's English isn't so strong. It looks like he had it machine translated, so the translations themselves are a bit imperfect, but this was one of our favorites thus far as well. [♪♪] Hello, world. [speaking in Japanese] [I have to greet in Japanese because my English is very unreliable.] [I have delivered the message to you from the city of Gifu, Japan.] [I can be a student for the first time in 20 years, as can be seen.] [I am very grateful to Harvard University who gave me this opportunity and edX.] [Golf is a guitar and my favorite thing running.] [laughter] [♪♪] [Why do you think I was trying to attend a cs50x.] [Harvard University, it is my longing.] [Especially if I am distant presence lived in Japan.] [I wanted to try immediately aware of the existence of such edX when.] [Do not you think so you do not related to the age of learning I.] [cs50 is my longing. My name is Kazu, and this is cs50.] [♪♪] [applause and cheering] Another favorite of ours was this submission here from someone. [♪♪] [Malan] Google it if you're unfamiliar with this meme. And then lastly, a couple of others that got posted that perhaps win the adorable award. [students] Aww! >>[Malan] We'll have to listen. This is short, so listen closely. [female speaker] What's your name? >>Louie. [female speaker] What's this? >>[giggles] CS50. [laughter] [Malan] He did 2 takes, though. Here we go, the last. My name is Louie, and this is CS50. [laughter] This then is CS50x. Thank you to all of those of you while following along at home who have been partaking thus far. Today we conclude our discussion of data structures, at least some of the most fundamental, and then we continue our conversation about HTML and web programming. Indeed, we've spent the past some 7 weeks looking at the fundamentals of programming-- algorithms, data structures, and the like-- and C, as you may have experienced thus far, isn't necessarily the most accessible of languages with which to implement some of those ideas. And so starting this week and next week and then the following, we'll finally be able to transition from C, which is generally known as a fairly low-level language, to things higher level, among them PHP, JavaScript, and the like, which we'll see draw upon the same lessons that we've learned over the past few weeks, but you'll find that declaring things like arrays and hash tables and searching and sorting become so much easier because the languages themselves we'll start using will become more powerful. But first, an application of trees. It's very common these days to need to compress information. In what context would you want to compress some kind of digital information? Yeah. >>[student] When you need to send it over the Web. Yeah, when you want to send something over the Web. If you want to download a big file, it's ideal if someone on the other end has compressed that file using a zip format or something like that so that you're sending fewer bits than might otherwise be transmitted. So how do you compress information? It all boils down to using fewer bits than are required by default. But this is kind of a curious thing because think back to weeks 0 and 1 when we talked about ASCII and binary and we talked about ASCII in particular as using 8 bits to represent letters of the alphabet so that the letter A is represented by 65, lowercase a is the number 97, and however you represent the 65 or 97, you're using 7 or 8 bits. But the catch is that there are some letters in the English alphabet that aren't as popular as others. Z isn't all that popular, Q isn't all that popular, but A and E are super popular. And yet for all of these letters, by default the world uses the same number of bits, just 8. So wouldn't it have been smarter if instead of using 8 bits for every letter, even the most infrequently used like Q and Z, what if we used fewer bits for A and E and S and the most popular letters and used more bits for the less popular letters, the idea being let's optimize for the common case, which is a theme in computer science of trying to optimize what's going to happen the most and spend a little more time, a little more space on the things that, yeah, might happen but not necessarily as frequently. So let's take an example. Suppose that we want to encode information fairly efficiently. You might have grown up knowing a little something about Morse code, and odds are you didn't know the actual code, but you might recall that it's at least this series of dots and dashes. This is a fairly efficient coding, and notice that the most popular letter--for instance, E-- uses the shortest of beeps. Morse code is all about beep-beep-beep-beep-beep-beep and holding tones either for short periods of time or long periods of time. E, as denoted by the dot, is a super short beep, just beep, and that would represent E. By contrast, T would be a longer beep, like beep [prolongs sound], and that would represent T. But that's still pretty short because, by contrast, if you look at Z, to express Z you would go beep, beep [longer sound], beep, beep [shorter sound]. So it's longer because it's less common. But the gotcha here is that Morse code is a bit flawed in that it's not immediately decodable. For instance, suppose that you hear on some end of the wire beep [short], beep [long]. What message did I just receive? A dot and a dash. What does that represent? [student] A. >>[Malan] Maybe. It could also be E followed by T. In other words, Morse code, though it leverages this principle of optimizing the corner case, it doesn't lend itself to immediate decodability. That is, the human who is hearing or receiving these dots and dashes has to somehow figure out where the breaks are between letters, because if you don't know where those breaks are, you might confuse A for ET or vice versa. So what might you do? In Morse code you could just pause between each of the letters. But pausing is kind of counter to the whole point of speeding things up. So what if instead we came up with a code where there was not this bad situation where E is a prefix, for instance, of A-- in other words, if we could make sure that the patterns are still short for the popular letters long for the less popular letters, but there's no possible confusion? A man by the name of Huffman years ago invented this scheme called Huffman coding that actually leverages one of the data structures we've spent a bit of time talking about this past week, that of trees, binary trees specifically-- a binary tree meaning that it has no more than 2 children. It has maybe a left child, maybe a right child, and that's it. So suppose just for the sake of discussion that someone wants to send a message that looks like this. It's complete nonsense but it's composed of As, Bs, Cs, Ds, and Es. And if you actually count up all of the As, Bs, Cs, Ds, and Es and then divide by the total number of letters, this little chart here says that 45% of the letters are Es, 20% are As, 10% Bs, and so forth. So in other words, assume that the quoted string there is just some message that you want to send. It happens to be nonsense just so we can use as few letters as possible, but it's indeed the case that E remains the most popular, and B and C are the least popular, at least of these 5 letters of the alphabet. So how can we go about coming up with an encoding, a binary encoding, a pattern of 0s and 1s for each of these letters in such a way that E is a short pattern and maybe B and C are slightly longer patterns, again, the idea being that we want to use fewer bits most of the time and more bits only once in a while. According to Huffman coding, you can create a forest of trees. There's sort of a story line here that involves trees and also the process of building them up. Let's begin. I propose that you start with this forest, so to speak, of 5 trees, each of which is a pretty stupid tree. The tree is composed of just a single node, as represented here by a circle. So each of these things might be a C struct and inside of the C struct might be a float representing the frequency count and then maybe a char representing the letter. So think of these nodes as just any old C struct but, for now, higher level. This is a forest of 5 trees, each of who only have a single node. What Huffman proposed is that we start to combine those trees that have the smallest frequency counts into slightly bigger trees by connecting them with a new root node. So among the letters here, notice that for convenience I've sorted them from left to right, although that's not strictly necessary, and notice that the smallest nodes are currently 10% and 10%. So Huffman proposed that we merge those 2 smallest nodes into a new tree by introducing a new parent node and then give that parent a left child and a right child where B is arbitrarily the left and C is arbitrarily the right. And then Huffman further proposed that let's now just think of the left child in one of these trees always as being represented by 0 and the right child always as being represented by the number 1. It doesn't matter if you flip them so long as you're consistent. So now we have 4 trees in this forest. And I say 4 because now the tree on the left-- and it's not so much a tree in the sense that it grows this way, it's more like a family tree where now the 0.2 is sort of the parent of the 2 children-- notice that in that parent we've drawn 0.2. We've added the frequency counts of the 2 children and given the new node the total sum. So now we just repeat this process. Find the 2 smallest nodes and then join them into a new tree and then repeat the process further. Right now we have a few candidates, 20%, 15%, and another 20%. In this case we have to break the tie. We can do it arbitrarily. We should just do it consistently. In this case I'll arbitrarily go with the one on the left, and I now merge the 20% and the 15% to give me a new parent called 35%, whose left child is 0, whose right child is 1, and now we have just 3 trees in the forest. You can perhaps see where this is going. If we repeat this a couple more times, we're going to have just 1 bigger tree, all of whose edges are labeled with 0s and 1s. Let's do it again. 35% is that tree's root. 20% and 45%, so we're going to merge the 35% and 20%. Now we have this tree here. We add those together, we have 55%. Now there's only 2 trees in the forest. We do this 1 final time, and hopefully mathematically all the frequencies add up because they should since we computed them from the get-go to add up to 100%. And now we have 1 tree. So this is a Huffman coding tree. It kind of took a while to get there verbally, but the reality is with a for loop or with a recursive function you could build this thing up pretty fast. So now we have 1 new node, and all of these inner nodes have been malloc'd, presumably, along the way. So now at the top of this tree we have 100%, but now notice we have a path from this new great-great-great-grandparent to all of the great-great-great-grandchildren all the way at the bottom, to all of the leaves. What we're going to do now is propose that in order to represent the letter E, we will simply use the number 1. Why? Because if we traverse this tree from the final root down to the leaf known as E, we follow just 1 edge, the right edge, and that's labeled of course at top right 1. So the implication here for Huffman was that E's encoding in binary shall just be 1. And that's pretty damn efficient. Can't really get any smaller than that. By contrast, A is going to be represented, if you follow the logic, by what pattern of bits instead? 01. So to get to A, we start at the root and we go left and then we go right, which means we followed a 0 and then a 1. So we shall represent the letter A with the pattern 0 and 1. And now notice we already have a property of immediate decodability that we didn't have in Morse code. Even though both of these patterns are pretty short--E is 1 bit, A is 2 bits-- notice that they can't be confused one or the other, because if you see a 1 it's got to be an E, if you see a 0 then a 1 it's obviously got to be an A. Similarly, what's D? 001. What is C? 0001. And what is B? 0000. And again, because all of the letters we care about are at the leaves and none of them are kind of middlemen in the path from root to leaf, there's no risk of conflating 2 letters' different encodings because all of these bit patterns are deterministic. 0000 will always be B. There's no node somewhere in between that you might confuse 1 letter for the other. So what's the implication here? The most popular letter--in this case E--has gotten the shortest encoding, A has gotten the next shortest encoding, and B and C, which we already knew from the get-go were kind of the least popular at 10% frequency each, they have gotten the longest encoding. And so what this means now is that if you want to send a message that's compressed over the Internet or in an email or the like, rather than using standard ASCII, you can send a Huffman coded message whereby if you want to send the letter E, you send just a single bit. If you want to send an A, you send 2 bits, 01, instead of sending 8 bits followed by another 8 bits followed by another 8 bits and so forth. But there is a gotcha here. It's not sufficient to just construct this tree and then start sending from Alice to Bob the shorter bit pattern, string from ASCII, because Alice also has to inform Bob of what if Bob is going to be able to read her compressed message? [inaudible student response] >>What's that? [inaudible student response] >>Of what the tree is. Or even more specifically, what those encodings are, especially since during this story we made a judgment call at one point. Remember that we had to pick arbitrarily between the 2 different 20% nodes? So it's not the case that Bob, the recipient, can just reconstruct the tree on his own because maybe he will create the tree ever so slightly differently from Alice. Moreover, Bob doesn't even know what the original message is because the only thing Alice is sending him, of course, is the compressed message. So the catch with compression like this is that, yes, Alice can save a whole lot of bits by sending 1 for E and 01 for A and so forth, but she also has to inform Bob what the mapping is between letters and bits because they can't clearly rely on just ASCII anymore if we're not using ASCII. So she can either send him the tree somehow-- write it down, store it as binary data or something like that-- or just send him a little cheat sheet, an Excel file, that shows the mappings. So the effectiveness of compression really assumes that the messages that you're sending are pretty big, at least medium-sized, because if you're sending a super short message, if you just want to send the message BAD, which happens to be a word we can spell here, B-A-D, you're probably going to use fewer bits, but the catch is if you also have to inform Bob what the tree is or what those encodings are, you're going to probably outweigh all of the savings of having compressed things to begin with. So it can actually be the case that if you try compressing even with something like zip or file formats you might be familiar with-- pretty small files, even empty files-- sometimes those files might get bigger and not smaller. But realistically, that happens only for small file sizes, so it's not going to make a gigabyte file be 2 gigabytes; we're really talking bytes or just a couple kilobytes. Some programs like zip are smart enough to realize that, "You're going to spend more bits compressing this." "Let me not bother compressing it for you at all." So this is just 1 way then of compressing text format. We could implement something like this in C. For instance, here is how we might represent a node in this tree where we have a char for the symbol, a floating value for the frequency, and as we've seen with our other data structures, 2 pointers, 1 to the left child, 1 to the right, either of which can be NULL, but if not, it refers to a left child and a right child. So this then is Huffman coding, and it's 1 way that you can go about compressing information, and it's certainly one of the most easy to implement in the context of, say, last week's data structures, though even more sophisticated algorithms exist that can do even more sophisticated mutations of your data. Any questions then on trees, binary trees, or compression of text? [student] Is there some ambiguity, like if [inaudible] split into 01, then 011 would be ambiguous, right? [inaudible] >>Good question. Ambiguity. Let me summarize by referring to this picture here. Because the characters you are compressing, the representations of, by definition of this algorithm always remain the leaves, you'll never accidentally use the same pattern of bits for the prefix of multiple letters. So in other words, you're concerned about, it sounds like, an ambiguity arising whereby 001 might be the start of B or the start of C or something like that. But that can't be the case because notice that all of the letters of the alphabet we're encoding are at the leaves. The ambiguity can only arise, as in the case of Morse code, if, for instance, C was somewhere along the path from the root to B. [student] Right. So in that case, say A has 2 leaves. >>Say A has-- Say that again. [student] Say A has 2 leaves, F and G, and then G-- >>Okay. But it can't. A itself could not have leaves F and G because those letters F and G would themselves be leaves somewhere to the left of B or the right of E. So by definition, they must be leaves. Otherwise, you're exactly right, we've not solved the problem that Morse code faces. Good question. Other questions? All right. This notion of bits, it turns out we've had power all along that we've not actually used when it came to manipulating these 0s and 1s. We asked about this on one of the earliest problem sets, namely, how do you go about converting uppercase to lowercase or vice versa? Or more concretely, one of those first psets asked how many bits do you actually have to flip in order to change A to lowercase a or vice versa? Here's a quick reminder of what 65 and 97 look like in binary. And even if that question has sort of faded in your memory, you can see again here that how many bits need to be flipped to change capital A to lowercase a? Just 1. They only differ in 1 location, the third bit from the left. Whereas A has a 010, little a has a 011. So somehow we need to just be able to flip that bit, and we can then capitalize or lowercase letters. We've done this in the past by actually using if conditions and checking if the letter is between capital A and capital Z, then outputs like A - "a" + 26 or something like that. You probably did an arithmetic change to the letters of the alphabet. But what if we could just flip that single bit? How could you go about taking 1 byte's worth of bits, so 8 bits like 01000001 and 01100001? If you had those patterns of bits, how can we go about changing just 1 of them? What if we introduce in yellow here this other pattern of bits? If I make the whole yellow string 0s except for the 1 bit that I want to change and then I introduce a new operator known as a bitwise operator-- bitwise in the sense that it operates on individual bits, not on an entire byte or 4 bytes all at once. This vertical bar there in yellow suggests that what if we take the representation of capital A and bitwise "or" it with the yellow sequence of bits? In other words, think back to our discussion of Boolean expressions in Scratch and then in C. Doing a Boolean or means that to be true, either the first thing has to be true or the second thing has to be true or they both have to be true, and then the resulting output is itself true. In this case here, what do we get if we take 0 "or"ed with 0? False or false? It's still false, so the lowercase a remains as expected. What if instead we do 1 or 0? This now remains 1, but notice what's about to happen here. If we start with capital A and we continue to "or" its individual bits as we're doing here, 0 or the yellow 1 gives us what down here? This gives us 1. In fact, suppose we didn't know what the uppercase version of little a actually was. Let's go do this. Let me move this back over here. Let's do this again. 0 or 0 gives me 0. 1 or 0 gives me 1. 0 or 1 gives me 1. 0 or 0 gives me 0. The next one is 0, the next one is 0, the next one is 0. 1 or 0 gives me 1. And so even if we didn't know in advance what lowercase a was, simply by "or"ing A with this pattern of bits that we've presented here in yellow, you can lowercase a capital A by flipping that bit. We used this expression weeks ago: flipping a bit. How do you actually do that programmatically? You use what's generally called a mask, a sequence of bits, that in this case just so happens to look like this number here, and then you "or" it together using this new C operator, not ||, you use a single | and you would actually get this answer here because why? This is the 1s place, 2s place, 4s, 8s, 16s, 32s. So it turns out that if you take a capital letter A and bitwise "or" it with the integer 32, because the integer 32 when you look at it as bits looks like this, that means you can flip the bit that you actually want. And similarly--and we'll look at code in just a moment-- suppose we want to go the other direction. How do you go from lowercase a to capital A? Which bit needs to change? It's the same one. We want to change that third bit from a 1 to a 0. And how might we go about doing this? How do we turn off a bit? With what pattern of bits could we turn off a bit? What if we sort of invert the mask? Whereas before we made the whole yellow mask 0s except for the 1 bit we wanted to turn on, what if this time we make the whole mask 1s except for the bit that we want to turn off and then use what operator? What if we "and" things? Let's take a look. If we now flip to this, suppose that again I create a mask that's all 1s except for the 1 bit that I want to turn off and then rather than "or" the white numbers up top with the yellow numbers down here, what if I instead "and" them together? It's called a bitwise and. Logically, it's the same thing as a Boolean and. This gives me 0 & 1 is 0. So false and true is false. True and true is true. And here is the magic: True and false is now false, so we've turned off that bit. And now the rest of the story is somewhat straightforward. Because the rest of the mask is 1s, it doesn't matter what the numbers are in white. When you "and" something with true, you're not going to change its value. If it is true, it will remain true. If it was false, it will remain false. But the magic happens when you take something that was true and you then "and" it with false. This has the effect of turning off that bit. So a little cryptic there. Let's actually look at some code, which might actually look even more cryptic, but let's take a look here at tolower. If I look at tolower, going from capital A to lowercase a, let's see how we might implement this program. Here's main, and it's not taking any command-line arguments. I'm declaring a character c for the letter that the user is going to type in. I then use a familiar do while loop to just make sure that the user definitely gives me a capital A or B or C...Z, so they give me something between A and Z. And now what am I doing here? I'm "or"ing this with 0x20, but that's actually the same as-- and we'll come back to this in a moment--32. So again, 32 is this pattern of bits here. Why do we know this? Just think back to week 0. This is the 1s place, 2s place, 4s, 8s, 16s, 32s place. So this yellow number happens to be 32. I can then take a letter like the char here, bitwise "or" it with literally the number 32, and what do I get back? The lowercase version of that char. A moment ago, though, I expressed this in a different base notation. What did this represent? >>[student] Hexadecimal. [Malan] This happens to represent hexadecimal. We haven't talked about hexadecimal all that much, but it's actually convenient in cases like this. Even though it looks more complex and even though it looks like 20 and not 32, it turns out that hexadecimal is actually super convenient notation because in hexadecimal every digit after the 0x--and this means nothing; this is just human convention that says here comes a hexadecimal number-- each of these digits, the 2 and then the 0, themselves can be represented with exactly 4 bits. So if we do this, let me open up a text editor here--weird autocomplete-- if we do a little text editor here, the number 0x20 means here is 4 bits, here's another 4 bits. Let's do the rightmost 4 bits first. 0 when represented with 4 bits is what? Super easy. Just all 0s. So 4 bits as 0s. How do you represent 2? It's been a while since we did this, but it's 0100. So this is the 1s place, this is the 2s place, and then it doesn't matter what the other places are. In other words, in hexadecimal you might say 0x20, but if you then think about what is the 2 and how is it represented in binary, what is the 0 and how is it represented in binary, the answers to those questions are this and this, respectively. So 0x20 happens to represent this pattern of 8 bits, which is precisely the mask that we wanted. So this is for the moment just an intellectual exercise, but the reality is in code it's typically more common to write constants like this in hexadecimal because then the programmer can relatively easily, even if it requires some paper and pencil, figure out what that pattern of bits is because you can't just express 0s and 1s typically in code. You can't go 00010 and so forth. You have to pick decimal or hexadecimal or octal or other notations. Most people tend to pick hexadecimal simply so that each digit represents 4 bits and you can do this quick math. And I'll wave my hand at toupper, which is almost the same; it looks almost identical. Toupper happens to use not the or operator but rather this guy and df. What does df represent? df? Anyone? >>[student] 255. 255? Not 255. That would be ff. We'll leave this one as a little exercise. But if you go from 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 and then what comes after 9? We're kind of out of decimal digits, but in hexadecimal what comes after 9? [student] a. >>So a, b, c, d. You can figure out from there what pattern of bits d actually represents. And if we do the math, we'll see that the mask you end up getting back is identical to this. This is f, all 1s, and this is d. So df represents that mask. All right. And lastly, not to make things sound super, super technical, but suppose we wanted to write a program that does this. Let me go ahead and make binary, which is a program in a file called binary.c. And now let me run binary and give me a non-negative integer. Let's start easy and type in 0. This now is a program that prints out an integer in its binary representation. So if I play this game again and type in just 1, I should get a 32-bit representation of 1. If I do this again with 2, I should get that. If I do 7, I should get a few 1s at the end and so forth. It turns out I mention this because with bitwise operations you can actually do 1 other thing as well. You can create these masks dynamically. Take a look at this 1 final example involving bitwise operations. Here is the first part of the code, prompt the user for a number, and it insists that you give me a non-negative integer. So that's sort of old school stuff. But here is something that's kind of interesting. How do I go about printing a number in binary? I first iterate from what to what? What's the size of an int typically, at least in the appliance? >>[student] 4. It's 4. So 4 * 8 is 32 - 1 is 31. So if I'm starting to count from 31, that represents, it turns out, just conceptually, the 31st bit or the highest order bit, which is this guy over here, whereas this is going to be bit 0. So this is bit 01...bit 31. So what is this code doing? Notice this for loop, even though it looks cryptic, is just iterating from 31 down to 0. That's it. So the interesting part now must be in these 5 lines here. Notice that in this line I'm declaring a variable called mask to be consistent with our story of these yellow numbers. And then what is this doing? This is another bitwise operator we've not seen before, most likely. It's the left shift operator. This operator does this. Here is the number 1, and if you do i left shift, left shift, what do you think that has the effect of doing to that individual 1? Literally shifting it over. So if the number 1 is what you have on the left and you start by initializing i to 31, what is that going to do? It's going to take this number 1 and shift it 31 places over here. And because there's obviously no other digits behind it, those will by default be replaced with 0s. So you'll start out with the number 1, which of course looks like this-- and let me draw it over here in the center. And then as you shift things to the left, this guy essentially goes this way. But as soon as you do that, a 0 gets filled in. If you shift it a second time, it goes this way and another 0 gets filled in. You shift it again and then another 0 gets filled in. So if you do this thing of 1 << i 31 places, you end up getting a mask that is 32 characters long, the leftmost one of which is a 1, all of the rest of which are a 0. And it turns out, as an aside, shifting a number to the left like this also coincidentally, and sometimes conveniently, has the effect of doing what to that number? >>[student] Doubling it. Doubling it because each of the columns--the 1s place, 2s place, 4s place, 8s place, 16s place--they're all doubling as you go to the left. Or rather, when you shift the 1s you're going to end up doubling the value of the number. You can end up doing interesting transformations of digits by shifting everything over in this way by powers of 2. So how does this work? This then gives me a mask that's all 0s except for a 1 in precisely the place I want it, and then this expression, which is stolen from toupper.c, is simply saying take the number n that the user typed in, "and" it with that mask, and what are you going to get? You're going to get a 1 if there's a 1 in that masked location, or you're going to get a 0 if there's not. And so all this program does effectively is it has a loop, and it creates a mask with a 1 over here, then a 1 over here, then a 1 over here, and it uses this bitwise and trick to say is there a 1 bit in the user's input here? Is there a 1 bit in the user's input here? And if so, literally print 1, else print 0. We're doing this with ints just because that's why we're doing 32 bits instead of 8, but what we've introduced then is this bitwise and, this bitwise or, and this left shift operator, which aren't often terribly helpful, but it turns out they can be. In fact, if you were to represent something like an array of Booleans just to represent true or false, suppose you wanted to keep track of whether or not a room full of 300 students is present, you could declare an array of size 300 of type bool so that you get 300 bools, and you can set each to true if someone is here and false otherwise. Why is that representation in that data structure inefficient? What's bad about the design of that data structure, an array of 300 bools? What is a bool in fact underneath the hood? This too is something that might not be familiar. It turns out there is no bool. Remember we sort of created that with the cs50.h file, which itself includes standard bool. C is kind of dumb, though, when it comes to bool. It uses 8 bits to represent every bool, which is completely wasteful because obviously, how many bits do you need to represent a bool? Just 1. So it turns out that if you now have the ability with bitwise operators to manipulate individual bits even in a char, even in a single byte, it turns out you could decrease the memory required to represent something stupid like that attendance styled data structure by a factor of 8. Instead of using 8 bits to represent true or false, you could literally use 1 by using a single byte for every 8 students in the class and toggling from 0 to 1 individual bits by using these kinds of low-level tricks. That really put an end to the energy. Are there any questions about bitwise operations? Yeah. >>[student] Is there an exclusive or operator? Yes. There is an exclusive or operator that looks like this, ^, the carrot symbol, which means only the first thing or the second thing can be a 1 for the output to be a 1. There is also a not, ~, which will allow you to invert a 0 to a 1 or vice versa as well. And there's also a right shift operator, >>, which is the opposite of the one we saw. All right. Let's take things now to a higher level. We started by talking about text and then compressing it and representing the text with fewer numbers of bits; we talked a bit about how we can now start manipulating things on a bitwise level. Let's now zoom back up 10,000 feet to representation of more complex things like graphics. Here we have a flag of Germany, here we have one of France. These might be represented in file formats you might know--GIFs, for instance. If you've ever seen an image on the Web that ends in .gif, this is a graphics interchange format. These 2 flags here sort of lend themselves to compression for what perhaps obvious reason? >>[inaudible student response] There's a lot of repetition, right? In order to send Germany's flag, think of this as being an image on the screen back in your Scratch days. You might recall that there's individual pixels or dots that compose an image. There's a whole row of black dots and another whole row of black dots. There's a bunch of rows of black dots that we could see if we really zoomed in, much like when we zoomed in on Rob's face in Photoshop. As soon as we got deeper and deeper and deeper into the image, you started seeing the pixelation, all of the squares that composed his eye in that case. Same deal here. If we zoomed in quite a bit, you would see individual dots. Well, this is kind of a waste of bits. If a third of the flag is black and a third of the flag is yellow and so forth, why can't we somehow compress this flag? And even the French flag could be compressed even though the pattern is a little bit different. It turns out the GIF file format is a lossless compression format, which means you can take an image like the German flag here, you can throw away a lot of its bits without sacrificing quality. This is in contrast to something like JPEGs, with which most of us are probably more familiar. Facebook photos and Flickr photos and the like are almost always saved as JPEGs when they're uploaded, but JPEGs is a lossy--L-O-S-S-Y--format whereby you do throw away bits but you also throw away quality. And so if you compress photos with Photoshop or upload them to Facebook or take them on a really crappy phone, you know that the picture starts to get very splotchy and pixelated, and that's because it's being compressed by the computer or phone by literally throwing information away. But GIF is amazing in that it can use fewer bits than it might by default without losing any information. And it essentially does so as follows. Rather than store in a file like a BMP would an RGB triple for black, black, black, black, black, black, black, black, black, black, black, black and so forth, rather, the GIF format is going to say, "Black," and then, "Repeat this 100 times," or something like that. "Black, repeat this 100 times, black, repeat this 100 times..." "Yellow, repeat this 100 times." And so it remembers, essentially, the leftmost pixel and then encodes somehow the notion of repeating that pixel again and again. So GIFs can then compress themselves without losing any information. But if you had to guess, if that is the algorithm that GIFs use, which of these flags, even though they look identical in size, is going to be smaller when saved on disk as a GIF? >>[student] Germany. Germany is going to be smaller? Why? [student] Because you repeat it many, many times horizontally and then you repeat another time. >>Exactly. Because the people who invented GIF just kind of arbitrarily decided that the repetition will be leveraged horizontally and not laterally. There's a lot more repetition laterally here in the German flag than in the French flag. So if we actually open up a folder on my hard drive that has these GIFs, you can actually see that the German flag here is 2 kilobytes and the French one is 4 kilobytes. It happens to be a coincidence that one is twice the other, but it's in fact the case that the French flag is much larger. Even though we're talking here about graphics, the same ideas can apply to not things like flags but images that are a little more complex. If you take a picture of an apple, surely there's a lot of duplication there, so we could somehow remember that the default background is blue and not, as the right-hand picture suggests, have to remember the color of every single pixel in this picture. So we can throw bits away there without losing information. The apple still looks just the same. In this example here, you might see what happens in a movie. These represent old school film reels whereby in the top image there you have an RV driving past a house and a tree. And as that van drives past from left to right, what's obviously not changing? The house isn't going anywhere, and the tree is not going anywhere. The only thing that's moving is the van in this case. So as Background Unchanged suggests, what you can do in movies is similarly just throw away information that doesn't change in between frames. This is generally known as interframe compression whereby if this frame looks almost identical to this one, let's not bother storing on disk any of the identical information on these intermediate frames, let's only use key frames once in a while that actually store that information redundantly just as a little sanity check. By contrast, another approach to compressing video is in this second and lower example here, where rather than store 30 frames, why don't you just store 15 frames a second instead? Rather than the movie kind of flowing beautifully, perfectly, it might look like it's stuttering a little bit, a little old school, but the net effect will be to use far fewer bits than might otherwise be necessary. So where does this then leave us? That was a bit of an aside on where else you can go with compression. For more on that, take a class like CS175 here. Here's another example within video. If the bee is the only thing moving, you can really throw away information in those middle frames because the flower and sky and leaves are not changing. But let's now consider 1 last thing. In the next 5 minutes we leave C behind forever in lecture? Yes. Not in the psets, though. Last story about C and then we get to very sexy stuff involving HTML and Web and woo-hoo. All right. Here we go. That's the motivation. It turns out all this time when we have been writing programs we run Clang. And Clang, we've said since the first week pretty much, takes source code and converts it into object code. It takes C and converts it into 0s and 1s. I've kind of been lying to you for a few weeks because it's not quite as simple as that. There's a lot more going on underneath the hood when you run a program like Clang. In fact, the process of compiling a program can really be summarized, as you might recall from Rob's video on compilers, into these 4 steps: pre-processing, compiling itself, assembling, and linking. But we in class and most people in the world typically summarize all of these steps as just "compiling." But if we start with source code like this, recall this is perhaps the simplest C program we've written thus far, recall that when compiled it ends up looking like this. But there's actually an intermediate step, and those steps are as follows. First there's this thing at the very top of this and most of our programs, #include What does #include do for us? It pretty much copies and pastes the contents of stdio.h into my file so that why? Why do I care about the contents of stdio.h? What's in there of interest? Printf's declaration, its prototype, so that the compiler then knows what I mean when I mention this function printf. So step 1 in compiling is pre-processing, whereby a program like Clang or some helper program that Clang comes with reads your code top to bottom, left to right, and any time it sees a # symbol followed by a keyword like include, it performs that operation, copying and pasting in this case stdio.h into your file. That's step 1. Then you have a much bigger C file because of the huge copy, paste job that's just happened. Step 2 now is compiling. But it turns out compiling takes source code that looks like this and turns it into something that looks like this, which for those familiar is called? >>[student] Assembly. >>Assembly language. This is actually something if you take CS61 you'll dive into in more detail. This is just about as close as you can get to writing 0s and 1s yourself but writing things in such a way that still makes at least a little bit of sense. These are machine instructions, and if we scroll down to the main function here, notice that there is this push instruction, move instruction, subtract instruction, call instruction, and so forth. When you hear that your computer has Intel inside, you have an Intel CPU in your Mac or PC, what does that mean? A CPU comes built by companies like Intel understanding certain instructions. They have no idea what functions like swap are or main are per se, but they do know what very low-level instructions like add, subtract, push, move, call, and so forth are. So when you compile C code into assembly language, your very user friendly-looking code is converted into something that looks like this, that literally moves bytes or 4 bytes around in such small units in and out of the CPU. But finally, when Clang is ready to take this representation of your program into 0s and 1s, then the step called assembling happens, and this again all happens in the blink of an eye when running Clang. We start here, it outputs a file like this, and then it converts it to these 0s and 1s. And if you want to go back at some point and actually see this in action, if I go into hello1.c--this is one of the very first programs we looked at-- normally we would compile this with Clang hello1.c and this would give us a.out. If by contrast you instead give it the -S flag, what you'll get is hello1.s and you'll actually see the assembly language. I'm doing this for a very short program, but if you go back for Scramble or Recover or any program you've written and just out of curiosity want to see what it actually looks like, what's actually being fed into the CPU, you can use that -S flag with Clang. But then lastly, there's still 1 gotcha. Here are the 0s and 1s that represent my implementation of hello, world. But I used someone else's function in my program. So even though the process has been I take hello.c, it gets compiled into assembly code, and then it gets assembled into 0s and 1s, the only 0s and 1s that are outputted at this point in time are the ones that result from my code. But the person who wrote printf, they compiled their code 20 years ago and it's now installed somewhere on the appliance, so we somehow have to merge his or her 0s and 1s with my 0s and 1s, and that brings us to the 4th and final step of compiling, known as linking. So on the left-hand side we have the exact same picture as before: hello.c becomes assembly code becomes 0s and 1s. But recall that I used the standard I/O library in my code, and that means somewhere on the computer there's a file called stdio.c or at least the compiled version thereof because someone some years ago compiled stdio.c into assembly code and then a whole bunch of 0s and 1s. This is what's known as a static or a dynamic library. It's some file sitting somewhere in the appliance. But lastly, I have to take my 0s and 1s and that person's 0s and 1s and somehow link them together, literally combine those 0s and 1s into a single file called a.out or hello1 or whatever I called my program so that the end result has all of the 1s and 0s that should compose my program. So all this time this semester when you've been using Clang and even more recently running make in order to run Clang, all of these steps have been happening sort of instantaneously but very deliberately. And so if you continue on in computer science, namely CS61, this is the layer that you'll continue to peel back off there talking about efficiency, security implications, and the like of these lower level details. But with that, we're about to leave C behind. Let's go ahead and take our 5-minute break now, and when we come back: the Internet. All right. We are back. Now we begin our look not just at HTML because, as you will see, HTML itself is actually pretty simple but really at web programming more generally, networking more generally, and how all of these technologies come together to allow us to create much more sophisticated programs atop the Internet than thus far we've been able to in these black and white windows. Indeed, at this point in the semester even though we will spend relatively less time on PHP, HTML, CSS, JavaScript, SQL and more, most students do end up doing final projects that are web-based because as you'll see, the background you now have in C is very much applicable to these higher level languages. And as you start thinking about your final project, which, much like Problem Set 0 where you were encouraged to do most anything of interest to you in Scratch, the final project is your opportunity to take your newfound knowledge and savvy with C or PHP or JavaScript or the like out for a spin and create your very own piece of software for the world to see. And to seed you with ideas, know that you can head here, projects.cs50.net. Every year we solicit ideas from faculty and staff and student groups on campus just to submit their ideas for interesting things that could be solved using computers, using websites, using software. So if you're struggling to come up with an idea of your own, by all means scroll through the ideas there from this year and last. It is perfectly okay to tackle a project that has been tackled before. We have seen many apps for seeing the status of laundry on campus, many apps for navigating the dining hall menu, many apps for navigating the course catalog and the like. And indeed, in a future lecture and in future seminars, we will introduce you to some publicly available APIs, both commercially available as well as here available from CS50 on campus so that you have access to data and can then do interesting things with it. So more on final projects in a few days when we release the specification, but for now, know that you can work solo or with 1 or 2 friends on most any project of interest to you. The Internet. You go ahead and pull out your laptop, you go to facebook.com for the first time, not having logged in recently, and hit Enter. What exactly happens? When you hit Enter on your computer, a whole bunch of steps start sort of magically happening. So you here on the left, web server like Facebook is here on the right, and somehow you're using this language called HTTP, Hypertext Transfer Protocol. HTTP isn't a programming language. It's more of a protocol. It is a set of conventions that web browsers and web servers use when intercommunicating. And what this means is as follows. Much like in the real world, we have these conventions where if you meet some human for the first time, if you don't mind humoring me here, I might come up to you, say, "Hi, my name is David." >>Hi, David. My name is Sammy. "Hi, David. My name is Sammy." So now we have just engaged in this sort of silly human protocol where I have initiated the protocol, Sammy has responded, we've shaken hands, and the transaction is complete. HTTP is very similar in spirit. When your web browser requests www.facebook.com, what your browser is really doing is extending its hand, so to speak, to the server and it's sending it a message. And that message is typically something like get--what do you want to get?-- get me the home page, which is typically denoted by a single slash at the end of a URL. And just so you know what language I'm speaking, I the browser am going to tell you that I'm speaking HTTP version 1.1, And also for good measure I'm going to tell you that the host that I want the home page of is facebook.com. Typically, a web browser, unbeknownst to you the human, sends this message across the Internet when you simply type www.facebook.com, Enter, into your browser. And what does Facebook respond with? It responds with some similar-looking cryptic details but also much more. Let me go ahead to Facebook's home page here. This is the screen that most of us probably never see if you stay logged in all of the time, but this is indeed their home page. If we do this in Chrome, notice that you can pull up these little context menus. Using Chrome, whether on Mac OS, Windows, Linux, or the like, if you Control click or left click, you can typically pull up a menu that looks like this, where a few options await, one of which is View Page Source. You can also typically get to these things by going to the View menu and poking around. For instance, here under View, Developer is the same thing. I'm going to go ahead and look at View Page Source. What you'll see is the HTML that Mark has written to represent facebook.com. It's a complete mess here, but we'll see that this makes a little more sense before long. But there are some patterns here. Let me scroll down to stuff like this. This is hard for a human to read, but notice that there's this pattern of angled brackets with keywords like option, keywords like value, some quoted strings. This is where when you signed up for the very first time specified what your birth year is. That drop-down menu of birth years is somehow encoded here in this language called HTML, HyperText Markup Language. In other words, when your browser requests a web page, it speaks this convention called HTTP. But what does facebook.com respond to that request with? It responds with some of these cryptic messages, as we'll see in a moment. But most of its response is in the form of HTML, HyperText Markup Language. That's the actual language in which a web page is written. And what a web browser really does then is, upon receipt of something that looks like this, reads it top to bottom, left to right, and any time it sees one of these angled brackets followed by a keyword like option, it displays that markup language in the appropriate way. In this case it would display a drop-down menu of years. But again, this is a complete mess to look at. This is not because Facebook developers manifest 0 for 5 for style, for instance. This is because most of the code that they write is in fact written beautifully, well commented, nicely indented, and the like, but of course machines, computers, browsers really don't give a damn whether your code is well styled. And in fact, it's completely wasteful to hit the tab key all those times and to put comments all throughout your code and to choose really descriptive variable names because if the browser doesn't care, all you're doing at the end of the day is wasting bytes. So it turns out what most websites do is even though the source code for facebook.com, for cs50.net and all of these other websites on the Internet are typically well written and well commented and nicely indented and the like, typically before the website is put onto the Internet, the code is minified, whereby the HTML and the CSS--something else we'll soon see-- the JavaScript code we'll soon see is compressed whereby long variable names become X and Y and Z, and all of that whitespace that makes everything look so readable is all thrown away, because if you think about it this way, Facebook gets a billion page hits a day-- something crazy like that--so what if a programmer just to be anal hit the space bar 1 extra time just to indent some line of code ever so much more? What's the implication if Facebook preserves that whitespace in all of the bytes they send back to people on the Internet? Hitting the space bar once gives you an extra byte in your file. And if a billion people then proceed to download the home page that day, how much more data have you transmitted over the Internet? A gigabyte for no good reason. And granted, for a lot of websites this is not such a scalable issue, but for Facebook, for Google, for some of the most popular websites there's great incentive financially to make your code look like a mess so that you're using as few bytes as possible in addition to then compressing it using something like zip, an algorithm called gzip that the browser does for you automatically. But this is awful. We'll never learn anything about other people's websites and how to design web pages if we have to look at it like this. So fortunately, browsers like Chrome and IE and Firefox these days typically come with built-in developer tools. In fact, if I go down here to Inspect Element or if I go to View, Developer, and go to Developer Tools explicitly, this window at the bottom of my screen now pops up. It's a little intimidating at first because there's a lot of unfamiliar tabs here, but if I click on Elements all the way at the bottom left, Chrome is obviously pretty smart. It knows how to interpret all of this code. And so what Chrome does is it cleans up all of Facebook's HTML. Even though there's not whitespace there, there's not indentation there, now notice that I can begin to navigate this web page all the more hierarchically. It turns out that every web page written in a language called HTML5 should start with this, this DOCTYPE declaration, so to speak: It's kind of light and gray there, but that's the very first line of code in this file, and that just tells the browser, "Hey, here comes some HTML5. Here comes a web page." The first open bracket beyond that happens to be this thing, an open bracket HTML tag, and then if I dive in deeper--these arrows are completely meaningless; they are just for presentation's sake, they are not actually in the file-- notice that inside of Facebook's HTML tag, anything that starts with an open bracket and then has a word is called a tag. So inside the HTML tag is apparently a head tag and a body tag. Inside of the head tag now is a whole mess for Facebook because they have a lot of metadata and other things for marketing and advertising. But if we scroll down, down, down, down, let's see where it is. Here it is. This one is at least somewhat familiar. The title of Facebook's home page, if you ever look in the tab in your title bar, is Welcome to Facebook - Log In, Sign Up or Learn More. That's what you would see in Chrome's title bar, and that's how it's represented in code. If we ignore everything else in the head, most of the guts of a web page are in the body, and it turns out that Facebook's code is going to look more complex than most things we'll write initially just because it's been built up over the years, but there's a whole lot of script tags, JavaScript code, that makes the website very interactive: seeing status updates instantaneously using languages like JavaScript. There's something called a div, which is a division of a page. But before we get to that detail, let's try to zoom out and look at a simpler version of Facebook 1.0, so to speak. Here is the hello, world of web pages. It has that DOCTYPE declaration at the very top which is a little different from everything else. Nothing else we write in a web page is going to start with for bold. Again, the story is the same: h-e-l-l-o, comma, start making this bold, then world gets printed in bold, and this means stop printing this in bold. Let me go ahead and save my file, go back to Chrome, I'll zoom in just so we can see it better, and reload, and you'll see that world is now in bold. The Web is all about hyperlinks, so let's go ahead and do this: my favorite website is, let's say, youtube.com. Save, reload. Okay. There's a couple problems now besides the hideousness of the website. 1, I'm pretty sure I hit Enter here. And I did. I not only hit Enter, I also indented, practicing what we've been preaching about style, but my is right next to world. So why is this? Browsers only do what you tell them to do. I have not told the browser, "Break lines here. Insert paragraph break here." So the browser, it doesn't matter if I hit Return 30 times, it's still going to put my right next to world. What I really have to do here is say something like
, insert a line break. And actually, a line break is kind of a weird thing because you can't really start moving to another line, then do something, and then stop moving to a new line. It's kind of an atomic operation. You either do it or you don't. You hit Enter or you don't. So br is a little bit of a different tag, and so I need to sort of both open and close it all at once. The syntax for that is this. Technically, you could do something like this in some versions of HTML, but this is just stupid because there's no reason to start and stop something if you can instead do it all at once. Realize that HTML5 does not strictly require this slash, so you will see textbooks and online resources that don't have it, but for good measure let's practice the symmetry that we've seen thus far. This means that the tag is both opened and closed. So now let me save my file, go back here. Okay, so it's starting to look better, except the Web I know is kind of clickable, and yet youtube here doesn't seem to lead to anything. That's because even though it looks like a link, the browser doesn't know that per se, so I have to tell the browser that this is a link. The way to do this is to use an anchor tag: and let me move this to a new line just so it's a little more readable, and I'll shrink the font size. Am I done yet? No. There's going to be this dichotomy. This tag, the anchor tag, does indeed take an attribute, which modifies its behavior, and the value of that attribute is apparently YouTube's URL. But notice the dichotomy is that just because that's the URL you're going to, that doesn't mean that has to be the word that you're underlining and making a link. Rather, that can be something like this. So I have to say stop making this word a hyperlink by using the close anchor tag. Notice I'm not doing this. 1, this would just be a waste of everyone's time and it's not necessary. To close a tag, you only mention the name of the tag again. You don't mention any of the attributes. So let's save that, go back. Okay, voila, now it's blue and hyperlinked. If I click it, I actually do go to YouTube. So even though my web page is not on the Internet, it is at least HTML, and if we let the Internet catch up, we would actually end up here at youtube.com. And I can go back and here's my web page. But notice this. If you've ever gotten spam or a phishing attack, now you have the ability after just 5 minutes to do the same. We can go here and do something like www.badguy.com or whatever the sketchy website is, and then you can say verify your PayPal account. [laughter] And now this is going to go to badguy.com, which I'm not going to click on because I have no idea where that leads. [laughter] But we now have the ability to actually end up there. So we're really just starting to scratch the surface. We're not programming per se; we're writing markup language. But as soon as we round out our vocabulary in HTML, we'll introduce PHP, an actual programming language that will allow us to generate HTML automatically, generate CSS automatically, so that we can begin on Wednesday to implement, say, our own search engine and more. But more on that in a couple of days. We'll see you then. [CS50.TV]