1 00:00:00,000 --> 00:00:02,440 [Week 7] 2 00:00:02,440 --> 00:00:04,730 [David J. Malan - Harvard University] 3 00:00:04,730 --> 00:00:07,490 [This is CS50. - CS50.TV] 4 00:00:07,490 --> 00:00:12,280 All right. Welcome back. This is CS50, and this is the start of week 7. 5 00:00:12,280 --> 00:00:14,690 A couple of little announcements: 6 00:00:14,690 --> 00:00:18,150 Pset5 is now in progress, or soon shall be, 7 00:00:18,150 --> 00:00:21,590 and let me say, quite honestly, this does tend to be among the more challenging 8 00:00:21,590 --> 00:00:24,460 of the course's problem sets, so let me mention this now 9 00:00:24,460 --> 00:00:28,190 so that this week more than ever you don't wait until, say, Wednesday night 10 00:00:28,190 --> 00:00:29,920 or Thursday night to dive in. 11 00:00:29,920 --> 00:00:32,369 This is definitely an interesting pset. We think it's fun. 12 00:00:32,369 --> 00:00:36,110 If you actually get it fully correct and can then challenge the so-called Big Board, 13 00:00:36,110 --> 00:00:39,830 you'll have an opportunity to match wits with some of the course's staff 14 00:00:39,830 --> 00:00:41,620 and some of your classmates. 15 00:00:41,620 --> 00:00:44,670 What The Big Board is is once you have your spell-checker working, 16 00:00:44,670 --> 00:00:48,860 you'll be able to go to cs50.net after running a command, 17 00:00:48,860 --> 00:00:52,430 purely opt in, and then the amount of time and the amount of RAM and more 18 00:00:52,430 --> 00:00:56,130 that you have used in your implementation will be exhibited here on the course's home page. 19 00:00:56,130 --> 00:00:59,740 You'll notice that a whole bunch of these folks here are listed as staff 20 00:00:59,740 --> 00:01:04,220 since over the weekend, the staff thought it would be fun to try to outdo each other. 21 00:01:04,220 --> 00:01:07,390 So realize that the goal here is not to outdo the staff. 22 00:01:07,390 --> 00:01:09,790 Even I am only here at number 13. 23 00:01:09,790 --> 00:01:13,790 Purely opt in, but it's an opportunity to see just how little RAM 24 00:01:13,790 --> 00:01:16,790 and how few CPU seconds you can use vis-a-vis some of your classmates. 25 00:01:16,790 --> 00:01:20,540 >> And I'll admit that Kevin Michael Schmid, 26 00:01:20,540 --> 00:01:23,750 currently in number 1 position as one of the TFs, 27 00:01:23,750 --> 00:01:28,120 this is an implementation that we call not possible 28 00:01:28,120 --> 00:01:32,700 given that he's using almost 0 RAM and almost 0 seconds for loading. 29 00:01:32,700 --> 00:01:35,670 So we'll take care of Kevin offline. [laughter] 30 00:01:35,670 --> 00:01:40,950 There are certain skills that Kevin is putting to the test here. 31 00:01:40,950 --> 00:01:45,280 One of the things we thought we'd do too is now CS50x is a week in progress, 32 00:01:45,280 --> 00:01:49,520 and you guys are as much a part of this experiment as those students are. 33 00:01:49,520 --> 00:01:53,720 We've asked them as part of their pset0, which was similarly to submit a Scratch project 34 00:01:53,720 --> 00:01:58,280 of interest to them--a game, an interactive piece of art, an animation, or the like-- 35 00:01:58,280 --> 00:02:03,700 a 1- to 2-minute video, if they would like, saying hello to the world and who they actually are. 36 00:02:03,700 --> 00:02:06,780 I thought I'd share with you just a couple of the videos that have been submitted thus far 37 00:02:06,780 --> 00:02:10,759 because for us, on the staff at least, it really has been exciting 38 00:02:10,759 --> 00:02:14,220 and inspiring to see these folks from all over the world--countries all over the world-- 39 00:02:14,220 --> 00:02:18,160 tuning in, of all things, to a computer science course on the Internet, 40 00:02:18,160 --> 00:02:20,410 whether it's because they want to continue their own studies, 41 00:02:20,410 --> 00:02:22,300 they want to take their careers in a new direction, 42 00:02:22,300 --> 00:02:24,390 they want to fill in gaps in their own knowledge, 43 00:02:24,390 --> 00:02:27,190 so some of the same reasons that you guys perhaps have been here. 44 00:02:27,190 --> 00:02:31,090 >> So I give you one such student here. You could raise the volume just a little bit. 45 00:02:31,090 --> 00:02:35,520 Here is one of our student's 1-minute submissions. 46 00:02:35,520 --> 00:02:40,380 Hello, world. I am a student of industrial engineering here in Malaga, Spain. 47 00:02:40,380 --> 00:02:45,840 I am excited about this online course because I love computer science, I really do, 48 00:02:45,840 --> 00:02:48,880 and I truly appreciate that I get to explore it. 49 00:02:48,880 --> 00:02:51,940 And the fact that I can learn the same all of you guys do 50 00:02:51,940 --> 00:02:57,040 but instead of being in Harvard I am in Malaga, how awesome is that? 51 00:02:57,040 --> 00:03:02,040 Well, I am Fernando, and this is CS50. See you guys. 52 00:03:02,040 --> 00:03:07,100 [laughter] Another clip we particularly like, you'll find that this gentleman's English isn't so strong. 53 00:03:07,100 --> 00:03:11,520 It looks like he had it machine translated, so the translations themselves are a bit imperfect, 54 00:03:11,520 --> 00:03:15,790 but this was one of our favorites thus far as well. 55 00:03:25,080 --> 00:03:29,980 [♪♪] 56 00:03:29,980 --> 00:03:32,370 Hello, world. [speaking in Japanese] 57 00:03:32,370 --> 00:03:39,830 [I have to greet in Japanese because my English is very unreliable.] 58 00:03:39,830 --> 00:03:45,380 [I have delivered the message to you from the city of Gifu, Japan.] 59 00:03:45,380 --> 00:03:49,820 [I can be a student for the first time in 20 years, as can be seen.] 60 00:03:49,820 --> 00:03:54,640 [I am very grateful to Harvard University who gave me this opportunity and edX.] 61 00:03:54,640 --> 00:04:01,510 [Golf is a guitar and my favorite thing running.] [laughter] 62 00:04:01,510 --> 00:04:05,750 [♪♪] 63 00:04:05,750 --> 00:04:10,790 [Why do you think I was trying to attend a cs50x.] 64 00:04:10,790 --> 00:04:14,990 [Harvard University, it is my longing.] 65 00:04:14,990 --> 00:04:19,740 [Especially if I am distant presence lived in Japan.] 66 00:04:19,740 --> 00:04:26,680 [I wanted to try immediately aware of the existence of such edX when.] 67 00:04:26,680 --> 00:04:32,500 [Do not you think so you do not related to the age of learning I.] 68 00:04:32,500 --> 00:04:38,350 [cs50 is my longing. My name is Kazu, and this is cs50.] 69 00:04:38,350 --> 00:04:43,090 [♪♪] [applause and cheering] 70 00:04:43,090 --> 00:04:49,220 Another favorite of ours was this submission here from someone. 71 00:04:51,070 --> 00:04:55,380 [♪♪] [Malan] Google it if you're unfamiliar with this meme. 72 00:04:55,380 --> 00:05:01,480 >> And then lastly, a couple of others that got posted that perhaps win the adorable award. 73 00:05:01,480 --> 00:05:06,820 [students] Aww! >>[Malan] We'll have to listen. This is short, so listen closely. 74 00:05:08,580 --> 00:05:11,150 [female speaker] What's your name? >>Louie. 75 00:05:11,150 --> 00:05:16,120 [female speaker] What's this? >>[giggles] CS50. [laughter] 76 00:05:16,120 --> 00:05:19,510 [Malan] He did two takes, though. 77 00:05:19,510 --> 00:05:22,240 Here we go, the last. 78 00:05:23,030 --> 00:05:26,980 My name is Louie, and this is CS50. 79 00:05:26,980 --> 00:05:30,250 [laughter] This then is CS50x. 80 00:05:30,250 --> 00:05:33,230 Thank you to all of those of you while following along at home 81 00:05:33,230 --> 00:05:35,620 who have been partaking thus far. 82 00:05:35,620 --> 00:05:39,510 Today, we conclude our discussion of data structures, 83 00:05:39,510 --> 00:05:41,160 at least some of the most fundamental, 84 00:05:41,160 --> 00:05:44,760 and then we continue our conversation about HTML and web programming. 85 00:05:44,760 --> 00:05:48,520 Indeed, we've spent the past some seven weeks looking at the fundamentals of programming-- 86 00:05:48,520 --> 00:05:50,450 algorithms, data structures, and the like-- 87 00:05:50,450 --> 00:05:53,050 and C, as you may have experienced thus far, 88 00:05:53,050 --> 00:05:57,060 isn't necessarily the most accessible of languages 89 00:05:57,060 --> 00:05:59,090 with which to implement some of those ideas. 90 00:05:59,090 --> 00:06:01,880 And so starting this week and next week and then the following, 91 00:06:01,880 --> 00:06:07,110 we'll finally be able to transition from C, which is generally known as a fairly low-level language, 92 00:06:07,110 --> 00:06:11,190 to things higher level, among them PHP, JavaScript, and the like, 93 00:06:11,190 --> 00:06:14,850 which we'll see draw upon the same lessons that we've learned over the past few weeks, 94 00:06:14,850 --> 00:06:19,430 but you'll find that declaring things like arrays and hash tables and searching and sorting 95 00:06:19,430 --> 00:06:23,370 become so much easier because the languages themselves we'll start using 96 00:06:23,370 --> 00:06:25,290 will become more powerful. 97 00:06:25,290 --> 00:06:27,410 But first, an application of trees. 98 00:06:27,410 --> 00:06:30,240 It's very common these days to need to compress information. 99 00:06:30,240 --> 00:06:34,770 In what context would you want to compress some kind of digital information? 100 00:06:37,190 --> 00:06:39,670 >> Yeah. >>[student] When you need to send it over the Web. 101 00:06:39,670 --> 00:06:41,450 Yeah, when you want to send something over the Web. 102 00:06:41,450 --> 00:06:44,950 If you want to download a big file, it's ideal if someone on the other end 103 00:06:44,950 --> 00:06:48,760 has compressed that file using a zip format or something like that 104 00:06:48,760 --> 00:06:53,760 so that you're sending fewer bits than might otherwise be transmitted. 105 00:06:53,760 --> 00:06:55,500 So how do you compress information? 106 00:06:55,500 --> 00:07:00,540 It all boils down to using fewer bits than are required by default. 107 00:07:00,540 --> 00:07:03,220 But this is kind of a curious thing because think back to weeks 0 and 1 108 00:07:03,220 --> 00:07:07,370 when we talked about ASCII and binary and we talked about ASCII in particular 109 00:07:07,370 --> 00:07:10,690 as using 8 bits to represent letters of the alphabet 110 00:07:10,690 --> 00:07:16,120 so that the letter A is represented by 65, lowercase a is the number 97, 111 00:07:16,120 --> 00:07:21,210 and however you represent the 65 or 97, you're using 7 or 8 bits. 112 00:07:21,210 --> 00:07:24,120 But the catch is that there are some letters in the English alphabet 113 00:07:24,120 --> 00:07:26,230 that aren't as popular as others. 114 00:07:26,230 --> 00:07:31,600 Z isn't all that popular, Q isn't all that popular, but A and E are super popular. 115 00:07:31,600 --> 00:07:37,280 And yet for all of these letters, by default the world uses the same number of bits, just 8. 116 00:07:37,280 --> 00:07:42,690 So wouldn't it have been smarter if instead of using 8 bits for every letter, 117 00:07:42,690 --> 00:07:47,440 even the most infrequently used like Q and Z, 118 00:07:47,440 --> 00:07:51,910 what if we used fewer bits for A and E and S and the most popular letters 119 00:07:51,910 --> 00:07:55,000 and used more bits for the less popular letters, 120 00:07:55,000 --> 00:07:57,770 the idea being let's optimize for the common case, 121 00:07:57,770 --> 00:08:01,160 which is a theme in computer science of trying to optimize what's going to happen the most 122 00:08:01,160 --> 00:08:05,310 and spend a little more time, a little more space on the things that, yeah, might happen 123 00:08:05,310 --> 00:08:07,680 but not necessarily as frequently. 124 00:08:07,680 --> 00:08:09,330 So let's take an example. 125 00:08:09,330 --> 00:08:12,610 >> Suppose that we want to encode information fairly efficiently. 126 00:08:12,610 --> 00:08:15,090 You might have grown up knowing a little something about Morse code, 127 00:08:15,090 --> 00:08:17,450 and odds are you didn't know the actual code, 128 00:08:17,450 --> 00:08:21,750 but you might recall that it's at least this series of dots and dashes. 129 00:08:21,750 --> 00:08:26,640 This is a fairly efficient coding, and notice that the most popular letter--for instance, E-- 130 00:08:26,640 --> 00:08:28,980 uses the shortest of beeps. 131 00:08:28,980 --> 00:08:31,740 Morse code is all about beep-beep-beep-beep-beep-beep and holding tones 132 00:08:31,740 --> 00:08:34,799 either for short periods of time or long periods of time. 133 00:08:34,799 --> 00:08:40,330 E, as denoted by the dot, is a super short beep, just beep, and that would represent E. 134 00:08:40,330 --> 00:08:43,960 By contrast, T would be a longer beep, like beep [prolongs sound], 135 00:08:43,960 --> 00:08:45,710 and that would represent T. 136 00:08:45,710 --> 00:08:48,840 But that's still pretty short because, by contrast, if you look at Z, 137 00:08:48,840 --> 00:08:52,690 to express Z you would go beep, beep [longer sound], beep, beep [shorter sound]. 138 00:08:52,690 --> 00:08:55,360 So it's longer because it's less common. 139 00:08:55,360 --> 00:08:58,150 But the gotcha here is that Morse code is a bit flawed 140 00:08:58,150 --> 00:09:00,610 in that it's not immediately decodable. 141 00:09:00,610 --> 00:09:07,350 For instance, suppose that you hear on some end of the wire beep [short], beep [long]. 142 00:09:07,350 --> 00:09:12,480 What message did I just receive? A dot and a dash. What does that represent? 143 00:09:12,480 --> 00:09:15,330 [student] A. >>[Malan] Maybe. 144 00:09:15,330 --> 00:09:18,270 It could also be E followed by T. 145 00:09:18,270 --> 00:09:23,390 In other words, Morse code, though it leverages this principle of optimizing the corner case, 146 00:09:23,390 --> 00:09:26,250 it doesn't lend itself to immediate decodability. 147 00:09:26,250 --> 00:09:29,850 That is, the human who is hearing or receiving these dots and dashes 148 00:09:29,850 --> 00:09:34,540 has to somehow figure out where the breaks are between letters, 149 00:09:34,540 --> 00:09:39,660 because if you don't know where those breaks are, you might confuse A for ET or vice versa. 150 00:09:39,660 --> 00:09:43,880 >> So what might you do? In Morse code you could just pause between each of the letters. 151 00:09:43,880 --> 00:09:47,660 But pausing is kind of counter to the whole point of speeding things up. 152 00:09:47,660 --> 00:09:52,880 So what if instead we came up with a code where there was not this bad situation 153 00:09:52,880 --> 00:09:56,570 where E is a prefix, for instance, of A-- 154 00:09:56,570 --> 00:10:00,020 in other words, if we could make sure that the patterns are still short for the popular letters 155 00:10:00,020 --> 00:10:04,850 long for the less popular letters, but there's no possible confusion? 156 00:10:04,850 --> 00:10:08,930 A man by the name of Huffman years ago invented this scheme called Huffman coding 157 00:10:08,930 --> 00:10:12,390 that actually leverages one of the data structures we've spent a bit of time talking about 158 00:10:12,390 --> 00:10:16,560 this past week, that of trees, binary trees specifically-- 159 00:10:16,560 --> 00:10:19,710 a binary tree meaning that it has no more than 2 children. 160 00:10:19,710 --> 00:10:22,720 It has maybe a left child, maybe a right child, and that's it. 161 00:10:22,720 --> 00:10:26,510 So suppose just for the sake of discussion that someone wants to send a message 162 00:10:26,510 --> 00:10:31,270 that looks like this. It's complete nonsense but it's composed of As, Bs, Cs, Ds, and Es. 163 00:10:31,270 --> 00:10:34,890 And if you actually count up all of the As, Bs, Cs, Ds, and Es 164 00:10:34,890 --> 00:10:36,870 and then divide by the total number of letters, 165 00:10:36,870 --> 00:10:42,710 this little chart here says that 45% of the letters are Es, 20% are As, 166 00:10:42,710 --> 00:10:45,010 10% Bs, and so forth. 167 00:10:45,010 --> 00:10:47,330 So in other words, assume that the quoted string there 168 00:10:47,330 --> 00:10:49,080 is just some message that you want to send. 169 00:10:49,080 --> 00:10:52,180 It happens to be nonsense just so we can use as few letters as possible, 170 00:10:52,180 --> 00:10:55,220 but it's indeed the case that E remains the most popular, 171 00:10:55,220 --> 00:11:01,450 and B and C are the least popular, at least of these 5 letters of the alphabet. 172 00:11:01,450 --> 00:11:04,040 So how can we go about coming up with an encoding, 173 00:11:04,040 --> 00:11:08,430 a binary encoding, a pattern of 0s and 1s for each of these letters 174 00:11:08,430 --> 00:11:14,820 in such a way that E is a short pattern and maybe B and C are slightly longer patterns, 175 00:11:14,820 --> 00:11:19,270 again, the idea being that we want to use fewer bits most of the time 176 00:11:19,270 --> 00:11:21,790 and more bits only once in a while. 177 00:11:21,790 --> 00:11:26,070 According to Huffman coding, you can create a forest of trees. 178 00:11:26,070 --> 00:11:31,190 There's sort of a story line here that involves trees and also the process of building them up. 179 00:11:31,190 --> 00:11:32,420 Let's begin. 180 00:11:32,420 --> 00:11:36,140 >> I propose that you start with this forest, so to speak, of 5 trees, 181 00:11:36,140 --> 00:11:38,260 each of which is a pretty stupid tree. 182 00:11:38,260 --> 00:11:42,800 The tree is composed of just a single node, as represented here by a circle. 183 00:11:42,800 --> 00:11:45,310 So each of these things might be a C struct 184 00:11:45,310 --> 00:11:50,200 and inside of the C struct might be a float representing the frequency count 185 00:11:50,200 --> 00:11:52,510 and then maybe a char representing the letter. 186 00:11:52,510 --> 00:11:56,470 So think of these nodes as just any old C struct but, for now, higher level. 187 00:11:56,470 --> 00:12:01,230 This is a forest of 5 trees, each of who only have a single node. 188 00:12:01,230 --> 00:12:06,830 What Huffman proposed is that we start to combine those trees 189 00:12:06,830 --> 00:12:11,140 that have the smallest frequency counts into slightly bigger trees 190 00:12:11,140 --> 00:12:13,490 by connecting them with a new root node. 191 00:12:13,490 --> 00:12:17,560 So among the letters here, notice that for convenience I've sorted them from left to right, 192 00:12:17,560 --> 00:12:21,420 although that's not strictly necessary, and notice that the smallest nodes 193 00:12:21,420 --> 00:12:23,930 are currently 10% and 10%. 194 00:12:23,930 --> 00:12:28,940 So Huffman proposed that we merge those 2 smallest nodes into a new tree 195 00:12:28,940 --> 00:12:34,450 by introducing a new parent node and then give that parent a left child and a right child 196 00:12:34,450 --> 00:12:37,720 where B is arbitrarily the left and C is arbitrarily the right. 197 00:12:37,720 --> 00:12:41,590 And then Huffman further proposed that let's now just think of the left child 198 00:12:41,590 --> 00:12:44,790 in one of these trees always as being represented by 0 199 00:12:44,790 --> 00:12:47,890 and the right child always as being represented by the number 1. 200 00:12:47,890 --> 00:12:50,680 >> It doesn't matter if you flip them so long as you're consistent. 201 00:12:50,680 --> 00:12:54,650 So now we have four trees in this forest. 202 00:12:54,650 --> 00:12:58,050 And I say four because now the tree on the left-- 203 00:12:58,050 --> 00:13:00,570 and it's not so much a tree in the sense that it grows this way, 204 00:13:00,570 --> 00:13:05,170 it's more like a family tree where now the 0.2 is sort of the parent of the two children-- 205 00:13:05,170 --> 00:13:07,930 notice that in that parent we've drawn 0.2. 206 00:13:07,930 --> 00:13:13,370 We've added the frequency counts of the two children and given the new node the total sum. 207 00:13:13,370 --> 00:13:15,310 So now we just repeat this process. 208 00:13:15,310 --> 00:13:19,490 Find the two smallest nodes and then join them into a new tree 209 00:13:19,490 --> 00:13:21,380 and then repeat the process further. 210 00:13:21,380 --> 00:13:26,390 Right now we have a few candidates, 20%, 15%, and another 20%. 211 00:13:26,390 --> 00:13:29,780 In this case, we have to break the tie. We can do it arbitrarily. 212 00:13:29,780 --> 00:13:31,540 We should just do it consistently. 213 00:13:31,540 --> 00:13:33,760 In this case, I'll arbitrarily go with the one on the left, 214 00:13:33,760 --> 00:13:39,880 and I now merge the 20% and the 15% to give me a new parent called 35%, 215 00:13:39,880 --> 00:13:46,310 whose left child is 0, whose right child is 1, and now we have just three trees in the forest. 216 00:13:46,310 --> 00:13:47,960 You can perhaps see where this is going. 217 00:13:47,960 --> 00:13:51,150 If we repeat this a couple more times, we're going to have just one bigger tree, 218 00:13:51,150 --> 00:13:53,900 all of whose edges are labeled with 0s and 1s. 219 00:13:53,900 --> 00:13:55,710 Let's do it again. 220 00:13:55,710 --> 00:14:02,600 35% is that tree's root. 20% and 45%, so we're going to merge the 35% and 20%. 221 00:14:02,600 --> 00:14:05,610 Now we have this tree here. We add those together, we have 55%. 222 00:14:05,610 --> 00:14:07,910 Now there's only two trees in the forest. 223 00:14:07,910 --> 00:14:11,900 We do this one final time, and hopefully mathematically all the frequencies add up 224 00:14:11,900 --> 00:14:15,570 because they should since we computed them from the get-go to add up to 100%. 225 00:14:15,570 --> 00:14:17,960 And now we have one tree. 226 00:14:17,960 --> 00:14:20,580 So this is a Huffman coding tree. 227 00:14:20,580 --> 00:14:24,400 It kind of took a while to get there verbally, but the reality is with a for loop 228 00:14:24,400 --> 00:14:27,620 or with a recursive function, you could build this thing up pretty fast. 229 00:14:27,620 --> 00:14:32,440 So now we have one new node, and all of these inner nodes have been malloc'd, 230 00:14:32,440 --> 00:14:34,690 presumably, along the way. 231 00:14:34,690 --> 00:14:38,650 So now at the top of this tree we have 100%, but now notice we have a path 232 00:14:38,650 --> 00:14:43,780 from this new great-great-great-grandparent to all of the great-great-great-grandchildren 233 00:14:43,780 --> 00:14:45,930 all the way at the bottom, to all of the leaves. 234 00:14:45,930 --> 00:14:52,840 >> What we're going to do now is propose that in order to represent the letter E, 235 00:14:52,840 --> 00:14:55,670 we will simply use the number 1. Why? 236 00:14:55,670 --> 00:15:01,000 Because if we traverse this tree from the final root down to the leaf known as E, 237 00:15:01,000 --> 00:15:06,050 we follow just one edge, the right edge, and that's labeled of course at top right 1. 238 00:15:06,050 --> 00:15:11,550 So the implication here for Huffman was that E's encoding in binary shall just be 1. 239 00:15:11,550 --> 00:15:14,490 And that's pretty damn efficient. Can't really get any smaller than that. 240 00:15:14,490 --> 00:15:18,350 By contrast, A is going to be represented, if you follow the logic, 241 00:15:18,350 --> 00:15:21,610 by what pattern of bits instead? 01. 242 00:15:21,610 --> 00:15:25,500 So to get to A, we start at the root and we go left and then we go right, 243 00:15:25,500 --> 00:15:28,580 which means we followed a 0 and then a 1. 244 00:15:28,580 --> 00:15:32,810 So we shall represent the letter A with the pattern 0 and 1. 245 00:15:32,810 --> 00:15:36,010 And now notice we already have a property of immediate decodability 246 00:15:36,010 --> 00:15:38,090 that we didn't have in Morse code. 247 00:15:38,090 --> 00:15:42,840 Even though both of these patterns are pretty short--E is 1 bit, A is 2 bits-- 248 00:15:42,840 --> 00:15:45,080 notice that they can't be confused one or the other, 249 00:15:45,080 --> 00:15:54,870 because if you see a 1 it's got to be an E, if you see a 0 then a 1 it's obviously got to be an A. 250 00:15:54,870 --> 00:15:58,410 Similarly, what's D? 001. 251 00:15:58,410 --> 00:16:01,440 What is C? 0001. 252 00:16:01,440 --> 00:16:05,320 And what is B? 0000. 253 00:16:05,320 --> 00:16:09,550 And again, because all of the letters we care about are at the leaves 254 00:16:09,550 --> 00:16:13,890 and none of them are kind of middlemen in the path from root to leaf, 255 00:16:13,890 --> 00:16:18,760 there's no risk of conflating 2 letters' different encodings 256 00:16:18,760 --> 00:16:22,300 because all of these bit patterns are deterministic. 257 00:16:22,300 --> 00:16:25,280 0000 will always be B. 258 00:16:25,280 --> 00:16:29,480 There's no node somewhere in between that you might confuse one letter for the other. 259 00:16:29,480 --> 00:16:31,150 So what's the implication here? 260 00:16:31,150 --> 00:16:35,080 >> The most popular letter--in this case E--has gotten the shortest encoding, 261 00:16:35,080 --> 00:16:37,430 A has gotten the next shortest encoding, 262 00:16:37,430 --> 00:16:41,390 and B and C, which we already knew from the get-go were kind of the least popular 263 00:16:41,390 --> 00:16:45,390 at 10% frequency each, they have gotten the longest encoding. 264 00:16:45,390 --> 00:16:49,410 And so what this means now is that if you want to send a message that's compressed 265 00:16:49,410 --> 00:16:51,950 over the Internet or in an email or the like, 266 00:16:51,950 --> 00:16:56,730 rather than using standard ASCII, you can send a Huffman coded message 267 00:16:56,730 --> 00:17:01,720 whereby if you want to send the letter E, you send just a single bit. 268 00:17:01,720 --> 00:17:05,680 If you want to send an A, you send 2 bits, 01, instead of sending 8 bits 269 00:17:05,680 --> 00:17:10,190 followed by another 8 bits followed by another 8 bits and so forth. 270 00:17:10,190 --> 00:17:11,940 But there is a gotcha here. 271 00:17:11,940 --> 00:17:17,079 It's not sufficient to just construct this tree and then start sending from Alice to Bob 272 00:17:17,079 --> 00:17:20,010 the shorter bit pattern, string from ASCII, 273 00:17:20,010 --> 00:17:23,140 because Alice also has to inform Bob of what 274 00:17:23,140 --> 00:17:26,880 if Bob is going to be able to read her compressed message? 275 00:17:26,880 --> 00:17:30,770 [inaudible student response] >>What's that? 276 00:17:30,770 --> 00:17:32,310 [inaudible student response] >>Of what the tree is. 277 00:17:32,310 --> 00:17:35,160 Or even more specifically, what those encodings are, 278 00:17:35,160 --> 00:17:39,010 especially since during this story we made a judgment call at one point. 279 00:17:39,010 --> 00:17:43,640 Remember that we had to pick arbitrarily between the 2 different 20% nodes? 280 00:17:43,640 --> 00:17:49,800 So it's not the case that Bob, the recipient, can just reconstruct the tree on his own 281 00:17:49,800 --> 00:17:53,390 because maybe he will create the tree ever so slightly differently from Alice. 282 00:17:53,390 --> 00:17:56,670 Moreover, Bob doesn't even know what the original message is 283 00:17:56,670 --> 00:18:00,770 because the only thing Alice is sending him, of course, is the compressed message. 284 00:18:00,770 --> 00:18:05,900 >> So the catch with compression like this is that, yes, Alice can save a whole lot of bits 285 00:18:05,900 --> 00:18:09,900 by sending 1 for E and 01 for A and so forth, 286 00:18:09,900 --> 00:18:15,180 but she also has to inform Bob what the mapping is between letters and bits 287 00:18:15,180 --> 00:18:19,620 because they can't clearly rely on just ASCII anymore if we're not using ASCII. 288 00:18:19,620 --> 00:18:22,200 So she can either send him the tree somehow-- 289 00:18:22,200 --> 00:18:26,600 write it down, store it as binary data or something like that-- 290 00:18:26,600 --> 00:18:30,280 or just send him a little cheat sheet, an Excel file, that shows the mappings. 291 00:18:30,280 --> 00:18:36,480 So the effectiveness of compression really assumes that the messages that you're sending 292 00:18:36,480 --> 00:18:40,230 are pretty big, at least medium-sized, 293 00:18:40,230 --> 00:18:42,180 because if you're sending a super short message, 294 00:18:42,180 --> 00:18:45,390 if you just want to send the message BAD, which happens to be a word we can spell here, 295 00:18:45,390 --> 00:18:49,550 B-A-D, you're probably going to use fewer bits, 296 00:18:49,550 --> 00:18:53,130 but the catch is if you also have to inform Bob what the tree is 297 00:18:53,130 --> 00:18:57,530 or what those encodings are, you're going to probably outweigh all of the savings 298 00:18:57,530 --> 00:19:00,110 of having compressed things to begin with. 299 00:19:00,110 --> 00:19:02,210 So it can actually be the case that if you try compressing 300 00:19:02,210 --> 00:19:05,330 even with something like zip or file formats you might be familiar with-- 301 00:19:05,330 --> 00:19:07,780 pretty small files, even empty files-- 302 00:19:07,780 --> 00:19:10,930 sometimes those files might get bigger and not smaller. 303 00:19:10,930 --> 00:19:14,320 But realistically, that happens only for small file sizes, 304 00:19:14,320 --> 00:19:16,920 so it's not going to make a gigabyte file be 2 gigabytes; 305 00:19:16,920 --> 00:19:19,480 we're really talking bytes or just a couple kilobytes. 306 00:19:19,480 --> 00:19:22,330 >> Some programs like zip are smart enough to realize that, 307 00:19:22,330 --> 00:19:24,590 "You're going to spend more bits compressing this." 308 00:19:24,590 --> 00:19:27,460 "Let me not bother compressing it for you at all." 309 00:19:27,460 --> 00:19:30,160 So this is just one way then of compressing text format. 310 00:19:30,160 --> 00:19:32,300 We could implement something like this in C. 311 00:19:32,300 --> 00:19:35,370 For instance, here is how we might represent a node in this tree 312 00:19:35,370 --> 00:19:39,320 where we have a char for the symbol, a floating value for the frequency, 313 00:19:39,320 --> 00:19:42,250 and as we've seen with our other data structures, 2 pointers, 314 00:19:42,250 --> 00:19:47,080 1 to the left child, 1 to the right, either of which can be NULL, 315 00:19:47,080 --> 00:19:50,850 but if not, it refers to a left child and a right child. 316 00:19:50,850 --> 00:19:55,130 So this then is Huffman coding, and it's one way that you can go about compressing information, 317 00:19:55,130 --> 00:19:57,880 and it's certainly one of the most easy to implement 318 00:19:57,880 --> 00:20:00,830 in the context of, say, last week's data structures, 319 00:20:00,830 --> 00:20:03,250 though even more sophisticated algorithms exist 320 00:20:03,250 --> 00:20:08,220 that can do even more sophisticated mutations of your data. 321 00:20:08,220 --> 00:20:11,640 Any questions then on trees, binary trees, or compression of text? 322 00:20:11,640 --> 00:20:15,590 [student] Is there some ambiguity, like if [inaudible] split into 01, 323 00:20:15,590 --> 00:20:19,160 then 011 would be ambiguous, right? 324 00:20:19,160 --> 00:20:22,730 [inaudible] >>Good question. Ambiguity. 325 00:20:22,730 --> 00:20:25,940 Let me summarize by referring to this picture here. 326 00:20:25,940 --> 00:20:29,650 Because the characters you are compressing, the representations of, 327 00:20:29,650 --> 00:20:32,850 by definition of this algorithm always remain the leaves, 328 00:20:32,850 --> 00:20:41,870 you'll never accidentally use the same pattern of bits for the prefix of multiple letters. 329 00:20:41,870 --> 00:20:46,740 So in other words, you're concerned about, it sounds like, an ambiguity arising 330 00:20:46,740 --> 00:20:51,580 whereby 001 might be the start of B or the start of C or something like that. 331 00:20:51,580 --> 00:20:56,780 But that can't be the case because notice that all of the letters of the alphabet we're encoding 332 00:20:56,780 --> 00:20:58,290 are at the leaves. 333 00:20:58,290 --> 00:21:01,910 >> The ambiguity can only arise, as in the case of Morse code, 334 00:21:01,910 --> 00:21:06,770 if, for instance, C was somewhere along the path from the root to B. 335 00:21:06,770 --> 00:21:12,290 [student] Right. So in that case, say A has 2 leaves. >>Say A has-- Say that again. 336 00:21:12,290 --> 00:21:18,760 [student] Say A has 2 leaves, F and G, and then G-- >>Okay. But it can't. 337 00:21:18,760 --> 00:21:23,230 A itself could not have leaves F and G because those letters F and G 338 00:21:23,230 --> 00:21:27,560 would themselves be leaves somewhere to the left of B or the right of E. 339 00:21:27,560 --> 00:21:28,900 So by definition, they must be leaves. 340 00:21:28,900 --> 00:21:32,940 Otherwise, you're exactly right, we've not solved the problem that Morse code faces. 341 00:21:32,940 --> 00:21:38,150 Good question. Other questions? All right. 342 00:21:38,150 --> 00:21:42,050 This notion of bits, it turns out we've had power all along that we've not actually used 343 00:21:42,050 --> 00:21:44,200 when it came to manipulating these 0s and 1s. 344 00:21:44,200 --> 00:21:46,600 We asked about this on one of the earliest problem sets: 345 00:21:46,600 --> 00:21:52,340 namely, how do you go about converting uppercase to lowercase or vice versa? 346 00:21:52,340 --> 00:21:55,460 Or, more concretely, one of those first psets asked 347 00:21:55,460 --> 00:22:01,090 how many bits do you actually have to flip in order to change A to lowercase a or vice versa? 348 00:22:01,090 --> 00:22:05,580 Here's a quick reminder of what 65 and 97 look like in binary. 349 00:22:05,580 --> 00:22:08,060 And even if that question has sort of faded in your memory, 350 00:22:08,060 --> 00:22:11,290 you can see again here that how many bits need to be flipped 351 00:22:11,290 --> 00:22:15,810 to change capital A to lowercase a? Just one. 352 00:22:15,810 --> 00:22:19,650 >> They only differ in one location, the third bit from the left. 353 00:22:19,650 --> 00:22:24,240 Whereas A has a 010, little a has a 011. 354 00:22:24,240 --> 00:22:26,250 So somehow, we need to just be able to flip that bit, 355 00:22:26,250 --> 00:22:29,410 and we can then capitalize or lowercase letters. 356 00:22:29,410 --> 00:22:32,720 We've done this in the past by actually using if conditions 357 00:22:32,720 --> 00:22:35,930 and checking if the letter is between capital A and capital Z, 358 00:22:35,930 --> 00:22:41,480 then outputs like A - a + 26 or something like that. 359 00:22:41,480 --> 00:22:46,130 You probably did an arithmetic change to the letters of the alphabet. 360 00:22:46,130 --> 00:22:49,270 But what if we could just flip that single bit? 361 00:22:49,270 --> 00:22:59,080 How could you go about taking one byte's worth of bits, so 8 bits like 01000001 and 01100001? 362 00:22:59,080 --> 00:23:03,170 If you had those patterns of bits, how can we go about changing just one of them? 363 00:23:03,170 --> 00:23:07,610 What if we introduce in yellow here this other pattern of bits? 364 00:23:07,610 --> 00:23:13,420 If I make the whole yellow string 0s except for the one bit that I want to change 365 00:23:13,420 --> 00:23:17,900 and then I introduce a new operator known as a bitwise operator-- 366 00:23:17,900 --> 00:23:21,210 bitwise in the sense that it operates on individual bits, 367 00:23:21,210 --> 00:23:25,360 not on an entire byte or four bytes all at once. 368 00:23:25,360 --> 00:23:31,170 This vertical bar there in yellow suggests that what if we take the representation of capital A 369 00:23:31,170 --> 00:23:37,060 and bitwise OR it with the yellow sequence of bits? 370 00:23:37,060 --> 00:23:41,300 In other words, think back to our discussion of Boolean expressions in Scratch and then in C. 371 00:23:41,300 --> 00:23:47,520 >> Doing a Boolean or means that to be true, either the first thing has to be true 372 00:23:47,520 --> 00:23:50,700 or the second thing has to be true or they both have to be true, 373 00:23:50,700 --> 00:23:53,270 and then the resulting output is itself true. 374 00:23:53,270 --> 00:24:00,230 In this case here, what do we get if we take 0 "or"ed with 0? False or false? 375 00:24:00,230 --> 00:24:04,280 It's still false, so the lowercase a remains as expected. 376 00:24:04,280 --> 00:24:07,540 What if instead we do 1 or 0? 377 00:24:07,540 --> 00:24:12,640 This now remains 1, but notice what's about to happen here. 378 00:24:12,640 --> 00:24:18,630 If we start with capital A and we continue to "or" its individual bits as we're doing here, 379 00:24:18,630 --> 00:24:25,180 0 or the yellow one gives us what down here? This gives us 1. 380 00:24:25,180 --> 00:24:35,120 In fact, suppose we didn't know what the uppercase version of little a actually was. 381 00:24:35,120 --> 00:24:38,270 Let's go do this. Let me move this back over here. 382 00:24:38,270 --> 00:24:42,340 Let's do this again. 0 or 0 gives me 0. 383 00:24:42,340 --> 00:24:45,020 1 or 0 gives me 1. 384 00:24:45,020 --> 00:24:48,020 0 or 1 gives me 1. 385 00:24:48,020 --> 00:24:52,880 0 or 0 gives me 0. The next one is 0, the next one is 0, the next one is 0. 386 00:24:52,880 --> 00:24:55,660 1 or 0 gives me 1. 387 00:24:55,660 --> 00:24:59,140 And so even if we didn't know in advance what lowercase a was, 388 00:24:59,140 --> 00:25:04,770 simply by "or"ing A with this pattern of bits that we've presented here in yellow, 389 00:25:04,770 --> 00:25:09,400 you can lowercase a capital A by flipping that bit. 390 00:25:09,400 --> 00:25:11,580 We used this expression weeks ago: flipping a bit. 391 00:25:11,580 --> 00:25:13,710 How do you actually do that programmatically? 392 00:25:13,710 --> 00:25:16,390 You use what's generally called a mask, a sequence of bits, 393 00:25:16,390 --> 00:25:19,980 that in this case just so happens to look like this number here, 394 00:25:19,980 --> 00:25:22,980 and then you "or" it together using this new C operator, 395 00:25:22,980 --> 00:25:29,940 not ||, you use a single | and you would actually get this answer here because why? 396 00:25:29,940 --> 00:25:35,120 This is the 1s place, 2s place, 4s, 8s, 16s, 32s. 397 00:25:35,120 --> 00:25:42,280 So it turns out that if you take a capital letter A and bitwise OR it with the integer 32, 398 00:25:42,280 --> 00:25:47,520 because the integer 32, when you look at it as bits, looks like this, 399 00:25:47,520 --> 00:25:50,860 that means you can flip the bit that you actually want. 400 00:25:50,860 --> 00:25:52,630 And similarly--and we'll look at code in just a moment-- 401 00:25:52,630 --> 00:25:54,210 suppose we want to go the other direction. 402 00:25:54,210 --> 00:25:58,210 >> How do you go from lowercase a to capital A? Which bit needs to change? 403 00:25:58,210 --> 00:25:59,820 It's the same one. 404 00:25:59,820 --> 00:26:03,970 We want to change that third bit from a 1 to a 0. 405 00:26:03,970 --> 00:26:06,310 And how might we go about doing this? 406 00:26:06,310 --> 00:26:10,130 How do we turn off a bit? With what pattern of bits could we turn off a bit? 407 00:26:11,580 --> 00:26:14,070 What if we sort of invert the mask? 408 00:26:14,070 --> 00:26:17,350 Whereas before, we made the whole yellow mask 0s 409 00:26:17,350 --> 00:26:19,930 except for the one bit we wanted to turn on, 410 00:26:19,930 --> 00:26:25,580 what if this time, we make the whole mask 1s except for the bit that we want to turn off 411 00:26:25,580 --> 00:26:28,330 and then use what operator? 412 00:26:28,330 --> 00:26:30,560 What if we "and" things? Let's take a look. 413 00:26:30,560 --> 00:26:34,880 If we now flip to this, suppose that again I create a mask that's all 1s 414 00:26:34,880 --> 00:26:37,650 except for the one bit that I want to turn off 415 00:26:37,650 --> 00:26:43,860 and then rather than "or" the white numbers up top with the yellow numbers down here, 416 00:26:43,860 --> 00:26:46,940 what if I instead "and" them together? It's called a bitwise and. 417 00:26:46,940 --> 00:26:49,450 Logically, it's the same thing as a Boolean and. 418 00:26:49,450 --> 00:26:55,160 This gives me 0 & 1 is 0. So false and true is false. 419 00:26:55,160 --> 00:26:58,160 True and true is true. 420 00:26:58,160 --> 00:27:04,020 And here is the magic: True and false is now false, so we've turned off that bit. 421 00:27:04,020 --> 00:27:06,560 And now the rest of the story is somewhat straightforward. 422 00:27:06,560 --> 00:27:11,970 Because the rest of the mask is 1s, it doesn't matter what the numbers are in white. 423 00:27:11,970 --> 00:27:15,580 When you "and" something with true, you're not going to change its value. 424 00:27:15,580 --> 00:27:20,200 If it is true, it will remain true. If it was false, it will remain false. 425 00:27:20,200 --> 00:27:23,190 >> But the magic happens when you take something that was true 426 00:27:23,190 --> 00:27:25,430 and you then "and" it with false. 427 00:27:25,430 --> 00:27:30,030 This has the effect of turning off that bit. 428 00:27:30,030 --> 00:27:31,980 So a little cryptic there. 429 00:27:31,980 --> 00:27:35,390 Let's actually look at some code, which might actually look even more cryptic, 430 00:27:35,390 --> 00:27:38,220 but let's take a look here at tolower. 431 00:27:38,220 --> 00:27:45,880 If I look at tolower, going from capital A to lowercase a, 432 00:27:45,880 --> 00:27:47,730 let's see how we might implement this program. 433 00:27:47,730 --> 00:27:51,280 Here's main, and it's not taking any command-line arguments. 434 00:27:51,280 --> 00:27:55,980 I'm declaring a character c for the letter that the user is going to type in. 435 00:27:55,980 --> 00:28:00,690 I then use a familiar do while loop to just make sure that the user definitely gives me a capital A 436 00:28:00,690 --> 00:28:05,010 or B or C...Z, so they give me something between A and Z. 437 00:28:05,010 --> 00:28:08,580 And now what am I doing here? 438 00:28:08,580 --> 00:28:14,870 I'm "or"ing this with 0x20, but that's actually the same as-- 439 00:28:14,870 --> 00:28:19,500 and we'll come back to this in a moment--32. 440 00:28:19,500 --> 00:28:24,830 So again, 32 is this pattern of bits here. Why do we know this? 441 00:28:24,830 --> 00:28:26,320 Just think back to week 0. 442 00:28:26,320 --> 00:28:31,010 This is the 1s place, 2s place, 4s, 8s, 16s, 32s place. 443 00:28:31,010 --> 00:28:33,470 So this yellow number happens to be 32. 444 00:28:33,470 --> 00:28:40,570 I can then take a letter like the char here, bitwise "or" it with literally the number 32, 445 00:28:40,570 --> 00:28:45,250 and what do I get back? The lowercase version of that char. 446 00:28:45,250 --> 00:28:48,830 A moment ago, though, I expressed this in a different base notation. 447 00:28:48,830 --> 00:28:51,370 What did this represent? >>[student] Hexadecimal. 448 00:28:51,370 --> 00:28:53,050 [Malan] This happens to represent hexadecimal. 449 00:28:53,050 --> 00:28:55,170 We haven't talked about hexadecimal all that much, 450 00:28:55,170 --> 00:28:57,330 but it's actually convenient in cases like this. 451 00:28:57,330 --> 00:29:01,730 >> Even though it looks more complex and even though it looks like 20 and not 32, 452 00:29:01,730 --> 00:29:06,240 it turns out that hexadecimal is actually super convenient notation 453 00:29:06,240 --> 00:29:10,810 because in hexadecimal every digit after the 0x--and this means nothing; 454 00:29:10,810 --> 00:29:13,960 this is just human convention that says here comes a hexadecimal number-- 455 00:29:13,960 --> 00:29:18,590 each of these digits, the 2 and then the 0, themselves can be represented 456 00:29:18,590 --> 00:29:20,800 with exactly 4 bits. 457 00:29:20,800 --> 00:29:27,840 So if we do this, let me open up a text editor here--weird autocomplete-- 458 00:29:27,840 --> 00:29:35,940 if we do a little text editor here, the number 0x20 means here is 4 bits, here's another 4 bits. 459 00:29:35,940 --> 00:29:38,050 Let's do the rightmost 4 bits first. 460 00:29:38,050 --> 00:29:44,690 0 when represented with 4 bits is what? Super easy. Just all 0s. 461 00:29:44,690 --> 00:29:46,780 So 4 bits as 0s. 462 00:29:46,780 --> 00:29:53,510 How do you represent 2? It's been a while since we did this, but it's 0100. 463 00:29:53,510 --> 00:29:57,310 So this is the 1s place, this is the 2s place, and then it doesn't matter what the other places are. 464 00:29:57,310 --> 00:30:00,610 In other words, in hexadecimal you might say 0x20, 465 00:30:00,610 --> 00:30:04,340 but if you then think about what is the 2 and how is it represented in binary, 466 00:30:04,340 --> 00:30:07,130 what is the 0 and how is it represented in binary, 467 00:30:07,130 --> 00:30:10,440 the answers to those questions are this and this, respectively. 468 00:30:10,440 --> 00:30:14,380 So 0x20 happens to represent this pattern of 8 bits, 469 00:30:14,380 --> 00:30:16,880 which is precisely the mask that we wanted. 470 00:30:16,880 --> 00:30:20,140 So this is for the moment just an intellectual exercise, 471 00:30:20,140 --> 00:30:24,520 but the reality is in code it's typically more common to write constants like this 472 00:30:24,520 --> 00:30:28,360 in hexadecimal because then the programmer can relatively easily, 473 00:30:28,360 --> 00:30:32,560 even if it requires some paper and pencil, figure out what that pattern of bits is 474 00:30:32,560 --> 00:30:35,960 because you can't just express 0s and 1s typically in code. 475 00:30:35,960 --> 00:30:38,540 You can't go 00010 and so forth. 476 00:30:38,540 --> 00:30:42,380 >> You have to pick decimal or hexadecimal or octal or other notations. 477 00:30:42,380 --> 00:30:47,540 Most people tend to pick hexadecimal simply so that each digit represents 4 bits 478 00:30:47,540 --> 00:30:49,320 and you can do this quick math. 479 00:30:49,320 --> 00:30:54,990 And I'll wave my hand at toupper, which is almost the same; it looks almost identical. 480 00:30:54,990 --> 00:31:01,900 Toupper happens to use not the or operator but rather this guy and df. 481 00:31:01,900 --> 00:31:09,300 What does df represent? df? Anyone? >>[student] 255. 482 00:31:09,300 --> 00:31:12,780 255? Not 255. That would be ff. 483 00:31:12,780 --> 00:31:15,210 We'll leave this one as a little exercise. 484 00:31:15,210 --> 00:31:23,460 But if you go from 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 and then what comes after 9? 485 00:31:23,460 --> 00:31:26,510 We're kind of out of decimal digits, but in hexadecimal what comes after 9? 486 00:31:26,510 --> 00:31:29,510 [student] a. >>So a, b, c, d. 487 00:31:29,510 --> 00:31:33,470 You can figure out from there what pattern of bits d actually represents. 488 00:31:33,470 --> 00:31:38,850 And if we do the math, we'll see that the mask you end up getting back is identical to this. 489 00:31:38,850 --> 00:31:45,580 This is f, all 1s, and this is d. So df represents that mask. All right. 490 00:31:45,580 --> 00:31:50,980 And lastly, not to make things sound super, super technical, 491 00:31:50,980 --> 00:31:53,840 but suppose we wanted to write a program that does this. 492 00:31:53,840 --> 00:31:58,960 Let me go ahead and make binary, which is a program in a file called binary.c. 493 00:31:58,960 --> 00:32:02,050 And now let me run binary and give me a non-negative integer. 494 00:32:02,050 --> 00:32:03,960 Let's start easy and type in 0. 495 00:32:03,960 --> 00:32:09,010 This now is a program that prints out an integer in its binary representation. 496 00:32:09,010 --> 00:32:13,470 So if I play this game again and type in just 1, I should get a 32-bit representation of 1. 497 00:32:13,470 --> 00:32:15,490 If I do this again with 2, I should get that. 498 00:32:15,490 --> 00:32:19,310 If I do 7, I should get a few 1s at the end and so forth. 499 00:32:19,310 --> 00:32:22,740 It turns out I mention this because with bitwise operations 500 00:32:22,740 --> 00:32:25,490 you can actually do one other thing as well. 501 00:32:25,490 --> 00:32:29,130 You can create these masks dynamically. 502 00:32:29,130 --> 00:32:32,800 Take a look at this one final example involving bitwise operations. 503 00:32:32,800 --> 00:32:35,490 Here is the first part of the code, prompt the user for a number, 504 00:32:35,490 --> 00:32:38,130 and it insists that you give me a non-negative integer. 505 00:32:38,130 --> 00:32:39,780 So that's sort of old school stuff. 506 00:32:39,780 --> 00:32:41,980 But here is something that's kind of interesting. 507 00:32:41,980 --> 00:32:44,910 >> How do I go about printing a number in binary? 508 00:32:44,910 --> 00:32:48,970 I first iterate from what to what? 509 00:32:48,970 --> 00:32:52,270 What's the size of an int typically, at least in the appliance? >>[student] 4. 510 00:32:52,270 --> 00:32:57,130 It's 4. So 4 * 8 is 32 - 1 is 31. 511 00:32:57,130 --> 00:33:02,590 So if I'm starting to count from 31, that represents, it turns out, 512 00:33:02,590 --> 00:33:07,630 just conceptually, the 31st bit or the highest order bit, which is this guy over here, 513 00:33:07,630 --> 00:33:09,650 whereas this is going to be bit 0. 514 00:33:09,650 --> 00:33:12,850 So this is bit 01...bit 31. 515 00:33:12,850 --> 00:33:14,950 So what is this code doing? 516 00:33:14,950 --> 00:33:20,140 Notice this for loop, even though it looks cryptic, is just iterating from 31 down to 0. That's it. 517 00:33:20,140 --> 00:33:24,530 So the interesting part now must be in these 5 lines here. 518 00:33:24,530 --> 00:33:28,110 Notice that in this line I'm declaring a variable called mask 519 00:33:28,110 --> 00:33:30,790 to be consistent with our story of these yellow numbers. 520 00:33:30,790 --> 00:33:32,200 And then what is this doing? 521 00:33:32,200 --> 00:33:35,720 This is another bitwise operator we've not seen before, most likely. 522 00:33:35,720 --> 00:33:38,300 It's the left shift operator. 523 00:33:38,300 --> 00:33:40,060 This operator does this. 524 00:33:40,060 --> 00:33:44,920 Here is the number 1, and if you do i left shift, left shift, 525 00:33:44,920 --> 00:33:49,260 what do you think that has the effect of doing to that individual 1? 526 00:33:49,260 --> 00:33:51,290 Literally shifting it over. 527 00:33:51,290 --> 00:33:57,540 So if the number 1 is what you have on the left and you start by initializing i to 31, 528 00:33:57,540 --> 00:34:03,490 what is that going to do? It's going to take this number 1 and shift it 31 places over here. 529 00:34:03,490 --> 00:34:06,210 And because there's obviously no other digits behind it, 530 00:34:06,210 --> 00:34:10,350 those will by default be replaced with 0s. 531 00:34:10,350 --> 00:34:15,120 So you'll start out with the number 1, which of course looks like this-- 532 00:34:15,120 --> 00:34:18,659 and let me draw it over here in the center. 533 00:34:18,659 --> 00:34:22,139 And then as you shift things to the left, this guy essentially goes this way. 534 00:34:22,139 --> 00:34:24,659 But as soon as you do that, a 0 gets filled in. 535 00:34:24,659 --> 00:34:28,360 If you shift it a second time, it goes this way and another 0 gets filled in. 536 00:34:28,360 --> 00:34:31,000 >> You shift it again and then another 0 gets filled in. 537 00:34:31,000 --> 00:34:37,900 So if you do this thing of 1 << i 31 places, you end up getting a mask 538 00:34:37,900 --> 00:34:42,550 that is 32 characters long, the leftmost one of which is a 1, 539 00:34:42,550 --> 00:34:45,199 all of the rest of which are a 0. 540 00:34:45,199 --> 00:34:50,880 And it turns out, as an aside, shifting a number to the left like this 541 00:34:50,880 --> 00:34:53,530 also coincidentally, and sometimes conveniently, 542 00:34:53,530 --> 00:34:57,520 has the effect of doing what to that number? >>[student] Doubling it. 543 00:34:57,520 --> 00:35:00,980 Doubling it because each of the columns--the 1s place, 2s place, 4s place, 544 00:35:00,980 --> 00:35:05,030 8s place, 16s place--they're all doubling as you go to the left. 545 00:35:05,030 --> 00:35:09,500 Or rather, when you shift the 1s you're going to end up doubling the value of the number. 546 00:35:09,500 --> 00:35:12,070 You can end up doing interesting transformations of digits 547 00:35:12,070 --> 00:35:15,640 by shifting everything over in this way by powers of 2. 548 00:35:15,640 --> 00:35:17,150 So how does this work? 549 00:35:17,150 --> 00:35:22,580 This then gives me a mask that's all 0s except for a 1 in precisely the place I want it, 550 00:35:22,580 --> 00:35:27,920 and then this expression, which is stolen from toupper.c, 551 00:35:27,920 --> 00:35:31,770 is simply saying take the number n that the user typed in, 552 00:35:31,770 --> 00:35:34,730 "and" it with that mask, and what are you going to get? 553 00:35:34,730 --> 00:35:39,200 You're going to get a 1 if there's a 1 in that masked location, 554 00:35:39,200 --> 00:35:41,570 or you're going to get a 0 if there's not. 555 00:35:41,570 --> 00:35:44,370 And so all this program does effectively is it has a loop, 556 00:35:44,370 --> 00:35:48,340 and it creates a mask with a 1 over here, then a 1 over here, then a 1 over here, 557 00:35:48,340 --> 00:35:52,950 and it uses this bitwise AND trick to say is there a 1 bit in the user's input here? 558 00:35:52,950 --> 00:35:59,220 >> Is there a 1 bit in the user's input here? And if so, literally print 1, else print 0. 559 00:35:59,220 --> 00:36:03,780 We're doing this with ints just because that's why we're doing 32 bits instead of 8, 560 00:36:03,780 --> 00:36:06,900 but what we've introduced then is this bitwise AND, this bitwise OR, 561 00:36:06,900 --> 00:36:10,450 and this left shift operator, which aren't often terribly helpful, 562 00:36:10,450 --> 00:36:12,230 but it turns out they can be. 563 00:36:12,230 --> 00:36:16,560 In fact, if you were to represent something like an array of Booleans 564 00:36:16,560 --> 00:36:21,260 just to represent true or false, suppose you wanted to keep track of whether or not 565 00:36:21,260 --> 00:36:24,630 a room full of 300 students is present, 566 00:36:24,630 --> 00:36:29,420 you could declare an array of size 300 of type bool so that you get 300 bools, 567 00:36:29,420 --> 00:36:33,090 and you can set each to true if someone is here and false otherwise. 568 00:36:33,090 --> 00:36:37,550 Why is that representation in that data structure inefficient? 569 00:36:39,370 --> 00:36:44,800 What's bad about the design of that data structure, an array of 300 bools? 570 00:36:46,190 --> 00:36:49,600 What is a bool, in fact, underneath the hood? 571 00:36:49,600 --> 00:36:52,310 This, too, is something that might not be familiar. 572 00:36:52,310 --> 00:36:53,720 It turns out there is no bool. 573 00:36:53,720 --> 00:36:56,620 Remember we sort of created that with the cs50.h file, 574 00:36:56,620 --> 00:36:58,630 which itself includes standard bool. 575 00:36:58,630 --> 00:37:00,930 C is kind of dumb, though, when it comes to bool. 576 00:37:00,930 --> 00:37:04,880 It uses 8 bits to represent every bool, which is completely wasteful 577 00:37:04,880 --> 00:37:09,040 because obviously, how many bits do you need to represent a bool? Just 1. 578 00:37:09,040 --> 00:37:13,190 So it turns out that if you now have the ability with bitwise operators 579 00:37:13,190 --> 00:37:17,760 to manipulate individual bits even in a char, even in a single byte, 580 00:37:17,760 --> 00:37:21,380 it turns out you could decrease the memory required to represent something stupid 581 00:37:21,380 --> 00:37:25,490 like that attendance styled data structure by a factor of 8. 582 00:37:25,490 --> 00:37:29,820 Instead of using eight bits to represent true or false, you could literally use one 583 00:37:29,820 --> 00:37:34,500 by using a single byte for every eight students in the class 584 00:37:34,500 --> 00:37:41,990 and toggling from 0 to 1 individual bits by using these kinds of low-level tricks. 585 00:37:43,850 --> 00:37:49,460 That really put an end to the energy. Are there any questions about bitwise operations? 586 00:37:49,460 --> 00:37:52,710 >> Yeah. >>[student] Is there an exclusive or operator? 587 00:37:52,710 --> 00:37:56,440 Yes. There is an exclusive or operator that looks like this, ^, the carrot symbol, 588 00:37:56,440 --> 00:38:02,070 which means only the first thing or the second thing can be a 1 for the output to be a 1. 589 00:38:02,070 --> 00:38:07,750 There is also a not, ~, which will allow you to invert a 0 to a 1 or vice versa as well. 590 00:38:07,750 --> 00:38:11,600 And there's also a right shift operator, >>, which is the opposite of the one we saw. 591 00:38:11,600 --> 00:38:13,850 All right. Let's take things now to a higher level. 592 00:38:13,850 --> 00:38:16,770 We started by talking about text and then compressing it 593 00:38:16,770 --> 00:38:19,650 and representing the text with fewer numbers of bits; 594 00:38:19,650 --> 00:38:22,890 we talked a bit about how we can now start manipulating things on a bitwise level. 595 00:38:22,890 --> 00:38:26,640 Let's now zoom back up 10,000 feet to representation 596 00:38:26,640 --> 00:38:29,250 of more complex things like graphics. 597 00:38:29,250 --> 00:38:32,950 Here we have a flag of Germany, here we have one of France. 598 00:38:32,950 --> 00:38:36,350 These might be represented in file formats you might know--GIFs, for instance. 599 00:38:36,350 --> 00:38:40,030 If you've ever seen an image on the Web that ends in .gif, 600 00:38:40,030 --> 00:38:43,000 this is a graphics interchange format. 601 00:38:43,000 --> 00:38:47,530 These two flags here sort of lend themselves to compression 602 00:38:47,530 --> 00:38:52,050 for what perhaps obvious reason? >>[inaudible student response] 603 00:38:52,050 --> 00:38:53,440 There's a lot of repetition, right? 604 00:38:53,440 --> 00:38:57,270 In order to send Germany's flag, think of this as being an image on the screen 605 00:38:57,270 --> 00:38:59,030 back in your Scratch days. 606 00:38:59,030 --> 00:39:02,380 You might recall that there's individual pixels or dots that compose an image. 607 00:39:02,380 --> 00:39:06,650 >> There's a whole row of black dots and another whole row of black dots. 608 00:39:06,650 --> 00:39:10,110 There's a bunch of rows of black dots that we could see if we really zoomed in, 609 00:39:10,110 --> 00:39:13,370 much like when we zoomed in on Rob's face in Photoshop. 610 00:39:13,370 --> 00:39:15,500 As soon as we got deeper and deeper and deeper into the image, 611 00:39:15,500 --> 00:39:19,990 you started seeing the pixelation, all of the squares that composed his eye in that case. 612 00:39:19,990 --> 00:39:24,130 Same deal here. If we zoomed in quite a bit, you would see individual dots. 613 00:39:24,130 --> 00:39:27,110 Well, this is kind of a waste of bits. 614 00:39:27,110 --> 00:39:32,120 If a third of the flag is black and a third of the flag is yellow and so forth, 615 00:39:32,120 --> 00:39:34,860 why can't we somehow compress this flag? 616 00:39:34,860 --> 00:39:39,560 And even the French flag could be compressed even though the pattern is a little bit different. 617 00:39:39,560 --> 00:39:44,120 It turns out the GIF file format is a lossless compression format, 618 00:39:44,120 --> 00:39:48,420 which means you can take an image like the German flag here, 619 00:39:48,420 --> 00:39:53,540 you can throw away a lot of its bits without sacrificing quality. 620 00:39:53,540 --> 00:39:55,340 This is in contrast to something like JPEGs, 621 00:39:55,340 --> 00:39:57,050 with which most of us are probably more familiar. 622 00:39:57,050 --> 00:39:59,000 Facebook photos and Flickr photos and the like 623 00:39:59,000 --> 00:40:02,200 are almost always saved as JPEGs when they're uploaded, 624 00:40:02,200 --> 00:40:08,100 but JPEGs is a lossy--L-O-S-S-Y--format whereby you do throw away bits 625 00:40:08,100 --> 00:40:10,430 but you also throw away quality. 626 00:40:10,430 --> 00:40:13,890 And so if you compress photos with Photoshop or upload them to Facebook 627 00:40:13,890 --> 00:40:15,580 or take them on a really crappy phone, 628 00:40:15,580 --> 00:40:19,510 you know that the picture starts to get very splotchy and pixelated, 629 00:40:19,510 --> 00:40:22,290 and that's because it's being compressed by the computer or phone 630 00:40:22,290 --> 00:40:24,550 by literally throwing information away. 631 00:40:24,550 --> 00:40:28,500 But GIF is amazing in that it can use fewer bits than it might by default 632 00:40:28,500 --> 00:40:30,750 without losing any information. 633 00:40:30,750 --> 00:40:32,410 >> And it essentially does so as follows. 634 00:40:32,410 --> 00:40:38,740 Rather than store in a file like a BMP would an RGB triple for black, black, black, black, 635 00:40:38,740 --> 00:40:42,570 black, black, black, black, black, black, black, black and so forth, 636 00:40:42,570 --> 00:40:45,640 rather, the GIF format is going to say, "Black," 637 00:40:45,640 --> 00:40:48,330 and then, "Repeat this 100 times," or something like that. 638 00:40:48,330 --> 00:40:52,280 "Black, repeat this 100 times, black, repeat this 100 times..." 639 00:40:52,280 --> 00:40:54,530 "Yellow, repeat this 100 times." 640 00:40:54,530 --> 00:40:57,200 And so it remembers, essentially, the leftmost pixel 641 00:40:57,200 --> 00:41:02,160 and then encodes somehow the notion of repeating that pixel again and again. 642 00:41:02,160 --> 00:41:06,110 So GIFs can then compress themselves without losing any information. 643 00:41:06,110 --> 00:41:09,510 But if you had to guess, if that is the algorithm that GIFs use, 644 00:41:09,510 --> 00:41:13,180 which of these flags, even though they look identical in size, 645 00:41:13,180 --> 00:41:19,620 is going to be smaller when saved on disk as a GIF? >>[student] Germany. 646 00:41:19,620 --> 00:41:21,660 Germany is going to be smaller? Why? 647 00:41:21,660 --> 00:41:26,620 [student] Because you repeat it many, many times horizontally 648 00:41:26,620 --> 00:41:29,010 and then you repeat another time. >>Exactly. 649 00:41:29,010 --> 00:41:32,020 Because the people who invented GIF just kind of arbitrarily decided 650 00:41:32,020 --> 00:41:36,040 that the repetition will be leveraged horizontally and not laterally. 651 00:41:36,040 --> 00:41:40,900 There's a lot more repetition laterally here in the German flag than in the French flag. 652 00:41:40,900 --> 00:41:44,430 So if we actually open up a folder on my hard drive that has these GIFs, 653 00:41:44,430 --> 00:41:51,920 you can actually see that the German flag here is 2 kilobytes and the French one is 4 kilobytes. 654 00:41:51,920 --> 00:41:54,080 It happens to be a coincidence that one is twice the other, 655 00:41:54,080 --> 00:41:57,960 but it's in fact the case that the French flag is much larger. 656 00:41:57,960 --> 00:42:01,250 >> Even though we're talking here about graphics, the same ideas can apply to 657 00:42:01,250 --> 00:42:05,150 not things like flags but images that are a little more complex. 658 00:42:05,150 --> 00:42:08,170 If you take a picture of an apple, surely there's a lot of duplication there, 659 00:42:08,170 --> 00:42:11,040 so we could somehow remember that the default background is blue 660 00:42:11,040 --> 00:42:13,230 and not, as the right-hand picture suggests, 661 00:42:13,230 --> 00:42:16,830 have to remember the color of every single pixel in this picture. 662 00:42:16,830 --> 00:42:21,060 So we can throw bits away there without losing information. 663 00:42:21,060 --> 00:42:23,340 The apple still looks just the same. 664 00:42:23,340 --> 00:42:27,510 In this example here, you might see what happens in a movie. 665 00:42:27,510 --> 00:42:31,970 These represent old-school film reels whereby in the top image there 666 00:42:31,970 --> 00:42:36,900 you have an RV driving past a house and a tree. 667 00:42:36,900 --> 00:42:42,130 And as that van drives past from left to right, what's obviously not changing? 668 00:42:42,130 --> 00:42:45,320 The house isn't going anywhere, and the tree is not going anywhere. 669 00:42:45,320 --> 00:42:47,700 The only thing that's moving is the van in this case. 670 00:42:47,700 --> 00:42:51,650 So as Background Unchanged suggests, what you can do in movies 671 00:42:51,650 --> 00:42:56,530 is similarly just throw away information that doesn't change in between frames. 672 00:42:56,530 --> 00:42:58,900 This is generally known as interframe compression 673 00:42:58,900 --> 00:43:02,120 whereby if this frame looks almost identical to this one, 674 00:43:02,120 --> 00:43:05,390 let's not bother storing on disk any of the identical information 675 00:43:05,390 --> 00:43:09,250 on these intermediate frames, let's only use key frames once in a while 676 00:43:09,250 --> 00:43:13,420 that actually store that information redundantly just as a little sanity check. 677 00:43:13,420 --> 00:43:18,620 >> By contrast, another approach to compressing video is in this second and lower example here, 678 00:43:18,620 --> 00:43:23,970 where rather than store 30 frames, why don't you just store 15 frames a second instead? 679 00:43:23,970 --> 00:43:27,070 Rather than the movie kind of flowing beautifully, perfectly, 680 00:43:27,070 --> 00:43:30,060 it might look like it's stuttering a little bit, a little old school, 681 00:43:30,060 --> 00:43:37,190 but the net effect will be to use far fewer bits than might otherwise be necessary. 682 00:43:37,190 --> 00:43:39,240 So where does this then leave us? 683 00:43:39,240 --> 00:43:41,700 That was a bit of an aside on where else you can go with compression. 684 00:43:41,700 --> 00:43:45,140 For more on that, take a class like CS175 here. 685 00:43:45,140 --> 00:43:46,990 Here's another example within video. 686 00:43:46,990 --> 00:43:49,190 If the bee is the only thing moving, 687 00:43:49,190 --> 00:43:51,790 you can really throw away information in those middle frames 688 00:43:51,790 --> 00:43:55,260 because the flower and sky and leaves are not changing. 689 00:43:55,260 --> 00:43:57,960 But let's now consider one last thing. 690 00:43:57,960 --> 00:44:03,890 In the next 5 minutes we leave C behind forever in lecture? Yes. Not in the psets, though. 691 00:44:03,890 --> 00:44:10,210 Last story about C and then we get to very sexy stuff 692 00:44:10,210 --> 00:44:13,870 involving HTML and Web and woo-hoo. All right. 693 00:44:13,870 --> 00:44:16,050 Here we go. That's the motivation. 694 00:44:16,050 --> 00:44:20,020 It turns out all this time when we have been writing programs we run Clang. 695 00:44:20,020 --> 00:44:23,890 And Clang, we've said since the first week pretty much, takes source code 696 00:44:23,890 --> 00:44:25,740 and converts it into object code. 697 00:44:25,740 --> 00:44:28,540 It takes C and converts it into 0s and 1s. 698 00:44:28,540 --> 00:44:32,150 I've kind of been lying to you for a few weeks because it's not quite as simple as that. 699 00:44:32,150 --> 00:44:36,750 >> There's a lot more going on underneath the hood when you run a program like Clang. 700 00:44:36,750 --> 00:44:39,560 In fact, the process of compiling a program can really be summarized, 701 00:44:39,560 --> 00:44:42,210 as you might recall from Rob's video on compilers, 702 00:44:42,210 --> 00:44:47,580 into these 4 steps: pre-processing, compiling itself, assembling, and linking. 703 00:44:47,580 --> 00:44:51,950 But we in class and most people in the world typically summarize all of these steps 704 00:44:51,950 --> 00:44:54,410 as just "compiling." 705 00:44:54,410 --> 00:44:58,070 But if we start with source code like this, recall this is perhaps the simplest C program 706 00:44:58,070 --> 00:45:03,530 we've written thus far, recall that when compiled it ends up looking like this. 707 00:45:03,530 --> 00:45:07,310 But there's actually an intermediate step, and those steps are as follows. 708 00:45:07,310 --> 00:45:10,750 First there's this thing at the very top of this and most of our programs, 709 00:45:10,750 --> 00:45:13,550 #include 710 00:45:13,550 --> 00:45:17,210 What does #include do for us? 711 00:45:17,210 --> 00:45:24,150 It pretty much copies and pastes the contents of stdio.h into my file so that why? 712 00:45:24,150 --> 00:45:27,220 Why do I care about the contents of stdio.h? What's in there of interest? 713 00:45:27,220 --> 00:45:32,310 Printf's declaration, its prototype, so that the compiler then knows what I mean 714 00:45:32,310 --> 00:45:34,900 when I mention this function printf. 715 00:45:34,900 --> 00:45:39,390 So step 1 in compiling is pre-processing, whereby a program like Clang 716 00:45:39,390 --> 00:45:43,450 or some helper program that Clang comes with reads your code top to bottom, 717 00:45:43,450 --> 00:45:47,740 left to right, and any time it sees a # symbol followed by a keyword like include, 718 00:45:47,740 --> 00:45:53,980 it performs that operation, copying and pasting in this case stdio.h into your file. 719 00:45:53,980 --> 00:45:55,510 That's step 1. 720 00:45:55,510 --> 00:45:59,620 Then you have a much bigger C file because of the huge copy, paste job that's just happened. 721 00:45:59,620 --> 00:46:01,710 >> Step 2 now is compiling. 722 00:46:01,710 --> 00:46:04,880 But it turns out compiling takes source code that looks like this 723 00:46:04,880 --> 00:46:08,160 and turns it into something that looks like this, 724 00:46:08,160 --> 00:46:12,560 which for those familiar is called? >>[student] Assembly. >>Assembly language. 725 00:46:12,560 --> 00:46:16,700 This is actually something if you take CS61 you'll dive into in more detail. 726 00:46:16,700 --> 00:46:22,380 This is just about as close as you can get to writing 0s and 1s yourself 727 00:46:22,380 --> 00:46:25,850 but writing things in such a way that still makes at least a little bit of sense. 728 00:46:25,850 --> 00:46:30,760 These are machine instructions, and if we scroll down to the main function here, 729 00:46:30,760 --> 00:46:35,470 notice that there is this push instruction, move instruction, subtract instruction, 730 00:46:35,470 --> 00:46:38,550 call instruction, and so forth. 731 00:46:38,550 --> 00:46:42,930 When you hear that your computer has Intel inside, 732 00:46:42,930 --> 00:46:46,180 you have an Intel CPU in your Mac or PC, what does that mean? 733 00:46:46,180 --> 00:46:51,200 A CPU comes built by companies like Intel understanding certain instructions. 734 00:46:51,200 --> 00:46:55,770 They have no idea what functions like swap are or main are per se, 735 00:46:55,770 --> 00:47:00,060 but they do know what very low-level instructions like add, subtract, push, 736 00:47:00,060 --> 00:47:02,430 move, call, and so forth are. 737 00:47:02,430 --> 00:47:06,170 So when you compile C code into assembly language, 738 00:47:06,170 --> 00:47:11,820 your very user friendly-looking code is converted into something that looks like this, 739 00:47:11,820 --> 00:47:21,670 that literally moves bytes or 4 bytes around in such small units in and out of the CPU. 740 00:47:21,670 --> 00:47:26,820 But finally, when Clang is ready to take this representation of your program 741 00:47:26,820 --> 00:47:30,940 into 0s and 1s, then the step called assembling happens, 742 00:47:30,940 --> 00:47:33,850 and this again all happens in the blink of an eye when running Clang. 743 00:47:33,850 --> 00:47:39,300 We start here, it outputs a file like this, and then it converts it to these 0s and 1s. 744 00:47:39,300 --> 00:47:42,000 And if you want to go back at some point and actually see this in action, 745 00:47:42,000 --> 00:47:48,220 if I go into hello1.c--this is one of the very first programs we looked at-- 746 00:47:48,220 --> 00:47:53,710 normally we would compile this with Clang hello1.c and this would give us a.out. 747 00:47:53,710 --> 00:47:59,890 If by contrast you instead give it the -S flag, what you'll get is hello1.s 748 00:47:59,890 --> 00:48:02,750 and you'll actually see the assembly language. 749 00:48:02,750 --> 00:48:05,750 >> I'm doing this for a very short program, but if you go back for Scramble 750 00:48:05,750 --> 00:48:08,740 or Recover or any program you've written and just out of curiosity 751 00:48:08,740 --> 00:48:13,240 want to see what it actually looks like, what's actually being fed into the CPU, 752 00:48:13,240 --> 00:48:15,700 you can use that -S flag with Clang. 753 00:48:15,700 --> 00:48:17,770 But then lastly, there's still one gotcha. 754 00:48:17,770 --> 00:48:21,810 Here are the 0s and 1s that represent my implementation of hello, world. 755 00:48:21,810 --> 00:48:25,530 But I used someone else's function in my program. 756 00:48:25,530 --> 00:48:28,710 So even though the process has been I take hello.c, 757 00:48:28,710 --> 00:48:34,280 it gets compiled into assembly code, and then it gets assembled into 0s and 1s, 758 00:48:34,280 --> 00:48:37,460 the only 0s and 1s that are outputted at this point in time 759 00:48:37,460 --> 00:48:40,270 are the ones that result from my code. 760 00:48:40,270 --> 00:48:44,400 But the person who wrote printf, they compiled their code 20 years ago 761 00:48:44,400 --> 00:48:47,000 and it's now installed somewhere on the appliance, 762 00:48:47,000 --> 00:48:51,610 so we somehow have to merge his or her 0s and 1s with my 0s and 1s, 763 00:48:51,610 --> 00:48:56,160 and that brings us to the 4th and final step of compiling, known as linking. 764 00:48:56,160 --> 00:48:58,680 So on the left-hand side we have the exact same picture as before: 765 00:48:58,680 --> 00:49:02,580 hello.c becomes assembly code becomes 0s and 1s. 766 00:49:02,580 --> 00:49:05,960 But recall that I used the standard I/O library in my code, 767 00:49:05,960 --> 00:49:10,350 and that means somewhere on the computer there's a file called stdio.c 768 00:49:10,350 --> 00:49:13,980 or at least the compiled version thereof because someone some years ago 769 00:49:13,980 --> 00:49:18,530 compiled stdio.c into assembly code and then a whole bunch of 0s and 1s. 770 00:49:18,530 --> 00:49:21,130 This is what's known as a static or a dynamic library. 771 00:49:21,130 --> 00:49:23,350 It's some file sitting somewhere in the appliance. 772 00:49:23,350 --> 00:49:28,710 >> But lastly, I have to take my 0s and 1s and that person's 0s and 1s 773 00:49:28,710 --> 00:49:32,760 and somehow link them together, literally combine those 0s and 1s 774 00:49:32,760 --> 00:49:37,900 into a single file called a.out or hello1 or whatever I called my program 775 00:49:37,900 --> 00:49:43,320 so that the end result has all of the 1s and 0s that should compose my program. 776 00:49:43,320 --> 00:49:45,660 So all this time this semester when you've been using Clang 777 00:49:45,660 --> 00:49:48,750 and even more recently running make in order to run Clang, 778 00:49:48,750 --> 00:49:53,580 all of these steps have been happening sort of instantaneously but very deliberately. 779 00:49:53,580 --> 00:49:57,830 And so if you continue on in computer science, namely CS61, 780 00:49:57,830 --> 00:50:00,850 this is the layer that you'll continue to peel back off there 781 00:50:00,850 --> 00:50:06,980 talking about efficiency, security implications, and the like of these lower level details. 782 00:50:06,980 --> 00:50:09,220 But with that, we're about to leave C behind. 783 00:50:09,220 --> 00:50:11,420 Let's go ahead and take our 5-minute break now, 784 00:50:11,420 --> 00:50:14,190 and when we come back: the Internet. 785 00:50:17,280 --> 00:50:19,170 All right. We are back. 786 00:50:19,170 --> 00:50:23,590 Now we begin our look not just at HTML because, as you will see, 787 00:50:23,590 --> 00:50:26,050 HTML itself is actually pretty simple 788 00:50:26,050 --> 00:50:29,270 but really at web programming more generally, networking more generally, 789 00:50:29,270 --> 00:50:31,770 and how all of these technologies come together 790 00:50:31,770 --> 00:50:35,400 to allow us to create much more sophisticated programs atop the Internet 791 00:50:35,400 --> 00:50:38,690 than thus far we've been able to in these black and white windows. 792 00:50:38,690 --> 00:50:42,140 Indeed, at this point in the semester even though we will spend relatively less time 793 00:50:42,140 --> 00:50:46,200 on PHP, HTML, CSS, JavaScript, SQL and more, 794 00:50:46,200 --> 00:50:48,480 most students do end up doing final projects that are web-based 795 00:50:48,480 --> 00:50:51,230 because as you'll see, the background you now have in C 796 00:50:51,230 --> 00:50:54,450 is very much applicable to these higher-level languages. 797 00:50:54,450 --> 00:50:56,800 >> And as you start thinking about your final project, 798 00:50:56,800 --> 00:50:59,940 which, much like Problem Set 0, where you were encouraged 799 00:50:59,940 --> 00:51:02,160 to do most anything of interest to you in Scratch, 800 00:51:02,160 --> 00:51:05,790 the final project is your opportunity to take your newfound knowledge and savvy with C 801 00:51:05,790 --> 00:51:09,850 or PHP or JavaScript or the like out for a spin 802 00:51:09,850 --> 00:51:12,330 and create your very own piece of software for the world to see. 803 00:51:12,330 --> 00:51:17,770 And to seed you with ideas, know that you can head here, projects.cs50.net. 804 00:51:17,770 --> 00:51:21,800 Every year, we solicit ideas from faculty and staff and student groups on campus 805 00:51:21,800 --> 00:51:27,330 just to submit their ideas for interesting things that could be solved using computers, 806 00:51:27,330 --> 00:51:29,860 using websites, using software. 807 00:51:29,860 --> 00:51:32,360 So if you're struggling to come up with an idea of your own, 808 00:51:32,360 --> 00:51:35,790 by all means scroll through the ideas there from this year and last. 809 00:51:35,790 --> 00:51:39,990 It is perfectly okay to tackle a project that has been tackled before. 810 00:51:39,990 --> 00:51:44,540 We have seen many apps for seeing the status of laundry on campus, 811 00:51:44,540 --> 00:51:47,000 many apps for navigating the dining hall menu, 812 00:51:47,000 --> 00:51:49,540 many apps for navigating the course catalog and the like. 813 00:51:49,540 --> 00:51:53,680 And indeed, in a future lecture and in future seminars, 814 00:51:53,680 --> 00:51:57,750 we will introduce you to some publicly available APIs, both commercially available 815 00:51:57,750 --> 00:52:02,520 as well as here available from CS50 on campus so that you have access to data 816 00:52:02,520 --> 00:52:04,910 and can then do interesting things with it. 817 00:52:04,910 --> 00:52:09,380 So more on final projects in a few days when we release the specification, 818 00:52:09,380 --> 00:52:12,990 but for now, know that you can work solo or with one or two friends 819 00:52:12,990 --> 00:52:16,010 on most any project of interest to you. 820 00:52:16,010 --> 00:52:18,080 The Internet. 821 00:52:18,080 --> 00:52:22,300 You go ahead and pull out your laptop, you go to facebook.com for the first time, 822 00:52:22,300 --> 00:52:27,020 not having logged in recently, and hit Enter. What exactly happens? 823 00:52:27,020 --> 00:52:30,150 >> When you hit Enter on your computer, a whole bunch of steps 824 00:52:30,150 --> 00:52:32,600 start sort of magically happening. 825 00:52:32,600 --> 00:52:35,960 So you here on the left, web server like Facebook is here on the right, 826 00:52:35,960 --> 00:52:42,500 and somehow you're using this language called HTTP, Hypertext Transfer Protocol. 827 00:52:42,500 --> 00:52:46,770 HTTP isn't a programming language. It's more of a protocol. 828 00:52:46,770 --> 00:52:52,310 It is a set of conventions that web browsers and web servers use when intercommunicating. 829 00:52:52,310 --> 00:52:54,360 And what this means is as follows. 830 00:52:54,360 --> 00:52:56,790 Much like in the real world, we have these conventions 831 00:52:56,790 --> 00:53:00,140 where if you meet some human for the first time, if you don't mind humoring me here, 832 00:53:00,140 --> 00:53:03,980 I might come up to you, say, "Hi, my name is David." >>Hi, David. My name is Sammy. 833 00:53:03,980 --> 00:53:05,770 "Hi, David. My name is Sammy." 834 00:53:05,770 --> 00:53:08,310 So now we have just engaged in this sort of silly human protocol 835 00:53:08,310 --> 00:53:12,200 where I have initiated the protocol, Sammy has responded, 836 00:53:12,200 --> 00:53:15,060 we've shaken hands, and the transaction is complete. 837 00:53:15,060 --> 00:53:18,260 HTTP is very similar in spirit. 838 00:53:18,260 --> 00:53:23,350 When your web browser requests www.facebook.com, 839 00:53:23,350 --> 00:53:27,020 what your browser is really doing is extending its hand, so to speak, 840 00:53:27,020 --> 00:53:29,960 to the server and it's sending it a message. 841 00:53:29,960 --> 00:53:34,220 And that message is typically something like get--what do you want to get?-- 842 00:53:34,220 --> 00:53:38,740 get me the home page, which is typically denoted by a single slash at the end of a URL. 843 00:53:38,740 --> 00:53:43,790 And just so you know what language I'm speaking, I the browser am going to tell you 844 00:53:43,790 --> 00:53:46,930 that I'm speaking HTTP version 1.1, 845 00:53:46,930 --> 00:53:51,980 And also for good measure, I'm going to tell you that the host that I want the home page of 846 00:53:51,980 --> 00:53:54,120 is facebook.com. 847 00:53:54,120 --> 00:53:57,730 Typically, a web browser, unbeknownst to you, the human, 848 00:53:57,730 --> 00:54:03,350 sends this message across the Internet when you simply type www.facebook.com, 849 00:54:03,350 --> 00:54:05,370 >> Enter, into your browser. 850 00:54:05,370 --> 00:54:07,300 And what does Facebook respond with? 851 00:54:07,300 --> 00:54:12,540 It responds with some similar-looking cryptic details but also much more. 852 00:54:12,540 --> 00:54:14,310 Let me go ahead to Facebook's home page here. 853 00:54:14,310 --> 00:54:17,480 This is the screen that most of us probably never see if you stay logged in all of the time, 854 00:54:17,480 --> 00:54:19,830 but this is indeed their home page. 855 00:54:19,830 --> 00:54:24,150 If we do this in Chrome, notice that you can pull up these little context menus. 856 00:54:24,150 --> 00:54:26,980 Using Chrome, whether on Mac OS, Windows, Linux, or the like, 857 00:54:26,980 --> 00:54:31,840 if you Control click or left click, you can typically pull up a menu that looks like this, 858 00:54:31,840 --> 00:54:35,870 where a few options await, one of which is View Page Source. 859 00:54:35,870 --> 00:54:39,920 You can also typically get to these things by going to the View menu and poking around. 860 00:54:39,920 --> 00:54:42,750 For instance, here under View, Developer is the same thing. 861 00:54:42,750 --> 00:54:45,780 I'm going to go ahead and look at View Page Source. 862 00:54:45,780 --> 00:54:50,800 What you'll see is the HTML that Mark has written to represent facebook.com. 863 00:54:50,800 --> 00:54:55,910 It's a complete mess here, but we'll see that this makes a little more sense before long. 864 00:54:55,910 --> 00:54:59,840 But there are some patterns here. Let me scroll down to stuff like this. 865 00:54:59,840 --> 00:55:05,730 This is hard for a human to read, but notice that there's this pattern of angled brackets 866 00:55:05,730 --> 00:55:10,360 with keywords like option, keywords like value, some quoted strings. 867 00:55:10,360 --> 00:55:15,660 This is where, when you signed up for the very first time, specified what your birth year is. 868 00:55:15,660 --> 00:55:19,020 That drop-down menu of birth years is somehow encoded here 869 00:55:19,020 --> 00:55:23,870 in this language called HTML, HyperText Markup Language. 870 00:55:23,870 --> 00:55:27,730 In other words, when your browser requests a web page, 871 00:55:27,730 --> 00:55:30,610 it speaks this convention called HTTP. 872 00:55:30,610 --> 00:55:35,170 But what does facebook.com respond to that request with? 873 00:55:35,170 --> 00:55:38,260 >> It responds with some of these cryptic messages, as we'll see in a moment. 874 00:55:38,260 --> 00:55:43,760 But most of its response is in the form of HTML, HyperText Markup Language. 875 00:55:43,760 --> 00:55:47,170 That's the actual language in which a web page is written. 876 00:55:47,170 --> 00:55:52,030 And what a web browser really does then is, upon receipt of something that looks like this, 877 00:55:52,030 --> 00:55:57,120 reads it top to bottom, left to right, and any time it sees one of these angled brackets 878 00:55:57,120 --> 00:56:03,370 followed by a keyword like option, it displays that markup language in the appropriate way. 879 00:56:03,370 --> 00:56:06,820 In this case, it would display a drop-down menu of years. 880 00:56:06,820 --> 00:56:09,240 But again, this is a complete mess to look at. 881 00:56:09,240 --> 00:56:16,630 This is not because Facebook developers manifest 0 for 5 for style, for instance. 882 00:56:16,630 --> 00:56:20,190 This is because most of the code that they write is, in fact, written beautifully, 883 00:56:20,190 --> 00:56:22,450 well commented, nicely indented, and the like, 884 00:56:22,450 --> 00:56:26,080 but of course machines, computers, browsers really don't give a damn 885 00:56:26,080 --> 00:56:27,890 whether your code is well-styled. 886 00:56:27,890 --> 00:56:33,100 And in fact, it's completely wasteful to hit the tab key all those times 887 00:56:33,100 --> 00:56:37,650 and to put comments all throughout your code and to choose really descriptive variable names 888 00:56:37,650 --> 00:56:42,340 because if the browser doesn't care, all you're doing at the end of the day is wasting bytes. 889 00:56:42,340 --> 00:56:46,660 >> So it turns out what most websites do is even though the source code for facebook.com, 890 00:56:46,660 --> 00:56:49,550 for cs50.net and all of these other websites on the Internet 891 00:56:49,550 --> 00:56:53,730 are typically well written and well commented and nicely indented and the like, 892 00:56:53,730 --> 00:56:59,270 typically before the website is put onto the Internet, the code is minified, 893 00:56:59,270 --> 00:57:02,970 whereby the HTML and the CSS--something else we'll soon see-- 894 00:57:02,970 --> 00:57:05,960 the JavaScript code we'll soon see is compressed, 895 00:57:05,960 --> 00:57:09,250 whereby long variable names become X and Y and Z, 896 00:57:09,250 --> 00:57:13,900 and all of that whitespace that makes everything look so readable is all thrown away, 897 00:57:13,900 --> 00:57:17,700 because if you think about it this way, Facebook gets a billion page hits a day-- 898 00:57:17,700 --> 00:57:21,670 something crazy like that--so what if a programmer just to be anal 899 00:57:21,670 --> 00:57:26,660 hit the space bar one extra time just to indent some line of code ever so much more? 900 00:57:26,660 --> 00:57:29,500 What's the implication if Facebook preserves that whitespace 901 00:57:29,500 --> 00:57:32,880 in all of the bytes they send back to people on the Internet? 902 00:57:32,880 --> 00:57:36,400 Hitting the space bar once gives you an extra byte in your file. 903 00:57:36,400 --> 00:57:39,730 And if a billion people then proceed to download the home page that day, 904 00:57:39,730 --> 00:57:42,060 how much more data have you transmitted over the Internet? 905 00:57:42,060 --> 00:57:45,200 A gigabyte for no good reason. 906 00:57:45,200 --> 00:57:48,510 And granted, for a lot of websites this is not such a scalable issue, 907 00:57:48,510 --> 00:57:51,030 but for Facebook, for Google, for some of the most popular websites 908 00:57:51,030 --> 00:57:54,860 there's great incentive financially to make your code look like a mess 909 00:57:54,860 --> 00:57:58,980 so that you're using as few bytes as possible in addition to then compressing it 910 00:57:58,980 --> 00:58:01,500 using something like zip, an algorithm called gzip, 911 00:58:01,500 --> 00:58:04,250 that the browser does for you automatically. But this is awful. 912 00:58:04,250 --> 00:58:08,060 We'll never learn anything about other people's websites and how to design web pages 913 00:58:08,060 --> 00:58:09,680 if we have to look at it like this. 914 00:58:09,680 --> 00:58:13,620 >> So fortunately, browsers like Chrome and IE and Firefox these days 915 00:58:13,620 --> 00:58:16,450 typically come with built-in developer tools. 916 00:58:16,450 --> 00:58:21,730 In fact, if I go down here to Inspect Element or if I go to View, Developer, 917 00:58:21,730 --> 00:58:25,220 and go to Developer Tools explicitly, 918 00:58:25,220 --> 00:58:27,640 this window at the bottom of my screen now pops up. 919 00:58:27,640 --> 00:58:31,230 It's a little intimidating at first because there's a lot of unfamiliar tabs here, 920 00:58:31,230 --> 00:58:34,510 but if I click on Elements all the way at the bottom left, 921 00:58:34,510 --> 00:58:38,810 Chrome is obviously pretty smart. It knows how to interpret all of this code. 922 00:58:38,810 --> 00:58:42,320 And so what Chrome does is it cleans up all of Facebook's HTML. 923 00:58:42,320 --> 00:58:45,680 Even though there's not whitespace there, there's not indentation there, 924 00:58:45,680 --> 00:58:51,120 now notice that I can begin to navigate this web page all the more hierarchically. 925 00:58:51,120 --> 00:58:56,910 It turns out that every web page written in a language called HTML5 should start with this, 926 00:58:56,910 --> 00:59:03,980 this DOCTYPE declaration, so to speak: 927 00:59:03,980 --> 00:59:07,840 It's kind of light and gray there, but that's the very first line of code in this file, 928 00:59:07,840 --> 00:59:12,080 and that just tells the browser, "Hey, here comes some HTML5. Here comes a web page." 929 00:59:12,080 --> 00:59:18,490 The first open bracket beyond that happens to be this thing, an open bracket HTML tag, 930 00:59:18,490 --> 00:59:22,320 and then if I dive in deeper--these arrows are completely meaningless; 931 00:59:22,320 --> 00:59:25,140 they are just for presentation's sake, they are not actually in the file-- 932 00:59:25,140 --> 00:59:30,300 notice that inside of Facebook's HTML tag, anything that starts with an open bracket 933 00:59:30,300 --> 00:59:32,910 and then has a word is called a tag. 934 00:59:32,910 --> 00:59:38,610 So inside the HTML tag is apparently a head tag and a body tag. 935 00:59:38,610 --> 00:59:41,930 Inside of the head tag now is a whole mess for Facebook 936 00:59:41,930 --> 00:59:45,620 because they have a lot of metadata and other things for marketing and advertising. 937 00:59:45,620 --> 00:59:50,600 >> But if we scroll down, down, down, down, let's see where it is. Here it is. 938 00:59:50,600 --> 00:59:52,210 This one is at least somewhat familiar. 939 00:59:52,210 --> 00:59:55,990 The title of Facebook's home page, if you ever look in the tab in your title bar, 940 00:59:55,990 --> 00:59:59,060 is Welcome to Facebook - Log In, Sign Up or Learn More. 941 00:59:59,060 --> 01:00:01,110 That's what you would see in Chrome's title bar, 942 01:00:01,110 --> 01:00:03,100 and that's how it's represented in code. 943 01:00:03,100 --> 01:00:08,090 If we ignore everything else in the head, most of the guts of a web page are in the body, 944 01:00:08,090 --> 01:00:10,940 and it turns out that Facebook's code is going to look more complex 945 01:00:10,940 --> 01:00:14,540 than most things we'll write initially just because it's been built up over the years, 946 01:00:14,540 --> 01:00:17,260 but there's a whole lot of script tags, JavaScript code, 947 01:00:17,260 --> 01:00:18,870 that makes the website very interactive: 948 01:00:18,870 --> 01:00:22,330 seeing status updates instantaneously using languages like JavaScript. 949 01:00:22,330 --> 01:00:25,270 There's something called a div, which is a division of a page. 950 01:00:25,270 --> 01:00:27,940 But before we get to that detail, let's try to zoom out 951 01:00:27,940 --> 01:00:31,920 and look at a simpler version of Facebook 1.0, so to speak. 952 01:00:31,920 --> 01:00:34,740 Here is the hello, world of web pages. 953 01:00:34,740 --> 01:00:37,370 It has that DOCTYPE declaration at the very top 954 01:00:37,370 --> 01:00:40,280 which is a little different from everything else. 955 01:00:40,280 --> 01:00:46,130 Nothing else we write in a web page is going to start with 01:00:48,880 and except for something called comments in HTML. 957 01:00:48,880 --> 01:00:53,000 But for the most part, everything in a web page is open bracket, keyword, close bracket. 958 01:00:53,000 --> 01:00:56,220 >> In this case you can see the simplest of web pages possible. 959 01:00:56,220 --> 01:01:00,260 The HTML tag contains a head tag and it contains a body tag, 960 01:01:00,260 --> 01:01:04,580 but notice that there's this notion of starting and stopping tags. 961 01:01:04,580 --> 01:01:11,360 This is the start tag for HTML, this is the close tag or end tag. 962 01:01:11,360 --> 01:01:15,400 Notice that they're sort of opposites in the sense that the close tag or end tag 963 01:01:15,400 --> 01:01:20,030 has this forward slash inside of itself. 964 01:01:20,030 --> 01:01:23,540 Meanwhile, there's an open head tag here and a close head tag here. 965 01:01:23,540 --> 01:01:26,880 >> There's an open title and a close title tag here. 966 01:01:26,880 --> 01:01:29,850 The fact that I've put the title on one line, purely arbitrary. 967 01:01:29,850 --> 01:01:33,760 It just looked like it would fit nicely on one line, so I didn't bother hitting Enter a couple times. 968 01:01:33,760 --> 01:01:38,200 Meanwhile, the body I did indent just to be ever so clear. 969 01:01:38,200 --> 01:01:41,050 Notice that HTML is a pretty dumb language. 970 01:01:41,050 --> 01:01:43,410 In fact, back in the day before there were WYSIWYG editors 971 01:01:43,410 --> 01:01:46,770 and Microsoft Word where you can say, "Make this bold, make this italics," 972 01:01:46,770 --> 01:01:50,850 you would actually type little commands in essays 20+ years ago 973 01:01:50,850 --> 01:01:55,740 whereby you would say, "Start making this text bold. Stop making this text bold." 974 01:01:55,740 --> 01:01:59,010 "Start making this text italics. Stop making this text italics." 975 01:01:59,010 --> 01:02:01,850 >> That's what HTML or any markup language is. 976 01:02:01,850 --> 01:02:05,530 This first tag says, "Hey, browser. Here comes some HTML." 977 01:02:05,530 --> 01:02:09,880 The next tag says, "Hey, browser. Here comes the head, the header of my web page." 978 01:02:09,880 --> 01:02:11,650 "Hey, browser. Here comes the title." 979 01:02:11,650 --> 01:02:15,880 And then over here, "Hey, browser. That's it for the title." 980 01:02:15,880 --> 01:02:20,000 So this is how the browser knows to no longer display more characters than hello, world 981 01:02:20,000 --> 01:02:21,860 in the title bar. 982 01:02:21,860 --> 01:02:23,640 Meanwhile, this says, "That's it for the head." 983 01:02:23,640 --> 01:02:28,340 This says, "Here comes the body. Here is the actual body"--literally, the words hello, world. 984 01:02:28,340 --> 01:02:33,190 And this says here, "That's it for the body. That's it for the HTML." 985 01:02:33,190 --> 01:02:34,640 So browsers are pretty dumb. 986 01:02:34,640 --> 01:02:39,920 They just read this stuff top to bottom, left to right, and do exactly what they are told to do. 987 01:02:39,920 --> 01:02:41,860 Let's actually do a little example here. 988 01:02:41,860 --> 01:02:46,240 Let me open up the simplest of programs on my Mac here, namely TextEdit. 989 01:02:46,240 --> 01:02:48,220 On Windows you could use Notepad.exe. 990 01:02:48,220 --> 01:02:50,520 But this is all you need to start making web pages. 991 01:02:50,520 --> 01:02:53,730 I'm going to go ahead and just copy and paste this code into this file. 992 01:02:53,730 --> 01:02:57,210 I'm going to go ahead and save it on my desktop, 993 01:02:57,210 --> 01:03:01,220 and I'm going to save this as hello.html, 994 01:03:01,220 --> 01:03:03,840 and now the file is named hello.html. 995 01:03:03,840 --> 01:03:05,690 Here it is on my desktop. 996 01:03:05,690 --> 01:03:11,130 Let me now go into a browser and drag the file into the browser. 997 01:03:11,130 --> 01:03:14,060 And voila, here is my very first web page. 998 01:03:14,060 --> 01:03:17,340 Notice that the title of the tab is hello, world as per the title tag, 999 01:03:17,340 --> 01:03:20,040 and notice that hello, world is the body of my web page, 1000 01:03:20,040 --> 01:03:22,190 and woo-hoo, I'm on the Internet. 1001 01:03:22,190 --> 01:03:24,700 >> I'm not really, right, because this file is not on the Internet. 1002 01:03:24,700 --> 01:03:28,330 It happens to be on my local hard drive at that particular path. 1003 01:03:28,330 --> 01:03:32,720 But the idea is the same. All we now need is a web server to which to upload it. 1004 01:03:32,720 --> 01:03:37,410 But first let's actually introduce a little more complexity and a little more stylization. 1005 01:03:37,410 --> 01:03:39,890 This is a simple, if boring, web page. 1006 01:03:39,890 --> 01:03:41,990 It turns out there are other types of tags we can use. 1007 01:03:41,990 --> 01:03:45,530 For instance, here in yellow I've introduced 2 new tags. 1008 01:03:45,530 --> 01:03:49,630 We won't play much with these today, but notice that the link tag 1009 01:03:49,630 --> 01:03:52,520 somehow looks different from everything else. 1010 01:03:52,520 --> 01:03:55,370 The link tag takes what are called attributes, 1011 01:03:55,370 --> 01:03:59,770 and an attribute is something that modifies the behavior of a tag. 1012 01:03:59,770 --> 01:04:03,840 In this case this is not the best choice of names, link, because it's kind of meaningless, 1013 01:04:03,840 --> 01:04:11,590 but this link tag says, essentially, include the file called styles.css inside of my web page. 1014 01:04:11,590 --> 01:04:15,400 You can think of this as analogous to C's #include directive. 1015 01:04:15,400 --> 01:04:19,650 Styles.css is referring to a different language altogether that we won't play with today, 1016 01:04:19,650 --> 01:04:23,790 but it's for aesthetics: font sizes, colors, padding, indentation, margins, 1017 01:04:23,790 --> 01:04:26,040 and all of that kind of aesthetics detail. 1018 01:04:26,040 --> 01:04:28,820 Meanwhile, the script tag is functionally similar, 1019 01:04:28,820 --> 01:04:33,140 but rather than include CSS, that language, it includes another language, JavaScript. 1020 01:04:33,140 --> 01:04:37,810 So in other words, with these 2 tags I will eventually be able to write my own web page 1021 01:04:37,810 --> 01:04:41,490 but also pull in code that I or someone else has written 1022 01:04:41,490 --> 01:04:44,350 so that we can stand on other people's shoulders, we can practice good design, 1023 01:04:44,350 --> 01:04:46,120 factoring out common code. 1024 01:04:46,120 --> 01:04:49,090 If I've got 10 different web pages, this means that some of my aesthetics 1025 01:04:49,090 --> 01:04:52,490 can be factored out, much like #include, into a separate file. 1026 01:04:52,490 --> 01:04:54,420 So we're getting there. 1027 01:04:54,420 --> 01:04:57,180 But let's actually first do something more interesting with this file. 1028 01:04:57,180 --> 01:05:01,110 >> Again, this is just TextEdit. I'm not technically on the Internet yet, but we'll get there. 1029 01:05:01,110 --> 01:05:04,910 I'd like to make hello, world a little bolder than it is. 1030 01:05:04,910 --> 01:05:10,890 So hello, let's arbitrarily say for bold. 1031 01:05:10,890 --> 01:05:15,910 Again, the story is the same: h-e-l-l-o, comma, start making this bold, 1032 01:05:15,910 --> 01:05:19,730 then world gets printed in bold, and this means stop printing this in bold. 1033 01:05:19,730 --> 01:05:24,020 Let me go ahead and save my file, go back to Chrome, I'll zoom in just so we can see it better, 1034 01:05:24,020 --> 01:05:27,870 and reload, and you'll see that world is now in bold. 1035 01:05:27,870 --> 01:05:31,810 The Web is all about hyperlinks, so let's go ahead and do this: 1036 01:05:31,810 --> 01:05:38,550 my favorite website is, let's say, youtube.com. 1037 01:05:38,550 --> 01:05:43,810 Save, reload. Okay. There's a couple problems now besides the hideousness of the website. 1038 01:05:43,810 --> 01:05:47,310 1, I'm pretty sure I hit Enter here. And I did. 1039 01:05:47,310 --> 01:05:51,590 I not only hit Enter, I also indented, practicing what we've been preaching about style, 1040 01:05:51,590 --> 01:05:54,930 but my is right next to world. 1041 01:05:54,930 --> 01:05:58,410 So why is this? Browsers only do what you tell them to do. 1042 01:05:58,410 --> 01:06:04,010 I have not told the browser, "Break lines here. Insert paragraph break here." 1043 01:06:04,010 --> 01:06:07,820 So the browser, it doesn't matter if I hit Return 30 times, 1044 01:06:07,820 --> 01:06:10,820 it's still going to put my right next to world. 1045 01:06:10,820 --> 01:06:15,930 What I really have to do here is say something like
, insert a line break. 1046 01:06:15,930 --> 01:06:17,940 >> And actually, a line break is kind of a weird thing 1047 01:06:17,940 --> 01:06:21,650 because you can't really start moving to another line, then do something, 1048 01:06:21,650 --> 01:06:25,380 and then stop moving to a new line. It's kind of an atomic operation. 1049 01:06:25,380 --> 01:06:28,140 You either do it or you don't. You hit Enter or you don't. 1050 01:06:28,140 --> 01:06:33,390 So br is a little bit of a different tag, and so I need to sort of both open and close it 1051 01:06:33,390 --> 01:06:35,230 all at once. 1052 01:06:35,230 --> 01:06:37,500 The syntax for that is this. 1053 01:06:37,500 --> 01:06:41,760 Technically, you could do something like this in some versions of HTML, 1054 01:06:41,760 --> 01:06:45,600 but this is just stupid because there's no reason to start and stop something 1055 01:06:45,600 --> 01:06:48,420 if you can instead do it all at once. 1056 01:06:48,420 --> 01:06:52,310 Realize that HTML5 does not strictly require this slash, 1057 01:06:52,310 --> 01:06:55,410 so you will see textbooks and online resources that don't have it, 1058 01:06:55,410 --> 01:06:59,780 but for good measure let's practice the symmetry that we've seen thus far. 1059 01:06:59,780 --> 01:07:02,870 This means that the tag is both opened and closed. 1060 01:07:02,870 --> 01:07:05,220 So now let me save my file, go back here. 1061 01:07:05,220 --> 01:07:10,240 Okay, so it's starting to look better, except the Web I know is kind of clickable, 1062 01:07:10,240 --> 01:07:13,610 and yet youtube here doesn't seem to lead to anything. 1063 01:07:13,610 --> 01:07:17,560 That's because even though it looks like a link, the browser doesn't know that per se, 1064 01:07:17,560 --> 01:07:20,670 so I have to tell the browser that this is a link. 1065 01:07:20,670 --> 01:07:22,620 >> The way to do this is to use an anchor tag: 1066 01:07:22,620 --> 01:07:26,770 01:07:35,900 ="http://www.youtube.com"> 1068 01:07:35,900 --> 01:07:38,490 and let me move this to a new line just so it's a little more readable, 1069 01:07:38,490 --> 01:07:40,060 and I'll shrink the font size. 1070 01:07:40,060 --> 01:07:43,890 Am I done yet? No. There's going to be this dichotomy. 1071 01:07:43,890 --> 01:07:46,760 This tag, the anchor tag, does indeed take an attribute, 1072 01:07:46,760 --> 01:07:52,900 which modifies its behavior, and the value of that attribute is apparently YouTube's URL. 1073 01:07:52,900 --> 01:07:56,380 But notice the dichotomy is that just because that's the URL you're going to, 1074 01:07:56,380 --> 01:08:01,020 that doesn't mean that has to be the word that you're underlining and making a link. 1075 01:08:01,020 --> 01:08:03,960 Rather, that can be something like this. 1076 01:08:03,960 --> 01:08:10,870 So I have to say stop making this word a hyperlink by using the close anchor tag. 1077 01:08:10,870 --> 01:08:12,650 Notice I'm not doing this. 1078 01:08:12,650 --> 01:08:15,890 1, this would just be a waste of everyone's time and it's not necessary. 1079 01:08:15,890 --> 01:08:19,290 >> To close a tag, you only mention the name of the tag again. 1080 01:08:19,290 --> 01:08:21,800 You don't mention any of the attributes. 1081 01:08:21,800 --> 01:08:26,189 So let's save that, go back. Okay, voila, now it's blue and hyperlinked. 1082 01:08:26,189 --> 01:08:29,430 If I click it, I actually do go to YouTube. 1083 01:08:29,430 --> 01:08:32,529 So even though my web page is not on the Internet, it is at least HTML, 1084 01:08:32,529 --> 01:08:37,930 and if we let the Internet catch up, we would actually end up here at youtube.com. 1085 01:08:37,930 --> 01:08:40,670 And I can go back and here's my web page. But notice this. 1086 01:08:40,670 --> 01:08:43,120 If you've ever gotten spam or a phishing attack, 1087 01:08:43,120 --> 01:08:45,850 now you have the ability after just five minutes to do the same. 1088 01:08:45,850 --> 01:08:50,920 We can go here and do something like www.badguy.com 1089 01:08:50,920 --> 01:08:59,319 or whatever the sketchy website is, and then you can say verify your PayPal account. 1090 01:08:59,319 --> 01:09:04,840 [laughter] And now this is going to go to badguy.com, which I'm not going to click on 1091 01:09:04,840 --> 01:09:08,000 because I have no idea where that leads. [laughter] 1092 01:09:08,000 --> 01:09:10,859 >> But we now have the ability to actually end up there. 1093 01:09:10,859 --> 01:09:12,640 So we're really just starting to scratch the surface. 1094 01:09:12,640 --> 01:09:15,830 We're not programming per se; we're writing markup language. 1095 01:09:15,830 --> 01:09:18,569 But as soon as we round out our vocabulary in HTML, 1096 01:09:18,569 --> 01:09:21,520 we'll introduce PHP, an actual programming language 1097 01:09:21,520 --> 01:09:26,859 that will allow us to generate HTML automatically, generate CSS automatically, 1098 01:09:26,859 --> 01:09:29,430 so that we can begin on Wednesday to implement, say, 1099 01:09:29,430 --> 01:09:31,700 our own search engine and more. 1100 01:09:31,700 --> 01:09:34,770 But more on that in a couple of days. We'll see you then. 1101 01:09:34,870 --> 01:09:39,000 >> [CS50.TV]