[Week 7, Continued] [David J. Malan, Harvard University] [This is CS50.] [CS50.TV] All right. Welcome Back. This is CS50, and this is the end of week 7. So one of these stupid little things that goes around the Internet and we slurped up, and it should now make a little bit of geeky sense to you. Well, it was funnier to this guy than it was to you guys. Speaking of, well, guys, today is Nate's birthday. To give you a sense of just how good Nate and I are at web development based on Monday's class and based now on this, I thought I'd pull up Nate's home page, if you haven't seen it yet. This here is Nate's HTML. So see his source code if you'd like to see how to do this, and Nate, if we could embarrass you just briefly, the staff got you a little something if you'd like to share some dessert with some of the kids in the class here. If you'd like to come on down. You all applaud and are very nice, but no one is sitting anywhere near Nate, for some reason, in that back zone. So perhaps you can find some folks to enjoy these with. Happy Birthday, Nate. Additional hellos: We showed a couple clips from our CS50x students. If you would like to see who else it is in the world that's following along, you can head to this URL, where Joseph, one of our TFs, has put together a montage of sorts of everyone who has been submitting these videos, among them Rick Astley. And if you scroll through these, it's really quite inspiring to see the diversity of countries and cities from which people are hailing. So if you'd like to take a look at that, that will be up through the end of the semester. Today we continue our look at the Web, web programming, HTML and the like, and we also have lunch coming up this Friday if you would like, and particularly, have not done so before. This Friday's theme will be Nate's birthday, so if you would like to have birthday lunch with Nate and others, some of our friends from industry, please head to that URL there. Space, as always, is limited. Also, if you've forgotten, realize that next week is the deadline for problem set 4's scavenger hunt, whereby after recovering all of those JPEGs from card.raw, you and your section mates, if you would like, can try photographing as many of the computer scientists from that memory card as possible, and you and your section will then win a fabulous prize. Refer back to pset 4's specification as to what to submit and by when. Also, if you would like to have your handiwork immortalized on the course's website and its history of apparel, know that you are welcome now to start submitting designs for this year's T-shirts and sweatshirts and the like. We'll do our best to include as many as we can, but we'll have some members of the staff review all of the designs to make sure they're consistent with the specifications, and we then pick generally a handful of them to be exhibited. So if you are the design type, just know that the requirements for graphics are PNG, at least 200 DPI; they shouldn't be more than 4000 x 4000 pixels, and no more than 10 MB, but you're welcome to use things like Photoshop or GIMP or various graphics programs, whatever you have at your disposal. Also on the horizon is the final project. The final project really is the climax of 50, whereby of all the assignments in the course, it's your opportunity really to do your own thing. And that can be simply to do something for fun, it can be to solve some pressing problem your student group has, for some new website, some new collection mechanism for data. It can be a mobile application for Android, for iOS. Really, the sky is the limit, and over the next few weeks, as we transition from C to these higher-level languages like PHP and JavaScript, you'll find yourself increasingly familiarized with some real-world techniques, some real-world tools, and to supplement that, know that the course has a history of seminars, whereby over the next several weeks, some of the teaching staff and friends of ours from on campus will offer optional seminars which go above and beyond what's typically done in section to introduce you to things like Android programming, to introduce you to things like iOS programming or more advanced web-development techniques. There's a whole history of these already online. If you go to cs50.net/seminars, we've been doing this for quite some years, and you'll see that archived here with PDFs and videos and the like are several dozen videos of seminars. Last year, for instance, we had a seminar on acing your technical interviews, if you're actually looking to go off and do an internship or full-time gig. Windows mobile development, Android development, Google Maps, API, CSS, developing for the BlackBerry, Emacs. Really, you are welcome to take a look at any of these seminars at your convenience. And we'll be holding some new ones this semester, as well. So what is ahead with the final project? Well, first, even though this date is somewhat imminent, this is really just an opportunity to start thinking about the final project quite realistically. We know only the beginnings of some of what we'll still be covering in the course-- HTML, PHP and the like--but you're all familiar with the Web, and I bias this conversation toward the Web only because most people end up doing Web-based final projects, but that is by no means requisite. Using C is fine, Objective-C, Java, any other language you might know or want to know is quite fine. But to get the juices flowing initially, we'll expect the submission of a preproposal which, per the PDF on the website, which is now at cs50.net, and at the top left you'll see final project is the specification for the final project, and in there are details on the preproposal and the like. It pretty much boils down to an email to your teaching fellow just to strike up a conversation with him or her about what you're thinking. On projects.cs50.net is a repository of ideas from folks on campus if you're struggling to come up with some idea, and manual.cs50.net/apis is a repository of links to APIs. What, though, is an API? What's an API? I've said it at least twice, according to the transcripts of the past several weeks. What's that? [Student, unintelligible] >>Okay, good. So something programming interface. Application programming interface, and this can take several forms, but what this really boils down to is code that someone else has written or data that someone else has collected that is made available to you in some programmatic way. You can write code in C, PHP, Python, Ruby, whatever your language of choice typically is, and you can somehow build upon someone else's functionality or someone else's data set. For instance, if I go to this link here, and you'll see a pair of links on the subsequent page whereby we have CS50's own APIs, which are very Harvard-centric, and then third-party APIs. Among the third-party APIs are really useful things like being able to send SMS's to people, being able to receive SMS text messages from people. And things like that that you might have no idea how to implement yourself, but thanks to services, some free and some commercial, you can build atop those and do something of interest to you. Among CS50's APIs are these campus-centric things like Harvard courses, energy, events, food, maps, news, tweets, and Shuttleboy's own, and these are APIs that look a little something like this. Let me pull up the HarvardFood API. If you've ever been to HUD's website, you've probably been there to just see what's for dinner or to see what the hours are for some d-hall. Well, it's not particularly easy to navigate, and so what we did some time ago was we wrote software-- it happens to be in PHP--that actually screen scrapes the entirety of HUD's website. To screen scrape something means to write a program in a language like PHP that pretends to be a browser, even though you might run it at a command prompt, that pretends to be a browser, connects to a website, downloads its HTML, the language in which it's written, and then reads it, or more specifically, parses it top to bottom, left to right. And what we did was we wrote our code in such a way that any time we saw something in that HTML that looked like something on the menu, like hamburger, we would then import that into our own database. And any time we saw nutritional content, we would import that into our own database. And what we did was leverage the fact that HUD's website, even though it might be a bit of a challenge for us humans to navigate underneath the hood, all of the HTML is generated by their own computer programs. So all of their HTML, even though it might look messy, like most websites underneath the hood, it follows a pattern. So we just spent a couple hours figuring out that pattern so that in the end, we throw away all of the messy HTML, all of the aesthetics of bold facing and italics and the like, and what we are then able to do is expose that same data. For instance, in this way. So we, according to the documentation here, have informed the world that if you request a URL that looks like this, food.cs50.net/something, and you provide certain parameters, which we'll talk about today, like end-date time, start-date time, meal, and so forth, what our servers will return to you, for instance, is a CSV file, comma separted values like an Excel file, containing everything for breakfast on this particular date in March of last year when I happened to write up this documentation. For those familiar, CSV is not the only file format. There's another format that's all the more versatile called JSON, JavaScript Object Notation. The data can come back in that format. So the takeaway here is that whether you dive into this API or any other of CS50's or anything out there on the Internet, or not at all, realize that the world has increasingly started to standardize how machines intercommunicate. We use standard data formats like CSV or JSON. And what this means for you is you can write the interesting part of a program that lets your user search a dining hall menu, that lets them create lists of favorites that lets them get text alerts when their favorite meal is about to be served in some d-hall by using someone else's data sets and building on top of their APIs. So more on that in the form of seminars and the documentation that you have here online. So those, then, are APIs. That brings us back to HTML. Quick recap. What is HTML? [Student, unintelligible] >>Good. HyperText Markup Language. Someone else, what is Hypertext Markup Language? HyperText Markup Language. Okay. So HTML, HyperText. HyperText just refers to the Web, for the most part. Markup means that it's not actually a programming language, HTML. It's not a language that you can express logic in. It doesn't have loops. It doesn't have conditions. It doesn't have functions, per se. Rather, it has these things called tags, or, more properly, elements. And those elements have start tags and end tags, or open tags and closed tags, and what those tags generally mean for a browser is, start doing something and then stop doing something, though there are exceptions to that. Sometimes it's just "put a line break here," for instance. And we saw examples of that the other day, between bold facing, line breaks, and then a couple of other tags. So HTML is the language in which web pages are written. So if I go to something like Google.com and pull up just their home page, recall that if you right click or control click and look at view page source, typically it's a complete mess these days underneath the hood, but that's because computers don't care about white space, so this doesn't have to look pretty. But if we zoom in on parts of it, notice that Chrome, just to be nice, has color coded things. Indeed, this is the very first tag that we saw in a web page. And again, HTML 5, the latest version of this language, does have this thing at the beginning, >Yeah, we've solved this before by explicitly telling the browser "put a line break here." And that's because, again, a browser's only going to do explicitly what the markup language tells it to do, so even though you might have hit Enter once or twice or even ten times, it's going to combine that all into a single space, just by convention. So if you really want a line break, you have to use the br tag, and now notice, like Monday, I put the / inside of this tag, only because this just doesn't feel right to start a line break then stop it with nothing in between. So the convention in HTML is to open and close a tag simultaneously. As an aside, you'll see a lot of websites in books not doing that. It is correct to do or not to do it, but we would argue that design-wise and stylistically, this is just better because then every tag is both opened and closed somehow. So now let's save and reload. Go back to the browser, okay. Now we're making some progress, but it's not quite enough. Let's go ahead and start typing in some longer body of text. So let's say, "A quick brown fox jumps over a lazy dog." And now let me just copy and paste this a few times so that we have a paragraph of text. Let me go back over here. So it's not looking very good. I do have a line break, so it's okay, but now, once we're getting to the point of having a web page that has lots of content and not just single lines to demonstrate HTML, we can start to think of these things as actual paragraphs. And we can start to structure our web page a little more cleanly. And indeed, what I can do is go up here inside of my body tag, and you know what, if "This is CS50. . ." really demarks the beginning of a paragraph, well, let's tag it as such. Let me indent the text; just by convention, let me say that this paragraph ends here, and then rather than do this line break, let me just say that this belongs there and as a new paragraph, and I'll just quickly indent by just clobbering all of this stuff. So now we have an indented paragraph there, and now our markup is starting to get a little more semantically consistent with what we're trying to do. We have a paragraph, so let's call it a paragraph with the p tag. We have a second paragraph, so let's call it a paragraph with the p tag. And now, what the browser will typically do is just like in an English book or essay, where you typically see some line breaks between paragraphs. Browsers will do that for you automatically. So now we have two paragraphs and we can continue this. But, of course, on the Web, when you have bodies of text, it's not typically just huge blobs of text. There are often hyperlinks in there. So if we want to, for instance, include some links there, suppose what might be of interest in whatever web page I'm creating here is-- let me go to Google.com, and let me search for a quick brown fox. Go to Google images, and, how about--this is cute. We'll go with this. So here we have a quick brown fox jumping over a lazy dog. So what I'm going to do here, just for the sake of demonstration, is suppose that this image was on my server, and I had been creating these images. What I just did was right click or control click on the image, and what you'll see in most browsers is a little menu-- stop doing that--a little menu that allows you to choose copy link location or copy URL. So let me go back now to my HTML, and suppose that I want to hyperlink this to another web page. What was the tag called for that? [Student, unintelligible] >>Yeah. So a href for hyper reference. Let me go ahead and paste that in. It's a pretty long URL, so let me zoom back out. Close brackets, so now notice I'm way over here because that URL happened to be pretty long. Let me scroll over here to the end of quick brown fox, and then let me close this tag with . So everything at the top in blue is just a comment. This is my doctype declaration, which again, you can just copy and paste on faith, for now. This just tells the browser, "Here comes some HTML 5." Below that, on line 14, is the first of my actual tags, and this just says, as before, here comes some HTML, here comes the head of my page, here comes the title, and then, conversely, that's it for the title, that's it for the head. Here now comes the body of my page. So a couple new tags now: h1 stands for heading 1. There's a tradition in HTML for many years back of having different sizes of text. And back in the day, each one meant, generally, just big and bold. But there's also h2, which is big but not quite as big and bold. There's h3, which is kind of big but not nearly as big and bold, and so forth, all the way down to h6. These days, though, h1, h2, and h3 are really meant to have more semantic meaning to them, whereby h1 is really a heading: the heading of a web page, the heading of a column or something like that of text. So I've deliberately said

CS50 search

to specifiy that this is really the heading, the title of my page. Not the title in the title bar sense, but the title that you actually see in the web page itself, in the body. Now this, you can probably guess what it is, even though we have a few new pieces of syntax. This is a form. So the web really gets interesting when websites take input from users. In this class, in the problem set on web programming, we're not going to make a website, per se, with static content that shows photographs that you've taken, or this is my resume, and things about me, because those things are relatively easy to put together. It's hard to make things beautiful on the Web, but at least putting up content is pretty trivial. But things get really interesting when someone can visit your website and provide input and can fill out forms, can check off checkboxes and can interact with your website. And indeed, probably every website you care about these days, in any detail, is somehow interactive. Facebook, Google, and the like, that take user input and produce customized output. So let's start to do that now. Let's transition now from just using HTML for markup of static content as instead a delivery mechanism for dynamic content. And toward that end, let's implement our own search engine. Let's do it as follows. Here's the form tag. The action attribute specifies that when the user fills out this form with their keyboard, it will be submitted to this URL here. So I'm kind of cheating. It's going to take us a little longer than one class to implement the whole search engine, so we'll just do the front end, so to speak. We'll do the part that lets the user search, and we'll sort of punt to Google the hard part of finding search results, but, specifically, I'm going to talk to Google's web server using one of two very popular methods. One being get, another, that we'll eventually see, being post, although there are others that are less often used. So get just conjures up the idea of I want to get some content, get some search results. This, you can perhaps guess what this does. This is some kind of input; it's, in fact, going to look like a text field, and the name of that input, the name of that variable, so to speak, is going to be q for query by convention. And again, the type of this input is not going to be a checkbox; it's not going to be a menu; it's going to be a text field as denoted by this attribute here, and this text box, like a line break, is either there or not. So we have an empty element with the slash inside that tag. Then I'm going to put a line break, and you can, perhaps, guess what this is going to do. This is another sort of form input. This one's going to be used for submitting the form. So this is going to be the big button that the user can click to submit the form, and the label on that button is going to be "CS50 Search." Close form, close body, close HTML. Let's see what we have in the form of this web page. So let me go to my browser, let me go, still, to localhost. This is still index.html, so if I want to see this file called search0, I can simply do /search0.html, Enter-- and the first of my mistakes. What's going on? I clearly don't have permission to access this file, for some reason. But that's because, unlike the work we've done thus far in C, where the programs you write are assumed to be runnable by you, executable by you, that's not really the case on the Web, whereby sometimes you might want to create files on a server, but you don't want the whole world to be able to see them. Rather, you want the world to see some files but not others, just for privacy's sake. So it's more of an opt-in basis when you're doing things on the Web. And so let me actually type ls here, and you see the files I have, but recall that if I do ls -l for long, I'll get a longer listing that gives me some more details about these files that are now, really, for the first time relevant to us. Notice that on the far right are the names of my files, and then the time at which they were last modified or copied. This number here is what? Do you recall? The size in bytes, how big the file is. So I seem to have some kind of logo in here that's bigger than all the other files. This is who I am, this is what I am and what group I'm in. But then, over here on the left is a bit of cryptic sequence, and we talked, I think, briefly about this in the past, but this has to do with permissions. And even if that's a little hazy, RW probably means read and write. So it turns out that these dashes denote different sets of permissions for different people. And the pattern is, essentially, as follows. When you see a sequence of dashes here, they look as follows. There's a dash, then there's three more dashes, then there's another three, then there's another three. The first one is either a dash or it's a d for directory. So that one's pretty easy. If it's a folder, it says d, otherwise it's a hyphen. There's a couple other cases, but for now we'll just care about files and directories. These next three dashes--and I've artificially inserted the spaces. They were, obviously, not there when we saw them a moment ago. These are the file owner's permissions, and recall from a second ago that it was read and write. That was because I, as the person who created this file a moment ago, I, just by default, on a Linux computer, have the ability to continue reading and writing that file. So the operating system just gives me RW automatically. The middle ones relate to my group, that of students, which is sort of meaningless on the appliance because I'm the only person using the appliance. So let me just wave my hands at that for now. But the last ones are most important for the Web. This is everyone else in the world, and the fact that that is --- means that no one else in the world has any permissions to this file. Clearly a problem, so I need to fix this by somehow giving the world what? Read and write? That's probably dumb, right? I don't want anyone on the Web to go to visit my page and somehow change that file, even though they really couldn't with an HTML file, but just in principle, probably just want them to be able to read it. What does it mean to read it? It doesn't mean they're going to care about the actual HTML, but the browser needs to be able to parse that markup language, top to bottom, left to right. So someone on the Web needs to be able to read it, so I minimally need to give it r. I can do this in a few different ways, but perhaps the simplest is to run this command here. Chmod, change mode, then a + r so all, everyone in the world + read, and then the name of the file, search0.html. Now if I do ls -l again, notice that that file has changed, and indeed, I've turned on r for everyone. I've also turned it on for my group, but that's fine, because if I turned it on for everyone, my group is a subset of that. So that's fine too. This just means the computer has now made it readable. Now let me go back to my browser, click reload. Ah-ha. We now have CS50 Search. I've zoomed in a little artificially--pretty hideous search engine. But let's see if it actually works. First, let me do a quick sanity check, let me control click and view page source. Notice that within Chrome, we're now seeing the same HTML that I myself created. Don't get confused here, though. I can't start changing the code here, because the browser has a read-only view of this code. The browser has just asked localhost for a file called search0.html. It is now pure coincidence that the appliance happens to be on the same computer as my browser. I could just have, equivalently, have typed in www.facebook.com/search0.html, and if Facebook had a file called that, I would then be seeing their HTML. And, of course, I can't change the file that comes back from Facebook, either. So now we're sort of blurring the lines. The appliance is both a server, serving up web pages, but it's also a client in the sense that I'm using a browser to actually talk to that server. So let's see if my Google search engine works. Let me go ahead and search for quick brown fox, Enter. And voila, I now have my own search engine. But how does this work? Bit of a stretch, but--and now you can't see, precisely, the part that's of interest. Notice what happens. Notice the URL. It turns out that that method, called get, is super simple. When you specify in a form that you want to 'get' results from some server, what it's going to do is take whatever you typed into the form and put it in the URL. It's going to standardize how it gets put into the URL as follows. Notice that this is the URL that was the value of my action attribute. That's where I wanted the form to end up. But then notice this question mark. This is a convention on the Web whereby to provide user input to a website, you append to the URL a question mark, and then you have a whole bunch of key-value pairs. The name of a key, otherwise known as a parameter in the Web, then you have an equal sign, then you have the value of that parameter. So it's essentially a variable name and a variable value, but those variables' names and values came from the HTML form. Why are the pluses there, do you think? Because I did not type + in between my words. [Student, unintelligible] >>Yeah, it's just for spacing. Odds are, whenever you've seen a URL, there's never any spaces in it, if only because if there were, you couldn't really copy and paste it into an IM or into an email because it would break. You want the whole thing to be one contiguous string of characters. So the browser is smart enough to realize, uh-uh. Don't just put a space there. Let me encode the space in some standard way. One of the conventions for doing so is to have the browser automatically put a + where you would otherwise have a space. So now, notice Google has been kind of user-friendly. I certainly did not create this web page, but they have prepopulated their own text field with what, precisely, I typed in. Suppose I want to search for something else, like a lazy dog. I can just type this here, re-search. Notice that the URL changes up here, but notice then that I can actually search for anything I want just by understanding how URLs work. I could do lazy cat, Enter, and notice now I'm getting a very lazy--should we? I feel like we should. I get a very lazy cat. All right. This is one of the stupidest things we've done. But that is a lazy cat. Anyhow, what's the key takeaway here? Now we're sort of playing in the world of HTTP. HTML is just this markup language, open tag, close tag, that tells a browser how to render content on a web page. But when you start transmitting data across the Internet between web browser and server, that's where this protocol known as HyperText Transfer Protocol takes over. This is the sort of human convention; when Sam and I shook hands on Monday, starting a connection and then closing a connection, same idea here. How are Google's results coming back to me? How is my form submission going to Google? Well, recall from the other day that what's really going on underneath the hood when you request a web page is your browser is sending a somewhat cryptic message like GET / HTTP/1.1 for the default home page. Or, in this case, because I specifically requested earlier search0.html, this then would be the somewhat-cryptic message that my browser sends to the appliance. Or, in this case of Google, what's actually sent is a request to /search, and then ?q=lazy cat, with a plus there. So this message that I, the human, am never typing, but is being sent by my browser, this is how HTTP happens. This is the equivalent of our having shaken hands. This is the request, and the server's about to send a response. So let's take a look at this underneath the hood. As before, we can open up this special field in a browser. View Page, Inspect Elements. So under Inspect Element, notice that what's happened in Chrome, and IE and Firefox have similar mechanisms, we have these developer tools accessible to us. Normal people do not use these tabs. But we, now, are interested in what's going on underneath the hood at the network level. So if I pull up the network level here, let me go ahead and expand this window, open up this entry here, and look at the headers. So what happens when I request a file from a web server is my browser sends a whole bunch of things. And let me view source. So under request headers, and this is just Chrome showing me some diagnostic output, sort of like a debugger of some sort, notice that what I've highlighted here is precisely what Chrome is sending to the server in order to request a file called search0.html. It is telling the server what it thinks its name is, thanks to this host colon field, then there's some pretty esoteric stuff in here, like something to do with dates and times, something to do with the languages that the browser understands, but the really important lines are these first two here. What does the server respond with? Well, if we scroll down here and view source of this thing, notice that the server has responded with a somewhat cryptic message as well, 304 not modified. That's a little strange; let me actually try to fix this. Let me hold down Shift and click Reload up here to force the browser to actually make this request for the first time. Then let me zoom in, and we'll see now that the server's response, because I held Shift, is 200 OK. So you've probably never seen the number 200 in the context of the Web, but what numbers have you sometimes seen unexpectedly from a server? 404, file not found; 403, forbidden; 500, server error. So there are these numeric codes that the world uses in the Web to signify errors, just like C functions can return errors and main can return exit codes. 200, though, you rarely see because it means all is well. And 304 you probably never see because what is it signifying? That nothing has--let's see if we can simulate this again-- Oh, now it's not cooperating. 304 said not modified, so why was the server even responding? Well, for efficiency, a web server automatically for you, if the file hasn't changed, it won't retransmit the whole HTML file. It'll just tell the browser it hasn't changed. Just use the copy you already have. So there's this notion of caching on the Web for performance, so that you don't waste time and waste bandwidth downloading files again and again unnecessarily. But this web page, now, was super-simple, and it only showed me the HTML that came back. Let's actually use the network tab now to do a Google search like quick brown fox. Let me then click CS50 Search, and now, notice in the bottom here a whole bunch of stuff came back because when I visit a real website like Google.com, they have images, they have text, they have a language called JavaScript there. So every row in this table down here represents something that Google spit out in response to my single request. The one I care about, though, is this first one. And if I go to the search, request, click View Source here, notice that, indeed, the cryptic message that my browser sent to Google was these two lines here, followed by some arcane information down here which we'll ignore for now. But notice, too, what Chrome is pretty handy with, it's also showing me the query string that was sent in. So rather than show me this, which was literally sent, if I view it decoded, Chrome, just for debugging purposes, for developers like us, it's just showing me a human-friendly version of-- that is not how you spell fox, apparently. I'm just noticing this now--but it's showing you what I, apparently, typed. Meanwhile, the response that came back from the server is again 200 OK. But included in that response, of course, if we actually view the page's HTML-- sorry, this is a little keyboard shortcut gone awry today. I'll deal with this later. So if we actually view the page's source, which I can do down here by clicking response, this is what was actually spit back, in addition to that cryptic 200 OK message from the server. A little cryptic, but where is all this coming from? Well, let's do one other thing here. Another somewhat cryptic command, but this one's kind of neat in that it reveals to us exactly what's going on underneath the hood. So I'm back on my Mac here, I have connected via a program called SSH, Secure Shell, to another server because most of Harvard's computers block the command we're about to run because there's this command on some servers called traceroute that allows you to trace the route between points a and b, and thus far we've been taking completely for granted that I can type in Google.com and somehow get data back from halfway across the country or halfway across the world. With traceroute we can actually dive in a little deeper as to how the Internet works, and see what's going on underneath the hood. So let's go ahead and arbitrarily trace a route to, say, Stanford.edu, which is across the country, and hit Enter. This command can be super fast or super slow, but what we're seeing now, line by line, is every one of the steps or hops between us and Palo Alto, or Stanford, where they have their web server. So what does each of these lines represent more concretely, though? A piece of jargon from the Internet? [Student, unintelligible] >>What's that? [Student, unintelligible] >>Oh, so there are times, but what does each row--what do I mean by hop? Well, there are these things on the Internet called routers. And routers, as the name suggests, route information from point a to point b. But there are several points beyond a and b. There's c and d and e and f between row 1, which happens to be my computer's IP address, or my numeric address, which uniquely identifies my computer, and step 15, which is actually the sixth web server, apparently, which I'm inferring from this, or version 6 of their web server at Stanford. But what's kind of neat is, we can see the path that my 0's and 1's are taking from my computer to Stanford. So step 1 is my own computer's address. Every computer on the Internet has a unique identifier that looks like this. Number.number.number.number. Somewhere on this campus, probably in the science center, is a router called Core Gateway 2 -te83, whatever that means, so this is one of Harvard's big fancy routers that routes a lot of their traffic. Here's another of Harvard's routers, this one is Border Gateway, border meaning it's probably on the periphery of campus somewhere. Then there's nox one, row 4, which is Northern Crossroads, which is a big ISP, Internet service provider, that places like Harvard connect up to. But then things get a little interesting in line 6. Where are my bits all of a sudden? Kansas. The world has a habit of using airport codes in a lot of these things, or at least abbreviations for states or cities, so it looks like, in just 60 ms, a packet of information, 0's and 1's from my laptop got all the way to Kansas, and again, in 60 ms. Moreover, after Kansas, they took a tour through Houston, probably, as suggested by the name of this server. So just as a server on the Internet must have a numeric address, it can also, optionally, have a slightly more human-friendly address that humans came up with. Now, in step 8, we don't know what this is. Sometimes routers just kind of ignore you, and they just don't answer the questions, so that's fine. The one after step 8 is apparently where? L.A. Notice in only 78 ms, what takes us humans like 6+ hours to do physically, takes packets of information on the Internet 78 ms to travel that far. Step 10 is in L.A. as well, and step 11 seems to have gone north, up near Stanford. This is their boundary router, or border router. A couple steps at Stanford that are ignoring us, and lastly, we reach the web server in just 87 ms. Now, all of these numbers, as an aside, just tell you how long it takes for data to get from me to each of these routers, and it's not accumulative. What this program does is it first sends a message, essentially, to the first router. Then one to the second router; then one to the third router, measuring each time. So in theory, these times will be growing or at least pretty close to one another, and, indeed, the ones that are right here on campus are super-small. As soon as you start going across the country, it takes data a little longer to travel, closer to 100 ms, give or take. But let's go the other direction now. How about Cambridge University in the UK? Let me instead run traceroute of www.cam for Cambridge, .ac for academic, .uk, and hit Enter here. That was pretty damn fast. My data literally went to Cambridge, England, in that split second of time. So let's see the path that it took. Harvard, Harvard, Harvard, Northern Crossroads, which is an ISP, and then this is Northern Crossroads, and then bam. What is in between steps 6 and 7, router 6 and 7? The Atlantic Ocean. And we're inferring this from the fact that we go from 20 ms here to 80 ms here. So something took 60 ms, give or take, to get over. And that was probably a big body of water. What goes on after that? Well, here we are in London, just 88 ms later. More London, more London, not sure where this is, but we'll assume it's outside of London, Cambridge here, and finally we--literally, University of Cambridge .something.net, and then, finally, in line 16, their web server is apparently called Scorpius underneath the hood, even though we know it as www. Kind of mind-blowing, I think. The first time I ever did this, it totally blew my mind. Unfortunately, Harvard blocks this kind of traffic, typically, on the network. So you can't do it super easily. Realize, though, this here is possible. All right. Let's take our 5-minute break here. We'll come back and dive in deeper. So we are back, and we've kind of ambled about in a few different directions here. So let's summarize exactly what's been going on here. We started the conversation talking about this language called HTML. Again, not a programming language. It's just a markup language that is largely about aesthetics and structuring of content in the form of a webpage. But HTML, therefore, needs some kind of mechanism for traveling between web browser and server. HTML therefore sort of rides on top of this other language, or more properly, a protocol, known as HTTP. And HTTP, as we've seen it thus far, is kind of analogous to this human convention of shaking hands. When a browser wants to request a page from a server, it sends that "get" request from browser to server, and then the server responds with a number like 200, all is okay, as well as the HTML or some bad number like 404, file not found. But meanwhile, HTTP itself isn't the Internet, per se. HTTP is just a service, a feature of the Internet much like G chat is another service, much like email is another service. There's all sorts of things we can do on the Internet. HTTP is just one of those applications. So on top of--HTTP is on top of something else which we didn't mention by name, you might have heard of by name, TCP/IP. So the story we just told there is all about how data travels from point a to point b. And in this case, we saw at a very low level router to router to router to router, how the data is actually being transmitted. But along the way, it is going to encounter various impediments. Besides these routers, there are things called firewalls on the Internet, and so data, such as that we were just transmitting from me to Stanford, from me to Cambridge, is sent to, at this level, something called an IP address. We saw this a moment ago, and an IP address is just a numeric address of the form w.x.y.z, where each of these is between, give or take, 0 and 255, though you can't quite use all of those numbers. But each of these place holders is a number between 0 and 255. So an IP address these days is 32 bits. Now, that gives us how many possible IP addresses in the world? Roughly 4 billion, because any time we're counting in powers of 2 all the way up to 32 of something, that usually gives us 4 billion. So that's a lot of IP addresses, but you might have read, or you might now notice in the popular press, a push toward a new version of IP called IPV6. Right now we're using version 4. There really hasn't been a version 5, we're just jumping right to 6. Version 6 is going to use 128 bits for IP addresses, which is freaking huge. We should not run out for quite some time now, but we have begun to run out of version 4 IP addresses, because all of us have not only things like laptops and desktops, a lot of us have phones, a lot of us have other devices like TiVo and the like that have IP addresses themselves. Harvard itself has tens of thousands of computers. So the world is genuinely running out of IP addresses, at least of this form. So over the next few years, you are going to see the addresses on your own computers probably slowly change as more and more companies and universities start to support the newer version. But an IP address is not sufficient for computer a to request data from computer b. Because computer b could be a server, and a server, as I mentioned earlier, can do bunches of things. It can host web pages, it can be an email server, it can be a Skype server, it can be a G chat server. All these different services that can be provided on a server could all, physically, be on the same machine. So in addition to IP addresses, the world has things called ports on the Internet. A port is just a number; so there is a unique number for HTTP. Its number is 80. HTTP also uses number 443, but more specifically, for encrypted HTTPS. Whenever you see the s, for secure, that's using a different number. There are other numbers, like 25, used for something called SMTP, otherwise known as email. There's something called 22 for SSH, and there's a whole bunch of other ports out there. Now, we humans rarely see these numbers. However, when you type in an address like http://www.facebook.com, the browser is secretly inserting 80, because you're using HTTP. If you, instead, type HTTPS, it's secretly inserting 443. And we can kind of see this manually if I pull up a brower and go to http://www.facebook.com:80. Therefore explicitly citing not just the name of the website but the port that I want to talk to, and hit Enter. Notice it disappears, because the browser assumes, oh, 80, I'm not even going to bother showing that to you. But the reason for this is that if I actually wanted to send someone an email, I would really be sending it to them on port 25, that being SMTP. A bit of an oversimplification, but some of you have friends who actually work at Facebook, and they, similarly, have servers that receive email. Any time you send an email, what Gmail is doing for you or Outlook or whatever program you use, it's sort of secretly inserting that number as well, 25 in that case. It's this combination of IP address and number that uniquely identifies a computer on the Internet and a specific service on that computer. Now, of course, most of us have probably never typed manually an IP address. Maybe you have in the appliance, but in the real world, not so much. Why do we not type IP addresses into browsers? It would work, in fact, we can see this; let me show you one other command that should work most anywhere on Harvard's campus on a Mac or a PC. There's this command called nslookup, name server lookup. If I look up www.cnn.com, it turns out that CNN has--oh, interesting. CNN has started using Amazon web services. You might know of cloud computing; Amazon's one of the big players in cloud computing. What I just did was I said, "Give me the address of CNN's web server," but it turns out that CNN's web server is managed by Amazon, Amazon web services, this suggests. And the address of that server is this here. So I'm not sure if this will work, because they didn't used to use Amazon. But let's try this; http://, IP address, Enter, and-- is it going to work? Yes. It is going to work. Internet is super slow today. But, in a moment, you will see some news story. There we go. Bank of America's being sued. All right. This is because this IP address just happens to by synonymous with www.cnn.com. Of course, it would be horrible marketing to say, visit us on the Web at 50.112.94.127. You'd never remember. So even these days you might recall things like 1-800-COLLECT or mnemonics the world came up with for phone numbers. Which, before cell phones, were rather hard to remember until you could just type it in and forget about it. So the Web, too, has this convention of names and IP addresses, and there are these things out there called DNS servers, domain name systems servers, that translate IP addresses into names and vice versa. So that's what's going on underneath the hood. In the end, we have TCP/IP, which is this very low-level protocol that, really, just gets 0's and 1's across the Internet, and it does so by putting them into a virtual envelope, if you will, and writing on the outside of the envelope the IP address of the destination, as well as the numeric port number of the service on that destination that it wants to talk to. Meanwhile, on the envelope there's also something known as a return address, which is your IP address, so that when CNN gets a packet of information from you, opens this virtual envelope, sees that you want the home page, it knows from the sender part of this virtual envelope whom to send the HTML back to. So let's take a look at this in a little more detail. This is from a company called Ericson, from a few years back. And they took some liberties with how the Internet actually works, but it paints a much more visual picture than mere chalk up here. So I give you "A Bit of the Internet." [Narrator] For the first time in history, people and machinery are working together, realizing a dream. A uniting force that knows no geographical boundaries. Without regard to race, creed, or color. A new era where communication truly brings people together. This is The Dawn of the Net. Want to know how it works? Click here to begin your journey into the Net. Now, exactly what happened when you clicked on that link? You started a flow of information. This information travels down into your own personal mailroom where Mr. IP packages it, labels it, and sends it on its way. Each packet is limited in its size. The mail room must decide how to divide the information and how to package it. Now, the package needs a label containing important information such as sender's address, receiver's address, and the type of packet it is. Because this particular packet is going out onto the Internet, it also gets an address for the proxy server, which has a special function, as we'll see later. The packet is now launched onto your local area network, or LAN. This network is used to connect all the local computers' routers, printers, et cetera, for information exchange within the physical walls of the building. The LAN is a pretty uncontrolled place, and, unfortunately, accidents can happen. The highway of the LAN is packed with all types of information. These are IP packets, Novell packets, AppleTalk packets. They're going against traffic, as usual. The local router reads the address and, if necessary, lifts the packet on to another network. Ah, the router. A symbol of control in a seemingly disorganized world. [Router mumbling and talking to itself] [Narrator] There he is, systematic, uncaring, methodical, conservative, and sometimes not quite up to speed. But at least he is exact, for the most part. As the packets leave the router, they make their way into the corporate Internet and head for the router switch. A bit more efficient than the router, the router switch plays fast and loose with IP packets, deftly routing them along their way. A digital "pinball wizard," if you will. [Router switch talking to itself] [Narrator] As packets arrive at their destination, they're picked up by the network interface, ready to be sent to the next level. In this case, the proxy. The proxy is used by many companies as sort of a middle man in order to lessen the load on the Internet connection and for security reasons, as well. As you can see, the packets are all of various sizes depending upon their content. The proxy opens the packet and looks for the web address or URL. Depending upon whether the address is acceptable, the packet is sent on to the Internet. There are, however, some addresses which do not meet with the approval of the proxy. That is to say, corporate or management guidelines. These are summarily dealt with. We'll have none of that. For those who make it, it's on the road again. Next up, the firewall. The corporate firewall serves two purposes. It prevents some rather nasty things from the Internet from coming in to the Intranet, and it can also prevent sensitive corporate information from being sent out onto the Internet. Once through the firewall, a router picks up the packet and places it onto a much narrower road, or bandwidth, as we say. Obviously, the road is not broad enough to take them all. Now, you might wonder what happens to all those packets which don't make it along the way. Well, when Mr. IP doesn't receive an acknowledgement that a packet has been received in due time, he simply sends a replacement packet. We are now ready to enter the world of the Internet. A spiderweb of interconnected networks which span our entire globe. Here, routers and switches establish links between networks. Now, the Net is an entirely different environment than you'll find within the protective walls of your LAN. Out here, it's the Wild West. Plenty of space, plenty of opportunities, plenty of things to explore and places to go. Thanks to very little control and regulation, new ideas find fertile soil to push the envelope of their possibilities. But because of this freedom, certain dangers also lurk. You'll never know when you'll meet the dreaded ping of death, a special version of a normal request ping, which some idiot thought up to mess up unsuspecting hosts. The path our packets take may be via satellite, telephone lines, wireless, or even transoceanic cable. They don't always take the fastest or shortest routes possible, but they will get there eventually. Maybe that's why it's sometimes called "The World Wide Wait." But when everything is working smoothly, you can circumvent the globe five times over at the drop of a hat, literally. And all for the cost of a local call or less. Near the end of our destination, we'll find another firewall. Depending upon your perspective as a data packet, the firewall could be a bastion of security or a dreaded adversary. It all depends on which side you're on and what your intentions are. The firewall is designed to let in only those packets that meet its criteria. This firewall is operating on ports 80 and 25. All attempts to enter through other ports are closed for business. Port 25 is used for mail packets, while port 80 is the entrance for packets from the Internet to the web server. Inside the firewall, packets are screened more thoroughly. Some packets make it easily through customs, while others look just a bit dubious. Now, the firewall officer is not easily fooled, such as when this ping of death packet tries to disguise itself as a normal ping packet. [Firewall officer talking to packets] [Narrator] For those packets lucky enough to make it this far, the journey is almost over. It's just a line up on the interface to be taken up into the web server. Nowadays, a web server can run on many things, from a mainframe to a web cam to the computer on your desk. Why not your refrigerator? With the proper setup, you can find out if you have the makings for Chicken Cacciatore, or if you have to go shopping. Remember, this is the dawn of the Net. Almost anything is possible. One by one, the packets are received, opened, and unpacked. The information they contain, that is, your request for information, is sent on to the web server application. The packet itself is recycled, ready to be used again, and filled with your requested information, addressed, and sent out on its way back to you. Back past the firewall, routers, and on through to the Internet. Back through your corporate firewall and onto your interface, ready to supply your web browser with the information you've requested. That is, this film. Pleased with their efforts, and trusting the better world, our trusty data packets ride off blissfully into the sunset of another day, knowing fully they have served their masters well. Now, isn't that a happy ending? [Malan] Okay, that's enough. We'll see you next week. [CS50.TV]