[MUSIC PLAYING] 00:00:16,302 --> 00:00:17,260 DAVID MALAN: All right. This is CS50, and last time where we left off was here, focusing on data structures. And indeed, one of the last data structures we looked at was that of a hash table. But that was the result of a progression of data structures that we began with this thing here, an array. Recall that an array was actually a data structure that was actually introduced back in week two of CS50, but it was advantageous at the time, because it allowed us to do things efficiently, like binary search, and it was very easy to use with its square bracket notation and adding integers or strings or whatever it is. But it had limitations, recall. And among those limitations were its lack of resizeability, its lack of dynamism. We had to decide in advance how big we wanted this data structure to be, and if we wanted it to be any bigger, or for that matter any smaller, we would have to dynamically ourselves resize it and copy all of the old elements into a new array, and then go about our business. And so we introduced last time this thing here instead, a linked list that addresses that problem by having us on demand allocate these things that we called nodes, storing inside them an integer or really any data type that we want, but connecting those nodes with these arrows pictured here, specifically connecting them or threading them together using something called pointers. Whereby pointers are just addresses of those nodes in memory. So while we pay a bit of a price in terms of more memory in order to link these nodes together, we gain this flexibility, because now when we want to grow or shrink this kind of data structure, we simply use our friend malloc or free or similar functions still. But we then, using linked lists-- and at a lower level, pointers as a new building block, did we begin to solve other problems. We considered the problem of a stack of trays in the cafeteria, and we presented an abstract data type known as a stack. And a stack supports operations like push and pop. But what's interesting about a stack for our purposes recall is that we don't need to commit necessarily to implementing it one way or another. Indeed, we can abstract away the underlying implementation details of a stack, and implement it using an array if we want, if we find that easier or convenient. Or for that matter we can implement it using a linked list, if we want that additional ability to grow and shrink. And so the data type itself in a stack has these two operations, push and pop, but they're independent, ultimately, of how we actually implement things underneath the hood. And that holds true as well for this thing you here. A line, or more properly a queue, whereby instead of having this last in, first out or LIFO property, we want something more fair in the human world, a first in, first out. So that when you NQueue or Dqueue some piece of data, whatever was NQueued first is the first thing to get out of that queue as well. And here too did we see that we could implement these things using a linked list or using array, and I would wager there is yet other possible implementations as well. And then we transitioned from these abstract data types to another sort of paradigm for building a data structure in memory. Rather than just linking things together in a unidirectional way, so to speak, with a linked list, we introduced trees, and we introduced things like binary search trees, that so long as you keep these data structures pretty well balanced, such that the height of them is logarithmic and is not linear like a linked list, can we achieve the kind of efficiency that we saw back in week zero when we did binary search on a phone book. But now, thanks to these pointers and thanks to malloc and free can we grow and shrink the data structure without committing in advance to an actual fixed size array. And similarly did we solve another real world problem. Recall that a few weeks ago we looked at forensics, and most recently did we look at compression, both of which happen to involve files. And in this case, the goal was to compress information, ideally losslessly, without throwing away any of the underlying information. And thanks to Huffman coding did we see one technique for doing that, whereby instead of using seven or eight bits for every letter or punctuation symbol in some text, we can instead come up with our own coding that we use one bit like a one to represent a super common letter like e, and two or three or four or more bits for the less common letters in our world. And then again we came to hash tables. And hash tables too is an abstract type that we could implement using an array or using a linked list or using an array and a linked list. And indeed, we looked first at a hash table as little more than an array. But we introduced this idea of a hash function, that allows you, given some input, to decide on some output and index a numeric value, typically, that allows you to decide where to put some value. But if you use something like an array, of course, you might paint yourself into a corner, such that you don't have enough room ultimately for everything. And so we introduced separate chaining, whereby a hash table in this form is really just an array, pictured here vertically, and a set of linked lists hanging off that array, pictured here horizontally, that allows us to get some pretty good efficiency in terms of hashing, finding the chain that we want in pretty much constant time, and then maybe incurring a bit of linear cost if we actually have a number of collisions. Now today, after leaving behind these data structures-- among them a try, which recall was our last data structure that allowed us in theory in constant time to look up or insert or even delete words in a data structure, depending only on the length of the string, not how many strings were in there-- do we continue to use these ideas, these building blocks, these data structures. But now today we literally leave behind the world of C and starts to enter the world of web programming, or really the world of web pages and dynamic outputs and databases, ultimately, and all of the things that most of us are familiar with every day. But it turns out that this time we don't have to leave behind those ingredients. Indeed, something like this, which you'll soon know as HTML-- the language in which web pages are written-- HyperText Markup Language-- even this textual document, which seems to have a bit of structure to it, as you might glean here from the indentation, can underneath the hood be itself represented as a tree. A DOM, or Document Object Model, but indeed, we'll see now some real world, very modern applications of the same data structures in software that we ourselves use. Because today, we look at how the internet works, and in turn how we actually build software atop it. But first, a teaser. [VIDEO PLAYBACK] [MUSIC PLAYING] 00:06:38,876 --> 00:06:47,420 -He came with a message, with a protocol all his own. 00:07:01,530 --> 00:07:06,350 He came to a world of cruel firewalls, uncaring routers, and dangers far worse than death. He's fast. He's strong. He's TCPIP, and he's got your address. 00:07:20,810 --> 00:07:23,820 Warriors of the Net. [END PLAYBACK] DAVID MALAN: All right, so coming soon is how the internet works. And it's not quite like that. But we'll see in a bit more detail. But let's consider first something a little more familiar, if abstractly, like our own home. So odds are, before coming to a place like this, you had internet access at home or at school or at work or the like. And inside of that building-- let's call it your home-- you had a number of devices. Maybe a laptop, maybe a desktop, maybe both, maybe multiple. And you had some kind of internet service provider, Comcast or Verizon or companies like that, that actually run some kind of wired connection, typically-- though it could be wireless-- into your home, and via that connection are you on your laptop or desktop able to get out onto the internet. Well it turns out that the internet itself is a pretty broad term. The internet is really just this interconnection of lots of different networks. Harvard here has a network. Yale has a network. Google has a network. Facebook has a network. Your home has a network and the like. And so the internet really is the interconnection of all of those physical networks. And on top of this internet, do there run services, things like the web or the world wide web. Things like email. Things like Facebook Messenger. Things like Skype. And any number of applications that we use every day run on top of this physical layer known as the internet. But how does this internet itself work? Well, when you first plug in your computer to a home modem that you might get from Verizon or Comcast-- it might be a cable modem or a DSL modem or another technology still-- or more commonly these days, you connect wirelessly, such that your Mac or PC laptop connects somehow wirelessly to this device, what actually happens? Like the first time you have internet installed on your home, how does your computer know how to connect to that device, and how does that device know how to get your laptop's data to and from the rest of the internet? Well, odds are you know on your Mac or PC you at least get to choose the name of your network, whether it's Harvard University or Yale or LinkSys or Airport Extreme or whatever it is at home, and then once you're connected to that, it turns out that there's special software running on this device in your home called a router. And actually, it can be called any number of things. But one of its primary functions is to route information, and also to assign certain settings to your computer. Indeed, running inside of this so-called router in your home typically is a protocol, a special type of software called DHCP-- Dynamic Host Configuration Protocol. And this is just a fancy way of saying but that little device in your home knows how to get you onto the internet. And how does it do that? Well, the first time you turn on your Mac or PC and connect to your home network-- or Harvard's or Yale's for that matter-- you are assigned, thanks to this technology DHCP an IP address, a numeric address, something of the form something dot something dot something dot something that uniquely in theory identifies your computer on the internet, so long as your computer speaks this protocol IP, or the Internet Protocol. And we'll see in a bit that IP and TCP-- or more commonly known as TCPIP-- is really just a set of conventions that governs how computers talk to each other on the internet. And the first way they do that is by agreeing upon in advance what each of their addresses look like. Now, these addresses are actually changing in format over time, because frankly, we're running out of these addresses. But the most common address right now still is an IP version 4, or V4 address, that is literally of the form something dot something dot something dot something. And so when your computer first turns on in your home network, you are given a number that looks a little something like that. And via that address now can you talk to other computers on the internet, because this is like your from address in the physical world, and you can receive responses from computers on the internet, because they now know you via this address. So much like the CS building here is that 33 Oxford Street Cambridge, Massachusetts, or the CS building at Yale was 51 Prospect Street, New Haven, Connecticut, much as those addresses uniquely identified those two buildings, so do IP addresses in the world of computers uniquely identify computers. So here for instance just happens to be by convention what most of Harvard's own IP addresses look like. Now that I'm on this network here, odds are my IP address starts with 140.247 dot something dot something, or 128.103 dot something dot something. Or at New Haven at Yale, it might look like 130.132 dot something dot something, or 128.36 dot something dot something. And it turns out that each of these somethings simply is by definition a number between 0 and 255. 0 to 255. I feel like we've heard these numbers before. And indeed, if you can count from 0 to 255, that means you're using what 8 bits. And so each of these numbers is 8 bits plus 8 plus 8 plus 8. So that's 32 bits. And indeed, an IP address typically these days-- at least version 4-- is a 32-bit value which means there can be total no more than 4 billion or so computers on the internet. And we're actually starting to bump up against that, because everything these days seems to be on the internet, whether it's your phone, laptop, or even some smart device in your home. And so there is a way to mitigate that. It turns out that your computer, even if you're on campus, might not quite have one of those Harvard or Yale IPs. You might instead have depending on where you are on campus a private IP address, or if you're in your home, you similarly might have one of these addresses. And these are private in the sense that they are used to route information within your home or within your school or within your company, but these addresses are not meant to be used by the outside world. Instead, what you get from Harvard or Yale or Comcast or Verizon when you connect to their network typically is at least the ability to have one or more public IP addresses that the rest of the world knows you by. So what does this actually mean? Well, sometimes it doesn't really mean anything at all. And in fact, if you look at popular media today or various television shows, you'll see that IP is either miscommunicated or outright misunderstood. Let's take a look. [VIDEO PLAYBACK] -It's a 32-bit IPv4 address. -IP, as in the internet? -Private network. To meet is private network. 00:13:31,731 --> 00:13:32,960 It's just so amazing. 00:13:36,010 --> 00:13:37,180 It's in their IP address. She's letting us watch what she's doing in real time. [END PLAYBACK] DAVID MALAN: No, no, that is not what a hacker does in real time, and that is not how you watch a hacker in real time. Indeed, if you zoom in on this screen here, you'll see that what's actually being looked at has nothing to do with networking per se. This is actually programming code written in a language called Objective C, which happens to be used conventionally for Mac applications or more recently iOS applications. And of all the things for them to have pulled out, they use this code, which has to be something related to some kind of drawing program insofar as it's talking about crayons. Moreover, if you actually look at one of the other scenes from this show, this was the IP address in question. This too is not technically accurate. What's wrong with this IP address in this frame here from the show? Yeah, so if the IP addresses can only be from 0 to 255, 275 is definitely too big. Now, in their defense, this is probably a good thing, because now they're not broadcasting some random, unsuspecting person's actual IP address. But there too there's a technical limitation. But of course, we humans, when we visit websites using Safari or Chrome or IE or Edge or whatever, we rarely if ever type in the address of websites or servers by these numeric IP addresses. Rather, we seem to use more user-friendly words, like www.google.com, or harvard.edu, or yale.edu, or facebook.com, or the like. And thankfully, there exists in the world another system, another technology known as DNS-- Domain Name System. And what DNS does is it simply converts numeric IP addresses to more human-friendly host names, or fully qualified domain names. Which is to say when I first sit down at my Mac or my PC on my home network or Harvard's or Yale's and I type in something like www.google.com and hit Enter, the way that my computer actually talks to google.com is by way of those numeric IP addresses. But the way my Mac or PC figures out what that IP address is of google.com is it asks the local operating system-- Mac OS or Windows-- and if Mac OS or Windows doesn't know, my operating system asks Harvard's network or Yale's network or Comcast's network, wherever I physically am, because each of those networks has their own DNS server, whose purpose in life is to convert IP addresses to host names and host names to IP addresses. And in the event that Comcast or Yale or Harvard, wherever I am, doesn't know the answer to what is the IP address for www.google.com, there exist root servers in the world. Servers that are globally administered at the end of the day can at least help those DNS servers figure out what the answers are. And indeed, when you buy or when you rarely rent a domain name, among the things you're doing is informing the world via a set of standards what your server's IP addresses are. And so that's exactly what Google and others have done. But of course, the data at the end of the day still has to get from my laptop to Google. And then my search results have to get from Google to me. And how does that happen? I mean, most of Google's servers are probably out in Mountain View, California or maybe here on the East Coast somewhere, if they have multiple servers. Or maybe somewhere in the world. And indeed, big companies these days have servers all over the place. So how does one little old laptop know how to request search results from Google or how to request my news feed from Facebook or how to do any number of other things on the internet? Well it does it by way of these things called routers. It turns out that between me and most any other point on the internet, there's one or more routers-- special servers that could be this big, this big, any number of sizes these days. They're just computers that typically live in data centers of some sort. And these routers' purpose in life is to quite simply route information. So when my Mac wants to talk to google.com, my Mac constructs what we call a packet of information inside of which is my request. Give me all of your search results for cats, for instance, if that's what I'm searching for. And that packet is handed off to the nearest router. That router happens to be, at this point in the story, at Harvard here. Harvard has its own routers. And Harvard's routers are somehow wired or wirelessly connected to other routers in the world. And those routers, typically no more than 30 routers away, can get my data by routing it, routing it, routing it, routing it, routing it, until it eventually reaches its correct destination. In its simplest form, what you can think of these routers as doing is looking at those IP addresses-- something dot something dot something dot something-- and deciding, based on those numbers, which direction to go. So maybe if my IP address starts arbitrarily with 1, maybe the packet should go that way to that router. If it starts with 2, it should go that way and be routed to that router, or that way, or that way. It doesn't really matter. This all happens dynamically thanks to software. But routers just use those IP addresses to decide which way to route your information. And we can actually see this. Let me go ahead into CS50 IDE, and Macs and PCs and other computers have the same software. This will allow me to do a number of things at my command line here. For instance, suppose that I wanted to check what the IP address is for google.com. Because if I want to send Google a letter, like a packet of information requesting a whole bunch of search results about cats, I need to know their IP address. So what I can do at the command line here is run a command that's pretty popular called nslookup-- names server lookup. And I can type in something like www.google.com Enter, and wala, I seem to get the answer here that Google's IP address is apparently 172.217.4.36. And I know that answer, because Harvard's server-- and I know it's Harvard, because it starts with 140.247-- Harvard's DNS server somewhere here on campus just knew that result. But it's non-authoritative, in the sense that Harvard does not run google.com. But Harvard has previously asked Google or someone else for Google's IP address. And so Harvard is answering the question for me, but not authoritatively. It's a delegate who is relaying that information to me. Now, suppose I want to do this for another site. Let me go ahead and search for nslookup say www.facebook.com. And you'll see here that Facebook's IP address is apparently 31.13.80.36. And there's some more cleverness going on here. It turns out there's other types of DNS records or entries, starmini.c10r.facebook.com. I don't really know what that means. Facebook's a big enough company that there's probably a lot more complexity going on. But just out of curiosity, let me go ahead and copy this IP address here. And in a browser, go to http:// that IP address. Enter. And wala, I make my way to Facebook.com. But it would be pretty bad for business if everyone in the world had to know that Facebook's IP address is this. Back in the day when people still used phone numbers, you might have services like 1-800-COLLECT, C-O-L-L-E-C-T, these mnemonics, so that it was easier for humans to remember phone numbers. Thankfully, DNS does all of this automatically. We just have to remember facebook.com, and DNS does that conversion even more dynamically than the old school 1-800-COLLECT tricks that the world adopted. So that's how my computer would get the TO address. So at this point in the story, if I want to send a request to google.com-- and this is just an envelope in which I might send a letter-- I need to have two pieces of information. I need to have the TO address here, which for Google recall-- let me look it up again-- is 172.217.4.36. 7 And so I'm going to put that in the TO field of this envelope. And now I need to know my own IP address. So it turns out my computer has its own IP address. And so when I send this request over the internet to Google, I'm going to need to include my own IP address, which Windows or Mac OS knows for me. And so in the top corner of this envelope might I write my actual IP address as well. So now I have to actually route this information. I first have to write Google a note, and I might say on this blank sheet of paper, search for cats. So this might be my search request. And I'm going to go ahead and just bundle this up, put this inside of this envelope. But now I need to send this envelope or this so-called packet of information to www.google.com. And who knows where they are? Maybe they're in California. Maybe they're here on the East Coast. Maybe they're somewhere else. How do I route this information? Well, turns out that Harvard has a router, again, and Harvard's routers know of other routers. And in turn, and we using the same command prompt can we actually see the path that my data should take if I trace the route one query at a time from here to www.google.com. And now what you see, one row at a time, is the following. The first hop between me and Google is apparently this router here. Row number, mr-sc-1-gw-vl427.fas.net.harvard.edu. Don't quite understand all of that, but I do know just from knowing the people there, MR is the machine room. So here at Harvard Science Center, there is a room with machines. And that's where this server apparently is. SC means Science Center. GW by convention means gateway, which is just a synonym for router, this kind of device. And then I don't know what VL427 means. But I do know that if we continue to the next hop here, row two, Core Science Center gateway, or Core Science Center router. So one router is connected to another router. The third hop to which my data is delivered is bdrgw2, which I know by convention means border gateway. And so this data is being passed from hop one to two to three. And once it goes there, it goes to hop four or router number four, which is nox1sumgw. So nox is the northern crossroads, which is a common peering point here in the Northeast of the US, which just means lots of different internet service providers interconnect their cabling and their technology so as to route data to and from locations. That's apparently where we're connected here. Then I don't know where row five is, but it looks like its owned by internet two, which is a fast level of internet service that a lot of universities use. Then router 6, 7, 8, 9, 10, and 11 don't even disclose that they have names. And they might not. Routers don't and computers don't need to have domain names or human-friendly terms, it's just useful for us humans. But then lastly in hop 12, we finally make our way to whatever this is, which seems to be some kind of synonym or alias for one of Google's servers. So it seems that in just 12 hops, I can get data from here to Google. And you know how long it takes to get from here to Google, wherever they are? 9 milliseconds in total. That's pretty darn fast to make a request from my computer to some other computer, especially when that computer could be most anywhere in the world or in the country. Now, there's a lot of variability. If you look at each of these rows-- 1.5 milliseconds, 1.9, 2.9, 25, 25, 25. These aren't cumulative. What my computer is doing is sending a packet to the first router, then to the second rather, then to the third router, and measuring each time how long it takes. So you really just get a rough sense, an average of sorts, based on running this command like this. So it seems to take between 10 and 30 milliseconds to get my data from me to Google. Now, I don't know where Google's servers are, but I do know that UC Berkeley is in California, and their servers I do think are in California. So let's do another by tracing the route to www.berkeley.edu where some of our friends there are. That was super fast, even though it still took some 93 milliseconds. So I'm going to infer that the server of Google's that I'm talking to isn't all the way in California, because to get to California in reality seems to take a good 100 or 90 milliseconds. But let's see what we can glean here. So Machine Room Science Center. It's a core gateway. It's a border gateway to Northern Crossroads, to an unnamed server. Don't know what this one is. But I can guess maybe what this is. And notice in particular, router number six jumps from seven milliseconds to like 49. That's a pretty good distance. And indeed, if you look at the name here, Hous, this I'm guessing is a router that's in Houston, Texas, halfway across the country. After that, maybe Los Angeles here in step 8. And that, indeed takes a little more time. So you can probably infer that it's farther away. No name, no name. This one here, I'm not really sure. But now we seem to be in Berkeley's campus and CalWeb-- California web, their server farm production. Indeed, it takes some 90 seconds in total to get to Berkeley. What about MIT? MIT should be pretty close. Let's do a trace route to MIT.edu. And it takes-- all right, so it seems that two routers between us and MIT aren't even cooperating, and that's their prerogative. Not actually responding to our requests. And so in about 10 milliseconds, we get to MIT's server, which seems to be hosted by a third party company called Akamai, which is a content delivery network, among other things. Which means MIT has outsourced to some third party the physical hosting of their servers, which is not uncommon. But let's do one more. Let's do one like for CNN, but not here in the US. But maybe .co.jp for the Japanese version of CNN's website. Let's go ahead and run this. Initially following the same route, Machine Room, Core Gateway, border. And then wala, 189 milliseconds later, we seem to have gotten to Japan. But what can we glean from these numbers? I'm not quite sure where all of these hops are. But what is interesting to me is this one here between routers 8 and 9, what do you notice? That's a sizable jump in time. And it's not a fluke. It's not an anomaly, because indeed, it seems to persist. So if we go farther and farther into this trace, then indeed it's staying at 170 plus milliseconds. So what do you think is in between routers number 8 and 9? What would be between these? I dare say there's an entire ocean between them. And we can see that thanks to this animation here, there's a whole lot going on between points A and B, including sometimes some pretty big cables and some pretty big oceans. Let's take a look. [MUSIC PLAYING] 00:28:54,694 --> 00:28:56,860 All right, there's something about really cool music that makes lines cool. But indeed, those pictures capture the complexity of all the wiring that's actually interconnecting all of the continents and countries of the world that actually explains more technically some of those differences in timings. But at the end of the day, this packet has to get somewhere. And suppose it does make its way over to Google servers, and Google receives this packet of information, realizes, oh, someone is searching for cats again. What does Google actually do in order to respond to that request? Well, it turns out that Google too is going to use a whole bunch of packets. And whereas previously, it was their address in the TO field and my address in the FROM field, now they're just going to simply reverse this so that the TO field now is to me, the FROM field is from Google. And inside of this envelope is going to be their various search results. Now turns out we found one such search result here. So if Google has decided to send me back this search result. Maybe I was feeling lucky and clicked that button. So I just get back one result. They're going to put the cat into the envelope. But sometimes, the data is pretty big. Sometimes this image might be kilobytes, megabytes, or if it's a video file, could be gigabytes large. And it would be kind of rude if Google, in order to send me a really big response, shoved a really big piece of information in its packet and then clogged the internet so-called tubes on their way back to my laptop, thereby preventing anyone else from talking to Google or nearby websites at that same moment in time. So indeed, what Google and what many websites do is they leverage a feature of IP, and its sister protocol TCP that lets us fragment this. And indeed, they will take this perfectly nice picture of a cat, and they will fragment it, thanks to IP, into maybe four different pieces, each of which is smaller than the original. And inside of this envelope then goes one piece at a time. And so if I put one such piece in this first envelope. I can then much more efficiently clearly proceed to transmit this. And then if I do the same with a second and a third and maybe a fourth envelope, now Google can respond with one, two, three and maybe more packets of information that make their way on the internet, not even necessarily following the same path. In fact, there's no guarantee that A to B is going to be the same route as B to A. Things change dynamically over time. But Google's going to have to include a little bit more information on this envelope. It's not sufficient anymore just to send me four envelopes. What else had they probably best do so that I can actually see my cat when it gets back to me? I've got to know how many packets they sent me, and I need to know in what order. So it turns out that what Google is probably going to do is something like this, write on this envelope the number of the packet and really how many there are. And this is a bit of a white lie, it's actually done a little differently thanks to some other fields that are inside of this envelope. But we can think of it really as 1/4, 2/4, 3/4, 4/4, so that if I only get two of these envelopes or three of these envelopes or four, I now know definitively, wait a minute, I only got 3/4 of my cat. And moreover, the ones I did get, I know the order in which I can reassemble those packets. Now, I mentioned this other protocol, TCP, that, indeed often works in conjunction with IP. And you can think of IP as giving you features like addressing, signing every computer in the world a unique address, and fragmentation, being able to chop things up. But TCP further allows us to associate sequence numbers with packets that allows me the receiver to know, wait a minute, I'm missing one or more packets. So TCP is often said to guarantee delivery, and it is this protocol. So long as your Mac or your PC or your computer supports it, which they all do these days. If it determines, hey, wait a minute, I'm missing this packet, TCP is the protocol, the set of conventions, that say Google, I need this packet again or these packets again, and they will be retransmitted. Now, you pay a price in terms of performance, because now you might have to wait for the rest of the cat. So there might be a bit of a latency in order to get back that response. And that might not always be desirable. And indeed, I can think of some scenarios, like if you're watching a baseball game on TV or soccer or football where you're watching a live stream-- or maybe it's the Oscars or the Emmys, or something live, where you really want to stay in sync with that broadcast, even if sometimes there's network issues or there's buffering-- you don't necessarily want it to buffer. You don't necessarily want lost information to be retransmitted. You'd rather just lose a few seconds of the show so that at least you're staying current, especially if you're there with a bunch of other people and it would be just silly if you gradually over time drift out of date. And so the rest of the world is finished watching the show or the game, and you're still chugging along. So as an alternative to TCP, there's other protocols, one of which is called UDP that's very often used for live streaming and for video and applications like that, where you really just want the software to forge ahead, rather than wait for some new data to get transmitted. But there's other things we can do with the internet. And indeed, there's lots of things we ourselves do every day. It's not just the web, like in downloading cats from Google. But there's email, and there's Skype, and Facebook Messenger, and any number of other services. So how in the world does a computer upon receiving a packet of information know if it is an email or if it is a web page, or put more concretely, how do I know if I should show this user this cat in his or her email program or in his or her browser, which might be the same? In other words, how do I distinguish between one type of program running on the internet from another? Well, turns out that TCP also provides a standardization of services. And that is just a fancy way of saying that in addition to saying on this envelope to who it is and what number it is and from whom it is, I also need to uniquely identify the type of service whose information is in that packet. And I do this just by writing a number. And I typically write one of these numbers. 80 if that packet is meant to be web information. So HTTP is the string that most of us type most every day-- or at least see these days, even though our browsers generally fill it in if we don't explicitly type it. It turns out that the world decided years ago that if you want to send information from yourself to a web server like Google to request cats, you had better write the number 80 in the TO field in addition to Google's IP address. This way, Google knows it's not an email destined for Gmail, knows it's not a message destined for Google Hangouts or the like. Google servers can actually distinguish this as an HTTP request or web request from any number of other services. If you're using encryption, HTTPS, that special number that the world standardized on is 443. You rarely see this, but it's on the envelopes that your Macs or PCs are actually sending to Google servers. Meanwhile, there's other port numbers, so to speak. If you've ever heard of FTP, file transfer protocol. This is software that's not recommended anymore, because it's comply unencrypted. But it's still unfortunately popular in some applications or with some less expensive web services. 21 is the number that identifies that service. And that just means inside of this packet is information related to transferring files, not a web page per se. 22, SSH, Secure Shell. This is a very popular protocol, at least among computer scientists and others, that allows you to run commands on your Mac or PC on a remote server, but in an encrypted way. And those kinds of packets contain the number 22. SMTP-- Simple Mail Transfer Protocol-- is what email generally is for outbound email. So if you send an email, your envelopes have 25 on them. And then lastly, DNS is again that service that converts host names to IP addresses and vice versa. So when your Mac or PC asks the world, hey, wait a minute, what is the IP address for www.google.com? That envelope has the number 53 on the outside. And dot dot dot, there's dozens or even hundreds of these other things, for Skype and for Google Hangouts and the like. But these here are just some of the most common. So the envelope, at the end of the day, has a decent amount of information on it. The TO address, the FROM address, and that TO address furthermore has a port number associated with it. And then, if it's been fragmented especially, there's got to be some kind of number that identifies the packet itself so that you can detect if something is missing. But there's kind of a side effect, or really a feature of having this level of detail on each of these envelopes. You've probably heard of a firewall. Maybe not in the real world. In the real world, a firewall is literally a wall that's meant to block fire, typically in like strip malls and offices or stores that are next to each other physically. A firewall is meant to keep a fire that breaks out in one store from traveling into another store, creating even more damage. But in the software world, a firewall is a piece of software that really keeps packets out that you don't want coming in, or keeps packets in that you don't want going out. So a firewall might be used by parents to prevent kids from accessing Facebook or Google, or silly things during the day for instance, if they want them focusing on other things. It might be used by universities or corporations to block access to certain websites that you simply don't want your students or your staff actually accessing. It might be used to keep corporate data inside, so that nothing accidentally leaks out-- financial information, or emails, or the like. You can use a firewall to block outbound access as well. But this invites the question then, how is a firewall implemented? Well, it's not all that hard, really. Because if the internet is just a whole bunch of these packets flying back and forth between computers, between routers, leaving and entering our own network, whether that's my home or my campus or my company, I could just have my routers, for instance, look at every one of those envelopes, look at the TO address, maybe look at the FROM address, and just blacklist certain addresses. Indeed, if I know that I don't want my employees accessing Facebook, I could, for instance, just say to my routers, configure my routers, do not allow any data going to or from IP address 31.13.80.36. Now, it might be easier said than done, because in reality, Facebook probably has multiple IP addresses. So we might have to grow this list or dig a little deeper in order to block them. And better yet, we could potentially look inside of the envelopes themselves to see, is this a Facebook packet? But if they're using encryption, which they do by default these days, that might not really be feasible. So we can have kind of a heavy-handed solution there, and just block everything we think is Facebook.com. But certainly, things might leak out potentially over time if things change. But what else could we do? Suppose that I really don't want people Skyping during the day, or I don't want people using Facebook Messenger, or some software that has its own unique TCP port number that some company or the world has standardized on. You could block all outbound email by just blocking port 25, it would seem, or a few other ports that are popular. You could block all web access by blocking 80 and 443. You could block all DNS traffic, if you really want. And indeed, a lot of companies do this, especially like Starbucks kind of places, internet cafes in airports and the like. Sometimes they only want you using their DNS server, not your own company's or your own home's. And so they can block access to any DNS server other than their own. This is unfortunately often or sometimes for advertising reasons, so that they can actually keep track of what you're accessing and where and why-- or where, at least. But it's all possible technologically with this underneath the hood. So what are some of the defenses in place, especially when you want to visit some site that isn't necessarily encrypted? Or maybe you want to visit some site that is blocked, and you want to simply be able to work around this, because you're traveling or you need to be able to access something privately at your home or your work. Well it turns out, that there are services called VPNs or Virtual Private Networks. And Harvard has one VPN at vpn.harvard.edu. And Yale has one as well at access.yale.edu. And this is simply software that you generally download to your phone or your computer that allows you to connect via some protocol and some port to your company or to your home's network, but in an encrypted way. So a VPN gives you an encrypted tunnel, so to speak, so that you are connected to the internet. That's a precondition. You have to get on the internet itself. But then you configure your Mac or PC to route all-- in theory-- of your internet traffic through the VPN. So even if I'm just visiting Gmail or Facebook or whatever on my Mac, if I'm connected to Harvard's VPN, all of that traffic by design is going through Harvard.edu first, and then it's going out to Facebook or Google or wherever it's destined. Similarly, if I'm traveling in a foreign country that happens to block a lot of internet access, if they do allow VPN access, I can, in my hotel room or wherever, connect to Harvard or to Yale, route all of my internet traffic through Harvard or Yale, and then from Yale to Harvard to wherever I'm going on the internet. And the upside of this is that it's entirely encrypted, which means no one at that company or that country in theory knows what data is going through the tunnel. But it also potentially costs me a good amount of time. We've seen that we're really only talking milliseconds, but hundreds of milliseconds can certainly add up. So if I'm abroad, for instance, trying to connect to some website that's going from that country to Harvard, to the destination, back to Harvard, back to the country I'm in, your internet connectivity might be slower, but at least it's not actually permanently blocked. So if you've ever heard of friends of yours actually accessing services like Netflix or Hulu, that for licensing reasons, do restrict you typically to being in this country-- this is why you might have read that Hulu and Netflix and others are cracking down on people using VPNs, whether it's Harvard's or Yale's or a third party companies, so as to circumvent those licensing restrictions. But technologically, all it's doing is giving you an encrypted tunnel between you and someone you have an affiliation with, like Harvard or Yale, and encrypting all of your traffic in between there, and routing all of your traffic through it. So with that said, we've looked at DNS, and we've looked at DHCP, and we've looked at routers. And there's other hardware still, whether, it's in your home or campus or office, there's things like switches, which are fairly simple devices that just have lots of ethernet jacks, so to speak, that you can plug physical cables into, and those cables can then intercommunicate, so that you can wire computers together en mass. There are things called access points or APs. Those are the things around campus that have the little bunny ear antennas that are often blinking. Those are the wireless access points. And access points often have firewalls, often have routing software built in. So the line is increasingly blurry these days as to what these small devices do. So it really is the services that matter. And indeed, while a little dated, I thought it would be fun to take a look now at a longer form version of the 60 second trailer of Warriors of the Net that was made a few years ago to paint a more visual picture of how the internet works. It definitely takes some liberties with shall we say accuracy. But it also helps paint a picture of what really is going on underneath the hood. So let's take a look at the internet. 00:44:13,549 --> 00:44:15,046 [MUSIC PLAYING] [VIDEO PLAYBACK] 00:45:13,340 --> 00:45:18,450 -For the first time in history, people and machinery are working together, realizing a dream. A uniting force that knows no geographical boundaries, without regard to race, creed, or color. A new era, where communication truly brings people together. This is the Dawn of the Net. 00:45:41,850 --> 00:45:43,710 Want to know how it works? Click here to begin your journey into the net. 00:45:51,490 --> 00:45:54,710 Now exactly what happened when you clicked on that link? You started a flow of information. This information travels down into your own personal mail room, where Mr. IP packages it, labels it, and send it on its way. Each packet is limited in its size. The mailroom must decide how to divide the information, and how to package it. Now the package needs a label, containing important information, such as sender's address, receiver's address, and the type of packet it is. 00:46:38,510 --> 00:46:41,900 Because this particular packet is going out on to the internet, it also gets an address for the proxy server, which has a special function, as we'll see later. The packet is now launched onto your Local Area Network, or LAN. This network is used to connect all the local computers, routers, printers, et cetera for information exchange within the physical walls of the building. The LAN is a pretty uncontrolled place, and unfortunately, accidents can happen. 00:47:18,110 --> 00:47:20,470 The highway of the LAN is packed with all types of. Information these are IP packets, Novell packets, Apple Talk packets. They're going against traffic, as usual. The local router reads the address, and if necessary, lifts the packet onto another network. Ah, the router. A symbol of control in a seemingly disorganized world. [METHODICAL MUTTERING] 00:47:52,280 --> 00:47:57,690 There he is, systematic, uncaring, methodical, conservative, and sometimes not quite up to speed. But at least he is exact, for the most part. 00:48:18,900 --> 00:48:21,890 As the packets leave the router, they make their way into the corporate internet and head for the router switch. A bit more efficient than the router, the router switch plays fast and loose with IP packets, deftly routing them along the way. A digital pinball wizard, if you will. [ERRATIC MUTTERING] 00:48:56,550 --> 00:48:59,210 As packets arrive at their destination, they're picked up by the network interface, Ready to be sent to the next level. In this case, the proxy. The proxy is used by many companies as sort of a middleman in order to lessen the load on their internet connection, and for security reasons as well. As you can see, the packets are all of various sizes, depending on their content. 00:49:42,010 --> 00:49:48,750 The proxy opens the packet and looks for the web address or URL. Depending upon whether the address is acceptable, the packet is sent on to the internet. 00:50:00,660 --> 00:50:04,060 There are, however, some addresses which do not meet with the approval of the proxy. That is to say, corporate or management guidelines. 00:50:12,830 --> 00:50:17,380 These are summarily dealt with. We'll have none of that. For those who make it, it's on the road again. 00:50:33,580 --> 00:50:35,920 Next up, the firewall. 00:50:39,867 --> 00:50:44,250 The corporate firewall serves two purposes. It prevents some rather nasty things from the internet from coming into the intranet, and it can also prevent sensitive corporate information from being sent out onto the internet. 00:50:57,240 --> 00:50:59,980 Once through the firewall, a router picks up the packet, and places it onto a much narrower road, or bandwidth, as we say. Obviously, the road is not broad enough to take them all. Now, you might wonder what happens to all those packets which don't make it along the way. Well, when Mr. IP doesn't receive an acknowledgement that a packet has been received in due time, he simply sends a replacement packet. 00:51:26,810 --> 00:51:32,260 We are now ready to enter the world of the internet, a spider web of interconnected networks which span our entire globe. Here, routers and switches establish links between networks. Now, the net is an entirely different environment than you'll find within the protective walls of your LAN. Out here, it's the Wild West. Plenty of space, plenty of opportunities, plenty of things to explore and places to go. Thanks to very little control and regulation, new ideas find fertile soil to push the envelope of their possibilities. But because of this freedom, certain dangers also lurk. You'll never know when you'll meet the dreaded ping of death. A special version of a normal request ping, which some idiot thought up to mess up unsuspecting hosts. The path our packets take may be via satellite, telephone lines, wireless, or even transoceanic cable. They don't always take the fastest or shortest routes possible, but they will get there eventually. Maybe that's why it's sometimes called the world wide wait. But when everything is working smoothly, you can circumvent the globe five times over at the drop of a hat, literally. And all for the cost of a local call or less. Near the end of our destination, we'll find another firewall. 00:53:02,200 --> 00:53:05,320 Depending upon your perspective as a data packet, the firewall could be a bastion of security or a dreaded adversary. It all depends on which side you're on and what your intentions are. The firewall is designed to let in only those packets that meet its criteria. This firewall is operating on ports 80 and 25. All attempts to enter through other ports are closed for business. 00:53:44,310 --> 00:53:52,580 Port 25 is used for mail packets, while port 80 is the entrance for packets from the internet to the web server. 00:53:57,710 --> 00:54:02,170 Inside the firewall, packets are screened more thoroughly. Some packets make it easily through customs, while others look just a bit dubious. The firewall officer is not easily fooled, such as when this ping of death packet tries to disguise itself as a normal ping packet. 00:54:24,240 --> 00:54:27,320 For those packets lucky enough to make it this far, the journey is almost over. 00:54:32,620 --> 00:54:39,120 It's just a line up on the interface to be taken up into the web server. Nowadays, a web server can run on many things, from a mainframe to a webcam to the computer on your desk. Why not your refrigerator? With a proper set up, you can find out if you have the makings for chicken cacciatore or if you have to go shopping. Remember, this is the dawn of the net. Almost anything's possible. 00:55:04,300 --> 00:55:09,160 One by one, the packets are received, opened, and unpacked. 00:55:13,230 --> 00:55:18,050 The information they contain, that is, your request for information, is sent on to the web server application. 00:55:28,420 --> 00:55:37,000 The packet itself is recycled, ready to be used again, and filled with your requested information, addressed, and sent out on its way back to you, back past the firewall, routers, and on through to the internet, back through your corporate firewall, and onto your interface, ready to supply your web browser with the information you requested, that is, this film. 00:56:26,550 --> 00:56:30,460 Pleased with their efforts, and trusting in a better world, our trusty data packets ride off blissfully into the sunset of another day, knowing fully they have served their masters well. 00:56:43,730 --> 00:56:47,010 Now isn't that a happy ending? 00:56:55,454 --> 00:56:56,933 [END PLAYBACK] DAVID MALAN: All right, so that is how the internet works. And as has been our tendency over the past few weeks, now that we know how we can get data from point A to point B, we can abstract above that, and just take for granted now that we can move data from point A to point B and start moving the actual data. So that invites the question now of what is inside this envelope. When I get a response back from Google containing a whole bunch of cats, or when I get back my news feed from Facebook, or my inbox from Google. Well, inside of these packets quite often is messages that conform to HTTP, the Hypertext Transfer Protocol. So this is just one of those services that we alluded to earlier. Among them also were SSH, and DNS, and SMTP, and yet others. But HTTP is perhaps by far the most common one in so far as we use the web so much these days. So inside of HTTP, there are certain types of messages, messages that conform to certain patterns by which we get information. Now, what is the P in HTTP? HTTP, Hypertext Transfer Protocol. Well, let me borrow Arthuro over here. And we have this silly human convention of course that when you meet someone for the first time or the first time in a while, you say, oh, hi, my name is David. Nice to meet you, Arthuro. And we exchange hands. And when I put out my hand, Arthuro knows to put out his hand. And then we do this silly handshake. Why is that? Well, it's just a protocol. It's a convention. It's a set of conventions that we humans for better or for worse have adopted by which we greet each other. Similarly do computers have protocols via which they communicate, and sets of conventions that govern how you start to communicate and how you finish communicating. So what do those messages actually look like? The simplest of them is quite literally this verb here, get, whereby inside of this envelope, when I'm requesting information of Google for the first time-- and indeed, I put that message before, search for cats-- that actually has a certain message at the top of it, really, that is literally get. There's a little more information, but at the end of the day, it just is get. Specifically, these are the first couple of lines inside of any request that my browser makes of a web server, like in this case, harvard.edu. If I want to get the default home page of Harvard., I literally, inside of my envelope, write this message-- GET slash space HTTP/1.1, which is the latest version of HTTp that people use. Then below that, I specify the host that I want to talk to, just in case Harvard or Google or whoever has multiple domain names physically running on the same servers, which is possible. So I say host, www.harvard.edu. And then maybe there's some other text. But this first line or two is really the most important. And then what comes back from the server, whether it's being sent to Harvard or being sent to Yale, is a response that hopefully says is literally, OK, inside of which is the cat or inside of which is the inbox for Gmail or inside of which is my news feed from Facebook. All of which typically are in this language here, HTML-- HyperText Markup Language. So whereas HTTP is a protocol, like a sort of handshake agreement that governs that when I want to request information of a server, I should say GET and then a few other words, and then the server should respond with OK and a few other words, HTML is the language in which the actual web pages that are coming back from Google or Facebook or Harvard or Yale are actually written in. It's not a programming language like C or Scratch. It's a markup language, as we'll see, that really controls formatting and layout. There aren't ifs and loops and other such constructs instead. But that's what's below the dot dot dot when the response comes back from Harvard or Yale or Google is this language HTML. Now, 200 is a status code, so to speak, that we almost never actually see from a server. But odds are, some of you have seen at least one of these status codes before. And perhaps the most obvious or the most familiar is probably this one here, when you've requested some web page, and either it doesn't exist anymore or you have a typo more commonly or the URL is broken for some reason. Odds are you have literally seen the status code 404, because the server is just showing it to you. But at a lower level, these numbers are actually typically sent in these packets of information back and forth from the server to me. But we'll see before long that you can use status codes like 301 and 302 to you induce redirects, so to speak. If you want to send the user from one URL to another-- maybe the domain name is changed-- you can do that there. For efficiency, a server can say 304, not modified. As in, you already asked me for this page. It hasn't modified since you asked me for it, I'm not going to send it to you again, thereby saving a bit of time and bandwidth. Unauthorized or forbidden generally means that you don't have access to the file for some reason. And 500's actually pretty bad. So we'll probably induce this ourselves before long when we actually write programs that run on a web server. But 500 means there's generally a problem in your code that's supposed to be serving up web content to browsers. So let's actually see these kinds of things too. It turns out that I can pretend to be a browser at my command line here. In fact, I can use a program called Telnet, which is an older program, similar in spirit to something called SSH, which I mentioned earlier, but it's not encrypted. But it allows me to connect to a remote server specifically on a certain port. So I for instance, can connect to harvard.edu and on port 80 specifically. I could actually with textual commands send emails to Harvard in this way, or send chat messages if they support that. But for now, we're focusing only on HTTP, the unencrypted version. And if I go ahead and hit enter, you'll see that I'm connected to www.harvard.edu.cdn.cloudflare.net, which is curious. But it turns out-- and we could see this if we poked around with nslookup again. It turns out that Harvard is also outsourcing its home page to a third party CDN-- Content Delivery Network-- called Cloudflare, so Harvard's servers really live elsewhere. And now I talked too long and the connection got automatically closed. So let me go ahead and redo this, and just pretend to be a browser by typing GET/HTTP/1.1 host www.harvard.edu and then Enter Enter twice. And it flew across the screen, but let me scroll back up to the top. This is-- even though it might look cryptic to you at the moment if you've never made web pages before-- this is this language called HTML. And it's quite a lot of HTML, so let me keep scrolling up and up and up and up. Until hopefully if we go up high enough-- oh, I've exceeded my buffer. So I'm going to do this differently. I'm going to go ahead and-- you might recall from a past problem, where you can actually redirect the output to a file. So I'm going to go ahead and save this in a file called output.txt. GET/HTTP/1.1 host www.harvard.edu, enter, enter. And now I'm going to go ahead and open this file, which is here. And you can see that what just happened was this. The server responded with 200 OK, which is great. And then the date of the server in Greenwich Mean Time. And then a bunch of information. Cookies, we'll come back to these before long. But those will be germane to when we actually write our own software for the read. Drupal, seems that Harvard's website is using Drupal, a popular content management software for websites. And then there's some other stuff about caching and when the site expires and so forth. This is a little strange. Harvard's website apparently expired in 1978. But more on that another time. And so there's some interesting HTTP headers besides things like the host field that we sent and the GET and the OK that I mentioned earlier as well. Now, Telnet is not a very user-friendly way to do this. I'm going to actually redo this with a different command, Curl, whereby I can do a curl-I, and I'm going to then do the full URL-- www.harvard.edu, Enter. And now what's nice with curl. Is that I don't actually see the HTML. I only see in this case the HTTP headers, which are still quite a few, but we can now at least see them a little more readily. In fact, let me go and do the same now for yale.edu, and see if we can glean any differences in their servers. There we go here. So the headers that are coming back for Yale are these that I've highlighted. And it looks too that there's some interesting stuff going on. It seems that Yale also uses Drupal. So it seems that both universities are doing something rather familiar. But most of this information is not all that useful. But it is useful if maybe we do this. What if we visit, for instance-- why don't we go to HTTP-- how about we go to reference.cs50.net, which you might use as an alternative to man pages. And this is a little curious. It moved permanently. This is not 200 OK. Move permanently. Where did it go? Well, wait a minute, let me go ahead and highlight that URL. And let me go ahead in another tab and just go there. OK, it's there. So where did it move to? And in fact, if I look at the domain again, it is indeed there, but notice this. Almost all of CS50's website's actually run not over HTTP per se but HTTPS, where the S means secure, whereby all of our websites for the most part are encrypted. But that's not what I typed. I just went to http://reference.cs50.net. And yet when I do that with this command line interface, which mimics the behavior of a browser, if I visit HTTP, I'm told by CS50's server, moved permanently, status code 301. But notice this one other header that's kind of interesting-- location. This location header-- and a header to be clear is just a word, a colon, and then a value. This header specifies where we move to. So this seems to be a mechanism whereby using HTTP headers-- sort of messages inside the envelope that the human doesn't really see, but that the browser doesn't understand. This seems to be a way that we can forcibly redirect all users from the insecure version of our website to the secure version, so that thereafter, all of the information is secure. And frankly, there's not all that much private information going on there. But if you don't really want the whole world or the NSA or Harvard or Yale knowing what pages, what functions you need to look up on reference.cs50.net, by forcing everything to HTTPS, in theory, everything is perfectly secure now so that only you know what pages you're visiting. And we, since we run the server. But no one in between. And indeed, that's one of the biggest values of using HTTPS-based URLs, so that even if there is some man in the middle, so to speak, a bad guy, an adversary between you and that remote server, whether it's here on campus or in Starbucks or the airport or some random adversary on the internet, he or she in theory should not be able to see anything between points A and B if you are, as before using a VPN between those points or two, using a protocol like HTTPS that by design is encrypting information. And suffice it to say the encryption is far fancier than Caesar or Vegener. But it is indeed similar in spirit, where those zeros and ones going back and forth are scrambled in some way that only you and the point B server can actually decode them or decrypt them. So let's visit an actual website now, Google. But before we do that, let's turn off some of the more modern features by going to Setting, going to Search Settings, and turn off so-called instant results. Because for our purposes today, instant results use a technology or language called JavaScript, which we'll get to in a few weeks' time, but for now it's just going to be a distraction from the underlying HTTP feature. So I'm going to go ahead and indeed never show instant results. So that now when I search for something like cats on google.com and hit Enter, I'm going to find myself at a fairly long URL, indeed this URL here. And I have no idea what most of this URL means, not knowing how Google works underneath the hood. But I'm looking for some familiar patterns. And indeed, if I pretty much a little ignorantly but hopefully cleverly just delete anything I don't understand, I'm going to deliberately leave myself with just the essence of this URL. So notice, I didn't type this URL. I ended up at this URL after I typed in cats to that search box and hit Enter. Now I found myself in a really long URL and then I just started deleting things I didn't understand to distill this URL into quite simply this. https://www.google.com/search?q=cats. Well, it turns out that much like in the world of C, you have functions from CS50 like getString and getInt, or if you implement them yourself, scanF or other such functions whereby you can get user input. It's less obvious at first glance how a web server can get input from a user. Because there is no-- well, rather, you can see the search box that I typed into, but until I hit Enter, the server doesn't see that information necessarily. And that's a bit of a white lie, because nowadays thanks to JavaScript and thanks to autocomplete, Google's actually seeing every keystroke you type. But in theory, when I hit Enter, only when I hit Enter, do they see the full word cats. And how do they get access to it not having physical access to my keyboard? They see it in the URL here. And so indeed HTTP, beyond supporting status codes and the sort of digital equivalent of my handshake with Arthuro, also supports input, specifically input parameters that in this case is arbitrarily but reasonably called q, because back in the day, Google decided that the default input to its search page would be q for query. And indeed, if I hit Enter now, the results seem no different. So for whatever reason, Google uses by default a lot more parameters, all of which I deleted. But the only necessary one is cats. And notice even without changing the page, I can go up in here and change my cats to dogs and hit Enter. And now notice I've searched for dogs just as though I had typed this myself. But indeed, the only thing I've been changing up here is the keyword. And if I search for mice now, I'm changing the search result. So it seems that the essence of an HTTP request boils down to what is sent here. So let's try this as well. Let me go ahead and copy that URL. And just for good measure, I can go ahead and do something like curl and then paste this URL. And let me go ahead and quote it, just because it has a question mark that could break things. And hit Enter. It's pretty overwhelming here, but this is all of the HTML that's coming back from google.com. So when I see these search results in google.com, this web page is written in this language called HTML. And HTML, as we'll see, is a little overwhelming perhaps at first glance, but follows some very simple patterns. And we can see them better in browsers like Chrome as follows. If you Control-Click or right click on your web page, most any web page if you're using Chrome, you can choose Inspect. And there's keyboard shortcuts and other menu options by which you can access this. And notice among the elements tab here that just popped up. And notice now, again a little overwhelming. But what's nice about Chrome-- and Edge can do this and Firefox and Safari and others-- it can pretty print your HTML. Sort of like Style 50 you can sort of see through any messiness, similarly, can the browser kind of look at the mess that just came across the wire from Google and format it as follows. And indeed, it looks like this language HTML follows a certain pattern. There's always this at the top, open bracket, exclamation point, doc type, HTML, close bracket. Then there's open bracket html in lower case, then some other words and quotes and equals signs perhaps. Then a head, then a body. Maybe some divs for divisions of the page. And even though this is quite a lot, let's look at a simpler one just for kicks real fast. Let's go to harvard.edu and hit Enter. And indeed-- well, actually, it looks just about as complicated. Here's the HTML that composes harvard.edu. So let's try to distill this into its essence. I showed a web page earlier. Let's go back to that to point out-- to be clear, these were called query strings. Let's come back to HTML. So HTML is up to version 5 these days. And this governs what syntax you should use when writing HTML. And here per the earlier slide is perhaps the simplest web page we can make. So the key components-- and there's others we can add and others we will soon add-- boil down to this. This first line, this is so-called document type declaration. This is just a fancy way of saying, you have to type this line first in your file in order to tell the browser that's reading this file top to bottom, left to right this web page is written in version 5 of HTML. Previous versions either didn't have this or had longer versions of this. Is just a globally-understood symbol that means version 5. Then below that is your actual HTML tags. So web pages are composed of HTML tags, or more properly, elements. And most elements have an open tag and a closed tag-- a start tag and an end tag-- that are identical, except for typically the slash. So indeed, notice the symmetry. This tag here, and so far it's what we'll call an open tag or start tag, means hey browser, here comes a web page written in HTML. Hey browser, here comes the head of the web page. Hey browser, here comes the title of the web page. And there's no technical reason I wrote this all on one line instead of putting hello world on its own line and this other tag on its own line. It just felt short enough to just write in one line, so I went with it. But notice that title is open tier. Then there's literally some hard coded text, hello world. And then there is the opposite so to speak, of the tag. It's the same word for the tag, but this forward slash inside of the tag, which closes or ends the tag and sort of ends the whole title element. Meanwhile, that's it for the head, at least in this example. So hey browser, that's it for the head. Oh hey, browser, here comes the body. Hey browser, here's some actual text. Hey browser, that's it for the body. Hey browser, that's it for the web page. So I've also by convention-- and for stylistic purposes like in C-- indented things to be very pretty printed, very readable to humans. But the browser certainly doesn't care. And indeed, we saw when we looked at the mess that is Google's website, it's just a big mess of tags and markup so to speak. But for Google, that makes sense, because you don't want to have to transmit any characters unnecessarily. Indeed, if you think about it, if Google's website gets visited by a billion people per day, which actually feels kind of reasonable. And suppose that a programmer at Google hits the space bar just one extra time and saves Google's home page. Well what's the implication of Google having just one additional space in their web page? If that web page is downloaded a billion times, that's a billion extra ASCII characters that gets downloaded per day. And a billion ASCII characters is a billion bytes, which is one gigabyte. So just by hitting the spacebar can really big players like Google cost themselves a huge amount of space and maybe cost or time. So that's why a lot of big websites minify or compress their information, whereas we will be a little more lax here, because it's more important for now certainly that things be readable and understandable. But the white space does not matter to the browser. So let's actually do something with this. Keeping in mind the following, just as this indentation kind of implies, this really if you think about it is a tree structure. There's some document on the screen, which I will literally call document, because that's what browsers do. The top element of which-- I'll draw with a rectangle, distinguish it from the document itself-- is the HTML element that starts here and ends here. And in so far as it starts here and ends here, everything that's inside of it, you can think of as children in a family tree. And the first child is head, the second child is body, left and right respectively. The head tag meanwhile has the title child, and so that's why we see title here. And then I'll draw it with an ellipse, just different shape because it's raw text. It's not an actual tag. And similarly does body have some text below it. So this is just a tree. It's not a binary tree, although it might be by coincidence here, because there aren't many children. But it's some kind of tree structure, each of whose nodes has zero or more children. And indeed, underneath the hood what is IE, what is Edge or Firefox or Chrome or Safari actually doing when it downloads a web page like this? Some programmer or programmers have after taking classes like CS50 and knowing what these data structures are implemented in code a tree that represents that web page. And indeed, once in a few weeks we get to JavaScript using yet another language will you be able to manipulate that tree in real time to change the contents of a web page and what a user is seeing. Indeed, if you kind of fast forward in your mind, suppose that you do use something like Facebook and Messenger built into it for sending messages to people or Gmail, where you suddenly get new rows of emails and your web page, what's really happening? Every time you get a message in Facebook, it's just as though this tree is getting modified with like another child somewhere in here. Every time you get a new email in Gmail, it's like another node is appearing in this tree. So there really is this equivalence to this markup language HTML and the tree structures that we've just come from in recent weeks. So let's actually now do something with this. I'm going to go over to CS50 IDE, and I'm going to go ahead and make if you will the simplest of web pages as follows. I'm going to go ahead and create a new file, a text file. I'm going to call it hello.html. And I'm going to go ahead and populate this with exactly what we saw a moment ago. Doc type, HTML. Open bracket, HTML. And notice that CS50 IDE is trying to be helpful here, and when it notices you typing something familiar, it's going to try to finish your thought for you. So indeed, it did. I'm going to go ahead and open now the head of the page. It's going to complete that thought. I'm going to open the title of the page, hello world. And now I'm going to move my cursor down here physically to do body, close bracket, hello comma world, save. So I have written code. It's source code, but it's code written in HTML-- HyperText Markup Language. And indeed, you see no loops or conditions or functions. There's no logic. This is just markup. Do this, stop doing this. Do this, stop doing this. It's fairly mundane. But it's going to allow us to actually visit this file in a browser. Indeed, let me go into a browser now and visit this page hello.html. Incredibly underwhelming. Indeed, this is a huge screen. And all I've created is a web page that says hello world up here. And if I scrolled up, I could actually see the tab whose title is also hello world. But that's my first web page. And if I now apply a lesson learned, if I go ahead and right click or Control-Click Chrome's backdrop and choose inspect, now you'll notice finally here's a simple web page, and not all the messiness that was Harvard's or Google's. You can actually see your HTML. You can't permanently change the files here, because you need to do that in CS50 IDE and change the files. And so here's where there's a potential point of confusion. CS50 IDE is of course a cloud based service, and it's where I'm writing and saving my files. And it just so happens that built into CS50 IDE is its own web server just for serving students work. So when I visit this web here in another tab, I'm visiting not CS50 IDE per se, but the web server running on a certain port on CS50 IDE so I can serve up these web pages. So let's go ahead and do something a little more interesting than that. Let me go ahead now and create another file say as follows. Let me go ahead and copy this just for good measure so I don't have to recreate the whole thing. And let me go ahead and create a new file called Image.html. Paste this in here. And instead of hello world, I'm just going to write say image up here. And how do I embed an image? Well, turns out that there is that literally an image tag-- img to be succint. Indeed, you might want to write out this. But nope, back in the day people decided that img is sufficient. I'm going to go ahead and give it a source. What should the source of this be? Well, let me just do a quick search for like a grumpy cat. And there's a good one. So I'm going to go ahead and Control-Click or Right Click for our purposes now just the image address here. We'll assume this is my image and I'm grabbing the address here for the moment. I'm going to paste it in here, in that there is the URL of a JPEG that is of a grumpy cat. Now with an image, there isn't really the same concept of like starting an image and stopping an image like there is start the title stop the title, start the body, stop the body. And so there are so-called empty elements in HTML that you can express either by doing this, which feels a little silly. Like you're opening the image tag and then immediately closing it, which feels a little ridiculous. And so there's shorter hand syntax where you can actually put the slash inside of the open tag like this so that the element is empty so to speak. Open and closed. It's not strictly required, but at least this way we're making clear our intent is to open and close the thing all at once. Now for accessibility purposes, for someone who has trouble with vision, you might want to provide some alternative text like grumpy cat so that if they're using a screen reader or some other device, there it can actually have a system support explaining what it is that you might otherwise be seeing. So let me go ahead now and open this file image.html. And it's pretty darn simple. But there is my own web page with this big white background, and nothing else yet and this grumpy cat. All right, but of course this web page doesn't do anything. It would be nice if I could click on something and go somewhere. So let's do that. Let's do another example whereby-- I'll call this link.html. And in here-- let me get started just by copying and pasting that-- instead of the cat, let me go ahead and do a an anchor. So it's a little counterintuitive. It's not link, it's anchor. And then anchor, confusingly, has a hyperreference, which is the link to which it goes. And I'm going to go ahead and do something clever like https://www.google.com/search?q=cats. And then close bracket. And now notice CS50 IDE is trying to be helpful. It closes the tag for me, and I can just write the word cats. But let me finish this thought. Let me say search for cats period. And so now, even though we've seen only some simple tags so far, you can use to HTML in line, so to speak, sort of in the middle of another thought. If I want to convey the sentence search for cats, but I want cats to be clickable so that when you click on the word cats it actually goes to Google and searches for cats, I can borrow the idea from earlier-- and I just happen to remember that q is the query that I have to pass in. And notice that I surround cats with the open tag and the close tags. So that now if I open a browser with this file, I see again, a very simple web page. And I can even zoom in to make this more clear. All it says is search for cats period. But notice, it's the link alone that's underlined. And it happens to be purple by default, because we already searched for cats earlier, and browsers typically remember URLs you visited. So that's why it's purple and not say blue, which tends to be the default. But if I click on this, indeed, I get a page full of cats. I can combine these ideas. Let me actually go into the IDE, and instead of the word cats, let me go ahead and paste the image tag. So it's a little hard to see all on one line here, but notice I can search for a href, close this tag. And then immediately open the image tag with its same value as before. And then close that. And then close the anchor tag. Save that, reload. Now it's a little stupid grammatically. Search for cat picture. But notice if I hover over the cat, my cursor becomes a little pointer. And indeed, if I look in Chrome's bottom left corner, I'll see that if I click, it's going to lead me to a URL. And indeed, if I click on the cat, anywhere on the cat, now I've made a hyperlink. So now the world wide web so to speak is getting more interesting. It's getting pretty ugly, but at least it's getting more interesting. So what are these things? They're not tags, per se. These are what we'll call attributes. So indeed, it seems that based on these simple examples alone certain tags like image can have their behavior modified with these attributes. And the format for those is a keyword like alt for alternative equals and then quote unquote some value, and source-- src-- which is by design. You can't write out source S-O-U-R-C-E. You'd have to do src per the documentation equals quote unquote some URL. And you would only know that these things exist by googling around, reading some online documentation, taking a class. But thankfully, there's not terribly, terribly many of them. And most every one can be looked up on demand when you're curious how to do something. In fact, let's take a look at a few other tags some this time that I've put together in advance. We have a whole bunch of online examples that you're welcome to look for online. Here's one that has a whole bunch of paragraphs. So in this page here, notice that I've done a couple of things. Inside of my body, I have a bunch of Latin paragraphs. Sort of nonsensical Latin, but I've wrapped each of them in an open p tag and a closed p tag, simply because I want these to be three separate blocks of text. And let me go ahead into my browser now and open this file in today's directory as paragraphs.html. And that's it. It's a little more interesting now that it fills the screen. But indeed, there are distinct paragraphs. There's one other tag that I proactively included here, which is a little cryptic at first glance. But this is a metatag that has to go in the head of the web page. And here too you would know this from some online reference. And it's cryptic only insofar as there's a lot of words here. But the effect of this essentially is that if this same web page is viewed not on my browser but on my phone, which might otherwise be pretty small to look at, and I'd have to squint to see the text, this tag is one technique for actually telling the web page to sort of resize itself and the text for whatever the device with is. So without this tag, these three paragraphs you might have to squint to actually read them pretty well on an Android phone or an iPhone. With that tag, the font size will sort of grow to take into account the fact that this is a smaller device and everything should not just be squeezed in on there. But otherwise, syntactically, everything else there is the same. Let's look at another example. If I go into headings.html, this one doesn't do all that much. But it seems to demonstrate tags called H1 through H6, literally saying one, two, three, four, five, six. And by convention, though this differs ever so slightly by browsers, H1 is big bold text. H2 is not quite as big, but still bold text. H3 is not quite as big. H4 not quite as big. Headings that you might see in a research paper or in the chapters and sections or subsections of a book. It's a way of adding sort of semantic headings to a web page that in our case might look ultimately like this. From bigger to smaller. And so these might just be the section headings in some book or some kind of reference like that. What about lists, which are pretty common? Well, if we go into list.html, it's pretty common on the web or in various applications to have bulleted lists or ordered lists. This is in an unordered list of bullets, foo, bar, and baz, which are just silly variable names in computer science. And if we want to see what this one is, if I go into list.html, you'll see quite simply that we just have a little more nesting. Body, UL, and LI. So UL us Unordered List, LI is List Item, and foo, bar, and baz are each of the three list items. If I change this ever so slightly to OL, Ordered List, and then go back to that web page and reload, now it's an automatically numbered list. So there's a lot of features you sort of get for free here, not unlike a typical Word processor. If we want to go really all out and see a lot of nesting, you can see a table here, which might be useful if you want to show a whole bunch of tabular data for research purposes or maybe sports scores and data on a ESPN site or the like. It's a little more involved, but if you just read it top to bottom, it all becomes pretty intuitive. Inside of this page's body there's an HTML table. This table has a TR, Table Row. And that table row has table data, table data, table data. So three columns, left to right. And another row with another three columns, another row with another three, columns another row with another three columns. And I chose these values arbitrarily just to kind of markup an old school telephone keypad, because indeed, if we go into this with table.html, you see this. You can add borders, and we'll see ways you can actually tweak the aesthetics. But it's just laying things out in a grid here, like you might tabular style data. But none of these have been all that pretty thus far. Indeed, I'm just using the default fonts and sizes, which apparently are just black text, white background, Times New Roman font, and pretty small text at that. The web of course these days is much prettier than this. So how do you actually start to stylize things? Well, as we often do, let's take a progression of ideas. Let me go into version zero of this file. css0.html. That does something terribly simply. It's more interesting than any of the pages we've seen thus far, if only because we have some slightly differing font sizes and some actual content, but it's still pretty simple. So what am I doing? This is big and bold and centered. This is kind of medium and bold and centered. And this is kind of small, this copyright holder there. So let's solve this in one way, but then iteratively improve upon this as follows. Let me go into css0.html, and we'll see that I've introduced amazingly already another language. CSS-- Cascading Style Sheets-- is another language that is almost always used in conjunction with HTML these days. And whereas HTML is all about formatting-- rather, all about markup and all about layouts and sort of semantically tagging things in a way that makes sense, CSS is used to kind of take things the last mile and stylize things so that they look and appear in exactly the way that you intend. So this is a little messy at the moment, because I seem to be co-mingling my HTML and CSS literally as follows. Turns out that in HTML there's a generic tag called the div for just a division of the page. If you want to think of the page as having rectangular regions, div would be one way of doing that. Or you could use a p tag or paragraph. And I can add a style attribute here that's a style font size colon 36 pixels semi-colon font weight colon bold semi-colon. And not all of the semi-colons, at least on the end there, are necessary. But this is two CSS properties. A property called font size with a value of 36 pixels, and a property of font weight with a value of bold. And then similarly, notice what I've done in a div of tag outside of this have I wrapped it with text align center. And that's a property called text align. Its value is center, and it's going to center all of its children so to speak. So we can use the same language from our discussion of data structures and trees. Meanwhile, you'll notice that my middle div is slightly smaller at 24 pixels and not bold, and my last one is 12 pixels. But this is a little messy now, because I've co-mingled my HTML markup with my CSS. It would be kind of nice if we could factor out the aesthetics, put them in one central spot to make it easier to edit. And so let me propose this instead. I've now simplified the body of my page to just have three divs, each of which has a unique ID. Turns out there's an attribute in HTML called ID that allows you to have a unique identifier. You can use that almost any word you want, though there are some restrictions on the letters you can use, or where you can have numbers, and so forth. But I'm just going to sort of conveniently call the top div top, middle, and bottom. And those are unique. And now that I have the ability to identify those divs uniquely, let's look at another tag up here. Inside of the head of my web page now, notice I have a style tag. Not a style attribute, an actual style tag. And the syntax here is a little different from before, but it's kind of reminiscent of C. But none of this has to do with programming per se, this is just aesthetics now. This syntax here says, hey, browser, apply to the body tag the following CSS properties in between curly braces. Text align center for the entire body. Hey, browser, apply the following properties to whatever HTML tag has a unique ID of top. So the hashtag here means ID. It's just a symbol that the world has adopted. So this means whatever HTML tag has a unique ID of top, apply these two properties to it. Notice the semi-colon's on the end, and I've invented everything to keep things nice and pretty. Middle will have this property, bottom will have that property. So now it's cleaner in that I've relegated to the top to one central spot all of the aesthetics of my web page. I've left all of the lower level markup down here. So that if on a whim tomorrow I want to change the font size or the color or the layout, I can do that very simply without actually changing the data. So the data is things like these white words here. And I've got some metadata, these red tags and green attributes, here, so that I can uniquely identify things in the page. But the aesthetics are now fundamentally separated. But it's still a little messy, because they're still in the same file. So let me open a third version of this, css2.html, which makes the file even smaller. What do I seem to have done here? So in this case, I seem to have similarly given IDs to these three divs. But I've introduced into the head of the page not a style tag, but a link tag, confusingly named, because it's not an anchor tag, it's link with an href. So even more confusing. But all this means is hey, browser, grab the contents of this file-- css2.css-- the relation to this file is that of style sheet. So it's stylisation. And then apply it to this web page. What is in css2.css? It's just those same tags as before, but in their own file. So what's the purpose of this? At the end of the day, the result in each of these three cases is an identical web page. All three of these things look exactly like this, so there are no prettier. But from a design perspective underneath the hood, these things are fundamentally better designed, because now this CSS file in theory could be shared across multiple pages. Multiple pages of mine could now have this one link tag up top, so that once a browser downloads css2.css or whatever the file is, it can reuse and cache the file for my entire website so that as the user clicks around to my website, they don't have to re download the CSS file. And indeed, even if the browser tries, it can get that HTTP 304 not modified message so that it doesn't waste time or bandwidth redownloading the file. So this also allows me to use, as we'll eventually see in future problems, third party libraries. It turns out that a lot of people in the world who are better than little old me at design certainly have created files ending in .css that have some really beautiful stylizations that you can apply to your own web pages so that you don't have to worry about as much the aesthetics. Bootstrap is one such tool formerly from Twitter, and other such libraries exist that allow you to stylize your site just by using themes or skins, so to speak, that other people have created. There is one last piece of syntax here I should draw attention to is this thing here. So this cryptic sequence of characters is what's known as an HTML entity. It turns out there are some symbols that to my knowledge I can't type on my Mac's keyboard, like the copyright symbol. You can maybe do it on iOS these days via special software support. But this is the canonical way of putting certain special characters inside of a web page that you might not be able to express or easily express on your keyboard. And these are standardized, too. So if I actually googled HTML entities, I could actually see whole charts telling me that ampersand hashtag 169 semi-colon will give me the copyright symbol. And just to be clear, when that's actually rendered, you don't see that in the page. You instead see the more familiar copyright symbol there. So let's now finally try to tie some of these things together. I know that Google supports search queries via GET. And this is in contrast just to be clear with one other thing. That is POST. It would be a little worrisome if every time you logged into Facebook or Google or any website, or any time you bought something on Amazon or any website, if your credit card and your password and all your sort of semi-private information appeared in the URL just like these Google search queries. So it turns out that HTTP supports another verb. And there's a few others, but the two we'll focus on are GET and POST. And POST is inside the envelope's initial message, just like my handshake to AJ, almost identically. But instead of GET, it's POST. What do you want to post information to and what protocol do you want to use? This is an example of a snippet of how I might log into Facebook. When I log in to Facebook, I don't want my friends or my siblings or my family members being able to see in my browser's history or the search box what my user name or really what my password is. And that's exactly what HTTP GET does by design. POST is just another way of submitting information to a server, still using the same conventions of HTTP parameter equals some value. And indeed, you can send multiple ones by separating them in this case with an ampersand. No relationship to the ampersand we just saw in an HTML entity. But notice that this email and password are deliberately below the HTTP headers. So they're not in URL bar, there instead deeper inside the envelope, if you will. But I need to know this because when I make my own web pages, this becomes relevant. Let me go ahead and create a super simple web page called search.html that again has the doc type declaration at the top, that then has my HTML tags, my head tags, my title tags-- and I'll call this search. And then over here I will have the body of the page. And then I'm just going to do an H1 for CS50 search, which is just a big bold heading on the page. And now I'm going to have a form. And I'm going to have action equals https://www.google.com/search. The method I want to use is necessarily GET, not POST. Though in different contexts, I might want to use POST. But I'm not doing logons or something like that. I'm using Google search engine. So now I have the HTML form element, which we've not yet seen. But it turns out there's another tag called input that you can give a name to like q, that can be a type like text, and it's empty. And then we can have another input whose type might be quote unquote submit and close that tag. And then save the page. If I now go back into this file and go to search.html, if I zoom in, we see if you will, version one of Google, without any aesthetics. And indeed, the actual version one of Google wasn't all that much more complicated. But if I now type in cats, submit this query, I go to actual Google, typing in effectively cats, because of the URL I was redirected to-- which is to say that using HTML, we can reconstruct exactly what Google's been doing all this time. Because if you distill the essence of Google into just a few lines of code, this is it. And indeed, this is essentially what Google looked like a few years ago. Although, to be fair, they also had this. They had another input whose type was submit, and whose value even early on was I'm Feeling Lucky. And if we save this, it's going to actually do anything, because we need a little more logic in order to make that work. But if I reload, now we get the second Google button as well. And so all we've implemented for now is the front end of Google, so to speak. We have completely punted to Google's back end, their own databases, their own software, the actual searching of things, because that's because we don't really have a language yet, a way of expressing searches ourselves. Indeed, we could using C and using HTML and using CSS start to build our own server, and we could actually write code in C that receives something like q equals cats, parse the cats, like to read it, extract it from that string, then figure out in our own database where can I find some cats. But it's going to be incredibly, incredibly tedious to do that in C. In fact, if you think back to the problems Vigenere and Caesar and the like, even just manipulating strings in C is really non-trivial and gets quickly tedious. And so we really need a better language. And that language is going to be in the coming weeks Python, which is a higher level language than C. In fact, the Python interpreter so to speak itself is written in C. So the world some years ago used C to write support for really what many would call a better language for solving problems like this. And so not only can you use Python for command line applications and processing and analyzing data like a data scientist might use it for. We can also use Python to actually write the back end of google.com, or the back end of Facebook, or the back of any web server that has to read the parameters, understand them, maybe look up some data or store some data in a database, and respond to the user with dynamic output. So all that and more in the weeks ahead. 01:43:27,032 --> 01:43:32,274 [MUSIC PLAYING]