00:00:00,207 --> 00:00:03,040 DAVID J. MALAN: But why don't we begin now with internet technology. The goal of the session being to better understand how this thing works that most of us use every day, how we can leverage it, and most importantly, try to build it conceptually from the ground up. So that after today and after tomorrow, you have not only the practical benefit of being able to diagnose things in the real world a little more effectively, you've sort of had the opportunity to think through, with the proverbial engineering hat, how you could go about building things. That at first glance might seem so terribly complex, but at the end of the day are the result of some fairly logical well-defined decisions, again layering from bottom to top. So with that said, what is the internet? Let's start there. 00:00:46,860 --> 00:00:53,000 Half of you are on it right now, so you must know what is the internet? What is the internet? Yes. Sean. AUDIENCE: It's a way to access different websites, I guess, [INAUDIBLE]. DAVID J. MALAN: OK. So a way to access different websites, network structure. OK. Let's dive in a little more physically. So that's what it does. What is it actually? 00:01:13,380 --> 00:01:15,295 What do you got, [? Abi ?]? AUDIENCE: [INAUDIBLE] DAVID J. MALAN: OK. Good. More precise. So a network of devices. And a network is just an interconnection of things. And those devices, we'll just simplify, in their computers-- though very much in vogue these days is IOT or internet of things. Which means like every stupid little thing in the world is going to be on the internet for some reason. So that's trendy right now. But for today we'll just focus on devices or computers. And what does it mean to network them. Well, back in the day it meant physically connecting them with cables, so that you could actually have a physical connection between devices. Nowadays, we, of course, have these antennas on the wall called access points, or Wi-Fi routers, or any number of other names for them. But that allows us to all transmit data wirelessly. So that might give us locally what we call a local area network or LAN. People don't say that as much anymore but that's what a local area network is. You also have WLAN, which is a wireless local area network, which probably even fewer people say these days. But these are indeed local networks. So what is the internet? Well, the internet is really just a network of networks. So we might think of this campus, of course, as being Harvard.edu. Down the road is MIT.edu. Across the country is stanford.edu. Not to mention all of the many companies and universities and homes that are out there. And so all of those, if they're interconnected, indeed give us the internet or inter-network. All right. So how does it work? So I've just walked into this room a little bit ago. I open my laptop screen. And somehow I'm magically on the internet. I'm connected to the internet. But a few steps had to precede that. Some of you engaged in those steps while you were here on campus this weekend, or first thing this morning by following some of the instructions. What was one of the first things you had to do according to the instructions in this booklet today? You had to connect and log on. So most of you probably have the intuitive-- the instincts these days, when you want to connect on Wi-Fi, you go to your Wi-Fi menu in the top right or bottom right corner of your screen, whatever operating system you're using. And you choose either the most familiar or the most unlocked network that you possibly can and try to get online. Sometimes you're asked for a password. Sometimes you aren't. And that may or may not have some actual implications that we'll perhaps scratch the surface of today. Or talk about in more detail tomorrow morning with security. But what then happened thereafter? Well, it turns out that every computer on the internet has a unique address. A unique IP address. I'll just toss up jargon here as we discuss some of these things. IP stands for internet protocol. And it's just a convention for assigning numeric addresses to most every computer on the internet. So we right now are at One Brattle Square in Cambridge, Massachusetts, 01238. But that pretty much uniquely identifies us. Especially if we tack on the room number or the floor number and so forth. And so that's how the mail-- the post office actually gets mail uniquely to this particular destination. So my laptop too isn't addressed, of course, by nature of postal addresses, but by way of numeric addresses. Just because computers kind of prefer numeric addresses. Turns out, fun fact, that these are 32-bit addresses mostly. And that means we can have how many total computers on the internet? 00:04:22,730 --> 00:04:24,050 2 to the 32. Nice way to punt there. Four billion roughly. So 2 to the 32 or four billion. Which actually these days isn't all that many, if you consider all of the humans in the world, and all of the laptops and desktops and phones. And again, because of internet of things, every toaster and thermostats. And any number of other devices. So the world has been transitioning away from something called IPv4, which is version 4, the version we've been using for decades right now, to IPv6. Which has been around for a while, but no one really got around to implementing it until really recent years. At least on any kind of scale, with big companies like Google and Comcast and the like finally starting to give people not just 32-bit addresses, but-- anyone want to take a guess what comes after 64-- uh, dammit. What comes after 32-bit addresses? No. 128-bit addresses, actually. So I don't even know why I said that. 128-bit addresses. Which is kind of unprecedented. Rarely do humans actually have the foresight to not just go one notch above but two notches above where we currently are. And so 128-bit addresses means that we really can have an internet of things. If I pull up my massive calculator here. 2 to the 32, of course, gives us four billion computers. 2 to the 128, also unpronounceable by me. That's a lot of devices. That's a lot of things on the internet. So we've got a lot of IP addresses available, at least in the works right now. So what does it mean to have an IP address? Well, once my computer has an IP address-- and how it gets that we'll come back to-- I can now communicate on the network. And indeed the way devices on the network work is essentially you can think of computers as sending envelopes of information. So here's like an old school envelope here. And on this in the human world we might typically put a to address and a from address. And those would be physical postal addresses. But in the digital world, of course, it's going to be a numeric address, in both the to field and the from field. Which is to say that both I and the recipient of any information I send on the internet has got to have such a numeric address. So Amazon has such an address, Google has such an address. And actually those bigger fish in the world tend to often have multiple IP addresses. And we can see this. So first on Mac OS, let me go to System Preferences and Network. And Windows has an analog as well. I'm going to click Advanced, and I'm going to click on TCP/IP. And indeed, you can see that my computer has currently the IP address 10.254.16.128 and a whole bunch of other things. So that is the unique address for my computer right now. And I should really say unique address. Because it turns out that even though much of the world still uses IPv4 or 32-bit addresses, the world also has started using a technology for some time now called NAT. Does anyone use this technology at home? 00:07:17,040 --> 00:07:18,360 David? I do. This is one of those trick questions. Like who has internet at home? OK so with very-- and how many people have Wi-Fi at home? Almost everyone? OK. So with very, very high probability, all of us in this room use NAT. NAT is network address translation. And what this means is that whether your ISP-- internet service provider-- is Comcast or RCN or Verizon or any number of other companies, probably when they installed it in your house they gave you some kind of little device. A little router as it might be called. Although they have different names. Sometimes they're wireless, sometimes they're wired. But connected to that is probably a telephone line or a coaxial cable, like your cable box and so forth. And that is what gives your house internet access. And that device comes with typically one IP address. And that IP address is associated with that machine. The problem, of course, is that most of us have a phone and a laptop or desktop or roommates or parents or siblings or kids, all of whom themselves have devices. And so it would be unfortunate if you only had one unique address for all of you if only one of you could therefore be on the internet at a time. NAT came about-- network address translation-- because it's a feature of modern hardware that allows you to have one IP address for that one device, the router you bought or leased from Comcast or the like. But behind that device, your home network-- a.k.a. LAN-- you can have all dozens of devices, if not hundreds of devices, all having not that public IP address, but private IP addresses. And so network address translation essentially works like this. If this here is your home, and you have a little box from Verizon or Comcast. And this here is the internet. This device here is your home router. 00:09:08,380 --> 00:09:11,360 And it has a public IP on this side. And inside the house are let's say many private IP addresses. One for each of the devices in your homes. And so any time you send a request from your laptop for CNN.com or Google.com or Facebook.com or the like, that request initially, by nature of the wires or Wi-Fi in your house, first go through that device. Because it's the only point coming in or going out to your internet service provider. It looks at your from address, which is going to be a private address. Which just means it's not meant to be public on the internet. It quickly crosses that out. This device puts its own public IP address that you get from Comcast or Verizon where the from field used to be on that envelope. Sends it out on the internet. Google or Facebook or whoever respond. That response comes to that little device in your home. Your little device in your home checks its records. Saying hm, who requested this piece of information a split second ago? It crosses out the to field, which was sent to the public IP. Puts in the private address of the laptop or desktop or phone in your home that originally requested it. And all of this happens nearly instantaneously. So we don't even notice the difference. But as a result of this network address translation from public IP to private IP addresses, plural, do we have the ability to put multiple devices on the network at once. It wasn't all that long ago, 20 or so years ago, when if you wanted to have two computers on the network at home, someone like Comcast or RCN would just charge you twice or three times to actually give you dedicated devices or IP addresses. So this is actually a wonderful feature. And it has the side effect also of firewalling your computer. And this is why these terms get so commingled these days. A firewall in the real world is like a physical wall between stores, especially in strip malls and the like. So that if one shop gets on fire, the other's next to it with high probability are safe. Because the fire can't get through the firewall. In the virtual world, you have firewalls that are software, that are designed to keep data from the outside coming in or perhaps from the inside going out. And so the fact that there is this translation from public to private also means as a side effect, typically, that if you have some adversary, some bad guy on the internet trying to get at your laptop or desktop in your home, he or she can't actually do that. Because the only way you can talk to a private IP address is if the request were initiated from the inside out. Now, you can firewall your home and your business even more securely than that. And we'll come back to that as we discuss security itself. And there are ways around the scenario I just described. But that tends to be a side effect. It's a good thing, actually, there's this network address translation. Because it means people can't very easily get in from the outside. The downside of that is that if you're trying to run your own web server-- for instance, you're trying to start a company or you're building a website, and you want friends or associates or just customers to be able to access your website and it's inside your home, odds are they're not going to be able to. Because you can't talk to a private IP address from the outside world in unless you configure your home network or your office network especially for that. So there's some interesting layers of protection that are in here. Also on the screen here are a couple of other addresses. So router. Let's come here. I just drew the internet as this big cloud. Sort of which has become cloud computing these days. But what is the internet itself? Or what really gets data from point A to point B? If this is me at home, and this is Amazon.com, how do I actually get data to and from Amazon.com? Well, I certainly don't have a Wi-Fi connection to Amazon in Seattle or wherever they are. I certainly don't have a dedicated wire from my laptop to Amazon. So once the data leaves my laptop, where does it go if I'm trying to shop on Amazon.com? 00:12:57,865 --> 00:12:58,780 To a cable? OK. So some cable somewhere. Indeed. And my laptop probably is talking wireless to some of these antennas in the room. Those antennas-- even though you don't see them, because they've been mounted pretty cleanly on the wall-- do have an ethernet cable that is quite like this thing here. It's like a thicker phone cable, if you remember what phone cables are like. This is an RJ45 connector. Phone cables are RJ11, which just means their size and specs. But that thing in the wall is somehow connected probably to a device called a switch. So that thing is called an access point or AP. Switches are fairly dumb devices that just have a lot of phone jack like connectors, a lot of ethernet connectors. That once you plug in all the ethernet cables, the devices can all inter-communicate with minimal amount of security. And from there, the switch is connected to probably one of Harvard's routers. What is a router? Guess even if you've never heard the term. A router routes information. Yeah. It really is that. So you can think of the internet as being speckled with a whole bunch of routers represented by those dots there. And there are edges between these dots or these nodes. And what's interesting about the internet is that data doesn't necessarily travel one predictable path or even one same path on subsequent requests. It's sort of this mesh network, whereby each of those nodes is a router, each of those dots is a router. Each of the edges is just a point of connectivity, wired typically or wireless. And so data can travel from point A to point B by getting relayed by all of these different routers. And so when a router receives a packet like this one here, it looks at the to address and sees, oh, this is destined for 1.2.3.4. I know where that server is. It's that way. And so the router routes the packet this way. The next router in turn might look at this and be like, oh, I know where this machine is. It's that way. And so it might hand this off to the next router. And by nature of how the internet is configured, typically it requires a max of 30 or so hops, hand-offs, to get from one point A to one point B around the entire world. And typically it's far fewer than that. And it is the routers that decide how to get the data with high probability closer and closer and closer. Not necessarily geographically. Sometimes it's faster to take a different direction. Sometimes it's less expensive to take a different direction. But eventually the data actually makes its way from point A to point B. Indeed, you can see here the IP address of a router. Which of the routers in this story is this IP address? 10.254.16.1. And notice the similarity. My IP address is almost the same, but ends with 128. Router IP addresses often, by human convention, end with the number .1. So you know that they're on the same network. Where is this router? Whose router is this? Or where in the story is it would you think? Harvard, yeah. It's probably quite proximal to this building. Maybe it's in the basement. Maybe it's around the corner. It's one of Harvard's routers presumably. And in turn, that has some kind of connectivity to these other routers as well. Typically these edges, these left right decisions are dynamically configured. So that if you unplug a router-- which wouldn't typically happen but could happen-- or if one router gets congested, special protocols, software that these routers are running will dynamically figure out which is the better or newer or correct way to send the data. And so it's all very adaptive without humans necessarily having to get involved. Yeah? AUDIENCE: [INAUDIBLE] 00:16:36,297 --> 00:16:37,630 DAVID J. MALAN: Totally depends. And it's different people along the way. So Harvard, of course, owns it's one or more routers. I, in my home, might own my tiny little router. Which really can just get data out of my home. In the middle are lots of internet service providers, big and small. Level three is a very big one. Verizon and Comcast have their own networks as well. There is yet other bigger fish. Google has its own fiber network and so forth-- AUDIENCE: [INAUDIBLE] 00:17:03,964 --> 00:17:04,880 DAVID J. MALAN: Money. They're peering points, so to speak. So these larger internet service providers typically have financial agreements between them that govern how much they will pay to send their data this way or that way in order to get data from one place to another. They themselves are internally incentivized to actually relay the data to someone if they actually want to have customers whose data can go from point A to point B. And so sometimes the decisions to go either to the left or to the right, so to speak, might be made not so much on technological decisions but just on business decisions. To whom do we have appearing connection? And so even if you as a business owner, for instance, want to bring your-- run your own servers, which isn't as common an instinct these days with cloud computing all the rage-- more on that later-- if you might have your own servers in a data center in some warehouse out in Western Massachusetts or New Jersey or those kinds of places, you would typically decide for yourself who do you want to pay to physically run a cable into your servers, into your part of the data center, to establish exactly one of those connections. And that too would be a financial decision and a reputation decision. And not so much a technology one. Yeah, David? AUDIENCE: [INAUDIBLE] 00:18:21,230 --> 00:18:22,860 DAVID J. MALAN: Ah, a good question. Why wouldn't you just have one router and lots of switches for the whole campus? Part of it is distributed management. So Harvard, for instance, is a big place. So let me oversimplify and say each of the schools to some extent might run its own local network so that they can have their own policies, their own infrastructure, and so forth. But they want to interconnect to the rest of campus. These days, Harvard has been transitioning to having more of a monolithic infrastructure. But there are still side effects of this. For instance, in a couple of the offices that I spend time, we can't actually-- we can have the offices talk to one another certainly. But we can't create the illusion of what's called a VLAN or virtual local area network, whereby two separate buildings appear to be the same network. Simply because of legacy and actual hardware limitations. There's also performance. For instance internal to campus, there's only so much traffic. But there's certainly a bottleneck when you're leaving campus. So you might have want to have a separate route, a more souped up router, that can actually handle that outbound traffic. Whereas you have smaller less expensive routers internally. And so it boils down to those kinds of economic and logistical decisions. Good question. There's also security implications too. A switch typically operates technologically at a certain level that doesn't allow you the same amount of control over what comes in and out of your network. Whereas a router is more of a deliberate bottleneck that you have more control over. But the line is blurred to some extent these days between routers and switches and their features. As an aside. This is a more arcane detail. But does anyone-- has anyone probably seen subnet mask before? Someone know what subnet mask before is? We don't have to get too far into the weeds here. But that is simply a number that allows the local computer-- my Mac in this case-- to decide when it is sending data from point A to some other point B, if that other computer is on the same local network or if it's elsewhere on the internet. And so essentially this subnet mask, 255.255.240.0, represents a pattern of ones and zeros. It uses that patterns of ones and zeros to determine, hm, I am trying to request or send information to this other server. If that pattern of one and zeros tells my Mac that, ooh, that other computer is on the local network, what's nice is my computer will use a different approach, a different protocol, ethernet specifically, to actually get the data from point A to point B. It will never go out on the public internet. By contrast, if that number reveals, oh, this is actually a computer that's far away, that's how the computer decides to send it not to the local network, LAN, but to the next router instead, to that IP. So it all boils down to what do you write on the envelope? The local address or the router's address instead. So let's see if we can't see this. Let me try to pull up. Give me just one moment to see if I can get into a server here that will let us do this. Nope, that won't do it. 00:21:23,060 --> 00:21:23,990 That doesn't work. 00:21:31,085 --> 00:21:32,420 Give me one moment. 00:21:44,850 --> 00:21:45,430 Come on. 00:21:50,504 --> 00:21:51,545 All right, let's do this. 00:21:58,800 --> 00:21:59,300 All right. Let's see if this works. And then I'll explain what we're doing here. Whoops. 00:22:07,700 --> 00:22:08,420 Transfer out. 00:22:18,640 --> 00:22:19,140 All right. Let's see if this gives me what I want. Turn this around. 00:22:26,980 --> 00:22:27,750 OK, this works. 00:22:32,980 --> 00:22:33,720 OK. So let's go ahead and try this as follows. -q 1. 00:22:39,880 --> 00:22:40,950 OK, perfect. MIT is seven hops away. What did I just do? So this is a command line program, a text-based program on my Mac-- though an equivalent exists for PCs as well-- and I ran traceroutes query one. So just give me, send one request at a time, to www.MIT.edu. Because I am technically interested in how data gets from point A here at my laptop to point B at MIT down the road. And it turns out that as we saw a moment ago, the first hop that my data takes in order to talk to MIT's web server is to that address. Which is the IP of what again? Yeah, the router that my Mac was preconfigured to use. I don't recognize hop two. It's just some other IP address. But I do know it's also private. It turns out that any IP address that starts with the number 10 is a private IP address. So you know it's being administered locally by Harvard or by a family member or someone in your company. And if you're curious, this is an inexhaustive list. But anything ending in 10 dot something, anything ending in-- starting with 172.16.something, anything starting with 192.168.something. So indeed if you go home tonight or this coming week and try to find your own Mac or PC's IP address at home, or even at work, perhaps, odds are it starts with one of these values. These are private IPs that are, by human decision years ago, never to appear on the public internet. They're meant to be used in homes and businesses and campuses and the like. So I know that one and two are somewhere here on campus because they have private IP addresses. But then step 3 and 4 get interesting because they have what are called host names. Semi-human friendly names that look like domain names, and indeed they are. And I know it's Harvard for sure. But I don't know really know where this is. I do see its IP address. This is a public IP address at this point. And I only know from convention, coregw means core gateway. Gateway as a synonym for router. So this is like a core router at Harvard. So it's a really important router at Harvard is as much as I can glean. And what's interesting is that it took three milliseconds for my data to reach that router. It took 1.75 for it to reach the second router. And two milliseconds to reach the first router. And what strikes you about these three values? Especially since I read them in reverse order. 00:25:05,046 --> 00:25:06,040 AUDIENCE: [INAUDIBLE] DAVID J. MALAN: What's that? They're not all the same. So there's a lot of variability or non-determinism when it comes to sending data on the internet. The routers might be slightly more busy at some times or other. And by busy I mean maybe more people are sending data. And so it takes a moment for the router to figure out all of the different decisions it has to make. And that slows down my data or someone else's data. These are all pretty close. They're all essentially two, three milliseconds. So it's still pretty fast. But that variability is to be expected. And these are not cumulative. What this program is doing is it's kind of putting its toe in the water a little deeper and deeper each time. How fast can I get to the first hop? Then to the second hop? Than to the third hop? So it's progressively going. They're not additive times. Step four is another router that I just know from convention is Border Gateway. So this is probably another router that Harvard owns that's on the border. So figuratively speaking, the edge, or maybe literally speaking, the edge of campus. Then it looks like Harvard's internet service provider is Quest, which is a really big ISP as well. Like level 3 and others, to your question earlier. So Quest would be one of our peers to whom we connect. It looks like they have at least two routers that are curiously named the same in the same address. I do not know why that is happening. That seems to be a quirk or a bug of some sort. And then curiously, it looks like MIT's website has been outsourced to a company that you might know of called Akamai. They, among other things, are a CDN content delivery network. Which just means they run servers for people to store their static files on. So that MIT doesn't have to run its own web servers, its own physical machines. They just pay Akamai some number of dollars per month or per year to store it for them. I have no idea what this number means. It's probably just a unique random identifier. And this is apparently where people deploy stuff too. So that's all we know. But it took 7 milliseconds to get to MIT, as opposed to a good 15-20 minutes by car or by [? T ?] or by bike from here. All right, so let's try another. Trace route to Stanford.edu and let's see what changes. Same initial path, but now step 5 is a little different. 00:27:19,380 --> 00:27:22,600 The stars generally mean that the router, for whatever reason of misconfiguration or deliberate security, is just not responding to the queries I'm sending. So it's sort of anonymous. Unfortunately, there aren't many named router as it seems between us and Stanford.edu. 00:27:42,390 --> 00:27:45,600 And there's a lot of anonymity between us and them. So it started off somewhat interesting, and now is devolving into not very interesting. Let's try someone else. UCBerkeley.ed-- AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Oh, we'll get there. Berkeley.edu. This one's juicier. OK. So what do we see here? So here's-- this was new. Nox, this is the northern crossroads. This is a really big peering point where lots of different internet service providers interconnect in the Northeast. Northern crossroads-- it looks like internet2, the internet2 is a network used primarily by universities to be super fast, to allow them to share data and information in silly little tests like this more effectively. It looks like, curiously, that step eight and nine are kind of showing their hand as to where they are. Notice what changes between step seven and eight What's striking about these two routers? 00:28:44,860 --> 00:28:47,680 What's different? AUDIENCE: [INAUDIBLE] DAVID J. MALAN: I'm sorry? The time really jumps, right, from seven milliseconds-- previous to that was four and two, two, two milliseconds. Now it's 50 milliseconds. Where might step eight be? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: What's that? It's not a private versus public. In fact, pretty much everything right now is public, by nature of it being routers. Right now everything you're seeing is inside this internet. AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Distance, yeah. And it turns out that you can sometimes infer from the host name, just because of human conventions for naming things that step eight, that router, is probably physically in Houston, I'm guessing? Texas. So it's a good distance away from wherever step seven actually was. I'm not sure where it is but probably the Northeast. Meanwhile, step 10 is even farther from step nine, I'm guessing, because of another jump in time. Where might step 10 be? LAX? For whatever reason, system administrators have historically liked naming routers after airports. So this is probably in LA. This one here, svl. Still LAX. I don't know what that is, so I'll go with the LAX. And then finally, we've reached Berkeley. And then for some reason, it's just not responding further. So it takes about 90 milliseconds to send data to Berkeley. Or about 5 or 6 hours to fly to Berkeley to put things into sort of human terms. And indeed, I think-- someone suggested something abroad. If we do like CNN.co.jp, the Japanese version of CNN's website, let's see if it cooperates here. So again, we seem to have taken the same direction. A little anonymity there. And then some really interesting stuff going on in step seven, eight, nine, and 10. Massive jump here. What might explain the striking increase in time between hops nine and 10? Pacific what? Yeah, so the Pacific Ocean. So indeed, there's huge trans-Atlantic cables. And trans-Pacific cables. And just generally, oceanic cables that carry a huge amount of internet traffic that really big ships just slowly drag and leave at the bottom of the ocean. And indeed, it might take 100 additional milliseconds to go from maybe California, can't really tell from the names here alone, to the coast of Japan so many miles away. So you can get a sense there of distance as well. And meanwhile, let's see if we do-- one last one, let's do like Yale.edu which is still on this coast, and see what we get. Here too we're going to get similar results. So nine or so milliseconds alone. So this puts into more real terms what's actually going on. And a system administrator, if he or she is trying to diagnose some issue with a network, this is actually a real tool that real people might actually use to figure out where is the data flowing. Is one of the millisecond counts way bigger than others, is it anomalously large? That might mean that one router is malfunctioning or it's just completely congested. And so this is just one such diagnostic tool. But another one that's useful to play with is this. I proposed earlier that people like Google.com not have one IP address publicly identifying them. And indeed, if we hit Google.com, in this case-- let's see if I'm about to prove my point or not. Nope. OK. Google has just one IP address. Which is 172.217.0.46, which is a little misleading. But let's see what happens. htp:// and that IP address. And indeed, it brings me to Google's website. So if Google has this one IP address-- which does not demonstrate the point I was hoping to make-- why do they not just advertise their URL as http://172.217.0.46? Right, it's not meaningful. I can't even read it, let alone memorize it. But this is kind of an interesting upgrade from yesteryear. Right? Back in the day, some of us might remember 1-800-COLLECT. 800 Which to this day, I don't know what number that is. But I know the mnemonic allowed me to remember how to dial it, so long as there's a mapping on the phone pad to letters in this case. And this is what DNS is for the internet, essentially. But it's automatic. DNS is domain name system. Which is to say that there are special servers in the world whose purpose in life is to translate numeric addresses to fully qualified domain names. Host names like Google.com and vice versa. So we humans only have to remember, and even our computers initially only have to write the domain name that the human provided. And some other server, a DNS server, will actually do the automatic conversion. And actually write at the end of the day the numeric address on the so-called virtual envelope. But where-- David? AUDIENCE: [INAUDIBLE] 00:33:36,564 --> 00:33:37,980 DAVID J. MALAN: You would hope so. Not in this case, though. So DNS also is a distributed system. And it's a hierarchical system. Which means there's lots of caching that happens. So it would be-- in the extreme scenario, suppose there is just one DNS server in the world that knows about all IP addresses and all names. What's the downside of that design intuitively? If it goes down, gets jammed, can't handle all of the traffic. So that just feels like bad design, whether or not you're an engineer. So the DNS system has multiple servers. But it doesn't just have duplicative server, it has a hierarchical system. So that there are some servers, typically by convention, at least 13 root servers, whose purpose in life isn't to know all of the answers, but to know who has the answers. And the root servers might know who would know the answers for all of the .com's, for all of the .gov's, for all of the .jp's, and so forth. Meanwhile, there is-- if you think of a family tree. If the root servers are up here, you have a second tier of servers whose purpose in life is to know those actual answers. Or if they don't, to know whom to ask in turn. And that goes all the way down to my laptop. For efficiency purposes, when my Mac first requests Google.com, it obviously does not know the IP address. Because Apple did not ship this in every Mac with the IP address of Google.com, especially since it might change day to day or week to week or a year to year. But Harvard does have a DNS server. But Harvard doesn't know the answers to all IP addresses and host names in the world. But Harvard has it so its own internet service provider that it could ask. And maybe that internet service provider has a bigger internet service provider. And if that person doesn't know, then at least the root servers can help us figure out who would actually know. But along the way, once my Mac gets that answer back the first time, by convention, Mac OS and Windows are going to remember or cache the answer locally. And why would they do that? 00:35:33,750 --> 00:35:35,230 So you don't have to ask it again. Which is good for efficiency. And frankly, even browsers do this too. Like Chrome and Internet Explorer might actually remember that information locally along with other information as well. So there's lots of layers of caching for efficiency. Of course, a side effect of this, if you put back on the engineering hat, what could go wrong if you're caching information, especially at multiple layers? AUDIENCE: [INAUDIBLE] DAVID J. MALAN: Yeah, it makes it really genuinely hard for Google to change its IP. Because if they change it, well, when do they change it? Well, if they change it right now, my laptop might remember the old IP address for multiple minutes, hours, days. It depends on how it was configured or misconfigured. So what does Google then do? Well, maybe they could do it late at night. Unfortunately, Google is a global company and there is every possible time zone. So just doing it at night really has no meaning. So that's not a solution. They could run two servers in parallel. Or at least two servers in parallel. One with the old IP, one with the new IP. That gets us by. But then we're getting new data on both servers which probably isn't good if we're trying to move to some new server with an IP. So this is actually a massive, massive challenge these days. And one of the ways that companies like Google avoid this is one, they only have one IP address. And they use it as sort of the entry point to the entire infrastructure. So that hopefully, Google never has to change this IP address, certainly not frequently. But they can change any number of servers that are behind that IP address so to speak. So indeed, it is the case-- and we'll talk more about this after lunch-- that there's a technology called load balancing, whereby even though my white lie earlier was that every computer on the internet has an IP address, that doesn't necessarily mean that's the IP address to whom we speak when we send data. There might be many other computers behind a device that has that IP address for purposes of balancing load. But we'll come back to that again when talking about cloud computing. But not all companies do it that way. If we look at Yahoo.com, Yahoo! it would seem has three IP addresses. This is a more obvious design. They have at least three servers. And frankly, they probably have thousands of servers. But these are the three ones they expose publicly in terms of IPs. And I just ran the same command again. What did you notice is different? The order changed. And again, the order is, again, different. So it seems-- I'm just gleaning this from running the command again and again. It's the same three IPs but they're changing order. It would seem that Yahoo uses round robin load balancing-- more on this in a bit-- whereby they give all users the same three IP addresses, but they change the order so that my computer by default uses just the first. And this way, they put a third of their users here, a third of their users here, a third of the users here. Just probabilistically, based on returning with 33% odds a different one at the top of the list each time. All right. So what actually happens? Well, let's do this. Suppose now, to make this more clear as to what's going on here, the internet as you may have heard is filled with pictures of cats. So suppose that one of you in back, let's say Sean, has requested a picture of a cat. And I am the server, imgur.com or Flickr or whatever. And I am going to send in this picture of a cat. Unfortunately, it's a pretty big picture. It's a couple megabytes. And that's not great for everyone else in the room you might want to be sending data to each other or to Sean or from Sean at the same time. And so it turns out that what IP does-- which doesn't just describe IP addresses. IP stands for itself, internet protocol. And it works in conjunction with another protocol called TCP, transmission control protocol. These are essentially conventions that govern how the data gets from point A to point B. And a way to think of a protocol is if I can come over here-- what's a human protocol? Hello, my name is David. [? Shavan ?]? So this is a protocol, right? Like I say hello, I extend my hand. [? Shavan ?] knew sort of intuitively to extend his hand, if awkwardly, to stick my hand in the middle of class. And then our transaction was complete. Similarly does IP and CCP govern just how computers speak. They follow an initial kind of hello, a subsequent kind of goodbye. And it's just sort of preprogrammed conventions that they adhere to. And one of the features, meanwhile, of these protocols, IP in particular, is the ability to fragment things. Because now, even though in the real world I've kind of ruined this picture, in the digital world I've just broken it up into four chunks. A quarter of the bits are here, a quarter of the bits are here, quarter here, quarter here. And now, I can put each of these chunks in its own virtual envelope. Which depending on the context you might call a packet typically, or datagram, or segment, which have minor differences semantically. But it just means that some kind of virtual envelope that zeros and ones go in. And then I have to make sure, of course, to address these. So I'm going to go ahead and say that Sean's IP address will be the number 1. 1. 00:40:27,030 --> 00:40:29,300 1. So I'm just writing To: 1, To: 1. So every envelope now has this on it, if Sean was the first one to get an IP address in the room. But I need a little more information in case he wants to acknowledge receipt of this cat-- which is what information? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: The from address. So I'll be the second computer in the world. So I'm going to put From: 2, From: 2, From: 2, and From: 2. And then lastly, you might have noticed this, I held this up. What had I also written on the envelope sort of preemptively. Yeah, the order, the number. So 1/4, 2/4, 3/4. Why did I do that? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: So he knows how to put them back together. One, two, three, four. And also, I had a denominator there for a reason. So that he knows what? When to stop. It's not just that it's an infinite string of pieces of cats like a puzzle where you don't know where the edge is. Now he knows where the edges. And now much like we saw with some of the data, that the data took variable length paths, which might have meant taking different routes some times. And much like there's multiple interconnections, all of this data is going to leave my hands through the same router. [? Shavan ?] is my default router. But now if you could presumptuously pass them, but not necessarily the same direction, the effect here is to have a room full of routers, each of whom roughly knows where Sean is. But is making perhaps independent decisions. Routing around blockages. If some classmate's not paying attention, that might be the router is congested. And so you pass it to someone else instead. Two of the packets have gotten there quite quickly. One is kind of stuck all over here. And let me if I can steal this one as though something went wrong. 00:42:24,720 --> 00:42:28,675 And if, Sean, you'd like to reassemble the cat and draw some conclusion. 00:42:41,724 --> 00:42:44,150 Unfortunately, there's a problem. I deliberately dropped or let a packet be dropped. And that happens. I mean, much like the physical this happening, routers will by design drop packets. Especially if they were overloaded. Now I was sort of maliciously-- I took it away from David here. But just as soon, I still dropped it on the floor. So what's your conclusion, Sean? What are you missing? AUDIENCE: I'm missing the bottom right half. [INAUDIBLE]. DAVID J. MALAN: OK. So it's corrupted or lost. So it was probably, if I numbered them right, four out of four or something like that. And so TCP is this protocol that works in conjunction with IP. Whereas IP is responsible for just fragmenting things ultimately for size and also addressing things, TCP, among its features, is to guarantee delivery. And it does it by way of something called sequence numbers, like my one, two, three, four, and so forth. It doesn't quite do it in the same way. But it is a numbering scheme from which Sean and people like him can infer what packet is missing. He is going to now send a response back that he was missing for. And TCP is a protocol that allows me to, uh-oh, let me go ahead and resend him-- not everything, because that could just lead to the same cyclical problem-- but let me just send him the missing piece. And hopefully, with high probability, it will get through this time. Now, even though I've lost the piece of the cat, we're talking zeros and ones. So at the end of the day, it's just duplicating data. So there's infinite supply of copies I can make. And so hopefully this will now get to Sean as well. But this seems a given. Like why is that a feature to guarantee delivery on the internet? Well, it turns out there is another protocol that's sometimes used called UDP, universal datagram protocol. Did I remember my acronym right? User datagram protocol, which does not guarantee delivery. Why in the world might you want to use IP and UDP when implementing some application for your business or for fun? As opposed to TCP IP, which is the more commonly paired. AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Speed, why? AUDIENCE: [INAUDIBLE]. 00:44:52,576 --> 00:44:53,450 DAVID J. MALAN: Good. It's one fewer decision to make. So if Sean doesn't even have to think about what's missing, great. Less work. Less communicating with me. Less resending by me. That surely must take less time. Good. That-- AUDIENCE: Anonymity? DAVID J. MALAN: What's that? AUDIENCE: Anonymity? DAVID J. MALAN: Anonymity. Oh, so that's an interesting one. A key ingredient or assumption of Sean's ability to re-request packets is that he has to know from where it came. Now in this case, it's not quite applicable because he presumably asked me for the CAD. So we already showed his hand as to who he is. But that would be true in other scenarios where you might want to send information, maybe maliciously, like spam or whatever without it being traced back to you. You don't want to resend. If the recipient doesn't get your spam, eh, so be it. AUDIENCE: [INAUDIBLE] 00:45:42,774 --> 00:45:43,940 DAVID J. MALAN: It could be. But any of these could be peer to peer. Just depends on how you use them. Yeah, Sean? AUDIENCE: [INAUDIBLE] DAVID J. MALAN: The sender doesn't-- and why might I not want a response? AUDIENCE: Because it's [INAUDIBLE] so many devices out there, and [INAUDIBLE]. DAVID J. MALAN: OK, good. I mean if Sean had to acknowledge each of the packets, that seems expensive. It's doubling the amount of traffic coming to me, so that might be undesirable. And what kinds of applications would actually be a feature not to waste time retransmitting data if it's lost? Instead just blowing ahead and forgetting about it. AUDIENCE: Video transmission? DAVID J. MALAN: Video, why video. AUDIENCE: Like if you're Netflix gives out [INAUDIBLE] back and watch what's already passed. You just want to pick up and go. DAVID J. MALAN: OK, good. Although I would push back slightly. I feel like Netflix users would be annoyed if they're just skipping parts of the show or movie. But I can-- I would propose tweaking your answer to be a specialized use of video. AUDIENCE: Yeah, [INAUDIBLE]-- I was going to say, if it gets pixel-y, I don't want to rewatch them. DAVID J. MALAN: OK, that's fair. AUDIENCE: I just want to move on. DAVID J. MALAN: OK, so that's fair. But that would be, I think, a different feature. More quality of service. Netflix and other companies, YouTube have decided that users probably would rather the screen suddenly get pixillated, but the audio still go through and the video still be discernible. Even if it's not as good of an experience. As opposed to buffering, buffering, which is annoying. But when might you-- but it would be annoying, I think, if all of a sudden at the end of the reveal. Like Sherlock is about to discover the case and you just skip it because you lost those bits. Minorly annoying. But when is that less annoying? What types of video? AUDIENCE: Live stream [INAUDIBLE]? DAVID J. MALAN: Live. Yeah. So like baseball games, sporting events, concerts. Anything where the user, just by nature of the experience, would probably prefer to be in real time, even if it means an inferior experience than having a great experience that's buffered. Especially if you're-- it's a little awkward if you and your friends are sort of rooting for your team to win when your team won 10 minutes ago because of buffering. It kind of takes you out of reality. So there might be a conscious decisions. And not even just video-- not video for like sports and events. But what about video conferencing? Right? Then it's really problematic if you're talking to someone, but their remarks were uttered seconds or minutes ago. Really you want to just kind of blow through it and rely on the humans to retransmit their own voice again, as opposed to resending the bits that might have gotten lost. All right. So if we have these ingredients, where the heck did all this information come from? Like all of you this morning, if you'd never used Harvard's network before, opened your laptop. Went to Harvard University or Harvard guest or whatnot, typed in a password. But none of you typed in a DNS address, none of you typed in a router address, none of you certainly typed in a subnet mask or any of the things we've been assuming we have. So where does that come from? Well it turns out there is one other protocol that's super popular and helpful these days called DHCP. And you might have glimpsed this on my screen. How is my Mac configured for IPv4? Apparently using DHCP. And this just means it's a protocol, dynamic host configuration protocol, that all Macs and PCs speak these days. And its purpose in life is to automatically configure your Mac and PC. Some of you might have had internet service years ago where the technician would come out. He or she would have a sheet of paper. And he or she or you would have to manually type in your IP address, your router address, your subnet mask. Even if you don't remember it all these years later. That was before there was DHCP. Or at least supported by the ISP and your computer. But with DHCP, you open up your laptop, you choose the Wi-Fi network. And your computer essentially says the equivalent of hello, what IP address should I use, what DNS server should I use, what router should I use? And Harvard, somewhere on campus, has a DHCP server that just responds with that information. How does it know what to use? Well, it turns out that Macs and PCs have, confusingly, a MAC address. Which does not mean Macintosh, it means media access control. Which is this hexadecimal address, where hexadecimal is a 16-digit alphabet. Binary is two, 0 and 1. Decimal is 10, 0-9. Hexadecimal is 0-F, where you count 0, 1, 2, 3, 4, 5, 6, 7, 8, 9-- can't say 10 because that's two digits. So 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f is the convention. So this is hexadecimal. But that's just a really big number expressed with some letters and digits. That is how Harvard knows that you are you. So even if you're cleverly, like while here for the weekend or here on campus in general, sort of using incognito mode or private mode in your browser, thinking, ooh, I'm being all incognito, like Harvard knows who you are the entirety of the time. Because if you registered your computer on Harvard's network, one of the pieces of information they glean, besides your name and email or password or whatever, is your computer's MAC address. Otherwise known as a hardware address or ethernet address. That is ultimately tied to you. So there really is no anonymity on a place like campus or in a business. Because even that lowest level detail is there. And even though we haven't discussed the MAC address, it turns out we've been over-simplifying. Inside of each of these envelopes, much like those old school Russian dolls, there isn't just a cat. The cat is actually inside this envelope. And this envelope is inside this envelope. And each of those envelopes has slightly different information. The outermost one might have the IP address, to and from. But the innermost one might have a MAC address, depending on where it is in the story. Yeah, Avi. AUDIENCE: [INAUDIBLE]? DAVID J. MALAN: An IP address is used to get your data from one point to another on a WAN, wide area network, or for all intents and purposes, the internet. A MAC address is used to route your data on a local network. So in this room or in this building, behind a router. As soon as you involve a router, you need IP. If you don't have a router or need a router, you can rely on a MAC address. And so when I alluded to subnet mask earlier, it's the subnet mask that decides should I address this packet using the recipient's MAC address? Or the router's MAC address? Inside or outside? Internal or external? AUDIENCE: DNS? DAVID J. MALAN: Felicia. I'm sorry? AUDIENCE: DNS? DAVID J. MALAN: DNS is domain name system. That's the device that translates IP addresses to host names and back. So that is why I, the human, can type Google.com and hit Enter. My computer is going to quickly and transparently ask the local DNS server, what's the IP address for Google.com? OK. Let me write that on the envelope instead of the words. Because routers understand IP addresses, not domain names. David? AUDIENCE: [INAUDIBLE] to your MAC address? DAVID J. MALAN: They would not, no. So the outside world does not see-- let me make sure I'm not misspeaking-- they do not see MAC addresses. Because the Mac address gets rewritten. With every hop, it would be used from router to router, for instance. But the recipient would not see it. And so I misspoke when I said-- when I was describing the envelope inside of an envelope, it isn't necessarily your MAC address that's in there the whole time. It's actually the MAC address of the next hop. The next router, router, router. Yes. AUDIENCE: [INAUDIBLE] right? [INAUDIBLE] somewhere [INAUDIBLE] [INAUDIBLE] probably they are doing it because [INAUDIBLE] anonymous, right? You don't have to [INAUDIBLE] 00:53:41,340 --> 00:53:45,970 DAVID J. MALAN: It's a good q-- the MAC address typically wouldn't-- you wouldn't get caught by way of your MAC address. Unless the police or the Harvard people, the security people, were monitoring the local network looking for your laptop to appear. So for instance, let's see, if the FBI or whoever, NSA, were-- and let's just suppose they are these days-- monitoring all the traffic in this room, they would see inside of my virtual envelopes, my MAC address. And so if they know from Harvard that David has this MAC address, then the whole world-- then rather-- we-- actually it occurs to me, this is David's MAC address apparently I'm showing on the whole world here-- though it doesn't matter for the same reasons we just discussed-- you could identify a user by their MAC address if you knew who owned it in advance. Harvard is figuring out your MAC address because when you register it, when you first logged in, the local network was detecting it. And associating that MAC address with your identity. AUDIENCE: [INAUDIBLE] So how do they do it? [INAUDIBLE]. DAVID J. MALAN: How do they get caught or not get caught? AUDIENCE: How do they figure out who is sending [? email? ?] DAVID J. MALAN: You know, so you have to make a mistake along the way. Or you have to somehow associate yourself with the addresses that the watchers are seeing. So your computer has at least two unique addresses. Your IP address, which is publicly routable, and your MAC address, which is locally routable. The moment any of you registered your computer on the Wi-Fi network earlier today or earlier this year, you forever associated your identity with that MAC address or that IP address. Now, Harvard knows that. So Harvard could be subpoenaed for that information. Even if you're a bad guy who's just bought a disposable iPad, the way to avoid this is you somehow have to change your MAC address, which is possible, or spoof your IP address to match that of someone else who is already registered. That is absolutely already possible. But barring that, even if you do that, all the bad guy has to do is mess up just once. If he or she visit some website that is not encrypted-- that is the information's not scrambled-- and they just log into their email account or some service once, if the NSA has been logging all of that traffic, they can just go back through weeks or years worth of data. And all you have to do is to have screwed up once in order for them to then rewind history and say oh, if this was David at this point in time, it must have been David with high probability in all previous points in time. And that's the danger of something like what the NSA was doing for some time. They're not just looking at moments in time. If they're storing information for a ridiculous amount of time, all you have to do is reveal yourself once. And you can reconstruct that entire history. And that's what's especially frightening about the storage of so much data. 00:56:36,060 --> 00:56:37,850 So in short, it's really hard. You have to be so careful. And not just-- and never screw up really. Or never screw up and be noticed. And in fact, there's-- I forget what the case is. I just read the other night and might pull it up for tomorrow's discussion of security. There was a case where the bad guys thought they were being clever by not actually sending emails. They were logging into some shared email account. They would compose a draft email, not send it. But then the other bad guy would log into the same account, look at the drafts email. Because they were thinking presumably it's not going out on the internet. Which did narrow the scope of the threat to their malicious behavior. But even then, the server, whoever it was, Yahoo or Facebook or whoever, was surely logging who was accessing that shared account. So with corporate cooperation can you reveal what's going on as well. AUDIENCE: So maybe my next question is so where [INAUDIBLE] And how do they control [INAUDIBLE] So where-- how do they control that? And my last question is [INAUDIBLE]. But it would be [INAUDIBLE] on the internet. Right? So if you've changed IP addresses to your numbers, how is that being done? Is there-- DAVID J. MALAN: There is an Internet Assigned Numbers Authority, which is a nonprofit entity that's responsible for allocating IP addresses throughout the world. They typically sell or rent IP addresses to bigger fish like internet service providers. Who in turn rent them effectively to little people like us. Our home or our smaller business or smaller school or the like. So there's a hierarchical system for allocating them in a way that ensures that the same IP address is not rented to multiple people. In terms of where you sniff traffic, you can certainly sniff it in any of these points. And that's what's so frightening about sending data on the internet from point A to point B, were B is who knows where. Because anyone with physical access to any of these wires or any of these physical machines could absolutely be snooping on all of our data. Our best defense is disinterest. If no one really cares what we're doing, that sort of our best protection. But if you do have a threat or someone is just fishing for information, whether it's a government or a company or a hacker or the like, they can look at all of the unencrypted traffic inside of this network. Which is a lot of unencrypted traffic today. In terms of companies or countries, they would typically, especially in certain Asian countries or Middle Eastern countries where there are very tight restrictions these days on internet connectivity, they will generally have relatively few-- or in the extreme case, only one router that routes data from inside the country to outside the country. And so the great firewall of China or restrictions in Pakistan or other countries might just have relatively few devices that are imposing those firewall rules, preventing Facebook traffic from coming in or going out, for instance. AUDIENCE: [INAUDIBLE] So how do they control them? [INAUDIBLE] DAVID J. MALAN: It's harder. I mean, if you are running-- well, if you've built your own Facebook knockoff, you have one or more IP addresses associated with it. So even if you're running those within the boundaries of the country, if you only have a finite number of IP addresses, the country could, if they control DNS, simply prevent resolution for like ourFacebook.com from actually being converted to IP addresses. So there are sort of choke points that you could exercise control over. Whether it's at the packet level or at the system level like this. In fact, there was a mistake at one point where some country or company accidentally brought down much of the internet abroad for a brief amount of time. Just because of a misconfiguration of DNS. And because so many people are reliant on DNS. If this starts returning bogus information or just incorrect information, it has this cascading effect of breaking most anything. So there's lots of different ways. I mean, mostly what's happening in recent years and months with all these revelations is people are just realizing how insecure the network has always been. These are not new threats. These are just threats that are being more publicized. But lots more scary stories tomorrow, as well. All right. So where did this come from? DHCP. So dynamic host configuration protocol. So that's just something that our Mac or PC is pre-configured to know about. So what is that actually let us do ultimately? Is actually use the internet without actually having to manually configure our machine. All right. So let's toss a couple of more items into alphabet soup. Which is actually germane to exactly that chat. And some of you might use this at work, even if you're not quite sure what it's doing for you. Who here uses a VPN for work? OK. So about a quarter of the folks here. Why do you use it? Or what does it do for you? Grace? AUDIENCE: Just to access any of our secure data or certain websites or tools [INAUDIBLE] VPN. DAVID J. MALAN: OK, to access certain websites or tools within the company. [? Avi ?]? AUDIENCE: [INAUDIBLE] 01:01:39,782 --> 01:01:41,990 DAVID J. MALAN: To remotely access the local network, back home or at your company. Anyone else have disparate use cases? This is generally the principle. VPN is virtual private network. And it allows you to by running some software, usually logging in with the username and password that maybe get preconfigured, to create the illusion that your computer is not on Harvard's network, for instance, but on your own company's network. And So you will suddenly have an IP address that appears to be not only-- you have two IP addresses. One that's at Harvard, and often one is that your company or at your home, wherever the end point is for that VPN. The upside of that is that if your company or your home or your campus' system administrators have decided this financial software or whatever is just too sensitive to be on the internet, we want people to be physically or virtually on our network, they can restrict access to that piece of software or website or whatever to only those people who are on the network physically, or virtually, as via VPN in the latter scenario. The upside of this is that you have a secure connection encrypted. All of the traffic to in Grace's laptop, wherever she is in the world and her company, are encrypted. So that even if a bad guy sees zeros and ones flying by, they're seemingly random. And they're not information that they could glean much detail from. The VPN has another application. Why do some people abroad in the countries like we've been describing or the scenarios use VPN? What problem might it also solve? AUDIENCE: For instance, China where Facebook is blocked, people use VPN to access Facebook. DAVID J. MALAN: Yeah. It's surprising how many Harvard undergrads seem to visit China and posts on Facebook while in China, which is blocked. But the reality is that if you have enough technical savvy and enough access technologically, you could, in theory, in China or any other country use your laptop to establish a VPN connection, a virtual private network connection, to a place like Harvard or your company. Or you can even pay third parties these days to have a VPN connection in any number of countries to really bounce yourself around the world. And what happens then is that China or whatever the country is would know that you have an internet connection between you and the outside world. But by nature of it being encrypted, they can't see inside. And so they therefore either by conscious choice or disinterest or oversight, allow you to maintain that connection. But the result is that all of your Facebook traffic doesn't go directly from you in that country to Facebook.com. It first goes through America or wherever your VPN server actually is or company actually is. Then it goes to Facebook. Then it goes back to your company. Then it goes to you. So just intuitively, what's the downside of this approach to circumventing those kinds of protections? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: It's got to be slower if you're adding distance. And frankly, you're adding cryptography. More on that tomorrow. You're adding the scrambling of information which takes a non-zero amount of time. Might slow things down. But the upside, of course, is that you can access, theoretically, protected resources. Now, there is no reason the country or the company couldn't prevent VPN connections, or prevent VPN connections to known places like Harvard.edu. But assuming the cryptography, the scrambling information is correct and secure, they at least can't see what's inside of it if they do allow it. So these are the kinds of trade-offs. But VPN services seems to be very much in vogue these days. And people do not so much for issues of-- well, still for issues of circumvention. VPNs are very popular among Netflix subscribers and Hulu subscribers. Why? AUDIENCE: Access location specific streams? DAVID J. MALAN: Yeah, location specific streams. So like certain shows only air in the UK initially. And so you might want-- like Downtown Abbey, if you were really into that, and you want to get access before PBS in the US has it, you can VPN into the UK and watch it on whatever the network-- the BBC's channels there, perhaps. Netflix was in the press recently because they've been clamping down on their own customers' use of VPNs. Because if you're traveling abroad-- even if you're a legitimate Netflix subscriber but you travel abroad, and you go to Netflix.com, log in with your American account, you might still not be able to access the resources. Because of whatever partner or licensing arrangements they've made with the big film studios and TV studios, they just won't stream it to you. Unless you pretend to be in America, as you could with a VPN service. But there are certain VPN services that have gotten super popular, it seems. So Netflix simply blacklists their IP addresses. So that even that doesn't work. Yeah? AUDIENCE: [INAUDIBLE] 01:06:16,585 --> 01:06:17,460 DAVID J. MALAN: Sure. So Tor is an interesting thing. This is the-- Tor Onion Router, so to speak, which is an anonymous technology. This essentially allows you and your laptop or desktop to create essentially a multi-hub VPN connection. Where you connect to someone else, he or she connects to someone else, he or she connects to someone else. So you have these several layers of indirection that aren't at the router level. They're just at individual laptops or desktops. But the result is that you have plausible deniability in some way. Whereby if you are requesting some websites that's way over here, but your traffic appears to, not unlike the movies have come from here to here to here to here to here to here to here to here, the resulting website isn't going to know who you are. They're going to know you only as whoever the most recent hop in that network was. So that might seem to put them at a disadvantage and you at an advantage. But insofar as you're participating in this anonymization network, the same might be true when that person wants to access something anonymously that your request might appear to be coming from your laptop. The catch, though, is that certainly if all of the middlemen in the story reveal their logs. Like everyone in the story knows to whom they've been connecting and to whom they're connecting. So you're really just making it harder for like subpoenas to chip away at this problem. You'd have to subpoena this machine, this machine, this machine, this machine. And if they're well configured technologically, they won't remember any of this information by design anyway. But it's not foolproof. In fact, a few years ago there was a bomb scare at Harvard during, coincidentally, exam period, which was actually traced back to a student, I believe. Even though he-- and this is all publicly documented in newspaper articles-- was using, I believe, Tor at the time. The problem, though, is that-- and I'm only inferring, I think, from the details-- if you're the only Tor user, or one of few Tor users on campus when a bomb threat is called in, you might appear to be anonymous. But that protocol is still identifiable. So you might not know, or the authorities might not know what those few Tor users were doing. But they certainly knew who presumably to question first. And so it only helps in certain scenarios. AUDIENCE: [INAUDIBLE] 01:08:31,880 --> 01:08:32,880 DAVID J. MALAN: Correct. The more, the better. But even then-- Google up on it, security of Tor networks. The folks at the edges are sometimes exposed nonetheless because of issues like this. So it's not something to sort of build a business or super secure communication on necessarily. It helps, but nothing is foolproof. AUDIENCE: [INAUDIBLE] 01:08:55,990 --> 01:08:57,086 DAVID J. MALAN: To what? AUDIENCE: [INAUDIBLE] DAVID J. MALAN: Oh. So that's very different. I know less about that one. But let's come back to that one tomorrow morning in more detail. Less we go too far into security today. Other questions? All right. So we have this whole alphabet soup, all of which together actually allow us to do something. So what is that something? What is one of the most common things? Let's conclude our look at internet technologies at the one that most of us are using so much these days, which is web related stuff. And in fact the one acronym we haven't mentioned yet that you sort of see or use all the time is http. Hyper text transfer protocol. Which is yet another sort of handshake kind of protocol that dictates how computers interact with servers. And what is http used for? Well, typically in a browser, back in the day, you would type http://www.google.com and hit Enter. Nowadays most of us probably don't type http. We've sort of fallen out of that habit. Most of us probably don't even type www these days. Why is that but none of us seem to do that anymore? DNS is true. So long as Google.com has an IP address and not just www.google.com, it will just work. And in fact, web servers can forcibly redirect you. If you visit Google.com, they can send a response that is not Google's home page. It's an initial response that says no, go here instead. Your browser will then behind the scenes request www.google.com, thereby filling in the blanks for you. Why do we humans rarely if ever type http anymore? 01:10:35,614 --> 01:10:36,406 AUDIENCE: Protocol? DAVID J. MALAN: New protocol? Not even. This is more of a-- AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: What's that? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: The browser does it for you. Right? If you're using a browser-- even though browsers actually can do a few different things, typically, different protocols FTP, http, and a few others sometimes-- the reality is 99 point whatever percent of the time you're just using them for http traffic. So you want your browser to speak this protocol, which we'll see in a moment. And so it's just inferred. And in fact, browsers even assume that if you type in something dot com and something dot com doesn't exist, they will often presumptuously pre-pend for you www.-- try that address just in case the website's not configured for you. In fact, I still remember, it was an amazing example for class discussions, years ago http://harvard.edu did not work. Which was infuriating, because you'd go to it and you'd hit a dead end. Which sort of doesn't reflect well, I think. But it was an amazing opportunity to discuss in class why it was broken. And then at some point, I must-- someone new was hired at Harvard who similarly thought this was ridiculous. And within days it was fixed. So unfortunately, it's not a good story anymore. But it is purely by human convention that most websites still start with www. And in fact, it's why you see on advertisements something advertised as www.something.com, because for a lot of users out there, technophobes some of them or less technical people, it's a visual identifier that this is a website. Now these days, you might argue that do you really need it? Like what's another good visual cue that when you see something you should type it into a browser. .com alone is pretty good. But nowadays, this is a new problem that will start to resurge again. Most of us have seen .com and .edu and probably .net and probably .org. But there is bunches of others like .travel-- AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Dot what? AUDIENCE: [INAUDIBLE]. .gl, .tv, .io. And to make a bad problem worse, if you go to a website now, Registrar-- more on this in a moment-- there are an atrocious number of generic TLD's as they're called-- and this is a recent change in the past year or so where the world just got really messy-- OK. .guru, .clothing, .singles, .holdings, .ventures, .equipment, .plumbing, .cam-- I mean, it's just-- I'm embarrassed to say them all. They're just so innumerable now. So gone are the days-- now you just have to look for a period. And when you see a period or a dot, like type it into a browser it would seem. So there's kind of this interesting issue now, whereby much like we benefited with the phone numbers for years. Like if you saw 800 dash something, dash something, you just knew intuitively it was a phone number. No one had to tell you to call this number necessarily. And we've had that same tendency for first with http. Then we just www. Then with .com. And now there's perhaps a regression in so far as I know lots of people-- and even I wouldn't necessarily know that if they see something dot something that it's not just some trendy way of marketing your sort of syntax as opposed to it being an actual domain name. So it'll be interesting to see what happens now. But this is to say that websites ultimately have domain names. And they're accessed by way of this protocol, http. And let's see what it looks like. It turns out that if I go to google.com, I of course see a page that looks a little something like this. Most browsers have a feature somewhere like under Chrome, it's under Developer, View Source, where you can actually see the underlying code that composes the web page. This isn't programming code per se. This is mostly a language called html and css, more on that later this afternoon. But this is the language in which web pages are written. It's scary looking here because Google has compacted it as much as possible. And it does a little more than a typical website might in terms of features. But this is what was inside the virtual envelope. So Sean got a cat. But when I go to google.com, I get all of this inside of my virtual envelopes. And together, remarkably, all of this implements just this simplicity. But there's actually some programming code in a language called JavaScript there as well. But notice, I can get that same response. I'm going to pretend to be a browser here. I'm going to run a program called Telnet which is an old school program, good for diagnostics these days, to Google.com. That's the server to which I want to connect. And specifically, I want to connect to port 80. It turns out that TCP which we discussed earlier, which guarantees delivery, it also, among its features assigns numbers to different services. So if I summarize a couple of these, 80 is for-- actually, we'll do it right here. Http has the TCP port number called 80. Https, which most folks probably know means secure, it's encrypted somehow-- more on that tomorrow-- is 443. Email, otherwise known as simple mail transfer protocol, is usually on port 25 or 487 or-- I'm getting this wrong-- 5-- 46-- I can't remember offhand. Usually on 25. And two other numbers. 400-something and 500-something. But in short, I'm going to use this Telnet program now to connect to Google server on that port. So I'm not sending an email as it might look like if I did this. I'm instead pretending to be a browser. And notice that I'm connected to google.com. I can now type textually, just for demonstration's sake, the following. GET / HTTP/1.1 from the host www.google.com. What I have just typed is the digital equivalent of my having extended my hand to [? Shavan ?] earlier. This is what's inside my virtual envelope when I request a page from Google. When I hit Enter now, I get back all of that messiness. Which previously we saw in Chrome, now I just see in my little black and white window. But that's all http is. It has more commands and more instructions that the server understands. But at its essence, all you are saying is get me a specific URL with version like 1.1 of the particular protocol. And that's all that's happening there. And if you're using https, all of that goes and it comes back encrypted, scrambled. So what does that mean at the end of the day? At the end of the day, this web page still needs to get written in another language. And we'll come back to that later. The other language is going to be something called htmp, HyperText Markup Language. With something called CSS, cascading style sheets. Also some JavaScript. More on that tomorrow. And if I'm ever rattling off acronyms just super fast, it's because usually that doesn't matter what they are, but how they work or how they're relevant. And html and CSS are the actual languages that implement underneath the hood the web page. So let's answer one final question. Where did www.google.com come from? Or if you're starting a business or if you have some personal portfolio site and you want to put yourself on the internet with your own domain name, what do you do? Does anyone want to offer if you've done this before? AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Register, OK. So you go to an internet registrar so to speak. Registrar. And there's dozens, if not hundreds, of them these days. And from them you buy a domain name. Or you kind of rent the domain name. Because by nature of how the internet was set up, we pay annually for these things. From like $5 to maybe $200 depending on which top level domain or TLD you want. By TLD I mean .com or .org or .guru. In fact, if we go back to namecheap, which is a popular, fairly inexpensive website, if I want to get computerscience.guru, search, let's see if this is available. It is not available, someone already owns that. Let's see what it is. 01:18:43,162 --> 01:18:44,870 Nope, they're not doing anything with it. Oh and notice, maybe now even Chrome's error messages make a little more sense. Computer science guru server's DNS address could not be found. So that means that my computer tried to figure out the IP address of computerscience.guru, and it seems the person might have paid for this domain name but not actually done step two. Which is actually an amazing segue to what step two should be. Once you own the domain name, what comes next if you've done it? AUDIENCE: Link it to an IP address. DAVID J. MALAN: You have to link it to an IP address. So you need some kind of web host, which would be the generic way of describing it. And this is either your own server. You could in theory run your own server, get an IP address or plug it into the internet and so forth. Most people don't do that these days. They'll use a cloud service. More on that after lunch time. But you'll sign up for someone like dreamhost.com or heroku.com or Amazon Web Services or Microsoft Azure or Google App Engine. Or any number of third parties who run servers and software and give you storage space where you can put all of your files, all of your web content, and they give you an IP address, or really a shared IP address, maybe with other sites using it. And you actually put your content here. And so there's two steps, and both of these involve some kind of financial transaction if you're indeed using someone else's servers and software to run your website. And there's more steps in between. Generally, it's documented and it completely varies by whom you partner with. But we'll see later today what it looks like to actually write code that you might put on to step two servers.