WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:02.480 --> 00:00:16.368 [MUSIC PLAYING] 00:00:16.368 --> 00:00:19.760 SPEAKER: You type an address into a browser, you send an email, 00:00:19.760 --> 00:00:22.613 you perhaps have a video conference or a chat online. 00:00:22.613 --> 00:00:24.530 Have you ever stopped to consider what exactly 00:00:24.530 --> 00:00:28.068 is going on underneath the hood, so to speak, of those pieces of software? 00:00:28.068 --> 00:00:30.110 And really the entire infrastructure that somehow 00:00:30.110 --> 00:00:33.470 connects you to the person or persons with whom you're communicating. 00:00:33.470 --> 00:00:36.080 Well it turns out that there's a whole stack, so to speak, 00:00:36.080 --> 00:00:39.770 of internet technologies that underline the software that you and I use 00:00:39.770 --> 00:00:41.390 these days, every day. 00:00:41.390 --> 00:00:43.670 And indeed, the software that we use, browsers 00:00:43.670 --> 00:00:46.520 and email clients and the like, are really abstractions, 00:00:46.520 --> 00:00:50.420 very user friendly abstractions, on top of some lower level implementation 00:00:50.420 --> 00:00:51.200 details. 00:00:51.200 --> 00:00:54.800 And these days, too, have we built abstractions even above those so known 00:00:54.800 --> 00:00:58.280 as the cloud, an abstraction on top of this underlying infrastructure 00:00:58.280 --> 00:01:02.450 that enables us to do most anything we want computationally without even 00:01:02.450 --> 00:01:04.220 having that hardware locally. 00:01:04.220 --> 00:01:08.110 So let's see if we can't distill what goes on when you do type an address 00:01:08.110 --> 00:01:12.200 or a URL into the address bar of a browser and then hit Enter. 00:01:12.200 --> 00:01:13.550 Or you type out an email-- 00:01:13.550 --> 00:01:15.890 specify someone's email address and then hit Enter. 00:01:15.890 --> 00:01:18.770 What exactly is going on underneath the hood? 00:01:18.770 --> 00:01:22.160 Well, at the end of the day, I dare say that what your laptop, and my laptop, 00:01:22.160 --> 00:01:25.070 and our desktops, and even our servers are capable of really 00:01:25.070 --> 00:01:29.390 is just sending messages in envelopes back and forth across the internet. 00:01:29.390 --> 00:01:31.100 Virtual envelopes, if you will. 00:01:31.100 --> 00:01:34.550 Now in our human world, an envelope needs a few things on the outside. 00:01:34.550 --> 00:01:37.790 If you want to send a letter or a card or something old school to someone, 00:01:37.790 --> 00:01:39.260 you need to address it, of course. 00:01:39.260 --> 00:01:42.600 And you need to put, perhaps in the middle, the recipient's name, 00:01:42.600 --> 00:01:44.330 and address, and other details. 00:01:44.330 --> 00:01:47.510 You might put in the top left hand corner, by convention, your own name 00:01:47.510 --> 00:01:48.443 and or address. 00:01:48.443 --> 00:01:50.360 You might even put a little memo in the bottom 00:01:50.360 --> 00:01:53.990 that specifies what's inside or fragile or some other annotation. 00:01:53.990 --> 00:01:56.390 So this metaphor of the physical world is actually 00:01:56.390 --> 00:02:00.140 pretty apt for what's going on underneath the hood in computers. 00:02:00.140 --> 00:02:02.960 When you have a computer plugged into a network 00:02:02.960 --> 00:02:05.630 or connected wirelessly to a network, it really 00:02:05.630 --> 00:02:08.765 is just sending and receiving envelopes, virtual envelopes, 00:02:08.765 --> 00:02:11.390 that at the end of the day are just patterns of zeros and ones, 00:02:11.390 --> 00:02:15.750 but collectively, those zeros and ones represent your email or the request 00:02:15.750 --> 00:02:18.500 that you've made of a web server, the response you're getting back 00:02:18.500 --> 00:02:19.940 from that web server. 00:02:19.940 --> 00:02:23.630 So let's see if we can't formalize exactly what these lower level 00:02:23.630 --> 00:02:27.200 primitives are, consider exactly how they're layered on top of one another, 00:02:27.200 --> 00:02:30.290 because thereafter we can build almost anything we want 00:02:30.290 --> 00:02:33.770 on top of this infrastructure once we understand what those underlying 00:02:33.770 --> 00:02:36.020 building blocks actually are. 00:02:36.020 --> 00:02:38.990 So let's consider how we actually address 00:02:38.990 --> 00:02:40.950 this envelope in the first place. 00:02:40.950 --> 00:02:43.940 After all, when I turn on my laptop or turn on my phone 00:02:43.940 --> 00:02:47.570 or open up my desktop in the morning, how does that computer or that phone 00:02:47.570 --> 00:02:50.420 even know what its own address is on the internet? 00:02:50.420 --> 00:02:52.610 Because just as in our human world, wherein 00:02:52.610 --> 00:02:56.000 you need to be uniquely addressable in the physical world 00:02:56.000 --> 00:02:59.720 in order to even receive an envelope or a card or a package, 00:02:59.720 --> 00:03:04.220 so do computers need to be uniquely identifiable on the internet. 00:03:04.220 --> 00:03:07.490 Now for our purposes, now we can consider the internet just 00:03:07.490 --> 00:03:11.810 to be an internetworked collection of computers connected 00:03:11.810 --> 00:03:13.730 via wires, connected wirelessly. 00:03:13.730 --> 00:03:16.610 There's some kind of interconnectivity among all of these devices 00:03:16.610 --> 00:03:19.730 and these days our phones and internet of things devices and other things 00:03:19.730 --> 00:03:20.360 still. 00:03:20.360 --> 00:03:22.580 So let's just stipulate that somehow or other there's 00:03:22.580 --> 00:03:25.550 a physical connection, or even a wireless connection, 00:03:25.550 --> 00:03:28.100 between all of these various devices. 00:03:28.100 --> 00:03:31.340 So those devices all need unique addresses, 00:03:31.340 --> 00:03:34.550 just like a building in the human world needs an address. 00:03:34.550 --> 00:03:37.970 For instance, the computer science building here on campus is at 33 Oxford 00:03:37.970 --> 00:03:41.780 Street, Cambridge, Massachusetts, 02138, USA. 00:03:41.780 --> 00:03:47.872 With that precise information, can you send us a real mail or a package 00:03:47.872 --> 00:03:50.330 or anything else through the physical world in order for it 00:03:50.330 --> 00:03:51.920 to arrive on our doorstep? 00:03:51.920 --> 00:03:54.490 But what if you, instead, wanted to send us an email 00:03:54.490 --> 00:03:56.240 and get it to that building, or really me, 00:03:56.240 --> 00:04:00.470 wherever I am physically in the world on my internet works device? 00:04:00.470 --> 00:04:02.810 You need to know my computer's address, you 00:04:02.810 --> 00:04:05.570 need to know my phone's address, or at least the mail server 00:04:05.570 --> 00:04:08.660 that's responsible for receiving that message from you. 00:04:08.660 --> 00:04:12.710 Well, it turns out that most any network on a campus, in a corporation, 00:04:12.710 --> 00:04:16.279 even at home these days has a DHCP server. 00:04:16.279 --> 00:04:19.040 Stands for a Dynamic Host Configuration Protocol, 00:04:19.040 --> 00:04:22.940 and that's just a fancy way of describing a server that is constantly 00:04:22.940 --> 00:04:26.540 listening for new laptops, new desktops, new phones, new other devices, to wake 00:04:26.540 --> 00:04:31.310 up or be turned on and to shout out the digital equivalent of hello, world, 00:04:31.310 --> 00:04:32.690 what is my address? 00:04:32.690 --> 00:04:35.250 Because the purpose in life of these DHCP 00:04:35.250 --> 00:04:37.610 servers is to answer that question. 00:04:37.610 --> 00:04:41.990 To say, David you're going to go ahead and be address 1.2.3.4 today. 00:04:41.990 --> 00:04:47.570 Or David, you're going to be 4.5.6.7 or 5.6.7.8. 00:04:47.570 --> 00:04:51.560 Any number of possibilities can be used to represent 00:04:51.560 --> 00:04:53.840 uniquely my particular device. 00:04:53.840 --> 00:04:57.890 So DHCP servers are run by the system administrators on a campus, 00:04:57.890 --> 00:05:00.530 in a company, in an internet service provider. 00:05:00.530 --> 00:05:03.770 More generally, they're run by whoever provides us 00:05:03.770 --> 00:05:05.540 with our internet connectivity. 00:05:05.540 --> 00:05:07.400 They just exist on our network. 00:05:07.400 --> 00:05:10.250 But these DHCP servers also give us other information. 00:05:10.250 --> 00:05:14.300 After all, it's not really sufficient just to know what my own address is. 00:05:14.300 --> 00:05:16.950 How do I know where anyone else in the world is? 00:05:16.950 --> 00:05:18.870 Well, it turns out that the internet is filled 00:05:18.870 --> 00:05:22.350 with devices called routers whose purpose in life, 00:05:22.350 --> 00:05:25.770 as their name suggests, is to route information from point A 00:05:25.770 --> 00:05:28.690 to point B to point C and so on. 00:05:28.690 --> 00:05:31.470 And those routers, similarly, need to know these addresses 00:05:31.470 --> 00:05:34.350 so that they know upon receiving some packet of information, 00:05:34.350 --> 00:05:37.810 some virtual envelope, in which direction to send it off. 00:05:37.810 --> 00:05:43.350 So these DHCP servers also tell me not just my address, but also the address 00:05:43.350 --> 00:05:45.360 of the next hop, so to speak. 00:05:45.360 --> 00:05:47.940 I, as a little old laptop or phone or a desktop, 00:05:47.940 --> 00:05:51.840 I have no idea where 99.999 percent of the computers in the world 00:05:51.840 --> 00:05:53.790 are, even higher than that perhaps. 00:05:53.790 --> 00:05:57.900 But I do need to know where the next computer is on the internet, 00:05:57.900 --> 00:06:01.260 so that if I want to send information that leaves this room, 00:06:01.260 --> 00:06:03.960 it needs to go to a router whose purpose in life 00:06:03.960 --> 00:06:06.690 is to, again, route it further along. 00:06:06.690 --> 00:06:09.840 And generally there might be one, two, maybe even 30 steps 00:06:09.840 --> 00:06:14.550 or hops in between me and my destination for that email or virtual envelope, 00:06:14.550 --> 00:06:17.610 and those routers are all configured by people who aren't me, 00:06:17.610 --> 00:06:21.060 system administrators beyond this, beyond these walls 00:06:21.060 --> 00:06:22.800 to know how to route that data. 00:06:22.800 --> 00:06:26.158 So we can actually see evidence of this that you yourself 00:06:26.158 --> 00:06:28.200 have had underneath your fingertips all this time 00:06:28.200 --> 00:06:29.970 and you might not have ever poked around. 00:06:29.970 --> 00:06:32.610 For instance, if you want to see your own address, 00:06:32.610 --> 00:06:34.950 keep an eye out for a number of this form. 00:06:34.950 --> 00:06:39.330 It's a number dot number dot number dot number, and each of those place holders 00:06:39.330 --> 00:06:44.520 represents a specific value, either starting at zero or ending at 255. 00:06:44.520 --> 00:06:49.530 In other words, each of these hashes can be any value between 0 and 255, 00:06:49.530 --> 00:06:54.960 and that range 0 to 255 well that's 256 total possible values. 00:06:54.960 --> 00:06:55.770 That's eight bits. 00:06:55.770 --> 00:07:00.300 Ergo, each of these place holders represents 8 bits, 8 more bits, 8 more, 00:07:00.300 --> 00:07:01.020 8 more. 00:07:01.020 --> 00:07:05.070 So an IP address, by definition, is 32 bits. 00:07:05.070 --> 00:07:05.760 And there it is. 00:07:05.760 --> 00:07:07.950 IP, an acronym you've probably seen somewhere, 00:07:07.950 --> 00:07:10.270 even if you've not thought hard about what it is, 00:07:10.270 --> 00:07:11.940 stands for Internet Protocol. 00:07:11.940 --> 00:07:15.360 Internet Protocol mandates that every computer on the internet, 00:07:15.360 --> 00:07:19.920 at the risk of oversimplification, has a unique address called an IP address. 00:07:19.920 --> 00:07:22.860 And those IP addresses look like this. 00:07:22.860 --> 00:07:27.720 If these IP addresses are composed of 32 bits, how many possible IPs are there 00:07:27.720 --> 00:07:30.840 and therefore how many possible machines can we have on our internet? 00:07:30.840 --> 00:07:36.060 Well, 2 times 2 times 2, 2 to the 32, so that's four billion, give or take. 00:07:36.060 --> 00:07:40.170 By design of IP addresses, you can have four billion, 00:07:40.170 --> 00:07:44.858 give or take, possible permutations of zeros and ones if you have 32 in total, 00:07:44.858 --> 00:07:46.650 and that gives you four billion, maximally, 00:07:46.650 --> 00:07:51.180 computers and phones and internet of things devices, and the like. 00:07:51.180 --> 00:07:53.760 Now that sounds big, but not when each of us 00:07:53.760 --> 00:07:57.780 personally probably carries one IP address in our pocket in our phone, 00:07:57.780 --> 00:08:00.572 maybe another on our wrist these days, one or more computers 00:08:00.572 --> 00:08:03.780 in our life, not to mention all of the other devices and servers in the world 00:08:03.780 --> 00:08:05.190 that need these addresses, too. 00:08:05.190 --> 00:08:08.250 So long story short, this is version 4 of IP. 00:08:08.250 --> 00:08:12.090 It's decades old, but there's also a newcomer on the field 00:08:12.090 --> 00:08:15.180 called IPv6, version 6. 00:08:15.180 --> 00:08:17.250 There isn't really to be a version 5. 00:08:17.250 --> 00:08:19.830 And IPv6 is only finally gaining traction 00:08:19.830 --> 00:08:22.290 because we're running so short on IPs that it's 00:08:22.290 --> 00:08:25.530 becoming a problem for campuses, for companies, and beyond. 00:08:25.530 --> 00:08:30.310 But IPv6 will use 128 bits instead of 32, 00:08:30.310 --> 00:08:33.630 which gives us many, many, many, many, more possibilities, bigger 00:08:33.630 --> 00:08:35.530 numbers than I can even pronounce. 00:08:35.530 --> 00:08:38.370 So that should cut it for quite some time. 00:08:38.370 --> 00:08:43.110 But not every computer on the internet needs a public IP address, only 00:08:43.110 --> 00:08:47.970 those envelopes, so to speak, that need to leave my pocket, or my home, 00:08:47.970 --> 00:08:50.100 or my campus, or my company. 00:08:50.100 --> 00:08:54.540 It turns out, as a short term mechanism to squeeze a bit more utility out 00:08:54.540 --> 00:08:58.050 of our 32-bit addresses, which are still omnipresent 00:08:58.050 --> 00:09:00.780 and the most popular among the versions, well 00:09:00.780 --> 00:09:02.850 we can actually distinguish between public IP 00:09:02.850 --> 00:09:06.300 addresses that do actually go out on the internet and private addresses. 00:09:06.300 --> 00:09:08.790 And indeed, if your own IP address happens 00:09:08.790 --> 00:09:11.820 to start with the number 10 and then a dot or the number 00:09:11.820 --> 00:09:17.640 172.16 and then a dot, or the number 162.168 and then a dot, 00:09:17.640 --> 00:09:21.810 and then something else, well, odds are, your computer has a private IP address. 00:09:21.810 --> 00:09:25.860 And this is just a feature of the little router that's probably in your home, 00:09:25.860 --> 00:09:28.920 or the bigger router on your campus or corporate network, 00:09:28.920 --> 00:09:33.690 that enables you to have an IP address that's only used within the company, 00:09:33.690 --> 00:09:36.790 only used within your home, and cannot, by definition, 00:09:36.790 --> 00:09:40.410 be routed publicly beyond your company, beyond your home, 00:09:40.410 --> 00:09:42.390 because the router will stop it. 00:09:42.390 --> 00:09:45.750 And so here we actually have the beginnings of a firewalling mechanism, 00:09:45.750 --> 00:09:46.292 if you will. 00:09:46.292 --> 00:09:48.000 In the real world, a firewall is a device 00:09:48.000 --> 00:09:51.690 that prevents fire from going from one store to another, for instance. 00:09:51.690 --> 00:09:54.330 In the virtual world, a firewall is a piece of software 00:09:54.330 --> 00:09:57.480 that prevents zeros and ones from going from one place to another. 00:09:57.480 --> 00:09:59.940 And in this case do we already have a mechanism 00:09:59.940 --> 00:10:03.840 via public and private addresses of keeping some data securely, 00:10:03.840 --> 00:10:06.690 or with high probability securely, within our company 00:10:06.690 --> 00:10:09.490 versus allowing it to go out on the internet. 00:10:09.490 --> 00:10:13.980 So we'll see now some screenshots of some actual computers from Mac OS 00:10:13.980 --> 00:10:16.165 and Windows alike that reveal their IP addresses, 00:10:16.165 --> 00:10:18.290 and you yourself can see this on your own machines. 00:10:18.290 --> 00:10:20.860 For instance, here on Windows 10 is a screenshot 00:10:20.860 --> 00:10:24.050 of what your Network Preferences, so to speak, might look like. 00:10:24.050 --> 00:10:26.680 And if you focus down here, it's a bit arcane at first glance, 00:10:26.680 --> 00:10:32.470 but IPv4 address is 192168.1.139 when we took that screenshot. 00:10:32.470 --> 00:10:35.230 And indeed, it starts with 192168 which means it's private, 00:10:35.230 --> 00:10:38.260 and indeed, I took this screenshot while we were within a home network, 00:10:38.260 --> 00:10:41.980 and so that suggests it can be used to route among computers in that home 00:10:41.980 --> 00:10:43.480 but not beyond. 00:10:43.480 --> 00:10:45.730 You'll see, too, if we move on to the next screen 00:10:45.730 --> 00:10:47.740 where you see more advanced network properties, 00:10:47.740 --> 00:10:51.390 you can also see the dimension of this default gateway, which is 00:10:51.390 --> 00:10:53.830 synonymous with router, default router. 00:10:53.830 --> 00:10:56.740 192168.1.1. 00:10:56.740 --> 00:11:00.340 So a default router or default gateway is that first hop, 00:11:00.340 --> 00:11:03.370 so that if I want to send an email outside of my home, 00:11:03.370 --> 00:11:05.890 I want to visit a web page outside of my company, 00:11:05.890 --> 00:11:09.820 all I need do is hand that virtual envelope containing that email 00:11:09.820 --> 00:11:14.050 or that web request off to the machine on the local network that 00:11:14.050 --> 00:11:15.340 has that IP address. 00:11:15.340 --> 00:11:19.450 I have no idea where it's going to go thereafter, to hops two and three 00:11:19.450 --> 00:11:22.150 and beyond, but that's why we have this whole internet 00:11:22.150 --> 00:11:23.710 and even more routers out there. 00:11:23.710 --> 00:11:27.280 They, the routers, intercommunicate and relay that data, 00:11:27.280 --> 00:11:31.000 hop to hop to hop, until it finally reaches its destination. 00:11:31.000 --> 00:11:33.310 Now where did I get my IPv4 address from, 00:11:33.310 --> 00:11:35.215 where did I get my default gateway from? 00:11:35.215 --> 00:11:39.610 From the DHCP server in my home, in my company, or whatever network 00:11:39.610 --> 00:11:40.740 I happen to be on. 00:11:40.740 --> 00:11:41.800 And Mac OS is the same. 00:11:41.800 --> 00:11:44.560 If these screens are unfamiliar, you might recognize this, 00:11:44.560 --> 00:11:46.210 under System Preferences in Mac OS. 00:11:46.210 --> 00:11:48.790 Here, while connected to Harvard University's network, 00:11:48.790 --> 00:11:53.560 you can actually see that my IP address was 10.254.16.242. 00:11:53.560 --> 00:11:57.010 That number, too, starting with one of those internal or private prefixes, 00:11:57.010 --> 00:11:59.350 indicative of the fact that even within Harvard, 00:11:59.350 --> 00:12:03.190 where we keeping all of our Harvard traffic internal to Harvard, 00:12:03.190 --> 00:12:05.328 and then not exposing that externally. 00:12:05.328 --> 00:12:07.870 And indeed, if we look in the more advanced preferences here, 00:12:07.870 --> 00:12:13.060 we can see that the router for my Mac was 10.254.16.1. 00:12:13.060 --> 00:12:16.390 Which is to say this Mac, when it's ready to send something off campus, 00:12:16.390 --> 00:12:20.890 simply hands that envelope off to this particular router here. 00:12:20.890 --> 00:12:25.480 And the router's job, ultimately, that first hop, a border gateway or border 00:12:25.480 --> 00:12:27.400 router, literally referring to a computer 00:12:27.400 --> 00:12:30.730 that physically or metaphorically is on the edge of a campus 00:12:30.730 --> 00:12:35.230 or company, its purpose in life is to simply change 00:12:35.230 --> 00:12:38.110 what's on that envelope initially from the private IP 00:12:38.110 --> 00:12:41.800 address to one or more public IP addresses, thereby 00:12:41.800 --> 00:12:43.180 maintaining this mapping. 00:12:43.180 --> 00:12:46.720 So this might result in everyone else in the world thinking 00:12:46.720 --> 00:12:49.930 that I and you and everyone else in my company or campus 00:12:49.930 --> 00:12:53.980 are all actually at the same IP address, but that's not true. 00:12:53.980 --> 00:12:56.950 Each of our personal devices has a private IP address 00:12:56.950 --> 00:12:59.830 and that router can actually translate via something 00:12:59.830 --> 00:13:03.190 called network address translation, or NAT, from a private address 00:13:03.190 --> 00:13:05.320 to public and back. 00:13:05.320 --> 00:13:09.400 And so in this way, too, can a company help mask the origin 00:13:09.400 --> 00:13:13.000 or the identity of whoever it is that's accessing some internet-based service. 00:13:13.000 --> 00:13:16.840 Of course, that same company could log what it is that's leaving the company 00:13:16.840 --> 00:13:19.720 and coming back in, so via subpoena or another mechanism, 00:13:19.720 --> 00:13:22.630 could someone certainly figure out who was accessing 00:13:22.630 --> 00:13:25.532 that service at a particular time, but the outside world 00:13:25.532 --> 00:13:26.740 would need help knowing that. 00:13:26.740 --> 00:13:29.600 And so here, even within Harvard, it's done perhaps for that reason. 00:13:29.600 --> 00:13:32.710 But also perhaps in order to use one public IP 00:13:32.710 --> 00:13:36.340 address among hundreds or thousands of university affiliates, 00:13:36.340 --> 00:13:39.680 so that frankly we just don't need as many IP addresses. 00:13:39.680 --> 00:13:42.100 So what might be both a technological motivation 00:13:42.100 --> 00:13:46.550 can also have these policy side effects as well. 00:13:46.550 --> 00:13:48.400 So IP itself. 00:13:48.400 --> 00:13:49.030 Protocol. 00:13:49.030 --> 00:13:50.488 Well, what does that actually mean? 00:13:50.488 --> 00:13:53.820 A protocol-- it's not a language, per se, it's not a programming language. 00:13:53.820 --> 00:13:55.570 It's really just a set of conventions that 00:13:55.570 --> 00:13:58.210 govern how computers intercommunicate. 00:13:58.210 --> 00:14:01.840 IP, specifically, says that if you want to send a message on the internet, 00:14:01.840 --> 00:14:07.480 you shall write a sender address on the envelope and a recipient address 00:14:07.480 --> 00:14:10.420 on the envelope, and that will ensure that the routers know 00:14:10.420 --> 00:14:12.670 what to do with it, and they'll send it back and forth 00:14:12.670 --> 00:14:14.140 in the appropriate directions. 00:14:14.140 --> 00:14:17.600 IP gives us some other features, as well, fragmentation among them. 00:14:17.600 --> 00:14:20.680 It turns out for efficiency if you've got a really big email or a really 00:14:20.680 --> 00:14:23.830 big file, whether a PowerPoint file or video file, 00:14:23.830 --> 00:14:27.070 it's not really fair to everyone else to kind of jam that onto the network 00:14:27.070 --> 00:14:30.820 and to the exclusion of other people's data at any given point in time. 00:14:30.820 --> 00:14:34.960 And so IP tends to fragment big files into smaller pieces 00:14:34.960 --> 00:14:37.540 and send them in multiple envelopes that eventually 00:14:37.540 --> 00:14:41.210 get reassembled at the other end, so that there is a compelling feature as 00:14:41.210 --> 00:14:41.710 well. 00:14:41.710 --> 00:14:44.710 But this leads, of course, to a slippery slope of implications 00:14:44.710 --> 00:14:47.560 for net neutrality and for companies or governments 00:14:47.560 --> 00:14:50.710 to actually then start to distinguish between quality 00:14:50.710 --> 00:14:53.890 of service of this type of data and this other type of data. 00:14:53.890 --> 00:14:54.850 Why can they do that? 00:14:54.850 --> 00:14:57.423 Well, it's all quantized at a very small unit of measure, 00:14:57.423 --> 00:14:59.590 and within these packets are additional information. 00:14:59.590 --> 00:15:04.510 Not just those addresses, but hints as to what type of data is in the packet. 00:15:04.510 --> 00:15:08.320 Is it an email, is it a web page, is it a video conference, is it Netflix, 00:15:08.320 --> 00:15:10.030 is it some competitor service? 00:15:10.030 --> 00:15:12.880 And so ISPs or companies or governments can certainly 00:15:12.880 --> 00:15:18.130 distinguish among these types of packets and treat them theoretically, and all 00:15:18.130 --> 00:15:20.350 to really these days, differently. 00:15:20.350 --> 00:15:23.230 So that they're derived simply from these basic primitives. 00:15:23.230 --> 00:15:25.900 Now we can very quickly go pretty low level. 00:15:25.900 --> 00:15:28.990 If you actually look back at the formal definition 00:15:28.990 --> 00:15:33.580 that humans crafted decades ago for what IP is, this is how they drew it. 00:15:33.580 --> 00:15:36.700 You might call this ASCII art, to borrow a phrase from our look 00:15:36.700 --> 00:15:38.140 at computational thinking. 00:15:38.140 --> 00:15:41.140 It's sort of an artist's rendition of some structure 00:15:41.140 --> 00:15:43.960 just by using the keys on his or her keyboard. 00:15:43.960 --> 00:15:45.970 And so these dashes and pluses, really, just 00:15:45.970 --> 00:15:48.430 are meant to draw a rectangular picture, nothing more. 00:15:48.430 --> 00:15:52.700 The numbers on top represent units of 10 bits at a time. 00:15:52.700 --> 00:15:57.610 Here's bit 0, here's 10, here's 20, and over here is the 32nd such bit. 00:15:57.610 --> 00:16:02.140 So start at zero, and you count as high as 31, so that's our 32nd bit. 00:16:02.140 --> 00:16:04.810 And we can see a few details within here. 00:16:04.810 --> 00:16:08.320 We can see details like the source address, 00:16:08.320 --> 00:16:12.190 and it's the whole width of this picture indicating that indeed this 00:16:12.190 --> 00:16:17.210 is a 32-bit value that composes the source address or the sender address. 00:16:17.210 --> 00:16:20.290 Destination address is just as wide, so there's another 32 bits. 00:16:20.290 --> 00:16:22.240 There's options and other time to live. 00:16:22.240 --> 00:16:25.420 You can specify just how many routers this can be handed off 00:16:25.420 --> 00:16:28.300 to before the router should say, we just don't know where 00:16:28.300 --> 00:16:30.370 this destination is, we shall give up. 00:16:30.370 --> 00:16:32.710 And there's other fields as well in here. 00:16:32.710 --> 00:16:34.330 Now what are we really looking at? 00:16:34.330 --> 00:16:36.430 This is just an artist's rendition of what 00:16:36.430 --> 00:16:38.140 it means to send a pattern of bits. 00:16:38.140 --> 00:16:41.020 The first few bits somehow relate to version. 00:16:41.020 --> 00:16:44.580 The next few bits relate to IHL and type of service and total length. 00:16:44.580 --> 00:16:46.330 Eventually, the pattern of bits represents 00:16:46.330 --> 00:16:48.500 source address and destination address. 00:16:48.500 --> 00:16:52.600 So any computer that's receiving just a series of bits wirelessly 00:16:52.600 --> 00:16:57.160 or over the wire in the form of wavelengths of light or of electricity 00:16:57.160 --> 00:17:01.990 on a wire, simply needs to realize, oh, once I've received this many bits, 00:17:01.990 --> 00:17:04.270 I can infer that those bits were my source address, 00:17:04.270 --> 00:17:06.069 those were my destination address. 00:17:06.069 --> 00:17:09.250 But again, this is so low level, it's a lot more pleasant to sort of think 00:17:09.250 --> 00:17:11.230 about things at the virtual level. 00:17:11.230 --> 00:17:14.589 An envelope that just has this information written on it, and let's 00:17:14.589 --> 00:17:18.640 not worry about an abstraction level below this one, wherein 00:17:18.640 --> 00:17:21.250 we get into the weeds of this data. 00:17:21.250 --> 00:17:25.776 But it turns out that IP is not the only protocol that drives the internet. 00:17:25.776 --> 00:17:28.359 In fact there's several, but perhaps the other most common one 00:17:28.359 --> 00:17:30.410 that you've heard of is that one here. 00:17:30.410 --> 00:17:31.630 TCP. 00:17:31.630 --> 00:17:33.460 Transmission Control Protocol. 00:17:33.460 --> 00:17:36.700 Now this is just a protocol that solves a different problem. 00:17:36.700 --> 00:17:40.660 Rather than simply focus on addressing computers on the internet 00:17:40.660 --> 00:17:43.090 and ensuring data gets from one point to another, 00:17:43.090 --> 00:17:46.630 TCP is about, among other things, guaranteeing delivery. 00:17:46.630 --> 00:17:52.330 TCP adds some additional zeros and ones to that envelope on the outside of it 00:17:52.330 --> 00:17:55.450 that helps us get that antelope to its destination with much 00:17:55.450 --> 00:17:56.737 higher probability. 00:17:56.737 --> 00:17:58.570 In other words, the internet's a busy place. 00:17:58.570 --> 00:18:01.870 Servers are constantly getting new users, 00:18:01.870 --> 00:18:05.260 routers are receiving any number of packets at any given time, 00:18:05.260 --> 00:18:07.360 and sometimes there are spikes in connectivity. 00:18:07.360 --> 00:18:10.630 People might all be tuning into some news broadcast online streaming 00:18:10.630 --> 00:18:13.630 lots of video, or downloading the latest news all at once, 00:18:13.630 --> 00:18:15.745 or everyone's playing the latest game online, 00:18:15.745 --> 00:18:17.620 and so there can be these bursts of activity. 00:18:17.620 --> 00:18:22.690 And honestly humans don't necessarily engineer with those bursts of activity 00:18:22.690 --> 00:18:26.680 in mind, and so routers get busy, computers get busy. 00:18:26.680 --> 00:18:29.770 And when they get busy, they might receive an envelope of information 00:18:29.770 --> 00:18:32.395 and realize, wait a minute, I don't have enough hands for this, 00:18:32.395 --> 00:18:35.200 and packets get dropped, so to speak. 00:18:35.200 --> 00:18:39.070 In fact that's a term of ours, to drop a packet just means to ignore it. 00:18:39.070 --> 00:18:42.490 You don't have enough memory, enough RAM inside of your system 00:18:42.490 --> 00:18:46.360 to hang onto it for any length of time, so you just ignore it. 00:18:46.360 --> 00:18:49.570 Now this would be pretty darn frustrating if you send an email 00:18:49.570 --> 00:18:52.210 and only with some probability does it go through. 00:18:52.210 --> 00:18:54.370 Now in practice that might feel like it happens, 00:18:54.370 --> 00:18:56.787 especially when things get caught up in spam and the like, 00:18:56.787 --> 00:19:00.520 but in practice you really do want emails that are sent to be received. 00:19:00.520 --> 00:19:03.190 When you request a web page, you want the entire web page. 00:19:03.190 --> 00:19:06.880 And even if those are big emails or big web pages that are therefore 00:19:06.880 --> 00:19:10.780 chopped into fragments, you really want to receive all of the fragments 00:19:10.780 --> 00:19:13.720 and not just only some of the paragraphs in the email, 00:19:13.720 --> 00:19:16.540 or only some sections of the web page. 00:19:16.540 --> 00:19:21.670 So TCP ensures that you get all of that data at the end of the day. 00:19:21.670 --> 00:19:24.620 Well hopefully not at the end of the day, but ultimately. 00:19:24.620 --> 00:19:29.140 And so what TCP adds to the envelope is essentially a little mental note 00:19:29.140 --> 00:19:32.740 that this is packet number one of two, or one of three, or one of four, 00:19:32.740 --> 00:19:34.480 in the case of an even larger file. 00:19:34.480 --> 00:19:39.467 And so when the recipient of this email or this web request gets the envelope 00:19:39.467 --> 00:19:42.550 and realizes, wait a minute, I've got numbers two and three and four, wait 00:19:42.550 --> 00:19:44.620 a minute, I'm missing the first envelope. 00:19:44.620 --> 00:19:47.800 TCP tells that Mac or PC or other computer, 00:19:47.800 --> 00:19:50.410 go ahead and send a message back to the sender saying, hey, 00:19:50.410 --> 00:19:53.850 I got everything except packet one, please resend. 00:19:53.850 --> 00:19:55.850 That's going to take a little bit of extra time, 00:19:55.850 --> 00:19:59.800 but that packet can be resent and TCP knows 00:19:59.800 --> 00:20:01.600 how to reassemble them in the proper order 00:20:01.600 --> 00:20:06.370 so that the human ultimately sees their entire email or that entire web page 00:20:06.370 --> 00:20:09.530 and not just some portion thereof. 00:20:09.530 --> 00:20:11.072 So what does TCP really look like? 00:20:11.072 --> 00:20:13.780 Well, let's just take a quick peek underneath the hood here, too. 00:20:13.780 --> 00:20:18.150 And here we see a similar pattern of bits but not addresses, that, again, 00:20:18.150 --> 00:20:22.530 is handled by IP itself, but you see mention of source port, 00:20:22.530 --> 00:20:23.820 and destination port. 00:20:23.820 --> 00:20:25.980 Sequence number, which helps with the delivery, 00:20:25.980 --> 00:20:28.320 and then other options as well, all of which 00:20:28.320 --> 00:20:31.660 we relate to the delivery of that information. 00:20:31.660 --> 00:20:35.820 But these two up here, looks like 16 bits each, source port 00:20:35.820 --> 00:20:37.080 and destination port. 00:20:37.080 --> 00:20:39.750 Those two have value, because TCP does something else. 00:20:39.750 --> 00:20:42.820 It doesn't just guarantee that data gets from one point to another, 00:20:42.820 --> 00:20:47.670 it also helps servers distinguish one type of data from another, and in turn 00:20:47.670 --> 00:20:52.170 allows companies and universities and internet service providers 00:20:52.170 --> 00:20:54.720 or governments to distinguish different types of data 00:20:54.720 --> 00:20:57.510 because it's right there on the outside of the envelope. 00:20:57.510 --> 00:21:01.890 In particular, TCP specifies what protocol 00:21:01.890 --> 00:21:06.210 is being used to convey this packet of information from one computer 00:21:06.210 --> 00:21:07.090 to another. 00:21:07.090 --> 00:21:09.632 In other words, there's lots of internet services these days. 00:21:09.632 --> 00:21:12.190 There's email, there's chat, there's video conferencing, 00:21:12.190 --> 00:21:13.650 there's web browsers, and more. 00:21:13.650 --> 00:21:16.930 So that's a lot of possibilities, a lot of patterns of zeros and ones 00:21:16.930 --> 00:21:18.490 that can be in these envelopes. 00:21:18.490 --> 00:21:23.470 So how, upon receiving an envelope, does a server know what type of information 00:21:23.470 --> 00:21:24.030 is in it? 00:21:24.030 --> 00:21:25.200 Especially big companies. 00:21:25.200 --> 00:21:27.480 Google, for instance, supports all of those services. 00:21:27.480 --> 00:21:29.550 Video conferencing, email, chat, and more. 00:21:29.550 --> 00:21:32.760 So when Google's servers receives a packet of information, 00:21:32.760 --> 00:21:35.610 how does Google know that this is an email from you, 00:21:35.610 --> 00:21:40.350 as opposed to a chat message from you, as opposed to a video from you 00:21:40.350 --> 00:21:41.872 that you're uploading to YouTube? 00:21:41.872 --> 00:21:44.580 You need to be able to distinguish these various services because 00:21:44.580 --> 00:21:47.010 at the end of the day, they're just patterns of bits. 00:21:47.010 --> 00:21:49.560 Well, if we reserve some of those bits, or really 00:21:49.560 --> 00:21:53.820 some of the markings on this virtual envelope, for just one more number 00:21:53.820 --> 00:21:55.700 we can distinguish services pretty easily. 00:21:55.700 --> 00:21:59.130 In fact, HTTP, an acronym that you might not 00:21:59.130 --> 00:22:01.530 know what it means but you've surely seen it a lot, 00:22:01.530 --> 00:22:03.720 since our hypertext transfer protocol and it's 00:22:03.720 --> 00:22:07.890 the conventions via which browsers and servers send web pages back and forth. 00:22:07.890 --> 00:22:09.870 Well, by convention, humans decided years 00:22:09.870 --> 00:22:12.600 ago to call that service number 80. 00:22:12.600 --> 00:22:15.180 TCP port 80, so to speak. 00:22:15.180 --> 00:22:19.920 And the secure version of that, HTTPS, they decided to number that 443, 00:22:19.920 --> 00:22:23.070 just because they'd already used quite a few numbers in between those two 00:22:23.070 --> 00:22:23.700 values. 00:22:23.700 --> 00:22:27.630 INAP is the protocol via which you can receive emails or check 00:22:27.630 --> 00:22:30.930 your email, that's used different ports depending on whether you're using it 00:22:30.930 --> 00:22:33.480 security or insecurity like 143 or 993. 00:22:33.480 --> 00:22:38.820 SMTP, which is outbound email, can use similarly 25, 465, or 587. 00:22:38.820 --> 00:22:41.970 And then, if familiar, there's something called SSH, secure shell. 00:22:41.970 --> 00:22:44.070 This is what developers might use at a lower level 00:22:44.070 --> 00:22:47.730 to connect from one computer, say a laptop, to a remote server. 00:22:47.730 --> 00:22:49.750 That tends to use port 22. 00:22:49.750 --> 00:22:51.750 And there's hundreds, there's actually thousands 00:22:51.750 --> 00:22:55.920 of others, as many as 65,000 possibilities, but only some of those 00:22:55.920 --> 00:22:57.720 are actually standardized. 00:22:57.720 --> 00:22:59.910 So this is to say what ultimately is going 00:22:59.910 --> 00:23:03.600 on the outside of an envelope is not just a user's address 00:23:03.600 --> 00:23:07.170 but when I as a computer send a message to some other server 00:23:07.170 --> 00:23:11.700 and for instance my address is 5.6.7.8 I'll write 00:23:11.700 --> 00:23:13.570 that in the top corner of the envelope. 00:23:13.570 --> 00:23:17.430 If the recipients of this envelope are supposed to be 1.2.3.4 00:23:17.430 --> 00:23:19.810 I do write that in the middle of the envelope, 00:23:19.810 --> 00:23:25.860 but I need to further specify IP address 1.2.3.4 but port number, 00:23:25.860 --> 00:23:28.530 let's say, 80, if it's a request for a web page. 00:23:28.530 --> 00:23:31.433 So conventionally you would do :80 to distinguish that service. 00:23:31.433 --> 00:23:34.100 And then of course because of TCP I need to number these things, 00:23:34.100 --> 00:23:37.380 so if it's a big request or a big response I better write one of two, 00:23:37.380 --> 00:23:39.760 one of three, or one of four, or the like. 00:23:39.760 --> 00:23:41.760 And so the envelope I'm ultimately left with 00:23:41.760 --> 00:23:44.250 is something a little more like this. 00:23:44.250 --> 00:23:47.010 On the outside is this recipient's address, on the outside 00:23:47.010 --> 00:23:49.290 is the sender's address, and on the outside 00:23:49.290 --> 00:23:52.770 is the sequence number of some sort that specifies 00:23:52.770 --> 00:23:56.830 how many packets I've actually sent and hopefully will be received. 00:23:56.830 --> 00:24:00.840 So TCP then allows the recipient to see this envelope, realize, oh this 00:24:00.840 --> 00:24:01.800 is for my web server. 00:24:01.800 --> 00:24:04.470 Google can hand it off to the appropriate piece of software 00:24:04.470 --> 00:24:06.420 that governs its web servers and so it's not 00:24:06.420 --> 00:24:09.330 confused for something else like an email, a chat message, a voice 00:24:09.330 --> 00:24:11.910 conference, or the like. 00:24:11.910 --> 00:24:14.400 And again, all of these features derive quite 00:24:14.400 --> 00:24:17.400 simply from these patterns of bits that esoterically happen 00:24:17.400 --> 00:24:20.340 to be laid out in this way, but if we abstract away from that 00:24:20.340 --> 00:24:24.340 and stipulate that just think about it like the real world with an envelope, 00:24:24.340 --> 00:24:27.150 it's really just these numeric values that somehow help 00:24:27.150 --> 00:24:31.770 us get data from one point to another. 00:24:31.770 --> 00:24:36.690 Collectively now, these two protocols, which are so often used hand in hand, 00:24:36.690 --> 00:24:39.570 are generally very abbreviated TCP/IP. 00:24:39.570 --> 00:24:41.970 It's two separate protocols, two separate conventions 00:24:41.970 --> 00:24:43.398 used in conjunction. 00:24:43.398 --> 00:24:45.940 Some of this information is just written in different places, 00:24:45.940 --> 00:24:49.650 if you will, on the virtual envelope, but TCP/IP settings are 00:24:49.650 --> 00:24:51.990 what you might look for on a Mac or PC or server 00:24:51.990 --> 00:24:55.020 to actually configure this level of detail. 00:24:55.020 --> 00:24:57.870 But of course, I've taken some liberties here. 00:24:57.870 --> 00:25:01.170 If my goal is to send a message from one computer 00:25:01.170 --> 00:25:04.230 to another, a chat message, an email, anything else, 00:25:04.230 --> 00:25:07.320 you know what, I'm pretty sure I have no idea what 00:25:07.320 --> 00:25:09.570 the IP address is of any colleague. 00:25:09.570 --> 00:25:12.960 And I have no idea what the IP address is of Google or Facebook 00:25:12.960 --> 00:25:16.740 or any number of popular websites that I might even visit daily. 00:25:16.740 --> 00:25:20.310 I don't even know people's phone numbers anymore but that's another matter. 00:25:20.310 --> 00:25:23.280 In the context of words, though, on the internet all of us, 00:25:23.280 --> 00:25:26.700 of course, type words, not numbers, when we want to reach some destination. 00:25:26.700 --> 00:25:30.900 We go to facebook.com or gmail.com or google.com or bing.com 00:25:30.900 --> 00:25:34.520 or any number of other domain names, so to speak. 00:25:34.520 --> 00:25:36.270 And of course, that's what you would write 00:25:36.270 --> 00:25:38.640 on the outside of an envelope in the human world, 00:25:38.640 --> 00:25:43.530 ideally as many words as possible, not just numbers let alone bits alone. 00:25:43.530 --> 00:25:46.170 And ideally our computers would similarly 00:25:46.170 --> 00:25:49.770 express exactly what we humans know, which is these domain 00:25:49.770 --> 00:25:52.380 names that are part of URLs. 00:25:52.380 --> 00:25:56.460 So it turns out we need the help of at least one more service among all 00:25:56.460 --> 00:25:57.930 of these internet technologies. 00:25:57.930 --> 00:26:01.580 We need the help of a service called DNS, domain name system. 00:26:01.580 --> 00:26:07.020 A DNS server is a server that quite simply translates domain names 00:26:07.020 --> 00:26:11.670 like gmail.com and bing.com and google.com into their corresponding IP 00:26:11.670 --> 00:26:12.390 addresses. 00:26:12.390 --> 00:26:15.023 We, the humans, might have no idea what they are, 00:26:15.023 --> 00:26:17.940 but odds are there's at least one human or more in the world, probably 00:26:17.940 --> 00:26:20.220 who works for those companies, that does know. 00:26:20.220 --> 00:26:22.620 And provided he or she configures their DNS 00:26:22.620 --> 00:26:26.370 servers to know that association of domain name to IP 00:26:26.370 --> 00:26:29.940 address, the equivalent of just an Excel file with one column with names 00:26:29.940 --> 00:26:33.990 and the other column with numbers, IP addresses well their server can then 00:26:33.990 --> 00:26:36.330 answer questions from little old me. 00:26:36.330 --> 00:26:39.810 And indeed what my phone knows how to do these days, what my Mac, my PC knows 00:26:39.810 --> 00:26:44.430 how to do is when my human types in gmail.com and hits enter, 00:26:44.430 --> 00:26:47.730 the very first thing that my browser, and in turn my operating 00:26:47.730 --> 00:26:51.510 system like Mac OS or Windows does, is it asks the local DNS 00:26:51.510 --> 00:26:55.680 server for the conversion of whatever I typed in, gmail.com, 00:26:55.680 --> 00:26:58.200 to the corresponding IP address. 00:26:58.200 --> 00:27:04.050 And hopefully, my own network be it at home or on campus or in work, 00:27:04.050 --> 00:27:05.870 has the answer to that question. 00:27:05.870 --> 00:27:07.590 But the world's a big place, and odds are 00:27:07.590 --> 00:27:11.100 my home does not know the IP address of every server in the world. 00:27:11.100 --> 00:27:14.670 Odds are my campus or company doesn't know the IP address of every server 00:27:14.670 --> 00:27:17.730 in the world, especially since they're surely changing continually 00:27:17.730 --> 00:27:21.300 as new sites are coming online and others are going offline. 00:27:21.300 --> 00:27:22.590 So how do we know? 00:27:22.590 --> 00:27:25.530 Well DNS is a whole hierarchical system whereby 00:27:25.530 --> 00:27:30.360 you might have a small DNS server, so to speak conceptually here on site. 00:27:30.360 --> 00:27:34.170 But then your internet service provider or ISP, Comcast, Verizon, 00:27:34.170 --> 00:27:38.730 or some other entity, they probably have a bigger DNS server with more memory, 00:27:38.730 --> 00:27:42.273 with a longer list of domain names and IP addresses. 00:27:42.273 --> 00:27:44.440 And you know what, even if they don't know everyone, 00:27:44.440 --> 00:27:47.280 there are probably what are called root servers in the world, 00:27:47.280 --> 00:27:50.430 that much like the root of a tree, is where everything starts. 00:27:50.430 --> 00:27:53.190 And indeed, you can find out from these actual root 00:27:53.190 --> 00:27:56.100 servers on the internet, the mapping, effectively, 00:27:56.100 --> 00:27:58.890 between all of the dot coms and their IP addresses. 00:27:58.890 --> 00:28:01.680 All of the dot govs or the dot nets and their IP addresses. 00:28:01.680 --> 00:28:05.700 And frankly, even if they don't know the answer by definition of root server 00:28:05.700 --> 00:28:08.790 they will be configured to know who knows. 00:28:08.790 --> 00:28:13.050 And so DNS is very hierarchical, and it's also recursive. 00:28:13.050 --> 00:28:17.040 You might ask a local server, which might ask a more remote server, which 00:28:17.040 --> 00:28:19.110 might ask and even further away server. 00:28:19.110 --> 00:28:22.440 That server might say, wait a minute, I know, this server knows, and then 00:28:22.440 --> 00:28:24.870 the answer eventually bubbles its way back to you. 00:28:24.870 --> 00:28:26.673 And long story short, we can be efficient. 00:28:26.673 --> 00:28:28.590 We don't have to constantly ask this question. 00:28:28.590 --> 00:28:30.660 We can cache those results locally. 00:28:30.660 --> 00:28:34.590 Remember them in my browser, in my Mac or my PC. 00:28:34.590 --> 00:28:36.480 There's downsides there, though, too. 00:28:36.480 --> 00:28:39.630 By remembering that mapping of domain name to IP address, 00:28:39.630 --> 00:28:43.680 I can save myself the trouble of asking that same question multiple times a day 00:28:43.680 --> 00:28:45.810 or even per week or even per minute. 00:28:45.810 --> 00:28:48.690 The catch, though, is that if Google changes something, or Facebook 00:28:48.690 --> 00:28:51.300 reconfigure something and that IP changes, 00:28:51.300 --> 00:28:53.110 caching might actually be a bad thing. 00:28:53.110 --> 00:28:55.530 And so here, too, even at the level of the internet 00:28:55.530 --> 00:28:57.300 do we see these series of trade-offs. 00:28:57.300 --> 00:29:01.860 You might save time by caching, but you might sacrifice correctness, 00:29:01.860 --> 00:29:06.990 because now the servers recollection of that IP address might become outdated. 00:29:06.990 --> 00:29:09.000 And so this is a whole can of worms, ultimately, 00:29:09.000 --> 00:29:11.130 and speaks to what it really means to be an engineer 00:29:11.130 --> 00:29:13.880 in the world of internet technologies to anticipate to think about 00:29:13.880 --> 00:29:15.930 and ultimately to solve these problems. 00:29:15.930 --> 00:29:21.000 There is no sure fire solution other than to expect that you'll need 00:29:21.000 --> 00:29:23.650 to accommodate these changes over time. 00:29:23.650 --> 00:29:25.440 So in Windows, can you see this yourself? 00:29:25.440 --> 00:29:28.500 Well, if you open up those same Wi-Fi properties or wired 00:29:28.500 --> 00:29:31.470 properties that you have, you'll see again, not only your IPv4 address, 00:29:31.470 --> 00:29:35.550 but it was there all this time, your IPv4 DNS servers one 00:29:35.550 --> 00:29:39.750 or more IP addresses turns out it's exactly the same by coincidence 00:29:39.750 --> 00:29:44.640 but also by design on this computer of my router or my default gateway 00:29:44.640 --> 00:29:47.070 192168.1.1. 00:29:47.070 --> 00:29:50.950 Which is to say that if this PC needs to know an answer to the question, 00:29:50.950 --> 00:29:55.830 what is gmail.com's IP address it is simply going to ask the local server 00:29:55.830 --> 00:29:59.310 that has that address and that DNS server, and this is important, 00:29:59.310 --> 00:30:00.900 cannot have itself a name. 00:30:00.900 --> 00:30:04.320 We need to know what its IP address is, otherwise, of course, 00:30:04.320 --> 00:30:05.580 we get into this endless loop. 00:30:05.580 --> 00:30:08.640 If we know only the name of our DNS server but only the DNS server 00:30:08.640 --> 00:30:12.370 can convert that to an IP address, we'll never actually answer that question. 00:30:12.370 --> 00:30:14.430 It's more of a catch-22. 00:30:14.430 --> 00:30:16.380 And even if it does have a name, you need 00:30:16.380 --> 00:30:21.370 to know manually, via your DHCP server somehow, what its IP address actually 00:30:21.370 --> 00:30:21.870 is. 00:30:21.870 --> 00:30:22.800 Mac OS, the same. 00:30:22.800 --> 00:30:26.370 And here on campus, Harvard happens to have redundancy like most any company. 00:30:26.370 --> 00:30:29.470 They don't have just one DNS server they have at least three here, 00:30:29.470 --> 00:30:33.250 128.103.1.1, and a couple of others, as well. 00:30:33.250 --> 00:30:36.300 And again, I got these automatically when I turned on my Mac or my phone 00:30:36.300 --> 00:30:41.220 or my PC via that local DHCP server. 00:30:41.220 --> 00:30:44.550 So let's see if we can't mimic what it is my Mac, 00:30:44.550 --> 00:30:48.330 your PC, your phone is doing everyday all day long, but rather 00:30:48.330 --> 00:30:49.710 unbeknownst to us. 00:30:49.710 --> 00:30:51.750 Here I have what's called a terminal window. 00:30:51.750 --> 00:30:54.300 This is just a textual interface to my computer here. 00:30:54.300 --> 00:30:57.660 Can exist on Macs, or PCs, or other operating systems, as well. 00:30:57.660 --> 00:31:01.260 And it allows me to execute by typing commands textually, 00:31:01.260 --> 00:31:04.620 only at my keyboard, no mouse, exactly the types of commands 00:31:04.620 --> 00:31:08.850 that your browser and other software are effectively executing or running 00:31:08.850 --> 00:31:09.570 for you. 00:31:09.570 --> 00:31:11.430 For instance, suppose I genuinely do want 00:31:11.430 --> 00:31:12.990 to know the IP address of gmail.com. 00:31:12.990 --> 00:31:16.020 I can ask this program as follows. 00:31:16.020 --> 00:31:19.170 nslookup, for name server look up, and then I can go ahead 00:31:19.170 --> 00:31:22.140 and type literally gmail.com and hit Enter. 00:31:22.140 --> 00:31:28.620 Here, visually, we see on the screen one answer that it's 172.217.3.37. 00:31:28.620 --> 00:31:31.740 And this comes from a server whose IP address in this room 00:31:31.740 --> 00:31:36.270 is 10.0.0.2, which we know now to be a private IP address, 00:31:36.270 --> 00:31:38.640 and indeed, here on campus we have servers 00:31:38.640 --> 00:31:42.480 that are local only to this room, this building, or this set of buildings 00:31:42.480 --> 00:31:42.987 here. 00:31:42.987 --> 00:31:44.820 Now this is a little interesting because I'm 00:31:44.820 --> 00:31:48.000 pretty sure business is good for Google, and surely they 00:31:48.000 --> 00:31:51.000 don't have just one server and therefore one IP address. 00:31:51.000 --> 00:31:54.720 Well, it turns out that there's a whole hierarchy of servers out there, most 00:31:54.720 --> 00:31:59.520 likely, that my data goes to and thereafter through on Google's end. 00:31:59.520 --> 00:32:05.160 The one IP address that they're telling me is theirs is 172.217.3.37, 00:32:05.160 --> 00:32:08.850 but once my packet of information gets there to Mountain View, California, 00:32:08.850 --> 00:32:11.580 or wherever their servers happen to be closest to me, 00:32:11.580 --> 00:32:16.290 then they might have any number of servers, dozens, hundreds, thousands, 00:32:16.290 --> 00:32:18.800 that can actually receive that packet next. 00:32:18.800 --> 00:32:22.380 This just happens to be the outward facing IP that my own Mac or PC 00:32:22.380 --> 00:32:24.480 or phone actually sees. 00:32:24.480 --> 00:32:28.260 Well, let's see if we can't trace the route to gmail.com via another command, 00:32:28.260 --> 00:32:32.550 literally traceroute, can I see the packets of information line 00:32:32.550 --> 00:32:36.840 by line leaving my computer and making their way, ultimately, to Google. 00:32:36.840 --> 00:32:39.270 I'm going to go ahead and do this once, so dash q1 00:32:39.270 --> 00:32:41.502 means do one query, please, at a time. 00:32:41.502 --> 00:32:43.710 And then I'm going to go ahead and say, quite simply, 00:32:43.710 --> 00:32:45.990 gmail.com, and then Enter. 00:32:45.990 --> 00:32:50.400 And we will see, line by line, the sequence of IP addresses 00:32:50.400 --> 00:32:54.930 of every router that is to say hop between me and Gmail. 00:32:54.930 --> 00:32:57.253 On occasion we'll see these asterisks instead, 00:32:57.253 --> 00:32:59.670 which indicates that that router isn't having any of this, 00:32:59.670 --> 00:33:03.660 it's not responding to my requests, so we can't see its IP or anything 00:33:03.660 --> 00:33:04.650 else about it. 00:33:04.650 --> 00:33:09.420 But we can see that in 17 steps does data leave my laptop 00:33:09.420 --> 00:33:12.150 and end up at gmail.com, and along the way 00:33:12.150 --> 00:33:16.830 it encounters all of these routers that have these unique IP addresses but not 00:33:16.830 --> 00:33:18.870 names, it seems, and the amount of time it 00:33:18.870 --> 00:33:21.420 takes for my data to get from my laptop to gmail.com 00:33:21.420 --> 00:33:26.220 is, oh my, 0.967 milliseconds. 00:33:26.220 --> 00:33:29.940 Less than one millisecond is required to get data or an email 00:33:29.940 --> 00:33:32.760 from my computer to gmail.com itself. 00:33:32.760 --> 00:33:36.450 Now what about all of these other measurements of time up above? 00:33:36.450 --> 00:33:38.580 Each of these represents the number of milliseconds 00:33:38.580 --> 00:33:43.410 it took during this process for data to go from my laptop to this router, 00:33:43.410 --> 00:33:45.940 then to this router, then to this router. 00:33:45.940 --> 00:33:48.390 Now, of course, it seems strange that it takes more time 00:33:48.390 --> 00:33:52.390 to get these to these close routers than it does to these further away. 00:33:52.390 --> 00:33:54.410 But there, too, if I ran this all day long 00:33:54.410 --> 00:33:56.160 I would get different numbers continually, 00:33:56.160 --> 00:33:58.737 it depends how busy those routers are at that moment in time. 00:33:58.737 --> 00:34:00.570 It depends what else everyone here on campus 00:34:00.570 --> 00:34:03.840 is doing, or other people in the world at that moment in time. 00:34:03.840 --> 00:34:06.660 Routers might be a little slow to respond because they're 00:34:06.660 --> 00:34:07.920 busy doing something else. 00:34:07.920 --> 00:34:10.050 My data might get dropped in other contexts 00:34:10.050 --> 00:34:12.400 and need to be resent, which is just going to take time, 00:34:12.400 --> 00:34:15.030 and I don't even see that happening on the screen. 00:34:15.030 --> 00:34:18.870 But it's fair to say that these give us a sense of the range of times 00:34:18.870 --> 00:34:21.270 it might take to go from a point A to point B, 00:34:21.270 --> 00:34:26.040 and let's say 1 to 20 milliseconds or even 32 milliseconds, somewhere 00:34:26.040 --> 00:34:29.370 in there is our average, and that can vary over time. 00:34:29.370 --> 00:34:32.489 But that's pretty fast, and indeed, even though it took a moment 00:34:32.489 --> 00:34:36.120 to run this whole test, this is why an email can be sent from your computer 00:34:36.120 --> 00:34:38.460 and be received nearly instantly by someone 00:34:38.460 --> 00:34:41.760 around the world, because at the end of the day, we're limited, really, 00:34:41.760 --> 00:34:44.880 ultimately, by the speed of light and little more. 00:34:44.880 --> 00:34:48.449 Well, to be fair, hardware and cost and everything in between, 00:34:48.449 --> 00:34:52.469 but you can certainly transmit your data faster than you can yourself. 00:34:52.469 --> 00:34:55.139 But what if we want to go farther away than gmail.com? 00:34:55.139 --> 00:34:57.690 Odds are they probably do have servers in California, 00:34:57.690 --> 00:35:01.380 but probably here on the east coast of the US as well, let alone abroad. 00:35:01.380 --> 00:35:04.050 What if I deliberately try to access a domain that is, 00:35:04.050 --> 00:35:06.150 in fact, abroad and go there? 00:35:06.150 --> 00:35:09.160 Well, let me go ahead and visit via traceroute, 00:35:09.160 --> 00:35:16.230 say, www.cnn.co.jp, the domain name for CNN's Japanese website. 00:35:16.230 --> 00:35:18.390 And then we'll add just dash q1 this time 00:35:18.390 --> 00:35:22.020 at the end, which is fine, too, to query the server just once. 00:35:22.020 --> 00:35:24.360 And here we see the sequence of steps, one 00:35:24.360 --> 00:35:27.900 after another, whereby the data's leaving my laptop and in turn campus, 00:35:27.900 --> 00:35:30.300 and then we see some anonymous routers in between. 00:35:30.300 --> 00:35:36.540 But the 30th there seems to be just in time, because within it 00:35:36.540 --> 00:35:40.710 seems 178 milliseconds do we make our way to Japan. 00:35:40.710 --> 00:35:44.220 Now that's quite a few milliseconds more, but that rather makes sense. 00:35:44.220 --> 00:35:47.400 Whereas it might take one to 20 to 32 milliseconds 00:35:47.400 --> 00:35:50.940 to get from here to Gmail either on the east coast or west coast, 00:35:50.940 --> 00:35:54.120 I'm kind of not surprised that it takes an order of magnitude 00:35:54.120 --> 00:35:59.070 more, almost to factor of 10, to get to Japan, because there's not only 00:35:59.070 --> 00:36:01.920 a whole continent between us here in Cambridge and Japan, 00:36:01.920 --> 00:36:04.770 there's also an entire Pacific Ocean between us. 00:36:04.770 --> 00:36:09.030 And indeed, there are Transatlantic, Transpacific, and transoceanic cables 00:36:09.030 --> 00:36:12.000 all around the world these days that actually transmit our data, 00:36:12.000 --> 00:36:15.450 not to mention all of the wireless technologies we have, satellites 00:36:15.450 --> 00:36:16.385 and below. 00:36:16.385 --> 00:36:19.260 And so it does stand to reason that even though none of these routers 00:36:19.260 --> 00:36:22.230 were paying attention to me at that moment for privacy sake, 00:36:22.230 --> 00:36:26.460 this last one indicates that 200 milliseconds later we can get halfway 00:36:26.460 --> 00:36:28.950 across the world digitally. 00:36:28.950 --> 00:36:32.310 And so that does rather speak to just how quickly these low level 00:36:32.310 --> 00:36:35.340 primitives operate, and we can talk far longer 00:36:35.340 --> 00:36:38.400 about how these things work than it actually takes time 00:36:38.400 --> 00:36:40.890 to actually get the data there. 00:36:40.890 --> 00:36:44.940 So then together we have TCP/IP via DHCP can we 00:36:44.940 --> 00:36:49.410 get the addresses that we need to use to address my envelopes and others, 00:36:49.410 --> 00:36:49.980 as well. 00:36:49.980 --> 00:36:54.180 Via DNS can we convert those domain names into IP addresses and even back. 00:36:54.180 --> 00:36:56.910 And those internet technologies are ultimately 00:36:56.910 --> 00:37:02.890 what govern how our data gets from point A to point B. But what is the data? 00:37:02.890 --> 00:37:05.700 Indeed, everything thus far is really just metadata. 00:37:05.700 --> 00:37:09.150 Information that helps our actual data that we care about get from one 00:37:09.150 --> 00:37:10.192 point to another. 00:37:10.192 --> 00:37:12.900 But it's the data at the end of the day that I really care about. 00:37:12.900 --> 00:37:16.140 The contents of my email, the contents of my chat message, 00:37:16.140 --> 00:37:19.860 the voice that I'm sending over a video conference, or even just 00:37:19.860 --> 00:37:21.275 the contents of a web page. 00:37:21.275 --> 00:37:24.150 Indeed, perhaps the most popular service that you and I use every day 00:37:24.150 --> 00:37:27.270 is just that, pulling up pages on the web. 00:37:27.270 --> 00:37:32.250 So just how is a web page specifically requested and received? 00:37:32.250 --> 00:37:36.300 Well, it turns out that http:// that you've surely seen, 00:37:36.300 --> 00:37:39.450 but probably not typed for some time, because your browser, odds are, 00:37:39.450 --> 00:37:42.840 just inserts it automatically or even invisibly for you. 00:37:42.840 --> 00:37:47.970 That HTTP is yet another protocol in this stack of internet technologies. 00:37:47.970 --> 00:37:50.550 Hypertext transfer protocol. 00:37:50.550 --> 00:37:55.830 A set of conventions that browsers and web servers have agreed upon long ago 00:37:55.830 --> 00:37:57.900 to use when intercommunicating. 00:37:57.900 --> 00:38:00.540 And to be clear, then, what exactly is a protocol? 00:38:00.540 --> 00:38:01.760 Well, it's just a convention. 00:38:01.760 --> 00:38:04.620 We humans have protocols even though we might not call them such. 00:38:04.620 --> 00:38:07.453 When I meet someone new on the street I might reach up to him or her 00:38:07.453 --> 00:38:09.420 and say, hello, my name is David. 00:38:09.420 --> 00:38:13.020 And that protocol results in that other person, 00:38:13.020 --> 00:38:16.770 if polite, in extending their hand too, reaching into mine 00:38:16.770 --> 00:38:20.430 and probably saying as well, hello, nice to meet you or how are you. 00:38:20.430 --> 00:38:23.190 That's a human protocol that we were taught some time ago, 00:38:23.190 --> 00:38:26.820 and culturally we have all agreed here in the US to, generally speaking, 00:38:26.820 --> 00:38:28.740 greet each other in that manner. 00:38:28.740 --> 00:38:31.020 Computers, similarly, have standardized what 00:38:31.020 --> 00:38:35.850 goes not only on the outside of these envelopes but what goes in the inside, 00:38:35.850 --> 00:38:37.020 as well. 00:38:37.020 --> 00:38:39.270 And so if, for instance, the goal at hand 00:38:39.270 --> 00:38:44.490 is to request a web page of a canonical website like www.example.com, 00:38:44.490 --> 00:38:47.610 let's consider exactly what is inside of this envelope. 00:38:47.610 --> 00:38:51.810 Well, first of all here we have a proper URL, uniform resource locator. 00:38:51.810 --> 00:38:56.010 These days, your browser, whether it's Chrome or Safari or Edge or Firefox, 00:38:56.010 --> 00:38:58.950 probably doesn't even show you all of this information. 00:38:58.950 --> 00:39:02.280 In the interests of simpler user interfaces or UIs, 00:39:02.280 --> 00:39:05.370 browsers have started to hide these so-called protocol here 00:39:05.370 --> 00:39:09.210 at the left, even the ww here, the hostname in the middle, 00:39:09.210 --> 00:39:12.780 leaving you oftentimes with just example.com or the equivalent 00:39:12.780 --> 00:39:14.657 somewhere at the top of your screen. 00:39:14.657 --> 00:39:16.740 But if you click on that address, typically you'll 00:39:16.740 --> 00:39:18.930 see more information such as that here. 00:39:18.930 --> 00:39:22.020 And sometimes there's more information that's just implicit. 00:39:22.020 --> 00:39:26.670 It turns out if you try to visit http://www.example.com 00:39:26.670 --> 00:39:31.310 or any similar domain name, what you're likely reaching for is 00:39:31.310 --> 00:39:33.290 a very specific file on that server. 00:39:33.290 --> 00:39:34.550 But how do we reach it? 00:39:34.550 --> 00:39:37.850 Well, highlighted in yellow here is what's called the domain name itself, 00:39:37.850 --> 00:39:39.260 example.com. 00:39:39.260 --> 00:39:41.690 This is something that you buy, or really rent, 00:39:41.690 --> 00:39:45.620 on an annual basis via an internet registrar, a company, that 00:39:45.620 --> 00:39:48.860 via the associations on the internet that govern IP addresses 00:39:48.860 --> 00:39:51.560 domain names has been authorized to sell, or really 00:39:51.560 --> 00:39:55.490 rent, you and anyone else a domain name for some amount of time, 00:39:55.490 --> 00:39:58.280 usually one year or two years or 10 or anywhere 00:39:58.280 --> 00:40:00.860 in between, for some dollar amount. 00:40:00.860 --> 00:40:03.680 And what you get, then, is the ability, for that amount of time 00:40:03.680 --> 00:40:07.250 renewable thereafter, to use that specific domain name. 00:40:07.250 --> 00:40:10.070 It might be dot com, or dot net or dot org, 00:40:10.070 --> 00:40:15.050 or any number of hundreds of others of TLDs, or top level domains. 00:40:15.050 --> 00:40:18.943 Indeed, that suffix there is what represents the type of website, 00:40:18.943 --> 00:40:20.360 at least historically, that it is. 00:40:20.360 --> 00:40:24.680 Dot com for commercial, dot net for network, dot edu for education, 00:40:24.680 --> 00:40:26.460 or dot gov for government. 00:40:26.460 --> 00:40:28.820 Of course, all of those TLDs, or top level domains, 00:40:28.820 --> 00:40:32.120 were very US centric by design, and so far it 00:40:32.120 --> 00:40:36.350 was generally a cohort of Americans that designed a lot of this system 00:40:36.350 --> 00:40:37.070 initially. 00:40:37.070 --> 00:40:39.890 Of course, other countries have shorter TLDs. 00:40:39.890 --> 00:40:44.150 Country codes, dot US, dot JP and others that signify 00:40:44.150 --> 00:40:46.100 a specific country in which they're in. 00:40:46.100 --> 00:40:49.430 And these days anyone can buy a dot com or dot net, 00:40:49.430 --> 00:40:53.930 but not everyone can buy a dot gov or dot edu, or several other top level 00:40:53.930 --> 00:40:54.650 domains, as well. 00:40:54.650 --> 00:40:58.700 It depends on whoever controls that particular suffix. 00:40:58.700 --> 00:41:03.380 This here we might call the hostname, the name of the specific server 00:41:03.380 --> 00:41:06.380 that you were trying to visit that lives within that domain name. 00:41:06.380 --> 00:41:09.050 In other contexts, you might call this a subdomain, 00:41:09.050 --> 00:41:12.770 indicating what subdivision of a company or university you're actually 00:41:12.770 --> 00:41:14.450 trying to access. 00:41:14.450 --> 00:41:19.070 And then down here on the right, implicitly so to speak, is a file name. 00:41:19.070 --> 00:41:21.830 It is human convention, but not required, that the name 00:41:21.830 --> 00:41:27.440 of the file that contains the web page that a server serves up by default, 00:41:27.440 --> 00:41:30.350 happens to be traditionally index.html. 00:41:30.350 --> 00:41:35.060 It could also be index.htm or any number of other names or extensions, 00:41:35.060 --> 00:41:37.140 but this is among the most common. 00:41:37.140 --> 00:41:40.450 So if you don't mention that via just a slash, it's implied, 00:41:40.450 --> 00:41:43.070 and it's that file or any other file, that's 00:41:43.070 --> 00:41:48.230 implied or even specified explicitly that is inside of this envelope. 00:41:48.230 --> 00:41:51.740 That's the whole point of this virtual packet of information, 00:41:51.740 --> 00:41:56.450 to encapsulate the request for a page and the actual page itself. 00:41:56.450 --> 00:41:59.600 At the end of the day, it's HTML, Hypertext Markup Language, 00:41:59.600 --> 00:42:03.620 an actual language in which pages are written, that's inside that envelope, 00:42:03.620 --> 00:42:06.740 but it's transmitted there via HTTP. 00:42:06.740 --> 00:42:10.820 The protocol, the set of conventions via which browser and server agree 00:42:10.820 --> 00:42:14.850 to send and receive that information. 00:42:14.850 --> 00:42:16.770 So what does that information look like? 00:42:16.770 --> 00:42:19.670 And just what have these computers agreed on? 00:42:19.670 --> 00:42:22.400 It turns out that inside of this envelope, 00:42:22.400 --> 00:42:27.080 when it represents a request for a web page like my URL there, 00:42:27.080 --> 00:42:29.030 are these lines here. 00:42:29.030 --> 00:42:33.230 GET/HTTP/1.1, where get is clearly a verb, 00:42:33.230 --> 00:42:35.900 by definition in all caps in this protocol, 00:42:35.900 --> 00:42:40.500 slash means the default page of the website index.html or something else. 00:42:40.500 --> 00:42:44.060 And then often a mention of host colon and then the name of the host 00:42:44.060 --> 00:42:45.650 that you're actually looking for. 00:42:45.650 --> 00:42:47.990 Because it turns out servers can do so many things. 00:42:47.990 --> 00:42:51.110 Not just Google servers with voice and chat and other services, 00:42:51.110 --> 00:42:54.410 one web server can actually serve up multiple websites. 00:42:54.410 --> 00:42:59.660 Example.com, acme.com, Harvard.edu, google.com, all of us 00:42:59.660 --> 00:43:04.070 can actually have shared tendencies, so to speak, on the same server in theory. 00:43:04.070 --> 00:43:08.510 And so by mentioning what actual website you want inside of the envelope, 00:43:08.510 --> 00:43:12.560 the recipient of this envelope can make sure that it serves you my home page 00:43:12.560 --> 00:43:14.030 and not someone else's. 00:43:14.030 --> 00:43:17.210 But beyond that, there needs to be additional information, as well. 00:43:17.210 --> 00:43:20.030 You might explicitly specify the name of the file. 00:43:20.030 --> 00:43:23.750 And again, we humans have nothing to do with any of this, ultimately, 00:43:23.750 --> 00:43:25.760 we have just typed that URL. 00:43:25.760 --> 00:43:29.000 But it's our browser, on Mac OS or Windows or phones, 00:43:29.000 --> 00:43:32.870 that's packaging up this information inside of a virtual envelope 00:43:32.870 --> 00:43:36.470 and sending it out, ultimately, on our behalf. 00:43:36.470 --> 00:43:40.430 And indeed, if all goes well and that envelope reaches point B 00:43:40.430 --> 00:43:44.780 and it's opened up and it represents the name of a web page that does, 00:43:44.780 --> 00:43:47.540 in fact, exist, the response that I hope to get back 00:43:47.540 --> 00:43:51.230 in another envelope from point B to point A 00:43:51.230 --> 00:43:54.590 is going to contain an HTTP message like this. 00:43:54.590 --> 00:43:59.480 Literally the name of the protocol again, HTTP/1.1, and then a number, 00:43:59.480 --> 00:44:01.140 and optionally a phrase. 00:44:01.140 --> 00:44:03.800 200 is perhaps a number you've never actually seen, 00:44:03.800 --> 00:44:06.490 even though it is the best possible response to get. 00:44:06.490 --> 00:44:09.560 200 means, quite literally, OK. 00:44:09.560 --> 00:44:11.480 The web page you requested has been found 00:44:11.480 --> 00:44:15.170 and has been delivered in this response envelope, OK. 00:44:15.170 --> 00:44:19.220 The type of content you've received is in this case text/html. 00:44:19.220 --> 00:44:23.660 Which is to say inside of that envelope is a clue to your browser 00:44:23.660 --> 00:44:26.240 what kind of content is inside deeper. 00:44:26.240 --> 00:44:29.120 Is it text.html, like the contents of a web page? 00:44:29.120 --> 00:44:33.570 Is it an image/png like a graphic, or image/gif, something animated, 00:44:33.570 --> 00:44:39.720 or video/mp4, an actual video file, this so-called MIME type or content 00:44:39.720 --> 00:44:43.080 type is inside of the envelope for your browser so as to provide a hint, 00:44:43.080 --> 00:44:45.660 so as to know how to display it on the screen. 00:44:45.660 --> 00:44:48.750 There's so many other headers, as well, but these two alone 00:44:48.750 --> 00:44:51.330 really specify almost as much information 00:44:51.330 --> 00:44:54.950 as you need in order to render that response for the user. 00:44:54.950 --> 00:44:56.700 Now as an aside, there are other versions. 00:44:56.700 --> 00:44:59.370 And increasingly in vogue, though not yet omnipresent, 00:44:59.370 --> 00:45:03.450 is HTTP2 which has additional features, particularly for performance 00:45:03.450 --> 00:45:05.820 and getting data to you even more quickly. 00:45:05.820 --> 00:45:08.580 It simply replaces that 1.1 with a two and the response, 00:45:08.580 --> 00:45:11.880 though, comes back almost the same. 00:45:11.880 --> 00:45:15.840 So let's consider an example then, such as harvard.edu. 00:45:15.840 --> 00:45:23.310 It turns out that http://harvard.edu is not where Harvard wants you to be. 00:45:23.310 --> 00:45:25.890 In fact, let me go ahead and pull up my browser here 00:45:25.890 --> 00:45:28.630 and visit precisely that URL. 00:45:28.630 --> 00:45:33.480 http://harvard.edu, Enter. 00:45:33.480 --> 00:45:37.200 And within seconds do I find myself not at harvard.edu, but rather 00:45:37.200 --> 00:45:45.480 at ww.harvard.edu and moreover at https://www.harvard.edu. 00:45:45.480 --> 00:45:48.990 In other words, even though I specified a protocol of HTTP, 00:45:48.990 --> 00:45:53.320 a domain name of harvard.edu, and no hostname, so to speak, 00:45:53.320 --> 00:45:55.890 I have actually been whisked away, seemingly magically, 00:45:55.890 --> 00:46:01.260 to this URL instead, for reasons both technical and perhaps marketing alike. 00:46:01.260 --> 00:46:05.250 For today, though, let's focus on exactly how this came to pass. 00:46:05.250 --> 00:46:08.100 Well, it turns out that inside of the envelope with which Harvard, 00:46:08.100 --> 00:46:11.850 or any server, replies to me can be additional metadata, as well. 00:46:11.850 --> 00:46:15.180 Not just 200 OK, but really the equivalent of uh-uh, 00:46:15.180 --> 00:46:18.360 there's nothing to see here, go here instead. 00:46:18.360 --> 00:46:22.410 So let me go ahead and run a program, again in that black and white window 00:46:22.410 --> 00:46:24.480 known as my terminal window, whereby I can 00:46:24.480 --> 00:46:27.240 pretend to be a browser without all of the graphics 00:46:27.240 --> 00:46:29.970 and without all of the distraction and focus only 00:46:29.970 --> 00:46:32.760 on the contents of those digital envelopes. 00:46:32.760 --> 00:46:36.840 Here the program I'm going to run is called curl for connect to a URL, 00:46:36.840 --> 00:46:41.820 and I'm going to specify dash I which is to say I only want the HTTP headers. 00:46:41.820 --> 00:46:48.240 I'm going to go ahead now and say http://harvard.edu, nothing more. 00:46:48.240 --> 00:46:51.030 When I hit Enter now, here are the complete headers 00:46:51.030 --> 00:46:53.100 that come back from the server. 00:46:53.100 --> 00:46:56.130 No dot dot dot this time, we see everything, in fact, here, 00:46:56.130 --> 00:46:57.510 but notice the first line. 00:46:57.510 --> 00:47:02.100 It's not 200 OK, but rather 301 moved permanently. 00:47:02.100 --> 00:47:03.510 Like, where did Harvard go? 00:47:03.510 --> 00:47:07.410 Well, it turns out that Harvard has specified its new location 00:47:07.410 --> 00:47:13.950 down here as https://www.harvard.edu. 00:47:13.950 --> 00:47:16.080 Now there's other lines of headers there, 00:47:16.080 --> 00:47:19.410 HTTP headers as they're called, each of which starts with a word, 00:47:19.410 --> 00:47:23.010 perhaps with some punctuation, and a colon, followed by the value. 00:47:23.010 --> 00:47:27.930 Location, value, go to this location is the general paradigm there. 00:47:27.930 --> 00:47:31.650 But why might Harvard not want to show me their web page at the address 00:47:31.650 --> 00:47:32.670 that I typed? 00:47:32.670 --> 00:47:36.420 Well, it turns out that HTTP is by definition insecure. 00:47:36.420 --> 00:47:39.210 The extents to which the message is encoded 00:47:39.210 --> 00:47:41.740 is quite literally in English or English-like syntax, 00:47:41.740 --> 00:47:43.980 such as that we've been looking at here. 00:47:43.980 --> 00:47:47.050 It's just that text that's inside the envelope. 00:47:47.050 --> 00:47:50.850 If instead, though, you want to encrypt those contents so that no one knows 00:47:50.850 --> 00:47:53.370 what web page you're requesting or receiving, 00:47:53.370 --> 00:47:57.690 and your employer and your university administrator or your internet service 00:47:57.690 --> 00:48:00.900 provider or country does not know what you're doing, 00:48:00.900 --> 00:48:04.380 be it for personal reasons, financial, or otherwise, well then 00:48:04.380 --> 00:48:06.300 you want to use HTTPS. 00:48:06.300 --> 00:48:09.030 And Harvard University, like so many companies today, 00:48:09.030 --> 00:48:12.030 is insistent that you actually visit them securely, 00:48:12.030 --> 00:48:14.850 if only because it's best practice, but it also 00:48:14.850 --> 00:48:18.390 prevents potentially private information from leaking. 00:48:18.390 --> 00:48:20.640 And so here with this location line is Harvard saying, 00:48:20.640 --> 00:48:24.930 no, we will not respond to you with OK via HTTP, 00:48:24.930 --> 00:48:28.980 we have moved permanently to a secure address at HTTPS, 00:48:28.980 --> 00:48:31.020 where the S denotes secure. 00:48:31.020 --> 00:48:32.320 But why the www? 00:48:32.320 --> 00:48:34.710 Back in the day, you probably did have to type 00:48:34.710 --> 00:48:40.230 for many companies, www.example.com instead of just going to example.com 00:48:40.230 --> 00:48:43.200 and hoping that you end up in the right place. 00:48:43.200 --> 00:48:45.780 Well, humans have gotten more comfortable with the internet 00:48:45.780 --> 00:48:48.240 over the past years, over the past decades, and indeed, 00:48:48.240 --> 00:48:52.080 whereas years ago, in order to advertise yourself effectively on the web, 00:48:52.080 --> 00:48:54.810 you might have indeed needed to go to press on your business card 00:48:54.810 --> 00:49:00.500 or advertisement with http://www.something.com. 00:49:00.500 --> 00:49:05.505 But all of us have kind of seen HTTP enough, if not HTTPS as well, 00:49:05.505 --> 00:49:07.130 you don't need to tell me to type that. 00:49:07.130 --> 00:49:09.660 And indeed, my browser no longer requires me to type 00:49:09.660 --> 00:49:13.080 that, so now you see business cards and advertisements 00:49:13.080 --> 00:49:15.805 with just www.something.com. 00:49:15.805 --> 00:49:18.480 But you know what, I'm not new to the internet. 00:49:18.480 --> 00:49:22.140 I know what ww is, and I know what dot com is as well, 00:49:22.140 --> 00:49:27.780 don't even bother showing me or telling me on your card or your website or ad 00:49:27.780 --> 00:49:32.740 that it's www.something.com, just tell me something.com. 00:49:32.740 --> 00:49:35.328 And so browsers have been getting more user friendly 00:49:35.328 --> 00:49:37.120 and humans have been getting more familiar, 00:49:37.120 --> 00:49:40.540 and so we tend not to see those prefixes anymore. 00:49:40.540 --> 00:49:44.500 But it turns out that for technical reasons, for security reasons, 00:49:44.500 --> 00:49:47.370 it tends to be useful to have a subdomain. 00:49:47.370 --> 00:49:49.120 As an aside, for things like cookies, it's 00:49:49.120 --> 00:49:51.520 useful to keep cookies in a subdomain as opposed 00:49:51.520 --> 00:49:56.330 to the domain itself just to narrow the scope via which they can be accessed. 00:49:56.330 --> 00:49:58.960 But also for marketing sake, it would be nice 00:49:58.960 --> 00:50:04.120 if everyone in the world, whether they type harvard.edu or www.harvard.edu, 00:50:04.120 --> 00:50:08.020 ultimately end up in the same location just because that's how we 00:50:08.020 --> 00:50:09.940 want to present ourselves to the world. 00:50:09.940 --> 00:50:13.150 And so for both technical and marketing and security reasons alike might 00:50:13.150 --> 00:50:18.100 Harvard or a company want to redirect to a URL like this one here. 00:50:18.100 --> 00:50:19.810 Now what does your browser know to do? 00:50:19.810 --> 00:50:23.440 Well, when your browser receives not 200 OK, in which case 00:50:23.440 --> 00:50:27.460 it just shows you the page, but instead receives 301 moved permanently, 00:50:27.460 --> 00:50:31.390 it instead looks for that location line and takes you there instead, 00:50:31.390 --> 00:50:35.920 at which point then you'll get that 200 OK. 00:50:35.920 --> 00:50:38.350 And so this, again, is with browsers do. 00:50:38.350 --> 00:50:42.820 HTTP is what they understand, and know by definition of that protocol 00:50:42.820 --> 00:50:45.160 how to handle these cases. 00:50:45.160 --> 00:50:50.140 But not everything is always OK and not always has something moved permanently. 00:50:50.140 --> 00:50:52.670 Sometimes something's just not found. 00:50:52.670 --> 00:50:55.210 And in fact, of all of these numbers we've seen thus far, 00:50:55.210 --> 00:51:00.160 odds are you've not seen or cared about 200 or even 301, but most of us 00:51:00.160 --> 00:51:03.070 have probably at least once seen 404. 00:51:03.070 --> 00:51:04.060 Why? 00:51:04.060 --> 00:51:06.640 Why in the world is that the number we somehow 00:51:06.640 --> 00:51:11.140 see anytime you visit a web page that's gone, or anytime you mistype an address 00:51:11.140 --> 00:51:12.610 and you reach a dead end? 00:51:12.610 --> 00:51:16.300 Well, for better or for worse the designers of websites for years 00:51:16.300 --> 00:51:20.360 have exposed this value to end users even though it's not all that useful. 00:51:20.360 --> 00:51:22.720 But it's indeed the unique value that humans 00:51:22.720 --> 00:51:27.550 decided some years ago would uniquely represent the notion of a page 00:51:27.550 --> 00:51:28.720 not being found. 00:51:28.720 --> 00:51:33.070 So if inside of that virtual envelope comes back a message 404 not found, 00:51:33.070 --> 00:51:36.030 the browser can say that literally or perhaps display 00:51:36.030 --> 00:51:41.140 a cute message to that effect, but the reason that you're seeing that 404 00:51:41.140 --> 00:51:44.200 is because quite literally and mind numbingly that 00:51:44.200 --> 00:51:49.480 is just the low level status code that has come back from an HTTP server. 00:51:49.480 --> 00:51:50.890 And there's more of these, too. 00:51:50.890 --> 00:51:53.260 In fact, 200 OK is the best you might get. 00:51:53.260 --> 00:51:55.300 301 moved permanently we've seen. 00:51:55.300 --> 00:51:57.910 302 found is another form of redirection, 00:51:57.910 --> 00:51:59.640 but a temporary one instead. 00:51:59.640 --> 00:52:04.480 304 not modified is a response that a server can send for efficiency. 00:52:04.480 --> 00:52:07.240 If you visited a web page just a moment ago 00:52:07.240 --> 00:52:09.520 and you happen to hit reload or click on a link 00:52:09.520 --> 00:52:13.060 and get back the same content again, it's not terribly efficient or good 00:52:13.060 --> 00:52:16.750 business for a company to incur the time and perhaps financial cost 00:52:16.750 --> 00:52:20.740 to retransmit all of those bits to you, and so it might instead 00:52:20.740 --> 00:52:25.540 respond with an envelope more succinctly with 304 not modified 00:52:25.540 --> 00:52:30.310 without anything else deeper in that envelope, no additional content. 00:52:30.310 --> 00:52:34.030 And so this way your browser will just reline its own cache, its own copy, 00:52:34.030 --> 00:52:36.550 so to speak, of the original request. 00:52:36.550 --> 00:52:39.990 Meanwhile, if you're not allowed to visit some web page because you've not 00:52:39.990 --> 00:52:41.990 logged in or you don't have authorization there, 00:52:41.990 --> 00:52:45.400 too, well 401 unauthorized might instead come back. 00:52:45.400 --> 00:52:46.960 As might 403 forbidden. 00:52:46.960 --> 00:52:49.420 404 not found means there's just nothing there. 00:52:49.420 --> 00:52:52.750 418 I'm a teapot was an April Fool's joke some years 00:52:52.750 --> 00:52:55.270 ago where someone went to the lengths of actually writing 00:52:55.270 --> 00:53:00.340 a formal specification for what a server should say when it is in fact a teapot. 00:53:00.340 --> 00:53:03.350 But the worst error you might see, and most users would never see this, 00:53:03.350 --> 00:53:08.620 but developers of software would is five zero zero, 500, which 00:53:08.620 --> 00:53:12.520 represents an internal server error, and almost always represents 00:53:12.520 --> 00:53:17.290 a logical or a syntactic error in the code that someone has written, 00:53:17.290 --> 00:53:21.410 be it in Python or any number of other languages. 00:53:21.410 --> 00:53:24.340 And now a fun example, perhaps, to bring all this home. 00:53:24.340 --> 00:53:28.510 It turns out that safetyschool.org is an actual address on the web. 00:53:28.510 --> 00:53:30.940 And indeed, it happens to have been bought or rented 00:53:30.940 --> 00:53:33.790 for years now by some Harvard alum. 00:53:33.790 --> 00:53:36.580 And indeed, if you visit safetyschool.org, 00:53:36.580 --> 00:53:40.330 you shall find yourself at this website here. 00:53:40.330 --> 00:53:44.170 http://safetyschool.org. 00:53:44.170 --> 00:53:49.060 We find ourselves whisked away to www.yale.edu. 00:53:49.060 --> 00:53:50.320 But how is that implemented? 00:53:50.320 --> 00:53:52.420 Well, let's again turn to our terminal window, 00:53:52.420 --> 00:53:55.870 where we can see really the contents of that virtual envelope. 00:53:55.870 --> 00:53:59.710 And if in here in my terminal window I again type curl dash I, 00:53:59.710 --> 00:54:05.470 http://safetyschool.org, well I see all of the headers 00:54:05.470 --> 00:54:07.300 that are exactly coming back. 00:54:07.300 --> 00:54:11.800 And indeed, here, safetyschool.org has permanently moved for years now 00:54:11.800 --> 00:54:17.740 to this location, http://www.yale.org. 00:54:17.740 --> 00:54:21.220 A fun jab at our rivals that some alum has been paying now 00:54:21.220 --> 00:54:25.760 for years on an annual basis. 00:54:25.760 --> 00:54:29.140 So we now have a pair of protocols, TCP and IP, 00:54:29.140 --> 00:54:32.020 via which we can get data, any data, from point A 00:54:32.020 --> 00:54:33.890 to point B on the internet. 00:54:33.890 --> 00:54:38.800 Sometimes that data is itself HTTP data that is a request for a web page 00:54:38.800 --> 00:54:41.080 or a response with a web page. 00:54:41.080 --> 00:54:45.880 But what if there are so many others trying to access data at point B-- 00:54:45.880 --> 00:54:49.360 that is to say, business is good, and a web server out there is receiving 00:54:49.360 --> 00:54:54.400 so many packets per second that the server cannot quite yet keep up? 00:54:54.400 --> 00:54:57.700 The routers in between might very well be able to handle that load perfectly 00:54:57.700 --> 00:55:02.320 because those are much bigger servers, conceptually and physically, with far 00:55:02.320 --> 00:55:05.650 more CPUs and RAM and therefore can handle that load, 00:55:05.650 --> 00:55:10.930 but some business' server out there is only finite in capacity. 00:55:10.930 --> 00:55:15.250 And so what happens when you need to scale to handle more users? 00:55:15.250 --> 00:55:17.680 Well, you might have initially just one server 00:55:17.680 --> 00:55:20.150 such as that Dell server pictured here. 00:55:20.150 --> 00:55:22.450 This is what's called a rack server, insofar 00:55:22.450 --> 00:55:26.170 as it's designed to exist on a rack that you slide this thing into, 00:55:26.170 --> 00:55:29.630 and it happens to be one rack unit or 1.5 inches, 00:55:29.630 --> 00:55:32.230 which is simply a standardization thereof. 00:55:32.230 --> 00:55:35.500 Inside of this rack server is its hard drive, and RAM, 00:55:35.500 --> 00:55:39.340 and CPU, and more pieces, but it's exactly the same technology 00:55:39.340 --> 00:55:41.770 that you might have in a box under your desk 00:55:41.770 --> 00:55:45.910 or even in the form factor of a laptop, just bigger and faster. 00:55:45.910 --> 00:55:48.400 And, to be fair, more expensive. 00:55:48.400 --> 00:55:52.150 But it's only so big, indeed, it's only 1.5 inches tall 00:55:52.150 --> 00:55:54.790 and some number of inches deep, which is to say there's only 00:55:54.790 --> 00:55:57.160 a finite amount of RAM in there. 00:55:57.160 --> 00:55:59.770 There's only a fixed number of CPUs in there, 00:55:59.770 --> 00:56:04.540 and there's only so many gigabytes, presumably, of disk storage space. 00:56:04.540 --> 00:56:07.480 At some point or other, we're going to run out of one or more 00:56:07.480 --> 00:56:08.830 of those resources. 00:56:08.830 --> 00:56:13.600 And even though we've not really gotten into the weeds of how a server handles 00:56:13.600 --> 00:56:16.360 and reads these envelopes, it certainly stands 00:56:16.360 --> 00:56:19.780 to reason that it can only read with finite resources 00:56:19.780 --> 00:56:23.620 some finite number of packets per unit of time, 00:56:23.620 --> 00:56:26.290 be it second or minutes or days. 00:56:26.290 --> 00:56:28.750 And so at some point if business is booming, 00:56:28.750 --> 00:56:32.080 we might receive at any given point more packets 00:56:32.080 --> 00:56:35.380 of information that we can handle and indeed, like some routers, if they're 00:56:35.380 --> 00:56:38.890 overwhelmed, we might just drop these incoming packets, 00:56:38.890 --> 00:56:43.210 or worse yet, not expect them and just crash or freeze or somehow 00:56:43.210 --> 00:56:45.580 behave unpredictably. 00:56:45.580 --> 00:56:47.630 And that's probably not good for our business. 00:56:47.630 --> 00:56:49.960 So how can we go about solving this problem? 00:56:49.960 --> 00:56:54.430 Well, the easiest way, quite simply, is to scale vertically, so to speak. 00:56:54.430 --> 00:56:56.020 That is don't use that server. 00:56:56.020 --> 00:57:01.000 Instead, buy one that's bigger with more RAM and more CPUs and more disk space 00:57:01.000 --> 00:57:03.970 and faster internet connectivity, and really just 00:57:03.970 --> 00:57:06.550 avoid that problem altogether. 00:57:06.550 --> 00:57:08.030 Why is this compelling? 00:57:08.030 --> 00:57:11.560 Well, cost of it aside, you don't have to change your code, 00:57:11.560 --> 00:57:14.260 you needn't change your configuration in software, 00:57:14.260 --> 00:57:20.050 you need only throw hardware and in turn, to be fair, money at the problem. 00:57:20.050 --> 00:57:23.470 Now that in and of itself might alone be a deal breaker, the money alone, 00:57:23.470 --> 00:57:27.640 but at some point if we want to handle that business, we've got to scale up, 00:57:27.640 --> 00:57:29.830 but even this is shortsighted. 00:57:29.830 --> 00:57:35.620 Because at the end of the day, Dell only sells servers that operate so quickly 00:57:35.620 --> 00:57:37.660 and have so much disk space. 00:57:37.660 --> 00:57:40.210 Those resources, too, are ultimately finite. 00:57:40.210 --> 00:57:42.790 And while next year there might be an even bigger version 00:57:42.790 --> 00:57:44.980 of this same machine out there, this year 00:57:44.980 --> 00:57:47.020 you might have the top of the line. 00:57:47.020 --> 00:57:50.800 So at some point, one server, even with so many resources, 00:57:50.800 --> 00:57:54.910 might not be able to handle all of the packets and business you're getting. 00:57:54.910 --> 00:57:56.980 So what do you then do? 00:57:56.980 --> 00:58:01.660 Well, there is an opportunity to scale not vertically, so to speak, 00:58:01.660 --> 00:58:04.120 but horizontally instead. 00:58:04.120 --> 00:58:07.930 Focusing not on the top tier machines, but instead, 00:58:07.930 --> 00:58:12.820 two of the smaller ones, or as needed three or four or more of the same. 00:58:12.820 --> 00:58:15.940 In other words, spending lower on that cost curve, 00:58:15.940 --> 00:58:18.520 getting more hardware, hopefully, for your money, 00:58:18.520 --> 00:58:21.700 but such that the net effect is even more CPU power 00:58:21.700 --> 00:58:25.630 and more disk space and more RAM than you might have gotten with that one 00:58:25.630 --> 00:58:27.240 souped up machine itself. 00:58:27.240 --> 00:58:28.990 And heck, if you really need the capacity, 00:58:28.990 --> 00:58:32.200 you can buy any number of these big servers, 00:58:32.200 --> 00:58:35.980 but you do somehow ultimately have to interconnect them. 00:58:35.980 --> 00:58:38.260 And here now is where there's a trade-off. 00:58:38.260 --> 00:58:40.660 Whereas money was really the only barrier 00:58:40.660 --> 00:58:44.140 to solving this problem initially, though easier said than done, 00:58:44.140 --> 00:58:48.370 now we have to re-engineer our system, because no longer 00:58:48.370 --> 00:58:51.910 are packets of internet data coming in from our customers 00:58:51.910 --> 00:58:54.670 and ending up in one place, they now have to somehow 00:58:54.670 --> 00:58:58.750 be spread across multiple servers. 00:58:58.750 --> 00:59:00.910 So how might we do this back in the day? 00:59:00.910 --> 00:59:04.780 Well, back in the late 90s, when Larry and Sergey of Google fame 00:59:04.780 --> 00:59:07.270 built out their first cluster of servers, 00:59:07.270 --> 00:59:09.970 they didn't have those pretty Dell boxes, rather, 00:59:09.970 --> 00:59:12.940 this was, now in Google's museum, reportedly 00:59:12.940 --> 00:59:15.550 one of their first racks of servers. 00:59:15.550 --> 00:59:18.100 Notice there's no shiny cases, let alone logos, 00:59:18.100 --> 00:59:22.180 but instead, lots of circuit boards on which are hard drive after hard drive 00:59:22.180 --> 00:59:26.650 after hard drive and suffice to say so many wires connecting everything. 00:59:26.650 --> 00:59:30.240 And even though, ironically, this picture seems to be vertical, 00:59:30.240 --> 00:59:36.150 this is, perhaps, one of the earliest examples in our internet era of scaling 00:59:36.150 --> 00:59:37.230 horizontally. 00:59:37.230 --> 00:59:40.110 Each of these servers, which is represented by each of these boards, 00:59:40.110 --> 00:59:43.260 is somehow interconnected in such a way that those servers 00:59:43.260 --> 00:59:45.100 can intercommunicate. 00:59:45.100 --> 00:59:46.140 But how? 00:59:46.140 --> 00:59:48.540 Well, let's consider, with the proverbial engineering hat 00:59:48.540 --> 00:59:51.750 on, how servers might somehow intercommunicate. 00:59:51.750 --> 00:59:54.690 If up here, for instance, is just some artist's rendition 00:59:54.690 --> 00:59:58.590 of someone's laptop, a potential customer who's sending us packets, 00:59:58.590 --> 01:00:03.780 that customer might previously have been accessing our server, which we'll 01:00:03.780 --> 01:00:06.270 represent here with a box and just call it 01:00:06.270 --> 01:00:09.250 A. Server A. There's no other servers involved, 01:00:09.250 --> 01:00:12.600 but there is some internet in between us here, so we'll assume that this 01:00:12.600 --> 01:00:14.880 is the so-called cloud, so to speak. 01:00:14.880 --> 01:00:17.730 And I, as this laptop or the customer, has a connection 01:00:17.730 --> 01:00:20.940 and so does that one and only server have the same. 01:00:20.940 --> 01:00:22.570 This picture is fairly straightforward. 01:00:22.570 --> 01:00:24.720 Now you request a web page via this browser, 01:00:24.720 --> 01:00:27.360 it somehow traverses the internet via routers, 01:00:27.360 --> 01:00:31.150 and then ultimately ends up at that server A. 01:00:31.150 --> 01:00:33.180 But what if, instead, there's not just A, 01:00:33.180 --> 01:00:35.580 even if it's top of the line, because that's not enough, 01:00:35.580 --> 01:00:40.710 but instead their servers A and B together here for this website. 01:00:40.710 --> 01:00:44.790 Well, here we might now have two boxes, the same size or bigger or smaller, 01:00:44.790 --> 01:00:46.860 but ultimately finite, as well. 01:00:46.860 --> 01:00:51.330 And somehow we need to now decide how to route information 01:00:51.330 --> 01:00:54.460 from customer to server A or B. In other words, 01:00:54.460 --> 01:00:57.420 there is now virtually a fork in the road, 01:00:57.420 --> 01:01:00.700 left or right, that packets need to traverse. 01:01:00.700 --> 01:01:02.790 So how can we implement this building block? 01:01:02.790 --> 01:01:05.580 Well, again, as always, go back to first principles. 01:01:05.580 --> 01:01:07.710 We know from our stack of internet technologies 01:01:07.710 --> 01:01:10.710 that we already have a mechanism via which to translate 01:01:10.710 --> 01:01:13.080 domain names into IP addresses. 01:01:13.080 --> 01:01:15.300 And if each of these servers, by definition of IP, 01:01:15.300 --> 01:01:21.180 has its own IP address, why not just use DNS to solve this problem? 01:01:21.180 --> 01:01:25.680 When a customer requests example.com, perhaps answer that request 01:01:25.680 --> 01:01:29.790 with the IP address of A. And then when a second customer somewhere else out 01:01:29.790 --> 01:01:34.110 there on his or her laptop asks for example.com next, 01:01:34.110 --> 01:01:39.060 return to them the IP address of B and vise versa, again and again. 01:01:39.060 --> 01:01:41.580 Literally adopting a round robin technique 01:01:41.580 --> 01:01:45.330 of sorts, whereby one time you answer A, the next time you answer B, 01:01:45.330 --> 01:01:47.580 and back and forth you go. 01:01:47.580 --> 01:01:51.690 On average, you would like to think that this uniform distribution of answers 01:01:51.690 --> 01:01:55.110 will give you 50% load, that is to say traffic, 01:01:55.110 --> 01:01:58.020 on one server and 50% on the other. 01:01:58.020 --> 01:02:00.840 But perhaps this customer is more of a shopper than this one, 01:02:00.840 --> 01:02:03.960 and they end up imposing even more load on A than on B, 01:02:03.960 --> 01:02:08.067 so there with that simple heuristic you can get skew. 01:02:08.067 --> 01:02:10.650 You might not even use round robin, you could just use random, 01:02:10.650 --> 01:02:15.390 but on there on average yes, you'll send 50% traffic left and 50% right, 01:02:15.390 --> 01:02:19.170 but some of those users might be heavier users than other. 01:02:19.170 --> 01:02:22.020 So perhaps we should have some form of feedback loop, 01:02:22.020 --> 01:02:24.910 and DNS alone might not be sufficient. 01:02:24.910 --> 01:02:28.710 We really need there to be a middle man, such as this dot here, 01:02:28.710 --> 01:02:33.840 that decides more intelligently whether to send data to A or to B. 01:02:33.840 --> 01:02:37.560 And we'll call this thing here, this dot now, a load balancer. 01:02:37.560 --> 01:02:40.030 Aptly name insofar as it balances load that's 01:02:40.030 --> 01:02:42.270 incoming across multiple servers. 01:02:42.270 --> 01:02:43.410 But how? 01:02:43.410 --> 01:02:47.970 Well, if these connections between A and B in this load balancer 01:02:47.970 --> 01:02:50.460 are not uni-directional, but bidirectional somehow, 01:02:50.460 --> 01:02:54.360 literally a cable that allows bits to flow left and right. 01:02:54.360 --> 01:02:58.920 Could we perhaps have A just continually report back to that load balancer 01:02:58.920 --> 01:03:02.040 saying, I have capacity, I have capacity? 01:03:02.040 --> 01:03:04.800 Whereas B might say, I've got too many customers. 01:03:04.800 --> 01:03:08.850 And logically, then, this load balancer can just start sending no traffic to B 01:03:08.850 --> 01:03:11.280 and send all of it to A or vise versa. 01:03:11.280 --> 01:03:13.350 Of course, logically, we could find ourselves 01:03:13.350 --> 01:03:18.270 in a situation where both A and B are too busy, what then do we do? 01:03:18.270 --> 01:03:21.390 Well, at some point we have to throw money at the problem 01:03:21.390 --> 01:03:23.760 and solve it by just adding hardware. 01:03:23.760 --> 01:03:27.000 And so C might be added to the mix with that same logic, 01:03:27.000 --> 01:03:30.990 but the load balancer just has to know about it. 01:03:30.990 --> 01:03:32.910 So all fine and good, we seem to have solved 01:03:32.910 --> 01:03:37.680 the problem in a very straightforward way, but as with computer science 01:03:37.680 --> 01:03:40.740 more generally, there's probably a price paid and a trade-off, 01:03:40.740 --> 01:03:42.570 and not just financial. 01:03:42.570 --> 01:03:47.760 Unfortunately, even though I have two, maybe even three servers now, therefore 01:03:47.760 --> 01:03:50.490 seemingly having high availability of service, 01:03:50.490 --> 01:03:53.340 that is any one of these servers theoretically could go down 01:03:53.340 --> 01:03:57.090 and I've still got 2/3 of my capacity. 01:03:57.090 --> 01:04:02.520 But there's a single point of failure here, an SPOF so to speak, 01:04:02.520 --> 01:04:05.070 that could really derail the whole process. 01:04:05.070 --> 01:04:09.150 What happens if this load balancer, which while pictorial is just a dot, 01:04:09.150 --> 01:04:13.470 is actually a server underneath the hood itself? 01:04:13.470 --> 01:04:15.990 What if that load balancer goes down, or what if that load 01:04:15.990 --> 01:04:17.670 balancer itself gets overwhelmed? 01:04:17.670 --> 01:04:22.290 It does not matter how many servers you have here, A through Z, if none of them 01:04:22.290 --> 01:04:23.710 can be reached. 01:04:23.710 --> 01:04:28.050 So this simple architecture alone is not a solution. 01:04:28.050 --> 01:04:32.470 And indeed, this is what is meant by architecting network itself. 01:04:32.470 --> 01:04:36.320 This design is probably not the best, especially for business. 01:04:36.320 --> 01:04:40.300 And so let's start anew, at least down here inside this company, 01:04:40.300 --> 01:04:44.650 and consider if one load balancer is not great, what's better than one? 01:04:44.650 --> 01:04:46.510 Well, honestly, two. 01:04:46.510 --> 01:04:48.550 And so let's now draw them a bit bigger, where 01:04:48.550 --> 01:04:51.820 here we have a load balancer on the left, and here on the right, 01:04:51.820 --> 01:04:53.770 and we'll number them 1 and 2. 01:04:53.770 --> 01:04:58.330 Whereas our servers we'll continue to name A and B and C and perhaps even 01:04:58.330 --> 01:05:02.170 through Z. And now we just have to ensure that we have connections 01:05:02.170 --> 01:05:04.300 to both load balancers, and that each load 01:05:04.300 --> 01:05:09.460 balancer can connect to each server in this sort of mesh network here. 01:05:09.460 --> 01:05:12.730 It's wonderfully redundant now, albeit a bit complex. 01:05:12.730 --> 01:05:15.340 But because we have all of these interconnections now, 01:05:15.340 --> 01:05:21.280 we can ensure that even if one or two go down, data can still reach A, B, or C. 01:05:21.280 --> 01:05:25.660 But how to know whether load balancer 1 should be doing all of this 01:05:25.660 --> 01:05:27.040 or load balancer 2? 01:05:27.040 --> 01:05:30.850 You know what, why don't we draw another connection between 1 and 2? 01:05:30.850 --> 01:05:35.410 And a very common paradigm in systems is heartbeats, quite simply. 01:05:35.410 --> 01:05:40.120 Much like you and I have every second or so a heartbeat saying we are alive, 01:05:40.120 --> 01:05:42.730 we are alive, hello world, hello world, if you will, 01:05:42.730 --> 01:05:48.340 so here might load balancers 1 and 2, themselves just servers, 01:05:48.340 --> 01:05:53.050 say I am alive, I am alive, hello number 2, hello number 2. 01:05:53.050 --> 01:05:57.010 And if 2 does not hear 1 eventually, or if 1 does not hear 2, 01:05:57.010 --> 01:05:59.800 the other can just commandeer that role. 01:05:59.800 --> 01:06:03.100 By default, only 1 will be load balancing, but if it goes offline, 01:06:03.100 --> 01:06:05.470 2 will presume to take over. 01:06:05.470 --> 01:06:08.680 And now we have this property generally known as high availability. 01:06:08.680 --> 01:06:12.070 Even if we lose one or more servers can we still stay up, 01:06:12.070 --> 01:06:16.000 and there is no single point of failure, at least here in this picture, 01:06:16.000 --> 01:06:18.770 because we now have that second load balancer. 01:06:18.770 --> 01:06:21.250 But if we look a little higher, it would seem 01:06:21.250 --> 01:06:24.440 that we do actually have another single point of failure in here, 01:06:24.440 --> 01:06:26.380 and now we go down the rabbit hole. 01:06:26.380 --> 01:06:29.500 If this line here to the cloud, the internet, 01:06:29.500 --> 01:06:33.520 represents my internet connection, my ISP, 01:06:33.520 --> 01:06:38.110 what if that ISP, Comcast, Verizon or any other, itself goes down, 01:06:38.110 --> 01:06:42.790 a big storm and a loss of power might take my whole business offline. 01:06:42.790 --> 01:06:44.830 Well, the best way to solve that would be 01:06:44.830 --> 01:06:48.040 to access someone else's internet connectivity 01:06:48.040 --> 01:06:49.820 and make sure you're connected to that. 01:06:49.820 --> 01:06:54.460 And in fact, if we keep going with load balancer 3 or even 4 01:06:54.460 --> 01:06:59.770 or server D, E, and F, this picture very quickly starts to get so intertwined. 01:06:59.770 --> 01:07:01.240 But this is how you do it. 01:07:01.240 --> 01:07:06.580 And not too long ago was this done entirely with wires and hardware. 01:07:06.580 --> 01:07:10.450 But these days this topology, if you will, this architecture, 01:07:10.450 --> 01:07:12.640 is increasingly done in software. 01:07:12.640 --> 01:07:15.730 And indeed, the whole thing is done in the cloud. 01:07:15.730 --> 01:07:19.420 Less frequently do staff of companies find themselves crawling 01:07:19.420 --> 01:07:22.630 along the floor and in wiring closets and in data centers, 01:07:22.630 --> 01:07:25.510 so to speak, making these connections possible, 01:07:25.510 --> 01:07:28.390 but rather they do it virtually in software. 01:07:28.390 --> 01:07:30.920 And indeed, thus was born the cloud. 01:07:30.920 --> 01:07:33.910 Well, it turns out that as Moore's law, so to speak, 01:07:33.910 --> 01:07:37.930 helps us in each passing year, we seem to have computers that are 01:07:37.930 --> 01:07:40.330 half as expensive and twice as fast. 01:07:40.330 --> 01:07:44.230 Can we ride that sort of curve of innovation in such a way 01:07:44.230 --> 01:07:47.830 that we can solve even more problems each year more quickly? 01:07:47.830 --> 01:07:50.610 And yet, with each passing year, I the human 01:07:50.610 --> 01:07:53.560 am not getting any better or faster at checking my email 01:07:53.560 --> 01:07:57.490 or using the web, so we increasingly have on our laptops and desks 01:07:57.490 --> 01:08:00.370 and our server rooms more computational power 01:08:00.370 --> 01:08:03.440 frankly than we really know what to do. 01:08:03.440 --> 01:08:08.500 And so increasingly in vogue these days is to virtualize that hardware, 01:08:08.500 --> 01:08:13.330 and to take physical hardware with so many CPUs and so much RAM and so much 01:08:13.330 --> 01:08:16.149 disk space and to write software that runs 01:08:16.149 --> 01:08:22.359 on it that creates the illusion that one computer is two, or one computer is 10. 01:08:22.359 --> 01:08:25.359 That is to say, through software can you write 01:08:25.359 --> 01:08:31.210 code that virtualizes that hardware, thereby creating the illusion that you 01:08:31.210 --> 01:08:36.340 can have one server per customer but all 10 of those customers 01:08:36.340 --> 01:08:38.529 are on the same machine. 01:08:38.529 --> 01:08:42.120 Virtualization includes products like VMware and Parallels 01:08:42.120 --> 01:08:43.870 and other companies as well, and it's just 01:08:43.870 --> 01:08:45.580 software that runs on top of the hardware 01:08:45.580 --> 01:08:50.439 and creates this illusion, which then is all the better for business. 01:08:50.439 --> 01:08:53.200 If you can sell one piece of hardware multiple times 01:08:53.200 --> 01:08:55.960 but not necessarily in a way that you're over-provisioning it 01:08:55.960 --> 01:09:00.220 to multiple customers, but rather you're isolating each of those customers 01:09:00.220 --> 01:09:03.880 from one another, giving them not only the illusion of their own machine 01:09:03.880 --> 01:09:07.180 but indeed, the constraints whereby my data can't 01:09:07.180 --> 01:09:12.850 be accessed by another customer who only has cloud access there, too. 01:09:12.850 --> 01:09:17.470 And indeed, this is really in part, why we have now this cloud. 01:09:17.470 --> 01:09:20.000 The cloud is more of a buzz word than anything technical. 01:09:20.000 --> 01:09:23.770 Indeed, using the cloud just means using servers somewhere else 01:09:23.770 --> 01:09:25.750 that someone else is managing. 01:09:25.750 --> 01:09:28.930 No longer do companies with as much frequency have their own server 01:09:28.930 --> 01:09:32.580 room in their office, or their own data center in some warehouse somewhere. 01:09:32.580 --> 01:09:35.790 Rather, they virtualized even that piece of their product 01:09:35.790 --> 01:09:38.729 using Amazon or Microsoft or Google or others 01:09:38.729 --> 01:09:41.207 out there that provide you with access to servers 01:09:41.207 --> 01:09:43.290 that they themselves control, but they provide you 01:09:43.290 --> 01:09:49.319 with access to the illusion of your very own servers known as virtual machines. 01:09:49.319 --> 01:09:52.590 And via this process can we take ever more advantage 01:09:52.590 --> 01:09:56.130 of so many of those new CPUs and disk space and RAM 01:09:56.130 --> 01:09:58.020 that otherwise might frankly go to waste, 01:09:58.020 --> 01:10:02.250 because there's only so much we can typically do with one such machine. 01:10:02.250 --> 01:10:06.540 Hence you might now think of this design, this stack, so to speak, 01:10:06.540 --> 01:10:07.200 as follows. 01:10:07.200 --> 01:10:10.170 In green here pictured is infrastructure, the physical hardware 01:10:10.170 --> 01:10:11.130 that you have bought. 01:10:11.130 --> 01:10:14.340 Here in blue is the hypervisor, the software 01:10:14.340 --> 01:10:18.660 called VMware or Parallels or something else, that virtualizes this hardware 01:10:18.660 --> 01:10:21.240 and creates the illusion that you actually have 01:10:21.240 --> 01:10:23.605 three machines, for instance, on one. 01:10:23.605 --> 01:10:25.980 And within each of those machines, which you can think of 01:10:25.980 --> 01:10:29.220 is just a separate window on that computer, double click 01:10:29.220 --> 01:10:32.610 to open computer A so to speak and computer B and computer 01:10:32.610 --> 01:10:37.170 C, each of those virtual machines, you can install your own operating system 01:10:37.170 --> 01:10:39.750 differently in each of those virtual machines. 01:10:39.750 --> 01:10:42.360 Some version of Windows in A, another in B, 01:10:42.360 --> 01:10:46.590 and maybe Linux or Unix or something else in C. And then within A, B, and C 01:10:46.590 --> 01:10:51.240 can you install your own apps, your own software, or so can your customers, 01:10:51.240 --> 01:10:56.190 thereby being isolated, not so much physically but virtually, 01:10:56.190 --> 01:10:57.750 from everyone else. 01:10:57.750 --> 01:10:59.970 But of course, there's always a price. 01:10:59.970 --> 01:11:02.820 While this might take better advantage of the increasing 01:11:02.820 --> 01:11:06.180 computational resources that we have in these boxes, 01:11:06.180 --> 01:11:08.820 there seems to be some duplication here. 01:11:08.820 --> 01:11:10.950 And indeed, in computer science, anytime you 01:11:10.950 --> 01:11:14.130 start duplicating resources or efforts there's 01:11:14.130 --> 01:11:16.710 probably an opportunity for better design. 01:11:16.710 --> 01:11:19.650 And while this technology itself is still nascent, 01:11:19.650 --> 01:11:22.680 there's a newcomer to the field called containerization, 01:11:22.680 --> 01:11:24.180 and it exists in multiple forms. 01:11:24.180 --> 01:11:28.140 But containerization shares more software, in some sense, 01:11:28.140 --> 01:11:31.680 underneath the hood, so that you might install an operating system not three 01:11:31.680 --> 01:11:36.870 times but once, and share it across those machines but in such a way that 01:11:36.870 --> 01:11:39.330 one cannot access the other. 01:11:39.330 --> 01:11:42.630 And on top of that layer, here called Docker, 01:11:42.630 --> 01:11:44.970 one of the most popular incarnations thereof, 01:11:44.970 --> 01:11:47.730 you have as before your infrastructure, the actual hardware, 01:11:47.730 --> 01:11:51.510 on top of which is your own operating system, be it Windows or Linux. 01:11:51.510 --> 01:11:55.230 On top of that is this program called Docker that provides you 01:11:55.230 --> 01:11:58.350 then with the ability to run A through F apps 01:11:58.350 --> 01:12:02.640 instead of say, just three because the overhead, so to speak, 01:12:02.640 --> 01:12:06.780 computationally is not quite as much as with virtual machines. 01:12:06.780 --> 01:12:10.620 Here we have three operating systems, each installed independently 01:12:10.620 --> 01:12:14.010 on the same hardware, we're just surely consume time, 01:12:14.010 --> 01:12:17.820 whereas here you have just one operating system theoretically and then 01:12:17.820 --> 01:12:21.000 more room for more apps there upon. 01:12:21.000 --> 01:12:24.660 So whereas containerization allows you ultimately 01:12:24.660 --> 01:12:27.300 to isolate one app from another, in virtual machines 01:12:27.300 --> 01:12:30.390 allow you to isolate one machine from another, 01:12:30.390 --> 01:12:34.410 they do this through different techniques and with disparate overhead. 01:12:34.410 --> 01:12:37.200 And surely in the years to come will this overhead only 01:12:37.200 --> 01:12:39.660 get chipped away at as we humans get better 01:12:39.660 --> 01:12:44.940 about running more and more software and less and less but more capable 01:12:44.940 --> 01:12:46.290 hardware. 01:12:46.290 --> 01:12:49.260 There than we have these internet technologies all the way up 01:12:49.260 --> 01:12:52.830 to cloud computing itself, whereas the technologies we've looked at 01:12:52.830 --> 01:12:57.360 are fairly low level protocols that simply get zeros and ones from point A 01:12:57.360 --> 01:13:02.340 to point B. Once we have that ability and we can stipulate that we can do it, 01:13:02.340 --> 01:13:05.510 we can build any number of abstractions on top of it. 01:13:05.510 --> 01:13:09.300 In HTTP, for instance, do we have effectively an application, 01:13:09.300 --> 01:13:14.220 known as web browsing, via which we can transmit text and images and sounds 01:13:14.220 --> 01:13:15.420 and so much more. 01:13:15.420 --> 01:13:18.270 And via the cloud itself do we have the ability now 01:13:18.270 --> 01:13:21.400 to slice up individual machines as though they are multiple 01:13:21.400 --> 01:13:25.200 and that picture before can be implemented not with two load 01:13:25.200 --> 01:13:30.990 balancers and three servers physically, but maybe, just maybe, with just one. 01:13:30.990 --> 01:13:33.780 One server that's been so virtualized or in turn 01:13:33.780 --> 01:13:38.310 containerized so that you can have different parts of its hardware each 01:13:38.310 --> 01:13:41.490 implementing different pieces of functionality that collectively 01:13:41.490 --> 01:13:43.070 implement that architecture. 01:13:43.070 --> 01:13:45.570 And so whereas back in the day might you actually physically 01:13:45.570 --> 01:13:49.080 wire all of those disparate types of machines together, now 01:13:49.080 --> 01:13:52.980 can you do it virtually in software literally with keystrokes and mouse 01:13:52.980 --> 01:13:56.430 clicks because someone has written software that abstracts away 01:13:56.430 --> 01:14:00.690 that underlying hardware in such a way that you can think about it virtually. 01:14:00.690 --> 01:14:03.810 Now at the end of the day, the servers in Google's and Microsoft 01:14:03.810 --> 01:14:06.750 and Amazon's closets are still completely physical themselves 01:14:06.750 --> 01:14:11.610 with so many cables, but you can reroute information, those zeros and ones, 01:14:11.610 --> 01:14:15.000 different ways virtually thanks to these layers 01:14:15.000 --> 01:14:19.550 that we've built on top of these internet technologies.