00:00:00,804 --> 00:00:02,550 DAVID MALAN: All right. So we are back. Now it's time for the cloud. What the heck is the cloud? Who's in the cloud? Who uses the cloud? Yeah, OK. Everyone over here and over here. OK. So what does this mean? We kind of take it for granted now. But what does cloud computing mean? Yeah? AUDIENCE: Off premise? DAVID MALAN: Off premise. OK, good. So what's off premise? AUDIENCE: You're hosting all your data off of your physical premises. [INAUDIBLE]. DAVID MALAN: Good. AUDIENCE: [INAUDIBLE]. DAVID MALAN: OK, applications, too. David? AUDIENCE: [INAUDIBLE] off-site storage. DAVID MALAN: Yeah, and this holds true for both businesses and consumers. Like, for us consumers in the room, we might be in the cloud in the sense that we're backing up our photos to iCloud or you're using Dropbox or some such service to store your data in the cloud. But a company could certainly do it, too. And that's where we'll focus today-- what you actually do if you're doing something for a business and you don't want to run your own on-premise servers. You don't want to have to worry about power and cooling and electricity and physical security and hiring someone to run all of that and paying for all of that. Rather, you'd much rather leverage someone else's expertise in and infrastructure for that kind of stuff and actually use their off-premise servers. And so cloud computing kind of came onto the scene a few years ago, when really it was just a nice new rebranding of what it means to outsource or to rent someone else's server space. But it's been driven, in part, by technological trends. Does anyone have a sense of what it is that's been happening in industry technologically that's made cloud computing all the more the rage and all the more technologically possible? AUDIENCE: It's faster? DAVID MALAN: Faster? What's faster? AUDIENCE: The give and take of the data between [INAUDIBLE]. DAVID MALAN: OK, so transfer rates, bandwidth between points A and B. It's possible to move data around faster. And now that that's the case, it's not such a big deal, maybe, if your photos are stored in the cloud because you can see them almost as quickly on your device, anyway. Sean? AUDIENCE: Cellular technology. DAVID MALAN: Cellular technology. How so? AUDIENCE: As far as speed. [INAUDIBLE]. DAVID MALAN: Yeah. 00:02:15,710 --> 00:02:18,220 Yeah, this is definitely kind of the bottleneck these days. But it's getting better. There was, like, edge, and then 3g, and now LTE and other such variants thereof. And so it's becoming a little more seamless, such that it doesn't matter where the data is because if you see it pretty much instantaneously, it doesn't matter how close or how far the data is. Other trends? AUDIENCE: Web hosts. DAVID MALAN: What's that? AUDIENCE: Web hosts DAVID MALAN: Web-- AUDIENCE: Like, AWS, Google [INAUDIBLE]. DAVID MALAN: Oh, OK, sure. So these providers, these big players especially that have really popularized this. Amazon, in particular, was one of the first some years ago to start building out this fairly generic infrastructure as a service, IAAS, which is kind of the silly buzzword or buzz acronym for it-- Infrastructure As a Service, which describes the sort of virtualization of low-level services. And we'll come back to this in just a bit as to what that menu of options is and how they are representative of other offerings from, like, Google and Microsoft and others. What-- Grace? AUDIENCE: [INAUDIBLE] scale much faster? Like, the volumes of data [INAUDIBLE]. DAVID MALAN: OK. And what you mean by the ability to scale faster? Where does that come from? AUDIENCE: Cloud server could scale up additional memory that you couldn't do if you were in a physical location. DAVID MALAN: OK. AUDIENCE: Like, having to buy and build out more servers at Harvard is much harder to do, to get more space. DAVID MALAN: Yeah. Absolutely. And let me toss in the word spikiness or spiky traffic, especially when websites have gotten mentioned on Buzzfeed or Slashdot or other such websites, where all of a sudden your baseline users might be some number of hundreds or thousands people per day or per second or whatever. And then all of a sudden, there's a massive spike. And in yesteryear, the results of that would be your website goes offline. Well, why is that? Well, let's actually focus on that for a moment. Why would your website or web server physically go offline or break or crash just because you have a lot of users? What's going on? Sean? AUDIENCE: Kind of [INAUDIBLE] over here. The computer only has so many connections, I guess? DAVID MALAN: Uh-huh, OK. AUDIENCE: Maxed out? DAVID MALAN: Yeah. So if you-- a computer, of course, can only do a finite amount of work per unit of time, right? Because it's a finite device. There's some ceiling on how much disk space it has, CPU cycles, so to speak, how much it can do per second, how much RAM it actually has. So if you try to exceed that amount, sometimes the behavior is undefined, especially if the programmers didn't really worry about that upper bound scenario. And at the end of the day, there's really nothing you can do. If you are just getting request after request after request, at some point, something's got to break along the way. And maybe it's the routers in between you and point B that just start dropping the data. Maybe it's your own server that can't handle the load. And it just gets so consumed with handling request, request, request, even if it runs out of RAM, it might use virtual memory, as we discussed. But then it's spending all of its time temporarily moving those requests until you get locked in this cycle where now you've used all of your disk space. And frankly, computers do not like it when they run out of disk space. Bad things happen, mostly because the software doesn't anticipate that actually happening. And so things slow to a crawl, and the server effectively freezes, crashes, or some other ill-defined behavior. And so your server goes offline. So what do you do in cases of that spiky traffic, at least before cloud computing? AUDIENCE: More servers. DAVID MALAN: Right, more servers. So you sort of hope that the customers will still be there tomorrow or next week or next month when you've actually bought the equipment and plugged it in and installed it and configured it. And the thing is, there's a lot of complexity when it comes to wiring things up, both virtually and physically, as to how you design your software. So we'll come to that in just a moment. So cloud computing's gotten super alluring insofar as you can amortize your costs over all of the other hardware there-- you and other people can do this. And so when you do get spiky behavior, assuming that not every other company and website on the internet is also getting a spike of behavior, which stands to reason that's not possible because there's only a finite number of users. So they have to go one way or the other. And there's many different providers out there. You can consume all the more of Amazon or Microsoft's or Google's services, and then as soon as people lose interest in your site or the article that got reblogged or whatnot, then you sort of turn off those rented servers that you were borrowing from someone else. Now, ideally, all of this is automatic, and you yourself don't have to log in anywhere or make a phone call and actually scale up your services by saying, hey, we'd like to place an order for two more servers. And indeed, all the rage these days would be something like autoscaling, where you actually configure the service or write software that monitors, well, how much of my RAM am I using? How much disk space am I using? How many users are currently on my website right now? And you yourself define some threshold such that if your servers are at, like, 80% of capacity, you still have 20%, of course. But maybe you should get ahead of that curve, turn on automatically some more servers configured identically so that your overall utilization is maybe in the more comfortable zone of 50% or whatever it is you want to be comfortable with so that you can sort of grow and contract based on actual load. Yeah? AUDIENCE: That's like an elastic cloud. DAVID MALAN: Exactly. So elastic cloud is-- that's using two different Amazon terms. But yes, anything elastic means exactly this-- having the software automatically add or subtract resources based on actual load or thresholds that you set. Absolutely. So what's the downside of this? AUDIENCE: Security? DAVID MALAN: Security? How so? AUDIENCE: Getting all of your data types. DAVID MALAN: OK. So yeah. I mean, especially for particularly sensitive data, whether it's HR or financial or intellectual property or otherwise. Cloud computing literally means moving your data off of your own, what might have been internal servers, to someone else's servers. Now, there are private clouds, which is a way of mitigating this. And this is sort of a sillier marketing term. Private cloud means just having your own servers, like you used to. But maybe more technically it will often mean running certain software on it so that you abstract away the detail that those are your own servers so that functionally, they're configured to behave identically to the third-party servers. And all it is is like a line in a configuration file that says, send users to our private cloud or send users to our public cloud. So abstraction comes into play here, where it doesn't matter if it's a Dell computer or IBM computer or anything else. You're running software that creates the illusion to your own code, your own programs, that it could just be third-party servers or your own. It doesn't matter. But putting your data out there might not be acceptable. In fact, there are so many popular web services out there. GitHub, if you're familiar, for instance, is a very popular service for hosting your programming code. And your connection to GitHub will be encrypted between point A and B. But your code on their servers isn't going to be encrypted because the whole purpose of that site is to be able to share your code, either publicly or even internally, privately, to other people without jumping over hoops with encryption and whatnot. So it could exist, but it doesn't. But so many companies are putting really all of their software in the cloud because it's kind of trendy to do or it's cheaper to do or they didn't really think it through, any number of reasons. But it's very much in vogue, certainly, these days. What's another downside of using the cloud or enabling autoscaling? AUDIENCE: If you don't have internet, you don't have cloud. DAVID MALAN: Yeah. So if you don't have internet access, you don't have, in turn, the cloud. And this has actually happened in some ways, too. It's very much in vogue these days for software developers to just assume constant internet access and that third-party services will just be alive so much so that-- and we ourselves here on campus do this because it's cheaper and easier at the end of the day. But we increase these risks. If this is little old me on my laptop, and we're using some third-party service like GitHub here-- and there's equivalents of this-- and then maybe this is Amazon Web Services over here. And this is our cloud provider, this is the middleman that's storing our programming code just because it's convenient and it's an easy way for us to share so that I can use it, my buddies here on their laptops can use it. And so all of us might-- let's just put one arrow. So if these lines represent our laptops connections to GitHub, where we're constantly sharing code and using it in a cloud sense to kind of distribute. If I make changes, I can push it here, then persons B and C can also access the same. It's very common these days with cloud services to have what are called hooks, so to speak, whereby a hook is a reaction to some event. And by that I mean if I, for instance, make some change to our website and I push it, so to speak, to GitHub or whatever third-party service, you can define a hook on GitHub's server that says any time you hear a push from one of our customers, go ahead and deploy that customer's code to some servers that have been preconfigured to receive code from GitHub. So you use this as sort of a middleman so that everyone can push changes here. Then that code automatically gets pushed to Amazon Web Services, where our customers can actually then see those changes. And this is an example of something called Continuous Deployment or CD, whereby whereas in yesteryear or yester-yesteryear, companies would update their software once a year, every two years. You would literally receive in the mail a shrink-wrapped box before the internet. And then even when there was the internet and things like Microsoft Office, they might update themselves every few years. Microsoft Office 2008, Microsoft Office 2013, or whatever the milestones were. Much more in vogue these days, certainly among startups, is continuous deployment, whereby you might update your website's code five times a day, 20 times a day, 30 times a day. Anytime someone makes even the smallest change to the code, it gets pushed to some central repository, you maybe have some tests run-- also known as Continuous Integration, whereby automatically are certain tests run to make sure is the code working as expected? And if so, it gets pushed to someone like Amazon. So among the upsides here is just the ease of use. We as users over here, we don't have to worry about how to run our servers. We don't have to worry about how to share our code among collaborators. We can just pay each of those folks a few dollars per month, and they just make all this happen for us. But beyond money, what other prices must we be paying? What are the risks? AUDIENCE: [INAUDIBLE]. 00:12:43,110 --> 00:12:44,510 DAVID MALAN: What's that? AUDIENCE: [INAUDIBLE]. 00:12:51,760 --> 00:12:52,510 DAVID MALAN: Sure. So we're consuming a lot more bandwidth, which at least for stuff like code, not too worrisome, since it's small. Video sites, absolutely, would consume an order of magnitude more. Netflix has run up against that. Yeah, Victor? AUDIENCE: Latency? DAVID MALAN: Latency? How so? AUDIENCE: Servers are on different sides of the country. DAVID MALAN: Yeah. So whereas if you had a central server at your company, in this building, for instance, we could save our code centrally within a few milliseconds, let's say, or, like, a second. But if we have to push it to California or Virginia or whatever GitHub's main servers are, that could take longer, more milliseconds, more seconds. So you might have latency, the time between when you start an action and it actually completes. Other thoughts? Yeah. AUDIENCE: What about trust, security issue, [INAUDIBLE]. DAVID MALAN: Yeah, this is kind of one of the underappreciated details. I mean, fortunately for our academic uses, we're not really too worried about people stealing our code or the code that we use to administer assignments and such. But we're fairly unique in that sense. Any number of companies that actually use services like GitHub are putting all of their intellectual property in a third party's hands because it's convenient. But there's a massive potential downside there. Now thankfully, as an aside, just because I'm picking on GitHub as the most popular, they have something called GitHub Enterprise edition, which is the same software, but you get to run it on your own computer or your own servers or in the cloud. But even then, Amazon, in theory, has access to it because Amazon has physical humans as employees who could certainly physically access those devices, as well. And really all that's protecting you in that case there are SLAs or policy agreements between you and the provider. But this is, I think, an underappreciated thing. I mean, most any internet startup certainly just jumps on the latest and greatest technologies, I dare say without really thinking through the implications. But all it takes is for GitHub or whatever third-party service to be compromised. You could lose all of your intellectual property. But even more mundanely but significantly, what else could go wrong here? AUDIENCE: If the server went down-- DAVID MALAN: Yeah. When GitHub goes down, half of the internet seems to break these days, at least among trendy startups and so forth who are using this third-party service. Because if your whole system is built to deploy your code through this pipeline, so to speak, if this piece breaks, you're kind of dead in the water. Now, your servers are probably still running. But they're running older code because you haven't been able to push those changes. Now, you could certainly circumvent this. There's no fixed requirement of using this middleman. But it's going to take time, and then you have to re-engineer things. And it's just kind of a headache. So it's probably better just to kind of ride it out or wait until it resolves. But there's that issue, too. Very common too is for software development-- more on this tomorrow-- to rely on third-party libraries. A library is a bunch of code that someone else has written, often open source, freely available, that you can use in your own project. But it's very much in vogue these days to have deployment time-- to resolve your dependencies at deployment time. What do I mean by this? Suppose that I'm writing software that uses some third-party library. Like, I have no idea how to send emails, but I know that someone else wrote a library, a bunch of code, that knows how to send emails. So I'm using his or her library in my website to send my emails out. It's very common these days not to save copies of your libraries you're using in your own code repository, partly 'cause of principle. It's just redundant, and if someone else is already saving and archiving different versions of their email library, why should you do the same? It's wasting space. Things might get out of sync. And so what people will sometimes do is you store only your website's code here. You push it to some central source. And the moment it gets deployed to Amazon Web Services or wherever is when automatically, some program grabs all these other third-party services that you might have been using that get linked in as well. And we'll call these libraries. Of course, the problem there is exactly the same as if GitHub goes down. And it's funny, it's the stupidest thing-- let me see-- node.js left shift, left pad. OK. So this was all the rage. Here you go-- how one developer just broke Node, Babel, and thousands of projects. So this was delightful to read because in recent years, it's just become very common, this kind of paradigm-- to not only use libraries, which has been happening for decades, but to use third-party libraries that are hosted elsewhere and are pulled in dynamically for your own project, which has some upsides but also some downsides. And essentially, for reasons I'll defer to this article or can send the URL around later, someone who was hosting a library called left-pad whose purpose in life is just to add, I think, white space-- so space characters-- to the left of a sentence, if you want. If you want to kind of shift a sentence over this way, it's not hard to do in code. But someone wrote this, and it's very popular to make it open source. And so a lot of people were relying on this very small library. And for whatever reason-- some of them, I think, personal-- this fellow removed his library from public distribution. And to this article's headline, all of these projects suddenly broke because these companies and persons are trying to deploy their code or update their code and no longer can this dependency be resolved. And I think-- I mean, what's amazing is how simple this is. So let me see if I can find the code. 00:18:17,590 --> 00:18:21,260 OK, so even if you're unfamiliar with programming-- well, this is not that much code. It looks a little scary because it has so many lines here. But half of these lines are what are called comments, just human-readable strings. This is not, like, a huge amount of intellectual property. Someone could whip this up in probably a few minutes and a bit of testing. But thousands of projects were apparently using this tiny, tiny piece of software. And the unavailability of it suddenly broke all of these projects. So these are the kinds of decisions, too, that might come as a surprise, certainly, to managers and folks who, why is the website down? Well, someone took down their third-party library. This is not, like, a great threat to software development per se. But it is sort of a side effect of very popular trends and paradigms in engineering-- having very distributed approaches to building your software, but you introduce a lot of what we would call, more generally, single points of failure. Like if GitHub goes down, you go down. If Amazon Web Services go down, if you haven't engineered around this, you go down, as well. And so that's what's both exciting and sort of risky about the cloud, is if you don't necessarily understand what building blocks exist and how you can assemble all of those together. So let's come back to one final question, but first Vanessa. AUDIENCE: So what would be a best practice? 'Cause I know engineers I've worked with don't want to create dependency in their code. So they would do exactly [INAUDIBLE]. 00:19:42,730 --> 00:19:44,320 DAVID MALAN: Yeah. It kind of depends on cost and convenience. Like the reality is it is just-- especially for a young startup, where you really want to have high returns quickly from the limited resources and labor that you have. You don't necessarily want your humans spending a day, a weekend, a week sort of setting up a centralized code repository and all of the sort of configuration required for that. You don't necessarily want them to have to set up their own servers locally because that could take, like, a week or a month to buy and a week to configure. And so it's kind of this judgment call, whereby, yes, those would be better in terms of security and robustness and uptime. But it's going to cost us a month or two of labor or effort by that person. And so now we're two months behind where we want to be. So I would say it is not uncommon to do this. For startups, it's probably fine because you're young enough and small enough that if you go offline, it's not great, but you're not going to be losing millions of dollars a day as a big fish like Amazon might. So it kind of depends on what the cost benefit ratio is. And only you and they could determine that. I would say it's very common to do this. It is not hard to add your dependencies to your own repository. And this is perhaps a stupid trend. So I would just do that because it's really no cost. But then there's other issues here that we'll start to explore in a moment because you can really go overboard when it comes to redundancy and planning for the worst. And if there's only a, like, 0.001% chance of your website going offline, do you really want to spend 10,000 times more to avoid that threat? So it depends on what the expected cost is, as well. So we'll come to those decisions in just a moment. So one final question-- what else has spurred forward the popularity of cloud computing besides the sort of benefits to users and companies? What technologically has made this, perhaps, all the more of a thing? Ave? AUDIENCE: We're so reliant on [INAUDIBLE]. 00:21:41,420 --> 00:21:42,170 DAVID MALAN: Yeah. So this is a biggie. I mean, I alluded earlier to this verbal list of, like, power and cooling and physical space, not to mention the money required to procure servers. And back in the day-- it was only, like, 10 or so years ago-- I still remember doing this consulting gig once where we bought a whole lot of hardware because we wanted to run our own servers and run them in a data center where we were renting space. And maybe the first time around, it was fun to kind of crawl around on the floor and wire everything together and make sure that all of the individual servers had multiple hard drives for redundancy and multiple power supplies for redundancy and think through all of this. But once you've kind of done that once and spent for that much redundancy only to find that, well, occasionally your usage is here. Maybe it's over here. But you sort of have to pay for up here. It's not all that compelling. And it's also a huge amount of work that doesn't need to be done by you and your more limited team. So that's certainly driven. 00:22:35,600 --> 00:22:36,960 Not to mention lack of space. Like at Harvard, we started using the cloud, in part, because we, for our team-- we had no space. We had no cooling. We kind of didn't really have power. So we really had no other options other than putting it under someone's desk. AUDIENCE: I was gonna say one other [INAUDIBLE] server and storage technology that makes it actually cost effective for these companies to do this. Where before, they could only do it for themselves. DAVID MALAN: That's what's really helped technologically. If you've heard of Moore's law, whose definition kind of gets tweaked every few years-- but it generally says that the number of transistors on a CPU doubles every 18 months or 12 months or 24 months, depending on when you've looked at the definition. But it essentially says that technological trends double every year, give or take, which means you have twice as many transistors inside of your computer every year. You have twice as much storage space for the same amount of money every year. You have twice as much CPU speed or cores, so to speak, inside of your computer every year. So there's this doubling. And if you think about a doubling, it's the opposite of the logarithmic curve we saw earlier, which still rises, but ever more slowly. Something like Moore's law is more like a hockey stick, where we're kind of more on this side nowadays, where the returns of having things double and double and double and double have really started to yield some exciting returns, so much so that this Mac here-- let's see, About This Mac. This is three gigahertz, running an Intel Core i7, which is a type of CPU, 16 gigabytes of RAM. So this means in terms of processor, CPU, speed, my computer can do 3.1 billion things per second. What is the limiting factor, then, in using a computer? I can only check my email so fast or reload Facebook so quickly. The human is by far the slowest piece of equipment standing in front of this laptop. And so we're at the point even with desktop or laptop computers that we have far more resources than even we humans know what to do with. And servers, by contrast, will have not just one or two CPUS. They might have 16 or 32 or 64 brains inside of them. They might have tens of gigabytes or hundreds of gigabytes of RAM that those CPUs can use. They might have terabytes and terabytes of space to use. And it's sort of more than individuals might necessarily need. You have so much more hardware and performance packed into such a small package that it would be nice to amortize the costs over multiple users. But at the same time, I don't want my intellectual property and my code and my data sitting alongside Nicholas's data an Ave's data and Sarah's data. I want at least my own user accounts and administrative privileges. I want some kind of barrier between my data and their data. And so the way that has-- what's really popularized this of late has been virtualization or virtual machines. And this is a diagram drawn from a Docker's website, which is an even newer incarnation of this general idea. But if you're unfamiliar with virtualization, the user-facing feature that it provides is it allows you to run one operating system on top of another. So if you're running Mac OS, you can have Windows running in a window on your computer. And conversely, if you run Windows, you can, in theory, run Mac OS in a window on your computer. But Apple doesn't like to let people do this, so it's hard. But you can run Linux or Unix-- these are other operating systems-- on top of Mac OS or on top of Windows, again, sort of visually within a window. But what that means is you are virtualizing one operating system and one computer, and using one computer to pretend that it can actually support multiple ones. So pictorially, you might have this. So infrastructure is just referring to your Mac or PC in this story. Host operating system's going to a Mac OS or Windows for most people in the room. Hypervisor is the fancy name given to a virtual machine monitor. It's virtualization software, like VMware Fusion, VMware Workstation, VMware Player-- suffice it to say, VMware is a big company in this space-- Oracle VirtualBox, Microsoft Virtual PC. And there's a few other-- something, the company's name might be Parallels. Parallels for Mac OS. There's a lot of different software that can do this. And as this picture suggests, it runs on top of the operating system. So it's just a program running on Mac OS or Windows. But then as these three towers suggest, what hypervisor does for you is it lets you run as many as three different operating systems, even more, on top of your own. And you can think of it as being in separate windows. So now that this is possible, if I might go out and rent, effectively, in the cloud a really big server with way more disk space, way more RAM, way more CPU cycles than I need for my little business, well, you know what? I could chop this up, effectively, for Nicholas, Ave, Sarah, and myself so that each of us can run our own operating system-- different operating systems, no less. We can each have our own usernames and passwords. All of our data and code can be isolated from everyone else's. Now, whoever owns that machine, in theory, could access all of our work because by having physical access. But at least Nicholas, Sarah, and Ave are compartmentalized, as am I, so that no one else can get at our own data. And so one of the reasons that we have virtualization so trendy these days is we just have almost more CPUs and more space and more memory than we even know what to do with, at least within the footprint of a single machine. So that too, has spurred things forward. Now, as an aside, there's another technology-- no break yet. There's another technology that alluded to a moment ago called containerization which is, if you've not heard the term, am even lighter-weight version of this, whereby containers are similar in spirit to virtual machines but can be started and can be booted much faster than full-fledged virtual machines. We'll have more on those another time. Yeah, Anessa. AUDIENCE: So I know at least the team that I worked with [INAUDIBLE] containerization is the thing right now. And they're even building [INAUDIBLE]. 00:28:29,890 --> 00:28:34,136 What are some of the-- I just want to get a better understanding of the values and the risks of containerization. DAVID MALAN: Sure. So big fan. In fact, I and our team are in the process of containerizing everything that we do right now. So big fan. Let me see, what is Docker? So Docker is sort of the de facto standard right now, though there's variations of this idea. And the picture I showed is actually from their own comparison. Oh, they seem to have changed it. Now they've changed it to blue. But here is kind of a side-by-side comparison of the two ideas. So on the left is virtualization, a sort of two-dimensional version of what we just saw in blue. And on the right is containerization. So one of the takeaways the picture is meant to convey is look how much lighter-weight Docker is on the right-hand side. There's just less clutter there. But that's kind of true. Containerization does the following-- or rather, virtualization has you running one base operating system and hypervisor on top of it, and then multiple copies of some other OS or OSes on top of those. Containerization has you run one operating system that all of your so-called containers share access to. So you install one operating system underneath it all. And then all of your containers share some other operating system of your choice. So that's already reducing from three down to just one operating system, for instance. Moreover, containerization tends to use a technique called union file system. A file system is just the fancy term for the way in which you store data on your hard drives and solid state drives and so forth. A union file system gives you the ability to layer things so that, for instance, the owner of this machine would install some base layer of software-- like, only the minimal amount of software necessary to boot the computer. But then Anessa, you and your team might need-- you might be writing your product in Python. So you need certain Python software and certain libraries. I, by contrast, might be writing my site in PHP. I don't need that layer. I need this layer of software. And what containerization allows you to do is all share everything that's down here, but only optionally add these layers such that only you see your layer, only I see my layer. But we share enough of the common resources that we can do more work on that machine per unit of time because we're not trying to run one, two three separate operating systems. We're really just running one at that layer. So that's the gist of it. And I would say the risks and the downsides are it's just so bleeding edge, still. I mean, it's very popular. I just came back from the Dockercon, the Docker conference in Seattle a few weeks ago. And there were a couple thousand people there. It was apparently doubled in size from last year. So containerization is all the rage. But the result of which is even on my own computer-- you can see Docker is installed on my computer. Actually, you can see the version number there. I am running version 1.12.0, release candidate three, beta 18, which means this is the 18th beta version or test version of the software. So stuff breaks. And bleeding edge can be painful. So the upside, though, on the other hand, is that Amazon, Google, Microsoft and others are all starting to support this. And what's nice is that it's a nice commoditization of what have been cloud providers. For many years, you would have to write your code and build your product in a way that's very specific to Google or Microsoft or Amazon or any number of third-party companies. And it's great for them. You're kind of bought in. But it's not great for you if you want to jump ship or change or use multiple cloud providers. So containerization is nice, popularized by Docker, the sort of leading player in this, in that it allows you to abstract away-- perfect tie in to earlier-- what it means to be the cloud. And you write your software for Docker, and you don't have to care if it's ending up on Google or Amazon or Microsoft or the like. So it's great in that regard. So you shouldn't have any regrets, but you should realize that maybe with higher probability, you'll run into technical headaches versus other technologies. Really good question. All right. So if we now have the ability to have all of these various-- if we have the ability to run so many different things all on the same hardware, that means that no longer do we have to have just one server for our website. And indeed, this was inevitable because if you have a server that can only handle so many users per second or per day, surely once you're popular enough, you're going to need more hardware to support more users. So let's consider what starts to happen when we do that. So if I have just one little server here, called my www web server, per our conversation before lunch, what does that web server need to have in order to work on the internet? AUDIENCE: An IP address. DAVID MALAN: An IP address. So it has to have an IP address. And it has to have a DNS entry so that if I type in www.something.com, the servers in the world can convert that to an IP address, and Macs and PCs and everyone can find this on the internet. And we'll just abstract away the internet as a cloud so that the server is somehow connected to the internet. So that's great. The world is nice and simple when you just have one server. Now, suppose I want to store data for my website. Users are registering. Users are buying things. I want to store that information. And where, of course, is data like that usually stored, if generally familiar? What kind of technology do you use to store data? AUDIENCE: A databse. DAVID MALAN: A database. Yeah, so a database. You could just store it in files. You could just save text files every time someone buys something, and that works. But it's not very scalable. It's not very searchable. So databases are products like Microsoft Access is a tiny version. Microsoft's SQL Server is a bigger one. Oracle is a behemoth. There's PostgreSQL. 00:34:20,860 --> 00:34:25,362 There's MySQL and bunches of others. But at the end of the day-- and actually, these are only the relational databases. There's also things like MongoDB. 00:34:34,170 --> 00:34:37,814 There's Redis for certain applications, though not necessarily as persistent, and bunches of others still. And those are object-oriented databases or document stores. But there's just a long list of ways of storing your data. And generally what all of these things provide, a database, is a way to save information, delete information, update information, and search for information. And the last one is the really juicy detail because especially as you're big and you're popular-- and to your analytics comment earlier, it'd be nice if you could actually select and search over data quickly so as to get answers more quickly. And that's what databases do. Oracle's intellectual property is sort of the secret sauce that helps you find your data fast, and same with all of these products, doing it better, for instance, than the competitor. So with that said, you could run not only web server software and a database on one physical server. In fact, super common, especially for startups or someone who's just got a test server under his or her desk. You just run all of these same servers on the same device. And among the servers you might run, you might have-- so these are databases. Let me keep this all together. These are database technologies. And on the other hand, we might have web servers like Microsoft IIS-- Internet Information Server. Apache is a very popular web server for Linux and other operating systems. There's NGINX, which is also very popular, and bunches of others. So this is web server software. This is the server software that knows what to do when it receives a request like GET/HTTP/1.1. So when we did that quick example earlier when I visited google.com, they are running something like this on their server. But if they want to store data because people are buying things or they're logging information, they probably need to also run one of these servers. And a server, even though almost all of us think of it as a physical device, a server is really just a piece of software. And you can have multiple servers running on one physical device, one server. So it's confusing. The term means different things in different contexts. But you can certainly run multiple things on the same server. In fact, if this server is supposed to send email confirmations when people check out, this could be an email server, as well. If they've got built-in chat software for customer service, it could also be running there. But at the end of the day, no matter how much work this thing is doing, it can only do a finite amount of work. So what starts to break as soon as we need a second server? So suppose we need to invest in a second server. We have the money. We can do so. What do you do now if this now becomes one, and this becomes www2. 00:37:18,760 --> 00:37:20,680 What kinds of questions do you need to ask? Or what might the engineers need to do to make this work? And I've deliberately removed the line because now, what gets wired to what? How does it all work? Yeah? AUDIENCE: Could you update www1 has www2 get updated, as well? DAVID MALAN: Good question. Updated in what sense? Like, your code? AUDIENCE: Yeah, [INAUDIBLE] any server aspect. DAVID MALAN: Yeah, hopefully. So there's this wrinkle, right? If you want to update the servers, you could try to push the updates simultaneously. But there could be a slight delay. So one user might see the old software. One user might see the new, which doesn't feel great, but is a reality. What else comes to mind? 00:37:58,300 --> 00:37:59,223 Yeah? AUDIENCE: [INAUDIBLE]. 00:38:02,040 --> 00:38:02,790 DAVID MALAN: Yeah. It's more worrisome with the database. If I now continue my super simple world where I have a web server and an email server and a database server all on the same physical box, what if I happen to log in here, but Sarah-- sorry. I happen to end up here. Sarah ends up here. Now our data is on separate servers. And then maybe tomorrow, we visit the website again, and somehow, Sarah ends up over here, doesn't see her data. I end up over here. I don't see my data. This does not feel like a great design. So already, our super simple initial design of assume one server, everything running on it, breaks. What else might break? Or what else might we want to consider before we start fixing? 00:38:48,390 --> 00:38:51,467 Or if you put on the head of the manager-- so I'm the engineering guy. I can answer all your questions. But you have to ask me the technical questions to get to a place of comfort yourself that this will actually work. What other questions should spring to mind? This is your business. 00:39:06,810 --> 00:39:07,990 No questions? Because I will just as soon leave everything disconnected. AUDIENCE: Yeah, I was gonna say, how do you have two databases-- no matter how a person logs into our website, how do we make sure their data is intact no matter where they log into? DAVID MALAN: Ah. OK. Well, I've thought about that. Don't worry. We're actually going to have a third server. It's often drawn as a cylinder like this here. This will be database. And these guys are both going to be connected to it. So I'm now going to have two tiers. And let me introduce some new jargon. I would typically call this my front end, here, or my front end. And this I shall call the back end. And generally, front end means anything user facing, that the user's laptop or desktop might somehow talk to. Back end is something that the servers might only talk to. The user's not going to be allowed to talk here. All right? So I've answered that. We're going to centralize the database here so that there is no more data on individual servers. It's now centralized here. What other questions have you now? AUDIENCE: Do we need another DNS entry? Or how does it-- we have one IP address? DAVID MALAN: Well, we'll just tell our customers to go to www1.something.com. Or if it seems busy, go to www2. 00:40:17,940 --> 00:40:19,272 So how do we fix that? 00:40:23,120 --> 00:40:26,010 AUDIENCE: Both those should go to the same IP address. DAVID MALAN: Both of those should go to the same-- ideally, yes. OK, so I-- oh, Anessa, do you want to comment? AUDIENCE: I mean, you somehow need to be able to run [INAUDIBLE]. 00:40:39,036 --> 00:40:40,910 DAVID MALAN: And though, to be clear, I claim now there is no right server because the database is central. So now these are commodity. It doesn't matter which one you end up on so long as it has capacity for you. AUDIENCE: Right. So you need to do something to make sure that you're going to one [INAUDIBLE]. DAVID MALAN: OK. So what's the simplest way we've seen a company do this so far? We've only seen one. What did Yahoo do to balance load across their servers? AUDIENCE: [INAUDIBLE]. DAVID MALAN: Yeah, the round robin. Rotating their traffic via DNS. So, for instance, if someone's laptop out there on the internet requests www.something.com or Yahoo, the DNS server, which is not pictured here but is somewhere-- let's just say, yeah. We have a DNS server. It's over here. And I won't bother drawing the lines because it's kind of-- we'll just assume it exists. The first time someone asks for something.com, I'm going to give them the IP address of this server. The second time someone asks, I'm going to give them the IP address. And then this one, and then this, and then da, da, da, and back and forth. What's good about this? 00:41:44,520 --> 00:41:46,850 AUDIENCE: [INAUDIBLE]. DAVID MALAN: Now one doesn't get too busy, and-- AUDIENCE: It's pretty simple. DAVID MALAN: Simple is good, right? And this is underappreciated, but the easier you can architect your systems, in theory, the fewer things that might go wrong. So simple is good. It's really like one line in a configuration file on Yahoo's end to implement load balancing in apparently that way, though, to be fair, they have more than three servers. So there's a whole other layer of load balancing they're absolutely doing. So I'm oversimplifying. What's bad about this approach? Yeah? AUDIENCE: It depends what you're having people do when they get those servers. If people are doing things that are radically out of scale with one another, [INAUDIBLE]. DAVID MALAN: Exactly. Even though with 50% odds, you're going one place or the other, what if Sarah is spending way more time on the website than I am? So she's consuming disproportionately more resources. I might want to send more users to the server I'm on and avoid sending anyone to her server for some amount of time. So 50-50. Overall, given an infinite amount of time and an infinite number of resources, we'll all just kind of average out. But in reality, there might be spikiness. So that might not necessarily be the best way. So what else could we do? DNS feels overly simplistic. Let's not go there. 00:43:01,540 --> 00:43:03,874 Some companies, as an aside-- and look for this in life. It doesn't seem to happen too often, but it usually happens with kind of bad, bigger companies. Sometimes, you will see in the URL that you are at, literally www1.something.com and www.2.something.com. And this is, frankly, because of moronic technical design decisions where they are somehow redirecting the user to a different named server simply to balance load. And I say moronic partly to be judgmental, technologically, but also because it's technologically unnecessary, and it actually has downsides. Why might you not want to send a user to a name like www1, www2, and so forth? Why might you regret that decision? AUDIENCE: Then they might go there themselves. DAVID MALAN: They might go there themselves. So I go to www2.something.com. Why is that bad? Won't it be there? AUDIENCE: Maybe. DAVID MALAN: Maybe. AUDIENCE: Maybe next time, they want you to go to one. And now [INAUDIBLE]. DAVID MALAN: Yeah. So if your users maybe bookmark a specific URL, and they just kind of out of habit always go back to that bookmark, now your whole 50-50 fantasy is kind of out the window, unless people bookmark the websites with equal probability, which might be the case. But in either case, you're sort of losing a bit of control over the process. What if-- we're only talking about two servers, but what if it was www20? And you know what? You only need 19 servers nowadays, so you turned off number 20 or you stopped renting or paying for those resources. Now they've bookmarked a dead end, which isn't good. And frankly, most users won't have the wherewithal to realize when they click on that bookmark or whatnot, why isn't it working? They're just going to assume your whole business is offline and maybe shop elsewhere. So that's not good, either. So what could we do to solve this? And, in fact, let's not even give these things names because if we don't want users knowing about them or seeing them, they might as well just have an IP address number one. We'll just abstract it away. This is IP address number 2. But the user doesn't need to know or care what those are. Yeah, Daria? AUDIENCE: Are those both running the same amount of services? Like, you've got an email server on one and, like, the web server software on one? Each of those IP one and two have everything except for the database? DAVID MALAN: At the moment, I was assuming that. But as an aside, even if they weren't, turns out with most of the strategies we'll figure out, you can weight your percentages a little differently. So it doesn't have to be 50-50. Could be 75-25, or you can take into account how much. AUDIENCE: It's like, can you pull more things out and just run them like a database? DAVID MALAN: Ah, OK. So let's say there is an email server. So let's call this the email server. Let's factor that out because it was consuming some resources unnecessarily. So I like the instincts. Unfortunately, you can only do this finitely many times until all that's left is the web server on both. And even then, if we get more users than we can handle, we're just talking about two servers. We might need a third or a fourth. So we can never quite escape this problem. We can just postpone it, which is reasonable. Katie? AUDIENCE: Is there a way to put a cap on once one server has a certain amount of traffic, go to the other server? DAVID MALAN: Absolutely. We could impose caps. But to my same comment here, that still breaks as soon as we overload server number one and server number 2. So we're still going to need to add more. But even then, how do we decide how to route the traffic? One idea that doesn't seem to have come yet-- a buzzword, too, that we can toss up here is what's called vertical scaling. You can throw money at the problem, so to speak. So we kind of skipped a step, right? Instead of going from-- we went from one server to two servers, which was nice because it created a lot of possibilities but problems. But why don't we just sell the old server, buy a bigger, better server, more RAM, more CPUs, more disk space, and just literally throw money at the problem, an upside of which is this whole conversation we're having now, let's just avoid it. Right? Let's just get rid of this. This is just too hard. Too many problems arise. Let's just put this as our web server. What's the upside? 00:47:06,510 --> 00:47:07,654 What's an upside? AUDIENCE: Simplest. DAVID MALAN: Simplest, right? I literally didn't have to think about this. All I had to do was buy a server, configure it, but it's configured identically. I just spent more money on it. What's the downside, of course? Same thing. I spent a lot of money on it. And, more fundamentally, what's the problem here? AUDIENCE: It's not a long-term solution. DAVID MALAN: It's not a long-term solution. I've postponed the issue, which is reasonable, if I've just got to get through some sales cycle or somehow get through the holiday season or something like that. But there's going to be this ceiling on just how many resources you can fit into one machine. Typically, especially from companies like Apple and even Dell and others, you're going to pay a premium for getting the very top of the line. So you're overspending on the hardware. And so companies like Google years ago began to popularize what has been called horizontal scaling, where instead of getting one big, souped-up version of something, you get the cheapest version, perhaps. You go the other extreme and just get lots and lots of cheaper or medium spec devices. But unfortunately-- well, fortunately, that's great because in theory, it allows you to scale infinitely, so long as you have the money and the space and so forth for it. But it creates a whole slew of problems. So we're kind of back to where we were before. So DNS we proposed. Eh, it doesn't really cut it. It's not smart enough because DNS has no notion of weights or load. It has no feedback loop. All it does is translate domain names to IP addresses and vice versa. So what could we introduce to help solve this problem? 00:48:43,660 --> 00:48:45,540 The answer's kind of implicit on the board because we used a technique twice already now that could help us balance load. 00:48:53,600 --> 00:48:54,100 Yeah. AUDIENCE: Could you just have a feedback loop so that when you need more service space, you scale up, and when you need less, you scale down? DAVID MALAN: OK. So that'll get us to the point of auto scaling. And that'll allow us to add IP address number three and four and five. But fundamentally, two is interesting because it's representative of an infinite supply of problems now, which is what if you have more than one server? Question at hand is how do we decide or what pieces of hardware or features do we need to add to this story in order to get data from users to server one or two or three or four or five or six. Yeah? AUDIENCE: Could you put-- I don't know-- another server or something on top of it that's just directing? DAVID MALAN: Yeah. In fact it has a wonderful-- AUDIENCE: Like a router? DAVID MALAN: --word. Yeah, it wouldn't technically be called a router, though it's similar in spirit. Load balancer, which is, in a sense, a router. So a lot of these terms are interchangeable and more just conventions than anything else. I'll call this LB for load balancer. And that's exactly right. Now let me connect some lines. This looks like CB. That's LB, load balancer. So now, it is on the internet somehow with a public IP address. And these two servers have an IP address. But you know what? I'm going to call this private. And this one too will be private. This guy needs an IP that's public, which is not unlike our home router. So calling it a router is not unreasonable in that regard. And what does this load balancer need to do? Well, he's got to decide whether to route data to the left or to the right. And just to be clear, what might feed into that decision? 00:50:30,446 --> 00:50:31,649 AUDIENCE: Usage. DAVID MALAN: Usage. So I'm going to specifically draw these lines as bi-directional arrows. So there's some kind of feedback loop, or constantly these servers are saying, I'm at 10% capacity. I'm at 20% capacity. Or I have 1,000 users, or I have no users. Whatever the metric is that you care about, there could be that feedback loop. And then the load balancer could indeed route the traffic. And so long as the response that goes back also knows to go through the load balancer, it'll just kind of work seamlessly, much like our home network. So we've fixed that problem. I like that. What new problems have we created at this point in the whole story? 00:51:10,977 --> 00:51:12,040 AUDIENCE: Bottleneck? DAVID MALAN: Bottleneck? Where at? AUDIENCE: In the load balancer? DAVID MALAN: Yeah. This is kind of besides the point, right? Like, Grace, haven't you kind of broke-- it's a regression. Like we solved our problem of load earlier by doubling the number of servers. But to get that to work, you've proposed that we go back to one server because then it all just kind of works and we somehow flail the traffic to the left or to the right. So it's not wrong. So what's a pushback? This is OK, in some sense. Why is this OK, even though before it was not OK to just confine ourselves to one server? AUDIENCE: Because the load balancer's only job is to push people to [INAUDIBLE]. DAVID MALAN: Exactly. That's its sole purpose in life. And if it's reasonable to assume, which it kind of is, that the web servers probably have to do a little more work-- right? They have to talk to a database. They have to check a user out. They might have to trigger emails to be sent. It just feels like there's a bunch of work they need to do. Load balancer literally, in the dumbest sense, needs to just send 50% of traffic here, 50% here. But we know we can do better. So it needs to have a little bit of sense of metrics. But at the end of the day, it's just like a traffic cop going this way or that way. Intuitively, that feels like a little bit less work. And so indeed, you could throw maybe more resources at this one server and then get really good economy of scale by horizontally scaling your front end tier, so to speak, up until some actual threshold. And the thresholds are going to be twofold-- whether this is software or hardware that you either download or you buy physically, either you're going to have one, licensing restrictions, whereby whatever company you buy it from is going to say, this can handle 10,000 concurrent connections at a time. After that, you need to upgrade to our more expensive device or something like that. Or it could just be technological, like this device only has so much capacity. It can only physically handle 10,000 connections at a time, after which you're going to need to upgrade to some other device altogether. So it can be a mix of those. So that actually is a nice revelation of the next problem. OK, so I can easily spread this load out here to three. And I can add in another one over here. But what's going to break next, if not my front end web tier, so to speak? AUDIENCE: [INAUDIBLE]. DAVID MALAN: Database? And what makes you think that? AUDIENCE: [INAUDIBLE] simultaneous queries. DAVID MALAN: Yeah. Like, to my concern earlier, we're horizontally scaling this to handle more users. But we just still are sending all the data to one place. And at some point, if we're successful-- and it's a good problem to have-- we're going to overwhelm our one database. So what might we do there? How do we fix that? More, all right. But now, the problem with doing this to your database is that the database, of course, is stateful. It actually stores information, otherwise generally known as state. The web servers I proposed earlier can be ignorant of state. All of their permanent data gets stored on the database. So if we do add another database into the picture, like down here, where do I put my data? Right? I could put my data here, Sarah's data here. But then we need to make sure that every subsequent request from me goes here and from Sarah goes here so that it's consistent. Or, of course, she's not going to see her data. I'm not going to see my data. So how do we solve that problem? AUDIENCE: The types of data [INAUDIBLE]. DAVID MALAN: OK. So we can use a technique that-- let me toss in a buzzword, sharding. To shard data means to, using some decision-making process, send certain data this way and other data this way. And an example we often use here on campus is back in the day of Facebook or thefacebook.com, when Mark finally started expanding it to other campuses, it was harvard.thefacebook.com and mit.thefacebook.com and berkeley.thefacebook.com or whatever the additional schools were, which was a way of sharding your setup because Harvard people were going to one setup and MIT people were going to another setup. But, of course, a downside early on, if you remember back then, is you couldn't really be friends with people in other networks. That was a whole other feature. But that was an example of sharding. Now, if you just have one website, it's possible and reasonable that anyone whose name starts with D might go to the left database. Anyone whose name starts with S might go to the second database. Or you could divide the alphabet similarly. So you could shard based on that. What might bite you there, if you're just sharding based on people's names? AUDIENCE: Growth. DAVID MALAN: Growth? You could maybe put A through M on one database server and N through Z on the other. And that works fine initially, but eventually, you got to split, like, the A's through the M's. And what about-- then you get down to the A's and the BA's, and then you have just A's. But one server's not enough for all the A's. So it's an ongoing problem. What else? AUDIENCE: Would sharding also apply if you're topically separating your data? Like, this becomes my sales database and my profile database-- DAVID MALAN: Absolutely. AUDIENCE: --and my product database. DAVID MALAN: Yep, you can decide it however you want. Names is one approach for very consumer-oriented databases. You can put different types of data on different databases. Same problem ultimately, as Griff proposes, whereby if you just have so much sales data, you might still need to shard that somehow. So at some point, you need to kind of figure out how to part the waters and scale out. But quite possible. Sarah? AUDIENCE: Could you shard based on time? Like long-term data gets stored-- DAVID MALAN: Oh, of course. Yep. You could put short-term data on some servers, long-term data on others. And long-term data, frankly, could be on older, even slower, or temporarily offline servers. And in fact, I can only conjecture, but I really don't understand why certain banks and big companies like Comcast and others only let me see my statements a year past. Surely in 2016, if I can get my order history on Amazon from the 2000s, you can give me my bank accounts from 13 months ago. And it's probably either a foolish or a conscious design decision that they're just offloading old data because it costs them, ironically, too much money to keep around their user's data. Other thoughts? AUDIENCE: What about instead of sharding, what about have some sort of overlay that's able to point to-- DAVID MALAN: OK. So we could have, even for sharding, some notion of load balancing here. And it's not load balancing in a generic sense. It needs to be a decision maker. Am I misinterpreting? AUDIENCE: Yes. So it can choose either one. It could figure out which database it's in. DAVID MALAN: OK. So based on that, I'm either going to call it-- well, it wouldn't be a load balancer if it's just generically doing this. It's some kind of device that's doing the sharding. And the sharding could either be done in software, to be fair, and the web servers could be programmed to say the A through M's over here and the N's through Z's over here. You don't necessarily need a middleman. But we certainly could, and it could be making those decisions such that all of these servers talk to that middleman first. But it turns out you could load balance. We could take the same principle of earlier of load balancing, but also solve this problem in a different way. Let me push this up for a moment. And if we erase this, let me actually call this a load balancer. And let me assume that now this is going to go to databases as follows. Let's actually do this, just so I have some more room. So here, I'll draw a slightly bigger database. Not uncommon with databases maybe to throw a little more money at it because it's a lot easier to keep your data initially all in one place. And so we might just vertically scale this thing initially. So throw money at it, so we still have the simplicity of one database, albeit a problem for downtime if this thing goes offline. But more on that in a moment. But what I'm really concerned with is rights. Changing information is what's ideally centralized. But reading information could come from redundant copies. And so what's fairly common is maybe you have a bunch of read replicas. 00:59:20,537 --> 00:59:22,370 And that's not necessarily a technical term. It's just kind of a term of art here, where the replica, as the name suggests, is really just a duplication of this thing. But maybe it's a little smaller or slower or cheaper. But there's some kind of synchronization from this one to this one. So writes are coming into this one, and I'll represent writes with a W. But when the user's code running on the web servers want to read information, that data's not going to come from here. So that arrow is not going to exist. Rather, all of the reads going back to the web servers are going to come from this device, historically called the slave or secondary, whereas this would be the master or primary. And what's nice about this topology is that we could have multiple read replicas. We could even add a third one in here. And the decision as to whether or not this works well is kind of a side effect of whatever your businesses is or the use case is. If your website is very read heavy, this works great. You have few or one server databases devoted to writes-- so changes, deletions, additions, that kind of thing. But you can have as many read replicas allocated as you want, which are just real-time copies of the master database that your code actually reads from. So something like Facebook-- depends on the user. Many people on Facebook probably read more information than they post, right? Every time you log in, you might post one thing, maybe, let's say. But you might read 10 posts from friends. So in that sense, you're sort of read-heavy. And you can imagine other applications-- maybe in Amazon, maybe you tend to window shop more. So you rarely buy things, but you go to shop around a lot. So you might be very read-heavy, but you only checkout infrequently. So this topology might work well. Now, there's kind of a problem here. If I keep drawing more and more databases like this, what might start to break, eventually? 01:01:21,436 --> 01:01:22,360 AUDIENCE: The master. DAVID MALAN: The master, right? If we're asking the master to copy itself-- in parallel, no less-- to all of these secondary databases, at some point this has got to break, right? It can't infinitely handle traffic over here and then infinitely duplicate itself to all of these replicas down here. But what's nice and what's not uncommon is to have a whole hierarchical structure. You know what? So if that is worrisome, let's just then have one medium sized or large replica here, but then replicate in sort of tree fashion off of it to these other replicas. So kind of push the problem away from the super-important special database, the write database-- write with a W. And then the read replicas down here have their own sort of hierarchy that gives us a bit of defense against that issue. Now, what problem still remains in this picture? What could go wrong, critique, somehow? AUDIENCE: I don't understand how you would [INAUDIBLE]. 01:02:24,640 --> 01:02:27,840 DAVID MALAN: So what I'm proposing is this is allowing us to scale. So if, I assume, as in the Facebook scenario described, that most of my business involves reads, where I want to have as many read servers as possible, but I can get by with just one writeable server, then what's nice about this is that we can add additional read replicas, so to speak, and handle more and more and more users without having to deal with the problem that I proposed earlier as a bit of a headache-- how do we actually decide, based on sharding or some other logic, how to split our data? I just avoid splitting our data altogether. So that's the problem we've solved, is scaling. We can handle lots and lots and lots of reads this way. But there's still a problem, even if this is plenty of capacity for writing. Grace? AUDIENCE: The timing, then, between the writing and the reading and if you want to overwrite or update something you've just written where you're reading it from. DAVID MALAN: Yeah. There's a bit of latency, again, so to speak, between the time you start to do something and that actually happens. And you can sometimes see this, in fact, because of latency or caching, even on things like Facebook. An example I always think of is on occasion, I feel like I've posted or commented something on Facebook. Then in another tab, I might hit Reload, and I don't see the change. And then I have to reload, and then it's there. But it didn't have the immediate effect that I assumed it would. And that could be any number of reasons. But one of them could just be propagation delays. Like, this does not happen instantly. It's going to take some amount of time, some number of milliseconds, for that data to propagate. So you get minor inconsistencies. And that is problematic if, for instance, you read a value here, you write that change, then you're like, oh, no. Wait a minute. I want to fix whatever I just did. I want to fix a typo or something. You might end up changing this version instead of that version. 01:04:13,260 --> 01:04:16,090 That has to be a conscious design decision. Yes, that is possible, so this is imperfect. What else is worrisome here? Management should not sign off on this design, I would propose, at least if management has money with which they're willing to solve this problem. AUDIENCE: From a user point of view, it might be a lot of time [INAUDIBLE]. DAVID MALAN: Good-- not bad instinct. But we're talking milliseconds. So I would push back at this point and say, users are rarely if ever going to notice this. But fair. 01:04:49,250 --> 01:04:50,170 What could go wrong? Always consider that, and especially if you are starting something. What are the questions you would ask of the engineers you're working with? What could break first? You don't even have to be an engineer to sort of identify intuitively what could go wrong. Grace? AUDIENCE: You still have one master where you're writing everything. There's no redundancy there. There's no [INAUDIBLE]. DAVID MALAN: Yeah. And the buzz word here is single point of failure. Single points of failure, generally bad. And here, too, it does not take an engineering degree to isolate that, so long as you have a conceptual understanding of things. The fact that we have just one database for writes, one master database-- very bad in that sense. If this goes offline, it would seem that our entire back end goes offline, which is probably bad if our back end is where we're storing all of our products or all of our user data or all of our Facebook posts or whatever the tool might be doing. This is really the stuff we care about, the actual data. So not good. And, in fact, we dealt with that earlier by introducing some redundancy at the web tier. So propose what's good and bad about this-- suppose that, all right, I'm going to deal with this by adding a second writeable database. I realize it's going to cost money. But if you want me to fix this, that's the price we pay, literally. But there's technical prices we now need to pay. How do we kind of wire this thing in, the second database? Or what questions does it invite again? AUDIENCE: Write things simultaneously? DAVID MALAN: Yeah. Like, all right. So why don't we do that-- like sharding sounded like so much work. It's so hard to solve. It doesn't fundamentally solve the problem long-term. Let me just go ahead and write my data in duplicate to two places. Well, this would be an incorrect approach, but the theory isn't bad. What would typically happen, though, is this-- there's the notion of master-master relationships, whereby it doesn't matter which one you write to. You can configure databases to make sure that any changes that happen on one immediately and automatically get synced to the other. So it's called master-master replication, in this case. So that helps with that. And now, frankly, now this opens up really interesting opportunities because we could have another database over here, and it could have its own databases. So we have this whole family tree thing going on that really gives us a lot of capacity-- horizontal scaling even though, paradoxically, it's all hierarchical in this case. So that's pretty good. And that's kind of a pretty common solution, if you have the money with which to cover that and you're willing to take on the additional complexity. Like Anessa, to some of your concerns with the team, this is more complicated. To someone else's point earlier, simple is good. This is no longer very simple. Thankfully, this is a common problem, so there's plenty of documentation and precedent for doing this. But this is the added kind of complexity that we have. There's still some other single points of failure on the screen. What are those? AUDIENCE: Load balancer? DAVID MALAN: Yeah, the load balancer. So damn it, that was such a nice solution earlier. But if really we want to be uptight about this, got to fix this, too. So let me go in there and put in one, two load balancers. Of course now, all right-- so we have two problems, on the outside and on the inside. So how does this actually work? What would you propose we do to get this topology working? AUDIENCE: Put another load balancer on top? DAVID MALAN: Yeah, we could kind of do this all day long, just kind of keep stacking and unstacking and stacking. So not bad, but not going to solve the problem fundamentally. What else might work here? AUDIENCE: Master-master load balancer [INAUDIBLE]. DAVID MALAN: Yeah, that's not bad. That doesn't really solve-- let's solve the first problem first. How do we decide for the users where their traffic ends up? So I am someone on a laptop. I type in something.com. I'm here in the cloud. I'm coming out of the cloud, ready to go to your website. Which load balancer do I hit and how? Yeah, Anessa? AUDIENCE: [INAUDIBLE]. DAVID MALAN: Yeah. So we didn't talked about this earlier. You could use DNS to geographically segregate your users based on geography. And companies like Akamai have done this for years, whereby if they detect an IP address that they're pretty sure is coming from North America, they might send you to one destination. If you're coming from Africa, you might go to another destination. And with high probability, they can take into account where they IP addresses are coming from just based on the allocation of IP addresses globally around the world, which is a centralized process. So that could help. We could do a little bit of that, and that's not a bad thing, thereby splitting our traffic. Unfortunately, in that model, if America goes offline, its load balancer, then only African customers can visit. Or conversely, if the other one goes down, only the others can. So we'd have to do something adaptive, where we'd then have to quickly change DNS so that, OK, if the American load balancer's offline, then we better send all of our traffic to the servers and load balancer that's in Africa. Unfortunately, in the world of DNS, what have other DNS servers and Macs and PCs and browsers unfortunately done? They've cached the damn old address, which means some of our users might still be given the appearance that we're offline, even though we're up and running just fine with our servers in Africa in this case. So trade-off there. I'll spoil this one, only because it's perhaps not obvious. Typically, what you would do with a load balancer situation is only one of them would really be operational at a time, just because it's nice and simple and it avoids exactly that kind of slippery slope of a problem. But these two things are talking to one another via a technique you generally call heartbeats. And as the name implies, both of them kind of have heartbeats. And that means in technical forms that each of them just kind of sends a signal to the other every second or minute or whatever-- I'm alive. I'm alive. I'm alive. Because what the other can do when it stops hearing the heartbeat from the other, it can infer with reasonable accuracy that, oh, that server must have died, literally or figuratively. I am going to proactively take over its IP address and start listening for requests on the internet on that same IP address. So you have one IP address, still, but it floats between the load balancers based on whichever one has decided I am now in charge, which then allows you to tolerate one of them going offline. And in theory, you could make this true for three or four, but you get diminishing returns. We could do the same thing down here. So if we really care and worry about this, we could do the same kind of heartbeat approach down here. And then with the databases, we're still OK with this kind of hierarchy. And this isn't so much a heartbeat, recall, as a synchronization. But dear God, look at what we've just built. What a nightmare, right? Remember what we started with. We started with this. And now we're up to this. But this is truly what it means to design, like, an enterprise class architecture and to be resilient against failure and to handle load and to have built into the whole architecture the ability to scale. So a lot of desirable features. And we didn't go from that directly to that. It was, hopefully, a fairly logical story of frustrations and solutions along the way. But dear God, the complexity. Like, we are a far cry from what was proposed as simple before. So what does this mean? So in the real, physical world back in the day, you would buy these servers. You would wire them together. You would configure them. And when something dies, hopefully you have alerts set up. And it's just a laundry list of operational things. And so a common role in a company would be ops or operations, which is all the hardware stuff, the networking stuff, the behind-the-scenes stuff, the lower level details. And it's fun, and it appeals to certain people. But it is being supplanted, in part, by cloud services. And what you get from these-- where's our acronym? Might have erased it-- Infrastructure As A Service, IAAS, is these same capabilities from companies like Amazon and Microsoft, but in the cloud. So if you want a load balancer, you don't buy a server and physically plug it in. You click a button on a website that gives you a software implementation of a load balancer. So what's nice is because of virtualization and because of software being so configurable, you can implement in software what has historically been a physical device. You can create the illusion that it's the same thing. And so that's what you're doing with a lot of cloud services. You're saying, give me a load balancer. Give it this IP address. Give me two back end servers or four back end servers. Give me a database, two databases. And it's all click, click, click or with a command line, textual interface. You're wiring things together virtually. So at the end of the day, it's the same skill set other than you don't need to physically plug things in anymore. But it's the same mental paradigm. It's the same kind of decision process, the same amount of complexity. But it's more virtual than it is physical. And, in fact, if I pull up the one I keep mentioning, only because I tend to use them myself here. But this is Amazon Web Services. You'll see an overwhelming list of products these days. Frankly, it's to a fault, I think, how many damn different services they have. It is completely overwhelming, I think, to the uninitiated. And even I have kind of started to get confused as to what exists. But just to give you a teaser so you've at least heard of a few of them, Amazon EC2 is elastic compute cloud. This is what you would use, typically, to implement your front end, your web server tier. But it really is just generic, virtualized servers that can do anything you want them to do. In our story, we would use them as web servers. Amazon has elastic load balancing, which replaces our load balancers. And it's elastic in the sense that if you start to get a lot of traffic, they give you more capacity, either by moving your load balancer to a different virtual machine, a bigger one with more resources, or maybe giving you multiple ones. But what's nice and what's beautiful about their latching on to this word elastic is you don't have to think about it. Conceptually, though, it's doing this. So understanding the problem is still important and daresay requisite. But you don't have to worry as much about the solution. Autoscaling is what decides how many of these front end web servers in our story to turn on. So Amazon for you, albeit with some complexity-- it's not nearly as easily done as said-- can decide to turn these things on or off and give you one or two or three or 100 based on the load you're currently experiencing in metric you've specified. 01:15:23,630 --> 01:15:26,120 Amazon RDS, Relational Database Server. That's what can replace all of this complexity. You can just say, give me one big database server, and they'll figure out how big to make it and how to contract and expand. And actually, you can literally check a box that says replicate this, so you get an automated backup of it. So this is the kind of stuff that you would just spend so much time and money as a human, building up, figuring out, updating, configuring. All of this has been abstracted away-- there, too, to our story earlier about abstraction. We certainly won't go through most of these, partly because I don't know all of them and partly because they're not so germane. But S3 is a common one-- Simple Storage Service. This is a way of getting nearly infinite disk space in the cloud. So for some years, Dropbox, for instance, was using Amazon S3 for their data. I believe they have since moved to running their own servers, presumably because of cost or security or the like. But you have gigabytes or terabytes of space. And, in fact, for courses I teach, we move all of our video files, which tend to be big-- we don't run anything on Harvard's servers. It's all run in the Amazon cloud because they abstract all of that detail away for us. So it's both overwhelming but also exciting in that there are all these ingredients. And the fact that this is so low-level, so to speak, is what makes it Infrastructure As A Service. Frankly, for startups and the like, more, I think, appealing is Platform As A Service, which is, if you will, a layer on top of this in terms of abstraction because at the end of the day, as interesting as it might be to a computer scientist or an engineer, oh, my dear God. I don't really care about load balancing, sharding. I don't really need to think about this to build my business, especially if it's just me or just a few people. The returns are probably higher on focusing on our application, not this whole narrative. And so if you go to companies like Heroku, which is very popular and is built on top of Amazon, you'll see, one, a much more reasonable list of options. But what's nice is-- let me find the page. It's a Platform As A Service in the sense that, ah, now we're focusing on what I care about as a software developer. What language am I using? What database technology do I want to use? I don't care what's wired to what or where the load balancing is happening. Just please abstract that all away from me, so you just give me a black box, effectively, that I can put my product on, and it just runs. And so here, your first decision point isn't the lower level details I just rattled off on Amazon. It's higher level details like languages, which we'll talk about more tomorrow, as well. So for a younger startup, honestly, starting it like the Heroku layer tends to be much more pleasurable than the Amazon layer. Google, for instance, has App Engine and their Compute Cloud. They have any number of options, as well. Microsoft has their own. So if you google Microsoft or you bing Microsoft Azure, you'll find your way here. And in terms of how you decide which to use, I would generally, especially for a startup, go with what you know or what you know someone knows. Looking for recommendations? You can google around. Just so I don't forget to mention it, if you go to Hacker News, whose website is news.ycombinator.com, an investment fund, this is a good place to stay current with these kinds of technologies. I would say that quora.com is very good, too. It's kind of the right community to have these kinds of technical decisions. And TechCrunch, although that's more newsy than it is thoughtful discussion. So those three sites together, I would say-- especially if you are part of a startup, keeping those kinds of sources in mind and just kind of passively reading those things will help keep you at least current on a lot of these options and tools and techniques. But more on that tomorrow, as well. 01:19:04,040 --> 01:19:06,421 Any questions? Yeah? AUDIENCE: Could you go back to Heroku? DAVID MALAN: Sure. 01:19:12,188 --> 01:19:15,152 AUDIENCE: So essentially, it's all about [INAUDIBLE]. 01:19:19,407 --> 01:19:20,240 DAVID MALAN: You do. And let me see if I can find one more screen. The docs kind of change pretty frequently. 01:19:29,945 --> 01:19:32,180 AUDIENCE: [INAUDIBLE]. DAVID MALAN: I'm sorry? AUDIENCE: [INAUDIBLE]. 01:19:40,650 --> 01:19:42,020 DAVID MALAN: Oh, let's see. Deploy, build-- oh, yeah. So this is actually a very clever way-- OK. So this is all the stuff we just spent an hour talking about. Heroku makes the world feel like that. Yeah, so infrastructure, platform, infrastructure, platform. So this is nice, and this is compelling. So what are some of the down-- work with, let's see. This is just random [INAUDIBLE]. I do want to show one thing. Let's see. Pricing is interesting. Hobby. Dyno. So they have some of their-- dyno is not really a technical term. This is their own marketing thing for how much resource you get on their servers. This is what I wanted, like databases. No, that's now what I want. Feature, add ons. Explore add ons pricing. OK, this is what's fun. So most of these-- well, it's also overwhelming, too. There are so many third-party tools, databases, libraries, software, services. We could spend a month, I'm sure, just even looking at the definitions of these things. What's nice about Heroku is they have figured out how to install all of this stuff, how to configure this stuff. And so if you want it, you just sort of click it and add it to your shopping cart, so to speak. And so long as you adhere to certain conventions that Heroku has-- so you have to design your app a little bit to be consistent with their design approach. So you're tied a little bit to their topology, though not hugely. You can just so much more easily use these services. And it's a beautiful, beautiful thing. But what's the downside then? Sounds like all win. 01:21:19,030 --> 01:21:21,410 AUDIENCE: A little less customizable. DAVID MALAN: A little less customizable. You're much more dependent on their own design decisions and their sort of optimization, presumably, for common cases that maybe your own unique cases or whatever don't fit well. Yeah, Sarah? AUDIENCE: So are we monitoring the databases less frequently so you're less likely to predict something bad is going to happen down the road? DAVID MALAN: Yeah, good point. So maybe they're monitoring less frequently. And I would also say, even if that's not true, the fact that there's another party involved-- so it's not just Amazon. Now there's multiple layers where things could go wrong. That feels, too, worrisome along those lines. 01:22:01,090 --> 01:22:02,240 Sounds all good. So what's the catch? What's another catch? AUDIENCE: [INAUDIBLE]. DAVID MALAN: Yeah, it costs more. It's great to have all of these features and live in this little world where none of that underlying infrastructure exists, which is probably fine for a lot of people, certainly when they're first starting or getting a startup off of the ground. But certainly once you have good problems like lots of users, that's the point at which you might need to decide, did we really engineer this at the right level? Should we have maybe started in advance, albeit at greater cost or complexity, so that over time, we're just ready to go and we're ready to scale? Or was it the right call to simplify things, pay for this value-added service, and not have to worry about those lower-level implementation details? So it totally depends. But I would say in general, certainly for a small, up-and-coming startup, simpler is probably good, at least if you have the money to cover the marginal costs. And I'll defer to the pricing pages here. But what is typical here is if we look at AWS pricing, for instance, there's any number of things they charge for, nickel and diming here and there. But it's often literally nickels and dimes. So thankfully, it takes a while for the costs to actually add up. Just to give you a sense, though, if you get, let's call it, a small server that has two gigabytes of memory, which is about as much as a small laptop might have these days, and essentially one CPU, you'll pay 0.02 cents per hour to use that server. So if you do this-- if we pull up my calculator again, so it's that many cents per hour. There's 24 hours in a day. There's 365 days in a year. You'll pay $227 to rent that server year round. And this, too, is where there's a trade-off. In the consulting gig I alluded to earlier, we had to handle some-- I forget the number offhand-- hundreds of thousands of hits per day, which was kind of a lot. And we were moving from one old architecture to another. And so when we did the math back in the day, and it was some years ago, it was actually going to cost us quite a bit in the cloud because we were going to have so many recurring costs. Certainly after a year, after two years, we worried it was really going to add up. By contrast, we happened to go the hardware route at the time. The cloud and Amazon were not as mature at the time, either, so there were some risk concerns. But the upfront costs, significant-- thousands and thousands of dollars. But very low marginal costs or recurring costs thereafter. So that, too, was a trade-off, as well. Other questions or comments? AUDIENCE: [INAUDIBLE]. DAVID MALAN: I'm sorry? AUDIENCE: Why do they do it by hour? Why not do it by year? DAVID MALAN: Oh, so that's a good question. So why do they do it per hour, per year? It's not uncommon in this cloud-based economy to really just want to spin up servers for a few hours. Like, if you get spiky traffic, you might get a real hit around the holidays. Or maybe you get blogged about, and so you have to tolerate this for a few hours, a few days or weeks. But after that, you definitely don't want to commit, necessarily, for the whole year. One of the best articles years ago when Amazon was first maturing was the New York Times-- let me see. New York Times Amazon EC2 tiff. They did this-- yeah, 2008. Oh, no. It's this one, 2007. So this was an article I remember being so inspired by at the time. They had-- let's see, public domain articles. They had 11 million articles as PDF. Or they wanted to create, it sounds, 11 million articles as PDFs. So this was an example of something that hopefully wasn't going to take them a whole year. And, in fact, if you read through the article, one of the inspiring takeaways at the time was that the person who set this up used Amazon's cloud service to sort of scale up suddenly from zero to, I don't know, a few hundred or a few thousand servers, ran them for a few hours or days, shut it all down, and paid some dollar amount, but some modest dollar amount in the article. And I think one of his cute comments is he screwed up at one point. And the PDFs didn't come out right. But no big deal, they just ran it again. So twice the cost, but it was still relatively few dollars. And that was without having to buy or set up a single server at the New York Times. So for those kinds of workloads or data analytics where you really just need to do a lot of number crunching, then shut it all down, the cloud is amazing because it would cost you massive amounts of money and time to do it locally otherwise. Other questions or comments? 01:26:27,720 --> 01:26:29,080 That was the cloud. Let me propose this-- I sent around an email last night. And if you haven't already, do read that email, and make sure you are able to log in to cs50.io during the break. If not, just call me over, and I'll lend a hand. But otherwise, why don't we take our 15-minute break here and come back right after 3:15 to finish off the day?