KENNY YU: So, hi, everyone. I'm Kenny. I'm a software engineer at Facebook. And today, I'm going to talk about this one problem-- how do you deploy a service at scale? And if you have any questions, we'll save them for the end. So how do you deploy a service at scale? So you may be wondering how hard can this actually be? It works on my laptop, how hard can it be to deploy it? So I thought this too four years ago when I had just graduated from Harvard as well. And in this talk, I'll talk about why it's hard. And we'll go on a journey to explore many of the challenges you'll hit when you actually start to run a service at scale. And I'll talk about how Facebook has approached some of these challenges along the way. So a bit about me-- I graduated from Harvard class 2014 concentrating in computer science. I took C50 fall of 2010 and then TF'd it the following year. And I TF'd other classes at Harvard as well. And I think one of my favorite experiences at Harvard was TFing. So if you have the opportunity, I highly recommend it. And after I graduated, I went to Facebook, and I've been on this one team for the past four years. And I really like this team. My team is Tupperware, and Tupperware is Facebook cluster management system and container platform. So there's a lot of big words, and my goal is by the end of this talk is that you'll have a good overview of the challenges we face in cluster management and how Facebook is tackling some of these challenges and then once you understand these, how this relates to how we deploy services at scale. So our goal is to deploy a service in production at scale. But first, what is a service? So let's first define what a service is. So a service can have one or more replicas, and it's a long-running program. It's not meant to terminate. It responds to requests and gives a response back. And as an example, you can think of a web server. So if you're running Python, you're a Python web server or if you're running PHP, an Apache web server. A response requests, and it gives you get a response back. And you might have multiple of these, and multiple of these together compose your service that you want to provide it. So as an example, Facebook users Thrift for most of its back end services. Thrift is open source, and it makes it easy to do something called Remote Procedure Calls or RPCs. And it makes it easy for one service to talk to another. So as an example, service at Facebook, let's take the website as an example. So for those of you that don't know, the entire website is pushed as one monolithic unit every hour. And the thing that actually runs the website is hhvm. It runs our version of PHP, called Hack, as a type-safe language. And both of these are open source. And the way to website is deployed is that there are many, many instances of this web server running in the world. This service might call other services in order to fulfill your request. So let's say I hit the home page for Facebook. I might want to give my profile and render some ads. So the service will call maybe the profile service or the ad service. Anyhow, the website is updated every hour. And more importantly, as a Facebook user, you don't even notice this. So here's a picture of what this all looks like. First, we have the Facebook web service. We have many copies of our web server, hhvm, running. Requests from Facebook users-- so either from your browser or from your phone. They go to these replicas. And in order to fulfill the responses for these requests, it might have to talk to other services-- so profile service or ad service? And once it's gotten all the data it needs, it will return the response back. So how did we get there? So we have something that works on our local laptop. Let's say you're starting a new web app. You have something working-- a prototype working-- on your laptop. Now you actually want to run it in production. So there are some challenges there to get that first instance running in production. And now let's say your app takes off. You get a lot of users. A lot of requests start coming to your app. And now that single instance you're running can no longer handle all the load. So now you'd have multiple instances in production. And now let's say your app-- you start to add more features. You add more products. The complexity of your application gets more complicated. In order to simplify that, you might want to extract some of the responsibilities into separate components. And now instead of just having one service in production, you have multiple services in production. So each of these transitions involves lots of challenges, and I'll go over each of these challenges along the way. First, let's focus on the first one. From your laptop to that first instance in production, what does this look like? So first challenge you might hit when you want to start that first copy in production is reproducing the same environment as your laptop. So some of the challenges you might hit is let's say you're running a Python web app. You might have various packages of Python libraries or Python versions installed on your laptop, and now you need to reproduce the same exact versions and libraries on that production environment. So versions and libraries-- you have to make sure they're installed on the production environment. And then also, your app might make assumptions about where certain files are located. So let's say my web app needs some configuration file. It might be stored in one place on my laptop, and it might not even exist in a production environment. Or it may exist in a different location. So the first challenge here is you need to reproduce this environment that you have on your laptop on the production machine. This includes all the files and the binaries that you need to run. Next challenge is how do you make sure that stuff on the machine doesn't interfere with my work and vice versa? Let's say there's something more important running on the machine, and I want to make sure my dummy web app doesn't interfere with that work. So as an example, let's say my service-- the dotted red box-- it should use four gigabytes of memory, maybe two cores. And something else in the machine wants to use two gigabytes of memory and one core. I want to make sure that that other service doesn't take more memory and start using some of my service's memory and then cause my service to crash or slow down and vice versa. I don't want to interfere with the resources used by that other service. So this is a resource isolation problem. You want to ensure that no workload on machine interferes with my workload and vice versa. Another problem with interference is protection. Let's say I have my workload in a red dotted box, and something else running a machine, the purple dotted box. One thing I want to ensure is that that other thing doesn't somehow kill or restart or terminate my program accidentally. Let's say there's a bug in the other program that goes haywire. The effects of that service should be isolated in its own environment and also that other thing shouldn't be touching important files that I need for my service. So let's say my service needs some configuration file. I would really like it if something else doesn't touch that file that I need to run my service. So I want to isolate the environment of these different workloads. The next problem you might have is how do you ensure that a service is alive? Let's say you have your service up. There's some bug, and it crashes. If it crashes, this means users will not be able to use your service. So imagine if Facebook went down and users are unable to use Facebook. That's a terrible experience for everyone. Or let's say it doesn't crash. It's just misbehaving or slowing down, and then restarting it might help-- might help it mitigate the issue temporarily. So what I really like is if my service has an issue, please restart it automatically so that user impact is at a minimum. And one way you might be able to do this is to ask the service, hey, are you alive? Yes. Are you alive? No response. And then after a few seconds of that, if there's still no response, restart the service. So the goal is the service should always be up and running. So here's a summary of challenges to go from your laptop to one copy in production. How do you reproduce the same environment as your laptop? How do you make sure that once you're running on a production machine, no other workload is affecting my service, and my service isn't affecting anything critical on that machine? And then how do I make sure that my service is always up and running? Because the goal is to have users be able to use your service all the time. So there are multiple ways to tackle this issue. Two typical ways that companies have approached this problem is to use virtual machines and containers. So for virtual machines, the way that I think about it is you have your application. It's running on top of an operating system and that operating system is running on top of another operating system. So if you ever use dual boot on your Mac, you're running Windows inside a Mac-- that's very similar idea. There are some issues with this. It's usually slower to create a virtual machine, and there is also an efficiency cost in terms of CPU. Another approach that companies take is to create containers. So we can run our application in some isolated environment that provides all the guarantees as before and run it directly on the machine's operating system. We can avoid the overhead of a virtual machine. And this tends to be faster to create and more efficient. And here's a diagram that shows how these relate to each other. On the left, you have my service-- the blue box-- running on top of a guest operating system, which itself is running on top of another operating system. And there's some overhead because you're running two operate systems at the same time versus the container-- we eliminate that extra overhead of that middle operating system and run our application directly on the machine with some protection around it. So the way Facebook has approached these problems is to use containers. For us, the overhead of using virtual machines is too much, and so that's why we use containers. And to do, this we have a program called a Tupperware agent running on every machine at Facebook, and it's responsible for creating containers. And to reproduce the environment, we use container images. And our way of using container images is based on btrfs snapshots. Btrfs is a file system that makes it very fast to create copies of entire subtrees of a file system, and this makes it very fast for us to create new containers. And then for resource isolation, we use a feature of Linux called control groups that allow us to say, for this workload, you're allowed to use this much memory, CPU, whatever resources and no more. If you try to use more than that, we'll throttle you, or we'll kill your workload to avoid you from harming the other workloads on the machines. And for our protection, we use various Linux namespaces. So I'm going to not go over too much detail here. There's a lot of jargon here. If you want more detailed information, we have a public talk from our Systems of Scale Conference in July 2018 that will talk about this more in depth. But here's a picture that summarizes how this all fits together. So on the left, you have the Tupperware agent. This is a program that's running on every machine at Facebook that creates containers and ensures that they're all running and healthy. And then to actually create the environment for your container, we use container images, and that's based on btrfs snapshots. And then the protection layer we put around the container includes multiple things. This includes control groups to control resources and various namespaces to ensure that the environments of two containers are sort of invisible to each other, and they can't affect each other. So now that we have one instance of the service in production, how can we get many instances of the service in production? There are new sets of challenges that this brings. So the first challenge you'll have is, OK, how do I start multiple replicas of a container? So one approach you may take is, OK, given one machine, let's just start multiple on that machine. And that works until that machine runs out our resources, and you need to use multiple machines to start multiple copies of your service. Now so now you have to use multiple machines to start your containers. And now you're going to hit a new set of classic problems because now this is a distributed systems problem. You have to get multiple machines to work together to accomplish some goal. And in this diagram, what is the component that creates the containers on the multiple machines? There needs to be something that knows to tell the first one to create containers and a second one to create containers or sock containers as well. And now what if a machine fails? So let's say I have two copies of my servers running. The two copies are running two different machines. For some reason, machine two loses power. This happens in the real world all the time. What happens then? I need two copies of my service running at all times in order to serve all the traffic my service has. But now that machine two is down, I don't have enough capacity. Ideally, I would want something to notice, hey, the copy of machine two is down. I know machine three has available resources. Please start a new copy on machine three for me. So ideally, some component would have all this logic and do all this automatically for me, and this problem is known as a failover. So when real-world failures happen, then we want ideally to be able to restart that workload on a different machine, and that's known as a failure. So now let's look at this problem from the caller's point of view. The callers or clients of your service have a different set of issues now. So in the beginning, there's two copies of my servers running-- there's a copy on machine one and a copy on machine two. The caller knows that it's on machine one and machine two. Now machine two loses power. The caller still thinks that a copy is running on machine two. It's still going to send traffic there. The requests are going to fail. Users are going to have a hard time. Now let's say there is some automation that knows, hey, machine two is down please start another one on machine three. How's the client made aware of this? The client still thinks the replicas are on machine one and machine two. It doesn't know that there's a new copy on machine three. So something needs to tell the client, hey, the copies are now on machine and one machine three. And this problem is known as a service discovery problem. So for service discovery, the question it tries to answer is where is my service running? So now another problem we might face is how do you deploy your service? So remember I said the website is updated every hour. So we have many copies of the service. And it's updated every hour, and users never even notice that it's being updated. So how is this even possible? The key observation here is that you never want to take down all the replicas of your service at the same time because if all of your replicas are down in that time period, requests to Facebook would fail, and then users would have a hard time. So instead of taking them all down at once, one approach you might take it to take down only a percentage on a time. So as an example, let's say I have three replicas of my service. I can tolerate one replica being down any given moment. So let's say I want to update my containers from blue to purple. So what I would do is take down one star a new one with the new software update way until that's healthy. And then once that's healthy and traffic is back to normal again, now I can take that the next one and update that. And then once that's healthy, I can take down the next one. And now all my replicas are healthy, and users have not had any issues throughout this whole time period. Another challenge you might hit is what if your traffic spikes? So let's say at noon, the number of users that use your app increases by 2x. And now you need to increase the number of applicants to handle the load. But then at nighttime, whenever users decrease and it becomes too expensive to run the extra replica, so you might on a tear down the number of replicas and use those machines to run something else more important. So how do you handle this dynamic resizing of your service based on traffic spikes? So here's a summary of some of the challenges you might face when you go from one copy to many copies. First, you have to actually be able to start multiple replicas on multiple machines. So there needs to be something that correlates that logic. You need to be able to handle machine failures because once you have many machines, machines will fail in the real world. And then if containers are moving around between machines, how are clients made aware of this movement? And then how do you update your service without affecting clients? And how do you handle traffic spikes? How do you add more replicas? How do you spindown replicas? So all these problems you'll face when you have multiple instances in production. And our approach for solving this at Facebook is we introduce a new component, the Tupperware control plane, that manages the lifecycle of containers across many machines. It acts as a central coordination point between all the Tupperware agents in our fleet. So this solves the following problems. It is the thing that will start multiple replicas across many machines. If a machine goes down, it will notice, and then it will be its responsibility to recreate that container on another machine. It is responsible for publishing service discovery information so that client will now be made aware that the container is running on a new machine, and it handles deployments of the service in a safe way. So here's how this office together. You have the Tupperware control plane, which is this green box. It's responsible for creating and stopping containers. You have this service discovery system. And I'll just draw a cloud as a black box. It provides an abstraction where you give it a service name, and it will tell you the list of machines that my service is running now. So right now, there are no replicas of my service, so it returns an empty list for my service. Clients of my service-- they want to talk to my replicas. But the first thing they do is they first ask the service discovery system, hey, where are the replicas running? So let's say we start two replicas for the first time on machine one and machine two. So you have two containers running. The next step is the update of service discovery system so that clients know that they're running on machine one and machine two. So now things are all healthy and fine. And now let's say machine two loses power. Eventually, the control plane will notice because it's trying to heartbeat with every agent in the fleet. It sees that machine two is unresponsive for too long. It deems the work on machine two as dead, and it will update the service discovery system to say, hey, the service is no longer running on machine two. And now clients will stop sending traffic there. Meanwhile, it seems that machine three has available resources to create another copy of my service. It will create a container on machine three. And once the container on machine three is healthy, it will update the service discovery system to let clients know you can send traffic to machine three now as well. So this is how a failover and service discovery are managed by the Tupperware control client. I'll save questions for the end. So what about deployments? Let's say I have three replicas already you running and healthy. Clients know it's on machine 1, 2, and 3. And now I want to push a new version of my service, and I can tolerate one replica being down at once. So the first thing I want to do is first, let's say I want to update the replica on machine one. The first thing I want to do is make sure clients stop sending traffic there before I tear down the container. First, I'm going to tell the service discovery system to say machine one is no longer available, and now clients will stop setting traffic there. Once clients stop sending traffic there, I can tear down the container on machine one and create a new version using the new software update. And once that's healthy, I can update the service discovery system to say, hey, clients you can send traffic to machine one again. And in fact, you'll be getting the new version of the service there. The process repeats for machine two. We disable machine two, stop the container, recreate that container, and then publish that information as well. And now we repeat again for machine three. We disable the entry so that clients stop sending traffic there. We stop the container and then recreate the container. And now clients can send traffic there as well. So now at this point, we've updated all the replicas of our service. We've never had more than one replica down, and users are totally unaware that any issue has happened in this process. So now we are able to start many replicas of our service in production we can update them. We can handle failovers, and we can scale to handle load. And now let's say your web app gets more complicated. The number of features or products grow. It gets a bit complicated having one service, so you want to separate out responsibilities into multiple services. So now your app is now multiple services that you need to deploy to production. And you'll hit a different set of challenges now. So first understand the challenges. Here's some background about Facebook. Facebook has many data centers in the world. And this is an example of a data center-- a bird's eye view. Each building has many thousands of machines serving the website or ads or databases to store user information. And they are very expensive to create-- so the construction costs, purchasing the hardware, electricity, and maintenance costs. This is a big deal. This is a very expensive investment for Facebook. And now separately, there are many products at Facebook. And as a result, there are many services to support those products. So given that data centers are so expensive, how can we utilize all the resources efficiently? And also, another problem we have is reasoning about physical infrastructure is actually really hard. There are a lot of machines. A lot of failures can happen. How can we hide as much of this complexity from engineers as possible so that engineers can focus on their business logic? And so the first problem is how can we effectively use the resources we have becomes a bin-packing problem. And the Tupperware logo is actually a good illustration of this. Let's say this square represents one machine. Each container represents a different service or work we want to run. And we want to stack as many containers that will fit as possible onto a single machine to most effectively utilize that machine's resources. So it's kind of like playing Tetris with machines. And yeah, so our approach to solving this is to stack multiple containers onto as few machines as possible, resources permitting. And now our data centers are spread out geographically across the world, and this introduces a different set of challenges. So as an example, let's say we have a West Coast data center and an East Coast data center, and my service just so happens to only be running in the East Coast data center. And now a hurricane hits the East Coast and takes down the data center. Our data center loses power. Now suddenly, users of that service cannot use that service until we create new replicas somewhere else. So ideally, we should spread our containers across these two data centers so that when disaster hits one, the service continues to operate. And so the property we would like is to spread replicas across something known as fault domains, and a fault domain is a group of things likely to fail together. So as an example, a data center is a fault domain of machines because they're located geographically together, and they might all lose power at the same time together. So another issue you might have is hardware fails in the real world. Data center operators need to frequently put machines into repair because their disk drives are failing. The machine needs to be rebooted for whatever reason. The machines need to be replaced with newer generation of hardware. And so they frequently need to say I need to take those machines into maintenance but on those 1,000 machines might be many different teams' services running on those machines. Ideally, the data center operators would need to interact with all those different teams in order to have them safely move the replicas away before taking down all 1,000 machines. So the goal is how can we safely replace those 1,000 machines in an automated way and have as little involvement with service owners as possible? So in this example, a single machine might be running containers from five different teams. And so if we had no automation, five different teams would need to do work to move those containers elsewhere. And this might be challenging for teams because sometimes a container might store a local state on the machine and that local state needs to be copied somewhere else before you take the machine down. Or sometimes a team might not have enough replicas elsewhere. So if you take down this machine, they will actually be unable to serve all their traffic. So there are a lot of issues in how we do this safely. So recap of some of the issues we face here-- we want to efficiently use all the resources that we have. We want to make sure replicas are spread out in a safe way. And hardware will fail in the real world, and we want to make repairing hardware as safe and as seamless as possible. And the approach Facebook has taken here is to provide abstractions. We provide abstractions to make it easier for engineers to reason about physical structure. So an example, we can stack multiple containers on a machine, and users don't need to know how that works. We provide some abstractions to allow users to say I want to spread across these fault domains, and it will take care of that. They don't need to understand how that actually works. And then we allow engineers to specify some policies on how to move containers around in the fleet, and we'll take care of how that actually works. And we provide them a high-level API for them to do that. So here's a recap. We have a service running on our local laptop or developer environment. We want to start running in production for real. And suddenly, we have more traffic Than one instance can handle, so we start multiple replicas. And now our app gets more complicated. So we have instead of just one service, we have many services in Production. And all the problems we faced in this talk are problems we face in the cluster management space. And these are the problems my team is tackling. So what exactly is cluster management? The way you can understand cluster management is by understanding the stakeholders in this system. So for much of this talk, we've been focusing on the perspective of service developers. They have an app they want to easily deploy to production as reliably and safely as possible. And ideally, they should focus most of their energy on the business logic and not on the physical infrastructure and the concerns around that. But our service needs to run on real-world machines, and in the real world, machines will fail. And so center operators-- what they want from the system is being able to automatically and safely fix machines and have the system move containers around as needed. And a third stakeholder is efficiency engineers. They want to make sure we're actually using the resources we have as efficiently as possible because data centers are expensive, and we want to make sure that we're getting the most bang for our buck. And the intersection of all these stakeholders is cluster management. The problems we face and the challenges working with these stakeholders are all in the cluster management space. So to put it concisely, the goal of cluster management is make it easy for engineers to develop services while utilizing resources as efficiently as possible and making the services run as safely as possible in the presence of real-world failures. And for more information about what my team is working on, we have a public talk from the Systems of Scale Conference in July, and it's talking about what we're currently working on and what we will be working on for the next few years. So to recap, this is my motto now-- coding is the easy part. Productionizing is the hard part. It's very easy to write some code and get a prototype working. The real hard part is to make sure it's reliable, highly available, efficient, debuggable, testable so you can easily make changes in the future, handling all existing legacy requirements because you might have many existing users or use cases, and then be able to withstand requirements changing over time. So let's say you get it working now. What if the number just grows by 10x or 100x-- will this design still work as intended? So with cluster management systems and container platforms, it makes-- at least the productionizing part-- a bit easier. Thank you. [APPLAUSE]