1 00:00:00,000 --> 00:00:00,265 2 00:00:00,265 --> 00:00:01,390 KENNY YU: So, hi, everyone. 3 00:00:01,390 --> 00:00:02,290 I'm Kenny. 4 00:00:02,290 --> 00:00:04,150 I'm a software engineer at Facebook. 5 00:00:04,150 --> 00:00:06,550 And today, I'm going to talk about this one problem-- 6 00:00:06,550 --> 00:00:09,290 how do you deploy a service at scale? 7 00:00:09,290 --> 00:00:12,460 And if you have any questions, we'll save them for the end. 8 00:00:12,460 --> 00:00:14,480 So how do you deploy a service at scale? 9 00:00:14,480 --> 00:00:18,520 So you may be wondering how hard can this actually be? 10 00:00:18,520 --> 00:00:21,730 It works on my laptop, how hard can it be to deploy it? 11 00:00:21,730 --> 00:00:24,580 So I thought this too four years ago when I had just 12 00:00:24,580 --> 00:00:27,320 graduated from Harvard as well. 13 00:00:27,320 --> 00:00:29,742 And in this talk, I'll talk about why it's hard. 14 00:00:29,742 --> 00:00:31,450 And we'll go on a journey to explore many 15 00:00:31,450 --> 00:00:33,366 of the challenges you'll hit when you actually 16 00:00:33,366 --> 00:00:36,310 start to run a service at scale. 17 00:00:36,310 --> 00:00:40,120 And I'll talk about how Facebook has approached some of these challenges 18 00:00:40,120 --> 00:00:42,250 along the way. 19 00:00:42,250 --> 00:00:43,210 So a bit about me-- 20 00:00:43,210 --> 00:00:47,680 I graduated from Harvard class 2014 concentrating in computer science. 21 00:00:47,680 --> 00:00:52,390 I took C50 fall of 2010 and then TF'd it the following year. 22 00:00:52,390 --> 00:00:54,620 And I TF'd other classes at Harvard as well. 23 00:00:54,620 --> 00:00:58,840 And I think one of my favorite experiences at Harvard was TFing. 24 00:00:58,840 --> 00:01:01,930 So if you have the opportunity, I highly recommend it. 25 00:01:01,930 --> 00:01:06,010 And after I graduated, I went to Facebook, 26 00:01:06,010 --> 00:01:08,810 and I've been on this one team for the past four years. 27 00:01:08,810 --> 00:01:10,310 And I really like this team. 28 00:01:10,310 --> 00:01:13,900 My team is Tupperware, and Tupperware is Facebook cluster management 29 00:01:13,900 --> 00:01:16,150 system and container platform. 30 00:01:16,150 --> 00:01:17,982 So there's a lot of big words, and my goal 31 00:01:17,982 --> 00:01:19,690 is by the end of this talk is that you'll 32 00:01:19,690 --> 00:01:23,750 have a good overview of the challenges we face in cluster management 33 00:01:23,750 --> 00:01:27,010 and how Facebook is tackling some of these challenges 34 00:01:27,010 --> 00:01:29,590 and then once you understand these, how this relates 35 00:01:29,590 --> 00:01:34,230 to how we deploy services at scale. 36 00:01:34,230 --> 00:01:37,460 So our goal is to deploy a service in production at scale. 37 00:01:37,460 --> 00:01:39,710 But first, what is a service? 38 00:01:39,710 --> 00:01:42,440 So let's first define what a service is. 39 00:01:42,440 --> 00:01:44,780 So a service can have one or more replicas, 40 00:01:44,780 --> 00:01:46,580 and it's a long-running program. 41 00:01:46,580 --> 00:01:48,260 It's not meant to terminate. 42 00:01:48,260 --> 00:01:52,730 It responds to requests and gives a response back. 43 00:01:52,730 --> 00:01:55,650 And as an example, you can think of a web server. 44 00:01:55,650 --> 00:01:59,690 So if you're running Python, you're a Python web server 45 00:01:59,690 --> 00:02:03,050 or if you're running PHP, an Apache web server. 46 00:02:03,050 --> 00:02:06,090 A response requests, and it gives you get a response back. 47 00:02:06,090 --> 00:02:08,870 And you might have multiple of these, and multiple of these 48 00:02:08,870 --> 00:02:12,660 together compose your service that you want to provide it. 49 00:02:12,660 --> 00:02:18,850 So as an example, Facebook users Thrift for most of its back end services. 50 00:02:18,850 --> 00:02:21,410 Thrift is open source, and it makes it easy to do something 51 00:02:21,410 --> 00:02:24,080 called Remote Procedure Calls or RPCs. 52 00:02:24,080 --> 00:02:27,840 And it makes it easy for one service to talk to another. 53 00:02:27,840 --> 00:02:32,810 So as an example, service at Facebook, let's take the website as an example. 54 00:02:32,810 --> 00:02:36,080 So for those of you that don't know, the entire website 55 00:02:36,080 --> 00:02:39,560 is pushed as one monolithic unit every hour. 56 00:02:39,560 --> 00:02:42,710 And the thing that actually runs the website is hhvm. 57 00:02:42,710 --> 00:02:47,030 It runs our version of PHP, called Hack, as a type-safe language. 58 00:02:47,030 --> 00:02:49,170 And both of these are open source. 59 00:02:49,170 --> 00:02:51,170 And the way to website is deployed is that there 60 00:02:51,170 --> 00:02:57,170 are many, many instances of this web server running in the world. 61 00:02:57,170 --> 00:03:00,510 This service might call other services in order to fulfill your request. 62 00:03:00,510 --> 00:03:03,260 So let's say I hit the home page for Facebook. 63 00:03:03,260 --> 00:03:06,480 I might want to give my profile and render some ads. 64 00:03:06,480 --> 00:03:12,200 So the service will call maybe the profile service or the ad service. 65 00:03:12,200 --> 00:03:14,540 Anyhow, the website is updated every hour. 66 00:03:14,540 --> 00:03:19,160 And more importantly, as a Facebook user, you don't even notice this. 67 00:03:19,160 --> 00:03:21,770 So here's a picture of what this all looks like. 68 00:03:21,770 --> 00:03:24,140 First, we have the Facebook web service. 69 00:03:24,140 --> 00:03:28,830 We have many copies of our web server, hhvm, running. 70 00:03:28,830 --> 00:03:32,970 Requests from Facebook users-- so either from your browser or from your phone. 71 00:03:32,970 --> 00:03:35,460 They go to these replicas. 72 00:03:35,460 --> 00:03:38,450 And in order to fulfill the responses for these requests, 73 00:03:38,450 --> 00:03:41,580 it might have to talk to other services-- so profile service or ad 74 00:03:41,580 --> 00:03:42,560 service? 75 00:03:42,560 --> 00:03:45,196 And once it's gotten all the data it needs, 76 00:03:45,196 --> 00:03:46,570 it will return the response back. 77 00:03:46,570 --> 00:03:51,090 78 00:03:51,090 --> 00:03:52,860 So how did we get there? 79 00:03:52,860 --> 00:03:56,280 So we have something that works on our local laptop. 80 00:03:56,280 --> 00:03:58,580 Let's say you're starting a new web app. 81 00:03:58,580 --> 00:04:01,890 You have something working-- a prototype working-- on your laptop. 82 00:04:01,890 --> 00:04:04,080 Now you actually want to run it in production. 83 00:04:04,080 --> 00:04:07,340 So there are some challenges there to get that first instance running 84 00:04:07,340 --> 00:04:08,500 in production. 85 00:04:08,500 --> 00:04:11,510 And now let's say your app takes off. 86 00:04:11,510 --> 00:04:12,890 You get a lot of users. 87 00:04:12,890 --> 00:04:15,500 A lot of requests start coming to your app. 88 00:04:15,500 --> 00:04:18,200 And now that single instance you're running 89 00:04:18,200 --> 00:04:20,300 can no longer handle all the load. 90 00:04:20,300 --> 00:04:23,800 So now you'd have multiple instances in production. 91 00:04:23,800 --> 00:04:26,690 And now let's say your app-- you start to add more features. 92 00:04:26,690 --> 00:04:28,220 You add more products. 93 00:04:28,220 --> 00:04:32,212 The complexity of your application gets more complicated. 94 00:04:32,212 --> 00:04:33,920 In order to simplify that, you might want 95 00:04:33,920 --> 00:04:37,786 to extract some of the responsibilities into separate components. 96 00:04:37,786 --> 00:04:40,160 And now instead of just having one service in production, 97 00:04:40,160 --> 00:04:42,810 you have multiple services in production. 98 00:04:42,810 --> 00:04:45,770 So each of these transitions involves lots of challenges, 99 00:04:45,770 --> 00:04:50,470 and I'll go over each of these challenges along the way. 100 00:04:50,470 --> 00:04:52,660 First, let's focus on the first one. 101 00:04:52,660 --> 00:04:56,020 From your laptop to that first instance in production, 102 00:04:56,020 --> 00:04:59,150 what does this look like? 103 00:04:59,150 --> 00:05:01,010 So first challenge you might hit when you 104 00:05:01,010 --> 00:05:03,320 want to start that first copy in production 105 00:05:03,320 --> 00:05:06,980 is reproducing the same environment as your laptop. 106 00:05:06,980 --> 00:05:10,640 So some of the challenges you might hit is let's 107 00:05:10,640 --> 00:05:12,810 say you're running a Python web app. 108 00:05:12,810 --> 00:05:15,710 You might have various packages of Python libraries 109 00:05:15,710 --> 00:05:18,220 or Python versions installed on your laptop, 110 00:05:18,220 --> 00:05:22,340 and now you need to reproduce the same exact versions and libraries 111 00:05:22,340 --> 00:05:26,170 on that production environment. 112 00:05:26,170 --> 00:05:28,810 So versions and libraries-- you have to make sure 113 00:05:28,810 --> 00:05:31,210 they're installed on the production environment. 114 00:05:31,210 --> 00:05:33,910 And then also, your app might make assumptions 115 00:05:33,910 --> 00:05:36,440 about where certain files are located. 116 00:05:36,440 --> 00:05:39,420 So let's say my web app needs some configuration file. 117 00:05:39,420 --> 00:05:41,590 It might be stored in one place on my laptop, 118 00:05:41,590 --> 00:05:44,000 and it might not even exist in a production environment. 119 00:05:44,000 --> 00:05:46,760 Or it may exist in a different location. 120 00:05:46,760 --> 00:05:49,150 So the first challenge here is you need to reproduce 121 00:05:49,150 --> 00:05:52,700 this environment that you have on your laptop on the production machine. 122 00:05:52,700 --> 00:05:57,070 This includes all the files and the binaries that you need to run. 123 00:05:57,070 --> 00:06:02,250 124 00:06:02,250 --> 00:06:05,790 Next challenge is how do you make sure that stuff on the machine 125 00:06:05,790 --> 00:06:09,280 doesn't interfere with my work and vice versa? 126 00:06:09,280 --> 00:06:12,210 Let's say there's something more important running on the machine, 127 00:06:12,210 --> 00:06:17,760 and I want to make sure my dummy web app doesn't interfere with that work. 128 00:06:17,760 --> 00:06:20,130 So as an example, let's say my service-- 129 00:06:20,130 --> 00:06:21,720 the dotted red box-- 130 00:06:21,720 --> 00:06:25,350 it should use four gigabytes of memory, maybe two cores. 131 00:06:25,350 --> 00:06:27,900 And something else in the machine wants to use 132 00:06:27,900 --> 00:06:30,060 two gigabytes of memory and one core. 133 00:06:30,060 --> 00:06:33,720 I want to make sure that that other service doesn't take more memory 134 00:06:33,720 --> 00:06:36,660 and start using some of my service's memory 135 00:06:36,660 --> 00:06:40,080 and then cause my service to crash or slow down and vice versa. 136 00:06:40,080 --> 00:06:44,520 I don't want to interfere with the resources used by that other service. 137 00:06:44,520 --> 00:06:47,074 So this is a resource isolation problem. 138 00:06:47,074 --> 00:06:48,990 You want to ensure that no workload on machine 139 00:06:48,990 --> 00:06:51,050 interferes with my workload and vice versa. 140 00:06:51,050 --> 00:06:56,520 141 00:06:56,520 --> 00:06:59,030 Another problem with interference is protection. 142 00:06:59,030 --> 00:07:02,780 Let's say I have my workload in a red dotted box, 143 00:07:02,780 --> 00:07:06,500 and something else running a machine, the purple dotted box. 144 00:07:06,500 --> 00:07:10,670 One thing I want to ensure is that that other thing doesn't somehow 145 00:07:10,670 --> 00:07:14,270 kill or restart or terminate my program accidentally. 146 00:07:14,270 --> 00:07:16,940 Let's say there's a bug in the other program that goes haywire. 147 00:07:16,940 --> 00:07:21,880 The effects of that service should be isolated in its own environment 148 00:07:21,880 --> 00:07:24,520 and also that other thing shouldn't be touching important 149 00:07:24,520 --> 00:07:27,160 files that I need for my service. 150 00:07:27,160 --> 00:07:30,250 So let's say my service needs some configuration file. 151 00:07:30,250 --> 00:07:33,160 I would really like it if something else doesn't touch that 152 00:07:33,160 --> 00:07:38,630 file that I need to run my service. 153 00:07:38,630 --> 00:07:41,500 So I want to isolate the environment of these different workloads. 154 00:07:41,500 --> 00:07:46,490 155 00:07:46,490 --> 00:07:50,120 The next problem you might have is how do you ensure that a service is alive? 156 00:07:50,120 --> 00:07:52,280 Let's say you have your service up. 157 00:07:52,280 --> 00:07:54,770 There's some bug, and it crashes. 158 00:07:54,770 --> 00:07:58,274 If it crashes, this means users will not be able to use your service. 159 00:07:58,274 --> 00:08:01,190 So imagine if Facebook went down and users are unable to use Facebook. 160 00:08:01,190 --> 00:08:04,220 That's a terrible experience for everyone. 161 00:08:04,220 --> 00:08:06,740 Or let's say it doesn't crash. 162 00:08:06,740 --> 00:08:11,300 It's just misbehaving or slowing down, and then restarting it might help-- 163 00:08:11,300 --> 00:08:15,020 might help it mitigate the issue temporarily. 164 00:08:15,020 --> 00:08:18,600 So what I really like is if my service has an issue, 165 00:08:18,600 --> 00:08:25,640 please restart it automatically so that user impact is at a minimum. 166 00:08:25,640 --> 00:08:30,380 And one way you might be able to do this is to ask the service, hey, 167 00:08:30,380 --> 00:08:31,250 are you alive? 168 00:08:31,250 --> 00:08:32,000 Yes. 169 00:08:32,000 --> 00:08:33,260 Are you alive? 170 00:08:33,260 --> 00:08:34,490 No response. 171 00:08:34,490 --> 00:08:38,059 And then after a few seconds of that, if there's still no response, 172 00:08:38,059 --> 00:08:39,562 restart the service. 173 00:08:39,562 --> 00:08:42,020 So the goal is the service should always be up and running. 174 00:08:42,020 --> 00:08:46,390 175 00:08:46,390 --> 00:08:49,720 So here's a summary of challenges to go from your laptop 176 00:08:49,720 --> 00:08:52,030 to one copy in production. 177 00:08:52,030 --> 00:08:55,460 How do you reproduce the same environment as your laptop? 178 00:08:55,460 --> 00:08:58,390 How do you make sure that once you're running on a production machine, 179 00:08:58,390 --> 00:09:01,630 no other workload is affecting my service, 180 00:09:01,630 --> 00:09:05,710 and my service isn't affecting anything critical on that machine? 181 00:09:05,710 --> 00:09:08,980 And then how do I make sure that my service is always up and running? 182 00:09:08,980 --> 00:09:12,940 Because the goal is to have users be able to use your service all the time. 183 00:09:12,940 --> 00:09:15,510 184 00:09:15,510 --> 00:09:19,800 So there are multiple ways to tackle this issue. 185 00:09:19,800 --> 00:09:22,740 Two typical ways that companies have approached this problem 186 00:09:22,740 --> 00:09:26,130 is to use virtual machines and containers. 187 00:09:26,130 --> 00:09:29,250 So for virtual machines, the way that I think about it is you 188 00:09:29,250 --> 00:09:30,500 have your application. 189 00:09:30,500 --> 00:09:33,060 It's running on top of an operating system 190 00:09:33,060 --> 00:09:37,170 and that operating system is running on top of another operating system. 191 00:09:37,170 --> 00:09:38,940 So if you ever use dual boot on your Mac, 192 00:09:38,940 --> 00:09:42,840 you're running Windows inside a Mac-- that's very similar idea. 193 00:09:42,840 --> 00:09:44,670 There are some issues with this. 194 00:09:44,670 --> 00:09:47,500 It's usually slower to create a virtual machine, 195 00:09:47,500 --> 00:09:51,450 and there is also an efficiency cost in terms of CPU. 196 00:09:51,450 --> 00:09:55,820 Another approach that companies take is to create containers. 197 00:09:55,820 --> 00:10:00,360 So we can run our application in some isolated environment that 198 00:10:00,360 --> 00:10:04,530 provides all the guarantees as before and run it directly 199 00:10:04,530 --> 00:10:06,540 on the machine's operating system. 200 00:10:06,540 --> 00:10:09,640 We can avoid the overhead of a virtual machine. 201 00:10:09,640 --> 00:10:13,020 And this tends to be faster to create and more efficient. 202 00:10:13,020 --> 00:10:16,380 And here's a diagram that shows how these relate to each other. 203 00:10:16,380 --> 00:10:19,350 On the left, you have my service-- 204 00:10:19,350 --> 00:10:22,920 the blue box-- running on top of a guest operating 205 00:10:22,920 --> 00:10:27,197 system, which itself is running on top of another operating system. 206 00:10:27,197 --> 00:10:30,030 And there's some overhead because you're running two operate systems 207 00:10:30,030 --> 00:10:33,930 at the same time versus the container-- 208 00:10:33,930 --> 00:10:38,060 we eliminate that extra overhead of that middle operating system 209 00:10:38,060 --> 00:10:42,510 and run our application directly on the machine with some protection around it. 210 00:10:42,510 --> 00:10:48,370 211 00:10:48,370 --> 00:10:51,794 So the way Facebook has approached these problems is to use containers. 212 00:10:51,794 --> 00:10:53,710 For us, the overhead of using virtual machines 213 00:10:53,710 --> 00:10:57,040 is too much, and so that's why we use containers. 214 00:10:57,040 --> 00:11:00,100 And to do, this we have a program called a Tupperware agent 215 00:11:00,100 --> 00:11:02,290 running on every machine at Facebook, and it's 216 00:11:02,290 --> 00:11:04,720 responsible for creating containers. 217 00:11:04,720 --> 00:11:09,940 And to reproduce the environment, we use container images. 218 00:11:09,940 --> 00:11:14,780 And our way of using container images is based on btrfs snapshots. 219 00:11:14,780 --> 00:11:18,760 Btrfs is a file system that makes it very fast to create copies 220 00:11:18,760 --> 00:11:22,570 of entire subtrees of a file system, and this makes it very fast 221 00:11:22,570 --> 00:11:26,690 for us to create new containers. 222 00:11:26,690 --> 00:11:29,210 And then for resource isolation, we use a feature of Linux 223 00:11:29,210 --> 00:11:33,020 called control groups that allow us to say, for this workload, 224 00:11:33,020 --> 00:11:37,580 you're allowed to use this much memory, CPU, whatever resources and no more. 225 00:11:37,580 --> 00:11:40,070 If you try to use more than that, we'll throttle you, 226 00:11:40,070 --> 00:11:43,790 or we'll kill your workload to avoid you from harming 227 00:11:43,790 --> 00:11:47,740 the other workloads on the machines. 228 00:11:47,740 --> 00:11:50,610 And for our protection, we use various Linux namespaces. 229 00:11:50,610 --> 00:11:52,860 So I'm going to not go over too much detail here. 230 00:11:52,860 --> 00:11:55,020 There's a lot of jargon here. 231 00:11:55,020 --> 00:11:56,910 If you want more detailed information, we 232 00:11:56,910 --> 00:12:01,040 have a public talk from our Systems of Scale Conference in July 2018 233 00:12:01,040 --> 00:12:03,420 that will talk about this more in depth. 234 00:12:03,420 --> 00:12:07,090 But here's a picture that summarizes how this all fits together. 235 00:12:07,090 --> 00:12:09,510 So on the left, you have the Tupperware agent. 236 00:12:09,510 --> 00:12:12,600 This is a program that's running on every machine at Facebook that 237 00:12:12,600 --> 00:12:16,350 creates containers and ensures that they're all running and healthy. 238 00:12:16,350 --> 00:12:19,710 And then to actually create the environment for your container, 239 00:12:19,710 --> 00:12:23,980 we use container images, and that's based on btrfs snapshots. 240 00:12:23,980 --> 00:12:25,980 And then the protection layer we put around 241 00:12:25,980 --> 00:12:27,810 the container includes multiple things. 242 00:12:27,810 --> 00:12:32,970 This includes control groups to control resources and various namespaces 243 00:12:32,970 --> 00:12:35,760 to ensure that the environments of two containers 244 00:12:35,760 --> 00:12:39,690 are sort of invisible to each other, and they can't affect each other. 245 00:12:39,690 --> 00:12:48,130 246 00:12:48,130 --> 00:12:51,420 So now that we have one instance of the service in production, 247 00:12:51,420 --> 00:12:54,780 how can we get many instances of the service in production? 248 00:12:54,780 --> 00:12:57,980 There are new sets of challenges that this brings. 249 00:12:57,980 --> 00:13:00,580 So the first challenge you'll have is, OK, 250 00:13:00,580 --> 00:13:04,440 how do I start multiple replicas of a container? 251 00:13:04,440 --> 00:13:07,480 So one approach you may take is, OK, given one machine, 252 00:13:07,480 --> 00:13:09,500 let's just start multiple on that machine. 253 00:13:09,500 --> 00:13:12,400 And that works until that machine runs out our resources, 254 00:13:12,400 --> 00:13:18,530 and you need to use multiple machines to start multiple copies of your service. 255 00:13:18,530 --> 00:13:22,010 Now so now you have to use multiple machines to start your containers. 256 00:13:22,010 --> 00:13:26,990 And now you're going to hit a new set of classic problems 257 00:13:26,990 --> 00:13:29,600 because now this is a distributed systems problem. 258 00:13:29,600 --> 00:13:34,080 You have to get multiple machines to work together to accomplish some goal. 259 00:13:34,080 --> 00:13:38,120 And in this diagram, what is the component that 260 00:13:38,120 --> 00:13:43,010 creates the containers on the multiple machines? 261 00:13:43,010 --> 00:13:46,280 There needs to be something that knows to tell the first one to create 262 00:13:46,280 --> 00:13:49,700 containers and a second one to create containers or sock containers as well. 263 00:13:49,700 --> 00:13:54,070 264 00:13:54,070 --> 00:13:56,600 And now what if a machine fails? 265 00:13:56,600 --> 00:13:59,380 So let's say I have two copies of my servers running. 266 00:13:59,380 --> 00:14:02,330 The two copies are running two different machines. 267 00:14:02,330 --> 00:14:04,340 For some reason, machine two loses power. 268 00:14:04,340 --> 00:14:06,801 This happens in the real world all the time. 269 00:14:06,801 --> 00:14:07,550 What happens then? 270 00:14:07,550 --> 00:14:10,340 271 00:14:10,340 --> 00:14:14,060 I need two copies of my service running at all times in order 272 00:14:14,060 --> 00:14:17,600 to serve all the traffic my service has. 273 00:14:17,600 --> 00:14:22,460 But now that machine two is down, I don't have enough capacity. 274 00:14:22,460 --> 00:14:25,850 Ideally, I would want something to notice, hey, the copy of machine two 275 00:14:25,850 --> 00:14:26,870 is down. 276 00:14:26,870 --> 00:14:29,360 I know machine three has available resources. 277 00:14:29,360 --> 00:14:31,930 Please start a new copy on machine three for me. 278 00:14:31,930 --> 00:14:34,970 279 00:14:34,970 --> 00:14:38,660 So ideally, some component would have all this logic 280 00:14:38,660 --> 00:14:42,200 and do all this automatically for me, and this problem 281 00:14:42,200 --> 00:14:44,180 is known as a failover. 282 00:14:44,180 --> 00:14:47,990 So when real-world failures happen, then we 283 00:14:47,990 --> 00:14:51,255 want ideally to be able to restart that workload on a different machine, 284 00:14:51,255 --> 00:14:52,505 and that's known as a failure. 285 00:14:52,505 --> 00:14:57,740 286 00:14:57,740 --> 00:15:01,910 So now let's look at this problem from the caller's point of view. 287 00:15:01,910 --> 00:15:06,410 The callers or clients of your service have a different set of issues now. 288 00:15:06,410 --> 00:15:09,500 So in the beginning, there's two copies of my servers running-- 289 00:15:09,500 --> 00:15:12,860 there's a copy on machine one and a copy on machine two. 290 00:15:12,860 --> 00:15:16,870 The caller knows that it's on machine one and machine two. 291 00:15:16,870 --> 00:15:18,740 Now machine two loses power. 292 00:15:18,740 --> 00:15:22,366 The caller still thinks that a copy is running on machine two. 293 00:15:22,366 --> 00:15:23,990 It's still going to send traffic there. 294 00:15:23,990 --> 00:15:25,340 The requests are going to fail. 295 00:15:25,340 --> 00:15:28,300 Users are going to have a hard time. 296 00:15:28,300 --> 00:15:31,000 Now let's say there is some automation that knows, 297 00:15:31,000 --> 00:15:35,820 hey, machine two is down please start another one on machine three. 298 00:15:35,820 --> 00:15:38,760 How's the client made aware of this? 299 00:15:38,760 --> 00:15:42,710 The client still thinks the replicas are on machine one and machine two. 300 00:15:42,710 --> 00:15:46,170 It doesn't know that there's a new copy on machine three. 301 00:15:46,170 --> 00:15:50,310 So something needs to tell the client, hey, the copies are now 302 00:15:50,310 --> 00:15:52,140 on machine and one machine three. 303 00:15:52,140 --> 00:15:55,380 And this problem is known as a service discovery problem. 304 00:15:55,380 --> 00:15:58,140 So for service discovery, the question it tries to answer 305 00:15:58,140 --> 00:16:00,390 is where is my service running? 306 00:16:00,390 --> 00:16:08,180 307 00:16:08,180 --> 00:16:12,220 So now another problem we might face is how do you deploy your service? 308 00:16:12,220 --> 00:16:15,330 So remember I said the website is updated every hour. 309 00:16:15,330 --> 00:16:17,840 So we have many copies of the service. 310 00:16:17,840 --> 00:16:21,200 And it's updated every hour, and users never 311 00:16:21,200 --> 00:16:24,020 even notice that it's being updated. 312 00:16:24,020 --> 00:16:25,810 So how is this even possible? 313 00:16:25,810 --> 00:16:27,560 The key observation here is that you never 314 00:16:27,560 --> 00:16:31,270 want to take down all the replicas of your service at the same time 315 00:16:31,270 --> 00:16:35,390 because if all of your replicas are down in that time period, 316 00:16:35,390 --> 00:16:40,910 requests to Facebook would fail, and then users would have a hard time. 317 00:16:40,910 --> 00:16:43,790 So instead of taking them all down at once, 318 00:16:43,790 --> 00:16:47,700 one approach you might take it to take down only a percentage on a time. 319 00:16:47,700 --> 00:16:52,820 So as an example, let's say I have three replicas of my service. 320 00:16:52,820 --> 00:16:57,000 I can tolerate one replica being down any given moment. 321 00:16:57,000 --> 00:17:02,840 So let's say I want to update my containers from blue to purple. 322 00:17:02,840 --> 00:17:07,980 So what I would do is take down one star a new one with the new software update 323 00:17:07,980 --> 00:17:09,859 way until that's healthy. 324 00:17:09,859 --> 00:17:13,130 And then once that's healthy and traffic is back to normal again, 325 00:17:13,130 --> 00:17:17,040 now I can take that the next one and update that. 326 00:17:17,040 --> 00:17:20,060 And then once that's healthy, I can take down the next one. 327 00:17:20,060 --> 00:17:25,024 And now all my replicas are healthy, and users have not had any issues 328 00:17:25,024 --> 00:17:26,440 throughout this whole time period. 329 00:17:26,440 --> 00:17:34,200 330 00:17:34,200 --> 00:17:37,520 Another challenge you might hit is what if your traffic spikes? 331 00:17:37,520 --> 00:17:40,740 So let's say at noon, the number of users 332 00:17:40,740 --> 00:17:43,300 that use your app increases by 2x. 333 00:17:43,300 --> 00:17:47,490 And now you need to increase the number of applicants to handle the load. 334 00:17:47,490 --> 00:17:50,460 But then at nighttime, whenever users decrease 335 00:17:50,460 --> 00:17:52,950 and it becomes too expensive to run the extra replica, 336 00:17:52,950 --> 00:17:56,070 so you might on a tear down the number of replicas 337 00:17:56,070 --> 00:18:00,160 and use those machines to run something else more important. 338 00:18:00,160 --> 00:18:03,270 So how do you handle this dynamic resizing of your service 339 00:18:03,270 --> 00:18:04,800 based on traffic spikes? 340 00:18:04,800 --> 00:18:09,200 341 00:18:09,200 --> 00:18:11,140 So here's a summary of some of the challenges 342 00:18:11,140 --> 00:18:14,497 you might face when you go from one copy to many copies. 343 00:18:14,497 --> 00:18:16,330 First, you have to actually be able to start 344 00:18:16,330 --> 00:18:18,980 multiple replicas on multiple machines. 345 00:18:18,980 --> 00:18:22,934 So there needs to be something that correlates that logic. 346 00:18:22,934 --> 00:18:24,850 You need to be able to handle machine failures 347 00:18:24,850 --> 00:18:31,150 because once you have many machines, machines will fail in the real world. 348 00:18:31,150 --> 00:18:36,430 And then if containers are moving around between machines, how are clients 349 00:18:36,430 --> 00:18:39,990 made aware of this movement? 350 00:18:39,990 --> 00:18:45,860 And then how do you update your service without affecting clients? 351 00:18:45,860 --> 00:18:47,720 And how do you handle traffic spikes? 352 00:18:47,720 --> 00:18:48,950 How do you add more replicas? 353 00:18:48,950 --> 00:18:51,750 How do you spindown replicas? 354 00:18:51,750 --> 00:18:54,780 So all these problems you'll face when you 355 00:18:54,780 --> 00:18:57,916 have multiple instances in production. 356 00:18:57,916 --> 00:18:59,790 And our approach for solving this at Facebook 357 00:18:59,790 --> 00:19:03,930 is we introduce a new component, the Tupperware control 358 00:19:03,930 --> 00:19:08,250 plane, that manages the lifecycle of containers across many machines. 359 00:19:08,250 --> 00:19:11,070 It acts as a central coordination point between all the Tupperware 360 00:19:11,070 --> 00:19:13,650 agents in our fleet. 361 00:19:13,650 --> 00:19:17,020 So this solves the following problems. 362 00:19:17,020 --> 00:19:22,720 It is the thing that will start multiple replicas across many machines. 363 00:19:22,720 --> 00:19:25,510 If a machine goes down, it will notice, and then it 364 00:19:25,510 --> 00:19:30,766 will be its responsibility to recreate that container on another machine. 365 00:19:30,766 --> 00:19:34,090 It is responsible for publishing service discovery information 366 00:19:34,090 --> 00:19:38,320 so that client will now be made aware that the container is 367 00:19:38,320 --> 00:19:42,730 running on a new machine, and it handles deployments 368 00:19:42,730 --> 00:19:46,810 of the service in a safe way. 369 00:19:46,810 --> 00:19:48,790 So here's how this office together. 370 00:19:48,790 --> 00:19:52,210 You have the Tupperware control plane, which is this green box. 371 00:19:52,210 --> 00:19:56,350 It's responsible for creating and stopping containers. 372 00:19:56,350 --> 00:19:58,310 You have this service discovery system. 373 00:19:58,310 --> 00:20:01,570 And I'll just draw a cloud as a black box. 374 00:20:01,570 --> 00:20:04,900 It provides an abstraction where you give it a service name, 375 00:20:04,900 --> 00:20:06,850 and it will tell you the list of machines 376 00:20:06,850 --> 00:20:08,750 that my service is running now. 377 00:20:08,750 --> 00:20:11,170 So right now, there are no replicas of my service, 378 00:20:11,170 --> 00:20:13,090 so it returns an empty list for my service. 379 00:20:13,090 --> 00:20:16,310 380 00:20:16,310 --> 00:20:17,720 Clients of my service-- 381 00:20:17,720 --> 00:20:21,020 they want to talk to my replicas. 382 00:20:21,020 --> 00:20:24,560 But the first thing they do is they first ask the service discovery system, 383 00:20:24,560 --> 00:20:28,440 hey, where are the replicas running? 384 00:20:28,440 --> 00:20:31,760 So let's say we start two replicas for the first time on machine 385 00:20:31,760 --> 00:20:33,620 one and machine two. 386 00:20:33,620 --> 00:20:35,240 So you have two containers running. 387 00:20:35,240 --> 00:20:38,240 The next step is the update of service discovery system 388 00:20:38,240 --> 00:20:43,310 so that clients know that they're running on machine one and machine two. 389 00:20:43,310 --> 00:20:47,240 So now things are all healthy and fine. 390 00:20:47,240 --> 00:20:50,750 And now let's say machine two loses power. 391 00:20:50,750 --> 00:20:53,150 Eventually, the control plane will notice 392 00:20:53,150 --> 00:20:56,630 because it's trying to heartbeat with every agent in the fleet. 393 00:20:56,630 --> 00:21:00,050 It sees that machine two is unresponsive for too long. 394 00:21:00,050 --> 00:21:03,312 It deems the work on machine two as dead, 395 00:21:03,312 --> 00:21:05,270 and it will update the service discovery system 396 00:21:05,270 --> 00:21:08,810 to say, hey, the service is no longer running on machine two. 397 00:21:08,810 --> 00:21:11,100 And now clients will stop sending traffic there. 398 00:21:11,100 --> 00:21:14,660 399 00:21:14,660 --> 00:21:17,890 Meanwhile, it seems that machine three has available resources 400 00:21:17,890 --> 00:21:21,640 to create another copy of my service. 401 00:21:21,640 --> 00:21:24,250 It will create a container on machine three. 402 00:21:24,250 --> 00:21:27,460 And once the container on machine three is healthy, 403 00:21:27,460 --> 00:21:29,410 it will update the service discovery system 404 00:21:29,410 --> 00:21:34,260 to let clients know you can send traffic to machine three now as well. 405 00:21:34,260 --> 00:21:39,030 So this is how a failover and service discovery are managed 406 00:21:39,030 --> 00:21:41,190 by the Tupperware control client. 407 00:21:41,190 --> 00:21:42,595 I'll save questions for the end. 408 00:21:42,595 --> 00:21:46,060 409 00:21:46,060 --> 00:21:47,900 So what about deployments? 410 00:21:47,900 --> 00:21:50,930 Let's say I have three replicas already you running and healthy. 411 00:21:50,930 --> 00:21:55,200 Clients know it's on machine 1, 2, and 3. 412 00:21:55,200 --> 00:21:57,750 And now I want to push a new version of my service, 413 00:21:57,750 --> 00:22:01,570 and I can tolerate one replica being down at once. 414 00:22:01,570 --> 00:22:03,450 So the first thing I want to do is first, 415 00:22:03,450 --> 00:22:06,794 let's say I want to update the replica on machine one. 416 00:22:06,794 --> 00:22:09,960 The first thing I want to do is make sure clients stop sending traffic there 417 00:22:09,960 --> 00:22:12,390 before I tear down the container. 418 00:22:12,390 --> 00:22:15,060 First, I'm going to tell the service discovery 419 00:22:15,060 --> 00:22:19,640 system to say machine one is no longer available, 420 00:22:19,640 --> 00:22:23,570 and now clients will stop setting traffic there. 421 00:22:23,570 --> 00:22:26,180 Once clients stop sending traffic there, I 422 00:22:26,180 --> 00:22:28,910 can tear down the container on machine one 423 00:22:28,910 --> 00:22:33,040 and create a new version using the new software update. 424 00:22:33,040 --> 00:22:36,770 And once that's healthy, I can update the service discovery system 425 00:22:36,770 --> 00:22:40,730 to say, hey, clients you can send traffic to machine one again. 426 00:22:40,730 --> 00:22:45,740 And in fact, you'll be getting the new version of the service there. 427 00:22:45,740 --> 00:22:48,440 The process repeats for machine two. 428 00:22:48,440 --> 00:22:54,930 We disable machine two, stop the container, recreate that container, 429 00:22:54,930 --> 00:22:58,890 and then publish that information as well. 430 00:22:58,890 --> 00:23:01,440 And now we repeat again for machine three. 431 00:23:01,440 --> 00:23:05,760 We disable the entry so that clients stop sending traffic there. 432 00:23:05,760 --> 00:23:09,980 We stop the container and then recreate the container. 433 00:23:09,980 --> 00:23:13,940 And now clients can send traffic there as well. 434 00:23:13,940 --> 00:23:17,930 So now at this point, we've updated all the replicas of our service. 435 00:23:17,930 --> 00:23:20,450 We've never had more than one replica down, 436 00:23:20,450 --> 00:23:24,640 and users are totally unaware that any issue has happened in this process. 437 00:23:24,640 --> 00:23:31,860 438 00:23:31,860 --> 00:23:34,890 So now we are able to start many replicas of our service 439 00:23:34,890 --> 00:23:37,290 in production we can update them. 440 00:23:37,290 --> 00:23:41,952 We can handle failovers, and we can scale to handle load. 441 00:23:41,952 --> 00:23:44,160 And now let's say your web app gets more complicated. 442 00:23:44,160 --> 00:23:48,810 The number of features or products grow. 443 00:23:48,810 --> 00:23:51,210 It gets a bit complicated having one service, 444 00:23:51,210 --> 00:23:53,340 so you want to separate out responsibilities 445 00:23:53,340 --> 00:23:55,090 into multiple services. 446 00:23:55,090 --> 00:24:01,510 So now your app is now multiple services that you need to deploy to production. 447 00:24:01,510 --> 00:24:04,302 And you'll hit a different set of challenges now. 448 00:24:04,302 --> 00:24:05,760 So first understand the challenges. 449 00:24:05,760 --> 00:24:08,470 Here's some background about Facebook. 450 00:24:08,470 --> 00:24:10,750 Facebook has many data centers in the world. 451 00:24:10,750 --> 00:24:12,760 And this is an example of a data center-- 452 00:24:12,760 --> 00:24:15,750 a bird's eye view. 453 00:24:15,750 --> 00:24:21,540 Each building has many thousands of machines serving the website or ads 454 00:24:21,540 --> 00:24:24,930 or databases to store user information. 455 00:24:24,930 --> 00:24:27,690 And they are very expensive to create-- 456 00:24:27,690 --> 00:24:31,350 so the construction costs, purchasing the hardware, electricity, 457 00:24:31,350 --> 00:24:32,940 and maintenance costs. 458 00:24:32,940 --> 00:24:34,050 This is a big deal. 459 00:24:34,050 --> 00:24:38,080 This is a very expensive investment for Facebook. 460 00:24:38,080 --> 00:24:41,024 And now separately, there are many products at Facebook. 461 00:24:41,024 --> 00:24:43,815 And as a result, there are many services to support those products. 462 00:24:43,815 --> 00:24:46,480 463 00:24:46,480 --> 00:24:48,630 So given that data centers are so expensive, 464 00:24:48,630 --> 00:24:52,740 how can we utilize all the resources efficiently? 465 00:24:52,740 --> 00:24:56,610 And also, another problem we have is reasoning about physical infrastructure 466 00:24:56,610 --> 00:24:57,630 is actually really hard. 467 00:24:57,630 --> 00:24:59,800 There are a lot of machines. 468 00:24:59,800 --> 00:25:01,540 A lot of failures can happen. 469 00:25:01,540 --> 00:25:05,820 How can we hide as much of this complexity from engineers as possible 470 00:25:05,820 --> 00:25:08,265 so that engineers can focus on their business logic? 471 00:25:08,265 --> 00:25:13,950 472 00:25:13,950 --> 00:25:17,640 And so the first problem is how can we effectively use the resources 473 00:25:17,640 --> 00:25:20,600 we have becomes a bin-packing problem. 474 00:25:20,600 --> 00:25:24,600 And the Tupperware logo is actually a good illustration of this. 475 00:25:24,600 --> 00:25:27,750 Let's say this square represents one machine. 476 00:25:27,750 --> 00:25:32,120 Each container represents a different service or work we want to run. 477 00:25:32,120 --> 00:25:34,830 And we want to stack as many containers that 478 00:25:34,830 --> 00:25:37,380 will fit as possible onto a single machine 479 00:25:37,380 --> 00:25:41,350 to most effectively utilize that machine's resources. 480 00:25:41,350 --> 00:25:44,970 So it's kind of like playing Tetris with machines. 481 00:25:44,970 --> 00:25:47,820 And yeah, so our approach to solving this 482 00:25:47,820 --> 00:25:52,290 is to stack multiple containers onto as few machines as possible, resources 483 00:25:52,290 --> 00:25:53,136 permitting. 484 00:25:53,136 --> 00:25:58,970 485 00:25:58,970 --> 00:26:03,410 And now our data centers are spread out geographically across the world, 486 00:26:03,410 --> 00:26:06,210 and this introduces a different set of challenges. 487 00:26:06,210 --> 00:26:09,290 So as an example, let's say we have a West Coast data center and an East 488 00:26:09,290 --> 00:26:13,040 Coast data center, and my service just so 489 00:26:13,040 --> 00:26:16,580 happens to only be running in the East Coast data center. 490 00:26:16,580 --> 00:26:21,620 And now a hurricane hits the East Coast and takes down the data center. 491 00:26:21,620 --> 00:26:23,510 Our data center loses power. 492 00:26:23,510 --> 00:26:28,430 Now suddenly, users of that service cannot use that service until we create 493 00:26:28,430 --> 00:26:30,930 new replicas somewhere else. 494 00:26:30,930 --> 00:26:34,460 So ideally, we should spread our containers across these two data 495 00:26:34,460 --> 00:26:41,560 centers so that when disaster hits one, the service continues to operate. 496 00:26:41,560 --> 00:26:44,349 497 00:26:44,349 --> 00:26:46,640 And so the property we would like is to spread replicas 498 00:26:46,640 --> 00:26:50,170 across something known as fault domains, and a fault domain 499 00:26:50,170 --> 00:26:54,740 is a group of things likely to fail together. 500 00:26:54,740 --> 00:26:57,890 So as an example, a data center is a fault domain of machines 501 00:26:57,890 --> 00:27:02,367 because they're located geographically together, 502 00:27:02,367 --> 00:27:04,700 and they might all lose power at the same time together. 503 00:27:04,700 --> 00:27:11,560 504 00:27:11,560 --> 00:27:16,240 So another issue you might have is hardware fails in the real world. 505 00:27:16,240 --> 00:27:20,700 Data center operators need to frequently put machines into repair 506 00:27:20,700 --> 00:27:24,290 because their disk drives are failing. 507 00:27:24,290 --> 00:27:26,860 The machine needs to be rebooted for whatever reason. 508 00:27:26,860 --> 00:27:30,490 The machines need to be replaced with newer generation of hardware. 509 00:27:30,490 --> 00:27:33,160 And so they frequently need to say I need 510 00:27:33,160 --> 00:27:38,170 to take those machines into maintenance but on those 1,000 machines 511 00:27:38,170 --> 00:27:43,210 might be many different teams' services running on those machines. 512 00:27:43,210 --> 00:27:45,520 Ideally, the data center operators would need 513 00:27:45,520 --> 00:27:48,790 to interact with all those different teams 514 00:27:48,790 --> 00:27:54,700 in order to have them safely move the replicas away before taking 515 00:27:54,700 --> 00:27:56,920 down all 1,000 machines. 516 00:27:56,920 --> 00:28:03,100 So the goal is how can we safely replace those 1,000 machines in an automated 517 00:28:03,100 --> 00:28:08,710 way and have as little involvement with service owners as possible? 518 00:28:08,710 --> 00:28:12,630 519 00:28:12,630 --> 00:28:15,480 So in this example, a single machine might be running containers 520 00:28:15,480 --> 00:28:17,400 from five different teams. 521 00:28:17,400 --> 00:28:24,730 And so if we had no automation, five different teams 522 00:28:24,730 --> 00:28:28,020 would need to do work to move those containers elsewhere. 523 00:28:28,020 --> 00:28:32,460 And this might be challenging for teams because sometimes a container might 524 00:28:32,460 --> 00:28:36,210 store a local state on the machine and that local state needs 525 00:28:36,210 --> 00:28:39,960 to be copied somewhere else before you take the machine down. 526 00:28:39,960 --> 00:28:44,694 Or sometimes a team might not have enough replicas elsewhere. 527 00:28:44,694 --> 00:28:46,860 So if you take down this machine, they will actually 528 00:28:46,860 --> 00:28:49,150 be unable to serve all their traffic. 529 00:28:49,150 --> 00:28:51,930 So there are a lot of issues in how we do this safely. 530 00:28:51,930 --> 00:28:57,510 531 00:28:57,510 --> 00:29:00,240 So recap of some of the issues we face here-- 532 00:29:00,240 --> 00:29:03,840 we want to efficiently use all the resources that we have. 533 00:29:03,840 --> 00:29:07,980 We want to make sure replicas are spread out in a safe way. 534 00:29:07,980 --> 00:29:09,910 And hardware will fail in the real world, 535 00:29:09,910 --> 00:29:19,870 and we want to make repairing hardware as safe and as seamless as possible. 536 00:29:19,870 --> 00:29:22,890 537 00:29:22,890 --> 00:29:27,890 And the approach Facebook has taken here is to provide abstractions. 538 00:29:27,890 --> 00:29:30,470 We provide abstractions to make it easier for engineers 539 00:29:30,470 --> 00:29:33,110 to reason about physical structure. 540 00:29:33,110 --> 00:29:36,830 So an example, we can stack multiple containers on a machine, 541 00:29:36,830 --> 00:29:39,750 and users don't need to know how that works. 542 00:29:39,750 --> 00:29:42,800 We provide some abstractions to allow users 543 00:29:42,800 --> 00:29:45,990 to say I want to spread across these fault domains, 544 00:29:45,990 --> 00:29:47,240 and it will take care of that. 545 00:29:47,240 --> 00:29:50,260 They don't need to understand how that actually works. 546 00:29:50,260 --> 00:29:53,750 And then we allow engineers to specify some policies 547 00:29:53,750 --> 00:29:58,980 on how to move containers around in the fleet, 548 00:29:58,980 --> 00:30:01,010 and we'll take care of how that actually works. 549 00:30:01,010 --> 00:30:03,860 And we provide them a high-level API for them to do that. 550 00:30:03,860 --> 00:30:09,570 551 00:30:09,570 --> 00:30:11,040 So here's a recap. 552 00:30:11,040 --> 00:30:16,880 We have a service running on our local laptop or developer environment. 553 00:30:16,880 --> 00:30:21,170 We want to start running in production for real. 554 00:30:21,170 --> 00:30:24,720 And suddenly, we have more traffic Than one instance can handle, 555 00:30:24,720 --> 00:30:26,810 so we start multiple replicas. 556 00:30:26,810 --> 00:30:28,730 And now our app gets more complicated. 557 00:30:28,730 --> 00:30:33,050 So we have instead of just one service, we have many services in Production. 558 00:30:33,050 --> 00:30:36,110 And all the problems we faced in this talk 559 00:30:36,110 --> 00:30:39,480 are problems we face in the cluster management space. 560 00:30:39,480 --> 00:30:42,320 And these are the problems my team is tackling. 561 00:30:42,320 --> 00:30:44,300 So what exactly is cluster management? 562 00:30:44,300 --> 00:30:47,090 The way you can understand cluster management 563 00:30:47,090 --> 00:30:50,990 is by understanding the stakeholders in this system. 564 00:30:50,990 --> 00:30:54,140 So for much of this talk, we've been focusing on the perspective 565 00:30:54,140 --> 00:30:55,850 of service developers. 566 00:30:55,850 --> 00:31:00,150 They have an app they want to easily deploy to production as reliably 567 00:31:00,150 --> 00:31:02,270 and safely as possible. 568 00:31:02,270 --> 00:31:05,900 And ideally, they should focus most of their energy on the business logic 569 00:31:05,900 --> 00:31:10,970 and not on the physical infrastructure and the concerns around that. 570 00:31:10,970 --> 00:31:13,940 But our service needs to run on real-world machines, 571 00:31:13,940 --> 00:31:18,030 and in the real world, machines will fail. 572 00:31:18,030 --> 00:31:20,450 And so center operators-- 573 00:31:20,450 --> 00:31:27,200 what they want from the system is being able to automatically and safely fix 574 00:31:27,200 --> 00:31:34,610 machines and have the system move containers around as needed. 575 00:31:34,610 --> 00:31:37,680 And a third stakeholder is efficiency engineers. 576 00:31:37,680 --> 00:31:39,990 They want to make sure we're actually using 577 00:31:39,990 --> 00:31:43,110 the resources we have as efficiently as possible 578 00:31:43,110 --> 00:31:45,750 because data centers are expensive, and we 579 00:31:45,750 --> 00:31:50,610 want to make sure that we're getting the most bang for our buck. 580 00:31:50,610 --> 00:31:54,930 And the intersection of all these stakeholders is cluster management. 581 00:31:54,930 --> 00:32:00,344 The problems we face and the challenges working with these stakeholders 582 00:32:00,344 --> 00:32:02,010 are all in the cluster management space. 583 00:32:02,010 --> 00:32:05,740 584 00:32:05,740 --> 00:32:09,300 So to put it concisely, the goal of cluster management is 585 00:32:09,300 --> 00:32:12,330 make it easy for engineers to develop services 586 00:32:12,330 --> 00:32:16,290 while utilizing resources as efficiently as possible 587 00:32:16,290 --> 00:32:19,200 and making the services run as safely as possible 588 00:32:19,200 --> 00:32:22,860 in the presence of real-world failures. 589 00:32:22,860 --> 00:32:25,590 And for more information about what my team is working on, 590 00:32:25,590 --> 00:32:30,060 we have a public talk from the Systems of Scale Conference in July, 591 00:32:30,060 --> 00:32:32,850 and it's talking about what we're currently working on 592 00:32:32,850 --> 00:32:36,560 and what we will be working on for the next few years. 593 00:32:36,560 --> 00:32:39,260 594 00:32:39,260 --> 00:32:45,040 So to recap, this is my motto now-- 595 00:32:45,040 --> 00:32:46,720 coding is the easy part. 596 00:32:46,720 --> 00:32:48,990 Productionizing is the hard part. 597 00:32:48,990 --> 00:32:52,000 It's very easy to write some code and get a prototype working. 598 00:32:52,000 --> 00:32:57,820 The real hard part is to make sure it's reliable, highly available, efficient, 599 00:32:57,820 --> 00:33:02,530 debuggable, testable so you can easily make changes in the future, 600 00:33:02,530 --> 00:33:04,720 handling all existing legacy requirements 601 00:33:04,720 --> 00:33:06,700 because you might have many existing users 602 00:33:06,700 --> 00:33:10,150 or use cases, and then be able to withstand requirements changing 603 00:33:10,150 --> 00:33:11,240 over time. 604 00:33:11,240 --> 00:33:13,240 So let's say you get it working now. 605 00:33:13,240 --> 00:33:16,720 What if the number just grows by 10x or 100x-- 606 00:33:16,720 --> 00:33:20,720 will this design still work as intended? 607 00:33:20,720 --> 00:33:24,419 So with cluster management systems and container platforms, it makes-- 608 00:33:24,419 --> 00:33:26,460 at least the productionizing part-- a bit easier. 609 00:33:26,460 --> 00:33:29,060 610 00:33:29,060 --> 00:33:30,140 Thank you. 611 00:33:30,140 --> 00:33:32,692 [APPLAUSE] 612 00:33:32,692 --> 00:33:33,192