1
00:00:00,000 --> 00:00:00,265


2
00:00:00,265 --> 00:00:01,390
KENNY YU: So, hi, everyone.

3
00:00:01,390 --> 00:00:02,290
I'm Kenny.

4
00:00:02,290 --> 00:00:04,150
I'm a software engineer at Facebook.

5
00:00:04,150 --> 00:00:06,550
And today, I'm going to talk
about this one problem--

6
00:00:06,550 --> 00:00:09,290
how do you deploy a service at scale?

7
00:00:09,290 --> 00:00:12,460
And if you have any questions,
we'll save them for the end.

8
00:00:12,460 --> 00:00:14,480
So how do you deploy a service at scale?

9
00:00:14,480 --> 00:00:18,520
So you may be wondering how
hard can this actually be?

10
00:00:18,520 --> 00:00:21,730
It works on my laptop, how
hard can it be to deploy it?

11
00:00:21,730 --> 00:00:24,580
So I thought this too four
years ago when I had just

12
00:00:24,580 --> 00:00:27,320
graduated from Harvard as well.

13
00:00:27,320 --> 00:00:29,742
And in this talk, I'll
talk about why it's hard.

14
00:00:29,742 --> 00:00:31,450
And we'll go on a
journey to explore many

15
00:00:31,450 --> 00:00:33,366
of the challenges you'll
hit when you actually

16
00:00:33,366 --> 00:00:36,310
start to run a service at scale.

17
00:00:36,310 --> 00:00:40,120
And I'll talk about how Facebook has
approached some of these challenges

18
00:00:40,120 --> 00:00:42,250
along the way.

19
00:00:42,250 --> 00:00:43,210
So a bit about me--

20
00:00:43,210 --> 00:00:47,680
I graduated from Harvard class 2014
concentrating in computer science.

21
00:00:47,680 --> 00:00:52,390
I took C50 fall of 2010 and
then TF'd it the following year.

22
00:00:52,390 --> 00:00:54,620
And I TF'd other classes
at Harvard as well.

23
00:00:54,620 --> 00:00:58,840
And I think one of my favorite
experiences at Harvard was TFing.

24
00:00:58,840 --> 00:01:01,930
So if you have the opportunity,
I highly recommend it.

25
00:01:01,930 --> 00:01:06,010
And after I graduated,
I went to Facebook,

26
00:01:06,010 --> 00:01:08,810
and I've been on this one
team for the past four years.

27
00:01:08,810 --> 00:01:10,310
And I really like this team.

28
00:01:10,310 --> 00:01:13,900
My team is Tupperware, and Tupperware
is Facebook cluster management

29
00:01:13,900 --> 00:01:16,150
system and container platform.

30
00:01:16,150 --> 00:01:17,982
So there's a lot of
big words, and my goal

31
00:01:17,982 --> 00:01:19,690
is by the end of this
talk is that you'll

32
00:01:19,690 --> 00:01:23,750
have a good overview of the challenges
we face in cluster management

33
00:01:23,750 --> 00:01:27,010
and how Facebook is tackling
some of these challenges

34
00:01:27,010 --> 00:01:29,590
and then once you understand
these, how this relates

35
00:01:29,590 --> 00:01:34,230
to how we deploy services at scale.

36
00:01:34,230 --> 00:01:37,460
So our goal is to deploy a
service in production at scale.

37
00:01:37,460 --> 00:01:39,710
But first, what is a service?

38
00:01:39,710 --> 00:01:42,440
So let's first define what a service is.

39
00:01:42,440 --> 00:01:44,780
So a service can have
one or more replicas,

40
00:01:44,780 --> 00:01:46,580
and it's a long-running program.

41
00:01:46,580 --> 00:01:48,260
It's not meant to terminate.

42
00:01:48,260 --> 00:01:52,730
It responds to requests
and gives a response back.

43
00:01:52,730 --> 00:01:55,650
And as an example, you
can think of a web server.

44
00:01:55,650 --> 00:01:59,690
So if you're running Python,
you're a Python web server

45
00:01:59,690 --> 00:02:03,050
or if you're running PHP,
an Apache web server.

46
00:02:03,050 --> 00:02:06,090
A response requests, and it
gives you get a response back.

47
00:02:06,090 --> 00:02:08,870
And you might have multiple of
these, and multiple of these

48
00:02:08,870 --> 00:02:12,660
together compose your service
that you want to provide it.

49
00:02:12,660 --> 00:02:18,850
So as an example, Facebook users Thrift
for most of its back end services.

50
00:02:18,850 --> 00:02:21,410
Thrift is open source, and it
makes it easy to do something

51
00:02:21,410 --> 00:02:24,080
called Remote Procedure Calls or RPCs.

52
00:02:24,080 --> 00:02:27,840
And it makes it easy for one
service to talk to another.

53
00:02:27,840 --> 00:02:32,810
So as an example, service at Facebook,
let's take the website as an example.

54
00:02:32,810 --> 00:02:36,080
So for those of you that
don't know, the entire website

55
00:02:36,080 --> 00:02:39,560
is pushed as one
monolithic unit every hour.

56
00:02:39,560 --> 00:02:42,710
And the thing that actually
runs the website is hhvm.

57
00:02:42,710 --> 00:02:47,030
It runs our version of PHP, called
Hack, as a type-safe language.

58
00:02:47,030 --> 00:02:49,170
And both of these are open source.

59
00:02:49,170 --> 00:02:51,170
And the way to website
is deployed is that there

60
00:02:51,170 --> 00:02:57,170
are many, many instances of this
web server running in the world.

61
00:02:57,170 --> 00:03:00,510
This service might call other services
in order to fulfill your request.

62
00:03:00,510 --> 00:03:03,260
So let's say I hit the
home page for Facebook.

63
00:03:03,260 --> 00:03:06,480
I might want to give my
profile and render some ads.

64
00:03:06,480 --> 00:03:12,200
So the service will call maybe the
profile service or the ad service.

65
00:03:12,200 --> 00:03:14,540
Anyhow, the website
is updated every hour.

66
00:03:14,540 --> 00:03:19,160
And more importantly, as a Facebook
user, you don't even notice this.

67
00:03:19,160 --> 00:03:21,770
So here's a picture of
what this all looks like.

68
00:03:21,770 --> 00:03:24,140
First, we have the Facebook web service.

69
00:03:24,140 --> 00:03:28,830
We have many copies of our
web server, hhvm, running.

70
00:03:28,830 --> 00:03:32,970
Requests from Facebook users-- so either
from your browser or from your phone.

71
00:03:32,970 --> 00:03:35,460
They go to these replicas.

72
00:03:35,460 --> 00:03:38,450
And in order to fulfill the
responses for these requests,

73
00:03:38,450 --> 00:03:41,580
it might have to talk to other
services-- so profile service or ad

74
00:03:41,580 --> 00:03:42,560
service?

75
00:03:42,560 --> 00:03:45,196
And once it's gotten
all the data it needs,

76
00:03:45,196 --> 00:03:46,570
it will return the response back.

77
00:03:46,570 --> 00:03:51,090


78
00:03:51,090 --> 00:03:52,860
So how did we get there?

79
00:03:52,860 --> 00:03:56,280
So we have something that
works on our local laptop.

80
00:03:56,280 --> 00:03:58,580
Let's say you're starting a new web app.

81
00:03:58,580 --> 00:04:01,890
You have something working-- a
prototype working-- on your laptop.

82
00:04:01,890 --> 00:04:04,080
Now you actually want
to run it in production.

83
00:04:04,080 --> 00:04:07,340
So there are some challenges there
to get that first instance running

84
00:04:07,340 --> 00:04:08,500
in production.

85
00:04:08,500 --> 00:04:11,510
And now let's say your app takes off.

86
00:04:11,510 --> 00:04:12,890
You get a lot of users.

87
00:04:12,890 --> 00:04:15,500
A lot of requests start
coming to your app.

88
00:04:15,500 --> 00:04:18,200
And now that single
instance you're running

89
00:04:18,200 --> 00:04:20,300
can no longer handle all the load.

90
00:04:20,300 --> 00:04:23,800
So now you'd have multiple
instances in production.

91
00:04:23,800 --> 00:04:26,690
And now let's say your app--
you start to add more features.

92
00:04:26,690 --> 00:04:28,220
You add more products.

93
00:04:28,220 --> 00:04:32,212
The complexity of your
application gets more complicated.

94
00:04:32,212 --> 00:04:33,920
In order to simplify
that, you might want

95
00:04:33,920 --> 00:04:37,786
to extract some of the responsibilities
into separate components.

96
00:04:37,786 --> 00:04:40,160
And now instead of just having
one service in production,

97
00:04:40,160 --> 00:04:42,810
you have multiple
services in production.

98
00:04:42,810 --> 00:04:45,770
So each of these transitions
involves lots of challenges,

99
00:04:45,770 --> 00:04:50,470
and I'll go over each of these
challenges along the way.

100
00:04:50,470 --> 00:04:52,660
First, let's focus on the first one.

101
00:04:52,660 --> 00:04:56,020
From your laptop to that
first instance in production,

102
00:04:56,020 --> 00:04:59,150
what does this look like?

103
00:04:59,150 --> 00:05:01,010
So first challenge
you might hit when you

104
00:05:01,010 --> 00:05:03,320
want to start that
first copy in production

105
00:05:03,320 --> 00:05:06,980
is reproducing the same
environment as your laptop.

106
00:05:06,980 --> 00:05:10,640
So some of the challenges
you might hit is let's

107
00:05:10,640 --> 00:05:12,810
say you're running a Python web app.

108
00:05:12,810 --> 00:05:15,710
You might have various
packages of Python libraries

109
00:05:15,710 --> 00:05:18,220
or Python versions
installed on your laptop,

110
00:05:18,220 --> 00:05:22,340
and now you need to reproduce the
same exact versions and libraries

111
00:05:22,340 --> 00:05:26,170
on that production environment.

112
00:05:26,170 --> 00:05:28,810
So versions and libraries--
you have to make sure

113
00:05:28,810 --> 00:05:31,210
they're installed on the
production environment.

114
00:05:31,210 --> 00:05:33,910
And then also, your app
might make assumptions

115
00:05:33,910 --> 00:05:36,440
about where certain files are located.

116
00:05:36,440 --> 00:05:39,420
So let's say my web app needs
some configuration file.

117
00:05:39,420 --> 00:05:41,590
It might be stored in
one place on my laptop,

118
00:05:41,590 --> 00:05:44,000
and it might not even exist
in a production environment.

119
00:05:44,000 --> 00:05:46,760
Or it may exist in a different location.

120
00:05:46,760 --> 00:05:49,150
So the first challenge here
is you need to reproduce

121
00:05:49,150 --> 00:05:52,700
this environment that you have on
your laptop on the production machine.

122
00:05:52,700 --> 00:05:57,070
This includes all the files and
the binaries that you need to run.

123
00:05:57,070 --> 00:06:02,250


124
00:06:02,250 --> 00:06:05,790
Next challenge is how do you make
sure that stuff on the machine

125
00:06:05,790 --> 00:06:09,280
doesn't interfere with
my work and vice versa?

126
00:06:09,280 --> 00:06:12,210
Let's say there's something more
important running on the machine,

127
00:06:12,210 --> 00:06:17,760
and I want to make sure my dummy web
app doesn't interfere with that work.

128
00:06:17,760 --> 00:06:20,130
So as an example, let's say my service--

129
00:06:20,130 --> 00:06:21,720
the dotted red box--

130
00:06:21,720 --> 00:06:25,350
it should use four gigabytes
of memory, maybe two cores.

131
00:06:25,350 --> 00:06:27,900
And something else in
the machine wants to use

132
00:06:27,900 --> 00:06:30,060
two gigabytes of memory and one core.

133
00:06:30,060 --> 00:06:33,720
I want to make sure that that other
service doesn't take more memory

134
00:06:33,720 --> 00:06:36,660
and start using some
of my service's memory

135
00:06:36,660 --> 00:06:40,080
and then cause my service to
crash or slow down and vice versa.

136
00:06:40,080 --> 00:06:44,520
I don't want to interfere with the
resources used by that other service.

137
00:06:44,520 --> 00:06:47,074
So this is a resource isolation problem.

138
00:06:47,074 --> 00:06:48,990
You want to ensure that
no workload on machine

139
00:06:48,990 --> 00:06:51,050
interferes with my
workload and vice versa.

140
00:06:51,050 --> 00:06:56,520


141
00:06:56,520 --> 00:06:59,030
Another problem with
interference is protection.

142
00:06:59,030 --> 00:07:02,780
Let's say I have my workload
in a red dotted box,

143
00:07:02,780 --> 00:07:06,500
and something else running a
machine, the purple dotted box.

144
00:07:06,500 --> 00:07:10,670
One thing I want to ensure is that
that other thing doesn't somehow

145
00:07:10,670 --> 00:07:14,270
kill or restart or terminate
my program accidentally.

146
00:07:14,270 --> 00:07:16,940
Let's say there's a bug in the
other program that goes haywire.

147
00:07:16,940 --> 00:07:21,880
The effects of that service should
be isolated in its own environment

148
00:07:21,880 --> 00:07:24,520
and also that other thing
shouldn't be touching important

149
00:07:24,520 --> 00:07:27,160
files that I need for my service.

150
00:07:27,160 --> 00:07:30,250
So let's say my service needs
some configuration file.

151
00:07:30,250 --> 00:07:33,160
I would really like it if
something else doesn't touch that

152
00:07:33,160 --> 00:07:38,630
file that I need to run my service.

153
00:07:38,630 --> 00:07:41,500
So I want to isolate the environment
of these different workloads.

154
00:07:41,500 --> 00:07:46,490


155
00:07:46,490 --> 00:07:50,120
The next problem you might have is how
do you ensure that a service is alive?

156
00:07:50,120 --> 00:07:52,280
Let's say you have your service up.

157
00:07:52,280 --> 00:07:54,770
There's some bug, and it crashes.

158
00:07:54,770 --> 00:07:58,274
If it crashes, this means users will
not be able to use your service.

159
00:07:58,274 --> 00:08:01,190
So imagine if Facebook went down and
users are unable to use Facebook.

160
00:08:01,190 --> 00:08:04,220
That's a terrible
experience for everyone.

161
00:08:04,220 --> 00:08:06,740
Or let's say it doesn't crash.

162
00:08:06,740 --> 00:08:11,300
It's just misbehaving or slowing down,
and then restarting it might help--

163
00:08:11,300 --> 00:08:15,020
might help it mitigate
the issue temporarily.

164
00:08:15,020 --> 00:08:18,600
So what I really like is
if my service has an issue,

165
00:08:18,600 --> 00:08:25,640
please restart it automatically so
that user impact is at a minimum.

166
00:08:25,640 --> 00:08:30,380
And one way you might be able to
do this is to ask the service, hey,

167
00:08:30,380 --> 00:08:31,250
are you alive?

168
00:08:31,250 --> 00:08:32,000
Yes.

169
00:08:32,000 --> 00:08:33,260
Are you alive?

170
00:08:33,260 --> 00:08:34,490
No response.

171
00:08:34,490 --> 00:08:38,059
And then after a few seconds of
that, if there's still no response,

172
00:08:38,059 --> 00:08:39,562
restart the service.

173
00:08:39,562 --> 00:08:42,020
So the goal is the service
should always be up and running.

174
00:08:42,020 --> 00:08:46,390


175
00:08:46,390 --> 00:08:49,720
So here's a summary of
challenges to go from your laptop

176
00:08:49,720 --> 00:08:52,030
to one copy in production.

177
00:08:52,030 --> 00:08:55,460
How do you reproduce the same
environment as your laptop?

178
00:08:55,460 --> 00:08:58,390
How do you make sure that once you're
running on a production machine,

179
00:08:58,390 --> 00:09:01,630
no other workload is
affecting my service,

180
00:09:01,630 --> 00:09:05,710
and my service isn't affecting
anything critical on that machine?

181
00:09:05,710 --> 00:09:08,980
And then how do I make sure that my
service is always up and running?

182
00:09:08,980 --> 00:09:12,940
Because the goal is to have users be
able to use your service all the time.

183
00:09:12,940 --> 00:09:15,510


184
00:09:15,510 --> 00:09:19,800
So there are multiple
ways to tackle this issue.

185
00:09:19,800 --> 00:09:22,740
Two typical ways that companies
have approached this problem

186
00:09:22,740 --> 00:09:26,130
is to use virtual
machines and containers.

187
00:09:26,130 --> 00:09:29,250
So for virtual machines, the
way that I think about it is you

188
00:09:29,250 --> 00:09:30,500
have your application.

189
00:09:30,500 --> 00:09:33,060
It's running on top
of an operating system

190
00:09:33,060 --> 00:09:37,170
and that operating system is running
on top of another operating system.

191
00:09:37,170 --> 00:09:38,940
So if you ever use
dual boot on your Mac,

192
00:09:38,940 --> 00:09:42,840
you're running Windows inside a
Mac-- that's very similar idea.

193
00:09:42,840 --> 00:09:44,670
There are some issues with this.

194
00:09:44,670 --> 00:09:47,500
It's usually slower to
create a virtual machine,

195
00:09:47,500 --> 00:09:51,450
and there is also an efficiency
cost in terms of CPU.

196
00:09:51,450 --> 00:09:55,820
Another approach that companies
take is to create containers.

197
00:09:55,820 --> 00:10:00,360
So we can run our application in
some isolated environment that

198
00:10:00,360 --> 00:10:04,530
provides all the guarantees
as before and run it directly

199
00:10:04,530 --> 00:10:06,540
on the machine's operating system.

200
00:10:06,540 --> 00:10:09,640
We can avoid the overhead
of a virtual machine.

201
00:10:09,640 --> 00:10:13,020
And this tends to be faster
to create and more efficient.

202
00:10:13,020 --> 00:10:16,380
And here's a diagram that shows
how these relate to each other.

203
00:10:16,380 --> 00:10:19,350
On the left, you have my service--

204
00:10:19,350 --> 00:10:22,920
the blue box-- running on
top of a guest operating

205
00:10:22,920 --> 00:10:27,197
system, which itself is running on
top of another operating system.

206
00:10:27,197 --> 00:10:30,030
And there's some overhead because
you're running two operate systems

207
00:10:30,030 --> 00:10:33,930
at the same time versus the container--

208
00:10:33,930 --> 00:10:38,060
we eliminate that extra overhead
of that middle operating system

209
00:10:38,060 --> 00:10:42,510
and run our application directly on the
machine with some protection around it.

210
00:10:42,510 --> 00:10:48,370


211
00:10:48,370 --> 00:10:51,794
So the way Facebook has approached
these problems is to use containers.

212
00:10:51,794 --> 00:10:53,710
For us, the overhead of
using virtual machines

213
00:10:53,710 --> 00:10:57,040
is too much, and so that's
why we use containers.

214
00:10:57,040 --> 00:11:00,100
And to do, this we have a
program called a Tupperware agent

215
00:11:00,100 --> 00:11:02,290
running on every machine
at Facebook, and it's

216
00:11:02,290 --> 00:11:04,720
responsible for creating containers.

217
00:11:04,720 --> 00:11:09,940
And to reproduce the environment,
we use container images.

218
00:11:09,940 --> 00:11:14,780
And our way of using container
images is based on btrfs snapshots.

219
00:11:14,780 --> 00:11:18,760
Btrfs is a file system that makes
it very fast to create copies

220
00:11:18,760 --> 00:11:22,570
of entire subtrees of a file
system, and this makes it very fast

221
00:11:22,570 --> 00:11:26,690
for us to create new containers.

222
00:11:26,690 --> 00:11:29,210
And then for resource isolation,
we use a feature of Linux

223
00:11:29,210 --> 00:11:33,020
called control groups that allow
us to say, for this workload,

224
00:11:33,020 --> 00:11:37,580
you're allowed to use this much memory,
CPU, whatever resources and no more.

225
00:11:37,580 --> 00:11:40,070
If you try to use more than
that, we'll throttle you,

226
00:11:40,070 --> 00:11:43,790
or we'll kill your workload
to avoid you from harming

227
00:11:43,790 --> 00:11:47,740
the other workloads on the machines.

228
00:11:47,740 --> 00:11:50,610
And for our protection, we
use various Linux namespaces.

229
00:11:50,610 --> 00:11:52,860
So I'm going to not go
over too much detail here.

230
00:11:52,860 --> 00:11:55,020
There's a lot of jargon here.

231
00:11:55,020 --> 00:11:56,910
If you want more
detailed information, we

232
00:11:56,910 --> 00:12:01,040
have a public talk from our Systems
of Scale Conference in July 2018

233
00:12:01,040 --> 00:12:03,420
that will talk about this more in depth.

234
00:12:03,420 --> 00:12:07,090
But here's a picture that summarizes
how this all fits together.

235
00:12:07,090 --> 00:12:09,510
So on the left, you have
the Tupperware agent.

236
00:12:09,510 --> 00:12:12,600
This is a program that's running
on every machine at Facebook that

237
00:12:12,600 --> 00:12:16,350
creates containers and ensures that
they're all running and healthy.

238
00:12:16,350 --> 00:12:19,710
And then to actually create the
environment for your container,

239
00:12:19,710 --> 00:12:23,980
we use container images, and
that's based on btrfs snapshots.

240
00:12:23,980 --> 00:12:25,980
And then the protection
layer we put around

241
00:12:25,980 --> 00:12:27,810
the container includes multiple things.

242
00:12:27,810 --> 00:12:32,970
This includes control groups to control
resources and various namespaces

243
00:12:32,970 --> 00:12:35,760
to ensure that the
environments of two containers

244
00:12:35,760 --> 00:12:39,690
are sort of invisible to each other,
and they can't affect each other.

245
00:12:39,690 --> 00:12:48,130


246
00:12:48,130 --> 00:12:51,420
So now that we have one instance
of the service in production,

247
00:12:51,420 --> 00:12:54,780
how can we get many instances
of the service in production?

248
00:12:54,780 --> 00:12:57,980
There are new sets of
challenges that this brings.

249
00:12:57,980 --> 00:13:00,580
So the first challenge
you'll have is, OK,

250
00:13:00,580 --> 00:13:04,440
how do I start multiple
replicas of a container?

251
00:13:04,440 --> 00:13:07,480
So one approach you may take
is, OK, given one machine,

252
00:13:07,480 --> 00:13:09,500
let's just start
multiple on that machine.

253
00:13:09,500 --> 00:13:12,400
And that works until that
machine runs out our resources,

254
00:13:12,400 --> 00:13:18,530
and you need to use multiple machines to
start multiple copies of your service.

255
00:13:18,530 --> 00:13:22,010
Now so now you have to use multiple
machines to start your containers.

256
00:13:22,010 --> 00:13:26,990
And now you're going to hit
a new set of classic problems

257
00:13:26,990 --> 00:13:29,600
because now this is a
distributed systems problem.

258
00:13:29,600 --> 00:13:34,080
You have to get multiple machines to
work together to accomplish some goal.

259
00:13:34,080 --> 00:13:38,120
And in this diagram, what
is the component that

260
00:13:38,120 --> 00:13:43,010
creates the containers
on the multiple machines?

261
00:13:43,010 --> 00:13:46,280
There needs to be something that
knows to tell the first one to create

262
00:13:46,280 --> 00:13:49,700
containers and a second one to create
containers or sock containers as well.

263
00:13:49,700 --> 00:13:54,070


264
00:13:54,070 --> 00:13:56,600
And now what if a machine fails?

265
00:13:56,600 --> 00:13:59,380
So let's say I have two
copies of my servers running.

266
00:13:59,380 --> 00:14:02,330
The two copies are running
two different machines.

267
00:14:02,330 --> 00:14:04,340
For some reason,
machine two loses power.

268
00:14:04,340 --> 00:14:06,801
This happens in the
real world all the time.

269
00:14:06,801 --> 00:14:07,550
What happens then?

270
00:14:07,550 --> 00:14:10,340


271
00:14:10,340 --> 00:14:14,060
I need two copies of my service
running at all times in order

272
00:14:14,060 --> 00:14:17,600
to serve all the traffic my service has.

273
00:14:17,600 --> 00:14:22,460
But now that machine two is down,
I don't have enough capacity.

274
00:14:22,460 --> 00:14:25,850
Ideally, I would want something to
notice, hey, the copy of machine two

275
00:14:25,850 --> 00:14:26,870
is down.

276
00:14:26,870 --> 00:14:29,360
I know machine three
has available resources.

277
00:14:29,360 --> 00:14:31,930
Please start a new copy
on machine three for me.

278
00:14:31,930 --> 00:14:34,970


279
00:14:34,970 --> 00:14:38,660
So ideally, some component
would have all this logic

280
00:14:38,660 --> 00:14:42,200
and do all this automatically
for me, and this problem

281
00:14:42,200 --> 00:14:44,180
is known as a failover.

282
00:14:44,180 --> 00:14:47,990
So when real-world
failures happen, then we

283
00:14:47,990 --> 00:14:51,255
want ideally to be able to restart
that workload on a different machine,

284
00:14:51,255 --> 00:14:52,505
and that's known as a failure.

285
00:14:52,505 --> 00:14:57,740


286
00:14:57,740 --> 00:15:01,910
So now let's look at this problem
from the caller's point of view.

287
00:15:01,910 --> 00:15:06,410
The callers or clients of your service
have a different set of issues now.

288
00:15:06,410 --> 00:15:09,500
So in the beginning, there's two
copies of my servers running--

289
00:15:09,500 --> 00:15:12,860
there's a copy on machine one
and a copy on machine two.

290
00:15:12,860 --> 00:15:16,870
The caller knows that it's on
machine one and machine two.

291
00:15:16,870 --> 00:15:18,740
Now machine two loses power.

292
00:15:18,740 --> 00:15:22,366
The caller still thinks that a
copy is running on machine two.

293
00:15:22,366 --> 00:15:23,990
It's still going to send traffic there.

294
00:15:23,990 --> 00:15:25,340
The requests are going to fail.

295
00:15:25,340 --> 00:15:28,300
Users are going to have a hard time.

296
00:15:28,300 --> 00:15:31,000
Now let's say there is
some automation that knows,

297
00:15:31,000 --> 00:15:35,820
hey, machine two is down please
start another one on machine three.

298
00:15:35,820 --> 00:15:38,760
How's the client made aware of this?

299
00:15:38,760 --> 00:15:42,710
The client still thinks the replicas
are on machine one and machine two.

300
00:15:42,710 --> 00:15:46,170
It doesn't know that there's
a new copy on machine three.

301
00:15:46,170 --> 00:15:50,310
So something needs to tell the
client, hey, the copies are now

302
00:15:50,310 --> 00:15:52,140
on machine and one machine three.

303
00:15:52,140 --> 00:15:55,380
And this problem is known as
a service discovery problem.

304
00:15:55,380 --> 00:15:58,140
So for service discovery, the
question it tries to answer

305
00:15:58,140 --> 00:16:00,390
is where is my service running?

306
00:16:00,390 --> 00:16:08,180


307
00:16:08,180 --> 00:16:12,220
So now another problem we might face
is how do you deploy your service?

308
00:16:12,220 --> 00:16:15,330
So remember I said the
website is updated every hour.

309
00:16:15,330 --> 00:16:17,840
So we have many copies of the service.

310
00:16:17,840 --> 00:16:21,200
And it's updated every
hour, and users never

311
00:16:21,200 --> 00:16:24,020
even notice that it's being updated.

312
00:16:24,020 --> 00:16:25,810
So how is this even possible?

313
00:16:25,810 --> 00:16:27,560
The key observation
here is that you never

314
00:16:27,560 --> 00:16:31,270
want to take down all the replicas
of your service at the same time

315
00:16:31,270 --> 00:16:35,390
because if all of your replicas
are down in that time period,

316
00:16:35,390 --> 00:16:40,910
requests to Facebook would fail, and
then users would have a hard time.

317
00:16:40,910 --> 00:16:43,790
So instead of taking
them all down at once,

318
00:16:43,790 --> 00:16:47,700
one approach you might take it to
take down only a percentage on a time.

319
00:16:47,700 --> 00:16:52,820
So as an example, let's say I
have three replicas of my service.

320
00:16:52,820 --> 00:16:57,000
I can tolerate one replica
being down any given moment.

321
00:16:57,000 --> 00:17:02,840
So let's say I want to update my
containers from blue to purple.

322
00:17:02,840 --> 00:17:07,980
So what I would do is take down one star
a new one with the new software update

323
00:17:07,980 --> 00:17:09,859
way until that's healthy.

324
00:17:09,859 --> 00:17:13,130
And then once that's healthy and
traffic is back to normal again,

325
00:17:13,130 --> 00:17:17,040
now I can take that the
next one and update that.

326
00:17:17,040 --> 00:17:20,060
And then once that's healthy,
I can take down the next one.

327
00:17:20,060 --> 00:17:25,024
And now all my replicas are healthy,
and users have not had any issues

328
00:17:25,024 --> 00:17:26,440
throughout this whole time period.

329
00:17:26,440 --> 00:17:34,200


330
00:17:34,200 --> 00:17:37,520
Another challenge you might hit
is what if your traffic spikes?

331
00:17:37,520 --> 00:17:40,740
So let's say at noon,
the number of users

332
00:17:40,740 --> 00:17:43,300
that use your app increases by 2x.

333
00:17:43,300 --> 00:17:47,490
And now you need to increase the number
of applicants to handle the load.

334
00:17:47,490 --> 00:17:50,460
But then at nighttime,
whenever users decrease

335
00:17:50,460 --> 00:17:52,950
and it becomes too expensive
to run the extra replica,

336
00:17:52,950 --> 00:17:56,070
so you might on a tear
down the number of replicas

337
00:17:56,070 --> 00:18:00,160
and use those machines to run
something else more important.

338
00:18:00,160 --> 00:18:03,270
So how do you handle this
dynamic resizing of your service

339
00:18:03,270 --> 00:18:04,800
based on traffic spikes?

340
00:18:04,800 --> 00:18:09,200


341
00:18:09,200 --> 00:18:11,140
So here's a summary of
some of the challenges

342
00:18:11,140 --> 00:18:14,497
you might face when you go
from one copy to many copies.

343
00:18:14,497 --> 00:18:16,330
First, you have to
actually be able to start

344
00:18:16,330 --> 00:18:18,980
multiple replicas on multiple machines.

345
00:18:18,980 --> 00:18:22,934
So there needs to be something
that correlates that logic.

346
00:18:22,934 --> 00:18:24,850
You need to be able to
handle machine failures

347
00:18:24,850 --> 00:18:31,150
because once you have many machines,
machines will fail in the real world.

348
00:18:31,150 --> 00:18:36,430
And then if containers are moving
around between machines, how are clients

349
00:18:36,430 --> 00:18:39,990
made aware of this movement?

350
00:18:39,990 --> 00:18:45,860
And then how do you update your
service without affecting clients?

351
00:18:45,860 --> 00:18:47,720
And how do you handle traffic spikes?

352
00:18:47,720 --> 00:18:48,950
How do you add more replicas?

353
00:18:48,950 --> 00:18:51,750
How do you spindown replicas?

354
00:18:51,750 --> 00:18:54,780
So all these problems
you'll face when you

355
00:18:54,780 --> 00:18:57,916
have multiple instances in production.

356
00:18:57,916 --> 00:18:59,790
And our approach for
solving this at Facebook

357
00:18:59,790 --> 00:19:03,930
is we introduce a new component,
the Tupperware control

358
00:19:03,930 --> 00:19:08,250
plane, that manages the lifecycle
of containers across many machines.

359
00:19:08,250 --> 00:19:11,070
It acts as a central coordination
point between all the Tupperware

360
00:19:11,070 --> 00:19:13,650
agents in our fleet.

361
00:19:13,650 --> 00:19:17,020
So this solves the following problems.

362
00:19:17,020 --> 00:19:22,720
It is the thing that will start
multiple replicas across many machines.

363
00:19:22,720 --> 00:19:25,510
If a machine goes down, it
will notice, and then it

364
00:19:25,510 --> 00:19:30,766
will be its responsibility to recreate
that container on another machine.

365
00:19:30,766 --> 00:19:34,090
It is responsible for publishing
service discovery information

366
00:19:34,090 --> 00:19:38,320
so that client will now be made
aware that the container is

367
00:19:38,320 --> 00:19:42,730
running on a new machine,
and it handles deployments

368
00:19:42,730 --> 00:19:46,810
of the service in a safe way.

369
00:19:46,810 --> 00:19:48,790
So here's how this office together.

370
00:19:48,790 --> 00:19:52,210
You have the Tupperware control
plane, which is this green box.

371
00:19:52,210 --> 00:19:56,350
It's responsible for creating
and stopping containers.

372
00:19:56,350 --> 00:19:58,310
You have this service discovery system.

373
00:19:58,310 --> 00:20:01,570
And I'll just draw a
cloud as a black box.

374
00:20:01,570 --> 00:20:04,900
It provides an abstraction where
you give it a service name,

375
00:20:04,900 --> 00:20:06,850
and it will tell you
the list of machines

376
00:20:06,850 --> 00:20:08,750
that my service is running now.

377
00:20:08,750 --> 00:20:11,170
So right now, there are
no replicas of my service,

378
00:20:11,170 --> 00:20:13,090
so it returns an empty
list for my service.

379
00:20:13,090 --> 00:20:16,310


380
00:20:16,310 --> 00:20:17,720
Clients of my service--

381
00:20:17,720 --> 00:20:21,020
they want to talk to my replicas.

382
00:20:21,020 --> 00:20:24,560
But the first thing they do is they
first ask the service discovery system,

383
00:20:24,560 --> 00:20:28,440
hey, where are the replicas running?

384
00:20:28,440 --> 00:20:31,760
So let's say we start two replicas
for the first time on machine

385
00:20:31,760 --> 00:20:33,620
one and machine two.

386
00:20:33,620 --> 00:20:35,240
So you have two containers running.

387
00:20:35,240 --> 00:20:38,240
The next step is the update
of service discovery system

388
00:20:38,240 --> 00:20:43,310
so that clients know that they're
running on machine one and machine two.

389
00:20:43,310 --> 00:20:47,240
So now things are all healthy and fine.

390
00:20:47,240 --> 00:20:50,750
And now let's say
machine two loses power.

391
00:20:50,750 --> 00:20:53,150
Eventually, the control
plane will notice

392
00:20:53,150 --> 00:20:56,630
because it's trying to heartbeat
with every agent in the fleet.

393
00:20:56,630 --> 00:21:00,050
It sees that machine two is
unresponsive for too long.

394
00:21:00,050 --> 00:21:03,312
It deems the work on
machine two as dead,

395
00:21:03,312 --> 00:21:05,270
and it will update the
service discovery system

396
00:21:05,270 --> 00:21:08,810
to say, hey, the service is no
longer running on machine two.

397
00:21:08,810 --> 00:21:11,100
And now clients will stop
sending traffic there.

398
00:21:11,100 --> 00:21:14,660


399
00:21:14,660 --> 00:21:17,890
Meanwhile, it seems that machine
three has available resources

400
00:21:17,890 --> 00:21:21,640
to create another copy of my service.

401
00:21:21,640 --> 00:21:24,250
It will create a container
on machine three.

402
00:21:24,250 --> 00:21:27,460
And once the container on
machine three is healthy,

403
00:21:27,460 --> 00:21:29,410
it will update the
service discovery system

404
00:21:29,410 --> 00:21:34,260
to let clients know you can send
traffic to machine three now as well.

405
00:21:34,260 --> 00:21:39,030
So this is how a failover and
service discovery are managed

406
00:21:39,030 --> 00:21:41,190
by the Tupperware control client.

407
00:21:41,190 --> 00:21:42,595
I'll save questions for the end.

408
00:21:42,595 --> 00:21:46,060


409
00:21:46,060 --> 00:21:47,900
So what about deployments?

410
00:21:47,900 --> 00:21:50,930
Let's say I have three replicas
already you running and healthy.

411
00:21:50,930 --> 00:21:55,200
Clients know it's on
machine 1, 2, and 3.

412
00:21:55,200 --> 00:21:57,750
And now I want to push a
new version of my service,

413
00:21:57,750 --> 00:22:01,570
and I can tolerate one
replica being down at once.

414
00:22:01,570 --> 00:22:03,450
So the first thing I
want to do is first,

415
00:22:03,450 --> 00:22:06,794
let's say I want to update
the replica on machine one.

416
00:22:06,794 --> 00:22:09,960
The first thing I want to do is make
sure clients stop sending traffic there

417
00:22:09,960 --> 00:22:12,390
before I tear down the container.

418
00:22:12,390 --> 00:22:15,060
First, I'm going to tell
the service discovery

419
00:22:15,060 --> 00:22:19,640
system to say machine one
is no longer available,

420
00:22:19,640 --> 00:22:23,570
and now clients will stop
setting traffic there.

421
00:22:23,570 --> 00:22:26,180
Once clients stop
sending traffic there, I

422
00:22:26,180 --> 00:22:28,910
can tear down the
container on machine one

423
00:22:28,910 --> 00:22:33,040
and create a new version
using the new software update.

424
00:22:33,040 --> 00:22:36,770
And once that's healthy, I can
update the service discovery system

425
00:22:36,770 --> 00:22:40,730
to say, hey, clients you can send
traffic to machine one again.

426
00:22:40,730 --> 00:22:45,740
And in fact, you'll be getting the
new version of the service there.

427
00:22:45,740 --> 00:22:48,440
The process repeats for machine two.

428
00:22:48,440 --> 00:22:54,930
We disable machine two, stop the
container, recreate that container,

429
00:22:54,930 --> 00:22:58,890
and then publish that
information as well.

430
00:22:58,890 --> 00:23:01,440
And now we repeat again
for machine three.

431
00:23:01,440 --> 00:23:05,760
We disable the entry so that
clients stop sending traffic there.

432
00:23:05,760 --> 00:23:09,980
We stop the container and
then recreate the container.

433
00:23:09,980 --> 00:23:13,940
And now clients can send
traffic there as well.

434
00:23:13,940 --> 00:23:17,930
So now at this point, we've updated
all the replicas of our service.

435
00:23:17,930 --> 00:23:20,450
We've never had more
than one replica down,

436
00:23:20,450 --> 00:23:24,640
and users are totally unaware that any
issue has happened in this process.

437
00:23:24,640 --> 00:23:31,860


438
00:23:31,860 --> 00:23:34,890
So now we are able to start
many replicas of our service

439
00:23:34,890 --> 00:23:37,290
in production we can update them.

440
00:23:37,290 --> 00:23:41,952
We can handle failovers, and
we can scale to handle load.

441
00:23:41,952 --> 00:23:44,160
And now let's say your web
app gets more complicated.

442
00:23:44,160 --> 00:23:48,810
The number of features or products grow.

443
00:23:48,810 --> 00:23:51,210
It gets a bit complicated
having one service,

444
00:23:51,210 --> 00:23:53,340
so you want to separate
out responsibilities

445
00:23:53,340 --> 00:23:55,090
into multiple services.

446
00:23:55,090 --> 00:24:01,510
So now your app is now multiple services
that you need to deploy to production.

447
00:24:01,510 --> 00:24:04,302
And you'll hit a different
set of challenges now.

448
00:24:04,302 --> 00:24:05,760
So first understand the challenges.

449
00:24:05,760 --> 00:24:08,470
Here's some background about Facebook.

450
00:24:08,470 --> 00:24:10,750
Facebook has many data
centers in the world.

451
00:24:10,750 --> 00:24:12,760
And this is an example
of a data center--

452
00:24:12,760 --> 00:24:15,750
a bird's eye view.

453
00:24:15,750 --> 00:24:21,540
Each building has many thousands of
machines serving the website or ads

454
00:24:21,540 --> 00:24:24,930
or databases to store user information.

455
00:24:24,930 --> 00:24:27,690
And they are very expensive to create--

456
00:24:27,690 --> 00:24:31,350
so the construction costs,
purchasing the hardware, electricity,

457
00:24:31,350 --> 00:24:32,940
and maintenance costs.

458
00:24:32,940 --> 00:24:34,050
This is a big deal.

459
00:24:34,050 --> 00:24:38,080
This is a very expensive
investment for Facebook.

460
00:24:38,080 --> 00:24:41,024
And now separately, there are
many products at Facebook.

461
00:24:41,024 --> 00:24:43,815
And as a result, there are many
services to support those products.

462
00:24:43,815 --> 00:24:46,480


463
00:24:46,480 --> 00:24:48,630
So given that data
centers are so expensive,

464
00:24:48,630 --> 00:24:52,740
how can we utilize all
the resources efficiently?

465
00:24:52,740 --> 00:24:56,610
And also, another problem we have is
reasoning about physical infrastructure

466
00:24:56,610 --> 00:24:57,630
is actually really hard.

467
00:24:57,630 --> 00:24:59,800
There are a lot of machines.

468
00:24:59,800 --> 00:25:01,540
A lot of failures can happen.

469
00:25:01,540 --> 00:25:05,820
How can we hide as much of this
complexity from engineers as possible

470
00:25:05,820 --> 00:25:08,265
so that engineers can focus
on their business logic?

471
00:25:08,265 --> 00:25:13,950


472
00:25:13,950 --> 00:25:17,640
And so the first problem is how can
we effectively use the resources

473
00:25:17,640 --> 00:25:20,600
we have becomes a bin-packing problem.

474
00:25:20,600 --> 00:25:24,600
And the Tupperware logo is actually
a good illustration of this.

475
00:25:24,600 --> 00:25:27,750
Let's say this square
represents one machine.

476
00:25:27,750 --> 00:25:32,120
Each container represents a different
service or work we want to run.

477
00:25:32,120 --> 00:25:34,830
And we want to stack
as many containers that

478
00:25:34,830 --> 00:25:37,380
will fit as possible
onto a single machine

479
00:25:37,380 --> 00:25:41,350
to most effectively utilize
that machine's resources.

480
00:25:41,350 --> 00:25:44,970
So it's kind of like playing
Tetris with machines.

481
00:25:44,970 --> 00:25:47,820
And yeah, so our
approach to solving this

482
00:25:47,820 --> 00:25:52,290
is to stack multiple containers onto
as few machines as possible, resources

483
00:25:52,290 --> 00:25:53,136
permitting.

484
00:25:53,136 --> 00:25:58,970


485
00:25:58,970 --> 00:26:03,410
And now our data centers are spread
out geographically across the world,

486
00:26:03,410 --> 00:26:06,210
and this introduces a
different set of challenges.

487
00:26:06,210 --> 00:26:09,290
So as an example, let's say we have
a West Coast data center and an East

488
00:26:09,290 --> 00:26:13,040
Coast data center,
and my service just so

489
00:26:13,040 --> 00:26:16,580
happens to only be running in
the East Coast data center.

490
00:26:16,580 --> 00:26:21,620
And now a hurricane hits the East
Coast and takes down the data center.

491
00:26:21,620 --> 00:26:23,510
Our data center loses power.

492
00:26:23,510 --> 00:26:28,430
Now suddenly, users of that service
cannot use that service until we create

493
00:26:28,430 --> 00:26:30,930
new replicas somewhere else.

494
00:26:30,930 --> 00:26:34,460
So ideally, we should spread our
containers across these two data

495
00:26:34,460 --> 00:26:41,560
centers so that when disaster hits
one, the service continues to operate.

496
00:26:41,560 --> 00:26:44,349


497
00:26:44,349 --> 00:26:46,640
And so the property we would
like is to spread replicas

498
00:26:46,640 --> 00:26:50,170
across something known as fault
domains, and a fault domain

499
00:26:50,170 --> 00:26:54,740
is a group of things
likely to fail together.

500
00:26:54,740 --> 00:26:57,890
So as an example, a data center
is a fault domain of machines

501
00:26:57,890 --> 00:27:02,367
because they're located
geographically together,

502
00:27:02,367 --> 00:27:04,700
and they might all lose power
at the same time together.

503
00:27:04,700 --> 00:27:11,560


504
00:27:11,560 --> 00:27:16,240
So another issue you might have is
hardware fails in the real world.

505
00:27:16,240 --> 00:27:20,700
Data center operators need to
frequently put machines into repair

506
00:27:20,700 --> 00:27:24,290
because their disk drives are failing.

507
00:27:24,290 --> 00:27:26,860
The machine needs to be
rebooted for whatever reason.

508
00:27:26,860 --> 00:27:30,490
The machines need to be replaced
with newer generation of hardware.

509
00:27:30,490 --> 00:27:33,160
And so they frequently
need to say I need

510
00:27:33,160 --> 00:27:38,170
to take those machines into
maintenance but on those 1,000 machines

511
00:27:38,170 --> 00:27:43,210
might be many different teams'
services running on those machines.

512
00:27:43,210 --> 00:27:45,520
Ideally, the data center
operators would need

513
00:27:45,520 --> 00:27:48,790
to interact with all
those different teams

514
00:27:48,790 --> 00:27:54,700
in order to have them safely move
the replicas away before taking

515
00:27:54,700 --> 00:27:56,920
down all 1,000 machines.

516
00:27:56,920 --> 00:28:03,100
So the goal is how can we safely replace
those 1,000 machines in an automated

517
00:28:03,100 --> 00:28:08,710
way and have as little involvement
with service owners as possible?

518
00:28:08,710 --> 00:28:12,630


519
00:28:12,630 --> 00:28:15,480
So in this example, a single
machine might be running containers

520
00:28:15,480 --> 00:28:17,400
from five different teams.

521
00:28:17,400 --> 00:28:24,730
And so if we had no automation,
five different teams

522
00:28:24,730 --> 00:28:28,020
would need to do work to move
those containers elsewhere.

523
00:28:28,020 --> 00:28:32,460
And this might be challenging for teams
because sometimes a container might

524
00:28:32,460 --> 00:28:36,210
store a local state on the
machine and that local state needs

525
00:28:36,210 --> 00:28:39,960
to be copied somewhere else
before you take the machine down.

526
00:28:39,960 --> 00:28:44,694
Or sometimes a team might not
have enough replicas elsewhere.

527
00:28:44,694 --> 00:28:46,860
So if you take down this
machine, they will actually

528
00:28:46,860 --> 00:28:49,150
be unable to serve all their traffic.

529
00:28:49,150 --> 00:28:51,930
So there are a lot of issues
in how we do this safely.

530
00:28:51,930 --> 00:28:57,510


531
00:28:57,510 --> 00:29:00,240
So recap of some of the
issues we face here--

532
00:29:00,240 --> 00:29:03,840
we want to efficiently use all
the resources that we have.

533
00:29:03,840 --> 00:29:07,980
We want to make sure replicas
are spread out in a safe way.

534
00:29:07,980 --> 00:29:09,910
And hardware will fail
in the real world,

535
00:29:09,910 --> 00:29:19,870
and we want to make repairing hardware
as safe and as seamless as possible.

536
00:29:19,870 --> 00:29:22,890


537
00:29:22,890 --> 00:29:27,890
And the approach Facebook has taken
here is to provide abstractions.

538
00:29:27,890 --> 00:29:30,470
We provide abstractions to
make it easier for engineers

539
00:29:30,470 --> 00:29:33,110
to reason about physical structure.

540
00:29:33,110 --> 00:29:36,830
So an example, we can stack
multiple containers on a machine,

541
00:29:36,830 --> 00:29:39,750
and users don't need
to know how that works.

542
00:29:39,750 --> 00:29:42,800
We provide some
abstractions to allow users

543
00:29:42,800 --> 00:29:45,990
to say I want to spread
across these fault domains,

544
00:29:45,990 --> 00:29:47,240
and it will take care of that.

545
00:29:47,240 --> 00:29:50,260
They don't need to understand
how that actually works.

546
00:29:50,260 --> 00:29:53,750
And then we allow engineers
to specify some policies

547
00:29:53,750 --> 00:29:58,980
on how to move containers
around in the fleet,

548
00:29:58,980 --> 00:30:01,010
and we'll take care of
how that actually works.

549
00:30:01,010 --> 00:30:03,860
And we provide them a high-level
API for them to do that.

550
00:30:03,860 --> 00:30:09,570


551
00:30:09,570 --> 00:30:11,040
So here's a recap.

552
00:30:11,040 --> 00:30:16,880
We have a service running on our
local laptop or developer environment.

553
00:30:16,880 --> 00:30:21,170
We want to start running
in production for real.

554
00:30:21,170 --> 00:30:24,720
And suddenly, we have more traffic
Than one instance can handle,

555
00:30:24,720 --> 00:30:26,810
so we start multiple replicas.

556
00:30:26,810 --> 00:30:28,730
And now our app gets more complicated.

557
00:30:28,730 --> 00:30:33,050
So we have instead of just one service,
we have many services in Production.

558
00:30:33,050 --> 00:30:36,110
And all the problems
we faced in this talk

559
00:30:36,110 --> 00:30:39,480
are problems we face in the
cluster management space.

560
00:30:39,480 --> 00:30:42,320
And these are the problems
my team is tackling.

561
00:30:42,320 --> 00:30:44,300
So what exactly is cluster management?

562
00:30:44,300 --> 00:30:47,090
The way you can understand
cluster management

563
00:30:47,090 --> 00:30:50,990
is by understanding the
stakeholders in this system.

564
00:30:50,990 --> 00:30:54,140
So for much of this talk, we've
been focusing on the perspective

565
00:30:54,140 --> 00:30:55,850
of service developers.

566
00:30:55,850 --> 00:31:00,150
They have an app they want to easily
deploy to production as reliably

567
00:31:00,150 --> 00:31:02,270
and safely as possible.

568
00:31:02,270 --> 00:31:05,900
And ideally, they should focus most
of their energy on the business logic

569
00:31:05,900 --> 00:31:10,970
and not on the physical infrastructure
and the concerns around that.

570
00:31:10,970 --> 00:31:13,940
But our service needs to
run on real-world machines,

571
00:31:13,940 --> 00:31:18,030
and in the real world,
machines will fail.

572
00:31:18,030 --> 00:31:20,450
And so center operators--

573
00:31:20,450 --> 00:31:27,200
what they want from the system is being
able to automatically and safely fix

574
00:31:27,200 --> 00:31:34,610
machines and have the system
move containers around as needed.

575
00:31:34,610 --> 00:31:37,680
And a third stakeholder
is efficiency engineers.

576
00:31:37,680 --> 00:31:39,990
They want to make sure
we're actually using

577
00:31:39,990 --> 00:31:43,110
the resources we have as
efficiently as possible

578
00:31:43,110 --> 00:31:45,750
because data centers
are expensive, and we

579
00:31:45,750 --> 00:31:50,610
want to make sure that we're
getting the most bang for our buck.

580
00:31:50,610 --> 00:31:54,930
And the intersection of all these
stakeholders is cluster management.

581
00:31:54,930 --> 00:32:00,344
The problems we face and the challenges
working with these stakeholders

582
00:32:00,344 --> 00:32:02,010
are all in the cluster management space.

583
00:32:02,010 --> 00:32:05,740


584
00:32:05,740 --> 00:32:09,300
So to put it concisely, the
goal of cluster management is

585
00:32:09,300 --> 00:32:12,330
make it easy for engineers
to develop services

586
00:32:12,330 --> 00:32:16,290
while utilizing resources
as efficiently as possible

587
00:32:16,290 --> 00:32:19,200
and making the services
run as safely as possible

588
00:32:19,200 --> 00:32:22,860
in the presence of real-world failures.

589
00:32:22,860 --> 00:32:25,590
And for more information about
what my team is working on,

590
00:32:25,590 --> 00:32:30,060
we have a public talk from the
Systems of Scale Conference in July,

591
00:32:30,060 --> 00:32:32,850
and it's talking about what
we're currently working on

592
00:32:32,850 --> 00:32:36,560
and what we will be working
on for the next few years.

593
00:32:36,560 --> 00:32:39,260


594
00:32:39,260 --> 00:32:45,040
So to recap, this is my motto now--

595
00:32:45,040 --> 00:32:46,720
coding is the easy part.

596
00:32:46,720 --> 00:32:48,990
Productionizing is the hard part.

597
00:32:48,990 --> 00:32:52,000
It's very easy to write some
code and get a prototype working.

598
00:32:52,000 --> 00:32:57,820
The real hard part is to make sure it's
reliable, highly available, efficient,

599
00:32:57,820 --> 00:33:02,530
debuggable, testable so you can
easily make changes in the future,

600
00:33:02,530 --> 00:33:04,720
handling all existing
legacy requirements

601
00:33:04,720 --> 00:33:06,700
because you might have
many existing users

602
00:33:06,700 --> 00:33:10,150
or use cases, and then be able to
withstand requirements changing

603
00:33:10,150 --> 00:33:11,240
over time.

604
00:33:11,240 --> 00:33:13,240
So let's say you get it working now.

605
00:33:13,240 --> 00:33:16,720
What if the number just
grows by 10x or 100x--

606
00:33:16,720 --> 00:33:20,720
will this design still work as intended?

607
00:33:20,720 --> 00:33:24,419
So with cluster management systems
and container platforms, it makes--

608
00:33:24,419 --> 00:33:26,460
at least the productionizing
part-- a bit easier.

609
00:33:26,460 --> 00:33:29,060


610
00:33:29,060 --> 00:33:30,140
Thank you.

611
00:33:30,140 --> 00:33:32,692
[APPLAUSE]

612
00:33:32,692 --> 00:33:33,192