1
00:00:00,000 --> 00:00:04,969
>> [MUSIC PLAYING]

2
00:00:04,969 --> 00:00:06,010
RICK HOULIHAN: All right.

3
00:00:06,010 --> 00:00:06,600
Hi, everybody.

4
00:00:06,600 --> 00:00:07,670
My name is Rick Houlihan.

5
00:00:07,670 --> 00:00:10,330
I'm a senior principal
solutions architect at AWS.

6
00:00:10,330 --> 00:00:14,070
I focus on NoSQL and
DynamoDB technologies.

7
00:00:14,070 --> 00:00:16,930
I'm here today to talk to
you a little bit about those.

8
00:00:16,930 --> 00:00:18,970
>> My background is
primarily in data layer.

9
00:00:18,970 --> 00:00:21,390
I spent half my development
career writing database,

10
00:00:21,390 --> 00:00:25,930
data access, solutions
for various applications.

11
00:00:25,930 --> 00:00:30,000
I've been in Cloud virtualization
for about 20 years.

12
00:00:30,000 --> 00:00:33,460
So before the Cloud was the Cloud,
we used to call it utility computing.

13
00:00:33,460 --> 00:00:37,170
And the idea was, it's like
PG&E, you pay for what you use.

14
00:00:37,170 --> 00:00:38,800
Today we call it the cloud.

15
00:00:38,800 --> 00:00:41,239
>> But over the years, I've worked
for a couple of companies

16
00:00:41,239 --> 00:00:42,530
you've probably never heard of.

17
00:00:42,530 --> 00:00:47,470
But I've compiled a list of technical
accomplishments, I guess you'd say.

18
00:00:47,470 --> 00:00:51,620
I have eight patents in Cloud systems
virtualization, microprocessor design,

19
00:00:51,620 --> 00:00:54,440
complex event processing,
and other areas as well.

20
00:00:54,440 --> 00:00:58,290
>> So these days, I focus mostly on NoSQL
technologies and the next generation

21
00:00:58,290 --> 00:00:59,450
database.

22
00:00:59,450 --> 00:01:03,370
And that's generally what I'm going
to be here talking to you today about.

23
00:01:03,370 --> 00:01:06,030
So what you can expect
from this session,

24
00:01:06,030 --> 00:01:08,254
we'll go through a brief
history of data processing.

25
00:01:08,254 --> 00:01:10,420
It's always helpful to
understand where we came from

26
00:01:10,420 --> 00:01:12,400
and why we're where we are.

27
00:01:12,400 --> 00:01:15,600
And we'll talk a little
bit about NoSQL technology

28
00:01:15,600 --> 00:01:17,500
from a fundamental standpoint.

29
00:01:17,500 --> 00:01:19,870
>> We will get into some of
the DynamoDB internals.

30
00:01:19,870 --> 00:01:24,350
DynamoDB is AWS's no flavor.

31
00:01:24,350 --> 00:01:27,340
It's a fully managed and
hosted NoSQL solution.

32
00:01:27,340 --> 00:01:32,420
And we'll talk a little bit about table
structure, APIs, data types, indexes,

33
00:01:32,420 --> 00:01:35,177
and some of the internals
of that DynamoDB technology.

34
00:01:35,177 --> 00:01:37,760
We'll get into some of the design
patterns and best practices.

35
00:01:37,760 --> 00:01:39,968
We'll talk about how you
use this technology for some

36
00:01:39,968 --> 00:01:41,430
of today's applications.

37
00:01:41,430 --> 00:01:44,820
And then we'll talk a little bit
about the evolution or the emergence

38
00:01:44,820 --> 00:01:48,980
of a new paradigm in programming
called event-driven applications

39
00:01:48,980 --> 00:01:51,580
and how DynamoDB plays in that as well.

40
00:01:51,580 --> 00:01:54,690
And we'll leave you a little bit of
a reference architecture discussion

41
00:01:54,690 --> 00:01:59,540
so we can talk about some of
the ways you can use DynamoDB.

42
00:01:59,540 --> 00:02:04,116
>> So first off-- this is a question
I hear a lot is, what's a database.

43
00:02:04,116 --> 00:02:06,240
A lot of people think they
know what a database is.

44
00:02:06,240 --> 00:02:08,360
If you Google, you'll see this.

45
00:02:08,360 --> 00:02:11,675
It's a a structured set of data held
in a computer, especially one that

46
00:02:11,675 --> 00:02:13,600
is accessible in various ways.

47
00:02:13,600 --> 00:02:16,992
I suppose that's a good
definition of a modern database.

48
00:02:16,992 --> 00:02:19,450
But I don't like it, because
it implies a couple of things.

49
00:02:19,450 --> 00:02:20,935
It implies structure.

50
00:02:20,935 --> 00:02:23,120
And it implies that it's on a computer.

51
00:02:23,120 --> 00:02:25,750
And databases didn't
always exist on computers.

52
00:02:25,750 --> 00:02:28,020
Databases actually existed in many ways.

53
00:02:28,020 --> 00:02:32,000
>> So a better definition of a
database is something like this.

54
00:02:32,000 --> 00:02:34,786
A database is an organized
mechanism for storing, managing,

55
00:02:34,786 --> 00:02:35,910
and retrieving information.

56
00:02:35,910 --> 00:02:36,868
This is from About.com.

57
00:02:36,868 --> 00:02:42,080
So I like this because it really talks
about a database being a repository,

58
00:02:42,080 --> 00:02:44,800
a repository of
information, not necessarily

59
00:02:44,800 --> 00:02:46,780
something that sits on a computer.

60
00:02:46,780 --> 00:02:49,290
And throughout history, we
haven't always had computers.

61
00:02:49,290 --> 00:02:52,110
>> Now, if I ask the average
developer today what's

62
00:02:52,110 --> 00:02:54,770
a database, that's the answer I get.

63
00:02:54,770 --> 00:02:56,070
Somewhere I can stick stuff.

64
00:02:56,070 --> 00:02:56,670
Right?

65
00:02:56,670 --> 00:02:58,725
And it's true.

66
00:02:58,725 --> 00:02:59,600
But it's unfortunate.

67
00:02:59,600 --> 00:03:02,700
Because the database is really
the foundation of the modern app.

68
00:03:02,700 --> 00:03:04,810
It's the foundation
of every application.

69
00:03:04,810 --> 00:03:07,240
And how you build that
database, how you structure

70
00:03:07,240 --> 00:03:11,750
that data is going to dictate how that
application performs as you scale.

71
00:03:11,750 --> 00:03:14,640
>> So a lot of my job today
is dealing with what

72
00:03:14,640 --> 00:03:17,180
happens when developers
take this approach

73
00:03:17,180 --> 00:03:19,510
and dealing with the aftermath
of an application that

74
00:03:19,510 --> 00:03:24,966
is now scaling beyond the original
intent and suffering from bad design.

75
00:03:24,966 --> 00:03:26,840
So hopefully when you
walk away today, you'll

76
00:03:26,840 --> 00:03:29,010
have a couple of tools in
your belt that'll keep you

77
00:03:29,010 --> 00:03:32,566
from making those same mistakes.

78
00:03:32,566 --> 00:03:33,066
All right.

79
00:03:33,066 --> 00:03:36,360
So let's talk about a little bit of
the timeline of database technology.

80
00:03:36,360 --> 00:03:38,830
I think I read an
article not that long ago

81
00:03:38,830 --> 00:03:43,020
and it said something on the lines--
it's a very poetic statement.

82
00:03:43,020 --> 00:03:46,590
It said the history
of data processing is

83
00:03:46,590 --> 00:03:49,350
full of high watermarks
of data abundance.

84
00:03:49,350 --> 00:03:49,920
OK.

85
00:03:49,920 --> 00:03:52,532
Now, I guess that's kind of true.

86
00:03:52,532 --> 00:03:54,990
But I actually look at is as
the history is actually filled

87
00:03:54,990 --> 00:03:56,820
with high watermark of data pressure.

88
00:03:56,820 --> 00:04:00,040
Because the data rate of
ingestion never goes down.

89
00:04:00,040 --> 00:04:01,360
It only goes up.

90
00:04:01,360 --> 00:04:03,670
>> And innovation occurs when
we see data pressure, which

91
00:04:03,670 --> 00:04:07,825
is the amount of data that is
now in coming into the system.

92
00:04:07,825 --> 00:04:12,027
And it cannot be processed
efficiently either in time or in cost.

93
00:04:12,027 --> 00:04:14,110
And that's when we start
to look at data pressure.

94
00:04:14,110 --> 00:04:15,920
>> So when we look at the
first database, this

95
00:04:15,920 --> 00:04:17,180
is the one that was between our ears.

96
00:04:17,180 --> 00:04:18,310
We're all born with it.

97
00:04:18,310 --> 00:04:19,194
It's a nice database.

98
00:04:19,194 --> 00:04:21,110
It has a high availability.

99
00:04:21,110 --> 00:04:21,959
It's always on.

100
00:04:21,959 --> 00:04:23,930
You can always get it.

101
00:04:23,930 --> 00:04:24,890
>> But it's single user.

102
00:04:24,890 --> 00:04:26,348
I can't share my thoughts with you.

103
00:04:26,348 --> 00:04:28,370
You can't get my thoughts
when you want them.

104
00:04:28,370 --> 00:04:30,320
And their abilitiy is not so good.

105
00:04:30,320 --> 00:04:32,510
We forget things.

106
00:04:32,510 --> 00:04:36,540
Every now and then, one of us leaves
and moves on to another existence

107
00:04:36,540 --> 00:04:39,110
and we lose everything
that was in that database.

108
00:04:39,110 --> 00:04:40,640
So that's not all that good.

109
00:04:40,640 --> 00:04:43,189
>> And this worked well over time
when we were back in the day

110
00:04:43,189 --> 00:04:46,230
when all we really needed to know is
where are we going to go on tomorrow

111
00:04:46,230 --> 00:04:49,630
or where we gather the best food.

112
00:04:49,630 --> 00:04:52,820
But as we started to grow as a
civilization and government started

113
00:04:52,820 --> 00:04:55,152
to come into being, and
businesses started to evolve,

114
00:04:55,152 --> 00:04:57,360
we started to realize we
need a little more than what

115
00:04:57,360 --> 00:04:58,210
we could put in our head.

116
00:04:58,210 --> 00:04:58,870
All right?

117
00:04:58,870 --> 00:05:00,410
>> We needed systems of record.

118
00:05:00,410 --> 00:05:02,220
We needed places to be able store data.

119
00:05:02,220 --> 00:05:05,450
So we started writing documents,
creating libraries and archives.

120
00:05:05,450 --> 00:05:08,000
We started developing a
system a ledger accounting.

121
00:05:08,000 --> 00:05:12,200
And that system of ledger counting
ran the world for many centuries,

122
00:05:12,200 --> 00:05:15,580
and maybe even millennia as
we kind of grew to the point

123
00:05:15,580 --> 00:05:18,420
where that data load surpassed
the ability of those systems

124
00:05:18,420 --> 00:05:19,870
to be able to contain it.

125
00:05:19,870 --> 00:05:22,070
>> And this actually happened in the 1880s.

126
00:05:22,070 --> 00:05:22,570
Right?

127
00:05:22,570 --> 00:05:24,390
In the 1880 US Census.

128
00:05:24,390 --> 00:05:26,976
This is really where the turning
point modern data processing.

129
00:05:26,976 --> 00:05:28,850
This is the point at
which the amount of data

130
00:05:28,850 --> 00:05:32,060
that was being collected by the
US government got to the point

131
00:05:32,060 --> 00:05:34,005
where it took eight years to process.

132
00:05:34,005 --> 00:05:36,350
>> Now, eight years-- as
you know, the census

133
00:05:36,350 --> 00:05:39,180
runs every 10 years-- so it's
pretty obvious that by time we

134
00:05:39,180 --> 00:05:41,419
got the 1890 census,
the amount of data that

135
00:05:41,419 --> 00:05:43,210
was going to be processed
by government was

136
00:05:43,210 --> 00:05:46,335
going to exceed the 10 years that it
would take to launched the new census.

137
00:05:46,335 --> 00:05:47,250
This was a problem.

138
00:05:47,250 --> 00:05:49,000
>> So a guy named Herman
Hollerith came along

139
00:05:49,000 --> 00:05:52,640
and he invented unit record punch
cards, punch card reader, punch card

140
00:05:52,640 --> 00:05:58,420
tabulator, and the collation of
the mechanisms for this technology.

141
00:05:58,420 --> 00:06:01,860
And that company that he formed at the
time, along with a couple of others,

142
00:06:01,860 --> 00:06:05,450
actually became one of the pillars of a
small company we know today called IBM.

143
00:06:05,450 --> 00:06:08,417
>> So IBM originally was in
the database business.

144
00:06:08,417 --> 00:06:09,750
And that's really what they did.

145
00:06:09,750 --> 00:06:11,110
They did data processing.

146
00:06:11,110 --> 00:06:15,400
>> As so the proliferation of punch
cards, an ingenious mechanisms

147
00:06:15,400 --> 00:06:18,560
of being able to leverage that
technology to poll sorted result sets.

148
00:06:18,560 --> 00:06:20,726
You can see in this picture
there we have a little--

149
00:06:20,726 --> 00:06:23,970
it's a little small-- but you can see
a very ingenious mechanical mechanism

150
00:06:23,970 --> 00:06:26,970
where we have a punch card deck.

151
00:06:26,970 --> 00:06:28,720
And somebody's taking
a little screwdriver

152
00:06:28,720 --> 00:06:31,400
and sticking through the
slots and lifting it up

153
00:06:31,400 --> 00:06:34,820
to get that match, that
sorted results set.

154
00:06:34,820 --> 00:06:36,270
>> This is an aggregation.

155
00:06:36,270 --> 00:06:38,690
We do this all the time
today in the computer,

156
00:06:38,690 --> 00:06:40,100
where you do it in the database.

157
00:06:40,100 --> 00:06:41,620
We used to do it manually, right?

158
00:06:41,620 --> 00:06:42,994
People put these things together.

159
00:06:42,994 --> 00:06:45,440
And it was the proliferation
of these punch cards

160
00:06:45,440 --> 00:06:50,070
into what we called data drums
and data reels, paper tape.

161
00:06:50,070 --> 00:06:55,980
>> The data processing industry took
a lesson from the player pianos.

162
00:06:55,980 --> 00:06:57,855
Player pianos back at
the turn of the century

163
00:06:57,855 --> 00:07:02,100
used to use paper reels with slots
on to tell it which keys to play.

164
00:07:02,100 --> 00:07:05,380
So that technology was adapted
eventually to store digital data,

165
00:07:05,380 --> 00:07:08,070
because they could put that data
onto those paper tape reels.

166
00:07:08,070 --> 00:07:10,870
>> Now, as a result, data
was actually-- how

167
00:07:10,870 --> 00:07:14,960
you access this data was directly
dependent on how you stored it.

168
00:07:14,960 --> 00:07:17,825
So if I put the data on a tape,
I had access the data linearly.

169
00:07:17,825 --> 00:07:20,475
I had to roll the whole
tape to access all the data.

170
00:07:20,475 --> 00:07:22,600
If I put the data in punch
cards, I could access it

171
00:07:22,600 --> 00:07:26,270
in a little more random
fashion, maybe not as quickly.

172
00:07:26,270 --> 00:07:30,770
>> But there were limitations in how we
access to data based on how was stored.

173
00:07:30,770 --> 00:07:32,890
And so this was a problem
going into the '50s.

174
00:07:32,890 --> 00:07:37,890
Again, we can start to see that as we
develop new technologies to process

175
00:07:37,890 --> 00:07:41,670
the data, right, it opens up
the door for new solutions,

176
00:07:41,670 --> 00:07:45,852
for new programs, new
applications for that data.

177
00:07:45,852 --> 00:07:47,810
And really, governance
may have been the reason

178
00:07:47,810 --> 00:07:49,435
why we developed some of these systems.

179
00:07:49,435 --> 00:07:52,290
But business rapidly became
the driver behind the evolution

180
00:07:52,290 --> 00:07:54,720
of the modern database and
the modern file system.

181
00:07:54,720 --> 00:07:56,870
>> So the next thing that
came up was in the '50s

182
00:07:56,870 --> 00:08:00,780
was the file system and the
development of random access storage.

183
00:08:00,780 --> 00:08:02,050
This was beautiful.

184
00:08:02,050 --> 00:08:06,230
Now, all of sudden, we can put our
files anywhere on these hard drives

185
00:08:06,230 --> 00:08:09,760
and we can access this data randomly.

186
00:08:09,760 --> 00:08:11,950
We can parse that
information out of files.

187
00:08:11,950 --> 00:08:14,920
And we solved all the world's
problems with data processing.

188
00:08:14,920 --> 00:08:17,550
>> And that lasted about 20 or
30 years until the evolution

189
00:08:17,550 --> 00:08:22,100
of the relational database, which
is when the world decided we now

190
00:08:22,100 --> 00:08:27,940
need to have a repository that defeats
the sprawl of data across the file

191
00:08:27,940 --> 00:08:29,540
systems that we've built. Right?

192
00:08:29,540 --> 00:08:34,270
Too much data distributed in too many
places, the de-duplication of data,

193
00:08:34,270 --> 00:08:37,120
and the cost of storage was enormous.

194
00:08:37,120 --> 00:08:43,760
>> In the '70s, the most expensive resource
that a computer had was the storage.

195
00:08:43,760 --> 00:08:46,200
The processor was
viewed as a fixed cost.

196
00:08:46,200 --> 00:08:49,030
When I buy the box,
the CPU does some work.

197
00:08:49,030 --> 00:08:51,960
It's going to be spinning whether
it's actually working or not.

198
00:08:51,960 --> 00:08:53,350
That's really a sunk cost.

199
00:08:53,350 --> 00:08:56,030
>> But what cost me as a
business is storage.

200
00:08:56,030 --> 00:09:00,020
If I have to buy more disks next
month, that's a real cost that I pay.

201
00:09:00,020 --> 00:09:01,620
And that storage is expensive.

202
00:09:01,620 --> 00:09:05,020
>> Now we fast forward 40 years
and we have a different problem.

203
00:09:05,020 --> 00:09:10,020
The compute is now the
most expensive resource.

204
00:09:10,020 --> 00:09:11,470
The storage is cheap.

205
00:09:11,470 --> 00:09:14,570
I mean, we can go anywhere on the
cloud and we can find cheap storage.

206
00:09:14,570 --> 00:09:17,190
But what I can't find is cheap compute.

207
00:09:17,190 --> 00:09:20,700
>> So the evolution of today's
technology, of database technology,

208
00:09:20,700 --> 00:09:23,050
is really focused around
distributed databases

209
00:09:23,050 --> 00:09:26,960
that don't suffer from
the same type of scale

210
00:09:26,960 --> 00:09:29,240
limitations of relational databases.

211
00:09:29,240 --> 00:09:32,080
We'll talk a little bit about
what that actually means.

212
00:09:32,080 --> 00:09:34,760
>> But one of the reasons and
the driver behind this-- we

213
00:09:34,760 --> 00:09:38,290
talked about the data pressure.

214
00:09:38,290 --> 00:09:41,920
Data pressure is something
that drives innovation.

215
00:09:41,920 --> 00:09:44,610
And if you look at over
the last five years,

216
00:09:44,610 --> 00:09:48,180
this is a chart of what the data
load across the general enterprise

217
00:09:48,180 --> 00:09:49,640
looks like in the last five years.

218
00:09:49,640 --> 00:09:52,570
>> And the general rule of thumb
these days-- if you go Google--

219
00:09:52,570 --> 00:09:55,290
is 90% of the data that
we store today, and it was

220
00:09:55,290 --> 00:09:57,330
generated within the last two years.

221
00:09:57,330 --> 00:09:57,911
OK.

222
00:09:57,911 --> 00:09:59,410
Now, this is not a trend that's new.

223
00:09:59,410 --> 00:10:01,230
This is a trend that's been
going out for 100 years.

224
00:10:01,230 --> 00:10:03,438
Ever since Herman Hollerith
developed the punch card,

225
00:10:03,438 --> 00:10:08,040
we've been building data repositories
and gathering data at phenomenal rates.

226
00:10:08,040 --> 00:10:10,570
>> So over the last 100 years,
we've seen this trend.

227
00:10:10,570 --> 00:10:11,940
That's not going to change.

228
00:10:11,940 --> 00:10:14,789
Going forward, we're going to see
this, if not an accelerated trend.

229
00:10:14,789 --> 00:10:16,330
And you can see what that looks like.

230
00:10:16,330 --> 00:10:23,510
>> If a business in 2010 had one
terabyte of data under management,

231
00:10:23,510 --> 00:10:27,080
today that means they're
managing 6.5 petabytes of data.

232
00:10:27,080 --> 00:10:30,380
That's 6,500 times more data.

233
00:10:30,380 --> 00:10:31,200
And I know this.

234
00:10:31,200 --> 00:10:33,292
I work with these businesses every day.

235
00:10:33,292 --> 00:10:35,000
Five years ago, I
would talk to companies

236
00:10:35,000 --> 00:10:38,260
who would talk to me about what a pain
it is to manage terabytes of data.

237
00:10:38,260 --> 00:10:39,700
And they would talk
to me about how we see

238
00:10:39,700 --> 00:10:41,825
that this is probably going
to be a petabyte or two

239
00:10:41,825 --> 00:10:43,030
within a couple of years.

240
00:10:43,030 --> 00:10:45,170
>> These same companies
today I'm meeting with,

241
00:10:45,170 --> 00:10:48,100
and they're talking to me about the
problem are there having managing

242
00:10:48,100 --> 00:10:51,440
tens, 20 petabytes of data.

243
00:10:51,440 --> 00:10:53,590
So the explosion of the
data in the industry

244
00:10:53,590 --> 00:10:56,670
is driving the enormous
need for better solutions.

245
00:10:56,670 --> 00:11:00,980
And the relational database is
just not living up to the demand.

246
00:11:00,980 --> 00:11:03,490
>> And so there's a linear
correlation between data pressure

247
00:11:03,490 --> 00:11:05,210
and technical innovation.

248
00:11:05,210 --> 00:11:07,780
History has shown us
this, that over time,

249
00:11:07,780 --> 00:11:11,090
whenever the volume of data
that needs to be processed

250
00:11:11,090 --> 00:11:15,490
exceeds the capacity of the system
to process it in a reasonable time

251
00:11:15,490 --> 00:11:18,870
or at a reasonable cost,
then new technologies

252
00:11:18,870 --> 00:11:21,080
are invented to solve those problems.

253
00:11:21,080 --> 00:11:24,090
Those new technologies,
in turn, open the door

254
00:11:24,090 --> 00:11:27,840
to another set of problems, which
is gathering even more data.

255
00:11:27,840 --> 00:11:29,520
>> Now, we're not going to stop this.

256
00:11:29,520 --> 00:11:30,020
Right?

257
00:11:30,020 --> 00:11:31,228
We're not going to stop this.

258
00:11:31,228 --> 00:11:31,830
Why?

259
00:11:31,830 --> 00:11:35,520
Because you can't know everything
there is to know in the universe.

260
00:11:35,520 --> 00:11:40,510
And as long as we've been alive,
throughout the history of man,

261
00:11:40,510 --> 00:11:43,440
we have always driven to know more.

262
00:11:43,440 --> 00:11:49,840
>> So it seems like every inch we move
down the path of scientific discovery,

263
00:11:49,840 --> 00:11:54,620
we are multiplying the amount of data
that we need to process exponentially

264
00:11:54,620 --> 00:11:59,920
as we uncover more and more and more
about the inner workings of life,

265
00:11:59,920 --> 00:12:04,530
about how the universe works, about
driving the scientific discovery,

266
00:12:04,530 --> 00:12:06,440
and the invention that
we're doing today.

267
00:12:06,440 --> 00:12:09,570
The volume of data just
continually increases.

268
00:12:09,570 --> 00:12:12,120
So being able to deal with
this problem is enormous.

269
00:12:12,120 --> 00:12:14,790

270
00:12:14,790 --> 00:12:17,410
>> So one of the things
we look as why NoSQL?

271
00:12:17,410 --> 00:12:19,200
How does NoSQL solve this problem?

272
00:12:19,200 --> 00:12:24,980
Well, relational databases,
Structured Query Language,

273
00:12:24,980 --> 00:12:28,600
SQL-- that's really a construct of the
relational database-- these things are

274
00:12:28,600 --> 00:12:30,770
optimized for storage.

275
00:12:30,770 --> 00:12:33,180
>> Back in the '70s, again,
disk is expensive.

276
00:12:33,180 --> 00:12:36,990
The provisioning exercise of storage
in the enterprise is never-ending.

277
00:12:36,990 --> 00:12:37,490
I know.

278
00:12:37,490 --> 00:12:38,020
I lived it.

279
00:12:38,020 --> 00:12:41,250
I wrote storage drivers for an
enterprised superserver company

280
00:12:41,250 --> 00:12:42,470
back in the '90s.

281
00:12:42,470 --> 00:12:45,920
And the bottom line is racking another
storage array was just something that

282
00:12:45,920 --> 00:12:47,600
happened every day in the enterprise.

283
00:12:47,600 --> 00:12:49,030
And it never stopped.

284
00:12:49,030 --> 00:12:52,690
Higher density storage, demand
for high density storage,

285
00:12:52,690 --> 00:12:56,340
and for more efficient storage
devices-- it's never stopped.

286
00:12:56,340 --> 00:13:00,160
>> And NoSQL is a great technology
because it normalizes the data.

287
00:13:00,160 --> 00:13:02,210
It de-duplicates the data.

288
00:13:02,210 --> 00:13:07,180
It puts the data in a structure that
is agnostic to every access pattern.

289
00:13:07,180 --> 00:13:11,600
Multiple applications can hit that
SQL database, run ad hoc queries,

290
00:13:11,600 --> 00:13:15,950
and get data in the shape that they
need to process for their workloads.

291
00:13:15,950 --> 00:13:17,570
That sounds fantastic.

292
00:13:17,570 --> 00:13:21,350
But the bottom line is with any
system, if it's agnostic to everything,

293
00:13:21,350 --> 00:13:23,500
it is optimized for nothing.

294
00:13:23,500 --> 00:13:24,050
OK?

295
00:13:24,050 --> 00:13:26,386
>> And that's what we get with
the relational database.

296
00:13:26,386 --> 00:13:27,510
It's optimized for storage.

297
00:13:27,510 --> 00:13:28,280
It's normalized.

298
00:13:28,280 --> 00:13:29,370
It's relational.

299
00:13:29,370 --> 00:13:31,660
It supports the ad hoc queries.

300
00:13:31,660 --> 00:13:34,000
And it and it scales vertically.

301
00:13:34,000 --> 00:13:39,030
>> If I need to get a bigger SQL database
or a more powerful SQL database,

302
00:13:39,030 --> 00:13:41,090
I go buy a bigger piece of iron.

303
00:13:41,090 --> 00:13:41,600
OK?

304
00:13:41,600 --> 00:13:44,940
I've worked with a lot of customers
that have been through major upgrades

305
00:13:44,940 --> 00:13:48,340
in their SQL infrastructure only
to find out six months later,

306
00:13:48,340 --> 00:13:49,750
they're hitting the wall again.

307
00:13:49,750 --> 00:13:55,457
And the answer from Oracle or MSSQL
or anybody else is get a bigger box.

308
00:13:55,457 --> 00:13:58,540
Well sooner or later, you can't buy a
bigger box, and that's real problem.

309
00:13:58,540 --> 00:14:00,080
We need to actually change things.

310
00:14:00,080 --> 00:14:01,080
So where does this work?

311
00:14:01,080 --> 00:14:06,560
It works well for offline
analytics, OLAP-type workloads.

312
00:14:06,560 --> 00:14:08,670
And that's really where SQL belongs.

313
00:14:08,670 --> 00:14:12,540
Now, it's used today in many online
transactional processing-type

314
00:14:12,540 --> 00:14:13,330
applications.

315
00:14:13,330 --> 00:14:16,460
And it works just fine at
some level of utilization,

316
00:14:16,460 --> 00:14:18,670
but it just doesn't scale
the way that NoSQL does.

317
00:14:18,670 --> 00:14:20,660
And we'll talk a little
bit about why that is.

318
00:14:20,660 --> 00:14:23,590
>> Now, NoSQL, on the other hand,
is more optimized for compute.

319
00:14:23,590 --> 00:14:24,540
OK?

320
00:14:24,540 --> 00:14:26,830
It is not agnostic to
the access pattern.

321
00:14:26,830 --> 00:14:31,620
Is what we call de-normalized
structure or a hierarchical structure.

322
00:14:31,620 --> 00:14:35,000
The data in a relational database is
joined together from multiple tables

323
00:14:35,000 --> 00:14:36,850
to produce the view that you need.

324
00:14:36,850 --> 00:14:40,090
The data in a NoSQL database
is stored in a document that

325
00:14:40,090 --> 00:14:42,100
contains the hierarchical structure.

326
00:14:42,100 --> 00:14:45,670
All of the data that would normally be
joined together to produce that view

327
00:14:45,670 --> 00:14:47,160
is stored in a single document.

328
00:14:47,160 --> 00:14:50,990
And we'll talk a little bit about
how that works in a couple of charts.

329
00:14:50,990 --> 00:14:55,320
>> But the idea here is that you store
your data as these instantiated views.

330
00:14:55,320 --> 00:14:56,410
OK?

331
00:14:56,410 --> 00:14:58,610
You scale horizontally.

332
00:14:58,610 --> 00:14:59,556
Right?

333
00:14:59,556 --> 00:15:02,100
If I need to increase the
size of my NoSQL cluster,

334
00:15:02,100 --> 00:15:03,700
I don't need to get a bigger box.

335
00:15:03,700 --> 00:15:05,200
I get another box.

336
00:15:05,200 --> 00:15:07,700
And I cluster those together,
and I can shard that data.

337
00:15:07,700 --> 00:15:10,780
We'll talk a bit about
what sharding is, to be

338
00:15:10,780 --> 00:15:14,270
able to scale that database
across multiple physical devices

339
00:15:14,270 --> 00:15:18,370
and remove the barrier that
requires me to scale vertically.

340
00:15:18,370 --> 00:15:22,080
>> So it's really built for online
transaction processing and scale.

341
00:15:22,080 --> 00:15:25,480
There's a big distinction
here between reporting, right?

342
00:15:25,480 --> 00:15:27,810
Reporting, I don't know the
questions I'm going to ask.

343
00:15:27,810 --> 00:15:28,310
Right?

344
00:15:28,310 --> 00:15:30,570
Reporting-- if someone from
my marketing department

345
00:15:30,570 --> 00:15:34,520
wants to just-- how many of my customers
have this particular characteristic who

346
00:15:34,520 --> 00:15:37,850
bought on this day-- I don't know
what query they're going to ask.

347
00:15:37,850 --> 00:15:39,160
So I need to be agnostic.

348
00:15:39,160 --> 00:15:41,810
>> Now, in a online
transactional application,

349
00:15:41,810 --> 00:15:43,820
I know what questions I'm asking.

350
00:15:43,820 --> 00:15:46,581
I built the application for
a very specific workflow.

351
00:15:46,581 --> 00:15:47,080
OK?

352
00:15:47,080 --> 00:15:50,540
So if I optimize the data
store to support that workflow,

353
00:15:50,540 --> 00:15:52,020
it's going to be faster.

354
00:15:52,020 --> 00:15:55,190
And that's why NoSQL can
really accelerate the delivery

355
00:15:55,190 --> 00:15:57,710
of those types of services.

356
00:15:57,710 --> 00:15:58,210
All right.

357
00:15:58,210 --> 00:16:00,501
>> So we're going to get into
a little bit of theory here.

358
00:16:00,501 --> 00:16:03,330
And some of you, your eyes
might roll back a little bit.

359
00:16:03,330 --> 00:16:06,936
But I'll try to keep it
as high level as I can.

360
00:16:06,936 --> 00:16:08,880
So if you're in project
management, there's

361
00:16:08,880 --> 00:16:12,280
a construct called the
triangle of constraints.

362
00:16:12,280 --> 00:16:12,936
OK.

363
00:16:12,936 --> 00:16:16,060
The triangle of constrains dictates
you can't have everything all the time.

364
00:16:16,060 --> 00:16:17,750
Can't have your pie and eat it too.

365
00:16:17,750 --> 00:16:22,310
So in project management, that triangle
constraints is you can have it cheap,

366
00:16:22,310 --> 00:16:24,710
you can have it fast,
or you can have it good.

367
00:16:24,710 --> 00:16:25,716
Pick two.

368
00:16:25,716 --> 00:16:27,090
Because you can't have all three.

369
00:16:27,090 --> 00:16:27,460
Right?

370
00:16:27,460 --> 00:16:27,820
OK.

371
00:16:27,820 --> 00:16:28,920
>> So you hear about this a lot.

372
00:16:28,920 --> 00:16:31,253
It's a triple constraint,
triangle of triple constraint,

373
00:16:31,253 --> 00:16:34,420
or the iron triangle is oftentimes--
when you talk to project managers,

374
00:16:34,420 --> 00:16:35,420
they'll talk about this.

375
00:16:35,420 --> 00:16:37,640
Now, databases have
their own iron triangle.

376
00:16:37,640 --> 00:16:40,350
And the iron triangle of data
is what we call CAP theorem.

377
00:16:40,350 --> 00:16:41,580
OK?

378
00:16:41,580 --> 00:16:43,770
>> CAP theorem dictates
how databases operate

379
00:16:43,770 --> 00:16:45,627
under a very specific condition.

380
00:16:45,627 --> 00:16:47,460
And we'll talk about
what that condition is.

381
00:16:47,460 --> 00:16:52,221
But the three points of the triangle,
so to speak, are C, consistency.

382
00:16:52,221 --> 00:16:52,720
OK?

383
00:16:52,720 --> 00:16:56,760
So in CAP, consistency means that all
clients who can access the database

384
00:16:56,760 --> 00:16:59,084
will always have a very
consistent view of data.

385
00:16:59,084 --> 00:17:00,750
Nobody's gonna see two different things.

386
00:17:00,750 --> 00:17:01,480
OK?

387
00:17:01,480 --> 00:17:04,020
If I see the database,
I'm seeing the same view

388
00:17:04,020 --> 00:17:06,130
as my partner who sees
the same database.

389
00:17:06,130 --> 00:17:07,470
That's consistency.

390
00:17:07,470 --> 00:17:12,099
>> Availability means that if the
database online, if it can be reached,

391
00:17:12,099 --> 00:17:14,760
that all clients will always
be able to read and write.

392
00:17:14,760 --> 00:17:15,260
OK?

393
00:17:15,260 --> 00:17:17,010
So every client that
can read the database

394
00:17:17,010 --> 00:17:18,955
will always be able read
data and write data.

395
00:17:18,955 --> 00:17:21,819
And if that's the case,
it's an available system.

396
00:17:21,819 --> 00:17:24,230
>> And the third point is what
we call partition tolerance.

397
00:17:24,230 --> 00:17:24,730
OK?

398
00:17:24,730 --> 00:17:28,160
Partition tolerance means
that the system works well

399
00:17:28,160 --> 00:17:32,000
despite physical network
partitions between the nodes.

400
00:17:32,000 --> 00:17:32,760
OK?

401
00:17:32,760 --> 00:17:36,270
So nodes in the cluster can't
talk to each other, what happens?

402
00:17:36,270 --> 00:17:36,880
All right.

403
00:17:36,880 --> 00:17:39,545
>> So relational databases choose--
you can pick two of these.

404
00:17:39,545 --> 00:17:40,045
OK.

405
00:17:40,045 --> 00:17:43,680
So relational databases choose
to be consistent and available.

406
00:17:43,680 --> 00:17:47,510
If the partition happens between
the DataNodes in the data store,

407
00:17:47,510 --> 00:17:48,831
the database crashes.

408
00:17:48,831 --> 00:17:49,330
Right?

409
00:17:49,330 --> 00:17:50,900
It just goes down.

410
00:17:50,900 --> 00:17:51,450
OK.

411
00:17:51,450 --> 00:17:54,230
>> And this is why they have
to grow with bigger boxes.

412
00:17:54,230 --> 00:17:54,730
Right?

413
00:17:54,730 --> 00:17:58,021
Because there's no-- usually, a cluster
database, there's not very many of them

414
00:17:58,021 --> 00:17:59,590
that operate that way.

415
00:17:59,590 --> 00:18:03,019
But most databases scale
vertically within a single box.

416
00:18:03,019 --> 00:18:05,060
Because they need to be
consistent and available.

417
00:18:05,060 --> 00:18:10,320
If a partition were to be injected,
then you would have to make a choice.

418
00:18:10,320 --> 00:18:13,720
You have to make a choice between
being consistent and available.

419
00:18:13,720 --> 00:18:16,080
>> And that's what NoSQL databases do.

420
00:18:16,080 --> 00:18:16,580
All right.

421
00:18:16,580 --> 00:18:20,950
So a NoSQL database, it
comes in two flavors.

422
00:18:20,950 --> 00:18:22,990
We have-- well, it
comes in many flavors,

423
00:18:22,990 --> 00:18:26,140
but it comes with two basic
characteristics-- what

424
00:18:26,140 --> 00:18:30,050
we would call CP database, or a
consistent and partition tolerance

425
00:18:30,050 --> 00:18:31,040
system.

426
00:18:31,040 --> 00:18:34,930
These guys make the choice that when
the nodes lose contact with each other,

427
00:18:34,930 --> 00:18:37,091
we're not going to allow
people to write any more.

428
00:18:37,091 --> 00:18:37,590
OK?

429
00:18:37,590 --> 00:18:41,855
>> Until that partition is removed,
write access is blocked.

430
00:18:41,855 --> 00:18:43,230
That means they're not available.

431
00:18:43,230 --> 00:18:44,510
They're consistent.

432
00:18:44,510 --> 00:18:46,554
When we see that
partition inject itself,

433
00:18:46,554 --> 00:18:48,470
we are now consistent,
because we're not going

434
00:18:48,470 --> 00:18:51,517
to allow the data change on two
sides of the partition independently

435
00:18:51,517 --> 00:18:52,100
of each other.

436
00:18:52,100 --> 00:18:54,130
We will have to
reestablish communication

437
00:18:54,130 --> 00:18:56,930
before any update to
the data is allowed.

438
00:18:56,930 --> 00:18:58,120
OK?

439
00:18:58,120 --> 00:19:02,650
>> The next flavor would be an AP system,
or an available and partitioned

440
00:19:02,650 --> 00:19:03,640
tolerance system.

441
00:19:03,640 --> 00:19:05,320
These guys don't care.

442
00:19:05,320 --> 00:19:06,020
Right?

443
00:19:06,020 --> 00:19:08,960
Any node that gets a
write, we'll take it.

444
00:19:08,960 --> 00:19:11,480
So I'm replicating my data
across multiple nodes.

445
00:19:11,480 --> 00:19:14,730
These nodes get a client, client comes
in, says, I'm going to write some data.

446
00:19:14,730 --> 00:19:16,300
Node says, no problem.

447
00:19:16,300 --> 00:19:18,580
The node next to him gets
a write on the same record,

448
00:19:18,580 --> 00:19:20,405
he's going to say no problem.

449
00:19:20,405 --> 00:19:23,030
Somewhere back on the back end,
that data's going to replicate.

450
00:19:23,030 --> 00:19:27,360
And then someone's going to realize,
uh-oh, they system will realize, uh-oh,

451
00:19:27,360 --> 00:19:28,870
there's been an update to two sides.

452
00:19:28,870 --> 00:19:30,370
What do we do?

453
00:19:30,370 --> 00:19:33,210
And what they do then is
they do something which

454
00:19:33,210 --> 00:19:36,080
allows them to resolve that data state.

455
00:19:36,080 --> 00:19:39,000
And we'll talk about
that in the next chart.

456
00:19:39,000 --> 00:19:40,000
>> Thing to point out here.

457
00:19:40,000 --> 00:19:42,374
And I'm not going to get too
much into this, because this

458
00:19:42,374 --> 00:19:43,510
gets into deep data theory.

459
00:19:43,510 --> 00:19:46,670
But there's a transactional
framework that

460
00:19:46,670 --> 00:19:50,680
runs in a relational system that
allows me to safely make updates

461
00:19:50,680 --> 00:19:53,760
to multiple entities in the database.

462
00:19:53,760 --> 00:19:58,320
And those updates will occur
all at once or not at all.

463
00:19:58,320 --> 00:20:00,500
And this is called ACID transactions.

464
00:20:00,500 --> 00:20:01,000
OK?

465
00:20:01,000 --> 00:20:06,570
>> ACID gives us atomicity, consistency,
isolation, and durability.

466
00:20:06,570 --> 00:20:07,070
OK?

467
00:20:07,070 --> 00:20:13,550
That means atomic, transactions, all
my updates either happen or they don't.

468
00:20:13,550 --> 00:20:16,570
Consistency means that
the database will always

469
00:20:16,570 --> 00:20:19,780
be brought into a consistent
state after an update.

470
00:20:19,780 --> 00:20:23,900
I will never leave the database in a
bad state after applying an update.

471
00:20:23,900 --> 00:20:24,400
OK?

472
00:20:24,400 --> 00:20:26,720
>> So it's a little different
than CAP consistency.

473
00:20:26,720 --> 00:20:29,760
CAP consistency means all my
clients can always see the data.

474
00:20:29,760 --> 00:20:34,450
ACID consistency means that when
a transaction's done, data's good.

475
00:20:34,450 --> 00:20:35,709
My relationships are all good.

476
00:20:35,709 --> 00:20:38,750
I'm not going to delete a parent row
and leave a bunch of orphan children

477
00:20:38,750 --> 00:20:40,970
in some other table.

478
00:20:40,970 --> 00:20:44,320
It can't happen if I'm consistent
in an acid transaction.

479
00:20:44,320 --> 00:20:49,120
>> Isolation means that transactions
will always occur one after the other.

480
00:20:49,120 --> 00:20:51,920
The end result of the data
will be the same state

481
00:20:51,920 --> 00:20:54,770
as if those transactions
that were issued concurrently

482
00:20:54,770 --> 00:20:57,340
were executed serially.

483
00:20:57,340 --> 00:21:00,030
So it's concurrency
control in the database.

484
00:21:00,030 --> 00:21:04,130
So basically, I can't increment the
same value twice with two operations.

485
00:21:04,130 --> 00:21:08,580
>> But if I say add 1 to this value,
and two transactions come in

486
00:21:08,580 --> 00:21:10,665
and try to do it, one's
going to get there first

487
00:21:10,665 --> 00:21:12,540
and the other one's
going to get there after.

488
00:21:12,540 --> 00:21:15,210
So in the end, I added two.

489
00:21:15,210 --> 00:21:16,170
You see what I mean?

490
00:21:16,170 --> 00:21:16,670
OK.

491
00:21:16,670 --> 00:21:19,220

492
00:21:19,220 --> 00:21:21,250
>> Durability is pretty straightforward.

493
00:21:21,250 --> 00:21:23,460
When the transaction
is acknowledged, it's

494
00:21:23,460 --> 00:21:26,100
going to be there even
if the system crashes.

495
00:21:26,100 --> 00:21:29,230
When that system recovers, that
transaction that was committed

496
00:21:29,230 --> 00:21:30,480
is actually going to be there.

497
00:21:30,480 --> 00:21:33,130
So that's the guarantees
of ACID transactions.

498
00:21:33,130 --> 00:21:35,470
Those are pretty nice guarantees
to have on a database,

499
00:21:35,470 --> 00:21:36,870
but they come at that cost.

500
00:21:36,870 --> 00:21:37,640
Right?

501
00:21:37,640 --> 00:21:40,520
>> Because the problem
with this framework is

502
00:21:40,520 --> 00:21:44,540
if there is a partition in the data
set, I have to make a decision.

503
00:21:44,540 --> 00:21:48,000
I'm going to have to allow
updates on one side or the other.

504
00:21:48,000 --> 00:21:50,310
And if that happens,
then I'm no longer going

505
00:21:50,310 --> 00:21:52,630
to be able to maintain
those characteristics.

506
00:21:52,630 --> 00:21:53,960
They won't be consistent.

507
00:21:53,960 --> 00:21:55,841
They won't be isolated.

508
00:21:55,841 --> 00:21:58,090
This is where it breaks down
for relational databases.

509
00:21:58,090 --> 00:22:01,360
This is the reason relational
databases scale vertically.

510
00:22:01,360 --> 00:22:05,530
>> On the other hand, we have
what's called BASE technology.

511
00:22:05,530 --> 00:22:07,291
And these are your NoSQL Databases.

512
00:22:07,291 --> 00:22:07,790
All right.

513
00:22:07,790 --> 00:22:10,180
So we have our CP, AP databases.

514
00:22:10,180 --> 00:22:14,720
And these are what you call basically
available, soft state, eventually

515
00:22:14,720 --> 00:22:15,740
consistent.

516
00:22:15,740 --> 00:22:16,420
OK?

517
00:22:16,420 --> 00:22:19,690
>> Basically available, because
they're partition tolerant.

518
00:22:19,690 --> 00:22:21,470
They will always be
there, even if there's

519
00:22:21,470 --> 00:22:23,053
a network partition between the nodes.

520
00:22:23,053 --> 00:22:25,900
If I can talk to a node, I'm
going to be able to read data.

521
00:22:25,900 --> 00:22:26,460
OK?

522
00:22:26,460 --> 00:22:30,810
I might not always be able to write
data if I'm a consistent platform.

523
00:22:30,810 --> 00:22:32,130
But I'll be able to read data.

524
00:22:32,130 --> 00:22:34,960

525
00:22:34,960 --> 00:22:38,010
>> The soft state indicates
that when I read that data,

526
00:22:38,010 --> 00:22:40,790
it might not be the same as other nodes.

527
00:22:40,790 --> 00:22:43,390
If a right was issued on a node
somewhere else in the cluster

528
00:22:43,390 --> 00:22:46,650
and it hasn't replicated across the
cluster yet when I read that data,

529
00:22:46,650 --> 00:22:48,680
that state might not be consistent.

530
00:22:48,680 --> 00:22:51,650
However, it will be
eventually consistent,

531
00:22:51,650 --> 00:22:53,870
meaning that when a write
is made to the system,

532
00:22:53,870 --> 00:22:56,480
it will replicate across the nodes.

533
00:22:56,480 --> 00:22:59,095
And eventually, that state
will be brought into order,

534
00:22:59,095 --> 00:23:00,890
and it will be a consistent state.

535
00:23:00,890 --> 00:23:05,000
>> Now, CAP theorem really
plays only in one condition.

536
00:23:05,000 --> 00:23:08,700
That condition is when this happens.

537
00:23:08,700 --> 00:23:13,710
Because whenever it's operating in
normal mode, there's no partition,

538
00:23:13,710 --> 00:23:16,370
everything's consistent and available.

539
00:23:16,370 --> 00:23:19,990
You only worry about CAP
when we have that partition.

540
00:23:19,990 --> 00:23:21,260
So those are rare.

541
00:23:21,260 --> 00:23:25,360
But how the system reacts when those
occur dictate what type of system

542
00:23:25,360 --> 00:23:26,750
we're dealing with.

543
00:23:26,750 --> 00:23:31,110
>> So let's take a look at what
that looks like for AP systems.

544
00:23:31,110 --> 00:23:32,621
OK?

545
00:23:32,621 --> 00:23:34,830
AP systems come in two flavors.

546
00:23:34,830 --> 00:23:38,514
They come in the flavor that is a
master master, 100%, always available.

547
00:23:38,514 --> 00:23:40,430
And they come in the
other flavor, which says,

548
00:23:40,430 --> 00:23:43,330
you know what, I'm going to worry
about this partitioning thing

549
00:23:43,330 --> 00:23:44,724
when an actual partition occurs.

550
00:23:44,724 --> 00:23:47,890
Otherwise, there's going to be primary
nodes who's going to take the rights.

551
00:23:47,890 --> 00:23:48,500
OK?

552
00:23:48,500 --> 00:23:50,040
>> So if we something like Cassandra.

553
00:23:50,040 --> 00:23:54,440
Cassandra would be a master
master, let's me write to any node.

554
00:23:54,440 --> 00:23:55,540
So what happens?

555
00:23:55,540 --> 00:23:58,270
So I have an object in the
database that exists on two nodes.

556
00:23:58,270 --> 00:24:01,705
Let's call that object S.
So we have state for S.

557
00:24:01,705 --> 00:24:04,312
We have some operations
on S that are ongoing.

558
00:24:04,312 --> 00:24:06,270
Cassandra allows me to
write to multiple nodes.

559
00:24:06,270 --> 00:24:08,550
So let's say I get a
write for s to two nodes.

560
00:24:08,550 --> 00:24:12,274
Well, what ends up happening is
we call that a partitioning event.

561
00:24:12,274 --> 00:24:14,190
There may not be a
physical network partition.

562
00:24:14,190 --> 00:24:15,950
But because of the design
of the system, it's

563
00:24:15,950 --> 00:24:18,449
actually partitioning as soon
as I get a write on two nodes.

564
00:24:18,449 --> 00:24:20,830
It's not forcing me to
write all through one node.

565
00:24:20,830 --> 00:24:22,340
I'm writing on two nodes.

566
00:24:22,340 --> 00:24:23,330
OK?

567
00:24:23,330 --> 00:24:25,740
>> So now I have two states.

568
00:24:25,740 --> 00:24:26,360
OK?

569
00:24:26,360 --> 00:24:28,110
What's going to happen
is sooner or later,

570
00:24:28,110 --> 00:24:29,960
there's going to be a replication event.

571
00:24:29,960 --> 00:24:33,300
There's going to be what we
called a partition recovery, which

572
00:24:33,300 --> 00:24:35,200
is where these two
states come back together

573
00:24:35,200 --> 00:24:37,310
and there's going to be an algorithm
that runs inside the database,

574
00:24:37,310 --> 00:24:38,540
decides what to do.

575
00:24:38,540 --> 00:24:39,110
OK?

576
00:24:39,110 --> 00:24:43,057
By default, last update
wins in most AP systems.

577
00:24:43,057 --> 00:24:44,890
So there's usually a
default algorithm, what

578
00:24:44,890 --> 00:24:47,400
they call a callback
function, something that

579
00:24:47,400 --> 00:24:51,000
will be called when this condition
is detected to execute some logic

580
00:24:51,000 --> 00:24:52,900
to resolve that conflict.

581
00:24:52,900 --> 00:24:53,850
OK?

582
00:24:53,850 --> 00:24:58,770
The default callback and default
resolver in most AP databases

583
00:24:58,770 --> 00:25:01,130
is, guess what, timestamp wins.

584
00:25:01,130 --> 00:25:02,380
This was the last update.

585
00:25:02,380 --> 00:25:04,320
I'm going to put that update in there.

586
00:25:04,320 --> 00:25:08,440
I may dump this record that I
dumped off into a recovery log

587
00:25:08,440 --> 00:25:11,670
so that the user can come back later
and say, hey, there was a collision.

588
00:25:11,670 --> 00:25:12,320
What happened?

589
00:25:12,320 --> 00:25:16,370
And you can actually dump a record of
all the collisions and the rollbacks

590
00:25:16,370 --> 00:25:17,550
and see what happens.

591
00:25:17,550 --> 00:25:21,580
>> Now, as a user, you can also
include logic into that callback.

592
00:25:21,580 --> 00:25:24,290
So you can change that
callback operation.

593
00:25:24,290 --> 00:25:26,730
You can say, hey, I want
to remediate this data.

594
00:25:26,730 --> 00:25:28,880
And I want to try and
merge those two records.

595
00:25:28,880 --> 00:25:30,050
But that's up to you.

596
00:25:30,050 --> 00:25:32,880
The database doesn't know how to
do that by default. Most the time,

597
00:25:32,880 --> 00:25:34,850
the only thing the database
knows how to do is say,

598
00:25:34,850 --> 00:25:36,100
this one was the last record.

599
00:25:36,100 --> 00:25:39,183
That's the one that's going to win,
and that's the value I'm going to put.

600
00:25:39,183 --> 00:25:41,490
Once that partition recovery
and replication occurs,

601
00:25:41,490 --> 00:25:43,930
we have our state, which
is now S prime, which is

602
00:25:43,930 --> 00:25:46,890
the merge state of all those objects.

603
00:25:46,890 --> 00:25:49,700
So AP systems have this.

604
00:25:49,700 --> 00:25:51,615
CP systems don't need
to worry about this.

605
00:25:51,615 --> 00:25:54,490
Because as soon as a partition comes
into play, they just stop taking

606
00:25:54,490 --> 00:25:55,530
writes.

607
00:25:55,530 --> 00:25:56,180
OK?

608
00:25:56,180 --> 00:25:58,670
So that's very easy to
deal with being consistent

609
00:25:58,670 --> 00:26:01,330
when you don't accept any updates.

610
00:26:01,330 --> 00:26:04,620
That's with CP systems do.

611
00:26:04,620 --> 00:26:05,120
All right.

612
00:26:05,120 --> 00:26:07,590
>> So let's talk a little
bit about access patterns.

613
00:26:07,590 --> 00:26:11,580
When we talk about NoSQL, it's
all about the access pattern.

614
00:26:11,580 --> 00:26:13,550
Now, SQL is ad hoc, queries.

615
00:26:13,550 --> 00:26:14,481
It's relational store.

616
00:26:14,481 --> 00:26:16,480
We don't have to worry
about the access pattern.

617
00:26:16,480 --> 00:26:17,688
I write a very complex query.

618
00:26:17,688 --> 00:26:19,250
It goes and gets the data.

619
00:26:19,250 --> 00:26:21,210
That's what this looks
like, normalization.

620
00:26:21,210 --> 00:26:24,890
>> So in this particular structure,
we're looking at a products catalog.

621
00:26:24,890 --> 00:26:26,640
I have different types of products.

622
00:26:26,640 --> 00:26:27,217
I have books.

623
00:26:27,217 --> 00:26:27,800
I have albums.

624
00:26:27,800 --> 00:26:30,090
I have videos.

625
00:26:30,090 --> 00:26:33,370
The relationship between products
and any one of these books, albums,

626
00:26:33,370 --> 00:26:34,860
and videos tables is 1:1.

627
00:26:34,860 --> 00:26:35,800
All right?

628
00:26:35,800 --> 00:26:38,860
I've got a product ID,
and that ID corresponds

629
00:26:38,860 --> 00:26:41,080
to a book, an album, or a video.

630
00:26:41,080 --> 00:26:41,580
OK?

631
00:26:41,580 --> 00:26:44,350
That's a 1:1 relationship
across those tables.

632
00:26:44,350 --> 00:26:46,970
>> Now, books-- all they
have is root properties.

633
00:26:46,970 --> 00:26:47,550
No problem.

634
00:26:47,550 --> 00:26:48,230
That's great.

635
00:26:48,230 --> 00:26:52,130
One-to-one relationship, I get all
the data I need to describe that book.

636
00:26:52,130 --> 00:26:54,770
Albums-- albums have tracks.

637
00:26:54,770 --> 00:26:56,470
This is what we call one to many.

638
00:26:56,470 --> 00:26:58,905
Every album could have many tracks.

639
00:26:58,905 --> 00:27:00,780
So for every track on
the album, I could have

640
00:27:00,780 --> 00:27:02,570
another record in this child table.

641
00:27:02,570 --> 00:27:04,680
So I create one record
in my albums table.

642
00:27:04,680 --> 00:27:06,700
I create multiple records
in the tracks table.

643
00:27:06,700 --> 00:27:08,850
One-to-many relationship.

644
00:27:08,850 --> 00:27:11,220
>> This relationship is what
we call many-to-many.

645
00:27:11,220 --> 00:27:11,750
OK?

646
00:27:11,750 --> 00:27:17,000
You see that actors could be
in many movies, many videos.

647
00:27:17,000 --> 00:27:21,450
So what we do is we put this mapping
table between those, which it just

648
00:27:21,450 --> 00:27:24,040
maps the actor ID to the video ID.

649
00:27:24,040 --> 00:27:28,464
Now I can create a query the joins
videos through actor video to actors,

650
00:27:28,464 --> 00:27:31,130
and it gives me a nice list of
all the movies and all the actors

651
00:27:31,130 --> 00:27:32,420
who were in that movie.

652
00:27:32,420 --> 00:27:33,290
>> OK.

653
00:27:33,290 --> 00:27:33,880
So here we go.

654
00:27:33,880 --> 00:27:38,040
One-to-one is the top-level
relationship; one-to-many,

655
00:27:38,040 --> 00:27:40,240
albums to tracks; many-to-many.

656
00:27:40,240 --> 00:27:44,990
Those are the three top-level
relationships in any database.

657
00:27:44,990 --> 00:27:48,050
If you know how those
relationships work together,

658
00:27:48,050 --> 00:27:51,490
then you know a lot
about database already.

659
00:27:51,490 --> 00:27:55,660
So NoSQL works a little differently.

660
00:27:55,660 --> 00:27:58,930
Let's think about for a second what it
looks like to go get all my products.

661
00:27:58,930 --> 00:28:01,096
>> In a relational store, I
want to get all my products

662
00:28:01,096 --> 00:28:02,970
on a list of all my products.

663
00:28:02,970 --> 00:28:04,910
That's a lot of queries.

664
00:28:04,910 --> 00:28:07,030
I got a query for all my books.

665
00:28:07,030 --> 00:28:08,470
I got a query from my albums.

666
00:28:08,470 --> 00:28:09,970
And I got a query for all my videos.

667
00:28:09,970 --> 00:28:11,719
And I got to put it
all together in a list

668
00:28:11,719 --> 00:28:15,250
and serve it back up to the
application that's requesting it.

669
00:28:15,250 --> 00:28:18,000
>> To get my books, I join
Products and Books.

670
00:28:18,000 --> 00:28:21,680
To get my albums, I got to join
Products, Albums, and Tracks.

671
00:28:21,680 --> 00:28:25,330
And to get my videos, I have
to join Products to Videos,

672
00:28:25,330 --> 00:28:28,890
join through Actor Videos,
and bring in the Actors.

673
00:28:28,890 --> 00:28:31,020
So that's three queries.

674
00:28:31,020 --> 00:28:34,560
Very complex queries to
assemble one result set.

675
00:28:34,560 --> 00:28:36,540
>> That's less than optimal.

676
00:28:36,540 --> 00:28:39,200
This is why when we talk
about a data structure that's

677
00:28:39,200 --> 00:28:42,900
built to be agnostic to the access
pattern-- well that's great.

678
00:28:42,900 --> 00:28:45,730
And you can see this is really
nice how we've organized the data.

679
00:28:45,730 --> 00:28:46,550
And you know what?

680
00:28:46,550 --> 00:28:49,750
I only have one record for an actor.

681
00:28:49,750 --> 00:28:50,440
>> That's cool.

682
00:28:50,440 --> 00:28:53,750
I've deduplicated all my actors,
and I maintained my associations

683
00:28:53,750 --> 00:28:55,200
in this mapping table.

684
00:28:55,200 --> 00:29:00,620
However, getting the data
out becomes expensive.

685
00:29:00,620 --> 00:29:04,500
I'm sending the CPU all over the system
joining these data structures together

686
00:29:04,500 --> 00:29:05,950
to be able to pull that data back.

687
00:29:05,950 --> 00:29:07,310
>> So how do I get around that?

688
00:29:07,310 --> 00:29:11,200
In NoSQL it's about
aggregation, not normalization.

689
00:29:11,200 --> 00:29:13,534
So we want to say we want to
support the access pattern.

690
00:29:13,534 --> 00:29:15,283
If the access pattern
to the applications,

691
00:29:15,283 --> 00:29:16,770
I need to get all my products.

692
00:29:16,770 --> 00:29:19,027
Let's put all the products in one table.

693
00:29:19,027 --> 00:29:22,110
If I put all the products in one table,
I can just select all the products

694
00:29:22,110 --> 00:29:23,850
from that table and I get it all.

695
00:29:23,850 --> 00:29:25,240
Well how do I do that?

696
00:29:25,240 --> 00:29:28,124
Well in NoSQL there's no
structure to the table.

697
00:29:28,124 --> 00:29:30,540
We'll talk a little bit about
how this works in Dynamo DB.

698
00:29:30,540 --> 00:29:33,570
But you don't have the same
attributes and the same properties

699
00:29:33,570 --> 00:29:37,751
in every single row, in every single
item, like you do in an SQL table.

700
00:29:37,751 --> 00:29:39,750
And what this allows me
to do is a lot of things

701
00:29:39,750 --> 00:29:41,124
and give me a lot of flexibility.

702
00:29:41,124 --> 00:29:45,360
In this particular case, I
have my product documents.

703
00:29:45,360 --> 00:29:49,090
And in this particular
example, everything

704
00:29:49,090 --> 00:29:51,930
is a document in the Products table.

705
00:29:51,930 --> 00:29:56,510
And the product for a book might
have a type ID that specifies a book.

706
00:29:56,510 --> 00:29:59,180
And the application
would switch on that ID.

707
00:29:59,180 --> 00:30:02,570
>> At the application tier, I'm going
to say oh, what record type is this?

708
00:30:02,570 --> 00:30:04,100
Oh, it's a book record.

709
00:30:04,100 --> 00:30:05,990
Book records have these properties.

710
00:30:05,990 --> 00:30:08,100
Let me create a book object.

711
00:30:08,100 --> 00:30:11,289
So I'm going to fill the
book object with this item.

712
00:30:11,289 --> 00:30:13,080
Next item comes and
says, what's this item?

713
00:30:13,080 --> 00:30:14,560
Well this item is an album.

714
00:30:14,560 --> 00:30:17,340
Oh, I got a whole different
processing routine for that,

715
00:30:17,340 --> 00:30:18,487
because it's an album.

716
00:30:18,487 --> 00:30:19,320
You see what I mean?

717
00:30:19,320 --> 00:30:21,950
>> So the application tier-- I
just select all these records.

718
00:30:21,950 --> 00:30:23,200
They all start coming in.

719
00:30:23,200 --> 00:30:24,680
They could be all different types.

720
00:30:24,680 --> 00:30:27,590
And it's the application's logic
that switches across those types

721
00:30:27,590 --> 00:30:29,530
and decides how to process them.

722
00:30:29,530 --> 00:30:33,640
>> Again, so we're optimizing the
schema for the access pattern.

723
00:30:33,640 --> 00:30:36,390
We're doing it by
collapsing those tables.

724
00:30:36,390 --> 00:30:39,670
We're basically taking
these normalized structures,

725
00:30:39,670 --> 00:30:42,000
and we're building
hierarchical structures.

726
00:30:42,000 --> 00:30:45,130
Inside each one of these records
I'm going to see array properties.

727
00:30:45,130 --> 00:30:49,400
>> Inside this document for Albums,
I'm seeing arrays of tracks.

728
00:30:49,400 --> 00:30:53,900
Those tracks now become-- it's
basically this child table that

729
00:30:53,900 --> 00:30:56,520
exists right here in this structure.

730
00:30:56,520 --> 00:30:57,975
So you can do this in DynamoDB.

731
00:30:57,975 --> 00:30:59,810
You can do this in MongoDB.

732
00:30:59,810 --> 00:31:01,437
You can do this in any NoSQL database.

733
00:31:01,437 --> 00:31:03,520
Create these types of
hierarchical data structures

734
00:31:03,520 --> 00:31:07,120
that allow you retrieve data
very quickly because now I

735
00:31:07,120 --> 00:31:08,537
don't have to conform.

736
00:31:08,537 --> 00:31:11,620
When I insert a row into the Tracks
table, or a row into the Albums table,

737
00:31:11,620 --> 00:31:13,110
I have to conform to that schema.

738
00:31:13,110 --> 00:31:18,060
I have to have the attribute or the
property that is defined on that table.

739
00:31:18,060 --> 00:31:20,480
Every one of them,
when I insert that row.

740
00:31:20,480 --> 00:31:21,910
That's not the case in NoSQL.

741
00:31:21,910 --> 00:31:24,440
>> I can have totally different
properties in every document

742
00:31:24,440 --> 00:31:26,100
that I insert into the collection.

743
00:31:26,100 --> 00:31:30,480
So very powerful mechanism.

744
00:31:30,480 --> 00:31:32,852
And it's really how you
optimize the system.

745
00:31:32,852 --> 00:31:35,310
Because now that query, instead
of joining all these tables

746
00:31:35,310 --> 00:31:39,160
and executing a half a dozen queries
to pull back the data I need,

747
00:31:39,160 --> 00:31:40,890
I'm executing one query.

748
00:31:40,890 --> 00:31:43,010
And I'm iterating
across the results set.

749
00:31:43,010 --> 00:31:46,512
it gives you an idea
of the power of NoSQL.

750
00:31:46,512 --> 00:31:49,470
I'm going to kind of go sideways here
and talk a little bit about this.

751
00:31:49,470 --> 00:31:53,240
This is more kind of the
marketing or technology--

752
00:31:53,240 --> 00:31:55,660
the marketing of technology
type of discussion.

753
00:31:55,660 --> 00:31:58,672
But it's important to understand
because if we look at the top

754
00:31:58,672 --> 00:32:00,380
here at this chart,
what we're looking at

755
00:32:00,380 --> 00:32:04,030
is what we call the
technology hype curve.

756
00:32:04,030 --> 00:32:06,121
And what this means is
new stuff comes into play.

757
00:32:06,121 --> 00:32:07,120
People think it's great.

758
00:32:07,120 --> 00:32:09,200
I've solved all my problems.

759
00:32:09,200 --> 00:32:11,630
>> This could be the end
all, be all to everything.

760
00:32:11,630 --> 00:32:12,790
And they start using it.

761
00:32:12,790 --> 00:32:14,720
And they say, this stuff doesn't work.

762
00:32:14,720 --> 00:32:17,600
This is not right.

763
00:32:17,600 --> 00:32:19,105
The old stuff was better.

764
00:32:19,105 --> 00:32:21,230
And they go back to doing
things the way they were.

765
00:32:21,230 --> 00:32:22,730
And then eventually
they go, you know what?

766
00:32:22,730 --> 00:32:24,040
This stuff is not so bad.

767
00:32:24,040 --> 00:32:26,192
Oh, that's how it works.

768
00:32:26,192 --> 00:32:28,900
And once they figure out how it
works, they start getting better.

769
00:32:28,900 --> 00:32:32,050
>> And the funny thing about it
is, it kind of lines up to what

770
00:32:32,050 --> 00:32:34,300
we call the Technology Adoption Curve.

771
00:32:34,300 --> 00:32:36,910
So what happens is we have
some sort technology trigger.

772
00:32:36,910 --> 00:32:39,100
In the case of databases,
it's data pressure.

773
00:32:39,100 --> 00:32:42,200
We talked about the high water points
of data pressure throughout time.

774
00:32:42,200 --> 00:32:46,310
When that data pressure hits a certain
point, that's a technology trigger.

775
00:32:46,310 --> 00:32:47,830
>> It's getting too expensive.

776
00:32:47,830 --> 00:32:49,790
It takes too long to process the data.

777
00:32:49,790 --> 00:32:50,890
We need something better.

778
00:32:50,890 --> 00:32:52,890
You get the innovators
out there running around,

779
00:32:52,890 --> 00:32:55,050
trying to find out what's the solution.

780
00:32:55,050 --> 00:32:56,050
What's the new idea?

781
00:32:56,050 --> 00:32:58,170
>> What's the next best
way to do this thing?

782
00:32:58,170 --> 00:32:59,530
And they come up with something.

783
00:32:59,530 --> 00:33:03,140
And the people with the real pain,
the guys at the bleeding edge,

784
00:33:03,140 --> 00:33:06,390
they'll jump all over it,
because they need an answer.

785
00:33:06,390 --> 00:33:09,690
Now what inevitably happens-- and
it's happening right now in NoSQL.

786
00:33:09,690 --> 00:33:11,090
I see it all the time.

787
00:33:11,090 --> 00:33:13,610
>> What inevitably happens is
people start using the new tool

788
00:33:13,610 --> 00:33:15,490
the same way they used the old tool.

789
00:33:15,490 --> 00:33:17,854
And they find out it
doesn't work so well.

790
00:33:17,854 --> 00:33:20,020
I can't remember who I was
talking to earlier today.

791
00:33:20,020 --> 00:33:22,080
But it's like, when the
jackhammer was invented,

792
00:33:22,080 --> 00:33:24,621
people didn't swing it over
their head to smash the concrete.

793
00:33:24,621 --> 00:33:27,360

794
00:33:27,360 --> 00:33:30,610
>> But that is what's
happening with NoSQL today.

795
00:33:30,610 --> 00:33:33,900
If you walk in to most shops,
they are trying to be NoSQL shops.

796
00:33:33,900 --> 00:33:36,510
What they're doing is
they're using NoSQL,

797
00:33:36,510 --> 00:33:39,900
and they're loading it
full of relational schema.

798
00:33:39,900 --> 00:33:41,630
Because that's how
they design databases.

799
00:33:41,630 --> 00:33:44,046
And they're wondering, why is
it not performing very well?

800
00:33:44,046 --> 00:33:45,230
Boy, this thing stinks.

801
00:33:45,230 --> 00:33:49,900
I had to maintain all my
joins in-- it's like, no, no.

802
00:33:49,900 --> 00:33:50,800
Maintain joins?

803
00:33:50,800 --> 00:33:52,430
Why are you joining data?

804
00:33:52,430 --> 00:33:54,350
You don't join data in NoSQL.

805
00:33:54,350 --> 00:33:55,850
You aggregate it.

806
00:33:55,850 --> 00:34:00,690
>> So if you want to avoid this, learn
how the tool works before you actually

807
00:34:00,690 --> 00:34:02,010
start using it.

808
00:34:02,010 --> 00:34:04,860
Don't try and use the new tools the
same way you used the old tools.

809
00:34:04,860 --> 00:34:06,500
You're going to have a bad experience.

810
00:34:06,500 --> 00:34:08,848
And every single time
that's what this is about.

811
00:34:08,848 --> 00:34:11,389
When we start coming up here,
it's because people figured out

812
00:34:11,389 --> 00:34:13,449
how to use the tools.

813
00:34:13,449 --> 00:34:16,250
>> They did the same thing when
relational databases were invented,

814
00:34:16,250 --> 00:34:17,969
and they were replacing file systems.

815
00:34:17,969 --> 00:34:20,420
They tried to build file systems
with relational databases

816
00:34:20,420 --> 00:34:22,159
because that's what people understood.

817
00:34:22,159 --> 00:34:23,049
It didn't work.

818
00:34:23,049 --> 00:34:26,090
So understanding the best practices
of the technology you're working with

819
00:34:26,090 --> 00:34:26,730
is huge.

820
00:34:26,730 --> 00:34:29,870
Very important.

821
00:34:29,870 --> 00:34:32,440
>> So we're going to get into DynamoDB.

822
00:34:32,440 --> 00:34:36,480
DynamoDB is AWS's
fully-managed NoSQL platform.

823
00:34:36,480 --> 00:34:37,719
What does fully-managed mean?

824
00:34:37,719 --> 00:34:40,010
It means you don't need to
really worry about anything.

825
00:34:40,010 --> 00:34:42,060
>> You come in, you tell
us, I need a table.

826
00:34:42,060 --> 00:34:43,409
It needs this much capacity.

827
00:34:43,409 --> 00:34:47,300
You hit the button, and we provision
all the infrastructure behind the scene.

828
00:34:47,300 --> 00:34:48,310
Now that is enormous.

829
00:34:48,310 --> 00:34:51,310
>> Because when you talk
about scaling a database,

830
00:34:51,310 --> 00:34:53,917
NoSQL data clusters at
scale, running petabytes,

831
00:34:53,917 --> 00:34:55,750
running millions of
transactions per second,

832
00:34:55,750 --> 00:34:58,180
these things are not small clusters.

833
00:34:58,180 --> 00:35:00,830
We're talking thousands of instances.

834
00:35:00,830 --> 00:35:04,480
Managing thousands of instances,
even virtual instances,

835
00:35:04,480 --> 00:35:06,350
is a real pain in the butt.

836
00:35:06,350 --> 00:35:09,110
I mean, think about every time an
operating system patch comes out

837
00:35:09,110 --> 00:35:11,552
or a new version of the database.

838
00:35:11,552 --> 00:35:13,260
What does that mean
to you operationally?

839
00:35:13,260 --> 00:35:16,330
That means you got 1,200
servers that need to be updated.

840
00:35:16,330 --> 00:35:18,960
Now even with automation,
that can take a long time.

841
00:35:18,960 --> 00:35:21,480
That can cause a lot of
operational headaches,

842
00:35:21,480 --> 00:35:23,090
because I might have services down.

843
00:35:23,090 --> 00:35:26,070
>> As I update these databases, I
might do blue green deployments

844
00:35:26,070 --> 00:35:29,420
where I deploy and upgrade half my
nodes, and then upgrade the other half.

845
00:35:29,420 --> 00:35:30,490
Take those down.

846
00:35:30,490 --> 00:35:33,410
So managing the infrastructure
scale is enormously painful.

847
00:35:33,410 --> 00:35:36,210
And AWS take that pain out of it.

848
00:35:36,210 --> 00:35:39,210
And NoSQL databases can
be extraordinarily painful

849
00:35:39,210 --> 00:35:41,780
because of the way they scale.

850
00:35:41,780 --> 00:35:42,926
>> Scale horizontally.

851
00:35:42,926 --> 00:35:45,550
If you want to get a bigger NoSQL
database, you buy more nodes.

852
00:35:45,550 --> 00:35:48,660
Every node you buy is
another operational headache.

853
00:35:48,660 --> 00:35:50,830
So let somebody else do that for you.

854
00:35:50,830 --> 00:35:52,000
AWS can do that.

855
00:35:52,000 --> 00:35:54,587
>> We support document key values.

856
00:35:54,587 --> 00:35:56,670
Now we didn't go too much
into on the other chart.

857
00:35:56,670 --> 00:35:58,750
There's a lot of different
flavors of NoSQL.

858
00:35:58,750 --> 00:36:02,670
They're all kind of getting
munged together at this point.

859
00:36:02,670 --> 00:36:06,260
You can look at DynamoDB and say yes,
we're both a document and a key value

860
00:36:06,260 --> 00:36:08,412
store this point.

861
00:36:08,412 --> 00:36:10,620
And you can argue the features
of one over the other.

862
00:36:10,620 --> 00:36:13,950
To me, a lot of this is really six
of one half a dozen of the other.

863
00:36:13,950 --> 00:36:18,710
Every one of these technologies is a
fine technology and a fine solution.

864
00:36:18,710 --> 00:36:23,390
I wouldn't say MongoDB is better or
worse than Couch, then Cassandra,

865
00:36:23,390 --> 00:36:25,994
then Dynamo, or vice versa.

866
00:36:25,994 --> 00:36:27,285
I mean, these are just options.

867
00:36:27,285 --> 00:36:29,850

868
00:36:29,850 --> 00:36:32,700
>> It's fast and it's
consistent at any scale.

869
00:36:32,700 --> 00:36:36,210
So this is one of the biggest
bonuses you get with AWS.

870
00:36:36,210 --> 00:36:40,850
With DynamoDB is the ability
to get a low single digit

871
00:36:40,850 --> 00:36:44,040
millisecond latency at any scale.

872
00:36:44,040 --> 00:36:45,720
That was a design goal of the system.

873
00:36:45,720 --> 00:36:49,130
And we have customers that are doing
millions of transactions per second.

874
00:36:49,130 --> 00:36:52,670
>> Now I'll go through some of those
use cases in a few minutes here.

875
00:36:52,670 --> 00:36:55,660
Integrated access control--
we have what we call

876
00:36:55,660 --> 00:36:57,920
Identity Access Management, or IAM.

877
00:36:57,920 --> 00:37:01,980
It permeates every system,
every service that AWS offers.

878
00:37:01,980 --> 00:37:03,630
DynamoDB is no exception.

879
00:37:03,630 --> 00:37:06,020
You can control access
to the DynamoDB tables.

880
00:37:06,020 --> 00:37:09,960
Across all your AWS accounts by
defining access roles and permissions

881
00:37:09,960 --> 00:37:12,140
in the IAM infrastructure.

882
00:37:12,140 --> 00:37:16,630
>> And it's a key and integral component in
what we call Event Driven Programming.

883
00:37:16,630 --> 00:37:19,056
Now this is a new paradigm.

884
00:37:19,056 --> 00:37:22,080
>> AUDIENCE: How's your rate of true
positives versus false negatives

885
00:37:22,080 --> 00:37:24,052
on your access control system?

886
00:37:24,052 --> 00:37:26,260
RICK HOULIHAN: True positives
versus false negatives?

887
00:37:26,260 --> 00:37:28,785
AUDIENCE: Returning what
you should be returning?

888
00:37:28,785 --> 00:37:33,720
As opposed to once in a while it
doesn't return when it should validate?

889
00:37:33,720 --> 00:37:36,260

890
00:37:36,260 --> 00:37:38,050
>> RICK HOULIHAN: I couldn't tell you that.

891
00:37:38,050 --> 00:37:40,140
If there's any failures
whatsoever on that,

892
00:37:40,140 --> 00:37:42,726
I'm not the person to ask
that particular question.

893
00:37:42,726 --> 00:37:43,850
But that's a good question.

894
00:37:43,850 --> 00:37:45,905
I would be curious to know
that myself, actually.

895
00:37:45,905 --> 00:37:48,810

896
00:37:48,810 --> 00:37:51,320
>> And so then again, new paradigm
is event driven programming.

897
00:37:51,320 --> 00:37:55,160
This is the idea that you can
deploy complex applications that

898
00:37:55,160 --> 00:37:59,720
can operate a very, very high scale
without any infrastructure whatsoever.

899
00:37:59,720 --> 00:38:02,120
Without any fixed
infrastructure whatsoever.

900
00:38:02,120 --> 00:38:04,720
And we'll talk a little bit
about what that means as we

901
00:38:04,720 --> 00:38:06,550
get on to the next couple of charts.

902
00:38:06,550 --> 00:38:08,716
>> The first thing we'll do
is we'll talk about tables.

903
00:38:08,716 --> 00:38:10,857
API data types for Dynamo.

904
00:38:10,857 --> 00:38:13,190
And the first thing you'll
notice when you look at this,

905
00:38:13,190 --> 00:38:17,930
if you're familiar with any database,
databases have really two kind of APIs

906
00:38:17,930 --> 00:38:18,430
I'd call it.

907
00:38:18,430 --> 00:38:21,570
Or two sets of API.

908
00:38:21,570 --> 00:38:23,840
One of those would be
administrative API.

909
00:38:23,840 --> 00:38:26,710
>> The things they take care of
the functions of the database.

910
00:38:26,710 --> 00:38:31,340
Configuring the storage engine,
setting up and adding tables.

911
00:38:31,340 --> 00:38:35,180
creating database
catalogs and instances.

912
00:38:35,180 --> 00:38:40,450
These things-- in DynamoDB, you
have very short, short lists.

913
00:38:40,450 --> 00:38:43,120
>> So in other databases,
you might see dozens

914
00:38:43,120 --> 00:38:45,680
of commands, of administrative
commands, for configuring

915
00:38:45,680 --> 00:38:47,290
these additional options.

916
00:38:47,290 --> 00:38:51,234
In DynamoDB you don't need those because
you don't configure the system, we do.

917
00:38:51,234 --> 00:38:54,150
So the only thing you need to do is
tell me what size table do I need.

918
00:38:54,150 --> 00:38:55,660
So you get a very
limited set of commands.

919
00:38:55,660 --> 00:38:58,618
>> You get a Create Table Update, Table,
Delete Table, and Describe Table.

920
00:38:58,618 --> 00:39:01,150
Those are the only things
you need for DynamoDB.

921
00:39:01,150 --> 00:39:03,294
You don't need a storage
engine configuration.

922
00:39:03,294 --> 00:39:04,960
I don't need to worry about replication.

923
00:39:04,960 --> 00:39:06,490
I don't need to worry about sharding.

924
00:39:06,490 --> 00:39:07,800
>> I don't need to worry
about any of this stuff.

925
00:39:07,800 --> 00:39:08,740
We do it all for you.

926
00:39:08,740 --> 00:39:11,867
So that's a huge amount of overhead
that's just lifted off your plate.

927
00:39:11,867 --> 00:39:13,200
Then we have the CRUD operators.

928
00:39:13,200 --> 00:39:17,740
CRUD is something what we
call in database that's

929
00:39:17,740 --> 00:39:19,860
Create, Update, Delete operators.

930
00:39:19,860 --> 00:39:24,180
These are your common
database operations.

931
00:39:24,180 --> 00:39:31,299
Things like put item, get item, update
items, delete items, batch query, scan.

932
00:39:31,299 --> 00:39:32,840
If you want to scan the entire table.

933
00:39:32,840 --> 00:39:34,220
Pull everything off the table.

934
00:39:34,220 --> 00:39:37,130
One of the nice things about DynamoDB
is it allows parallel scanning.

935
00:39:37,130 --> 00:39:40,602
So you can actually let me know how many
threads you want to run on that scan.

936
00:39:40,602 --> 00:39:41,810
And we can run those threads.

937
00:39:41,810 --> 00:39:43,985
We can spin that scan up
across multiple threads

938
00:39:43,985 --> 00:39:49,060
so you can scan the entire table
space very, very quickly in DynamoDB.

939
00:39:49,060 --> 00:39:51,490
>> The other API we have is
what we call our Streams API.

940
00:39:51,490 --> 00:39:52,940
We're not going to talk too
much about this right now.

941
00:39:52,940 --> 00:39:55,189
I've got some content later
on in the deck about this.

942
00:39:55,189 --> 00:39:59,910
But Streams is really a running--
think of it as the time ordered

943
00:39:59,910 --> 00:40:01,274
and partition change log.

944
00:40:01,274 --> 00:40:03,940
Everything that's happening on
the table shows up on the stream.

945
00:40:03,940 --> 00:40:05,940
>> Every write to the table
shows up on the stream.

946
00:40:05,940 --> 00:40:08,370
You can read that stream, and
you can do things with it.

947
00:40:08,370 --> 00:40:10,150
We'll talk about what
types of things you

948
00:40:10,150 --> 00:40:13,680
do with the things like replication,
creating secondary indexes.

949
00:40:13,680 --> 00:40:17,620
All kinds of really cool
things you can do with that.

950
00:40:17,620 --> 00:40:19,150
>> Data types.

951
00:40:19,150 --> 00:40:23,320
In DynamoDB, we support both key
value and document data types.

952
00:40:23,320 --> 00:40:26,350
On the left hand side of the screen
here, we've got our basic types.

953
00:40:26,350 --> 00:40:27,230
Key value types.

954
00:40:27,230 --> 00:40:30,040
These are strings,
numbers, and binaries.

955
00:40:30,040 --> 00:40:31,640
>> So just three basic types.

956
00:40:31,640 --> 00:40:33,700
And then you can have sets of those.

957
00:40:33,700 --> 00:40:37,650
One of the nice things about NoSQL is
you can contain arrays as properties.

958
00:40:37,650 --> 00:40:42,050
And with DynamoDB you can contain arrays
of basic types as a root property.

959
00:40:42,050 --> 00:40:43,885
>> And then there's the document types.

960
00:40:43,885 --> 00:40:45,510
How many people are familiar with JSON?

961
00:40:45,510 --> 00:40:47,130
You guys familiar with JSON so much?

962
00:40:47,130 --> 00:40:49,380
It's basically JavaScript,
Object, Notation.

963
00:40:49,380 --> 00:40:52,510
It allows you to basically
define a hierarchical structure.

964
00:40:52,510 --> 00:40:58,107
>> You can store a JSON document on
DynamoDB using common components

965
00:40:58,107 --> 00:41:00,940
or building blocks that are available
in most programming languages.

966
00:41:00,940 --> 00:41:03,602
So if you have Java, you're
looking at maps and lists.

967
00:41:03,602 --> 00:41:05,060
I can create objects that area map.

968
00:41:05,060 --> 00:41:08,030
A map as key values
stored as properties.

969
00:41:08,030 --> 00:41:10,890
And it might have lists of
values within those properties.

970
00:41:10,890 --> 00:41:13,490
You can store this complex
hierarchical structure

971
00:41:13,490 --> 00:41:16,320
as a single attribute
of a DynamoDB item.

972
00:41:16,320 --> 00:41:19,010

973
00:41:19,010 --> 00:41:24,460
>> So tables in DynamoDB, like most
NoSQL databases, tables have items.

974
00:41:24,460 --> 00:41:26,469
In MongoDB you would
call these documents.

975
00:41:26,469 --> 00:41:27,760
And it would be the couch base.

976
00:41:27,760 --> 00:41:28,900
Also a document database.

977
00:41:28,900 --> 00:41:29,941
You call these documents.

978
00:41:29,941 --> 00:41:32,930
Documents or items have attributes.

979
00:41:32,930 --> 00:41:35,850
Attributes can exist or
not exist on the item.

980
00:41:35,850 --> 00:41:38,520
In DynamoDB, there's
one mandatory attribute.

981
00:41:38,520 --> 00:41:43,880
Just like in a relational database,
you have a primary key on the table.

982
00:41:43,880 --> 00:41:46,010
>> DynamoDB has what we call a hash key.

983
00:41:46,010 --> 00:41:48,280
Hash key must be unique.

984
00:41:48,280 --> 00:41:52,580
So when I define a hash table,
basically what I'm saying

985
00:41:52,580 --> 00:41:54,110
is every item will have a hash key.

986
00:41:54,110 --> 00:41:58,520
And every hash key must be unique.

987
00:41:58,520 --> 00:42:01,200
>> Every item is defined
by that unique hash key.

988
00:42:01,200 --> 00:42:02,940
And there can only be one.

989
00:42:02,940 --> 00:42:05,820
This is OK, but oftentimes
what people need

990
00:42:05,820 --> 00:42:08,170
is they want is this hash
key to do a little bit more

991
00:42:08,170 --> 00:42:11,010
than just be a unique identifier.

992
00:42:11,010 --> 00:42:15,240
Oftentimes we want to use that hash key
as the top level aggregation bucket.

993
00:42:15,240 --> 00:42:19,160
And the way we do that is by
adding what we call a range key.

994
00:42:19,160 --> 00:42:22,460
>> So if it's a hash only
table, this must be unique.

995
00:42:22,460 --> 00:42:27,040
If it's a hash and range table, the
combination of the hash and the range

996
00:42:27,040 --> 00:42:28,640
must be unique.

997
00:42:28,640 --> 00:42:30,110
So think about it this way.

998
00:42:30,110 --> 00:42:32,140
If I have a forum.

999
00:42:32,140 --> 00:42:39,010
And the form has topics, it has
posts, and it has responses.

1000
00:42:39,010 --> 00:42:42,630
>> So I might have a hash
key, which is the topic ID.

1001
00:42:42,630 --> 00:42:46,650
And I might have a range key,
which is the response ID.

1002
00:42:46,650 --> 00:42:49,650
That way if I want to get all the
responses for particular topic,

1003
00:42:49,650 --> 00:42:52,370
I can just query the hash.

1004
00:42:52,370 --> 00:42:55,190
I can just say give me all
the items that have this hash.

1005
00:42:55,190 --> 00:43:01,910
And I'm going to get every question
or post for that particular topic.

1006
00:43:01,910 --> 00:43:03,910
These top level aggregations
are very important.

1007
00:43:03,910 --> 00:43:07,370
They support the primary access
pattern of the application.

1008
00:43:07,370 --> 00:43:09,420
Generally speaking, this
is what we want to do.

1009
00:43:09,420 --> 00:43:11,780
We want that table--
as you load the table,

1010
00:43:11,780 --> 00:43:16,640
we want to structure the data
within the table in such a way

1011
00:43:16,640 --> 00:43:20,140
that the application can very
quickly retrieve those results.

1012
00:43:20,140 --> 00:43:24,510
And oftentimes the way to do that is
to maintain these aggregations as we

1013
00:43:24,510 --> 00:43:25,650
insert the data.

1014
00:43:25,650 --> 00:43:31,110
Basically, we're spreading the data
into the bright bucket as it comes in.

1015
00:43:31,110 --> 00:43:35,210
>> Range keys allow me-- hash
keys have to be equality.

1016
00:43:35,210 --> 00:43:39,490
When I query a hash, I have to say
give me a hash that equals this.

1017
00:43:39,490 --> 00:43:41,950
When I query a range, I
can say give me a range

1018
00:43:41,950 --> 00:43:47,040
that is using any kind of
rich operator that we support.

1019
00:43:47,040 --> 00:43:49,200
Give me all the items for a hash.

1020
00:43:49,200 --> 00:43:52,520
Is it equal, greater than,
less than, does it begin with,

1021
00:43:52,520 --> 00:43:54,145
does it exist between these two values?

1022
00:43:54,145 --> 00:43:56,811
So these types of range queries
that we're always interested in.

1023
00:43:56,811 --> 00:43:59,650
Now one thing about data, when
you look at accessing data, when

1024
00:43:59,650 --> 00:44:02,360
you access the data, it's
always about an aggregation.

1025
00:44:02,360 --> 00:44:05,770
It's always about the records
that are related to this.

1026
00:44:05,770 --> 00:44:10,390
Give me everything here that's-- all
the transactions on this credit card

1027
00:44:10,390 --> 00:44:12,500
for the last month.

1028
00:44:12,500 --> 00:44:13,960
That's an aggregation.

1029
00:44:13,960 --> 00:44:17,490
>> Almost everything you do in the
database is some kind of aggregation.

1030
00:44:17,490 --> 00:44:21,530
So being able to be able to define
these buckets and give you these range

1031
00:44:21,530 --> 00:44:24,950
attributes to be able to query on,
those rich queries support many,

1032
00:44:24,950 --> 00:44:27,165
many, many application access patterns.

1033
00:44:27,165 --> 00:44:30,990

1034
00:44:30,990 --> 00:44:35,000
>> So the other thing the hash key
does is it gives us a mechanism

1035
00:44:35,000 --> 00:44:37,740
to be able to spread the data around.

1036
00:44:37,740 --> 00:44:40,390
NoSQL databases work best
when the data is evenly

1037
00:44:40,390 --> 00:44:41,740
distributed across the cluster.

1038
00:44:41,740 --> 00:44:44,530

1039
00:44:44,530 --> 00:44:47,050
How many people are familiar
with hashing algorithms?

1040
00:44:47,050 --> 00:44:49,860
When I say hash and a hashing--
because a hashing algorithm

1041
00:44:49,860 --> 00:44:54,140
is a way of being able to generate
a random value from any given value.

1042
00:44:54,140 --> 00:44:59,300
So in this particular case, the
hash algorithm we run is ND 5 based.

1043
00:44:59,300 --> 00:45:04,765
>> And if I have an ID, and this
is my hash key, I have 1, 2, 3.

1044
00:45:04,765 --> 00:45:07,390
When I run the hash algorithm,
it's going to come back and say,

1045
00:45:07,390 --> 00:45:10,800
well 1 equals 7B, 2
equals 48, 3 equals CD.

1046
00:45:10,800 --> 00:45:13,092
They're spread all over the key space.

1047
00:45:13,092 --> 00:45:14,050
And why do you do this?

1048
00:45:14,050 --> 00:45:17,120
Because that makes sure that I can
put the records across multiple nodes.

1049
00:45:17,120 --> 00:45:19,574
>> If I'm doing this
incrementally, 1, 2, 3.

1050
00:45:19,574 --> 00:45:21,990
And I have a hash range that
runs in this particular case,

1051
00:45:21,990 --> 00:45:24,785
a small hash space,
it runs from 00 to FF,

1052
00:45:24,785 --> 00:45:27,951
then the records are going to come in
and they're going to go 1, 2, 3, 4, 5,

1053
00:45:27,951 --> 00:45:30,390
6, 7, 8, 9, 10, 11, 12.

1054
00:45:30,390 --> 00:45:31,800
What happens?

1055
00:45:31,800 --> 00:45:34,860
Every insert is going to the same node.

1056
00:45:34,860 --> 00:45:36,070
You see what I mean?

1057
00:45:36,070 --> 00:45:40,910
>> Because when I split the space,
and I spread these records across,

1058
00:45:40,910 --> 00:45:45,950
and I partition, I'm going to say
partition 1 has key space 0 to 54.

1059
00:45:45,950 --> 00:45:47,720
Partition 2 is 55 to 89.

1060
00:45:47,720 --> 00:45:49,780
Partition 3 is AA to FF.

1061
00:45:49,780 --> 00:45:53,740
So if I'm using linearly incrementing
IDs, you can see what's happening.

1062
00:45:53,740 --> 00:45:57,410
1, 2, 3, 4, 5, 6, all way up to 54.

1063
00:45:57,410 --> 00:46:00,030
So as I'm hammering the
records into the system,

1064
00:46:00,030 --> 00:46:02,030
everything ends up going to one node.

1065
00:46:02,030 --> 00:46:03,160
>> That's not good.

1066
00:46:03,160 --> 00:46:04,820
That's an antipattern.

1067
00:46:04,820 --> 00:46:08,760
In MongoDB they have this problem
if you don't use a hash key.

1068
00:46:08,760 --> 00:46:11,325
MongoDB gives you the option
of hashing the key value.

1069
00:46:11,325 --> 00:46:13,950
You should always do that, if
you're using an incrementing hash

1070
00:46:13,950 --> 00:46:17,380
key in MongoDB, or you'll be
nailing every write to one node,

1071
00:46:17,380 --> 00:46:21,290
and you will be limiting
your write throughput badly.

1072
00:46:21,290 --> 00:46:24,896
>> AUDIENCE: Is that A9 169 in decimal?

1073
00:46:24,896 --> 00:46:28,450
>> RICK HOULIHAN: Yeah, it's
somewhere around there.

1074
00:46:28,450 --> 00:46:29,950
A9, I don't know.

1075
00:46:29,950 --> 00:46:32,200
You'd have to get my binary
to decimal calculator.

1076
00:46:32,200 --> 00:46:34,237
My brain doesn't work like that.

1077
00:46:34,237 --> 00:46:36,320
AUDIENCE: Just a quick one
of your Mongo comments.

1078
00:46:36,320 --> 00:46:39,530
So is the object ID that comes
natively with Mongo do that?

1079
00:46:39,530 --> 00:46:40,179

1080
00:46:40,179 --> 00:46:41,470
RICK HOULIHAN: Does it do that?

1081
00:46:41,470 --> 00:46:42,970
If you specify it.

1082
00:46:42,970 --> 00:46:45,030
With MongoDB, you have the option.

1083
00:46:45,030 --> 00:46:48,930
You can specify-- every document in
MongoDB has to have an underscore ID.

1084
00:46:48,930 --> 00:46:50,300
That's the unique value.

1085
00:46:50,300 --> 00:46:55,240
>> In MongoDB you can specify
whether to hash it or not.

1086
00:46:55,240 --> 00:46:56,490
They just give you the option.

1087
00:46:56,490 --> 00:46:58,198
If you know that it's
random, no problem.

1088
00:46:58,198 --> 00:46:59,640
You don't need to do that.

1089
00:46:59,640 --> 00:47:04,260
If you know that it's not random, that
it's incrementing, then do the hash.

1090
00:47:04,260 --> 00:47:06,880
>> Now the thing about
hashing, once you hash

1091
00:47:06,880 --> 00:47:08,800
a value-- and this is
why hash keys are always

1092
00:47:08,800 --> 00:47:13,740
unique queries, because I've changed
the value, now I can't do a range query.

1093
00:47:13,740 --> 00:47:15,640
I can't say is this
between this or that,

1094
00:47:15,640 --> 00:47:20,800
because the hash value is not going
to be equivalent to the actual value.

1095
00:47:20,800 --> 00:47:24,570
So when you hash that
key, it's equality only.

1096
00:47:24,570 --> 00:47:28,700
This is why in DynamoDB hash key
queries are always equality only.

1097
00:47:28,700 --> 00:47:32,090

1098
00:47:32,090 --> 00:47:34,700
>> So now in a range key--
when I add that range key,

1099
00:47:34,700 --> 00:47:38,180
those range key records all come in and
they get stored on the same partition.

1100
00:47:38,180 --> 00:47:42,430
So they are very quickly, easily
retrieved because this is the hash,

1101
00:47:42,430 --> 00:47:43,220
this is the range.

1102
00:47:43,220 --> 00:47:44,928
And you see everything
with the same hash

1103
00:47:44,928 --> 00:47:48,550
gets stored on the same partition space.

1104
00:47:48,550 --> 00:47:53,889
You can use that range key to help
locate your data close to its parent.

1105
00:47:53,889 --> 00:47:55,180
So what am I really doing here?

1106
00:47:55,180 --> 00:47:57,320
This is a one to many relationship.

1107
00:47:57,320 --> 00:48:01,490
The relationship between a hash key
and the range key is one to many.

1108
00:48:01,490 --> 00:48:03,490
I can have multiple hash keys.

1109
00:48:03,490 --> 00:48:07,610
I can only have multiple range
keys within every hash key.

1110
00:48:07,610 --> 00:48:11,910
>> The hash defines the parent,
the range defines the children.

1111
00:48:11,910 --> 00:48:15,240
So you can see there's analog here
between the relational construct

1112
00:48:15,240 --> 00:48:18,840
and the same types of
constructs in NoSQL.

1113
00:48:18,840 --> 00:48:20,760
People talk about
NoSQL as nonrelational.

1114
00:48:20,760 --> 00:48:22,200
It's not nonrelational.

1115
00:48:22,200 --> 00:48:24,680
Data always has relationships.

1116
00:48:24,680 --> 00:48:28,172
Those relationships just
are modeled differently.

1117
00:48:28,172 --> 00:48:29,880
Let's talk a little
bit about durability.

1118
00:48:29,880 --> 00:48:34,860
When you write to DynamoDB, writes
are always three-way replicated.

1119
00:48:34,860 --> 00:48:37,550
Meaning that we have three AZ's.

1120
00:48:37,550 --> 00:48:39,160
AZ's are Availability Zones.

1121
00:48:39,160 --> 00:48:43,430
You can think of an Availability
Zone as a data center

1122
00:48:43,430 --> 00:48:45,447
or a collection of data centers.

1123
00:48:45,447 --> 00:48:47,780
These things are geographically
isolated from each other

1124
00:48:47,780 --> 00:48:51,610
across different fault zones, across
different power grids and floodplains.

1125
00:48:51,610 --> 00:48:54,510
A failure in one AZ is not
going to take down another.

1126
00:48:54,510 --> 00:48:56,890
They are also linked
together with dark fiber.

1127
00:48:56,890 --> 00:49:01,240
It supports one sub 1
millisecond latency between AZs.

1128
00:49:01,240 --> 00:49:05,390
So real time data replications
capable in multi AZs.

1129
00:49:05,390 --> 00:49:09,990
>> And oftentimes multi AZ deployments
meet the high availability requirements

1130
00:49:09,990 --> 00:49:12,930
of most enterprise organizations.

1131
00:49:12,930 --> 00:49:16,139
So DynamoDB is spread
across three AZs by default.

1132
00:49:16,139 --> 00:49:19,430
We're only going to knowledge the write
when two of those three nodes come back

1133
00:49:19,430 --> 00:49:21,470
and say, Yeah, I got it.

1134
00:49:21,470 --> 00:49:22,050
Why is that?

1135
00:49:22,050 --> 00:49:25,950
Because on the read side we're only
going to give you the data back when

1136
00:49:25,950 --> 00:49:27,570
we get it from two nodes.

1137
00:49:27,570 --> 00:49:30,490
>> If I'm replicating across
three, and I'm reading from two,

1138
00:49:30,490 --> 00:49:32,840
I'm always guaranteed
to have at least one

1139
00:49:32,840 --> 00:49:35,720
of those reads to be the
most current copy of data.

1140
00:49:35,720 --> 00:49:38,340
That's what makes DynamoDB consistent.

1141
00:49:38,340 --> 00:49:42,450
Now you can choose to turn
those consistent reads off.

1142
00:49:42,450 --> 00:49:45,070
In which case I'm going to say,
I'll only read from one node.

1143
00:49:45,070 --> 00:49:47,430
And I can't guarantee it's going
to be the most current data.

1144
00:49:47,430 --> 00:49:49,450
>> So if a write is coming in,
it hasn't replicated yet,

1145
00:49:49,450 --> 00:49:50,360
you're going to get that copy.

1146
00:49:50,360 --> 00:49:52,220
That's an eventually consistent read.

1147
00:49:52,220 --> 00:49:54,640
And what that is is half the cost.

1148
00:49:54,640 --> 00:49:56,140
So this is something to think about.

1149
00:49:56,140 --> 00:50:00,160
When you're reading out DynamoDB, and
you're setting up your read capacity

1150
00:50:00,160 --> 00:50:04,430
units, if you choose eventually
consistent reads, it's a lot cheaper,

1151
00:50:04,430 --> 00:50:06,010
it's about half the cost.

1152
00:50:06,010 --> 00:50:09,342
>> And so it saves you money.

1153
00:50:09,342 --> 00:50:10,300
But that's your choice.

1154
00:50:10,300 --> 00:50:12,925
If you want a consistent read or
an eventually consistent read.

1155
00:50:12,925 --> 00:50:15,720
That's something that you can choose.

1156
00:50:15,720 --> 00:50:17,659
>> Let's talk about indexes.

1157
00:50:17,659 --> 00:50:19,450
So we mentioned that
top level aggregation.

1158
00:50:19,450 --> 00:50:23,720
We've got hash keys, and
we've got range keys.

1159
00:50:23,720 --> 00:50:24,320
That's nice.

1160
00:50:24,320 --> 00:50:26,950
And that's on the primary table, I
got one hash key, I got one range key.

1161
00:50:26,950 --> 00:50:27,783
>> What does that mean?

1162
00:50:27,783 --> 00:50:30,410
I've got one attribute that I
can run rich queries against.

1163
00:50:30,410 --> 00:50:31,800
It's the range key.

1164
00:50:31,800 --> 00:50:35,530
The other attributes on that item--
I can filter on those attributes.

1165
00:50:35,530 --> 00:50:40,050
But I can't do things like, it
begins with, or is greater than.

1166
00:50:40,050 --> 00:50:40,820
>> How do I do that?

1167
00:50:40,820 --> 00:50:42,860
I create an index.

1168
00:50:42,860 --> 00:50:45,340
There's two types of
indexes in DynamoDB.

1169
00:50:45,340 --> 00:50:49,002
An index is really
another view of the table.

1170
00:50:49,002 --> 00:50:50,490
And the local secondary index.

1171
00:50:50,490 --> 00:50:51,781
>> The first one we'll talk about.

1172
00:50:51,781 --> 00:50:57,740
So local secondaries are coexisted
on the same partition as the data.

1173
00:50:57,740 --> 00:51:00,240
And as such, they are on
the same physical node.

1174
00:51:00,240 --> 00:51:01,780
They are what we call consistent.

1175
00:51:01,780 --> 00:51:04,599
Meaning, they will acknowledge
the write along with the table.

1176
00:51:04,599 --> 00:51:06,890
When the write comes in,
we'll write through the index.

1177
00:51:06,890 --> 00:51:09,306
We'll write up to the table,
and then we will acknowledge.

1178
00:51:09,306 --> 00:51:10,490
So that's consistent.

1179
00:51:10,490 --> 00:51:13,174
Once the write has been
acknowledged from the table,

1180
00:51:13,174 --> 00:51:15,090
it's guaranteed that the
local secondary index

1181
00:51:15,090 --> 00:51:18,380
will have the same vision of data.

1182
00:51:18,380 --> 00:51:22,390
But what they allow you do is
define alternate range keys.

1183
00:51:22,390 --> 00:51:25,260
>> Have to use the same hash
key as the primary table,

1184
00:51:25,260 --> 00:51:29,050
because they are co-located on the
same partition, and they're consistent.

1185
00:51:29,050 --> 00:51:33,110
But I can create an index
with different range keys.

1186
00:51:33,110 --> 00:51:41,590
So for example, if I had a manufacturer
that had a raw parts table coming in.

1187
00:51:41,590 --> 00:51:44,590
And raw parts come in, and
they're aggregated by assembly.

1188
00:51:44,590 --> 00:51:46,840
And maybe there's a recall.

1189
00:51:46,840 --> 00:51:50,240
>> Any part that was made by this
manufacturer after this date,

1190
00:51:50,240 --> 00:51:52,840
I need to pull from my line.

1191
00:51:52,840 --> 00:51:55,950
I can spin an index
that would be looking,

1192
00:51:55,950 --> 00:52:00,760
aggregating on the date of
manufacture of that particular part.

1193
00:52:00,760 --> 00:52:03,930
So if my top level table was
already hashed by manufacturer,

1194
00:52:03,930 --> 00:52:07,655
maybe it was arranged on part ID, I
can create an index off that table

1195
00:52:07,655 --> 00:52:11,140
as hashed by manufacturer and
ranged on date of manufacture.

1196
00:52:11,140 --> 00:52:14,490
And that way I could say, anything that
was manufactured between these dates,

1197
00:52:14,490 --> 00:52:16,804
I need to pull from the line.

1198
00:52:16,804 --> 00:52:18,220
So that's a local secondary index.

1199
00:52:18,220 --> 00:52:22,280
>> These have the effect of
limiting your hash key space.

1200
00:52:22,280 --> 00:52:24,360
Because they co-existed
on the same storage node,

1201
00:52:24,360 --> 00:52:26,860
they limit the hash key
space to 10 gigabytes.

1202
00:52:26,860 --> 00:52:28,950
DynamoDB, under the
tables, will partition

1203
00:52:28,950 --> 00:52:31,380
your table every 10 gigabytes.

1204
00:52:31,380 --> 00:52:34,760
When you put 10 gigs of data in, we
go [PHH], and we add another node.

1205
00:52:34,760 --> 00:52:38,120

1206
00:52:38,120 --> 00:52:42,070
>> We will not split the LSI
across multiple partitions.

1207
00:52:42,070 --> 00:52:43,200
We'll split the table.

1208
00:52:43,200 --> 00:52:44,679
But we won't split the LSI.

1209
00:52:44,679 --> 00:52:46,470
So that's something
important to understand

1210
00:52:46,470 --> 00:52:50,070
is if you're doing very,
very, very large aggregations,

1211
00:52:50,070 --> 00:52:53,860
then you're going to be limited
to 10 gigabytes on your LSIs.

1212
00:52:53,860 --> 00:52:56,640
>> If that's the case, we can
use global secondaries.

1213
00:52:56,640 --> 00:52:58,630
Global secondaries are
really another table.

1214
00:52:58,630 --> 00:53:01,720
They exist completely off to
the side of your primary table.

1215
00:53:01,720 --> 00:53:04,680
And they allow me to find a
completely different structure.

1216
00:53:04,680 --> 00:53:08,010
So think of it as data is being inserted
into two different tables, structured

1217
00:53:08,010 --> 00:53:09,220
in two different ways.

1218
00:53:09,220 --> 00:53:11,360
>> I can define a totally
different hash key.

1219
00:53:11,360 --> 00:53:13,490
I can define a totally
different range key.

1220
00:53:13,490 --> 00:53:15,941
And I can run this
completely independently.

1221
00:53:15,941 --> 00:53:18,190
As a matter of fact, I've
provisioned my read capacity

1222
00:53:18,190 --> 00:53:21,090
and write capacity for my
global secondary indexes

1223
00:53:21,090 --> 00:53:24,240
completely independently
of my primary table.

1224
00:53:24,240 --> 00:53:26,640
If I define that index, I tell
it how much read and write

1225
00:53:26,640 --> 00:53:28,610
capacity it's going to be using.

1226
00:53:28,610 --> 00:53:31,490
>> And that is separate
from my primary table.

1227
00:53:31,490 --> 00:53:35,240
Now both of the indexes allow us to
not only define hash and range keys,

1228
00:53:35,240 --> 00:53:38,610
but they allow us to
project additional values.

1229
00:53:38,610 --> 00:53:44,950
So if I want to read off the index,
and I want to get some set of data,

1230
00:53:44,950 --> 00:53:48,327
I don't need to go back to the main
table to get the additional attributes.

1231
00:53:48,327 --> 00:53:50,660
I can project those additional
attributes into the table

1232
00:53:50,660 --> 00:53:53,440
to support the access pattern.

1233
00:53:53,440 --> 00:53:57,700
I know we're probably getting into some
really, really-- getting into the weeds

1234
00:53:57,700 --> 00:53:58,910
here on some of this stuff.

1235
00:53:58,910 --> 00:54:02,725
Now I got to drift out of this.

1236
00:54:02,725 --> 00:54:07,320
>> AUDIENCE: [INAUDIBLE]
--table key meant was a hash?

1237
00:54:07,320 --> 00:54:08,840
The original hash?

1238
00:54:08,840 --> 00:54:09,340
Multi-slats?

1239
00:54:09,340 --> 00:54:10,200
>> RICK HOULIHAN: Yes.

1240
00:54:10,200 --> 00:54:11,070
Yes.

1241
00:54:11,070 --> 00:54:15,260
The table key basically
points back to the item.

1242
00:54:15,260 --> 00:54:19,280
So an index is a pointer back to
the original items on the table.

1243
00:54:19,280 --> 00:54:22,910
Now you can choose to build an
index that only has the table key,

1244
00:54:22,910 --> 00:54:24,840
and no other properties.

1245
00:54:24,840 --> 00:54:26,570
And why might I do that?

1246
00:54:26,570 --> 00:54:28,570
Well, maybe I have very large items.

1247
00:54:28,570 --> 00:54:31,660
>> I really only need to know which--
my access pattern might say,

1248
00:54:31,660 --> 00:54:33,760
which items contain this property?

1249
00:54:33,760 --> 00:54:35,780
Don't need to return the item.

1250
00:54:35,780 --> 00:54:37,800
I just need to know
which items contain it.

1251
00:54:37,800 --> 00:54:40,700
So you can build indexes
that only have the table key.

1252
00:54:40,700 --> 00:54:43,360
>> But that's primarily what
an index in database is for.

1253
00:54:43,360 --> 00:54:46,280
It's for being able to quickly
identify which records,

1254
00:54:46,280 --> 00:54:49,470
which rows, which
items in the table have

1255
00:54:49,470 --> 00:54:51,080
the properties that I'm searching for.

1256
00:54:51,080 --> 00:54:53,610

1257
00:54:53,610 --> 00:54:54,860
>> GSIs, so how do they work?

1258
00:54:54,860 --> 00:54:58,340
GSIs basically are asynchronous.

1259
00:54:58,340 --> 00:55:02,570
The update comes into the table,
table is then asynchronously updated

1260
00:55:02,570 --> 00:55:03,720
all of your GSIs.

1261
00:55:03,720 --> 00:55:06,680
This is why GSIs are
eventually consistent.

1262
00:55:06,680 --> 00:55:09,440
>> It is important to note that
when you're building GSIs,

1263
00:55:09,440 --> 00:55:13,110
and you understand you're creating
another dimension of aggregation--

1264
00:55:13,110 --> 00:55:16,594
now let's say a good example
here is a manufacturer.

1265
00:55:16,594 --> 00:55:19,260
I think I might have talked about
a medical device manufacturer.

1266
00:55:19,260 --> 00:55:23,870
Medical device manufacturers
oftentimes have serialized parts.

1267
00:55:23,870 --> 00:55:28,070
The parts that go into
a hip replacement all

1268
00:55:28,070 --> 00:55:30,200
have a little serial number on them.

1269
00:55:30,200 --> 00:55:33,584
And they could have millions and
millions and billions of parts

1270
00:55:33,584 --> 00:55:35,000
in all the devices that they ship.

1271
00:55:35,000 --> 00:55:37,440
Well, they need to aggregate under
different dimensions, all the parts

1272
00:55:37,440 --> 00:55:39,520
in an assembly, all the
parts that were made

1273
00:55:39,520 --> 00:55:41,670
on a certain line, all
the parts that came

1274
00:55:41,670 --> 00:55:44,620
in from a certain manufacturer
on a certain date.

1275
00:55:44,620 --> 00:55:47,940
And these aggregations sometimes
get up into the billions.

1276
00:55:47,940 --> 00:55:50,550
>> So I work with some of
these guys who are suffering

1277
00:55:50,550 --> 00:55:53,156
because they're creating
these ginormous aggregations

1278
00:55:53,156 --> 00:55:54,280
in their secondary indexes.

1279
00:55:54,280 --> 00:55:57,070
They might have a raw parts
table that comes as hash only.

1280
00:55:57,070 --> 00:55:59,090
Every part has a unique serial number.

1281
00:55:59,090 --> 00:56:00,975
I use the serial number as the hash.

1282
00:56:00,975 --> 00:56:01,600
It's beautiful.

1283
00:56:01,600 --> 00:56:04,160
My raw data table is spread
all across the key space.

1284
00:56:04,160 --> 00:56:05,930
My [? write ?]
[? ingestion ?] is awesome.

1285
00:56:05,930 --> 00:56:07,876
I take a lot of data.

1286
00:56:07,876 --> 00:56:09,500
Then what they do is they create a GSI.

1287
00:56:09,500 --> 00:56:12,666
And I say, you know what, I need to see
all the parts for this manufacturer.

1288
00:56:12,666 --> 00:56:15,060
Well, all of a sudden I'm
taking a billion rows,

1289
00:56:15,060 --> 00:56:17,550
and stuff them onto
one node, because when

1290
00:56:17,550 --> 00:56:21,170
I aggregate as the
manufacturer ID as the hash,

1291
00:56:21,170 --> 00:56:25,410
and part number as the range,
then all of the sudden I'm

1292
00:56:25,410 --> 00:56:30,530
putting a billion parts into what
this manufacturer has delivered me.

1293
00:56:30,530 --> 00:56:34,447
>> That can cause a lot
of pressure on the GSI,

1294
00:56:34,447 --> 00:56:36,030
again, because I'm hammering one node.

1295
00:56:36,030 --> 00:56:38,350
I'm putting all these
inserts into one node.

1296
00:56:38,350 --> 00:56:40,940
And that's a real problematic use case.

1297
00:56:40,940 --> 00:56:43,479
Now, I got a good design
pattern for how you avoid that.

1298
00:56:43,479 --> 00:56:45,770
And that's one of the problems
that I always work with.

1299
00:56:45,770 --> 00:56:49,590
But what happens, is the GSI might
not have enough write capacity

1300
00:56:49,590 --> 00:56:52,330
to be able to push all those
rows into a single node.

1301
00:56:52,330 --> 00:56:55,390
And what happens then is the
primary, the client table,

1302
00:56:55,390 --> 00:57:00,180
the primary table will be throttled
because the GSI can't keep up.

1303
00:57:00,180 --> 00:57:02,980
So my insert rate will
fall on the primary table

1304
00:57:02,980 --> 00:57:06,230
as my GSI tries to keep up.

1305
00:57:06,230 --> 00:57:08,850
>> All right, so GSI's, LSI's,
which one should I use?

1306
00:57:08,850 --> 00:57:12,290
LSI's are consistent.

1307
00:57:12,290 --> 00:57:13,750
GSI's are eventually consistent.

1308
00:57:13,750 --> 00:57:17,490
If that's OK, I recommend using a
GSI, they're much more flexible.

1309
00:57:17,490 --> 00:57:20,270
LSI's can be modeled as a GSI.

1310
00:57:20,270 --> 00:57:27,040
And if the data size per hash keys in
your collection exceeds 10 gigabytes,

1311
00:57:27,040 --> 00:57:31,050
then you're going to want to use that
GSI because it's just a hard limit.

1312
00:57:31,050 --> 00:57:32,035
>> All right, so scaling.

1313
00:57:32,035 --> 00:57:35,210

1314
00:57:35,210 --> 00:57:37,460
Throughput in Dynamo DB, you
can provision [INAUDIBLE]

1315
00:57:37,460 --> 00:57:38,680
throughput to a table.

1316
00:57:38,680 --> 00:57:42,740
We have customers that have
provisioned 60 billion--

1317
00:57:42,740 --> 00:57:45,970
are doing 60 billion requests, regularly
running at over a million requests

1318
00:57:45,970 --> 00:57:47,790
per second on our tables.

1319
00:57:47,790 --> 00:57:50,360
There's really no
theoretical limit to how much

1320
00:57:50,360 --> 00:57:53,730
and how fast the table
can run in Dynamo DB.

1321
00:57:53,730 --> 00:57:55,920
There are some soft
limits on your account

1322
00:57:55,920 --> 00:57:58,170
that we put in there so
that you don't go crazy.

1323
00:57:58,170 --> 00:58:00,070
If you want more than
that, not a problem.

1324
00:58:00,070 --> 00:58:00,820
You come tell us.

1325
00:58:00,820 --> 00:58:02,810
We'll turn up the dial.

1326
00:58:02,810 --> 00:58:08,210
>> Every account is limited to some level
in every service, just off the bat

1327
00:58:08,210 --> 00:58:11,920
so the people don't go crazy
get themselves into trouble.

1328
00:58:11,920 --> 00:58:12,840
No limit in size.

1329
00:58:12,840 --> 00:58:14,940
You can put any number
of items on a table.

1330
00:58:14,940 --> 00:58:17,620
The size of an item is
limited to 400 kilobytes each,

1331
00:58:17,620 --> 00:58:20,050
that would be item not the attributes.

1332
00:58:20,050 --> 00:58:24,200
So the sum of all attributes
is limited to 400 kilobytes.

1333
00:58:24,200 --> 00:58:27,300
And then again, we have
that little LSI issue

1334
00:58:27,300 --> 00:58:30,405
with the 10 gigabyte limit per hash.

1335
00:58:30,405 --> 00:58:33,280
AUDIENCE: Small number, I'm missing
what you're telling me, that is--

1336
00:58:33,280 --> 00:58:36,830
AUDIENCE: Oh, 400 kilobyte
is the maximum size per item.

1337
00:58:36,830 --> 00:58:39,570
So an item has all the attributes.

1338
00:58:39,570 --> 00:58:43,950
So 400 k is the total size
of that item, 400 kilobytes.

1339
00:58:43,950 --> 00:58:46,170
So of all the attributes
combined, all the data

1340
00:58:46,170 --> 00:58:49,140
that's in all those attributes,
rolled up into a total size,

1341
00:58:49,140 --> 00:58:51,140
currently today the item limit is 400 k.

1342
00:58:51,140 --> 00:58:54,390

1343
00:58:54,390 --> 00:58:57,046
So scaling again, achieved
through partitioning.

1344
00:58:57,046 --> 00:58:58,920
Throughput is provisioned
at the table level.

1345
00:58:58,920 --> 00:59:00,160
And there's really two knobs.

1346
00:59:00,160 --> 00:59:02,400
We have read capacity
and write capacity.

1347
00:59:02,400 --> 00:59:05,530
>> So these are adjusted
independently of each other.

1348
00:59:05,530 --> 00:59:08,640
RCU's measure strictly consistent reads.

1349
00:59:08,640 --> 00:59:13,005
OK, so if you're saying I want 1,000
RCU's those are strictly consistent,

1350
00:59:13,005 --> 00:59:14,130
those are consistent reads.

1351
00:59:14,130 --> 00:59:17,130
If you say I want
eventual consistent reads,

1352
00:59:17,130 --> 00:59:19,402
you can provision 1,000
RCU's, you're going

1353
00:59:19,402 --> 00:59:21,840
to get 2,000 eventually
consistent reads.

1354
00:59:21,840 --> 00:59:25,940
And half the price for those
eventually consist in reads.

1355
00:59:25,940 --> 00:59:28,520
>> Again, adjusted
independently of each other.

1356
00:59:28,520 --> 00:59:32,900
And they have the throughput--
If you consume 100% of your RCU,

1357
00:59:32,900 --> 00:59:35,960
you're not going to impact the
availability of your rights.

1358
00:59:35,960 --> 00:59:40,161
So they are completely
independent of each other.

1359
00:59:40,161 --> 00:59:43,160
All right, so one of the things that
I mentioned briefly was throttling.

1360
00:59:43,160 --> 00:59:44,320
Throttling is bad.

1361
00:59:44,320 --> 00:59:47,311
Throttling indicates bad no SQL.

1362
00:59:47,311 --> 00:59:50,310
There are things we can do to help
you alleviate the throttling that you

1363
00:59:50,310 --> 00:59:51,040
are experiencing.

1364
00:59:51,040 --> 00:59:53,240
But the best solution
to this is let's take

1365
00:59:53,240 --> 00:59:58,000
a look at what you're doing, because
there's an anti-pattern in play here.

1366
00:59:58,000 --> 01:00:02,140
>> These things, things like non-uniform
workloads, hot keys, hot partitions.

1367
01:00:02,140 --> 01:00:06,210
I'm hitting a particular key space
very hard for some particular reason.

1368
01:00:06,210 --> 01:00:07,080
Why am I doing this?

1369
01:00:07,080 --> 01:00:08,710
Let's figure that out.

1370
01:00:08,710 --> 01:00:10,427
I'm mixing my hot data with cold data.

1371
01:00:10,427 --> 01:00:12,510
I'm letting my tables get
huge, but there's really

1372
01:00:12,510 --> 01:00:15,970
only some subset of the data
that's really interesting to me.

1373
01:00:15,970 --> 01:00:20,290
So for log data, for example, a lot of
customers, they get log data every day.

1374
01:00:20,290 --> 01:00:22,490
They got a huge amount of log data.

1375
01:00:22,490 --> 01:00:25,940
>> If you're just dumping all that log
data into one big table, over time

1376
01:00:25,940 --> 01:00:28,070
that table's going to get massive.

1377
01:00:28,070 --> 01:00:30,950
But I'm really only interested in
last 24 hours, the last seven days,

1378
01:00:30,950 --> 01:00:31,659
the last 30 days.

1379
01:00:31,659 --> 01:00:34,074
Whatever the window of time
that I'm interested in looking

1380
01:00:34,074 --> 01:00:37,010
for the event that bothers me, or
the event that's interesting to me,

1381
01:00:37,010 --> 01:00:39,540
that's the only window time that I need.

1382
01:00:39,540 --> 01:00:42,470
So why am I putting 10 years
worth of log data in the table?

1383
01:00:42,470 --> 01:00:45,030
What that causes is
the table the fragment.

1384
01:00:45,030 --> 01:00:45,880
>> It gets huge.

1385
01:00:45,880 --> 01:00:48,340
It starts spreading out
across thousands of nodes.

1386
01:00:48,340 --> 01:00:51,380
And since your capacity
is so low, you're

1387
01:00:51,380 --> 01:00:54,090
actually rate limiting on each
one of those individual nodes.

1388
01:00:54,090 --> 01:00:57,120
So let's start looking at how
do we roll that table over.

1389
01:00:57,120 --> 01:01:01,502
How do we manage that data a little
better to avoid these problems.

1390
01:01:01,502 --> 01:01:02,710
And what does that look like?

1391
01:01:02,710 --> 01:01:04,370
This is what that looks like.

1392
01:01:04,370 --> 01:01:06,790
This is what bad NoSQL looks like.

1393
01:01:06,790 --> 01:01:07,830
>> I got a hot key here.

1394
01:01:07,830 --> 01:01:10,246
If you look on the side here,
these are all my partitions.

1395
01:01:10,246 --> 01:01:12,630
I got 16 partitions up here
on this particular database.

1396
01:01:12,630 --> 01:01:13,630
We do this all the time.

1397
01:01:13,630 --> 01:01:15,046
I run this for customers all time.

1398
01:01:15,046 --> 01:01:16,550
It's called the heat map.

1399
01:01:16,550 --> 01:01:20,590
Heat map tells me how you're
accessing your key space.

1400
01:01:20,590 --> 01:01:23,700
And what this is telling me is
that there's one particular hash

1401
01:01:23,700 --> 01:01:26,330
that this guy likes an
awful lot, because he's

1402
01:01:26,330 --> 01:01:28,250
hitting it really, really hard.

1403
01:01:28,250 --> 01:01:29,260
>> So the blue is nice.

1404
01:01:29,260 --> 01:01:29,900
We like blue.

1405
01:01:29,900 --> 01:01:30,720
We don't like red.

1406
01:01:30,720 --> 01:01:33,120
Red's where the pressure
gets up to 100%.

1407
01:01:33,120 --> 01:01:35,560
100%, now you're going to be throttled.

1408
01:01:35,560 --> 01:01:39,030
So whenever you see any red lines like
this-- and it's not just Dynamo DB--

1409
01:01:39,030 --> 01:01:41,630
every NoSQL database has this problem.

1410
01:01:41,630 --> 01:01:44,640
There are anti-patterns that can
drive these types of conditions.

1411
01:01:44,640 --> 01:01:49,070
What I do is I work with customers
to alleviate these conditions.

1412
01:01:49,070 --> 01:01:51,840
>> And what does that look like?

1413
01:01:51,840 --> 01:01:54,260
And this is getting the most
out of Dynamo DB throughput,

1414
01:01:54,260 --> 01:01:56,176
but it's really getting
the most out of NoSQL.

1415
01:01:56,176 --> 01:01:58,740
This is not restricted to Dynamo.

1416
01:01:58,740 --> 01:02:02,050
This is definitely-- I
used to work at Mongo.

1417
01:02:02,050 --> 01:02:04,090
I'm familiar with many NoSQL platforms.

1418
01:02:04,090 --> 01:02:06,830
Every one has these types
of hot key problems.

1419
01:02:06,830 --> 01:02:10,320
To get the most out of any NoSQL
database, specifically Dynamo DB,

1420
01:02:10,320 --> 01:02:13,320
you want to create the tables
where the hash key element has

1421
01:02:13,320 --> 01:02:18,590
a large number of distinct values,
a high degree of cardinality.

1422
01:02:18,590 --> 01:02:22,530
Because that means I'm writing
to lots of different buckets.

1423
01:02:22,530 --> 01:02:24,870
>> The more buckets I'm
writing to, the more likely

1424
01:02:24,870 --> 01:02:29,100
I am to spread that write load or
read load out across multiple nodes,

1425
01:02:29,100 --> 01:02:33,560
the more likely I am to have a
high throughput on the table.

1426
01:02:33,560 --> 01:02:37,440
And then I want the values to be
requested fairly evenly over time

1427
01:02:37,440 --> 01:02:39,430
and uniformly as randomly as possible.

1428
01:02:39,430 --> 01:02:42,410
Well, that's kind of interesting,
because I can't really

1429
01:02:42,410 --> 01:02:43,960
control when the users come.

1430
01:02:43,960 --> 01:02:47,645
So suffice to say, if we spread
things out across the key space,

1431
01:02:47,645 --> 01:02:49,270
we'll probably be in better shape.

1432
01:02:49,270 --> 01:02:51,522
>> There's a certain
amount of time delivery

1433
01:02:51,522 --> 01:02:53,230
that you're not going
to be able control.

1434
01:02:53,230 --> 01:02:55,438
But those are really the
two dimensions that we have,

1435
01:02:55,438 --> 01:02:58,800
space, access evenly
spread, time, requests

1436
01:02:58,800 --> 01:03:01,040
arriving evenly spaced in time.

1437
01:03:01,040 --> 01:03:03,110
And if those two
conditions are being met,

1438
01:03:03,110 --> 01:03:05,610
then that's what it's
going to look like.

1439
01:03:05,610 --> 01:03:07,890
This is much nicer.

1440
01:03:07,890 --> 01:03:08,890
We're really happy here.

1441
01:03:08,890 --> 01:03:10,432
We've got a very even access pattern.

1442
01:03:10,432 --> 01:03:13,098
Yeah, maybe you're getting a
little pressure every now and then,

1443
01:03:13,098 --> 01:03:14,830
but nothing really too extensive.

1444
01:03:14,830 --> 01:03:17,660
So it's amazing how many times,
when I work with customers,

1445
01:03:17,660 --> 01:03:20,670
that first graph with the big red
bar and all that ugly yellow it's

1446
01:03:20,670 --> 01:03:23,147
all over the place, we
get done with the exercise

1447
01:03:23,147 --> 01:03:24,980
after a couple of months
of re-architecture,

1448
01:03:24,980 --> 01:03:28,050
they're running the exact same
workload at the exact same load.

1449
01:03:28,050 --> 01:03:30,140
And this is what it's looking like now.

1450
01:03:30,140 --> 01:03:36,600
So what you get with NoSQL is a
data schema that is absolutely

1451
01:03:36,600 --> 01:03:38,510
tied to the access pattern.

1452
01:03:38,510 --> 01:03:42,170
>> And you can optimize that data schema
to support that access pattern.

1453
01:03:42,170 --> 01:03:45,490
If you don't, then you're going
to see those types of problems

1454
01:03:45,490 --> 01:03:46,710
with those hot keys.

1455
01:03:46,710 --> 01:03:50,518
>> AUDIENCE: Well, inevitably some places
are going to be hotter than others.

1456
01:03:50,518 --> 01:03:51,450
>> RICK HOULIHAN: Always.

1457
01:03:51,450 --> 01:03:51,960
Always.

1458
01:03:51,960 --> 01:03:54,620
Yeah, I mean there's always
a-- and again, there's

1459
01:03:54,620 --> 01:03:56,980
some design patterns we'll get through
that will talk about how you deal

1460
01:03:56,980 --> 01:03:58,480
with these super large aggregations.

1461
01:03:58,480 --> 01:04:01,260
I mean, I got to have them,
how do we deal with them?

1462
01:04:01,260 --> 01:04:03,760
I got a pretty good use case
that we'll talk about for that.

1463
01:04:03,760 --> 01:04:05,940
>> All right, so let's talk
about some customers now.

1464
01:04:05,940 --> 01:04:06,950
These guys are AdRoll.

1465
01:04:06,950 --> 01:04:08,990
I don't know if you're
familiar with AdRoll.

1466
01:04:08,990 --> 01:04:10,781
You probably see them
a lot on the browser.

1467
01:04:10,781 --> 01:04:14,230
They're ad re-targeting, they're
the largest ad re-targeting business

1468
01:04:14,230 --> 01:04:14,940
out there.

1469
01:04:14,940 --> 01:04:17,792
They normally regularly run over
60 billion transactions per day.

1470
01:04:17,792 --> 01:04:20,000
They're doing over a million
transactions per second.

1471
01:04:20,000 --> 01:04:22,660
They've got a pretty simple table
structure, the busiest table.

1472
01:04:22,660 --> 01:04:26,450
It's basically just a
hash key is the cookie,

1473
01:04:26,450 --> 01:04:29,010
the range is the demographic
category, and then

1474
01:04:29,010 --> 01:04:31,220
the third attribute is the score.

1475
01:04:31,220 --> 01:04:33,720
>> So we all have cookies in
our browser from these guys.

1476
01:04:33,720 --> 01:04:35,900
And when you go to a
participating merchant,

1477
01:04:35,900 --> 01:04:39,390
they basically score you across
various demographic categories.

1478
01:04:39,390 --> 01:04:42,070
When you go to a website and
you say I want to see this ad--

1479
01:04:42,070 --> 01:04:44,920
or basically you don't say that--
but when you go to the website

1480
01:04:44,920 --> 01:04:47,550
they say you want to see this ad.

1481
01:04:47,550 --> 01:04:49,370
And they go get that ad from AdRoll.

1482
01:04:49,370 --> 01:04:51,130
AdRoll looks you up on their table.

1483
01:04:51,130 --> 01:04:52,115
They find your cookie.

1484
01:04:52,115 --> 01:04:53,990
The advertisers telling
them, I want somebody

1485
01:04:53,990 --> 01:04:58,632
who's middle-aged,
40-year-old man, into sports.

1486
01:04:58,632 --> 01:05:01,590
And they score you in those demographics
and they decide whether or not

1487
01:05:01,590 --> 01:05:02,740
that's a good ad for you.

1488
01:05:02,740 --> 01:05:10,330
>> Now they have a SLA with
their advertising providers

1489
01:05:10,330 --> 01:05:14,510
to provide sub-10 millisecond
response on every single request.

1490
01:05:14,510 --> 01:05:16,090
So they're using Dynamo DB for this.

1491
01:05:16,090 --> 01:05:18,131
They're hitting us a
million requests per second.

1492
01:05:18,131 --> 01:05:21,120
They're able to do all their
lookups, triage all that data,

1493
01:05:21,120 --> 01:05:26,130
and get that add link back to that
advertiser in under 10 milliseconds.

1494
01:05:26,130 --> 01:05:29,800
It's really pretty phenomenal
implementation that they have.

1495
01:05:29,800 --> 01:05:36,210
>> These guys actually--
are these the guys.

1496
01:05:36,210 --> 01:05:38,010
I'm not sure if it's these guys.

1497
01:05:38,010 --> 01:05:40,127
Might be these guys.

1498
01:05:40,127 --> 01:05:42,210
Basically told us-- no, I
don't think it was them.

1499
01:05:42,210 --> 01:05:43,000
I think it was somebody else.

1500
01:05:43,000 --> 01:05:44,750
I was working with a
customer that told me

1501
01:05:44,750 --> 01:05:47,040
that now that they've
gone to Dynamo DB, they're

1502
01:05:47,040 --> 01:05:50,330
spending more money on snacks for
their development team every month

1503
01:05:50,330 --> 01:05:52,886
than they spend on their database.

1504
01:05:52,886 --> 01:05:54,760
So it'll give you an
idea of the cost savings

1505
01:05:54,760 --> 01:05:57,889
that you can get in Dynamo DB is huge.

1506
01:05:57,889 --> 01:05:59,430
All right, dropcam's another company.

1507
01:05:59,430 --> 01:06:02,138
These guy's kind of-- if you think
of internet of things, dropcam

1508
01:06:02,138 --> 01:06:05,150
is basically internet security video.

1509
01:06:05,150 --> 01:06:06,660
You put your camera out there.

1510
01:06:06,660 --> 01:06:08,180
Camera has a motion detector.

1511
01:06:08,180 --> 01:06:10,290
Someone comes along,
triggers a cue point.

1512
01:06:10,290 --> 01:06:13,540
Camera starts recording for a while till
it doesn't detect any motion anymore.

1513
01:06:13,540 --> 01:06:15,310
Puts that video up on the internet.

1514
01:06:15,310 --> 01:06:19,800
>> Dropcam was a company that is
basically switched to Dynamo DB

1515
01:06:19,800 --> 01:06:22,200
because they were experiencing
enormous growing pains.

1516
01:06:22,200 --> 01:06:25,820
And what they told us,
suddenly petabytes of data.

1517
01:06:25,820 --> 01:06:28,070
They had no idea their service
would be so successful.

1518
01:06:28,070 --> 01:06:32,310
More inbound video than YouTube
is what these guys are getting.

1519
01:06:32,310 --> 01:06:36,780
They use DynamoDB to track all the
metadata on all their video key points.

1520
01:06:36,780 --> 01:06:40,282
>> So they have S3 buckets they push
all the binary artifacts into.

1521
01:06:40,282 --> 01:06:41,990
And then they have
Dynamo DB records that

1522
01:06:41,990 --> 01:06:44,070
point people to those S3 three objects.

1523
01:06:44,070 --> 01:06:47,070
When they need to look at a video,
they look up the record in Dynamo DB.

1524
01:06:47,070 --> 01:06:47,903
They click the link.

1525
01:06:47,903 --> 01:06:49,770
They pull down the video from S3.

1526
01:06:49,770 --> 01:06:51,590
So that's kind of what this looks like.

1527
01:06:51,590 --> 01:06:53,580
And this is straight from their team.

1528
01:06:53,580 --> 01:06:56,010
>> Dynamo DB reduces their
delivery time for video events

1529
01:06:56,010 --> 01:06:57,590
from five to 10 seconds.

1530
01:06:57,590 --> 01:07:00,470
In their old relational store,
they used to have to go and execute

1531
01:07:00,470 --> 01:07:03,780
multiple complex queries to figure
out which videos to pull down,

1532
01:07:03,780 --> 01:07:06,690
to less than 50 milliseconds.

1533
01:07:06,690 --> 01:07:08,990
So it's amazing, amazing
how much performance

1534
01:07:08,990 --> 01:07:12,990
you can get when you optimize and
you tune the underlying database

1535
01:07:12,990 --> 01:07:15,110
to support the access pattern.

1536
01:07:15,110 --> 01:07:20,500
Halfbrick, these guys, what is it,
Fruit Ninja I guess is their thing.

1537
01:07:20,500 --> 01:07:22,590
That all runs on Dynamo DB.

1538
01:07:22,590 --> 01:07:26,810
And these guys, they are a great
development team, great development

1539
01:07:26,810 --> 01:07:27,670
shop.

1540
01:07:27,670 --> 01:07:29,364
>> Not a good ops team.

1541
01:07:29,364 --> 01:07:31,280
They didn't have a lot
of operation resources.

1542
01:07:31,280 --> 01:07:33,940
They were struggling trying to keep
their application infrastructure up

1543
01:07:33,940 --> 01:07:34,290
and running.

1544
01:07:34,290 --> 01:07:35,000
They came to us.

1545
01:07:35,000 --> 01:07:36,251
They looked at that Dynamo DB.

1546
01:07:36,251 --> 01:07:37,291
They said, that's for us.

1547
01:07:37,291 --> 01:07:39,470
They built their whole
application framework on it.

1548
01:07:39,470 --> 01:07:43,640
Some really nice comments here
from the team on their ability

1549
01:07:43,640 --> 01:07:46,800
to now focus on building
the game and not

1550
01:07:46,800 --> 01:07:49,010
having to maintain the
infrastructure, which

1551
01:07:49,010 --> 01:07:51,910
was becoming an enormous amount
of overhead for their team.

1552
01:07:51,910 --> 01:07:56,170
So this is something that-- the
benefit that you get from Dynamo DB.

1553
01:07:56,170 --> 01:08:00,930
>> All right, getting into
data modeling here.

1554
01:08:00,930 --> 01:08:03,440
And we talked a little about
this one to one, one to many,

1555
01:08:03,440 --> 01:08:05,060
and many to many type relationships.

1556
01:08:05,060 --> 01:08:07,630
And how do you maintain those in Dynamo.

1557
01:08:07,630 --> 01:08:10,500
In Dynamo DB we use
indexes, generally speaking,

1558
01:08:10,500 --> 01:08:12,910
to rotate the data from
one flavor to the other.

1559
01:08:12,910 --> 01:08:15,210
Hash keys, range keys, and indexes.

1560
01:08:15,210 --> 01:08:18,540
>> In this particular
example, as most states

1561
01:08:18,540 --> 01:08:23,802
have a licensing requirement that
only one driver's license per person.

1562
01:08:23,802 --> 01:08:26,510
You can't go to get two driver's
licenses in the state of Boston.

1563
01:08:26,510 --> 01:08:27,500
I can't do it in Texas.

1564
01:08:27,500 --> 01:08:28,708
That's kind of the way it is.

1565
01:08:28,708 --> 01:08:32,779
And so at the DMV, we have lookups, we
want to look up the driver's license

1566
01:08:32,779 --> 01:08:35,180
by the social security number.

1567
01:08:35,180 --> 01:08:39,990
I want to look up the user details
by the driver's license number.

1568
01:08:39,990 --> 01:08:43,620
>> So we might have a user's table that
has a hash key on the serial number,

1569
01:08:43,620 --> 01:08:47,830
or the social security number, and
various attributes defined on the item.

1570
01:08:47,830 --> 01:08:49,859
Now on that table I
could define a GSI that

1571
01:08:49,859 --> 01:08:53,370
flips that around that says I want
a hash key on the license and then

1572
01:08:53,370 --> 01:08:54,252
all the other items.

1573
01:08:54,252 --> 01:08:57,210
Now if I want to query and find the
license number for any given Social

1574
01:08:57,210 --> 01:08:59,609
Security number, I can
query the main table.

1575
01:08:59,609 --> 01:09:02,130
>> If I want to query and I want
to get the social security

1576
01:09:02,130 --> 01:09:05,735
number or other attributes by a
license number, I can query the GSI.

1577
01:09:05,735 --> 01:09:08,689
That model is that one
to one relationship.

1578
01:09:08,689 --> 01:09:12,460
Just a very simple GSI,
flip those things around.

1579
01:09:12,460 --> 01:09:13,979
Now, talk about one to many.

1580
01:09:13,979 --> 01:09:16,450
One to many is basically
your hash range key.

1581
01:09:16,450 --> 01:09:20,510
Where we get a lot with this
use case is monitor data.

1582
01:09:20,510 --> 01:09:23,880
Monitor data comes in regular
interval, like internet of things.

1583
01:09:23,880 --> 01:09:26,890
We always get all these
records coming in all the time.

1584
01:09:26,890 --> 01:09:31,420
>> And I want to find all the readings
between a particular time period.

1585
01:09:31,420 --> 01:09:34,220
It's a very common query in
monitoring infrastructure.

1586
01:09:34,220 --> 01:09:38,430
The way go about that is to find a
simple table structure, one table.

1587
01:09:38,430 --> 01:09:42,250
I've got a device measurements table
with a hash key on the device ID.

1588
01:09:42,250 --> 01:09:47,340
And I have a range key on the
timestamp, or in this case, the epic.

1589
01:09:47,340 --> 01:09:50,350
And that allows me execute complex
queries against that range key

1590
01:09:50,350 --> 01:09:54,950
and return those records that
are relative to the result

1591
01:09:54,950 --> 01:09:56,310
set that I'm looking for.

1592
01:09:56,310 --> 01:09:58,360
And it builds that one
to many relationship

1593
01:09:58,360 --> 01:10:02,340
into the primary table using the
hash key, range key structure.

1594
01:10:02,340 --> 01:10:04,600
>> So that's kind of built
into the table in Dynamo DB.

1595
01:10:04,600 --> 01:10:07,290
When I define a hash
and range t table, I'm

1596
01:10:07,290 --> 01:10:09,240
defining a one to many relationship.

1597
01:10:09,240 --> 01:10:12,770
It's a parent-child relationship.

1598
01:10:12,770 --> 01:10:14,620
>> Let's talk about many
to many relationships.

1599
01:10:14,620 --> 01:10:19,170
And for this particular example,
again, we're going to use GSI's.

1600
01:10:19,170 --> 01:10:23,500
And let's talk about gaming
scenario where I have a given user.

1601
01:10:23,500 --> 01:10:26,500
I want to find out all the games that
he's registered for or playing in.

1602
01:10:26,500 --> 01:10:29,600
And for a given game, I
want to find all the users.

1603
01:10:29,600 --> 01:10:31,010
So how do I do that?

1604
01:10:31,010 --> 01:10:34,330
My user games table, I'm going
to have a hash key of user ID

1605
01:10:34,330 --> 01:10:35,810
and a range key of the game.

1606
01:10:35,810 --> 01:10:37,810
>> So a user can have multiple games.

1607
01:10:37,810 --> 01:10:41,380
It's a one to many relationship between
the user and the games he plays.

1608
01:10:41,380 --> 01:10:43,410
And then on the GSI,
I'll flip that around.

1609
01:10:43,410 --> 01:10:46,679
I'll hash on the game and
I'll range on the user.

1610
01:10:46,679 --> 01:10:48,970
So if I want to get all the
game the user's playing in,

1611
01:10:48,970 --> 01:10:49,950
I'll query the main table.

1612
01:10:49,950 --> 01:10:52,699
If I want to get all the users
that are playing a particular game,

1613
01:10:52,699 --> 01:10:53,887
I query the GSI.

1614
01:10:53,887 --> 01:10:54,970
So you see how we do this?

1615
01:10:54,970 --> 01:10:58,369
You build these GSI's to support the
use case, the application, the access

1616
01:10:58,369 --> 01:10:59,410
pattern, the application.

1617
01:10:59,410 --> 01:11:01,440
>> If I need to query on
this dimension, let

1618
01:11:01,440 --> 01:11:03,500
me create an index on that dimension.

1619
01:11:03,500 --> 01:11:05,850
If I don't, I don't care.

1620
01:11:05,850 --> 01:11:09,060
And depending on the use case, I
may need the index or I might not.

1621
01:11:09,060 --> 01:11:12,390
If it's a simple one to many,
the primary table is fine.

1622
01:11:12,390 --> 01:11:15,860
If I need to do these many to
many's, or I need to do one to ones,

1623
01:11:15,860 --> 01:11:18,390
then maybe I do need
to second the index.

1624
01:11:18,390 --> 01:11:20,840
So it all depends on
what I'm trying to do

1625
01:11:20,840 --> 01:11:24,550
and what I'm trying to get accomplished.

1626
01:11:24,550 --> 01:11:28,000
>> Probably I'm not going to spend too
much time talking about documents.

1627
01:11:28,000 --> 01:11:31,460
This gets a little bit, probably,
deeper than we need to go into.

1628
01:11:31,460 --> 01:11:33,710
Let's talk a little bit
about rich query expression.

1629
01:11:33,710 --> 01:11:37,831
So in Dynamo DB we have
the ability to create

1630
01:11:37,831 --> 01:11:39,330
what we call projection expressions.

1631
01:11:39,330 --> 01:11:42,660
Projection expressions are simply
picking the fields or the values

1632
01:11:42,660 --> 01:11:44,290
that you want to display.

1633
01:11:44,290 --> 01:11:46,000
OK, so I make a selection.

1634
01:11:46,000 --> 01:11:48,010
I make a query against Dynamo DB.

1635
01:11:48,010 --> 01:11:51,730
And I say, you know what, show
me only the five star reviews

1636
01:11:51,730 --> 01:11:54,544
for this particular product.

1637
01:11:54,544 --> 01:11:55,710
So that's all I want to see.

1638
01:11:55,710 --> 01:11:57,320
I don't want to see all the
other attributes of the row,

1639
01:11:57,320 --> 01:11:58,319
I just want to see this.

1640
01:11:58,319 --> 01:12:01,209
It's just like in SQL when you
say select star or from table,

1641
01:12:01,209 --> 01:12:02,000
you get everything.

1642
01:12:02,000 --> 01:12:05,450
When I say select name from
table, I only get one attribute.

1643
01:12:05,450 --> 01:12:09,070
It's the same kind of thing in
Dynamo DB or another NoSQL databases.

1644
01:12:09,070 --> 01:12:14,510
Filter expressions allow me to
basically cut the result set down.

1645
01:12:14,510 --> 01:12:15,540
So I make a query.

1646
01:12:15,540 --> 01:12:17,260
Query may come back with 500 items.

1647
01:12:17,260 --> 01:12:20,255
But I only want the items that
have an attribute that says this.

1648
01:12:20,255 --> 01:12:23,380
OK, so let's filter out those items
that don't match that particular query.

1649
01:12:23,380 --> 01:12:25,540
So we have filter expressions.

1650
01:12:25,540 --> 01:12:28,310
>> Filter expressions can
be run on any attribute.

1651
01:12:28,310 --> 01:12:30,260
They're not like range queries.

1652
01:12:30,260 --> 01:12:32,690
Raise queries are more selective.

1653
01:12:32,690 --> 01:12:36,470
Filter queries require me to go
get the entire results set and then

1654
01:12:36,470 --> 01:12:39,170
carve out the data I don't want.

1655
01:12:39,170 --> 01:12:40,660
Why is that important?

1656
01:12:40,660 --> 01:12:42,770
Because I read it all.

1657
01:12:42,770 --> 01:12:46,597
In a query, I'm going to read and
it's going to be a giant about data.

1658
01:12:46,597 --> 01:12:48,430
And then I'm going to
carve out what I need.

1659
01:12:48,430 --> 01:12:52,080
And if I'm only carving out a
couple of rows, then that's OK.

1660
01:12:52,080 --> 01:12:53,620
It's not so inefficient.

1661
01:12:53,620 --> 01:12:57,800
>> But if I'm reading a whole pile of
data, just to carve out one item,

1662
01:12:57,800 --> 01:13:01,490
then I'm going to be better
off using a range query,

1663
01:13:01,490 --> 01:13:03,030
because it's much more selective.

1664
01:13:03,030 --> 01:13:06,330
It's going to save me a lot of
money, because I pay for that read.

1665
01:13:06,330 --> 01:13:10,430
Where the results that comes back
cross that wire might be smaller,

1666
01:13:10,430 --> 01:13:11,890
but I'm paying for the read.

1667
01:13:11,890 --> 01:13:14,340
So understand how
you're getting the data.

1668
01:13:14,340 --> 01:13:16,420
That's very important in Dynamo DB.

1669
01:13:16,420 --> 01:13:19,710
>> Conditional expressions, this is what
you might call optimistic locking.

1670
01:13:19,710 --> 01:13:28,470
Update IF EXISTS, or if this value
is equivalent to what I specify.

1671
01:13:28,470 --> 01:13:31,494
And if I have a time stamp on a
record, I might read the data.

1672
01:13:31,494 --> 01:13:32,535
I might change that data.

1673
01:13:32,535 --> 01:13:35,030
I might go write that
data back to the database.

1674
01:13:35,030 --> 01:13:38,100
If somebody has changed the record,
the timestamp might have changed.

1675
01:13:38,100 --> 01:13:40,370
And that way my conditional
update could say update

1676
01:13:40,370 --> 01:13:42,340
if the timestamp equals this.

1677
01:13:42,340 --> 01:13:46,290
Or the update will fail because somebody
updated the record in the meantime.

1678
01:13:46,290 --> 01:13:48,290
>> That's what we call optimistic locking.

1679
01:13:48,290 --> 01:13:50,670
It means that somebody
can come in and change it,

1680
01:13:50,670 --> 01:13:53,100
and I'm going to detect it
when I go back to write.

1681
01:13:53,100 --> 01:13:56,106
And then I can actually read that
data and say, oh, he changed this.

1682
01:13:56,106 --> 01:13:57,230
I need to account for that.

1683
01:13:57,230 --> 01:14:00,490
And I can change the data in my
record and apply another update.

1684
01:14:00,490 --> 01:14:04,330
So you can catch those incremental
updates that occur between the time

1685
01:14:04,330 --> 01:14:08,740
that you read the data and the
time you might write the data.

1686
01:14:08,740 --> 01:14:11,520
>> AUDIENCE: And the filter
expression actually means not

1687
01:14:11,520 --> 01:14:13,020
in the number or not--

1688
01:14:13,020 --> 01:14:14,316
>> [INTERPOSING VOICES]

1689
01:14:14,316 --> 01:14:16,232
RICK HOULIHAN: I won't
get too much into this.

1690
01:14:16,232 --> 01:14:17,700
This Is a reserved keyword.

1691
01:14:17,700 --> 01:14:20,130
The pound view is a reserved
keyword in Dynamo DB.

1692
01:14:20,130 --> 01:14:24,500
Every database has its own reserved
names for collections you can't use.

1693
01:14:24,500 --> 01:14:27,240
Dynamo DB, if you specify
a pound in front of this,

1694
01:14:27,240 --> 01:14:29,310
you can define those names up above.

1695
01:14:29,310 --> 01:14:31,840
This is a referenced value.

1696
01:14:31,840 --> 01:14:34,880
It's probably not the best syntax to
have up there for this discussion,

1697
01:14:34,880 --> 01:14:38,090
because it gets into some real--
I would have been talking more

1698
01:14:38,090 --> 01:14:41,360
about that at a deeper level.

1699
01:14:41,360 --> 01:14:46,130
>> But suffice to say, this could
be query scan where they views--

1700
01:14:46,130 --> 01:14:50,190
nor pound views is greater than 10.

1701
01:14:50,190 --> 01:14:54,660
It is a numerical value, yes.

1702
01:14:54,660 --> 01:14:57,322
If you want, we can talk about
that after the discussion.

1703
01:14:57,322 --> 01:15:00,030
All right, so we're getting into
some scenarios in best practices

1704
01:15:00,030 --> 01:15:02,000
where we're going to talk
about some apps here.

1705
01:15:02,000 --> 01:15:03,810
What are the use cases for Dynamo DB.

1706
01:15:03,810 --> 01:15:06,120
What are the design
patterns in Dynamo DB.

1707
01:15:06,120 --> 01:15:09,110
>> And the first one we're going to
talk about is the internet of things.

1708
01:15:09,110 --> 01:15:15,010
So we get a lot of-- I guess,
what is it-- more than 50%

1709
01:15:15,010 --> 01:15:19,370
of traffic on the internet these days
is actually generated by machines,

1710
01:15:19,370 --> 01:15:21,930
automated processes, not by humans.

1711
01:15:21,930 --> 01:15:25,140
I mean this thing this thing that
you carry around in your pocket,

1712
01:15:25,140 --> 01:15:28,840
how much data that that thing is
actually sending around without you

1713
01:15:28,840 --> 01:15:30,550
knowing it is absolutely amazing.

1714
01:15:30,550 --> 01:15:34,970
Your location, information
about how fast you're going.

1715
01:15:34,970 --> 01:15:38,400
How do you think Google Maps works
when they tell you what the traffic is.

1716
01:15:38,400 --> 01:15:41,275
It's because there are millions and
millions of people driving around

1717
01:15:41,275 --> 01:15:44,667
with phones that are sending
data all over place all the time.

1718
01:15:44,667 --> 01:15:46,500
So one of the things
about this type of data

1719
01:15:46,500 --> 01:15:50,980
that comes in, monitor data, log
data, time series data, is it's

1720
01:15:50,980 --> 01:15:53,540
usually only interesting
for a little bit of time.

1721
01:15:53,540 --> 01:15:55,580
After that time, it's
not so interesting.

1722
01:15:55,580 --> 01:15:58,390
So we talked about, don't let
those tables grow without bounds.

1723
01:15:58,390 --> 01:16:03,410
The idea here is that maybe I've got 24
hours worth of events in my hot table.

1724
01:16:03,410 --> 01:16:06,160
And that hot table is going to be
provisioned at a very high rate,

1725
01:16:06,160 --> 01:16:07,950
because it's taking a lot of data.

1726
01:16:07,950 --> 01:16:10,920
It's taking a lot of data
in and I'm reading it a lot.

1727
01:16:10,920 --> 01:16:14,560
I've got a lot of operation
queries running against that data.

1728
01:16:14,560 --> 01:16:18,120
>> After 24 hours, hey, you
know what, I don't care.

1729
01:16:18,120 --> 01:16:21,150
So maybe every midnight I roll
my table over to a new table

1730
01:16:21,150 --> 01:16:22,430
and I deprovision this table.

1731
01:16:22,430 --> 01:16:26,440
And I'll take the RCU's and
WCU's down because 24 hours later

1732
01:16:26,440 --> 01:16:28,630
I'm not running as many
queries against that data.

1733
01:16:28,630 --> 01:16:30,200
So I'm going to save money.

1734
01:16:30,200 --> 01:16:32,940
And maybe 30 days later I don't
even need to care about it all.

1735
01:16:32,940 --> 01:16:35,020
I could take the WCU's
all the way down to one,

1736
01:16:35,020 --> 01:16:36,990
because you know what, it's
never going to get written to.

1737
01:16:36,990 --> 01:16:38,300
The data is 30 days old.

1738
01:16:38,300 --> 01:16:40,000
It never changes.

1739
01:16:40,000 --> 01:16:44,200
>> And it's almost never going to get read,
so let's just take that RCU down to 10.

1740
01:16:44,200 --> 01:16:49,372
And I'm saving a ton of money on this
data, and only paying for my hot data.

1741
01:16:49,372 --> 01:16:52,330
So that's the important thing to look
at when you look at a time series

1742
01:16:52,330 --> 01:16:54,716
data coming in in volume.

1743
01:16:54,716 --> 01:16:55,590
These are strategies.

1744
01:16:55,590 --> 01:16:58,010
Now, I could just let it
all go to the same table

1745
01:16:58,010 --> 01:16:59,461
and just let that table grow.

1746
01:16:59,461 --> 01:17:01,460
Eventually, I'm going to
see performance issues.

1747
01:17:01,460 --> 01:17:04,060
I'm going to have to start to archive
some of that data off the table,

1748
01:17:04,060 --> 01:17:04,720
what not.

1749
01:17:04,720 --> 01:17:07,010
>> Let's much better
design your application

1750
01:17:07,010 --> 01:17:08,900
so that you can operate this way right.

1751
01:17:08,900 --> 01:17:11,460
So it's just automatic
in the application code.

1752
01:17:11,460 --> 01:17:13,580
At midnight every night
it rolls the table.

1753
01:17:13,580 --> 01:17:17,170
Maybe what I need is a sliding
window of 24 hours of data.

1754
01:17:17,170 --> 01:17:20,277
Then on a regular basis I'm
calling data off the table.

1755
01:17:20,277 --> 01:17:22,360
I'm trimming it with a
Cron job and I'm putting it

1756
01:17:22,360 --> 01:17:24,160
onto these other tables,
whatever you need.

1757
01:17:24,160 --> 01:17:25,940
So if a rollover works, that's great.

1758
01:17:25,940 --> 01:17:27,080
If not, trim it.

1759
01:17:27,080 --> 01:17:29,640
But let's keep that hot data
away from your cold data.

1760
01:17:29,640 --> 01:17:32,535
It'll save you a lot of money and
make your tables more performing.

1761
01:17:32,535 --> 01:17:35,960

1762
01:17:35,960 --> 01:17:38,210
So the next thing we'll talk
about is product catalog.

1763
01:17:38,210 --> 01:17:42,000
Product catalog is
pretty common use case.

1764
01:17:42,000 --> 01:17:46,600
This is actually a very common pattern
that we'll see in a variety of things.

1765
01:17:46,600 --> 01:17:48,870
You know, Twitter for
example, a hot tweet.

1766
01:17:48,870 --> 01:17:51,280
Everyone's coming and
grabbing that tweet.

1767
01:17:51,280 --> 01:17:52,680
Product catalog, I got a sale.

1768
01:17:52,680 --> 01:17:54,120
I got a hot sale.

1769
01:17:54,120 --> 01:17:57,277
I got 70,000 requests per
second coming for a product

1770
01:17:57,277 --> 01:17:58,860
description out of my product catalog.

1771
01:17:58,860 --> 01:18:02,384
We see this on the retail
operation quite a bit.

1772
01:18:02,384 --> 01:18:03,550
So how do we deal with that?

1773
01:18:03,550 --> 01:18:04,924
There's no way to deal with that.

1774
01:18:04,924 --> 01:18:07,110
All my users want to see
the same piece of data.

1775
01:18:07,110 --> 01:18:09,410
They're are coming in, concurrently.

1776
01:18:09,410 --> 01:18:11,920
And they're all making requests
for the same piece of data.

1777
01:18:11,920 --> 01:18:16,240
This gives me that hot key, that big red
stripe on my chart that we don't like.

1778
01:18:16,240 --> 01:18:17,720
And that's what that looks like.

1779
01:18:17,720 --> 01:18:22,290
So across my key space I'm getting
hammered in the sale items.

1780
01:18:22,290 --> 01:18:24,070
I'm getting nothing anywhere else.

1781
01:18:24,070 --> 01:18:26,050
>> How do I alleviate this problem?

1782
01:18:26,050 --> 01:18:28,410
Well, we alleviate this with cache.

1783
01:18:28,410 --> 01:18:33,630
Cache, you put basically an in-memory
partition in front of the database.

1784
01:18:33,630 --> 01:18:37,260
We have managed
[INAUDIBLE] cache, how you

1785
01:18:37,260 --> 01:18:40,260
can set up your own cache, [INAUDIBLE]
cache [? d, ?] whatever you want.

1786
01:18:40,260 --> 01:18:42,220
Put that up in front of the database.

1787
01:18:42,220 --> 01:18:47,250
And that way you can store that data
from those hot keys up in that cache

1788
01:18:47,250 --> 01:18:49,390
space and read through the cache.

1789
01:18:49,390 --> 01:18:51,962
>> And then most of your reads
start looking like this.

1790
01:18:51,962 --> 01:18:54,920
I got all these cache hits up here
and I got nothing going on down here

1791
01:18:54,920 --> 01:18:59,330
because database is sitting behind the
cache and the reads never come through.

1792
01:18:59,330 --> 01:19:02,520
If I change the data in the
database, I have to update the cache.

1793
01:19:02,520 --> 01:19:04,360
We can use something
like steams to do that.

1794
01:19:04,360 --> 01:19:07,360
And I'll explain how that works.

1795
01:19:07,360 --> 01:19:09,060
All right, messaging.

1796
01:19:09,060 --> 01:19:11,180
Email, we all use email.

1797
01:19:11,180 --> 01:19:12,540
>> This is a pretty good example.

1798
01:19:12,540 --> 01:19:14,950
We've got some sort of messages table.

1799
01:19:14,950 --> 01:19:17,040
And we got inbox and outbox.

1800
01:19:17,040 --> 01:19:19,760
This is what the SQL would
look like to build that inbox.

1801
01:19:19,760 --> 01:19:23,350
We kind of use the same kind
of strategy to use GSI's, GSI's

1802
01:19:23,350 --> 01:19:25,320
for my inbox and my outbox.

1803
01:19:25,320 --> 01:19:27,600
So I got raw messages coming
into my messages table.

1804
01:19:27,600 --> 01:19:30,194
And the first approach to this
might be, say, OK, no problem.

1805
01:19:30,194 --> 01:19:31,110
I've got raw messages.

1806
01:19:31,110 --> 01:19:33,710
Messages coming [INAUDIBLE],
message ID, that's great.

1807
01:19:33,710 --> 01:19:35,070
That's my unique hash.

1808
01:19:35,070 --> 01:19:38,280
I'm going to create two GSI's, one
for my inbox, one for my outbox.

1809
01:19:38,280 --> 01:19:40,530
And the first thing I'll do
is I'll say my hash key is

1810
01:19:40,530 --> 01:19:43,310
going to be the recipient and
I'm going to arrange on the date.

1811
01:19:43,310 --> 01:19:44,220
This is fantastic.

1812
01:19:44,220 --> 01:19:45,890
I got my nice view here.

1813
01:19:45,890 --> 01:19:47,780
But there's a little issue here.

1814
01:19:47,780 --> 01:19:50,891
And you run into this in
relational databases as well.

1815
01:19:50,891 --> 01:19:52,390
They called vertically partitioning.

1816
01:19:52,390 --> 01:19:55,840
You want to keep your big data
away from your little data.

1817
01:19:55,840 --> 01:20:00,470
>> And the reason why is because I gotta
go read the items to get the attributes.

1818
01:20:00,470 --> 01:20:05,570
And if my bodies are all on here,
then reading just a few items

1819
01:20:05,570 --> 01:20:08,560
if my body length is
averaging 256 kilobytes each,

1820
01:20:08,560 --> 01:20:10,991
the math gets pretty ugly.

1821
01:20:10,991 --> 01:20:12,490
So say I want to read David's inbox.

1822
01:20:12,490 --> 01:20:14,520
David's inbox has 50 items.

1823
01:20:14,520 --> 01:20:17,880
The average and size is 256 kilobytes.

1824
01:20:17,880 --> 01:20:21,730
Here's my conversion ratio
for RCU's is four kilobytes.

1825
01:20:21,730 --> 01:20:24,450
>> OK, let's go with
eventually consistent reads.

1826
01:20:24,450 --> 01:20:28,640
I'm still eating 1600 RCU's
just to read David's inbox.

1827
01:20:28,640 --> 01:20:29,950
Ouch.

1828
01:20:29,950 --> 01:20:31,980
OK, now let's think
about how the app works.

1829
01:20:31,980 --> 01:20:35,340
If I'm in an email app and
I'm looking at my inbox,

1830
01:20:35,340 --> 01:20:39,680
and I look at the body of every message,
no, I'm looking at the summaries.

1831
01:20:39,680 --> 01:20:41,850
I'm looking at only the headers.

1832
01:20:41,850 --> 01:20:46,310
So let's build a table structure
that looks more like that.

1833
01:20:46,310 --> 01:20:49,470
>> So here's the information
that my workflow needs.

1834
01:20:49,470 --> 01:20:50,890
It's in my inbox GSI.

1835
01:20:50,890 --> 01:20:53,800
It's the date, the sender,
the subject, and then

1836
01:20:53,800 --> 01:20:56,790
the message ID, which points
back to the messages table

1837
01:20:56,790 --> 01:20:57,850
where I can get the body.

1838
01:20:57,850 --> 01:21:01,260

1839
01:21:01,260 --> 01:21:04,420
Well, these would be record IDs.

1840
01:21:04,420 --> 01:21:09,850
They would point back to the
item IDs on the Dynamo DB table.

1841
01:21:09,850 --> 01:21:12,220
Every index always creates--
always has the item

1842
01:21:12,220 --> 01:21:15,750
ID as part of-- that
comes with the index.

1843
01:21:15,750 --> 01:21:17,414
>> All right.

1844
01:21:17,414 --> 01:21:19,080
AUDIENCE: It tells it where it's stored?

1845
01:21:19,080 --> 01:21:21,420
RICK HOULIHAN: Yes, it tells
exactly-- that's exactly what it does.

1846
01:21:21,420 --> 01:21:22,644
It says here's my re record.

1847
01:21:22,644 --> 01:21:24,310
And it'll point it back to my re record.

1848
01:21:24,310 --> 01:21:26,460
Exactly.

1849
01:21:26,460 --> 01:21:29,490
OK, so now my inbox is
actually much smaller.

1850
01:21:29,490 --> 01:21:32,210
And this actually supports
the workflow of an email app.

1851
01:21:32,210 --> 01:21:34,230
So my inbox, I click.

1852
01:21:34,230 --> 01:21:38,160
I go along and I click on the message,
that's when I need to go get the body,

1853
01:21:38,160 --> 01:21:40,180
because I'm going to
go to a different view.

1854
01:21:40,180 --> 01:21:43,870
So if you think about MVC type of
framework, model view controller.

1855
01:21:43,870 --> 01:21:46,120
>> The model contains the
data that the view needs

1856
01:21:46,120 --> 01:21:48,130
and the controller interacts with.

1857
01:21:48,130 --> 01:21:51,670
When I change the frame, when
I change the perspective,

1858
01:21:51,670 --> 01:21:55,080
it's OK to go back to the
server and repopulate the model,

1859
01:21:55,080 --> 01:21:56,860
because that's what the user expects.

1860
01:21:56,860 --> 01:22:00,530
When they change views, that's when
we can go back to the database.

1861
01:22:00,530 --> 01:22:02,480
So email, click.

1862
01:22:02,480 --> 01:22:03,710
I'm looking for the body.

1863
01:22:03,710 --> 01:22:04,330
Round trip.

1864
01:22:04,330 --> 01:22:05,680
Go get the body.

1865
01:22:05,680 --> 01:22:06,950
>> I read a lot less data.

1866
01:22:06,950 --> 01:22:09,960
I'm only reading the bodies that
David needs when he needs them.

1867
01:22:09,960 --> 01:22:14,230
And I'm not burn in 1600
RCU's just to show his inbox.

1868
01:22:14,230 --> 01:22:17,670
So now that-- this is the way
that LSI or GSI-- I'm sorry,

1869
01:22:17,670 --> 01:22:19,900
GSI, would work out.

1870
01:22:19,900 --> 01:22:25,450
We've got our hash on the recipient.

1871
01:22:25,450 --> 01:22:27,030
We've got the range key on the date.

1872
01:22:27,030 --> 01:22:31,380
And we've got the projected attributes
that we need only to support the view.

1873
01:22:31,380 --> 01:22:34,300
>> We rotate that for the outbox.

1874
01:22:34,300 --> 01:22:35,770
Hash on sender.

1875
01:22:35,770 --> 01:22:39,612
And in essence, we have
the very nice, clean view.

1876
01:22:39,612 --> 01:22:41,570
And it's basically-- we
have this nice messages

1877
01:22:41,570 --> 01:22:45,870
table that's being spread nicely because
it's hash only, hashed message ID.

1878
01:22:45,870 --> 01:22:51,750
And we have two indexes that
are rotating off of that table.

1879
01:22:51,750 --> 01:22:57,411
All right, so idea here is don't
keep the big data and this small data

1880
01:22:57,411 --> 01:22:57,910
together.

1881
01:22:57,910 --> 01:23:00,700
Partition vertically,
partition those tables.

1882
01:23:00,700 --> 01:23:03,150
Don't read data you don't have to.

1883
01:23:03,150 --> 01:23:04,850
All right, gaming.

1884
01:23:04,850 --> 01:23:06,990
We all like games.

1885
01:23:06,990 --> 01:23:10,902
At least I like games then.

1886
01:23:10,902 --> 01:23:12,735
So some of the things
that we deal with when

1887
01:23:12,735 --> 01:23:14,193
we're thinking about gaming, right?

1888
01:23:14,193 --> 01:23:16,999
Gaming these days, especially mobile
gaming, is all about thinking.

1889
01:23:16,999 --> 01:23:19,540
And I'm going to rotate here a
little bit away from DynamoDB.

1890
01:23:19,540 --> 01:23:21,373
I'm going to bring in
some of the discussion

1891
01:23:21,373 --> 01:23:24,240
around some of the
other AWS technologies.

1892
01:23:24,240 --> 01:23:28,930
>> But the idea about gaming is to think
about in terms of APIs, APIs that are,

1893
01:23:28,930 --> 01:23:31,730
generally speaking, HTTP and JSON.

1894
01:23:31,730 --> 01:23:34,550
It's how mobile games kind of
interact with their back ends.

1895
01:23:34,550 --> 01:23:35,850
They do JSON posting.

1896
01:23:35,850 --> 01:23:40,660
They get data, and it's all,
generally speaking, in nice JSON APIs.

1897
01:23:40,660 --> 01:23:44,950
>> Things like get friends, get
the leaderboard, exchange data,

1898
01:23:44,950 --> 01:23:47,699
user generated content,
push back up to the system,

1899
01:23:47,699 --> 01:23:49,740
these are types of things
that we're going to do.

1900
01:23:49,740 --> 01:23:52,542
Binary asset data, this data
might not sit in the database.

1901
01:23:52,542 --> 01:23:54,250
This might sit in an
object store, right?

1902
01:23:54,250 --> 01:23:56,541
But the database is going to
end up telling the system,

1903
01:23:56,541 --> 01:23:59,140
telling the application
where to go get it.

1904
01:23:59,140 --> 01:24:03,550
And inevitably, multiplayer
servers, back end infrastructure,

1905
01:24:03,550 --> 01:24:06,180
and designed for high
availability and scalability.

1906
01:24:06,180 --> 01:24:09,400
So these are things that we all want
in the gaming infrastructure today.

1907
01:24:09,400 --> 01:24:12,160
>> So let's take a look at
what that looks like.

1908
01:24:12,160 --> 01:24:16,070
Got a core back end,
very straightforward.

1909
01:24:16,070 --> 01:24:19,880
We've got a system here with
multiple availability zones.

1910
01:24:19,880 --> 01:24:23,780
We talked about AZs as being-- think
of them as separate data centers.

1911
01:24:23,780 --> 01:24:26,040
More than one data center
per AZ, but that's OK,

1912
01:24:26,040 --> 01:24:28,831
just think of them as separate data
centers that are geographically

1913
01:24:28,831 --> 01:24:30,090
and fault isolated.

1914
01:24:30,090 --> 01:24:32,172
>> We're going to have a
couple EC2 instances.

1915
01:24:32,172 --> 01:24:33,880
We're going to have
some back end server.

1916
01:24:33,880 --> 01:24:35,800
Maybe if you're a legacy
architecture, we're

1917
01:24:35,800 --> 01:24:38,920
using what we call RDS,
relational database services.

1918
01:24:38,920 --> 01:24:42,040
Could be MSSQL, MySQL,
or something like that.

1919
01:24:42,040 --> 01:24:47,080
This is way a lot applications
are designed today.

1920
01:24:47,080 --> 01:24:49,594
>> Well we might want to go with
this is when we scale out.

1921
01:24:49,594 --> 01:24:51,510
We'll go ahead and put
the S3 bucket up there.

1922
01:24:51,510 --> 01:24:54,200
And that S3 bucket, instead of serving
up those objects from our servers--

1923
01:24:54,200 --> 01:24:55,220
we could do that.

1924
01:24:55,220 --> 01:24:57,210
You put all your binary
objects on your servers

1925
01:24:57,210 --> 01:24:59,751
and you can use those server
instances to serve that data up.

1926
01:24:59,751 --> 01:25:01,860
But that's pretty expensive.

1927
01:25:01,860 --> 01:25:05,107
>> Better way to do is go ahead and
put those objects in an S3 bucket.

1928
01:25:05,107 --> 01:25:06,315
S3 is an object repositories.

1929
01:25:06,315 --> 01:25:10,860
It's built specifically for
serving up these types of things.

1930
01:25:10,860 --> 01:25:13,690
And let those clients request
directly from those object buckets,

1931
01:25:13,690 --> 01:25:15,390
offload the servers.

1932
01:25:15,390 --> 01:25:17,020
So we're starting to scale out here.

1933
01:25:17,020 --> 01:25:19,140
>> Now we got users all over the world.

1934
01:25:19,140 --> 01:25:19,730
I got users.

1935
01:25:19,730 --> 01:25:23,380
I need to have content locally
located close to these users, right?

1936
01:25:23,380 --> 01:25:26,200
I've created an S3 bucket
as my source repository.

1937
01:25:26,200 --> 01:25:29,370
And I'll front that with
the CloudFront distribution.

1938
01:25:29,370 --> 01:25:31,720
>> CloudFront is a CD and a
content delivery network.

1939
01:25:31,720 --> 01:25:35,750
Basically it takes data that you specify
and caches it all over the internet

1940
01:25:35,750 --> 01:25:39,230
so users everywhere can have
a very quick response when

1941
01:25:39,230 --> 01:25:40,960
they request those objects.

1942
01:25:40,960 --> 01:25:41,960
>> So you get an idea.

1943
01:25:41,960 --> 01:25:48,230
You're kind of leveraging all the
aspects of AWS here to get this done.

1944
01:25:48,230 --> 01:25:50,790
And eventually, we throw
in an auto scaling group.

1945
01:25:50,790 --> 01:25:52,737
So our AC2 instances
of our game servers,

1946
01:25:52,737 --> 01:25:54,820
as they start to get busier
and busier and busier,

1947
01:25:54,820 --> 01:25:57,236
they'll just spin another
instance, spin another instance,

1948
01:25:57,236 --> 01:25:58,210
spin another instance.

1949
01:25:58,210 --> 01:26:02,090
So the technology AWS has, it
allows you specify the parameters

1950
01:26:02,090 --> 01:26:04,650
around which your servers will grow.

1951
01:26:04,650 --> 01:26:08,110
So you can have n number of servers
out there at any given time.

1952
01:26:08,110 --> 01:26:11,870
And if your load goes away, they'll
shrink, the number will shrink.

1953
01:26:11,870 --> 01:26:15,250
And if the load comes back,
it'll grow back out, elastically.

1954
01:26:15,250 --> 01:26:17,050
>> So this looks great.

1955
01:26:17,050 --> 01:26:19,800
We've got a lot of EC2 instances.

1956
01:26:19,800 --> 01:26:21,671
We can put cache in
front of the databases,

1957
01:26:21,671 --> 01:26:23,045
try and accelerate the databases.

1958
01:26:23,045 --> 01:26:25,030
The next pressure point
typically people see

1959
01:26:25,030 --> 01:26:28,850
is they scale a game using a
relational database system.

1960
01:26:28,850 --> 01:26:30,790
Jeez, the database
performance is terrible.

1961
01:26:30,790 --> 01:26:31,932
How do we improve that?

1962
01:26:31,932 --> 01:26:33,640
Let's try putting
cache in front of that.

1963
01:26:33,640 --> 01:26:36,780
>> Well, cache doesn't work
so great in games, right?

1964
01:26:36,780 --> 01:26:39,330
For games, writing is painful.

1965
01:26:39,330 --> 01:26:40,930
Games are very write heavy.

1966
01:26:40,930 --> 01:26:43,610
Cache doesn't work when you're
write heavy because you've always

1967
01:26:43,610 --> 01:26:44,610
got to update the cache.

1968
01:26:44,610 --> 01:26:47,780
You update the cache, it's
irrelevant to be caching.

1969
01:26:47,780 --> 01:26:49,780
It's actually just extra work.

1970
01:26:49,780 --> 01:26:51,970
>> So where we go here?

1971
01:26:51,970 --> 01:26:54,400
You've got a big bottleneck
down there in the database.

1972
01:26:54,400 --> 01:26:57,661
And the place to go
obviously is partitioning.

1973
01:26:57,661 --> 01:26:59,410
Partitioning is not
easy to do when you're

1974
01:26:59,410 --> 01:27:01,900
dealing with relational databases.

1975
01:27:01,900 --> 01:27:05,080
With relational databases, you're
responsible for managing, effectively,

1976
01:27:05,080 --> 01:27:06,210
the key space.

1977
01:27:06,210 --> 01:27:10,527
You're saying users between A and M
go here, between N and Z go there.

1978
01:27:10,527 --> 01:27:12,360
And you're switching
across the application.

1979
01:27:12,360 --> 01:27:15,000
So you're dealing with
this partition data source.

1980
01:27:15,000 --> 01:27:18,670
You have transactional constraints
that don't span partitions.

1981
01:27:18,670 --> 01:27:20,560
You've got all kinds of
messiness that you're

1982
01:27:20,560 --> 01:27:23,040
dealing with down there trying
to deal with scaling out

1983
01:27:23,040 --> 01:27:25,120
and building a larger infrastructure.

1984
01:27:25,120 --> 01:27:27,284
It's just no fun.

1985
01:27:27,284 --> 01:27:30,930
>> AUDIENCE: So are you saying that
increasing source points speeds up

1986
01:27:30,930 --> 01:27:31,430
the process?

1987
01:27:31,430 --> 01:27:32,513
RICK HOULIHAN: Increasing?

1988
01:27:32,513 --> 01:27:33,520
AUDIENCE: Source points.

1989
01:27:33,520 --> 01:27:34,410
RICK HOULIHAN: Source points?

1990
01:27:34,410 --> 01:27:37,500
AUDIENCE: From the information,
where the information is coming from?

1991
01:27:37,500 --> 01:27:38,250
RICK HOULIHAN: No.

1992
01:27:38,250 --> 01:27:41,820
What I'm saying is increasing the
number of partitions in the data store

1993
01:27:41,820 --> 01:27:44,060
improves throughput.

1994
01:27:44,060 --> 01:27:48,300
So what's happening here is users
coming into the EC2 instance up here,

1995
01:27:48,300 --> 01:27:50,780
well, if I need a user
that's A to M, I'll go here.

1996
01:27:50,780 --> 01:27:53,560
From N to p, I'll go here.

1997
01:27:53,560 --> 01:27:55,060
From P to Z, I'll go here.

1998
01:27:55,060 --> 01:27:57,120
>> AUDIENCE: OK, those so those are
all stored in different nodes?

1999
01:27:57,120 --> 01:27:57,911
>> RICK HOULIHAN: Yes.

2000
01:27:57,911 --> 01:28:00,210
Think of these as
different silos of data.

2001
01:28:00,210 --> 01:28:01,660
So you're having to do this.

2002
01:28:01,660 --> 01:28:02,910
If you're trying to do
this, if you're trying

2003
01:28:02,910 --> 01:28:05,730
to scale on a relational platform,
this is what you're doing.

2004
01:28:05,730 --> 01:28:08,100
You're taking data and
you're cutting it down.

2005
01:28:08,100 --> 01:28:10,975
And you're partitioning it across
multiple instances of the database.

2006
01:28:10,975 --> 01:28:13,580
And you're managing all that
at the application tier.

2007
01:28:13,580 --> 01:28:14,729
It's no fun.

2008
01:28:14,729 --> 01:28:15,770
So what do we want to go?

2009
01:28:15,770 --> 01:28:20,240
We want to go DynamoDB, fully managed,
NoSQL data store, provision throughput.

2010
01:28:20,240 --> 01:28:22,680
We use secondary indexes.

2011
01:28:22,680 --> 01:28:26,154
It's basically HTTP API and
includes document support.

2012
01:28:26,154 --> 01:28:28,570
So you don't have to worry
about any of that partitioning.

2013
01:28:28,570 --> 01:28:30,740
We do it all for you.

2014
01:28:30,740 --> 01:28:33,260
So now, instead, you
just write to the table.

2015
01:28:33,260 --> 01:28:36,490
If the table needs to be partitioned,
that happens behind the scenes.

2016
01:28:36,490 --> 01:28:40,642
You're completely insulated
from that as a developer.

2017
01:28:40,642 --> 01:28:42,350
So let's talk about
some of the use cases

2018
01:28:42,350 --> 01:28:47,564
that we run into in gaming, common
gaming scenarios, leaderboard.

2019
01:28:47,564 --> 01:28:49,980
So you've got users coming in,
the BoardNames that they're

2020
01:28:49,980 --> 01:28:52,930
on, the scores for this user.

2021
01:28:52,930 --> 01:28:57,700
We might be hashing on the UserID,
and then we have range on the game.

2022
01:28:57,700 --> 01:28:59,960
So every user wants to see
all the game he's played

2023
01:28:59,960 --> 01:29:01,770
and all his top score
across all the game.

2024
01:29:01,770 --> 01:29:04,000
So that's his personal leaderboard.

2025
01:29:04,000 --> 01:29:10,010
>> Now I want to go in and I want to get--
so I get these personal leaderboards.

2026
01:29:10,010 --> 01:29:12,827
What I want to do is go get
the top score across all users.

2027
01:29:12,827 --> 01:29:13,660
So how do I do that?

2028
01:29:13,660 --> 01:29:18,070
When my record is hashed on
the UserID, ranged on the game,

2029
01:29:18,070 --> 01:29:20,740
well I'm going to go ahead
and restructure, create a GSI,

2030
01:29:20,740 --> 01:29:22,370
and I'm going to restructure that data.

2031
01:29:22,370 --> 01:29:27,310
>> Now I'm going to hash on the
BoardName, which is the game.

2032
01:29:27,310 --> 01:29:29,800
And I'm going to range on the top score.

2033
01:29:29,800 --> 01:29:31,540
And now I've created different buckets.

2034
01:29:31,540 --> 01:29:34,790
I'm using the same table,
the same item data.

2035
01:29:34,790 --> 01:29:39,870
But I'm creating a bucket that gives
me an aggregation of top score by game.

2036
01:29:39,870 --> 01:29:43,180
>> And I can query that table
to get that information.

2037
01:29:43,180 --> 01:29:50,890
So I've set that query pattern up to
be supported by a secondary index.

2038
01:29:50,890 --> 01:29:54,556
Now they can be sorted by BoardName
and sorted by TopScore, depending on.

2039
01:29:54,556 --> 01:29:57,180
So you can see, these are types
of use cases you get in gaming.

2040
01:29:57,180 --> 01:30:02,190
Another good use case we get in gaming
is awards and who's won the awards.

2041
01:30:02,190 --> 01:30:05,340
And this is a great use case
where we call sparse indexes.

2042
01:30:05,340 --> 01:30:07,340
Sparse indexes are the
ability to generate

2043
01:30:07,340 --> 01:30:10,850
an index that doesn't necessarily
contain every single item on the table.

2044
01:30:10,850 --> 01:30:11,470
And why not?

2045
01:30:11,470 --> 01:30:14,540
Because the attribute that's being
indexed doesn't exist on every item.

2046
01:30:14,540 --> 01:30:16,460
>> So in this particular
use case, I'm saying,

2047
01:30:16,460 --> 01:30:19,240
you know what, I'm going to
create an attribute called Award.

2048
01:30:19,240 --> 01:30:22,970
And I'm going to give every user
that has an award that attribute.

2049
01:30:22,970 --> 01:30:25,950
Users that don't have awards are
not going to have that attribute.

2050
01:30:25,950 --> 01:30:27,800
So when I create the
index, the only users

2051
01:30:27,800 --> 01:30:28,960
that are going to show
up in the index are

2052
01:30:28,960 --> 01:30:31,050
the ones that actually have won awards.

2053
01:30:31,050 --> 01:30:34,440
So that's a great way to be able
to create filtered indexes that

2054
01:30:34,440 --> 01:30:40,580
are very, very selective that don't
have to index the entire table.

2055
01:30:40,580 --> 01:30:43,050
>> So we're getting low on time here.

2056
01:30:43,050 --> 01:30:49,190
I'm going to go ahead and skip
out and skip this scenario.

2057
01:30:49,190 --> 01:30:52,625
Talk a little bit about--

2058
01:30:52,625 --> 01:30:54,460
>> AUDIENCE: Can I ask a quick question?

2059
01:30:54,460 --> 01:30:56,722
One is write heavy?

2060
01:30:56,722 --> 01:30:57,680
RICK HOULIHAN: What is?

2061
01:30:57,680 --> 01:30:58,596
AUDIENCE: Write heavy.

2062
01:30:58,596 --> 01:31:01,270
RICK HOULIHAN: Write heavy.

2063
01:31:01,270 --> 01:31:03,460
Let me see.

2064
01:31:03,460 --> 01:31:06,220
>> AUDIENCE: Or is that not
something you can just

2065
01:31:06,220 --> 01:31:08,809
voice to in a matter of seconds?

2066
01:31:08,809 --> 01:31:10,850
RICK HOULIHAN: We go
through the voting scenario.

2067
01:31:10,850 --> 01:31:11,670
It's not that bad.

2068
01:31:11,670 --> 01:31:14,580
Do you guys have a few minutes?

2069
01:31:14,580 --> 01:31:15,860
OK.

2070
01:31:15,860 --> 01:31:17,890
>> So we'll talk about voting.

2071
01:31:17,890 --> 01:31:20,250
So real time voting, we have
requirements for voting.

2072
01:31:20,250 --> 01:31:25,250
Requirements are that we allow
each person to vote only once.

2073
01:31:25,250 --> 01:31:28,060
We want nobody to be able
to change their vote.

2074
01:31:28,060 --> 01:31:31,045
We want real-time aggregation
and analytics for demographics

2075
01:31:31,045 --> 01:31:34,210
that we're going to be
showing to users on the site.

2076
01:31:34,210 --> 01:31:35,200
>> Think of this scenario.

2077
01:31:35,200 --> 01:31:37,550
We work a lot of reality
TV shows where they're

2078
01:31:37,550 --> 01:31:38,960
doing these exact type of things.

2079
01:31:38,960 --> 01:31:41,584
So you can think of the scenario,
we have millions and millions

2080
01:31:41,584 --> 01:31:43,959
of teenage girls there
with their cell phones

2081
01:31:43,959 --> 01:31:46,250
and voting, and voting, and
voting for whoever they are

2082
01:31:46,250 --> 01:31:48,610
find to be the most popular.

2083
01:31:48,610 --> 01:31:50,830
So these are some of the
requirements we run out.

2084
01:31:50,830 --> 01:31:52,990
>> And so the first take
in solving this problem

2085
01:31:52,990 --> 01:31:55,090
would be to build a
very simple application.

2086
01:31:55,090 --> 01:31:56,490
So I've got this app.

2087
01:31:56,490 --> 01:31:57,950
I have some voters out there.

2088
01:31:57,950 --> 01:31:59,980
They come in, they hit the voting app.

2089
01:31:59,980 --> 01:32:03,440
I've got some raw votes table
I'll just dump those votes into.

2090
01:32:03,440 --> 01:32:05,780
I'll have some aggregate
votes table that

2091
01:32:05,780 --> 01:32:09,490
will do my analytics and demographics,
and we'll put all this in there.

2092
01:32:09,490 --> 01:32:11,420
>> And this is great.

2093
01:32:11,420 --> 01:32:12,332
Life is good.

2094
01:32:12,332 --> 01:32:15,040
Life's good until we find out that
there's always only one or two

2095
01:32:15,040 --> 01:32:16,879
people that are popular in an election.

2096
01:32:16,879 --> 01:32:19,420
There's only one or two things
that people really care about.

2097
01:32:19,420 --> 01:32:22,340
And if you're voting at
scale, all of a sudden I'm

2098
01:32:22,340 --> 01:32:26,360
going to be hammering the hell out of
two candidates, one or two candidates.

2099
01:32:26,360 --> 01:32:29,390
A very limited number of items
people find to be popular.

2100
01:32:29,390 --> 01:32:31,710
>> This is not a good design pattern.

2101
01:32:31,710 --> 01:32:33,549
This is actually a
very bad design pattern

2102
01:32:33,549 --> 01:32:36,340
because it creates exactly what we
talked about which was hot keys.

2103
01:32:36,340 --> 01:32:38,960
Hot keys are something we don't like.

2104
01:32:38,960 --> 01:32:40,470
>> So how do we fix that?

2105
01:32:40,470 --> 01:32:47,640
And really, the way to fix this is
by taking those candidate buckets

2106
01:32:47,640 --> 01:32:51,490
and for each candidate we have,
we're going to append a random value,

2107
01:32:51,490 --> 01:32:54,192
something that we know, random
value between one and 100,

2108
01:32:54,192 --> 01:32:56,620
between 100 and 1,000,
or between one and 1,000,

2109
01:32:56,620 --> 01:32:59,940
however many random values you want to
append onto the end of that candidate.

2110
01:32:59,940 --> 01:33:01,330
>> And what have I really done then?

2111
01:33:01,330 --> 01:33:05,830
If I'm using the candidate ID as
the bucket to aggregate votes,

2112
01:33:05,830 --> 01:33:08,780
if I've added a random
number to the end of that,

2113
01:33:08,780 --> 01:33:12,000
I've created now 10 buckets, a
hundred buckets, a thousand buckets

2114
01:33:12,000 --> 01:33:14,160
that I'm aggregating votes across.

2115
01:33:14,160 --> 01:33:18,030
>> So I have millions, and millions,
and millions of records coming in

2116
01:33:18,030 --> 01:33:22,050
for these candidates, I am now spreading
those votes across Candidate A_1

2117
01:33:22,050 --> 01:33:24,630
through Candidate A_100, because
every time a vote comes in,

2118
01:33:24,630 --> 01:33:26,530
I'm generating a random
value between one and 100.

2119
01:33:26,530 --> 01:33:29,446
I'm tacking it onto the end of the
candidate that person's voting for.

2120
01:33:29,446 --> 01:33:31,120
I'm dumping it into that bucket.

2121
01:33:31,120 --> 01:33:33,910
>> Now on the backside, I know
that I got a hundred buckets.

2122
01:33:33,910 --> 01:33:36,350
So when I want to go ahead
and aggregate the votes,

2123
01:33:36,350 --> 01:33:38,244
I read from all those buckets.

2124
01:33:38,244 --> 01:33:39,160
So I go ahead and add.

2125
01:33:39,160 --> 01:33:42,410
And then I do the scatter gather
where I go out and say hey,

2126
01:33:42,410 --> 01:33:45,399
you know what, this candidate's key
spaces is over a hundred buckets.

2127
01:33:45,399 --> 01:33:47,940
I'm going to gather all the
votes from those hundred buckets.

2128
01:33:47,940 --> 01:33:49,981
I'm going to aggregate
them and I'm going to say,

2129
01:33:49,981 --> 01:33:53,830
Candidate A now has
total vote count of x.

2130
01:33:53,830 --> 01:33:55,690
>> Now both the write
query and the read query

2131
01:33:55,690 --> 01:33:58,160
are nicely distributed
because I'm writing across

2132
01:33:58,160 --> 01:34:00,320
and I'm reading across hundreds of keys.

2133
01:34:00,320 --> 01:34:03,500
I'm not writing and
reading across one key now.

2134
01:34:03,500 --> 01:34:04,950
So that's a great pattern.

2135
01:34:04,950 --> 01:34:08,090
>> This is actually probably one
of the most important design

2136
01:34:08,090 --> 01:34:10,420
patterns for scale in NoSQL.

2137
01:34:10,420 --> 01:34:14,470
You will see this type of
design pattern in every flavor.

2138
01:34:14,470 --> 01:34:19,100
MongoDB, DynamoDB, it doesn't
matter, we all have to do this.

2139
01:34:19,100 --> 01:34:21,840
Because when you're dealing
with those huge aggregations,

2140
01:34:21,840 --> 01:34:26,650
you have to figure out a way to
spread them out across buckets.

2141
01:34:26,650 --> 01:34:29,512
So this is the way you do that.

2142
01:34:29,512 --> 01:34:31,220
All right, so what
you're doing right now

2143
01:34:31,220 --> 01:34:35,252
is you're trading off read
cost for write scalability.

2144
01:34:35,252 --> 01:34:37,085
The cost of my read is
a little more complex

2145
01:34:37,085 --> 01:34:40,220
and I have to go read from a
hundred buckets instead of one.

2146
01:34:40,220 --> 01:34:41,310
But I'm able to write.

2147
01:34:41,310 --> 01:34:44,860
And my throughput, my write
throughput is incredible.

2148
01:34:44,860 --> 01:34:49,450
So it's usually a valuable
technique for scaling DynamoDB,

2149
01:34:49,450 --> 01:34:51,350
or any NoSQL database for that matter.

2150
01:34:51,350 --> 01:34:53,824

2151
01:34:53,824 --> 01:34:55,240
So we figured out how to scale it.

2152
01:34:55,240 --> 01:34:56,930
And we figured how to
eliminate our hot keys.

2153
01:34:56,930 --> 01:34:57,820
And this is fantastic.

2154
01:34:57,820 --> 01:34:58,960
And we got this nice system.

2155
01:34:58,960 --> 01:35:02,043
And it's given us very correct voting
because we have record vote de-dupe.

2156
01:35:02,043 --> 01:35:03,130
It's built into DynamoDB.

2157
01:35:03,130 --> 01:35:05,380
We talked about conditional rights.

2158
01:35:05,380 --> 01:35:08,170
>> When a voter comes in, puts
an insert on the table,

2159
01:35:08,170 --> 01:35:11,220
they insert with their voter ID,
if they try to insert another vote,

2160
01:35:11,220 --> 01:35:13,320
I do a conditional write.

2161
01:35:13,320 --> 01:35:16,960
Say only write this
if this doesn't exist.

2162
01:35:16,960 --> 01:35:19,270
So as soon as I see that
that vote's hit the table,

2163
01:35:19,270 --> 01:35:20,460
nobody else's going to be
able to put their vote in.

2164
01:35:20,460 --> 01:35:21,634
And that's fantastic.

2165
01:35:21,634 --> 01:35:23,550
And we're incrementing
our candidate counters.

2166
01:35:23,550 --> 01:35:25,466
And we're doing our
demographics and all that.

2167
01:35:25,466 --> 01:35:29,110
But what happens if my
application falls over?

2168
01:35:29,110 --> 01:35:31,350
Now all of a sudden votes
are coming in, and I

2169
01:35:31,350 --> 01:35:34,840
don't know if they're getting processed
into my analytics and demographics

2170
01:35:34,840 --> 01:35:36,040
anymore.

2171
01:35:36,040 --> 01:35:38,462
And when the application
comes back up, how

2172
01:35:38,462 --> 01:35:41,420
the hell do I know what votes have
been processed and where do I start?

2173
01:35:41,420 --> 01:35:44,530
>> So this is a real problem when you
start to look at this type of scenario.

2174
01:35:44,530 --> 01:35:45,571
And how do we solve that?

2175
01:35:45,571 --> 01:35:48,070
We solve it with what we
call DynamoDB Streams.

2176
01:35:48,070 --> 01:35:53,470
Streams is a time ordered and
partitioned change log of every access

2177
01:35:53,470 --> 01:35:55,700
to the table, every write
access to the table.

2178
01:35:55,700 --> 01:35:58,810
Any data that's written to the
table shows up on the stream.

2179
01:35:58,810 --> 01:36:01,815
>> It's basically a 24 hour queue.

2180
01:36:01,815 --> 01:36:03,690
Items hit the stream,
they live for 24 hours.

2181
01:36:03,690 --> 01:36:05,990
They can be read multiple times.

2182
01:36:05,990 --> 01:36:09,400
Guaranteed to be delivered
only once to the stream,

2183
01:36:09,400 --> 01:36:11,180
could be read n number of times.

2184
01:36:11,180 --> 01:36:14,910
So however many processes you want to
consume that data, you can consume it.

2185
01:36:14,910 --> 01:36:16,350
It will appear every update.

2186
01:36:16,350 --> 01:36:18,455
Every write will only
appear once on the stream.

2187
01:36:18,455 --> 01:36:20,621
So you don't have to worry
about processing it twice

2188
01:36:20,621 --> 01:36:22,500
from the same process.

2189
01:36:22,500 --> 01:36:25,350
>> It's strictly ordered per item.

2190
01:36:25,350 --> 01:36:28,180
When we say time
ordered and partitioned,

2191
01:36:28,180 --> 01:36:30,680
you'll see per partition on the stream.

2192
01:36:30,680 --> 01:36:33,169
You will see items, updates in order.

2193
01:36:33,169 --> 01:36:35,210
We are not guaranteeing
on the stream that you're

2194
01:36:35,210 --> 01:36:40,240
going to get every transaction
in the order across items.

2195
01:36:40,240 --> 01:36:42,440
>> So streams are idempotent.

2196
01:36:42,440 --> 01:36:44,037
Do we all know what idempotent means?

2197
01:36:44,037 --> 01:36:46,620
Idempotent means you can do it
over, and over, and over again.

2198
01:36:46,620 --> 01:36:48,200
The result's going to be the same.

2199
01:36:48,200 --> 01:36:49,991
>> Streams are idempotent,
but they have to be

2200
01:36:49,991 --> 01:36:54,860
played from the starting point,
wherever you choose, to the end,

2201
01:36:54,860 --> 01:36:57,950
or they will not result
in the same values.

2202
01:36:57,950 --> 01:36:59,727
>> Same thing with MongoDB.

2203
01:36:59,727 --> 01:37:01,560
MongoDB has a construct
they call the oplog.

2204
01:37:01,560 --> 01:37:04,140
It is the exact same construct.

2205
01:37:04,140 --> 01:37:06,500
Many NoSQL databases
have this construct.

2206
01:37:06,500 --> 01:37:08,790
They use it to do things
like replication, which

2207
01:37:08,790 --> 01:37:10,475
is exactly what we do with streams.

2208
01:37:10,475 --> 01:37:12,350
AUDIENCE: Maybe a
heretical question, but you

2209
01:37:12,350 --> 01:37:13,975
talk about apps doing down an so forth.

2210
01:37:13,975 --> 01:37:16,089
Are streams guaranteed to
never possibly go down?

2211
01:37:16,089 --> 01:37:18,630
RICK HOULIHAN: Yeah, streams
are guaranteed to never go down.

2212
01:37:18,630 --> 01:37:21,040
We manage the infrastructure
behind. streams automatically

2213
01:37:21,040 --> 01:37:22,498
deploy in their auto scaling group.

2214
01:37:22,498 --> 01:37:25,910
We'll go through a little
bit about what happens.

2215
01:37:25,910 --> 01:37:30,060
>> I shouldn't say they're not
guaranteed to never go down.

2216
01:37:30,060 --> 01:37:33,110
The elements are guaranteed
to appear in the stream.

2217
01:37:33,110 --> 01:37:36,740
And the stream will be accessible.

2218
01:37:36,740 --> 01:37:40,580
So what goes down or comes back
up, that happens underneath.

2219
01:37:40,580 --> 01:37:43,844
It covers-- it's OK.

2220
01:37:43,844 --> 01:37:46,260
All right, so you get different
view types off the screen.

2221
01:37:46,260 --> 01:37:51,040
The view types that are important to a
programmer typically are, what was it?

2222
01:37:51,040 --> 01:37:52,370
I get the old view.

2223
01:37:52,370 --> 01:37:55,630
When an update hits the table, it'll
push the old view to the stream

2224
01:37:55,630 --> 01:38:02,070
so data can archive, or change
control, change identification, change

2225
01:38:02,070 --> 01:38:03,600
management.

2226
01:38:03,600 --> 01:38:07,160
>> The new image, what it is now after
the update, that's another type of view

2227
01:38:07,160 --> 01:38:07,660
you can get.

2228
01:38:07,660 --> 01:38:09,660
You can get both the old and new images.

2229
01:38:09,660 --> 01:38:10,660
Maybe I want them both.

2230
01:38:10,660 --> 01:38:11,790
I want to see what it was.

2231
01:38:11,790 --> 01:38:13,290
I want to see what it changed to.

2232
01:38:13,290 --> 01:38:15,340
>> I have a compliance type
of process that runs.

2233
01:38:15,340 --> 01:38:17,430
It needs to verify that
when these things change,

2234
01:38:17,430 --> 01:38:21,840
that they're within certain limits
or within certain parameters.

2235
01:38:21,840 --> 01:38:23,840
>> And then maybe I only
need to know what changed.

2236
01:38:23,840 --> 01:38:26,240
I don't care what item changed.

2237
01:38:26,240 --> 01:38:28,580
I don't need to need to know
what attributes changed.

2238
01:38:28,580 --> 01:38:30,882
I just need to know that
the items are being touched.

2239
01:38:30,882 --> 01:38:33,340
So these are the types of views
that you get off the stream

2240
01:38:33,340 --> 01:38:35,960
and you can interact with.

2241
01:38:35,960 --> 01:38:37,840
>> The application that
consumes the stream,

2242
01:38:37,840 --> 01:38:39,298
this is kind of the way this works.

2243
01:38:39,298 --> 01:38:42,570
DynamoDB client ask to
push data to the tables.

2244
01:38:42,570 --> 01:38:44,750
Streams deploy on what we call shards.

2245
01:38:44,750 --> 01:38:47,380
Shards are scaled
independently of the table.

2246
01:38:47,380 --> 01:38:50,660
They don't line up completely
to the partitions of your table.

2247
01:38:50,660 --> 01:38:52,540
And the reason why is
because they line up

2248
01:38:52,540 --> 01:38:55,430
to the capacity, the current
capacity of the table.

2249
01:38:55,430 --> 01:38:57,600
>> They deploy in their
own auto scaling group,

2250
01:38:57,600 --> 01:39:00,800
and they start to spin out depending
on how many writes are coming in,

2251
01:39:00,800 --> 01:39:03,090
how many reads-- really it's writes.

2252
01:39:03,090 --> 01:39:05,820
There's no reads-- but how
many writes are coming in.

2253
01:39:05,820 --> 01:39:08,200
>> And then on the back
end, we have what we

2254
01:39:08,200 --> 01:39:11,390
call a KCL, or Kinesis Client Library.

2255
01:39:11,390 --> 01:39:19,190
Kinesis is a stream data
processing technology from Amazon.

2256
01:39:19,190 --> 01:39:22,040
And streams is built on that.

2257
01:39:22,040 --> 01:39:25,670
>> So you use a KCL enabled
application to read the stream.

2258
01:39:25,670 --> 01:39:28,752
The Kinesis Client Library actually
manages the workers for you.

2259
01:39:28,752 --> 01:39:30,460
And it also does some
interesting things.

2260
01:39:30,460 --> 01:39:35,630
It will create some tables up
in your DynamoDB tablespace

2261
01:39:35,630 --> 01:39:38,410
to track which items
have been processed.

2262
01:39:38,410 --> 01:39:41,190
So this way if it falls back, if
it falls over and comes and gets

2263
01:39:41,190 --> 01:39:45,570
stood back up, it can determine where
was it in processing the stream.

2264
01:39:45,570 --> 01:39:48,360
>> That's very important when
you're talking about replication.

2265
01:39:48,360 --> 01:39:50,350
I need to know what
data was been processed

2266
01:39:50,350 --> 01:39:52,810
and what data has yet to be processed.

2267
01:39:52,810 --> 01:39:57,380
So the KCL library for streams will
give you a lot of that functionality.

2268
01:39:57,380 --> 01:39:58,990
It takes care of all the housekeeping.

2269
01:39:58,990 --> 01:40:01,140
It stands up a worker for every shard.

2270
01:40:01,140 --> 01:40:04,620
It creates an administrative table
for every shard, for every worker.

2271
01:40:04,620 --> 01:40:07,560
And as those workers fire,
they maintain those tables

2272
01:40:07,560 --> 01:40:10,510
so you know this record
was read and processed.

2273
01:40:10,510 --> 01:40:13,850
And then that way if the process
dies and comes back online,

2274
01:40:13,850 --> 01:40:17,940
it can resume right where it took off.

2275
01:40:17,940 --> 01:40:20,850
>> So we use this for
cross-region replication.

2276
01:40:20,850 --> 01:40:24,680
A lot of customers have the need to
move data or parts of their data tables

2277
01:40:24,680 --> 01:40:25,920
around to different regions.

2278
01:40:25,920 --> 01:40:29,230
There are nine regions
all around the world.

2279
01:40:29,230 --> 01:40:32,100
So there might be a need-- I
might have users in Asia, users

2280
01:40:32,100 --> 01:40:34,150
in the East Coast of the United States.

2281
01:40:34,150 --> 01:40:38,980
They have different data that
needs to be locally distributed.

2282
01:40:38,980 --> 01:40:42,510
And maybe a user flies from
Asia over to the United States,

2283
01:40:42,510 --> 01:40:45,020
and I want to replicate
his data with him.

2284
01:40:45,020 --> 01:40:49,340
So when he gets off the plane, he has
a good experience using his mobile app.

2285
01:40:49,340 --> 01:40:52,360
>> You can use the cross-region
replication library to do this.

2286
01:40:52,360 --> 01:40:55,730
Basically we have
provided two technologies.

2287
01:40:55,730 --> 01:40:59,400
One's a console application you can
stand up on your own EC2 instance.

2288
01:40:59,400 --> 01:41:01,240
It runs pure replication.

2289
01:41:01,240 --> 01:41:02,720
And then we gave you the library.

2290
01:41:02,720 --> 01:41:06,070
The library you can use to build
your own application if you

2291
01:41:06,070 --> 01:41:10,740
want to do crazy things with that data--
filter, replicate only part of it,

2292
01:41:10,740 --> 01:41:14,120
rotate the data, move it into a
different table, so on and so forth.

2293
01:41:14,120 --> 01:41:18,700

2294
01:41:18,700 --> 01:41:20,520
So that's kind of what that looks like.

2295
01:41:20,520 --> 01:41:23,690
>> DynamoDB Streams can be
processed by what we call Lambda.

2296
01:41:23,690 --> 01:41:27,394
We mentioned a little bit about event
driven application architectures.

2297
01:41:27,394 --> 01:41:28,810
Lambda is a key component of that.

2298
01:41:28,810 --> 01:41:32,840
Lambda is code that fires on demand
in response to a particular event.

2299
01:41:32,840 --> 01:41:36,020
One of those events could be a
record appearing on the stream.

2300
01:41:36,020 --> 01:41:39,100
If a record appears on the stream,
we'll call this Java function.

2301
01:41:39,100 --> 01:41:44,980
Well, this is JavaScript, and Lambda
supports Node.js, Java, Python,

2302
01:41:44,980 --> 01:41:47,820
and will soon support
other languages as well.

2303
01:41:47,820 --> 01:41:50,940
And suffice to say, it's pure code.

2304
01:41:50,940 --> 01:41:53,610
write In Java, you define a class.

2305
01:41:53,610 --> 01:41:55,690
You push the JAR up into Lambda.

2306
01:41:55,690 --> 01:42:00,200
And then you specify which class
to call in response to which event.

2307
01:42:00,200 --> 01:42:04,770
And then the Lambda infrastructure
behind that will run that code.

2308
01:42:04,770 --> 01:42:06,730
>> That code can process
records off the stream.

2309
01:42:06,730 --> 01:42:08,230
It can do anything it wants with it.

2310
01:42:08,230 --> 01:42:11,650
In this particular example, all we're
really doing is logging the attributes.

2311
01:42:11,650 --> 01:42:13,480
But this is just code.

2312
01:42:13,480 --> 01:42:15,260
Code can do anything, right?

2313
01:42:15,260 --> 01:42:16,600
>> So you can rotate that data.

2314
01:42:16,600 --> 01:42:18,160
You can create a derivative view.

2315
01:42:18,160 --> 01:42:21,160
If it's a document structure,
you can flatten the structure.

2316
01:42:21,160 --> 01:42:24,300
You can create alternate indexes.

2317
01:42:24,300 --> 01:42:27,100
All kinds of things you can
do with the DynamoDB Streams.

2318
01:42:27,100 --> 01:42:28,780
>> And really, that's what that looks like.

2319
01:42:28,780 --> 01:42:29,940
So you get those updates coming in.

2320
01:42:29,940 --> 01:42:31,190
They're coming off the string.

2321
01:42:31,190 --> 01:42:32,720
They're read by the Lambda function.

2322
01:42:32,720 --> 01:42:37,480
They're rotating the data and
pushing it up in derivative tables,

2323
01:42:37,480 --> 01:42:42,200
notifying external systems of change,
and pushing data into ElastiCache.

2324
01:42:42,200 --> 01:42:45,900
>> We talked about how to put the cache
in front of the database for that sales

2325
01:42:45,900 --> 01:42:46,450
scenario.

2326
01:42:46,450 --> 01:42:50,049
Well what happens if I
update the item description?

2327
01:42:50,049 --> 01:42:52,340
Well, if I had a Lambda
function running on that table,

2328
01:42:52,340 --> 01:42:55,490
if I update the item description, it'll
pick up the record off the stream,

2329
01:42:55,490 --> 01:42:58,711
and it'll update the ElastiCache
instance with the new data.

2330
01:42:58,711 --> 01:43:00,460
So that's a lot of
what we do with Lambda.

2331
01:43:00,460 --> 01:43:02,619
It's glue code, connectors.

2332
01:43:02,619 --> 01:43:04,410
And it actually gives
the ability to launch

2333
01:43:04,410 --> 01:43:07,930
and to run very complex applications
without a dedicated server

2334
01:43:07,930 --> 01:43:10,371
infrastructure, which is really cool.

2335
01:43:10,371 --> 01:43:13,100
>> So let's go back to our
real-time voting architecture.

2336
01:43:13,100 --> 01:43:17,984
This is new and improved with our
streams and KCL enabled application.

2337
01:43:17,984 --> 01:43:20,150
Same as before, we can
handle any scale of election.

2338
01:43:20,150 --> 01:43:21,100
We like this.

2339
01:43:21,100 --> 01:43:24,770
We're doing out scatter gathers
across multiple buckets.

2340
01:43:24,770 --> 01:43:26,780
We've got optimistic locking going on.

2341
01:43:26,780 --> 01:43:30,192
We can keep our voters
from changing their votes.

2342
01:43:30,192 --> 01:43:31,400
They can only vote only once.

2343
01:43:31,400 --> 01:43:32,880
This is fantastic.

2344
01:43:32,880 --> 01:43:35,895
Real-time fault tolerance,
scalable aggregation now.

2345
01:43:35,895 --> 01:43:38,270
If the thing falls over, it
knows where to restart itself

2346
01:43:38,270 --> 01:43:41,300
when it comes back up because
we're using the KCL app.

2347
01:43:41,300 --> 01:43:45,700
And then we can also use that
KCL application to push data out

2348
01:43:45,700 --> 01:43:48,820
to Redshift for other
app analytics, or use

2349
01:43:48,820 --> 01:43:51,990
the Elastic MapReduce to run
real-time streaming aggregations off

2350
01:43:51,990 --> 01:43:53,180
of that data.

2351
01:43:53,180 --> 01:43:55,480
>> So these are things we
haven't talked about much.

2352
01:43:55,480 --> 01:43:57,375
But they're additional
technologies that come

2353
01:43:57,375 --> 01:44:00,310
to bear when you're looking
at these types of scenarios.

2354
01:44:00,310 --> 01:44:03,160
>> All right, so that's about
analytics with DynamoDB Streams.

2355
01:44:03,160 --> 01:44:05,340
You can collect de-dupe
data, do all kinds

2356
01:44:05,340 --> 01:44:09,490
of nice stuff, aggregate data in
memory, create those derivative tables.

2357
01:44:09,490 --> 01:44:13,110
That's a huge use case
that a lot of customers

2358
01:44:13,110 --> 01:44:16,950
are involved with, taking the nested
properties of those JSON documents

2359
01:44:16,950 --> 01:44:18,946
and creating additional indexes.

2360
01:44:18,946 --> 01:44:21,680

2361
01:44:21,680 --> 01:44:23,150
>> We're at the end.

2362
01:44:23,150 --> 01:44:26,689
Thank you for bearing with me.

2363
01:44:26,689 --> 01:44:28,480
So let's talk about
reference architecture.

2364
01:44:28,480 --> 01:44:33,440
DynamoDB sits in the middle of so
much of the AWS infrastructure.

2365
01:44:33,440 --> 01:44:37,090
Basically you can hook it
up to anything you want.

2366
01:44:37,090 --> 01:44:45,600
Applications built using Dynamo include
Lambda, ElastiCache, CloudSearch,

2367
01:44:45,600 --> 01:44:49,890
push the data out into Elastic
MapReduce, import export from DynamoDB

2368
01:44:49,890 --> 01:44:52,370
into S3, all kinds of workflows.

2369
01:44:52,370 --> 01:44:54,120
But probably the best
thing to talk about,

2370
01:44:54,120 --> 01:44:56,119
and this is what's really
interesting is when we

2371
01:44:56,119 --> 01:44:58,350
talk about event driven applications.

2372
01:44:58,350 --> 01:45:00,300
>> This is an example of
an internal project

2373
01:45:00,300 --> 01:45:04,850
that we have where we're actually
publishing to gather survey results.

2374
01:45:04,850 --> 01:45:07,700
So in an email link that
we send out, there'll

2375
01:45:07,700 --> 01:45:11,350
be a little link saying click
here to respond to the survey.

2376
01:45:11,350 --> 01:45:14,070
And when a person clicks
that link, what happens

2377
01:45:14,070 --> 01:45:18,020
is they pull down a secure
HTML survey form from S3.

2378
01:45:18,020 --> 01:45:18,980
There's no server.

2379
01:45:18,980 --> 01:45:20,600
This is just an S3 object.

2380
01:45:20,600 --> 01:45:22,770
>> That form comes up,
loads up in the browser.

2381
01:45:22,770 --> 01:45:24,240
It's got Backbone.

2382
01:45:24,240 --> 01:45:30,160
It's got complex JavaScript
that it's running.

2383
01:45:30,160 --> 01:45:33,557
So it's very rich application
running in the client's browser.

2384
01:45:33,557 --> 01:45:36,390
They don't know that they're not
interacting with a back end server.

2385
01:45:36,390 --> 01:45:38,220
At this point, it's all browser.

2386
01:45:38,220 --> 01:45:41,780
>> They publish the results to what
we call the Amazon API Gateway.

2387
01:45:41,780 --> 01:45:46,270
API Gateway is simply a web API
that you can define and hook up

2388
01:45:46,270 --> 01:45:47,760
to whatever you want.

2389
01:45:47,760 --> 01:45:50,990
In this particular case, we're
hooked up to a Lambda function.

2390
01:45:50,990 --> 01:45:54,797
>> So my POST operation is
happening with no server.

2391
01:45:54,797 --> 01:45:56,380
Basically that API Gateway sits there.

2392
01:45:56,380 --> 01:45:58,770
It costs me nothing until people
start POSTing to it, right?

2393
01:45:58,770 --> 01:46:00,269
The Lambda function just sits there.

2394
01:46:00,269 --> 01:46:03,760
And it costs me nothing until
people start hitting it.

2395
01:46:03,760 --> 01:46:07,270
So you can see, as the volume
increases, that's when the charges come.

2396
01:46:07,270 --> 01:46:09,390
I'm not running a server 7/24.

2397
01:46:09,390 --> 01:46:12,310
>> So I pull the form
down out of the bucket,

2398
01:46:12,310 --> 01:46:15,719
and I post through the API
Gateway into the Lambda function.

2399
01:46:15,719 --> 01:46:17,510
And then the Lambda
function says, you know

2400
01:46:17,510 --> 01:46:20,600
what, I've got some PIIs, some
personally identifiable information

2401
01:46:20,600 --> 01:46:21,480
in these responses.

2402
01:46:21,480 --> 01:46:23,020
I got comments coming from users.

2403
01:46:23,020 --> 01:46:24,230
I've got email addresses.

2404
01:46:24,230 --> 01:46:26,190
I've got usernames.

2405
01:46:26,190 --> 01:46:27,810
>> Let me split this off.

2406
01:46:27,810 --> 01:46:30,280
I'm going to generate some
metadata off this record.

2407
01:46:30,280 --> 01:46:32,850
And I'm going to push the
metadata into DynamoDB.

2408
01:46:32,850 --> 01:46:36,059
And I could encrypt all the data
and push it into DynamoDB if I want.

2409
01:46:36,059 --> 01:46:38,600
But it's easier for me, in this
use case, to go ahead an say,

2410
01:46:38,600 --> 01:46:42,800
I'm going to push the raw data
into an encrypted S3 bucket.

2411
01:46:42,800 --> 01:46:47,240
So I use built in S3 server side
encryption and Amazon's Key Management

2412
01:46:47,240 --> 01:46:51,600
Service so that I have a key that
can rotate on a regular interval,

2413
01:46:51,600 --> 01:46:55,010
and I can protect that PII data
as part of this whole workflow.

2414
01:46:55,010 --> 01:46:55,870
>> So what have I done?

2415
01:46:55,870 --> 01:47:00,397
I've just deployed a whole
application, and I have no server.

2416
01:47:00,397 --> 01:47:02,980
So is what event driven application
architecture does for you.

2417
01:47:02,980 --> 01:47:05,730
>> Now if you think about
the use case for this--

2418
01:47:05,730 --> 01:47:08,730
we have other customers I'm talking
to about this exact architecture who

2419
01:47:08,730 --> 01:47:14,560
run phenomenally large campaigns, who
are looking at this and going, oh my.

2420
01:47:14,560 --> 01:47:17,840
Because now, they can
basically push it out there,

2421
01:47:17,840 --> 01:47:21,900
let that campaign just sit
there until it launches, and not

2422
01:47:21,900 --> 01:47:24,400
have to worry a fig about
what kind of infrastructure

2423
01:47:24,400 --> 01:47:26,120
is going to be there to support it.

2424
01:47:26,120 --> 01:47:28,600
And then as soon as
that campaign is done,

2425
01:47:28,600 --> 01:47:31,520
it's like the infrastructure
just immediately goes away

2426
01:47:31,520 --> 01:47:33,680
because there really
is no infrastructure.

2427
01:47:33,680 --> 01:47:35,660
It's just code that sits on Lambda.

2428
01:47:35,660 --> 01:47:38,560
It's just data that sits in DynamoDB.

2429
01:47:38,560 --> 01:47:41,340
It is an amazing way
to build applications.

2430
01:47:41,340 --> 01:47:43,970
>> AUDIENCE: So is it more
ephemeral than it would be

2431
01:47:43,970 --> 01:47:45,740
if it was stored on an actual server?

2432
01:47:45,740 --> 01:47:46,823
>> RICK HOULIHAN: Absolutely.

2433
01:47:46,823 --> 01:47:49,190
Because that server instance
would have to be a 7/24.

2434
01:47:49,190 --> 01:47:51,954
It has to be available for
somebody to respond to.

2435
01:47:51,954 --> 01:47:52,620
Well guess what?

2436
01:47:52,620 --> 01:47:55,410
S3 is available 7/24.

2437
01:47:55,410 --> 01:47:57,100
S3 always responds.

2438
01:47:57,100 --> 01:47:59,320
And S3 is very, very good
at serving up objects.

2439
01:47:59,320 --> 01:48:02,590
Those objects can be HTML files, or
JavaScript files, or whatever you want.

2440
01:48:02,590 --> 01:48:07,430
You can run very rich web applications
out of S3 buckets, and people do.

2441
01:48:07,430 --> 01:48:10,160
>> And so that's the idea here
is to get away from the way

2442
01:48:10,160 --> 01:48:11,270
we used to think about it.

2443
01:48:11,270 --> 01:48:14,270
We all used to think in
terms of servers and hosts.

2444
01:48:14,270 --> 01:48:16,580
It's not about that anymore.

2445
01:48:16,580 --> 01:48:19,310
It's about infrastructure as code.

2446
01:48:19,310 --> 01:48:22,470
Deploy the code to the cloud and
let the cloud run it for you.

2447
01:48:22,470 --> 01:48:24,980
And that's what AWS is trying to do.

2448
01:48:24,980 --> 01:48:29,690
>> AUDIENCE: So your gold box in the middle
of the API Gateway is not server-like,

2449
01:48:29,690 --> 01:48:30,576
but instead is just--

2450
01:48:30,576 --> 01:48:32,850
>> RICK HOULIHAN: You can think
of it as server facade.

2451
01:48:32,850 --> 01:48:38,040
All it is is it'll take an HTTP
request and map it to another process.

2452
01:48:38,040 --> 01:48:39,192
That's all it does.

2453
01:48:39,192 --> 01:48:41,525
And in this case, we're mapping
it to a Lambda function.

2454
01:48:41,525 --> 01:48:44,119

2455
01:48:44,119 --> 01:48:45,410
All right, so that's all I got.

2456
01:48:45,410 --> 01:48:46,190
Thank you very much.

2457
01:48:46,190 --> 01:48:46,800
I appreciate it.

2458
01:48:46,800 --> 01:48:48,100
I know we want a little bit over time.

2459
01:48:48,100 --> 01:48:49,980
And hopefully you guys got
a little bit of information

2460
01:48:49,980 --> 01:48:51,410
that you can take away today.

2461
01:48:51,410 --> 01:48:53,520
And I apologize if I went
over some of your heads,

2462
01:48:53,520 --> 01:48:56,697
but there's a good lot of
fundamental foundational knowledge

2463
01:48:56,697 --> 01:48:58,280
that I think is very valuable for you.

2464
01:48:58,280 --> 01:48:59,825
So thank you for having me.

2465
01:48:59,825 --> 01:49:00,325
[APPLAUSE]

2466
01:49:00,325 --> 01:49:02,619
AUDIENCE: [INAUDIBLE]
is when you were saying

2467
01:49:02,619 --> 01:49:05,160
you had to go through the thing
from the beginning to the end

2468
01:49:05,160 --> 01:49:07,619
to get the right values
or the same values,

2469
01:49:07,619 --> 01:49:09,410
how would the values
change if [INAUDIBLE].

2470
01:49:09,410 --> 01:49:10,480
>> RICK HOULIHAN: Oh, idempotent?

2471
01:49:10,480 --> 01:49:11,800
How would the values change?

2472
01:49:11,800 --> 01:49:15,180
Well, because if I didn't run
it all the way to the end,

2473
01:49:15,180 --> 01:49:19,770
then I don't know what changes
were made in the last mile.

2474
01:49:19,770 --> 01:49:22,144
It's not going to be the
same data as what I saw.

2475
01:49:22,144 --> 01:49:24,560
AUDIENCE: Oh, so you just
haven't gotten the entire input.

2476
01:49:24,560 --> 01:49:24,770
RICK HOULIHAN: Right.

2477
01:49:24,770 --> 01:49:26,895
You have to go from beginning
to end, and then it's

2478
01:49:26,895 --> 01:49:29,280
going to be a consistent state.

2479
01:49:29,280 --> 01:49:31,520
Cool.

2480
01:49:31,520 --> 01:49:35,907
>> AUDIENCE: So you showed us DynamoDB
can do document or the key value.

2481
01:49:35,907 --> 01:49:38,740
And we spent a lot of time on the
key value with a hash and the ways

2482
01:49:38,740 --> 01:49:40,005
to flip it around.

2483
01:49:40,005 --> 01:49:43,255
When you looked at those tables, is that
leaving behind the document approach?

2484
01:49:43,255 --> 01:49:44,600
>> RICK HOULIHAN: I wouldn't
say leaving it behind.

2485
01:49:44,600 --> 01:49:45,855
>> AUDIENCE: They were separated from the--

2486
01:49:45,855 --> 01:49:49,140
>> RICK HOULIHAN: With the document
approach, the document type in DynamoDB

2487
01:49:49,140 --> 01:49:50,880
is just think of as another attribute.

2488
01:49:50,880 --> 01:49:53,560
It's an attribute that contains
a hierarchical data structure.

2489
01:49:53,560 --> 01:49:56,980
And then in the queries,
you can use the properties

2490
01:49:56,980 --> 01:49:59,480
of those objects using Object Notation.

2491
01:49:59,480 --> 01:50:03,562
So I can filter on a nested
property of the JSON document.

2492
01:50:03,562 --> 01:50:05,520
AUDIENCE: So any time I
do a document approach,

2493
01:50:05,520 --> 01:50:07,906
I can sort of arrive at the tabular--

2494
01:50:07,906 --> 01:50:08,780
AUDIENCE: Absolutely.

2495
01:50:08,780 --> 01:50:09,800
AUDIENCE: --indexes and
things you just talked about.

2496
01:50:09,800 --> 01:50:11,280
RICK HOULIHAN: Yeah, the
indexes and all that,

2497
01:50:11,280 --> 01:50:13,363
when you want to index the
properties of the JSON,

2498
01:50:13,363 --> 01:50:18,230
the way that we'd have to do that is if
you insert a JSON object or a document

2499
01:50:18,230 --> 01:50:20,780
into Dynamo, you would use streams.

2500
01:50:20,780 --> 01:50:22,400
Streams would read the input.

2501
01:50:22,400 --> 01:50:24,340
You'd get that JSON
object and you'd say OK,

2502
01:50:24,340 --> 01:50:26,030
what's the property I want to index?

2503
01:50:26,030 --> 01:50:28,717
>> You create a derivative table.

2504
01:50:28,717 --> 01:50:30,300
Now that's the way it works right now.

2505
01:50:30,300 --> 01:50:32,650
We don't allow you to index
directly those properties.

2506
01:50:32,650 --> 01:50:33,520
>> AUDIENCE: Tabularizing your documents.

2507
01:50:33,520 --> 01:50:36,230
>> RICK HOULIHAN: Exactly, flattening
it, tabularizing it, exactly.

2508
01:50:36,230 --> 01:50:37,415
That's what you do with it.

2509
01:50:37,415 --> 01:50:37,860
>> AUDIENCE: Thank you.

2510
01:50:37,860 --> 01:50:39,609
>> RICK HOULIHAN: Yep,
absolutely, thank you.

2511
01:50:39,609 --> 01:50:42,240
AUDIENCE: So it's kind of
Mongo meets Redis classifers.

2512
01:50:42,240 --> 01:50:43,990
>> RICK HOULIHAN: Yeah,
it's a lot like that.

2513
01:50:43,990 --> 01:50:45,940
That's a good description for it.

2514
01:50:45,940 --> 01:50:47,490
Cool.

2515
01:50:47,490 --> 01:50:49,102