1
00:00:00,000 --> 00:00:03,500
[MUSIC PLAYING]

2
00:00:03,500 --> 00:00:17,457


3
00:00:17,457 --> 00:00:19,290
BRIAN YU: All right,
welcome back, everyone,

4
00:00:19,290 --> 00:00:21,570
to Web Programming with
Python and JavaScript.

5
00:00:21,570 --> 00:00:25,750
And for our final topic, we're going
to explore scalability and security.

6
00:00:25,750 --> 00:00:28,470
So far in the class, we've
been building web applications.

7
00:00:28,470 --> 00:00:31,635
And we've been building web applications
that work on our own computer.

8
00:00:31,635 --> 00:00:33,510
But if we want to take
those web applications

9
00:00:33,510 --> 00:00:36,000
and deploy them to the world so
people all across the internet

10
00:00:36,000 --> 00:00:37,958
can begin to use them,
then we're going to need

11
00:00:37,958 --> 00:00:40,980
to host our web application
on some sort of web server--

12
00:00:40,980 --> 00:00:44,192
some dedicated piece of hardware
that is listening for web requests

13
00:00:44,192 --> 00:00:46,650
and responding to them with
the response that we would like

14
00:00:46,650 --> 00:00:48,660
for our web application to deliver.

15
00:00:48,660 --> 00:00:51,030
And when we do so, this
introduces a whole bunch

16
00:00:51,030 --> 00:00:54,338
of interesting issues surrounding
scalability and security.

17
00:00:54,338 --> 00:00:56,130
So we'll take a look
at these issues today,

18
00:00:56,130 --> 00:00:59,970
beginning with problems concerning
scalability-- what those problems are

19
00:00:59,970 --> 00:01:02,650
and how we might go
about addressing them.

20
00:01:02,650 --> 00:01:04,410
So when we deploy our
web applications, we

21
00:01:04,410 --> 00:01:06,720
deploy them by putting
them onto a web server

22
00:01:06,720 --> 00:01:08,970
that I'm, here, just
representing with this rectangle.

23
00:01:08,970 --> 00:01:12,840
But all the server is is some dedicated
computer, some piece of hardware that

24
00:01:12,840 --> 00:01:14,620
is listening for incoming requests.

25
00:01:14,620 --> 00:01:18,750
So we'll draw this line to represent
an incoming web request from a user.

26
00:01:18,750 --> 00:01:21,660
The server takes that
request and responds to it.

27
00:01:21,660 --> 00:01:23,880
But ultimately, our web
application isn't just

28
00:01:23,880 --> 00:01:25,530
going to be servicing one user.

29
00:01:25,530 --> 00:01:28,080
If it becomes popular,
it might have many users

30
00:01:28,080 --> 00:01:31,560
that are all trying to connect
to that server at the same time.

31
00:01:31,560 --> 00:01:34,790
And as multiple people start to connect
to that server at the same time,

32
00:01:34,790 --> 00:01:37,560
here is where we start to deal
with issues of scalability.

33
00:01:37,560 --> 00:01:41,040
A single computer or a single server
can only service so many users

34
00:01:41,040 --> 00:01:42,273
at any given time.

35
00:01:42,273 --> 00:01:44,190
And so, therefore, we
need to think in advance

36
00:01:44,190 --> 00:01:47,640
about how we're going to deal
with those issues of scale.

37
00:01:47,640 --> 00:01:49,920
But the first question,
before we even get there,

38
00:01:49,920 --> 00:01:52,320
is where these servers actually exist.

39
00:01:52,320 --> 00:01:56,010
And nowadays, there are two main options
for where these servers can exist.

40
00:01:56,010 --> 00:02:00,210
These servers can be on the
cloud or they can be on premise.

41
00:02:00,210 --> 00:02:02,400
And on-premise servers,
you might imagine

42
00:02:02,400 --> 00:02:05,160
is if a company is running
their own web application.

43
00:02:05,160 --> 00:02:08,340
On-premise servers are servers that
are inside of the company's walls.

44
00:02:08,340 --> 00:02:10,710
The company owns the
physical servers, maybe

45
00:02:10,710 --> 00:02:12,840
on some server racks inside of a room.

46
00:02:12,840 --> 00:02:14,970
And therefore, they
have very direct control

47
00:02:14,970 --> 00:02:17,940
over all of the servers-- exactly
what kind of servers they are,

48
00:02:17,940 --> 00:02:19,830
exactly what software
is running on them.

49
00:02:19,830 --> 00:02:23,280
They can go and physically look at the
servers and debug them, if need be,

50
00:02:23,280 --> 00:02:25,830
in order to make sure that
any issues are dealt with.

51
00:02:25,830 --> 00:02:28,170
But increasingly, we're
starting to move into a world

52
00:02:28,170 --> 00:02:31,170
where cloud computing is
becoming increasingly popular.

53
00:02:31,170 --> 00:02:35,190
In cloud computing, rather than have
dedicated servers that are on premise,

54
00:02:35,190 --> 00:02:37,290
we have servers that are
somewhere in the cloud

55
00:02:37,290 --> 00:02:40,950
where cloud computing companies
like Amazon, or Google, or Microsoft

56
00:02:40,950 --> 00:02:42,720
are able to run their own servers.

57
00:02:42,720 --> 00:02:46,860
And we simply use those servers that
are provided by those third parties,

58
00:02:46,860 --> 00:02:50,130
whether it's Amazon, or Google,
or Microsoft, or someone else.

59
00:02:50,130 --> 00:02:51,330
And there are trade offs.

60
00:02:51,330 --> 00:02:54,950
With cloud computing, we no longer have
as direct control over the machines

61
00:02:54,950 --> 00:02:56,700
themselves because
they're not on premise.

62
00:02:56,700 --> 00:02:59,190
We can't physically
manipulate those computers.

63
00:02:59,190 --> 00:03:01,620
But we have the advantage
of not having to worry

64
00:03:01,620 --> 00:03:05,070
about dealing with physical
objects that are inside

65
00:03:05,070 --> 00:03:08,280
of the premise of the company whose
servers we'd like to run code for.

66
00:03:08,280 --> 00:03:10,770
When it's on the cloud,
everything is managed externally

67
00:03:10,770 --> 00:03:14,205
by some other company, and we can
simply use the servers that we need to.

68
00:03:14,205 --> 00:03:16,830
And we'll see that this lends
itself to other benefits as well.

69
00:03:16,830 --> 00:03:20,490
As we might need more servers, as we
start to get more sophisticated web

70
00:03:20,490 --> 00:03:24,120
applications that need more users,
these cloud-computing companies

71
00:03:24,120 --> 00:03:26,220
can allow us to create
web applications that

72
00:03:26,220 --> 00:03:29,280
are able to scale across
multiple different servers

73
00:03:29,280 --> 00:03:31,910
as we start to get more and more users.

74
00:03:31,910 --> 00:03:35,460
But we'll discuss those issues
of scale as we get to them.

75
00:03:35,460 --> 00:03:37,890
The question we need to ask
after we have these servers--

76
00:03:37,890 --> 00:03:40,348
whether they're servers that
are on premise or servers that

77
00:03:40,348 --> 00:03:42,240
are operating somewhere in the cloud--

78
00:03:42,240 --> 00:03:47,328
is, how many users can the server
actually service at any given time?

79
00:03:47,328 --> 00:03:48,370
And that's going to vary.

80
00:03:48,370 --> 00:03:51,300
It's going to vary based on the
size of the server, the computing

81
00:03:51,300 --> 00:03:52,470
power of the server.

82
00:03:52,470 --> 00:03:56,250
And it's going to be dependent
upon how long it takes to process

83
00:03:56,250 --> 00:03:58,110
any particular user's request.

84
00:03:58,110 --> 00:04:00,420
If user requests are
quite expensive, it might

85
00:04:00,420 --> 00:04:03,870
mean that there are fewer users that
can be serviced at any given time.

86
00:04:03,870 --> 00:04:05,880
And it's for that reason
that a helpful tool

87
00:04:05,880 --> 00:04:08,850
is to do some kind of benchmarking,
some process of trying

88
00:04:08,850 --> 00:04:12,630
to do some analysis on how many
users a server can actually

89
00:04:12,630 --> 00:04:14,730
be handling at any particular time.

90
00:04:14,730 --> 00:04:16,950
And there are numerous
different tools that allow

91
00:04:16,950 --> 00:04:18,779
us to do this kind of benchmarking.

92
00:04:18,779 --> 00:04:22,470
Apache Bench, or otherwise
known as AB, is a popular tool

93
00:04:22,470 --> 00:04:24,250
for doing this kind of thing.

94
00:04:24,250 --> 00:04:28,290
But benchmarking is going to be useful
so that we know how many users one

95
00:04:28,290 --> 00:04:29,550
particular server can handle.

96
00:04:29,550 --> 00:04:31,290
Maybe it can handle 50 users.

97
00:04:31,290 --> 00:04:32,700
Maybe it can handle 100 users.

98
00:04:32,700 --> 00:04:35,160
Maybe it can handle
more at any given time.

99
00:04:35,160 --> 00:04:37,830
But ultimately, it's going
to be some finite limit.

100
00:04:37,830 --> 00:04:40,680
Every computer just has some
finite amount of resources,

101
00:04:40,680 --> 00:04:42,030
and servers are no exception.

102
00:04:42,030 --> 00:04:45,360
There's going to be some number of
users after which the server is not

103
00:04:45,360 --> 00:04:47,020
going to be able to handle it.

104
00:04:47,020 --> 00:04:48,850
So what do we do in that situation?

105
00:04:48,850 --> 00:04:53,130
What do we do if our server can only
handle 100 users at any given time,

106
00:04:53,130 --> 00:04:58,020
but 101 users are trying to use our
web application at the same time?

107
00:04:58,020 --> 00:04:59,440
Something needs to change.

108
00:04:59,440 --> 00:05:01,740
We need to deal with
some sort of scaling

109
00:05:01,740 --> 00:05:04,500
to make sure that our web
application can scale.

110
00:05:04,500 --> 00:05:07,770
And there are a couple of different
types of scaling that we can try.

111
00:05:07,770 --> 00:05:10,530
One approach is to do what's
called vertical scaling, which

112
00:05:10,530 --> 00:05:12,780
might be the simplest way
you could imagine scaling.

113
00:05:12,780 --> 00:05:15,900
If this server is not good enough
for handling the number of users

114
00:05:15,900 --> 00:05:18,890
that we need it to handle,
well, just get a bigger serve.

115
00:05:18,890 --> 00:05:21,260
In vertical scaling,
we just take the server

116
00:05:21,260 --> 00:05:23,930
and get a bigger server,
a more powerful server,

117
00:05:23,930 --> 00:05:26,480
a server that can handle
more users at any given time.

118
00:05:26,480 --> 00:05:27,730
It's going to cost more.

119
00:05:27,730 --> 00:05:29,480
But if we need it to
handle more users, we

120
00:05:29,480 --> 00:05:33,110
can just get a bigger server to
be able to deal with that problem.

121
00:05:33,110 --> 00:05:34,607
This approach is fairly simple.

122
00:05:34,607 --> 00:05:37,190
It just involves swapping out
one server for another, one that

123
00:05:37,190 --> 00:05:39,410
can handle more users concurrently.

124
00:05:39,410 --> 00:05:40,830
But it also has drawbacks.

125
00:05:40,830 --> 00:05:44,330
There is some limit to how big the
server can be, to how many users

126
00:05:44,330 --> 00:05:47,390
any physical one server is going to
be able to handle because there's

127
00:05:47,390 --> 00:05:50,870
a physical limitation on what is
the biggest, fastest, most powerful

128
00:05:50,870 --> 00:05:53,310
server we could possibly get.

129
00:05:53,310 --> 00:05:55,970
So when vertical scaling
ends up not being enough,

130
00:05:55,970 --> 00:05:59,720
an alternative-- as you might imagine--
is what's known as horizontal scaling.

131
00:05:59,720 --> 00:06:01,970
And the idea behind
horizontal scaling is

132
00:06:01,970 --> 00:06:06,560
that, when one server isn't enough to
be able to service all of the users that

133
00:06:06,560 --> 00:06:10,070
might be trying to use a web
application at the same time, well,

134
00:06:10,070 --> 00:06:13,010
then we can take the approach of
saying, well, rather than just using

135
00:06:13,010 --> 00:06:17,840
one server, let's go ahead and split
it up into two different servers.

136
00:06:17,840 --> 00:06:21,420
We now have two servers that are
both running the web application.

137
00:06:21,420 --> 00:06:24,980
And now, effectively, we've been
able to double the number of users

138
00:06:24,980 --> 00:06:26,600
that this web application can handle.

139
00:06:26,600 --> 00:06:29,690
Rather than just a single server
that can service 100 users,

140
00:06:29,690 --> 00:06:33,200
if we have two of them, now we can
service 200 users at any given time

141
00:06:33,200 --> 00:06:37,670
if you imagine 100 of them using
server A over here and 100 of them

142
00:06:37,670 --> 00:06:40,460
using server B over there.

143
00:06:40,460 --> 00:06:44,220
But this then lends itself to some
other questions that we have to answer,

144
00:06:44,220 --> 00:06:47,630
which is, how do these servers get
their users in the first place?

145
00:06:47,630 --> 00:06:50,450
When a user requests a web
page, how does that user

146
00:06:50,450 --> 00:06:54,140
get directed either to
server A or to server B?

147
00:06:54,140 --> 00:06:57,980
It seems that they need some way to
make that decision in order to decide

148
00:06:57,980 --> 00:07:00,690
whether to go one direction or another.

149
00:07:00,690 --> 00:07:04,010
And it's for that reason that we might
introduce another piece of hardware

150
00:07:04,010 --> 00:07:05,240
into this picture.

151
00:07:05,240 --> 00:07:09,070
And that additional piece of hardware
is what we might call a load balancer.

152
00:07:09,070 --> 00:07:11,510
And a load balancer is just
another piece of hardware

153
00:07:11,510 --> 00:07:14,910
that is going to sit in front
of these servers, so to speak.

154
00:07:14,910 --> 00:07:17,660
In other words, when a user
makes a request to a web page,

155
00:07:17,660 --> 00:07:21,170
rather than immediately getting that
request to one of these web servers,

156
00:07:21,170 --> 00:07:25,250
the request is first going to
go through this load balancer

157
00:07:25,250 --> 00:07:27,800
where the request first
comes into the load balancer.

158
00:07:27,800 --> 00:07:31,160
And the load balancer then decides
whether to send that request to server

159
00:07:31,160 --> 00:07:35,330
A or to send that request to
server B. And this process

160
00:07:35,330 --> 00:07:38,300
is likely less expensive than
actually dealing with and processing

161
00:07:38,300 --> 00:07:39,330
that request.

162
00:07:39,330 --> 00:07:42,440
So the load balancer is effectively
just acting as a dispatcher.

163
00:07:42,440 --> 00:07:44,310
It waits for those requests to come in.

164
00:07:44,310 --> 00:07:46,670
And when the requests do
come in, the load balancer

165
00:07:46,670 --> 00:07:49,628
directs those requests either to
go to one server or to another.

166
00:07:49,628 --> 00:07:52,670
And you might imagine the story where
we have more than just two servers.

167
00:07:52,670 --> 00:07:54,260
Maybe we have many servers.

168
00:07:54,260 --> 00:07:56,660
And the load balancer
is just going to balance

169
00:07:56,660 --> 00:07:59,030
between all of those different servers.

170
00:07:59,030 --> 00:08:02,570
And this process of deciding
which server to send a request to

171
00:08:02,570 --> 00:08:05,840
is known as load balancing, which is
what the load balancer is ultimately

172
00:08:05,840 --> 00:08:06,618
doing.

173
00:08:06,618 --> 00:08:09,410
And there are various different
methods that you might use in order

174
00:08:09,410 --> 00:08:11,042
to perform this load balancing.

175
00:08:11,042 --> 00:08:13,250
So you might imagine thinking
about this intuitively.

176
00:08:13,250 --> 00:08:16,490
How would the load balancer
decide, given some request,

177
00:08:16,490 --> 00:08:19,220
should we send the request to
this router, to this server,

178
00:08:19,220 --> 00:08:22,910
or should we send the request
to some other server instead?

179
00:08:22,910 --> 00:08:26,120
And there are many different approaches
that our load balancer might take.

180
00:08:26,120 --> 00:08:27,440
And here are just a couple.

181
00:08:27,440 --> 00:08:30,230
Random choice might be
the simplest of options.

182
00:08:30,230 --> 00:08:34,480
Given a user that shows up and tries
to make a request to our web server,

183
00:08:34,480 --> 00:08:36,620
the load balancer first
takes a look at the user

184
00:08:36,620 --> 00:08:40,497
and just randomly assigns them to
one of the various different servers

185
00:08:40,497 --> 00:08:42,080
that might be processing that request.

186
00:08:42,080 --> 00:08:46,340
If there are 10 different servers, it
randomly chooses among those 10 servers

187
00:08:46,340 --> 00:08:50,030
to decide which of them is going
to be servicing that request.

188
00:08:50,030 --> 00:08:52,020
This has the advantage
of being very simple.

189
00:08:52,020 --> 00:08:53,300
It's just a quick calculation.

190
00:08:53,300 --> 00:08:56,330
The computers can pretty
readily generate random numbers.

191
00:08:56,330 --> 00:08:58,310
And based on that random
number, the computer

192
00:08:58,310 --> 00:09:02,720
can dispatch the user to one
server or to another server.

193
00:09:02,720 --> 00:09:06,620
But it might not be the best option
because, if we happen to get unlucky,

194
00:09:06,620 --> 00:09:10,190
we might end up with many more
users on one server than another.

195
00:09:10,190 --> 00:09:12,890
Or we might end up with
servers that are entirely

196
00:09:12,890 --> 00:09:15,230
unused if it just so
happens that we don't end up

197
00:09:15,230 --> 00:09:17,300
randomly selecting that server.

198
00:09:17,300 --> 00:09:20,780
Now, in practice with many users that
are all using this load balancer, all

199
00:09:20,780 --> 00:09:24,260
being dispatched, odds are high that
eventually all of them will be used.

200
00:09:24,260 --> 00:09:26,837
But it might not be a
totally even distribution.

201
00:09:26,837 --> 00:09:28,670
And so for that reason,
another approach you

202
00:09:28,670 --> 00:09:32,570
might take is round-robin approach
where the approach is, instead,

203
00:09:32,570 --> 00:09:36,650
for the very first user, go ahead and
assign that user to server number one.

204
00:09:36,650 --> 00:09:38,840
For the next user, assign
them to server number two.

205
00:09:38,840 --> 00:09:40,760
And maybe, if there are
five servers, you say,

206
00:09:40,760 --> 00:09:44,150
the third user goes to server three,
user four goes to server four,

207
00:09:44,150 --> 00:09:47,420
user five goes to server
five, and then user six

208
00:09:47,420 --> 00:09:49,070
goes back to server number one.

209
00:09:49,070 --> 00:09:51,257
You basically rotate
going one through five.

210
00:09:51,257 --> 00:09:53,840
And then, once you've assigned
someone to each of the servers,

211
00:09:53,840 --> 00:09:55,760
you go back to the beginning.

212
00:09:55,760 --> 00:09:59,360
This is also a relatively easy thing to
implement because you can simply just

213
00:09:59,360 --> 00:10:01,520
keep count somewhere
in the load balancer

214
00:10:01,520 --> 00:10:04,730
saying, what was the most recent
server that I assigned a user to?

215
00:10:04,730 --> 00:10:07,550
And the next time a request
comes in, go ahead and assign it

216
00:10:07,550 --> 00:10:09,710
to the next server, and
the next server after that,

217
00:10:09,710 --> 00:10:12,220
effectively doing a
round-robin style approach

218
00:10:12,220 --> 00:10:16,040
where you go through all the servers
once before going through the servers

219
00:10:16,040 --> 00:10:17,140
again.

220
00:10:17,140 --> 00:10:19,750
Now, this might seem better
than random choice in the sense

221
00:10:19,750 --> 00:10:23,230
that it's going to more equitably
decide whether to assign

222
00:10:23,230 --> 00:10:26,710
any particular request
to any particular server.

223
00:10:26,710 --> 00:10:29,110
But it also suffers
from certain problems.

224
00:10:29,110 --> 00:10:31,510
Round robin might be
great, but if some requests

225
00:10:31,510 --> 00:10:34,975
take longer than other requests,
we might also get unlucky,

226
00:10:34,975 --> 00:10:36,850
and the requests that
are taking longer might

227
00:10:36,850 --> 00:10:40,160
end up all going to one of the
servers as opposed to another server.

228
00:10:40,160 --> 00:10:43,310
So there are other approaches that
we might want to go to as well--

229
00:10:43,310 --> 00:10:45,880
for example, something
like fewest connections

230
00:10:45,880 --> 00:10:50,430
where the approach there is to say, go
ahead, and when a user makes a request,

231
00:10:50,430 --> 00:10:53,050
the load balancer should
pick which of the servers

232
00:10:53,050 --> 00:10:57,370
currently has the fewest active
connections from other users

233
00:10:57,370 --> 00:11:01,060
and other requests that are currently
connected to those servers instead.

234
00:11:01,060 --> 00:11:04,120
And by choosing the server that
happens to have the fewest connections,

235
00:11:04,120 --> 00:11:07,330
you're probably going to do a
better job of trying to balance out

236
00:11:07,330 --> 00:11:09,340
between all of the
various different requests

237
00:11:09,340 --> 00:11:12,220
that might be happening inside
of your web application.

238
00:11:12,220 --> 00:11:15,220
And while this might do a better job,
there are trade offs here as well.

239
00:11:15,220 --> 00:11:18,700
It might be more expensive, for
example, to compute which of the servers

240
00:11:18,700 --> 00:11:21,310
happens to have the fewest
number of connections,

241
00:11:21,310 --> 00:11:24,880
whereas it's much easier just to
say, choose a server at random

242
00:11:24,880 --> 00:11:29,740
or to do the round-robin style approach
of just 1, 2, 3, 4, 5, 1, 2, 3, 4, 5,

243
00:11:29,740 --> 00:11:32,590
again, and again, and again.

244
00:11:32,590 --> 00:11:36,410
But all of these approaches
naively have yet another problem,

245
00:11:36,410 --> 00:11:38,030
which has to do with sessions.

246
00:11:38,030 --> 00:11:40,150
And you'll recall that
sessions we used whenever

247
00:11:40,150 --> 00:11:44,110
we wanted to store information
about the user's current interaction

248
00:11:44,110 --> 00:11:45,220
with the web application.

249
00:11:45,220 --> 00:11:46,780
When you log into a website--

250
00:11:46,780 --> 00:11:50,300
you log into your email, or you
log into Amazon, for example--

251
00:11:50,300 --> 00:11:53,740
and then you come back to that website
or visit another page on that website--

252
00:11:53,740 --> 00:11:56,470
make another request, for example--

253
00:11:56,470 --> 00:11:59,800
it's not the case that you have to sign
in yet again, that the web browser has

254
00:11:59,800 --> 00:12:01,720
totally forgotten who you are.

255
00:12:01,720 --> 00:12:04,450
When I go back to my mail account,
or when I go back to Amazon

256
00:12:04,450 --> 00:12:08,205
for a second time, my mail account or
Amazon remembers me from the last time

257
00:12:08,205 --> 00:12:08,830
that I visited.

258
00:12:08,830 --> 00:12:13,060
I have some sort of session where it's
keeping track of who is logged in,

259
00:12:13,060 --> 00:12:15,670
maybe information about what
I've been doing on the page,

260
00:12:15,670 --> 00:12:18,790
and allows me to continue
interacting with the web application,

261
00:12:18,790 --> 00:12:21,880
even if I'm making multiple requests.

262
00:12:21,880 --> 00:12:24,310
And this, you might
imagine, could be a problem

263
00:12:24,310 --> 00:12:26,440
for this type of load balancing.

264
00:12:26,440 --> 00:12:31,630
If I have multiple different servers,
imagine if I try to log into a website.

265
00:12:31,630 --> 00:12:34,990
And the first time I make a request,
I'm directed to server number one.

266
00:12:34,990 --> 00:12:37,690
And I'm now logged in
on server number one.

267
00:12:37,690 --> 00:12:39,400
But then I make another request.

268
00:12:39,400 --> 00:12:41,162
I'm directed back to the load balancer.

269
00:12:41,162 --> 00:12:43,120
And maybe the load
balancer, this time, decides

270
00:12:43,120 --> 00:12:45,310
to send me to server number two.

271
00:12:45,310 --> 00:12:48,190
But if the session is stored in
server number one somewhere--

272
00:12:48,190 --> 00:12:51,010
server number one remembers
who I am and what I'm doing--

273
00:12:51,010 --> 00:12:54,282
then server number two is
not going to know who I am.

274
00:12:54,282 --> 00:12:56,740
And therefore, it's not going
to remember that I've already

275
00:12:56,740 --> 00:12:58,660
logged into this web application.

276
00:12:58,660 --> 00:13:01,710
And as a result, I might be
prompted to log in again.

277
00:13:01,710 --> 00:13:04,630
And if I go make another request,
and I end up on yet another server,

278
00:13:04,630 --> 00:13:07,580
I might be logged out again and
have to log in for a third time.

279
00:13:07,580 --> 00:13:11,590
So the problem comes about when
our load balancing happens,

280
00:13:11,590 --> 00:13:14,290
but we're not doing so
in a session-aware way--

281
00:13:14,290 --> 00:13:18,310
that our load balancer isn't caring
about when a user visits the page

282
00:13:18,310 --> 00:13:22,300
and then visits another page on
the same web application again--

283
00:13:22,300 --> 00:13:25,720
because we want to remember
information from the previous time

284
00:13:25,720 --> 00:13:27,475
that the user was here.

285
00:13:27,475 --> 00:13:28,850
So how can we solve this problem?

286
00:13:28,850 --> 00:13:30,820
How can we make sure
that, when we do this load

287
00:13:30,820 --> 00:13:33,010
balancing across multiple
different servers,

288
00:13:33,010 --> 00:13:34,795
that we do so in a session-aware way?

289
00:13:34,795 --> 00:13:36,670
Well, there are multiple
different approaches

290
00:13:36,670 --> 00:13:39,310
to session-aware load balancing.

291
00:13:39,310 --> 00:13:42,610
One approach is this general
idea known as sticky sessions

292
00:13:42,610 --> 00:13:46,150
where the idea is that, when I
come back to the load balancer,

293
00:13:46,150 --> 00:13:49,940
the load balancer will remember
what server I was sent to last time

294
00:13:49,940 --> 00:13:52,210
and send me there yet again.

295
00:13:52,210 --> 00:13:54,670
So for example, if I
log into a website once,

296
00:13:54,670 --> 00:13:57,490
and I'm directed to server
number two, for example, then

297
00:13:57,490 --> 00:14:00,130
the next time I visit
this web application,

298
00:14:00,130 --> 00:14:03,520
even if I should be directed to
server three or four according

299
00:14:03,520 --> 00:14:07,600
to random choice or according to fewest
connections or any of these other load

300
00:14:07,600 --> 00:14:09,700
balancing methods, the
load balancer should

301
00:14:09,700 --> 00:14:12,310
remember that, last time
I came to this site,

302
00:14:12,310 --> 00:14:14,240
I got directed to server number two.

303
00:14:14,240 --> 00:14:16,210
And so this time, the
load balancer is going

304
00:14:16,210 --> 00:14:18,550
to direct me to server
number two yet again.

305
00:14:18,550 --> 00:14:22,000
That way, server number two, which
contains information about my session,

306
00:14:22,000 --> 00:14:25,000
is going to see me again and
remember who it is that I am.

307
00:14:25,000 --> 00:14:28,180
And it's not going to make me log
in again into the exact same website

308
00:14:28,180 --> 00:14:30,570
for a second time, for example.

309
00:14:30,570 --> 00:14:33,280
And so sticky sessions are one
way of dealing with this problem.

310
00:14:33,280 --> 00:14:35,363
But again, with all of
these approaches-- and this

311
00:14:35,363 --> 00:14:38,410
will be a recurring theme as we talk
about scalability and security--

312
00:14:38,410 --> 00:14:39,730
there are trade offs here.

313
00:14:39,730 --> 00:14:44,200
A trade to the sticky sessions is that
it's possible that one of these servers

314
00:14:44,200 --> 00:14:47,950
is going to end up getting far more
load than another if one server happens

315
00:14:47,950 --> 00:14:50,620
to have a lot of users that
keep coming back to the website

316
00:14:50,620 --> 00:14:52,390
and keep requesting additional pages.

317
00:14:52,390 --> 00:14:54,940
But other pages, other
servers might have

318
00:14:54,940 --> 00:14:58,010
had users that decided not
to come back, for example.

319
00:14:58,010 --> 00:15:01,390
And so there's a difference in
utilization where some of our servers

320
00:15:01,390 --> 00:15:03,880
might be more heavily
utilized than other servers,

321
00:15:03,880 --> 00:15:07,580
and we're not doing a very good
job of balancing between them.

322
00:15:07,580 --> 00:15:11,980
And so one approach is to store
sessions inside of the database

323
00:15:11,980 --> 00:15:15,580
rather than store information
about sessions inside of the server

324
00:15:15,580 --> 00:15:18,730
themselves so that, if I get
directed to another server,

325
00:15:18,730 --> 00:15:20,710
that other server doesn't
remember who I am,

326
00:15:20,710 --> 00:15:24,310
doesn't remember information about
my interaction with this website.

327
00:15:24,310 --> 00:15:27,890
If we instead choose to store
sessions inside of a database--

328
00:15:27,890 --> 00:15:31,210
and, in particular, inside of a
database that all of the servers

329
00:15:31,210 --> 00:15:33,100
have the ability to access--

330
00:15:33,100 --> 00:15:36,400
well, then it doesn't matter which
of the servers I get directed to

331
00:15:36,400 --> 00:15:39,370
and which server the load
balancer decides to send me to

332
00:15:39,370 --> 00:15:42,310
because, regardless of which
server I end up getting sent to,

333
00:15:42,310 --> 00:15:44,235
the session information
is in the database.

334
00:15:44,235 --> 00:15:46,360
And each of the servers
can connect to the database

335
00:15:46,360 --> 00:15:49,390
to find out who I am, to find out
whether I've logged into the site

336
00:15:49,390 --> 00:15:52,660
already, and therefore
is able to recognize me.

337
00:15:52,660 --> 00:15:54,670
And so that might be
one approach as well.

338
00:15:54,670 --> 00:15:57,702
Another approach is to store
sessions on the client side.

339
00:15:57,702 --> 00:15:59,410
We've talked a little
bit about this idea

340
00:15:59,410 --> 00:16:03,100
of cookies, which can be stored where
the web browser can set a cookie so

341
00:16:03,100 --> 00:16:06,460
that your web browser is able to
present that cookie the next time

342
00:16:06,460 --> 00:16:09,020
it makes a request to
the same web application.

343
00:16:09,020 --> 00:16:12,430
And inside this cookie, you can store
a whole bunch of information, including

344
00:16:12,430 --> 00:16:14,000
information about the session.

345
00:16:14,000 --> 00:16:16,690
You might, inside of a
cookie, store information

346
00:16:16,690 --> 00:16:19,340
about what user is currently
logged in, for example,

347
00:16:19,340 --> 00:16:21,500
or other session-related information.

348
00:16:21,500 --> 00:16:23,080
But here, too, there are drawbacks.

349
00:16:23,080 --> 00:16:25,750
If you're not careful, someone
could manipulate that cookie

350
00:16:25,750 --> 00:16:27,380
and maybe pretend to be something else.

351
00:16:27,380 --> 00:16:29,230
And so for that reason,
you might want to do

352
00:16:29,230 --> 00:16:32,020
some encryption or some
kind of sign in to make sure

353
00:16:32,020 --> 00:16:35,832
that you can't fake a cookie and
pretend to be someone that you're not.

354
00:16:35,832 --> 00:16:37,540
But another concern
is that, as you start

355
00:16:37,540 --> 00:16:40,130
to store more and more information
inside of these cookies,

356
00:16:40,130 --> 00:16:43,540
these cookies keep getting sent back and
forth between the server and the client

357
00:16:43,540 --> 00:16:45,250
every time a request is made.

358
00:16:45,250 --> 00:16:48,040
That can start to get expensive,
too-- more and more information

359
00:16:48,040 --> 00:16:52,090
passing back and forth between
the client and between the server.

360
00:16:52,090 --> 00:16:54,580
So lots of possible
approaches-- no one approach

361
00:16:54,580 --> 00:16:57,040
that is necessarily the right
approach or the best approach

362
00:16:57,040 --> 00:16:58,270
to use in any cases.

363
00:16:58,270 --> 00:17:00,850
But things to be aware
of-- things to think about

364
00:17:00,850 --> 00:17:03,520
as we begin to deal with these
issues of scale, of making

365
00:17:03,520 --> 00:17:07,270
sure we have multiple servers that
are available for usage in case we do

366
00:17:07,270 --> 00:17:07,869
need it.

367
00:17:07,869 --> 00:17:10,930
But also making sure that, when
we do so, we don't break the user

368
00:17:10,930 --> 00:17:14,920
experience-- we don't result in a
situation where a user is logged in

369
00:17:14,920 --> 00:17:18,160
but then, suddenly,
isn't logged in at all.

370
00:17:18,160 --> 00:17:21,460
And so horizontal scaling gives
us this kind of capacity--

371
00:17:21,460 --> 00:17:24,760
the ability to have multiple
different servers, all of which

372
00:17:24,760 --> 00:17:27,880
can be dealing with user requests
and responding to those user requests

373
00:17:27,880 --> 00:17:28,890
as well.

374
00:17:28,890 --> 00:17:34,240
But a reasonable question asked is,
how many of those servers do we need?

375
00:17:34,240 --> 00:17:36,850
Now, we can use benchmarking
to try to estimate this.

376
00:17:36,850 --> 00:17:40,190
If we have an estimate of how many
users are going to be on our website

377
00:17:40,190 --> 00:17:42,430
at any given time, we
can benchmark and see

378
00:17:42,430 --> 00:17:46,420
how many users can be handled by
a single server and extrapolate,

379
00:17:46,420 --> 00:17:49,330
based on that information,
to infer how many servers we

380
00:17:49,330 --> 00:17:52,000
might need in our web
application to be able to service

381
00:17:52,000 --> 00:17:53,650
all of these different users.

382
00:17:53,650 --> 00:17:56,680
But it might be the case that our
web application doesn't always

383
00:17:56,680 --> 00:17:58,540
have the same number of users.

384
00:17:58,540 --> 00:18:01,660
Maybe, sometimes, there are going to
be far more users than another time.

385
00:18:01,660 --> 00:18:05,140
You might imagine, for example, that
in a news organization's website--

386
00:18:05,140 --> 00:18:07,690
like the web application
for a newspaper--

387
00:18:07,690 --> 00:18:09,720
when there's breaking
news, some big story,

388
00:18:09,720 --> 00:18:11,470
there's going to be a
lot more people that

389
00:18:11,470 --> 00:18:15,380
are all trying to access the website
at the same time than at other times.

390
00:18:15,380 --> 00:18:18,310
So one approach might
be, consider the maximum.

391
00:18:18,310 --> 00:18:20,650
What is the most number
of users that ever

392
00:18:20,650 --> 00:18:23,620
might be trying to use our web
application at any given time?

393
00:18:23,620 --> 00:18:26,830
And choose a number of servers
based on that maximum so that,

394
00:18:26,830 --> 00:18:28,960
no matter how high the
number of users get,

395
00:18:28,960 --> 00:18:32,800
we will have enough servers to be
able to service all of those users.

396
00:18:32,800 --> 00:18:35,560
But that's probably not
a great economical choice

397
00:18:35,560 --> 00:18:39,250
if, in the vast majority of cases,
there will be far fewer users.

398
00:18:39,250 --> 00:18:42,625
In that case, you're going to have a
lot of servers that are underutilized--

399
00:18:42,625 --> 00:18:45,250
where you don't need that many
servers, but you're still paying

400
00:18:45,250 --> 00:18:47,770
for the electricity, for
keeping all of them running--

401
00:18:47,770 --> 00:18:50,740
which might not be an
ideal choice either.

402
00:18:50,740 --> 00:18:52,120
So one solution to this--

403
00:18:52,120 --> 00:18:54,970
quite popular, especially in
this world of cloud computing--

404
00:18:54,970 --> 00:18:58,660
is the idea of autoscaling
where you can have an autoscaler

405
00:18:58,660 --> 00:19:03,460
to say that, you know what, let's
start with, for example, two servers.

406
00:19:03,460 --> 00:19:05,470
But if there's enough
traffic to the website,

407
00:19:05,470 --> 00:19:07,678
if enough people are making
requests to the website--

408
00:19:07,678 --> 00:19:10,360
maybe it's a peak time where
people are using the website--

409
00:19:10,360 --> 00:19:11,830
go ahead and scale up.

410
00:19:11,830 --> 00:19:15,880
Go ahead and add a third server where
now our load balancer can balance

411
00:19:15,880 --> 00:19:18,100
between all three of those servers.

412
00:19:18,100 --> 00:19:20,710
And if even more traffic ends
up coming to the website--

413
00:19:20,710 --> 00:19:24,280
more users are trying to use this
application all at the same time--

414
00:19:24,280 --> 00:19:27,160
well, then we can go ahead and
add a fourth server as well.

415
00:19:27,160 --> 00:19:28,660
And we can continue to do that.

416
00:19:28,660 --> 00:19:31,510
Most autoscalers will let
you configure, for example,

417
00:19:31,510 --> 00:19:34,480
a minimum number of servers and
a maximum number of servers.

418
00:19:34,480 --> 00:19:37,420
And dependent on how many users
happen to be using your web

419
00:19:37,420 --> 00:19:40,300
application at any given
time, the autoscaler

420
00:19:40,300 --> 00:19:44,410
can scale up or scale down, adding
new servers as more users come

421
00:19:44,410 --> 00:19:47,410
to the website, removing
servers as fewer users are

422
00:19:47,410 --> 00:19:49,870
using the website as well.

423
00:19:49,870 --> 00:19:52,425
And so this can be a nice
solution to this problem of scale

424
00:19:52,425 --> 00:19:55,050
where you don't have to worry
about how many servers there are.

425
00:19:55,050 --> 00:19:57,580
It just autoscales entirely on its own.

426
00:19:57,580 --> 00:19:59,080
Now, there are trade offs here, too.

427
00:19:59,080 --> 00:20:01,250
This auto scaling
process might take time.

428
00:20:01,250 --> 00:20:05,260
And if a lot of users all come into
your website all at the exact same time,

429
00:20:05,260 --> 00:20:08,350
well, it's going to take
some time to be able to add

430
00:20:08,350 --> 00:20:10,630
all of these additional
servers to start them up.

431
00:20:10,630 --> 00:20:13,700
And so there might be some
trade offs there, too,

432
00:20:13,700 --> 00:20:17,330
where you might not be able to
service all of the users immediately.

433
00:20:17,330 --> 00:20:19,380
And another problem
worth thinking about is,

434
00:20:19,380 --> 00:20:21,510
as you add more and
more of these servers,

435
00:20:21,510 --> 00:20:23,877
you introduce opportunities for failure.

436
00:20:23,877 --> 00:20:25,710
Now, it's better than
having a single server

437
00:20:25,710 --> 00:20:29,490
where, if that single server fails,
now suddenly the entire web application

438
00:20:29,490 --> 00:20:30,390
doesn't work at all.

439
00:20:30,390 --> 00:20:33,240
That's what we generally call
a single point of failure--

440
00:20:33,240 --> 00:20:37,410
a single place where, if it fails, the
entire system is going to be broken.

441
00:20:37,410 --> 00:20:39,720
One advantage of having
multiple servers is

442
00:20:39,720 --> 00:20:43,530
that we no longer have a single server
that acts as a point of failure.

443
00:20:43,530 --> 00:20:46,140
If one of the servers
goes down then, ideally,

444
00:20:46,140 --> 00:20:49,780
our load balancer should be able
to know, based on that information,

445
00:20:49,780 --> 00:20:53,370
to no longer send a request to
that particular server-- to,

446
00:20:53,370 --> 00:20:58,470
instead, balance the load across
the remaining three servers instead.

447
00:20:58,470 --> 00:21:00,640
Now, there's an interesting
question there as well,

448
00:21:00,640 --> 00:21:04,200
which is, how does the load
balancer know that this server is

449
00:21:04,200 --> 00:21:05,450
no longer responding?

450
00:21:05,450 --> 00:21:07,200
For some reason, it
has some sort of error

451
00:21:07,200 --> 00:21:09,763
that it's not able to process
requests appropriately.

452
00:21:09,763 --> 00:21:11,680
Well, there are multiple
ways you can do this.

453
00:21:11,680 --> 00:21:15,090
But one of the most common is what's
simply known as a heartbeat where,

454
00:21:15,090 --> 00:21:18,240
effectively, every so often,
every some number of seconds,

455
00:21:18,240 --> 00:21:20,700
the load balancer pings
all of the servers--

456
00:21:20,700 --> 00:21:23,280
just sends a quick request
to all the servers.

457
00:21:23,280 --> 00:21:26,250
And all of the servers are
supposed to respond back.

458
00:21:26,250 --> 00:21:29,010
And using that information,
the load balancer

459
00:21:29,010 --> 00:21:31,920
knows a little bit about the
latency of each of the servers--

460
00:21:31,920 --> 00:21:34,920
how long it took for the server
to respond to the request.

461
00:21:34,920 --> 00:21:37,440
But also, it can get
information about whether or not

462
00:21:37,440 --> 00:21:39,450
the server is functioning properly.

463
00:21:39,450 --> 00:21:42,157
If one of the servers
doesn't respond to the ping,

464
00:21:42,157 --> 00:21:44,490
well, then the load balancer
knows that there's probably

465
00:21:44,490 --> 00:21:47,640
something wrong with the server, that
we probably shouldn't be directing

466
00:21:47,640 --> 00:21:50,570
more users to that server at all.

467
00:21:50,570 --> 00:21:53,730
And so this can solve for the
problem of a single point of failure

468
00:21:53,730 --> 00:21:57,570
by allowing ourselves multiple servers
where, if any one of the servers fails,

469
00:21:57,570 --> 00:22:00,450
the load balancer learns
about that via heartbeat

470
00:22:00,450 --> 00:22:03,540
and then, based on that information,
can begin to redirect traffic

471
00:22:03,540 --> 00:22:05,847
to the other servers instead.

472
00:22:05,847 --> 00:22:08,430
Now, one thing you might notice
is that, even in this picture,

473
00:22:08,430 --> 00:22:11,970
now the load balancer appears to
be like a single point of failure

474
00:22:11,970 --> 00:22:14,460
where, if the low balance
happens to fail, well, now

475
00:22:14,460 --> 00:22:16,668
nothing is going to work
because the load balancer is

476
00:22:16,668 --> 00:22:18,810
the one responsible for
directing traffic to all

477
00:22:18,810 --> 00:22:20,190
of the various different servers.

478
00:22:20,190 --> 00:22:23,790
And so even though there is no single
server that is a point to failure,

479
00:22:23,790 --> 00:22:27,370
this load balancer also appears
to be a single point of failure.

480
00:22:27,370 --> 00:22:28,540
And that's definitely true.

481
00:22:28,540 --> 00:22:31,470
And you might imagine instead
having multiple load balancers

482
00:22:31,470 --> 00:22:35,310
where one load balancer goes down,
another load balancer can swoop in,

483
00:22:35,310 --> 00:22:39,000
acting as a hot spare where it picks up
all of the traffic that was originally

484
00:22:39,000 --> 00:22:40,650
going to the first load balancer.

485
00:22:40,650 --> 00:22:44,550
And if it ever goes down, a second
one is ready to take its place.

486
00:22:44,550 --> 00:22:47,700
And it might also be doing this kind
of heartbeat process-- checking up

487
00:22:47,700 --> 00:22:48,845
on the first load balancer.

488
00:22:48,845 --> 00:22:51,970
And if all goes well, the second load
balancer doesn't have to do anything.

489
00:22:51,970 --> 00:22:54,490
But if the first load
balancer ever were to fail,

490
00:22:54,490 --> 00:22:56,640
well, then the second
load balancer can step in

491
00:22:56,640 --> 00:22:59,700
and begin servicing those
requests, directing them to all

492
00:22:59,700 --> 00:23:01,840
of these individual servers as well.

493
00:23:01,840 --> 00:23:02,705
And so there, too--

494
00:23:02,705 --> 00:23:05,580
another opportunity to think about
where the single points of failure

495
00:23:05,580 --> 00:23:09,300
are and thinking about how we might
address the single points of failure

496
00:23:09,300 --> 00:23:12,330
in order to make sure that our
web applications are scalable.

497
00:23:12,330 --> 00:23:14,820
So that then deals with
issues about how we might

498
00:23:14,820 --> 00:23:17,070
go about scaling up these servers.

499
00:23:17,070 --> 00:23:20,340
But ultimately, the servers are
not the entirety of the story.

500
00:23:20,340 --> 00:23:22,350
Inside of our applications,
we mostly have

501
00:23:22,350 --> 00:23:25,918
writing web applications that interact
and deal with data in some way.

502
00:23:25,918 --> 00:23:28,710
And there are multiple different
databases that we've talked about.

503
00:23:28,710 --> 00:23:30,900
SQLite Light has been the
default one that Django

504
00:23:30,900 --> 00:23:34,200
provides to us, which just
stores data inside of a file.

505
00:23:34,200 --> 00:23:36,020
But as we begin to
grow our applications,

506
00:23:36,020 --> 00:23:39,270
if we want to begin to scale them,
it's quite popular and quite common

507
00:23:39,270 --> 00:23:41,530
to put databases entirely
somewhere separate--

508
00:23:41,530 --> 00:23:44,340
to have a separate database server
running somewhere else where

509
00:23:44,340 --> 00:23:46,800
the servers are all
communicating with that database,

510
00:23:46,800 --> 00:23:50,550
whether it's we're running MySQL, or
Postgres, or some other database system

511
00:23:50,550 --> 00:23:51,750
instead.

512
00:23:51,750 --> 00:23:55,410
And all of the servers then
have access to that database.

513
00:23:55,410 --> 00:23:57,990
And so there, too, are
considerations that we

514
00:23:57,990 --> 00:24:00,420
need to take into account--
issues of how it is that we

515
00:24:00,420 --> 00:24:03,840
go about scaling up these databases.

516
00:24:03,840 --> 00:24:06,960
In this picture, for example,
you might imagine a load balancer

517
00:24:06,960 --> 00:24:08,730
that is communicating with two servers.

518
00:24:08,730 --> 00:24:10,950
But both of those
servers, for example, need

519
00:24:10,950 --> 00:24:13,200
to be communicating with this database.

520
00:24:13,200 --> 00:24:16,140
And much like any server can only
handle some number of requests,

521
00:24:16,140 --> 00:24:19,380
some number of users at any
given time, databases, too,

522
00:24:19,380 --> 00:24:23,280
can only handle some number of requests,
some concurrent number of connections

523
00:24:23,280 --> 00:24:24,250
at any given time.

524
00:24:24,250 --> 00:24:26,130
And so we need to begin
to think about issues

525
00:24:26,130 --> 00:24:30,120
of how it is that we scale these
databases as well in order to be

526
00:24:30,120 --> 00:24:33,330
able to handle more and more users.

527
00:24:33,330 --> 00:24:35,580
Now, one approach, the first
thing we might try to do,

528
00:24:35,580 --> 00:24:38,160
is something called database
partitioning-- effectively,

529
00:24:38,160 --> 00:24:42,270
splitting up what is a big data
set into multiple different parts

530
00:24:42,270 --> 00:24:43,470
to that data set.

531
00:24:43,470 --> 00:24:46,560
And we've already seen some
examples of database partitioning.

532
00:24:46,560 --> 00:24:49,890
We've seen one example where-- for
example, when we talked about SQL,

533
00:24:49,890 --> 00:24:53,130
we looked at a table of flights
where each flight had an origin

534
00:24:53,130 --> 00:24:57,840
city, the origin city's airport code,
the destination city, the destination

535
00:24:57,840 --> 00:25:00,120
city's airport code, and
some number of minutes,

536
00:25:00,120 --> 00:25:02,850
the duration for that particular flight.

537
00:25:02,850 --> 00:25:05,820
And we decided that storing all
of this data in a single table

538
00:25:05,820 --> 00:25:07,590
probably wasn't the best idea.

539
00:25:07,590 --> 00:25:10,170
And instead, we wanted
to split that data up

540
00:25:10,170 --> 00:25:13,380
in a type of partitioning where,
instead, we said, all right, let's just

541
00:25:13,380 --> 00:25:16,230
have one table that will
have all of the airports.

542
00:25:16,230 --> 00:25:20,440
And so each airport gets its own
row inside of this airports table.

543
00:25:20,440 --> 00:25:22,640
And we also had another
table which was just

544
00:25:22,640 --> 00:25:26,270
the flights table which, rather
than storing all of those columns,

545
00:25:26,270 --> 00:25:28,820
just mapped two airports to each other.

546
00:25:28,820 --> 00:25:32,660
With any given flight, it has an
origin idea, meaning which object,

547
00:25:32,660 --> 00:25:36,800
which row in the origin airports
table is represented by the flight,

548
00:25:36,800 --> 00:25:39,680
and then which row in
the airports table is

549
00:25:39,680 --> 00:25:42,860
going to represent the
destination for that flight.

550
00:25:42,860 --> 00:25:45,530
So we took one table and
effectively split it up

551
00:25:45,530 --> 00:25:49,940
into multiple tables, each of
which ultimately had fewer columns.

552
00:25:49,940 --> 00:25:52,850
And this might be something we
call the vertical partitioning

553
00:25:52,850 --> 00:25:56,810
of a database where, instead of
just having single big long tables,

554
00:25:56,810 --> 00:25:59,420
we split them up into
multiple tables, each

555
00:25:59,420 --> 00:26:01,820
of which have fewer columns
that are able to represent

556
00:26:01,820 --> 00:26:03,497
data in a more relational way.

557
00:26:03,497 --> 00:26:05,330
And that's something
we've seen before, too.

558
00:26:05,330 --> 00:26:07,460
But in addition to
vertical partitioning,

559
00:26:07,460 --> 00:26:11,090
we can also do horizontal
partitioning where the idea there

560
00:26:11,090 --> 00:26:13,340
is that we take a table
and just split it up

561
00:26:13,340 --> 00:26:17,390
into multiple tables that are all
storing effectively the same data,

562
00:26:17,390 --> 00:26:19,380
but split up into different data sets.

563
00:26:19,380 --> 00:26:22,520
So the same type of data, but
just in different tables--

564
00:26:22,520 --> 00:26:25,100
where we might have originally
had a flights table,

565
00:26:25,100 --> 00:26:28,490
and instead we split it up
into a domestic flights table

566
00:26:28,490 --> 00:26:30,380
and an international flights table.

567
00:26:30,380 --> 00:26:32,870
Each of these tables still
has the exact same column.

568
00:26:32,870 --> 00:26:34,555
They still have a destination column.

569
00:26:34,555 --> 00:26:35,930
They still have an origin column.

570
00:26:35,930 --> 00:26:38,250
They still have a duration
column, for example.

571
00:26:38,250 --> 00:26:41,210
But we've just now taken the
data that used to be in one table

572
00:26:41,210 --> 00:26:46,040
and split up that data into two or more
multiple different tables instead--

573
00:26:46,040 --> 00:26:49,940
one for all the domestic flights, one
for all the international flights.

574
00:26:49,940 --> 00:26:52,370
And the advantage there
is that we no longer

575
00:26:52,370 --> 00:26:55,760
need to search through the entirety
of the data set if we're just looking

576
00:26:55,760 --> 00:26:57,780
for one domestic flight, for example.

577
00:26:57,780 --> 00:27:00,680
If you know the flight you're
looking for is a domestic flight,

578
00:27:00,680 --> 00:27:04,820
well, then it can be more efficient to
just search the flight's domestic table

579
00:27:04,820 --> 00:27:08,270
and not bother searching through
the flight international table.

580
00:27:08,270 --> 00:27:11,300
And so if we're intelligent about
how we choose to take a table

581
00:27:11,300 --> 00:27:14,540
and split it up into multiple
different tables, the effect of that

582
00:27:14,540 --> 00:27:16,880
is that we can often
improve the efficiency

583
00:27:16,880 --> 00:27:19,190
of our searches, the
efficiency of our operations,

584
00:27:19,190 --> 00:27:21,830
because we're dealing with
multiple smaller tables

585
00:27:21,830 --> 00:27:24,320
where these operations can come faster.

586
00:27:24,320 --> 00:27:27,350
One drawback though is that,
as we begin to split data

587
00:27:27,350 --> 00:27:31,250
across multiple different tables,
it becomes more expensive if ever we

588
00:27:31,250 --> 00:27:33,980
need to join this data
back together and connect

589
00:27:33,980 --> 00:27:36,290
all the domestic and
international flights running

590
00:27:36,290 --> 00:27:37,790
separate queries on each.

591
00:27:37,790 --> 00:27:40,010
And so in that case, we'll
want to think about trying

592
00:27:40,010 --> 00:27:42,710
to separate our data in such
a way that, generally, we're

593
00:27:42,710 --> 00:27:46,750
only going to need to deal with one
table or the other at any given time.

594
00:27:46,750 --> 00:27:49,280
And so domestic and international
might be a reasonable way

595
00:27:49,280 --> 00:27:52,970
to split up our flights table because
maybe, most of the time, our airport

596
00:27:52,970 --> 00:27:54,860
just cares about
searching domestic flights

597
00:27:54,860 --> 00:27:56,630
if we know we're looking
for one kind of flight,

598
00:27:56,630 --> 00:27:59,030
or just cares about searching
for international flights

599
00:27:59,030 --> 00:28:01,405
if there are different people
or different computers that

600
00:28:01,405 --> 00:28:05,090
are going to handle each of
those different types of systems.

601
00:28:05,090 --> 00:28:08,630
And so partitioning our database can
sometimes help with issues of scale

602
00:28:08,630 --> 00:28:11,480
by making it faster to search
through large amounts of data

603
00:28:11,480 --> 00:28:14,480
and being able to represent
data a little bit more cleanly.

604
00:28:14,480 --> 00:28:17,840
But it still seems to represent
a single point of failure--

605
00:28:17,840 --> 00:28:22,850
that we have multiple servers now that
are all connected to the same database.

606
00:28:22,850 --> 00:28:24,890
And there, again, is a
single point of failure.

607
00:28:24,890 --> 00:28:27,353
If the database fails for
some reason, well now,

608
00:28:27,353 --> 00:28:29,270
suddenly, none of our
web application is going

609
00:28:29,270 --> 00:28:31,940
to work because all of
those servers are all

610
00:28:31,940 --> 00:28:35,180
connected to that exact same database.

611
00:28:35,180 --> 00:28:36,980
And so it's for that
reason that we might--

612
00:28:36,980 --> 00:28:39,230
just as we tried to add
more servers in order

613
00:28:39,230 --> 00:28:42,530
to solve the problem of a single
point of failure with our servers,

614
00:28:42,530 --> 00:28:45,410
we might also try database replication.

615
00:28:45,410 --> 00:28:48,860
Rather than just have a single
database in our web application,

616
00:28:48,860 --> 00:28:50,870
in order to guard against
potential failure,

617
00:28:50,870 --> 00:28:54,410
we might replicate our database--
have multiple different databases

618
00:28:54,410 --> 00:28:59,297
and, therefore, reduce the likelihood
that our application entirely fails.

619
00:28:59,297 --> 00:29:01,130
And there are a couple
of approaches that we

620
00:29:01,130 --> 00:29:03,020
can use for database replication.

621
00:29:03,020 --> 00:29:06,800
Two of the most common are what are
known as single-primary replication

622
00:29:06,800 --> 00:29:09,190
and multi-primary replication.

623
00:29:09,190 --> 00:29:11,760
And in single-primary
database replication,

624
00:29:11,760 --> 00:29:14,040
we have multiple different databases.

625
00:29:14,040 --> 00:29:17,930
But one of those databases is
considered to be the primary database.

626
00:29:17,930 --> 00:29:20,510
And what we mean by a primary
database is a database

627
00:29:20,510 --> 00:29:22,310
to which we can both read data--

628
00:29:22,310 --> 00:29:24,560
meaning select rows from the table--

629
00:29:24,560 --> 00:29:27,350
but also write data,
meaning insert rows,

630
00:29:27,350 --> 00:29:31,200
or update rows, or delete
rows to any of those tables.

631
00:29:31,200 --> 00:29:34,070
So in single-primary replication,
we have a single database

632
00:29:34,070 --> 00:29:36,260
where we can both read and write.

633
00:29:36,260 --> 00:29:38,680
And we have some number of
other databases-- in this case,

634
00:29:38,680 --> 00:29:40,100
two other databases--

635
00:29:40,100 --> 00:29:41,900
from which we can only read data.

636
00:29:41,900 --> 00:29:44,220
So we can get data from those databases.

637
00:29:44,220 --> 00:29:48,560
But we can't update, or insert,
or delete from those databases.

638
00:29:48,560 --> 00:29:52,490
And now we need some mechanism to
make sure that all of these databases

639
00:29:52,490 --> 00:29:53,750
are kept in sync.

640
00:29:53,750 --> 00:29:57,620
And ultimately, what that means is
that, any time the database changes,

641
00:29:57,620 --> 00:29:59,660
all of the databases are informed.

642
00:29:59,660 --> 00:30:02,390
Now, the only database that
can change is our primary one.

643
00:30:02,390 --> 00:30:04,250
This is the only one
that can be written to,

644
00:30:04,250 --> 00:30:06,740
the only one that allows
for the data to change.

645
00:30:06,740 --> 00:30:08,180
The others are read only.

646
00:30:08,180 --> 00:30:12,170
So anytime this primary database
updates or changes in some way,

647
00:30:12,170 --> 00:30:16,540
it needs to inform the other
databases of that update.

648
00:30:16,540 --> 00:30:18,920
And so it informs the other
databases of that update.

649
00:30:18,920 --> 00:30:21,230
And now all of the
databases are kept in sync

650
00:30:21,230 --> 00:30:23,960
where, if you try and run a
query on any of these databases

651
00:30:23,960 --> 00:30:25,910
to select and get some
information, you'll

652
00:30:25,910 --> 00:30:30,440
get the same results from all of
these various different databases.

653
00:30:30,440 --> 00:30:32,990
Now, the single-primary
approach has some drawbacks.

654
00:30:32,990 --> 00:30:36,950
It has the drawback of only one of
these databases can be written to.

655
00:30:36,950 --> 00:30:38,750
So if you have a lot
of users that are all

656
00:30:38,750 --> 00:30:42,550
trying to write data to the
database at the exact same time,

657
00:30:42,550 --> 00:30:44,360
well, there might be
some issues here where

658
00:30:44,360 --> 00:30:46,370
this one database is
going to be carrying

659
00:30:46,370 --> 00:30:49,100
all of that load for all of
the people that might be trying

660
00:30:49,100 --> 00:30:51,860
to update and change that database.

661
00:30:51,860 --> 00:30:54,140
And it also has a
slightly smaller version

662
00:30:54,140 --> 00:30:57,140
of the same problem of a
single point of failure.

663
00:30:57,140 --> 00:31:00,770
There is no longer a single point of
failure for reading from that data.

664
00:31:00,770 --> 00:31:03,750
If you want to read from the data,
and one of the databases goes out,

665
00:31:03,750 --> 00:31:07,340
you can read data from any of the other
databases, and they'll work just fine.

666
00:31:07,340 --> 00:31:10,670
But it does have the drawback
that, if this database fails,

667
00:31:10,670 --> 00:31:13,040
if our primary database
fails, well, then

668
00:31:13,040 --> 00:31:14,750
we're no longer able to write data.

669
00:31:14,750 --> 00:31:17,150
If we want to update data
inside of our database,

670
00:31:17,150 --> 00:31:19,910
this one database is no longer
going to be operational.

671
00:31:19,910 --> 00:31:24,673
And none of the other databases are
going to allow us to write new changes.

672
00:31:24,673 --> 00:31:27,840
So there are a couple of approaches we
can use to try to solve this problem.

673
00:31:27,840 --> 00:31:31,145
One approach though is, instead of
having a single-primary database--

674
00:31:31,145 --> 00:31:33,950
a single database to which
we can read and write--

675
00:31:33,950 --> 00:31:36,610
to use a multi-primary approach.

676
00:31:36,610 --> 00:31:40,160
And in the multi-primary approach, we
have multiple databases, all of which

677
00:31:40,160 --> 00:31:41,810
we can read and write to.

678
00:31:41,810 --> 00:31:44,230
We can select rows
from all the databases.

679
00:31:44,230 --> 00:31:48,780
And we can insert an update and delete
rows to all of these databases as well.

680
00:31:48,780 --> 00:31:52,050
But now the synchronization process
becomes a little bit trickier.

681
00:31:52,050 --> 00:31:54,050
And here, now, is the
trade off-- that now we've

682
00:31:54,050 --> 00:31:55,850
replicated the number
of reads and writes

683
00:31:55,850 --> 00:31:59,870
we can do by having many databases to
which we can read data and write data.

684
00:31:59,870 --> 00:32:02,870
But anytime any of
these databases changes,

685
00:32:02,870 --> 00:32:07,695
every database needs to inform all of
the other databases of those updates.

686
00:32:07,695 --> 00:32:10,070
And that's, certainly, going
to take some amount of time.

687
00:32:10,070 --> 00:32:13,160
It introduces some complexity
into our system as well.

688
00:32:13,160 --> 00:32:16,550
And it also introduces the
possibility for conflicts.

689
00:32:16,550 --> 00:32:19,550
You might imagine situations
where, if two people are editing

690
00:32:19,550 --> 00:32:21,830
similar data at the
same time, you might run

691
00:32:21,830 --> 00:32:24,080
into a number of different
types of conflicts.

692
00:32:24,080 --> 00:32:27,560
So one type of conflict, for
example, would be an update conflict.

693
00:32:27,560 --> 00:32:30,170
If I tried to edit one
row in one database,

694
00:32:30,170 --> 00:32:34,040
and someone else tries to edit the same
row in another database, when they sync

695
00:32:34,040 --> 00:32:36,230
up with each other via
this update process,

696
00:32:36,230 --> 00:32:38,600
our database system
needs some way to decide

697
00:32:38,600 --> 00:32:42,200
how it's going to resolve those
various different updates.

698
00:32:42,200 --> 00:32:44,880
Another conflict might
be a uniqueness conflict.

699
00:32:44,880 --> 00:32:46,907
We've seen, in the case
of databases in SQL

700
00:32:46,907 --> 00:32:48,740
that, when we're designing
our tables, I can

701
00:32:48,740 --> 00:32:51,980
specify that this particular
field should be a unique field--

702
00:32:51,980 --> 00:32:56,030
common one being the ID field, for
example, where every single row is

703
00:32:56,030 --> 00:32:58,100
going to have its own unique ideas.

704
00:32:58,100 --> 00:33:01,670
Well, what happens if two people
try to insert data at the same time

705
00:33:01,670 --> 00:33:03,350
into two different databases?

706
00:33:03,350 --> 00:33:07,610
They're each given a unique ID, but it's
the same idea on both of the databases,

707
00:33:07,610 --> 00:33:11,240
because neither database knows that the
other database has added a new row yet.

708
00:33:11,240 --> 00:33:14,540
So when they sync back up, we might
run into a uniqueness conflict

709
00:33:14,540 --> 00:33:18,290
where two different databases
have assigned the same exact ID

710
00:33:18,290 --> 00:33:19,730
to multiple different entries.

711
00:33:19,730 --> 00:33:23,117
So we need some way to be able to
resolve those conflicts as well.

712
00:33:23,117 --> 00:33:24,950
And there are many other
conflicts you might

713
00:33:24,950 --> 00:33:28,340
imagine trying to deal with-- one
example being, for instance, delete

714
00:33:28,340 --> 00:33:31,430
conflicts, where one person
tries to delete a row

715
00:33:31,430 --> 00:33:33,710
and another person tries
to update that row.

716
00:33:33,710 --> 00:33:35,278
Well, which should take precedence?

717
00:33:35,278 --> 00:33:36,320
Should we update the row?

718
00:33:36,320 --> 00:33:37,610
Should we delete the row?

719
00:33:37,610 --> 00:33:41,450
We need some way to be able to
make those decisions because there

720
00:33:41,450 --> 00:33:45,150
is some latency between when
a change is made to a database

721
00:33:45,150 --> 00:33:48,600
and when that database is able to
communicate with another database.

722
00:33:48,600 --> 00:33:51,290
So these issues of scale,
these issues of synchronization

723
00:33:51,290 --> 00:33:53,330
are always going to come
up as we start to deal

724
00:33:53,330 --> 00:33:56,970
with programs that are interacting with
more and more of this kind of data.

725
00:33:56,970 --> 00:33:59,810
And as a result, we need to
design more and more sophisticated

726
00:33:59,810 --> 00:34:04,040
systems that are able to deal
with those issues of scale.

727
00:34:04,040 --> 00:34:09,139
Now, ultimately, we'd ideally like to
reduce the number of different database

728
00:34:09,139 --> 00:34:10,130
servers that we have.

729
00:34:10,130 --> 00:34:12,692
Every additional database
server is going to cost time.

730
00:34:12,692 --> 00:34:13,900
It's going to cost resources.

731
00:34:13,900 --> 00:34:17,060
It costs money in terms of keeping
all of these servers running.

732
00:34:17,060 --> 00:34:20,960
And so, ideally, we'd like not
to have to talk to this database

733
00:34:20,960 --> 00:34:22,590
if we don't need to.

734
00:34:22,590 --> 00:34:26,360
So you might imagine, for example, a
news organization's website, something

735
00:34:26,360 --> 00:34:28,275
like the front page
of the New York Times.

736
00:34:28,275 --> 00:34:30,650
If you go to the home page of
the New York Times website,

737
00:34:30,650 --> 00:34:33,230
it displays all of the
day's headlines with images

738
00:34:33,230 --> 00:34:36,860
and with information about what each
of the stories are about, for example.

739
00:34:36,860 --> 00:34:39,983
And you might imagine that the way
they're doing something like this

740
00:34:39,983 --> 00:34:41,900
is that they have some
kind of database that's

741
00:34:41,900 --> 00:34:43,670
storing all of these news articles.

742
00:34:43,670 --> 00:34:46,040
And when you visit the front
page of the New York Times,

743
00:34:46,040 --> 00:34:48,290
it's going to do some
kind of database query--

744
00:34:48,290 --> 00:34:51,500
selecting all of the recent
top headlines, for example--

745
00:34:51,500 --> 00:34:56,460
and rendering all of that information
in an HTML page that you can see.

746
00:34:56,460 --> 00:34:57,930
And that would certainly work.

747
00:34:57,930 --> 00:35:00,440
But if a lot of people are
all requesting the front page

748
00:35:00,440 --> 00:35:04,670
at the same time, well, it probably
doesn't make all that much sense

749
00:35:04,670 --> 00:35:08,390
if the web application, every time,
is making a database query, getting

750
00:35:08,390 --> 00:35:13,040
the latest articles, and then displaying
that information to all of the users

751
00:35:13,040 --> 00:35:16,130
because the articles might not
be changing all that frequently.

752
00:35:16,130 --> 00:35:18,440
If one person makes
a request one second,

753
00:35:18,440 --> 00:35:21,710
and another person makes the
same request half a second later,

754
00:35:21,710 --> 00:35:26,150
it probably is not going to be useful
to re-request all of the information

755
00:35:26,150 --> 00:35:29,450
from the database, regenerate that
template yet again, because it's

756
00:35:29,450 --> 00:35:33,050
an expensive process of requesting
data from the database, of generating

757
00:35:33,050 --> 00:35:33,800
that template.

758
00:35:33,800 --> 00:35:36,710
We'd, ideally, like some way
of dealing with that problem.

759
00:35:36,710 --> 00:35:40,040
And the way we can deal with that
problem is some form of caching.

760
00:35:40,040 --> 00:35:44,300
And caching refers to a whole bunch
of different types of ideas and tools

761
00:35:44,300 --> 00:35:47,660
that we can use at various different
places inside of our system.

762
00:35:47,660 --> 00:35:50,390
But in general, when we're
talking about caching,

763
00:35:50,390 --> 00:35:54,680
we're talking about storing a saved
version of some information in a way

764
00:35:54,680 --> 00:35:58,340
that we can access it more quickly so
that we don't need to continue making

765
00:35:58,340 --> 00:36:00,720
requests to a database, for example.

766
00:36:00,720 --> 00:36:02,930
And so there are a number
of ways we can do caching.

767
00:36:02,930 --> 00:36:07,010
One way we can do caching is on the
client side via client-side caching

768
00:36:07,010 --> 00:36:08,850
where the idea is that your browser--

769
00:36:08,850 --> 00:36:11,030
whether it's Safari, or
Chrome, or something else--

770
00:36:11,030 --> 00:36:13,700
is able to cache data,
store information,

771
00:36:13,700 --> 00:36:17,070
so that the browser doesn't need
to re-request the same information

772
00:36:17,070 --> 00:36:19,050
the next time it visits the page.

773
00:36:19,050 --> 00:36:21,680
For example, if you request a
page and it loads an image--

774
00:36:21,680 --> 00:36:23,210
on the page, for example--

775
00:36:23,210 --> 00:36:25,850
and you reload the page,
well, your web browser

776
00:36:25,850 --> 00:36:28,760
might try and make a request
again for the exact same image

777
00:36:28,760 --> 00:36:30,020
and then display it to you.

778
00:36:30,020 --> 00:36:33,500
But an alternative might be
that your web browser could just

779
00:36:33,500 --> 00:36:35,960
save a copy of the
image inside of a cache

780
00:36:35,960 --> 00:36:40,280
to locally store a version of
the image so that, the next time

781
00:36:40,280 --> 00:36:42,860
that the user makes a request
to the website, the user

782
00:36:42,860 --> 00:36:45,410
doesn't need to reload
that entire image.

783
00:36:45,410 --> 00:36:48,650
And that might be true of entire
web pages and web resources--

784
00:36:48,650 --> 00:36:51,770
that if there is some page that
doesn't change very often then,

785
00:36:51,770 --> 00:36:55,850
if the web browser just stores a
cached, a saved version of that page,

786
00:36:55,850 --> 00:36:58,340
then the next time the user
goes to their web browser,

787
00:36:58,340 --> 00:37:03,020
tries to access that page, rather than
re-request to the server and make a new

788
00:37:03,020 --> 00:37:06,440
request that the server needs to
respond to, if the browser has that page

789
00:37:06,440 --> 00:37:09,530
cached, the browser can
just display the cached--

790
00:37:09,530 --> 00:37:13,830
saved-- version of the page, saving
the need to talk to the server at all.

791
00:37:13,830 --> 00:37:16,970
So this can certainly help to
reduce the load on any given server.

792
00:37:16,970 --> 00:37:20,360
If users are caching information
inside of the web browser,

793
00:37:20,360 --> 00:37:22,480
it makes the experience
faster for the user

794
00:37:22,480 --> 00:37:24,980
because they can see the
information immediately rather than

795
00:37:24,980 --> 00:37:28,070
need to make a request and wait
for a response to come back.

796
00:37:28,070 --> 00:37:30,140
And it's good for the
server because the server

797
00:37:30,140 --> 00:37:33,740
doesn't need to be dealing with as
many requests if some of those requests

798
00:37:33,740 --> 00:37:35,160
are getting cached.

799
00:37:35,160 --> 00:37:37,400
And so one approach to
trying to do this is

800
00:37:37,400 --> 00:37:42,290
by adding this inside of the
headers of an HTTP response.

801
00:37:42,290 --> 00:37:44,960
When your web server
responds to some requests,

802
00:37:44,960 --> 00:37:48,770
the web server can include a line
like this inside of the response--

803
00:37:48,770 --> 00:37:53,210
something like cache-control
max-age-86400--

804
00:37:53,210 --> 00:37:56,330
in effect, specifying
the number of seconds

805
00:37:56,330 --> 00:37:58,850
that you should cache this resource for.

806
00:37:58,850 --> 00:38:02,510
But if I try to access
this page 10 seconds later,

807
00:38:02,510 --> 00:38:04,910
well, that's less than 86,400.

808
00:38:04,910 --> 00:38:08,600
So rather than reload and
re-request the entire page,

809
00:38:08,600 --> 00:38:11,390
we're just going to use the
version of the page that happens

810
00:38:11,390 --> 00:38:13,750
to be cached inside of the web browser.

811
00:38:13,750 --> 00:38:16,250
And so this has several advantages,
that we've talked about,

812
00:38:16,250 --> 00:38:19,640
in terms of reducing the amount of time
it takes to see the content of the page

813
00:38:19,640 --> 00:38:23,570
because it's already saved and reducing
the load on any particular server.

814
00:38:23,570 --> 00:38:25,040
But it also has drawbacks.

815
00:38:25,040 --> 00:38:29,180
If, for example, the resource
changes within this amount of time--

816
00:38:29,180 --> 00:38:32,240
maybe in 60 seconds,
the page has changed--

817
00:38:32,240 --> 00:38:35,120
if I try and load the
page again, well, then

818
00:38:35,120 --> 00:38:37,400
if it's loading the cache
version of the page,

819
00:38:37,400 --> 00:38:40,400
I might be seeing an outdated
version of a web page.

820
00:38:40,400 --> 00:38:42,470
I'm seeing an older
version of the web page

821
00:38:42,470 --> 00:38:45,320
because my web browser
just so happens to have

822
00:38:45,320 --> 00:38:47,570
that particular resource cached.

823
00:38:47,570 --> 00:38:49,610
And this might be true of a web page.

824
00:38:49,610 --> 00:38:53,630
It's especially true of other static
resources, things like CSS files

825
00:38:53,630 --> 00:38:54,760
or JavaScript files.

826
00:38:54,760 --> 00:38:58,860
The CSS of a web page probably
doesn't change all that often.

827
00:38:58,860 --> 00:39:02,120
And so, as a result, it's pretty
natural that your web browser--

828
00:39:02,120 --> 00:39:05,870
rather than request the exact same CSS
files again, and again, and again--

829
00:39:05,870 --> 00:39:08,650
might just save a copy
of those CSS files,

830
00:39:08,650 --> 00:39:12,380
cache them, such that it's able
to just reuse the cached version.

831
00:39:12,380 --> 00:39:14,690
But if the website were
to update their CSS,

832
00:39:14,690 --> 00:39:16,355
you might not see the latest changes.

833
00:39:16,355 --> 00:39:18,230
And you might have
experienced this yourself.

834
00:39:18,230 --> 00:39:21,410
If you're working on your own web
applications, when you change your CSS

835
00:39:21,410 --> 00:39:23,270
and refresh the page,
you might not always

836
00:39:23,270 --> 00:39:27,900
see those changes reflected if your
web browser is caching those results.

837
00:39:27,900 --> 00:39:30,710
And so, in most web browsers,
you can do a hard refresh

838
00:39:30,710 --> 00:39:33,740
to say, ignore whatever is in
the cache, and actually go out

839
00:39:33,740 --> 00:39:36,030
and make a new request
and get some new data.

840
00:39:36,030 --> 00:39:38,810
But ultimately, if you
don't do that, you're

841
00:39:38,810 --> 00:39:42,230
subject to this cache control where
the web browser is going to say,

842
00:39:42,230 --> 00:39:44,750
unless this number of
seconds has elapsed,

843
00:39:44,750 --> 00:39:48,500
we're going to reuse the
existing version of the page.

844
00:39:48,500 --> 00:39:51,590
And so an alternative to this approach--
and this approach certainly works

845
00:39:51,590 --> 00:39:52,670
and is quite popular--

846
00:39:52,670 --> 00:39:56,950
we can add to this approach by
adding what's known as ETag.

847
00:39:56,950 --> 00:40:00,290
An ETag for a resource--
like a CSS file, or an image,

848
00:40:00,290 --> 00:40:01,590
or a JavaScript file--

849
00:40:01,590 --> 00:40:04,190
is just some unique
sequence of characters

850
00:40:04,190 --> 00:40:07,610
that identifies a particular
version of a resource,

851
00:40:07,610 --> 00:40:11,300
that identifies a particular version
of a CSS file or a JavaScript file,

852
00:40:11,300 --> 00:40:12,930
for example.

853
00:40:12,930 --> 00:40:14,840
And what this allows a program to do--

854
00:40:14,840 --> 00:40:16,010
like a web browser--

855
00:40:16,010 --> 00:40:18,230
is that, when a web browser
requests a resource--

856
00:40:18,230 --> 00:40:21,410
makes a request for a CSS
file or a JavaScript file--

857
00:40:21,410 --> 00:40:22,370
they get it back.

858
00:40:22,370 --> 00:40:25,760
And they get its
associated ETag value, so I

859
00:40:25,760 --> 00:40:28,310
know that this is the
value that is associated

860
00:40:28,310 --> 00:40:31,040
with this version of the CSS file.

861
00:40:31,040 --> 00:40:35,720
And if the web server were ever to
change that CSS file, replace it

862
00:40:35,720 --> 00:40:41,820
with a new updated CSS file, the
corresponding ETag will also change.

863
00:40:41,820 --> 00:40:43,650
So why is this helpful?

864
00:40:43,650 --> 00:40:46,730
Well, it means that if I am
trying to decide, should I

865
00:40:46,730 --> 00:40:50,070
load a new version of
the resource or not,

866
00:40:50,070 --> 00:40:53,510
should I try and make another request
to get the latest version of the CSS,

867
00:40:53,510 --> 00:40:55,970
what I can do first
is just ask for, what

868
00:40:55,970 --> 00:40:59,660
is the ETag value, the short sequence
that can be answered very quickly?

869
00:40:59,660 --> 00:41:02,090
Very quickly, we can
just respond and say,

870
00:41:02,090 --> 00:41:05,360
you know what, if the ETag value
is the same as what I remembered

871
00:41:05,360 --> 00:41:07,850
from last time, well,
then I don't need to get

872
00:41:07,850 --> 00:41:10,340
a whole new version of that resource.

873
00:41:10,340 --> 00:41:13,070
And so this is quite common,
too, that a web browser will say,

874
00:41:13,070 --> 00:41:15,110
hey, let me request this resource.

875
00:41:15,110 --> 00:41:19,200
But I already have a version of the
resource with this particular ETag.

876
00:41:19,200 --> 00:41:24,110
So if that ETag is still the ETag for
the most recent version of a particular

877
00:41:24,110 --> 00:41:26,450
resource-- like a CSS
or JavaScript file--

878
00:41:26,450 --> 00:41:30,650
then no need for the web server to
send a new version of that file.

879
00:41:30,650 --> 00:41:33,650
Just go ahead and respond and say,
the version you have-- that one

880
00:41:33,650 --> 00:41:34,920
works-- totally fine.

881
00:41:34,920 --> 00:41:38,280
But if there is a new version, well,
then the web server can respond with

882
00:41:38,280 --> 00:41:41,130
the new asset-- the new
CSS file, for example--

883
00:41:41,130 --> 00:41:43,430
but also the new ETag value.

884
00:41:43,430 --> 00:41:46,160
So these two approaches can
work in concert with each other.

885
00:41:46,160 --> 00:41:49,220
You can say, go ahead and cache
this for some number of seconds

886
00:41:49,220 --> 00:41:51,020
so that, for some number
of seconds, you're

887
00:41:51,020 --> 00:41:54,680
not going to ever request a
new version of that resource.

888
00:41:54,680 --> 00:41:57,710
But even if you do ask for a
new version of the resource

889
00:41:57,710 --> 00:41:59,900
after this number of
seconds has elapsed,

890
00:41:59,900 --> 00:42:02,390
if the ETag value
hasn't updated, then no

891
00:42:02,390 --> 00:42:06,090
need to redownload a whole new
version of a particular file.

892
00:42:06,090 --> 00:42:08,750
You can just reuse the
version that happens

893
00:42:08,750 --> 00:42:10,890
to be cached already in the browser.

894
00:42:10,890 --> 00:42:14,270
So caching in the browser can
be an incredibly powerful tool

895
00:42:14,270 --> 00:42:17,000
for trying to speed up these
requests, for trying to reduce

896
00:42:17,000 --> 00:42:19,070
the load on any particular server.

897
00:42:19,070 --> 00:42:21,290
But the client side
is not the only place

898
00:42:21,290 --> 00:42:23,510
where we can begin to
do this kind of caching.

899
00:42:23,510 --> 00:42:26,330
We also have the ability
to do server-side caching.

900
00:42:26,330 --> 00:42:30,560
And in server-side caching, we're going
to introduce to our picture the notion

901
00:42:30,560 --> 00:42:31,940
of a cache--

902
00:42:31,940 --> 00:42:34,160
that we have these multiple
servers that are all

903
00:42:34,160 --> 00:42:35,720
communicating with the database.

904
00:42:35,720 --> 00:42:38,300
But these servers can also
communicate with a cache--

905
00:42:38,300 --> 00:42:41,360
someplace where we've
stored information that we

906
00:42:41,360 --> 00:42:46,340
might want to reuse later rather than
have to do all of that recalculation.

907
00:42:46,340 --> 00:42:49,280
And Django, in turns out, has
an entire cache framework,

908
00:42:49,280 --> 00:42:51,530
a whole host of features
that Django offers

909
00:42:51,530 --> 00:42:54,860
that allow us to leverage
this ability to use the cache

910
00:42:54,860 --> 00:42:56,470
to be able to speed up requests.

911
00:42:56,470 --> 00:42:59,150
So there are per-view
caches where you can

912
00:42:59,150 --> 00:43:02,720
specify a cache on a particular
view to say that, rather than run

913
00:43:02,720 --> 00:43:05,540
through all this Python code
every time someone makes

914
00:43:05,540 --> 00:43:09,410
a request to this
particular view, instead,

915
00:43:09,410 --> 00:43:14,150
just cache the view so that, for
the next 30 seconds or 30 minutes,

916
00:43:14,150 --> 00:43:16,940
the next time someone tries
to visit the same view,

917
00:43:16,940 --> 00:43:19,910
go ahead and just reuse the
results of the last time

918
00:43:19,910 --> 00:43:21,665
that that view was loaded.

919
00:43:21,665 --> 00:43:23,540
And this can work not
just for a single view.

920
00:43:23,540 --> 00:43:25,657
It can work for fragments
inside of a template.

921
00:43:25,657 --> 00:43:27,740
Your template might have
multiple different parts.

922
00:43:27,740 --> 00:43:31,190
On your web page, you might render
the navigation bar, and the sidebar,

923
00:43:31,190 --> 00:43:33,800
and the footer, maybe based
on information about today

924
00:43:33,800 --> 00:43:36,050
that might change the next day.

925
00:43:36,050 --> 00:43:38,510
But if you expect that
the side bar of your page

926
00:43:38,510 --> 00:43:41,570
is not going to change very
often within the same minute

927
00:43:41,570 --> 00:43:43,820
or within the same hour,
well, then you might imagine

928
00:43:43,820 --> 00:43:46,910
caching that part of the
template so that, the next time

929
00:43:46,910 --> 00:43:49,160
that Django tries to load
that entire template,

930
00:43:49,160 --> 00:43:52,550
it doesn't need to recalculate how to
generate the sidebar for your website.

931
00:43:52,550 --> 00:43:56,330
It just knows that we can use
the same version of the sidebar

932
00:43:56,330 --> 00:43:59,786
from the last time that we
loaded this website instead.

933
00:43:59,786 --> 00:44:03,600
And Django also gives you access
to a lower level cache API

934
00:44:03,600 --> 00:44:07,080
where, for any information that you
might want to cache and store for use

935
00:44:07,080 --> 00:44:10,140
later, you can save that
information inside of the API.

936
00:44:10,140 --> 00:44:12,180
You make an expensive
database query that

937
00:44:12,180 --> 00:44:15,360
takes a couple of milliseconds or
a couple of seconds to process.

938
00:44:15,360 --> 00:44:17,760
You can save those
results inside of a cache

939
00:44:17,760 --> 00:44:20,550
to make it easier to access
that same data if ever you

940
00:44:20,550 --> 00:44:22,930
try to get access to that again.

941
00:44:22,930 --> 00:44:26,430
So caching allows us to be able
to deal with these issues of scale

942
00:44:26,430 --> 00:44:29,910
by reducing load on our servers,
but also on our databases.

943
00:44:29,910 --> 00:44:33,330
Rather than need to talk to the
database every single time we

944
00:44:33,330 --> 00:44:36,750
make a new request for a
particular web application,

945
00:44:36,750 --> 00:44:39,060
we can just reuse
information that happens

946
00:44:39,060 --> 00:44:42,930
to be in the cache to allow our web
applications to become even more

947
00:44:42,930 --> 00:44:44,350
scalable.

948
00:44:44,350 --> 00:44:48,000
So that then was a look at some
issues concerning scalability.

949
00:44:48,000 --> 00:44:50,580
And we'll next turn our
attention to security--

950
00:44:50,580 --> 00:44:53,610
trying to make sure that, as we build
our web applications, as we deploy

951
00:44:53,610 --> 00:44:56,370
our web applications and
more users start to use them,

952
00:44:56,370 --> 00:44:58,290
we want to make sure
that they're secure.

953
00:44:58,290 --> 00:45:00,570
And there are a whole bunch
of security considerations

954
00:45:00,570 --> 00:45:03,170
to take into account
across all of the topics

955
00:45:03,170 --> 00:45:04,650
that we've looked at in the course.

956
00:45:04,650 --> 00:45:06,525
We've looked at a number
of different topics.

957
00:45:06,525 --> 00:45:09,400
And with each of them, there
are security vulnerabilities.

958
00:45:09,400 --> 00:45:12,720
There are ideas to be mindful of
when it comes towards making sure

959
00:45:12,720 --> 00:45:14,580
that our applications are secure.

960
00:45:14,580 --> 00:45:18,420
And we can begin our story, in fact, by
talking about Git and version control.

961
00:45:18,420 --> 00:45:20,370
Git is all about trying
to make sure we're

962
00:45:20,370 --> 00:45:22,860
able to keep track of
different versions of our code.

963
00:45:22,860 --> 00:45:24,780
And one thing that goes
hand-in-hand with Git

964
00:45:24,780 --> 00:45:27,480
is this idea of open-source software.

965
00:45:27,480 --> 00:45:30,930
On websites like GitHub and other
services that host Git repositories,

966
00:45:30,930 --> 00:45:33,930
increasingly, a lot of software
is becoming open source

967
00:45:33,930 --> 00:45:38,190
where anyone can see and contribute
to the source code of an application.

968
00:45:38,190 --> 00:45:40,868
And this is great in the sense
that it allows for many people

969
00:45:40,868 --> 00:45:42,660
to be able to collaborate
and work together

970
00:45:42,660 --> 00:45:46,590
in order to try to find bugs that might
exist inside of a web application.

971
00:45:46,590 --> 00:45:48,810
But it also comes with
drawbacks-- drawbacks

972
00:45:48,810 --> 00:45:51,333
where, if there is a
bug in the application,

973
00:45:51,333 --> 00:45:54,000
now someone who's looking through
the source code of our program

974
00:45:54,000 --> 00:45:56,250
might be able to spot that bug.

975
00:45:56,250 --> 00:45:58,920
Or you might imagine
that, because Git keeps

976
00:45:58,920 --> 00:46:01,830
track of different versions
of our code every time

977
00:46:01,830 --> 00:46:04,050
we make a commit to our
repository, you have

978
00:46:04,050 --> 00:46:07,110
to be very careful when it comes
towards credentials or things that

979
00:46:07,110 --> 00:46:08,910
might leak inside of the source code.

980
00:46:08,910 --> 00:46:12,600
You generally never want to put
passwords or any secure information

981
00:46:12,600 --> 00:46:15,990
inside of the Git repository
because the Git repository could

982
00:46:15,990 --> 00:46:19,000
be shared with other people and
might be open to anyone to look at.

983
00:46:19,000 --> 00:46:22,200
And so those are security
considerations to be mindful there as

984
00:46:22,200 --> 00:46:25,920
well-- that if you make a commit, and
accidentally make a commit to your code

985
00:46:25,920 --> 00:46:29,610
where you expose those credentials,
you might remove those credentials

986
00:46:29,610 --> 00:46:32,160
and commit again so the
latest version of your program

987
00:46:32,160 --> 00:46:34,140
doesn't have those credentials in it.

988
00:46:34,140 --> 00:46:36,540
But someone who has access
to the Git repository

989
00:46:36,540 --> 00:46:39,150
has access not just to the
latest version of your code,

990
00:46:39,150 --> 00:46:41,110
but to every version of your code.

991
00:46:41,110 --> 00:46:43,650
And that person could,
theoretically, go back

992
00:46:43,650 --> 00:46:46,770
through the history of the
repository and find the commit

993
00:46:46,770 --> 00:46:51,040
where the credentials were exposed
and see those credentials as well.

994
00:46:51,040 --> 00:46:54,270
So while Git is a very powerful
tool, it's also one to be mindful of.

995
00:46:54,270 --> 00:46:57,840
Any change you make could potentially
get saved inside of a commit--

996
00:46:57,840 --> 00:47:00,690
could potentially, therefore,
be accessed later on.

997
00:47:00,690 --> 00:47:04,380
And so if ever credentials are
exposed inside of the repository,

998
00:47:04,380 --> 00:47:07,260
you want to make sure to wipe
out all of those previous commits

999
00:47:07,260 --> 00:47:09,690
and not just make some
new commit in order

1000
00:47:09,690 --> 00:47:13,740
to try and hide the previous credentials
that can be exposed because they can

1001
00:47:13,740 --> 00:47:17,010
still be retrieved if someone
goes back through the history

1002
00:47:17,010 --> 00:47:19,300
of any particular repository.

1003
00:47:19,300 --> 00:47:23,025
And so that, then, was a look at
some issues that might surround Git.

1004
00:47:23,025 --> 00:47:24,900
We also talked at the
beginning of the course

1005
00:47:24,900 --> 00:47:28,110
about HTML, and about what it
is that we can use with HTML,

1006
00:47:28,110 --> 00:47:32,040
and how we can use this language in
order to design the structure of a web

1007
00:47:32,040 --> 00:47:36,150
page, in order to decide where all
of the paragraphs are going to be,

1008
00:47:36,150 --> 00:47:38,070
what tables are going to be on the page.

1009
00:47:38,070 --> 00:47:40,710
We talked about links and
how we can use anchor tags

1010
00:47:40,710 --> 00:47:42,960
to link one page to another page.

1011
00:47:42,960 --> 00:47:47,640
Now, one concern is this type of attack
known as a phishing attack with HTML.

1012
00:47:47,640 --> 00:47:49,830
And a phishing attack
really just comes down

1013
00:47:49,830 --> 00:47:53,100
to a little bit of HTML that looks
like this-- very easy to write,

1014
00:47:53,100 --> 00:47:57,690
where I have an anchor tag that is
going to direct the user to URL one.

1015
00:47:57,690 --> 00:48:01,860
But it looks like it
directs the user to URL 2.

1016
00:48:01,860 --> 00:48:03,930
So what might an example of this be?

1017
00:48:03,930 --> 00:48:05,380
All right, so we'll take a look.

1018
00:48:05,380 --> 00:48:09,280
I'll go ahead and open up link.html.

1019
00:48:09,280 --> 00:48:11,770
And in link.html, I have a
website that I've written

1020
00:48:11,770 --> 00:48:13,950
that appears to have a link to Google.

1021
00:48:13,950 --> 00:48:16,030
But if I click on that
link, I'm suddenly

1022
00:48:16,030 --> 00:48:19,162
directed to this course's
website, for example.

1023
00:48:19,162 --> 00:48:20,120
So how did that happen?

1024
00:48:20,120 --> 00:48:20,953
Why did that happen?

1025
00:48:20,953 --> 00:48:22,670
It seems like it's linking to Google.

1026
00:48:22,670 --> 00:48:26,290
Well, if you look at the code, if
I go ahead and open up link.html,

1027
00:48:26,290 --> 00:48:31,360
we'll see that here I have an anchor
tag that actually links to the course

1028
00:48:31,360 --> 00:48:34,150
website but appears to
be linking-- the text

1029
00:48:34,150 --> 00:48:37,900
that the user sees appears that
it is linking instead to Google.

1030
00:48:37,900 --> 00:48:41,360
And so this is a very common attack
vector, especially in emails,

1031
00:48:41,360 --> 00:48:41,980
for example.

1032
00:48:41,980 --> 00:48:45,040
You might see an email that tells
you to click on a particular link.

1033
00:48:45,040 --> 00:48:48,070
But that link takes you to
somewhere else entirely instead.

1034
00:48:48,070 --> 00:48:50,380
And as a result, someone
might inadvertently

1035
00:48:50,380 --> 00:48:54,010
share their bank account credentials
or other sensitive information.

1036
00:48:54,010 --> 00:48:57,220
And so here, too, something be mindful
of as you interact with the web,

1037
00:48:57,220 --> 00:49:00,490
maybe not necessarily on your own
website, but in other websites

1038
00:49:00,490 --> 00:49:03,940
that you might interact with, just to be
mindful about where links are actually

1039
00:49:03,940 --> 00:49:04,580
taking you.

1040
00:49:04,580 --> 00:49:07,300
And most web browsers,
if you hover over a link,

1041
00:49:07,300 --> 00:49:09,400
will show you where
that link might actually

1042
00:49:09,400 --> 00:49:12,010
be directing you to because it
might be different than what

1043
00:49:12,010 --> 00:49:17,930
the text of that particular anchor tag
might appear to link you to instead.

1044
00:49:17,930 --> 00:49:21,017
So HTML has all these various
different vulnerabilities

1045
00:49:21,017 --> 00:49:24,100
where, because you can just decide
what you want the structure of the page

1046
00:49:24,100 --> 00:49:26,710
to be, it leaves open the
possibility that someone

1047
00:49:26,710 --> 00:49:29,770
might try to trick you into thinking
that you were going to a page

1048
00:49:29,770 --> 00:49:31,420
that you're not actually on.

1049
00:49:31,420 --> 00:49:34,150
And this problem is more
widespread because anyone

1050
00:49:34,150 --> 00:49:36,580
can look at the HTML for any page.

1051
00:49:36,580 --> 00:49:38,950
HTML comes back from the server.

1052
00:49:38,950 --> 00:49:42,310
And therefore, the web browser
has access to all of that HTML

1053
00:49:42,310 --> 00:49:46,270
and can use that HTML in order
to render a page, for example.

1054
00:49:46,270 --> 00:49:49,150
And this leaves open other
vulnerabilities, too.

1055
00:49:49,150 --> 00:49:54,760
For example, let me go ahead and
go to bankofamerica.com, just

1056
00:49:54,760 --> 00:49:55,900
Bank of America's website.

1057
00:49:55,900 --> 00:49:57,850
You can go to any other website instead.

1058
00:49:57,850 --> 00:50:01,600
If I wanted to create a fake version
of Bank of America's website,

1059
00:50:01,600 --> 00:50:03,820
for example, to trick
people into thinking

1060
00:50:03,820 --> 00:50:05,740
they're going to Bank
of America's website

1061
00:50:05,740 --> 00:50:08,950
when really they're going to my
website, well, then what I can do

1062
00:50:08,950 --> 00:50:11,420
is just go ahead and view
the source of this page.

1063
00:50:11,420 --> 00:50:13,940
I go ahead and view page source.

1064
00:50:13,940 --> 00:50:17,990
And here is all of the HTML
for Bank of America's website.

1065
00:50:17,990 --> 00:50:21,410
And nothing then stops me
from copying all this content,

1066
00:50:21,410 --> 00:50:27,440
going into an HTML file, and creating a
new file that I'll just call bank.html.

1067
00:50:27,440 --> 00:50:31,350
And I'll go ahead and paste in
the contents of that HTML file,

1068
00:50:31,350 --> 00:50:34,700
secure then all of
Bank of America's HTML.

1069
00:50:34,700 --> 00:50:37,190
And now, if I open up bank.html--

1070
00:50:37,190 --> 00:50:39,920
that HTML file that I have
now written, but really

1071
00:50:39,920 --> 00:50:42,320
just copied from Bank of America--

1072
00:50:42,320 --> 00:50:43,730
I open it up.

1073
00:50:43,730 --> 00:50:47,000
And now here, on my
page, is a web page that

1074
00:50:47,000 --> 00:50:48,680
appears to look like Bank of America.

1075
00:50:48,680 --> 00:50:51,170
It's using all of Bank
of America's HTML.

1076
00:50:51,170 --> 00:50:56,130
But instead, it is my HTML page
and not, actually, Bank of America.

1077
00:50:56,130 --> 00:51:00,350
And so you might imagine combining
these to create an even more concerning

1078
00:51:00,350 --> 00:51:03,050
attack vector where, instead
of linking to google.com,

1079
00:51:03,050 --> 00:51:06,461
let me try and link
to bankofamerica.com.

1080
00:51:06,461 --> 00:51:12,170
But where I'm actually going to
link to is bank.html, my version

1081
00:51:12,170 --> 00:51:14,180
of Bank of America's website.

1082
00:51:14,180 --> 00:51:18,170
Now, if I open up
link.html, here appears

1083
00:51:18,170 --> 00:51:20,900
to be a link that links
me to Bank of America.

1084
00:51:20,900 --> 00:51:23,180
If I click on that link,
I get to a page that

1085
00:51:23,180 --> 00:51:25,250
looks like Bank of America's website.

1086
00:51:25,250 --> 00:51:27,260
But it's not Bank of America's website.

1087
00:51:27,260 --> 00:51:30,490
It's my bank.html file
that I have written.

1088
00:51:30,490 --> 00:51:33,140
It just so happens to look
like Bank of America's website

1089
00:51:33,140 --> 00:51:36,620
because I copied all of
that underlying HTML.

1090
00:51:36,620 --> 00:51:39,860
So HTML has the ability to describe
the structure of our web page.

1091
00:51:39,860 --> 00:51:43,790
But anytime you're writing this HTML,
it's good to be mindful of the fact

1092
00:51:43,790 --> 00:51:48,110
that anyone can copy your HTML, could
theoretically pretend to be you.

1093
00:51:48,110 --> 00:51:50,090
These are security
vulnerabilities that are

1094
00:51:50,090 --> 00:51:53,240
worth bearing in mind as we
start to develop web applications

1095
00:51:53,240 --> 00:51:56,910
and interacting with web
applications as well.

1096
00:51:56,910 --> 00:52:01,070
So ultimately, we used HTML in the
context of designing web applications

1097
00:52:01,070 --> 00:52:02,960
using Django, a framework.

1098
00:52:02,960 --> 00:52:05,690
And how exactly, then,
did these web frameworks

1099
00:52:05,690 --> 00:52:10,250
work in terms of creating these web
servers that are listening for requests

1100
00:52:10,250 --> 00:52:12,650
and that are responding
to those requests?

1101
00:52:12,650 --> 00:52:14,390
Well, ultimately, much
of the internet is

1102
00:52:14,390 --> 00:52:17,930
based around this idea of a client
communicating with a server or, more

1103
00:52:17,930 --> 00:52:20,420
generally, any one
computer communicating

1104
00:52:20,420 --> 00:52:23,810
with another computer using
HTTP and, in particular,

1105
00:52:23,810 --> 00:52:28,618
HTTPS, a more secure version
of the HTTP protocol.

1106
00:52:28,618 --> 00:52:31,160
And so you imagine that what
these protocols are really about

1107
00:52:31,160 --> 00:52:34,200
is how information gets
from one person to another

1108
00:52:34,200 --> 00:52:36,110
and what we're storing
with that information.

1109
00:52:36,110 --> 00:52:39,680
We have one computer trying to
communicate with some other computer.

1110
00:52:39,680 --> 00:52:42,440
And in order to do so,
information is generally

1111
00:52:42,440 --> 00:52:45,020
going to flow through these routers.

1112
00:52:45,020 --> 00:52:47,270
You might imagine information
going back and forth

1113
00:52:47,270 --> 00:52:49,610
between one computer
and another computer,

1114
00:52:49,610 --> 00:52:53,540
going through these intermediate
routers along the way.

1115
00:52:53,540 --> 00:52:56,390
And as a result, one
thing to be cautious about

1116
00:52:56,390 --> 00:52:58,400
is, how do you know that
this information that's

1117
00:52:58,400 --> 00:53:02,390
getting passed back and forth is
getting passed back and forth securely?

1118
00:53:02,390 --> 00:53:05,150
Ideally, when I send a
message to another computer--

1119
00:53:05,150 --> 00:53:07,190
I'm sending an email
to someone else, I'm

1120
00:53:07,190 --> 00:53:09,800
sending a message, I'm making
a request to a website that

1121
00:53:09,800 --> 00:53:13,130
might contain sensitive information,
like my bank account, for example--

1122
00:53:13,130 --> 00:53:17,030
I don't want it so that any intercepting
router that is taking my request

1123
00:53:17,030 --> 00:53:18,260
and passing it along--

1124
00:53:18,260 --> 00:53:21,170
I don't want those routers to
be able to look at that request

1125
00:53:21,170 --> 00:53:24,950
and see the contents of my email
or the contents of what password

1126
00:53:24,950 --> 00:53:27,620
I happen to be sending
across the web or not.

1127
00:53:27,620 --> 00:53:31,005
Ideally, I'd like for this
information to be encrypted.

1128
00:53:31,005 --> 00:53:33,380
And so here, we'll talk a
little bit about cryptography--

1129
00:53:33,380 --> 00:53:35,450
this process of trying
to make sure that I

1130
00:53:35,450 --> 00:53:37,850
am able to communicate
with some other person

1131
00:53:37,850 --> 00:53:42,860
without some eavesdropper in the middle
being able to intercept that message.

1132
00:53:42,860 --> 00:53:45,555
Obviously, if I just
take a plain text version

1133
00:53:45,555 --> 00:53:47,930
of the message I'm trying to
send and just literally take

1134
00:53:47,930 --> 00:53:51,560
the text of the message I'm trying
to send and effectively pass it along

1135
00:53:51,560 --> 00:53:53,660
across the internet,
well, then anyone who

1136
00:53:53,660 --> 00:53:57,430
is able to see that message is going to
know what the text of that message is.

1137
00:53:57,430 --> 00:53:59,420
And so I want to do
some kind of encryption,

1138
00:53:59,420 --> 00:54:02,900
some way of encrypting that message
so that someone along the way

1139
00:54:02,900 --> 00:54:06,230
won't be able to do that decryption
if a router in the middle

1140
00:54:06,230 --> 00:54:09,408
or someone in the middle is
able to intercept that message.

1141
00:54:09,408 --> 00:54:11,450
And so the first approach
we'll look at is what's

1142
00:54:11,450 --> 00:54:14,030
known as secret-key cryptography.

1143
00:54:14,030 --> 00:54:19,160
In secret-key cryptography, I have
not just the plaintext, but some key,

1144
00:54:19,160 --> 00:54:23,600
some secret piece of information
that can be used in order to encrypt

1145
00:54:23,600 --> 00:54:25,550
or decrypt information.

1146
00:54:25,550 --> 00:54:29,600
And so I'll use both the
key and the plaintext

1147
00:54:29,600 --> 00:54:33,710
to generate what's known as the
ciphertext, the encrypted version

1148
00:54:33,710 --> 00:54:35,690
of the message I'm trying to send.

1149
00:54:35,690 --> 00:54:39,080
And then, instead of
sending the plaintext

1150
00:54:39,080 --> 00:54:41,540
across the internet
to the other person, I

1151
00:54:41,540 --> 00:54:44,870
might instead want to just send
the ciphertext across the internet

1152
00:54:44,870 --> 00:54:48,050
to the other person so that I'm
not sending the plain version

1153
00:54:48,050 --> 00:54:49,700
of the message across the internet.

1154
00:54:49,700 --> 00:54:51,560
So the ciphertext goes across.

1155
00:54:51,560 --> 00:54:54,270
And the other person
will also need the key.

1156
00:54:54,270 --> 00:54:57,835
Now, if the other person has
both the ciphertext and the key,

1157
00:54:57,835 --> 00:54:59,960
well, then using that
information, the other person

1158
00:54:59,960 --> 00:55:02,960
can use the key to
decrypt the ciphertext

1159
00:55:02,960 --> 00:55:05,800
and obtain the original plaintext.

1160
00:55:05,800 --> 00:55:10,340
And this key is what we might call a
symmetric key encryption and decryption

1161
00:55:10,340 --> 00:55:10,840
key.

1162
00:55:10,840 --> 00:55:13,820
You use the key in order
to encrypt messages.

1163
00:55:13,820 --> 00:55:17,600
And you use the same key in order
to do the decryption process.

1164
00:55:17,600 --> 00:55:21,050
And as long as both I and the person
I'm communicating with both have access

1165
00:55:21,050 --> 00:55:25,760
to that key, well, then we'll be able to
encrypt messages and decrypt messages.

1166
00:55:25,760 --> 00:55:28,610
And someone who just has the
ciphertext but not the key

1167
00:55:28,610 --> 00:55:33,160
likely won't be able to figure out
what that original message was.

1168
00:55:33,160 --> 00:55:36,370
But there's a problem here, especially
in the context of the internet.

1169
00:55:36,370 --> 00:55:41,500
And that is that both I and the other
person need to have access to this key.

1170
00:55:41,500 --> 00:55:45,320
The key is what I use to do the
encryption and the decryption.

1171
00:55:45,320 --> 00:55:48,978
And I can't just send the key across
the internet to the other person

1172
00:55:48,978 --> 00:55:51,520
because, if I do that, well,
then someone in the middle who's

1173
00:55:51,520 --> 00:55:54,130
intercepting all of my
requests could intercept

1174
00:55:54,130 --> 00:55:56,740
both the ciphertext and the key.

1175
00:55:56,740 --> 00:56:00,670
And therefore, they would be able
to decrypt the message because they

1176
00:56:00,670 --> 00:56:03,260
have both the ciphertext and the key.

1177
00:56:03,260 --> 00:56:07,090
Now, if I were able to go to another
person in person and exchange

1178
00:56:07,090 --> 00:56:10,390
this secret key in secret,
well, then this scheme

1179
00:56:10,390 --> 00:56:12,490
might work, because
we both have the key.

1180
00:56:12,490 --> 00:56:16,360
And I didn't share the key publicly with
anyone who might intercept the message.

1181
00:56:16,360 --> 00:56:18,970
Only I and the other person had the key.

1182
00:56:18,970 --> 00:56:21,157
But in general, when
communicating on the internet,

1183
00:56:21,157 --> 00:56:22,990
you're not communicating
with servers you've

1184
00:56:22,990 --> 00:56:25,210
necessarily communicated with before.

1185
00:56:25,210 --> 00:56:27,880
I might be trying to make
a request to a new website.

1186
00:56:27,880 --> 00:56:32,770
And we somehow still need to agree on
a system where I can encrypt messages

1187
00:56:32,770 --> 00:56:35,110
but only the other
person on the other side

1188
00:56:35,110 --> 00:56:38,990
is able to decrypt
those messages instead.

1189
00:56:38,990 --> 00:56:42,460
So this kind of cryptography--
probably not great

1190
00:56:42,460 --> 00:56:47,300
for trying to initially try and create
a secure connection on the internet.

1191
00:56:47,300 --> 00:56:49,810
And for that reason, a major
advancement in cryptography

1192
00:56:49,810 --> 00:56:54,970
that allows for the internet to work is
this notion of public-key cryptography.

1193
00:56:54,970 --> 00:56:56,890
In secret-key cryptography,
it's important

1194
00:56:56,890 --> 00:57:00,280
that the key is secret because, if
the key were known by everyone, well,

1195
00:57:00,280 --> 00:57:03,040
then anyone would be
able to decrypt messages.

1196
00:57:03,040 --> 00:57:06,730
In public-key cryptography, we're
able to create a secure encryption

1197
00:57:06,730 --> 00:57:09,790
system where the key is
allowed to be public,

1198
00:57:09,790 --> 00:57:11,980
or one of the keys, as we'll soon see.

1199
00:57:11,980 --> 00:57:16,030
And the idea here is that we're
using two keys instead of just one--

1200
00:57:16,030 --> 00:57:20,072
that we have both a public key
and what's known as a private key.

1201
00:57:20,072 --> 00:57:22,030
The private key-- your
private key is something

1202
00:57:22,030 --> 00:57:25,840
you should not share with other people
to keep the encryption scheme secure.

1203
00:57:25,840 --> 00:57:30,340
But the public key is one that
is OK to share with other people.

1204
00:57:30,340 --> 00:57:34,150
And the distinction between the
two is that the public key will be

1205
00:57:34,150 --> 00:57:36,640
used in order to encrypt information.

1206
00:57:36,640 --> 00:57:40,090
And the private key will be
used to decrypt information

1207
00:57:40,090 --> 00:57:41,870
that was encrypted by the public.

1208
00:57:41,870 --> 00:57:44,620
And the public key and the private
key are mathematically related.

1209
00:57:44,620 --> 00:57:47,287
And there are a couple of ways
that we might imagine doing that.

1210
00:57:47,287 --> 00:57:51,160
But the idea now is that, if I want
to communicate with another person,

1211
00:57:51,160 --> 00:57:54,100
that person sends me their public key.

1212
00:57:54,100 --> 00:57:56,890
And it's OK for the public key
to travel across the internet.

1213
00:57:56,890 --> 00:58:01,000
Anyone is allowed to see the public
key because the public key is only

1214
00:58:01,000 --> 00:58:03,610
used for encrypting that data.

1215
00:58:03,610 --> 00:58:06,610
So I can then take the
plaintext and the public key

1216
00:58:06,610 --> 00:58:11,350
and use that to generate the ciphertext,
the encrypted version of the message

1217
00:58:11,350 --> 00:58:13,930
that I am trying to send
across the internet.

1218
00:58:13,930 --> 00:58:16,960
And then I send the
ciphertext to the other person

1219
00:58:16,960 --> 00:58:18,640
with whom I'm trying to communicate.

1220
00:58:18,640 --> 00:58:24,080
And the other person now, using the
ciphertext, then uses the private key--

1221
00:58:24,080 --> 00:58:26,800
the private key that they did
not share, and the private key

1222
00:58:26,800 --> 00:58:29,710
that has the ability to
decrypt information that

1223
00:58:29,710 --> 00:58:32,600
was encrypted using the public key.

1224
00:58:32,600 --> 00:58:35,800
So using a combination of the
ciphertext and the private key,

1225
00:58:35,800 --> 00:58:38,830
the person I'm communicating
with can decrypt that information

1226
00:58:38,830 --> 00:58:43,070
and get back whatever the original
plaintext of that information

1227
00:58:43,070 --> 00:58:44,360
happened to be.

1228
00:58:44,360 --> 00:58:46,630
And so this, then, is
how we can do a lot

1229
00:58:46,630 --> 00:58:48,430
of this communication on the internet.

1230
00:58:48,430 --> 00:58:50,830
By using this
public-private key pair, we

1231
00:58:50,830 --> 00:58:53,560
can say, use the public
key to do the encrypting,

1232
00:58:53,560 --> 00:58:55,690
use the private key
to do the decrypting.

1233
00:58:55,690 --> 00:58:58,690
And now two computers that have
never interacted with each other

1234
00:58:58,690 --> 00:59:00,970
before, without having
the opportunity to meet,

1235
00:59:00,970 --> 00:59:04,630
to exchange some secret information,
can use a technique like this

1236
00:59:04,630 --> 00:59:07,060
in order to securely
communicate with each other--

1237
00:59:07,060 --> 00:59:10,300
to send a message back and forth
without anyone in the middle

1238
00:59:10,300 --> 00:59:15,140
being able to intercept the message
and identify what the message is about.

1239
00:59:15,140 --> 00:59:18,310
And once you have this ability, the
ability to communicate with another

1240
00:59:18,310 --> 00:59:21,730
secretly, well, then you can
imagine agreeing on some secret key

1241
00:59:21,730 --> 00:59:25,780
and then using secret-key encryption to
be able to encrypt and decrypt messages

1242
00:59:25,780 --> 00:59:26,470
as well.

1243
00:59:26,470 --> 00:59:28,262
And so that's an approach
that you can also

1244
00:59:28,262 --> 00:59:31,460
take when trying to communicate with
other people across the internet.

1245
00:59:31,460 --> 00:59:34,950
But this idea of encryption
is what allows for HTTPS,

1246
00:59:34,950 --> 00:59:39,190
the secure version of the HTTP protocol,
to actually work to make sure that--

1247
00:59:39,190 --> 00:59:42,690
when you are communicating with
your bank's website, for example--

1248
00:59:42,690 --> 00:59:46,300
that someone along the way won't be
able to intercept that information

1249
00:59:46,300 --> 00:59:48,770
and identify what it is that
you're communicating about

1250
00:59:48,770 --> 00:59:51,090
and, instead, only has
the encrypted version

1251
00:59:51,090 --> 00:59:55,720
of the information and a public key
with which they can encrypt information,

1252
00:59:55,720 --> 00:59:57,850
but not a private key
that can ultimately

1253
00:59:57,850 --> 01:00:02,150
be used in order to decrypt
information as well.

1254
01:00:02,150 --> 01:00:05,920
And so that then is how we might allow
for this kind of secure communication

1255
01:00:05,920 --> 01:00:09,010
on the internet and allow our
web applications to be secure.

1256
01:00:09,010 --> 01:00:12,130
But in addition to our web applications
just listening for requests

1257
01:00:12,130 --> 01:00:14,180
and then providing
some sort of response,

1258
01:00:14,180 --> 01:00:17,560
our web applications were
also dealing with data.

1259
01:00:17,560 --> 01:00:19,720
We introduced the idea
of SQL data tables

1260
01:00:19,720 --> 01:00:22,240
where we had tables of
data with rows and columns

1261
01:00:22,240 --> 01:00:23,950
that are representing information.

1262
01:00:23,950 --> 01:00:26,980
And we've also created web
applications in this course where

1263
01:00:26,980 --> 01:00:28,900
we've had applications that have users.

1264
01:00:28,900 --> 01:00:32,940
Users sign in with a user name
and a password, for example.

1265
01:00:32,940 --> 01:00:35,450
And so how might we
represent that information

1266
01:00:35,450 --> 01:00:37,100
about users and their passwords?

1267
01:00:37,100 --> 01:00:41,070
Well, one way would be just stored
inside of a table like this.

1268
01:00:41,070 --> 01:00:42,410
Here's a table of users.

1269
01:00:42,410 --> 01:00:44,210
Every user has an ID.

1270
01:00:44,210 --> 01:00:47,490
They have a user name,
and they have a password.

1271
01:00:47,490 --> 01:00:50,750
But this turns out to be
an incredibly insecure way

1272
01:00:50,750 --> 01:00:53,090
to store passwords--
to be storing passwords

1273
01:00:53,090 --> 01:00:56,120
in what might be called
plaintext, just to literally store

1274
01:00:56,120 --> 01:00:58,040
the passwords inside of a database.

1275
01:00:58,040 --> 01:01:01,910
And we should never do this in practice
because of the security vulnerabilities

1276
01:01:01,910 --> 01:01:03,090
associated with it.

1277
01:01:03,090 --> 01:01:06,680
If ever someone were to, unauthorized,
get access to this database,

1278
01:01:06,680 --> 01:01:10,140
they would be able to see all of
the passwords for all of the users.

1279
01:01:10,140 --> 01:01:13,010
So if this database ever leaked
for whatever reason, suddenly

1280
01:01:13,010 --> 01:01:14,852
all of these passwords are now known.

1281
01:01:14,852 --> 01:01:16,310
And this kind of thing does happen.

1282
01:01:16,310 --> 01:01:19,460
If companies are not careful about
how they represent user names

1283
01:01:19,460 --> 01:01:22,380
and passwords inside of their
databases, and if ever there's

1284
01:01:22,380 --> 01:01:27,040
some sort of database leak,
suddenly a whole bunch of passwords

1285
01:01:27,040 --> 01:01:29,008
could potentially be compromised.

1286
01:01:29,008 --> 01:01:31,300
And it's for that reason that
the recommended approach,

1287
01:01:31,300 --> 01:01:34,060
rather than store an actual
password, is to store

1288
01:01:34,060 --> 01:01:38,740
a hashed version of the same
password using a hash function where

1289
01:01:38,740 --> 01:01:41,680
a hash function, in this
context, is some function that

1290
01:01:41,680 --> 01:01:46,630
takes a password of input
and outputs some hash--

1291
01:01:46,630 --> 01:01:49,540
some sequence of characters
and numbers, in this case--

1292
01:01:49,540 --> 01:01:51,850
that represents that
particular password,

1293
01:01:51,850 --> 01:01:53,650
a hashed version of the password.

1294
01:01:53,650 --> 01:01:55,870
But the important thing
about this hash function

1295
01:01:55,870 --> 01:01:58,120
is that it's a one-way hash function.

1296
01:01:58,120 --> 01:02:01,750
From the password, you can get to
the sequence of letters and numbers.

1297
01:02:01,750 --> 01:02:04,480
But it is very, very difficult
to go the other way around

1298
01:02:04,480 --> 01:02:09,490
to use this information to figure out
what the original password actually

1299
01:02:09,490 --> 01:02:10,240
was.

1300
01:02:10,240 --> 01:02:12,940
And so what this means is that
the companies won't actually

1301
01:02:12,940 --> 01:02:18,550
know what any particular user's
password is when a user tries to log in.

1302
01:02:18,550 --> 01:02:21,760
What we'll do is take their password
that they're trying to log in with.

1303
01:02:21,760 --> 01:02:25,090
We'll hash it and compare
that hash against the hash

1304
01:02:25,090 --> 01:02:27,580
that we've stored in the database.

1305
01:02:27,580 --> 01:02:31,030
If the hashes match up, that means the
user probably typed in their password

1306
01:02:31,030 --> 01:02:33,130
correctly and, therefore,
we can sign the user in.

1307
01:02:33,130 --> 01:02:35,830
And otherwise, that's a
sign that the user did not

1308
01:02:35,830 --> 01:02:38,270
type their password in correctly.

1309
01:02:38,270 --> 01:02:40,330
So this, then, is the
reason why companies--

1310
01:02:40,330 --> 01:02:42,670
if they're obeying these
best practices-- usually

1311
01:02:42,670 --> 01:02:44,740
can't tell you what
your password actually

1312
01:02:44,740 --> 01:02:46,810
is if you forget your password.

1313
01:02:46,810 --> 01:02:49,930
If you forget your password, the company
will let you reset your password.

1314
01:02:49,930 --> 01:02:52,242
They can update the data
inside of the table.

1315
01:02:52,242 --> 01:02:53,950
But the company won't
be able to tell you

1316
01:02:53,950 --> 01:02:57,760
what your password actually is because
the company doesn't know your password.

1317
01:02:57,760 --> 01:03:00,460
The company only knows
some hashed version

1318
01:03:00,460 --> 01:03:04,970
of the password, some result of passing
that password through a hash function.

1319
01:03:04,970 --> 01:03:07,870
And as a result, they're
able to know whether you

1320
01:03:07,870 --> 01:03:10,600
logged in successfully or not
with the correct credentials

1321
01:03:10,600 --> 01:03:14,000
without actually knowing what
your password actually is.

1322
01:03:14,000 --> 01:03:15,940
And so this is another
area where you might

1323
01:03:15,940 --> 01:03:19,270
imagine that, if you're not careful
about how you're storing this data,

1324
01:03:19,270 --> 01:03:22,360
it could be a security
vulnerability inside of your program

1325
01:03:22,360 --> 01:03:26,220
where, if ever that data is leaked,
passwords suddenly become known.

1326
01:03:26,220 --> 01:03:29,890
And there are other more subtle ways
that web applications could potentially

1327
01:03:29,890 --> 01:03:32,410
leak information that
you, as the web developer,

1328
01:03:32,410 --> 01:03:34,330
need to decide if you're OK with or not.

1329
01:03:34,330 --> 01:03:37,570
Imagine a website, for example, where
you do have a place where you can say,

1330
01:03:37,570 --> 01:03:39,700
if you forgot your
password, you can be sent

1331
01:03:39,700 --> 01:03:43,173
to a place where you can reset
your password, for example.

1332
01:03:43,173 --> 01:03:46,090
You might imagine that, if you type
in your email address, click Reset

1333
01:03:46,090 --> 01:03:49,270
Password, you might get a message
like, all right, password reset email

1334
01:03:49,270 --> 01:03:50,530
has been sent.

1335
01:03:50,530 --> 01:03:54,070
But you might imagine typing in an email
address and getting something like,

1336
01:03:54,070 --> 01:03:57,400
error, there is no user
with that email address.

1337
01:03:57,400 --> 01:04:00,250
And here, again, is a potential
security vulnerability

1338
01:04:00,250 --> 01:04:02,320
in terms of leaked information.

1339
01:04:02,320 --> 01:04:06,340
This page that just seems to send you
an email if you forgot your password is

1340
01:04:06,340 --> 01:04:10,720
now leaking information about which
users happened to have accounts

1341
01:04:10,720 --> 01:04:14,140
on your website and which users do
not because all someone needs to do

1342
01:04:14,140 --> 01:04:18,100
is type in an email address and find out
whether it results in an error or not

1343
01:04:18,100 --> 01:04:22,310
in order to know whether a user happens
to have an account on the website

1344
01:04:22,310 --> 01:04:22,810
or not.

1345
01:04:22,810 --> 01:04:24,685
And maybe that's not a
big deal if that's not

1346
01:04:24,685 --> 01:04:26,170
something you care about securing.

1347
01:04:26,170 --> 01:04:30,160
But if it's a website where
you do care about making sure

1348
01:04:30,160 --> 01:04:32,650
that, if someone has an account
or doesn't have an account,

1349
01:04:32,650 --> 01:04:35,350
that information is kept private
and secure only to the user,

1350
01:04:35,350 --> 01:04:37,630
unless they want to share
it, well, then this type

1351
01:04:37,630 --> 01:04:40,570
of page, this type of
interface with the database

1352
01:04:40,570 --> 01:04:43,570
could potentially be leaking
that kind of information.

1353
01:04:43,570 --> 01:04:46,120
And information can be leaked
in all sorts of different ways.

1354
01:04:46,120 --> 01:04:48,700
You can even leak information
just based on the time

1355
01:04:48,700 --> 01:04:52,780
it takes for the database to be able
to respond to a particular request.

1356
01:04:52,780 --> 01:04:55,450
You might imagine, if you
make a request about a user,

1357
01:04:55,450 --> 01:04:58,180
and it takes longer to
respond, that might tell you

1358
01:04:58,180 --> 01:05:01,150
something about the number of
database queries it needs to run

1359
01:05:01,150 --> 01:05:04,210
or the amount of information that's
stored about that user as opposed

1360
01:05:04,210 --> 01:05:06,200
to if a request takes less time.

1361
01:05:06,200 --> 01:05:09,850
So even something like how many
milliseconds it takes for a web server

1362
01:05:09,850 --> 01:05:13,780
to respond to a request can
reveal or leak information

1363
01:05:13,780 --> 01:05:16,720
about the data that is stored
inside of the database.

1364
01:05:16,720 --> 01:05:19,750
And there have been examples of
researchers who actually try and see

1365
01:05:19,750 --> 01:05:23,702
what information they can get just from
looking at these kinds of information.

1366
01:05:23,702 --> 01:05:25,660
It doesn't seem like it
would leak information,

1367
01:05:25,660 --> 01:05:29,580
but it might actually
reveal information as well.

1368
01:05:29,580 --> 01:05:32,740
Now, another concern when dealing with
SQL and databases we've talked about

1369
01:05:32,740 --> 01:05:34,707
is the context of SQL injection--

1370
01:05:34,707 --> 01:05:36,790
this threat where, if
you're not careful about how

1371
01:05:36,790 --> 01:05:40,090
it is that you run your SQL
code, you could inadvertently

1372
01:05:40,090 --> 01:05:43,390
end up executing code that you
don't mean to be executing.

1373
01:05:43,390 --> 01:05:46,390
Situations like here-- we're in
a username and password field.

1374
01:05:46,390 --> 01:05:48,010
We've seen this example before--

1375
01:05:48,010 --> 01:05:50,620
where, if a user tries to log
in, you might imagine a query

1376
01:05:50,620 --> 01:05:53,200
like this is run selecting
from the user's table

1377
01:05:53,200 --> 01:05:57,190
where user name equals whatever was
typed in as the user name and password

1378
01:05:57,190 --> 01:05:59,800
equals whatever was
typed in as the password.

1379
01:05:59,800 --> 01:06:04,200
And we saw how, for a normal user--
someone who types in, Harry and 1, 2,

1380
01:06:04,200 --> 01:06:06,970
3, 4, 5 as their username and password--

1381
01:06:06,970 --> 01:06:09,380
that this type of query works just fine.

1382
01:06:09,380 --> 01:06:11,890
But if a hacker tries
to log into a website

1383
01:06:11,890 --> 01:06:15,520
and maybe includes a double
quotation mark and two hyphens,

1384
01:06:15,520 --> 01:06:18,640
for example, where two
hyphens mean a comment in SQL,

1385
01:06:18,640 --> 01:06:22,760
and we were to literally substitute
these values into our SQL queries,

1386
01:06:22,760 --> 01:06:27,010
well, then you might end up
substituting hacker hyphen hyphen hyphen

1387
01:06:27,010 --> 01:06:30,100
hyphen creating a comment that
ignores the rest of this query,

1388
01:06:30,100 --> 01:06:33,640
effectively ignoring any kind of
password checking that we might

1389
01:06:33,640 --> 01:06:35,560
want our web application to be doing.

1390
01:06:35,560 --> 01:06:37,390
So this, too-- another
vulnerability that

1391
01:06:37,390 --> 01:06:40,570
comes about whenever we're
dealing with executing

1392
01:06:40,570 --> 01:06:42,520
SQL code inside of a database.

1393
01:06:42,520 --> 01:06:44,860
And in order to deal with
this, we want to make sure

1394
01:06:44,860 --> 01:06:48,640
that we're escaping any of these
potentially dangerous characters that

1395
01:06:48,640 --> 01:06:50,710
might show up inside of our SQL queries.

1396
01:06:50,710 --> 01:06:52,870
And Django's models do this for us.

1397
01:06:52,870 --> 01:06:56,980
When we do these kinds of queries
using Django saying, .objects, .filter,

1398
01:06:56,980 --> 01:07:00,580
to be able to filter out for only
certain versions of a particular model,

1399
01:07:00,580 --> 01:07:04,330
it is going to take care of the process
of making sure that it's not subject

1400
01:07:04,330 --> 01:07:06,770
to these kinds of SQL injection attacks.

1401
01:07:06,770 --> 01:07:09,340
But if ever you're writing a
web application that is directly

1402
01:07:09,340 --> 01:07:12,070
executing secret code, which
you might imagine doing,

1403
01:07:12,070 --> 01:07:14,080
you do want to be
careful about making sure

1404
01:07:14,080 --> 01:07:16,240
that you're not exposing
the application to be

1405
01:07:16,240 --> 01:07:20,070
vulnerable to these
kinds of threats as well.

1406
01:07:20,070 --> 01:07:21,920
So that then are potential
threats that come

1407
01:07:21,920 --> 01:07:24,935
about when we're just talking about
what's happening on the server.

1408
01:07:24,935 --> 01:07:26,810
But we also can think
about what might happen

1409
01:07:26,810 --> 01:07:28,700
when we're interacting
with other servers--

1410
01:07:28,700 --> 01:07:31,380
when we're interacting
with APIs, for example.

1411
01:07:31,380 --> 01:07:33,770
So we talked about JavaScript
and using JavaScript

1412
01:07:33,770 --> 01:07:37,400
to be able to make additional requests
to APIs or to other services that

1413
01:07:37,400 --> 01:07:40,302
are able to return back with
certain types of information.

1414
01:07:40,302 --> 01:07:42,260
And with APIs, there are
a number of techniques

1415
01:07:42,260 --> 01:07:46,040
that we can use in APIs to allow
them to be more scalable, to allow

1416
01:07:46,040 --> 01:07:48,290
them to be more secure.

1417
01:07:48,290 --> 01:07:50,780
One is this notion of
rate limiting where

1418
01:07:50,780 --> 01:07:52,940
we might want to make
sure that no user is

1419
01:07:52,940 --> 01:07:56,480
able to make more than a certain
number of requests to an API

1420
01:07:56,480 --> 01:07:59,000
in any particular amount of time.

1421
01:07:59,000 --> 01:08:01,130
This is in response to
a security threat that

1422
01:08:01,130 --> 01:08:03,440
has to do with the
scalability of a system, which

1423
01:08:03,440 --> 01:08:06,560
is known as a DOS or Denial
of Service Attack where,

1424
01:08:06,560 --> 01:08:09,920
effectively, if you just make a whole
bunch of requests to a single server

1425
01:08:09,920 --> 01:08:13,543
over, and over, and over again, you
could potentially shut down that system

1426
01:08:13,543 --> 01:08:15,710
because you're making so
many requests that it's not

1427
01:08:15,710 --> 01:08:19,050
able to handle that many
requests all at the same time.

1428
01:08:19,050 --> 01:08:22,310
And for that reason, because it's
so easy to make an API request--

1429
01:08:22,310 --> 01:08:27,170
you can do so using just a single line
of Python or JavaScript, for example--

1430
01:08:27,170 --> 01:08:29,840
APIs will often institute
some kind of rate

1431
01:08:29,840 --> 01:08:32,960
limiting to limit the number of
requests you can make so that you're not

1432
01:08:32,960 --> 01:08:35,630
going to overwhelm the server
or overwhelm the database that

1433
01:08:35,630 --> 01:08:39,080
needs to be queried in order
to respond to those requests.

1434
01:08:39,080 --> 01:08:42,229
And so this kind of
limiting might work as well.

1435
01:08:42,229 --> 01:08:45,800
APIs might also want to add some
kind of route authentication.

1436
01:08:45,800 --> 01:08:49,527
You might not want everybody to
access the same data via an API.

1437
01:08:49,527 --> 01:08:51,319
Maybe there's some sort
of permission model

1438
01:08:51,319 --> 01:08:54,800
where only certain users are able
to access certain pieces of data

1439
01:08:54,800 --> 01:08:55,880
from the API.

1440
01:08:55,880 --> 01:09:00,290
So you might imagine that a user needs
to have an API key, for example--

1441
01:09:00,290 --> 01:09:03,830
effectively, a password that
they need to pass around anytime

1442
01:09:03,830 --> 01:09:06,710
they're making an API
request to your API

1443
01:09:06,710 --> 01:09:09,140
and that allows you to then
be able to look at that key

1444
01:09:09,140 --> 01:09:12,390
and verify that they are
who they say they are.

1445
01:09:12,390 --> 01:09:16,010
Now, with those API keys comes other
potential security vulnerabilities

1446
01:09:16,010 --> 01:09:17,090
to be mindful of.

1447
01:09:17,090 --> 01:09:21,290
One is that, just as you should never be
putting passwords inside of your source

1448
01:09:21,290 --> 01:09:23,899
code-- inside of your Git
repository, for example--

1449
01:09:23,899 --> 01:09:27,290
you likewise generally shouldn't
be putting your API keys

1450
01:09:27,290 --> 01:09:31,700
inside of your web applications as well,
inside of the source code of those web

1451
01:09:31,700 --> 01:09:34,069
applications, because
then anyone who has access

1452
01:09:34,069 --> 01:09:36,020
to the source code for
the web application

1453
01:09:36,020 --> 01:09:38,960
can see what your API
key is, could then use

1454
01:09:38,960 --> 01:09:42,439
the API key to pretend to be
you and, therefore, get access

1455
01:09:42,439 --> 01:09:46,609
to potential API routes that they
should not be able to access.

1456
01:09:46,609 --> 01:09:50,930
One common solution to this is to use
what are known as environment variables

1457
01:09:50,930 --> 01:09:55,190
where, effectively, you in your
program say that your API key is not

1458
01:09:55,190 --> 01:09:59,220
going to be some predetermined string
that is in the text of your program

1459
01:09:59,220 --> 01:10:03,170
but instead is going to be drawn from
the environment in which the program is

1460
01:10:03,170 --> 01:10:04,040
being run.

1461
01:10:04,040 --> 01:10:07,430
And then, on the server, when
you're running the web application,

1462
01:10:07,430 --> 01:10:11,000
you'll first make sure the server has
all of those environment variables set

1463
01:10:11,000 --> 01:10:16,400
correctly so that, rather than have
the API key actually in the source

1464
01:10:16,400 --> 01:10:20,570
code of the program, the API key is
simply in the environment on the server

1465
01:10:20,570 --> 01:10:22,340
where the web application is running.

1466
01:10:22,340 --> 01:10:25,370
And the server can just draw that
information from the environment

1467
01:10:25,370 --> 01:10:29,720
so that it knows what the API
key should be without the API key

1468
01:10:29,720 --> 01:10:34,590
actually having to be inside of the
web application source code itself.

1469
01:10:34,590 --> 01:10:36,470
And so as we begin to
deal with APIs, you

1470
01:10:36,470 --> 01:10:40,070
might notice that many APIs will
require you to have an API key.

1471
01:10:40,070 --> 01:10:42,170
And often, it's for
these sorts of reasons--

1472
01:10:42,170 --> 01:10:45,310
to make sure that we're able to
authenticate users effectively

1473
01:10:45,310 --> 01:10:48,560
and also to make sure that we're able
to limit users to make sure that they're

1474
01:10:48,560 --> 01:10:51,140
not making too many
requests to the server

1475
01:10:51,140 --> 01:10:54,170
or to the database at
any particular time.

1476
01:10:54,170 --> 01:10:57,440
But this, then, starts to get us into
other potential vulnerabilities--

1477
01:10:57,440 --> 01:11:00,470
in particular, vulnerabilities
concerning JavaScript.

1478
01:11:00,470 --> 01:11:02,600
JavaScript, again, is
a programming language

1479
01:11:02,600 --> 01:11:05,840
that we use in order to write code
that runs inside of our web browser--

1480
01:11:05,840 --> 01:11:08,730
a browser like Chrome, or
Safari, or something like that.

1481
01:11:08,730 --> 01:11:14,210
And as a result, JavaScript has a lot of
power to manipulate things on the page.

1482
01:11:14,210 --> 01:11:16,220
It can simulate the clicking of buttons.

1483
01:11:16,220 --> 01:11:20,120
It can change the content of what
happens to be on any particular page.

1484
01:11:20,120 --> 01:11:22,370
And as a result, there are
many, many vulnerabilities

1485
01:11:22,370 --> 01:11:26,750
that come about when it comes
to thinking about JavaScript.

1486
01:11:26,750 --> 01:11:30,750
And one such vulnerability is this
notion of cross-site scripting--

1487
01:11:30,750 --> 01:11:33,380
that, in general, when
on your web application,

1488
01:11:33,380 --> 01:11:37,760
you only want JavaScript to run
if you, yourself have written it.

1489
01:11:37,760 --> 01:11:39,830
Cross-site scripting
is a potential threat

1490
01:11:39,830 --> 01:11:45,050
where someone else might be able to get
JavaScript code to run on your website

1491
01:11:45,050 --> 01:11:48,890
when it's JavaScript code that someone
else wrote instead of you, yourself.

1492
01:11:48,890 --> 01:11:51,710
And this is a potential vulnerability
because, if someone else can

1493
01:11:51,710 --> 01:11:55,280
write the JavaScript code, they
can manipulate the contents of what

1494
01:11:55,280 --> 01:11:56,830
happens to be on your website.

1495
01:11:56,830 --> 01:11:59,300
They can potentially
manipulate the user experience

1496
01:11:59,300 --> 01:12:02,260
to get a result that is
not, actually, desired.

1497
01:12:02,260 --> 01:12:06,860
So let's go ahead and take a look at
one example of cross-site scripting.

1498
01:12:06,860 --> 01:12:09,770
All right, so I've prepared a
web application in advance--

1499
01:12:09,770 --> 01:12:14,900
it's called security-- inside of which
is a single Django app called XXS,

1500
01:12:14,900 --> 01:12:16,590
for Cross-Site Scripting.

1501
01:12:16,590 --> 01:12:19,670
And inside of here, we'll
first take a look at the URLs.

1502
01:12:19,670 --> 01:12:24,290
So there's a single URL that just
allows us to provide any path.

1503
01:12:24,290 --> 01:12:27,330
And then it's going to
load the Index view.

1504
01:12:27,330 --> 01:12:31,910
And on the Index view, we're
going to display in HTTP response.

1505
01:12:31,910 --> 01:12:35,210
It says, here was the path that
just happened to be requested.

1506
01:12:35,210 --> 01:12:37,910
So you might imagine this is
a simplified version of what

1507
01:12:37,910 --> 01:12:41,240
you might see on other websites, for
example, where websites might show you

1508
01:12:41,240 --> 01:12:45,170
on any particular page what path
you're on in order to get to that page,

1509
01:12:45,170 --> 01:12:49,610
some indication of where you are
inside of this web application.

1510
01:12:49,610 --> 01:12:53,150
So I'd go ahead and see the
security and run the server--

1511
01:12:53,150 --> 01:12:57,640
Python manage.py, run server.

1512
01:12:57,640 --> 01:12:59,320
So I am now running the server.

1513
01:12:59,320 --> 01:13:06,420
And now I'll go ahead and go into my
web application, /hello, for example.

1514
01:13:06,420 --> 01:13:09,570
And so what I see here is
the requested path hello,

1515
01:13:09,570 --> 01:13:11,230
which is what I would expect it to be.

1516
01:13:11,230 --> 01:13:13,960
I can change it to
something else, like hi.

1517
01:13:13,960 --> 01:13:15,270
So here's requested path hi.

1518
01:13:15,270 --> 01:13:17,760
Here's hi/2, for example.

1519
01:13:17,760 --> 01:13:20,430
Whatever page I visit,
it gives me a page

1520
01:13:20,430 --> 01:13:23,190
that says, requested
path, and then whatever

1521
01:13:23,190 --> 01:13:25,770
path I happened to be visiting.

1522
01:13:25,770 --> 01:13:29,520
But watch what happens if I
try and visit this URL instead.

1523
01:13:29,520 --> 01:13:39,600
I'm going to visit URL /script
alert hi, and then end script.

1524
01:13:39,600 --> 01:13:40,650
So I run it.

1525
01:13:40,650 --> 01:13:44,990
And suddenly, an alert shows
up on my page that says, hi.

1526
01:13:44,990 --> 01:13:45,850
And I press OK.

1527
01:13:45,850 --> 01:13:47,790
And it says, all right, requested path.

1528
01:13:47,790 --> 01:13:49,680
That alert was a JavaScript alert.

1529
01:13:49,680 --> 01:13:53,250
It was JavaScript code
running on my web application.

1530
01:13:53,250 --> 01:13:56,940
But it was not code that was JavaScript
code inside of my web application.

1531
01:13:56,940 --> 01:14:00,150
It was someone else who
wrote based on the URL

1532
01:14:00,150 --> 01:14:03,780
to run particular JavaScript
on my particular page.

1533
01:14:03,780 --> 01:14:06,120
And so someone linked
to my web application

1534
01:14:06,120 --> 01:14:09,000
and passed in this script
tag as part of the URL.

1535
01:14:09,000 --> 01:14:12,840
Someone who clicked on that link might
have been taken to my web application

1536
01:14:12,840 --> 01:14:17,630
but ultimately had JavaScript run
that was created by someone else.

1537
01:14:17,630 --> 01:14:19,980
And that, ultimately, is
potentially dangerous.

1538
01:14:19,980 --> 01:14:22,440
It leaves open the
possibility that someone else

1539
01:14:22,440 --> 01:14:24,990
could run JavaScript code on my page.

1540
01:14:24,990 --> 01:14:27,300
And it might not just be
something like a script.

1541
01:14:27,300 --> 01:14:29,940
You might imagine someone
not just displaying an alert,

1542
01:14:29,940 --> 01:14:33,720
but modifying something inside of the
DOM-- changing the contents of the web

1543
01:14:33,720 --> 01:14:36,960
page, making API requests,
doing other types of tasks

1544
01:14:36,960 --> 01:14:39,870
that you can do using JavaScript
inside of a web browser

1545
01:14:39,870 --> 01:14:44,580
that, ultimately, leave my page open
to potential security vulnerabilities.

1546
01:14:44,580 --> 01:14:47,580
And so these are cases where it's
important to be mindful of when you're

1547
01:14:47,580 --> 01:14:51,720
designing these pages, if ever there is
a possibility that someone could inject

1548
01:14:51,720 --> 01:14:54,630
their own JavaScript
into your page somehow,

1549
01:14:54,630 --> 01:14:57,780
you'll want to either detect
that or escape it in some way.

1550
01:14:57,780 --> 01:15:02,025
Or take other precautions to make sure
that this kind of cross-site scripting

1551
01:15:02,025 --> 01:15:03,150
isn't going to be possible.

1552
01:15:03,150 --> 01:15:06,240
You might imagine that, in a
messaging application-- for example,

1553
01:15:06,240 --> 01:15:07,740
if you're messaging back and forth--

1554
01:15:07,740 --> 01:15:10,282
you don't want it to be the case
that, if you message someone

1555
01:15:10,282 --> 01:15:13,260
else some JavaScript code
that, when they receive it,

1556
01:15:13,260 --> 01:15:16,380
that code actually ends up
running as some JavaScript that

1557
01:15:16,380 --> 01:15:18,210
runs on that particular page.

1558
01:15:18,210 --> 01:15:20,450
You want to be sure to
escape that information so

1559
01:15:20,450 --> 01:15:22,830
that they just see the
text of the JavaScript code

1560
01:15:22,830 --> 01:15:25,430
but that the code isn't
actually executed.

1561
01:15:25,430 --> 01:15:28,140
And this is a similar threat to
that threat of SQL injection.

1562
01:15:28,140 --> 01:15:30,480
It all comes back to
the idea of not wanting

1563
01:15:30,480 --> 01:15:33,120
to allow someone else
to be able to inject

1564
01:15:33,120 --> 01:15:35,280
their own code into your program.

1565
01:15:35,280 --> 01:15:39,540
You don't want someone else to be able
to inject SQL code into the queries you

1566
01:15:39,540 --> 01:15:40,770
run on your database.

1567
01:15:40,770 --> 01:15:44,640
And you don't want someone to be able
to inject JavaScript code into your web

1568
01:15:44,640 --> 01:15:49,850
page because that leaves open potential
security vulnerabilities as well.

1569
01:15:49,850 --> 01:15:51,882
One type of security
vulnerability that Django

1570
01:15:51,882 --> 01:15:54,590
is quite good at defending against
is one that we've seen before,

1571
01:15:54,590 --> 01:15:57,470
but we'll explore in more
detail how it might work.

1572
01:15:57,470 --> 01:16:00,530
And it's this idea of
cross-site request forgery where

1573
01:16:00,530 --> 01:16:05,270
you fake a request to a website when
you didn't intend to actually make

1574
01:16:05,270 --> 01:16:07,020
a request to that website.

1575
01:16:07,020 --> 01:16:10,830
So you might imagine that,
if your bank, for example,

1576
01:16:10,830 --> 01:16:12,982
had a URL that allowed
you to transfer money

1577
01:16:12,982 --> 01:16:14,690
from one person to
another person-- we've

1578
01:16:14,690 --> 01:16:16,430
talked about this idea a little bit.

1579
01:16:16,430 --> 01:16:20,480
But imagine now how you could implement
this if it really was just a URL.

1580
01:16:20,480 --> 01:16:24,740
You could go to /transfer
and say, as get parameters,

1581
01:16:24,740 --> 01:16:26,060
who am I transferring money to?

1582
01:16:26,060 --> 01:16:27,950
And what is the amount
that I'm transferring?

1583
01:16:27,950 --> 01:16:32,120
Then someone else on some other website
could, in the body of their page,

1584
01:16:32,120 --> 01:16:35,270
just have a link where
that link says, click here.

1585
01:16:35,270 --> 01:16:37,460
And it links to your
bank.com, or whatever

1586
01:16:37,460 --> 01:16:41,390
your bank is, transferring
money to me in this amount.

1587
01:16:41,390 --> 01:16:44,720
And if some user unknowingly just
clicked on that link not knowing

1588
01:16:44,720 --> 01:16:46,640
where it would take
them, this website might

1589
01:16:46,640 --> 01:16:49,640
be able to forge a
request to the bank-- make

1590
01:16:49,640 --> 01:16:52,070
it seem like the user
had gone to the bank

1591
01:16:52,070 --> 01:16:54,350
and tried to initiate
some kind of transfer

1592
01:16:54,350 --> 01:16:56,360
and, ultimately, tried
to transfer money.

1593
01:16:56,360 --> 01:16:59,330
And it doesn't even necessarily
need to be in a link.

1594
01:16:59,330 --> 01:17:03,230
How else might you get some new request
to happen inside of the web browser?

1595
01:17:03,230 --> 01:17:05,690
You might imagine-- though
it might seem a bit strange--

1596
01:17:05,690 --> 01:17:08,450
to put this inside of an image.

1597
01:17:08,450 --> 01:17:13,250
Image source, the source of the
image, is this particular URL--

1598
01:17:13,250 --> 01:17:14,493
the bank's transfer page.

1599
01:17:14,493 --> 01:17:16,160
Now, that doesn't really make any sense.

1600
01:17:16,160 --> 01:17:17,840
The transfer page is not an image.

1601
01:17:17,840 --> 01:17:19,340
But it doesn't matter.

1602
01:17:19,340 --> 01:17:24,590
All an image tag is going to do is try
to make a request to this source URL

1603
01:17:24,590 --> 01:17:28,527
to get that image and then try to
display it in the user's web browser.

1604
01:17:28,527 --> 01:17:31,610
But the first part is what's important--
the fact that this source ends up

1605
01:17:31,610 --> 01:17:33,650
being requested by the web browser.

1606
01:17:33,650 --> 01:17:36,380
Without the user having to
click on or do anything,

1607
01:17:36,380 --> 01:17:40,850
they might try and request from your
bank.com/transfer this particular

1608
01:17:40,850 --> 01:17:45,500
request, which might initiate some sort
of bank transfer without the user even

1609
01:17:45,500 --> 01:17:46,580
realizing it.

1610
01:17:46,580 --> 01:17:49,160
And it's for that reason that
we generally suggest that,

1611
01:17:49,160 --> 01:17:54,560
anytime you're creating a website that
is going to allow for the manipulation

1612
01:17:54,560 --> 01:17:57,500
of some kind of state-- that
allows for some change to happen,

1613
01:17:57,500 --> 01:17:59,210
something like transferring money--

1614
01:17:59,210 --> 01:18:02,450
you don't want that to be a Git
request, something that you could just

1615
01:18:02,450 --> 01:18:06,515
load in an image or load by clicking on
a link that takes you to another page.

1616
01:18:06,515 --> 01:18:08,390
You don't want that to
happen because then it

1617
01:18:08,390 --> 01:18:12,350
makes it very easy for someone
else to fake a request to your page

1618
01:18:12,350 --> 01:18:16,790
by just creating an image or
linking to, somehow, a website,

1619
01:18:16,790 --> 01:18:20,005
transferring funds from
one user to another.

1620
01:18:20,005 --> 01:18:22,130
So a solution to this--
and we've talked about it--

1621
01:18:22,130 --> 01:18:24,920
is that, generally, we
only want post requests

1622
01:18:24,920 --> 01:18:27,860
to be able to manipulate
something inside of the database,

1623
01:18:27,860 --> 01:18:32,330
to be able to actually initiate a
transfer from one user to another user.

1624
01:18:32,330 --> 01:18:35,210
But even then, this is
not perfectly secure.

1625
01:18:35,210 --> 01:18:38,660
You could still be tricked
into submitting a post request.

1626
01:18:38,660 --> 01:18:42,320
Imagine an adversarial website
that had a form like this--

1627
01:18:42,320 --> 01:18:47,120
a form whose action was your
bank.com/transfer and whose method was

1628
01:18:47,120 --> 01:18:48,200
post.

1629
01:18:48,200 --> 01:18:52,370
And now here-- two input fields
whose type is hidden, meaning you

1630
01:18:52,370 --> 01:18:55,040
won't actually be able to
see those input fields when

1631
01:18:55,040 --> 01:18:56,420
the user is looking at the page.

1632
01:18:56,420 --> 01:18:59,090
They'd only know about it
if they inspected the source

1633
01:18:59,090 --> 01:19:03,120
code of this particular HTML page.

1634
01:19:03,120 --> 01:19:05,550
Here, there's a hidden
input whose name is to,

1635
01:19:05,550 --> 01:19:07,840
meaning the person I'd
like to transfer money to.

1636
01:19:07,840 --> 01:19:10,470
Here is the amount, the value
that I would like to transfer.

1637
01:19:10,470 --> 01:19:14,153
And all the user is going to see
is a button that says, click here.

1638
01:19:14,153 --> 01:19:17,320
They're not going to see either of the
input fields, because they're hidden.

1639
01:19:17,320 --> 01:19:19,740
But if they do click the
Click Here button, well, then

1640
01:19:19,740 --> 01:19:22,950
suddenly they're going to be
submitting a post request to the bank

1641
01:19:22,950 --> 01:19:25,525
and initiating some transfer
when they didn't intend to.

1642
01:19:25,525 --> 01:19:28,650
Now, maybe this seems like, oh, it's
not a big deal, because the user still

1643
01:19:28,650 --> 01:19:29,850
needs to click a button.

1644
01:19:29,850 --> 01:19:31,767
And the user shouldn't
be clicking on a button

1645
01:19:31,767 --> 01:19:33,990
if they don't know what
the button is going to do.

1646
01:19:33,990 --> 01:19:38,280
Well, for one, it's probably reasonable
to imagine that an adversary might

1647
01:19:38,280 --> 01:19:41,010
embed this button inside of
a page where it looks totally

1648
01:19:41,010 --> 01:19:42,820
safe to be able to click on a button.

1649
01:19:42,820 --> 01:19:45,960
But moreover, the user doesn't
even need to click on it in order

1650
01:19:45,960 --> 01:19:47,010
to submit the form.

1651
01:19:47,010 --> 01:19:49,170
We can just add a little
bit of JavaScript.

1652
01:19:49,170 --> 01:19:52,710
You might imagine that an adversary
could do something like this.

1653
01:19:52,710 --> 01:19:55,560
Add an unknown attribute
to the body that says,

1654
01:19:55,560 --> 01:19:59,250
when the body of the page is done
loading, go to document.form--

1655
01:19:59,250 --> 01:20:01,680
meaning all of the
forms for this web page.

1656
01:20:01,680 --> 01:20:04,590
Get the first one, and submit it.

1657
01:20:04,590 --> 01:20:06,320
Submit the form.

1658
01:20:06,320 --> 01:20:09,450
And what that's going to do is, even
without the user doing anything--

1659
01:20:09,450 --> 01:20:12,330
even without the user clicking
on the Click Here button--

1660
01:20:12,330 --> 01:20:15,420
as soon as this page is loaded,
this form is going to submit,

1661
01:20:15,420 --> 01:20:19,050
submitting a post request to the
bank, and attempting to transfer funds

1662
01:20:19,050 --> 01:20:21,120
from one user to another user.

1663
01:20:21,120 --> 01:20:23,760
And so this is what we might
call a cross-site request

1664
01:20:23,760 --> 01:20:29,220
forgery where some adversarial website
has forged a request to our website.

1665
01:20:29,220 --> 01:20:32,870
And ideally, we wouldn't like
for that to be able to happen.

1666
01:20:32,870 --> 01:20:35,030
So how do we guard against this?

1667
01:20:35,030 --> 01:20:39,780
Well, what Django allows us to do and
a very common approach is to add a CSRF

1668
01:20:39,780 --> 01:20:42,390
token-- a Cross-Site
Request Forgery token--

1669
01:20:42,390 --> 01:20:46,320
that is going to be
regenerated for every session

1670
01:20:46,320 --> 01:20:48,740
such that, only if
that token is present,

1671
01:20:48,740 --> 01:20:51,610
will the transfer be able to go through.

1672
01:20:51,610 --> 01:20:57,360
So on our website, we can include the
CSRF token inside of this HTML form

1673
01:20:57,360 --> 01:21:00,510
and, as a result, make sure that
we're able to transfer money only

1674
01:21:00,510 --> 01:21:02,650
when the CSRF token is present.

1675
01:21:02,650 --> 01:21:05,220
But if some other website
tries to forge a request,

1676
01:21:05,220 --> 01:21:07,710
they won't know what
the CSRF token should be

1677
01:21:07,710 --> 01:21:09,840
because it changes for every session.

1678
01:21:09,840 --> 01:21:14,730
And therefore, they won't be able to
actually forge a request from one user

1679
01:21:14,730 --> 01:21:16,510
to another.

1680
01:21:16,510 --> 01:21:19,590
So all across the various
different tools and technologies

1681
01:21:19,590 --> 01:21:20,340
we've been using--

1682
01:21:20,340 --> 01:21:25,710
Python, HTTP, Django, HTML in
terms of creating these web

1683
01:21:25,710 --> 01:21:27,990
applications using
JavaScript, and the APIs

1684
01:21:27,990 --> 01:21:29,460
that we might be interacting with--

1685
01:21:29,460 --> 01:21:31,710
there are security
considerations all throughout.

1686
01:21:31,710 --> 01:21:33,623
We've only touched on
a couple of them here.

1687
01:21:33,623 --> 01:21:36,540
But it just goes to show how it's
important to be mindful as you think

1688
01:21:36,540 --> 01:21:39,790
about the practice of web programming,
thinking about what you're going to add

1689
01:21:39,790 --> 01:21:42,960
to your web applications and what
features your web application supports,

1690
01:21:42,960 --> 01:21:46,260
to think about what the potential
vulnerabilities there are as well--

1691
01:21:46,260 --> 01:21:49,920
how someone might exploit your web
application in order to do something

1692
01:21:49,920 --> 01:21:51,690
with it that they probably shouldn't.

1693
01:21:51,690 --> 01:21:54,450
And as you take your web
applications from applications

1694
01:21:54,450 --> 01:21:57,015
that are just running on
your own local computer

1695
01:21:57,015 --> 01:21:59,940
to applications that are
running in some web server

1696
01:21:59,940 --> 01:22:02,130
that many people are
starting to use, these

1697
01:22:02,130 --> 01:22:04,420
are the types of questions
to start to be asking.

1698
01:22:04,420 --> 01:22:07,740
How can you make sure that your
web application is scalable?

1699
01:22:07,740 --> 01:22:11,740
How can you make sure that
your web application is secure?

1700
01:22:11,740 --> 01:22:15,392
So now that we've explored that-- a lot
of web programming-- what comes next?

1701
01:22:15,392 --> 01:22:17,850
In this course, we've explored
a number of different tools,

1702
01:22:17,850 --> 01:22:19,470
and technologies, and languages.

1703
01:22:19,470 --> 01:22:21,540
But there are many other
web frameworks and ways

1704
01:22:21,540 --> 01:22:23,850
you can build web applications as well.

1705
01:22:23,850 --> 01:22:26,220
We spent most of our time
looking at the Django web

1706
01:22:26,220 --> 01:22:27,580
framework, written in Python.

1707
01:22:27,580 --> 01:22:29,430
But you can use other
programming languages

1708
01:22:29,430 --> 01:22:31,560
to build web applications as well.

1709
01:22:31,560 --> 01:22:34,980
Express.js, for example, is a
very popular JavaScript framework

1710
01:22:34,980 --> 01:22:36,480
for building web applications.

1711
01:22:36,480 --> 01:22:41,390
Ruby on Rails is a popular server-side
web framework built using Ruby.

1712
01:22:41,390 --> 01:22:43,020
And there are many others as well.

1713
01:22:43,020 --> 01:22:44,730
And there are also
client-side frameworks

1714
01:22:44,730 --> 01:22:48,540
used primarily with JavaScript to
be able to build user interfaces.

1715
01:22:48,540 --> 01:22:51,750
We've seen a little bit of React to
both dynamic and interactive user

1716
01:22:51,750 --> 01:22:52,620
interfaces.

1717
01:22:52,620 --> 01:22:56,490
Other popular client-side frameworks
include Angular JS, and Vue.js,

1718
01:22:56,490 --> 01:22:58,343
and a number of others as well.

1719
01:22:58,343 --> 01:23:00,510
And then, once you've built
these web applications--

1720
01:23:00,510 --> 01:23:03,600
using any of these server-side
frameworks and client-side frameworks--

1721
01:23:03,600 --> 01:23:06,360
then you might imagine wanting
to take these applications

1722
01:23:06,360 --> 01:23:07,645
and deploy them to the web.

1723
01:23:07,645 --> 01:23:10,020
And to do that, there are a
number of ways we can do this

1724
01:23:10,020 --> 01:23:13,950
as well-- a number of different services
including Amazon Web Services, AWS,

1725
01:23:13,950 --> 01:23:17,730
Google Cloud, and Microsoft Azure
that can be used in order to deploy

1726
01:23:17,730 --> 01:23:19,530
these web applications.

1727
01:23:19,530 --> 01:23:22,320
Roku is a service that
uses AWS and tries

1728
01:23:22,320 --> 01:23:26,100
to simplify the process of making it
easier to deploy your web applications.

1729
01:23:26,100 --> 01:23:29,340
And if you're web application is
really just static-- it's just HTML,

1730
01:23:29,340 --> 01:23:33,300
and CSS, and JavaScript-- well, then
you can use something like GitHub Pages

1731
01:23:33,300 --> 01:23:37,945
to be able to host a web application for
free on GitHub's own servers instead.

1732
01:23:37,945 --> 01:23:41,070
And there are many other ways you can
imagine deploying web applications as

1733
01:23:41,070 --> 01:23:43,395
well-- different services
that you can use in order

1734
01:23:43,395 --> 01:23:46,020
to take the web applications that
you have been building or web

1735
01:23:46,020 --> 01:23:47,940
applications you might
build in the future

1736
01:23:47,940 --> 01:23:52,870
and make them available on the internet
for others to be able to use as well.

1737
01:23:52,870 --> 01:23:56,550
So as we look back on the various topics
within web programming we've explored,

1738
01:23:56,550 --> 01:23:58,690
we've seen a lot of
tools and technologies

1739
01:23:58,690 --> 01:24:02,760
we can use that we can leverage in order
to build interesting web applications.

1740
01:24:02,760 --> 01:24:06,930
We started by taking a
closer look HTML and CSS,

1741
01:24:06,930 --> 01:24:10,080
diving into how we can use that to
describe the structure of our page,

1742
01:24:10,080 --> 01:24:12,210
and then taking advantage
of tools like SAS

1743
01:24:12,210 --> 01:24:15,570
that allow us to generate
CSS that allows for much more

1744
01:24:15,570 --> 01:24:18,270
complex styling for our website
that would have been much more

1745
01:24:18,270 --> 01:24:21,090
difficult to do with just CSS alone.

1746
01:24:21,090 --> 01:24:24,240
As we started to build larger web
applications, we took a look at Git--

1747
01:24:24,240 --> 01:24:26,610
version control tools
that we can use in order

1748
01:24:26,610 --> 01:24:29,370
to make sure that we keep track
of versions and changes we

1749
01:24:29,370 --> 01:24:33,240
make to our code, allowing multiple
people to collaborate on a project

1750
01:24:33,240 --> 01:24:34,547
simultaneously.

1751
01:24:34,547 --> 01:24:37,380
We then took a look at Python,
looking at various different features

1752
01:24:37,380 --> 01:24:40,697
that the language offered--
functions, and conditions, and loops,

1753
01:24:40,697 --> 01:24:42,780
as we've seen in many other
programming languages.

1754
01:24:42,780 --> 01:24:45,210
But also object-oriented
programming-- the ability

1755
01:24:45,210 --> 01:24:47,700
to represent objects, and
methods, and functions

1756
01:24:47,700 --> 01:24:49,950
that operate on those
particular objects, which

1757
01:24:49,950 --> 01:24:53,940
prove especially powerful in the context
of dealing with data inside of our web

1758
01:24:53,940 --> 01:24:55,380
applications.

1759
01:24:55,380 --> 01:24:58,500
Django was the example of a
web framework written in Python

1760
01:24:58,500 --> 01:25:00,510
that we used to very
quickly be able to start up

1761
01:25:00,510 --> 01:25:04,500
a web application, that's able to
listen for requests, and make responses.

1762
01:25:04,500 --> 01:25:06,600
Django has a whole lot
of features built in that

1763
01:25:06,600 --> 01:25:10,072
really make it easy to get started
with building a web application.

1764
01:25:10,072 --> 01:25:12,030
And in particular, it
makes it easy for writing

1765
01:25:12,030 --> 01:25:14,260
web applications that deal with data.

1766
01:25:14,260 --> 01:25:16,860
So Django allows us the
ability to build models

1767
01:25:16,860 --> 01:25:20,760
that interact with SQL without us
having to actually write any SQL code.

1768
01:25:20,760 --> 01:25:25,320
Django can generate the SQL for us just
using these models and migrations that

1769
01:25:25,320 --> 01:25:29,020
allow us to continually apply
changes that we make to our database.

1770
01:25:29,020 --> 01:25:33,330
As we add new tables, add and modify
existing fields on those tables,

1771
01:25:33,330 --> 01:25:36,065
Django can take care of all of that.

1772
01:25:36,065 --> 01:25:38,190
After that, as you'll
recall, we took our attention

1773
01:25:38,190 --> 01:25:40,440
towards the second of the
main programming languages

1774
01:25:40,440 --> 01:25:44,950
in the course, JavaScript, which has a
lot of uses and is very, very popular.

1775
01:25:44,950 --> 01:25:46,920
But we primarily use
it on the client side

1776
01:25:46,920 --> 01:25:50,460
to be able to build interesting
user interfaces-- using JavaScript

1777
01:25:50,460 --> 01:25:52,680
to manipulate the DOM,
the structure of the page,

1778
01:25:52,680 --> 01:25:54,930
to change what it is the user sees.

1779
01:25:54,930 --> 01:25:56,850
And also to add event
handling-- so that when

1780
01:25:56,850 --> 01:25:59,880
the user clicks on a button, when
the user hovers over something, when

1781
01:25:59,880 --> 01:26:02,550
the user interacts with the
page in some sort of way,

1782
01:26:02,550 --> 01:26:04,590
our code is able to respond to it.

1783
01:26:04,590 --> 01:26:09,540
And we saw React, a client-side
framework that uses JavaScript in order

1784
01:26:09,540 --> 01:26:13,470
to allow us to create really interesting
and interactive user interfaces

1785
01:26:13,470 --> 01:26:15,893
with not all that much code at all.

1786
01:26:15,893 --> 01:26:18,060
And then, finally, in these
last couple of lectures,

1787
01:26:18,060 --> 01:26:21,350
we've been looking at some best
practices-- how we can design tests,

1788
01:26:21,350 --> 01:26:23,520
tests the test the server,
but also the client

1789
01:26:23,520 --> 01:26:25,800
to make sure that our code
is working appropriately,

1790
01:26:25,800 --> 01:26:28,860
and also some industry practices
like continuous integration

1791
01:26:28,860 --> 01:26:31,140
and continuous delivery
that just help to make sure

1792
01:26:31,140 --> 01:26:34,740
that, as we make changes to our code,
we're able to deploy and deliver them

1793
01:26:34,740 --> 01:26:37,050
rapidly and effectively
and make sure that we're

1794
01:26:37,050 --> 01:26:39,630
able to make incremental
changes to our code base

1795
01:26:39,630 --> 01:26:42,460
rather than need to wait
on longer release cycles.

1796
01:26:42,460 --> 01:26:44,520
And then finally, today,
we've been talking

1797
01:26:44,520 --> 01:26:47,820
about issues about scalability
and security, especially important

1798
01:26:47,820 --> 01:26:50,880
as we begin to take our application
and move them to the web.

1799
01:26:50,880 --> 01:26:53,562
We want to make sure that these
applications are scalable,

1800
01:26:53,562 --> 01:26:55,770
that they're able to handle
multiple different users,

1801
01:26:55,770 --> 01:26:57,720
and also to make sure
that they're secure--

1802
01:26:57,720 --> 01:27:01,050
that we're not exposing ourselves to
potential vulnerabilities like someone

1803
01:27:01,050 --> 01:27:05,370
who might inject SQL or inject
JavaScript code into our pages

1804
01:27:05,370 --> 01:27:08,730
or who might try to access some data
that they're not supposed to access.

1805
01:27:08,730 --> 01:27:12,420
We want to make sure that, when we go
about designing these web applications,

1806
01:27:12,420 --> 01:27:17,330
we're able to do so in a scalable
and, ultimately, in a secure way.

1807
01:27:17,330 --> 01:27:19,080
So hopefully, you
enjoyed this exploration

1808
01:27:19,080 --> 01:27:21,747
into the world of web programming
with Python and JavaScript.

1809
01:27:21,747 --> 01:27:23,580
Best of luck with the
web programs that you,

1810
01:27:23,580 --> 01:27:26,130
yourself might build with the
tools we've seen here today,

1811
01:27:26,130 --> 01:27:29,310
and also other tools that are
inspired by our use similar tools

1812
01:27:29,310 --> 01:27:32,130
and techniques and ideas as the
things that we've ultimately

1813
01:27:32,130 --> 01:27:32,880
talked about here.

1814
01:27:32,880 --> 01:27:35,672
A big thanks to the course's teaching
staff and the production team

1815
01:27:35,672 --> 01:27:37,255
for making this entire class possible.

1816
01:27:37,255 --> 01:27:39,130
I look forward to seeing
the web applications

1817
01:27:39,130 --> 01:27:40,620
that you might go on to create.

1818
01:27:40,620 --> 01:27:45,110
This was Web Programming
with Python and JavaScript.