1
00:00:00,000 --> 00:00:00,500


2
00:00:00,500 --> 00:00:03,872
[MUSIC PLAYING]

3
00:00:03,872 --> 00:00:12,110


4
00:00:12,110 --> 00:00:14,130
BRIAN YU: OK, let's get started.

5
00:00:14,130 --> 00:00:16,594
Welcome, everyone, to the
final day of CS50 Beyond.

6
00:00:16,594 --> 00:00:19,010
And goal for today is going
to be to take a look at things

7
00:00:19,010 --> 00:00:20,150
at a bit of a higher level.

8
00:00:20,150 --> 00:00:22,340
There is going to be less
code in today's lecture.

9
00:00:22,340 --> 00:00:24,440
The focus of today is
on two main topics--

10
00:00:24,440 --> 00:00:26,914
security and scalability--
which are both important as you

11
00:00:26,914 --> 00:00:30,080
begin to think about, you're writing
all this code for your web application.

12
00:00:30,080 --> 00:00:32,570
You're ready to deploy it so
that people can actually use it.

13
00:00:32,570 --> 00:00:35,153
What are the sorts of considerations
you need to bear in mind?

14
00:00:35,153 --> 00:00:37,310
What are the security
considerations in making

15
00:00:37,310 --> 00:00:42,110
sure that wherever you're hosting the
application, you and the application

16
00:00:42,110 --> 00:00:45,920
itself is secure and that your users are
secure from potential vulnerabilities

17
00:00:45,920 --> 00:00:47,120
or potential threats?

18
00:00:47,120 --> 00:00:49,235
And also, from a
scalability perspective,

19
00:00:49,235 --> 00:00:51,860
we've been designing applications
that so far probably only you

20
00:00:51,860 --> 00:00:53,690
or a couple other
people have been using.

21
00:00:53,690 --> 00:00:55,815
But what sorts of things
do you need to think about

22
00:00:55,815 --> 00:00:59,060
as your applications begin to scale, as
more and more people begin to use it,

23
00:00:59,060 --> 00:01:02,150
and you have to begin to think about
this idea of multiple people trying

24
00:01:02,150 --> 00:01:04,995
to use the same application
at the same time?

25
00:01:04,995 --> 00:01:07,370
So a number of different
considerations come about there.

26
00:01:07,370 --> 00:01:09,190
We'll show a couple of code examples.

27
00:01:09,190 --> 00:01:12,440
But the main idea of this is going to
be high level, just thinking abstractly,

28
00:01:12,440 --> 00:01:15,560
sort of trying to design the product,
trying to design the project,

29
00:01:15,560 --> 00:01:18,860
trying to figure out how exactly we
need to be adjusting our application

30
00:01:18,860 --> 00:01:22,460
to make sure that it's secure and
to make sure that it's scalable.

31
00:01:22,460 --> 00:01:24,230
So we'll go ahead and
start with security.

32
00:01:24,230 --> 00:01:25,500
And on the topic of
security, we're going

33
00:01:25,500 --> 00:01:28,010
to look at a number of different
security considerations

34
00:01:28,010 --> 00:01:30,935
as we move all throughout the week,
from the beginning of the week

35
00:01:30,935 --> 00:01:33,560
until the end of the week, thinking
about the types of security

36
00:01:33,560 --> 00:01:35,370
implications that come about.

37
00:01:35,370 --> 00:01:38,180
And so one of the first things we
introduced in the class was Git,

38
00:01:38,180 --> 00:01:39,971
the version control
tool that we were using

39
00:01:39,971 --> 00:01:42,374
to keep track of different
versions of our code

40
00:01:42,374 --> 00:01:45,290
in order to manage different branches
of our code, so on and so forth.

41
00:01:45,290 --> 00:01:48,165
And so a couple of important security
considerations to be aware with

42
00:01:48,165 --> 00:01:49,040
regards to Git.

43
00:01:49,040 --> 00:01:51,440
You all probably created
GitHub repositories

44
00:01:51,440 --> 00:01:53,990
over the course of this week,
maybe for the first time.

45
00:01:53,990 --> 00:01:56,870
And GitHub repositories
by default are public.

46
00:01:56,870 --> 00:02:00,230
And this is in the spirit of the idea
of open source software, the idea

47
00:02:00,230 --> 00:02:01,790
that anyone can see the code.

48
00:02:01,790 --> 00:02:03,590
Anyone can contribute to the code.

49
00:02:03,590 --> 00:02:05,150
And that, of course,
comes with its trade offs.

50
00:02:05,150 --> 00:02:07,580
On one hand, everyone being
able to see the code certainly

51
00:02:07,580 --> 00:02:10,850
means that anyone can help you
to find bugs and identify bugs.

52
00:02:10,850 --> 00:02:13,640
But it also means that anyone on
the internet can see the code,

53
00:02:13,640 --> 00:02:15,680
look for potential
vulnerabilities, and then

54
00:02:15,680 --> 00:02:18,060
potentially take advantage
of those vulnerabilities.

55
00:02:18,060 --> 00:02:20,230
So definitely, trade offs,
costs, and benefits that

56
00:02:20,230 --> 00:02:21,980
come along with open source software.

57
00:02:21,980 --> 00:02:25,760
And another thing just to be aware of,
we mentioned this earlier in the week,

58
00:02:25,760 --> 00:02:30,740
but your Git commit history is going
to store the entire history of any

59
00:02:30,740 --> 00:02:33,200
of the commits that you have
made, as the name might imply.

60
00:02:33,200 --> 00:02:35,416
And so if you make a
commit and you do something

61
00:02:35,416 --> 00:02:38,540
you shouldn't have done, for instance--
you make a commit that accidentally

62
00:02:38,540 --> 00:02:41,509
includes database credentials
inside of the commit somewhere

63
00:02:41,509 --> 00:02:43,300
or includes a password
inside of the commit

64
00:02:43,300 --> 00:02:45,920
somewhere-- you can later
on remove those credentials

65
00:02:45,920 --> 00:02:48,560
and make another commit
and remove the credentials.

66
00:02:48,560 --> 00:02:51,320
But the credentials are still
there inside of the history.

67
00:02:51,320 --> 00:02:53,630
If you go back, you could
still find the credentials

68
00:02:53,630 --> 00:02:56,060
if you had access to the
entire Git repository

69
00:02:56,060 --> 00:02:58,754
and could go back and find
that point in Git's history.

70
00:02:58,754 --> 00:03:01,670
So what are the potential solutions
for if you do something like this,

71
00:03:01,670 --> 00:03:04,640
accidentally expose credentials
at some point in the repository

72
00:03:04,640 --> 00:03:06,410
and then remove them?

73
00:03:06,410 --> 00:03:08,090
What could you do?

74
00:03:08,090 --> 00:03:08,662
Yeah?

75
00:03:08,662 --> 00:03:09,840
AUDIENCE: Change the credentials.

76
00:03:09,840 --> 00:03:10,160
BRIAN YU: Certainly.

77
00:03:10,160 --> 00:03:12,993
Changing the credentials, something
you should almost definitely do.

78
00:03:12,993 --> 00:03:13,900
Change the password.

79
00:03:13,900 --> 00:03:16,400
It's not enough just to remove
them and make another commit.

80
00:03:16,400 --> 00:03:19,310
And there's also something you
can do known as Git purge, where

81
00:03:19,310 --> 00:03:23,120
you can effectively purge the history
of commit, sort of overwrite history,

82
00:03:23,120 --> 00:03:25,470
so to speak, in order to
replace that, as well.

83
00:03:25,470 --> 00:03:27,800
But even that, if it's
been online on GitHub,

84
00:03:27,800 --> 00:03:30,258
who knows who may have been
able to access the credentials?

85
00:03:30,258 --> 00:03:33,830
So definitely always a good
idea to remove those, as well.

86
00:03:33,830 --> 00:03:36,860
On the first day, we
also took a look at HTML.

87
00:03:36,860 --> 00:03:38,767
We were designing basic HTML pages.

88
00:03:38,767 --> 00:03:40,850
And there are a number of
security vulnerabilities

89
00:03:40,850 --> 00:03:43,490
you could create just with HTML alone.

90
00:03:43,490 --> 00:03:48,320
Perhaps one of the most basic is just
the idea that the contents of a link

91
00:03:48,320 --> 00:03:50,529
can differ from where
the link takes you to.

92
00:03:50,529 --> 00:03:52,820
There's probably a pretty
obvious point where you often

93
00:03:52,820 --> 00:03:54,751
have text that links you
to a particular page.

94
00:03:54,751 --> 00:03:56,750
But this can often be
misleading and is commonly

95
00:03:56,750 --> 00:03:59,150
used in phishing email
attacks, for instance,

96
00:03:59,150 --> 00:04:02,300
whereby you have a link
that takes you to URL one,

97
00:04:02,300 --> 00:04:06,260
but by default, it shows you URL two,
which can be misleading, for sure.

98
00:04:06,260 --> 00:04:10,640
Or I can have situations where I could--

99
00:04:10,640 --> 00:04:15,440
let's go into link.html--

100
00:04:15,440 --> 00:04:18,230
I have a link that presumably
takes me to google.com.

101
00:04:18,230 --> 00:04:20,870
But if I click on google.com,
it could take me anywhere else--

102
00:04:20,870 --> 00:04:22,580
to some other site, for instance.

103
00:04:22,580 --> 00:04:28,660
And the way that it does
that is quite simply by just

104
00:04:28,660 --> 00:04:33,310
having a link that takes you to a
URL, but the contents of that URL

105
00:04:33,310 --> 00:04:35,594
are something different or
something else entirely.

106
00:04:35,594 --> 00:04:37,510
And so that alone is
something to be aware of.

107
00:04:37,510 --> 00:04:40,060
But that problem is compounded
when you consider the idea

108
00:04:40,060 --> 00:04:42,580
that even though your server-side
code-- application code

109
00:04:42,580 --> 00:04:45,040
you write in Python and
Flask, for instance--

110
00:04:45,040 --> 00:04:48,040
you can keep secret from
your users, HTML code is not

111
00:04:48,040 --> 00:04:49,030
kept secret from users.

112
00:04:49,030 --> 00:04:52,051
Any users can see HTML and do
whatever they want with it.

113
00:04:52,051 --> 00:04:53,800
And so on the first
day, you may have been

114
00:04:53,800 --> 00:04:56,650
trying to take a look at an HTML
page and try and replicate it

115
00:04:56,650 --> 00:04:59,280
using your own HTML
and CSS, for example.

116
00:04:59,280 --> 00:05:01,030
The simplest way to
do something like that

117
00:05:01,030 --> 00:05:02,613
would just be to copy the source code.

118
00:05:02,613 --> 00:05:09,790
So I could go to bankofamerica.com, for
instance, Control-Click on the page,

119
00:05:09,790 --> 00:05:12,070
view the page source, and all right.

120
00:05:12,070 --> 00:05:15,100
Here's all the HTML on Bank
of America's home page.

121
00:05:15,100 --> 00:05:26,720
I could copy that, create a new
file, and call it bank.html.

122
00:05:26,720 --> 00:05:29,690
Paste the contents of it in here.

123
00:05:29,690 --> 00:05:32,830
Go ahead and save that.

124
00:05:32,830 --> 00:05:35,072
And now, open up bank.html.

125
00:05:35,072 --> 00:05:38,280
And now, I've got a page that basically
looks like Bank of America's website.

126
00:05:38,280 --> 00:05:39,180
And now, I could go in.

127
00:05:39,180 --> 00:05:41,679
I could modify the links, change
where Sign In takes you to,

128
00:05:41,679 --> 00:05:43,600
make it take you to
somewhere else entirely.

129
00:05:43,600 --> 00:05:45,420
And so these are potential
threats, vulnerabilities,

130
00:05:45,420 --> 00:05:48,300
to be aware of on the internet
that are quite easy to actually do.

131
00:05:48,300 --> 00:05:51,777
So this is less about when you're
designing your own web applications

132
00:05:51,777 --> 00:05:54,360
but, when you're using web
applications, the types of security

133
00:05:54,360 --> 00:05:56,420
concerns to definitely be aware of.

134
00:05:56,420 --> 00:06:00,274


135
00:06:00,274 --> 00:06:02,690
So let's keep moving forward
in the week-- yeah, question?

136
00:06:02,690 --> 00:06:05,350
AUDIENCE: Can you copy JavaScript
source code in the same way?

137
00:06:05,350 --> 00:06:05,933
BRIAN YU: Yes.

138
00:06:05,933 --> 00:06:08,760
Any JavaScript code that is
on the client, you can access

139
00:06:08,760 --> 00:06:09,990
and you can modify.

140
00:06:09,990 --> 00:06:12,300
You can change variables
and so on and so forth.

141
00:06:12,300 --> 00:06:15,420
And this is actually a
pretty easy thing to do.

142
00:06:15,420 --> 00:06:20,640
So if I go to like, I don't know, The
New York Times website, for instance,

143
00:06:20,640 --> 00:06:24,070
and I look at the source code there--

144
00:06:24,070 --> 00:06:26,670
let me go ahead and inspect
the element, and I'll

145
00:06:26,670 --> 00:06:31,040
try and hover over a main headline.

146
00:06:31,040 --> 00:06:32,760
OK.

147
00:06:32,760 --> 00:06:35,147
This is the name of a CSS class.

148
00:06:35,147 --> 00:06:36,480
You could access any JavaScript.

149
00:06:36,480 --> 00:06:39,460
You can also run any JavaScript
in the console arbitrarily.

150
00:06:39,460 --> 00:06:45,750
So I could say, all right,
document.query selector all let's

151
00:06:45,750 --> 00:06:48,510
get everything with that CSS class.

152
00:06:48,510 --> 00:06:51,880
Or maybe it's just the first one,
because it's two CSS classes.

153
00:06:51,880 --> 00:06:52,380
All right.

154
00:06:52,380 --> 00:06:53,040
Great.

155
00:06:53,040 --> 00:06:56,790
I'll take the first one,
set its inner HTML to be,

156
00:06:56,790 --> 00:07:01,800
like, welcome to CS50 Beyond.

157
00:07:01,800 --> 00:07:05,400
And you can play around with websites
in order to mess around, change them.

158
00:07:05,400 --> 00:07:07,890
So all of the JavaScript
CSS classes, all of that,

159
00:07:07,890 --> 00:07:10,410
is accessible to anyone who is
using the page, for example.

160
00:07:10,410 --> 00:07:14,980


161
00:07:14,980 --> 00:07:16,520
Other questions before I go on?

162
00:07:16,520 --> 00:07:17,020
Yeah.

163
00:07:17,020 --> 00:07:19,520
AUDIENCE: Any thoughts on
JavaScript obfuscation?

164
00:07:19,520 --> 00:07:22,270
BRIAN YU: JavaScript obfuscation--
certainly something you can do.

165
00:07:22,270 --> 00:07:26,950
So since JavaScript is available to
anyone who has access to the web page,

166
00:07:26,950 --> 00:07:29,910
there are programs called
JavaScript obfuscators gators

167
00:07:29,910 --> 00:07:32,320
that basically take plain
old looking JavaScript

168
00:07:32,320 --> 00:07:34,840
and convert it into something
that's still JavaScript

169
00:07:34,840 --> 00:07:37,480
but that's very difficult
for any human to decipher.

170
00:07:37,480 --> 00:07:41,140
It changes variable names and does
a bunch of tricks in JavaScript

171
00:07:41,140 --> 00:07:46,257
to still execute the exact same
way but that looks quite obscure.

172
00:07:46,257 --> 00:07:47,590
Definitely something you can do.

173
00:07:47,590 --> 00:07:49,930
Still not totally foolproof,
because there are ways

174
00:07:49,930 --> 00:07:53,500
of trying to deobfuscate JavaScript
code, at least to some extent.

175
00:07:53,500 --> 00:07:57,780
So it's not perfect, but definitely
something that you can do.

176
00:07:57,780 --> 00:08:00,470
Other things?

177
00:08:00,470 --> 00:08:00,970
All right.

178
00:08:00,970 --> 00:08:01,810
Let's take a look at--

179
00:08:01,810 --> 00:08:03,670
OK, when we were writing
Flask applications,

180
00:08:03,670 --> 00:08:05,235
we were writing web servers.

181
00:08:05,235 --> 00:08:08,110
And so one thing that's just good
to know from a security perspective

182
00:08:08,110 --> 00:08:11,410
is the difference between HTTP,
the Hypertext Transfer Protocol,

183
00:08:11,410 --> 00:08:13,960
and the secure version of it, HTTPS.

184
00:08:13,960 --> 00:08:16,930
And that has to do with the
idea that on the internet,

185
00:08:16,930 --> 00:08:19,420
we have computer servers that
are trying to communicate

186
00:08:19,420 --> 00:08:22,582
with each other that are trying to
send information back and forth.

187
00:08:22,582 --> 00:08:25,540
And when these computers are trying
to send information back and forth,

188
00:08:25,540 --> 00:08:27,760
we would like for that
to happen securely,

189
00:08:27,760 --> 00:08:31,090
that when one computer is sending
information to another computer,

190
00:08:31,090 --> 00:08:34,090
that information is going through
a number of different routers.

191
00:08:34,090 --> 00:08:36,790
And each of those routers
could hypothetically

192
00:08:36,790 --> 00:08:38,289
have information that's intercepted.

193
00:08:38,289 --> 00:08:41,890
Someone could try and intercept a
package on its way from computer number

194
00:08:41,890 --> 00:08:43,780
one to computer number two.

195
00:08:43,780 --> 00:08:47,680
So how do we securely try and
transfer information from one location

196
00:08:47,680 --> 00:08:48,495
to the other?

197
00:08:48,495 --> 00:08:50,870
And this has to do with the
entire field of cryptography,

198
00:08:50,870 --> 00:08:52,390
which is a huge field that
we're only going to be

199
00:08:52,390 --> 00:08:54,370
able to barely scratch the surface of.

200
00:08:54,370 --> 00:08:56,740
But the basic idea here is
that we would like some way

201
00:08:56,740 --> 00:09:00,670
to encrypt our information, that if I
have some plain text that I would like

202
00:09:00,670 --> 00:09:03,400
to send from my computer
to someone else's computer,

203
00:09:03,400 --> 00:09:07,510
I would like to encrypt that plain text,
send it across in some encrypted way,

204
00:09:07,510 --> 00:09:10,540
such that the person on the
other end could decrypt it.

205
00:09:10,540 --> 00:09:13,150
And so this is perhaps a
more sophisticated version

206
00:09:13,150 --> 00:09:15,760
of what you might have done
in CS50's problem set two

207
00:09:15,760 --> 00:09:18,180
when you were using the
Caesar or the Vigenere cipher

208
00:09:18,180 --> 00:09:19,430
in order to encrypt something.

209
00:09:19,430 --> 00:09:22,870
The ciphers that are used in computing
on the internet, for instance,

210
00:09:22,870 --> 00:09:25,390
are just much more secure, for example.

211
00:09:25,390 --> 00:09:27,490
But they follow a similar principle.

212
00:09:27,490 --> 00:09:31,450
And so one form of cryptography
is called secret-key cryptography,

213
00:09:31,450 --> 00:09:33,550
where the idea is that if
I am a computer up here

214
00:09:33,550 --> 00:09:36,010
and I have some plain text
that I want to encrypt,

215
00:09:36,010 --> 00:09:39,050
I also have some key that only I know.

216
00:09:39,050 --> 00:09:41,830
And I can take the plain
text, and I can take that key

217
00:09:41,830 --> 00:09:43,780
and run an algorithm on it.

218
00:09:43,780 --> 00:09:47,680
And that generates some ciphertext,
some encrypted version of the plain text

219
00:09:47,680 --> 00:09:49,750
that was encrypted using the key.

220
00:09:49,750 --> 00:09:52,460
I can then send that ciphertext
along to the other person.

221
00:09:52,460 --> 00:09:56,050
And so long as the other person
has both the ciphertext and the key

222
00:09:56,050 --> 00:09:58,660
to encrypt it, they
can do the same process

223
00:09:58,660 --> 00:10:01,840
and just decrypt it, generating
the plain text from it.

224
00:10:01,840 --> 00:10:04,540
That way, the ciphertext is
transferred, not the plain text,

225
00:10:04,540 --> 00:10:07,400
from one side to the other
side of this communication.

226
00:10:07,400 --> 00:10:10,810
And so long as both parties in this
instance have access to the same key,

227
00:10:10,810 --> 00:10:14,344
they can encrypt and
decrypt messages at will.

228
00:10:14,344 --> 00:10:16,510
Why doesn't this quite work
on the internet, though?

229
00:10:16,510 --> 00:10:19,464
What is the problem with this model?

230
00:10:19,464 --> 00:10:19,964
Yeah?

231
00:10:19,964 --> 00:10:22,850
AUDIENCE: If you're sending the
key as well as the ciphertext,

232
00:10:22,850 --> 00:10:28,600
then it's just revealed as sending
the plain text that you have one.

233
00:10:28,600 --> 00:10:29,350
BRIAN YU: Exactly.

234
00:10:29,350 --> 00:10:32,260
When we transfer the ciphertext
across, the other person

235
00:10:32,260 --> 00:10:33,950
also needs access to the key.

236
00:10:33,950 --> 00:10:35,950
We need to transfer the
key across the internet,

237
00:10:35,950 --> 00:10:38,180
as well, to give it to the other person.

238
00:10:38,180 --> 00:10:40,990
And so anyone who is
intercepting the ciphertext

239
00:10:40,990 --> 00:10:43,960
could also have intercepted
the key and therefore could

240
00:10:43,960 --> 00:10:47,110
have decrypted the information
and gotten the plain text

241
00:10:47,110 --> 00:10:47,860
as a result of it.

242
00:10:47,860 --> 00:10:50,740
So this secret-key
cryptography, ultimately, it

243
00:10:50,740 --> 00:10:53,000
doesn't work in the
context of the internet

244
00:10:53,000 --> 00:10:55,330
if it needs to be the
case that the key is just

245
00:10:55,330 --> 00:10:56,716
transferred across the internet.

246
00:10:56,716 --> 00:10:58,840
Now, you could try encrypting
the key, for example.

247
00:10:58,840 --> 00:11:00,730
But then whenever key you
used to encrypt the key,

248
00:11:00,730 --> 00:11:02,650
that also needs to be
sent across the internet,

249
00:11:02,650 --> 00:11:05,899
and you end up with this problem where
you can never figure out a way in order

250
00:11:05,899 --> 00:11:09,610
to make sure that information
can be transferred securely.

251
00:11:09,610 --> 00:11:12,910
So the solution to this lies in a
different idea called public-key

252
00:11:12,910 --> 00:11:17,430
cryptography, where the idea here
is that instead of having one key,

253
00:11:17,430 --> 00:11:18,880
we'll have two keys--

254
00:11:18,880 --> 00:11:21,361
one called a public key,
one called a private key.

255
00:11:21,361 --> 00:11:24,610
And the idea here is that a public key
is something you can share with anyone.

256
00:11:24,610 --> 00:11:26,290
Doesn't matter who has it.

257
00:11:26,290 --> 00:11:28,974
And a private key is a key
that you keep to yourself

258
00:11:28,974 --> 00:11:31,390
that you don't give to anyone,
even the person that you're

259
00:11:31,390 --> 00:11:33,580
trying to communicate with.

260
00:11:33,580 --> 00:11:36,820
And because we have two keys, each key
is going to serve a different purpose.

261
00:11:36,820 --> 00:11:38,270
They're going to be
mathematically related.

262
00:11:38,270 --> 00:11:40,061
And take a theory of
computing class if you

263
00:11:40,061 --> 00:11:42,910
want to understand the exact
mathematics behind this.

264
00:11:42,910 --> 00:11:48,010
But the basic idea is that the public
key can be used to encrypt messages,

265
00:11:48,010 --> 00:11:51,820
and the private key can be
used to decrypt messages that

266
00:11:51,820 --> 00:11:54,680
were encrypted using the public key.

267
00:11:54,680 --> 00:11:56,640
And so what does this model look like?

268
00:11:56,640 --> 00:11:59,080
Well, I have some
public and private key.

269
00:11:59,080 --> 00:12:01,710
And if I want some other
person to send me information,

270
00:12:01,710 --> 00:12:03,439
I will give them my public key.

271
00:12:03,439 --> 00:12:06,480
Just give the other person the public
key so that they have access to it.

272
00:12:06,480 --> 00:12:09,570
Remember, the public key
is used to encrypt data.

273
00:12:09,570 --> 00:12:12,270
So they can use the public key
and encrypt the plain text,

274
00:12:12,270 --> 00:12:13,745
generate some ciphertext.

275
00:12:13,745 --> 00:12:16,620
And then all the other person needs
to do is send me that ciphertext.

276
00:12:16,620 --> 00:12:18,990
The ciphertext comes across to me.

277
00:12:18,990 --> 00:12:20,970
And I now have the private
key, the key that I

278
00:12:20,970 --> 00:12:23,040
can use to decrypt the information.

279
00:12:23,040 --> 00:12:24,990
And using the private
key and the ciphertext,

280
00:12:24,990 --> 00:12:29,080
I can then decrypt the message
and generate the plain text.

281
00:12:29,080 --> 00:12:31,330
So this is the basic idea
of public-key cryptography,

282
00:12:31,330 --> 00:12:35,170
this idea that we use a public key to
encrypt information and a private key

283
00:12:35,170 --> 00:12:36,580
to decrypt information.

284
00:12:36,580 --> 00:12:39,280
And by separating this out
into two different keys,

285
00:12:39,280 --> 00:12:41,590
we can share the public
key freely without needing

286
00:12:41,590 --> 00:12:44,770
to worry about the potential
for internet traffic

287
00:12:44,770 --> 00:12:47,700
to be intercepted and
decrypted, for example.

288
00:12:47,700 --> 00:12:50,140
And so this is the basis on
which internet security works.

289
00:12:50,140 --> 00:12:50,743
Yeah?

290
00:12:50,743 --> 00:12:54,580
AUDIENCE: What if someone
else intercepts the ciphertext

291
00:12:54,580 --> 00:12:57,496
and they also have a private key?

292
00:12:57,496 --> 00:12:58,872
Would they be able to decrypt it?

293
00:12:58,872 --> 00:13:02,204
BRIAN YU: If someone else intercepts the
ciphertext and they have a private key,

294
00:13:02,204 --> 00:13:04,660
they won't be able to decrypt
it, because the private key

295
00:13:04,660 --> 00:13:08,710
and the public key are
mathematically related in such a way

296
00:13:08,710 --> 00:13:11,740
that if you encrypt
something with a public key,

297
00:13:11,740 --> 00:13:14,980
you can only decrypt it with
the corresponding private key.

298
00:13:14,980 --> 00:13:18,730
And so generally speaking,
you'll generate both the public

299
00:13:18,730 --> 00:13:22,420
and the private key at the same time,
such that only messages encrypted

300
00:13:22,420 --> 00:13:24,190
with one can be
decrypted with the other.

301
00:13:24,190 --> 00:13:27,398
So you can't just have some other random
private key and decrypt the message.

302
00:13:27,398 --> 00:13:30,153
It can only decrypt messages
from the public key.

303
00:13:30,153 --> 00:13:33,400
AUDIENCE: So how did this person
get that specific [INAUDIBLE]??

304
00:13:33,400 --> 00:13:36,450
BRIAN YU: So this person down
here generated both the public

305
00:13:36,450 --> 00:13:38,130
and the private key at the same time.

306
00:13:38,130 --> 00:13:40,755
There's just an algorithm that
you can use to randomly generate

307
00:13:40,755 --> 00:13:41,880
a public and private key.

308
00:13:41,880 --> 00:13:45,390
You share the public key with anyone you
want to be able to send you messages.

309
00:13:45,390 --> 00:13:48,930
That person you share it with can use
the public key to encrypt the message.

310
00:13:48,930 --> 00:13:51,600
And then you, the person
who generated these keys,

311
00:13:51,600 --> 00:13:55,230
can take the encrypted message, use
the private key that you generated,

312
00:13:55,230 --> 00:13:58,331
and get the plain text out of that.

313
00:13:58,331 --> 00:13:58,830
Yeah?

314
00:13:58,830 --> 00:14:01,920
AUDIENCE: How difficult is it to get
the private key from the public key?

315
00:14:01,920 --> 00:14:04,070
Is it impossible?

316
00:14:04,070 --> 00:14:07,250
BRIAN YU: How difficult is it to get
the private key from the public key?

317
00:14:07,250 --> 00:14:09,410
Long story short, we don't really know.

318
00:14:09,410 --> 00:14:11,460
We think it is very difficult to do.

319
00:14:11,460 --> 00:14:13,400
We think that it would
take a very long time.

320
00:14:13,400 --> 00:14:18,470
If you took a computer and tried
to get it to go from the public key

321
00:14:18,470 --> 00:14:22,700
to the private key, we think it would
probably take billions, trillions, more

322
00:14:22,700 --> 00:14:27,350
years if a computer was operating at
top speed trying to do this calculation.

323
00:14:27,350 --> 00:14:31,070
But no one has been able to
technically prove that it is difficult.

324
00:14:31,070 --> 00:14:33,650
And so this is a big open
question in computing right now.

325
00:14:33,650 --> 00:14:35,570
You can take a theory
of computation class

326
00:14:35,570 --> 00:14:37,740
for more information
on this sort of thing.

327
00:14:37,740 --> 00:14:40,220
But there are some open
unsolved problems in computing,

328
00:14:40,220 --> 00:14:41,940
and this happens to be one of them.

329
00:14:41,940 --> 00:14:42,440
Yeah?

330
00:14:42,440 --> 00:14:46,196
AUDIENCE: Is it based on primes
and very large primes, and you

331
00:14:46,196 --> 00:14:47,340
multiply them together?

332
00:14:47,340 --> 00:14:49,800
BRIAN YU: Yes, this is basically
the idea of very large prime numbers

333
00:14:49,800 --> 00:14:50,980
that you multiply together.

334
00:14:50,980 --> 00:14:53,370
The long story short of it
is it's based on the idea

335
00:14:53,370 --> 00:14:55,890
that there is some mathematical
operations that are easy

336
00:14:55,890 --> 00:14:58,860
and some mathematical operations
that are believed to be difficult.

337
00:14:58,860 --> 00:15:01,140
And if you take two
very big prime numbers,

338
00:15:01,140 --> 00:15:03,549
a computer can multiply
those numbers very easily

339
00:15:03,549 --> 00:15:05,840
and calculate what the product
of those two numbers is.

340
00:15:05,840 --> 00:15:07,950
It's just a simple
multiplication algorithm.

341
00:15:07,950 --> 00:15:11,880
But if you have that result,
that big multiplied prime number,

342
00:15:11,880 --> 00:15:14,070
it's very difficult
to factor that number

343
00:15:14,070 --> 00:15:16,980
and figure out which two prime
numbers were multiplied together

344
00:15:16,980 --> 00:15:18,690
in order to generate that number.

345
00:15:18,690 --> 00:15:22,650
And nobody has been able to come up with
an efficient algorithm for factoring

346
00:15:22,650 --> 00:15:23,250
it.

347
00:15:23,250 --> 00:15:26,250
And so as a result, because we
believe factoring numbers to be

348
00:15:26,250 --> 00:15:28,650
a very difficult problem,
we use it as the basis

349
00:15:28,650 --> 00:15:33,140
for computing security on the internet.

350
00:15:33,140 --> 00:15:36,260
Brief teaser of theory of computation.

351
00:15:36,260 --> 00:15:39,777
Take any of the 120 series
here at Harvard, at least,

352
00:15:39,777 --> 00:15:41,110
for more information about that.

353
00:15:41,110 --> 00:15:44,870


354
00:15:44,870 --> 00:15:45,500
Other things?

355
00:15:45,500 --> 00:15:48,711


356
00:15:48,711 --> 00:15:51,460
Some other security considerations
when designing web applications

357
00:15:51,460 --> 00:15:53,660
to be aware of-- we
mentioned this before,

358
00:15:53,660 --> 00:15:56,455
but when it comes to
storing credentials,

359
00:15:56,455 --> 00:15:58,330
you should generally
always store credentials

360
00:15:58,330 --> 00:16:01,240
in environment variables
inside of your application

361
00:16:01,240 --> 00:16:05,060
rather than have inside of
your Python code some password,

362
00:16:05,060 --> 00:16:07,074
whether it's the secret
key of your application,

363
00:16:07,074 --> 00:16:08,990
whether it's the credentials
to your database,

364
00:16:08,990 --> 00:16:11,710
whether it's some other
credentials for an API key,

365
00:16:11,710 --> 00:16:13,930
for example, that you're
using the server to access.

366
00:16:13,930 --> 00:16:16,750
Usually best not to put that in
the code in case someone else

367
00:16:16,750 --> 00:16:18,250
gets access to the code.

368
00:16:18,250 --> 00:16:20,020
Generally best to put
it in an environment

369
00:16:20,020 --> 00:16:24,240
variable, a variable that's just
stored in the command line environment

370
00:16:24,240 --> 00:16:26,080
where your server's being run from.

371
00:16:26,080 --> 00:16:31,190
And then add code that just pulls
the credentials from the environment.

372
00:16:31,190 --> 00:16:34,540
You can use in Python,
at least, os.environ.get

373
00:16:34,540 --> 00:16:37,570
to mean get some information from
the application's environment.

374
00:16:37,570 --> 00:16:41,200
And this is generally going to be a
more secure way of doing the same thing.

375
00:16:41,200 --> 00:16:41,847
Yeah?

376
00:16:41,847 --> 00:16:44,516
AUDIENCE: How do we do
that in Heroku if we

377
00:16:44,516 --> 00:16:46,140
want to upload our code to the website?

378
00:16:46,140 --> 00:16:46,895
BRIAN YU: Yeah.

379
00:16:46,895 --> 00:16:50,020
So if you're uploading this to Heroku,
if you go to your Heroku application

380
00:16:50,020 --> 00:16:52,330
and go to the Settings
panel, there is a section,

381
00:16:52,330 --> 00:16:55,990
I think it's called config vars, that
basically just lets you add environment

382
00:16:55,990 --> 00:16:57,644
variables to the Heroku application.

383
00:16:57,644 --> 00:17:00,310
And that will automatically set
those environment variables such

384
00:17:00,310 --> 00:17:02,018
that when you run the
application, it can

385
00:17:02,018 --> 00:17:03,830
draw from those environment variables.

386
00:17:03,830 --> 00:17:04,420
Yeah?

387
00:17:04,420 --> 00:17:10,120
AUDIENCE: Is it [INAUDIBLE]
yesterday, or is that something

388
00:17:10,120 --> 00:17:11,440
you can't have access to?

389
00:17:11,440 --> 00:17:14,254
Because if you just did
[INAUDIBLE] and then the key,

390
00:17:14,254 --> 00:17:17,717
it goes away when you close
the terminal, correct?

391
00:17:17,717 --> 00:17:18,300
BRIAN YU: Yes.

392
00:17:18,300 --> 00:17:19,500
So that's true.

393
00:17:19,500 --> 00:17:21,900
So you can certainly,
on your own computer,

394
00:17:21,900 --> 00:17:24,611
set aliases or environment
variables inside

395
00:17:24,611 --> 00:17:27,569
of your profile that automatically
set credentials in a particular way.

396
00:17:27,569 --> 00:17:30,180
The idea is that you never want
to be taking those credentials

397
00:17:30,180 --> 00:17:33,360
and committing them to a
repository that other people might

398
00:17:33,360 --> 00:17:35,040
be able to see, for instance.

399
00:17:35,040 --> 00:17:37,320
That's where things
start to get less secure.

400
00:17:37,320 --> 00:17:41,550


401
00:17:41,550 --> 00:17:42,250
OK.

402
00:17:42,250 --> 00:17:45,730
Moving on in the week to talk about
some other security considerations.

403
00:17:45,730 --> 00:17:47,590
We'll talk about SQL,
the idea of databases.

404
00:17:47,590 --> 00:17:50,800
And when we introduce databases, there
are a lot of security considerations

405
00:17:50,800 --> 00:17:52,030
that come about.

406
00:17:52,030 --> 00:17:53,930
But we'll just touch
on a couple of them.

407
00:17:53,930 --> 00:17:56,170
The first is how you store passwords.

408
00:17:56,170 --> 00:17:58,210
So you can imagine that
inside of a database,

409
00:17:58,210 --> 00:18:00,670
you might be storing users
and passwords together.

410
00:18:00,670 --> 00:18:04,450
And maybe we have a whole users
table that has an ID column,

411
00:18:04,450 --> 00:18:07,720
a column for people's usernames,
and a column for people's passwords.

412
00:18:07,720 --> 00:18:12,320
And you could imagine just storing
passwords inside of the row.

413
00:18:12,320 --> 00:18:14,185
But why is this not particularly secure?

414
00:18:14,185 --> 00:18:21,430


415
00:18:21,430 --> 00:18:21,930
Yeah?

416
00:18:21,930 --> 00:18:24,294
AUDIENCE: If anyone gets
access to the data table,

417
00:18:24,294 --> 00:18:25,960
they can see what all the passwords are.

418
00:18:25,960 --> 00:18:26,710
BRIAN YU: Exactly.

419
00:18:26,710 --> 00:18:29,140
If anyone gets access to the
database, they immediately

420
00:18:29,140 --> 00:18:30,666
have access to all of the passwords.

421
00:18:30,666 --> 00:18:33,040
And this is probably not a
secure way to go about things,

422
00:18:33,040 --> 00:18:35,440
because you probably hear
in the news from time

423
00:18:35,440 --> 00:18:39,610
to time that databases aren't perfectly
secure, that every once in a while,

424
00:18:39,610 --> 00:18:43,600
there's some big security vulnerability
where someone's able to get access

425
00:18:43,600 --> 00:18:45,370
to passwords inside of a database.

426
00:18:45,370 --> 00:18:47,530
And that becomes a
major security concern.

427
00:18:47,530 --> 00:18:50,020
And so one way to try
and mitigate this problem

428
00:18:50,020 --> 00:18:53,920
is, instead of storing passwords
inside of the database,

429
00:18:53,920 --> 00:18:56,220
store a hashed version of the password.

430
00:18:56,220 --> 00:18:59,710
A hash function, as you might recall
from CS50, just takes some input

431
00:18:59,710 --> 00:19:02,590
and returns some deterministic output.

432
00:19:02,590 --> 00:19:05,860
And a hash function can
generally take any input password

433
00:19:05,860 --> 00:19:09,430
and turn it into what looks like a whole
bunch of random sequences of letters

434
00:19:09,430 --> 00:19:10,256
and numbers.

435
00:19:10,256 --> 00:19:12,130
And the idea here is
that it's deterministic.

436
00:19:12,130 --> 00:19:16,030
The same password will always
result in the same hash value

437
00:19:16,030 --> 00:19:19,690
whereby when someone tries to log
in, when they type in their password,

438
00:19:19,690 --> 00:19:21,820
rather than just literally
compare their password

439
00:19:21,820 --> 00:19:25,120
and say does the password match up
with the password in this column,

440
00:19:25,120 --> 00:19:27,670
you can say, all right, let's
hash the password first.

441
00:19:27,670 --> 00:19:30,700
And if the hashes match up,
then with very high probability,

442
00:19:30,700 --> 00:19:34,600
the user actually signed in to the
website with the correct password.

443
00:19:34,600 --> 00:19:36,190
And you can then log the user in.

444
00:19:36,190 --> 00:19:39,100
And now, if someone was able
to get access to the database,

445
00:19:39,100 --> 00:19:41,020
they wouldn't get access
to all the passwords.

446
00:19:41,020 --> 00:19:43,452
They would only get access
to the password hashes.

447
00:19:43,452 --> 00:19:45,160
Now, it's still a
security vulnerability,

448
00:19:45,160 --> 00:19:48,850
because someone could, in
theory, be able to figure out

449
00:19:48,850 --> 00:19:51,340
information about the password
from the password hashes.

450
00:19:51,340 --> 00:19:54,460
But better, certainly, than
literally storing the raw text

451
00:19:54,460 --> 00:19:55,891
of the password in the database.

452
00:19:55,891 --> 00:19:56,390
Yeah?

453
00:19:56,390 --> 00:19:59,660
AUDIENCE: Do we know how the hash
functions generate that code?

454
00:19:59,660 --> 00:20:00,290
BRIAN YU: Yeah.

455
00:20:00,290 --> 00:20:02,420
The hash functions tend
to be deterministic,

456
00:20:02,420 --> 00:20:05,330
and you look up what the hash
functions themselves are.

457
00:20:05,330 --> 00:20:07,640
So there are a couple of
quite popular hash functions

458
00:20:07,640 --> 00:20:10,437
that are out there that
do this sort of thing.

459
00:20:10,437 --> 00:20:12,770
But the idea of the hash
function is similar to the idea

460
00:20:12,770 --> 00:20:16,430
of public and private keys, that
it's very easy to hash something,

461
00:20:16,430 --> 00:20:19,250
and it's very difficult to
go in the other direction.

462
00:20:19,250 --> 00:20:21,170
I can easily hash a
password and generate

463
00:20:21,170 --> 00:20:22,490
something that looks like this.

464
00:20:22,490 --> 00:20:25,830
But it's a difficult operation to
take something that looks like this

465
00:20:25,830 --> 00:20:30,400
and go backwards and figure out what
it was that the original password was.

466
00:20:30,400 --> 00:20:33,820
And so that's one of the
properties of a good hash function.

467
00:20:33,820 --> 00:20:34,320
Yes?

468
00:20:34,320 --> 00:20:37,920
AUDIENCE: Did you actually hash these,
or did you just hit the keyboard?

469
00:20:37,920 --> 00:20:41,280
BRIAN YU: I think these are probably--

470
00:20:41,280 --> 00:20:43,800
there might be hidden messages
here if you look carefully.

471
00:20:43,800 --> 00:20:44,960
But separate issue.

472
00:20:44,960 --> 00:20:47,519


473
00:20:47,519 --> 00:20:48,060
Other things?

474
00:20:48,060 --> 00:20:51,321


475
00:20:51,321 --> 00:20:51,820
OK.

476
00:20:51,820 --> 00:20:57,537
So how is it that potential data is
leaked as a result of using a database?

477
00:20:57,537 --> 00:21:00,370
Well, there are a number of ways
that applications can inadvertently

478
00:21:00,370 --> 00:21:02,150
leak information.

479
00:21:02,150 --> 00:21:03,460
Take a simple example.

480
00:21:03,460 --> 00:21:06,190
Oftentimes, you'll see websites
that have a Forgot Your Password

481
00:21:06,190 --> 00:21:10,390
screen where you type in an email
address, and you click Reset Password.

482
00:21:10,390 --> 00:21:12,850
And that helps you to send
you an email that allows you

483
00:21:12,850 --> 00:21:15,267
to reset your password, for example.

484
00:21:15,267 --> 00:21:17,350
And you imagine that you
type in an email address,

485
00:21:17,350 --> 00:21:21,640
and you get, OK, password
reset email has been sent.

486
00:21:21,640 --> 00:21:24,220
But maybe some applications
work such that if you type

487
00:21:24,220 --> 00:21:27,070
in an email address
that doesn't exist, then

488
00:21:27,070 --> 00:21:28,900
you get an error that says, OK, error.

489
00:21:28,900 --> 00:21:31,930
There is no user with
that email address.

490
00:21:31,930 --> 00:21:34,120
What data has this
application now exposed?

491
00:21:34,120 --> 00:21:36,717


492
00:21:36,717 --> 00:21:39,800
What information can you get just by
using this part of a web application,

493
00:21:39,800 --> 00:21:40,570
for instance?

494
00:21:40,570 --> 00:21:41,070
Yeah?

495
00:21:41,070 --> 00:21:45,104
AUDIENCE: You know that that email
address is not in the system,

496
00:21:45,104 --> 00:21:47,150
so you know that person
is not using that app.

497
00:21:47,150 --> 00:21:48,150
BRIAN YU: Yeah, exactly.

498
00:21:48,150 --> 00:21:50,631
Just using the Forgot Password
part of this application,

499
00:21:50,631 --> 00:21:53,130
you can tell exactly who has
an account for this application

500
00:21:53,130 --> 00:21:56,740
and who doesn't just by typing email
addresses and seeing what comes back.

501
00:21:56,740 --> 00:21:59,190
So there's potential
vulnerabilities in terms of data

502
00:21:59,190 --> 00:22:00,577
that gets leaked there, as well.

503
00:22:00,577 --> 00:22:03,660
And there are all sorts of different
ways that information can get leaked.

504
00:22:03,660 --> 00:22:05,850
Oftentimes, there's a
growing field whereby

505
00:22:05,850 --> 00:22:10,440
you can tell just based on the amount
of time it takes for an HTTP request

506
00:22:10,440 --> 00:22:13,590
to come back whether or not--

507
00:22:13,590 --> 00:22:16,110
you can get information about
the data inside of a database

508
00:22:16,110 --> 00:22:19,809
based on that whereby if you make a
request that takes a long time, that

509
00:22:19,809 --> 00:22:22,350
can tell you something different
than if a request comes back

510
00:22:22,350 --> 00:22:24,930
very quickly, because that might
mean fewer database requests

511
00:22:24,930 --> 00:22:27,780
were required in order to make
that particular operation work

512
00:22:27,780 --> 00:22:29,340
or any number of different things.

513
00:22:29,340 --> 00:22:33,401
And so there are security
vulnerabilities there, as well.

514
00:22:33,401 --> 00:22:33,900
Final one.

515
00:22:33,900 --> 00:22:35,700
I'll briefly mention the SQL injection.

516
00:22:35,700 --> 00:22:36,690
We've already talked about that.

517
00:22:36,690 --> 00:22:38,910
But again, something to be
aware of just to make sure

518
00:22:38,910 --> 00:22:40,710
that whenever you're
making database queries,

519
00:22:40,710 --> 00:22:42,751
you're protecting yourself
against SQL injection,

520
00:22:42,751 --> 00:22:46,560
that you're making sure to either use a
library that takes care of this for you

521
00:22:46,560 --> 00:22:48,990
or escape any characters
that you might be using that

522
00:22:48,990 --> 00:22:55,155
could ultimately result
in vulnerabilities in SQL.

523
00:22:55,155 --> 00:22:56,125
Yeah?

524
00:22:56,125 --> 00:22:57,920
AUDIENCE: How about
the websites or tools

525
00:22:57,920 --> 00:23:02,280
like LastPass that store your
credentials for other sites?

526
00:23:02,280 --> 00:23:06,180
Don't they have to have some way
of reversing their own hash on it

527
00:23:06,180 --> 00:23:10,880
in order to give you that credential
when you go to another site?

528
00:23:10,880 --> 00:23:13,980
So when it auto fills your
username and password,

529
00:23:13,980 --> 00:23:17,889
it has to-- if they're storing a hashed
version on their side but filling

530
00:23:17,889 --> 00:23:21,673
in the plain text version
in the password field,

531
00:23:21,673 --> 00:23:24,961
how are they able to reverse
that in a way that is secure?

532
00:23:24,961 --> 00:23:27,316
They would have to have a
table of keys or something

533
00:23:27,316 --> 00:23:30,365
that then is just as vulnerable
as leaving the password.

534
00:23:30,365 --> 00:23:30,990
BRIAN YU: Yeah.

535
00:23:30,990 --> 00:23:34,694
So for password manager-type
applications, it's a good question.

536
00:23:34,694 --> 00:23:37,860
I think the way most of them do this
is that you have a master password that

537
00:23:37,860 --> 00:23:42,480
unlocks the entire database of the
passwords that are stored there.

538
00:23:42,480 --> 00:23:44,760
And the idea would be
that they're encrypted

539
00:23:44,760 --> 00:23:48,272
using the master password as
the key to be the unlocker such

540
00:23:48,272 --> 00:23:49,230
that they're encrypted.

541
00:23:49,230 --> 00:23:51,180
And only by getting the
master password correct

542
00:23:51,180 --> 00:23:53,054
can you then decrypt
the information and then

543
00:23:53,054 --> 00:23:55,770
access the plain text version of
the passwords that are inside.

544
00:23:55,770 --> 00:23:59,520
And so hashing and encryption and
decryption are slightly different.

545
00:23:59,520 --> 00:24:01,320
In the case of encryption
and decryption,

546
00:24:01,320 --> 00:24:05,310
you still want to be able to go from
the ciphertext back to the plain text,

547
00:24:05,310 --> 00:24:07,470
whereas in the case of
the password hashing,

548
00:24:07,470 --> 00:24:11,032
you don't really care about the ability
to reverse engineer it to go backwards.

549
00:24:11,032 --> 00:24:14,810


550
00:24:14,810 --> 00:24:15,317
All right.

551
00:24:15,317 --> 00:24:17,150
And finally, on the
topic of security, we'll

552
00:24:17,150 --> 00:24:18,710
talk a little bit about JavaScript.

553
00:24:18,710 --> 00:24:21,950
JavaScript opens a whole host of
different potential vulnerabilities

554
00:24:21,950 --> 00:24:23,290
from a security standpoint.

555
00:24:23,290 --> 00:24:25,070
But we'll talk about a couple.

556
00:24:25,070 --> 00:24:28,400
The first is this idea
called cross-site scripting,

557
00:24:28,400 --> 00:24:32,960
or the idea of taking a script and
being effectively able to inject it

558
00:24:32,960 --> 00:24:36,350
into some other site by putting
some JavaScript that the web

559
00:24:36,350 --> 00:24:40,440
application didn't intend into
the web application itself.

560
00:24:40,440 --> 00:24:44,876
And so here's a very simple web
application written in Flask.

561
00:24:44,876 --> 00:24:46,500
And this is the entire web application.

562
00:24:46,500 --> 00:24:49,430
It's got a route, a default route,
called / that just returns, "Hello,

563
00:24:49,430 --> 00:24:50,120
world!"

564
00:24:50,120 --> 00:24:52,640
And it's got an error handler that
we didn't really see in the class.

565
00:24:52,640 --> 00:24:54,410
But basically, it
handles whenever there's

566
00:24:54,410 --> 00:24:58,010
a 404 error, whenever you're trying
to access a page that was not found.

567
00:24:58,010 --> 00:25:02,120
And it just returns, "Not found,"
followed by request.path, whatever it

568
00:25:02,120 --> 00:25:04,700
is that was the URL that you requested.

569
00:25:04,700 --> 00:25:06,770
And so I could run this application.

570
00:25:06,770 --> 00:25:10,280
I'll go ahead and start up
Chrome, and I'll go ahead

571
00:25:10,280 --> 00:25:18,020
and go to the source code for XSS1.

572
00:25:18,020 --> 00:25:20,520
I'll run this application.

573
00:25:20,520 --> 00:25:21,020
Go here.

574
00:25:21,020 --> 00:25:22,430
It says, "Hello, world!"

575
00:25:22,430 --> 00:25:27,350
And if I go to helloworld/foo, for
example, some route that doesn't exist,

576
00:25:27,350 --> 00:25:30,770
I get not found, /foo, because that's
not a route that's available on this

577
00:25:30,770 --> 00:25:31,550
page.

578
00:25:31,550 --> 00:25:32,870
I go to /bar.

579
00:25:32,870 --> 00:25:34,579
Not found, /bar.

580
00:25:34,579 --> 00:25:35,620
What could go wrong here?

581
00:25:35,620 --> 00:25:38,760


582
00:25:38,760 --> 00:25:41,550
Where's the security
vulnerability, again,

583
00:25:41,550 --> 00:25:43,170
thinking in the context of JavaScript?

584
00:25:43,170 --> 00:25:47,620


585
00:25:47,620 --> 00:25:52,420
The page my application is returning
is literally just "not found"

586
00:25:52,420 --> 00:25:56,400
followed by whatever was
typed into the request path.

587
00:25:56,400 --> 00:26:04,630
And so what I could do is you could
imagine that instead of running /foo,

588
00:26:04,630 --> 00:26:09,250
I could instead make a request
that looks something like /script

589
00:26:09,250 --> 00:26:13,240
alert('hi) and then
/script, for instance,

590
00:26:13,240 --> 00:26:17,860
injecting some JavaScript into the
request path whereby if I do that,

591
00:26:17,860 --> 00:26:22,150
I say, OK, /script alert('hi') /script.

592
00:26:22,150 --> 00:26:23,370
Press Return.

593
00:26:23,370 --> 00:26:25,756
And OK, Chrome is
being smart about this.

594
00:26:25,756 --> 00:26:27,630
Chrome actually isn't
allowing me to do this,

595
00:26:27,630 --> 00:26:30,370
because Chrome has some more
advanced features that are basically

596
00:26:30,370 --> 00:26:32,800
saying Chrome detected
unusual code on this page

597
00:26:32,800 --> 00:26:36,430
and blocked it to protect your
personal information and error blocked

598
00:26:36,430 --> 00:26:37,720
by XSS auditor.

599
00:26:37,720 --> 00:26:38,980
That's cross-site scripting.

600
00:26:38,980 --> 00:26:40,930
So Chrome is automatically
auditing for this.

601
00:26:40,930 --> 00:26:42,520
But not all browsers are like that.

602
00:26:42,520 --> 00:26:44,260
And I can, I think--

603
00:26:44,260 --> 00:26:48,190
let's see if I can disable--

604
00:26:48,190 --> 00:26:51,320
if I disable cross-site
scripting protections,

605
00:26:51,320 --> 00:26:53,680
I think I can get this to-- yeah, OK.

606
00:26:53,680 --> 00:26:55,720
Disabling cross-site
scripting productions,

607
00:26:55,720 --> 00:26:58,540
we can still type in the URL
and actually get some JavaScript

608
00:26:58,540 --> 00:27:04,080
that the page didn't intend to still
run on this particular web page.

609
00:27:04,080 --> 00:27:08,550
And so if someone were to send you
a link that took you to this page,

610
00:27:08,550 --> 00:27:11,280
/script alert('hi'), you could
get JavaScript to run that you

611
00:27:11,280 --> 00:27:12,180
didn't intend.

612
00:27:12,180 --> 00:27:13,589
And maybe that's not a big deal.

613
00:27:13,589 --> 00:27:15,630
But it could be a bigger
deal in a situation that

614
00:27:15,630 --> 00:27:18,690
looks like this, where
we have JavaScript

615
00:27:18,690 --> 00:27:23,730
and document.write is a function
that just add something to the page.

616
00:27:23,730 --> 00:27:27,720
And here, we're loading
an image, img src,

617
00:27:27,720 --> 00:27:29,970
and the source is some hacker's website.

618
00:27:29,970 --> 00:27:33,540
And then we say, cookie=
and then document.cookie.

619
00:27:33,540 --> 00:27:37,050
Document.cookie stores the
cookie for this particular page.

620
00:27:37,050 --> 00:27:39,330
And so effectively, what's
happening in this script

621
00:27:39,330 --> 00:27:43,110
is that your page, when you
load it, is going to make a web

622
00:27:43,110 --> 00:27:45,630
request to the hacker's URL.

623
00:27:45,630 --> 00:27:47,970
And it's going to provide
it as an argument whatever

624
00:27:47,970 --> 00:27:51,457
the value of your
cookie is, for instance.

625
00:27:51,457 --> 00:27:53,790
And that cookie could be
something that you use in order

626
00:27:53,790 --> 00:27:55,890
to log in as the credentials
for some website,

627
00:27:55,890 --> 00:27:57,630
like a bank application or whatnot.

628
00:27:57,630 --> 00:27:59,970
And as a result, the
hacker now has access

629
00:27:59,970 --> 00:28:02,910
to whatever the value of
your cookie is, because they

630
00:28:02,910 --> 00:28:04,680
can look at their list
of all the requests

631
00:28:04,680 --> 00:28:06,570
that have been made to the
application much in the same way

632
00:28:06,570 --> 00:28:08,361
that you've been able
to do in the terminal

633
00:28:08,361 --> 00:28:10,530
to see all the requests
for your Flask application.

634
00:28:10,530 --> 00:28:15,030
And they can see that someone requested
hacker_url?cookie= this cookie,

635
00:28:15,030 --> 00:28:18,090
and they can then use that cookie to
be able to sign in to other sites,

636
00:28:18,090 --> 00:28:18,630
as well.

637
00:28:18,630 --> 00:28:21,480
So most modern browsers,
like Chrome, are

638
00:28:21,480 --> 00:28:24,580
pretty good at defending
against this sort of thing.

639
00:28:24,580 --> 00:28:28,650
But definitely something that is a
potential vulnerability, especially

640
00:28:28,650 --> 00:28:31,380
for older browsers.

641
00:28:31,380 --> 00:28:33,220
Questions about this
cross-site scripting?

642
00:28:33,220 --> 00:28:34,502
Yeah?

643
00:28:34,502 --> 00:28:36,397
AUDIENCE: Are you getting
the user's cookie,

644
00:28:36,397 --> 00:28:37,980
or whose cookie are you getting there?

645
00:28:37,980 --> 00:28:39,354
BRIAN YU: Whoever opens the page.

646
00:28:39,354 --> 00:28:42,090
So the user's cookie, potentially
on an entirely different site.

647
00:28:42,090 --> 00:28:45,120
The idea is that if your site
is vulnerable to cross-site

648
00:28:45,120 --> 00:28:48,270
scripting in this form, then
you open up a possibility

649
00:28:48,270 --> 00:28:52,050
where someone could generate
a link to your website that

650
00:28:52,050 --> 00:28:56,310
includes some JavaScript injected
like this whereby someone else could

651
00:28:56,310 --> 00:28:59,280
steal the cookies of your
users on your website.

652
00:28:59,280 --> 00:29:01,310
And they could get the
cookies for themselves

653
00:29:01,310 --> 00:29:03,690
and use those cookies to
sign into your website

654
00:29:03,690 --> 00:29:06,010
and pretend to be people that
they're not, for example.

655
00:29:06,010 --> 00:29:07,760
There's a potential
security threat there.

656
00:29:07,760 --> 00:29:10,950


657
00:29:10,950 --> 00:29:14,330
So cross-site scripting is one
example of a JavaScript vulnerability.

658
00:29:14,330 --> 00:29:17,780
Another vulnerability is called
cross-site request forgery.

659
00:29:17,780 --> 00:29:20,900
Imagine that you have a
bank website, for instance,

660
00:29:20,900 --> 00:29:23,390
and that bank gives you
a way to transfer money.

661
00:29:23,390 --> 00:29:27,997
And if you go to that URL /transfer and
then you provide arguments as to who

662
00:29:27,997 --> 00:29:30,830
you're transferring money to and
how much money you're transferring,

663
00:29:30,830 --> 00:29:31,910
you can transfer money.

664
00:29:31,910 --> 00:29:35,000
Might be a web request
that allows you to do that.

665
00:29:35,000 --> 00:29:38,690
Imagine some other
website, some website where

666
00:29:38,690 --> 00:29:41,840
hackers are trying to steal
money, where they have code that

667
00:29:41,840 --> 00:29:43,430
looks a little something like this.

668
00:29:43,430 --> 00:29:45,480
They have a link that
says, "Click Here!"

669
00:29:45,480 --> 00:29:49,820
And when you click on the link, that
takes you to yourbank.com/transfer

670
00:29:49,820 --> 00:29:53,090
transferring to a particular person,
transferring a particular amount.

671
00:29:53,090 --> 00:29:56,240
And some unsuspecting user on this
website could click the button.

672
00:29:56,240 --> 00:29:58,750
And as a result, that
takes them to their bank.

673
00:29:58,750 --> 00:30:01,250
And if they happen to be logged
into their bank at the time,

674
00:30:01,250 --> 00:30:04,050
that could result in actually
making that transfer.

675
00:30:04,050 --> 00:30:06,260
So cross-site request
forgery is the idea

676
00:30:06,260 --> 00:30:11,630
that some other site can make a request
on your site as by, in this case,

677
00:30:11,630 --> 00:30:13,890
linking to it.

678
00:30:13,890 --> 00:30:18,180
This still isn't an amazing threat,
because the person actually still needs

679
00:30:18,180 --> 00:30:22,590
to click on the button in order to be
able to load in order to actually go

680
00:30:22,590 --> 00:30:25,639
to yourbank.com/transfer/whatever.

681
00:30:25,639 --> 00:30:28,680
But you can imagine that a clever
hacker might be able to get around this

682
00:30:28,680 --> 00:30:31,380
by doing something like this--

683
00:30:31,380 --> 00:30:34,807
rendering an image, for example,
and saying the source of the image

684
00:30:34,807 --> 00:30:35,640
is going to be this.

685
00:30:35,640 --> 00:30:39,257
And when an HTML sees an image tag, the
browser is just going to go to that URL

686
00:30:39,257 --> 00:30:40,590
and try and download that image.

687
00:30:40,590 --> 00:30:43,560
It's going to go to the URL,
try and fetch that resource.

688
00:30:43,560 --> 00:30:47,280
And here, that resource is
yourbank.com/transfer and then

689
00:30:47,280 --> 00:30:48,510
transferring that money.

690
00:30:48,510 --> 00:30:50,730
So the user doesn't even
have to click on anything.

691
00:30:50,730 --> 00:30:54,750
And by making a GET request
to yourbank.com/transfer,

692
00:30:54,750 --> 00:30:57,780
if yourbank.com isn't implemented
particularly securely and just allows

693
00:30:57,780 --> 00:31:02,302
you to go to a URL like this to transfer
money, then that could be the result.

694
00:31:02,302 --> 00:31:03,760
So how do you protect against this?

695
00:31:03,760 --> 00:31:08,280


696
00:31:08,280 --> 00:31:10,320
How would you protect
against your website

697
00:31:10,320 --> 00:31:11,430
being able to do something like this?

698
00:31:11,430 --> 00:31:12,870
Because your website
probably wants some way

699
00:31:12,870 --> 00:31:15,740
of being able to transfer money
if you have a bank application,

700
00:31:15,740 --> 00:31:21,090
but you don't want to allow
people to make requests like that.

701
00:31:21,090 --> 00:31:22,001
Answer, yeah?

702
00:31:22,001 --> 00:31:22,626
AUDIENCE: Yeah.

703
00:31:22,626 --> 00:31:23,060
It's facetious.

704
00:31:23,060 --> 00:31:24,010
BRIAN YU: Go for it.

705
00:31:24,010 --> 00:31:25,010
AUDIENCE: You get a better bank.

706
00:31:25,010 --> 00:31:25,760
BRIAN YU: Get a better bank.

707
00:31:25,760 --> 00:31:26,570
OK.

708
00:31:26,570 --> 00:31:29,810
Certainly something that would work.

709
00:31:29,810 --> 00:31:31,260
Other thoughts?

710
00:31:31,260 --> 00:31:32,110
Yeah?

711
00:31:32,110 --> 00:31:35,482
AUDIENCE: Change the form request
type so it's not literally in your own

712
00:31:35,482 --> 00:31:36,065
[INAUDIBLE].

713
00:31:36,065 --> 00:31:36,690
BRIAN YU: Yeah.

714
00:31:36,690 --> 00:31:39,231
Change the form request type so
that it's not literally here.

715
00:31:39,231 --> 00:31:41,172
So this right here is a GET request.

716
00:31:41,172 --> 00:31:44,130
You might imagine that instead, it's
a form that's submitted by a POST,

717
00:31:44,130 --> 00:31:46,005
like a POST request, a
form that you actually

718
00:31:46,005 --> 00:31:50,420
have to submit, click on a Submit
button, in order to submit that form.

719
00:31:50,420 --> 00:31:55,630
And so now, you could imagine
that someone could still

720
00:31:55,630 --> 00:31:58,510
create a vulnerability by
doing something like this.

721
00:31:58,510 --> 00:32:03,130
They have a form whose action is
yourbank.com/transfer submitting

722
00:32:03,130 --> 00:32:04,480
by a method POST.

723
00:32:04,480 --> 00:32:07,300
And now, they have these
input that are type hidden,

724
00:32:07,300 --> 00:32:10,440
which are just input fields that
don't show up inside of a page.

725
00:32:10,440 --> 00:32:12,190
And they can have
hidden input fields that

726
00:32:12,190 --> 00:32:16,120
specify who it's to, what the amount
is, and then just some button that says,

727
00:32:16,120 --> 00:32:17,410
"Click Here!"

728
00:32:17,410 --> 00:32:19,420
And if they click
here, then unwittingly,

729
00:32:19,420 --> 00:32:21,730
the user could be submitting
a form to the bank that's

730
00:32:21,730 --> 00:32:24,270
initiating some transfer.

731
00:32:24,270 --> 00:32:27,900
And in fact, if the hacker
is being particularly clever,

732
00:32:27,900 --> 00:32:29,880
you don't even need the
user to click anything,

733
00:32:29,880 --> 00:32:32,640
because we can use event
listeners to get around this.

734
00:32:32,640 --> 00:32:35,130
I could say body onload--

735
00:32:35,130 --> 00:32:37,800
in other words, when the body
of the page is done loading,

736
00:32:37,800 --> 00:32:39,270
run this JavaScript.

737
00:32:39,270 --> 00:32:42,930
Document.forms returns an array of
all the forms in the web document.

738
00:32:42,930 --> 00:32:45,540
Square bracket 0 says
get the first form.

739
00:32:45,540 --> 00:32:49,192
And there's a function in JavaScript
called .submit that submits a form.

740
00:32:49,192 --> 00:32:51,900
So you can say, all right, get
all the forms, get the first form,

741
00:32:51,900 --> 00:32:52,980
and run submit.

742
00:32:52,980 --> 00:32:55,440
And that's going to result
in submitting this form,

743
00:32:55,440 --> 00:32:58,320
making a POST request to
yourbank.com/transfer,

744
00:32:58,320 --> 00:33:02,704
which results in some
amount being transferred.

745
00:33:02,704 --> 00:33:04,620
So this is a potential
vulnerability, as well.

746
00:33:04,620 --> 00:33:06,540
If you're writing this
bank application, you

747
00:33:06,540 --> 00:33:10,350
don't want to allow a code like this to
be able to get through your security,

748
00:33:10,350 --> 00:33:13,811
because that opens up a whole host of
potential security vulnerabilities.

749
00:33:13,811 --> 00:33:15,810
And in general, the way
that people tend to deal

750
00:33:15,810 --> 00:33:20,190
with this is by adding what's called
a CSRF token, a Cross-Site Request

751
00:33:20,190 --> 00:33:25,350
Forgery token, basically adding
some special value that changes

752
00:33:25,350 --> 00:33:28,800
into their own forms and
then, anytime someone submits

753
00:33:28,800 --> 00:33:31,860
the form, checking to make
sure the value of that token

754
00:33:31,860 --> 00:33:33,510
is, in fact, a valid token.

755
00:33:33,510 --> 00:33:37,260
And that way, someone couldn't
fake it because some other form

756
00:33:37,260 --> 00:33:41,370
on some other hacker's website
isn't going to have a valid CSRF

757
00:33:41,370 --> 00:33:44,580
token inside of their form page.

758
00:33:44,580 --> 00:33:49,020
And so larger scale web application
frameworks, like Django,

759
00:33:49,020 --> 00:33:52,384
offer easy ways to add CSRF
tokens to your forms, as well.

760
00:33:52,384 --> 00:33:54,300
But just something to
be aware of as you begin

761
00:33:54,300 --> 00:33:56,752
to think about, when you're
designing a web application,

762
00:33:56,752 --> 00:33:57,960
how could someone exploit it?

763
00:33:57,960 --> 00:34:00,360
How could someone make
requests on behalf of users

764
00:34:00,360 --> 00:34:02,640
that they don't intend
to in order to get

765
00:34:02,640 --> 00:34:05,880
some malicious result to come about?

766
00:34:05,880 --> 00:34:09,679
So lots of security things
to be thinking about.

767
00:34:09,679 --> 00:34:11,930
Questions about security or
any of the security topics

768
00:34:11,930 --> 00:34:13,387
that we've covered or talked about?

769
00:34:13,387 --> 00:34:13,943
Yeah?

770
00:34:13,943 --> 00:34:17,807
AUDIENCE: [INAUDIBLE] the token
is generated [INAUDIBLE] event,

771
00:34:17,807 --> 00:34:20,425
or it's a unique token for every user?

772
00:34:20,425 --> 00:34:21,050
BRIAN YU: Yeah.

773
00:34:21,050 --> 00:34:23,300
Imagine that in the
case of CS50 Finance,

774
00:34:23,300 --> 00:34:26,239
for instance, that when I click
on the Buy page that takes me

775
00:34:26,239 --> 00:34:29,480
to the page where I can buy
stocks, my route for buy

776
00:34:29,480 --> 00:34:32,090
is going to basically
generate a new token

777
00:34:32,090 --> 00:34:35,090
and insert it into the form
that then gets displayed to me.

778
00:34:35,090 --> 00:34:37,489
And then when I submit that
form, it gets submitted back

779
00:34:37,489 --> 00:34:38,582
to the same application.

780
00:34:38,582 --> 00:34:40,040
And the application can then check.

781
00:34:40,040 --> 00:34:43,678
Did the token that came back match the
token that I inserted into the page?

782
00:34:43,678 --> 00:34:45,469
And if they do, in
fact, match, then that's

783
00:34:45,469 --> 00:34:48,110
a way of sort of verifying
that the user was actually

784
00:34:48,110 --> 00:34:51,608
submitting the actual form
and not some fake form

785
00:34:51,608 --> 00:34:53,232
that they were tricked into submitting.

786
00:34:53,232 --> 00:34:57,220


787
00:34:57,220 --> 00:34:57,720
All right.

788
00:34:57,720 --> 00:34:59,636
In that case, let's
switch gears a little bit,

789
00:34:59,636 --> 00:35:01,320
and let's talk about scalability.

790
00:35:01,320 --> 00:35:02,940
Here again, there's going
to be even less code.

791
00:35:02,940 --> 00:35:05,523
And the idea is just going to
be, all right, what happens when

792
00:35:05,523 --> 00:35:07,110
we begin to scale our web application?

793
00:35:07,110 --> 00:35:09,750
We've got some web server,
and we've got some users

794
00:35:09,750 --> 00:35:13,020
that are using that web server, which
we're going to represent as that line.

795
00:35:13,020 --> 00:35:16,246
And so what happens
when that server starts

796
00:35:16,246 --> 00:35:18,120
to have more users that
are all trying to use

797
00:35:18,120 --> 00:35:19,980
the application at the same time?

798
00:35:19,980 --> 00:35:21,460
What do we do?

799
00:35:21,460 --> 00:35:24,810
Well, the first thing to probably
do is figure out how many users

800
00:35:24,810 --> 00:35:26,460
our website can actually support.

801
00:35:26,460 --> 00:35:29,532
How many can it handle before it
stops being able to support users?

802
00:35:29,532 --> 00:35:31,740
And so this is where
benchmarking is quite important.

803
00:35:31,740 --> 00:35:35,880
Benchmarking is just this process by
which we can test and sort of load test

804
00:35:35,880 --> 00:35:40,920
our application to see what we can do to
see how many users we could potentially

805
00:35:40,920 --> 00:35:42,626
handle on our server.

806
00:35:42,626 --> 00:35:45,000
And so what happens if we find
out via benchmarking that,

807
00:35:45,000 --> 00:35:49,590
OK, our server can only hold 100 users?

808
00:35:49,590 --> 00:35:53,015
What if we need to support
101 users or 102 users?

809
00:35:53,015 --> 00:35:53,640
What can we do?

810
00:35:53,640 --> 00:35:59,390


811
00:35:59,390 --> 00:36:02,317
One thing we can do is called
vertical scaling, where the idea here

812
00:36:02,317 --> 00:36:03,650
is, all right, we have a server.

813
00:36:03,650 --> 00:36:05,960
And that server only supports 100 users.

814
00:36:05,960 --> 00:36:08,510
All right, well, let's just
get a bigger server, right?

815
00:36:08,510 --> 00:36:11,630
Let's get a server that
supports 200 users or 300 users.

816
00:36:11,630 --> 00:36:14,022
And that's going to be able
to better handle that load.

817
00:36:14,022 --> 00:36:15,480
But there's a limit to this, right?

818
00:36:15,480 --> 00:36:19,160
There's a limit to how much you can
just increase the size of a server

819
00:36:19,160 --> 00:36:21,800
and increase its ability to handle load.

820
00:36:21,800 --> 00:36:24,565
And so what could you do to
be able to handle more users?

821
00:36:24,565 --> 00:36:25,752
AUDIENCE: More servers.

822
00:36:25,752 --> 00:36:26,710
BRIAN YU: More servers.

823
00:36:26,710 --> 00:36:27,210
Great.

824
00:36:27,210 --> 00:36:29,500
And this is an idea called
horizontal scaling, where

825
00:36:29,500 --> 00:36:31,360
the idea is that we have some server.

826
00:36:31,360 --> 00:36:33,412
And let's say, instead
of having one server,

827
00:36:33,412 --> 00:36:36,370
let's go ahead and have two servers
that are running the exact same web

828
00:36:36,370 --> 00:36:37,390
application.

829
00:36:37,390 --> 00:36:41,410
And now, we have two servers that
are able to run the application

830
00:36:41,410 --> 00:36:44,020
and handle twice as many people.

831
00:36:44,020 --> 00:36:47,770
What problems come
about now, logistically?

832
00:36:47,770 --> 00:36:51,412
User tries to access our
website, and now what?

833
00:36:51,412 --> 00:36:55,160


834
00:36:55,160 --> 00:36:55,680
Yeah?

835
00:36:55,680 --> 00:36:58,263
AUDIENCE: That means you could
have a race condition situation

836
00:36:58,263 --> 00:37:01,915
or how the servers communicate
to each other [INAUDIBLE]..

837
00:37:01,915 --> 00:37:02,540
BRIAN YU: Yeah.

838
00:37:02,540 --> 00:37:04,070
How do the servers
communicate with each other?

839
00:37:04,070 --> 00:37:06,470
Certainly, race conditions
become a threat, as well.

840
00:37:06,470 --> 00:37:10,090
And then a fundamental problem
is a user comes to the site,

841
00:37:10,090 --> 00:37:12,650
and which server do they go to, right?

842
00:37:12,650 --> 00:37:16,911
We need some way of deciding which
server to direct a particular user to.

843
00:37:16,911 --> 00:37:19,910
And so generally, this is solved by
adding yet another piece of hardware

844
00:37:19,910 --> 00:37:23,150
into the mix, adding some load
balancer in between the user

845
00:37:23,150 --> 00:37:25,964
and the servers whereby a user,
when they request the page,

846
00:37:25,964 --> 00:37:28,880
rather than going straight to the
server, they go to the load balancer

847
00:37:28,880 --> 00:37:29,715
first.

848
00:37:29,715 --> 00:37:32,090
And from there on, the load
balancer can split people up,

849
00:37:32,090 --> 00:37:35,120
say certain people go to this server,
certain people go to that server,

850
00:37:35,120 --> 00:37:38,090
and try and decide how it is
that people are going to be

851
00:37:38,090 --> 00:37:41,240
divided into the different servers.

852
00:37:41,240 --> 00:37:44,570
And so how could a load balancer decide?

853
00:37:44,570 --> 00:37:47,880
If there are five servers
and a user comes along,

854
00:37:47,880 --> 00:37:51,959
how should a load balancer decide
which server to send a user to?

855
00:37:51,959 --> 00:37:53,500
There is no one right answer to this.

856
00:37:53,500 --> 00:37:56,041
There are a number of possible
options, a number of different

857
00:37:56,041 --> 00:37:57,690
what are called load balancing methods.

858
00:37:57,690 --> 00:37:59,890
But how could you decide
where to send a user?

859
00:37:59,890 --> 00:38:01,529
Yeah?

860
00:38:01,529 --> 00:38:04,490
AUDIENCE: The server with the
least amount of users currently.

861
00:38:04,490 --> 00:38:04,730
BRIAN YU: Sure.

862
00:38:04,730 --> 00:38:06,610
The server with the fewest
users currently, what's often

863
00:38:06,610 --> 00:38:08,900
called the fewest connections
load balancing method.

864
00:38:08,900 --> 00:38:11,800
You try and figure out which
server has the fewest people on it.

865
00:38:11,800 --> 00:38:14,620
And whichever one has the fewest
people on it, send the user there.

866
00:38:14,620 --> 00:38:18,100
Definitely good for trying to make sure
that each one has about an equal load,

867
00:38:18,100 --> 00:38:20,037
but potentially
computationally expensive.

868
00:38:20,037 --> 00:38:22,620
You're doing a lot of calculation
now, so there's a trade off.

869
00:38:22,620 --> 00:38:22,850
Yeah?

870
00:38:22,850 --> 00:38:24,200
AUDIENCE: You could just do it randomly.

871
00:38:24,200 --> 00:38:24,880
BRIAN YU: You could do it randomly.

872
00:38:24,880 --> 00:38:27,250
You could just generate a
random number between 1 and 5

873
00:38:27,250 --> 00:38:29,374
and randomly assign someone
to a particular server.

874
00:38:29,374 --> 00:38:30,970
Definitely something you could do.

875
00:38:30,970 --> 00:38:32,080
Other things?

876
00:38:32,080 --> 00:38:34,224
Certainly the random approach is quick.

877
00:38:34,224 --> 00:38:36,640
It doesn't involve having to
do any calculation across all

878
00:38:36,640 --> 00:38:38,252
the different servers.

879
00:38:38,252 --> 00:38:40,210
But if you're unlucky,
you could end up putting

880
00:38:40,210 --> 00:38:43,600
a lot of people on server number two and
not many people on server number eight

881
00:38:43,600 --> 00:38:44,290
or whatnot.

882
00:38:44,290 --> 00:38:45,220
And so what else could we do?

883
00:38:45,220 --> 00:38:45,720
Yeah?

884
00:38:45,720 --> 00:38:49,125
AUDIENCE: Just set up
a counter [INAUDIBLE]..

885
00:38:49,125 --> 00:38:49,750
BRIAN YU: Sure.

886
00:38:49,750 --> 00:38:50,625
Some sort of counter.

887
00:38:50,625 --> 00:38:53,260
If you only have two, you just
alternate odd, even, odd, even.

888
00:38:53,260 --> 00:38:54,010
Go to this server.

889
00:38:54,010 --> 00:38:54,820
Go to that one.

890
00:38:54,820 --> 00:38:57,153
If you've got eight, you just
rotate amongst the eight--

891
00:38:57,153 --> 00:38:59,045
1, 2, 3, 4, 5, 6, 7, 8 and go back to 1.

892
00:38:59,045 --> 00:39:02,170
And so these are probably three of the
most common load balancing methods--

893
00:39:02,170 --> 00:39:05,336
random choice, whereby you just pick a
random server, direct the user there;

894
00:39:05,336 --> 00:39:09,040
round robin, where we do exactly that,
just basically go one up until the end

895
00:39:09,040 --> 00:39:12,220
and then go back to server number one;
and then fewest connections, whereby

896
00:39:12,220 --> 00:39:14,530
you try and actually calculate
which server currently

897
00:39:14,530 --> 00:39:16,810
has the fewest number
of people on it and then

898
00:39:16,810 --> 00:39:20,887
try and direct the user to that
one with the fewest connections.

899
00:39:20,887 --> 00:39:22,720
There are other methods
in addition to this,

900
00:39:22,720 --> 00:39:24,520
but these are perhaps
three of the most intuitive

901
00:39:24,520 --> 00:39:26,460
where you can start to
see their trade offs.

902
00:39:26,460 --> 00:39:28,210
Depending upon the
type of user experience

903
00:39:28,210 --> 00:39:30,460
you want, depending
on how computationally

904
00:39:30,460 --> 00:39:34,420
expensive certain operations are, you
might choose different load balancing

905
00:39:34,420 --> 00:39:36,610
methods.

906
00:39:36,610 --> 00:39:37,750
Yeah?

907
00:39:37,750 --> 00:39:41,470
AUDIENCE: [INAUDIBLE] benchmarking, and
what are some common ways to do that?

908
00:39:41,470 --> 00:39:44,304
BRIAN YU: Yeah, there are
software tools that can do this.

909
00:39:44,304 --> 00:39:46,970
There are a number of different
ones-- the names are escaping me

910
00:39:46,970 --> 00:39:47,690
at the moment--

911
00:39:47,690 --> 00:39:51,260
where you can basically
test on a particular URL

912
00:39:51,260 --> 00:39:54,550
and get a sense for how well
it's able to handle that load.

913
00:39:54,550 --> 00:39:59,430
And if you have particular use cases, I
can chat with you about that, as well.

914
00:39:59,430 --> 00:40:01,980
So all right, let's imagine
we have two servers now.

915
00:40:01,980 --> 00:40:04,860
And every time a user
makes an HTTP request

916
00:40:04,860 --> 00:40:06,900
to a server, every time
they request a page,

917
00:40:06,900 --> 00:40:09,240
we direct them to one server
or the other server using

918
00:40:09,240 --> 00:40:12,330
one of these methods, either by
choosing randomly or by round robin

919
00:40:12,330 --> 00:40:15,510
or by figuring out which one currently
has the fewest users connected to it

920
00:40:15,510 --> 00:40:17,580
or is handling the fewest connections.

921
00:40:17,580 --> 00:40:18,330
What can go wrong?

922
00:40:18,330 --> 00:40:21,094


923
00:40:21,094 --> 00:40:24,260
Whenever we're dealing with issues of
scale, we just try and solve a problem

924
00:40:24,260 --> 00:40:26,134
and figure out what new
problems have arisen.

925
00:40:26,134 --> 00:40:33,724


926
00:40:33,724 --> 00:40:34,720
Yeah?

927
00:40:34,720 --> 00:40:37,400
AUDIENCE: You only have five
servers, and now you need six.

928
00:40:37,400 --> 00:40:37,640
BRIAN YU: Yeah.

929
00:40:37,640 --> 00:40:40,010
Certainly, if you only have five
servers and suddenly you need six,

930
00:40:40,010 --> 00:40:42,051
that could potentially
become a problem, as well.

931
00:40:42,051 --> 00:40:44,180
But let's even assume that
we have enough servers.

932
00:40:44,180 --> 00:40:47,600
We have five servers, and
every time someone load a page,

933
00:40:47,600 --> 00:40:51,470
they get sent to a different server
based on one of these methods.

934
00:40:51,470 --> 00:40:54,804
What can still go wrong
with the user experience?

935
00:40:54,804 --> 00:40:56,470
And in particular, I'll give you a hint.

936
00:40:56,470 --> 00:40:58,740
Let's think about sessions.

937
00:40:58,740 --> 00:40:59,490
What can go wrong?

938
00:40:59,490 --> 00:41:04,290


939
00:41:04,290 --> 00:41:07,040
Remember, sessions were ways of
storing information-- in our case,

940
00:41:07,040 --> 00:41:08,702
inside of the server--

941
00:41:08,702 --> 00:41:10,910
about the user's current
interaction with the server.

942
00:41:10,910 --> 00:41:12,574
It stored which user was logged in.

943
00:41:12,574 --> 00:41:14,740
It stored the current state
of the tic-tac-toe game.

944
00:41:14,740 --> 00:41:15,940
It stored other information.

945
00:41:15,940 --> 00:41:17,126
Yeah?

946
00:41:17,126 --> 00:41:19,994
AUDIENCE: You have to
pick one [INAUDIBLE]..

947
00:41:19,994 --> 00:41:24,691


948
00:41:24,691 --> 00:41:25,690
BRIAN YU: Yeah, exactly.

949
00:41:25,690 --> 00:41:29,689
If I initially load a page and I go
to server one and some information

950
00:41:29,689 --> 00:41:32,230
about me is stored in the session,
like whether I'm logged in

951
00:41:32,230 --> 00:41:34,790
or the current state of
my game or something else,

952
00:41:34,790 --> 00:41:37,000
and then I load another
page and it takes

953
00:41:37,000 --> 00:41:40,750
me to server four this
time, well, now, that server

954
00:41:40,750 --> 00:41:43,900
doesn't have access to the
same session information

955
00:41:43,900 --> 00:41:46,420
that server one had if the
information about the session

956
00:41:46,420 --> 00:41:47,654
was stored in the server.

957
00:41:47,654 --> 00:41:49,070
And now, that information is lost.

958
00:41:49,070 --> 00:41:50,986
So I could load a page,
and suddenly, now, I'm

959
00:41:50,986 --> 00:41:53,036
logged out of the page
for no apparent reason

960
00:41:53,036 --> 00:41:54,910
even though I've logged
in just a moment ago.

961
00:41:54,910 --> 00:41:56,680
And then I could go to another
page, and maybe by chance,

962
00:41:56,680 --> 00:41:59,390
I'm back to server one, and
now I'm logged in again.

963
00:41:59,390 --> 00:42:01,730
So strange things can begin to happen.

964
00:42:01,730 --> 00:42:03,640
And so to solve that, what could we do?

965
00:42:03,640 --> 00:42:06,410


966
00:42:06,410 --> 00:42:08,780
How can we make sure that
sessions are preserved

967
00:42:08,780 --> 00:42:11,060
when the user is requesting pages?

968
00:42:11,060 --> 00:42:13,860


969
00:42:13,860 --> 00:42:15,270
Again, no one correct answer.

970
00:42:15,270 --> 00:42:16,440
Multiple possibilities here.

971
00:42:16,440 --> 00:42:19,170


972
00:42:19,170 --> 00:42:21,574
How do we solve this problem?

973
00:42:21,574 --> 00:42:22,074
Yeah?

974
00:42:22,074 --> 00:42:25,380
AUDIENCE: Would there any way to store
the session on the load balancer?

975
00:42:25,380 --> 00:42:27,750
BRIAN YU: Store the session
on the load balancer.

976
00:42:27,750 --> 00:42:28,650
That's a good idea.

977
00:42:28,650 --> 00:42:30,858
And that will actually get
me at the first idea here,

978
00:42:30,858 --> 00:42:33,689
which is this idea of sticky sessions.

979
00:42:33,689 --> 00:42:34,980
And this is slightly different.

980
00:42:34,980 --> 00:42:40,020
Rather than store all the session
information in the load balancer,

981
00:42:40,020 --> 00:42:42,750
it just needs to store for
this particular user which

982
00:42:42,750 --> 00:42:45,400
server has their session information.

983
00:42:45,400 --> 00:42:47,640
So if I went to server
number one initially,

984
00:42:47,640 --> 00:42:51,649
the load balancer will remember me based
on my IP address, cookie, or whatever

985
00:42:51,649 --> 00:42:53,940
and say, all right, next time
I try and request a page,

986
00:42:53,940 --> 00:42:56,760
let me direct them back to
server number one, for instance.

987
00:42:56,760 --> 00:43:00,624
That way, whenever I come back, I'm
always going to go to the same place.

988
00:43:00,624 --> 00:43:02,790
There are other ways to
solve this problem, as well.

989
00:43:02,790 --> 00:43:04,915
You could store session
information in the database

990
00:43:04,915 --> 00:43:06,510
that all the servers have access to.

991
00:43:06,510 --> 00:43:09,210
You could store session information
on the client side, whereby

992
00:43:09,210 --> 00:43:11,850
it doesn't matter what server you go to,
because all the session information is

993
00:43:11,850 --> 00:43:12,624
inside the client.

994
00:43:12,624 --> 00:43:14,790
So there are a number of
ways to solve this problem,

995
00:43:14,790 --> 00:43:18,300
but these generally fall under
the heading of session-aware load

996
00:43:18,300 --> 00:43:20,870
balancing.

997
00:43:20,870 --> 00:43:23,750
Someone mentioned the problem of,
OK, well, I have five servers,

998
00:43:23,750 --> 00:43:26,720
but what happens when I need six?

999
00:43:26,720 --> 00:43:28,930
To solve this in the
world of cloud computing,

1000
00:43:28,930 --> 00:43:31,520
where nowadays most people don't
maintain their own hardware

1001
00:43:31,520 --> 00:43:33,680
for their web applications,
they just rent out

1002
00:43:33,680 --> 00:43:37,310
hardware on someone else's servers,
for instance, on AWS, for instance,

1003
00:43:37,310 --> 00:43:39,830
use Amazon servers--

1004
00:43:39,830 --> 00:43:44,800
you can take advantage of auto scaling,
which automatically will grow or shrink

1005
00:43:44,800 --> 00:43:47,550
the number of servers based upon
load, whereby you could initially

1006
00:43:47,550 --> 00:43:48,262
have two servers.

1007
00:43:48,262 --> 00:43:50,220
But if more users come
about and you need more,

1008
00:43:50,220 --> 00:43:51,845
we can add a third server into the mix.

1009
00:43:51,845 --> 00:43:53,511
More people come out, we need even more.

1010
00:43:53,511 --> 00:43:54,600
We add a fourth server.

1011
00:43:54,600 --> 00:43:57,200
And auto scaling goes
in both directions.

1012
00:43:57,200 --> 00:43:59,790
So if suddenly we find, all
right, we had a lot of load

1013
00:43:59,790 --> 00:44:02,340
at this particular peak time
of the day but now there are

1014
00:44:02,340 --> 00:44:05,374
fewer users on the site, the auto
load balancer can sort of say,

1015
00:44:05,374 --> 00:44:07,290
all right, we don't need
four servers anymore.

1016
00:44:07,290 --> 00:44:09,780
Let's go back to three and then
later on, if it needs doing,

1017
00:44:09,780 --> 00:44:10,821
go back up to four again.

1018
00:44:10,821 --> 00:44:15,810
And it can automatically, dynamically
reconfigure the number of servers

1019
00:44:15,810 --> 00:44:19,050
in order to figure out
what the optimal number is

1020
00:44:19,050 --> 00:44:23,170
given the number of users that are
currently using the application.

1021
00:44:23,170 --> 00:44:28,450
What happens, though, when one of
the servers fails for some reason?

1022
00:44:28,450 --> 00:44:31,740
The server just dies, for instance.

1023
00:44:31,740 --> 00:44:34,817
The load balancer doesn't
necessarily know about that.

1024
00:44:34,817 --> 00:44:37,650
And so if it's still directing
people across four different servers,

1025
00:44:37,650 --> 00:44:43,200
it could direct users to that server
that is no longer operational.

1026
00:44:43,200 --> 00:44:45,730
Any thoughts on how we
might solve that problem?

1027
00:44:45,730 --> 00:44:46,230
Yeah?

1028
00:44:46,230 --> 00:44:49,395
AUDIENCE: Have the load balancer ping
the server at determined intervals

1029
00:44:49,395 --> 00:44:50,520
to see if it's still there.

1030
00:44:50,520 --> 00:44:51,900
BRIAN YU: Yeah, some
sort of ping to make sure

1031
00:44:51,900 --> 00:44:53,040
That the server is still there.

1032
00:44:53,040 --> 00:44:55,206
And often, one of the easiest
ways that this is done

1033
00:44:55,206 --> 00:44:57,780
is via what's called a heartbeat,
whereby each of the servers

1034
00:44:57,780 --> 00:45:01,680
gives off a heartbeat every fixed number
of seconds or minutes, for instance,

1035
00:45:01,680 --> 00:45:04,920
whereby if every 10 seconds
the server pings the heartbeat,

1036
00:45:04,920 --> 00:45:06,540
that gets sent to the load balancer.

1037
00:45:06,540 --> 00:45:09,660
If ever the load balancer doesn't
hear the heartbeat from the server,

1038
00:45:09,660 --> 00:45:12,579
it can know that that server is no
longer operational, and it can say,

1039
00:45:12,579 --> 00:45:13,620
all right, you know what?

1040
00:45:13,620 --> 00:45:17,535
Let's stop sending users there and only
send users to the other three servers.

1041
00:45:17,535 --> 00:45:20,200


1042
00:45:20,200 --> 00:45:24,740
Questions about that or any of the
ideas of how we scale our servers

1043
00:45:24,740 --> 00:45:26,840
to be able to handle load?

1044
00:45:26,840 --> 00:45:29,340
We decided, all right, if too
many people are on one server,

1045
00:45:29,340 --> 00:45:31,550
we need to split up into
two different servers.

1046
00:45:31,550 --> 00:45:33,140
But that introduced a
bunch of problems that we

1047
00:45:33,140 --> 00:45:36,098
had to solve-- problems about load
balancing, problems about what to do

1048
00:45:36,098 --> 00:45:38,700
about sessions, so on and so forth.

1049
00:45:38,700 --> 00:45:39,820
Yeah?

1050
00:45:39,820 --> 00:45:43,215
AUDIENCE: Do you hear a lot
about distributed servers?

1051
00:45:43,215 --> 00:45:45,640
I'm wondering how they [INAUDIBLE].

1052
00:45:45,640 --> 00:45:48,966


1053
00:45:48,966 --> 00:45:49,590
BRIAN YU: Sure.

1054
00:45:49,590 --> 00:45:52,260
How do servers share data?

1055
00:45:52,260 --> 00:45:54,030
Well, they use databases.

1056
00:45:54,030 --> 00:45:57,930
And of course, as we start to figure out
what to do with more and more servers,

1057
00:45:57,930 --> 00:46:00,180
we also need to figure out
what to do about databases,

1058
00:46:00,180 --> 00:46:03,720
figure out how to scale databases
and make sure that as we scale them,

1059
00:46:03,720 --> 00:46:06,430
the databases are able to
handle that load, as well.

1060
00:46:06,430 --> 00:46:08,880
And so in the past, we've had,
all right, a load balancer.

1061
00:46:08,880 --> 00:46:10,050
We've got servers.

1062
00:46:10,050 --> 00:46:13,170
And in our model right now, we have
a database that both of these servers

1063
00:46:13,170 --> 00:46:15,270
are connected to.

1064
00:46:15,270 --> 00:46:18,990
But of course, the problem is
soon going to arise of, all right,

1065
00:46:18,990 --> 00:46:20,910
now we've got a lot of
servers that are all

1066
00:46:20,910 --> 00:46:23,250
trying to connect to the same database.

1067
00:46:23,250 --> 00:46:25,110
And now, we've got yet
another single point

1068
00:46:25,110 --> 00:46:27,210
where things could
potentially go wrong or where

1069
00:46:27,210 --> 00:46:29,190
we could potentially be overloaded.

1070
00:46:29,190 --> 00:46:31,020
So how do we solve this type of problem?

1071
00:46:31,020 --> 00:46:33,900
One of the most common ways
is database partitioning.

1072
00:46:33,900 --> 00:46:36,610
One form of database partitioning
you've, in fact, already seen,

1073
00:46:36,610 --> 00:46:39,180
and it's just an extension of
what we've been doing with SQL,

1074
00:46:39,180 --> 00:46:41,430
whereby we have this flights table.

1075
00:46:41,430 --> 00:46:45,540
And we could say, all right, rather than
store the origin and the origin code,

1076
00:46:45,540 --> 00:46:47,679
let's go ahead and separate
what's in one table

1077
00:46:47,679 --> 00:46:48,970
into a couple different tables.

1078
00:46:48,970 --> 00:46:51,780
Let's separate the flights
table into a locations table

1079
00:46:51,780 --> 00:46:55,140
where the locations table has a
number for each possible location.

1080
00:46:55,140 --> 00:46:57,210
And then it also, in
the flights table, now,

1081
00:46:57,210 --> 00:47:03,240
only needs to store a single number for
the origin ID and the destination ID.

1082
00:47:03,240 --> 00:47:05,550
We could also separate
tables in different ways.

1083
00:47:05,550 --> 00:47:09,450
If we have some general
way we could partition

1084
00:47:09,450 --> 00:47:11,700
a table into different
parts that are generally

1085
00:47:11,700 --> 00:47:13,980
going to be queried
separately, then we can

1086
00:47:13,980 --> 00:47:16,560
do another partition where
I could say, all right,

1087
00:47:16,560 --> 00:47:18,330
my flight's table is getting big.

1088
00:47:18,330 --> 00:47:19,620
Let's split it up.

1089
00:47:19,620 --> 00:47:23,670
And all right, at my airline, the
international departures and arrivals

1090
00:47:23,670 --> 00:47:26,520
are handled separately from the
domestic departures and arrivals.

1091
00:47:26,520 --> 00:47:28,647
So no need for those to
be in the same table.

1092
00:47:28,647 --> 00:47:30,855
Let me just go ahead and
take flights and separate it

1093
00:47:30,855 --> 00:47:33,480
into a domestic flights table and
an international flights table,

1094
00:47:33,480 --> 00:47:34,060
for instance.

1095
00:47:34,060 --> 00:47:36,900
One way to just partition things
into two different tables that

1096
00:47:36,900 --> 00:47:39,570
could potentially be stored in
different places that ultimately

1097
00:47:39,570 --> 00:47:43,680
allows for handling of scale.

1098
00:47:43,680 --> 00:47:45,887
But ultimately, all
of these are problems

1099
00:47:45,887 --> 00:47:48,720
that are still going to lead to the
fundamental problem of if I only

1100
00:47:48,720 --> 00:47:52,530
have one database and 10 or
dozens of servers that are all

1101
00:47:52,530 --> 00:47:54,510
trying to communicate
with that same database,

1102
00:47:54,510 --> 00:47:55,884
we're going to run into problems.

1103
00:47:55,884 --> 00:47:58,830
The database can only handle
some fixed number of connections.

1104
00:47:58,830 --> 00:48:03,120
And so one solution to this
is database replication.

1105
00:48:03,120 --> 00:48:06,600
So all right, how does
database replication work?

1106
00:48:06,600 --> 00:48:10,140
Well, probably the simplest
form of database replication

1107
00:48:10,140 --> 00:48:13,410
is what's called single
primary replication, whereby

1108
00:48:13,410 --> 00:48:16,470
I have one what's called
primary database and maybe

1109
00:48:16,470 --> 00:48:18,460
three databases in total,
but only one that I'm

1110
00:48:18,460 --> 00:48:20,460
going to consider the primary one.

1111
00:48:20,460 --> 00:48:22,980
And you can read data
from any of the databases.

1112
00:48:22,980 --> 00:48:25,380
You can get data out of
any of the three databases,

1113
00:48:25,380 --> 00:48:29,100
whereby if there are three servers
and each one wants to read data,

1114
00:48:29,100 --> 00:48:31,620
they can just share among the
three databases reading data

1115
00:48:31,620 --> 00:48:33,578
to make sure that we're
not overloading any one

1116
00:48:33,578 --> 00:48:35,600
database with too many connections.

1117
00:48:35,600 --> 00:48:39,970
But you can only write
data to a single database.

1118
00:48:39,970 --> 00:48:42,220
And by only writing data
to a single database,

1119
00:48:42,220 --> 00:48:44,860
that means that anytime
this database is updated,

1120
00:48:44,860 --> 00:48:47,276
then this database,
our primary database,

1121
00:48:47,276 --> 00:48:49,150
just needs to update
the other two databases.

1122
00:48:49,150 --> 00:48:52,000
Say, all right, there's been a
change made to the primary database.

1123
00:48:52,000 --> 00:48:54,310
And it's the primary
database's responsibility

1124
00:48:54,310 --> 00:48:59,005
to then communicate to the other two
databases what those changes are.

1125
00:48:59,005 --> 00:49:00,970
And so that's
single-primary replication.

1126
00:49:00,970 --> 00:49:01,470
Yeah?

1127
00:49:01,470 --> 00:49:04,720
AUDIENCE: How is that more efficient
than just communicating with all three

1128
00:49:04,720 --> 00:49:05,341
of them?

1129
00:49:05,341 --> 00:49:07,090
Because I think you're
sending information

1130
00:49:07,090 --> 00:49:09,460
from the first database
to the second and third.

1131
00:49:09,460 --> 00:49:16,160
[INAUDIBLE] information sent that's
just rewriting to all three of them.

1132
00:49:16,160 --> 00:49:17,410
BRIAN YU: That's true, though.

1133
00:49:17,410 --> 00:49:19,330
Databases could potentially
batch information

1134
00:49:19,330 --> 00:49:21,354
together into transactions
and things and groups

1135
00:49:21,354 --> 00:49:23,020
so as to be a little bit more efficient.

1136
00:49:23,020 --> 00:49:24,400
So certainly ways around that problem.

1137
00:49:24,400 --> 00:49:25,358
But yeah, a good point.

1138
00:49:25,358 --> 00:49:29,400


1139
00:49:29,400 --> 00:49:32,160
Of course, this helps the read problem.

1140
00:49:32,160 --> 00:49:35,220
It makes it easier to be able
to read data out of databases.

1141
00:49:35,220 --> 00:49:37,590
But it leaves open a
potential vulnerability

1142
00:49:37,590 --> 00:49:40,500
or a potential scalability problem
with regard to writing data,

1143
00:49:40,500 --> 00:49:43,410
because there is still only a single
database on which I can actually

1144
00:49:43,410 --> 00:49:46,590
write data to if that one database
is responsible for updating

1145
00:49:46,590 --> 00:49:48,090
all of the other databases.

1146
00:49:48,090 --> 00:49:50,160
And so a more complex
version of this is what's

1147
00:49:50,160 --> 00:49:52,200
known as multi-primary
replication, where

1148
00:49:52,200 --> 00:49:55,410
the idea is that each database
can be read to and written from.

1149
00:49:55,410 --> 00:49:57,900
But now, updates get a
lot more complicated.

1150
00:49:57,900 --> 00:50:00,480
All of the databases need to
have some notion and some way

1151
00:50:00,480 --> 00:50:02,340
of being able to update each other.

1152
00:50:02,340 --> 00:50:04,170
And there, conflicts begin to arrive.

1153
00:50:04,170 --> 00:50:07,680
You can have update conflicts
where two different databases

1154
00:50:07,680 --> 00:50:09,129
have updated the same row.

1155
00:50:09,129 --> 00:50:10,920
All right, how do you
resolve that problem?

1156
00:50:10,920 --> 00:50:13,500
You can have uniqueness
conflicts, whereby

1157
00:50:13,500 --> 00:50:17,220
if you add a row to each of two
databases at the same time, maybe

1158
00:50:17,220 --> 00:50:18,570
they get the same ID.

1159
00:50:18,570 --> 00:50:21,390
Maybe this one only has
27 rows, so this database

1160
00:50:21,390 --> 00:50:24,507
adds a new row with ID number 28, and
this database does the same thing.

1161
00:50:24,507 --> 00:50:26,340
And now, when they try
to update each other,

1162
00:50:26,340 --> 00:50:28,096
we have two rows with the same ID.

1163
00:50:28,096 --> 00:50:29,970
And now, we need some
way of resolving those,

1164
00:50:29,970 --> 00:50:31,770
because the IDs are
supposed to be unique.

1165
00:50:31,770 --> 00:50:34,620
And so that can create
problems, as well.

1166
00:50:34,620 --> 00:50:37,470
And then there are other types of
conflicts, too-- delete conflicts,

1167
00:50:37,470 --> 00:50:40,320
whereby one database tries to
delete a row at the same time

1168
00:50:40,320 --> 00:50:42,329
another database tries to update a row.

1169
00:50:42,329 --> 00:50:43,120
So which do you do?

1170
00:50:43,120 --> 00:50:43,920
Do you update the row?

1171
00:50:43,920 --> 00:50:44,940
Do you delete the row?

1172
00:50:44,940 --> 00:50:47,356
And so these are all conflicts
that when you're setting up

1173
00:50:47,356 --> 00:50:49,410
a multi-primary replication
system, you need

1174
00:50:49,410 --> 00:50:52,320
to figure out how you're going to
ultimately resolve those conflicts.

1175
00:50:52,320 --> 00:50:54,780
You gain the ability to
write to all the databases,

1176
00:50:54,780 --> 00:50:57,550
but new problems arise
as you begin to do that.

1177
00:50:57,550 --> 00:50:58,157
Yeah?

1178
00:50:58,157 --> 00:51:01,973
AUDIENCE: So is the information
in each database the same?

1179
00:51:01,973 --> 00:51:04,055
Are they [INAUDIBLE] with each other?

1180
00:51:04,055 --> 00:51:04,680
BRIAN YU: Yeah.

1181
00:51:04,680 --> 00:51:06,690
In this model, the
databases in general are

1182
00:51:06,690 --> 00:51:09,454
going to be the same, though
they're not always perfectly going

1183
00:51:09,454 --> 00:51:12,120
to be in sync, which is yet another
problem, whereby there might

1184
00:51:12,120 --> 00:51:14,640
be some time after I
write to this database

1185
00:51:14,640 --> 00:51:18,491
before that data propagates through
all of the databases, for instance.

1186
00:51:18,491 --> 00:51:20,989
AUDIENCE: So why not keep it in one?

1187
00:51:20,989 --> 00:51:23,530
BRIAN YU: You could keep all
the information in one database.

1188
00:51:23,530 --> 00:51:27,070
But a single database server can
only handle so many connections.

1189
00:51:27,070 --> 00:51:30,310
And so you might imagine that having
three different servers, three

1190
00:51:30,310 --> 00:51:33,070
different computers that are all
able to handle incoming requests,

1191
00:51:33,070 --> 00:51:35,106
just increases the capacity
of your application

1192
00:51:35,106 --> 00:51:36,730
to be able to handle that kind of load.

1193
00:51:36,730 --> 00:51:41,690


1194
00:51:41,690 --> 00:51:42,530
All right.

1195
00:51:42,530 --> 00:51:46,820
Questions about databases, database
replication, any of the scale problems

1196
00:51:46,820 --> 00:51:49,681
that come about there?

1197
00:51:49,681 --> 00:51:50,180
All right.

1198
00:51:50,180 --> 00:51:53,013
Final thing I'll mention on the
topic of scaling that can be helpful

1199
00:51:53,013 --> 00:51:54,302
is just the idea of caching.

1200
00:51:54,302 --> 00:51:56,510
Caching is something we've
talked about a lot before.

1201
00:51:56,510 --> 00:52:00,380
But a general idea could be that in
order to try and solve this problem

1202
00:52:00,380 --> 00:52:03,350
of constantly having to request
information from the database,

1203
00:52:03,350 --> 00:52:06,650
if we could store data in some
other place-- in particular,

1204
00:52:06,650 --> 00:52:07,632
inside of a cache--

1205
00:52:07,632 --> 00:52:10,340
then we don't need to access the
database as often, because we've

1206
00:52:10,340 --> 00:52:12,330
got the information already stored.

1207
00:52:12,330 --> 00:52:14,960
And so one way to do this
is via client-side caching.

1208
00:52:14,960 --> 00:52:20,070
And so inside of the HTTP
headers, when an HTTP response

1209
00:52:20,070 --> 00:52:22,670
is sending back
information to a user, you

1210
00:52:22,670 --> 00:52:26,660
can add an HTTP header called
cache control that basically

1211
00:52:26,660 --> 00:52:32,450
says for up to this number of seconds,
you can just store information

1212
00:52:32,450 --> 00:52:35,870
about this page and not
request it again if you try

1213
00:52:35,870 --> 00:52:37,674
and request the page for a second time.

1214
00:52:37,674 --> 00:52:40,340
And this helps to make sure that
if the browser tries to request

1215
00:52:40,340 --> 00:52:41,870
the page again, it doesn't need to.

1216
00:52:41,870 --> 00:52:45,350
It can just use the version
that's stored inside of the cache.

1217
00:52:45,350 --> 00:52:50,240
And a more recent development is this
idea of an ETag, or an entity tag.

1218
00:52:50,240 --> 00:52:53,990
And the idea here is that if we have
some web resource, some document,

1219
00:52:53,990 --> 00:52:57,200
some piece of data from a database
that our web application is sending out

1220
00:52:57,200 --> 00:53:01,790
to users, when I send users
that resource, that document,

1221
00:53:01,790 --> 00:53:06,230
I'll send that document, and
I'll also send an entity tag that

1222
00:53:06,230 --> 00:53:09,260
corresponds to that particular
version of the document

1223
00:53:09,260 --> 00:53:10,730
and send them both to the user.

1224
00:53:10,730 --> 00:53:12,240
And imagine this is a big document.

1225
00:53:12,240 --> 00:53:16,880
It's a lot of data, so it's expensive
to query and to send to the user.

1226
00:53:16,880 --> 00:53:21,230
The next time the user tries to
request this page, what the user can do

1227
00:53:21,230 --> 00:53:25,970
is the user can send the entity tag,
the ETag, along with their request.

1228
00:53:25,970 --> 00:53:29,750
I would like to request this
resource, and, oh, by the way,

1229
00:53:29,750 --> 00:53:32,840
I already have this version
of the entity stored

1230
00:53:32,840 --> 00:53:35,570
locally inside of my computer's cache.

1231
00:53:35,570 --> 00:53:38,130
And if the web application then
looks at that ETag and says,

1232
00:53:38,130 --> 00:53:39,171
all right, you know what?

1233
00:53:39,171 --> 00:53:41,300
That's the latest
version of the document.

1234
00:53:41,300 --> 00:53:44,120
The web application can just respond--

1235
00:53:44,120 --> 00:53:47,990
in particular, with an HTTP status
code of 304, meaning not modified,

1236
00:53:47,990 --> 00:53:49,190
to just say, you know what?

1237
00:53:49,190 --> 00:53:52,310
This entity tag is the
most recent entity tag.

1238
00:53:52,310 --> 00:53:54,650
Don't bother trying to
request the document again.

1239
00:53:54,650 --> 00:53:57,800
Just use the version you
saved locally in your cache.

1240
00:53:57,800 --> 00:54:00,200
And if, on the off chance,
the document's been updated

1241
00:54:00,200 --> 00:54:03,230
and therefore has a new ETag
value, then the web application

1242
00:54:03,230 --> 00:54:07,000
goes through the process of sending
that entire document back to the user.

1243
00:54:07,000 --> 00:54:09,335
But by taking advantage
of technologies like this,

1244
00:54:09,335 --> 00:54:11,450
this can allow us to
make sure that we're not

1245
00:54:11,450 --> 00:54:13,940
making too many requests
to the database,

1246
00:54:13,940 --> 00:54:19,950
that we don't make redundant requests
if a particular resource hasn't changed.

1247
00:54:19,950 --> 00:54:21,940
So caching can be done
on the client side.

1248
00:54:21,940 --> 00:54:24,190
Caching can also be done
on the server side, which

1249
00:54:24,190 --> 00:54:26,920
changes our diagram slightly
so as to look a little bit more

1250
00:54:26,920 --> 00:54:30,640
like this, whereby now, we've
got some more complications here.

1251
00:54:30,640 --> 00:54:32,830
We've got some load balancer
that's communicating

1252
00:54:32,830 --> 00:54:34,247
with a bunch of different servers.

1253
00:54:34,247 --> 00:54:36,580
All of those servers have to
interact with the database,

1254
00:54:36,580 --> 00:54:39,850
and maybe you've got multiple databases
going on here that are each able to do

1255
00:54:39,850 --> 00:54:42,340
reads and writes, either
in a single-primary model

1256
00:54:42,340 --> 00:54:43,840
or a multi-primary model.

1257
00:54:43,840 --> 00:54:47,530
And those servers also have access
to some cache that makes it easier

1258
00:54:47,530 --> 00:54:51,280
to access data quickly,
in a sense, saying,

1259
00:54:51,280 --> 00:54:53,470
if there's some
expensive database query,

1260
00:54:53,470 --> 00:54:56,530
don't bother performing the database
query again and again and again.

1261
00:54:56,530 --> 00:54:59,050
Take the results of that
database query once.

1262
00:54:59,050 --> 00:55:00,910
Save it inside of the cache.

1263
00:55:00,910 --> 00:55:03,220
And from then on, the server
can just look to the cache

1264
00:55:03,220 --> 00:55:06,830
and get information out of there.

1265
00:55:06,830 --> 00:55:09,680
So lot of security and
scalability concerns

1266
00:55:09,680 --> 00:55:12,740
that can potentially come about as
you begin web application development.

1267
00:55:12,740 --> 00:55:14,740
And so goal of today was
really just to give you

1268
00:55:14,740 --> 00:55:17,212
a sense for the types of
concerns to be aware of,

1269
00:55:17,212 --> 00:55:18,920
the types of things
to be thinking about,

1270
00:55:18,920 --> 00:55:20,480
and the types of issues
that will come about

1271
00:55:20,480 --> 00:55:23,810
if you decide to take a web application
and begin to have more and more people

1272
00:55:23,810 --> 00:55:26,090
actually start to use it.

1273
00:55:26,090 --> 00:55:28,730
So questions about that or
about any of the other topics

1274
00:55:28,730 --> 00:55:32,190
we've covered this week?

1275
00:55:32,190 --> 00:55:32,690
All right.

1276
00:55:32,690 --> 00:55:36,057
So with the remainder of this morning,
between now and about 12:30 or so,

1277
00:55:36,057 --> 00:55:38,390
we'll leave it open to more
project time, an opportunity

1278
00:55:38,390 --> 00:55:40,070
to work on any of the
projects you've worked on

1279
00:55:40,070 --> 00:55:42,800
so far over the course of this week
and also an opportunity to work

1280
00:55:42,800 --> 00:55:44,383
on something new if you would like to.

1281
00:55:44,383 --> 00:55:47,600
I know many of you yesterday decided
to start on new projects, projects

1282
00:55:47,600 --> 00:55:49,730
of your own choosing
built in React or Flask

1283
00:55:49,730 --> 00:55:51,980
or using JavaScript or any
of the other technologies

1284
00:55:51,980 --> 00:55:53,450
we've talked about this week.

1285
00:55:53,450 --> 00:55:56,750
Before we conclude, though, I do
have to say a couple of thank yous,

1286
00:55:56,750 --> 00:55:59,850
first to David for helping to advise
the class, to the teaching fellows--

1287
00:55:59,850 --> 00:56:01,740
Josh and Christian
and Athena and Julia--

1288
00:56:01,740 --> 00:56:03,470
for being excellent in
helping to answer questions

1289
00:56:03,470 --> 00:56:06,380
and helping to make sure that the
course can run smoothly, to Andrew up

1290
00:56:06,380 --> 00:56:09,020
in the back, who's been taking care
of the production side of everything

1291
00:56:09,020 --> 00:56:11,780
over the course of this week, making
sure that all the lectures are recorded

1292
00:56:11,780 --> 00:56:14,240
and making sure they're posted
online, such that afterwards, you,

1293
00:56:14,240 --> 00:56:15,890
when you're here or
when you're not here,

1294
00:56:15,890 --> 00:56:17,370
are able to come online to see them.

1295
00:56:17,370 --> 00:56:19,860
So thank you to everyone for
helping to make the course possible.

1296
00:56:19,860 --> 00:56:21,500
Thank you to all of you
for coming to the course.

1297
00:56:21,500 --> 00:56:22,333
Hope you enjoyed it.

1298
00:56:22,333 --> 00:56:23,810
Hope you got things out of it.

1299
00:56:23,810 --> 00:56:25,340
We've really only scratched
the surface, though,

1300
00:56:25,340 --> 00:56:27,020
of a lot of the topics
that we've covered

1301
00:56:27,020 --> 00:56:28,394
over the course of the past week.

1302
00:56:28,394 --> 00:56:32,720
There's a lot more to CSS and HTML
and JavaScript and Flask and Python

1303
00:56:32,720 --> 00:56:35,870
and React than we were really able to
touch on over the course of the week.

1304
00:56:35,870 --> 00:56:37,870
It was really meant to
be more of an opportunity

1305
00:56:37,870 --> 00:56:40,820
to give you some exposure to some
of the fundamentals of these ideas,

1306
00:56:40,820 --> 00:56:43,236
some of the tools and the
concepts that you can ultimately

1307
00:56:43,236 --> 00:56:45,930
use them as you begin to design
web applications of your own.

1308
00:56:45,930 --> 00:56:47,660
So I do hope that you've learned
something from the week but,

1309
00:56:47,660 --> 00:56:50,576
in particular, that you found things
that are interesting to you, such

1310
00:56:50,576 --> 00:56:52,880
that you continue to take
those ideas and explore them.

1311
00:56:52,880 --> 00:56:55,940
Go beyond just what we've been able
to cover over the course of this week

1312
00:56:55,940 --> 00:57:00,290
and explore what else these technologies
and these tools and these ideas

1313
00:57:00,290 --> 00:57:01,465
ultimately have to offer.

1314
00:57:01,465 --> 00:57:02,340
So thank you so much.

1315
00:57:02,340 --> 00:57:04,655
We'll stick around until 12:30
to help with project time.

1316
00:57:04,655 --> 00:57:05,155
[APPLAUSE]

1317
00:57:05,155 --> 00:57:08,020
But this was CS50 Beyond.

1318
00:57:08,020 --> 00:57:10,050