WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000

00:00:17.213 --> 00:00:20.380
DOUG LLOYD: Now that we know a bit more
about the internet and how it works,

00:00:20.380 --> 00:00:23.200
let's reintroduce the subject of
security with this new context.

00:00:23.200 --> 00:00:26.100
And let's start by talking
about Git and GitHub.

00:00:26.100 --> 00:00:28.540
Recall that Git and GitHub
are a technology that

00:00:28.540 --> 00:00:31.990
are used by programmers
to version control

00:00:31.990 --> 00:00:34.690
their software, which basically
allows them the ability

00:00:34.690 --> 00:00:39.010
to save code to an internet-based
repository in case of some failure

00:00:39.010 --> 00:00:41.830
locally, they have a backup
place to put it, but also

00:00:41.830 --> 00:00:43.750
keep track of all the
changes they've made

00:00:43.750 --> 00:00:46.120
and possibly go back in
time in case they produce

00:00:46.120 --> 00:00:48.460
a version of code that is broken.

00:00:48.460 --> 00:00:50.440
GitHub has some great
advantages, but it also

00:00:50.440 --> 00:00:53.110
has the potential disadvantages
because of this structure

00:00:53.110 --> 00:00:54.590
of being able to go back in time.

00:00:54.590 --> 00:00:58.180
So for example, imagine that what we
have is an initial commit, and commit

00:00:58.180 --> 00:01:01.828
is just GitHub parlance
for a set of code

00:01:01.828 --> 00:01:03.370
that you are sending to the internet.

00:01:03.370 --> 00:01:07.720
So I've decided to take file A, file B,
and file C in their current versions.

00:01:07.720 --> 00:01:12.190
I've saved them using control S or
command S literally on my machine,

00:01:12.190 --> 00:01:14.800
and I want to send those
versions to GitHub to be

00:01:14.800 --> 00:01:17.410
stored permanently or semi-permanently.

00:01:17.410 --> 00:01:19.900
You would package those up
in what's called a commit

00:01:19.900 --> 00:01:23.560
and then push that code to GitHub
where it would then be visible online.

00:01:23.560 --> 00:01:25.270
And this would be packaged as a commit.

00:01:25.270 --> 00:01:29.860
And all the files that we view on
GitHub are tracked in terms of commits.

00:01:29.860 --> 00:01:31.450
And commits chain together.

00:01:31.450 --> 00:01:34.210
And we've seen this idea of
chaining in the past when we've

00:01:34.210 --> 00:01:36.600
discussed linked lists, for example.

00:01:36.600 --> 00:01:39.100
So every commit knows about the
one that comes after it once

00:01:39.100 --> 00:01:43.810
that commit is eventually pushed as well
as all of the ones that preceded it.

00:01:43.810 --> 00:01:47.110
So imagine we have an initial
comment where we post some code

00:01:47.110 --> 00:01:49.870
and then we write some more--
we make some more changes.

00:01:49.870 --> 00:01:52.510
We perhaps update our
database in such a way

00:01:52.510 --> 00:01:57.790
where when we post or push-- excuse
me-- our second commit to GitHub,

00:01:57.790 --> 00:02:00.460
we accidentally expose
the database credentials.

00:02:00.460 --> 00:02:03.250
So perhaps someone
inadvertently typed the password

00:02:03.250 --> 00:02:06.760
for how to access the database into
some Python code that would then

00:02:06.760 --> 00:02:09.639
be used to access that database.

00:02:09.639 --> 00:02:10.930
That's not a good thing.

00:02:10.930 --> 00:02:13.833
And maybe somebody quickly realized
it and said, you know what?

00:02:13.833 --> 00:02:15.250
We need to get this off of GitHub.

00:02:15.250 --> 00:02:16.570
It is a source repository.

00:02:16.570 --> 00:02:17.920
It's available online.

00:02:17.920 --> 00:02:22.390
And so they push a third commit to
GitHub that deletes those credentials.

00:02:22.390 --> 00:02:26.740
It stores them somewhere else that's not
going to be saved on this repository.

00:02:26.740 --> 00:02:29.977
But have we actually solved the problem?

00:02:29.977 --> 00:02:31.810
And you can probably
imagine that the answer

00:02:31.810 --> 00:02:34.930
is no, because we have this
idea of version control

00:02:34.930 --> 00:02:39.700
where every past iteration
of all of these files

00:02:39.700 --> 00:02:43.840
is stored still on GitHub such that, if
I needed to, I could go back in time.

00:02:43.840 --> 00:02:48.220
So even though I attempted to
solve the security crisis I just

00:02:48.220 --> 00:02:52.360
created for myself by
introducing a new commit that

00:02:52.360 --> 00:02:54.520
removes the credentials
from those files such that,

00:02:54.520 --> 00:02:57.070
if I'm looking just at the most
recent version of the files,

00:02:57.070 --> 00:02:58.147
I don't see it anymore.

00:02:58.147 --> 00:02:59.980
I still have the ability
to go back in time,

00:02:59.980 --> 00:03:03.790
so this doesn't actually
solve a problem.

00:03:03.790 --> 00:03:05.800
See, one of the interesting
things about GitHub

00:03:05.800 --> 00:03:08.230
is the model that is used for it.

00:03:08.230 --> 00:03:10.120
At the very beginning
of GitHub's existence,

00:03:10.120 --> 00:03:14.260
it relied pretty extensively on
this idea of you sign up for free,

00:03:14.260 --> 00:03:16.030
you get a free account
for GitHub, and you

00:03:16.030 --> 00:03:20.170
have a limited number of private
repositories, repositories that are not

00:03:20.170 --> 00:03:24.250
publicly viewable or searchable, and
you could pay to have more of them

00:03:24.250 --> 00:03:25.930
if you wanted to.

00:03:25.930 --> 00:03:29.650
But the majority of your
repositories, assuming

00:03:29.650 --> 00:03:33.610
you did not opt into a paid
account, were free, which

00:03:33.610 --> 00:03:37.720
meant anybody on the internet could
search them using GitHub's search tool,

00:03:37.720 --> 00:03:40.600
or using even a regular
search engine such as Google,

00:03:40.600 --> 00:03:42.790
could just look for something.

00:03:42.790 --> 00:03:46.990
And if your GitHub repositories happen
to match what that person searched

00:03:46.990 --> 00:03:49.660
or specifically, if you're looking
within GitHub search feature,

00:03:49.660 --> 00:03:52.620
if a user is looking for
specific lines of code,

00:03:52.620 --> 00:03:56.138
anything in a public
repository, it is available.

00:03:56.138 --> 00:03:58.180
Now, GitHub has recently
changed to a model where

00:03:58.180 --> 00:04:01.720
there are more private repo--
or there's a higher limit

00:04:01.720 --> 00:04:04.840
on the number of private repositories
that somebody could have.

00:04:04.840 --> 00:04:10.090
But this was part of Github's
design to really encourage

00:04:10.090 --> 00:04:13.780
developers and programmers to sort of
create this open source community where

00:04:13.780 --> 00:04:18.310
anybody could view someone else's
code, and in GitHub parlance,

00:04:18.310 --> 00:04:21.670
fork their code, which basically
means to take their entire repository

00:04:21.670 --> 00:04:26.830
or collection of files and copy it
into their own GitHub repository

00:04:26.830 --> 00:04:29.760
to perhaps make changes
or suggest changes,

00:04:29.760 --> 00:04:33.040
pushing those back into the
code base with the idea being

00:04:33.040 --> 00:04:35.810
that it would make the
entire community better.

00:04:35.810 --> 00:04:38.680
A side effect, of
course, is that items get

00:04:38.680 --> 00:04:43.360
revealed when we do so because of this
public repository setup we have here.

00:04:43.360 --> 00:04:47.200
So GitHub is great in terms
of its ability for programmers

00:04:47.200 --> 00:04:49.930
to refer to materials on the internet.

00:04:49.930 --> 00:04:52.750
They don't have to rely on their
own local machines to store code.

00:04:52.750 --> 00:04:57.070
It allows people to work
from multiple workstations,

00:04:57.070 --> 00:04:59.590
similar to how Dropbox or
Google Drive, for example,

00:04:59.590 --> 00:05:02.470
might allow you to access
files from different machines.

00:05:02.470 --> 00:05:04.970
You don't have to be on a
specific machine to access a file,

00:05:04.970 --> 00:05:08.500
as we used to have to do before
these cloud-based document storage

00:05:08.500 --> 00:05:10.060
services existed.

00:05:10.060 --> 00:05:12.310
And it encourages collaboration.

00:05:12.310 --> 00:05:16.390
For example, if you and I were to
collaborate on a GitHub repository,

00:05:16.390 --> 00:05:20.000
I could push changes to that
repository that you could then pull.

00:05:20.000 --> 00:05:22.750
And we could then be working
off of the same code base again.

00:05:22.750 --> 00:05:25.690
We sort of have this central repo--

00:05:25.690 --> 00:05:28.630
central area where we share
our code with one another.

00:05:28.630 --> 00:05:30.580
And we can each
individually make changes

00:05:30.580 --> 00:05:33.520
and incorporate one another's
changes into the final products.

00:05:33.520 --> 00:05:38.110
So we're always working off
of the same base of material.

00:05:38.110 --> 00:05:40.210
The side effect, though,
again, is this material

00:05:40.210 --> 00:05:44.260
is generally public unless you have
opted into a private repository where

00:05:44.260 --> 00:05:46.450
you have specific
individuals who are logged

00:05:46.450 --> 00:05:49.990
in with their GitHub
accounts who want to share.

00:05:49.990 --> 00:05:52.420
So is there a way to solve
this problem, though, of we

00:05:52.420 --> 00:05:55.087
accidentally expose our
credentials in a public repository?

00:05:55.087 --> 00:05:56.920
Of course, if we're in
a private repository,

00:05:56.920 --> 00:05:58.220
this might not be as alarming.

00:05:58.220 --> 00:05:59.920
It's still probably not something you--

00:05:59.920 --> 00:06:03.130
it should be encouraged
to have credentials

00:06:03.130 --> 00:06:07.480
for anything stored anywhere, whether
public or private, on the internet.

00:06:07.480 --> 00:06:08.830
It's a little riskier.

00:06:08.830 --> 00:06:12.402
But is there a way to get rid of this or
to prevent this problem from happening?

00:06:12.402 --> 00:06:14.860
And fortunately, there are a
number of different safeguards

00:06:14.860 --> 00:06:17.680
specific to Git and
GitHub that we can use

00:06:17.680 --> 00:06:22.240
to prevent the accidental leakage
of information, so to speak.

00:06:22.240 --> 00:06:25.330
So for example, one way we can handle
this is using a program or utility

00:06:25.330 --> 00:06:27.340
called GitSecrets.

00:06:27.340 --> 00:06:31.000
GitSecrets works by looking for
what's called a regular expression.

00:06:31.000 --> 00:06:33.640
And a regular expression is
computer science parlance

00:06:33.640 --> 00:06:37.600
for a particular formation of
a string, so a certain number

00:06:37.600 --> 00:06:41.360
of characters, a certain number of
digit characters, maybe some punctuation

00:06:41.360 --> 00:06:41.860
marks.

00:06:41.860 --> 00:06:46.360
You can say, I'm looking for
strings that match this idea.

00:06:46.360 --> 00:06:49.630
And you can express this idea
where this idea is all capital

00:06:49.630 --> 00:06:52.900
letters, all lowercase letters, this
many numbers, and this many punctuation

00:06:52.900 --> 00:06:55.750
marks, and so on using this tool
called a regular expression.

00:06:55.750 --> 00:06:59.410
But GitSecrets contains a list
of these regular expressions

00:06:59.410 --> 00:07:02.710
and will warn you when you are
about to make a commit, when you're

00:07:02.710 --> 00:07:05.650
about to push code or send
code to GitHub to be stored

00:07:05.650 --> 00:07:10.030
in its online repository that you have
a string that matches this pattern

00:07:10.030 --> 00:07:11.950
that you wanted me to warn you about.

00:07:11.950 --> 00:07:15.190
And so be sure before
you commit this code

00:07:15.190 --> 00:07:19.600
and push this code that you
actually intend to send this up

00:07:19.600 --> 00:07:23.380
to GitHub, because it may be that this
matches a password string that you're

00:07:23.380 --> 00:07:24.560
trying to avoid.

00:07:24.560 --> 00:07:27.580
So that's an interesting tool
that can be used for that.

00:07:27.580 --> 00:07:31.150
You also want to consider
limiting third party app access.

00:07:31.150 --> 00:07:35.930
GitHub accounts are actually very
common to use as other forms of login,

00:07:35.930 --> 00:07:36.770
for example.

00:07:36.770 --> 00:07:39.190
So there's a platform
on the internet called

00:07:39.190 --> 00:07:42.190
OAuth which allows you to use,
for example, your Facebook

00:07:42.190 --> 00:07:44.977
account or your Google account
to log into other services.

00:07:44.977 --> 00:07:47.560
Perhaps you've encountered this
in your own experience working

00:07:47.560 --> 00:07:49.510
with different services on the internet.

00:07:49.510 --> 00:07:54.010
Instead of creating a login for site x,
you could use your Facebook or Google

00:07:54.010 --> 00:07:58.150
login, or, in many instances as
well, your GitHub log in to do so.

00:07:58.150 --> 00:08:01.610
When you do so, though, you are
allowing that third party application,

00:08:01.610 --> 00:08:07.090
someone that's not GitHub, the ability
to use and access your GitHub identity

00:08:07.090 --> 00:08:08.120
or credential.

00:08:08.120 --> 00:08:12.640
And so you should be very careful with
not only GitHub but other services

00:08:12.640 --> 00:08:17.560
as well, thinking about whether you
want that other service to have access

00:08:17.560 --> 00:08:21.940
to your GitHub, or Facebook, or Google
account information to use it even just

00:08:21.940 --> 00:08:23.380
for authentication.

00:08:23.380 --> 00:08:26.320
It's a good idea to try and
limit how much third party app

00:08:26.320 --> 00:08:30.340
access you're giving to other services.

00:08:30.340 --> 00:08:33.520
Another tool is to use
something called a commit hook.

00:08:33.520 --> 00:08:36.460
Now, commit hook is just a
fancy term for a short program

00:08:36.460 --> 00:08:42.070
or set of instructions that executes
when a commit is pushed to GitHub.

00:08:42.070 --> 00:08:44.740
So for example, many
of the course websites

00:08:44.740 --> 00:08:48.490
that we use here at Harvard
for CS50 are GitHub-based,

00:08:48.490 --> 00:08:52.030
which means that when we want to change
the content on the course website,

00:08:52.030 --> 00:08:56.350
we update some HTML, or Python,
or JavaScript files, we push those

00:08:56.350 --> 00:09:01.000
to GitHub, and that triggers a commit
hook where basically that commit

00:09:01.000 --> 00:09:04.570
hook copies those files
into our web server,

00:09:04.570 --> 00:09:07.420
runs some tests on them to make
sure that there's no errors in them.

00:09:07.420 --> 00:09:10.390
For example, if we wrote some
JavaScript or Python that was breaking,

00:09:10.390 --> 00:09:15.250
it had a bug in it, we'd rather
not deploy that bug so to speak.

00:09:15.250 --> 00:09:17.710
We wouldn't want the
broken version of the code

00:09:17.710 --> 00:09:21.190
to replace the currently
working website.

00:09:21.190 --> 00:09:23.750
And so commit hook can be
used to do testing as well.

00:09:23.750 --> 00:09:26.170
And then once all the
tests pass, we then

00:09:26.170 --> 00:09:28.300
are able to activate those
files on the web server

00:09:28.300 --> 00:09:29.890
and the changes have happened.

00:09:29.890 --> 00:09:32.530
So we're using GitHub
to store the changes

00:09:32.530 --> 00:09:35.650
that we want to make on our
site, the HTML, the Python,

00:09:35.650 --> 00:09:37.870
the JavaScript changes
that we want to make.

00:09:37.870 --> 00:09:41.650
And then we're using this commit
hook, a set of instructions,

00:09:41.650 --> 00:09:45.340
to copy them over and actually
deploy those changes to the website

00:09:45.340 --> 00:09:48.430
once we've verified that we
haven't made anything break.

00:09:48.430 --> 00:09:52.210
You can also use commit hooks, for
example, to check for passwords

00:09:52.210 --> 00:09:56.830
and have it warn you if you have
perhaps leaked a credential.

00:09:56.830 --> 00:10:00.040
And then you can undo
that with a technique

00:10:00.040 --> 00:10:02.480
that we'll see in just a moment.

00:10:02.480 --> 00:10:06.250
Another thing that you can do when
using GitHub to protect or verify

00:10:06.250 --> 00:10:09.180
your identity is to use an SSH key.

00:10:09.180 --> 00:10:12.653
SSH keys are a special form
of a public and private key.

00:10:12.653 --> 00:10:15.070
In this case, it's really not
used for encryption, though.

00:10:15.070 --> 00:10:17.535
It's actually used as identification.

00:10:17.535 --> 00:10:19.410
And so this idea of
digital signatures, which

00:10:19.410 --> 00:10:22.860
you may recall from a few lectures
ago, comes back into play.

00:10:22.860 --> 00:10:27.600
Whenever I use an SSH key to push
my code to GitHub, what happens

00:10:27.600 --> 00:10:33.150
is I also digitally sign the
commit when I send it up.

00:10:33.150 --> 00:10:36.870
And so before that commit
gets posted to GitHub,

00:10:36.870 --> 00:10:40.200
GitHub verifies this by
checking my public key

00:10:40.200 --> 00:10:43.230
and verifying, using the mathematics
that we've seen in the past,

00:10:43.230 --> 00:10:46.650
that, yes, only Doug
could have sent this to me

00:10:46.650 --> 00:10:53.160
because only Doug's public key will
unscramble this set of zeros and ones

00:10:53.160 --> 00:10:57.180
that I received that only could have
then been created by his private key.

00:10:57.180 --> 00:10:59.550
These two things are
reciprocal of one another.

00:10:59.550 --> 00:11:01.980
So we can use SSH keys
and digital signatures

00:11:01.980 --> 00:11:05.850
as an identity verification
scheme as well for GitHub

00:11:05.850 --> 00:11:08.430
as we might be able to for
mailing documents, or sending

00:11:08.430 --> 00:11:11.160
documents, or something like that.

00:11:11.160 --> 00:11:15.300
Now, imagine we have posted
the credentials accidentally.

00:11:15.300 --> 00:11:17.130
Is there a way to get rid of them?

00:11:17.130 --> 00:11:18.930
GitHub does track our entire history.

00:11:18.930 --> 00:11:20.430
But what if we do make a mistake?

00:11:20.430 --> 00:11:22.410
Human beings are fallible.

00:11:22.410 --> 00:11:25.980
And so there is a way to
actually eliminate the history.

00:11:25.980 --> 00:11:29.697
And that is using a
command called Git Rebase.

00:11:29.697 --> 00:11:32.280
So let's go back to the illustration
we had a moment ago where

00:11:32.280 --> 00:11:34.250
we have several different commits.

00:11:34.250 --> 00:11:37.210
And I've added a fourth commit here
just for purposes of illustration.

00:11:37.210 --> 00:11:38.960
So our first commit
and our second commit,

00:11:38.960 --> 00:11:42.180
and then it's after that that we
expose the credentials accidentally,

00:11:42.180 --> 00:11:47.010
and then we have a fourth commit where
we actually delete that mistake that we

00:11:47.010 --> 00:11:48.300
had previously made.

00:11:48.300 --> 00:11:51.810
When we want to Git
Rebase, the idea is we want

00:11:51.810 --> 00:11:54.370
to delete a portion of the history.

00:11:54.370 --> 00:11:56.120
Now, deleting a portion
of the history has

00:11:56.120 --> 00:11:59.075
a side effect of any changes
that I made here or here.

00:11:59.075 --> 00:12:01.950
In this illustration, we're going
to get rid of the last two commits.

00:12:01.950 --> 00:12:05.460
Any changes that I've made besides
accidentally exposing the credentials

00:12:05.460 --> 00:12:07.170
are also going to be destroyed.

00:12:07.170 --> 00:12:11.220
And so it's going to be incumbent
on us to make sure to copy and save

00:12:11.220 --> 00:12:15.150
the changes we actually want to preserve
in case we've done more than just

00:12:15.150 --> 00:12:16.530
expose the credentials.

00:12:16.530 --> 00:12:19.170
And then we'll have to make a
new commit in this new history

00:12:19.170 --> 00:12:23.100
we create so that we can still preserve
those changes that we want to make.

00:12:23.100 --> 00:12:25.620
But let's say, other
than the credentials,

00:12:25.620 --> 00:12:27.900
I didn't actually do anything else.

00:12:27.900 --> 00:12:33.330
One thing I could do is rebase or
set as a new start point, basically,

00:12:33.330 --> 00:12:36.190
this second commit as
the end of the chain.

00:12:36.190 --> 00:12:40.590
So instead of going all the way to here
and having that preserved ad infinitum,

00:12:40.590 --> 00:12:44.430
I want to just get rid of everything
from the second commit forward.

00:12:44.430 --> 00:12:45.300
And I can do that.

00:12:45.300 --> 00:12:49.110
And then those commits are no
longer remembered by GitHub.

00:12:49.110 --> 00:12:52.110
And as soon as the next
commit I have would go here,

00:12:52.110 --> 00:12:56.760
right after second commit as opposed
to imagining a fifth one there

00:12:56.760 --> 00:12:59.580
right after credentials
being removed, those commits

00:12:59.580 --> 00:13:03.570
are, for all intents and
purposes on GitHub, forgotten.

00:13:03.570 --> 00:13:06.330
And finally, one more thing
that we can do when using GitHub

00:13:06.330 --> 00:13:09.420
is to mandate the use of
two-factor authentication.

00:13:09.420 --> 00:13:12.810
Recall we've discussed two-factor
authentication a little bit previously.

00:13:12.810 --> 00:13:16.890
And the idea is that you
have a backup mechanism

00:13:16.890 --> 00:13:19.650
to prevent unauthorized login.

00:13:19.650 --> 00:13:21.720
And the two factors in
two-factor authentication

00:13:21.720 --> 00:13:26.520
are not two passwords, because those
are fundamentally quite similar.

00:13:26.520 --> 00:13:29.850
The idea is that you want to have
something that you know, for example,

00:13:29.850 --> 00:13:33.150
a password-- that's usually very
commonly one of the two factors

00:13:33.150 --> 00:13:35.220
in two-factor authentication--

00:13:35.220 --> 00:13:37.590
and something that you
have, the thought being

00:13:37.590 --> 00:13:42.900
that an adversary is incredibly unlikely
to have both things at the same time.

00:13:42.900 --> 00:13:45.120
They may know your
password, but they probably

00:13:45.120 --> 00:13:49.320
don't have your cell phone,
for example, or your RSA key.

00:13:49.320 --> 00:13:54.360
They may have stolen your phone or
they may have stolen your RSA key,

00:13:54.360 --> 00:13:57.390
but they probably don't
also know your password.

00:13:57.390 --> 00:14:00.690
And so the idea is that this provides
an additional level of defense

00:14:00.690 --> 00:14:04.080
against potential hacking,
or breaking into accounts,

00:14:04.080 --> 00:14:06.660
or unauthorized behavior in
accounts that you obviously

00:14:06.660 --> 00:14:08.190
don't want to happen.

00:14:08.190 --> 00:14:11.562
Now, an RSA key, if you're unfamiliar,
is something that looks like this.

00:14:11.562 --> 00:14:13.020
There's different versions of them.

00:14:13.020 --> 00:14:14.437
They've sort of evolved over time.

00:14:14.437 --> 00:14:18.660
This one is actually a
combined RSA key and USB drive.

00:14:18.660 --> 00:14:22.020
And inside the window
here of the RSA key

00:14:22.020 --> 00:14:26.010
is a six digit number that just
changes every 60 seconds or so.

00:14:26.010 --> 00:14:28.900
So when you are given one
of these, for example,

00:14:28.900 --> 00:14:32.310
perhaps at a firm or a business,
it is assigned to you specifically.

00:14:32.310 --> 00:14:35.530
There's a server that
your IT team will have

00:14:35.530 --> 00:14:39.960
setup that maps the serial number
on the back of this RSA key

00:14:39.960 --> 00:14:42.120
to your employee ID, for example.

00:14:42.120 --> 00:14:47.010
But they otherwise don't know what the
number currently on the RSA key is.

00:14:47.010 --> 00:14:51.840
They only know who owns it, who is
physically in possession of it, which

00:14:51.840 --> 00:14:53.210
employee ID it maps do.

00:14:53.210 --> 00:14:54.990
And every 60 seconds
it changes according

00:14:54.990 --> 00:14:59.430
to some mathematical algorithm that
is built into the key that generates

00:14:59.430 --> 00:15:02.190
numbers in a pseudo random way.

00:15:02.190 --> 00:15:05.490
And after 60 seconds, that code
will change into something else.

00:15:05.490 --> 00:15:10.130
And you'll need to actually have
the key on you to complete a login.

00:15:10.130 --> 00:15:12.810
If an RSA key is being
used to secure such

00:15:12.810 --> 00:15:15.483
that you need to enter a
password and your RSA key value,

00:15:15.483 --> 00:15:16.650
you would need to have both.

00:15:16.650 --> 00:15:19.872
No other employee RSA key--
well, hypothetically, I

00:15:19.872 --> 00:15:21.830
guess there's a one in
a million chance that it

00:15:21.830 --> 00:15:24.705
would happen to be randomly showing
the same number at the same time.

00:15:24.705 --> 00:15:28.100
But no other employee's RSA
key could be used to log in.

00:15:28.100 --> 00:15:30.690
Only yours could be used to log in.

00:15:30.690 --> 00:15:32.690
Now, there are several
different tools out there

00:15:32.690 --> 00:15:35.810
that can be used to provide
two-factor authentication services.

00:15:35.810 --> 00:15:39.628
And there's really no technical
reason not to use these services.

00:15:39.628 --> 00:15:42.170
You'll find them as applications
on cell phones, most likely.

00:15:42.170 --> 00:15:46.310
And you'll find ones like this, Google
Authenticator, Authy, Duo Mobile.

00:15:46.310 --> 00:15:47.360
There are lots of others.

00:15:47.360 --> 00:15:50.390
And if you don't want to use one
of those applications specifically,

00:15:50.390 --> 00:15:53.210
many services also just allow
you to receive a text message

00:15:53.210 --> 00:15:54.902
from the service itself.

00:15:54.902 --> 00:15:56.860
And you'll just get that
via SMS on your phone,

00:15:56.860 --> 00:16:00.470
so still on your phone, just not
tied to a specific application.

00:16:00.470 --> 00:16:05.690
And while there's no technical reason
to avoid two-factor authentication,

00:16:05.690 --> 00:16:08.600
there is sort of this
social friction surrounding

00:16:08.600 --> 00:16:13.580
two-factor authentication in that human
beings tend to find it annoying, right?

00:16:13.580 --> 00:16:15.860
It used to be username,
password, you're logged in.

00:16:15.860 --> 00:16:16.920
It's pretty quick.

00:16:16.920 --> 00:16:19.630
Now it's username, password, you
get brought to another screen,

00:16:19.630 --> 00:16:22.880
you're asked to enter a six-digit code,
or maybe in some advanced applications

00:16:22.880 --> 00:16:26.390
you get a push notification sent to
your device that you have to unlock

00:16:26.390 --> 00:16:28.970
and then hit OK on the device.

00:16:28.970 --> 00:16:31.280
And people just find that inconvenient.

00:16:31.280 --> 00:16:34.400
We haven't yet reached
this point culturally

00:16:34.400 --> 00:16:39.440
where two-factor
authentication is the norm.

00:16:39.440 --> 00:16:43.610
And so it's sort of a linchpin
when we talk about security

00:16:43.610 --> 00:16:49.400
in the internet context, is human
beings being the limiting factor

00:16:49.400 --> 00:16:51.980
for how secure we can be.

00:16:51.980 --> 00:16:56.810
We have the technology to take
steps to protect ourselves,

00:16:56.810 --> 00:16:59.360
but we don't feel compelled to do so.

00:16:59.360 --> 00:17:03.260
And we'll see this pattern reemerge
in a few other places today.

00:17:03.260 --> 00:17:06.315
But just know that that
is why perhaps you're

00:17:06.315 --> 00:17:08.690
not seeing so much adoption
of two-factor authentication.

00:17:08.690 --> 00:17:11.480
It's not that it's technically
infeasible to do so.

00:17:11.480 --> 00:17:14.900
It's just that we just
find it annoying to do so,

00:17:14.900 --> 00:17:19.401
and so we don't adopt it as
aggressively as perhaps we should.

00:17:19.401 --> 00:17:21.109
Now let's discuss the
type of attack that

00:17:21.109 --> 00:17:24.109
occurs on the internet with
unfortunate regularity,

00:17:24.109 --> 00:17:27.270
and that is the idea of a
denial of service attack.

00:17:27.270 --> 00:17:29.450
Now, the idea behind
these attacks is basically

00:17:29.450 --> 00:17:32.000
to cripple the
infrastructure of a website.

00:17:32.000 --> 00:17:34.460
Now, the reason for
this might be financial.

00:17:34.460 --> 00:17:36.050
You want to try and sabotage somebody.

00:17:36.050 --> 00:17:39.380
There might be other motivations,
distraction, for example,

00:17:39.380 --> 00:17:42.380
by tying up their resources,
trying to stop the attack.

00:17:42.380 --> 00:17:44.510
It opens up another avenue
to do something else,

00:17:44.510 --> 00:17:46.077
to perhaps steal information.

00:17:46.077 --> 00:17:48.410
There's many different
motivations for why they do this.

00:17:48.410 --> 00:17:51.020
And some of them are
honestly just boredom or fun.

00:17:51.020 --> 00:17:54.140
Amateur hackers sometimes
think it's fun to just initiate

00:17:54.140 --> 00:17:57.110
a denial of service attack
against an entity that

00:17:57.110 --> 00:17:59.870
is not prepared to handle it.

00:17:59.870 --> 00:18:02.480
Now, in the associated
materials for this course,

00:18:02.480 --> 00:18:06.380
we provided an article called Making
Cyberspace Safe for Democracy, which

00:18:06.380 --> 00:18:08.870
we really do encourage you
to take a look at, read,

00:18:08.870 --> 00:18:10.597
and discuss with your group.

00:18:10.597 --> 00:18:12.680
But I also want to take a
little bit of time right

00:18:12.680 --> 00:18:15.590
now just to talk about
this article in particular

00:18:15.590 --> 00:18:18.680
and draw your attention
to some areas of concern

00:18:18.680 --> 00:18:21.710
or some areas that might
lead to more discussion.

00:18:21.710 --> 00:18:25.070
Now, the biggest of
these is these attacks

00:18:25.070 --> 00:18:28.875
tend not to be taken very seriously
by people when they hear about them.

00:18:28.875 --> 00:18:31.250
You'll occasionally hear about
these attacks in the news,

00:18:31.250 --> 00:18:33.350
denial of service
attacks, or their cousin,

00:18:33.350 --> 00:18:35.930
distributed denial of service attacks.

00:18:35.930 --> 00:18:39.800
But culturally, again,
us being humans and sort

00:18:39.800 --> 00:18:42.650
of neglecting some of the
real security concerns here,

00:18:42.650 --> 00:18:44.420
we don't think of it as an attack.

00:18:44.420 --> 00:18:48.740
And that's maybe because of how we
hear about other kinds of attacks

00:18:48.740 --> 00:18:52.340
on the news that seem more
physically devastating,

00:18:52.340 --> 00:18:55.310
that have more real consequences.

00:18:55.310 --> 00:19:00.860
And it makes it hard to have a serious
conversation about cyber attacks

00:19:00.860 --> 00:19:06.650
because there's this friction that we
face trying to get people to understand

00:19:06.650 --> 00:19:08.600
that these are meaningful and real.

00:19:08.600 --> 00:19:12.530
And in particular, these
attacks are kind of insidious.

00:19:12.530 --> 00:19:17.355
They're really easy to execute
without much difficulty at all,

00:19:17.355 --> 00:19:20.480
especially against a small business
that might be running its own server as

00:19:20.480 --> 00:19:22.640
opposed to relying on a cloud service.

00:19:22.640 --> 00:19:29.150
A pretty top-of-the-line, commercially
available machine might be able

00:19:29.150 --> 00:19:33.200
to execute a denial of service
or DoS attack on its own.

00:19:33.200 --> 00:19:37.310
It doesn't even require
exceptional resources.

00:19:37.310 --> 00:19:41.450
Now, when we start to attack mid-sized
companies, or larger companies

00:19:41.450 --> 00:19:45.110
or entities, one single computer
from one single IP address

00:19:45.110 --> 00:19:47.480
is not typically going to be enough.

00:19:47.480 --> 00:19:52.730
And so instead, you would have a
distributed denial of service attack.

00:19:52.730 --> 00:19:54.620
In a distributed denial
of service attack,

00:19:54.620 --> 00:19:58.070
there is still generally one core
hacker, or one collective group

00:19:58.070 --> 00:19:59.960
of hackers or adversaries
that are trying

00:19:59.960 --> 00:20:03.647
to penetrate some company's defenses.

00:20:03.647 --> 00:20:05.480
But they can't do it
with their own machine.

00:20:05.480 --> 00:20:08.210
And so what they do is create
something called a botnet.

00:20:08.210 --> 00:20:09.890
Perhaps you've heard this term before.

00:20:09.890 --> 00:20:12.590
A botnet basically
happens, or is created,

00:20:12.590 --> 00:20:17.103
when hackers or adversaries
distribute worms or viruses sort of

00:20:17.103 --> 00:20:17.770
surreptitiously.

00:20:17.770 --> 00:20:19.700
Perhaps they packaged
them into some download.

00:20:19.700 --> 00:20:22.780
People don't notice anything
about the worm or anything

00:20:22.780 --> 00:20:25.750
about this program that has been
covertly installed on their machine.

00:20:25.750 --> 00:20:30.010
It doesn't do anything in
particular until it is activated.

00:20:30.010 --> 00:20:32.500
And then it becomes
an agent or a zombie--

00:20:32.500 --> 00:20:34.930
sometimes you'll hear
it termed that as well--

00:20:34.930 --> 00:20:36.400
controlled by the hackers.

00:20:36.400 --> 00:20:39.130
And so all of a sudden
the adversaries gain

00:20:39.130 --> 00:20:42.190
control of many different
devices, hundreds or thousands

00:20:42.190 --> 00:20:46.450
or tens of thousands, or even
more in some of the bigger attacks

00:20:46.450 --> 00:20:50.602
that have happened, basically
turning these computers--

00:20:50.602 --> 00:20:52.310
rendering all of them
under their control

00:20:52.310 --> 00:20:55.130
and being able to direct them to
take whatever action they want.

00:20:55.130 --> 00:20:58.870
And in particular, in the case of a
distributed denial of service attack,

00:20:58.870 --> 00:21:03.190
all of these computers are
going to make web requests

00:21:03.190 --> 00:21:07.810
to the same server or same
website, because that's the idea.

00:21:07.810 --> 00:21:09.180
You have so many requests.

00:21:09.180 --> 00:21:10.930
With distributed denial
of service attacks

00:21:10.930 --> 00:21:13.972
or just regular denial of service
attacks, it's just a question of scale,

00:21:13.972 --> 00:21:15.610
really.

00:21:15.610 --> 00:21:18.430
We're hitting those servers
with so many web requests.

00:21:18.430 --> 00:21:19.390
I want to access this.

00:21:19.390 --> 00:21:22.210
I want to access this, hundreds,
thousands, tens of thousands

00:21:22.210 --> 00:21:26.110
of these requests a second such that
the computer can't possibly-- the server

00:21:26.110 --> 00:21:28.210
can't possibly field
all of these inquiries

00:21:28.210 --> 00:21:33.010
that are coming and trying to give these
requests the data they're asking for.

00:21:33.010 --> 00:21:35.425
Ultimately, that would
eventually, after enough time,

00:21:35.425 --> 00:21:38.300
result in the server just crashing,
throwing up its hands and saying,

00:21:38.300 --> 00:21:39.430
I don't know what to do.

00:21:39.430 --> 00:21:41.388
I can't possibly process
all of these requests.

00:21:41.388 --> 00:21:45.010
But by tying it up in
this way, the adversary

00:21:45.010 --> 00:21:49.840
has succeeded in damaging the
infrastructure of the server.

00:21:49.840 --> 00:21:52.960
It's either denied the server
the ability to process customers

00:21:52.960 --> 00:21:55.840
and payments or it's just
taken down the entire website

00:21:55.840 --> 00:21:58.840
so there's no information available
about the company anymore to anybody

00:21:58.840 --> 00:22:01.630
who's trying to look it up.

00:22:01.630 --> 00:22:04.990
These attacks are actually
really, really common.

00:22:04.990 --> 00:22:06.910
There are some surveys
that have been out that

00:22:06.910 --> 00:22:12.292
assess that roughly one sixth to one
third of average-sized businesses that

00:22:12.292 --> 00:22:14.500
are part of this tech survey
that goes out every year

00:22:14.500 --> 00:22:20.680
suffer some sort of DoS attack in
a given year, so 16% to 35% or so

00:22:20.680 --> 00:22:23.910
of business, which is a lot of
businesses when you think about it.

00:22:23.910 --> 00:22:25.660
And these attacks are
usually quite small,

00:22:25.660 --> 00:22:27.610
and they're certainly not newsworthy.

00:22:27.610 --> 00:22:28.870
They might last a few minutes.

00:22:28.870 --> 00:22:30.190
They might last a few hours.

00:22:30.190 --> 00:22:31.690
But they're enough to be disruptive.

00:22:31.690 --> 00:22:32.898
They're certainly noteworthy.

00:22:32.898 --> 00:22:36.310
And they're something to
avoid if it's possible.

00:22:36.310 --> 00:22:41.660
Cloud computing has made
this problem kind of worse.

00:22:41.660 --> 00:22:45.190
And the reason for this is that,
in a cloud computing context,

00:22:45.190 --> 00:22:47.980
your server that is
running your business

00:22:47.980 --> 00:22:50.350
is not physically
located on your premises.

00:22:50.350 --> 00:22:54.270
It was often the case that when
a business would run a website

00:22:54.270 --> 00:23:00.430
or would run their business, they
would have a server room that

00:23:00.430 --> 00:23:03.790
had the software that was
necessary to run their website

00:23:03.790 --> 00:23:07.060
or to run whatever software-based
services they provided.

00:23:07.060 --> 00:23:10.415
And it was all local to that business.

00:23:10.415 --> 00:23:12.980
No one else could possibly be affected.

00:23:12.980 --> 00:23:15.070
But in a cloud computing
context, we are generally

00:23:15.070 --> 00:23:20.860
renting server space and server power
from an entity such as Amazon Web

00:23:20.860 --> 00:23:24.790
Services, or Google Cloud Services,
or some other large provider where

00:23:24.790 --> 00:23:30.460
it might be that 10, 20, 50, depending
on the size of the business in question

00:23:30.460 --> 00:23:31.510
here--

00:23:31.510 --> 00:23:35.920
multiple businesses are sharing
the same physical resources,

00:23:35.920 --> 00:23:37.990
and they're sharing
the same server space,

00:23:37.990 --> 00:23:41.260
such that if any one
of those 50, let's say,

00:23:41.260 --> 00:23:44.950
businesses is targeted
by hackers or adversaries

00:23:44.950 --> 00:23:49.570
for a denial of service attack, that
might actually, as collateral damage,

00:23:49.570 --> 00:23:52.390
take out the other 49 businesses.

00:23:52.390 --> 00:23:54.400
They weren't even part of the attack.

00:23:54.400 --> 00:23:55.930
But cloud computing is--

00:23:55.930 --> 00:23:57.820
we've heard about it
as it's a great thing.

00:23:57.820 --> 00:24:00.640
It allows us to scale
out our websites, make it

00:24:00.640 --> 00:24:02.800
so that we can handle more customers.

00:24:02.800 --> 00:24:06.280
It takes away the problem of
security, web-based security,

00:24:06.280 --> 00:24:11.090
because we're outsourcing that to the
cloud provider to give that to us.

00:24:11.090 --> 00:24:15.490
But it now introduces this new problem
of, if we're all sharing the resources

00:24:15.490 --> 00:24:18.790
and any one of us gets
attacked, then all of us

00:24:18.790 --> 00:24:21.760
lose the ability to access
those resources and use them,

00:24:21.760 --> 00:24:24.550
which might cause all of
our organizations to suffer

00:24:24.550 --> 00:24:28.090
the consequences of one single attack.

00:24:28.090 --> 00:24:30.700
This collateral damage
can get even worse

00:24:30.700 --> 00:24:33.050
when you think about servers that are--

00:24:33.050 --> 00:24:38.590
or businesses whose service
is providing the internet, OK?

00:24:38.590 --> 00:24:40.970
So a very common example of
this, or a noteworthy example

00:24:40.970 --> 00:24:44.260
of this, happened in 2016
with a service called

00:24:44.260 --> 00:24:49.480
DYN, D-Y-N. DYN is a
DNS service provider,

00:24:49.480 --> 00:24:52.390
DNS being the domain name system.

00:24:52.390 --> 00:25:00.450
And the idea there is to map the things
like www.google.com to its IP address.

00:25:00.450 --> 00:25:02.950
Because in order to actually
access anything on the internet

00:25:02.950 --> 00:25:06.140
or to have a communication with anyone,
you need to know their IP address.

00:25:06.140 --> 00:25:09.220
And as human beings, we tend
not to actually remember

00:25:09.220 --> 00:25:14.020
what some website's IP address is, much
like we may not recall a certain phone

00:25:14.020 --> 00:25:14.590
number.

00:25:14.590 --> 00:25:17.170
But if it has a mnemonic
attached to it-- so for example,

00:25:17.170 --> 00:25:20.530
you know back in the day we had
1-800-COLLECT for collect calls.

00:25:20.530 --> 00:25:25.750
If you forgot the number, the
literal digits of that phone number,

00:25:25.750 --> 00:25:29.290
you could still remember the idea of
it because you had this mnemonic device

00:25:29.290 --> 00:25:30.760
to help remind you.

00:25:30.760 --> 00:25:35.110
Domain names, www.whatever.com,
are just mnemonic devices

00:25:35.110 --> 00:25:37.570
that we use to refer to an IP address.

00:25:37.570 --> 00:25:41.770
And DNS servers provide
this service to us.

00:25:41.770 --> 00:25:46.990
DYN is one of the major DNS
providers for the internet overall.

00:25:46.990 --> 00:25:49.630
And if a denial of service
attack, or in this case

00:25:49.630 --> 00:25:53.800
it was certainly a distributed denial of
service attack because it was enormous,

00:25:53.800 --> 00:25:58.480
goes after pinging the IP address
or hitting that server over

00:25:58.480 --> 00:26:03.070
and over and over, then it is unable
to field requests from anyone else,

00:26:03.070 --> 00:26:06.880
because it's just getting pummeled by
all of these requests from some botnet

00:26:06.880 --> 00:26:11.250
that some adversary or collective
of adversaries has taken control of.

00:26:11.250 --> 00:26:13.990
This, the collateral
damage, is no one can ever

00:26:13.990 --> 00:26:17.110
map a domain name to
an IP address, which

00:26:17.110 --> 00:26:19.720
means no one can visit
any of these websites

00:26:19.720 --> 00:26:24.250
unless you happen to know at the
outset what the IP address of any given

00:26:24.250 --> 00:26:24.850
website was.

00:26:24.850 --> 00:26:27.243
If you knew the IP address,
this wasn't a problem.

00:26:27.243 --> 00:26:29.410
You could just still directly
go to that IP address.

00:26:29.410 --> 00:26:31.000
That's not the kind of attack here.

00:26:31.000 --> 00:26:33.460
But the attack instead
tied up the ability

00:26:33.460 --> 00:26:38.410
to translate these mnemonic
names into numbers.

00:26:38.410 --> 00:26:42.400
And as you can see,
DYN was a DNS-- or is

00:26:42.400 --> 00:26:45.490
a DNS provider for much of the
eastern half of the United States

00:26:45.490 --> 00:26:48.842
as well as the Pacific
Northwest and California.

00:26:48.842 --> 00:26:50.800
And if you think about
what kinds of businesses

00:26:50.800 --> 00:26:53.950
are headquartered in
the Pacific Northwest

00:26:53.950 --> 00:26:58.810
and in California and in the
New York area, for example,

00:26:58.810 --> 00:27:01.060
you probably see that some
major, major services,

00:27:01.060 --> 00:27:03.435
including GitHub, which we've
already talked about today,

00:27:03.435 --> 00:27:06.190
but also Facebook and others--

00:27:06.190 --> 00:27:09.940
Harvard University's website was
also taken down for several hours.

00:27:09.940 --> 00:27:12.320
This attack lasted about 10
hours, so quite prolonged.

00:27:12.320 --> 00:27:15.810
It really did a lot
of damage on that day.

00:27:15.810 --> 00:27:18.310
It really crippled the ability
of people to use the internet

00:27:18.310 --> 00:27:22.420
for a long period of time,
so kind of very interesting.

00:27:22.420 --> 00:27:28.330
This article also talks a bit about
how the United States government has

00:27:28.330 --> 00:27:31.450
decided to-- or legislature--

00:27:31.450 --> 00:27:35.293
handle these kinds of issues,
computer-based attacks.

00:27:35.293 --> 00:27:37.460
It takes take a look at the
Computer Fraud and Abuse

00:27:37.460 --> 00:27:41.290
Act, which is codified at 18 USC 1030.

00:27:41.290 --> 00:27:47.020
And this is really the only computer
crimes, general computer crimes,

00:27:47.020 --> 00:27:49.990
law that is on the books
and talks about what

00:27:49.990 --> 00:27:53.710
it means to be a protected computer.

00:27:53.710 --> 00:27:57.430
And you'll be interested to know
perhaps that any computer pretty much is

00:27:57.430 --> 00:27:58.780
a protected computer.

00:27:58.780 --> 00:28:02.320
The law specifically calls out
government computers as well as

00:28:02.320 --> 00:28:04.990
any computer that may be
involved in interstate commerce,

00:28:04.990 --> 00:28:08.200
which is you can imagine
anybody who uses the internet,

00:28:08.200 --> 00:28:11.030
their computer then falls
under the ambit of this act.

00:28:11.030 --> 00:28:13.030
So it's another interesting
thing to take a look

00:28:13.030 --> 00:28:20.320
at if you're interested in how we
deal with processing or prosecuting

00:28:20.320 --> 00:28:23.020
violations of computer-based crimes.

00:28:23.020 --> 00:28:26.330
All of it is actually sort of dealt
with in the Computer Fraud and Abuse

00:28:26.330 --> 00:28:29.500
Act, which is not terribly long
and hasn't been updated extensively

00:28:29.500 --> 00:28:32.150
since the 1980s other than
some small amendments.

00:28:32.150 --> 00:28:34.150
So it's kind of interesting
that we have not yet

00:28:34.150 --> 00:28:38.440
gotten to the point where we
are defining and prosecuting

00:28:38.440 --> 00:28:42.400
specific types of computer crime,
even though we've begun to figure out

00:28:42.400 --> 00:28:47.620
different types of computer crimes,
such as DoS attacks, such as phishing,

00:28:47.620 --> 00:28:49.370
and so on.

00:28:49.370 --> 00:28:52.690
Now, hypothetically, a simple
denial of service attack

00:28:52.690 --> 00:28:53.950
should be pretty easy to stop.

00:28:53.950 --> 00:28:59.230
And the reason for that is that there's
only one person making the attack.

00:28:59.230 --> 00:29:03.130
All requests, recall, that happen
over the internet happen via HTTP.

00:29:03.130 --> 00:29:07.585
And HTTP requires that
the sender's IP address

00:29:07.585 --> 00:29:09.460
be part of that envelope
that gets sent over,

00:29:09.460 --> 00:29:12.880
such that the server who wants to
respond to the client, or the sender,

00:29:12.880 --> 00:29:13.980
can just reference.

00:29:13.980 --> 00:29:14.980
It's the return address.

00:29:14.980 --> 00:29:17.438
You need to be able to know
where to send the data back to.

00:29:17.438 --> 00:29:19.680
And so any request that is coming from--

00:29:19.680 --> 00:29:21.430
there are thousands
of requests that might

00:29:21.430 --> 00:29:23.680
be coming from a single IP address.

00:29:23.680 --> 00:29:27.490
If you see that happening, you can
just decide as a server in the software

00:29:27.490 --> 00:29:31.570
to stop accepting requests
from that address.

00:29:31.570 --> 00:29:34.360
DDoS attacks, distributed
denial of service attacks,

00:29:34.360 --> 00:29:36.160
are much harder to stop.

00:29:36.160 --> 00:29:40.390
And it's exactly because of the fact
that there is not a single source.

00:29:40.390 --> 00:29:42.880
If there's a single source,
again, we would just completely

00:29:42.880 --> 00:29:48.250
stop accepting any requests of
any type from that computer.

00:29:48.250 --> 00:29:51.370
However, because we have so many
different computers to contend with,

00:29:51.370 --> 00:29:54.010
the options to handle this
are a bit more limited.

00:29:54.010 --> 00:29:57.400
There are some techniques for
averting them or stopping them

00:29:57.400 --> 00:30:01.960
once they are detected, however,
the first of which is firewalling.

00:30:01.960 --> 00:30:04.270
So the idea of a firewall
is we are only going

00:30:04.270 --> 00:30:06.700
to allow requests of a certain type.

00:30:06.700 --> 00:30:08.950
We're going to allow
them from any IP address,

00:30:08.950 --> 00:30:11.950
but we're only going to
accept them into this port.

00:30:11.950 --> 00:30:15.880
Recall that TCPIP gives us the
ability to say this service

00:30:15.880 --> 00:30:19.390
comes in via this port, so HTTP
requests come in by a port 80.

00:30:19.390 --> 00:30:24.360
HTTPS requests come in via port 443.

00:30:24.360 --> 00:30:27.030
So imagine a distributed
denial of service attack

00:30:27.030 --> 00:30:33.100
where typically the site would expect
to be receiving requests on HTTPS.

00:30:33.100 --> 00:30:37.650
It generally only uses
secured HTTP in order

00:30:37.650 --> 00:30:40.300
to process whatever
requests are coming in.

00:30:40.300 --> 00:30:44.160
So it's expecting to receive
a lot of traffic on port 443.

00:30:44.160 --> 00:30:47.970
And then all of a sudden a
distributed denial of service attack

00:30:47.970 --> 00:30:51.930
begins and it's receiving
lots of requests on port 80.

00:30:51.930 --> 00:30:55.440
One way to stop that attack before
it starts to tie up resources

00:30:55.440 --> 00:30:57.540
is to just put a
firewall up and say, I'm

00:30:57.540 --> 00:31:00.210
not actually going to accept
any requests on port 80.

00:31:00.210 --> 00:31:03.650
And this may have a side effect of
denying certain legitimate requests

00:31:03.650 --> 00:31:04.710
from getting through.

00:31:04.710 --> 00:31:07.920
But since the vast majority of the
traffic that I receive on the site

00:31:07.920 --> 00:31:12.805
comes in via HTTPS on port 443,
that's a small price to pay.

00:31:12.805 --> 00:31:15.180
I'd rather just allow the
legitimate requests to come in.

00:31:15.180 --> 00:31:17.140
So that's one technique.

00:31:17.140 --> 00:31:19.950
Another technique is
something called sinkholing.

00:31:19.950 --> 00:31:22.350
And it's exactly what
you probably think it is.

00:31:22.350 --> 00:31:24.860
So a sinkhole, as you
probably know, is a hole

00:31:24.860 --> 00:31:26.610
in the ground that
swallows everything up.

00:31:26.610 --> 00:31:32.730
And a sink hole in digital context is
a big black hole, basically, for data.

00:31:32.730 --> 00:31:34.890
It's just going to swallow
up every single request

00:31:34.890 --> 00:31:36.960
and just not allow any of them out.

00:31:36.960 --> 00:31:39.962
So this would, again, stop
the denial of service attack

00:31:39.962 --> 00:31:41.670
because it's just
taking all the requests

00:31:41.670 --> 00:31:44.190
and basically throwing
them in the trash.

00:31:44.190 --> 00:31:48.120
This won't take down the website of
the company that's being attacked,

00:31:48.120 --> 00:31:49.590
so that's a good thing.

00:31:49.590 --> 00:31:52.590
But it's also not going to allow
any legitimate traffic of any type

00:31:52.590 --> 00:31:54.460
through, so that might be a bad thing.

00:31:54.460 --> 00:31:56.460
But depending on the
length of the attack, if it

00:31:56.460 --> 00:31:59.520
seems like it's going to be
short, if the requests trickle off

00:31:59.520 --> 00:32:02.670
and stop because the attackers
realize, we're not making any progress,

00:32:02.670 --> 00:32:04.020
we're not actually doing--

00:32:04.020 --> 00:32:06.510
we're not getting the results
that we had hoped for,

00:32:06.510 --> 00:32:08.490
then perhaps they would give up.

00:32:08.490 --> 00:32:11.903
Then the sinkhole could be
stopped and regular traffic

00:32:11.903 --> 00:32:13.320
could start to flow through again.

00:32:13.320 --> 00:32:16.590
So a sinkhole is basically just
take all the traffic that comes in

00:32:16.590 --> 00:32:18.665
and just throw it in the trash.

00:32:18.665 --> 00:32:20.665
And then finally, another
technique we could use

00:32:20.665 --> 00:32:22.950
is something called packet analysis.

00:32:22.950 --> 00:32:27.390
So again, HTTP we know
is requests via the web.

00:32:27.390 --> 00:32:30.120
And we learned a little
bit that we have headers

00:32:30.120 --> 00:32:33.060
that are packaged alongside
those HTTP packets

00:32:33.060 --> 00:32:38.010
where the request originated
from, where it's going to.

00:32:38.010 --> 00:32:40.440
There's a whole lot of
other metadata as well.

00:32:40.440 --> 00:32:44.250
You'll know, for example, what type
of browser the individual is using

00:32:44.250 --> 00:32:46.290
and what operating system
perhaps they are using

00:32:46.290 --> 00:32:50.950
and where, as in sort of a
geographical generalization, are they.

00:32:50.950 --> 00:32:52.440
Are they in the US Northeast?

00:32:52.440 --> 00:32:55.350
Are they in South America and so on?

00:32:55.350 --> 00:32:59.160
Instead of deciding to restrict
traffic via specific ports

00:32:59.160 --> 00:33:03.540
or just restrict all traffic, we could
still allow all traffic to come in

00:33:03.540 --> 00:33:06.460
but inspect all of the
packets as they come in.

00:33:06.460 --> 00:33:09.060
So for example, perhaps most
of the traffic on our site we

00:33:09.060 --> 00:33:11.650
are expecting to come from the--

00:33:11.650 --> 00:33:13.400
just because I used
that example already--

00:33:13.400 --> 00:33:14.700
US Northeast.

00:33:14.700 --> 00:33:16.650
And then all of a sudden
we are experiencing

00:33:16.650 --> 00:33:20.640
tons of packets coming in that have IP
addresses that all seem to be based--

00:33:20.640 --> 00:33:24.050
or they have, as part of
their packets, information

00:33:24.050 --> 00:33:25.800
that says that they're
from South America,

00:33:25.800 --> 00:33:29.790
or they're from the US West Coast, or
somewhere else that we don't expect.

00:33:29.790 --> 00:33:32.430
We can decide, after taking
a quick look at that packet

00:33:32.430 --> 00:33:36.240
and analyzing those individual
headers, that I'm not

00:33:36.240 --> 00:33:39.240
going to accept any
packets from that location.

00:33:39.240 --> 00:33:42.970
The ones that match locations
I'm expecting, I'll let through.

00:33:42.970 --> 00:33:45.948
And this, again, might prevent certain
customers from getting through,

00:33:45.948 --> 00:33:48.990
certain legitimate customers who might
actually be based in South America

00:33:48.990 --> 00:33:50.460
from getting through.

00:33:50.460 --> 00:33:54.980
But in general, it's going to
block most of the damaging traffic.

00:33:54.980 --> 00:33:57.900
DDoS attacks are really
frustrating for companies

00:33:57.900 --> 00:34:01.470
because they really
can do a lot of damage.

00:34:01.470 --> 00:34:04.480
Usually the resources of the
company will eventually-- especially

00:34:04.480 --> 00:34:08.280
if they're cloud-based and they rely
on their cloud provider to help them

00:34:08.280 --> 00:34:12.290
scale up, usually the resources
of the company being attacked

00:34:12.290 --> 00:34:14.699
are enough to eventually
overwhelm and stop

00:34:14.699 --> 00:34:18.780
the attacker who usually has a
much more limited set of resources.

00:34:18.780 --> 00:34:22.570
But again, depending on the type of
business being attacked in this way--

00:34:22.570 --> 00:34:25.580
again, think of the example
of DYN, the DNS provider.

00:34:25.580 --> 00:34:27.330
The ramifications for
one of these attacks

00:34:27.330 --> 00:34:31.350
can be really quite severe and
really quite annoying and costly

00:34:31.350 --> 00:34:34.480
for a business that suffers it.

00:34:34.480 --> 00:34:38.050
So we just talked about
HTTP and HTTPSS a moment ago

00:34:38.050 --> 00:34:40.050
when we were talking about
firewalling, allowing

00:34:40.050 --> 00:34:42.790
some traffic on some of the
ports but not other ports,

00:34:42.790 --> 00:34:47.290
so maybe allowing HTTP
traffic but not HTTPS traffic.

00:34:47.290 --> 00:34:51.120
Let's take a look at these two
technologies in a bit more detail.

00:34:51.120 --> 00:34:54.330
So HTTP, again, is the
hypertext transfer protocol.

00:34:54.330 --> 00:34:58.530
It is how hypertext or web pages
are transmitted over the internet.

00:34:58.530 --> 00:35:04.530
If I am a client and I make a
request to you for some HTML content,

00:35:04.530 --> 00:35:08.130
then you as a server would
send a response back to me,

00:35:08.130 --> 00:35:11.550
and then I would be able to see
the page that I had requested.

00:35:11.550 --> 00:35:17.090
And every HTTP request has a specific
format at the beginning of it.

00:35:17.090 --> 00:35:24.560
For example, we might see something
like this, GET /execed HTTP/1.1, host:

00:35:24.560 --> 00:35:25.790
law.harvard.edu.

00:35:25.790 --> 00:35:28.670
Let's just quickly pick these
apart again one more time.

00:35:28.670 --> 00:35:31.910
If you see GET at the
beginning of an HTTP request,

00:35:31.910 --> 00:35:36.680
it means please fetch or get
for me, literally, this page.

00:35:36.680 --> 00:35:40.970
The page I'm requesting
specifically is /execed.

00:35:40.970 --> 00:35:46.520
And the host that I'm asking it from
is, in this case, law.harvard.edu.

00:35:46.520 --> 00:35:50.690
So basically what I'm saying
here is please fetch for me,

00:35:50.690 --> 00:35:54.120
or retreat from me, the
HTML content that comprises

00:35:54.120 --> 00:36:00.410
http://law.harvard.edu/execed.

00:36:00.410 --> 00:36:05.990
And specifically I'm doing this
using HTTP protocol version 1.1.

00:36:05.990 --> 00:36:08.270
We're still using
version 1.1 even though I

00:36:08.270 --> 00:36:13.250
believe version 2.0 was defined
almost 20 years ago now probably.

00:36:13.250 --> 00:36:17.030
And basically this is just
HTTP's way of identifying

00:36:17.030 --> 00:36:19.040
how you're asking the question.

00:36:19.040 --> 00:36:23.540
So it's similar to me making a
request and saying, oh, by the way,

00:36:23.540 --> 00:36:26.690
the rest of this request is written
in French, or, oh, by the way,

00:36:26.690 --> 00:36:29.630
the rest of this request
is written in Spanish.

00:36:29.630 --> 00:36:32.750
It's more like here are
the parameters that you

00:36:32.750 --> 00:36:35.150
should expect to see
because this request is

00:36:35.150 --> 00:36:39.540
in version 1.1, which differed
non-trivially from version 1.0.

00:36:39.540 --> 00:36:45.590
So it's just an identifier for how
exactly we are formatting our request.

00:36:45.590 --> 00:36:47.950
But HTTP is not encrypted.

00:36:47.950 --> 00:36:51.232
And so if we think about
making a request to a server,

00:36:51.232 --> 00:36:52.940
if we're the client
on the left and we're

00:36:52.940 --> 00:36:56.120
making a request to a server on the
right, it might go something like this.

00:36:56.120 --> 00:37:00.530
Because the odds are pretty low
that, if we're making a request,

00:37:00.530 --> 00:37:03.350
we are so close to the
server that would serve

00:37:03.350 --> 00:37:05.660
that request to us that
it wouldn't need to hop

00:37:05.660 --> 00:37:07.480
through any routers along the way.

00:37:07.480 --> 00:37:09.410
Remember, routers,
their purpose in life is

00:37:09.410 --> 00:37:11.260
to send traffic in the right direction.

00:37:11.260 --> 00:37:13.350
And they contain a table
of information that says,

00:37:13.350 --> 00:37:15.800
oh, if I'm making a request
to some server over there,

00:37:15.800 --> 00:37:18.920
then the best path is to go here,
and then I'll send it over there,

00:37:18.920 --> 00:37:20.890
and then it will send it there.

00:37:20.890 --> 00:37:23.480
Their job is to optimize
and find the best path

00:37:23.480 --> 00:37:26.370
to get the request to
where it needs to be.

00:37:26.370 --> 00:37:31.145
So if I'm initiating a request
to, as the client, the server,

00:37:31.145 --> 00:37:33.020
it's going to first go
through router A who's

00:37:33.020 --> 00:37:35.760
going to say, OK, I'm going to
move it closer to the server

00:37:35.760 --> 00:37:38.960
so that it receives that request,
goes to router B, goes to router C.

00:37:38.960 --> 00:37:41.900
And eventually router C perhaps
is close enough to the server

00:37:41.900 --> 00:37:45.380
that it can just hand
off the request directly.

00:37:45.380 --> 00:37:48.568
The server's then going to get
that request, read it as HTTP/1.1,

00:37:48.568 --> 00:37:51.860
look at all the other metadata inside of
the request to see if there's anything

00:37:51.860 --> 00:37:55.030
else that it's being asked for, and
then it's going to send the information

00:37:55.030 --> 00:37:55.530
back.

00:37:55.530 --> 00:37:57.620
And in this example
I'm having it go back

00:37:57.620 --> 00:38:00.860
exactly through the same chain
of routers but in reverse.

00:38:00.860 --> 00:38:02.540
But in reality, that might be different.

00:38:02.540 --> 00:38:04.430
It might not go through
the exact same three

00:38:04.430 --> 00:38:06.620
routers in this example in reverse.

00:38:06.620 --> 00:38:12.110
It might actually go from C to A to
B, back to A depending on traffic

00:38:12.110 --> 00:38:14.780
that's happening on the network
and how congested things are

00:38:14.780 --> 00:38:19.310
and whether there might be a new path
that is better in the amount of time

00:38:19.310 --> 00:38:23.210
it took to process the
request that I asked for.

00:38:23.210 --> 00:38:25.880
But remember, HTTP, not secured.

00:38:25.880 --> 00:38:26.720
Not encrypted.

00:38:26.720 --> 00:38:29.000
This is plain,
over-the-air communication.

00:38:29.000 --> 00:38:33.560
We saw previously, when we
took a look at a screenshot

00:38:33.560 --> 00:38:36.530
from a tool called
Wireshark, that it's not

00:38:36.530 --> 00:38:41.420
that difficult on an unsecured network
using an unsecured protocol to read,

00:38:41.420 --> 00:38:44.150
literally, the contents of
those packets going to and from.

00:38:44.150 --> 00:38:46.320
So that's a vulnerability here for sure.

00:38:46.320 --> 00:38:48.980
Another vulnerability is
any one of these computers

00:38:48.980 --> 00:38:51.060
along the way could be compromised.

00:38:51.060 --> 00:38:54.320
So for example, router
A perhaps was infected

00:38:54.320 --> 00:38:57.510
by somebody who-- a router
is just a computer as well.

00:38:57.510 --> 00:39:00.200
So perhaps it was
infected by an adversary

00:39:00.200 --> 00:39:03.950
with some worm that will eventually
make it part of some botnet,

00:39:03.950 --> 00:39:07.580
and it'll eventually start
spamming some server somewhere.

00:39:07.580 --> 00:39:11.960
If router A is compromised in such a
way that an adversary can just read all

00:39:11.960 --> 00:39:14.010
the traffic that flows
through it-- and again,

00:39:14.010 --> 00:39:17.780
we're sending all of our traffic
in an unencrypted fashion--

00:39:17.780 --> 00:39:21.230
then we have another security
loophole to deal with.

00:39:21.230 --> 00:39:27.440
So HTTPS resolves this problem
by securing or encrypting

00:39:27.440 --> 00:39:32.150
all of the communications
between a client and a server.

00:39:32.150 --> 00:39:33.762
So HTTP requests go to one port.

00:39:33.762 --> 00:39:34.970
We talked about that already.

00:39:34.970 --> 00:39:36.950
They go to port 80 by convention.

00:39:36.950 --> 00:39:40.790
HTTP requests go to port
for 443 by convention.

00:39:40.790 --> 00:39:44.840
In order for HTTPS to
work, the server is

00:39:44.840 --> 00:39:52.100
responsible for providing or possessing
a valid what's called an SSL or TLS

00:39:52.100 --> 00:39:52.670
certificate.

00:39:52.670 --> 00:39:55.550
SSL is actually a
deprecated technology now.

00:39:55.550 --> 00:39:58.070
It's been subsumed into TLS.

00:39:58.070 --> 00:40:01.580
But typically these things are still
referred to as SSL certificates.

00:40:01.580 --> 00:40:04.430
And perhaps you've seen a
screen that looks like this when

00:40:04.430 --> 00:40:05.990
you're trying to visit some website.

00:40:05.990 --> 00:40:08.240
You get a warning that your
connection is not private.

00:40:08.240 --> 00:40:10.970
And at the very end of
that warning, you are

00:40:10.970 --> 00:40:13.640
informed that the cert date is invalid.

00:40:13.640 --> 00:40:18.900
Basically this just means that
their SSL certificate has expired.

00:40:18.900 --> 00:40:21.510
Now, what is an SSL certificate?

00:40:21.510 --> 00:40:27.000
So there are services that work
alongside the internet called

00:40:27.000 --> 00:40:28.020
certificate authorities.

00:40:28.020 --> 00:40:32.520
And like GlobalSign, for example,
from whom I borrowed the screenshots--

00:40:32.520 --> 00:40:35.280
GoDaddy, who is also a very
popular domain name provider,

00:40:35.280 --> 00:40:37.780
is also a certificate authority.

00:40:37.780 --> 00:40:42.600
And what they do is they verify
that a particular website owns

00:40:42.600 --> 00:40:44.270
a particular private key--

00:40:44.270 --> 00:40:48.230
or excuse me, a particular public key
which has a corresponding private key.

00:40:48.230 --> 00:40:49.980
And the way they do
that is they digitally

00:40:49.980 --> 00:40:51.928
sign something to the
certificate authority.

00:40:51.928 --> 00:40:54.720
The certificate authority then goes
through those exact same checks

00:40:54.720 --> 00:40:56.595
that we've seen before
for digital signatures

00:40:56.595 --> 00:40:59.460
to verify that, yes, this
person must own this public key.

00:40:59.460 --> 00:41:03.810
And the idea for this
is we're trusting that,

00:41:03.810 --> 00:41:06.750
when I send a communication
to you as the website

00:41:06.750 --> 00:41:12.120
owner using the public key that you
say is yours, then it really is yours.

00:41:12.120 --> 00:41:16.110
There really is somebody out
there or some third party

00:41:16.110 --> 00:41:19.530
that we've decided to collectively
trust, the certificate authority, who

00:41:19.530 --> 00:41:20.670
is going to verify this.

00:41:20.670 --> 00:41:23.100
Now, why does this matter?

00:41:23.100 --> 00:41:27.570
Why do we need to verify that someone's
public key is what they say it is?

00:41:27.570 --> 00:41:31.032
Well, it turns out that this
idea of asymmetric encryption,

00:41:31.032 --> 00:41:33.990
or public and private key cryptography
that we've previously discussed,

00:41:33.990 --> 00:41:38.520
does form part of the core of HTTPS.

00:41:38.520 --> 00:41:43.200
But as we'll see in a moment, we don't
actually use public and private keys

00:41:43.200 --> 00:41:47.100
to communicate except at the very,
very beginning of our interaction

00:41:47.100 --> 00:41:52.680
with some site when we are using HTTPS.

00:41:52.680 --> 00:41:56.370
So the way this really
happens underneath the hood

00:41:56.370 --> 00:42:00.780
is via the secure sockets layer, SSL,
which is now known as the transport

00:42:00.780 --> 00:42:02.950
layer security overall protocol.

00:42:02.950 --> 00:42:06.270
There's other things that are folded
into it, but SSL is part of it.

00:42:06.270 --> 00:42:09.210
And this is what happens.

00:42:09.210 --> 00:42:14.970
When I am requesting a page from
you, and you are the server,

00:42:14.970 --> 00:42:18.540
and I am requesting this
via HTTPS, I am going

00:42:18.540 --> 00:42:22.800
to initially make a request using
the public key that I believe

00:42:22.800 --> 00:42:24.780
is yours because the
certificate authority has

00:42:24.780 --> 00:42:30.395
vouched for you, saying that I would
like to make a encrypted request.

00:42:30.395 --> 00:42:32.520
And I don't want to send
that request over the air.

00:42:32.520 --> 00:42:34.145
I don't want to send that in the clear.

00:42:34.145 --> 00:42:37.110
I want to send it to you using the
encryption that you say is yours.

00:42:37.110 --> 00:42:41.160
So I send a request to you,
encrypting it using your public key.

00:42:41.160 --> 00:42:42.180
You receive the request.

00:42:42.180 --> 00:42:45.150
You decrypt it using your private key.

00:42:45.150 --> 00:42:48.900
You see, OK, I see now that Doug
wants to initiate a request with me,

00:42:48.900 --> 00:42:51.300
and you're going to fulfill the request.

00:42:51.300 --> 00:42:53.610
But you're also going
to do one other thing.

00:42:53.610 --> 00:42:57.420
You're going to set a key.

00:42:57.420 --> 00:43:00.270
And you're going to
send me back a key, not

00:43:00.270 --> 00:43:04.322
your public or private key, a different
key, alongside the request that I made.

00:43:04.322 --> 00:43:06.780
And you're going to send it
back to me using my public key.

00:43:06.780 --> 00:43:10.620
So the initial volley of communications
back and forth between us

00:43:10.620 --> 00:43:13.230
is the same as any other
encrypted communication

00:43:13.230 --> 00:43:16.140
using public and private keys
that we've previously seen.

00:43:16.140 --> 00:43:18.270
I send a message to you
using your public key.

00:43:18.270 --> 00:43:20.040
You decrypt it using your private key.

00:43:20.040 --> 00:43:26.340
You respond to me using my public key,
and I decrypt it using my private key.

00:43:26.340 --> 00:43:28.260
But this is really slow.

00:43:28.260 --> 00:43:34.780
If we're just having communications back
and forth via mail or even via text,

00:43:34.780 --> 00:43:39.210
the difference of a few
milliseconds is immaterial.

00:43:39.210 --> 00:43:41.450
We don't really notice it.

00:43:41.450 --> 00:43:44.757
But on the web, we do
notice it, especially

00:43:44.757 --> 00:43:46.590
if we're making multiple
requests or there's

00:43:46.590 --> 00:43:49.680
multiple packets going back and
forth and every single one of them

00:43:49.680 --> 00:43:51.520
needs to be encrypted.

00:43:51.520 --> 00:43:55.650
So beyond this initial volley,
public and private key encryption

00:43:55.650 --> 00:44:01.360
is no longer needed because it's no
longer used, because it's too slow.

00:44:01.360 --> 00:44:03.610
We would notice it if we did.

00:44:03.610 --> 00:44:09.150
Instead, as I mentioned, the server
is going to respond with a key.

00:44:09.150 --> 00:44:11.205
And that key is the key to a cipher.

00:44:11.205 --> 00:44:14.910
And we've talked about ciphers before
and we know that they are reversible.

00:44:14.910 --> 00:44:19.350
The particular cipher in question
here is something called AES.

00:44:19.350 --> 00:44:20.520
But it is just a cipher.

00:44:20.520 --> 00:44:21.960
It is reversible.

00:44:21.960 --> 00:44:24.360
And the key that you
receive is the key that you

00:44:24.360 --> 00:44:28.410
are supposed to use to decrypt
all future communications.

00:44:28.410 --> 00:44:30.060
This key is called the session key.

00:44:30.060 --> 00:44:33.360
And you use it to decrypt
all future communications

00:44:33.360 --> 00:44:37.230
and use it to encrypt all future
communications to the server

00:44:37.230 --> 00:44:40.350
until the session,
so-called, is terminated.

00:44:40.350 --> 00:44:43.320
And the session is basically
as long as you're on the site

00:44:43.320 --> 00:44:46.770
and you haven't logged
out or closed the window.

00:44:46.770 --> 00:44:48.240
That is the idea of a session.

00:44:48.240 --> 00:44:53.685
It is one singular
experience with a page

00:44:53.685 --> 00:44:57.750
or with a set of pages that are
all part of same domain name.

00:44:57.750 --> 00:45:00.960
We're just going to use a cipher for
the rest of the time that we talk.

00:45:00.960 --> 00:45:03.932
Now, this may seem
insecure for reasons we've

00:45:03.932 --> 00:45:05.640
talked about when we
talked about ciphers

00:45:05.640 --> 00:45:07.470
and how they are inherently flawed.

00:45:07.470 --> 00:45:10.470
Recall that when we were talking about
some of the really early ciphers,

00:45:10.470 --> 00:45:13.090
those are classic ciphers
like Caesar and Vigenere,

00:45:13.090 --> 00:45:14.430
those are very easy to break.

00:45:14.430 --> 00:45:17.630
AES is much more complex than that.

00:45:17.630 --> 00:45:22.080
And the other upside is that
this key, like I mentioned,

00:45:22.080 --> 00:45:23.910
is only good for a session.

00:45:23.910 --> 00:45:29.040
So in the unlikely event that the server
chooses a bad key, for example, if we

00:45:29.040 --> 00:45:32.490
think about it as if it was Caesar,
if they choose a key of zero,

00:45:32.490 --> 00:45:35.240
which would be a very bad key, or
key of one that doesn't actually

00:45:35.240 --> 00:45:40.113
shift the letters at all, even
if the key is compromised,

00:45:40.113 --> 00:45:41.780
it's only good for a particular session.

00:45:41.780 --> 00:45:44.240
That's not a very long amount of time.

00:45:44.240 --> 00:45:47.240
But the upside is the
ability to encipher

00:45:47.240 --> 00:45:49.520
and decipher information is much faster.

00:45:49.520 --> 00:45:53.390
If it's reversible, it's pretty quick
to do some mathematical manipulation

00:45:53.390 --> 00:45:57.140
and transform it into something
that looks obscured and gibberish

00:45:57.140 --> 00:45:59.240
and to undo that as well.

00:45:59.240 --> 00:46:03.020
And so even though public
and private keys are--

00:46:03.020 --> 00:46:05.780
we consider effectively
unbreakable, like to the point

00:46:05.780 --> 00:46:10.040
of it's mathematically untenable
to crack a message using

00:46:10.040 --> 00:46:11.510
public and private key encryption.

00:46:11.510 --> 00:46:16.010
We don't rely on it for SSL because
it is impractical to actually expect

00:46:16.010 --> 00:46:17.450
communications to go that slowly.

00:46:17.450 --> 00:46:19.610
And so we do fall back on these ciphers.

00:46:19.610 --> 00:46:24.260
And that really is when you're using
secured encrypted communication

00:46:24.260 --> 00:46:26.270
via HTTPS.

00:46:26.270 --> 00:46:27.980
You're just relying
on a cipher that just

00:46:27.980 --> 00:46:31.700
happens to be a very, very fancy
cipher that should hypothetically

00:46:31.700 --> 00:46:36.060
be very difficult to figure
out the key to as well.

00:46:36.060 --> 00:46:40.280
You may have also seen a few changes
in your browser, especially recently.

00:46:40.280 --> 00:46:42.170
This screenshot shows
a couple of changes

00:46:42.170 --> 00:46:48.080
that are designed to warn you when
you are not using HTTPS encryption.

00:46:48.080 --> 00:46:51.980
And it's not necessary to use
HTTPS for every interaction you

00:46:51.980 --> 00:46:53.480
have on the internet.

00:46:53.480 --> 00:46:56.750
For example, if you are going to a
site that is purely informational,

00:46:56.750 --> 00:47:00.900
it's just static content, it's just a
list of information, there's no login,

00:47:00.900 --> 00:47:05.190
there's no buying, there's no clicking
on things that might then get tracked,

00:47:05.190 --> 00:47:08.280
for example, it's not really
necessary to use HTTPS.

00:47:08.280 --> 00:47:11.630
So don't be necessarily
alarmed if you visit a site

00:47:11.630 --> 00:47:14.180
and your warned it's not secure.

00:47:14.180 --> 00:47:17.480
We're told that over time this will
turn red and become perhaps even

00:47:17.480 --> 00:47:19.950
more concerning as more
versions of this come out

00:47:19.950 --> 00:47:23.850
and as more and more adopters
of HTTPS exist as well.

00:47:23.850 --> 00:47:25.850
But you're going to start
getting notifications.

00:47:25.850 --> 00:47:27.725
And you may have seen
these as well in green.

00:47:27.725 --> 00:47:29.870
If you are using HTTPS and
you log into something,

00:47:29.870 --> 00:47:33.120
you'll see a little lock icon here
and you'll be told that it is secure.

00:47:33.120 --> 00:47:35.570
And again, this is just
because human beings

00:47:35.570 --> 00:47:40.460
tend not to be as concerned
about their digital privacy

00:47:40.460 --> 00:47:43.430
and their digital security
when using the internet.

00:47:43.430 --> 00:47:48.260
And now the technology is
trying to provide clues and tips

00:47:48.260 --> 00:47:54.880
to entice you to be more
concerned about these things.

00:47:54.880 --> 00:47:57.330
Now let's take a look
at a couple of attacks

00:47:57.330 --> 00:47:59.640
that are derived from
things we typically consider

00:47:59.640 --> 00:48:02.130
to be advantages of using the internet.

00:48:02.130 --> 00:48:07.050
The first of these is the idea
of cross-site scripting, XSS.

00:48:07.050 --> 00:48:09.450
We've previously discussed
this idea of the distinction

00:48:09.450 --> 00:48:11.700
between server-side code
and client-side code.

00:48:11.700 --> 00:48:14.400
Client-side code, recall, is
something that runs locally

00:48:14.400 --> 00:48:16.710
on our computer where
our browser, for example,

00:48:16.710 --> 00:48:19.380
is expected to interpret
and execute that code.

00:48:19.380 --> 00:48:22.000
Server-side code is run on the server.

00:48:22.000 --> 00:48:25.060
And when we get
information from a server,

00:48:25.060 --> 00:48:27.630
we're not getting back
the actual lines of code.

00:48:27.630 --> 00:48:31.028
We're getting back the output of that
code having run in the first place.

00:48:31.028 --> 00:48:34.320
So for example, there might be some code
on the server, some Python code or PHP

00:48:34.320 --> 00:48:38.220
code that generates HTML for us.

00:48:38.220 --> 00:48:42.570
The actual Python or PHP code in this
example would be server-side code.

00:48:42.570 --> 00:48:44.430
We don't actually ever see that code.

00:48:44.430 --> 00:48:46.890
We only see the output of that code.

00:48:46.890 --> 00:48:50.550
A cross-site script
vulnerability exists when

00:48:50.550 --> 00:48:57.180
an adversary is able to trick a client's
browser to run something locally.

00:48:57.180 --> 00:49:01.860
And it will do something that
presumably the person, the client,

00:49:01.860 --> 00:49:04.965
didn't actually intend to do.

00:49:04.965 --> 00:49:07.590
Let's take a look at an example
of this using a very simple web

00:49:07.590 --> 00:49:09.150
server called Flask.

00:49:09.150 --> 00:49:10.575
We have here some Python code.

00:49:10.575 --> 00:49:13.200
And don't be too worried if this
doesn't all make sense to you.

00:49:13.200 --> 00:49:20.050
It's just a pretty short, simple
web server that does two things.

00:49:20.050 --> 00:49:22.170
So this is just some
bookkeeping stuff in Flask.

00:49:22.170 --> 00:49:26.460
And Flask is a package of Python
that is used to create web servers.

00:49:26.460 --> 00:49:29.100
This web server has two
things, though, that it does.

00:49:29.100 --> 00:49:34.350
The first is when I visit
slash on my web server--

00:49:34.350 --> 00:49:36.750
so let's say this is Doug's site.

00:49:36.750 --> 00:49:41.912
If I go to dougssite.com, which you may
not actually explicitly type anymore

00:49:41.912 --> 00:49:43.620
but most browsers just
add it, slash just

00:49:43.620 --> 00:49:47.730
means the root page of your server.

00:49:47.730 --> 00:49:50.430
I'm going to call the following
function whose name happens

00:49:50.430 --> 00:49:52.440
to be called index in this case.

00:49:52.440 --> 00:49:53.970
Return hello world.

00:49:53.970 --> 00:49:58.770
And what this basically means
is if I visit dougspage.com/,

00:49:58.770 --> 00:50:05.730
what I receive is an HTML page
whose content is just hello world.

00:50:05.730 --> 00:50:09.060
So it's just an HTML file
that says hello world.

00:50:09.060 --> 00:50:11.730
Again, this code here
is all server-side code.

00:50:11.730 --> 00:50:14.130
You don't actually see this code.

00:50:14.130 --> 00:50:18.933
You only see the output of this
code, which is this here, this HTML.

00:50:18.933 --> 00:50:21.100
It's just a simple string
in this case, but it would

00:50:21.100 --> 00:50:25.080
be interpreted by the browser as HTML.

00:50:25.080 --> 00:50:27.920
If, however, I get a 404--

00:50:27.920 --> 00:50:31.470
a 404 is a not found error. it means
the page I requested doesn't exist.

00:50:31.470 --> 00:50:35.370
And since I've only defined the
behavior for literally one page,

00:50:35.370 --> 00:50:41.790
slash the index page of my server, then
I want to call this function not found.

00:50:41.790 --> 00:50:46.590
Return not found plus whatever
page I tried to visit.

00:50:46.590 --> 00:50:50.550
So it basically is another very simple
page, much like hello world here,

00:50:50.550 --> 00:50:53.980
where instead of saying hello
world, it says not found.

00:50:53.980 --> 00:50:57.560
And then it also concatenates onto
the very end of that whatever page

00:50:57.560 --> 00:50:59.760
I tried to visit.

00:50:59.760 --> 00:51:03.960
This is a major cross-site
scripting vulnerability.

00:51:03.960 --> 00:51:05.640
And let's see why.

00:51:05.640 --> 00:51:10.920
Let's imagine I go to
/foo, so dougspage/com/foo.

00:51:10.920 --> 00:51:14.130
Recall that our error handler function,
which I've reproduced down here,

00:51:14.130 --> 00:51:17.330
will return not found /foo.

00:51:17.330 --> 00:51:18.330
Seems pretty reasonable.

00:51:18.330 --> 00:51:22.260
It seems like the behavior I
expected or intended to have happen.

00:51:22.260 --> 00:51:24.970
But what about if I go
to a page like this one?

00:51:24.970 --> 00:51:29.490
So this is what I literally type in the
browser, dougspage.com/ angle bracket,

00:51:29.490 --> 00:51:36.450
script, angle bracket alert(hi)
and then a closed script tag there.

00:51:36.450 --> 00:51:42.770
This script here, script
here, looks a lot like HTML.

00:51:42.770 --> 00:51:47.640
And in fact, when the browser sees
this, it will interpret it as HTML.

00:51:47.640 --> 00:51:53.340
And so I will get returned by visiting
this page not found And then everything

00:51:53.340 --> 00:51:57.150
here except for the
leading slash, which means

00:51:57.150 --> 00:52:02.550
that when I receive this and my
client is interpreting the HTML,

00:52:02.550 --> 00:52:05.502
I'm going to generate an alert.

00:52:05.502 --> 00:52:06.210
What is an alert?

00:52:06.210 --> 00:52:09.025
Well, if you've ever gone to a
website and had a pop-up box display

00:52:09.025 --> 00:52:11.400
some information, you have to
click OK or click X to make

00:52:11.400 --> 00:52:13.590
it go away, that's what an alert is.

00:52:13.590 --> 00:52:16.350
So I visit this page on
my website, I've actually

00:52:16.350 --> 00:52:21.330
tricked my browser into
giving me a JavaScript alert,

00:52:21.330 --> 00:52:23.850
or I've tricked whoever
visits this page's browser

00:52:23.850 --> 00:52:26.070
to give me a JavaScript alert.

00:52:26.070 --> 00:52:29.980
So that's probably not
exactly a good thing.

00:52:29.980 --> 00:52:33.540
But it can get a little bit
more nefarious than that.

00:52:33.540 --> 00:52:36.670
Let's instead imagine-- instead
of having this be on my server,

00:52:36.670 --> 00:52:41.250
it might be easier to imagine it
like this, that this is what I wrote.

00:52:41.250 --> 00:52:45.698
This script tag here's what I wrote
into my Facebook profile, for example.

00:52:45.698 --> 00:52:48.240
So Facebook gives you the ability
to write a short little bio

00:52:48.240 --> 00:52:49.500
about yourself.

00:52:49.500 --> 00:52:54.927
Let's imagine that my bio was this
script document.write, image source,

00:52:54.927 --> 00:52:56.760
and then I have a hacker
URL and everything.

00:52:56.760 --> 00:52:58.760
And imagine that I own hacker URL.

00:52:58.760 --> 00:53:04.800
So I own hacker URL and I wrote
this in my Facebook profile.

00:53:04.800 --> 00:53:08.010
Assuming that Facebook did not
defend against cross-site scripting

00:53:08.010 --> 00:53:11.740
attacks, which they do, but
assuming that they did not,

00:53:11.740 --> 00:53:15.540
anytime somebody visited
my profile, their browser

00:53:15.540 --> 00:53:19.810
would be forced to contend
with this script tag here.

00:53:19.810 --> 00:53:20.310
Why?

00:53:20.310 --> 00:53:22.590
Because they're trying
to visit my profile page.

00:53:22.590 --> 00:53:26.610
My profile page contains
literally these characters which

00:53:26.610 --> 00:53:29.540
are going to be interpreted as HTML.

00:53:29.540 --> 00:53:33.990
And it's going to add document.write--
that's a JavaScript way of saying add

00:53:33.990 --> 00:53:38.490
the following line in addition
to the HTML of the page--

00:53:38.490 --> 00:53:44.700
image source equals hacker
url?cookie= and then document.cookie.

00:53:44.700 --> 00:53:48.210
So imagine that I, again,
control hacker URL.

00:53:48.210 --> 00:53:50.730
Presumably, as somebody
who is running a website,

00:53:50.730 --> 00:53:54.810
I also maintain logs of every time
somebody tries to access my website,

00:53:54.810 --> 00:53:57.960
what page on my site
they're trying to visit.

00:53:57.960 --> 00:54:00.690
If somebody goes to my Facebook
profile and executes this,

00:54:00.690 --> 00:54:06.270
I'm going to get notified via my hacker
URL logs that somebody has tried to go

00:54:06.270 --> 00:54:12.560
to that page ?cookie=
and then document.cookie.

00:54:12.560 --> 00:54:14.910
Now, document.cookie in
this case, because this

00:54:14.910 --> 00:54:21.670
exists on my Facebook profile, is
an individual's cookie for Facebook.

00:54:21.670 --> 00:54:24.000
So here what I am
doing-- again, Facebook

00:54:24.000 --> 00:54:26.310
does defend against
cross-site scripting attacks,

00:54:26.310 --> 00:54:28.230
so this can't actually
happen on Facebook.

00:54:28.230 --> 00:54:31.980
But assuming that they did not
defend against them adequately,

00:54:31.980 --> 00:54:36.210
what I'm basically doing
is getting told via my log

00:54:36.210 --> 00:54:38.520
that somebody tried to
visit some page on my URL,

00:54:38.520 --> 00:54:41.400
but the page that they
tried to visit, I'm

00:54:41.400 --> 00:54:46.170
plugging in and basically stealing
the cookie that they use for Facebook.

00:54:46.170 --> 00:54:48.873
And a cookie, recall, is
sort of like a hand stamp.

00:54:48.873 --> 00:54:50.790
It's basically me, instead
of having to re-log

00:54:50.790 --> 00:54:53.602
into Facebook every time I want
to use it, going up to Facebook

00:54:53.602 --> 00:54:54.310
and saying, here.

00:54:54.310 --> 00:54:56.070
You've already verified my identity.

00:54:56.070 --> 00:54:59.040
Just take a look at
this, and you get let in.

00:54:59.040 --> 00:55:04.920
And now I hypothetically know
someone else's Facebook cookie.

00:55:04.920 --> 00:55:07.890
And if I was clever, I
could try and use that

00:55:07.890 --> 00:55:12.060
to change what my Facebook cookie
is to that person's Facebook cookie.

00:55:12.060 --> 00:55:17.220
And then suddenly I'm able to log in
and view their profile and act as them.

00:55:17.220 --> 00:55:19.290
This image tag here
is just a clever trick

00:55:19.290 --> 00:55:24.150
because the idea is that it's trying
to pull some resource from my site.

00:55:24.150 --> 00:55:25.060
It doesn't exist.

00:55:25.060 --> 00:55:27.270
I don't have a list of all
the cookies on Facebook.

00:55:27.270 --> 00:55:32.040
But I'm being told that somebody is
trying to access this URL on my site.

00:55:32.040 --> 00:55:34.950
So the image tag is just
sort of a trick to force

00:55:34.950 --> 00:55:38.760
it to log something on my hacker URL.

00:55:38.760 --> 00:55:43.170
But the idea here is that I would
be able to steal somebody's Facebook

00:55:43.170 --> 00:55:47.610
cookie where this attack's
not well-defended against.

00:55:47.610 --> 00:55:51.960
So what techniques can we
use either for our own sites

00:55:51.960 --> 00:55:55.980
when we are running to avoid
cross-site scripting vulnerabilities

00:55:55.980 --> 00:56:01.270
or to protect against cross-site
scripting vulnerabilities?

00:56:01.270 --> 00:56:04.770
The first technique that we can
use is to sanitize, so to speak,

00:56:04.770 --> 00:56:08.400
all of the inputs that
come in to our page.

00:56:08.400 --> 00:56:10.610
So let's take a look at how
exactly we might do this.

00:56:10.610 --> 00:56:13.500
So it turns out that
there are things called

00:56:13.500 --> 00:56:19.080
HTML entities, which are other ways of
representing certain characters in HTML

00:56:19.080 --> 00:56:22.950
that might be considered special or
control characters, so things like,

00:56:22.950 --> 00:56:26.460
for example, this or this.

00:56:26.460 --> 00:56:29.610
Typically, when a browser
sees a character left

00:56:29.610 --> 00:56:31.770
angle bracket or right
angle bracket, it's

00:56:31.770 --> 00:56:37.740
going to automatically interpret that as
some HTML that it should then process.

00:56:37.740 --> 00:56:39.930
So in the example I just
showed a moment ago,

00:56:39.930 --> 00:56:44.130
I was using the fact that whenever
it sees angle brackets with script

00:56:44.130 --> 00:56:47.050
around it, they're going to
try and interpret whatever

00:56:47.050 --> 00:56:49.470
is between those tags as a script.

00:56:49.470 --> 00:56:52.920
One way for me to prevent that
from being interpreted as a script

00:56:52.920 --> 00:56:58.800
is to call this or call this something
else other than just left angle bracket

00:56:58.800 --> 00:57:00.130
and right angle bracket.

00:57:00.130 --> 00:57:03.780
And it turns out that there are these
things called HTML entities that

00:57:03.780 --> 00:57:08.250
can be used to refer to
these characters instead,

00:57:08.250 --> 00:57:13.440
such that if I sanitize
my input in such a way

00:57:13.440 --> 00:57:20.278
that every time somebody literally
typed the character left angle bracket,

00:57:20.278 --> 00:57:23.070
I had written some code that
automatically took that and changed it

00:57:23.070 --> 00:57:25.470
into ampersand lt;.

00:57:25.470 --> 00:57:29.440
And then every time somebody
wrote a greater than character,

00:57:29.440 --> 00:57:35.670
or right angle bracket, I changed
that in the code to ampersand gt;.

00:57:35.670 --> 00:57:40.170
Then when my page was responsible for
processing or interpreting something,

00:57:40.170 --> 00:57:44.640
it wouldn't interpret this-- it would
still display this character as a left

00:57:44.640 --> 00:57:47.580
angle bracket or less than-- that's
what the lt stands for here--

00:57:47.580 --> 00:57:49.290
or a right angle bracket, greater than.

00:57:49.290 --> 00:57:52.210
That's what the gt stands for there.

00:57:52.210 --> 00:57:55.960
It would literally just show those
characters and not treat them as HTML.

00:57:55.960 --> 00:58:00.030
So that's the idea of what it means
to sanitize input when we're talking

00:58:00.030 --> 00:58:04.510
about HTML entities, for example.

00:58:04.510 --> 00:58:08.160
Another thing that we could do is
just disable JavaScript entirely.

00:58:08.160 --> 00:58:10.290
This would have some
upsides and some downsides.

00:58:10.290 --> 00:58:13.440
The upside is you're pretty protected
against cross-site scripting

00:58:13.440 --> 00:58:17.820
vulnerabilities because they're usually
going to be introduced via JavaScript.

00:58:17.820 --> 00:58:20.100
The downside is JavaScript
is pretty convenient.

00:58:20.100 --> 00:58:20.670
It's nice.

00:58:20.670 --> 00:58:22.770
It makes for a better user experience.

00:58:22.770 --> 00:58:24.930
Sometimes there might
be parts of our page

00:58:24.930 --> 00:58:29.040
that just don't work if
JavaScript is completely disabled,

00:58:29.040 --> 00:58:30.540
and so trade-offs there.

00:58:30.540 --> 00:58:33.360
You're protecting yourself,
but you might be doing

00:58:33.360 --> 00:58:37.050
other sorts of non-material damage.

00:58:37.050 --> 00:58:40.142
Or we could decide to just handle
the JavaScript in a special way.

00:58:40.142 --> 00:58:41.850
So for example, we
might not allow what's

00:58:41.850 --> 00:58:44.940
called inline JavaScript, for
example, like the script tags

00:58:44.940 --> 00:58:46.470
that I just showed a moment ago.

00:58:46.470 --> 00:58:50.010
But we might allow JavaScripts
written in separate JavaScript files

00:58:50.010 --> 00:58:52.870
which can also be linked
into your HTML pages.

00:58:52.870 --> 00:58:56.280
So those would be allowed, but inline
JavaScript, like what we just saw,

00:58:56.280 --> 00:58:57.690
would not be allowed.

00:58:57.690 --> 00:59:01.890
We could sandbox the JavaScript and
run it separately somewhere else first

00:59:01.890 --> 00:59:06.210
to see if it does something weird,
and if it doesn't do something weird,

00:59:06.210 --> 00:59:08.580
then allow it to be displayed.

00:59:08.580 --> 00:59:12.390
We could also execute the
content security policy.

00:59:12.390 --> 00:59:15.570
Content security policy
is another header

00:59:15.570 --> 00:59:20.370
that we can add to our HTML
pages or HTTP responses.

00:59:20.370 --> 00:59:22.350
And we can define certain
behavior to happen

00:59:22.350 --> 00:59:25.800
such that will allow certain lines or
certain types of JavaScript through

00:59:25.800 --> 00:59:28.167
but not others.

00:59:28.167 --> 00:59:30.000
Now, there's another
type of attack that can

00:59:30.000 --> 00:59:34.800
be used that relies heavily on the fact
that we use cookies so extensively,

00:59:34.800 --> 00:59:40.650
and that is a cross-site
request forgery, or a CSRF.

00:59:40.650 --> 00:59:43.680
Now, cross-eyed scripting
attacks generally

00:59:43.680 --> 00:59:48.840
involve receiving some content
and the client's browser

00:59:48.840 --> 00:59:53.610
being tricked into doing something
locally that it didn't want to do.

00:59:53.610 --> 00:59:58.170
In a CSRF request, or
CSRF attack, rather,

00:59:58.170 --> 01:00:02.430
the trick is we're relying
on the fact that there

01:00:02.430 --> 01:00:04.980
is a cookie that can
be exploited to make

01:00:04.980 --> 01:00:11.595
a an outbound request, an outbound HTTP
request that we did not intend to make.

01:00:11.595 --> 01:00:13.470
And again, this relies
extensively on cookies

01:00:13.470 --> 01:00:18.300
because they are this shorthand,
short-form way to log into something.

01:00:18.300 --> 01:00:22.230
And we can make a fraudulent
request appear legitimate

01:00:22.230 --> 01:00:24.480
if we can rely on someone's cookie.

01:00:24.480 --> 01:00:28.110
Now, again, if you ever use
a cloud service for example,

01:00:28.110 --> 01:00:31.560
they're going to have CSRF
defenses built into them.

01:00:31.560 --> 01:00:33.780
This is really if you're
building a simple site

01:00:33.780 --> 01:00:35.368
and you don't defend against this.

01:00:35.368 --> 01:00:38.160
Flask, for example, does not defend
against this particularly well,

01:00:38.160 --> 01:00:40.568
but Flask is a very simple
web framework for servers.

01:00:40.568 --> 01:00:43.110
They're generally going to be
much more complicated than that

01:00:43.110 --> 01:00:46.620
and have much more additional
functionality to be more featurefull.

01:00:46.620 --> 01:00:48.840
So let's walk through what
these cross-site request

01:00:48.840 --> 01:00:50.280
forgeries might look like.

01:00:50.280 --> 01:00:53.820
And for context, let's imagine
that I send you an email

01:00:53.820 --> 01:00:56.137
asking you to click on some URL.

01:00:56.137 --> 01:00:57.720
So you're going to click on this link.

01:00:57.720 --> 01:00:59.820
It's going to redirect you to some page.

01:00:59.820 --> 01:01:02.310
Maybe that page looks
something like this.

01:01:02.310 --> 01:01:04.470
It's pretty simple,
not much going on here.

01:01:04.470 --> 01:01:05.320
I have a body.

01:01:05.320 --> 01:01:07.500
And inside of it I have one more link.

01:01:07.500 --> 01:01:15.422
And the link is http://hackbank.com/
transfertodoug=amt500.

01:01:15.422 --> 01:01:18.630
Now, perhaps you don't hover over it
and see the link at the beginning of it.

01:01:18.630 --> 01:01:20.960
But maybe you are a
customer of Hack Bank.

01:01:20.960 --> 01:01:24.480
And maybe I know that you're a customer
of Hack Bank such that if you click

01:01:24.480 --> 01:01:28.290
on this link and if you happen to be
logged in, and if you happen to have

01:01:28.290 --> 01:01:32.730
your cookie set for hackbank.com, and
this was the way that they actually

01:01:32.730 --> 01:01:37.650
executed transfers, by having you go
to /transfer and say to whom you want

01:01:37.650 --> 01:01:40.200
to send money and in what amount--

01:01:40.200 --> 01:01:42.938
And fortunately, most banks
don't actually do this.

01:01:42.938 --> 01:01:46.230
Usually, if you're going to do something
that manipulates the database, as this

01:01:46.230 --> 01:01:48.938
would, because it's going to be
transferring some amount of money

01:01:48.938 --> 01:01:51.930
somewhere that would be
via HTTP POST request--

01:01:51.930 --> 01:01:55.530
this is just a straightforward
GET request I'm making here.

01:01:55.530 --> 01:01:57.722
If you were logged in,
though, to Hack Bank,

01:01:57.722 --> 01:01:59.430
or if you're cookie
for Hack Bank was set

01:01:59.430 --> 01:02:03.555
and you clicked on this link,
hypothetically, a transfer of $500--

01:02:03.555 --> 01:02:05.430
again, assuming that
this was how you did it,

01:02:05.430 --> 01:02:07.740
you specified a person and
you specified an amount--

01:02:07.740 --> 01:02:13.288
would be transferred from your
account to presumably my account.

01:02:13.288 --> 01:02:15.330
That's probably not
something you intended to do.

01:02:15.330 --> 01:02:18.867
So that would be an example of why
this is a cross-site request forgery.

01:02:18.867 --> 01:02:19.950
It's a legitimate request.

01:02:19.950 --> 01:02:23.130
It appears that you intended to
do this because it came from you.

01:02:23.130 --> 01:02:24.330
It's using your cookie.

01:02:24.330 --> 01:02:28.090
But you didn't actually
intend for it to happen.

01:02:28.090 --> 01:02:29.460
Here's another example.

01:02:29.460 --> 01:02:32.260
You click on the link in my email
and you get brought to this page.

01:02:32.260 --> 01:02:35.250
So there's not actually even a
second link to click anymore.

01:02:35.250 --> 01:02:37.410
Now it's just trying to load an image.

01:02:37.410 --> 01:02:40.660
Now, looking at this URL, we can
tell there's not an image there.

01:02:40.660 --> 01:02:43.920
It doesn't end in jpeg
or .pmg or the like.

01:02:43.920 --> 01:02:45.540
It's the same URL as before.

01:02:45.540 --> 01:02:49.397
But my browser sees image source
equals something and says,

01:02:49.397 --> 01:02:51.480
well, I'm at least going
to try and go to that URL

01:02:51.480 --> 01:02:55.040
and see if there is an
image there to load for you.

01:02:55.040 --> 01:02:57.710
Again, you just click on
the link in the email.

01:02:57.710 --> 01:03:00.140
This page loads.

01:03:00.140 --> 01:03:03.320
My browser tries to go to this
page, or your browser in this case

01:03:03.320 --> 01:03:06.230
tries to go to this page
to load the image there.

01:03:06.230 --> 01:03:10.910
But in so doing, it's, again,
executing this unintended transfer,

01:03:10.910 --> 01:03:14.750
relying on your cookie at hackbank.com.

01:03:14.750 --> 01:03:17.120
Another example of this might be a form.

01:03:17.120 --> 01:03:20.120
So again, it appears that you
click on the link in the email.

01:03:20.120 --> 01:03:23.870
You get brought to a form that just has
now just a button at the bottom of it

01:03:23.870 --> 01:03:24.892
that says Click Here.

01:03:24.892 --> 01:03:26.600
And the reason it just
has a button, even

01:03:26.600 --> 01:03:31.990
though there's other stuff written, is
that those first two fields are hidden.

01:03:31.990 --> 01:03:35.000
They are type equals hidden,
which means you wouldn't actually

01:03:35.000 --> 01:03:37.040
see them when you load your browser.

01:03:37.040 --> 01:03:40.160
Now, contrast this, for
example, with a field

01:03:40.160 --> 01:03:43.340
whose type is text, which you might
see if you're doing a straightforward

01:03:43.340 --> 01:03:44.090
login.

01:03:44.090 --> 01:03:48.020
You would type characters in and
see the actual characters appear.

01:03:48.020 --> 01:03:50.660
That's text versus a password
field where you would

01:03:50.660 --> 01:03:52.580
type characters in and see all stars.

01:03:52.580 --> 01:03:55.640
It would visually
obscure what you typed.

01:03:55.640 --> 01:03:58.760
The action of this
form, or so to say where

01:03:58.760 --> 01:04:02.313
the form-- what happens when you click
on the Submit button at the bottom

01:04:02.313 --> 01:04:03.230
is the same as before.

01:04:03.230 --> 01:04:06.140
It's hackbank.com/transfer.

01:04:06.140 --> 01:04:07.970
And then I'm using
these parameters here;

01:04:07.970 --> 01:04:13.550
to Doug, the amount of $500, Click Here.

01:04:13.550 --> 01:04:17.090
Now I actually am using a
notice also POST request

01:04:17.090 --> 01:04:19.500
to try to initiate this
transfer, again, assuming

01:04:19.500 --> 01:04:24.380
that this was how Hack Bank structured
transfer requests in this way.

01:04:24.380 --> 01:04:27.650
So if you clicked here and this
was otherwise validly structured

01:04:27.650 --> 01:04:31.340
and you were logged in, or your
cookie was valid for Hack Bank,

01:04:31.340 --> 01:04:33.800
then this would initiate
a transfer of $500.

01:04:33.800 --> 01:04:37.850
And I can play another similar trick to
what I did a moment ago with the image

01:04:37.850 --> 01:04:43.070
by doing something like this
where, when the page is loaded,

01:04:43.070 --> 01:04:44.435
instantly submit this form.

01:04:44.435 --> 01:04:46.310
So you don't even have
to click here anymore.

01:04:46.310 --> 01:04:47.630
It's just going to go
through the document,

01:04:47.630 --> 01:04:50.780
document being JavaScript's way of
referring to the entire web page,

01:04:50.780 --> 01:04:53.600
find the first form,
form zeros, assuming

01:04:53.600 --> 01:04:57.380
this is the first form on
the page, and just submit it.

01:04:57.380 --> 01:04:59.840
Doesn't matter what else is going on.

01:04:59.840 --> 01:05:00.860
Just submit this form.

01:05:00.860 --> 01:05:06.110
That would also initiate transfer if
you clicked on that link from my email.

01:05:06.110 --> 01:05:10.010
So a quick summary of these
two different types of attacks.

01:05:10.010 --> 01:05:12.740
Cross-site scripting
attacks, the adversary

01:05:12.740 --> 01:05:16.940
tricks you into executing code on
your browser to do something locally

01:05:16.940 --> 01:05:19.070
that you probably did not intend.

01:05:19.070 --> 01:05:22.280
And a cross-site request
forgery, something

01:05:22.280 --> 01:05:27.320
that appears to be a legitimate
request from your browser

01:05:27.320 --> 01:05:31.220
because it's relying on cookies, your
ostensibly logged in in that way,

01:05:31.220 --> 01:05:35.670
but you don't actually
mean to make that request.

01:05:35.670 --> 01:05:37.670
Now let's talk about a
couple of vulnerabilities

01:05:37.670 --> 01:05:40.340
that exist in the context
of a database, which I

01:05:40.340 --> 01:05:42.600
know you've discussed recently as well.

01:05:42.600 --> 01:05:46.170
So imagine that I have a
table of users on my database

01:05:46.170 --> 01:05:49.580
that looks like this, that each of them
has an ID number, they have a username,

01:05:49.580 --> 01:05:51.170
and they have a password.

01:05:51.170 --> 01:05:53.630
Now, the obvious
vulnerability here is I really

01:05:53.630 --> 01:05:57.800
shouldn't be storing my users'
passwords like this in the clear.

01:05:57.800 --> 01:06:01.370
If somebody were to ever hack and
get a hold of this database file,

01:06:01.370 --> 01:06:03.020
that's really, really bad.

01:06:03.020 --> 01:06:08.740
I am not taking best practices to
protect my customers' information.

01:06:08.740 --> 01:06:09.990
So I want to avoid doing that.

01:06:09.990 --> 01:06:14.060
So instead what I might do, as we've
discussed, is hash their passwords,

01:06:14.060 --> 01:06:17.540
run them through some hash function
so that when they're actually stored,

01:06:17.540 --> 01:06:19.880
they get stored looking
something like this.

01:06:19.880 --> 01:06:23.120
You have no idea what the
original password was.

01:06:23.120 --> 01:06:25.050
And because it's a
hash, it's irreversible.

01:06:25.050 --> 01:06:28.280
You should not be able
to undo what I did

01:06:28.280 --> 01:06:30.390
when I ran through the hash function.

01:06:30.390 --> 01:06:33.560
But there's actually still
a vulnerability here.

01:06:33.560 --> 01:06:35.840
And the vulnerability
here is not technical.

01:06:35.840 --> 01:06:38.570
It's human again.

01:06:38.570 --> 01:06:41.785
And the vulnerability that
exists here is that we see--

01:06:41.785 --> 01:06:43.910
we're using a hash function,
so it's deterministic.

01:06:43.910 --> 01:06:47.300
When we pass some data through it, we're
going to get the same output every time

01:06:47.300 --> 01:06:48.810
we pass data through it.

01:06:48.810 --> 01:06:53.900
And two of our users, Charlie
and Eric, have the same hash.

01:06:53.900 --> 01:06:56.390
We saw this makes sense,
because if we go back a moment,

01:06:56.390 --> 01:06:59.840
they also had the same actual password
when it was stored in plain text.

01:06:59.840 --> 01:07:03.530
We've gone out of our way to try and
defend against that by hashing it.

01:07:03.530 --> 01:07:06.860
But somebody who gets a hold of
this database file, for example,

01:07:06.860 --> 01:07:11.750
they hack into it, they get it, they'll
see two people have the same password.

01:07:11.750 --> 01:07:14.540
And maybe this is a very
small subset of my user base.

01:07:14.540 --> 01:07:17.150
And maybe there's hundreds
of thousands of people.

01:07:17.150 --> 01:07:20.720
And maybe 10% of them
all have the same hash.

01:07:20.720 --> 01:07:26.670
Well, again, human beings, we are not
the best at defending our own stuff.

01:07:26.670 --> 01:07:29.090
It's a sad truth that
the most common password

01:07:29.090 --> 01:07:32.997
is password followed by some of these
other examples we had a second ago.

01:07:32.997 --> 01:07:34.580
All of these are pretty bad passwords.

01:07:34.580 --> 01:07:38.990
They're all on the list of some of
the most commonly used passwords

01:07:38.990 --> 01:07:42.920
for all services, which means
that if you see a hash like this,

01:07:42.920 --> 01:07:45.620
it doesn't matter that
we have taken steps

01:07:45.620 --> 01:07:49.130
to protect our users against this.

01:07:49.130 --> 01:07:55.700
If we see a hash like this many, many
times in our database, a clever hacker,

01:07:55.700 --> 01:07:58.732
a clever adversary
might think, oh, well,

01:07:58.732 --> 01:08:00.440
I'm seeing this password
10% of the time,

01:08:00.440 --> 01:08:04.400
so I'm going to guess that Charlie's
password for the service is 12345

01:08:04.400 --> 01:08:05.330
and they're wrong.

01:08:05.330 --> 01:08:08.480
And then they'll maybe try abcdef
and they're wrong, and then maybe try

01:08:08.480 --> 01:08:10.520
password and they're right.

01:08:10.520 --> 01:08:13.910
And then all of a sudden every
time they see that hash, they

01:08:13.910 --> 01:08:18.090
can assume that the password is password
for every single one of those users.

01:08:18.090 --> 01:08:24.960
So again, nothing we can do as
technologists to solve this problem.

01:08:24.960 --> 01:08:29.510
This is really just
getting folks to understand

01:08:29.510 --> 01:08:33.276
that using different passwords,
using non-standard passwords,

01:08:33.276 --> 01:08:34.109
is really important.

01:08:34.109 --> 01:08:37.067
That's why we talked about password
managers and maybe not even knowing

01:08:37.067 --> 01:08:41.160
your own passwords in a prior lecture.

01:08:41.160 --> 01:08:45.140
There's another problem that can exist,
though, with databases, in particular,

01:08:45.140 --> 01:08:47.120
when we see screens like this.

01:08:47.120 --> 01:08:51.560
So this is a contrived login screen
that has a username and password

01:08:51.560 --> 01:08:55.220
field And a Forgot Password
button whose purpose in life

01:08:55.220 --> 01:08:59.149
is, if you type in your
email address and you--

01:08:59.149 --> 01:09:01.189
which is the username
in this case, and you

01:09:01.189 --> 01:09:05.510
have the Forgot Password box
checked, and you try and click login,

01:09:05.510 --> 01:09:09.418
instead of actually logging you in,
it's going to email you, hopefully,

01:09:09.418 --> 01:09:11.960
a link to your password, not
your actual password for reasons

01:09:11.960 --> 01:09:14.970
we previously discussed as well.

01:09:14.970 --> 01:09:20.640
But what if when we click
on this button we see this?

01:09:20.640 --> 01:09:22.310
OK.

01:09:22.310 --> 01:09:25.520
We've emailed you a link
to change your password.

01:09:25.520 --> 01:09:29.660
Does that seem inherently problematic?

01:09:29.660 --> 01:09:30.479
Perhaps not.

01:09:30.479 --> 01:09:34.600
But what about if you see this as well?

01:09:34.600 --> 01:09:37.100
Somebody might see this if
they're logged in as well.

01:09:37.100 --> 01:09:40.490
Sorry, no user with that email address.

01:09:40.490 --> 01:09:44.870
Does that perhaps seem problematic
when you compare it against this?

01:09:44.870 --> 01:09:48.350
This is an example of something
called information leakage.

01:09:48.350 --> 01:09:51.710
Perhaps an adversary has
hacked some other database

01:09:51.710 --> 01:09:55.040
where folks were not being
as secure with credentials.

01:09:55.040 --> 01:09:58.970
And so they have a whole set of email
addresses mapped to credentials.

01:09:58.970 --> 01:10:02.570
And because human beings tend
to reuse the same credentials

01:10:02.570 --> 01:10:06.650
on multiple different services,
they are trying different services

01:10:06.650 --> 01:10:09.170
that they believe that
these users might also

01:10:09.170 --> 01:10:13.550
use using those same username
and password combinations.

01:10:13.550 --> 01:10:18.860
If this is the way that we field these
types of forgot password inquiries,

01:10:18.860 --> 01:10:22.130
we're revealing some
information potentially.

01:10:22.130 --> 01:10:27.650
If Alice is a user, we're now
saying, yes, Alice is a user of this.

01:10:27.650 --> 01:10:29.300
Try this password.

01:10:29.300 --> 01:10:34.490
If we get something like this, then
the adversary might not bother trying.

01:10:34.490 --> 01:10:37.820
They've realized, oh, Alice
is not a user of this service.

01:10:37.820 --> 01:10:41.720
And even if they're not trying to hack
into it, if we do something like this,

01:10:41.720 --> 01:10:45.230
we're also telling that adversary
quite a bit about Alice.

01:10:45.230 --> 01:10:49.340
Now we know Alice uses this service,
and this service, and this service,

01:10:49.340 --> 01:10:50.600
and not this service.

01:10:50.600 --> 01:10:54.050
And they can sort of create a
picture of who Alice might be.

01:10:54.050 --> 01:11:00.398
They're sort of using her digital
footprint to understand more about her.

01:11:00.398 --> 01:11:03.190
A better response in this case
might be to say something like this,

01:11:03.190 --> 01:11:04.550
request received.

01:11:04.550 --> 01:11:07.702
If you're in our system, you'll receive
an email with instructions shortly.

01:11:07.702 --> 01:11:09.410
That's not tipping
our hand either way as

01:11:09.410 --> 01:11:12.890
to whether the user is in the
database or not in the database.

01:11:12.890 --> 01:11:15.860
No information leakage here,
and generally a better way

01:11:15.860 --> 01:11:19.610
to protect our customer's privacy.

01:11:19.610 --> 01:11:22.850
Now, that's not the only problem
that we can have with databases.

01:11:22.850 --> 01:11:25.610
We've alluded to this
idea of SQL injection.

01:11:25.610 --> 01:11:28.100
And there's this comment that
gets the rounds quite a bit

01:11:28.100 --> 01:11:30.620
when we talk about SQL injection
from a web comic called

01:11:30.620 --> 01:11:35.240
XKCD that involves a SQL injection
attack, which is basically

01:11:35.240 --> 01:11:39.080
providing some information that--

01:11:39.080 --> 01:11:42.670
or providing some text or some query
that we want to make to a database

01:11:42.670 --> 01:11:46.690
where that query actually
does something unintended.

01:11:46.690 --> 01:11:50.700
It actually itself is SQL as opposed
to just plugging in some parameter,

01:11:50.700 --> 01:11:53.750
like what is your name, and then
searching the database for that name.

01:11:53.750 --> 01:11:55.708
Instead of giving you my
name, I might give you

01:11:55.708 --> 01:11:58.040
something that is actually
a SQL query that's

01:11:58.040 --> 01:12:01.050
going to be executed that
you don't want me to execute.

01:12:01.050 --> 01:12:03.750
So let's see an example
of how this might work.

01:12:03.750 --> 01:12:07.800
So here's another simple
username and password field.

01:12:07.800 --> 01:12:11.580
And in this example, I've written my
password field poorly intentionally

01:12:11.580 --> 01:12:14.000
for purposes of the example
so that it will actually

01:12:14.000 --> 01:12:16.970
show you the text that is
typed as opposed to showing

01:12:16.970 --> 01:12:19.640
you stars like a password field should.

01:12:19.640 --> 01:12:23.300
So this is something that the user
sees when they access my site.

01:12:23.300 --> 01:12:26.718
And perhaps on the back end in the
server-side code, inside of Python

01:12:26.718 --> 01:12:29.510
somewhere I have written a SQL
query that looks like the following.

01:12:29.510 --> 01:12:35.540
When the login button is clicked,
execute the following SQL query.

01:12:35.540 --> 01:12:40.040
SELECT star from users where
username equals uname--

01:12:40.040 --> 01:12:45.230
and uname here in yellow referring
to whatever was typed in this box--

01:12:45.230 --> 01:12:48.050
and password equals
pword, where, again, pword

01:12:48.050 --> 01:12:51.140
is referring to whatever
was typed in this box.

01:12:51.140 --> 01:12:54.120
So we're doing a SQL query
to select star from users,

01:12:54.120 --> 01:12:57.360
get all of the information
from the users table

01:12:57.360 --> 01:13:01.170
where the username equals
whatever they typed in that box

01:13:01.170 --> 01:13:05.560
and the password equals
whatever they typed in that box.

01:13:05.560 --> 01:13:07.410
And so, for example,
if I have somebody who

01:13:07.410 --> 01:13:09.810
logs in with the username
Alice and the password

01:13:09.810 --> 01:13:14.580
12345, what the query would actually
look like with these values plugged

01:13:14.580 --> 01:13:19.920
into it might look something like this;
SELECT star from users where username

01:13:19.920 --> 01:13:25.200
equals Alice and password equals 12345.

01:13:25.200 --> 01:13:30.420
If there is nobody with username Alice
or Alice's password is not 12345,

01:13:30.420 --> 01:13:31.770
then this will fail.

01:13:31.770 --> 01:13:34.890
Both of those conditions
need to be true.

01:13:34.890 --> 01:13:37.890
But what about this?

01:13:37.890 --> 01:13:46.800
Someone whose username is hacker and
their password is 1' or '1' equals '1.

01:13:49.800 --> 01:13:51.848
That looks pretty weird.

01:13:51.848 --> 01:13:53.640
And the reason that
that looks pretty weird

01:13:53.640 --> 01:13:57.390
is because this is an
attempt to inject SQL,

01:13:57.390 --> 01:14:02.820
to trick SQL into doing something that
is presumably not intended by the code

01:14:02.820 --> 01:14:04.050
that we wrote.

01:14:04.050 --> 01:14:07.980
Now, it probably helps to take a
look at it plugging the data in

01:14:07.980 --> 01:14:11.580
to see what exactly this is going to do.

01:14:11.580 --> 01:14:16.270
SELECT star from users where
username equals hacker or--

01:14:16.270 --> 01:14:23.190
excuse me, and password equals
'1' or and so on and so on.

01:14:26.880 --> 01:14:30.180
Maybe I do have a person whose
username actually is hacker,

01:14:30.180 --> 01:14:33.000
but that's probably not their password.

01:14:33.000 --> 01:14:34.050
That doesn't matter.

01:14:34.050 --> 01:14:37.350
I'm still going to be
able to log in if I

01:14:37.350 --> 01:14:39.140
have somebody whose username is hacker.

01:14:39.140 --> 01:14:41.850
And the reason for that
is because of this or.

01:14:41.850 --> 01:14:45.780
I have sort of short circuited
the end of the SQL query.

01:14:45.780 --> 01:14:50.370
I have this quote mark that demarcates
the end of what the user presumably

01:14:50.370 --> 01:14:51.780
typed in.

01:14:51.780 --> 01:14:54.660
But I've actually literally
typed those into my password

01:14:54.660 --> 01:14:59.060
to trick SQL such that if
hacker's password equals 1,

01:14:59.060 --> 01:15:03.420
it just happens to literally be the
character 1, OK, I have succeeded.

01:15:03.420 --> 01:15:05.250
I guess that's a really
bad password, and I

01:15:05.250 --> 01:15:08.100
shouldn't be able to log it in that
way, but maybe that is the case

01:15:08.100 --> 01:15:09.060
and I'm able to log in.

01:15:09.060 --> 01:15:13.560
But even if not, this
other thing is true.

01:15:13.560 --> 01:15:18.660
'1' does equal '1'.

01:15:18.660 --> 01:15:23.030
So as long as somebody whose username
is hacker exists in the database,

01:15:23.030 --> 01:15:27.330
I am now able to log in as
hacker because this is true.

01:15:27.330 --> 01:15:29.230
This part's probably not true, right?

01:15:29.230 --> 01:15:31.860
It's unlikely that their password is 1.

01:15:31.860 --> 01:15:36.960
Regardless of what their password
is, this part actually is true.

01:15:36.960 --> 01:15:40.200
It's a very simple SQL injection attack.

01:15:40.200 --> 01:15:44.490
I'm basically logging in as someone
who I'm presumably not supposed

01:15:44.490 --> 01:15:48.780
to be able to log in as, but it
illustrates the kind of thing

01:15:48.780 --> 01:15:50.550
that could happen.

01:15:50.550 --> 01:15:54.450
You are allowing people
to bypass logins.

01:15:54.450 --> 01:15:59.100
Now, it could get worse if your
database administrator username

01:15:59.100 --> 01:16:01.710
is admin or something very common.

01:16:01.710 --> 01:16:04.683
The default for this is typically admin.

01:16:04.683 --> 01:16:06.600
This would potentially
give people the ability

01:16:06.600 --> 01:16:08.760
to be database
administrators, that they're

01:16:08.760 --> 01:16:14.370
able to execute exactly this
kind of trick on the admin user.

01:16:14.370 --> 01:16:16.830
Now they have administrative
access to your database, which

01:16:16.830 --> 01:16:19.580
means they can do things like
manipulate the data in the database,

01:16:19.580 --> 01:16:23.350
change things, add things, delete things
that you don't want to have deleted.

01:16:23.350 --> 01:16:28.170
And in the case of a database,
deletion is pretty permanent.

01:16:28.170 --> 01:16:32.580
You can't undo a delete most
of the time in a database

01:16:32.580 --> 01:16:35.890
as the way you might be
able to do with other files.

01:16:35.890 --> 01:16:38.430
Now, are there techniques to
avoid this kind of attack?

01:16:38.430 --> 01:16:40.108
Fortunately, there are.

01:16:40.108 --> 01:16:42.900
Right now I'd like just to just
take a look at a very simple Python

01:16:42.900 --> 01:16:45.720
program that replicates
the kind of thing

01:16:45.720 --> 01:16:50.080
that one could do in a more
robust, more complex SQL situation.

01:16:50.080 --> 01:16:52.080
So let's pull up a program
here where we're just

01:16:52.080 --> 01:16:54.870
simulating this idea
of a SQL injection just

01:16:54.870 --> 01:17:00.230
to show you how it's not that
difficult to defend against it.

01:17:00.230 --> 01:17:03.840
So let's pull up the code
here in this file login.py.

01:17:03.840 --> 01:17:06.060
So there's not that much going on here.

01:17:06.060 --> 01:17:07.950
I have x equals input username.

01:17:07.950 --> 01:17:10.920
So x, recall, is a Python variable.

01:17:10.920 --> 01:17:14.460
And input username is basically going
to prompt the user with the string

01:17:14.460 --> 01:17:17.405
username and then expect them
to type something after that.

01:17:17.405 --> 01:17:19.530
And then we do exactly the
same thing with password

01:17:19.530 --> 01:17:21.270
except storing the result there in y.

01:17:21.270 --> 01:17:24.000
So whatever the user types after
username will get stored in x.

01:17:24.000 --> 01:17:27.270
Whatever they type after
password will get stored in y.

01:17:27.270 --> 01:17:29.030
And then here I'm just going to print.

01:17:29.030 --> 01:17:33.310
And in the SQL context, this would be
the query that actually gets executed.

01:17:33.310 --> 01:17:35.610
So imagine that that's
what's happening instead.

01:17:35.610 --> 01:17:39.850
SELECT star from users where username
equals and then this symbol here,

01:17:39.850 --> 01:17:40.350
'[? x ?]'.

01:17:44.180 --> 01:17:46.680
What I'm doing here is just
using a Python-formatted string.

01:17:46.680 --> 01:17:48.560
That's what this f
here-- it's not a typo--

01:17:48.560 --> 01:17:51.810
at the beginning means, is I'm going to
plug in whatever the person, the user,

01:17:51.810 --> 01:17:55.640
typed at the first prompt,
which I stored in x here,

01:17:55.640 --> 01:17:59.933
and whatever the user typed the
second prompt that's store in y there.

01:17:59.933 --> 01:18:01.600
So let's actually just run this program.

01:18:01.600 --> 01:18:03.980
So let's pop open here for a second.

01:18:03.980 --> 01:18:07.780
The name of this program is
login.py, so I'm going to type python

01:18:07.780 --> 01:18:10.880
login.py, Enter.

01:18:10.880 --> 01:18:13.290
Username, Doug.

01:18:13.290 --> 01:18:16.308
Password, 12345.

01:18:16.308 --> 01:18:19.600
And then the query, hypothetically, that
would get executed if I constructed it

01:18:19.600 --> 01:18:22.480
in this way is SELECT star
from users where username

01:18:22.480 --> 01:18:25.210
equals Doug and password equals 12345.

01:18:25.210 --> 01:18:26.320
Seems reasonable.

01:18:26.320 --> 01:18:30.130
But if I try and do the adversary
thing that I did a moment ago,

01:18:30.130 --> 01:18:38.380
username equals Doug, password
equals 1' or '1' equals '1, not

01:18:38.380 --> 01:18:42.850
a final single quote, and I hit
Enter, then I end up with SELECT star

01:18:42.850 --> 01:18:49.865
from users where username equals Doug
and password equals 1 or 1 equals 1.

01:18:49.865 --> 01:18:52.000
And the latter part of that is true.

01:18:52.000 --> 01:18:53.890
The former part is false.

01:18:53.890 --> 01:18:56.860
But it's good enough that
I would be able to log in

01:18:56.860 --> 01:18:59.650
if I did something like that.

01:18:59.650 --> 01:19:02.200
But we want to try and get around that.

01:19:02.200 --> 01:19:05.200
So now let's take a look at a second
file that might solve this problem.

01:19:05.200 --> 01:19:11.380
So I'm going to open up
login2.py in my editor here.

01:19:11.380 --> 01:19:15.610
So now it starts out exactly the same,
x equals something, y equals something.

01:19:15.610 --> 01:19:18.640
But I'm making a pretty
basic substitution.

01:19:18.640 --> 01:19:23.020
I'm replacing every time that I see
single quotes with double quotes.

01:19:23.020 --> 01:19:25.050
So I'm replacing every
instance of single quote,

01:19:25.050 --> 01:19:26.800
and I have to preface
it with a backslash.

01:19:26.800 --> 01:19:30.160
Because notice I'm actually using
single quotes to identify the character.

01:19:30.160 --> 01:19:33.880
It just so happens that it's to indicate
that I'm trying to substitute something

01:19:33.880 --> 01:19:35.350
which I'm putting in single quotes.

01:19:35.350 --> 01:19:38.440
The thing I'm trying to substitute
actually is a single quote,

01:19:38.440 --> 01:19:42.130
and so I need to put a
backslash in front of it

01:19:42.130 --> 01:19:44.440
to escape that character
such that it actually

01:19:44.440 --> 01:19:48.310
gets treated as a single quotation
mark character as opposed

01:19:48.310 --> 01:19:50.308
to some special Python--

01:19:50.308 --> 01:19:52.850
Python's not going to try and
interpret it in some other way.

01:19:52.850 --> 01:19:56.890
So I want to replace every instance of
a single quote in x with a double quote,

01:19:56.890 --> 01:20:00.010
and I want to replace every
instance of a single quote in y

01:20:00.010 --> 01:20:01.030
with a double quote.

01:20:01.030 --> 01:20:02.650
Now, why do I want to do that?

01:20:02.650 --> 01:20:07.240
Because notice in my
actual Python string here

01:20:07.240 --> 01:20:12.670
I'm using single quotes to set
off the variables for purposes

01:20:12.670 --> 01:20:14.290
of SQL's interpretation of them.

01:20:14.290 --> 01:20:16.520
So where the user name
equals this string,

01:20:16.520 --> 01:20:18.830
I'm using single quotes to do that.

01:20:18.830 --> 01:20:23.920
So if my username or my password
also contained single quotation mark

01:20:23.920 --> 01:20:27.430
characters, when SQL
was interpreting it,

01:20:27.430 --> 01:20:32.080
it might think that the next single
quote character it sees is the end.

01:20:32.080 --> 01:20:34.300
I'm done with what I've prompted.

01:20:34.300 --> 01:20:37.420
And that's exactly how I tricked
it in the previous example.

01:20:37.420 --> 01:20:40.930
I used that first single quote,
which seemed kind of random and out

01:20:40.930 --> 01:20:44.380
of nowhere, to trick SQL into
thinking I'm done with this.

01:20:44.380 --> 01:20:48.850
Then I used the keyword or back
now into a SQL and not some string

01:20:48.850 --> 01:20:52.570
that I'm searching for, and then I
would continue this trick going forward.

01:20:52.570 --> 01:20:55.732
So this is designed to
eliminate all the single quotes,

01:20:55.732 --> 01:20:57.940
because the single quotes
mean something very special

01:20:57.940 --> 01:21:01.510
in the context of my SQL query itself.

01:21:01.510 --> 01:21:06.610
If you're actually using SQL
libraries that are tied into Python,

01:21:06.610 --> 01:21:11.108
the ability to replace things is
much more robust than this example.

01:21:11.108 --> 01:21:12.900
But even this very
simple example where I'm

01:21:12.900 --> 01:21:16.480
doing just this very basic
substitution is good enough

01:21:16.480 --> 01:21:20.390
to get around the injection
attack that we just looked at.

01:21:20.390 --> 01:21:23.350
So this is now in login2.py.

01:21:23.350 --> 01:21:24.520
Let's do this.

01:21:24.520 --> 01:21:26.895
Let's Python login2.py.

01:21:26.895 --> 01:21:28.270
And we'll start out the same way.

01:21:28.270 --> 01:21:30.890
We'll do Doug and 12345.

01:21:30.890 --> 01:21:32.895
And it appears that nothing has changed.

01:21:32.895 --> 01:21:35.020
The behavior is otherwise
identical because I'm not

01:21:35.020 --> 01:21:36.730
trying to do any tricks like that.

01:21:36.730 --> 01:21:41.440
SELECT star from users where username
equals Doug and password equals 12345.

01:21:41.440 --> 01:21:45.250
But if I now try that same
trick that I did a moment ago,

01:21:45.250 --> 01:21:55.090
so password is 1' or '1'
equals '1 and I hit Enter,

01:21:55.090 --> 01:21:59.020
now I'm not subject to that same SQL
injection anymore because I'm trying

01:21:59.020 --> 01:22:02.800
to select all the information from the
users table where the username is Doug

01:22:02.800 --> 01:22:03.970
and the password equals--

01:22:03.970 --> 01:22:06.950
And notice that here is
the first single quote.

01:22:06.950 --> 01:22:08.440
Here is the second one.

01:22:08.440 --> 01:22:11.770
So it's thinking that entire
thing now is the password.

01:22:11.770 --> 01:22:20.468
Only if my password is
literally 1" or "1" equals "1,

01:22:20.468 --> 01:22:22.010
then I would be literally logging in.

01:22:22.010 --> 01:22:23.980
If that happened to be my
password, this would work.

01:22:23.980 --> 01:22:25.150
But otherwise I've escaped.

01:22:25.150 --> 01:22:28.630
I've stopped the adversary
from being able to leverage

01:22:28.630 --> 01:22:33.080
a simple trick like this
to break in to my database

01:22:33.080 --> 01:22:34.930
when perhaps they're
not intended to do so.

01:22:34.930 --> 01:22:41.140
And again, in actual SQL injection
defense, the substitutions that we make

01:22:41.140 --> 01:22:42.640
are much more complicated than this.

01:22:42.640 --> 01:22:45.932
We're not just looking for single quote
characters and double quote characters,

01:22:45.932 --> 01:22:48.610
but we're considering semicolons
or any other special characters

01:22:48.610 --> 01:22:51.460
that SQL would interpret
as part of a statement.

01:22:51.460 --> 01:22:53.900
We can escape those out so
that users could literally

01:22:53.900 --> 01:22:59.720
use single quotes or semicolons
or the like in their passwords

01:22:59.720 --> 01:23:03.160
without necessarily compromising
the integrity of the entire database

01:23:03.160 --> 01:23:04.510
overall.

01:23:04.510 --> 01:23:08.480
So we've taken a look at several of
the most common, most obvious ways

01:23:08.480 --> 01:23:11.180
that an adversary might be
able to extract information

01:23:11.180 --> 01:23:13.910
either from a business or an individual.

01:23:13.910 --> 01:23:17.660
And these ways are kind of
attention-getting in some context.

01:23:17.660 --> 01:23:19.880
But let's focus now-- let's
go back and bring things

01:23:19.880 --> 01:23:22.280
full circle to something
I've mentioned many times,

01:23:22.280 --> 01:23:28.400
which is humans are the core fatal
flaw in all of these security things

01:23:28.400 --> 01:23:29.800
that we're dealing with here.

01:23:29.800 --> 01:23:31.800
And so let's bring things
full circle by talking

01:23:31.800 --> 01:23:34.220
about phishing, what phishing is.

01:23:34.220 --> 01:23:39.140
So phishing is just an attempt
by an adversary to prey upon us

01:23:39.140 --> 01:23:45.440
and our unfortunate general ignorance
of basic security protocols.

01:23:45.440 --> 01:23:47.900
So it's just an attempt
to socially engineer,

01:23:47.900 --> 01:23:49.730
basically, information out of someone.

01:23:49.730 --> 01:23:52.460
You pretend to be
someone that you are not.

01:23:52.460 --> 01:23:54.710
And if you do so
convincingly enough, you

01:23:54.710 --> 01:23:58.190
might be able to extract
information about that person.

01:23:58.190 --> 01:24:01.053
Now, phishing you'll also see
in other contexts that are--

01:24:01.053 --> 01:24:03.470
computer scientists like to
be clever with their wordplay.

01:24:03.470 --> 01:24:06.800
You'll see things like netting, which
is basically a phishing attack that

01:24:06.800 --> 01:24:08.780
launches against many
people at once, hoping

01:24:08.780 --> 01:24:11.060
they'll be able to get one or two.

01:24:11.060 --> 01:24:13.400
There's spear phishing,
which is a phishing

01:24:13.400 --> 01:24:17.240
attack that targets one specific person
trying to get information from them.

01:24:17.240 --> 01:24:20.090
And then there's whaling,
which is a phishing attack that

01:24:20.090 --> 01:24:23.330
is targeted against somebody who is
perceived to have a lot of information

01:24:23.330 --> 01:24:25.413
or whose information is
particularly valuable such

01:24:25.413 --> 01:24:28.820
that you'd be phishing
for some big whale.

01:24:28.820 --> 01:24:31.730
Now, one of the most obvious and
easy types of phishing attack

01:24:31.730 --> 01:24:32.900
looks like this.

01:24:32.900 --> 01:24:35.450
It's a simple URL substitution.

01:24:35.450 --> 01:24:39.590
This is how we can write a link in HTML.

01:24:39.590 --> 01:24:43.480
A is the HTML tag for anchor,
which we use for hyperlinks.

01:24:43.480 --> 01:24:46.460
Href is where we are going to.

01:24:46.460 --> 01:24:50.660
And then we also have the ability to
specify some text at the end of that.

01:24:50.660 --> 01:24:54.830
These two items do not have
to match, as you can see here.

01:24:54.830 --> 01:25:02.750
I can say we're going to URL2
but actually send you to URL1.

01:25:02.750 --> 01:25:08.420
This is an incredibly common way
to get information from somebody.

01:25:08.420 --> 01:25:12.830
They think they're going one place but
they're actually going someplace else.

01:25:12.830 --> 01:25:16.430
And to show you, as a very
basic example, just how easy it

01:25:16.430 --> 01:25:21.560
is to potentially trick somebody into
going somewhere they're not supposed to

01:25:21.560 --> 01:25:25.220
and potentially then
revealing credentials as well,

01:25:25.220 --> 01:25:28.580
let's just take a simple
example here with Facebook.

01:25:28.580 --> 01:25:31.970
And why don't we just take a moment
to build our own version of Facebook

01:25:31.970 --> 01:25:36.410
and see if we can't get somebody to
potentially reveal information to us?

01:25:36.410 --> 01:25:38.750
So let's imagine that I
have acquired some domain

01:25:38.750 --> 01:25:41.390
name that's really
similar to Facebook.com,

01:25:41.390 --> 01:25:44.150
like it's off by one character.

01:25:44.150 --> 01:25:45.350
It's a common typo.

01:25:45.350 --> 01:25:48.198
For example fs maybe is a common thing.

01:25:48.198 --> 01:25:49.990
People mistype the A
or something like that

01:25:49.990 --> 01:25:54.800
that would be really not necessarily
obvious to somebody at the outset.

01:25:54.800 --> 01:25:59.240
One way that I might be able to just
take advantage of somebody's thinking

01:25:59.240 --> 01:26:01.670
that they're logging into
Facebook is to make a page that

01:26:01.670 --> 01:26:05.150
looks exactly the same as Facebook.

01:26:05.150 --> 01:26:07.640
That's actually not
very difficult to do.

01:26:07.640 --> 01:26:09.680
All you have to do is
open up Facebook here.

01:26:09.680 --> 01:26:14.720
And because its HTML is available
to me, I can right click on it,

01:26:14.720 --> 01:26:18.530
view page source, take
a second to load here--

01:26:18.530 --> 01:26:20.480
Facebook is a pretty big site--

01:26:20.480 --> 01:26:27.080
and then I can just control A, copy,
select all, copy all of the content,

01:26:27.080 --> 01:26:33.500
and paste this in to my
index.html, and we will save.

01:26:36.140 --> 01:26:40.970
And then we'll head back
into our terminal here,

01:26:40.970 --> 01:26:45.170
and I will start Chrome on
the file index.html, which

01:26:45.170 --> 01:26:49.400
is the file that I literally just
saved my Facebook information in.

01:26:49.400 --> 01:26:51.040
So start Chrome index.html.

01:26:51.040 --> 01:26:53.360
You'll notice that it
brings me to this URL

01:26:53.360 --> 01:26:56.670
here, which is the file
for where I currently live,

01:26:56.670 --> 01:26:58.310
or where this file currently lives.

01:26:58.310 --> 01:27:00.920
And this page looks like Facebook,
except for the fact that,

01:27:00.920 --> 01:27:04.220
when I log in, I then
get redirected back

01:27:04.220 --> 01:27:07.370
to something that actually is Facebook
and is not something that I control.

01:27:07.370 --> 01:27:10.820
But at the outset, my page
here at the very beginning

01:27:10.820 --> 01:27:14.810
looks identical to Facebook.

01:27:14.810 --> 01:27:16.790
Now, the trick here
would be to do something

01:27:16.790 --> 01:27:20.780
so that the user would provide
information here in the email box

01:27:20.780 --> 01:27:24.397
and then here in the password field
such that when they click Login,

01:27:24.397 --> 01:27:26.480
I might be able to get
that information from them.

01:27:26.480 --> 01:27:30.500
Maybe I just am waiting to
capture their information.

01:27:30.500 --> 01:27:35.450
So the next step for me might be to go
back into my random set of stuff here.

01:27:35.450 --> 01:27:38.570
There's a lot of random code
that we don't really care about.

01:27:38.570 --> 01:27:41.030
But the one thing I do care
about is what happens when

01:27:41.030 --> 01:27:43.790
somebody clicks on this Login button.

01:27:43.790 --> 01:27:45.590
That is interesting to me.

01:27:45.590 --> 01:27:48.230
So I'm going to go through
this and just do control F,

01:27:48.230 --> 01:27:51.968
control F just being
find, the string login.

01:27:51.968 --> 01:27:54.260
That's the text that's
literally written on the button,

01:27:54.260 --> 01:27:55.843
so hopefully I'll find that somewhere.

01:27:55.843 --> 01:27:58.160
I'm told I have eight results.

01:27:58.160 --> 01:27:59.990
So this is, if I just
kind of look around

01:27:59.990 --> 01:28:01.698
for context to try
and figure out where I

01:28:01.698 --> 01:28:05.660
am in the code, the title of
something, so that's probably not it.

01:28:05.660 --> 01:28:07.180
So I don't want to go there.

01:28:07.180 --> 01:28:10.640
Create an account or login,
not quite what I'm looking for.

01:28:10.640 --> 01:28:12.620
So go the next one.

01:28:12.620 --> 01:28:15.890
OK, here we go, input
value equals login.

01:28:15.890 --> 01:28:18.680
So now I found an input
that is called login.

01:28:18.680 --> 01:28:22.110
So this is presumably a button
that's presumably part of some form.

01:28:22.110 --> 01:28:25.820
So if I scroll up a little
bit higher, hopefully I

01:28:25.820 --> 01:28:29.570
will find a form, which I do, form ID.

01:28:29.570 --> 01:28:30.920
And it has an action.

01:28:30.920 --> 01:28:34.040
The action is to go to
this particular page,

01:28:34.040 --> 01:28:37.310
facebook.com/login/ and so on and so on.

01:28:37.310 --> 01:28:39.820
But maybe I want to
send it somewhere else.

01:28:39.820 --> 01:28:44.000
So if I replace this entire URL with
where I actually want to send the user,

01:28:44.000 --> 01:28:46.160
where maybe I'm going to
capture their information,

01:28:46.160 --> 01:28:49.220
maybe I'll store this in login.html.

01:28:49.220 --> 01:28:51.140
And so that's what's
going to come in here.

01:28:51.140 --> 01:28:56.210
And then we'll save the file such
that our changes have been captured.

01:28:56.210 --> 01:28:58.370
So presumably what should
happen is now, when

01:28:58.370 --> 01:29:02.420
you click on the Login
button in my fake Facebook,

01:29:02.420 --> 01:29:08.000
you instead get redirected to login.html
rather than the Facebook actual login

01:29:08.000 --> 01:29:10.458
as we saw just a moment ago.

01:29:10.458 --> 01:29:11.250
So let's try again.

01:29:11.250 --> 01:29:14.870
We'll go back here to
our fake Facebook page.

01:29:14.870 --> 01:29:18.880
We will refresh so that
we get our new content.

01:29:18.880 --> 01:29:20.860
Remember, we just
changed the HTML content,

01:29:20.860 --> 01:29:23.900
so we actually need to reload
it so that our browser has it.

01:29:23.900 --> 01:29:31.250
And we'll type in abc@cs50.net and then
some password here and click Login,

01:29:31.250 --> 01:29:32.990
and we get redirected here.

01:29:32.990 --> 01:29:35.630
Sorry, we are unable to
log you in at this time.

01:29:35.630 --> 01:29:38.270
But notice we're still
in a file that I created.

01:29:38.270 --> 01:29:41.973
I didn't show you login.html, but
that's exactly what I put there.

01:29:41.973 --> 01:29:44.390
Now, I'm not actually going
to phish for information here.

01:29:44.390 --> 01:29:46.370
And I'm going to do something
that would arguably vio--

01:29:46.370 --> 01:29:48.100
even though I'm using
fake data here, I'm

01:29:48.100 --> 01:29:50.808
not going to do something that
would violate the terms of service

01:29:50.808 --> 01:29:54.500
or get myself in trouble by actually
attempting to do some phishing here.

01:29:54.500 --> 01:29:58.070
But imagine instead of some HTML
I had some Python code that was

01:29:58.070 --> 01:30:00.740
able to read the data from that field.

01:30:00.740 --> 01:30:02.840
We saw that a moment ago
with passwords, right?

01:30:02.840 --> 01:30:06.860
We know that the possibility exists
that if the user types something

01:30:06.860 --> 01:30:10.850
into a field, we have the
ability to extract it.

01:30:10.850 --> 01:30:13.340
What I could do here is very simple.

01:30:13.340 --> 01:30:18.200
I could just read those two fields where
they typed a username and a password

01:30:18.200 --> 01:30:20.032
but then display this content.

01:30:20.032 --> 01:30:22.490
Perhaps it's been the case that
you've gone to some website

01:30:22.490 --> 01:30:26.300
and seen, oh, yeah, sorry, the server
can't handle this request right now,

01:30:26.300 --> 01:30:28.820
or something along those lines.

01:30:28.820 --> 01:30:30.650
And you maybe think nothing of it.

01:30:30.650 --> 01:30:33.530
Or maybe I even would then have
a link here that says, try again.

01:30:33.530 --> 01:30:35.870
And if you click Try Again,
it would bring you back

01:30:35.870 --> 01:30:39.860
to Facebook's actual login where you
would then enter your credentials

01:30:39.860 --> 01:30:42.560
and try again and perhaps
think everything was fine.

01:30:42.560 --> 01:30:46.520
But if on this login page I had
extracted your username and password

01:30:46.520 --> 01:30:49.120
by tricking you into thinking
you were logging into Facebook,

01:30:49.120 --> 01:30:51.203
and then maybe I save those
in some file somewhere

01:30:51.203 --> 01:30:54.882
and then just display this to you,
you think, ah, they just had an error.

01:30:54.882 --> 01:30:56.090
Things are a little bit busy.

01:30:56.090 --> 01:30:57.050
I'll try again.

01:30:57.050 --> 01:30:58.910
And when you try again, it works.

01:30:58.910 --> 01:31:00.770
It's really that easy.

01:31:00.770 --> 01:31:05.600
And the way to avoid phishing
expeditions, so to speak,

01:31:05.600 --> 01:31:07.530
are just to be mindful
of what you're doing.

01:31:07.530 --> 01:31:11.000
Take a look at the URL bar to
make sure that you're on the page

01:31:11.000 --> 01:31:12.983
that you think you're on.

01:31:12.983 --> 01:31:14.900
Hopefully you've come
away now with a bit more

01:31:14.900 --> 01:31:16.775
of an understanding of
cybersecurity and some

01:31:16.775 --> 01:31:19.700
of the best practices that
are put in place to deal

01:31:19.700 --> 01:31:21.740
with potential cybersecurity threats.

01:31:21.740 --> 01:31:24.320
Now it's incumbent upon
us to use the technology

01:31:24.320 --> 01:31:28.130
that we have available to help us
protect ourselves from ourselves,

01:31:28.130 --> 01:31:33.020
but not only ourselves and our own data,
but also working to protect our clients

01:31:33.020 --> 01:31:35.200
and their data as well.