WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:17.213 --> 00:00:20.380 DOUG LLOYD: Now that we know a bit more about the internet and how it works, 00:00:20.380 --> 00:00:23.200 let's reintroduce the subject of security with this new context. 00:00:23.200 --> 00:00:26.100 And let's start by talking about Git and GitHub. 00:00:26.100 --> 00:00:28.540 Recall that Git and GitHub are a technology that 00:00:28.540 --> 00:00:31.990 are used by programmers to version control 00:00:31.990 --> 00:00:34.690 their software, which basically allows them the ability 00:00:34.690 --> 00:00:39.010 to save code to an internet-based repository in case of some failure 00:00:39.010 --> 00:00:41.830 locally, they have a backup place to put it, but also 00:00:41.830 --> 00:00:43.750 keep track of all the changes they've made 00:00:43.750 --> 00:00:46.120 and possibly go back in time in case they produce 00:00:46.120 --> 00:00:48.460 a version of code that is broken. 00:00:48.460 --> 00:00:50.440 GitHub has some great advantages, but it also 00:00:50.440 --> 00:00:53.110 has the potential disadvantages because of this structure 00:00:53.110 --> 00:00:54.590 of being able to go back in time. 00:00:54.590 --> 00:00:58.180 So for example, imagine that what we have is an initial commit, and commit 00:00:58.180 --> 00:01:01.828 is just GitHub parlance for a set of code 00:01:01.828 --> 00:01:03.370 that you are sending to the internet. 00:01:03.370 --> 00:01:07.720 So I've decided to take file A, file B, and file C in their current versions. 00:01:07.720 --> 00:01:12.190 I've saved them using control S or command S literally on my machine, 00:01:12.190 --> 00:01:14.800 and I want to send those versions to GitHub to be 00:01:14.800 --> 00:01:17.410 stored permanently or semi-permanently. 00:01:17.410 --> 00:01:19.900 You would package those up in what's called a commit 00:01:19.900 --> 00:01:23.560 and then push that code to GitHub where it would then be visible online. 00:01:23.560 --> 00:01:25.270 And this would be packaged as a commit. 00:01:25.270 --> 00:01:29.860 And all the files that we view on GitHub are tracked in terms of commits. 00:01:29.860 --> 00:01:31.450 And commits chain together. 00:01:31.450 --> 00:01:34.210 And we've seen this idea of chaining in the past when we've 00:01:34.210 --> 00:01:36.600 discussed linked lists, for example. 00:01:36.600 --> 00:01:39.100 So every commit knows about the one that comes after it once 00:01:39.100 --> 00:01:43.810 that commit is eventually pushed as well as all of the ones that preceded it. 00:01:43.810 --> 00:01:47.110 So imagine we have an initial comment where we post some code 00:01:47.110 --> 00:01:49.870 and then we write some more-- we make some more changes. 00:01:49.870 --> 00:01:52.510 We perhaps update our database in such a way 00:01:52.510 --> 00:01:57.790 where when we post or push-- excuse me-- our second commit to GitHub, 00:01:57.790 --> 00:02:00.460 we accidentally expose the database credentials. 00:02:00.460 --> 00:02:03.250 So perhaps someone inadvertently typed the password 00:02:03.250 --> 00:02:06.760 for how to access the database into some Python code that would then 00:02:06.760 --> 00:02:09.639 be used to access that database. 00:02:09.639 --> 00:02:10.930 That's not a good thing. 00:02:10.930 --> 00:02:13.833 And maybe somebody quickly realized it and said, you know what? 00:02:13.833 --> 00:02:15.250 We need to get this off of GitHub. 00:02:15.250 --> 00:02:16.570 It is a source repository. 00:02:16.570 --> 00:02:17.920 It's available online. 00:02:17.920 --> 00:02:22.390 And so they push a third commit to GitHub that deletes those credentials. 00:02:22.390 --> 00:02:26.740 It stores them somewhere else that's not going to be saved on this repository. 00:02:26.740 --> 00:02:29.977 But have we actually solved the problem? 00:02:29.977 --> 00:02:31.810 And you can probably imagine that the answer 00:02:31.810 --> 00:02:34.930 is no, because we have this idea of version control 00:02:34.930 --> 00:02:39.700 where every past iteration of all of these files 00:02:39.700 --> 00:02:43.840 is stored still on GitHub such that, if I needed to, I could go back in time. 00:02:43.840 --> 00:02:48.220 So even though I attempted to solve the security crisis I just 00:02:48.220 --> 00:02:52.360 created for myself by introducing a new commit that 00:02:52.360 --> 00:02:54.520 removes the credentials from those files such that, 00:02:54.520 --> 00:02:57.070 if I'm looking just at the most recent version of the files, 00:02:57.070 --> 00:02:58.147 I don't see it anymore. 00:02:58.147 --> 00:02:59.980 I still have the ability to go back in time, 00:02:59.980 --> 00:03:03.790 so this doesn't actually solve a problem. 00:03:03.790 --> 00:03:05.800 See, one of the interesting things about GitHub 00:03:05.800 --> 00:03:08.230 is the model that is used for it. 00:03:08.230 --> 00:03:10.120 At the very beginning of GitHub's existence, 00:03:10.120 --> 00:03:14.260 it relied pretty extensively on this idea of you sign up for free, 00:03:14.260 --> 00:03:16.030 you get a free account for GitHub, and you 00:03:16.030 --> 00:03:20.170 have a limited number of private repositories, repositories that are not 00:03:20.170 --> 00:03:24.250 publicly viewable or searchable, and you could pay to have more of them 00:03:24.250 --> 00:03:25.930 if you wanted to. 00:03:25.930 --> 00:03:29.650 But the majority of your repositories, assuming 00:03:29.650 --> 00:03:33.610 you did not opt into a paid account, were free, which 00:03:33.610 --> 00:03:37.720 meant anybody on the internet could search them using GitHub's search tool, 00:03:37.720 --> 00:03:40.600 or using even a regular search engine such as Google, 00:03:40.600 --> 00:03:42.790 could just look for something. 00:03:42.790 --> 00:03:46.990 And if your GitHub repositories happen to match what that person searched 00:03:46.990 --> 00:03:49.660 or specifically, if you're looking within GitHub search feature, 00:03:49.660 --> 00:03:52.620 if a user is looking for specific lines of code, 00:03:52.620 --> 00:03:56.138 anything in a public repository, it is available. 00:03:56.138 --> 00:03:58.180 Now, GitHub has recently changed to a model where 00:03:58.180 --> 00:04:01.720 there are more private repo-- or there's a higher limit 00:04:01.720 --> 00:04:04.840 on the number of private repositories that somebody could have. 00:04:04.840 --> 00:04:10.090 But this was part of Github's design to really encourage 00:04:10.090 --> 00:04:13.780 developers and programmers to sort of create this open source community where 00:04:13.780 --> 00:04:18.310 anybody could view someone else's code, and in GitHub parlance, 00:04:18.310 --> 00:04:21.670 fork their code, which basically means to take their entire repository 00:04:21.670 --> 00:04:26.830 or collection of files and copy it into their own GitHub repository 00:04:26.830 --> 00:04:29.760 to perhaps make changes or suggest changes, 00:04:29.760 --> 00:04:33.040 pushing those back into the code base with the idea being 00:04:33.040 --> 00:04:35.810 that it would make the entire community better. 00:04:35.810 --> 00:04:38.680 A side effect, of course, is that items get 00:04:38.680 --> 00:04:43.360 revealed when we do so because of this public repository setup we have here. 00:04:43.360 --> 00:04:47.200 So GitHub is great in terms of its ability for programmers 00:04:47.200 --> 00:04:49.930 to refer to materials on the internet. 00:04:49.930 --> 00:04:52.750 They don't have to rely on their own local machines to store code. 00:04:52.750 --> 00:04:57.070 It allows people to work from multiple workstations, 00:04:57.070 --> 00:04:59.590 similar to how Dropbox or Google Drive, for example, 00:04:59.590 --> 00:05:02.470 might allow you to access files from different machines. 00:05:02.470 --> 00:05:04.970 You don't have to be on a specific machine to access a file, 00:05:04.970 --> 00:05:08.500 as we used to have to do before these cloud-based document storage 00:05:08.500 --> 00:05:10.060 services existed. 00:05:10.060 --> 00:05:12.310 And it encourages collaboration. 00:05:12.310 --> 00:05:16.390 For example, if you and I were to collaborate on a GitHub repository, 00:05:16.390 --> 00:05:20.000 I could push changes to that repository that you could then pull. 00:05:20.000 --> 00:05:22.750 And we could then be working off of the same code base again. 00:05:22.750 --> 00:05:25.690 We sort of have this central repo-- 00:05:25.690 --> 00:05:28.630 central area where we share our code with one another. 00:05:28.630 --> 00:05:30.580 And we can each individually make changes 00:05:30.580 --> 00:05:33.520 and incorporate one another's changes into the final products. 00:05:33.520 --> 00:05:38.110 So we're always working off of the same base of material. 00:05:38.110 --> 00:05:40.210 The side effect, though, again, is this material 00:05:40.210 --> 00:05:44.260 is generally public unless you have opted into a private repository where 00:05:44.260 --> 00:05:46.450 you have specific individuals who are logged 00:05:46.450 --> 00:05:49.990 in with their GitHub accounts who want to share. 00:05:49.990 --> 00:05:52.420 So is there a way to solve this problem, though, of we 00:05:52.420 --> 00:05:55.087 accidentally expose our credentials in a public repository? 00:05:55.087 --> 00:05:56.920 Of course, if we're in a private repository, 00:05:56.920 --> 00:05:58.220 this might not be as alarming. 00:05:58.220 --> 00:05:59.920 It's still probably not something you-- 00:05:59.920 --> 00:06:03.130 it should be encouraged to have credentials 00:06:03.130 --> 00:06:07.480 for anything stored anywhere, whether public or private, on the internet. 00:06:07.480 --> 00:06:08.830 It's a little riskier. 00:06:08.830 --> 00:06:12.402 But is there a way to get rid of this or to prevent this problem from happening? 00:06:12.402 --> 00:06:14.860 And fortunately, there are a number of different safeguards 00:06:14.860 --> 00:06:17.680 specific to Git and GitHub that we can use 00:06:17.680 --> 00:06:22.240 to prevent the accidental leakage of information, so to speak. 00:06:22.240 --> 00:06:25.330 So for example, one way we can handle this is using a program or utility 00:06:25.330 --> 00:06:27.340 called GitSecrets. 00:06:27.340 --> 00:06:31.000 GitSecrets works by looking for what's called a regular expression. 00:06:31.000 --> 00:06:33.640 And a regular expression is computer science parlance 00:06:33.640 --> 00:06:37.600 for a particular formation of a string, so a certain number 00:06:37.600 --> 00:06:41.360 of characters, a certain number of digit characters, maybe some punctuation 00:06:41.360 --> 00:06:41.860 marks. 00:06:41.860 --> 00:06:46.360 You can say, I'm looking for strings that match this idea. 00:06:46.360 --> 00:06:49.630 And you can express this idea where this idea is all capital 00:06:49.630 --> 00:06:52.900 letters, all lowercase letters, this many numbers, and this many punctuation 00:06:52.900 --> 00:06:55.750 marks, and so on using this tool called a regular expression. 00:06:55.750 --> 00:06:59.410 But GitSecrets contains a list of these regular expressions 00:06:59.410 --> 00:07:02.710 and will warn you when you are about to make a commit, when you're 00:07:02.710 --> 00:07:05.650 about to push code or send code to GitHub to be stored 00:07:05.650 --> 00:07:10.030 in its online repository that you have a string that matches this pattern 00:07:10.030 --> 00:07:11.950 that you wanted me to warn you about. 00:07:11.950 --> 00:07:15.190 And so be sure before you commit this code 00:07:15.190 --> 00:07:19.600 and push this code that you actually intend to send this up 00:07:19.600 --> 00:07:23.380 to GitHub, because it may be that this matches a password string that you're 00:07:23.380 --> 00:07:24.560 trying to avoid. 00:07:24.560 --> 00:07:27.580 So that's an interesting tool that can be used for that. 00:07:27.580 --> 00:07:31.150 You also want to consider limiting third party app access. 00:07:31.150 --> 00:07:35.930 GitHub accounts are actually very common to use as other forms of login, 00:07:35.930 --> 00:07:36.770 for example. 00:07:36.770 --> 00:07:39.190 So there's a platform on the internet called 00:07:39.190 --> 00:07:42.190 OAuth which allows you to use, for example, your Facebook 00:07:42.190 --> 00:07:44.977 account or your Google account to log into other services. 00:07:44.977 --> 00:07:47.560 Perhaps you've encountered this in your own experience working 00:07:47.560 --> 00:07:49.510 with different services on the internet. 00:07:49.510 --> 00:07:54.010 Instead of creating a login for site x, you could use your Facebook or Google 00:07:54.010 --> 00:07:58.150 login, or, in many instances as well, your GitHub log in to do so. 00:07:58.150 --> 00:08:01.610 When you do so, though, you are allowing that third party application, 00:08:01.610 --> 00:08:07.090 someone that's not GitHub, the ability to use and access your GitHub identity 00:08:07.090 --> 00:08:08.120 or credential. 00:08:08.120 --> 00:08:12.640 And so you should be very careful with not only GitHub but other services 00:08:12.640 --> 00:08:17.560 as well, thinking about whether you want that other service to have access 00:08:17.560 --> 00:08:21.940 to your GitHub, or Facebook, or Google account information to use it even just 00:08:21.940 --> 00:08:23.380 for authentication. 00:08:23.380 --> 00:08:26.320 It's a good idea to try and limit how much third party app 00:08:26.320 --> 00:08:30.340 access you're giving to other services. 00:08:30.340 --> 00:08:33.520 Another tool is to use something called a commit hook. 00:08:33.520 --> 00:08:36.460 Now, commit hook is just a fancy term for a short program 00:08:36.460 --> 00:08:42.070 or set of instructions that executes when a commit is pushed to GitHub. 00:08:42.070 --> 00:08:44.740 So for example, many of the course websites 00:08:44.740 --> 00:08:48.490 that we use here at Harvard for CS50 are GitHub-based, 00:08:48.490 --> 00:08:52.030 which means that when we want to change the content on the course website, 00:08:52.030 --> 00:08:56.350 we update some HTML, or Python, or JavaScript files, we push those 00:08:56.350 --> 00:09:01.000 to GitHub, and that triggers a commit hook where basically that commit 00:09:01.000 --> 00:09:04.570 hook copies those files into our web server, 00:09:04.570 --> 00:09:07.420 runs some tests on them to make sure that there's no errors in them. 00:09:07.420 --> 00:09:10.390 For example, if we wrote some JavaScript or Python that was breaking, 00:09:10.390 --> 00:09:15.250 it had a bug in it, we'd rather not deploy that bug so to speak. 00:09:15.250 --> 00:09:17.710 We wouldn't want the broken version of the code 00:09:17.710 --> 00:09:21.190 to replace the currently working website. 00:09:21.190 --> 00:09:23.750 And so commit hook can be used to do testing as well. 00:09:23.750 --> 00:09:26.170 And then once all the tests pass, we then 00:09:26.170 --> 00:09:28.300 are able to activate those files on the web server 00:09:28.300 --> 00:09:29.890 and the changes have happened. 00:09:29.890 --> 00:09:32.530 So we're using GitHub to store the changes 00:09:32.530 --> 00:09:35.650 that we want to make on our site, the HTML, the Python, 00:09:35.650 --> 00:09:37.870 the JavaScript changes that we want to make. 00:09:37.870 --> 00:09:41.650 And then we're using this commit hook, a set of instructions, 00:09:41.650 --> 00:09:45.340 to copy them over and actually deploy those changes to the website 00:09:45.340 --> 00:09:48.430 once we've verified that we haven't made anything break. 00:09:48.430 --> 00:09:52.210 You can also use commit hooks, for example, to check for passwords 00:09:52.210 --> 00:09:56.830 and have it warn you if you have perhaps leaked a credential. 00:09:56.830 --> 00:10:00.040 And then you can undo that with a technique 00:10:00.040 --> 00:10:02.480 that we'll see in just a moment. 00:10:02.480 --> 00:10:06.250 Another thing that you can do when using GitHub to protect or verify 00:10:06.250 --> 00:10:09.180 your identity is to use an SSH key. 00:10:09.180 --> 00:10:12.653 SSH keys are a special form of a public and private key. 00:10:12.653 --> 00:10:15.070 In this case, it's really not used for encryption, though. 00:10:15.070 --> 00:10:17.535 It's actually used as identification. 00:10:17.535 --> 00:10:19.410 And so this idea of digital signatures, which 00:10:19.410 --> 00:10:22.860 you may recall from a few lectures ago, comes back into play. 00:10:22.860 --> 00:10:27.600 Whenever I use an SSH key to push my code to GitHub, what happens 00:10:27.600 --> 00:10:33.150 is I also digitally sign the commit when I send it up. 00:10:33.150 --> 00:10:36.870 And so before that commit gets posted to GitHub, 00:10:36.870 --> 00:10:40.200 GitHub verifies this by checking my public key 00:10:40.200 --> 00:10:43.230 and verifying, using the mathematics that we've seen in the past, 00:10:43.230 --> 00:10:46.650 that, yes, only Doug could have sent this to me 00:10:46.650 --> 00:10:53.160 because only Doug's public key will unscramble this set of zeros and ones 00:10:53.160 --> 00:10:57.180 that I received that only could have then been created by his private key. 00:10:57.180 --> 00:10:59.550 These two things are reciprocal of one another. 00:10:59.550 --> 00:11:01.980 So we can use SSH keys and digital signatures 00:11:01.980 --> 00:11:05.850 as an identity verification scheme as well for GitHub 00:11:05.850 --> 00:11:08.430 as we might be able to for mailing documents, or sending 00:11:08.430 --> 00:11:11.160 documents, or something like that. 00:11:11.160 --> 00:11:15.300 Now, imagine we have posted the credentials accidentally. 00:11:15.300 --> 00:11:17.130 Is there a way to get rid of them? 00:11:17.130 --> 00:11:18.930 GitHub does track our entire history. 00:11:18.930 --> 00:11:20.430 But what if we do make a mistake? 00:11:20.430 --> 00:11:22.410 Human beings are fallible. 00:11:22.410 --> 00:11:25.980 And so there is a way to actually eliminate the history. 00:11:25.980 --> 00:11:29.697 And that is using a command called Git Rebase. 00:11:29.697 --> 00:11:32.280 So let's go back to the illustration we had a moment ago where 00:11:32.280 --> 00:11:34.250 we have several different commits. 00:11:34.250 --> 00:11:37.210 And I've added a fourth commit here just for purposes of illustration. 00:11:37.210 --> 00:11:38.960 So our first commit and our second commit, 00:11:38.960 --> 00:11:42.180 and then it's after that that we expose the credentials accidentally, 00:11:42.180 --> 00:11:47.010 and then we have a fourth commit where we actually delete that mistake that we 00:11:47.010 --> 00:11:48.300 had previously made. 00:11:48.300 --> 00:11:51.810 When we want to Git Rebase, the idea is we want 00:11:51.810 --> 00:11:54.370 to delete a portion of the history. 00:11:54.370 --> 00:11:56.120 Now, deleting a portion of the history has 00:11:56.120 --> 00:11:59.075 a side effect of any changes that I made here or here. 00:11:59.075 --> 00:12:01.950 In this illustration, we're going to get rid of the last two commits. 00:12:01.950 --> 00:12:05.460 Any changes that I've made besides accidentally exposing the credentials 00:12:05.460 --> 00:12:07.170 are also going to be destroyed. 00:12:07.170 --> 00:12:11.220 And so it's going to be incumbent on us to make sure to copy and save 00:12:11.220 --> 00:12:15.150 the changes we actually want to preserve in case we've done more than just 00:12:15.150 --> 00:12:16.530 expose the credentials. 00:12:16.530 --> 00:12:19.170 And then we'll have to make a new commit in this new history 00:12:19.170 --> 00:12:23.100 we create so that we can still preserve those changes that we want to make. 00:12:23.100 --> 00:12:25.620 But let's say, other than the credentials, 00:12:25.620 --> 00:12:27.900 I didn't actually do anything else. 00:12:27.900 --> 00:12:33.330 One thing I could do is rebase or set as a new start point, basically, 00:12:33.330 --> 00:12:36.190 this second commit as the end of the chain. 00:12:36.190 --> 00:12:40.590 So instead of going all the way to here and having that preserved ad infinitum, 00:12:40.590 --> 00:12:44.430 I want to just get rid of everything from the second commit forward. 00:12:44.430 --> 00:12:45.300 And I can do that. 00:12:45.300 --> 00:12:49.110 And then those commits are no longer remembered by GitHub. 00:12:49.110 --> 00:12:52.110 And as soon as the next commit I have would go here, 00:12:52.110 --> 00:12:56.760 right after second commit as opposed to imagining a fifth one there 00:12:56.760 --> 00:12:59.580 right after credentials being removed, those commits 00:12:59.580 --> 00:13:03.570 are, for all intents and purposes on GitHub, forgotten. 00:13:03.570 --> 00:13:06.330 And finally, one more thing that we can do when using GitHub 00:13:06.330 --> 00:13:09.420 is to mandate the use of two-factor authentication. 00:13:09.420 --> 00:13:12.810 Recall we've discussed two-factor authentication a little bit previously. 00:13:12.810 --> 00:13:16.890 And the idea is that you have a backup mechanism 00:13:16.890 --> 00:13:19.650 to prevent unauthorized login. 00:13:19.650 --> 00:13:21.720 And the two factors in two-factor authentication 00:13:21.720 --> 00:13:26.520 are not two passwords, because those are fundamentally quite similar. 00:13:26.520 --> 00:13:29.850 The idea is that you want to have something that you know, for example, 00:13:29.850 --> 00:13:33.150 a password-- that's usually very commonly one of the two factors 00:13:33.150 --> 00:13:35.220 in two-factor authentication-- 00:13:35.220 --> 00:13:37.590 and something that you have, the thought being 00:13:37.590 --> 00:13:42.900 that an adversary is incredibly unlikely to have both things at the same time. 00:13:42.900 --> 00:13:45.120 They may know your password, but they probably 00:13:45.120 --> 00:13:49.320 don't have your cell phone, for example, or your RSA key. 00:13:49.320 --> 00:13:54.360 They may have stolen your phone or they may have stolen your RSA key, 00:13:54.360 --> 00:13:57.390 but they probably don't also know your password. 00:13:57.390 --> 00:14:00.690 And so the idea is that this provides an additional level of defense 00:14:00.690 --> 00:14:04.080 against potential hacking, or breaking into accounts, 00:14:04.080 --> 00:14:06.660 or unauthorized behavior in accounts that you obviously 00:14:06.660 --> 00:14:08.190 don't want to happen. 00:14:08.190 --> 00:14:11.562 Now, an RSA key, if you're unfamiliar, is something that looks like this. 00:14:11.562 --> 00:14:13.020 There's different versions of them. 00:14:13.020 --> 00:14:14.437 They've sort of evolved over time. 00:14:14.437 --> 00:14:18.660 This one is actually a combined RSA key and USB drive. 00:14:18.660 --> 00:14:22.020 And inside the window here of the RSA key 00:14:22.020 --> 00:14:26.010 is a six digit number that just changes every 60 seconds or so. 00:14:26.010 --> 00:14:28.900 So when you are given one of these, for example, 00:14:28.900 --> 00:14:32.310 perhaps at a firm or a business, it is assigned to you specifically. 00:14:32.310 --> 00:14:35.530 There's a server that your IT team will have 00:14:35.530 --> 00:14:39.960 setup that maps the serial number on the back of this RSA key 00:14:39.960 --> 00:14:42.120 to your employee ID, for example. 00:14:42.120 --> 00:14:47.010 But they otherwise don't know what the number currently on the RSA key is. 00:14:47.010 --> 00:14:51.840 They only know who owns it, who is physically in possession of it, which 00:14:51.840 --> 00:14:53.210 employee ID it maps do. 00:14:53.210 --> 00:14:54.990 And every 60 seconds it changes according 00:14:54.990 --> 00:14:59.430 to some mathematical algorithm that is built into the key that generates 00:14:59.430 --> 00:15:02.190 numbers in a pseudo random way. 00:15:02.190 --> 00:15:05.490 And after 60 seconds, that code will change into something else. 00:15:05.490 --> 00:15:10.130 And you'll need to actually have the key on you to complete a login. 00:15:10.130 --> 00:15:12.810 If an RSA key is being used to secure such 00:15:12.810 --> 00:15:15.483 that you need to enter a password and your RSA key value, 00:15:15.483 --> 00:15:16.650 you would need to have both. 00:15:16.650 --> 00:15:19.872 No other employee RSA key-- well, hypothetically, I 00:15:19.872 --> 00:15:21.830 guess there's a one in a million chance that it 00:15:21.830 --> 00:15:24.705 would happen to be randomly showing the same number at the same time. 00:15:24.705 --> 00:15:28.100 But no other employee's RSA key could be used to log in. 00:15:28.100 --> 00:15:30.690 Only yours could be used to log in. 00:15:30.690 --> 00:15:32.690 Now, there are several different tools out there 00:15:32.690 --> 00:15:35.810 that can be used to provide two-factor authentication services. 00:15:35.810 --> 00:15:39.628 And there's really no technical reason not to use these services. 00:15:39.628 --> 00:15:42.170 You'll find them as applications on cell phones, most likely. 00:15:42.170 --> 00:15:46.310 And you'll find ones like this, Google Authenticator, Authy, Duo Mobile. 00:15:46.310 --> 00:15:47.360 There are lots of others. 00:15:47.360 --> 00:15:50.390 And if you don't want to use one of those applications specifically, 00:15:50.390 --> 00:15:53.210 many services also just allow you to receive a text message 00:15:53.210 --> 00:15:54.902 from the service itself. 00:15:54.902 --> 00:15:56.860 And you'll just get that via SMS on your phone, 00:15:56.860 --> 00:16:00.470 so still on your phone, just not tied to a specific application. 00:16:00.470 --> 00:16:05.690 And while there's no technical reason to avoid two-factor authentication, 00:16:05.690 --> 00:16:08.600 there is sort of this social friction surrounding 00:16:08.600 --> 00:16:13.580 two-factor authentication in that human beings tend to find it annoying, right? 00:16:13.580 --> 00:16:15.860 It used to be username, password, you're logged in. 00:16:15.860 --> 00:16:16.920 It's pretty quick. 00:16:16.920 --> 00:16:19.630 Now it's username, password, you get brought to another screen, 00:16:19.630 --> 00:16:22.880 you're asked to enter a six-digit code, or maybe in some advanced applications 00:16:22.880 --> 00:16:26.390 you get a push notification sent to your device that you have to unlock 00:16:26.390 --> 00:16:28.970 and then hit OK on the device. 00:16:28.970 --> 00:16:31.280 And people just find that inconvenient. 00:16:31.280 --> 00:16:34.400 We haven't yet reached this point culturally 00:16:34.400 --> 00:16:39.440 where two-factor authentication is the norm. 00:16:39.440 --> 00:16:43.610 And so it's sort of a linchpin when we talk about security 00:16:43.610 --> 00:16:49.400 in the internet context, is human beings being the limiting factor 00:16:49.400 --> 00:16:51.980 for how secure we can be. 00:16:51.980 --> 00:16:56.810 We have the technology to take steps to protect ourselves, 00:16:56.810 --> 00:16:59.360 but we don't feel compelled to do so. 00:16:59.360 --> 00:17:03.260 And we'll see this pattern reemerge in a few other places today. 00:17:03.260 --> 00:17:06.315 But just know that that is why perhaps you're 00:17:06.315 --> 00:17:08.690 not seeing so much adoption of two-factor authentication. 00:17:08.690 --> 00:17:11.480 It's not that it's technically infeasible to do so. 00:17:11.480 --> 00:17:14.900 It's just that we just find it annoying to do so, 00:17:14.900 --> 00:17:19.401 and so we don't adopt it as aggressively as perhaps we should. 00:17:19.401 --> 00:17:21.109 Now let's discuss the type of attack that 00:17:21.109 --> 00:17:24.109 occurs on the internet with unfortunate regularity, 00:17:24.109 --> 00:17:27.270 and that is the idea of a denial of service attack. 00:17:27.270 --> 00:17:29.450 Now, the idea behind these attacks is basically 00:17:29.450 --> 00:17:32.000 to cripple the infrastructure of a website. 00:17:32.000 --> 00:17:34.460 Now, the reason for this might be financial. 00:17:34.460 --> 00:17:36.050 You want to try and sabotage somebody. 00:17:36.050 --> 00:17:39.380 There might be other motivations, distraction, for example, 00:17:39.380 --> 00:17:42.380 by tying up their resources, trying to stop the attack. 00:17:42.380 --> 00:17:44.510 It opens up another avenue to do something else, 00:17:44.510 --> 00:17:46.077 to perhaps steal information. 00:17:46.077 --> 00:17:48.410 There's many different motivations for why they do this. 00:17:48.410 --> 00:17:51.020 And some of them are honestly just boredom or fun. 00:17:51.020 --> 00:17:54.140 Amateur hackers sometimes think it's fun to just initiate 00:17:54.140 --> 00:17:57.110 a denial of service attack against an entity that 00:17:57.110 --> 00:17:59.870 is not prepared to handle it. 00:17:59.870 --> 00:18:02.480 Now, in the associated materials for this course, 00:18:02.480 --> 00:18:06.380 we provided an article called Making Cyberspace Safe for Democracy, which 00:18:06.380 --> 00:18:08.870 we really do encourage you to take a look at, read, 00:18:08.870 --> 00:18:10.597 and discuss with your group. 00:18:10.597 --> 00:18:12.680 But I also want to take a little bit of time right 00:18:12.680 --> 00:18:15.590 now just to talk about this article in particular 00:18:15.590 --> 00:18:18.680 and draw your attention to some areas of concern 00:18:18.680 --> 00:18:21.710 or some areas that might lead to more discussion. 00:18:21.710 --> 00:18:25.070 Now, the biggest of these is these attacks 00:18:25.070 --> 00:18:28.875 tend not to be taken very seriously by people when they hear about them. 00:18:28.875 --> 00:18:31.250 You'll occasionally hear about these attacks in the news, 00:18:31.250 --> 00:18:33.350 denial of service attacks, or their cousin, 00:18:33.350 --> 00:18:35.930 distributed denial of service attacks. 00:18:35.930 --> 00:18:39.800 But culturally, again, us being humans and sort 00:18:39.800 --> 00:18:42.650 of neglecting some of the real security concerns here, 00:18:42.650 --> 00:18:44.420 we don't think of it as an attack. 00:18:44.420 --> 00:18:48.740 And that's maybe because of how we hear about other kinds of attacks 00:18:48.740 --> 00:18:52.340 on the news that seem more physically devastating, 00:18:52.340 --> 00:18:55.310 that have more real consequences. 00:18:55.310 --> 00:19:00.860 And it makes it hard to have a serious conversation about cyber attacks 00:19:00.860 --> 00:19:06.650 because there's this friction that we face trying to get people to understand 00:19:06.650 --> 00:19:08.600 that these are meaningful and real. 00:19:08.600 --> 00:19:12.530 And in particular, these attacks are kind of insidious. 00:19:12.530 --> 00:19:17.355 They're really easy to execute without much difficulty at all, 00:19:17.355 --> 00:19:20.480 especially against a small business that might be running its own server as 00:19:20.480 --> 00:19:22.640 opposed to relying on a cloud service. 00:19:22.640 --> 00:19:29.150 A pretty top-of-the-line, commercially available machine might be able 00:19:29.150 --> 00:19:33.200 to execute a denial of service or DoS attack on its own. 00:19:33.200 --> 00:19:37.310 It doesn't even require exceptional resources. 00:19:37.310 --> 00:19:41.450 Now, when we start to attack mid-sized companies, or larger companies 00:19:41.450 --> 00:19:45.110 or entities, one single computer from one single IP address 00:19:45.110 --> 00:19:47.480 is not typically going to be enough. 00:19:47.480 --> 00:19:52.730 And so instead, you would have a distributed denial of service attack. 00:19:52.730 --> 00:19:54.620 In a distributed denial of service attack, 00:19:54.620 --> 00:19:58.070 there is still generally one core hacker, or one collective group 00:19:58.070 --> 00:19:59.960 of hackers or adversaries that are trying 00:19:59.960 --> 00:20:03.647 to penetrate some company's defenses. 00:20:03.647 --> 00:20:05.480 But they can't do it with their own machine. 00:20:05.480 --> 00:20:08.210 And so what they do is create something called a botnet. 00:20:08.210 --> 00:20:09.890 Perhaps you've heard this term before. 00:20:09.890 --> 00:20:12.590 A botnet basically happens, or is created, 00:20:12.590 --> 00:20:17.103 when hackers or adversaries distribute worms or viruses sort of 00:20:17.103 --> 00:20:17.770 surreptitiously. 00:20:17.770 --> 00:20:19.700 Perhaps they packaged them into some download. 00:20:19.700 --> 00:20:22.780 People don't notice anything about the worm or anything 00:20:22.780 --> 00:20:25.750 about this program that has been covertly installed on their machine. 00:20:25.750 --> 00:20:30.010 It doesn't do anything in particular until it is activated. 00:20:30.010 --> 00:20:32.500 And then it becomes an agent or a zombie-- 00:20:32.500 --> 00:20:34.930 sometimes you'll hear it termed that as well-- 00:20:34.930 --> 00:20:36.400 controlled by the hackers. 00:20:36.400 --> 00:20:39.130 And so all of a sudden the adversaries gain 00:20:39.130 --> 00:20:42.190 control of many different devices, hundreds or thousands 00:20:42.190 --> 00:20:46.450 or tens of thousands, or even more in some of the bigger attacks 00:20:46.450 --> 00:20:50.602 that have happened, basically turning these computers-- 00:20:50.602 --> 00:20:52.310 rendering all of them under their control 00:20:52.310 --> 00:20:55.130 and being able to direct them to take whatever action they want. 00:20:55.130 --> 00:20:58.870 And in particular, in the case of a distributed denial of service attack, 00:20:58.870 --> 00:21:03.190 all of these computers are going to make web requests 00:21:03.190 --> 00:21:07.810 to the same server or same website, because that's the idea. 00:21:07.810 --> 00:21:09.180 You have so many requests. 00:21:09.180 --> 00:21:10.930 With distributed denial of service attacks 00:21:10.930 --> 00:21:13.972 or just regular denial of service attacks, it's just a question of scale, 00:21:13.972 --> 00:21:15.610 really. 00:21:15.610 --> 00:21:18.430 We're hitting those servers with so many web requests. 00:21:18.430 --> 00:21:19.390 I want to access this. 00:21:19.390 --> 00:21:22.210 I want to access this, hundreds, thousands, tens of thousands 00:21:22.210 --> 00:21:26.110 of these requests a second such that the computer can't possibly-- the server 00:21:26.110 --> 00:21:28.210 can't possibly field all of these inquiries 00:21:28.210 --> 00:21:33.010 that are coming and trying to give these requests the data they're asking for. 00:21:33.010 --> 00:21:35.425 Ultimately, that would eventually, after enough time, 00:21:35.425 --> 00:21:38.300 result in the server just crashing, throwing up its hands and saying, 00:21:38.300 --> 00:21:39.430 I don't know what to do. 00:21:39.430 --> 00:21:41.388 I can't possibly process all of these requests. 00:21:41.388 --> 00:21:45.010 But by tying it up in this way, the adversary 00:21:45.010 --> 00:21:49.840 has succeeded in damaging the infrastructure of the server. 00:21:49.840 --> 00:21:52.960 It's either denied the server the ability to process customers 00:21:52.960 --> 00:21:55.840 and payments or it's just taken down the entire website 00:21:55.840 --> 00:21:58.840 so there's no information available about the company anymore to anybody 00:21:58.840 --> 00:22:01.630 who's trying to look it up. 00:22:01.630 --> 00:22:04.990 These attacks are actually really, really common. 00:22:04.990 --> 00:22:06.910 There are some surveys that have been out that 00:22:06.910 --> 00:22:12.292 assess that roughly one sixth to one third of average-sized businesses that 00:22:12.292 --> 00:22:14.500 are part of this tech survey that goes out every year 00:22:14.500 --> 00:22:20.680 suffer some sort of DoS attack in a given year, so 16% to 35% or so 00:22:20.680 --> 00:22:23.910 of business, which is a lot of businesses when you think about it. 00:22:23.910 --> 00:22:25.660 And these attacks are usually quite small, 00:22:25.660 --> 00:22:27.610 and they're certainly not newsworthy. 00:22:27.610 --> 00:22:28.870 They might last a few minutes. 00:22:28.870 --> 00:22:30.190 They might last a few hours. 00:22:30.190 --> 00:22:31.690 But they're enough to be disruptive. 00:22:31.690 --> 00:22:32.898 They're certainly noteworthy. 00:22:32.898 --> 00:22:36.310 And they're something to avoid if it's possible. 00:22:36.310 --> 00:22:41.660 Cloud computing has made this problem kind of worse. 00:22:41.660 --> 00:22:45.190 And the reason for this is that, in a cloud computing context, 00:22:45.190 --> 00:22:47.980 your server that is running your business 00:22:47.980 --> 00:22:50.350 is not physically located on your premises. 00:22:50.350 --> 00:22:54.270 It was often the case that when a business would run a website 00:22:54.270 --> 00:23:00.430 or would run their business, they would have a server room that 00:23:00.430 --> 00:23:03.790 had the software that was necessary to run their website 00:23:03.790 --> 00:23:07.060 or to run whatever software-based services they provided. 00:23:07.060 --> 00:23:10.415 And it was all local to that business. 00:23:10.415 --> 00:23:12.980 No one else could possibly be affected. 00:23:12.980 --> 00:23:15.070 But in a cloud computing context, we are generally 00:23:15.070 --> 00:23:20.860 renting server space and server power from an entity such as Amazon Web 00:23:20.860 --> 00:23:24.790 Services, or Google Cloud Services, or some other large provider where 00:23:24.790 --> 00:23:30.460 it might be that 10, 20, 50, depending on the size of the business in question 00:23:30.460 --> 00:23:31.510 here-- 00:23:31.510 --> 00:23:35.920 multiple businesses are sharing the same physical resources, 00:23:35.920 --> 00:23:37.990 and they're sharing the same server space, 00:23:37.990 --> 00:23:41.260 such that if any one of those 50, let's say, 00:23:41.260 --> 00:23:44.950 businesses is targeted by hackers or adversaries 00:23:44.950 --> 00:23:49.570 for a denial of service attack, that might actually, as collateral damage, 00:23:49.570 --> 00:23:52.390 take out the other 49 businesses. 00:23:52.390 --> 00:23:54.400 They weren't even part of the attack. 00:23:54.400 --> 00:23:55.930 But cloud computing is-- 00:23:55.930 --> 00:23:57.820 we've heard about it as it's a great thing. 00:23:57.820 --> 00:24:00.640 It allows us to scale out our websites, make it 00:24:00.640 --> 00:24:02.800 so that we can handle more customers. 00:24:02.800 --> 00:24:06.280 It takes away the problem of security, web-based security, 00:24:06.280 --> 00:24:11.090 because we're outsourcing that to the cloud provider to give that to us. 00:24:11.090 --> 00:24:15.490 But it now introduces this new problem of, if we're all sharing the resources 00:24:15.490 --> 00:24:18.790 and any one of us gets attacked, then all of us 00:24:18.790 --> 00:24:21.760 lose the ability to access those resources and use them, 00:24:21.760 --> 00:24:24.550 which might cause all of our organizations to suffer 00:24:24.550 --> 00:24:28.090 the consequences of one single attack. 00:24:28.090 --> 00:24:30.700 This collateral damage can get even worse 00:24:30.700 --> 00:24:33.050 when you think about servers that are-- 00:24:33.050 --> 00:24:38.590 or businesses whose service is providing the internet, OK? 00:24:38.590 --> 00:24:40.970 So a very common example of this, or a noteworthy example 00:24:40.970 --> 00:24:44.260 of this, happened in 2016 with a service called 00:24:44.260 --> 00:24:49.480 DYN, D-Y-N. DYN is a DNS service provider, 00:24:49.480 --> 00:24:52.390 DNS being the domain name system. 00:24:52.390 --> 00:25:00.450 And the idea there is to map the things like www.google.com to its IP address. 00:25:00.450 --> 00:25:02.950 Because in order to actually access anything on the internet 00:25:02.950 --> 00:25:06.140 or to have a communication with anyone, you need to know their IP address. 00:25:06.140 --> 00:25:09.220 And as human beings, we tend not to actually remember 00:25:09.220 --> 00:25:14.020 what some website's IP address is, much like we may not recall a certain phone 00:25:14.020 --> 00:25:14.590 number. 00:25:14.590 --> 00:25:17.170 But if it has a mnemonic attached to it-- so for example, 00:25:17.170 --> 00:25:20.530 you know back in the day we had 1-800-COLLECT for collect calls. 00:25:20.530 --> 00:25:25.750 If you forgot the number, the literal digits of that phone number, 00:25:25.750 --> 00:25:29.290 you could still remember the idea of it because you had this mnemonic device 00:25:29.290 --> 00:25:30.760 to help remind you. 00:25:30.760 --> 00:25:35.110 Domain names, www.whatever.com, are just mnemonic devices 00:25:35.110 --> 00:25:37.570 that we use to refer to an IP address. 00:25:37.570 --> 00:25:41.770 And DNS servers provide this service to us. 00:25:41.770 --> 00:25:46.990 DYN is one of the major DNS providers for the internet overall. 00:25:46.990 --> 00:25:49.630 And if a denial of service attack, or in this case 00:25:49.630 --> 00:25:53.800 it was certainly a distributed denial of service attack because it was enormous, 00:25:53.800 --> 00:25:58.480 goes after pinging the IP address or hitting that server over 00:25:58.480 --> 00:26:03.070 and over and over, then it is unable to field requests from anyone else, 00:26:03.070 --> 00:26:06.880 because it's just getting pummeled by all of these requests from some botnet 00:26:06.880 --> 00:26:11.250 that some adversary or collective of adversaries has taken control of. 00:26:11.250 --> 00:26:13.990 This, the collateral damage, is no one can ever 00:26:13.990 --> 00:26:17.110 map a domain name to an IP address, which 00:26:17.110 --> 00:26:19.720 means no one can visit any of these websites 00:26:19.720 --> 00:26:24.250 unless you happen to know at the outset what the IP address of any given 00:26:24.250 --> 00:26:24.850 website was. 00:26:24.850 --> 00:26:27.243 If you knew the IP address, this wasn't a problem. 00:26:27.243 --> 00:26:29.410 You could just still directly go to that IP address. 00:26:29.410 --> 00:26:31.000 That's not the kind of attack here. 00:26:31.000 --> 00:26:33.460 But the attack instead tied up the ability 00:26:33.460 --> 00:26:38.410 to translate these mnemonic names into numbers. 00:26:38.410 --> 00:26:42.400 And as you can see, DYN was a DNS-- or is 00:26:42.400 --> 00:26:45.490 a DNS provider for much of the eastern half of the United States 00:26:45.490 --> 00:26:48.842 as well as the Pacific Northwest and California. 00:26:48.842 --> 00:26:50.800 And if you think about what kinds of businesses 00:26:50.800 --> 00:26:53.950 are headquartered in the Pacific Northwest 00:26:53.950 --> 00:26:58.810 and in California and in the New York area, for example, 00:26:58.810 --> 00:27:01.060 you probably see that some major, major services, 00:27:01.060 --> 00:27:03.435 including GitHub, which we've already talked about today, 00:27:03.435 --> 00:27:06.190 but also Facebook and others-- 00:27:06.190 --> 00:27:09.940 Harvard University's website was also taken down for several hours. 00:27:09.940 --> 00:27:12.320 This attack lasted about 10 hours, so quite prolonged. 00:27:12.320 --> 00:27:15.810 It really did a lot of damage on that day. 00:27:15.810 --> 00:27:18.310 It really crippled the ability of people to use the internet 00:27:18.310 --> 00:27:22.420 for a long period of time, so kind of very interesting. 00:27:22.420 --> 00:27:28.330 This article also talks a bit about how the United States government has 00:27:28.330 --> 00:27:31.450 decided to-- or legislature-- 00:27:31.450 --> 00:27:35.293 handle these kinds of issues, computer-based attacks. 00:27:35.293 --> 00:27:37.460 It takes take a look at the Computer Fraud and Abuse 00:27:37.460 --> 00:27:41.290 Act, which is codified at 18 USC 1030. 00:27:41.290 --> 00:27:47.020 And this is really the only computer crimes, general computer crimes, 00:27:47.020 --> 00:27:49.990 law that is on the books and talks about what 00:27:49.990 --> 00:27:53.710 it means to be a protected computer. 00:27:53.710 --> 00:27:57.430 And you'll be interested to know perhaps that any computer pretty much is 00:27:57.430 --> 00:27:58.780 a protected computer. 00:27:58.780 --> 00:28:02.320 The law specifically calls out government computers as well as 00:28:02.320 --> 00:28:04.990 any computer that may be involved in interstate commerce, 00:28:04.990 --> 00:28:08.200 which is you can imagine anybody who uses the internet, 00:28:08.200 --> 00:28:11.030 their computer then falls under the ambit of this act. 00:28:11.030 --> 00:28:13.030 So it's another interesting thing to take a look 00:28:13.030 --> 00:28:20.320 at if you're interested in how we deal with processing or prosecuting 00:28:20.320 --> 00:28:23.020 violations of computer-based crimes. 00:28:23.020 --> 00:28:26.330 All of it is actually sort of dealt with in the Computer Fraud and Abuse 00:28:26.330 --> 00:28:29.500 Act, which is not terribly long and hasn't been updated extensively 00:28:29.500 --> 00:28:32.150 since the 1980s other than some small amendments. 00:28:32.150 --> 00:28:34.150 So it's kind of interesting that we have not yet 00:28:34.150 --> 00:28:38.440 gotten to the point where we are defining and prosecuting 00:28:38.440 --> 00:28:42.400 specific types of computer crime, even though we've begun to figure out 00:28:42.400 --> 00:28:47.620 different types of computer crimes, such as DoS attacks, such as phishing, 00:28:47.620 --> 00:28:49.370 and so on. 00:28:49.370 --> 00:28:52.690 Now, hypothetically, a simple denial of service attack 00:28:52.690 --> 00:28:53.950 should be pretty easy to stop. 00:28:53.950 --> 00:28:59.230 And the reason for that is that there's only one person making the attack. 00:28:59.230 --> 00:29:03.130 All requests, recall, that happen over the internet happen via HTTP. 00:29:03.130 --> 00:29:07.585 And HTTP requires that the sender's IP address 00:29:07.585 --> 00:29:09.460 be part of that envelope that gets sent over, 00:29:09.460 --> 00:29:12.880 such that the server who wants to respond to the client, or the sender, 00:29:12.880 --> 00:29:13.980 can just reference. 00:29:13.980 --> 00:29:14.980 It's the return address. 00:29:14.980 --> 00:29:17.438 You need to be able to know where to send the data back to. 00:29:17.438 --> 00:29:19.680 And so any request that is coming from-- 00:29:19.680 --> 00:29:21.430 there are thousands of requests that might 00:29:21.430 --> 00:29:23.680 be coming from a single IP address. 00:29:23.680 --> 00:29:27.490 If you see that happening, you can just decide as a server in the software 00:29:27.490 --> 00:29:31.570 to stop accepting requests from that address. 00:29:31.570 --> 00:29:34.360 DDoS attacks, distributed denial of service attacks, 00:29:34.360 --> 00:29:36.160 are much harder to stop. 00:29:36.160 --> 00:29:40.390 And it's exactly because of the fact that there is not a single source. 00:29:40.390 --> 00:29:42.880 If there's a single source, again, we would just completely 00:29:42.880 --> 00:29:48.250 stop accepting any requests of any type from that computer. 00:29:48.250 --> 00:29:51.370 However, because we have so many different computers to contend with, 00:29:51.370 --> 00:29:54.010 the options to handle this are a bit more limited. 00:29:54.010 --> 00:29:57.400 There are some techniques for averting them or stopping them 00:29:57.400 --> 00:30:01.960 once they are detected, however, the first of which is firewalling. 00:30:01.960 --> 00:30:04.270 So the idea of a firewall is we are only going 00:30:04.270 --> 00:30:06.700 to allow requests of a certain type. 00:30:06.700 --> 00:30:08.950 We're going to allow them from any IP address, 00:30:08.950 --> 00:30:11.950 but we're only going to accept them into this port. 00:30:11.950 --> 00:30:15.880 Recall that TCPIP gives us the ability to say this service 00:30:15.880 --> 00:30:19.390 comes in via this port, so HTTP requests come in by a port 80. 00:30:19.390 --> 00:30:24.360 HTTPS requests come in via port 443. 00:30:24.360 --> 00:30:27.030 So imagine a distributed denial of service attack 00:30:27.030 --> 00:30:33.100 where typically the site would expect to be receiving requests on HTTPS. 00:30:33.100 --> 00:30:37.650 It generally only uses secured HTTP in order 00:30:37.650 --> 00:30:40.300 to process whatever requests are coming in. 00:30:40.300 --> 00:30:44.160 So it's expecting to receive a lot of traffic on port 443. 00:30:44.160 --> 00:30:47.970 And then all of a sudden a distributed denial of service attack 00:30:47.970 --> 00:30:51.930 begins and it's receiving lots of requests on port 80. 00:30:51.930 --> 00:30:55.440 One way to stop that attack before it starts to tie up resources 00:30:55.440 --> 00:30:57.540 is to just put a firewall up and say, I'm 00:30:57.540 --> 00:31:00.210 not actually going to accept any requests on port 80. 00:31:00.210 --> 00:31:03.650 And this may have a side effect of denying certain legitimate requests 00:31:03.650 --> 00:31:04.710 from getting through. 00:31:04.710 --> 00:31:07.920 But since the vast majority of the traffic that I receive on the site 00:31:07.920 --> 00:31:12.805 comes in via HTTPS on port 443, that's a small price to pay. 00:31:12.805 --> 00:31:15.180 I'd rather just allow the legitimate requests to come in. 00:31:15.180 --> 00:31:17.140 So that's one technique. 00:31:17.140 --> 00:31:19.950 Another technique is something called sinkholing. 00:31:19.950 --> 00:31:22.350 And it's exactly what you probably think it is. 00:31:22.350 --> 00:31:24.860 So a sinkhole, as you probably know, is a hole 00:31:24.860 --> 00:31:26.610 in the ground that swallows everything up. 00:31:26.610 --> 00:31:32.730 And a sink hole in digital context is a big black hole, basically, for data. 00:31:32.730 --> 00:31:34.890 It's just going to swallow up every single request 00:31:34.890 --> 00:31:36.960 and just not allow any of them out. 00:31:36.960 --> 00:31:39.962 So this would, again, stop the denial of service attack 00:31:39.962 --> 00:31:41.670 because it's just taking all the requests 00:31:41.670 --> 00:31:44.190 and basically throwing them in the trash. 00:31:44.190 --> 00:31:48.120 This won't take down the website of the company that's being attacked, 00:31:48.120 --> 00:31:49.590 so that's a good thing. 00:31:49.590 --> 00:31:52.590 But it's also not going to allow any legitimate traffic of any type 00:31:52.590 --> 00:31:54.460 through, so that might be a bad thing. 00:31:54.460 --> 00:31:56.460 But depending on the length of the attack, if it 00:31:56.460 --> 00:31:59.520 seems like it's going to be short, if the requests trickle off 00:31:59.520 --> 00:32:02.670 and stop because the attackers realize, we're not making any progress, 00:32:02.670 --> 00:32:04.020 we're not actually doing-- 00:32:04.020 --> 00:32:06.510 we're not getting the results that we had hoped for, 00:32:06.510 --> 00:32:08.490 then perhaps they would give up. 00:32:08.490 --> 00:32:11.903 Then the sinkhole could be stopped and regular traffic 00:32:11.903 --> 00:32:13.320 could start to flow through again. 00:32:13.320 --> 00:32:16.590 So a sinkhole is basically just take all the traffic that comes in 00:32:16.590 --> 00:32:18.665 and just throw it in the trash. 00:32:18.665 --> 00:32:20.665 And then finally, another technique we could use 00:32:20.665 --> 00:32:22.950 is something called packet analysis. 00:32:22.950 --> 00:32:27.390 So again, HTTP we know is requests via the web. 00:32:27.390 --> 00:32:30.120 And we learned a little bit that we have headers 00:32:30.120 --> 00:32:33.060 that are packaged alongside those HTTP packets 00:32:33.060 --> 00:32:38.010 where the request originated from, where it's going to. 00:32:38.010 --> 00:32:40.440 There's a whole lot of other metadata as well. 00:32:40.440 --> 00:32:44.250 You'll know, for example, what type of browser the individual is using 00:32:44.250 --> 00:32:46.290 and what operating system perhaps they are using 00:32:46.290 --> 00:32:50.950 and where, as in sort of a geographical generalization, are they. 00:32:50.950 --> 00:32:52.440 Are they in the US Northeast? 00:32:52.440 --> 00:32:55.350 Are they in South America and so on? 00:32:55.350 --> 00:32:59.160 Instead of deciding to restrict traffic via specific ports 00:32:59.160 --> 00:33:03.540 or just restrict all traffic, we could still allow all traffic to come in 00:33:03.540 --> 00:33:06.460 but inspect all of the packets as they come in. 00:33:06.460 --> 00:33:09.060 So for example, perhaps most of the traffic on our site we 00:33:09.060 --> 00:33:11.650 are expecting to come from the-- 00:33:11.650 --> 00:33:13.400 just because I used that example already-- 00:33:13.400 --> 00:33:14.700 US Northeast. 00:33:14.700 --> 00:33:16.650 And then all of a sudden we are experiencing 00:33:16.650 --> 00:33:20.640 tons of packets coming in that have IP addresses that all seem to be based-- 00:33:20.640 --> 00:33:24.050 or they have, as part of their packets, information 00:33:24.050 --> 00:33:25.800 that says that they're from South America, 00:33:25.800 --> 00:33:29.790 or they're from the US West Coast, or somewhere else that we don't expect. 00:33:29.790 --> 00:33:32.430 We can decide, after taking a quick look at that packet 00:33:32.430 --> 00:33:36.240 and analyzing those individual headers, that I'm not 00:33:36.240 --> 00:33:39.240 going to accept any packets from that location. 00:33:39.240 --> 00:33:42.970 The ones that match locations I'm expecting, I'll let through. 00:33:42.970 --> 00:33:45.948 And this, again, might prevent certain customers from getting through, 00:33:45.948 --> 00:33:48.990 certain legitimate customers who might actually be based in South America 00:33:48.990 --> 00:33:50.460 from getting through. 00:33:50.460 --> 00:33:54.980 But in general, it's going to block most of the damaging traffic. 00:33:54.980 --> 00:33:57.900 DDoS attacks are really frustrating for companies 00:33:57.900 --> 00:34:01.470 because they really can do a lot of damage. 00:34:01.470 --> 00:34:04.480 Usually the resources of the company will eventually-- especially 00:34:04.480 --> 00:34:08.280 if they're cloud-based and they rely on their cloud provider to help them 00:34:08.280 --> 00:34:12.290 scale up, usually the resources of the company being attacked 00:34:12.290 --> 00:34:14.699 are enough to eventually overwhelm and stop 00:34:14.699 --> 00:34:18.780 the attacker who usually has a much more limited set of resources. 00:34:18.780 --> 00:34:22.570 But again, depending on the type of business being attacked in this way-- 00:34:22.570 --> 00:34:25.580 again, think of the example of DYN, the DNS provider. 00:34:25.580 --> 00:34:27.330 The ramifications for one of these attacks 00:34:27.330 --> 00:34:31.350 can be really quite severe and really quite annoying and costly 00:34:31.350 --> 00:34:34.480 for a business that suffers it. 00:34:34.480 --> 00:34:38.050 So we just talked about HTTP and HTTPSS a moment ago 00:34:38.050 --> 00:34:40.050 when we were talking about firewalling, allowing 00:34:40.050 --> 00:34:42.790 some traffic on some of the ports but not other ports, 00:34:42.790 --> 00:34:47.290 so maybe allowing HTTP traffic but not HTTPS traffic. 00:34:47.290 --> 00:34:51.120 Let's take a look at these two technologies in a bit more detail. 00:34:51.120 --> 00:34:54.330 So HTTP, again, is the hypertext transfer protocol. 00:34:54.330 --> 00:34:58.530 It is how hypertext or web pages are transmitted over the internet. 00:34:58.530 --> 00:35:04.530 If I am a client and I make a request to you for some HTML content, 00:35:04.530 --> 00:35:08.130 then you as a server would send a response back to me, 00:35:08.130 --> 00:35:11.550 and then I would be able to see the page that I had requested. 00:35:11.550 --> 00:35:17.090 And every HTTP request has a specific format at the beginning of it. 00:35:17.090 --> 00:35:24.560 For example, we might see something like this, GET /execed HTTP/1.1, host: 00:35:24.560 --> 00:35:25.790 law.harvard.edu. 00:35:25.790 --> 00:35:28.670 Let's just quickly pick these apart again one more time. 00:35:28.670 --> 00:35:31.910 If you see GET at the beginning of an HTTP request, 00:35:31.910 --> 00:35:36.680 it means please fetch or get for me, literally, this page. 00:35:36.680 --> 00:35:40.970 The page I'm requesting specifically is /execed. 00:35:40.970 --> 00:35:46.520 And the host that I'm asking it from is, in this case, law.harvard.edu. 00:35:46.520 --> 00:35:50.690 So basically what I'm saying here is please fetch for me, 00:35:50.690 --> 00:35:54.120 or retreat from me, the HTML content that comprises 00:35:54.120 --> 00:36:00.410 http://law.harvard.edu/execed. 00:36:00.410 --> 00:36:05.990 And specifically I'm doing this using HTTP protocol version 1.1. 00:36:05.990 --> 00:36:08.270 We're still using version 1.1 even though I 00:36:08.270 --> 00:36:13.250 believe version 2.0 was defined almost 20 years ago now probably. 00:36:13.250 --> 00:36:17.030 And basically this is just HTTP's way of identifying 00:36:17.030 --> 00:36:19.040 how you're asking the question. 00:36:19.040 --> 00:36:23.540 So it's similar to me making a request and saying, oh, by the way, 00:36:23.540 --> 00:36:26.690 the rest of this request is written in French, or, oh, by the way, 00:36:26.690 --> 00:36:29.630 the rest of this request is written in Spanish. 00:36:29.630 --> 00:36:32.750 It's more like here are the parameters that you 00:36:32.750 --> 00:36:35.150 should expect to see because this request is 00:36:35.150 --> 00:36:39.540 in version 1.1, which differed non-trivially from version 1.0. 00:36:39.540 --> 00:36:45.590 So it's just an identifier for how exactly we are formatting our request. 00:36:45.590 --> 00:36:47.950 But HTTP is not encrypted. 00:36:47.950 --> 00:36:51.232 And so if we think about making a request to a server, 00:36:51.232 --> 00:36:52.940 if we're the client on the left and we're 00:36:52.940 --> 00:36:56.120 making a request to a server on the right, it might go something like this. 00:36:56.120 --> 00:37:00.530 Because the odds are pretty low that, if we're making a request, 00:37:00.530 --> 00:37:03.350 we are so close to the server that would serve 00:37:03.350 --> 00:37:05.660 that request to us that it wouldn't need to hop 00:37:05.660 --> 00:37:07.480 through any routers along the way. 00:37:07.480 --> 00:37:09.410 Remember, routers, their purpose in life is 00:37:09.410 --> 00:37:11.260 to send traffic in the right direction. 00:37:11.260 --> 00:37:13.350 And they contain a table of information that says, 00:37:13.350 --> 00:37:15.800 oh, if I'm making a request to some server over there, 00:37:15.800 --> 00:37:18.920 then the best path is to go here, and then I'll send it over there, 00:37:18.920 --> 00:37:20.890 and then it will send it there. 00:37:20.890 --> 00:37:23.480 Their job is to optimize and find the best path 00:37:23.480 --> 00:37:26.370 to get the request to where it needs to be. 00:37:26.370 --> 00:37:31.145 So if I'm initiating a request to, as the client, the server, 00:37:31.145 --> 00:37:33.020 it's going to first go through router A who's 00:37:33.020 --> 00:37:35.760 going to say, OK, I'm going to move it closer to the server 00:37:35.760 --> 00:37:38.960 so that it receives that request, goes to router B, goes to router C. 00:37:38.960 --> 00:37:41.900 And eventually router C perhaps is close enough to the server 00:37:41.900 --> 00:37:45.380 that it can just hand off the request directly. 00:37:45.380 --> 00:37:48.568 The server's then going to get that request, read it as HTTP/1.1, 00:37:48.568 --> 00:37:51.860 look at all the other metadata inside of the request to see if there's anything 00:37:51.860 --> 00:37:55.030 else that it's being asked for, and then it's going to send the information 00:37:55.030 --> 00:37:55.530 back. 00:37:55.530 --> 00:37:57.620 And in this example I'm having it go back 00:37:57.620 --> 00:38:00.860 exactly through the same chain of routers but in reverse. 00:38:00.860 --> 00:38:02.540 But in reality, that might be different. 00:38:02.540 --> 00:38:04.430 It might not go through the exact same three 00:38:04.430 --> 00:38:06.620 routers in this example in reverse. 00:38:06.620 --> 00:38:12.110 It might actually go from C to A to B, back to A depending on traffic 00:38:12.110 --> 00:38:14.780 that's happening on the network and how congested things are 00:38:14.780 --> 00:38:19.310 and whether there might be a new path that is better in the amount of time 00:38:19.310 --> 00:38:23.210 it took to process the request that I asked for. 00:38:23.210 --> 00:38:25.880 But remember, HTTP, not secured. 00:38:25.880 --> 00:38:26.720 Not encrypted. 00:38:26.720 --> 00:38:29.000 This is plain, over-the-air communication. 00:38:29.000 --> 00:38:33.560 We saw previously, when we took a look at a screenshot 00:38:33.560 --> 00:38:36.530 from a tool called Wireshark, that it's not 00:38:36.530 --> 00:38:41.420 that difficult on an unsecured network using an unsecured protocol to read, 00:38:41.420 --> 00:38:44.150 literally, the contents of those packets going to and from. 00:38:44.150 --> 00:38:46.320 So that's a vulnerability here for sure. 00:38:46.320 --> 00:38:48.980 Another vulnerability is any one of these computers 00:38:48.980 --> 00:38:51.060 along the way could be compromised. 00:38:51.060 --> 00:38:54.320 So for example, router A perhaps was infected 00:38:54.320 --> 00:38:57.510 by somebody who-- a router is just a computer as well. 00:38:57.510 --> 00:39:00.200 So perhaps it was infected by an adversary 00:39:00.200 --> 00:39:03.950 with some worm that will eventually make it part of some botnet, 00:39:03.950 --> 00:39:07.580 and it'll eventually start spamming some server somewhere. 00:39:07.580 --> 00:39:11.960 If router A is compromised in such a way that an adversary can just read all 00:39:11.960 --> 00:39:14.010 the traffic that flows through it-- and again, 00:39:14.010 --> 00:39:17.780 we're sending all of our traffic in an unencrypted fashion-- 00:39:17.780 --> 00:39:21.230 then we have another security loophole to deal with. 00:39:21.230 --> 00:39:27.440 So HTTPS resolves this problem by securing or encrypting 00:39:27.440 --> 00:39:32.150 all of the communications between a client and a server. 00:39:32.150 --> 00:39:33.762 So HTTP requests go to one port. 00:39:33.762 --> 00:39:34.970 We talked about that already. 00:39:34.970 --> 00:39:36.950 They go to port 80 by convention. 00:39:36.950 --> 00:39:40.790 HTTP requests go to port for 443 by convention. 00:39:40.790 --> 00:39:44.840 In order for HTTPS to work, the server is 00:39:44.840 --> 00:39:52.100 responsible for providing or possessing a valid what's called an SSL or TLS 00:39:52.100 --> 00:39:52.670 certificate. 00:39:52.670 --> 00:39:55.550 SSL is actually a deprecated technology now. 00:39:55.550 --> 00:39:58.070 It's been subsumed into TLS. 00:39:58.070 --> 00:40:01.580 But typically these things are still referred to as SSL certificates. 00:40:01.580 --> 00:40:04.430 And perhaps you've seen a screen that looks like this when 00:40:04.430 --> 00:40:05.990 you're trying to visit some website. 00:40:05.990 --> 00:40:08.240 You get a warning that your connection is not private. 00:40:08.240 --> 00:40:10.970 And at the very end of that warning, you are 00:40:10.970 --> 00:40:13.640 informed that the cert date is invalid. 00:40:13.640 --> 00:40:18.900 Basically this just means that their SSL certificate has expired. 00:40:18.900 --> 00:40:21.510 Now, what is an SSL certificate? 00:40:21.510 --> 00:40:27.000 So there are services that work alongside the internet called 00:40:27.000 --> 00:40:28.020 certificate authorities. 00:40:28.020 --> 00:40:32.520 And like GlobalSign, for example, from whom I borrowed the screenshots-- 00:40:32.520 --> 00:40:35.280 GoDaddy, who is also a very popular domain name provider, 00:40:35.280 --> 00:40:37.780 is also a certificate authority. 00:40:37.780 --> 00:40:42.600 And what they do is they verify that a particular website owns 00:40:42.600 --> 00:40:44.270 a particular private key-- 00:40:44.270 --> 00:40:48.230 or excuse me, a particular public key which has a corresponding private key. 00:40:48.230 --> 00:40:49.980 And the way they do that is they digitally 00:40:49.980 --> 00:40:51.928 sign something to the certificate authority. 00:40:51.928 --> 00:40:54.720 The certificate authority then goes through those exact same checks 00:40:54.720 --> 00:40:56.595 that we've seen before for digital signatures 00:40:56.595 --> 00:40:59.460 to verify that, yes, this person must own this public key. 00:40:59.460 --> 00:41:03.810 And the idea for this is we're trusting that, 00:41:03.810 --> 00:41:06.750 when I send a communication to you as the website 00:41:06.750 --> 00:41:12.120 owner using the public key that you say is yours, then it really is yours. 00:41:12.120 --> 00:41:16.110 There really is somebody out there or some third party 00:41:16.110 --> 00:41:19.530 that we've decided to collectively trust, the certificate authority, who 00:41:19.530 --> 00:41:20.670 is going to verify this. 00:41:20.670 --> 00:41:23.100 Now, why does this matter? 00:41:23.100 --> 00:41:27.570 Why do we need to verify that someone's public key is what they say it is? 00:41:27.570 --> 00:41:31.032 Well, it turns out that this idea of asymmetric encryption, 00:41:31.032 --> 00:41:33.990 or public and private key cryptography that we've previously discussed, 00:41:33.990 --> 00:41:38.520 does form part of the core of HTTPS. 00:41:38.520 --> 00:41:43.200 But as we'll see in a moment, we don't actually use public and private keys 00:41:43.200 --> 00:41:47.100 to communicate except at the very, very beginning of our interaction 00:41:47.100 --> 00:41:52.680 with some site when we are using HTTPS. 00:41:52.680 --> 00:41:56.370 So the way this really happens underneath the hood 00:41:56.370 --> 00:42:00.780 is via the secure sockets layer, SSL, which is now known as the transport 00:42:00.780 --> 00:42:02.950 layer security overall protocol. 00:42:02.950 --> 00:42:06.270 There's other things that are folded into it, but SSL is part of it. 00:42:06.270 --> 00:42:09.210 And this is what happens. 00:42:09.210 --> 00:42:14.970 When I am requesting a page from you, and you are the server, 00:42:14.970 --> 00:42:18.540 and I am requesting this via HTTPS, I am going 00:42:18.540 --> 00:42:22.800 to initially make a request using the public key that I believe 00:42:22.800 --> 00:42:24.780 is yours because the certificate authority has 00:42:24.780 --> 00:42:30.395 vouched for you, saying that I would like to make a encrypted request. 00:42:30.395 --> 00:42:32.520 And I don't want to send that request over the air. 00:42:32.520 --> 00:42:34.145 I don't want to send that in the clear. 00:42:34.145 --> 00:42:37.110 I want to send it to you using the encryption that you say is yours. 00:42:37.110 --> 00:42:41.160 So I send a request to you, encrypting it using your public key. 00:42:41.160 --> 00:42:42.180 You receive the request. 00:42:42.180 --> 00:42:45.150 You decrypt it using your private key. 00:42:45.150 --> 00:42:48.900 You see, OK, I see now that Doug wants to initiate a request with me, 00:42:48.900 --> 00:42:51.300 and you're going to fulfill the request. 00:42:51.300 --> 00:42:53.610 But you're also going to do one other thing. 00:42:53.610 --> 00:42:57.420 You're going to set a key. 00:42:57.420 --> 00:43:00.270 And you're going to send me back a key, not 00:43:00.270 --> 00:43:04.322 your public or private key, a different key, alongside the request that I made. 00:43:04.322 --> 00:43:06.780 And you're going to send it back to me using my public key. 00:43:06.780 --> 00:43:10.620 So the initial volley of communications back and forth between us 00:43:10.620 --> 00:43:13.230 is the same as any other encrypted communication 00:43:13.230 --> 00:43:16.140 using public and private keys that we've previously seen. 00:43:16.140 --> 00:43:18.270 I send a message to you using your public key. 00:43:18.270 --> 00:43:20.040 You decrypt it using your private key. 00:43:20.040 --> 00:43:26.340 You respond to me using my public key, and I decrypt it using my private key. 00:43:26.340 --> 00:43:28.260 But this is really slow. 00:43:28.260 --> 00:43:34.780 If we're just having communications back and forth via mail or even via text, 00:43:34.780 --> 00:43:39.210 the difference of a few milliseconds is immaterial. 00:43:39.210 --> 00:43:41.450 We don't really notice it. 00:43:41.450 --> 00:43:44.757 But on the web, we do notice it, especially 00:43:44.757 --> 00:43:46.590 if we're making multiple requests or there's 00:43:46.590 --> 00:43:49.680 multiple packets going back and forth and every single one of them 00:43:49.680 --> 00:43:51.520 needs to be encrypted. 00:43:51.520 --> 00:43:55.650 So beyond this initial volley, public and private key encryption 00:43:55.650 --> 00:44:01.360 is no longer needed because it's no longer used, because it's too slow. 00:44:01.360 --> 00:44:03.610 We would notice it if we did. 00:44:03.610 --> 00:44:09.150 Instead, as I mentioned, the server is going to respond with a key. 00:44:09.150 --> 00:44:11.205 And that key is the key to a cipher. 00:44:11.205 --> 00:44:14.910 And we've talked about ciphers before and we know that they are reversible. 00:44:14.910 --> 00:44:19.350 The particular cipher in question here is something called AES. 00:44:19.350 --> 00:44:20.520 But it is just a cipher. 00:44:20.520 --> 00:44:21.960 It is reversible. 00:44:21.960 --> 00:44:24.360 And the key that you receive is the key that you 00:44:24.360 --> 00:44:28.410 are supposed to use to decrypt all future communications. 00:44:28.410 --> 00:44:30.060 This key is called the session key. 00:44:30.060 --> 00:44:33.360 And you use it to decrypt all future communications 00:44:33.360 --> 00:44:37.230 and use it to encrypt all future communications to the server 00:44:37.230 --> 00:44:40.350 until the session, so-called, is terminated. 00:44:40.350 --> 00:44:43.320 And the session is basically as long as you're on the site 00:44:43.320 --> 00:44:46.770 and you haven't logged out or closed the window. 00:44:46.770 --> 00:44:48.240 That is the idea of a session. 00:44:48.240 --> 00:44:53.685 It is one singular experience with a page 00:44:53.685 --> 00:44:57.750 or with a set of pages that are all part of same domain name. 00:44:57.750 --> 00:45:00.960 We're just going to use a cipher for the rest of the time that we talk. 00:45:00.960 --> 00:45:03.932 Now, this may seem insecure for reasons we've 00:45:03.932 --> 00:45:05.640 talked about when we talked about ciphers 00:45:05.640 --> 00:45:07.470 and how they are inherently flawed. 00:45:07.470 --> 00:45:10.470 Recall that when we were talking about some of the really early ciphers, 00:45:10.470 --> 00:45:13.090 those are classic ciphers like Caesar and Vigenere, 00:45:13.090 --> 00:45:14.430 those are very easy to break. 00:45:14.430 --> 00:45:17.630 AES is much more complex than that. 00:45:17.630 --> 00:45:22.080 And the other upside is that this key, like I mentioned, 00:45:22.080 --> 00:45:23.910 is only good for a session. 00:45:23.910 --> 00:45:29.040 So in the unlikely event that the server chooses a bad key, for example, if we 00:45:29.040 --> 00:45:32.490 think about it as if it was Caesar, if they choose a key of zero, 00:45:32.490 --> 00:45:35.240 which would be a very bad key, or key of one that doesn't actually 00:45:35.240 --> 00:45:40.113 shift the letters at all, even if the key is compromised, 00:45:40.113 --> 00:45:41.780 it's only good for a particular session. 00:45:41.780 --> 00:45:44.240 That's not a very long amount of time. 00:45:44.240 --> 00:45:47.240 But the upside is the ability to encipher 00:45:47.240 --> 00:45:49.520 and decipher information is much faster. 00:45:49.520 --> 00:45:53.390 If it's reversible, it's pretty quick to do some mathematical manipulation 00:45:53.390 --> 00:45:57.140 and transform it into something that looks obscured and gibberish 00:45:57.140 --> 00:45:59.240 and to undo that as well. 00:45:59.240 --> 00:46:03.020 And so even though public and private keys are-- 00:46:03.020 --> 00:46:05.780 we consider effectively unbreakable, like to the point 00:46:05.780 --> 00:46:10.040 of it's mathematically untenable to crack a message using 00:46:10.040 --> 00:46:11.510 public and private key encryption. 00:46:11.510 --> 00:46:16.010 We don't rely on it for SSL because it is impractical to actually expect 00:46:16.010 --> 00:46:17.450 communications to go that slowly. 00:46:17.450 --> 00:46:19.610 And so we do fall back on these ciphers. 00:46:19.610 --> 00:46:24.260 And that really is when you're using secured encrypted communication 00:46:24.260 --> 00:46:26.270 via HTTPS. 00:46:26.270 --> 00:46:27.980 You're just relying on a cipher that just 00:46:27.980 --> 00:46:31.700 happens to be a very, very fancy cipher that should hypothetically 00:46:31.700 --> 00:46:36.060 be very difficult to figure out the key to as well. 00:46:36.060 --> 00:46:40.280 You may have also seen a few changes in your browser, especially recently. 00:46:40.280 --> 00:46:42.170 This screenshot shows a couple of changes 00:46:42.170 --> 00:46:48.080 that are designed to warn you when you are not using HTTPS encryption. 00:46:48.080 --> 00:46:51.980 And it's not necessary to use HTTPS for every interaction you 00:46:51.980 --> 00:46:53.480 have on the internet. 00:46:53.480 --> 00:46:56.750 For example, if you are going to a site that is purely informational, 00:46:56.750 --> 00:47:00.900 it's just static content, it's just a list of information, there's no login, 00:47:00.900 --> 00:47:05.190 there's no buying, there's no clicking on things that might then get tracked, 00:47:05.190 --> 00:47:08.280 for example, it's not really necessary to use HTTPS. 00:47:08.280 --> 00:47:11.630 So don't be necessarily alarmed if you visit a site 00:47:11.630 --> 00:47:14.180 and your warned it's not secure. 00:47:14.180 --> 00:47:17.480 We're told that over time this will turn red and become perhaps even 00:47:17.480 --> 00:47:19.950 more concerning as more versions of this come out 00:47:19.950 --> 00:47:23.850 and as more and more adopters of HTTPS exist as well. 00:47:23.850 --> 00:47:25.850 But you're going to start getting notifications. 00:47:25.850 --> 00:47:27.725 And you may have seen these as well in green. 00:47:27.725 --> 00:47:29.870 If you are using HTTPS and you log into something, 00:47:29.870 --> 00:47:33.120 you'll see a little lock icon here and you'll be told that it is secure. 00:47:33.120 --> 00:47:35.570 And again, this is just because human beings 00:47:35.570 --> 00:47:40.460 tend not to be as concerned about their digital privacy 00:47:40.460 --> 00:47:43.430 and their digital security when using the internet. 00:47:43.430 --> 00:47:48.260 And now the technology is trying to provide clues and tips 00:47:48.260 --> 00:47:54.880 to entice you to be more concerned about these things. 00:47:54.880 --> 00:47:57.330 Now let's take a look at a couple of attacks 00:47:57.330 --> 00:47:59.640 that are derived from things we typically consider 00:47:59.640 --> 00:48:02.130 to be advantages of using the internet. 00:48:02.130 --> 00:48:07.050 The first of these is the idea of cross-site scripting, XSS. 00:48:07.050 --> 00:48:09.450 We've previously discussed this idea of the distinction 00:48:09.450 --> 00:48:11.700 between server-side code and client-side code. 00:48:11.700 --> 00:48:14.400 Client-side code, recall, is something that runs locally 00:48:14.400 --> 00:48:16.710 on our computer where our browser, for example, 00:48:16.710 --> 00:48:19.380 is expected to interpret and execute that code. 00:48:19.380 --> 00:48:22.000 Server-side code is run on the server. 00:48:22.000 --> 00:48:25.060 And when we get information from a server, 00:48:25.060 --> 00:48:27.630 we're not getting back the actual lines of code. 00:48:27.630 --> 00:48:31.028 We're getting back the output of that code having run in the first place. 00:48:31.028 --> 00:48:34.320 So for example, there might be some code on the server, some Python code or PHP 00:48:34.320 --> 00:48:38.220 code that generates HTML for us. 00:48:38.220 --> 00:48:42.570 The actual Python or PHP code in this example would be server-side code. 00:48:42.570 --> 00:48:44.430 We don't actually ever see that code. 00:48:44.430 --> 00:48:46.890 We only see the output of that code. 00:48:46.890 --> 00:48:50.550 A cross-site script vulnerability exists when 00:48:50.550 --> 00:48:57.180 an adversary is able to trick a client's browser to run something locally. 00:48:57.180 --> 00:49:01.860 And it will do something that presumably the person, the client, 00:49:01.860 --> 00:49:04.965 didn't actually intend to do. 00:49:04.965 --> 00:49:07.590 Let's take a look at an example of this using a very simple web 00:49:07.590 --> 00:49:09.150 server called Flask. 00:49:09.150 --> 00:49:10.575 We have here some Python code. 00:49:10.575 --> 00:49:13.200 And don't be too worried if this doesn't all make sense to you. 00:49:13.200 --> 00:49:20.050 It's just a pretty short, simple web server that does two things. 00:49:20.050 --> 00:49:22.170 So this is just some bookkeeping stuff in Flask. 00:49:22.170 --> 00:49:26.460 And Flask is a package of Python that is used to create web servers. 00:49:26.460 --> 00:49:29.100 This web server has two things, though, that it does. 00:49:29.100 --> 00:49:34.350 The first is when I visit slash on my web server-- 00:49:34.350 --> 00:49:36.750 so let's say this is Doug's site. 00:49:36.750 --> 00:49:41.912 If I go to dougssite.com, which you may not actually explicitly type anymore 00:49:41.912 --> 00:49:43.620 but most browsers just add it, slash just 00:49:43.620 --> 00:49:47.730 means the root page of your server. 00:49:47.730 --> 00:49:50.430 I'm going to call the following function whose name happens 00:49:50.430 --> 00:49:52.440 to be called index in this case. 00:49:52.440 --> 00:49:53.970 Return hello world. 00:49:53.970 --> 00:49:58.770 And what this basically means is if I visit dougspage.com/, 00:49:58.770 --> 00:50:05.730 what I receive is an HTML page whose content is just hello world. 00:50:05.730 --> 00:50:09.060 So it's just an HTML file that says hello world. 00:50:09.060 --> 00:50:11.730 Again, this code here is all server-side code. 00:50:11.730 --> 00:50:14.130 You don't actually see this code. 00:50:14.130 --> 00:50:18.933 You only see the output of this code, which is this here, this HTML. 00:50:18.933 --> 00:50:21.100 It's just a simple string in this case, but it would 00:50:21.100 --> 00:50:25.080 be interpreted by the browser as HTML. 00:50:25.080 --> 00:50:27.920 If, however, I get a 404-- 00:50:27.920 --> 00:50:31.470 a 404 is a not found error. it means the page I requested doesn't exist. 00:50:31.470 --> 00:50:35.370 And since I've only defined the behavior for literally one page, 00:50:35.370 --> 00:50:41.790 slash the index page of my server, then I want to call this function not found. 00:50:41.790 --> 00:50:46.590 Return not found plus whatever page I tried to visit. 00:50:46.590 --> 00:50:50.550 So it basically is another very simple page, much like hello world here, 00:50:50.550 --> 00:50:53.980 where instead of saying hello world, it says not found. 00:50:53.980 --> 00:50:57.560 And then it also concatenates onto the very end of that whatever page 00:50:57.560 --> 00:50:59.760 I tried to visit. 00:50:59.760 --> 00:51:03.960 This is a major cross-site scripting vulnerability. 00:51:03.960 --> 00:51:05.640 And let's see why. 00:51:05.640 --> 00:51:10.920 Let's imagine I go to /foo, so dougspage/com/foo. 00:51:10.920 --> 00:51:14.130 Recall that our error handler function, which I've reproduced down here, 00:51:14.130 --> 00:51:17.330 will return not found /foo. 00:51:17.330 --> 00:51:18.330 Seems pretty reasonable. 00:51:18.330 --> 00:51:22.260 It seems like the behavior I expected or intended to have happen. 00:51:22.260 --> 00:51:24.970 But what about if I go to a page like this one? 00:51:24.970 --> 00:51:29.490 So this is what I literally type in the browser, dougspage.com/ angle bracket, 00:51:29.490 --> 00:51:36.450 script, angle bracket alert(hi) and then a closed script tag there. 00:51:36.450 --> 00:51:42.770 This script here, script here, looks a lot like HTML. 00:51:42.770 --> 00:51:47.640 And in fact, when the browser sees this, it will interpret it as HTML. 00:51:47.640 --> 00:51:53.340 And so I will get returned by visiting this page not found And then everything 00:51:53.340 --> 00:51:57.150 here except for the leading slash, which means 00:51:57.150 --> 00:52:02.550 that when I receive this and my client is interpreting the HTML, 00:52:02.550 --> 00:52:05.502 I'm going to generate an alert. 00:52:05.502 --> 00:52:06.210 What is an alert? 00:52:06.210 --> 00:52:09.025 Well, if you've ever gone to a website and had a pop-up box display 00:52:09.025 --> 00:52:11.400 some information, you have to click OK or click X to make 00:52:11.400 --> 00:52:13.590 it go away, that's what an alert is. 00:52:13.590 --> 00:52:16.350 So I visit this page on my website, I've actually 00:52:16.350 --> 00:52:21.330 tricked my browser into giving me a JavaScript alert, 00:52:21.330 --> 00:52:23.850 or I've tricked whoever visits this page's browser 00:52:23.850 --> 00:52:26.070 to give me a JavaScript alert. 00:52:26.070 --> 00:52:29.980 So that's probably not exactly a good thing. 00:52:29.980 --> 00:52:33.540 But it can get a little bit more nefarious than that. 00:52:33.540 --> 00:52:36.670 Let's instead imagine-- instead of having this be on my server, 00:52:36.670 --> 00:52:41.250 it might be easier to imagine it like this, that this is what I wrote. 00:52:41.250 --> 00:52:45.698 This script tag here's what I wrote into my Facebook profile, for example. 00:52:45.698 --> 00:52:48.240 So Facebook gives you the ability to write a short little bio 00:52:48.240 --> 00:52:49.500 about yourself. 00:52:49.500 --> 00:52:54.927 Let's imagine that my bio was this script document.write, image source, 00:52:54.927 --> 00:52:56.760 and then I have a hacker URL and everything. 00:52:56.760 --> 00:52:58.760 And imagine that I own hacker URL. 00:52:58.760 --> 00:53:04.800 So I own hacker URL and I wrote this in my Facebook profile. 00:53:04.800 --> 00:53:08.010 Assuming that Facebook did not defend against cross-site scripting 00:53:08.010 --> 00:53:11.740 attacks, which they do, but assuming that they did not, 00:53:11.740 --> 00:53:15.540 anytime somebody visited my profile, their browser 00:53:15.540 --> 00:53:19.810 would be forced to contend with this script tag here. 00:53:19.810 --> 00:53:20.310 Why? 00:53:20.310 --> 00:53:22.590 Because they're trying to visit my profile page. 00:53:22.590 --> 00:53:26.610 My profile page contains literally these characters which 00:53:26.610 --> 00:53:29.540 are going to be interpreted as HTML. 00:53:29.540 --> 00:53:33.990 And it's going to add document.write-- that's a JavaScript way of saying add 00:53:33.990 --> 00:53:38.490 the following line in addition to the HTML of the page-- 00:53:38.490 --> 00:53:44.700 image source equals hacker url?cookie= and then document.cookie. 00:53:44.700 --> 00:53:48.210 So imagine that I, again, control hacker URL. 00:53:48.210 --> 00:53:50.730 Presumably, as somebody who is running a website, 00:53:50.730 --> 00:53:54.810 I also maintain logs of every time somebody tries to access my website, 00:53:54.810 --> 00:53:57.960 what page on my site they're trying to visit. 00:53:57.960 --> 00:54:00.690 If somebody goes to my Facebook profile and executes this, 00:54:00.690 --> 00:54:06.270 I'm going to get notified via my hacker URL logs that somebody has tried to go 00:54:06.270 --> 00:54:12.560 to that page ?cookie= and then document.cookie. 00:54:12.560 --> 00:54:14.910 Now, document.cookie in this case, because this 00:54:14.910 --> 00:54:21.670 exists on my Facebook profile, is an individual's cookie for Facebook. 00:54:21.670 --> 00:54:24.000 So here what I am doing-- again, Facebook 00:54:24.000 --> 00:54:26.310 does defend against cross-site scripting attacks, 00:54:26.310 --> 00:54:28.230 so this can't actually happen on Facebook. 00:54:28.230 --> 00:54:31.980 But assuming that they did not defend against them adequately, 00:54:31.980 --> 00:54:36.210 what I'm basically doing is getting told via my log 00:54:36.210 --> 00:54:38.520 that somebody tried to visit some page on my URL, 00:54:38.520 --> 00:54:41.400 but the page that they tried to visit, I'm 00:54:41.400 --> 00:54:46.170 plugging in and basically stealing the cookie that they use for Facebook. 00:54:46.170 --> 00:54:48.873 And a cookie, recall, is sort of like a hand stamp. 00:54:48.873 --> 00:54:50.790 It's basically me, instead of having to re-log 00:54:50.790 --> 00:54:53.602 into Facebook every time I want to use it, going up to Facebook 00:54:53.602 --> 00:54:54.310 and saying, here. 00:54:54.310 --> 00:54:56.070 You've already verified my identity. 00:54:56.070 --> 00:54:59.040 Just take a look at this, and you get let in. 00:54:59.040 --> 00:55:04.920 And now I hypothetically know someone else's Facebook cookie. 00:55:04.920 --> 00:55:07.890 And if I was clever, I could try and use that 00:55:07.890 --> 00:55:12.060 to change what my Facebook cookie is to that person's Facebook cookie. 00:55:12.060 --> 00:55:17.220 And then suddenly I'm able to log in and view their profile and act as them. 00:55:17.220 --> 00:55:19.290 This image tag here is just a clever trick 00:55:19.290 --> 00:55:24.150 because the idea is that it's trying to pull some resource from my site. 00:55:24.150 --> 00:55:25.060 It doesn't exist. 00:55:25.060 --> 00:55:27.270 I don't have a list of all the cookies on Facebook. 00:55:27.270 --> 00:55:32.040 But I'm being told that somebody is trying to access this URL on my site. 00:55:32.040 --> 00:55:34.950 So the image tag is just sort of a trick to force 00:55:34.950 --> 00:55:38.760 it to log something on my hacker URL. 00:55:38.760 --> 00:55:43.170 But the idea here is that I would be able to steal somebody's Facebook 00:55:43.170 --> 00:55:47.610 cookie where this attack's not well-defended against. 00:55:47.610 --> 00:55:51.960 So what techniques can we use either for our own sites 00:55:51.960 --> 00:55:55.980 when we are running to avoid cross-site scripting vulnerabilities 00:55:55.980 --> 00:56:01.270 or to protect against cross-site scripting vulnerabilities? 00:56:01.270 --> 00:56:04.770 The first technique that we can use is to sanitize, so to speak, 00:56:04.770 --> 00:56:08.400 all of the inputs that come in to our page. 00:56:08.400 --> 00:56:10.610 So let's take a look at how exactly we might do this. 00:56:10.610 --> 00:56:13.500 So it turns out that there are things called 00:56:13.500 --> 00:56:19.080 HTML entities, which are other ways of representing certain characters in HTML 00:56:19.080 --> 00:56:22.950 that might be considered special or control characters, so things like, 00:56:22.950 --> 00:56:26.460 for example, this or this. 00:56:26.460 --> 00:56:29.610 Typically, when a browser sees a character left 00:56:29.610 --> 00:56:31.770 angle bracket or right angle bracket, it's 00:56:31.770 --> 00:56:37.740 going to automatically interpret that as some HTML that it should then process. 00:56:37.740 --> 00:56:39.930 So in the example I just showed a moment ago, 00:56:39.930 --> 00:56:44.130 I was using the fact that whenever it sees angle brackets with script 00:56:44.130 --> 00:56:47.050 around it, they're going to try and interpret whatever 00:56:47.050 --> 00:56:49.470 is between those tags as a script. 00:56:49.470 --> 00:56:52.920 One way for me to prevent that from being interpreted as a script 00:56:52.920 --> 00:56:58.800 is to call this or call this something else other than just left angle bracket 00:56:58.800 --> 00:57:00.130 and right angle bracket. 00:57:00.130 --> 00:57:03.780 And it turns out that there are these things called HTML entities that 00:57:03.780 --> 00:57:08.250 can be used to refer to these characters instead, 00:57:08.250 --> 00:57:13.440 such that if I sanitize my input in such a way 00:57:13.440 --> 00:57:20.278 that every time somebody literally typed the character left angle bracket, 00:57:20.278 --> 00:57:23.070 I had written some code that automatically took that and changed it 00:57:23.070 --> 00:57:25.470 into ampersand lt;. 00:57:25.470 --> 00:57:29.440 And then every time somebody wrote a greater than character, 00:57:29.440 --> 00:57:35.670 or right angle bracket, I changed that in the code to ampersand gt;. 00:57:35.670 --> 00:57:40.170 Then when my page was responsible for processing or interpreting something, 00:57:40.170 --> 00:57:44.640 it wouldn't interpret this-- it would still display this character as a left 00:57:44.640 --> 00:57:47.580 angle bracket or less than-- that's what the lt stands for here-- 00:57:47.580 --> 00:57:49.290 or a right angle bracket, greater than. 00:57:49.290 --> 00:57:52.210 That's what the gt stands for there. 00:57:52.210 --> 00:57:55.960 It would literally just show those characters and not treat them as HTML. 00:57:55.960 --> 00:58:00.030 So that's the idea of what it means to sanitize input when we're talking 00:58:00.030 --> 00:58:04.510 about HTML entities, for example. 00:58:04.510 --> 00:58:08.160 Another thing that we could do is just disable JavaScript entirely. 00:58:08.160 --> 00:58:10.290 This would have some upsides and some downsides. 00:58:10.290 --> 00:58:13.440 The upside is you're pretty protected against cross-site scripting 00:58:13.440 --> 00:58:17.820 vulnerabilities because they're usually going to be introduced via JavaScript. 00:58:17.820 --> 00:58:20.100 The downside is JavaScript is pretty convenient. 00:58:20.100 --> 00:58:20.670 It's nice. 00:58:20.670 --> 00:58:22.770 It makes for a better user experience. 00:58:22.770 --> 00:58:24.930 Sometimes there might be parts of our page 00:58:24.930 --> 00:58:29.040 that just don't work if JavaScript is completely disabled, 00:58:29.040 --> 00:58:30.540 and so trade-offs there. 00:58:30.540 --> 00:58:33.360 You're protecting yourself, but you might be doing 00:58:33.360 --> 00:58:37.050 other sorts of non-material damage. 00:58:37.050 --> 00:58:40.142 Or we could decide to just handle the JavaScript in a special way. 00:58:40.142 --> 00:58:41.850 So for example, we might not allow what's 00:58:41.850 --> 00:58:44.940 called inline JavaScript, for example, like the script tags 00:58:44.940 --> 00:58:46.470 that I just showed a moment ago. 00:58:46.470 --> 00:58:50.010 But we might allow JavaScripts written in separate JavaScript files 00:58:50.010 --> 00:58:52.870 which can also be linked into your HTML pages. 00:58:52.870 --> 00:58:56.280 So those would be allowed, but inline JavaScript, like what we just saw, 00:58:56.280 --> 00:58:57.690 would not be allowed. 00:58:57.690 --> 00:59:01.890 We could sandbox the JavaScript and run it separately somewhere else first 00:59:01.890 --> 00:59:06.210 to see if it does something weird, and if it doesn't do something weird, 00:59:06.210 --> 00:59:08.580 then allow it to be displayed. 00:59:08.580 --> 00:59:12.390 We could also execute the content security policy. 00:59:12.390 --> 00:59:15.570 Content security policy is another header 00:59:15.570 --> 00:59:20.370 that we can add to our HTML pages or HTTP responses. 00:59:20.370 --> 00:59:22.350 And we can define certain behavior to happen 00:59:22.350 --> 00:59:25.800 such that will allow certain lines or certain types of JavaScript through 00:59:25.800 --> 00:59:28.167 but not others. 00:59:28.167 --> 00:59:30.000 Now, there's another type of attack that can 00:59:30.000 --> 00:59:34.800 be used that relies heavily on the fact that we use cookies so extensively, 00:59:34.800 --> 00:59:40.650 and that is a cross-site request forgery, or a CSRF. 00:59:40.650 --> 00:59:43.680 Now, cross-eyed scripting attacks generally 00:59:43.680 --> 00:59:48.840 involve receiving some content and the client's browser 00:59:48.840 --> 00:59:53.610 being tricked into doing something locally that it didn't want to do. 00:59:53.610 --> 00:59:58.170 In a CSRF request, or CSRF attack, rather, 00:59:58.170 --> 01:00:02.430 the trick is we're relying on the fact that there 01:00:02.430 --> 01:00:04.980 is a cookie that can be exploited to make 01:00:04.980 --> 01:00:11.595 a an outbound request, an outbound HTTP request that we did not intend to make. 01:00:11.595 --> 01:00:13.470 And again, this relies extensively on cookies 01:00:13.470 --> 01:00:18.300 because they are this shorthand, short-form way to log into something. 01:00:18.300 --> 01:00:22.230 And we can make a fraudulent request appear legitimate 01:00:22.230 --> 01:00:24.480 if we can rely on someone's cookie. 01:00:24.480 --> 01:00:28.110 Now, again, if you ever use a cloud service for example, 01:00:28.110 --> 01:00:31.560 they're going to have CSRF defenses built into them. 01:00:31.560 --> 01:00:33.780 This is really if you're building a simple site 01:00:33.780 --> 01:00:35.368 and you don't defend against this. 01:00:35.368 --> 01:00:38.160 Flask, for example, does not defend against this particularly well, 01:00:38.160 --> 01:00:40.568 but Flask is a very simple web framework for servers. 01:00:40.568 --> 01:00:43.110 They're generally going to be much more complicated than that 01:00:43.110 --> 01:00:46.620 and have much more additional functionality to be more featurefull. 01:00:46.620 --> 01:00:48.840 So let's walk through what these cross-site request 01:00:48.840 --> 01:00:50.280 forgeries might look like. 01:00:50.280 --> 01:00:53.820 And for context, let's imagine that I send you an email 01:00:53.820 --> 01:00:56.137 asking you to click on some URL. 01:00:56.137 --> 01:00:57.720 So you're going to click on this link. 01:00:57.720 --> 01:00:59.820 It's going to redirect you to some page. 01:00:59.820 --> 01:01:02.310 Maybe that page looks something like this. 01:01:02.310 --> 01:01:04.470 It's pretty simple, not much going on here. 01:01:04.470 --> 01:01:05.320 I have a body. 01:01:05.320 --> 01:01:07.500 And inside of it I have one more link. 01:01:07.500 --> 01:01:15.422 And the link is http://hackbank.com/ transfertodoug=amt500. 01:01:15.422 --> 01:01:18.630 Now, perhaps you don't hover over it and see the link at the beginning of it. 01:01:18.630 --> 01:01:20.960 But maybe you are a customer of Hack Bank. 01:01:20.960 --> 01:01:24.480 And maybe I know that you're a customer of Hack Bank such that if you click 01:01:24.480 --> 01:01:28.290 on this link and if you happen to be logged in, and if you happen to have 01:01:28.290 --> 01:01:32.730 your cookie set for hackbank.com, and this was the way that they actually 01:01:32.730 --> 01:01:37.650 executed transfers, by having you go to /transfer and say to whom you want 01:01:37.650 --> 01:01:40.200 to send money and in what amount-- 01:01:40.200 --> 01:01:42.938 And fortunately, most banks don't actually do this. 01:01:42.938 --> 01:01:46.230 Usually, if you're going to do something that manipulates the database, as this 01:01:46.230 --> 01:01:48.938 would, because it's going to be transferring some amount of money 01:01:48.938 --> 01:01:51.930 somewhere that would be via HTTP POST request-- 01:01:51.930 --> 01:01:55.530 this is just a straightforward GET request I'm making here. 01:01:55.530 --> 01:01:57.722 If you were logged in, though, to Hack Bank, 01:01:57.722 --> 01:01:59.430 or if you're cookie for Hack Bank was set 01:01:59.430 --> 01:02:03.555 and you clicked on this link, hypothetically, a transfer of $500-- 01:02:03.555 --> 01:02:05.430 again, assuming that this was how you did it, 01:02:05.430 --> 01:02:07.740 you specified a person and you specified an amount-- 01:02:07.740 --> 01:02:13.288 would be transferred from your account to presumably my account. 01:02:13.288 --> 01:02:15.330 That's probably not something you intended to do. 01:02:15.330 --> 01:02:18.867 So that would be an example of why this is a cross-site request forgery. 01:02:18.867 --> 01:02:19.950 It's a legitimate request. 01:02:19.950 --> 01:02:23.130 It appears that you intended to do this because it came from you. 01:02:23.130 --> 01:02:24.330 It's using your cookie. 01:02:24.330 --> 01:02:28.090 But you didn't actually intend for it to happen. 01:02:28.090 --> 01:02:29.460 Here's another example. 01:02:29.460 --> 01:02:32.260 You click on the link in my email and you get brought to this page. 01:02:32.260 --> 01:02:35.250 So there's not actually even a second link to click anymore. 01:02:35.250 --> 01:02:37.410 Now it's just trying to load an image. 01:02:37.410 --> 01:02:40.660 Now, looking at this URL, we can tell there's not an image there. 01:02:40.660 --> 01:02:43.920 It doesn't end in jpeg or .pmg or the like. 01:02:43.920 --> 01:02:45.540 It's the same URL as before. 01:02:45.540 --> 01:02:49.397 But my browser sees image source equals something and says, 01:02:49.397 --> 01:02:51.480 well, I'm at least going to try and go to that URL 01:02:51.480 --> 01:02:55.040 and see if there is an image there to load for you. 01:02:55.040 --> 01:02:57.710 Again, you just click on the link in the email. 01:02:57.710 --> 01:03:00.140 This page loads. 01:03:00.140 --> 01:03:03.320 My browser tries to go to this page, or your browser in this case 01:03:03.320 --> 01:03:06.230 tries to go to this page to load the image there. 01:03:06.230 --> 01:03:10.910 But in so doing, it's, again, executing this unintended transfer, 01:03:10.910 --> 01:03:14.750 relying on your cookie at hackbank.com. 01:03:14.750 --> 01:03:17.120 Another example of this might be a form. 01:03:17.120 --> 01:03:20.120 So again, it appears that you click on the link in the email. 01:03:20.120 --> 01:03:23.870 You get brought to a form that just has now just a button at the bottom of it 01:03:23.870 --> 01:03:24.892 that says Click Here. 01:03:24.892 --> 01:03:26.600 And the reason it just has a button, even 01:03:26.600 --> 01:03:31.990 though there's other stuff written, is that those first two fields are hidden. 01:03:31.990 --> 01:03:35.000 They are type equals hidden, which means you wouldn't actually 01:03:35.000 --> 01:03:37.040 see them when you load your browser. 01:03:37.040 --> 01:03:40.160 Now, contrast this, for example, with a field 01:03:40.160 --> 01:03:43.340 whose type is text, which you might see if you're doing a straightforward 01:03:43.340 --> 01:03:44.090 login. 01:03:44.090 --> 01:03:48.020 You would type characters in and see the actual characters appear. 01:03:48.020 --> 01:03:50.660 That's text versus a password field where you would 01:03:50.660 --> 01:03:52.580 type characters in and see all stars. 01:03:52.580 --> 01:03:55.640 It would visually obscure what you typed. 01:03:55.640 --> 01:03:58.760 The action of this form, or so to say where 01:03:58.760 --> 01:04:02.313 the form-- what happens when you click on the Submit button at the bottom 01:04:02.313 --> 01:04:03.230 is the same as before. 01:04:03.230 --> 01:04:06.140 It's hackbank.com/transfer. 01:04:06.140 --> 01:04:07.970 And then I'm using these parameters here; 01:04:07.970 --> 01:04:13.550 to Doug, the amount of $500, Click Here. 01:04:13.550 --> 01:04:17.090 Now I actually am using a notice also POST request 01:04:17.090 --> 01:04:19.500 to try to initiate this transfer, again, assuming 01:04:19.500 --> 01:04:24.380 that this was how Hack Bank structured transfer requests in this way. 01:04:24.380 --> 01:04:27.650 So if you clicked here and this was otherwise validly structured 01:04:27.650 --> 01:04:31.340 and you were logged in, or your cookie was valid for Hack Bank, 01:04:31.340 --> 01:04:33.800 then this would initiate a transfer of $500. 01:04:33.800 --> 01:04:37.850 And I can play another similar trick to what I did a moment ago with the image 01:04:37.850 --> 01:04:43.070 by doing something like this where, when the page is loaded, 01:04:43.070 --> 01:04:44.435 instantly submit this form. 01:04:44.435 --> 01:04:46.310 So you don't even have to click here anymore. 01:04:46.310 --> 01:04:47.630 It's just going to go through the document, 01:04:47.630 --> 01:04:50.780 document being JavaScript's way of referring to the entire web page, 01:04:50.780 --> 01:04:53.600 find the first form, form zeros, assuming 01:04:53.600 --> 01:04:57.380 this is the first form on the page, and just submit it. 01:04:57.380 --> 01:04:59.840 Doesn't matter what else is going on. 01:04:59.840 --> 01:05:00.860 Just submit this form. 01:05:00.860 --> 01:05:06.110 That would also initiate transfer if you clicked on that link from my email. 01:05:06.110 --> 01:05:10.010 So a quick summary of these two different types of attacks. 01:05:10.010 --> 01:05:12.740 Cross-site scripting attacks, the adversary 01:05:12.740 --> 01:05:16.940 tricks you into executing code on your browser to do something locally 01:05:16.940 --> 01:05:19.070 that you probably did not intend. 01:05:19.070 --> 01:05:22.280 And a cross-site request forgery, something 01:05:22.280 --> 01:05:27.320 that appears to be a legitimate request from your browser 01:05:27.320 --> 01:05:31.220 because it's relying on cookies, your ostensibly logged in in that way, 01:05:31.220 --> 01:05:35.670 but you don't actually mean to make that request. 01:05:35.670 --> 01:05:37.670 Now let's talk about a couple of vulnerabilities 01:05:37.670 --> 01:05:40.340 that exist in the context of a database, which I 01:05:40.340 --> 01:05:42.600 know you've discussed recently as well. 01:05:42.600 --> 01:05:46.170 So imagine that I have a table of users on my database 01:05:46.170 --> 01:05:49.580 that looks like this, that each of them has an ID number, they have a username, 01:05:49.580 --> 01:05:51.170 and they have a password. 01:05:51.170 --> 01:05:53.630 Now, the obvious vulnerability here is I really 01:05:53.630 --> 01:05:57.800 shouldn't be storing my users' passwords like this in the clear. 01:05:57.800 --> 01:06:01.370 If somebody were to ever hack and get a hold of this database file, 01:06:01.370 --> 01:06:03.020 that's really, really bad. 01:06:03.020 --> 01:06:08.740 I am not taking best practices to protect my customers' information. 01:06:08.740 --> 01:06:09.990 So I want to avoid doing that. 01:06:09.990 --> 01:06:14.060 So instead what I might do, as we've discussed, is hash their passwords, 01:06:14.060 --> 01:06:17.540 run them through some hash function so that when they're actually stored, 01:06:17.540 --> 01:06:19.880 they get stored looking something like this. 01:06:19.880 --> 01:06:23.120 You have no idea what the original password was. 01:06:23.120 --> 01:06:25.050 And because it's a hash, it's irreversible. 01:06:25.050 --> 01:06:28.280 You should not be able to undo what I did 01:06:28.280 --> 01:06:30.390 when I ran through the hash function. 01:06:30.390 --> 01:06:33.560 But there's actually still a vulnerability here. 01:06:33.560 --> 01:06:35.840 And the vulnerability here is not technical. 01:06:35.840 --> 01:06:38.570 It's human again. 01:06:38.570 --> 01:06:41.785 And the vulnerability that exists here is that we see-- 01:06:41.785 --> 01:06:43.910 we're using a hash function, so it's deterministic. 01:06:43.910 --> 01:06:47.300 When we pass some data through it, we're going to get the same output every time 01:06:47.300 --> 01:06:48.810 we pass data through it. 01:06:48.810 --> 01:06:53.900 And two of our users, Charlie and Eric, have the same hash. 01:06:53.900 --> 01:06:56.390 We saw this makes sense, because if we go back a moment, 01:06:56.390 --> 01:06:59.840 they also had the same actual password when it was stored in plain text. 01:06:59.840 --> 01:07:03.530 We've gone out of our way to try and defend against that by hashing it. 01:07:03.530 --> 01:07:06.860 But somebody who gets a hold of this database file, for example, 01:07:06.860 --> 01:07:11.750 they hack into it, they get it, they'll see two people have the same password. 01:07:11.750 --> 01:07:14.540 And maybe this is a very small subset of my user base. 01:07:14.540 --> 01:07:17.150 And maybe there's hundreds of thousands of people. 01:07:17.150 --> 01:07:20.720 And maybe 10% of them all have the same hash. 01:07:20.720 --> 01:07:26.670 Well, again, human beings, we are not the best at defending our own stuff. 01:07:26.670 --> 01:07:29.090 It's a sad truth that the most common password 01:07:29.090 --> 01:07:32.997 is password followed by some of these other examples we had a second ago. 01:07:32.997 --> 01:07:34.580 All of these are pretty bad passwords. 01:07:34.580 --> 01:07:38.990 They're all on the list of some of the most commonly used passwords 01:07:38.990 --> 01:07:42.920 for all services, which means that if you see a hash like this, 01:07:42.920 --> 01:07:45.620 it doesn't matter that we have taken steps 01:07:45.620 --> 01:07:49.130 to protect our users against this. 01:07:49.130 --> 01:07:55.700 If we see a hash like this many, many times in our database, a clever hacker, 01:07:55.700 --> 01:07:58.732 a clever adversary might think, oh, well, 01:07:58.732 --> 01:08:00.440 I'm seeing this password 10% of the time, 01:08:00.440 --> 01:08:04.400 so I'm going to guess that Charlie's password for the service is 12345 01:08:04.400 --> 01:08:05.330 and they're wrong. 01:08:05.330 --> 01:08:08.480 And then they'll maybe try abcdef and they're wrong, and then maybe try 01:08:08.480 --> 01:08:10.520 password and they're right. 01:08:10.520 --> 01:08:13.910 And then all of a sudden every time they see that hash, they 01:08:13.910 --> 01:08:18.090 can assume that the password is password for every single one of those users. 01:08:18.090 --> 01:08:24.960 So again, nothing we can do as technologists to solve this problem. 01:08:24.960 --> 01:08:29.510 This is really just getting folks to understand 01:08:29.510 --> 01:08:33.276 that using different passwords, using non-standard passwords, 01:08:33.276 --> 01:08:34.109 is really important. 01:08:34.109 --> 01:08:37.067 That's why we talked about password managers and maybe not even knowing 01:08:37.067 --> 01:08:41.160 your own passwords in a prior lecture. 01:08:41.160 --> 01:08:45.140 There's another problem that can exist, though, with databases, in particular, 01:08:45.140 --> 01:08:47.120 when we see screens like this. 01:08:47.120 --> 01:08:51.560 So this is a contrived login screen that has a username and password 01:08:51.560 --> 01:08:55.220 field And a Forgot Password button whose purpose in life 01:08:55.220 --> 01:08:59.149 is, if you type in your email address and you-- 01:08:59.149 --> 01:09:01.189 which is the username in this case, and you 01:09:01.189 --> 01:09:05.510 have the Forgot Password box checked, and you try and click login, 01:09:05.510 --> 01:09:09.418 instead of actually logging you in, it's going to email you, hopefully, 01:09:09.418 --> 01:09:11.960 a link to your password, not your actual password for reasons 01:09:11.960 --> 01:09:14.970 we previously discussed as well. 01:09:14.970 --> 01:09:20.640 But what if when we click on this button we see this? 01:09:20.640 --> 01:09:22.310 OK. 01:09:22.310 --> 01:09:25.520 We've emailed you a link to change your password. 01:09:25.520 --> 01:09:29.660 Does that seem inherently problematic? 01:09:29.660 --> 01:09:30.479 Perhaps not. 01:09:30.479 --> 01:09:34.600 But what about if you see this as well? 01:09:34.600 --> 01:09:37.100 Somebody might see this if they're logged in as well. 01:09:37.100 --> 01:09:40.490 Sorry, no user with that email address. 01:09:40.490 --> 01:09:44.870 Does that perhaps seem problematic when you compare it against this? 01:09:44.870 --> 01:09:48.350 This is an example of something called information leakage. 01:09:48.350 --> 01:09:51.710 Perhaps an adversary has hacked some other database 01:09:51.710 --> 01:09:55.040 where folks were not being as secure with credentials. 01:09:55.040 --> 01:09:58.970 And so they have a whole set of email addresses mapped to credentials. 01:09:58.970 --> 01:10:02.570 And because human beings tend to reuse the same credentials 01:10:02.570 --> 01:10:06.650 on multiple different services, they are trying different services 01:10:06.650 --> 01:10:09.170 that they believe that these users might also 01:10:09.170 --> 01:10:13.550 use using those same username and password combinations. 01:10:13.550 --> 01:10:18.860 If this is the way that we field these types of forgot password inquiries, 01:10:18.860 --> 01:10:22.130 we're revealing some information potentially. 01:10:22.130 --> 01:10:27.650 If Alice is a user, we're now saying, yes, Alice is a user of this. 01:10:27.650 --> 01:10:29.300 Try this password. 01:10:29.300 --> 01:10:34.490 If we get something like this, then the adversary might not bother trying. 01:10:34.490 --> 01:10:37.820 They've realized, oh, Alice is not a user of this service. 01:10:37.820 --> 01:10:41.720 And even if they're not trying to hack into it, if we do something like this, 01:10:41.720 --> 01:10:45.230 we're also telling that adversary quite a bit about Alice. 01:10:45.230 --> 01:10:49.340 Now we know Alice uses this service, and this service, and this service, 01:10:49.340 --> 01:10:50.600 and not this service. 01:10:50.600 --> 01:10:54.050 And they can sort of create a picture of who Alice might be. 01:10:54.050 --> 01:11:00.398 They're sort of using her digital footprint to understand more about her. 01:11:00.398 --> 01:11:03.190 A better response in this case might be to say something like this, 01:11:03.190 --> 01:11:04.550 request received. 01:11:04.550 --> 01:11:07.702 If you're in our system, you'll receive an email with instructions shortly. 01:11:07.702 --> 01:11:09.410 That's not tipping our hand either way as 01:11:09.410 --> 01:11:12.890 to whether the user is in the database or not in the database. 01:11:12.890 --> 01:11:15.860 No information leakage here, and generally a better way 01:11:15.860 --> 01:11:19.610 to protect our customer's privacy. 01:11:19.610 --> 01:11:22.850 Now, that's not the only problem that we can have with databases. 01:11:22.850 --> 01:11:25.610 We've alluded to this idea of SQL injection. 01:11:25.610 --> 01:11:28.100 And there's this comment that gets the rounds quite a bit 01:11:28.100 --> 01:11:30.620 when we talk about SQL injection from a web comic called 01:11:30.620 --> 01:11:35.240 XKCD that involves a SQL injection attack, which is basically 01:11:35.240 --> 01:11:39.080 providing some information that-- 01:11:39.080 --> 01:11:42.670 or providing some text or some query that we want to make to a database 01:11:42.670 --> 01:11:46.690 where that query actually does something unintended. 01:11:46.690 --> 01:11:50.700 It actually itself is SQL as opposed to just plugging in some parameter, 01:11:50.700 --> 01:11:53.750 like what is your name, and then searching the database for that name. 01:11:53.750 --> 01:11:55.708 Instead of giving you my name, I might give you 01:11:55.708 --> 01:11:58.040 something that is actually a SQL query that's 01:11:58.040 --> 01:12:01.050 going to be executed that you don't want me to execute. 01:12:01.050 --> 01:12:03.750 So let's see an example of how this might work. 01:12:03.750 --> 01:12:07.800 So here's another simple username and password field. 01:12:07.800 --> 01:12:11.580 And in this example, I've written my password field poorly intentionally 01:12:11.580 --> 01:12:14.000 for purposes of the example so that it will actually 01:12:14.000 --> 01:12:16.970 show you the text that is typed as opposed to showing 01:12:16.970 --> 01:12:19.640 you stars like a password field should. 01:12:19.640 --> 01:12:23.300 So this is something that the user sees when they access my site. 01:12:23.300 --> 01:12:26.718 And perhaps on the back end in the server-side code, inside of Python 01:12:26.718 --> 01:12:29.510 somewhere I have written a SQL query that looks like the following. 01:12:29.510 --> 01:12:35.540 When the login button is clicked, execute the following SQL query. 01:12:35.540 --> 01:12:40.040 SELECT star from users where username equals uname-- 01:12:40.040 --> 01:12:45.230 and uname here in yellow referring to whatever was typed in this box-- 01:12:45.230 --> 01:12:48.050 and password equals pword, where, again, pword 01:12:48.050 --> 01:12:51.140 is referring to whatever was typed in this box. 01:12:51.140 --> 01:12:54.120 So we're doing a SQL query to select star from users, 01:12:54.120 --> 01:12:57.360 get all of the information from the users table 01:12:57.360 --> 01:13:01.170 where the username equals whatever they typed in that box 01:13:01.170 --> 01:13:05.560 and the password equals whatever they typed in that box. 01:13:05.560 --> 01:13:07.410 And so, for example, if I have somebody who 01:13:07.410 --> 01:13:09.810 logs in with the username Alice and the password 01:13:09.810 --> 01:13:14.580 12345, what the query would actually look like with these values plugged 01:13:14.580 --> 01:13:19.920 into it might look something like this; SELECT star from users where username 01:13:19.920 --> 01:13:25.200 equals Alice and password equals 12345. 01:13:25.200 --> 01:13:30.420 If there is nobody with username Alice or Alice's password is not 12345, 01:13:30.420 --> 01:13:31.770 then this will fail. 01:13:31.770 --> 01:13:34.890 Both of those conditions need to be true. 01:13:34.890 --> 01:13:37.890 But what about this? 01:13:37.890 --> 01:13:46.800 Someone whose username is hacker and their password is 1' or '1' equals '1. 01:13:49.800 --> 01:13:51.848 That looks pretty weird. 01:13:51.848 --> 01:13:53.640 And the reason that that looks pretty weird 01:13:53.640 --> 01:13:57.390 is because this is an attempt to inject SQL, 01:13:57.390 --> 01:14:02.820 to trick SQL into doing something that is presumably not intended by the code 01:14:02.820 --> 01:14:04.050 that we wrote. 01:14:04.050 --> 01:14:07.980 Now, it probably helps to take a look at it plugging the data in 01:14:07.980 --> 01:14:11.580 to see what exactly this is going to do. 01:14:11.580 --> 01:14:16.270 SELECT star from users where username equals hacker or-- 01:14:16.270 --> 01:14:23.190 excuse me, and password equals '1' or and so on and so on. 01:14:26.880 --> 01:14:30.180 Maybe I do have a person whose username actually is hacker, 01:14:30.180 --> 01:14:33.000 but that's probably not their password. 01:14:33.000 --> 01:14:34.050 That doesn't matter. 01:14:34.050 --> 01:14:37.350 I'm still going to be able to log in if I 01:14:37.350 --> 01:14:39.140 have somebody whose username is hacker. 01:14:39.140 --> 01:14:41.850 And the reason for that is because of this or. 01:14:41.850 --> 01:14:45.780 I have sort of short circuited the end of the SQL query. 01:14:45.780 --> 01:14:50.370 I have this quote mark that demarcates the end of what the user presumably 01:14:50.370 --> 01:14:51.780 typed in. 01:14:51.780 --> 01:14:54.660 But I've actually literally typed those into my password 01:14:54.660 --> 01:14:59.060 to trick SQL such that if hacker's password equals 1, 01:14:59.060 --> 01:15:03.420 it just happens to literally be the character 1, OK, I have succeeded. 01:15:03.420 --> 01:15:05.250 I guess that's a really bad password, and I 01:15:05.250 --> 01:15:08.100 shouldn't be able to log it in that way, but maybe that is the case 01:15:08.100 --> 01:15:09.060 and I'm able to log in. 01:15:09.060 --> 01:15:13.560 But even if not, this other thing is true. 01:15:13.560 --> 01:15:18.660 '1' does equal '1'. 01:15:18.660 --> 01:15:23.030 So as long as somebody whose username is hacker exists in the database, 01:15:23.030 --> 01:15:27.330 I am now able to log in as hacker because this is true. 01:15:27.330 --> 01:15:29.230 This part's probably not true, right? 01:15:29.230 --> 01:15:31.860 It's unlikely that their password is 1. 01:15:31.860 --> 01:15:36.960 Regardless of what their password is, this part actually is true. 01:15:36.960 --> 01:15:40.200 It's a very simple SQL injection attack. 01:15:40.200 --> 01:15:44.490 I'm basically logging in as someone who I'm presumably not supposed 01:15:44.490 --> 01:15:48.780 to be able to log in as, but it illustrates the kind of thing 01:15:48.780 --> 01:15:50.550 that could happen. 01:15:50.550 --> 01:15:54.450 You are allowing people to bypass logins. 01:15:54.450 --> 01:15:59.100 Now, it could get worse if your database administrator username 01:15:59.100 --> 01:16:01.710 is admin or something very common. 01:16:01.710 --> 01:16:04.683 The default for this is typically admin. 01:16:04.683 --> 01:16:06.600 This would potentially give people the ability 01:16:06.600 --> 01:16:08.760 to be database administrators, that they're 01:16:08.760 --> 01:16:14.370 able to execute exactly this kind of trick on the admin user. 01:16:14.370 --> 01:16:16.830 Now they have administrative access to your database, which 01:16:16.830 --> 01:16:19.580 means they can do things like manipulate the data in the database, 01:16:19.580 --> 01:16:23.350 change things, add things, delete things that you don't want to have deleted. 01:16:23.350 --> 01:16:28.170 And in the case of a database, deletion is pretty permanent. 01:16:28.170 --> 01:16:32.580 You can't undo a delete most of the time in a database 01:16:32.580 --> 01:16:35.890 as the way you might be able to do with other files. 01:16:35.890 --> 01:16:38.430 Now, are there techniques to avoid this kind of attack? 01:16:38.430 --> 01:16:40.108 Fortunately, there are. 01:16:40.108 --> 01:16:42.900 Right now I'd like just to just take a look at a very simple Python 01:16:42.900 --> 01:16:45.720 program that replicates the kind of thing 01:16:45.720 --> 01:16:50.080 that one could do in a more robust, more complex SQL situation. 01:16:50.080 --> 01:16:52.080 So let's pull up a program here where we're just 01:16:52.080 --> 01:16:54.870 simulating this idea of a SQL injection just 01:16:54.870 --> 01:17:00.230 to show you how it's not that difficult to defend against it. 01:17:00.230 --> 01:17:03.840 So let's pull up the code here in this file login.py. 01:17:03.840 --> 01:17:06.060 So there's not that much going on here. 01:17:06.060 --> 01:17:07.950 I have x equals input username. 01:17:07.950 --> 01:17:10.920 So x, recall, is a Python variable. 01:17:10.920 --> 01:17:14.460 And input username is basically going to prompt the user with the string 01:17:14.460 --> 01:17:17.405 username and then expect them to type something after that. 01:17:17.405 --> 01:17:19.530 And then we do exactly the same thing with password 01:17:19.530 --> 01:17:21.270 except storing the result there in y. 01:17:21.270 --> 01:17:24.000 So whatever the user types after username will get stored in x. 01:17:24.000 --> 01:17:27.270 Whatever they type after password will get stored in y. 01:17:27.270 --> 01:17:29.030 And then here I'm just going to print. 01:17:29.030 --> 01:17:33.310 And in the SQL context, this would be the query that actually gets executed. 01:17:33.310 --> 01:17:35.610 So imagine that that's what's happening instead. 01:17:35.610 --> 01:17:39.850 SELECT star from users where username equals and then this symbol here, 01:17:39.850 --> 01:17:40.350 '[? x ?]'. 01:17:44.180 --> 01:17:46.680 What I'm doing here is just using a Python-formatted string. 01:17:46.680 --> 01:17:48.560 That's what this f here-- it's not a typo-- 01:17:48.560 --> 01:17:51.810 at the beginning means, is I'm going to plug in whatever the person, the user, 01:17:51.810 --> 01:17:55.640 typed at the first prompt, which I stored in x here, 01:17:55.640 --> 01:17:59.933 and whatever the user typed the second prompt that's store in y there. 01:17:59.933 --> 01:18:01.600 So let's actually just run this program. 01:18:01.600 --> 01:18:03.980 So let's pop open here for a second. 01:18:03.980 --> 01:18:07.780 The name of this program is login.py, so I'm going to type python 01:18:07.780 --> 01:18:10.880 login.py, Enter. 01:18:10.880 --> 01:18:13.290 Username, Doug. 01:18:13.290 --> 01:18:16.308 Password, 12345. 01:18:16.308 --> 01:18:19.600 And then the query, hypothetically, that would get executed if I constructed it 01:18:19.600 --> 01:18:22.480 in this way is SELECT star from users where username 01:18:22.480 --> 01:18:25.210 equals Doug and password equals 12345. 01:18:25.210 --> 01:18:26.320 Seems reasonable. 01:18:26.320 --> 01:18:30.130 But if I try and do the adversary thing that I did a moment ago, 01:18:30.130 --> 01:18:38.380 username equals Doug, password equals 1' or '1' equals '1, not 01:18:38.380 --> 01:18:42.850 a final single quote, and I hit Enter, then I end up with SELECT star 01:18:42.850 --> 01:18:49.865 from users where username equals Doug and password equals 1 or 1 equals 1. 01:18:49.865 --> 01:18:52.000 And the latter part of that is true. 01:18:52.000 --> 01:18:53.890 The former part is false. 01:18:53.890 --> 01:18:56.860 But it's good enough that I would be able to log in 01:18:56.860 --> 01:18:59.650 if I did something like that. 01:18:59.650 --> 01:19:02.200 But we want to try and get around that. 01:19:02.200 --> 01:19:05.200 So now let's take a look at a second file that might solve this problem. 01:19:05.200 --> 01:19:11.380 So I'm going to open up login2.py in my editor here. 01:19:11.380 --> 01:19:15.610 So now it starts out exactly the same, x equals something, y equals something. 01:19:15.610 --> 01:19:18.640 But I'm making a pretty basic substitution. 01:19:18.640 --> 01:19:23.020 I'm replacing every time that I see single quotes with double quotes. 01:19:23.020 --> 01:19:25.050 So I'm replacing every instance of single quote, 01:19:25.050 --> 01:19:26.800 and I have to preface it with a backslash. 01:19:26.800 --> 01:19:30.160 Because notice I'm actually using single quotes to identify the character. 01:19:30.160 --> 01:19:33.880 It just so happens that it's to indicate that I'm trying to substitute something 01:19:33.880 --> 01:19:35.350 which I'm putting in single quotes. 01:19:35.350 --> 01:19:38.440 The thing I'm trying to substitute actually is a single quote, 01:19:38.440 --> 01:19:42.130 and so I need to put a backslash in front of it 01:19:42.130 --> 01:19:44.440 to escape that character such that it actually 01:19:44.440 --> 01:19:48.310 gets treated as a single quotation mark character as opposed 01:19:48.310 --> 01:19:50.308 to some special Python-- 01:19:50.308 --> 01:19:52.850 Python's not going to try and interpret it in some other way. 01:19:52.850 --> 01:19:56.890 So I want to replace every instance of a single quote in x with a double quote, 01:19:56.890 --> 01:20:00.010 and I want to replace every instance of a single quote in y 01:20:00.010 --> 01:20:01.030 with a double quote. 01:20:01.030 --> 01:20:02.650 Now, why do I want to do that? 01:20:02.650 --> 01:20:07.240 Because notice in my actual Python string here 01:20:07.240 --> 01:20:12.670 I'm using single quotes to set off the variables for purposes 01:20:12.670 --> 01:20:14.290 of SQL's interpretation of them. 01:20:14.290 --> 01:20:16.520 So where the user name equals this string, 01:20:16.520 --> 01:20:18.830 I'm using single quotes to do that. 01:20:18.830 --> 01:20:23.920 So if my username or my password also contained single quotation mark 01:20:23.920 --> 01:20:27.430 characters, when SQL was interpreting it, 01:20:27.430 --> 01:20:32.080 it might think that the next single quote character it sees is the end. 01:20:32.080 --> 01:20:34.300 I'm done with what I've prompted. 01:20:34.300 --> 01:20:37.420 And that's exactly how I tricked it in the previous example. 01:20:37.420 --> 01:20:40.930 I used that first single quote, which seemed kind of random and out 01:20:40.930 --> 01:20:44.380 of nowhere, to trick SQL into thinking I'm done with this. 01:20:44.380 --> 01:20:48.850 Then I used the keyword or back now into a SQL and not some string 01:20:48.850 --> 01:20:52.570 that I'm searching for, and then I would continue this trick going forward. 01:20:52.570 --> 01:20:55.732 So this is designed to eliminate all the single quotes, 01:20:55.732 --> 01:20:57.940 because the single quotes mean something very special 01:20:57.940 --> 01:21:01.510 in the context of my SQL query itself. 01:21:01.510 --> 01:21:06.610 If you're actually using SQL libraries that are tied into Python, 01:21:06.610 --> 01:21:11.108 the ability to replace things is much more robust than this example. 01:21:11.108 --> 01:21:12.900 But even this very simple example where I'm 01:21:12.900 --> 01:21:16.480 doing just this very basic substitution is good enough 01:21:16.480 --> 01:21:20.390 to get around the injection attack that we just looked at. 01:21:20.390 --> 01:21:23.350 So this is now in login2.py. 01:21:23.350 --> 01:21:24.520 Let's do this. 01:21:24.520 --> 01:21:26.895 Let's Python login2.py. 01:21:26.895 --> 01:21:28.270 And we'll start out the same way. 01:21:28.270 --> 01:21:30.890 We'll do Doug and 12345. 01:21:30.890 --> 01:21:32.895 And it appears that nothing has changed. 01:21:32.895 --> 01:21:35.020 The behavior is otherwise identical because I'm not 01:21:35.020 --> 01:21:36.730 trying to do any tricks like that. 01:21:36.730 --> 01:21:41.440 SELECT star from users where username equals Doug and password equals 12345. 01:21:41.440 --> 01:21:45.250 But if I now try that same trick that I did a moment ago, 01:21:45.250 --> 01:21:55.090 so password is 1' or '1' equals '1 and I hit Enter, 01:21:55.090 --> 01:21:59.020 now I'm not subject to that same SQL injection anymore because I'm trying 01:21:59.020 --> 01:22:02.800 to select all the information from the users table where the username is Doug 01:22:02.800 --> 01:22:03.970 and the password equals-- 01:22:03.970 --> 01:22:06.950 And notice that here is the first single quote. 01:22:06.950 --> 01:22:08.440 Here is the second one. 01:22:08.440 --> 01:22:11.770 So it's thinking that entire thing now is the password. 01:22:11.770 --> 01:22:20.468 Only if my password is literally 1" or "1" equals "1, 01:22:20.468 --> 01:22:22.010 then I would be literally logging in. 01:22:22.010 --> 01:22:23.980 If that happened to be my password, this would work. 01:22:23.980 --> 01:22:25.150 But otherwise I've escaped. 01:22:25.150 --> 01:22:28.630 I've stopped the adversary from being able to leverage 01:22:28.630 --> 01:22:33.080 a simple trick like this to break in to my database 01:22:33.080 --> 01:22:34.930 when perhaps they're not intended to do so. 01:22:34.930 --> 01:22:41.140 And again, in actual SQL injection defense, the substitutions that we make 01:22:41.140 --> 01:22:42.640 are much more complicated than this. 01:22:42.640 --> 01:22:45.932 We're not just looking for single quote characters and double quote characters, 01:22:45.932 --> 01:22:48.610 but we're considering semicolons or any other special characters 01:22:48.610 --> 01:22:51.460 that SQL would interpret as part of a statement. 01:22:51.460 --> 01:22:53.900 We can escape those out so that users could literally 01:22:53.900 --> 01:22:59.720 use single quotes or semicolons or the like in their passwords 01:22:59.720 --> 01:23:03.160 without necessarily compromising the integrity of the entire database 01:23:03.160 --> 01:23:04.510 overall. 01:23:04.510 --> 01:23:08.480 So we've taken a look at several of the most common, most obvious ways 01:23:08.480 --> 01:23:11.180 that an adversary might be able to extract information 01:23:11.180 --> 01:23:13.910 either from a business or an individual. 01:23:13.910 --> 01:23:17.660 And these ways are kind of attention-getting in some context. 01:23:17.660 --> 01:23:19.880 But let's focus now-- let's go back and bring things 01:23:19.880 --> 01:23:22.280 full circle to something I've mentioned many times, 01:23:22.280 --> 01:23:28.400 which is humans are the core fatal flaw in all of these security things 01:23:28.400 --> 01:23:29.800 that we're dealing with here. 01:23:29.800 --> 01:23:31.800 And so let's bring things full circle by talking 01:23:31.800 --> 01:23:34.220 about phishing, what phishing is. 01:23:34.220 --> 01:23:39.140 So phishing is just an attempt by an adversary to prey upon us 01:23:39.140 --> 01:23:45.440 and our unfortunate general ignorance of basic security protocols. 01:23:45.440 --> 01:23:47.900 So it's just an attempt to socially engineer, 01:23:47.900 --> 01:23:49.730 basically, information out of someone. 01:23:49.730 --> 01:23:52.460 You pretend to be someone that you are not. 01:23:52.460 --> 01:23:54.710 And if you do so convincingly enough, you 01:23:54.710 --> 01:23:58.190 might be able to extract information about that person. 01:23:58.190 --> 01:24:01.053 Now, phishing you'll also see in other contexts that are-- 01:24:01.053 --> 01:24:03.470 computer scientists like to be clever with their wordplay. 01:24:03.470 --> 01:24:06.800 You'll see things like netting, which is basically a phishing attack that 01:24:06.800 --> 01:24:08.780 launches against many people at once, hoping 01:24:08.780 --> 01:24:11.060 they'll be able to get one or two. 01:24:11.060 --> 01:24:13.400 There's spear phishing, which is a phishing 01:24:13.400 --> 01:24:17.240 attack that targets one specific person trying to get information from them. 01:24:17.240 --> 01:24:20.090 And then there's whaling, which is a phishing attack that 01:24:20.090 --> 01:24:23.330 is targeted against somebody who is perceived to have a lot of information 01:24:23.330 --> 01:24:25.413 or whose information is particularly valuable such 01:24:25.413 --> 01:24:28.820 that you'd be phishing for some big whale. 01:24:28.820 --> 01:24:31.730 Now, one of the most obvious and easy types of phishing attack 01:24:31.730 --> 01:24:32.900 looks like this. 01:24:32.900 --> 01:24:35.450 It's a simple URL substitution. 01:24:35.450 --> 01:24:39.590 This is how we can write a link in HTML. 01:24:39.590 --> 01:24:43.480 A is the HTML tag for anchor, which we use for hyperlinks. 01:24:43.480 --> 01:24:46.460 Href is where we are going to. 01:24:46.460 --> 01:24:50.660 And then we also have the ability to specify some text at the end of that. 01:24:50.660 --> 01:24:54.830 These two items do not have to match, as you can see here. 01:24:54.830 --> 01:25:02.750 I can say we're going to URL2 but actually send you to URL1. 01:25:02.750 --> 01:25:08.420 This is an incredibly common way to get information from somebody. 01:25:08.420 --> 01:25:12.830 They think they're going one place but they're actually going someplace else. 01:25:12.830 --> 01:25:16.430 And to show you, as a very basic example, just how easy it 01:25:16.430 --> 01:25:21.560 is to potentially trick somebody into going somewhere they're not supposed to 01:25:21.560 --> 01:25:25.220 and potentially then revealing credentials as well, 01:25:25.220 --> 01:25:28.580 let's just take a simple example here with Facebook. 01:25:28.580 --> 01:25:31.970 And why don't we just take a moment to build our own version of Facebook 01:25:31.970 --> 01:25:36.410 and see if we can't get somebody to potentially reveal information to us? 01:25:36.410 --> 01:25:38.750 So let's imagine that I have acquired some domain 01:25:38.750 --> 01:25:41.390 name that's really similar to Facebook.com, 01:25:41.390 --> 01:25:44.150 like it's off by one character. 01:25:44.150 --> 01:25:45.350 It's a common typo. 01:25:45.350 --> 01:25:48.198 For example fs maybe is a common thing. 01:25:48.198 --> 01:25:49.990 People mistype the A or something like that 01:25:49.990 --> 01:25:54.800 that would be really not necessarily obvious to somebody at the outset. 01:25:54.800 --> 01:25:59.240 One way that I might be able to just take advantage of somebody's thinking 01:25:59.240 --> 01:26:01.670 that they're logging into Facebook is to make a page that 01:26:01.670 --> 01:26:05.150 looks exactly the same as Facebook. 01:26:05.150 --> 01:26:07.640 That's actually not very difficult to do. 01:26:07.640 --> 01:26:09.680 All you have to do is open up Facebook here. 01:26:09.680 --> 01:26:14.720 And because its HTML is available to me, I can right click on it, 01:26:14.720 --> 01:26:18.530 view page source, take a second to load here-- 01:26:18.530 --> 01:26:20.480 Facebook is a pretty big site-- 01:26:20.480 --> 01:26:27.080 and then I can just control A, copy, select all, copy all of the content, 01:26:27.080 --> 01:26:33.500 and paste this in to my index.html, and we will save. 01:26:36.140 --> 01:26:40.970 And then we'll head back into our terminal here, 01:26:40.970 --> 01:26:45.170 and I will start Chrome on the file index.html, which 01:26:45.170 --> 01:26:49.400 is the file that I literally just saved my Facebook information in. 01:26:49.400 --> 01:26:51.040 So start Chrome index.html. 01:26:51.040 --> 01:26:53.360 You'll notice that it brings me to this URL 01:26:53.360 --> 01:26:56.670 here, which is the file for where I currently live, 01:26:56.670 --> 01:26:58.310 or where this file currently lives. 01:26:58.310 --> 01:27:00.920 And this page looks like Facebook, except for the fact that, 01:27:00.920 --> 01:27:04.220 when I log in, I then get redirected back 01:27:04.220 --> 01:27:07.370 to something that actually is Facebook and is not something that I control. 01:27:07.370 --> 01:27:10.820 But at the outset, my page here at the very beginning 01:27:10.820 --> 01:27:14.810 looks identical to Facebook. 01:27:14.810 --> 01:27:16.790 Now, the trick here would be to do something 01:27:16.790 --> 01:27:20.780 so that the user would provide information here in the email box 01:27:20.780 --> 01:27:24.397 and then here in the password field such that when they click Login, 01:27:24.397 --> 01:27:26.480 I might be able to get that information from them. 01:27:26.480 --> 01:27:30.500 Maybe I just am waiting to capture their information. 01:27:30.500 --> 01:27:35.450 So the next step for me might be to go back into my random set of stuff here. 01:27:35.450 --> 01:27:38.570 There's a lot of random code that we don't really care about. 01:27:38.570 --> 01:27:41.030 But the one thing I do care about is what happens when 01:27:41.030 --> 01:27:43.790 somebody clicks on this Login button. 01:27:43.790 --> 01:27:45.590 That is interesting to me. 01:27:45.590 --> 01:27:48.230 So I'm going to go through this and just do control F, 01:27:48.230 --> 01:27:51.968 control F just being find, the string login. 01:27:51.968 --> 01:27:54.260 That's the text that's literally written on the button, 01:27:54.260 --> 01:27:55.843 so hopefully I'll find that somewhere. 01:27:55.843 --> 01:27:58.160 I'm told I have eight results. 01:27:58.160 --> 01:27:59.990 So this is, if I just kind of look around 01:27:59.990 --> 01:28:01.698 for context to try and figure out where I 01:28:01.698 --> 01:28:05.660 am in the code, the title of something, so that's probably not it. 01:28:05.660 --> 01:28:07.180 So I don't want to go there. 01:28:07.180 --> 01:28:10.640 Create an account or login, not quite what I'm looking for. 01:28:10.640 --> 01:28:12.620 So go the next one. 01:28:12.620 --> 01:28:15.890 OK, here we go, input value equals login. 01:28:15.890 --> 01:28:18.680 So now I found an input that is called login. 01:28:18.680 --> 01:28:22.110 So this is presumably a button that's presumably part of some form. 01:28:22.110 --> 01:28:25.820 So if I scroll up a little bit higher, hopefully I 01:28:25.820 --> 01:28:29.570 will find a form, which I do, form ID. 01:28:29.570 --> 01:28:30.920 And it has an action. 01:28:30.920 --> 01:28:34.040 The action is to go to this particular page, 01:28:34.040 --> 01:28:37.310 facebook.com/login/ and so on and so on. 01:28:37.310 --> 01:28:39.820 But maybe I want to send it somewhere else. 01:28:39.820 --> 01:28:44.000 So if I replace this entire URL with where I actually want to send the user, 01:28:44.000 --> 01:28:46.160 where maybe I'm going to capture their information, 01:28:46.160 --> 01:28:49.220 maybe I'll store this in login.html. 01:28:49.220 --> 01:28:51.140 And so that's what's going to come in here. 01:28:51.140 --> 01:28:56.210 And then we'll save the file such that our changes have been captured. 01:28:56.210 --> 01:28:58.370 So presumably what should happen is now, when 01:28:58.370 --> 01:29:02.420 you click on the Login button in my fake Facebook, 01:29:02.420 --> 01:29:08.000 you instead get redirected to login.html rather than the Facebook actual login 01:29:08.000 --> 01:29:10.458 as we saw just a moment ago. 01:29:10.458 --> 01:29:11.250 So let's try again. 01:29:11.250 --> 01:29:14.870 We'll go back here to our fake Facebook page. 01:29:14.870 --> 01:29:18.880 We will refresh so that we get our new content. 01:29:18.880 --> 01:29:20.860 Remember, we just changed the HTML content, 01:29:20.860 --> 01:29:23.900 so we actually need to reload it so that our browser has it. 01:29:23.900 --> 01:29:31.250 And we'll type in abc@cs50.net and then some password here and click Login, 01:29:31.250 --> 01:29:32.990 and we get redirected here. 01:29:32.990 --> 01:29:35.630 Sorry, we are unable to log you in at this time. 01:29:35.630 --> 01:29:38.270 But notice we're still in a file that I created. 01:29:38.270 --> 01:29:41.973 I didn't show you login.html, but that's exactly what I put there. 01:29:41.973 --> 01:29:44.390 Now, I'm not actually going to phish for information here. 01:29:44.390 --> 01:29:46.370 And I'm going to do something that would arguably vio-- 01:29:46.370 --> 01:29:48.100 even though I'm using fake data here, I'm 01:29:48.100 --> 01:29:50.808 not going to do something that would violate the terms of service 01:29:50.808 --> 01:29:54.500 or get myself in trouble by actually attempting to do some phishing here. 01:29:54.500 --> 01:29:58.070 But imagine instead of some HTML I had some Python code that was 01:29:58.070 --> 01:30:00.740 able to read the data from that field. 01:30:00.740 --> 01:30:02.840 We saw that a moment ago with passwords, right? 01:30:02.840 --> 01:30:06.860 We know that the possibility exists that if the user types something 01:30:06.860 --> 01:30:10.850 into a field, we have the ability to extract it. 01:30:10.850 --> 01:30:13.340 What I could do here is very simple. 01:30:13.340 --> 01:30:18.200 I could just read those two fields where they typed a username and a password 01:30:18.200 --> 01:30:20.032 but then display this content. 01:30:20.032 --> 01:30:22.490 Perhaps it's been the case that you've gone to some website 01:30:22.490 --> 01:30:26.300 and seen, oh, yeah, sorry, the server can't handle this request right now, 01:30:26.300 --> 01:30:28.820 or something along those lines. 01:30:28.820 --> 01:30:30.650 And you maybe think nothing of it. 01:30:30.650 --> 01:30:33.530 Or maybe I even would then have a link here that says, try again. 01:30:33.530 --> 01:30:35.870 And if you click Try Again, it would bring you back 01:30:35.870 --> 01:30:39.860 to Facebook's actual login where you would then enter your credentials 01:30:39.860 --> 01:30:42.560 and try again and perhaps think everything was fine. 01:30:42.560 --> 01:30:46.520 But if on this login page I had extracted your username and password 01:30:46.520 --> 01:30:49.120 by tricking you into thinking you were logging into Facebook, 01:30:49.120 --> 01:30:51.203 and then maybe I save those in some file somewhere 01:30:51.203 --> 01:30:54.882 and then just display this to you, you think, ah, they just had an error. 01:30:54.882 --> 01:30:56.090 Things are a little bit busy. 01:30:56.090 --> 01:30:57.050 I'll try again. 01:30:57.050 --> 01:30:58.910 And when you try again, it works. 01:30:58.910 --> 01:31:00.770 It's really that easy. 01:31:00.770 --> 01:31:05.600 And the way to avoid phishing expeditions, so to speak, 01:31:05.600 --> 01:31:07.530 are just to be mindful of what you're doing. 01:31:07.530 --> 01:31:11.000 Take a look at the URL bar to make sure that you're on the page 01:31:11.000 --> 01:31:12.983 that you think you're on. 01:31:12.983 --> 01:31:14.900 Hopefully you've come away now with a bit more 01:31:14.900 --> 01:31:16.775 of an understanding of cybersecurity and some 01:31:16.775 --> 01:31:19.700 of the best practices that are put in place to deal 01:31:19.700 --> 01:31:21.740 with potential cybersecurity threats. 01:31:21.740 --> 01:31:24.320 Now it's incumbent upon us to use the technology 01:31:24.320 --> 01:31:28.130 that we have available to help us protect ourselves from ourselves, 01:31:28.130 --> 01:31:33.020 but not only ourselves and our own data, but also working to protect our clients 01:31:33.020 --> 01:31:35.200 and their data as well.