WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:02.982 [MUSIC PLAYING] 00:00:17.407 --> 00:00:18.490 DAVID J. MALAN: All right. 00:00:18.490 --> 00:00:21.620 This is CS50's Introduction to Cybersecurity. 00:00:21.620 --> 00:00:22.690 My name is David Malan. 00:00:22.690 --> 00:00:25.310 And this week, let's focus on securing software, 00:00:25.310 --> 00:00:27.670 whether it's the software you use or whether it's 00:00:27.670 --> 00:00:29.710 the software you write as a programmer. 00:00:29.710 --> 00:00:32.680 Consider, for instance, one of our topics from earlier in the class, 00:00:32.680 --> 00:00:36.100 namely, phishing, this attempt by an adversarially to phish or obtain 00:00:36.100 --> 00:00:37.600 information from you. 00:00:37.600 --> 00:00:41.950 Let's consider today how you might go about implementing this kind of attack, 00:00:41.950 --> 00:00:44.470 and equivalently, how you, as the user, might 00:00:44.470 --> 00:00:47.120 go about noticing an attack like this. 00:00:47.120 --> 00:00:51.550 Well, here, again, is a language called HTML, Hypertext Markup Language. 00:00:51.550 --> 00:00:53.770 And it's the language in which web pages are written. 00:00:53.770 --> 00:00:56.650 This is maybe the simplest example that we could put together 00:00:56.650 --> 00:01:02.020 that represents the kind of text that a web server would send to a web browser 00:01:02.020 --> 00:01:05.140 when it wants to display information on your screen, 00:01:05.140 --> 00:01:09.160 be it the day's news, your email inbox, or anything else that is web based. 00:01:09.160 --> 00:01:12.310 Dot, dot, dot is where I've put placeholders just to represent where 00:01:12.310 --> 00:01:16.150 some additional code might actually go, like the content of this actual web 00:01:16.150 --> 00:01:16.660 page. 00:01:16.660 --> 00:01:22.850 Well, for instance, suppose that we consider in the abstract just a simple 00:01:22.850 --> 00:01:25.040 example of the so so-called tags. 00:01:25.040 --> 00:01:28.220 And in fact, recall that everything that you just saw sort of 00:01:28.220 --> 00:01:31.670 had these open brackets, but also the same words again and again. 00:01:31.670 --> 00:01:35.840 For instance, if we wanted to have a paragraph in this language called HTML, 00:01:35.840 --> 00:01:40.290 we would have this thing here called a tag, or an open tag, or a start tag, 00:01:40.290 --> 00:01:44.000 and then this thing here at the end, an end tag or a close tag. 00:01:44.000 --> 00:01:46.190 And those are meant typically to be symmetric. 00:01:46.190 --> 00:01:48.900 This one begins a thought for the browser. 00:01:48.900 --> 00:01:50.760 Hey, browser, here comes a paragraph. 00:01:50.760 --> 00:01:54.560 And this one here with the forward inside of those angled brackets 00:01:54.560 --> 00:01:57.510 means hey, browser, that's it for the paragraph. 00:01:57.510 --> 00:02:01.460 So any time you see HTML in a file, it's really 00:02:01.460 --> 00:02:04.880 telling the browser what to start doing and what to stop doing. 00:02:04.880 --> 00:02:07.340 So here is how a browser, therefore, might 00:02:07.340 --> 00:02:10.460 know that it needs to display a paragraph of text, maybe separated 00:02:10.460 --> 00:02:13.370 by some whitespace by other paragraphs of text. 00:02:13.370 --> 00:02:15.410 Here, for instance, is how a browser might 00:02:15.410 --> 00:02:19.040 know that there's some code inside of the web page, typically written 00:02:19.040 --> 00:02:20.720 in a language called JavaScript. 00:02:20.720 --> 00:02:25.890 So this script tag opening here and this close script tag here would be hey, 00:02:25.890 --> 00:02:27.530 browser, here's some code to execute. 00:02:27.530 --> 00:02:28.220 Dot, dot, dot. 00:02:28.220 --> 00:02:31.620 And hey, browser, that's it for the code to execute. 00:02:31.620 --> 00:02:35.960 So how might there actually be a threat or an opportunity 00:02:35.960 --> 00:02:38.930 here for an adversarially to phish information from you? 00:02:38.930 --> 00:02:41.750 Well, here, for instance, is how you might in this language 00:02:41.750 --> 00:02:46.130 called HTML create a link, otherwise known as a hyperlink in a web page. 00:02:46.130 --> 00:02:48.620 And it, too, uses an open tag and a close tag. 00:02:48.620 --> 00:02:52.490 This A tag represents an anchor, like anchor, some hyperlink, 00:02:52.490 --> 00:02:54.590 some link in the web page right here. 00:02:54.590 --> 00:02:57.600 And this tag here means, OK, that's it for the link. 00:02:57.600 --> 00:03:00.410 And if we want a link, for instance, to Harvard's website 00:03:00.410 --> 00:03:03.170 in between that open tag and close tag would 00:03:03.170 --> 00:03:07.050 you actually put the text of what the link is meant to be. 00:03:07.050 --> 00:03:10.040 So if you want the user to see a link on the page that says Harvard, 00:03:10.040 --> 00:03:13.610 you would put literally Harvard between this open tag and close tag. 00:03:13.610 --> 00:03:17.150 But that's not actually enough to link to some other page. 00:03:17.150 --> 00:03:22.310 You have to tell the browser what URL or what file you want clicking on Harvard 00:03:22.310 --> 00:03:24.170 to actually lead the user to. 00:03:24.170 --> 00:03:27.860 So for that, we need to introduce one other concept in this language called 00:03:27.860 --> 00:03:30.390 HTML, namely, an attribute. 00:03:30.390 --> 00:03:33.200 So h href stands for hyper-reference. 00:03:33.200 --> 00:03:35.600 And it's a fancy way of saying, this is what 00:03:35.600 --> 00:03:39.595 I want the browser to link the user to when they click on that word, Harvard. 00:03:39.595 --> 00:03:41.720 Now, at the moment, I've just done a dot, dot, dot. 00:03:41.720 --> 00:03:45.800 But that would, for instance, be the URL of Harvard's own website. 00:03:45.800 --> 00:03:48.920 So consider this very specific example, now, 00:03:48.920 --> 00:03:51.860 whereby, we still have our anchor tag opening here, 00:03:51.860 --> 00:03:54.320 we still have our anchor tag closing here. 00:03:54.320 --> 00:03:59.180 We now have, though, an href attribute that's telling the browser that when 00:03:59.180 --> 00:04:06.620 the word, Harvard is clicked, I want the user to end up at https://harvard.edu. 00:04:06.620 --> 00:04:09.110 So that all seems fine and good. 00:04:09.110 --> 00:04:12.020 And this is the way the web is supposed to work. 00:04:12.020 --> 00:04:15.200 And this is what links in your own web pages 00:04:15.200 --> 00:04:17.810 will look like if you poke around underneath the hood. 00:04:17.810 --> 00:04:19.490 So where is the actual danger? 00:04:19.490 --> 00:04:21.170 Well, hopefully, there is none. 00:04:21.170 --> 00:04:24.200 And hopefully, when you open this kind of HTML 00:04:24.200 --> 00:04:27.920 in the context of a larger file with even more tags than just 00:04:27.920 --> 00:04:30.080 this anchor tag, you'd see a browser window that 00:04:30.080 --> 00:04:31.788 looks a little something like this, you'd 00:04:31.788 --> 00:04:35.570 see a link typically underlined, though, not necessarily to Harvard. 00:04:35.570 --> 00:04:38.810 And then, and only then, if you hover over 00:04:38.810 --> 00:04:41.300 that link, can you see where you will go, 00:04:41.300 --> 00:04:43.080 even if you don't actually click the link. 00:04:43.080 --> 00:04:45.080 So suffice it to say if you just click the link, 00:04:45.080 --> 00:04:47.190 you're going to end up on Harvard's own website. 00:04:47.190 --> 00:04:51.740 But if you a little more cautiously with a bit more paranoia, a bit more 00:04:51.740 --> 00:04:55.340 consciousness now of cybersecurity hover over that link 00:04:55.340 --> 00:04:58.760 and focus on your browser's bottom left-hand corner typically, 00:04:58.760 --> 00:05:03.260 at least, on a laptop or desktop, you'll actually see the URL to which 00:05:03.260 --> 00:05:07.200 you will actually be whisked away when you click on that link. 00:05:07.200 --> 00:05:08.630 So this actually looks OK. 00:05:08.630 --> 00:05:10.220 The word here is Harvard. 00:05:10.220 --> 00:05:15.470 The URL it's going to link me to is https://harvard.edu. 00:05:15.470 --> 00:05:18.150 So I think all is well in the world. 00:05:18.150 --> 00:05:21.600 And indeed, you can do this on most any web page on your laptop and desktop 00:05:21.600 --> 00:05:24.750 if you want to proactively preemptively see 00:05:24.750 --> 00:05:28.270 where it is you're going to go before you actually click on that link. 00:05:28.270 --> 00:05:29.790 So where's the danger? 00:05:29.790 --> 00:05:33.720 Well, let's get a little more specific and a little more malicious, if I may. 00:05:33.720 --> 00:05:37.660 So here, we have the exact same HTML as before. 00:05:37.660 --> 00:05:42.150 But let's go ahead now and not just say Harvard inside of this open tag 00:05:42.150 --> 00:05:42.990 and closed tag. 00:05:42.990 --> 00:05:47.430 Suppose that for whatever reason, I want the user to see a little more obviously 00:05:47.430 --> 00:05:49.750 the URL to which they are going to be linked. 00:05:49.750 --> 00:05:54.060 So I may change Harvard, capital H, to harvard.edu, 00:05:54.060 --> 00:05:58.320 so the actual domain name that I want the user to be led to. 00:05:58.320 --> 00:06:00.430 Here is now what the user would see. 00:06:00.430 --> 00:06:04.500 So it's a little more obvious that it's harvard.edu and not some other Harvard 00:06:04.500 --> 00:06:05.070 website. 00:06:05.070 --> 00:06:07.050 And indeed, if we hover over that, we'll see 00:06:07.050 --> 00:06:10.180 that it still is going to the same URL. 00:06:10.180 --> 00:06:10.680 All right. 00:06:10.680 --> 00:06:12.000 So that seems fine. 00:06:12.000 --> 00:06:13.530 And not all that enlightening. 00:06:13.530 --> 00:06:15.090 Let's go one step further. 00:06:15.090 --> 00:06:21.080 Suppose that you really want the user to see a URL in the body of the web page, 00:06:21.080 --> 00:06:24.925 so now I'm actually going to put in between the open tag and the close tag, 00:06:24.925 --> 00:06:28.530 https://harvard.edu. 00:06:28.530 --> 00:06:30.280 Now, notice this looks a little redundant. 00:06:30.280 --> 00:06:32.110 And it is in some sense because I literally 00:06:32.110 --> 00:06:34.600 have the URL in two different places. 00:06:34.600 --> 00:06:38.050 But that's because those two values serve different purposes. 00:06:38.050 --> 00:06:42.130 The one in between the open tag and close tag is what the human sees. 00:06:42.130 --> 00:06:46.720 The one inside of the quote marks, the so-called value of the href attribute, 00:06:46.720 --> 00:06:48.830 is where the user will actually end up. 00:06:48.830 --> 00:06:50.710 So if you want them to be equivalent, you 00:06:50.710 --> 00:06:53.660 have to type the exact same thing twice in this case. 00:06:53.660 --> 00:06:55.690 So now, of course, if I go back to the web page, 00:06:55.690 --> 00:06:58.750 the human is now going to see literally this URL. 00:06:58.750 --> 00:07:03.220 And if they hover before clicking, they'll see confirmation as much. 00:07:03.220 --> 00:07:04.970 So where are we going with this? 00:07:04.970 --> 00:07:08.350 Well, here is among the lessons of the course is to think about now, 00:07:08.350 --> 00:07:12.940 how can you take a perfectly reasonable technical solution to a problem, 00:07:12.940 --> 00:07:16.420 creating a link in a page in this case, and how might an adversarially 00:07:16.420 --> 00:07:17.140 abuse it? 00:07:17.140 --> 00:07:19.540 How might you, as the end user, be vulnerable, 00:07:19.540 --> 00:07:22.060 in this case, to a so-called phishing attack? 00:07:22.060 --> 00:07:24.280 Well, there's nothing stopping me from putting 00:07:24.280 --> 00:07:29.410 anything I want in either this value or in this name of the link. 00:07:29.410 --> 00:07:30.335 So you know what? 00:07:30.335 --> 00:07:31.960 Why don't I be a little malicious here? 00:07:31.960 --> 00:07:35.200 And why don't I tell the user that they're going to harvard.edu, 00:07:35.200 --> 00:07:40.910 but they're actually going to yale.edu instead, another school down the road? 00:07:40.910 --> 00:07:42.250 So what is the human see now? 00:07:42.250 --> 00:07:44.320 If we go back to the browser, they still see 00:07:44.320 --> 00:07:48.310 what appears to be https://harvard.edu. 00:07:48.310 --> 00:07:51.850 But if they hover over it, and only if they hover over it, 00:07:51.850 --> 00:07:54.160 will they see this little clue that, uh-uh, 00:07:54.160 --> 00:07:57.070 you're actually going to be whisked away to yale.edu. 00:07:57.070 --> 00:08:00.040 And if they click on the link, they'll actually find themselves 00:08:00.040 --> 00:08:03.230 at the actual yale.edu website. 00:08:03.230 --> 00:08:04.852 So what's the big deal? 00:08:04.852 --> 00:08:07.310 Well, this might just be a silly prank, in which case, it's 00:08:07.310 --> 00:08:09.020 probably inconsequential. 00:08:09.020 --> 00:08:13.350 And if you do link from one website to a completely different one, 00:08:13.350 --> 00:08:15.282 it's not necessarily a phishing attack. 00:08:15.282 --> 00:08:18.240 It might be confusing because the user thinks they're going to Harvard, 00:08:18.240 --> 00:08:19.615 but they find themselves at Yale. 00:08:19.615 --> 00:08:22.970 But there's not necessarily any danger in that mislead. 00:08:22.970 --> 00:08:25.910 But what if the adversarially in this case 00:08:25.910 --> 00:08:30.230 doesn't link to a very common popular website like yale.edu, 00:08:30.230 --> 00:08:34.610 but maybe a website like harvard.edu, where just one of the characters 00:08:34.610 --> 00:08:37.149 is slightly misspelled, as we've discussed in the past? 00:08:37.149 --> 00:08:41.220 Shows that you and I, unless we really, really look carefully, 00:08:41.220 --> 00:08:44.480 we might not even notice that we're not at the real harvard.edu. 00:08:44.480 --> 00:08:47.300 And what if further, the adversarially went 00:08:47.300 --> 00:08:51.080 through the trouble of copying all of the HTML 00:08:51.080 --> 00:08:53.930 that implements Harvard's website and pastes it 00:08:53.930 --> 00:08:56.960 into their own fake version of Harvard's website 00:08:56.960 --> 00:08:59.990 that lives at, again, a URL that is almost the same? 00:08:59.990 --> 00:09:02.660 Here is now where there's a phishing opportunity 00:09:02.660 --> 00:09:06.350 because if you think you're going to harvard.edu, and you click the link, 00:09:06.350 --> 00:09:10.530 and it looks like you're at harvard.edu, and you don't notice a subtlety like, 00:09:10.530 --> 00:09:12.950 wait a minute, that's not quite the right URL, 00:09:12.950 --> 00:09:17.060 you might now be inclined to and comfortable with maybe logging 00:09:17.060 --> 00:09:21.410 in to the fake "harvard.edu" website with your username and your password. 00:09:21.410 --> 00:09:25.573 And voila, now the adversarially has that information from you. 00:09:25.573 --> 00:09:26.990 And it doesn't have to be Harvard. 00:09:26.990 --> 00:09:28.115 It doesn't have to be Yale. 00:09:28.115 --> 00:09:29.300 It might be your bank. 00:09:29.300 --> 00:09:33.500 It might be paypal.com or something where you could actually lose money 00:09:33.500 --> 00:09:36.180 or some other asset you care about. 00:09:36.180 --> 00:09:38.990 And so that's really the essence of the implementation 00:09:38.990 --> 00:09:43.430 details of a phishing attack, at least, in the context of web pages 00:09:43.430 --> 00:09:44.960 and/or emails. 00:09:44.960 --> 00:09:48.470 It all boils down to these primitives of HTML 00:09:48.470 --> 00:09:50.570 being the language in which web pages are written. 00:09:50.570 --> 00:09:54.680 And adversaries, by knowing HTML, also now logically 00:09:54.680 --> 00:10:00.150 can misuse HTML by understanding how these basics work. 00:10:00.150 --> 00:10:04.790 So let me pause here and see if there's any questions about phishing, or HTML, 00:10:04.790 --> 00:10:07.640 or this convergence of the two when it comes 00:10:07.640 --> 00:10:11.510 to this form of social engineering, as we called it before. 00:10:11.510 --> 00:10:16.730 AUDIENCE: Would it be possible to write my IP or some other means to get 00:10:16.730 --> 00:10:20.893 to their website and not the URL's? 00:10:20.893 --> 00:10:22.310 DAVID J. MALAN: Short answer, yes. 00:10:22.310 --> 00:10:25.365 If you have access to dedicated IP addresses, 00:10:25.365 --> 00:10:28.490 which are these unique identifiers you can use for servers on the internet, 00:10:28.490 --> 00:10:35.220 you can absolutely have a URL that is http:// and then the IP address. 00:10:35.220 --> 00:10:38.660 Now, typically, it would be HTTP and not HTTPS 00:10:38.660 --> 00:10:42.170 when using an IP address, in which case, that might be 00:10:42.170 --> 00:10:43.940 a clue to the user that wait a minute. 00:10:43.940 --> 00:10:46.700 This is making me nervous that this isn't legitimate. 00:10:46.700 --> 00:10:49.400 But honestly, I think we can all think of people in our lives 00:10:49.400 --> 00:10:52.280 who wouldn't have the instincts to notice, wait a minute. 00:10:52.280 --> 00:10:55.130 What is this weird numeric address in my browser bar 00:10:55.130 --> 00:10:56.537 and stop what they're doing. 00:10:56.537 --> 00:10:58.370 That's, indeed, among the goals of the class 00:10:58.370 --> 00:11:00.990 like this is to give you those instincts and that training 00:11:00.990 --> 00:11:02.990 to be a little suspicious when you see something 00:11:02.990 --> 00:11:05.690 like a raw IP address in the browser. 00:11:05.690 --> 00:11:07.490 Technically, there's nothing wrong with it. 00:11:07.490 --> 00:11:09.980 But it's a little bit of a weird branding or marketing 00:11:09.980 --> 00:11:11.270 decision for a website. 00:11:11.270 --> 00:11:13.940 And I think a corollary of this then logically 00:11:13.940 --> 00:11:16.160 is that if you are running a website of your own 00:11:16.160 --> 00:11:18.830 or if you're running a business with a website of your own, 00:11:18.830 --> 00:11:24.200 you should really avoid using many different URL formats, 00:11:24.200 --> 00:11:28.070 or many different domains, or having any sort of curiosities 00:11:28.070 --> 00:11:31.580 or weirdnesses in your domain names because you're really just teaching 00:11:31.580 --> 00:11:35.533 users implicitly that your URL format might change from time to time. 00:11:35.533 --> 00:11:38.450 And certainly, you never want to use just an IP address because you're 00:11:38.450 --> 00:11:40.800 going to train people to expect that. 00:11:40.800 --> 00:11:45.050 And so standardizing on one or very few domain names or subdomains 00:11:45.050 --> 00:11:47.670 is generally best for that. 00:11:47.670 --> 00:11:49.700 So what are some other attacks that we should 00:11:49.700 --> 00:11:52.130 be mindful of when it comes to our own software? 00:11:52.130 --> 00:11:55.250 Well, a class of attacks, or a category of attacks, 00:11:55.250 --> 00:11:57.500 are generally known as code injection, which 00:11:57.500 --> 00:12:01.370 is an opportunity for an adversarially to somehow inject code 00:12:01.370 --> 00:12:05.960 into your software and often trick your software into executing that code, 00:12:05.960 --> 00:12:08.090 even if you, yourself, didn't write it. 00:12:08.090 --> 00:12:10.640 Well, let's consider one example of this. 00:12:10.640 --> 00:12:12.740 A common attack on the web in particular, 00:12:12.740 --> 00:12:17.780 too, is what's known as cross-site scripting, or XSS for short. 00:12:17.780 --> 00:12:21.530 Cross-site scripting refers to this potential opportunity 00:12:21.530 --> 00:12:26.390 for an adversarially to trick one website into executing code that they, 00:12:26.390 --> 00:12:28.220 again, themselves did not write. 00:12:28.220 --> 00:12:30.170 So what form might this actually take? 00:12:30.170 --> 00:12:35.510 Well, suppose that you, yourself, visit google.com, and suppose that Google 00:12:35.510 --> 00:12:37.370 isn't aware of this particular attack. 00:12:37.370 --> 00:12:38.880 They certainly are nowadays. 00:12:38.880 --> 00:12:42.690 But suppose that they weren't yet aware that this attack exists. 00:12:42.690 --> 00:12:45.980 And so when someone like you or I goes to google.com and searches 00:12:45.980 --> 00:12:49.830 for something like cats, suppose they do the following. 00:12:49.830 --> 00:12:51.740 They show you a whole page of search results. 00:12:51.740 --> 00:12:53.657 And I won't bother showing the actual results. 00:12:53.657 --> 00:12:59.660 But as of today, there were 6,420,000,000 cats on the internet 00:12:59.660 --> 00:13:00.740 that Google knows about. 00:13:00.740 --> 00:13:02.930 And they would show up, of course, down here. 00:13:02.930 --> 00:13:07.250 Now, notice a few characteristics about google.com as it typically behaves. 00:13:07.250 --> 00:13:11.490 Well, one, you still see a text box containing what it is you searched for, 00:13:11.490 --> 00:13:13.890 so that you can change it, or at least, see what it is. 00:13:13.890 --> 00:13:17.540 And notice, too, that in smaller text here in this particular version 00:13:17.540 --> 00:13:20.750 of Google, it tells you not only how many results there 00:13:20.750 --> 00:13:23.810 are, but specifically, how many cats there are. 00:13:23.810 --> 00:13:28.430 So that is to say, if Google is using your own input, 00:13:28.430 --> 00:13:32.600 not only to remind you of what you searched for in the search box, 00:13:32.600 --> 00:13:37.970 but also in the body of the web page, that very simple idea is 00:13:37.970 --> 00:13:39.620 vulnerable to an attack. 00:13:39.620 --> 00:13:40.160 Why? 00:13:40.160 --> 00:13:43.340 Because who wrote the word, cats, C-A-T-S? 00:13:43.340 --> 00:13:45.240 Well, it wasn't Google, per se. 00:13:45.240 --> 00:13:46.310 It was me. 00:13:46.310 --> 00:13:49.440 Now, fortunately, cats, in and of itself, is not dangerous. 00:13:49.440 --> 00:13:52.850 But suppose I knew a little something about HTML, and browsers, 00:13:52.850 --> 00:13:56.540 and how the internet works, and suppose that I now an adversarially 00:13:56.540 --> 00:13:58.190 did something like this. 00:13:58.190 --> 00:14:02.360 And knowing that Google is probably inside of their web page 00:14:02.360 --> 00:14:05.720 rendering HTML that looks like this, a paragraph of text 00:14:05.720 --> 00:14:07.880 per the open paragraph and closed paragraph 00:14:07.880 --> 00:14:13.670 tag and an English sentence like this, about 6,420,000,000 cats, if I know 00:14:13.670 --> 00:14:18.510 they're putting my input, cats into HTML that looks like this, 00:14:18.510 --> 00:14:22.430 let me see if I can try to trick Google into outputting something 00:14:22.430 --> 00:14:24.090 that they might not have anticipated. 00:14:24.090 --> 00:14:27.170 So instead of cats, let me type something a little weird 00:14:27.170 --> 00:14:28.500 that looks like this. 00:14:28.500 --> 00:14:29.700 Now, what are we looking at? 00:14:29.700 --> 00:14:33.560 We're looking at now an HTML tag, the script tag, 00:14:33.560 --> 00:14:37.010 both opened and closed here, and a little bit of code in a language 00:14:37.010 --> 00:14:38.030 called JavaScript. 00:14:38.030 --> 00:14:41.840 Now, thankfully, this, in and of itself, is not actually a compelling attack. 00:14:41.840 --> 00:14:44.360 It's literally just going to display "attack," 00:14:44.360 --> 00:14:45.810 quote, unquote, on the screen. 00:14:45.810 --> 00:14:50.390 So it's just meant to be representative I claim of how I could potentially 00:14:50.390 --> 00:14:54.710 trick a website like Google into executing code that I wrote, 00:14:54.710 --> 00:14:56.330 not that they wrote. 00:14:56.330 --> 00:14:57.830 So what do I mean by that? 00:14:57.830 --> 00:15:00.200 Notice that I've got this open script tag here 00:15:00.200 --> 00:15:03.440 and the close script tag here, which means everything in between there 00:15:03.440 --> 00:15:06.530 is script, that is JavaScript, this particular language. 00:15:06.530 --> 00:15:09.530 Well, it turns out this particular language in the context of browsers 00:15:09.530 --> 00:15:12.960 comes with a function, a feature known as alert. 00:15:12.960 --> 00:15:15.560 And if you want to alert the user with some message, 00:15:15.560 --> 00:15:18.470 you literally write the word, alert and then an open parentheses 00:15:18.470 --> 00:15:21.000 and a closed parentheses on the left and right. 00:15:21.000 --> 00:15:23.360 And then inside of single quotes or double quotes, 00:15:23.360 --> 00:15:27.140 you put whatever word or words that you want to alert the user to. 00:15:27.140 --> 00:15:30.740 So this is often used for displaying messages to the user, not 00:15:30.740 --> 00:15:32.540 actual attacks, but useful messages. 00:15:32.540 --> 00:15:34.620 And there's more elegant ways to do this as well. 00:15:34.620 --> 00:15:39.800 But this is the simplest representation of an attack that I could propose here. 00:15:39.800 --> 00:15:42.740 Now, I haven't yet hit Enter on this page 00:15:42.740 --> 00:15:45.860 because, indeed, we still see that we're on the page relating 00:15:45.860 --> 00:15:49.100 to cats as my search result. But as soon as I 00:15:49.100 --> 00:15:53.810 hit Enter after searching for this string of JavaScript code, 00:15:53.810 --> 00:15:58.430 or really, HTML inside of which is JavaScript code, what might happen? 00:15:58.430 --> 00:16:02.330 Well, this could potentially happen. 00:16:02.330 --> 00:16:05.420 Now, again, this is not a bad thing in this specific case. 00:16:05.420 --> 00:16:07.740 It's just throwing up an alert to the user. 00:16:07.740 --> 00:16:10.340 And in this sense, too, I'm really only attacking myself 00:16:10.340 --> 00:16:12.770 because if I'm the adversarially, and this is my browser, 00:16:12.770 --> 00:16:15.920 and I've just tricked Google into executing some JavaScript code such 00:16:15.920 --> 00:16:19.220 that a pop-up appears saying attack, well, I'm just hacking myself. 00:16:19.220 --> 00:16:20.570 So this is inconsequential. 00:16:20.570 --> 00:16:23.360 But again, it's representative of how we could potentially 00:16:23.360 --> 00:16:27.140 trick a website into executing code that they did not intend. 00:16:27.140 --> 00:16:29.510 Now, why is this displaying? 00:16:29.510 --> 00:16:35.180 But notice no more cats and no input here that I typed myself. 00:16:35.180 --> 00:16:38.360 Last time when I searched for cats, I saw the word cats here. 00:16:38.360 --> 00:16:41.450 But now I'm seeing nothing at all. 00:16:41.450 --> 00:16:42.810 Now, why is that? 00:16:42.810 --> 00:16:46.370 Well, underneath the hood previously, I claimed that Google was probably 00:16:46.370 --> 00:16:48.920 rendering HTML like this, a paragraph of text, 00:16:48.920 --> 00:16:53.720 open tag, close tag, and then the sentence about 6,420,000,000 cats, 00:16:53.720 --> 00:16:54.740 so HTML. 00:16:54.740 --> 00:16:57.980 And they were just plugging in whatever I, the human, typed in. 00:16:57.980 --> 00:17:01.310 Well, this time, I conjecture that if I typed 00:17:01.310 --> 00:17:05.589 in what looks like and is HTML with a bit of scary JavaScript 00:17:05.589 --> 00:17:08.079 code in between, what Google is probably going 00:17:08.079 --> 00:17:12.220 to try to output in the body of the web page they send to my browser 00:17:12.220 --> 00:17:15.790 is this, an open paragraph tag and a closed paragraph 00:17:15.790 --> 00:17:18.640 tag still the beginning of a sentence, assuming 00:17:18.640 --> 00:17:23.260 that there are this many attacks in the world, 6,420,000,000. 00:17:23.260 --> 00:17:24.460 But notice this. 00:17:24.460 --> 00:17:29.170 Because I, the user, literally typed in the script tag, and the closed script 00:17:29.170 --> 00:17:31.510 tag, and then that JavaScript code in between, 00:17:31.510 --> 00:17:35.660 the browser doesn't know that that came from me and not Google. 00:17:35.660 --> 00:17:37.720 So the browser is just going to read this 00:17:37.720 --> 00:17:42.280 as, hey, browser, start a paragraph, about 6,420,000,000. 00:17:42.280 --> 00:17:46.330 Hey, browser, here comes a script that is a program that you should execute. 00:17:46.330 --> 00:17:47.350 What do I execute? 00:17:47.350 --> 00:17:48.910 Alert, quote, unquote, "attack." 00:17:48.910 --> 00:17:50.770 Hey, browser, that's it for the script. 00:17:50.770 --> 00:17:53.150 Hey, browser, that's it for the paragraph. 00:17:53.150 --> 00:17:57.340 So if Google is just blindly copying and pasting what I, the human, 00:17:57.340 --> 00:18:00.910 am typing in, I might trick Google into rendering HTML 00:18:00.910 --> 00:18:02.720 that Google did not intend. 00:18:02.720 --> 00:18:05.850 And the side effect in this case is that I see this alert. 00:18:05.850 --> 00:18:11.090 But really, that's indicative of a potential exploit in Google's website 00:18:11.090 --> 00:18:15.000 if they were not detecting this on their own. 00:18:15.000 --> 00:18:17.060 So this is the code that's dangerous. 00:18:17.060 --> 00:18:19.430 And what then is the fundamental problem? 00:18:19.430 --> 00:18:24.350 They are just literally outputting what I, the adversarially, 00:18:24.350 --> 00:18:26.700 typed in to their page. 00:18:26.700 --> 00:18:28.790 So how do we go about mitigating something 00:18:28.790 --> 00:18:31.310 like this to avoid this kind of attack? 00:18:31.310 --> 00:18:35.060 Well, let's first propose that what we want the effect to be 00:18:35.060 --> 00:18:37.040 is something a little more like this. 00:18:37.040 --> 00:18:39.572 Assuming, again, that there are 6,420,000,000 00:18:39.572 --> 00:18:41.780 attacks in the world, what I want to see is literally 00:18:41.780 --> 00:18:43.430 that English sentence here. 00:18:43.430 --> 00:18:45.330 I don't want any sort of pop-up. 00:18:45.330 --> 00:18:48.650 So this, I would propose, is the correct behavior, 00:18:48.650 --> 00:18:51.020 assuming we see more search results down below. 00:18:51.020 --> 00:18:52.400 I have searched for this. 00:18:52.400 --> 00:18:55.560 Google is telling me or reminding me what I searched for, 00:18:55.560 --> 00:18:57.320 but there's no pop-ups in this case. 00:18:57.320 --> 00:19:00.260 So somehow or other, based on this screenshot alone, 00:19:00.260 --> 00:19:03.530 there must be a way of ensuring on Google's end 00:19:03.530 --> 00:19:07.790 that even if the human timescale in HTML with, perhaps, some JavaScript 00:19:07.790 --> 00:19:11.930 code in the middle, that they don't actually treat it as HTML 00:19:11.930 --> 00:19:13.760 and JavaScript code in the middle. 00:19:13.760 --> 00:19:18.440 They just display it literally character by character whatever I, the user, 00:19:18.440 --> 00:19:19.290 typed in. 00:19:19.290 --> 00:19:21.860 So what is our concern with this particular symptom? 00:19:21.860 --> 00:19:23.870 Well, it turns out that an adversarially can 00:19:23.870 --> 00:19:26.840 wage what we would call a reflected attack, 00:19:26.840 --> 00:19:30.290 whereby, we could leverage this symptom in such a way 00:19:30.290 --> 00:19:34.010 that maybe we could construct a URL that if clicked by a user, 00:19:34.010 --> 00:19:37.100 actually triggers this kind of behavior, but moreover, 00:19:37.100 --> 00:19:40.040 doesn't just trigger this fairly innocuous behavior like alerting 00:19:40.040 --> 00:19:42.840 the user with a message like attack just to scare them. 00:19:42.840 --> 00:19:46.460 But what if we wrote even more malicious JavaScript code that 00:19:46.460 --> 00:19:49.920 maybe steals their cookies or does something more than that? 00:19:49.920 --> 00:19:52.520 Well, how do you wage what's called here reflected attack? 00:19:52.520 --> 00:19:56.520 Well, let's first consider what a basic link in a web page or an email 00:19:56.520 --> 00:19:57.020 looks like. 00:19:57.020 --> 00:19:59.490 It's again, an anchor tag that starts and ends. 00:19:59.490 --> 00:20:01.790 It has an href attribute that's represents 00:20:01.790 --> 00:20:05.450 the URL or file to which we're going to link the user and then some text 00:20:05.450 --> 00:20:08.300 that the human will actually see in the web page. 00:20:08.300 --> 00:20:12.710 And now let's notice when we correctly search for cats, as we did before, 00:20:12.710 --> 00:20:16.250 that not only do we see cats in the text box here, not only do we 00:20:16.250 --> 00:20:20.360 see cats in the body of the web page, but notice now the URL. 00:20:20.360 --> 00:20:23.510 It turns out when you search for something on Google, 00:20:23.510 --> 00:20:26.438 you end up at a URL that looks essentially like this. 00:20:26.438 --> 00:20:27.980 It might actually be a little longer. 00:20:27.980 --> 00:20:30.980 But a lot of those parameters, so to speak, in the URL 00:20:30.980 --> 00:20:32.340 aren't strictly necessary. 00:20:32.340 --> 00:20:34.580 So this is the shortest possible URL that 00:20:34.580 --> 00:20:37.430 will work on Google if you want to search for cats. 00:20:37.430 --> 00:20:46.580 And notice what it is, https://www.google.com/search?q=cats. 00:20:46.580 --> 00:20:49.700 So this is to say that the way Google works is 00:20:49.700 --> 00:20:51.800 that if you want to search for cats, you simply 00:20:51.800 --> 00:20:53.233 visit a URL that looks like this. 00:20:53.233 --> 00:20:56.150 If you want to search for dogs, you visit a URL that looks almost like 00:20:56.150 --> 00:21:01.520 this, but instead has q=dogs, which is to say there's just a very standard 00:21:01.520 --> 00:21:04.670 format on google.com, and a lot of other websites too, 00:21:04.670 --> 00:21:09.050 for searching for things or really sending input to a web server. 00:21:09.050 --> 00:21:12.500 And this web form or this text box that you typically 00:21:12.500 --> 00:21:15.320 use to type in cats, or dogs, or anything else 00:21:15.320 --> 00:21:18.740 is just generating a URL that looks like this. 00:21:18.740 --> 00:21:20.850 And then Google knows what to do with it. 00:21:20.850 --> 00:21:25.310 So how can we leverage now that reality a little more maliciously? 00:21:25.310 --> 00:21:27.080 Well, let's go back to our HTML. 00:21:27.080 --> 00:21:29.300 And let's again assume that the adversarially 00:21:29.300 --> 00:21:32.300 is trying to construct some HTML for their own email 00:21:32.300 --> 00:21:37.190 or for their own website in order to attack some unsuspecting users. 00:21:37.190 --> 00:21:40.260 Well, instead of the dot, dot, dots, let's be more specific. 00:21:40.260 --> 00:21:43.370 Let's actually, in a good way, in an honest way, 00:21:43.370 --> 00:21:46.130 say that we're going to let the user click on a word 00:21:46.130 --> 00:21:49.190 cats, which is in between my open tag and close tag. 00:21:49.190 --> 00:21:52.490 And if they click on that, they're going to end up at the legitimate Google 00:21:52.490 --> 00:21:59.480 website where https://www.google.com/search?q=cats. 00:21:59.480 --> 00:22:00.480 So this is correct. 00:22:00.480 --> 00:22:02.030 This is not yet an attack. 00:22:02.030 --> 00:22:04.600 But what if I am a little malicious? 00:22:04.600 --> 00:22:09.220 And instead of using the legitimate URL there for searching for cats, 00:22:09.220 --> 00:22:12.400 suppose I construct something a little more cleverly 00:22:12.400 --> 00:22:16.330 that says we're going to give them cats, but actually, we're 00:22:16.330 --> 00:22:18.340 going to bring them to this URL. 00:22:18.340 --> 00:22:19.870 Now, this is a bit of a mouthful. 00:22:19.870 --> 00:22:22.240 And in fact, it wraps onto two lines this time. 00:22:22.240 --> 00:22:30.850 But notice the URL starts the same, https://www.google.com/search?q= 00:22:30.850 --> 00:22:39.310 and then some weird text, %3Cscript%3Ealert wrapping onto 00:22:39.310 --> 00:22:41.030 the other line and so forth. 00:22:41.030 --> 00:22:46.330 So I dare say, you're seeing some familiar phrases now, script and alert, 00:22:46.330 --> 00:22:49.160 but there's also some weird syntax there as well. 00:22:49.160 --> 00:22:53.470 Now, that weird syntax is just a representation of URL escaping. 00:22:53.470 --> 00:22:57.220 It turns out that certain characters in URLs, like angled brackets 00:22:57.220 --> 00:23:02.210 and other syntax is not good to include in URLs because it might be mistaken 00:23:02.210 --> 00:23:04.350 by the browser for something else. 00:23:04.350 --> 00:23:09.320 And so URLs typically escape punctuation symbols and other characters 00:23:09.320 --> 00:23:11.390 using this percent syntax. 00:23:11.390 --> 00:23:15.590 Now, it looks a little weird to the user, but what's more worrisome 00:23:15.590 --> 00:23:19.310 is what this is going to be used for on Google's end. 00:23:19.310 --> 00:23:24.830 If the q value equals this whole bunch of text 00:23:24.830 --> 00:23:29.240 and it's just the browser that's encoding those special characters 00:23:29.240 --> 00:23:32.630 in this way, what Google's really going to see on its end 00:23:32.630 --> 00:23:38.870 is the actual script tag with the actual alert and the actual close script tag 00:23:38.870 --> 00:23:40.760 that you and I constructed earlier. 00:23:40.760 --> 00:23:43.340 That is to say, what Google's going to receive 00:23:43.340 --> 00:23:48.300 from that URL is no longer "cats," quote, unquote, but this, quote, 00:23:48.300 --> 00:23:51.050 unquote, because the servers aren't going to automatically convert 00:23:51.050 --> 00:23:53.960 the percent signs and those weird characters back 00:23:53.960 --> 00:23:55.340 to the original automatically. 00:23:55.340 --> 00:23:57.210 That's how URL encoding works. 00:23:57.210 --> 00:24:00.000 So what the server is going to receive is this. 00:24:00.000 --> 00:24:04.850 And again, if the server is vulnerable to naively just outputting literally 00:24:04.850 --> 00:24:08.810 whatever the human typed in, the risk is that they're 00:24:08.810 --> 00:24:11.750 going to now execute that code. 00:24:11.750 --> 00:24:13.640 And what if the code isn't just an alert? 00:24:13.640 --> 00:24:17.060 Maybe it's something like this, which still isn't, in and of itself, 00:24:17.060 --> 00:24:19.550 a bad thing because it's just an alert. 00:24:19.550 --> 00:24:22.340 But this is actually some JavaScript code now, 00:24:22.340 --> 00:24:29.510 alert(document.cookie) that would actually throw up a dialog window that 00:24:29.510 --> 00:24:34.370 shows the user the value of the cookies they have there on Google's website. 00:24:34.370 --> 00:24:35.820 OK, not such a big deal. 00:24:35.820 --> 00:24:39.360 It's not all that different from just saying, quote, unquote, "attack." 00:24:39.360 --> 00:24:42.950 But what this means is that in JavaScript, 00:24:42.950 --> 00:24:47.060 you have access to all of the cookies for a website, at least, 00:24:47.060 --> 00:24:49.190 those that are made available to JavaScript. 00:24:49.190 --> 00:24:52.730 And if an adversarially doesn't use the alert function, 00:24:52.730 --> 00:24:58.250 but maybe uses a little more code to send the value of document.cookie 00:24:58.250 --> 00:25:03.290 to their own website or to somehow send other information from the web page, 00:25:03.290 --> 00:25:07.190 the user's username or any other personally identifying information, 00:25:07.190 --> 00:25:13.100 suffice it to say, that by being able to write code in JavaScript and by being 00:25:13.100 --> 00:25:17.000 able to trick a server like Google in this story into executing that code, 00:25:17.000 --> 00:25:23.510 you can effectively by transitivity trick a user's browser into executing 00:25:23.510 --> 00:25:25.530 that code for you. 00:25:25.530 --> 00:25:28.430 So the adversarially is sending the code into Google, 00:25:28.430 --> 00:25:31.220 and it's being reflected back to some user 00:25:31.220 --> 00:25:34.610 if they click that same link in an email or a website. 00:25:34.610 --> 00:25:38.850 And at this point, things like their own cookies might be vulnerable. 00:25:38.850 --> 00:25:42.270 And again, to be clear, this, in and of itself, should not hurt you. 00:25:42.270 --> 00:25:46.100 I'm just using alert as demonstrative of what could be possible. 00:25:46.100 --> 00:25:48.200 But you could do any number of other things 00:25:48.200 --> 00:25:51.320 with document.cookie or other values from a web page 00:25:51.320 --> 00:25:55.610 as soon as you have this ability to write JavaScript that's reflected back 00:25:55.610 --> 00:25:57.680 into someone else's browser. 00:25:57.680 --> 00:26:02.130 Any questions then on this particular attack? 00:26:02.130 --> 00:26:05.500 AUDIENCE: I just wanted to ask a question about the JavaScript blockage 00:26:05.500 --> 00:26:09.700 because many of the browsers [INAUDIBLE] uses to block the JavaScript. 00:26:09.700 --> 00:26:12.580 How can you use the websites and browser which 00:26:12.580 --> 00:26:16.193 blocks the JavaScript without getting tricked into the JavaScript? 00:26:16.193 --> 00:26:18.110 DAVID J. MALAN: That's a really good question. 00:26:18.110 --> 00:26:20.800 And the short answer is nowadays that that's not really 00:26:20.800 --> 00:26:22.930 the best technique to just block JavaScript. 00:26:22.930 --> 00:26:27.610 The reality is in this point in time, so many websites, most websites, 00:26:27.610 --> 00:26:32.200 dare say, use, if not, rely on JavaScript to do any number of features 00:26:32.200 --> 00:26:33.880 or render their own content. 00:26:33.880 --> 00:26:39.155 And so it's just, I think, not realistic to just disable JavaScript 00:26:39.155 --> 00:26:41.530 in order to protect yourself from these kinds of attacks. 00:26:41.530 --> 00:26:44.890 In a bit, we'll discuss ways to mitigate this kind of attack, where you 00:26:44.890 --> 00:26:47.470 disable some JavaScript, but not all. 00:26:47.470 --> 00:26:51.400 But in general, I don't think that's a realistic solution, at least, 00:26:51.400 --> 00:26:54.040 on most websites nowadays. 00:26:54.040 --> 00:26:57.130 So how else might these same principles be misused? 00:26:57.130 --> 00:27:00.520 Well, it turns out there's another class of attacks known as stored attacks, 00:27:00.520 --> 00:27:04.630 whereby, the adversary's input isn't just immediately reflected back 00:27:04.630 --> 00:27:07.180 from the server to some unsuspecting user 00:27:07.180 --> 00:27:10.960 as it might be when you're using the URL to contain the code. 00:27:10.960 --> 00:27:15.190 But suppose that a website were vulnerable to actually storing 00:27:15.190 --> 00:27:18.790 the user's input, even if the user's input includes 00:27:18.790 --> 00:27:22.427 HTML with some JavaScript inside, well, that would be a stored attack. 00:27:22.427 --> 00:27:23.635 And it might work as follows. 00:27:23.635 --> 00:27:28.180 And at the risk of picking on Google, suppose that-- when using Gmail, 00:27:28.180 --> 00:27:32.975 suppose that you, if you sent someone an email with that exact same code, 00:27:32.975 --> 00:27:35.350 whereby, you're just alerting, quote, unquote, "attack--" 00:27:35.350 --> 00:27:37.767 again, that, in and of itself, isn't going to hurt anyone. 00:27:37.767 --> 00:27:40.090 But it's representative of what you could do with code. 00:27:40.090 --> 00:27:43.720 Now, presumably, when you send an email in Gmail, or Outlook, 00:27:43.720 --> 00:27:47.770 or any other service, that email is going to be stored on a server 00:27:47.770 --> 00:27:50.860 until it's read and until it's deleted, perhaps, by the user. 00:27:50.860 --> 00:27:53.810 And if it's never deleted, it's going to stay stored on the server. 00:27:53.810 --> 00:27:56.950 So this type of attack assumes that the server might actually 00:27:56.950 --> 00:28:00.940 be saving in a database or a file somewhere the user's input. 00:28:00.940 --> 00:28:03.770 Now, suppose that here, too, Google didn't 00:28:03.770 --> 00:28:07.290 know about these kinds of cross-site scripting attacks, 00:28:07.290 --> 00:28:11.600 and they just allow you and me to input HTML and JavaScript into an email, 00:28:11.600 --> 00:28:14.580 and they just blindly save it into their database. 00:28:14.580 --> 00:28:17.180 And then when the recipient opens this email, 00:28:17.180 --> 00:28:20.820 they just show the recipient the contents of that email. 00:28:20.820 --> 00:28:22.200 Well, what could go wrong? 00:28:22.200 --> 00:28:26.150 Well, if the recipient opens that particular email and Google is 00:28:26.150 --> 00:28:29.540 literally rendering the script tag with the JavaScript inside, 00:28:29.540 --> 00:28:32.810 the recipient of that email, when they open their inbox, 00:28:32.810 --> 00:28:35.450 may very well suffer some kind of an attack. 00:28:35.450 --> 00:28:37.380 Again, it just says attack on the screen. 00:28:37.380 --> 00:28:41.990 But it represents being tricked into running code that someone else wrote, 00:28:41.990 --> 00:28:44.270 and in this case, someone else sent you. 00:28:44.270 --> 00:28:46.940 Ideally, what we would want to have happen instead 00:28:46.940 --> 00:28:50.820 is not have Google show us the attack message here. 00:28:50.820 --> 00:28:54.770 But rather, I would like my inbox to show me the code I was sent, 00:28:54.770 --> 00:28:56.210 but not execute it. 00:28:56.210 --> 00:29:00.380 That is, just like I wanted to know that there are 6,420,000,000 00:29:00.380 --> 00:29:03.500 cats among the search results, so would I 00:29:03.500 --> 00:29:08.090 want Gmail to just show me what it is the adversarially typed in 00:29:08.090 --> 00:29:11.120 without actually interpreting it or executing it 00:29:11.120 --> 00:29:14.090 as HTML with some JavaScript inside. 00:29:14.090 --> 00:29:19.010 So that could be a stored attack and would be a stored attack 00:29:19.010 --> 00:29:22.940 if, thankfully, Google weren't actually protecting us 00:29:22.940 --> 00:29:24.480 against this, which they are. 00:29:24.480 --> 00:29:27.870 So how do you go about preventing an attack like this in software? 00:29:27.870 --> 00:29:30.740 Well, the general answer is character escapes. 00:29:30.740 --> 00:29:34.280 That is taking any characters in user's input 00:29:34.280 --> 00:29:39.350 that might potentially be misinterpreted at best, or at worst, 00:29:39.350 --> 00:29:41.970 might be dangerous to the users. 00:29:41.970 --> 00:29:44.000 Now, what characters might be worrisome? 00:29:44.000 --> 00:29:47.420 Well, in something like HTML, anything with an angled bracket, 00:29:47.420 --> 00:29:50.210 a less than sign, would probably potentially 00:29:50.210 --> 00:29:53.310 be mistaken for the beginning of an HTML tag. 00:29:53.310 --> 00:29:58.250 So I dare say that a less than sign is a dangerous character, similarly 00:29:58.250 --> 00:30:01.580 might a greater than sign represent the end of a tag. 00:30:01.580 --> 00:30:04.070 So that, too, might be something to give us concern. 00:30:04.070 --> 00:30:06.570 And there's probably a few other characters as well. 00:30:06.570 --> 00:30:09.260 So what should servers be doing? 00:30:09.260 --> 00:30:14.480 What should software be doing to avoid this kind of cross-site scripting 00:30:14.480 --> 00:30:16.430 attacks, whether reflected or stored? 00:30:16.430 --> 00:30:19.730 Well, ideally, something like this would not just 00:30:19.730 --> 00:30:25.040 be blindly outputted by Google or by / but rather, it 00:30:25.040 --> 00:30:28.250 would be escaped in this very weird looking way. 00:30:28.250 --> 00:30:30.830 But let me highlight just a subset of these characters. 00:30:30.830 --> 00:30:34.400 Highlighted in yellow now are only the character escapes 00:30:34.400 --> 00:30:35.510 to which I'm referring. 00:30:35.510 --> 00:30:40.580 It turns out that this language, HTML, has standardized some special sequences 00:30:40.580 --> 00:30:43.760 of characters that represent the less than sign, 00:30:43.760 --> 00:30:45.800 that represent the greater than sign. 00:30:45.800 --> 00:30:47.780 They're a little more verbose to type. 00:30:47.780 --> 00:30:50.720 You have to type out four characters in this particular case. 00:30:50.720 --> 00:30:56.510 But browsers are designed to know that when they see <, 00:30:56.510 --> 00:30:59.780 they should not show on the screen <. 00:30:59.780 --> 00:31:01.700 They should show a less than sign. 00:31:01.700 --> 00:31:07.280 And similarly, when browsers see >, they should display not literally that, 00:31:07.280 --> 00:31:09.120 but a greater than sign. 00:31:09.120 --> 00:31:14.780 So this is to say, if Google were smart, they would take any user input you 00:31:14.780 --> 00:31:17.000 and I give them, but they would make sure 00:31:17.000 --> 00:31:22.520 to escape any potentially dangerous characters with these kinds of escape 00:31:22.520 --> 00:31:24.153 sequences, so to speak. 00:31:24.153 --> 00:31:26.570 And Google's got to look it up in a book, or on a website, 00:31:26.570 --> 00:31:30.230 or in the specification to know what escape they should use. 00:31:30.230 --> 00:31:33.320 But these are very well documented and standardized. 00:31:33.320 --> 00:31:37.310 And indeed, we have one here, one, one here for the open script tag and then 00:31:37.310 --> 00:31:40.130 another here and here for the closed script tag. 00:31:40.130 --> 00:31:44.285 But notice, we don't have to escape all of the punctuation, like the slash, 00:31:44.285 --> 00:31:47.210 or the English letters in the tag name, or the like. 00:31:47.210 --> 00:31:51.960 We're only escaping a certain list of these characters. 00:31:51.960 --> 00:31:53.030 Well, what is that list? 00:31:53.030 --> 00:31:56.150 Here are the five that minimally, we should generally 00:31:56.150 --> 00:31:58.220 be escaping depending on the context. 00:31:58.220 --> 00:32:02.820 The less than sign should be & lt;, the greater than sign. 00:32:02.820 --> 00:32:06.930 The ampersand sign, for the very reason that we're now potentially creating 00:32:06.930 --> 00:32:09.720 a new problem-- if we're using ampersand over the place, 00:32:09.720 --> 00:32:11.775 what if the user's input has an ampersand? 00:32:11.775 --> 00:32:16.090 We don't want to confuse the ampersand in the user's input for a character 00:32:16.090 --> 00:32:16.590 escape. 00:32:16.590 --> 00:32:21.690 So there actually needs to be a more verbose way, &amp; 00:32:21.690 --> 00:32:23.760 to represent literally an ampersand. 00:32:23.760 --> 00:32:32.820 Then there's one for a double quote, " and a single quote or apostrophe, '. 00:32:32.820 --> 00:32:34.060 And there's more as well. 00:32:34.060 --> 00:32:37.630 But generally, these are the five that could otherwise get you in trouble. 00:32:37.630 --> 00:32:40.830 So all of the examples we've seen thus far where Google is somehow 00:32:40.830 --> 00:32:43.440 reflecting back or storing potential attack code 00:32:43.440 --> 00:32:47.130 will not happen if Google is just smart, whereby, 00:32:47.130 --> 00:32:51.360 they're escaping that input from a user before sending 00:32:51.360 --> 00:32:58.210 it back out as output to google.com search results or to Gmail inboxes. 00:32:58.210 --> 00:33:01.170 So how else might we actually prevent attacks like these? 00:33:01.170 --> 00:33:04.210 Well, we can also put in place other measures as well. 00:33:04.210 --> 00:33:08.160 And recall from past classes we discussed this notion of HTTP headers. 00:33:08.160 --> 00:33:11.220 And an HTTP header is a line of text that's 00:33:11.220 --> 00:33:13.320 stored in those virtual envelopes that get 00:33:13.320 --> 00:33:16.530 sent from browsers to servers and from servers to browsers. 00:33:16.530 --> 00:33:20.190 Inside of the envelope typically is the actual request for a web page 00:33:20.190 --> 00:33:23.400 or the actual contents of, the response for a web page. 00:33:23.400 --> 00:33:26.070 But also in those envelopes are additional information, 00:33:26.070 --> 00:33:28.770 namely, these HTTP headers, which are key value 00:33:28.770 --> 00:33:31.810 pairs that provide additional instructions, if you will, 00:33:31.810 --> 00:33:33.330 to the browser or server. 00:33:33.330 --> 00:33:40.200 So for instance, suppose that we want to ensure that this kind of reflected 00:33:40.200 --> 00:33:44.280 or stored attack isn't possible, whereby, we're accidentally embedding 00:33:44.280 --> 00:33:48.090 script tags in our own website's HTML. 00:33:48.090 --> 00:33:49.840 Well, suppose that the website in question 00:33:49.840 --> 00:33:53.520 now isn't google.com specifically, but more generally, example.com. 00:33:53.520 --> 00:33:58.990 And suppose that example.com's web server is configured to output always 00:33:58.990 --> 00:34:01.930 in those virtual envelopes an HTTP header that 00:34:01.930 --> 00:34:05.500 is Content-Security-Policy:. 00:34:05.500 --> 00:34:07.570 So that string of text is the key. 00:34:07.570 --> 00:34:13.900 And the value of that key is script-src then the URL 00:34:13.900 --> 00:34:17.889 that we want to allow scripts from only. 00:34:17.889 --> 00:34:19.120 So what does this mean? 00:34:19.120 --> 00:34:24.340 Albeit, fairly cryptic, if you configure a web server with this HTTP header, 00:34:24.340 --> 00:34:30.130 this will ensure that you can only load JavaScript code from actual files, 00:34:30.130 --> 00:34:34.719 typically ending in .js that are sent separately from the server 00:34:34.719 --> 00:34:35.840 to the browser. 00:34:35.840 --> 00:34:41.300 This line of an HTTP header prevents inline scripts, so to speak. 00:34:41.300 --> 00:34:45.699 Whereby it allows the browser to execute any old script tag in the web page, 00:34:45.699 --> 00:34:49.040 this prevents that default behavior. 00:34:49.040 --> 00:34:53.590 So as such, even if Google, even if example.com messes up and forgets 00:34:53.590 --> 00:34:58.870 to use character escapes when rendering user input that came from a URL 00:34:58.870 --> 00:35:00.910 or came from an email or any other source, 00:35:00.910 --> 00:35:03.400 this header should at least tell the browser, at least, 00:35:03.400 --> 00:35:04.930 newer browsers, uh-uh. 00:35:04.930 --> 00:35:10.000 Even if you accidentally see a script tag with some JavaScript inside of it 00:35:10.000 --> 00:35:12.910 in my web page, don't execute it. 00:35:12.910 --> 00:35:18.490 Only allow me to execute JavaScript code that came from a separate file. 00:35:18.490 --> 00:35:21.220 The only type of JavaScript that will now be allowed 00:35:21.220 --> 00:35:24.520 is if I have a tag that looks like this in my HTML, which 00:35:24.520 --> 00:35:27.220 is an alternative version of the script tag. 00:35:27.220 --> 00:35:31.570 But instead of embedding any code inside of the open tag and the close tag 00:35:31.570 --> 00:35:37.240 itself, it refers to the source of, abbreviated src, some file, typically, 00:35:37.240 --> 00:35:38.800 again ending in .js. 00:35:38.800 --> 00:35:43.180 So if this dot, dot, dot were the URL of a file that 00:35:43.180 --> 00:35:46.660 contains JavaScript code, that would be allowed because the presumption 00:35:46.660 --> 00:35:49.690 there is that if someone went through the trouble of creating 00:35:49.690 --> 00:35:53.980 that file on our own server, example.com, presumably, 00:35:53.980 --> 00:35:55.000 that code is safe. 00:35:55.000 --> 00:35:58.000 But what this line in our header does is it 00:35:58.000 --> 00:36:01.960 ensures that we can only execute JavaScript code if it comes 00:36:01.960 --> 00:36:04.540 from example.com in a separate file. 00:36:04.540 --> 00:36:07.960 Using an HTML tag like this, it will prohibit 00:36:07.960 --> 00:36:12.400 that HTTP header any script tags that are inlined 00:36:12.400 --> 00:36:15.790 in the body of our actual web pages. 00:36:15.790 --> 00:36:16.970 What else can we do? 00:36:16.970 --> 00:36:19.300 Well, it turns out-- and we haven't talked about it in this class. 00:36:19.300 --> 00:36:22.175 There's other languages that you can use in the context of web pages, 00:36:22.175 --> 00:36:25.090 not only HTML, not only JavaScript, but also a language 00:36:25.090 --> 00:36:27.670 called CSS, or Cascading Style Sheets, which 00:36:27.670 --> 00:36:29.650 is generally used to style your page. 00:36:29.650 --> 00:36:33.070 If familiar, or if you take a course on web development, 00:36:33.070 --> 00:36:35.920 know that there's similarly a mechanism whereby 00:36:35.920 --> 00:36:41.410 you can specify that only CSS from a specific server like example.com 00:36:41.410 --> 00:36:45.370 should be allowed either, not inline style tags 00:36:45.370 --> 00:36:47.450 with which you might be familiar as well. 00:36:47.450 --> 00:36:52.270 So here instead of script source, we see style source, which is just another way 00:36:52.270 --> 00:36:57.610 using this HTTP header mechanism to just ensure that the browser, at least, 00:36:57.610 --> 00:37:00.880 if it's new enough, will not blindly execute script 00:37:00.880 --> 00:37:05.110 tags in the first case or style tags in the second case when 00:37:05.110 --> 00:37:06.850 these kinds of headers are present. 00:37:06.850 --> 00:37:10.210 It's an additional layer of defense against these kinds 00:37:10.210 --> 00:37:12.580 of reflected or stored attacks. 00:37:12.580 --> 00:37:15.490 Indeed, that particular HTTP header would only 00:37:15.490 --> 00:37:18.790 allow us to conclude CSS in our web page if it 00:37:18.790 --> 00:37:22.240 uses a tag like this, namely, a link tag with an href value. 00:37:22.240 --> 00:37:26.950 That dot, dot, dot of which in this case would be the URL of a CSS file 00:37:26.950 --> 00:37:30.490 on the particular server, example.com, the relationship 00:37:30.490 --> 00:37:35.170 of which is to the page that of this thing called a style sheet. 00:37:35.170 --> 00:37:38.680 Questions, then, on this use of HTTP headers 00:37:38.680 --> 00:37:42.910 to prevent these kinds of stored or reflected attacks or anything 00:37:42.910 --> 00:37:44.200 else thus far? 00:37:44.200 --> 00:37:48.463 AUDIENCE: What do the backslash P and backslash A [INAUDIBLE] sequence do? 00:37:48.463 --> 00:37:49.630 DAVID J. MALAN: Backslash P? 00:37:49.630 --> 00:37:52.000 Oh, in the HTML. 00:37:52.000 --> 00:37:58.060 So recall that a lot of our tags have open tags and close tags and the slash. 00:37:58.060 --> 00:38:00.160 It's actually a forward, not a backslash. 00:38:00.160 --> 00:38:04.310 The forward here just finishes the thought for the browser. 00:38:04.310 --> 00:38:05.620 So this starts the tag. 00:38:05.620 --> 00:38:06.940 This ends the tag. 00:38:06.940 --> 00:38:11.650 And you use the same word, script in this case, script in this case or A, 00:38:11.650 --> 00:38:13.450 or P, as you described. 00:38:13.450 --> 00:38:18.470 That is what closes or ends the tag in question, 00:38:18.470 --> 00:38:23.070 so that you know where the tag ends or where the paragraph ends. 00:38:23.070 --> 00:38:24.870 Other questions? 00:38:24.870 --> 00:38:27.480 AUDIENCE: Pertaining that React framework, as far 00:38:27.480 --> 00:38:32.460 as I understand [INAUDIBLE] format, you use interchangeably 00:38:32.460 --> 00:38:36.720 both JavaScript and HTML, how that isn't a security 00:38:36.720 --> 00:38:39.672 risk for these kind of attacks? 00:38:39.672 --> 00:38:41.880 DAVID J. MALAN: Really good question beyond the scope 00:38:41.880 --> 00:38:44.830 of this class for those who don't have a programming background. 00:38:44.830 --> 00:38:48.450 However, yes, React and other frameworks use a technique 00:38:48.450 --> 00:38:51.690 called JSX, which combines JavaScript with HTML 00:38:51.690 --> 00:38:54.270 with CSS that are rendered by the browser. 00:38:54.270 --> 00:38:57.270 In that case, though, Mateo, the browser is 00:38:57.270 --> 00:39:01.230 running JavaScript code that comes from the React library that 00:39:01.230 --> 00:39:05.760 is reading as input that JSX code and converting it 00:39:05.760 --> 00:39:10.690 to the resulting code that should be executed within the browser. 00:39:10.690 --> 00:39:16.440 So long as all of that code comes from .js files, or .css files, or the like, 00:39:16.440 --> 00:39:17.380 all is well. 00:39:17.380 --> 00:39:20.640 But if you just inline it and you're outputting headers like this, 00:39:20.640 --> 00:39:22.180 it won't execute at all. 00:39:22.180 --> 00:39:23.650 So the same rules apply. 00:39:23.650 --> 00:39:26.240 You would have to use an external file. 00:39:26.240 --> 00:39:29.470 So when it comes to code injection, there are other types of attacks, 00:39:29.470 --> 00:39:32.140 particularly, in the context of what's called SQL, 00:39:32.140 --> 00:39:35.350 or Structured Query Language, which is a language that's typically used 00:39:35.350 --> 00:39:37.940 with databases, so again, on a server. 00:39:37.940 --> 00:39:41.380 So let's consider how you might also trick software 00:39:41.380 --> 00:39:44.830 into executing SQL code, that is, code written 00:39:44.830 --> 00:39:48.280 in this particular language, when it comes to databases specifically. 00:39:48.280 --> 00:39:51.790 Well, here, for instance, is some representative code in this language 00:39:51.790 --> 00:39:56.680 called SQL, whereby, you have a line like SELECT * FROM users WHERE 00:39:56.680 --> 00:40:00.940 username= quote, unquote and then username in curly braces. 00:40:00.940 --> 00:40:03.310 Now, consider this to be pseudo code of sorts 00:40:03.310 --> 00:40:08.470 because I'm mixing some SQL syntax with some Python syntax in this case 00:40:08.470 --> 00:40:12.520 because it turns out that when you're using this language, SQL or SQL, 00:40:12.520 --> 00:40:15.910 you typically use it in combination with some other language, 00:40:15.910 --> 00:40:19.420 be it Python, or PHP, or Java, or something else. 00:40:19.420 --> 00:40:21.370 And you use that other language typically 00:40:21.370 --> 00:40:26.600 to construct queries dynamically based on values that humans have typed in. 00:40:26.600 --> 00:40:29.140 So for instance, if you're logging into a website 00:40:29.140 --> 00:40:31.480 and you type in your username and hit Enter, 00:40:31.480 --> 00:40:36.280 very often, if that website is implemented in Python, or PHP, or Java, 00:40:36.280 --> 00:40:41.440 it might use one of those languages to construct a SQL query that is then 00:40:41.440 --> 00:40:44.800 actually sent to the database to look up that specific user who's 00:40:44.800 --> 00:40:46.070 trying to log in. 00:40:46.070 --> 00:40:49.450 So what I have here then is mostly SQL syntax, 00:40:49.450 --> 00:40:54.070 except for in these curly braces some Python-specific syntax. 00:40:54.070 --> 00:40:57.580 And what this curly brace with username inside of it represents 00:40:57.580 --> 00:41:02.320 is hey, server, plug in whatever the human typed in as their username 00:41:02.320 --> 00:41:03.920 into that part of the string. 00:41:03.920 --> 00:41:06.460 So the curly braces and the word username 00:41:06.460 --> 00:41:09.460 should be replaced with literally something like Malan 00:41:09.460 --> 00:41:12.100 if that is the user who's trying to log in. 00:41:12.100 --> 00:41:15.850 And that will then resulting-- the resulting code will 00:41:15.850 --> 00:41:17.950 be sent to the database to select everything 00:41:17.950 --> 00:41:21.850 we know from the user's table, so to speak, in that database 00:41:21.850 --> 00:41:23.960 about that particular username. 00:41:23.960 --> 00:41:26.360 So what could potentially go wrong here? 00:41:26.360 --> 00:41:30.520 Well, it all has to do with, again, trusting input from the user. 00:41:30.520 --> 00:41:32.560 And that should now be emerging as a theme. 00:41:32.560 --> 00:41:37.797 You should generally always mistrust input that comes from users. 00:41:37.797 --> 00:41:39.130 You should do something with it. 00:41:39.130 --> 00:41:43.000 But you should sanitize it or scrub it in such a way, 00:41:43.000 --> 00:41:46.120 that any potentially dangerous characters are somehow escaped. 00:41:46.120 --> 00:41:49.990 And that's exactly what the solution was to those cross-site scripting 00:41:49.990 --> 00:41:53.440 attacks, whereby, so long as we escaped the user's input 00:41:53.440 --> 00:41:56.860 and changed the less than sign, and the greater than sign, and maybe 00:41:56.860 --> 00:42:00.580 some other symbols as well to the equivalent character escapes, 00:42:00.580 --> 00:42:02.090 all was well. 00:42:02.090 --> 00:42:04.870 So here, too, is an example now in the context of databases 00:42:04.870 --> 00:42:09.340 where a bit of paranoia will go a long way to keeping your software secure. 00:42:09.340 --> 00:42:10.150 Why? 00:42:10.150 --> 00:42:13.420 Well, suppose that my username is, indeed, Malan, 00:42:13.420 --> 00:42:16.180 but suppose that's not what I type into the website 00:42:16.180 --> 00:42:17.900 when trying to log in, for instance. 00:42:17.900 --> 00:42:21.010 So instead of typing just my username, suppose 00:42:21.010 --> 00:42:24.460 I am suspicious as the adversarially that this website is probably 00:42:24.460 --> 00:42:25.540 using a database. 00:42:25.540 --> 00:42:28.210 And that database is probably using this language, SQL. 00:42:28.210 --> 00:42:32.380 So what could I do to kind of mess with the owners of this website 00:42:32.380 --> 00:42:36.670 and try to trick their database into executing my code 00:42:36.670 --> 00:42:38.510 and not just their own? 00:42:38.510 --> 00:42:40.180 How do I inject code of my own? 00:42:40.180 --> 00:42:43.180 Well, instead of Malan, let me a little cryptically type 00:42:43.180 --> 00:42:46.690 this in to the website where I'm prompted for my username. 00:42:46.690 --> 00:42:48.250 Now, this does look cryptic. 00:42:48.250 --> 00:42:50.200 And odds are an adversary is not going to know 00:42:50.200 --> 00:42:55.330 exactly what to type the very first time they try to hack into a server. 00:42:55.330 --> 00:42:58.330 Rather, it's through trial and error very often 00:42:58.330 --> 00:43:01.120 that an adversary might eventually realize, ah, 00:43:01.120 --> 00:43:04.150 this is what I could probably type into that website 00:43:04.150 --> 00:43:06.520 to inject some code of my own. 00:43:06.520 --> 00:43:08.140 So to be clear, what have I typed? 00:43:08.140 --> 00:43:12.910 I've still typed my username, M-A-L-A-N, but then I've typed a single quote, 00:43:12.910 --> 00:43:18.070 and then a semicolon, and then DELETE FROM users, and then another semicolon, 00:43:18.070 --> 00:43:19.300 and then a dash, dash. 00:43:19.300 --> 00:43:22.680 Now, if you don't know SQL-- and you're not expected to know SQL for this 00:43:22.680 --> 00:43:23.250 cause-- 00:43:23.250 --> 00:43:25.410 this looks weird, probably. 00:43:25.410 --> 00:43:29.250 But each of these symbols, each of these punctuation symbols, in particular, 00:43:29.250 --> 00:43:33.100 means something specific and serves a particular purpose. 00:43:33.100 --> 00:43:34.510 Now, what might that be? 00:43:34.510 --> 00:43:36.390 Well, me go back to the original query. 00:43:36.390 --> 00:43:41.280 And now let me assume that in yellow here, the curly braces with username, 00:43:41.280 --> 00:43:43.440 is where my username is supposed to go. 00:43:43.440 --> 00:43:45.510 And my username is supposed to be Malan. 00:43:45.510 --> 00:43:49.020 But what if I type in that long sequence of cryptic text? 00:43:49.020 --> 00:43:51.360 Here's what's going to happen on the server. 00:43:51.360 --> 00:43:54.870 Because it's using a language like Python, or PHP, or Java, 00:43:54.870 --> 00:43:58.530 this yellow value is going to be "interpolated," 00:43:58.530 --> 00:44:01.750 that is, replaced with whatever the human typed in. 00:44:01.750 --> 00:44:02.490 So let's do that. 00:44:02.490 --> 00:44:06.360 Let me paste in what I, the adversary, typed in. 00:44:06.360 --> 00:44:09.700 And notice I've kept yellow the user's input. 00:44:09.700 --> 00:44:12.810 So everything in white is still the part of the SQL query 00:44:12.810 --> 00:44:16.290 that the designers of the database came up with in advance. 00:44:16.290 --> 00:44:20.130 But everything in yellow is what came from a form on the web, an adversary, 00:44:20.130 --> 00:44:21.100 in my case. 00:44:21.100 --> 00:44:23.040 And this looks a little cryptic still. 00:44:23.040 --> 00:44:26.400 But even if you've never seen SQL before, 00:44:26.400 --> 00:44:29.190 you might have an intuition for what could go wrong. 00:44:29.190 --> 00:44:34.380 Because I, the adversary, typed not only Malan, but a single quote here, 00:44:34.380 --> 00:44:38.070 notice that, oh, my goodness, that perfectly lines up 00:44:38.070 --> 00:44:42.960 with the single quote that the database designer used in their query. 00:44:42.960 --> 00:44:45.960 And so even though this white quote is meant 00:44:45.960 --> 00:44:49.620 to be closed by this white quote way over here, 00:44:49.620 --> 00:44:52.350 notice that grammatically in this language, 00:44:52.350 --> 00:44:56.280 not to mention in English and other human languages, this single quote here 00:44:56.280 --> 00:44:58.950 or apostrophe, because it comes first, will 00:44:58.950 --> 00:45:04.620 be presumed to close this single quote or this apostrophe here. 00:45:04.620 --> 00:45:08.160 The semicolon, it turns out in this language, SQL, ends a thought. 00:45:08.160 --> 00:45:09.690 It's like a period in English. 00:45:09.690 --> 00:45:14.080 And so anything after a semicolon is like a new command altogether. 00:45:14.080 --> 00:45:18.330 So notice that DELETE FROM users semicolon is like a second command that 00:45:18.330 --> 00:45:20.400 came entirely from me, the adversary. 00:45:20.400 --> 00:45:23.520 And then dash, dash, it turns out-- and this is very clever. 00:45:23.520 --> 00:45:27.720 Dash, dash a lot of versions of SQL represents a "comment." 00:45:27.720 --> 00:45:32.280 And a comment in a programming language means ignore everything after this 00:45:32.280 --> 00:45:36.720 because a problem right now is that single quote, that apostrophe, 00:45:36.720 --> 00:45:39.840 was meant to surround the user's username. 00:45:39.840 --> 00:45:43.230 But because I, the adversary, already gave you a single quote 00:45:43.230 --> 00:45:47.230 to use accidentally as closing this thought, 00:45:47.230 --> 00:45:50.200 well, we don't need this single quote at the very end anymore. 00:45:50.200 --> 00:45:53.860 So this is why the adversary, or me in this story is doing dash, dash. 00:45:53.860 --> 00:45:56.820 That's just going to tell the server, OK, ignore everything 00:45:56.820 --> 00:46:01.710 after that, including the single quote that we do not need grammatically. 00:46:01.710 --> 00:46:03.150 So let me reformat this a bit. 00:46:03.150 --> 00:46:06.113 I'm going to go ahead and add some new lines, some white space, just 00:46:06.113 --> 00:46:07.530 to make it a little more readable. 00:46:07.530 --> 00:46:13.260 What you see on the screen here right now is equivalent to this. 00:46:13.260 --> 00:46:16.410 Notice that I've moved the Delete command to a own line 00:46:16.410 --> 00:46:17.760 just for readability's sake. 00:46:17.760 --> 00:46:21.300 I've gotten rid of the final apostrophe, the single quote, 00:46:21.300 --> 00:46:23.820 because it was after a comment, which means by design, 00:46:23.820 --> 00:46:25.090 it's meant to be ignored. 00:46:25.090 --> 00:46:29.640 So what I have done as the adversary because I presumed, or inferred, 00:46:29.640 --> 00:46:33.330 or figured out that this website is using single quotes and they're just 00:46:33.330 --> 00:46:37.290 blindly interpolating, that is, replacing those curly braces 00:46:37.290 --> 00:46:41.790 and username with literally anything I type in, I can trick the server 00:46:41.790 --> 00:46:45.990 into finishing this first command by saying SELECT * FROM users WHERE 00:46:45.990 --> 00:46:48.360 username='malan';. 00:46:48.360 --> 00:46:51.630 And worse, I can trick this particular database 00:46:51.630 --> 00:46:55.710 into executing a second SQL command, which even if again, you've never seen 00:46:55.710 --> 00:46:58.490 SQL, deleting is probably a bad thing. 00:46:58.490 --> 00:47:00.240 It's probably a destructive thing that you 00:47:00.240 --> 00:47:02.700 don't want some random adversary on the internet being 00:47:02.700 --> 00:47:04.810 able to do on your server. 00:47:04.810 --> 00:47:07.300 So what's the goal of these lines here? 00:47:07.300 --> 00:47:09.600 Well, the original intent of the first query, 00:47:09.600 --> 00:47:13.047 presumably, I claimed, was just to search for the user in the database, 00:47:13.047 --> 00:47:14.130 so that they could log in. 00:47:14.130 --> 00:47:16.770 So when I type in Malan and hit Enter, I am somehow 00:47:16.770 --> 00:47:20.340 able to log in, probably, after typing also a password, maybe 00:47:20.340 --> 00:47:22.200 a two-factor code or the like. 00:47:22.200 --> 00:47:25.950 But SELECT * users WHERE username=malan, fine. 00:47:25.950 --> 00:47:28.110 That's probably going to retrieve the information 00:47:28.110 --> 00:47:29.640 that it was supposed to retrieve. 00:47:29.640 --> 00:47:33.000 The dangerous part here is that I tricked this server 00:47:33.000 --> 00:47:35.230 into executing a second command. 00:47:35.230 --> 00:47:36.780 And this one looks destructive. 00:47:36.780 --> 00:47:41.160 DELETE FROM user; means delete all of the users from the system. 00:47:41.160 --> 00:47:44.610 So it doesn't help the adversary in this case get into the system 00:47:44.610 --> 00:47:49.590 or do anything with the Malan account other than delete it and literally 00:47:49.590 --> 00:47:52.240 every other account in the system. 00:47:52.240 --> 00:47:53.710 So this is bad. 00:47:53.710 --> 00:47:59.190 This is representative of a SQL injection, whereby I, the adversary, 00:47:59.190 --> 00:48:03.510 wrote code that you, the designer of this database, accidentally, 00:48:03.510 --> 00:48:08.100 naively treated as part of your own commands. 00:48:08.100 --> 00:48:11.710 So how else could things go wrong? 00:48:11.710 --> 00:48:14.790 Well, not only could you do something destructive like deleting data 00:48:14.790 --> 00:48:15.540 from the database. 00:48:15.540 --> 00:48:18.210 But suppose that the user is prompted at the same time 00:48:18.210 --> 00:48:20.820 for a username and a password now in this story, 00:48:20.820 --> 00:48:25.240 and suppose, therefore, that the query in the software is this, 00:48:25.240 --> 00:48:28.210 SELECT * FROM users WHERE username equals, 00:48:28.210 --> 00:48:30.970 quote, unquote, "username" in curly braces, but one more 00:48:30.970 --> 00:48:35.440 phrase, AND password equals, quote, unquote, 00:48:35.440 --> 00:48:39.340 "password" based on whatever the human typed in as their password. 00:48:39.340 --> 00:48:42.520 So again, to be clear, in this story, the user 00:48:42.520 --> 00:48:44.560 is prompted for a username and a password. 00:48:44.560 --> 00:48:48.910 And the SQL command that's going to use those two values looks like this. 00:48:48.910 --> 00:48:52.720 But here, too, we're setting the stage for an injection attack. 00:48:52.720 --> 00:48:53.260 Why? 00:48:53.260 --> 00:48:56.350 Because based on these placeholders with the curly braces 00:48:56.350 --> 00:48:58.960 around username and password, it looks like we're just 00:48:58.960 --> 00:49:04.180 going to blindly plug in to this command exactly what it 00:49:04.180 --> 00:49:08.420 is the human had typed for their username and password respectively. 00:49:08.420 --> 00:49:11.540 So what could a more sophisticated adversary now do? 00:49:11.540 --> 00:49:14.140 Well, maybe instead of typing in Malan and then 00:49:14.140 --> 00:49:17.692 whatever Malan's actual password is, suppose that they just 00:49:17.692 --> 00:49:20.650 want to get into someone's account, maybe Malan's, maybe someone else's 00:49:20.650 --> 00:49:21.490 altogether? 00:49:21.490 --> 00:49:27.360 What if the adversary doesn't just type Malan, not to mention Malan's password, 00:49:27.360 --> 00:49:32.820 but what if they type in this specifically for Malan's password? 00:49:32.820 --> 00:49:33.770 Now, this is weird. 00:49:33.770 --> 00:49:36.180 And I'll tell you now, this is not, in fact, my password. 00:49:36.180 --> 00:49:38.180 But what has the adversary typed in? 00:49:38.180 --> 00:49:42.050 A single quote, the word or, then another single quote 00:49:42.050 --> 00:49:45.710 with a one and a single quote equals a single quote and a one. 00:49:45.710 --> 00:49:48.080 So this looks very, very weird. 00:49:48.080 --> 00:49:49.590 But let's see what happens. 00:49:49.590 --> 00:49:51.620 And again, most adversaries wouldn't figure this 00:49:51.620 --> 00:49:53.480 out the first time they try. 00:49:53.480 --> 00:49:56.930 Odds are, they'd be trying a whole bunch of techniques and heuristics 00:49:56.930 --> 00:49:59.210 to figure out what might actually work for them. 00:49:59.210 --> 00:50:02.210 So we're fast forwarding to the end of the story where the adversary has 00:50:02.210 --> 00:50:07.280 figured out that this weird sequence of characters can hack into this server 00:50:07.280 --> 00:50:10.560 by tricking it into executing code that wasn't intended. 00:50:10.560 --> 00:50:14.180 So here again in yellow is exactly what the adversary has typed in, 00:50:14.180 --> 00:50:16.940 Malan, which may very well be a legitimate username, 00:50:16.940 --> 00:50:21.560 but then for the password in yellow, single quote or single quote, 00:50:21.560 --> 00:50:24.500 one single quote equals single quote one. 00:50:24.500 --> 00:50:27.020 And based on our previous example, you can, perhaps, 00:50:27.020 --> 00:50:29.120 see what's starting to go on here. 00:50:29.120 --> 00:50:32.120 We've finished the Malan thought naturally. 00:50:32.120 --> 00:50:35.030 We didn't type anything malicious for the username this time. 00:50:35.030 --> 00:50:38.450 But we did type something seemingly malicious for the password. 00:50:38.450 --> 00:50:41.660 And the first single quote in yellow quickly 00:50:41.660 --> 00:50:44.990 finishes the password thought, quote, unquote, nothing in between. 00:50:44.990 --> 00:50:49.040 But then we're saying or, quote, unquote, one equals, quote, one. 00:50:49.040 --> 00:50:49.670 Why? 00:50:49.670 --> 00:50:52.400 Well, the adversary in this case kind of figured out, or knew, 00:50:52.400 --> 00:50:57.300 or guessed that the SQL command ends with a single quote itself. 00:50:57.300 --> 00:51:00.200 So the whole point here is even though this, too, probably looks 00:51:00.200 --> 00:51:04.790 very cryptic is that grammatically, what the adversary has typed 00:51:04.790 --> 00:51:08.390 in not only perfectly aligns with the username field, 00:51:08.390 --> 00:51:10.400 because it's just Malan, nothing special there, 00:51:10.400 --> 00:51:13.760 but very cleverly, the adversary has finished 00:51:13.760 --> 00:51:16.910 the thought of this single quote and also finished 00:51:16.910 --> 00:51:18.840 the thought of this single quote. 00:51:18.840 --> 00:51:21.890 So we've made everything balanced, just like you would not only in SQL, 00:51:21.890 --> 00:51:23.720 but in a language like English. 00:51:23.720 --> 00:51:25.700 So let me go ahead and clean this up a little 00:51:25.700 --> 00:51:28.920 bit too to make clear why this is dangerous. 00:51:28.920 --> 00:51:31.970 This command now, once formed by the server, 00:51:31.970 --> 00:51:35.022 based on that adversary's input, is really the same as this. 00:51:35.022 --> 00:51:37.730 And I'm just going to add, again, a new line and some white space 00:51:37.730 --> 00:51:40.170 just to help us wrap our minds around what's going on. 00:51:40.170 --> 00:51:42.650 So I've just moved the or line to the bottom. 00:51:42.650 --> 00:51:45.440 And just like in math class years ago, let 00:51:45.440 --> 00:51:47.870 me go ahead and put parentheses around things here 00:51:47.870 --> 00:51:52.580 that makes clear what the precedence is of things like and and or. 00:51:52.580 --> 00:51:55.070 It turns out that and, like multiplication, 00:51:55.070 --> 00:51:56.667 binds at a higher precedence. 00:51:56.667 --> 00:51:57.500 It's more important. 00:51:57.500 --> 00:51:58.792 You're supposed to do it first. 00:51:58.792 --> 00:52:02.000 So I'm going to add parentheses now to this first expression. 00:52:02.000 --> 00:52:03.350 They're not strictly necessary. 00:52:03.350 --> 00:52:04.253 They're implied. 00:52:04.253 --> 00:52:06.920 I'm just making them explicit now to show you, just like in math 00:52:06.920 --> 00:52:09.210 class, the order of operations. 00:52:09.210 --> 00:52:10.590 Now, what does this mean? 00:52:10.590 --> 00:52:13.680 This means that the database is going to say, SELECT & FROM users-- 00:52:13.680 --> 00:52:15.500 so select everything from the users table-- 00:52:15.500 --> 00:52:20.480 WHERE the username is 'malan' and the password is quote unquote. 00:52:20.480 --> 00:52:22.580 Now, that is probably not my password. 00:52:22.580 --> 00:52:24.560 My password is definitely not nothing. 00:52:24.560 --> 00:52:25.730 It's not empty. 00:52:25.730 --> 00:52:27.350 But that doesn't matter now. 00:52:27.350 --> 00:52:28.040 Why? 00:52:28.040 --> 00:52:33.650 Because even if this first clause, WHERE username = 'malan' and password = quote 00:52:33.650 --> 00:52:37.490 unquote-- even if that doesn't find anyone in the database with a username 00:52:37.490 --> 00:52:41.060 of 'malan' and a password of quote unquote, it doesn't matter, 00:52:41.060 --> 00:52:44.690 because we've tricked the database command into including an OR, 00:52:44.690 --> 00:52:48.050 which is so stupid that it's always true-- 00:52:48.050 --> 00:52:49.970 OR '1' = '1'. 00:52:49.970 --> 00:52:55.580 Well, 1 always equals 1, which means that now, logically, this query is 00:52:55.580 --> 00:52:59.690 going to return everything we know about users from the database. 00:52:59.690 --> 00:53:01.290 And why is this problematic? 00:53:01.290 --> 00:53:04.880 Well, when you're logging users into a database-- logging users into a website 00:53:04.880 --> 00:53:08.210 or application, you're typically searching for them in the database. 00:53:08.210 --> 00:53:11.090 And typically, if you get back one or more users, 00:53:11.090 --> 00:53:14.660 you're going to assume that the very first user is the one that you want. 00:53:14.660 --> 00:53:17.060 And maybe in this case, it's Malan, but it's also 00:53:17.060 --> 00:53:21.800 very common in servers for the very first user that was created to be you, 00:53:21.800 --> 00:53:23.240 the person that designed the site. 00:53:23.240 --> 00:53:25.820 And you probably have administrative privileges-- 00:53:25.820 --> 00:53:28.290 that is, access over everything in this system. 00:53:28.290 --> 00:53:32.040 And so if a query like this is returning all of the users, 00:53:32.040 --> 00:53:36.290 including you as the very first one, if there's additional code in the system 00:53:36.290 --> 00:53:40.310 that we won't put on the screen here or bother hypothesizing about, 00:53:40.310 --> 00:53:44.180 it means that you could be now letting the adversary 00:53:44.180 --> 00:53:47.810 log in maybe as Malan, but worse, maybe as you, 00:53:47.810 --> 00:53:51.050 all because you trusted user input. 00:53:51.050 --> 00:53:54.620 But you should never trust that your users, if called Malan, 00:53:54.620 --> 00:53:56.240 are going to type in just 'malan'. 00:53:56.240 --> 00:53:59.540 You should always assume that there's someone out there, very annoyingly, 00:53:59.540 --> 00:54:03.890 very maliciously, that's going to try using some single quotes, 00:54:03.890 --> 00:54:08.660 some semicolons-- or in HTML, we saw a less-than sign or greater-than sign. 00:54:08.660 --> 00:54:11.210 You should always expect that someone on the internet 00:54:11.210 --> 00:54:15.500 will have enough time and interest in hacking your website or application 00:54:15.500 --> 00:54:19.080 that this might indeed happen to you and your software. 00:54:19.080 --> 00:54:20.900 So what's the solution, then? 00:54:20.900 --> 00:54:25.062 How do you avoid a query that's equivalent ultimately to this-- 00:54:25.062 --> 00:54:27.020 because if there's no 'malan' with no password, 00:54:27.020 --> 00:54:30.680 it's still the same as asking for WHERE '1' = '1', which is anything. 00:54:30.680 --> 00:54:32.510 And to be clear, I didn't have to use 1. 00:54:32.510 --> 00:54:35.210 I could have used 2 or 3 or 4. 00:54:35.210 --> 00:54:37.865 I could have used "cat" or "dog" or anything else. 00:54:37.865 --> 00:54:40.490 So long as the thing on the left equals the thing on the right, 00:54:40.490 --> 00:54:44.150 and I type that into the application, the same thing 00:54:44.150 --> 00:54:47.090 would certainly equal itself, is the point here, 00:54:47.090 --> 00:54:49.820 and 1 is just the simplest thing we could think of. 00:54:49.820 --> 00:54:54.840 So what is the solution here, to SQL injection attack specifically? 00:54:54.840 --> 00:54:58.370 Well, it's very similar in spirit to the notion of character escapes. 00:54:58.370 --> 00:55:01.940 But in the world of SQL, there tend to be standard ways 00:55:01.940 --> 00:55:04.130 of escaping dangerous characters. 00:55:04.130 --> 00:55:05.540 You don't have to do it yourself. 00:55:05.540 --> 00:55:09.020 And much like security in general, with encryption specifically, 00:55:09.020 --> 00:55:13.160 you probably should not be writing code yourself to solve problems 00:55:13.160 --> 00:55:15.950 like these that hundreds, thousands, millions of people 00:55:15.950 --> 00:55:19.890 before you have already had to deal with and have probably solved correctly. 00:55:19.890 --> 00:55:23.900 Do not reinvent wheels when you don't need to in the context of security. 00:55:23.900 --> 00:55:27.650 So this is to say, in the world of databases, most databases support 00:55:27.650 --> 00:55:32.060 what are called prepared statements, which is a fancy way of saying that you 00:55:32.060 --> 00:55:34.760 provide the code for your SQL query. 00:55:34.760 --> 00:55:38.780 You provide placeholders for wherever you want user input. 00:55:38.780 --> 00:55:44.690 But let the database itself replace or interpolate those placeholders 00:55:44.690 --> 00:55:46.770 with the user's actual input. 00:55:46.770 --> 00:55:50.330 And let the database handle escaping anything dangerous. 00:55:50.330 --> 00:55:54.350 And we've seen that things are dangerous, like apostrophes thus far. 00:55:54.350 --> 00:55:59.390 So, for instance, instead of writing a single apostrophe-- and this 00:55:59.390 --> 00:56:00.560 is weird, admittedly. 00:56:00.560 --> 00:56:05.030 In the world of SQL, the way you typically escape an apostrophe 00:56:05.030 --> 00:56:06.170 is not like HTML. 00:56:06.170 --> 00:56:09.800 You don't do &apos semicolon. 00:56:09.800 --> 00:56:13.580 You don't, like in some languages, put a backslash in front of it, typically. 00:56:13.580 --> 00:56:18.650 The way, weirdly, you escape a single quote or an apostrophe in SQL 00:56:18.650 --> 00:56:22.760 is very often by putting two of them in a row. 00:56:22.760 --> 00:56:23.810 So why? 00:56:23.810 --> 00:56:25.470 We'll defer that to another day. 00:56:25.470 --> 00:56:27.140 But this is just the convention. 00:56:27.140 --> 00:56:29.450 Now, this means that you could write code that changes 00:56:29.450 --> 00:56:31.520 any single quote to two single quotes. 00:56:31.520 --> 00:56:33.350 But again, don't bother doing that. 00:56:33.350 --> 00:56:35.870 Use functionality that comes with the database 00:56:35.870 --> 00:56:37.530 or whatever library you're using. 00:56:37.530 --> 00:56:40.730 So, for instance, if we go back to that very first query that 00:56:40.730 --> 00:56:45.590 was vulnerable to being injected with something like, DELETE FROM users, 00:56:45.590 --> 00:56:47.130 what if we now do this? 00:56:47.130 --> 00:56:50.630 Let's change our Python-based placeholder, using 00:56:50.630 --> 00:56:54.800 curly braces in yellow here, and let's change that and get rid of the quotes 00:56:54.800 --> 00:56:56.240 and just put a question mark. 00:56:56.240 --> 00:56:59.575 This is one of the common conventions in prepared statements, where 00:56:59.575 --> 00:57:02.450 you put a question mark not because you don't know what to put there, 00:57:02.450 --> 00:57:05.630 but because you want the database to replace that question 00:57:05.630 --> 00:57:08.360 mark with a user's own input. 00:57:08.360 --> 00:57:09.600 Then what happens? 00:57:09.600 --> 00:57:13.160 Well, if the user types in that dangerous command 00:57:13.160 --> 00:57:16.890 with the DELETE inside of it, notice what happens. 00:57:16.890 --> 00:57:18.650 Here's the single quote. 00:57:18.650 --> 00:57:20.330 Here's the close single quote. 00:57:20.330 --> 00:57:23.630 And the database has given that to you automatically. 00:57:23.630 --> 00:57:26.870 The prepared statement adds those single quotes for you. 00:57:26.870 --> 00:57:29.450 Notice that even though I, the adversary, 00:57:29.450 --> 00:57:33.980 only typed in 'malan' single quote semicolon, 00:57:33.980 --> 00:57:38.630 the prepared statement has gone ahead and escaped a single quote 00:57:38.630 --> 00:57:41.610 or apostrophe with two of them instead. 00:57:41.610 --> 00:57:44.600 And nothing else here thereafter is actually too worrisome. 00:57:44.600 --> 00:57:48.570 That alone is sufficient to solve the problem. 00:57:48.570 --> 00:57:51.530 Now, this looks a little weird, right, because it kind of 00:57:51.530 --> 00:57:55.110 looks like, logically, well, you still have this quote and this quote, 00:57:55.110 --> 00:57:58.460 which line up, and you still have this quote and this quote, which line up. 00:57:58.460 --> 00:58:01.460 So it looks like we haven't really fundamentally solved the problem, 00:58:01.460 --> 00:58:02.370 but we have. 00:58:02.370 --> 00:58:06.770 It turns out, in SQL databases, anytime they see two single quotes back 00:58:06.770 --> 00:58:09.110 to back, they don't try to pair them with something 00:58:09.110 --> 00:58:10.910 to the left or something to the right. 00:58:10.910 --> 00:58:16.140 They just treat it as one special escape sequence, so to speak. 00:58:16.140 --> 00:58:18.030 So that would then fix this query. 00:58:18.030 --> 00:58:21.020 And if we go back to the second query, which had two placeholders, 00:58:21.020 --> 00:58:23.570 [? username ?] and [? password ?] using this Python syntax, 00:58:23.570 --> 00:58:26.480 let me go ahead and change that to prepared statement syntax using, 00:58:26.480 --> 00:58:30.440 in this case, question marks without quotes and trust that the database 00:58:30.440 --> 00:58:34.400 itself will add any necessary quotes and escape any potentially dangerous 00:58:34.400 --> 00:58:35.190 characters. 00:58:35.190 --> 00:58:37.910 Now, what did the adversary type in in that second scenario? 00:58:37.910 --> 00:58:41.000 Well, it was just innocuously 'malan', and so that comes back from 00:58:41.000 --> 00:58:44.870 the prepared statement as being prepared with quote unquote on the outside. 00:58:44.870 --> 00:58:45.890 No problem there. 00:58:45.890 --> 00:58:51.080 But this other input from the user, from the adversary's password, which 00:58:51.080 --> 00:58:53.490 was very cryptic with lots of single quotes-- 00:58:53.490 --> 00:58:57.770 notice that every single quote in the adversary's so-called password 00:58:57.770 --> 00:59:01.310 has been escaped so that the single quote here becomes two. 00:59:01.310 --> 00:59:02.810 The single quote here becomes two. 00:59:02.810 --> 00:59:04.010 The single quote here becomes two. 00:59:04.010 --> 00:59:05.427 The single quote here becomes two. 00:59:05.427 --> 00:59:09.830 And the prepared statement automatically adds a final single quote 00:59:09.830 --> 00:59:10.920 at the very end. 00:59:10.920 --> 00:59:14.900 But I've kept highlighted in yellow everything that represents the user's 00:59:14.900 --> 00:59:18.200 input now that it's been properly escaped, because, again, 00:59:18.200 --> 00:59:21.680 even though you might try mentally to pair this quote with this one, 00:59:21.680 --> 00:59:23.670 this one with this one, this one with this one, 00:59:23.670 --> 00:59:26.060 and so forth, that's not what the database does. 00:59:26.060 --> 00:59:29.900 Whenever it sees, in this case, two apostrophes back to back, 00:59:29.900 --> 00:59:32.880 they are treated as special escape sequences. 00:59:32.880 --> 00:59:37.040 And so the only quotes that ultimately are treated as lining up 00:59:37.040 --> 00:59:42.258 are the two around the username and the two around the entire password here. 00:59:42.258 --> 00:59:44.300 So the characters are still in there, but they've 00:59:44.300 --> 00:59:48.440 been escaped, sanitized, or scrubbed, so to speak, in such the way 00:59:48.440 --> 00:59:52.340 that now the database is smart enough not to mistake those for quotes 00:59:52.340 --> 00:59:55.520 that should be matched with ones that we might have otherwise 00:59:55.520 --> 00:59:56.750 written previously. 00:59:56.750 --> 01:00:01.430 Now, there is another class of attacks that similarly involve injection 01:00:01.430 --> 01:00:04.710 into your software, particularly command injection. 01:00:04.710 --> 01:00:07.580 Those of you familiar with a command-line interface 01:00:07.580 --> 01:00:09.920 in the context of a terminal window or the like 01:00:09.920 --> 01:00:12.560 might be familiar with how on a system you 01:00:12.560 --> 01:00:15.620 type commands as opposed to always using your mouse to point 01:00:15.620 --> 01:00:17.390 and click on menus and buttons. 01:00:17.390 --> 01:00:20.000 The problem with command injection is that it's all 01:00:20.000 --> 01:00:22.820 too easy in a lot of today's programming languages 01:00:22.820 --> 01:00:27.480 to write code that invokes commands on systems, 01:00:27.480 --> 01:00:31.880 whether it's to copy files, delete files, move files, execute 01:00:31.880 --> 01:00:33.320 other commands altogether. 01:00:33.320 --> 01:00:35.540 And that's because a lot of programming languages 01:00:35.540 --> 01:00:40.610 come with functions that has a feature called system, which is literally 01:00:40.610 --> 01:00:45.020 a feature of some programming languages that allow you in your program 01:00:45.020 --> 01:00:48.170 to execute a command on the underlying system, a command 01:00:48.170 --> 01:00:49.820 and the underlying operating system. 01:00:49.820 --> 01:00:53.270 And that might be useful for you because in addition 01:00:53.270 --> 01:00:55.820 to writing your own code in some higher-level language, 01:00:55.820 --> 01:00:59.300 you can occasionally run a command on the system itself. 01:00:59.300 --> 01:01:02.060 But the problem is that if you, the programmer, 01:01:02.060 --> 01:01:07.070 somehow take user input and you just blindly pass that user's input 01:01:07.070 --> 01:01:10.550 to the command line, so to speak-- to the terminal window, 01:01:10.550 --> 01:01:12.350 to the underlying operating system-- 01:01:12.350 --> 01:01:16.580 that is yet another context in which potentially dangerous characters, 01:01:16.580 --> 01:01:21.500 like semicolons or the like, could accidentally finish your thought 01:01:21.500 --> 01:01:24.470 but then start a completely new one from the adversary 01:01:24.470 --> 01:01:27.860 so that they, too, on your system can not only 01:01:27.860 --> 01:01:30.320 delete things like data from your database, 01:01:30.320 --> 01:01:32.780 but even files from your file system. 01:01:32.780 --> 01:01:36.020 They could perhaps send email or spam or do anything 01:01:36.020 --> 01:01:39.830 in a command-line environment that you yourself could do on the same. 01:01:39.830 --> 01:01:42.080 In other programming languages, the same idea 01:01:42.080 --> 01:01:45.230 exists in the context of another function called eval, 01:01:45.230 --> 01:01:48.410 which evaluates whatever you pass to it. 01:01:48.410 --> 01:01:52.370 So there, too-- if you're in the habit of using system or eval, 01:01:52.370 --> 01:01:58.130 taking user input, and passing that user input as part of the input to system 01:01:58.130 --> 01:02:02.210 or eval without having sanitized it or scrubbed it or, more generally, 01:02:02.210 --> 01:02:04.970 "escaped" potentially dangerous characters, 01:02:04.970 --> 01:02:10.550 you're putting your entire system at risk and any and all software that's 01:02:10.550 --> 01:02:13.440 installed on or running on the same. 01:02:13.440 --> 01:02:14.930 So what's the solution here? 01:02:14.930 --> 01:02:17.240 In any of those programming languages to which 01:02:17.240 --> 01:02:19.850 I'm alluding that have functions like system or eval-- 01:02:19.850 --> 01:02:23.170 they almost always come with another function 01:02:23.170 --> 01:02:27.110 or built into these functions a way of escaping the user's input. 01:02:27.110 --> 01:02:30.610 So I would always take care to read the documentation if you yourself 01:02:30.610 --> 01:02:35.620 are or want to become a programmer, that whenever you take user input, 01:02:35.620 --> 01:02:38.560 you always figure out and think to yourself, wait a minute, 01:02:38.560 --> 01:02:41.770 how can I escape this properly so that I can't 01:02:41.770 --> 01:02:48.400 be tricked into executing some command, some SQL, or some HTML and JavaScript 01:02:48.400 --> 01:02:50.772 within my own software? 01:02:50.772 --> 01:02:51.980 All right, that's been a lot. 01:02:51.980 --> 01:02:54.022 Let's go ahead here and take a five-minute break. 01:02:54.022 --> 01:02:56.800 And when we resume, we'll look at a whole other category 01:02:56.800 --> 01:03:00.000 of potential attacks on software. 01:03:00.000 --> 01:03:01.570 All right, we're back. 01:03:01.570 --> 01:03:05.040 Let's go ahead and return to this world of HTML on the web, 01:03:05.040 --> 01:03:08.430 if only because so much of today's software is actually web based. 01:03:08.430 --> 01:03:10.920 And indeed, even on your Macs or PCs or phones, 01:03:10.920 --> 01:03:13.690 what looks like a native application, so to speak, 01:03:13.690 --> 01:03:16.320 might actually still be implemented in HTML 01:03:16.320 --> 01:03:18.840 with that other language, JavaScript and CSS. 01:03:18.840 --> 01:03:21.030 Well, it turns out, in the context of a browser, 01:03:21.030 --> 01:03:23.652 there's very often a feature called developer tools. 01:03:23.652 --> 01:03:26.110 And indeed, if you've done any web development of your own, 01:03:26.110 --> 01:03:27.540 you might have played with this feature. 01:03:27.540 --> 01:03:30.748 And these developer tools, which might be called something slightly different 01:03:30.748 --> 01:03:33.750 across different browsers, allow you to poke around 01:03:33.750 --> 01:03:38.760 the HTML, the CSS, and the JavaScript that compose some web page either 01:03:38.760 --> 01:03:41.310 that you yourself have made or that someone else has made 01:03:41.310 --> 01:03:44.340 and you have downloaded or accessed via your own browser. 01:03:44.340 --> 01:03:49.020 Let's consider now, though, what you can do if you have access 01:03:49.020 --> 01:03:50.680 to these developer tools. 01:03:50.680 --> 01:03:53.490 So, for instance, here is some HTML using 01:03:53.490 --> 01:03:57.310 a tag called input, which would create a checkbox on a website. 01:03:57.310 --> 01:04:01.000 We haven't seen this one yet, but it's similar in spirit to the paragraph tag 01:04:01.000 --> 01:04:05.320 and the anchor tag, in which case it is interpreted as the browser as meaning, 01:04:05.320 --> 01:04:07.208 hey, browser, here comes a checkbox. 01:04:07.208 --> 01:04:09.250 The only thing different that's worth noting here 01:04:09.250 --> 01:04:13.270 is that some HTML tags don't actually need a close tag, 01:04:13.270 --> 01:04:16.240 because whereas a paragraph starts somewhere 01:04:16.240 --> 01:04:19.450 and then ends somewhere else after some number of words, 01:04:19.450 --> 01:04:22.010 a checkbox is either there or it isn't there, 01:04:22.010 --> 01:04:25.480 so there's really no conceptual notion of it starting and stopping. 01:04:25.480 --> 01:04:29.380 So some HTML tags don't even need end tags or close tags. 01:04:29.380 --> 01:04:31.010 This, then, is one of them. 01:04:31.010 --> 01:04:35.350 This, then, is an input tag that gives us a type of input, namely, a checkbox. 01:04:35.350 --> 01:04:40.630 And another curiosity about this is that some HTML attributes don't need values. 01:04:40.630 --> 01:04:44.870 We saw the href attribute for the anchor tag earlier, and that, of course, 01:04:44.870 --> 01:04:45.550 had a value. 01:04:45.550 --> 01:04:48.280 In quotes was the URL that you want to link to. 01:04:48.280 --> 01:04:50.980 Here, we see that same paradigm-- type equals, quote unquote, 01:04:50.980 --> 01:04:55.090 checkbox to give us specifically a checkbox type of input. 01:04:55.090 --> 01:04:59.170 But you'll also notice here another attribute specifically for this input 01:04:59.170 --> 01:05:01.240 tag that's literally called disabled. 01:05:01.240 --> 01:05:04.570 And strictly speaking, you don't need to give it a value, 01:05:04.570 --> 01:05:06.590 because it's either there or it isn't. 01:05:06.590 --> 01:05:09.970 And if it is there, that just means that this checkbox is exactly 01:05:09.970 --> 01:05:13.720 that, disabled, which means you can see it, but it's not checked, 01:05:13.720 --> 01:05:15.550 and you can't actually check it. 01:05:15.550 --> 01:05:19.180 It's disabled and lightly grayed out, typically, on a browser. 01:05:19.180 --> 01:05:20.612 So why might this be? 01:05:20.612 --> 01:05:22.570 Well, maybe there's some feature in the browser 01:05:22.570 --> 01:05:25.150 that you don't want to give some users access to. 01:05:25.150 --> 01:05:28.240 Maybe based on who has logged in, they should or should not 01:05:28.240 --> 01:05:30.040 have access to some feature. 01:05:30.040 --> 01:05:34.480 The problem, though, with HTML and CSS and JavaScript 01:05:34.480 --> 01:05:38.410 or really anything that is web based or using those languages 01:05:38.410 --> 01:05:44.590 is that you're sending this HTML to the user's own device, 01:05:44.590 --> 01:05:48.490 to their browser on their phone or laptop or desktop, 01:05:48.490 --> 01:05:53.860 which means they can not only see this HTML code, but theoretically, 01:05:53.860 --> 01:05:54.820 they can edit it. 01:05:54.820 --> 01:05:58.660 They can't edit it on the server, because that's your own copy, assuming 01:05:58.660 --> 01:06:00.100 they can't hack into the server. 01:06:00.100 --> 01:06:02.900 But they can edit their own copy thereof. 01:06:02.900 --> 01:06:04.762 Now, usually, that's not such a big deal, 01:06:04.762 --> 01:06:06.470 because what's the worst that can happen? 01:06:06.470 --> 01:06:11.260 They can hack themselves by changing HTML on their own computer or phone. 01:06:11.260 --> 01:06:15.940 But it is problematic if you are using HTML or even 01:06:15.940 --> 01:06:21.040 JavaScript to try to prevent certain user interactions with your server. 01:06:21.040 --> 01:06:24.070 So, for instance, if you simply don't want a user 01:06:24.070 --> 01:06:27.700 to be able to check this box so that when they submit a form, 01:06:27.700 --> 01:06:30.580 they're not agreeing to something, or they're adding something 01:06:30.580 --> 01:06:33.820 to their shopping cart by using this checkbox, 01:06:33.820 --> 01:06:37.420 well, you might rely therefore on this HTML attribute, disabled. 01:06:37.420 --> 01:06:42.100 Just prevent them on the client, in the browser, from checking this box. 01:06:42.100 --> 01:06:44.770 But it turns out, with developer tools, which 01:06:44.770 --> 01:06:47.980 are accessible usually via menu at the top of the screen 01:06:47.980 --> 01:06:50.770 or by right-clicking or Control-clicking on a web page 01:06:50.770 --> 01:06:55.450 and then selecting an option, a user with access to HTML 01:06:55.450 --> 01:06:59.290 can change any and all of that HTML on their own computer, which 01:06:59.290 --> 01:07:02.290 means they could just remove this disabled attribute put 01:07:02.290 --> 01:07:05.020 in their own copy of your HTML on their computer 01:07:05.020 --> 01:07:09.250 and effectively enable that checkbox by getting rid of it. 01:07:09.250 --> 01:07:10.580 Now, what does this mean? 01:07:10.580 --> 01:07:12.490 Well, there's no problem yet. 01:07:12.490 --> 01:07:16.690 But if they do now check that checkbox, and you didn't want them to be able to, 01:07:16.690 --> 01:07:19.480 and they submit the checkbox to the server, 01:07:19.480 --> 01:07:24.130 as by clicking a submit button in a web form, you on the server, 01:07:24.130 --> 01:07:27.970 if you're not paranoid enough, might just 01:07:27.970 --> 01:07:31.750 trust that if I see a checked box being submitted via form, 01:07:31.750 --> 01:07:35.260 they must have been allowed to do it, so I will trust them, but no. 01:07:35.260 --> 01:07:39.440 Here, too, is an example of where you should never trust user's input, 01:07:39.440 --> 01:07:42.670 because if you're trying to disable them from doing something on the client, 01:07:42.670 --> 01:07:44.410 they don't have to respect that. 01:07:44.410 --> 01:07:48.730 They can override the HTML in their own browser, remove any such defenses, 01:07:48.730 --> 01:07:51.130 and then send the checkbox to you anyway. 01:07:51.130 --> 01:07:54.250 The takeaway here then is that you really should never 01:07:54.250 --> 01:07:57.280 rely on client-side validation alone. 01:07:57.280 --> 01:07:59.450 And this disabled attribute is just one more 01:07:59.450 --> 01:08:03.080 minor incarnation of that, where you're relying on the client 01:08:03.080 --> 01:08:07.400 to ensure that the checkbox is disabled, and its value 01:08:07.400 --> 01:08:09.350 can't be sent to the server. 01:08:09.350 --> 01:08:12.980 But client-side validation is really on the honor system 01:08:12.980 --> 01:08:15.950 only, because if someone knows how to use these developer tools 01:08:15.950 --> 01:08:20.149 and removes the disabled attribute, or if they know how to use developer tools 01:08:20.149 --> 01:08:25.220 and maybe disable JavaScript altogether for your website on their computer, 01:08:25.220 --> 01:08:29.689 any form of client-side validation in HTML or JavaScript 01:08:29.689 --> 01:08:33.380 that you wrote on the server but that your server sent to their browser 01:08:33.380 --> 01:08:36.200 and that's therefore executed by their browser 01:08:36.200 --> 01:08:39.270 is vulnerable to simply being turned off. 01:08:39.270 --> 01:08:42.200 So the catch is that even though client-side validation tends 01:08:42.200 --> 01:08:44.870 to be nice in terms of user experience-- the button 01:08:44.870 --> 01:08:47.930 is obviously disabled, so I should not be able to click on it. 01:08:47.930 --> 01:08:50.750 Or my email address is improperly formatted, 01:08:50.750 --> 01:08:52.819 so I should not be allowed to submit the form. 01:08:52.819 --> 01:08:56.689 Any forms of client-side validation tend to give the user immediate 01:08:56.689 --> 01:08:59.420 and often very useful visual feedback. 01:08:59.420 --> 01:09:02.510 But if it's not accompanied, this client-side validation, 01:09:02.510 --> 01:09:06.140 by server-side validation, your server software 01:09:06.140 --> 01:09:09.180 is still vulnerable to attack in some way. 01:09:09.180 --> 01:09:12.979 So what else might-- what other form might this take? 01:09:12.979 --> 01:09:16.370 Well, here's another example of an HTML input. 01:09:16.370 --> 01:09:19.970 This time, it's of type text, which means that the text box on the field. 01:09:19.970 --> 01:09:23.930 Suppose that you really want them to provide that value. 01:09:23.930 --> 01:09:26.210 Maybe this text box represents the user's name 01:09:26.210 --> 01:09:29.340 or their email address or their password or something like that. 01:09:29.340 --> 01:09:32.300 And so if you know a little bit of HTML, you 01:09:32.300 --> 01:09:35.029 know that there's not only a disabled attribute available to you, 01:09:35.029 --> 01:09:36.547 but also a required attribute. 01:09:36.547 --> 01:09:38.630 And it doesn't have an equal sign or a quote mark. 01:09:38.630 --> 01:09:39.529 You don't need that. 01:09:39.529 --> 01:09:42.260 It suffices just to say this input is required. 01:09:42.260 --> 01:09:45.670 But the catch here is, too, if a user doesn't want to give you a name, 01:09:45.670 --> 01:09:47.420 doesn't want to give you an email address, 01:09:47.420 --> 01:09:50.060 doesn't want to give you a password or some other value, 01:09:50.060 --> 01:09:52.279 well, they can use these so-called developer tools, 01:09:52.279 --> 01:09:55.700 click a button on their browser, remove the required attribute, 01:09:55.700 --> 01:09:59.447 and, voila, now they do not need to submit that value. 01:09:59.447 --> 01:10:01.280 Now, that in and of itself is not a problem, 01:10:01.280 --> 01:10:03.322 because again, they're only "hacking themselves." 01:10:03.322 --> 01:10:07.310 But if they're then allowed to submit this form to your server, 01:10:07.310 --> 01:10:12.680 and your server just trusts or assumes that every user will send you 01:10:12.680 --> 01:10:16.010 a username, an email address, a password, or the like, 01:10:16.010 --> 01:10:21.050 that's where things can break, if you're trusting client-side validation alone 01:10:21.050 --> 01:10:25.500 to ensure that the user's input is as expected. 01:10:25.500 --> 01:10:27.950 So if they're allowed to get rid of something 01:10:27.950 --> 01:10:31.430 as simple as this required attribute, effectively making the input like this, 01:10:31.430 --> 01:10:33.420 you might be missing some value. 01:10:33.420 --> 01:10:38.270 And so you must therefore use server-side validation again. 01:10:38.270 --> 01:10:40.133 And we won't get into the particulars of how 01:10:40.133 --> 01:10:42.050 you do this, because it will completely depend 01:10:42.050 --> 01:10:45.380 on the type of server software you're using, the programming language 01:10:45.380 --> 01:10:46.160 that you're using. 01:10:46.160 --> 01:10:49.760 And so it's really the principle today that's important that client-side 01:10:49.760 --> 01:10:54.470 validation, whereby the browser or the user's own copy of your software tries 01:10:54.470 --> 01:10:58.910 to preempt mistakes and require or disable certain inputs-- 01:10:58.910 --> 01:10:59.720 that's fine. 01:10:59.720 --> 01:11:03.740 That tends to give good, immediate, useful user feedback. 01:11:03.740 --> 01:11:09.330 But it must still be always accompanied by server-side validation 01:11:09.330 --> 01:11:15.110 so that you have the final say over what the user input looks like 01:11:15.110 --> 01:11:19.100 and if and how it's actually stored into your system. 01:11:19.100 --> 01:11:21.920 Again, the particulars of how you do one or the other 01:11:21.920 --> 01:11:26.570 is the topic for an actual programming class or a class on web development 01:11:26.570 --> 01:11:27.450 specifically. 01:11:27.450 --> 01:11:29.030 But for now, it's this principle. 01:11:29.030 --> 01:11:31.490 Just because you have client-side validation 01:11:31.490 --> 01:11:34.928 doesn't mean you shouldn't also have server-side validation. 01:11:34.928 --> 01:11:37.220 And, in fact, if you've got to choose one or the other, 01:11:37.220 --> 01:11:39.740 always choose server-side validation. 01:11:39.740 --> 01:11:43.160 Client-side validation is really just icing on the cake. 01:11:43.160 --> 01:11:47.480 It adds to the experience, but it's not the prerequisite one. 01:11:47.480 --> 01:11:51.320 Questions, then, on these so-called developer tools 01:11:51.320 --> 01:11:57.410 or these kinds of threats when it comes to validating the user's input? 01:11:57.410 --> 01:12:00.020 AUDIENCE: Yeah, so my question is more related to SQL 01:12:00.020 --> 01:12:01.580 and command injections for a second. 01:12:01.580 --> 01:12:07.310 Isn't it really easy to just not run the user's commands with admin or root 01:12:07.310 --> 01:12:10.270 privileges to delete certain records from a database or something? 01:12:10.270 --> 01:12:12.020 DAVID J. MALAN: Yes, another defense would 01:12:12.020 --> 01:12:14.420 be to make sure that whatever username you're 01:12:14.420 --> 01:12:18.920 using to execute these SQL commands does not have the ability to delete anything 01:12:18.920 --> 01:12:19.620 at all. 01:12:19.620 --> 01:12:23.480 However, some threats only need select access. 01:12:23.480 --> 01:12:26.930 So the second example I showed you, whereby we tricked the database 01:12:26.930 --> 01:12:30.440 into just selecting * from users WHERE '1' = '1'-- 01:12:30.440 --> 01:12:33.890 that was an example of one where, permission-wise, it probably 01:12:33.890 --> 01:12:37.040 would work, and it might allow the adversary still to log in. 01:12:37.040 --> 01:12:41.460 But your suggestion is a good one as an additional defense, not an alternative. 01:12:41.460 --> 01:12:45.650 Let's consider another class of attack to which your code might be vulnerable 01:12:45.650 --> 01:12:48.860 if it's on a server using the same language, HTML-- 01:12:48.860 --> 01:12:52.190 namely, Cross-Site Request Forgeries, or CSRFs. 01:12:52.190 --> 01:12:54.980 So this one's more of a mouthful, but it too 01:12:54.980 --> 01:12:59.700 relates to a mistake you might otherwise make when writing software on a server 01:12:59.700 --> 01:13:02.190 if you're not already familiar with this kind of threat 01:13:02.190 --> 01:13:07.350 So first, HTTP, recall, is this protocol, this convention by which 01:13:07.350 --> 01:13:09.240 web browsers and servers communicate. 01:13:09.240 --> 01:13:12.780 Well, it turns out there's different ways that browsers can get information 01:13:12.780 --> 01:13:13.720 to a server. 01:13:13.720 --> 01:13:17.070 And one of those ways is literally called GET by convention. 01:13:17.070 --> 01:13:19.470 In other words, inside of that virtual envelope 01:13:19.470 --> 01:13:24.360 is typically literally this word, GET, followed by the file name 01:13:24.360 --> 01:13:26.370 that the browser wants to get from a server. 01:13:26.370 --> 01:13:28.560 But more importantly for our purposes is that 01:13:28.560 --> 01:13:31.890 whenever using this GET method to send information 01:13:31.890 --> 01:13:36.510 to a server, all of the information that you want to get is embedded in the URL 01:13:36.510 --> 01:13:37.510 itself. 01:13:37.510 --> 01:13:38.530 So what does that mean? 01:13:38.530 --> 01:13:41.310 Well, consider, for instance, this sample link in HTML. 01:13:41.310 --> 01:13:43.020 Here's my anchor tag beginning. 01:13:43.020 --> 01:13:44.700 Here's my anchor tag ending. 01:13:44.700 --> 01:13:47.490 Notice here is the text "By Now." 01:13:47.490 --> 01:13:50.190 Well, let's suppose that you're in the US here. 01:13:50.190 --> 01:13:52.380 And on amazon.com in the US, there's actually 01:13:52.380 --> 01:13:54.780 this feature where you can "buy now." 01:13:54.780 --> 01:13:58.290 That is to say, when you visit the page of a product on Amazon's website 01:13:58.290 --> 01:14:02.080 in the US, if not beyond, you can skip the steps 01:14:02.080 --> 01:14:04.300 of having to add an item to your shopping cart 01:14:04.300 --> 01:14:06.733 and check out and choose your payment method 01:14:06.733 --> 01:14:09.400 and then, some number of clicks later, actually buy the product. 01:14:09.400 --> 01:14:12.370 Rather, if you configure your account in advance, 01:14:12.370 --> 01:14:15.940 you can go to any product's page, literally click a link or a button 01:14:15.940 --> 01:14:18.100 that says "Buy Now," and that's it. 01:14:18.100 --> 01:14:20.140 In a single click, for better or for worse, 01:14:20.140 --> 01:14:22.610 that product will be shipped to your home. 01:14:22.610 --> 01:14:24.537 So how might Amazon be implementing this? 01:14:24.537 --> 01:14:26.620 Well, they might indeed be using a link like this, 01:14:26.620 --> 01:14:38.560 the href value of which is a URL, like https://www.amazon.com/dp/B07XLQ2FSK. 01:14:38.560 --> 01:14:44.110 In other words, that seems to be enough information in the URL alone 01:14:44.110 --> 01:14:47.560 via which to buy the product whose unique identifier is apparently 01:14:47.560 --> 01:14:49.317 that string of text at the end. 01:14:49.317 --> 01:14:51.400 Now, that's all fine and good, and that's actually 01:14:51.400 --> 01:14:56.320 seems very user friendly, because with a single click on Buy Now, 01:14:56.320 --> 01:14:58.270 I can indeed buy that product. 01:14:58.270 --> 01:15:02.140 But the danger here is that if this link is not just 01:15:02.140 --> 01:15:05.890 on amazon.com but is in some adversary's website 01:15:05.890 --> 01:15:08.530 or maybe in an email that is sent to you. 01:15:08.530 --> 01:15:14.140 If it's that easy to buy something now, you could trick someone, potentially, 01:15:14.140 --> 01:15:17.830 into buying things that they didn't actually intend in this case, 01:15:17.830 --> 01:15:21.340 or doing anything else on a web server into which they're already 01:15:21.340 --> 01:15:26.900 logged in if GET is the method being used to get something from that server. 01:15:26.900 --> 01:15:31.220 Well, why is this, exactly, that this URL is problematic? 01:15:31.220 --> 01:15:34.900 Well, consider, for instance, the following HTML instead. 01:15:34.900 --> 01:15:37.630 Suppose that you visit an adversary site who 01:15:37.630 --> 01:15:42.430 just likes to create havoc in the world, and that adversary site doesn't even 01:15:42.430 --> 01:15:45.670 have an anchor tag or a link that they want you to trick. 01:15:45.670 --> 01:15:48.820 So it's not even as deliberate as a phishing attack 01:15:48.820 --> 01:15:50.470 that they want you to click some link. 01:15:50.470 --> 01:15:55.420 Suppose they're using something like an image tag, which it turns out in HTML, 01:15:55.420 --> 01:15:58.630 img for short, is how you embed an image on a web page. 01:15:58.630 --> 01:16:00.280 And how do you specify what image? 01:16:00.280 --> 01:16:04.390 You specify the source thereof, src for short, the value of which 01:16:04.390 --> 01:16:08.470 can be the URL of or the name of the image you want to display. 01:16:08.470 --> 01:16:13.630 But strictly speaking, that URL doesn't have to actually lead to an image. 01:16:13.630 --> 01:16:17.080 It could actually lead to an Amazon product page. 01:16:17.080 --> 01:16:19.540 But the way images work on web pages, recall, 01:16:19.540 --> 01:16:23.830 is that, typically, when you visit a web page, the images automatically load. 01:16:23.830 --> 01:16:27.280 You don't have to click or do anything, typically, for the images 01:16:27.280 --> 01:16:29.277 to appear on a web page-- maybe in emails, 01:16:29.277 --> 01:16:30.860 and that's an anti-phishing mechanism. 01:16:30.860 --> 01:16:34.000 But in web pages, you typically don't have to click on anything 01:16:34.000 --> 01:16:35.050 to see the images. 01:16:35.050 --> 01:16:37.840 That is to say, the value of the source attributes 01:16:37.840 --> 01:16:41.540 are just automatically downloaded and displayed to the user. 01:16:41.540 --> 01:16:44.740 Now, this, in fairness, is not an image. 01:16:44.740 --> 01:16:48.680 But the browser doesn't necessarily know that from the get-go. 01:16:48.680 --> 01:16:53.680 And so if this HTML is in some adversary's website that you've somehow 01:16:53.680 --> 01:16:56.410 been tricked into visiting, and you don't even click a link-- 01:16:56.410 --> 01:16:58.180 you just visit that web page-- 01:16:58.180 --> 01:17:01.442 that means this image tag is going to try to download this source. 01:17:01.442 --> 01:17:03.400 And even though it's not going to get an image, 01:17:03.400 --> 01:17:06.770 it is going to buy that product for you. 01:17:06.770 --> 01:17:07.480 Why? 01:17:07.480 --> 01:17:10.210 Because if you're logged into your Amazon account, 01:17:10.210 --> 01:17:15.130 even though this is in another tab, it's as though your browser requested 01:17:15.130 --> 01:17:18.670 that URL via that "GET" method because all 01:17:18.670 --> 01:17:24.100 of the relevant information for buying that product is in the URL alone. 01:17:24.100 --> 01:17:27.460 So it turns out that using GET is actually 01:17:27.460 --> 01:17:32.260 not a good thing when it comes to changing state on the server. 01:17:32.260 --> 01:17:35.590 To get technical, the GET method is meant to be "safe," 01:17:35.590 --> 01:17:41.660 safe whereby it does not change any state or values on the server. 01:17:41.660 --> 01:17:44.290 So it would actually be incorrect, or definitely 01:17:44.290 --> 01:17:48.160 bad practice by Amazon, if they were implementing their Buy Now 01:17:48.160 --> 01:17:53.680 button simply with a simple URL and a simple GET request. 01:17:53.680 --> 01:17:58.360 It should not be that easy to buy things on the internet, let alone change state 01:17:58.360 --> 01:17:59.890 on a server in other ways. 01:17:59.890 --> 01:18:02.520 So there are, thankfully, other methods, but even 01:18:02.520 --> 01:18:05.070 these are potentially vulnerable to this kind of attack. 01:18:05.070 --> 01:18:06.960 There's a POST method, which is typically 01:18:06.960 --> 01:18:11.400 used by browsers when you want to post your credit card information 01:18:11.400 --> 01:18:12.900 or your password to a server. 01:18:12.900 --> 01:18:15.900 You don't want your credit card-- you don't want your password typically 01:18:15.900 --> 01:18:19.980 ending up in the URL of your browser for privacy's sake. 01:18:19.980 --> 01:18:22.500 So rather, POST will kind of hide it more deeply 01:18:22.500 --> 01:18:24.930 in that virtual envelope to which we keep alluding. 01:18:24.930 --> 01:18:28.470 POST might also be used if you want to upload images or video files 01:18:28.470 --> 01:18:31.920 to a server because those don't really fit in URLs, it would seem. 01:18:31.920 --> 01:18:37.230 And so POST is an alternative that is meant to change state on the server, 01:18:37.230 --> 01:18:39.580 for instance, buy products for you. 01:18:39.580 --> 01:18:41.430 But even this can perhaps be abused. 01:18:41.430 --> 01:18:43.020 Well, let's take a look how. 01:18:43.020 --> 01:18:46.500 Here is now some HTML-- and it's more HTML than we've seen thus far, 01:18:46.500 --> 01:18:49.200 but we'll wrap our minds around each piece of it-- 01:18:49.200 --> 01:18:53.370 that represents an alternative implementation 01:18:53.370 --> 01:18:56.280 of the Buy Now button on Amazon that's no longer 01:18:56.280 --> 01:19:01.200 a simple anchor tag with everything that's needed in the URL. 01:19:01.200 --> 01:19:03.510 This is more of a traditional web form. 01:19:03.510 --> 01:19:06.510 And it's fine if the form is super short and only has a single button. 01:19:06.510 --> 01:19:08.740 It doesn't need text fields or anything like that. 01:19:08.740 --> 01:19:10.150 But there's a lot going on here. 01:19:10.150 --> 01:19:10.990 So let's see. 01:19:10.990 --> 01:19:13.800 So here's the form tag, the opening tag. 01:19:13.800 --> 01:19:14.940 Here's the close tag. 01:19:14.940 --> 01:19:17.520 So everything in between must be implementing this form. 01:19:17.520 --> 01:19:20.190 The action of this form, I claim, is going 01:19:20.190 --> 01:19:24.480 to be to submit the information to this amazon.com URL here. 01:19:24.480 --> 01:19:27.450 But the method that we're going to use is explicitly POST. 01:19:27.450 --> 01:19:30.420 So it turns out, in an HTML form, if you don't specify a method, 01:19:30.420 --> 01:19:32.940 it will use GET by default. So explicitly, I'm 01:19:32.940 --> 01:19:36.720 at least using POST because I don't want everything to be in the URL alone. 01:19:36.720 --> 01:19:39.210 Well, I've got two inputs here, one of which 01:19:39.210 --> 01:19:42.700 is of type hidden Well, what's going on here? 01:19:42.700 --> 01:19:45.000 Well, it turns out that, in HTML forms, you 01:19:45.000 --> 01:19:48.370 can create key-value pairs to send input to a server. 01:19:48.370 --> 01:19:53.670 So if you recall previously, I used the "dp" part of the URL 01:19:53.670 --> 01:19:56.850 as separating amazon.com from the product ID. 01:19:56.850 --> 01:19:59.850 And here-- and now I'm making this up for the sake of discussion-- 01:19:59.850 --> 01:20:05.280 I'm supposing that Amazon supports a web form name called 01:20:05.280 --> 01:20:10.230 dp whose type is hidden because the user doesn't need to see this, 01:20:10.230 --> 01:20:12.970 but the value is that same product ID. 01:20:12.970 --> 01:20:16.920 So this is an alternative to embedding that product ID in the URL. 01:20:16.920 --> 01:20:23.428 Instead, I'm saying there's an HTTP parameter called dp, the value of which 01:20:23.428 --> 01:20:23.970 will be this. 01:20:23.970 --> 01:20:25.678 But it's hidden, so the user doesn't even 01:20:25.678 --> 01:20:29.610 see it, which is fine, because the whole point is a nice, simple Buy Now button. 01:20:29.610 --> 01:20:30.750 How do we get that button? 01:20:30.750 --> 01:20:33.328 We use a button tag in HTML, the type of which 01:20:33.328 --> 01:20:36.120 is just submit, because its purpose in life is to submit this form. 01:20:36.120 --> 01:20:40.290 And the text that the user sees for this button is indeed "Buy Now." 01:20:40.290 --> 01:20:41.340 So what am I doing? 01:20:41.340 --> 01:20:43.620 This will make more sense, admittedly, to those of you 01:20:43.620 --> 01:20:45.787 who've already studied a bit of web development, who 01:20:45.787 --> 01:20:47.130 have written HTML yourself. 01:20:47.130 --> 01:20:51.900 But I'm essentially making it harder for an adversary 01:20:51.900 --> 01:20:56.320 to automate an attack on a user's Amazon account. 01:20:56.320 --> 01:20:56.880 Why? 01:20:56.880 --> 01:21:01.230 Because I am not just using a link anymore 01:21:01.230 --> 01:21:06.630 that the user might click or a URL that could be subtly hidden in an image tag. 01:21:06.630 --> 01:21:08.910 Now I have an actual web form. 01:21:08.910 --> 01:21:12.660 And at least based on my naive understanding of HTML 01:21:12.660 --> 01:21:15.510 at the moment in this story, this would seem 01:21:15.510 --> 01:21:18.330 to require that a human click an actual button. 01:21:18.330 --> 01:21:22.120 Like, I cannot use this as the source of an image. 01:21:22.120 --> 01:21:23.010 It's not a URL. 01:21:23.010 --> 01:21:24.670 It's all of this complexity. 01:21:24.670 --> 01:21:27.120 So if you're familiar with GET versus POST, 01:21:27.120 --> 01:21:29.610 you might be inclined to think that, OK, POST surely 01:21:29.610 --> 01:21:34.260 solves the problem by using this web form because you make sure in this way 01:21:34.260 --> 01:21:37.920 that someone clicks the button before they can buy anything. 01:21:37.920 --> 01:21:41.190 Now, why is that indeed naive? 01:21:41.190 --> 01:21:44.460 Well, it turns out using not just HTML but this language 01:21:44.460 --> 01:21:48.150 we've seen a little bit of today, namely, JavaScript, can 01:21:48.150 --> 01:21:52.210 be used to automate the process of submitting a form. 01:21:52.210 --> 01:21:56.790 So if an adversary now has this HTML in their website, 01:21:56.790 --> 01:21:59.010 they don't have to wait and hope that someone 01:21:59.010 --> 01:22:03.060 like you or me is going to come along and click the button explicitly, 01:22:03.060 --> 01:22:05.700 because that would be a little weird to click a button thinking 01:22:05.700 --> 01:22:08.880 you're going to buy something now, but it's not on the actual amazon.com. 01:22:08.880 --> 01:22:10.350 That alone doesn't matter. 01:22:10.350 --> 01:22:14.190 If the adversary can just trick you into visiting their website, 01:22:14.190 --> 01:22:20.250 and their website contains this HTML and this additional JavaScript, 01:22:20.250 --> 01:22:23.850 they can immediately submit this form for you 01:22:23.850 --> 01:22:26.820 to Amazon without you clicking a thing. 01:22:26.820 --> 01:22:27.360 Why? 01:22:27.360 --> 01:22:30.880 Well, inside of this script tag that I've added down below, 01:22:30.880 --> 01:22:33.810 I've simply said, document.forms[0]. 01:22:33.810 --> 01:22:36.600 So this means get me the first form on the page-- 01:22:36.600 --> 01:22:38.730 and I'm presuming there's only one in this story-- 01:22:38.730 --> 01:22:40.240 and then submit it. 01:22:40.240 --> 01:22:43.170 So this is to say, in JavaScript, not only can you 01:22:43.170 --> 01:22:46.620 do things like trigger alerts on the screen, those dialog windows. 01:22:46.620 --> 01:22:51.270 You can similarly, through code, automatically submit forms. 01:22:51.270 --> 01:22:53.460 So it doesn't matter that you're using POST. 01:22:53.460 --> 01:22:56.850 It doesn't matter that you have an actual button that must be clicked. 01:22:56.850 --> 01:22:59.100 It doesn't have to be a human that clicks that button. 01:22:59.100 --> 01:23:03.180 It can be their browser automatically executing this JavaScript code 01:23:03.180 --> 01:23:07.600 in the adversary's website that just submits that form for them. 01:23:07.600 --> 01:23:12.510 So this is the essence now of a cross-site request forgery. 01:23:12.510 --> 01:23:16.080 If HTML like this exists in the adversary's 01:23:16.080 --> 01:23:19.740 website, some other website, you can nonetheless 01:23:19.740 --> 01:23:24.600 trick users into executing operations across websites, 01:23:24.600 --> 01:23:28.170 on amazon.com in this case, even though the users themselves 01:23:28.170 --> 01:23:29.970 are not on amazon.com. 01:23:29.970 --> 01:23:33.450 So that's the cross-site aspect of these attacks. 01:23:33.450 --> 01:23:36.960 And it's a request forgery in the sense that it's 01:23:36.960 --> 01:23:40.380 sending all of the right information, but it's forged by the adversary. 01:23:40.380 --> 01:23:44.190 It's not coming from the amazon.com developers themselves. 01:23:44.190 --> 01:23:48.300 But it is this simple, because if Amazon does not defend against this attack, 01:23:48.300 --> 01:23:51.390 there is nothing stopping you or me or any adversary 01:23:51.390 --> 01:23:54.390 from including code like this on our websites, 01:23:54.390 --> 01:23:57.060 somehow tricking users into visiting our websites, 01:23:57.060 --> 01:24:01.860 and, boom, having products sent to them automatically-- assuming they have 01:24:01.860 --> 01:24:06.330 an amazon.com account, and they're already logged into it in another tab, 01:24:06.330 --> 01:24:10.210 or at least earlier in the day, for instance. 01:24:10.210 --> 01:24:14.040 All right, any questions now on this particular attack, 01:24:14.040 --> 01:24:17.640 these cross-site request forgeries, whether implemented 01:24:17.640 --> 01:24:21.390 using GET with simple URLs or even implemented 01:24:21.390 --> 01:24:24.660 with POST using actual forms? 01:24:24.660 --> 01:24:28.410 AUDIENCE: How can AI model and quantum computing change the way 01:24:28.410 --> 01:24:31.043 that we look at cyber security? 01:24:31.043 --> 01:24:32.460 DAVID J. MALAN: Quantum computing? 01:24:32.460 --> 01:24:34.860 Let me address that another time because I daresay that's 01:24:34.860 --> 01:24:36.840 a bit far from today's goals. 01:24:36.840 --> 01:24:42.300 But quantum computing is bad if the bad guys have it and you and I don't. 01:24:42.300 --> 01:24:43.590 I'll put it that way. 01:24:43.590 --> 01:24:46.080 All right, so how can we defend against this threat 01:24:46.080 --> 01:24:49.860 even when there's JavaScript code automatically inducing 01:24:49.860 --> 01:24:52.590 submission of these forms, which have enough information in them 01:24:52.590 --> 01:24:54.870 in order to buy something on our behalf? 01:24:54.870 --> 01:25:00.120 Well, it turns out that we could include something like a special token. 01:25:00.120 --> 01:25:04.380 And it turns out a common way to address this problem is by having the server 01:25:04.380 --> 01:25:07.230 not just output a simple HTML form but to output 01:25:07.230 --> 01:25:12.980 an HTML form that additionally has another value, often hidden as well. 01:25:12.980 --> 01:25:17.200 And by convention, in some worlds, it's called the CSRF token, which is just 01:25:17.200 --> 01:25:19.370 a fancy way of saying an extra value. 01:25:19.370 --> 01:25:21.820 But its value is typically meant to be random. 01:25:21.820 --> 01:25:25.600 And I've chosen something fairly pronounceable here, "1234abcd," 01:25:25.600 --> 01:25:29.225 but assume that that value is randomly generated by the server. 01:25:29.225 --> 01:25:31.850 And it might have a bunch of numbers, a bunch of letters in it. 01:25:31.850 --> 01:25:35.050 But the point is that it's randomly generated by the server. 01:25:35.050 --> 01:25:36.920 Now, why is this important? 01:25:36.920 --> 01:25:42.400 The implication of this mechanism, by having a web server output not 01:25:42.400 --> 01:25:48.430 only the product ID and also the button inside of a form 01:25:48.430 --> 01:25:51.190 that they might be using to implement this Buy Now feature-- 01:25:51.190 --> 01:25:53.890 the point is that the server should also be generating 01:25:53.890 --> 01:25:58.210 some secret, randomly generated piece of information in that server 01:25:58.210 --> 01:26:00.440 as well, in that HTML as well. 01:26:00.440 --> 01:26:03.190 And the server should remember that value 01:26:03.190 --> 01:26:06.520 as by using its own database or some other mechanism. 01:26:06.520 --> 01:26:11.440 The point here, though, is that only the server, amazon.com in this case, 01:26:11.440 --> 01:26:15.400 knows what that value should be for you, a specific user. 01:26:15.400 --> 01:26:20.200 And so an adversary, even if they trick you into visiting their website, 01:26:20.200 --> 01:26:22.910 where they have HTML that looks quite like this-- 01:26:22.910 --> 01:26:25.750 the adversary, unless they've hacked amazon.com, 01:26:25.750 --> 01:26:27.610 which is not part of this story-- 01:26:27.610 --> 01:26:32.740 they would have no idea what this random value is that amazon.com 01:26:32.740 --> 01:26:35.628 is using for your Buy Now buttons, because again, 01:26:35.628 --> 01:26:37.420 if it's just an adversary on the internet-- 01:26:37.420 --> 01:26:38.980 they haven't taken over amazon.com. 01:26:38.980 --> 01:26:41.680 They haven't taken over your computer itself. 01:26:41.680 --> 01:26:46.300 They are just trying to trick you into visiting a web page of their own 01:26:46.300 --> 01:26:47.740 that has HTML like this. 01:26:47.740 --> 01:26:49.820 They won't know what value to put there. 01:26:49.820 --> 01:26:51.850 Now, they can try to guess, and maybe they 01:26:51.850 --> 01:26:57.040 could guess "1234abcd," but assuming it's more random than that, the odds 01:26:57.040 --> 01:27:01.660 that that adversary guesses your CSRF token value is 01:27:01.660 --> 01:27:04.960 just so small and low probability that it's just not 01:27:04.960 --> 01:27:07.250 going to happen realistically. 01:27:07.250 --> 01:27:11.380 So now, even if you visit a web page that contains this HTML, even if it 01:27:11.380 --> 01:27:13.930 has some JavaScript that automatically submits the form, 01:27:13.930 --> 01:27:17.410 because the adversary doesn't know the value of this CSRF token, 01:27:17.410 --> 01:27:21.400 amazon.com can just ignore the request to buy that product now. 01:27:21.400 --> 01:27:23.320 And they can throw up an error message or say 01:27:23.320 --> 01:27:25.040 something went wrong or the like. 01:27:25.040 --> 01:27:28.120 But the point is that only the real amazon.com 01:27:28.120 --> 01:27:33.040 should be able to generate and remember this CSRF token value. 01:27:33.040 --> 01:27:36.310 And so therefore, they can validate server-side 01:27:36.310 --> 01:27:40.120 that it's indeed you who intends to buy something now. 01:27:40.120 --> 01:27:43.580 The adversary-- if they put a blank value there or any other value, 01:27:43.580 --> 01:27:45.880 it's not going to be validated server-side 01:27:45.880 --> 01:27:47.650 because the server realizes, uh-uh. 01:27:47.650 --> 01:27:49.600 That's not the value I'm using for David. 01:27:49.600 --> 01:27:52.310 That's not the value I'm using for you. 01:27:52.310 --> 01:27:53.920 So this is a very common technique. 01:27:53.920 --> 01:27:56.110 It does require a bit more complexity on the server. 01:27:56.110 --> 01:27:58.360 Very often, programming languages, like Python, 01:27:58.360 --> 01:28:01.170 will come with libraries, third-party libraries or code 01:28:01.170 --> 01:28:02.920 that other people have written, that allow 01:28:02.920 --> 01:28:06.220 you to add this functionality to your own software. 01:28:06.220 --> 01:28:08.530 But you have to know that the threat exists, 01:28:08.530 --> 01:28:11.860 and you have to look for a solution there too 01:28:11.860 --> 01:28:16.270 or implement it yourself if need be, but almost always as a library 01:28:16.270 --> 01:28:17.630 the answer to this problem. 01:28:17.630 --> 01:28:20.500 There's another way you can solve this same problem, which doesn't 01:28:20.500 --> 01:28:22.570 involve outputting any HTML at all. 01:28:22.570 --> 01:28:25.420 It is also possible to send these kinds of tokens 01:28:25.420 --> 01:28:28.840 as HTTP headers as well, as might commonly 01:28:28.840 --> 01:28:32.200 be the case when a website is very heavily using JavaScript 01:28:32.200 --> 01:28:36.580 and is using JavaScript to talk directly to a server without even any HTML. 01:28:36.580 --> 01:28:41.300 Well, the same values can be sent via this other mechanism as well. 01:28:41.300 --> 01:28:45.670 So if you're interested in these kinds of web-centric attacks especially, 01:28:45.670 --> 01:28:49.750 you might find it interesting to explore the Open Worldwide Application Security 01:28:49.750 --> 01:28:54.220 Project, which has documentation of, discussion of, recommendations 01:28:54.220 --> 01:28:58.148 for all of these kinds of web-centric attacks and more. 01:28:58.148 --> 01:29:00.440 For now, though, let's go ahead and take a short break. 01:29:00.440 --> 01:29:02.380 And when we come back, we'll look at problems 01:29:02.380 --> 01:29:05.650 that go beyond the world of the web, specifically to software 01:29:05.650 --> 01:29:10.140 that you might have running on your own Macs, PCs, or phones. 01:29:10.140 --> 01:29:11.650 All right, we're back. 01:29:11.650 --> 01:29:15.210 Let's now consider a class of attacks that is particularly common when 01:29:15.210 --> 01:29:19.530 it comes to software that's running on your own Mac or your PC or your phone, 01:29:19.530 --> 01:29:21.410 so not web based, but local instead. 01:29:21.410 --> 01:29:23.160 And the first of those is generally called 01:29:23.160 --> 01:29:26.910 arbitrary code execution, the potential for an adversary 01:29:26.910 --> 01:29:32.910 to somehow trick your own computer into executing code that you, the adversary, 01:29:32.910 --> 01:29:36.780 have written and that's not embedded in the actual software that's 01:29:36.780 --> 01:29:38.020 meant to execute it. 01:29:38.020 --> 01:29:42.090 This is an example more generally of what might be called a remote code 01:29:42.090 --> 01:29:45.420 execution, whereby the same attack can happen even if the adversary is 01:29:45.420 --> 01:29:47.550 somewhere else in the world, perhaps connected 01:29:47.550 --> 01:29:50.100 to you somehow via the internet. 01:29:50.100 --> 01:29:52.530 And how might these attacks be possible? 01:29:52.530 --> 01:29:56.940 Very, very common mechanism for waging these kinds of attacks whereby 01:29:56.940 --> 01:30:01.530 an adversary tricks your own system into executing code that the adversary wrote 01:30:01.530 --> 01:30:04.170 is through something generally known as a buffer overflow. 01:30:04.170 --> 01:30:05.920 Now, to be fair, this is a topic you would 01:30:05.920 --> 01:30:09.120 explore in more detail in a class on programming 01:30:09.120 --> 01:30:11.320 specifically, computer science more generally. 01:30:11.320 --> 01:30:13.270 But we'll give you a high-level sense of what 01:30:13.270 --> 01:30:16.840 the threat is as it relates to software that might very well be 01:30:16.840 --> 01:30:18.950 running on your own computer. 01:30:18.950 --> 01:30:20.860 So what is a buffer overflow? 01:30:20.860 --> 01:30:23.410 Well, for this, we need a mental model for what's 01:30:23.410 --> 01:30:26.140 going on inside of your computer when running a program. 01:30:26.140 --> 01:30:29.170 And when you double-click a program on your Mac or PC or your phone, 01:30:29.170 --> 01:30:32.470 and it opens up and loads into the computer's memory, 01:30:32.470 --> 01:30:35.770 the memory, you can think of, is this big, rectangular region 01:30:35.770 --> 01:30:39.070 that represents all of the bytes or megabytes or gigabytes 01:30:39.070 --> 01:30:41.650 that are in your Mac, your PC, or your phone. 01:30:41.650 --> 01:30:44.290 And the computer, or device, more generally, 01:30:44.290 --> 01:30:47.300 uses different parts of this memory for different purposes. 01:30:47.300 --> 01:30:50.200 And this is just because humans came up with conventions years ago 01:30:50.200 --> 01:30:53.770 to lay out the computer's memory in this way-- using some of it 01:30:53.770 --> 01:30:56.110 up here for one purpose, using some of the memory 01:30:56.110 --> 01:30:58.640 down here for another purpose instead. 01:30:58.640 --> 01:31:01.870 So, for instance, if this big rectangle represents 01:31:01.870 --> 01:31:05.440 your phone or your computer's memory, let me just propose that, 01:31:05.440 --> 01:31:07.570 at the top of it, so to speak-- although memory 01:31:07.570 --> 01:31:09.820 doesn't have a top, bottom, left, or right because it totally 01:31:09.820 --> 01:31:11.195 depends on how you're holding it. 01:31:11.195 --> 01:31:14.440 But assume conceptually that at the top of your computer's memory 01:31:14.440 --> 01:31:17.470 is the machine code for the program you're running. 01:31:17.470 --> 01:31:21.760 So long story short, when you write software, at the end of the day, 01:31:21.760 --> 01:31:23.590 zeros and ones are involved. 01:31:23.590 --> 01:31:27.070 And the zeros and ones represent the instructions or commands 01:31:27.070 --> 01:31:30.100 that that software wants to execute on your computer. 01:31:30.100 --> 01:31:34.210 When you click an icon or double-click an icon and load a program into memory, 01:31:34.210 --> 01:31:38.410 the actual program's machine code-- the zeros and ones, if you will-- 01:31:38.410 --> 01:31:41.140 are stored up here in your computer's memory. 01:31:41.140 --> 01:31:44.320 Meanwhile, while a program is running, it 01:31:44.320 --> 01:31:47.380 might need more or less additional memory 01:31:47.380 --> 01:31:50.092 as it executes instructions therein. 01:31:50.092 --> 01:31:51.050 So what does this mean? 01:31:51.050 --> 01:31:53.680 Well, if the program is prompting you for input 01:31:53.680 --> 01:31:56.260 or if it needs to load a new level from a game, 01:31:56.260 --> 01:31:57.910 it might need more and more memory. 01:31:57.910 --> 01:32:00.260 But eventually, it might not need that memory anymore. 01:32:00.260 --> 01:32:02.260 So the memory requirements of a program tend 01:32:02.260 --> 01:32:05.770 to go up and down all the time based on what you, the human, 01:32:05.770 --> 01:32:09.760 are doing with the software and based on what the software is designed to do. 01:32:09.760 --> 01:32:12.940 So computers typically use this bottom area 01:32:12.940 --> 01:32:16.540 of the computer's memory for a so-called stack, very similar in spirit 01:32:16.540 --> 01:32:19.090 to any physical thing that you might stack one 01:32:19.090 --> 01:32:23.680 on top of the other, like clothes in a closet or trays in a cafeteria. 01:32:23.680 --> 01:32:26.650 Stacking means literally from bottom on up. 01:32:26.650 --> 01:32:29.500 But the weird thing is about a computer's memory 01:32:29.500 --> 01:32:32.530 is that, by convention, when the computer needs memory, 01:32:32.530 --> 01:32:35.540 it first uses some memory from the very bottom. 01:32:35.540 --> 01:32:38.590 And then if it needs more, it uses more above that. 01:32:38.590 --> 01:32:41.480 When it needs more, it uses more above that. 01:32:41.480 --> 01:32:45.220 So instead of just going top to bottom, it actually deliberately, by design, 01:32:45.220 --> 01:32:48.200 goes bottom up for reasons we won't get into in this course. 01:32:48.200 --> 01:32:53.140 But just take on faith that, indeed, this stack of memory grows upward. 01:32:53.140 --> 01:32:57.610 The catch, though, is that sometimes software doesn't necessarily 01:32:57.610 --> 01:33:02.980 know in advance or predict correctly in advance how much input you, the human, 01:33:02.980 --> 01:33:03.670 might give it. 01:33:03.670 --> 01:33:05.650 So, for instance, a computer program might 01:33:05.650 --> 01:33:08.770 decide to take up this much memory at the bottom 01:33:08.770 --> 01:33:10.990 but then not realize that, oh, wait a minute, what 01:33:10.990 --> 01:33:15.490 if the human timescale in a really long name or a really long essay or just 01:33:15.490 --> 01:33:19.540 gives me more keystrokes as input than I, the programmer who 01:33:19.540 --> 01:33:21.220 wrote this software, anticipated? 01:33:21.220 --> 01:33:23.590 That might mean that even as you allocate 01:33:23.590 --> 01:33:27.250 what are called frames of memory on this stack, 01:33:27.250 --> 01:33:31.090 the user's input might not stay confined to that particular frame. 01:33:31.090 --> 01:33:33.790 If they type in too many characters at their keyboard, what's 01:33:33.790 --> 01:33:37.010 supposed to go here might end up going down here, 01:33:37.010 --> 01:33:39.890 so overflowing these frames of memory. 01:33:39.890 --> 01:33:42.610 So the computer or the programmer makes hopefully 01:33:42.610 --> 01:33:45.610 an educated guess as to how much input the user might have. 01:33:45.610 --> 01:33:48.280 But if they're wrong, that input might be too tall 01:33:48.280 --> 01:33:52.120 and therefore overlap other parts of the computer's memory. 01:33:52.120 --> 01:33:53.980 Now, there are ways to defend against this. 01:33:53.980 --> 01:33:56.470 So the scenario we're worried about here is often 01:33:56.470 --> 01:33:59.140 when programmers don't know how to anticipate this 01:33:59.140 --> 01:34:03.070 or when you are using software written by programmers who didn't anticipate 01:34:03.070 --> 01:34:05.630 or implement the solution properly. 01:34:05.630 --> 01:34:07.160 So what might go wrong? 01:34:07.160 --> 01:34:09.480 Well, for instance, when a program first starts 01:34:09.480 --> 01:34:12.690 running, one of the first things it does if it's 01:34:12.690 --> 01:34:16.950 calling another routine or another function, when you click on a button, 01:34:16.950 --> 01:34:19.020 when you start typing keystrokes, the computer 01:34:19.020 --> 01:34:21.150 might start using some of this memory, and it 01:34:21.150 --> 01:34:23.675 might be moving around among these zeros and ones, 01:34:23.675 --> 01:34:25.050 executing different instructions. 01:34:25.050 --> 01:34:27.600 If you click this menu option, it'll use this code. 01:34:27.600 --> 01:34:31.170 If you click on this menu option, it'll use this code. 01:34:31.170 --> 01:34:33.060 So in other words, the computer logically 01:34:33.060 --> 01:34:35.850 is kind of moving around among all those zeros and ones 01:34:35.850 --> 01:34:38.500 and executing them accordingly. 01:34:38.500 --> 01:34:42.160 So one of the first things your computer does when a program is running 01:34:42.160 --> 01:34:45.970 is it just jots down at the bottom of the computer's memory 01:34:45.970 --> 01:34:49.433 what is the address to which I should return after doing this. 01:34:49.433 --> 01:34:52.350 So it's kind of like in the real world, if you go off and do something 01:34:52.350 --> 01:34:55.560 over there, eventually, you want to remember to come back over here 01:34:55.560 --> 01:34:57.000 to pick up where you left off. 01:34:57.000 --> 01:34:58.800 And that's what we mean by return address. 01:34:58.800 --> 01:35:00.960 It's a little reminder to yourself that, no matter 01:35:00.960 --> 01:35:02.820 what you go off and do right now, you got 01:35:02.820 --> 01:35:05.340 to come back and resume where you left off. 01:35:05.340 --> 01:35:07.530 And what this return address then does is 01:35:07.530 --> 01:35:12.870 it refers to some specific location in the machine code, some specific pattern 01:35:12.870 --> 01:35:15.810 of zeros and ones that eventually the software should come back 01:35:15.810 --> 01:35:18.240 to pick up when the user leaves off. 01:35:18.240 --> 01:35:22.740 So, for instance, if you open the File menu and go to Print, 01:35:22.740 --> 01:35:25.170 and you go through the steps of printing a document, 01:35:25.170 --> 01:35:28.440 the return address might be to go back to whatever 01:35:28.440 --> 01:35:32.192 you were doing in that document before you initiated the print command. 01:35:32.192 --> 01:35:34.650 So the software is constantly jumping around in this sense. 01:35:34.650 --> 01:35:38.190 Suppose now that the user clicks some button within the software 01:35:38.190 --> 01:35:39.330 to search for something. 01:35:39.330 --> 01:35:40.590 Maybe it's cats. 01:35:40.590 --> 01:35:43.650 Well, because this is a new function that's being called, 01:35:43.650 --> 01:35:46.740 the search function, what the computer might do inside of its memory 01:35:46.740 --> 01:35:49.380 is this-- it might put a little note to self to say, 01:35:49.380 --> 01:35:53.700 go back to this location in the machine code once the user is done searching, 01:35:53.700 --> 01:35:55.740 just like the user might be done printing. 01:35:55.740 --> 01:35:57.900 And then suppose the user types in "cats." 01:35:57.900 --> 01:36:00.180 Well, "cats" is stored in the computer's memory 01:36:00.180 --> 01:36:03.990 just above this frame on the stack because, again, I said, by convention, 01:36:03.990 --> 01:36:08.020 whenever the software uses memory, it starts at the bottom then goes up, 01:36:08.020 --> 01:36:09.540 then goes up, then goes up. 01:36:09.540 --> 01:36:13.510 So after now this software is done searching itself for cats, 01:36:13.510 --> 01:36:16.967 then that frame on the stack is sort of removed because we 01:36:16.967 --> 01:36:18.550 don't need to know about cats anymore. 01:36:18.550 --> 01:36:19.880 We're done searching for them. 01:36:19.880 --> 01:36:22.900 So the last thing in memory is this reminder, go to machine code. 01:36:22.900 --> 01:36:26.650 And this is how the software knows to go back to a particular location in code, 01:36:26.650 --> 01:36:29.980 where maybe it's just sitting there waiting for me to click some other menu 01:36:29.980 --> 01:36:31.660 option instead. 01:36:31.660 --> 01:36:35.320 But what if an adversary is the one at the keyboard, so to speak, 01:36:35.320 --> 01:36:38.770 and it's not a good user just typing in short phrases like "cats," 01:36:38.770 --> 01:36:41.750 but maybe it's an adversary who's typing something more? 01:36:41.750 --> 01:36:45.400 So suppose that an adversary actually pulls up the search feature. 01:36:45.400 --> 01:36:47.230 And in general, therefore, the software is 01:36:47.230 --> 01:36:49.313 going to remember to put the return address there, 01:36:49.313 --> 01:36:52.930 so specifically something like "go to" this location in the machine code. 01:36:52.930 --> 01:36:55.840 But suppose that the adversary doesn't type in "cats," 01:36:55.840 --> 01:37:00.010 doesn't type in "dogs," but types in, for the sake of discussion, 01:37:00.010 --> 01:37:04.580 some pattern of zeros and ones that represents actual code. 01:37:04.580 --> 01:37:06.520 Maybe it's the pattern of zeros and ones that 01:37:06.520 --> 01:37:09.280 represents, delete everything from a server, 01:37:09.280 --> 01:37:11.530 or start sending emails or the like. 01:37:11.530 --> 01:37:15.220 Or maybe, more cleverly, maybe it means skip 01:37:15.220 --> 01:37:20.080 whatever menu keeps prompting me to register or activate my software. 01:37:20.080 --> 01:37:22.870 In other words, the adversary wants to trick this software 01:37:22.870 --> 01:37:26.025 into running zeros and ones that didn't come with the software. 01:37:26.025 --> 01:37:28.900 Now, in practice, you can't just type zeros and ones at the keyboard. 01:37:28.900 --> 01:37:31.753 It would be a different way that the adversary inputs this data. 01:37:31.753 --> 01:37:34.420 But for the sake of discussion, assume that the adversary is not 01:37:34.420 --> 01:37:37.060 typing in "cats" but is typing in the zeros and ones. 01:37:37.060 --> 01:37:39.730 And they know enough about binary, zeros and ones, 01:37:39.730 --> 01:37:42.220 that they know what patterns to choose. 01:37:42.220 --> 01:37:45.280 Now suppose that this code, this so-called attack code, 01:37:45.280 --> 01:37:51.640 is way longer than C-A-T-S, and it's many more characters or bytes long. 01:37:51.640 --> 01:37:54.970 It is possible, by definition of how memory is used, that this attack 01:37:54.970 --> 01:37:57.190 code might be so big that it takes up not only 01:37:57.190 --> 01:37:59.590 this space that's been allocated for it, but it 01:37:59.590 --> 01:38:02.920 overflows other things in memory. 01:38:02.920 --> 01:38:06.400 You can now think of this frame, this rectangular region 01:38:06.400 --> 01:38:10.090 on the stack to which I keep referring, as like a buffer, a room 01:38:10.090 --> 01:38:11.720 for some amount of information. 01:38:11.720 --> 01:38:14.470 But if the adversary provides so much information, so much 01:38:14.470 --> 01:38:17.500 attack code, zeros and ones, that it overflows that buffer, 01:38:17.500 --> 01:38:20.620 it might actually overwrite that note to self 01:38:20.620 --> 01:38:23.740 with dot, dot, dot, something else. 01:38:23.740 --> 01:38:28.210 And what's clever here, though, is that if the adversary is smart enough-- 01:38:28.210 --> 01:38:30.970 and this is often through lots and lots of trial and error. 01:38:30.970 --> 01:38:32.950 They don't often just get it the first time. 01:38:32.950 --> 01:38:35.720 If the adversary is ultimately clever enough, 01:38:35.720 --> 01:38:39.250 they can actually put not just some random zeros and ones there, 01:38:39.250 --> 01:38:43.840 but they can put the equivalent of a note to self that says, 01:38:43.840 --> 01:38:46.220 "go to attack code." 01:38:46.220 --> 01:38:48.580 In other words, instead of typing in "cats," 01:38:48.580 --> 01:38:50.830 they type in two things that are pretty long. 01:38:50.830 --> 01:38:54.040 One are the zeros and ones that represent some form of attack, 01:38:54.040 --> 01:38:57.890 like circumvent the registration or the activation for the software 01:38:57.890 --> 01:39:01.390 so I can use it for free, or do something else that's malicious. 01:39:01.390 --> 01:39:04.600 And if the second thing they provide in just so 01:39:04.600 --> 01:39:09.000 happens to be cleverly the address of their own attack code, which 01:39:09.000 --> 01:39:12.060 they can figure out mathematically perhaps through trial and error, 01:39:12.060 --> 01:39:15.060 the adversary can trick the computer into not 01:39:15.060 --> 01:39:16.890 going back up here and running the machine 01:39:16.890 --> 01:39:18.660 code that came with the software. 01:39:18.660 --> 01:39:22.320 The adversary can trick the software into executing code 01:39:22.320 --> 01:39:26.070 that the adversary themselves injected. 01:39:26.070 --> 01:39:27.480 Now, what does that mean? 01:39:27.480 --> 01:39:31.230 If you are running the software under your user name, whatever 01:39:31.230 --> 01:39:35.500 you can do on the software and the system, be it your Mac or PC or phone, 01:39:35.500 --> 01:39:37.770 so now can the adversary. 01:39:37.770 --> 01:39:40.350 And maybe they'll now delete all of your files. 01:39:40.350 --> 01:39:43.290 Maybe they will now all register the software. 01:39:43.290 --> 01:39:44.970 Maybe they will now start sending spam. 01:39:44.970 --> 01:39:51.190 Anything at all is possible based on what the adversary has passed in here. 01:39:51.190 --> 01:39:55.140 So if you've ever heard of a website called stackoverflow.com, 01:39:55.140 --> 01:39:58.260 which is a popular website for programmers to ask questions and get 01:39:58.260 --> 01:40:01.050 answers of a community, that specifically 01:40:01.050 --> 01:40:06.330 is the allusion to exactly this kind of bug or mistake, 01:40:06.330 --> 01:40:10.920 whereby, if not programmed properly, the stack can overflow. 01:40:10.920 --> 01:40:14.520 And if the software or programmer does not anticipate or detect it, 01:40:14.520 --> 01:40:16.120 bad things can happen. 01:40:16.120 --> 01:40:18.600 You can have arbitrary code executed, or you 01:40:18.600 --> 01:40:22.620 can have remote code executed if the adversary isn't even at that keyboard 01:40:22.620 --> 01:40:27.090 but is somehow sending this code into your software via some network 01:40:27.090 --> 01:40:27.840 connection. 01:40:27.840 --> 01:40:30.772 What then might this mean? 01:40:30.772 --> 01:40:32.730 So if you've ever heard of the term "cracking," 01:40:32.730 --> 01:40:36.630 which typically refers to figuring out someone's password or, in this case, 01:40:36.630 --> 01:40:39.450 breaking into software, cracking might refer 01:40:39.450 --> 01:40:42.330 to eliminating the need for a serial number 01:40:42.330 --> 01:40:45.000 or an activation code or the like, because if you 01:40:45.000 --> 01:40:49.740 can inject any code that you want into someone's software, 01:40:49.740 --> 01:40:53.220 you could tell that software to just skip the lines of code, the zeros 01:40:53.220 --> 01:40:56.340 and ones that represent asking you for that activation code, 01:40:56.340 --> 01:40:59.520 or they can do something much more malicious. 01:40:59.520 --> 01:41:02.640 So this is an example, in some sense, of what we might also 01:41:02.640 --> 01:41:04.110 call reverse engineering. 01:41:04.110 --> 01:41:07.020 Reverse engineering refers to the ability for someone 01:41:07.020 --> 01:41:10.590 to figure out how something was engineered, how it was built. 01:41:10.590 --> 01:41:12.780 Now, at the end of the day, most of the software 01:41:12.780 --> 01:41:15.690 that you and I install on our Macs or PCs or phones 01:41:15.690 --> 01:41:17.790 is pretty much just zeros and ones. 01:41:17.790 --> 01:41:22.120 So it's very nonobvious to an adversary even what is actually going on. 01:41:22.120 --> 01:41:25.410 But with certain techniques, with certain trial and error, 01:41:25.410 --> 01:41:28.620 they can actually figure out what those zeros and ones represent. 01:41:28.620 --> 01:41:31.560 And depending on the language that was used to generate that software, 01:41:31.560 --> 01:41:34.290 they might be able to glean even more information than that. 01:41:34.290 --> 01:41:37.710 Now, there's a good side of reverse engineering, whereby 01:41:37.710 --> 01:41:41.700 if you and I are in the business of figuring out 01:41:41.700 --> 01:41:45.360 how malware was implemented so that you and I can contribute solutions 01:41:45.360 --> 01:41:47.400 to antivirus software and the like, well, 01:41:47.400 --> 01:41:50.730 malware analysis uses these same kinds of techniques, 01:41:50.730 --> 01:41:54.750 trying to figure out what's going on underneath the hood of software 01:41:54.750 --> 01:41:57.780 as by reverse engineering it, so using trial and error, 01:41:57.780 --> 01:42:00.000 maybe injecting some code of our own, to figure out 01:42:00.000 --> 01:42:05.760 exactly what instructions are embedded among all of those zeros and ones. 01:42:05.760 --> 01:42:10.530 Now, how might you hedge against these kinds of threats of remote code 01:42:10.530 --> 01:42:14.040 execution, arbitrary code execution, with software of your own? 01:42:14.040 --> 01:42:17.640 Well, you could start using open-source software, for instance. 01:42:17.640 --> 01:42:21.820 Open-source software just means that the code that implements that software, 01:42:21.820 --> 01:42:28.500 be it in Python or PHP or Java or C# or C++ or any number of other languages is 01:42:28.500 --> 01:42:29.550 itself open source. 01:42:29.550 --> 01:42:31.800 That is, you and I and anyone on the internet 01:42:31.800 --> 01:42:35.700 typically can read the source code and see exactly what instructions will 01:42:35.700 --> 01:42:37.950 be executed on your computer or phone. 01:42:37.950 --> 01:42:41.370 Now, that doesn't necessarily mean that the version 01:42:41.370 --> 01:42:44.820 of the software that you are running is exactly 01:42:44.820 --> 01:42:46.470 the same as the open-source version. 01:42:46.470 --> 01:42:49.290 There's still a threat whereby the code might be open source, 01:42:49.290 --> 01:42:52.170 but maybe you were tricked via some phishing email 01:42:52.170 --> 01:42:54.600 or some malicious website into installing 01:42:54.600 --> 01:42:58.750 a fake version of some software that actually has malicious code in it. 01:42:58.750 --> 01:43:01.120 So malware might still be a problem. 01:43:01.120 --> 01:43:03.630 But a lot of folks think that open source 01:43:03.630 --> 01:43:05.730 tends to be a good thing because you can audit-- 01:43:05.730 --> 01:43:07.740 smart people on the internet can audit the code 01:43:07.740 --> 01:43:11.570 and make sure that there are no "backdoors" or malicious instructions 01:43:11.570 --> 01:43:14.870 that might do things that you wouldn't expect the software to do. 01:43:14.870 --> 01:43:18.230 Now, again, that's not necessarily the case that the version you're running 01:43:18.230 --> 01:43:20.390 doesn't still have some form of infection. 01:43:20.390 --> 01:43:23.000 But this might give you at least a bit more reassurance. 01:43:23.000 --> 01:43:27.110 Now the flip side, though, is that if code is open source, even if it's 01:43:27.110 --> 01:43:29.150 devoid of anything malicious, it might still 01:43:29.150 --> 01:43:33.560 have bugs, mistakes, that human programmers accidentally made, 01:43:33.560 --> 01:43:37.400 which might very well make open-source software, or any software, 01:43:37.400 --> 01:43:39.090 vulnerable to attack. 01:43:39.090 --> 01:43:45.080 I mean, you're literally giving the adversaries the plans to your software. 01:43:45.080 --> 01:43:47.390 It's like the plans to the Death Star in Star Wars, 01:43:47.390 --> 01:43:49.370 such that they can probably figure out what 01:43:49.370 --> 01:43:52.250 the weaknesses are in your software because you're 01:43:52.250 --> 01:43:54.410 giving them the blueprint therefore. 01:43:54.410 --> 01:43:57.020 So an alternative to open-source software 01:43:57.020 --> 01:43:59.790 is perhaps the default, which is closed-source software. 01:43:59.790 --> 01:44:02.420 So any software that you might download or buy from companies 01:44:02.420 --> 01:44:04.700 that it's not open source is typically closed 01:44:04.700 --> 01:44:08.180 source, which means only they, only their employees, have access to it. 01:44:08.180 --> 01:44:11.410 Now, the downside is that you, the user, do not 01:44:11.410 --> 01:44:13.660 have access to closed-source software. 01:44:13.660 --> 01:44:15.730 Only the authors, therefore, do. 01:44:15.730 --> 01:44:21.370 But the upside, arguably, is that now so do adversaries on the internet 01:44:21.370 --> 01:44:23.030 not have access to it. 01:44:23.030 --> 01:44:26.680 So maybe the probability that that software not only has mistakes, 01:44:26.680 --> 01:44:30.230 but those mistakes are exploited, is perhaps lower. 01:44:30.230 --> 01:44:32.500 And so this is perhaps more of a debate. 01:44:32.500 --> 01:44:34.750 And you yourselves as you consider this might 01:44:34.750 --> 01:44:37.780 have to acquire your own opinions on open source versus closed source. 01:44:37.780 --> 01:44:40.000 But another argument in favor of open source 01:44:40.000 --> 01:44:42.550 is often that with so many people around the world 01:44:42.550 --> 01:44:46.750 having eyes on software, perhaps that actually increases the probability 01:44:46.750 --> 01:44:51.430 that we will detect bugs or detect potential exploits because so many more 01:44:51.430 --> 01:44:54.543 smart people are looking at it and therefore weighing in. 01:44:54.543 --> 01:44:57.460 The downside, of course, if one of those smart people is an adversary, 01:44:57.460 --> 01:45:00.730 they find it and don't tell anyone, then we're back to a problem 01:45:00.730 --> 01:45:04.510 from a previous class, wherein we discussed those zero-day attacks. 01:45:04.510 --> 01:45:07.780 But this is one way, one mental model you might have, 01:45:07.780 --> 01:45:11.320 for evaluating just how secure your own software might 01:45:11.320 --> 01:45:14.950 be that you're either using as a user or developing as a company. 01:45:14.950 --> 01:45:18.580 What's another way that you might gain some assurance that the software you're 01:45:18.580 --> 01:45:23.920 installing and using is not infected with some form of vulnerability 01:45:23.920 --> 01:45:26.560 or malicious intent? 01:45:26.560 --> 01:45:28.600 Well, you could download all of the software 01:45:28.600 --> 01:45:32.860 that you use only from an approved app store, be it in the world of iPhones 01:45:32.860 --> 01:45:35.980 or Android devices, macOS, Windows, or the like, 01:45:35.980 --> 01:45:39.100 whereby you have some other entity, like a Google and Apple 01:45:39.100 --> 01:45:45.190 or a Microsoft, a big company that is at least analyzing the applications that 01:45:45.190 --> 01:45:49.120 are being uploaded to these app stores before they're 01:45:49.120 --> 01:45:51.370 allowed to be distributed to people like you and me. 01:45:51.370 --> 01:45:54.245 Now, that's not to say that Apple and Microsoft and Googles or others 01:45:54.245 --> 01:45:54.850 are perfect. 01:45:54.850 --> 01:45:59.110 There have absolutely been many cases where even applications in these app 01:45:59.110 --> 01:46:03.430 stores has some malicious feature that they only realize after the fact. 01:46:03.430 --> 01:46:07.330 But again, it's probably increasing the probability 01:46:07.330 --> 01:46:09.880 that some smart people or automated software 01:46:09.880 --> 01:46:13.960 is going to detect those things first before it even reaches your device. 01:46:13.960 --> 01:46:17.350 And therefore, it makes it harder for the adversary-- raises the bar, 01:46:17.350 --> 01:46:20.320 raises the cost, raises the risk to them to even get 01:46:20.320 --> 01:46:21.740 something like that distributed. 01:46:21.740 --> 01:46:22.730 So what does this mean? 01:46:22.730 --> 01:46:24.772 Well, when you install software in your computer, 01:46:24.772 --> 01:46:27.610 perhaps you should get it only from Microsoft or Google or Apple 01:46:27.610 --> 01:46:31.000 and not from some random website, and certainly not from some random email 01:46:31.000 --> 01:46:34.480 that someone sent you with a link to download some piece of software. 01:46:34.480 --> 01:46:36.320 Now, that's not always going to be the case. 01:46:36.320 --> 01:46:40.210 And particularly, if you yourself are an aspiring programmer, a software 01:46:40.210 --> 01:46:43.300 developer, you might need to be in the habit of installing 01:46:43.300 --> 01:46:46.382 sort of "unauthorized software" for which you might have 01:46:46.382 --> 01:46:49.090 to jump through some hoops and change some settings in your phone 01:46:49.090 --> 01:46:54.310 or in your Mac or PC to even allow you to install unauthorized software if you 01:46:54.310 --> 01:46:55.360 know what you're doing. 01:46:55.360 --> 01:46:59.860 But these kinds of mechanisms, even though they create dissatisfaction 01:46:59.860 --> 01:47:02.110 with this idea of a walled garden, whereby 01:47:02.110 --> 01:47:04.510 you need some corporate entity's permission 01:47:04.510 --> 01:47:08.960 just to distribute your software-- they do serve a good purpose as well. 01:47:08.960 --> 01:47:11.740 So there, too, you might fall on one side or the other 01:47:11.740 --> 01:47:14.080 of that sort of argument too. 01:47:14.080 --> 01:47:19.150 Now, how do those app stores enforce the fact 01:47:19.150 --> 01:47:23.660 that you can only install the software if it is in the app store itself? 01:47:23.660 --> 01:47:27.490 Well, it turns out we can revisit some of our primitives from past classes, 01:47:27.490 --> 01:47:31.270 whereby we talked about encryption and also hashing 01:47:31.270 --> 01:47:35.500 and also digital signatures, the latter of two are particularly germane here. 01:47:35.500 --> 01:47:39.910 It turns out that cryptography really is the solution to a lot of the world's 01:47:39.910 --> 01:47:42.250 current problems when it comes to cybersecurity 01:47:42.250 --> 01:47:46.570 if we use these primitives, hashing, encryption, and digital signing 01:47:46.570 --> 01:47:48.740 as building blocks to solutions. 01:47:48.740 --> 01:47:52.180 So, for instance, when you develop a piece of software, 01:47:52.180 --> 01:47:55.510 or some company does this for you, and they upload their software 01:47:55.510 --> 01:47:58.750 to Apple or Google or Microsoft for distribution, 01:47:58.750 --> 01:48:00.790 what are those companies doing? 01:48:00.790 --> 01:48:04.510 Well, first, you, as the author of the software, 01:48:04.510 --> 01:48:08.360 are first using your own public and private key, 01:48:08.360 --> 01:48:09.910 which you came up with in advance. 01:48:09.910 --> 01:48:14.500 And you are running your software through some special function 01:48:14.500 --> 01:48:16.940 or algorithm and getting back a hash thereof. 01:48:16.940 --> 01:48:20.240 Again, a hash is this fixed length representation of your software. 01:48:20.240 --> 01:48:22.280 So even if you wrote a really big program, 01:48:22.280 --> 01:48:26.800 you have this unique identifier, or highly probably unique 01:48:26.800 --> 01:48:29.060 identifier, called a hash. 01:48:29.060 --> 01:48:32.860 And what you can then do with that hash is use your private key 01:48:32.860 --> 01:48:37.270 and sign that hash, giving you a digital signature. 01:48:37.270 --> 01:48:41.680 That signature can be verified by Google or Microsoft or Apple as being, 01:48:41.680 --> 01:48:44.920 OK, I know that David Malan wrote this software because I 01:48:44.920 --> 01:48:47.200 know that only he has that private key. 01:48:47.200 --> 01:48:50.170 And so long as I, David Malan, registered my public key 01:48:50.170 --> 01:48:52.240 with Apple or Google or Microsoft in advance, 01:48:52.240 --> 01:48:55.060 they can assume that, OK, this new version 01:48:55.060 --> 01:48:58.120 of the software or this new program came from David Malan 01:48:58.120 --> 01:49:01.960 and not from some random person on the internet pretending to be David Malan. 01:49:01.960 --> 01:49:07.810 Conversely, what Google and Microsoft and Apple and others 01:49:07.810 --> 01:49:12.420 can do is the same thing-- once you have uploaded your software to their app 01:49:12.420 --> 01:49:18.570 store, they can ensure that they run the software through the same function, 01:49:18.570 --> 01:49:22.140 getting back a hash thereof, a unique representation thereof. 01:49:22.140 --> 01:49:26.220 They can use their own private key from their own app store 01:49:26.220 --> 01:49:30.480 to take that hash as input and produce a digital signature, this time signed 01:49:30.480 --> 01:49:34.530 by Apple or Microsoft or Google or whoever else is running the app store. 01:49:34.530 --> 01:49:40.440 And then when you or I install that software on our Mac and our PC, 01:49:40.440 --> 01:49:43.920 or on our phones, our phones and devices can 01:49:43.920 --> 01:49:48.030 ensure that any software you and I are installing on our device 01:49:48.030 --> 01:49:51.840 was digitally signed by that app store, by Google 01:49:51.840 --> 01:49:55.300 or Microsoft or Apple or the like. 01:49:55.300 --> 01:49:59.010 So again, just by using this basic building block of digital signatures 01:49:59.010 --> 01:50:02.160 and hashing in this case, you can both attest in one direction 01:50:02.160 --> 01:50:04.170 that I am David Malan. 01:50:04.170 --> 01:50:06.330 Trust my software is written from me. 01:50:06.330 --> 01:50:09.070 Conversely, when people install my software, 01:50:09.070 --> 01:50:11.890 they can trust if Apple or Google or Microsoft or others 01:50:11.890 --> 01:50:15.710 trust that software that they should indeed be allowed to double-click it. 01:50:15.710 --> 01:50:19.090 And it's only when you download some unauthorized software, 01:50:19.090 --> 01:50:22.870 from the internet, typically, that you get often nowadays on your screen 01:50:22.870 --> 01:50:25.420 an alert saying, this has not been signed, 01:50:25.420 --> 01:50:29.230 or this is from an unauthorized third-party developer or the like 01:50:29.230 --> 01:50:32.620 if they're not playing nicely in this same ecosystem. 01:50:32.620 --> 01:50:35.750 But again, digital signatures take us a long way there. 01:50:35.750 --> 01:50:39.170 So another mechanism you can consider, which is similar in spirit 01:50:39.170 --> 01:50:42.550 but typically terminology that's used in the world of Linux computers 01:50:42.550 --> 01:50:45.400 or similar, are package managers. 01:50:45.400 --> 01:50:47.680 And different programming languages also come 01:50:47.680 --> 01:50:50.260 with an ecosystem of libraries, third-party code, 01:50:50.260 --> 01:50:54.220 that people write and they make freely available, often as open source. 01:50:54.220 --> 01:50:57.310 But there's standard ways by which these package managers 01:50:57.310 --> 01:51:02.020 can let you and me install the software on our own Macs, PCs, phones, 01:51:02.020 --> 01:51:02.710 or the like. 01:51:02.710 --> 01:51:08.890 And it's using tools like pip for Python, gem for Ruby, npm for Node.js, 01:51:08.890 --> 01:51:11.350 and there's others as well-- apt for Linux. 01:51:11.350 --> 01:51:13.690 These package managers, though, typically adopt 01:51:13.690 --> 01:51:18.310 a very similar mechanism, whereby they are digitally signing these packages so 01:51:18.310 --> 01:51:23.080 that you and I can have our computers verify those signatures before they're 01:51:23.080 --> 01:51:24.760 actually allowed to be installed. 01:51:24.760 --> 01:51:28.430 And in general, this involves operating systems as well. 01:51:28.430 --> 01:51:31.360 The operating systems that you and I are running nowadays, 01:51:31.360 --> 01:51:34.720 at least if you stayed current and are in the habit of automatic updates 01:51:34.720 --> 01:51:36.490 or frequent manual updates-- 01:51:36.490 --> 01:51:40.270 odds are today's more modern operating systems are increasingly 01:51:40.270 --> 01:51:44.390 building in native support for these kinds of checks. 01:51:44.390 --> 01:51:47.650 Downside is it's getting a little more difficult, a little more annoying, 01:51:47.650 --> 01:51:50.410 to install third-party software on our devices. 01:51:50.410 --> 01:51:53.560 But the upside is that if you trust these app 01:51:53.560 --> 01:51:57.100 stores, these package managers, then, by transitivity, 01:51:57.100 --> 01:52:00.970 you can with higher probability trust the software being 01:52:00.970 --> 01:52:02.230 distributed there too. 01:52:02.230 --> 01:52:05.320 Now this too, though, is not fail safe, and it has often 01:52:05.320 --> 01:52:08.500 happened that even when software has been uploaded to these app stores 01:52:08.500 --> 01:52:11.050 or package managers and made available to folks, 01:52:11.050 --> 01:52:15.220 and version 1 might be perfectly safe, version 2 might be perfectly safe, 01:52:15.220 --> 01:52:19.900 version 3 might be malicious for some reason. 01:52:19.900 --> 01:52:24.848 Maybe the developer finally decided to do what their intention was all along. 01:52:24.848 --> 01:52:26.890 Maybe the developer-- and this has happened too-- 01:52:26.890 --> 01:52:29.920 sold their software to someone else, and the third party now 01:52:29.920 --> 01:52:32.080 is adding ads or something malicious to it. 01:52:32.080 --> 01:52:34.930 Or someone has hacked their computer or account 01:52:34.930 --> 01:52:37.990 and gained access to their private key and not just their public key 01:52:37.990 --> 01:52:39.800 and therefore are masquerading as them. 01:52:39.800 --> 01:52:44.200 So even now, you can't necessarily trust the software 01:52:44.200 --> 01:52:45.547 you're running on your computer. 01:52:45.547 --> 01:52:48.130 But again, that brings us back to some of our earliest lessons 01:52:48.130 --> 01:52:50.213 in the class, where what we're really trying to do 01:52:50.213 --> 01:52:53.170 is raise the bar to the adversary, increase the cost, 01:52:53.170 --> 01:52:56.260 increase the risk to them, and conversely decrease 01:52:56.260 --> 01:53:00.220 the probability to us that any one of these pieces of software 01:53:00.220 --> 01:53:02.830 might actually be malicious. 01:53:02.830 --> 01:53:06.700 Now, there are models that the world has been experimenting with over time 01:53:06.700 --> 01:53:10.270 to try to figure out how best to reduce these probabilities further. 01:53:10.270 --> 01:53:14.740 And there's this notion of bug bounties, whereby some companies will actually 01:53:14.740 --> 01:53:17.980 steer in to the reality that there are people out there with the skills, 01:53:17.980 --> 01:53:22.900 not only to do malicious things with their software, but also good things 01:53:22.900 --> 01:53:23.660 as well. 01:53:23.660 --> 01:53:29.560 For instance, people who might very well want to try to find bugs in software, 01:53:29.560 --> 01:53:31.990 particularly ones that relate to security-- 01:53:31.990 --> 01:53:36.100 if they know that the company in whose software they're discovering these bugs 01:53:36.100 --> 01:53:38.990 is willing to pay for it-- and not in a ransom sense, 01:53:38.990 --> 01:53:44.410 not in a malicious ransom sense, but in a bounty sense, whereby there tends 01:53:44.410 --> 01:53:47.290 to be this marketplace for some companies and some products, 01:53:47.290 --> 01:53:50.080 whereby if you do discover a bug in their software, 01:53:50.080 --> 01:53:54.140 and you disclose it only to the designers of the software, 01:53:54.140 --> 01:53:57.340 at least during some window of time before you tell the world about it, 01:53:57.340 --> 01:53:58.540 they will pay you. 01:53:58.540 --> 01:54:02.320 So that once they can therefore fix the bug then 01:54:02.320 --> 01:54:05.050 pay out because it's a net positive for everyone. 01:54:05.050 --> 01:54:05.560 Win-win. 01:54:05.560 --> 01:54:06.340 You have benefited. 01:54:06.340 --> 01:54:07.173 They have benefited. 01:54:07.173 --> 01:54:09.280 And hopefully, no adversaries have found it first. 01:54:09.280 --> 01:54:11.200 And depending on the severity of these bugs, 01:54:11.200 --> 01:54:14.750 you might get paid more or less based on the same. 01:54:14.750 --> 01:54:17.260 And so the idea here of these bug bounty programs 01:54:17.260 --> 01:54:21.220 is try to leverage the collective intelligence and technical skill 01:54:21.220 --> 01:54:23.590 of people who, frankly, without these programs, 01:54:23.590 --> 01:54:27.400 maybe would be using their skills for evil and trying to hack these systems 01:54:27.400 --> 01:54:29.140 and monetize them through ransomware. 01:54:29.140 --> 01:54:33.700 But perhaps we could channel those funds instead toward paying people 01:54:33.700 --> 01:54:35.180 to do this kind of work. 01:54:35.180 --> 01:54:38.920 So this too is something to consider not so much as a user using software 01:54:38.920 --> 01:54:42.980 but perhaps a company developing software. 01:54:42.980 --> 01:54:44.980 So where can you learn more? 01:54:44.980 --> 01:54:46.900 And what has the world come up with to keep 01:54:46.900 --> 01:54:49.060 track of all of these possible threats? 01:54:49.060 --> 01:54:52.420 And what we focused on today really are representative attacks 01:54:52.420 --> 01:54:54.370 using some languages and technologies that 01:54:54.370 --> 01:54:56.740 are quite omnipresent and fairly accessible, 01:54:56.740 --> 01:54:58.580 at least at the level we've explained them. 01:54:58.580 --> 01:55:02.557 But it turns out that there is a whole inventory of vulnerabilities 01:55:02.557 --> 01:55:05.140 that have been detected over the years, Common Vulnerabilities 01:55:05.140 --> 01:55:09.290 and Exposures, or CVEs, such that a lot of the kinds of attacks 01:55:09.290 --> 01:55:12.050 we've been talking about today and, more specifically, 01:55:12.050 --> 01:55:15.500 bugs and flaws in specific software and versions 01:55:15.500 --> 01:55:19.640 thereof are often assigned a unique identifier, a CVE 01:55:19.640 --> 01:55:23.480 number that system administrators, companies, and even end users 01:55:23.480 --> 01:55:27.320 can keep track of to make sure they are always current with the latest threats 01:55:27.320 --> 01:55:28.010 out there. 01:55:28.010 --> 01:55:32.120 There is also a Common Vulnerability Scoring System, or CVSS, 01:55:32.120 --> 01:55:34.850 which is a standardized way of assigning a score 01:55:34.850 --> 01:55:36.800 to the severity of a vulnerability. 01:55:36.800 --> 01:55:37.850 Is it a big deal? 01:55:37.850 --> 01:55:39.500 Or is it not so much a big deal? 01:55:39.500 --> 01:55:41.510 It might still be a vulnerability, a bug. 01:55:41.510 --> 01:55:43.350 But is it that problematic? 01:55:43.350 --> 01:55:45.920 And so there's this scale so that you can prioritize things-- 01:55:45.920 --> 01:55:49.550 for instance, given limited resources or time, which of the bugs 01:55:49.550 --> 01:55:52.760 you should be fixing, which of the software you should be updating, 01:55:52.760 --> 01:55:54.680 or maybe which of the software you should not 01:55:54.680 --> 01:55:58.880 be using, at least while it's vulnerable to something that's highly severe. 01:55:58.880 --> 01:56:01.790 There's an Exploit Prediction Scoring System out there, 01:56:01.790 --> 01:56:05.690 EPSS, which refers to what do people in the real world 01:56:05.690 --> 01:56:11.060 think the probability is that this particular bug or mistake in software 01:56:11.060 --> 01:56:12.710 will be exploited is. 01:56:12.710 --> 01:56:16.550 And this then might give you a sense of just how problematic it is. 01:56:16.550 --> 01:56:18.890 Even if there's something very severe, is it more 01:56:18.890 --> 01:56:21.680 of a hypothetical threat or an actual threat, something 01:56:21.680 --> 01:56:25.520 that IT people might indeed take into account when designing 01:56:25.520 --> 01:56:27.290 how to respond to some system. 01:56:27.290 --> 01:56:31.610 And then there's a Known Exploited Vulnerability catalog, KEV, 01:56:31.610 --> 01:56:34.220 which refers to all of these kinds of bugs 01:56:34.220 --> 01:56:36.800 that are known to have been exploited. 01:56:36.800 --> 01:56:39.200 So suffice it to say, here, we're now seeing 01:56:39.200 --> 01:56:42.650 evidence of just how big of a world, how big of a space 01:56:42.650 --> 01:56:46.508 cybersecurity is that we have all of these lists and taxonomies for keeping 01:56:46.508 --> 01:56:49.550 track of things, because if you're feeling a little overwhelmed with just 01:56:49.550 --> 01:56:52.534 some of the concepts, imagine just how many hundreds, 01:56:52.534 --> 01:56:56.720 thousands of actual threats and vulnerabilities 01:56:56.720 --> 01:56:59.180 there are in the actual wild. 01:56:59.180 --> 01:57:01.280 All right, so that's a whole lot of threats 01:57:01.280 --> 01:57:04.730 to the security of your software, whether you're using it as a user 01:57:04.730 --> 01:57:06.410 or writing it as a developer. 01:57:06.410 --> 01:57:08.840 But hopefully, by way of today's examples 01:57:08.840 --> 01:57:12.410 of how software works, how adversaries can take advantage of it, 01:57:12.410 --> 01:57:15.800 and how you can defend against it, you have a much better sense 01:57:15.800 --> 01:57:17.840 of how to manage those threats. 01:57:17.840 --> 01:57:21.890 Up ahead is how we might now preserve our own privacy. 01:57:21.890 --> 01:57:24.700 More on that next time.