1 00:00:00,000 --> 00:00:03,904 [MUSIC PLAYING] 2 00:00:03,904 --> 00:00:17,047 3 00:00:17,047 --> 00:00:18,130 DAVID J. MALAN: All right. 4 00:00:18,130 --> 00:00:21,220 This is CS50's Introduction to Cybersecurity. 5 00:00:21,220 --> 00:00:25,450 My name is David Malan, and this week, we'll focus on securing data. 6 00:00:25,450 --> 00:00:27,987 Last week, recall, we focused on accounts, 7 00:00:27,987 --> 00:00:30,070 and particularly one of the mechanisms by which we 8 00:00:30,070 --> 00:00:33,880 protect our accounts is generally by way of these things called passwords. 9 00:00:33,880 --> 00:00:38,080 But we focused last time really on our having the responsibility 10 00:00:38,080 --> 00:00:39,640 to keep these things secure. 11 00:00:39,640 --> 00:00:41,650 And yet, there's another party involved whenever 12 00:00:41,650 --> 00:00:43,775 you have an account with a username and a password, 13 00:00:43,775 --> 00:00:46,240 and that's the server or app that is actually 14 00:00:46,240 --> 00:00:49,160 storing that password in some form long-term 15 00:00:49,160 --> 00:00:51,460 so that you can actually authenticate yourself-- 16 00:00:51,460 --> 00:00:53,980 that is, prove to this application or website 17 00:00:53,980 --> 00:00:56,110 that you are who you claim to be. 18 00:00:56,110 --> 00:01:00,010 Well, in the simplest form, perhaps these servers 19 00:01:00,010 --> 00:01:04,120 that are storing our usernames and passwords for which we have registered 20 00:01:04,120 --> 00:01:06,250 or maybe doing something very simple like this. 21 00:01:06,250 --> 00:01:10,210 For instance, if a website or app has two users at the moment, at least, 22 00:01:10,210 --> 00:01:12,670 Alice and Bob, and suppose for simplicity 23 00:01:12,670 --> 00:01:16,760 that Alice's password is Apple and Bob's password is banana, 24 00:01:16,760 --> 00:01:20,390 you could imagine that a website or that app, simply storing 25 00:01:20,390 --> 00:01:25,700 in a very simple text file these key value pairs-- 26 00:01:25,700 --> 00:01:28,370 username, colon, password, new line. 27 00:01:28,370 --> 00:01:30,980 Username, colon, password, new line. 28 00:01:30,980 --> 00:01:33,020 And in fact, that's actually very commonly 29 00:01:33,020 --> 00:01:36,110 how passwords are stored on systems, at least certain operating 30 00:01:36,110 --> 00:01:39,350 systems like Linux, not necessarily as simply as this. 31 00:01:39,350 --> 00:01:42,290 They often have a little more information off to the right there, 32 00:01:42,290 --> 00:01:44,700 but in essence, it's the username and password. 33 00:01:44,700 --> 00:01:48,980 But this wouldn't be a good thing to store the passwords exactly like this. 34 00:01:48,980 --> 00:01:49,550 Why? 35 00:01:49,550 --> 00:01:53,030 Well, suppose that this website or this app and its database 36 00:01:53,030 --> 00:01:54,920 are somehow hacked by an adversary. 37 00:01:54,920 --> 00:01:58,310 That if someone gains access to that file containing these usernames 38 00:01:58,310 --> 00:02:00,440 and passwords, well, at that point, they literally 39 00:02:00,440 --> 00:02:02,810 have everyone's username and password. 40 00:02:02,810 --> 00:02:05,270 And we talked last time about attacks like credential 41 00:02:05,270 --> 00:02:09,620 stuffing whereby an adversary, once they know your username and password on one 42 00:02:09,620 --> 00:02:12,050 system, they can try stuffing that username 43 00:02:12,050 --> 00:02:14,240 and password into other systems, other websites, 44 00:02:14,240 --> 00:02:18,620 other apps just in hopes that you are, unfortunately, using the same username 45 00:02:18,620 --> 00:02:21,000 and password elsewhere as well. 46 00:02:21,000 --> 00:02:23,810 So this is generally not a good thing if an adversary gets access 47 00:02:23,810 --> 00:02:26,360 to everyone's usernames and passwords. 48 00:02:26,360 --> 00:02:30,110 And even though, of course, in an ideal world, that would never happen, 49 00:02:30,110 --> 00:02:32,480 we should probably, as the administrators, 50 00:02:32,480 --> 00:02:34,760 as the creators of this website or app, we 51 00:02:34,760 --> 00:02:37,790 should probably do everything we can to at least minimize 52 00:02:37,790 --> 00:02:42,260 the fallout, the downsides, the damages that might result if, 53 00:02:42,260 --> 00:02:47,880 and daresay, when our database or this text file here are somehow compromised. 54 00:02:47,880 --> 00:02:49,880 So how might we go about doing that here? 55 00:02:49,880 --> 00:02:53,850 Rather than just storing apple and banana in clear text, 56 00:02:53,850 --> 00:02:57,020 so to speak, literally in the English words themselves, 57 00:02:57,020 --> 00:03:00,263 why don't we go ahead and employ a technique known as hashing? 58 00:03:00,263 --> 00:03:02,180 Now if you've studied computer science before, 59 00:03:02,180 --> 00:03:06,020 you might actually know this phrase in the context of hash tables and data 60 00:03:06,020 --> 00:03:06,650 structures. 61 00:03:06,650 --> 00:03:11,275 Well, it turns out the idea in this world of securing data is very similar, 62 00:03:11,275 --> 00:03:14,150 and in fact, this is a technique that's incredibly common for solving 63 00:03:14,150 --> 00:03:15,560 all sorts of problems. 64 00:03:15,560 --> 00:03:17,780 Well, what do we mean by hashing in this context? 65 00:03:17,780 --> 00:03:20,960 Hashing is the process of taking a password as input 66 00:03:20,960 --> 00:03:25,250 and somehow converting it to a so-called hash or hash value. 67 00:03:25,250 --> 00:03:27,980 Now these hash values don't look like English. 68 00:03:27,980 --> 00:03:32,240 They're typically strings of text that might have letters, might have numbers, 69 00:03:32,240 --> 00:03:35,060 but they're generally of some fixed length typically. 70 00:03:35,060 --> 00:03:39,170 And in this case here, when we go about taking our password as input, 71 00:03:39,170 --> 00:03:43,580 converting it somehow via an algorithm or some code that we wrote, 72 00:03:43,580 --> 00:03:45,680 we want to convert it into this hash value 73 00:03:45,680 --> 00:03:50,400 and then store that hash value in that database of passwords instead. 74 00:03:50,400 --> 00:03:53,450 So here's a proverbial black box, and let's stipulate for the moment 75 00:03:53,450 --> 00:03:57,710 that I have no idea how hashing works, but I do know that this box can do it. 76 00:03:57,710 --> 00:03:59,630 So how do I think about this process? 77 00:03:59,630 --> 00:04:03,500 Well generally speaking, there's going to be some input to this box. 78 00:04:03,500 --> 00:04:06,630 Ultimately, I want to get some output from that box. 79 00:04:06,630 --> 00:04:11,150 And what this box really represents is, in fact, a hash function. 80 00:04:11,150 --> 00:04:14,060 You can think of this as a device like some kind of machine; 81 00:04:14,060 --> 00:04:16,970 you can think of it like a program, some piece of software; 82 00:04:16,970 --> 00:04:20,560 or you can even think about it as a mathematical function that operates 83 00:04:20,560 --> 00:04:22,760 simply on numbers coming in as input. 84 00:04:22,760 --> 00:04:24,610 In fact, if you're mathematically inclined, 85 00:04:24,610 --> 00:04:28,540 though we won't use this syntax often, you can think of that hash function 86 00:04:28,540 --> 00:04:33,250 as being represented by f, you can think of the input as being represented by x, 87 00:04:33,250 --> 00:04:37,678 and you can think of the output of this process as being so-called f of x. 88 00:04:37,678 --> 00:04:39,970 If you're not familiar with that notation, that's fine, 89 00:04:39,970 --> 00:04:43,750 but this is directly connected hashing to basic mathematics 90 00:04:43,750 --> 00:04:45,880 as well that you might encounter before long. 91 00:04:45,880 --> 00:04:49,930 But what we care about is passing into this black box a password 92 00:04:49,930 --> 00:04:53,440 and getting out a hash, and then storing that hash and not 93 00:04:53,440 --> 00:04:57,670 the password in our database or text file of usernames and passwords. 94 00:04:57,670 --> 00:04:59,600 So how might we go about doing this? 95 00:04:59,600 --> 00:05:03,760 Well, if I were to provide apple as an input to this hash function, 96 00:05:03,760 --> 00:05:06,910 let's think about the simplest hash function possible 97 00:05:06,910 --> 00:05:12,100 that doesn't output apple, but some representation of apple 98 00:05:12,100 --> 00:05:14,325 that I can eventually store in that database. 99 00:05:14,325 --> 00:05:17,450 So I'm going to propose very simply that maybe the a simplest hash function 100 00:05:17,450 --> 00:05:20,330 we can come up with-- and indeed, if you've studied computer science 101 00:05:20,330 --> 00:05:23,000 or taken CS50 itself, you might recall that we 102 00:05:23,000 --> 00:05:27,300 can hash our inputs unlike specific letters therein. 103 00:05:27,300 --> 00:05:29,480 So apple starts with A. So you know what? 104 00:05:29,480 --> 00:05:31,490 A is the first letter of the English alphabet. 105 00:05:31,490 --> 00:05:33,470 So I'm going to create a hash function here 106 00:05:33,470 --> 00:05:38,360 pictorially that outputs one whenever the input happens to start with an A, 107 00:05:38,360 --> 00:05:39,530 as does apple. 108 00:05:39,530 --> 00:05:41,990 Meanwhile, if we pass in banana, I'm going 109 00:05:41,990 --> 00:05:44,540 to have this hash function output 2 because B 110 00:05:44,540 --> 00:05:46,670 is the letter of the English alphabet. 111 00:05:46,670 --> 00:05:51,500 And dot-dot-dot, we might get to cherry or other passwords as well that might 112 00:05:51,500 --> 00:05:53,240 output 3 and beyond. 113 00:05:53,240 --> 00:05:56,870 And you could imagine doing this for all letters of the English alphabet. 114 00:05:56,870 --> 00:05:59,870 Now unfortunately, this isn't the best hash function 115 00:05:59,870 --> 00:06:01,640 because it's fairly simplistic. 116 00:06:01,640 --> 00:06:05,180 And in fact, I can quickly think of some other fruits like avocados 117 00:06:05,180 --> 00:06:08,893 that also start with A and that would give me the same hash value. 118 00:06:08,893 --> 00:06:11,060 And that's actually a characteristic we'll come back 119 00:06:11,060 --> 00:06:15,830 to whereby when you hash values, there can actually be ambiguities, 120 00:06:15,830 --> 00:06:20,580 potentially, whereby two inputs might actually have the same output, 121 00:06:20,580 --> 00:06:23,800 and we'll consider eventually what the implications of that might be. 122 00:06:23,800 --> 00:06:26,640 But for now, I dare say that's a little too simplistic. 123 00:06:26,640 --> 00:06:30,570 And what might be better than outputting 1 or 2 or 3 124 00:06:30,570 --> 00:06:34,320 is a little something more cryptic, because that's just too helpful. 125 00:06:34,320 --> 00:06:35,470 That's too much of a hint. 126 00:06:35,470 --> 00:06:38,280 If I see that your hash value is 1, I at least 127 00:06:38,280 --> 00:06:42,330 know that your password now clearly starts with an A, which means at best, 128 00:06:42,330 --> 00:06:46,570 I can do 1/26th the amount of work to figure out what it actually is. 129 00:06:46,570 --> 00:06:50,610 So we want these hashes generally to be a little weird-looking and really 130 00:06:50,610 --> 00:06:53,440 unguessable and not leak any information. 131 00:06:53,440 --> 00:06:57,960 So for instance, a very common older hash function for apple might actually 132 00:06:57,960 --> 00:07:05,580 output this-- ..ekWXa83dhiA with some mixed uppercase and lowercase letters 133 00:07:05,580 --> 00:07:06,180 therein. 134 00:07:06,180 --> 00:07:10,180 Now it looks weird, you probably can't and shouldn't see any kind of pattern 135 00:07:10,180 --> 00:07:10,680 in there. 136 00:07:10,680 --> 00:07:14,490 There is a fancy math formula that took as input apple 137 00:07:14,490 --> 00:07:18,420 and outputted as its hash value that string of text 138 00:07:18,420 --> 00:07:22,170 there, but in and of itself, it doesn't really leak any information 139 00:07:22,170 --> 00:07:24,270 like the number 1 or 2 or 3 would. 140 00:07:24,270 --> 00:07:25,920 So we've already made an improvement. 141 00:07:25,920 --> 00:07:28,510 Banana, meanwhile, would look like this. 142 00:07:28,510 --> 00:07:31,180 And cherry, meanwhile, would look like that. 143 00:07:31,180 --> 00:07:34,240 So notice that these values are indeed quite different. 144 00:07:34,240 --> 00:07:37,320 So using this better hash function, I claim, that doesn't just 145 00:07:37,320 --> 00:07:39,060 look at the first letter of the alphabet, 146 00:07:39,060 --> 00:07:42,210 but looks at maybe all of the letters in the input-- 147 00:07:42,210 --> 00:07:46,380 C-H-E-R-R-Y in this case, we can probably come up with something more 148 00:07:46,380 --> 00:07:48,930 interesting, more cryptic-looking, if you will, 149 00:07:48,930 --> 00:07:50,550 like the values that we've just seen. 150 00:07:50,550 --> 00:07:54,750 So let me propose now that what we should do in our database of passwords 151 00:07:54,750 --> 00:07:59,400 is not store alice, apple, bob, banana, but let's instead 152 00:07:59,400 --> 00:08:03,400 store the hashes of apple and banana respectively. 153 00:08:03,400 --> 00:08:07,410 So instead in this password database, I'm going to store this instead. 154 00:08:07,410 --> 00:08:11,250 The exact same values that we just saw coming as outputs from that black box, 155 00:08:11,250 --> 00:08:13,920 but in this case now, I'm storing in my database 156 00:08:13,920 --> 00:08:18,900 of passwords usernames and hash values. 157 00:08:18,900 --> 00:08:21,260 Now why is this perhaps a good thing? 158 00:08:21,260 --> 00:08:23,750 Well, one, if someone now attacks this server 159 00:08:23,750 --> 00:08:28,220 and somehow gains access to all of these usernames and hashes, what they don't 160 00:08:28,220 --> 00:08:31,170 have is an entire list of passwords. 161 00:08:31,170 --> 00:08:35,510 So they can't quite as easily go about credential stuffing and figuring out 162 00:08:35,510 --> 00:08:39,710 maybe if this database will give me access to my accounts somewhere else. 163 00:08:39,710 --> 00:08:42,289 I'm at least creating some work for the adversary. 164 00:08:42,289 --> 00:08:46,070 But at the same time, I feel like I've kind of broken the whole system 165 00:08:46,070 --> 00:08:49,610 because previously, presumably, when you log into a website or app 166 00:08:49,610 --> 00:08:53,150 and you type in your username and then you type in your password, what 167 00:08:53,150 --> 00:08:55,470 is the website or app probably do? 168 00:08:55,470 --> 00:08:58,190 Well, once that username and password are sent over the internet, 169 00:08:58,190 --> 00:09:00,950 typically to that server, well, the server probably 170 00:09:00,950 --> 00:09:04,760 compares what you typed in against the username and their database, 171 00:09:04,760 --> 00:09:07,640 or their text file, and the server compares 172 00:09:07,640 --> 00:09:10,790 what you typed in as your password against whatever 173 00:09:10,790 --> 00:09:12,410 password is in their database. 174 00:09:12,410 --> 00:09:13,790 But now we have a problem. 175 00:09:13,790 --> 00:09:16,940 We have you typing the username and we do have the username 176 00:09:16,940 --> 00:09:18,200 still in the database. 177 00:09:18,200 --> 00:09:22,030 Case in point, Alice and Bob are still here. 178 00:09:22,030 --> 00:09:25,290 But what we don't have is apple and banana. 179 00:09:25,290 --> 00:09:27,600 We've replaced those altogether with hashes. 180 00:09:27,600 --> 00:09:30,840 So even if you type in-- or Alice types in apple, 181 00:09:30,840 --> 00:09:34,200 well we don't want to compare A-P-P-L-E to this because it obviously 182 00:09:34,200 --> 00:09:37,200 doesn't match; and Bob's banana, we don't want to compare against this 183 00:09:37,200 --> 00:09:40,240 because it's not going to match; and so forth. 184 00:09:40,240 --> 00:09:41,830 So what can we do? 185 00:09:41,830 --> 00:09:46,560 Well, the way authentication typically works on the server side 186 00:09:46,560 --> 00:09:49,180 when using hashing is as follows. 187 00:09:49,180 --> 00:09:52,920 When you first create an account or register for this website or app, 188 00:09:52,920 --> 00:09:58,020 you type in, if you're Alice, Alice, Enter, and then apple, for instance, 189 00:09:58,020 --> 00:09:58,800 Enter. 190 00:09:58,800 --> 00:10:02,890 That username, Alice, that password, Apple, are sent to the server. 191 00:10:02,890 --> 00:10:06,270 But what the server does before saving the username and password 192 00:10:06,270 --> 00:10:10,300 is it runs that hash function on Alice's password, 193 00:10:10,300 --> 00:10:16,200 which is apple, converts it thereafter to this value, and stores Alice's 194 00:10:16,200 --> 00:10:21,570 username and the hash of Alice's password only and throws away apple, 195 00:10:21,570 --> 00:10:23,640 deletes it, it forgets it in memory. 196 00:10:23,640 --> 00:10:25,600 What then happens next? 197 00:10:25,600 --> 00:10:29,400 Well, the next time Alice tries to log into this website-- 198 00:10:29,400 --> 00:10:33,270 maybe the next day, a week from then, a year from then for the second or third 199 00:10:33,270 --> 00:10:35,130 or more time, what happens? 200 00:10:35,130 --> 00:10:38,010 Well, Alice types in Alice as her username, hopefully 201 00:10:38,010 --> 00:10:41,880 apple as her password, hits Enter, those get sent to the server as usual, 202 00:10:41,880 --> 00:10:44,100 and obviously the server can't just compare 203 00:10:44,100 --> 00:10:46,770 username against username and password against password 204 00:10:46,770 --> 00:10:49,890 because it doesn't have the password in its database, so 205 00:10:49,890 --> 00:10:51,090 what can the server do? 206 00:10:51,090 --> 00:10:53,910 The server can repeat the very same process, 207 00:10:53,910 --> 00:10:57,870 taking Alice's password as inputted, A-P-P-L-E, 208 00:10:57,870 --> 00:11:02,190 run it through the exact same hash function a day, a week, a year later, 209 00:11:02,190 --> 00:11:07,530 and then compare that resulting hash value to whatever is stored in this 210 00:11:07,530 --> 00:11:09,510 text file or database. 211 00:11:09,510 --> 00:11:13,860 And now admittedly, we're creating a whole lot more work for ourselves, 212 00:11:13,860 --> 00:11:16,770 but it's not that big a deal because this is just a math function, 213 00:11:16,770 --> 00:11:19,800 or if you know how to program, it's just a few lines of code 214 00:11:19,800 --> 00:11:23,610 that you've written in software that converts passwords to hash values. 215 00:11:23,610 --> 00:11:26,670 And honestly, nowadays, you wouldn't even rewriting most of this code 216 00:11:26,670 --> 00:11:29,910 yourself, you'd be using a library, third-party code that someone 217 00:11:29,910 --> 00:11:32,850 else smarter than you, maybe, has written and gotten it just right, 218 00:11:32,850 --> 00:11:36,060 no bugs or mistakes, so you're just relying on someone else's code 219 00:11:36,060 --> 00:11:37,770 anyway to achieve this goal. 220 00:11:37,770 --> 00:11:42,420 But the upside now, to be clear, is if this file is compromised somehow, 221 00:11:42,420 --> 00:11:45,570 the server's hacked into and this data is leaked, 222 00:11:45,570 --> 00:11:52,620 at least they only know the usernames on your system, not the actual passwords. 223 00:11:52,620 --> 00:11:55,620 And let me pause here and see if there's any questions on this technique 224 00:11:55,620 --> 00:12:01,220 of hashing for passwords specifically. 225 00:12:01,220 --> 00:12:04,430 STUDENT: You said yourself, we are using libraries 226 00:12:04,430 --> 00:12:09,920 more often than write the hash functions ourselves if we are not 227 00:12:09,920 --> 00:12:13,610 taking the course on CS50. 228 00:12:13,610 --> 00:12:17,180 So then it's easy to hack these hashes, right? 229 00:12:17,180 --> 00:12:20,705 Because we can go through 10, 40, I don't 230 00:12:20,705 --> 00:12:24,620 know, hash functions that are available in the libraries, 231 00:12:24,620 --> 00:12:29,392 and then you can reverse the hash results, is that right? 232 00:12:29,392 --> 00:12:30,350 DAVID J. MALAN: Almost. 233 00:12:30,350 --> 00:12:32,960 Can do exactly what you described first whereby 234 00:12:32,960 --> 00:12:37,940 you use the same library, the same code, to create hash values to then compare 235 00:12:37,940 --> 00:12:41,330 those against what's in the database, but generally, these hashes 236 00:12:41,330 --> 00:12:43,280 are not reversible, per se. 237 00:12:43,280 --> 00:12:46,130 You can compare them, but you can't reverse the process 238 00:12:46,130 --> 00:12:47,720 for reasons we'll come back to. 239 00:12:47,720 --> 00:12:49,310 But your intuition is right. 240 00:12:49,310 --> 00:12:51,710 And so really, the takeaway here is that we 241 00:12:51,710 --> 00:12:54,620 haven't made our system absolutely secure, 242 00:12:54,620 --> 00:12:56,940 we've made it relatively more secure. 243 00:12:56,940 --> 00:12:57,440 Why? 244 00:12:57,440 --> 00:13:01,100 Because we've increased the cost to the adversary, to the hacker. 245 00:13:01,100 --> 00:13:05,540 They now have to do more work to figure out what the actual passwords are 246 00:13:05,540 --> 00:13:07,430 if they want to benefit from this hack. 247 00:13:07,430 --> 00:13:10,610 So again, it just raises the bar, it does not 248 00:13:10,610 --> 00:13:12,680 keep the adversary necessarily out or even 249 00:13:12,680 --> 00:13:15,210 stop them from figuring out one person's password, 250 00:13:15,210 --> 00:13:17,220 but it might take them a lot more time, it 251 00:13:17,220 --> 00:13:21,030 might take them a lot more resources like server or cloud costs or money, 252 00:13:21,030 --> 00:13:25,680 or it might even heighten the risk before they actually are successful. 253 00:13:25,680 --> 00:13:28,950 How about one other question here on hashing? 254 00:13:28,950 --> 00:13:32,430 STUDENT: If the password is intercepted before-- 255 00:13:32,430 --> 00:13:34,950 after the website is hacked and the password 256 00:13:34,950 --> 00:13:40,950 is intercepted before it's encrypted, so wouldn't that pose a problem? 257 00:13:40,950 --> 00:13:42,660 DAVID J. MALAN: Yes, absolutely. 258 00:13:42,660 --> 00:13:44,040 Then all bets are off. 259 00:13:44,040 --> 00:13:45,840 Everything we just discussed is not useful 260 00:13:45,840 --> 00:13:49,320 at all if the adversary has actually intercepted the password 261 00:13:49,320 --> 00:13:50,758 before it has even been hashed. 262 00:13:50,758 --> 00:13:53,550 Now thankfully, there's going to be solutions to that problem, too, 263 00:13:53,550 --> 00:13:57,180 and we'll come to them today, but for now, focusing only on hashes, 264 00:13:57,180 --> 00:13:59,380 it solves one problem but not all. 265 00:13:59,380 --> 00:14:04,140 In fact, it turns out that those attacks we talked about last time with respect 266 00:14:04,140 --> 00:14:06,240 to our accounts are still possible. 267 00:14:06,240 --> 00:14:09,360 You can still use a dictionary, for instance, of English words, 268 00:14:09,360 --> 00:14:12,240 or better yet, a dictionary of English fruits, 269 00:14:12,240 --> 00:14:18,150 and you could, one fruit at a time, run each of those values as input 270 00:14:18,150 --> 00:14:20,550 into the same hash function, the library or code 271 00:14:20,550 --> 00:14:22,560 that you're using to achieve this, and then 272 00:14:22,560 --> 00:14:25,860 that's going to give you one hash value after another. 273 00:14:25,860 --> 00:14:28,170 And you could compare each of those hash values 274 00:14:28,170 --> 00:14:31,890 against whatever is in the database or the file of passwords 275 00:14:31,890 --> 00:14:35,940 that you, the hacker in this story, might have actually stolen somehow. 276 00:14:35,940 --> 00:14:37,920 You have to do more work though, because it's 277 00:14:37,920 --> 00:14:41,700 no longer as simple as just comparing apple against apple and banana 278 00:14:41,700 --> 00:14:42,420 against banana. 279 00:14:42,420 --> 00:14:44,680 You actually have to do some work. 280 00:14:44,680 --> 00:14:46,860 You have to do some computational work. 281 00:14:46,860 --> 00:14:50,100 And if the file is only a few values, of course, not a big deal. 282 00:14:50,100 --> 00:14:54,150 If it's thousands or millions of rows, it might actually take a lot more 283 00:14:54,150 --> 00:14:55,948 time, energy, and effort. 284 00:14:55,948 --> 00:14:58,740 So again, we're just raising the bar, but not keeping the adversary 285 00:14:58,740 --> 00:15:00,010 out altogether. 286 00:15:00,010 --> 00:15:02,770 And even if you don't have a dictionary available, 287 00:15:02,770 --> 00:15:06,030 and even if the passwords are not all fruits in English, 288 00:15:06,030 --> 00:15:10,080 well, you can still, as the adversary, resort to brute-force attacks. 289 00:15:10,080 --> 00:15:15,390 And you can try even the simplest of passwords like 0000 or maybe eight 290 00:15:15,390 --> 00:15:19,680 0's instead, and you can hash that and see what the resulting hash value is 291 00:15:19,680 --> 00:15:22,170 and compare that against what's in the database. 292 00:15:22,170 --> 00:15:28,620 Then you can try 00000001, hash that, compare that against 293 00:15:28,620 --> 00:15:31,380 what's in the database, and then move on to the next and the next, 294 00:15:31,380 --> 00:15:34,080 doing this not just for numbers, but for letters as well. 295 00:15:34,080 --> 00:15:38,400 A, A, A, A, A, A, A, A, A, hash that and compare. 296 00:15:38,400 --> 00:15:40,590 Eventually, apple will be on that list. 297 00:15:40,590 --> 00:15:42,682 Eventually, banana will be on that list. 298 00:15:42,682 --> 00:15:44,640 But there, too, the brute force attack is still 299 00:15:44,640 --> 00:15:46,120 going to take some amount of time. 300 00:15:46,120 --> 00:15:48,660 So it's just increasing the cost or the complexity 301 00:15:48,660 --> 00:15:51,810 for the adversary in this particular case. 302 00:15:51,810 --> 00:15:54,810 But there's yet another threat that's possible in the context now 303 00:15:54,810 --> 00:15:58,170 of the hashes, which is worth knowing about. 304 00:15:58,170 --> 00:16:00,690 There's a term of art known as a rainbow table, which 305 00:16:00,690 --> 00:16:05,040 is a very beautiful way of saying that adversaries in advance 306 00:16:05,040 --> 00:16:09,150 might have already hashed all possible English words in a dictionary. 307 00:16:09,150 --> 00:16:13,400 Adversaries might have already hashed all possible passwords of length 4 308 00:16:13,400 --> 00:16:16,410 or 5 or 6 or 7 or 8 or something else. 309 00:16:16,410 --> 00:16:18,630 And maybe if they have a big enough hard drive, 310 00:16:18,630 --> 00:16:21,510 they are storing a big table, like an Excel file 311 00:16:21,510 --> 00:16:25,950 or a CSV file of all of the words that they've tried, all of the passwords 312 00:16:25,950 --> 00:16:30,090 they've tried, and all of the hash values they've already computed. 313 00:16:30,090 --> 00:16:31,320 Then it's even easier. 314 00:16:31,320 --> 00:16:34,260 Then they don't even need to do a brute-force attack, per se, 315 00:16:34,260 --> 00:16:36,480 hashing and hashing and hashing and hashing. 316 00:16:36,480 --> 00:16:38,880 Then they can just compare, compare, compare. 317 00:16:38,880 --> 00:16:41,490 Because indeed, a rainbow table simply contains 318 00:16:41,490 --> 00:16:46,110 all of the passwords they've tried, all of the hash values they've generated, 319 00:16:46,110 --> 00:16:48,510 and so they just compare left to right whatever 320 00:16:48,510 --> 00:16:52,110 the user typed in against the hash value they've already computed. 321 00:16:52,110 --> 00:16:56,430 Now for certain hash functions, this threat of a rainbow table 322 00:16:56,430 --> 00:16:57,690 is just not feasible. 323 00:16:57,690 --> 00:17:03,060 You might need terabytes or petabytes of data, which means a lot of hard drives 324 00:17:03,060 --> 00:17:06,630 and a lot of money, so there are potential downward pressures 325 00:17:06,630 --> 00:17:09,690 on this kind of an attack, but it can certainly speed things up. 326 00:17:09,690 --> 00:17:12,060 Certainly if you're pre-computing-- that is, 327 00:17:12,060 --> 00:17:14,930 pre-calculating some of the hashes for at least words 328 00:17:14,930 --> 00:17:17,900 in an English dictionary, and certainly some short list like all 329 00:17:17,900 --> 00:17:19,890 of the fruits in the world. 330 00:17:19,890 --> 00:17:21,980 But there's another problem that we might 331 00:17:21,980 --> 00:17:25,010 encounter on the server with regard to our passwords. 332 00:17:25,010 --> 00:17:28,369 Alice might have a password of apple, Bob might have a password of banana, 333 00:17:28,369 --> 00:17:34,460 but suppose that both Carol and Charlie have a password of cherry. 334 00:17:34,460 --> 00:17:38,030 And just by coincidence, they both chose the same password 335 00:17:38,030 --> 00:17:39,940 and are in this same database. 336 00:17:39,940 --> 00:17:42,950 Now we've already concluded, I think, that we definitely don't 337 00:17:42,950 --> 00:17:45,380 want to store the plaintext passwords. 338 00:17:45,380 --> 00:17:50,030 We don't want to store literally in the clear apple, banana, cherry, and cherry 339 00:17:50,030 --> 00:17:53,690 because this is just too easy for the adversary to do bad things with it. 340 00:17:53,690 --> 00:17:56,390 So we at least want to hash this, but here's 341 00:17:56,390 --> 00:18:00,800 where hashing can leak information, so to speak. 342 00:18:00,800 --> 00:18:02,900 If I go ahead and use the same function I've 343 00:18:02,900 --> 00:18:06,500 been using to hash apple and banana and now cherry, 344 00:18:06,500 --> 00:18:12,620 what do you notice about Carol's and Charlie's hash values? 345 00:18:12,620 --> 00:18:17,180 Curiously, but maybe not surprisingly, they're exactly the same. 346 00:18:17,180 --> 00:18:19,620 That's, after all, how functions typically work, 347 00:18:19,620 --> 00:18:21,740 be it in math or in software, in code. 348 00:18:21,740 --> 00:18:25,370 If you pass the exact same input, unless there's some randomness going on, 349 00:18:25,370 --> 00:18:27,860 you're going to get the same output again and again. 350 00:18:27,860 --> 00:18:29,640 Now why is this a big deal? 351 00:18:29,640 --> 00:18:33,020 Well, if some adversary attacks this database and gains 352 00:18:33,020 --> 00:18:36,380 access to all of these usernames and hashes, 353 00:18:36,380 --> 00:18:40,640 we have leaked information in the sense that the adversary, just 354 00:18:40,640 --> 00:18:43,760 by glancing at this file, knows that, OK, I 355 00:18:43,760 --> 00:18:46,860 don't know what Carol's password is or what Charlie's password is, 356 00:18:46,860 --> 00:18:50,090 but I know it's the same password, and that alone 357 00:18:50,090 --> 00:18:54,350 might be enough information to figure out with higher probability what it is. 358 00:18:54,350 --> 00:18:56,430 Maybe Carol and Charlie are related. 359 00:18:56,430 --> 00:19:01,100 So maybe you focus on words or numbers that are common to both of them. 360 00:19:01,100 --> 00:19:05,360 Maybe there's some information that's implied by this if they both are-- 361 00:19:05,360 --> 00:19:08,090 they both like the same TV shows, they both like the same movies. 362 00:19:08,090 --> 00:19:12,840 You can try to find, in your mind, maybe the intersection of information that 363 00:19:12,840 --> 00:19:15,810 might lead you, with higher probability, to figure out, 364 00:19:15,810 --> 00:19:20,220 without brute force, even, what Carol's password is and Charlie's password is. 365 00:19:20,220 --> 00:19:24,090 So this is a common problem, and we only have four users in this database. 366 00:19:24,090 --> 00:19:25,770 You can imagine having many more. 367 00:19:25,770 --> 00:19:28,468 Odds are, some of us are going to have the same username-- not 368 00:19:28,468 --> 00:19:31,260 the same username, some of us are going to have the same passwords. 369 00:19:31,260 --> 00:19:35,490 In fact, without raising your hands or admitting to this for the whole world 370 00:19:35,490 --> 00:19:43,530 to see, do any of you have a password of 1234 In some website or app? 371 00:19:43,530 --> 00:19:44,580 Maybe a little harder? 372 00:19:44,580 --> 00:19:48,030 12345678? 373 00:19:48,030 --> 00:19:49,500 Something very simple like this. 374 00:19:49,500 --> 00:19:51,600 Maybe it's an account you don't really care about. 375 00:19:51,600 --> 00:19:54,420 Well, that's a perfect example of where, if you 376 00:19:54,420 --> 00:19:57,810 have an account on the same system as someone else here in the classroom, 377 00:19:57,810 --> 00:20:01,950 you're going to have, in that database, presumably, the same hash values, 378 00:20:01,950 --> 00:20:06,540 and that might be alone enough information to leak and increase 379 00:20:06,540 --> 00:20:09,330 the probability that you, and not Alice or Bob, 380 00:20:09,330 --> 00:20:12,250 are actually compromised with respect to your account. 381 00:20:12,250 --> 00:20:13,540 So how can we fix this? 382 00:20:13,540 --> 00:20:16,500 Well, it turns out there's another technique in the world of data 383 00:20:16,500 --> 00:20:20,252 that we can use to perturb this process. 384 00:20:20,252 --> 00:20:21,960 And you can think of it metaphorically as 385 00:20:21,960 --> 00:20:25,450 like sprinkling a little bit of salt on the hash function 386 00:20:25,450 --> 00:20:27,660 so as to change what its output is. 387 00:20:27,660 --> 00:20:31,090 It's not random, per se, but you are perturbing the output 388 00:20:31,090 --> 00:20:34,470 so that it's much less likely that two people with the same passwords 389 00:20:34,470 --> 00:20:36,970 are going to have the same hash value. 390 00:20:36,970 --> 00:20:38,190 So how does this work? 391 00:20:38,190 --> 00:20:42,370 In this case before, when we passed in cherry as our input, 392 00:20:42,370 --> 00:20:45,640 we got the same hash again and again. 393 00:20:45,640 --> 00:20:50,220 But let me propose that we modify our hash function to take two inputs now. 394 00:20:50,220 --> 00:20:55,260 Not just the password, but also a salt value, so to speak. 395 00:20:55,260 --> 00:20:58,740 A little bit of a sprinkling of, in this case, just two characters-- 396 00:20:58,740 --> 00:21:01,690 two numbers, two letters, or a combination thereof. 397 00:21:01,690 --> 00:21:04,380 Now this hash function that I'm describing is still 398 00:21:04,380 --> 00:21:06,660 going to output a hash value, but notice, 399 00:21:06,660 --> 00:21:09,840 it's different from the one before, and even if you don't quite remember 400 00:21:09,840 --> 00:21:11,940 what it was before, it was not this. 401 00:21:11,940 --> 00:21:15,790 But worth noting is that in the output of this hash function 402 00:21:15,790 --> 00:21:18,130 now is the salt itself. 403 00:21:18,130 --> 00:21:22,650 So the salt isn't something that's meant to be private or secret or secure, it's 404 00:21:22,650 --> 00:21:26,430 just sprinkled in there to make sure that whatever hash value comes out 405 00:21:26,430 --> 00:21:29,820 of this black box is a little bit different than if you 406 00:21:29,820 --> 00:21:32,920 had put a different salt value instead. 407 00:21:32,920 --> 00:21:38,190 So for instance, suppose that for Carol and for Charlie, 408 00:21:38,190 --> 00:21:39,870 we use different salts. 409 00:21:39,870 --> 00:21:40,920 And that's the idea. 410 00:21:40,920 --> 00:21:43,140 Different users should have different salt values 411 00:21:43,140 --> 00:21:45,540 just in case they choose the same passwords. 412 00:21:45,540 --> 00:21:48,690 So instead of 50 and cherry, suppose that Charlie 413 00:21:48,690 --> 00:21:51,930 uses a salt value of, say, 49. 414 00:21:51,930 --> 00:21:55,140 49 is not a number that Charlie or you or me have to pick. 415 00:21:55,140 --> 00:21:57,150 This is all done by the server automatically, 416 00:21:57,150 --> 00:22:00,940 picking a random two characters like 4-9 or 5-0. 417 00:22:00,940 --> 00:22:02,190 But notice what just happened. 418 00:22:02,190 --> 00:22:07,830 If I rewind to cherry with a salt of 5, this was the hash value, the first two 419 00:22:07,830 --> 00:22:10,770 characters of which are the salt. If, though, 420 00:22:10,770 --> 00:22:16,110 I change the salt from 50 to 49, the hash changes completely, 421 00:22:16,110 --> 00:22:19,800 and it prefixes it with now 49 instead of 50. 422 00:22:19,800 --> 00:22:24,390 This ensures that even if Carol and Charlie have the exact same password, 423 00:22:24,390 --> 00:22:28,380 there's no way I, the adversary, am going to know by looking at it. 424 00:22:28,380 --> 00:22:32,860 Because indeed, what ends up in the file now are these two values. 425 00:22:32,860 --> 00:22:37,110 One is prefixed with 50, one is prefixed with 49, the rest of the hash values 426 00:22:37,110 --> 00:22:39,310 clearly are completely different. 427 00:22:39,310 --> 00:22:42,720 So again, the upside is this approach where the hash function 428 00:22:42,720 --> 00:22:46,050 takes two inputs, the password and a salt, 429 00:22:46,050 --> 00:22:51,570 and then outputs one hash value means that we're not leaking information 430 00:22:51,570 --> 00:22:53,040 except-- 431 00:22:53,040 --> 00:22:55,020 except-- so there is a corner case-- 432 00:22:55,020 --> 00:22:58,470 if by chance, by bad luck, the system chooses 433 00:22:58,470 --> 00:23:01,410 the same salt for both Carol and Charlie, 434 00:23:01,410 --> 00:23:03,990 yes, there might still be information leaked. 435 00:23:03,990 --> 00:23:06,480 And honestly, that may very well happen if you've 436 00:23:06,480 --> 00:23:09,270 got thousands, millions of users, then you're 437 00:23:09,270 --> 00:23:11,550 going to run out of two-character possibilities, 438 00:23:11,550 --> 00:23:13,010 you're going to have to reuse salt. 439 00:23:13,010 --> 00:23:15,020 But the idea is that we're just trying to put 440 00:23:15,020 --> 00:23:20,240 downward pressure on the probability of being attacked successfully. 441 00:23:20,240 --> 00:23:23,120 We're trying to equivalently raise the bar to the adversary 442 00:23:23,120 --> 00:23:28,160 so that they are not as likely to gain access to my data or, in turn, 443 00:23:28,160 --> 00:23:29,360 my account. 444 00:23:29,360 --> 00:23:34,650 Questions now on salting or hashing itself? 445 00:23:34,650 --> 00:23:35,940 STUDENT: Oh, I'm curious. 446 00:23:35,940 --> 00:23:37,562 Where do we store the salt? 447 00:23:37,562 --> 00:23:39,520 DAVID J. MALAN: So where do you store the salt? 448 00:23:39,520 --> 00:23:43,470 The salt is actually stored in the hash value itself, 449 00:23:43,470 --> 00:23:46,590 according to this algorithm, in the first two characters. 450 00:23:46,590 --> 00:23:50,580 And the value of storing the salt in the first two characters of the hash 451 00:23:50,580 --> 00:23:51,550 is as follows. 452 00:23:51,550 --> 00:23:56,310 The next time Carol logs in, she types in her username, Carol, and hits Enter. 453 00:23:56,310 --> 00:23:59,880 The server now knows, OK, I'm expecting a password from Carol, 454 00:23:59,880 --> 00:24:01,320 let's see what she types in. 455 00:24:01,320 --> 00:24:04,080 Suppose that she types in correctly cherry. 456 00:24:04,080 --> 00:24:06,720 Now the system is not storing cherry, so it's not 457 00:24:06,720 --> 00:24:08,910 going to compare literally what Carol typed in, 458 00:24:08,910 --> 00:24:14,160 but it is going to hash cherry, but first, the system is going to check, 459 00:24:14,160 --> 00:24:17,010 what is Carol's hash-- what is Carol's salt? 460 00:24:17,010 --> 00:24:21,060 And it's going to infer as much by looking at Carol's hash value 461 00:24:21,060 --> 00:24:24,013 and looking only at the first two characters by convention. 462 00:24:24,013 --> 00:24:27,180 Then what the server is going to do, it's going to take whatever Carol typed 463 00:24:27,180 --> 00:24:33,850 in, cherry, C-H-E-R-R-Y, it's going to pass in 50, 5-0, and then hopefully, 464 00:24:33,850 --> 00:24:36,730 it's going to get back to this same value here, 465 00:24:36,730 --> 00:24:38,480 this whole string in yellow. 466 00:24:38,480 --> 00:24:43,180 And if those are correct, then Carol will be considered authenticated. 467 00:24:43,180 --> 00:24:47,470 By contrast, if the username happens to be Charlie and Charlie hits Enter, 468 00:24:47,470 --> 00:24:50,560 then what the server is going to do is look at Charlie's hash value, 469 00:24:50,560 --> 00:24:53,680 grab the first two characters for Charlie's salt, 470 00:24:53,680 --> 00:24:57,340 use that salt and cherry as the input to the hash function, 471 00:24:57,340 --> 00:25:02,350 and hope that the result is Charlie's value, not Carol's. 472 00:25:02,350 --> 00:25:04,040 Really good question. 473 00:25:04,040 --> 00:25:07,120 Other questions on salting or hashing? 474 00:25:07,120 --> 00:25:09,790 STUDENT: Is there any sense in rehashing a password? 475 00:25:09,790 --> 00:25:14,770 So hashing it a first time to get a string, 476 00:25:14,770 --> 00:25:17,080 then rehashing it for a second string? 477 00:25:17,080 --> 00:25:19,210 Or it's just impractical? 478 00:25:19,210 --> 00:25:22,300 DAVID J. MALAN: No, you could certainly hash the value multiple times, 479 00:25:22,300 --> 00:25:25,430 but a good hash function should not require that of you. 480 00:25:25,430 --> 00:25:28,330 Especially now, more recent modern hashes, one of which 481 00:25:28,330 --> 00:25:32,320 we'll look at in a moment, they should have sufficiently calculated 482 00:25:32,320 --> 00:25:36,580 and proven characteristics that allow you to hash it just once 483 00:25:36,580 --> 00:25:39,430 and you will get a seemingly random string 484 00:25:39,430 --> 00:25:41,800 that represents whatever that input is. 485 00:25:41,800 --> 00:25:44,320 And here, too, is where I should emphasize 486 00:25:44,320 --> 00:25:47,980 that when it comes to this world of hashing and salting 487 00:25:47,980 --> 00:25:51,670 and today's other topics ultimately, these are not wheels 488 00:25:51,670 --> 00:25:54,640 that you or I should be reinventing. 489 00:25:54,640 --> 00:25:57,880 Unless you are the researcher or the company that's actually 490 00:25:57,880 --> 00:26:02,320 developing the algorithm, stress-testing them, analyzing them theoretically 491 00:26:02,320 --> 00:26:05,890 and practically so often in industry or the real world, 492 00:26:05,890 --> 00:26:10,300 when people like you and me invent our own systems for storing information, 493 00:26:10,300 --> 00:26:13,000 we just haven't spent nearly as much time 494 00:26:13,000 --> 00:26:16,120 or we're just not nearly as sharp as some of the security researchers 495 00:26:16,120 --> 00:26:18,680 out there who have really given this some thought. 496 00:26:18,680 --> 00:26:22,510 So when it comes to all things security-- and let me get on my soapbox 497 00:26:22,510 --> 00:26:26,200 here and say, you and I should not be solving these problems unless it is 498 00:26:26,200 --> 00:26:29,710 your full-time job or calling in life. 499 00:26:29,710 --> 00:26:31,930 There's just too many corner cases unless you're 500 00:26:31,930 --> 00:26:35,000 collaborating with a smart team. 501 00:26:35,000 --> 00:26:35,500 All right. 502 00:26:35,500 --> 00:26:40,230 With that said, here is what hashes generally 503 00:26:40,230 --> 00:26:41,870 look like nowadays in practice. 504 00:26:41,870 --> 00:26:43,620 For the sake of discussion, I deliberately 505 00:26:43,620 --> 00:26:48,240 chose a fairly simple hash function that was using a fairly short salt, 506 00:26:48,240 --> 00:26:52,260 just two characters, and a fairly short hash value as output. 507 00:26:52,260 --> 00:26:57,540 Here, in a smaller font, no less, is how Alice's and Bob's and Carol's 508 00:26:57,540 --> 00:27:00,180 and Charlie's passwords would probably be 509 00:27:00,180 --> 00:27:03,720 stored nowadays using a more recent modern hash function 510 00:27:03,720 --> 00:27:06,900 that, notice, by the shear length of the text on the screen, 511 00:27:06,900 --> 00:27:09,210 outputs a much larger value. 512 00:27:09,210 --> 00:27:12,030 If you're familiar from computer science with the notion of bits, 513 00:27:12,030 --> 00:27:15,420 0's and 1's that are used to store information in systems, 514 00:27:15,420 --> 00:27:19,350 these hash values use many more bits, many more 0's and 1's. 515 00:27:19,350 --> 00:27:23,910 You and I as humans are seeing them as alphabetical letters and as numbers, 516 00:27:23,910 --> 00:27:27,300 but underneath the hood, these are just more and more 0's and 1's 517 00:27:27,300 --> 00:27:31,470 that the computer is storing, which means it's much, much less likely 518 00:27:31,470 --> 00:27:33,840 that someone who steals this kind of file 519 00:27:33,840 --> 00:27:37,260 is going to be able to figure out efficiently what 520 00:27:37,260 --> 00:27:38,920 those original passwords were. 521 00:27:38,920 --> 00:27:41,500 And you can see, too, that for both Carol and Charlie, 522 00:27:41,500 --> 00:27:43,900 even though their passwords are still cherry, 523 00:27:43,900 --> 00:27:47,980 these two strings along the bottom look completely different. 524 00:27:47,980 --> 00:27:49,720 Except in one location here. 525 00:27:49,720 --> 00:27:52,720 It turns out that the scheme a lot of systems have adopted 526 00:27:52,720 --> 00:27:56,080 is that if you look between dollar signs at the beginning of what 527 00:27:56,080 --> 00:28:01,330 seems to be the hash value, you'll see a code like y or y or y or y 528 00:28:01,330 --> 00:28:03,520 or other numbers or letters as well. 529 00:28:03,520 --> 00:28:07,780 That's a little cheat sheet that tells the system exactly what hash function 530 00:28:07,780 --> 00:28:10,037 was used to generate the rest of it. 531 00:28:10,037 --> 00:28:12,370 And that's in the documentation that you can read online 532 00:28:12,370 --> 00:28:14,720 for any number of hash functions. 533 00:28:14,720 --> 00:28:18,850 So that's just to say, when you create an account on some new website or app, 534 00:28:18,850 --> 00:28:22,840 if they are doing things well in a manner consistent with best practices 535 00:28:22,840 --> 00:28:27,100 and they are being mindful of your security, they are probably in a file 536 00:28:27,100 --> 00:28:29,890 or in a database or some other mechanism storing 537 00:28:29,890 --> 00:28:34,780 values that look quite like these based on whatever password you actually 538 00:28:34,780 --> 00:28:35,605 typed in. 539 00:28:35,605 --> 00:28:38,980 In fact, just to give you a sense of how easy or difficult 540 00:28:38,980 --> 00:28:43,690 it might be to crack passwords-- that is, figure out what they are based only 541 00:28:43,690 --> 00:28:47,080 on these hashes, in the case of our first hash function 542 00:28:47,080 --> 00:28:50,470 whereby we had a fairly short hash value being outputted 543 00:28:50,470 --> 00:28:52,810 with or without the salt, turns out, there's 544 00:28:52,810 --> 00:28:56,480 18 quintillion possible hash values. 545 00:28:56,480 --> 00:28:57,640 Now that's a lot. 546 00:28:57,640 --> 00:29:00,580 That's bigger than last times quadrillion value. 547 00:29:00,580 --> 00:29:04,880 But, with enough time, enough money, and enough cloud computing, 548 00:29:04,880 --> 00:29:07,810 those early hash functions can be broken. 549 00:29:07,810 --> 00:29:10,060 That is, with enough time and energy, you can probably 550 00:29:10,060 --> 00:29:11,860 figure out what someone's password is. 551 00:29:11,860 --> 00:29:14,650 If you fast forward to the other strings that I showed you 552 00:29:14,650 --> 00:29:18,080 on the screen, the much longer ones that use more bits, so to speak, 553 00:29:18,080 --> 00:29:22,510 then you have this many possible hash values nowadays. 554 00:29:22,510 --> 00:29:25,000 And I actually did look up how to pronounce this, 555 00:29:25,000 --> 00:29:28,300 but based on reading it on my screen, I wasn't actually sure 556 00:29:28,300 --> 00:29:31,915 how to say the word since this is a really big number that my mathematician 557 00:29:31,915 --> 00:29:33,790 colleagues could do a better job pronouncing. 558 00:29:33,790 --> 00:29:36,040 But given how many digits are on the screen, 559 00:29:36,040 --> 00:29:38,350 given how many commas are on the screen here, 560 00:29:38,350 --> 00:29:40,360 this is a really big number such that you 561 00:29:40,360 --> 00:29:45,730 and I probably don't need to worry about an adversary using brute force figuring 562 00:29:45,730 --> 00:29:49,660 out and still being able to figure out by the end of time 563 00:29:49,660 --> 00:29:52,660 what the corresponding password might be unless there 564 00:29:52,660 --> 00:29:55,510 are other weaknesses in the system. 565 00:29:55,510 --> 00:29:57,370 Now speaking of weaknesses. 566 00:29:57,370 --> 00:30:00,130 Has anyone ever forgotten your password? 567 00:30:00,130 --> 00:30:00,980 Yes, of course. 568 00:30:00,980 --> 00:30:04,630 But have you ever gone to a website or app, clicked that link that says, 569 00:30:04,630 --> 00:30:09,130 Forgot Password, question mark, in hopes of getting an email of some sort 570 00:30:09,130 --> 00:30:10,840 so that you can reset the password? 571 00:30:10,840 --> 00:30:13,840 I mean, odds are, almost everyone here has experienced that. 572 00:30:13,840 --> 00:30:17,830 But has anyone ever clicked on that link, gotten back 573 00:30:17,830 --> 00:30:22,030 an email that actually contains your password 574 00:30:22,030 --> 00:30:24,520 so that you're just immediately reminded what it is? 575 00:30:24,520 --> 00:30:26,110 I'm seeing a few nods of the head. 576 00:30:26,110 --> 00:30:28,520 You can copy-paste it, then, into the website. 577 00:30:28,520 --> 00:30:30,880 Do not use that website anymore. 578 00:30:30,880 --> 00:30:36,250 That is evidence of-- that is a symptom of a website or application not 579 00:30:36,250 --> 00:30:38,630 practicing best practices. 580 00:30:38,630 --> 00:30:39,160 Why? 581 00:30:39,160 --> 00:30:43,420 Well, if it is the case that the website can email you your password, 582 00:30:43,420 --> 00:30:46,930 that means they can see and they know what your password is. 583 00:30:46,930 --> 00:30:49,270 That means this database, this text file we've 584 00:30:49,270 --> 00:30:52,270 been talking about is probably vulnerable to some hacker 585 00:30:52,270 --> 00:30:55,360 eventually getting into it and stealing all of those usernames 586 00:30:55,360 --> 00:30:57,730 and passwords in the clear, no less. 587 00:30:57,730 --> 00:30:59,830 Because recall what these hashes are. 588 00:30:59,830 --> 00:31:02,230 They're generally meant to be irreversible. 589 00:31:02,230 --> 00:31:04,780 When you take as input apple, banana, and cherry, 590 00:31:04,780 --> 00:31:08,800 the output looks completely different with no obvious relationship to what 591 00:31:08,800 --> 00:31:11,590 those original passwords actually were. 592 00:31:11,590 --> 00:31:14,170 And so if that's what's being stored in the database, 593 00:31:14,170 --> 00:31:18,610 the company who made that website, the person who made that website or app, 594 00:31:18,610 --> 00:31:21,370 they should not be able to reverse that process either, 595 00:31:21,370 --> 00:31:23,470 otherwise surely, the adversary can. 596 00:31:23,470 --> 00:31:27,100 So it is the case, and I've experienced this myself, often 597 00:31:27,100 --> 00:31:30,730 from smaller shops or companies that maybe haven't really 598 00:31:30,730 --> 00:31:33,310 invested a lot of time or care into their website, 599 00:31:33,310 --> 00:31:37,120 if they are able to email you your original password, 600 00:31:37,120 --> 00:31:39,670 it is, by definition, not secure. 601 00:31:39,670 --> 00:31:43,120 And it's certainly not up to today's standards, it's just too easy 602 00:31:43,120 --> 00:31:44,480 for it to be compromised. 603 00:31:44,480 --> 00:31:46,752 So maybe minimally stop using that service 604 00:31:46,752 --> 00:31:49,210 and make sure you're not using that password anywhere else. 605 00:31:49,210 --> 00:31:52,840 Maximally, maybe send them a note explaining your concern 606 00:31:52,840 --> 00:31:55,420 and maybe linking them to some reference online-- 607 00:31:55,420 --> 00:32:01,270 maybe this video-- in which you explain why you have that concern. 608 00:32:01,270 --> 00:32:07,630 Questions, then, on forgetting passwords or hashing or salting? 609 00:32:07,630 --> 00:32:13,190 STUDENT: So as you said, some companies may not be practicing these hashes 610 00:32:13,190 --> 00:32:15,790 and maybe practicing something very bad. 611 00:32:15,790 --> 00:32:19,870 So if I were, let's say, a company and I-- 612 00:32:19,870 --> 00:32:24,670 because of my practices, I had a leak of passwords and all the data, 613 00:32:24,670 --> 00:32:30,400 do I as a company have any obligations or responsibility for what 614 00:32:30,400 --> 00:32:35,320 happened since I have all the customer's data and all their passwords, 615 00:32:35,320 --> 00:32:38,982 do I have any obligations or responsibilities? 616 00:32:38,982 --> 00:32:41,190 DAVID J. MALAN: It's a really good, a noble question. 617 00:32:41,190 --> 00:32:45,000 The answer to that ethically is probably yes you should, quite simply. 618 00:32:45,000 --> 00:32:48,000 However, the more nuanced answer is that it's probably 619 00:32:48,000 --> 00:32:51,780 going to depend on the industry that you're in, the country that you're in, 620 00:32:51,780 --> 00:32:55,380 any regulatory requirements that your company faces which might 621 00:32:55,380 --> 00:32:58,420 oblige you to report out in that way. 622 00:32:58,420 --> 00:33:02,220 So I would read up on the context that's specific to you yourself. 623 00:33:02,220 --> 00:33:08,100 And I will say, unfortunately, it is not that common in the world, I dare say, 624 00:33:08,100 --> 00:33:11,370 that companies document and detail publicly 625 00:33:11,370 --> 00:33:13,638 when there have been security exploits. 626 00:33:13,638 --> 00:33:15,930 They might announce that something indeed has happened, 627 00:33:15,930 --> 00:33:19,140 but it is rare that companies will go into any amount of detail. 628 00:33:19,140 --> 00:33:22,800 Now this is understandable because, one, they're already embarrassed, 629 00:33:22,800 --> 00:33:26,520 or if not in legal trouble or financial trouble because that has happened 630 00:33:26,520 --> 00:33:29,370 already, but they probably, typically, don't 631 00:33:29,370 --> 00:33:33,960 want to provide other adversaries-- other future attackers-- with more 632 00:33:33,960 --> 00:33:38,490 information about their systems and the weaknesses that those systems have. 633 00:33:38,490 --> 00:33:42,360 The downside, of course, societally, is that if each of us 634 00:33:42,360 --> 00:33:45,420 is secretly getting attacked in ways we didn't 635 00:33:45,420 --> 00:33:49,350 expect, learning things that would be ideal to share 636 00:33:49,350 --> 00:33:51,160 with others in the world. 637 00:33:51,160 --> 00:33:54,420 This itself is actually a big question in the world of cybersecurity, 638 00:33:54,420 --> 00:33:57,000 just how much and how often to share, especially when 639 00:33:57,000 --> 00:33:59,400 you discover a bug or a mistake in someone's system, 640 00:33:59,400 --> 00:34:02,130 do you tell them privately, do you tell the world publicly? 641 00:34:02,130 --> 00:34:06,540 These are ethical questions that we'll touch on indeed in the coming days 642 00:34:06,540 --> 00:34:07,880 as well. 643 00:34:07,880 --> 00:34:12,568 Allow me to propose that separate from these concerns here, 644 00:34:12,568 --> 00:34:14,610 we can come back to some of those recommendations 645 00:34:14,610 --> 00:34:17,940 that we started the class with from this, the National Institute 646 00:34:17,940 --> 00:34:19,380 for Standards and Technology. 647 00:34:19,380 --> 00:34:22,800 Notice that this was one other quote we did not share last time. 648 00:34:22,800 --> 00:34:25,230 A recommendation from NIST is that "Verifiers 649 00:34:25,230 --> 00:34:28,110 shall store memorized secrets in the form that 650 00:34:28,110 --> 00:34:30,030 is resistant to offline attacks. 651 00:34:30,030 --> 00:34:33,690 Memorized secrets SHALL be salted and hashed 652 00:34:33,690 --> 00:34:36,960 using a suitable one-way key derivation function. 653 00:34:36,960 --> 00:34:41,199 Their purpose is to make each password guessing trial 654 00:34:41,199 --> 00:34:45,639 by an attacker who has obtained a password hash file expensive, 655 00:34:45,639 --> 00:34:48,400 and therefore, the cost of guessing attack-- 656 00:34:48,400 --> 00:34:51,610 of a guessing attack high or prohibitive." 657 00:34:51,610 --> 00:34:54,040 So when I refer to best practices, I'm really 658 00:34:54,040 --> 00:34:58,030 referring to actual documentation like this, either from the United States, 659 00:34:58,030 --> 00:35:00,370 from other countries, from other companies. 660 00:35:00,370 --> 00:35:03,940 There are indeed these best practices, and among our goals 661 00:35:03,940 --> 00:35:06,550 for this class is to expose you to some of those, 662 00:35:06,550 --> 00:35:10,630 both on the consumer side-- you and me as individual computer users, 663 00:35:10,630 --> 00:35:13,360 but also on the corporate or the academic side 664 00:35:13,360 --> 00:35:15,220 as well as to what you should be doing when 665 00:35:15,220 --> 00:35:19,000 you are in a position of being responsible for someone else's data as 666 00:35:19,000 --> 00:35:19,610 well. 667 00:35:19,610 --> 00:35:22,690 Now as for the actual hash functions to use nowadays, 668 00:35:22,690 --> 00:35:26,140 these are just some of them that are generally recommended nowadays 669 00:35:26,140 --> 00:35:29,170 that can be categorized as SHA-2 and SHA-3. 670 00:35:29,170 --> 00:35:32,920 These refer to fairly sophisticated mathematical functions that 671 00:35:32,920 --> 00:35:36,910 take as input, typically, a password, or some input more generally, 672 00:35:36,910 --> 00:35:40,060 and then output a hash value thereof. 673 00:35:40,060 --> 00:35:43,060 There are other algorithms, too, that can even 674 00:35:43,060 --> 00:35:49,370 be used to verify the authenticity and integrity of messages as well. 675 00:35:49,370 --> 00:35:53,620 In fact, today, we'll also focus on how we can use primitives like these 676 00:35:53,620 --> 00:35:57,490 to ensure that data was not actually changed in transit when you sent it 677 00:35:57,490 --> 00:36:00,040 over the internet from one person to another. 678 00:36:00,040 --> 00:36:03,280 But ultimately, what we've been focusing on and what you've seen on this list 679 00:36:03,280 --> 00:36:07,330 here are what are generally known as one-way hash functions. 680 00:36:07,330 --> 00:36:09,790 That is, these are mathematical functions, 681 00:36:09,790 --> 00:36:12,370 or, in the context of programming, these are 682 00:36:12,370 --> 00:36:15,760 functions written in code, languages like Python 683 00:36:15,760 --> 00:36:20,020 or otherwise, that take as input a string of arbitrary length. 684 00:36:20,020 --> 00:36:23,590 That is, a password that's this long, maybe this long, maybe this long, 685 00:36:23,590 --> 00:36:25,900 but what's key to these cryptographic functions 686 00:36:25,900 --> 00:36:29,680 is they output a hash value of fixed length 687 00:36:29,680 --> 00:36:33,610 that is always this many bytes or characters or this many bytes 688 00:36:33,610 --> 00:36:34,330 or characters. 689 00:36:34,330 --> 00:36:37,670 That is, it doesn't matter how short or how long the password is, 690 00:36:37,670 --> 00:36:42,740 these cryptographic, these one-way hash functions are one-way in the sense 691 00:36:42,740 --> 00:36:46,055 that they take a potentially infinite domain, if you 692 00:36:46,055 --> 00:36:51,500 know this term for mathematics, and condense it into a finite range. 693 00:36:51,500 --> 00:36:55,190 That is, a huge number of values, all possible passwords in the world, 694 00:36:55,190 --> 00:36:58,400 to just a finite list of possible hash values. 695 00:36:58,400 --> 00:37:01,130 It might be a long list of possible hash values, 696 00:37:01,130 --> 00:37:03,530 but indeed, no matter how long a string of text 697 00:37:03,530 --> 00:37:06,020 is, if it's of some fixed length-- 698 00:37:06,020 --> 00:37:08,870 16 characters, 32 characters, something else, 699 00:37:08,870 --> 00:37:11,880 there's only a finite number of those values. 700 00:37:11,880 --> 00:37:13,910 Now there's an implication of this. 701 00:37:13,910 --> 00:37:19,100 When you take a really large input space or domain mathematically 702 00:37:19,100 --> 00:37:24,200 and map it to a smaller finite range, so to speak, 703 00:37:24,200 --> 00:37:28,700 mathematically, it turns out that if you do try to reverse the process, 704 00:37:28,700 --> 00:37:33,750 there will be multiple inputs that yield the same output. 705 00:37:33,750 --> 00:37:34,970 Think about it this way. 706 00:37:34,970 --> 00:37:37,930 If you've got 100 possible passwords in the world, 707 00:37:37,930 --> 00:37:41,350 but you only have 10 possible hash values-- 708 00:37:41,350 --> 00:37:44,790 so 100 passwords, 10 hash values, you have 709 00:37:44,790 --> 00:37:49,320 to figure out how to put all of those passwords into 10 buckets, so to speak. 710 00:37:49,320 --> 00:37:52,870 So surely, some of those passwords are going to be in the same bucket. 711 00:37:52,870 --> 00:37:54,870 Think about it in terms of the English alphabet. 712 00:37:54,870 --> 00:37:59,790 If we stuck with that original hash function where A was 1, B was 2, 713 00:37:59,790 --> 00:38:05,580 C was 3, presumably Z was 26, there's more than one fruit 714 00:38:05,580 --> 00:38:06,780 that starts with a-- 715 00:38:06,780 --> 00:38:09,040 apple, avocado, and so forth. 716 00:38:09,040 --> 00:38:13,020 So there, too, you are going to have multiple fruits mapping 717 00:38:13,020 --> 00:38:17,880 to the same finite range of values, hash values 1 through 26. 718 00:38:17,880 --> 00:38:21,780 What that means is that if an adversary, or even you, the owner of the system, 719 00:38:21,780 --> 00:38:24,270 look at that hash value and see the number 1, 720 00:38:24,270 --> 00:38:30,390 you don't know if the password was apple or avocado or some other word that 721 00:38:30,390 --> 00:38:34,260 started with A. And so that's what we mean by one-way hash functions. 722 00:38:34,260 --> 00:38:39,660 You cannot reliably reverse the process by any means and know definitively what 723 00:38:39,660 --> 00:38:41,340 the original input is. 724 00:38:41,340 --> 00:38:42,660 Now there is a catch. 725 00:38:42,660 --> 00:38:45,480 That technically means on some systems, it 726 00:38:45,480 --> 00:38:50,070 might be possible to log in with apple or avocado, or more 727 00:38:50,070 --> 00:38:54,960 generally, your actual password and some other seemingly random password that 728 00:38:54,960 --> 00:38:58,740 might make no sense to you, but just because mathematically it 729 00:38:58,740 --> 00:39:03,190 has the same hash value, that password, too, might let you into the system. 730 00:39:03,190 --> 00:39:07,500 But the idea is, especially as we're using really large numbers of bits, 731 00:39:07,500 --> 00:39:11,910 really long hash values, the probability of you or me figuring 732 00:39:11,910 --> 00:39:14,580 out or an adversary even guessing what that other hash 733 00:39:14,580 --> 00:39:17,430 value or what those other inputs-- 734 00:39:17,430 --> 00:39:22,328 passwords might be is just so small that we tend not to worry about it as well. 735 00:39:22,328 --> 00:39:24,120 The algorithms we've looked at on the board 736 00:39:24,120 --> 00:39:27,690 here are also known as cryptographic hash functions, which 737 00:39:27,690 --> 00:39:31,200 means they have utility in the world of cryptography 738 00:39:31,200 --> 00:39:34,920 where the world of cryptography is all about the practice and the study 739 00:39:34,920 --> 00:39:36,900 of securing data. 740 00:39:36,900 --> 00:39:41,400 Securing data while in transit from one point to another or while 741 00:39:41,400 --> 00:39:43,210 at rest on your own system. 742 00:39:43,210 --> 00:39:45,900 Let's go ahead here and take a five-minute break, 743 00:39:45,900 --> 00:39:50,130 and when we come back, we'll explore precisely that world of cryptography 744 00:39:50,130 --> 00:39:52,580 with respect to our data. 745 00:39:52,580 --> 00:39:53,570 All right. 746 00:39:53,570 --> 00:39:54,350 We're back. 747 00:39:54,350 --> 00:39:57,920 And indeed, cryptography is all about the practice and study 748 00:39:57,920 --> 00:40:01,610 of securing our data, particularly when we want to transmit it 749 00:40:01,610 --> 00:40:03,300 from one person to another. 750 00:40:03,300 --> 00:40:07,370 So cryptography can be broken down into a couple of different categories, one 751 00:40:07,370 --> 00:40:08,570 of which are codes. 752 00:40:08,570 --> 00:40:12,860 And codes are not the type of code that you might write in Python or the like. 753 00:40:12,860 --> 00:40:15,260 It has nothing to do with software, but rather, 754 00:40:15,260 --> 00:40:17,720 a mapping between what we'll call code words 755 00:40:17,720 --> 00:40:21,980 and the actual message or true reading that those words represent. 756 00:40:21,980 --> 00:40:25,530 Here, for instance, is an actual book from over 100 years ago 757 00:40:25,530 --> 00:40:28,850 that was used to map these code words in the left column 758 00:40:28,850 --> 00:40:32,330 to these, indeed, messages or true readings on the right. 759 00:40:32,330 --> 00:40:35,180 The idea is, that if that one party wanted 760 00:40:35,180 --> 00:40:37,640 to send a secure message to another party, 761 00:40:37,640 --> 00:40:39,710 they wouldn't just write it out in plain English. 762 00:40:39,710 --> 00:40:40,210 Why? 763 00:40:40,210 --> 00:40:43,430 Because if that message, written on a piece of paper or parchment, 764 00:40:43,430 --> 00:40:46,460 were intercepted by another human, that other human, 765 00:40:46,460 --> 00:40:48,410 assuming they, too, know English, could just 766 00:40:48,410 --> 00:40:51,500 read the actual message, the so-called plaintext. 767 00:40:51,500 --> 00:40:55,890 In a code, though, you can convert the words 768 00:40:55,890 --> 00:41:00,870 that you want to say to code words that make no sense necessarily 769 00:41:00,870 --> 00:41:03,360 to someone who's intercepted the message in and of itself 770 00:41:03,360 --> 00:41:05,670 unless they, too, have this book. 771 00:41:05,670 --> 00:41:08,430 Now you can imagine this being a fairly time-consuming process 772 00:41:08,430 --> 00:41:12,180 because when the recipient receives that message, unless they've memorized 773 00:41:12,180 --> 00:41:15,120 all of these pages, these code words and the meanings thereof, 774 00:41:15,120 --> 00:41:19,530 they have to do quite a bit of work flipping through their copy of the book 775 00:41:19,530 --> 00:41:21,750 in order to figure out what that message is. 776 00:41:21,750 --> 00:41:24,150 But the fact that they have a copy of the book, too, 777 00:41:24,150 --> 00:41:28,410 is a potential threat because if one party or another had their code 778 00:41:28,410 --> 00:41:33,450 book stolen, then any of the messages they've sent can now be decoded, 779 00:41:33,450 --> 00:41:36,180 so to speak, by looking them up retrospectively. 780 00:41:36,180 --> 00:41:38,910 And any future messages, if the owners of the book 781 00:41:38,910 --> 00:41:41,760 don't realize that code book has been taken, so, too, 782 00:41:41,760 --> 00:41:43,890 could those messages be translated. 783 00:41:43,890 --> 00:41:46,320 Not to mention the fact, it's fairly cumbersome. 784 00:41:46,320 --> 00:41:48,960 This alone is page 187. 785 00:41:48,960 --> 00:41:51,720 And so that's quite a bit of codes and quite a bit of work 786 00:41:51,720 --> 00:41:54,240 just to achieve this layer of indirection. 787 00:41:54,240 --> 00:41:57,090 But there are some terms of art here that are worth knowing, 788 00:41:57,090 --> 00:41:59,460 and you might actually use in everyday context, 789 00:41:59,460 --> 00:42:01,690 but not necessarily for the same purpose. 790 00:42:01,690 --> 00:42:03,910 So encode, what do we mean by that? 791 00:42:03,910 --> 00:42:06,420 It means taking a plaintext text message, 792 00:42:06,420 --> 00:42:10,230 be it in English or any human language, and taking that as input 793 00:42:10,230 --> 00:42:12,930 and producing as output codetext. 794 00:42:12,930 --> 00:42:15,660 So the codetext might be a short succinct 795 00:42:15,660 --> 00:42:19,020 sequence of words that might actually be English words, 796 00:42:19,020 --> 00:42:22,083 but they're not meant to mean what they normally mean. 797 00:42:22,083 --> 00:42:24,000 They're meant to be looked up in the code book 798 00:42:24,000 --> 00:42:27,160 to figure out what the message is actually trying to say. 799 00:42:27,160 --> 00:42:29,910 Meanwhile, decode, as you might expect, is the opposite. 800 00:42:29,910 --> 00:42:33,810 You take as input the codetext that you have received as the recipient, 801 00:42:33,810 --> 00:42:38,010 you use that same code book to look up the code words 802 00:42:38,010 --> 00:42:41,100 and figure out what the actual message is in order 803 00:42:41,100 --> 00:42:45,420 to get the original plaintext, be it in English or any other human language 804 00:42:45,420 --> 00:42:47,320 that the code book is designed for. 805 00:42:47,320 --> 00:42:51,780 But there's an alternative to codes, if only because those code books can 806 00:42:51,780 --> 00:42:56,015 get very cumbersome indeed, they can be taken and compromised and the like. 807 00:42:56,015 --> 00:42:58,140 So it's not necessarily the best system in that you 808 00:42:58,140 --> 00:43:01,230 need to physically keep something like that secure, let alone 809 00:43:01,230 --> 00:43:03,210 do so efficiently when converting. 810 00:43:03,210 --> 00:43:05,850 So there are also what we'll call ciphers. 811 00:43:05,850 --> 00:43:09,220 And ciphers are more algorithmic in nature. 812 00:43:09,220 --> 00:43:11,910 So if you have taken a computer science or a programming course, 813 00:43:11,910 --> 00:43:16,320 you already have the predisposition to thinking algorithmically and taking 814 00:43:16,320 --> 00:43:20,040 a big problem and breaking it down into smaller pieces 815 00:43:20,040 --> 00:43:23,520 and then applying some kind of logic, sometimes again and again, 816 00:43:23,520 --> 00:43:25,300 in order to solve some problem. 817 00:43:25,300 --> 00:43:28,170 So ciphers focus on exactly that. 818 00:43:28,170 --> 00:43:30,780 They don't focus on maybe words or phrases. 819 00:43:30,780 --> 00:43:34,620 They might focus on individual letters instead or even bits 820 00:43:34,620 --> 00:43:37,115 if it's in the context nowadays of computers. 821 00:43:37,115 --> 00:43:39,240 So in the world of ciphers, you might have actually 822 00:43:39,240 --> 00:43:41,140 seen them in popular culture. 823 00:43:41,140 --> 00:43:46,380 So here, for instance, is just one frame from a famous film known as A Christmas 824 00:43:46,380 --> 00:43:47,850 Story, at least here in the US. 825 00:43:47,850 --> 00:43:51,780 It plays like every day all day long on a couple of TV channels 826 00:43:51,780 --> 00:43:55,230 around Christmas time, but this here is Ralphie, 827 00:43:55,230 --> 00:43:59,280 one of the main characters in the movie, and in his hands 828 00:43:59,280 --> 00:44:04,710 here is this secret decoder pin that he tried so hard to get through the mail, 829 00:44:04,710 --> 00:44:09,090 and the secret decoder pin was from little Orphan Annie herself. 830 00:44:09,090 --> 00:44:13,560 And what it does is implement mechanically a cipher, 831 00:44:13,560 --> 00:44:17,620 converting one letter to a number and back. 832 00:44:17,620 --> 00:44:20,610 But the thing twists left and right so that you can actually 833 00:44:20,610 --> 00:44:22,720 figure out what the mapping might be. 834 00:44:22,720 --> 00:44:26,100 So this is more of a cipher because it's operating at a lower level-- 835 00:44:26,100 --> 00:44:30,330 not in entire words or phrases, but one letter at a time. 836 00:44:30,330 --> 00:44:32,940 And it's a repeatable process that Ralphie, in this case, 837 00:44:32,940 --> 00:44:37,110 can apply again and again to all of the letters of the secret message. 838 00:44:37,110 --> 00:44:39,398 In World War II, the Germans, for instance, 839 00:44:39,398 --> 00:44:41,940 had the Enigma Machine that you might have read about or seen 840 00:44:41,940 --> 00:44:45,180 depicted in films, and this was a mechanical implementation 841 00:44:45,180 --> 00:44:47,160 of this same idea of a cipher. 842 00:44:47,160 --> 00:44:51,538 But instead of using mathematics or gears turning just this way and that, 843 00:44:51,538 --> 00:44:52,705 it was much more mechanical. 844 00:44:52,705 --> 00:44:55,270 It was with rotors and lights and the like, 845 00:44:55,270 --> 00:44:57,730 but it, too, was implementing a cipher and could 846 00:44:57,730 --> 00:45:00,430 be configured with different inputs in order 847 00:45:00,430 --> 00:45:03,670 to influence exactly what the output would be. 848 00:45:03,670 --> 00:45:07,330 But that, too, is a physical device, and we'll focus here for the most part, 849 00:45:07,330 --> 00:45:10,600 though, on things more digital, things that you can ultimately, for instance, 850 00:45:10,600 --> 00:45:15,100 nowadays implement much more readily and much more scalably in software. 851 00:45:15,100 --> 00:45:17,470 But the words we'll use are pretty much the same. 852 00:45:17,470 --> 00:45:21,760 To encipher a message means to take that message in English or any other 853 00:45:21,760 --> 00:45:25,780 language, or so-called plaintext, and convert it, not surprisingly, 854 00:45:25,780 --> 00:45:29,530 to ciphertext as output Meanwhile, the reverse-- 855 00:45:29,530 --> 00:45:33,280 or rather, an equivalent term here that you might know as well is to encrypt. 856 00:45:33,280 --> 00:45:38,050 Same idea, synonyms for our purposes, plaintext to ciphertext. 857 00:45:38,050 --> 00:45:39,580 To encipher or to encrypt. 858 00:45:39,580 --> 00:45:43,240 Nowadays, encrypt is probably the more common of those terms 859 00:45:43,240 --> 00:45:46,300 Meanwhile, decipher would be the opposite of that, 860 00:45:46,300 --> 00:45:49,960 to actually take the ciphertext that someone else has sent to you, 861 00:45:49,960 --> 00:45:54,110 run it through an algorithm or cipher, and get back the plaintext. 862 00:45:54,110 --> 00:45:57,200 Meanwhile, decrypt would be a synonym for that phrase, which 863 00:45:57,200 --> 00:46:00,770 refers to exactly the same process of taking ciphertext as input 864 00:46:00,770 --> 00:46:03,300 and outputting plaintext as output. 865 00:46:03,300 --> 00:46:06,320 So how do we configure these ciphers so that you and I 866 00:46:06,320 --> 00:46:10,940 can use the same algorithm but customize them, not only with our own messages, 867 00:46:10,940 --> 00:46:14,930 but also with our own settings so that just because you and I might 868 00:46:14,930 --> 00:46:18,260 want to send the same plaintext doesn't mean that the ciphertext has 869 00:46:18,260 --> 00:46:20,000 to actually be identical? 870 00:46:20,000 --> 00:46:22,460 And indeed, in the world of cryptography, 871 00:46:22,460 --> 00:46:27,350 it's quite recommended that you and I use public and well-documented, 872 00:46:27,350 --> 00:46:30,020 well-tried-and-tested algorithms publicly, 873 00:46:30,020 --> 00:46:35,300 but we do keep one piece of information secret so that our use of that cipher, 874 00:46:35,300 --> 00:46:37,730 that algorithm is specific to us. 875 00:46:37,730 --> 00:46:40,670 And this customization, this configuration 876 00:46:40,670 --> 00:46:42,590 are generally known as keys. 877 00:46:42,590 --> 00:46:47,270 Now keys, much like a physical key to a lock on an actual door to your home, 878 00:46:47,270 --> 00:46:51,320 a key is what unlocks the capabilities of this cipher, 879 00:46:51,320 --> 00:46:55,430 but it's a key that needs to be known and used not only by you, typically, 880 00:46:55,430 --> 00:46:57,420 but also by the recipient. 881 00:46:57,420 --> 00:46:59,930 So that by having copies of the same key, 882 00:46:59,930 --> 00:47:03,470 you can not only encrypt messages or encipher them, 883 00:47:03,470 --> 00:47:07,190 but you can also decrypt or decipher those messages, too. 884 00:47:07,190 --> 00:47:08,690 Now what are these keys in practice? 885 00:47:08,690 --> 00:47:10,890 They're not physical objects in the virtual world, 886 00:47:10,890 --> 00:47:13,160 but really just really big numbers. 887 00:47:13,160 --> 00:47:16,130 And often, there's some mathematical significance of these numbers, 888 00:47:16,130 --> 00:47:19,070 and sometimes those numbers don't even look like numbers. 889 00:47:19,070 --> 00:47:21,920 They might be presented on your phone or your laptop 890 00:47:21,920 --> 00:47:24,620 or desktop actually as letters of an alphabet 891 00:47:24,620 --> 00:47:26,600 and maybe even with some punctuation, too. 892 00:47:26,600 --> 00:47:29,753 But at the end of the day, they're really just numbers, or, of course, 893 00:47:29,753 --> 00:47:31,670 if you know a bit of computer science already, 894 00:47:31,670 --> 00:47:33,410 they're really just 0's and 1's. 895 00:47:33,410 --> 00:47:36,170 But it's perhaps helpful to think about them metaphorically as 896 00:47:36,170 --> 00:47:38,030 akin to these physical keys. 897 00:47:38,030 --> 00:47:40,400 Now how are these keys actually used? 898 00:47:40,400 --> 00:47:43,370 Well, within the world of cryptography, there 899 00:47:43,370 --> 00:47:45,470 are different types of encryption. 900 00:47:45,470 --> 00:47:49,520 And the first we'll look at is known as secret key cryptography. 901 00:47:49,520 --> 00:47:53,150 The presumption is that the security of your data 902 00:47:53,150 --> 00:47:56,540 relies on the secrecy of some key. 903 00:47:56,540 --> 00:48:00,860 So if A wants to send a message to B, then A and B 904 00:48:00,860 --> 00:48:05,210 must keep secret whatever key they are using to configure 905 00:48:05,210 --> 00:48:06,660 their choice of algorithms. 906 00:48:06,660 --> 00:48:08,310 So what do we mean by that? 907 00:48:08,310 --> 00:48:10,880 Well, secret key cryptography, specifically 908 00:48:10,880 --> 00:48:13,280 in the context of encryption and scrambling data, 909 00:48:13,280 --> 00:48:16,460 is also known as symmetric key encryption for the reason 910 00:48:16,460 --> 00:48:21,500 that both A and B in this story are going to use the exact same key. 911 00:48:21,500 --> 00:48:24,860 And we'll contrast this in just a bit with asymmetric key encryption, 912 00:48:24,860 --> 00:48:26,910 which solves other problems as well. 913 00:48:26,910 --> 00:48:29,390 So let's consider the process of encryption, 914 00:48:29,390 --> 00:48:32,720 much like the process of hashing, as being this black box. 915 00:48:32,720 --> 00:48:37,610 Somehow or other, this Black box is going to encrypt information for me. 916 00:48:37,610 --> 00:48:40,940 Taking as input my plaintext and hopefully outputting as output 917 00:48:40,940 --> 00:48:45,110 my ciphertext that I can actually send over the internet or some other channel 918 00:48:45,110 --> 00:48:47,040 to a recipient as well. 919 00:48:47,040 --> 00:48:50,480 So in the context, then, of secret key encryption, 920 00:48:50,480 --> 00:48:52,490 the picture looks a little something like this. 921 00:48:52,490 --> 00:48:55,430 Not only do you pass as input to the algorithm 922 00:48:55,430 --> 00:48:59,130 your plaintext message in English or any other human language, 923 00:48:59,130 --> 00:49:00,500 you also pass a key. 924 00:49:00,500 --> 00:49:04,700 And for now, just think of that key as a number that you and the other person 925 00:49:04,700 --> 00:49:07,010 have somehow agreed upon in advance. 926 00:49:07,010 --> 00:49:10,280 That algorithm, then, will ultimately output the ciphertext. 927 00:49:10,280 --> 00:49:13,850 And to be clear, the motivation for that key 928 00:49:13,850 --> 00:49:18,380 is to ensure that if I and you and you and you and you are all 929 00:49:18,380 --> 00:49:21,170 using the exact same encryption algorithm, 930 00:49:21,170 --> 00:49:23,750 it's not going to be obvious if and when we're 931 00:49:23,750 --> 00:49:26,450 sending the exact same messages because that, 932 00:49:26,450 --> 00:49:30,710 too, per our discussion of passwords, would leak information. 933 00:49:30,710 --> 00:49:33,410 Maybe you don't care about the information being leaked, 934 00:49:33,410 --> 00:49:37,760 but it's probably not a good thing if-- just because someone else is getting 935 00:49:37,760 --> 00:49:41,660 some message, that, makes it more likely that an adversary can 936 00:49:41,660 --> 00:49:46,190 infer what it is you sent because the ciphertext just so happens 937 00:49:46,190 --> 00:49:47,250 to look the same. 938 00:49:47,250 --> 00:49:51,290 We want our ciphertext to be unique to each of our transmissions. 939 00:49:51,290 --> 00:49:54,890 So, let's consider a simple, simple example. 940 00:49:54,890 --> 00:49:59,155 Suppose that the message I want to send is just as short as the capital letter 941 00:49:59,155 --> 00:50:04,300 A, and suppose that the key that I want to use is as simple as the number 1. 942 00:50:04,300 --> 00:50:06,970 These are not good best practices, but we'll 943 00:50:06,970 --> 00:50:08,530 use them for the sake of discussion. 944 00:50:08,530 --> 00:50:11,890 Let me propose that the simplest algorithm I can perhaps think of 945 00:50:11,890 --> 00:50:17,110 is actually one that would take A as input and 1 as input and output B. 946 00:50:17,110 --> 00:50:19,280 And you can perhaps infer where this is going. 947 00:50:19,280 --> 00:50:24,880 If I instead provide B as input and 1 as input for the plaintext and key 948 00:50:24,880 --> 00:50:27,580 respectively, then the output is C. 949 00:50:27,580 --> 00:50:30,520 So believe it or not, in yesteryear, Julius Caesar 950 00:50:30,520 --> 00:50:36,580 was known to use an algorithm like this whereby this algorithm, Caesar Cipher, 951 00:50:36,580 --> 00:50:39,190 is what's generally known as a rotational cipher, 952 00:50:39,190 --> 00:50:42,640 because you're rotating the letters of the English alphabet. 953 00:50:42,640 --> 00:50:46,880 A becomes B, B becomes C. And I bet if we continue this logic, 954 00:50:46,880 --> 00:50:50,240 we can go around from Z becoming A as well. 955 00:50:50,240 --> 00:50:52,850 Now this, of course, is being applied at the moment 956 00:50:52,850 --> 00:50:55,460 to very short messages that are not that useful. 957 00:50:55,460 --> 00:50:58,850 Sending A or B or C is not particularly useful in general, 958 00:50:58,850 --> 00:51:02,630 but it's demonstrating how we can encipher or encrypt 959 00:51:02,630 --> 00:51:05,930 our plaintext into our ciphertext. 960 00:51:05,930 --> 00:51:09,170 However, when someone receives this message, 961 00:51:09,170 --> 00:51:13,610 they need to not only what algorithm I used to encrypt it-- 962 00:51:13,610 --> 00:51:17,720 in this case, Caesar Cipher or a rotational cipher more generally, 963 00:51:17,720 --> 00:51:20,120 but they also need to know what the key is. 964 00:51:20,120 --> 00:51:22,100 And the key might not be as simple as 1. 965 00:51:22,100 --> 00:51:24,740 Here, for instance, is an example of 13. 966 00:51:24,740 --> 00:51:27,620 If your key is 13 and your plaintext is A, 967 00:51:27,620 --> 00:51:33,507 then your ciphertext should be N, because that is 13 places away from A, 968 00:51:33,507 --> 00:51:38,000 and so now the algorithm seems a little less obvious. 969 00:51:38,000 --> 00:51:41,630 13 is also representative of something that's long been known on the internet 970 00:51:41,630 --> 00:51:46,400 as ROT13 for R-O-T-1-3-- rotate 13 places. 971 00:51:46,400 --> 00:51:49,940 It's a very popular way of scrambling information 972 00:51:49,940 --> 00:51:52,610 but not in a way that you intend to be secure. 973 00:51:52,610 --> 00:51:55,790 Historically, it was often used for like movie spoilers online. 974 00:51:55,790 --> 00:51:59,570 If you want to make something a spoiler before there was CSS and blurring 975 00:51:59,570 --> 00:52:01,670 effects on websites and whatnot, you could just 976 00:52:01,670 --> 00:52:04,610 scramble it so it looks completely encrypted, 977 00:52:04,610 --> 00:52:07,310 but it's very easy for someone else with a click of a button 978 00:52:07,310 --> 00:52:08,930 even to just decrypt it. 979 00:52:08,930 --> 00:52:17,810 However, I would recommend that you not use a key of 26 because why? 980 00:52:17,810 --> 00:52:21,440 Well, at least in English, there's only 26 letters of the alphabet, capital A 981 00:52:21,440 --> 00:52:22,820 through capital Z in this case. 982 00:52:22,820 --> 00:52:26,690 So a key of 26 is going to output for your ciphertext 983 00:52:26,690 --> 00:52:29,600 the exact same thing as your plaintext. 984 00:52:29,600 --> 00:52:35,060 So there's another joke on the internet whereby ROT26 is twice 985 00:52:35,060 --> 00:52:39,780 as secure as ROT13 because 13 times 2 is 26, 986 00:52:39,780 --> 00:52:43,010 and obviously, that's not the case deductively here. 987 00:52:43,010 --> 00:52:47,600 Now of course, this particular algorithm and keys of this small size, 988 00:52:47,600 --> 00:52:49,790 1 through 26, not at all secure. 989 00:52:49,790 --> 00:52:50,300 Why? 990 00:52:50,300 --> 00:52:53,390 Well honestly, I don't even need a computer to crack this cipher. 991 00:52:53,390 --> 00:52:55,880 I can probably take out a piece of paper and pencil 992 00:52:55,880 --> 00:53:00,260 and just try all possible numbers from 1 to 25-- 993 00:53:00,260 --> 00:53:02,450 I don't need to even waste my time with 26-- 994 00:53:02,450 --> 00:53:06,650 and just figure out via brute force what keys someone might have used 995 00:53:06,650 --> 00:53:08,900 to send a message using this algorithm. 996 00:53:08,900 --> 00:53:13,430 Not on even single letters, but maybe it operates on every individual letter 997 00:53:13,430 --> 00:53:14,240 of their message. 998 00:53:14,240 --> 00:53:18,320 Wouldn't take me that long to probably figure this out by brute force by hand. 999 00:53:18,320 --> 00:53:21,650 And with code, my gosh, I could write some Python code probably 1000 00:53:21,650 --> 00:53:23,850 that does it even faster than that. 1001 00:53:23,850 --> 00:53:27,740 So here on the screen is some ciphertext that I created in advance. 1002 00:53:27,740 --> 00:53:30,710 And I'll stipulate that this ciphertext was enciphered 1003 00:53:30,710 --> 00:53:33,710 using that same rotational cipher, but I'm not 1004 00:53:33,710 --> 00:53:36,890 going to tell you just yet what key I actually used. 1005 00:53:36,890 --> 00:53:40,530 It was originally an English message in all capital letters. 1006 00:53:40,530 --> 00:53:43,850 So the task at hand now is to decrypt this, I dare say. 1007 00:53:43,850 --> 00:53:47,810 Whether you are the intended recipient of the message or maybe maliciously, 1008 00:53:47,810 --> 00:53:51,560 you've intercepted my transmission with this message and it, 1009 00:53:51,560 --> 00:53:54,260 and now you're trying to brute force your way through by trying, 1010 00:53:54,260 --> 00:53:58,430 and by the looks of some heads going down and some scribbling, 1 or 2 or 3. 1011 00:53:58,430 --> 00:54:01,805 I bet we could also brute force our way through this algorithm, but how? 1012 00:54:01,805 --> 00:54:03,980 How does the decrypting process work? 1013 00:54:03,980 --> 00:54:06,440 It's really just the same thing in reverse. 1014 00:54:06,440 --> 00:54:10,340 If this now is our picture and you have ciphertext as your input, 1015 00:54:10,340 --> 00:54:13,340 you should be able to pass the same key as input-- 1016 00:54:13,340 --> 00:54:17,060 1, for instance or 13 or, with no good reason, 1017 00:54:17,060 --> 00:54:20,010 26, and get back out the plaintext. 1018 00:54:20,010 --> 00:54:23,060 But of course, the decryption algorithm is indeed the opposite 1019 00:54:23,060 --> 00:54:25,640 because you don't want to just add one position 1020 00:54:25,640 --> 00:54:32,150 or add two positions or three positions, you want to subtract 1 or 2 or 3 or 13. 1021 00:54:32,150 --> 00:54:34,470 You want to go in reverse, so to speak. 1022 00:54:34,470 --> 00:54:40,160 And so, if I were to pass in B as the ciphertext and 1 as the key, 1023 00:54:40,160 --> 00:54:44,360 well, the plaintext decrypted should, of course, be A. 1024 00:54:44,360 --> 00:54:47,060 And that holds now for all of the other letters of the alphabet, 1025 00:54:47,060 --> 00:54:50,450 assuming I'm reversing this process, in order to decrypt. 1026 00:54:50,450 --> 00:54:54,030 And now, I'll let you a glance at the screen here for just a moment 1027 00:54:54,030 --> 00:54:58,140 and see if you yourselves can't figure out 1028 00:54:58,140 --> 00:55:02,410 what this ciphertext is trying to say. 1029 00:55:02,410 --> 00:55:05,410 And if you like the idea of figuring this out, 1030 00:55:05,410 --> 00:55:08,800 if you want to get better at this particular skill, 1031 00:55:08,800 --> 00:55:13,060 you are an aspiring cryptanalyst, I dare say, focusing 1032 00:55:13,060 --> 00:55:14,590 on this world of cryptanalysis. 1033 00:55:14,590 --> 00:55:18,670 And this, too, itself is a job, I dare say particularly with governments, 1034 00:55:18,670 --> 00:55:23,180 trying to decrypt messages that might very well have been encrypted. 1035 00:55:23,180 --> 00:55:26,590 Now hopefully the world is using more secure algorithms 1036 00:55:26,590 --> 00:55:28,540 than these simple rotational ciphers. 1037 00:55:28,540 --> 00:55:30,010 And what do I mean by secure? 1038 00:55:30,010 --> 00:55:35,260 Hopefully they're using keys that are much bigger than small numbers like 1 1039 00:55:35,260 --> 00:55:36,400 through 25. 1040 00:55:36,400 --> 00:55:41,200 Hopefully they're using much, much, much larger numbers, many more bits, if only 1041 00:55:41,200 --> 00:55:47,200 so that it takes you and me, when we try to apply cryptanalysis to ciphertext, 1042 00:55:47,200 --> 00:55:52,600 it takes us way, way longer than this particular algorithm alone. 1043 00:55:52,600 --> 00:55:55,060 Now I don't want to keep you in suspense, 1044 00:55:55,060 --> 00:55:58,490 but I also don't want to spoil this if you'd like to try your hand at this. 1045 00:55:58,490 --> 00:56:03,380 So go ahead and close your eyes if you don't want to see the answer to this, 1046 00:56:03,380 --> 00:56:05,960 or I suppose you can just look away from your screen. 1047 00:56:05,960 --> 00:56:10,538 But in five seconds, I'll reveal what the plaintext actually is-- 1048 00:56:10,538 --> 00:56:12,830 and some of you, if you've seen that movie I mentioned, 1049 00:56:12,830 --> 00:56:15,360 will know immediately why this is the way it is, 1050 00:56:15,360 --> 00:56:19,310 but otherwise, you might just see this as an advertisement of sorts. 1051 00:56:19,310 --> 00:56:20,150 So here we go. 1052 00:56:20,150 --> 00:56:26,110 Your chance to close your eyes in 5, 4, 3, 2, 1. 1053 00:56:26,110 --> 00:56:34,090 1054 00:56:34,090 --> 00:56:37,090 From some faces, some of you have seen this movie around the holidays, 1055 00:56:37,090 --> 00:56:40,450 but now, I've taken it off the screen and we'll move on now 1056 00:56:40,450 --> 00:56:41,920 with some actual algorithms. 1057 00:56:41,920 --> 00:56:45,430 If you'd like to come back on replay and actually see what the answer is, 1058 00:56:45,430 --> 00:56:47,450 we'll, of course, leave it on-demand. 1059 00:56:47,450 --> 00:56:49,840 So what are some of the actual algorithms 1060 00:56:49,840 --> 00:56:53,680 used nowadays for encryption that are best practices? 1061 00:56:53,680 --> 00:56:57,370 This rotational cipher that I described earlier, Caesar's simple one, 1062 00:56:57,370 --> 00:56:58,810 is not to be recommended. 1063 00:56:58,810 --> 00:57:01,480 It's wonderful for demonstration sake and discussion's sake, 1064 00:57:01,480 --> 00:57:05,240 but it's not something you should be using in practice unless, for instance, 1065 00:57:05,240 --> 00:57:08,440 you're in, say, middle school trying to send a message on a piece of paper 1066 00:57:08,440 --> 00:57:11,140 through your classroom of classmates and worried 1067 00:57:11,140 --> 00:57:14,050 that the teacher might intercept it and the teacher probably 1068 00:57:14,050 --> 00:57:18,223 doesn't have the instinct to or the care to actually 1069 00:57:18,223 --> 00:57:20,890 brute force their way through it and figure out what the key is. 1070 00:57:20,890 --> 00:57:23,680 But that's the level of security you're getting with something 1071 00:57:23,680 --> 00:57:25,030 like that rotational cipher. 1072 00:57:25,030 --> 00:57:29,290 But in the real world, with our phones and desktops and laptops today, 1073 00:57:29,290 --> 00:57:33,130 generally used our AES or triple DES, both of which 1074 00:57:33,130 --> 00:57:36,020 are popular algorithms that have been vetted by the world 1075 00:57:36,020 --> 00:57:41,150 and are very commonly used as secret key encryption ciphers 1076 00:57:41,150 --> 00:57:44,090 or symmetric key encryption ciphers, which, to be clear, 1077 00:57:44,090 --> 00:57:47,030 require that both the sender and the receiver 1078 00:57:47,030 --> 00:57:50,120 know and use the exact same key. 1079 00:57:50,120 --> 00:57:52,250 And for our purposes today, let me just stipulate 1080 00:57:52,250 --> 00:57:55,070 that the mathematics of these two and other algorithms 1081 00:57:55,070 --> 00:57:58,560 much more sophisticated and documented in textbooks, 1082 00:57:58,560 --> 00:58:02,660 but, therefore, it makes it much harder for the adversary 1083 00:58:02,660 --> 00:58:08,450 to figure out, as by trying 25 different keys, what the actual key in use 1084 00:58:08,450 --> 00:58:10,270 might be. 1085 00:58:10,270 --> 00:58:17,260 Questions now about secret key cryptography or any of the primitives 1086 00:58:17,260 --> 00:58:19,390 we've just discussed? 1087 00:58:19,390 --> 00:58:22,320 STUDENT: So is it possible that if someone hacks the-- like 1088 00:58:22,320 --> 00:58:26,280 gets to know about the hash value-- the hash function of a company that it 1089 00:58:26,280 --> 00:58:29,520 is using, he might be able to use the hash values 1090 00:58:29,520 --> 00:58:34,577 and use-- like find a reverse function and then get the passwords for that? 1091 00:58:34,577 --> 00:58:35,910 DAVID J. MALAN: A good question. 1092 00:58:35,910 --> 00:58:39,830 I wouldn't worry mathematically about someone reversing the hash functions, 1093 00:58:39,830 --> 00:58:43,220 if only because with all of the ones that are in popular use 1094 00:58:43,220 --> 00:58:47,540 today in modern systems, there are a lot of smart mathematicians, computer 1095 00:58:47,540 --> 00:58:52,010 scientists, professionals who have vetted, if not proven mathematically, 1096 00:58:52,010 --> 00:58:54,750 that these things work as expected. 1097 00:58:54,750 --> 00:59:00,380 However, if the passwords that have been hashed are relatively easy to guess, 1098 00:59:00,380 --> 00:59:03,500 or if the adversary just gets lucky with whatever technique 1099 00:59:03,500 --> 00:59:07,430 they are using, it is absolutely possible to find at least a password, 1100 00:59:07,430 --> 00:59:10,880 a input that maps to that hash value, but often 1101 00:59:10,880 --> 00:59:12,960 not without significant effort. 1102 00:59:12,960 --> 00:59:15,920 And so generally, a company does not want to, 1103 00:59:15,920 --> 00:59:19,820 should not try to keep proprietary or secret what 1104 00:59:19,820 --> 00:59:22,970 hash function they're using, what encryption algorithm they're using. 1105 00:59:22,970 --> 00:59:25,580 If anything, I dare say, it should be reassuring 1106 00:59:25,580 --> 00:59:30,140 to the public if and when companies are using best practices and de facto 1107 00:59:30,140 --> 00:59:32,750 standards, all of these algorithms are designed 1108 00:59:32,750 --> 00:59:36,300 to keep secret not the algorithm itself, which literally can be found 1109 00:59:36,300 --> 00:59:40,410 in like university textbooks nowadays and on Wikipedia and beyond, 1110 00:59:40,410 --> 00:59:44,130 but rather, to keep secret the thing that's designed to be secret, 1111 00:59:44,130 --> 00:59:45,360 which is the key. 1112 00:59:45,360 --> 00:59:49,260 And now, if you're using too small of a key like I did originally, 1113 00:59:49,260 --> 00:59:52,000 well, then you're just using the algorithm poorly, perhaps. 1114 00:59:52,000 --> 00:59:54,000 But so long as you're adhering to best practices 1115 00:59:54,000 --> 00:59:57,360 and picking a really big, recommended-sized key, 1116 00:59:57,360 --> 01:00:01,470 then things mathematically should be trustworthy. 1117 01:00:01,470 --> 01:00:05,070 STUDENT: For an attacker, rather than like basically cracking a hash 1118 01:00:05,070 --> 01:00:08,310 or cracking an algorithm, wouldn't it be easier 1119 01:00:08,310 --> 01:00:12,750 to just try and access the basic server database 1120 01:00:12,750 --> 01:00:16,140 and access the hash function like generated code? 1121 01:00:16,140 --> 01:00:20,340 So rather, access how the specific algorithm works. 1122 01:00:20,340 --> 01:00:24,930 That way, they can basically just reverse-engineer it? 1123 01:00:24,930 --> 01:00:27,970 DAVID J. MALAN: Everything you described is possible. 1124 01:00:27,970 --> 01:00:31,470 However, I would push back on this assumption 1125 01:00:31,470 --> 01:00:36,810 that the company should try to keep its hash algorithm secure or hidden. 1126 01:00:36,810 --> 01:00:38,970 You should trust in the mathematics of what 1127 01:00:38,970 --> 01:00:41,710 we're discussing today, both in the context of hashes 1128 01:00:41,710 --> 01:00:43,350 and in the context of encryption. 1129 01:00:43,350 --> 01:00:48,210 And I've pulled back up on the screen here the number of possible hashes 1130 01:00:48,210 --> 01:00:53,470 that exist when using one of the most modern standards for hashing passwords. 1131 01:00:53,470 --> 01:00:55,920 This is such a big number-- 1132 01:00:55,920 --> 01:00:58,600 I dare say, I don't remember how many atoms are in the universe, 1133 01:00:58,600 --> 01:01:01,410 but I'm going to guess it's fewer than this, maybe. 1134 01:01:01,410 --> 01:01:07,440 The idea is, intuitively, that if the search space of possible hash values 1135 01:01:07,440 --> 01:01:11,430 or the search space of possible keys is so darn big, 1136 01:01:11,430 --> 01:01:14,370 both you and I, not to speak darkly, are going 1137 01:01:14,370 --> 01:01:18,600 to be dead before the attacker actually figures out what 1138 01:01:18,600 --> 01:01:22,960 that password or that hash actually is. 1139 01:01:22,960 --> 01:01:24,840 So that's generally the presumption. 1140 01:01:24,840 --> 01:01:28,620 Most of what we do today in terms of security all boils down 1141 01:01:28,620 --> 01:01:33,840 to probabilities and trying to derive the probability of being exploited way, 1142 01:01:33,840 --> 01:01:40,300 way, way down, even though, if your password is still 00000000, 1143 01:01:40,300 --> 01:01:44,100 doesn't matter if there's this many or more possibilities if the adversary 1144 01:01:44,100 --> 01:01:45,790 tries that one first. 1145 01:01:45,790 --> 01:01:49,950 So keeping algorithms secret, keeping ciphers secret 1146 01:01:49,950 --> 01:01:52,270 is generally not best practice. 1147 01:01:52,270 --> 01:01:55,710 You should be trusting that the math and the probabilities 1148 01:01:55,710 --> 01:02:00,120 will protect your data if you are using these algorithms correctly. 1149 01:02:00,120 --> 01:02:03,240 And how about one more question before we resume? 1150 01:02:03,240 --> 01:02:07,470 STUDENT: How cipher work with word? 1151 01:02:07,470 --> 01:02:12,190 Not number, like with words, how it work? 1152 01:02:12,190 --> 01:02:22,890 How we can cipher-- or cryptograph like our latest with words, not the number, 1153 01:02:22,890 --> 01:02:25,440 how it can be work? 1154 01:02:25,440 --> 01:02:28,660 DAVID J. MALAN: OK, so if your key is a word and not a number, 1155 01:02:28,660 --> 01:02:32,160 let me first say that generally when it comes to encryption, 1156 01:02:32,160 --> 01:02:34,470 the keys are not words. 1157 01:02:34,470 --> 01:02:37,800 These are not passwords, they're not meant to be used in quite the same way. 1158 01:02:37,800 --> 01:02:41,830 These keys are generally generated by the computer for you, 1159 01:02:41,830 --> 01:02:45,750 and so as such, they're just random numbers for the most part. 1160 01:02:45,750 --> 01:02:50,910 With that said, even if it is a word like apple, there are ways-- 1161 01:02:50,910 --> 01:02:53,220 and you would learn this in a class like CS50 1162 01:02:53,220 --> 01:02:58,750 itself-- to convert a word to the underlying numeric representation. 1163 01:02:58,750 --> 01:03:00,900 There's a system called ASCII or Unicode. 1164 01:03:00,900 --> 01:03:04,290 So capital A is actually the number 65 in most systems. 1165 01:03:04,290 --> 01:03:06,000 Capital B is the number 66. 1166 01:03:06,000 --> 01:03:07,680 But we can go one level deeper. 1167 01:03:07,680 --> 01:03:11,910 There's actually a pattern of 0's and 1's that represent A's and B's and C's 1168 01:03:11,910 --> 01:03:15,540 and so forth, so we can convert everything in the world of computers 1169 01:03:15,540 --> 01:03:16,820 to numbers. 1170 01:03:16,820 --> 01:03:20,680 And for that, let me encourage you to take CS50x online. 1171 01:03:20,680 --> 01:03:24,340 So that, then, is secret key cryptography 1172 01:03:24,340 --> 01:03:28,300 or symmetric key cryptography, but it doesn't solve all of our problems, 1173 01:03:28,300 --> 01:03:31,360 because I've taken for granted throughout this whole discussion 1174 01:03:31,360 --> 01:03:36,100 that the sender and the receiver have a shared secret between them. 1175 01:03:36,100 --> 01:03:40,090 Whether it's a simple key like 1 or 2 or 13-- 1176 01:03:40,090 --> 01:03:43,330 hopefully not 26-- or hopefully some much bigger value. 1177 01:03:43,330 --> 01:03:46,460 But there's kind of a chicken and the egg problem there, 1178 01:03:46,460 --> 01:03:50,800 so to speak, in English whereby how do you actually establish 1179 01:03:50,800 --> 01:03:57,980 a shared secret between parties A and B if A and B have never talked before, 1180 01:03:57,980 --> 01:03:58,690 in fact? 1181 01:03:58,690 --> 01:04:01,510 So for instance, if you're visiting Amazon.com 1182 01:04:01,510 --> 01:04:05,950 for the first time, a popular e-commerce website, or gmail.com for your email, 1183 01:04:05,950 --> 01:04:08,437 ideally, and you probably know this already 1184 01:04:08,437 --> 01:04:10,270 from just living in the real world nowadays, 1185 01:04:10,270 --> 01:04:15,040 ideally you want that connection to Amazon or Gmail to be encrypted, 1186 01:04:15,040 --> 01:04:16,690 to be scrambled in some way. 1187 01:04:16,690 --> 01:04:17,200 Why? 1188 01:04:17,200 --> 01:04:19,887 Well, you don't want your password being stolen by someone. 1189 01:04:19,887 --> 01:04:22,720 You don't want your credit card number being intercepted by someone. 1190 01:04:22,720 --> 01:04:25,490 You don't want your personal emails being read by other people. 1191 01:04:25,490 --> 01:04:29,380 So it stands to reason that encryption is generally a good thing. 1192 01:04:29,380 --> 01:04:31,300 And you've seen this, perhaps, in the URL bar 1193 01:04:31,300 --> 01:04:36,700 via something called HTTPS where the S literally is meant to mean Secure. 1194 01:04:36,700 --> 01:04:41,020 But odds are, you don't know anyone personally at amazon.com 1195 01:04:41,020 --> 01:04:43,840 and you don't know anyone personally at gmail.com. 1196 01:04:43,840 --> 01:04:47,890 So what key are you going to use to communicate securely 1197 01:04:47,890 --> 01:04:51,580 with these websites, not to mention new websites that don't even exist today 1198 01:04:51,580 --> 01:04:56,320 but might come online tomorrow, how do you establish a shared secret 1199 01:04:56,320 --> 01:04:57,890 with someone else? 1200 01:04:57,890 --> 01:05:02,500 So that's a fundamental gotcha or caveat with symmetric key 1201 01:05:02,500 --> 01:05:05,350 or secret key encryption, is that it assumes 1202 01:05:05,350 --> 01:05:09,940 that you have a shared secret between you and the other person. 1203 01:05:09,940 --> 01:05:11,950 But the chicken and the egg scenario comes 1204 01:05:11,950 --> 01:05:15,940 in whereby the only way to establish a shared secret 1205 01:05:15,940 --> 01:05:18,610 would be to send it to the other person securely, 1206 01:05:18,610 --> 01:05:21,430 but if you can't communicate securely, you can't even 1207 01:05:21,430 --> 01:05:23,150 send them the secret you want to use. 1208 01:05:23,150 --> 01:05:26,020 So you're caught in this deadlock. 1209 01:05:26,020 --> 01:05:28,900 Thankfully, thanks to math, there are ways 1210 01:05:28,900 --> 01:05:32,200 that we can solve this, too, via not symmetric key cryptography, 1211 01:05:32,200 --> 01:05:37,420 but public key cryptography, otherwise known as asymmetric key cryptography. 1212 01:05:37,420 --> 01:05:40,870 And among the algorithms here might be these, something called Diffie-Hellman, 1213 01:05:40,870 --> 01:05:43,700 MQV, RSA, and others as well. 1214 01:05:43,700 --> 01:05:47,710 And I dare say, on this list, maybe RSA is among the most well-known. 1215 01:05:47,710 --> 01:05:50,800 It's perhaps an acronym you've actually seen in the wild. 1216 01:05:50,800 --> 01:05:53,380 Now what do we mean by public key cryptography, 1217 01:05:53,380 --> 01:05:56,410 or more specifically, public key encryption? 1218 01:05:56,410 --> 01:05:58,960 Well, in the world of public key encryption, 1219 01:05:58,960 --> 01:06:02,470 or asymmetric key encryption, the asymmetry 1220 01:06:02,470 --> 01:06:07,060 is implying that you actually don't use one key between the two people 1221 01:06:07,060 --> 01:06:10,660 A and B. You actually use two keys. 1222 01:06:10,660 --> 01:06:14,200 In the world of public key encryption, everyone in the world 1223 01:06:14,200 --> 01:06:17,230 has both a public key and a private key. 1224 01:06:17,230 --> 01:06:19,560 And these two are just really big numbers. 1225 01:06:19,560 --> 01:06:23,290 There is a mathematical relationship between these numbers, the public key 1226 01:06:23,290 --> 01:06:25,390 and the private key, but that's a relationship 1227 01:06:25,390 --> 01:06:27,940 that your phone or your laptop or your desktop 1228 01:06:27,940 --> 01:06:30,910 figures out when generating these values for you. 1229 01:06:30,910 --> 01:06:35,770 So unlike our previous discussion of passwords, which you and I as humans do 1230 01:06:35,770 --> 01:06:39,040 choose and memorize or store in our password managers, 1231 01:06:39,040 --> 01:06:42,100 when it comes to keys, these are generally, 1232 01:06:42,100 --> 01:06:45,490 in the world of public key cryptography, generated for you. 1233 01:06:45,490 --> 01:06:48,970 And as the name suggests, the whole purpose of these keys 1234 01:06:48,970 --> 01:06:52,720 is to tell the whole world if you want what your public key is. 1235 01:06:52,720 --> 01:06:54,460 It is not in any way secret. 1236 01:06:54,460 --> 01:06:59,050 You can literally email it out, you can put it in the signature of every email, 1237 01:06:59,050 --> 01:07:01,450 you can post it on your website, on social media. 1238 01:07:01,450 --> 01:07:05,500 The whole point of the public key is to make it, indeed, public. 1239 01:07:05,500 --> 01:07:10,270 But, suffice it to say, the private key should be kept secret by you, 1240 01:07:10,270 --> 01:07:13,150 private by you on your own device. 1241 01:07:13,150 --> 01:07:15,430 That should never be shared with anyone else. 1242 01:07:15,430 --> 01:07:19,230 But the cool thing about public key cryptography and the mathematics 1243 01:07:19,230 --> 01:07:23,130 underlying it is that if you share your public key 1244 01:07:23,130 --> 01:07:27,540 with someone else on the internet, they can use that public key 1245 01:07:27,540 --> 01:07:30,870 to encrypt a message and then send it to you over email 1246 01:07:30,870 --> 01:07:33,100 or chat or any other technology. 1247 01:07:33,100 --> 01:07:36,180 And if you had to guess, what is the only key 1248 01:07:36,180 --> 01:07:40,170 in the world that can decrypt a message that has 1249 01:07:40,170 --> 01:07:42,880 been encrypted with your public key? 1250 01:07:42,880 --> 01:07:46,470 The only key in the world that can decrypt 1251 01:07:46,470 --> 01:07:51,420 a message that has been encrypted with your public key is your private key. 1252 01:07:51,420 --> 01:07:54,810 That's what the mathematical relationship ultimately does for you. 1253 01:07:54,810 --> 01:07:57,540 So, pictorially here, if this is our algorithm that 1254 01:07:57,540 --> 01:07:59,800 implements this idea of public key encryption, 1255 01:07:59,800 --> 01:08:01,890 let's see what the inputs and outputs should be. 1256 01:08:01,890 --> 01:08:04,860 If the goal is to send a message to you and you 1257 01:08:04,860 --> 01:08:07,890 have shared with the world your public key, whoever is sending you 1258 01:08:07,890 --> 01:08:13,260 this message uses your public key, their plaintext message, and out of that 1259 01:08:13,260 --> 01:08:15,150 comes ciphertext. 1260 01:08:15,150 --> 01:08:18,850 That, then, is how asymmetric key encryption works. 1261 01:08:18,850 --> 01:08:21,939 Meanwhile, when you receive that message, 1262 01:08:21,939 --> 01:08:26,080 you can use your own private key and the ciphertext you've just 1263 01:08:26,080 --> 01:08:28,720 received to get back the plaintext. 1264 01:08:28,720 --> 01:08:30,880 And this is what we mean by asymmetric. 1265 01:08:30,880 --> 01:08:35,649 Unlike secret key cryptography or symmetric key cryptography where 1266 01:08:35,649 --> 01:08:39,460 you're using the same key back and forth, plus 1 or minus 1 1267 01:08:39,460 --> 01:08:44,170 in the case of the rotational cipher, with asymmetric encryption, 1268 01:08:44,170 --> 01:08:49,990 you are using one key for one process and another key for the decryption 1269 01:08:49,990 --> 01:08:50,689 process. 1270 01:08:50,689 --> 01:08:52,960 So that's what's fundamentally different. 1271 01:08:52,960 --> 01:08:55,870 RSA is one of the most popular algorithms for this. 1272 01:08:55,870 --> 01:08:58,450 The browsers you probably use every day are probably 1273 01:08:58,450 --> 01:09:01,450 using some variant of RSA underneath the hood. 1274 01:09:01,450 --> 01:09:03,910 We won't get into great detail about the mathematics, 1275 01:09:03,910 --> 01:09:06,850 but one of the most important details about RSA 1276 01:09:06,850 --> 01:09:10,210 is that it relies on really big prime numbers. 1277 01:09:10,210 --> 01:09:15,069 In fact, in a nutshell, what happens with RSA is your computer or your phone 1278 01:09:15,069 --> 01:09:18,220 chooses a really big prime number called p. 1279 01:09:18,220 --> 01:09:21,384 It then chooses a really big other prime number called q. 1280 01:09:21,384 --> 01:09:25,479 Then it multiplies them together to get a new value, we'll call it n. 1281 01:09:25,479 --> 01:09:30,220 And it uses that value n in the resulting mathematics 1282 01:09:30,220 --> 01:09:33,850 that the algorithm's authors came up with, dot-dot-dot. 1283 01:09:33,850 --> 01:09:37,569 The presumption here is that when you take a really big prime number 1284 01:09:37,569 --> 01:09:40,870 and multiply it against a really big other prime number, 1285 01:09:40,870 --> 01:09:45,609 it is really hard to figure out from the product of those numbers 1286 01:09:45,609 --> 01:09:48,819 what the original p and q were. 1287 01:09:48,819 --> 01:09:50,950 And if you're a little hazy on prime numbers, 1288 01:09:50,950 --> 01:09:55,840 it's a number that can be only-- that can only be divided by itself and 1. 1289 01:09:55,840 --> 01:09:59,620 And indeed, we can use those, coming up with two big ones, 1290 01:09:59,620 --> 01:10:03,700 multiply it together in order to get this value n that is subsequently 1291 01:10:03,700 --> 01:10:05,680 used in the rest of the mathematics. 1292 01:10:05,680 --> 01:10:07,450 What are the rest of those mathematics? 1293 01:10:07,450 --> 01:10:08,470 In essence, this. 1294 01:10:08,470 --> 01:10:10,930 And this will be the scariest-looking formulas you perhaps 1295 01:10:10,930 --> 01:10:12,730 see over the course of this class. 1296 01:10:12,730 --> 01:10:18,520 The value n I just described is used as to divide values 1297 01:10:18,520 --> 01:10:22,330 ultimately if you're unfamiliar with mod here, this means to, in this context, 1298 01:10:22,330 --> 01:10:24,350 take the remainder of some value. 1299 01:10:24,350 --> 01:10:25,480 So what are we doing? 1300 01:10:25,480 --> 01:10:29,800 Here is a quick summary of how encryption and decryption works 1301 01:10:29,800 --> 01:10:30,760 with RSA. 1302 01:10:30,760 --> 01:10:35,230 If you have some message m that you want to send to another person 1303 01:10:35,230 --> 01:10:39,130 and you have come up with somehow, via the dot-dot-dot process 1304 01:10:39,130 --> 01:10:45,500 earlier that I alluded to, you've come up with your own public key e there. 1305 01:10:45,500 --> 01:10:47,500 Well then, someone can take their message, 1306 01:10:47,500 --> 01:10:53,140 encrypt it by raising that message to the power of e, the exponent of e, 1307 01:10:53,140 --> 01:10:56,440 and then divide it, divide it, divide it, divide it by n 1308 01:10:56,440 --> 01:11:00,550 and figure out what the remainder is when dividing by n. 1309 01:11:00,550 --> 01:11:03,760 That then gives you a value called c for ciphertext. 1310 01:11:03,760 --> 01:11:08,170 When you then receive that message c, you can use your private key, 1311 01:11:08,170 --> 01:11:12,340 known here as d, and you raise the ciphertext, 1312 01:11:12,340 --> 01:11:16,480 its numeric value, to the power of d-- that is, the exponent in d, and you 1313 01:11:16,480 --> 01:11:19,330 divide, divide, divide by n in order to figure out 1314 01:11:19,330 --> 01:11:22,990 that remainder, which will give you back the original message. 1315 01:11:22,990 --> 01:11:26,510 Now that is a significant oversimplification of what's going on, 1316 01:11:26,510 --> 01:11:28,750 but that's the essence of the algorithm. 1317 01:11:28,750 --> 01:11:32,200 It has to do with picking two very large prime numbers, 1318 01:11:32,200 --> 01:11:35,140 multiplying them together to get that value n, 1319 01:11:35,140 --> 01:11:39,850 and then using n as well as other values that, dot-dot-dot, are generated 1320 01:11:39,850 --> 01:11:46,150 by the algorithm for you, e and d, in order to encrypt and decrypt messages 1321 01:11:46,150 --> 01:11:47,080 ultimately. 1322 01:11:47,080 --> 01:11:49,750 And this is what's generally known as modular arithmetic. 1323 01:11:49,750 --> 01:11:52,000 It involves lots of division and division and division 1324 01:11:52,000 --> 01:11:53,750 in order to come up with these remainders, 1325 01:11:53,750 --> 01:11:59,020 but ultimately, it is a very secure way to asymmetrically share information 1326 01:11:59,020 --> 01:12:02,410 without having to agree on one shared key in advance, 1327 01:12:02,410 --> 01:12:06,130 but rather, using a public and a private key instead. 1328 01:12:06,130 --> 01:12:10,480 Now there are other techniques that come with this world 1329 01:12:10,480 --> 01:12:14,590 of public key cryptography, and another technique is that of key exchange. 1330 01:12:14,590 --> 01:12:18,210 So by contrast, if you do actually want to establish 1331 01:12:18,210 --> 01:12:22,320 some kind of shared secret, there are alternative algorithms 1332 01:12:22,320 --> 01:12:24,850 that different humans have invented over the years. 1333 01:12:24,850 --> 01:12:27,550 So there are alternatives to one algorithm or another, 1334 01:12:27,550 --> 01:12:29,670 and one of these alternatives is actually 1335 01:12:29,670 --> 01:12:33,910 called Diffie-Hellman, named after another pair of authors here. 1336 01:12:33,910 --> 01:12:37,530 So here is the essence of the mathematics for this algorithm, 1337 01:12:37,530 --> 01:12:40,470 the goal of which is indeed key exchange. 1338 01:12:40,470 --> 01:12:45,090 To figure out, using fancy mathematics, how both A and B can come up 1339 01:12:45,090 --> 01:12:49,590 with the same value that they can then use as a shared secret, 1340 01:12:49,590 --> 01:12:52,980 but without anyone who intercepts any of their messages 1341 01:12:52,980 --> 01:12:57,750 being able to figure out what is that shared value, that shared secret. 1342 01:12:57,750 --> 01:12:59,830 So what's the essence of the math here? 1343 01:12:59,830 --> 01:13:02,910 Well, you first pick a value g, which is called a generator. 1344 01:13:02,910 --> 01:13:04,740 It can be as simple as the number 2. 1345 01:13:04,740 --> 01:13:07,860 And you pick a big prime number, call it p here. 1346 01:13:07,860 --> 01:13:10,170 And those are agreed-upon in advance. 1347 01:13:10,170 --> 01:13:15,090 Meanwhile, person A, say Alice, picks her own private key A, 1348 01:13:15,090 --> 01:13:18,600 which is another really big number, and then she does this math. g 1349 01:13:18,600 --> 01:13:20,550 to the power of A mod p. 1350 01:13:20,550 --> 01:13:23,940 And again, mod refers to taking the remainder of some value. 1351 01:13:23,940 --> 01:13:28,890 Meanwhile, B, or Bob, still uses the same g, still uses the same p, 1352 01:13:28,890 --> 01:13:34,500 picks his own private key called B and raises g to the power of B modulo p, 1353 01:13:34,500 --> 01:13:36,990 and that gives him back this value capital 1354 01:13:36,990 --> 01:13:41,910 B, whereas Alice had capital A. Then, turns out that Alice and Bob can 1355 01:13:41,910 --> 01:13:44,310 send those values across the internet-- 1356 01:13:44,310 --> 01:13:50,010 A one way, B the other way, and thanks to some fancy modular arithmetic 1357 01:13:50,010 --> 01:13:54,360 here, too, Alice can take Bob's B value and raise it 1358 01:13:54,360 --> 01:13:58,680 to the power of her A value, which effectively gives you 1359 01:13:58,680 --> 01:14:02,010 g to the power of A times B mod p. 1360 01:14:02,010 --> 01:14:05,790 Bob, meanwhile, can take Alice's A value that was sent to him, 1361 01:14:05,790 --> 01:14:10,590 raise it to the power of his private key B, and then mod p. 1362 01:14:10,590 --> 01:14:13,140 So calculate the remainder with respect to p. 1363 01:14:13,140 --> 01:14:16,530 The end result, and it's totally fine if these mathematics 1364 01:14:16,530 --> 01:14:18,330 are uncomfortable for you or whoo! 1365 01:14:18,330 --> 01:14:22,980 Just know that, thanks to some basic principles of mathematics, 1366 01:14:22,980 --> 01:14:27,930 this results in both Alice and Bob having the exact same value-- 1367 01:14:27,930 --> 01:14:30,270 we'll call it s for shared secret-- 1368 01:14:30,270 --> 01:14:35,310 even though the value never went across the internet in its entirety. 1369 01:14:35,310 --> 01:14:38,850 Alice sent part of it this way, Bob sent part of it this way, 1370 01:14:38,850 --> 01:14:43,200 but because Alice and Bob held on to private values, the little A 1371 01:14:43,200 --> 01:14:46,320 and the little B, they kept that to themselves, they're 1372 01:14:46,320 --> 01:14:49,590 able to do these mathematics that ensure that they both came up 1373 01:14:49,590 --> 01:14:53,850 with the same value even though you or I, if we intercepted 1374 01:14:53,850 --> 01:14:57,240 any one of those messages, we could not figure out what it is. 1375 01:14:57,240 --> 01:14:59,670 And now that they have a shared secret s, 1376 01:14:59,670 --> 01:15:02,880 they can use that using any of those other symmetric 1377 01:15:02,880 --> 01:15:04,710 ciphers we talked about earlier. 1378 01:15:04,710 --> 01:15:09,210 AES I put on the board briefly, triple DES I put on the board briefly. 1379 01:15:09,210 --> 01:15:12,090 Heck, we could even use this in a rotational cipher 1380 01:15:12,090 --> 01:15:15,870 if we really wanted to, but not, indeed, best practice. 1381 01:15:15,870 --> 01:15:19,410 So again, don't worry so much about focusing on the mathematics, 1382 01:15:19,410 --> 01:15:22,710 but if you were to take a higher-level class in theoretical computer science, 1383 01:15:22,710 --> 01:15:26,550 these are intellectual rabbit holes that you could go down to better understand 1384 01:15:26,550 --> 01:15:27,690 how the software works. 1385 01:15:27,690 --> 01:15:30,480 And now to my comments earlier about not trying 1386 01:15:30,480 --> 01:15:33,690 to invent your own cryptographic functions, 1387 01:15:33,690 --> 01:15:35,280 this is the kind of reason why. 1388 01:15:35,280 --> 01:15:37,980 This is the degree of sophistication that you and I take 1389 01:15:37,980 --> 01:15:41,520 for granted in our phones, our laptops, and desktops 1390 01:15:41,520 --> 01:15:44,250 that have been vetted by industry and academics alike. 1391 01:15:44,250 --> 01:15:47,550 Generally best practice is to rely on standards 1392 01:15:47,550 --> 01:15:50,040 that have been tried and tested rather than 1393 01:15:50,040 --> 01:15:53,760 try to come up with your own creative cryptosystem, so to speak, 1394 01:15:53,760 --> 01:15:57,780 that may very well have faults that you yourself do not know. 1395 01:15:57,780 --> 01:16:01,170 And the icing on the cake is that this is ultimately, if curious as 1396 01:16:01,170 --> 01:16:04,860 to the underlying mathematics, what value ultimately 1397 01:16:04,860 --> 01:16:10,320 Alice and Bob are both calculating, g to the power A times B mod p. 1398 01:16:10,320 --> 01:16:13,560 But more on that in a higher-level mathematics course if indeed 1399 01:16:13,560 --> 01:16:14,460 of interest. 1400 01:16:14,460 --> 01:16:17,160 How about one final building block that you 1401 01:16:17,160 --> 01:16:19,960 get from this world of public key cryptography, 1402 01:16:19,960 --> 01:16:23,110 and this is one that's going to be increasingly omnipresent, 1403 01:16:23,110 --> 01:16:25,680 I do think, in our world, especially as we move away 1404 01:16:25,680 --> 01:16:28,950 from very archaic paper-pencil signatures 1405 01:16:28,950 --> 01:16:31,290 that you might write with a pen on a paper, 1406 01:16:31,290 --> 01:16:35,050 and rather, moving to what we'll call digital signatures as well. 1407 01:16:35,050 --> 01:16:38,850 It turns out that once you're comfortable with the idea 1408 01:16:38,850 --> 01:16:42,600 of public key cryptography generally involving a public key 1409 01:16:42,600 --> 01:16:46,050 and a private key, the first of which is literally public, 1410 01:16:46,050 --> 01:16:49,710 you can share it with the world; the second of which is meant to be private, 1411 01:16:49,710 --> 01:16:50,910 kept only to you. 1412 01:16:50,910 --> 01:16:53,880 And if you can take at face value my claim 1413 01:16:53,880 --> 01:16:56,160 that through appropriate mathematics, there's 1414 01:16:56,160 --> 01:16:59,310 a relationship possible between these two numbers, 1415 01:16:59,310 --> 01:17:02,850 that whereas one can encrypt data, the other can decrypt, 1416 01:17:02,850 --> 01:17:06,000 even if you don't care to get into the specifics of the mathematics, 1417 01:17:06,000 --> 01:17:09,750 but you just agree that, OK, that sounds reasonable to me, 1418 01:17:09,750 --> 01:17:15,090 that that math can work, we can now use that building block 1419 01:17:15,090 --> 01:17:18,860 of a public key and a private key to solve other problems as well. 1420 01:17:18,860 --> 01:17:21,620 Not just encrypt messages from point A to point B 1421 01:17:21,620 --> 01:17:25,670 and back, but rather, to sign information, sign documents, 1422 01:17:25,670 --> 01:17:30,150 even, and say, yes, this was signed by David or someone else. 1423 01:17:30,150 --> 01:17:31,520 So how does this work? 1424 01:17:31,520 --> 01:17:33,680 In the world of digital signatures, here's 1425 01:17:33,680 --> 01:17:36,260 a few more acronyms of algorithms that are commonly 1426 01:17:36,260 --> 01:17:39,230 used even though we'll continue to simplify them in our discussion. 1427 01:17:39,230 --> 01:17:43,550 DSA, ECDSA, RSA, and others can be used to give you 1428 01:17:43,550 --> 01:17:48,000 the ability to sign documents or other pieces of information digitally. 1429 01:17:48,000 --> 01:17:51,230 So what does it mean to sign something digitally? 1430 01:17:51,230 --> 01:17:53,930 It's not at all like this with a unique signature, 1431 01:17:53,930 --> 01:17:56,340 it's all mathematics involved. 1432 01:17:56,340 --> 01:18:00,500 So, here, then, might be our algorithm for digitally signing 1433 01:18:00,500 --> 01:18:02,660 some document or piece of information. 1434 01:18:02,660 --> 01:18:06,260 And I claim that the input to this process is a message. 1435 01:18:06,260 --> 01:18:09,050 A letter that you've written, a contract that you want to sign, 1436 01:18:09,050 --> 01:18:11,855 something that you want to put your digital signature on. 1437 01:18:11,855 --> 01:18:16,000 And the output of this message initially is going to be a hash. 1438 01:18:16,000 --> 01:18:18,810 So we can use any number of hash functions 1439 01:18:18,810 --> 01:18:23,610 we talked about earlier that take as input an arbitrary length 1440 01:18:23,610 --> 01:18:27,780 input, like a message, a document, an essay, a contract, 1441 01:18:27,780 --> 01:18:32,050 and produce as output a fixed length hash value. 1442 01:18:32,050 --> 01:18:34,650 So we've seen that and we've stipulated that is indeed 1443 01:18:34,650 --> 01:18:38,010 possible, similar in spirit to our password discussion earlier. 1444 01:18:38,010 --> 01:18:40,800 You can even do it for larger inputs than passwords. 1445 01:18:40,800 --> 01:18:43,570 You can do it for entire documents as well. 1446 01:18:43,570 --> 01:18:47,400 Once you have that hash, here's how you digitally sign the document. 1447 01:18:47,400 --> 01:18:54,600 You use your private key, you pass that as input, as well as the hash value 1448 01:18:54,600 --> 01:18:58,620 you just computed a moment ago into the digital signature algorithm, 1449 01:18:58,620 --> 01:19:01,993 and the output of that process is a signature. 1450 01:19:01,993 --> 01:19:04,410 So if you think about this intuitively, what are we doing? 1451 01:19:04,410 --> 01:19:07,020 Well, we're taking an arbitrary-sized document. 1452 01:19:07,020 --> 01:19:09,420 Maybe it's a letter that you've written, maybe it's 1453 01:19:09,420 --> 01:19:12,690 a contract that you've written that you need to sign that might be short 1454 01:19:12,690 --> 01:19:14,250 or it might be really long. 1455 01:19:14,250 --> 01:19:17,440 Here's where the value of cryptographic hash functions come in. 1456 01:19:17,440 --> 01:19:19,830 Recall that a cryptographic hash function, by definition, 1457 01:19:19,830 --> 01:19:25,240 takes an arbitrary-sized input and reduces it to a fixed-sized output. 1458 01:19:25,240 --> 01:19:27,120 So it doesn't matter how big the original 1459 01:19:27,120 --> 01:19:31,200 was, you can distill it into a distinct representation that's shorter. 1460 01:19:31,200 --> 01:19:35,280 So, per this diagram, if you take that hash value 1461 01:19:35,280 --> 01:19:39,330 and you encrypt it with your private key, what we say 1462 01:19:39,330 --> 01:19:41,880 is that the output of that process, which 1463 01:19:41,880 --> 01:19:45,330 is just a really big number or some sequence of weird-looking text, 1464 01:19:45,330 --> 01:19:47,800 is your digital signature. 1465 01:19:47,800 --> 01:19:50,820 Now this is a little weird because what we're doing now 1466 01:19:50,820 --> 01:19:53,280 is the opposite of public key encryption. 1467 01:19:53,280 --> 01:19:56,040 With public key encryption, remember, someone else 1468 01:19:56,040 --> 01:19:59,010 used your public key to encrypt a message to you 1469 01:19:59,010 --> 01:20:02,800 and you used your private key to decrypt it. 1470 01:20:02,800 --> 01:20:06,960 But in the case of digital signatures, the story gets flipped upside-down. 1471 01:20:06,960 --> 01:20:11,100 You use your private key and a hash of your message 1472 01:20:11,100 --> 01:20:15,150 to digitally sign your document and the output of that is a signature-- again, 1473 01:20:15,150 --> 01:20:17,040 a number or some string of text. 1474 01:20:17,040 --> 01:20:20,940 And you send that signature to the recipient saying, this 1475 01:20:20,940 --> 01:20:24,970 is my digital signature, you can verify it now if you so choose. 1476 01:20:24,970 --> 01:20:25,950 And they should. 1477 01:20:25,950 --> 01:20:29,160 So that invites the question, well, how does the recipient 1478 01:20:29,160 --> 01:20:31,200 verify your digital signature? 1479 01:20:31,200 --> 01:20:34,530 How do they know that this weird-looking sequence of characters or numbers 1480 01:20:34,530 --> 01:20:36,570 actually was signed by you? 1481 01:20:36,570 --> 01:20:41,230 Well, recall that you have not only a private key, but a public key as well. 1482 01:20:41,230 --> 01:20:44,640 And that public key is accessible to everyone, including that recipient. 1483 01:20:44,640 --> 01:20:46,830 And so, what happens is this. 1484 01:20:46,830 --> 01:20:51,100 When that recipient gets your document and your digital signature, 1485 01:20:51,100 --> 01:20:55,710 so to speak, they probably want to and should verify the digital signature 1486 01:20:55,710 --> 01:20:59,410 to confirm that, yes, you signed off on that document or contract. 1487 01:20:59,410 --> 01:21:01,300 So what does that box look like? 1488 01:21:01,300 --> 01:21:05,220 Well, they have received not only the document itself, the so-called message, 1489 01:21:05,220 --> 01:21:07,318 they've also received your digital signature. 1490 01:21:07,318 --> 01:21:08,610 So you've sent them two things. 1491 01:21:08,610 --> 01:21:11,770 And the digital signature, you can think of it like a human signature, 1492 01:21:11,770 --> 01:21:14,130 but it's, of course, a big number or a string of text. 1493 01:21:14,130 --> 01:21:17,290 But they've sent you two things-- the document and that signature. 1494 01:21:17,290 --> 01:21:18,220 So what do you do? 1495 01:21:18,220 --> 01:21:20,640 You take the document you've received and you run it 1496 01:21:20,640 --> 01:21:22,800 through the exact same publicly available hash 1497 01:21:22,800 --> 01:21:24,820 function, because the document might be long, 1498 01:21:24,820 --> 01:21:28,650 so you want to collapse it into a short hash representation 1499 01:21:28,650 --> 01:21:31,180 thereof, just like our use of passwords. 1500 01:21:31,180 --> 01:21:35,010 So that you can just do easily, no private information involved. 1501 01:21:35,010 --> 01:21:36,420 But then what do you do? 1502 01:21:36,420 --> 01:21:42,480 You then take the public key of the person who signed this document, you 1503 01:21:42,480 --> 01:21:45,780 take the signature that they claim is their signature, 1504 01:21:45,780 --> 01:21:50,430 and you decrypt their signature with their public key. 1505 01:21:50,430 --> 01:21:58,780 That should output the exact same hash that you just calculated. 1506 01:21:58,780 --> 01:22:04,333 So to summarize, the message itself the document in this story is public. 1507 01:22:04,333 --> 01:22:07,500 It's not encrypted, it's not something you really worry about being private. 1508 01:22:07,500 --> 01:22:09,480 What you really care about in this story is 1509 01:22:09,480 --> 01:22:11,830 that it was signed by a specific person. 1510 01:22:11,830 --> 01:22:15,130 So if that message, that document is available to both the sender 1511 01:22:15,130 --> 01:22:20,890 and the receiver, both of them do this first process of hashing the message, 1512 01:22:20,890 --> 01:22:24,610 hashing the document just to get some succinct representation thereof. 1513 01:22:24,610 --> 01:22:26,560 So it's not this big, it's this big. 1514 01:22:26,560 --> 01:22:28,580 Makes the math quicker and easier. 1515 01:22:28,580 --> 01:22:33,670 However, what the recipient does is upon receiving not only that message, which 1516 01:22:33,670 --> 01:22:37,660 they just hashed, but also your claimed digital signature, 1517 01:22:37,660 --> 01:22:42,640 they try to decrypt your signature using your public key. 1518 01:22:42,640 --> 01:22:45,460 And here, too, just as the private key can 1519 01:22:45,460 --> 01:22:48,460 reverse the encryption done by a public key, 1520 01:22:48,460 --> 01:22:53,060 so can the public key reverse the encryption done by a private key. 1521 01:22:53,060 --> 01:22:58,300 So if the recipient mathematically gets the exact same hash 1522 01:22:58,300 --> 01:23:01,660 after decrypting what you sent them, it must be the case 1523 01:23:01,660 --> 01:23:04,870 mathematically that the only person in the world who 1524 01:23:04,870 --> 01:23:07,600 could have signed this document is, in fact, you 1525 01:23:07,600 --> 01:23:09,670 because they have your public key. 1526 01:23:09,670 --> 01:23:11,920 And maybe some third party, some registry, 1527 01:23:11,920 --> 01:23:15,200 some company has said, yes, that is David Malan's public key, 1528 01:23:15,200 --> 01:23:16,410 you can trust that. 1529 01:23:16,410 --> 01:23:20,900 And so, if David Malan's private key has not been compromised, 1530 01:23:20,900 --> 01:23:26,720 you can trust that any signature that you can decrypt with my public key 1531 01:23:26,720 --> 01:23:31,190 must have been encrypted with my private key. 1532 01:23:31,190 --> 01:23:34,670 And it takes a while, I think, for these ideas, and certainly the mathematics 1533 01:23:34,670 --> 01:23:36,770 to sink in, but for now, if you just trust 1534 01:23:36,770 --> 01:23:39,770 that there's two big numbers in the world, one public, one private, 1535 01:23:39,770 --> 01:23:43,580 there's a mathematical relationship between them such that one can reverse 1536 01:23:43,580 --> 01:23:46,130 the effects of the other in either direction, 1537 01:23:46,130 --> 01:23:49,790 we humans can use this now not only to secure 1538 01:23:49,790 --> 01:23:53,060 our messages per our discussion of encryption, 1539 01:23:53,060 --> 01:23:56,210 we can also use it to authenticate messages 1540 01:23:56,210 --> 01:24:01,160 and attest, yes, this came from David Malan or did not. 1541 01:24:01,160 --> 01:24:03,980 And unlike a human signature on a piece of paper 1542 01:24:03,980 --> 01:24:07,940 that can obviously just be photographed, duplicated, traced over, 1543 01:24:07,940 --> 01:24:13,070 the secrecy of digital signatures relies on keeping your private key private, 1544 01:24:13,070 --> 01:24:16,280 and that notion does not exist in the world of human signatures, 1545 01:24:16,280 --> 01:24:20,300 and so in that sense, digital signatures are objectively better 1546 01:24:20,300 --> 01:24:24,020 than our old-form human ones. 1547 01:24:24,020 --> 01:24:25,400 Questions now? 1548 01:24:25,400 --> 01:24:29,540 And I know that's a lot, and it's OK if it didn't all go down at once. 1549 01:24:29,540 --> 01:24:34,130 Questions on digital signatures, public key encryption or decryption, 1550 01:24:34,130 --> 01:24:36,080 or anything prior? 1551 01:24:36,080 --> 01:24:39,650 STUDENT: Would these public and private keys be attributed to, what, 1552 01:24:39,650 --> 01:24:41,177 your IP address? 1553 01:24:41,177 --> 01:24:42,510 DAVID J. MALAN: A good question. 1554 01:24:42,510 --> 01:24:43,677 To what are they attributed? 1555 01:24:43,677 --> 01:24:45,650 Not to your IP address typically. 1556 01:24:45,650 --> 01:24:49,700 They are typically stored in a registry, like a central registry that 1557 01:24:49,700 --> 01:24:53,050 knows that this is Vlad's public key, this is David's public key and so 1558 01:24:53,050 --> 01:24:53,550 forth. 1559 01:24:53,550 --> 01:24:56,630 And it relies on a system of trust and transitivity. 1560 01:24:56,630 --> 01:25:01,940 So if you trust this third party company that is storing all of our public keys, 1561 01:25:01,940 --> 01:25:05,690 then you can trust whoever it is "they" are, in turn, trusting. 1562 01:25:05,690 --> 01:25:07,370 Or it can be more distributed. 1563 01:25:07,370 --> 01:25:09,470 Your public key can literally be distributed 1564 01:25:09,470 --> 01:25:10,825 in the footer of your emails. 1565 01:25:10,825 --> 01:25:12,200 It can be posted on your website. 1566 01:25:12,200 --> 01:25:14,670 It can be on your LinkedIn profile or the like. 1567 01:25:14,670 --> 01:25:17,690 And so long as other people in the world trust 1568 01:25:17,690 --> 01:25:21,200 your emails or your website or LinkedIn, they 1569 01:25:21,200 --> 01:25:23,940 can trust that that is, in fact, your public key. 1570 01:25:23,940 --> 01:25:26,910 So different ways to implement that system of trust. 1571 01:25:26,910 --> 01:25:28,740 Other questions? 1572 01:25:28,740 --> 01:25:33,840 STUDENT: Hashing uses a mathematical function and encryption uses 1573 01:25:33,840 --> 01:25:36,270 a mathematical function plus a key. 1574 01:25:36,270 --> 01:25:42,960 Like the Caesar Cipher basically uses the simple function plus the key. 1575 01:25:42,960 --> 01:25:44,788 Is that analogy correct? 1576 01:25:44,788 --> 01:25:46,330 DAVID J. MALAN: Yes, that is correct. 1577 01:25:46,330 --> 01:25:49,500 And if it helps you-- this is an oversimplification, 1578 01:25:49,500 --> 01:25:54,100 but it's generally helpful, I think, to think of hashing as one-way. 1579 01:25:54,100 --> 01:26:00,570 So you can only convert a value to a hash value but not the opposite. 1580 01:26:00,570 --> 01:26:04,590 But encryption is like two-way-- 1581 01:26:04,590 --> 01:26:06,660 it's reversible hashing, so to speak. 1582 01:26:06,660 --> 01:26:10,770 The output still looks weird and random, but you can undo the process. 1583 01:26:10,770 --> 01:26:14,460 And one way to think about this is in the world of hashing, 1584 01:26:14,460 --> 01:26:18,510 because I claim that you can take like an infinite domain, 1585 01:26:18,510 --> 01:26:22,200 like any possible message you want to send, and convert it 1586 01:26:22,200 --> 01:26:23,820 to a finite range,-- 1587 01:26:23,820 --> 01:26:27,450 for instance, all A-words could be a hash value of 1, 1588 01:26:27,450 --> 01:26:29,940 all B-words could have a hash value of 2. 1589 01:26:29,940 --> 01:26:34,590 That simple example already captures the reality 1590 01:26:34,590 --> 01:26:38,820 that if you only have the hash values 1, 2, 1591 01:26:38,820 --> 01:26:41,550 I have no idea what the original input is. 1592 01:26:41,550 --> 01:26:44,490 And it doesn't matter how hard I try, I'm never going to figure it out 1593 01:26:44,490 --> 01:26:48,930 because it could be apple or avocado or something else that starts with A. 1594 01:26:48,930 --> 01:26:54,030 So hashing in that sense, one-way hashing throws away information such 1595 01:26:54,030 --> 01:26:55,800 that it's not recoverable. 1596 01:26:55,800 --> 01:26:58,780 But encryption does the opposite. 1597 01:26:58,780 --> 01:27:02,430 It would be pretty useless if encryption threw away information 1598 01:27:02,430 --> 01:27:06,360 because the whole point of encryption is to secure messages and information 1599 01:27:06,360 --> 01:27:07,300 we want to send. 1600 01:27:07,300 --> 01:27:13,530 So encryption is reversible; hashing, in general, is not. 1601 01:27:13,530 --> 01:27:17,880 And, as you know, the key, no pun intended, to encryption 1602 01:27:17,880 --> 01:27:21,900 is necessary so that you can reverse the process in a way that 1603 01:27:21,900 --> 01:27:24,147 remains secret to other people. 1604 01:27:24,147 --> 01:27:26,730 How about one more question, and then we'll take a short break 1605 01:27:26,730 --> 01:27:28,890 and then we'll come back and wrap up. 1606 01:27:28,890 --> 01:27:32,520 STUDENT: Is there any possibility to spoof the signatures? 1607 01:27:32,520 --> 01:27:34,290 DAVID J. MALAN: Short answer, no. 1608 01:27:34,290 --> 01:27:38,280 Like so long as you are using a standard that we believe 1609 01:27:38,280 --> 01:27:43,080 to be correct and not compromised, so long as your private key has not 1610 01:27:43,080 --> 01:27:47,310 been stolen by someone or no one's taken it off of your phone or your computer, 1611 01:27:47,310 --> 01:27:50,280 they should not-- it should not be possible to forge it. 1612 01:27:50,280 --> 01:27:55,500 The probability is so, so, so low, it should be the least of your concerns 1613 01:27:55,500 --> 01:27:57,080 is the idea. 1614 01:27:57,080 --> 01:28:00,640 Now it turns out, there is yet one other application 1615 01:28:00,640 --> 01:28:05,720 of this world of public key cryptography that solves a problem from last time. 1616 01:28:05,720 --> 01:28:09,790 Recall that we ended our first class on a note of emphasizing 1617 01:28:09,790 --> 01:28:14,870 that passwords and password managers can improve our security if used properly, 1618 01:28:14,870 --> 01:28:18,310 but there's another technology that's becoming increasingly available. 1619 01:28:18,310 --> 01:28:21,070 And it's colloquially called passkeys. 1620 01:28:21,070 --> 01:28:23,230 Or more technically, it's an implementation 1621 01:28:23,230 --> 01:28:25,540 of a standard called web authentication. 1622 01:28:25,540 --> 01:28:27,970 And it turns out that these passkeys, which 1623 01:28:27,970 --> 01:28:31,480 are available on certain platforms and certain websites and evermore 1624 01:28:31,480 --> 01:28:34,870 will be available soon quite shortly, they, too, 1625 01:28:34,870 --> 01:28:38,530 rely on public and private keys as follows. 1626 01:28:38,530 --> 01:28:41,620 And thankfully now, as fancy as the mathematics 1627 01:28:41,620 --> 01:28:44,650 we're alluding to today sound, there really are only two ways 1628 01:28:44,650 --> 01:28:46,570 to use these public and private keys-- 1629 01:28:46,570 --> 01:28:50,930 to either encrypt with one and decrypt with the other or vice versa. 1630 01:28:50,930 --> 01:28:53,170 So we have just a fairly basic building block 1631 01:28:53,170 --> 01:28:55,670 that we can use in one direction or another. 1632 01:28:55,670 --> 01:28:57,650 So how do passkeys work? 1633 01:28:57,650 --> 01:29:00,170 In the near-future, as you will find, when 1634 01:29:00,170 --> 01:29:02,420 you go to certain websites or applications, 1635 01:29:02,420 --> 01:29:07,130 you probably will not be prompted as frequently to type in a username 1636 01:29:07,130 --> 01:29:09,770 and pick a password, which is to say, you 1637 01:29:09,770 --> 01:29:11,870 don't have to generate a hard-to-guess password, 1638 01:29:11,870 --> 01:29:14,037 you don't have to memorize a hard-to-guess password. 1639 01:29:14,037 --> 01:29:17,330 You don't have to even store a hard-to-guess password in a password 1640 01:29:17,330 --> 01:29:21,650 manager because passkeys eliminate passwords. 1641 01:29:21,650 --> 01:29:26,270 It moves us more toward a world of passwordless accounts. 1642 01:29:26,270 --> 01:29:27,890 Now how can that be? 1643 01:29:27,890 --> 01:29:30,650 Because up until now, we've been using usernames and passwords 1644 01:29:30,650 --> 01:29:32,360 to authenticate ourselves. 1645 01:29:32,360 --> 01:29:35,600 Well, it turns out, we humans have been getting really good at this math, 1646 01:29:35,600 --> 01:29:37,580 even if it doesn't feel like it today, we've 1647 01:29:37,580 --> 01:29:39,620 been getting really good at using mathematics 1648 01:29:39,620 --> 01:29:41,460 to solve these problems as well. 1649 01:29:41,460 --> 01:29:43,220 So imagine the following scenario. 1650 01:29:43,220 --> 01:29:46,550 When you go to a website in the future or app, 1651 01:29:46,550 --> 01:29:49,430 rather than being prompted to create a username and password, 1652 01:29:49,430 --> 01:29:52,220 you'll just be prompted to create a passkey. 1653 01:29:52,220 --> 01:29:56,210 What that means is your laptop or desktop or phone will probably 1654 01:29:56,210 --> 01:29:58,520 prompt you with some form of factor. 1655 01:29:58,520 --> 01:30:02,540 They'll ask you for your fingerprint or they'll ask you for a scan of your face 1656 01:30:02,540 --> 01:30:06,140 or maybe a pin code, a short number that you type in just 1657 01:30:06,140 --> 01:30:08,210 to demonstrate with high probability that you 1658 01:30:08,210 --> 01:30:11,240 are authorized to be using this device and creating this account. 1659 01:30:11,240 --> 01:30:14,840 What then will your device and the website do? 1660 01:30:14,840 --> 01:30:19,280 Your device will generate a public key and a private key 1661 01:30:19,280 --> 01:30:22,670 just for that one website or app. 1662 01:30:22,670 --> 01:30:29,870 Your device will send the public key to that new website, along with your user 1663 01:30:29,870 --> 01:30:32,690 ID or username, some identifying information 1664 01:30:32,690 --> 01:30:35,180 so that they know your David or someone else. 1665 01:30:35,180 --> 01:30:37,130 But you don't send a password. 1666 01:30:37,130 --> 01:30:41,480 You only send to the website or app your public key. 1667 01:30:41,480 --> 01:30:43,850 And you keep private, within your browser 1668 01:30:43,850 --> 01:30:47,660 or some other piece of software, your corresponding private key. 1669 01:30:47,660 --> 01:30:52,280 And to be clear, this public-private key pair is used only for this one website. 1670 01:30:52,280 --> 01:30:54,770 You'll do this repeatedly, but automatically 1671 01:30:54,770 --> 01:30:57,810 for every other website in the world in this model. 1672 01:30:57,810 --> 01:31:01,640 So what happens when you not register for that website, which 1673 01:31:01,640 --> 01:31:04,700 you've just done, but you want to log into it tomorrow, 1674 01:31:04,700 --> 01:31:06,470 next week, or next year? 1675 01:31:06,470 --> 01:31:08,840 Well, assuming you still have that same device 1676 01:31:08,840 --> 01:31:11,210 or you're using some kind of cloud service 1677 01:31:11,210 --> 01:31:15,380 that synchronizes all of your past keys, your public and private keys, 1678 01:31:15,380 --> 01:31:16,590 across devices-- 1679 01:31:16,590 --> 01:31:19,220 so you haven't lost these past keys, here's 1680 01:31:19,220 --> 01:31:22,970 how you would log in to the website tomorrow, next week, or next year. 1681 01:31:22,970 --> 01:31:26,750 The website would send you when you visit a challenge, 1682 01:31:26,750 --> 01:31:28,820 and a challenge is like some little message. 1683 01:31:28,820 --> 01:31:31,280 It's like a number or a word or a phrase. 1684 01:31:31,280 --> 01:31:33,890 It's some piece of randomly-generated data 1685 01:31:33,890 --> 01:31:37,070 that the website wants you to digitally sign. 1686 01:31:37,070 --> 01:31:39,200 Well, how do you digitally sign information? 1687 01:31:39,200 --> 01:31:42,290 I proposed earlier that you can use your private key 1688 01:31:42,290 --> 01:31:47,060 and pass that key and that challenge, which is just a random input given 1689 01:31:47,060 --> 01:31:51,500 to you by the website, into your digital signature algorithm, this black box. 1690 01:31:51,500 --> 01:31:54,380 And the output of that, as before, is your signature. 1691 01:31:54,380 --> 01:31:55,890 And what is your device do? 1692 01:31:55,890 --> 01:32:00,720 It sends that signature for that challenge to the website. 1693 01:32:00,720 --> 01:32:03,187 And if you followed along earlier well enough, 1694 01:32:03,187 --> 01:32:05,270 you might now realize where we're going with this. 1695 01:32:05,270 --> 01:32:11,000 How does the website now verify that that is, in fact, your signature? 1696 01:32:11,000 --> 01:32:15,950 That this did come from David's device and not some adversary online? 1697 01:32:15,950 --> 01:32:19,380 The website, because it's stored yesterday, 1698 01:32:19,380 --> 01:32:22,670 last week, last year, your public key, it 1699 01:32:22,670 --> 01:32:26,330 will use your public key to decrypt your signature 1700 01:32:26,330 --> 01:32:31,650 using the same algorithm to get back hopefully the same challenge value. 1701 01:32:31,650 --> 01:32:35,240 And if the output of this verification process 1702 01:32:35,240 --> 01:32:39,650 matches the challenge the website sent you a second before, 1703 01:32:39,650 --> 01:32:42,140 it must be the case mathematically that you 1704 01:32:42,140 --> 01:32:44,870 are, in fact, who you claim to be because it 1705 01:32:44,870 --> 01:32:48,840 was your device that registered for this website a day, a week, 1706 01:32:48,840 --> 01:32:50,520 a year ago as well. 1707 01:32:50,520 --> 01:32:53,600 So again, if we trust in the mathematics here 1708 01:32:53,600 --> 01:32:58,150 and we trust that these algorithms allow us to encrypt information and decrypt 1709 01:32:58,150 --> 01:33:01,570 it using a public key and private key, or conversely, 1710 01:33:01,570 --> 01:33:06,640 a private key and public key, we can, with very, very high confidence, 1711 01:33:06,640 --> 01:33:09,820 probabilistically say, yes, this is David Malan, 1712 01:33:09,820 --> 01:33:12,380 I'm going to allow him back into this account. 1713 01:33:12,380 --> 01:33:15,460 So what's the implication of this passwordless world that 1714 01:33:15,460 --> 01:33:18,970 uses passkeys keys, or web authentication more technically? 1715 01:33:18,970 --> 01:33:21,700 It means that we're getting out of the business, potentially, 1716 01:33:21,700 --> 01:33:24,340 as a society of having to remember dozens 1717 01:33:24,340 --> 01:33:28,360 or hundreds or thousands of different passwords for all of our accounts. 1718 01:33:28,360 --> 01:33:33,640 It does require, though, that we don't lose the device or the devices that 1719 01:33:33,640 --> 01:33:37,480 registered for these websites or apps, but again, increasingly, 1720 01:33:37,480 --> 01:33:41,890 as the world providing cloud services, whether it's with Apple or Microsoft 1721 01:33:41,890 --> 01:33:45,040 or Google or others, that presumably can synchronize 1722 01:33:45,040 --> 01:33:47,980 your passkeys across devices and will conclude ultimately today, 1723 01:33:47,980 --> 01:33:51,910 by talking about how they can be synchronized securely, even 1724 01:33:51,910 --> 01:33:56,570 without Google and Microsoft and Apple knowing what your own passkeys are, so 1725 01:33:56,570 --> 01:33:59,720 long as they provide us with a certain technical guarantee. 1726 01:33:59,720 --> 01:34:03,650 So the upside of this is we can move away from passwords, 1727 01:34:03,650 --> 01:34:08,720 and you can even share these passkeys with other people if you so choose. 1728 01:34:08,720 --> 01:34:12,590 The catch is, right now, they're not omnipresently 1729 01:34:12,590 --> 01:34:14,660 available on every website out there. 1730 01:34:14,660 --> 01:34:17,690 It's probably going to take some time for the world to come on board, 1731 01:34:17,690 --> 01:34:20,450 but I do dare say, in the coming weeks, months, and years, 1732 01:34:20,450 --> 01:34:23,550 you will see passkeys increasingly offered to you. 1733 01:34:23,550 --> 01:34:25,550 And so indeed, the next time you visit a website 1734 01:34:25,550 --> 01:34:28,490 that asks you, hey, do you want to register with your fingerprint 1735 01:34:28,490 --> 01:34:30,800 or with your face or with a PIN code? 1736 01:34:30,800 --> 01:34:33,710 And you're never even asked for a password, odds are, 1737 01:34:33,710 --> 01:34:37,650 it's using this passkey technology instead. 1738 01:34:37,650 --> 01:34:40,350 Well, let's go ahead and take one more five-minute break here, 1739 01:34:40,350 --> 01:34:43,070 and when we come back, we'll talk about securing data 1740 01:34:43,070 --> 01:34:47,720 as it's moving back and forth and sitting on our own systems. 1741 01:34:47,720 --> 01:34:49,580 All right, so we are back. 1742 01:34:49,580 --> 01:34:53,000 And allow me to claim that we now have a bunch of ways 1743 01:34:53,000 --> 01:34:57,360 to hash data and also encrypt data and also now, decrypt data. 1744 01:34:57,360 --> 01:34:59,570 So how can we use these building blocks to solve 1745 01:34:59,570 --> 01:35:01,700 some other perhaps familiar problems? 1746 01:35:01,700 --> 01:35:04,590 Well, there's this notion of encryption in transit, 1747 01:35:04,590 --> 01:35:08,120 which is a fancy way of saying that you and I probably prefer nowadays 1748 01:35:08,120 --> 01:35:11,630 that our data be encrypted whenever it's traveling from point A 1749 01:35:11,630 --> 01:35:15,710 to point B. Whether that point B is Amazon.com, Gmail.com, 1750 01:35:15,710 --> 01:35:18,740 WhatsApp, or any other service that we're communicating with, 1751 01:35:18,740 --> 01:35:23,060 we ideally want no one in between us-- some machine in the middle, so 1752 01:35:23,060 --> 01:35:26,030 to speak, to be able to get at that same data. 1753 01:35:26,030 --> 01:35:28,610 Because in particular, what you should be worried about 1754 01:35:28,610 --> 01:35:32,780 is a scenario like this where if Alice is trying to communicate with Bob, 1755 01:35:32,780 --> 01:35:35,750 you might worry that there's some eavesdropper, so to speak, 1756 01:35:35,750 --> 01:35:38,270 named Eve between Alice and Bob. 1757 01:35:38,270 --> 01:35:40,970 And maybe this is via wires nowadays on the internet. 1758 01:35:40,970 --> 01:35:42,860 Maybe it's somehow wirelessly. 1759 01:35:42,860 --> 01:35:46,940 Maybe Eve actually represents a company that Alice and Bob 1760 01:35:46,940 --> 01:35:51,150 are communicating between, like Gmail or Outlook or the like. 1761 01:35:51,150 --> 01:35:54,980 So encryption in transit, though, is important to distinguish 1762 01:35:54,980 --> 01:35:56,780 from other forms of encryption. 1763 01:35:56,780 --> 01:35:59,660 In particular here, Alice might very well 1764 01:35:59,660 --> 01:36:03,380 have an encrypted connection not to an eavesdropper, per se, but just 1765 01:36:03,380 --> 01:36:05,000 a third party like Gmail. 1766 01:36:05,000 --> 01:36:07,400 So assume that Eve here is Gmail. 1767 01:36:07,400 --> 01:36:10,640 And meanwhile, Bob, when checking his email account, 1768 01:36:10,640 --> 01:36:14,570 has an encrypted connection to Eve as well, which, in this story now, 1769 01:36:14,570 --> 01:36:15,470 is Gmail. 1770 01:36:15,470 --> 01:36:18,830 So Alice has a secure connection to Gmail and Bob 1771 01:36:18,830 --> 01:36:21,330 has a secure connection to Gmail as well, 1772 01:36:21,330 --> 01:36:26,720 but that does not mean necessarily that Alice has a secure connection to Bob. 1773 01:36:26,720 --> 01:36:31,250 Security does not really work through transitivity, so to speak. 1774 01:36:31,250 --> 01:36:34,490 This might very well mean that the data is only 1775 01:36:34,490 --> 01:36:39,860 encrypted while in transit from A to E and from B to E, 1776 01:36:39,860 --> 01:36:43,310 but that doesn't mean that Eve, or Gmail in this story, 1777 01:36:43,310 --> 01:36:46,040 can't be reading all of Alice's and Bob's emails. 1778 01:36:46,040 --> 01:36:49,850 And indeed, that is technically possible on Google's end. 1779 01:36:49,850 --> 01:36:53,900 They, of course, run all of the servers that your Gmail accounts might be on. 1780 01:36:53,900 --> 01:36:57,110 There's nothing technically probably stopping them 1781 01:36:57,110 --> 01:36:59,030 from reading anything and everything. 1782 01:36:59,030 --> 01:37:00,650 Now hopefully they have policies. 1783 01:37:00,650 --> 01:37:05,330 Hopefully very few humans actually have the privileges or the authorization 1784 01:37:05,330 --> 01:37:07,310 to even do anything close to that. 1785 01:37:07,310 --> 01:37:11,390 But technically speaking, just because Alice has a secure connection to Gmail 1786 01:37:11,390 --> 01:37:13,730 and Bob has a secure connection to Gmail, 1787 01:37:13,730 --> 01:37:16,760 that doesn't mean that their communications will 1788 01:37:16,760 --> 01:37:22,620 be encrypted entirely between A and B. And there are lots of examples of this 1789 01:37:22,620 --> 01:37:23,120 as well. 1790 01:37:23,120 --> 01:37:25,610 Zoom, for instance, when it comes to video conferencing, 1791 01:37:25,610 --> 01:37:27,890 you might have an encrypted connection to Zoom, 1792 01:37:27,890 --> 01:37:29,930 I might have an encrypted connection to Zoom. 1793 01:37:29,930 --> 01:37:34,610 That does not necessarily mean that Zoom couldn't be Eve in this story 1794 01:37:34,610 --> 01:37:39,350 listening and watching everything that we're saying while video conferencing 1795 01:37:39,350 --> 01:37:40,050 as well. 1796 01:37:40,050 --> 01:37:44,600 So encryption in transit is good in that it at least keeps random people out 1797 01:37:44,600 --> 01:37:48,380 of the picture because they don't have access to these encrypted channels, 1798 01:37:48,380 --> 01:37:51,560 but if there is this third party, this machine in the middle 1799 01:37:51,560 --> 01:37:55,700 or company in the middle, even they might have access to data that we 1800 01:37:55,700 --> 01:37:58,020 do not want them to have access to. 1801 01:37:58,020 --> 01:38:00,770 So what, then, is a stronger alternative? 1802 01:38:00,770 --> 01:38:06,590 Increasingly possible, increasingly available, and something you as a user 1803 01:38:06,590 --> 01:38:09,590 should be looking for with greater frequency is what 1804 01:38:09,590 --> 01:38:11,940 we would call an end-to-end encryption. 1805 01:38:11,940 --> 01:38:14,780 This is a stronger guarantee whereby you can 1806 01:38:14,780 --> 01:38:19,430 trust that Alice's connection to Bob is, in fact, secure 1807 01:38:19,430 --> 01:38:25,460 even if-- not pictured here, there are 1, 2, 3, 4 machines in the middle, 1808 01:38:25,460 --> 01:38:28,520 companies in the middle, eavesdroppers in the middle. 1809 01:38:28,520 --> 01:38:32,900 If you use encryption properly end-to-end, 1810 01:38:32,900 --> 01:38:38,750 you can ensure that the only thing Eve or Google or Zoom can see 1811 01:38:38,750 --> 01:38:43,250 is just your ciphertext, the seemingly random strings of text 1812 01:38:43,250 --> 01:38:47,810 or 0's and 1's that represent your encrypted data, but without your key, 1813 01:38:47,810 --> 01:38:51,240 they have no idea what that data actually is. 1814 01:38:51,240 --> 01:38:54,170 So end-to-end encryption isn't necessarily in most 1815 01:38:54,170 --> 01:38:55,320 company's best interest. 1816 01:38:55,320 --> 01:38:55,820 Why? 1817 01:38:55,820 --> 01:38:59,440 Well, companies like Gmail tend to presumably mine our data, 1818 01:38:59,440 --> 01:39:01,960 whether it's for advertising purposes or otherwise. 1819 01:39:01,960 --> 01:39:05,610 And so it's sometimes in companies' interest to have access to your data 1820 01:39:05,610 --> 01:39:08,880 to keep it secure on their servers, but still 1821 01:39:08,880 --> 01:39:10,990 in a way that they have access to it. 1822 01:39:10,990 --> 01:39:13,750 Now that might be not comfortable for you. 1823 01:39:13,750 --> 01:39:15,330 And so there are alternatives. 1824 01:39:15,330 --> 01:39:18,720 For instance, iMessage for Apple users and WhatsApp 1825 01:39:18,720 --> 01:39:23,370 internationally is known in particular for offering end-to-end encryption 1826 01:39:23,370 --> 01:39:27,300 which, if implemented truthfully and technically correctly, 1827 01:39:27,300 --> 01:39:29,880 should guarantee that even though your messages might 1828 01:39:29,880 --> 01:39:33,720 be going through WhatsApp servers, no employee at WhatsApp 1829 01:39:33,720 --> 01:39:36,300 can actually see your messages because it's encrypted 1830 01:39:36,300 --> 01:39:39,660 all the way from A to B, even though it's 1831 01:39:39,660 --> 01:39:42,060 going through a potential eavesdropper. 1832 01:39:42,060 --> 01:39:45,570 But that depends on exactly what form of encryption you're using, 1833 01:39:45,570 --> 01:39:47,880 and if it's not end-to-end, it might only 1834 01:39:47,880 --> 01:39:51,690 be encrypted in transit such that Eve's, that eavesdropper, 1835 01:39:51,690 --> 01:39:54,040 might indeed have access to the data. 1836 01:39:54,040 --> 01:39:56,850 So as to how you can use end-to-end encryption, 1837 01:39:56,850 --> 01:40:00,570 it's an option that a service must provide to you in this case 1838 01:40:00,570 --> 01:40:02,760 or you must choose services that offer it. 1839 01:40:02,760 --> 01:40:05,370 It's not necessarily something that's always available, 1840 01:40:05,370 --> 01:40:09,720 but it is increasingly available in different software. 1841 01:40:09,720 --> 01:40:12,960 So let's now consider a fairly mundane operation, 1842 01:40:12,960 --> 01:40:16,440 but one that has implications for these same technologies and solutions. 1843 01:40:16,440 --> 01:40:19,500 That is, deleting a file, be it on your Mac or your PC 1844 01:40:19,500 --> 01:40:22,090 or your phone or some other device. 1845 01:40:22,090 --> 01:40:24,390 Now where is data stored in your devices? 1846 01:40:24,390 --> 01:40:26,500 Well generally, it might be in a device like this, 1847 01:40:26,500 --> 01:40:29,220 a large, somewhat older but large hard drive that 1848 01:40:29,220 --> 01:40:31,920 can store lots and lots of files and folders, 1849 01:40:31,920 --> 01:40:34,620 or perhaps something smaller known as a solid state 1850 01:40:34,620 --> 01:40:37,860 drive that might store information entirely digitally 1851 01:40:37,860 --> 01:40:39,495 without any moving parts. 1852 01:40:39,495 --> 01:40:41,370 And even smaller might be something like this 1853 01:40:41,370 --> 01:40:44,820 that you carry around like a USB stick, and they are even smaller nowadays, 1854 01:40:44,820 --> 01:40:48,000 too, that similarly stores some data digitally. 1855 01:40:48,000 --> 01:40:51,540 Now how do we go about deleting files from a computer or any 1856 01:40:51,540 --> 01:40:52,500 of these devices? 1857 01:40:52,500 --> 01:40:56,370 Well, you typically click it and drag it somewhere, or maybe you right-click it 1858 01:40:56,370 --> 01:40:59,370 or maybe you tap and drag it to some trash or the like. 1859 01:40:59,370 --> 01:41:02,700 There's any number of user interface mechanisms for deleting files, 1860 01:41:02,700 --> 01:41:06,240 but let's consider for our purposes what happens underneath the hood. 1861 01:41:06,240 --> 01:41:09,780 So let me stipulate that your hard drive, your solid state 1862 01:41:09,780 --> 01:41:12,540 drive, your USB stick just contains ultimately 1863 01:41:12,540 --> 01:41:17,760 a whole bunch of 0's and 1's, and those 0's and 1's represent your files 1864 01:41:17,760 --> 01:41:18,910 and folders. 1865 01:41:18,910 --> 01:41:22,620 So when you go about deleting a file, by dragging it 1866 01:41:22,620 --> 01:41:26,220 to the recycle bin on Windows, or dragging it to the Trash 1867 01:41:26,220 --> 01:41:29,520 Can on macOS, what actually happens? 1868 01:41:29,520 --> 01:41:33,240 Well, it turns out, not anything at all, really. 1869 01:41:33,240 --> 01:41:37,590 When you recycle a file on Windows or when you trash a file on macOS, 1870 01:41:37,590 --> 01:41:42,000 it doesn't actually get deleted in the sense that you and I might expect. 1871 01:41:42,000 --> 01:41:43,920 By delete it, I mean it's gone. 1872 01:41:43,920 --> 01:41:45,780 I don't want to be able to find it anywhere. 1873 01:41:45,780 --> 01:41:47,190 OK, wait a minute, though. 1874 01:41:47,190 --> 01:41:49,920 Of course, we all know by now, at least on computers, 1875 01:41:49,920 --> 01:41:53,380 you at least have to empty the Recycle Bin or empty the Trash Can. 1876 01:41:53,380 --> 01:41:55,360 So OK, maybe I missed that step. 1877 01:41:55,360 --> 01:41:58,140 But even then, contrary to what you might expect, 1878 01:41:58,140 --> 01:42:02,910 emptying the Recycle + Bin, emptying the Trash Can also does not generally 1879 01:42:02,910 --> 01:42:04,260 delete the data. 1880 01:42:04,260 --> 01:42:06,510 And here's where I'd, again, emphasize, wait a minute, 1881 01:42:06,510 --> 01:42:10,890 when I delete a file, I want it gone, removed from my computer altogether. 1882 01:42:10,890 --> 01:42:16,050 But what macOS and Windows and operating systems in general tend to do instead, 1883 01:42:16,050 --> 01:42:19,080 when you even empty the Recycle Bin or Trash Can, 1884 01:42:19,080 --> 01:42:23,880 they don't actually get rid of the file, per se, they just forget where it is. 1885 01:42:23,880 --> 01:42:26,100 Somewhere in the computer's memory, there's 1886 01:42:26,100 --> 01:42:29,310 like a spreadsheet of sorts, some kind of database or table 1887 01:42:29,310 --> 01:42:32,520 with at least two columns, one of which has the name of your file 1888 01:42:32,520 --> 01:42:35,400 or the location of your file, the other of which 1889 01:42:35,400 --> 01:42:40,770 has some kind of reference to which 0's and 1's on your actual computer 1890 01:42:40,770 --> 01:42:43,200 implement that specific file. 1891 01:42:43,200 --> 01:42:46,650 Maybe these 0's and 1's are for one file, these 0's and 1's are 1892 01:42:46,650 --> 01:42:48,160 for another file, and so forth. 1893 01:42:48,160 --> 01:42:50,970 So somewhere, your computer is keeping track of what 1894 01:42:50,970 --> 01:42:53,410 is where physically on your computer. 1895 01:42:53,410 --> 01:42:56,160 But when you delete a file by emptying the Trash or Recycle Bin, 1896 01:42:56,160 --> 01:42:58,670 the computer just, eh, forgets where it is. 1897 01:42:58,670 --> 01:43:02,970 And more importantly, it frees up the space so it can be used later. 1898 01:43:02,970 --> 01:43:04,260 So what do I mean by that? 1899 01:43:04,260 --> 01:43:07,410 Well, suppose I do go ahead and delete a file 1900 01:43:07,410 --> 01:43:09,980 and empty the Recycle Bin or Trash Can, and suppose 1901 01:43:09,980 --> 01:43:15,530 that these yellow 0's and 1's represent the file that I no longer care about. 1902 01:43:15,530 --> 01:43:19,080 Well, what's actually going to happen underneath the hood, so to speak, 1903 01:43:19,080 --> 01:43:19,940 of the computer? 1904 01:43:19,940 --> 01:43:24,020 Well eventually, some of those yellow 0's and 1's might just 1905 01:43:24,020 --> 01:43:26,030 get reused for other files. 1906 01:43:26,030 --> 01:43:29,630 In other words, these 0's and 1's highlighted in yellow 1907 01:43:29,630 --> 01:43:32,390 represent a file that used to be there, but is not. 1908 01:43:32,390 --> 01:43:36,320 That is equivalent to saying some other file can now use those same 1909 01:43:36,320 --> 01:43:37,260 0's and 1's. 1910 01:43:37,260 --> 01:43:41,510 And so here's some random 0's and 1's that may be overwrite some of the file, 1911 01:43:41,510 --> 01:43:42,680 but not all of it. 1912 01:43:42,680 --> 01:43:45,590 Notice, there's still a bunch of yellow 0's and 1's here 1913 01:43:45,590 --> 01:43:48,180 in my depiction of my computer. 1914 01:43:48,180 --> 01:43:53,510 So it turns out that over time, yes, your file will probably 1915 01:43:53,510 --> 01:43:55,100 get actually deleted. 1916 01:43:55,100 --> 01:43:56,270 What do I mean by that? 1917 01:43:56,270 --> 01:44:00,830 Eventually those 0's and 1's will be repurposed, changed from 1 to 0, 1918 01:44:00,830 --> 01:44:05,190 changed from 0 to 1 such that your file, for all intents and purposes, 1919 01:44:05,190 --> 01:44:09,530 is actually gone, because it's been repurposed, that space, altogether. 1920 01:44:09,530 --> 01:44:12,020 But notice, at least at this point in time, 1921 01:44:12,020 --> 01:44:15,770 and shortly after you delete a file, even if you've created or downloaded 1922 01:44:15,770 --> 01:44:18,740 new files, there might still be parts of your files 1923 01:44:18,740 --> 01:44:23,900 around, which means that sensitive word document or Excel file or images 1924 01:44:23,900 --> 01:44:27,020 that you had on your computer, there might still be remnants of them, 1925 01:44:27,020 --> 01:44:29,820 just a few lines from any of those. 1926 01:44:29,820 --> 01:44:33,170 So you should realize that deleting a file doesn't really get rid of it 1927 01:44:33,170 --> 01:44:35,480 in the way you might expect or hope. 1928 01:44:35,480 --> 01:44:39,320 To do that, you need to be a little better with practices. 1929 01:44:39,320 --> 01:44:41,180 Now what do I mean by this? 1930 01:44:41,180 --> 01:44:44,750 Secure deletion is another beast altogether. 1931 01:44:44,750 --> 01:44:48,500 And typically when we delete files, they're not deleted securely. 1932 01:44:48,500 --> 01:44:51,740 They're not deleted typically in a way that you would hope. 1933 01:44:51,740 --> 01:44:55,790 So secure deletion does what you might really hope for, get rid of this file 1934 01:44:55,790 --> 01:44:56,400 altogether. 1935 01:44:56,400 --> 01:44:59,210 So if we go back to the original contents of my computer 1936 01:44:59,210 --> 01:45:02,180 with all of these here 0's and 1's, and suppose 1937 01:45:02,180 --> 01:45:05,300 that I want to delete this file here at the top of the screen, 1938 01:45:05,300 --> 01:45:10,188 in an extreme ideal world, those 0's and 1's would just be gone. 1939 01:45:10,188 --> 01:45:11,480 Like that's pretty darn secure. 1940 01:45:11,480 --> 01:45:15,080 Those bits, those 0's and 1's, they don't even exist anymore. 1941 01:45:15,080 --> 01:45:18,890 Now this is probably not the best way to securely delete information 1942 01:45:18,890 --> 01:45:23,090 because if I just got rid of those 0's and 1's somehow, like my hard drive 1943 01:45:23,090 --> 01:45:25,610 is getting like literally smaller and smaller 1944 01:45:25,610 --> 01:45:29,270 in terms of how much stuff I can put on it if I don't have as many bits 1945 01:45:29,270 --> 01:45:30,860 or 0's and 1's available. 1946 01:45:30,860 --> 01:45:32,990 So that's probably not the best long-term solution 1947 01:45:32,990 --> 01:45:34,040 because it's expensive. 1948 01:45:34,040 --> 01:45:36,750 It's like getting rid of some of my capacity. 1949 01:45:36,750 --> 01:45:41,630 So we don't actually do that, but how might we securely delete a file? 1950 01:45:41,630 --> 01:45:46,040 I don't think we want to just wait and hope that those 0's and 1's eventually 1951 01:45:46,040 --> 01:45:49,160 get reused by the system because we might still 1952 01:45:49,160 --> 01:45:52,500 be left with some remnants which might not be ideal. 1953 01:45:52,500 --> 01:45:56,450 So what we can do when securely deleting a file is something like this-- 1954 01:45:56,450 --> 01:46:00,250 change all of the 0's and 1's that we don't care about anymore or want, 1955 01:46:00,250 --> 01:46:01,790 change them all to 0's. 1956 01:46:01,790 --> 01:46:05,900 And this will effectively securely delete the file 1957 01:46:05,900 --> 01:46:09,200 because now the 1's that were previously there 1958 01:46:09,200 --> 01:46:12,620 that represented some piece of information are just completely gone. 1959 01:46:12,620 --> 01:46:15,230 Or equivalently, I could change them all to 1's. 1960 01:46:15,230 --> 01:46:18,080 Or I could even change it to random 0's and 1's. 1961 01:46:18,080 --> 01:46:20,990 The point is, to securely delete a file, you 1962 01:46:20,990 --> 01:46:26,360 should change all of the 0's and 1's to at least some other pattern 1963 01:46:26,360 --> 01:46:28,640 so that the file is effectively gone. 1964 01:46:28,640 --> 01:46:31,820 Now how can you use this to your benefit? 1965 01:46:31,820 --> 01:46:34,040 Well, some operating systems nowadays support 1966 01:46:34,040 --> 01:46:38,640 what's called full-disk encryption, and this is good for a number of reasons. 1967 01:46:38,640 --> 01:46:41,790 One, if you enable a feature called full-disk encryption, 1968 01:46:41,790 --> 01:46:46,430 which is actually a specific incarnation of an idea known as encryption at rest. 1969 01:46:46,430 --> 01:46:49,910 Encryption in transit refers, of course, to your data going back and forth 1970 01:46:49,910 --> 01:46:52,280 from point A to point B. Encryption at rest 1971 01:46:52,280 --> 01:46:56,240 means it's just sitting there on your device, in your pocket, or on your lap 1972 01:46:56,240 --> 01:47:00,900 or on your desktop, sitting unused, maybe on or off. 1973 01:47:00,900 --> 01:47:04,700 So when it comes to full-disk encryption or encryption at rest, 1974 01:47:04,700 --> 01:47:09,170 you ideally want all of your data somehow encrypted on your Mac, 1975 01:47:09,170 --> 01:47:11,420 on your PC, on your phone. 1976 01:47:11,420 --> 01:47:14,540 And only when you log in with your password or maybe 1977 01:47:14,540 --> 01:47:19,740 your fingerprint or your face should that data be decrypted automatically, 1978 01:47:19,740 --> 01:47:23,240 and this can happen pretty darn fast nowadays with modern hardware, 1979 01:47:23,240 --> 01:47:25,850 should the data be unencrypted so you can actually 1980 01:47:25,850 --> 01:47:28,620 use it and interact with that device. 1981 01:47:28,620 --> 01:47:31,220 So why is this advantageous? 1982 01:47:31,220 --> 01:47:34,490 Well, one, if your device gets stolen, so long 1983 01:47:34,490 --> 01:47:37,520 as you're not logged into it, so long as it's locked, 1984 01:47:37,520 --> 01:47:41,060 so long as the lid is closed, so long as it's unplugged or any other number 1985 01:47:41,060 --> 01:47:45,920 of scenarios, at least if someone takes your laptop from the table in Starbucks 1986 01:47:45,920 --> 01:47:48,980 or the cafe, well, hopefully, if you have 1987 01:47:48,980 --> 01:47:51,635 a good password or good biometrics, they're 1988 01:47:51,635 --> 01:47:53,510 not going to be able to get any of your data. 1989 01:47:53,510 --> 01:47:56,190 They can maybe delete all of your data and they can 1990 01:47:56,190 --> 01:47:59,790 and sell your computer, they can use your computer, but they probably, 1991 01:47:59,790 --> 01:48:03,180 if you're practicing best practices, don't have access 1992 01:48:03,180 --> 01:48:04,660 to the data that's on the system. 1993 01:48:04,660 --> 01:48:05,160 Why? 1994 01:48:05,160 --> 01:48:09,470 Because it's completely encrypted at rest and they don't know your password, 1995 01:48:09,470 --> 01:48:11,970 they don't have your fingerprint, they don't have your face, 1996 01:48:11,970 --> 01:48:14,470 they should not be able to decrypt that data. 1997 01:48:14,470 --> 01:48:17,790 So in other words, if this is my unencrypted data, 1998 01:48:17,790 --> 01:48:20,910 the way I want it and need it when I'm using my computer, 1999 01:48:20,910 --> 01:48:25,590 full-disk encryption, at rest, would change my entire computer 2000 01:48:25,590 --> 01:48:26,610 to look random. 2001 01:48:26,610 --> 01:48:30,630 These are random 0's and 1's now that I generated by using, 2002 01:48:30,630 --> 01:48:34,020 for instance, my password or my fingerprint or my face. 2003 01:48:34,020 --> 01:48:37,350 And this is what your hard drive or your solid state drive 2004 01:48:37,350 --> 01:48:41,910 should look like when the lid is closed, when the power is off. 2005 01:48:41,910 --> 01:48:46,020 When you are logged out of it, it should be random 0's and 1's. 2006 01:48:46,020 --> 01:48:49,080 And the upside of this now is that, again, 2007 01:48:49,080 --> 01:48:53,910 if it's stolen while in this state, there's no data to be used 2008 01:48:53,910 --> 01:48:56,890 by the adversary because it looks like random 0's and 1's. 2009 01:48:56,890 --> 01:49:00,010 Better yet, if you deliberately want to get rid of the device 2010 01:49:00,010 --> 01:49:02,710 because you want to trade it in for resale value, 2011 01:49:02,710 --> 01:49:04,720 because you want to donate it to someone else, 2012 01:49:04,720 --> 01:49:06,880 because you want to sell it to someone online, 2013 01:49:06,880 --> 01:49:09,820 when using full-disk encryption, the upside 2014 01:49:09,820 --> 01:49:14,390 is that so long as you had a really hard-to-guess password, your data is, 2015 01:49:14,390 --> 01:49:17,800 for all intents and purposes, securely deleted already. 2016 01:49:17,800 --> 01:49:21,040 Because only if the new buyer figures out or knows 2017 01:49:21,040 --> 01:49:24,190 your password or has your same fingerprint or has your same face, 2018 01:49:24,190 --> 01:49:26,870 they're not going to be able to access any of your data anyway. 2019 01:49:26,870 --> 01:49:31,300 And this is important nowadays because it turns out, with modern hardware, 2020 01:49:31,300 --> 01:49:36,970 even if you might want to change all of the 0's and 1's to all 0's or all 1's 2021 01:49:36,970 --> 01:49:42,080 or all random data, it turns out that today's hardware can fail over time. 2022 01:49:42,080 --> 01:49:47,860 So even little USB sticks or solid state drives over time can kind of wear out. 2023 01:49:47,860 --> 01:49:49,930 But they're smart enough, thanks to software 2024 01:49:49,930 --> 01:49:53,920 known as firmware inside of it, as soon as the device realizes, wait a minute, 2025 01:49:53,920 --> 01:49:56,770 those bits over there aren't working properly anymore, 2026 01:49:56,770 --> 01:50:02,200 the device might not let you change them to all 0's or all 1's or a random 0's 2027 01:50:02,200 --> 01:50:03,260 and 1's anymore. 2028 01:50:03,260 --> 01:50:06,350 It might just leave them as is forever. 2029 01:50:06,350 --> 01:50:09,070 Which is to say, it's even more important to start 2030 01:50:09,070 --> 01:50:11,740 using full-disk encryption, encryption at rest, 2031 01:50:11,740 --> 01:50:14,620 when you first get a device because that way, 2032 01:50:14,620 --> 01:50:18,040 you can trust that even if parts of the device degrade over time, 2033 01:50:18,040 --> 01:50:20,560 all of the data that's there and has been there 2034 01:50:20,560 --> 01:50:25,900 was at least encrypted with one of your passwords or one of your biometrics 2035 01:50:25,900 --> 01:50:26,800 in the past. 2036 01:50:26,800 --> 01:50:30,700 So this is the kind of feature to look for in your Mac, your PC, or your phone 2037 01:50:30,700 --> 01:50:33,580 to ensure that it is somehow enabled. 2038 01:50:33,580 --> 01:50:36,160 Thankfully, once you log back in with your password, 2039 01:50:36,160 --> 01:50:38,860 it goes back to the original data and you can use it. 2040 01:50:38,860 --> 01:50:42,190 Of course, then, an implication of this best practice 2041 01:50:42,190 --> 01:50:45,250 is that if you lose your laptop or your phone 2042 01:50:45,250 --> 01:50:48,820 or your desktop's password, or your fingerprint somehow changed, 2043 01:50:48,820 --> 01:50:51,400 or your face sufficiently changes, you might be locked out 2044 01:50:51,400 --> 01:50:54,310 of all of your data, too, but again, that's 2045 01:50:54,310 --> 01:50:59,480 just another example of this trade-off between usability and security as well. 2046 01:50:59,480 --> 01:51:02,320 Now a downside, an evil side to full-disk encryption 2047 01:51:02,320 --> 01:51:06,200 is ransomware, which is how adversaries are monetizing attacks. 2048 01:51:06,200 --> 01:51:09,460 It's not uncommon nowadays for hackers, for adversaries, 2049 01:51:09,460 --> 01:51:12,160 when they get into a system, whether it's your laptop 2050 01:51:12,160 --> 01:51:16,330 or, for instance, a corporate network, or in some cases, hospital 2051 01:51:16,330 --> 01:51:21,220 systems or a city's own computer networks, to not try to do any damage 2052 01:51:21,220 --> 01:51:24,280 or just do something like spam or cryptocurrency mining, 2053 01:51:24,280 --> 01:51:30,850 but to actually encrypt all of the data on these systems they somehow 2054 01:51:30,850 --> 01:51:32,500 accessed online. 2055 01:51:32,500 --> 01:51:33,220 Why? 2056 01:51:33,220 --> 01:51:36,610 Well, if they encrypt all of the data they can then ask for a ransom 2057 01:51:36,610 --> 01:51:39,670 and say, listen, if you don't give me this many bitcoins, 2058 01:51:39,670 --> 01:51:44,480 I'm going to give you the key that I used to encrypt your data. 2059 01:51:44,480 --> 01:51:47,590 And if you poke around online, there have been many examples of this, 2060 01:51:47,590 --> 01:51:51,190 unfortunately, where hackers have gotten into systems that were not 2061 01:51:51,190 --> 01:51:55,460 very well-protected, all of the data therein was encrypted, 2062 01:51:55,460 --> 01:51:58,450 and this is an opportunity for the adversaries 2063 01:51:58,450 --> 01:52:02,290 to try to extort, say, financial gain from a situation 2064 01:52:02,290 --> 01:52:07,360 by then only handing you the keys, if ever, once you've actually paid up. 2065 01:52:07,360 --> 01:52:10,000 And there, too, there's the risk, as in any ransom scenario, 2066 01:52:10,000 --> 01:52:14,240 where who even knows if they're going to give you the proper key in the end, 2067 01:52:14,240 --> 01:52:17,800 but this is increasingly a concern for municipalities, for companies, 2068 01:52:17,800 --> 01:52:19,340 for universities, and the like. 2069 01:52:19,340 --> 01:52:22,090 So just as we have some upsides here, there, 2070 01:52:22,090 --> 01:52:24,740 too, is this trade-off in what you can do. 2071 01:52:24,740 --> 01:52:27,820 And lastly, we thought we'd end on a note about the future 2072 01:52:27,820 --> 01:52:29,560 because this is a topic that will come up 2073 01:52:29,560 --> 01:52:33,290 and has come up over time, this topic of quantum computing. 2074 01:52:33,290 --> 01:52:35,800 So for those less familiar, we've been talking a lot 2075 01:52:35,800 --> 01:52:38,290 about bits, 0's and 1's today, and at the end 2076 01:52:38,290 --> 01:52:41,140 of the day that's how today's computer systems are implemented. 2077 01:52:41,140 --> 01:52:44,950 Patterns of 0's and 1's to represent numbers and letters and colors 2078 01:52:44,950 --> 01:52:47,320 and videos and sounds and everything. 2079 01:52:47,320 --> 01:52:50,260 We've been discussing today data more generally. 2080 01:52:50,260 --> 01:52:56,950 Now typically, in our world now, a bit, a binary digit, can either there be a 0 2081 01:52:56,950 --> 01:53:02,020 or it can be a 1, as per the diagram we had on the screen in these examples. 2082 01:53:02,020 --> 01:53:04,090 Either a 0 or a 1. 2083 01:53:04,090 --> 01:53:08,950 In the world of quantum computing, thanks to some very fancy physics 2084 01:53:08,950 --> 01:53:12,670 and quantum mechanics in particular, it is possible, 2085 01:53:12,670 --> 01:53:17,380 it seems, physically, for us to implement the idea of bits a little bit 2086 01:53:17,380 --> 01:53:20,030 differently using quantum techniques. 2087 01:53:20,030 --> 01:53:26,080 And there's this idea of not just a bit, but a quantum bit or qubit whose power 2088 01:53:26,080 --> 01:53:28,900 derives from the reality that physically, you 2089 01:53:28,900 --> 01:53:33,550 can implement a qubit in such a way that it is representing both a 0 2090 01:53:33,550 --> 01:53:37,130 and a 1 at the exact same time. 2091 01:53:37,130 --> 01:53:39,970 So it can be not in just one state, so to speak, 2092 01:53:39,970 --> 01:53:44,000 one condition at once, but two states at once. 2093 01:53:44,000 --> 01:53:47,740 And if you have two qubits, they can be in four states at once. 2094 01:53:47,740 --> 01:53:50,530 If you have three, they can be in eight states at once. 2095 01:53:50,530 --> 01:53:55,270 If you have 32 of them, they can be in 4 billion states at once. 2096 01:53:55,270 --> 01:53:57,270 Now what's the implication of this? 2097 01:53:57,270 --> 01:53:59,240 Well, when we talk about cryptography, when 2098 01:53:59,240 --> 01:54:02,870 we talk about hashing, when we talk about just very large numbers 2099 01:54:02,870 --> 01:54:05,900 and trying to figure out via brute force or some other mechanism 2100 01:54:05,900 --> 01:54:12,530 what some input to a function was, if you have exponentially more computing 2101 01:54:12,530 --> 01:54:15,560 capabilities by not being able to do one or two 2102 01:54:15,560 --> 01:54:20,520 things at a time with individual bits, but two or four or eight or 4 2103 01:54:20,520 --> 01:54:23,540 billion things at once, it stands to reason 2104 01:54:23,540 --> 01:54:27,920 that if adversaries have access to quantum computing before you 2105 01:54:27,920 --> 01:54:31,700 and I do, then all of the security you and I now rely on 2106 01:54:31,700 --> 01:54:35,990 and that we've talked about today could suddenly become insecure. 2107 01:54:35,990 --> 01:54:38,120 Because we're trusting right now that it's just 2108 01:54:38,120 --> 01:54:40,340 going to take the adversary a lot, a lot, 2109 01:54:40,340 --> 01:54:42,590 a lot of time, maybe money, maybe resources, 2110 01:54:42,590 --> 01:54:44,870 maybe risk to attack our accounts. 2111 01:54:44,870 --> 01:54:49,170 But if they have exponentially more resources than you and me, 2112 01:54:49,170 --> 01:54:51,830 then our data really is at risk. 2113 01:54:51,830 --> 01:54:56,410 And all of the mathematics we've been trusting need to be hardened instead. 2114 01:54:56,410 --> 01:55:00,370 Now hopefully you and I will have access to quantum computing at the same time 2115 01:55:00,370 --> 01:55:03,110 as or ideally before all of these adversaries, 2116 01:55:03,110 --> 01:55:06,040 so hopefully our algorithms for securing information 2117 01:55:06,040 --> 01:55:08,990 will continue to evolve along with these technologies. 2118 01:55:08,990 --> 01:55:11,980 So this isn't necessarily something you need to worry about for now. 2119 01:55:11,980 --> 01:55:15,640 Indeed, I think after today, we have more than enough to worry about. 2120 01:55:15,640 --> 01:55:17,510 So for today, that's all. 2121 01:55:17,510 --> 01:55:20,160 We'll see you next time. 2122 01:55:20,160 --> 01:55:22,000