1 00:00:00,000 --> 00:00:00,500 2 00:00:00,500 --> 00:00:03,872 [MUSIC PLAYING] 3 00:00:03,872 --> 00:00:12,110 4 00:00:12,110 --> 00:00:14,130 BRIAN YU: OK, let's get started. 5 00:00:14,130 --> 00:00:16,594 Welcome, everyone, to the final day of CS50 Beyond. 6 00:00:16,594 --> 00:00:19,010 And goal for today is going to be to take a look at things 7 00:00:19,010 --> 00:00:20,150 at a bit of a higher level. 8 00:00:20,150 --> 00:00:22,340 There is going to be less code in today's lecture. 9 00:00:22,340 --> 00:00:24,440 The focus of today is on two main topics-- 10 00:00:24,440 --> 00:00:26,914 security and scalability-- which are both important as you 11 00:00:26,914 --> 00:00:30,080 begin to think about, you're writing all this code for your web application. 12 00:00:30,080 --> 00:00:32,570 You're ready to deploy it so that people can actually use it. 13 00:00:32,570 --> 00:00:35,153 What are the sorts of considerations you need to bear in mind? 14 00:00:35,153 --> 00:00:37,310 What are the security considerations in making 15 00:00:37,310 --> 00:00:42,110 sure that wherever you're hosting the application, you and the application 16 00:00:42,110 --> 00:00:45,920 itself is secure and that your users are secure from potential vulnerabilities 17 00:00:45,920 --> 00:00:47,120 or potential threats? 18 00:00:47,120 --> 00:00:49,235 And also, from a scalability perspective, 19 00:00:49,235 --> 00:00:51,860 we've been designing applications that so far probably only you 20 00:00:51,860 --> 00:00:53,690 or a couple other people have been using. 21 00:00:53,690 --> 00:00:55,815 But what sorts of things do you need to think about 22 00:00:55,815 --> 00:00:59,060 as your applications begin to scale, as more and more people begin to use it, 23 00:00:59,060 --> 00:01:02,150 and you have to begin to think about this idea of multiple people trying 24 00:01:02,150 --> 00:01:04,995 to use the same application at the same time? 25 00:01:04,995 --> 00:01:07,370 So a number of different considerations come about there. 26 00:01:07,370 --> 00:01:09,190 We'll show a couple of code examples. 27 00:01:09,190 --> 00:01:12,440 But the main idea of this is going to be high level, just thinking abstractly, 28 00:01:12,440 --> 00:01:15,560 sort of trying to design the product, trying to design the project, 29 00:01:15,560 --> 00:01:18,860 trying to figure out how exactly we need to be adjusting our application 30 00:01:18,860 --> 00:01:22,460 to make sure that it's secure and to make sure that it's scalable. 31 00:01:22,460 --> 00:01:24,230 So we'll go ahead and start with security. 32 00:01:24,230 --> 00:01:25,500 And on the topic of security, we're going 33 00:01:25,500 --> 00:01:28,010 to look at a number of different security considerations 34 00:01:28,010 --> 00:01:30,935 as we move all throughout the week, from the beginning of the week 35 00:01:30,935 --> 00:01:33,560 until the end of the week, thinking about the types of security 36 00:01:33,560 --> 00:01:35,370 implications that come about. 37 00:01:35,370 --> 00:01:38,180 And so one of the first things we introduced in the class was Git, 38 00:01:38,180 --> 00:01:39,971 the version control tool that we were using 39 00:01:39,971 --> 00:01:42,374 to keep track of different versions of our code 40 00:01:42,374 --> 00:01:45,290 in order to manage different branches of our code, so on and so forth. 41 00:01:45,290 --> 00:01:48,165 And so a couple of important security considerations to be aware with 42 00:01:48,165 --> 00:01:49,040 regards to Git. 43 00:01:49,040 --> 00:01:51,440 You all probably created GitHub repositories 44 00:01:51,440 --> 00:01:53,990 over the course of this week, maybe for the first time. 45 00:01:53,990 --> 00:01:56,870 And GitHub repositories by default are public. 46 00:01:56,870 --> 00:02:00,230 And this is in the spirit of the idea of open source software, the idea 47 00:02:00,230 --> 00:02:01,790 that anyone can see the code. 48 00:02:01,790 --> 00:02:03,590 Anyone can contribute to the code. 49 00:02:03,590 --> 00:02:05,150 And that, of course, comes with its trade offs. 50 00:02:05,150 --> 00:02:07,580 On one hand, everyone being able to see the code certainly 51 00:02:07,580 --> 00:02:10,850 means that anyone can help you to find bugs and identify bugs. 52 00:02:10,850 --> 00:02:13,640 But it also means that anyone on the internet can see the code, 53 00:02:13,640 --> 00:02:15,680 look for potential vulnerabilities, and then 54 00:02:15,680 --> 00:02:18,060 potentially take advantage of those vulnerabilities. 55 00:02:18,060 --> 00:02:20,230 So definitely, trade offs, costs, and benefits that 56 00:02:20,230 --> 00:02:21,980 come along with open source software. 57 00:02:21,980 --> 00:02:25,760 And another thing just to be aware of, we mentioned this earlier in the week, 58 00:02:25,760 --> 00:02:30,740 but your Git commit history is going to store the entire history of any 59 00:02:30,740 --> 00:02:33,200 of the commits that you have made, as the name might imply. 60 00:02:33,200 --> 00:02:35,416 And so if you make a commit and you do something 61 00:02:35,416 --> 00:02:38,540 you shouldn't have done, for instance-- you make a commit that accidentally 62 00:02:38,540 --> 00:02:41,509 includes database credentials inside of the commit somewhere 63 00:02:41,509 --> 00:02:43,300 or includes a password inside of the commit 64 00:02:43,300 --> 00:02:45,920 somewhere-- you can later on remove those credentials 65 00:02:45,920 --> 00:02:48,560 and make another commit and remove the credentials. 66 00:02:48,560 --> 00:02:51,320 But the credentials are still there inside of the history. 67 00:02:51,320 --> 00:02:53,630 If you go back, you could still find the credentials 68 00:02:53,630 --> 00:02:56,060 if you had access to the entire Git repository 69 00:02:56,060 --> 00:02:58,754 and could go back and find that point in Git's history. 70 00:02:58,754 --> 00:03:01,670 So what are the potential solutions for if you do something like this, 71 00:03:01,670 --> 00:03:04,640 accidentally expose credentials at some point in the repository 72 00:03:04,640 --> 00:03:06,410 and then remove them? 73 00:03:06,410 --> 00:03:08,090 What could you do? 74 00:03:08,090 --> 00:03:08,662 Yeah? 75 00:03:08,662 --> 00:03:09,840 AUDIENCE: Change the credentials. 76 00:03:09,840 --> 00:03:10,160 BRIAN YU: Certainly. 77 00:03:10,160 --> 00:03:12,993 Changing the credentials, something you should almost definitely do. 78 00:03:12,993 --> 00:03:13,900 Change the password. 79 00:03:13,900 --> 00:03:16,400 It's not enough just to remove them and make another commit. 80 00:03:16,400 --> 00:03:19,310 And there's also something you can do known as Git purge, where 81 00:03:19,310 --> 00:03:23,120 you can effectively purge the history of commit, sort of overwrite history, 82 00:03:23,120 --> 00:03:25,470 so to speak, in order to replace that, as well. 83 00:03:25,470 --> 00:03:27,800 But even that, if it's been online on GitHub, 84 00:03:27,800 --> 00:03:30,258 who knows who may have been able to access the credentials? 85 00:03:30,258 --> 00:03:33,830 So definitely always a good idea to remove those, as well. 86 00:03:33,830 --> 00:03:36,860 On the first day, we also took a look at HTML. 87 00:03:36,860 --> 00:03:38,767 We were designing basic HTML pages. 88 00:03:38,767 --> 00:03:40,850 And there are a number of security vulnerabilities 89 00:03:40,850 --> 00:03:43,490 you could create just with HTML alone. 90 00:03:43,490 --> 00:03:48,320 Perhaps one of the most basic is just the idea that the contents of a link 91 00:03:48,320 --> 00:03:50,529 can differ from where the link takes you to. 92 00:03:50,529 --> 00:03:52,820 There's probably a pretty obvious point where you often 93 00:03:52,820 --> 00:03:54,751 have text that links you to a particular page. 94 00:03:54,751 --> 00:03:56,750 But this can often be misleading and is commonly 95 00:03:56,750 --> 00:03:59,150 used in phishing email attacks, for instance, 96 00:03:59,150 --> 00:04:02,300 whereby you have a link that takes you to URL one, 97 00:04:02,300 --> 00:04:06,260 but by default, it shows you URL two, which can be misleading, for sure. 98 00:04:06,260 --> 00:04:10,640 Or I can have situations where I could-- 99 00:04:10,640 --> 00:04:15,440 let's go into link.html-- 100 00:04:15,440 --> 00:04:18,230 I have a link that presumably takes me to google.com. 101 00:04:18,230 --> 00:04:20,870 But if I click on google.com, it could take me anywhere else-- 102 00:04:20,870 --> 00:04:22,580 to some other site, for instance. 103 00:04:22,580 --> 00:04:28,660 And the way that it does that is quite simply by just 104 00:04:28,660 --> 00:04:33,310 having a link that takes you to a URL, but the contents of that URL 105 00:04:33,310 --> 00:04:35,594 are something different or something else entirely. 106 00:04:35,594 --> 00:04:37,510 And so that alone is something to be aware of. 107 00:04:37,510 --> 00:04:40,060 But that problem is compounded when you consider the idea 108 00:04:40,060 --> 00:04:42,580 that even though your server-side code-- application code 109 00:04:42,580 --> 00:04:45,040 you write in Python and Flask, for instance-- 110 00:04:45,040 --> 00:04:48,040 you can keep secret from your users, HTML code is not 111 00:04:48,040 --> 00:04:49,030 kept secret from users. 112 00:04:49,030 --> 00:04:52,051 Any users can see HTML and do whatever they want with it. 113 00:04:52,051 --> 00:04:53,800 And so on the first day, you may have been 114 00:04:53,800 --> 00:04:56,650 trying to take a look at an HTML page and try and replicate it 115 00:04:56,650 --> 00:04:59,280 using your own HTML and CSS, for example. 116 00:04:59,280 --> 00:05:01,030 The simplest way to do something like that 117 00:05:01,030 --> 00:05:02,613 would just be to copy the source code. 118 00:05:02,613 --> 00:05:09,790 So I could go to bankofamerica.com, for instance, Control-Click on the page, 119 00:05:09,790 --> 00:05:12,070 view the page source, and all right. 120 00:05:12,070 --> 00:05:15,100 Here's all the HTML on Bank of America's home page. 121 00:05:15,100 --> 00:05:26,720 I could copy that, create a new file, and call it bank.html. 122 00:05:26,720 --> 00:05:29,690 Paste the contents of it in here. 123 00:05:29,690 --> 00:05:32,830 Go ahead and save that. 124 00:05:32,830 --> 00:05:35,072 And now, open up bank.html. 125 00:05:35,072 --> 00:05:38,280 And now, I've got a page that basically looks like Bank of America's website. 126 00:05:38,280 --> 00:05:39,180 And now, I could go in. 127 00:05:39,180 --> 00:05:41,679 I could modify the links, change where Sign In takes you to, 128 00:05:41,679 --> 00:05:43,600 make it take you to somewhere else entirely. 129 00:05:43,600 --> 00:05:45,420 And so these are potential threats, vulnerabilities, 130 00:05:45,420 --> 00:05:48,300 to be aware of on the internet that are quite easy to actually do. 131 00:05:48,300 --> 00:05:51,777 So this is less about when you're designing your own web applications 132 00:05:51,777 --> 00:05:54,360 but, when you're using web applications, the types of security 133 00:05:54,360 --> 00:05:56,420 concerns to definitely be aware of. 134 00:05:56,420 --> 00:06:00,274 135 00:06:00,274 --> 00:06:02,690 So let's keep moving forward in the week-- yeah, question? 136 00:06:02,690 --> 00:06:05,350 AUDIENCE: Can you copy JavaScript source code in the same way? 137 00:06:05,350 --> 00:06:05,933 BRIAN YU: Yes. 138 00:06:05,933 --> 00:06:08,760 Any JavaScript code that is on the client, you can access 139 00:06:08,760 --> 00:06:09,990 and you can modify. 140 00:06:09,990 --> 00:06:12,300 You can change variables and so on and so forth. 141 00:06:12,300 --> 00:06:15,420 And this is actually a pretty easy thing to do. 142 00:06:15,420 --> 00:06:20,640 So if I go to like, I don't know, The New York Times website, for instance, 143 00:06:20,640 --> 00:06:24,070 and I look at the source code there-- 144 00:06:24,070 --> 00:06:26,670 let me go ahead and inspect the element, and I'll 145 00:06:26,670 --> 00:06:31,040 try and hover over a main headline. 146 00:06:31,040 --> 00:06:32,760 OK. 147 00:06:32,760 --> 00:06:35,147 This is the name of a CSS class. 148 00:06:35,147 --> 00:06:36,480 You could access any JavaScript. 149 00:06:36,480 --> 00:06:39,460 You can also run any JavaScript in the console arbitrarily. 150 00:06:39,460 --> 00:06:45,750 So I could say, all right, document.query selector all let's 151 00:06:45,750 --> 00:06:48,510 get everything with that CSS class. 152 00:06:48,510 --> 00:06:51,880 Or maybe it's just the first one, because it's two CSS classes. 153 00:06:51,880 --> 00:06:52,380 All right. 154 00:06:52,380 --> 00:06:53,040 Great. 155 00:06:53,040 --> 00:06:56,790 I'll take the first one, set its inner HTML to be, 156 00:06:56,790 --> 00:07:01,800 like, welcome to CS50 Beyond. 157 00:07:01,800 --> 00:07:05,400 And you can play around with websites in order to mess around, change them. 158 00:07:05,400 --> 00:07:07,890 So all of the JavaScript CSS classes, all of that, 159 00:07:07,890 --> 00:07:10,410 is accessible to anyone who is using the page, for example. 160 00:07:10,410 --> 00:07:14,980 161 00:07:14,980 --> 00:07:16,520 Other questions before I go on? 162 00:07:16,520 --> 00:07:17,020 Yeah. 163 00:07:17,020 --> 00:07:19,520 AUDIENCE: Any thoughts on JavaScript obfuscation? 164 00:07:19,520 --> 00:07:22,270 BRIAN YU: JavaScript obfuscation-- certainly something you can do. 165 00:07:22,270 --> 00:07:26,950 So since JavaScript is available to anyone who has access to the web page, 166 00:07:26,950 --> 00:07:29,910 there are programs called JavaScript obfuscators gators 167 00:07:29,910 --> 00:07:32,320 that basically take plain old looking JavaScript 168 00:07:32,320 --> 00:07:34,840 and convert it into something that's still JavaScript 169 00:07:34,840 --> 00:07:37,480 but that's very difficult for any human to decipher. 170 00:07:37,480 --> 00:07:41,140 It changes variable names and does a bunch of tricks in JavaScript 171 00:07:41,140 --> 00:07:46,257 to still execute the exact same way but that looks quite obscure. 172 00:07:46,257 --> 00:07:47,590 Definitely something you can do. 173 00:07:47,590 --> 00:07:49,930 Still not totally foolproof, because there are ways 174 00:07:49,930 --> 00:07:53,500 of trying to deobfuscate JavaScript code, at least to some extent. 175 00:07:53,500 --> 00:07:57,780 So it's not perfect, but definitely something that you can do. 176 00:07:57,780 --> 00:08:00,470 Other things? 177 00:08:00,470 --> 00:08:00,970 All right. 178 00:08:00,970 --> 00:08:01,810 Let's take a look at-- 179 00:08:01,810 --> 00:08:03,670 OK, when we were writing Flask applications, 180 00:08:03,670 --> 00:08:05,235 we were writing web servers. 181 00:08:05,235 --> 00:08:08,110 And so one thing that's just good to know from a security perspective 182 00:08:08,110 --> 00:08:11,410 is the difference between HTTP, the Hypertext Transfer Protocol, 183 00:08:11,410 --> 00:08:13,960 and the secure version of it, HTTPS. 184 00:08:13,960 --> 00:08:16,930 And that has to do with the idea that on the internet, 185 00:08:16,930 --> 00:08:19,420 we have computer servers that are trying to communicate 186 00:08:19,420 --> 00:08:22,582 with each other that are trying to send information back and forth. 187 00:08:22,582 --> 00:08:25,540 And when these computers are trying to send information back and forth, 188 00:08:25,540 --> 00:08:27,760 we would like for that to happen securely, 189 00:08:27,760 --> 00:08:31,090 that when one computer is sending information to another computer, 190 00:08:31,090 --> 00:08:34,090 that information is going through a number of different routers. 191 00:08:34,090 --> 00:08:36,790 And each of those routers could hypothetically 192 00:08:36,790 --> 00:08:38,289 have information that's intercepted. 193 00:08:38,289 --> 00:08:41,890 Someone could try and intercept a package on its way from computer number 194 00:08:41,890 --> 00:08:43,780 one to computer number two. 195 00:08:43,780 --> 00:08:47,680 So how do we securely try and transfer information from one location 196 00:08:47,680 --> 00:08:48,495 to the other? 197 00:08:48,495 --> 00:08:50,870 And this has to do with the entire field of cryptography, 198 00:08:50,870 --> 00:08:52,390 which is a huge field that we're only going to be 199 00:08:52,390 --> 00:08:54,370 able to barely scratch the surface of. 200 00:08:54,370 --> 00:08:56,740 But the basic idea here is that we would like some way 201 00:08:56,740 --> 00:09:00,670 to encrypt our information, that if I have some plain text that I would like 202 00:09:00,670 --> 00:09:03,400 to send from my computer to someone else's computer, 203 00:09:03,400 --> 00:09:07,510 I would like to encrypt that plain text, send it across in some encrypted way, 204 00:09:07,510 --> 00:09:10,540 such that the person on the other end could decrypt it. 205 00:09:10,540 --> 00:09:13,150 And so this is perhaps a more sophisticated version 206 00:09:13,150 --> 00:09:15,760 of what you might have done in CS50's problem set two 207 00:09:15,760 --> 00:09:18,180 when you were using the Caesar or the Vigenere cipher 208 00:09:18,180 --> 00:09:19,430 in order to encrypt something. 209 00:09:19,430 --> 00:09:22,870 The ciphers that are used in computing on the internet, for instance, 210 00:09:22,870 --> 00:09:25,390 are just much more secure, for example. 211 00:09:25,390 --> 00:09:27,490 But they follow a similar principle. 212 00:09:27,490 --> 00:09:31,450 And so one form of cryptography is called secret-key cryptography, 213 00:09:31,450 --> 00:09:33,550 where the idea is that if I am a computer up here 214 00:09:33,550 --> 00:09:36,010 and I have some plain text that I want to encrypt, 215 00:09:36,010 --> 00:09:39,050 I also have some key that only I know. 216 00:09:39,050 --> 00:09:41,830 And I can take the plain text, and I can take that key 217 00:09:41,830 --> 00:09:43,780 and run an algorithm on it. 218 00:09:43,780 --> 00:09:47,680 And that generates some ciphertext, some encrypted version of the plain text 219 00:09:47,680 --> 00:09:49,750 that was encrypted using the key. 220 00:09:49,750 --> 00:09:52,460 I can then send that ciphertext along to the other person. 221 00:09:52,460 --> 00:09:56,050 And so long as the other person has both the ciphertext and the key 222 00:09:56,050 --> 00:09:58,660 to encrypt it, they can do the same process 223 00:09:58,660 --> 00:10:01,840 and just decrypt it, generating the plain text from it. 224 00:10:01,840 --> 00:10:04,540 That way, the ciphertext is transferred, not the plain text, 225 00:10:04,540 --> 00:10:07,400 from one side to the other side of this communication. 226 00:10:07,400 --> 00:10:10,810 And so long as both parties in this instance have access to the same key, 227 00:10:10,810 --> 00:10:14,344 they can encrypt and decrypt messages at will. 228 00:10:14,344 --> 00:10:16,510 Why doesn't this quite work on the internet, though? 229 00:10:16,510 --> 00:10:19,464 What is the problem with this model? 230 00:10:19,464 --> 00:10:19,964 Yeah? 231 00:10:19,964 --> 00:10:22,850 AUDIENCE: If you're sending the key as well as the ciphertext, 232 00:10:22,850 --> 00:10:28,600 then it's just revealed as sending the plain text that you have one. 233 00:10:28,600 --> 00:10:29,350 BRIAN YU: Exactly. 234 00:10:29,350 --> 00:10:32,260 When we transfer the ciphertext across, the other person 235 00:10:32,260 --> 00:10:33,950 also needs access to the key. 236 00:10:33,950 --> 00:10:35,950 We need to transfer the key across the internet, 237 00:10:35,950 --> 00:10:38,180 as well, to give it to the other person. 238 00:10:38,180 --> 00:10:40,990 And so anyone who is intercepting the ciphertext 239 00:10:40,990 --> 00:10:43,960 could also have intercepted the key and therefore could 240 00:10:43,960 --> 00:10:47,110 have decrypted the information and gotten the plain text 241 00:10:47,110 --> 00:10:47,860 as a result of it. 242 00:10:47,860 --> 00:10:50,740 So this secret-key cryptography, ultimately, it 243 00:10:50,740 --> 00:10:53,000 doesn't work in the context of the internet 244 00:10:53,000 --> 00:10:55,330 if it needs to be the case that the key is just 245 00:10:55,330 --> 00:10:56,716 transferred across the internet. 246 00:10:56,716 --> 00:10:58,840 Now, you could try encrypting the key, for example. 247 00:10:58,840 --> 00:11:00,730 But then whenever key you used to encrypt the key, 248 00:11:00,730 --> 00:11:02,650 that also needs to be sent across the internet, 249 00:11:02,650 --> 00:11:05,899 and you end up with this problem where you can never figure out a way in order 250 00:11:05,899 --> 00:11:09,610 to make sure that information can be transferred securely. 251 00:11:09,610 --> 00:11:12,910 So the solution to this lies in a different idea called public-key 252 00:11:12,910 --> 00:11:17,430 cryptography, where the idea here is that instead of having one key, 253 00:11:17,430 --> 00:11:18,880 we'll have two keys-- 254 00:11:18,880 --> 00:11:21,361 one called a public key, one called a private key. 255 00:11:21,361 --> 00:11:24,610 And the idea here is that a public key is something you can share with anyone. 256 00:11:24,610 --> 00:11:26,290 Doesn't matter who has it. 257 00:11:26,290 --> 00:11:28,974 And a private key is a key that you keep to yourself 258 00:11:28,974 --> 00:11:31,390 that you don't give to anyone, even the person that you're 259 00:11:31,390 --> 00:11:33,580 trying to communicate with. 260 00:11:33,580 --> 00:11:36,820 And because we have two keys, each key is going to serve a different purpose. 261 00:11:36,820 --> 00:11:38,270 They're going to be mathematically related. 262 00:11:38,270 --> 00:11:40,061 And take a theory of computing class if you 263 00:11:40,061 --> 00:11:42,910 want to understand the exact mathematics behind this. 264 00:11:42,910 --> 00:11:48,010 But the basic idea is that the public key can be used to encrypt messages, 265 00:11:48,010 --> 00:11:51,820 and the private key can be used to decrypt messages that 266 00:11:51,820 --> 00:11:54,680 were encrypted using the public key. 267 00:11:54,680 --> 00:11:56,640 And so what does this model look like? 268 00:11:56,640 --> 00:11:59,080 Well, I have some public and private key. 269 00:11:59,080 --> 00:12:01,710 And if I want some other person to send me information, 270 00:12:01,710 --> 00:12:03,439 I will give them my public key. 271 00:12:03,439 --> 00:12:06,480 Just give the other person the public key so that they have access to it. 272 00:12:06,480 --> 00:12:09,570 Remember, the public key is used to encrypt data. 273 00:12:09,570 --> 00:12:12,270 So they can use the public key and encrypt the plain text, 274 00:12:12,270 --> 00:12:13,745 generate some ciphertext. 275 00:12:13,745 --> 00:12:16,620 And then all the other person needs to do is send me that ciphertext. 276 00:12:16,620 --> 00:12:18,990 The ciphertext comes across to me. 277 00:12:18,990 --> 00:12:20,970 And I now have the private key, the key that I 278 00:12:20,970 --> 00:12:23,040 can use to decrypt the information. 279 00:12:23,040 --> 00:12:24,990 And using the private key and the ciphertext, 280 00:12:24,990 --> 00:12:29,080 I can then decrypt the message and generate the plain text. 281 00:12:29,080 --> 00:12:31,330 So this is the basic idea of public-key cryptography, 282 00:12:31,330 --> 00:12:35,170 this idea that we use a public key to encrypt information and a private key 283 00:12:35,170 --> 00:12:36,580 to decrypt information. 284 00:12:36,580 --> 00:12:39,280 And by separating this out into two different keys, 285 00:12:39,280 --> 00:12:41,590 we can share the public key freely without needing 286 00:12:41,590 --> 00:12:44,770 to worry about the potential for internet traffic 287 00:12:44,770 --> 00:12:47,700 to be intercepted and decrypted, for example. 288 00:12:47,700 --> 00:12:50,140 And so this is the basis on which internet security works. 289 00:12:50,140 --> 00:12:50,743 Yeah? 290 00:12:50,743 --> 00:12:54,580 AUDIENCE: What if someone else intercepts the ciphertext 291 00:12:54,580 --> 00:12:57,496 and they also have a private key? 292 00:12:57,496 --> 00:12:58,872 Would they be able to decrypt it? 293 00:12:58,872 --> 00:13:02,204 BRIAN YU: If someone else intercepts the ciphertext and they have a private key, 294 00:13:02,204 --> 00:13:04,660 they won't be able to decrypt it, because the private key 295 00:13:04,660 --> 00:13:08,710 and the public key are mathematically related in such a way 296 00:13:08,710 --> 00:13:11,740 that if you encrypt something with a public key, 297 00:13:11,740 --> 00:13:14,980 you can only decrypt it with the corresponding private key. 298 00:13:14,980 --> 00:13:18,730 And so generally speaking, you'll generate both the public 299 00:13:18,730 --> 00:13:22,420 and the private key at the same time, such that only messages encrypted 300 00:13:22,420 --> 00:13:24,190 with one can be decrypted with the other. 301 00:13:24,190 --> 00:13:27,398 So you can't just have some other random private key and decrypt the message. 302 00:13:27,398 --> 00:13:30,153 It can only decrypt messages from the public key. 303 00:13:30,153 --> 00:13:33,400 AUDIENCE: So how did this person get that specific [INAUDIBLE]?? 304 00:13:33,400 --> 00:13:36,450 BRIAN YU: So this person down here generated both the public 305 00:13:36,450 --> 00:13:38,130 and the private key at the same time. 306 00:13:38,130 --> 00:13:40,755 There's just an algorithm that you can use to randomly generate 307 00:13:40,755 --> 00:13:41,880 a public and private key. 308 00:13:41,880 --> 00:13:45,390 You share the public key with anyone you want to be able to send you messages. 309 00:13:45,390 --> 00:13:48,930 That person you share it with can use the public key to encrypt the message. 310 00:13:48,930 --> 00:13:51,600 And then you, the person who generated these keys, 311 00:13:51,600 --> 00:13:55,230 can take the encrypted message, use the private key that you generated, 312 00:13:55,230 --> 00:13:58,331 and get the plain text out of that. 313 00:13:58,331 --> 00:13:58,830 Yeah? 314 00:13:58,830 --> 00:14:01,920 AUDIENCE: How difficult is it to get the private key from the public key? 315 00:14:01,920 --> 00:14:04,070 Is it impossible? 316 00:14:04,070 --> 00:14:07,250 BRIAN YU: How difficult is it to get the private key from the public key? 317 00:14:07,250 --> 00:14:09,410 Long story short, we don't really know. 318 00:14:09,410 --> 00:14:11,460 We think it is very difficult to do. 319 00:14:11,460 --> 00:14:13,400 We think that it would take a very long time. 320 00:14:13,400 --> 00:14:18,470 If you took a computer and tried to get it to go from the public key 321 00:14:18,470 --> 00:14:22,700 to the private key, we think it would probably take billions, trillions, more 322 00:14:22,700 --> 00:14:27,350 years if a computer was operating at top speed trying to do this calculation. 323 00:14:27,350 --> 00:14:31,070 But no one has been able to technically prove that it is difficult. 324 00:14:31,070 --> 00:14:33,650 And so this is a big open question in computing right now. 325 00:14:33,650 --> 00:14:35,570 You can take a theory of computation class 326 00:14:35,570 --> 00:14:37,740 for more information on this sort of thing. 327 00:14:37,740 --> 00:14:40,220 But there are some open unsolved problems in computing, 328 00:14:40,220 --> 00:14:41,940 and this happens to be one of them. 329 00:14:41,940 --> 00:14:42,440 Yeah? 330 00:14:42,440 --> 00:14:46,196 AUDIENCE: Is it based on primes and very large primes, and you 331 00:14:46,196 --> 00:14:47,340 multiply them together? 332 00:14:47,340 --> 00:14:49,800 BRIAN YU: Yes, this is basically the idea of very large prime numbers 333 00:14:49,800 --> 00:14:50,980 that you multiply together. 334 00:14:50,980 --> 00:14:53,370 The long story short of it is it's based on the idea 335 00:14:53,370 --> 00:14:55,890 that there is some mathematical operations that are easy 336 00:14:55,890 --> 00:14:58,860 and some mathematical operations that are believed to be difficult. 337 00:14:58,860 --> 00:15:01,140 And if you take two very big prime numbers, 338 00:15:01,140 --> 00:15:03,549 a computer can multiply those numbers very easily 339 00:15:03,549 --> 00:15:05,840 and calculate what the product of those two numbers is. 340 00:15:05,840 --> 00:15:07,950 It's just a simple multiplication algorithm. 341 00:15:07,950 --> 00:15:11,880 But if you have that result, that big multiplied prime number, 342 00:15:11,880 --> 00:15:14,070 it's very difficult to factor that number 343 00:15:14,070 --> 00:15:16,980 and figure out which two prime numbers were multiplied together 344 00:15:16,980 --> 00:15:18,690 in order to generate that number. 345 00:15:18,690 --> 00:15:22,650 And nobody has been able to come up with an efficient algorithm for factoring 346 00:15:22,650 --> 00:15:23,250 it. 347 00:15:23,250 --> 00:15:26,250 And so as a result, because we believe factoring numbers to be 348 00:15:26,250 --> 00:15:28,650 a very difficult problem, we use it as the basis 349 00:15:28,650 --> 00:15:33,140 for computing security on the internet. 350 00:15:33,140 --> 00:15:36,260 Brief teaser of theory of computation. 351 00:15:36,260 --> 00:15:39,777 Take any of the 120 series here at Harvard, at least, 352 00:15:39,777 --> 00:15:41,110 for more information about that. 353 00:15:41,110 --> 00:15:44,870 354 00:15:44,870 --> 00:15:45,500 Other things? 355 00:15:45,500 --> 00:15:48,711 356 00:15:48,711 --> 00:15:51,460 Some other security considerations when designing web applications 357 00:15:51,460 --> 00:15:53,660 to be aware of-- we mentioned this before, 358 00:15:53,660 --> 00:15:56,455 but when it comes to storing credentials, 359 00:15:56,455 --> 00:15:58,330 you should generally always store credentials 360 00:15:58,330 --> 00:16:01,240 in environment variables inside of your application 361 00:16:01,240 --> 00:16:05,060 rather than have inside of your Python code some password, 362 00:16:05,060 --> 00:16:07,074 whether it's the secret key of your application, 363 00:16:07,074 --> 00:16:08,990 whether it's the credentials to your database, 364 00:16:08,990 --> 00:16:11,710 whether it's some other credentials for an API key, 365 00:16:11,710 --> 00:16:13,930 for example, that you're using the server to access. 366 00:16:13,930 --> 00:16:16,750 Usually best not to put that in the code in case someone else 367 00:16:16,750 --> 00:16:18,250 gets access to the code. 368 00:16:18,250 --> 00:16:20,020 Generally best to put it in an environment 369 00:16:20,020 --> 00:16:24,240 variable, a variable that's just stored in the command line environment 370 00:16:24,240 --> 00:16:26,080 where your server's being run from. 371 00:16:26,080 --> 00:16:31,190 And then add code that just pulls the credentials from the environment. 372 00:16:31,190 --> 00:16:34,540 You can use in Python, at least, os.environ.get 373 00:16:34,540 --> 00:16:37,570 to mean get some information from the application's environment. 374 00:16:37,570 --> 00:16:41,200 And this is generally going to be a more secure way of doing the same thing. 375 00:16:41,200 --> 00:16:41,847 Yeah? 376 00:16:41,847 --> 00:16:44,516 AUDIENCE: How do we do that in Heroku if we 377 00:16:44,516 --> 00:16:46,140 want to upload our code to the website? 378 00:16:46,140 --> 00:16:46,895 BRIAN YU: Yeah. 379 00:16:46,895 --> 00:16:50,020 So if you're uploading this to Heroku, if you go to your Heroku application 380 00:16:50,020 --> 00:16:52,330 and go to the Settings panel, there is a section, 381 00:16:52,330 --> 00:16:55,990 I think it's called config vars, that basically just lets you add environment 382 00:16:55,990 --> 00:16:57,644 variables to the Heroku application. 383 00:16:57,644 --> 00:17:00,310 And that will automatically set those environment variables such 384 00:17:00,310 --> 00:17:02,018 that when you run the application, it can 385 00:17:02,018 --> 00:17:03,830 draw from those environment variables. 386 00:17:03,830 --> 00:17:04,420 Yeah? 387 00:17:04,420 --> 00:17:10,120 AUDIENCE: Is it [INAUDIBLE] yesterday, or is that something 388 00:17:10,120 --> 00:17:11,440 you can't have access to? 389 00:17:11,440 --> 00:17:14,254 Because if you just did [INAUDIBLE] and then the key, 390 00:17:14,254 --> 00:17:17,717 it goes away when you close the terminal, correct? 391 00:17:17,717 --> 00:17:18,300 BRIAN YU: Yes. 392 00:17:18,300 --> 00:17:19,500 So that's true. 393 00:17:19,500 --> 00:17:21,900 So you can certainly, on your own computer, 394 00:17:21,900 --> 00:17:24,611 set aliases or environment variables inside 395 00:17:24,611 --> 00:17:27,569 of your profile that automatically set credentials in a particular way. 396 00:17:27,569 --> 00:17:30,180 The idea is that you never want to be taking those credentials 397 00:17:30,180 --> 00:17:33,360 and committing them to a repository that other people might 398 00:17:33,360 --> 00:17:35,040 be able to see, for instance. 399 00:17:35,040 --> 00:17:37,320 That's where things start to get less secure. 400 00:17:37,320 --> 00:17:41,550 401 00:17:41,550 --> 00:17:42,250 OK. 402 00:17:42,250 --> 00:17:45,730 Moving on in the week to talk about some other security considerations. 403 00:17:45,730 --> 00:17:47,590 We'll talk about SQL, the idea of databases. 404 00:17:47,590 --> 00:17:50,800 And when we introduce databases, there are a lot of security considerations 405 00:17:50,800 --> 00:17:52,030 that come about. 406 00:17:52,030 --> 00:17:53,930 But we'll just touch on a couple of them. 407 00:17:53,930 --> 00:17:56,170 The first is how you store passwords. 408 00:17:56,170 --> 00:17:58,210 So you can imagine that inside of a database, 409 00:17:58,210 --> 00:18:00,670 you might be storing users and passwords together. 410 00:18:00,670 --> 00:18:04,450 And maybe we have a whole users table that has an ID column, 411 00:18:04,450 --> 00:18:07,720 a column for people's usernames, and a column for people's passwords. 412 00:18:07,720 --> 00:18:12,320 And you could imagine just storing passwords inside of the row. 413 00:18:12,320 --> 00:18:14,185 But why is this not particularly secure? 414 00:18:14,185 --> 00:18:21,430 415 00:18:21,430 --> 00:18:21,930 Yeah? 416 00:18:21,930 --> 00:18:24,294 AUDIENCE: If anyone gets access to the data table, 417 00:18:24,294 --> 00:18:25,960 they can see what all the passwords are. 418 00:18:25,960 --> 00:18:26,710 BRIAN YU: Exactly. 419 00:18:26,710 --> 00:18:29,140 If anyone gets access to the database, they immediately 420 00:18:29,140 --> 00:18:30,666 have access to all of the passwords. 421 00:18:30,666 --> 00:18:33,040 And this is probably not a secure way to go about things, 422 00:18:33,040 --> 00:18:35,440 because you probably hear in the news from time 423 00:18:35,440 --> 00:18:39,610 to time that databases aren't perfectly secure, that every once in a while, 424 00:18:39,610 --> 00:18:43,600 there's some big security vulnerability where someone's able to get access 425 00:18:43,600 --> 00:18:45,370 to passwords inside of a database. 426 00:18:45,370 --> 00:18:47,530 And that becomes a major security concern. 427 00:18:47,530 --> 00:18:50,020 And so one way to try and mitigate this problem 428 00:18:50,020 --> 00:18:53,920 is, instead of storing passwords inside of the database, 429 00:18:53,920 --> 00:18:56,220 store a hashed version of the password. 430 00:18:56,220 --> 00:18:59,710 A hash function, as you might recall from CS50, just takes some input 431 00:18:59,710 --> 00:19:02,590 and returns some deterministic output. 432 00:19:02,590 --> 00:19:05,860 And a hash function can generally take any input password 433 00:19:05,860 --> 00:19:09,430 and turn it into what looks like a whole bunch of random sequences of letters 434 00:19:09,430 --> 00:19:10,256 and numbers. 435 00:19:10,256 --> 00:19:12,130 And the idea here is that it's deterministic. 436 00:19:12,130 --> 00:19:16,030 The same password will always result in the same hash value 437 00:19:16,030 --> 00:19:19,690 whereby when someone tries to log in, when they type in their password, 438 00:19:19,690 --> 00:19:21,820 rather than just literally compare their password 439 00:19:21,820 --> 00:19:25,120 and say does the password match up with the password in this column, 440 00:19:25,120 --> 00:19:27,670 you can say, all right, let's hash the password first. 441 00:19:27,670 --> 00:19:30,700 And if the hashes match up, then with very high probability, 442 00:19:30,700 --> 00:19:34,600 the user actually signed in to the website with the correct password. 443 00:19:34,600 --> 00:19:36,190 And you can then log the user in. 444 00:19:36,190 --> 00:19:39,100 And now, if someone was able to get access to the database, 445 00:19:39,100 --> 00:19:41,020 they wouldn't get access to all the passwords. 446 00:19:41,020 --> 00:19:43,452 They would only get access to the password hashes. 447 00:19:43,452 --> 00:19:45,160 Now, it's still a security vulnerability, 448 00:19:45,160 --> 00:19:48,850 because someone could, in theory, be able to figure out 449 00:19:48,850 --> 00:19:51,340 information about the password from the password hashes. 450 00:19:51,340 --> 00:19:54,460 But better, certainly, than literally storing the raw text 451 00:19:54,460 --> 00:19:55,891 of the password in the database. 452 00:19:55,891 --> 00:19:56,390 Yeah? 453 00:19:56,390 --> 00:19:59,660 AUDIENCE: Do we know how the hash functions generate that code? 454 00:19:59,660 --> 00:20:00,290 BRIAN YU: Yeah. 455 00:20:00,290 --> 00:20:02,420 The hash functions tend to be deterministic, 456 00:20:02,420 --> 00:20:05,330 and you look up what the hash functions themselves are. 457 00:20:05,330 --> 00:20:07,640 So there are a couple of quite popular hash functions 458 00:20:07,640 --> 00:20:10,437 that are out there that do this sort of thing. 459 00:20:10,437 --> 00:20:12,770 But the idea of the hash function is similar to the idea 460 00:20:12,770 --> 00:20:16,430 of public and private keys, that it's very easy to hash something, 461 00:20:16,430 --> 00:20:19,250 and it's very difficult to go in the other direction. 462 00:20:19,250 --> 00:20:21,170 I can easily hash a password and generate 463 00:20:21,170 --> 00:20:22,490 something that looks like this. 464 00:20:22,490 --> 00:20:25,830 But it's a difficult operation to take something that looks like this 465 00:20:25,830 --> 00:20:30,400 and go backwards and figure out what it was that the original password was. 466 00:20:30,400 --> 00:20:33,820 And so that's one of the properties of a good hash function. 467 00:20:33,820 --> 00:20:34,320 Yes? 468 00:20:34,320 --> 00:20:37,920 AUDIENCE: Did you actually hash these, or did you just hit the keyboard? 469 00:20:37,920 --> 00:20:41,280 BRIAN YU: I think these are probably-- 470 00:20:41,280 --> 00:20:43,800 there might be hidden messages here if you look carefully. 471 00:20:43,800 --> 00:20:44,960 But separate issue. 472 00:20:44,960 --> 00:20:47,519 473 00:20:47,519 --> 00:20:48,060 Other things? 474 00:20:48,060 --> 00:20:51,321 475 00:20:51,321 --> 00:20:51,820 OK. 476 00:20:51,820 --> 00:20:57,537 So how is it that potential data is leaked as a result of using a database? 477 00:20:57,537 --> 00:21:00,370 Well, there are a number of ways that applications can inadvertently 478 00:21:00,370 --> 00:21:02,150 leak information. 479 00:21:02,150 --> 00:21:03,460 Take a simple example. 480 00:21:03,460 --> 00:21:06,190 Oftentimes, you'll see websites that have a Forgot Your Password 481 00:21:06,190 --> 00:21:10,390 screen where you type in an email address, and you click Reset Password. 482 00:21:10,390 --> 00:21:12,850 And that helps you to send you an email that allows you 483 00:21:12,850 --> 00:21:15,267 to reset your password, for example. 484 00:21:15,267 --> 00:21:17,350 And you imagine that you type in an email address, 485 00:21:17,350 --> 00:21:21,640 and you get, OK, password reset email has been sent. 486 00:21:21,640 --> 00:21:24,220 But maybe some applications work such that if you type 487 00:21:24,220 --> 00:21:27,070 in an email address that doesn't exist, then 488 00:21:27,070 --> 00:21:28,900 you get an error that says, OK, error. 489 00:21:28,900 --> 00:21:31,930 There is no user with that email address. 490 00:21:31,930 --> 00:21:34,120 What data has this application now exposed? 491 00:21:34,120 --> 00:21:36,717 492 00:21:36,717 --> 00:21:39,800 What information can you get just by using this part of a web application, 493 00:21:39,800 --> 00:21:40,570 for instance? 494 00:21:40,570 --> 00:21:41,070 Yeah? 495 00:21:41,070 --> 00:21:45,104 AUDIENCE: You know that that email address is not in the system, 496 00:21:45,104 --> 00:21:47,150 so you know that person is not using that app. 497 00:21:47,150 --> 00:21:48,150 BRIAN YU: Yeah, exactly. 498 00:21:48,150 --> 00:21:50,631 Just using the Forgot Password part of this application, 499 00:21:50,631 --> 00:21:53,130 you can tell exactly who has an account for this application 500 00:21:53,130 --> 00:21:56,740 and who doesn't just by typing email addresses and seeing what comes back. 501 00:21:56,740 --> 00:21:59,190 So there's potential vulnerabilities in terms of data 502 00:21:59,190 --> 00:22:00,577 that gets leaked there, as well. 503 00:22:00,577 --> 00:22:03,660 And there are all sorts of different ways that information can get leaked. 504 00:22:03,660 --> 00:22:05,850 Oftentimes, there's a growing field whereby 505 00:22:05,850 --> 00:22:10,440 you can tell just based on the amount of time it takes for an HTTP request 506 00:22:10,440 --> 00:22:13,590 to come back whether or not-- 507 00:22:13,590 --> 00:22:16,110 you can get information about the data inside of a database 508 00:22:16,110 --> 00:22:19,809 based on that whereby if you make a request that takes a long time, that 509 00:22:19,809 --> 00:22:22,350 can tell you something different than if a request comes back 510 00:22:22,350 --> 00:22:24,930 very quickly, because that might mean fewer database requests 511 00:22:24,930 --> 00:22:27,780 were required in order to make that particular operation work 512 00:22:27,780 --> 00:22:29,340 or any number of different things. 513 00:22:29,340 --> 00:22:33,401 And so there are security vulnerabilities there, as well. 514 00:22:33,401 --> 00:22:33,900 Final one. 515 00:22:33,900 --> 00:22:35,700 I'll briefly mention the SQL injection. 516 00:22:35,700 --> 00:22:36,690 We've already talked about that. 517 00:22:36,690 --> 00:22:38,910 But again, something to be aware of just to make sure 518 00:22:38,910 --> 00:22:40,710 that whenever you're making database queries, 519 00:22:40,710 --> 00:22:42,751 you're protecting yourself against SQL injection, 520 00:22:42,751 --> 00:22:46,560 that you're making sure to either use a library that takes care of this for you 521 00:22:46,560 --> 00:22:48,990 or escape any characters that you might be using that 522 00:22:48,990 --> 00:22:55,155 could ultimately result in vulnerabilities in SQL. 523 00:22:55,155 --> 00:22:56,125 Yeah? 524 00:22:56,125 --> 00:22:57,920 AUDIENCE: How about the websites or tools 525 00:22:57,920 --> 00:23:02,280 like LastPass that store your credentials for other sites? 526 00:23:02,280 --> 00:23:06,180 Don't they have to have some way of reversing their own hash on it 527 00:23:06,180 --> 00:23:10,880 in order to give you that credential when you go to another site? 528 00:23:10,880 --> 00:23:13,980 So when it auto fills your username and password, 529 00:23:13,980 --> 00:23:17,889 it has to-- if they're storing a hashed version on their side but filling 530 00:23:17,889 --> 00:23:21,673 in the plain text version in the password field, 531 00:23:21,673 --> 00:23:24,961 how are they able to reverse that in a way that is secure? 532 00:23:24,961 --> 00:23:27,316 They would have to have a table of keys or something 533 00:23:27,316 --> 00:23:30,365 that then is just as vulnerable as leaving the password. 534 00:23:30,365 --> 00:23:30,990 BRIAN YU: Yeah. 535 00:23:30,990 --> 00:23:34,694 So for password manager-type applications, it's a good question. 536 00:23:34,694 --> 00:23:37,860 I think the way most of them do this is that you have a master password that 537 00:23:37,860 --> 00:23:42,480 unlocks the entire database of the passwords that are stored there. 538 00:23:42,480 --> 00:23:44,760 And the idea would be that they're encrypted 539 00:23:44,760 --> 00:23:48,272 using the master password as the key to be the unlocker such 540 00:23:48,272 --> 00:23:49,230 that they're encrypted. 541 00:23:49,230 --> 00:23:51,180 And only by getting the master password correct 542 00:23:51,180 --> 00:23:53,054 can you then decrypt the information and then 543 00:23:53,054 --> 00:23:55,770 access the plain text version of the passwords that are inside. 544 00:23:55,770 --> 00:23:59,520 And so hashing and encryption and decryption are slightly different. 545 00:23:59,520 --> 00:24:01,320 In the case of encryption and decryption, 546 00:24:01,320 --> 00:24:05,310 you still want to be able to go from the ciphertext back to the plain text, 547 00:24:05,310 --> 00:24:07,470 whereas in the case of the password hashing, 548 00:24:07,470 --> 00:24:11,032 you don't really care about the ability to reverse engineer it to go backwards. 549 00:24:11,032 --> 00:24:14,810 550 00:24:14,810 --> 00:24:15,317 All right. 551 00:24:15,317 --> 00:24:17,150 And finally, on the topic of security, we'll 552 00:24:17,150 --> 00:24:18,710 talk a little bit about JavaScript. 553 00:24:18,710 --> 00:24:21,950 JavaScript opens a whole host of different potential vulnerabilities 554 00:24:21,950 --> 00:24:23,290 from a security standpoint. 555 00:24:23,290 --> 00:24:25,070 But we'll talk about a couple. 556 00:24:25,070 --> 00:24:28,400 The first is this idea called cross-site scripting, 557 00:24:28,400 --> 00:24:32,960 or the idea of taking a script and being effectively able to inject it 558 00:24:32,960 --> 00:24:36,350 into some other site by putting some JavaScript that the web 559 00:24:36,350 --> 00:24:40,440 application didn't intend into the web application itself. 560 00:24:40,440 --> 00:24:44,876 And so here's a very simple web application written in Flask. 561 00:24:44,876 --> 00:24:46,500 And this is the entire web application. 562 00:24:46,500 --> 00:24:49,430 It's got a route, a default route, called / that just returns, "Hello, 563 00:24:49,430 --> 00:24:50,120 world!" 564 00:24:50,120 --> 00:24:52,640 And it's got an error handler that we didn't really see in the class. 565 00:24:52,640 --> 00:24:54,410 But basically, it handles whenever there's 566 00:24:54,410 --> 00:24:58,010 a 404 error, whenever you're trying to access a page that was not found. 567 00:24:58,010 --> 00:25:02,120 And it just returns, "Not found," followed by request.path, whatever it 568 00:25:02,120 --> 00:25:04,700 is that was the URL that you requested. 569 00:25:04,700 --> 00:25:06,770 And so I could run this application. 570 00:25:06,770 --> 00:25:10,280 I'll go ahead and start up Chrome, and I'll go ahead 571 00:25:10,280 --> 00:25:18,020 and go to the source code for XSS1. 572 00:25:18,020 --> 00:25:20,520 I'll run this application. 573 00:25:20,520 --> 00:25:21,020 Go here. 574 00:25:21,020 --> 00:25:22,430 It says, "Hello, world!" 575 00:25:22,430 --> 00:25:27,350 And if I go to helloworld/foo, for example, some route that doesn't exist, 576 00:25:27,350 --> 00:25:30,770 I get not found, /foo, because that's not a route that's available on this 577 00:25:30,770 --> 00:25:31,550 page. 578 00:25:31,550 --> 00:25:32,870 I go to /bar. 579 00:25:32,870 --> 00:25:34,579 Not found, /bar. 580 00:25:34,579 --> 00:25:35,620 What could go wrong here? 581 00:25:35,620 --> 00:25:38,760 582 00:25:38,760 --> 00:25:41,550 Where's the security vulnerability, again, 583 00:25:41,550 --> 00:25:43,170 thinking in the context of JavaScript? 584 00:25:43,170 --> 00:25:47,620 585 00:25:47,620 --> 00:25:52,420 The page my application is returning is literally just "not found" 586 00:25:52,420 --> 00:25:56,400 followed by whatever was typed into the request path. 587 00:25:56,400 --> 00:26:04,630 And so what I could do is you could imagine that instead of running /foo, 588 00:26:04,630 --> 00:26:09,250 I could instead make a request that looks something like /script 589 00:26:09,250 --> 00:26:13,240 alert('hi) and then /script, for instance, 590 00:26:13,240 --> 00:26:17,860 injecting some JavaScript into the request path whereby if I do that, 591 00:26:17,860 --> 00:26:22,150 I say, OK, /script alert('hi') /script. 592 00:26:22,150 --> 00:26:23,370 Press Return. 593 00:26:23,370 --> 00:26:25,756 And OK, Chrome is being smart about this. 594 00:26:25,756 --> 00:26:27,630 Chrome actually isn't allowing me to do this, 595 00:26:27,630 --> 00:26:30,370 because Chrome has some more advanced features that are basically 596 00:26:30,370 --> 00:26:32,800 saying Chrome detected unusual code on this page 597 00:26:32,800 --> 00:26:36,430 and blocked it to protect your personal information and error blocked 598 00:26:36,430 --> 00:26:37,720 by XSS auditor. 599 00:26:37,720 --> 00:26:38,980 That's cross-site scripting. 600 00:26:38,980 --> 00:26:40,930 So Chrome is automatically auditing for this. 601 00:26:40,930 --> 00:26:42,520 But not all browsers are like that. 602 00:26:42,520 --> 00:26:44,260 And I can, I think-- 603 00:26:44,260 --> 00:26:48,190 let's see if I can disable-- 604 00:26:48,190 --> 00:26:51,320 if I disable cross-site scripting protections, 605 00:26:51,320 --> 00:26:53,680 I think I can get this to-- yeah, OK. 606 00:26:53,680 --> 00:26:55,720 Disabling cross-site scripting productions, 607 00:26:55,720 --> 00:26:58,540 we can still type in the URL and actually get some JavaScript 608 00:26:58,540 --> 00:27:04,080 that the page didn't intend to still run on this particular web page. 609 00:27:04,080 --> 00:27:08,550 And so if someone were to send you a link that took you to this page, 610 00:27:08,550 --> 00:27:11,280 /script alert('hi'), you could get JavaScript to run that you 611 00:27:11,280 --> 00:27:12,180 didn't intend. 612 00:27:12,180 --> 00:27:13,589 And maybe that's not a big deal. 613 00:27:13,589 --> 00:27:15,630 But it could be a bigger deal in a situation that 614 00:27:15,630 --> 00:27:18,690 looks like this, where we have JavaScript 615 00:27:18,690 --> 00:27:23,730 and document.write is a function that just add something to the page. 616 00:27:23,730 --> 00:27:27,720 And here, we're loading an image, img src, 617 00:27:27,720 --> 00:27:29,970 and the source is some hacker's website. 618 00:27:29,970 --> 00:27:33,540 And then we say, cookie= and then document.cookie. 619 00:27:33,540 --> 00:27:37,050 Document.cookie stores the cookie for this particular page. 620 00:27:37,050 --> 00:27:39,330 And so effectively, what's happening in this script 621 00:27:39,330 --> 00:27:43,110 is that your page, when you load it, is going to make a web 622 00:27:43,110 --> 00:27:45,630 request to the hacker's URL. 623 00:27:45,630 --> 00:27:47,970 And it's going to provide it as an argument whatever 624 00:27:47,970 --> 00:27:51,457 the value of your cookie is, for instance. 625 00:27:51,457 --> 00:27:53,790 And that cookie could be something that you use in order 626 00:27:53,790 --> 00:27:55,890 to log in as the credentials for some website, 627 00:27:55,890 --> 00:27:57,630 like a bank application or whatnot. 628 00:27:57,630 --> 00:27:59,970 And as a result, the hacker now has access 629 00:27:59,970 --> 00:28:02,910 to whatever the value of your cookie is, because they 630 00:28:02,910 --> 00:28:04,680 can look at their list of all the requests 631 00:28:04,680 --> 00:28:06,570 that have been made to the application much in the same way 632 00:28:06,570 --> 00:28:08,361 that you've been able to do in the terminal 633 00:28:08,361 --> 00:28:10,530 to see all the requests for your Flask application. 634 00:28:10,530 --> 00:28:15,030 And they can see that someone requested hacker_url?cookie= this cookie, 635 00:28:15,030 --> 00:28:18,090 and they can then use that cookie to be able to sign in to other sites, 636 00:28:18,090 --> 00:28:18,630 as well. 637 00:28:18,630 --> 00:28:21,480 So most modern browsers, like Chrome, are 638 00:28:21,480 --> 00:28:24,580 pretty good at defending against this sort of thing. 639 00:28:24,580 --> 00:28:28,650 But definitely something that is a potential vulnerability, especially 640 00:28:28,650 --> 00:28:31,380 for older browsers. 641 00:28:31,380 --> 00:28:33,220 Questions about this cross-site scripting? 642 00:28:33,220 --> 00:28:34,502 Yeah? 643 00:28:34,502 --> 00:28:36,397 AUDIENCE: Are you getting the user's cookie, 644 00:28:36,397 --> 00:28:37,980 or whose cookie are you getting there? 645 00:28:37,980 --> 00:28:39,354 BRIAN YU: Whoever opens the page. 646 00:28:39,354 --> 00:28:42,090 So the user's cookie, potentially on an entirely different site. 647 00:28:42,090 --> 00:28:45,120 The idea is that if your site is vulnerable to cross-site 648 00:28:45,120 --> 00:28:48,270 scripting in this form, then you open up a possibility 649 00:28:48,270 --> 00:28:52,050 where someone could generate a link to your website that 650 00:28:52,050 --> 00:28:56,310 includes some JavaScript injected like this whereby someone else could 651 00:28:56,310 --> 00:28:59,280 steal the cookies of your users on your website. 652 00:28:59,280 --> 00:29:01,310 And they could get the cookies for themselves 653 00:29:01,310 --> 00:29:03,690 and use those cookies to sign into your website 654 00:29:03,690 --> 00:29:06,010 and pretend to be people that they're not, for example. 655 00:29:06,010 --> 00:29:07,760 There's a potential security threat there. 656 00:29:07,760 --> 00:29:10,950 657 00:29:10,950 --> 00:29:14,330 So cross-site scripting is one example of a JavaScript vulnerability. 658 00:29:14,330 --> 00:29:17,780 Another vulnerability is called cross-site request forgery. 659 00:29:17,780 --> 00:29:20,900 Imagine that you have a bank website, for instance, 660 00:29:20,900 --> 00:29:23,390 and that bank gives you a way to transfer money. 661 00:29:23,390 --> 00:29:27,997 And if you go to that URL /transfer and then you provide arguments as to who 662 00:29:27,997 --> 00:29:30,830 you're transferring money to and how much money you're transferring, 663 00:29:30,830 --> 00:29:31,910 you can transfer money. 664 00:29:31,910 --> 00:29:35,000 Might be a web request that allows you to do that. 665 00:29:35,000 --> 00:29:38,690 Imagine some other website, some website where 666 00:29:38,690 --> 00:29:41,840 hackers are trying to steal money, where they have code that 667 00:29:41,840 --> 00:29:43,430 looks a little something like this. 668 00:29:43,430 --> 00:29:45,480 They have a link that says, "Click Here!" 669 00:29:45,480 --> 00:29:49,820 And when you click on the link, that takes you to yourbank.com/transfer 670 00:29:49,820 --> 00:29:53,090 transferring to a particular person, transferring a particular amount. 671 00:29:53,090 --> 00:29:56,240 And some unsuspecting user on this website could click the button. 672 00:29:56,240 --> 00:29:58,750 And as a result, that takes them to their bank. 673 00:29:58,750 --> 00:30:01,250 And if they happen to be logged into their bank at the time, 674 00:30:01,250 --> 00:30:04,050 that could result in actually making that transfer. 675 00:30:04,050 --> 00:30:06,260 So cross-site request forgery is the idea 676 00:30:06,260 --> 00:30:11,630 that some other site can make a request on your site as by, in this case, 677 00:30:11,630 --> 00:30:13,890 linking to it. 678 00:30:13,890 --> 00:30:18,180 This still isn't an amazing threat, because the person actually still needs 679 00:30:18,180 --> 00:30:22,590 to click on the button in order to be able to load in order to actually go 680 00:30:22,590 --> 00:30:25,639 to yourbank.com/transfer/whatever. 681 00:30:25,639 --> 00:30:28,680 But you can imagine that a clever hacker might be able to get around this 682 00:30:28,680 --> 00:30:31,380 by doing something like this-- 683 00:30:31,380 --> 00:30:34,807 rendering an image, for example, and saying the source of the image 684 00:30:34,807 --> 00:30:35,640 is going to be this. 685 00:30:35,640 --> 00:30:39,257 And when an HTML sees an image tag, the browser is just going to go to that URL 686 00:30:39,257 --> 00:30:40,590 and try and download that image. 687 00:30:40,590 --> 00:30:43,560 It's going to go to the URL, try and fetch that resource. 688 00:30:43,560 --> 00:30:47,280 And here, that resource is yourbank.com/transfer and then 689 00:30:47,280 --> 00:30:48,510 transferring that money. 690 00:30:48,510 --> 00:30:50,730 So the user doesn't even have to click on anything. 691 00:30:50,730 --> 00:30:54,750 And by making a GET request to yourbank.com/transfer, 692 00:30:54,750 --> 00:30:57,780 if yourbank.com isn't implemented particularly securely and just allows 693 00:30:57,780 --> 00:31:02,302 you to go to a URL like this to transfer money, then that could be the result. 694 00:31:02,302 --> 00:31:03,760 So how do you protect against this? 695 00:31:03,760 --> 00:31:08,280 696 00:31:08,280 --> 00:31:10,320 How would you protect against your website 697 00:31:10,320 --> 00:31:11,430 being able to do something like this? 698 00:31:11,430 --> 00:31:12,870 Because your website probably wants some way 699 00:31:12,870 --> 00:31:15,740 of being able to transfer money if you have a bank application, 700 00:31:15,740 --> 00:31:21,090 but you don't want to allow people to make requests like that. 701 00:31:21,090 --> 00:31:22,001 Answer, yeah? 702 00:31:22,001 --> 00:31:22,626 AUDIENCE: Yeah. 703 00:31:22,626 --> 00:31:23,060 It's facetious. 704 00:31:23,060 --> 00:31:24,010 BRIAN YU: Go for it. 705 00:31:24,010 --> 00:31:25,010 AUDIENCE: You get a better bank. 706 00:31:25,010 --> 00:31:25,760 BRIAN YU: Get a better bank. 707 00:31:25,760 --> 00:31:26,570 OK. 708 00:31:26,570 --> 00:31:29,810 Certainly something that would work. 709 00:31:29,810 --> 00:31:31,260 Other thoughts? 710 00:31:31,260 --> 00:31:32,110 Yeah? 711 00:31:32,110 --> 00:31:35,482 AUDIENCE: Change the form request type so it's not literally in your own 712 00:31:35,482 --> 00:31:36,065 [INAUDIBLE]. 713 00:31:36,065 --> 00:31:36,690 BRIAN YU: Yeah. 714 00:31:36,690 --> 00:31:39,231 Change the form request type so that it's not literally here. 715 00:31:39,231 --> 00:31:41,172 So this right here is a GET request. 716 00:31:41,172 --> 00:31:44,130 You might imagine that instead, it's a form that's submitted by a POST, 717 00:31:44,130 --> 00:31:46,005 like a POST request, a form that you actually 718 00:31:46,005 --> 00:31:50,420 have to submit, click on a Submit button, in order to submit that form. 719 00:31:50,420 --> 00:31:55,630 And so now, you could imagine that someone could still 720 00:31:55,630 --> 00:31:58,510 create a vulnerability by doing something like this. 721 00:31:58,510 --> 00:32:03,130 They have a form whose action is yourbank.com/transfer submitting 722 00:32:03,130 --> 00:32:04,480 by a method POST. 723 00:32:04,480 --> 00:32:07,300 And now, they have these input that are type hidden, 724 00:32:07,300 --> 00:32:10,440 which are just input fields that don't show up inside of a page. 725 00:32:10,440 --> 00:32:12,190 And they can have hidden input fields that 726 00:32:12,190 --> 00:32:16,120 specify who it's to, what the amount is, and then just some button that says, 727 00:32:16,120 --> 00:32:17,410 "Click Here!" 728 00:32:17,410 --> 00:32:19,420 And if they click here, then unwittingly, 729 00:32:19,420 --> 00:32:21,730 the user could be submitting a form to the bank that's 730 00:32:21,730 --> 00:32:24,270 initiating some transfer. 731 00:32:24,270 --> 00:32:27,900 And in fact, if the hacker is being particularly clever, 732 00:32:27,900 --> 00:32:29,880 you don't even need the user to click anything, 733 00:32:29,880 --> 00:32:32,640 because we can use event listeners to get around this. 734 00:32:32,640 --> 00:32:35,130 I could say body onload-- 735 00:32:35,130 --> 00:32:37,800 in other words, when the body of the page is done loading, 736 00:32:37,800 --> 00:32:39,270 run this JavaScript. 737 00:32:39,270 --> 00:32:42,930 Document.forms returns an array of all the forms in the web document. 738 00:32:42,930 --> 00:32:45,540 Square bracket 0 says get the first form. 739 00:32:45,540 --> 00:32:49,192 And there's a function in JavaScript called .submit that submits a form. 740 00:32:49,192 --> 00:32:51,900 So you can say, all right, get all the forms, get the first form, 741 00:32:51,900 --> 00:32:52,980 and run submit. 742 00:32:52,980 --> 00:32:55,440 And that's going to result in submitting this form, 743 00:32:55,440 --> 00:32:58,320 making a POST request to yourbank.com/transfer, 744 00:32:58,320 --> 00:33:02,704 which results in some amount being transferred. 745 00:33:02,704 --> 00:33:04,620 So this is a potential vulnerability, as well. 746 00:33:04,620 --> 00:33:06,540 If you're writing this bank application, you 747 00:33:06,540 --> 00:33:10,350 don't want to allow a code like this to be able to get through your security, 748 00:33:10,350 --> 00:33:13,811 because that opens up a whole host of potential security vulnerabilities. 749 00:33:13,811 --> 00:33:15,810 And in general, the way that people tend to deal 750 00:33:15,810 --> 00:33:20,190 with this is by adding what's called a CSRF token, a Cross-Site Request 751 00:33:20,190 --> 00:33:25,350 Forgery token, basically adding some special value that changes 752 00:33:25,350 --> 00:33:28,800 into their own forms and then, anytime someone submits 753 00:33:28,800 --> 00:33:31,860 the form, checking to make sure the value of that token 754 00:33:31,860 --> 00:33:33,510 is, in fact, a valid token. 755 00:33:33,510 --> 00:33:37,260 And that way, someone couldn't fake it because some other form 756 00:33:37,260 --> 00:33:41,370 on some other hacker's website isn't going to have a valid CSRF 757 00:33:41,370 --> 00:33:44,580 token inside of their form page. 758 00:33:44,580 --> 00:33:49,020 And so larger scale web application frameworks, like Django, 759 00:33:49,020 --> 00:33:52,384 offer easy ways to add CSRF tokens to your forms, as well. 760 00:33:52,384 --> 00:33:54,300 But just something to be aware of as you begin 761 00:33:54,300 --> 00:33:56,752 to think about, when you're designing a web application, 762 00:33:56,752 --> 00:33:57,960 how could someone exploit it? 763 00:33:57,960 --> 00:34:00,360 How could someone make requests on behalf of users 764 00:34:00,360 --> 00:34:02,640 that they don't intend to in order to get 765 00:34:02,640 --> 00:34:05,880 some malicious result to come about? 766 00:34:05,880 --> 00:34:09,679 So lots of security things to be thinking about. 767 00:34:09,679 --> 00:34:11,930 Questions about security or any of the security topics 768 00:34:11,930 --> 00:34:13,387 that we've covered or talked about? 769 00:34:13,387 --> 00:34:13,943 Yeah? 770 00:34:13,943 --> 00:34:17,807 AUDIENCE: [INAUDIBLE] the token is generated [INAUDIBLE] event, 771 00:34:17,807 --> 00:34:20,425 or it's a unique token for every user? 772 00:34:20,425 --> 00:34:21,050 BRIAN YU: Yeah. 773 00:34:21,050 --> 00:34:23,300 Imagine that in the case of CS50 Finance, 774 00:34:23,300 --> 00:34:26,239 for instance, that when I click on the Buy page that takes me 775 00:34:26,239 --> 00:34:29,480 to the page where I can buy stocks, my route for buy 776 00:34:29,480 --> 00:34:32,090 is going to basically generate a new token 777 00:34:32,090 --> 00:34:35,090 and insert it into the form that then gets displayed to me. 778 00:34:35,090 --> 00:34:37,489 And then when I submit that form, it gets submitted back 779 00:34:37,489 --> 00:34:38,582 to the same application. 780 00:34:38,582 --> 00:34:40,040 And the application can then check. 781 00:34:40,040 --> 00:34:43,678 Did the token that came back match the token that I inserted into the page? 782 00:34:43,678 --> 00:34:45,469 And if they do, in fact, match, then that's 783 00:34:45,469 --> 00:34:48,110 a way of sort of verifying that the user was actually 784 00:34:48,110 --> 00:34:51,608 submitting the actual form and not some fake form 785 00:34:51,608 --> 00:34:53,232 that they were tricked into submitting. 786 00:34:53,232 --> 00:34:57,220 787 00:34:57,220 --> 00:34:57,720 All right. 788 00:34:57,720 --> 00:34:59,636 In that case, let's switch gears a little bit, 789 00:34:59,636 --> 00:35:01,320 and let's talk about scalability. 790 00:35:01,320 --> 00:35:02,940 Here again, there's going to be even less code. 791 00:35:02,940 --> 00:35:05,523 And the idea is just going to be, all right, what happens when 792 00:35:05,523 --> 00:35:07,110 we begin to scale our web application? 793 00:35:07,110 --> 00:35:09,750 We've got some web server, and we've got some users 794 00:35:09,750 --> 00:35:13,020 that are using that web server, which we're going to represent as that line. 795 00:35:13,020 --> 00:35:16,246 And so what happens when that server starts 796 00:35:16,246 --> 00:35:18,120 to have more users that are all trying to use 797 00:35:18,120 --> 00:35:19,980 the application at the same time? 798 00:35:19,980 --> 00:35:21,460 What do we do? 799 00:35:21,460 --> 00:35:24,810 Well, the first thing to probably do is figure out how many users 800 00:35:24,810 --> 00:35:26,460 our website can actually support. 801 00:35:26,460 --> 00:35:29,532 How many can it handle before it stops being able to support users? 802 00:35:29,532 --> 00:35:31,740 And so this is where benchmarking is quite important. 803 00:35:31,740 --> 00:35:35,880 Benchmarking is just this process by which we can test and sort of load test 804 00:35:35,880 --> 00:35:40,920 our application to see what we can do to see how many users we could potentially 805 00:35:40,920 --> 00:35:42,626 handle on our server. 806 00:35:42,626 --> 00:35:45,000 And so what happens if we find out via benchmarking that, 807 00:35:45,000 --> 00:35:49,590 OK, our server can only hold 100 users? 808 00:35:49,590 --> 00:35:53,015 What if we need to support 101 users or 102 users? 809 00:35:53,015 --> 00:35:53,640 What can we do? 810 00:35:53,640 --> 00:35:59,390 811 00:35:59,390 --> 00:36:02,317 One thing we can do is called vertical scaling, where the idea here 812 00:36:02,317 --> 00:36:03,650 is, all right, we have a server. 813 00:36:03,650 --> 00:36:05,960 And that server only supports 100 users. 814 00:36:05,960 --> 00:36:08,510 All right, well, let's just get a bigger server, right? 815 00:36:08,510 --> 00:36:11,630 Let's get a server that supports 200 users or 300 users. 816 00:36:11,630 --> 00:36:14,022 And that's going to be able to better handle that load. 817 00:36:14,022 --> 00:36:15,480 But there's a limit to this, right? 818 00:36:15,480 --> 00:36:19,160 There's a limit to how much you can just increase the size of a server 819 00:36:19,160 --> 00:36:21,800 and increase its ability to handle load. 820 00:36:21,800 --> 00:36:24,565 And so what could you do to be able to handle more users? 821 00:36:24,565 --> 00:36:25,752 AUDIENCE: More servers. 822 00:36:25,752 --> 00:36:26,710 BRIAN YU: More servers. 823 00:36:26,710 --> 00:36:27,210 Great. 824 00:36:27,210 --> 00:36:29,500 And this is an idea called horizontal scaling, where 825 00:36:29,500 --> 00:36:31,360 the idea is that we have some server. 826 00:36:31,360 --> 00:36:33,412 And let's say, instead of having one server, 827 00:36:33,412 --> 00:36:36,370 let's go ahead and have two servers that are running the exact same web 828 00:36:36,370 --> 00:36:37,390 application. 829 00:36:37,390 --> 00:36:41,410 And now, we have two servers that are able to run the application 830 00:36:41,410 --> 00:36:44,020 and handle twice as many people. 831 00:36:44,020 --> 00:36:47,770 What problems come about now, logistically? 832 00:36:47,770 --> 00:36:51,412 User tries to access our website, and now what? 833 00:36:51,412 --> 00:36:55,160 834 00:36:55,160 --> 00:36:55,680 Yeah? 835 00:36:55,680 --> 00:36:58,263 AUDIENCE: That means you could have a race condition situation 836 00:36:58,263 --> 00:37:01,915 or how the servers communicate to each other [INAUDIBLE].. 837 00:37:01,915 --> 00:37:02,540 BRIAN YU: Yeah. 838 00:37:02,540 --> 00:37:04,070 How do the servers communicate with each other? 839 00:37:04,070 --> 00:37:06,470 Certainly, race conditions become a threat, as well. 840 00:37:06,470 --> 00:37:10,090 And then a fundamental problem is a user comes to the site, 841 00:37:10,090 --> 00:37:12,650 and which server do they go to, right? 842 00:37:12,650 --> 00:37:16,911 We need some way of deciding which server to direct a particular user to. 843 00:37:16,911 --> 00:37:19,910 And so generally, this is solved by adding yet another piece of hardware 844 00:37:19,910 --> 00:37:23,150 into the mix, adding some load balancer in between the user 845 00:37:23,150 --> 00:37:25,964 and the servers whereby a user, when they request the page, 846 00:37:25,964 --> 00:37:28,880 rather than going straight to the server, they go to the load balancer 847 00:37:28,880 --> 00:37:29,715 first. 848 00:37:29,715 --> 00:37:32,090 And from there on, the load balancer can split people up, 849 00:37:32,090 --> 00:37:35,120 say certain people go to this server, certain people go to that server, 850 00:37:35,120 --> 00:37:38,090 and try and decide how it is that people are going to be 851 00:37:38,090 --> 00:37:41,240 divided into the different servers. 852 00:37:41,240 --> 00:37:44,570 And so how could a load balancer decide? 853 00:37:44,570 --> 00:37:47,880 If there are five servers and a user comes along, 854 00:37:47,880 --> 00:37:51,959 how should a load balancer decide which server to send a user to? 855 00:37:51,959 --> 00:37:53,500 There is no one right answer to this. 856 00:37:53,500 --> 00:37:56,041 There are a number of possible options, a number of different 857 00:37:56,041 --> 00:37:57,690 what are called load balancing methods. 858 00:37:57,690 --> 00:37:59,890 But how could you decide where to send a user? 859 00:37:59,890 --> 00:38:01,529 Yeah? 860 00:38:01,529 --> 00:38:04,490 AUDIENCE: The server with the least amount of users currently. 861 00:38:04,490 --> 00:38:04,730 BRIAN YU: Sure. 862 00:38:04,730 --> 00:38:06,610 The server with the fewest users currently, what's often 863 00:38:06,610 --> 00:38:08,900 called the fewest connections load balancing method. 864 00:38:08,900 --> 00:38:11,800 You try and figure out which server has the fewest people on it. 865 00:38:11,800 --> 00:38:14,620 And whichever one has the fewest people on it, send the user there. 866 00:38:14,620 --> 00:38:18,100 Definitely good for trying to make sure that each one has about an equal load, 867 00:38:18,100 --> 00:38:20,037 but potentially computationally expensive. 868 00:38:20,037 --> 00:38:22,620 You're doing a lot of calculation now, so there's a trade off. 869 00:38:22,620 --> 00:38:22,850 Yeah? 870 00:38:22,850 --> 00:38:24,200 AUDIENCE: You could just do it randomly. 871 00:38:24,200 --> 00:38:24,880 BRIAN YU: You could do it randomly. 872 00:38:24,880 --> 00:38:27,250 You could just generate a random number between 1 and 5 873 00:38:27,250 --> 00:38:29,374 and randomly assign someone to a particular server. 874 00:38:29,374 --> 00:38:30,970 Definitely something you could do. 875 00:38:30,970 --> 00:38:32,080 Other things? 876 00:38:32,080 --> 00:38:34,224 Certainly the random approach is quick. 877 00:38:34,224 --> 00:38:36,640 It doesn't involve having to do any calculation across all 878 00:38:36,640 --> 00:38:38,252 the different servers. 879 00:38:38,252 --> 00:38:40,210 But if you're unlucky, you could end up putting 880 00:38:40,210 --> 00:38:43,600 a lot of people on server number two and not many people on server number eight 881 00:38:43,600 --> 00:38:44,290 or whatnot. 882 00:38:44,290 --> 00:38:45,220 And so what else could we do? 883 00:38:45,220 --> 00:38:45,720 Yeah? 884 00:38:45,720 --> 00:38:49,125 AUDIENCE: Just set up a counter [INAUDIBLE].. 885 00:38:49,125 --> 00:38:49,750 BRIAN YU: Sure. 886 00:38:49,750 --> 00:38:50,625 Some sort of counter. 887 00:38:50,625 --> 00:38:53,260 If you only have two, you just alternate odd, even, odd, even. 888 00:38:53,260 --> 00:38:54,010 Go to this server. 889 00:38:54,010 --> 00:38:54,820 Go to that one. 890 00:38:54,820 --> 00:38:57,153 If you've got eight, you just rotate amongst the eight-- 891 00:38:57,153 --> 00:38:59,045 1, 2, 3, 4, 5, 6, 7, 8 and go back to 1. 892 00:38:59,045 --> 00:39:02,170 And so these are probably three of the most common load balancing methods-- 893 00:39:02,170 --> 00:39:05,336 random choice, whereby you just pick a random server, direct the user there; 894 00:39:05,336 --> 00:39:09,040 round robin, where we do exactly that, just basically go one up until the end 895 00:39:09,040 --> 00:39:12,220 and then go back to server number one; and then fewest connections, whereby 896 00:39:12,220 --> 00:39:14,530 you try and actually calculate which server currently 897 00:39:14,530 --> 00:39:16,810 has the fewest number of people on it and then 898 00:39:16,810 --> 00:39:20,887 try and direct the user to that one with the fewest connections. 899 00:39:20,887 --> 00:39:22,720 There are other methods in addition to this, 900 00:39:22,720 --> 00:39:24,520 but these are perhaps three of the most intuitive 901 00:39:24,520 --> 00:39:26,460 where you can start to see their trade offs. 902 00:39:26,460 --> 00:39:28,210 Depending upon the type of user experience 903 00:39:28,210 --> 00:39:30,460 you want, depending on how computationally 904 00:39:30,460 --> 00:39:34,420 expensive certain operations are, you might choose different load balancing 905 00:39:34,420 --> 00:39:36,610 methods. 906 00:39:36,610 --> 00:39:37,750 Yeah? 907 00:39:37,750 --> 00:39:41,470 AUDIENCE: [INAUDIBLE] benchmarking, and what are some common ways to do that? 908 00:39:41,470 --> 00:39:44,304 BRIAN YU: Yeah, there are software tools that can do this. 909 00:39:44,304 --> 00:39:46,970 There are a number of different ones-- the names are escaping me 910 00:39:46,970 --> 00:39:47,690 at the moment-- 911 00:39:47,690 --> 00:39:51,260 where you can basically test on a particular URL 912 00:39:51,260 --> 00:39:54,550 and get a sense for how well it's able to handle that load. 913 00:39:54,550 --> 00:39:59,430 And if you have particular use cases, I can chat with you about that, as well. 914 00:39:59,430 --> 00:40:01,980 So all right, let's imagine we have two servers now. 915 00:40:01,980 --> 00:40:04,860 And every time a user makes an HTTP request 916 00:40:04,860 --> 00:40:06,900 to a server, every time they request a page, 917 00:40:06,900 --> 00:40:09,240 we direct them to one server or the other server using 918 00:40:09,240 --> 00:40:12,330 one of these methods, either by choosing randomly or by round robin 919 00:40:12,330 --> 00:40:15,510 or by figuring out which one currently has the fewest users connected to it 920 00:40:15,510 --> 00:40:17,580 or is handling the fewest connections. 921 00:40:17,580 --> 00:40:18,330 What can go wrong? 922 00:40:18,330 --> 00:40:21,094 923 00:40:21,094 --> 00:40:24,260 Whenever we're dealing with issues of scale, we just try and solve a problem 924 00:40:24,260 --> 00:40:26,134 and figure out what new problems have arisen. 925 00:40:26,134 --> 00:40:33,724 926 00:40:33,724 --> 00:40:34,720 Yeah? 927 00:40:34,720 --> 00:40:37,400 AUDIENCE: You only have five servers, and now you need six. 928 00:40:37,400 --> 00:40:37,640 BRIAN YU: Yeah. 929 00:40:37,640 --> 00:40:40,010 Certainly, if you only have five servers and suddenly you need six, 930 00:40:40,010 --> 00:40:42,051 that could potentially become a problem, as well. 931 00:40:42,051 --> 00:40:44,180 But let's even assume that we have enough servers. 932 00:40:44,180 --> 00:40:47,600 We have five servers, and every time someone load a page, 933 00:40:47,600 --> 00:40:51,470 they get sent to a different server based on one of these methods. 934 00:40:51,470 --> 00:40:54,804 What can still go wrong with the user experience? 935 00:40:54,804 --> 00:40:56,470 And in particular, I'll give you a hint. 936 00:40:56,470 --> 00:40:58,740 Let's think about sessions. 937 00:40:58,740 --> 00:40:59,490 What can go wrong? 938 00:40:59,490 --> 00:41:04,290 939 00:41:04,290 --> 00:41:07,040 Remember, sessions were ways of storing information-- in our case, 940 00:41:07,040 --> 00:41:08,702 inside of the server-- 941 00:41:08,702 --> 00:41:10,910 about the user's current interaction with the server. 942 00:41:10,910 --> 00:41:12,574 It stored which user was logged in. 943 00:41:12,574 --> 00:41:14,740 It stored the current state of the tic-tac-toe game. 944 00:41:14,740 --> 00:41:15,940 It stored other information. 945 00:41:15,940 --> 00:41:17,126 Yeah? 946 00:41:17,126 --> 00:41:19,994 AUDIENCE: You have to pick one [INAUDIBLE].. 947 00:41:19,994 --> 00:41:24,691 948 00:41:24,691 --> 00:41:25,690 BRIAN YU: Yeah, exactly. 949 00:41:25,690 --> 00:41:29,689 If I initially load a page and I go to server one and some information 950 00:41:29,689 --> 00:41:32,230 about me is stored in the session, like whether I'm logged in 951 00:41:32,230 --> 00:41:34,790 or the current state of my game or something else, 952 00:41:34,790 --> 00:41:37,000 and then I load another page and it takes 953 00:41:37,000 --> 00:41:40,750 me to server four this time, well, now, that server 954 00:41:40,750 --> 00:41:43,900 doesn't have access to the same session information 955 00:41:43,900 --> 00:41:46,420 that server one had if the information about the session 956 00:41:46,420 --> 00:41:47,654 was stored in the server. 957 00:41:47,654 --> 00:41:49,070 And now, that information is lost. 958 00:41:49,070 --> 00:41:50,986 So I could load a page, and suddenly, now, I'm 959 00:41:50,986 --> 00:41:53,036 logged out of the page for no apparent reason 960 00:41:53,036 --> 00:41:54,910 even though I've logged in just a moment ago. 961 00:41:54,910 --> 00:41:56,680 And then I could go to another page, and maybe by chance, 962 00:41:56,680 --> 00:41:59,390 I'm back to server one, and now I'm logged in again. 963 00:41:59,390 --> 00:42:01,730 So strange things can begin to happen. 964 00:42:01,730 --> 00:42:03,640 And so to solve that, what could we do? 965 00:42:03,640 --> 00:42:06,410 966 00:42:06,410 --> 00:42:08,780 How can we make sure that sessions are preserved 967 00:42:08,780 --> 00:42:11,060 when the user is requesting pages? 968 00:42:11,060 --> 00:42:13,860 969 00:42:13,860 --> 00:42:15,270 Again, no one correct answer. 970 00:42:15,270 --> 00:42:16,440 Multiple possibilities here. 971 00:42:16,440 --> 00:42:19,170 972 00:42:19,170 --> 00:42:21,574 How do we solve this problem? 973 00:42:21,574 --> 00:42:22,074 Yeah? 974 00:42:22,074 --> 00:42:25,380 AUDIENCE: Would there any way to store the session on the load balancer? 975 00:42:25,380 --> 00:42:27,750 BRIAN YU: Store the session on the load balancer. 976 00:42:27,750 --> 00:42:28,650 That's a good idea. 977 00:42:28,650 --> 00:42:30,858 And that will actually get me at the first idea here, 978 00:42:30,858 --> 00:42:33,689 which is this idea of sticky sessions. 979 00:42:33,689 --> 00:42:34,980 And this is slightly different. 980 00:42:34,980 --> 00:42:40,020 Rather than store all the session information in the load balancer, 981 00:42:40,020 --> 00:42:42,750 it just needs to store for this particular user which 982 00:42:42,750 --> 00:42:45,400 server has their session information. 983 00:42:45,400 --> 00:42:47,640 So if I went to server number one initially, 984 00:42:47,640 --> 00:42:51,649 the load balancer will remember me based on my IP address, cookie, or whatever 985 00:42:51,649 --> 00:42:53,940 and say, all right, next time I try and request a page, 986 00:42:53,940 --> 00:42:56,760 let me direct them back to server number one, for instance. 987 00:42:56,760 --> 00:43:00,624 That way, whenever I come back, I'm always going to go to the same place. 988 00:43:00,624 --> 00:43:02,790 There are other ways to solve this problem, as well. 989 00:43:02,790 --> 00:43:04,915 You could store session information in the database 990 00:43:04,915 --> 00:43:06,510 that all the servers have access to. 991 00:43:06,510 --> 00:43:09,210 You could store session information on the client side, whereby 992 00:43:09,210 --> 00:43:11,850 it doesn't matter what server you go to, because all the session information is 993 00:43:11,850 --> 00:43:12,624 inside the client. 994 00:43:12,624 --> 00:43:14,790 So there are a number of ways to solve this problem, 995 00:43:14,790 --> 00:43:18,300 but these generally fall under the heading of session-aware load 996 00:43:18,300 --> 00:43:20,870 balancing. 997 00:43:20,870 --> 00:43:23,750 Someone mentioned the problem of, OK, well, I have five servers, 998 00:43:23,750 --> 00:43:26,720 but what happens when I need six? 999 00:43:26,720 --> 00:43:28,930 To solve this in the world of cloud computing, 1000 00:43:28,930 --> 00:43:31,520 where nowadays most people don't maintain their own hardware 1001 00:43:31,520 --> 00:43:33,680 for their web applications, they just rent out 1002 00:43:33,680 --> 00:43:37,310 hardware on someone else's servers, for instance, on AWS, for instance, 1003 00:43:37,310 --> 00:43:39,830 use Amazon servers-- 1004 00:43:39,830 --> 00:43:44,800 you can take advantage of auto scaling, which automatically will grow or shrink 1005 00:43:44,800 --> 00:43:47,550 the number of servers based upon load, whereby you could initially 1006 00:43:47,550 --> 00:43:48,262 have two servers. 1007 00:43:48,262 --> 00:43:50,220 But if more users come about and you need more, 1008 00:43:50,220 --> 00:43:51,845 we can add a third server into the mix. 1009 00:43:51,845 --> 00:43:53,511 More people come out, we need even more. 1010 00:43:53,511 --> 00:43:54,600 We add a fourth server. 1011 00:43:54,600 --> 00:43:57,200 And auto scaling goes in both directions. 1012 00:43:57,200 --> 00:43:59,790 So if suddenly we find, all right, we had a lot of load 1013 00:43:59,790 --> 00:44:02,340 at this particular peak time of the day but now there are 1014 00:44:02,340 --> 00:44:05,374 fewer users on the site, the auto load balancer can sort of say, 1015 00:44:05,374 --> 00:44:07,290 all right, we don't need four servers anymore. 1016 00:44:07,290 --> 00:44:09,780 Let's go back to three and then later on, if it needs doing, 1017 00:44:09,780 --> 00:44:10,821 go back up to four again. 1018 00:44:10,821 --> 00:44:15,810 And it can automatically, dynamically reconfigure the number of servers 1019 00:44:15,810 --> 00:44:19,050 in order to figure out what the optimal number is 1020 00:44:19,050 --> 00:44:23,170 given the number of users that are currently using the application. 1021 00:44:23,170 --> 00:44:28,450 What happens, though, when one of the servers fails for some reason? 1022 00:44:28,450 --> 00:44:31,740 The server just dies, for instance. 1023 00:44:31,740 --> 00:44:34,817 The load balancer doesn't necessarily know about that. 1024 00:44:34,817 --> 00:44:37,650 And so if it's still directing people across four different servers, 1025 00:44:37,650 --> 00:44:43,200 it could direct users to that server that is no longer operational. 1026 00:44:43,200 --> 00:44:45,730 Any thoughts on how we might solve that problem? 1027 00:44:45,730 --> 00:44:46,230 Yeah? 1028 00:44:46,230 --> 00:44:49,395 AUDIENCE: Have the load balancer ping the server at determined intervals 1029 00:44:49,395 --> 00:44:50,520 to see if it's still there. 1030 00:44:50,520 --> 00:44:51,900 BRIAN YU: Yeah, some sort of ping to make sure 1031 00:44:51,900 --> 00:44:53,040 That the server is still there. 1032 00:44:53,040 --> 00:44:55,206 And often, one of the easiest ways that this is done 1033 00:44:55,206 --> 00:44:57,780 is via what's called a heartbeat, whereby each of the servers 1034 00:44:57,780 --> 00:45:01,680 gives off a heartbeat every fixed number of seconds or minutes, for instance, 1035 00:45:01,680 --> 00:45:04,920 whereby if every 10 seconds the server pings the heartbeat, 1036 00:45:04,920 --> 00:45:06,540 that gets sent to the load balancer. 1037 00:45:06,540 --> 00:45:09,660 If ever the load balancer doesn't hear the heartbeat from the server, 1038 00:45:09,660 --> 00:45:12,579 it can know that that server is no longer operational, and it can say, 1039 00:45:12,579 --> 00:45:13,620 all right, you know what? 1040 00:45:13,620 --> 00:45:17,535 Let's stop sending users there and only send users to the other three servers. 1041 00:45:17,535 --> 00:45:20,200 1042 00:45:20,200 --> 00:45:24,740 Questions about that or any of the ideas of how we scale our servers 1043 00:45:24,740 --> 00:45:26,840 to be able to handle load? 1044 00:45:26,840 --> 00:45:29,340 We decided, all right, if too many people are on one server, 1045 00:45:29,340 --> 00:45:31,550 we need to split up into two different servers. 1046 00:45:31,550 --> 00:45:33,140 But that introduced a bunch of problems that we 1047 00:45:33,140 --> 00:45:36,098 had to solve-- problems about load balancing, problems about what to do 1048 00:45:36,098 --> 00:45:38,700 about sessions, so on and so forth. 1049 00:45:38,700 --> 00:45:39,820 Yeah? 1050 00:45:39,820 --> 00:45:43,215 AUDIENCE: Do you hear a lot about distributed servers? 1051 00:45:43,215 --> 00:45:45,640 I'm wondering how they [INAUDIBLE]. 1052 00:45:45,640 --> 00:45:48,966 1053 00:45:48,966 --> 00:45:49,590 BRIAN YU: Sure. 1054 00:45:49,590 --> 00:45:52,260 How do servers share data? 1055 00:45:52,260 --> 00:45:54,030 Well, they use databases. 1056 00:45:54,030 --> 00:45:57,930 And of course, as we start to figure out what to do with more and more servers, 1057 00:45:57,930 --> 00:46:00,180 we also need to figure out what to do about databases, 1058 00:46:00,180 --> 00:46:03,720 figure out how to scale databases and make sure that as we scale them, 1059 00:46:03,720 --> 00:46:06,430 the databases are able to handle that load, as well. 1060 00:46:06,430 --> 00:46:08,880 And so in the past, we've had, all right, a load balancer. 1061 00:46:08,880 --> 00:46:10,050 We've got servers. 1062 00:46:10,050 --> 00:46:13,170 And in our model right now, we have a database that both of these servers 1063 00:46:13,170 --> 00:46:15,270 are connected to. 1064 00:46:15,270 --> 00:46:18,990 But of course, the problem is soon going to arise of, all right, 1065 00:46:18,990 --> 00:46:20,910 now we've got a lot of servers that are all 1066 00:46:20,910 --> 00:46:23,250 trying to connect to the same database. 1067 00:46:23,250 --> 00:46:25,110 And now, we've got yet another single point 1068 00:46:25,110 --> 00:46:27,210 where things could potentially go wrong or where 1069 00:46:27,210 --> 00:46:29,190 we could potentially be overloaded. 1070 00:46:29,190 --> 00:46:31,020 So how do we solve this type of problem? 1071 00:46:31,020 --> 00:46:33,900 One of the most common ways is database partitioning. 1072 00:46:33,900 --> 00:46:36,610 One form of database partitioning you've, in fact, already seen, 1073 00:46:36,610 --> 00:46:39,180 and it's just an extension of what we've been doing with SQL, 1074 00:46:39,180 --> 00:46:41,430 whereby we have this flights table. 1075 00:46:41,430 --> 00:46:45,540 And we could say, all right, rather than store the origin and the origin code, 1076 00:46:45,540 --> 00:46:47,679 let's go ahead and separate what's in one table 1077 00:46:47,679 --> 00:46:48,970 into a couple different tables. 1078 00:46:48,970 --> 00:46:51,780 Let's separate the flights table into a locations table 1079 00:46:51,780 --> 00:46:55,140 where the locations table has a number for each possible location. 1080 00:46:55,140 --> 00:46:57,210 And then it also, in the flights table, now, 1081 00:46:57,210 --> 00:47:03,240 only needs to store a single number for the origin ID and the destination ID. 1082 00:47:03,240 --> 00:47:05,550 We could also separate tables in different ways. 1083 00:47:05,550 --> 00:47:09,450 If we have some general way we could partition 1084 00:47:09,450 --> 00:47:11,700 a table into different parts that are generally 1085 00:47:11,700 --> 00:47:13,980 going to be queried separately, then we can 1086 00:47:13,980 --> 00:47:16,560 do another partition where I could say, all right, 1087 00:47:16,560 --> 00:47:18,330 my flight's table is getting big. 1088 00:47:18,330 --> 00:47:19,620 Let's split it up. 1089 00:47:19,620 --> 00:47:23,670 And all right, at my airline, the international departures and arrivals 1090 00:47:23,670 --> 00:47:26,520 are handled separately from the domestic departures and arrivals. 1091 00:47:26,520 --> 00:47:28,647 So no need for those to be in the same table. 1092 00:47:28,647 --> 00:47:30,855 Let me just go ahead and take flights and separate it 1093 00:47:30,855 --> 00:47:33,480 into a domestic flights table and an international flights table, 1094 00:47:33,480 --> 00:47:34,060 for instance. 1095 00:47:34,060 --> 00:47:36,900 One way to just partition things into two different tables that 1096 00:47:36,900 --> 00:47:39,570 could potentially be stored in different places that ultimately 1097 00:47:39,570 --> 00:47:43,680 allows for handling of scale. 1098 00:47:43,680 --> 00:47:45,887 But ultimately, all of these are problems 1099 00:47:45,887 --> 00:47:48,720 that are still going to lead to the fundamental problem of if I only 1100 00:47:48,720 --> 00:47:52,530 have one database and 10 or dozens of servers that are all 1101 00:47:52,530 --> 00:47:54,510 trying to communicate with that same database, 1102 00:47:54,510 --> 00:47:55,884 we're going to run into problems. 1103 00:47:55,884 --> 00:47:58,830 The database can only handle some fixed number of connections. 1104 00:47:58,830 --> 00:48:03,120 And so one solution to this is database replication. 1105 00:48:03,120 --> 00:48:06,600 So all right, how does database replication work? 1106 00:48:06,600 --> 00:48:10,140 Well, probably the simplest form of database replication 1107 00:48:10,140 --> 00:48:13,410 is what's called single primary replication, whereby 1108 00:48:13,410 --> 00:48:16,470 I have one what's called primary database and maybe 1109 00:48:16,470 --> 00:48:18,460 three databases in total, but only one that I'm 1110 00:48:18,460 --> 00:48:20,460 going to consider the primary one. 1111 00:48:20,460 --> 00:48:22,980 And you can read data from any of the databases. 1112 00:48:22,980 --> 00:48:25,380 You can get data out of any of the three databases, 1113 00:48:25,380 --> 00:48:29,100 whereby if there are three servers and each one wants to read data, 1114 00:48:29,100 --> 00:48:31,620 they can just share among the three databases reading data 1115 00:48:31,620 --> 00:48:33,578 to make sure that we're not overloading any one 1116 00:48:33,578 --> 00:48:35,600 database with too many connections. 1117 00:48:35,600 --> 00:48:39,970 But you can only write data to a single database. 1118 00:48:39,970 --> 00:48:42,220 And by only writing data to a single database, 1119 00:48:42,220 --> 00:48:44,860 that means that anytime this database is updated, 1120 00:48:44,860 --> 00:48:47,276 then this database, our primary database, 1121 00:48:47,276 --> 00:48:49,150 just needs to update the other two databases. 1122 00:48:49,150 --> 00:48:52,000 Say, all right, there's been a change made to the primary database. 1123 00:48:52,000 --> 00:48:54,310 And it's the primary database's responsibility 1124 00:48:54,310 --> 00:48:59,005 to then communicate to the other two databases what those changes are. 1125 00:48:59,005 --> 00:49:00,970 And so that's single-primary replication. 1126 00:49:00,970 --> 00:49:01,470 Yeah? 1127 00:49:01,470 --> 00:49:04,720 AUDIENCE: How is that more efficient than just communicating with all three 1128 00:49:04,720 --> 00:49:05,341 of them? 1129 00:49:05,341 --> 00:49:07,090 Because I think you're sending information 1130 00:49:07,090 --> 00:49:09,460 from the first database to the second and third. 1131 00:49:09,460 --> 00:49:16,160 [INAUDIBLE] information sent that's just rewriting to all three of them. 1132 00:49:16,160 --> 00:49:17,410 BRIAN YU: That's true, though. 1133 00:49:17,410 --> 00:49:19,330 Databases could potentially batch information 1134 00:49:19,330 --> 00:49:21,354 together into transactions and things and groups 1135 00:49:21,354 --> 00:49:23,020 so as to be a little bit more efficient. 1136 00:49:23,020 --> 00:49:24,400 So certainly ways around that problem. 1137 00:49:24,400 --> 00:49:25,358 But yeah, a good point. 1138 00:49:25,358 --> 00:49:29,400 1139 00:49:29,400 --> 00:49:32,160 Of course, this helps the read problem. 1140 00:49:32,160 --> 00:49:35,220 It makes it easier to be able to read data out of databases. 1141 00:49:35,220 --> 00:49:37,590 But it leaves open a potential vulnerability 1142 00:49:37,590 --> 00:49:40,500 or a potential scalability problem with regard to writing data, 1143 00:49:40,500 --> 00:49:43,410 because there is still only a single database on which I can actually 1144 00:49:43,410 --> 00:49:46,590 write data to if that one database is responsible for updating 1145 00:49:46,590 --> 00:49:48,090 all of the other databases. 1146 00:49:48,090 --> 00:49:50,160 And so a more complex version of this is what's 1147 00:49:50,160 --> 00:49:52,200 known as multi-primary replication, where 1148 00:49:52,200 --> 00:49:55,410 the idea is that each database can be read to and written from. 1149 00:49:55,410 --> 00:49:57,900 But now, updates get a lot more complicated. 1150 00:49:57,900 --> 00:50:00,480 All of the databases need to have some notion and some way 1151 00:50:00,480 --> 00:50:02,340 of being able to update each other. 1152 00:50:02,340 --> 00:50:04,170 And there, conflicts begin to arrive. 1153 00:50:04,170 --> 00:50:07,680 You can have update conflicts where two different databases 1154 00:50:07,680 --> 00:50:09,129 have updated the same row. 1155 00:50:09,129 --> 00:50:10,920 All right, how do you resolve that problem? 1156 00:50:10,920 --> 00:50:13,500 You can have uniqueness conflicts, whereby 1157 00:50:13,500 --> 00:50:17,220 if you add a row to each of two databases at the same time, maybe 1158 00:50:17,220 --> 00:50:18,570 they get the same ID. 1159 00:50:18,570 --> 00:50:21,390 Maybe this one only has 27 rows, so this database 1160 00:50:21,390 --> 00:50:24,507 adds a new row with ID number 28, and this database does the same thing. 1161 00:50:24,507 --> 00:50:26,340 And now, when they try to update each other, 1162 00:50:26,340 --> 00:50:28,096 we have two rows with the same ID. 1163 00:50:28,096 --> 00:50:29,970 And now, we need some way of resolving those, 1164 00:50:29,970 --> 00:50:31,770 because the IDs are supposed to be unique. 1165 00:50:31,770 --> 00:50:34,620 And so that can create problems, as well. 1166 00:50:34,620 --> 00:50:37,470 And then there are other types of conflicts, too-- delete conflicts, 1167 00:50:37,470 --> 00:50:40,320 whereby one database tries to delete a row at the same time 1168 00:50:40,320 --> 00:50:42,329 another database tries to update a row. 1169 00:50:42,329 --> 00:50:43,120 So which do you do? 1170 00:50:43,120 --> 00:50:43,920 Do you update the row? 1171 00:50:43,920 --> 00:50:44,940 Do you delete the row? 1172 00:50:44,940 --> 00:50:47,356 And so these are all conflicts that when you're setting up 1173 00:50:47,356 --> 00:50:49,410 a multi-primary replication system, you need 1174 00:50:49,410 --> 00:50:52,320 to figure out how you're going to ultimately resolve those conflicts. 1175 00:50:52,320 --> 00:50:54,780 You gain the ability to write to all the databases, 1176 00:50:54,780 --> 00:50:57,550 but new problems arise as you begin to do that. 1177 00:50:57,550 --> 00:50:58,157 Yeah? 1178 00:50:58,157 --> 00:51:01,973 AUDIENCE: So is the information in each database the same? 1179 00:51:01,973 --> 00:51:04,055 Are they [INAUDIBLE] with each other? 1180 00:51:04,055 --> 00:51:04,680 BRIAN YU: Yeah. 1181 00:51:04,680 --> 00:51:06,690 In this model, the databases in general are 1182 00:51:06,690 --> 00:51:09,454 going to be the same, though they're not always perfectly going 1183 00:51:09,454 --> 00:51:12,120 to be in sync, which is yet another problem, whereby there might 1184 00:51:12,120 --> 00:51:14,640 be some time after I write to this database 1185 00:51:14,640 --> 00:51:18,491 before that data propagates through all of the databases, for instance. 1186 00:51:18,491 --> 00:51:20,989 AUDIENCE: So why not keep it in one? 1187 00:51:20,989 --> 00:51:23,530 BRIAN YU: You could keep all the information in one database. 1188 00:51:23,530 --> 00:51:27,070 But a single database server can only handle so many connections. 1189 00:51:27,070 --> 00:51:30,310 And so you might imagine that having three different servers, three 1190 00:51:30,310 --> 00:51:33,070 different computers that are all able to handle incoming requests, 1191 00:51:33,070 --> 00:51:35,106 just increases the capacity of your application 1192 00:51:35,106 --> 00:51:36,730 to be able to handle that kind of load. 1193 00:51:36,730 --> 00:51:41,690 1194 00:51:41,690 --> 00:51:42,530 All right. 1195 00:51:42,530 --> 00:51:46,820 Questions about databases, database replication, any of the scale problems 1196 00:51:46,820 --> 00:51:49,681 that come about there? 1197 00:51:49,681 --> 00:51:50,180 All right. 1198 00:51:50,180 --> 00:51:53,013 Final thing I'll mention on the topic of scaling that can be helpful 1199 00:51:53,013 --> 00:51:54,302 is just the idea of caching. 1200 00:51:54,302 --> 00:51:56,510 Caching is something we've talked about a lot before. 1201 00:51:56,510 --> 00:52:00,380 But a general idea could be that in order to try and solve this problem 1202 00:52:00,380 --> 00:52:03,350 of constantly having to request information from the database, 1203 00:52:03,350 --> 00:52:06,650 if we could store data in some other place-- in particular, 1204 00:52:06,650 --> 00:52:07,632 inside of a cache-- 1205 00:52:07,632 --> 00:52:10,340 then we don't need to access the database as often, because we've 1206 00:52:10,340 --> 00:52:12,330 got the information already stored. 1207 00:52:12,330 --> 00:52:14,960 And so one way to do this is via client-side caching. 1208 00:52:14,960 --> 00:52:20,070 And so inside of the HTTP headers, when an HTTP response 1209 00:52:20,070 --> 00:52:22,670 is sending back information to a user, you 1210 00:52:22,670 --> 00:52:26,660 can add an HTTP header called cache control that basically 1211 00:52:26,660 --> 00:52:32,450 says for up to this number of seconds, you can just store information 1212 00:52:32,450 --> 00:52:35,870 about this page and not request it again if you try 1213 00:52:35,870 --> 00:52:37,674 and request the page for a second time. 1214 00:52:37,674 --> 00:52:40,340 And this helps to make sure that if the browser tries to request 1215 00:52:40,340 --> 00:52:41,870 the page again, it doesn't need to. 1216 00:52:41,870 --> 00:52:45,350 It can just use the version that's stored inside of the cache. 1217 00:52:45,350 --> 00:52:50,240 And a more recent development is this idea of an ETag, or an entity tag. 1218 00:52:50,240 --> 00:52:53,990 And the idea here is that if we have some web resource, some document, 1219 00:52:53,990 --> 00:52:57,200 some piece of data from a database that our web application is sending out 1220 00:52:57,200 --> 00:53:01,790 to users, when I send users that resource, that document, 1221 00:53:01,790 --> 00:53:06,230 I'll send that document, and I'll also send an entity tag that 1222 00:53:06,230 --> 00:53:09,260 corresponds to that particular version of the document 1223 00:53:09,260 --> 00:53:10,730 and send them both to the user. 1224 00:53:10,730 --> 00:53:12,240 And imagine this is a big document. 1225 00:53:12,240 --> 00:53:16,880 It's a lot of data, so it's expensive to query and to send to the user. 1226 00:53:16,880 --> 00:53:21,230 The next time the user tries to request this page, what the user can do 1227 00:53:21,230 --> 00:53:25,970 is the user can send the entity tag, the ETag, along with their request. 1228 00:53:25,970 --> 00:53:29,750 I would like to request this resource, and, oh, by the way, 1229 00:53:29,750 --> 00:53:32,840 I already have this version of the entity stored 1230 00:53:32,840 --> 00:53:35,570 locally inside of my computer's cache. 1231 00:53:35,570 --> 00:53:38,130 And if the web application then looks at that ETag and says, 1232 00:53:38,130 --> 00:53:39,171 all right, you know what? 1233 00:53:39,171 --> 00:53:41,300 That's the latest version of the document. 1234 00:53:41,300 --> 00:53:44,120 The web application can just respond-- 1235 00:53:44,120 --> 00:53:47,990 in particular, with an HTTP status code of 304, meaning not modified, 1236 00:53:47,990 --> 00:53:49,190 to just say, you know what? 1237 00:53:49,190 --> 00:53:52,310 This entity tag is the most recent entity tag. 1238 00:53:52,310 --> 00:53:54,650 Don't bother trying to request the document again. 1239 00:53:54,650 --> 00:53:57,800 Just use the version you saved locally in your cache. 1240 00:53:57,800 --> 00:54:00,200 And if, on the off chance, the document's been updated 1241 00:54:00,200 --> 00:54:03,230 and therefore has a new ETag value, then the web application 1242 00:54:03,230 --> 00:54:07,000 goes through the process of sending that entire document back to the user. 1243 00:54:07,000 --> 00:54:09,335 But by taking advantage of technologies like this, 1244 00:54:09,335 --> 00:54:11,450 this can allow us to make sure that we're not 1245 00:54:11,450 --> 00:54:13,940 making too many requests to the database, 1246 00:54:13,940 --> 00:54:19,950 that we don't make redundant requests if a particular resource hasn't changed. 1247 00:54:19,950 --> 00:54:21,940 So caching can be done on the client side. 1248 00:54:21,940 --> 00:54:24,190 Caching can also be done on the server side, which 1249 00:54:24,190 --> 00:54:26,920 changes our diagram slightly so as to look a little bit more 1250 00:54:26,920 --> 00:54:30,640 like this, whereby now, we've got some more complications here. 1251 00:54:30,640 --> 00:54:32,830 We've got some load balancer that's communicating 1252 00:54:32,830 --> 00:54:34,247 with a bunch of different servers. 1253 00:54:34,247 --> 00:54:36,580 All of those servers have to interact with the database, 1254 00:54:36,580 --> 00:54:39,850 and maybe you've got multiple databases going on here that are each able to do 1255 00:54:39,850 --> 00:54:42,340 reads and writes, either in a single-primary model 1256 00:54:42,340 --> 00:54:43,840 or a multi-primary model. 1257 00:54:43,840 --> 00:54:47,530 And those servers also have access to some cache that makes it easier 1258 00:54:47,530 --> 00:54:51,280 to access data quickly, in a sense, saying, 1259 00:54:51,280 --> 00:54:53,470 if there's some expensive database query, 1260 00:54:53,470 --> 00:54:56,530 don't bother performing the database query again and again and again. 1261 00:54:56,530 --> 00:54:59,050 Take the results of that database query once. 1262 00:54:59,050 --> 00:55:00,910 Save it inside of the cache. 1263 00:55:00,910 --> 00:55:03,220 And from then on, the server can just look to the cache 1264 00:55:03,220 --> 00:55:06,830 and get information out of there. 1265 00:55:06,830 --> 00:55:09,680 So lot of security and scalability concerns 1266 00:55:09,680 --> 00:55:12,740 that can potentially come about as you begin web application development. 1267 00:55:12,740 --> 00:55:14,740 And so goal of today was really just to give you 1268 00:55:14,740 --> 00:55:17,212 a sense for the types of concerns to be aware of, 1269 00:55:17,212 --> 00:55:18,920 the types of things to be thinking about, 1270 00:55:18,920 --> 00:55:20,480 and the types of issues that will come about 1271 00:55:20,480 --> 00:55:23,810 if you decide to take a web application and begin to have more and more people 1272 00:55:23,810 --> 00:55:26,090 actually start to use it. 1273 00:55:26,090 --> 00:55:28,730 So questions about that or about any of the other topics 1274 00:55:28,730 --> 00:55:32,190 we've covered this week? 1275 00:55:32,190 --> 00:55:32,690 All right. 1276 00:55:32,690 --> 00:55:36,057 So with the remainder of this morning, between now and about 12:30 or so, 1277 00:55:36,057 --> 00:55:38,390 we'll leave it open to more project time, an opportunity 1278 00:55:38,390 --> 00:55:40,070 to work on any of the projects you've worked on 1279 00:55:40,070 --> 00:55:42,800 so far over the course of this week and also an opportunity to work 1280 00:55:42,800 --> 00:55:44,383 on something new if you would like to. 1281 00:55:44,383 --> 00:55:47,600 I know many of you yesterday decided to start on new projects, projects 1282 00:55:47,600 --> 00:55:49,730 of your own choosing built in React or Flask 1283 00:55:49,730 --> 00:55:51,980 or using JavaScript or any of the other technologies 1284 00:55:51,980 --> 00:55:53,450 we've talked about this week. 1285 00:55:53,450 --> 00:55:56,750 Before we conclude, though, I do have to say a couple of thank yous, 1286 00:55:56,750 --> 00:55:59,850 first to David for helping to advise the class, to the teaching fellows-- 1287 00:55:59,850 --> 00:56:01,740 Josh and Christian and Athena and Julia-- 1288 00:56:01,740 --> 00:56:03,470 for being excellent in helping to answer questions 1289 00:56:03,470 --> 00:56:06,380 and helping to make sure that the course can run smoothly, to Andrew up 1290 00:56:06,380 --> 00:56:09,020 in the back, who's been taking care of the production side of everything 1291 00:56:09,020 --> 00:56:11,780 over the course of this week, making sure that all the lectures are recorded 1292 00:56:11,780 --> 00:56:14,240 and making sure they're posted online, such that afterwards, you, 1293 00:56:14,240 --> 00:56:15,890 when you're here or when you're not here, 1294 00:56:15,890 --> 00:56:17,370 are able to come online to see them. 1295 00:56:17,370 --> 00:56:19,860 So thank you to everyone for helping to make the course possible. 1296 00:56:19,860 --> 00:56:21,500 Thank you to all of you for coming to the course. 1297 00:56:21,500 --> 00:56:22,333 Hope you enjoyed it. 1298 00:56:22,333 --> 00:56:23,810 Hope you got things out of it. 1299 00:56:23,810 --> 00:56:25,340 We've really only scratched the surface, though, 1300 00:56:25,340 --> 00:56:27,020 of a lot of the topics that we've covered 1301 00:56:27,020 --> 00:56:28,394 over the course of the past week. 1302 00:56:28,394 --> 00:56:32,720 There's a lot more to CSS and HTML and JavaScript and Flask and Python 1303 00:56:32,720 --> 00:56:35,870 and React than we were really able to touch on over the course of the week. 1304 00:56:35,870 --> 00:56:37,870 It was really meant to be more of an opportunity 1305 00:56:37,870 --> 00:56:40,820 to give you some exposure to some of the fundamentals of these ideas, 1306 00:56:40,820 --> 00:56:43,236 some of the tools and the concepts that you can ultimately 1307 00:56:43,236 --> 00:56:45,930 use them as you begin to design web applications of your own. 1308 00:56:45,930 --> 00:56:47,660 So I do hope that you've learned something from the week but, 1309 00:56:47,660 --> 00:56:50,576 in particular, that you found things that are interesting to you, such 1310 00:56:50,576 --> 00:56:52,880 that you continue to take those ideas and explore them. 1311 00:56:52,880 --> 00:56:55,940 Go beyond just what we've been able to cover over the course of this week 1312 00:56:55,940 --> 00:57:00,290 and explore what else these technologies and these tools and these ideas 1313 00:57:00,290 --> 00:57:01,465 ultimately have to offer. 1314 00:57:01,465 --> 00:57:02,340 So thank you so much. 1315 00:57:02,340 --> 00:57:04,655 We'll stick around until 12:30 to help with project time. 1316 00:57:04,655 --> 00:57:05,155 [APPLAUSE] 1317 00:57:05,155 --> 00:57:08,020 But this was CS50 Beyond. 1318 00:57:08,020 --> 00:57:10,050