[MUSIC PLAYING] BRIAN YU: OK, let's get started. Welcome, everyone, to the final day of CS50 Beyond. And goal for today is going to be to take a look at things at a bit of a higher level. There is going to be less code in today's lecture. 

The focus of today is on two main topics-- security and scalability-- which are both important as you begin to think about, you're writing all this code for your web application. You're ready to deploy it so that people can actually use it. What are the sorts of considerations you need to bear in mind? What are the security considerations in making sure that wherever you're hosting the application, you and the application itself is secure and that your users are secure from potential vulnerabilities or potential threats? 

And also, from a scalability perspective, we've been designing applications that so far probably only you or a couple other people have been using. But what sorts of things do you need to think about as your applications begin to scale, as more and more people begin to use it, and you have to begin to think about this idea of multiple people trying to use the same application at the same time? 

So a number of different considerations come about there. We'll show a couple of code examples. But the main idea of this is going to be high level, just thinking abstractly, sort of trying to design the product, trying to design the project, trying to figure out how exactly we need to be adjusting our application to make sure that it's secure and to make sure that it's scalable. 

So we'll go ahead and start with security. And on the topic of security, we're going to look at a number of different security considerations as we move all throughout the week, from the beginning of the week until the end of the week, thinking about the types of security implications that come about. 

And so one of the first things we introduced in the class was Git, the version control tool that we were using to keep track of different versions of our code in order to manage different branches of our code, so on and so forth. And so a couple of important security considerations to be aware with regards to Git. You all probably created GitHub repositories over the course of this week, maybe for the first time. 

And GitHub repositories by default are public. And this is in the spirit of the idea of open source software, the idea that anyone can see the code. Anyone can contribute to the code. 

And that, of course, comes with its trade offs. On one hand, everyone being able to see the code certainly means that anyone can help you to find bugs and identify bugs. But it also means that anyone on the internet can see the code, look for potential vulnerabilities, and then potentially take advantage of those vulnerabilities. So definitely, trade offs, costs, and benefits that come along with open source software. 

And another thing just to be aware of, we mentioned this earlier in the week, but your Git commit history is going to store the entire history of any of the commits that you have made, as the name might imply. And so if you make a commit and you do something you shouldn't have done, for instance-- you make a commit that accidentally includes database credentials inside of the commit somewhere or includes a password inside of the commit somewhere-- you can later on remove those credentials and make another commit and remove the credentials. But the credentials are still there inside of the history. If you go back, you could still find the credentials if you had access to the entire Git repository and could go back and find that point in Git's history. 

So what are the potential solutions for if you do something like this, accidentally expose credentials at some point in the repository and then remove them? What could you do? Yeah? 

AUDIENCE: Change the credentials. 

BRIAN YU: Certainly. Changing the credentials, something you should almost definitely do. Change the password. It's not enough just to remove them and make another commit. 

And there's also something you can do known as Git purge, where you can effectively purge the history of commit, sort of overwrite history, so to speak, in order to replace that, as well. But even that, if it's been online on GitHub, who knows who may have been able to access the credentials? So definitely always a good idea to remove those, as well. 

On the first day, we also took a look at HTML. We were designing basic HTML pages. And there are a number of security vulnerabilities you could create just with HTML alone. 

Perhaps one of the most basic is just the idea that the contents of a link can differ from where the link takes you to. There's probably a pretty obvious point where you often have text that links you to a particular page. But this can often be misleading and is commonly used in phishing email attacks, for instance, whereby you have a link that takes you to URL one, but by default, it shows you URL two, which can be misleading, for sure. 

Or I can have situations where I could-- let's go into link.html-- I have a link that presumably takes me to google.com. But if I click on google.com, it could take me anywhere else-- to some other site, for instance. And the way that it does that is quite simply by just having a link that takes you to a URL, but the contents of that URL are something different or something else entirely. 

And so that alone is something to be aware of. But that problem is compounded when you consider the idea that even though your server-side code-- application code you write in Python and Flask, for instance-- you can keep secret from your users, HTML code is not kept secret from users. Any users can see HTML and do whatever they want with it. 

And so on the first day, you may have been trying to take a look at an HTML page and try and replicate it using your own HTML and CSS, for example. The simplest way to do something like that would just be to copy the source code. So I could go to bankofamerica.com, for instance, Control-Click on the page, view the page source, and all right. Here's all the HTML on Bank of America's home page. 

I could copy that, create a new file, and call it bank.html. Paste the contents of it in here. Go ahead and save that. 

And now, open up bank.html. And now, I've got a page that basically looks like Bank of America's website. And now, I could go in. I could modify the links, change where Sign In takes you to, make it take you to somewhere else entirely. And so these are potential threats, vulnerabilities, to be aware of on the internet that are quite easy to actually do. So this is less about when you're designing your own web applications but, when you're using web applications, the types of security concerns to definitely be aware of. 

So let's keep moving forward in the week-- yeah, question? 

AUDIENCE: Can you copy JavaScript source code in the same way? 

BRIAN YU: Yes. Any JavaScript code that is on the client, you can access and you can modify. You can change variables and so on and so forth. 

And this is actually a pretty easy thing to do. So if I go to like, I don't know, The New York Times website, for instance, and I look at the source code there-- let me go ahead and inspect the element, and I'll try and hover over a main headline. 

OK. This is the name of a CSS class. You could access any JavaScript. You can also run any JavaScript in the console arbitrarily. So I could say, all right, document.query selector all let's get everything with that CSS class. Or maybe it's just the first one, because it's two CSS classes. 

All right. Great. I'll take the first one, set its inner HTML to be, like, welcome to CS50 Beyond. And you can play around with websites in order to mess around, change them. So all of the JavaScript CSS classes, all of that, is accessible to anyone who is using the page, for example. 

Other questions before I go on? Yeah. 

AUDIENCE: Any thoughts on JavaScript obfuscation? 

BRIAN YU: JavaScript obfuscation-- certainly something you can do. So since JavaScript is available to anyone who has access to the web page, there are programs called JavaScript obfuscators gators that basically take plain old looking JavaScript and convert it into something that's still JavaScript but that's very difficult for any human to decipher. It changes variable names and does a bunch of tricks in JavaScript to still execute the exact same way but that looks quite obscure. 

Definitely something you can do. Still not totally foolproof, because there are ways of trying to deobfuscate JavaScript code, at least to some extent. So it's not perfect, but definitely something that you can do. Other things? 

All right. Let's take a look at-- OK, when we were writing Flask applications, we were writing web servers. And so one thing that's just good to know from a security perspective is the difference between HTTP, the Hypertext Transfer Protocol, and the secure version of it, HTTPS. And that has to do with the idea that on the internet, we have computer servers that are trying to communicate with each other that are trying to send information back and forth. And when these computers are trying to send information back and forth, we would like for that to happen securely, that when one computer is sending information to another computer, that information is going through a number of different routers. 

And each of those routers could hypothetically have information that's intercepted. Someone could try and intercept a package on its way from computer number one to computer number two. So how do we securely try and transfer information from one location to the other? 

And this has to do with the entire field of cryptography, which is a huge field that we're only going to be able to barely scratch the surface of. But the basic idea here is that we would like some way to encrypt our information, that if I have some plain text that I would like to send from my computer to someone else's computer, I would like to encrypt that plain text, send it across in some encrypted way, such that the person on the other end could decrypt it. 

And so this is perhaps a more sophisticated version of what you might have done in CS50's problem set two when you were using the Caesar or the Vigenere cipher in order to encrypt something. The ciphers that are used in computing on the internet, for instance, are just much more secure, for example. But they follow a similar principle. 

And so one form of cryptography is called secret-key cryptography, where the idea is that if I am a computer up here and I have some plain text that I want to encrypt, I also have some key that only I know. And I can take the plain text, and I can take that key and run an algorithm on it. And that generates some ciphertext, some encrypted version of the plain text that was encrypted using the key. 

I can then send that ciphertext along to the other person. And so long as the other person has both the ciphertext and the key to encrypt it, they can do the same process and just decrypt it, generating the plain text from it. That way, the ciphertext is transferred, not the plain text, from one side to the other side of this communication. And so long as both parties in this instance have access to the same key, they can encrypt and decrypt messages at will. 

Why doesn't this quite work on the internet, though? What is the problem with this model? Yeah? 

AUDIENCE: If you're sending the key as well as the ciphertext, then it's just revealed as sending the plain text that you have one. 

BRIAN YU: Exactly. When we transfer the ciphertext across, the other person also needs access to the key. We need to transfer the key across the internet, as well, to give it to the other person. And so anyone who is intercepting the ciphertext could also have intercepted the key and therefore could have decrypted the information and gotten the plain text as a result of it. 

So this secret-key cryptography, ultimately, it doesn't work in the context of the internet if it needs to be the case that the key is just transferred across the internet. Now, you could try encrypting the key, for example. But then whenever key you used to encrypt the key, that also needs to be sent across the internet, and you end up with this problem where you can never figure out a way in order to make sure that information can be transferred securely. 

So the solution to this lies in a different idea called public-key cryptography, where the idea here is that instead of having one key, we'll have two keys-- one called a public key, one called a private key. And the idea here is that a public key is something you can share with anyone. Doesn't matter who has it. And a private key is a key that you keep to yourself that you don't give to anyone, even the person that you're trying to communicate with. 

And because we have two keys, each key is going to serve a different purpose. They're going to be mathematically related. And take a theory of computing class if you want to understand the exact mathematics behind this. But the basic idea is that the public key can be used to encrypt messages, and the private key can be used to decrypt messages that were encrypted using the public key. 

And so what does this model look like? Well, I have some public and private key. And if I want some other person to send me information, I will give them my public key. Just give the other person the public key so that they have access to it. 

Remember, the public key is used to encrypt data. So they can use the public key and encrypt the plain text, generate some ciphertext. And then all the other person needs to do is send me that ciphertext. The ciphertext comes across to me. And I now have the private key, the key that I can use to decrypt the information. And using the private key and the ciphertext, I can then decrypt the message and generate the plain text. 

So this is the basic idea of public-key cryptography, this idea that we use a public key to encrypt information and a private key to decrypt information. And by separating this out into two different keys, we can share the public key freely without needing to worry about the potential for internet traffic to be intercepted and decrypted, for example. And so this is the basis on which internet security works. Yeah? 

AUDIENCE: What if someone else intercepts the ciphertext and they also have a private key? Would they be able to decrypt it? 

BRIAN YU: If someone else intercepts the ciphertext and they have a private key, they won't be able to decrypt it, because the private key and the public key are mathematically related in such a way that if you encrypt something with a public key, you can only decrypt it with the corresponding private key. And so generally speaking, you'll generate both the public and the private key at the same time, such that only messages encrypted with one can be decrypted with the other. So you can't just have some other random private key and decrypt the message. It can only decrypt messages from the public key. 

AUDIENCE: So how did this person get that specific [INAUDIBLE]? 

BRIAN YU: So this person down here generated both the public and the private key at the same time. There's just an algorithm that you can use to randomly generate a public and private key. You share the public key with anyone you want to be able to send you messages. 

That person you share it with can use the public key to encrypt the message. And then you, the person who generated these keys, can take the encrypted message, use the private key that you generated, and get the plain text out of that. Yeah? 

AUDIENCE: How difficult is it to get the private key from the public key? Is it impossible? 

BRIAN YU: How difficult is it to get the private key from the public key? Long story short, we don't really know. We think it is very difficult to do. We think that it would take a very long time. If you took a computer and tried to get it to go from the public key to the private key, we think it would probably take billions, trillions, more years if a computer was operating at top speed trying to do this calculation. 

But no one has been able to technically prove that it is difficult. And so this is a big open question in computing right now. You can take a theory of computation class for more information on this sort of thing. But there are some open unsolved problems in computing, and this happens to be one of them. Yeah? 

AUDIENCE: Is it based on primes and very large primes, and you multiply them together? 

BRIAN YU: Yes, this is basically the idea of very large prime numbers that you multiply together. The long story short of it is it's based on the idea that there is some mathematical operations that are easy and some mathematical operations that are believed to be difficult. And if you take two very big prime numbers, a computer can multiply those numbers very easily and calculate what the product of those two numbers is. It's just a simple multiplication algorithm. 

But if you have that result, that big multiplied prime number, it's very difficult to factor that number and figure out which two prime numbers were multiplied together in order to generate that number. And nobody has been able to come up with an efficient algorithm for factoring it. And so as a result, because we believe factoring numbers to be a very difficult problem, we use it as the basis for computing security on the internet. 

Brief teaser of theory of computation. Take any of the 120 series here at Harvard, at least, for more information about that. Other things? 

Some other security considerations when designing web applications to be aware of-- we mentioned this before, but when it comes to storing credentials, you should generally always store credentials in environment variables inside of your application rather than have inside of your Python code some password, whether it's the secret key of your application, whether it's the credentials to your database, whether it's some other credentials for an API key, for example, that you're using the server to access. Usually best not to put that in the code in case someone else gets access to the code. 

Generally best to put it in an environment variable, a variable that's just stored in the command line environment where your server's being run from. And then add code that just pulls the credentials from the environment. You can use in Python, at least, os.environ.get to mean get some information from the application's environment. And this is generally going to be a more secure way of doing the same thing. Yeah? 

AUDIENCE: How do we do that in Heroku if we want to upload our code to the website? 

BRIAN YU: Yeah. So if you're uploading this to Heroku, if you go to your Heroku application and go to the Settings panel, there is a section, I think it's called config vars, that basically just lets you add environment variables to the Heroku application. And that will automatically set those environment variables such that when you run the application, it can draw from those environment variables. Yeah? 

AUDIENCE: Is it [INAUDIBLE] yesterday, or is that something you can't have access to? Because if you just did [INAUDIBLE] and then the key, it goes away when you close the terminal, correct? 

BRIAN YU: Yes. So that's true. So you can certainly, on your own computer, set aliases or environment variables inside of your profile that automatically set credentials in a particular way. The idea is that you never want to be taking those credentials and committing them to a repository that other people might be able to see, for instance. That's where things start to get less secure. 

OK. Moving on in the week to talk about some other security considerations. We'll talk about SQL, the idea of databases. And when we introduce databases, there are a lot of security considerations that come about. But we'll just touch on a couple of them. 

The first is how you store passwords. So you can imagine that inside of a database, you might be storing users and passwords together. And maybe we have a whole users table that has an ID column, a column for people's usernames, and a column for people's passwords. And you could imagine just storing passwords inside of the row. But why is this not particularly secure? Yeah? 

AUDIENCE: If anyone gets access to the data table, they can see what all the passwords are. 

BRIAN YU: Exactly. If anyone gets access to the database, they immediately have access to all of the passwords. And this is probably not a secure way to go about things, because you probably hear in the news from time to time that databases aren't perfectly secure, that every once in a while, there's some big security vulnerability where someone's able to get access to passwords inside of a database. And that becomes a major security concern. 

And so one way to try and mitigate this problem is, instead of storing passwords inside of the database, store a hashed version of the password. A hash function, as you might recall from CS50, just takes some input and returns some deterministic output. And a hash function can generally take any input password and turn it into what looks like a whole bunch of random sequences of letters and numbers. 

And the idea here is that it's deterministic. The same password will always result in the same hash value whereby when someone tries to log in, when they type in their password, rather than just literally compare their password and say does the password match up with the password in this column, you can say, all right, let's hash the password first. And if the hashes match up, then with very high probability, the user actually signed in to the website with the correct password. And you can then log the user in. 

And now, if someone was able to get access to the database, they wouldn't get access to all the passwords. They would only get access to the password hashes. 

Now, it's still a security vulnerability, because someone could, in theory, be able to figure out information about the password from the password hashes. But better, certainly, than literally storing the raw text of the password in the database. Yeah? 

AUDIENCE: Do we know how the hash functions generate that code? 

BRIAN YU: Yeah. The hash functions tend to be deterministic, and you look up what the hash functions themselves are. So there are a couple of quite popular hash functions that are out there that do this sort of thing. 

But the idea of the hash function is similar to the idea of public and private keys, that it's very easy to hash something, and it's very difficult to go in the other direction. I can easily hash a password and generate something that looks like this. But it's a difficult operation to take something that looks like this and go backwards and figure out what it was that the original password was. And so that's one of the properties of a good hash function. Yes? 

AUDIENCE: Did you actually hash these, or did you just hit the keyboard? 

BRIAN YU: I think these are probably-- there might be hidden messages here if you look carefully. But separate issue. Other things? 

OK. So how is it that potential data is leaked as a result of using a database? Well, there are a number of ways that applications can inadvertently leak information. Take a simple example. Oftentimes, you'll see websites that have a Forgot Your Password screen where you type in an email address, and you click Reset Password. And that helps you to send you an email that allows you to reset your password, for example. 

And you imagine that you type in an email address, and you get, OK, password reset email has been sent. But maybe some applications work such that if you type in an email address that doesn't exist, then you get an error that says, OK, error. There is no user with that email address. What data has this application now exposed? What information can you get just by using this part of a web application, for instance? Yeah? 

AUDIENCE: You know that that email address is not in the system, so you know that person is not using that app. 

BRIAN YU: Yeah, exactly. Just using the Forgot Password part of this application, you can tell exactly who has an account for this application and who doesn't just by typing email addresses and seeing what comes back. So there's potential vulnerabilities in terms of data that gets leaked there, as well. 

And there are all sorts of different ways that information can get leaked. Oftentimes, there's a growing field whereby you can tell just based on the amount of time it takes for an HTTP request to come back whether or not-- you can get information about the data inside of a database based on that whereby if you make a request that takes a long time, that can tell you something different than if a request comes back very quickly, because that might mean fewer database requests were required in order to make that particular operation work or any number of different things. And so there are security vulnerabilities there, as well. 

Final one. I'll briefly mention the SQL injection. We've already talked about that. But again, something to be aware of just to make sure that whenever you're making database queries, you're protecting yourself against SQL injection, that you're making sure to either use a library that takes care of this for you or escape any characters that you might be using that could ultimately result in vulnerabilities in SQL. Yeah? 

AUDIENCE: How about the websites or tools like LastPass that store your credentials for other sites? Don't they have to have some way of reversing their own hash on it in order to give you that credential when you go to another site? So when it auto fills your username and password, it has to-- if they're storing a hashed version on their side but filling in the plain text version in the password field, how are they able to reverse that in a way that is secure? They would have to have a table of keys or something that then is just as vulnerable as leaving the password. 

BRIAN YU: Yeah. So for password manager-type applications, it's a good question. I think the way most of them do this is that you have a master password that unlocks the entire database of the passwords that are stored there. And the idea would be that they're encrypted using the master password as the key to be the unlocker such that they're encrypted. And only by getting the master password correct can you then decrypt the information and then access the plain text version of the passwords that are inside. 

And so hashing and encryption and decryption are slightly different. In the case of encryption and decryption, you still want to be able to go from the ciphertext back to the plain text, whereas in the case of the password hashing, you don't really care about the ability to reverse engineer it to go backwards. 

All right. And finally, on the topic of security, we'll talk a little bit about JavaScript. JavaScript opens a whole host of different potential vulnerabilities from a security standpoint. But we'll talk about a couple. The first is this idea called cross-site scripting, or the idea of taking a script and being effectively able to inject it into some other site by putting some JavaScript that the web application didn't intend into the web application itself. 

And so here's a very simple web application written in Flask. And this is the entire web application. It's got a route, a default route, called / that just returns, "Hello, world!" 

And it's got an error handler that we didn't really see in the class. But basically, it handles whenever there's a 404 error, whenever you're trying to access a page that was not found. And it just returns, "Not found," followed by request.path, whatever it is that was the URL that you requested. 

And so I could run this application. I'll go ahead and start up Chrome, and I'll go ahead and go to the source code for XSS1. I'll run this application. Go here. It says, "Hello, world!" 

And if I go to helloworld/foo, for example, some route that doesn't exist, I get not found, /foo, because that's not a route that's available on this page. I go to /bar. Not found, /bar. What could go wrong here? Where's the security vulnerability, again, thinking in the context of JavaScript? 

The page my application is returning is literally just "not found" followed by whatever was typed into the request path. And so what I could do is you could imagine that instead of running /foo, I could instead make a request that looks something like /script alert('hi) and then /script, for instance, injecting some JavaScript into the request path whereby if I do that, I say, OK, /script alert('hi') /script. Press Return. 

And OK, Chrome is being smart about this. Chrome actually isn't allowing me to do this, because Chrome has some more advanced features that are basically saying Chrome detected unusual code on this page and blocked it to protect your personal information and error blocked by XSS auditor. That's cross-site scripting. So Chrome is automatically auditing for this. 

But not all browsers are like that. And I can, I think-- let's see if I can disable-- if I disable cross-site scripting protections, I think I can get this to-- yeah, OK. Disabling cross-site scripting productions, we can still type in the URL and actually get some JavaScript that the page didn't intend to still run on this particular web page. 

And so if someone were to send you a link that took you to this page, /script alert('hi'), you could get JavaScript to run that you didn't intend. And maybe that's not a big deal. But it could be a bigger deal in a situation that looks like this, where we have JavaScript and document.write is a function that just add something to the page. And here, we're loading an image, img src, and the source is some hacker's website. And then we say, cookie= and then document.cookie. 

Document.cookie stores the cookie for this particular page. And so effectively, what's happening in this script is that your page, when you load it, is going to make a web request to the hacker's URL. And it's going to provide it as an argument whatever the value of your cookie is, for instance. 

And that cookie could be something that you use in order to log in as the credentials for some website, like a bank application or whatnot. And as a result, the hacker now has access to whatever the value of your cookie is, because they can look at their list of all the requests that have been made to the application much in the same way that you've been able to do in the terminal to see all the requests for your Flask application. And they can see that someone requested hacker_url?cookie= this cookie, and they can then use that cookie to be able to sign in to other sites, as well. 

So most modern browsers, like Chrome, are pretty good at defending against this sort of thing. But definitely something that is a potential vulnerability, especially for older browsers. Questions about this cross-site scripting? Yeah? 

AUDIENCE: Are you getting the user's cookie, or whose cookie are you getting there? 

BRIAN YU: Whoever opens the page. So the user's cookie, potentially on an entirely different site. The idea is that if your site is vulnerable to cross-site scripting in this form, then you open up a possibility where someone could generate a link to your website that includes some JavaScript injected like this whereby someone else could steal the cookies of your users on your website. And they could get the cookies for themselves and use those cookies to sign into your website and pretend to be people that they're not, for example. There's a potential security threat there. 

So cross-site scripting is one example of a JavaScript vulnerability. Another vulnerability is called cross-site request forgery. Imagine that you have a bank website, for instance, and that bank gives you a way to transfer money. And if you go to that URL /transfer and then you provide arguments as to who you're transferring money to and how much money you're transferring, you can transfer money. Might be a web request that allows you to do that. 

Imagine some other website, some website where hackers are trying to steal money, where they have code that looks a little something like this. They have a link that says, "Click Here!" And when you click on the link, that takes you to yourbank.com/transfer transferring to a particular person, transferring a particular amount. 

And some unsuspecting user on this website could click the button. And as a result, that takes them to their bank. And if they happen to be logged into their bank at the time, that could result in actually making that transfer. So cross-site request forgery is the idea that some other site can make a request on your site as by, in this case, linking to it. 

This still isn't an amazing threat, because the person actually still needs to click on the button in order to be able to load in order to actually go to yourbank.com/transfer/whatever. But you can imagine that a clever hacker might be able to get around this by doing something like this-- rendering an image, for example, and saying the source of the image is going to be this. And when an HTML sees an image tag, the browser is just going to go to that URL and try and download that image. It's going to go to the URL, try and fetch that resource. 

And here, that resource is yourbank.com/transfer and then transferring that money. So the user doesn't even have to click on anything. And by making a GET request to yourbank.com/transfer, if yourbank.com isn't implemented particularly securely and just allows you to go to a URL like this to transfer money, then that could be the result. 

So how do you protect against this? How would you protect against your website being able to do something like this? Because your website probably wants some way of being able to transfer money if you have a bank application, but you don't want to allow people to make requests like that. Answer, yeah? 

AUDIENCE: Yeah. It's facetious. 

BRIAN YU: Go for it. 

AUDIENCE: You get a better bank. 

BRIAN YU: Get a better bank. OK. Certainly something that would work. Other thoughts? Yeah? 

AUDIENCE: Change the form request type so it's not literally in your own [INAUDIBLE]. 

BRIAN YU: Yeah. Change the form request type so that it's not literally here. So this right here is a GET request. You might imagine that instead, it's a form that's submitted by a POST, like a POST request, a form that you actually have to submit, click on a Submit button, in order to submit that form. 

And so now, you could imagine that someone could still create a vulnerability by doing something like this. They have a form whose action is yourbank.com/transfer submitting by a method POST. And now, they have these input that are type hidden, which are just input fields that don't show up inside of a page. And they can have hidden input fields that specify who it's to, what the amount is, and then just some button that says, "Click Here!" 

And if they click here, then unwittingly, the user could be submitting a form to the bank that's initiating some transfer. And in fact, if the hacker is being particularly clever, you don't even need the user to click anything, because we can use event listeners to get around this. I could say body onload-- in other words, when the body of the page is done loading, run this JavaScript. 

Document.forms returns an array of all the forms in the web document. Square bracket 0 says get the first form. And there's a function in JavaScript called .submit that submits a form. 

So you can say, all right, get all the forms, get the first form, and run submit. And that's going to result in submitting this form, making a POST request to yourbank.com/transfer, which results in some amount being transferred. 

So this is a potential vulnerability, as well. If you're writing this bank application, you don't want to allow a code like this to be able to get through your security, because that opens up a whole host of potential security vulnerabilities. 

And in general, the way that people tend to deal with this is by adding what's called a CSRF token, a Cross-Site Request Forgery token, basically adding some special value that changes into their own forms and then, anytime someone submits the form, checking to make sure the value of that token is, in fact, a valid token. And that way, someone couldn't fake it because some other form on some other hacker's website isn't going to have a valid CSRF token inside of their form page. 

And so larger scale web application frameworks, like Django, offer easy ways to add CSRF tokens to your forms, as well. But just something to be aware of as you begin to think about, when you're designing a web application, how could someone exploit it? How could someone make requests on behalf of users that they don't intend to in order to get some malicious result to come about? 

So lots of security things to be thinking about. Questions about security or any of the security topics that we've covered or talked about? Yeah? 

AUDIENCE: [INAUDIBLE] the token is generated [INAUDIBLE] event, or it's a unique token for every user? 

BRIAN YU: Yeah. Imagine that in the case of CS50 Finance, for instance, that when I click on the Buy page that takes me to the page where I can buy stocks, my route for buy is going to basically generate a new token and insert it into the form that then gets displayed to me. And then when I submit that form, it gets submitted back to the same application. 

And the application can then check. Did the token that came back match the token that I inserted into the page? And if they do, in fact, match, then that's a way of sort of verifying that the user was actually submitting the actual form and not some fake form that they were tricked into submitting. 

All right. In that case, let's switch gears a little bit, and let's talk about scalability. Here again, there's going to be even less code. And the idea is just going to be, all right, what happens when we begin to scale our web application? 

We've got some web server, and we've got some users that are using that web server, which we're going to represent as that line. And so what happens when that server starts to have more users that are all trying to use the application at the same time? What do we do? 

Well, the first thing to probably do is figure out how many users our website can actually support. How many can it handle before it stops being able to support users? And so this is where benchmarking is quite important. Benchmarking is just this process by which we can test and sort of load test our application to see what we can do to see how many users we could potentially handle on our server. 

And so what happens if we find out via benchmarking that, OK, our server can only hold 100 users? What if we need to support 101 users or 102 users? What can we do? 

One thing we can do is called vertical scaling, where the idea here is, all right, we have a server. And that server only supports 100 users. All right, well, let's just get a bigger server, right? Let's get a server that supports 200 users or 300 users. And that's going to be able to better handle that load. 

But there's a limit to this, right? There's a limit to how much you can just increase the size of a server and increase its ability to handle load. And so what could you do to be able to handle more users? 

AUDIENCE: More servers. 

BRIAN YU: More servers. Great. And this is an idea called horizontal scaling, where the idea is that we have some server. And let's say, instead of having one server, let's go ahead and have two servers that are running the exact same web application. And now, we have two servers that are able to run the application and handle twice as many people. What problems come about now, logistically? User tries to access our website, and now what? Yeah? 

AUDIENCE: That means you could have a race condition situation or how the servers communicate to each other [INAUDIBLE]. 

BRIAN YU: Yeah. How do the servers communicate with each other? Certainly, race conditions become a threat, as well. And then a fundamental problem is a user comes to the site, and which server do they go to, right? We need some way of deciding which server to direct a particular user to. 

And so generally, this is solved by adding yet another piece of hardware into the mix, adding some load balancer in between the user and the servers whereby a user, when they request the page, rather than going straight to the server, they go to the load balancer first. And from there on, the load balancer can split people up, say certain people go to this server, certain people go to that server, and try and decide how it is that people are going to be divided into the different servers. 

And so how could a load balancer decide? If there are five servers and a user comes along, how should a load balancer decide which server to send a user to? There is no one right answer to this. There are a number of possible options, a number of different what are called load balancing methods. But how could you decide where to send a user? Yeah? 

AUDIENCE: The server with the least amount of users currently. 

BRIAN YU: Sure. The server with the fewest users currently, what's often called the fewest connections load balancing method. You try and figure out which server has the fewest people on it. And whichever one has the fewest people on it, send the user there. 

Definitely good for trying to make sure that each one has about an equal load, but potentially computationally expensive. You're doing a lot of calculation now, so there's a trade off. Yeah? 

AUDIENCE: You could just do it randomly. 

BRIAN YU: You could do it randomly. You could just generate a random number between 1 and 5 and randomly assign someone to a particular server. Definitely something you could do. Other things? Certainly the random approach is quick. It doesn't involve having to do any calculation across all the different servers. But if you're unlucky, you could end up putting a lot of people on server number two and not many people on server number eight or whatnot. And so what else could we do? Yeah? 

AUDIENCE: Just set up a counter [INAUDIBLE]. 

BRIAN YU: Sure. Some sort of counter. If you only have two, you just alternate odd, even, odd, even. Go to this server. Go to that one. If you've got eight, you just rotate amongst the eight-- 1, 2, 3, 4, 5, 6, 7, 8 and go back to 1. 

And so these are probably three of the most common load balancing methods-- random choice, whereby you just pick a random server, direct the user there; round robin, where we do exactly that, just basically go one up until the end and then go back to server number one; and then fewest connections, whereby you try and actually calculate which server currently has the fewest number of people on it and then try and direct the user to that one with the fewest connections. There are other methods in addition to this, but these are perhaps three of the most intuitive where you can start to see their trade offs. Depending upon the type of user experience you want, depending on how computationally expensive certain operations are, you might choose different load balancing methods. Yeah? 

AUDIENCE: [INAUDIBLE] benchmarking, and what are some common ways to do that? 

BRIAN YU: Yeah, there are software tools that can do this. There are a number of different ones-- the names are escaping me at the moment-- where you can basically test on a particular URL and get a sense for how well it's able to handle that load. And if you have particular use cases, I can chat with you about that, as well. 

So all right, let's imagine we have two servers now. And every time a user makes an HTTP request to a server, every time they request a page, we direct them to one server or the other server using one of these methods, either by choosing randomly or by round robin or by figuring out which one currently has the fewest users connected to it or is handling the fewest connections. What can go wrong? Whenever we're dealing with issues of scale, we just try and solve a problem and figure out what new problems have arisen. Yeah? 

AUDIENCE: You only have five servers, and now you need six. 

BRIAN YU: Yeah. Certainly, if you only have five servers and suddenly you need six, that could potentially become a problem, as well. But let's even assume that we have enough servers. We have five servers, and every time someone load a page, they get sent to a different server based on one of these methods. What can still go wrong with the user experience? 

And in particular, I'll give you a hint. Let's think about sessions. What can go wrong? 

Remember, sessions were ways of storing information-- in our case, inside of the server-- about the user's current interaction with the server. It stored which user was logged in. It stored the current state of the tic-tac-toe game. It stored other information. Yeah? 

AUDIENCE: You have to pick one [INAUDIBLE]. 

BRIAN YU: Yeah, exactly. If I initially load a page and I go to server one and some information about me is stored in the session, like whether I'm logged in or the current state of my game or something else, and then I load another page and it takes me to server four this time, well, now, that server doesn't have access to the same session information that server one had if the information about the session was stored in the server. And now, that information is lost. 

So I could load a page, and suddenly, now, I'm logged out of the page for no apparent reason even though I've logged in just a moment ago. And then I could go to another page, and maybe by chance, I'm back to server one, and now I'm logged in again. So strange things can begin to happen. 

And so to solve that, what could we do? How can we make sure that sessions are preserved when the user is requesting pages? Again, no one correct answer. Multiple possibilities here. How do we solve this problem? Yeah? 

AUDIENCE: Would there any way to store the session on the load balancer? 

BRIAN YU: Store the session on the load balancer. That's a good idea. And that will actually get me at the first idea here, which is this idea of sticky sessions. And this is slightly different. 

Rather than store all the session information in the load balancer, it just needs to store for this particular user which server has their session information. So if I went to server number one initially, the load balancer will remember me based on my IP address, cookie, or whatever and say, all right, next time I try and request a page, let me direct them back to server number one, for instance. That way, whenever I come back, I'm always going to go to the same place. 

There are other ways to solve this problem, as well. You could store session information in the database that all the servers have access to. You could store session information on the client side, whereby it doesn't matter what server you go to, because all the session information is inside the client. So there are a number of ways to solve this problem, but these generally fall under the heading of session-aware load balancing. 

Someone mentioned the problem of, OK, well, I have five servers, but what happens when I need six? To solve this in the world of cloud computing, where nowadays most people don't maintain their own hardware for their web applications, they just rent out hardware on someone else's servers, for instance, on AWS, for instance, use Amazon servers-- you can take advantage of auto scaling, which automatically will grow or shrink the number of servers based upon load, whereby you could initially have two servers. But if more users come about and you need more, we can add a third server into the mix. More people come out, we need even more. We add a fourth server. 

And auto scaling goes in both directions. So if suddenly we find, all right, we had a lot of load at this particular peak time of the day but now there are fewer users on the site, the auto load balancer can sort of say, all right, we don't need four servers anymore. Let's go back to three and then later on, if it needs doing, go back up to four again. And it can automatically, dynamically reconfigure the number of servers in order to figure out what the optimal number is given the number of users that are currently using the application. 

What happens, though, when one of the servers fails for some reason? The server just dies, for instance. The load balancer doesn't necessarily know about that. And so if it's still directing people across four different servers, it could direct users to that server that is no longer operational. Any thoughts on how we might solve that problem? Yeah? 

AUDIENCE: Have the load balancer ping the server at determined intervals to see if it's still there. 

BRIAN YU: Yeah, some sort of ping to make sure 

That the server is still there. And often, one of the easiest ways that this is done is via what's called a heartbeat, whereby each of the servers gives off a heartbeat every fixed number of seconds or minutes, for instance, whereby if every 10 seconds the server pings the heartbeat, that gets sent to the load balancer. If ever the load balancer doesn't hear the heartbeat from the server, it can know that that server is no longer operational, and it can say, all right, you know what? Let's stop sending users there and only send users to the other three servers. 

Questions about that or any of the ideas of how we scale our servers to be able to handle load? We decided, all right, if too many people are on one server, we need to split up into two different servers. But that introduced a bunch of problems that we had to solve-- problems about load balancing, problems about what to do about sessions, so on and so forth. Yeah? 

AUDIENCE: Do you hear a lot about distributed servers? I'm wondering how they [INAUDIBLE]. 

BRIAN YU: Sure. How do servers share data? Well, they use databases. And of course, as we start to figure out what to do with more and more servers, we also need to figure out what to do about databases, figure out how to scale databases and make sure that as we scale them, the databases are able to handle that load, as well. 

And so in the past, we've had, all right, a load balancer. We've got servers. And in our model right now, we have a database that both of these servers are connected to. 

But of course, the problem is soon going to arise of, all right, now we've got a lot of servers that are all trying to connect to the same database. And now, we've got yet another single point where things could potentially go wrong or where we could potentially be overloaded. So how do we solve this type of problem? 

One of the most common ways is database partitioning. One form of database partitioning you've, in fact, already seen, and it's just an extension of what we've been doing with SQL, whereby we have this flights table. And we could say, all right, rather than store the origin and the origin code, let's go ahead and separate what's in one table into a couple different tables. Let's separate the flights table into a locations table where the locations table has a number for each possible location. And then it also, in the flights table, now, only needs to store a single number for the origin ID and the destination ID. 

We could also separate tables in different ways. If we have some general way we could partition a table into different parts that are generally going to be queried separately, then we can do another partition where I could say, all right, my flight's table is getting big. Let's split it up. 

And all right, at my airline, the international departures and arrivals are handled separately from the domestic departures and arrivals. So no need for those to be in the same table. Let me just go ahead and take flights and separate it into a domestic flights table and an international flights table, for instance. One way to just partition things into two different tables that could potentially be stored in different places that ultimately allows for handling of scale. 

But ultimately, all of these are problems that are still going to lead to the fundamental problem of if I only have one database and 10 or dozens of servers that are all trying to communicate with that same database, we're going to run into problems. The database can only handle some fixed number of connections. 

And so one solution to this is database replication. So all right, how does database replication work? Well, probably the simplest form of database replication is what's called single primary replication, whereby I have one what's called primary database and maybe three databases in total, but only one that I'm going to consider the primary one. 

And you can read data from any of the databases. You can get data out of any of the three databases, whereby if there are three servers and each one wants to read data, they can just share among the three databases reading data to make sure that we're not overloading any one database with too many connections. 

But you can only write data to a single database. And by only writing data to a single database, that means that anytime this database is updated, then this database, our primary database, just needs to update the other two databases. Say, all right, there's been a change made to the primary database. And it's the primary database's responsibility to then communicate to the other two databases what those changes are. And so that's single-primary replication. Yeah? 

AUDIENCE: How is that more efficient than just communicating with all three of them? Because I think you're sending information from the first database to the second and third. [INAUDIBLE] information sent that's just rewriting to all three of them. 

BRIAN YU: That's true, though. Databases could potentially batch information together into transactions and things and groups so as to be a little bit more efficient. So certainly ways around that problem. But yeah, a good point. 

Of course, this helps the read problem. It makes it easier to be able to read data out of databases. But it leaves open a potential vulnerability or a potential scalability problem with regard to writing data, because there is still only a single database on which I can actually write data to if that one database is responsible for updating all of the other databases. 

And so a more complex version of this is what's known as multi-primary replication, where the idea is that each database can be read to and written from. But now, updates get a lot more complicated. All of the databases need to have some notion and some way of being able to update each other. 

And there, conflicts begin to arrive. You can have update conflicts where two different databases have updated the same row. All right, how do you resolve that problem? 

You can have uniqueness conflicts, whereby if you add a row to each of two databases at the same time, maybe they get the same ID. Maybe this one only has 27 rows, so this database adds a new row with ID number 28, and this database does the same thing. And now, when they try to update each other, we have two rows with the same ID. And now, we need some way of resolving those, because the IDs are supposed to be unique. And so that can create problems, as well. 

And then there are other types of conflicts, too-- delete conflicts, whereby one database tries to delete a row at the same time another database tries to update a row. So which do you do? Do you update the row? Do you delete the row? And so these are all conflicts that when you're setting up a multi-primary replication system, you need to figure out how you're going to ultimately resolve those conflicts. You gain the ability to write to all the databases, but new problems arise as you begin to do that. Yeah? 

AUDIENCE: So is the information in each database the same? Are they [INAUDIBLE] with each other? 

BRIAN YU: Yeah. In this model, the databases in general are going to be the same, though they're not always perfectly going to be in sync, which is yet another problem, whereby there might be some time after I write to this database before that data propagates through all of the databases, for instance. 

AUDIENCE: So why not keep it in one? 

BRIAN YU: You could keep all the information in one database. But a single database server can only handle so many connections. And so you might imagine that having three different servers, three different computers that are all able to handle incoming requests, just increases the capacity of your application to be able to handle that kind of load. 

All right. Questions about databases, database replication, any of the scale problems that come about there? All right. Final thing I'll mention on the topic of scaling that can be helpful is just the idea of caching. 

Caching is something we've talked about a lot before. But a general idea could be that in order to try and solve this problem of constantly having to request information from the database, if we could store data in some other place-- in particular, inside of a cache-- then we don't need to access the database as often, because we've got the information already stored. 

And so one way to do this is via client-side caching. And so inside of the HTTP headers, when an HTTP response is sending back information to a user, you can add an HTTP header called cache control that basically says for up to this number of seconds, you can just store information about this page and not request it again if you try and request the page for a second time. And this helps to make sure that if the browser tries to request the page again, it doesn't need to. It can just use the version that's stored inside of the cache. 

And a more recent development is this idea of an ETag, or an entity tag. And the idea here is that if we have some web resource, some document, some piece of data from a database that our web application is sending out to users, when I send users that resource, that document, I'll send that document, and I'll also send an entity tag that corresponds to that particular version of the document and send them both to the user. And imagine this is a big document. It's a lot of data, so it's expensive to query and to send to the user. 

The next time the user tries to request this page, what the user can do is the user can send the entity tag, the ETag, along with their request. I would like to request this resource, and, oh, by the way, I already have this version of the entity stored locally inside of my computer's cache. 

And if the web application then looks at that ETag and says, all right, you know what? That's the latest version of the document. The web application can just respond-- in particular, with an HTTP status code of 304, meaning not modified, to just say, you know what? This entity tag is the most recent entity tag. Don't bother trying to request the document again. Just use the version you saved locally in your cache. 

And if, on the off chance, the document's been updated and therefore has a new ETag value, then the web application goes through the process of sending that entire document back to the user. But by taking advantage of technologies like this, this can allow us to make sure that we're not making too many requests to the database, that we don't make redundant requests if a particular resource hasn't changed. 

So caching can be done on the client side. Caching can also be done on the server side, which changes our diagram slightly so as to look a little bit more like this, whereby now, we've got some more complications here. We've got some load balancer that's communicating with a bunch of different servers. All of those servers have to interact with the database, and maybe you've got multiple databases going on here that are each able to do reads and writes, either in a single-primary model or a multi-primary model. 

And those servers also have access to some cache that makes it easier to access data quickly, in a sense, saying, if there's some expensive database query, don't bother performing the database query again and again and again. Take the results of that database query once. Save it inside of the cache. And from then on, the server can just look to the cache and get information out of there. 

So lot of security and scalability concerns that can potentially come about as you begin web application development. And so goal of today was really just to give you a sense for the types of concerns to be aware of, the types of things to be thinking about, and the types of issues that will come about if you decide to take a web application and begin to have more and more people actually start to use it. So questions about that or about any of the other topics we've covered this week? 

All right. So with the remainder of this morning, between now and about 12:30 or so, we'll leave it open to more project time, an opportunity to work on any of the projects you've worked on so far over the course of this week and also an opportunity to work on something new if you would like to. I know many of you yesterday decided to start on new projects, projects of your own choosing built in React or Flask or using JavaScript or any of the other technologies we've talked about this week. 

Before we conclude, though, I do have to say a couple of thank yous, first to David for helping to advise the class, to the teaching fellows-- Josh and Christian and Athena and Julia-- for being excellent in helping to answer questions and helping to make sure that the course can run smoothly, to Andrew up in the back, who's been taking care of the production side of everything over the course of this week, making sure that all the lectures are recorded and making sure they're posted online, such that afterwards, you, when you're here or when you're not here, are able to come online to see them. So thank you to everyone for helping to make the course possible. 

Thank you to all of you for coming to the course. Hope you enjoyed it. Hope you got things out of it. We've really only scratched the surface, though, of a lot of the topics that we've covered over the course of the past week. There's a lot more to CSS and HTML and JavaScript and Flask and Python and React than we were really able to touch on over the course of the week. It was really meant to be more of an opportunity to give you some exposure to some of the fundamentals of these ideas, some of the tools and the concepts that you can ultimately use them as you begin to design web applications of your own. 

So I do hope that you've learned something from the week but, in particular, that you found things that are interesting to you, such that you continue to take those ideas and explore them. Go beyond just what we've been able to cover over the course of this week and explore what else these technologies and these tools and these ideas ultimately have to offer. 

So thank you so much. We'll stick around until 12:30 to help with project time. 

[APPLAUSE] 

But this was CS50 Beyond.