1 00:00:00,000 --> 00:00:03,500 [MUSIC PLAYING] 2 00:00:03,500 --> 00:00:17,457 3 00:00:17,457 --> 00:00:19,290 BRIAN YU: All right, welcome back, everyone, 4 00:00:19,290 --> 00:00:21,570 to Web Programming with Python and JavaScript. 5 00:00:21,570 --> 00:00:25,750 And for our final topic, we're going to explore scalability and security. 6 00:00:25,750 --> 00:00:28,470 So far in the class, we've been building web applications. 7 00:00:28,470 --> 00:00:31,635 And we've been building web applications that work on our own computer. 8 00:00:31,635 --> 00:00:33,510 But if we want to take those web applications 9 00:00:33,510 --> 00:00:36,000 and deploy them to the world so people all across the internet 10 00:00:36,000 --> 00:00:37,958 can begin to use them, then we're going to need 11 00:00:37,958 --> 00:00:40,980 to host our web application on some sort of web server-- 12 00:00:40,980 --> 00:00:44,192 some dedicated piece of hardware that is listening for web requests 13 00:00:44,192 --> 00:00:46,650 and responding to them with the response that we would like 14 00:00:46,650 --> 00:00:48,660 for our web application to deliver. 15 00:00:48,660 --> 00:00:51,030 And when we do so, this introduces a whole bunch 16 00:00:51,030 --> 00:00:54,338 of interesting issues surrounding scalability and security. 17 00:00:54,338 --> 00:00:56,130 So we'll take a look at these issues today, 18 00:00:56,130 --> 00:00:59,970 beginning with problems concerning scalability-- what those problems are 19 00:00:59,970 --> 00:01:02,650 and how we might go about addressing them. 20 00:01:02,650 --> 00:01:04,410 So when we deploy our web applications, we 21 00:01:04,410 --> 00:01:06,720 deploy them by putting them onto a web server 22 00:01:06,720 --> 00:01:08,970 that I'm, here, just representing with this rectangle. 23 00:01:08,970 --> 00:01:12,840 But all the server is is some dedicated computer, some piece of hardware that 24 00:01:12,840 --> 00:01:14,620 is listening for incoming requests. 25 00:01:14,620 --> 00:01:18,750 So we'll draw this line to represent an incoming web request from a user. 26 00:01:18,750 --> 00:01:21,660 The server takes that request and responds to it. 27 00:01:21,660 --> 00:01:23,880 But ultimately, our web application isn't just 28 00:01:23,880 --> 00:01:25,530 going to be servicing one user. 29 00:01:25,530 --> 00:01:28,080 If it becomes popular, it might have many users 30 00:01:28,080 --> 00:01:31,560 that are all trying to connect to that server at the same time. 31 00:01:31,560 --> 00:01:34,790 And as multiple people start to connect to that server at the same time, 32 00:01:34,790 --> 00:01:37,560 here is where we start to deal with issues of scalability. 33 00:01:37,560 --> 00:01:41,040 A single computer or a single server can only service so many users 34 00:01:41,040 --> 00:01:42,273 at any given time. 35 00:01:42,273 --> 00:01:44,190 And so, therefore, we need to think in advance 36 00:01:44,190 --> 00:01:47,640 about how we're going to deal with those issues of scale. 37 00:01:47,640 --> 00:01:49,920 But the first question, before we even get there, 38 00:01:49,920 --> 00:01:52,320 is where these servers actually exist. 39 00:01:52,320 --> 00:01:56,010 And nowadays, there are two main options for where these servers can exist. 40 00:01:56,010 --> 00:02:00,210 These servers can be on the cloud or they can be on premise. 41 00:02:00,210 --> 00:02:02,400 And on-premise servers, you might imagine 42 00:02:02,400 --> 00:02:05,160 is if a company is running their own web application. 43 00:02:05,160 --> 00:02:08,340 On-premise servers are servers that are inside of the company's walls. 44 00:02:08,340 --> 00:02:10,710 The company owns the physical servers, maybe 45 00:02:10,710 --> 00:02:12,840 on some server racks inside of a room. 46 00:02:12,840 --> 00:02:14,970 And therefore, they have very direct control 47 00:02:14,970 --> 00:02:17,940 over all of the servers-- exactly what kind of servers they are, 48 00:02:17,940 --> 00:02:19,830 exactly what software is running on them. 49 00:02:19,830 --> 00:02:23,280 They can go and physically look at the servers and debug them, if need be, 50 00:02:23,280 --> 00:02:25,830 in order to make sure that any issues are dealt with. 51 00:02:25,830 --> 00:02:28,170 But increasingly, we're starting to move into a world 52 00:02:28,170 --> 00:02:31,170 where cloud computing is becoming increasingly popular. 53 00:02:31,170 --> 00:02:35,190 In cloud computing, rather than have dedicated servers that are on premise, 54 00:02:35,190 --> 00:02:37,290 we have servers that are somewhere in the cloud 55 00:02:37,290 --> 00:02:40,950 where cloud computing companies like Amazon, or Google, or Microsoft 56 00:02:40,950 --> 00:02:42,720 are able to run their own servers. 57 00:02:42,720 --> 00:02:46,860 And we simply use those servers that are provided by those third parties, 58 00:02:46,860 --> 00:02:50,130 whether it's Amazon, or Google, or Microsoft, or someone else. 59 00:02:50,130 --> 00:02:51,330 And there are trade offs. 60 00:02:51,330 --> 00:02:54,950 With cloud computing, we no longer have as direct control over the machines 61 00:02:54,950 --> 00:02:56,700 themselves because they're not on premise. 62 00:02:56,700 --> 00:02:59,190 We can't physically manipulate those computers. 63 00:02:59,190 --> 00:03:01,620 But we have the advantage of not having to worry 64 00:03:01,620 --> 00:03:05,070 about dealing with physical objects that are inside 65 00:03:05,070 --> 00:03:08,280 of the premise of the company whose servers we'd like to run code for. 66 00:03:08,280 --> 00:03:10,770 When it's on the cloud, everything is managed externally 67 00:03:10,770 --> 00:03:14,205 by some other company, and we can simply use the servers that we need to. 68 00:03:14,205 --> 00:03:16,830 And we'll see that this lends itself to other benefits as well. 69 00:03:16,830 --> 00:03:20,490 As we might need more servers, as we start to get more sophisticated web 70 00:03:20,490 --> 00:03:24,120 applications that need more users, these cloud-computing companies 71 00:03:24,120 --> 00:03:26,220 can allow us to create web applications that 72 00:03:26,220 --> 00:03:29,280 are able to scale across multiple different servers 73 00:03:29,280 --> 00:03:31,910 as we start to get more and more users. 74 00:03:31,910 --> 00:03:35,460 But we'll discuss those issues of scale as we get to them. 75 00:03:35,460 --> 00:03:37,890 The question we need to ask after we have these servers-- 76 00:03:37,890 --> 00:03:40,348 whether they're servers that are on premise or servers that 77 00:03:40,348 --> 00:03:42,240 are operating somewhere in the cloud-- 78 00:03:42,240 --> 00:03:47,328 is, how many users can the server actually service at any given time? 79 00:03:47,328 --> 00:03:48,370 And that's going to vary. 80 00:03:48,370 --> 00:03:51,300 It's going to vary based on the size of the server, the computing 81 00:03:51,300 --> 00:03:52,470 power of the server. 82 00:03:52,470 --> 00:03:56,250 And it's going to be dependent upon how long it takes to process 83 00:03:56,250 --> 00:03:58,110 any particular user's request. 84 00:03:58,110 --> 00:04:00,420 If user requests are quite expensive, it might 85 00:04:00,420 --> 00:04:03,870 mean that there are fewer users that can be serviced at any given time. 86 00:04:03,870 --> 00:04:05,880 And it's for that reason that a helpful tool 87 00:04:05,880 --> 00:04:08,850 is to do some kind of benchmarking, some process of trying 88 00:04:08,850 --> 00:04:12,630 to do some analysis on how many users a server can actually 89 00:04:12,630 --> 00:04:14,730 be handling at any particular time. 90 00:04:14,730 --> 00:04:16,950 And there are numerous different tools that allow 91 00:04:16,950 --> 00:04:18,779 us to do this kind of benchmarking. 92 00:04:18,779 --> 00:04:22,470 Apache Bench, or otherwise known as AB, is a popular tool 93 00:04:22,470 --> 00:04:24,250 for doing this kind of thing. 94 00:04:24,250 --> 00:04:28,290 But benchmarking is going to be useful so that we know how many users one 95 00:04:28,290 --> 00:04:29,550 particular server can handle. 96 00:04:29,550 --> 00:04:31,290 Maybe it can handle 50 users. 97 00:04:31,290 --> 00:04:32,700 Maybe it can handle 100 users. 98 00:04:32,700 --> 00:04:35,160 Maybe it can handle more at any given time. 99 00:04:35,160 --> 00:04:37,830 But ultimately, it's going to be some finite limit. 100 00:04:37,830 --> 00:04:40,680 Every computer just has some finite amount of resources, 101 00:04:40,680 --> 00:04:42,030 and servers are no exception. 102 00:04:42,030 --> 00:04:45,360 There's going to be some number of users after which the server is not 103 00:04:45,360 --> 00:04:47,020 going to be able to handle it. 104 00:04:47,020 --> 00:04:48,850 So what do we do in that situation? 105 00:04:48,850 --> 00:04:53,130 What do we do if our server can only handle 100 users at any given time, 106 00:04:53,130 --> 00:04:58,020 but 101 users are trying to use our web application at the same time? 107 00:04:58,020 --> 00:04:59,440 Something needs to change. 108 00:04:59,440 --> 00:05:01,740 We need to deal with some sort of scaling 109 00:05:01,740 --> 00:05:04,500 to make sure that our web application can scale. 110 00:05:04,500 --> 00:05:07,770 And there are a couple of different types of scaling that we can try. 111 00:05:07,770 --> 00:05:10,530 One approach is to do what's called vertical scaling, which 112 00:05:10,530 --> 00:05:12,780 might be the simplest way you could imagine scaling. 113 00:05:12,780 --> 00:05:15,900 If this server is not good enough for handling the number of users 114 00:05:15,900 --> 00:05:18,890 that we need it to handle, well, just get a bigger serve. 115 00:05:18,890 --> 00:05:21,260 In vertical scaling, we just take the server 116 00:05:21,260 --> 00:05:23,930 and get a bigger server, a more powerful server, 117 00:05:23,930 --> 00:05:26,480 a server that can handle more users at any given time. 118 00:05:26,480 --> 00:05:27,730 It's going to cost more. 119 00:05:27,730 --> 00:05:29,480 But if we need it to handle more users, we 120 00:05:29,480 --> 00:05:33,110 can just get a bigger server to be able to deal with that problem. 121 00:05:33,110 --> 00:05:34,607 This approach is fairly simple. 122 00:05:34,607 --> 00:05:37,190 It just involves swapping out one server for another, one that 123 00:05:37,190 --> 00:05:39,410 can handle more users concurrently. 124 00:05:39,410 --> 00:05:40,830 But it also has drawbacks. 125 00:05:40,830 --> 00:05:44,330 There is some limit to how big the server can be, to how many users 126 00:05:44,330 --> 00:05:47,390 any physical one server is going to be able to handle because there's 127 00:05:47,390 --> 00:05:50,870 a physical limitation on what is the biggest, fastest, most powerful 128 00:05:50,870 --> 00:05:53,310 server we could possibly get. 129 00:05:53,310 --> 00:05:55,970 So when vertical scaling ends up not being enough, 130 00:05:55,970 --> 00:05:59,720 an alternative-- as you might imagine-- is what's known as horizontal scaling. 131 00:05:59,720 --> 00:06:01,970 And the idea behind horizontal scaling is 132 00:06:01,970 --> 00:06:06,560 that, when one server isn't enough to be able to service all of the users that 133 00:06:06,560 --> 00:06:10,070 might be trying to use a web application at the same time, well, 134 00:06:10,070 --> 00:06:13,010 then we can take the approach of saying, well, rather than just using 135 00:06:13,010 --> 00:06:17,840 one server, let's go ahead and split it up into two different servers. 136 00:06:17,840 --> 00:06:21,420 We now have two servers that are both running the web application. 137 00:06:21,420 --> 00:06:24,980 And now, effectively, we've been able to double the number of users 138 00:06:24,980 --> 00:06:26,600 that this web application can handle. 139 00:06:26,600 --> 00:06:29,690 Rather than just a single server that can service 100 users, 140 00:06:29,690 --> 00:06:33,200 if we have two of them, now we can service 200 users at any given time 141 00:06:33,200 --> 00:06:37,670 if you imagine 100 of them using server A over here and 100 of them 142 00:06:37,670 --> 00:06:40,460 using server B over there. 143 00:06:40,460 --> 00:06:44,220 But this then lends itself to some other questions that we have to answer, 144 00:06:44,220 --> 00:06:47,630 which is, how do these servers get their users in the first place? 145 00:06:47,630 --> 00:06:50,450 When a user requests a web page, how does that user 146 00:06:50,450 --> 00:06:54,140 get directed either to server A or to server B? 147 00:06:54,140 --> 00:06:57,980 It seems that they need some way to make that decision in order to decide 148 00:06:57,980 --> 00:07:00,690 whether to go one direction or another. 149 00:07:00,690 --> 00:07:04,010 And it's for that reason that we might introduce another piece of hardware 150 00:07:04,010 --> 00:07:05,240 into this picture. 151 00:07:05,240 --> 00:07:09,070 And that additional piece of hardware is what we might call a load balancer. 152 00:07:09,070 --> 00:07:11,510 And a load balancer is just another piece of hardware 153 00:07:11,510 --> 00:07:14,910 that is going to sit in front of these servers, so to speak. 154 00:07:14,910 --> 00:07:17,660 In other words, when a user makes a request to a web page, 155 00:07:17,660 --> 00:07:21,170 rather than immediately getting that request to one of these web servers, 156 00:07:21,170 --> 00:07:25,250 the request is first going to go through this load balancer 157 00:07:25,250 --> 00:07:27,800 where the request first comes into the load balancer. 158 00:07:27,800 --> 00:07:31,160 And the load balancer then decides whether to send that request to server 159 00:07:31,160 --> 00:07:35,330 A or to send that request to server B. And this process 160 00:07:35,330 --> 00:07:38,300 is likely less expensive than actually dealing with and processing 161 00:07:38,300 --> 00:07:39,330 that request. 162 00:07:39,330 --> 00:07:42,440 So the load balancer is effectively just acting as a dispatcher. 163 00:07:42,440 --> 00:07:44,310 It waits for those requests to come in. 164 00:07:44,310 --> 00:07:46,670 And when the requests do come in, the load balancer 165 00:07:46,670 --> 00:07:49,628 directs those requests either to go to one server or to another. 166 00:07:49,628 --> 00:07:52,670 And you might imagine the story where we have more than just two servers. 167 00:07:52,670 --> 00:07:54,260 Maybe we have many servers. 168 00:07:54,260 --> 00:07:56,660 And the load balancer is just going to balance 169 00:07:56,660 --> 00:07:59,030 between all of those different servers. 170 00:07:59,030 --> 00:08:02,570 And this process of deciding which server to send a request to 171 00:08:02,570 --> 00:08:05,840 is known as load balancing, which is what the load balancer is ultimately 172 00:08:05,840 --> 00:08:06,618 doing. 173 00:08:06,618 --> 00:08:09,410 And there are various different methods that you might use in order 174 00:08:09,410 --> 00:08:11,042 to perform this load balancing. 175 00:08:11,042 --> 00:08:13,250 So you might imagine thinking about this intuitively. 176 00:08:13,250 --> 00:08:16,490 How would the load balancer decide, given some request, 177 00:08:16,490 --> 00:08:19,220 should we send the request to this router, to this server, 178 00:08:19,220 --> 00:08:22,910 or should we send the request to some other server instead? 179 00:08:22,910 --> 00:08:26,120 And there are many different approaches that our load balancer might take. 180 00:08:26,120 --> 00:08:27,440 And here are just a couple. 181 00:08:27,440 --> 00:08:30,230 Random choice might be the simplest of options. 182 00:08:30,230 --> 00:08:34,480 Given a user that shows up and tries to make a request to our web server, 183 00:08:34,480 --> 00:08:36,620 the load balancer first takes a look at the user 184 00:08:36,620 --> 00:08:40,497 and just randomly assigns them to one of the various different servers 185 00:08:40,497 --> 00:08:42,080 that might be processing that request. 186 00:08:42,080 --> 00:08:46,340 If there are 10 different servers, it randomly chooses among those 10 servers 187 00:08:46,340 --> 00:08:50,030 to decide which of them is going to be servicing that request. 188 00:08:50,030 --> 00:08:52,020 This has the advantage of being very simple. 189 00:08:52,020 --> 00:08:53,300 It's just a quick calculation. 190 00:08:53,300 --> 00:08:56,330 The computers can pretty readily generate random numbers. 191 00:08:56,330 --> 00:08:58,310 And based on that random number, the computer 192 00:08:58,310 --> 00:09:02,720 can dispatch the user to one server or to another server. 193 00:09:02,720 --> 00:09:06,620 But it might not be the best option because, if we happen to get unlucky, 194 00:09:06,620 --> 00:09:10,190 we might end up with many more users on one server than another. 195 00:09:10,190 --> 00:09:12,890 Or we might end up with servers that are entirely 196 00:09:12,890 --> 00:09:15,230 unused if it just so happens that we don't end up 197 00:09:15,230 --> 00:09:17,300 randomly selecting that server. 198 00:09:17,300 --> 00:09:20,780 Now, in practice with many users that are all using this load balancer, all 199 00:09:20,780 --> 00:09:24,260 being dispatched, odds are high that eventually all of them will be used. 200 00:09:24,260 --> 00:09:26,837 But it might not be a totally even distribution. 201 00:09:26,837 --> 00:09:28,670 And so for that reason, another approach you 202 00:09:28,670 --> 00:09:32,570 might take is round-robin approach where the approach is, instead, 203 00:09:32,570 --> 00:09:36,650 for the very first user, go ahead and assign that user to server number one. 204 00:09:36,650 --> 00:09:38,840 For the next user, assign them to server number two. 205 00:09:38,840 --> 00:09:40,760 And maybe, if there are five servers, you say, 206 00:09:40,760 --> 00:09:44,150 the third user goes to server three, user four goes to server four, 207 00:09:44,150 --> 00:09:47,420 user five goes to server five, and then user six 208 00:09:47,420 --> 00:09:49,070 goes back to server number one. 209 00:09:49,070 --> 00:09:51,257 You basically rotate going one through five. 210 00:09:51,257 --> 00:09:53,840 And then, once you've assigned someone to each of the servers, 211 00:09:53,840 --> 00:09:55,760 you go back to the beginning. 212 00:09:55,760 --> 00:09:59,360 This is also a relatively easy thing to implement because you can simply just 213 00:09:59,360 --> 00:10:01,520 keep count somewhere in the load balancer 214 00:10:01,520 --> 00:10:04,730 saying, what was the most recent server that I assigned a user to? 215 00:10:04,730 --> 00:10:07,550 And the next time a request comes in, go ahead and assign it 216 00:10:07,550 --> 00:10:09,710 to the next server, and the next server after that, 217 00:10:09,710 --> 00:10:12,220 effectively doing a round-robin style approach 218 00:10:12,220 --> 00:10:16,040 where you go through all the servers once before going through the servers 219 00:10:16,040 --> 00:10:17,140 again. 220 00:10:17,140 --> 00:10:19,750 Now, this might seem better than random choice in the sense 221 00:10:19,750 --> 00:10:23,230 that it's going to more equitably decide whether to assign 222 00:10:23,230 --> 00:10:26,710 any particular request to any particular server. 223 00:10:26,710 --> 00:10:29,110 But it also suffers from certain problems. 224 00:10:29,110 --> 00:10:31,510 Round robin might be great, but if some requests 225 00:10:31,510 --> 00:10:34,975 take longer than other requests, we might also get unlucky, 226 00:10:34,975 --> 00:10:36,850 and the requests that are taking longer might 227 00:10:36,850 --> 00:10:40,160 end up all going to one of the servers as opposed to another server. 228 00:10:40,160 --> 00:10:43,310 So there are other approaches that we might want to go to as well-- 229 00:10:43,310 --> 00:10:45,880 for example, something like fewest connections 230 00:10:45,880 --> 00:10:50,430 where the approach there is to say, go ahead, and when a user makes a request, 231 00:10:50,430 --> 00:10:53,050 the load balancer should pick which of the servers 232 00:10:53,050 --> 00:10:57,370 currently has the fewest active connections from other users 233 00:10:57,370 --> 00:11:01,060 and other requests that are currently connected to those servers instead. 234 00:11:01,060 --> 00:11:04,120 And by choosing the server that happens to have the fewest connections, 235 00:11:04,120 --> 00:11:07,330 you're probably going to do a better job of trying to balance out 236 00:11:07,330 --> 00:11:09,340 between all of the various different requests 237 00:11:09,340 --> 00:11:12,220 that might be happening inside of your web application. 238 00:11:12,220 --> 00:11:15,220 And while this might do a better job, there are trade offs here as well. 239 00:11:15,220 --> 00:11:18,700 It might be more expensive, for example, to compute which of the servers 240 00:11:18,700 --> 00:11:21,310 happens to have the fewest number of connections, 241 00:11:21,310 --> 00:11:24,880 whereas it's much easier just to say, choose a server at random 242 00:11:24,880 --> 00:11:29,740 or to do the round-robin style approach of just 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 243 00:11:29,740 --> 00:11:32,590 again, and again, and again. 244 00:11:32,590 --> 00:11:36,410 But all of these approaches naively have yet another problem, 245 00:11:36,410 --> 00:11:38,030 which has to do with sessions. 246 00:11:38,030 --> 00:11:40,150 And you'll recall that sessions we used whenever 247 00:11:40,150 --> 00:11:44,110 we wanted to store information about the user's current interaction 248 00:11:44,110 --> 00:11:45,220 with the web application. 249 00:11:45,220 --> 00:11:46,780 When you log into a website-- 250 00:11:46,780 --> 00:11:50,300 you log into your email, or you log into Amazon, for example-- 251 00:11:50,300 --> 00:11:53,740 and then you come back to that website or visit another page on that website-- 252 00:11:53,740 --> 00:11:56,470 make another request, for example-- 253 00:11:56,470 --> 00:11:59,800 it's not the case that you have to sign in yet again, that the web browser has 254 00:11:59,800 --> 00:12:01,720 totally forgotten who you are. 255 00:12:01,720 --> 00:12:04,450 When I go back to my mail account, or when I go back to Amazon 256 00:12:04,450 --> 00:12:08,205 for a second time, my mail account or Amazon remembers me from the last time 257 00:12:08,205 --> 00:12:08,830 that I visited. 258 00:12:08,830 --> 00:12:13,060 I have some sort of session where it's keeping track of who is logged in, 259 00:12:13,060 --> 00:12:15,670 maybe information about what I've been doing on the page, 260 00:12:15,670 --> 00:12:18,790 and allows me to continue interacting with the web application, 261 00:12:18,790 --> 00:12:21,880 even if I'm making multiple requests. 262 00:12:21,880 --> 00:12:24,310 And this, you might imagine, could be a problem 263 00:12:24,310 --> 00:12:26,440 for this type of load balancing. 264 00:12:26,440 --> 00:12:31,630 If I have multiple different servers, imagine if I try to log into a website. 265 00:12:31,630 --> 00:12:34,990 And the first time I make a request, I'm directed to server number one. 266 00:12:34,990 --> 00:12:37,690 And I'm now logged in on server number one. 267 00:12:37,690 --> 00:12:39,400 But then I make another request. 268 00:12:39,400 --> 00:12:41,162 I'm directed back to the load balancer. 269 00:12:41,162 --> 00:12:43,120 And maybe the load balancer, this time, decides 270 00:12:43,120 --> 00:12:45,310 to send me to server number two. 271 00:12:45,310 --> 00:12:48,190 But if the session is stored in server number one somewhere-- 272 00:12:48,190 --> 00:12:51,010 server number one remembers who I am and what I'm doing-- 273 00:12:51,010 --> 00:12:54,282 then server number two is not going to know who I am. 274 00:12:54,282 --> 00:12:56,740 And therefore, it's not going to remember that I've already 275 00:12:56,740 --> 00:12:58,660 logged into this web application. 276 00:12:58,660 --> 00:13:01,710 And as a result, I might be prompted to log in again. 277 00:13:01,710 --> 00:13:04,630 And if I go make another request, and I end up on yet another server, 278 00:13:04,630 --> 00:13:07,580 I might be logged out again and have to log in for a third time. 279 00:13:07,580 --> 00:13:11,590 So the problem comes about when our load balancing happens, 280 00:13:11,590 --> 00:13:14,290 but we're not doing so in a session-aware way-- 281 00:13:14,290 --> 00:13:18,310 that our load balancer isn't caring about when a user visits the page 282 00:13:18,310 --> 00:13:22,300 and then visits another page on the same web application again-- 283 00:13:22,300 --> 00:13:25,720 because we want to remember information from the previous time 284 00:13:25,720 --> 00:13:27,475 that the user was here. 285 00:13:27,475 --> 00:13:28,850 So how can we solve this problem? 286 00:13:28,850 --> 00:13:30,820 How can we make sure that, when we do this load 287 00:13:30,820 --> 00:13:33,010 balancing across multiple different servers, 288 00:13:33,010 --> 00:13:34,795 that we do so in a session-aware way? 289 00:13:34,795 --> 00:13:36,670 Well, there are multiple different approaches 290 00:13:36,670 --> 00:13:39,310 to session-aware load balancing. 291 00:13:39,310 --> 00:13:42,610 One approach is this general idea known as sticky sessions 292 00:13:42,610 --> 00:13:46,150 where the idea is that, when I come back to the load balancer, 293 00:13:46,150 --> 00:13:49,940 the load balancer will remember what server I was sent to last time 294 00:13:49,940 --> 00:13:52,210 and send me there yet again. 295 00:13:52,210 --> 00:13:54,670 So for example, if I log into a website once, 296 00:13:54,670 --> 00:13:57,490 and I'm directed to server number two, for example, then 297 00:13:57,490 --> 00:14:00,130 the next time I visit this web application, 298 00:14:00,130 --> 00:14:03,520 even if I should be directed to server three or four according 299 00:14:03,520 --> 00:14:07,600 to random choice or according to fewest connections or any of these other load 300 00:14:07,600 --> 00:14:09,700 balancing methods, the load balancer should 301 00:14:09,700 --> 00:14:12,310 remember that, last time I came to this site, 302 00:14:12,310 --> 00:14:14,240 I got directed to server number two. 303 00:14:14,240 --> 00:14:16,210 And so this time, the load balancer is going 304 00:14:16,210 --> 00:14:18,550 to direct me to server number two yet again. 305 00:14:18,550 --> 00:14:22,000 That way, server number two, which contains information about my session, 306 00:14:22,000 --> 00:14:25,000 is going to see me again and remember who it is that I am. 307 00:14:25,000 --> 00:14:28,180 And it's not going to make me log in again into the exact same website 308 00:14:28,180 --> 00:14:30,570 for a second time, for example. 309 00:14:30,570 --> 00:14:33,280 And so sticky sessions are one way of dealing with this problem. 310 00:14:33,280 --> 00:14:35,363 But again, with all of these approaches-- and this 311 00:14:35,363 --> 00:14:38,410 will be a recurring theme as we talk about scalability and security-- 312 00:14:38,410 --> 00:14:39,730 there are trade offs here. 313 00:14:39,730 --> 00:14:44,200 A trade to the sticky sessions is that it's possible that one of these servers 314 00:14:44,200 --> 00:14:47,950 is going to end up getting far more load than another if one server happens 315 00:14:47,950 --> 00:14:50,620 to have a lot of users that keep coming back to the website 316 00:14:50,620 --> 00:14:52,390 and keep requesting additional pages. 317 00:14:52,390 --> 00:14:54,940 But other pages, other servers might have 318 00:14:54,940 --> 00:14:58,010 had users that decided not to come back, for example. 319 00:14:58,010 --> 00:15:01,390 And so there's a difference in utilization where some of our servers 320 00:15:01,390 --> 00:15:03,880 might be more heavily utilized than other servers, 321 00:15:03,880 --> 00:15:07,580 and we're not doing a very good job of balancing between them. 322 00:15:07,580 --> 00:15:11,980 And so one approach is to store sessions inside of the database 323 00:15:11,980 --> 00:15:15,580 rather than store information about sessions inside of the server 324 00:15:15,580 --> 00:15:18,730 themselves so that, if I get directed to another server, 325 00:15:18,730 --> 00:15:20,710 that other server doesn't remember who I am, 326 00:15:20,710 --> 00:15:24,310 doesn't remember information about my interaction with this website. 327 00:15:24,310 --> 00:15:27,890 If we instead choose to store sessions inside of a database-- 328 00:15:27,890 --> 00:15:31,210 and, in particular, inside of a database that all of the servers 329 00:15:31,210 --> 00:15:33,100 have the ability to access-- 330 00:15:33,100 --> 00:15:36,400 well, then it doesn't matter which of the servers I get directed to 331 00:15:36,400 --> 00:15:39,370 and which server the load balancer decides to send me to 332 00:15:39,370 --> 00:15:42,310 because, regardless of which server I end up getting sent to, 333 00:15:42,310 --> 00:15:44,235 the session information is in the database. 334 00:15:44,235 --> 00:15:46,360 And each of the servers can connect to the database 335 00:15:46,360 --> 00:15:49,390 to find out who I am, to find out whether I've logged into the site 336 00:15:49,390 --> 00:15:52,660 already, and therefore is able to recognize me. 337 00:15:52,660 --> 00:15:54,670 And so that might be one approach as well. 338 00:15:54,670 --> 00:15:57,702 Another approach is to store sessions on the client side. 339 00:15:57,702 --> 00:15:59,410 We've talked a little bit about this idea 340 00:15:59,410 --> 00:16:03,100 of cookies, which can be stored where the web browser can set a cookie so 341 00:16:03,100 --> 00:16:06,460 that your web browser is able to present that cookie the next time 342 00:16:06,460 --> 00:16:09,020 it makes a request to the same web application. 343 00:16:09,020 --> 00:16:12,430 And inside this cookie, you can store a whole bunch of information, including 344 00:16:12,430 --> 00:16:14,000 information about the session. 345 00:16:14,000 --> 00:16:16,690 You might, inside of a cookie, store information 346 00:16:16,690 --> 00:16:19,340 about what user is currently logged in, for example, 347 00:16:19,340 --> 00:16:21,500 or other session-related information. 348 00:16:21,500 --> 00:16:23,080 But here, too, there are drawbacks. 349 00:16:23,080 --> 00:16:25,750 If you're not careful, someone could manipulate that cookie 350 00:16:25,750 --> 00:16:27,380 and maybe pretend to be something else. 351 00:16:27,380 --> 00:16:29,230 And so for that reason, you might want to do 352 00:16:29,230 --> 00:16:32,020 some encryption or some kind of sign in to make sure 353 00:16:32,020 --> 00:16:35,832 that you can't fake a cookie and pretend to be someone that you're not. 354 00:16:35,832 --> 00:16:37,540 But another concern is that, as you start 355 00:16:37,540 --> 00:16:40,130 to store more and more information inside of these cookies, 356 00:16:40,130 --> 00:16:43,540 these cookies keep getting sent back and forth between the server and the client 357 00:16:43,540 --> 00:16:45,250 every time a request is made. 358 00:16:45,250 --> 00:16:48,040 That can start to get expensive, too-- more and more information 359 00:16:48,040 --> 00:16:52,090 passing back and forth between the client and between the server. 360 00:16:52,090 --> 00:16:54,580 So lots of possible approaches-- no one approach 361 00:16:54,580 --> 00:16:57,040 that is necessarily the right approach or the best approach 362 00:16:57,040 --> 00:16:58,270 to use in any cases. 363 00:16:58,270 --> 00:17:00,850 But things to be aware of-- things to think about 364 00:17:00,850 --> 00:17:03,520 as we begin to deal with these issues of scale, of making 365 00:17:03,520 --> 00:17:07,270 sure we have multiple servers that are available for usage in case we do 366 00:17:07,270 --> 00:17:07,869 need it. 367 00:17:07,869 --> 00:17:10,930 But also making sure that, when we do so, we don't break the user 368 00:17:10,930 --> 00:17:14,920 experience-- we don't result in a situation where a user is logged in 369 00:17:14,920 --> 00:17:18,160 but then, suddenly, isn't logged in at all. 370 00:17:18,160 --> 00:17:21,460 And so horizontal scaling gives us this kind of capacity-- 371 00:17:21,460 --> 00:17:24,760 the ability to have multiple different servers, all of which 372 00:17:24,760 --> 00:17:27,880 can be dealing with user requests and responding to those user requests 373 00:17:27,880 --> 00:17:28,890 as well. 374 00:17:28,890 --> 00:17:34,240 But a reasonable question asked is, how many of those servers do we need? 375 00:17:34,240 --> 00:17:36,850 Now, we can use benchmarking to try to estimate this. 376 00:17:36,850 --> 00:17:40,190 If we have an estimate of how many users are going to be on our website 377 00:17:40,190 --> 00:17:42,430 at any given time, we can benchmark and see 378 00:17:42,430 --> 00:17:46,420 how many users can be handled by a single server and extrapolate, 379 00:17:46,420 --> 00:17:49,330 based on that information, to infer how many servers we 380 00:17:49,330 --> 00:17:52,000 might need in our web application to be able to service 381 00:17:52,000 --> 00:17:53,650 all of these different users. 382 00:17:53,650 --> 00:17:56,680 But it might be the case that our web application doesn't always 383 00:17:56,680 --> 00:17:58,540 have the same number of users. 384 00:17:58,540 --> 00:18:01,660 Maybe, sometimes, there are going to be far more users than another time. 385 00:18:01,660 --> 00:18:05,140 You might imagine, for example, that in a news organization's website-- 386 00:18:05,140 --> 00:18:07,690 like the web application for a newspaper-- 387 00:18:07,690 --> 00:18:09,720 when there's breaking news, some big story, 388 00:18:09,720 --> 00:18:11,470 there's going to be a lot more people that 389 00:18:11,470 --> 00:18:15,380 are all trying to access the website at the same time than at other times. 390 00:18:15,380 --> 00:18:18,310 So one approach might be, consider the maximum. 391 00:18:18,310 --> 00:18:20,650 What is the most number of users that ever 392 00:18:20,650 --> 00:18:23,620 might be trying to use our web application at any given time? 393 00:18:23,620 --> 00:18:26,830 And choose a number of servers based on that maximum so that, 394 00:18:26,830 --> 00:18:28,960 no matter how high the number of users get, 395 00:18:28,960 --> 00:18:32,800 we will have enough servers to be able to service all of those users. 396 00:18:32,800 --> 00:18:35,560 But that's probably not a great economical choice 397 00:18:35,560 --> 00:18:39,250 if, in the vast majority of cases, there will be far fewer users. 398 00:18:39,250 --> 00:18:42,625 In that case, you're going to have a lot of servers that are underutilized-- 399 00:18:42,625 --> 00:18:45,250 where you don't need that many servers, but you're still paying 400 00:18:45,250 --> 00:18:47,770 for the electricity, for keeping all of them running-- 401 00:18:47,770 --> 00:18:50,740 which might not be an ideal choice either. 402 00:18:50,740 --> 00:18:52,120 So one solution to this-- 403 00:18:52,120 --> 00:18:54,970 quite popular, especially in this world of cloud computing-- 404 00:18:54,970 --> 00:18:58,660 is the idea of autoscaling where you can have an autoscaler 405 00:18:58,660 --> 00:19:03,460 to say that, you know what, let's start with, for example, two servers. 406 00:19:03,460 --> 00:19:05,470 But if there's enough traffic to the website, 407 00:19:05,470 --> 00:19:07,678 if enough people are making requests to the website-- 408 00:19:07,678 --> 00:19:10,360 maybe it's a peak time where people are using the website-- 409 00:19:10,360 --> 00:19:11,830 go ahead and scale up. 410 00:19:11,830 --> 00:19:15,880 Go ahead and add a third server where now our load balancer can balance 411 00:19:15,880 --> 00:19:18,100 between all three of those servers. 412 00:19:18,100 --> 00:19:20,710 And if even more traffic ends up coming to the website-- 413 00:19:20,710 --> 00:19:24,280 more users are trying to use this application all at the same time-- 414 00:19:24,280 --> 00:19:27,160 well, then we can go ahead and add a fourth server as well. 415 00:19:27,160 --> 00:19:28,660 And we can continue to do that. 416 00:19:28,660 --> 00:19:31,510 Most autoscalers will let you configure, for example, 417 00:19:31,510 --> 00:19:34,480 a minimum number of servers and a maximum number of servers. 418 00:19:34,480 --> 00:19:37,420 And dependent on how many users happen to be using your web 419 00:19:37,420 --> 00:19:40,300 application at any given time, the autoscaler 420 00:19:40,300 --> 00:19:44,410 can scale up or scale down, adding new servers as more users come 421 00:19:44,410 --> 00:19:47,410 to the website, removing servers as fewer users are 422 00:19:47,410 --> 00:19:49,870 using the website as well. 423 00:19:49,870 --> 00:19:52,425 And so this can be a nice solution to this problem of scale 424 00:19:52,425 --> 00:19:55,050 where you don't have to worry about how many servers there are. 425 00:19:55,050 --> 00:19:57,580 It just autoscales entirely on its own. 426 00:19:57,580 --> 00:19:59,080 Now, there are trade offs here, too. 427 00:19:59,080 --> 00:20:01,250 This auto scaling process might take time. 428 00:20:01,250 --> 00:20:05,260 And if a lot of users all come into your website all at the exact same time, 429 00:20:05,260 --> 00:20:08,350 well, it's going to take some time to be able to add 430 00:20:08,350 --> 00:20:10,630 all of these additional servers to start them up. 431 00:20:10,630 --> 00:20:13,700 And so there might be some trade offs there, too, 432 00:20:13,700 --> 00:20:17,330 where you might not be able to service all of the users immediately. 433 00:20:17,330 --> 00:20:19,380 And another problem worth thinking about is, 434 00:20:19,380 --> 00:20:21,510 as you add more and more of these servers, 435 00:20:21,510 --> 00:20:23,877 you introduce opportunities for failure. 436 00:20:23,877 --> 00:20:25,710 Now, it's better than having a single server 437 00:20:25,710 --> 00:20:29,490 where, if that single server fails, now suddenly the entire web application 438 00:20:29,490 --> 00:20:30,390 doesn't work at all. 439 00:20:30,390 --> 00:20:33,240 That's what we generally call a single point of failure-- 440 00:20:33,240 --> 00:20:37,410 a single place where, if it fails, the entire system is going to be broken. 441 00:20:37,410 --> 00:20:39,720 One advantage of having multiple servers is 442 00:20:39,720 --> 00:20:43,530 that we no longer have a single server that acts as a point of failure. 443 00:20:43,530 --> 00:20:46,140 If one of the servers goes down then, ideally, 444 00:20:46,140 --> 00:20:49,780 our load balancer should be able to know, based on that information, 445 00:20:49,780 --> 00:20:53,370 to no longer send a request to that particular server-- to, 446 00:20:53,370 --> 00:20:58,470 instead, balance the load across the remaining three servers instead. 447 00:20:58,470 --> 00:21:00,640 Now, there's an interesting question there as well, 448 00:21:00,640 --> 00:21:04,200 which is, how does the load balancer know that this server is 449 00:21:04,200 --> 00:21:05,450 no longer responding? 450 00:21:05,450 --> 00:21:07,200 For some reason, it has some sort of error 451 00:21:07,200 --> 00:21:09,763 that it's not able to process requests appropriately. 452 00:21:09,763 --> 00:21:11,680 Well, there are multiple ways you can do this. 453 00:21:11,680 --> 00:21:15,090 But one of the most common is what's simply known as a heartbeat where, 454 00:21:15,090 --> 00:21:18,240 effectively, every so often, every some number of seconds, 455 00:21:18,240 --> 00:21:20,700 the load balancer pings all of the servers-- 456 00:21:20,700 --> 00:21:23,280 just sends a quick request to all the servers. 457 00:21:23,280 --> 00:21:26,250 And all of the servers are supposed to respond back. 458 00:21:26,250 --> 00:21:29,010 And using that information, the load balancer 459 00:21:29,010 --> 00:21:31,920 knows a little bit about the latency of each of the servers-- 460 00:21:31,920 --> 00:21:34,920 how long it took for the server to respond to the request. 461 00:21:34,920 --> 00:21:37,440 But also, it can get information about whether or not 462 00:21:37,440 --> 00:21:39,450 the server is functioning properly. 463 00:21:39,450 --> 00:21:42,157 If one of the servers doesn't respond to the ping, 464 00:21:42,157 --> 00:21:44,490 well, then the load balancer knows that there's probably 465 00:21:44,490 --> 00:21:47,640 something wrong with the server, that we probably shouldn't be directing 466 00:21:47,640 --> 00:21:50,570 more users to that server at all. 467 00:21:50,570 --> 00:21:53,730 And so this can solve for the problem of a single point of failure 468 00:21:53,730 --> 00:21:57,570 by allowing ourselves multiple servers where, if any one of the servers fails, 469 00:21:57,570 --> 00:22:00,450 the load balancer learns about that via heartbeat 470 00:22:00,450 --> 00:22:03,540 and then, based on that information, can begin to redirect traffic 471 00:22:03,540 --> 00:22:05,847 to the other servers instead. 472 00:22:05,847 --> 00:22:08,430 Now, one thing you might notice is that, even in this picture, 473 00:22:08,430 --> 00:22:11,970 now the load balancer appears to be like a single point of failure 474 00:22:11,970 --> 00:22:14,460 where, if the low balance happens to fail, well, now 475 00:22:14,460 --> 00:22:16,668 nothing is going to work because the load balancer is 476 00:22:16,668 --> 00:22:18,810 the one responsible for directing traffic to all 477 00:22:18,810 --> 00:22:20,190 of the various different servers. 478 00:22:20,190 --> 00:22:23,790 And so even though there is no single server that is a point to failure, 479 00:22:23,790 --> 00:22:27,370 this load balancer also appears to be a single point of failure. 480 00:22:27,370 --> 00:22:28,540 And that's definitely true. 481 00:22:28,540 --> 00:22:31,470 And you might imagine instead having multiple load balancers 482 00:22:31,470 --> 00:22:35,310 where one load balancer goes down, another load balancer can swoop in, 483 00:22:35,310 --> 00:22:39,000 acting as a hot spare where it picks up all of the traffic that was originally 484 00:22:39,000 --> 00:22:40,650 going to the first load balancer. 485 00:22:40,650 --> 00:22:44,550 And if it ever goes down, a second one is ready to take its place. 486 00:22:44,550 --> 00:22:47,700 And it might also be doing this kind of heartbeat process-- checking up 487 00:22:47,700 --> 00:22:48,845 on the first load balancer. 488 00:22:48,845 --> 00:22:51,970 And if all goes well, the second load balancer doesn't have to do anything. 489 00:22:51,970 --> 00:22:54,490 But if the first load balancer ever were to fail, 490 00:22:54,490 --> 00:22:56,640 well, then the second load balancer can step in 491 00:22:56,640 --> 00:22:59,700 and begin servicing those requests, directing them to all 492 00:22:59,700 --> 00:23:01,840 of these individual servers as well. 493 00:23:01,840 --> 00:23:02,705 And so there, too-- 494 00:23:02,705 --> 00:23:05,580 another opportunity to think about where the single points of failure 495 00:23:05,580 --> 00:23:09,300 are and thinking about how we might address the single points of failure 496 00:23:09,300 --> 00:23:12,330 in order to make sure that our web applications are scalable. 497 00:23:12,330 --> 00:23:14,820 So that then deals with issues about how we might 498 00:23:14,820 --> 00:23:17,070 go about scaling up these servers. 499 00:23:17,070 --> 00:23:20,340 But ultimately, the servers are not the entirety of the story. 500 00:23:20,340 --> 00:23:22,350 Inside of our applications, we mostly have 501 00:23:22,350 --> 00:23:25,918 writing web applications that interact and deal with data in some way. 502 00:23:25,918 --> 00:23:28,710 And there are multiple different databases that we've talked about. 503 00:23:28,710 --> 00:23:30,900 SQLite Light has been the default one that Django 504 00:23:30,900 --> 00:23:34,200 provides to us, which just stores data inside of a file. 505 00:23:34,200 --> 00:23:36,020 But as we begin to grow our applications, 506 00:23:36,020 --> 00:23:39,270 if we want to begin to scale them, it's quite popular and quite common 507 00:23:39,270 --> 00:23:41,530 to put databases entirely somewhere separate-- 508 00:23:41,530 --> 00:23:44,340 to have a separate database server running somewhere else where 509 00:23:44,340 --> 00:23:46,800 the servers are all communicating with that database, 510 00:23:46,800 --> 00:23:50,550 whether it's we're running MySQL, or Postgres, or some other database system 511 00:23:50,550 --> 00:23:51,750 instead. 512 00:23:51,750 --> 00:23:55,410 And all of the servers then have access to that database. 513 00:23:55,410 --> 00:23:57,990 And so there, too, are considerations that we 514 00:23:57,990 --> 00:24:00,420 need to take into account-- issues of how it is that we 515 00:24:00,420 --> 00:24:03,840 go about scaling up these databases. 516 00:24:03,840 --> 00:24:06,960 In this picture, for example, you might imagine a load balancer 517 00:24:06,960 --> 00:24:08,730 that is communicating with two servers. 518 00:24:08,730 --> 00:24:10,950 But both of those servers, for example, need 519 00:24:10,950 --> 00:24:13,200 to be communicating with this database. 520 00:24:13,200 --> 00:24:16,140 And much like any server can only handle some number of requests, 521 00:24:16,140 --> 00:24:19,380 some number of users at any given time, databases, too, 522 00:24:19,380 --> 00:24:23,280 can only handle some number of requests, some concurrent number of connections 523 00:24:23,280 --> 00:24:24,250 at any given time. 524 00:24:24,250 --> 00:24:26,130 And so we need to begin to think about issues 525 00:24:26,130 --> 00:24:30,120 of how it is that we scale these databases as well in order to be 526 00:24:30,120 --> 00:24:33,330 able to handle more and more users. 527 00:24:33,330 --> 00:24:35,580 Now, one approach, the first thing we might try to do, 528 00:24:35,580 --> 00:24:38,160 is something called database partitioning-- effectively, 529 00:24:38,160 --> 00:24:42,270 splitting up what is a big data set into multiple different parts 530 00:24:42,270 --> 00:24:43,470 to that data set. 531 00:24:43,470 --> 00:24:46,560 And we've already seen some examples of database partitioning. 532 00:24:46,560 --> 00:24:49,890 We've seen one example where-- for example, when we talked about SQL, 533 00:24:49,890 --> 00:24:53,130 we looked at a table of flights where each flight had an origin 534 00:24:53,130 --> 00:24:57,840 city, the origin city's airport code, the destination city, the destination 535 00:24:57,840 --> 00:25:00,120 city's airport code, and some number of minutes, 536 00:25:00,120 --> 00:25:02,850 the duration for that particular flight. 537 00:25:02,850 --> 00:25:05,820 And we decided that storing all of this data in a single table 538 00:25:05,820 --> 00:25:07,590 probably wasn't the best idea. 539 00:25:07,590 --> 00:25:10,170 And instead, we wanted to split that data up 540 00:25:10,170 --> 00:25:13,380 in a type of partitioning where, instead, we said, all right, let's just 541 00:25:13,380 --> 00:25:16,230 have one table that will have all of the airports. 542 00:25:16,230 --> 00:25:20,440 And so each airport gets its own row inside of this airports table. 543 00:25:20,440 --> 00:25:22,640 And we also had another table which was just 544 00:25:22,640 --> 00:25:26,270 the flights table which, rather than storing all of those columns, 545 00:25:26,270 --> 00:25:28,820 just mapped two airports to each other. 546 00:25:28,820 --> 00:25:32,660 With any given flight, it has an origin idea, meaning which object, 547 00:25:32,660 --> 00:25:36,800 which row in the origin airports table is represented by the flight, 548 00:25:36,800 --> 00:25:39,680 and then which row in the airports table is 549 00:25:39,680 --> 00:25:42,860 going to represent the destination for that flight. 550 00:25:42,860 --> 00:25:45,530 So we took one table and effectively split it up 551 00:25:45,530 --> 00:25:49,940 into multiple tables, each of which ultimately had fewer columns. 552 00:25:49,940 --> 00:25:52,850 And this might be something we call the vertical partitioning 553 00:25:52,850 --> 00:25:56,810 of a database where, instead of just having single big long tables, 554 00:25:56,810 --> 00:25:59,420 we split them up into multiple tables, each 555 00:25:59,420 --> 00:26:01,820 of which have fewer columns that are able to represent 556 00:26:01,820 --> 00:26:03,497 data in a more relational way. 557 00:26:03,497 --> 00:26:05,330 And that's something we've seen before, too. 558 00:26:05,330 --> 00:26:07,460 But in addition to vertical partitioning, 559 00:26:07,460 --> 00:26:11,090 we can also do horizontal partitioning where the idea there 560 00:26:11,090 --> 00:26:13,340 is that we take a table and just split it up 561 00:26:13,340 --> 00:26:17,390 into multiple tables that are all storing effectively the same data, 562 00:26:17,390 --> 00:26:19,380 but split up into different data sets. 563 00:26:19,380 --> 00:26:22,520 So the same type of data, but just in different tables-- 564 00:26:22,520 --> 00:26:25,100 where we might have originally had a flights table, 565 00:26:25,100 --> 00:26:28,490 and instead we split it up into a domestic flights table 566 00:26:28,490 --> 00:26:30,380 and an international flights table. 567 00:26:30,380 --> 00:26:32,870 Each of these tables still has the exact same column. 568 00:26:32,870 --> 00:26:34,555 They still have a destination column. 569 00:26:34,555 --> 00:26:35,930 They still have an origin column. 570 00:26:35,930 --> 00:26:38,250 They still have a duration column, for example. 571 00:26:38,250 --> 00:26:41,210 But we've just now taken the data that used to be in one table 572 00:26:41,210 --> 00:26:46,040 and split up that data into two or more multiple different tables instead-- 573 00:26:46,040 --> 00:26:49,940 one for all the domestic flights, one for all the international flights. 574 00:26:49,940 --> 00:26:52,370 And the advantage there is that we no longer 575 00:26:52,370 --> 00:26:55,760 need to search through the entirety of the data set if we're just looking 576 00:26:55,760 --> 00:26:57,780 for one domestic flight, for example. 577 00:26:57,780 --> 00:27:00,680 If you know the flight you're looking for is a domestic flight, 578 00:27:00,680 --> 00:27:04,820 well, then it can be more efficient to just search the flight's domestic table 579 00:27:04,820 --> 00:27:08,270 and not bother searching through the flight international table. 580 00:27:08,270 --> 00:27:11,300 And so if we're intelligent about how we choose to take a table 581 00:27:11,300 --> 00:27:14,540 and split it up into multiple different tables, the effect of that 582 00:27:14,540 --> 00:27:16,880 is that we can often improve the efficiency 583 00:27:16,880 --> 00:27:19,190 of our searches, the efficiency of our operations, 584 00:27:19,190 --> 00:27:21,830 because we're dealing with multiple smaller tables 585 00:27:21,830 --> 00:27:24,320 where these operations can come faster. 586 00:27:24,320 --> 00:27:27,350 One drawback though is that, as we begin to split data 587 00:27:27,350 --> 00:27:31,250 across multiple different tables, it becomes more expensive if ever we 588 00:27:31,250 --> 00:27:33,980 need to join this data back together and connect 589 00:27:33,980 --> 00:27:36,290 all the domestic and international flights running 590 00:27:36,290 --> 00:27:37,790 separate queries on each. 591 00:27:37,790 --> 00:27:40,010 And so in that case, we'll want to think about trying 592 00:27:40,010 --> 00:27:42,710 to separate our data in such a way that, generally, we're 593 00:27:42,710 --> 00:27:46,750 only going to need to deal with one table or the other at any given time. 594 00:27:46,750 --> 00:27:49,280 And so domestic and international might be a reasonable way 595 00:27:49,280 --> 00:27:52,970 to split up our flights table because maybe, most of the time, our airport 596 00:27:52,970 --> 00:27:54,860 just cares about searching domestic flights 597 00:27:54,860 --> 00:27:56,630 if we know we're looking for one kind of flight, 598 00:27:56,630 --> 00:27:59,030 or just cares about searching for international flights 599 00:27:59,030 --> 00:28:01,405 if there are different people or different computers that 600 00:28:01,405 --> 00:28:05,090 are going to handle each of those different types of systems. 601 00:28:05,090 --> 00:28:08,630 And so partitioning our database can sometimes help with issues of scale 602 00:28:08,630 --> 00:28:11,480 by making it faster to search through large amounts of data 603 00:28:11,480 --> 00:28:14,480 and being able to represent data a little bit more cleanly. 604 00:28:14,480 --> 00:28:17,840 But it still seems to represent a single point of failure-- 605 00:28:17,840 --> 00:28:22,850 that we have multiple servers now that are all connected to the same database. 606 00:28:22,850 --> 00:28:24,890 And there, again, is a single point of failure. 607 00:28:24,890 --> 00:28:27,353 If the database fails for some reason, well now, 608 00:28:27,353 --> 00:28:29,270 suddenly, none of our web application is going 609 00:28:29,270 --> 00:28:31,940 to work because all of those servers are all 610 00:28:31,940 --> 00:28:35,180 connected to that exact same database. 611 00:28:35,180 --> 00:28:36,980 And so it's for that reason that we might-- 612 00:28:36,980 --> 00:28:39,230 just as we tried to add more servers in order 613 00:28:39,230 --> 00:28:42,530 to solve the problem of a single point of failure with our servers, 614 00:28:42,530 --> 00:28:45,410 we might also try database replication. 615 00:28:45,410 --> 00:28:48,860 Rather than just have a single database in our web application, 616 00:28:48,860 --> 00:28:50,870 in order to guard against potential failure, 617 00:28:50,870 --> 00:28:54,410 we might replicate our database-- have multiple different databases 618 00:28:54,410 --> 00:28:59,297 and, therefore, reduce the likelihood that our application entirely fails. 619 00:28:59,297 --> 00:29:01,130 And there are a couple of approaches that we 620 00:29:01,130 --> 00:29:03,020 can use for database replication. 621 00:29:03,020 --> 00:29:06,800 Two of the most common are what are known as single-primary replication 622 00:29:06,800 --> 00:29:09,190 and multi-primary replication. 623 00:29:09,190 --> 00:29:11,760 And in single-primary database replication, 624 00:29:11,760 --> 00:29:14,040 we have multiple different databases. 625 00:29:14,040 --> 00:29:17,930 But one of those databases is considered to be the primary database. 626 00:29:17,930 --> 00:29:20,510 And what we mean by a primary database is a database 627 00:29:20,510 --> 00:29:22,310 to which we can both read data-- 628 00:29:22,310 --> 00:29:24,560 meaning select rows from the table-- 629 00:29:24,560 --> 00:29:27,350 but also write data, meaning insert rows, 630 00:29:27,350 --> 00:29:31,200 or update rows, or delete rows to any of those tables. 631 00:29:31,200 --> 00:29:34,070 So in single-primary replication, we have a single database 632 00:29:34,070 --> 00:29:36,260 where we can both read and write. 633 00:29:36,260 --> 00:29:38,680 And we have some number of other databases-- in this case, 634 00:29:38,680 --> 00:29:40,100 two other databases-- 635 00:29:40,100 --> 00:29:41,900 from which we can only read data. 636 00:29:41,900 --> 00:29:44,220 So we can get data from those databases. 637 00:29:44,220 --> 00:29:48,560 But we can't update, or insert, or delete from those databases. 638 00:29:48,560 --> 00:29:52,490 And now we need some mechanism to make sure that all of these databases 639 00:29:52,490 --> 00:29:53,750 are kept in sync. 640 00:29:53,750 --> 00:29:57,620 And ultimately, what that means is that, any time the database changes, 641 00:29:57,620 --> 00:29:59,660 all of the databases are informed. 642 00:29:59,660 --> 00:30:02,390 Now, the only database that can change is our primary one. 643 00:30:02,390 --> 00:30:04,250 This is the only one that can be written to, 644 00:30:04,250 --> 00:30:06,740 the only one that allows for the data to change. 645 00:30:06,740 --> 00:30:08,180 The others are read only. 646 00:30:08,180 --> 00:30:12,170 So anytime this primary database updates or changes in some way, 647 00:30:12,170 --> 00:30:16,540 it needs to inform the other databases of that update. 648 00:30:16,540 --> 00:30:18,920 And so it informs the other databases of that update. 649 00:30:18,920 --> 00:30:21,230 And now all of the databases are kept in sync 650 00:30:21,230 --> 00:30:23,960 where, if you try and run a query on any of these databases 651 00:30:23,960 --> 00:30:25,910 to select and get some information, you'll 652 00:30:25,910 --> 00:30:30,440 get the same results from all of these various different databases. 653 00:30:30,440 --> 00:30:32,990 Now, the single-primary approach has some drawbacks. 654 00:30:32,990 --> 00:30:36,950 It has the drawback of only one of these databases can be written to. 655 00:30:36,950 --> 00:30:38,750 So if you have a lot of users that are all 656 00:30:38,750 --> 00:30:42,550 trying to write data to the database at the exact same time, 657 00:30:42,550 --> 00:30:44,360 well, there might be some issues here where 658 00:30:44,360 --> 00:30:46,370 this one database is going to be carrying 659 00:30:46,370 --> 00:30:49,100 all of that load for all of the people that might be trying 660 00:30:49,100 --> 00:30:51,860 to update and change that database. 661 00:30:51,860 --> 00:30:54,140 And it also has a slightly smaller version 662 00:30:54,140 --> 00:30:57,140 of the same problem of a single point of failure. 663 00:30:57,140 --> 00:31:00,770 There is no longer a single point of failure for reading from that data. 664 00:31:00,770 --> 00:31:03,750 If you want to read from the data, and one of the databases goes out, 665 00:31:03,750 --> 00:31:07,340 you can read data from any of the other databases, and they'll work just fine. 666 00:31:07,340 --> 00:31:10,670 But it does have the drawback that, if this database fails, 667 00:31:10,670 --> 00:31:13,040 if our primary database fails, well, then 668 00:31:13,040 --> 00:31:14,750 we're no longer able to write data. 669 00:31:14,750 --> 00:31:17,150 If we want to update data inside of our database, 670 00:31:17,150 --> 00:31:19,910 this one database is no longer going to be operational. 671 00:31:19,910 --> 00:31:24,673 And none of the other databases are going to allow us to write new changes. 672 00:31:24,673 --> 00:31:27,840 So there are a couple of approaches we can use to try to solve this problem. 673 00:31:27,840 --> 00:31:31,145 One approach though is, instead of having a single-primary database-- 674 00:31:31,145 --> 00:31:33,950 a single database to which we can read and write-- 675 00:31:33,950 --> 00:31:36,610 to use a multi-primary approach. 676 00:31:36,610 --> 00:31:40,160 And in the multi-primary approach, we have multiple databases, all of which 677 00:31:40,160 --> 00:31:41,810 we can read and write to. 678 00:31:41,810 --> 00:31:44,230 We can select rows from all the databases. 679 00:31:44,230 --> 00:31:48,780 And we can insert an update and delete rows to all of these databases as well. 680 00:31:48,780 --> 00:31:52,050 But now the synchronization process becomes a little bit trickier. 681 00:31:52,050 --> 00:31:54,050 And here, now, is the trade off-- that now we've 682 00:31:54,050 --> 00:31:55,850 replicated the number of reads and writes 683 00:31:55,850 --> 00:31:59,870 we can do by having many databases to which we can read data and write data. 684 00:31:59,870 --> 00:32:02,870 But anytime any of these databases changes, 685 00:32:02,870 --> 00:32:07,695 every database needs to inform all of the other databases of those updates. 686 00:32:07,695 --> 00:32:10,070 And that's, certainly, going to take some amount of time. 687 00:32:10,070 --> 00:32:13,160 It introduces some complexity into our system as well. 688 00:32:13,160 --> 00:32:16,550 And it also introduces the possibility for conflicts. 689 00:32:16,550 --> 00:32:19,550 You might imagine situations where, if two people are editing 690 00:32:19,550 --> 00:32:21,830 similar data at the same time, you might run 691 00:32:21,830 --> 00:32:24,080 into a number of different types of conflicts. 692 00:32:24,080 --> 00:32:27,560 So one type of conflict, for example, would be an update conflict. 693 00:32:27,560 --> 00:32:30,170 If I tried to edit one row in one database, 694 00:32:30,170 --> 00:32:34,040 and someone else tries to edit the same row in another database, when they sync 695 00:32:34,040 --> 00:32:36,230 up with each other via this update process, 696 00:32:36,230 --> 00:32:38,600 our database system needs some way to decide 697 00:32:38,600 --> 00:32:42,200 how it's going to resolve those various different updates. 698 00:32:42,200 --> 00:32:44,880 Another conflict might be a uniqueness conflict. 699 00:32:44,880 --> 00:32:46,907 We've seen, in the case of databases in SQL 700 00:32:46,907 --> 00:32:48,740 that, when we're designing our tables, I can 701 00:32:48,740 --> 00:32:51,980 specify that this particular field should be a unique field-- 702 00:32:51,980 --> 00:32:56,030 common one being the ID field, for example, where every single row is 703 00:32:56,030 --> 00:32:58,100 going to have its own unique ideas. 704 00:32:58,100 --> 00:33:01,670 Well, what happens if two people try to insert data at the same time 705 00:33:01,670 --> 00:33:03,350 into two different databases? 706 00:33:03,350 --> 00:33:07,610 They're each given a unique ID, but it's the same idea on both of the databases, 707 00:33:07,610 --> 00:33:11,240 because neither database knows that the other database has added a new row yet. 708 00:33:11,240 --> 00:33:14,540 So when they sync back up, we might run into a uniqueness conflict 709 00:33:14,540 --> 00:33:18,290 where two different databases have assigned the same exact ID 710 00:33:18,290 --> 00:33:19,730 to multiple different entries. 711 00:33:19,730 --> 00:33:23,117 So we need some way to be able to resolve those conflicts as well. 712 00:33:23,117 --> 00:33:24,950 And there are many other conflicts you might 713 00:33:24,950 --> 00:33:28,340 imagine trying to deal with-- one example being, for instance, delete 714 00:33:28,340 --> 00:33:31,430 conflicts, where one person tries to delete a row 715 00:33:31,430 --> 00:33:33,710 and another person tries to update that row. 716 00:33:33,710 --> 00:33:35,278 Well, which should take precedence? 717 00:33:35,278 --> 00:33:36,320 Should we update the row? 718 00:33:36,320 --> 00:33:37,610 Should we delete the row? 719 00:33:37,610 --> 00:33:41,450 We need some way to be able to make those decisions because there 720 00:33:41,450 --> 00:33:45,150 is some latency between when a change is made to a database 721 00:33:45,150 --> 00:33:48,600 and when that database is able to communicate with another database. 722 00:33:48,600 --> 00:33:51,290 So these issues of scale, these issues of synchronization 723 00:33:51,290 --> 00:33:53,330 are always going to come up as we start to deal 724 00:33:53,330 --> 00:33:56,970 with programs that are interacting with more and more of this kind of data. 725 00:33:56,970 --> 00:33:59,810 And as a result, we need to design more and more sophisticated 726 00:33:59,810 --> 00:34:04,040 systems that are able to deal with those issues of scale. 727 00:34:04,040 --> 00:34:09,139 Now, ultimately, we'd ideally like to reduce the number of different database 728 00:34:09,139 --> 00:34:10,130 servers that we have. 729 00:34:10,130 --> 00:34:12,692 Every additional database server is going to cost time. 730 00:34:12,692 --> 00:34:13,900 It's going to cost resources. 731 00:34:13,900 --> 00:34:17,060 It costs money in terms of keeping all of these servers running. 732 00:34:17,060 --> 00:34:20,960 And so, ideally, we'd like not to have to talk to this database 733 00:34:20,960 --> 00:34:22,590 if we don't need to. 734 00:34:22,590 --> 00:34:26,360 So you might imagine, for example, a news organization's website, something 735 00:34:26,360 --> 00:34:28,275 like the front page of the New York Times. 736 00:34:28,275 --> 00:34:30,650 If you go to the home page of the New York Times website, 737 00:34:30,650 --> 00:34:33,230 it displays all of the day's headlines with images 738 00:34:33,230 --> 00:34:36,860 and with information about what each of the stories are about, for example. 739 00:34:36,860 --> 00:34:39,983 And you might imagine that the way they're doing something like this 740 00:34:39,983 --> 00:34:41,900 is that they have some kind of database that's 741 00:34:41,900 --> 00:34:43,670 storing all of these news articles. 742 00:34:43,670 --> 00:34:46,040 And when you visit the front page of the New York Times, 743 00:34:46,040 --> 00:34:48,290 it's going to do some kind of database query-- 744 00:34:48,290 --> 00:34:51,500 selecting all of the recent top headlines, for example-- 745 00:34:51,500 --> 00:34:56,460 and rendering all of that information in an HTML page that you can see. 746 00:34:56,460 --> 00:34:57,930 And that would certainly work. 747 00:34:57,930 --> 00:35:00,440 But if a lot of people are all requesting the front page 748 00:35:00,440 --> 00:35:04,670 at the same time, well, it probably doesn't make all that much sense 749 00:35:04,670 --> 00:35:08,390 if the web application, every time, is making a database query, getting 750 00:35:08,390 --> 00:35:13,040 the latest articles, and then displaying that information to all of the users 751 00:35:13,040 --> 00:35:16,130 because the articles might not be changing all that frequently. 752 00:35:16,130 --> 00:35:18,440 If one person makes a request one second, 753 00:35:18,440 --> 00:35:21,710 and another person makes the same request half a second later, 754 00:35:21,710 --> 00:35:26,150 it probably is not going to be useful to re-request all of the information 755 00:35:26,150 --> 00:35:29,450 from the database, regenerate that template yet again, because it's 756 00:35:29,450 --> 00:35:33,050 an expensive process of requesting data from the database, of generating 757 00:35:33,050 --> 00:35:33,800 that template. 758 00:35:33,800 --> 00:35:36,710 We'd, ideally, like some way of dealing with that problem. 759 00:35:36,710 --> 00:35:40,040 And the way we can deal with that problem is some form of caching. 760 00:35:40,040 --> 00:35:44,300 And caching refers to a whole bunch of different types of ideas and tools 761 00:35:44,300 --> 00:35:47,660 that we can use at various different places inside of our system. 762 00:35:47,660 --> 00:35:50,390 But in general, when we're talking about caching, 763 00:35:50,390 --> 00:35:54,680 we're talking about storing a saved version of some information in a way 764 00:35:54,680 --> 00:35:58,340 that we can access it more quickly so that we don't need to continue making 765 00:35:58,340 --> 00:36:00,720 requests to a database, for example. 766 00:36:00,720 --> 00:36:02,930 And so there are a number of ways we can do caching. 767 00:36:02,930 --> 00:36:07,010 One way we can do caching is on the client side via client-side caching 768 00:36:07,010 --> 00:36:08,850 where the idea is that your browser-- 769 00:36:08,850 --> 00:36:11,030 whether it's Safari, or Chrome, or something else-- 770 00:36:11,030 --> 00:36:13,700 is able to cache data, store information, 771 00:36:13,700 --> 00:36:17,070 so that the browser doesn't need to re-request the same information 772 00:36:17,070 --> 00:36:19,050 the next time it visits the page. 773 00:36:19,050 --> 00:36:21,680 For example, if you request a page and it loads an image-- 774 00:36:21,680 --> 00:36:23,210 on the page, for example-- 775 00:36:23,210 --> 00:36:25,850 and you reload the page, well, your web browser 776 00:36:25,850 --> 00:36:28,760 might try and make a request again for the exact same image 777 00:36:28,760 --> 00:36:30,020 and then display it to you. 778 00:36:30,020 --> 00:36:33,500 But an alternative might be that your web browser could just 779 00:36:33,500 --> 00:36:35,960 save a copy of the image inside of a cache 780 00:36:35,960 --> 00:36:40,280 to locally store a version of the image so that, the next time 781 00:36:40,280 --> 00:36:42,860 that the user makes a request to the website, the user 782 00:36:42,860 --> 00:36:45,410 doesn't need to reload that entire image. 783 00:36:45,410 --> 00:36:48,650 And that might be true of entire web pages and web resources-- 784 00:36:48,650 --> 00:36:51,770 that if there is some page that doesn't change very often then, 785 00:36:51,770 --> 00:36:55,850 if the web browser just stores a cached, a saved version of that page, 786 00:36:55,850 --> 00:36:58,340 then the next time the user goes to their web browser, 787 00:36:58,340 --> 00:37:03,020 tries to access that page, rather than re-request to the server and make a new 788 00:37:03,020 --> 00:37:06,440 request that the server needs to respond to, if the browser has that page 789 00:37:06,440 --> 00:37:09,530 cached, the browser can just display the cached-- 790 00:37:09,530 --> 00:37:13,830 saved-- version of the page, saving the need to talk to the server at all. 791 00:37:13,830 --> 00:37:16,970 So this can certainly help to reduce the load on any given server. 792 00:37:16,970 --> 00:37:20,360 If users are caching information inside of the web browser, 793 00:37:20,360 --> 00:37:22,480 it makes the experience faster for the user 794 00:37:22,480 --> 00:37:24,980 because they can see the information immediately rather than 795 00:37:24,980 --> 00:37:28,070 need to make a request and wait for a response to come back. 796 00:37:28,070 --> 00:37:30,140 And it's good for the server because the server 797 00:37:30,140 --> 00:37:33,740 doesn't need to be dealing with as many requests if some of those requests 798 00:37:33,740 --> 00:37:35,160 are getting cached. 799 00:37:35,160 --> 00:37:37,400 And so one approach to trying to do this is 800 00:37:37,400 --> 00:37:42,290 by adding this inside of the headers of an HTTP response. 801 00:37:42,290 --> 00:37:44,960 When your web server responds to some requests, 802 00:37:44,960 --> 00:37:48,770 the web server can include a line like this inside of the response-- 803 00:37:48,770 --> 00:37:53,210 something like cache-control max-age-86400-- 804 00:37:53,210 --> 00:37:56,330 in effect, specifying the number of seconds 805 00:37:56,330 --> 00:37:58,850 that you should cache this resource for. 806 00:37:58,850 --> 00:38:02,510 But if I try to access this page 10 seconds later, 807 00:38:02,510 --> 00:38:04,910 well, that's less than 86,400. 808 00:38:04,910 --> 00:38:08,600 So rather than reload and re-request the entire page, 809 00:38:08,600 --> 00:38:11,390 we're just going to use the version of the page that happens 810 00:38:11,390 --> 00:38:13,750 to be cached inside of the web browser. 811 00:38:13,750 --> 00:38:16,250 And so this has several advantages, that we've talked about, 812 00:38:16,250 --> 00:38:19,640 in terms of reducing the amount of time it takes to see the content of the page 813 00:38:19,640 --> 00:38:23,570 because it's already saved and reducing the load on any particular server. 814 00:38:23,570 --> 00:38:25,040 But it also has drawbacks. 815 00:38:25,040 --> 00:38:29,180 If, for example, the resource changes within this amount of time-- 816 00:38:29,180 --> 00:38:32,240 maybe in 60 seconds, the page has changed-- 817 00:38:32,240 --> 00:38:35,120 if I try and load the page again, well, then 818 00:38:35,120 --> 00:38:37,400 if it's loading the cache version of the page, 819 00:38:37,400 --> 00:38:40,400 I might be seeing an outdated version of a web page. 820 00:38:40,400 --> 00:38:42,470 I'm seeing an older version of the web page 821 00:38:42,470 --> 00:38:45,320 because my web browser just so happens to have 822 00:38:45,320 --> 00:38:47,570 that particular resource cached. 823 00:38:47,570 --> 00:38:49,610 And this might be true of a web page. 824 00:38:49,610 --> 00:38:53,630 It's especially true of other static resources, things like CSS files 825 00:38:53,630 --> 00:38:54,760 or JavaScript files. 826 00:38:54,760 --> 00:38:58,860 The CSS of a web page probably doesn't change all that often. 827 00:38:58,860 --> 00:39:02,120 And so, as a result, it's pretty natural that your web browser-- 828 00:39:02,120 --> 00:39:05,870 rather than request the exact same CSS files again, and again, and again-- 829 00:39:05,870 --> 00:39:08,650 might just save a copy of those CSS files, 830 00:39:08,650 --> 00:39:12,380 cache them, such that it's able to just reuse the cached version. 831 00:39:12,380 --> 00:39:14,690 But if the website were to update their CSS, 832 00:39:14,690 --> 00:39:16,355 you might not see the latest changes. 833 00:39:16,355 --> 00:39:18,230 And you might have experienced this yourself. 834 00:39:18,230 --> 00:39:21,410 If you're working on your own web applications, when you change your CSS 835 00:39:21,410 --> 00:39:23,270 and refresh the page, you might not always 836 00:39:23,270 --> 00:39:27,900 see those changes reflected if your web browser is caching those results. 837 00:39:27,900 --> 00:39:30,710 And so, in most web browsers, you can do a hard refresh 838 00:39:30,710 --> 00:39:33,740 to say, ignore whatever is in the cache, and actually go out 839 00:39:33,740 --> 00:39:36,030 and make a new request and get some new data. 840 00:39:36,030 --> 00:39:38,810 But ultimately, if you don't do that, you're 841 00:39:38,810 --> 00:39:42,230 subject to this cache control where the web browser is going to say, 842 00:39:42,230 --> 00:39:44,750 unless this number of seconds has elapsed, 843 00:39:44,750 --> 00:39:48,500 we're going to reuse the existing version of the page. 844 00:39:48,500 --> 00:39:51,590 And so an alternative to this approach-- and this approach certainly works 845 00:39:51,590 --> 00:39:52,670 and is quite popular-- 846 00:39:52,670 --> 00:39:56,950 we can add to this approach by adding what's known as ETag. 847 00:39:56,950 --> 00:40:00,290 An ETag for a resource-- like a CSS file, or an image, 848 00:40:00,290 --> 00:40:01,590 or a JavaScript file-- 849 00:40:01,590 --> 00:40:04,190 is just some unique sequence of characters 850 00:40:04,190 --> 00:40:07,610 that identifies a particular version of a resource, 851 00:40:07,610 --> 00:40:11,300 that identifies a particular version of a CSS file or a JavaScript file, 852 00:40:11,300 --> 00:40:12,930 for example. 853 00:40:12,930 --> 00:40:14,840 And what this allows a program to do-- 854 00:40:14,840 --> 00:40:16,010 like a web browser-- 855 00:40:16,010 --> 00:40:18,230 is that, when a web browser requests a resource-- 856 00:40:18,230 --> 00:40:21,410 makes a request for a CSS file or a JavaScript file-- 857 00:40:21,410 --> 00:40:22,370 they get it back. 858 00:40:22,370 --> 00:40:25,760 And they get its associated ETag value, so I 859 00:40:25,760 --> 00:40:28,310 know that this is the value that is associated 860 00:40:28,310 --> 00:40:31,040 with this version of the CSS file. 861 00:40:31,040 --> 00:40:35,720 And if the web server were ever to change that CSS file, replace it 862 00:40:35,720 --> 00:40:41,820 with a new updated CSS file, the corresponding ETag will also change. 863 00:40:41,820 --> 00:40:43,650 So why is this helpful? 864 00:40:43,650 --> 00:40:46,730 Well, it means that if I am trying to decide, should I 865 00:40:46,730 --> 00:40:50,070 load a new version of the resource or not, 866 00:40:50,070 --> 00:40:53,510 should I try and make another request to get the latest version of the CSS, 867 00:40:53,510 --> 00:40:55,970 what I can do first is just ask for, what 868 00:40:55,970 --> 00:40:59,660 is the ETag value, the short sequence that can be answered very quickly? 869 00:40:59,660 --> 00:41:02,090 Very quickly, we can just respond and say, 870 00:41:02,090 --> 00:41:05,360 you know what, if the ETag value is the same as what I remembered 871 00:41:05,360 --> 00:41:07,850 from last time, well, then I don't need to get 872 00:41:07,850 --> 00:41:10,340 a whole new version of that resource. 873 00:41:10,340 --> 00:41:13,070 And so this is quite common, too, that a web browser will say, 874 00:41:13,070 --> 00:41:15,110 hey, let me request this resource. 875 00:41:15,110 --> 00:41:19,200 But I already have a version of the resource with this particular ETag. 876 00:41:19,200 --> 00:41:24,110 So if that ETag is still the ETag for the most recent version of a particular 877 00:41:24,110 --> 00:41:26,450 resource-- like a CSS or JavaScript file-- 878 00:41:26,450 --> 00:41:30,650 then no need for the web server to send a new version of that file. 879 00:41:30,650 --> 00:41:33,650 Just go ahead and respond and say, the version you have-- that one 880 00:41:33,650 --> 00:41:34,920 works-- totally fine. 881 00:41:34,920 --> 00:41:38,280 But if there is a new version, well, then the web server can respond with 882 00:41:38,280 --> 00:41:41,130 the new asset-- the new CSS file, for example-- 883 00:41:41,130 --> 00:41:43,430 but also the new ETag value. 884 00:41:43,430 --> 00:41:46,160 So these two approaches can work in concert with each other. 885 00:41:46,160 --> 00:41:49,220 You can say, go ahead and cache this for some number of seconds 886 00:41:49,220 --> 00:41:51,020 so that, for some number of seconds, you're 887 00:41:51,020 --> 00:41:54,680 not going to ever request a new version of that resource. 888 00:41:54,680 --> 00:41:57,710 But even if you do ask for a new version of the resource 889 00:41:57,710 --> 00:41:59,900 after this number of seconds has elapsed, 890 00:41:59,900 --> 00:42:02,390 if the ETag value hasn't updated, then no 891 00:42:02,390 --> 00:42:06,090 need to redownload a whole new version of a particular file. 892 00:42:06,090 --> 00:42:08,750 You can just reuse the version that happens 893 00:42:08,750 --> 00:42:10,890 to be cached already in the browser. 894 00:42:10,890 --> 00:42:14,270 So caching in the browser can be an incredibly powerful tool 895 00:42:14,270 --> 00:42:17,000 for trying to speed up these requests, for trying to reduce 896 00:42:17,000 --> 00:42:19,070 the load on any particular server. 897 00:42:19,070 --> 00:42:21,290 But the client side is not the only place 898 00:42:21,290 --> 00:42:23,510 where we can begin to do this kind of caching. 899 00:42:23,510 --> 00:42:26,330 We also have the ability to do server-side caching. 900 00:42:26,330 --> 00:42:30,560 And in server-side caching, we're going to introduce to our picture the notion 901 00:42:30,560 --> 00:42:31,940 of a cache-- 902 00:42:31,940 --> 00:42:34,160 that we have these multiple servers that are all 903 00:42:34,160 --> 00:42:35,720 communicating with the database. 904 00:42:35,720 --> 00:42:38,300 But these servers can also communicate with a cache-- 905 00:42:38,300 --> 00:42:41,360 someplace where we've stored information that we 906 00:42:41,360 --> 00:42:46,340 might want to reuse later rather than have to do all of that recalculation. 907 00:42:46,340 --> 00:42:49,280 And Django, in turns out, has an entire cache framework, 908 00:42:49,280 --> 00:42:51,530 a whole host of features that Django offers 909 00:42:51,530 --> 00:42:54,860 that allow us to leverage this ability to use the cache 910 00:42:54,860 --> 00:42:56,470 to be able to speed up requests. 911 00:42:56,470 --> 00:42:59,150 So there are per-view caches where you can 912 00:42:59,150 --> 00:43:02,720 specify a cache on a particular view to say that, rather than run 913 00:43:02,720 --> 00:43:05,540 through all this Python code every time someone makes 914 00:43:05,540 --> 00:43:09,410 a request to this particular view, instead, 915 00:43:09,410 --> 00:43:14,150 just cache the view so that, for the next 30 seconds or 30 minutes, 916 00:43:14,150 --> 00:43:16,940 the next time someone tries to visit the same view, 917 00:43:16,940 --> 00:43:19,910 go ahead and just reuse the results of the last time 918 00:43:19,910 --> 00:43:21,665 that that view was loaded. 919 00:43:21,665 --> 00:43:23,540 And this can work not just for a single view. 920 00:43:23,540 --> 00:43:25,657 It can work for fragments inside of a template. 921 00:43:25,657 --> 00:43:27,740 Your template might have multiple different parts. 922 00:43:27,740 --> 00:43:31,190 On your web page, you might render the navigation bar, and the sidebar, 923 00:43:31,190 --> 00:43:33,800 and the footer, maybe based on information about today 924 00:43:33,800 --> 00:43:36,050 that might change the next day. 925 00:43:36,050 --> 00:43:38,510 But if you expect that the side bar of your page 926 00:43:38,510 --> 00:43:41,570 is not going to change very often within the same minute 927 00:43:41,570 --> 00:43:43,820 or within the same hour, well, then you might imagine 928 00:43:43,820 --> 00:43:46,910 caching that part of the template so that, the next time 929 00:43:46,910 --> 00:43:49,160 that Django tries to load that entire template, 930 00:43:49,160 --> 00:43:52,550 it doesn't need to recalculate how to generate the sidebar for your website. 931 00:43:52,550 --> 00:43:56,330 It just knows that we can use the same version of the sidebar 932 00:43:56,330 --> 00:43:59,786 from the last time that we loaded this website instead. 933 00:43:59,786 --> 00:44:03,600 And Django also gives you access to a lower level cache API 934 00:44:03,600 --> 00:44:07,080 where, for any information that you might want to cache and store for use 935 00:44:07,080 --> 00:44:10,140 later, you can save that information inside of the API. 936 00:44:10,140 --> 00:44:12,180 You make an expensive database query that 937 00:44:12,180 --> 00:44:15,360 takes a couple of milliseconds or a couple of seconds to process. 938 00:44:15,360 --> 00:44:17,760 You can save those results inside of a cache 939 00:44:17,760 --> 00:44:20,550 to make it easier to access that same data if ever you 940 00:44:20,550 --> 00:44:22,930 try to get access to that again. 941 00:44:22,930 --> 00:44:26,430 So caching allows us to be able to deal with these issues of scale 942 00:44:26,430 --> 00:44:29,910 by reducing load on our servers, but also on our databases. 943 00:44:29,910 --> 00:44:33,330 Rather than need to talk to the database every single time we 944 00:44:33,330 --> 00:44:36,750 make a new request for a particular web application, 945 00:44:36,750 --> 00:44:39,060 we can just reuse information that happens 946 00:44:39,060 --> 00:44:42,930 to be in the cache to allow our web applications to become even more 947 00:44:42,930 --> 00:44:44,350 scalable. 948 00:44:44,350 --> 00:44:48,000 So that then was a look at some issues concerning scalability. 949 00:44:48,000 --> 00:44:50,580 And we'll next turn our attention to security-- 950 00:44:50,580 --> 00:44:53,610 trying to make sure that, as we build our web applications, as we deploy 951 00:44:53,610 --> 00:44:56,370 our web applications and more users start to use them, 952 00:44:56,370 --> 00:44:58,290 we want to make sure that they're secure. 953 00:44:58,290 --> 00:45:00,570 And there are a whole bunch of security considerations 954 00:45:00,570 --> 00:45:03,170 to take into account across all of the topics 955 00:45:03,170 --> 00:45:04,650 that we've looked at in the course. 956 00:45:04,650 --> 00:45:06,525 We've looked at a number of different topics. 957 00:45:06,525 --> 00:45:09,400 And with each of them, there are security vulnerabilities. 958 00:45:09,400 --> 00:45:12,720 There are ideas to be mindful of when it comes towards making sure 959 00:45:12,720 --> 00:45:14,580 that our applications are secure. 960 00:45:14,580 --> 00:45:18,420 And we can begin our story, in fact, by talking about Git and version control. 961 00:45:18,420 --> 00:45:20,370 Git is all about trying to make sure we're 962 00:45:20,370 --> 00:45:22,860 able to keep track of different versions of our code. 963 00:45:22,860 --> 00:45:24,780 And one thing that goes hand-in-hand with Git 964 00:45:24,780 --> 00:45:27,480 is this idea of open-source software. 965 00:45:27,480 --> 00:45:30,930 On websites like GitHub and other services that host Git repositories, 966 00:45:30,930 --> 00:45:33,930 increasingly, a lot of software is becoming open source 967 00:45:33,930 --> 00:45:38,190 where anyone can see and contribute to the source code of an application. 968 00:45:38,190 --> 00:45:40,868 And this is great in the sense that it allows for many people 969 00:45:40,868 --> 00:45:42,660 to be able to collaborate and work together 970 00:45:42,660 --> 00:45:46,590 in order to try to find bugs that might exist inside of a web application. 971 00:45:46,590 --> 00:45:48,810 But it also comes with drawbacks-- drawbacks 972 00:45:48,810 --> 00:45:51,333 where, if there is a bug in the application, 973 00:45:51,333 --> 00:45:54,000 now someone who's looking through the source code of our program 974 00:45:54,000 --> 00:45:56,250 might be able to spot that bug. 975 00:45:56,250 --> 00:45:58,920 Or you might imagine that, because Git keeps 976 00:45:58,920 --> 00:46:01,830 track of different versions of our code every time 977 00:46:01,830 --> 00:46:04,050 we make a commit to our repository, you have 978 00:46:04,050 --> 00:46:07,110 to be very careful when it comes towards credentials or things that 979 00:46:07,110 --> 00:46:08,910 might leak inside of the source code. 980 00:46:08,910 --> 00:46:12,600 You generally never want to put passwords or any secure information 981 00:46:12,600 --> 00:46:15,990 inside of the Git repository because the Git repository could 982 00:46:15,990 --> 00:46:19,000 be shared with other people and might be open to anyone to look at. 983 00:46:19,000 --> 00:46:22,200 And so those are security considerations to be mindful there as 984 00:46:22,200 --> 00:46:25,920 well-- that if you make a commit, and accidentally make a commit to your code 985 00:46:25,920 --> 00:46:29,610 where you expose those credentials, you might remove those credentials 986 00:46:29,610 --> 00:46:32,160 and commit again so the latest version of your program 987 00:46:32,160 --> 00:46:34,140 doesn't have those credentials in it. 988 00:46:34,140 --> 00:46:36,540 But someone who has access to the Git repository 989 00:46:36,540 --> 00:46:39,150 has access not just to the latest version of your code, 990 00:46:39,150 --> 00:46:41,110 but to every version of your code. 991 00:46:41,110 --> 00:46:43,650 And that person could, theoretically, go back 992 00:46:43,650 --> 00:46:46,770 through the history of the repository and find the commit 993 00:46:46,770 --> 00:46:51,040 where the credentials were exposed and see those credentials as well. 994 00:46:51,040 --> 00:46:54,270 So while Git is a very powerful tool, it's also one to be mindful of. 995 00:46:54,270 --> 00:46:57,840 Any change you make could potentially get saved inside of a commit-- 996 00:46:57,840 --> 00:47:00,690 could potentially, therefore, be accessed later on. 997 00:47:00,690 --> 00:47:04,380 And so if ever credentials are exposed inside of the repository, 998 00:47:04,380 --> 00:47:07,260 you want to make sure to wipe out all of those previous commits 999 00:47:07,260 --> 00:47:09,690 and not just make some new commit in order 1000 00:47:09,690 --> 00:47:13,740 to try and hide the previous credentials that can be exposed because they can 1001 00:47:13,740 --> 00:47:17,010 still be retrieved if someone goes back through the history 1002 00:47:17,010 --> 00:47:19,300 of any particular repository. 1003 00:47:19,300 --> 00:47:23,025 And so that, then, was a look at some issues that might surround Git. 1004 00:47:23,025 --> 00:47:24,900 We also talked at the beginning of the course 1005 00:47:24,900 --> 00:47:28,110 about HTML, and about what it is that we can use with HTML, 1006 00:47:28,110 --> 00:47:32,040 and how we can use this language in order to design the structure of a web 1007 00:47:32,040 --> 00:47:36,150 page, in order to decide where all of the paragraphs are going to be, 1008 00:47:36,150 --> 00:47:38,070 what tables are going to be on the page. 1009 00:47:38,070 --> 00:47:40,710 We talked about links and how we can use anchor tags 1010 00:47:40,710 --> 00:47:42,960 to link one page to another page. 1011 00:47:42,960 --> 00:47:47,640 Now, one concern is this type of attack known as a phishing attack with HTML. 1012 00:47:47,640 --> 00:47:49,830 And a phishing attack really just comes down 1013 00:47:49,830 --> 00:47:53,100 to a little bit of HTML that looks like this-- very easy to write, 1014 00:47:53,100 --> 00:47:57,690 where I have an anchor tag that is going to direct the user to URL one. 1015 00:47:57,690 --> 00:48:01,860 But it looks like it directs the user to URL 2. 1016 00:48:01,860 --> 00:48:03,930 So what might an example of this be? 1017 00:48:03,930 --> 00:48:05,380 All right, so we'll take a look. 1018 00:48:05,380 --> 00:48:09,280 I'll go ahead and open up link.html. 1019 00:48:09,280 --> 00:48:11,770 And in link.html, I have a website that I've written 1020 00:48:11,770 --> 00:48:13,950 that appears to have a link to Google. 1021 00:48:13,950 --> 00:48:16,030 But if I click on that link, I'm suddenly 1022 00:48:16,030 --> 00:48:19,162 directed to this course's website, for example. 1023 00:48:19,162 --> 00:48:20,120 So how did that happen? 1024 00:48:20,120 --> 00:48:20,953 Why did that happen? 1025 00:48:20,953 --> 00:48:22,670 It seems like it's linking to Google. 1026 00:48:22,670 --> 00:48:26,290 Well, if you look at the code, if I go ahead and open up link.html, 1027 00:48:26,290 --> 00:48:31,360 we'll see that here I have an anchor tag that actually links to the course 1028 00:48:31,360 --> 00:48:34,150 website but appears to be linking-- the text 1029 00:48:34,150 --> 00:48:37,900 that the user sees appears that it is linking instead to Google. 1030 00:48:37,900 --> 00:48:41,360 And so this is a very common attack vector, especially in emails, 1031 00:48:41,360 --> 00:48:41,980 for example. 1032 00:48:41,980 --> 00:48:45,040 You might see an email that tells you to click on a particular link. 1033 00:48:45,040 --> 00:48:48,070 But that link takes you to somewhere else entirely instead. 1034 00:48:48,070 --> 00:48:50,380 And as a result, someone might inadvertently 1035 00:48:50,380 --> 00:48:54,010 share their bank account credentials or other sensitive information. 1036 00:48:54,010 --> 00:48:57,220 And so here, too, something be mindful of as you interact with the web, 1037 00:48:57,220 --> 00:49:00,490 maybe not necessarily on your own website, but in other websites 1038 00:49:00,490 --> 00:49:03,940 that you might interact with, just to be mindful about where links are actually 1039 00:49:03,940 --> 00:49:04,580 taking you. 1040 00:49:04,580 --> 00:49:07,300 And most web browsers, if you hover over a link, 1041 00:49:07,300 --> 00:49:09,400 will show you where that link might actually 1042 00:49:09,400 --> 00:49:12,010 be directing you to because it might be different than what 1043 00:49:12,010 --> 00:49:17,930 the text of that particular anchor tag might appear to link you to instead. 1044 00:49:17,930 --> 00:49:21,017 So HTML has all these various different vulnerabilities 1045 00:49:21,017 --> 00:49:24,100 where, because you can just decide what you want the structure of the page 1046 00:49:24,100 --> 00:49:26,710 to be, it leaves open the possibility that someone 1047 00:49:26,710 --> 00:49:29,770 might try to trick you into thinking that you were going to a page 1048 00:49:29,770 --> 00:49:31,420 that you're not actually on. 1049 00:49:31,420 --> 00:49:34,150 And this problem is more widespread because anyone 1050 00:49:34,150 --> 00:49:36,580 can look at the HTML for any page. 1051 00:49:36,580 --> 00:49:38,950 HTML comes back from the server. 1052 00:49:38,950 --> 00:49:42,310 And therefore, the web browser has access to all of that HTML 1053 00:49:42,310 --> 00:49:46,270 and can use that HTML in order to render a page, for example. 1054 00:49:46,270 --> 00:49:49,150 And this leaves open other vulnerabilities, too. 1055 00:49:49,150 --> 00:49:54,760 For example, let me go ahead and go to bankofamerica.com, just 1056 00:49:54,760 --> 00:49:55,900 Bank of America's website. 1057 00:49:55,900 --> 00:49:57,850 You can go to any other website instead. 1058 00:49:57,850 --> 00:50:01,600 If I wanted to create a fake version of Bank of America's website, 1059 00:50:01,600 --> 00:50:03,820 for example, to trick people into thinking 1060 00:50:03,820 --> 00:50:05,740 they're going to Bank of America's website 1061 00:50:05,740 --> 00:50:08,950 when really they're going to my website, well, then what I can do 1062 00:50:08,950 --> 00:50:11,420 is just go ahead and view the source of this page. 1063 00:50:11,420 --> 00:50:13,940 I go ahead and view page source. 1064 00:50:13,940 --> 00:50:17,990 And here is all of the HTML for Bank of America's website. 1065 00:50:17,990 --> 00:50:21,410 And nothing then stops me from copying all this content, 1066 00:50:21,410 --> 00:50:27,440 going into an HTML file, and creating a new file that I'll just call bank.html. 1067 00:50:27,440 --> 00:50:31,350 And I'll go ahead and paste in the contents of that HTML file, 1068 00:50:31,350 --> 00:50:34,700 secure then all of Bank of America's HTML. 1069 00:50:34,700 --> 00:50:37,190 And now, if I open up bank.html-- 1070 00:50:37,190 --> 00:50:39,920 that HTML file that I have now written, but really 1071 00:50:39,920 --> 00:50:42,320 just copied from Bank of America-- 1072 00:50:42,320 --> 00:50:43,730 I open it up. 1073 00:50:43,730 --> 00:50:47,000 And now here, on my page, is a web page that 1074 00:50:47,000 --> 00:50:48,680 appears to look like Bank of America. 1075 00:50:48,680 --> 00:50:51,170 It's using all of Bank of America's HTML. 1076 00:50:51,170 --> 00:50:56,130 But instead, it is my HTML page and not, actually, Bank of America. 1077 00:50:56,130 --> 00:51:00,350 And so you might imagine combining these to create an even more concerning 1078 00:51:00,350 --> 00:51:03,050 attack vector where, instead of linking to google.com, 1079 00:51:03,050 --> 00:51:06,461 let me try and link to bankofamerica.com. 1080 00:51:06,461 --> 00:51:12,170 But where I'm actually going to link to is bank.html, my version 1081 00:51:12,170 --> 00:51:14,180 of Bank of America's website. 1082 00:51:14,180 --> 00:51:18,170 Now, if I open up link.html, here appears 1083 00:51:18,170 --> 00:51:20,900 to be a link that links me to Bank of America. 1084 00:51:20,900 --> 00:51:23,180 If I click on that link, I get to a page that 1085 00:51:23,180 --> 00:51:25,250 looks like Bank of America's website. 1086 00:51:25,250 --> 00:51:27,260 But it's not Bank of America's website. 1087 00:51:27,260 --> 00:51:30,490 It's my bank.html file that I have written. 1088 00:51:30,490 --> 00:51:33,140 It just so happens to look like Bank of America's website 1089 00:51:33,140 --> 00:51:36,620 because I copied all of that underlying HTML. 1090 00:51:36,620 --> 00:51:39,860 So HTML has the ability to describe the structure of our web page. 1091 00:51:39,860 --> 00:51:43,790 But anytime you're writing this HTML, it's good to be mindful of the fact 1092 00:51:43,790 --> 00:51:48,110 that anyone can copy your HTML, could theoretically pretend to be you. 1093 00:51:48,110 --> 00:51:50,090 These are security vulnerabilities that are 1094 00:51:50,090 --> 00:51:53,240 worth bearing in mind as we start to develop web applications 1095 00:51:53,240 --> 00:51:56,910 and interacting with web applications as well. 1096 00:51:56,910 --> 00:52:01,070 So ultimately, we used HTML in the context of designing web applications 1097 00:52:01,070 --> 00:52:02,960 using Django, a framework. 1098 00:52:02,960 --> 00:52:05,690 And how exactly, then, did these web frameworks 1099 00:52:05,690 --> 00:52:10,250 work in terms of creating these web servers that are listening for requests 1100 00:52:10,250 --> 00:52:12,650 and that are responding to those requests? 1101 00:52:12,650 --> 00:52:14,390 Well, ultimately, much of the internet is 1102 00:52:14,390 --> 00:52:17,930 based around this idea of a client communicating with a server or, more 1103 00:52:17,930 --> 00:52:20,420 generally, any one computer communicating 1104 00:52:20,420 --> 00:52:23,810 with another computer using HTTP and, in particular, 1105 00:52:23,810 --> 00:52:28,618 HTTPS, a more secure version of the HTTP protocol. 1106 00:52:28,618 --> 00:52:31,160 And so you imagine that what these protocols are really about 1107 00:52:31,160 --> 00:52:34,200 is how information gets from one person to another 1108 00:52:34,200 --> 00:52:36,110 and what we're storing with that information. 1109 00:52:36,110 --> 00:52:39,680 We have one computer trying to communicate with some other computer. 1110 00:52:39,680 --> 00:52:42,440 And in order to do so, information is generally 1111 00:52:42,440 --> 00:52:45,020 going to flow through these routers. 1112 00:52:45,020 --> 00:52:47,270 You might imagine information going back and forth 1113 00:52:47,270 --> 00:52:49,610 between one computer and another computer, 1114 00:52:49,610 --> 00:52:53,540 going through these intermediate routers along the way. 1115 00:52:53,540 --> 00:52:56,390 And as a result, one thing to be cautious about 1116 00:52:56,390 --> 00:52:58,400 is, how do you know that this information that's 1117 00:52:58,400 --> 00:53:02,390 getting passed back and forth is getting passed back and forth securely? 1118 00:53:02,390 --> 00:53:05,150 Ideally, when I send a message to another computer-- 1119 00:53:05,150 --> 00:53:07,190 I'm sending an email to someone else, I'm 1120 00:53:07,190 --> 00:53:09,800 sending a message, I'm making a request to a website that 1121 00:53:09,800 --> 00:53:13,130 might contain sensitive information, like my bank account, for example-- 1122 00:53:13,130 --> 00:53:17,030 I don't want it so that any intercepting router that is taking my request 1123 00:53:17,030 --> 00:53:18,260 and passing it along-- 1124 00:53:18,260 --> 00:53:21,170 I don't want those routers to be able to look at that request 1125 00:53:21,170 --> 00:53:24,950 and see the contents of my email or the contents of what password 1126 00:53:24,950 --> 00:53:27,620 I happen to be sending across the web or not. 1127 00:53:27,620 --> 00:53:31,005 Ideally, I'd like for this information to be encrypted. 1128 00:53:31,005 --> 00:53:33,380 And so here, we'll talk a little bit about cryptography-- 1129 00:53:33,380 --> 00:53:35,450 this process of trying to make sure that I 1130 00:53:35,450 --> 00:53:37,850 am able to communicate with some other person 1131 00:53:37,850 --> 00:53:42,860 without some eavesdropper in the middle being able to intercept that message. 1132 00:53:42,860 --> 00:53:45,555 Obviously, if I just take a plain text version 1133 00:53:45,555 --> 00:53:47,930 of the message I'm trying to send and just literally take 1134 00:53:47,930 --> 00:53:51,560 the text of the message I'm trying to send and effectively pass it along 1135 00:53:51,560 --> 00:53:53,660 across the internet, well, then anyone who 1136 00:53:53,660 --> 00:53:57,430 is able to see that message is going to know what the text of that message is. 1137 00:53:57,430 --> 00:53:59,420 And so I want to do some kind of encryption, 1138 00:53:59,420 --> 00:54:02,900 some way of encrypting that message so that someone along the way 1139 00:54:02,900 --> 00:54:06,230 won't be able to do that decryption if a router in the middle 1140 00:54:06,230 --> 00:54:09,408 or someone in the middle is able to intercept that message. 1141 00:54:09,408 --> 00:54:11,450 And so the first approach we'll look at is what's 1142 00:54:11,450 --> 00:54:14,030 known as secret-key cryptography. 1143 00:54:14,030 --> 00:54:19,160 In secret-key cryptography, I have not just the plaintext, but some key, 1144 00:54:19,160 --> 00:54:23,600 some secret piece of information that can be used in order to encrypt 1145 00:54:23,600 --> 00:54:25,550 or decrypt information. 1146 00:54:25,550 --> 00:54:29,600 And so I'll use both the key and the plaintext 1147 00:54:29,600 --> 00:54:33,710 to generate what's known as the ciphertext, the encrypted version 1148 00:54:33,710 --> 00:54:35,690 of the message I'm trying to send. 1149 00:54:35,690 --> 00:54:39,080 And then, instead of sending the plaintext 1150 00:54:39,080 --> 00:54:41,540 across the internet to the other person, I 1151 00:54:41,540 --> 00:54:44,870 might instead want to just send the ciphertext across the internet 1152 00:54:44,870 --> 00:54:48,050 to the other person so that I'm not sending the plain version 1153 00:54:48,050 --> 00:54:49,700 of the message across the internet. 1154 00:54:49,700 --> 00:54:51,560 So the ciphertext goes across. 1155 00:54:51,560 --> 00:54:54,270 And the other person will also need the key. 1156 00:54:54,270 --> 00:54:57,835 Now, if the other person has both the ciphertext and the key, 1157 00:54:57,835 --> 00:54:59,960 well, then using that information, the other person 1158 00:54:59,960 --> 00:55:02,960 can use the key to decrypt the ciphertext 1159 00:55:02,960 --> 00:55:05,800 and obtain the original plaintext. 1160 00:55:05,800 --> 00:55:10,340 And this key is what we might call a symmetric key encryption and decryption 1161 00:55:10,340 --> 00:55:10,840 key. 1162 00:55:10,840 --> 00:55:13,820 You use the key in order to encrypt messages. 1163 00:55:13,820 --> 00:55:17,600 And you use the same key in order to do the decryption process. 1164 00:55:17,600 --> 00:55:21,050 And as long as both I and the person I'm communicating with both have access 1165 00:55:21,050 --> 00:55:25,760 to that key, well, then we'll be able to encrypt messages and decrypt messages. 1166 00:55:25,760 --> 00:55:28,610 And someone who just has the ciphertext but not the key 1167 00:55:28,610 --> 00:55:33,160 likely won't be able to figure out what that original message was. 1168 00:55:33,160 --> 00:55:36,370 But there's a problem here, especially in the context of the internet. 1169 00:55:36,370 --> 00:55:41,500 And that is that both I and the other person need to have access to this key. 1170 00:55:41,500 --> 00:55:45,320 The key is what I use to do the encryption and the decryption. 1171 00:55:45,320 --> 00:55:48,978 And I can't just send the key across the internet to the other person 1172 00:55:48,978 --> 00:55:51,520 because, if I do that, well, then someone in the middle who's 1173 00:55:51,520 --> 00:55:54,130 intercepting all of my requests could intercept 1174 00:55:54,130 --> 00:55:56,740 both the ciphertext and the key. 1175 00:55:56,740 --> 00:56:00,670 And therefore, they would be able to decrypt the message because they 1176 00:56:00,670 --> 00:56:03,260 have both the ciphertext and the key. 1177 00:56:03,260 --> 00:56:07,090 Now, if I were able to go to another person in person and exchange 1178 00:56:07,090 --> 00:56:10,390 this secret key in secret, well, then this scheme 1179 00:56:10,390 --> 00:56:12,490 might work, because we both have the key. 1180 00:56:12,490 --> 00:56:16,360 And I didn't share the key publicly with anyone who might intercept the message. 1181 00:56:16,360 --> 00:56:18,970 Only I and the other person had the key. 1182 00:56:18,970 --> 00:56:21,157 But in general, when communicating on the internet, 1183 00:56:21,157 --> 00:56:22,990 you're not communicating with servers you've 1184 00:56:22,990 --> 00:56:25,210 necessarily communicated with before. 1185 00:56:25,210 --> 00:56:27,880 I might be trying to make a request to a new website. 1186 00:56:27,880 --> 00:56:32,770 And we somehow still need to agree on a system where I can encrypt messages 1187 00:56:32,770 --> 00:56:35,110 but only the other person on the other side 1188 00:56:35,110 --> 00:56:38,990 is able to decrypt those messages instead. 1189 00:56:38,990 --> 00:56:42,460 So this kind of cryptography-- probably not great 1190 00:56:42,460 --> 00:56:47,300 for trying to initially try and create a secure connection on the internet. 1191 00:56:47,300 --> 00:56:49,810 And for that reason, a major advancement in cryptography 1192 00:56:49,810 --> 00:56:54,970 that allows for the internet to work is this notion of public-key cryptography. 1193 00:56:54,970 --> 00:56:56,890 In secret-key cryptography, it's important 1194 00:56:56,890 --> 00:57:00,280 that the key is secret because, if the key were known by everyone, well, 1195 00:57:00,280 --> 00:57:03,040 then anyone would be able to decrypt messages. 1196 00:57:03,040 --> 00:57:06,730 In public-key cryptography, we're able to create a secure encryption 1197 00:57:06,730 --> 00:57:09,790 system where the key is allowed to be public, 1198 00:57:09,790 --> 00:57:11,980 or one of the keys, as we'll soon see. 1199 00:57:11,980 --> 00:57:16,030 And the idea here is that we're using two keys instead of just one-- 1200 00:57:16,030 --> 00:57:20,072 that we have both a public key and what's known as a private key. 1201 00:57:20,072 --> 00:57:22,030 The private key-- your private key is something 1202 00:57:22,030 --> 00:57:25,840 you should not share with other people to keep the encryption scheme secure. 1203 00:57:25,840 --> 00:57:30,340 But the public key is one that is OK to share with other people. 1204 00:57:30,340 --> 00:57:34,150 And the distinction between the two is that the public key will be 1205 00:57:34,150 --> 00:57:36,640 used in order to encrypt information. 1206 00:57:36,640 --> 00:57:40,090 And the private key will be used to decrypt information 1207 00:57:40,090 --> 00:57:41,870 that was encrypted by the public. 1208 00:57:41,870 --> 00:57:44,620 And the public key and the private key are mathematically related. 1209 00:57:44,620 --> 00:57:47,287 And there are a couple of ways that we might imagine doing that. 1210 00:57:47,287 --> 00:57:51,160 But the idea now is that, if I want to communicate with another person, 1211 00:57:51,160 --> 00:57:54,100 that person sends me their public key. 1212 00:57:54,100 --> 00:57:56,890 And it's OK for the public key to travel across the internet. 1213 00:57:56,890 --> 00:58:01,000 Anyone is allowed to see the public key because the public key is only 1214 00:58:01,000 --> 00:58:03,610 used for encrypting that data. 1215 00:58:03,610 --> 00:58:06,610 So I can then take the plaintext and the public key 1216 00:58:06,610 --> 00:58:11,350 and use that to generate the ciphertext, the encrypted version of the message 1217 00:58:11,350 --> 00:58:13,930 that I am trying to send across the internet. 1218 00:58:13,930 --> 00:58:16,960 And then I send the ciphertext to the other person 1219 00:58:16,960 --> 00:58:18,640 with whom I'm trying to communicate. 1220 00:58:18,640 --> 00:58:24,080 And the other person now, using the ciphertext, then uses the private key-- 1221 00:58:24,080 --> 00:58:26,800 the private key that they did not share, and the private key 1222 00:58:26,800 --> 00:58:29,710 that has the ability to decrypt information that 1223 00:58:29,710 --> 00:58:32,600 was encrypted using the public key. 1224 00:58:32,600 --> 00:58:35,800 So using a combination of the ciphertext and the private key, 1225 00:58:35,800 --> 00:58:38,830 the person I'm communicating with can decrypt that information 1226 00:58:38,830 --> 00:58:43,070 and get back whatever the original plaintext of that information 1227 00:58:43,070 --> 00:58:44,360 happened to be. 1228 00:58:44,360 --> 00:58:46,630 And so this, then, is how we can do a lot 1229 00:58:46,630 --> 00:58:48,430 of this communication on the internet. 1230 00:58:48,430 --> 00:58:50,830 By using this public-private key pair, we 1231 00:58:50,830 --> 00:58:53,560 can say, use the public key to do the encrypting, 1232 00:58:53,560 --> 00:58:55,690 use the private key to do the decrypting. 1233 00:58:55,690 --> 00:58:58,690 And now two computers that have never interacted with each other 1234 00:58:58,690 --> 00:59:00,970 before, without having the opportunity to meet, 1235 00:59:00,970 --> 00:59:04,630 to exchange some secret information, can use a technique like this 1236 00:59:04,630 --> 00:59:07,060 in order to securely communicate with each other-- 1237 00:59:07,060 --> 00:59:10,300 to send a message back and forth without anyone in the middle 1238 00:59:10,300 --> 00:59:15,140 being able to intercept the message and identify what the message is about. 1239 00:59:15,140 --> 00:59:18,310 And once you have this ability, the ability to communicate with another 1240 00:59:18,310 --> 00:59:21,730 secretly, well, then you can imagine agreeing on some secret key 1241 00:59:21,730 --> 00:59:25,780 and then using secret-key encryption to be able to encrypt and decrypt messages 1242 00:59:25,780 --> 00:59:26,470 as well. 1243 00:59:26,470 --> 00:59:28,262 And so that's an approach that you can also 1244 00:59:28,262 --> 00:59:31,460 take when trying to communicate with other people across the internet. 1245 00:59:31,460 --> 00:59:34,950 But this idea of encryption is what allows for HTTPS, 1246 00:59:34,950 --> 00:59:39,190 the secure version of the HTTP protocol, to actually work to make sure that-- 1247 00:59:39,190 --> 00:59:42,690 when you are communicating with your bank's website, for example-- 1248 00:59:42,690 --> 00:59:46,300 that someone along the way won't be able to intercept that information 1249 00:59:46,300 --> 00:59:48,770 and identify what it is that you're communicating about 1250 00:59:48,770 --> 00:59:51,090 and, instead, only has the encrypted version 1251 00:59:51,090 --> 00:59:55,720 of the information and a public key with which they can encrypt information, 1252 00:59:55,720 --> 00:59:57,850 but not a private key that can ultimately 1253 00:59:57,850 --> 01:00:02,150 be used in order to decrypt information as well. 1254 01:00:02,150 --> 01:00:05,920 And so that then is how we might allow for this kind of secure communication 1255 01:00:05,920 --> 01:00:09,010 on the internet and allow our web applications to be secure. 1256 01:00:09,010 --> 01:00:12,130 But in addition to our web applications just listening for requests 1257 01:00:12,130 --> 01:00:14,180 and then providing some sort of response, 1258 01:00:14,180 --> 01:00:17,560 our web applications were also dealing with data. 1259 01:00:17,560 --> 01:00:19,720 We introduced the idea of SQL data tables 1260 01:00:19,720 --> 01:00:22,240 where we had tables of data with rows and columns 1261 01:00:22,240 --> 01:00:23,950 that are representing information. 1262 01:00:23,950 --> 01:00:26,980 And we've also created web applications in this course where 1263 01:00:26,980 --> 01:00:28,900 we've had applications that have users. 1264 01:00:28,900 --> 01:00:32,940 Users sign in with a user name and a password, for example. 1265 01:00:32,940 --> 01:00:35,450 And so how might we represent that information 1266 01:00:35,450 --> 01:00:37,100 about users and their passwords? 1267 01:00:37,100 --> 01:00:41,070 Well, one way would be just stored inside of a table like this. 1268 01:00:41,070 --> 01:00:42,410 Here's a table of users. 1269 01:00:42,410 --> 01:00:44,210 Every user has an ID. 1270 01:00:44,210 --> 01:00:47,490 They have a user name, and they have a password. 1271 01:00:47,490 --> 01:00:50,750 But this turns out to be an incredibly insecure way 1272 01:00:50,750 --> 01:00:53,090 to store passwords-- to be storing passwords 1273 01:00:53,090 --> 01:00:56,120 in what might be called plaintext, just to literally store 1274 01:00:56,120 --> 01:00:58,040 the passwords inside of a database. 1275 01:00:58,040 --> 01:01:01,910 And we should never do this in practice because of the security vulnerabilities 1276 01:01:01,910 --> 01:01:03,090 associated with it. 1277 01:01:03,090 --> 01:01:06,680 If ever someone were to, unauthorized, get access to this database, 1278 01:01:06,680 --> 01:01:10,140 they would be able to see all of the passwords for all of the users. 1279 01:01:10,140 --> 01:01:13,010 So if this database ever leaked for whatever reason, suddenly 1280 01:01:13,010 --> 01:01:14,852 all of these passwords are now known. 1281 01:01:14,852 --> 01:01:16,310 And this kind of thing does happen. 1282 01:01:16,310 --> 01:01:19,460 If companies are not careful about how they represent user names 1283 01:01:19,460 --> 01:01:22,380 and passwords inside of their databases, and if ever there's 1284 01:01:22,380 --> 01:01:27,040 some sort of database leak, suddenly a whole bunch of passwords 1285 01:01:27,040 --> 01:01:29,008 could potentially be compromised. 1286 01:01:29,008 --> 01:01:31,300 And it's for that reason that the recommended approach, 1287 01:01:31,300 --> 01:01:34,060 rather than store an actual password, is to store 1288 01:01:34,060 --> 01:01:38,740 a hashed version of the same password using a hash function where 1289 01:01:38,740 --> 01:01:41,680 a hash function, in this context, is some function that 1290 01:01:41,680 --> 01:01:46,630 takes a password of input and outputs some hash-- 1291 01:01:46,630 --> 01:01:49,540 some sequence of characters and numbers, in this case-- 1292 01:01:49,540 --> 01:01:51,850 that represents that particular password, 1293 01:01:51,850 --> 01:01:53,650 a hashed version of the password. 1294 01:01:53,650 --> 01:01:55,870 But the important thing about this hash function 1295 01:01:55,870 --> 01:01:58,120 is that it's a one-way hash function. 1296 01:01:58,120 --> 01:02:01,750 From the password, you can get to the sequence of letters and numbers. 1297 01:02:01,750 --> 01:02:04,480 But it is very, very difficult to go the other way around 1298 01:02:04,480 --> 01:02:09,490 to use this information to figure out what the original password actually 1299 01:02:09,490 --> 01:02:10,240 was. 1300 01:02:10,240 --> 01:02:12,940 And so what this means is that the companies won't actually 1301 01:02:12,940 --> 01:02:18,550 know what any particular user's password is when a user tries to log in. 1302 01:02:18,550 --> 01:02:21,760 What we'll do is take their password that they're trying to log in with. 1303 01:02:21,760 --> 01:02:25,090 We'll hash it and compare that hash against the hash 1304 01:02:25,090 --> 01:02:27,580 that we've stored in the database. 1305 01:02:27,580 --> 01:02:31,030 If the hashes match up, that means the user probably typed in their password 1306 01:02:31,030 --> 01:02:33,130 correctly and, therefore, we can sign the user in. 1307 01:02:33,130 --> 01:02:35,830 And otherwise, that's a sign that the user did not 1308 01:02:35,830 --> 01:02:38,270 type their password in correctly. 1309 01:02:38,270 --> 01:02:40,330 So this, then, is the reason why companies-- 1310 01:02:40,330 --> 01:02:42,670 if they're obeying these best practices-- usually 1311 01:02:42,670 --> 01:02:44,740 can't tell you what your password actually 1312 01:02:44,740 --> 01:02:46,810 is if you forget your password. 1313 01:02:46,810 --> 01:02:49,930 If you forget your password, the company will let you reset your password. 1314 01:02:49,930 --> 01:02:52,242 They can update the data inside of the table. 1315 01:02:52,242 --> 01:02:53,950 But the company won't be able to tell you 1316 01:02:53,950 --> 01:02:57,760 what your password actually is because the company doesn't know your password. 1317 01:02:57,760 --> 01:03:00,460 The company only knows some hashed version 1318 01:03:00,460 --> 01:03:04,970 of the password, some result of passing that password through a hash function. 1319 01:03:04,970 --> 01:03:07,870 And as a result, they're able to know whether you 1320 01:03:07,870 --> 01:03:10,600 logged in successfully or not with the correct credentials 1321 01:03:10,600 --> 01:03:14,000 without actually knowing what your password actually is. 1322 01:03:14,000 --> 01:03:15,940 And so this is another area where you might 1323 01:03:15,940 --> 01:03:19,270 imagine that, if you're not careful about how you're storing this data, 1324 01:03:19,270 --> 01:03:22,360 it could be a security vulnerability inside of your program 1325 01:03:22,360 --> 01:03:26,220 where, if ever that data is leaked, passwords suddenly become known. 1326 01:03:26,220 --> 01:03:29,890 And there are other more subtle ways that web applications could potentially 1327 01:03:29,890 --> 01:03:32,410 leak information that you, as the web developer, 1328 01:03:32,410 --> 01:03:34,330 need to decide if you're OK with or not. 1329 01:03:34,330 --> 01:03:37,570 Imagine a website, for example, where you do have a place where you can say, 1330 01:03:37,570 --> 01:03:39,700 if you forgot your password, you can be sent 1331 01:03:39,700 --> 01:03:43,173 to a place where you can reset your password, for example. 1332 01:03:43,173 --> 01:03:46,090 You might imagine that, if you type in your email address, click Reset 1333 01:03:46,090 --> 01:03:49,270 Password, you might get a message like, all right, password reset email 1334 01:03:49,270 --> 01:03:50,530 has been sent. 1335 01:03:50,530 --> 01:03:54,070 But you might imagine typing in an email address and getting something like, 1336 01:03:54,070 --> 01:03:57,400 error, there is no user with that email address. 1337 01:03:57,400 --> 01:04:00,250 And here, again, is a potential security vulnerability 1338 01:04:00,250 --> 01:04:02,320 in terms of leaked information. 1339 01:04:02,320 --> 01:04:06,340 This page that just seems to send you an email if you forgot your password is 1340 01:04:06,340 --> 01:04:10,720 now leaking information about which users happened to have accounts 1341 01:04:10,720 --> 01:04:14,140 on your website and which users do not because all someone needs to do 1342 01:04:14,140 --> 01:04:18,100 is type in an email address and find out whether it results in an error or not 1343 01:04:18,100 --> 01:04:22,310 in order to know whether a user happens to have an account on the website 1344 01:04:22,310 --> 01:04:22,810 or not. 1345 01:04:22,810 --> 01:04:24,685 And maybe that's not a big deal if that's not 1346 01:04:24,685 --> 01:04:26,170 something you care about securing. 1347 01:04:26,170 --> 01:04:30,160 But if it's a website where you do care about making sure 1348 01:04:30,160 --> 01:04:32,650 that, if someone has an account or doesn't have an account, 1349 01:04:32,650 --> 01:04:35,350 that information is kept private and secure only to the user, 1350 01:04:35,350 --> 01:04:37,630 unless they want to share it, well, then this type 1351 01:04:37,630 --> 01:04:40,570 of page, this type of interface with the database 1352 01:04:40,570 --> 01:04:43,570 could potentially be leaking that kind of information. 1353 01:04:43,570 --> 01:04:46,120 And information can be leaked in all sorts of different ways. 1354 01:04:46,120 --> 01:04:48,700 You can even leak information just based on the time 1355 01:04:48,700 --> 01:04:52,780 it takes for the database to be able to respond to a particular request. 1356 01:04:52,780 --> 01:04:55,450 You might imagine, if you make a request about a user, 1357 01:04:55,450 --> 01:04:58,180 and it takes longer to respond, that might tell you 1358 01:04:58,180 --> 01:05:01,150 something about the number of database queries it needs to run 1359 01:05:01,150 --> 01:05:04,210 or the amount of information that's stored about that user as opposed 1360 01:05:04,210 --> 01:05:06,200 to if a request takes less time. 1361 01:05:06,200 --> 01:05:09,850 So even something like how many milliseconds it takes for a web server 1362 01:05:09,850 --> 01:05:13,780 to respond to a request can reveal or leak information 1363 01:05:13,780 --> 01:05:16,720 about the data that is stored inside of the database. 1364 01:05:16,720 --> 01:05:19,750 And there have been examples of researchers who actually try and see 1365 01:05:19,750 --> 01:05:23,702 what information they can get just from looking at these kinds of information. 1366 01:05:23,702 --> 01:05:25,660 It doesn't seem like it would leak information, 1367 01:05:25,660 --> 01:05:29,580 but it might actually reveal information as well. 1368 01:05:29,580 --> 01:05:32,740 Now, another concern when dealing with SQL and databases we've talked about 1369 01:05:32,740 --> 01:05:34,707 is the context of SQL injection-- 1370 01:05:34,707 --> 01:05:36,790 this threat where, if you're not careful about how 1371 01:05:36,790 --> 01:05:40,090 it is that you run your SQL code, you could inadvertently 1372 01:05:40,090 --> 01:05:43,390 end up executing code that you don't mean to be executing. 1373 01:05:43,390 --> 01:05:46,390 Situations like here-- we're in a username and password field. 1374 01:05:46,390 --> 01:05:48,010 We've seen this example before-- 1375 01:05:48,010 --> 01:05:50,620 where, if a user tries to log in, you might imagine a query 1376 01:05:50,620 --> 01:05:53,200 like this is run selecting from the user's table 1377 01:05:53,200 --> 01:05:57,190 where user name equals whatever was typed in as the user name and password 1378 01:05:57,190 --> 01:05:59,800 equals whatever was typed in as the password. 1379 01:05:59,800 --> 01:06:04,200 And we saw how, for a normal user-- someone who types in, Harry and 1, 2, 1380 01:06:04,200 --> 01:06:06,970 3, 4, 5 as their username and password-- 1381 01:06:06,970 --> 01:06:09,380 that this type of query works just fine. 1382 01:06:09,380 --> 01:06:11,890 But if a hacker tries to log into a website 1383 01:06:11,890 --> 01:06:15,520 and maybe includes a double quotation mark and two hyphens, 1384 01:06:15,520 --> 01:06:18,640 for example, where two hyphens mean a comment in SQL, 1385 01:06:18,640 --> 01:06:22,760 and we were to literally substitute these values into our SQL queries, 1386 01:06:22,760 --> 01:06:27,010 well, then you might end up substituting hacker hyphen hyphen hyphen 1387 01:06:27,010 --> 01:06:30,100 hyphen creating a comment that ignores the rest of this query, 1388 01:06:30,100 --> 01:06:33,640 effectively ignoring any kind of password checking that we might 1389 01:06:33,640 --> 01:06:35,560 want our web application to be doing. 1390 01:06:35,560 --> 01:06:37,390 So this, too-- another vulnerability that 1391 01:06:37,390 --> 01:06:40,570 comes about whenever we're dealing with executing 1392 01:06:40,570 --> 01:06:42,520 SQL code inside of a database. 1393 01:06:42,520 --> 01:06:44,860 And in order to deal with this, we want to make sure 1394 01:06:44,860 --> 01:06:48,640 that we're escaping any of these potentially dangerous characters that 1395 01:06:48,640 --> 01:06:50,710 might show up inside of our SQL queries. 1396 01:06:50,710 --> 01:06:52,870 And Django's models do this for us. 1397 01:06:52,870 --> 01:06:56,980 When we do these kinds of queries using Django saying, .objects, .filter, 1398 01:06:56,980 --> 01:07:00,580 to be able to filter out for only certain versions of a particular model, 1399 01:07:00,580 --> 01:07:04,330 it is going to take care of the process of making sure that it's not subject 1400 01:07:04,330 --> 01:07:06,770 to these kinds of SQL injection attacks. 1401 01:07:06,770 --> 01:07:09,340 But if ever you're writing a web application that is directly 1402 01:07:09,340 --> 01:07:12,070 executing secret code, which you might imagine doing, 1403 01:07:12,070 --> 01:07:14,080 you do want to be careful about making sure 1404 01:07:14,080 --> 01:07:16,240 that you're not exposing the application to be 1405 01:07:16,240 --> 01:07:20,070 vulnerable to these kinds of threats as well. 1406 01:07:20,070 --> 01:07:21,920 So that then are potential threats that come 1407 01:07:21,920 --> 01:07:24,935 about when we're just talking about what's happening on the server. 1408 01:07:24,935 --> 01:07:26,810 But we also can think about what might happen 1409 01:07:26,810 --> 01:07:28,700 when we're interacting with other servers-- 1410 01:07:28,700 --> 01:07:31,380 when we're interacting with APIs, for example. 1411 01:07:31,380 --> 01:07:33,770 So we talked about JavaScript and using JavaScript 1412 01:07:33,770 --> 01:07:37,400 to be able to make additional requests to APIs or to other services that 1413 01:07:37,400 --> 01:07:40,302 are able to return back with certain types of information. 1414 01:07:40,302 --> 01:07:42,260 And with APIs, there are a number of techniques 1415 01:07:42,260 --> 01:07:46,040 that we can use in APIs to allow them to be more scalable, to allow 1416 01:07:46,040 --> 01:07:48,290 them to be more secure. 1417 01:07:48,290 --> 01:07:50,780 One is this notion of rate limiting where 1418 01:07:50,780 --> 01:07:52,940 we might want to make sure that no user is 1419 01:07:52,940 --> 01:07:56,480 able to make more than a certain number of requests to an API 1420 01:07:56,480 --> 01:07:59,000 in any particular amount of time. 1421 01:07:59,000 --> 01:08:01,130 This is in response to a security threat that 1422 01:08:01,130 --> 01:08:03,440 has to do with the scalability of a system, which 1423 01:08:03,440 --> 01:08:06,560 is known as a DOS or Denial of Service Attack where, 1424 01:08:06,560 --> 01:08:09,920 effectively, if you just make a whole bunch of requests to a single server 1425 01:08:09,920 --> 01:08:13,543 over, and over, and over again, you could potentially shut down that system 1426 01:08:13,543 --> 01:08:15,710 because you're making so many requests that it's not 1427 01:08:15,710 --> 01:08:19,050 able to handle that many requests all at the same time. 1428 01:08:19,050 --> 01:08:22,310 And for that reason, because it's so easy to make an API request-- 1429 01:08:22,310 --> 01:08:27,170 you can do so using just a single line of Python or JavaScript, for example-- 1430 01:08:27,170 --> 01:08:29,840 APIs will often institute some kind of rate 1431 01:08:29,840 --> 01:08:32,960 limiting to limit the number of requests you can make so that you're not 1432 01:08:32,960 --> 01:08:35,630 going to overwhelm the server or overwhelm the database that 1433 01:08:35,630 --> 01:08:39,080 needs to be queried in order to respond to those requests. 1434 01:08:39,080 --> 01:08:42,229 And so this kind of limiting might work as well. 1435 01:08:42,229 --> 01:08:45,800 APIs might also want to add some kind of route authentication. 1436 01:08:45,800 --> 01:08:49,527 You might not want everybody to access the same data via an API. 1437 01:08:49,527 --> 01:08:51,319 Maybe there's some sort of permission model 1438 01:08:51,319 --> 01:08:54,800 where only certain users are able to access certain pieces of data 1439 01:08:54,800 --> 01:08:55,880 from the API. 1440 01:08:55,880 --> 01:09:00,290 So you might imagine that a user needs to have an API key, for example-- 1441 01:09:00,290 --> 01:09:03,830 effectively, a password that they need to pass around anytime 1442 01:09:03,830 --> 01:09:06,710 they're making an API request to your API 1443 01:09:06,710 --> 01:09:09,140 and that allows you to then be able to look at that key 1444 01:09:09,140 --> 01:09:12,390 and verify that they are who they say they are. 1445 01:09:12,390 --> 01:09:16,010 Now, with those API keys comes other potential security vulnerabilities 1446 01:09:16,010 --> 01:09:17,090 to be mindful of. 1447 01:09:17,090 --> 01:09:21,290 One is that, just as you should never be putting passwords inside of your source 1448 01:09:21,290 --> 01:09:23,899 code-- inside of your Git repository, for example-- 1449 01:09:23,899 --> 01:09:27,290 you likewise generally shouldn't be putting your API keys 1450 01:09:27,290 --> 01:09:31,700 inside of your web applications as well, inside of the source code of those web 1451 01:09:31,700 --> 01:09:34,069 applications, because then anyone who has access 1452 01:09:34,069 --> 01:09:36,020 to the source code for the web application 1453 01:09:36,020 --> 01:09:38,960 can see what your API key is, could then use 1454 01:09:38,960 --> 01:09:42,439 the API key to pretend to be you and, therefore, get access 1455 01:09:42,439 --> 01:09:46,609 to potential API routes that they should not be able to access. 1456 01:09:46,609 --> 01:09:50,930 One common solution to this is to use what are known as environment variables 1457 01:09:50,930 --> 01:09:55,190 where, effectively, you in your program say that your API key is not 1458 01:09:55,190 --> 01:09:59,220 going to be some predetermined string that is in the text of your program 1459 01:09:59,220 --> 01:10:03,170 but instead is going to be drawn from the environment in which the program is 1460 01:10:03,170 --> 01:10:04,040 being run. 1461 01:10:04,040 --> 01:10:07,430 And then, on the server, when you're running the web application, 1462 01:10:07,430 --> 01:10:11,000 you'll first make sure the server has all of those environment variables set 1463 01:10:11,000 --> 01:10:16,400 correctly so that, rather than have the API key actually in the source 1464 01:10:16,400 --> 01:10:20,570 code of the program, the API key is simply in the environment on the server 1465 01:10:20,570 --> 01:10:22,340 where the web application is running. 1466 01:10:22,340 --> 01:10:25,370 And the server can just draw that information from the environment 1467 01:10:25,370 --> 01:10:29,720 so that it knows what the API key should be without the API key 1468 01:10:29,720 --> 01:10:34,590 actually having to be inside of the web application source code itself. 1469 01:10:34,590 --> 01:10:36,470 And so as we begin to deal with APIs, you 1470 01:10:36,470 --> 01:10:40,070 might notice that many APIs will require you to have an API key. 1471 01:10:40,070 --> 01:10:42,170 And often, it's for these sorts of reasons-- 1472 01:10:42,170 --> 01:10:45,310 to make sure that we're able to authenticate users effectively 1473 01:10:45,310 --> 01:10:48,560 and also to make sure that we're able to limit users to make sure that they're 1474 01:10:48,560 --> 01:10:51,140 not making too many requests to the server 1475 01:10:51,140 --> 01:10:54,170 or to the database at any particular time. 1476 01:10:54,170 --> 01:10:57,440 But this, then, starts to get us into other potential vulnerabilities-- 1477 01:10:57,440 --> 01:11:00,470 in particular, vulnerabilities concerning JavaScript. 1478 01:11:00,470 --> 01:11:02,600 JavaScript, again, is a programming language 1479 01:11:02,600 --> 01:11:05,840 that we use in order to write code that runs inside of our web browser-- 1480 01:11:05,840 --> 01:11:08,730 a browser like Chrome, or Safari, or something like that. 1481 01:11:08,730 --> 01:11:14,210 And as a result, JavaScript has a lot of power to manipulate things on the page. 1482 01:11:14,210 --> 01:11:16,220 It can simulate the clicking of buttons. 1483 01:11:16,220 --> 01:11:20,120 It can change the content of what happens to be on any particular page. 1484 01:11:20,120 --> 01:11:22,370 And as a result, there are many, many vulnerabilities 1485 01:11:22,370 --> 01:11:26,750 that come about when it comes to thinking about JavaScript. 1486 01:11:26,750 --> 01:11:30,750 And one such vulnerability is this notion of cross-site scripting-- 1487 01:11:30,750 --> 01:11:33,380 that, in general, when on your web application, 1488 01:11:33,380 --> 01:11:37,760 you only want JavaScript to run if you, yourself have written it. 1489 01:11:37,760 --> 01:11:39,830 Cross-site scripting is a potential threat 1490 01:11:39,830 --> 01:11:45,050 where someone else might be able to get JavaScript code to run on your website 1491 01:11:45,050 --> 01:11:48,890 when it's JavaScript code that someone else wrote instead of you, yourself. 1492 01:11:48,890 --> 01:11:51,710 And this is a potential vulnerability because, if someone else can 1493 01:11:51,710 --> 01:11:55,280 write the JavaScript code, they can manipulate the contents of what 1494 01:11:55,280 --> 01:11:56,830 happens to be on your website. 1495 01:11:56,830 --> 01:11:59,300 They can potentially manipulate the user experience 1496 01:11:59,300 --> 01:12:02,260 to get a result that is not, actually, desired. 1497 01:12:02,260 --> 01:12:06,860 So let's go ahead and take a look at one example of cross-site scripting. 1498 01:12:06,860 --> 01:12:09,770 All right, so I've prepared a web application in advance-- 1499 01:12:09,770 --> 01:12:14,900 it's called security-- inside of which is a single Django app called XXS, 1500 01:12:14,900 --> 01:12:16,590 for Cross-Site Scripting. 1501 01:12:16,590 --> 01:12:19,670 And inside of here, we'll first take a look at the URLs. 1502 01:12:19,670 --> 01:12:24,290 So there's a single URL that just allows us to provide any path. 1503 01:12:24,290 --> 01:12:27,330 And then it's going to load the Index view. 1504 01:12:27,330 --> 01:12:31,910 And on the Index view, we're going to display in HTTP response. 1505 01:12:31,910 --> 01:12:35,210 It says, here was the path that just happened to be requested. 1506 01:12:35,210 --> 01:12:37,910 So you might imagine this is a simplified version of what 1507 01:12:37,910 --> 01:12:41,240 you might see on other websites, for example, where websites might show you 1508 01:12:41,240 --> 01:12:45,170 on any particular page what path you're on in order to get to that page, 1509 01:12:45,170 --> 01:12:49,610 some indication of where you are inside of this web application. 1510 01:12:49,610 --> 01:12:53,150 So I'd go ahead and see the security and run the server-- 1511 01:12:53,150 --> 01:12:57,640 Python manage.py, run server. 1512 01:12:57,640 --> 01:12:59,320 So I am now running the server. 1513 01:12:59,320 --> 01:13:06,420 And now I'll go ahead and go into my web application, /hello, for example. 1514 01:13:06,420 --> 01:13:09,570 And so what I see here is the requested path hello, 1515 01:13:09,570 --> 01:13:11,230 which is what I would expect it to be. 1516 01:13:11,230 --> 01:13:13,960 I can change it to something else, like hi. 1517 01:13:13,960 --> 01:13:15,270 So here's requested path hi. 1518 01:13:15,270 --> 01:13:17,760 Here's hi/2, for example. 1519 01:13:17,760 --> 01:13:20,430 Whatever page I visit, it gives me a page 1520 01:13:20,430 --> 01:13:23,190 that says, requested path, and then whatever 1521 01:13:23,190 --> 01:13:25,770 path I happened to be visiting. 1522 01:13:25,770 --> 01:13:29,520 But watch what happens if I try and visit this URL instead. 1523 01:13:29,520 --> 01:13:39,600 I'm going to visit URL /script alert hi, and then end script. 1524 01:13:39,600 --> 01:13:40,650 So I run it. 1525 01:13:40,650 --> 01:13:44,990 And suddenly, an alert shows up on my page that says, hi. 1526 01:13:44,990 --> 01:13:45,850 And I press OK. 1527 01:13:45,850 --> 01:13:47,790 And it says, all right, requested path. 1528 01:13:47,790 --> 01:13:49,680 That alert was a JavaScript alert. 1529 01:13:49,680 --> 01:13:53,250 It was JavaScript code running on my web application. 1530 01:13:53,250 --> 01:13:56,940 But it was not code that was JavaScript code inside of my web application. 1531 01:13:56,940 --> 01:14:00,150 It was someone else who wrote based on the URL 1532 01:14:00,150 --> 01:14:03,780 to run particular JavaScript on my particular page. 1533 01:14:03,780 --> 01:14:06,120 And so someone linked to my web application 1534 01:14:06,120 --> 01:14:09,000 and passed in this script tag as part of the URL. 1535 01:14:09,000 --> 01:14:12,840 Someone who clicked on that link might have been taken to my web application 1536 01:14:12,840 --> 01:14:17,630 but ultimately had JavaScript run that was created by someone else. 1537 01:14:17,630 --> 01:14:19,980 And that, ultimately, is potentially dangerous. 1538 01:14:19,980 --> 01:14:22,440 It leaves open the possibility that someone else 1539 01:14:22,440 --> 01:14:24,990 could run JavaScript code on my page. 1540 01:14:24,990 --> 01:14:27,300 And it might not just be something like a script. 1541 01:14:27,300 --> 01:14:29,940 You might imagine someone not just displaying an alert, 1542 01:14:29,940 --> 01:14:33,720 but modifying something inside of the DOM-- changing the contents of the web 1543 01:14:33,720 --> 01:14:36,960 page, making API requests, doing other types of tasks 1544 01:14:36,960 --> 01:14:39,870 that you can do using JavaScript inside of a web browser 1545 01:14:39,870 --> 01:14:44,580 that, ultimately, leave my page open to potential security vulnerabilities. 1546 01:14:44,580 --> 01:14:47,580 And so these are cases where it's important to be mindful of when you're 1547 01:14:47,580 --> 01:14:51,720 designing these pages, if ever there is a possibility that someone could inject 1548 01:14:51,720 --> 01:14:54,630 their own JavaScript into your page somehow, 1549 01:14:54,630 --> 01:14:57,780 you'll want to either detect that or escape it in some way. 1550 01:14:57,780 --> 01:15:02,025 Or take other precautions to make sure that this kind of cross-site scripting 1551 01:15:02,025 --> 01:15:03,150 isn't going to be possible. 1552 01:15:03,150 --> 01:15:06,240 You might imagine that, in a messaging application-- for example, 1553 01:15:06,240 --> 01:15:07,740 if you're messaging back and forth-- 1554 01:15:07,740 --> 01:15:10,282 you don't want it to be the case that, if you message someone 1555 01:15:10,282 --> 01:15:13,260 else some JavaScript code that, when they receive it, 1556 01:15:13,260 --> 01:15:16,380 that code actually ends up running as some JavaScript that 1557 01:15:16,380 --> 01:15:18,210 runs on that particular page. 1558 01:15:18,210 --> 01:15:20,450 You want to be sure to escape that information so 1559 01:15:20,450 --> 01:15:22,830 that they just see the text of the JavaScript code 1560 01:15:22,830 --> 01:15:25,430 but that the code isn't actually executed. 1561 01:15:25,430 --> 01:15:28,140 And this is a similar threat to that threat of SQL injection. 1562 01:15:28,140 --> 01:15:30,480 It all comes back to the idea of not wanting 1563 01:15:30,480 --> 01:15:33,120 to allow someone else to be able to inject 1564 01:15:33,120 --> 01:15:35,280 their own code into your program. 1565 01:15:35,280 --> 01:15:39,540 You don't want someone else to be able to inject SQL code into the queries you 1566 01:15:39,540 --> 01:15:40,770 run on your database. 1567 01:15:40,770 --> 01:15:44,640 And you don't want someone to be able to inject JavaScript code into your web 1568 01:15:44,640 --> 01:15:49,850 page because that leaves open potential security vulnerabilities as well. 1569 01:15:49,850 --> 01:15:51,882 One type of security vulnerability that Django 1570 01:15:51,882 --> 01:15:54,590 is quite good at defending against is one that we've seen before, 1571 01:15:54,590 --> 01:15:57,470 but we'll explore in more detail how it might work. 1572 01:15:57,470 --> 01:16:00,530 And it's this idea of cross-site request forgery where 1573 01:16:00,530 --> 01:16:05,270 you fake a request to a website when you didn't intend to actually make 1574 01:16:05,270 --> 01:16:07,020 a request to that website. 1575 01:16:07,020 --> 01:16:10,830 So you might imagine that, if your bank, for example, 1576 01:16:10,830 --> 01:16:12,982 had a URL that allowed you to transfer money 1577 01:16:12,982 --> 01:16:14,690 from one person to another person-- we've 1578 01:16:14,690 --> 01:16:16,430 talked about this idea a little bit. 1579 01:16:16,430 --> 01:16:20,480 But imagine now how you could implement this if it really was just a URL. 1580 01:16:20,480 --> 01:16:24,740 You could go to /transfer and say, as get parameters, 1581 01:16:24,740 --> 01:16:26,060 who am I transferring money to? 1582 01:16:26,060 --> 01:16:27,950 And what is the amount that I'm transferring? 1583 01:16:27,950 --> 01:16:32,120 Then someone else on some other website could, in the body of their page, 1584 01:16:32,120 --> 01:16:35,270 just have a link where that link says, click here. 1585 01:16:35,270 --> 01:16:37,460 And it links to your bank.com, or whatever 1586 01:16:37,460 --> 01:16:41,390 your bank is, transferring money to me in this amount. 1587 01:16:41,390 --> 01:16:44,720 And if some user unknowingly just clicked on that link not knowing 1588 01:16:44,720 --> 01:16:46,640 where it would take them, this website might 1589 01:16:46,640 --> 01:16:49,640 be able to forge a request to the bank-- make 1590 01:16:49,640 --> 01:16:52,070 it seem like the user had gone to the bank 1591 01:16:52,070 --> 01:16:54,350 and tried to initiate some kind of transfer 1592 01:16:54,350 --> 01:16:56,360 and, ultimately, tried to transfer money. 1593 01:16:56,360 --> 01:16:59,330 And it doesn't even necessarily need to be in a link. 1594 01:16:59,330 --> 01:17:03,230 How else might you get some new request to happen inside of the web browser? 1595 01:17:03,230 --> 01:17:05,690 You might imagine-- though it might seem a bit strange-- 1596 01:17:05,690 --> 01:17:08,450 to put this inside of an image. 1597 01:17:08,450 --> 01:17:13,250 Image source, the source of the image, is this particular URL-- 1598 01:17:13,250 --> 01:17:14,493 the bank's transfer page. 1599 01:17:14,493 --> 01:17:16,160 Now, that doesn't really make any sense. 1600 01:17:16,160 --> 01:17:17,840 The transfer page is not an image. 1601 01:17:17,840 --> 01:17:19,340 But it doesn't matter. 1602 01:17:19,340 --> 01:17:24,590 All an image tag is going to do is try to make a request to this source URL 1603 01:17:24,590 --> 01:17:28,527 to get that image and then try to display it in the user's web browser. 1604 01:17:28,527 --> 01:17:31,610 But the first part is what's important-- the fact that this source ends up 1605 01:17:31,610 --> 01:17:33,650 being requested by the web browser. 1606 01:17:33,650 --> 01:17:36,380 Without the user having to click on or do anything, 1607 01:17:36,380 --> 01:17:40,850 they might try and request from your bank.com/transfer this particular 1608 01:17:40,850 --> 01:17:45,500 request, which might initiate some sort of bank transfer without the user even 1609 01:17:45,500 --> 01:17:46,580 realizing it. 1610 01:17:46,580 --> 01:17:49,160 And it's for that reason that we generally suggest that, 1611 01:17:49,160 --> 01:17:54,560 anytime you're creating a website that is going to allow for the manipulation 1612 01:17:54,560 --> 01:17:57,500 of some kind of state-- that allows for some change to happen, 1613 01:17:57,500 --> 01:17:59,210 something like transferring money-- 1614 01:17:59,210 --> 01:18:02,450 you don't want that to be a Git request, something that you could just 1615 01:18:02,450 --> 01:18:06,515 load in an image or load by clicking on a link that takes you to another page. 1616 01:18:06,515 --> 01:18:08,390 You don't want that to happen because then it 1617 01:18:08,390 --> 01:18:12,350 makes it very easy for someone else to fake a request to your page 1618 01:18:12,350 --> 01:18:16,790 by just creating an image or linking to, somehow, a website, 1619 01:18:16,790 --> 01:18:20,005 transferring funds from one user to another. 1620 01:18:20,005 --> 01:18:22,130 So a solution to this-- and we've talked about it-- 1621 01:18:22,130 --> 01:18:24,920 is that, generally, we only want post requests 1622 01:18:24,920 --> 01:18:27,860 to be able to manipulate something inside of the database, 1623 01:18:27,860 --> 01:18:32,330 to be able to actually initiate a transfer from one user to another user. 1624 01:18:32,330 --> 01:18:35,210 But even then, this is not perfectly secure. 1625 01:18:35,210 --> 01:18:38,660 You could still be tricked into submitting a post request. 1626 01:18:38,660 --> 01:18:42,320 Imagine an adversarial website that had a form like this-- 1627 01:18:42,320 --> 01:18:47,120 a form whose action was your bank.com/transfer and whose method was 1628 01:18:47,120 --> 01:18:48,200 post. 1629 01:18:48,200 --> 01:18:52,370 And now here-- two input fields whose type is hidden, meaning you 1630 01:18:52,370 --> 01:18:55,040 won't actually be able to see those input fields when 1631 01:18:55,040 --> 01:18:56,420 the user is looking at the page. 1632 01:18:56,420 --> 01:18:59,090 They'd only know about it if they inspected the source 1633 01:18:59,090 --> 01:19:03,120 code of this particular HTML page. 1634 01:19:03,120 --> 01:19:05,550 Here, there's a hidden input whose name is to, 1635 01:19:05,550 --> 01:19:07,840 meaning the person I'd like to transfer money to. 1636 01:19:07,840 --> 01:19:10,470 Here is the amount, the value that I would like to transfer. 1637 01:19:10,470 --> 01:19:14,153 And all the user is going to see is a button that says, click here. 1638 01:19:14,153 --> 01:19:17,320 They're not going to see either of the input fields, because they're hidden. 1639 01:19:17,320 --> 01:19:19,740 But if they do click the Click Here button, well, then 1640 01:19:19,740 --> 01:19:22,950 suddenly they're going to be submitting a post request to the bank 1641 01:19:22,950 --> 01:19:25,525 and initiating some transfer when they didn't intend to. 1642 01:19:25,525 --> 01:19:28,650 Now, maybe this seems like, oh, it's not a big deal, because the user still 1643 01:19:28,650 --> 01:19:29,850 needs to click a button. 1644 01:19:29,850 --> 01:19:31,767 And the user shouldn't be clicking on a button 1645 01:19:31,767 --> 01:19:33,990 if they don't know what the button is going to do. 1646 01:19:33,990 --> 01:19:38,280 Well, for one, it's probably reasonable to imagine that an adversary might 1647 01:19:38,280 --> 01:19:41,010 embed this button inside of a page where it looks totally 1648 01:19:41,010 --> 01:19:42,820 safe to be able to click on a button. 1649 01:19:42,820 --> 01:19:45,960 But moreover, the user doesn't even need to click on it in order 1650 01:19:45,960 --> 01:19:47,010 to submit the form. 1651 01:19:47,010 --> 01:19:49,170 We can just add a little bit of JavaScript. 1652 01:19:49,170 --> 01:19:52,710 You might imagine that an adversary could do something like this. 1653 01:19:52,710 --> 01:19:55,560 Add an unknown attribute to the body that says, 1654 01:19:55,560 --> 01:19:59,250 when the body of the page is done loading, go to document.form-- 1655 01:19:59,250 --> 01:20:01,680 meaning all of the forms for this web page. 1656 01:20:01,680 --> 01:20:04,590 Get the first one, and submit it. 1657 01:20:04,590 --> 01:20:06,320 Submit the form. 1658 01:20:06,320 --> 01:20:09,450 And what that's going to do is, even without the user doing anything-- 1659 01:20:09,450 --> 01:20:12,330 even without the user clicking on the Click Here button-- 1660 01:20:12,330 --> 01:20:15,420 as soon as this page is loaded, this form is going to submit, 1661 01:20:15,420 --> 01:20:19,050 submitting a post request to the bank, and attempting to transfer funds 1662 01:20:19,050 --> 01:20:21,120 from one user to another user. 1663 01:20:21,120 --> 01:20:23,760 And so this is what we might call a cross-site request 1664 01:20:23,760 --> 01:20:29,220 forgery where some adversarial website has forged a request to our website. 1665 01:20:29,220 --> 01:20:32,870 And ideally, we wouldn't like for that to be able to happen. 1666 01:20:32,870 --> 01:20:35,030 So how do we guard against this? 1667 01:20:35,030 --> 01:20:39,780 Well, what Django allows us to do and a very common approach is to add a CSRF 1668 01:20:39,780 --> 01:20:42,390 token-- a Cross-Site Request Forgery token-- 1669 01:20:42,390 --> 01:20:46,320 that is going to be regenerated for every session 1670 01:20:46,320 --> 01:20:48,740 such that, only if that token is present, 1671 01:20:48,740 --> 01:20:51,610 will the transfer be able to go through. 1672 01:20:51,610 --> 01:20:57,360 So on our website, we can include the CSRF token inside of this HTML form 1673 01:20:57,360 --> 01:21:00,510 and, as a result, make sure that we're able to transfer money only 1674 01:21:00,510 --> 01:21:02,650 when the CSRF token is present. 1675 01:21:02,650 --> 01:21:05,220 But if some other website tries to forge a request, 1676 01:21:05,220 --> 01:21:07,710 they won't know what the CSRF token should be 1677 01:21:07,710 --> 01:21:09,840 because it changes for every session. 1678 01:21:09,840 --> 01:21:14,730 And therefore, they won't be able to actually forge a request from one user 1679 01:21:14,730 --> 01:21:16,510 to another. 1680 01:21:16,510 --> 01:21:19,590 So all across the various different tools and technologies 1681 01:21:19,590 --> 01:21:20,340 we've been using-- 1682 01:21:20,340 --> 01:21:25,710 Python, HTTP, Django, HTML in terms of creating these web 1683 01:21:25,710 --> 01:21:27,990 applications using JavaScript, and the APIs 1684 01:21:27,990 --> 01:21:29,460 that we might be interacting with-- 1685 01:21:29,460 --> 01:21:31,710 there are security considerations all throughout. 1686 01:21:31,710 --> 01:21:33,623 We've only touched on a couple of them here. 1687 01:21:33,623 --> 01:21:36,540 But it just goes to show how it's important to be mindful as you think 1688 01:21:36,540 --> 01:21:39,790 about the practice of web programming, thinking about what you're going to add 1689 01:21:39,790 --> 01:21:42,960 to your web applications and what features your web application supports, 1690 01:21:42,960 --> 01:21:46,260 to think about what the potential vulnerabilities there are as well-- 1691 01:21:46,260 --> 01:21:49,920 how someone might exploit your web application in order to do something 1692 01:21:49,920 --> 01:21:51,690 with it that they probably shouldn't. 1693 01:21:51,690 --> 01:21:54,450 And as you take your web applications from applications 1694 01:21:54,450 --> 01:21:57,015 that are just running on your own local computer 1695 01:21:57,015 --> 01:21:59,940 to applications that are running in some web server 1696 01:21:59,940 --> 01:22:02,130 that many people are starting to use, these 1697 01:22:02,130 --> 01:22:04,420 are the types of questions to start to be asking. 1698 01:22:04,420 --> 01:22:07,740 How can you make sure that your web application is scalable? 1699 01:22:07,740 --> 01:22:11,740 How can you make sure that your web application is secure? 1700 01:22:11,740 --> 01:22:15,392 So now that we've explored that-- a lot of web programming-- what comes next? 1701 01:22:15,392 --> 01:22:17,850 In this course, we've explored a number of different tools, 1702 01:22:17,850 --> 01:22:19,470 and technologies, and languages. 1703 01:22:19,470 --> 01:22:21,540 But there are many other web frameworks and ways 1704 01:22:21,540 --> 01:22:23,850 you can build web applications as well. 1705 01:22:23,850 --> 01:22:26,220 We spent most of our time looking at the Django web 1706 01:22:26,220 --> 01:22:27,580 framework, written in Python. 1707 01:22:27,580 --> 01:22:29,430 But you can use other programming languages 1708 01:22:29,430 --> 01:22:31,560 to build web applications as well. 1709 01:22:31,560 --> 01:22:34,980 Express.js, for example, is a very popular JavaScript framework 1710 01:22:34,980 --> 01:22:36,480 for building web applications. 1711 01:22:36,480 --> 01:22:41,390 Ruby on Rails is a popular server-side web framework built using Ruby. 1712 01:22:41,390 --> 01:22:43,020 And there are many others as well. 1713 01:22:43,020 --> 01:22:44,730 And there are also client-side frameworks 1714 01:22:44,730 --> 01:22:48,540 used primarily with JavaScript to be able to build user interfaces. 1715 01:22:48,540 --> 01:22:51,750 We've seen a little bit of React to both dynamic and interactive user 1716 01:22:51,750 --> 01:22:52,620 interfaces. 1717 01:22:52,620 --> 01:22:56,490 Other popular client-side frameworks include Angular JS, and Vue.js, 1718 01:22:56,490 --> 01:22:58,343 and a number of others as well. 1719 01:22:58,343 --> 01:23:00,510 And then, once you've built these web applications-- 1720 01:23:00,510 --> 01:23:03,600 using any of these server-side frameworks and client-side frameworks-- 1721 01:23:03,600 --> 01:23:06,360 then you might imagine wanting to take these applications 1722 01:23:06,360 --> 01:23:07,645 and deploy them to the web. 1723 01:23:07,645 --> 01:23:10,020 And to do that, there are a number of ways we can do this 1724 01:23:10,020 --> 01:23:13,950 as well-- a number of different services including Amazon Web Services, AWS, 1725 01:23:13,950 --> 01:23:17,730 Google Cloud, and Microsoft Azure that can be used in order to deploy 1726 01:23:17,730 --> 01:23:19,530 these web applications. 1727 01:23:19,530 --> 01:23:22,320 Roku is a service that uses AWS and tries 1728 01:23:22,320 --> 01:23:26,100 to simplify the process of making it easier to deploy your web applications. 1729 01:23:26,100 --> 01:23:29,340 And if you're web application is really just static-- it's just HTML, 1730 01:23:29,340 --> 01:23:33,300 and CSS, and JavaScript-- well, then you can use something like GitHub Pages 1731 01:23:33,300 --> 01:23:37,945 to be able to host a web application for free on GitHub's own servers instead. 1732 01:23:37,945 --> 01:23:41,070 And there are many other ways you can imagine deploying web applications as 1733 01:23:41,070 --> 01:23:43,395 well-- different services that you can use in order 1734 01:23:43,395 --> 01:23:46,020 to take the web applications that you have been building or web 1735 01:23:46,020 --> 01:23:47,940 applications you might build in the future 1736 01:23:47,940 --> 01:23:52,870 and make them available on the internet for others to be able to use as well. 1737 01:23:52,870 --> 01:23:56,550 So as we look back on the various topics within web programming we've explored, 1738 01:23:56,550 --> 01:23:58,690 we've seen a lot of tools and technologies 1739 01:23:58,690 --> 01:24:02,760 we can use that we can leverage in order to build interesting web applications. 1740 01:24:02,760 --> 01:24:06,930 We started by taking a closer look HTML and CSS, 1741 01:24:06,930 --> 01:24:10,080 diving into how we can use that to describe the structure of our page, 1742 01:24:10,080 --> 01:24:12,210 and then taking advantage of tools like SAS 1743 01:24:12,210 --> 01:24:15,570 that allow us to generate CSS that allows for much more 1744 01:24:15,570 --> 01:24:18,270 complex styling for our website that would have been much more 1745 01:24:18,270 --> 01:24:21,090 difficult to do with just CSS alone. 1746 01:24:21,090 --> 01:24:24,240 As we started to build larger web applications, we took a look at Git-- 1747 01:24:24,240 --> 01:24:26,610 version control tools that we can use in order 1748 01:24:26,610 --> 01:24:29,370 to make sure that we keep track of versions and changes we 1749 01:24:29,370 --> 01:24:33,240 make to our code, allowing multiple people to collaborate on a project 1750 01:24:33,240 --> 01:24:34,547 simultaneously. 1751 01:24:34,547 --> 01:24:37,380 We then took a look at Python, looking at various different features 1752 01:24:37,380 --> 01:24:40,697 that the language offered-- functions, and conditions, and loops, 1753 01:24:40,697 --> 01:24:42,780 as we've seen in many other programming languages. 1754 01:24:42,780 --> 01:24:45,210 But also object-oriented programming-- the ability 1755 01:24:45,210 --> 01:24:47,700 to represent objects, and methods, and functions 1756 01:24:47,700 --> 01:24:49,950 that operate on those particular objects, which 1757 01:24:49,950 --> 01:24:53,940 prove especially powerful in the context of dealing with data inside of our web 1758 01:24:53,940 --> 01:24:55,380 applications. 1759 01:24:55,380 --> 01:24:58,500 Django was the example of a web framework written in Python 1760 01:24:58,500 --> 01:25:00,510 that we used to very quickly be able to start up 1761 01:25:00,510 --> 01:25:04,500 a web application, that's able to listen for requests, and make responses. 1762 01:25:04,500 --> 01:25:06,600 Django has a whole lot of features built in that 1763 01:25:06,600 --> 01:25:10,072 really make it easy to get started with building a web application. 1764 01:25:10,072 --> 01:25:12,030 And in particular, it makes it easy for writing 1765 01:25:12,030 --> 01:25:14,260 web applications that deal with data. 1766 01:25:14,260 --> 01:25:16,860 So Django allows us the ability to build models 1767 01:25:16,860 --> 01:25:20,760 that interact with SQL without us having to actually write any SQL code. 1768 01:25:20,760 --> 01:25:25,320 Django can generate the SQL for us just using these models and migrations that 1769 01:25:25,320 --> 01:25:29,020 allow us to continually apply changes that we make to our database. 1770 01:25:29,020 --> 01:25:33,330 As we add new tables, add and modify existing fields on those tables, 1771 01:25:33,330 --> 01:25:36,065 Django can take care of all of that. 1772 01:25:36,065 --> 01:25:38,190 After that, as you'll recall, we took our attention 1773 01:25:38,190 --> 01:25:40,440 towards the second of the main programming languages 1774 01:25:40,440 --> 01:25:44,950 in the course, JavaScript, which has a lot of uses and is very, very popular. 1775 01:25:44,950 --> 01:25:46,920 But we primarily use it on the client side 1776 01:25:46,920 --> 01:25:50,460 to be able to build interesting user interfaces-- using JavaScript 1777 01:25:50,460 --> 01:25:52,680 to manipulate the DOM, the structure of the page, 1778 01:25:52,680 --> 01:25:54,930 to change what it is the user sees. 1779 01:25:54,930 --> 01:25:56,850 And also to add event handling-- so that when 1780 01:25:56,850 --> 01:25:59,880 the user clicks on a button, when the user hovers over something, when 1781 01:25:59,880 --> 01:26:02,550 the user interacts with the page in some sort of way, 1782 01:26:02,550 --> 01:26:04,590 our code is able to respond to it. 1783 01:26:04,590 --> 01:26:09,540 And we saw React, a client-side framework that uses JavaScript in order 1784 01:26:09,540 --> 01:26:13,470 to allow us to create really interesting and interactive user interfaces 1785 01:26:13,470 --> 01:26:15,893 with not all that much code at all. 1786 01:26:15,893 --> 01:26:18,060 And then, finally, in these last couple of lectures, 1787 01:26:18,060 --> 01:26:21,350 we've been looking at some best practices-- how we can design tests, 1788 01:26:21,350 --> 01:26:23,520 tests the test the server, but also the client 1789 01:26:23,520 --> 01:26:25,800 to make sure that our code is working appropriately, 1790 01:26:25,800 --> 01:26:28,860 and also some industry practices like continuous integration 1791 01:26:28,860 --> 01:26:31,140 and continuous delivery that just help to make sure 1792 01:26:31,140 --> 01:26:34,740 that, as we make changes to our code, we're able to deploy and deliver them 1793 01:26:34,740 --> 01:26:37,050 rapidly and effectively and make sure that we're 1794 01:26:37,050 --> 01:26:39,630 able to make incremental changes to our code base 1795 01:26:39,630 --> 01:26:42,460 rather than need to wait on longer release cycles. 1796 01:26:42,460 --> 01:26:44,520 And then finally, today, we've been talking 1797 01:26:44,520 --> 01:26:47,820 about issues about scalability and security, especially important 1798 01:26:47,820 --> 01:26:50,880 as we begin to take our application and move them to the web. 1799 01:26:50,880 --> 01:26:53,562 We want to make sure that these applications are scalable, 1800 01:26:53,562 --> 01:26:55,770 that they're able to handle multiple different users, 1801 01:26:55,770 --> 01:26:57,720 and also to make sure that they're secure-- 1802 01:26:57,720 --> 01:27:01,050 that we're not exposing ourselves to potential vulnerabilities like someone 1803 01:27:01,050 --> 01:27:05,370 who might inject SQL or inject JavaScript code into our pages 1804 01:27:05,370 --> 01:27:08,730 or who might try to access some data that they're not supposed to access. 1805 01:27:08,730 --> 01:27:12,420 We want to make sure that, when we go about designing these web applications, 1806 01:27:12,420 --> 01:27:17,330 we're able to do so in a scalable and, ultimately, in a secure way. 1807 01:27:17,330 --> 01:27:19,080 So hopefully, you enjoyed this exploration 1808 01:27:19,080 --> 01:27:21,747 into the world of web programming with Python and JavaScript. 1809 01:27:21,747 --> 01:27:23,580 Best of luck with the web programs that you, 1810 01:27:23,580 --> 01:27:26,130 yourself might build with the tools we've seen here today, 1811 01:27:26,130 --> 01:27:29,310 and also other tools that are inspired by our use similar tools 1812 01:27:29,310 --> 01:27:32,130 and techniques and ideas as the things that we've ultimately 1813 01:27:32,130 --> 01:27:32,880 talked about here. 1814 01:27:32,880 --> 01:27:35,672 A big thanks to the course's teaching staff and the production team 1815 01:27:35,672 --> 01:27:37,255 for making this entire class possible. 1816 01:27:37,255 --> 01:27:39,130 I look forward to seeing the web applications 1817 01:27:39,130 --> 01:27:40,620 that you might go on to create. 1818 01:27:40,620 --> 01:27:45,110 This was Web Programming with Python and JavaScript.