1 00:00:00,000 --> 00:00:02,500 [MUSIC PLAYING] 2 00:00:02,500 --> 00:00:49,539 3 00:00:49,539 --> 00:00:52,580 [MUSIC - FERGIE & Q TIP & GOONROCK, "A LITTLE PARTY NEVER KILLED NOBODY"] 4 00:00:52,580 --> 00:00:54,746 FERGIE & Q-TIP & GOONROCK: (SINGING) Just one night, 5 00:00:54,746 --> 00:00:57,837 all we got, just one nigh, all we got, just one night, all we 6 00:00:57,837 --> 00:00:59,538 got, just one night, all we got. 7 00:00:59,538 --> 00:01:03,340 A little party never killed nobody. 8 00:01:03,340 --> 00:01:05,972 So we going to dance until we drop, drop, drop. 9 00:01:05,972 --> 00:01:10,346 Mm, a little party never killed nobody. 10 00:01:10,346 --> 00:01:13,535 Right here, right now, is all we got. skeeten-bod-op-bop. 11 00:01:13,535 --> 00:01:16,020 [MUSIC - FERGIE & Q TIP & 12 00:01:16,020 --> 00:01:28,776 GOONROCK, "A LITTLE PARTY NEVER KILLED NOBODY"] 13 00:01:28,776 --> 00:01:32,145 (SINGING) A little party never killed nobody. 14 00:01:32,145 --> 00:01:34,000 So we're going to dance until we drop. 15 00:01:34,000 --> 00:01:35,500 Let's go. 16 00:01:35,500 --> 00:01:39,680 A little party never killed nobody. 17 00:01:39,680 --> 00:01:42,312 Right here, right now is all we got. 18 00:01:42,312 --> 00:01:46,780 Oh, a little party never killed nobody. 19 00:01:46,780 --> 00:01:50,591 So we're going to dance until we drop. 20 00:01:50,591 --> 00:01:53,573 A little party never killed nobody. 21 00:01:53,573 --> 00:01:57,549 22 00:01:57,549 --> 00:02:00,034 [MUSIC - FERGIE & Q TIP & 23 00:02:00,034 --> 00:02:17,926 GOONROCK, "A LITTLE PARTY NEVER KILLED NOBODY"] 24 00:02:17,926 --> 00:02:20,411 MUPPET: [SNORING] 25 00:02:20,411 --> 00:02:23,396 26 00:02:23,396 --> 00:02:24,850 DAVID J. MALAN: All right. 27 00:02:24,850 --> 00:02:27,940 This is CS50, and this is lecture 8. 28 00:02:27,940 --> 00:02:31,150 And that was a look at the so-called CS50 hackathon, this annual tradition 29 00:02:31,150 --> 00:02:35,440 with which we nearly end the semester that will start around 7:00 PM here 30 00:02:35,440 --> 00:02:39,820 in Cambridge, will then end around 7:00 AM the next day in Cambridge. 31 00:02:39,820 --> 00:02:42,460 And punctuating the evening will be quite a bit of work 32 00:02:42,460 --> 00:02:44,680 on final projects, which is the overarching goal of the evening 33 00:02:44,680 --> 00:02:46,510 to give students, both an academic and a very 34 00:02:46,510 --> 00:02:48,730 social and collaborative environment in which to tackle 35 00:02:48,730 --> 00:02:50,410 their final achievements in the course. 36 00:02:50,410 --> 00:02:53,830 But also, as you gleaned there will be a bit of distraction, a bit of food 37 00:02:53,830 --> 00:02:58,210 from Felipe's, typically, around 9:00 PM, a little bit of Domino's Pizza 38 00:02:58,210 --> 00:02:59,200 around 1:00 AM. 39 00:02:59,200 --> 00:03:01,910 And for those still standing as in the muppet there, 40 00:03:01,910 --> 00:03:05,650 we'll treat you to breakfast at IHOP if you still have the energy. 41 00:03:05,650 --> 00:03:08,950 Thereafter, we'll be the so-called CS50 Fair, the climax of everything 42 00:03:08,950 --> 00:03:10,000 that you've done in CS50. 43 00:03:10,000 --> 00:03:11,890 And more on that in the weeks to come. 44 00:03:11,890 --> 00:03:15,199 So without further ado, where have we been these past several weeks? 45 00:03:15,199 --> 00:03:17,240 Well, recall that, over the past couple of weeks, 46 00:03:17,240 --> 00:03:19,000 we've been introducing web programming. 47 00:03:19,000 --> 00:03:22,221 And most recently, did we try to tie together a lot of the topics 48 00:03:22,221 --> 00:03:23,220 from the past few weeks. 49 00:03:23,220 --> 00:03:27,439 HTML and CSS and JavaScript, and then Python, and then another framework, 50 00:03:27,439 --> 00:03:29,230 another piece of software called Flask that 51 00:03:29,230 --> 00:03:32,860 just made it easier and more possible to build web-based applications. 52 00:03:32,860 --> 00:03:35,290 And the simplest example of that last week, recall, 53 00:03:35,290 --> 00:03:36,850 looked a little something like this. 54 00:03:36,850 --> 00:03:40,200 And this past week with the problems on similarities and survey, 55 00:03:40,200 --> 00:03:42,940 have you been building your own web-based applications. 56 00:03:42,940 --> 00:03:45,100 But they haven't quite had all of the pieces 57 00:03:45,100 --> 00:03:47,860 that you might want to assemble into a web-based app. 58 00:03:47,860 --> 00:03:50,061 And so today, we'll fill in those final blanks. 59 00:03:50,061 --> 00:03:52,810 But recall that we've been trying to frame these applications, not 60 00:03:52,810 --> 00:03:55,540 as one-offs or just tools that you built, but as part 61 00:03:55,540 --> 00:03:57,337 of a common paradigm or a methodology. 62 00:03:57,337 --> 00:03:59,920 And indeed, when you're learning computer science and software 63 00:03:59,920 --> 00:04:03,040 engineering, you start to notice patterns in the software, 64 00:04:03,040 --> 00:04:04,570 in the code that you're writing. 65 00:04:04,570 --> 00:04:07,361 And humans tend to adopt these patterns because they save you time. 66 00:04:07,361 --> 00:04:09,890 And then everyone can speak the same language, so to speak. 67 00:04:09,890 --> 00:04:14,305 So MVC was the acronym we introduced last week, Model View Controller. 68 00:04:14,305 --> 00:04:16,180 And that just speaks to this paradigm whereby 69 00:04:16,180 --> 00:04:19,720 you organized certain type of code in one file, certain types of other code 70 00:04:19,720 --> 00:04:20,600 and another file. 71 00:04:20,600 --> 00:04:22,840 So your Python code goes in application.py. 72 00:04:22,840 --> 00:04:27,290 Your HTML goes in your HTML files, your CSS in your CSS files, and so forth. 73 00:04:27,290 --> 00:04:30,550 But what we didn't have last week was this thing here, really. 74 00:04:30,550 --> 00:04:32,860 Model generally refers to your data. 75 00:04:32,860 --> 00:04:35,650 And while survey, your most recent problem, 76 00:04:35,650 --> 00:04:38,800 did have data, a CSV file that's arguably a model, 77 00:04:38,800 --> 00:04:40,420 CSV is not terribly expressive. 78 00:04:40,420 --> 00:04:45,321 You can pretty much just write rows to a text file, not unlike a spreadsheet. 79 00:04:45,321 --> 00:04:46,570 But you can't really query it. 80 00:04:46,570 --> 00:04:48,636 You can't easily insert or delete. 81 00:04:48,636 --> 00:04:51,010 You pretty much would have to reconstruct the whole file. 82 00:04:51,010 --> 00:04:53,440 And indeed, that's exactly what I did back in the day. 83 00:04:53,440 --> 00:04:57,700 The very first web-based application I wrote back in sophomore or junior year 84 00:04:57,700 --> 00:04:59,530 was the freshman intramural website. 85 00:04:59,530 --> 00:05:04,120 And all I had available to me, both technologically and conceptually, 86 00:05:04,120 --> 00:05:05,044 we're CSV files. 87 00:05:05,044 --> 00:05:07,960 I had no idea what databases were, didn't realize how much easier they 88 00:05:07,960 --> 00:05:08,920 could make my life. 89 00:05:08,920 --> 00:05:12,010 So I stored all of the data behind these links here just 90 00:05:12,010 --> 00:05:13,690 in very simple text files. 91 00:05:13,690 --> 00:05:16,270 But today, will give you so much more power. 92 00:05:16,270 --> 00:05:19,060 And especially, if you're coming into the class with an interest 93 00:05:19,060 --> 00:05:22,810 in applying CS to other fields in medicine or the arts 94 00:05:22,810 --> 00:05:25,270 or any world in which there's data, particularly, 95 00:05:25,270 --> 00:05:27,640 in STEM and data science and the like, realize 96 00:05:27,640 --> 00:05:32,590 that SQL has really become this incredibly powerful language 97 00:05:32,590 --> 00:05:35,680 with which to solve problems in those and so many domains. 98 00:05:35,680 --> 00:05:38,050 And ultimately, what your build, this coming week-- 99 00:05:38,050 --> 00:05:41,380 your very last CS50 problem set-- 100 00:05:41,380 --> 00:05:44,080 you have now peaked just about-- will be CD50 101 00:05:44,080 --> 00:05:46,990 Finance, a web-based application by which you can buy, 102 00:05:46,990 --> 00:05:49,279 or "buy" and sell stocks. 103 00:05:49,279 --> 00:05:51,820 And so what you're going to do this coming week is write code 104 00:05:51,820 --> 00:05:54,970 that implements a web application that resembles this whereby 105 00:05:54,970 --> 00:05:58,270 your users and yourself will be able to register for an account, 106 00:05:58,270 --> 00:05:59,620 log into that account. 107 00:05:59,620 --> 00:06:03,370 You'll be given for free, thanks to us, 10,000 virtual dollars. 108 00:06:03,370 --> 00:06:06,790 And then you can go ahead and buy and sell stocks via their symbol 109 00:06:06,790 --> 00:06:08,030 using this application. 110 00:06:08,030 --> 00:06:10,620 So for instance, if I go ahead and register myself here. 111 00:06:10,620 --> 00:06:14,425 Let's say, username Malan, password 12345-- 112 00:06:14,425 --> 00:06:15,850 shouldn't have said that. 113 00:06:15,850 --> 00:06:17,530 Go ahead and register here. 114 00:06:17,530 --> 00:06:20,805 You'll see that, by default, I get this free $10,000 in virtual cash. 115 00:06:20,805 --> 00:06:22,930 You know what, I'm going to go ahead and buy myself 116 00:06:22,930 --> 00:06:26,500 a share of maybe Netflix, whose symbol is NFLX. 117 00:06:26,500 --> 00:06:28,570 I'm going to go ahead and try to buy one share. 118 00:06:28,570 --> 00:06:29,230 Voila. 119 00:06:29,230 --> 00:06:34,240 And now you'll see that, not only do I have less cash left, about $9,600, 120 00:06:34,240 --> 00:06:35,710 I now own a share of stock. 121 00:06:35,710 --> 00:06:38,470 And if we reload this throughout the lecture, frankly, 122 00:06:38,470 --> 00:06:40,630 the price might go a little up or a little down 123 00:06:40,630 --> 00:06:42,740 because this web application you'll build 124 00:06:42,740 --> 00:06:46,750 will also talk to an API, an Application Programming Interface-- 125 00:06:46,750 --> 00:06:50,020 so some third-party website that provides you with stock data. 126 00:06:50,020 --> 00:06:52,060 So you ask it for the current price of a stock. 127 00:06:52,060 --> 00:06:54,640 It gives you an answer in JSON format, if you 128 00:06:54,640 --> 00:06:56,260 recall our short discussion of that. 129 00:06:56,260 --> 00:06:58,799 And you'll integrate it, ultimately, into this interface. 130 00:06:58,799 --> 00:07:01,090 So this will be the culmination of so many of the ideas 131 00:07:01,090 --> 00:07:03,381 and the building blocks with which you've experimented. 132 00:07:03,381 --> 00:07:06,020 But we need a few final pieces today. 133 00:07:06,020 --> 00:07:10,120 So first and foremost, when I registered for this site, I created an account. 134 00:07:10,120 --> 00:07:12,200 And indeed, I can go ahead and log out now. 135 00:07:12,200 --> 00:07:14,320 And if I go ahead and log back in-- 136 00:07:14,320 --> 00:07:19,870 if no one's stolen my password yet, I can see my account and only my account. 137 00:07:19,870 --> 00:07:22,390 But we haven't yet seen any mechanism in code 138 00:07:22,390 --> 00:07:25,390 whereby you can implement this notion of logging in, right? 139 00:07:25,390 --> 00:07:30,370 What is the underlying logic via which a website knows that you are logged in 140 00:07:30,370 --> 00:07:32,409 or you're not logged in, right? 141 00:07:32,409 --> 00:07:34,700 Clearly, when I went to this website just a moment ago, 142 00:07:34,700 --> 00:07:38,320 it prompted me to log in because I was not because I had logged out. 143 00:07:38,320 --> 00:07:39,620 But how did it know that? 144 00:07:39,620 --> 00:07:43,750 And now that I am logged in, how does it know that I can click on "quote" 145 00:07:43,750 --> 00:07:47,710 to get just the stock quote or "buy" just to buy a stock or "sell." 146 00:07:47,710 --> 00:07:51,880 Why does it not prompt me every time I click a link to log back 147 00:07:51,880 --> 00:07:54,580 in with my username or password? 148 00:07:54,580 --> 00:07:55,360 What do you think? 149 00:07:55,360 --> 00:07:57,730 AUDIENCE: Isn't cookies little files that 150 00:07:57,730 --> 00:08:01,570 saves the data inside your browser's cache that let it know what's going on. 151 00:08:01,570 --> 00:08:02,110 DAVID J. MALAN: Yeah, cookies-- 152 00:08:02,110 --> 00:08:04,520 little files that get saved by websites on your computer. 153 00:08:04,520 --> 00:08:07,228 Now odds are, most everyone here has probably heard in some form, 154 00:08:07,228 --> 00:08:08,680 cookies-- bad. 155 00:08:08,680 --> 00:08:11,031 Maybe bad, privacy-invading, yes? 156 00:08:11,031 --> 00:08:11,530 OK. 157 00:08:11,530 --> 00:08:12,940 So true. 158 00:08:12,940 --> 00:08:15,820 But most of the web would not work without them. 159 00:08:15,820 --> 00:08:20,170 So someone else, what do you understand a cookie to be? 160 00:08:20,170 --> 00:08:21,040 What is a cookie? 161 00:08:21,040 --> 00:08:22,940 AUDIENCE: Basically saves your information 162 00:08:22,940 --> 00:08:24,782 so you don't have to put it in every time. 163 00:08:24,782 --> 00:08:25,657 DAVID J. MALAN: Yeah. 164 00:08:25,657 --> 00:08:28,550 It saves your information so you don't have to put it in every time. 165 00:08:28,550 --> 00:08:31,270 So in simplest form, it's exactly a combination of those answers. 166 00:08:31,270 --> 00:08:33,280 Whereby, when I log into a website, you could 167 00:08:33,280 --> 00:08:37,000 imagine that website just remembering my username and my password 168 00:08:37,000 --> 00:08:40,270 by just saving them in a little file on my Mac or PC 169 00:08:40,270 --> 00:08:41,980 so that the next time I visit a web page, 170 00:08:41,980 --> 00:08:44,974 it just automatically sends them for me, so that the human doesn't 171 00:08:44,974 --> 00:08:46,390 have to type them again and again. 172 00:08:46,390 --> 00:08:47,890 Now let's find fault with that. 173 00:08:47,890 --> 00:08:53,967 Feels like that would work, but what's a downside of that naive implementation? 174 00:08:53,967 --> 00:08:55,300 What do you think instinctively? 175 00:08:55,300 --> 00:08:56,810 AUDIENCE: The data is just out there? 176 00:08:56,810 --> 00:08:58,893 DAVID J. MALAN: The data is just out there, right? 177 00:08:58,893 --> 00:09:03,542 I have people in the office or at home or in a library 178 00:09:03,542 --> 00:09:06,250 that I might not want to just be able to walk up to that computer 179 00:09:06,250 --> 00:09:08,937 and just see this little text file planted on my computer. 180 00:09:08,937 --> 00:09:11,770 Because then they could log in to my account by just digging around. 181 00:09:11,770 --> 00:09:13,136 So it feels a little invasive. 182 00:09:13,136 --> 00:09:15,010 Indeed, that's one of the threats of cookies. 183 00:09:15,010 --> 00:09:17,950 So it turns out cookies or websites tend not to do that, at least, 184 00:09:17,950 --> 00:09:19,480 if they're implemented well. 185 00:09:19,480 --> 00:09:24,070 They instead just plant a really big random number on your hard drive 186 00:09:24,070 --> 00:09:26,170 or in your computer's ram or memory. 187 00:09:26,170 --> 00:09:30,100 So 1234567-- like, some really big number maybe with letters and numbers 188 00:09:30,100 --> 00:09:33,550 maybe even some punctuation that uniquely identifies me. 189 00:09:33,550 --> 00:09:37,330 And thereafter, any time I click a link on that same website, 190 00:09:37,330 --> 00:09:39,707 after logging in, my browser, thanks to HTTP-- 191 00:09:39,707 --> 00:09:42,790 the language that browsers and servers speak that we discussed a few weeks 192 00:09:42,790 --> 00:09:43,289 back-- 193 00:09:43,289 --> 00:09:46,420 just magically sends that same big random number 194 00:09:46,420 --> 00:09:48,490 to the server again and again. 195 00:09:48,490 --> 00:09:52,300 And so long as the server remembers that that big random number corresponds 196 00:09:52,300 --> 00:09:56,590 to user name Malan, it can figure out whose account to show. 197 00:09:56,590 --> 00:09:58,130 And why is it a big random number? 198 00:09:58,130 --> 00:10:03,160 Why is it not just my password or my username? 199 00:10:03,160 --> 00:10:05,440 Why am I proposing that it be a big random number? 200 00:10:05,440 --> 00:10:09,660 AUDIENCE: Sometimes people can guess it by just using a random generator. 201 00:10:09,660 --> 00:10:10,660 DAVID J. MALAN: Exactly. 202 00:10:10,660 --> 00:10:14,050 In computer science, randomness is this incredibly powerful ingredient. 203 00:10:14,050 --> 00:10:16,780 If you pick a big enough word or a big enough string 204 00:10:16,780 --> 00:10:20,914 or sentence, the probability that some random adversary, or bad person 205 00:10:20,914 --> 00:10:23,080 on the internet, is going to be able to guess or try 206 00:10:23,080 --> 00:10:26,944 to guess that value is just so low, it's just not realistically 207 00:10:26,944 --> 00:10:29,110 ever going to happen in your lifetime statistically. 208 00:10:29,110 --> 00:10:31,000 And so random this gives us that capability. 209 00:10:31,000 --> 00:10:33,310 And so you can think about this metaphorically in the real world 210 00:10:33,310 --> 00:10:34,450 as being like a hand stamp. 211 00:10:34,450 --> 00:10:38,020 If you've ever gone to an amusement park or a bar 212 00:10:38,020 --> 00:10:40,270 or a club where you have to show your ID or you 213 00:10:40,270 --> 00:10:42,820 have to pay for a ticket on the way in, sometimes they'll 214 00:10:42,820 --> 00:10:45,881 stamp you with either visible or invisible ink. 215 00:10:45,881 --> 00:10:48,130 And that's largely for efficiency so that, thereafter, 216 00:10:48,130 --> 00:10:50,421 when you're in the amusement park, you can come and go. 217 00:10:50,421 --> 00:10:52,169 And you don't have to repay or reprocess. 218 00:10:52,169 --> 00:10:55,210 If you're in the bar or the club, you don't have to keep showing your ID. 219 00:10:55,210 --> 00:10:57,820 They can check once, and then more efficiently let you come and go 220 00:10:57,820 --> 00:10:58,480 as you please. 221 00:10:58,480 --> 00:11:01,021 Because you're just presenting your hand or this virtual hand 222 00:11:01,021 --> 00:11:04,900 stamp to the bouncer or to the gatekeeper at those places. 223 00:11:04,900 --> 00:11:06,600 So cookies are exactly like that. 224 00:11:06,600 --> 00:11:09,850 Unbeknownst to us all this time, anytime you visit a website into which you've 225 00:11:09,850 --> 00:11:13,030 logged in, your browser is secretly but usefully 226 00:11:13,030 --> 00:11:17,350 presenting a hand stamp to that server to remind it who you are. 227 00:11:17,350 --> 00:11:21,560 Or rather, not really who you are, but of that big random number 228 00:11:21,560 --> 00:11:25,690 so that, if the server remembers who that number belongs to, 229 00:11:25,690 --> 00:11:28,750 it can figure out whose account to then show. 230 00:11:28,750 --> 00:11:32,770 So put more concretely, if I actually pull up some of the HTTP examples 231 00:11:32,770 --> 00:11:35,390 that we looked at in the past, let's consider this in context. 232 00:11:35,390 --> 00:11:37,973 So almost everyone here has probably used Gmail at some point. 233 00:11:37,973 --> 00:11:40,160 And you log in generally via page like this. 234 00:11:40,160 --> 00:11:42,670 So it might be infrequent because you're not often 235 00:11:42,670 --> 00:11:46,210 prompted to log into email because of-- surprise, surprise-- cookies 236 00:11:46,210 --> 00:11:47,690 from Google being on your computer. 237 00:11:47,690 --> 00:11:49,190 But let's see where those come from. 238 00:11:49,190 --> 00:11:54,290 So when you request Google or gmail.com, you might send, in a virtual envelope, 239 00:11:54,290 --> 00:11:57,150 so to speak, from your Mac or PC to the server, a message like this. 240 00:11:57,150 --> 00:11:58,900 Families who've not seen this before, this 241 00:11:58,900 --> 00:12:01,900 is what your browser is actually sending to a server in order 242 00:12:01,900 --> 00:12:03,817 to request a home page of gmail.com. 243 00:12:03,817 --> 00:12:05,650 Now, I'm going to simplify this a little bit 244 00:12:05,650 --> 00:12:07,233 because there's a bunch of redirects-- 245 00:12:07,233 --> 00:12:11,380 HTTP 302s and 301's involved that aren't that interesting today. 246 00:12:11,380 --> 00:12:14,170 So let's just assume that Gmail responds immediately. 247 00:12:14,170 --> 00:12:17,620 Typically, Gmail would respond with this, saying 200 OK. 248 00:12:17,620 --> 00:12:18,760 Here's the login page. 249 00:12:18,760 --> 00:12:22,120 And here's a web page in text/html format. 250 00:12:22,120 --> 00:12:24,940 But once you've logged in, what your Gmail actually does 251 00:12:24,940 --> 00:12:26,530 is it also sends this. 252 00:12:26,530 --> 00:12:31,030 Recall that we call these things HTTP headers-- 253 00:12:31,030 --> 00:12:35,590 key value pairs separated by colons that are semi-secretly sent from browser 254 00:12:35,590 --> 00:12:37,630 to server and from server to browser. 255 00:12:37,630 --> 00:12:41,390 Now we, more sophisticated developer types can see this stuff, right? 256 00:12:41,390 --> 00:12:44,050 You can open up Chrome's network tab and start poking around. 257 00:12:44,050 --> 00:12:45,790 And it's not secret per se. 258 00:12:45,790 --> 00:12:47,800 It's just most people don't know it's there. 259 00:12:47,800 --> 00:12:50,050 And what Google and other companies are doing is 260 00:12:50,050 --> 00:12:54,820 they literally send a header called set-cookie, the value of which 261 00:12:54,820 --> 00:12:59,170 is that big random value that the server has decided for you. 262 00:12:59,170 --> 00:13:03,070 Your browser, assuming that it speaks HTTP properly, 263 00:13:03,070 --> 00:13:06,340 should then save that value in RAM, your computer's memory, 264 00:13:06,340 --> 00:13:07,510 or on your hard drive. 265 00:13:07,510 --> 00:13:12,070 And then, every other click you make on gmail.com should send, 266 00:13:12,070 --> 00:13:16,510 not just headers like this, but it should send the opposite header-- 267 00:13:16,510 --> 00:13:18,940 just cookie, not set-cookie, but cookie, which 268 00:13:18,940 --> 00:13:20,920 is the presentation of that hand stamp. 269 00:13:20,920 --> 00:13:24,460 And so every time you click a page on Gmail, or Facebook, 270 00:13:24,460 --> 00:13:26,800 or almost any website into which you've logged in. 271 00:13:26,800 --> 00:13:29,090 Those cookies have been planted on your computer. 272 00:13:29,090 --> 00:13:31,510 Now recently, in both Europe and in the US, 273 00:13:31,510 --> 00:13:34,360 have laws been passed that are increasingly putting pressure 274 00:13:34,360 --> 00:13:37,870 on companies that operate internationally to present you 275 00:13:37,870 --> 00:13:40,080 with cookie-based information, right? 276 00:13:40,080 --> 00:13:41,830 You get these little pop-ups increasingly. 277 00:13:41,830 --> 00:13:44,830 And you've been getting them for years in Europe saying, 278 00:13:44,830 --> 00:13:47,110 can we plant cookies on your computer? 279 00:13:47,110 --> 00:13:49,510 So given all of this, what might happen if you say 280 00:13:49,510 --> 00:13:51,370 no, no cookies on my computer? 281 00:13:51,370 --> 00:13:53,006 AUDIENCE: You have to keep logging in. 282 00:13:53,006 --> 00:13:55,630 DAVID J. MALAN: You're going to have to keep logging in, right? 283 00:13:55,630 --> 00:13:59,560 If you take away this fundamental HTTP feature, much of the web 284 00:13:59,560 --> 00:14:03,160 breaks, or the user experience deteriorates significantly. 285 00:14:03,160 --> 00:14:06,710 Now the flip side is cookies can be used, not only for good, so to speak, 286 00:14:06,710 --> 00:14:07,460 but also for evil. 287 00:14:07,460 --> 00:14:10,293 It turns out that there are a lot of ads on the internet, of course. 288 00:14:10,293 --> 00:14:13,300 And that's what drives a lot of the revenue that makes it all possible. 289 00:14:13,300 --> 00:14:17,800 Those ads typically come from image tags in your HTML or script tags 290 00:14:17,800 --> 00:14:20,200 or some of the HTML tags we have seen in the class. 291 00:14:20,200 --> 00:14:23,560 But they go to third-party servers, somewhereelse.com. 292 00:14:23,560 --> 00:14:27,250 And the problem arises with cookies when those third parties are allowed 293 00:14:27,250 --> 00:14:29,770 to plant cookies on your computer. 294 00:14:29,770 --> 00:14:32,347 Because if you go to Google, and you go to Facebook, 295 00:14:32,347 --> 00:14:34,180 and you go to Twitter-- bunches of websites. 296 00:14:34,180 --> 00:14:37,540 Suppose they all have some middlemen advertising service advertising 297 00:14:37,540 --> 00:14:38,860 on each of these websites. 298 00:14:38,860 --> 00:14:41,560 That middlemen, so to speak, because they 299 00:14:41,560 --> 00:14:46,990 have their ads and, therefore, their HTTP headers on this site, 300 00:14:46,990 --> 00:14:49,990 and this site, and this site, there are big third parties 301 00:14:49,990 --> 00:14:52,900 out there-- lots of them advertising networks-- that know everywhere 302 00:14:52,900 --> 00:14:57,100 little old you is going because they see your cookie appearing 303 00:14:57,100 --> 00:14:58,960 from multiple different other sites. 304 00:14:58,960 --> 00:15:03,009 And so here's where cookies become, not a computer science, engineering feat, 305 00:15:03,009 --> 00:15:05,050 but really a threat to one's privacy because they 306 00:15:05,050 --> 00:15:08,080 can be used so easily for tracking. 307 00:15:08,080 --> 00:15:11,830 And frankly, AT&T and Verizon, as an aside, got into trouble-- 308 00:15:11,830 --> 00:15:13,420 not enough people know this-- 309 00:15:13,420 --> 00:15:18,550 some years back when they started injecting, forcibly, additional HTTP 310 00:15:18,550 --> 00:15:22,000 headers similar in spirit to this to all of people's cell phone traffic-- 311 00:15:22,000 --> 00:15:25,180 so not things you could even opt out of initially. 312 00:15:25,180 --> 00:15:28,060 Because this was a way for advertisers and for themselves 313 00:15:28,060 --> 00:15:29,360 to be able to track users. 314 00:15:29,360 --> 00:15:33,220 So these HTTP headers on which cookies are based 315 00:15:33,220 --> 00:15:35,965 are very powerful, but also very invasive. 316 00:15:35,965 --> 00:15:38,590 And we're only now starting to see, societally and politically, 317 00:15:38,590 --> 00:15:42,340 pushback on this very simple mechanism that, hopefully, we as CS types 318 00:15:42,340 --> 00:15:46,640 just understand the mechanics of and, therefore, now the implications. 319 00:15:46,640 --> 00:15:48,820 So let's see this in context, for instance. 320 00:15:48,820 --> 00:15:53,740 Let me go ahead and open up a relatively small example in CS50 IDE. 321 00:15:53,740 --> 00:15:56,800 For those unfamiliar, CS50 IDE is a web-based application 322 00:15:56,800 --> 00:15:59,110 via which you can write programs in the cloud, 323 00:15:59,110 --> 00:16:01,210 just using any browser on a Mac or PC. 324 00:16:01,210 --> 00:16:03,460 I'm going to go ahead into my account here. 325 00:16:03,460 --> 00:16:05,920 And I'm going to go into Store. 326 00:16:05,920 --> 00:16:08,980 And I'm going to go into the Templates directory 327 00:16:08,980 --> 00:16:11,050 and show that this example here has a few files. 328 00:16:11,050 --> 00:16:13,050 This will look familiar to students in the room. 329 00:16:13,050 --> 00:16:15,320 Application.py is the web-based application. 330 00:16:15,320 --> 00:16:17,900 And then we have some template files, so to speak. 331 00:16:17,900 --> 00:16:21,130 In my Terminal window here, I'm going to go ahead and, with just my keyboard, 332 00:16:21,130 --> 00:16:25,400 go into this directory called Store in source 8, which is available online. 333 00:16:25,400 --> 00:16:28,960 And I'm going to go ahead and do a flask run, which, for those unfamiliar, 334 00:16:28,960 --> 00:16:31,420 is the command via which you can start a web server 335 00:16:31,420 --> 00:16:34,190 and start running a web-based application. 336 00:16:34,190 --> 00:16:39,100 So now that I have that, I'm going to go ahead and visit exactly this URL here. 337 00:16:39,100 --> 00:16:43,804 And we'll see a relatively simple and super ugly web store. 338 00:16:43,804 --> 00:16:45,220 Let me go ahead and zoom in a bit. 339 00:16:45,220 --> 00:16:48,460 And this web store allows me to buy three things-- foos, bars, 340 00:16:48,460 --> 00:16:50,170 and bazes, whatever those are. 341 00:16:50,170 --> 00:16:53,230 And this is a very simple e-commerce-like site 342 00:16:53,230 --> 00:16:56,810 where I just have to type the quantity of foos, bars, and bazes that I want. 343 00:16:56,810 --> 00:16:58,610 And then I can go ahead and buy them. 344 00:16:58,610 --> 00:17:02,800 So I'm going to go ahead and say, give me one foo, how about zero bars, 345 00:17:02,800 --> 00:17:04,599 and these two bazes. 346 00:17:04,599 --> 00:17:06,700 And I'm going to go ahead and click Purchase. 347 00:17:06,700 --> 00:17:08,619 And now you'll see my shopping cart. 348 00:17:08,619 --> 00:17:10,540 But if I continue shopping, you'll see that it 349 00:17:10,540 --> 00:17:13,640 resets, just like if you keep browsing Amazon or whatever other website. 350 00:17:13,640 --> 00:17:15,710 But if you want to check shopping cart again, 351 00:17:15,710 --> 00:17:17,765 notice it's remembered what I'm looking for. 352 00:17:17,765 --> 00:17:18,890 And in fact, you know what? 353 00:17:18,890 --> 00:17:20,250 I'm going to go ahead and close the window. 354 00:17:20,250 --> 00:17:20,869 Oops. 355 00:17:20,869 --> 00:17:22,520 I actually lost the website. 356 00:17:22,520 --> 00:17:23,599 But you know what? 357 00:17:23,599 --> 00:17:27,349 If I go back to that URL and reopen, I see the storefront. 358 00:17:27,349 --> 00:17:30,500 But if I view my shopping cart, it has remembered my state. 359 00:17:30,500 --> 00:17:32,550 So notice the power now of cookies. 360 00:17:32,550 --> 00:17:35,390 It's not just to remember with a hand stamp who you are. 361 00:17:35,390 --> 00:17:39,080 But now you can remember anything about that user that you want, right? 362 00:17:39,080 --> 00:17:43,100 A shopping cart on Amazon or any website is the best example of this 363 00:17:43,100 --> 00:17:46,640 because it would be horrible, horrible, horrible for a User Experience or UX 364 00:17:46,640 --> 00:17:48,530 if every time you click the darn link, you 365 00:17:48,530 --> 00:17:52,010 lost the contents of your shopping cart because the website forgot who you are. 366 00:17:52,010 --> 00:17:54,410 So this is a compelling feature to remember. 367 00:17:54,410 --> 00:17:57,020 And it's cookies that implement this feature too. 368 00:17:57,020 --> 00:18:00,440 Because more generally, what's going on with cookies is this. 369 00:18:00,440 --> 00:18:03,920 When you set a cookie using the set-cookie header that 370 00:18:03,920 --> 00:18:07,550 looked like this, there is a key value pair 371 00:18:07,550 --> 00:18:11,240 to the right that we might call session, and that has the value. 372 00:18:11,240 --> 00:18:12,710 Value is the hand stamp. 373 00:18:12,710 --> 00:18:16,410 Session is just a term of art that refers to this abstraction 374 00:18:16,410 --> 00:18:18,410 that you can think of really as a shopping cart. 375 00:18:18,410 --> 00:18:21,660 But it doesn't have to have anything to do with actual shopping or e-commerce. 376 00:18:21,660 --> 00:18:24,830 It's just a container in which you can store stuff. 377 00:18:24,830 --> 00:18:29,450 So this is telling the website that my session value, my hand stamp 378 00:18:29,450 --> 00:18:32,840 is 12345-- and big really random number. 379 00:18:32,840 --> 00:18:35,060 The website can then say, you know what? 380 00:18:35,060 --> 00:18:37,730 I'm going to store a container for you. 381 00:18:37,730 --> 00:18:40,490 Let's call it a Python dictionary, or dict for you, 382 00:18:40,490 --> 00:18:42,810 inside of which I can put anything I want. 383 00:18:42,810 --> 00:18:46,040 And so in fact, when you go to a shopping cart like this, 384 00:18:46,040 --> 00:18:48,290 what is the server actually doing? 385 00:18:48,290 --> 00:18:52,652 Well, upon seeing your hand stamp, and realizing, oh, you are user 1234567. 386 00:18:52,652 --> 00:18:55,010 Oh, that's username Malan. 387 00:18:55,010 --> 00:18:57,830 Let me go ahead and grab the dictionary, the Python dict 388 00:18:57,830 --> 00:19:00,830 from memory, or even from a CSV file if you want, 389 00:19:00,830 --> 00:19:03,470 and show you the contents of your shopping cart. 390 00:19:03,470 --> 00:19:06,710 Meanwhile, someone else, Brian visits the same exact website 391 00:19:06,710 --> 00:19:08,030 and logs in with his account. 392 00:19:08,030 --> 00:19:10,550 He's going to present a different hand stamp, presumably. 393 00:19:10,550 --> 00:19:13,460 And so the website can look for a different dictionary 394 00:19:13,460 --> 00:19:16,430 and show Brian the different contents of his shopping cart. 395 00:19:16,430 --> 00:19:18,390 And the same for everyone in this room. 396 00:19:18,390 --> 00:19:19,850 So this simple mechanism-- 397 00:19:19,850 --> 00:19:21,180 I mean, consider the power. 398 00:19:21,180 --> 00:19:24,770 This is just a stupid text value key:value. 399 00:19:24,770 --> 00:19:28,730 And from all of that does all of the web's capabilities now come. 400 00:19:28,730 --> 00:19:31,070 And we won't look in too much detail into the code here, 401 00:19:31,070 --> 00:19:32,990 but let me show one snippet of it. 402 00:19:32,990 --> 00:19:37,850 If I go into application.py, the magic that makes this possible in flask 403 00:19:37,850 --> 00:19:40,362 is just to import this additional feature. 404 00:19:40,362 --> 00:19:42,320 You'll recall, from the past couple of problems 405 00:19:42,320 --> 00:19:45,110 you've worked on, you import from flask a few things-- 406 00:19:45,110 --> 00:19:47,740 render template, redirect, request. 407 00:19:47,740 --> 00:19:49,760 Well, if you Add session to that list, which 408 00:19:49,760 --> 00:19:53,750 is the code abstraction of this hand-stamp idea, what you can now do 409 00:19:53,750 --> 00:19:54,450 is this. 410 00:19:54,450 --> 00:19:56,600 Let me scroll down to the juicy part here. 411 00:19:56,600 --> 00:19:59,390 And notice that you can do lines of code now, 412 00:19:59,390 --> 00:20:03,830 and for the next CS50 finance problem, lines of code like this. 413 00:20:03,830 --> 00:20:07,220 Session is just a Python dictionary, or dict. 414 00:20:07,220 --> 00:20:09,530 You can index into it using any word-- 415 00:20:09,530 --> 00:20:13,670 foo, or bar, or baz, just like you could with dictionaries more generally. 416 00:20:13,670 --> 00:20:15,927 And you can store in it anything you want. 417 00:20:15,927 --> 00:20:17,510 In my case, I want to store a number-- 418 00:20:17,510 --> 00:20:20,540 0 foos, or 1 foo, or 2 foos or whatever. 419 00:20:20,540 --> 00:20:24,650 So I can simply convert to an int, the user's request forms 420 00:20:24,650 --> 00:20:25,609 item, whatever that is. 421 00:20:25,609 --> 00:20:28,316 And let me wave my hand at some of the code above because it just 422 00:20:28,316 --> 00:20:29,510 sets the whole thing up. 423 00:20:29,510 --> 00:20:31,670 But the new line of code, the new feature 424 00:20:31,670 --> 00:20:35,180 that now will empower us to build something like a stock trading website 425 00:20:35,180 --> 00:20:37,670 is quite simply this line here. 426 00:20:37,670 --> 00:20:40,640 Because what you can also do to remember that a user is logged in, 427 00:20:40,640 --> 00:20:43,370 just go ahead and store in this so-called session, 428 00:20:43,370 --> 00:20:45,027 a value of, like, true. 429 00:20:45,027 --> 00:20:48,110 And if the value true is there, you can infer that the human is logged in. 430 00:20:48,110 --> 00:20:52,670 And if there's no such value in the dictionary, they are not logged in. 431 00:20:52,670 --> 00:20:55,880 And so we'll hold your hand a bit more in the next problem set 432 00:20:55,880 --> 00:20:57,710 with this introduction of CS50 Finance. 433 00:20:57,710 --> 00:20:59,540 And we'll write some of the code that handles the login, 434 00:20:59,540 --> 00:21:01,490 so you can see by example how to do this. 435 00:21:01,490 --> 00:21:03,530 But it'll be up to you thereafter to start 436 00:21:03,530 --> 00:21:07,100 remembering what stocks a user actually has 437 00:21:07,100 --> 00:21:11,120 using sessions to retain the fact that they've logged in already. 438 00:21:11,120 --> 00:21:11,620 All right. 439 00:21:11,620 --> 00:21:13,550 So that was a lot all at once. 440 00:21:13,550 --> 00:21:17,360 But any questions on cookies and the feature they provide, 441 00:21:17,360 --> 00:21:19,916 these things called sessions? 442 00:21:19,916 --> 00:21:20,730 Anything at a all? 443 00:21:20,730 --> 00:21:22,366 Yeah? 444 00:21:22,366 --> 00:21:25,467 AUDIENCE: Is the session saved on the user side or the server? 445 00:21:25,467 --> 00:21:28,300 DAVID J. MALAN: Is the session saved on the user side or the server? 446 00:21:28,300 --> 00:21:29,390 Really good question. 447 00:21:29,390 --> 00:21:31,720 And the answer can be it depends. 448 00:21:31,720 --> 00:21:35,600 You could store on the user's client, on their browser. 449 00:21:35,600 --> 00:21:41,107 And as the gentleman over here proposed, you could theoretically store literally 450 00:21:41,107 --> 00:21:43,690 their user name and password-- maybe I proposed that earlier-- 451 00:21:43,690 --> 00:21:44,439 on their computer. 452 00:21:44,439 --> 00:21:47,200 And that's bad for the reasons that we surmised earlier. 453 00:21:47,200 --> 00:21:50,320 But you could also store the contents of their shopping cart, foos, bars, 454 00:21:50,320 --> 00:21:50,950 and bazes. 455 00:21:50,950 --> 00:21:53,320 That's not quite as invasive as storing their password. 456 00:21:53,320 --> 00:21:56,028 But if you're buying things you don't really want people knowing, 457 00:21:56,028 --> 00:21:56,980 that is then invasive. 458 00:21:56,980 --> 00:21:58,063 So maybe we can do better. 459 00:21:58,063 --> 00:22:00,430 And better is often, store it on the server. 460 00:22:00,430 --> 00:22:02,230 So a well-designed website will typically 461 00:22:02,230 --> 00:22:07,270 store only this big random value, the hand stamp, on the Mac or PC. 462 00:22:07,270 --> 00:22:10,030 And then all of the interesting and maybe sensitive stuff 463 00:22:10,030 --> 00:22:14,230 is stored in a database or CSV file or just the server's RAM or memory 464 00:22:14,230 --> 00:22:15,760 like in a global variable. 465 00:22:15,760 --> 00:22:16,910 Good question. 466 00:22:16,910 --> 00:22:19,520 Other questions? 467 00:22:19,520 --> 00:22:21,145 Other questions? 468 00:22:21,145 --> 00:22:22,390 No. 469 00:22:22,390 --> 00:22:23,150 All right. 470 00:22:23,150 --> 00:22:25,670 So that's one problem solved. 471 00:22:25,670 --> 00:22:28,280 We know now we can implement login forms. 472 00:22:28,280 --> 00:22:30,930 And we know that we can remember that people are logged in. 473 00:22:30,930 --> 00:22:33,210 So let's just stipulate that is now possible. 474 00:22:33,210 --> 00:22:36,350 But over the past couple of weeks, it's not been incredibly powerful 475 00:22:36,350 --> 00:22:41,120 to only have access to things like CSV files, comma separated values, which 476 00:22:41,120 --> 00:22:43,910 create the illusion of Excel and Google Spreadsheets 477 00:22:43,910 --> 00:22:46,220 and Apple Numbers like columns and rows. 478 00:22:46,220 --> 00:22:46,790 Why? 479 00:22:46,790 --> 00:22:49,331 Well, it's pretty much a linear search for everything, right? 480 00:22:49,331 --> 00:22:52,370 A CSV file is just rows and rows and rows and rows of data. 481 00:22:52,370 --> 00:22:54,500 And if you want to search for anything in that 482 00:22:54,500 --> 00:22:57,540 file, like you might have wanted to for your survey implementation, 483 00:22:57,540 --> 00:22:58,460 how do you find it? 484 00:22:58,460 --> 00:23:00,470 Well, you open the file with Open. 485 00:23:00,470 --> 00:23:04,040 You maybe use a for loop and iterate over every single line 486 00:23:04,040 --> 00:23:06,440 looking for some value, and then you close the file. 487 00:23:06,440 --> 00:23:07,489 That is big O of n. 488 00:23:07,489 --> 00:23:10,280 And in the worst case, the thing you care about is at the very end. 489 00:23:10,280 --> 00:23:11,810 And it's not terribly efficient. 490 00:23:11,810 --> 00:23:14,180 Now, you can append to files pretty efficiently. 491 00:23:14,180 --> 00:23:17,390 Recall that when we opened files with quote unquote, "A" for Append mode. 492 00:23:17,390 --> 00:23:20,180 We did this with a brief example, instead of "W" for Write. 493 00:23:20,180 --> 00:23:23,630 You get the operating system's help and add rose to the file 494 00:23:23,630 --> 00:23:25,610 at the very bottom, which is more efficient. 495 00:23:25,610 --> 00:23:28,460 But you can't insert things in the middle very efficiently. 496 00:23:28,460 --> 00:23:30,200 You can't delete things very easily. 497 00:23:30,200 --> 00:23:33,650 You would have to literally, for those kinds of scenarios, open the file, 498 00:23:33,650 --> 00:23:36,590 read the whole darn thing, then write out parts of it 499 00:23:36,590 --> 00:23:38,870 or add to it as you're writing out. 500 00:23:38,870 --> 00:23:41,420 And so humans years ago realized, well, this is stupid. 501 00:23:41,420 --> 00:23:43,460 All of us humans in the world are constantly 502 00:23:43,460 --> 00:23:46,610 writing code to open files, change files, save files. 503 00:23:46,610 --> 00:23:50,502 Why don't a few of us do an even better job of implementing that feature, 504 00:23:50,502 --> 00:23:51,710 then share it with the world? 505 00:23:51,710 --> 00:23:53,930 And thus were born, effectively, databases. 506 00:23:53,930 --> 00:23:56,900 And these days, there are so many different types of databases-- 507 00:23:56,900 --> 00:23:59,900 you might be familiar with tools, commercial software like Oracle, 508 00:23:59,900 --> 00:24:02,240 or SQL Server, or Microsoft Access. 509 00:24:02,240 --> 00:24:05,420 And in the open source world, there's Postgres and MySQL, 510 00:24:05,420 --> 00:24:07,040 and SQLite and others. 511 00:24:07,040 --> 00:24:09,710 And many of them, as even those names imply, 512 00:24:09,710 --> 00:24:13,310 use a special language called SQL, Structured Query Language, which 513 00:24:13,310 --> 00:24:15,020 is the very last language we'll introduce 514 00:24:15,020 --> 00:24:17,960 you to in CS50 unless you go off on your own with final projects 515 00:24:17,960 --> 00:24:19,160 to pick up something more. 516 00:24:19,160 --> 00:24:22,820 But with SQL, you have the ability to select data from a database, 517 00:24:22,820 --> 00:24:25,070 to insert it, to delete it, to updated it. 518 00:24:25,070 --> 00:24:28,220 All of the things that you could absolutely do with Python and CSVs, 519 00:24:28,220 --> 00:24:32,790 it would just be so darn tedious to write those lines of code yourself. 520 00:24:32,790 --> 00:24:34,490 So what's the right mental model? 521 00:24:34,490 --> 00:24:37,577 Here is, I claim, essentially a database. 522 00:24:37,577 --> 00:24:39,410 Like most everyone in this room has probably 523 00:24:39,410 --> 00:24:41,618 used Google Spreadsheets, and if you've not, probably 524 00:24:41,618 --> 00:24:43,890 Excel or maybe Apple Numbers. 525 00:24:43,890 --> 00:24:45,650 So these are spreadsheets. 526 00:24:45,650 --> 00:24:49,320 And they're essentially what you could call a relational database. 527 00:24:49,320 --> 00:24:51,320 Relational-- implying that there's relationships 528 00:24:51,320 --> 00:24:54,290 among the various data in the rows and the columns decided by you 529 00:24:54,290 --> 00:24:55,310 or whoever made it. 530 00:24:55,310 --> 00:24:57,050 And this is a spreadsheet of course. 531 00:24:57,050 --> 00:25:00,980 And it has rows where your actual data goes. 532 00:25:00,980 --> 00:25:04,760 And columns-- and what's noteworthy about the columns? 533 00:25:04,760 --> 00:25:06,950 It's just conventional to do what with the columns, 534 00:25:06,950 --> 00:25:08,210 especially in that first row? 535 00:25:08,210 --> 00:25:09,182 AUDIENCE: Headers. 536 00:25:09,182 --> 00:25:10,960 DAVID J. MALAN: Yeah, put some header values, right? 537 00:25:10,960 --> 00:25:12,820 You could certainly put them in the left. 538 00:25:12,820 --> 00:25:14,890 It's just humans prefer, it seems, to read top 539 00:25:14,890 --> 00:25:17,200 to bottom instead of left to all the way to the right. 540 00:25:17,200 --> 00:25:20,170 So we just adopted a conventioneers ago that your columns represent 541 00:25:20,170 --> 00:25:24,580 different types of data, and the rows represent different values 542 00:25:24,580 --> 00:25:27,710 for those fields or for those columns, so to speak. 543 00:25:27,710 --> 00:25:30,100 So if you wanted to store a spreadsheet of values 544 00:25:30,100 --> 00:25:33,400 for a bunch of students in a class or in a university, 545 00:25:33,400 --> 00:25:37,120 one column might be ID number, like their Harvard ID or the Yale ID. 546 00:25:37,120 --> 00:25:40,900 Another column might be their name, their email address, phone number, age, 547 00:25:40,900 --> 00:25:41,470 and so forth. 548 00:25:41,470 --> 00:25:42,970 And you could just lay that all out. 549 00:25:42,970 --> 00:25:46,420 You can make it pretty and boldface the top and organize things and sort it. 550 00:25:46,420 --> 00:25:49,179 But at some point, this isn't quite the right tool. 551 00:25:49,179 --> 00:25:51,220 And in fact, I ran into this in grad school years 552 00:25:51,220 --> 00:25:53,262 ago where I was analyzing large sets of data. 553 00:25:53,262 --> 00:25:55,720 And it was just convenient to double-click on the CSV file, 554 00:25:55,720 --> 00:25:58,570 open it in Excel, and just manipulate it and answer the data 555 00:25:58,570 --> 00:25:59,820 questions that I had. 556 00:25:59,820 --> 00:26:06,370 But back in the day, Excel actually had a limit of 65,536 rows. 557 00:26:06,370 --> 00:26:08,020 Sounds like a lot-- 558 00:26:08,020 --> 00:26:12,130 and that's 2 to the 16th power, if you call the reference there. 559 00:26:12,130 --> 00:26:15,450 And so humans at Microsoft decided years ago, that's enough. 560 00:26:15,450 --> 00:26:16,720 And frankly, it kind of was. 561 00:26:16,720 --> 00:26:18,820 Because even in the thousands of rows, like, 562 00:26:18,820 --> 00:26:22,960 my Mac just became unusable because it was just too much memory being used. 563 00:26:22,960 --> 00:26:25,590 So it was just an impractical limit anyway. 564 00:26:25,590 --> 00:26:27,340 But at some point, you might want to store 565 00:26:27,340 --> 00:26:33,670 65,537 rows of data or hundreds of thousands of rows or millions of rows 566 00:26:33,670 --> 00:26:34,420 or even more. 567 00:26:34,420 --> 00:26:36,711 I mean, there are so many websites, Google and Facebook 568 00:26:36,711 --> 00:26:38,350 and others that surely do this already. 569 00:26:38,350 --> 00:26:41,060 You can't just use a program on your Mac or PC anymore. 570 00:26:41,060 --> 00:26:42,700 You need to use fancier software. 571 00:26:42,700 --> 00:26:46,150 But that fancier software often still works the same way. 572 00:26:46,150 --> 00:26:49,780 You have one main file called a spreadsheet in our human world. 573 00:26:49,780 --> 00:26:52,932 But in the server world, you might call it a database instead. 574 00:26:52,932 --> 00:26:54,640 And whereas, in our human world, we might 575 00:26:54,640 --> 00:26:58,150 have things called Sheets or individual spreadsheets within the bigger file, 576 00:26:58,150 --> 00:27:00,740 in a database, you're going to have things called tables. 577 00:27:00,740 --> 00:27:03,580 But they're the exact same thing with rows and columns. 578 00:27:03,580 --> 00:27:05,620 And so when we want to actually store data, 579 00:27:05,620 --> 00:27:08,289 we can actually store it in exactly those ways. 580 00:27:08,289 --> 00:27:09,580 So let me go ahead and do this. 581 00:27:09,580 --> 00:27:13,060 Let me go ahead and open up Google Spreadsheets just 582 00:27:13,060 --> 00:27:15,450 as representative of a database. 583 00:27:15,450 --> 00:27:18,520 And let me go ahead and create a new file here. 584 00:27:18,520 --> 00:27:23,740 So New Spreadsheet-- and just so that we can represent things like students. 585 00:27:23,740 --> 00:27:25,630 I'm going to call this spreadsheet Students. 586 00:27:25,630 --> 00:27:29,381 And over here I'm going to put their ID number and their name, and what 587 00:27:29,381 --> 00:27:31,630 are some other fields I rattled off that come to mind? 588 00:27:31,630 --> 00:27:32,424 AUDIENCE: Email. 589 00:27:32,424 --> 00:27:33,340 DAVID J. MALAN: Email. 590 00:27:33,340 --> 00:27:34,240 AUDIENCE: Dorm. 591 00:27:34,240 --> 00:27:34,880 DAVID J. MALAN: Say again. 592 00:27:34,880 --> 00:27:35,620 AUDIENCE: Dorm. 593 00:27:35,620 --> 00:27:35,990 DAVID J. MALAN: Dorm. 594 00:27:35,990 --> 00:27:36,842 Good. 595 00:27:36,842 --> 00:27:38,810 AUDIENCE: Sports and stuff they want to do. 596 00:27:38,810 --> 00:27:39,601 DAVID J. MALAN: OK. 597 00:27:39,601 --> 00:27:41,250 So sports they want to do, sure. 598 00:27:41,250 --> 00:27:44,040 That's fine-- a little longer. 599 00:27:44,040 --> 00:27:45,202 What else? 600 00:27:45,202 --> 00:27:46,620 AUDIENCE: Graduation year. 601 00:27:46,620 --> 00:27:48,600 DAVID J. MALAN: Graduation year, age, OK, good. 602 00:27:48,600 --> 00:27:52,805 So graduation year, age, concentration-- 603 00:27:52,805 --> 00:27:55,680 OK, and we can probably keep going, and it just gets wider and wider. 604 00:27:55,680 --> 00:27:59,400 And these are my columns, if you will, or the fields in the database. 605 00:27:59,400 --> 00:28:03,030 And frankly, I could name this if I really want to be tidy here, 606 00:28:03,030 --> 00:28:04,440 and I could call this Students. 607 00:28:04,440 --> 00:28:04,800 And you know what? 608 00:28:04,800 --> 00:28:07,260 Maybe I should call this not students but university. 609 00:28:07,260 --> 00:28:10,740 Because I might want to have another tab here that we'll preemptively 610 00:28:10,740 --> 00:28:12,540 name to, say, faculty. 611 00:28:12,540 --> 00:28:15,419 And faculty probably have ID numbers, and they have names, 612 00:28:15,419 --> 00:28:17,460 and they probably have departments, for instance. 613 00:28:17,460 --> 00:28:20,590 And they also probably have emails and so forth. 614 00:28:20,590 --> 00:28:22,510 So phone number is another one. 615 00:28:22,510 --> 00:28:25,200 But I might want to kind of cluster these different types 616 00:28:25,200 --> 00:28:27,286 of data in different spreadsheets because they're 617 00:28:27,286 --> 00:28:28,410 kind of apples and oranges. 618 00:28:28,410 --> 00:28:31,680 There might be some overlap, but they're distinct inside different sheets. 619 00:28:31,680 --> 00:28:34,089 In the database world, you do the exact same thing, 620 00:28:34,089 --> 00:28:35,880 except you call the whole thing a database, 621 00:28:35,880 --> 00:28:38,670 and you call these things tables to be clear. 622 00:28:38,670 --> 00:28:41,850 But more importantly, with a real database, so to speak, 623 00:28:41,850 --> 00:28:45,120 you have to be a little more proactive about telling the database what 624 00:28:45,120 --> 00:28:46,680 types of data you want to store. 625 00:28:46,680 --> 00:28:49,138 Now, those of you who've used Google Spreadsheets and Apple 626 00:28:49,138 --> 00:28:52,620 numbers and Excel know that you can go to the format menu or the data menu 627 00:28:52,620 --> 00:28:55,380 or whatever, and you can show things to different decimal places. 628 00:28:55,380 --> 00:28:56,970 You can turn things to percents. 629 00:28:56,970 --> 00:28:58,410 And you can format the data. 630 00:28:58,410 --> 00:29:01,170 But that doesn't tend to have any functional impact on your data. 631 00:29:01,170 --> 00:29:03,940 It's just an aesthetic detail, like, how long or short do 632 00:29:03,940 --> 00:29:07,690 you want to show the numbers to be in your columns and cells. 633 00:29:07,690 --> 00:29:10,830 But with databases, you can actually provide juicy hints 634 00:29:10,830 --> 00:29:15,960 to the database that will help that database store and find data for you 635 00:29:15,960 --> 00:29:17,670 more efficiently. 636 00:29:17,670 --> 00:29:20,580 For instance, there's this laundry list of data types 637 00:29:20,580 --> 00:29:22,440 to which we have access in SQL. 638 00:29:22,440 --> 00:29:26,220 So SQL is the language with which we're going to be able to query a database. 639 00:29:26,220 --> 00:29:30,150 Frankly, in something like Google Spreadsheets and Microsoft Excel, 640 00:29:30,150 --> 00:29:31,080 how do you find data? 641 00:29:31,080 --> 00:29:34,680 We'll usually hit Command-F or Control-F, and you search for it. 642 00:29:34,680 --> 00:29:35,910 That's not very programmatic. 643 00:29:35,910 --> 00:29:37,230 You can do it with macros and such. 644 00:29:37,230 --> 00:29:38,396 And we'll do this with code. 645 00:29:38,396 --> 00:29:41,160 But with SQL, you're going to have to make a few decisions first. 646 00:29:41,160 --> 00:29:45,229 SQL supports a few different data types, at least, in one of its forms. 647 00:29:45,229 --> 00:29:47,520 So it turns out there's a lot of competition out there. 648 00:29:47,520 --> 00:29:49,890 SQL is a general language. 649 00:29:49,890 --> 00:29:52,260 But different companies and different nonprofit groups 650 00:29:52,260 --> 00:29:54,090 have come up with their own dialects. 651 00:29:54,090 --> 00:29:56,340 So much like we humans might speak-- 652 00:29:56,340 --> 00:29:57,757 bunches of us might speak Spanish. 653 00:29:57,757 --> 00:30:00,381 Well, there are different variations of Spanish, not to mention 654 00:30:00,381 --> 00:30:01,830 many, many, many other languages. 655 00:30:01,830 --> 00:30:03,621 And so similarly, in the programming world, 656 00:30:03,621 --> 00:30:06,038 as there's sort of a common subset of words in a language. 657 00:30:06,038 --> 00:30:08,079 But then different companies and different people 658 00:30:08,079 --> 00:30:11,220 might have added and subtracted their own features just because they think 659 00:30:11,220 --> 00:30:13,170 that's better for their use cases. 660 00:30:13,170 --> 00:30:19,080 So in SQLite, which is the database technology we will use in CS50 IDE, 661 00:30:19,080 --> 00:30:21,930 and Android uses this, and it's very popular in other contexts 662 00:30:21,930 --> 00:30:23,520 too, IOS as well-- 663 00:30:23,520 --> 00:30:25,200 they only have a few data types. 664 00:30:25,200 --> 00:30:29,100 Blob, of all things, integer, numeric, real, and text. 665 00:30:29,100 --> 00:30:30,100 Now what does that mean? 666 00:30:30,100 --> 00:30:33,058 Well, it turns out, it means different things with different databases. 667 00:30:33,058 --> 00:30:36,080 In another popular database, it's called Postgres or PostgreSQL. 668 00:30:36,080 --> 00:30:38,782 This is a very popular one for high-performing websites. 669 00:30:38,782 --> 00:30:40,740 So if you're trying to build a business, you're 670 00:30:40,740 --> 00:30:43,573 trying to make an application used by everyone on campus, generally, 671 00:30:43,573 --> 00:30:46,800 you would use a fancier technology than SQLite, which is, by definition, 672 00:30:46,800 --> 00:30:47,580 lightweight. 673 00:30:47,580 --> 00:30:49,230 And you'd do something like Postgres. 674 00:30:49,230 --> 00:30:52,650 And so we'll introduce you to both of these dialects along the way 675 00:30:52,650 --> 00:30:55,440 because each of them allows us to do different things 676 00:30:55,440 --> 00:30:57,340 with a different degree of precision. 677 00:30:57,340 --> 00:31:01,290 So in SQL, let's claim, even though it's a simplification, 678 00:31:01,290 --> 00:31:02,910 there are at least these data types. 679 00:31:02,910 --> 00:31:06,090 When deciding what type of data to store in your database table, 680 00:31:06,090 --> 00:31:09,655 you have to decide, not just what it's called, like ID, and name, and email, 681 00:31:09,655 --> 00:31:10,155 and dorm. 682 00:31:10,155 --> 00:31:12,474 You have to decide what data type it is. 683 00:31:12,474 --> 00:31:13,890 So we've kind of come full circle. 684 00:31:13,890 --> 00:31:17,124 Because the last time we did this was in what language? 685 00:31:17,124 --> 00:31:17,930 C. 686 00:31:17,930 --> 00:31:20,710 So in SQL too you do this, but it's not to be nitpicky. 687 00:31:20,710 --> 00:31:22,490 It's to actually improve performance. 688 00:31:22,490 --> 00:31:25,180 So this is to help you get at that data more efficiently. 689 00:31:25,180 --> 00:31:28,280 Because the more the server, the database, knows about your data, 690 00:31:28,280 --> 00:31:30,460 the faster it can find it essentially. 691 00:31:30,460 --> 00:31:32,680 So in the world of databases, you generally 692 00:31:32,680 --> 00:31:35,500 have to decide, not just to store an integer, but either 693 00:31:35,500 --> 00:31:38,290 a small int of integer, or a big int. 694 00:31:38,290 --> 00:31:42,700 And generally speaking, at least in most databases, this might be 2 bytes. 695 00:31:42,700 --> 00:31:43,870 This might be 4 bytes. 696 00:31:43,870 --> 00:31:45,220 This might be 8 bytes. 697 00:31:45,220 --> 00:31:47,020 So let's maybe impress the families here. 698 00:31:47,020 --> 00:31:50,110 If you have a 32-bit integer, or 4 bytes, 699 00:31:50,110 --> 00:31:54,600 what's the biggest integer you can store in your database? 700 00:31:54,600 --> 00:31:56,960 (WHISPERS) 4 billion. 701 00:31:56,960 --> 00:31:58,130 Someone please, impress. 702 00:31:58,130 --> 00:31:59,200 AUDIENCE: 4 billion. 703 00:31:59,200 --> 00:32:01,200 DAVID J. MALAN: 4 billion is exactly correct. 704 00:32:01,200 --> 00:32:04,430 And it might be 4 billion positive values, or maybe it's only 2 billion 705 00:32:04,430 --> 00:32:08,220 if you also want to have 2 billion or so negative numbers as well. 706 00:32:08,220 --> 00:32:13,520 Big int uses 64-bit, which is just massively, massively bigger. 707 00:32:13,520 --> 00:32:15,420 And tiny int or small int rather-- 708 00:32:15,420 --> 00:32:18,610 there's also tiny int in some dialects-- small int is just 2 bytes, 709 00:32:18,610 --> 00:32:24,700 and it actually counts up as high as 65,535, as I alluded to earlier. 710 00:32:24,700 --> 00:32:25,980 So you just need to decide. 711 00:32:25,980 --> 00:32:29,340 Now, why would you choose one or the other, when you could just use big int 712 00:32:29,340 --> 00:32:31,570 and store small numbers and big numbers alike? 713 00:32:31,570 --> 00:32:35,190 Well, why might you just intuitively not want to make all of your integer values 714 00:32:35,190 --> 00:32:37,050 in a database big ints. 715 00:32:37,050 --> 00:32:39,660 It's just simple-- give me as much headroom as I want. 716 00:32:39,660 --> 00:32:41,420 AUDIENCE: You would be using memory. 717 00:32:41,420 --> 00:32:41,550 DAVID J. MALAN: Yeah. 718 00:32:41,550 --> 00:32:43,590 You're just using memory unnecessarily, right? 719 00:32:43,590 --> 00:32:45,650 Why use more memory than you need to? 720 00:32:45,650 --> 00:32:48,220 Now, arguably-- god, memory is so cheap these days. 721 00:32:48,220 --> 00:32:49,930 Computers are getting bigger and faster. 722 00:32:49,930 --> 00:32:52,000 So in some contexts, it's not a big deal. 723 00:32:52,000 --> 00:32:56,130 But if you have lots of data, and every one of your rows in a database 724 00:32:56,130 --> 00:32:59,950 is 2 bytes bigger than needs to be, and you have 1 million rows, 725 00:32:59,950 --> 00:33:03,910 that's already huge numbers of bytes that you're just wasting and throwing 726 00:33:03,910 --> 00:33:04,410 away. 727 00:33:04,410 --> 00:33:07,284 So it behooves you to actually give some thought to these data types. 728 00:33:07,284 --> 00:33:09,000 Of course, you don't want to go too small 729 00:33:09,000 --> 00:33:11,940 because, if you use a small int or maybe even an integer, 730 00:33:11,940 --> 00:33:17,160 but business is so darn good that you have over 4 billion users or 4 731 00:33:17,160 --> 00:33:21,330 billion rows or 4 billion foos and bars and bazes that have been sold, 732 00:33:21,330 --> 00:33:24,520 which absolutely happens to the Googles and Facebooks of the world, 733 00:33:24,520 --> 00:33:30,450 you better start using 64 bits to represent unique values instead of just 734 00:33:30,450 --> 00:33:30,990 32. 735 00:33:30,990 --> 00:33:32,280 Because eventually, you'll run out. 736 00:33:32,280 --> 00:33:32,946 Well, what else? 737 00:33:32,946 --> 00:33:36,757 Besides integer values in SQL, you might want to use floating point values. 738 00:33:36,757 --> 00:33:39,090 Those are real numbers that have decimal points in them. 739 00:33:39,090 --> 00:33:41,280 This is quite like C. They're called a little something different. 740 00:33:41,280 --> 00:33:43,750 And annoyingly, it uses two words in this case. 741 00:33:43,750 --> 00:33:46,920 But a real number is a 32-bit value that is a floating point 742 00:33:46,920 --> 00:33:49,430 value, like a float in C, and double precision 743 00:33:49,430 --> 00:33:51,540 is like a double in C, which is 64-bits. 744 00:33:51,540 --> 00:33:54,480 And that just lets you get even finer-grained control 745 00:33:54,480 --> 00:33:56,220 over the decimal point as we've seen. 746 00:33:56,220 --> 00:33:59,130 Because of course, you can't represent numbers precisely 747 00:33:59,130 --> 00:34:02,640 because of imprecision, as we've seen in both Python and C. 748 00:34:02,640 --> 00:34:05,970 Numeric is kind of a catch-all for a bunch of number-related things 749 00:34:05,970 --> 00:34:08,460 that aren't quite integers or reals necessarily. 750 00:34:08,460 --> 00:34:11,429 Boolean, a true or false value, dates if you 751 00:34:11,429 --> 00:34:14,610 want to literally store, like, year, year, year, year, dash, month, month, 752 00:34:14,610 --> 00:34:17,634 dash, day, day in a standard computer format, 753 00:34:17,634 --> 00:34:19,800 even though human cultures vary how they write that. 754 00:34:19,800 --> 00:34:23,190 Datetime if you want have a date and a time right after it-- 755 00:34:23,190 --> 00:34:26,070 numeric is actually a solution to a problem. 756 00:34:26,070 --> 00:34:29,010 Turns out that, finally, after all this time, 757 00:34:29,010 --> 00:34:33,210 we have a built-in mechanism for storing numbers precisely. 758 00:34:33,210 --> 00:34:35,969 And this is perfectly timed because, in CS50 Finance, when 759 00:34:35,969 --> 00:34:39,239 you're dealing with money, it would kind of be nice if you don't accidentally 760 00:34:39,239 --> 00:34:43,170 round off how much money someone has either slightly lower or slightly 761 00:34:43,170 --> 00:34:43,949 higher. 762 00:34:43,949 --> 00:34:47,639 Those of you may be families especially because, from a few movies back 763 00:34:47,639 --> 00:34:50,460 in the day, if you ever saw Superman III, 764 00:34:50,460 --> 00:34:53,730 this is how some money was made by shaving fractions of pennies 765 00:34:53,730 --> 00:34:56,760 off of computer systems and, more moderately, Office Space. 766 00:34:56,760 --> 00:35:00,960 Office Space did this as well-- big scam trying to round off these pennies. 767 00:35:00,960 --> 00:35:04,650 Well, if you use the right technology and language, with SQL, 768 00:35:04,650 --> 00:35:08,460 you can actually specify a numeric data type, which is like a floating point 769 00:35:08,460 --> 00:35:11,940 value, but you specify, essentially, the total number of digits you 770 00:35:11,940 --> 00:35:15,790 want it to handle and how many of them should be after the decimal point. 771 00:35:15,790 --> 00:35:17,850 So it's common in stock markets to actually use 772 00:35:17,850 --> 00:35:19,480 four decimal points of precision. 773 00:35:19,480 --> 00:35:22,260 So not just cents to the tenths and the hundredths place, 774 00:35:22,260 --> 00:35:24,570 but to the thousandths and ten-thousandths place, 775 00:35:24,570 --> 00:35:28,126 you can now specify that with a value of four for precision. 776 00:35:28,126 --> 00:35:30,000 And you don't have to worry about those round 777 00:35:30,000 --> 00:35:32,370 off errors we talked about before. 778 00:35:32,370 --> 00:35:34,230 Time and time stamps-- just how much time 779 00:35:34,230 --> 00:35:38,130 has passed since the date and date and time are also available to you in SQL. 780 00:35:38,130 --> 00:35:40,920 And then lastly here, text values-- 781 00:35:40,920 --> 00:35:43,260 turns out you have a choice of values for when 782 00:35:43,260 --> 00:35:46,690 it comes to deciding how many characters do you use to store your data. 783 00:35:46,690 --> 00:35:49,780 You can use char, which just like in C with an array, 784 00:35:49,780 --> 00:35:53,170 you have to decide in advance for a string how many bytes to use. 785 00:35:53,170 --> 00:35:55,380 And if you say char 8, you are getting 8 bytes. 786 00:35:55,380 --> 00:35:59,160 And if the name is Malan, M-A-L-A-N, you are wasting 3, 787 00:35:59,160 --> 00:36:01,940 or if we null terminate it, 2 of those bytes. 788 00:36:01,940 --> 00:36:05,490 But SQL introduces a solution that Python kind of gave us too. 789 00:36:05,490 --> 00:36:09,300 Varchar as the name implies-- variable number of chars 790 00:36:09,300 --> 00:36:11,410 allows you to specify just an upper bound. 791 00:36:11,410 --> 00:36:14,550 So if you know that one of your columns is going to be called Name. 792 00:36:14,550 --> 00:36:17,490 And a person's name might be M-A-L-A-N-- like five letters, 793 00:36:17,490 --> 00:36:21,060 or maybe certainly more for people with longer last names are first names, 794 00:36:21,060 --> 00:36:23,940 you can say varchar 100. 795 00:36:23,940 --> 00:36:26,170 And that's the upper bound on how long it is. 796 00:36:26,170 --> 00:36:27,570 But the database will be smart. 797 00:36:27,570 --> 00:36:30,240 If you only need five characters or 5 bytes, 798 00:36:30,240 --> 00:36:32,790 it won't waste 95 others for you. 799 00:36:32,790 --> 00:36:35,080 It will optimize that for you. 800 00:36:35,080 --> 00:36:38,220 And then lastly text, if you want to store tens of thousands, 801 00:36:38,220 --> 00:36:41,610 like paragraphs or pages of text, that's even bigger than varchar 802 00:36:41,610 --> 00:36:44,310 where you don't necessarily want a well-defined upper bound. 803 00:36:44,310 --> 00:36:46,350 You just want a good chunk of memory. 804 00:36:46,350 --> 00:36:47,860 But there are implications here. 805 00:36:47,860 --> 00:36:52,380 And just quickly, char and varchar seem not even complimentary. 806 00:36:52,380 --> 00:36:54,185 Just char seems unnecessary. 807 00:36:54,185 --> 00:36:56,210 Why paint yourself into a corner and say give me 808 00:36:56,210 --> 00:36:58,850 only 8 bytes when you could just say, eh, give me 809 00:36:58,850 --> 00:37:02,550 up to 100 bites, and you, the database server, 810 00:37:02,550 --> 00:37:05,330 figure out how many to actually use? 811 00:37:05,330 --> 00:37:07,040 Why do these two data types exist? 812 00:37:07,040 --> 00:37:07,190 Yeah? 813 00:37:07,190 --> 00:37:07,660 AUDIENCE: Time. 814 00:37:07,660 --> 00:37:08,640 DAVID J. MALAN: Time. 815 00:37:08,640 --> 00:37:10,796 AUDIENCE: It would be longer to run. 816 00:37:10,796 --> 00:37:11,670 DAVID J. MALAN: Yeah. 817 00:37:11,670 --> 00:37:12,860 So time is spot on. 818 00:37:12,860 --> 00:37:16,370 It might actually take longer to use one data type or the other. 819 00:37:16,370 --> 00:37:17,900 How could that possibly be? 820 00:37:17,900 --> 00:37:21,020 Well, if you've got a value like char where you specify 821 00:37:21,020 --> 00:37:23,660 a precise number of characters like 8. 822 00:37:23,660 --> 00:37:26,840 If you think about the columns in Google Spreadsheets and Excel, 823 00:37:26,840 --> 00:37:30,930 every one of those columns, those cells is exactly the same width, right? 824 00:37:30,930 --> 00:37:33,590 It's 8 bytes, 8 bytes, 8 bytes, 8 bytes. 825 00:37:33,590 --> 00:37:37,520 And as soon as you have perfectly constant offsets-- 826 00:37:37,520 --> 00:37:42,770 this is byte 0, this is 8, this is 16, this is 24, this is 32, 827 00:37:42,770 --> 00:37:45,050 and so forth, you reintroduce the capability 828 00:37:45,050 --> 00:37:46,800 that in C we had with arrays. 829 00:37:46,800 --> 00:37:50,400 You have random access doing simple, simple arithmetic-- multiplication, 830 00:37:50,400 --> 00:37:54,680 addition, and subtraction, you can just jump to any element in a column 831 00:37:54,680 --> 00:37:58,190 if it's using the same number of bytes for every cell. 832 00:37:58,190 --> 00:38:04,329 But with varchar, if you need to have small narrow cells and wide cells, 833 00:38:04,329 --> 00:38:06,620 you're going to have what's called a ragged array where 834 00:38:06,620 --> 00:38:08,670 one side of it, the left-hand side, so to speak, 835 00:38:08,670 --> 00:38:10,295 is going to be very straight and rigid. 836 00:38:10,295 --> 00:38:14,480 But the other side my kind of vary based on how long the word is in that column. 837 00:38:14,480 --> 00:38:17,870 And to your point about time, it's going to take the computer more time 838 00:38:17,870 --> 00:38:19,730 to search a varchar field. 839 00:38:19,730 --> 00:38:22,070 Because it can't just jump to the third element. 840 00:38:22,070 --> 00:38:24,005 It has to search all of them, potentially. 841 00:38:24,005 --> 00:38:26,630 But thankfully, at the end of the day, databases don't do that. 842 00:38:26,630 --> 00:38:28,421 They don't just devolve into linear search. 843 00:38:28,421 --> 00:38:31,520 CSVs devolve, for our purposes, into linear search. 844 00:38:31,520 --> 00:38:35,420 The magic and the intellectual property and the secret sauce, so to speak, 845 00:38:35,420 --> 00:38:36,620 that goes into databases-- 846 00:38:36,620 --> 00:38:40,520 Oracle SQL Server, Microsoft Access, Postgres, MySQL, and SQLite 847 00:38:40,520 --> 00:38:45,020 is that really smart computer scientists have decided to implement algorithms 848 00:38:45,020 --> 00:38:47,660 and fancy data structures, often tree-based-- 849 00:38:47,660 --> 00:38:50,060 if you recall our discussion of binary trees 850 00:38:50,060 --> 00:38:54,590 or tries on a hash tables and those fancier data structures-- they 851 00:38:54,590 --> 00:38:57,170 had baked those kinds of data structures and designs 852 00:38:57,170 --> 00:38:58,850 into their database software. 853 00:38:58,850 --> 00:39:01,680 They've made it freely available or commercially available. 854 00:39:01,680 --> 00:39:04,610 And so we now just have an abstraction called a database 855 00:39:04,610 --> 00:39:07,670 into which we can store data and just read it, update it, write it, 856 00:39:07,670 --> 00:39:10,430 and change it quickly without having to implement all 857 00:39:10,430 --> 00:39:12,899 of that low-level plumbing ourselves. 858 00:39:12,899 --> 00:39:15,440 And so that's ultimately what a database is going to give us. 859 00:39:15,440 --> 00:39:18,690 But we have to actually have access to a database to do that. 860 00:39:18,690 --> 00:39:21,260 And we need to actually decide first to help 861 00:39:21,260 --> 00:39:27,350 the database by telling it what to store where and how to store it as a type. 862 00:39:27,350 --> 00:39:28,850 So why don't we do this? 863 00:39:28,850 --> 00:39:33,060 Let me go ahead and do exactly that with a couple of examples. 864 00:39:33,060 --> 00:39:36,220 Let me go ahead and open up CS50 IDE again, 865 00:39:36,220 --> 00:39:38,210 our web-based programming environment. 866 00:39:38,210 --> 00:39:42,152 And let me go into our Terminal window. 867 00:39:42,152 --> 00:39:45,110 So for those unfamiliar, a terminal window is just this black and white 868 00:39:45,110 --> 00:39:47,722 or this blue and white window in which you can type commands. 869 00:39:47,722 --> 00:39:50,180 And rather than point and click and double click on things, 870 00:39:50,180 --> 00:39:52,179 you can only do things textually in this window. 871 00:39:52,179 --> 00:39:54,710 And it turns out that, in the world of SQLite, 872 00:39:54,710 --> 00:39:58,850 we can actually use a command called SQLite3, 873 00:39:58,850 --> 00:40:01,050 which allows us to create a database. 874 00:40:01,050 --> 00:40:03,710 So in our human world, for CSVs, you've seen how, in code, you 875 00:40:03,710 --> 00:40:05,084 can just create rows and columns. 876 00:40:05,084 --> 00:40:07,286 You can literally type it in a text box. 877 00:40:07,286 --> 00:40:12,330 SQLite is a database technology that uses binary format, 0s and 1s. 878 00:40:12,330 --> 00:40:15,140 And so you're going to have to actually store your data differently 879 00:40:15,140 --> 00:40:16,140 using a program. 880 00:40:16,140 --> 00:40:22,070 So if I want to make a file called, for instance, froshims.db, 881 00:40:22,070 --> 00:40:25,070 I can literally say SQLite three froshims.db, 882 00:40:25,070 --> 00:40:26,450 and then I'm in this program. 883 00:40:26,450 --> 00:40:29,872 This is a text-based program via which I can create my rows and columns. 884 00:40:29,872 --> 00:40:32,580 And it, by designer, is going to be a little arcane for a moment. 885 00:40:32,580 --> 00:40:34,080 It's going to look a little cryptic. 886 00:40:34,080 --> 00:40:37,140 But if I want to go ahead and create a table, like a spreadsheet in here, 887 00:40:37,140 --> 00:40:39,350 I'm going to literally type CREATE TABLE. 888 00:40:39,350 --> 00:40:41,720 I'm going to call this registrants just like last week 889 00:40:41,720 --> 00:40:43,490 when we were looking at froshims examples. 890 00:40:43,490 --> 00:40:45,239 And the freshman intramural sports program 891 00:40:45,239 --> 00:40:47,210 has students registering for sports. 892 00:40:47,210 --> 00:40:49,190 So I'm going to have a registrants table here. 893 00:40:49,190 --> 00:40:51,770 I'm going to go ahead now and specify that I 894 00:40:51,770 --> 00:40:56,930 want a few columns for this database table, one of which I want to call ID, 895 00:40:56,930 --> 00:40:59,990 one of which I want to call name, and one of which I want to call dorm. 896 00:40:59,990 --> 00:41:01,386 So I'm going to keep-- whoops. 897 00:41:01,386 --> 00:41:02,510 I didn't mean to hit Enter. 898 00:41:02,510 --> 00:41:03,770 Let me just clean this up. 899 00:41:03,770 --> 00:41:07,130 So now I have three columns, ID, name, and dorm. 900 00:41:07,130 --> 00:41:09,170 And if I were to hit Enter, theoretically, it 901 00:41:09,170 --> 00:41:10,350 should create a database. 902 00:41:10,350 --> 00:41:13,058 It's not going to be graphical like Excel or Google Spreadsheets, 903 00:41:13,058 --> 00:41:16,850 but it will exist in my IDE in a file called froshims.db. 904 00:41:16,850 --> 00:41:18,170 But I can't just hit Enter yet. 905 00:41:18,170 --> 00:41:20,247 Because I have to tell the database a bit more. 906 00:41:20,247 --> 00:41:21,830 I need to tell it that, you know what? 907 00:41:21,830 --> 00:41:24,560 My ID value should probably be an integer. 908 00:41:24,560 --> 00:41:25,310 And you know what? 909 00:41:25,310 --> 00:41:31,280 My name should probably be char, or varchar? 910 00:41:31,280 --> 00:41:33,921 What should the name of a student be stored as? 911 00:41:33,921 --> 00:41:34,670 AUDIENCE: varchar. 912 00:41:34,670 --> 00:41:36,050 DAVID J. MALAN: Varchar, OK? 913 00:41:36,050 --> 00:41:38,050 So varchar because they're going to vary, right? 914 00:41:38,050 --> 00:41:39,400 If you can think of bunches of your friends, 915 00:41:39,400 --> 00:41:41,185 they might have short names or long names. 916 00:41:41,185 --> 00:41:42,310 OK, what's the upper bound? 917 00:41:42,310 --> 00:41:46,810 What's the length of your friends' longest name? 918 00:41:46,810 --> 00:41:47,630 AUDIENCE: 50. 919 00:41:47,630 --> 00:41:48,510 DAVID J. MALAN: 16? 920 00:41:48,510 --> 00:41:49,390 All right. 921 00:41:49,390 --> 00:41:54,218 Is anyone in here having a name, first name and last, that's longer than 16? 922 00:41:54,218 --> 00:41:55,542 AUDIENCE: That's 50, sorry. 923 00:41:55,542 --> 00:41:56,500 DAVID J. MALAN: Oh, 50. 924 00:41:56,500 --> 00:41:56,999 Sorry. 925 00:41:56,999 --> 00:41:58,355 I heard 16, I thought. 926 00:41:58,355 --> 00:42:00,730 Because you're not going be able to register for froshims 927 00:42:00,730 --> 00:42:02,479 because the database is going to truncate. 928 00:42:02,479 --> 00:42:05,830 And as an aside, if you've ever gone to a website and tried typing into a form 929 00:42:05,830 --> 00:42:08,740 field and either your keyboard stops working 930 00:42:08,740 --> 00:42:12,100 or you do type a long word or paragraph in, and then you hit Enter, 931 00:42:12,100 --> 00:42:15,550 and it's some of it's gone, well, that's because the database likely 932 00:42:15,550 --> 00:42:17,110 can only store so many bytes. 933 00:42:17,110 --> 00:42:20,014 And frankly, this often happens on customer service forums, right? 934 00:42:20,014 --> 00:42:22,930 They don't really want you being too verbose with the customer support 935 00:42:22,930 --> 00:42:23,430 staff. 936 00:42:23,430 --> 00:42:26,020 They'll cap the length of the field into which you're typing. 937 00:42:26,020 --> 00:42:27,860 And they're also doing that on the database, 938 00:42:27,860 --> 00:42:29,710 but also probably for practical human reasons, 939 00:42:29,710 --> 00:42:32,418 they don't want to read a big complaint that's this long as well. 940 00:42:32,418 --> 00:42:36,208 So 50-- anyone have a name longer than 50 characters? 941 00:42:36,208 --> 00:42:37,939 AUDIENCE: [INAUDIBLE] 942 00:42:37,939 --> 00:42:39,480 DAVID J. MALAN: Yeah, not many names. 943 00:42:39,480 --> 00:42:40,340 But you know what? 944 00:42:40,340 --> 00:42:42,230 We have technology to answer this. 945 00:42:42,230 --> 00:42:46,110 Longest name in world-- 946 00:42:46,110 --> 00:42:46,940 and it's this guy. 947 00:42:46,940 --> 00:42:49,400 Guinness Book of World Records is probably right. 948 00:42:49,400 --> 00:42:56,210 So his name is Barnaby Marmaduke Aloysius Benjy Cobweb Dartagnan Egbert 949 00:42:56,210 --> 00:43:02,420 Felix Gaspar Humbert Ignatius Jayden Kasper Leroy Maximilian Neddy Obiajulu 950 00:43:02,420 --> 00:43:08,420 Pepin Quilliam Rosenkranz Sexton Teddy Upwood Vivatma Wayland Xylon Yardley 951 00:43:08,420 --> 00:43:10,400 Zachary Usansky. 952 00:43:10,400 --> 00:43:13,720 But he also goes by Nick, apparently. 953 00:43:13,720 --> 00:43:14,930 But OK. 954 00:43:14,930 --> 00:43:16,550 Let me go ahead and highlight this. 955 00:43:16,550 --> 00:43:18,530 And I could count this manually. 956 00:43:18,530 --> 00:43:22,112 But I'm sure someone has made a website to count characters in a string. 957 00:43:22,112 --> 00:43:23,320 There we go, lettercount.com. 958 00:43:23,320 --> 00:43:25,340 OK. 959 00:43:25,340 --> 00:43:28,760 Let's go ahead and paste that in, count the characters, 225-- 960 00:43:28,760 --> 00:43:32,780 so nick will not be registering unless we don't support just 16 or 50. 961 00:43:32,780 --> 00:43:34,969 Looks like we need at least 225. 962 00:43:34,969 --> 00:43:37,010 So at some point, you do have to make a decision. 963 00:43:37,010 --> 00:43:38,930 And honestly, it's not always obvious. 964 00:43:38,930 --> 00:43:42,370 You do have to specify that varchar is going to be-- eh, it could be 16, 965 00:43:42,370 --> 00:43:44,510 could be 50, could be 225. 966 00:43:44,510 --> 00:43:46,010 But you know what? if he has kids-- 967 00:43:46,010 --> 00:43:50,240 so let's just round up a bit and call it varchar 255, honestly, 968 00:43:50,240 --> 00:43:52,050 only because it was a convention. 969 00:43:52,050 --> 00:43:54,860 So back in the day, 255 tended to be the max length 970 00:43:54,860 --> 00:43:56,510 for this field in older databases. 971 00:43:56,510 --> 00:43:59,786 And that at least feels like it's probably enough wiggle room for names. 972 00:43:59,786 --> 00:44:01,910 But you have to make a judgment call at some point. 973 00:44:01,910 --> 00:44:05,189 We could ignore the problem say 1,000 characters. 974 00:44:05,189 --> 00:44:06,980 But if that's never really going to happen, 975 00:44:06,980 --> 00:44:09,438 and your potentially wasting space, you probably shouldn't. 976 00:44:09,438 --> 00:44:12,920 Because even though it's an upper bound, you're sacrificing something like time, 977 00:44:12,920 --> 00:44:15,440 potentially, to search that field if you're just 978 00:44:15,440 --> 00:44:19,130 telling the database that it's going to be bigger than it ever actually 979 00:44:19,130 --> 00:44:19,670 will be. 980 00:44:19,670 --> 00:44:22,430 It needs to be more finely hinted. 981 00:44:22,430 --> 00:44:23,360 So what about dorm? 982 00:44:23,360 --> 00:44:25,050 Oh, god, now we have to do this again. 983 00:44:25,050 --> 00:44:27,260 How do we do it with dorm? 984 00:44:27,260 --> 00:44:30,260 What's the longest length of a dorm or a house on campus? 985 00:44:30,260 --> 00:44:32,590 AUDIENCE: [INAUDIBLE] 986 00:44:32,590 --> 00:44:36,720 DAVID J. MALAN: I'm, like, OK [INAUDIBLE],, house maybe. 987 00:44:36,720 --> 00:44:40,450 But honestly, at some point, it doesn't matter too much. 988 00:44:40,450 --> 00:44:42,002 But being consistent is what matters. 989 00:44:42,002 --> 00:44:42,960 It's a matter of style. 990 00:44:42,960 --> 00:44:44,100 It's a matter of design. 991 00:44:44,100 --> 00:44:46,950 At some point, you don't have to nitpick every little value. 992 00:44:46,950 --> 00:44:49,033 But you should probably pick some reasonable value 993 00:44:49,033 --> 00:44:52,330 that you could justify to a roommate or to a colleague or to a teaching fellow. 994 00:44:52,330 --> 00:44:55,410 And then say this feels like enough without it being actually excessive. 995 00:44:55,410 --> 00:44:58,470 And there's one last thing I'm going to do here is just 996 00:44:58,470 --> 00:45:00,120 end my thought with a semicolon. 997 00:45:00,120 --> 00:45:01,640 I apologize, they're back. 998 00:45:01,640 --> 00:45:05,244 But at the end of my start here, if I haven't made any typographical errors, 999 00:45:05,244 --> 00:45:06,660 and I'm going to go and hit Enter. 1000 00:45:06,660 --> 00:45:08,100 Nothing seems to happen. 1001 00:45:08,100 --> 00:45:11,550 And ironically, in CS50, in programming, nothing happening 1002 00:45:11,550 --> 00:45:14,640 is usually a good thing because it means no errors have happened. 1003 00:45:14,640 --> 00:45:17,130 If, in SQLite, I type schema, I can actually 1004 00:45:17,130 --> 00:45:20,880 just see a regurgitation of the table I just created. 1005 00:45:20,880 --> 00:45:23,940 But more importantly, I can start to insert data into this. 1006 00:45:23,940 --> 00:45:25,950 Now unfortunately, it's not yet graphical. 1007 00:45:25,950 --> 00:45:29,370 But if I want to insert data into this, I can do this. 1008 00:45:29,370 --> 00:45:33,090 INSERT INTO registrants-- well, what do you want to insert? 1009 00:45:33,090 --> 00:45:36,390 I want to go ahead and insert an ID, a name, and a dorm. 1010 00:45:36,390 --> 00:45:37,530 Well, with what values? 1011 00:45:37,530 --> 00:45:38,940 Well, with these values-- 1012 00:45:38,940 --> 00:45:41,250 the ID will be-- the first registrant will be 1. 1013 00:45:41,250 --> 00:45:43,260 The first one to register will be Brian. 1014 00:45:43,260 --> 00:45:44,952 And his dorm, Brian where did you live? 1015 00:45:44,952 --> 00:45:46,560 BRIAN: Out on Pennypacker. 1016 00:45:46,560 --> 00:45:48,290 DAVID J. MALAN: Pennypacker. 1017 00:45:48,290 --> 00:45:48,840 So it's good. 1018 00:45:48,840 --> 00:45:51,240 That might be close to four timer too lengthwise. 1019 00:45:51,240 --> 00:45:53,160 So I'm going to go ahead and do this. 1020 00:45:53,160 --> 00:45:55,680 Here is an example of a SQL statement. 1021 00:45:55,680 --> 00:45:58,380 CREATE TABLE is one verb or expression you can use. 1022 00:45:58,380 --> 00:45:59,910 INSERT INTO is another. 1023 00:45:59,910 --> 00:46:03,930 You specify the name of the table, the fields of the table, and then values. 1024 00:46:03,930 --> 00:46:07,680 And now I'm capitalizing just to make clear what are SQL commands 1025 00:46:07,680 --> 00:46:11,175 and what are actually just words I, the human developer, chose. 1026 00:46:11,175 --> 00:46:12,300 But it's just a convention. 1027 00:46:12,300 --> 00:46:15,510 These uppercase words could probably be lower case in most contexts too. 1028 00:46:15,510 --> 00:46:18,000 But it helps things, I think visually, to distinguish. 1029 00:46:18,000 --> 00:46:19,109 Now I go ahead and enter. 1030 00:46:19,109 --> 00:46:20,150 Nothing seemed to happen. 1031 00:46:20,150 --> 00:46:21,400 That's probably a good thing. 1032 00:46:21,400 --> 00:46:23,070 Let's go ahead and register one other person. 1033 00:46:23,070 --> 00:46:24,480 I'll be the second registrant. 1034 00:46:24,480 --> 00:46:27,450 So maybe David from Matthews. 1035 00:46:27,450 --> 00:46:29,670 And so Matthews here, Enter. 1036 00:46:29,670 --> 00:46:32,940 And now, if I want to select all of the students in the database, 1037 00:46:32,940 --> 00:46:35,460 I can go ahead and say SELECT FROM-- 1038 00:46:35,460 --> 00:46:36,390 or you know what? 1039 00:46:36,390 --> 00:46:39,900 Let's select everything as denoted in many languages by star, 1040 00:46:39,900 --> 00:46:44,010 from registrants semicolon enter, and there we have it. 1041 00:46:44,010 --> 00:46:46,780 It's kind of a tiny super simple, lightweight database. 1042 00:46:46,780 --> 00:46:49,380 But there are my rows and columns much like Excel 1043 00:46:49,380 --> 00:46:51,690 and Google Spreadsheets would lay them out for me. 1044 00:46:51,690 --> 00:46:53,350 But it gets better than this. 1045 00:46:53,350 --> 00:46:57,390 Suppose I want to search this database for all of the students who 1046 00:46:57,390 --> 00:46:58,744 registered for Matthews. 1047 00:46:58,744 --> 00:47:00,660 And suppose that time passes and more students 1048 00:47:00,660 --> 00:47:02,370 actually register for forshims. 1049 00:47:02,370 --> 00:47:03,930 I can actually filter this data. 1050 00:47:03,930 --> 00:47:09,510 I can do something like SELECT star FROM registrants WHERE 1051 00:47:09,510 --> 00:47:12,810 dorm equals quote unquote, "Matthews." 1052 00:47:12,810 --> 00:47:16,610 And so I can filter it, hit Enter, and now I get back just one row. 1053 00:47:16,610 --> 00:47:18,360 And if your should mind starts to wander-- 1054 00:47:18,360 --> 00:47:21,129 wow, if I could introduce Python or JavaScript into this, 1055 00:47:21,129 --> 00:47:22,170 you know what I could do? 1056 00:47:22,170 --> 00:47:25,650 I could probably get back, not this and this ASCII-based table, this text 1057 00:47:25,650 --> 00:47:26,250 table. 1058 00:47:26,250 --> 00:47:28,966 Maybe I could get back an actual list of rows 1059 00:47:28,966 --> 00:47:31,090 so that I can actually do something with that data. 1060 00:47:31,090 --> 00:47:32,470 And that's, indeed, where we're going with this. 1061 00:47:32,470 --> 00:47:34,950 So if I want to select someone else, I could do hmm-- 1062 00:47:34,950 --> 00:47:39,600 maybe SELECT just the NAME from registrants 1063 00:47:39,600 --> 00:47:41,850 where dorm equals "Mathews." 1064 00:47:41,850 --> 00:47:43,980 If I only care about knowing who registered, 1065 00:47:43,980 --> 00:47:46,420 I could do that and whittle it down to even less data. 1066 00:47:46,420 --> 00:47:49,230 So already, in just these few commands, I 1067 00:47:49,230 --> 00:47:52,140 can express so much more functionality than you could with a CSV. 1068 00:47:52,140 --> 00:47:54,510 To do this and CSV, you would have had to write all 1069 00:47:54,510 --> 00:47:56,700 of these lines of Python code yourself. 1070 00:47:56,700 --> 00:47:57,690 What if Brian moved? 1071 00:47:57,690 --> 00:47:59,610 So Brian really didn't like the Union dorm, 1072 00:47:59,610 --> 00:48:02,280 so we're going to go ahead and UPDATE registrants 1073 00:48:02,280 --> 00:48:05,225 and SET dorm equal to-- where do want to move to? 1074 00:48:05,225 --> 00:48:05,850 BRIAN: Canaday. 1075 00:48:05,850 --> 00:48:09,210 DAVID J. MALAN: "Canaday" WHERE-- 1076 00:48:09,210 --> 00:48:10,835 I could do this a few ways. 1077 00:48:10,835 --> 00:48:11,710 What's your instinct? 1078 00:48:11,710 --> 00:48:14,160 How could I identify Brian and only Brian so-- 1079 00:48:14,160 --> 00:48:16,890 I don't want to move to Canaday. 1080 00:48:16,890 --> 00:48:18,120 How do we move just Brian? 1081 00:48:18,120 --> 00:48:19,495 AUDIENCE: [INAUDIBLE] 1082 00:48:19,495 --> 00:48:20,370 DAVID J. MALAN: Yeah. 1083 00:48:20,370 --> 00:48:24,930 Maybe we could WHERE name equals quote unquote "Brian" and Enter. 1084 00:48:24,930 --> 00:48:27,540 I like this, but someone find fault with this if you could. 1085 00:48:27,540 --> 00:48:28,691 AUDIENCE: Two Brians. 1086 00:48:28,691 --> 00:48:31,440 DAVID J. MALAN: If there are two Brians, we're moving both of them 1087 00:48:31,440 --> 00:48:32,356 are all three of them. 1088 00:48:32,356 --> 00:48:35,120 So maybe better, honestly, would be to say, mm, 1089 00:48:35,120 --> 00:48:37,680 let's just say where the ID equals 1. 1090 00:48:37,680 --> 00:48:40,819 Now, of course, I need to know Brian's Harvard ID number or whatnot. 1091 00:48:40,819 --> 00:48:42,360 But this is going to be more precise. 1092 00:48:42,360 --> 00:48:45,690 Because the supposition here is that the leftmost column or ID 1093 00:48:45,690 --> 00:48:50,520 field, by human convention, should be unique so that it uniquely identifies 1094 00:48:50,520 --> 00:48:52,710 the room so we can have as many Brians as we want, 1095 00:48:52,710 --> 00:48:54,210 but we're not going to confuse them. 1096 00:48:54,210 --> 00:48:56,950 If I go ahead and hit Enter now, nothing seems to happen. 1097 00:48:56,950 --> 00:49:02,280 But if I select star from registrants again, and hit Enter, notice now, 1098 00:49:02,280 --> 00:49:04,584 Brian has indeed been moved to Canaday. 1099 00:49:04,584 --> 00:49:06,000 So there are so many other things. 1100 00:49:06,000 --> 00:49:06,875 Brian, you know what? 1101 00:49:06,875 --> 00:49:09,060 You're not very good at sports the team concluded. 1102 00:49:09,060 --> 00:49:13,890 So delete from registrants WHERE ID equals 1. 1103 00:49:13,890 --> 00:49:18,810 And now if we select star from registrants, we have just me left. 1104 00:49:18,810 --> 00:49:20,130 But Brian is gone as well. 1105 00:49:20,130 --> 00:49:23,800 So we have the ability to INSERT, to SELECT, to UPDATE, and DELETE, 1106 00:49:23,800 --> 00:49:25,920 and CREATE, all the while filtering. 1107 00:49:25,920 --> 00:49:28,710 And we've only just scratched the surface here of what's possible. 1108 00:49:28,710 --> 00:49:31,670 Because it turns out, we can store so much data in these databases. 1109 00:49:31,670 --> 00:49:33,539 All of our students, all of our faculty-- 1110 00:49:33,539 --> 00:49:35,330 if you're a company, you can start to store 1111 00:49:35,330 --> 00:49:37,970 all of the products in your database, all of the orders, 1112 00:49:37,970 --> 00:49:39,050 all of your customers. 1113 00:49:39,050 --> 00:49:42,216 But as soon as you start going down that road, well, gee, what's a customer? 1114 00:49:42,216 --> 00:49:46,400 Customers have names and ID numbers and maybe email addresses 1115 00:49:46,400 --> 00:49:47,390 and postal addresses. 1116 00:49:47,390 --> 00:49:48,860 Those are going to get messy-- phone numbers, 1117 00:49:48,860 --> 00:49:50,600 which are kind of like integers, but not quite. 1118 00:49:50,600 --> 00:49:53,030 So there are so many questions we still need to answer. 1119 00:49:53,030 --> 00:49:55,940 But feels like it's time for some fruit and some muffins. 1120 00:49:55,940 --> 00:49:58,390 So why don't we go ahead and take a five-minute break, turn on some music. 1121 00:49:58,390 --> 00:50:00,014 If parents need to depart, that's fine. 1122 00:50:00,014 --> 00:50:02,540 And we'll see you back here in five or so minutes. 1123 00:50:02,540 --> 00:50:03,320 All right. 1124 00:50:03,320 --> 00:50:03,950 So we're back. 1125 00:50:03,950 --> 00:50:06,710 And where we left off was exactly here using 1126 00:50:06,710 --> 00:50:10,880 SQLite3, which is the version 3 of the SQLite command, which 1127 00:50:10,880 --> 00:50:14,360 is just a text-based interface to the technology 1128 00:50:14,360 --> 00:50:16,280 that we should think of as SQLite. 1129 00:50:16,280 --> 00:50:18,980 SQLite is interesting and it's lightweight in the sense 1130 00:50:18,980 --> 00:50:22,010 that it's not a server, it's not fancy software that you have to run. 1131 00:50:22,010 --> 00:50:25,850 It literally stores all of your data in a file, but that, by convention, 1132 00:50:25,850 --> 00:50:28,970 ends in .db or maybe even .sqlite. 1133 00:50:28,970 --> 00:50:31,570 But it's just a binary file, 0s and 1s. 1134 00:50:31,570 --> 00:50:34,320 It's not text that you can open with Excel or something like that. 1135 00:50:34,320 --> 00:50:35,510 So it's not a CSV. 1136 00:50:35,510 --> 00:50:37,110 So it's stored on file. 1137 00:50:37,110 --> 00:50:40,520 This means that you don't need particular experience 1138 00:50:40,520 --> 00:50:41,990 with setting up a database server. 1139 00:50:41,990 --> 00:50:45,590 You don't need memory and software to actually run on the computer. 1140 00:50:45,590 --> 00:50:47,990 You can store everything locally, but you do pay a price. 1141 00:50:47,990 --> 00:50:51,440 Because recall from our past discussions, disks are slow. 1142 00:50:51,440 --> 00:50:53,390 And if you're storing your data on the disk, 1143 00:50:53,390 --> 00:50:56,240 it's not going to be nearly as fast as storing it only in RAM. 1144 00:50:56,240 --> 00:50:59,180 And that's why these other technologies like Postgres and MySQL, 1145 00:50:59,180 --> 00:51:03,080 and Microsoft Access, and SQL Server, and Oracle exist. 1146 00:51:03,080 --> 00:51:07,040 Those are fancier products where someone literally does double-click an icon 1147 00:51:07,040 --> 00:51:09,260 or run a command that runs a program that 1148 00:51:09,260 --> 00:51:13,400 stays running in the computer's memory and gives you even better performance. 1149 00:51:13,400 --> 00:51:15,650 But for our purposes, pedagogically SQLite is handy. 1150 00:51:15,650 --> 00:51:18,320 But we're still going to ask the questions about data types 1151 00:51:18,320 --> 00:51:22,610 because SQLite does support fancier data types like those in Postgres 1152 00:51:22,610 --> 00:51:26,960 and SQL Server, MySQL, and Oracle, and those are the lower-case ones 1153 00:51:26,960 --> 00:51:28,160 we introduced a bit ago. 1154 00:51:28,160 --> 00:51:30,290 But honestly, this is going to get tedious quickly. 1155 00:51:30,290 --> 00:51:34,520 Even I rarely remember exactly the right syntax when creating tables, 1156 00:51:34,520 --> 00:51:36,230 the order in which everything has to go. 1157 00:51:36,230 --> 00:51:39,117 So I tend to use a Graphical User Interface, or GUI myself. 1158 00:51:39,117 --> 00:51:40,700 And there are lots of tools out there. 1159 00:51:40,700 --> 00:51:43,150 But we've built one that's free and open source the CS50 1160 00:51:43,150 --> 00:51:47,670 IDE that just makes it a little easier to edit your SQL tables. 1161 00:51:47,670 --> 00:51:52,430 So I'm going to go ahead and Exit out of SQLite. 1162 00:51:52,430 --> 00:51:54,282 And I'm going to go ahead over here. 1163 00:51:54,282 --> 00:51:56,240 And you'll notice that we have all of the files 1164 00:51:56,240 --> 00:51:59,870 from today including, most recently, froshims.db. 1165 00:51:59,870 --> 00:52:02,810 That is the file I created with SQLite3. 1166 00:52:02,810 --> 00:52:05,750 But if you double-click it when using CS50 IDE, 1167 00:52:05,750 --> 00:52:08,930 it's actually going to open a program that's called phpLiteAdmin. 1168 00:52:08,930 --> 00:52:11,420 It happens to be written in another language called PHP, 1169 00:52:11,420 --> 00:52:14,750 has nothing to do with databases, just the name of the product here. 1170 00:52:14,750 --> 00:52:18,620 But it's a tool that, using pretty simple HTML tables, 1171 00:52:18,620 --> 00:52:22,842 just gives us a graphical user interface over the exact same functionality. 1172 00:52:22,842 --> 00:52:24,800 And personally, I just find this easier to use. 1173 00:52:24,800 --> 00:52:26,550 And pedagogically, it's going to be better 1174 00:52:26,550 --> 00:52:30,480 because it's going to show us the available data types for our table. 1175 00:52:30,480 --> 00:52:33,620 So for instance, notice here-- there's a lot going on the screen. 1176 00:52:33,620 --> 00:52:36,560 But a lot of this is just uninteresting details. 1177 00:52:36,560 --> 00:52:39,830 But notice here, I see a table called registrants. 1178 00:52:39,830 --> 00:52:41,150 And then I can browse it. 1179 00:52:41,150 --> 00:52:42,320 I can see its structure. 1180 00:52:42,320 --> 00:52:45,040 I can execute manual SQL by typing it in. 1181 00:52:45,040 --> 00:52:46,040 I can search it, insert. 1182 00:52:46,040 --> 00:52:47,514 I can do bunches of things to it. 1183 00:52:47,514 --> 00:52:50,180 And that's why this graphical user interface is just convenient. 1184 00:52:50,180 --> 00:52:51,888 Let me go ahead and click on registrants. 1185 00:52:51,888 --> 00:52:57,050 And by default, you'll see these tabs now, Browse, Structure, SQL, and then 1186 00:52:57,050 --> 00:52:58,040 a bunch of others. 1187 00:52:58,040 --> 00:52:59,360 And notice over here. 1188 00:52:59,360 --> 00:53:02,630 You'll see the one row and the one registrant who actually remains. 1189 00:53:02,630 --> 00:53:07,910 Because when we last left off, we removed Brian forcibly from the team. 1190 00:53:07,910 --> 00:53:10,220 So suppose I want to go ahead and add more rows just 1191 00:53:10,220 --> 00:53:11,360 for the sake of discussion. 1192 00:53:11,360 --> 00:53:12,680 I can just do it manually here. 1193 00:53:12,680 --> 00:53:16,400 I can go in here and I can say, let's say the third player 1194 00:53:16,400 --> 00:53:19,670 is going to be Veronica. 1195 00:53:19,670 --> 00:53:22,860 I think she was also in Matthews so I can just type that in here. 1196 00:53:22,860 --> 00:53:24,680 And I can just go ahead and click Insert. 1197 00:53:24,680 --> 00:53:28,310 But what's nice about phpLiteAdmin is that it will not only 1198 00:53:28,310 --> 00:53:31,130 insert the rows for you, it will show you 1199 00:53:31,130 --> 00:53:34,590 the SQLite code with which you could have done it yourself. 1200 00:53:34,590 --> 00:53:37,550 So it's a nice visual reinforcement of that exact same command. 1201 00:53:37,550 --> 00:53:40,130 And you'll notice they use double quotes instead of my single quotes before. 1202 00:53:40,130 --> 00:53:42,213 They're sometimes interchangeable, but not always. 1203 00:53:42,213 --> 00:53:44,780 So it's a wonderful way of just learning how you can actually 1204 00:53:44,780 --> 00:53:48,110 do this with the right SQL code so you're not oversimplifying it 1205 00:53:48,110 --> 00:53:48,860 with the GUI. 1206 00:53:48,860 --> 00:53:49,280 But you know what? 1207 00:53:49,280 --> 00:53:49,970 Let's do this. 1208 00:53:49,970 --> 00:53:52,970 I'm going to go ahead and start over because I 1209 00:53:52,970 --> 00:53:54,660 want to make some better decisions. 1210 00:53:54,660 --> 00:53:57,409 I'm going to go ahead and, literally, right-click or Control-click 1211 00:53:57,409 --> 00:53:59,030 this, delete froshims.db. 1212 00:53:59,030 --> 00:54:01,230 And let's actually start this from scratch. 1213 00:54:01,230 --> 00:54:03,850 Now previously when I did this, I could have, 1214 00:54:03,850 --> 00:54:08,900 in my Terminal window, gone ahead and done SQLite3 and then, what was it? 1215 00:54:08,900 --> 00:54:11,510 Froshims.db and created the file. 1216 00:54:11,510 --> 00:54:13,760 For now, I'm going to actually just go ahead and touch 1217 00:54:13,760 --> 00:54:16,280 a file called that name, which is a simple command that 1218 00:54:16,280 --> 00:54:18,680 just literally creates it, but puts nothing in it, 1219 00:54:18,680 --> 00:54:20,109 just so I have an empty file. 1220 00:54:20,109 --> 00:54:21,900 And you'll see that it just popped up again 1221 00:54:21,900 --> 00:54:23,600 on the left-hand side, which is handy. 1222 00:54:23,600 --> 00:54:27,110 And now I can double-click this version of it, which has nothing in it yet. 1223 00:54:27,110 --> 00:54:28,460 And I'm back to phpLiteAdmin. 1224 00:54:28,460 --> 00:54:30,800 But notice no table in database. 1225 00:54:30,800 --> 00:54:33,230 So let's start to ask some of the harder questions 1226 00:54:33,230 --> 00:54:36,860 as to actually how to create data in a database. 1227 00:54:36,860 --> 00:54:40,490 So let me go ahead and go to this field here-- create 1228 00:54:40,490 --> 00:54:42,680 new table on database froshims. 1229 00:54:42,680 --> 00:54:44,570 I'll go ahead and call it registrants again. 1230 00:54:44,570 --> 00:54:46,640 But how many fields? 1231 00:54:46,640 --> 00:54:50,990 So let's go with ID, name, and dorm, what else? 1232 00:54:50,990 --> 00:54:53,080 Email I heard earlier. 1233 00:54:53,080 --> 00:54:54,512 Age I heard earlier. 1234 00:54:54,512 --> 00:54:55,970 AUDIENCE: Sports you want to be in. 1235 00:54:55,970 --> 00:54:56,245 DAVID J. MALAN: Sorry. 1236 00:54:56,245 --> 00:54:57,770 So sports you want-- excuse me-- 1237 00:54:57,770 --> 00:54:59,082 sports you want to be in. 1238 00:54:59,082 --> 00:55:01,040 OK. six fields, six fields, let's go with that. 1239 00:55:01,040 --> 00:55:01,910 So it's, actually, you know what? 1240 00:55:01,910 --> 00:55:02,451 Seven fields. 1241 00:55:02,451 --> 00:55:03,930 I want phone numbers too this time. 1242 00:55:03,930 --> 00:55:05,390 So let me go ahead and click Go. 1243 00:55:05,390 --> 00:55:08,737 And now you'll see just a GUI way of prompting you for all the answers 1244 00:55:08,737 --> 00:55:10,070 to the same questions as before. 1245 00:55:10,070 --> 00:55:10,940 And maybe this is clearer. 1246 00:55:10,940 --> 00:55:11,690 Maybe it's not. 1247 00:55:11,690 --> 00:55:14,270 But just no alternative to the memorizing 1248 00:55:14,270 --> 00:55:15,920 exactly what the commands need to be. 1249 00:55:15,920 --> 00:55:19,580 So top to bottom, here are all of the fields I need to decide on. 1250 00:55:19,580 --> 00:55:22,537 Ironically, the fields is just a synonym for columns. 1251 00:55:22,537 --> 00:55:24,620 And yet my columns are currently laid out in rows, 1252 00:55:24,620 --> 00:55:26,580 but that's just a UI issue. 1253 00:55:26,580 --> 00:55:28,130 So let me go ahead and decide. 1254 00:55:28,130 --> 00:55:32,360 By convention, my first field is almost always ID or probably should be. 1255 00:55:32,360 --> 00:55:35,290 And we'll see why this is powerful and just a little bit. 1256 00:55:35,290 --> 00:55:38,341 The data type for that, by convention, should be integer, 1257 00:55:38,341 --> 00:55:40,340 unless you've got a lot of data like a Facebook, 1258 00:55:40,340 --> 00:55:42,180 and then big int might make more sense. 1259 00:55:42,180 --> 00:55:43,970 But notice this Dropdown actually gives us 1260 00:55:43,970 --> 00:55:47,070 a nice menu of options just as before. 1261 00:55:47,070 --> 00:55:50,060 And you'll see in gray text, the category, 1262 00:55:50,060 --> 00:55:53,690 or in fancy terms, the affinity of these various types in SQLite 1263 00:55:53,690 --> 00:55:55,940 and in lower case black words here, you'll 1264 00:55:55,940 --> 00:56:00,165 see the actual data type supported by big popular databases like Postgres 1265 00:56:00,165 --> 00:56:02,290 that you might want to use for your final projects. 1266 00:56:02,290 --> 00:56:03,998 In fact, we're introducing these with eye 1267 00:56:03,998 --> 00:56:07,492 toward your using these for final projects in the cloud, not in CS50 IDE, 1268 00:56:07,492 --> 00:56:09,200 but actually getting your own domain name 1269 00:56:09,200 --> 00:56:11,939 and putting your website, if you do a web app, out there. 1270 00:56:11,939 --> 00:56:13,730 So here, we have all of my available types. 1271 00:56:13,730 --> 00:56:17,700 And under Integer, I'm going to go ahead and literally choose Integer for my ID. 1272 00:56:17,700 --> 00:56:18,200 All right. 1273 00:56:18,200 --> 00:56:19,707 Next, go ahead and Zoom Out. 1274 00:56:19,707 --> 00:56:21,540 Let me go ahead and choose a student's name. 1275 00:56:21,540 --> 00:56:24,755 Before did we do did we decide on char or varchar? 1276 00:56:24,755 --> 00:56:25,645 AUDIENCE: varchar. 1277 00:56:25,645 --> 00:56:26,840 DAVID J. MALAN: OK, varchar, and what size? 1278 00:56:26,840 --> 00:56:27,672 AUDIENCE: 255. 1279 00:56:27,672 --> 00:56:28,700 DAVID J. MALAN: 255. 1280 00:56:28,700 --> 00:56:31,230 So the user interface here just allows me to type it in. 1281 00:56:31,230 --> 00:56:34,220 So the syntax is a little different because it's a GUI, but 255. 1282 00:56:34,220 --> 00:56:35,290 But you know what? 1283 00:56:35,290 --> 00:56:37,100 The last field was dorm. 1284 00:56:37,100 --> 00:56:39,020 That too, I think we said varchar. 1285 00:56:39,020 --> 00:56:40,130 So let me choose that. 1286 00:56:40,130 --> 00:56:42,300 And 255, though, this one's a little more debatable. 1287 00:56:42,300 --> 00:56:45,260 I'm not sure what the right number is, so in the absence of clarity, 1288 00:56:45,260 --> 00:56:48,690 I'm just going to standardize on some same value without being too wasteful. 1289 00:56:48,690 --> 00:56:51,440 But notice there are a few questions here that we haven't come to. 1290 00:56:51,440 --> 00:56:54,074 But our perfect segue earlier hinted at this. 1291 00:56:54,074 --> 00:56:56,240 It turns out, you need to make a few other decisions 1292 00:56:56,240 --> 00:56:57,920 when designing a database. 1293 00:56:57,920 --> 00:57:01,070 If you know in advance that one of your fields 1294 00:57:01,070 --> 00:57:04,350 is the primary piece of data to uniquely identify users, 1295 00:57:04,350 --> 00:57:06,350 that's what's going to be called a primary key-- 1296 00:57:06,350 --> 00:57:11,300 the column or fields that, guaranteed, is going to identify users uniquely. 1297 00:57:11,300 --> 00:57:14,720 So if you've got two Brians, each of them is going to have its own ID. 1298 00:57:14,720 --> 00:57:18,860 That therefore, is your primary key, not the name field Brian. 1299 00:57:18,860 --> 00:57:22,130 So I'm going to tell the database, this is the primary key. 1300 00:57:22,130 --> 00:57:24,590 And it's going to help me keep track of that uniqueness. 1301 00:57:24,590 --> 00:57:26,750 Moreover, this is a fancy feature. 1302 00:57:26,750 --> 00:57:29,150 Before, I was manually and very arbitrarily 1303 00:57:29,150 --> 00:57:32,720 saying Brian will be number 1, I'll be number 2, Veronica will be number 3. 1304 00:57:32,720 --> 00:57:33,482 That's tedious. 1305 00:57:33,482 --> 00:57:35,940 Like, computer should be able to solve that problem for me. 1306 00:57:35,940 --> 00:57:38,210 I don't want to think about who is idea number what. 1307 00:57:38,210 --> 00:57:39,930 You can auto-increment the field. 1308 00:57:39,930 --> 00:57:43,130 So if I actually check this box, SQL for me 1309 00:57:43,130 --> 00:57:46,520 will just plus-plus, plus-plus the idea field every time 1310 00:57:46,520 --> 00:57:48,230 I insert a new name and a new dorm. 1311 00:57:48,230 --> 00:57:50,930 I don't have to even bother specifying an ID anymore. 1312 00:57:50,930 --> 00:57:53,180 Now there's another column here, not null, 1313 00:57:53,180 --> 00:57:55,790 where you can specify this column should never be null. 1314 00:57:55,790 --> 00:57:59,090 And this is important because, if you're building a website that has important 1315 00:57:59,090 --> 00:58:01,010 data that you must have from the users-- like, 1316 00:58:01,010 --> 00:58:04,970 your app won't work without the user's username or their password 1317 00:58:04,970 --> 00:58:06,180 or their email address-- 1318 00:58:06,180 --> 00:58:07,460 you can say not null. 1319 00:58:07,460 --> 00:58:10,520 And your database will ensure that you can't even 1320 00:58:10,520 --> 00:58:15,200 insert a row into this database unless you give it a value for that field. 1321 00:58:15,200 --> 00:58:17,140 So it helps you protect you against yourself. 1322 00:58:17,140 --> 00:58:18,890 Because you could certainly implement that 1323 00:58:18,890 --> 00:58:20,930 logically in Python or any language. 1324 00:58:20,930 --> 00:58:24,950 But the database is a final gauntlet as well. 1325 00:58:24,950 --> 00:58:27,230 A default value doesn't really make sense here. 1326 00:58:27,230 --> 00:58:30,170 But for certain types of fields, you can say, database, 1327 00:58:30,170 --> 00:58:33,702 insert the current time or the current date for me? 1328 00:58:33,702 --> 00:58:34,910 Now, why might you want that? 1329 00:58:34,910 --> 00:58:39,004 Why date and time by default? 1330 00:58:39,004 --> 00:58:40,045 Why might that be useful? 1331 00:58:40,045 --> 00:58:41,671 AUDIENCE: When the account was created? 1332 00:58:41,671 --> 00:58:44,253 DAVID J. MALAN: When the account was created, when they bought 1333 00:58:44,253 --> 00:58:46,720 a foo, when they shipped a bar-- any number of reasons. 1334 00:58:46,720 --> 00:58:48,550 You might just want to know, what is the time right now? 1335 00:58:48,550 --> 00:58:49,750 But you don't have to write code for that. 1336 00:58:49,750 --> 00:58:51,750 The database can answer those questions for you. 1337 00:58:51,750 --> 00:58:55,360 So just so much more functionality than we got, of course, with CSVs alone. 1338 00:58:55,360 --> 00:58:57,760 So name, should it be a primary key? 1339 00:58:57,760 --> 00:58:58,386 No. 1340 00:58:58,386 --> 00:59:00,010 Otherwise, we couldn't have two Brians. 1341 00:59:00,010 --> 00:59:02,620 And generally, your primary key will be one field, though, 1342 00:59:02,620 --> 00:59:05,329 theoretically you could make joint columns if you wanted. 1343 00:59:05,329 --> 00:59:06,870 But generally, it'll be a single one. 1344 00:59:06,870 --> 00:59:10,140 Should we auto-increment Brian-- so it's, like, well-- 1345 00:59:10,140 --> 00:59:14,030 Brian28, or Brian2, Brian3, and so forth? 1346 00:59:14,030 --> 00:59:14,530 No. 1347 00:59:14,530 --> 00:59:15,613 Doesn't really make sense. 1348 00:59:15,613 --> 00:59:16,570 Not null? 1349 00:59:16,570 --> 00:59:17,290 Probably. 1350 00:59:17,290 --> 00:59:19,894 I want all of the freshmen's names so that we 1351 00:59:19,894 --> 00:59:21,310 know who is signing up for sports. 1352 00:59:21,310 --> 00:59:22,200 And dorm? 1353 00:59:22,200 --> 00:59:22,960 Yeah, not null. 1354 00:59:22,960 --> 00:59:24,700 But oh, corner case-- 1355 00:59:24,700 --> 00:59:27,589 can anyone think of a corner case where dorm maybe 1356 00:59:27,589 --> 00:59:28,880 should kind of sort of be null? 1357 00:59:28,880 --> 00:59:29,592 AUDIENCE: They're off-campus. 1358 00:59:29,592 --> 00:59:31,330 DAVID J. MALAN: They commute, they're off-campus-- 1359 00:59:31,330 --> 00:59:32,440 it's not many students. 1360 00:59:32,440 --> 00:59:35,570 But if you have 1%, 5% of students living off-campus, 1361 00:59:35,570 --> 00:59:37,150 this is a design question now. 1362 00:59:37,150 --> 00:59:39,749 And all of us have probably visited some website 1363 00:59:39,749 --> 00:59:42,040 where you just can't fill out the form in the right way 1364 00:59:42,040 --> 00:59:44,707 because you don't fit their mold or their expectations. 1365 00:59:44,707 --> 00:59:46,790 And that's just because of a poor design decision. 1366 00:59:46,790 --> 00:59:49,077 So let's allow it to be null just in case. 1367 00:59:49,077 --> 00:59:49,660 Now what else? 1368 00:59:49,660 --> 00:59:52,600 We said a phone number was when I proposed. 1369 00:59:52,600 --> 00:59:55,760 Gosh, there's no phone number type. 1370 00:59:55,760 --> 00:59:58,800 So what do you want to go with? 1371 00:59:58,800 --> 01:00:00,122 AUDIENCE: [INAUDIBLE] 1372 01:00:00,122 --> 01:00:01,205 DAVID J. MALAN: I'm sorry? 1373 01:00:01,205 --> 01:00:02,871 AUDIENCE: [INAUDIBLE] varchar. 1374 01:00:02,871 --> 01:00:03,870 DAVID J. MALAN: Varchar. 1375 01:00:03,870 --> 01:00:04,369 OK. 1376 01:00:04,369 --> 01:00:08,340 So we can pick varchar and maybe use 10 or so for 10 digits in the US, 1377 01:00:08,340 --> 01:00:12,910 at least, though, maybe, like, 12 with the dashes, or 13 with the parentheses, 1378 01:00:12,910 --> 01:00:13,410 or-- 1379 01:00:13,410 --> 01:00:14,150 AUDIENCE: [INAUDIBLE] 1380 01:00:14,150 --> 01:00:14,650 DAVID J. MALAN: What's that? 1381 01:00:14,650 --> 01:00:15,906 AUDIENCE: [INAUDIBLE] 1382 01:00:15,906 --> 01:00:17,280 DAVID J. MALAN: I hear murmuring. 1383 01:00:17,280 --> 01:00:17,840 Sorry. 1384 01:00:17,840 --> 01:00:18,900 AUDIENCE: A small int. 1385 01:00:18,900 --> 01:00:20,880 DAVID J. MALAN: A small int and just treated 1386 01:00:20,880 --> 01:00:22,770 as a number, maybe that could work too. 1387 01:00:22,770 --> 01:00:24,000 Another alternative? 1388 01:00:24,000 --> 01:00:24,660 Anything else? 1389 01:00:24,660 --> 01:00:27,750 AUDIENCE: [INAUDIBLE] precision. 1390 01:00:27,750 --> 01:00:29,910 DAVID J. MALAN: Precision-- so a specific number. 1391 01:00:29,910 --> 01:00:34,015 AUDIENCE: [INAUDIBLE] make sure you have the correct number. 1392 01:00:34,015 --> 01:00:34,890 DAVID J. MALAN: Good. 1393 01:00:34,890 --> 01:00:35,970 So not a bad instinct. 1394 01:00:35,970 --> 01:00:39,314 But it turns out with the numeric data type, where you specify scale 1395 01:00:39,314 --> 01:00:41,730 as it's called-- the total number of digits in precision-- 1396 01:00:41,730 --> 01:00:44,820 that's generally meant for floating point values-- so 1397 01:00:44,820 --> 01:00:46,440 real numbers with decimal points. 1398 01:00:46,440 --> 01:00:49,650 Because even though you're specifying a max limit, 1399 01:00:49,650 --> 01:00:51,470 you don't require that many digits. 1400 01:00:51,470 --> 01:00:52,201 It's just a max. 1401 01:00:52,201 --> 01:00:52,700 Yeah? 1402 01:00:52,700 --> 01:00:56,510 AUDIENCE: I would think that giving it-- 1403 01:00:56,510 --> 01:00:59,960 doing some sort of chars would be a bad design because then you 1404 01:00:59,960 --> 01:01:03,250 can enter nonnumerical values. 1405 01:01:03,250 --> 01:01:04,490 DAVID J. MALAN: Yeah. 1406 01:01:04,490 --> 01:01:07,280 We can kind of find fault, I think, both of these ideas, though both of them 1407 01:01:07,280 --> 01:01:07,904 are reasonable. 1408 01:01:07,904 --> 01:01:10,220 If you allow for char or varchar, I could 1409 01:01:10,220 --> 01:01:13,639 type in like foo or bar or baz and not a number. 1410 01:01:13,639 --> 01:01:14,930 So maybe we should go with int. 1411 01:01:14,930 --> 01:01:16,970 But can someone think of a counter-example to why 1412 01:01:16,970 --> 01:01:17,990 you shouldn't use integer? 1413 01:01:17,990 --> 01:01:20,281 AUDIENCE: It's got a variable length to it, doesn't it? 1414 01:01:20,281 --> 01:01:21,927 So you'd get whatever number-- 1415 01:01:21,927 --> 01:01:23,260 DAVID J. MALAN: Variable length. 1416 01:01:23,260 --> 01:01:25,989 But if we actually do the math, maybe 65, 1417 01:01:25,989 --> 01:01:27,530 maybe-- there are enough bytes there. 1418 01:01:27,530 --> 01:01:30,405 We could use a big int, and that gives us a really long phone number. 1419 01:01:30,405 --> 01:01:32,130 So there's probably a reasonable max. 1420 01:01:32,130 --> 01:01:32,630 Yeah? 1421 01:01:32,630 --> 01:01:38,034 AUDIENCE: Let's say [INAUDIBLE] in between [INAUDIBLE] integer. 1422 01:01:38,034 --> 01:01:39,700 DAVID J. MALAN: Hyphens and parentheses. 1423 01:01:39,700 --> 01:01:41,658 We have to decide, do we want to support those? 1424 01:01:41,658 --> 01:01:45,290 And honestly, all of us are perhaps a little US-centric right now, most of us 1425 01:01:45,290 --> 01:01:45,790 here. 1426 01:01:45,790 --> 01:01:49,150 But when you make a local call in some zones, like, you type 0 first. 1427 01:01:49,150 --> 01:01:51,080 And some human might type 0 as their number. 1428 01:01:51,080 --> 01:01:53,410 But what's going to happen if you type 0 into an integer field? 1429 01:01:53,410 --> 01:01:54,640 AUDIENCE: You're going to ignore it. 1430 01:01:54,640 --> 01:01:56,120 DAVID J. MALAN: You're going to ignore it. 1431 01:01:56,120 --> 01:01:57,370 So now there's that corner case. 1432 01:01:57,370 --> 01:01:57,940 So dammit. 1433 01:01:57,940 --> 01:02:00,314 Like, there's no way to solve this problem it would seem. 1434 01:02:00,314 --> 01:02:01,630 So what's best? 1435 01:02:01,630 --> 01:02:04,150 We have to make a compromise and just accept 1436 01:02:04,150 --> 01:02:06,220 that we have to solve this with code. 1437 01:02:06,220 --> 01:02:07,810 AUDIENCE: Varchar. 1438 01:02:07,810 --> 01:02:08,810 DAVID J. MALAN: Varchar? 1439 01:02:08,810 --> 01:02:09,309 All right. 1440 01:02:09,309 --> 01:02:10,740 So varchars or char. 1441 01:02:10,740 --> 01:02:11,240 All right. 1442 01:02:11,240 --> 01:02:13,220 So maybe let's simplify the problem. 1443 01:02:13,220 --> 01:02:16,190 No one from outside the US can take freshman intramural sports. 1444 01:02:16,190 --> 01:02:17,330 That simplifies the world. 1445 01:02:17,330 --> 01:02:20,952 Because if we only support US phone numbers, now we can say 10 digits. 1446 01:02:20,952 --> 01:02:22,160 And if we say, you know what? 1447 01:02:22,160 --> 01:02:24,243 I don't care about the hyphens or the parentheses. 1448 01:02:24,243 --> 01:02:27,830 I can use code, JavaScript or Python, to throw away the syntax. 1449 01:02:27,830 --> 01:02:29,810 And I can just store 10 digits. 1450 01:02:29,810 --> 01:02:32,060 Maybe char 10 is sufficient-- 1451 01:02:32,060 --> 01:02:35,680 3 for the area code, then the rest of the number, the all seven digits 1452 01:02:35,680 --> 01:02:36,180 thereof. 1453 01:02:36,180 --> 01:02:37,805 But you could find fault with this too. 1454 01:02:37,805 --> 01:02:41,510 And we're really alienating that international population on campus. 1455 01:02:41,510 --> 01:02:43,812 But again, these are just non-obvious design decisions. 1456 01:02:43,812 --> 01:02:45,770 And so here we are, at the end of the semester. 1457 01:02:45,770 --> 01:02:47,060 We don't always have good answers. 1458 01:02:47,060 --> 01:02:48,680 And reasonable people will disagree. 1459 01:02:48,680 --> 01:02:52,490 But let me simplify our assumptions and just do US numbers, 10 digits, 1460 01:02:52,490 --> 01:02:56,270 and trust that I will use code in Python or some other language 1461 01:02:56,270 --> 01:02:59,400 to throw away the syntax, the punctuation and whatnot 1462 01:02:59,400 --> 01:03:01,190 of parentheses and hyphens. 1463 01:03:01,190 --> 01:03:04,234 And I'll make sure the human hasn't typed in any letters of the alphabet. 1464 01:03:04,234 --> 01:03:05,150 I can do that in code. 1465 01:03:05,150 --> 01:03:10,410 And you know we can do that even in JavaScript when a human submits a form. 1466 01:03:10,410 --> 01:03:12,860 We'll leave for the end, sports, what was it sports-- 1467 01:03:12,860 --> 01:03:15,485 AUDIENCE: Sports they want to do. 1468 01:03:15,485 --> 01:03:17,360 DAVID J. MALAN: Sports they might want to do. 1469 01:03:17,360 --> 01:03:20,240 So this is a good example of, you shouldn't really have spaces 1470 01:03:20,240 --> 01:03:25,010 in your field name, so the convention would be sports_they_ really or might 1471 01:03:25,010 --> 01:03:28,301 or let's just call it sports in this case. 1472 01:03:28,301 --> 01:03:29,300 We'll come back to that. 1473 01:03:29,300 --> 01:03:31,660 I think there were two other ideas we had. 1474 01:03:31,660 --> 01:03:32,620 Phone number. 1475 01:03:32,620 --> 01:03:33,460 AUDIENCE: Email. 1476 01:03:33,460 --> 01:03:33,980 DAVID J. MALAN: Email. 1477 01:03:33,980 --> 01:03:34,330 OK. 1478 01:03:34,330 --> 01:03:35,090 Email is a good one. 1479 01:03:35,090 --> 01:03:36,089 What should that one be? 1480 01:03:36,089 --> 01:03:38,480 There's no email type, unfortunately, even though there 1481 01:03:38,480 --> 01:03:41,490 is an HTML, an input type for email. 1482 01:03:41,490 --> 01:03:42,240 AUDIENCE: Varchar. 1483 01:03:42,240 --> 01:03:42,710 DAVID J. MALAN: What's that? 1484 01:03:42,710 --> 01:03:43,790 AUDIENCE: Varchar. 1485 01:03:43,790 --> 01:03:46,540 DAVID J. MALAN: Yeah, I feel like we probably need a varchar here. 1486 01:03:46,540 --> 01:03:50,480 But here, a little non-obvious, what is the longest email address in the world? 1487 01:03:50,480 --> 01:03:52,550 Maybe it's Nick's? 1488 01:03:52,550 --> 01:03:53,835 So I don't know. 1489 01:03:53,835 --> 01:03:56,210 But let's pick a reasonable upper bound that we can maybe 1490 01:03:56,210 --> 01:03:58,230 be comfortable with as a group. 1491 01:03:58,230 --> 01:04:00,380 And let's see-- anything else here? 1492 01:04:00,380 --> 01:04:00,880 No? 1493 01:04:00,880 --> 01:04:01,220 OK. 1494 01:04:01,220 --> 01:04:02,428 And was there one more field? 1495 01:04:02,428 --> 01:04:03,740 AUDIENCE: Graduation year. 1496 01:04:03,740 --> 01:04:04,650 DAVID J. MALAN: Oh, grad-- oh, age. 1497 01:04:04,650 --> 01:04:05,400 Let's go with age. 1498 01:04:05,400 --> 01:04:08,220 So age-- finally, something simple. 1499 01:04:08,220 --> 01:04:09,627 What you want this to be? 1500 01:04:09,627 --> 01:04:10,460 AUDIENCE: Small int. 1501 01:04:10,460 --> 01:04:13,070 DAVID J. MALAN: Small int, right. 1502 01:04:13,070 --> 01:04:18,584 We will not support people older than 65,535 years old. 1503 01:04:18,584 --> 01:04:20,625 Someone want to find fault with this idea though? 1504 01:04:20,625 --> 01:04:25,750 1505 01:04:25,750 --> 01:04:29,050 I would argue, there's no one right answer to any of these. 1506 01:04:29,050 --> 01:04:30,850 AUDIENCE: You need month. 1507 01:04:30,850 --> 01:04:32,230 DAVID J. MALAN: Month-- 1508 01:04:32,230 --> 01:04:33,680 oh, it depends. 1509 01:04:33,680 --> 01:04:34,397 Do we want month? 1510 01:04:34,397 --> 01:04:35,022 AUDIENCE: Yeah. 1511 01:04:35,022 --> 01:04:36,950 We might need [INAUDIBLE]. 1512 01:04:36,950 --> 01:04:41,780 AUDIENCE: --then you need the date, the year. 1513 01:04:41,780 --> 01:04:44,080 DAVID J. MALAN: So you're assuming we want birth date. 1514 01:04:44,080 --> 01:04:45,520 I think I've called it age. 1515 01:04:45,520 --> 01:04:46,920 So maybe that's the problem. 1516 01:04:46,920 --> 01:04:49,090 Like, if it's age, small int's fine. 1517 01:04:49,090 --> 01:04:51,146 Like, you can be 0 years old or 65,000 years old. 1518 01:04:51,146 --> 01:04:52,020 We have a good range. 1519 01:04:52,020 --> 01:04:54,284 AUDIENCE: All about the age, date of birth. 1520 01:04:54,284 --> 01:04:56,200 DAVID J. MALAN: If it's age, I think we're OK. 1521 01:04:56,200 --> 01:04:59,890 But I think you allude to a good point, which is, why would we maybe want 1522 01:04:59,890 --> 01:05:02,516 to store birth date and not age? 1523 01:05:02,516 --> 01:05:03,391 AUDIENCE: [INAUDIBLE] 1524 01:05:03,391 --> 01:05:04,265 DAVID J. MALAN: Yeah. 1525 01:05:04,265 --> 01:05:06,460 The damn thing's always changing, otherwise, right? 1526 01:05:06,460 --> 01:05:09,190 Like, I'm going to have to update my database tomorrow and then 1527 01:05:09,190 --> 01:05:12,250 the next day, let alone every hour or every minute just 1528 01:05:12,250 --> 01:05:14,200 because my users' ages are changing. 1529 01:05:14,200 --> 01:05:15,580 Like, that seems silly. 1530 01:05:15,580 --> 01:05:17,530 Let me, instead, fix a value-- 1531 01:05:17,530 --> 01:05:20,470 so do something like birth date. 1532 01:05:20,470 --> 01:05:23,020 Birth date-- maybe specify not an int. 1533 01:05:23,020 --> 01:05:25,780 But let's actually use the date field. 1534 01:05:25,780 --> 01:05:28,210 We could store time if we really care what time they 1535 01:05:28,210 --> 01:05:29,910 were born on a certain day. 1536 01:05:29,910 --> 01:05:34,300 But here, I can say date is going to be a little better because, now I 1537 01:05:34,300 --> 01:05:37,480 know in Python, JavaScript, even C, I can do a little bit of math. 1538 01:05:37,480 --> 01:05:40,520 And if I know they were born on such and such a day and month and year, 1539 01:05:40,520 --> 01:05:43,810 well, I'll just subtract that from the current day, month, and year 1540 01:05:43,810 --> 01:05:47,020 and figure out how many days or years old they are-- so a better design 1541 01:05:47,020 --> 01:05:48,620 decision there perhaps. 1542 01:05:48,620 --> 01:05:50,590 But we do have to standardize the format. 1543 01:05:50,590 --> 01:05:53,080 We can't just allow people from the US and Europe and Asia 1544 01:05:53,080 --> 01:05:54,780 to all kind of choose their own formats. 1545 01:05:54,780 --> 01:05:57,880 SQL standardizes this-- year, year, year, year dash 1546 01:05:57,880 --> 01:05:59,920 month, month dash day, day. 1547 01:05:59,920 --> 01:06:02,480 And that's the value of having these data types again. 1548 01:06:02,480 --> 01:06:02,980 All right. 1549 01:06:02,980 --> 01:06:05,277 So how about sports-- the last one? 1550 01:06:05,277 --> 01:06:06,770 AUDIENCE: Varchar. 1551 01:06:06,770 --> 01:06:08,530 DAVID J. MALAN: Varchar, all right. 1552 01:06:08,530 --> 01:06:12,242 What's the longest number of words in the sport? 1553 01:06:12,242 --> 01:06:14,950 AUDIENCE: I didn't say might want to do, so I could get up there, 1554 01:06:14,950 --> 01:06:17,780 but 255 probably makes the most sense. 1555 01:06:17,780 --> 01:06:20,610 DAVID J. MALAN: Maybe unless they're very athletic. 1556 01:06:20,610 --> 01:06:21,370 Yeah. 1557 01:06:21,370 --> 01:06:22,510 Here too I don't know. 1558 01:06:22,510 --> 01:06:24,910 But just for the sake of opening up possibilities, 1559 01:06:24,910 --> 01:06:27,797 when you think the human might be a little expository 1560 01:06:27,797 --> 01:06:30,130 and actually write a paragraph of all the sports they're 1561 01:06:30,130 --> 01:06:33,815 involved in or whatnot or even bigger than a paragraph, text is even bigger. 1562 01:06:33,815 --> 01:06:35,440 Sports, this probably isn't compelling. 1563 01:06:35,440 --> 01:06:36,790 And I'll change it back to varchar. 1564 01:06:36,790 --> 01:06:38,830 But if someone is typing in their college essay 1565 01:06:38,830 --> 01:06:41,640 into the Common Application, or if you're 1566 01:06:41,640 --> 01:06:44,890 asking people to paste their resumes, or the like, you might want to use text. 1567 01:06:44,890 --> 01:06:47,050 Because I have no idea how many words someone's going to have. 1568 01:06:47,050 --> 01:06:49,030 Text allows you to store even more data. 1569 01:06:49,030 --> 01:06:50,710 But it stores it a little differently. 1570 01:06:50,710 --> 01:06:53,770 It tends to store it not in the column, but using pointers. 1571 01:06:53,770 --> 01:06:56,380 If you recall from a few weeks back, it uses the equivalent 1572 01:06:56,380 --> 01:06:59,710 of that to store it over there, which takes a little more time to get to. 1573 01:06:59,710 --> 01:07:01,420 So again, there's just a trade-off here. 1574 01:07:01,420 --> 01:07:04,357 So we could do varchar 255-- 1575 01:07:04,357 --> 01:07:05,440 makes me a little nervous. 1576 01:07:05,440 --> 01:07:08,130 So I'm going to go with another common value. 1577 01:07:08,130 --> 01:07:11,455 1024, it's a power of 2, but there's no one right answer here. 1578 01:07:11,455 --> 01:07:14,080 But these are the non-obvious design decisions we have to make. 1579 01:07:14,080 --> 01:07:16,650 If I didn't make any mistakes here, I'm going 1580 01:07:16,650 --> 01:07:18,820 to go ahead and click Create, and whoo-- 1581 01:07:18,820 --> 01:07:20,360 table has been created. 1582 01:07:20,360 --> 01:07:23,660 You can even see now all of the data that's been created there. 1583 01:07:23,660 --> 01:07:28,787 And voila, if I go back to froshims, and I go back to the registrants table, 1584 01:07:28,787 --> 01:07:31,120 not only can I browse it-- there's nothing in there yet. 1585 01:07:31,120 --> 01:07:34,450 I can look at the structure and actually see all of those same values 1586 01:07:34,450 --> 01:07:36,070 and edit some of them like renaming. 1587 01:07:36,070 --> 01:07:37,900 But you can't completely mutilate it. 1588 01:07:37,900 --> 01:07:40,750 You might have to start over if you make too many changes. 1589 01:07:40,750 --> 01:07:41,500 All right. 1590 01:07:41,500 --> 01:07:42,790 Any questions then about this? 1591 01:07:42,790 --> 01:07:43,290 Yeah? 1592 01:07:43,290 --> 01:07:46,706 AUDIENCE: Why do you use 1024 instead of 1023? 1593 01:07:46,706 --> 01:07:49,060 DAVID J. MALAN: Oh, why do I use 1024 instead of 1023. 1594 01:07:49,060 --> 01:07:52,990 Typically, when you choose a ma-- oh, convention. 1595 01:07:52,990 --> 01:07:54,330 I can't justify this. 1596 01:07:54,330 --> 01:07:58,060 255 was the max because I think one of the bites was reserved for some value, 1597 01:07:58,060 --> 01:07:59,310 historically. 1598 01:07:59,310 --> 01:08:00,700 But that's not really the case. 1599 01:08:00,700 --> 01:08:03,580 That bound has been lifted, and no good reason. 1600 01:08:03,580 --> 01:08:07,130 AUDIENCE: You could have used 255. 1601 01:08:07,130 --> 01:08:08,420 DAVID J. MALAN: Nowadays, yes. 1602 01:08:08,420 --> 01:08:11,134 Years ago, 255 was an actual limit. 1603 01:08:11,134 --> 01:08:12,050 And so it got adopted. 1604 01:08:12,050 --> 01:08:12,800 Now, I don't know. 1605 01:08:12,800 --> 01:08:14,300 I just pick powers of 2 often. 1606 01:08:14,300 --> 01:08:18,770 And then at my next go-to would be 2048, 4096, and so forth, just because. 1607 01:08:18,770 --> 01:08:19,660 Yeah? 1608 01:08:19,660 --> 01:08:22,385 AUDIENCE: What's the difference between varchar and char? 1609 01:08:22,385 --> 01:08:22,830 DAVID J. MALAN: Varchar-- 1610 01:08:22,830 --> 01:08:24,856 What's the difference between varchar and char? 1611 01:08:24,856 --> 01:08:30,319 Char uses a fixed number of bytes no matter how many of them you are using, 1612 01:08:30,319 --> 01:08:33,020 the advantage of which is your columns, conceptually, 1613 01:08:33,020 --> 01:08:36,529 are perfectly straight on both the left edge and the right edge, which 1614 01:08:36,529 --> 01:08:39,890 means you have random access because every cell is some fixed 1615 01:08:39,890 --> 01:08:41,689 number of bytes from the rest. 1616 01:08:41,689 --> 01:08:44,729 Varchar user a ragged array, as it's called, 1617 01:08:44,729 --> 01:08:48,540 where one side, the right-hand side is shorter or longer in different cells. 1618 01:08:48,540 --> 01:08:51,050 So there's only a maximum length on each of those cells. 1619 01:08:51,050 --> 01:08:53,029 But searching it can be slower as a result 1620 01:08:53,029 --> 01:08:55,100 because you can't just jump to cell to cell. 1621 01:08:55,100 --> 01:08:57,765 You have to follow the lengths of those things. 1622 01:08:57,765 --> 01:08:58,640 That's the trade-off. 1623 01:08:58,640 --> 01:08:59,242 Yeah? 1624 01:08:59,242 --> 01:09:02,658 AUDIENCE: If you [INAUDIBLE] education, how do you add it? 1625 01:09:02,658 --> 01:09:05,930 DAVID J. MALAN: Oh, if I wanted to add an education field now and modify 1626 01:09:05,930 --> 01:09:06,740 the table-- 1627 01:09:06,740 --> 01:09:09,979 if you realize too late, oh, darn, like, I need to actually add something 1628 01:09:09,979 --> 01:09:13,710 to this, in the GUI tool, we can add 1 fields to the end of the table, 1629 01:09:13,710 --> 01:09:14,210 literally. 1630 01:09:14,210 --> 01:09:15,210 So let's do that. 1631 01:09:15,210 --> 01:09:16,460 Let me go ahead and click Go. 1632 01:09:16,460 --> 01:09:19,069 I'll be prompted with a similar form, but smaller. 1633 01:09:19,069 --> 01:09:21,689 I can go ahead and type in something like education. 1634 01:09:21,689 --> 01:09:24,680 Let me propose this is varchar, maybe 255, 1635 01:09:24,680 --> 01:09:26,300 though we could have that debate too. 1636 01:09:26,300 --> 01:09:29,180 I'm going to go ahead and say add fields. 1637 01:09:29,180 --> 01:09:32,162 And now notice that the table has been altered successfully. 1638 01:09:32,162 --> 01:09:34,370 It actually, for whatever reason, it's not showing me 1639 01:09:34,370 --> 01:09:35,911 the code for that particular command. 1640 01:09:35,911 --> 01:09:39,104 But there is literally an ALTER command in SQL 1641 01:09:39,104 --> 01:09:40,520 that would allow you to change it. 1642 01:09:40,520 --> 01:09:42,740 And if I go back to the structure now, you'll 1643 01:09:42,740 --> 01:09:46,590 see that I have another column called education shown ironically 1644 01:09:46,590 --> 01:09:47,330 here as a row. 1645 01:09:47,330 --> 01:09:47,913 Yeah? 1646 01:09:47,913 --> 01:09:50,520 AUDIENCE: Does the order of the columns matter? 1647 01:09:50,520 --> 01:09:51,020 DAVID J. MALAN: Good question. 1648 01:09:51,020 --> 01:09:52,609 Does the order of the columns matter? 1649 01:09:52,609 --> 01:09:54,200 Fundamentally, no. 1650 01:09:54,200 --> 01:09:56,780 By convention, you would typically put the ID first. 1651 01:09:56,780 --> 01:10:00,110 And then I, personally, by design, put the most important fields 1652 01:10:00,110 --> 01:10:02,000 next like name feels like the right choice, 1653 01:10:02,000 --> 01:10:04,250 maybe email feels like the right choice, though, I clearly thought 1654 01:10:04,250 --> 01:10:05,870 of it a little too late this time. 1655 01:10:05,870 --> 01:10:09,890 It's not easy in SQLite to reorder things, but in other databases you can. 1656 01:10:09,890 --> 01:10:12,200 So there, it's more of a human convention. 1657 01:10:12,200 --> 01:10:12,700 Yeah? 1658 01:10:12,700 --> 01:10:16,082 AUDIENCE: Is it convention to have one single primary key, 1659 01:10:16,082 --> 01:10:17,210 or can you have multiple? 1660 01:10:17,210 --> 01:10:17,720 DAVID J. MALAN: Good question. 1661 01:10:17,720 --> 01:10:20,886 Is it a convention to have one single primary key, or can you have multiple? 1662 01:10:20,886 --> 01:10:22,910 By definition, you can only have one. 1663 01:10:22,910 --> 01:10:26,670 But that primary key can span multiple columns. 1664 01:10:26,670 --> 01:10:29,480 So we haven't seen a use case for this yet. 1665 01:10:29,480 --> 01:10:31,940 But there are scenarios in which you would want to say, 1666 01:10:31,940 --> 01:10:35,580 I want to guarantee that these two columns together are unique, 1667 01:10:35,580 --> 01:10:37,520 but not each individual one unique. 1668 01:10:37,520 --> 01:10:39,830 But we won't encounter that just yet. 1669 01:10:39,830 --> 01:10:41,330 Other questions, yeah? 1670 01:10:41,330 --> 01:10:43,250 AUDIENCE: What if you had added education 1671 01:10:43,250 --> 01:10:46,459 after you had started the database, and that was a not null field? 1672 01:10:46,459 --> 01:10:48,000 DAVID J. MALAN: Really good question. 1673 01:10:48,000 --> 01:10:52,880 What if you had added education after you already had real data in there, 1674 01:10:52,880 --> 01:10:55,040 but you specified not null, which is problematic. 1675 01:10:55,040 --> 01:10:57,860 Because what is the educational backgrounds of the previous people? 1676 01:10:57,860 --> 01:11:00,276 Typically, what the database would do is either reject it, 1677 01:11:00,276 --> 01:11:02,660 or it would just put the "empty string," quote unquote. 1678 01:11:02,660 --> 01:11:05,368 So it's not technically null, but there's nothing actually there. 1679 01:11:05,368 --> 01:11:07,250 It's just a string of length 0. 1680 01:11:07,250 --> 01:11:08,610 Really good question. 1681 01:11:08,610 --> 01:11:09,110 All right. 1682 01:11:09,110 --> 01:11:12,080 So what can we now do that's a little more powerful about this? 1683 01:11:12,080 --> 01:11:15,032 Well, let me go ahead and quickly insert some data here. 1684 01:11:15,032 --> 01:11:16,490 I'm going to keep most of it blank. 1685 01:11:16,490 --> 01:11:16,970 But you know what? 1686 01:11:16,970 --> 01:11:18,511 I'm not even going to bother with ID. 1687 01:11:18,511 --> 01:11:20,520 Brian you're back on the team. 1688 01:11:20,520 --> 01:11:22,820 Let's go ahead and insert Brian. 1689 01:11:22,820 --> 01:11:25,040 Let's go ahead now into registrants again. 1690 01:11:25,040 --> 01:11:28,250 Let's go ahead and add Veronica again. 1691 01:11:28,250 --> 01:11:30,380 So I'm just inserting a few rows manually. 1692 01:11:30,380 --> 01:11:33,250 And again, notice it's executing all of this for me 1693 01:11:33,250 --> 01:11:35,000 without me having to bother typing it out. 1694 01:11:35,000 --> 01:11:36,200 But I absolutely could. 1695 01:11:36,200 --> 01:11:38,450 In fact, just for good measure, let's do one manually. 1696 01:11:38,450 --> 01:11:41,780 If I click the SQL tab, notice that I get a default suggestion here. 1697 01:11:41,780 --> 01:11:44,210 That is the syntax with which you can select everything. 1698 01:11:44,210 --> 01:11:47,810 Or I can just type INSERT into registrants. 1699 01:11:47,810 --> 01:11:49,680 But now, if I only want to insert a name, 1700 01:11:49,680 --> 01:11:51,440 I don't have to do all of the columns. 1701 01:11:51,440 --> 01:11:54,650 I can just say go ahead and insert here, Erin, for instance, 1702 01:11:54,650 --> 01:11:57,650 semicolon, zoom out and click Go. 1703 01:11:57,650 --> 01:11:58,970 That seemed to work. 1704 01:11:58,970 --> 01:12:01,980 If I go back to browse now, Erin is in there as well. 1705 01:12:01,980 --> 01:12:03,230 But you'll see the difference. 1706 01:12:03,230 --> 01:12:06,230 The query that's being generated automatically by the GUI 1707 01:12:06,230 --> 01:12:11,070 was lazily just inserting quote unquote, the so-called empty string of length 0. 1708 01:12:11,070 --> 01:12:13,880 I, by omitting even mention of those columns, 1709 01:12:13,880 --> 01:12:15,470 was deliberately inserting null. 1710 01:12:15,470 --> 01:12:17,761 So frankly, my database is getting a little messy here. 1711 01:12:17,761 --> 01:12:21,385 So you generally don't use phpLiteAdmin or a GUI to insert data. 1712 01:12:21,385 --> 01:12:23,510 You might use it to conveniently create your tables 1713 01:12:23,510 --> 01:12:25,252 and get your application ready. 1714 01:12:25,252 --> 01:12:27,210 But then you're going to write code ultimately. 1715 01:12:27,210 --> 01:12:28,490 And that's the direction we're going. 1716 01:12:28,490 --> 01:12:30,200 And I'm going to go ahead and insert one more person. 1717 01:12:30,200 --> 01:12:32,900 Oh, I forgot I'm not on the team at all because we started over. 1718 01:12:32,900 --> 01:12:36,020 So let me put myself back on the team, David. 1719 01:12:36,020 --> 01:12:38,780 And let me go ahead and click INSERT, go back to registrants, 1720 01:12:38,780 --> 01:12:40,070 and now you'll see there are four of us. 1721 01:12:40,070 --> 01:12:42,770 My ID changed because they've been inserted in different orders. 1722 01:12:42,770 --> 01:12:46,727 But notice all of the auto incrementing has been happening magically for me. 1723 01:12:46,727 --> 01:12:49,310 And that's useful because I don't have to even think about it. 1724 01:12:49,310 --> 01:12:51,360 And who cares what my ID is. 1725 01:12:51,360 --> 01:12:55,470 I just need to have, in many cases, in a database, a unique ID. 1726 01:12:55,470 --> 01:12:58,850 So now let's actually write a little bit of code, right? 1727 01:12:58,850 --> 01:13:00,807 Thus far we haven't done anything useful. 1728 01:13:00,807 --> 01:13:03,140 We've shown you this black and white window in which you 1729 01:13:03,140 --> 01:13:05,330 can select in certain update data. 1730 01:13:05,330 --> 01:13:08,240 But that doesn't really solve any problems we know about yet. 1731 01:13:08,240 --> 01:13:10,000 We have this graphical web-based interface 1732 01:13:10,000 --> 01:13:13,217 via which you can create tables and add data, but who cares? 1733 01:13:13,217 --> 01:13:14,800 We're trying to solve actual problems. 1734 01:13:14,800 --> 01:13:16,591 And the problems of late have been to build 1735 01:13:16,591 --> 01:13:19,450 software that would solve any number of human problems 1736 01:13:19,450 --> 01:13:21,580 like serving users and showing the results 1737 01:13:21,580 --> 01:13:23,170 or finding similarities in documents. 1738 01:13:23,170 --> 01:13:26,230 So suppose the problem at hand now is to actually build 1739 01:13:26,230 --> 01:13:30,220 something like the froshims website and let students register and then see 1740 01:13:30,220 --> 01:13:31,190 who is registered. 1741 01:13:31,190 --> 01:13:34,930 Well, back in my day, I fairly lazily, for lack of technical know-how, 1742 01:13:34,930 --> 01:13:38,830 just emailed the registrations to the proctor or the RA who was 1743 01:13:38,830 --> 01:13:40,840 managing the intramural sports program. 1744 01:13:40,840 --> 01:13:43,750 But I-- they later started putting it in CSV files. 1745 01:13:43,750 --> 01:13:48,460 Suppose now, version 3, 20 years later, I want to store in an actual database. 1746 01:13:48,460 --> 01:13:51,310 How can I actually do that and then see the results? 1747 01:13:51,310 --> 01:13:54,820 Well, let me go into the IDE again and open up, 1748 01:13:54,820 --> 01:14:00,640 for instance, a New File that I'll go ahead and call lecture.py. 1749 01:14:00,640 --> 01:14:04,240 And suppose I just want to write a simple Python program via which 1750 01:14:04,240 --> 01:14:06,340 to select data from a database. 1751 01:14:06,340 --> 01:14:08,240 So it turns out I can do a few things here. 1752 01:14:08,240 --> 01:14:14,110 First of all, let me go ahead and do, let's say, from CS50, 1753 01:14:14,110 --> 01:14:17,470 I previously have done things like import get_string and get_int 1754 01:14:17,470 --> 01:14:18,370 and so forth. 1755 01:14:18,370 --> 01:14:22,390 It turns out that the CS50 library for Python also supports SQL. 1756 01:14:22,390 --> 01:14:24,430 And it's going to give us a function called 1757 01:14:24,430 --> 01:14:29,960 EXECUTE that will let you execute any SQL command, but in Python code. 1758 01:14:29,960 --> 01:14:33,370 So instead of pulling up SQLite3 via my own hands 1759 01:14:33,370 --> 01:14:36,320 or going to a graphical user interface phpLiteAdmin, 1760 01:14:36,320 --> 01:14:40,300 I can write code that talks directly to froshims.db, 1761 01:14:40,300 --> 01:14:44,540 and eliminate all of those tools altogether, and just now write code. 1762 01:14:44,540 --> 01:14:45,400 So how do I do this? 1763 01:14:45,400 --> 01:14:47,691 I'm going to declare a variable called db for Database. 1764 01:14:47,691 --> 01:14:49,430 So I could call it anything I want. 1765 01:14:49,430 --> 01:14:51,670 And I'm going to go ahead and call this SQL function. 1766 01:14:51,670 --> 01:14:55,000 And I'm going to pass in a somewhat funky looking string as an argument, 1767 01:14:55,000 --> 01:14:57,100 but it's a standard convention, to say what 1768 01:14:57,100 --> 01:14:59,120 database technology do you want to use? 1769 01:14:59,120 --> 01:15:02,620 Then you do colon, slash, slash, slash-- so it's three slashes, not 1770 01:15:02,620 --> 01:15:04,330 the usual two in a URL. 1771 01:15:04,330 --> 01:15:09,940 And I'm going to specify froshims.db. 1772 01:15:09,940 --> 01:15:12,250 This now will give me a Python variable called 1773 01:15:12,250 --> 01:15:16,570 database that is kind of like a portal, if you will, into that database 1774 01:15:16,570 --> 01:15:20,650 file that I can send SELECTs, and INSERTs, and DELETEs, and UPDATEs to. 1775 01:15:20,650 --> 01:15:21,670 How do I do this? 1776 01:15:21,670 --> 01:15:23,980 Well, if at the end of the day, I want to execute 1777 01:15:23,980 --> 01:15:30,050 the equivalent of SELECT star from registrants, how do I do that? 1778 01:15:30,050 --> 01:15:32,020 Well, I'm just in a text editor, right? 1779 01:15:32,020 --> 01:15:33,190 This is CS50 IDE. 1780 01:15:33,190 --> 01:15:34,390 I'm just typing text. 1781 01:15:34,390 --> 01:15:38,200 Moreover, I'm typing text in a Python file, and this is not Python. 1782 01:15:38,200 --> 01:15:42,280 And indeed, the IDE has this little red x saying, mm-mm, can't do this. 1783 01:15:42,280 --> 01:15:47,290 But I could pass SQL code as an input to a Python function 1784 01:15:47,290 --> 01:15:49,360 and let that function talk to the database. 1785 01:15:49,360 --> 01:15:52,360 And indeed, that's what we're going to get here from CS50's library. 1786 01:15:52,360 --> 01:15:53,350 I'm going to go ahead and do this. 1787 01:15:53,350 --> 01:15:54,010 You know what? 1788 01:15:54,010 --> 01:16:00,070 Access the database and EXECUTE the following SQL code, 1789 01:16:00,070 --> 01:16:03,770 quote unquote, "that" close parentheses. 1790 01:16:03,770 --> 01:16:07,705 Now, what is select return by convention? 1791 01:16:07,705 --> 01:16:09,200 What should it return? 1792 01:16:09,200 --> 01:16:14,020 Well, in SQLite3, we just saw a pretty text-based table 1793 01:16:14,020 --> 01:16:17,170 with lines and slashes that looked like a table, but was just text. 1794 01:16:17,170 --> 01:16:20,939 phpLiteAdmin, we actually saw HTML tables when I browsed the database. 1795 01:16:20,939 --> 01:16:23,230 And I proposed verbally, a bit ago that, you know what? 1796 01:16:23,230 --> 01:16:26,620 If I were to get back all of this data in code, what data type 1797 01:16:26,620 --> 01:16:27,696 would I like it to me as? 1798 01:16:27,696 --> 01:16:30,818 1799 01:16:30,818 --> 01:16:31,990 Rows. 1800 01:16:31,990 --> 01:16:33,610 I want rows from a table. 1801 01:16:33,610 --> 01:16:35,560 Show me all the students who've registered. 1802 01:16:35,560 --> 01:16:37,600 What data structure in Python seems apt? 1803 01:16:37,600 --> 01:16:38,286 AUDIENCE: List. 1804 01:16:38,286 --> 01:16:39,994 DAVID J. MALAN: Yeah, just a list, right? 1805 01:16:39,994 --> 01:16:42,070 A list that's ordered from first row to last row. 1806 01:16:42,070 --> 01:16:45,412 So we'll call that a list or an array, back in the day of C. So you know what? 1807 01:16:45,412 --> 01:16:46,870 I'm going to assume that's correct. 1808 01:16:46,870 --> 01:16:49,495 And if I read the documentation, I would see that it's correct. 1809 01:16:49,495 --> 01:16:53,290 CS50's EXECUTE function, if you select, returns to you a list of rows. 1810 01:16:53,290 --> 01:16:55,520 It might have 0 rows if there are no matches. 1811 01:16:55,520 --> 01:16:58,500 But it might have 1,000 rows if there are lots of matches. 1812 01:16:58,500 --> 01:17:01,610 I'm going to store those results, wherever they are, in my rows array. 1813 01:17:01,610 --> 01:17:05,680 Now, suppose I want to print out who has registered in my database 1814 01:17:05,680 --> 01:17:07,929 from whatever froshims website exists. 1815 01:17:07,929 --> 01:17:09,970 I'm assuming students have registered on the web. 1816 01:17:09,970 --> 01:17:12,700 Now I'm just the proctor or the RA who's actually now 1817 01:17:12,700 --> 01:17:15,110 trying to manipulate the data and do something with it. 1818 01:17:15,110 --> 01:17:15,940 So what can I do? 1819 01:17:15,940 --> 01:17:18,700 Well, for row in rows, what do I want to do? 1820 01:17:18,700 --> 01:17:23,290 Let me go ahead and just print out that so-and-so registered. 1821 01:17:23,290 --> 01:17:26,150 So so-and-so registered. 1822 01:17:26,150 --> 01:17:27,820 Well, how do I plug in so-and-so? 1823 01:17:27,820 --> 01:17:29,710 Well, there are a few ways to do this. 1824 01:17:29,710 --> 01:17:30,830 And let's see. 1825 01:17:30,830 --> 01:17:34,600 First of all, I could use my placeholder syntax for print. 1826 01:17:34,600 --> 01:17:36,340 And then, I want to print out the row. 1827 01:17:36,340 --> 01:17:38,880 But what do I want from that row? 1828 01:17:38,880 --> 01:17:42,483 What columns are in any row in this database? 1829 01:17:42,483 --> 01:17:44,850 AUDIENCE: [INAUDIBLE] 1830 01:17:44,850 --> 01:17:47,481 DAVID J. MALAN: ID and name and dorm and phone and sports. 1831 01:17:47,481 --> 01:17:50,730 Well, it turns out, those are going to be handed to you as Python dictionaries 1832 01:17:50,730 --> 01:17:52,090 or dict structures. 1833 01:17:52,090 --> 01:17:54,780 So I can just say row quote unquote, 'name' here. 1834 01:17:54,780 --> 01:17:58,390 And I'll use single quotes just to make more clear what's going on here. 1835 01:17:58,390 --> 01:17:59,940 And then I need to make one fix. 1836 01:17:59,940 --> 01:18:00,660 How do I-- 1837 01:18:00,660 --> 01:18:01,500 AUDIENCE: F. 1838 01:18:01,500 --> 01:18:03,150 DAVID J. MALAN: --F for Format string. 1839 01:18:03,150 --> 01:18:04,380 So it looks a little cryptic. 1840 01:18:04,380 --> 01:18:05,880 But this is just Python stuff now. 1841 01:18:05,880 --> 01:18:07,290 The only thing that's new is SQL. 1842 01:18:07,290 --> 01:18:10,520 But if we stipulate that SQL, when using this execute function 1843 01:18:10,520 --> 01:18:14,340 is just going to hand you all a list of rows, each of which 1844 01:18:14,340 --> 01:18:17,190 is a dictionary so that you can get at this column or this column-- 1845 01:18:17,190 --> 01:18:18,690 ID, or name, or dorm-- 1846 01:18:18,690 --> 01:18:21,480 this would seem to be now a nice convergence of this week-- 1847 01:18:21,480 --> 01:18:23,280 now, with the past couple of weeks. 1848 01:18:23,280 --> 01:18:24,750 So let me go ahead and save this. 1849 01:18:24,750 --> 01:18:28,110 Let me go ahead and View my Console, so I have a Terminal window. 1850 01:18:28,110 --> 01:18:31,392 And let me go ahead and run Python of lecture.py. 1851 01:18:31,392 --> 01:18:33,930 And in just a moment, if I cross my fingers, 1852 01:18:33,930 --> 01:18:37,026 I should hopefully see who has registered. 1853 01:18:37,026 --> 01:18:38,340 Amazing. 1854 01:18:38,340 --> 01:18:39,600 I've seen who's registered. 1855 01:18:39,600 --> 01:18:42,630 Now there's one line of output that I didn't expect, which is just this one. 1856 01:18:42,630 --> 01:18:44,838 This is the library just being pedagogically helpful. 1857 01:18:44,838 --> 01:18:47,940 It's showing me every command that I sent to the database. 1858 01:18:47,940 --> 01:18:50,190 But you'll see that so-and-so has registered. 1859 01:18:50,190 --> 01:18:51,490 So this is kind of interesting. 1860 01:18:51,490 --> 01:18:53,031 It's kind of a stupid program, right? 1861 01:18:53,031 --> 01:18:55,410 Because most proctors aren't going to be hacking 1862 01:18:55,410 --> 01:18:58,039 froshims by using a terminal window and running Python scripts. 1863 01:18:58,039 --> 01:19:00,330 They're probably going to want to do this by a web page 1864 01:19:00,330 --> 01:19:01,920 and actually see who is registered. 1865 01:19:01,920 --> 01:19:04,770 But if we have the ability in Python code to do this, like, 1866 01:19:04,770 --> 01:19:08,910 iteration, what could I do instead of just printing to the screen? 1867 01:19:08,910 --> 01:19:12,630 What could I print out per last week and per the past problems set? 1868 01:19:12,630 --> 01:19:14,070 I could print out HTML, right? 1869 01:19:14,070 --> 01:19:16,069 Like, each of the students who register, kind of 1870 01:19:16,069 --> 01:19:18,240 feels like an opportunity for an unordered list 1871 01:19:18,240 --> 01:19:20,610 or an ordered list or a table or whatever. 1872 01:19:20,610 --> 01:19:22,080 You can now generate HTML. 1873 01:19:22,080 --> 01:19:23,290 So let me do this. 1874 01:19:23,290 --> 01:19:27,270 Let me actually go into an example I've made in advance. 1875 01:19:27,270 --> 01:19:30,330 What if I went ahead and opened up layout.html. 1876 01:19:30,330 --> 01:19:33,210 Here is a simple layout for a web application using Flask. 1877 01:19:33,210 --> 01:19:35,430 For those unfamiliar, this is mostly HTML 1878 01:19:35,430 --> 01:19:39,750 plus a technology called Jinja, which is a web-based technology for generating 1879 01:19:39,750 --> 01:19:40,980 websites dynamically. 1880 01:19:40,980 --> 01:19:43,770 The body of this page is clearly what's of interest. 1881 01:19:43,770 --> 01:19:44,550 And you know what? 1882 01:19:44,550 --> 01:19:47,320 I bet I could do some logic right in there. 1883 01:19:47,320 --> 01:19:48,780 So let me go ahead and do this. 1884 01:19:48,780 --> 01:19:52,960 Let me go ahead and Create, let's say, a New File. 1885 01:19:52,960 --> 01:19:55,871 Let's call this application.py. 1886 01:19:55,871 --> 01:19:57,870 I'm going to go ahead and, just for time's sake, 1887 01:19:57,870 --> 01:20:01,290 do a little bit of copy-paste to save myself some keystrokes. 1888 01:20:01,290 --> 01:20:03,510 So here's a very simple web app. 1889 01:20:03,510 --> 01:20:05,100 And this is going to be my to-do. 1890 01:20:05,100 --> 01:20:08,970 So if I go in here to templates, let me open up index.html-- 1891 01:20:08,970 --> 01:20:11,190 and let me go head into here. 1892 01:20:11,190 --> 01:20:12,760 So long story short-- 1893 01:20:12,760 --> 01:20:14,800 here's where we're going with this. 1894 01:20:14,800 --> 01:20:19,150 What if I instead generate an unordered list using code from last week, 1895 01:20:19,150 --> 01:20:21,930 but I use my for loop here inside of my web app 1896 01:20:21,930 --> 01:20:25,430 instead of actually just with a simple lecture.py file. 1897 01:20:25,430 --> 01:20:27,690 Well, recall that I could do something like this. 1898 01:20:27,690 --> 01:20:29,340 I can have a Jinja loop. 1899 01:20:29,340 --> 01:20:32,640 So I could say something like for row in rows. 1900 01:20:32,640 --> 01:20:35,790 And then, down here, I can preemptively say something 1901 01:20:35,790 --> 01:20:40,140 like andfor, which is our weird syntax from last week. 1902 01:20:40,140 --> 01:20:42,570 And then, in here, I can just do a list item. 1903 01:20:42,570 --> 01:20:47,260 And then, if I want to show who registered, what do I type here? 1904 01:20:47,260 --> 01:20:51,805 Something in between these curly braces if each row represents a registrant? 1905 01:20:51,805 --> 01:20:53,200 AUDIENCE: Row. 1906 01:20:53,200 --> 01:20:55,904 DAVID J. MALAN: Row name registered. 1907 01:20:55,904 --> 01:20:57,320 I can just do something like this. 1908 01:20:57,320 --> 01:21:00,130 So same idea, I'm just wrapping it with a little bit of HTML. 1909 01:21:00,130 --> 01:21:02,819 Now let me go to my application.py file because I'm 1910 01:21:02,819 --> 01:21:04,360 going to have to fill in some blanks. 1911 01:21:04,360 --> 01:21:05,970 And let me see, how can I do this? 1912 01:21:05,970 --> 01:21:08,860 Well, ultimately, I want to return the result of rendering 1913 01:21:08,860 --> 01:21:11,350 a template called index.html. 1914 01:21:11,350 --> 01:21:13,710 For families unfamiliar, this is just a line of code 1915 01:21:13,710 --> 01:21:16,180 that says go show that file to the user. 1916 01:21:16,180 --> 01:21:18,760 But I don't want to just show them the file as-is. 1917 01:21:18,760 --> 01:21:20,290 But let's at least get this set up. 1918 01:21:20,290 --> 01:21:25,340 I'm worried that it might not work yet because I need to get the actual data. 1919 01:21:25,340 --> 01:21:27,510 So how can I get all of the rows for my registrants? 1920 01:21:27,510 --> 01:21:31,390 Well, I can do rows gets db.execute. 1921 01:21:31,390 --> 01:21:36,470 And I can go ahead and select star from registrants and store that in there. 1922 01:21:36,470 --> 01:21:39,989 This file is called froshims.db that I created earlier. 1923 01:21:39,989 --> 01:21:42,280 And everything else is just Flask stuff from last week. 1924 01:21:42,280 --> 01:21:43,190 Nothing else is new. 1925 01:21:43,190 --> 01:21:45,880 The only thing that's new is this line here, 1926 01:21:45,880 --> 01:21:49,690 this line here, and now this line here where 1927 01:21:49,690 --> 01:21:53,200 I'm using SQL inside of a Python call by passing 1928 01:21:53,200 --> 01:21:55,450 it is an argument to a function called EXECUTE. 1929 01:21:55,450 --> 01:21:58,310 How do I pass the rows to index.html? 1930 01:21:58,310 --> 01:21:59,290 AUDIENCE: [INAUDIBLE] 1931 01:21:59,290 --> 01:22:02,070 DAVID J. MALAN: Yeah, like rows equals rows is the convention we've adopted. 1932 01:22:02,070 --> 01:22:03,900 You call it anything you want, x equals y, 1933 01:22:03,900 --> 01:22:05,650 but this is a little more straightforward. 1934 01:22:05,650 --> 01:22:08,858 So this is saying, hey, database, get me all of the rows from my registrants, 1935 01:22:08,858 --> 01:22:12,390 and then render the template index.html, and pass in these rows. 1936 01:22:12,390 --> 01:22:19,130 And now, if I hold my breath and run flask run, no syntax errors. 1937 01:22:19,130 --> 01:22:21,590 If I go ahead and visit this here and open-- 1938 01:22:21,590 --> 01:22:25,490 dammit-- the tab, I see an internal server error. 1939 01:22:25,490 --> 01:22:28,080 So teachable moment, families. 1940 01:22:28,080 --> 01:22:31,590 Let's go back into that browser window here and see what happened. 1941 01:22:31,590 --> 01:22:32,090 OK. 1942 01:22:32,090 --> 01:22:35,000 Template syntax error-- so pretty stupid mistake. 1943 01:22:35,000 --> 01:22:38,726 It looks like I expected a square bracket instead of a curly brace. 1944 01:22:38,726 --> 01:22:39,350 That's fixable. 1945 01:22:39,350 --> 01:22:41,190 Let me go into index.html. 1946 01:22:41,190 --> 01:22:43,070 And oh, I didn't finish my thought. 1947 01:22:43,070 --> 01:22:46,130 So that's some of the frustrations of programming for those 1948 01:22:46,130 --> 01:22:47,760 who are seeing this for the first time. 1949 01:22:47,760 --> 01:22:48,950 Let me save that. 1950 01:22:48,950 --> 01:22:53,000 Let me go back to the browser here, and we'll just reload. 1951 01:22:53,000 --> 01:22:56,600 And oh my god, voila, now I have a web page 1952 01:22:56,600 --> 01:22:58,640 via which you can see who has registered. 1953 01:22:58,640 --> 01:22:59,390 But you know what? 1954 01:22:59,390 --> 01:23:00,710 We can make this more powerful. 1955 01:23:00,710 --> 01:23:03,540 Recall that we've been playing with HTTP for some time. 1956 01:23:03,540 --> 01:23:05,630 And if this is the URL I'm accessing, recall 1957 01:23:05,630 --> 01:23:08,580 that we played around with reimplementing search functionality. 1958 01:23:08,580 --> 01:23:11,990 Well, what if I want to support search such that I can just visit q 1959 01:23:11,990 --> 01:23:15,650 equals and then search for people named, say, Brian 1960 01:23:15,650 --> 01:23:17,540 and see how many Brians are registered. 1961 01:23:17,540 --> 01:23:19,500 Could we add support for something like this? 1962 01:23:19,500 --> 01:23:20,030 Well, maybe. 1963 01:23:20,030 --> 01:23:23,810 Let me go back into the IDE, into application.py. 1964 01:23:23,810 --> 01:23:26,180 And let me go ahead and say something like this. 1965 01:23:26,180 --> 01:23:32,420 q equals request.args get q to see if anything is actually there. 1966 01:23:32,420 --> 01:23:35,040 And then let me go ahead and do this. 1967 01:23:35,040 --> 01:23:39,200 SELECT star FROM registrants WHERE-- 1968 01:23:39,200 --> 01:23:41,360 let me see-- q where what? 1969 01:23:41,360 --> 01:23:43,910 NAME equals q. 1970 01:23:43,910 --> 01:23:45,020 But I need a placeholder. 1971 01:23:45,020 --> 01:23:46,560 So maybe I should do this. 1972 01:23:46,560 --> 01:23:48,630 And as soon as I do this, I need a what? 1973 01:23:48,630 --> 01:23:50,240 F for a format string. 1974 01:23:50,240 --> 01:23:53,780 So I could just create, on the fly, a SQL command 1975 01:23:53,780 --> 01:23:56,270 that plugs in the value of q between those curly 1976 01:23:56,270 --> 01:23:59,540 braces to express the logic of select all 1977 01:23:59,540 --> 01:24:03,200 of the registrants whose names equal Brian or Veronica 1978 01:24:03,200 --> 01:24:04,910 or whoever's name I typed in. 1979 01:24:04,910 --> 01:24:06,960 Let me go back to the browser here. 1980 01:24:06,960 --> 01:24:10,190 Let me go over and do something like this now. 1981 01:24:10,190 --> 01:24:13,850 Question mark q equals Brian-- cross my fingers as before. 1982 01:24:13,850 --> 01:24:14,480 Dammit. 1983 01:24:14,480 --> 01:24:15,090 OK. 1984 01:24:15,090 --> 01:24:18,320 And what did I do wrong here? 1985 01:24:18,320 --> 01:24:22,080 1986 01:24:22,080 --> 01:24:24,210 What did I do wrong here? 1987 01:24:24,210 --> 01:24:26,280 This is subtle. 1988 01:24:26,280 --> 01:24:29,922 And we're seeing it for the first time. 1989 01:24:29,922 --> 01:24:32,694 It thinks there's a column called Brian. 1990 01:24:32,694 --> 01:24:33,860 But why would it think that? 1991 01:24:33,860 --> 01:24:36,650 Well, what I've effectively done is sent in this. 1992 01:24:36,650 --> 01:24:38,960 Brian is not a keyword in SQL. 1993 01:24:38,960 --> 01:24:42,650 And because it's an actual string that I'm comparing against, 1994 01:24:42,650 --> 01:24:45,110 what I really need to be doing is this. 1995 01:24:45,110 --> 01:24:48,980 Otherwise, SQLite is going to think it's like the name of a column or something 1996 01:24:48,980 --> 01:24:51,620 I pre-created so we have to fix this. 1997 01:24:51,620 --> 01:24:52,440 But that's OK. 1998 01:24:52,440 --> 01:24:53,780 I can put the quotes there. 1999 01:24:53,780 --> 01:24:57,320 But I should probably put the q there in quotes. 2000 01:24:57,320 --> 01:25:01,590 Let's save this, go back to the browser, reload. 2001 01:25:01,590 --> 01:25:02,330 And there we go. 2002 01:25:02,330 --> 01:25:04,546 Now we have functionality for Brian. 2003 01:25:04,546 --> 01:25:06,920 And so with this basic building block, what have we done? 2004 01:25:06,920 --> 01:25:09,470 Well, in SQL, we have several commands at our disposal-- 2005 01:25:09,470 --> 01:25:12,560 creating a table, which frankly gets tedious by typing it out. 2006 01:25:12,560 --> 01:25:15,980 I myself tend to use and recommend phpLiteAdmin just to create your table 2007 01:25:15,980 --> 01:25:16,820 and get it going. 2008 01:25:16,820 --> 01:25:21,380 But then you can certainly manually, with SQLite3 or phpLiteAdmin INSERT, 2009 01:25:21,380 --> 01:25:24,440 or UPDATE, or DELETE, or SELECT information 2010 01:25:24,440 --> 01:25:25,925 once it's actually in the database. 2011 01:25:25,925 --> 01:25:27,050 And that's pretty powerful. 2012 01:25:27,050 --> 01:25:31,070 But once you do that, you can now use that same new syntax, that new language 2013 01:25:31,070 --> 01:25:35,270 SQL, passing it in as an input with a "string" to our EXECUTE function, 2014 01:25:35,270 --> 01:25:39,080 and now start pulling any data you want from your database. 2015 01:25:39,080 --> 01:25:41,629 Last week, with CSV files, if you wanted to do this, 2016 01:25:41,629 --> 01:25:44,420 you'd have to open the CSV file, use a for loop to iterate over it, 2017 01:25:44,420 --> 01:25:47,490 look over every column and row for your data, then pass it in. 2018 01:25:47,490 --> 01:25:48,205 And that's fine. 2019 01:25:48,205 --> 01:25:48,830 That's correct. 2020 01:25:48,830 --> 01:25:49,910 That's not bad. 2021 01:25:49,910 --> 01:25:50,639 But it's tedious. 2022 01:25:50,639 --> 01:25:52,680 And you're reinventing the wheel again and again. 2023 01:25:52,680 --> 01:25:56,000 And there's no filtration built in as there is to SQL itself. 2024 01:25:56,000 --> 01:25:58,730 So you now have a more sophisticated tool 2025 01:25:58,730 --> 01:26:04,470 in your toolkit so to speak with which to solve that same kind of problem. 2026 01:26:04,470 --> 01:26:08,432 Any questions then on this technique? 2027 01:26:08,432 --> 01:26:09,400 All right. 2028 01:26:09,400 --> 01:26:13,897 Well, let's look at a bigger database and see where we can go with this? 2029 01:26:13,897 --> 01:26:15,730 So if you go on the course's website, you'll 2030 01:26:15,730 --> 01:26:19,464 see a larger database that's actually available in multiple formats, SQLite, 2031 01:26:19,464 --> 01:26:21,880 which we'll see in a moment, but also Google Spreadsheets. 2032 01:26:21,880 --> 01:26:23,921 Because frankly, it's a lot more pleasant to look 2033 01:26:23,921 --> 01:26:26,560 at your rows and columns in a GUI than it is, necessarily, 2034 01:26:26,560 --> 01:26:28,430 with the file itself. 2035 01:26:28,430 --> 01:26:31,330 So this happens to be a free and open source sample database. 2036 01:26:31,330 --> 01:26:35,260 Like, some guy, years ago, took his actual iTunes database-- 2037 01:26:35,260 --> 01:26:37,750 all the music he'd ever bought, he wrote a program 2038 01:26:37,750 --> 01:26:40,450 to like analyze Apple's file format in iTunes 2039 01:26:40,450 --> 01:26:43,090 and extract all of the data that seemed to be stored about him, 2040 01:26:43,090 --> 01:26:46,240 I think was the story, and just made it publicly available as a sample 2041 01:26:46,240 --> 01:26:49,930 database for students and teachers to just use to manipulate data. 2042 01:26:49,930 --> 01:26:52,060 But what's interesting is that this database 2043 01:26:52,060 --> 01:26:55,870 demonstrates some other principles that we really haven't touched on. 2044 01:26:55,870 --> 01:27:01,610 For instance, if I were to store, again and again, 2045 01:27:01,610 --> 01:27:04,810 all of these students who are registering for froshims, 2046 01:27:04,810 --> 01:27:07,840 what do you start to see in certain fields? 2047 01:27:07,840 --> 01:27:11,080 Well, I was a little lazy, and I didn't bother typing in everyone's dorm. 2048 01:27:11,080 --> 01:27:14,049 But suppose that hundreds of students have registered for froshims. 2049 01:27:14,049 --> 01:27:16,840 A lot of them are going to be from Matthews, some from Pennypacker, 2050 01:27:16,840 --> 01:27:20,890 some from Candaday, some from Weld, and bunches of other buildings on campus. 2051 01:27:20,890 --> 01:27:22,810 It starts to get a little ridiculous when 2052 01:27:22,810 --> 01:27:25,450 you see Matthews, Matthews, Matthews, Matthews, 2053 01:27:25,450 --> 01:27:27,520 Matthews-- like, 100 or more times. 2054 01:27:27,520 --> 01:27:30,670 If there are 1,600 freshmen, there are a lot of kids in Matthews. 2055 01:27:30,670 --> 01:27:35,289 That's a lot of bytes to store M-A-T-T-H-E-W-- 2056 01:27:35,289 --> 01:27:35,830 I don't know. 2057 01:27:35,830 --> 01:27:36,600 It's not important. 2058 01:27:36,600 --> 01:27:37,933 Doesn't matter how it's spelled. 2059 01:27:37,933 --> 01:27:41,380 That's a lot of bytes to actually store in your database again and again 2060 01:27:41,380 --> 01:27:42,127 and again. 2061 01:27:42,127 --> 01:27:43,960 It feels like there should be an opportunity 2062 01:27:43,960 --> 01:27:45,610 to factor out the commonalities. 2063 01:27:45,610 --> 01:27:48,610 And what humans do with databases is, once they recognize a recurring 2064 01:27:48,610 --> 01:27:52,390 pattern of data, same darn strings again and again and again, you know what? 2065 01:27:52,390 --> 01:27:54,220 Rather than use-- now it matters-- 2066 01:27:54,220 --> 01:27:58,120 M-A-T-T-H-E-W-S-- which is 8 bytes. 2067 01:27:58,120 --> 01:28:05,860 Or P-E-N-N-Y-P-A-C-K-E-R, which is 11, then we have a lot of bytes being used 2068 01:28:05,860 --> 01:28:08,110 again and again and again to store all of these dorms. 2069 01:28:08,110 --> 01:28:08,693 You know what? 2070 01:28:08,693 --> 01:28:10,690 What's better than 11 bytes or 8 bytes? 2071 01:28:10,690 --> 01:28:13,870 Let's just use an int, or let's even use a small int-- 2072 01:28:13,870 --> 01:28:16,850 2 bytes or 4 bytes to represent dorms. 2073 01:28:16,850 --> 01:28:20,320 So instead of storing Matthews, let's just store the number 10. 2074 01:28:20,320 --> 01:28:23,800 And instead of Penny Packer, lets just store the number 11, thereby, 2075 01:28:23,800 --> 01:28:27,190 using some bytes, but fewer and, therefore, 2076 01:28:27,190 --> 01:28:29,030 saving bytes in the long run. 2077 01:28:29,030 --> 01:28:32,680 And so with this database demonstrates is exactly that principle. 2078 01:28:32,680 --> 01:28:36,880 Certainly, when it comes to music, where artists have multiple albums 2079 01:28:36,880 --> 01:28:39,280 and artists have multiple songs, it's probably 2080 01:28:39,280 --> 01:28:42,430 a little silly in a musical database to store the name of the album 2081 01:28:42,430 --> 01:28:44,740 again and again and again and again for all 10 or 12 2082 01:28:44,740 --> 01:28:47,530 or 20 tracks or songs on that particular album. 2083 01:28:47,530 --> 01:28:49,450 So what this person did was this. 2084 01:28:49,450 --> 01:28:51,480 Notice here, we have a whole bunch of sheets. 2085 01:28:51,480 --> 01:28:53,890 Or in database-speak, these would be tables. 2086 01:28:53,890 --> 01:28:56,050 And notice that these tables have columns. 2087 01:28:56,050 --> 01:28:58,060 And notice that these columns are album ID, 2088 01:28:58,060 --> 01:29:01,240 in the album table, album ID and title. 2089 01:29:01,240 --> 01:29:06,070 But notice what he did very cleverly with this field, artist ID. 2090 01:29:06,070 --> 01:29:09,550 Artists or singers have lots of songs to their name, eventually. 2091 01:29:09,550 --> 01:29:11,560 And so he's assigned each of them unique value, 2092 01:29:11,560 --> 01:29:13,960 or Apple did, in iTunes underneath the hood. 2093 01:29:13,960 --> 01:29:17,260 So how do I know what this artist's name is? 2094 01:29:17,260 --> 01:29:18,910 How would you figure this out? 2095 01:29:18,910 --> 01:29:23,120 It's not that interesting to us humans do know, ooh, artist ID number 2. 2096 01:29:23,120 --> 01:29:24,490 This is just-- what's that? 2097 01:29:24,490 --> 01:29:26,010 AUDIENCE: You need another table with artists. 2098 01:29:26,010 --> 01:29:26,890 DAVID J. MALAN: Yeah. 2099 01:29:26,890 --> 01:29:29,270 We need another table with artists, which is right over here. 2100 01:29:29,270 --> 01:29:30,686 So let me go ahead and look there. 2101 01:29:30,686 --> 01:29:34,120 So if I want to see-- let's see, "Let there be rock," artist number 1. 2102 01:29:34,120 --> 01:29:35,650 Let's go to the artists table. 2103 01:29:35,650 --> 01:29:39,400 And turns out, AC/DC, the band is who created that. 2104 01:29:39,400 --> 01:29:43,294 Now we've added a step here, which maybe is costing us a little bit of time. 2105 01:29:43,294 --> 01:29:46,210 But it's going to save a space in the long run if I'm not storing long 2106 01:29:46,210 --> 01:29:48,730 artist's names-- although, AC/DC isn't terribly long-- 2107 01:29:48,730 --> 01:29:49,960 again and again and again. 2108 01:29:49,960 --> 01:29:51,042 Now, which is better? 2109 01:29:51,042 --> 01:29:52,000 Well, it's a trade-off. 2110 01:29:52,000 --> 01:29:55,360 Are you more comfortable wasting space and storing everything together? 2111 01:29:55,360 --> 01:29:59,230 Or do you prefer to save space and just spend a little more time 2112 01:29:59,230 --> 01:30:01,010 joining the data back together? 2113 01:30:01,010 --> 01:30:03,100 But it's going to be really annoying if, now, 2114 01:30:03,100 --> 01:30:06,040 if I want to make a website that shows me the names of the songs 2115 01:30:06,040 --> 01:30:08,509 that I have in a database and the artists for them, 2116 01:30:08,509 --> 01:30:11,800 let alone the albums, and more of that, the titles of the tracks, and so forth. 2117 01:30:11,800 --> 01:30:13,750 It feels like that's three queries, right? 2118 01:30:13,750 --> 01:30:18,140 Like SELECT the album, SELECT the artist, SELECT the titles-- 2119 01:30:18,140 --> 01:30:18,640 but no. 2120 01:30:18,640 --> 01:30:21,670 With SQL, you can collapse that altogether. 2121 01:30:21,670 --> 01:30:24,110 Because notice, in this table here, artist, 2122 01:30:24,110 --> 01:30:27,506 there is a column called Artist ID that's numbers. 2123 01:30:27,506 --> 01:30:29,380 And notice, if you kind of picture this, it's 2124 01:30:29,380 --> 01:30:32,200 like finger tips here-- let's propose metaphorically-- 2125 01:30:32,200 --> 01:30:33,970 represent the artist ID. 2126 01:30:33,970 --> 01:30:38,200 If I go into album now, notice that we have album ID and title, 2127 01:30:38,200 --> 01:30:40,630 but we also have artist ID. 2128 01:30:40,630 --> 01:30:47,590 And so if you imagine these two tables sharing this common column, what if we 2129 01:30:47,590 --> 01:30:50,410 kind of stitch them together like this, lining up one 2130 01:30:50,410 --> 01:30:53,050 on the left, the other on the right, thereby reconstructing 2131 01:30:53,050 --> 01:30:55,600 all of the information and duplicating it as 2132 01:30:55,600 --> 01:30:59,340 needed so that I get back just the album and the title and the artist. 2133 01:30:59,340 --> 01:31:00,910 Well, how can express that? 2134 01:31:00,910 --> 01:31:05,680 Well, let me go ahead into CS50 IDE where I have a copy of this file. 2135 01:31:05,680 --> 01:31:12,420 Let me close all of my tabs from earlier go into this file called lecture.db. 2136 01:31:12,420 --> 01:31:18,680 And in lecture.db, in phpLiteAdmin, we'll see all of those same tables. 2137 01:31:18,680 --> 01:31:20,960 And I literally just imported it into a SQLite. 2138 01:31:20,960 --> 01:31:22,460 You'll see all of these same tables. 2139 01:31:22,460 --> 01:31:23,919 We can browse album just as before. 2140 01:31:23,919 --> 01:31:26,126 And we just see a different format for the same data. 2141 01:31:26,126 --> 01:31:28,160 It's the same data from the Google Spreadsheet, 2142 01:31:28,160 --> 01:31:29,760 which is just more user-friendly. 2143 01:31:29,760 --> 01:31:31,650 And let me go ahead and do this. 2144 01:31:31,650 --> 01:31:35,960 I could, of course, SELECT star from album 2145 01:31:35,960 --> 01:31:43,250 Where our artist ID equals 1 to get back all of AC/DC's albums. 2146 01:31:43,250 --> 01:31:44,500 And indeed, here, I have two. 2147 01:31:44,500 --> 01:31:48,107 They have For those about to rock, We salute you, and Let there be rock. 2148 01:31:48,107 --> 01:31:48,690 They have two. 2149 01:31:48,690 --> 01:31:52,850 But notice, the rows I got back contain only what information? 2150 01:31:52,850 --> 01:31:56,330 Album ID title, and artist ID. 2151 01:31:56,330 --> 01:31:59,600 I just know, as a human, that oh, these are AC/DC's albums. 2152 01:31:59,600 --> 01:32:03,080 But what if I want to know, well, OK, I see that artist ID is 1. 2153 01:32:03,080 --> 01:32:05,870 So all right, well, let me open another tab here. 2154 01:32:05,870 --> 01:32:12,570 And now let me SELECT star from artist WHERE artist ID equals 1. 2155 01:32:12,570 --> 01:32:16,010 And so if I want to learn something about that artist-- let me go ahead 2156 01:32:16,010 --> 01:32:19,767 and Zoom Out, click Go, and OK-- now I get AC/DC. 2157 01:32:19,767 --> 01:32:20,600 Well, this is great. 2158 01:32:20,600 --> 01:32:23,760 Now I have to results, two sets of rows. 2159 01:32:23,760 --> 01:32:24,560 This is stupid. 2160 01:32:24,560 --> 01:32:28,160 Now I'm just creating work for myself by having two return values. 2161 01:32:28,160 --> 01:32:29,660 I could call db EXECUTE twice. 2162 01:32:29,660 --> 01:32:31,400 But there's a better way. 2163 01:32:31,400 --> 01:32:35,479 It turns out, SQL allows you to join tables just using SQL itself. 2164 01:32:35,479 --> 01:32:37,020 So I'm going to go ahead and do this. 2165 01:32:37,020 --> 01:32:40,910 I'm going to go ahead and SELECT star FROM album, 2166 01:32:40,910 --> 01:32:47,770 but also FROM artist WHERE Album.Artistid-- 2167 01:32:47,770 --> 01:32:49,160 let me scroll to the right-- 2168 01:32:49,160 --> 01:32:53,030 equals Artist.Artistid. 2169 01:32:53,030 --> 01:32:56,760 So notice I'm saying select everything from two tables, 2170 01:32:56,760 --> 01:33:00,830 but only do so where the album tables, artist ID 2171 01:33:00,830 --> 01:33:05,902 column has the same value as the artist tables artist ID column. 2172 01:33:05,902 --> 01:33:08,360 That's kind of the stitching, metaphorically, of my fingers 2173 01:33:08,360 --> 01:33:10,730 together, looking for that common column. 2174 01:33:10,730 --> 01:33:15,750 If I go ahead and click Go, wow, look at what I've just constructed. 2175 01:33:15,750 --> 01:33:19,610 It's a lot of information, but I have album ID and title, 2176 01:33:19,610 --> 01:33:24,320 I have artists ID still, but I have the name of that artist altogether. 2177 01:33:24,320 --> 01:33:28,270 So if you now let your mind wander back to the Python code, 2178 01:33:28,270 --> 01:33:30,860 oh, I could now get a whole bunch of rows containing 2179 01:33:30,860 --> 01:33:32,570 everything I care about all at once. 2180 01:33:32,570 --> 01:33:34,220 I don't need two select queries. 2181 01:33:34,220 --> 01:33:36,170 I can join these tables in this way. 2182 01:33:36,170 --> 01:33:37,700 And I use join very deliberately. 2183 01:33:37,700 --> 01:33:40,770 It turns out that there's another way to express this same thing. 2184 01:33:40,770 --> 01:33:44,390 Instead of using that comma syntax I did, you might see as well this, 2185 01:33:44,390 --> 01:33:56,075 Select star From Artist JOIN Album ON Artist.Artistid equals-- 2186 01:33:56,075 --> 01:33:57,260 let me scroll over-- 2187 01:33:57,260 --> 01:33:59,477 Album.Artistid. 2188 01:33:59,477 --> 01:34:01,310 This is going to have the exact same effect, 2189 01:34:01,310 --> 01:34:04,268 but you might just find that it reads a little more intuitively to you. 2190 01:34:04,268 --> 01:34:07,752 Select everything from the result of joining these two tables. 2191 01:34:07,752 --> 01:34:08,960 How do you want to join them? 2192 01:34:08,960 --> 01:34:12,606 Well, join them on this equaling that-- 2193 01:34:12,606 --> 01:34:14,480 just another way of expressing the same idea. 2194 01:34:14,480 --> 01:34:17,180 And if I click Go, I get back the same information. 2195 01:34:17,180 --> 01:34:20,420 So ultimately, with JOINs do we have the ability to reassemble data. 2196 01:34:20,420 --> 01:34:24,290 So on the one hand, it's just good practice to normalize your database. 2197 01:34:24,290 --> 01:34:27,560 Identify columns that have lots and lots and lots of redundancy, and only 2198 01:34:27,560 --> 01:34:29,510 store that information once. 2199 01:34:29,510 --> 01:34:31,790 For instance, CS50 Finance, if you're supporting 2200 01:34:31,790 --> 01:34:36,170 many different users, every time Malan or Brian or Veronica buys a stock, 2201 01:34:36,170 --> 01:34:38,660 feels like it would be a little silly to store Malan 2202 01:34:38,660 --> 01:34:42,830 or Brian or Veronica along with Netflix, the symbol, and the number of shares 2203 01:34:42,830 --> 01:34:43,790 one of us bought. 2204 01:34:43,790 --> 01:34:45,640 Because Malan, Malan, Malan, Malan is going 2205 01:34:45,640 --> 01:34:46,750 to appear all throughout the database. 2206 01:34:46,750 --> 01:34:49,716 And what if I change my username or my name or someone gets married 2207 01:34:49,716 --> 01:34:50,840 and, therefore, it changes? 2208 01:34:50,840 --> 01:34:53,006 Like, why do you create that messiness for yourself? 2209 01:34:53,006 --> 01:34:57,860 Instead, give Brian and Veronica and me and everyone else a unique ID. 2210 01:34:57,860 --> 01:35:03,350 And when they buy something, just store their user ID or customer ID 2211 01:35:03,350 --> 01:35:07,280 or however you want to think about it, just like with album ID and artist ID. 2212 01:35:07,280 --> 01:35:10,370 And so normalizing a database is all about finding those commonalities 2213 01:35:10,370 --> 01:35:12,470 and moving the data into its own table. 2214 01:35:12,470 --> 01:35:14,640 And if you care about rejoining it, just use 2215 01:35:14,640 --> 01:35:18,300 SQL to reconstruct that view of the data, so to speak. 2216 01:35:18,300 --> 01:35:21,020 So what else can we do here as well? 2217 01:35:21,020 --> 01:35:24,200 It turns out that there is in SQL, not just primary keys, 2218 01:35:24,200 --> 01:35:27,320 but there are unique constraints in some databases where you can specify, 2219 01:35:27,320 --> 01:35:29,940 this isn't my primary key, but I want it to be unique. 2220 01:35:29,940 --> 01:35:32,220 You can specify that something should be indexed. 2221 01:35:32,220 --> 01:35:35,510 So it turns out that, if you just know there's a field in your database 2222 01:35:35,510 --> 01:35:38,180 that you want to be able to search on very efficiently, 2223 01:35:38,180 --> 01:35:39,776 you can index it in advance. 2224 01:35:39,776 --> 01:35:42,650 And you'll see or be able to do this if you'd like for final projects 2225 01:35:42,650 --> 01:35:44,970 or even for the next problem set if you'd like. 2226 01:35:44,970 --> 01:35:47,660 But what this enables are queries like this. 2227 01:35:47,660 --> 01:35:50,900 If I want to go ahead and search for, for instance-- 2228 01:35:50,900 --> 01:35:53,560 2229 01:35:53,560 --> 01:35:55,490 what would be a good example? 2230 01:35:55,490 --> 01:35:56,090 Rock. 2231 01:35:56,090 --> 01:35:57,560 I'm interested in rock. 2232 01:35:57,560 --> 01:36:00,740 So if I want to go into My SQL tab here. 2233 01:36:00,740 --> 01:36:02,150 I could say something like this. 2234 01:36:02,150 --> 01:36:07,460 SELECT star FROM Album WHERE Name not equals, 2235 01:36:07,460 --> 01:36:11,510 but where name is LIKE and then I'm going to say 'Rock.' 2236 01:36:11,510 --> 01:36:14,550 But if I want any number of characters to come before that word, 2237 01:36:14,550 --> 01:36:15,724 I can use a percent sign. 2238 01:36:15,724 --> 01:36:18,390 And if any number of characters after, I can use a percent sign. 2239 01:36:18,390 --> 01:36:19,431 These are like wildcards. 2240 01:36:19,431 --> 01:36:21,240 In most languages, you would use star. 2241 01:36:21,240 --> 01:36:23,280 In SQL, you use percent signs. 2242 01:36:23,280 --> 01:36:24,640 But it means the same thing. 2243 01:36:24,640 --> 01:36:27,840 And if I go ahead and say go, now I get back-- 2244 01:36:27,840 --> 01:36:32,340 oh, I get the got wrong lecture, album, oh, title I think is what I wanted. 2245 01:36:32,340 --> 01:36:33,720 Let me try that again, sorry-- 2246 01:36:33,720 --> 01:36:36,780 WHERE Title LIKE 'Rock'-- 2247 01:36:36,780 --> 01:36:38,250 let me go ahead and click Go. 2248 01:36:38,250 --> 01:36:41,970 And voila, here are all of the albums in the database that 2249 01:36:41,970 --> 01:36:43,430 have the word rock in them. 2250 01:36:43,430 --> 01:36:48,120 Now, as an aside, this table has a lot of more albums in it. 2251 01:36:48,120 --> 01:36:49,800 And frankly, it's small enough though. 2252 01:36:49,800 --> 01:36:51,930 It has hundreds of rows, maybe a few thousand rows. 2253 01:36:51,930 --> 01:36:55,410 None of us humans are really going to notice how slow linear search is. 2254 01:36:55,410 --> 01:36:58,050 But if you start having thousands of rows, tens of thousands 2255 01:36:58,050 --> 01:37:01,050 of rows, millions of rows, not having an index 2256 01:37:01,050 --> 01:37:04,230 means that searching for something like rock is going to start at the top 2257 01:37:04,230 --> 01:37:07,770 and search every darn field all the way to the bottom, big O of n. 2258 01:37:07,770 --> 01:37:09,720 If you instead tell the database, I know I'm 2259 01:37:09,720 --> 01:37:13,200 going to be searching on this column a lot, please index it for me, 2260 01:37:13,200 --> 01:37:14,450 here comes the secret sauce. 2261 01:37:14,450 --> 01:37:17,700 SQLite, Oracle, Microsoft Access, and so forth, they 2262 01:37:17,700 --> 01:37:19,650 will, using their own intellectual property, 2263 01:37:19,650 --> 01:37:22,830 build up some fancy data structures-- trees, or hash tables, 2264 01:37:22,830 --> 01:37:30,390 or whatever in memory, store the data for you invisibly in that format 2265 01:37:30,390 --> 01:37:33,840 so that, when you do ask for a question like, show me all the albums like rock, 2266 01:37:33,840 --> 01:37:37,800 they can answer you in much faster time than linear. 2267 01:37:37,800 --> 01:37:41,020 And that too is what you get with SQL that you don't get with CSVs. 2268 01:37:41,020 --> 01:37:43,717 CSVs are, by nature, only linear. 2269 01:37:43,717 --> 01:37:44,550 So we can do better. 2270 01:37:44,550 --> 01:37:46,980 But you, the programmer, have to help the database 2271 01:37:46,980 --> 01:37:49,540 and actually give it those hints, not just the types, 2272 01:37:49,540 --> 01:37:50,600 but also hints like this. 2273 01:37:50,600 --> 01:37:52,410 And as an aside, there's also the notion of foreign keys 2274 01:37:52,410 --> 01:37:55,260 where, if you really want to lock things down, you can specify that, 2275 01:37:55,260 --> 01:37:58,570 if you ever see in album ID in another table, 2276 01:37:58,570 --> 01:38:00,930 if it's a primary key in the album table, 2277 01:38:00,930 --> 01:38:04,080 by definition, in the other table, it's going to be called a foreign key. 2278 01:38:04,080 --> 01:38:06,120 Because it doesn't really belong there, but it's 2279 01:38:06,120 --> 01:38:08,370 referencing a column elsewhere. 2280 01:38:08,370 --> 01:38:10,519 So there's a lot more technology and vocabulary. 2281 01:38:10,519 --> 01:38:12,060 And you're welcome to dive in deeper. 2282 01:38:12,060 --> 01:38:13,890 And odds are, many of you will for final projects, 2283 01:38:13,890 --> 01:38:16,410 by nature of wanting certain features, among them, even 2284 01:38:16,410 --> 01:38:18,840 the ones we've seen like auto incrementing and not null. 2285 01:38:18,840 --> 01:38:21,000 As an aside too, SQL even has functions. 2286 01:38:21,000 --> 01:38:22,830 And for data scientists and statisticians 2287 01:38:22,830 --> 01:38:26,100 it's super useful to be able to just do math and summaries of data right 2288 01:38:26,100 --> 01:38:29,760 within SQL without ever writing Python code or R or anything else. 2289 01:38:29,760 --> 01:38:32,460 Built into a SQLite and other databases are 2290 01:38:32,460 --> 01:38:34,860 functions like this for average, counting things, getting 2291 01:38:34,860 --> 01:38:36,690 the min, max, sum, and so forth-- 2292 01:38:36,690 --> 01:38:39,640 all of that you get for free with a lot of databases. 2293 01:38:39,640 --> 01:38:43,000 All it takes in the context of Python is a line like this. 2294 01:38:43,000 --> 01:38:47,220 But, but, but, but, but there are some problems. 2295 01:38:47,220 --> 01:38:51,030 And let's end by taking a look at two fundamental problems and threats that 2296 01:38:51,030 --> 01:38:52,410 are too often underappreciated. 2297 01:38:52,410 --> 01:38:56,940 And in fact, we have to fix a very serious vulnerability that I introduced 2298 01:38:56,940 --> 01:38:58,740 into my very own code earlier. 2299 01:38:58,740 --> 01:39:01,050 But first, the so-called race condition. 2300 01:39:01,050 --> 01:39:04,650 In survey-- or rather, let's see-- 2301 01:39:04,650 --> 01:39:08,910 suppose that we think back at the very start of the semester, most of you 2302 01:39:08,910 --> 01:39:11,820 signed up for a GitHub account for the very first time. 2303 01:39:11,820 --> 01:39:13,470 And you went to github.com/signup. 2304 01:39:13,470 --> 01:39:15,420 For those unfamiliar, GitHub is a website 2305 01:39:15,420 --> 01:39:17,550 where you can save and store programming code that you've written 2306 01:39:17,550 --> 01:39:19,174 and want to collaborate with others on. 2307 01:39:19,174 --> 01:39:20,470 And you chose a username. 2308 01:39:20,470 --> 01:39:24,630 And let me go ahead and try choosing a username like, say, jharvard 2309 01:39:24,630 --> 01:39:25,890 for John Harvard. 2310 01:39:25,890 --> 01:39:28,830 Notice that the website immediately said the user name is taken. 2311 01:39:28,830 --> 01:39:30,120 All right, that's useful. 2312 01:39:30,120 --> 01:39:33,119 And you can probably guess how this is done-- maybe a little JavaScript, 2313 01:39:33,119 --> 01:39:35,580 using AJAX, talking to the server, getting the response, 2314 01:39:35,580 --> 01:39:37,740 changing the HTML or the CSS or whatever. 2315 01:39:37,740 --> 01:39:39,960 Might take some time to wire all that together. 2316 01:39:39,960 --> 01:39:42,070 But that's probably what's going on. 2317 01:39:42,070 --> 01:39:49,140 So let me try a really long random username that is not taken. 2318 01:39:49,140 --> 01:39:51,240 Hey, it's available. 2319 01:39:51,240 --> 01:39:54,290 But probably is not a good thing that I'm streaming this on the internet. 2320 01:39:54,290 --> 01:39:57,290 Because if I wait long enough, I bet someone could, for playful reasons, 2321 01:39:57,290 --> 01:39:59,940 just sign up for this, let alone anyone in this room. 2322 01:39:59,940 --> 01:40:02,650 But you've just told me it's available. 2323 01:40:02,650 --> 01:40:03,150 So good. 2324 01:40:03,150 --> 01:40:03,985 I'm really excited. 2325 01:40:03,985 --> 01:40:04,860 I've got my username. 2326 01:40:04,860 --> 01:40:07,220 Let me go ahead and type in my email address, 2327 01:40:07,220 --> 01:40:11,100 malan@harvard.edu, my password, 12345-- take a few moments there. 2328 01:40:11,100 --> 01:40:12,540 Verify my account and so forth. 2329 01:40:12,540 --> 01:40:13,890 And I click Submit. 2330 01:40:13,890 --> 01:40:18,690 Suppose that I'm told, momentarily, sorry, that username has been taken. 2331 01:40:18,690 --> 01:40:20,700 Could that happen? 2332 01:40:20,700 --> 01:40:23,400 Yeah, if any of you were trying to mess with me right now, 2333 01:40:23,400 --> 01:40:26,430 you would have signed up for that username and beaten me to the punch 2334 01:40:26,430 --> 01:40:29,040 so that when I hit Join, I get an error. 2335 01:40:29,040 --> 01:40:33,006 That's the definition of a race condition where two people or two users 2336 01:40:33,006 --> 01:40:34,380 or two computers or two threads-- 2337 01:40:34,380 --> 01:40:37,140 if we really roll back to our discussion of threads in Scratch-- 2338 01:40:37,140 --> 01:40:40,950 are trying to do the same thing at roughly the same time. 2339 01:40:40,950 --> 01:40:44,070 And if those two things, threads or humans, 2340 01:40:44,070 --> 01:40:47,040 check the state of a variable, which is a fancy way of saying 2341 01:40:47,040 --> 01:40:50,320 is the username available, they both get back answers. 2342 01:40:50,320 --> 01:40:52,480 But then some number of split seconds later, 2343 01:40:52,480 --> 01:40:56,490 then they make a decision based on that information, there is a window of time, 2344 01:40:56,490 --> 01:40:58,800 either split seconds or even seconds or minutes, 2345 01:40:58,800 --> 01:41:01,330 where the state of that variable could, of course, change. 2346 01:41:01,330 --> 01:41:03,454 So if you two, literally, right now on your laptop, 2347 01:41:03,454 --> 01:41:07,460 typed that very long username, all of us would probably be told, green light, 2348 01:41:07,460 --> 01:41:08,730 it's available. 2349 01:41:08,730 --> 01:41:11,010 But only one of us is actually going to get it. 2350 01:41:11,010 --> 01:41:12,980 And that's because of a race condition. 2351 01:41:12,980 --> 01:41:16,130 Literally, all of us might be racing to sign up for that value. 2352 01:41:16,130 --> 01:41:19,340 And it's when state can change in between things happening. 2353 01:41:19,340 --> 01:41:21,620 This is a bad thing because it makes your data 2354 01:41:21,620 --> 01:41:25,700 vulnerable to changes by someone you don't necessarily intend. 2355 01:41:25,700 --> 01:41:28,070 Or if the database isn't smart, you might 2356 01:41:28,070 --> 01:41:29,799 be able to do especially bad things. 2357 01:41:29,799 --> 01:41:31,340 ATMs are a canonical example of this. 2358 01:41:31,340 --> 01:41:35,849 If you had a malicious adversary trying to log into two bank accounts at once 2359 01:41:35,849 --> 01:41:39,140 or two physical machines at once, either with two cards or two accounts and two 2360 01:41:39,140 --> 01:41:42,080 laptops, you could imagine both of them trying to deduct, like, 2361 01:41:42,080 --> 01:41:45,080 $100 from the same account instantly. 2362 01:41:45,080 --> 01:41:48,200 Because imagine a poorly-implemented bank website. 2363 01:41:48,200 --> 01:41:51,380 It checks the account balance of the user logged in. 2364 01:41:51,380 --> 01:41:52,520 Do you have $100? 2365 01:41:52,520 --> 01:41:56,360 If the answer is yes, maybe both websites are going to say yes, 2366 01:41:56,360 --> 01:41:57,620 you may deduct $100. 2367 01:41:57,620 --> 01:42:00,141 You hit enter and voila, you deduct $100. 2368 01:42:00,141 --> 01:42:03,140 The user gets it somehow because it's transferred to some other account. 2369 01:42:03,140 --> 01:42:06,660 But the bank thinks it only did that once, deducts $100, 2370 01:42:06,660 --> 01:42:10,850 but you've just walked away with $200 because you made a decision based 2371 01:42:10,850 --> 01:42:13,400 on the same answer in two different threads 2372 01:42:13,400 --> 01:42:16,050 or two different programs or two different computers. 2373 01:42:16,050 --> 01:42:18,590 So long story short, this can happen even in the real world. 2374 01:42:18,590 --> 01:42:22,280 An example I was taught by my advisor years ago was this. 2375 01:42:22,280 --> 01:42:25,460 Suppose you and your roommates have a little dorm fridge. 2376 01:42:25,460 --> 01:42:29,190 And you're in the habit, of course, of drinking a lot of milk. 2377 01:42:29,190 --> 01:42:32,090 And so the fridge has run out of milk. 2378 01:42:32,090 --> 01:42:34,280 And you come home, the first roommate after classes. 2379 01:42:34,280 --> 01:42:36,424 And you realize, oh, I really need a drink of milk. 2380 01:42:36,424 --> 01:42:37,590 And so you check the fridge. 2381 01:42:37,590 --> 01:42:38,320 There's nothing there. 2382 01:42:38,320 --> 01:42:41,090 So you close the fridge, and you walk into the square, go to CVS, 2383 01:42:41,090 --> 01:42:43,100 and get in line to buy some milk. 2384 01:42:43,100 --> 01:42:45,230 Meanwhile, your roommate comes home. 2385 01:42:45,230 --> 01:42:47,237 Also, they really need a drink of milk. 2386 01:42:47,237 --> 01:42:49,070 And so they check the state of the variable. 2387 01:42:49,070 --> 01:42:50,510 Argh, no milk. 2388 01:42:50,510 --> 01:42:54,230 Close the fridge, and then walk to like Tommy's Convenience or some other place 2389 01:42:54,230 --> 01:42:56,150 nearby and get in line for some milk. 2390 01:42:56,150 --> 01:42:58,730 You of course, then both get home eventually. 2391 01:42:58,730 --> 01:43:00,140 And what happens now? 2392 01:43:00,140 --> 01:43:01,910 Dammit, now you have twice as much milk. 2393 01:43:01,910 --> 01:43:04,220 And milk goes bad quickly. 2394 01:43:04,220 --> 01:43:06,095 So now this is a problem, a very bad problem. 2395 01:43:06,095 --> 01:43:08,900 You have twice as much milk as you could possibly drink. 2396 01:43:08,900 --> 01:43:11,764 But what's the origin of that problem fundamentally? 2397 01:43:11,764 --> 01:43:13,180 AUDIENCE: You're out of something. 2398 01:43:13,180 --> 01:43:14,345 DAVID J. MALAN: You're out of something, but-- 2399 01:43:14,345 --> 01:43:15,230 AUDIENCE: You need it. 2400 01:43:15,230 --> 01:43:16,070 DAVID J. MALAN: --you need it. 2401 01:43:16,070 --> 01:43:17,705 But why did I end up with two? 2402 01:43:17,705 --> 01:43:18,825 AUDIENCE: There's no flag. 2403 01:43:18,825 --> 01:43:20,450 DAVID J. MALAN: There's no flag, right? 2404 01:43:20,450 --> 01:43:21,474 There's no indication. 2405 01:43:21,474 --> 01:43:22,640 There's no sharing of state. 2406 01:43:22,640 --> 01:43:24,500 You both inspected the value of the variable, 2407 01:43:24,500 --> 01:43:26,000 made a decision independently on it. 2408 01:43:26,000 --> 01:43:28,850 But the state of that variable changed on one of you. 2409 01:43:28,850 --> 01:43:31,760 Because when one of you came home, the later person, damn, like, 2410 01:43:31,760 --> 01:43:33,710 the milk has already been refilled. 2411 01:43:33,710 --> 01:43:34,880 So how do you solve this? 2412 01:43:34,880 --> 01:43:37,227 In the real world, how could you avoid this problem? 2413 01:43:37,227 --> 01:43:39,310 You just, one, never do errands for your roommate. 2414 01:43:39,310 --> 01:43:42,787 AUDIENCE: The magnet you put on the refrigerator says get this. 2415 01:43:42,787 --> 01:43:43,870 DAVID J. MALAN: Mag-- yes. 2416 01:43:43,870 --> 01:43:45,615 A shopping list, right-- gone for milk-- 2417 01:43:45,615 --> 01:43:46,490 Arrested Development. 2418 01:43:46,490 --> 01:43:47,990 Always leave a note, right? 2419 01:43:47,990 --> 01:43:50,240 You could convey that information. 2420 01:43:50,240 --> 01:43:53,090 You could more dramatically lock the refrigerator, right? 2421 01:43:53,090 --> 01:43:55,760 Padlock the thing, and so your roommate can't 2422 01:43:55,760 --> 01:43:59,540 inspect the state of the refrigerator while you are gone, therefore, 2423 01:43:59,540 --> 01:44:01,040 not making us vulnerable to this. 2424 01:44:01,040 --> 01:44:02,915 And I use the word lock deliberately because, 2425 01:44:02,915 --> 01:44:04,670 in databases, that's how they solve this. 2426 01:44:04,670 --> 01:44:07,070 There is a feature in databases called locks. 2427 01:44:07,070 --> 01:44:09,590 Or fancier versions of this are called transactions, 2428 01:44:09,590 --> 01:44:12,540 whereby, you can guarantee something called atomicity, 2429 01:44:12,540 --> 01:44:16,910 where atomicity means you can do multiple things back to back to back 2430 01:44:16,910 --> 01:44:18,470 without getting interrupted. 2431 01:44:18,470 --> 01:44:22,580 So in the case of a bank, it is possible, with SQL, using slightly 2432 01:44:22,580 --> 01:44:25,807 fancier syntax that we won't dive into today to solve this problem by saying, 2433 01:44:25,807 --> 01:44:26,390 you know what? 2434 01:44:26,390 --> 01:44:28,310 Begin the following transaction. 2435 01:44:28,310 --> 01:44:31,550 Check the state of the bank account, deduct this amount of money, 2436 01:44:31,550 --> 01:44:33,080 and now commit the results. 2437 01:44:33,080 --> 01:44:36,380 And while I'm doing that, lock everyone else out. 2438 01:44:36,380 --> 01:44:39,650 Don't let any other customer or any other user 2439 01:44:39,650 --> 01:44:43,700 do exactly that information that touches the same data until I am done. 2440 01:44:43,700 --> 01:44:45,727 Long story short-- you pay a price, perhaps. 2441 01:44:45,727 --> 01:44:48,560 You're literally preventing your roommate from accessing the fridge, 2442 01:44:48,560 --> 01:44:49,393 and that's annoying. 2443 01:44:49,393 --> 01:44:52,010 Or you're preventing other customers from doing transactions. 2444 01:44:52,010 --> 01:44:55,250 So hopefully the computer is fast at this, and your fast at shopping. 2445 01:44:55,250 --> 01:44:58,370 But you've at least ensured that you have atomicity. 2446 01:44:58,370 --> 01:45:01,970 No operation can get inserted into your sequence of operations 2447 01:45:01,970 --> 01:45:05,820 as by your roommate or some other computer or thread. 2448 01:45:05,820 --> 01:45:08,810 So that's a problem with databases that we're only going to skirt over. 2449 01:45:08,810 --> 01:45:11,157 And GitHub might solve this, how? 2450 01:45:11,157 --> 01:45:12,740 Well, by just not caring, potentially. 2451 01:45:12,740 --> 01:45:14,630 I don't know what's going to happen if multiple of us try. 2452 01:45:14,630 --> 01:45:18,440 I'm guessing they will just give n minus 1 of us an error message saying, sorry, 2453 01:45:18,440 --> 01:45:20,012 that username is no longer available. 2454 01:45:20,012 --> 01:45:20,720 Think about this. 2455 01:45:20,720 --> 01:45:22,220 If you've ever bought airline tickets, this 2456 01:45:22,220 --> 01:45:23,790 is a solved problem in that industry. 2457 01:45:23,790 --> 01:45:25,040 That would be really annoying. 2458 01:45:25,040 --> 01:45:27,620 If you just spent an hour of stressful price-hunting 2459 01:45:27,620 --> 01:45:29,937 for a good airplane ticket, you start checking out 2460 01:45:29,937 --> 01:45:31,520 after adding it to your shopping cart. 2461 01:45:31,520 --> 01:45:34,561 And five minutes later, after your name and email address and credit card 2462 01:45:34,561 --> 01:45:36,140 number, the ticket is gone. 2463 01:45:36,140 --> 01:45:37,400 So what do airlines do? 2464 01:45:37,400 --> 01:45:39,270 They often give you a five-minute window. 2465 01:45:39,270 --> 01:45:41,390 And some of the fancier websites show you the clock saying, 2466 01:45:41,390 --> 01:45:43,340 we guarantee this for the next five minutes. 2467 01:45:43,340 --> 01:45:46,970 Hotels might do this too where they locked the refrigerator for you 2468 01:45:46,970 --> 01:45:49,280 by somehow altering the database to say, mm-mm. 2469 01:45:49,280 --> 01:45:52,820 No one else can buy this ticket or this room for the next five minutes, 2470 01:45:52,820 --> 01:45:54,890 much like the note or the padlock. 2471 01:45:54,890 --> 01:45:57,980 So those kinds of things are all around us. 2472 01:45:57,980 --> 01:46:02,870 But let's look at one final example that's the worst threat of all is this. 2473 01:46:02,870 --> 01:46:06,320 Previously, I allowed myself to search by name. 2474 01:46:06,320 --> 01:46:10,164 So q equals Brian or q equals David or Veronica or the like. 2475 01:46:10,164 --> 01:46:11,830 And what did I do with that information? 2476 01:46:11,830 --> 01:46:14,280 Well, if we go back into the IDE and actually look 2477 01:46:14,280 --> 01:46:20,250 at that file in application.py, I simply formatted it using an F string 2478 01:46:20,250 --> 01:46:22,630 inside of this SQL string. 2479 01:46:22,630 --> 01:46:25,680 But what if my users were a little bit malicious? 2480 01:46:25,680 --> 01:46:28,660 And suppose that someone doesn't want to just search for Brian. 2481 01:46:28,660 --> 01:46:29,410 But you know what? 2482 01:46:29,410 --> 01:46:37,350 Suppose they do something like, my query is DELETE FROM registrants WHERE-- 2483 01:46:37,350 --> 01:46:40,560 sorry Brian-- NAME equals Brian-- 2484 01:46:40,560 --> 01:46:42,130 something like this. 2485 01:46:42,130 --> 01:46:46,380 Now, this is not valid at the moment, because this string, 2486 01:46:46,380 --> 01:46:48,650 while I'm certainly allowed to type it in, 2487 01:46:48,650 --> 01:46:53,077 is going to get plugged into my code, but in the wrong place logically. 2488 01:46:53,077 --> 01:46:56,160 Like, I'm going to look for someone's name called "DELETE FROM registrants 2489 01:46:56,160 --> 01:46:58,326 WHERE name equals Brian," which is just nonsensical. 2490 01:46:58,326 --> 01:47:00,240 It will return 0 results. 2491 01:47:00,240 --> 01:47:04,210 But what if I do something like this where I say, 2492 01:47:04,210 --> 01:47:08,790 Brian or DELETE from registrants where I finish 2493 01:47:08,790 --> 01:47:12,660 the thought that the programmer had and then start my own new thought. 2494 01:47:12,660 --> 01:47:16,020 Or another way of doing this is to use special syntax semicolon, something 2495 01:47:16,020 --> 01:47:16,890 like this. 2496 01:47:16,890 --> 01:47:21,300 Long story short, I could contrive a human malicious input that 2497 01:47:21,300 --> 01:47:24,540 finishes the programmer's thought and returns zero rows, but by the way, 2498 01:47:24,540 --> 01:47:28,710 also sneaks one additional rogue query into the database. 2499 01:47:28,710 --> 01:47:31,410 This is what's known as a SQL injection attack. 2500 01:47:31,410 --> 01:47:37,350 And if you naively and very, very, very badly and incorrectly write code like I 2501 01:47:37,350 --> 01:47:39,630 did-- don't ever do this-- 2502 01:47:39,630 --> 01:47:42,180 you will be vulnerable to exactly this attack 2503 01:47:42,180 --> 01:47:46,644 because you are blindly plugging in the user's input to a string 2504 01:47:46,644 --> 01:47:48,310 that you are then passing to a database. 2505 01:47:48,310 --> 01:47:52,410 This is a fundamental flaw in lots of applications, lots of languages 2506 01:47:52,410 --> 01:47:54,960 where you have to distrust your users. 2507 01:47:54,960 --> 01:47:57,300 It doesn't matter if it's for just students on campus 2508 01:47:57,300 --> 01:47:58,883 or it's just for you and your friends. 2509 01:47:58,883 --> 01:48:01,419 Never ever, ever trust users' input because either someone's 2510 01:48:01,419 --> 01:48:03,960 going to mistype something and something is going to go awry, 2511 01:48:03,960 --> 01:48:06,180 or you're going to have a bad apple trying 2512 01:48:06,180 --> 01:48:08,310 to hack into your website or your application 2513 01:48:08,310 --> 01:48:10,230 by trying these kinds of commands. 2514 01:48:10,230 --> 01:48:13,020 And you have to always write code defensively. 2515 01:48:13,020 --> 01:48:13,980 So how to do this? 2516 01:48:13,980 --> 01:48:15,330 There are a bunch of ways. 2517 01:48:15,330 --> 01:48:17,730 But it turns out that, what's dangerous about something 2518 01:48:17,730 --> 01:48:21,370 like I just typed in is that it's the semicolon, for instance, 2519 01:48:21,370 --> 01:48:24,040 and that's the quote marks over "elsewhere." 2520 01:48:24,040 --> 01:48:28,486 So the safest thing to do is, no matter what the user types in, escape things. 2521 01:48:28,486 --> 01:48:29,610 You can use special syntax. 2522 01:48:29,610 --> 01:48:32,940 We saw this in C-- generally, putting a backslash in front of something means, 2523 01:48:32,940 --> 01:48:35,280 don't let it have its default behavior. 2524 01:48:35,280 --> 01:48:36,780 Instead, treat it specially. 2525 01:48:36,780 --> 01:48:38,640 So you could use special code in Python that 2526 01:48:38,640 --> 01:48:41,640 just says remove any bad characters, or replace things. 2527 01:48:41,640 --> 01:48:44,610 Frankly, you've probably been to a website where you've been told, 2528 01:48:44,610 --> 01:48:47,160 sorry, you can't use that character in your password. 2529 01:48:47,160 --> 01:48:49,260 Or sorry, you can't use that in your username. 2530 01:48:49,260 --> 01:48:49,980 That's just dumb. 2531 01:48:49,980 --> 01:48:51,840 Like, that is the lazy approach to this. 2532 01:48:51,840 --> 01:48:54,060 There is no reason to prevent users from typing 2533 01:48:54,060 --> 01:48:57,750 any characters into their keyboard for their password and maybe even 2534 01:48:57,750 --> 01:48:58,620 their username. 2535 01:48:58,620 --> 01:49:01,203 That's kind of a lazy way of defending against this by saying, 2536 01:49:01,203 --> 01:49:03,480 mm-mm, I don't trust any percent sign, any semicolons, 2537 01:49:03,480 --> 01:49:05,370 any dashes, any apostrophes. 2538 01:49:05,370 --> 01:49:07,172 Rather, just escape things. 2539 01:49:07,172 --> 01:49:08,880 But it's silly for all of us in this room 2540 01:49:08,880 --> 01:49:13,320 to write our own code for scaping users' input or scrubbing it, as it's called, 2541 01:49:13,320 --> 01:49:16,200 or sanitizing it, as it's called-- same things. 2542 01:49:16,200 --> 01:49:17,850 Why don't we just use a library? 2543 01:49:17,850 --> 01:49:19,560 Now, there are many libraries out there. 2544 01:49:19,560 --> 01:49:21,601 The one that we're using at the moment is CS50's. 2545 01:49:21,601 --> 01:49:24,570 And the EXECUTE function does this for us. 2546 01:49:24,570 --> 01:49:28,110 Instead of using F strings, which you should not use like this. 2547 01:49:28,110 --> 01:49:29,970 You should instead do this. 2548 01:49:29,970 --> 01:49:33,060 If you want to plug in a placeholder value to a SQL query, 2549 01:49:33,060 --> 01:49:35,970 you literally use a standard convention, that we have adopted too, 2550 01:49:35,970 --> 01:49:39,222 where you just put in a variable's name, but with a colon in front of it. 2551 01:49:39,222 --> 01:49:40,180 And it can be anything. 2552 01:49:40,180 --> 01:49:40,710 It can be q. 2553 01:49:40,710 --> 01:49:41,310 It can be x. 2554 01:49:41,310 --> 01:49:42,210 It doesn't matter. 2555 01:49:42,210 --> 01:49:44,310 But you want to just plug in some value there. 2556 01:49:44,310 --> 01:49:46,560 So I'm going to call it name, by convention. 2557 01:49:46,560 --> 01:49:49,440 Then you close your quote and finish your thought. 2558 01:49:49,440 --> 01:49:55,110 And then you go ahead and pass in the actual value, name equals q. 2559 01:49:55,110 --> 01:49:58,590 And now you have constructed, dynamically, 2560 01:49:58,590 --> 01:50:01,590 a SQL string with a place holder that is not 2561 01:50:01,590 --> 01:50:03,870 Python's own curly brace placeholder. 2562 01:50:03,870 --> 01:50:07,800 This is a special SQL convention where you say plug in value here. 2563 01:50:07,800 --> 01:50:08,600 What value? 2564 01:50:08,600 --> 01:50:13,680 We'll plug in this names value, q, whatever the human has typed in. 2565 01:50:13,680 --> 01:50:16,350 And what are execute function will do for you 2566 01:50:16,350 --> 01:50:19,110 is all of the fancy backslashing and all of the escaping 2567 01:50:19,110 --> 01:50:21,432 and will protect you from the user's data. 2568 01:50:21,432 --> 01:50:22,890 And this is how truly simple it is. 2569 01:50:22,890 --> 01:50:24,570 It doesn't have to be CS50's library. 2570 01:50:24,570 --> 01:50:28,920 This is ever so common in all languages, but too few people know about it 2571 01:50:28,920 --> 01:50:29,674 and use it. 2572 01:50:29,674 --> 01:50:32,340 And so half the time you read about some database getting hacked 2573 01:50:32,340 --> 01:50:34,350 or your data getting stolen, it is because 2574 01:50:34,350 --> 01:50:37,170 of a stupid oversight like that. 2575 01:50:37,170 --> 01:50:40,020 So just use libraries and escape users' input. 2576 01:50:40,020 --> 01:50:41,610 We can see this now more concretely. 2577 01:50:41,610 --> 01:50:43,980 All of the undergrads in the room have surely 2578 01:50:43,980 --> 01:50:46,684 logged into, either Yale's website or Harvard's website, 2579 01:50:46,684 --> 01:50:48,600 which looks a little something like this here. 2580 01:50:48,600 --> 01:50:50,470 You're prompted for your login name and your password, 2581 01:50:50,470 --> 01:50:52,050 or your Harvard key or the like. 2582 01:50:52,050 --> 01:50:54,750 Well, how does this take effect in real terms? 2583 01:50:54,750 --> 01:50:56,520 If I were to type in my email address-- 2584 01:50:56,520 --> 01:50:59,920 but then weird syntax like this-- let's look at an example. 2585 01:50:59,920 --> 01:51:03,390 Quote "or" quote unquote "1" equals quote "1." 2586 01:51:03,390 --> 01:51:04,800 Notice it's not balanced. 2587 01:51:04,800 --> 01:51:07,400 It's missing a quote over here, missing a quote over here. 2588 01:51:07,400 --> 01:51:10,920 Because the presumption is that maybe Harvard is vulnerable to this. 2589 01:51:10,920 --> 01:51:12,240 I don't think they are. 2590 01:51:12,240 --> 01:51:15,200 But suppose that the code running Harvard key 2591 01:51:15,200 --> 01:51:18,660 and Harvard's login page looks a little something like this. 2592 01:51:18,660 --> 01:51:19,440 This is bad. 2593 01:51:19,440 --> 01:51:22,160 This is dangerous because they're just using f strings or format 2594 01:51:22,160 --> 01:51:24,826 strings, which are just going to blindly plug anything in there. 2595 01:51:24,826 --> 01:51:28,140 2596 01:51:28,140 --> 01:51:31,192 And so if you let the human type in something cryptic like that, 2597 01:51:31,192 --> 01:51:32,650 notice what has happened logically. 2598 01:51:32,650 --> 01:51:36,470 Where username equals me at example your email provider.com and password 2599 01:51:36,470 --> 01:51:41,330 equals quote unquote, so nothing, or 1 equals 1. 2600 01:51:41,330 --> 01:51:43,070 And why 1 equals 1? 2601 01:51:43,070 --> 01:51:46,460 If I go back, notice that there's a quote here and a quote here. 2602 01:51:46,460 --> 01:51:49,910 And the reason that I didn't finish my second quote here or my second quote 2603 01:51:49,910 --> 01:51:52,670 here is because I'm assuming, as a bad guy, 2604 01:51:52,670 --> 01:51:55,550 I think Harvard is just going to blindly plug my input 2605 01:51:55,550 --> 01:51:57,810 into a single quotes of their own. 2606 01:51:57,810 --> 01:51:59,990 Therefore, I can finish their thought nonsensically. 2607 01:51:59,990 --> 01:52:01,520 But notice, logically, what happens. 2608 01:52:01,520 --> 01:52:05,480 Select all users from the database where the user name is me at example email 2609 01:52:05,480 --> 01:52:07,340 provider and the password is nothing. 2610 01:52:07,340 --> 01:52:09,890 Or 1 equals 1. 2611 01:52:09,890 --> 01:52:12,170 Well, when does 1 equal 1? 2612 01:52:12,170 --> 01:52:13,680 Like, always. 2613 01:52:13,680 --> 01:52:16,310 So this will always return users from the database, 2614 01:52:16,310 --> 01:52:20,840 and presumably, therefore, let me log in as one of those users-- 2615 01:52:20,840 --> 01:52:23,480 so incredibly simple to defend against this. 2616 01:52:23,480 --> 01:52:28,130 Just use placeholder syntax and distrust and sanitize users' input. 2617 01:52:28,130 --> 01:52:31,670 The syntax in SQL and the CS50 library is quite simply with that colon. 2618 01:52:31,670 --> 01:52:34,220 But in other libraries, it might be quite the same. 2619 01:52:34,220 --> 01:52:37,130 So now you are all, families and students alike, 2620 01:52:37,130 --> 01:52:39,800 inaugurated into the small class of folks 2621 01:52:39,800 --> 01:52:42,069 in the world who understand particularly geeky humor. 2622 01:52:42,069 --> 01:52:45,110 You might notice this meme that's gone around the internet for many years 2623 01:52:45,110 --> 01:52:48,710 now where someone either maliciously or humorously 2624 01:52:48,710 --> 01:52:53,770 decided to paint this over their license platelets. 2625 01:52:53,770 --> 01:52:55,351 Let's enhance. 2626 01:52:55,351 --> 01:52:57,806 Why would someone do this? 2627 01:52:57,806 --> 01:53:00,452 2628 01:53:00,452 --> 01:53:01,243 AUDIENCE: Scanners. 2629 01:53:01,243 --> 01:53:02,750 DAVID J. MALAN: Scanners, yeah. 2630 01:53:02,750 --> 01:53:04,967 Tollbooths are going away, at least in the US. 2631 01:53:04,967 --> 01:53:08,300 And they instead have cameras or readers that are scanning the front of your car 2632 01:53:08,300 --> 01:53:12,110 and trying to optically do OCR, Optical Character Recognition, 2633 01:53:12,110 --> 01:53:13,340 on your license plate. 2634 01:53:13,340 --> 01:53:15,911 And the presumption here is, maybe in some municipality, 2635 01:53:15,911 --> 01:53:18,410 there's some badly written code where they just blindly plug 2636 01:53:18,410 --> 01:53:20,910 your license plate into their code. 2637 01:53:20,910 --> 01:53:23,977 And hopefully you finish the thought where ZU 0666-- 2638 01:53:23,977 --> 01:53:26,060 whatever that is, it's part of the license plate-- 2639 01:53:26,060 --> 01:53:30,425 but "semicolon drop database table dot dot dot." 2640 01:53:30,425 --> 01:53:33,050 And we didn't even look at that because DROP is pretty extreme. 2641 01:53:33,050 --> 01:53:35,870 It literally deletes a database itself. 2642 01:53:35,870 --> 01:53:41,960 But this is a nice way of getting off the hook from a total price. 2643 01:53:41,960 --> 01:53:45,470 And most canonical perhaps, XKCD is a very popular cartoon strip. 2644 01:53:45,470 --> 01:53:47,300 It's particularly geek-oriented. 2645 01:53:47,300 --> 01:53:53,057 And you'll perhaps understand this joke now as well among CS circles. 2646 01:53:53,057 --> 01:53:55,298 [LAUGHING] 2647 01:53:55,298 --> 01:54:04,516 2648 01:54:04,516 --> 01:54:06,656 I can hear the laughter making its way through. 2649 01:54:06,656 --> 01:54:08,780 So from here on out, if you take nothing else away, 2650 01:54:08,780 --> 01:54:11,840 remember little Bobby Tables with pset. 2651 01:54:11,840 --> 01:54:14,112 Our final, will you actually implement CS50 Finance 2652 01:54:14,112 --> 01:54:15,320 and coalesce all these ideas. 2653 01:54:15,320 --> 01:54:17,528 Thank you so much to all of our families for joining. 2654 01:54:17,528 --> 01:54:18,830 And we will see you next time. 2655 01:54:18,830 --> 01:54:21,580 [APPLAUSE]