WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:02.988 [MUSIC PLAYING] 00:01:12.582 --> 00:01:13.540 DAVID MALAN: All right. 00:01:13.540 --> 00:01:17.150 This is CS50 and this is already week seven. 00:01:17.150 --> 00:01:20.240 And this is the week where we'll continue where we left off 00:01:20.240 --> 00:01:22.430 with Python, introducing you to a bit more syntax 00:01:22.430 --> 00:01:25.580 and capabilities of the language so you can solve interesting problems. 00:01:25.580 --> 00:01:27.580 But a lot of those problems increasingly are now 00:01:27.580 --> 00:01:29.180 going to involve data in some form. 00:01:29.180 --> 00:01:31.250 After all, if you think of most any website 00:01:31.250 --> 00:01:36.320 or mobile app or process nowadays that involves solving problems, 00:01:36.320 --> 00:01:39.530 it almost always involves some amount of data and often data at scale. 00:01:39.530 --> 00:01:40.680 Lots and lots of data. 00:01:40.680 --> 00:01:42.430 And so what we're going to see first today 00:01:42.430 --> 00:01:45.710 is that, yes, you can use Python to solve all the problems past that we've 00:01:45.710 --> 00:01:48.327 seen and also some data specific ones, but sometimes it's 00:01:48.327 --> 00:01:49.410 just going to be annoying. 00:01:49.410 --> 00:01:50.420 It's going to be a little painful. 00:01:50.420 --> 00:01:54.060 It's just going to be more work than you might like to just get to some answer. 00:01:54.060 --> 00:01:56.870 And so today we'll too introduce you to a new language 00:01:56.870 --> 00:01:59.240 called SQL, Structured Query Language. 00:01:59.240 --> 00:02:02.780 And this is a language that rest assured is actually much smaller, 00:02:02.780 --> 00:02:05.240 relatively speaking, than C and Python. 00:02:05.240 --> 00:02:07.860 It sort of does less, but it doesn't really well. 00:02:07.860 --> 00:02:11.240 And it's a language for querying databases, storing data in it, 00:02:11.240 --> 00:02:13.950 updating it, inserting it, deleting it, and so much more. 00:02:13.950 --> 00:02:16.160 And it's the kind of technology that's used nowadays 00:02:16.160 --> 00:02:19.700 in, indeed, web apps and mobile apps, data science, analytics, 00:02:19.700 --> 00:02:20.940 and so much more. 00:02:20.940 --> 00:02:23.600 It's really good at storing lots and lots of data. 00:02:23.600 --> 00:02:25.430 Now, this is yet another language. 00:02:25.430 --> 00:02:27.800 And believe it or not, next week we'll introduce you 00:02:27.800 --> 00:02:31.880 to three more languages, HTML and CSS, which are not technically 00:02:31.880 --> 00:02:32.900 programming languages. 00:02:32.900 --> 00:02:35.630 They're all about aesthetics and markup of information. 00:02:35.630 --> 00:02:38.660 But also JavaScript, which is, in fact, a programming language. 00:02:38.660 --> 00:02:41.630 But the goals here in CS50 really are going 00:02:41.630 --> 00:02:44.000 to be to empower you to program more generally. 00:02:44.000 --> 00:02:47.000 And indeed, when you're out there in the real world some years from now, 00:02:47.000 --> 00:02:50.330 invariably there's going to be some new other popular language out there. 00:02:50.330 --> 00:02:53.880 And hopefully in this week and next week and beyond, among the goals 00:02:53.880 --> 00:02:56.630 is not just to teach you these languages specifically but again, 00:02:56.630 --> 00:02:59.330 how to teach yourself the future languages that we've not even 00:02:59.330 --> 00:03:00.480 heard about just yet. 00:03:00.480 --> 00:03:04.220 So with that said, let's begin with a survey of sorts. 00:03:04.220 --> 00:03:10.190 If you go to this URL on your phone or laptop, cs50.ly/favorites, 00:03:10.190 --> 00:03:14.390 a very simple Google Form awaits you that's just going to ask you a couple 00:03:14.390 --> 00:03:16.530 of multiple choice questions. 00:03:16.530 --> 00:03:24.800 So go to cs50.ly/favorites, and that should lead you to a Google Form that 00:03:24.800 --> 00:03:27.590 looks a little something like this asking you first, 00:03:27.590 --> 00:03:31.700 as of now in week seven, what is your favorite language among those options 00:03:31.700 --> 00:03:32.250 here. 00:03:32.250 --> 00:03:34.220 And then further down, one more question. 00:03:34.220 --> 00:03:39.170 If you think back on problem sets 0 through 6, what was, if any, 00:03:39.170 --> 00:03:42.080 your favorite problem set problem? 00:03:42.080 --> 00:03:45.110 Be it in Scratch or C or Python. 00:03:45.110 --> 00:03:48.650 So answer those two questions. 00:03:48.650 --> 00:03:52.820 And in a moment, I'll flip over to my screen here where you'll see, 00:03:52.820 --> 00:03:56.570 and anyone who's used Google Forms knows, the spreadsheet that's 00:03:56.570 --> 00:03:58.130 collecting now this data. 00:03:58.130 --> 00:04:01.060 Microsoft Office 365 can do the same if you use one of those forms. 00:04:01.060 --> 00:04:02.810 And what you see here now is a spreadsheet 00:04:02.810 --> 00:04:06.770 in Google Sheets enumerating all of the audience's questions. 00:04:06.770 --> 00:04:09.980 Language is in column B, problem is in column C, 00:04:09.980 --> 00:04:13.190 and each row represents one student who has responded. 00:04:13.190 --> 00:04:17.360 A few of you were super eager for class today at 8:33 AM Eastern time. 00:04:17.360 --> 00:04:19.040 10:32, 11:10. 00:04:19.040 --> 00:04:21.779 OK, so now we're getting into the actual class time here. 00:04:21.779 --> 00:04:24.830 And if I scroll down, we'll probably see a few dozen, 00:04:24.830 --> 00:04:26.900 a couple of hundred answers by now. 00:04:26.900 --> 00:04:29.225 And yeah, so we're getting a whole lot of answers here. 00:04:29.225 --> 00:04:31.100 And I'm seeing some patterns emerge, but it's 00:04:31.100 --> 00:04:34.190 not necessarily obvious to the human eyes what those patterns are. 00:04:34.190 --> 00:04:36.320 Now of course, you can use Google Spreadsheets. 00:04:36.320 --> 00:04:40.143 You can highlight the data and you can create charts magically out of it. 00:04:40.143 --> 00:04:42.560 But you can only do what Google lets you do with the data. 00:04:42.560 --> 00:04:45.350 And same thing for Microsoft Excel or Apple Numbers. 00:04:45.350 --> 00:04:48.230 But wouldn't it be nice to just be able to manipulate the raw data, 00:04:48.230 --> 00:04:51.740 relatively simple though it is, to just answer questions about the data? 00:04:51.740 --> 00:04:54.260 Maybe long term create your own charts, customize it 00:04:54.260 --> 00:04:56.240 just the way you want rather than beholden 00:04:56.240 --> 00:04:59.250 to software that's off the shelf like this. 00:04:59.250 --> 00:05:02.130 Well, how could we go about doing this? 00:05:02.130 --> 00:05:05.330 Well, let me propose that we treat this data set now 00:05:05.330 --> 00:05:08.790 as what we're going to call for now a flat file database. 00:05:08.790 --> 00:05:10.850 We'll see today that there's fancier databases, 00:05:10.850 --> 00:05:15.200 but the simplest database in the world is really just like a .csv file. 00:05:15.200 --> 00:05:18.500 And we saw that a couple of weeks ago in C. We wrote a bit of C code 00:05:18.500 --> 00:05:23.660 that used fprintf to write data to a file using commas as the separator. 00:05:23.660 --> 00:05:25.910 We didn't really do much more with CSVs at the time 00:05:25.910 --> 00:05:28.610 though, because it's really annoying, painful, time consuming, 00:05:28.610 --> 00:05:32.840 not fun to use C for something like that because of malloc and memory and all 00:05:32.840 --> 00:05:33.413 that stuff. 00:05:33.413 --> 00:05:35.330 But with Python, it's going to be much easier. 00:05:35.330 --> 00:05:39.530 And so any time you have access to some data set where you can just download it 00:05:39.530 --> 00:05:42.140 to your own Mac or PC or your cloud environment, 00:05:42.140 --> 00:05:46.460 it's sort of a candidate for now writing code to do something with the data. 00:05:46.460 --> 00:05:48.120 Maybe analyze it right away. 00:05:48.120 --> 00:05:50.150 If it's been human imported manually, maybe you 00:05:50.150 --> 00:05:52.550 have to clean it up by doing a lot of find and replace 00:05:52.550 --> 00:05:54.852 but not with your keyboard but rather with code. 00:05:54.852 --> 00:05:56.310 And so let me go ahead and do this. 00:05:56.310 --> 00:05:59.780 Let me go back to my Google Sheet here that has 00:05:59.780 --> 00:06:02.010 all of the data that's come in now. 00:06:02.010 --> 00:06:08.430 And let me go ahead and download this via the File menu here. 00:06:08.430 --> 00:06:09.590 And let's see. 00:06:09.590 --> 00:06:10.310 Download. 00:06:10.310 --> 00:06:12.170 And you can see a whole bunch of options. 00:06:12.170 --> 00:06:14.210 Most formats might be familiar, but today we'll 00:06:14.210 --> 00:06:17.602 focus just on this one, Comma Separated Values or CSV. 00:06:17.602 --> 00:06:20.810 That's going to go ahead and download it on my Mac here into my own Downloads 00:06:20.810 --> 00:06:21.140 folder. 00:06:21.140 --> 00:06:22.890 And now I'm going to go ahead and do this. 00:06:22.890 --> 00:06:25.140 Let me go ahead and pull up VS Code in the cloud here. 00:06:25.140 --> 00:06:28.140 And if you've never done this before, there's a couple of ways to do it. 00:06:28.140 --> 00:06:31.320 But the simplest way to upload a file to your code space, so to speak, 00:06:31.320 --> 00:06:32.990 is just a sort of drag and drop. 00:06:32.990 --> 00:06:36.020 That's going to magically upload it to the server there. 00:06:36.020 --> 00:06:39.688 And we'll see that, one, it has a very long file name, which I'm actually 00:06:39.688 --> 00:06:41.480 going to clean this up because that's going 00:06:41.480 --> 00:06:43.880 to be very tedious to type in my code. 00:06:43.880 --> 00:06:46.927 So I could either right click, of course, up here, 00:06:46.927 --> 00:06:48.510 but I'm going to use my Linux command. 00:06:48.510 --> 00:06:52.550 So let's move this file called CSV 50 2022 something or other and let's 00:06:52.550 --> 00:06:54.740 just name it more simply favorites.csv. 00:06:54.740 --> 00:06:58.370 So all lowercase, no spaces, sort of good basics. 00:06:58.370 --> 00:07:02.960 And let me go ahead now and open up this file with code favorites.csv. 00:07:02.960 --> 00:07:06.500 I'll close my File Explorer and we'll see exactly the same data 00:07:06.500 --> 00:07:09.890 as before but not quite as pretty as Google Sheets makes it be. 00:07:09.890 --> 00:07:13.640 Rather we see here that I still have three columns, timestamp, language, 00:07:13.640 --> 00:07:17.013 problem, and then all of the values down below, including the timestamps 00:07:17.013 --> 00:07:18.180 and the answers they're for. 00:07:18.180 --> 00:07:20.360 But it doesn't have proper columns. 00:07:20.360 --> 00:07:22.250 It just has commas separating them. 00:07:22.250 --> 00:07:24.860 Now, we could very easily write Python code 00:07:24.860 --> 00:07:28.820 just like we wrote C code to manipulate files like this 00:07:28.820 --> 00:07:30.140 either to write or read. 00:07:30.140 --> 00:07:32.450 But instead let's do something that's a little more 00:07:32.450 --> 00:07:34.640 pleasant, which is indeed in the form of Python. 00:07:34.640 --> 00:07:38.300 So Python actually comes with native support for CSVs. 00:07:38.300 --> 00:07:41.200 It has indeed a package called CSV that just 00:07:41.200 --> 00:07:43.700 lets you read and write and do a whole bunch of useful stuff 00:07:43.700 --> 00:07:45.007 when it comes to CSV files. 00:07:45.007 --> 00:07:47.090 So let's go ahead and do something with this file. 00:07:47.090 --> 00:07:49.250 Let me go back here to VS Code. 00:07:49.250 --> 00:07:51.860 I'm going to close favorites.csv for now. 00:07:51.860 --> 00:07:55.430 But just remember in your mind that timestamp was the first column, 00:07:55.430 --> 00:07:58.640 language was the second column, and problem was the third. 00:07:58.640 --> 00:08:01.790 And notice because we're using commas, they don't again line up perfectly, 00:08:01.790 --> 00:08:02.832 but that's not a problem. 00:08:02.832 --> 00:08:05.270 There are two commas in every line presumably. 00:08:05.270 --> 00:08:09.892 And I'm going to go ahead and now create a file called how about favorites.py so 00:08:09.892 --> 00:08:12.350 that I can start writing some code to manipulate this data. 00:08:12.350 --> 00:08:13.600 And let's do something simple. 00:08:13.600 --> 00:08:16.910 Let's just write a simple program in Python that opens this file, 00:08:16.910 --> 00:08:19.997 reads it, and print something out just as a safety check 00:08:19.997 --> 00:08:22.830 that I know what I'm doing, even though it's not going to be useful. 00:08:22.830 --> 00:08:26.780 So in Python, if you want CSV support, you import CSV. 00:08:26.780 --> 00:08:30.350 And that gives you access to all the magical capabilities thereof. 00:08:30.350 --> 00:08:32.450 Let me now go ahead and use this technique 00:08:32.450 --> 00:08:35.960 to open a file in Python, which is similar in C. 00:08:35.960 --> 00:08:37.789 But with Python, we're going to do this. 00:08:37.789 --> 00:08:39.020 The keyword with. 00:08:39.020 --> 00:08:42.049 I'm going to open a file called favorites.csv, 00:08:42.049 --> 00:08:43.880 which was the shorter name I gave it. 00:08:43.880 --> 00:08:46.280 This is optional, but just for explicitness, I'm 00:08:46.280 --> 00:08:50.090 going to open it in read mode explicitly, just like f open 00:08:50.090 --> 00:08:51.720 took a second argument as well. 00:08:51.720 --> 00:08:54.500 And I'm going to name this file once open quite simply file, 00:08:54.500 --> 00:08:56.240 though I could call it anything I want. 00:08:56.240 --> 00:08:59.060 And now it's just an open file. 00:08:59.060 --> 00:09:02.870 So far as Python knows at this moment, it's just text, or better yet, 00:09:02.870 --> 00:09:04.250 it's just zeros and ones. 00:09:04.250 --> 00:09:08.360 If you want this Python package called CSV to actually do 00:09:08.360 --> 00:09:12.500 something useful with it, you have to load this file now into the library. 00:09:12.500 --> 00:09:16.165 And the simplest way to do this is to give myself a variable called reader 00:09:16.165 --> 00:09:17.540 because I want to read this file. 00:09:17.540 --> 00:09:19.430 Though this too I could call anything else. 00:09:19.430 --> 00:09:22.610 I'm going to then set that equal to the return value of a function called 00:09:22.610 --> 00:09:24.710 csv.reader. 00:09:24.710 --> 00:09:28.220 And I pass to that per the documentation the open file. 00:09:28.220 --> 00:09:29.780 So step one, I open the file. 00:09:29.780 --> 00:09:32.030 And this just gives me access to the bytes therein. 00:09:32.030 --> 00:09:36.530 Step two now with csv.reader tells the Python package 00:09:36.530 --> 00:09:40.940 called CSV to do something useful with it and start analyzing the commas 00:09:40.940 --> 00:09:43.770 and allow me to parse it further. 00:09:43.770 --> 00:09:45.180 So let's go ahead and do this. 00:09:45.180 --> 00:09:49.730 Let me go ahead now and within this loop let's say this. 00:09:49.730 --> 00:09:52.340 Sorry, within this open file, let's do this. 00:09:52.340 --> 00:09:58.040 For every row, if you will, or line in the file, a.k.a. 00:09:58.040 --> 00:10:04.910 reader, let's go ahead and print out just how about row bracket 1. 00:10:04.910 --> 00:10:06.198 Now what's going on here? 00:10:06.198 --> 00:10:08.990 Well, it turns out if you read the documentation for the CSV reader 00:10:08.990 --> 00:10:13.610 function, what it hands you back is essentially this special object, so 00:10:13.610 --> 00:10:16.850 to speak, that allows you to treat it as though it's just 00:10:16.850 --> 00:10:20.660 a really big list of lines from the file, a.k.a. reader. 00:10:20.660 --> 00:10:25.310 So by saying for row in reader, this is a way more succinct way 00:10:25.310 --> 00:10:28.763 of saying give me the first line in the file plus plus, 00:10:28.763 --> 00:10:30.680 give me the second line in the file plus plus, 00:10:30.680 --> 00:10:33.597 and so forth that we would have done what much more mechanically in C. 00:10:33.597 --> 00:10:37.350 This is just much more Pythonic and English friendly, if you will. 00:10:37.350 --> 00:10:42.225 So in every iteration of this loop, row is going to contain all of the data 00:10:42.225 --> 00:10:43.100 from the current row. 00:10:43.100 --> 00:10:45.590 But better yet, what the reader function does 00:10:45.590 --> 00:10:50.240 for me is it hands me each row not just as a big string or STR 00:10:50.240 --> 00:10:51.590 of text in Python. 00:10:51.590 --> 00:10:56.590 It gives me what apparently based on the syntax on line six. 00:10:56.590 --> 00:10:57.460 Any instinct? 00:10:57.460 --> 00:10:58.030 Yeah. 00:10:58.030 --> 00:10:58.740 AUDIENCE: A list. 00:10:58.740 --> 00:11:00.698 DAVID MALAN: It's giving me back indeed a list. 00:11:00.698 --> 00:11:02.880 And I presume the visual clue for you was the fact 00:11:02.880 --> 00:11:04.463 that we're using square brackets here. 00:11:04.463 --> 00:11:07.780 And indeed, row bracket 1 is going to be not the first 00:11:07.780 --> 00:11:10.060 but the second element in that list. 00:11:10.060 --> 00:11:11.280 And so just take a guess. 00:11:11.280 --> 00:11:14.400 When I run this code in a moment, what's going to get printed? 00:11:14.400 --> 00:11:16.950 The timestamp, the language, or the problem? 00:11:20.892 --> 00:11:21.392 Yeah? 00:11:21.392 --> 00:11:22.350 AUDIENCE: The language. 00:11:22.350 --> 00:11:25.260 DAVID MALAN: The language because it's the second column that is 00:11:25.260 --> 00:11:27.848 in the file delimited by those commas. 00:11:27.848 --> 00:11:29.140 So let me go ahead and do this. 00:11:29.140 --> 00:11:30.630 Let me clear my terminal down here. 00:11:30.630 --> 00:11:33.570 Let me run Python of favorites.py and Enter. 00:11:33.570 --> 00:11:34.710 And there's everything. 00:11:34.710 --> 00:11:35.490 It was super fast. 00:11:35.490 --> 00:11:37.090 But there's a really long list here. 00:11:37.090 --> 00:11:41.130 And in fact, if I increase the size of my terminal and start scrolling up, 00:11:41.130 --> 00:11:43.090 you'll just see all of the raw data. 00:11:43.090 --> 00:11:44.490 Now, this isn't that useful yet. 00:11:44.490 --> 00:11:46.032 I could have just glanced at the CSV. 00:11:46.032 --> 00:11:49.320 But clearly now I have the ability to open the file, 00:11:49.320 --> 00:11:52.830 parse it, so to speak, that is break it up into its constituent parts, 00:11:52.830 --> 00:11:56.380 and do something with specific parts therein. 00:11:56.380 --> 00:11:56.880 All right. 00:11:56.880 --> 00:11:59.640 So if I want to do this a little more pleasantly though, 00:11:59.640 --> 00:12:02.460 let me at least make this semantically a little cleaner. 00:12:02.460 --> 00:12:05.650 And you know what, just for clarity, let me just give myself a variable. 00:12:05.650 --> 00:12:09.000 It's not strictly necessary, but I know that this is 00:12:09.000 --> 00:12:12.958 the favorite, for instance, language. 00:12:12.958 --> 00:12:14.250 So let's just call it favorite. 00:12:14.250 --> 00:12:16.410 Set it equal to row bracket 1. 00:12:16.410 --> 00:12:18.600 And now just to be more explicit in my code, 00:12:18.600 --> 00:12:21.030 even though again, we don't need the variable per se, 00:12:21.030 --> 00:12:22.840 this code is, of course, going to do the same thing. 00:12:22.840 --> 00:12:25.132 It's just using an additional variable called favorite. 00:12:25.132 --> 00:12:28.450 If I go down here, scroll up, run the program again, 00:12:28.450 --> 00:12:30.280 I get back to the exact same data. 00:12:30.280 --> 00:12:32.280 But this is a stepping stone to something 00:12:32.280 --> 00:12:34.800 that's even more powerful about Python support 00:12:34.800 --> 00:12:38.610 for CSV files is that you don't have to just treat the return 00:12:38.610 --> 00:12:43.500 value as a list with 0 and 1 and 2. 00:12:43.500 --> 00:12:48.540 So just thinking intuitively here, why is this maybe not the best design 00:12:48.540 --> 00:12:51.960 to hand you, the programmer, back the data in a list 00:12:51.960 --> 00:12:55.640 that's numerically indexed with 0, 1, 2? 00:12:55.640 --> 00:12:59.760 It clearly works, but critique this. 00:12:59.760 --> 00:13:00.890 What could go wrong? 00:13:00.890 --> 00:13:03.282 What's a little poorly designed? 00:13:03.282 --> 00:13:04.169 Yeah? 00:13:04.169 --> 00:13:06.586 AUDIENCE: You have to always remember what the things are, 00:13:06.586 --> 00:13:10.945 what the order is [INAUDIBLE] 00:13:10.945 --> 00:13:11.820 DAVID MALAN: Exactly. 00:13:11.820 --> 00:13:14.880 So yeah, so it's up to you to repeat-- 00:13:14.880 --> 00:13:18.360 it's up to you to remember what column the data is actually in. 00:13:18.360 --> 00:13:22.020 And God forbid you're collaborating with someone else on the spreadsheet. 00:13:22.020 --> 00:13:24.803 If you've used Google Spreadsheets, you can move columns around 00:13:24.803 --> 00:13:27.220 maybe just because you want to visually reorganize things. 00:13:27.220 --> 00:13:30.150 And if you do this and then someone else downloads that same data, 00:13:30.150 --> 00:13:31.840 all of their code is going to break. 00:13:31.840 --> 00:13:33.360 So that's just really bad design. 00:13:33.360 --> 00:13:36.390 It's fragile just because you're sort of on the honor system 00:13:36.390 --> 00:13:38.588 that one means the data that you want. 00:13:38.588 --> 00:13:41.130 So wouldn't it be nice if it could be a little more explicit? 00:13:41.130 --> 00:13:47.130 Well, recall that the very first line in this file is actually this. 00:13:47.130 --> 00:13:50.730 And I paused the output this time so that we can see more optionally. 00:13:50.730 --> 00:13:51.960 I just reran favorites.py. 00:13:51.960 --> 00:13:55.050 And notice one of these things is not like the other. 00:13:55.050 --> 00:13:59.070 Every output was either scratch or C or Python except for this first one. 00:13:59.070 --> 00:14:00.855 Why am I seeing the word language here? 00:14:03.610 --> 00:14:05.860 Where did language come from? 00:14:05.860 --> 00:14:08.230 You didn't have the ability to manually input. 00:14:08.230 --> 00:14:09.372 No. 00:14:09.372 --> 00:14:10.330 Where did it come from? 00:14:10.330 --> 00:14:11.030 Yeah. 00:14:11.030 --> 00:14:11.850 AUDIENCE: That would be the header. 00:14:11.850 --> 00:14:12.630 DAVID MALAN: Yeah, the header. 00:14:12.630 --> 00:14:15.300 The very first row in the file, which by human convention 00:14:15.300 --> 00:14:18.210 generally just defines what the columns represent so that there's 00:14:18.210 --> 00:14:20.470 some human useful information there. 00:14:20.470 --> 00:14:24.400 Now, that's not really intended to be part of my output at the moment, 00:14:24.400 --> 00:14:25.710 so there is a way to skip this. 00:14:25.710 --> 00:14:29.380 If you want to skip the first row, you can actually do something like this. 00:14:29.380 --> 00:14:32.560 You can say next row, and that will just ignore that row. 00:14:32.560 --> 00:14:35.520 So then I'm starting really with the every row thereafter. 00:14:35.520 --> 00:14:37.995 But there's a better way to handle this than that. 00:14:37.995 --> 00:14:40.620 That will get rid of the row in the output, but let me go ahead 00:14:40.620 --> 00:14:42.888 and use a different feature of the CSV package 00:14:42.888 --> 00:14:45.430 that's just going to make this a little cleaner all together. 00:14:45.430 --> 00:14:47.140 So let me clear my terminal window here. 00:14:47.140 --> 00:14:49.830 Let me undo this next thing that I just added. 00:14:49.830 --> 00:14:54.690 And instead of using a reader, let me go ahead and use a dictionary reader, 00:14:54.690 --> 00:15:00.720 abbreviated dict reader, that's going to now return me the equivalent of all 00:15:00.720 --> 00:15:02.260 of the rows one at a time. 00:15:02.260 --> 00:15:04.720 So I can still call it reader just as before. 00:15:04.720 --> 00:15:08.340 But as the name implies, what this reader is going to return 00:15:08.340 --> 00:15:11.560 is not a list after list after list but a dictionary, 00:15:11.560 --> 00:15:13.140 a dictionary, a dictionary. 00:15:13.140 --> 00:15:16.210 And remember, a dictionary is just a collection of key value pairs. 00:15:16.210 --> 00:15:17.640 So what does that mean? 00:15:17.640 --> 00:15:18.390 What are the keys? 00:15:18.390 --> 00:15:19.480 What are the values? 00:15:19.480 --> 00:15:24.210 Well, now that I'm using a dictionary reader, I can actually do this. 00:15:24.210 --> 00:15:27.270 Instead of on the honor system remembering that I want column one, 00:15:27.270 --> 00:15:30.630 I can treat row now not as a list but as a dictionary. 00:15:30.630 --> 00:15:34.410 And that means I can go in here and say quote unquote "language." 00:15:34.410 --> 00:15:36.150 And we saw that back in week six. 00:15:36.150 --> 00:15:38.820 Python allows you to index into dictionaries 00:15:38.820 --> 00:15:42.480 using square bracket notation in strings or STRs on the inside, 00:15:42.480 --> 00:15:44.790 just like lists allow for numbers. 00:15:44.790 --> 00:15:48.180 But this now I think is going to be a little more robust. 00:15:48.180 --> 00:15:52.830 If I run this again, Python of favorites.py, of that worked out fine. 00:15:52.830 --> 00:15:56.580 And let me pause the output too by using this program called more. 00:15:56.580 --> 00:15:58.600 Now I don't even see the header. 00:15:58.600 --> 00:16:03.660 So now whoever works with Python wrote the code for this package 00:16:03.660 --> 00:16:06.600 to just analyze that first line of code, use the header 00:16:06.600 --> 00:16:10.230 as you just called it as the keys, and then every time you 00:16:10.230 --> 00:16:13.380 iterate through this loop, it updates the values, the values, the values, 00:16:13.380 --> 00:16:15.810 but the keys stay the same. 00:16:15.810 --> 00:16:20.320 Any questions then on this technique? 00:16:20.320 --> 00:16:24.206 Suffice it to say this would be painful in C. Yes? 00:16:24.206 --> 00:16:28.670 AUDIENCE: [INAUDIBLE] 00:16:33.395 --> 00:16:34.270 DAVID MALAN: Exactly. 00:16:34.270 --> 00:16:39.130 So the keys are always going to be quote unquote "timestamp, language, 00:16:39.130 --> 00:16:40.130 and problem." 00:16:40.130 --> 00:16:43.330 But on each iteration of this loop here, the row 00:16:43.330 --> 00:16:46.600 is going to contain a different row of values, different row of values, 00:16:46.600 --> 00:16:47.600 different row of values. 00:16:47.600 --> 00:16:50.830 So you're going to get back one dictionary for every student who 00:16:50.830 --> 00:16:54.140 submitted the Google Form, if you will, while iterating through it there. 00:16:54.140 --> 00:16:54.640 All right. 00:16:54.640 --> 00:17:01.270 So once we have this ability here, why don't we go ahead and transition to 00:17:01.270 --> 00:17:03.850 how about not just using that dictionary reader, which 00:17:03.850 --> 00:17:05.890 it makes the code a little more robust. 00:17:05.890 --> 00:17:08.599 Because now if you move the columns around, no big deal. 00:17:08.599 --> 00:17:11.170 It doesn't matter if the numeric indices change. 00:17:11.170 --> 00:17:13.660 You can still use those keywords instead. 00:17:13.660 --> 00:17:15.460 But let's actually analyze the data now. 00:17:15.460 --> 00:17:18.619 I'm just spitting it out, which is not solving any problems for anyone. 00:17:18.619 --> 00:17:22.750 So let's go ahead and count the popularity of Scratch, C, and Python 00:17:22.750 --> 00:17:25.790 and see what everyone's been thinking here. 00:17:25.790 --> 00:17:26.290 All right. 00:17:26.290 --> 00:17:27.470 So how might I do this? 00:17:27.470 --> 00:17:29.890 Well, let me go ahead and do this up here. 00:17:29.890 --> 00:17:33.880 Before I start iterating, let me give myself, let's say, three variables. 00:17:33.880 --> 00:17:37.120 And to keep things simple, I'll say one variable called Scratch. 00:17:37.120 --> 00:17:40.270 Set it equal to 0 for 0 students so far. 00:17:40.270 --> 00:17:44.200 C is going to equal 0 and Python is going to equal 0. 00:17:44.200 --> 00:17:46.157 There's a slightly prettier way of doing this, 00:17:46.157 --> 00:17:49.240 just because this is like three lines of code to do something very simple. 00:17:49.240 --> 00:17:53.080 You could alternatively in Python but not C 00:17:53.080 --> 00:17:58.720 do Scratch comma C comma Python equals 0 comma 0. 00:17:58.720 --> 00:18:02.168 So kind of slightly more elegant just to fit it all into one line. 00:18:02.168 --> 00:18:04.210 But now let's just do something more interesting. 00:18:04.210 --> 00:18:06.160 On line seven, I'm still going to figure out 00:18:06.160 --> 00:18:08.320 what the current favorite language is. 00:18:08.320 --> 00:18:11.740 And now I'm just going to do some conditional checks. 00:18:11.740 --> 00:18:16.610 How about if that favorite equals equals quote unquote "Scratch," 00:18:16.610 --> 00:18:19.990 Let's go ahead and increment Scratch by 1. 00:18:19.990 --> 00:18:23.560 We can't do plus plus in Python, but we can do plus equals 1. 00:18:23.560 --> 00:18:30.700 How about elif if favorite equals equals C, then let's do C plus equals 1. 00:18:30.700 --> 00:18:32.063 We could do else. 00:18:32.063 --> 00:18:33.730 This is actually a good design question. 00:18:33.730 --> 00:18:34.840 Should I do else? 00:18:34.840 --> 00:18:36.880 Should I do elif? 00:18:36.880 --> 00:18:39.250 Any instincts here? 00:18:39.250 --> 00:18:40.824 Yeah. 00:18:40.824 --> 00:18:43.259 AUDIENCE: [INAUDIBLE] 00:18:46.322 --> 00:18:48.030 DAVID MALAN: Yeah, really good instincts. 00:18:48.030 --> 00:18:50.370 Just in case someone goes and adds another language 00:18:50.370 --> 00:18:51.960 to the form next week because we're obviously 00:18:51.960 --> 00:18:53.370 going to introduce another language today, 00:18:53.370 --> 00:18:55.620 you don't want your code to now artificially inflate 00:18:55.620 --> 00:18:58.380 the scores for Python just because you're 00:18:58.380 --> 00:19:00.340 conflating multiple languages together. 00:19:00.340 --> 00:19:03.810 So the more defensive sort of better way to write this code, I agree, 00:19:03.810 --> 00:19:07.680 would be elif favorite equals equals Python. 00:19:07.680 --> 00:19:10.590 Then let's go ahead and increment Python plus equals 1. 00:19:10.590 --> 00:19:12.840 And if there's a new language next week, we're 00:19:12.840 --> 00:19:14.260 obviously going to have to update the code, 00:19:14.260 --> 00:19:15.630 but at least we're not miscounting. 00:19:15.630 --> 00:19:17.130 We're just missing the new language. 00:19:17.130 --> 00:19:19.210 So I think that's slightly more robust. 00:19:19.210 --> 00:19:22.290 All right, now at the very bottom of this program and outside of the loop 00:19:22.290 --> 00:19:25.680 when I'm all done counting, let me go ahead and print out 00:19:25.680 --> 00:19:27.060 using some f strings. 00:19:27.060 --> 00:19:31.450 How about the total number of people whose favorite is Scratch? 00:19:31.450 --> 00:19:34.890 So this is just week six f string syntax. 00:19:34.890 --> 00:19:38.940 Let me go ahead and print out another f string for C. And I'm, of course, 00:19:38.940 --> 00:19:41.880 putting the variables in curly braces, all lowercase, 00:19:41.880 --> 00:19:45.060 but the English words I'm doing capitalization for. 00:19:45.060 --> 00:19:49.470 Let's do a final one with f Python colon and then in curly braces 00:19:49.470 --> 00:19:51.000 Python close quote. 00:19:51.000 --> 00:19:52.105 And I think I'm done. 00:19:52.105 --> 00:19:53.980 So let me just hide my terminal for a second. 00:19:53.980 --> 00:19:55.290 Here's the total program. 00:19:55.290 --> 00:19:56.730 Same stuff as before. 00:19:56.730 --> 00:19:58.980 Open favorites.csv. 00:19:58.980 --> 00:20:02.880 Open it further with the dictionary reader to do that processing for us. 00:20:02.880 --> 00:20:06.330 Initialize three variables to 0 just so we have something to count with. 00:20:06.330 --> 00:20:09.300 And then iterate over the file row by row. 00:20:09.300 --> 00:20:12.120 And this is just some sort of week one style conditional logic, 00:20:12.120 --> 00:20:14.410 albeit in Python, counting things. 00:20:14.410 --> 00:20:14.910 All right. 00:20:14.910 --> 00:20:17.080 So how can we now execute this? 00:20:17.080 --> 00:20:18.450 Let me go back to my terminal. 00:20:18.450 --> 00:20:20.610 Python of favorites.py. 00:20:20.610 --> 00:20:22.050 And here we go. 00:20:22.050 --> 00:20:26.400 As of today, everyone who's reporting in live via the Google Form, 00:20:26.400 --> 00:20:30.190 their favorite languages are-- 00:20:30.190 --> 00:20:31.330 interesting. 00:20:31.330 --> 00:20:34.510 That's pretty interesting too after just one week of Python no less. 00:20:34.510 --> 00:20:36.310 But Scratch is a healthy contender there. 00:20:36.310 --> 00:20:38.210 A lot of C. So a pretty good mix here. 00:20:38.210 --> 00:20:43.127 So is this going to be the best way to write this program long term? 00:20:43.127 --> 00:20:45.460 Well, as you noted, if there's a new language next week, 00:20:45.460 --> 00:20:47.830 this week we're going to have to constantly update this. 00:20:47.830 --> 00:20:51.610 And here's where you should let your mind wander to the future. 00:20:51.610 --> 00:20:55.270 If we have a fourth language, fifth language, sixth, seventh, eighth, 00:20:55.270 --> 00:20:59.410 which aspects here might kind of have some code smell to it? 00:20:59.410 --> 00:21:04.310 This probably isn't the best design to set us up for the future. 00:21:04.310 --> 00:21:07.480 What might be better than this? 00:21:07.480 --> 00:21:08.417 Yeah. 00:21:08.417 --> 00:21:09.880 AUDIENCE: We need to add a language to line five. 00:21:09.880 --> 00:21:10.330 DAVID MALAN: Yeah. 00:21:10.330 --> 00:21:12.288 We have to keep adding a language to line five. 00:21:12.288 --> 00:21:13.340 And OK, not a big deal. 00:21:13.340 --> 00:21:16.690 We could add SQL today and maybe JavaScript next week. 00:21:16.690 --> 00:21:19.510 But any time a line of code, a line of logic, 00:21:19.510 --> 00:21:21.450 it's just going to grow out of control. 00:21:21.450 --> 00:21:23.950 We've had this chat a couple of times with different syntax. 00:21:23.950 --> 00:21:25.700 There's probably a better way than that. 00:21:25.700 --> 00:21:26.440 So let's do that. 00:21:26.440 --> 00:21:30.550 Instead of using these individual variables, we could maybe use a list, 00:21:30.550 --> 00:21:35.560 but a list would be a little confusing because what does bracket 0 mean? 00:21:35.560 --> 00:21:37.090 What is bracket 1, bracket 2? 00:21:37.090 --> 00:21:40.930 But a dictionary, recall, is this Swiss army knife of data structures 00:21:40.930 --> 00:21:44.630 whereby you can associate anything with anything else, keys with values. 00:21:44.630 --> 00:21:48.370 So I dare say a cleaner way to solve this problem that sets us up 00:21:48.370 --> 00:21:52.570 for less work or confusion later would be to create a new variable called 00:21:52.570 --> 00:21:55.000 counts, if that's what we're doing, counting things up, 00:21:55.000 --> 00:21:58.340 and just set it equal to an empty dictionary. 00:21:58.340 --> 00:22:01.675 And you can literally say dict with the open parenthesis, close parenthesis 00:22:01.675 --> 00:22:05.020 and nothing or the more Pythonic just use open 00:22:05.020 --> 00:22:07.000 and close curly braces with nothing inside. 00:22:07.000 --> 00:22:11.320 That gives me an empty dictionary just like square brackets gives me a list. 00:22:11.320 --> 00:22:13.990 Now, my logic down here has to change a little bit. 00:22:13.990 --> 00:22:19.150 But what's nice is I don't need one conditional for every language. 00:22:19.150 --> 00:22:22.240 Because again, if we have a fourth, a fifth, a sixth, that chunk of code 00:22:22.240 --> 00:22:24.580 is also going to grow a bit out of control too. 00:22:24.580 --> 00:22:26.570 So I can get rid of this here. 00:22:26.570 --> 00:22:29.180 And what I think I'm going to do is say this. 00:22:29.180 --> 00:22:32.380 Whatever the current favorite is from the current row in the file, 00:22:32.380 --> 00:22:38.420 why don't we go into our counts variable at that key? 00:22:38.420 --> 00:22:40.303 And again, favorite is a variable. 00:22:40.303 --> 00:22:41.720 It's not quote unquote "favorite." 00:22:41.720 --> 00:22:44.560 It's going to be Scratch or C or Python. 00:22:44.560 --> 00:22:47.350 And then why don't we go ahead and just increment whatever 00:22:47.350 --> 00:22:51.740 the value of that count is at that key? 00:22:51.740 --> 00:22:53.500 Now, this is technically buggy. 00:22:53.500 --> 00:22:57.140 We're really close, but there is a bug. 00:22:57.140 --> 00:23:00.350 Does anyone want to conjecture what the bug is? 00:23:00.350 --> 00:23:00.890 Yeah. 00:23:00.890 --> 00:23:04.215 AUDIENCE: [INAUDIBLE] 00:23:08.490 --> 00:23:11.640 DAVID MALAN: A good question that answers my question nonetheless. 00:23:11.640 --> 00:23:12.685 So no. 00:23:12.685 --> 00:23:14.310 The magic you describe will not happen. 00:23:14.310 --> 00:23:17.490 And to repeat the hypothesis, will this automatically 00:23:17.490 --> 00:23:22.590 create a key for every language that we try 00:23:22.590 --> 00:23:24.090 plugging into those square brackets? 00:23:24.090 --> 00:23:25.140 Short answer no. 00:23:25.140 --> 00:23:27.750 Odds are this is going to create a key error, one of those 00:23:27.750 --> 00:23:29.640 traceback error messages that you've probably 00:23:29.640 --> 00:23:31.890 seen by now either in class or in problem sets 00:23:31.890 --> 00:23:37.410 whereby if Scratch hasn't appeared in the dictionary before or C or Python, 00:23:37.410 --> 00:23:40.330 then the dictionary has no clue what you're talking about. 00:23:40.330 --> 00:23:43.200 So I think we actually still need some conditional logic but not 00:23:43.200 --> 00:23:45.900 that's going to grow longer and longer with each language. 00:23:45.900 --> 00:23:47.860 What I think we probably want to do is this. 00:23:47.860 --> 00:23:52.530 If the current favorite is in the count dictionary, 00:23:52.530 --> 00:23:56.730 and this is the Pythonic way of just saying is this key in this dictionary, 00:23:56.730 --> 00:24:02.160 then go ahead and safely do counts favorite plus equals 1. 00:24:02.160 --> 00:24:05.250 Else, to your conjecture now, else what I want to do. 00:24:05.250 --> 00:24:10.330 Counts favorites equals, yeah, 1. 00:24:10.330 --> 00:24:14.380 So initialize a brand new key to a brand new value of 1 00:24:14.380 --> 00:24:16.630 because I'm obviously just seeing this language. 00:24:16.630 --> 00:24:19.190 Otherwise increment again and again. 00:24:19.190 --> 00:24:22.390 And now down here I just need to tweak my syntax a little bit. 00:24:22.390 --> 00:24:26.540 I don't need to print out all of these things one at a time manually. 00:24:26.540 --> 00:24:30.470 I can actually get away I think with another loop at the very bottom here. 00:24:30.470 --> 00:24:31.480 So how about I do this? 00:24:31.480 --> 00:24:34.790 For each favorite in those counts, and this is, 00:24:34.790 --> 00:24:39.340 again, the Pythonic way to iterate over all of the keys in a dictionary, 00:24:39.340 --> 00:24:45.220 go ahead and print out using an f string whatever the current favorite is, 00:24:45.220 --> 00:24:52.180 Scratch or C or Python, and then a colon and then figure out what its count is. 00:24:52.180 --> 00:24:54.610 And you can do that by going into the counts dictionary, 00:24:54.610 --> 00:24:58.910 looking at the favorite key, and get back its value. 00:24:58.910 --> 00:25:00.190 So I close my curly braces. 00:25:00.190 --> 00:25:01.060 I close my quotes. 00:25:01.060 --> 00:25:04.690 And even though this looks ugly at the moment, now this is much more dynamic. 00:25:04.690 --> 00:25:07.690 Because if we go and add SQL to the CSV file tomorrow 00:25:07.690 --> 00:25:10.720 or we add JavaScript next week, this will just work. 00:25:10.720 --> 00:25:12.760 It will keep working now automatically. 00:25:12.760 --> 00:25:15.140 All I change is the Google Form, not my actual code. 00:25:15.140 --> 00:25:15.640 All right. 00:25:15.640 --> 00:25:17.620 Let's try Python of favorites.py. 00:25:17.620 --> 00:25:19.150 Cross my fingers as always. 00:25:19.150 --> 00:25:23.020 And there now is the data as of now. 00:25:23.020 --> 00:25:25.270 Questions on this code here? 00:25:25.270 --> 00:25:26.195 Yeah. 00:25:26.195 --> 00:25:28.093 AUDIENCE: [INAUDIBLE] 00:25:28.093 --> 00:25:29.510 DAVID MALAN: Really good question. 00:25:29.510 --> 00:25:33.040 What if you wanted to print it in a particular order? 00:25:33.040 --> 00:25:35.530 Well, I could give you a couple of solutions. 00:25:35.530 --> 00:25:37.390 If you want to print it out in-- 00:25:37.390 --> 00:25:40.850 it's already coincidentally in alphabetical order. 00:25:40.850 --> 00:25:43.630 So you got that for free although that's just by chance here. 00:25:43.630 --> 00:25:45.010 But there is a way to do this. 00:25:45.010 --> 00:25:47.320 And let me propose that we go down here to my loop. 00:25:47.320 --> 00:25:50.320 And I explicitly use a function you might not have seen in Python yet, 00:25:50.320 --> 00:25:52.240 but it's literally called sorted, which is 00:25:52.240 --> 00:25:54.970 going to take either a list or, in this case, a dictionary 00:25:54.970 --> 00:25:58.270 and by default sort it by key alphabetically. 00:25:58.270 --> 00:26:00.907 Now, if my intuition is correct, this is not 00:26:00.907 --> 00:26:03.490 going to change the output, because it's already alphabetical. 00:26:03.490 --> 00:26:05.980 But if you read the documentation for the sorted function, 00:26:05.980 --> 00:26:08.800 it takes multiple parameters potentially, some of which 00:26:08.800 --> 00:26:10.000 are named parameters. 00:26:10.000 --> 00:26:11.870 And so you can actually do this. 00:26:11.870 --> 00:26:14.200 If you want to the counts but you want to reverse 00:26:14.200 --> 00:26:18.018 the order for whatever reason here so that it's reverse alphabetical order. 00:26:18.018 --> 00:26:19.810 Now let me go ahead and rerun this and I'll 00:26:19.810 --> 00:26:21.470 keep the previous output on the screen. 00:26:21.470 --> 00:26:27.000 Enter and now it's backwards alphabetically, if you will. 00:26:27.000 --> 00:26:29.130 Other questions on this here? 00:26:31.770 --> 00:26:33.830 No? 00:26:33.830 --> 00:26:39.930 How about then we transition to changing sorting by value. 00:26:39.930 --> 00:26:42.770 And this is going to escalate a little quickly briefly 00:26:42.770 --> 00:26:44.540 but then we'll tone it down again. 00:26:44.540 --> 00:26:47.600 Notice that right now this is indeed sorting by key. 00:26:47.600 --> 00:26:49.400 What if, especially if I have lots of data, 00:26:49.400 --> 00:26:52.670 it'd be nice to make a top 10 list or, in this case, a top three list 00:26:52.670 --> 00:26:58.890 and actually see in order of the counts, the values what these popular ones are. 00:26:58.890 --> 00:27:00.620 So it's not C, Python, Scratch. 00:27:00.620 --> 00:27:03.770 It should ideally be Python, then C, then 00:27:03.770 --> 00:27:06.710 Scratch because of the values and the magnitude thereof. 00:27:06.710 --> 00:27:08.010 So how can I do this? 00:27:08.010 --> 00:27:12.170 Well, it turns out there is another key, another parameter that you 00:27:12.170 --> 00:27:16.550 can pass to the sorted function that is typically 00:27:16.550 --> 00:27:17.925 implemented as a function itself. 00:27:17.925 --> 00:27:19.633 And so I'm going to go ahead and do this. 00:27:19.633 --> 00:27:21.950 I'm going to temporarily define a function called 00:27:21.950 --> 00:27:24.230 get value just to make my life easier. 00:27:24.230 --> 00:27:29.390 And this get value function is going to take I'll say a language parameter. 00:27:29.390 --> 00:27:35.400 And then all I'm going to do is return whatever the count is of that language. 00:27:35.400 --> 00:27:37.850 So out of context, this is just a super simple function 00:27:37.850 --> 00:27:41.252 that you hand it a language like Scratch or C or Python, 00:27:41.252 --> 00:27:43.460 it's just going to tell you what the count is thereof 00:27:43.460 --> 00:27:46.400 in that dictionary called counts. 00:27:46.400 --> 00:27:51.440 But what I can do now down here in my newly introduced call to sort it is I 00:27:51.440 --> 00:27:54.380 can tell it what to use as its key. 00:27:54.380 --> 00:27:57.170 Instead of using literally the key, Scratch, C, 00:27:57.170 --> 00:28:01.040 Python, I can sort of override that behavior and say, you know what? 00:28:01.040 --> 00:28:05.570 To figure out what to sort by, go ahead and call this function 00:28:05.570 --> 00:28:07.890 called get value. 00:28:07.890 --> 00:28:11.270 Notice that I have not put parentheses after get value 00:28:11.270 --> 00:28:14.420 because I don't want to call get value right then and there. 00:28:14.420 --> 00:28:20.737 I want to pass the get value function as itself in argument to the sorted 00:28:20.737 --> 00:28:23.570 function so that the sorted function written years ago by the people 00:28:23.570 --> 00:28:27.320 at Python can call my version of get value again and again 00:28:27.320 --> 00:28:30.300 and again when they try to sort this actual data. 00:28:30.300 --> 00:28:36.350 So now if I add that and I leave reverse equals true, let's see what happens. 00:28:36.350 --> 00:28:38.240 Python of favorites.py. 00:28:38.240 --> 00:28:38.810 Enter. 00:28:38.810 --> 00:28:42.410 And now I get my top 10, or in this case, top three list. 00:28:42.410 --> 00:28:45.290 And if I had more sophisticated data with more columns 00:28:45.290 --> 00:28:47.150 all together that I actually care about, I 00:28:47.150 --> 00:28:50.758 could even sort this more powerfully as well. 00:28:50.758 --> 00:28:52.550 But let me clean this up a little bit, just 00:28:52.550 --> 00:28:54.592 so you've seen it, even though we won't use these 00:28:54.592 --> 00:28:59.010 that often in CS50 until the end of the class will they come up again. 00:28:59.010 --> 00:29:00.740 Technically this is a little bit-- 00:29:00.740 --> 00:29:03.620 this isn't necessarily the best design to spend all this time 00:29:03.620 --> 00:29:06.680 implementing a function and then only use it in one place. 00:29:06.680 --> 00:29:09.430 In general, we've argued that you don't necessarily 00:29:09.430 --> 00:29:11.930 need a variable if you're only going to use it in one place. 00:29:11.930 --> 00:29:13.430 You don't really need a function if you're only 00:29:13.430 --> 00:29:14.638 going to use it in one place. 00:29:14.638 --> 00:29:17.160 And here we kind of have a good candidate for that. 00:29:17.160 --> 00:29:19.400 And so it turns out in Python if you don't 00:29:19.400 --> 00:29:23.240 want to bother creating a function just to use it once, 00:29:23.240 --> 00:29:26.810 you can create what's called an anonymous function, a.k.a. 00:29:26.810 --> 00:29:29.900 a lambda function, like the lambda symbol familiar. 00:29:29.900 --> 00:29:33.380 And a lambda function, the syntax is a little strange looking, 00:29:33.380 --> 00:29:34.490 but you say this. 00:29:34.490 --> 00:29:36.080 You literally say lambda. 00:29:36.080 --> 00:29:38.660 You literally then say the name of the argument 00:29:38.660 --> 00:29:42.470 that you want this anonymous function with no name to take. 00:29:42.470 --> 00:29:44.430 Then you have a colon. 00:29:44.430 --> 00:29:47.930 And then quite simply, you write what you want the return 00:29:47.930 --> 00:29:49.850 value of this function to be. 00:29:49.850 --> 00:29:51.800 You don't even say return literally. 00:29:51.800 --> 00:29:53.720 These lambda functions are meant to be used 00:29:53.720 --> 00:30:00.680 super tersely so that you can in one line express something like this. 00:30:00.680 --> 00:30:03.720 And I admit this looks more cryptic, I think, than the previous version. 00:30:03.720 --> 00:30:06.512 But as you get more comfortable with Python or other languages that 00:30:06.512 --> 00:30:10.370 support this feature, it allows you to not bother with lines of code like that 00:30:10.370 --> 00:30:12.690 and just tighten up your code a little bit. 00:30:12.690 --> 00:30:16.670 So this line here, lambda language colon counts language, 00:30:16.670 --> 00:30:18.927 is the one line version of this. 00:30:18.927 --> 00:30:21.260 And you don't even need to bother picking a name for it. 00:30:21.260 --> 00:30:23.840 Lambda tells Python I didn't waste any time 00:30:23.840 --> 00:30:26.440 thinking of a name for this function. 00:30:26.440 --> 00:30:35.310 So questions then on this technique of using Python to analyze data like this? 00:30:35.310 --> 00:30:37.900 Any questions? 00:30:37.900 --> 00:30:39.860 We're almost done with Python. 00:30:39.860 --> 00:30:42.070 Questions? 00:30:42.070 --> 00:30:42.640 No? 00:30:42.640 --> 00:30:43.390 OK. 00:30:43.390 --> 00:30:45.790 So why don't we make things a little more interesting? 00:30:45.790 --> 00:30:48.620 Because we had a much juicier data set with the problems 00:30:48.620 --> 00:30:50.620 that we've assigned over the past several weeks. 00:30:50.620 --> 00:30:54.970 Why don't we go ahead and, quite simply, I think 00:30:54.970 --> 00:30:57.290 we wrote pretty darn good code here. 00:30:57.290 --> 00:31:02.590 So I think we can pretty much just change a bit of it to say, 00:31:02.590 --> 00:31:05.720 let's see, if I don't want language, I want problem. 00:31:05.720 --> 00:31:10.510 And if I want to sort by not language but problem, I think that's it. 00:31:10.510 --> 00:31:12.580 I think if I didn't overlook something here 00:31:12.580 --> 00:31:16.210 just by changing what column I'm reading the data from and then 00:31:16.210 --> 00:31:18.550 just to be consistent renaming my variables 00:31:18.550 --> 00:31:21.310 just so I know what I'm looking at, what will this program now 00:31:21.310 --> 00:31:25.470 do after those minor changes? 00:31:25.470 --> 00:31:26.970 What will I see when I run this? 00:31:29.693 --> 00:31:31.860 What would be the first thing I see when I run this? 00:31:34.950 --> 00:31:35.970 Tough crowd today. 00:31:35.970 --> 00:31:36.720 Yes. 00:31:36.720 --> 00:31:37.637 AUDIENCE: The problem. 00:31:37.637 --> 00:31:39.110 DAVID MALAN: Yeah. 00:31:39.110 --> 00:31:40.330 The top problem. 00:31:40.330 --> 00:31:42.627 So the most popular problem, which I'm a little worried 00:31:42.627 --> 00:31:45.210 it might be hello or just Scratch, but let's go ahead and see. 00:31:45.210 --> 00:31:47.700 So let me go ahead and open my terminal window. 00:31:47.700 --> 00:31:50.670 I'll even maximize my terminal window so we can see a lot. 00:31:50.670 --> 00:31:54.167 Let me go ahead and run Python of favorites.py. 00:31:54.167 --> 00:31:57.000 I'm going to go ahead now and cross my fingers that I didn't mess up 00:31:57.000 --> 00:31:57.660 and hit Enter. 00:31:57.660 --> 00:31:59.550 And OK, great. 00:31:59.550 --> 00:32:00.480 We peaked early. 00:32:00.480 --> 00:32:03.840 So Scratch was the most popular program according 00:32:03.840 --> 00:32:06.460 to the data at the time I downloaded it. 00:32:06.460 --> 00:32:08.520 I'm sure other votes have come in since. 00:32:08.520 --> 00:32:13.080 Filter in week four was tied then with Tideman as well. 00:32:13.080 --> 00:32:16.480 Mario is a close third there and so forth. 00:32:16.480 --> 00:32:19.540 So this is helpful for us on staff that not so much love 00:32:19.540 --> 00:32:21.040 down here at the bottom of the list. 00:32:21.040 --> 00:32:23.400 So it was a bunch of code to write, but now 00:32:23.400 --> 00:32:26.250 that we've written it in this very versatile dynamic way, 00:32:26.250 --> 00:32:29.280 it's pretty good for just like crunching data and doing some analytics. 00:32:29.280 --> 00:32:33.480 But it's still a decent number of lines to have had to write manually. 00:32:33.480 --> 00:32:41.820 And this is where sometimes it isn't necessarily the right tool for the job, 00:32:41.820 --> 00:32:44.520 but rather a candidate for using some other language altogether, 00:32:44.520 --> 00:32:47.428 especially when it's not just a one time program that you run 00:32:47.428 --> 00:32:48.720 and you want to see the answer. 00:32:48.720 --> 00:32:53.130 What if you want to take input from the user and answer questions dynamically 00:32:53.130 --> 00:32:55.050 like a mobile app would, like a website would, 00:32:55.050 --> 00:32:58.740 like Microsoft Excel or Apple Numbers or Google Sheets would for you. 00:32:58.740 --> 00:33:02.430 Well, let's make one final change for now to this version of the program 00:33:02.430 --> 00:33:05.020 and actually take in some user input. 00:33:05.020 --> 00:33:08.170 So besides just loading all of the data into memory, 00:33:08.170 --> 00:33:13.290 let's go ahead and down below here not just print out the top 10 00:33:13.290 --> 00:33:16.660 list, if you will, but prompt the user for their favorites. 00:33:16.660 --> 00:33:18.660 I'm going to use Python's input function and I'm 00:33:18.660 --> 00:33:21.840 going to prompt them with "favorite," quote unquote, like tell me what 00:33:21.840 --> 00:33:23.370 your favorite problem-- 00:33:23.370 --> 00:33:27.660 what problem rather you are interested in. 00:33:27.660 --> 00:33:33.267 And now let me go ahead and say if that favorite is in the count's variable, 00:33:33.267 --> 00:33:35.850 so you didn't type in something random that we didn't actually 00:33:35.850 --> 00:33:37.890 assign as a problem, then let me go ahead 00:33:37.890 --> 00:33:44.130 and print with a format string whatever that favorite is of yours and show you 00:33:44.130 --> 00:33:47.040 the actual popularity thereof by indexing 00:33:47.040 --> 00:33:51.370 into counts using that favorite as the key and printing this. 00:33:51.370 --> 00:33:52.620 So now it's a dynamic program. 00:33:52.620 --> 00:33:55.170 It doesn't dump all of the data and all of the summations. 00:33:55.170 --> 00:33:59.100 Rather it's going to allow me to see what my choice of favorite is. 00:33:59.100 --> 00:34:03.940 And I'm going to go ahead and say, let's see, I'm a fan of Mario here. 00:34:03.940 --> 00:34:04.950 So Enter. 00:34:04.950 --> 00:34:08.880 And indeed, we see the same value we saw a moment ago but just for Mario. 00:34:08.880 --> 00:34:11.880 But the point now is that, one, all of this is possible. 00:34:11.880 --> 00:34:15.630 Two, it's way easier and more pleasant than this would have been in C. 00:34:15.630 --> 00:34:17.280 This is still only 15 lines of code. 00:34:17.280 --> 00:34:19.238 And in C, again, there's the memory management. 00:34:19.238 --> 00:34:21.947 There's the iterating over the strings trying to find the commas. 00:34:21.947 --> 00:34:23.190 There's just a lot more work. 00:34:23.190 --> 00:34:26.489 But honestly even when you just want to answer a question like this 00:34:26.489 --> 00:34:29.067 in Excel and Apple Numbers, Google Sheets, generally, you 00:34:29.067 --> 00:34:30.150 can just highlight things. 00:34:30.150 --> 00:34:34.500 You can click a button and boom, you get your answer for summation or max or min 00:34:34.500 --> 00:34:35.909 or any of those sort of basics. 00:34:35.909 --> 00:34:39.270 Wouldn't it be nice if we weren't taking a step backwards as programmers 00:34:39.270 --> 00:34:42.880 and being more powerful and yet we now have to do more of the work? 00:34:42.880 --> 00:34:46.230 So maybe sometimes Python's not or any language 00:34:46.230 --> 00:34:48.370 is not the best tool for the job. 00:34:48.370 --> 00:34:51.960 And that's going to now allow us to introduce more generally something 00:34:51.960 --> 00:34:53.670 called a relational database. 00:34:53.670 --> 00:34:57.137 Graduating from mere flat file databases like text files 00:34:57.137 --> 00:34:58.845 or binary files in which all of your data 00:34:58.845 --> 00:35:00.570 is stored to something more proper. 00:35:00.570 --> 00:35:02.566 But first, questions. 00:35:02.566 --> 00:35:08.518 AUDIENCE: [INAUDIBLE] 00:35:12.943 --> 00:35:14.360 DAVID MALAN: Really good question. 00:35:14.360 --> 00:35:17.670 To reiterate, if I were-- is this case sensitive? 00:35:17.670 --> 00:35:21.470 So if I were to type in Mario in all lowercase and hit Enter, 00:35:21.470 --> 00:35:23.450 I actually get no such response. 00:35:23.450 --> 00:35:26.720 Now that might be acceptable, because the problem technically is a capital M. 00:35:26.720 --> 00:35:30.840 But that's a little ridiculous to be that pedantic about the input. 00:35:30.840 --> 00:35:34.250 So how could we solve this? 00:35:34.250 --> 00:35:37.010 Any tips for how we can make this a little more robust? 00:35:37.010 --> 00:35:38.378 Yeah. 00:35:38.378 --> 00:35:42.290 AUDIENCE: [INAUDIBLE] 00:35:45.848 --> 00:35:46.765 DAVID MALAN: OK, yeah. 00:35:49.400 --> 00:35:49.900 Yeah. 00:35:49.900 --> 00:35:51.700 So we could use a few different functions, one of which 00:35:51.700 --> 00:35:54.970 is called title, which will change it to title case where it capitalizes, 00:35:54.970 --> 00:35:58.390 like in most English sentences, the first letter of that sentence. 00:35:58.390 --> 00:35:59.740 We could use capitalize. 00:35:59.740 --> 00:36:00.940 We could use upper. 00:36:00.940 --> 00:36:01.990 We could use lower. 00:36:01.990 --> 00:36:06.435 But indeed, we could just decide how we want to standardize the capitalization. 00:36:06.435 --> 00:36:08.810 Either uppercase, lowercase, or some combination thereof. 00:36:08.810 --> 00:36:11.770 And just make sure that you change the counts themselves. 00:36:11.770 --> 00:36:13.930 Make sure that you do the same to favorite 00:36:13.930 --> 00:36:16.420 and make sure that maybe you keep a backup of the data 00:36:16.420 --> 00:36:19.780 if you want to show the original version that came from the CSV 00:36:19.780 --> 00:36:23.205 without presuming to just capitalize everything for the user. 00:36:23.205 --> 00:36:25.330 But indeed, that would be the most common scenario. 00:36:25.330 --> 00:36:28.540 You just make things case insensitive when doing those matches. 00:36:28.540 --> 00:36:36.910 Other questions now on Python before we leave it behind for the coming week? 00:36:36.910 --> 00:36:39.880 All right, well then let's introduce these relational databases. 00:36:39.880 --> 00:36:43.060 So relational database is what every-- 00:36:43.060 --> 00:36:45.572 it's a super popular way of storing lots of data. 00:36:45.572 --> 00:36:48.280 This is what the Twitters of the world, the Googles of the world, 00:36:48.280 --> 00:36:52.390 the Metas of the world use to store some of their data at scale. 00:36:52.390 --> 00:36:55.492 There are alternatives to relational databases. 00:36:55.492 --> 00:36:57.700 Indeed, today we'll talk about a language called SQL. 00:36:57.700 --> 00:37:00.408 There's also a movement, if you will, or an alternative generally 00:37:00.408 --> 00:37:02.330 called NoSQL, which is just the opposite. 00:37:02.330 --> 00:37:03.280 You don't use SQL. 00:37:03.280 --> 00:37:06.020 There are things called object oriented databases and the like. 00:37:06.020 --> 00:37:11.110 But if you've ever heard of MySQL or PostgreSQL or Microsoft SQL 00:37:11.110 --> 00:37:15.580 Server or Oracle or MariaDB or a bunch of other products, 00:37:15.580 --> 00:37:18.850 both free and commercial, this is what they're talking about. 00:37:18.850 --> 00:37:21.560 Databases that are designed to store lots of data. 00:37:21.560 --> 00:37:23.710 And what's nice about relational databases 00:37:23.710 --> 00:37:26.530 is that they're really similar to the spreadsheets with which you 00:37:26.530 --> 00:37:29.800 were presumably familiar long before today's class. 00:37:29.800 --> 00:37:32.530 So a relational database is going to store, 00:37:32.530 --> 00:37:35.452 as you'll see, all of the data in rows and columns. 00:37:35.452 --> 00:37:37.910 Now, the terminology will thereafter be a little different. 00:37:37.910 --> 00:37:40.570 Instead of having sheets, you're going to have tables. 00:37:40.570 --> 00:37:43.240 But those tables are still going to have rows and columns. 00:37:43.240 --> 00:37:46.330 And you're going to have even more control over the performance 00:37:46.330 --> 00:37:51.970 of your data when you start to access it using this Structured Query Language 00:37:51.970 --> 00:37:53.050 or SQL. 00:37:53.050 --> 00:37:56.440 This is a language you can use for web apps, mobile apps. 00:37:56.440 --> 00:37:58.990 A lot of analysts would sit down at their Mac or PC 00:37:58.990 --> 00:38:01.810 and actually ask questions of data to get back the answer. 00:38:01.810 --> 00:38:05.050 And wonderfully, even though there will be some new syntax today, 00:38:05.050 --> 00:38:07.570 SQL really just does four basic things. 00:38:07.570 --> 00:38:10.030 CRUD is the sort of crude acronym here. 00:38:10.030 --> 00:38:14.470 CRUD is a way of remembering that a relational database supports 00:38:14.470 --> 00:38:20.390 ultimately creating data, reading data, updating data, and deleting data. 00:38:20.390 --> 00:38:22.240 So even if you're feeling like, wow, this 00:38:22.240 --> 00:38:25.510 is a lot of new syntax, which it isn't relative to our past languages, 00:38:25.510 --> 00:38:29.650 the only things you're doing really are creating data, reading data, updating, 00:38:29.650 --> 00:38:30.910 and deleting the same. 00:38:30.910 --> 00:38:36.940 Now a little confusingly in SQL, the corresponding functions or commands 00:38:36.940 --> 00:38:41.150 that exist that map to CRUD are actually this. 00:38:41.150 --> 00:38:44.290 So it's still create, but there's another one called insert. 00:38:44.290 --> 00:38:48.250 It's not read, which is more of the computer scientist way of saying it, 00:38:48.250 --> 00:38:50.380 but select, which is a little more explicit. 00:38:50.380 --> 00:38:52.030 Like select data you care about. 00:38:52.030 --> 00:38:53.320 Update is still update. 00:38:53.320 --> 00:38:54.310 Delete is still delete. 00:38:54.310 --> 00:38:57.910 But there's another command called drop, which lets you drop, that is delete, 00:38:57.910 --> 00:38:59.930 entire tables as well. 00:38:59.930 --> 00:39:03.280 So you can create tables using syntax that's 00:39:03.280 --> 00:39:05.480 generally going to look like this. 00:39:05.480 --> 00:39:06.910 You'll say create table. 00:39:06.910 --> 00:39:10.210 You'll give the name of the table, which you can call most 00:39:10.210 --> 00:39:14.260 anything you want, but generally all lowercase, no spaces is best. 00:39:14.260 --> 00:39:18.550 Then in parentheses, you can specify a comma separated list of the columns 00:39:18.550 --> 00:39:20.660 that you might want in this table. 00:39:20.660 --> 00:39:24.070 So this is the code equivalent in the SQL language 00:39:24.070 --> 00:39:27.220 of manually opening Google Sheets or Excel or Numbers 00:39:27.220 --> 00:39:30.490 and clicking in the top left cell and like typing timestamp 00:39:30.490 --> 00:39:34.180 and then in the next typing language and then the third typing problem. 00:39:34.180 --> 00:39:37.750 This is the way to define what your headers are, if you will, 00:39:37.750 --> 00:39:38.500 in a spreadsheet. 00:39:38.500 --> 00:39:39.790 But now it's called a table. 00:39:39.790 --> 00:39:42.100 Now, we won't use this command manually first. 00:39:42.100 --> 00:39:44.050 Let's do something a little simpler. 00:39:44.050 --> 00:39:47.590 We're going to start off by just importing this data ourselves. 00:39:47.590 --> 00:39:49.340 And I'm going to go ahead and do this. 00:39:49.340 --> 00:39:51.160 Let me go back to VS Code here. 00:39:51.160 --> 00:39:54.040 I'm going to leave behind favorites.py for now, 00:39:54.040 --> 00:39:57.100 because now we're going to transition to this other language called SQL. 00:39:57.100 --> 00:40:02.680 And to do this, I am going to create a new database file. 00:40:02.680 --> 00:40:06.790 And I'm going to do so using a command called sqlite3, 00:40:06.790 --> 00:40:09.160 which is just the third version thereof, and I'm 00:40:09.160 --> 00:40:12.110 going to give the database a name of favorites.db. 00:40:12.110 --> 00:40:14.860 There's different conventions, but this is one of the most common. 00:40:14.860 --> 00:40:18.610 When I hit Enter, this is going to create for me a new empty database just 00:40:18.610 --> 00:40:23.080 like opening an untitled spreadsheet in Excel, Google Sheets, or Apple Numbers. 00:40:23.080 --> 00:40:25.720 I'm being prompted do I want to create favorites.db. 00:40:25.720 --> 00:40:26.890 I'll hit Y for Yes. 00:40:26.890 --> 00:40:28.562 OK, we're up and running. 00:40:28.562 --> 00:40:30.520 Now, you're going to notice a different prompt. 00:40:30.520 --> 00:40:34.360 I'm not in my Linux prompt per se, which is always the dollar sign. 00:40:34.360 --> 00:40:37.090 I'm now inside of the program called SQLite. 00:40:37.090 --> 00:40:41.500 And we're going to use SQLite, SQLite3, as just an interactive way for now 00:40:41.500 --> 00:40:43.180 of playing with SQL code. 00:40:43.180 --> 00:40:48.550 At the end of today, we'll show you how you can use SQL in Python code 00:40:48.550 --> 00:40:51.050 so that you still write Python code to do whatever you want, 00:40:51.050 --> 00:40:53.170 but you can talk to databases using Python. 00:40:53.170 --> 00:40:55.630 And this is exactly how web apps, mobile apps work. 00:40:55.630 --> 00:40:59.140 For instance, on iOS, on an iPhone, an iPad, or the like, 00:40:59.140 --> 00:41:03.010 if you want to store data, it's very often stored in a SQL database, 00:41:03.010 --> 00:41:04.580 as we're about to do. 00:41:04.580 --> 00:41:07.398 But you might use a language called SWIFT or Objective C. 00:41:07.398 --> 00:41:09.190 And the same exists in the world of Android 00:41:09.190 --> 00:41:12.830 using Java or Kotlin or something else to query the database. 00:41:12.830 --> 00:41:16.270 So we're going to see SQL in isolation for now like an analyst might just 00:41:16.270 --> 00:41:19.880 use with their Mac or PC, but we're going to tie it together by day's end. 00:41:19.880 --> 00:41:28.180 So at this terminal SQLite, let me go ahead and execute this command first. 00:41:28.180 --> 00:41:30.200 I'm going to first put SQLite into CSV mode, 00:41:30.200 --> 00:41:32.200 because I'm going to cut some corners initially. 00:41:32.200 --> 00:41:34.600 And I'm just going to automatically import 00:41:34.600 --> 00:41:38.560 all of the data that was submitted via the Google Form, which I exported 00:41:38.560 --> 00:41:41.410 as a CSV and uploaded to my code space. 00:41:41.410 --> 00:41:45.790 And I'm just going to automatically say turn this CSV file into a SQL 00:41:45.790 --> 00:41:48.040 database for me just so I don't have to figure out 00:41:48.040 --> 00:41:49.760 what those create table commands are. 00:41:49.760 --> 00:41:53.200 So to do this, I'm going to say mode csv so 00:41:53.200 --> 00:41:57.850 that SQLite knows that this is the command, knows that this is a CSV file. 00:41:57.850 --> 00:41:59.290 It's literally .mode. 00:41:59.290 --> 00:42:01.480 So the dot comes before the keyword there. 00:42:01.480 --> 00:42:05.770 And now I'm going to say .import and then the name of the file I want 00:42:05.770 --> 00:42:08.020 to import, which is favorites.csv. 00:42:08.020 --> 00:42:12.635 And now the name of the table that I want to create with that data. 00:42:12.635 --> 00:42:15.010 And just for consistency, I'm going to call it favorites. 00:42:15.010 --> 00:42:17.270 I could change these things to be anything I want, 00:42:17.270 --> 00:42:18.530 but I'm going to do that. 00:42:18.530 --> 00:42:20.860 And voila, nothing seems to have happened. 00:42:20.860 --> 00:42:24.503 But just like in C and in Python and Linux when nothing seems to happen, 00:42:24.503 --> 00:42:25.670 that's usually a good thing. 00:42:25.670 --> 00:42:27.590 It means I didn't mess up. 00:42:27.590 --> 00:42:31.120 So if I want to see what just happened, there's this other command. 00:42:31.120 --> 00:42:33.430 And these commands that start with dots, these 00:42:33.430 --> 00:42:37.690 are SQLite specific, which is indeed a lightweight version of SQL. 00:42:37.690 --> 00:42:39.010 They're not SQL, per se. 00:42:39.010 --> 00:42:41.260 So if you're using Oracle or something else like that, 00:42:41.260 --> 00:42:43.135 you're not going to use these exact commands. 00:42:43.135 --> 00:42:45.160 You'll see the ones we use in just a moment. 00:42:45.160 --> 00:42:46.660 And here's the first. 00:42:46.660 --> 00:42:51.230 When I type .schema, the schema of a database is the design of the database. 00:42:51.230 --> 00:42:52.120 What are the tables? 00:42:52.120 --> 00:42:53.900 What are the columns and all of that? 00:42:53.900 --> 00:42:58.390 So when I type .schema, this actually in this case shows me the create table 00:42:58.390 --> 00:43:03.700 command that was automatically drawn for me by just doing this import line. 00:43:03.700 --> 00:43:07.450 Once I get more comfortable with SQL, I could literally type this out myself 00:43:07.450 --> 00:43:09.560 or use some program to generate that as well. 00:43:09.560 --> 00:43:11.290 But what it's creating for me is this. 00:43:11.290 --> 00:43:15.580 Create table if it doesn't exist, even though it's more terse than that. 00:43:15.580 --> 00:43:18.130 I want to create a table called favorites. 00:43:18.130 --> 00:43:20.590 And then the columns for that table are going 00:43:20.590 --> 00:43:24.880 to be timestamp, which is going to be text, comma, language, 00:43:24.880 --> 00:43:27.248 which is also going to be text, comma, problem, 00:43:27.248 --> 00:43:28.540 which is also going to be text. 00:43:28.540 --> 00:43:34.330 That was just inferred very trivially by the .import command to just figure out 00:43:34.330 --> 00:43:38.680 that, yes, just give me a three column database table based on the Google 00:43:38.680 --> 00:43:40.390 Form. 00:43:40.390 --> 00:43:41.740 Questions on this? 00:43:41.740 --> 00:43:44.140 These are commands you run once to get up and running. 00:43:44.140 --> 00:43:47.260 You don't run these commands frequently, but we have them 00:43:47.260 --> 00:43:49.280 on the slide just for reference. 00:43:49.280 --> 00:43:49.780 All right. 00:43:49.780 --> 00:43:52.010 So now let's do something a little more interesting. 00:43:52.010 --> 00:43:55.330 I'm going to clear my SQLite terminal here, but I'm still in SQLite. 00:43:55.330 --> 00:43:58.570 I'm going to now use some of my first SQL commands, which 00:43:58.570 --> 00:44:02.440 recall were among them select. 00:44:02.440 --> 00:44:05.522 So CRUD, C-R-U-D. The R was select. 00:44:05.522 --> 00:44:07.480 This is maybe the most common, the most useful, 00:44:07.480 --> 00:44:10.330 the most powerful thing to use with a SQL database 00:44:10.330 --> 00:44:13.000 selecting data to answer questions akin to the ones we 00:44:13.000 --> 00:44:14.980 were trying to answer with Python. 00:44:14.980 --> 00:44:16.270 This is the general syntax. 00:44:16.270 --> 00:44:19.900 Any time you want to select data from a SQL database, you literally say select. 00:44:19.900 --> 00:44:24.730 You then specify the column or columns that you want to select data from. 00:44:24.730 --> 00:44:28.450 You literally write the word from and then you specify the name of the table. 00:44:28.450 --> 00:44:32.030 You want to get that data from semicolon, in this case. 00:44:32.030 --> 00:44:35.200 Everything that's in capitals here is a SQL keyword. 00:44:35.200 --> 00:44:37.750 Strictly speaking, you don't have to capitalize things, 00:44:37.750 --> 00:44:40.150 but we would encourage you to do so stylistically. 00:44:40.150 --> 00:44:43.030 And especially as you're learning and even as you're writing it, 00:44:43.030 --> 00:44:47.530 it just helps to distinguish SQL from words you chose, like the names 00:44:47.530 --> 00:44:49.460 of the columns and the data therein. 00:44:49.460 --> 00:44:52.700 So do adopt early on this convention. 00:44:52.700 --> 00:44:56.240 So let me go back now to my code space here. 00:44:56.240 --> 00:44:59.012 I'm running my terminal window with SQLite3 inside of it. 00:44:59.012 --> 00:45:00.970 Suppose that I just want to get all of the data 00:45:00.970 --> 00:45:04.720 from the favorites table, which was automatically imported. 00:45:04.720 --> 00:45:05.480 Let's do this. 00:45:05.480 --> 00:45:06.220 Select. 00:45:06.220 --> 00:45:07.630 I want everything. 00:45:07.630 --> 00:45:13.090 Well, I can do timestamp comma language comma problem. 00:45:13.090 --> 00:45:13.840 But you know what? 00:45:13.840 --> 00:45:15.850 Here's a convenience already. 00:45:15.850 --> 00:45:19.120 If you want everything, there's what's called a wild card character 00:45:19.120 --> 00:45:21.460 in SQL, which is just a star, an asterisk, which 00:45:21.460 --> 00:45:25.120 means give me every column without my knowing even what they're called. 00:45:25.120 --> 00:45:28.400 Let me go ahead now and say from favorites semicolon. 00:45:28.400 --> 00:45:33.010 And this is the SQL way of opening the database, iterating over 00:45:33.010 --> 00:45:36.400 every row therein, printing out every row therein, done. 00:45:36.400 --> 00:45:39.640 So those three steps, which was like nine lines of Python code 00:45:39.640 --> 00:45:42.700 give or take earlier, is now one line of SQL. 00:45:42.700 --> 00:45:43.990 I hit Enter. 00:45:43.990 --> 00:45:46.370 There is all of the data. 00:45:46.370 --> 00:45:48.130 So I see now all of the data. 00:45:48.130 --> 00:45:50.230 Just output it as a CSV here. 00:45:50.230 --> 00:45:51.940 But it's not the CSV file. 00:45:51.940 --> 00:45:53.210 It's now actually the table. 00:45:53.210 --> 00:45:54.580 And in fact, just for good measure, let me 00:45:54.580 --> 00:45:58.000 do this, because you'll see the behavior a little differently the next time we 00:45:58.000 --> 00:45:58.690 open the file. 00:45:58.690 --> 00:46:00.730 I've just exited out of SQLite3. 00:46:00.730 --> 00:46:03.490 I'm going to rerun it, but I'm not going to reimport the data 00:46:03.490 --> 00:46:05.920 or do anything like that, because my file now exists. 00:46:05.920 --> 00:46:07.790 In fact, let me take one step back. 00:46:07.790 --> 00:46:11.830 If I type ls at my Linux prompt, there's my favorites.py from before. 00:46:11.830 --> 00:46:13.870 There's my favorites.csv from before. 00:46:13.870 --> 00:46:18.580 And here's a third file that I did create a moment ago when I first 00:46:18.580 --> 00:46:19.690 ran SQLite3. 00:46:19.690 --> 00:46:21.130 So the data is persistent. 00:46:21.130 --> 00:46:23.050 It's not using RAM or memory. 00:46:23.050 --> 00:46:24.950 Anything I do now is saved there. 00:46:24.950 --> 00:46:27.653 So let's go ahead and rerun SQLite3 with the same file. 00:46:27.653 --> 00:46:28.570 But I'm not going to-- 00:46:28.570 --> 00:46:31.130 I don't have to reimport everything, because the file already exists. 00:46:31.130 --> 00:46:32.800 Let me now do that same thing again. 00:46:32.800 --> 00:46:36.460 Select star from favorites to get all of the data. 00:46:36.460 --> 00:46:39.990 And what you'll see now is the same data, but it's a little prettier now. 00:46:39.990 --> 00:46:44.400 Because I reran it, I effectively disabled CSV mode this time. 00:46:44.400 --> 00:46:49.590 And what I'm now seeing is the entire contents of this database 00:46:49.590 --> 00:46:51.420 table called favorites. 00:46:51.420 --> 00:46:53.370 Now, there's nothing new here, but you're just 00:46:53.370 --> 00:46:55.980 seeing now like an ASCII or Unicode version of all 00:46:55.980 --> 00:46:57.988 of the same data from that database. 00:46:57.988 --> 00:47:00.030 Well, suppose I want to get a subset of the data. 00:47:00.030 --> 00:47:01.238 Well, let me clear my screen. 00:47:01.238 --> 00:47:04.530 And just like in Linux, I can Control L just to clean things up aesthetically. 00:47:04.530 --> 00:47:06.580 Suppose I want to get just the languages. 00:47:06.580 --> 00:47:09.540 So I could do select language from favorites. 00:47:09.540 --> 00:47:12.930 And this will now select not all three columns, a.k.a. 00:47:12.930 --> 00:47:17.280 star, this will only select the language column and all of the data therein. 00:47:17.280 --> 00:47:19.020 If I hit Enter, voila. 00:47:19.020 --> 00:47:21.150 Now I just see those there. 00:47:21.150 --> 00:47:22.990 No timestamps, no problems. 00:47:22.990 --> 00:47:25.470 It's just a slice of the table, if you will. 00:47:25.470 --> 00:47:30.010 All right, not that interesting still because it's just a big column of data. 00:47:30.010 --> 00:47:31.900 But now things get more interesting. 00:47:31.900 --> 00:47:35.850 It turns out in SQL that there are functions that come with this language, 00:47:35.850 --> 00:47:37.920 just like C, just like Python. 00:47:37.920 --> 00:47:41.790 In SQL, some of the more useful ones, some of the simpler ones, 00:47:41.790 --> 00:47:42.630 are these here. 00:47:42.630 --> 00:47:44.880 Average, count, distinct, lower, max, min, 00:47:44.880 --> 00:47:48.150 upper, which pretty much do what they say. 00:47:48.150 --> 00:47:50.533 And count is a particularly useful one. 00:47:50.533 --> 00:47:51.450 Let's start with that. 00:47:51.450 --> 00:47:55.740 It's a reasonable question to be asked how many people submitted the Google 00:47:55.740 --> 00:47:58.523 Form by the time I actually downloaded the CSV. 00:47:58.523 --> 00:48:00.190 Well, why don't we go ahead and do this? 00:48:00.190 --> 00:48:02.850 Let me go back to VS Code here in my terminal window. 00:48:02.850 --> 00:48:06.900 Let me select not star but the count of star. 00:48:06.900 --> 00:48:10.020 So give me the count of the rows that are being 00:48:10.020 --> 00:48:12.660 returned from the database called-- 00:48:12.660 --> 00:48:14.190 the database table called favorites. 00:48:14.190 --> 00:48:16.648 Now when I hit Enter, I'm not going to get all of the data. 00:48:16.648 --> 00:48:18.630 I'm just going to get simply a number. 00:48:18.630 --> 00:48:21.480 430 rows came back. 00:48:21.480 --> 00:48:22.740 So that's pretty good. 00:48:22.740 --> 00:48:24.580 I now know how much data is in there. 00:48:24.580 --> 00:48:26.350 Well, what languages were in there? 00:48:26.350 --> 00:48:30.520 Well, I could do select language from favorites just as before, 00:48:30.520 --> 00:48:33.270 but that's not that useful, especially if I'm inheriting the data. 00:48:33.270 --> 00:48:35.770 Like I'm the analyst who's been handed a data set by my boss 00:48:35.770 --> 00:48:37.440 and they want me to crunch some numbers. 00:48:37.440 --> 00:48:39.810 OK, I could load this into Excel. 00:48:39.810 --> 00:48:40.770 I could sort it. 00:48:40.770 --> 00:48:44.400 But you can use SQL now to answer pretty basic questions too. 00:48:44.400 --> 00:48:48.540 If you want to select the distinct languages in the data set because you 00:48:48.540 --> 00:48:51.000 weren't privy to the Google Form, let me go ahead 00:48:51.000 --> 00:48:56.820 and select only the distinct languages from the favorites table. 00:48:56.820 --> 00:49:00.330 And now I hit Enter and I get back a much more succinct answer. 00:49:00.330 --> 00:49:02.220 Just the three languages in question. 00:49:02.220 --> 00:49:04.963 Not really that useful since I created the Google Form, 00:49:04.963 --> 00:49:07.380 but certainly if you're inheriting data from someone else, 00:49:07.380 --> 00:49:09.390 you've just downloaded a data set, at least now 00:49:09.390 --> 00:49:12.190 I'm arguably wrapping my mind around what's going on. 00:49:12.190 --> 00:49:15.178 Now, this is not necessary for such a small data set, 00:49:15.178 --> 00:49:16.470 but I can combine these things. 00:49:16.470 --> 00:49:21.180 Select the count of the distinct languages in this data set 00:49:21.180 --> 00:49:22.140 called favorites. 00:49:22.140 --> 00:49:25.220 And now I should get back what answer? 00:49:25.220 --> 00:49:27.378 So hopefully indeed an answer called three. 00:49:27.378 --> 00:49:29.420 And what you're getting back notice aesthetically 00:49:29.420 --> 00:49:32.180 too is like a mini temporary table. 00:49:32.180 --> 00:49:36.890 When I asked for just the distinct languages, what SQL hands 00:49:36.890 --> 00:49:39.410 me back is this temporary table in memory 00:49:39.410 --> 00:49:43.340 that has one column called language and then three rows. 00:49:43.340 --> 00:49:45.210 Now, this is not saved anywhere. 00:49:45.210 --> 00:49:47.450 It's just executed ephemerally like this. 00:49:47.450 --> 00:49:49.580 But that's why it's depicted in this way. 00:49:49.580 --> 00:49:52.610 What you're getting is subsets of your data, smaller 00:49:52.610 --> 00:49:54.410 tables containing some of your data. 00:49:54.410 --> 00:49:55.550 And same thing down here. 00:49:55.550 --> 00:49:58.400 This is a crazy long column name. 00:49:58.400 --> 00:50:00.480 You can rename it if you really want. 00:50:00.480 --> 00:50:02.900 But that's all we're seeing there. 00:50:02.900 --> 00:50:07.100 And in fact, if that's a little ugly, we can actually alias these things. 00:50:07.100 --> 00:50:12.090 N is a common name for a variable, a number in any programming language. 00:50:12.090 --> 00:50:15.200 So I can actually alias this to be a column called n. 00:50:15.200 --> 00:50:16.160 Hit Enter. 00:50:16.160 --> 00:50:19.490 And now I'm getting a tiny, tiny table whose column is called 00:50:19.490 --> 00:50:23.280 n that just has the one value there. 00:50:23.280 --> 00:50:28.500 All right, questions on these application of these functions here? 00:50:31.950 --> 00:50:33.152 Questions, yeah. 00:50:33.152 --> 00:50:35.000 AUDIENCE: [INAUDIBLE] 00:50:35.000 --> 00:50:36.400 DAVID MALAN: Say a little louder. 00:50:36.400 --> 00:50:37.280 AUDIENCE: A-S. 00:50:37.280 --> 00:50:38.475 DAVID MALAN: Oh, A-S. As. 00:50:38.475 --> 00:50:39.350 Literally in English. 00:50:39.350 --> 00:50:42.650 So rename this column as this. 00:50:42.650 --> 00:50:45.317 Technically it creates an alias for the column. 00:50:45.317 --> 00:50:45.900 So that's all. 00:50:45.900 --> 00:50:46.818 Yeah? 00:50:46.818 --> 00:50:49.955 AUDIENCE: [INAUDIBLE] 00:50:49.955 --> 00:50:50.830 DAVID MALAN: Exactly. 00:50:50.830 --> 00:50:53.530 Distinct will operate on whatever you hand it in parentheses 00:50:53.530 --> 00:50:57.460 and get rid of all of the duplicates, giving you back just the uniques. 00:50:57.460 --> 00:50:58.360 Correct. 00:50:58.360 --> 00:50:59.780 Other questions here? 00:50:59.780 --> 00:51:00.524 Yeah. 00:51:00.524 --> 00:51:02.944 AUDIENCE: [INAUDIBLE] 00:51:06.935 --> 00:51:08.060 DAVID MALAN: Good question. 00:51:08.060 --> 00:51:10.355 When you define an alias like n, which I just did, 00:51:10.355 --> 00:51:12.230 does it become like a variable you can reuse? 00:51:12.230 --> 00:51:16.580 Short answer, no in this case, but you can reuse it within your same query. 00:51:16.580 --> 00:51:20.630 Even though these queries are getting a little longer, admittedly, 00:51:20.630 --> 00:51:23.270 statements that they are, you can actually 00:51:23.270 --> 00:51:26.090 reuse n in an even longer query. 00:51:26.090 --> 00:51:27.440 So later in your query. 00:51:27.440 --> 00:51:30.380 And we'll see a few that are going to start to grow in length. 00:51:30.380 --> 00:51:33.050 So it's a nice way of nicknaming things just to be 00:51:33.050 --> 00:51:35.190 a little more terse in your query. 00:51:35.190 --> 00:51:38.480 So we can transition to some of these more sophisticated queries 00:51:38.480 --> 00:51:41.360 because it turns out there are some other techniques we 00:51:41.360 --> 00:51:42.630 can introduce as well. 00:51:42.630 --> 00:51:44.660 Here are some other keywords in SQL. 00:51:44.660 --> 00:51:47.005 And again, even though this is another list of things, 00:51:47.005 --> 00:51:49.130 there's only four things fundamentally we're doing. 00:51:49.130 --> 00:51:51.600 Creating, reading, updating, and deleting data. 00:51:51.600 --> 00:51:55.580 These are just allowing us to fine tune how we do it exactly. 00:51:55.580 --> 00:51:59.210 So where is going to allow us to filter data, as we'll do in just a moment. 00:51:59.210 --> 00:52:03.440 Like select data where this conditional is true. 00:52:03.440 --> 00:52:06.260 Like is going to be an alternative to an equal sign. 00:52:06.260 --> 00:52:11.210 So instead of looking for exactly Scratch or exactly Python or exactly C, 00:52:11.210 --> 00:52:13.790 you can look for something like dot dot dot 00:52:13.790 --> 00:52:16.770 and it can be a little bit of a fuzzier match, if you will, 00:52:16.770 --> 00:52:18.020 with other characters as well. 00:52:18.020 --> 00:52:19.790 Order by is going to deal with sorting. 00:52:19.790 --> 00:52:23.360 Limit is going to just let me limit the total number of rows that come back 00:52:23.360 --> 00:52:26.330 to 1 or 10 or finite if I don't want to see 00:52:26.330 --> 00:52:28.375 all 400 plus rows all at once, because I'm just 00:52:28.375 --> 00:52:29.750 trying to wrap my mind around it. 00:52:29.750 --> 00:52:32.207 And group by is best shown by example. 00:52:32.207 --> 00:52:34.290 So let's play with just a couple of these as well. 00:52:34.290 --> 00:52:36.080 Let me go back to VS Code here. 00:52:36.080 --> 00:52:37.170 I'll clear my screen. 00:52:37.170 --> 00:52:39.530 I'm still in the same SQLite instance. 00:52:39.530 --> 00:52:44.040 And let's count how many of you liked C without writing Python code as before. 00:52:44.040 --> 00:52:47.960 So let me go ahead and select the count of the rows 00:52:47.960 --> 00:52:54.530 from favorites where the language in each row equals C. 00:52:54.530 --> 00:52:58.100 And the convention in SQLite is to use single quotes any time 00:52:58.100 --> 00:53:02.810 you're surrounding a string that's meant to represent a literal piece of text 00:53:02.810 --> 00:53:06.720 as opposed to C, which was double quotes, or Python, which was either. 00:53:06.720 --> 00:53:11.090 So this is selecting the count of rows from favorites table where 00:53:11.090 --> 00:53:13.730 the language in question is C. Enter. 00:53:13.730 --> 00:53:15.410 And this gives me 98. 00:53:15.410 --> 00:53:19.190 Notice, though, if I omit that predicate like we did before, 00:53:19.190 --> 00:53:23.850 you'll get back the total number of rows that were in the table. 00:53:23.850 --> 00:53:26.030 So where is what's called a predicate that just 00:53:26.030 --> 00:53:30.020 allows me to filter things just like an if condition or the like in a language 00:53:30.020 --> 00:53:31.130 that we've seen before. 00:53:31.130 --> 00:53:34.670 You can be a little more specific like how many people really liked 00:53:34.670 --> 00:53:38.870 C and the Mario problem specifically? 00:53:38.870 --> 00:53:40.440 Well, let's do this. 00:53:40.440 --> 00:53:45.830 Let's go ahead and do select the number of rows from the favorites table 00:53:45.830 --> 00:53:49.640 where the language is C and. 00:53:49.640 --> 00:53:52.880 So it's still literally the word ands and or, just 00:53:52.880 --> 00:53:57.830 like in Python but not like in C. And problem equals Mario. 00:53:57.830 --> 00:54:01.520 So let's see if there's any fans of both C and the Mario problem. 00:54:01.520 --> 00:54:05.638 And three of us really like those two things together in this case. 00:54:05.638 --> 00:54:06.930 All right, what else can we do? 00:54:06.930 --> 00:54:10.280 Well, more compelling might be to see, kind of like in Python, 00:54:10.280 --> 00:54:13.580 for each language, what was the popularity thereof? 00:54:13.580 --> 00:54:15.560 And at the moment, we don't really have a way 00:54:15.560 --> 00:54:17.960 of doing that except in Python where we had the loop 00:54:17.960 --> 00:54:20.710 and we had those variables and the dictionary that did all of that 00:54:20.710 --> 00:54:21.410 counting for us. 00:54:21.410 --> 00:54:24.710 Totally doable but tedious, especially if your job is to analyze data. 00:54:24.710 --> 00:54:28.190 My God, even writing 15 lines of code to answer simple questions 00:54:28.190 --> 00:54:29.390 is kind of ridiculous. 00:54:29.390 --> 00:54:31.560 SQL can do better for us. 00:54:31.560 --> 00:54:33.630 So let me go ahead and do this. 00:54:33.630 --> 00:54:38.540 Let me go ahead and select every language and the count 00:54:38.540 --> 00:54:45.650 thereof from the favorites table but this time group by language. 00:54:45.650 --> 00:54:48.440 So this was another one of the keywords that we 00:54:48.440 --> 00:54:52.460 can use in this abbreviated list of extra features of SQL. 00:54:52.460 --> 00:54:55.760 And this one takes a moment to wrap your mind around, 00:54:55.760 --> 00:54:59.510 but this is going to give me a two column temporary table where 00:54:59.510 --> 00:55:02.030 the first column is a language and the second column 00:55:02.030 --> 00:55:05.330 is the count thereof from this data set. 00:55:05.330 --> 00:55:10.010 And group by language just means that only show me Scratch once, 00:55:10.010 --> 00:55:12.350 only show me C once, only show me Python once. 00:55:12.350 --> 00:55:16.130 That is group all of the identical values together, 00:55:16.130 --> 00:55:19.260 but keep track of how many of them there are. 00:55:19.260 --> 00:55:27.200 And so now if I go over to SQLite and I hit Enter, now I have in SQL version 00:55:27.200 --> 00:55:29.720 the exact same output that I had from Python 00:55:29.720 --> 00:55:32.120 that took me, what, 15 plus lines before. 00:55:32.120 --> 00:55:35.810 Now we're down to just one because SQL, Structured Query Language, 00:55:35.810 --> 00:55:39.830 is all about constructing queries like this to answer questions and get back 00:55:39.830 --> 00:55:41.152 answers quickly. 00:55:41.152 --> 00:55:42.860 If we want to clean this up a little bit, 00:55:42.860 --> 00:55:44.960 you asked earlier about sorting order. 00:55:44.960 --> 00:55:46.200 Well, we can do that too. 00:55:46.200 --> 00:55:48.930 There's another key phrase we can use here. 00:55:48.930 --> 00:55:55.080 We can order by the count of those rows and then run that query here. 00:55:55.080 --> 00:55:57.480 So now unfortunately they're from smallest to biggest, 00:55:57.480 --> 00:55:58.675 but we can reverse that. 00:55:58.675 --> 00:56:00.800 It turns out, and my query's starting to wrap here. 00:56:00.800 --> 00:56:02.360 I'll zoom out for a moment. 00:56:02.360 --> 00:56:07.050 If you want to order by count, the default is in ascending order, 00:56:07.050 --> 00:56:10.980 abbreviated A-S-C. If you want to reverse the sort in SQL, 00:56:10.980 --> 00:56:13.920 instead of using reverse equals true like we did in Python, 00:56:13.920 --> 00:56:16.380 you say D-E-S-C for descending order. 00:56:16.380 --> 00:56:19.557 And now we get almost the same output but flipped in reverse. 00:56:19.557 --> 00:56:22.140 So it's just a lot faster to answer questions once, of course, 00:56:22.140 --> 00:56:24.660 you get some muscle memory and some comfort with it. 00:56:24.660 --> 00:56:26.280 Well, what else can I do? 00:56:26.280 --> 00:56:28.890 What if I just care about the most popular language? 00:56:28.890 --> 00:56:31.140 I don't care about the second place or the third place 00:56:31.140 --> 00:56:32.460 languages or anything else. 00:56:32.460 --> 00:56:34.830 Well, let me add one more clause here. 00:56:34.830 --> 00:56:36.750 Limit the answer to one. 00:56:36.750 --> 00:56:39.330 And no matter how many rows should come back now, 00:56:39.330 --> 00:56:41.880 I just get the number one language as of the data 00:56:41.880 --> 00:56:47.870 set we collected with 270 votes for it. 00:56:47.870 --> 00:56:52.060 Questions on this? 00:56:52.060 --> 00:56:55.260 Any questions here? 00:56:55.260 --> 00:56:55.800 No? 00:56:55.800 --> 00:56:58.620 Well, what if we're starting to introduce SQL 00:56:58.620 --> 00:57:02.250 and it was kind of too late to make it into the Google Form? 00:57:02.250 --> 00:57:04.260 So it turns out there's syntax for this too. 00:57:04.260 --> 00:57:05.670 You can create data, of course. 00:57:05.670 --> 00:57:09.240 Not just the tables, but the data therein and here's the typical syntax 00:57:09.240 --> 00:57:11.640 for inserting data into a SQL database. 00:57:11.640 --> 00:57:14.850 You literally say insert into the name of the table. 00:57:14.850 --> 00:57:19.020 And then in parentheses, you specify one or more columns for which you 00:57:19.020 --> 00:57:20.940 have values that you want to insert. 00:57:20.940 --> 00:57:22.950 This is to say you don't have to give values 00:57:22.950 --> 00:57:25.650 for every column in the given row. 00:57:25.650 --> 00:57:27.990 If you only have answers to some of those questions, 00:57:27.990 --> 00:57:29.940 you can enumerate them here like this. 00:57:29.940 --> 00:57:32.010 But the values you insert are going to be these. 00:57:32.010 --> 00:57:35.500 So you literally say after the close parenthesis values. 00:57:35.500 --> 00:57:38.430 And then in a second set of parentheses with a same length 00:57:38.430 --> 00:57:42.370 comma separated list, you specify what values do you want to insert. 00:57:42.370 --> 00:57:43.710 So it's a little verbose. 00:57:43.710 --> 00:57:47.520 And frankly longer term, you're going to use Python code to automatically do 00:57:47.520 --> 00:57:50.010 these kinds of insertions, but let's go ahead and try this. 00:57:50.010 --> 00:57:57.990 Right now if I do select distinct language from favorites, again, 00:57:57.990 --> 00:57:58.833 we see this. 00:57:58.833 --> 00:58:00.000 Just these three candidates. 00:58:00.000 --> 00:58:01.870 But we've now taught you a bit of SQL. 00:58:01.870 --> 00:58:09.120 So let's do insert into favorites the column called language. 00:58:09.120 --> 00:58:10.620 And you know what? 00:58:10.620 --> 00:58:12.150 I'm going to give a problem here. 00:58:12.150 --> 00:58:14.430 The values for which, and let me zoom back out, 00:58:14.430 --> 00:58:17.910 are going to be quote unquote "SQL" and quote, unquote "fiftyville." 00:58:17.910 --> 00:58:20.130 You'll soon see what that's all about. 00:58:20.130 --> 00:58:21.240 Semicolon. 00:58:21.240 --> 00:58:24.580 Nothing seems to happen, but that's usually a good thing. 00:58:24.580 --> 00:58:26.580 And now if I scroll back up in my queries, 00:58:26.580 --> 00:58:30.990 in SQLite3 you can scroll back and forth in time to avoid retyping things, 00:58:30.990 --> 00:58:35.400 now I should see indeed four candidate languages here. 00:58:35.400 --> 00:58:39.150 Now, suppose that you were never really a fan of C and maybe you programmed 00:58:39.150 --> 00:58:42.990 a little bit in high school or in the real world and you liked C++. 00:58:42.990 --> 00:58:50.010 Well, there's a whole lot of answers for C. So select star from favorites 00:58:50.010 --> 00:58:53.430 where language equals quote unquote "C." 00:58:53.430 --> 00:58:58.020 So here's everyone who submitted the answer for C. Let's presume that, no, 00:58:58.020 --> 00:58:59.850 they didn't really want C, they wanted C++, 00:58:59.850 --> 00:59:01.767 which is not a language we teach in the class. 00:59:01.767 --> 00:59:03.910 But I could also now do this. 00:59:03.910 --> 00:59:08.670 You can use the update command to set a column or columns to different values 00:59:08.670 --> 00:59:10.210 where some condition is met. 00:59:10.210 --> 00:59:15.360 So if I do update table name set column name equal to some value 00:59:15.360 --> 00:59:19.330 filtering it perhaps by where some condition is true. 00:59:19.330 --> 00:59:22.890 So suppose I've changed my mind, or you know what, 00:59:22.890 --> 00:59:31.320 let's go ahead and do update favorites set language equal to maybe C++ where 00:59:31.320 --> 00:59:33.870 language equals C. 00:59:33.870 --> 00:59:35.910 Now, this is destructive, so you generally 00:59:35.910 --> 00:59:38.368 don't want to do this unless you have a backup of your data 00:59:38.368 --> 00:59:40.170 too, overriding what people's answers are. 00:59:40.170 --> 00:59:42.753 This seems to have been successful, because no error messages. 00:59:42.753 --> 00:59:45.270 And if I rerun the previous select that gives me 00:59:45.270 --> 00:59:49.410 all of the favorites where language equals C, now indeed I get none. 00:59:49.410 --> 00:59:53.127 But if I search for C++, now I get a lot. 00:59:53.127 --> 00:59:55.710 And if I get rid of that where clause altogether and just look 00:59:55.710 --> 01:00:00.330 at the contents of my database, now you see that indeed C++ is comingled with 01:00:00.330 --> 01:00:01.140 all the other data. 01:00:01.140 --> 01:00:04.270 This is not what you all intended, of course, so I can undo this. 01:00:04.270 --> 01:00:06.150 Let me go ahead and undo what I just did. 01:00:06.150 --> 01:00:12.570 Let me set my favorite language to C where language equals C++. 01:00:12.570 --> 01:00:14.460 But the predicate is important. 01:00:14.460 --> 01:00:16.230 This I'm not going to do. 01:00:16.230 --> 01:00:20.190 What if I accidentally omitted this predicate, the where clause? 01:00:20.190 --> 01:00:22.545 How would that screw things up might you think? 01:00:26.160 --> 01:00:26.970 Yeah, in the back. 01:00:26.970 --> 01:00:29.110 AUDIENCE: [INAUDIBLE] 01:00:29.110 --> 01:00:32.590 DAVID MALAN: It would set every row's language to indeed C. 01:00:32.590 --> 01:00:34.060 And this is dangerous. 01:00:34.060 --> 01:00:37.720 And if you start googling around for SQL mistakes or the like, 01:00:37.720 --> 01:00:41.590 people in the real world have accidentally run commands like this. 01:00:41.590 --> 01:00:43.510 And without naming names, a former member 01:00:43.510 --> 01:00:47.200 of our teaching staff at one point accidentally ran a command like this 01:00:47.200 --> 01:00:51.610 and changed every student's name in our database to Bobby I think it was. 01:00:51.610 --> 01:00:55.120 The same name for every row because they simply forgot a predicate. 01:00:55.120 --> 01:00:58.240 So here too there's dangers in code, and you 01:00:58.240 --> 01:01:00.730 should adopt the habit quite quickly of always, 01:01:00.730 --> 01:01:03.730 one, backing up your data like with CP, for instance, 01:01:03.730 --> 01:01:09.010 in Linux or any other technique or just making sure before you hit Enter that, 01:01:09.010 --> 01:01:11.322 yes, this is indeed the query I want to execute. 01:01:11.322 --> 01:01:13.030 And generally speaking in the real world, 01:01:13.030 --> 01:01:14.890 there should be process controls in place. 01:01:14.890 --> 01:01:17.770 Like the intern should not have access to the production 01:01:17.770 --> 01:01:19.570 database, the live database, and the like. 01:01:19.570 --> 01:01:21.860 But you have a lot of power now with these queries. 01:01:21.860 --> 01:01:25.520 So just be all the more careful, because very easily can you do bad things. 01:01:25.520 --> 01:01:26.660 So let me undo this. 01:01:26.660 --> 01:01:30.790 Where language equals quote unquote "C++." 01:01:30.790 --> 01:01:31.840 And I'll zoom back out. 01:01:31.840 --> 01:01:32.470 Enter. 01:01:32.470 --> 01:01:34.480 And now I think we're back in business. 01:01:34.480 --> 01:01:36.100 C is among the answers. 01:01:36.100 --> 01:01:37.520 Yeah? 01:01:37.520 --> 01:01:40.340 AUDIENCE: [INAUDIBLE] 01:01:42.293 --> 01:01:44.460 DAVID MALAN: It's essentially doing what at the end? 01:01:44.460 --> 01:01:45.270 AUDIENCE: [INAUDIBLE] 01:01:45.270 --> 01:01:46.680 DAVID MALAN: It's essentially find and replace. 01:01:46.680 --> 01:01:47.180 Yes. 01:01:47.180 --> 01:01:51.060 In layperson's terms, this is find and replace implemented with SQL. 01:01:51.060 --> 01:01:55.020 And in fact, the authors of Microsoft Word or Google Docs 01:01:55.020 --> 01:01:58.290 might very well be using language like this SQL 01:01:58.290 --> 01:02:01.710 when you go to the nice graphical user friendly find and replace box. 01:02:01.710 --> 01:02:05.010 This may very well be what they're doing underneath the hood or, of course, 01:02:05.010 --> 01:02:07.770 they could be using some other language altogether. 01:02:07.770 --> 01:02:10.530 There's one last syntax that's worth knowing, 01:02:10.530 --> 01:02:14.280 delete, which for better or for worse is even more destructive whereby it 01:02:14.280 --> 01:02:16.590 allows you to delete rows from tables. 01:02:16.590 --> 01:02:20.130 It's distinct from drop, which lets you delete tables themselves. 01:02:20.130 --> 01:02:21.630 This focuses on rows. 01:02:21.630 --> 01:02:26.250 So suppose that you really, really didn't like, 01:02:26.250 --> 01:02:30.000 let's say, Tideman was a little challenging if you 01:02:30.000 --> 01:02:31.840 tackled that more comfortable problem. 01:02:31.840 --> 01:02:34.800 So if you really don't want to even think about Tideman anymore, 01:02:34.800 --> 01:02:40.080 so why don't we do delete from favorites where problem equals, and I 01:02:40.080 --> 01:02:42.240 won't execute it for real, Tideman. 01:02:42.240 --> 01:02:45.360 This would have the effect of deleting every row, including 01:02:45.360 --> 01:02:50.010 the language therein and the timestamp, where the student answered Tideman. 01:02:50.010 --> 01:02:53.910 Worse than this would be this. 01:02:53.910 --> 01:02:55.125 Why might this be bad? 01:02:58.560 --> 01:03:00.930 OK, chuckling because there's no predicate. 01:03:00.930 --> 01:03:04.960 There's no filter, which means literally this would delete all of the data. 01:03:04.960 --> 01:03:07.740 So again, with great power here comes great responsibility. 01:03:07.740 --> 01:03:12.390 Now, this has just been a data set of 430 rows by us dynamically created. 01:03:12.390 --> 01:03:15.240 There's, of course, some really juicy data sets in the real world. 01:03:15.240 --> 01:03:17.990 And one website you might have heard or an app you might have used 01:03:17.990 --> 01:03:20.460 is IMDb, the Internet Movie Database, which wonderfully 01:03:20.460 --> 01:03:24.330 makes some of their data available for download as CSV files 01:03:24.330 --> 01:03:27.420 or technically TSV files, Tab Separated Values. 01:03:27.420 --> 01:03:31.830 But what we did in advance of class was download some of that data for both TV 01:03:31.830 --> 01:03:34.170 shows in the real world and movies in the real world. 01:03:34.170 --> 01:03:35.962 And what's wonderful about this data set is 01:03:35.962 --> 01:03:39.630 it's not just dozens or hundreds or even thousands of lines. 01:03:39.630 --> 01:03:44.190 There are millions of rows of juicy data, TV shows and movies 01:03:44.190 --> 01:03:47.190 with which most folks are probably familiar, at least with a subset. 01:03:47.190 --> 01:03:50.580 And we'll see in just a little bit that this data 01:03:50.580 --> 01:03:54.600 comes in the form of now six different tables that we've given you. 01:03:54.600 --> 01:03:56.610 And the tables in question for today are going 01:03:56.610 --> 01:04:00.735 to be the people in the TV business, the stars therein, the shows that people 01:04:00.735 --> 01:04:01.860 are producing and the like. 01:04:01.860 --> 01:04:04.860 This is a picture we'll revisit to enable 01:04:04.860 --> 01:04:07.110 you to wrap your minds around what the actual data is. 01:04:07.110 --> 01:04:09.948 This feels like a good opportunity though for a snack. 01:04:09.948 --> 01:04:12.990 In fact, in just a moment, we have a whole lot of Rice Krispie treats out 01:04:12.990 --> 01:04:13.690 in the lobby. 01:04:13.690 --> 01:04:18.630 But if folks could perhaps acknowledge this mini wedding cake here. 01:04:18.630 --> 01:04:21.640 CS50 zone Carter Zenke is getting married this week. 01:04:21.640 --> 01:04:28.880 So congratulations to Carter as well. 01:04:28.880 --> 01:04:30.550 Congrats. 01:04:30.550 --> 01:04:31.940 All right. 01:04:31.940 --> 01:04:34.440 [APPLAUSE] 01:04:38.115 --> 01:04:40.952 There's only one piece of cake in that box but a lot of Rice Krispie 01:04:40.952 --> 01:04:41.910 treats in the transept. 01:04:41.910 --> 01:04:45.420 Let's take 10 minutes and we'll be back with Internet Movie Database in 10. 01:04:45.420 --> 01:04:46.470 All right. 01:04:46.470 --> 01:04:47.320 We are back. 01:04:47.320 --> 01:04:51.000 So if you've never been, you can actually go to imdb.com right now 01:04:51.000 --> 01:04:53.020 and play around or download the mobile app. 01:04:53.020 --> 01:04:56.580 And it's just big database of a lot of TV shows and movies 01:04:56.580 --> 01:04:58.090 and actors and the like. 01:04:58.090 --> 01:05:00.828 But what indeed is nice is you can download some of that data. 01:05:00.828 --> 01:05:02.370 And that's what I've done in advance. 01:05:02.370 --> 01:05:05.160 And what we've done is we wrote some Python code 01:05:05.160 --> 01:05:09.540 to convert some of the flat file databases that they let you download 01:05:09.540 --> 01:05:13.890 and we converted it into a SQL database with six tables. 01:05:13.890 --> 01:05:17.940 So not just one but six that ultimately are these here. 01:05:17.940 --> 01:05:21.090 And let me just help you wrap your minds around what this picture is, which 01:05:21.090 --> 01:05:25.410 is an entity relationship diagram, which is just to say each of these boxes 01:05:25.410 --> 01:05:27.090 on the screen represents a table. 01:05:27.090 --> 01:05:31.230 And each of the arrows or edges represents some kind of relationship 01:05:31.230 --> 01:05:32.100 across the tables. 01:05:32.100 --> 01:05:34.200 Because up until now, the only data we had 01:05:34.200 --> 01:05:36.270 were those three columns in the favorites table. 01:05:36.270 --> 01:05:39.130 But what gets really useful about SQL databases, 01:05:39.130 --> 01:05:41.700 just like a Google Spreadsheet or an Excel file, 01:05:41.700 --> 01:05:45.300 is you can have multiple sheets or in a database multiple tables. 01:05:45.300 --> 01:05:50.010 And so what we're about to see is that in the IMDb database for TV shows, 01:05:50.010 --> 01:05:53.820 there's going to be a dedicated table for all the people in the TV business. 01:05:53.820 --> 01:05:57.210 There's going to be a dedicated table for all of the TV shows 01:05:57.210 --> 01:05:59.130 that are in their database as of right now. 01:05:59.130 --> 01:06:03.180 There's going to be a dedicated table for writers in that industry, 01:06:03.180 --> 01:06:08.580 for the ratings of shows, for the genres to which shows belong, 01:06:08.580 --> 01:06:09.700 comedy and the like. 01:06:09.700 --> 01:06:12.310 And then lastly, there's going to be this table, 01:06:12.310 --> 01:06:17.700 which somehow associates people with the TV shows that they star in 01:06:17.700 --> 01:06:19.030 and vice versa. 01:06:19.030 --> 01:06:22.290 And so let's consider first what this looks like in code. 01:06:22.290 --> 01:06:25.110 And we'll see that it's going to overwhelm intentionally at first, 01:06:25.110 --> 01:06:26.200 but I'm going to do this. 01:06:26.200 --> 01:06:27.690 I'm going to go back to my terminal window. 01:06:27.690 --> 01:06:30.240 And during the break, I downloaded from the course's website 01:06:30.240 --> 01:06:33.960 a file called shows.db, which we made in advance for you. 01:06:33.960 --> 01:06:37.440 And if I type ls, I'll see all of my favorites files from before. 01:06:37.440 --> 01:06:40.020 The CSV, the DB, and the Python file. 01:06:40.020 --> 01:06:42.390 But now they're shows.db. 01:06:42.390 --> 01:06:45.540 So I'm going to go ahead in my full screen terminal window here. 01:06:45.540 --> 01:06:47.850 I'm not using actual tabs or code files. 01:06:47.850 --> 01:06:52.380 Now I'm going to run sqlite3 on the file called shows.db. 01:06:52.380 --> 01:06:54.900 And I'm just going to see this version information here. 01:06:54.900 --> 01:06:57.120 Let me clear my screen and run the one command 01:06:57.120 --> 01:07:01.170 I ran earlier to show us the schema of the favorites database. 01:07:01.170 --> 01:07:03.947 Now we'll see the schema for the shows database. 01:07:03.947 --> 01:07:06.030 And there's a lot going on here, but let me scroll 01:07:06.030 --> 01:07:08.070 back up to the very top, the beginning. 01:07:08.070 --> 01:07:10.570 And we see this here. 01:07:10.570 --> 01:07:15.690 So when I run .schema, we see a dump, really, 01:07:15.690 --> 01:07:20.310 of all of the SQL create table commands that were run in order to create this 01:07:20.310 --> 01:07:21.330 database for you. 01:07:21.330 --> 01:07:24.420 And one of those tables is called genres and another people, 01:07:24.420 --> 01:07:27.370 ratings, shows, stars, and so forth. 01:07:27.370 --> 01:07:29.340 And the columns therein, even though it's 01:07:29.340 --> 01:07:32.670 formatted a little more prettily than the automatically generated create 01:07:32.670 --> 01:07:37.680 table statement for favorites whereby we have one column per line of output 01:07:37.680 --> 01:07:41.520 here in the, for instance, people table, there's 01:07:41.520 --> 01:07:45.900 going to be an ID column, like a unique identifier like a Harvard ID, a Yale 01:07:45.900 --> 01:07:50.730 ID or the like, a name column, a birth year, and then some other stuff. 01:07:50.730 --> 01:07:53.760 If I scroll down to shows, every show in the world 01:07:53.760 --> 01:07:58.410 is going to have a unique ID as well, a title of course, the year in which it 01:07:58.410 --> 01:08:00.390 debuted, and the total number of episodes 01:08:00.390 --> 01:08:02.370 as of the time we downloaded the data. 01:08:02.370 --> 01:08:04.080 And then what else is there? 01:08:04.080 --> 01:08:08.170 Some of these are a little less obvious like ratings here. 01:08:08.170 --> 01:08:10.830 So ratings don't have an ID column, but they 01:08:10.830 --> 01:08:15.038 have a show ID column and a rating like on a five point scale or a 10 point 01:08:15.038 --> 01:08:16.830 scale or the like and then the total number 01:08:16.830 --> 01:08:19.319 of votes that were collected to contribute to that rating. 01:08:19.319 --> 01:08:23.680 IMDb allows people to upvote and downvote shows and movies and the like. 01:08:23.680 --> 01:08:26.380 And then similarly is genre structured. 01:08:26.380 --> 01:08:29.290 There's a show ID and then there's a genre, 01:08:29.290 --> 01:08:33.370 which is going to be an English word like comedy or drama or something else. 01:08:33.370 --> 01:08:34.200 And then what else? 01:08:34.200 --> 01:08:38.609 Let's go a little further at the bottom here for stars and writers. 01:08:38.609 --> 01:08:41.729 If we go to the very bottom here, stars and writers 01:08:41.729 --> 01:08:43.470 are similarly structured too. 01:08:43.470 --> 01:08:46.890 They have a show ID and a person ID. 01:08:46.890 --> 01:08:48.540 So show and person. 01:08:48.540 --> 01:08:52.470 And then this writers table has a show ID and a person ID. 01:08:52.470 --> 01:08:56.340 And there's a whole lot of other words that we'll come to in just a moment. 01:08:56.340 --> 01:08:58.330 But what is this code hinting at? 01:08:58.330 --> 01:09:01.000 Well, if I go back to the picture from earlier here, 01:09:01.000 --> 01:09:05.160 you'll see that this picture captures the relationships 01:09:05.160 --> 01:09:07.020 among these various tables. 01:09:07.020 --> 01:09:11.670 So for instance, if we focus on shows for just a moment, a show, 01:09:11.670 --> 01:09:15.359 again, has a unique ID, a title, a year in which it debuted, 01:09:15.359 --> 01:09:16.859 and a total number of episodes. 01:09:16.859 --> 01:09:20.550 If you want to figure out what genre or genres 01:09:20.550 --> 01:09:24.300 a show belongs to, because some shows are just comedies, 01:09:24.300 --> 01:09:29.010 some shows are just dramas, but some shows are arguably comedies and dramas 01:09:29.010 --> 01:09:30.700 depending on the episode or the like. 01:09:30.700 --> 01:09:34.740 So you can imagine wanting to associate two or three or even more genres 01:09:34.740 --> 01:09:35.640 with a show. 01:09:35.640 --> 01:09:39.830 This line here in this second table allows us to do that. 01:09:39.830 --> 01:09:46.310 Every row in the genres table we'll see has two items, a show ID 01:09:46.310 --> 01:09:49.250 which relates to the ID of a show. 01:09:49.250 --> 01:09:54.350 And that's why these lines literally line up with that specific column name. 01:09:54.350 --> 01:09:57.050 And a genre, which is going to be like, quote unquote, "comedy," 01:09:57.050 --> 01:09:59.250 quote unquote "drama," or something else. 01:09:59.250 --> 01:10:02.180 Now with that said, design question. 01:10:02.180 --> 01:10:07.250 Why have we deliberately not just gotten rid of this genres table 01:10:07.250 --> 01:10:12.950 and made our lives simpler by just adding a genre column to this show's 01:10:12.950 --> 01:10:13.640 table? 01:10:13.640 --> 01:10:17.150 And again, a table is just like a sheet with rows and columns. 01:10:17.150 --> 01:10:21.140 At the moment, shows only have four columns, ID, title, year, episodes. 01:10:21.140 --> 01:10:28.980 Why not just add a fifth column called genre and put the show's genre there? 01:10:28.980 --> 01:10:32.200 Any intuition here? 01:10:32.200 --> 01:10:34.960 Why not just keep things simple? 01:10:34.960 --> 01:10:36.286 Yeah, in back. 01:10:36.286 --> 01:10:42.035 AUDIENCE: [INAUDIBLE] 01:10:42.035 --> 01:10:42.910 DAVID MALAN: Exactly. 01:10:42.910 --> 01:10:46.670 If you add a fifth column here and call it genre, 01:10:46.670 --> 01:10:49.900 then you have to pick a genre specifically. 01:10:49.900 --> 01:10:55.840 You have to put in that cell presumably comedy or drama or musical or something 01:10:55.840 --> 01:10:56.480 else. 01:10:56.480 --> 01:10:58.270 Now, you could write multiple words in the cell, 01:10:58.270 --> 01:11:00.645 but generally speaking, that would be sloppy, bad design. 01:11:00.645 --> 01:11:02.492 Like every cell just like in a spreadsheet 01:11:02.492 --> 01:11:03.700 should really have one value. 01:11:03.700 --> 01:11:07.270 It might have multiple words, but it shouldn't be a weirdly comma separated 01:11:07.270 --> 01:11:08.530 list of multiple things. 01:11:08.530 --> 01:11:10.655 It should just be in a different cell in that case. 01:11:10.655 --> 01:11:14.320 So if you instead were to design this with just a single column called genre, 01:11:14.320 --> 01:11:16.510 you're imposing what a computer scientist 01:11:16.510 --> 01:11:19.480 would call a one to one relationship. 01:11:19.480 --> 01:11:21.580 Every show has one genre. 01:11:21.580 --> 01:11:23.800 And that's not necessarily a good thing. 01:11:23.800 --> 01:11:26.170 Or strictly speaking, it would be a many to one, 01:11:26.170 --> 01:11:28.510 because the same genre could belong to multiple shows, 01:11:28.510 --> 01:11:31.760 but each show could only have one genre in that case. 01:11:31.760 --> 01:11:33.760 What a relational database allows you to do, 01:11:33.760 --> 01:11:35.890 and relational is indeed the operative word, 01:11:35.890 --> 01:11:39.100 it allows you to factor out some of your information 01:11:39.100 --> 01:11:44.590 and then have maybe one show here in one row but then in this genres table, 01:11:44.590 --> 01:11:47.320 you could have one row for that one show genre, 01:11:47.320 --> 01:11:52.450 or you could have two rows in the genres table for comedy and for drama. 01:11:52.450 --> 01:11:55.490 Or if it has a third genre, you could just add another row here. 01:11:55.490 --> 01:11:59.380 So you still have one row for the show itself 01:11:59.380 --> 01:12:02.890 with all the juiciest details but a variable number of rows 01:12:02.890 --> 01:12:05.950 by having this relationship with another table. 01:12:05.950 --> 01:12:10.000 Meanwhile, ratings work the same way, at least in this case. 01:12:10.000 --> 01:12:12.882 A show has ID, title, year, and episodes. 01:12:12.882 --> 01:12:14.590 But if you want to figure out its rating, 01:12:14.590 --> 01:12:17.710 you have to follow the arrow here, so to speak, and look up 01:12:17.710 --> 01:12:20.020 the corresponding show ID in this table. 01:12:20.020 --> 01:12:23.750 Find the rating of that show and the total number of ratings. 01:12:23.750 --> 01:12:26.470 So that's been factored out too, for better or for worse. 01:12:26.470 --> 01:12:28.600 Now let's consider people. 01:12:28.600 --> 01:12:32.050 People have just three columns, ID, name, and birth. 01:12:32.050 --> 01:12:35.350 But there's no mention of the TV show in which people have starred 01:12:35.350 --> 01:12:38.140 or the TV shows that a person has written. 01:12:38.140 --> 01:12:39.200 Well, why is that? 01:12:39.200 --> 01:12:42.730 Well, if you just had a fourth column here called show, 01:12:42.730 --> 01:12:45.560 well, you would have to decide what show is that person in. 01:12:45.560 --> 01:12:47.560 And no one could ever act again in another show, 01:12:47.560 --> 01:12:49.310 because there's no room to store the data. 01:12:49.310 --> 01:12:53.560 But if someone, of course, a popular actor can star in multiple shows, 01:12:53.560 --> 01:12:57.790 well, we could have one ID for that person, one name, one birth year, 01:12:57.790 --> 01:12:58.360 obviously. 01:12:58.360 --> 01:13:02.230 Like there's only one Steve Carell as an actor in the world of people. 01:13:02.230 --> 01:13:06.610 But Steve Carell in this example could have his person ID, 01:13:06.610 --> 01:13:09.220 whatever his Harvard ID equivalent, Yale ID equivalent is, 01:13:09.220 --> 01:13:12.130 appear in multiple rows in this table so that it 01:13:12.130 --> 01:13:14.470 can be associated with multiple shows. 01:13:14.470 --> 01:13:18.830 And this allows you to create what's called a one to many relationship, 01:13:18.830 --> 01:13:20.500 or technically it's bidirectional. 01:13:20.500 --> 01:13:22.450 It's a many to many relationship. 01:13:22.450 --> 01:13:23.120 Why? 01:13:23.120 --> 01:13:26.470 Well, one show can certainly have multiple people in it 01:13:26.470 --> 01:13:29.090 and multiple people writing for it, just in the real world. 01:13:29.090 --> 01:13:33.910 But conversely, one person could certainly act in multiple shows 01:13:33.910 --> 01:13:35.330 or write multiple shows. 01:13:35.330 --> 01:13:38.380 So this is what you get with relational databases. 01:13:38.380 --> 01:13:41.440 You put your sort of canonical data for people in one place, 01:13:41.440 --> 01:13:46.150 for shows in another place, and then you use these additional tables 01:13:46.150 --> 01:13:48.993 to relate one thing to another. 01:13:48.993 --> 01:13:50.410 So we won't dwell on the pictures. 01:13:50.410 --> 01:13:53.350 That's just if you sort of can wrap your mind around the data 01:13:53.350 --> 01:13:55.870 set better that way, that's one way of thinking about it. 01:13:55.870 --> 01:14:00.177 But recall that the code we just saw for the schema, again, escalated quickly. 01:14:00.177 --> 01:14:02.260 There's a lot of keywords I haven't mentioned yet. 01:14:02.260 --> 01:14:04.360 But some of these are perhaps familiar. 01:14:04.360 --> 01:14:06.130 They're capitalized differently here. 01:14:06.130 --> 01:14:08.140 But integer is on the list here. 01:14:08.140 --> 01:14:10.790 Null is on the list, albeit technically not null. 01:14:10.790 --> 01:14:12.760 So let's tease apart some of these keywords 01:14:12.760 --> 01:14:15.218 and consider what they're actually doing for your database, 01:14:15.218 --> 01:14:18.640 because now we're exploring features that do not exist 01:14:18.640 --> 01:14:20.980 in the world of spreadsheets alone. 01:14:20.980 --> 01:14:24.190 So it turns out in a SQL database, specifically 01:14:24.190 --> 01:14:27.265 SQLite which is the version of SQL we use in CS50 01:14:27.265 --> 01:14:30.910 and which is commonly used for things like mobile applications nowadays. 01:14:30.910 --> 01:14:32.560 It's like a lightweight version of SQL. 01:14:32.560 --> 01:14:35.590 It's when you aren't trying to run Twitter and have billions 01:14:35.590 --> 01:14:37.150 and billions of rows necessarily. 01:14:37.150 --> 01:14:40.730 You've got hundreds, thousands, tens of thousands, maybe even a few million, 01:14:40.730 --> 01:14:44.590 but not crazy numbers, crazy amounts of data. 01:14:44.590 --> 01:14:49.250 In the world of SQLite specifically, there's these five data types. 01:14:49.250 --> 01:14:52.390 So just like in C, we had int and char and the like. 01:14:52.390 --> 01:14:54.190 In SQL, we have these. 01:14:54.190 --> 01:14:57.447 Blob, which is kind of funny, but it just means binary large objects. 01:14:57.447 --> 01:14:58.780 So it's like a binary data type. 01:14:58.780 --> 01:15:02.830 Zeros and ones that aren't necessarily fitting into the other categories. 01:15:02.830 --> 01:15:05.290 Integer, which of course, is an integer as we know it. 01:15:05.290 --> 01:15:09.940 Numeric, which is kind of a catchall for numbers that are formatted specially. 01:15:09.940 --> 01:15:14.710 So like a date would be like year, year, year, year, dash month, month, 01:15:14.710 --> 01:15:16.390 dash day, day. 01:15:16.390 --> 01:15:18.015 And this is actually a wonderful thing. 01:15:18.015 --> 01:15:19.723 Depending on the country you're from, you 01:15:19.723 --> 01:15:22.720 might think your date system in your country is great or it's horrible. 01:15:22.720 --> 01:15:27.040 The US system is horrible because we have month, day, and then 01:15:27.040 --> 01:15:29.350 year, which is impossible to sort. 01:15:29.350 --> 01:15:31.930 It is the wrong way objectively to store data. 01:15:31.930 --> 01:15:33.803 And yet here we are using this at scale. 01:15:33.803 --> 01:15:35.470 Other countries have gotten this better. 01:15:35.470 --> 01:15:38.383 Numeric in SQL itself standardizes that stuff. 01:15:38.383 --> 01:15:40.300 So it doesn't matter what country you're from. 01:15:40.300 --> 01:15:43.200 You're storing your data in this particular way for instance. 01:15:43.200 --> 01:15:46.560 Times are standardized and other types of numeric data as well. 01:15:46.560 --> 01:15:48.630 Real is synonymous with flow. 01:15:48.630 --> 01:15:52.410 So something with a decimal point and some number of digits thereafter. 01:15:52.410 --> 01:15:55.530 And then text is just for strings and the like. 01:15:55.530 --> 01:16:00.240 With other even fancier databases like MySQL, PostgreSQL, Oracle, 01:16:00.240 --> 01:16:02.370 and other products you might have heard of, 01:16:02.370 --> 01:16:04.170 there's even more data types where you have 01:16:04.170 --> 01:16:05.712 to make even finer grained decisions. 01:16:05.712 --> 01:16:07.950 But for SQLite, it's indeed pretty lightweight 01:16:07.950 --> 01:16:12.120 and you or we just have to decide the data types for each column in a table. 01:16:12.120 --> 01:16:15.000 But there's these additional constraints in the world of SQL. 01:16:15.000 --> 01:16:21.400 You can additionally say that cells in this column may or may not be null. 01:16:21.400 --> 01:16:24.660 So if you want to protect yourself from yourself so you don't screw up 01:16:24.660 --> 01:16:26.880 and insert a null, that is a blank value, 01:16:26.880 --> 01:16:31.090 you can explicitly design a table to have a column that cannot be null. 01:16:31.090 --> 01:16:33.090 And so in fact, someone came up during the break 01:16:33.090 --> 01:16:36.450 to ask me about my having manually inserted SQL, 01:16:36.450 --> 01:16:38.923 quote unquote "SQL," into our favorites database. 01:16:38.923 --> 01:16:40.590 You might recall that I kind of cheated. 01:16:40.590 --> 01:16:44.340 I just inserted "SQL" quote unquote and "fiftyville," 01:16:44.340 --> 01:16:46.050 the name of a new problem, quote unquote. 01:16:46.050 --> 01:16:49.170 But what did I not insert into the database? 01:16:49.170 --> 01:16:50.100 A timestamp. 01:16:50.100 --> 01:16:50.880 And I could have. 01:16:50.880 --> 01:16:54.240 I could have put the current day and time a few minutes ago, but I didn't. 01:16:54.240 --> 01:16:58.590 And that's fine if it's acceptable to you and the product you're building. 01:16:58.590 --> 01:17:00.300 But I could have prevented that. 01:17:00.300 --> 01:17:05.100 If we had defined the table to have a timestamp column that isn't just text 01:17:05.100 --> 01:17:08.490 but it's text that's not null, SQL would have complained 01:17:08.490 --> 01:17:10.650 and would not have let me complete that insertion. 01:17:10.650 --> 01:17:13.567 So there's these kinds of built in defenses that you don't necessarily 01:17:13.567 --> 01:17:15.180 get with a spreadsheet alone. 01:17:15.180 --> 01:17:17.170 And unique means exactly that. 01:17:17.170 --> 01:17:20.190 If you want to make sure that every row in that column 01:17:20.190 --> 01:17:24.480 is unique, maybe for email addresses or in the US Social Security 01:17:24.480 --> 01:17:26.190 numbers or anything that you want to make 01:17:26.190 --> 01:17:30.330 sure you don't have two versions of, you can specify that the column is unique. 01:17:30.330 --> 01:17:32.170 And there's other such constraints as well. 01:17:32.170 --> 01:17:34.140 But again, this is just a list of features 01:17:34.140 --> 01:17:37.170 that you get from a proper relational database. 01:17:37.170 --> 01:17:39.990 But perhaps the most intellectually interesting one 01:17:39.990 --> 01:17:42.420 and the most powerful one is what's called 01:17:42.420 --> 01:17:45.000 here a primary key and a foreign key. 01:17:45.000 --> 01:17:47.640 And let me go back now to this output. 01:17:47.640 --> 01:17:51.970 If we look at shows, you'll see that a show, again, 01:17:51.970 --> 01:17:55.320 has an ID, a title, a year, and a number of episodes. 01:17:55.320 --> 01:17:57.120 And now the data types might make sense. 01:17:57.120 --> 01:17:59.820 The ID it turns out, just like a Harvard ID, a Yale ID, 01:17:59.820 --> 01:18:01.150 is going to be an integer. 01:18:01.150 --> 01:18:02.220 So a simple number. 01:18:02.220 --> 01:18:04.680 The title, of course, is going to be text but not null. 01:18:04.680 --> 01:18:07.470 It would be weird if a TV show had no name. 01:18:07.470 --> 01:18:08.615 That can't be. 01:18:08.615 --> 01:18:10.990 The whole world would break or your TV Guide and whatnot. 01:18:10.990 --> 01:18:13.590 So that makes sense there to say not null. 01:18:13.590 --> 01:18:14.670 Year is numeric. 01:18:14.670 --> 01:18:15.990 So it's a standardized form. 01:18:15.990 --> 01:18:19.140 Episodes is an integer, like how many episodes have been produced. 01:18:19.140 --> 01:18:20.790 And then lastly, notice this. 01:18:20.790 --> 01:18:24.090 The primary key of the show's table is apparently 01:18:24.090 --> 01:18:27.880 the column called ID mentioned a few lines earlier. 01:18:27.880 --> 01:18:32.400 This just means that the database will use the ID 01:18:32.400 --> 01:18:34.740 column as the unique identifier. 01:18:34.740 --> 01:18:37.770 So it's similar to the unique keyword, but primary key just 01:18:37.770 --> 01:18:40.260 means the database is going to treat it as special too 01:18:40.260 --> 01:18:43.380 and make sure that it is uniquely identifying your data. 01:18:43.380 --> 01:18:45.300 But what's interesting is this. 01:18:45.300 --> 01:18:48.480 Notice if I scroll back up to people, people 01:18:48.480 --> 01:18:51.480 were sort of similarly structured but with different attributes. 01:18:51.480 --> 01:18:56.310 Like up here we had a person has an ID, a name, a birth 01:18:56.310 --> 01:18:58.800 year, and a primary key of ID. 01:18:58.800 --> 01:19:01.080 So a ID is, again, integer. 01:19:01.080 --> 01:19:03.600 Name is text but not null, because it'd be 01:19:03.600 --> 01:19:06.420 weird to have a human with absolutely no name textually. 01:19:06.420 --> 01:19:07.860 Birth is going to be numeric. 01:19:07.860 --> 01:19:11.590 But the primary key of people is ID as well. 01:19:11.590 --> 01:19:15.150 So those are the unique columns that the database will just treat special. 01:19:15.150 --> 01:19:15.810 Why? 01:19:15.810 --> 01:19:17.310 Well, we just looked at shows. 01:19:17.310 --> 01:19:18.900 We just looked at people. 01:19:18.900 --> 01:19:22.380 Let's focus now on this one down here, stars. 01:19:22.380 --> 01:19:25.260 How do you determine who stars in a TV show? 01:19:25.260 --> 01:19:26.370 Well, we had two columns. 01:19:26.370 --> 01:19:28.320 The show ID and the person ID. 01:19:28.320 --> 01:19:31.080 This is the incarnation of a many to many relationship. 01:19:31.080 --> 01:19:32.550 One person could be in many shows. 01:19:32.550 --> 01:19:35.710 One show could certainly have many people in it or writing for it. 01:19:35.710 --> 01:19:37.250 But notice this. 01:19:37.250 --> 01:19:41.220 Within this table of two columns, show ID and person ID, 01:19:41.220 --> 01:19:44.070 there's what's going to be called a foreign key called 01:19:44.070 --> 01:19:49.350 show ID that references the show's table's ID column 01:19:49.350 --> 01:19:51.367 and then another foreign key called person ID, 01:19:51.367 --> 01:19:53.700 though I could call these things in parentheses anything 01:19:53.700 --> 01:19:57.495 I want, that references the people table's ID column. 01:19:57.495 --> 01:20:00.120 Now, you're not going to often have to type commands like this. 01:20:00.120 --> 01:20:02.700 Again, you set the database up once in the beginning 01:20:02.700 --> 01:20:04.590 typically, maybe with some help from a TF, 01:20:04.590 --> 01:20:06.210 maybe with help of Google or the like. 01:20:06.210 --> 01:20:09.203 But once your database is designed, it's back to the CRUD. 01:20:09.203 --> 01:20:12.120 Create, read, update, delete, the selects, the inserts, the deletions, 01:20:12.120 --> 01:20:12.690 and the like. 01:20:12.690 --> 01:20:14.310 But what's this implying? 01:20:14.310 --> 01:20:17.520 These keywords like primary key and foreign key 01:20:17.520 --> 01:20:23.100 are what are doing in code what this picture was painting a moment ago. 01:20:23.100 --> 01:20:27.570 These lines here are drawn literally to line up with the corresponding things. 01:20:27.570 --> 01:20:30.660 People's ID lines up with person ID. 01:20:30.660 --> 01:20:34.410 Show's ID lines up with show ID. 01:20:34.410 --> 01:20:37.950 And so you're just seeing graphical version, code version, graphical, 01:20:37.950 --> 01:20:40.860 code that creates these relationships. 01:20:40.860 --> 01:20:45.190 Now, given that, let's actually see what these things look like. 01:20:45.190 --> 01:20:47.160 So let me go back to VS Code here. 01:20:47.160 --> 01:20:48.400 Let me clear my screen. 01:20:48.400 --> 01:20:51.000 I'm still within SQLite with shows.db. 01:20:51.000 --> 01:20:53.480 Let me go ahead and do what I do with any new database. 01:20:53.480 --> 01:20:55.230 If I ever download something or I'm trying 01:20:55.230 --> 01:20:57.600 to wrap my mind around a problem, usually it 01:20:57.600 --> 01:21:01.230 doesn't come with a pretty picture or a three hour lecture 01:21:01.230 --> 01:21:02.670 to explain what the data set is. 01:21:02.670 --> 01:21:06.190 Rather you just have the data set in your own knowledge of SQL. 01:21:06.190 --> 01:21:07.440 So let me play around. 01:21:07.440 --> 01:21:09.608 So .schema shows me all of the tables. 01:21:09.608 --> 01:21:10.900 That might be a starting point. 01:21:10.900 --> 01:21:11.900 OK, this is interesting. 01:21:11.900 --> 01:21:13.560 I know what people are. 01:21:13.560 --> 01:21:16.530 Let's go ahead and show me all the people. 01:21:16.530 --> 01:21:18.595 So select star from people. 01:21:18.595 --> 01:21:20.970 I'm just trying to wrap my mind around what this data set 01:21:20.970 --> 01:21:24.330 looks like in a more user friendly way. 01:21:24.330 --> 01:21:26.607 That's already a lot of people. 01:21:26.607 --> 01:21:28.440 As you see the years flying by, there's been 01:21:28.440 --> 01:21:31.170 a lot of people in the TV business. 01:21:31.170 --> 01:21:33.540 So this was maybe not the best query to run. 01:21:33.540 --> 01:21:39.300 But this is indicative of just how large this data set is from IMDb. 01:21:39.300 --> 01:21:42.960 When in doubt and whenever you lose control over your computer, 01:21:42.960 --> 01:21:44.940 Control C is your friend to interrupt. 01:21:44.940 --> 01:21:46.860 What would have been better, because I don't 01:21:46.860 --> 01:21:49.980 think I need to know all of the million people in the world, 01:21:49.980 --> 01:21:52.470 I could do limit me to 10 people. 01:21:52.470 --> 01:21:55.860 And that's enough now to get a sense of Fred Astaire 01:21:55.860 --> 01:21:58.920 has an ID of one, the first person ever. 01:21:58.920 --> 01:22:00.660 Birth year of 1899. 01:22:00.660 --> 01:22:04.740 Lauren Bacall and all of these other people from yesteryear. 01:22:04.740 --> 01:22:07.410 You see that they are the first 10 people in the database. 01:22:07.410 --> 01:22:09.160 So there's an example of some of the data. 01:22:09.160 --> 01:22:12.540 Now if I want to wrap my mind around what a show is, I know it technically. 01:22:12.540 --> 01:22:13.800 I know it from the picture. 01:22:13.800 --> 01:22:15.400 But let's just look at some raw data. 01:22:15.400 --> 01:22:18.210 So instead of saying select star from people, let me go ahead 01:22:18.210 --> 01:22:21.480 and select star from shows limit 10. 01:22:21.480 --> 01:22:26.250 And OK, I've only heard of or seen a couple of these, 01:22:26.250 --> 01:22:27.970 but these are older shows at that. 01:22:27.970 --> 01:22:31.530 But I see that every show has an ID, a title, a year in which it debuted, 01:22:31.530 --> 01:22:32.910 and a number of episodes. 01:22:32.910 --> 01:22:35.340 But perhaps most opaque is going to be this. 01:22:35.340 --> 01:22:41.520 Select star from stars where this is the table that associates people 01:22:41.520 --> 01:22:42.330 with shows. 01:22:42.330 --> 01:22:46.950 Am I going to see any names or show titles here? 01:22:46.950 --> 01:22:49.127 Not according to the definition we saw earlier. 01:22:49.127 --> 01:22:50.460 Oh, I should have done my limit. 01:22:50.460 --> 01:22:51.660 Let me interrupt that. 01:22:51.660 --> 01:22:52.620 Let me do that again. 01:22:52.620 --> 01:22:54.180 Limit 10. 01:22:54.180 --> 01:22:55.050 No. 01:22:55.050 --> 01:22:58.380 And this is where now you're definitely in the programmer world, 01:22:58.380 --> 01:23:01.780 because this would be the most annoying spreadsheet to use on your Mac or PC 01:23:01.780 --> 01:23:02.280 ever. 01:23:02.280 --> 01:23:04.528 If you just had a sheet with all of these numbers 01:23:04.528 --> 01:23:06.570 that associates one thing with the other, my God, 01:23:06.570 --> 01:23:08.910 how do you figure out who this is or what this is? 01:23:08.910 --> 01:23:12.780 You have to manually Control F or Command F looking for the data. 01:23:12.780 --> 01:23:14.310 But a database doesn't care. 01:23:14.310 --> 01:23:18.160 Once you know SQL, you can stitch these things back together. 01:23:18.160 --> 01:23:21.360 So what you're seeing here are foreign keys. 01:23:21.360 --> 01:23:22.150 Foreign keys. 01:23:22.150 --> 01:23:22.650 Why? 01:23:22.650 --> 01:23:27.600 Because show ID corresponds to the same numbers from that other table 01:23:27.600 --> 01:23:30.990 called shows that has a proper primary key called ID. 01:23:30.990 --> 01:23:33.360 Person ID is a foreign key in this context, 01:23:33.360 --> 01:23:38.620 because it refers to numbers that belong to really the people table and its ID 01:23:38.620 --> 01:23:39.120 column. 01:23:39.120 --> 01:23:41.037 So this is just a way of somehow linking them. 01:23:41.037 --> 01:23:44.100 And so if you think of I always think of this in my mind's eye as this. 01:23:44.100 --> 01:23:47.760 If this is the people table, this is the shows table, 01:23:47.760 --> 01:23:51.690 and there's this middle table in between, the stars table. 01:23:51.690 --> 01:23:56.490 There's some way of stitching those two together by lining up the IDs of one 01:23:56.490 --> 01:23:58.870 with the other and getting back some more data. 01:23:58.870 --> 01:24:01.690 So let's actually play with some of this data. 01:24:01.690 --> 01:24:05.148 How about we start where we emphasized earlier, genres. 01:24:05.148 --> 01:24:08.440 So let me go ahead and take a quick look at all of the genres in this database. 01:24:08.440 --> 01:24:10.500 So select star from genres. 01:24:10.500 --> 01:24:13.030 Star is usually going to be a little overwhelming, 01:24:13.030 --> 01:24:15.240 but it just gives me a sense of what the data is. 01:24:15.240 --> 01:24:17.850 But let's actually look at-- 01:24:17.850 --> 01:24:19.440 let's go look at all of them there. 01:24:19.440 --> 01:24:20.850 OK, that's a lot. 01:24:20.850 --> 01:24:24.210 These are all official genres from IMDb. 01:24:24.210 --> 01:24:27.030 OK, it wasn't terribly long. 01:24:27.030 --> 01:24:28.350 Let me filter that down. 01:24:28.350 --> 01:24:34.800 So from genres where genre equals Comedy, capital C just based 01:24:34.800 --> 01:24:36.000 on the data I'm seeing. 01:24:36.000 --> 01:24:38.220 OK, so what am I seeing now? 01:24:38.220 --> 01:24:40.838 And in fact, let me limit this arbitrarily to 10, 01:24:40.838 --> 01:24:42.630 though I could limit it to anything I want. 01:24:42.630 --> 01:24:45.060 Here are 10 comedies. 01:24:45.060 --> 01:24:46.080 What are they? 01:24:46.080 --> 01:24:47.490 Well, who the heck knows? 01:24:47.490 --> 01:24:49.590 All I know are the 10 show IDs. 01:24:49.590 --> 01:24:51.180 Now, I could do something like this. 01:24:51.180 --> 01:24:54.037 As we've seen before with SQL, I could do, all right, well 01:24:54.037 --> 01:24:55.620 let's figure out what this show ID is. 01:24:55.620 --> 01:25:03.750 Select star from shows where the ID of the show I'm looking for equals what? 01:25:03.750 --> 01:25:06.910 62614 semicolon. 01:25:06.910 --> 01:25:11.200 So I could manually look it up by cross referencing the other table. 01:25:11.200 --> 01:25:14.860 So that was the show in question there, the first comedy in the data set. 01:25:14.860 --> 01:25:16.200 Let me look up the second one. 01:25:16.200 --> 01:25:20.610 So instead of that, let's do 63881 Enter. 01:25:20.610 --> 01:25:21.150 OK. 01:25:21.150 --> 01:25:22.410 So that's that show. 01:25:22.410 --> 01:25:23.640 And let's do one more. 01:25:23.640 --> 01:25:27.210 And suffice it to say, this is just getting tedious and vulnerable 01:25:27.210 --> 01:25:29.520 to mistakes quickly. 01:25:29.520 --> 01:25:31.710 This surely can't be the way to do this. 01:25:31.710 --> 01:25:35.280 And indeed, SQL is going to let us do this a little more powerfully instead. 01:25:35.280 --> 01:25:36.790 Let's do this. 01:25:36.790 --> 01:25:39.210 Instead of getting this table temporarily 01:25:39.210 --> 01:25:42.640 with all these show IDs and all these genres, let's refine the query. 01:25:42.640 --> 01:25:49.920 So let's just select the show ID from the genres table where the genre equals 01:25:49.920 --> 01:25:51.810 quote unquote "comedy." 01:25:51.810 --> 01:25:56.670 Now I have a big list of show IDs, all of which are comedy. 01:25:56.670 --> 01:25:57.205 How many? 01:25:57.205 --> 01:25:58.830 Well, I can combine ideas from earlier. 01:25:58.830 --> 01:26:03.582 I can just count all of those show IDs or star if I want to just do that too. 01:26:03.582 --> 01:26:05.040 But I can count all those show IDs. 01:26:05.040 --> 01:26:09.880 48,706 comedies on IMDb's database for TV shows. 01:26:09.880 --> 01:26:11.140 So feels like a lot. 01:26:11.140 --> 01:26:14.010 But how can I now use that information and get back 01:26:14.010 --> 01:26:18.100 the titles of comedies in the database without doing it manually? 01:26:18.100 --> 01:26:19.720 Well, let's do this. 01:26:19.720 --> 01:26:21.270 I have a moment ago this query. 01:26:21.270 --> 01:26:25.710 Select the show ID from genres where the current genre 01:26:25.710 --> 01:26:29.295 is quote unquote "comedy." 01:26:29.295 --> 01:26:31.920 What if I kind of nest these queries, kind of like grade school 01:26:31.920 --> 01:26:33.030 math in parentheses? 01:26:33.030 --> 01:26:35.940 What if I combine this whole thing in parentheses? 01:26:35.940 --> 01:26:38.040 And now let me select what I really want. 01:26:38.040 --> 01:26:43.950 Let me go ahead and select how about the title of all 01:26:43.950 --> 01:26:52.420 shows where the idea of the show is in this list of show IDs. 01:26:52.420 --> 01:26:59.440 So if you agree that the shows table has an ID column, which is otherwise 01:26:59.440 --> 01:27:02.170 known as its primary key, the unique ID that identifies it, just 01:27:02.170 --> 01:27:06.490 like our Harvard IDs, our Yale IDs, and you agree that per a moment ago 01:27:06.490 --> 01:27:09.580 this shorter query will give me back just the show 01:27:09.580 --> 01:27:13.840 IDs of all of the comedies in the database, 01:27:13.840 --> 01:27:17.710 you can actually combine or nest these queries together. 01:27:17.710 --> 01:27:19.843 It's going to respect SQLite order of operations 01:27:19.843 --> 01:27:21.760 with parentheses, just like grade school math. 01:27:21.760 --> 01:27:24.070 So the thing in parentheses will be executed first. 01:27:24.070 --> 01:27:27.340 That gives it back a list of IDs, like 48,000 IDs. 01:27:27.340 --> 01:27:30.160 And then this query, the outer query, is going 01:27:30.160 --> 01:27:33.790 to get the title from all of the shows where the ID of the show 01:27:33.790 --> 01:27:36.280 is in that big list of 48,000. 01:27:36.280 --> 01:27:39.233 So if I now execute these together, I think 01:27:39.233 --> 01:27:42.400 the list is still going to be a little long, but let me execute it together. 01:27:42.400 --> 01:27:45.880 Now I see this long list of outputs. 01:27:45.880 --> 01:27:46.900 A little overwhelming. 01:27:46.900 --> 01:27:50.650 Let's go ahead and maybe limit it to just 10 01:27:50.650 --> 01:27:52.460 as before for discussion's sake. 01:27:52.460 --> 01:27:56.110 And now I see 10 comedies ordered arbitrarily 01:27:56.110 --> 01:27:58.330 from however they're in the database that happen 01:27:58.330 --> 01:28:01.420 to indeed have comedy as their genre. 01:28:01.420 --> 01:28:04.310 If I want to do this a little more cleanly, I could do this. 01:28:04.310 --> 01:28:05.260 Let's see. 01:28:05.260 --> 01:28:09.640 Why don't I order by title ascending order, which is alphabetically, 01:28:09.640 --> 01:28:12.190 or the default is also an ascending. 01:28:12.190 --> 01:28:13.240 Limit 10. 01:28:13.240 --> 01:28:18.250 Now I see the top 10, I mean, weirdly named things with hash symbols 01:28:18.250 --> 01:28:22.360 presumably to get their titles up to the beginning or maybe these are hashtags. 01:28:22.360 --> 01:28:27.550 Here now we have alphabetically the first 10 shows that are comedies. 01:28:27.550 --> 01:28:31.705 Any questions on these kinds of queries? 01:28:31.705 --> 01:28:33.580 It's kind of a lot, but at the same time it's 01:28:33.580 --> 01:28:37.540 just like composing the smaller ideas from before into slightly more 01:28:37.540 --> 01:28:39.820 useful queries. 01:28:39.820 --> 01:28:40.360 Yeah. 01:28:40.360 --> 01:28:43.348 AUDIENCE: [INAUDIBLE] 01:28:46.343 --> 01:28:48.760 DAVID MALAN: Do foreign keys have to set the relationship? 01:28:48.760 --> 01:28:51.850 When you create the table, the programmer or the database 01:28:51.850 --> 01:28:54.280 administrator would create that relationship 01:28:54.280 --> 01:28:57.700 by using those keywords primary key and foreign key that 01:28:57.700 --> 01:29:01.330 teaches the database what is related to what per the picture. 01:29:01.330 --> 01:29:02.530 So you do that once. 01:29:02.530 --> 01:29:07.390 And now I being the sort of programmer who's familiar with the database, 01:29:07.390 --> 01:29:13.750 I am just using these foreign keys in a manner consistent with their design. 01:29:13.750 --> 01:29:15.680 And this is where it's useful at some point, 01:29:15.680 --> 01:29:18.940 even if no one hands you a picture, to make sure you understand the database, 01:29:18.940 --> 01:29:22.180 because that's going to inform literally what you type in SQL 01:29:22.180 --> 01:29:23.740 to get the data you care about. 01:29:23.740 --> 01:29:26.060 Well, let's do something a little more precise. 01:29:26.060 --> 01:29:27.860 How about-- very reasonable question. 01:29:27.860 --> 01:29:30.970 And honestly, this is exactly what imdb.com and the app or for. 01:29:30.970 --> 01:29:34.600 What if you want to find all of the shows that Steve Carell is in? 01:29:34.600 --> 01:29:36.160 Kind of a reasonable query. 01:29:36.160 --> 01:29:40.150 Literally something someone might type into Google or more specifically IMDb. 01:29:40.150 --> 01:29:42.130 It's not really obvious at first glance how 01:29:42.130 --> 01:29:44.795 to do that, though, because from my database, 01:29:44.795 --> 01:29:47.170 if these are my six tables, well, I can pretty easily get 01:29:47.170 --> 01:29:48.460 Steve Carell from here. 01:29:48.460 --> 01:29:51.105 But I can really only get his ID number, whatever that is, 01:29:51.105 --> 01:29:53.230 his name, which I know already, and his birth year. 01:29:53.230 --> 01:29:56.200 OK, interesting but has nothing to do with the shows that he's in. 01:29:56.200 --> 01:30:00.850 I can look at shows over here, but there's no mention of Steve Carell 01:30:00.850 --> 01:30:02.710 because there's no person ID here. 01:30:02.710 --> 01:30:04.660 Where is that relationship implemented? 01:30:04.660 --> 01:30:06.590 Well, it's implemented down here. 01:30:06.590 --> 01:30:07.910 So how do we do this? 01:30:07.910 --> 01:30:10.210 Well, here's the perfect example of a lesson 01:30:10.210 --> 01:30:13.450 we've been trying to emphasize for weeks of taking these baby steps. 01:30:13.450 --> 01:30:17.750 Break larger problems down into smaller ones and let's do something like this. 01:30:17.750 --> 01:30:20.740 Let's just get everything I know about Steve Carell from the database. 01:30:20.740 --> 01:30:24.190 Let's select star from people where the name of the person 01:30:24.190 --> 01:30:26.890 is quote unquote "Steve Carell." 01:30:26.890 --> 01:30:28.630 I just want to see what data we've got. 01:30:28.630 --> 01:30:30.640 And here's what we have. 01:30:30.640 --> 01:30:33.130 There's only one Steve Carell born in 1962 01:30:33.130 --> 01:30:37.150 and his unique ID is 136797 according to IMDb. 01:30:37.150 --> 01:30:40.930 This isn't some global actor identifier, per se. 01:30:40.930 --> 01:30:45.010 All right, well how do I get now all of the shows that Steve Carell is in? 01:30:45.010 --> 01:30:46.130 Well, I could do this. 01:30:46.130 --> 01:30:51.190 Select star from stars, not to confuse the two. 01:30:51.190 --> 01:30:53.500 One's the symbol, one's the table name. 01:30:53.500 --> 01:30:58.690 Where person ID equals 136797. 01:30:58.690 --> 01:31:03.070 So I think this will now give me everything from the stars table 01:31:03.070 --> 01:31:04.943 that relates to Steve Carell. 01:31:04.943 --> 01:31:07.360 And you'll see person ID is the same because I'm literally 01:31:07.360 --> 01:31:08.693 searching for just Steve Carell. 01:31:08.693 --> 01:31:12.200 But there are like 20 or so shows that he's been in. 01:31:12.200 --> 01:31:14.450 All right, well here's where things would get tedious. 01:31:14.450 --> 01:31:15.340 What are those shows? 01:31:15.340 --> 01:31:21.460 Well, I could do select title from shows where the ID of the show equals. 01:31:21.460 --> 01:31:25.670 And here's whenever you copy paste, you're probably doing something wrong. 01:31:25.670 --> 01:31:27.890 OK, he was in The Dana Carvey Show. 01:31:27.890 --> 01:31:28.715 Familiar with that. 01:31:28.715 --> 01:31:29.590 Let's do another one. 01:31:29.590 --> 01:31:31.460 We'll copy paste this. 01:31:31.460 --> 01:31:34.090 Where ID equals this. 01:31:34.090 --> 01:31:34.890 Over The Top. 01:31:34.890 --> 01:31:35.390 Another. 01:31:35.390 --> 01:31:37.000 And if we keep digging, we'll probably find The Office. 01:31:37.000 --> 01:31:40.240 But my God, that's going to take forever to do 20 queries manually. 01:31:40.240 --> 01:31:41.252 It's not very dynamic. 01:31:41.252 --> 01:31:43.960 But what if we just nest these queries a little more dynamically? 01:31:43.960 --> 01:31:45.830 So let me start from the beginning again. 01:31:45.830 --> 01:31:48.890 What if we go ahead and select everything 01:31:48.890 --> 01:31:56.410 we know about people whose name equals Steve Carell. 01:31:56.410 --> 01:31:58.220 That gave us earlier this data. 01:31:58.220 --> 01:31:59.470 I don't need all of that data. 01:31:59.470 --> 01:32:00.100 I know his name. 01:32:00.100 --> 01:32:01.517 I don't care about his birth year. 01:32:01.517 --> 01:32:05.140 So let's change this to just be give me the ID of Steve Carell. 01:32:05.140 --> 01:32:08.390 And that gives me back now this smaller temporary data set. 01:32:08.390 --> 01:32:08.890 All right. 01:32:08.890 --> 01:32:12.195 Can I now use this inside of another query? 01:32:12.195 --> 01:32:14.320 Well, let me wrap the whole thing with parentheses. 01:32:14.320 --> 01:32:18.850 And now let me say select star from the stars table 01:32:18.850 --> 01:32:22.477 where the person ID equals this. 01:32:22.477 --> 01:32:24.310 So I'm deliberately not using in because I'm 01:32:24.310 --> 01:32:26.780 assuming there's indeed only one Steve Carell in the world. 01:32:26.780 --> 01:32:28.480 So I'm not getting back a list of Steve Carells. 01:32:28.480 --> 01:32:30.640 I'm getting back the one and only in this case. 01:32:30.640 --> 01:32:31.810 So equal is fine. 01:32:31.810 --> 01:32:33.310 In is when you have multiple. 01:32:33.310 --> 01:32:34.840 Equal is when you have one. 01:32:34.840 --> 01:32:36.375 Let me go ahead and hit Enter now. 01:32:36.375 --> 01:32:37.750 OK, that's more data than I need. 01:32:37.750 --> 01:32:40.430 I don't need like 20 copies of Steve Carell's person ID. 01:32:40.430 --> 01:32:41.380 So let me hit up. 01:32:41.380 --> 01:32:46.420 Let me go back and let me just get show ID from Steve Carell. 01:32:46.420 --> 01:32:52.070 And now I have a list of just the 20 or so show IDs that he has been in. 01:32:52.070 --> 01:32:52.570 All right. 01:32:52.570 --> 01:32:53.597 How can I now use this? 01:32:53.597 --> 01:32:54.430 Well, let me hit up. 01:32:54.430 --> 01:32:57.040 Let me put the whole thing in parentheses. 01:32:57.040 --> 01:32:58.960 And now let me select what I really want. 01:32:58.960 --> 01:33:03.340 Select title from shows where. 01:33:03.340 --> 01:33:05.020 And here's the final flourish. 01:33:05.020 --> 01:33:11.350 The shows table has an ID, has a title, has a year, and has an episode. 01:33:11.350 --> 01:33:21.610 And what I really want, though, is to check which shows have ID that is what? 01:33:21.610 --> 01:33:24.510 Anyone want to finish the thought? 01:33:24.510 --> 01:33:25.860 I just want to-- yeah. 01:33:25.860 --> 01:33:28.435 AUDIENCE: [INAUDIBLE] 01:33:28.435 --> 01:33:29.310 DAVID MALAN: Exactly. 01:33:29.310 --> 01:33:30.570 ID in this. 01:33:30.570 --> 01:33:31.690 And this is getting ugly. 01:33:31.690 --> 01:33:33.810 And when you actually write your queries in a text file, 01:33:33.810 --> 01:33:35.730 you can format them nicely and indent them. 01:33:35.730 --> 01:33:36.772 My font is just getting-- 01:33:36.772 --> 01:33:38.938 I don't want to make it too small to fit everything. 01:33:38.938 --> 01:33:40.230 But now we have three queries. 01:33:40.230 --> 01:33:43.380 One is in doubly nested parentheses, then there's the middle one, 01:33:43.380 --> 01:33:44.550 then there's the outer one. 01:33:44.550 --> 01:33:47.310 So this last query is going to get me the title from shows 01:33:47.310 --> 01:33:50.460 where the ID of the show is in this big list of 20 01:33:50.460 --> 01:33:52.410 or so show IDs that Steve Carell is in. 01:33:52.410 --> 01:33:54.812 And I knew that because I looked up his name here. 01:33:54.812 --> 01:33:57.270 And notice what I did not do this time is I didn't manually 01:33:57.270 --> 01:33:58.620 hardcode his ID number. 01:33:58.620 --> 01:33:59.430 There's no need. 01:33:59.430 --> 01:34:02.508 That would be kind of a bad way to implement a website if you're 01:34:02.508 --> 01:34:04.050 using a database underneath the hood. 01:34:04.050 --> 01:34:07.470 You want the IMDb for real to search for whatever 01:34:07.470 --> 01:34:11.010 the human typed in and no one's going to know Steve Carell's person ID 01:34:11.010 --> 01:34:12.070 or anything else. 01:34:12.070 --> 01:34:14.580 So here we've done this all dynamically. 01:34:14.580 --> 01:34:18.690 And now if I hit Enter, I think I get all of his shows. 01:34:18.690 --> 01:34:21.840 Let's go ahead and order this by title just to make it tidy. 01:34:21.840 --> 01:34:24.810 And you probably will see at least one or more shows that. 01:34:24.810 --> 01:34:29.520 And probably the most popular is, dot dot dot, The Office. 01:34:29.520 --> 01:34:32.280 So this is literally the kind of query that's 01:34:32.280 --> 01:34:37.200 being executed underneath the hood when you go to websites or apps like IMDb. 01:34:37.200 --> 01:34:42.030 Your textual query is probably being plugged into a longer SQL query 01:34:42.030 --> 01:34:44.850 like this where some programmer at IMDb probably 01:34:44.850 --> 01:34:48.420 wrote this whole query in advance weeks, months, years ago 01:34:48.420 --> 01:34:52.470 and they're just somehow plugging in the value that you the human 01:34:52.470 --> 01:34:55.550 typed into the search box or the like. 01:34:55.550 --> 01:35:03.010 Questions now on finding this data or any other? 01:35:03.010 --> 01:35:03.890 No? 01:35:03.890 --> 01:35:04.390 OK. 01:35:04.390 --> 01:35:06.980 So where else could we go with this? 01:35:06.980 --> 01:35:12.700 Well, let's consider how else we might combine data. 01:35:12.700 --> 01:35:16.060 Suppose that the next question actually perhaps appropriately 01:35:16.060 --> 01:35:19.780 would be focusing in on not just people and shows and these stars, 01:35:19.780 --> 01:35:23.680 but how do we gather more information about the shows themselves, 01:35:23.680 --> 01:35:25.430 like the genres, the ratings, or the like. 01:35:25.430 --> 01:35:28.030 So indeed, let's focus on just these two tables here. 01:35:28.030 --> 01:35:31.640 Recall that every show has an ID, a title, a year, and episodes. 01:35:31.640 --> 01:35:34.660 But it also might have one or more relationships with rows 01:35:34.660 --> 01:35:36.280 and this other table called genres. 01:35:36.280 --> 01:35:39.250 And this is so that a show can be a comedy, can be a drama, 01:35:39.250 --> 01:35:40.960 can be any number of other things. 01:35:40.960 --> 01:35:42.400 One row per. 01:35:42.400 --> 01:35:45.040 So you would see the same show ID again and again 01:35:45.040 --> 01:35:48.760 and again with a different genre written in English 01:35:48.760 --> 01:35:51.070 like comedy, drama, or the like. 01:35:51.070 --> 01:35:53.890 Well, how do I kind of reconstitute that data? 01:35:53.890 --> 01:35:56.710 Well, turns out there's a few different ways to do this. 01:35:56.710 --> 01:36:00.610 And let me propose that we introduce this keyword here, join. 01:36:00.610 --> 01:36:05.525 And this is really the most powerful of the keywords in SQL itself. 01:36:05.525 --> 01:36:06.650 It doesn't have to be used. 01:36:06.650 --> 01:36:09.310 We've seen with nested queries that you can still 01:36:09.310 --> 01:36:13.490 select data across multiple tables, but here is another way. 01:36:13.490 --> 01:36:14.660 So let me do this. 01:36:14.660 --> 01:36:17.710 Let me go back to my SQLite database. 01:36:17.710 --> 01:36:23.680 And let me select sort of in one breath exactly the data I want. 01:36:23.680 --> 01:36:26.980 Select star from shows. 01:36:26.980 --> 01:36:30.500 And let's just limit this initially to 10 to see what it looks like. 01:36:30.500 --> 01:36:31.000 All right. 01:36:31.000 --> 01:36:32.250 That's, again, the shows data. 01:36:32.250 --> 01:36:34.690 Select star from genres. 01:36:34.690 --> 01:36:37.760 Let's limit that to 10 too, just to wrap our minds around it. 01:36:37.760 --> 01:36:39.730 And now this is not that useful. 01:36:39.730 --> 01:36:43.510 However, the data in the leftmost column here 01:36:43.510 --> 01:36:45.490 is the primary key in the shows table. 01:36:45.490 --> 01:36:46.870 These are just unique IDs. 01:36:46.870 --> 01:36:51.530 The data here in the genres table, recall, show ID is the foreign key. 01:36:51.530 --> 01:36:54.310 So it's the same numbers but just copied into another table 01:36:54.310 --> 01:36:56.650 so that we can have this relationship across them. 01:36:56.650 --> 01:37:00.070 How do I kind of line up these numbers with these numbers 01:37:00.070 --> 01:37:06.430 to get back a wider table that has title and year and episodes and genre and, 01:37:06.430 --> 01:37:08.950 heck, ratings and all of that too if we want? 01:37:08.950 --> 01:37:11.230 Well, you can join these tables by just telling 01:37:11.230 --> 01:37:12.890 the database what to join on what. 01:37:12.890 --> 01:37:13.940 So let me do this. 01:37:13.940 --> 01:37:17.110 Select star from shows. 01:37:17.110 --> 01:37:21.760 Join that table though on the genres table. 01:37:21.760 --> 01:37:24.160 Well, how do you want to join those two tables? 01:37:24.160 --> 01:37:26.860 And again, the two tables from the picture looked like this. 01:37:26.860 --> 01:37:30.400 How do you tell SQL programmatically to put one of them 01:37:30.400 --> 01:37:32.420 right next to the other, line up all of the ID 01:37:32.420 --> 01:37:34.820 so that you just get one larger data set? 01:37:34.820 --> 01:37:39.160 Well, we can use indeed this syntax called join. 01:37:39.160 --> 01:37:41.140 So back to VS Code here. 01:37:41.140 --> 01:37:44.380 And let me join these two tables. 01:37:44.380 --> 01:37:45.970 Sorry, typo here. 01:37:45.970 --> 01:37:51.940 Join genres on the shows table's ID column, a.k.a. 01:37:51.940 --> 01:37:58.240 its primary key, equaling the genres table's show ID column, a.k.a. 01:37:58.240 --> 01:37:59.210 the foreign key. 01:37:59.210 --> 01:38:01.090 So in other words, it looks a little cryptic, 01:38:01.090 --> 01:38:05.080 but I'm just telling SQL how to line up these two tables and what column 01:38:05.080 --> 01:38:08.230 to match with the other so that the numbers line up 01:38:08.230 --> 01:38:10.610 and I get essentially a wider table. 01:38:10.610 --> 01:38:14.930 Let me go ahead and hit semicolon and Enter. 01:38:14.930 --> 01:38:17.330 And this is now going to give me a lot of data. 01:38:17.330 --> 01:38:18.600 We might have to interrupt it. 01:38:18.600 --> 01:38:22.650 But notice even at a glance, we're getting the ID, the title, the year, 01:38:22.650 --> 01:38:25.130 the number of episodes, the ID again redundantly, 01:38:25.130 --> 01:38:27.140 but that's to be expected if I'm joining them, 01:38:27.140 --> 01:38:29.360 and the genre all the way on the right. 01:38:29.360 --> 01:38:31.740 Let me hit Control C to interrupt. 01:38:31.740 --> 01:38:33.890 Let me just limit this to The Office. 01:38:33.890 --> 01:38:36.500 So where title equals quote unquote "The Office" 01:38:36.500 --> 01:38:39.680 so we can focus on just one sample data. 01:38:39.680 --> 01:38:43.490 And here, fun fact, there's been more than one Office. 01:38:43.490 --> 01:38:46.100 The one that you all probably like is this one 01:38:46.100 --> 01:38:49.250 that started in 2005 with 188 episodes. 01:38:49.250 --> 01:38:53.060 Its ID in the shows table is 386676. 01:38:53.060 --> 01:38:54.710 That's confirmed over here too. 01:38:54.710 --> 01:38:56.850 So again, we've just joined the two tables. 01:38:56.850 --> 01:38:57.350 How? 01:38:57.350 --> 01:38:58.730 By lining up those fields. 01:38:58.730 --> 01:39:01.490 But now that we can see that almost all of The Offices 01:39:01.490 --> 01:39:04.700 produced over the decades are comedies except for this one. 01:39:04.700 --> 01:39:07.460 There was a version of The Office produced in 2001 01:39:07.460 --> 01:39:10.160 that was considered more of a drama. 01:39:10.160 --> 01:39:11.940 Unsure if it's related to the other. 01:39:11.940 --> 01:39:13.640 How can we link in other data? 01:39:13.640 --> 01:39:16.920 Well, let's go ahead and link in ratings too or instead. 01:39:16.920 --> 01:39:21.290 So instead of joining this with genres, let me go ahead and rewind here 01:39:21.290 --> 01:39:28.550 and join shows on ratings on shows.id equals ratings.show_ID. 01:39:28.550 --> 01:39:32.630 And let's limit it to The Office too for discussion's sake where title equals 01:39:32.630 --> 01:39:35.820 quote unquote "The Office" semicolon. 01:39:35.820 --> 01:39:39.650 And now you can see that among the various Offices, 01:39:39.650 --> 01:39:43.100 it looks like the one that most of us probably know and love 01:39:43.100 --> 01:39:49.070 is the highest rated also with a 9.0 with like 585,000 people having 01:39:49.070 --> 01:39:52.280 cast votes for whereas this other shows seem to have been less popular. 01:39:52.280 --> 01:39:56.370 And perhaps that's why indeed you see fewer episodes for them as well. 01:39:56.370 --> 01:39:58.680 So even though we've put the data in multiple places, 01:39:58.680 --> 01:40:02.670 you can still kind of reconstitute it by lining things up in this way 01:40:02.670 --> 01:40:04.970 and rejoining the tables. 01:40:04.970 --> 01:40:09.140 Questions now on this? 01:40:09.140 --> 01:40:16.540 This is the heart of what SQL does and what relational databases do for you. 01:40:16.540 --> 01:40:18.100 Questions? 01:40:18.100 --> 01:40:18.610 All right. 01:40:18.610 --> 01:40:20.470 A few final features. 01:40:20.470 --> 01:40:22.660 There's not all that much that-- 01:40:22.660 --> 01:40:24.320 SQL takes practice like anything else. 01:40:24.320 --> 01:40:26.237 But in terms of syntax and capabilities, let's 01:40:26.237 --> 01:40:30.610 just introduce you to a couple of final features here and problems that arise 01:40:30.610 --> 01:40:32.210 and how we might solve them. 01:40:32.210 --> 01:40:34.640 Let's do this as well. 01:40:34.640 --> 01:40:36.980 So let me go back into VS Code here. 01:40:36.980 --> 01:40:40.900 And let's just find out Steve Carell's information again. 01:40:40.900 --> 01:40:43.180 Last time we did it with this nested query 01:40:43.180 --> 01:40:47.503 by getting his ID and then the show IDs and then the titles for those show IDs. 01:40:47.503 --> 01:40:49.420 With join, you can do it a little differently. 01:40:49.420 --> 01:40:50.950 And any of these ways are fine. 01:40:50.950 --> 01:40:53.740 One might become easier to mentally than another. 01:40:53.740 --> 01:40:57.040 Let's go ahead and select the titles from what. 01:40:57.040 --> 01:41:02.095 Let's select the title from the people table. 01:41:02.095 --> 01:41:03.220 And I'm going to hit Enter. 01:41:03.220 --> 01:41:05.260 And when you're using SQLite3 interactively, 01:41:05.260 --> 01:41:09.400 if you ever find yourself with a prompt that says dot dot dot angle bracket, 01:41:09.400 --> 01:41:12.067 it means you're continuing your thought onto the next line. 01:41:12.067 --> 01:41:13.900 If you didn't intend that, you can sometimes 01:41:13.900 --> 01:41:16.510 hit semicolon to just end the thought and hit 01:41:16.510 --> 01:41:18.460 Enter even if it triggers an error. 01:41:18.460 --> 01:41:21.430 But this is one way of formatting my queries now a little more nicely. 01:41:21.430 --> 01:41:24.597 I'm just going to add some white space so that it's a little easier to read. 01:41:24.597 --> 01:41:25.720 What do I want to select? 01:41:25.720 --> 01:41:28.000 Well, I want to select the title of shows 01:41:28.000 --> 01:41:35.320 from the people table joined with the stars table on the people table's ID 01:41:35.320 --> 01:41:40.520 column equaling the stars table's person ID column. 01:41:40.520 --> 01:41:44.230 So in other words, if you think back to what people are and what stars are, 01:41:44.230 --> 01:41:48.250 one has an ID, one has a person ID, I'm just now connecting those two tables. 01:41:48.250 --> 01:41:49.450 I'm joining those two. 01:41:49.450 --> 01:41:53.350 But I want to do this as well with another table. 01:41:53.350 --> 01:41:54.938 Let me additionally join in. 01:41:54.938 --> 01:41:56.980 So now I only have two hands, but now I'm putting 01:41:56.980 --> 01:41:59.320 a third table joined in together here. 01:41:59.320 --> 01:42:05.960 Join shows on stars.show_id equals shows.id. 01:42:05.960 --> 01:42:08.500 So this is now linking three tables together. 01:42:08.500 --> 01:42:12.100 But I only care about this for one person, so where the name of the person 01:42:12.100 --> 01:42:14.650 equals quote unquote "Steve Carell." 01:42:14.650 --> 01:42:18.850 So more cryptic, to be sure, but what we're doing with this query 01:42:18.850 --> 01:42:21.370 is just taking all three tables that we care about 01:42:21.370 --> 01:42:23.440 and we're joining them all together at once 01:42:23.440 --> 01:42:27.250 using this new join syntax literally telling the database what 01:42:27.250 --> 01:42:29.020 columns to line up with what. 01:42:29.020 --> 01:42:32.350 And then we filter at the very end just like before to get back, 01:42:32.350 --> 01:42:35.650 if I hit Enter, the answer we want, which in this case 01:42:35.650 --> 01:42:40.030 is a little slower at the moment, but that same list of 20 or so 01:42:40.030 --> 01:42:41.350 shows that he's been in. 01:42:41.350 --> 01:42:42.920 There's one other way to do this. 01:42:42.920 --> 01:42:45.250 And again, these are all in the slides online. 01:42:45.250 --> 01:42:47.890 So you can repeat them without having to jot down everything and we'll put them 01:42:47.890 --> 01:42:48.820 in the notes too. 01:42:48.820 --> 01:42:50.450 But there's another way to do this. 01:42:50.450 --> 01:42:52.780 I could also use an implicit join. 01:42:52.780 --> 01:42:55.030 So that was an explicit join because I literally typed 01:42:55.030 --> 01:42:57.220 the word join multiple times at that. 01:42:57.220 --> 01:43:00.880 But let me go ahead and select the title from these three tables. 01:43:00.880 --> 01:43:02.628 People, stars, and shows. 01:43:02.628 --> 01:43:04.420 And this might just be nicer because if you 01:43:04.420 --> 01:43:06.337 know what tables you want to select data from, 01:43:06.337 --> 01:43:08.980 just enumerate them separated by commas, which you might prefer 01:43:08.980 --> 01:43:18.970 in your mind, where the people ID equals the stars person ID and the stars 01:43:18.970 --> 01:43:25.450 show ID equals the shows ID and the name of the person equals Steve Carell. 01:43:25.450 --> 01:43:26.788 So this is an implicit join. 01:43:26.788 --> 01:43:29.830 And honestly, I constantly reference my notes for some of this stuff too. 01:43:29.830 --> 01:43:31.720 It's not the kind of thing that's going to come like this to you 01:43:31.720 --> 01:43:32.740 after just one day. 01:43:32.740 --> 01:43:35.560 But it's just a different way of expressing the same thing. 01:43:35.560 --> 01:43:37.720 I want to select data from three different tables. 01:43:37.720 --> 01:43:43.120 And hey SQL, here is how I want you to line those tables up so that I can 01:43:43.120 --> 01:43:45.640 get like related data for Steve Carell. 01:43:45.640 --> 01:43:49.450 And this now will achieve the same results ultimately. 01:43:49.450 --> 01:43:50.260 Let me hit Enter. 01:43:54.040 --> 01:43:54.920 And there we go. 01:43:54.920 --> 01:43:55.750 So a little slower. 01:43:55.750 --> 01:43:57.850 And performance might vary based on computer, 01:43:57.850 --> 01:44:01.270 based on implementation of SQL, but I think I still have the same answers. 01:44:01.270 --> 01:44:04.790 Now suppose, as I often do, and I had to look it up again last time, 01:44:04.790 --> 01:44:08.350 suppose you forget how to spell Steve Carell's name. 01:44:08.350 --> 01:44:10.240 Is it two R's, two L's, or the like? 01:44:10.240 --> 01:44:12.903 Well, I could also do something like this. 01:44:12.903 --> 01:44:14.320 Well, let's just keep this simple. 01:44:14.320 --> 01:44:17.103 Select star from people where name equals. 01:44:17.103 --> 01:44:20.020 I've been deliberately getting it right so as to not embarrass myself. 01:44:20.020 --> 01:44:22.000 That's the Steve Carell I keep querying. 01:44:22.000 --> 01:44:25.960 If you forget, well you could try searching for just Steves, 01:44:25.960 --> 01:44:28.450 but interestingly, there's a bunch of Steves. 01:44:28.450 --> 01:44:30.070 We don't know when they were born. 01:44:30.070 --> 01:44:35.360 But that's probably not the Steve Carell we want if we don't have his last name. 01:44:35.360 --> 01:44:38.770 So I could alternatively do, well it's Steve and then it starts with a C, 01:44:38.770 --> 01:44:39.460 I think. 01:44:39.460 --> 01:44:42.700 Well, it turns out there's another wild card you can use in SQL. 01:44:42.700 --> 01:44:46.060 We used the asterisk to select all of the columns. 01:44:46.060 --> 01:44:50.800 You can in quotes use a percent sign to say C something. 01:44:50.800 --> 01:44:54.370 So there's 0 or more characters after the letter C. 01:44:54.370 --> 01:44:58.630 And now this doesn't work because now I would be literally looking 01:44:58.630 --> 01:45:01.690 for Steve space C something. 01:45:01.690 --> 01:45:04.910 But recall earlier I mentioned that one other keyword, 01:45:04.910 --> 01:45:06.868 which is for fuzzier matching, so to speak, 01:45:06.868 --> 01:45:09.160 where it's not exactly what you're looking for but it's 01:45:09.160 --> 01:45:10.630 like what you're looking for. 01:45:10.630 --> 01:45:14.350 If you instead say where his name is like Steve space C something, 01:45:14.350 --> 01:45:17.060 now we'll get back a whole bunch of Steves. 01:45:17.060 --> 01:45:19.810 But I think now I could probably find the one I'm actually looking 01:45:19.810 --> 01:45:20.920 for if I don't remember his name. 01:45:20.920 --> 01:45:22.510 You can use multiple percent signs. 01:45:22.510 --> 01:45:25.260 If you forget what his first name is, you could reverse the order. 01:45:25.260 --> 01:45:29.940 But that too is a very powerful SQL feature at that. 01:45:29.940 --> 01:45:32.670 Questions on these queries here? 01:45:32.670 --> 01:45:33.708 Yeah. 01:45:33.708 --> 01:45:35.448 AUDIENCE: [INAUDIBLE] 01:45:35.448 --> 01:45:36.240 DAVID MALAN: Sorry? 01:45:36.240 --> 01:45:38.730 AUDIENCE: [INAUDIBLE] 01:45:38.730 --> 01:45:40.003 DAVID MALAN: What about it? 01:45:40.003 --> 01:45:40.990 AUDIENCE: [INAUDIBLE] 01:45:40.990 --> 01:45:42.400 DAVID MALAN: Oh yeah, sure. 01:45:42.400 --> 01:45:44.500 So the query I used here. 01:45:44.500 --> 01:45:49.810 There's a lot of Steves whose last name starts with C. Oops, too far. 01:45:49.810 --> 01:45:52.760 The last query I executed was this one here. 01:45:52.760 --> 01:45:57.110 So where the name is like quote unquote "Steve C%." 01:45:57.110 --> 01:45:59.950 So that's just another tool for your toolkit here. 01:45:59.950 --> 01:46:02.830 But you'll perhaps have notice that those two-- 01:46:02.830 --> 01:46:05.380 prior to that query, the joins I did were sort of slow. 01:46:05.380 --> 01:46:07.480 And honestly, this database isn't even that big. 01:46:07.480 --> 01:46:10.120 Like yes, it has tens of thousands of rows in it. 01:46:10.120 --> 01:46:13.630 But in the real world and most of the apps you and I use a lot every day 01:46:13.630 --> 01:46:17.080 or websites, there's millions, even billions of rows of data. 01:46:17.080 --> 01:46:22.210 And if I had to wait on my computer here or my code space a second or two 01:46:22.210 --> 01:46:25.990 to get the data, that's not going to work for millions of users or customers 01:46:25.990 --> 01:46:26.680 certainly. 01:46:26.680 --> 01:46:29.020 So how can we actually improve things? 01:46:29.020 --> 01:46:34.330 Well, it turns out another upside of a proper relational database 01:46:34.330 --> 01:46:38.830 is that it's not just a spreadsheet where the onus is on you to find 01:46:38.830 --> 01:46:40.120 the data you're looking for. 01:46:40.120 --> 01:46:43.780 You can also tell the database to index the data for you. 01:46:43.780 --> 01:46:48.670 An index is an efficient cheat sheet for finding data fast. 01:46:48.670 --> 01:46:52.908 Like books in the real world often have indices at the end of the book 01:46:52.908 --> 01:46:54.700 where you can look things up alphabetically 01:46:54.700 --> 01:46:57.790 and then you can cross reference it for the pages that topic appears on. 01:46:57.790 --> 01:46:59.110 Same idea in a database. 01:46:59.110 --> 01:47:03.190 If you tell the database in advance that you want to search on a certain column 01:47:03.190 --> 01:47:06.700 frequently, you can tell it to build a fancy index that will just 01:47:06.700 --> 01:47:08.770 allow you to search that column faster. 01:47:08.770 --> 01:47:11.560 By default, these columns are going to be searched 01:47:11.560 --> 01:47:13.197 most likely by a linear search. 01:47:13.197 --> 01:47:15.280 Not even binary search, because the data might not 01:47:15.280 --> 01:47:17.390 be sorted because it came in any order. 01:47:17.390 --> 01:47:20.170 But if you create an index, you're probably 01:47:20.170 --> 01:47:22.780 going to get something closer to logarithmic than linear, 01:47:22.780 --> 01:47:24.892 and that's going to be a big plus overall. 01:47:24.892 --> 01:47:26.350 So let me do something simple here. 01:47:26.350 --> 01:47:30.670 First let me turn on a SQLite specific feature that just is going to time all 01:47:30.670 --> 01:47:33.190 of my queries by writing .timer on. 01:47:33.190 --> 01:47:37.515 I just want to keep track of how long each of these commands takes. 01:47:37.515 --> 01:47:40.390 This one is not a slow command, so this is just going to be relative. 01:47:40.390 --> 01:47:43.300 But let's just select everything from the shows table 01:47:43.300 --> 01:47:45.970 where the title thereof is The Office. 01:47:45.970 --> 01:47:48.940 Let's see how long this relatively simple query takes. 01:47:48.940 --> 01:47:50.980 All right, not very long at all. 01:47:50.980 --> 01:47:54.850 In real terms less than a second, 0.035 seconds. 01:47:54.850 --> 01:47:57.070 So not slow by any means. 01:47:57.070 --> 01:48:00.160 But if you've got hundreds, thousands, millions of users, 01:48:00.160 --> 01:48:03.040 every one of those milliseconds could very well add up. 01:48:03.040 --> 01:48:04.640 So can we do better? 01:48:04.640 --> 01:48:06.440 Well, we can if I do this. 01:48:06.440 --> 01:48:11.740 If I use syntax like this once in the beginning of the design of my database, 01:48:11.740 --> 01:48:17.980 I create not a table but an index with some name on a specific table on one 01:48:17.980 --> 01:48:18.820 or more columns. 01:48:18.820 --> 01:48:21.370 I can give a clue, a hint to the database in advance 01:48:21.370 --> 01:48:26.260 saying please optimize with some secret sauce searching or selecting 01:48:26.260 --> 01:48:29.740 on this column in this table so that my searches are faster. 01:48:29.740 --> 01:48:30.860 So let me do this. 01:48:30.860 --> 01:48:32.590 Let me go back to VS Code here. 01:48:32.590 --> 01:48:36.460 Let me create an index called how about title index. 01:48:36.460 --> 01:48:40.240 I could call it anything I want, but I want to search faster on titles. 01:48:40.240 --> 01:48:45.940 So I'm going to call this a title index where rather title index on the table 01:48:45.940 --> 01:48:46.900 called shows. 01:48:46.900 --> 01:48:49.150 And then in parentheses is the syntax. 01:48:49.150 --> 01:48:50.620 The column called title. 01:48:50.620 --> 01:48:53.350 So again, I've just borrowed this canonical syntax 01:48:53.350 --> 01:48:57.590 and I've just translated it into something that's TV show specific. 01:48:57.590 --> 01:48:58.090 All right. 01:48:58.090 --> 01:49:00.040 What is this going to do for me? 01:49:00.040 --> 01:49:03.700 Once I hit Enter, this is going to create in the computer's memory, 01:49:03.700 --> 01:49:05.860 the database's memory something called a B-tree. 01:49:05.860 --> 01:49:07.330 It's not a binary tree. 01:49:07.330 --> 01:49:11.627 A B-tree is actually a potentially more efficient data structure 01:49:11.627 --> 01:49:13.960 that we didn't talk about a few weeks back in week five, 01:49:13.960 --> 01:49:15.710 but it looks a little something like this, 01:49:15.710 --> 01:49:17.680 which looks similar to a binary tree. 01:49:17.680 --> 01:49:21.847 But does anyone notice what makes this not a binary tree? 01:49:21.847 --> 01:49:26.405 AUDIENCE: [INAUDIBLE] 01:49:26.405 --> 01:49:27.280 DAVID MALAN: Exactly. 01:49:27.280 --> 01:49:31.570 Binary tree, bi implying two, has no more than two children per node, 01:49:31.570 --> 01:49:33.580 but here's a perfect example, one, two, three. 01:49:33.580 --> 01:49:35.990 And there could be four children, five children or more. 01:49:35.990 --> 01:49:38.530 But the effect of that, if you have a very wide tree, 01:49:38.530 --> 01:49:40.780 the upside is that it's very short. 01:49:40.780 --> 01:49:44.560 It pulls the data higher up closer to the node, to the root node. 01:49:44.560 --> 01:49:48.080 And recall that the root node is where we began our searches in the past, 01:49:48.080 --> 01:49:51.620 whether it was a BST, a Binary Search Tree, even a tri or other data 01:49:51.620 --> 01:49:52.120 structures. 01:49:52.120 --> 01:49:53.380 We always began at the top. 01:49:53.380 --> 01:49:55.780 So the higher up you can pull the data, even 01:49:55.780 --> 01:49:57.617 if it makes the data structure very wide, 01:49:57.617 --> 01:50:00.700 you're going to be able to do boom, boom, boom, look up queries or look up 01:50:00.700 --> 01:50:03.340 data probably much faster certainly than if it's just 01:50:03.340 --> 01:50:06.590 a very long list like a column by default. 01:50:06.590 --> 01:50:09.970 So with that said, let me go back to VS Code. 01:50:09.970 --> 01:50:11.380 I didn't create the index yet. 01:50:11.380 --> 01:50:13.277 Let me go ahead and hit Enter and create it. 01:50:13.277 --> 01:50:14.860 All right, it took a minute, a moment. 01:50:14.860 --> 01:50:17.920 It took like half a second, which obviously is not that slow. 01:50:17.920 --> 01:50:20.380 But with more data, that could have been even slower. 01:50:20.380 --> 01:50:22.600 But it's a one time operation as of now. 01:50:22.600 --> 01:50:27.910 And now let me hit up and let me select the same data from shows 01:50:27.910 --> 01:50:29.350 where title equals The Office. 01:50:29.350 --> 01:50:34.360 Last time just a moment ago it took 0.035 seconds. 01:50:34.360 --> 01:50:38.920 Not slow but also that's going to add up if I have lots of users of IMDb. 01:50:38.920 --> 01:50:42.040 Let's go ahead now and execute the same query again. 01:50:42.040 --> 01:50:44.700 How long did that take? 01:50:44.700 --> 01:50:47.010 0.001 seconds now. 01:50:47.010 --> 01:50:48.720 I mean, practically nothing. 01:50:48.720 --> 01:50:50.970 And so that's the sort of opportunity now. 01:50:50.970 --> 01:50:53.470 When you've got lots of data and you want to really speed up 01:50:53.470 --> 01:50:56.130 these searches, these indexes, these indices that just create 01:50:56.130 --> 01:50:59.130 for you these magical data structures in the databases memory, 01:50:59.130 --> 01:51:02.430 it allows you to search on columns that you are pretty sure you 01:51:02.430 --> 01:51:04.150 want to search on more effectively. 01:51:04.150 --> 01:51:07.380 Now, by contrast, if you've ever used Google or Bing or some search 01:51:07.380 --> 01:51:10.590 engine that has advanced search, some of those text boxes 01:51:10.590 --> 01:51:13.470 that you can search more precisely in might very well be slower. 01:51:13.470 --> 01:51:14.190 Why? 01:51:14.190 --> 01:51:16.950 Well, probably you don't want to go crazy and just index 01:51:16.950 --> 01:51:19.380 every column on every table. 01:51:19.380 --> 01:51:21.020 Why? 01:51:21.020 --> 01:51:25.070 What might be the intuition? 01:51:25.070 --> 01:51:28.490 If logically indexes speed things up, why not index everything? 01:51:28.490 --> 01:51:30.710 There's always going to be a trade off here. 01:51:30.710 --> 01:51:33.040 What might that be? 01:51:33.040 --> 01:51:34.000 Yeah. 01:51:34.000 --> 01:51:34.875 AUDIENCE: [INAUDIBLE] 01:51:34.875 --> 01:51:37.167 DAVID MALAN: Yeah, it's going to take a lot of storage. 01:51:37.167 --> 01:51:38.960 This is just a slide on the screen. 01:51:38.960 --> 01:51:41.020 But this has to go somewhere. 01:51:41.020 --> 01:51:43.360 This needs space in the computer's memory 01:51:43.360 --> 01:51:44.740 or on the hard drive or the like. 01:51:44.740 --> 01:51:48.777 And that's fine if you have unlimited space, but odds are you don't. 01:51:48.777 --> 01:51:51.110 And that's going to get expensive for different reasons. 01:51:51.110 --> 01:51:55.810 So maybe you only want to index certain columns and certain tables 01:51:55.810 --> 01:51:56.860 and not all of them. 01:51:56.860 --> 01:51:57.777 Because you know what? 01:51:57.777 --> 01:51:59.590 What even if a user really wants to search 01:51:59.590 --> 01:52:05.020 maybe via advanced search on some other column or table altogether, fine. 01:52:05.020 --> 01:52:07.240 If once in a while a query is slow, we're 01:52:07.240 --> 01:52:11.110 probably getting the bigger bang for our buck by optimizing the common cases, 01:52:11.110 --> 01:52:16.030 the more popular queries that people actually care about too. 01:52:16.030 --> 01:52:16.640 All right. 01:52:16.640 --> 01:52:22.450 So let's come full circle and bring this now back to how we actually 01:52:22.450 --> 01:52:25.030 began, which was with some Python code. 01:52:25.030 --> 01:52:27.550 So it turns out these are not either or decisions. 01:52:27.550 --> 01:52:29.740 It turns out in the real world, developers 01:52:29.740 --> 01:52:33.320 are constantly using one, two, three languages at once. 01:52:33.320 --> 01:52:36.940 And in fact, next week I rattled off HTML, CSS, and JavaScript, one of which 01:52:36.940 --> 01:52:39.310 is a proper programming language, but those languages 01:52:39.310 --> 01:52:40.510 are often used together. 01:52:40.510 --> 01:52:46.300 Totally normal and common to use Python and SQL or Java and SQL 01:52:46.300 --> 01:52:49.300 or SWIFT and SQL or any number of different combinations 01:52:49.300 --> 01:52:50.710 with a database language. 01:52:50.710 --> 01:52:53.920 You might use your preferred programming language, Java, Python, 01:52:53.920 --> 01:53:00.760 C++ to create the user interface and the logic that implements the program 01:53:00.760 --> 01:53:01.330 itself. 01:53:01.330 --> 01:53:04.092 But for your data, SQL's a really good candidate. 01:53:04.092 --> 01:53:07.300 And indeed, we've seen already that SQL can just speed up certain operations. 01:53:07.300 --> 01:53:08.140 You can change. 01:53:08.140 --> 01:53:11.200 You can collapse 15 lines of code into just one 01:53:11.200 --> 01:53:13.430 and you can use these things together. 01:53:13.430 --> 01:53:14.740 So let me come back to-- 01:53:14.740 --> 01:53:16.720 I'm going to quit out of SQLite. 01:53:16.720 --> 01:53:18.580 I'm going to minimize my terminal window. 01:53:18.580 --> 01:53:22.030 And here's where we left off before with favorites.py. 01:53:22.030 --> 01:53:27.430 With favorites.py, everything was being stored in favorites.csv. 01:53:27.430 --> 01:53:32.260 And recall that we eventually imported that CSV file into favorites.db 01:53:32.260 --> 01:53:35.920 automatically with .import just so we could start playing around with SQL. 01:53:35.920 --> 01:53:37.990 But we can now tie these two together. 01:53:37.990 --> 01:53:40.690 And a way to do that is as follows. 01:53:40.690 --> 01:53:42.700 CS50 has a library for Python. 01:53:42.700 --> 01:53:47.350 You might recall having available get string, get int, get float. 01:53:47.350 --> 01:53:49.360 You don't strictly need to use them in Python 01:53:49.360 --> 01:53:51.970 because it's much easier to just use the input function 01:53:51.970 --> 01:53:55.570 and then try, accept, and convert things to int or float or the like. 01:53:55.570 --> 01:54:00.520 But it's a lot more work to use SQL in Python without a third party library. 01:54:00.520 --> 01:54:03.370 A lot of the commercial options or popular open source options 01:54:03.370 --> 01:54:05.200 are actually just complicated to use. 01:54:05.200 --> 01:54:09.790 So CS50 does have a very useful function inside of its library for Python 01:54:09.790 --> 01:54:12.640 that you should use and must use for the problem set that 01:54:12.640 --> 01:54:18.100 just makes it easy to execute Python, execute SQL inside of your Python code. 01:54:18.100 --> 01:54:22.360 But it's built on top of a very popular open source alternative. 01:54:22.360 --> 01:54:24.350 So you can use that too in the real world. 01:54:24.350 --> 01:54:26.800 So the documentation for that is at this URL here, 01:54:26.800 --> 01:54:30.940 but I'll show you what we need to know here by focusing back on favorites.py. 01:54:30.940 --> 01:54:35.540 So what I'm going to do here is follows is this. 01:54:35.540 --> 01:54:42.550 Let me delete everything from favorites.py except for let's say this. 01:54:42.550 --> 01:54:45.970 From CS50 import SQL in all caps. 01:54:45.970 --> 01:54:49.060 So that's importing a SQL feature from CS50's library 01:54:49.060 --> 01:54:53.200 that's going to allow me to open a DB file in code. 01:54:53.200 --> 01:54:54.380 How do I do that? 01:54:54.380 --> 01:54:56.650 Well, let me create a variable called DB for database, 01:54:56.650 --> 01:54:58.275 though I could call it anything I want. 01:54:58.275 --> 01:55:02.470 Let me call this SQL function and pass in using special syntax that's 01:55:02.470 --> 01:55:03.640 not CS50 specific. 01:55:03.640 --> 01:55:09.640 It's an industry thing. sqlite:///. 01:55:09.640 --> 01:55:14.640 Unlike every other URL you type, this one literally has three in this context 01:55:14.640 --> 01:55:15.140 here. 01:55:15.140 --> 01:55:18.340 And then the name of the database, which in this case is favorites.db. 01:55:18.340 --> 01:55:22.450 So this is just a way of telling this SQL library that we wrote 01:55:22.450 --> 01:55:27.640 but that works exactly like third party alternatives open favorites.db using 01:55:27.640 --> 01:55:30.380 the SQLite technology, if you will. 01:55:30.380 --> 01:55:30.880 All right. 01:55:30.880 --> 01:55:32.338 Let's just ask the user a question. 01:55:32.338 --> 01:55:36.057 Give me your favorite problem. 01:55:36.057 --> 01:55:38.140 So we're going to use input instead of get string, 01:55:38.140 --> 01:55:41.450 but we could use get string, but they're pretty much the same for our purposes. 01:55:41.450 --> 01:55:43.150 Let's ask the user for their favorite. 01:55:43.150 --> 01:55:49.960 And now in Python code, let us select from favorites.db 01:55:49.960 --> 01:55:55.450 all of the rows where students specify that problem as their favorite. 01:55:55.450 --> 01:55:57.610 So in SQL alone, it would be this. 01:55:57.610 --> 01:56:04.210 Select star from favorites where problem equals 01:56:04.210 --> 01:56:08.170 and I'll do, well, whatever my favorite's going to be. 01:56:08.170 --> 01:56:10.288 Like problem equals Mario, for instance. 01:56:10.288 --> 01:56:13.330 So if I were just using SQL, I would literally write something like that. 01:56:13.330 --> 01:56:15.130 But I'm in a .py file now. 01:56:15.130 --> 01:56:16.870 I have to use Python syntax. 01:56:16.870 --> 01:56:18.550 But Python supports strings. 01:56:18.550 --> 01:56:20.090 SQL is just text. 01:56:20.090 --> 01:56:20.990 It's just a string. 01:56:20.990 --> 01:56:24.220 So I could certainly just put my SQL code in a string 01:56:24.220 --> 01:56:26.800 perhaps and then pass it to a Python function. 01:56:26.800 --> 01:56:28.780 And here's the bridge between the two. 01:56:28.780 --> 01:56:31.960 If you just treat SQL as any old text, we can put it in a string 01:56:31.960 --> 01:56:32.840 and execute it. 01:56:32.840 --> 01:56:34.837 So let me actually do this. 01:56:34.837 --> 01:56:36.670 Let me go ahead and create a variable called 01:56:36.670 --> 01:56:40.130 rows, which is eventually going to contain all the rows from the database. 01:56:40.130 --> 01:56:45.790 Let me go ahead and select db.execute. 01:56:45.790 --> 01:56:49.030 This is the one function you need to know about inside of CS50's library, 01:56:49.030 --> 01:56:51.310 and it literally executes a SQL statement. 01:56:51.310 --> 01:56:56.140 And then in quotes, you pass it literally what you want to execute. 01:56:56.140 --> 01:56:58.990 And let me go ahead and close the parenthesis at the end there. 01:56:58.990 --> 01:57:00.740 And now let me just try this. 01:57:00.740 --> 01:57:05.260 So for row in rows, let's iterate over all of the rows, let me go ahead 01:57:05.260 --> 01:57:12.370 and print out how about row, quote unquote. 01:57:12.370 --> 01:57:15.700 And what do I want here? 01:57:15.700 --> 01:57:20.020 Let's print out the timestamp of that person for kicks. 01:57:20.020 --> 01:57:22.120 All right, let me open my terminal window. 01:57:22.120 --> 01:57:23.890 Python of favorites.py. 01:57:23.890 --> 01:57:27.490 Crossing my fingers here for sure. 01:57:27.490 --> 01:57:27.990 Enter. 01:57:31.200 --> 01:57:31.770 There we go. 01:57:31.770 --> 01:57:32.280 Favorites. 01:57:32.280 --> 01:57:34.500 I'll type in Mario. 01:57:34.500 --> 01:57:35.070 OK. 01:57:35.070 --> 01:57:36.240 So I got back-- 01:57:36.240 --> 01:57:37.980 it's not very interesting, but I got back 01:57:37.980 --> 01:57:40.830 all of the timestamps of students who typed in Mario that we 01:57:40.830 --> 01:57:42.330 imported into this database. 01:57:42.330 --> 01:57:45.360 Well, what I really care about is how popular Mario is. 01:57:45.360 --> 01:57:46.830 So let me change this a little bit. 01:57:46.830 --> 01:57:50.490 Let me change this to count the number of rows. 01:57:50.490 --> 01:57:52.080 And let me keep it simple. 01:57:52.080 --> 01:57:56.260 Let me give an alias like I proposed earlier like as n, where n is a number. 01:57:56.260 --> 01:58:00.090 So that now down here, I can actually just do this. 01:58:00.090 --> 01:58:02.230 Print out the value of n. 01:58:02.230 --> 01:58:02.730 All right. 01:58:02.730 --> 01:58:04.450 Let me go back to my terminal window. 01:58:04.450 --> 01:58:05.910 Run Python to favorites.py. 01:58:05.910 --> 01:58:07.140 Let me type in Mario. 01:58:07.140 --> 01:58:08.070 Enter. 01:58:08.070 --> 01:58:09.660 OK, 39. 01:58:09.660 --> 01:58:11.790 Now, technically I'm cheating. 01:58:11.790 --> 01:58:14.430 Honestly if I'm executing select count, we've 01:58:14.430 --> 01:58:17.752 seen before it only ever returns one row, not multiple. 01:58:17.752 --> 01:58:20.460 So there's really nothing to iterate over, but it's working fine. 01:58:20.460 --> 01:58:22.560 It's just iterating once, but I'm getting lucky. 01:58:22.560 --> 01:58:26.100 So technically what I should probably just do is this. 01:58:26.100 --> 01:58:29.220 I should probably give myself a variable called row, 01:58:29.220 --> 01:58:34.770 set it equal to the very first row and only row that came back, 01:58:34.770 --> 01:58:39.510 and now print out that rows and column. 01:58:39.510 --> 01:58:40.890 Let me rerun the program. 01:58:40.890 --> 01:58:42.130 I'll type in Mario again. 01:58:42.130 --> 01:58:42.660 Enter. 01:58:42.660 --> 01:58:44.980 And I still see 39. 01:58:44.980 --> 01:58:48.328 So of course, I don't strictly need to do this. 01:58:48.328 --> 01:58:49.620 I don't really need a variable. 01:58:49.620 --> 01:58:52.590 I can do rows bracket 0 instead. 01:58:52.590 --> 01:58:54.840 But let me focus on what this library is now doing. 01:58:54.840 --> 01:59:01.620 So per the documentation, what the CS50 execute function always does for you is 01:59:01.620 --> 01:59:05.620 it returns a list of dictionaries. 01:59:05.620 --> 01:59:10.558 So if your query returns nothing, like no matches, you get back an empty list. 01:59:10.558 --> 01:59:12.600 Like open bracket, closed bracket, nothing in it. 01:59:12.600 --> 01:59:14.520 Any loop is not going to execute anything useful, 01:59:14.520 --> 01:59:15.770 because there's nothing in it. 01:59:15.770 --> 01:59:18.660 If, though, you get back one row, you're going 01:59:18.660 --> 01:59:22.740 to get back a list of size one inside of which is a single dictionary. 01:59:22.740 --> 01:59:24.930 That dictionary is going to have keys that 01:59:24.930 --> 01:59:29.200 correspond to whatever you selected, be it the columns or the count. 01:59:29.200 --> 01:59:32.880 So when I selected star before, I would have gotten all of the columns. 01:59:32.880 --> 01:59:34.950 That's how I was able to access timestamp. 01:59:34.950 --> 01:59:37.230 Here I'm just selecting count and I don't 01:59:37.230 --> 01:59:38.910 want to have to type this down here. 01:59:38.910 --> 01:59:40.230 That would just look kind of atrocious. 01:59:40.230 --> 01:59:42.355 It would work, but it would look weird to just keep 01:59:42.355 --> 01:59:44.640 retyping count paren star close paren. 01:59:44.640 --> 01:59:49.530 So I just created an alias called n just to make my life easier or cleaner down 01:59:49.530 --> 01:59:50.380 here. 01:59:50.380 --> 01:59:55.680 So to be clear, the CS50 execute function returns a list of dictionaries 01:59:55.680 --> 01:59:57.360 when you're using select. 01:59:57.360 --> 02:00:01.500 And that is how I can now get back the first and only row 02:00:01.500 --> 02:00:04.860 and then print out that row's end value. 02:00:04.860 --> 02:00:07.380 It is identical to-- 02:00:07.380 --> 02:00:08.650 let me do this. 02:00:08.650 --> 02:00:11.610 Let me highlight this whole line of text. 02:00:11.610 --> 02:00:15.120 Let me in my terminal window run SQLite3 of favorites.db 02:00:15.120 --> 02:00:16.470 like we did before break. 02:00:16.470 --> 02:00:19.170 Let me just copy paste this query. 02:00:19.170 --> 02:00:20.040 Enter. 02:00:20.040 --> 02:00:24.550 That's the table I got back earlier when we played with SQL manually. 02:00:24.550 --> 02:00:28.170 And so when I get back this table, here's the key, here's the value, 02:00:28.170 --> 02:00:31.860 and I only have one row, which is why I'm just blindly indexing 02:00:31.860 --> 02:00:34.080 into rows bracket 0, because I know there's always 02:00:34.080 --> 02:00:35.247 going to be an answer there. 02:00:35.247 --> 02:00:36.690 It's going to be 0 or 1 or more. 02:00:36.690 --> 02:00:40.360 But I know now it's going to be called n because of this here. 02:00:40.360 --> 02:00:41.580 So what have I just done? 02:00:41.580 --> 02:00:43.113 Well, this is SQL down here. 02:00:43.113 --> 02:00:45.780 And this is just me being like a data scientist asking questions 02:00:45.780 --> 02:00:48.640 about my data just using black and white SQL queries. 02:00:48.640 --> 02:00:52.410 This is me now being a Python programmer who wants to talk to a SQL database 02:00:52.410 --> 02:00:53.700 using Python. 02:00:53.700 --> 02:00:56.820 And the bridge we're using happens to be the CS50 library. 02:00:56.820 --> 02:00:59.940 But again, there's third party free libraries you can also use as well. 02:00:59.940 --> 02:01:01.710 Ours is just very simple. 02:01:01.710 --> 02:01:03.660 And indeed, the documentation will explain 02:01:03.660 --> 02:01:07.113 how execute behaves a little differently for inserts, updates, and deletes. 02:01:07.113 --> 02:01:09.780 You don't get back a list because you're not selecting anything, 02:01:09.780 --> 02:01:13.050 but you do get back some return values. 02:01:13.050 --> 02:01:15.990 Questions on this? 02:01:15.990 --> 02:01:18.030 That's the last of our Python code. 02:01:18.030 --> 02:01:23.320 That ties everything together in spirit. 02:01:23.320 --> 02:01:24.768 Yeah? 02:01:24.768 --> 02:01:29.460 AUDIENCE: [INAUDIBLE] 02:01:29.460 --> 02:01:31.810 DAVID MALAN: This one here? 02:01:31.810 --> 02:01:32.320 Yes. 02:01:32.320 --> 02:01:37.558 So db.execute by definition returns a list of rows. 02:01:37.558 --> 02:01:40.600 And each of those rows happens to be a dictionary because its convenient. 02:01:40.600 --> 02:01:42.100 Key value pairs. 02:01:42.100 --> 02:01:45.040 If I'm selecting the count of rows, I just 02:01:45.040 --> 02:01:48.250 know from having learned SQL an hour ago that this is always 02:01:48.250 --> 02:01:52.540 going to give me a single row whose column in this case is called n. 02:01:52.540 --> 02:01:56.470 So if I know it's a single row, I can just blindly, just like in C, 02:01:56.470 --> 02:02:00.160 go into that list or an array in C and go to the first location 02:02:00.160 --> 02:02:02.590 and then treat that as the single row. 02:02:02.590 --> 02:02:04.330 What you don't want to do is this. 02:02:04.330 --> 02:02:08.500 Even if you the human know the query returns one row, 02:02:08.500 --> 02:02:10.780 you can't just magically change the variable name 02:02:10.780 --> 02:02:13.270 to be singular and expect to have only one value. 02:02:13.270 --> 02:02:14.960 You will always have a list. 02:02:14.960 --> 02:02:18.190 So even if there is only one value in it, it's up to you to do something 02:02:18.190 --> 02:02:19.780 like this to get at it. 02:02:19.780 --> 02:02:24.070 Or if you prefer more succinctness, you can do rows bracket I bracket n. 02:02:24.070 --> 02:02:27.220 That'll achieve the same thing without a variable. 02:02:27.220 --> 02:02:29.095 Yeah? 02:02:29.095 --> 02:02:33.970 AUDIENCE: [INAUDIBLE] 02:02:33.970 --> 02:02:34.720 DAVID MALAN: Good. 02:02:34.720 --> 02:02:37.150 So I have been misleading this whole time 02:02:37.150 --> 02:02:40.690 and cheating because this is only ever going to return Mario. 02:02:40.690 --> 02:02:44.530 I'm ignoring the favorite that the human typed in here on line five. 02:02:44.530 --> 02:02:45.770 So let me fix that. 02:02:45.770 --> 02:02:49.000 And that's going to lead us to some of the problems that arise ultimately 02:02:49.000 --> 02:02:49.780 with SQL. 02:02:49.780 --> 02:02:53.320 The right way to solve that problem-- let me get rid of my terminal window 02:02:53.320 --> 02:02:53.890 here. 02:02:53.890 --> 02:02:56.170 The right way to solve this problem is not 02:02:56.170 --> 02:02:59.170 to use an fstring like we did in Python generally, 02:02:59.170 --> 02:03:02.680 because SQL queries, as we'll see in a moment, can be dangerous. 02:03:02.680 --> 02:03:06.940 When you want to plug in users' data into a query 02:03:06.940 --> 02:03:10.540 that you've written most of in advance, you should, you 02:03:10.540 --> 02:03:15.640 must, you had better use a placeholder, namely a question mark in this case. 02:03:15.640 --> 02:03:18.340 This is somewhat specific to CS50's library, 02:03:18.340 --> 02:03:21.910 but we just borrowed the convention that every other library uses too. 02:03:21.910 --> 02:03:25.660 In the world of SQL, single question marks are used as placeholders. 02:03:25.660 --> 02:03:28.190 And the way you do this is as follows. 02:03:28.190 --> 02:03:32.020 If you want to plug-in a value for that question mark, 02:03:32.020 --> 02:03:37.180 just like in printf in C, you specify as a second or a third or fourth argument 02:03:37.180 --> 02:03:39.320 all of the values you want plugged into this. 02:03:39.320 --> 02:03:42.280 So in C weeks ago, we were using %s. 02:03:42.280 --> 02:03:43.810 Same exact idea. 02:03:43.810 --> 02:03:46.810 In SQL it's a question mark that you use instead. 02:03:46.810 --> 02:03:49.480 This now, if I open back my terminal window 02:03:49.480 --> 02:03:55.120 and I run Python of favorites.py, type in Mario, I should still get 39. 02:03:55.120 --> 02:03:58.750 But now I can also type in Scratch perhaps and get 44 02:03:58.750 --> 02:04:00.520 for that very first piece at 0. 02:04:00.520 --> 02:04:02.900 And that one is even more popular here. 02:04:02.900 --> 02:04:04.150 So this now is correct. 02:04:04.150 --> 02:04:09.560 It would work to use an fstring here and then plug in a value like favorite 02:04:09.560 --> 02:04:10.060 here. 02:04:10.060 --> 02:04:11.977 But you'll see in just a moment don't do that. 02:04:11.977 --> 02:04:15.640 You will expose yourself to potential hack or attacks 02:04:15.640 --> 02:04:17.910 by trusting the user's input. 02:04:17.910 --> 02:04:19.660 And so in fact, let's transition from that 02:04:19.660 --> 02:04:21.610 to exactly some of these kinds of challenges, 02:04:21.610 --> 02:04:24.020 namely two before we wrap up. 02:04:24.020 --> 02:04:27.820 So in the world of SQL, especially when it's used at scale with the Twitters 02:04:27.820 --> 02:04:30.610 and the Googles of the world, a lot of data 02:04:30.610 --> 02:04:32.610 is probably coming into the database all at once 02:04:32.610 --> 02:04:34.610 because multiple people are opening their phones 02:04:34.610 --> 02:04:36.070 at the same time around the world. 02:04:36.070 --> 02:04:37.690 They're clicking on the same links roughly 02:04:37.690 --> 02:04:39.107 at the same time around the world. 02:04:39.107 --> 02:04:42.235 When you have thousands of people all using your site at once, 02:04:42.235 --> 02:04:44.110 order of operations is going to be important. 02:04:44.110 --> 02:04:47.620 But unfortunately in SQL and in other contexts of computing, 02:04:47.620 --> 02:04:50.150 there's this risk of what's known as a race condition. 02:04:50.150 --> 02:04:53.710 So for instance, has anyone ever seen or liked this? 02:04:53.710 --> 02:04:55.780 This is the world record egg. 02:04:55.780 --> 02:04:58.467 Or it's this thing that was very popular a while back. 02:04:58.467 --> 02:04:59.800 It's still kind of going strong. 02:04:59.800 --> 02:05:03.340 But if you go to the Instagram profile for World Record Egg, 02:05:03.340 --> 02:05:06.610 the goal was to make the most liked Instagram post ever. 02:05:06.610 --> 02:05:08.000 And they did pretty well. 02:05:08.000 --> 02:05:09.170 It's just this. 02:05:09.170 --> 02:05:10.790 It's just a picture of an egg. 02:05:10.790 --> 02:05:12.570 Now, at the height of the popularity, like 02:05:12.570 --> 02:05:15.070 there might have been hundreds, thousands, tens of thousands 02:05:15.070 --> 02:05:18.520 of people clicking pretty much at the same time on this egg. 02:05:18.520 --> 02:05:20.830 So it actually creates a potential problem 02:05:20.830 --> 02:05:23.470 with the integrity of Instagram's data. 02:05:23.470 --> 02:05:24.010 Why? 02:05:24.010 --> 02:05:26.302 Well, if you have all these requests coming in at once, 02:05:26.302 --> 02:05:28.250 how do you possibly keep track of all of them 02:05:28.250 --> 02:05:31.850 and update your counter in a way that can keep up with all of that traffic? 02:05:31.850 --> 02:05:32.350 Why? 02:05:32.350 --> 02:05:35.680 Well, let's just hypothesize what Meta, formerly Facebook, 02:05:35.680 --> 02:05:38.860 was doing underneath the hood with Instagram if this were their code. 02:05:38.860 --> 02:05:41.890 So suppose for the sake of discussion that Instagram servers 02:05:41.890 --> 02:05:44.617 are using a mix of Python and SQL. 02:05:44.617 --> 02:05:47.200 Probably not using the CS50 library, but they could absolutely 02:05:47.200 --> 02:05:50.680 be using those two languages or two others together. 02:05:50.680 --> 02:05:55.270 Suppose they do this in order to update the number of likes for that post. 02:05:55.270 --> 02:06:00.160 They first execute a SQL query like select the current number of likes 02:06:00.160 --> 02:06:03.580 from a table called posts where the idea of the post 02:06:03.580 --> 02:06:08.530 equals whatever the unique identifier is for that specific egg in the table. 02:06:08.530 --> 02:06:12.460 And then they store the result in this rows variable, just like I did. 02:06:12.460 --> 02:06:14.830 And then they do this. 02:06:14.830 --> 02:06:16.420 They create a variable called likes. 02:06:16.420 --> 02:06:18.700 They set it equal to rows bracket 0. 02:06:18.700 --> 02:06:21.280 So the very first row in the result set. 02:06:21.280 --> 02:06:22.450 And they get the likes key. 02:06:22.450 --> 02:06:25.030 So this is literally what I just did with the count. 02:06:25.030 --> 02:06:27.880 Let me hypothesize that Instagram does something similar 02:06:27.880 --> 02:06:29.330 with the total number of likes. 02:06:29.330 --> 02:06:30.370 Why are they doing this? 02:06:30.370 --> 02:06:34.600 Because they then want to execute a third line of code that executes 02:06:34.600 --> 02:06:36.430 update the posts table. 02:06:36.430 --> 02:06:41.920 Set the new number of likes equal to something where the idea of the post 02:06:41.920 --> 02:06:43.330 equals this other thing. 02:06:43.330 --> 02:06:46.870 Now, notice just like in printf there's the comma separated list of values. 02:06:46.870 --> 02:06:50.560 They want to update the current number of likes from the current value 02:06:50.560 --> 02:06:51.970 to the current value plus 1. 02:06:51.970 --> 02:06:53.350 So it's likes plus 1. 02:06:53.350 --> 02:06:55.640 And then we plug in the ID for this. 02:06:55.640 --> 02:06:57.760 So suppose this is what Instagram is doing. 02:06:57.760 --> 02:07:02.860 Unfortunately, whenever you execute multiple lines of code independently 02:07:02.860 --> 02:07:06.260 and you're so popular like Instagram that you have thousands, 02:07:06.260 --> 02:07:08.530 hundreds of thousands of servers potentially, 02:07:08.530 --> 02:07:11.470 it is quite possible that if you and I and everyone else in the room 02:07:11.470 --> 02:07:13.630 clicks that egg at the same time, it's not 02:07:13.630 --> 02:07:17.093 going to be the case statistically that three lines of code are executed for me 02:07:17.093 --> 02:07:19.510 and then three lines for you and then three lines for you. 02:07:19.510 --> 02:07:21.490 They're probably going to get interspersed. 02:07:21.490 --> 02:07:24.430 This gets executed for me and then this gets executed for you 02:07:24.430 --> 02:07:27.520 and then they get back to doing work for me and so forth just to kind 02:07:27.520 --> 02:07:30.820 of multitask, just like a human might, but at a super speed here. 02:07:30.820 --> 02:07:33.590 The problem, though, is if these lines of code 02:07:33.590 --> 02:07:35.838 get interrupted, what could go wrong? 02:07:35.838 --> 02:07:38.630 Well, suppose that Carter and I both click the egg at the same time 02:07:38.630 --> 02:07:41.780 and suppose the current number of likes back in the day is 100. 02:07:41.780 --> 02:07:45.150 That stores in this variable the value 100. 02:07:45.150 --> 02:07:49.010 But if we click so close in time, we might get back the same answer 02:07:49.010 --> 02:07:50.300 to this select query. 02:07:50.300 --> 02:07:54.380 As of that moment in time when David and Carter clicked, it had 100 likes. 02:07:54.380 --> 02:07:58.220 But then this last line of code is executed for me and then maybe Carter. 02:07:58.220 --> 02:08:03.380 Because that answer, the state of the database, was stored in this variable, 02:08:03.380 --> 02:08:08.870 then both Carter and I will result in this line of code 02:08:08.870 --> 02:08:10.610 being executed with the same value. 02:08:10.610 --> 02:08:17.480 Update the post table setting the likes equal to 101 for that post's ID. 02:08:17.480 --> 02:08:18.080 Why? 02:08:18.080 --> 02:08:21.950 Because again, if each of these lines of code running on different servers 02:08:21.950 --> 02:08:25.213 are checking the value of the current number of likes 02:08:25.213 --> 02:08:28.130 but then getting interrupted because Carter clicked the darn thing too 02:08:28.130 --> 02:08:30.350 and then resuming their work on my behalf, 02:08:30.350 --> 02:08:33.890 we might have a race condition where the code is sort of racing to finish 02:08:33.890 --> 02:08:36.320 but getting interrupted by other users' clicks. 02:08:36.320 --> 02:08:38.720 And the problem with that is that if you are 02:08:38.720 --> 02:08:42.710 inspecting the value of some variable, or in this case a database cell, 02:08:42.710 --> 02:08:45.650 and making a decision based on it, like how to update it, 02:08:45.650 --> 02:08:46.995 you might now lose data. 02:08:46.995 --> 02:08:50.120 And Instagram is probably not good for advertising if they're losing likes. 02:08:50.120 --> 02:08:54.620 And so that's probably a problem not to retain the value 102 02:08:54.620 --> 02:08:57.230 and instead insert the number 101 twice. 02:08:57.230 --> 02:09:00.500 It's actually similar in spirit to a story that 02:09:00.500 --> 02:09:05.960 was told in a databases course I took myself years ago whereby-- 02:09:05.960 --> 02:09:08.900 it's somewhat analogous to kind of a contrived scenario 02:09:08.900 --> 02:09:10.070 involving a refrigerator. 02:09:10.070 --> 02:09:12.380 And this is the closest thing to a refrigerator we could get on stage. 02:09:12.380 --> 02:09:14.420 But imagine you've got one of these little dorm fridges 02:09:14.420 --> 02:09:16.040 in your dorm too and your roommate. 02:09:16.040 --> 02:09:19.102 And maybe both of you, as the story was told to me, really like milk. 02:09:19.102 --> 02:09:21.560 And one of you is at class, but the other of you comes home 02:09:21.560 --> 02:09:25.500 and you open your dorm fridge and you're like, oh darn it, we're out of milk. 02:09:25.500 --> 02:09:27.020 And so you close the fridge. 02:09:27.020 --> 02:09:29.720 You walk across the street to CVS or some other store 02:09:29.720 --> 02:09:31.490 and you get in line to buy some milk. 02:09:31.490 --> 02:09:33.020 Meanwhile, your roommate gets out of class. 02:09:33.020 --> 02:09:34.160 They come back to your dorm room. 02:09:34.160 --> 02:09:35.702 They're really thirsty for some milk. 02:09:35.702 --> 02:09:37.010 They open up the fridge. 02:09:37.010 --> 02:09:38.510 They say, oh, we're out of milk. 02:09:38.510 --> 02:09:41.630 And then they take a different route perhaps to CVS 02:09:41.630 --> 02:09:44.300 or some other store nearby, get in line to buy some milk. 02:09:44.300 --> 02:09:48.260 Fast forward some amount of time in this very contrived story and what happens? 02:09:48.260 --> 02:09:50.570 Oh damn it, we now ended up with two gallons of milk 02:09:50.570 --> 02:09:54.480 and there's no way we can fit gallons of milk in there, let alone two of them. 02:09:54.480 --> 02:09:56.000 So that's a problem. 02:09:56.000 --> 02:10:00.000 But what's the relationship to this here? 02:10:00.000 --> 02:10:02.693 Well both of us, yeah, did what? 02:10:02.693 --> 02:10:07.523 AUDIENCE: [INAUDIBLE] 02:10:16.685 --> 02:10:17.560 DAVID MALAN: Exactly. 02:10:17.560 --> 02:10:19.925 AUDIENCE: [INAUDIBLE] 02:10:22.965 --> 02:10:23.840 DAVID MALAN: Exactly. 02:10:23.840 --> 02:10:26.480 So to summarize, both of us had a very similar thought process, 02:10:26.480 --> 02:10:28.820 made a similar decision based on the same information, 02:10:28.820 --> 02:10:31.760 not realizing that the information, the fridge, 02:10:31.760 --> 02:10:34.600 was in the process of being updated. 02:10:34.600 --> 02:10:37.100 And of course, in the Instagram world, it happens like this. 02:10:37.100 --> 02:10:39.300 In the fridge world, it might take a few minutes. 02:10:39.300 --> 02:10:41.840 But the problem is ultimately the result of our 02:10:41.840 --> 02:10:46.070 having made a decision about the state of the world and the state of the world 02:10:46.070 --> 02:10:48.600 was in the middle of being updated. 02:10:48.600 --> 02:10:50.333 The queries got comingled with others. 02:10:50.333 --> 02:10:53.000 Or, in this case, someone was already on their way to the store. 02:10:53.000 --> 02:10:54.960 So what's the solution in the real world? 02:10:54.960 --> 02:10:59.000 Well, you could very simply take a post it note and put like gone for milk 02:10:59.000 --> 02:11:02.060 so as to communicate to your roommate that they should not 02:11:02.060 --> 02:11:04.620 inspect the value of that variable and make a decision on it. 02:11:04.620 --> 02:11:05.120 Why? 02:11:05.120 --> 02:11:08.958 Because it's not yet consistent with the outcome that's about to happen. 02:11:08.958 --> 02:11:11.000 You could be more dramatic and you could actually 02:11:11.000 --> 02:11:13.580 lock the fridge somehow, put a padlock around it 02:11:13.580 --> 02:11:15.470 or the like so they can't even get in there. 02:11:15.470 --> 02:11:18.510 And that would achieve the same effect too. 02:11:18.510 --> 02:11:22.400 And that is actually pretty much the solution to this problem in code too. 02:11:22.400 --> 02:11:23.370 It's not safe. 02:11:23.370 --> 02:11:27.110 It's not sufficient to only execute three lines of code like this. 02:11:27.110 --> 02:11:32.000 Rather, what you probably want to do is use additional SQL keywords 02:11:32.000 --> 02:11:35.270 that we won't spend much time on in the class itself, but these. 02:11:35.270 --> 02:11:37.160 There are solutions to this problem. 02:11:37.160 --> 02:11:39.320 You can begin what's called a transaction 02:11:39.320 --> 02:11:43.460 and you can more explicitly commit to making a decision, 02:11:43.460 --> 02:11:46.100 like updating the database to 101 or 102. 02:11:46.100 --> 02:11:50.120 Or if you realize, wait a minute, Carter's query is interrupting mine. 02:11:50.120 --> 02:11:53.540 Let me roll back to the previous state and just rewind. 02:11:53.540 --> 02:11:54.290 Let me undo. 02:11:54.290 --> 02:11:55.840 Control Z, if you will. 02:11:55.840 --> 02:11:58.340 There's also another keyword that's not so much used anymore 02:11:58.340 --> 02:11:59.600 in SQL which is locking. 02:11:59.600 --> 02:12:01.170 You could literally back in the day. 02:12:01.170 --> 02:12:05.690 Lock the entire database table, preventing anyone from updating it 02:12:05.690 --> 02:12:09.320 or making changes or even reading it while someone else was accessing it. 02:12:09.320 --> 02:12:12.320 That was a very heavy handed solution because it slowed everything down. 02:12:12.320 --> 02:12:15.710 But in short, transactions are now a feature of SQL 02:12:15.710 --> 02:12:17.840 that you won't necessarily need to use yourselves 02:12:17.840 --> 02:12:22.490 that do solve this problem by doing the equivalent of saying while David's 02:12:22.490 --> 02:12:26.360 like counter is in the process of being updated, keep Carter at bay, 02:12:26.360 --> 02:12:29.510 ideally briefly, and then let his data go through too. 02:12:29.510 --> 02:12:32.240 It's equivalent too to putting a note or a lock on the fridge. 02:12:32.240 --> 02:12:33.920 And indeed, I mean lock literally. 02:12:33.920 --> 02:12:36.380 They were once upon a time called and still 02:12:36.380 --> 02:12:41.480 are in some contexts called locks on databases too. 02:12:41.480 --> 02:12:44.730 And the code for which you might do this is almost the same. 02:12:44.730 --> 02:12:51.860 You simply wrap the three queries with a transaction statement and a commit. 02:12:51.860 --> 02:12:57.800 And the term of art here is that this makes your statements atomic. 02:12:57.800 --> 02:13:01.230 So atomic means they're either all executed or not at all. 02:13:01.230 --> 02:13:04.340 That is they're all very tightly coupled together without interruption. 02:13:04.340 --> 02:13:08.480 Transactions solves that problem and avoid having two gallons of milk. 02:13:08.480 --> 02:13:11.360 And the last problem that arises that is tragically 02:13:11.360 --> 02:13:14.510 so darn common in the real world today is 02:13:14.510 --> 02:13:16.173 what's called a SQL injection attack. 02:13:16.173 --> 02:13:18.590 And it's what I alluded to earlier with the question mark. 02:13:18.590 --> 02:13:20.298 So suppose you're in the habit of logging 02:13:20.298 --> 02:13:23.240 into Yale websites with your net ID or password 02:13:23.240 --> 02:13:25.950 or at Harvard, your Harvard key and password as well. 02:13:25.950 --> 02:13:28.370 Suppose for the sake of discussion that the people that 02:13:28.370 --> 02:13:32.360 implemented Harvard key log in allow you to type in your email address, 02:13:32.360 --> 02:13:33.680 of course, and your password. 02:13:33.680 --> 02:13:37.160 But suppose that they are using SQL underneath the hood 02:13:37.160 --> 02:13:39.770 to check your username and password to make sure 02:13:39.770 --> 02:13:43.400 that you are David Malan or Carter Zenke or whoever you claim to be. 02:13:43.400 --> 02:13:48.770 I haven't shown you the syntax yet, but it turns out that in SQL, -- 02:13:48.770 --> 02:13:51.410 is a special way of indicating a comment. 02:13:51.410 --> 02:13:53.100 It means ignore everything to the right. 02:13:53.100 --> 02:13:57.080 So it's just like // in C or the hash symbol in Python. 02:13:57.080 --> 02:13:59.430 -- just means ignore everything to the right. 02:13:59.430 --> 02:14:01.440 And we've, of course, seen single quotes. 02:14:01.440 --> 02:14:04.730 So one way to wage a SQL injection attack 02:14:04.730 --> 02:14:08.870 is to try to inject malicious SQL code into someone else's database 02:14:08.870 --> 02:14:10.220 without them realizing it. 02:14:10.220 --> 02:14:11.220 How do you do this? 02:14:11.220 --> 02:14:15.492 Well, suppose I log in as malan@harvard.edu single quote dash, 02:14:15.492 --> 02:14:16.316 dash. 02:14:16.316 --> 02:14:20.510 I'm not double quoting anything clearly and there's nothing to the right 02:14:20.510 --> 02:14:21.770 of the -- 02:14:21.770 --> 02:14:23.150 anyway. 02:14:23.150 --> 02:14:25.500 But this imbalance is going to be useful. 02:14:25.500 --> 02:14:26.000 Why? 02:14:26.000 --> 02:14:29.630 Because if I'm a hacker and I'm presuming someone at Harvard probably 02:14:29.630 --> 02:14:33.680 is using single quotes to wrap the user's email address 02:14:33.680 --> 02:14:38.360 and wrap the user's password, what if I try to complete their thought for them 02:14:38.360 --> 02:14:40.370 and close one of those quotes for them? 02:14:40.370 --> 02:14:41.850 What might happen? 02:14:41.850 --> 02:14:43.220 Well, we could do this. 02:14:43.220 --> 02:14:46.250 Here for instance, let me hypothesize is the code that Harvard wrote, 02:14:46.250 --> 02:14:48.050 hopefully not, underneath the hood. 02:14:48.050 --> 02:14:51.500 So they're using CS50's library in Python and they're using SQL inside. 02:14:51.500 --> 02:14:54.350 Suppose that they have a query like this. 02:14:54.350 --> 02:15:00.380 Select star from users where username equals question mark and password 02:15:00.380 --> 02:15:01.315 equals question mark. 02:15:01.315 --> 02:15:04.190 And then suppose they just plug in whatever username and password was 02:15:04.190 --> 02:15:05.070 typed in. 02:15:05.070 --> 02:15:08.090 And then if they get back some number of rows dot dot dot, 02:15:08.090 --> 02:15:09.410 they assume I am David. 02:15:09.410 --> 02:15:12.200 They assume Carter is Carter if both the username and password are 02:15:12.200 --> 02:15:12.890 in the database. 02:15:12.890 --> 02:15:14.480 Just end of story there. 02:15:14.480 --> 02:15:15.680 This is good. 02:15:15.680 --> 02:15:18.750 This has the question mark placeholder, as we discussed earlier. 02:15:18.750 --> 02:15:21.140 But what if you don't quite remember that? 02:15:21.140 --> 02:15:24.670 You don't quite take that to heart and you use your more familiar last week 02:15:24.670 --> 02:15:28.690 fstrings whereby we use these curly braces to plug in values. 02:15:28.690 --> 02:15:30.280 What if you do this instead? 02:15:30.280 --> 02:15:31.900 So it's almost the same idea. 02:15:31.900 --> 02:15:34.950 It's still db execute but now it's select star from users 02:15:34.950 --> 02:15:36.120 where username equals. 02:15:36.120 --> 02:15:39.570 And now notice I'm doing the single quotes, which is required by SQL, 02:15:39.570 --> 02:15:41.850 but I'm using fstrings with the curly braces. 02:15:41.850 --> 02:15:46.530 And the password equals single quote password and then close single quote. 02:15:46.530 --> 02:15:50.910 The problem is if you're just blindly pasting effectively 02:15:50.910 --> 02:15:55.530 the user's input into that web form into the username field and the password 02:15:55.530 --> 02:15:59.890 field, there's nothing stopping a malicious user, student, faculty, 02:15:59.890 --> 02:16:03.090 staff from including a single quote in their name. 02:16:03.090 --> 02:16:07.470 Or maybe even benevolently if their name happens 02:16:07.470 --> 02:16:10.440 to have a single quote, as some last names in particular do. 02:16:10.440 --> 02:16:12.230 So this is very fragile. 02:16:12.230 --> 02:16:12.730 Why? 02:16:12.730 --> 02:16:15.090 Well, suppose that if we plug in my malicious, 02:16:15.090 --> 02:16:18.030 malan@harvard.edu single quote -- 02:16:18.030 --> 02:16:20.430 notice what happens to username here. 02:16:20.430 --> 02:16:23.940 The username variable inside of the curly quotes 02:16:23.940 --> 02:16:26.340 will get replaced with this. 02:16:26.340 --> 02:16:29.880 And notice single quote, which the Harvard programmer wrote, 02:16:29.880 --> 02:16:35.969 malan@harvard.edu single quote which I wrote -- 02:16:35.969 --> 02:16:40.170 which I wrote single quote which Harvard wrote and whatever else 02:16:40.170 --> 02:16:41.160 they want after that. 02:16:41.160 --> 02:16:43.524 What's the implication, though, of the dash, dash? 02:16:46.722 --> 02:16:48.680 Everything to the right is going to be ignored. 02:16:48.680 --> 02:16:51.730 So the password is never even checked in this scenario. 02:16:51.730 --> 02:16:55.855 I'm tricking the server into ignoring everything after the -- 02:16:55.855 --> 02:17:00.650 but I have constructed very cleverly, very maliciously a syntactically valid 02:17:00.650 --> 02:17:01.150 query. 02:17:01.150 --> 02:17:01.650 Why? 02:17:01.650 --> 02:17:03.580 Because I provided the single quote that's 02:17:03.580 --> 02:17:06.670 going to finish the thought of that first single quote. 02:17:06.670 --> 02:17:09.219 And now I would only know how to do this if I saw the code 02:17:09.219 --> 02:17:12.209 or if I just randomly try putting apostrophes into web forms 02:17:12.209 --> 02:17:13.209 and see if things break. 02:17:13.209 --> 02:17:15.215 That's often how adversaries attack systems. 02:17:15.215 --> 02:17:17.590 They type in potentially dangerous characters, hit Enter. 02:17:17.590 --> 02:17:20.510 If something breaks, they're not necessarily into the system, 02:17:20.510 --> 02:17:22.757 but they know that there might be a vulnerability. 02:17:22.757 --> 02:17:25.340 And then they start trying more methodically things like this. 02:17:25.340 --> 02:17:27.850 So this then is going to be bad, because it effectively 02:17:27.850 --> 02:17:29.500 grays out the rest of the query. 02:17:29.500 --> 02:17:32.889 And this query is surely going to return some rows 02:17:32.889 --> 02:17:34.570 without even knowing my password. 02:17:34.570 --> 02:17:37.450 And so this logic here dot dot dot means, well, 02:17:37.450 --> 02:17:39.850 if a data came back from this query, Harvard 02:17:39.850 --> 02:17:43.480 is presumably going to assume that Malan logged in. 02:17:43.480 --> 02:17:47.840 Show him his account or whatever is being protected here. 02:17:47.840 --> 02:17:50.559 So in short, using fstrings bad. 02:17:50.559 --> 02:17:54.280 Using any equivalent like %s in C, bad. 02:17:54.280 --> 02:17:56.559 When it comes to SQL, using question marks 02:17:56.559 --> 02:18:00.280 or whatever a third party library like CS50 prescribes 02:18:00.280 --> 02:18:02.110 is the way to solve this. 02:18:02.110 --> 02:18:02.620 Why? 02:18:02.620 --> 02:18:07.540 Because libraries like ours are designed to at least be smart and be paranoid. 02:18:07.540 --> 02:18:09.740 And what we will do is this. 02:18:09.740 --> 02:18:12.580 When you use the question marks and the values are plugged in, 02:18:12.580 --> 02:18:16.660 we will escape any potentially dangerous characters 02:18:16.660 --> 02:18:18.520 inside of those placeholders. 02:18:18.520 --> 02:18:21.580 And so effectively, the single quote will no longer 02:18:21.580 --> 02:18:23.469 be considered a grammatical single quote. 02:18:23.469 --> 02:18:26.870 It will just be literally a character in the username or password. 02:18:26.870 --> 02:18:30.040 So the library takes care of this for you 02:18:30.040 --> 02:18:33.500 because you're plugging in the username and password as separate arguments. 02:18:33.500 --> 02:18:37.360 And then we or the third party you're using actually sanitize. 02:18:37.360 --> 02:18:41.510 That is clean up the data and prevent those bad characters. 02:18:41.510 --> 02:18:44.650 Now, this is kind of an internet meme that went around for a while. 02:18:44.650 --> 02:18:47.320 If you've ever driven a car or been in a car 02:18:47.320 --> 02:18:49.930 where there's the automatic readers for tolls. 02:18:49.930 --> 02:18:52.969 This person thought it might be funny to try doing something like this. 02:18:52.969 --> 02:18:54.309 What are they presumably doing? 02:18:54.309 --> 02:18:57.520 The presumption here is, whether or not it worked is unclear, 02:18:57.520 --> 02:19:01.570 is that here's the end of actual license plate number, 02:19:01.570 --> 02:19:03.903 but here's an interesting single quote and a semicolon. 02:19:03.903 --> 02:19:06.070 That's especially bad because it means you can maybe 02:19:06.070 --> 02:19:08.059 execute a second query on the database. 02:19:08.059 --> 02:19:11.170 This is someone having fun trying to drop the entire database 02:19:11.170 --> 02:19:16.090 table for whatever municipality is scanning through cameras their license 02:19:16.090 --> 02:19:16.750 plate code. 02:19:16.750 --> 02:19:19.150 And I would be remiss if we didn't end on this note. 02:19:19.150 --> 02:19:22.240 At least in computer science circles, there 02:19:22.240 --> 02:19:26.920 is someone named, no relation to the TF name we put in the database earlier, 02:19:26.920 --> 02:19:31.660 little Bobby Tables, which ends with this XKCD comic. 02:19:31.660 --> 02:19:36.340 And if you chuckle, if you laugh, you're now legit SQL programmers. 02:19:39.049 --> 02:19:42.030 Nice, nice. 02:19:42.030 --> 02:19:44.740 Every CS student out there knows about little Bobby Table. 02:19:44.740 --> 02:19:47.657 So if you name drop little Bobby Tables now, you're in. 02:19:47.657 --> 02:19:49.240 All right, that's it though for today. 02:19:49.240 --> 02:19:51.780 We will see you next time. 02:19:51.780 --> 02:19:55.130 [MUSIC PLAYING]