WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:03.968 [MUSIC PLAYING] 00:00:03.968 --> 00:01:18.020 [MUSIC PLAYING] 00:01:18.020 --> 00:01:20.030 DAVID J. MALAN: This is CS50. 00:01:20.030 --> 00:01:23.360 And this is week 7, the week, here, of Halloween. 00:01:23.360 --> 00:01:26.510 Indeed, special thanks to CS50's own Valerie and her mom 00:01:26.510 --> 00:01:30.590 for having created this very festive scenery, and all past ones as well. 00:01:30.590 --> 00:01:33.330 Today, we pick up where we left off last time, 00:01:33.330 --> 00:01:35.180 which, recall, we introduced Python. 00:01:35.180 --> 00:01:38.240 And that was our big transition from C, where suddenly things 00:01:38.240 --> 00:01:40.850 started to look new again, probably, syntactically. 00:01:40.850 --> 00:01:44.390 But also, probably things hopefully started to feel easier. 00:01:44.390 --> 00:01:48.080 Well, with that said, problem set 6 certainly added some challenges, 00:01:48.080 --> 00:01:49.380 and you did some new things. 00:01:49.380 --> 00:01:53.090 But hopefully you've begun to appreciate that with Python, just a lot more stuff 00:01:53.090 --> 00:01:54.290 is easier to do. 00:01:54.290 --> 00:01:56.960 You get more out of the box with the language itself. 00:01:56.960 --> 00:01:59.450 And that's going to be so useful over the coming weeks 00:01:59.450 --> 00:02:03.980 as we transition further to introducing something called databases today, 00:02:03.980 --> 00:02:06.270 web programming next week and the week after. 00:02:06.270 --> 00:02:08.900 So that by term's end, and perhaps even for your final project, 00:02:08.900 --> 00:02:11.870 you really are building something from scratch 00:02:11.870 --> 00:02:15.415 using all of these various tools somehow together. 00:02:15.415 --> 00:02:17.540 So before we do that, though, today, let's consider 00:02:17.540 --> 00:02:22.610 what we weren't really able to do last week, which was actually 00:02:22.610 --> 00:02:25.490 create and store data ourselves. 00:02:25.490 --> 00:02:30.710 In Python, we've played around with the CSV, comma-separated values library. 00:02:30.710 --> 00:02:33.810 And you've been able to read in CSVs from disk, 00:02:33.810 --> 00:02:37.880 so to speak, that is, from files in your programming environments. 00:02:37.880 --> 00:02:41.455 But we haven't necessarily started saving data, persisting data ourselves. 00:02:41.455 --> 00:02:44.330 And that's a huge limitation, because pretty much all of the examples 00:02:44.330 --> 00:02:46.288 we've done thus far with a couple of exceptions 00:02:46.288 --> 00:02:49.550 have involved my providing input at the keyboard or even vocally. 00:02:49.550 --> 00:02:51.508 But then nothing happens to it. 00:02:51.508 --> 00:02:53.300 It disappears the moment the program quits, 00:02:53.300 --> 00:02:55.410 because it was only being stored in memory. 00:02:55.410 --> 00:02:59.090 But today, we'll start to focus all the more on storing things on disk, 00:02:59.090 --> 00:03:02.100 that is, storing things in files and folders 00:03:02.100 --> 00:03:04.940 so that you can actually write programs that remember 00:03:04.940 --> 00:03:06.620 what it is the human did last time. 00:03:06.620 --> 00:03:09.320 And ultimately, you can actually make mobile or web apps 00:03:09.320 --> 00:03:12.050 that actually begin to grow, and grow, and grow their data sets, 00:03:12.050 --> 00:03:15.650 as might happen if you get more and more users, for instance, on a website. 00:03:15.650 --> 00:03:19.400 To play, then, with this new capability of being able to write files, 00:03:19.400 --> 00:03:22.217 let's go ahead and just collect some data. 00:03:22.217 --> 00:03:24.050 In fact, those of you here in person, if you 00:03:24.050 --> 00:03:26.960 want to pull up this URL on your phone or laptop, 00:03:26.960 --> 00:03:29.060 that's going to lead you to a Google Form. 00:03:29.060 --> 00:03:34.130 And that Google Form is going to ask you in just a moment for really 00:03:34.130 --> 00:03:35.320 just your favorite TV show. 00:03:35.320 --> 00:03:37.070 And it's going to ask you to categorize it 00:03:37.070 --> 00:03:41.270 according to a genre, like comedy, or drama, or action, or musical, 00:03:41.270 --> 00:03:42.460 or something like that. 00:03:42.460 --> 00:03:44.210 And this is useful, because if you've ever 00:03:44.210 --> 00:03:47.520 used a Google Form before, or Microsoft's equivalent with Office 365, 00:03:47.520 --> 00:03:50.270 it's a really useful mechanism at just collecting data from users, 00:03:50.270 --> 00:03:54.000 and then ultimately, putting it into a spreadsheet form. 00:03:54.000 --> 00:03:58.160 So this is a screenshot of the form that those of you 00:03:58.160 --> 00:04:01.080 here in person or tuning in on Zoom are currently filling out. 00:04:01.080 --> 00:04:02.420 It's asking only two questions. 00:04:02.420 --> 00:04:04.550 What's the title of your favorite TV show? 00:04:04.550 --> 00:04:09.470 And what are one or more genres into which your TV show falls? 00:04:09.470 --> 00:04:12.860 And I'll go ahead and pivot now to the view 00:04:12.860 --> 00:04:15.667 that I'll be able to see as the person who created this form, which 00:04:15.667 --> 00:04:17.209 is quite simply a Google spreadsheet. 00:04:17.209 --> 00:04:19.709 Google Forms has this nice feature, if you've never noticed, 00:04:19.709 --> 00:04:22.650 that allows you to export your data to a Google Spreadsheet. 00:04:22.650 --> 00:04:25.130 And then from there, we can actually grab the file 00:04:25.130 --> 00:04:27.500 and download it to my own Mac or your own PC 00:04:27.500 --> 00:04:30.320 so that we can actually play around with the data that's come in. 00:04:30.320 --> 00:04:32.570 So in fact, let me go ahead and slide over 00:04:32.570 --> 00:04:36.540 to this, the live Google Spreadsheet. 00:04:36.540 --> 00:04:40.490 And you'll see, probably, a whole bunch of familiar TV shows here, 00:04:40.490 --> 00:04:41.360 all coming in. 00:04:41.360 --> 00:04:43.820 And if we keep scrolling, and scrolling, and scrolling-- 00:04:43.820 --> 00:04:45.290 only 46, 47. 00:04:45.290 --> 00:04:47.430 There we go, up to 50 plus already. 00:04:47.430 --> 00:04:50.420 If you need that URL again here, if you're just tuning in, 00:04:50.420 --> 00:04:52.850 you can go to this URL here. 00:04:52.850 --> 00:04:55.760 And in just a moment, we'll have a bunch of data 00:04:55.760 --> 00:04:59.450 with which we can start to experiment. 00:04:59.450 --> 00:05:01.070 I'll give you a moment or so there. 00:05:06.330 --> 00:05:08.418 All right. 00:05:08.418 --> 00:05:09.960 Let me hang in there a little longer. 00:05:09.960 --> 00:05:11.418 OK, we've got over 100 submissions. 00:05:11.418 --> 00:05:12.390 Good. 00:05:12.390 --> 00:05:15.270 Good, even more coming in now. 00:05:15.270 --> 00:05:16.890 And we can see them coming in live. 00:05:16.890 --> 00:05:18.750 Here, let me switch back to the spreadsheet. 00:05:18.750 --> 00:05:21.090 The list is growing, and growing, and growing. 00:05:21.090 --> 00:05:22.900 And in just a moment-- 00:05:22.900 --> 00:05:26.490 let me give Carter a moment to help me export it in real time. 00:05:26.490 --> 00:05:29.640 Carter, just give me a heads up when it's reasonable for me 00:05:29.640 --> 00:05:32.880 to download this file. 00:05:32.880 --> 00:05:35.140 All right, and I'll begin to do this very slowly. 00:05:35.140 --> 00:05:37.720 So I'm going to go up to the File menu, if you've never done this before. 00:05:37.720 --> 00:05:40.110 Download-- you can download a whole bunch of formats, one in Excel. 00:05:40.110 --> 00:05:42.060 But more simply, and the one we'll start to play with here, 00:05:42.060 --> 00:05:43.630 is comma-separated values. 00:05:43.630 --> 00:05:47.010 So CSV files we used this past week, why are they useful? 00:05:47.010 --> 00:05:50.190 Now that you've played with them or used them in past real world, 00:05:50.190 --> 00:05:55.260 what's the utility of a CSV file versus something like Excel, for instance? 00:05:55.260 --> 00:05:59.010 Why CSV in the first place? 00:05:59.010 --> 00:06:00.030 Any instincts? 00:06:00.030 --> 00:06:00.858 Yeah? 00:06:00.858 --> 00:06:02.730 AUDIENCE: Because it's just a text file? 00:06:02.730 --> 00:06:04.605 DAVID J. MALAN: OK, so storage is compelling. 00:06:04.605 --> 00:06:07.710 A simple text file with ASCII or Unicode text is probably pretty small. 00:06:07.710 --> 00:06:08.290 I like that. 00:06:08.290 --> 00:06:10.101 Other thoughts? 00:06:10.101 --> 00:06:11.184 AUDIENCE: Structure of it? 00:06:11.184 --> 00:06:12.518 DAVID J. MALAN: Yeah, well said. 00:06:12.518 --> 00:06:14.760 It's just a simple text format, but using conventions 00:06:14.760 --> 00:06:18.330 like commas you can represent the idea of columns using new lines, 00:06:18.330 --> 00:06:20.430 backslash ends invisibly at the end of your lines, 00:06:20.430 --> 00:06:22.000 you can create the idea of rows. 00:06:22.000 --> 00:06:24.000 So it's a very simple way of implementing what 00:06:24.000 --> 00:06:26.610 we might call a flat-file database. 00:06:26.610 --> 00:06:28.860 It's a way of storing data in a flat, that is, 00:06:28.860 --> 00:06:32.310 very simple file that's just pure ASCII or Unicode text. 00:06:32.310 --> 00:06:35.340 And more compellingly, I dare say, is that with a CSV file, 00:06:35.340 --> 00:06:36.862 it's completely portable. 00:06:36.862 --> 00:06:38.820 Something is portable in the world of computing 00:06:38.820 --> 00:06:41.910 if it means you can use it on a Mac or a PC running this operating system, 00:06:41.910 --> 00:06:42.750 or this other one. 00:06:42.750 --> 00:06:45.480 And portability is nice because if I were to download an Excel file, 00:06:45.480 --> 00:06:47.760 there'd be a whole bunch of people in this room and online 00:06:47.760 --> 00:06:50.190 who couldn't download it because they haven't bought Microsoft Excel 00:06:50.190 --> 00:06:50.940 or installed it. 00:06:50.940 --> 00:06:55.680 Or if they have a Mac, or if it's a .numbers file in the Mac world, 00:06:55.680 --> 00:06:57.610 a PC user might not be able to download it. 00:06:57.610 --> 00:06:59.800 So a CSV is indeed very portable. 00:06:59.800 --> 00:07:02.940 So I'm going to go ahead and download, quite simply, the CSV 00:07:02.940 --> 00:07:04.440 version of this file. 00:07:04.440 --> 00:07:06.960 That's going to put it onto my own Mac's Downloads folder. 00:07:06.960 --> 00:07:11.460 And let me go ahead here, and in just a moment, let me just simplify the name. 00:07:11.460 --> 00:07:14.910 Because it actually downloads it at a pretty large name. 00:07:14.910 --> 00:07:18.120 And give me just one moment here, and you'll see that, indeed, 00:07:18.120 --> 00:07:21.210 on my Mac I have a file called favorites.csv. 00:07:21.210 --> 00:07:22.750 I shortened the name real quick. 00:07:22.750 --> 00:07:28.680 And now what I'm going to do is go over to VS Code, and in VS Code, 00:07:28.680 --> 00:07:30.480 I'm going to open my File Explorer. 00:07:30.480 --> 00:07:34.620 And if I minimize my window here for a moment, a handy feature of VS Code 00:07:34.620 --> 00:07:37.980 is that you can just drag and drop a file, for instance, into your Explorer. 00:07:37.980 --> 00:07:40.660 And voila, it's going to automatically upload it for you. 00:07:40.660 --> 00:07:43.260 So let me go ahead and full screen here, close my Explorer, 00:07:43.260 --> 00:07:45.120 temporarily close my Terminal window. 00:07:45.120 --> 00:07:48.720 And you'll see here a CSV file, favorites.csv. 00:07:48.720 --> 00:07:51.390 And the first row, by convention, has whatever 00:07:51.390 --> 00:07:54.720 the columns were in Google Spreadsheets, or Office 365, 00:07:54.720 --> 00:07:58.620 in Excel online, timestamp, comma, title, comma, genres. 00:07:58.620 --> 00:08:00.390 Then, we have timestamps, which indicates 00:08:00.390 --> 00:08:01.722 when people started submitting. 00:08:01.722 --> 00:08:03.930 Looks like a couple of people were super eager to get 00:08:03.930 --> 00:08:05.430 started an hour or two ago. 00:08:05.430 --> 00:08:09.340 And then, you have the title next, after a comma. 00:08:09.340 --> 00:08:12.150 But there's kind of a curiosity after that. 00:08:12.150 --> 00:08:15.510 Sometimes I see the genre like comedy, comedy, comedy, 00:08:15.510 --> 00:08:19.870 but sometimes it's like crime, comma, drama, or action, comma, crime, comma, 00:08:19.870 --> 00:08:20.550 drama. 00:08:20.550 --> 00:08:22.470 And those things are quoted. 00:08:22.470 --> 00:08:24.180 And yet, I didn't do any quotes. 00:08:24.180 --> 00:08:25.800 You probably didn't type any quotes. 00:08:25.800 --> 00:08:30.180 Where are those quotes coming from in this CSV file? 00:08:30.180 --> 00:08:31.650 Why are they there if we infer? 00:08:31.650 --> 00:08:32.485 Yeah? 00:08:32.485 --> 00:08:35.340 AUDIENCE: [INAUDIBLE] 00:08:35.340 --> 00:08:38.308 DAVID J. MALAN: Yeah, so you have a corner case, if you will. 00:08:38.308 --> 00:08:40.350 Because if you're using commas, as you described, 00:08:40.350 --> 00:08:44.280 to separate your data into what are effectively columns, well, 00:08:44.280 --> 00:08:47.010 you've painted yourself into a corner if your actual data 00:08:47.010 --> 00:08:48.578 has commas in it itself. 00:08:48.578 --> 00:08:51.120 So what Google has done, what Microsoft does, what Apple does 00:08:51.120 --> 00:08:54.330 is, they quote any strings of text that themselves 00:08:54.330 --> 00:08:58.560 have commas so that these are now English grammatical commas, 00:08:58.560 --> 00:09:00.730 not CSV specific commas. 00:09:00.730 --> 00:09:03.400 So it's a way of escaping your data, if you will. 00:09:03.400 --> 00:09:06.070 And escaping just means to call out a symbol in a special way 00:09:06.070 --> 00:09:08.637 so it's not misinterpreted as something else. 00:09:08.637 --> 00:09:10.470 All right, so this is all to say that we now 00:09:10.470 --> 00:09:13.980 have all of this data with which we can play in the form of what we'll 00:09:13.980 --> 00:09:15.840 start calling a flat-file database. 00:09:15.840 --> 00:09:19.230 So suppose I wanted to now start manipulating this data, 00:09:19.230 --> 00:09:22.110 and I want to store it ultimately, indeed, in this CSV format. 00:09:22.110 --> 00:09:24.540 How can I actually start to read this data, 00:09:24.540 --> 00:09:26.950 maybe clean it up, maybe do some analytics on it 00:09:26.950 --> 00:09:30.570 and actually figure out, what's the most popular show among those who submitted 00:09:30.570 --> 00:09:32.190 here over the past few minutes? 00:09:32.190 --> 00:09:34.270 Well, let me go ahead and close this. 00:09:34.270 --> 00:09:38.970 Let me go ahead, then, and open up, for instance, just my Terminal window. 00:09:38.970 --> 00:09:41.970 And let's code up a file called favorites.py. 00:09:41.970 --> 00:09:46.110 And let's go ahead and iteratively start simple by just opening up this file 00:09:46.110 --> 00:09:47.830 and printing out what's inside of it. 00:09:47.830 --> 00:09:51.330 So you might recall that we can do this by doing something like import CSV 00:09:51.330 --> 00:09:54.990 to give myself some CSV reading functionality. 00:09:54.990 --> 00:09:59.610 Then, I can go ahead and do something like with open, the name of the file 00:09:59.610 --> 00:10:02.280 that I want to open in read mode. 00:10:02.280 --> 00:10:03.990 Quote, unquote, "r" means to read it. 00:10:03.990 --> 00:10:06.600 And then, I can say as file, or whatever other name 00:10:06.600 --> 00:10:09.670 for a variable to say that I want to open this file, 00:10:09.670 --> 00:10:12.480 and essentially store some kind of reference to it in that variable 00:10:12.480 --> 00:10:13.600 called file. 00:10:13.600 --> 00:10:16.920 Then, I can give myself a reader, and I can say csv.reader, 00:10:16.920 --> 00:10:18.397 passing in that file as input. 00:10:18.397 --> 00:10:19.980 And this is the magic of that library. 00:10:19.980 --> 00:10:23.190 It deals with the process of opening it, reading it, and giving you 00:10:23.190 --> 00:10:26.430 back something that you can just iterate over, like with a for loop 00:10:26.430 --> 00:10:30.510 I do want to skip the first row, and recall that I can do this. 00:10:30.510 --> 00:10:33.665 Next, reader, is this little trick that just says, ignore the first row. 00:10:33.665 --> 00:10:35.040 Because the first one is special. 00:10:35.040 --> 00:10:37.410 It said timestamp, title, genres. 00:10:37.410 --> 00:10:39.400 That's not your data, that was mine. 00:10:39.400 --> 00:10:41.880 But this means now that I've skipped that first row. 00:10:41.880 --> 00:10:44.700 Everything hereafter is going to be the title of a show 00:10:44.700 --> 00:10:46.260 that you all like, so let me do this. 00:10:46.260 --> 00:10:51.090 For row in the reader, let's go ahead and print out the title 00:10:51.090 --> 00:10:52.860 of the show each of you typed in. 00:10:52.860 --> 00:10:57.000 How do I get at the title of the show each of you typed in? 00:10:57.000 --> 00:10:58.740 It's somewhere inside of row. 00:10:58.740 --> 00:11:00.790 Row recalls a list. 00:11:00.790 --> 00:11:02.910 So what do I want to type next in order to get 00:11:02.910 --> 00:11:08.920 at the title of the current row just as a quick check here? 00:11:08.920 --> 00:11:11.350 What do I want to type to get at the title of the row, 00:11:11.350 --> 00:11:15.520 keeping in mind, again, that it was timestamp, title, genres? 00:11:15.520 --> 00:11:16.020 Yeah? 00:11:16.020 --> 00:11:16.895 AUDIENCE: [INAUDIBLE] 00:11:16.895 --> 00:11:18.910 DAVID J. MALAN: So row bracket 1 would give me 00:11:18.910 --> 00:11:22.640 the second column, 0 index, that is, the one in the middle with the title. 00:11:22.640 --> 00:11:24.520 So this program isn't that interesting yet, 00:11:24.520 --> 00:11:27.370 but it's a quick and dirty way to figure out, all right, what's my data 00:11:27.370 --> 00:11:27.820 look like? 00:11:27.820 --> 00:11:29.987 Let me actually just do a little bit of a check here 00:11:29.987 --> 00:11:32.380 and see if it contains the data I think it does. 00:11:32.380 --> 00:11:34.490 Let me maximize my Terminal window here. 00:11:34.490 --> 00:11:37.810 Let me run Python of favorites.py, hitting Enter. 00:11:37.810 --> 00:11:42.520 And you'll see now a purely textual list of all of the shows 00:11:42.520 --> 00:11:44.560 you all seem to like here. 00:11:44.560 --> 00:11:47.080 But what's noteworthy about it? 00:11:47.080 --> 00:11:49.960 Specific shows aside, judgment aside as to people's 00:11:49.960 --> 00:11:54.640 TV tastes, what's interesting or noteworthy about the data that 00:11:54.640 --> 00:11:58.090 might create some problems for us if we start to analyze this data, 00:11:58.090 --> 00:11:59.740 and figure out what's the most popular? 00:11:59.740 --> 00:12:02.590 How many people like this or that? 00:12:02.590 --> 00:12:03.340 What do you think? 00:12:03.340 --> 00:12:04.151 Yeah? 00:12:04.151 --> 00:12:07.458 AUDIENCE: User errors [INAUDIBLE]. 00:12:07.458 --> 00:12:09.500 DAVID J. MALAN: Yeah, there might be user errors, 00:12:09.500 --> 00:12:12.800 or just stylistic differences that give the appearance that one show 00:12:12.800 --> 00:12:14.400 is different from the other. 00:12:14.400 --> 00:12:16.970 For instance, here. 00:12:16.970 --> 00:12:20.250 Let's see if I can see an example on the screen here. 00:12:20.250 --> 00:12:24.340 Yeah, so friends here is an all lowercase, Friends here is capitalized. 00:12:24.340 --> 00:12:25.140 No big deal. 00:12:25.140 --> 00:12:26.370 We can sort of mitigate that. 00:12:26.370 --> 00:12:29.480 But this is just a tiny example of where data in the real world 00:12:29.480 --> 00:12:30.758 can get messy fast. 00:12:30.758 --> 00:12:32.300 And that probably wasn't even a typo. 00:12:32.300 --> 00:12:37.587 It was just someone not caring as much to capitalize it, and that's fine. 00:12:37.587 --> 00:12:39.920 Your users are going to type what they're going to type. 00:12:39.920 --> 00:12:43.580 So let's see if we can't now begin to get at more specific data, 00:12:43.580 --> 00:12:45.500 and maybe even clean some of this data up. 00:12:45.500 --> 00:12:50.570 Let me go back into my file called favorites.py here, 00:12:50.570 --> 00:12:54.350 and let's actually do something a little more user friendly for me. 00:12:54.350 --> 00:12:58.070 Instead of a reader, recall that there was this dictionary reader that's 00:12:58.070 --> 00:12:59.750 just a little more user friendly. 00:12:59.750 --> 00:13:04.670 And it means I can type in dictionary reader here, passing in the same file. 00:13:04.670 --> 00:13:10.670 But now, when I iterate over this reader variable, what is each row? 00:13:10.670 --> 00:13:14.060 When using a DictReader instead of a reader, recall, and this 00:13:14.060 --> 00:13:17.690 is just a peculiarity of the CSV library, 00:13:17.690 --> 00:13:22.530 this gives me back, not a list of cells, but what instead, 00:13:22.530 --> 00:13:24.950 which is marginally more user friendly for me? 00:13:24.950 --> 00:13:25.460 Yeah? 00:13:25.460 --> 00:13:27.585 AUDIENCE: [INAUDIBLE] 00:13:27.585 --> 00:13:28.460 DAVID J. MALAN: Yeah. 00:13:28.460 --> 00:13:30.860 I can now use open bracket, quotes, and the title. 00:13:30.860 --> 00:13:33.770 Because what's coming back now is a dict object, that is, 00:13:33.770 --> 00:13:37.010 a dictionary which has keys and values. 00:13:37.010 --> 00:13:39.060 The keys of which are the column headings. 00:13:39.060 --> 00:13:41.400 The values of which are the data I actually care about. 00:13:41.400 --> 00:13:44.030 So this is just marginally better because, one, it's 00:13:44.030 --> 00:13:47.330 just way more obvious to me, the author of this code, what it is I'm 00:13:47.330 --> 00:13:48.050 getting at. 00:13:48.050 --> 00:13:50.310 I don't remember what column the title was. 00:13:50.310 --> 00:13:50.810 Was it 0? 00:13:50.810 --> 00:13:51.320 Was it 1? 00:13:51.320 --> 00:13:51.890 Was it 2? 00:13:51.890 --> 00:13:53.450 That's something you're going to forget over time. 00:13:53.450 --> 00:13:56.510 And God forbid someone changes the data by just dragging and dropping 00:13:56.510 --> 00:13:59.390 the columns in Excel, or Apple Numbers, or Google Spreadsheets. 00:13:59.390 --> 00:14:01.978 That's going to break all of your numeric indices. 00:14:01.978 --> 00:14:03.770 And so a dictionary reader is arguably just 00:14:03.770 --> 00:14:07.310 better design because it's more robust against changes 00:14:07.310 --> 00:14:08.960 and potential errors like that. 00:14:08.960 --> 00:14:12.560 Now the effect of this change isn't going to be really any different. 00:14:12.560 --> 00:14:16.820 If I run Python of favorites.py, voila, I get all of the same results. 00:14:16.820 --> 00:14:21.290 But I've now not made any assumptions as to where each of the columns 00:14:21.290 --> 00:14:22.920 actually is numerically. 00:14:22.920 --> 00:14:23.420 All right. 00:14:23.420 --> 00:14:26.330 Well, let's go ahead and now filter out some duplicates. 00:14:26.330 --> 00:14:29.960 Because there's a lot of commonality among some of the shows here, so let's 00:14:29.960 --> 00:14:32.660 see if we can't filter out duplicates. 00:14:32.660 --> 00:14:38.900 If I'm reading a CSV file top to bottom, what intuitively might be the logic 00:14:38.900 --> 00:14:41.650 I want to implement to filter out duplicates? 00:14:41.650 --> 00:14:44.900 It's not going to be quite as simple as a simple function that does it for me. 00:14:44.900 --> 00:14:47.300 I'm going to have to build this. 00:14:47.300 --> 00:14:50.630 But logically, if you're reading a file from top to bottom, 00:14:50.630 --> 00:14:54.860 how might you go about, in Python or just any context, 00:14:54.860 --> 00:14:58.040 getting rid of duplicate values? 00:14:58.040 --> 00:14:59.862 Yeah, what do you think? 00:14:59.862 --> 00:15:05.735 AUDIENCE: [INAUDIBLE] 00:15:05.735 --> 00:15:06.610 DAVID J. MALAN: Sure. 00:15:06.610 --> 00:15:09.730 I could use a list and I could add each title to the list, 00:15:09.730 --> 00:15:13.070 but first check if I put this into the list before. 00:15:13.070 --> 00:15:15.140 So let's try a little something like that. 00:15:15.140 --> 00:15:18.340 Let me go ahead and create a variable at the top of my program here. 00:15:18.340 --> 00:15:21.340 I'll call it titles, for instance, initialize to an empty list, 00:15:21.340 --> 00:15:23.050 open bracket, close bracket. 00:15:23.050 --> 00:15:27.860 And then, inside of my loop here, instead of printing it out, 00:15:27.860 --> 00:15:29.630 let's start to make a decision. 00:15:29.630 --> 00:15:39.332 So if the current row's title is in the titles list 00:15:39.332 --> 00:15:40.540 I don't want to put it there. 00:15:40.540 --> 00:15:43.582 And actually, let me invert the logic so I'm doing something proactively. 00:15:43.582 --> 00:15:48.190 So if it's not the case that row bracket title is in titles, 00:15:48.190 --> 00:15:56.200 then, go ahead and do something like titles.append the current row's title. 00:15:56.200 --> 00:15:59.170 And recall that we saw .append a week or so ago, 00:15:59.170 --> 00:16:01.700 where it just allows you to append to the current list. 00:16:01.700 --> 00:16:04.810 And then, what can I do at the very end, after I'm all 00:16:04.810 --> 00:16:06.460 done reading the whole file? 00:16:06.460 --> 00:16:09.820 Why don't I go ahead and say, for title in titles, 00:16:09.820 --> 00:16:12.730 go ahead and print out the current title? 00:16:12.730 --> 00:16:16.700 So it's two loops now, and we can come back to the quality of that design. 00:16:16.700 --> 00:16:19.373 But let me go ahead here and rerun Python of favorites.py. 00:16:19.373 --> 00:16:22.540 Let me increase the size of my Terminal window so we can focus just on this, 00:16:22.540 --> 00:16:23.770 and hit Enter. 00:16:23.770 --> 00:16:28.120 And now, I'm just skimming. 00:16:28.120 --> 00:16:31.090 I don't think I'm seeing duplicates, although I 00:16:31.090 --> 00:16:34.000 am seeing some near duplicates. 00:16:34.000 --> 00:16:36.760 For instance, there's Friends again. 00:16:36.760 --> 00:16:40.060 And if we keep going, and going, and going, and going, 00:16:40.060 --> 00:16:41.260 there's Friends again. 00:16:41.260 --> 00:16:47.070 Oh, interesting, so that's curious that I seem to have multiple Friends, 00:16:47.070 --> 00:16:48.630 and I have this one here, too. 00:16:48.630 --> 00:16:51.120 So how might we clean this up further? 00:16:51.120 --> 00:16:53.393 I like your instincts, and it's a step closer to it. 00:16:53.393 --> 00:16:55.560 What are we going to have to do to really filter out 00:16:55.560 --> 00:16:57.360 those near duplicates? 00:16:57.360 --> 00:16:59.146 Any thoughts? 00:16:59.146 --> 00:17:04.045 AUDIENCE: You could set everything to lower [INAUDIBLE].. 00:17:04.045 --> 00:17:04.920 DAVID J. MALAN: Yeah. 00:17:04.920 --> 00:17:06.670 What are the common mistakes to summarize? 00:17:06.670 --> 00:17:09.405 We could ignore the capitalization altogether and maybe 00:17:09.405 --> 00:17:12.030 just force everything to lowercase, or everything to uppercase. 00:17:12.030 --> 00:17:13.920 Doesn't matter which, but let's just be consistent. 00:17:13.920 --> 00:17:16.795 And for those of you who might have accidentally or instinctively hit 00:17:16.795 --> 00:17:19.650 the spacebar at the beginning of your input or even at the end, 00:17:19.650 --> 00:17:21.180 we can strip that off, too. 00:17:21.180 --> 00:17:24.490 Stripping whitespace is a common thing just to clean up user input. 00:17:24.490 --> 00:17:27.869 So let me go back into my code here, and let me go ahead 00:17:27.869 --> 00:17:30.540 and tweak the title a little bit. 00:17:30.540 --> 00:17:33.360 Let me say that the current title inside of this loop 00:17:33.360 --> 00:17:36.300 is not going to be just the current row's title. 00:17:36.300 --> 00:17:40.440 But let me go ahead and strip off, from the left and the right implicitly, 00:17:40.440 --> 00:17:41.260 any whitespace. 00:17:41.260 --> 00:17:44.260 If you read the documentation for the strip function, it does just that. 00:17:44.260 --> 00:17:47.230 It gets rid of whitespace to the left, whitespace to the right. 00:17:47.230 --> 00:17:50.310 And then, if I want to force everything to maybe uppercase, 00:17:50.310 --> 00:17:52.710 I can just uppercase the entire string. 00:17:52.710 --> 00:17:56.310 And remember, what's handy about Python is you can chain some of these function 00:17:56.310 --> 00:17:59.550 calls together by just using dots again and again. 00:17:59.550 --> 00:18:01.500 And that just takes whatever just happened, 00:18:01.500 --> 00:18:04.320 like the whitespace got stripped off, then, it additionally 00:18:04.320 --> 00:18:06.280 uppercases the whole thing as well. 00:18:06.280 --> 00:18:10.980 So now, I'm going to just check whether this specific title is in titles. 00:18:10.980 --> 00:18:14.760 And if not, I'm going to go ahead and append that title, 00:18:14.760 --> 00:18:17.200 massaged into this different format, if you will. 00:18:17.200 --> 00:18:19.350 So I'm throwing away some information. 00:18:19.350 --> 00:18:23.760 I'm sacrificing all of the nuances of your grammar and input 00:18:23.760 --> 00:18:25.020 to the form itself. 00:18:25.020 --> 00:18:27.480 But at least I'm trying to canonicalize size, that is, 00:18:27.480 --> 00:18:29.950 standardize what the data actually looks like. 00:18:29.950 --> 00:18:33.810 So let me go ahead and run Python of favorites.py again and hit Enter. 00:18:33.810 --> 00:18:35.412 Oh, and this is just user error. 00:18:35.412 --> 00:18:36.870 Maybe you haven't seen this before. 00:18:36.870 --> 00:18:40.870 This just looks like a mistake on my part. 00:18:40.870 --> 00:18:43.200 I meant to say not even uppercase. 00:18:43.200 --> 00:18:44.202 That's completely wrong. 00:18:44.202 --> 00:18:46.410 The function is called upper, now that I think of it. 00:18:46.410 --> 00:18:46.830 All right. 00:18:46.830 --> 00:18:49.330 Let's go and increase the size of the Terminal window again. 00:18:49.330 --> 00:18:50.820 Run Python of favorites.py. 00:18:50.820 --> 00:18:55.530 And now, it's a little more overwhelming to look at because it's not sorted yet 00:18:55.530 --> 00:18:57.300 and it's all capitalized. 00:18:57.300 --> 00:19:03.300 But I don't think I'm seeing multiple Friends, so to speak. 00:19:03.300 --> 00:19:06.455 There's one Friends up here and that's it. 00:19:06.455 --> 00:19:07.830 I'm back up at my prompt already. 00:19:07.830 --> 00:19:10.270 So we seem now to be filtering out duplicates. 00:19:10.270 --> 00:19:13.582 Now, before we dive in further and clean this up further than this, 00:19:13.582 --> 00:19:14.790 what else could we have done? 00:19:14.790 --> 00:19:17.100 Well, it turns out that in Python 2 you often 00:19:17.100 --> 00:19:19.350 do get a lot of functionality built into the language. 00:19:19.350 --> 00:19:22.313 And I'm kind of implementing myself the idea of a set. 00:19:22.313 --> 00:19:24.480 If you think back to mathematics, a set is typically 00:19:24.480 --> 00:19:28.470 something with a bunch of values that has duplicates filtered out. 00:19:28.470 --> 00:19:31.290 Recall that Python already has this for us. 00:19:31.290 --> 00:19:34.200 And we saw it really briefly when I whipped up the dictionary 00:19:34.200 --> 00:19:36.190 implementation a couple of weeks back. 00:19:36.190 --> 00:19:40.980 So I could actually define my titles to be a set instead of a list, 00:19:40.980 --> 00:19:46.080 and this would just modestly allow me to refine my code here, such 00:19:46.080 --> 00:19:48.870 that I don't have to bother checking for duplicates anyway. 00:19:48.870 --> 00:19:52.710 I can instead just say something like, titles.add 00:19:52.710 --> 00:19:55.807 the current title, like this. 00:19:55.807 --> 00:19:58.890 Marginally better design if you know that a set exists because you're just 00:19:58.890 --> 00:20:00.753 getting more functionality out of this. 00:20:00.753 --> 00:20:02.670 All right, so let's clean the data up further. 00:20:02.670 --> 00:20:06.150 We've now gone ahead and fixed the problem of case sensitivity. 00:20:06.150 --> 00:20:08.940 We threw away whitespace in case someone had hit the spacebar 00:20:08.940 --> 00:20:09.990 with some of the input. 00:20:09.990 --> 00:20:13.960 Let's go ahead now and sort these things by the titles themselves. 00:20:13.960 --> 00:20:17.280 So instead of just printing out the titles in the same order 00:20:17.280 --> 00:20:22.142 you all inputted them, but filtering out duplicates as we go, let me go ahead 00:20:22.142 --> 00:20:24.600 and use another function in Python you might not have seen, 00:20:24.600 --> 00:20:26.940 which is literally called sorted, and will 00:20:26.940 --> 00:20:31.900 take care of the process of actually sorting titles for you. 00:20:31.900 --> 00:20:34.440 Let me go ahead and increase the font size of my Terminal, 00:20:34.440 --> 00:20:36.570 run Python of favorites.py, and hit Enter. 00:20:36.570 --> 00:20:40.410 And now you can really see how many of these shows start with the word "the" 00:20:40.410 --> 00:20:41.220 or do not. 00:20:41.220 --> 00:20:43.590 Now it's a little easier to wrap our minds around, 00:20:43.590 --> 00:20:46.650 just because it's at least sorted alphabetically. 00:20:46.650 --> 00:20:50.220 But now you can really see some of the differences in people's inputs. 00:20:50.220 --> 00:20:51.850 So far, so good. 00:20:51.850 --> 00:20:56.400 But a few of you decided to stylize Avatar in three different ways here. 00:20:56.400 --> 00:20:59.520 Brooklyn 99 is a couple of different ways here. 00:20:59.520 --> 00:21:02.700 And I think if we keep going we'll see further and further variances that we 00:21:02.700 --> 00:21:06.460 did not fix by focusing on whitespace and capitalization alone. 00:21:06.460 --> 00:21:09.870 So already here, this is only, what, 100 plus, 200 rows. 00:21:09.870 --> 00:21:12.795 Already real-world data starts to get messy quickly, 00:21:12.795 --> 00:21:14.670 and that might not bode well when we actually 00:21:14.670 --> 00:21:16.650 want to keep around real data from real users. 00:21:16.650 --> 00:21:19.025 You can imagine an actual website or a mobile application 00:21:19.025 --> 00:21:21.660 dealing with this kind of thing on scale. 00:21:21.660 --> 00:21:23.080 Well, let's go ahead and do this. 00:21:23.080 --> 00:21:26.370 Let's actually figure out the popularity of these various shows 00:21:26.370 --> 00:21:31.680 by now iterating over my data, and keeping track of how many of you 00:21:31.680 --> 00:21:33.000 inputted a given title. 00:21:33.000 --> 00:21:37.890 We're going to ignore the problems like Brooklyn 99 and the Avatar. 00:21:37.890 --> 00:21:42.390 Sorry, yeah, Avatar, where there was things 00:21:42.390 --> 00:21:47.070 that were different beyond just whitespace and capitalization. 00:21:47.070 --> 00:21:48.840 But let's go ahead and keep track of, now, 00:21:48.840 --> 00:21:52.290 how many of you inputted each of these titles. 00:21:52.290 --> 00:21:53.448 So how can I do this? 00:21:53.448 --> 00:21:55.740 I'm still going to take this approach of iterating over 00:21:55.740 --> 00:21:58.110 the CSV file from top to bottom. 00:21:58.110 --> 00:22:00.360 We've used a couple of data structures thus far, 00:22:00.360 --> 00:22:04.260 a list to keep track of titles, or a set to keep track of titles. 00:22:04.260 --> 00:22:07.350 But what if I now want to keep around a little more information? 00:22:07.350 --> 00:22:13.110 For each title, I want to keep around how many times I've seen it before. 00:22:13.110 --> 00:22:14.130 I'm not doing that yet. 00:22:14.130 --> 00:22:17.880 I'm throwing away the total number of times I see these shows. 00:22:17.880 --> 00:22:20.520 How could I start to keep that around? 00:22:20.520 --> 00:22:21.845 AUDIENCE: Use a dictionary. 00:22:21.845 --> 00:22:23.970 DAVID J. MALAN: We could use a dictionary, and how? 00:22:23.970 --> 00:22:25.260 Elaborate on that. 00:22:25.260 --> 00:22:26.135 AUDIENCE: [INAUDIBLE] 00:22:26.135 --> 00:22:28.093 DAVID J. MALAN: Perfect, really good instincts. 00:22:28.093 --> 00:22:29.850 Using a dictionary, insofar as it lets us 00:22:29.850 --> 00:22:33.300 store keys and values, that is, associate something with something 00:22:33.300 --> 00:22:33.850 else. 00:22:33.850 --> 00:22:36.480 This is why a dictionary or hash tables more 00:22:36.480 --> 00:22:39.630 generally are such a useful, practical data structure. 00:22:39.630 --> 00:22:43.120 Because they just let you remember stuff in some kind of structured way. 00:22:43.120 --> 00:22:45.870 So if the keys are going to be the titles I've seen, 00:22:45.870 --> 00:22:49.860 the values could be the number of times I've seen each of those titles. 00:22:49.860 --> 00:22:54.308 And so it's kind of like just having a two-column table on paper. 00:22:54.308 --> 00:22:56.850 For instance, if I were going to do this on a piece of paper, 00:22:56.850 --> 00:22:59.340 I might just have two columns here, where 00:22:59.340 --> 00:23:04.590 maybe this is the title that I've seen, and this is the count over here. 00:23:04.590 --> 00:23:08.250 This is, in effect, a dictionary in Python. 00:23:08.250 --> 00:23:11.490 It's two columns, keys on the left, values on the right. 00:23:11.490 --> 00:23:13.620 And this, if I can implement in code, will actually 00:23:13.620 --> 00:23:17.580 allow me to store this data, and then maybe do some simple arithmetic 00:23:17.580 --> 00:23:19.450 to figure out which is the most popular. 00:23:19.450 --> 00:23:20.380 So let's do this. 00:23:20.380 --> 00:23:24.240 Let me go ahead and change my titles to not be a list, not be a set. 00:23:24.240 --> 00:23:29.160 Let's have it be a dictionary instead, either doing this, or more succinctly, 00:23:29.160 --> 00:23:33.600 two curly braces that are empty gives me an empty dictionary automatically. 00:23:33.600 --> 00:23:34.860 What do I now want to do? 00:23:34.860 --> 00:23:37.440 I think most of my code can stay the same. 00:23:37.440 --> 00:23:40.860 But down here, I don't want to just blindly add titles 00:23:40.860 --> 00:23:42.180 to the data structure. 00:23:42.180 --> 00:23:45.060 I somehow need to keep track of the count. 00:23:45.060 --> 00:23:48.690 And unfortunately, if I just do this-- let's do titles, 00:23:48.690 --> 00:23:53.580 bracket, title, plus equals 1. 00:23:53.580 --> 00:23:56.130 This is a reasonable first attempt at this. 00:23:56.130 --> 00:23:57.420 Because what am I doing? 00:23:57.420 --> 00:24:02.730 If titles is a dictionary and I want to look up the current title therein, 00:24:02.730 --> 00:24:05.520 the syntax for that, like before, is titles, bracket, and then 00:24:05.520 --> 00:24:08.880 the key you want to use to index into the dictionary. 00:24:08.880 --> 00:24:12.000 It's not a number in this case, it's an actual word, a title. 00:24:12.000 --> 00:24:14.220 And you're just going to increment it by one, 00:24:14.220 --> 00:24:17.010 and then eventually I'll come back and finish my second loop 00:24:17.010 --> 00:24:19.710 and do things in terms of the order. 00:24:19.710 --> 00:24:23.580 But for now, let's just keep track of the total counts. 00:24:23.580 --> 00:24:25.660 Let me go ahead and increase my Terminal window. 00:24:25.660 --> 00:24:29.590 Let me do Python of favorites.py and hit Enter. 00:24:29.590 --> 00:24:30.090 Huh. 00:24:30.090 --> 00:24:34.140 How I Met Your Mother is giving me a key error. 00:24:34.140 --> 00:24:36.750 What does that mean? 00:24:36.750 --> 00:24:39.120 And why am I seeing this? 00:24:39.120 --> 00:24:42.330 And in fact, just to give a little bit of a breadcrumb here, 00:24:42.330 --> 00:24:44.140 let me zoom out here. 00:24:44.140 --> 00:24:46.920 Let me open up the CSV file again real quickly. 00:24:46.920 --> 00:24:50.460 And wow, we didn't even get past the second row in the file 00:24:50.460 --> 00:24:52.210 or the first show in the file. 00:24:52.210 --> 00:24:54.840 Notice that How I Met Your Mother, somewhat lowercased, 00:24:54.840 --> 00:24:57.562 is the very first show in therein. 00:24:57.562 --> 00:24:59.520 What's your instinct for why this is happening? 00:24:59.520 --> 00:25:01.750 AUDIENCE: You don't have a starting point. 00:25:01.750 --> 00:25:03.667 DAVID J. MALAN: I don't have a starting point. 00:25:03.667 --> 00:25:04.920 I'm adding one to what? 00:25:04.920 --> 00:25:09.840 I'm blindly indexing into the dictionary using a key, How I Met Your Mother, 00:25:09.840 --> 00:25:11.880 that doesn't yet exist in the dictionary. 00:25:11.880 --> 00:25:14.610 And so Python throws what's called a key error 00:25:14.610 --> 00:25:17.440 because the key you're trying to use just doesn't exist yet. 00:25:17.440 --> 00:25:21.000 So logically, how could we fix this? 00:25:21.000 --> 00:25:21.690 We're close. 00:25:21.690 --> 00:25:25.110 We got half of the problem solved, but I'm not handling the obvious, now, 00:25:25.110 --> 00:25:26.850 case of nothing being there. 00:25:26.850 --> 00:25:27.595 Yeah? 00:25:27.595 --> 00:25:28.930 AUDIENCE: Creating a counter. 00:25:28.930 --> 00:25:29.620 DAVID J. MALAN: Creating a-- 00:25:29.620 --> 00:25:30.370 AUDIENCE: Counter. 00:25:30.370 --> 00:25:32.240 DAVID J. MALAN: Creating the counter itself. 00:25:32.240 --> 00:25:34.190 So maybe I could do something like this. 00:25:34.190 --> 00:25:38.380 Let me close my Terminal window and let me ask a question first. 00:25:38.380 --> 00:25:44.980 If the current title is in the dictionary already, if title in titles, 00:25:44.980 --> 00:25:47.440 that's going to give me a true-false answer it turns out. 00:25:47.440 --> 00:25:52.240 Then, I can safely say, titles, bracket, title, plus equals 1. 00:25:52.240 --> 00:25:56.740 And recall, this is just shorthand notation for the same thing as in C, 00:25:56.740 --> 00:25:58.300 title plus 1. 00:25:58.300 --> 00:25:59.360 Whoops, typo. 00:25:59.360 --> 00:26:00.168 Don't do that. 00:26:00.168 --> 00:26:02.710 That's the same thing as this but it's a little more succinct 00:26:02.710 --> 00:26:04.810 just to say plus equals 1. 00:26:04.810 --> 00:26:09.490 Else, if it's logically not the case that the current title is in the titles 00:26:09.490 --> 00:26:13.610 dictionary, then I probably want to say titles, bracket, title equals? 00:26:13.610 --> 00:26:15.315 Feel free to just shout it out. 00:26:15.315 --> 00:26:15.940 AUDIENCE: Zero. 00:26:15.940 --> 00:26:16.815 DAVID J. MALAN: Zero. 00:26:16.815 --> 00:26:21.320 I just have to put some value there so that the key itself is also there. 00:26:21.320 --> 00:26:21.820 All right. 00:26:21.820 --> 00:26:24.370 So now that I've got this going on, let me go ahead 00:26:24.370 --> 00:26:26.620 and undo my sorting temporarily. 00:26:26.620 --> 00:26:29.650 And now let me go ahead and do this. 00:26:29.650 --> 00:26:33.460 I can, as a quick check, let me go ahead and just run the code 00:26:33.460 --> 00:26:35.050 as is, Python of favorites.py. 00:26:35.050 --> 00:26:37.030 I'm back in business. 00:26:37.030 --> 00:26:39.790 It's printing correctly, no key errors, but it's not sorted. 00:26:39.790 --> 00:26:41.620 And I'm not seeing any of the counts. 00:26:41.620 --> 00:26:43.780 Let me just quickly add the counts, and there's 00:26:43.780 --> 00:26:45.530 a couple of ways I could do this. 00:26:45.530 --> 00:26:52.900 I could, say, print out the title, and then, maybe, let's do something like-- 00:26:52.900 --> 00:26:57.220 how about just, comma, titles, bracket, title? 00:26:57.220 --> 00:26:59.020 So I'm going to print two things at once, 00:26:59.020 --> 00:27:01.600 both the current title in the dictionary, 00:27:01.600 --> 00:27:04.300 and whatever its value is by indexing into it. 00:27:04.300 --> 00:27:06.140 Let me increase my Terminal window. 00:27:06.140 --> 00:27:10.420 Let me run Python of favorites.py, Enter, and OK. 00:27:10.420 --> 00:27:11.908 Huh. 00:27:11.908 --> 00:27:14.080 Huh. 00:27:14.080 --> 00:27:17.560 None of you said a whole lot of TV shows, it seems. 00:27:17.560 --> 00:27:21.690 What's the logical error here? 00:27:21.690 --> 00:27:25.460 What did I do wrong if I look back at my code here? 00:27:25.460 --> 00:27:25.960 Yeah? 00:27:25.960 --> 00:27:27.600 Why so many 0s? 00:27:27.600 --> 00:27:29.490 AUDIENCE: [INAUDIBLE] 00:27:29.490 --> 00:27:30.490 DAVID J. MALAN: Exactly. 00:27:30.490 --> 00:27:35.140 To summarize, I initialized the count to 0 the first time I saw it, 00:27:35.140 --> 00:27:38.500 but I should have initialized it at least to 1 because I just saw it. 00:27:38.500 --> 00:27:40.220 Or I should change my code a bit. 00:27:40.220 --> 00:27:42.880 So for instance, if I go back in here, the simplest fix 00:27:42.880 --> 00:27:46.390 is probably to initialize to 1, because on this iteration of the loop, 00:27:46.390 --> 00:27:49.210 obviously, I'm seeing this title for the very first time. 00:27:49.210 --> 00:27:51.580 Or I could change my logic a little bit. 00:27:51.580 --> 00:27:53.470 I could do something like this instead. 00:27:53.470 --> 00:27:58.840 If the current title is not in titles, then I could initialize it to 0. 00:27:58.840 --> 00:28:02.860 And then I could get rid of the else, and now blindly index 00:28:02.860 --> 00:28:04.900 into the titles dictionary. 00:28:04.900 --> 00:28:09.100 Because now, on line 11, I can trust that lines 9 and 10 00:28:09.100 --> 00:28:12.040 took care of the initialization for me if need be. 00:28:12.040 --> 00:28:13.028 Which one is better? 00:28:13.028 --> 00:28:13.570 I don't know. 00:28:13.570 --> 00:28:17.150 This one's a little nicer, maybe because it's one line fewer. 00:28:17.150 --> 00:28:20.470 But I think both approaches are perfectly reasonable and well-designed. 00:28:20.470 --> 00:28:22.420 But the key thing, no pun intended, is that we 00:28:22.420 --> 00:28:27.100 have to make sure the key exists before we presume to actually incrue. 00:28:27.100 --> 00:28:28.810 Oh, this is wrong. 00:28:28.810 --> 00:28:30.370 This is incorrect code. 00:28:30.370 --> 00:28:33.040 What did I do wrong? 00:28:33.040 --> 00:28:33.790 OK, yes. 00:28:33.790 --> 00:28:34.400 There we go. 00:28:34.400 --> 00:28:37.690 So otherwise, everyone would have liked this show once, and no matter 00:28:37.690 --> 00:28:39.190 how many people said the same thing. 00:28:39.190 --> 00:28:41.237 Now the code is as it should be. 00:28:41.237 --> 00:28:43.570 So let me go ahead and open up my Terminal window again. 00:28:43.570 --> 00:28:47.710 Let me run Python of favorites.py, and now we see more reasonable counts. 00:28:47.710 --> 00:28:49.150 Some shows weren't that popular. 00:28:49.150 --> 00:28:50.830 There's just 1s and maybe 2s. 00:28:50.830 --> 00:28:56.570 But I bet if we sort these things we can start to see a little more detail. 00:28:56.570 --> 00:28:57.890 So how else can we do this? 00:28:57.890 --> 00:29:04.520 Well, turns out, when dealing with a dictionary like this-- 00:29:04.520 --> 00:29:07.160 let's go ahead and just sort the titles themselves. 00:29:07.160 --> 00:29:12.130 So let's reintroduce the sorted function as I did before, but no other changes. 00:29:12.130 --> 00:29:14.735 Let me go ahead now and run Python of favorites.py. 00:29:14.735 --> 00:29:17.110 Now it's just a little easier to wrap your mind around it 00:29:17.110 --> 00:29:18.568 because at least it's alphabetical. 00:29:18.568 --> 00:29:22.600 But it's not sorted by value, it's sorted by key. 00:29:22.600 --> 00:29:26.170 But sure enough, if we scroll down, there's something down here, 00:29:26.170 --> 00:29:29.202 for instance, like, let's see, The Office. 00:29:29.202 --> 00:29:30.910 That's definitely going to be a contender 00:29:30.910 --> 00:29:32.860 for most popular, 15 responses. 00:29:32.860 --> 00:29:35.860 But let's see what's actually going to bubble up to the top. 00:29:35.860 --> 00:29:40.870 Unfortunately, the sorted function only sorts dictionaries by keys 00:29:40.870 --> 00:29:44.410 by default, not by values. 00:29:44.410 --> 00:29:47.500 But it turns out, in Python, if you read the documentation 00:29:47.500 --> 00:29:49.510 for the sorted function, you can actually 00:29:49.510 --> 00:29:54.580 pass in other arguments that tell it how to sort things. 00:29:54.580 --> 00:29:57.430 For instance, if I want to do things in reverse order, 00:29:57.430 --> 00:30:02.140 I can add a second parameter to the sorted function called reverse. 00:30:02.140 --> 00:30:03.310 And it's a named parameter. 00:30:03.310 --> 00:30:05.620 You literally say, reverse equals true, so 00:30:05.620 --> 00:30:08.830 that the position of it in the comma-separated list doesn't matter. 00:30:08.830 --> 00:30:11.762 If I now rerun this after increasing my Terminal window, 00:30:11.762 --> 00:30:13.720 you'll see now that it's in the opposite order. 00:30:13.720 --> 00:30:16.330 Now adventure and Anne with an E is at the bottom 00:30:16.330 --> 00:30:18.410 of the output instead of the top. 00:30:18.410 --> 00:30:26.845 How can I tell it to sort by values instead of by key? 00:30:26.845 --> 00:30:28.220 Well, let's go ahead and do this. 00:30:28.220 --> 00:30:30.940 Let me go ahead and define a function. 00:30:30.940 --> 00:30:33.070 I'm just going to call it f to keep things simple. 00:30:33.070 --> 00:30:36.190 And this f function is going to take a title as input. 00:30:36.190 --> 00:30:41.140 And given a given title, it's going to return the value of that title. 00:30:41.140 --> 00:30:44.560 So actually, maybe a better name for this would be get value, 00:30:44.560 --> 00:30:46.820 and/or we could come up with something else as well. 00:30:46.820 --> 00:30:49.300 The purpose of the get value function, to be clear, 00:30:49.300 --> 00:30:54.200 is to take it as input a title and then return the corresponding value. 00:30:54.200 --> 00:30:55.460 Why is this useful? 00:30:55.460 --> 00:30:57.760 Well, it turns out that the sorted function in Python, 00:30:57.760 --> 00:31:01.870 according to its documentation, also takes a key parameter, 00:31:01.870 --> 00:31:05.860 where you can pass in, crazy enough, the name of a function 00:31:05.860 --> 00:31:11.650 that it will use in order to determine what it should sort by, by the key, 00:31:11.650 --> 00:31:16.550 or by the value, or in other cases, even other types of data as well. 00:31:16.550 --> 00:31:19.390 So there's a curiosity here, though, that's very deliberate. 00:31:19.390 --> 00:31:21.550 Key is the name of the parameter, just like reverse 00:31:21.550 --> 00:31:23.092 was the name of this other parameter. 00:31:23.092 --> 00:31:26.110 The value of it, though, is not a function call. 00:31:26.110 --> 00:31:27.580 It's a function name. 00:31:27.580 --> 00:31:30.520 Notice I am not doing this, no parentheses. 00:31:30.520 --> 00:31:35.400 I'm instead passing in get value, the function I wrote, by its name. 00:31:35.400 --> 00:31:37.900 And this is a feature of Python and certain other languages. 00:31:37.900 --> 00:31:41.110 Just like variables, you can actually pass whole functions 00:31:41.110 --> 00:31:45.560 around so that they can be called for you later on by someone else. 00:31:45.560 --> 00:31:48.880 So what this means is that the sorted function written by Python, 00:31:48.880 --> 00:31:51.410 they didn't know what you're going to want to sort by today. 00:31:51.410 --> 00:31:55.810 But if you provide them with a function called get value, or anything else, now 00:31:55.810 --> 00:31:58.000 their sorted function will use that function 00:31:58.000 --> 00:32:01.840 to determine, OK, if you don't want to sort by the key of the dictionary, what 00:32:01.840 --> 00:32:02.950 do you want to sort by? 00:32:02.950 --> 00:32:05.800 This is going to tell it to sort by the value 00:32:05.800 --> 00:32:08.750 by returning the specific value we care about. 00:32:08.750 --> 00:32:12.580 So let me go ahead now and rerun this after increasing my Terminal, Python 00:32:12.580 --> 00:32:14.620 of favorites.py, Enter. 00:32:14.620 --> 00:32:17.110 Here we have now an example of all of the titles you all 00:32:17.110 --> 00:32:22.180 typed in, albeit forced to uppercase and with any whitespace thrown out. 00:32:22.180 --> 00:32:24.733 And now, The Office is an easy win over Friends, 00:32:24.733 --> 00:32:27.400 versus Community, versus Game of Thrones, Breaking Bad, and then 00:32:27.400 --> 00:32:29.740 a lot of variants thereafter. 00:32:29.740 --> 00:32:31.680 So there's a lot of steps to go through. 00:32:31.680 --> 00:32:33.555 This isn't that bad once you've done it once, 00:32:33.555 --> 00:32:35.472 and you know what these functions are, and you 00:32:35.472 --> 00:32:36.870 know that these parameters exist. 00:32:36.870 --> 00:32:38.010 But it's a lot of work. 00:32:38.010 --> 00:32:42.090 That's 17 lines of code just to analyze a CSV file 00:32:42.090 --> 00:32:45.420 that you all created by way of those Google Form submissions. 00:32:45.420 --> 00:32:48.360 But it took me a lot of work just to get simple answers out of it. 00:32:48.360 --> 00:32:50.277 And indeed, that's going to be among the goals 00:32:50.277 --> 00:32:52.920 for today, ultimately, is, how can we just make this easier? 00:32:52.920 --> 00:32:54.795 It's one thing to learn new things in Python, 00:32:54.795 --> 00:32:57.300 but if we can avoid writing code, or this much code, 00:32:57.300 --> 00:32:58.840 that's going to be a good thing. 00:32:58.840 --> 00:33:01.020 And so one other technique we can introduce here 00:33:01.020 --> 00:33:03.570 that does allow us to write a little less code 00:33:03.570 --> 00:33:05.730 is, we can actually get rid of this function. 00:33:05.730 --> 00:33:09.240 It turns out, in Python, if you just need to make a function 00:33:09.240 --> 00:33:11.940 but it's going to be used and then essentially thrown away, 00:33:11.940 --> 00:33:14.790 it's not something you're going to be reusing in multiple places-- 00:33:14.790 --> 00:33:17.550 it's not like a library function that you want to keep around-- 00:33:17.550 --> 00:33:19.680 you can actually just do this. 00:33:19.680 --> 00:33:23.430 You can change the value of this key parameter 00:33:23.430 --> 00:33:25.950 to be what's called a lambda function, which 00:33:25.950 --> 00:33:29.040 is a fancy way of saying a function that technically has no name. 00:33:29.040 --> 00:33:30.870 It's an anonymous function. 00:33:30.870 --> 00:33:32.400 Why does it have no name? 00:33:32.400 --> 00:33:35.530 Well, it's kind of stupid that I invented this name on line 13. 00:33:35.530 --> 00:33:38.670 I used it on line 16, and then I never again used it. 00:33:38.670 --> 00:33:42.520 If there's only being used in one place, why bother giving it a name at all? 00:33:42.520 --> 00:33:45.000 So if you instead, in Python, say lambda, 00:33:45.000 --> 00:33:47.850 and then type out the name of the parameter 00:33:47.850 --> 00:33:50.130 you want this anonymous function to take, 00:33:50.130 --> 00:33:54.460 you can then say, go ahead and return this value. 00:33:54.460 --> 00:33:57.030 Now let's notice the inconsistencies here. 00:33:57.030 --> 00:33:59.850 When you use this special lambda keyword that says, hey Python, 00:33:59.850 --> 00:34:02.850 give me an anonymous function, a function with no name, 00:34:02.850 --> 00:34:06.610 it then says, Python, this anonymous function will take one parameter. 00:34:06.610 --> 00:34:08.699 Notice there's no parentheses. 00:34:08.699 --> 00:34:10.500 And that's deliberate, if confusing. 00:34:10.500 --> 00:34:12.909 It just tightens things up a little bit. 00:34:12.909 --> 00:34:16.770 Notice that there's no return keyword, which similarly tightens things 00:34:16.770 --> 00:34:18.780 up a bit, albeit inconsistently. 00:34:18.780 --> 00:34:22.650 But this line of code I've just highlighted 00:34:22.650 --> 00:34:26.429 is actually identical in functionality to this. 00:34:26.429 --> 00:34:28.320 But it throws away the word [INAUDIBLE]. 00:34:28.320 --> 00:34:29.850 It throws away the word get value. 00:34:29.850 --> 00:34:33.510 It throws away the parentheses, and it throws away the return keyword just 00:34:33.510 --> 00:34:34.679 to tighten things up. 00:34:34.679 --> 00:34:37.290 And it's well suited for a problem like this 00:34:37.290 --> 00:34:39.960 where I just want to pass in a tiny little function that 00:34:39.960 --> 00:34:41.090 does something useful. 00:34:41.090 --> 00:34:42.840 But it's not something I'm going to reuse. 00:34:42.840 --> 00:34:45.100 It doesn't need multiple lines to take up space. 00:34:45.100 --> 00:34:47.159 It's just a nice, elegant one liner. 00:34:47.159 --> 00:34:48.840 That's all a lambda function does. 00:34:48.840 --> 00:34:51.909 It allows you to create an anonymous function right then and there. 00:34:51.909 --> 00:34:57.210 And then the function you're passing it to, like sorted, will use it as before. 00:34:57.210 --> 00:35:00.810 Indeed, if I run Python of favorites.py after growing my Terminal window, 00:35:00.810 --> 00:35:02.800 the result is exactly the same. 00:35:02.800 --> 00:35:06.360 And we see at the bottom here all of those small results. 00:35:06.360 --> 00:35:10.810 Are any questions, then, on this syntax, on these ideas? 00:35:10.810 --> 00:35:13.770 The goal here has been to write a Python program that just starts 00:35:13.770 --> 00:35:18.860 to analyze or clean up data like this. 00:35:18.860 --> 00:35:19.765 Yeah? 00:35:19.765 --> 00:35:23.273 AUDIENCE: [INAUDIBLE] 00:35:23.273 --> 00:35:26.440 DAVID J. MALAN: Could you use the lambda if it's just returning immediately? 00:35:26.440 --> 00:35:29.260 It's really meant for one line of code, generally. 00:35:29.260 --> 00:35:31.420 So you don't use the return keyword. 00:35:31.420 --> 00:35:33.880 You just say what it is you want to return. 00:35:33.880 --> 00:35:37.680 AUDIENCE: [INAUDIBLE] 00:35:37.680 --> 00:35:38.930 DAVID J. MALAN: Good question. 00:35:38.930 --> 00:35:40.960 Could you do more in that one line if it's 00:35:40.960 --> 00:35:42.670 got to be a more involved algorithm? 00:35:42.670 --> 00:35:45.820 Yes, but you would just ultimately return the value in question. 00:35:45.820 --> 00:35:47.947 In short, if it's getting at all sophisticated 00:35:47.947 --> 00:35:49.780 you don't use the lambda function in Python. 00:35:49.780 --> 00:35:52.510 You go ahead and actually just define a name for it, 00:35:52.510 --> 00:35:54.233 even if it's a one-off name. 00:35:54.233 --> 00:35:56.650 JavaScript, another language we'll look at in a few weeks, 00:35:56.650 --> 00:35:59.590 makes heavier use, I dare say, of lambda functions. 00:35:59.590 --> 00:36:02.030 And those can actually be multiple, multiple lines, 00:36:02.030 --> 00:36:05.520 but Python does not support that instinct. 00:36:05.520 --> 00:36:06.020 All right. 00:36:06.020 --> 00:36:07.728 So let's go ahead and do one other thing. 00:36:07.728 --> 00:36:10.340 Office was clearly popping out of the code here quite a bit. 00:36:10.340 --> 00:36:12.760 Let's go ahead and write a slightly different program 00:36:12.760 --> 00:36:15.400 that maybe just focuses on The Office for the moment, 00:36:15.400 --> 00:36:16.900 just focuses on The Office. 00:36:16.900 --> 00:36:21.250 So let me go ahead and throw most of this code away, up until this point 00:36:21.250 --> 00:36:23.080 when I'm inside of my inner loop. 00:36:23.080 --> 00:36:26.050 And let me go ahead, and I don't even want the global variable here. 00:36:26.050 --> 00:36:28.240 All I want to do is focus on the current title. 00:36:28.240 --> 00:36:30.730 How could I detect if someone likes The Office? 00:36:30.730 --> 00:36:33.790 Well, I could say something like-- 00:36:33.790 --> 00:36:34.670 how about this? 00:36:34.670 --> 00:36:36.310 So counter equals 0. 00:36:36.310 --> 00:36:38.350 We'll just focus on The Office. 00:36:38.350 --> 00:36:43.930 If title equals, equals The Office, I could then go ahead and say, 00:36:43.930 --> 00:36:47.410 counter plus equals 1. 00:36:47.410 --> 00:36:48.400 I don't need a key. 00:36:48.400 --> 00:36:49.858 There's no dictionary involved now. 00:36:49.858 --> 00:36:51.880 It's just a simple integer variable. 00:36:51.880 --> 00:36:55.750 And then, down here I'll say something like, 00:36:55.750 --> 00:37:00.970 number of people who like The Office is, whatever this value is. 00:37:00.970 --> 00:37:03.850 And I'll put in counter in curly braces, and then I'll 00:37:03.850 --> 00:37:05.783 turn this whole thing into an F string. 00:37:05.783 --> 00:37:07.450 All right, let me go ahead and run this. 00:37:07.450 --> 00:37:10.100 Python of favorites.py, Enter. 00:37:10.100 --> 00:37:12.610 Number of people who like The Office is 15. 00:37:12.610 --> 00:37:13.990 All right, so that's great. 00:37:13.990 --> 00:37:17.530 But let's go ahead now and deliberately muddy the data a bit. 00:37:17.530 --> 00:37:21.160 All of you were very nice in that you typed in The Office. 00:37:21.160 --> 00:37:23.230 But you can imagine someone just typing Office, 00:37:23.230 --> 00:37:25.692 for instance, maybe there, maybe there. 00:37:25.692 --> 00:37:28.150 And many people might just write Office, you could imagine. 00:37:28.150 --> 00:37:30.400 Didn't happen here, but suppose it did, and probably 00:37:30.400 --> 00:37:33.290 would have if we had even more and more submissions over time. 00:37:33.290 --> 00:37:37.000 Now let's go ahead and rerun this program, no changes to the code. 00:37:37.000 --> 00:37:39.130 Now only 13 people like The Office. 00:37:39.130 --> 00:37:40.150 So let's fix this. 00:37:40.150 --> 00:37:45.790 The data is now as I mutated it to have a couple Offices, and many The Offices. 00:37:45.790 --> 00:37:51.050 How could I change my Python code to now count both of those situations? 00:37:51.050 --> 00:37:55.390 What could I change up here in order to improve this situation? 00:37:55.390 --> 00:37:57.910 Any thoughts? 00:37:57.910 --> 00:37:58.750 Yeah? 00:37:58.750 --> 00:38:02.300 AUDIENCE: You write the title [INAUDIBLE].. 00:38:02.300 --> 00:38:05.050 DAVID J. MALAN: Yeah, so I could just ask two questions like that. 00:38:05.050 --> 00:38:09.370 If title equals The Office, or title equals, equals just 00:38:09.370 --> 00:38:10.438 Office, for instance. 00:38:10.438 --> 00:38:12.730 And I'm still don't have to worry about capitalization. 00:38:12.730 --> 00:38:15.813 I don't have to worry about spaces because I at least threw that all away. 00:38:15.813 --> 00:38:18.200 Now I can go ahead and rerun this code. 00:38:18.200 --> 00:38:19.990 Let me go run it a third time. 00:38:19.990 --> 00:38:22.460 OK, so we're back up to 15. 00:38:22.460 --> 00:38:25.210 So I like that. 00:38:25.210 --> 00:38:28.660 You could imagine this not scaling very well. 00:38:28.660 --> 00:38:31.763 Avatar had three different permutations, and there were some others 00:38:31.763 --> 00:38:34.180 if we dug deeper that there might have been more variants. 00:38:34.180 --> 00:38:36.430 Could we do something a little more general purpose? 00:38:36.430 --> 00:38:38.230 Well, we could do something like this. 00:38:38.230 --> 00:38:41.883 If Office in the title-- 00:38:41.883 --> 00:38:44.050 this is kind of a cool thing you can do with Python. 00:38:44.050 --> 00:38:46.930 It's very English-like, just ask the question, albeit tersely. 00:38:46.930 --> 00:38:50.660 This, interesting, just got me into trouble. 00:38:50.660 --> 00:38:53.140 Now, all of a sudden, we're up to 16. 00:38:53.140 --> 00:38:56.252 Does anyone know what the other one is? 00:38:56.252 --> 00:38:58.098 AUDIENCE: Someone put V Office. 00:38:58.098 --> 00:38:59.265 DAVID J. MALAN: What Office? 00:38:59.265 --> 00:39:02.120 AUDIENCE: Someone entered a V Office, [INAUDIBLE].. 00:39:04.517 --> 00:39:05.850 DAVID J. MALAN: Oh, interesting. 00:39:05.850 --> 00:39:07.930 Yes, so they hit The. 00:39:07.930 --> 00:39:08.430 OK. 00:39:08.430 --> 00:39:11.568 [APPLAUSE] 00:39:11.568 --> 00:39:12.360 DAVID J. MALAN: OK. 00:39:12.360 --> 00:39:13.800 Someone did that, sure. 00:39:13.800 --> 00:39:17.130 So The V Office. 00:39:17.130 --> 00:39:19.470 OK, this one's actually going to be hard to correct for. 00:39:19.470 --> 00:39:21.300 I can't really think of a general-- 00:39:21.300 --> 00:39:25.860 well, this is actually a good example of data gets messy fast. 00:39:25.860 --> 00:39:28.080 And you could imagine doing something where, OK, we 00:39:28.080 --> 00:39:32.920 could have like 26 conditions if someone said The A Office, or The B Office, 00:39:32.920 --> 00:39:33.420 right? 00:39:33.420 --> 00:39:34.650 You could imagine doing that. 00:39:34.650 --> 00:39:37.400 But then there's surely going to be other typos that are possible. 00:39:37.400 --> 00:39:39.610 So that's actually a hard one to fix. 00:39:39.610 --> 00:39:44.730 But it turns out we got lucky and now this is actually the accurate count. 00:39:44.730 --> 00:39:46.660 But the data is itself messy. 00:39:46.660 --> 00:39:49.950 Let me show another way that just adds another tool to our toolkit. 00:39:49.950 --> 00:39:54.750 It turns out that there's this feature in many programming languages, Python 00:39:54.750 --> 00:39:57.040 among them, called regular expressions. 00:39:57.040 --> 00:39:59.040 And this is actually a really powerful technique 00:39:59.040 --> 00:40:00.873 that we'll just scratch the surface of here. 00:40:00.873 --> 00:40:04.290 But it's going to be really useful, actually, maybe toward final projects, 00:40:04.290 --> 00:40:08.880 in web programming, any time you want to clean up data or validate data. 00:40:08.880 --> 00:40:11.910 And actually, just to make this clear, give me a moment 00:40:11.910 --> 00:40:14.160 before I switch screens here. 00:40:14.160 --> 00:40:18.450 And let me open up a Google Form from scratch. 00:40:18.450 --> 00:40:22.230 Give me just a moment to create something real quick. 00:40:22.230 --> 00:40:25.260 If you've never noticed this before when creating a Google Form, 00:40:25.260 --> 00:40:28.410 you can do a question. 00:40:28.410 --> 00:40:30.360 And if you want the user to type in something 00:40:30.360 --> 00:40:33.033 very specific as a short text answer like this, 00:40:33.033 --> 00:40:35.700 you might know that there's toggles like this in Google's world, 00:40:35.700 --> 00:40:36.930 like you can require it. 00:40:36.930 --> 00:40:39.270 Or you can do response validation. 00:40:39.270 --> 00:40:41.670 You could say, what's your email? 00:40:41.670 --> 00:40:47.250 And then you could say something like, text is an email. 00:40:47.250 --> 00:40:52.530 So here's an example in Google Forms how you can validate users' input. 00:40:52.530 --> 00:40:57.150 But a feature most of you have probably never noticed, or cared about, or used, 00:40:57.150 --> 00:40:59.490 is this thing called a regular expression, where 00:40:59.490 --> 00:41:01.440 you can actually define a pattern. 00:41:01.440 --> 00:41:04.830 And I could actually reimplement that same idea by doing something like this. 00:41:04.830 --> 00:41:11.070 I can say, let the user type in anything represented by .star, then an at sign, 00:41:11.070 --> 00:41:16.600 then something else, then a literal period, then, for instance, 00:41:16.600 --> 00:41:17.680 something else. 00:41:17.680 --> 00:41:19.950 So it's very cryptic, admittedly, at first glance. 00:41:19.950 --> 00:41:23.430 But this means any character 0 more times. 00:41:23.430 --> 00:41:26.160 This means any character 0 more times. 00:41:26.160 --> 00:41:28.725 This means a literal period, because apparently 00:41:28.725 --> 00:41:32.160 dot means any character in the context of these patterns. 00:41:32.160 --> 00:41:36.070 Then this thing means any character 0 more times. 00:41:36.070 --> 00:41:38.670 So I should actually be a little more nitpicky. 00:41:38.670 --> 00:41:41.530 You don't want 0 or more times, you want 1 or more times. 00:41:41.530 --> 00:41:45.340 So this with the plus means any character 1 or more time. 00:41:45.340 --> 00:41:47.020 So there has to be something there. 00:41:47.020 --> 00:41:51.630 And I think I want the same thing here 1 or more times, 1 or more times. 00:41:51.630 --> 00:41:56.160 Or heck, if I want to restrict this form in some sense to edu addresses, 00:41:56.160 --> 00:41:59.520 I could change that last thing to literally .edu. 00:41:59.520 --> 00:42:01.570 And so long story short, even though this looks, 00:42:01.570 --> 00:42:06.420 I'm sure, pretty cryptic, there's this mini language built into Python, 00:42:06.420 --> 00:42:09.900 and JavaScript, and Java, and other languages that allows you to express 00:42:09.900 --> 00:42:12.430 patterns in a standardized way. 00:42:12.430 --> 00:42:15.930 And this pattern is actually something we can implement in code, too. 00:42:15.930 --> 00:42:18.150 And let me switch back to Python for a second just 00:42:18.150 --> 00:42:19.920 to do the same kind of idea. 00:42:19.920 --> 00:42:22.950 Let me toggle back to my code here. 00:42:22.950 --> 00:42:26.730 Let me put up, for instance, a summary of what it is you can do. 00:42:26.730 --> 00:42:33.030 And here's just a quick summary of some of the available symbols. 00:42:33.030 --> 00:42:39.330 A period may represent any character. .star or .asterisks means 0 or more 00:42:39.330 --> 00:42:39.970 characters. 00:42:39.970 --> 00:42:43.020 So the dot means anything, so it can be A or nothing. 00:42:43.020 --> 00:42:44.160 It can be B or nothing. 00:42:44.160 --> 00:42:49.530 It can be A, B, A, B, C. It can be any combination of 0 or more characters. 00:42:49.530 --> 00:42:52.860 Change that to a plus and you now express one or more characters. 00:42:52.860 --> 00:42:55.860 Question mark means something is optional. 00:42:55.860 --> 00:42:59.370 Caret symbol means start matching at the beginning of the user's input. 00:42:59.370 --> 00:43:05.100 Dollar sign means stop matching at the end of the user's input. 00:43:05.100 --> 00:43:07.210 So we won't play with all of these just now. 00:43:07.210 --> 00:43:11.470 But let me go over here and actually tackle this Office problem. 00:43:11.470 --> 00:43:15.450 Let me go ahead and import a new library called the regular expression library, 00:43:15.450 --> 00:43:17.280 import re. 00:43:17.280 --> 00:43:20.560 And then, down here, let me say this. 00:43:20.560 --> 00:43:25.080 If re.search, this pattern. 00:43:25.080 --> 00:43:30.210 Let's just search for Office, quote, unquote, in the current title. 00:43:30.210 --> 00:43:32.730 Then we're going to go ahead and increase the counter. 00:43:32.730 --> 00:43:35.040 So it turns out that the regular expression library 00:43:35.040 --> 00:43:39.090 has a function called search that takes as its first argument a pattern, 00:43:39.090 --> 00:43:41.970 and then, as its second argument the string you 00:43:41.970 --> 00:43:44.160 want to analyze for that pattern. 00:43:44.160 --> 00:43:47.880 So it's sort of looking for a needle in this haystack, from left to right. 00:43:47.880 --> 00:43:52.080 Let me go ahead now and run this version of the program, Enter. 00:43:52.080 --> 00:43:56.250 And now I screwed up because I forgot my colon, but that's old stuff. 00:43:56.250 --> 00:43:57.322 Enter. 00:43:57.322 --> 00:43:59.150 Huh. 00:43:59.150 --> 00:44:01.800 Number of people who like The Office is now 0. 00:44:01.800 --> 00:44:02.970 So this seems like a big-- 00:44:02.970 --> 00:44:05.640 thank you-- big step backwards. 00:44:05.640 --> 00:44:08.010 What did I do wrong? 00:44:08.010 --> 00:44:08.730 Yeah? 00:44:08.730 --> 00:44:10.410 AUDIENCE: [INAUDIBLE] 00:44:10.410 --> 00:44:11.610 DAVID J. MALAN: Yeah. 00:44:11.610 --> 00:44:14.595 I forced all my input to uppercase, so I probably need to do this. 00:44:14.595 --> 00:44:16.470 So we'll come back to other approaches there. 00:44:16.470 --> 00:44:17.430 Let me rerun it now. 00:44:17.430 --> 00:44:19.800 OK, now we're back up to 16. 00:44:19.800 --> 00:44:22.200 But I could even, let's say-- 00:44:22.200 --> 00:44:25.110 I could tolerate just The Office. 00:44:25.110 --> 00:44:30.120 How about this, or how about something like, or The Office? 00:44:30.120 --> 00:44:32.373 Let me do this instead. 00:44:32.373 --> 00:44:34.290 And let me use these other special characters. 00:44:34.290 --> 00:44:37.380 This caret sign means the beginning of the string. 00:44:37.380 --> 00:44:40.740 This dollar sign weirdly represents the end of the string. 00:44:40.740 --> 00:44:44.490 I'm adding in some parentheses just like in math, just to add another symbol 00:44:44.490 --> 00:44:46.470 here, the or symbol here. 00:44:46.470 --> 00:44:50.580 And this is saying start matching at the beginning of the user string. 00:44:50.580 --> 00:44:54.660 Check if the beginning of the string is Office, or the beginning of the string 00:44:54.660 --> 00:44:55.920 is The Office. 00:44:55.920 --> 00:44:58.540 And then, you better be at the end of the string. 00:44:58.540 --> 00:45:01.650 So they can't keep typing words before or after that input. 00:45:01.650 --> 00:45:03.690 Let me go ahead and rerun the program. 00:45:03.690 --> 00:45:07.500 And now we're down to 15, which used to be our correct answer, 00:45:07.500 --> 00:45:10.770 but then we noticed The V Office. 00:45:10.770 --> 00:45:12.680 How can we deal with that? 00:45:12.680 --> 00:45:16.200 It's going to be messier to deal with that. 00:45:16.200 --> 00:45:21.555 How about if I tolerate any character represented by dot 00:45:21.555 --> 00:45:23.640 in between The and Office? 00:45:23.640 --> 00:45:28.110 Now if I rerun it, now I really have this expressive capability. 00:45:28.110 --> 00:45:32.340 So this is only to say, there are so many ways in languages, in general, 00:45:32.340 --> 00:45:33.390 to solve problems. 00:45:33.390 --> 00:45:35.950 And some of these tools are more sophisticated than others. 00:45:35.950 --> 00:45:38.957 This is one that you've actually probably glanced at but never used 00:45:38.957 --> 00:45:41.040 in the context of Google Forms for years if you're 00:45:41.040 --> 00:45:43.770 in the habit of creating these for student groups or other activities. 00:45:43.770 --> 00:45:45.840 But it's now something you can start to leverage. 00:45:45.840 --> 00:45:49.440 And we're just scratching the surface of what's actually possible with this. 00:45:49.440 --> 00:45:53.640 But let's now do one final example just using some Python code here. 00:45:53.640 --> 00:45:55.530 And let's actually write a program that's 00:45:55.530 --> 00:46:00.040 a little more general purpose that allows me to search for any given title 00:46:00.040 --> 00:46:01.770 and figure out its popularity. 00:46:01.770 --> 00:46:04.410 So let me go ahead and simplify this. 00:46:04.410 --> 00:46:06.630 Let's get rid of our regular expressions. 00:46:06.630 --> 00:46:09.940 Let's go ahead and continue capitalizing the title. 00:46:09.940 --> 00:46:11.580 And let's go ahead to-- 00:46:11.580 --> 00:46:16.020 at the beginning of this program, and first ask the user for the title 00:46:16.020 --> 00:46:17.380 they want to search for. 00:46:17.380 --> 00:46:20.320 So title equals, let's ask the user for input, 00:46:20.320 --> 00:46:23.490 which is essentially the same thing as our CS50 get_string function. 00:46:23.490 --> 00:46:24.960 Ask them for the title. 00:46:24.960 --> 00:46:28.170 And then whatever they type in, let's go ahead and strip whitespace 00:46:28.170 --> 00:46:30.900 and uppercase the thing again. 00:46:30.900 --> 00:46:35.820 And now, inside of my loop, I could say something like this. 00:46:35.820 --> 00:46:42.660 If the current row's title after stripping whitespace and forcing 00:46:42.660 --> 00:46:46.920 it to uppercase, too, equals the user's title, then, go ahead 00:46:46.920 --> 00:46:49.440 and maybe increment a counter. 00:46:49.440 --> 00:46:51.160 So I still need that counter back. 00:46:51.160 --> 00:46:56.610 So let me go ahead and define this maybe in here, counter equals 0. 00:46:56.610 --> 00:46:58.720 And then, at the very end of this program, 00:46:58.720 --> 00:47:01.050 let me go ahead and print out just the popularity 00:47:01.050 --> 00:47:03.040 of whatever the human typed in. 00:47:03.040 --> 00:47:06.030 So again, the only difference is I'm asking the human for some input 00:47:06.030 --> 00:47:06.720 this time. 00:47:06.720 --> 00:47:09.150 I'm initializing my counter to 0, then I'm 00:47:09.150 --> 00:47:12.660 searching for their title in the CSV file 00:47:12.660 --> 00:47:15.810 by doing the same massaging of the data by forcing it to uppercase 00:47:15.810 --> 00:47:18.570 and getting rid of the whitespace. 00:47:18.570 --> 00:47:21.780 So now, when I run Python of favorites.py, Enter, 00:47:21.780 --> 00:47:30.030 I could type in the office all lowercase even, and now we're down to 13. 00:47:30.030 --> 00:47:34.940 13, why? 00:47:34.940 --> 00:47:36.700 Oh, that's correct. 00:47:36.700 --> 00:47:40.640 Because I'm the one that went in and removed those The keywords a bit ago. 00:47:40.640 --> 00:47:43.030 If we fixed those, we would be back up to 15. 00:47:43.030 --> 00:47:47.650 If we added support for The V Office, we would be up to 16 as well. 00:47:47.650 --> 00:47:50.350 All right, any questions then on these various manipulations? 00:47:50.350 --> 00:47:52.183 And if you're feeling like, oh, my god, this 00:47:52.183 --> 00:47:55.100 is so much Python code just to do simple things, that's the point. 00:47:55.100 --> 00:47:57.160 And indeed, even though it's a powerful language 00:47:57.160 --> 00:48:00.670 and can solve these kinds of problems, we had to write almost 20 lines of code 00:48:00.670 --> 00:48:03.470 just to ask a single question like this. 00:48:03.470 --> 00:48:07.120 But any questions on how we did this, or on any of these building 00:48:07.120 --> 00:48:10.750 blocks along the way? 00:48:10.750 --> 00:48:11.890 Anything here? 00:48:11.890 --> 00:48:12.390 No? 00:48:12.390 --> 00:48:12.600 All right. 00:48:12.600 --> 00:48:13.300 That was a lot. 00:48:13.300 --> 00:48:14.800 Let's take a five-minute break here. 00:48:14.800 --> 00:48:17.230 When we come back, we'll do it better. 00:48:17.230 --> 00:48:18.480 So we are back. 00:48:18.480 --> 00:48:20.610 And the rest of today is ultimately about, how 00:48:20.610 --> 00:48:24.840 can we store, and manipulate, and change, and retrieve data 00:48:24.840 --> 00:48:28.090 more efficiently than we might by just writing raw code? 00:48:28.090 --> 00:48:31.440 This isn't to say that you shouldn't use Python to do the kinds of things 00:48:31.440 --> 00:48:32.280 that we just did. 00:48:32.280 --> 00:48:37.020 And in fact, it might be super common if you're getting a lot of messy input 00:48:37.020 --> 00:48:39.073 from users that you might want to clean it up. 00:48:39.073 --> 00:48:42.240 And maybe the best way to do that is to write a program so that step-by-step 00:48:42.240 --> 00:48:44.370 you can make all of the requisite changes and fixes 00:48:44.370 --> 00:48:47.523 like we did with The Office, for instance, again and again, 00:48:47.523 --> 00:48:50.190 and reuse that code, especially if more and more submissions are 00:48:50.190 --> 00:48:51.150 coming through. 00:48:51.150 --> 00:48:53.550 But another theme of today, ultimately, is 00:48:53.550 --> 00:48:57.638 that sometimes there are different, if not better tools for the same job. 00:48:57.638 --> 00:48:59.430 And in fact, now at this point in the term, 00:48:59.430 --> 00:49:02.310 as we begin to introduce not just Python, but in a moment 00:49:02.310 --> 00:49:06.120 a language called SQL, and next week, a language called JavaScript, 00:49:06.120 --> 00:49:09.150 and the week after that, synthesizing a whole lot of these languages 00:49:09.150 --> 00:49:12.420 together is to just kind of paint a picture of how 00:49:12.420 --> 00:49:15.900 you might decide what the trade-offs are between using this tool, or this tool, 00:49:15.900 --> 00:49:16.830 or this other tool. 00:49:16.830 --> 00:49:19.770 Because undoubtedly you can solve problems moving forward 00:49:19.770 --> 00:49:22.660 in many different ways with many different tools. 00:49:22.660 --> 00:49:25.020 So let's give you another tool, one with which 00:49:25.020 --> 00:49:28.170 you can implement a proper relational database. 00:49:28.170 --> 00:49:31.050 What we just saw in the form of CSV files 00:49:31.050 --> 00:49:33.810 are what we might call flat-file databases. 00:49:33.810 --> 00:49:37.500 Again, just a very simple file, flat in that there's no hierarchy to it. 00:49:37.500 --> 00:49:39.270 It's just rows and columns. 00:49:39.270 --> 00:49:44.280 And that is all ultimately storing ASCII or Unicode text. 00:49:44.280 --> 00:49:47.400 A relational database, though, is something that's actually 00:49:47.400 --> 00:49:50.850 closer to a proper spreadsheet program. 00:49:50.850 --> 00:49:53.440 A CSV is an individual sheet, if you will, 00:49:53.440 --> 00:49:55.260 from a spreadsheet when you export it. 00:49:55.260 --> 00:49:57.460 If you had multiple sheets in a spreadsheet, 00:49:57.460 --> 00:49:59.595 you would have to export multiple CSVs. 00:49:59.595 --> 00:50:01.470 And that gets annoying quickly in code if you 00:50:01.470 --> 00:50:03.990 have to open up this CSV, this CSV, all of which 00:50:03.990 --> 00:50:07.080 represent different sheets or tabs in a proper spreadsheet. 00:50:07.080 --> 00:50:11.520 A relational database is more like a spreadsheet program 00:50:11.520 --> 00:50:14.640 that you, a programmer, now can interact with. 00:50:14.640 --> 00:50:16.140 You can write data to it. 00:50:16.140 --> 00:50:19.680 You can read data from it, and you can have multiple sheets, a.k.a., 00:50:19.680 --> 00:50:22.150 tables storing all of your data. 00:50:22.150 --> 00:50:24.240 So whereas Excel and numbers in Google spreadsheet 00:50:24.240 --> 00:50:27.090 are meant to be reused really by humans with their mouse and their keyboard, 00:50:27.090 --> 00:50:29.850 clicking, and pointing, and manipulating things graphically, 00:50:29.850 --> 00:50:32.160 a relational database using a language called 00:50:32.160 --> 00:50:37.320 SQL is one in which the programmer has similar capabilities, 00:50:37.320 --> 00:50:39.000 but doing so in code. 00:50:39.000 --> 00:50:42.720 Specifically, using a language called SQL, and at a scale 00:50:42.720 --> 00:50:45.670 that's much grander than spreadsheets alone. 00:50:45.670 --> 00:50:48.420 In fact, if you try on your Mac or PC to open a spreadsheet that's 00:50:48.420 --> 00:50:51.090 got tens of thousands of rows, it'll probably 00:50:51.090 --> 00:50:54.780 work fine, hundreds of thousands of rows, millions of rows, no way. 00:50:54.780 --> 00:50:57.000 At some point your Mac or PC is going to struggle 00:50:57.000 --> 00:50:59.130 to open particularly large data sets. 00:50:59.130 --> 00:51:01.620 And that, too, is where proper databases come 00:51:01.620 --> 00:51:04.140 into play and proper languages for databases come 00:51:04.140 --> 00:51:06.120 into play, when it's all about scale. 00:51:06.120 --> 00:51:09.390 And indeed, most any mobile app or web app today that you or someone else 00:51:09.390 --> 00:51:13.420 might write should probably plan on lots of data if it's successful. 00:51:13.420 --> 00:51:15.730 So we need the right tools for that problem. 00:51:15.730 --> 00:51:19.110 So fortunately, even though we're about to learn yet another language, 00:51:19.110 --> 00:51:24.360 it only does four things fundamentally, known by this silly acronym, CRUD. 00:51:24.360 --> 00:51:27.870 SQL, this language for databases, supports the ability 00:51:27.870 --> 00:51:32.400 to create data, read data, update data, and delete data. 00:51:32.400 --> 00:51:33.420 That's it. 00:51:33.420 --> 00:51:36.690 There's a few more keywords that exist in this language called SQL 00:51:36.690 --> 00:51:37.750 that we'll soon see. 00:51:37.750 --> 00:51:39.458 But at the end of the day, even if you're 00:51:39.458 --> 00:51:42.180 starting to feel like this is a lot very quickly, 00:51:42.180 --> 00:51:44.940 it all boils down to these four basic operations. 00:51:44.940 --> 00:51:47.640 And the four commands in SQL, if you will, 00:51:47.640 --> 00:51:51.780 functions in a sense that implement those four ideas happen to be these. 00:51:51.780 --> 00:51:54.270 They're almost the same but with some slight variance. 00:51:54.270 --> 00:51:59.280 The ability to create or insert data is the C. The ability to select data 00:51:59.280 --> 00:52:01.080 is the R, or read. 00:52:01.080 --> 00:52:02.130 Update is the same. 00:52:02.130 --> 00:52:05.048 Delete is the same, but drop is also a keyword as well. 00:52:05.048 --> 00:52:06.840 So we'll see these and a few other keywords 00:52:06.840 --> 00:52:10.410 in SQL that, at the end of the day, just allow you to create, read, and update 00:52:10.410 --> 00:52:14.310 data using verbs, if you will, like these. 00:52:14.310 --> 00:52:18.030 So to do that, what's the syntax going to be? 00:52:18.030 --> 00:52:20.290 Well, we won't get into the weeds too quickly on this. 00:52:20.290 --> 00:52:22.650 But here's a representative syntax of how 00:52:22.650 --> 00:52:25.710 you can create using this language called SQL, in your very 00:52:25.710 --> 00:52:28.020 own database, a brand new table. 00:52:28.020 --> 00:52:30.910 This is so easy in Excel, and Google Spreadsheets, and Apple Numbers. 00:52:30.910 --> 00:52:32.910 You want a new sheet, you click the plus button. 00:52:32.910 --> 00:52:33.690 You get a new tab. 00:52:33.690 --> 00:52:35.490 You give it a name, and boom, you're done. 00:52:35.490 --> 00:52:40.050 In the world of programming, though, if you want to create the analogue of that 00:52:40.050 --> 00:52:43.440 spreadsheet in the computer's memory, you create something called a table, 00:52:43.440 --> 00:52:48.420 like a sheet, that has a name, and then in parentheses has one or more columns. 00:52:48.420 --> 00:52:51.990 But unlike Google Spreadsheets, and Apple Numbers, and Excel, 00:52:51.990 --> 00:52:55.073 you have to decide as the programmer what types of data 00:52:55.073 --> 00:52:57.240 you're going to be storing in each of these columns. 00:52:57.240 --> 00:52:59.430 Now even though Excel, and Google Spreadsheets, 00:52:59.430 --> 00:53:03.310 and Numbers does allow you to format or present data in different ways, 00:53:03.310 --> 00:53:07.680 it's not strongly typed data like it is, for instance, when we were using C. 00:53:07.680 --> 00:53:10.200 And heck, even in Python there's underlying data types. 00:53:10.200 --> 00:53:12.158 Even if you don't have to type them explicitly, 00:53:12.158 --> 00:53:14.900 databases are going to want to know, are you storing integers? 00:53:14.900 --> 00:53:16.640 Are you storing real numbers or floats? 00:53:16.640 --> 00:53:17.630 Are you storing text? 00:53:17.630 --> 00:53:18.140 Why? 00:53:18.140 --> 00:53:20.960 Because especially as your data scales, the more hints 00:53:20.960 --> 00:53:24.410 you give the database about your data, the more performance it can be, 00:53:24.410 --> 00:53:27.500 the faster it can help you get at and store that data. 00:53:27.500 --> 00:53:29.302 So types are about to be important again, 00:53:29.302 --> 00:53:31.760 but there's not going to be that many of them, fortunately. 00:53:31.760 --> 00:53:34.640 Now how can I go about converting, for instance, some real data, 00:53:34.640 --> 00:53:37.490 like that from you, my favorites.csv file, 00:53:37.490 --> 00:53:39.440 into a proper relational database? 00:53:39.440 --> 00:53:42.650 Well, it turns out that using SQL I can do this 00:53:42.650 --> 00:53:45.260 in VS Code on my own Mac, or PC, or in the cloud 00:53:45.260 --> 00:53:48.455 here by just importing the CSV into a database. 00:53:48.455 --> 00:53:50.330 We'll see eventually how to do this manually. 00:53:50.330 --> 00:53:52.622 For now, I'm going to use more of an automated process. 00:53:52.622 --> 00:53:54.680 So let me go over to VS Code here. 00:53:54.680 --> 00:53:57.170 Let me type ls to see where we left off before. 00:53:57.170 --> 00:54:01.005 I had two files favorites.csv, which I downloaded from Google Spreadsheets. 00:54:01.005 --> 00:54:02.630 Recall that I made a couple of changes. 00:54:02.630 --> 00:54:06.050 We deleted a couple of Thes from the file for The Office. 00:54:06.050 --> 00:54:08.600 But this is the same file as before, and then we 00:54:08.600 --> 00:54:11.210 have favorites.py, which we'll set aside for now. 00:54:11.210 --> 00:54:14.870 I'm going to go ahead now and run a command SQLite3. 00:54:14.870 --> 00:54:18.020 So in the world of relational databases, there's 00:54:18.020 --> 00:54:23.030 many different products out there, many different software that 00:54:23.030 --> 00:54:25.370 implements the SQL language. 00:54:25.370 --> 00:54:26.600 Microsoft has their own. 00:54:26.600 --> 00:54:30.080 There's something called MySQL that's been very popular for years. 00:54:30.080 --> 00:54:32.120 Facebook, for instance, used it early on. 00:54:32.120 --> 00:54:35.030 PostgreSQL, Microsoft Access Server, Oracle, 00:54:35.030 --> 00:54:36.958 and maybe a whole bunch of other product names 00:54:36.958 --> 00:54:38.750 you might have encountered over time, which 00:54:38.750 --> 00:54:42.980 is to say there's many different types of tools, and servers, 00:54:42.980 --> 00:54:44.990 and software in which you can use SQL. 00:54:44.990 --> 00:54:47.780 We're going to use a very lightweight version of the SQL language 00:54:47.780 --> 00:54:49.370 today called SQLite. 00:54:49.370 --> 00:54:51.680 This is the version of SQL that's generally 00:54:51.680 --> 00:54:54.020 used on iPhones and Android devices these days. 00:54:54.020 --> 00:54:56.930 If you download an app that stores data like your own contacts, 00:54:56.930 --> 00:54:59.000 typically is stored using SQLite. 00:54:59.000 --> 00:55:02.710 Because it's fairly lightweight, but you can still store hundreds, 00:55:02.710 --> 00:55:05.810 thousands, even tens of thousands of pieces of data 00:55:05.810 --> 00:55:07.970 even using this lightweight version thereof. 00:55:07.970 --> 00:55:10.790 SQLite3 is like version 3 of this tool. 00:55:10.790 --> 00:55:16.340 We're going to go ahead and run SQLite3 with a file called favorites.db. 00:55:16.340 --> 00:55:20.120 It's conventional in the world of SQLite to name your file something.db. 00:55:20.120 --> 00:55:22.490 I'm going to create a database called favorites.db. 00:55:22.490 --> 00:55:27.010 Once I'm inside of the program, now I'm going to go ahead and enter CSV Mode. 00:55:27.010 --> 00:55:28.760 Again, not something you have to memorize, 00:55:28.760 --> 00:55:30.468 just something you can look up as needed. 00:55:30.468 --> 00:55:34.100 And then, I'm going to import favorites.csv 00:55:34.100 --> 00:55:40.260 into a table, that is, a sheet, if you will, called favorites as well. 00:55:40.260 --> 00:55:44.030 Now I'm going to hit Enter and I'm going to go ahead and exit the program 00:55:44.030 --> 00:55:45.650 altogether and type ls. 00:55:45.650 --> 00:55:48.170 Now I have three files in my current directory-- 00:55:48.170 --> 00:55:52.130 the CSV file, the Python file from before, and now favorites.db. 00:55:52.130 --> 00:55:56.180 But if I did this right, all of the data you all typed into the CSV file 00:55:56.180 --> 00:55:59.840 has now been loaded into a proper database where I can now use 00:55:59.840 --> 00:56:03.180 this SQL language to access it instead. 00:56:03.180 --> 00:56:07.730 So let's go ahead again and run SQLite3 of favorites.db, which now exists. 00:56:07.730 --> 00:56:10.640 And now, at the SQLite prompt I can start 00:56:10.640 --> 00:56:13.160 to play around and see what this data is. 00:56:13.160 --> 00:56:16.610 For instance, I can look, by typing .schema, 00:56:16.610 --> 00:56:19.362 at what the schema is of my data, what's the design. 00:56:19.362 --> 00:56:22.070 Now no thought was put into the design of this data at the moment 00:56:22.070 --> 00:56:23.900 because I automated the whole process. 00:56:23.900 --> 00:56:26.750 Once we start creating our own databases we'll 00:56:26.750 --> 00:56:29.750 give more thought to the data types and the columns that we have. 00:56:29.750 --> 00:56:34.220 But we can see what SQLite presumed I wanted just 00:56:34.220 --> 00:56:36.530 by importing the data by default. 00:56:36.530 --> 00:56:41.120 What the import command did for me a moment ago is essentially the syntax. 00:56:41.120 --> 00:56:44.510 It automated the process of creating a table, if it doesn't exist, 00:56:44.510 --> 00:56:45.770 called favorites. 00:56:45.770 --> 00:56:48.980 And then notice, in parentheses it gave me three columns-- 00:56:48.980 --> 00:56:53.360 timestamp, title, and genres, which were inferred, obviously, from the CSV. 00:56:53.360 --> 00:56:56.000 All three of which have been decreed to be text. 00:56:56.000 --> 00:56:59.180 Again, once we're more comfortable we'll create our own tables, 00:56:59.180 --> 00:57:01.010 choose our own types and column names. 00:57:01.010 --> 00:57:03.350 But for now, I just automated the whole process just 00:57:03.350 --> 00:57:08.120 to get us started by using this built-in import command as well. 00:57:08.120 --> 00:57:08.810 All right. 00:57:08.810 --> 00:57:11.630 So what now can I begin to do? 00:57:11.630 --> 00:57:16.910 Well, if I wanted to, for instance, start playing around with data therein, 00:57:16.910 --> 00:57:19.595 I might execute a couple of different commands. 00:57:23.000 --> 00:57:28.421 Let me find the right one here-- one of which would be select. 00:57:28.421 --> 00:57:31.610 Select being one of our most versatile tools 00:57:31.610 --> 00:57:33.180 to select data from this database. 00:57:33.180 --> 00:57:35.720 So if I have these three columns here-- timestamp, 00:57:35.720 --> 00:57:39.020 title, and genres, suppose I want to select all of the titles. 00:57:39.020 --> 00:57:43.790 Doing that earlier in Python required importing the CSV library, 00:57:43.790 --> 00:57:48.740 opening the file, creating a reader or a DictReader, iterating over every row, 00:57:48.740 --> 00:57:51.500 adding every title to a dictionary or just printing it out, 00:57:51.500 --> 00:57:52.310 and dot, dot, dot. 00:57:52.310 --> 00:57:55.170 There was a dozen or so lines of code when we first began. 00:57:55.170 --> 00:57:56.840 Now, how about this? 00:57:56.840 --> 00:58:01.220 Select title from favorites, semicolon, done. 00:58:01.220 --> 00:58:05.570 So now, with this particular language, the output is very textual 00:58:05.570 --> 00:58:08.930 and it's simulating what it looks like if it were more graphical by creating 00:58:08.930 --> 00:58:10.870 this table, so to speak. 00:58:10.870 --> 00:58:14.080 Select title from favorites is a distillation 00:58:14.080 --> 00:58:17.530 in a different language called SQL of all the lines of code 00:58:17.530 --> 00:58:20.740 I wrote early on when we first started playing with favorites.py. 00:58:20.740 --> 00:58:25.540 SQL is therefore optimized for reading, and creating, and updating, 00:58:25.540 --> 00:58:27.500 and ultimately, deleting data. 00:58:27.500 --> 00:58:30.700 So here's perhaps a better tool for the job once you have the data. 00:58:30.700 --> 00:58:34.030 Tossing it into a more powerful, versatile format 00:58:34.030 --> 00:58:37.228 might allow you now to get more work done more quickly 00:58:37.228 --> 00:58:38.770 without having to reinvent the wheel. 00:58:38.770 --> 00:58:41.510 Someone else has figured out how to select data like this. 00:58:41.510 --> 00:58:43.760 What more can I do here? 00:58:43.760 --> 00:58:47.050 Well, let me go ahead and pull up, in a moment, just a little bit 00:58:47.050 --> 00:58:49.390 of a cheat sheet here. 00:58:49.390 --> 00:58:53.840 Give me one second to find this. 00:58:53.840 --> 00:58:58.090 So suppose I want to now select data a little more powerfully. 00:58:58.090 --> 00:59:00.220 So here's what I just did in a canonical way. 00:59:00.220 --> 00:59:01.720 So select typically works like this. 00:59:01.720 --> 00:59:05.860 You select columns from a specific table, semicolon. 00:59:05.860 --> 00:59:08.260 Unfortunately, stupid semicolons are back. 00:59:08.260 --> 00:59:12.710 Select columns from table then, is the generic form of what I just did. 00:59:12.710 --> 00:59:17.122 More specifically, I selected one column called title from favorites. 00:59:17.122 --> 00:59:18.580 Favorites is the name of the table. 00:59:18.580 --> 00:59:19.690 Semicolon ends my thought. 00:59:19.690 --> 00:59:23.440 Suppose I wanted to get two things, like the genres that each of you inputted. 00:59:23.440 --> 00:59:28.420 I could instead do select title, comma, genres from favorites, 00:59:28.420 --> 00:59:30.220 and then, a semicolon, and Enter. 00:59:30.220 --> 00:59:32.110 It's going to look a little ugly on my screen 00:59:32.110 --> 00:59:33.670 because some of these titles and-- 00:59:33.670 --> 00:59:37.300 OK, one of you really went all out with Community. 00:59:37.300 --> 00:59:40.660 You can see that it's just wrapping in an ugly way, 00:59:40.660 --> 00:59:43.300 but it's just now showing me two columns. 00:59:43.300 --> 00:59:46.840 If we scroll up to the very top again, the left most of one, 00:59:46.840 --> 00:59:48.280 Black Mirror went all out, too. 00:59:48.280 --> 00:59:49.150 Thank you. 00:59:49.150 --> 00:59:52.000 And now, OK, we're going to have to clean some of these up. 00:59:52.000 --> 00:59:54.115 Game of Thrones, good comedy, yes. 00:59:57.550 --> 00:59:59.480 Keep going, keep going, keep going. 00:59:59.480 --> 01:00:02.870 So now we've selected two of the columns that we care about. 01:00:02.870 --> 01:00:03.370 There it is. 01:00:03.370 --> 01:00:06.380 OK, so it's crazy wide because of all of those genres. 01:00:06.380 --> 01:00:09.135 But it allows me to select exactly the data I want. 01:00:09.135 --> 01:00:12.010 Let's go back to the titles, though, and perhaps start playing around 01:00:12.010 --> 01:00:13.750 with some modifiers here. 01:00:13.750 --> 01:00:18.310 For instance, it turns out, using SQL there's a lot of functionality 01:00:18.310 --> 01:00:19.900 built into the language. 01:00:19.900 --> 01:00:22.843 You've got a lot of functions, similar to Excel or Google Spreadsheets 01:00:22.843 --> 01:00:24.010 where you can have formulas. 01:00:24.010 --> 01:00:26.320 SQL provides you with some of the same heuristics that 01:00:26.320 --> 01:00:30.350 allow you to apply operations like these on entire columns. 01:00:30.350 --> 01:00:32.920 For instance, you can take averages, count the total, 01:00:32.920 --> 01:00:36.010 get the distinct values, force things to lowercase, uppercase, min, 01:00:36.010 --> 01:00:37.220 and max, and so forth. 01:00:37.220 --> 01:00:39.610 So let's try distinct, for instance. 01:00:39.610 --> 01:00:43.450 Let me go back to my Terminal, and let's say, select, 01:00:43.450 --> 01:00:48.760 how about the distinct titles from the favorites table? 01:00:48.760 --> 01:00:49.513 Enter. 01:00:49.513 --> 01:00:51.430 I didn't bother selecting the genres because I 01:00:51.430 --> 01:00:52.763 want it to be a little prettier. 01:00:52.763 --> 01:00:58.090 And you can see here that we have just the distinct titles, 01:00:58.090 --> 01:01:00.547 except for issues of formatting. 01:01:00.547 --> 01:01:02.380 So whitespace is going to be an issue again. 01:01:02.380 --> 01:01:04.213 Capitalization is going to be a thing again. 01:01:04.213 --> 01:01:05.230 So there's a trade-off. 01:01:05.230 --> 01:01:09.280 One of the things I was doing in Python was forcing everything to uppercase 01:01:09.280 --> 01:01:10.930 and then getting rid of whitespace. 01:01:10.930 --> 01:01:12.400 But we could combine some of these. 01:01:12.400 --> 01:01:15.340 I could do something like force every title to uppercase, 01:01:15.340 --> 01:01:16.507 then get the distinct value. 01:01:16.507 --> 01:01:19.382 And that's actually going to get rid of some of those values as well. 01:01:19.382 --> 01:01:21.860 And again, I did it all in one simple line that was fast. 01:01:21.860 --> 01:01:24.027 So let me pull up at the bottom of the screen again. 01:01:24.027 --> 01:01:28.090 I selected distinct upper titles from favorites, 01:01:28.090 --> 01:01:31.070 and that did everything for me at once in just one breath. 01:01:31.070 --> 01:01:33.640 Suppose I want to get the total number of counts of titles. 01:01:33.640 --> 01:01:40.150 How about select count of all of those titles from favorites? 01:01:40.150 --> 01:01:43.990 Semicolon, Enter, and now you get back a mini table 01:01:43.990 --> 01:01:47.960 that contains just your answer, 158 in this case. 01:01:47.960 --> 01:01:50.560 So that's the total number of, not distinct, 01:01:50.560 --> 01:01:52.690 but total titles that we had in the file. 01:01:52.690 --> 01:01:56.560 And we could continue to manipulate the data further using, again, 01:01:56.560 --> 01:01:58.450 functions like these here. 01:01:58.450 --> 01:02:01.550 But there's also additional filtration we can do. 01:02:01.550 --> 01:02:07.010 We can also qualify our selections by saying where some condition is true. 01:02:07.010 --> 01:02:10.630 So just as in Scratch, and C, and Python, you have Boolean expressions, 01:02:10.630 --> 01:02:16.270 you can have the same in SQL as well, where I can filter my data where 01:02:16.270 --> 01:02:19.390 something is true or false. 01:02:19.390 --> 01:02:21.340 Like allows me to do approximations. 01:02:21.340 --> 01:02:23.500 If I want to get something that's like The Office 01:02:23.500 --> 01:02:26.320 but not necessarily T-H-E, space, Office, 01:02:26.320 --> 01:02:29.440 I could do pattern matching using like here. 01:02:29.440 --> 01:02:33.170 Order by, limit, and grouped by are other commands I can execute, too. 01:02:33.170 --> 01:02:35.980 So let me go back and do a couple of these here. 01:02:35.980 --> 01:02:42.610 How about, let me just get, oh, I don't know, all of the titles from favorites 01:02:42.610 --> 01:02:44.848 but limit it to 10 results. 01:02:44.848 --> 01:02:47.890 That might be one thing that's helpful to see if you just care about some 01:02:47.890 --> 01:02:50.110 of the data at the top there instead. 01:02:50.110 --> 01:02:56.530 How about, select all of the titles from favorites, where the title itself 01:02:56.530 --> 01:02:59.710 is like, quote, unquote, "Office?" 01:02:59.710 --> 01:03:02.740 And this will give me only two answers. 01:03:02.740 --> 01:03:07.180 Those are the two rows, recall, that I mutated by getting rid of the word The. 01:03:07.180 --> 01:03:11.880 Notice that like allows me too tolerate uppercase and lowercase. 01:03:11.880 --> 01:03:14.880 Because if I instead just use the equal sign, 01:03:14.880 --> 01:03:21.060 and in SQL a single equal sign does, in fact, mean equality. 01:03:21.060 --> 01:03:23.370 For comparison's sake, it's not doing assignment. 01:03:23.370 --> 01:03:26.070 This is not how you assign data in SQL. 01:03:26.070 --> 01:03:27.940 I got back no answers there. 01:03:27.940 --> 01:03:31.620 So indeed, the equal sign is giving me literal answers 01:03:31.620 --> 01:03:33.990 that searches just for what I typed in. 01:03:33.990 --> 01:03:35.370 How could I get all of these? 01:03:35.370 --> 01:03:39.090 Well, similar in spirit to regular expressions but not quite as powerful 01:03:39.090 --> 01:03:41.620 in SQL, I could do something like this. 01:03:41.620 --> 01:03:45.330 I can select the title from favorites where the title is like, quote, 01:03:45.330 --> 01:03:46.710 unquote, "Office." 01:03:46.710 --> 01:03:52.450 But I can add, a bit weirdly, percent signs to the left and the right. 01:03:52.450 --> 01:03:57.930 So the language SQL supports the same notion of pattern matching 01:03:57.930 --> 01:03:59.547 but much more limited out of the box. 01:03:59.547 --> 01:04:01.380 If we want more powerful regular expressions 01:04:01.380 --> 01:04:03.420 we probably do want to use Python instead. 01:04:03.420 --> 01:04:06.720 But the percent sign here means 0 or more characters 01:04:06.720 --> 01:04:09.340 on the left, 0 or more characters on the right. 01:04:09.340 --> 01:04:14.460 So this will just grab any title that contains O-F-F-I-C-E in it in that 01:04:14.460 --> 01:04:15.100 order. 01:04:15.100 --> 01:04:19.290 And now I get all 16, it would seem, of those results, again. 01:04:19.290 --> 01:04:20.490 How do I know it's 16? 01:04:20.490 --> 01:04:23.160 Well, I can just get the count of those titles 01:04:23.160 --> 01:04:26.140 and get back that answer instead as well. 01:04:26.140 --> 01:04:29.640 So again, it takes some getting used to, the vocabulary 01:04:29.640 --> 01:04:30.973 and the syntax that you can use. 01:04:30.973 --> 01:04:32.682 There's these building blocks and others. 01:04:32.682 --> 01:04:35.500 But SQL is really designed, again, for creating, reading, updating, 01:04:35.500 --> 01:04:36.480 and deleting data. 01:04:36.480 --> 01:04:41.080 For instance, I've never really been a fan of Friends, for instance. 01:04:41.080 --> 01:04:46.860 So right now if I do select, how about title from favorites 01:04:46.860 --> 01:04:53.340 where title like, quote, unquote, Friends with the percent signs? 01:04:53.340 --> 01:04:55.330 We can see that there's a whole bunch of them. 01:04:55.330 --> 01:04:56.520 That's how many exactly. 01:04:56.520 --> 01:04:57.960 Let's just do a quick count. 01:04:57.960 --> 01:04:59.890 So that's nine of them. 01:04:59.890 --> 01:05:03.300 Well, delete from favorites. 01:05:03.300 --> 01:05:10.500 OK, you and me, delete from favorites, where title like Friends, Enter. 01:05:10.500 --> 01:05:14.280 Nothing seems to happen, but bye-bye Friends. 01:05:14.280 --> 01:05:15.060 [APPLAUSE] 01:05:15.060 --> 01:05:16.143 DAVID J. MALAN: Thank you. 01:05:19.110 --> 01:05:21.390 So now we've actually changed the data. 01:05:21.390 --> 01:05:25.110 And this is what's compelling about a proper database. 01:05:25.110 --> 01:05:29.320 Yes, you could technically write Python code that not only reads the CSV file, 01:05:29.320 --> 01:05:30.540 but also writes it. 01:05:30.540 --> 01:05:32.820 You can change using quote, unquote, "A" for append, 01:05:32.820 --> 01:05:35.910 or quote, unquote, "W" for write, instead of quote, unquote, 01:05:35.910 --> 01:05:37.263 "R" for read alone. 01:05:37.263 --> 01:05:39.930 But it's definitely a little more involved to do that in Python. 01:05:39.930 --> 01:05:42.250 But with SQL, you can update the data in real time. 01:05:42.250 --> 01:05:45.750 And if I were actually running a web application here or a database 01:05:45.750 --> 01:05:47.880 for a mobile app, that change, theoretically, 01:05:47.880 --> 01:05:50.160 would be reflected everywhere on your own devices 01:05:50.160 --> 01:05:52.210 if you're somehow talking to this application. 01:05:52.210 --> 01:05:53.995 So that's the direction we're headed. 01:05:53.995 --> 01:05:55.620 This other thing has been bothering me. 01:05:55.620 --> 01:06:02.640 So select, how about title from favorites, where title equals, 01:06:02.640 --> 01:06:03.180 what was it? 01:06:03.180 --> 01:06:06.600 The V Office, was it? 01:06:06.600 --> 01:06:07.680 Yeah, it was that one. 01:06:07.680 --> 01:06:12.150 How about we update favorites by setting title 01:06:12.150 --> 01:06:19.080 equal to The Office, where title equals quote, unquote, "The V Office" 01:06:19.080 --> 01:06:20.470 semicolon? 01:06:20.470 --> 01:06:22.650 And now, if I select the same thing again 01:06:22.650 --> 01:06:24.810 I can go up and down with my arrow keys quickly. 01:06:24.810 --> 01:06:27.090 Now there is no The V Office. 01:06:27.090 --> 01:06:29.200 We've actually changed that value. 01:06:29.200 --> 01:06:30.300 How about genres? 01:06:30.300 --> 01:06:35.880 Select genres from favorites, where the title is title 01:06:35.880 --> 01:06:39.450 equals Game of Thrones, semicolon. 01:06:39.450 --> 01:06:43.000 These were kind of long, and I don't really agree with all of that. 01:06:43.000 --> 01:06:49.560 So how about we update favorites, set genres equal to, sure, 01:06:49.560 --> 01:06:52.650 action, adventure, sure, drama? 01:06:52.650 --> 01:06:54.390 OK, so it's a decent list. 01:06:54.390 --> 01:06:56.700 Fantasy, sure, thriller, war. 01:06:56.700 --> 01:07:01.050 OK, anything really but comedy, I would say. 01:07:01.050 --> 01:07:03.160 Let's go ahead and hit Enter now. 01:07:03.160 --> 01:07:07.800 And now, if I select genres again, same query, now we've canonicalized that. 01:07:07.800 --> 01:07:09.160 We've thrown data away. 01:07:09.160 --> 01:07:11.250 So whether or not that is right is probably 01:07:11.250 --> 01:07:13.020 a bit subjective and argumentative. 01:07:13.020 --> 01:07:16.920 But I have at least cleaned up my data, which is, again, the U in CRUD. 01:07:16.920 --> 01:07:20.670 Create, read, update, delete, you can do it that easily. 01:07:20.670 --> 01:07:22.290 Beware using delete. 01:07:22.290 --> 01:07:26.430 Beware worse using drop, whereby you can drop an entire table. 01:07:26.430 --> 01:07:29.310 But via these kinds of commands, can we actually now 01:07:29.310 --> 01:07:33.390 manipulate our data much more rapidly and with single thoughts. 01:07:33.390 --> 01:07:36.390 And in fact, if you're an aspiring statistician, or data scientist, 01:07:36.390 --> 01:07:40.320 or analyst in the real world, SQL is such a commonly used language 01:07:40.320 --> 01:07:43.573 because it allows you to really dive into data quickly, and ask 01:07:43.573 --> 01:07:45.990 questions of the data, and get back answers quite quickly. 01:07:45.990 --> 01:07:47.530 And this is a simple data set. 01:07:47.530 --> 01:07:51.840 You can do this with much larger data sets as we soon will, too. 01:07:51.840 --> 01:07:55.050 Or any questions on what we've seen of SQL thus far? 01:07:55.050 --> 01:07:57.170 Only scratched the surface, but again, it 01:07:57.170 --> 01:08:03.100 boils down to creating, reading, updating, and deleting data. 01:08:03.100 --> 01:08:04.910 Questions here? 01:08:04.910 --> 01:08:05.410 All right. 01:08:05.410 --> 01:08:07.820 Well, let's consider the design of this data. 01:08:07.820 --> 01:08:11.900 Recall that if I do .schema, that shows me the design of my table, 01:08:11.900 --> 01:08:13.990 the so-called schema of my data. 01:08:13.990 --> 01:08:15.010 This is OK. 01:08:15.010 --> 01:08:17.649 It gets the job done, and frankly, everything the user typed in 01:08:17.649 --> 01:08:21.609 was arguably text, including the timestamp, which is the date and time. 01:08:21.609 --> 01:08:24.040 But so the data set itself is somewhat simple. 01:08:24.040 --> 01:08:29.569 But if we look at the data set itself, especially genres, let's do this. 01:08:29.569 --> 01:08:31.960 Select genres from favorites. 01:08:31.960 --> 01:08:34.540 And let me point out one other thing stylistically, too. 01:08:34.540 --> 01:08:39.050 I am very deliberately capitalizing all of the special SQL keywords, 01:08:39.050 --> 01:08:42.700 and I'm lowercasing all of the column names and the table names. 01:08:42.700 --> 01:08:45.760 This is a convention, and honestly, it just helps you read, 01:08:45.760 --> 01:08:49.210 I think, the code when you're co-mingling your names for columns 01:08:49.210 --> 01:08:52.600 and tables with proper SQL keywords. 01:08:52.600 --> 01:08:58.029 But I could just as easily do select genres from favorites, 01:08:58.029 --> 01:09:01.149 but again, the SQL specific keywords don't quite jump out as much. 01:09:01.149 --> 01:09:04.240 So stylistically, we would recommend this, selecting genres 01:09:04.240 --> 01:09:06.250 from favorites, semicolon. 01:09:06.250 --> 01:09:09.370 So here is where-- 01:09:09.370 --> 01:09:11.560 oh. 01:09:11.560 --> 01:09:12.850 OK, that was not intended. 01:09:12.850 --> 01:09:15.220 I accidentally made every show, including 01:09:15.220 --> 01:09:20.410 The Office about action, adventure, drama, fantasy, thriller, and war. 01:09:20.410 --> 01:09:24.460 How did I do that accidentally? 01:09:24.460 --> 01:09:25.600 What did I do wrong? 01:09:25.600 --> 01:09:30.234 AUDIENCE: [INAUDIBLE] 01:09:30.234 --> 01:09:31.109 DAVID J. MALAN: Yeah. 01:09:31.109 --> 01:09:32.460 So beware, this is funny. 01:09:32.460 --> 01:09:34.359 I think I did say beware around this time. 01:09:34.359 --> 01:09:37.810 So the SQL database took me-- literally, I updated favorites, 01:09:37.810 --> 01:09:41.010 setting genres equal to that, semicolon, end of thought. 01:09:41.010 --> 01:09:43.500 I really wanted to say where title equals, 01:09:43.500 --> 01:09:46.020 quote, unquote, "Game of Thrones." 01:09:46.020 --> 01:09:49.080 Unfortunately, there isn't an undo command or time machine 01:09:49.080 --> 01:09:51.930 with a SQL database, so the best we can do here 01:09:51.930 --> 01:09:56.250 is, let's actually get rid of favorites.db. 01:09:56.250 --> 01:10:02.340 Let's run SQLite of favorites.db again, which now will be recreated. 01:10:02.340 --> 01:10:04.440 Let me change myself into CSV mode. 01:10:04.440 --> 01:10:09.930 Let me import, into my favorites table, the CSV file. 01:10:09.930 --> 01:10:14.250 And now, Friends is back, for better or for worse, 01:10:14.250 --> 01:10:15.510 but so are all of our genres. 01:10:18.090 --> 01:10:21.330 If I now reload the file and do select, star, from-- 01:10:21.330 --> 01:10:21.840 sorry. 01:10:21.840 --> 01:10:25.838 Select genres from favorites, that was the result I was getting. 01:10:25.838 --> 01:10:28.630 It's much messier, but that's because some of these are quite long. 01:10:28.630 --> 01:10:30.297 But now we're back to the original data. 01:10:30.297 --> 01:10:32.700 Lesson here, be sure to back up your work. 01:10:32.700 --> 01:10:33.240 All right. 01:10:33.240 --> 01:10:36.850 So what more can we now do with this data? 01:10:36.850 --> 01:10:40.480 Well, I don't love the design of the genres table for a couple of reasons. 01:10:40.480 --> 01:10:43.200 One, we didn't have any sort of validation, 01:10:43.200 --> 01:10:45.210 but user input is going to be messy. 01:10:45.210 --> 01:10:47.790 There's just a lot of redundancy in here. 01:10:47.790 --> 01:10:49.720 Let's go ahead and do this. 01:10:49.720 --> 01:10:51.760 Let me select all the comedies you all typed in. 01:10:51.760 --> 01:10:57.960 So select title from favorites, where genres equals, 01:10:57.960 --> 01:11:00.120 quote, unquote, "comedy." 01:11:00.120 --> 01:11:05.730 OK, so there's all of the shows that are explicitly comedies. 01:11:05.730 --> 01:11:08.940 But I think there might actually be others. 01:11:08.940 --> 01:11:11.350 Let me scroll back up here. 01:11:11.350 --> 01:11:12.180 Comedy, drama. 01:11:12.180 --> 01:11:14.040 What was a comedy and a drama? 01:11:14.040 --> 01:11:19.140 How about let's search for the-- oops, let me copy paste comedy, comma, drama. 01:11:19.140 --> 01:11:23.700 OK, so The Office, in this case, was considered comedy and drama, Billions, 01:11:23.700 --> 01:11:27.010 It's Always Sunny in Philadelphia, and Gilmore Girls as well. 01:11:27.010 --> 01:11:31.450 But notice that I get many more when I just search for comedy. 01:11:31.450 --> 01:11:36.450 So the catch here is that, because I have all of these genres implemented 01:11:36.450 --> 01:11:39.030 the way Google did, as a comma-separated list, 01:11:39.030 --> 01:11:43.590 it's actually really hard and messy to get at any show, all of the shows 01:11:43.590 --> 01:11:46.830 that are somewhere described as comedy. 01:11:46.830 --> 01:11:49.680 Because if I search for quote, unquote, "comedy," the only answers 01:11:49.680 --> 01:11:53.610 I'm going to get are this one, whatever that show is, this one, whatever 01:11:53.610 --> 01:11:55.240 that show is, this one. 01:11:55.240 --> 01:11:56.670 But I'm not going to get this one. 01:11:56.670 --> 01:11:58.170 I'm not going to get this one. 01:11:58.170 --> 01:11:59.070 Why? 01:11:59.070 --> 01:12:02.790 If I'm searching for, where genres equals, quote, unquote, "comedy," 01:12:02.790 --> 01:12:04.350 why am I missing those other shows? 01:12:07.202 --> 01:12:07.910 Why am I missing? 01:12:07.910 --> 01:12:08.595 Yeah? 01:12:08.595 --> 01:12:11.020 AUDIENCE: [INAUDIBLE] 01:12:11.020 --> 01:12:12.020 DAVID J. MALAN: Exactly. 01:12:12.020 --> 01:12:14.390 It's not just a comedy, it's a comedy and a drama, 01:12:14.390 --> 01:12:16.680 and a comedy or a news show, and so forth. 01:12:16.680 --> 01:12:20.510 So I have to search for these commas, so this gets messy quickly, right? 01:12:20.510 --> 01:12:22.560 Let me copy this so I can do this. 01:12:22.560 --> 01:12:26.150 Let me search for where genres equals comedy. 01:12:26.150 --> 01:12:33.230 How about, or genres equals comedy, drama, or genres 01:12:33.230 --> 01:12:36.650 equals this whole thing, comedy, news, talk show? 01:12:36.650 --> 01:12:38.370 I'm going to get more and more results. 01:12:38.370 --> 01:12:39.900 But that's not going to scale well. 01:12:39.900 --> 01:12:42.810 What could I do instead of enumerating with ors 01:12:42.810 --> 01:12:45.800 all of the different permutations of genres, do you think? 01:12:45.800 --> 01:12:49.655 AUDIENCE: [INAUDIBLE] 01:12:49.655 --> 01:12:50.530 DAVID J. MALAN: Yeah. 01:12:50.530 --> 01:12:54.430 So I could use the keyword is, similar in Python to the word in. 01:12:54.430 --> 01:12:56.980 I could use the like keyword so that so long 01:12:56.980 --> 01:13:02.080 as the genres is like comedy somewhere in there, 01:13:02.080 --> 01:13:05.900 that's going to give me all of them, so long as the word comedy is in there. 01:13:05.900 --> 01:13:09.265 But let me go ahead and just open the form from earlier. 01:13:12.250 --> 01:13:15.073 Let me see if I can open this real quick before I toggle over. 01:13:15.073 --> 01:13:16.990 If we look back at the form, recall that there 01:13:16.990 --> 01:13:22.630 were all of those radio buttons asking for the specific genres 01:13:22.630 --> 01:13:24.530 into which something fell. 01:13:24.530 --> 01:13:29.710 And if I open this, let me full screen here and now open the original form. 01:13:29.710 --> 01:13:32.920 You'll see all of the genres here, none of which 01:13:32.920 --> 01:13:38.680 are that worrisome except for a corner case is jumping out at me. 01:13:38.680 --> 01:13:43.330 Where might the like keyword alone get me into trouble? 01:13:43.330 --> 01:13:44.380 It's not with comedy. 01:13:44.380 --> 01:13:45.880 I'm OK with comedy. 01:13:45.880 --> 01:13:47.680 AUDIENCE: Music and musical? 01:13:47.680 --> 01:13:51.190 DAVID J. MALAN: Yeah, music and musical are deliberately on the list here. 01:13:51.190 --> 01:13:53.660 Because, one, they're separate genres. 01:13:53.660 --> 01:13:56.310 But if I just search for something that's like music, 01:13:56.310 --> 01:13:58.810 I'm going to accidentally suck in all of the musicals, which 01:13:58.810 --> 01:13:59.950 might not be what I intend. 01:13:59.950 --> 01:14:03.310 If music is a music video or whatever, and musical is actually 01:14:03.310 --> 01:14:06.620 a different type of show, I don't want to just do that. 01:14:06.620 --> 01:14:08.110 So it seems just very messy. 01:14:08.110 --> 01:14:11.682 I could probably hack something together with-- maybe add some commas in there, 01:14:11.682 --> 01:14:12.640 or something like this. 01:14:12.640 --> 01:14:15.520 But this is just not a good design for the data. 01:14:15.520 --> 01:14:17.890 Google has done it this way because it's just 01:14:17.890 --> 01:14:22.000 simple to actually keep the user's data all in a single column, 01:14:22.000 --> 01:14:24.520 and just as they did, separate it by commas. 01:14:24.520 --> 01:14:28.720 But this is a real messy way to use CSV is 01:14:28.720 --> 01:14:32.830 by putting comma-separated values in your comma-separated values. 01:14:32.830 --> 01:14:35.350 Arguably, the folks at Google probably just did this 01:14:35.350 --> 01:14:36.520 because it's just simpler. 01:14:36.520 --> 01:14:38.645 And they didn't want to give people multiple sheets 01:14:38.645 --> 01:14:42.220 or complicate things using some other weirder character than commas alone. 01:14:42.220 --> 01:14:44.450 But I bet there's a better way for us to do this. 01:14:44.450 --> 01:14:45.860 And let me go ahead and do this. 01:14:45.860 --> 01:14:47.978 Let me go back into my code here. 01:14:47.978 --> 01:14:50.020 And in just a moment, I'm going to grab a program 01:14:50.020 --> 01:14:54.250 that I wrote in advance that's going to use Python to open up the CSV file, 01:14:54.250 --> 01:14:59.110 iterate over all of the rows, and load the data into two tables this time, 01:14:59.110 --> 01:15:02.630 two tables, one called shows, and one called genres, 01:15:02.630 --> 01:15:05.140 so as to actually separate these two things out. 01:15:05.140 --> 01:15:07.730 Give me just a moment to grab the code. 01:15:07.730 --> 01:15:10.720 And when I run this, I'll only have to run it once. 01:15:10.720 --> 01:15:13.090 Let me go ahead and run Python in a moment, 01:15:13.090 --> 01:15:15.940 and I'll reveal the results in a sec. 01:15:15.940 --> 01:15:18.790 This is going to be version 8 of the code online. 01:15:18.790 --> 01:15:22.480 When I do this, let me go ahead and open up this file. 01:15:22.480 --> 01:15:26.260 Give me a second to move it into this directory. 01:15:26.260 --> 01:15:28.570 Version 8, OK. 01:15:28.570 --> 01:15:31.515 So here we have version 8 of this that's available online 01:15:31.515 --> 01:15:32.890 that's going to do the following. 01:15:32.890 --> 01:15:34.848 And I'll gloss over some of the details just so 01:15:34.848 --> 01:15:38.740 that we don't get stuck in the weeds of some of this code. 01:15:38.740 --> 01:15:41.380 I'm going to be using, at the top of this program, 01:15:41.380 --> 01:15:45.820 as we'll soon see, a CS50 library, not for the sake of get_string, 01:15:45.820 --> 01:15:48.730 or get_int, or get_float, but because there's some built-in SQL 01:15:48.730 --> 01:15:51.880 functionality that we didn't discuss a couple of weeks back with the CS50 01:15:51.880 --> 01:15:52.750 library itself. 01:15:52.750 --> 01:15:56.680 But inside of the CS50 library we'll see there is a special function called 01:15:56.680 --> 01:16:01.330 SQL that gives you the ability using this weird URL-like looking thing, 01:16:01.330 --> 01:16:06.040 technically called a URI, that allows me to open a file called favorites.db. 01:16:06.040 --> 01:16:08.380 And long story short, all of the subsequent code 01:16:08.380 --> 01:16:12.580 is going to iterate over this favorites.csv file that we downloaded. 01:16:12.580 --> 01:16:16.240 And it's going to import it into the SQLite database, 01:16:16.240 --> 01:16:19.430 but it's going to use two tables instead of just one. 01:16:19.430 --> 01:16:21.640 So give me just a moment to run this, and then I'll 01:16:21.640 --> 01:16:23.600 reveal the actual results. 01:16:23.600 --> 01:16:26.270 This is going to be run on favorites.csv. 01:16:31.510 --> 01:16:35.485 And taking a look here, give me just a moment. 01:16:39.310 --> 01:16:43.350 Oh, give me a sec. 01:16:43.350 --> 01:16:45.270 Come on. 01:16:45.270 --> 01:16:46.080 Come on. 01:16:46.080 --> 01:16:49.020 This program should not be taking this long. 01:16:49.020 --> 01:16:50.460 Sorry. 01:16:50.460 --> 01:16:51.840 Let's open this real fast. 01:16:54.750 --> 01:16:55.860 Whoops, not that file. 01:16:59.000 --> 01:16:59.500 OK. 01:16:59.500 --> 01:17:02.350 Let me just skim this code real quick to see where we've gone wrong. 01:17:02.350 --> 01:17:04.840 [INAUDIBLE] reader. 01:17:04.840 --> 01:17:10.390 Reader, title, show ID in certain two shows. 01:17:10.390 --> 01:17:15.370 [INAUDIBLE] genres split, DB execute. 01:17:15.370 --> 01:17:15.880 All right. 01:17:15.880 --> 01:17:17.410 This is me debugging in real time. 01:17:17.410 --> 01:17:22.843 All those times we encourage you to use print, this is me actually using print. 01:17:22.843 --> 01:17:24.760 We'll see how quickly I can recover from this. 01:17:24.760 --> 01:17:26.560 Python of favorites version 8. 01:17:29.100 --> 01:17:32.070 OK, so here's me debugging in real time. 01:17:32.070 --> 01:17:32.882 It's printing it. 01:17:32.882 --> 01:17:34.590 Oh, maybe I just didn't wait long enough. 01:17:34.590 --> 01:17:35.940 OK, so here we go. 01:17:35.940 --> 01:17:39.900 What I'm doing is printing out the dictionary that represents 01:17:39.900 --> 01:17:41.542 each row that you all typed in. 01:17:41.542 --> 01:17:43.000 And we're actually making progress. 01:17:43.000 --> 01:17:44.220 All right. 01:17:44.220 --> 01:17:46.510 I was too impatient and didn't wait long enough. 01:17:46.510 --> 01:17:47.470 So in a moment-- 01:17:47.470 --> 01:17:47.970 there we go. 01:17:47.970 --> 01:17:50.370 All right, so all we have to do sometimes is wait. 01:17:50.370 --> 01:17:54.420 Let me go ahead now and open this file using SQLite3. 01:17:54.420 --> 01:17:58.080 So in SQLite3 I now have a different version of favorites.db. 01:17:58.080 --> 01:17:59.970 I named it number 8 for consistency. 01:17:59.970 --> 01:18:03.280 Once I've run the program I can do .schema to look inside of it. 01:18:03.280 --> 01:18:07.200 And here's what the two tables in this database are going to look like. 01:18:07.200 --> 01:18:10.920 I've created a table called shows, this time to represent all of the TV shows 01:18:10.920 --> 01:18:14.410 that are favorites, that has two columns. 01:18:14.410 --> 01:18:16.812 One is called ID, one is called Title. 01:18:16.812 --> 01:18:19.020 But now I'm going to start taking out for a spin some 01:18:19.020 --> 01:18:20.640 of the other features of SQL. 01:18:20.640 --> 01:18:24.900 And besides there being text, it turns out there's a data type called integer. 01:18:24.900 --> 01:18:26.910 Besides there being a data type called text, 01:18:26.910 --> 01:18:30.150 there's also a special key phrase that you can specify 01:18:30.150 --> 01:18:31.830 that the title can never be null. 01:18:31.830 --> 01:18:35.160 Think back to our use of null in C. Think back 01:18:35.160 --> 01:18:37.260 to the keyword none in Python. 01:18:37.260 --> 01:18:40.830 This is a database constraint that allows you to ensure that none of you 01:18:40.830 --> 01:18:42.630 can't have of favorite TV show. 01:18:42.630 --> 01:18:46.320 If you submit the form, you have to have typed in a title for it 01:18:46.320 --> 01:18:48.250 to end up in our database here. 01:18:48.250 --> 01:18:50.940 And you'll notice one other new feature. 01:18:50.940 --> 01:18:53.460 It turns out, on this table I'm defining what's 01:18:53.460 --> 01:18:56.970 called a primary key, specifically to be the ID column. 01:18:56.970 --> 01:18:58.600 More on that in just a moment. 01:18:58.600 --> 01:19:02.700 Meanwhile, the second table my code has created for me, as we'll soon see, 01:19:02.700 --> 01:19:08.220 gives me a column called show ID, and then, a genre, 01:19:08.220 --> 01:19:11.140 the value of which is text that can also not be null. 01:19:11.140 --> 01:19:12.760 And then more on this in a moment. 01:19:12.760 --> 01:19:15.840 This table has what we're going to call a foreign key, 01:19:15.840 --> 01:19:20.550 specifically the show ID column that references shows ID. 01:19:20.550 --> 01:19:23.250 So before we get into the weeds of this, this 01:19:23.250 --> 01:19:27.120 is now a way of creating the relation in relational database. 01:19:27.120 --> 01:19:31.140 If I have two tables now, not just one, they can somehow 01:19:31.140 --> 01:19:34.720 be linked together by a common column. 01:19:34.720 --> 01:19:37.860 In other words, the shows column-- 01:19:37.860 --> 01:19:41.340 shows table is going to give me a table with two columns-- 01:19:41.340 --> 01:19:43.080 an ID and a title. 01:19:43.080 --> 01:19:46.800 Every title you gave me, I'm going to assign a unique value. 01:19:46.800 --> 01:19:51.600 The genre's table, meanwhile, is going to associate individual genres 01:19:51.600 --> 01:19:54.480 singular with that same idea. 01:19:54.480 --> 01:20:01.150 And the result of this, to pop back to the Terminal here, is, let's do this. 01:20:01.150 --> 01:20:05.460 Select star from shows of this new database, 01:20:05.460 --> 01:20:09.030 and you'll see that I've given, indeed, all of the shows you all typed 01:20:09.030 --> 01:20:10.663 in unique identifiers. 01:20:10.663 --> 01:20:13.830 I didn't filter out duplicates or do anything beyond just forcing everything 01:20:13.830 --> 01:20:14.430 to uppercase. 01:20:14.430 --> 01:20:16.930 So there's going to be some duplicates here because I didn't 01:20:16.930 --> 01:20:18.840 want to get rid of anyone's data. 01:20:18.840 --> 01:20:21.960 But you'll see that, indeed, I've given everyone 01:20:21.960 --> 01:20:24.420 a unique identifier, from the very first person who 01:20:24.420 --> 01:20:28.650 typed How I Met Your Mother, all the way down to input number 158. 01:20:28.650 --> 01:20:35.760 Meanwhile, if I do select star from genres, which is now a table, not just 01:20:35.760 --> 01:20:38.280 a column in the original data, now you'll 01:20:38.280 --> 01:20:43.195 see a much better design for this data. 01:20:43.195 --> 01:20:44.320 Notice what I've done here. 01:20:44.320 --> 01:20:47.370 Let me go all the way to the top and you'll see two columns, one of which 01:20:47.370 --> 01:20:51.508 is called show ID, the other of which is called genre. 01:20:51.508 --> 01:20:53.550 And again, I wrote some code to do this because I 01:20:53.550 --> 01:20:56.675 had to take Google's messy output where everything was separated by commas. 01:20:56.675 --> 01:21:00.330 I had to tear away the commas and then put each genre into this table 01:21:00.330 --> 01:21:01.770 by itself. 01:21:01.770 --> 01:21:04.200 Even though we haven't introduced the syntax via which 01:21:04.200 --> 01:21:07.230 we can reconstitute the data and reassociate 01:21:07.230 --> 01:21:10.530 your genres with your titles, why, at a glance, 01:21:10.530 --> 01:21:13.140 might this be a better design now? 01:21:13.140 --> 01:21:16.740 Even though I've doubled the number of tables from one to two, 01:21:16.740 --> 01:21:21.540 why is this probably on the direction toward a better design? 01:21:21.540 --> 01:21:23.130 What might your instincts be? 01:21:26.430 --> 01:21:27.720 Why is this cleaner? 01:21:27.720 --> 01:21:31.230 Again, first time with SQL, why is it better, perhaps, 01:21:31.230 --> 01:21:33.780 that we've done this with our genre's table? 01:21:33.780 --> 01:21:35.650 Can I come to you? 01:21:35.650 --> 01:21:36.700 Why might this be better? 01:21:36.700 --> 01:21:37.200 Yeah. 01:21:37.200 --> 01:21:41.360 Oh, just because we had the conversation before about the commas. 01:21:41.360 --> 01:21:42.470 AUDIENCE: [INAUDIBLE] 01:21:42.470 --> 01:21:42.890 DAVID J. MALAN: Exactly. 01:21:42.890 --> 01:21:43.848 It's as simple as that. 01:21:43.848 --> 01:21:49.130 We've cleaned up the data by giving every genre, every word in the genres 01:21:49.130 --> 01:21:51.260 column in the original Google Spreadsheet 01:21:51.260 --> 01:21:54.030 its own cell in this table, if you will. 01:21:54.030 --> 01:21:56.930 And now notice show ID might appear multiple times. 01:21:56.930 --> 01:22:01.530 Whoever typed in How I Met Your Mother, they only associated one genre with it. 01:22:01.530 --> 01:22:04.460 And so we see that show ID 1 is a comedy. 01:22:04.460 --> 01:22:05.645 But whoever typed in-- 01:22:05.645 --> 01:22:07.520 I forget the name of the second show offhand. 01:22:07.520 --> 01:22:11.960 But that person, whoever was assigned show ID 2 checked off 01:22:11.960 --> 01:22:13.850 a whole bunch of the genre's boxes. 01:22:13.850 --> 01:22:17.540 That happened again with show ID 3, 4. 01:22:17.540 --> 01:22:20.940 Persons 5, 6, 7 only checked one box. 01:22:20.940 --> 01:22:25.040 And so you can see now that we've associated the data with what we 01:22:25.040 --> 01:22:27.690 might call a one-to-many relationship. 01:22:27.690 --> 01:22:33.050 A one-to-many relationship, whereby for every one show in the show's table, 01:22:33.050 --> 01:22:36.800 it can now have many genres associated with it, each of which 01:22:36.800 --> 01:22:41.250 is represented by a separate row here. 01:22:41.250 --> 01:22:44.960 So again, if I go ahead and select star from shows-- 01:22:44.960 --> 01:22:48.740 let's limit it to the first 10 just to focus on a subset of the data. 01:22:48.740 --> 01:22:51.770 How I Met Your Mother, The Sopranos was the second input there. 01:22:51.770 --> 01:22:55.100 It would seem that now that I've created the data in this way, 01:22:55.100 --> 01:22:59.925 I could ideally somehow search the data, but a little more correctly. 01:22:59.925 --> 01:23:01.550 I don't have to worry about the commas. 01:23:01.550 --> 01:23:03.925 I don't have to worry about the hackish approach of music 01:23:03.925 --> 01:23:05.630 being a substring of musical. 01:23:05.630 --> 01:23:08.310 But how can I actually get back at this data? 01:23:08.310 --> 01:23:09.750 Well, let's go ahead and do this. 01:23:09.750 --> 01:23:13.700 Suppose I did want to get back maybe all of the comedies. 01:23:13.700 --> 01:23:17.120 All of the comedies, no matter whether the person checked just the comedy 01:23:17.120 --> 01:23:19.550 box or multiple boxes instead. 01:23:19.550 --> 01:23:23.330 How now, given that I have two tables, could I 01:23:23.330 --> 01:23:27.740 go about selecting only the titles of comedies? 01:23:27.740 --> 01:23:29.780 I've actually made the problem a little harder, 01:23:29.780 --> 01:23:32.720 but again, SQL is going to give me a solution for this. 01:23:32.720 --> 01:23:35.030 The problem is that if I want to search for comedies, 01:23:35.030 --> 01:23:37.820 I have to check the genres table first. 01:23:37.820 --> 01:23:39.650 And then what's that going to give me? 01:23:39.650 --> 01:23:43.580 If I search the genres table for comedies, 01:23:43.580 --> 01:23:46.610 what's that going to give me back potentially? 01:23:46.610 --> 01:23:47.360 Yeah? 01:23:47.360 --> 01:23:48.110 AUDIENCE: Show ID. 01:23:48.110 --> 01:23:49.110 DAVID J. MALAN: Maybe show ID. 01:23:49.110 --> 01:23:49.902 So let me try that. 01:23:49.902 --> 01:23:55.820 Let me do select show ID from genres, where the genre in a given row 01:23:55.820 --> 01:23:57.320 equals quote, unquote, "comedy." 01:23:57.320 --> 01:24:00.470 No commas, no like, no percent signs. 01:24:00.470 --> 01:24:04.940 Because literally, that column now is singular words, like comedy, or drama, 01:24:04.940 --> 01:24:05.720 or the like. 01:24:05.720 --> 01:24:07.670 Let me go ahead and hit Enter here. 01:24:07.670 --> 01:24:10.700 OK, so I got back a whole bunch of ID numbers. 01:24:10.700 --> 01:24:13.100 Now this could very quickly get annoying. 01:24:13.100 --> 01:24:17.930 It looks like show ID 1, 2, 4, 5, 6, 7, 9, and so forth, are all comedies. 01:24:17.930 --> 01:24:23.690 So I could do something really crazy like, select title from shows, 01:24:23.690 --> 01:24:29.390 where ID equals 1, or ID equals 2. 01:24:29.390 --> 01:24:32.360 This is not going to scale very well, but this 01:24:32.360 --> 01:24:34.610 is why SQL is especially powerful. 01:24:34.610 --> 01:24:39.420 You can actually compose one SQL question from multiple ones. 01:24:39.420 --> 01:24:40.440 So let's do this. 01:24:40.440 --> 01:24:44.480 Why don't I select the title where the ID of the show 01:24:44.480 --> 01:24:48.050 is in the following list of IDs? 01:24:48.050 --> 01:24:54.710 Select show ID from genres, where the specific genre is, quote, unquote, 01:24:54.710 --> 01:24:55.640 "comedy." 01:24:55.640 --> 01:24:58.550 So I've got two SQL queries. 01:24:58.550 --> 01:25:01.850 One is deliberately nested inside of parentheses. 01:25:01.850 --> 01:25:04.760 That's going to give me back that whole list of show IDs. 01:25:04.760 --> 01:25:07.070 But that's exactly what I want to then look up 01:25:07.070 --> 01:25:11.000 the titles for by selecting title from shows where the ID of the show 01:25:11.000 --> 01:25:13.380 is in that big, tall list. 01:25:13.380 --> 01:25:18.710 And so now if I hit Enter, I get back only those shows 01:25:18.710 --> 01:25:22.520 that were somehow flagged as comedy, whether you in the audience 01:25:22.520 --> 01:25:26.570 checked one box for comedy, two boxes, or all of the boxes. 01:25:26.570 --> 01:25:28.928 Somehow we teased out comedy, again, just 01:25:28.928 --> 01:25:31.220 by using that Python script, which loaded this data not 01:25:31.220 --> 01:25:33.800 into one big table, but instead, two. 01:25:33.800 --> 01:25:36.420 And if we want to clean this up, let's do a couple of things. 01:25:36.420 --> 01:25:40.580 Let's, outside of the parentheses, do order by title. 01:25:40.580 --> 01:25:43.550 This is a way of sorting the data in SQL very easily. 01:25:43.550 --> 01:25:48.140 Now we have a whole list of the same titles that are now sorted. 01:25:48.140 --> 01:25:52.640 And what was the keyword with which I could filter out duplicates? 01:25:52.640 --> 01:25:53.450 Yeah, distinct. 01:25:53.450 --> 01:25:54.260 So let's try this. 01:25:54.260 --> 01:25:59.600 Same query, but let's select only the distinct titles from that whole query. 01:25:59.600 --> 01:26:01.770 And notice, I've very deliberately done it this way. 01:26:01.770 --> 01:26:03.520 And to this day, any time I'm using SQL, I 01:26:03.520 --> 01:26:06.187 don't just start at the beginning and type out my whole thought, 01:26:06.187 --> 01:26:07.940 and just get it right on the first try. 01:26:07.940 --> 01:26:10.610 I very commonly start with the subquery, if you 01:26:10.610 --> 01:26:12.800 will, the thing in parentheses, just to get myself 01:26:12.800 --> 01:26:14.600 one step toward what I care about. 01:26:14.600 --> 01:26:15.332 Then I add to it. 01:26:15.332 --> 01:26:16.040 Then I add to it. 01:26:16.040 --> 01:26:18.590 Then I add to it, just like we've encouraged in Python and C, 01:26:18.590 --> 01:26:22.370 taking baby steps in order to get to the answer you actually care about, 01:26:22.370 --> 01:26:23.480 like this one now. 01:26:23.480 --> 01:26:26.060 And other than this mistake, which we didn't 01:26:26.060 --> 01:26:30.350 fix because I re-imported the data after accidentally changing everyone's genre, 01:26:30.350 --> 01:26:35.190 we now have an alphabetized list of all of the same data. 01:26:35.190 --> 01:26:40.670 But now it's better designed, because we have it split across these two tables. 01:26:40.670 --> 01:26:41.630 Oh, thank you. 01:26:41.630 --> 01:26:44.720 OK, just thanks. 01:26:44.720 --> 01:26:48.950 What questions do we have, if any here? 01:26:48.950 --> 01:26:50.480 Questions on this approach? 01:26:53.040 --> 01:26:55.316 Yeah? 01:26:55.316 --> 01:27:00.280 AUDIENCE: [INAUDIBLE] 01:27:00.280 --> 01:27:02.280 DAVID J. MALAN: Oh, now that we have a database, 01:27:02.280 --> 01:27:04.640 how do we transfer it to a CSV? 01:27:04.640 --> 01:27:06.060 There are ways to do that. 01:27:06.060 --> 01:27:08.570 And in fact, there's a command within SQLite 01:27:08.570 --> 01:27:11.590 that allows you to export your data back to a CSV file. 01:27:11.590 --> 01:27:13.340 If you want to email it to someone and you 01:27:13.340 --> 01:27:16.430 want them to be able to open it in Excel, or Google Spreadsheets, or Apple 01:27:16.430 --> 01:27:19.010 Numbers, or the like, you can go in the other direction. 01:27:19.010 --> 01:27:21.890 Generally though, once you're in the world of SQL 01:27:21.890 --> 01:27:24.620 you're probably storing your data there long term. 01:27:24.620 --> 01:27:27.590 And you're probably updating it, maybe deleting it, adding to it, 01:27:27.590 --> 01:27:28.273 and so forth. 01:27:28.273 --> 01:27:30.440 For instance, the one command I did not show earlier 01:27:30.440 --> 01:27:33.320 is, suppose someone forgot a show. 01:27:33.320 --> 01:27:35.640 Let's see, did I see this in the output? 01:27:35.640 --> 01:27:37.580 All right, so Curb Your Enthusiasm. 01:27:37.580 --> 01:27:38.540 Saw that last night. 01:27:38.540 --> 01:27:39.530 It was just, yeah. 01:27:39.530 --> 01:27:41.160 Did anyone see it last night? 01:27:41.160 --> 01:27:41.660 No? 01:27:41.660 --> 01:27:44.743 All right, well, just the one person that checked that box, so you and me. 01:27:44.743 --> 01:27:46.740 What's another show that didn't make the list? 01:27:46.740 --> 01:27:48.217 How about Seinfeld? 01:27:48.217 --> 01:27:49.550 It's now on Netflix, apparently. 01:27:49.550 --> 01:27:55.220 So insert into shows. 01:27:55.220 --> 01:27:56.390 What do we want to insert? 01:27:56.390 --> 01:27:59.910 Well, we want to insert maybe an ID and a title. 01:27:59.910 --> 01:28:02.150 But I don't actually care what the ID is, 01:28:02.150 --> 01:28:03.650 so I'm just going to insert a title. 01:28:03.650 --> 01:28:05.780 And the value I'm going to give to that title 01:28:05.780 --> 01:28:08.900 is going to be, quote, unquote, "Seinfeld." 01:28:08.900 --> 01:28:11.810 And then, let me go ahead and hit semicolon. 01:28:11.810 --> 01:28:14.360 Nothing seems to happen, but let me rerun the big query 01:28:14.360 --> 01:28:16.250 from before looking for comedies. 01:28:16.250 --> 01:28:19.850 And unfortunately, Seinfeld has not yet been flagged as a comedy, 01:28:19.850 --> 01:28:21.710 so let's get this right, too. 01:28:21.710 --> 01:28:25.370 What intuitively I'm going to have to do to associate, now, 01:28:25.370 --> 01:28:27.980 Seinfeld with my comedies? 01:28:27.980 --> 01:28:30.140 I just inserted into the show's table. 01:28:30.140 --> 01:28:33.890 What more needs to happen before we can flag Seinfeld as a comedy? 01:28:33.890 --> 01:28:35.369 AUDIENCE: [INAUDIBLE] 01:28:35.369 --> 01:28:36.452 DAVID J. MALAN: Say again? 01:28:36.452 --> 01:28:37.635 AUDIENCE: [INAUDIBLE] 01:28:37.635 --> 01:28:38.510 DAVID J. MALAN: Yeah. 01:28:38.510 --> 01:28:42.950 So I need to insert into the genres table two things now, 01:28:42.950 --> 01:28:48.180 a show ID, like this, and then, the name of the genre, 01:28:48.180 --> 01:28:49.340 which presumably is comedy. 01:28:49.340 --> 01:28:50.810 What values do I want to insert? 01:28:50.810 --> 01:28:52.795 Well, the show ID, I better grab that. 01:28:52.795 --> 01:28:54.170 Oh, I don't even know what it is. 01:28:54.170 --> 01:28:55.770 I'm going to have to figure out what that is. 01:28:55.770 --> 01:28:57.660 So I could do this in a couple of ways. 01:28:57.660 --> 01:28:59.150 Let me do a one-time thing. 01:28:59.150 --> 01:29:02.924 Select star from shows, where title equals, 01:29:02.924 --> 01:29:06.770 quote, unquote, "Seinfeld" semicolon 159. 01:29:06.770 --> 01:29:11.780 So now I could do, insert into genres a show ID 01:29:11.780 --> 01:29:20.510 and a genre name, the values 159, and, quote, unquote, "comedy" semicolon, 01:29:20.510 --> 01:29:21.260 Enter. 01:29:21.260 --> 01:29:24.710 And now, if I scroll back in my history and execute that really big query 01:29:24.710 --> 01:29:26.690 again, looking for all distinct comedies, 01:29:26.690 --> 01:29:29.100 now Seinfeld has made the list. 01:29:29.100 --> 01:29:32.637 But I did this manually so I didn't actually capitalize it. 01:29:32.637 --> 01:29:33.470 Let's clean that up. 01:29:33.470 --> 01:29:35.420 Let's do update. 01:29:35.420 --> 01:29:37.700 Let's do update my shows. 01:29:37.700 --> 01:29:43.440 Set title equals to Seinfeld semicolon. 01:29:43.440 --> 01:29:43.940 No? 01:29:43.940 --> 01:29:47.780 OK, thank you, where title equals, quote, unquote, "Seinfeld." 01:29:47.780 --> 01:29:49.520 Let's not make that mistake again. 01:29:49.520 --> 01:29:50.330 Enter. 01:29:50.330 --> 01:29:53.270 And now, if I execute that really big query, now Seinfeld is, 01:29:53.270 --> 01:29:56.480 indeed, considered a comedy. 01:29:56.480 --> 01:29:57.960 So where are we going with this? 01:29:57.960 --> 01:30:00.240 Well, thus far we've been doing all of this pretty manually. 01:30:00.240 --> 01:30:02.780 And this is absolutely what an analyst, a data scientist type person 01:30:02.780 --> 01:30:05.197 might do if just manipulating a pretty large data set just 01:30:05.197 --> 01:30:08.010 to get at interesting answers that might be across one, 01:30:08.010 --> 01:30:09.440 two, or even many more tables. 01:30:09.440 --> 01:30:12.440 Eventually, in a few weeks, we're going to start to automate all of this 01:30:12.440 --> 01:30:16.610 by writing code in Python that generates SQL to do this. 01:30:16.610 --> 01:30:19.280 If you go to most any website on the internet today, 01:30:19.280 --> 01:30:23.420 and you, for instance, log in, odds are you're typing a username and password, 01:30:23.420 --> 01:30:24.260 clicking Submit. 01:30:24.260 --> 01:30:25.430 What's then happening? 01:30:25.430 --> 01:30:28.010 Well, the website might not be implemented in Python 01:30:28.010 --> 01:30:31.610 but it's probably implemented in some language, Python, JavaScript, Java, 01:30:31.610 --> 01:30:33.110 Ruby, something else. 01:30:33.110 --> 01:30:38.330 And that language is probably using something like a relational database 01:30:38.330 --> 01:30:41.930 to use SQL to get your username, get your password, 01:30:41.930 --> 01:30:43.997 and compare the two against what you've typed in. 01:30:43.997 --> 01:30:46.580 And actually, it's hopefully not getting your actual password, 01:30:46.580 --> 01:30:48.163 but something called the hash thereof. 01:30:48.163 --> 01:30:50.360 But there's probably a database involved doing that. 01:30:50.360 --> 01:30:53.600 When you buy something on Amazon.com and you click Check Out, 01:30:53.600 --> 01:30:56.720 odds are there's some code on Amazon's server 01:30:56.720 --> 01:30:59.870 that's looking at what it is you added to your shopping cart, 01:30:59.870 --> 01:31:03.690 and then maybe using a for loop of some sort, in Python or another language. 01:31:03.690 --> 01:31:08.180 It's doing a whole bunch of SQL inserts to store in their database what 01:31:08.180 --> 01:31:09.290 it is you bought. 01:31:09.290 --> 01:31:12.260 There's other types of databases, too, but SQL databases, 01:31:12.260 --> 01:31:14.647 or relational databases are quite popular. 01:31:14.647 --> 01:31:16.730 So let's go ahead and write one other program here 01:31:16.730 --> 01:31:21.320 in Python that now merges these two languages together, whereby 01:31:21.320 --> 01:31:24.740 I'm going to use SQL inside of a Python program 01:31:24.740 --> 01:31:28.430 so I can implement my logic of my program in Python, 01:31:28.430 --> 01:31:30.050 step-by-step, line-by-line. 01:31:30.050 --> 01:31:34.260 But when I want to get at some data I can actually talk to a SQL database. 01:31:34.260 --> 01:31:37.190 So let me go ahead and open favorites.py. 01:31:39.740 --> 01:31:45.200 And let me go ahead and throw away some of what we did earlier and really 01:31:45.200 --> 01:31:47.690 just now add a SQL to the mix. 01:31:47.690 --> 01:31:51.440 From the CS50 library, let's import the SQL function. 01:31:51.440 --> 01:31:54.260 This will be useful to use because most third-party libraries 01:31:54.260 --> 01:31:57.390 that deal with SQL and Python are more complicated than they need to be. 01:31:57.390 --> 01:32:00.390 So I think you'll find this library easier to use. 01:32:00.390 --> 01:32:01.780 Let's then do the following. 01:32:01.780 --> 01:32:03.900 Create a variable called db for database. 01:32:03.900 --> 01:32:05.400 But I could call it anything I want. 01:32:05.400 --> 01:32:09.090 Let's use that you URI, which is a fancy way of saying something 01:32:09.090 --> 01:32:17.790 that looks like a URL, but that actually opens up a database on disk, that is, 01:32:17.790 --> 01:32:19.110 in the current folder. 01:32:19.110 --> 01:32:22.620 Let's now ask the user for a title by prompting them for a, quote, unquote, 01:32:22.620 --> 01:32:23.820 "title" like this. 01:32:23.820 --> 01:32:27.960 And let's strip off any whitespace just so that the data is not messy. 01:32:27.960 --> 01:32:30.730 And then, let's go ahead and do this. 01:32:30.730 --> 01:32:32.220 And this is the new logic. 01:32:32.220 --> 01:32:36.570 I'm going to go ahead now and write a line of code that uses Python 01:32:36.570 --> 01:32:39.840 to talk to the original favorites.db. 01:32:39.840 --> 01:32:43.980 So again, I'm not using the two-table database, which is in favorites8.db. 01:32:43.980 --> 01:32:47.320 I'm using the original that we imported from your own data, 01:32:47.320 --> 01:32:49.090 and I'm going to do the following. 01:32:49.090 --> 01:32:54.150 I'm going to use db.execute to execute a SQL command inside of Python. 01:32:54.150 --> 01:33:02.700 I'm going to select the count of shows from the favorites 01:33:02.700 --> 01:33:09.960 table, where the title the user typed in is like this question mark. 01:33:09.960 --> 01:33:11.910 And why I'm doing that is as follows. 01:33:11.910 --> 01:33:15.600 Just like in C, when we had percent S, in SQL for now, 01:33:15.600 --> 01:33:17.490 the analogue is going to be a question mark. 01:33:17.490 --> 01:33:18.990 So same idea, different syntax. 01:33:18.990 --> 01:33:21.180 Instead of percent S, it's just a question mark. 01:33:21.180 --> 01:33:26.280 And using a comma outside of this first string, using CS50's execute 01:33:26.280 --> 01:33:29.340 function I can pass in a SQL string, a command, 01:33:29.340 --> 01:33:33.835 then any arguments I want to plug into the question marks therein. 01:33:33.835 --> 01:33:36.210 So the goal at hand is to actually write a program that's 01:33:36.210 --> 01:33:42.420 going to search favorites.csv, a.k.a., favorites.db for the total number 01:33:42.420 --> 01:33:45.300 of people that liked a particular show. 01:33:45.300 --> 01:33:49.050 So this is going to select the count of people from the favorites table 01:33:49.050 --> 01:33:53.400 where the title they typed in is like whatever the user has just now typed 01:33:53.400 --> 01:33:53.970 in. 01:33:53.970 --> 01:33:56.730 This db execute function returns a list. 01:33:56.730 --> 01:33:57.813 It returns a list of rows. 01:33:57.813 --> 01:34:00.188 And you would only know that by my telling you or reading 01:34:00.188 --> 01:34:01.290 the documentation. 01:34:01.290 --> 01:34:04.140 And therefore, if I want to get back to the total count, 01:34:04.140 --> 01:34:08.940 I'm going to go ahead and grab the first row from those rows. 01:34:08.940 --> 01:34:11.220 Because it's only going to give me back the count. 01:34:11.220 --> 01:34:16.440 And then I'm going to go ahead and print out that row's first value. 01:34:16.440 --> 01:34:17.940 But it's going to be a little weird. 01:34:17.940 --> 01:34:21.420 Technically the column is going to be called "count" star, quote, unquote, 01:34:21.420 --> 01:34:22.420 which is a little weird. 01:34:22.420 --> 01:34:24.180 Let me add one more feature to the mix. 01:34:24.180 --> 01:34:26.280 You can actually give nicknames to columns 01:34:26.280 --> 01:34:30.090 that are coming back, especially if they are the result of functions like this. 01:34:30.090 --> 01:34:34.170 I can just call that column counter, in all lowercase. 01:34:34.170 --> 01:34:41.620 That means I can now say get back the counter key inside of this dictionary. 01:34:41.620 --> 01:34:43.360 So just to recap, what have we done? 01:34:43.360 --> 01:34:46.010 We've imported the CS50 library SQL function. 01:34:46.010 --> 01:34:49.080 We've, with this line of code, opened the favorites.db 01:34:49.080 --> 01:34:54.810 file that you and I created earlier by importing your CSV into SQLite. 01:34:54.810 --> 01:34:58.140 I'm now just asking the user for a title they want to search for. 01:34:58.140 --> 01:35:02.070 I'm now executing this SQL query on that database, 01:35:02.070 --> 01:35:05.250 plugging in whatever the human typed in as their title 01:35:05.250 --> 01:35:07.170 in order to get back a total count. 01:35:07.170 --> 01:35:10.950 And I'm giving the count a nickname, an alias of counter, 01:35:10.950 --> 01:35:13.860 just so it's more self-explanatory. 01:35:13.860 --> 01:35:18.330 This function, db execute, no matter what, always returns a list of rows, 01:35:18.330 --> 01:35:20.470 even if there's only one row inside of it. 01:35:20.470 --> 01:35:23.310 So this line of code just gives me the first and only row. 01:35:23.310 --> 01:35:27.960 And then, this goes inside of that row, which it turns out is a dictionary, 01:35:27.960 --> 01:35:33.730 and gives me the key counter and the value it corresponds to. 01:35:33.730 --> 01:35:35.340 So what, to be clear, is this doing? 01:35:35.340 --> 01:35:38.220 Let's go ahead and run this manually in my Terminal window first. 01:35:38.220 --> 01:35:41.970 Let me run SQLite3 on favorites-- 01:35:41.970 --> 01:35:43.380 Well, let's do this. 01:35:43.380 --> 01:35:47.410 On favorites.db, let me import the data again. 01:35:47.410 --> 01:35:54.910 So mode csv.import in from favorites.csv into a favorites table. 01:35:54.910 --> 01:35:57.330 So I've just recreated the same data set that you all 01:35:57.330 --> 01:35:59.782 gave me earlier in favorites.db. 01:35:59.782 --> 01:36:02.490 If I were to do this manually, let's search for The Office again. 01:36:02.490 --> 01:36:09.210 Select, count star from favorites, where title like, and let's 01:36:09.210 --> 01:36:12.270 just manually type it in for now, The Office. 01:36:12.270 --> 01:36:15.330 We'll search for the one with the word The, semicolon. 01:36:15.330 --> 01:36:16.560 I get back 12. 01:36:16.560 --> 01:36:18.780 But technically, notice what I get back. 01:36:18.780 --> 01:36:25.080 I technically get back a miniature table containing one column and one row. 01:36:25.080 --> 01:36:27.090 What if I want to rename that column? 01:36:27.090 --> 01:36:28.840 That's where the as keyword comes in. 01:36:28.840 --> 01:36:31.080 So select count star as counter. 01:36:31.080 --> 01:36:32.910 Notice what happens, Enter. 01:36:32.910 --> 01:36:35.883 I just get back-- same simple table, but I've 01:36:35.883 --> 01:36:38.550 renamed the column to be counter just because it's a little more 01:36:38.550 --> 01:36:40.410 self-explanatory as to what it is. 01:36:40.410 --> 01:36:43.200 So what am I doing with this line of code? 01:36:43.200 --> 01:36:47.430 This line of code is returning to me that miniature temporary table 01:36:47.430 --> 01:36:50.820 in the form of a list of dictionaries. 01:36:50.820 --> 01:36:55.620 The list contains one row, as we'll see, and it 01:36:55.620 --> 01:37:00.800 contains one column, as we'll see, the key for which is counter. 01:37:00.800 --> 01:37:02.540 So let's now run the code itself. 01:37:02.540 --> 01:37:07.280 I'm going to get out of SQLite3 and I'm going to run Python of favorites.py. 01:37:07.280 --> 01:37:08.120 Enter. 01:37:08.120 --> 01:37:09.540 I'm being prompted for a title. 01:37:09.540 --> 01:37:13.790 I'm going to type in The Office and cross my fingers, and there's that 12. 01:37:13.790 --> 01:37:14.810 Why is it 12? 01:37:14.810 --> 01:37:17.450 Well, there's a typo again because I re-imported the CSV. 01:37:17.450 --> 01:37:21.270 I had deleted two of the Thes, so we're back at the original data set. 01:37:21.270 --> 01:37:25.760 So there's 12 total that have, quote, unquote, "The Office" 01:37:25.760 --> 01:37:27.780 in the title like that. 01:37:27.780 --> 01:37:28.860 So what have we done? 01:37:28.860 --> 01:37:31.820 We've combined some Python with some SQL, 01:37:31.820 --> 01:37:34.790 but we've relegated all of the complexity of searching for something, 01:37:34.790 --> 01:37:36.800 the selecting of something, gotten rid of all 01:37:36.800 --> 01:37:39.560 of the with keyword, the open keyword, the for loop, 01:37:39.560 --> 01:37:41.600 the reader the DictReader, and all of that. 01:37:41.600 --> 01:37:46.460 And it's just one line of SQL now, using the best of both worlds. 01:37:46.460 --> 01:37:51.720 All right, any questions on what we've just done here or how any of this 01:37:51.720 --> 01:37:52.220 works? 01:37:54.870 --> 01:37:56.210 Any questions here? 01:37:56.210 --> 01:37:57.310 Yeah? 01:37:57.310 --> 01:38:01.070 AUDIENCE: [INAUDIBLE] 01:38:01.070 --> 01:38:04.413 DAVID J. MALAN: When does this function return more than one row? 01:38:04.413 --> 01:38:05.330 Was that the question? 01:38:05.330 --> 01:38:05.760 AUDIENCE: Yeah. 01:38:05.760 --> 01:38:06.030 DAVID J. MALAN: Yeah. 01:38:06.030 --> 01:38:08.240 So let's do that by changing the problem at hand. 01:38:08.240 --> 01:38:10.970 This program was designed just to select the total count. 01:38:10.970 --> 01:38:16.070 Let's go ahead and select, for instance, all 01:38:16.070 --> 01:38:20.705 of the ways you all typed in The Office by selecting the title this time. 01:38:24.110 --> 01:38:28.470 If I do this in SQLite3, let me go ahead and do this again 01:38:28.470 --> 01:38:29.970 after increasing my Terminal window. 01:38:29.970 --> 01:38:30.920 Let's do it manually. 01:38:30.920 --> 01:38:35.570 Select title from favorites, where the title is like, 01:38:35.570 --> 01:38:38.835 quote, unquote, "The Office," semicolon. 01:38:38.835 --> 01:38:41.960 I get back all of these different rows, and we didn't even notice this one. 01:38:41.960 --> 01:38:43.910 There's actually another little typo in there 01:38:43.910 --> 01:38:47.630 with some capitalization of the E, and the C, and the E. That 01:38:47.630 --> 01:38:50.840 would be an example of a query that gives me back therefore 01:38:50.840 --> 01:38:52.080 for multiple rows. 01:38:52.080 --> 01:38:53.990 So let's now change my Python program. 01:38:53.990 --> 01:38:59.540 If I now, in my Python program, do this, I get back a whole bunch of rows 01:38:59.540 --> 01:39:01.110 containing all of those titles. 01:39:01.110 --> 01:39:06.350 I can now do, for row in rows, I can print out the current row's title, 01:39:06.350 --> 01:39:09.390 and now manipulate all of those things together. 01:39:09.390 --> 01:39:10.760 Let me keep both on the screen. 01:39:10.760 --> 01:39:12.320 Let me run Python of favorites.py. 01:39:12.320 --> 01:39:16.320 And that for loop now should iterate, what, 10 or more times, 01:39:16.320 --> 01:39:18.020 once for each of those titles. 01:39:18.020 --> 01:39:22.010 And indeed, if I type in The Office again, Enter. 01:39:22.010 --> 01:39:24.890 Whoops. 01:39:24.890 --> 01:39:25.670 Row title. 01:39:25.670 --> 01:39:26.750 What did I do wrong? 01:39:26.750 --> 01:39:30.052 Oh, I should not be renaming title to counter this time. 01:39:30.052 --> 01:39:31.760 So that's just a dumb mistake on my part. 01:39:31.760 --> 01:39:33.410 Let me rerun it again. 01:39:33.410 --> 01:39:36.230 And now I should see after typing in The Office, 01:39:36.230 --> 01:39:38.420 Enter, a whole bunch of The Offices. 01:39:38.420 --> 01:39:40.520 And because I'm using like, even the missed 01:39:40.520 --> 01:39:43.550 capitalizations are coming through, because like is case insensitive. 01:39:43.550 --> 01:39:45.890 It doesn't matter if it's uppercase or lowercase. 01:39:45.890 --> 01:39:50.300 Whereas had I used the equal sign I would get back only the same ones 01:39:50.300 --> 01:39:51.935 capitalized correctly. 01:39:51.935 --> 01:39:55.250 All right, any questions on this next? 01:39:55.250 --> 01:39:59.700 All right, so let's transition to a larger, juicier data 01:39:59.700 --> 01:40:01.610 set, and consider some of the issues that 01:40:01.610 --> 01:40:05.870 arise when actually now using SQL and skating toward a world in which we're 01:40:05.870 --> 01:40:08.970 using SQL for mobile apps, web apps, and generally speaking, 01:40:08.970 --> 01:40:10.800 very large data sets. 01:40:10.800 --> 01:40:14.090 So let's start with a larger data set just like that. 01:40:14.090 --> 01:40:19.800 Give me just a moment to switch screens over to what we have for you today, 01:40:19.800 --> 01:40:22.970 which is an actual relational database that we've created out 01:40:22.970 --> 01:40:26.540 of a real-world data set from IMDb. 01:40:26.540 --> 01:40:29.210 So InternetMovieDatabase.com is a website 01:40:29.210 --> 01:40:31.790 where you can search for TV shows, and movies, and actors, 01:40:31.790 --> 01:40:34.880 and so forth, all using their database behind the scenes. 01:40:34.880 --> 01:40:39.530 IMDb wonderfully makes their data set available as not CSV files, 01:40:39.530 --> 01:40:42.960 but TSV files, tab-separated values. 01:40:42.960 --> 01:40:46.460 And so what we did is, before class we downloaded those TSV files. 01:40:46.460 --> 01:40:50.300 We wrote a Python program similar to my favorites8.py file 01:40:50.300 --> 01:40:53.750 earlier that read in all of those TSV files, 01:40:53.750 --> 01:40:58.820 created some SQL tables in an IMDb database 01:40:58.820 --> 01:41:03.270 for you in SQLite that has multiple tables and multiple columns. 01:41:03.270 --> 01:41:07.190 So let's go and wrap our minds around what's actually in this data set. 01:41:07.190 --> 01:41:10.940 Let me go back to VS Code here, and in just a moment, 01:41:10.940 --> 01:41:15.260 I'm going to go ahead and copy the file, which we've named shows.db. 01:41:15.260 --> 01:41:20.510 And I'm going to go ahead and increase my Terminal and do SQLite3 of shows.db. 01:41:20.510 --> 01:41:23.570 Whenever playing around with a SQLite database for the first time, 01:41:23.570 --> 01:41:26.480 typing .schema is perhaps a good place to start to give you a sense 01:41:26.480 --> 01:41:27.410 of what's in there. 01:41:27.410 --> 01:41:29.120 And things just escalated quickly. 01:41:29.120 --> 01:41:31.640 There's a lot in this data set, because, indeed, there's 01:41:31.640 --> 01:41:35.750 going to be tens of hundreds of thousands of rows in this data set, 01:41:35.750 --> 01:41:39.230 and also problem set 7, where we'll look at the movie side of things 01:41:39.230 --> 01:41:40.920 and not just the TV shows. 01:41:40.920 --> 01:41:43.940 So what is the schema that we have created for you 01:41:43.940 --> 01:41:47.150 from IMDb's actual real-world data? 01:41:47.150 --> 01:41:48.950 One, there's a table called shows. 01:41:48.950 --> 01:41:51.950 And notice we've just added whitespace by hitting Enter a bunch of times 01:41:51.950 --> 01:41:54.050 to make it a little more stylistically readable. 01:41:54.050 --> 01:41:57.980 The shows table has an ID column, a title column, a year, 01:41:57.980 --> 01:42:00.740 and the total number of episodes for a given show. 01:42:00.740 --> 01:42:05.750 And the types of those columns are integer, text, numeric, and integer. 01:42:05.750 --> 01:42:08.090 So it turns out there's actually a few different data 01:42:08.090 --> 01:42:13.850 types that are worth being aware of when it comes to creating tables themselves. 01:42:13.850 --> 01:42:18.170 In fact, in SQLite there's five data types, and only five, 01:42:18.170 --> 01:42:21.650 fortunately, one of which is, indeed, integer, negative or positive, 01:42:21.650 --> 01:42:25.010 numeric, which is kind of a catchall for dates and times, 01:42:25.010 --> 01:42:27.320 things that are numeric but are not just integers, 01:42:27.320 --> 01:42:29.510 and not just real numbers, for instance. 01:42:29.510 --> 01:42:33.020 Real number is what we've generally thought of as float up until now. 01:42:33.020 --> 01:42:35.058 Text, of course, is just text, but notice 01:42:35.058 --> 01:42:37.100 that you don't have to worry about how big it is. 01:42:37.100 --> 01:42:39.110 Like in Python, it will size to fit. 01:42:39.110 --> 01:42:41.840 And then there's BLOB, which is binary large object, which 01:42:41.840 --> 01:42:45.300 is for just raw 0s and 1s, like for files or things like that. 01:42:45.300 --> 01:42:47.570 But we'll generally use the other four of these. 01:42:47.570 --> 01:42:50.960 And so, indeed, when we imported this data for you 01:42:50.960 --> 01:42:56.270 we decided that every show would be given an ID, which is just an integer. 01:42:56.270 --> 01:42:59.460 Every show has, of course, a title, which should not be null. 01:42:59.460 --> 01:43:01.320 Otherwise, why is it in the database? 01:43:01.320 --> 01:43:04.830 Every show has a year, which is numeric according 01:43:04.830 --> 01:43:06.180 to that definition a moment ago. 01:43:06.180 --> 01:43:09.540 And the total number of episodes for a show is going to be an integer. 01:43:09.540 --> 01:43:13.110 What now is with these primary keys that we mentioned earlier, too? 01:43:13.110 --> 01:43:18.090 A primary key is the column that uniquely identifies all of the data. 01:43:18.090 --> 01:43:21.120 In our case, with the favorites, I automatically 01:43:21.120 --> 01:43:24.750 gave each of your submissions a unique ID so that even if two or more of you 01:43:24.750 --> 01:43:27.360 typed in The Office, your submission still 01:43:27.360 --> 01:43:32.310 had a unique identifier, a number that allowed me to then correlate it 01:43:32.310 --> 01:43:36.330 with your genres, just as we saw a moment ago. 01:43:36.330 --> 01:43:39.280 In this version of IMDb, there's also genres. 01:43:39.280 --> 01:43:42.120 But they don't come from us, they come from IMDb.com. 01:43:42.120 --> 01:43:46.320 And so a genre has a show ID, and a genre just like our database. 01:43:46.320 --> 01:43:49.890 But these are real-world genres with a bit more filtration. 01:43:49.890 --> 01:43:54.520 Notice, though, just like my version, there's a foreign key. 01:43:54.520 --> 01:43:59.970 A foreign key is the appearance of another table's primary key 01:43:59.970 --> 01:44:02.050 in its own table. 01:44:02.050 --> 01:44:05.100 So when you have a table like genres, which is somehow 01:44:05.100 --> 01:44:10.920 cross referencing the original shows table, if shows have a primary key 01:44:10.920 --> 01:44:15.270 called ID, and those same numbers appear in the genres table 01:44:15.270 --> 01:44:20.250 under the column called show ID, by definition, show ID is a foreign key. 01:44:20.250 --> 01:44:22.650 It's the same numbers but it's foreign in the sense 01:44:22.650 --> 01:44:25.260 that the number is being used in this table, 01:44:25.260 --> 01:44:29.130 even though it's officially defined primarily in this other table. 01:44:29.130 --> 01:44:31.770 This is what we mean by relational databases. 01:44:31.770 --> 01:44:37.170 You have multiple tables with some column in common, numbers typically. 01:44:37.170 --> 01:44:40.860 And those numbers allow you to line the two tables up in such a way 01:44:40.860 --> 01:44:44.040 that you can reconnect the shows with their genres, 01:44:44.040 --> 01:44:46.800 just like we did with our smaller data set a moment ago. 01:44:46.800 --> 01:44:49.050 This logic is extended further. 01:44:49.050 --> 01:44:53.070 Notice that the IMDb database we've created for you has a stars table, 01:44:53.070 --> 01:44:56.730 like TV show stars, the actors therein. 01:44:56.730 --> 01:45:00.210 And that table, interestingly, has no mention of people 01:45:00.210 --> 01:45:02.220 and no mention of shows, per se. 01:45:02.220 --> 01:45:05.730 It only has a column called show ID, which is an integer, 01:45:05.730 --> 01:45:08.220 and a person ID, which is an integer. 01:45:08.220 --> 01:45:14.320 Meanwhile, if we scrolled down to the bottom, 01:45:14.320 --> 01:45:17.490 you will see a table called people. 01:45:17.490 --> 01:45:23.010 And we have decided in IMDb's world that every person in the TV show world 01:45:23.010 --> 01:45:27.510 will have a unique identifier that's a number, a name that's text, a birth 01:45:27.510 --> 01:45:31.500 date, which is numeric, and then, again, specifying that ID 01:45:31.500 --> 01:45:35.350 is going to be their primary key. 01:45:35.350 --> 01:45:36.940 So what's going on here? 01:45:36.940 --> 01:45:42.640 Well, it turns out that TV stars and writers are both types of people. 01:45:42.640 --> 01:45:47.730 So using this relational database, notice the road we're going down. 01:45:47.730 --> 01:45:49.770 We're factoring out commonalities. 01:45:49.770 --> 01:45:52.570 And if a person can be different things in life, 01:45:52.570 --> 01:45:55.260 well, we're defining them first and foremost as people. 01:45:55.260 --> 01:45:58.150 And then, notice these two tables are almost the same. 01:45:58.150 --> 01:46:00.660 The stars table has a show ID, which is a number, 01:46:00.660 --> 01:46:02.670 and a person ID, which is a number, which 01:46:02.670 --> 01:46:10.710 allows us via this middleman table, if you will, to link people with TV shows. 01:46:10.710 --> 01:46:16.080 Similarly, the writers table allows us to connect shows with people, too, 01:46:16.080 --> 01:46:18.220 by just recording those numbers. 01:46:18.220 --> 01:46:20.980 So if we go into this data set, let's do the following. 01:46:20.980 --> 01:46:24.340 Let's do select star from people semicolon. 01:46:24.340 --> 01:46:27.030 So a huge amount of data is coming back. 01:46:27.030 --> 01:46:31.480 This is hundreds of thousands of rows now based on the ID numbers alone. 01:46:31.480 --> 01:46:34.320 So this is real-world data now flying across the screen. 01:46:34.320 --> 01:46:38.160 There's a lot of people in the TV show business, not just actors and writers, 01:46:38.160 --> 01:46:39.780 but others as well. 01:46:39.780 --> 01:46:40.680 It's still going. 01:46:40.680 --> 01:46:41.977 There's a lot of data there. 01:46:41.977 --> 01:46:44.310 So my god, if you had to do anything manual in this data 01:46:44.310 --> 01:46:46.660 set it's probably not going to work out very well. 01:46:46.660 --> 01:46:49.590 And actually, we're up to, what, a million people in this data 01:46:49.590 --> 01:46:51.990 set, plus, which would mean this probably isn't even 01:46:51.990 --> 01:46:55.590 going to open very well in Excel, or Google Spreadsheets, or Apple Numbers. 01:46:55.590 --> 01:46:57.700 SQL probably is the better approach here. 01:46:57.700 --> 01:47:00.360 Let's search for someone specific, like select star 01:47:00.360 --> 01:47:06.060 from people, where name equals Steve Carell, for instance, sticking 01:47:06.060 --> 01:47:06.702 with comedies. 01:47:06.702 --> 01:47:08.160 All right, so there's Steve Carell. 01:47:08.160 --> 01:47:14.100 He is person number 136,797, born in 1962. 01:47:14.100 --> 01:47:16.540 And that's as much data as we have on Steve Carell here. 01:47:16.540 --> 01:47:19.210 How do we figure out what shows, for instance, he's in? 01:47:19.210 --> 01:47:23.500 Well, let's see, select star from shows, semicolon. 01:47:23.500 --> 01:47:27.180 There's a crazy number of shows out there in the IMDb database. 01:47:27.180 --> 01:47:30.150 And you can see it here again flying across the screen. 01:47:30.150 --> 01:47:33.630 Feels like we're going to have to employ some techniques in order 01:47:33.630 --> 01:47:37.060 to get at all of Steve Carell's shows. 01:47:37.060 --> 01:47:39.220 So how are we going to do that? 01:47:39.220 --> 01:47:41.730 Well, god, this is a lot of data here. 01:47:41.730 --> 01:47:45.120 And in fact, yeah, we have, what, 15 million 01:47:45.120 --> 01:47:47.280 shows plus in this data set, too. 01:47:47.280 --> 01:47:50.340 So doing things efficiently is now going to start to matter. 01:47:50.340 --> 01:47:51.790 So let's actually do this. 01:47:51.790 --> 01:47:53.460 Let me select a specific show. 01:47:53.460 --> 01:47:58.590 Select star from shows where title equals, quote, unquote, "The Office." 01:47:58.590 --> 01:48:00.920 And there presumably shouldn't be typos in this data 01:48:00.920 --> 01:48:03.470 because it comes from the real website IMDb.com. 01:48:03.470 --> 01:48:05.000 Let's get back to show. 01:48:05.000 --> 01:48:08.210 Turns out there's been a lot of The Offices out in the world. 01:48:08.210 --> 01:48:12.170 The one that started in 2005 is the one that we want, 01:48:12.170 --> 01:48:14.990 presumably the most popular with 188 episodes. 01:48:14.990 --> 01:48:16.220 How can we get just that? 01:48:16.220 --> 01:48:21.650 Maybe we could do and year equals, how about 2005? 01:48:21.650 --> 01:48:25.250 All right, so now we've got back just the ID of The Office 01:48:25.250 --> 01:48:26.210 that we care about. 01:48:26.210 --> 01:48:27.230 And let's do this, too. 01:48:27.230 --> 01:48:30.110 Let me turn on a timer within SQLite just 01:48:30.110 --> 01:48:31.790 to get a sense of running time now. 01:48:31.790 --> 01:48:33.110 Let me do that again. 01:48:33.110 --> 01:48:36.050 Select star from shows, where title equals The Office, 01:48:36.050 --> 01:48:37.808 and year equals 2005. 01:48:37.808 --> 01:48:38.850 And let's keep it simple. 01:48:38.850 --> 01:48:40.370 Let's just do titles for now. 01:48:40.370 --> 01:48:41.450 Enter. 01:48:41.450 --> 01:48:43.460 All right, so not terribly long. 01:48:43.460 --> 01:48:46.990 It found it pretty fast, but it looks like it took how much real time? 01:48:46.990 --> 01:48:50.010 0.02 seconds, not bad for just a title. 01:48:50.010 --> 01:48:53.210 But just to plant a seed, it turns out that we can probably 01:48:53.210 --> 01:48:54.200 speed even this up. 01:48:54.200 --> 01:48:54.930 Let me do this. 01:48:54.930 --> 01:48:58.220 Let me create something called an index, which is another use of the C 01:48:58.220 --> 01:49:00.170 in CRUD for creating something. 01:49:00.170 --> 01:49:02.810 And I'm going to call this title index. 01:49:02.810 --> 01:49:07.340 And I'm going to create it on the shows table, 01:49:07.340 --> 01:49:09.273 specifically on the title column. 01:49:09.273 --> 01:49:11.690 And we'll see in a moment what this is going to do for me. 01:49:11.690 --> 01:49:12.920 Enter. 01:49:12.920 --> 01:49:17.130 Took a moment, like 0.349 seconds, to create something called an index. 01:49:17.130 --> 01:49:21.590 But now watch, if I select star from shows searching for The Office again, 01:49:21.590 --> 01:49:24.230 previously it took me 0.021 seconds. 01:49:24.230 --> 01:49:27.680 Not bad, but now, wow. 01:49:27.680 --> 01:49:31.190 Literally no time at all, or so low that it wasn't really measurable. 01:49:31.190 --> 01:49:33.470 And I'll do it again just to get a sense of things. 01:49:33.470 --> 01:49:34.860 Still quite low. 01:49:34.860 --> 01:49:40.020 Now even though 0.021 seconds, not crazy long, imagine now having a lot of data, 01:49:40.020 --> 01:49:42.560 a lot of users running a real website or real mobile app. 01:49:42.560 --> 01:49:45.980 Every millisecond we can start to shave off is going to be compelling. 01:49:45.980 --> 01:49:47.810 So what is it we just did? 01:49:47.810 --> 01:49:51.830 Well, we actually just created something called an index. 01:49:51.830 --> 01:49:53.960 And this is a nice way to tie in, now, some 01:49:53.960 --> 01:49:56.420 of our week 5 discussion of data structures, 01:49:56.420 --> 01:49:58.250 and our week 3 discussion of running times. 01:49:58.250 --> 01:50:01.280 An index in a database is some kind of fancy data 01:50:01.280 --> 01:50:06.110 structure that allows the database to do better than linear search. 01:50:06.110 --> 01:50:10.580 Literally, as you just saw, these tables are crazy long or tall right now, 01:50:10.580 --> 01:50:12.150 very linear, that is. 01:50:12.150 --> 01:50:14.300 And so when I first searched for The Office, 01:50:14.300 --> 01:50:17.780 it was literally doing linear search, top to bottom, looking at as many as, 01:50:17.780 --> 01:50:19.400 what, a million plus rows. 01:50:19.400 --> 01:50:20.870 That's relatively slow. 01:50:20.870 --> 01:50:23.330 It's not that slow, 0.021 seconds. 01:50:23.330 --> 01:50:27.020 But that's relatively slow just theoretically, algorithmically, 01:50:27.020 --> 01:50:28.580 doing anything linearly. 01:50:28.580 --> 01:50:32.090 But if you instead create an index using syntax 01:50:32.090 --> 01:50:37.880 like this, which I just did, creating an index on the title column of the show's 01:50:37.880 --> 01:50:41.220 table, that's like giving the database a clue in advance saying, 01:50:41.220 --> 01:50:44.660 hey, I know I'm going to search on this column in this table a lot. 01:50:44.660 --> 01:50:47.430 Do something with data structures to speed things up. 01:50:47.430 --> 01:50:50.030 And so if you think back to our discussion of data structures, 01:50:50.030 --> 01:50:51.720 maybe it's using a tree. 01:50:51.720 --> 01:50:56.060 Maybe it's using a trie or a hash table, some fancier two-dimensional data 01:50:56.060 --> 01:50:59.930 structure is generally going to lift the data up creating right maybe a tree 01:50:59.930 --> 01:51:00.810 structure. 01:51:00.810 --> 01:51:03.620 So it's just much faster to find data, especially 01:51:03.620 --> 01:51:06.500 if it's sorting it now based on title, and not 01:51:06.500 --> 01:51:08.030 just storing it in one long list. 01:51:08.030 --> 01:51:10.220 And in fact, in the world of relational databases, 01:51:10.220 --> 01:51:12.560 the type of structure that's often used in a database 01:51:12.560 --> 01:51:13.920 is something called a B-tree. 01:51:13.920 --> 01:51:15.170 It's not a binary tree. 01:51:15.170 --> 01:51:19.040 Different use of the letter B, but it looks a little something like the trees 01:51:19.040 --> 01:51:19.820 we've seen. 01:51:19.820 --> 01:51:21.680 It's not binary because some of the nodes 01:51:21.680 --> 01:51:24.890 might have more than two children or fewer, 01:51:24.890 --> 01:51:28.190 but it's a very wide but relatively shallow tree. 01:51:28.190 --> 01:51:29.990 It's not very tall. 01:51:29.990 --> 01:51:33.770 And the upside of that is that if your data is stored in this tree, 01:51:33.770 --> 01:51:36.060 the database can find it more quickly. 01:51:36.060 --> 01:51:41.270 And the reason it took half a second, a third of a second to build the index 01:51:41.270 --> 01:51:45.260 is because SQLite needed to take some non-zero amount of time 01:51:45.260 --> 01:51:47.630 to just build up this tree in memory. 01:51:47.630 --> 01:51:51.900 And it has algorithms for doing so based on alphabetization or other techniques. 01:51:51.900 --> 01:51:55.340 But you spend a bit of time up front, a third of a second. 01:51:55.340 --> 01:51:57.470 And then thereafter, wow. 01:51:57.470 --> 01:52:00.470 Every subsequent query, if I keep doing it again and again, 01:52:00.470 --> 01:52:04.040 is going to be crazy low, 0.000, maybe 0.001. 01:52:04.040 --> 01:52:08.240 But an order of magnitude, a factor of 10 or 100 faster than it 01:52:08.240 --> 01:52:10.790 previously was earlier. 01:52:10.790 --> 01:52:14.360 So we have these indexes which allow us to get at data faster. 01:52:14.360 --> 01:52:17.330 But what if we want to actually get data that's 01:52:17.330 --> 01:52:19.370 now across these multiple tables? 01:52:19.370 --> 01:52:20.240 How can we do that? 01:52:20.240 --> 01:52:23.060 And how might these indices or indexes help further? 01:52:23.060 --> 01:52:26.900 Well, it turns out there is a way that we've seen already 01:52:26.900 --> 01:52:29.510 indirectly to join two tables together. 01:52:29.510 --> 01:52:33.410 Previously, when I selected the ID of The Office, 01:52:33.410 --> 01:52:37.740 and then I searched for it in the other table using select in a nested query, 01:52:37.740 --> 01:52:40.410 I was joining two tables together. 01:52:40.410 --> 01:52:42.900 And it turns out there's a couple of ways to do this. 01:52:42.900 --> 01:52:46.550 Let's go ahead now and, for instance, find all of Steve Carell's TV shows. 01:52:46.550 --> 01:52:48.680 Not just The Office but all of them, too. 01:52:48.680 --> 01:52:56.315 Unfortunately, if we look at our schema, shows up here have no mention of TV-- 01:52:56.315 --> 01:53:01.860 oh, shows over here has no mention of the TV stars in them. 01:53:01.860 --> 01:53:05.130 And people have no mention of shows. 01:53:05.130 --> 01:53:09.360 We somehow need to use this table here to connect the two. 01:53:09.360 --> 01:53:14.820 And this is called a join table, in the sense that using two integer columns-- 01:53:14.820 --> 01:53:17.790 it joins the two tables together logically. 01:53:17.790 --> 01:53:21.750 And so if you're savvy enough with SQL, you can do what I did with my hands 01:53:21.750 --> 01:53:26.010 earlier and like recombine tables by using these common IDs, 01:53:26.010 --> 01:53:27.460 these integers together. 01:53:27.460 --> 01:53:28.330 So let me do this. 01:53:28.330 --> 01:53:32.730 Let me go ahead and figure out, step-by-step, Steve Carell's shows. 01:53:32.730 --> 01:53:33.970 So how am I going to do this? 01:53:33.970 --> 01:53:39.120 Well, if I select star from people, where name equals Steve Carell, 01:53:39.120 --> 01:53:40.840 fortunately, there's only one of them. 01:53:40.840 --> 01:53:46.800 So this gives me back his name, his ID, and his birth year. 01:53:46.800 --> 01:53:48.960 But it's really only his ID that I care about. 01:53:48.960 --> 01:53:49.680 Why? 01:53:49.680 --> 01:53:55.500 Because in order to get back his shows, I need to link person ID with show ID. 01:53:55.500 --> 01:53:57.640 So I need to know his ID number. 01:53:57.640 --> 01:53:59.590 So what could I do with this? 01:53:59.590 --> 01:54:04.230 Well, remember the schema and the stars table. 01:54:04.230 --> 01:54:07.830 I've just gotten, from the people table, Steve Carell's ID. 01:54:07.830 --> 01:54:13.530 I bet by transitivity I could now use his person ID, his ID, 01:54:13.530 --> 01:54:15.900 to get back all of his show IDs. 01:54:15.900 --> 01:54:19.170 And then once I've got all of his show IDs, I can take it one step further 01:54:19.170 --> 01:54:21.330 and get back all of his shows' titles. 01:54:21.330 --> 01:54:25.440 So the answer is actually English words and not just random, seemingly, 01:54:25.440 --> 01:54:26.290 integers. 01:54:26.290 --> 01:54:27.640 So let me go ahead and do this. 01:54:27.640 --> 01:54:31.740 Let me, again, get Steve Carell's ID number, but not star. 01:54:31.740 --> 01:54:33.060 Star represents everything. 01:54:33.060 --> 01:54:35.610 It's a wildcard character in SQL. 01:54:35.610 --> 01:54:38.310 Let me just select the ID of Steve Carell. 01:54:38.310 --> 01:54:41.640 And that gives me back 136,797. 01:54:41.640 --> 01:54:43.440 And it's only giving me back one value. 01:54:43.440 --> 01:54:46.710 The thing called ID is just the column heading up above. 01:54:46.710 --> 01:54:51.360 Now, suppose I want to select all of the show IDs 01:54:51.360 --> 01:54:53.460 that Steve Carell is affiliated with. 01:54:53.460 --> 01:55:00.570 Let me select Show ID from stars, where the person ID in stars 01:55:00.570 --> 01:55:03.480 happens to equal Steve Carell's ID. 01:55:03.480 --> 01:55:07.320 So again, I'm building up my answer in reverse and taking these baby steps. 01:55:07.320 --> 01:55:11.430 On the right, in parentheses, I'm getting Steve Carell's ID. 01:55:11.430 --> 01:55:15.330 On the left, I am now selecting all of the show IDs 01:55:15.330 --> 01:55:19.410 that have some connection with that person ID in the stars table. 01:55:19.410 --> 01:55:21.850 This answer, too, is not going to be that illuminating. 01:55:21.850 --> 01:55:25.420 It's just a whole bunch of integers that have no meaning to me as a human. 01:55:25.420 --> 01:55:27.770 But let's take this one step further. 01:55:27.770 --> 01:55:29.520 And even though my code is getting long, I 01:55:29.520 --> 01:55:31.920 could hit Enter and format it nicely, especially 01:55:31.920 --> 01:55:33.660 if I were doing this in a code file. 01:55:33.660 --> 01:55:35.580 But I'm just doing it interactively for now. 01:55:35.580 --> 01:55:39.420 Let's now select all of the titles from the shows table, 01:55:39.420 --> 01:55:48.100 where the ID of the show is in this following previous query. 01:55:48.100 --> 01:55:49.740 So again, the query is getting long. 01:55:49.740 --> 01:55:52.200 But notice, it's the third and last step. 01:55:52.200 --> 01:55:55.950 Select title from the shows table, where the ID of the show 01:55:55.950 --> 01:55:58.590 is in the list of all of the show IDs that 01:55:58.590 --> 01:56:02.040 came back from the stars table searching for Steve Carell's person ID. 01:56:02.040 --> 01:56:03.480 How did we get that person ID? 01:56:03.480 --> 01:56:04.680 Let me scroll to the end. 01:56:04.680 --> 01:56:11.160 Well, I selected, in my innermost parentheses, Steve Carell's own ID. 01:56:11.160 --> 01:56:13.440 So now, when I hit Enter, voila. 01:56:13.440 --> 01:56:16.525 I get all of Steve Carell's TV shows up until now. 01:56:16.525 --> 01:56:19.650 And if I want to tidy this up further, I can use the same tricks as before. 01:56:19.650 --> 01:56:22.180 Order by title, semicolon. 01:56:22.180 --> 01:56:25.540 Now I've got it all alphabetized as before. 01:56:25.540 --> 01:56:28.080 So again, with SQL comes the ability to search-- 01:56:28.080 --> 01:56:31.290 I mean, look how quickly we do this, 0.094 seconds 01:56:31.290 --> 01:56:34.650 to search across three different tables to get back this answer. 01:56:34.650 --> 01:56:38.820 But my data is now all neatly designed in individual tables, 01:56:38.820 --> 01:56:42.000 which is going to be important now that the data set is so large. 01:56:42.000 --> 01:56:44.340 But let me take this one step further. 01:56:44.340 --> 01:56:46.930 Let me go ahead and do this. 01:56:46.930 --> 01:56:51.580 Let me go ahead and point out that with this query, 01:56:51.580 --> 01:56:54.870 notice that I'm searching on-- 01:56:54.870 --> 01:56:58.710 let's say I'm searching on a person ID here. 01:56:58.710 --> 01:57:02.410 And at the end here, I'm searching on a name column here. 01:57:02.410 --> 01:57:05.230 So let me actually go ahead and do this. 01:57:05.230 --> 01:57:09.510 Let me go ahead and see if we can't speed this up. 01:57:09.510 --> 01:57:13.090 This query at the moment takes 0.092 seconds. 01:57:13.090 --> 01:57:15.930 Let's see if we can't speed this up further by just quickly creating 01:57:15.930 --> 01:57:18.930 a few more of those B-trees in the databases memory. 01:57:18.930 --> 01:57:24.240 Create an index called person index, and I'm going to do this on the stars table 01:57:24.240 --> 01:57:25.950 on the person ID column. 01:57:25.950 --> 01:57:26.850 Enter. 01:57:26.850 --> 01:57:28.488 It's taking a moment, taking a moment. 01:57:28.488 --> 01:57:30.780 That's almost a full second because that's a big table. 01:57:30.780 --> 01:57:35.050 Let's create another index called show index on the stars table. 01:57:35.050 --> 01:57:35.550 Why? 01:57:35.550 --> 01:57:37.950 Because I want to search by the show ID also. 01:57:37.950 --> 01:57:39.780 That was part of my big query. 01:57:39.780 --> 01:57:40.660 Takes a moment. 01:57:40.660 --> 01:57:43.810 OK, just more than about 2/3 of a second. 01:57:43.810 --> 01:57:46.440 Now let's create one last one, another index called name index, 01:57:46.440 --> 01:57:49.160 but I could call these things anything I want, on the people table. 01:57:49.160 --> 01:57:49.560 Why? 01:57:49.560 --> 01:57:51.550 Because I'm also searching on the name column. 01:57:51.550 --> 01:57:53.730 So in short, I'm creating indexes on each 01:57:53.730 --> 01:57:57.450 of the columns that are somehow involved in my search query, 01:57:57.450 --> 01:57:59.680 going from one table to the other. 01:57:59.680 --> 01:58:07.590 Now let's go back to the previous query, which, recall, took-- 01:58:07.590 --> 01:58:10.270 I think I erased it, 0.091. 01:58:10.270 --> 01:58:10.770 All right. 01:58:10.770 --> 01:58:12.645 Well, it was roughly this order of magnitude. 01:58:12.645 --> 01:58:14.140 We're not seeing the data now. 01:58:14.140 --> 01:58:17.160 But let me go ahead and run my original big query once. 01:58:17.160 --> 01:58:20.380 And boom, we're down to almost nothing. 01:58:20.380 --> 01:58:22.860 So again, creating these indexes in memory 01:58:22.860 --> 01:58:27.640 has the effect of rapidly speeding up our computation time. 01:58:27.640 --> 01:58:31.140 Now if you've ever used, for instance, the my.harvard course shopping tool 01:58:31.140 --> 01:58:35.130 here on campus, or Yale's analogue, you might wonder, why is the thing so slow? 01:58:35.130 --> 01:58:39.330 This could be one of the reasons why large data sets with thousands of rows, 01:58:39.330 --> 01:58:41.670 thousands of courses tend to be slow, if, 01:58:41.670 --> 01:58:44.880 and I'm only conjecturing, if the database isn't properly indexed. 01:58:44.880 --> 01:58:46.890 If you're building your own web application 01:58:46.890 --> 01:58:49.170 and you're finding that users are waiting and waiting, 01:58:49.170 --> 01:58:52.290 and things are spinning and spinning, what might be among the problems? 01:58:52.290 --> 01:58:55.770 Well, it could absolutely just be bad algorithms and bad code that you wrote. 01:58:55.770 --> 01:58:58.270 Or it might be that you haven't thought about, well, 01:58:58.270 --> 01:59:01.770 what column should be optimized for searches and filtration 01:59:01.770 --> 01:59:05.705 like I've done here in order to speed up subsequent queries? 01:59:05.705 --> 01:59:07.830 Again, from the outside in, we can only conjecture. 01:59:07.830 --> 01:59:11.580 But ultimately, this is just one of the things that 01:59:11.580 --> 01:59:14.308 explains performance problems as well. 01:59:14.308 --> 01:59:17.100 All right, let's point out just a couple of final syntactic things, 01:59:17.100 --> 01:59:19.770 and then we'll consider, bigger picture, some problems 01:59:19.770 --> 01:59:22.110 that might arise in this world. 01:59:22.110 --> 01:59:26.880 If these nested, nested queries start to get a little much, 01:59:26.880 --> 01:59:29.160 there are other ways, just so you've seen it, 01:59:29.160 --> 01:59:31.920 that you can execute similar logic in SQL. 01:59:31.920 --> 01:59:34.410 For instance, if I know in advance that I 01:59:34.410 --> 01:59:39.390 want to connect Steve Carell to his show IDs and to their titles, 01:59:39.390 --> 01:59:41.190 we can do something more like this. 01:59:41.190 --> 01:59:52.050 Select title from the people table, joined with the stars table on people 01:59:52.050 --> 01:59:54.900 ID equals stars.personID. 01:59:54.900 --> 01:59:55.860 So what am I doing? 01:59:55.860 --> 01:59:56.490 New syntax. 01:59:56.490 --> 01:59:59.782 And again, this is not something you'll have to memorize or ingrain right away. 01:59:59.782 --> 02:00:04.080 But just so you've seen other approaches, select title from people 02:00:04.080 --> 02:00:05.160 join stars. 02:00:05.160 --> 02:00:09.690 This is an explicit way to say, take the people table in one hand, the stars 02:00:09.690 --> 02:00:12.030 table in the other hand, and somehow join them 02:00:12.030 --> 02:00:13.990 as I keep doing with my fingertips here. 02:00:13.990 --> 02:00:15.610 How to join them? 02:00:15.610 --> 02:00:20.370 Join them so that the people, the ID column in the people table lines up 02:00:20.370 --> 02:00:23.260 with the person ID in the stars table. 02:00:23.260 --> 02:00:25.260 But that's not quite everything. 02:00:25.260 --> 02:00:28.740 I could also say, join further on the shows table, 02:00:28.740 --> 02:00:35.290 where the stars show ID equals the shows ID column. 02:00:35.290 --> 02:00:36.540 So what am I doing here? 02:00:36.540 --> 02:00:45.990 That's saying, go further and join the stars table with the show's table, 02:00:45.990 --> 02:00:48.990 joining the show ID column with the ID column. 02:00:48.990 --> 02:00:51.750 Again, this starts to get a little messy to think about. 02:00:51.750 --> 02:00:55.830 But now I can just say, where name equals, quote, unquote, "Steve Carell." 02:00:55.830 --> 02:00:59.070 I can do in one query what previously took me three nested queries 02:00:59.070 --> 02:01:00.600 and get back the same answers. 02:01:00.600 --> 02:01:05.380 And I can still add in my order by title to get back the result. 02:01:05.380 --> 02:01:09.780 And if I do this a little more neatly, let me type this out a little 02:01:09.780 --> 02:01:10.750 differently. 02:01:10.750 --> 02:01:16.620 Let me type this out by adding a new line-- ah, I can't do that here. 02:01:16.620 --> 02:01:18.120 I'm going to leave it alone for now. 02:01:18.120 --> 02:01:20.700 We can type it on multiple lines in other contexts. 02:01:20.700 --> 02:01:23.700 And let me do one last thing. 02:01:23.700 --> 02:01:25.012 Do I want to show that? 02:01:25.012 --> 02:01:26.970 I'm going to show it, but this is not something 02:01:26.970 --> 02:01:28.440 you should ingrain just yet either. 02:01:28.440 --> 02:01:31.620 Select title from people, stars, and shows. 02:01:31.620 --> 02:01:34.860 If you know in advance that you want to do something with all three tables, 02:01:34.860 --> 02:01:38.130 you can just enumerate them, one table name after the other. 02:01:38.130 --> 02:01:43.497 And then you can say where people.ID equals stars.personID. 02:01:43.497 --> 02:01:45.330 And now I'm hitting Enter so that it formats 02:01:45.330 --> 02:01:47.070 a little more readably on my screen. 02:01:47.070 --> 02:01:55.140 And stars.showID equals shows.ID, and lastly, name equals Steve Carell. 02:01:55.140 --> 02:01:59.730 In short, you specify that you want to select data from all three 02:01:59.730 --> 02:02:00.690 of these tables. 02:02:00.690 --> 02:02:06.450 And then you tell the database how to combine foreign keys with primary keys, 02:02:06.450 --> 02:02:09.690 that is, the columns that have those integers in common. 02:02:09.690 --> 02:02:12.720 If I hit Enter now, I get the same exact results, ever 02:02:12.720 --> 02:02:16.110 more so if I also add in an order by title. 02:02:16.110 --> 02:02:17.770 Oops. 02:02:17.770 --> 02:02:18.270 All right. 02:02:18.270 --> 02:02:20.190 That's why I didn't want to do this earlier. 02:02:20.190 --> 02:02:23.190 I have to go back through my history multiple times to actually get back 02:02:23.190 --> 02:02:24.640 the multi-line query this time. 02:02:24.640 --> 02:02:25.140 All right. 02:02:25.140 --> 02:02:27.280 That was a lot all at once. 02:02:27.280 --> 02:02:31.365 But this is only to say that, even as we make the design of the data 02:02:31.365 --> 02:02:33.990 more sophisticated, and we put some of it over here, some of it 02:02:33.990 --> 02:02:38.130 over here, some of it over here so as to avoid duplication of data, weird hacks 02:02:38.130 --> 02:02:41.910 like putting commas in the data, we can still get back all of the answers 02:02:41.910 --> 02:02:44.280 that we might want across these several tables. 02:02:44.280 --> 02:02:48.580 And using indexes, we can significantly speed up these processes 02:02:48.580 --> 02:02:52.140 so as to handle 10 times as many, a 100 times as many users 02:02:52.140 --> 02:02:53.670 on the same actual database. 02:02:53.670 --> 02:02:55.020 There is going to be a downside. 02:02:55.020 --> 02:02:57.540 And thinking back to our discussion of algorithms and data 02:02:57.540 --> 02:03:02.120 structures in past weeks, what might be a downside of creating these indexes? 02:03:02.120 --> 02:03:06.110 Because as of now, I created four separate indexes on the name column, 02:03:06.110 --> 02:03:09.560 the title column, and some other columns, too. 02:03:09.560 --> 02:03:11.930 Why wouldn't I just go ahead and index everything 02:03:11.930 --> 02:03:14.390 if it's clearly speeding things up? 02:03:14.390 --> 02:03:15.770 Memory, so space. 02:03:15.770 --> 02:03:18.890 Any time you're starting to benefit time wise in computer science, 02:03:18.890 --> 02:03:21.950 odds are you're sacrificing space, or vice versa. 02:03:21.950 --> 02:03:25.400 And probably indexing absolutely everything is a little dumb 02:03:25.400 --> 02:03:29.430 because you're going to waste way more space than you might actually need. 02:03:29.430 --> 02:03:31.610 So figuring out where the right inflection point is 02:03:31.610 --> 02:03:36.410 is part of the process of designing and just getting better at these things. 02:03:36.410 --> 02:03:40.910 Now unfortunately, a whole lot of things can go wrong in this world, 02:03:40.910 --> 02:03:44.870 and they continue to in the real world with people using SQL databases. 02:03:44.870 --> 02:03:46.760 And in fact, here on out, if you're reading 02:03:46.760 --> 02:03:51.530 something technical about SQL databases, and websites being hacked in some form, 02:03:51.530 --> 02:03:54.660 and passwords leaking out, unfortunately, all too often 02:03:54.660 --> 02:03:57.505 it is because of what are called SQL injection attacks. 02:03:57.505 --> 02:03:59.630 And just to give you a sense now to counterbalance, 02:03:59.630 --> 02:04:01.588 maybe [INAUDIBLE] enthusiasm for like, oh, that 02:04:01.588 --> 02:04:03.620 was neat how we can do things so quickly. 02:04:03.620 --> 02:04:06.680 With great power comes responsibility in this world, too. 02:04:06.680 --> 02:04:09.320 And so many people introduce bugs into their code 02:04:09.320 --> 02:04:17.160 by not quite appreciating how it is the data is getting into your application. 02:04:17.160 --> 02:04:18.390 So what do I mean by that? 02:04:18.390 --> 02:04:21.200 Here, for instance, is a typical login screen for Yale. 02:04:21.200 --> 02:04:23.780 And here's the analogue for Harvard where you're prompted, 02:04:23.780 --> 02:04:26.450 every day probably, for your username and your password, 02:04:26.450 --> 02:04:28.460 your email address and your password here. 02:04:28.460 --> 02:04:32.420 Suppose, though, that behind this login page, 02:04:32.420 --> 02:04:35.030 whether Harvard's or Yale's, there's some website. 02:04:35.030 --> 02:04:38.270 And that website is using SQL underneath the hood 02:04:38.270 --> 02:04:40.700 to store all of the Harvard or Yale people's 02:04:40.700 --> 02:04:43.940 usernames, passwords, ID numbers, courses, transcripts, 02:04:43.940 --> 02:04:44.970 all of that stuff. 02:04:44.970 --> 02:04:47.480 So there's a SQL database underneath the website. 02:04:47.480 --> 02:04:50.360 Well, what might go wrong with this process? 02:04:50.360 --> 02:04:52.850 Unfortunately, there's some special syntax in SQL 02:04:52.850 --> 02:04:54.530 just like there is in C and Python. 02:04:54.530 --> 02:04:56.960 For instance, there are comments in SQL, too. 02:04:56.960 --> 02:05:00.680 If you do two hyphens, dash, dash, that's a comment in SQL. 02:05:00.680 --> 02:05:06.170 And if you, the programmer, aren't sufficiently distrustful of your users, 02:05:06.170 --> 02:05:09.560 such that you defend against potentially adversarial attacks, 02:05:09.560 --> 02:05:11.160 you might do something like this. 02:05:11.160 --> 02:05:16.070 Suppose that I somewhat maliciously or curiously log in 02:05:16.070 --> 02:05:19.130 by typing my username, Malan@harvard.edu, and then maybe 02:05:19.130 --> 02:05:20.990 a single quote and a dash, dash. 02:05:20.990 --> 02:05:21.680 Why? 02:05:21.680 --> 02:05:24.860 Because I'm trying to suss out if there is a vulnerability here 02:05:24.860 --> 02:05:26.420 to a SQL injection attack. 02:05:26.420 --> 02:05:27.910 Do not do this in general. 02:05:27.910 --> 02:05:31.160 But if I were the owner of the website trying to see if I've made any mistake, 02:05:31.160 --> 02:05:35.300 I might try using potentially dangerous characters in my input. 02:05:35.300 --> 02:05:36.290 Dangerous how? 02:05:36.290 --> 02:05:40.340 Because single quote is used for quoting things in SQL, as we've seen-- 02:05:40.340 --> 02:05:41.810 single quotes or double quotes. 02:05:41.810 --> 02:05:44.930 Dash, dash, I claim now, is used for commenting. 02:05:44.930 --> 02:05:47.960 But let's now imagine what the code underneath the hood 02:05:47.960 --> 02:05:52.160 might be for something like Yale's login or Harvard's login. 02:05:52.160 --> 02:05:54.600 What if it's code that looks like this? 02:05:54.600 --> 02:05:56.540 So let me read it from left to right. 02:05:56.540 --> 02:06:00.710 Suppose that they are using something like CS50's own execute function, 02:06:00.710 --> 02:06:03.230 and they've got some SQL typed into the website that 02:06:03.230 --> 02:06:07.160 says select star from users, where username equals this, 02:06:07.160 --> 02:06:09.050 and password equals that. 02:06:09.050 --> 02:06:12.510 And they're plugging in username and password. 02:06:12.510 --> 02:06:13.607 So what am I doing here? 02:06:13.607 --> 02:06:16.190 Well, when the user types their username password, hits Enter, 02:06:16.190 --> 02:06:18.920 I probably want to select that user from my database 02:06:18.920 --> 02:06:21.020 to see if the username and passwords match. 02:06:21.020 --> 02:06:23.720 So the underlying SQL might be, select star 02:06:23.720 --> 02:06:25.790 from users, where username equals question mark, 02:06:25.790 --> 02:06:27.207 and password equals question mark. 02:06:27.207 --> 02:06:28.100 Users is the table. 02:06:28.100 --> 02:06:29.510 One column is username. 02:06:29.510 --> 02:06:31.400 One column is password. 02:06:31.400 --> 02:06:32.000 All right. 02:06:32.000 --> 02:06:37.430 And if we get back one row, presumably Malan@harvard.edu 02:06:37.430 --> 02:06:38.970 exists with that password. 02:06:38.970 --> 02:06:41.190 We should let him proceed from there on out. 02:06:41.190 --> 02:06:45.140 So that's some pseudo code, if you will, for this scenario. 02:06:45.140 --> 02:06:49.580 What if, though, this code is not as well written as it currently 02:06:49.580 --> 02:06:51.500 is, and isn't using question marks? 02:06:51.500 --> 02:06:54.757 So the question mark syntax is a fairly common SQL thing, 02:06:54.757 --> 02:06:56.840 where the question marks are used as placeholders, 02:06:56.840 --> 02:06:59.390 just like in printf, percent S was. 02:06:59.390 --> 02:07:02.900 But this function, db.execute from CS50's library 02:07:02.900 --> 02:07:05.420 and third-party libraries as well, is also 02:07:05.420 --> 02:07:07.790 doing some good stuff with these question marks, 02:07:07.790 --> 02:07:09.830 and defending against the following attack. 02:07:09.830 --> 02:07:12.920 Suppose that you were not using a third-party library like ours 02:07:12.920 --> 02:07:16.490 and you were just manually constructing your SQL queries like this. 02:07:16.490 --> 02:07:19.940 You were to do something like this instead using an f-string in Python. 02:07:19.940 --> 02:07:21.800 You're comfortable with format strings now. 02:07:21.800 --> 02:07:24.883 You've gotten into the habit of using curly braces and plugging in values. 02:07:24.883 --> 02:07:27.230 Suppose that you, the aspiring programmer, 02:07:27.230 --> 02:07:29.660 is just using techniques that you've been taught. 02:07:29.660 --> 02:07:32.660 So you have an f-string with select star from users, 02:07:32.660 --> 02:07:36.470 where username equals, quote, unquote, "username" in curly braces. 02:07:36.470 --> 02:07:41.270 And password equals, quote, unquote, "password" in curly braces. 02:07:41.270 --> 02:07:44.270 As of what, two weeks ago, this was perfectly 02:07:44.270 --> 02:07:49.460 legitimate technique in Python to plug in values into a string. 02:07:49.460 --> 02:07:53.630 But notice if you are using single quotes yourself 02:07:53.630 --> 02:07:58.750 and the user has typed in single quotes to their input, what 02:07:58.750 --> 02:07:59.920 could go wrong here? 02:07:59.920 --> 02:08:04.510 Where are we going with this if you're just blindly plugging user input 02:08:04.510 --> 02:08:08.350 into your own prepared string of text? 02:08:08.350 --> 02:08:09.230 Yeah? 02:08:09.230 --> 02:08:14.725 AUDIENCE: [INAUDIBLE] 02:08:14.725 --> 02:08:15.600 DAVID J. MALAN: Yeah. 02:08:15.600 --> 02:08:20.670 Worst case, they could insert what is actually SQL code into your database 02:08:20.670 --> 02:08:21.810 as follows. 02:08:21.810 --> 02:08:25.140 Generally speaking, if you're using special syntax like single quotes 02:08:25.140 --> 02:08:27.000 to surround the user's input, you'd better 02:08:27.000 --> 02:08:29.250 hope that they don't have an apostrophe in their name. 02:08:29.250 --> 02:08:31.875 Or you better hope that they don't type a single quote as well. 02:08:31.875 --> 02:08:36.025 Because what if their single quote finishes your single quote instead, 02:08:36.025 --> 02:08:37.900 and then the rest of this is somehow ignored? 02:08:37.900 --> 02:08:39.330 Well, let's consider how this might happen. 02:08:39.330 --> 02:08:40.510 Let me go ahead in here. 02:08:40.510 --> 02:08:42.840 This got a little blurry here, but let me 02:08:42.840 --> 02:08:44.850 plug in here-- wow, that looks awful. 02:08:44.850 --> 02:08:47.940 Let me fix the red. 02:08:47.940 --> 02:08:50.430 Just change this to white so it's more readable. 02:08:50.430 --> 02:08:56.730 What happens if the user does this instead? 02:08:56.730 --> 02:08:59.310 They type in, like I did into the screenshot, 02:08:59.310 --> 02:09:02.760 'Malan@harvard.edu,' single quote, dash, dash. 02:09:02.760 --> 02:09:05.070 What has just happened logically, even though we've 02:09:05.070 --> 02:09:06.900 only just begun with SQL today? 02:09:06.900 --> 02:09:12.150 Well, select star from users, where username equals Malan@harvard.edu, end 02:09:12.150 --> 02:09:13.320 quote. 02:09:13.320 --> 02:09:17.340 What's bad about the rest of this? 02:09:17.340 --> 02:09:19.752 Dash, dash, I claim, means a comment, which 02:09:19.752 --> 02:09:22.210 means my color coding is going to be a little blurry again. 02:09:22.210 --> 02:09:25.140 But everything after the dash, dash is just ignored. 02:09:25.140 --> 02:09:27.440 The logic, then, of the SQL query, then, is 02:09:27.440 --> 02:09:30.760 to just say, select Malan@harvard.edu from the database, 02:09:30.760 --> 02:09:33.490 not even checking the password anymore. 02:09:33.490 --> 02:09:36.180 Therefore, you will get back at least one row. 02:09:36.180 --> 02:09:41.190 So length of rows will equal 1, and so presumably the rest of the pseudo code 02:09:41.190 --> 02:09:44.190 logs the user in, gives them access to my my.harvard account, 02:09:44.190 --> 02:09:45.150 or whatever it is. 02:09:45.150 --> 02:09:50.640 And they've pretended to be me simply by using a single quote and a dash, 02:09:50.640 --> 02:09:52.445 dash in the username field. 02:09:52.445 --> 02:09:54.570 Again, please don't go start doing this later today 02:09:54.570 --> 02:09:56.140 on Harvard, Yale, or other websites. 02:09:56.140 --> 02:09:57.670 But it could be as simple as that. 02:09:57.670 --> 02:09:58.170 Why? 02:09:58.170 --> 02:10:00.030 Because the programmer practiced what they 02:10:00.030 --> 02:10:04.110 were taught, which was just to use curly braces to plug in, 02:10:04.110 --> 02:10:05.560 in f-strings, values. 02:10:05.560 --> 02:10:08.590 But if you don't understand how the user's input is going to be used, 02:10:08.590 --> 02:10:12.255 and if you don't distrust your users fundamentally, for every good person 02:10:12.255 --> 02:10:14.130 out there there's going to be, unfortunately, 02:10:14.130 --> 02:10:19.380 some adversary who just wants to try to find fault in your code or hack 02:10:19.380 --> 02:10:20.490 into your data set. 02:10:20.490 --> 02:10:22.980 This is what's known as a SQL injection attack, 02:10:22.980 --> 02:10:27.000 because the user can type something that happens to be or look like SQL, 02:10:27.000 --> 02:10:31.110 and trick your database into doing something it didn't intend to, 02:10:31.110 --> 02:10:34.830 like, for instance, logging the user in. 02:10:34.830 --> 02:10:36.880 Worst case, they could even do something else. 02:10:36.880 --> 02:10:41.490 Maybe the user types a semicolon, then the word drop, or the word update. 02:10:41.490 --> 02:10:45.180 You could imagine doing semicolon update table grades, where 02:10:45.180 --> 02:10:49.140 name equals Malan, and set the grade equal to A instead of B, 02:10:49.140 --> 02:10:50.670 or something like that. 02:10:50.670 --> 02:10:53.550 The ability to inject SQL into the database 02:10:53.550 --> 02:10:56.820 means you can do anything you want with the data set, either constructively, 02:10:56.820 --> 02:11:00.000 or worse, destructively. 02:11:00.000 --> 02:11:02.880 And now, just a quick, little cartoon that should now make sense. 02:11:09.410 --> 02:11:13.020 OK, to, like, one of us, two of us. 02:11:13.020 --> 02:11:14.400 Awkwardly somewhat funny. 02:11:14.400 --> 02:11:16.560 All right, so let's move on to one last condition. 02:11:16.560 --> 02:11:19.130 There's one other problem that can go awry here. 02:11:19.130 --> 02:11:20.380 Oh, and I should explain this. 02:11:20.380 --> 02:11:25.500 So this is an allusion to the son, Robert, having typed in semicolon. 02:11:25.500 --> 02:11:28.810 The word drop, table, students, and doing some of the same technique. 02:11:28.810 --> 02:11:31.440 This is humor that only CS people would understand 02:11:31.440 --> 02:11:35.040 because it's the mom realizing, oh, her son's doing a SQL injection 02:11:35.040 --> 02:11:36.308 attack onto the database. 02:11:36.308 --> 02:11:39.600 Less funny when you explain it, but once you notice the syntax, that's all this 02:11:39.600 --> 02:11:40.960 is an allusion to. 02:11:40.960 --> 02:11:41.460 All right. 02:11:41.460 --> 02:11:44.850 So one final threat, now that you are graduating 02:11:44.850 --> 02:11:49.320 to the world of proper databases and away from CSV files alone. 02:11:49.320 --> 02:11:52.170 Things can go wrong when using databases, 02:11:52.170 --> 02:11:55.838 and honestly, even using CSV files if you have multiple users. 02:11:55.838 --> 02:11:57.630 And thus far, you and I have had the luxury 02:11:57.630 --> 02:12:00.547 in almost every program we've written that it's just me using my code. 02:12:00.547 --> 02:12:01.830 It's just you using your code. 02:12:01.830 --> 02:12:04.770 And even if your teaching fellow or TA is using it, probably 02:12:04.770 --> 02:12:06.060 not at the same time. 02:12:06.060 --> 02:12:10.770 But the world gets interesting if you start putting your code on phones, 02:12:10.770 --> 02:12:14.730 on websites, such that now you might have two users literally trying 02:12:14.730 --> 02:12:17.430 to log in at the same time, literally clicking a button 02:12:17.430 --> 02:12:19.410 at the same, or nearly the same time. 02:12:19.410 --> 02:12:22.380 What happens, then, if a computer is trying 02:12:22.380 --> 02:12:25.290 to handle requests from two different people at once, 02:12:25.290 --> 02:12:27.480 as might happen all the time on a website? 02:12:27.480 --> 02:12:29.610 You might get what are called race conditions. 02:12:29.610 --> 02:12:33.060 And this is a problem in computing in general, not just with SQL, not just 02:12:33.060 --> 02:12:36.960 with Python, really just any time you have shared data, 02:12:36.960 --> 02:12:39.150 like a database, as follows. 02:12:39.150 --> 02:12:43.620 This apparently is one of the most liked Instagram posts ever. 02:12:43.620 --> 02:12:46.110 It is literally just a picture of an egg. 02:12:46.110 --> 02:12:47.645 Has anyone clicked on this egg? 02:12:47.645 --> 02:12:48.270 Like, a couple? 02:12:48.270 --> 02:12:48.780 Oh, OK. 02:12:48.780 --> 02:12:49.280 Wow. 02:12:49.280 --> 02:12:50.220 All right, so yes. 02:12:50.220 --> 02:12:53.880 So go search for this photo if you'd like to add to the likes on Instagram. 02:12:53.880 --> 02:12:56.110 The account is world_record_egg. 02:12:56.110 --> 02:12:59.040 This is just a screenshot of Instagram of that picture of an egg. 02:12:59.040 --> 02:13:00.540 If you're in the habit of using Instagram, 02:13:00.540 --> 02:13:03.498 or like any social media site, there's some equivalent of a like button 02:13:03.498 --> 02:13:04.920 or a heart button these days. 02:13:04.920 --> 02:13:06.900 And that's actually a really hard problem. 02:13:06.900 --> 02:13:10.050 Such a simple idea to count the number of likes something 02:13:10.050 --> 02:13:12.990 has, but that means someone has to click on it. 02:13:12.990 --> 02:13:14.910 Your code has to detect the click. 02:13:14.910 --> 02:13:17.860 Your code has to update the database, and then do it again and again, 02:13:17.860 --> 02:13:22.740 even if multiple people are perhaps right now clicking on that same egg. 02:13:22.740 --> 02:13:28.350 And unfortunately, bad things can happen if two people try to do something 02:13:28.350 --> 02:13:30.540 at the same time on a computer. 02:13:30.540 --> 02:13:31.690 How might this happen? 02:13:31.690 --> 02:13:35.670 So here's some more code, half pseudocode, half Python code here, 02:13:35.670 --> 02:13:36.580 as follows. 02:13:36.580 --> 02:13:40.230 Suppose that what happens when you, literally, right now, maybe click 02:13:40.230 --> 02:13:43.470 on the like button on the Instagram post. 02:13:43.470 --> 02:13:47.760 Suppose that code, like the following, is executed on Facebook servers. 02:13:47.760 --> 02:13:54.190 db.execute of select likes from posts where ID equals question mark. 02:13:54.190 --> 02:13:54.690 All right. 02:13:54.690 --> 02:13:57.090 So what am I assuming here? 02:13:57.090 --> 02:13:59.280 I'm assuming that that photograph has a unique ID. 02:13:59.280 --> 02:14:02.670 It's some big integer, whatever it was, randomly assigned. 02:14:02.670 --> 02:14:05.130 I'm assuming that when you click on the heart 02:14:05.130 --> 02:14:08.160 the unique ID is somehow sent to Instagram servers 02:14:08.160 --> 02:14:10.740 so that their code can call it ID. 02:14:10.740 --> 02:14:13.830 And I'm assuming that Instagram is using its SQL database 02:14:13.830 --> 02:14:17.790 and selecting, from a posts table, the current number of likes 02:14:17.790 --> 02:14:21.160 of that egg for that given ID number. 02:14:21.160 --> 02:14:21.660 Why? 02:14:21.660 --> 02:14:24.952 Because I need to know how many likes it already has if I want to add one to it 02:14:24.952 --> 02:14:26.190 and then update the database. 02:14:26.190 --> 02:14:29.710 I need to select the data, then I need to update the data here. 02:14:29.710 --> 02:14:30.210 All right. 02:14:30.210 --> 02:14:33.780 So in some Python code here, let's store, in a variable called 02:14:33.780 --> 02:14:37.950 likes, whatever comes back in the first row from the likes column. 02:14:37.950 --> 02:14:40.660 Again, this is new syntax specific to our library, 02:14:40.660 --> 02:14:43.830 but a common way of getting back first row and the column called 02:14:43.830 --> 02:14:44.800 likes therein. 02:14:44.800 --> 02:14:46.920 So at this point in the story, likes is storing 02:14:46.920 --> 02:14:49.462 the total number of likes, in the millions or whatever it is, 02:14:49.462 --> 02:14:50.610 of that particular egg. 02:14:50.610 --> 02:14:51.900 Then I do this. 02:14:51.900 --> 02:14:56.400 Execute update posts, set the number of likes 02:14:56.400 --> 02:15:00.607 equal to this value, where the ID of the post equals this value. 02:15:00.607 --> 02:15:02.190 What do I want to update the likes to? 02:15:02.190 --> 02:15:06.540 Whatever likes currently is plus 1, and then plugging in the ID. 02:15:06.540 --> 02:15:08.340 So a simple idea, right? 02:15:08.340 --> 02:15:11.730 I'm checking the value of the likes, and maybe it's 10. 02:15:11.730 --> 02:15:15.030 I'm changing 10 to 11 and then updating the table. 02:15:15.030 --> 02:15:18.390 But a problem can arise if two people have 02:15:18.390 --> 02:15:23.010 clicked on that egg at roughly the same time, or literally, the same time. 02:15:23.010 --> 02:15:24.340 Why is that? 02:15:24.340 --> 02:15:26.850 Well, in the world of databases and servers, 02:15:26.850 --> 02:15:31.120 and the Instagrams of the world have thousands of physical servers nowadays. 02:15:31.120 --> 02:15:35.430 So they can support millions, billions even, of users nowadays. 02:15:35.430 --> 02:15:36.910 What can go wrong? 02:15:36.910 --> 02:15:40.530 Well, typically code like this is not what we'll call atomic. 02:15:40.530 --> 02:15:44.430 To be atomic means that it all executes together or not at all. 02:15:44.430 --> 02:15:49.500 Rather, code typically is executed, as you might imagine, line by line. 02:15:49.500 --> 02:15:53.110 And if your code is running on a server that multiple people have access to, 02:15:53.110 --> 02:15:55.530 which is absolutely the case for an app like Instagram, 02:15:55.530 --> 02:15:58.560 if you and I click on the heart at roughly the same time, 02:15:58.560 --> 02:16:02.160 for efficiency, the computer, the server, owned by Instagram, 02:16:02.160 --> 02:16:04.500 might execute this line of code for me. 02:16:04.500 --> 02:16:06.537 Then it might execute this line of code for you. 02:16:06.537 --> 02:16:09.120 Then this line of code for me, then this line of code for you, 02:16:09.120 --> 02:16:11.703 then this line of code for me, then this line of code for you. 02:16:11.703 --> 02:16:16.550 That is to say, our queries might get intermingled chronologically. 02:16:16.550 --> 02:16:19.370 Because it'd be a little obnoxious if, when you're using Instagram, 02:16:19.370 --> 02:16:22.040 I'm blocked out while you're interacting with the site. 02:16:22.040 --> 02:16:24.778 It'd be a lot nicer for efficiency and fairness if somehow they 02:16:24.778 --> 02:16:27.320 do a little bit of work for me, a little bit of work for you, 02:16:27.320 --> 02:16:30.540 and back and forth, and back and forth, equitably on the server. 02:16:30.540 --> 02:16:33.320 So that's what typically happens by default. These lines of code 02:16:33.320 --> 02:16:35.420 get executed independently. 02:16:35.420 --> 02:16:39.770 And they can happen in alternating order with other users. 02:16:39.770 --> 02:16:41.809 You can get them combined like this. 02:16:41.809 --> 02:16:45.840 Same order top to bottom, but other things might happen in between. 02:16:45.840 --> 02:16:50.000 So suppose that the number of likes at the very beginning was 10. 02:16:50.000 --> 02:16:54.080 And suppose that Carter and I both click on that egg at roughly the same time. 02:16:54.080 --> 02:16:56.420 And suppose this line of code gets executed for me, 02:16:56.420 --> 02:16:59.870 and that gives me a value in likes, ultimately, of 10. 02:16:59.870 --> 02:17:03.290 Suppose, then, that the computer takes a break from dealing with my request, 02:17:03.290 --> 02:17:05.690 does the same code for Carter, and gets back 02:17:05.690 --> 02:17:07.910 what value for the current number of likes? 02:17:07.910 --> 02:17:09.170 Also 10 for Carter. 02:17:09.170 --> 02:17:11.209 Because mine has not been recorded yet. 02:17:11.209 --> 02:17:13.910 At this point in the story, somewhere in the computer's memory 02:17:13.910 --> 02:17:16.219 there's a likes variable for me, storing 10. 02:17:16.219 --> 02:17:19.400 There's a likes variable storing 10 for Carter. 02:17:19.400 --> 02:17:21.290 Then this line of code executes for me. 02:17:21.290 --> 02:17:25.219 It updates the database to be likes plus 1, which stores 11 in the database. 02:17:25.219 --> 02:17:30.440 Then Carter's code is executed, updating the same row in the database 02:17:30.440 --> 02:17:34.160 to 11, unfortunately. 02:17:34.160 --> 02:17:38.209 Because his value of likes happened to be the same value of mine. 02:17:38.209 --> 02:17:41.540 And so the metaphor here, that if we had a refrigerator on stage we would 02:17:41.540 --> 02:17:44.790 actually act out, is something that was taught to me years ago in an operating 02:17:44.790 --> 02:17:50.299 systems class, whereby the most similar analogue in the real world would be 02:17:50.299 --> 02:17:52.370 if you've got a mini fridge in your dorm room. 02:17:52.370 --> 02:17:58.440 And one of you and your roommates comes home, opens the fridge, and realizes, 02:17:58.440 --> 02:18:01.160 oh, we're out of milk, was how the story went in my day. 02:18:01.160 --> 02:18:05.184 So you close the refrigerator, and you walk across the street, go to CVS, 02:18:05.184 --> 02:18:06.559 and get in line to buy some milk. 02:18:06.559 --> 02:18:08.309 Meanwhile, your roommate comes home. 02:18:08.309 --> 02:18:12.924 They, too, inspect the state of your refrigerator, a.k.a., a variable, 02:18:12.924 --> 02:18:15.049 open the door, and realizes, oh, we're out of milk. 02:18:15.049 --> 02:18:15.980 I'll go get more milk. 02:18:15.980 --> 02:18:18.020 Close the fridge, go across the street, and head 02:18:18.020 --> 02:18:20.059 to maybe a different store, or the line is long enough 02:18:20.059 --> 02:18:21.851 that you don't see each other at the store. 02:18:21.851 --> 02:18:25.700 So long story short, you both eventually get home, open the door, and damn it, 02:18:25.700 --> 02:18:27.559 now there's milk from your other roommate 02:18:27.559 --> 02:18:30.680 there because you both made a decision on this 02:18:30.680 --> 02:18:35.809 based on the state of a variable that you independently examined. 02:18:35.809 --> 02:18:37.700 And you didn't somehow communicate. 02:18:37.700 --> 02:18:40.610 Now in the real world, this is absolutely solvable. 02:18:40.610 --> 02:18:44.120 How would you fix this or avoid this problem in the real world? 02:18:44.120 --> 02:18:45.770 Literally, own roommate, own fridge. 02:18:45.770 --> 02:18:47.750 AUDIENCE: Text your roommate [INAUDIBLE].. 02:18:47.750 --> 02:18:48.750 DAVID J. MALAN: Perfect. 02:18:48.750 --> 02:18:50.482 Let them know, so somehow communicate. 02:18:50.482 --> 02:18:52.940 And in fact, the terminology here would be multiple threads 02:18:52.940 --> 02:18:55.459 can somehow intercommunicate by having shared state, 02:18:55.459 --> 02:18:57.230 like the iMessage thread on your phone. 02:18:57.230 --> 02:18:58.190 You could leave a note. 02:18:58.190 --> 02:19:01.490 You could, more dramatically, lock the refrigerator somehow, 02:19:01.490 --> 02:19:05.570 thereby making the milk purchasing process atomic. 02:19:05.570 --> 02:19:08.389 The fundamental problem is that for efficiency, again, 02:19:08.389 --> 02:19:11.690 computers tend to intermingle logic that needs 02:19:11.690 --> 02:19:15.980 to happen when it's happening across multiple users just for fairness' sake, 02:19:15.980 --> 02:19:17.180 for scheduling sake. 02:19:17.180 --> 02:19:19.820 You need to make sure that all three of these lines of code 02:19:19.820 --> 02:19:23.269 execute for me, and then for Carter, and then for you 02:19:23.269 --> 02:19:25.709 if you want to ensure that this count is correct. 02:19:25.709 --> 02:19:29.219 And for years, when social media was first getting off the ground, 02:19:29.219 --> 02:19:31.100 this was a super hard problem. 02:19:31.100 --> 02:19:33.920 Twitter used to go down all of the time, and tweets, 02:19:33.920 --> 02:19:36.260 and retweets were a thing that were similarly happening 02:19:36.260 --> 02:19:37.490 with a very high frequency. 02:19:37.490 --> 02:19:39.020 These are hard problems to solve. 02:19:39.020 --> 02:19:40.690 And thankfully, there are solutions. 02:19:40.690 --> 02:19:43.440 And we won't get into the weeds of how you might use these things, 02:19:43.440 --> 02:19:46.100 but know that there are solutions in the form of things 02:19:46.100 --> 02:19:49.610 called locks, which I use that word deliberately with the fridge. 02:19:49.610 --> 02:19:53.570 Software locks can allow you to protect a variable so no one else can 02:19:53.570 --> 02:19:55.550 look at it until you're done with it. 02:19:55.550 --> 02:19:57.770 There are things called transactions, which 02:19:57.770 --> 02:20:01.100 allow you to do the equivalent of sending a message to, or really locking 02:20:01.100 --> 02:20:04.130 out your roommate from accessing that same variable, too, 02:20:04.130 --> 02:20:06.920 but for slightly less amount of time. 02:20:06.920 --> 02:20:08.880 There are solutions to these problems. 02:20:08.880 --> 02:20:12.650 So for instance, in Python, the same code now in green 02:20:12.650 --> 02:20:14.360 might look a little something like this. 02:20:14.360 --> 02:20:17.150 When you know that something has to happen all at once, 02:20:17.150 --> 02:20:21.020 altogether, you first begin a transaction, and you do your thing, 02:20:21.020 --> 02:20:23.707 and then you commit the transaction at the very end. 02:20:23.707 --> 02:20:25.790 Here, too, though, there's going to be a downside. 02:20:25.790 --> 02:20:29.902 Typically, the more you use transactions in this way, 02:20:29.902 --> 02:20:31.610 potentially the higher the probability is 02:20:31.610 --> 02:20:35.190 that you're going to box someone out or make Carter's request a little slower. 02:20:35.190 --> 02:20:35.690 Why? 02:20:35.690 --> 02:20:37.482 Because we can't interact at the same time. 02:20:37.482 --> 02:20:39.920 Or you might make his request fail if he tries to update 02:20:39.920 --> 02:20:41.870 something that's already been updated. 02:20:41.870 --> 02:20:44.720 So you generally want to have as few lines of code 02:20:44.720 --> 02:20:47.840 together in between these transactions so that you get in and you get out. 02:20:47.840 --> 02:20:50.780 And you go to CVS and you get back really fast so as to not 02:20:50.780 --> 02:20:52.550 cause these kind of performance things. 02:20:52.550 --> 02:20:55.070 So things indeed escalated quickly today. 02:20:55.070 --> 02:20:58.272 The original goal was just to solve problems using a different language 02:20:58.272 --> 02:20:59.480 more effectively than Python. 02:20:59.480 --> 02:21:01.772 But as soon as you have these more powerful techniques, 02:21:01.772 --> 02:21:03.525 a whole new set of problems arises. 02:21:03.525 --> 02:21:05.150 Takes practice to get comfortable with. 02:21:05.150 --> 02:21:09.170 But ultimately, this is all leading us toward the introduction next week 02:21:09.170 --> 02:21:12.260 of web programming with HTML, CSS, and some JavaScript. 02:21:12.260 --> 02:21:15.090 The week after, bringing Python and SQL back into the mix. 02:21:15.090 --> 02:21:16.940 So that by term's end, we've really now used 02:21:16.940 --> 02:21:19.760 all of these different languages for what they're best at. 02:21:19.760 --> 02:21:22.843 And over the next few weeks, the goal is to make sure you're understanding 02:21:22.843 --> 02:21:25.932 and comfortable with what each of these things is good and bad for. 02:21:25.932 --> 02:21:27.140 Let's go ahead and wrap here. 02:21:27.140 --> 02:21:28.473 I'll stick around for questions. 02:21:28.473 --> 02:21:30.310 We'll see you next time. 02:21:30.310 --> 02:22:04.344 [MUSIC PLAYING]