WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:03.493 [MUSIC PLAYING] 00:00:49.357 --> 00:00:50.440 DAVID J. MALAN: All right. 00:00:50.440 --> 00:00:53.260 This is CS50, and this is week 7. 00:00:53.260 --> 00:00:56.207 And today's focus is going to be entirely on data-- 00:00:56.207 --> 00:00:58.540 the process of collecting it, the process of storing it, 00:00:58.540 --> 00:01:00.610 the process of searching it, and so much more. 00:01:00.610 --> 00:01:03.280 You'll recall that last week we started off by playing around 00:01:03.280 --> 00:01:04.750 with the relatively small data set. 00:01:04.750 --> 00:01:08.630 We asked everyone for what their preferred house at Hogwarts might be. 00:01:08.630 --> 00:01:12.400 And then we proceeded to analyze that data a little bit using some Python 00:01:12.400 --> 00:01:15.730 and counting up how many people wanted Gryffindor or Slytherin or the others, 00:01:15.730 --> 00:01:16.423 as well. 00:01:16.423 --> 00:01:19.090 And we ultimately did that by using a Google form to collect it. 00:01:19.090 --> 00:01:21.923 And we stored all of the data in a Google spreadsheet, which we then 00:01:21.923 --> 00:01:24.350 exported, of course, as a CSV file. 00:01:24.350 --> 00:01:26.740 So this week, we thought we'd collect a little more data 00:01:26.740 --> 00:01:28.780 and see what kinds of problems arise when 00:01:28.780 --> 00:01:32.350 we start using only a spreadsheet or, in turn, a CSV file 00:01:32.350 --> 00:01:34.220 to store the data that we care about. 00:01:34.220 --> 00:01:37.930 So in fact, if you could go ahead and go to this URL here that you see, 00:01:37.930 --> 00:01:41.230 you should see another Google form, this one asking you 00:01:41.230 --> 00:01:42.730 some different questions. 00:01:42.730 --> 00:01:46.540 All of us probably have some preferred TV shows, now more than ever, perhaps. 00:01:46.540 --> 00:01:49.150 And what we'd like to do is ask everyone to input 00:01:49.150 --> 00:01:53.590 into that form their favorite TV show followed by the genre 00:01:53.590 --> 00:01:58.190 or genres into which that particular TV show falls. 00:01:58.190 --> 00:02:00.680 So go ahead and take a moment to do that. 00:02:00.680 --> 00:02:03.820 And if you're unable to follow along at home, what folks are looking at 00:02:03.820 --> 00:02:07.360 is a form quite like this one here, whereby we're just asking them 00:02:07.360 --> 00:02:11.350 for the title of their preferred TV show and the genre 00:02:11.350 --> 00:02:15.770 or genres of that specific TV show. 00:02:15.770 --> 00:02:16.270 All right. 00:02:16.270 --> 00:02:19.270 So let's go ahead and start to look at some of this data that's come in. 00:02:19.270 --> 00:02:23.043 Here is the resulting Google spreadsheet that Google Forms has created for us. 00:02:23.043 --> 00:02:25.960 And you'll notice that by default, Google Forms, this particular tool, 00:02:25.960 --> 00:02:28.180 has three different columns, at least for this form. 00:02:28.180 --> 00:02:30.070 One is a timestamp, and Google automatically 00:02:30.070 --> 00:02:33.340 gives us that based on what day and time everyone was buzzing in 00:02:33.340 --> 00:02:34.390 with the responses. 00:02:34.390 --> 00:02:38.770 Then they have a header row beyond that for title and genres. 00:02:38.770 --> 00:02:42.100 I've manually boldfaced it in advance just to make it stand out. 00:02:42.100 --> 00:02:45.580 But you'll notice that the headings here, Title and Genres, 00:02:45.580 --> 00:02:48.790 perfectly matches the question that we asked in the Google form. 00:02:48.790 --> 00:02:53.020 That allows us to therefore line up your responses with our questions. 00:02:53.020 --> 00:02:56.410 And you can see here Punisher was the first favorite TV 00:02:56.410 --> 00:03:00.040 show to be inputted followed by The Office, Breaking Bad, New Girl, Archer, 00:03:00.040 --> 00:03:02.270 another Office, and so forth. 00:03:02.270 --> 00:03:04.660 And in the third column, under Genres, you'll 00:03:04.660 --> 00:03:06.520 see that there's something curious here. 00:03:06.520 --> 00:03:08.230 While some of the cells-- 00:03:08.230 --> 00:03:10.330 that is, the little boxes of text-- 00:03:10.330 --> 00:03:12.970 have just single words like "comedy" or "drama," 00:03:12.970 --> 00:03:15.550 you'll notice that some of them have a comma-separated list. 00:03:15.550 --> 00:03:19.150 And that comma-separated list is because some of you checked, as you could, 00:03:19.150 --> 00:03:24.730 multiple check boxes to indicate that Breaking Bad is a crime genre 00:03:24.730 --> 00:03:26.830 drama and also thriller. 00:03:26.830 --> 00:03:31.240 And so the way Google Forms handles this is a bit sleazily in the sense 00:03:31.240 --> 00:03:35.350 that they just drop all of those values as a comma-separated list 00:03:35.350 --> 00:03:37.853 inside of the spreadsheet itself. 00:03:37.853 --> 00:03:40.270 And that's potentially a problem if we ultimately download 00:03:40.270 --> 00:03:43.570 this as a CSV file, comma-separated values, 00:03:43.570 --> 00:03:47.500 because now you have commas in between the commas. 00:03:47.500 --> 00:03:50.480 Fortunately, there's a solution to that that we'll ultimately see. 00:03:50.480 --> 00:03:52.160 So we've got a good amount of data here. 00:03:52.160 --> 00:03:55.300 In fact, if I keep scrolling down, we'll see a few hundred responses now. 00:03:55.300 --> 00:03:58.120 And it would be nice to analyze this data in some way 00:03:58.120 --> 00:04:02.410 and figure out what the most popular TV show is, maybe search for new shows 00:04:02.410 --> 00:04:04.143 I might like via their genre. 00:04:04.143 --> 00:04:06.310 So you can imagine some number of queries that could 00:04:06.310 --> 00:04:08.620 be answered by way of this data set. 00:04:08.620 --> 00:04:12.340 But let's first consider the limitations of leaving this data 00:04:12.340 --> 00:04:14.980 in just a spreadsheet like this. 00:04:14.980 --> 00:04:17.440 All of us are probably in the habit of using occasionally 00:04:17.440 --> 00:04:22.490 Google Spreadsheets, Apple Numbers, Microsoft Excel, or some other tool. 00:04:22.490 --> 00:04:27.220 So let's consider what spreadsheets are good at and what they are bad at. 00:04:27.220 --> 00:04:30.370 Would anyone like to volunteer an answer to the first of those? 00:04:30.370 --> 00:04:34.030 What is a spreadsheet good at or good for? 00:04:34.030 --> 00:04:35.080 Yeah, Andrew? 00:04:35.080 --> 00:04:36.760 What's your thinking on spreadsheets? 00:04:36.760 --> 00:04:39.797 AUDIENCE: [INAUDIBLE] 00:04:39.797 --> 00:04:41.880 DAVID J. MALAN: OK, very good for quickly sorting. 00:04:41.880 --> 00:04:42.300 I like that. 00:04:42.300 --> 00:04:44.758 I could click on the top of the Title column, for instance, 00:04:44.758 --> 00:04:48.450 and immediately sort all of those titles by alphabetically. 00:04:48.450 --> 00:04:49.140 I like that. 00:04:49.140 --> 00:04:53.370 Other reasons to use a spreadsheet-- what problems do they solve? 00:04:53.370 --> 00:04:55.050 What are they good at? 00:04:55.050 --> 00:04:56.670 Other thoughts on spreadsheets. 00:04:56.670 --> 00:04:58.530 Yeah, how about Peter? 00:04:58.530 --> 00:05:01.947 AUDIENCE: Storing large amounts of data that you can later analyze. 00:05:01.947 --> 00:05:03.780 DAVID J. MALAN: OK, so storing large amounts 00:05:03.780 --> 00:05:05.700 of data that you can later analyze. 00:05:05.700 --> 00:05:09.210 It's kind of a nice model for storing lots of rows of data, so to speak. 00:05:09.210 --> 00:05:11.310 I will say that there actually is a limit. 00:05:11.310 --> 00:05:13.890 And in fact, back in the day, I learned what this limit is. 00:05:13.890 --> 00:05:16.515 Long story short, in graduate school, I was using a spreadsheet 00:05:16.515 --> 00:05:17.940 to analyze some research data. 00:05:17.940 --> 00:05:23.370 And at one point, I had more data than Excel supported rows for. 00:05:23.370 --> 00:05:28.110 Specifically, I had some 65,536 rows, which 00:05:28.110 --> 00:05:30.360 was too many at that point for Excel at the time, 00:05:30.360 --> 00:05:33.870 because, long story short, if you recall from a spreadsheet program 00:05:33.870 --> 00:05:37.328 like Google Spreadsheets, every row is numbered from 1 on up. 00:05:37.328 --> 00:05:39.120 Well, unfortunately, at the time, Microsoft 00:05:39.120 --> 00:05:43.170 had used a 16-bit integer, 16 bits or 2 bytes, 00:05:43.170 --> 00:05:45.150 to represent each of those numbers. 00:05:45.150 --> 00:05:49.320 And it turns out the 2 to the 16th power is roughly 65,000. 00:05:49.320 --> 00:05:52.000 So at that point, I maxed out the total number of rows. 00:05:52.000 --> 00:05:54.828 Now, to Peter's point, they've increased that in recent years. 00:05:54.828 --> 00:05:56.620 And you can actually store a lot more data. 00:05:56.620 --> 00:05:58.870 So spreadsheets are indeed good at that. 00:05:58.870 --> 00:06:02.580 But they're not necessarily good at everything, because at some point, 00:06:02.580 --> 00:06:05.310 you're going to have more data potentially in a spreadsheet 00:06:05.310 --> 00:06:07.860 than your Mac or PC can handle. 00:06:07.860 --> 00:06:10.860 In fact, if you're actually trying to build an application, whether it's 00:06:10.860 --> 00:06:14.490 Twitter, Instagram, or Facebook or anything of that scale, 00:06:14.490 --> 00:06:17.490 those companies are certainly not storing their data, suffice it to say, 00:06:17.490 --> 00:06:20.700 in a spreadsheet, because there would just be way too much data to use. 00:06:20.700 --> 00:06:23.050 And no one could literally open it on their computer. 00:06:23.050 --> 00:06:25.830 So we'll need a solution to that problem of scale. 00:06:25.830 --> 00:06:29.950 But I don't think we need to throw out what works well about spreadsheets. 00:06:29.950 --> 00:06:33.510 So you can store indeed a lot of data in row form. 00:06:33.510 --> 00:06:36.930 But it would seem that you can also store a lot of data in column form. 00:06:36.930 --> 00:06:39.567 And even though I'm only showing columns A, B, and C, 00:06:39.567 --> 00:06:41.400 of course, you've probably used spreadsheets 00:06:41.400 --> 00:06:42.570 where you add more columns-- 00:06:42.570 --> 00:06:44.680 D, E, F, and so forth. 00:06:44.680 --> 00:06:48.540 So what's the right mental model for how to think about rows 00:06:48.540 --> 00:06:51.540 versus columns in a spreadsheet? 00:06:51.540 --> 00:06:57.840 I feel like we probably use them in a somewhat different way conceptually. 00:06:57.840 --> 00:07:00.550 We might think about them a little differently. 00:07:00.550 --> 00:07:04.440 What's the difference between rows and columns in a spreadsheet? 00:07:04.440 --> 00:07:06.570 Sofia. 00:07:06.570 --> 00:07:07.890 AUDIENCE: Adding more entries. 00:07:07.890 --> 00:07:09.420 Adding more data is-- 00:07:09.420 --> 00:07:12.720 those are within the rows, but then the actual attributes or characteristics 00:07:12.720 --> 00:07:14.240 of the data should be in columns. 00:07:14.240 --> 00:07:15.240 DAVID J. MALAN: Exactly. 00:07:15.240 --> 00:07:17.220 When you add more data to the spreadsheet, 00:07:17.220 --> 00:07:19.620 you should really be adding to the bottom of it, 00:07:19.620 --> 00:07:21.310 adding more and more rows. 00:07:21.310 --> 00:07:24.300 So these things sort of grow vertically, even though of course that's 00:07:24.300 --> 00:07:25.920 just a human's perception of it. 00:07:25.920 --> 00:07:28.740 They grow from top to bottom by adding more and more rows. 00:07:28.740 --> 00:07:31.560 But to Sofia's point, your columns represent 00:07:31.560 --> 00:07:37.920 what we might call attributes or fields or any other such characteristic that 00:07:37.920 --> 00:07:40.030 is a type of data that you're storing. 00:07:40.030 --> 00:07:42.930 So in this case of our form, Timestamp is the first column. 00:07:42.930 --> 00:07:44.460 Title is the second column. 00:07:44.460 --> 00:07:45.930 Genres is the third column. 00:07:45.930 --> 00:07:49.980 And those columns can indeed be thought of as fields or attributes, properties 00:07:49.980 --> 00:07:50.697 of your data. 00:07:50.697 --> 00:07:54.030 And those are properties that you should really decide on in advance when you're 00:07:54.030 --> 00:07:56.970 first creating the form, in our case, or when you're manually creating 00:07:56.970 --> 00:07:59.430 the spreadsheet in another case. 00:07:59.430 --> 00:08:01.320 You should not really be in the habit, when 00:08:01.320 --> 00:08:05.430 using spreadsheets, of adding data from left 00:08:05.430 --> 00:08:08.370 to right, adding more and more columns, unless you 00:08:08.370 --> 00:08:11.740 decide to collect more types of data. 00:08:11.740 --> 00:08:15.873 So just because someone adds a new favorite TV show to your data set, 00:08:15.873 --> 00:08:18.540 you shouldn't be adding that from left to right in a new column. 00:08:18.540 --> 00:08:21.040 You should indeed be adding it from top to bottom. 00:08:21.040 --> 00:08:24.780 But suppose that we actually decided to collect more information from everyone. 00:08:24.780 --> 00:08:28.650 Maybe that form had instead asked you for your name or your email address 00:08:28.650 --> 00:08:30.120 or any other questions. 00:08:30.120 --> 00:08:34.480 Those properties or attributes or fields would belong as new columns. 00:08:34.480 --> 00:08:38.309 So this is to say we generally decide on the layout of our data, 00:08:38.309 --> 00:08:41.130 the schema of our data, in advance. 00:08:41.130 --> 00:08:45.420 And then from there on out, we proceed to add, add, add more rows, not 00:08:45.420 --> 00:08:47.670 columns, unless we change our mind and need to change 00:08:47.670 --> 00:08:50.230 the schema of our particular data. 00:08:50.230 --> 00:08:53.850 So it turns out that spreadsheets are indeed wonderfully useful, 00:08:53.850 --> 00:08:56.700 to Peter's point, for large or reasonably large 00:08:56.700 --> 00:08:59.610 data sets that we might collect. 00:08:59.610 --> 00:09:04.500 And we can, of course, per last week, export those data sets as CSV files. 00:09:04.500 --> 00:09:07.290 And so we can go from a spreadsheet to a simple text 00:09:07.290 --> 00:09:11.370 file stored in ASCII or Unicode, more generally, on your own hard drive 00:09:11.370 --> 00:09:12.660 or somewhere in the cloud. 00:09:12.660 --> 00:09:16.350 And you can actually think of that file, that .CSV file, 00:09:16.350 --> 00:09:19.540 as what we might call a flat-file database. 00:09:19.540 --> 00:09:22.980 A database is, generally speaking, a file that stores data. 00:09:22.980 --> 00:09:25.997 Or it's a program that stores data for you. 00:09:25.997 --> 00:09:29.080 And all of us have probably thought about or used databases in some sense. 00:09:29.080 --> 00:09:31.560 You're probably familiar with the fact that all 00:09:31.560 --> 00:09:35.310 of those same big websites, Google and Twitter and Facebook and others, use 00:09:35.310 --> 00:09:37.017 databases to store our data. 00:09:37.017 --> 00:09:38.850 Well, those databases are either just really 00:09:38.850 --> 00:09:42.120 big files containing lots of data or special programs 00:09:42.120 --> 00:09:44.130 that are storing our data for us. 00:09:44.130 --> 00:09:46.350 And a flat file is just referring to the fact 00:09:46.350 --> 00:09:48.580 that it really is a very simple design. 00:09:48.580 --> 00:09:51.510 In fact, years ago, decades ago, humans decided 00:09:51.510 --> 00:09:54.780 when storing data in simple text files that if you 00:09:54.780 --> 00:09:57.540 want to store different types of data, like, to Sofia's point, 00:09:57.540 --> 00:10:00.340 different properties or attributes, well, let's keep it simple. 00:10:00.340 --> 00:10:03.780 Let's just separate those columns with commas 00:10:03.780 --> 00:10:06.450 in our flat-file database, a.k.a. 00:10:06.450 --> 00:10:07.118 a CSV. 00:10:07.118 --> 00:10:08.160 You can use other things. 00:10:08.160 --> 00:10:09.430 You can use tabs. 00:10:09.430 --> 00:10:12.570 There's things called TSVs, for Tab-Separated Values. 00:10:12.570 --> 00:10:14.760 And frankly, you can use anything you want. 00:10:14.760 --> 00:10:16.050 But there is a corner case. 00:10:16.050 --> 00:10:17.980 And we've already seen a preview of it. 00:10:17.980 --> 00:10:21.190 What if your actual data has a comma in it? 00:10:21.190 --> 00:10:23.820 What if the title of your favorite TV show has a comma? 00:10:23.820 --> 00:10:27.660 What if Google is presuming to store genres as a comma-separated list? 00:10:27.660 --> 00:10:32.340 Bad things can happen if using a CSV as your flat-file database. 00:10:32.340 --> 00:10:33.760 But there are solutions to that. 00:10:33.760 --> 00:10:35.580 And in fact, what the world typically does 00:10:35.580 --> 00:10:39.640 is whenever you have commas inside of your CSV file, 00:10:39.640 --> 00:10:42.300 you just make sure that the whole string is double 00:10:42.300 --> 00:10:44.460 quoted on the far left and far right. 00:10:44.460 --> 00:10:46.860 And anything inside of double quotes is not 00:10:46.860 --> 00:10:50.790 mistaken thereafter as delineating a column 00:10:50.790 --> 00:10:53.220 as the other commas in the file might. 00:10:53.220 --> 00:10:55.590 So that's all that's meant by a flat-file database. 00:10:55.590 --> 00:10:58.860 And CSV is perhaps one of the most common, the most common, formats 00:10:58.860 --> 00:11:01.240 thereof, if only because all of these programs, 00:11:01.240 --> 00:11:03.420 like Google Spreadsheets and Excel and Numbers, 00:11:03.420 --> 00:11:07.137 allow you to save your files as CSVs. 00:11:07.137 --> 00:11:08.970 Now, long story short, those of you who have 00:11:08.970 --> 00:11:12.570 used fancier features of spreadsheets like built-in functions and formulas 00:11:12.570 --> 00:11:14.850 and those kinds of things, those are built in 00:11:14.850 --> 00:11:19.120 and proprietary to Google Spreadsheets and Excel and Numbers. 00:11:19.120 --> 00:11:24.900 You cannot use formulas in a CSV file or a TSV file or in a flat-file database, 00:11:24.900 --> 00:11:25.870 more generally. 00:11:25.870 --> 00:11:27.990 You can only store static-- 00:11:27.990 --> 00:11:30.090 that is, unchanging-- values. 00:11:30.090 --> 00:11:33.490 So when you export the data, what you see is what you get. 00:11:33.490 --> 00:11:35.242 And that's why people use fancier programs 00:11:35.242 --> 00:11:37.200 like Excel and Numbers and Google Spreadsheets, 00:11:37.200 --> 00:11:38.658 because you get more functionality. 00:11:38.658 --> 00:11:41.100 But if you want to export the data, you can only 00:11:41.100 --> 00:11:44.190 get indeed the raw textual data out of it. 00:11:44.190 --> 00:11:45.690 But I daresay that's going to be OK. 00:11:45.690 --> 00:11:47.398 In fact, Brian, do you mind if I go ahead 00:11:47.398 --> 00:11:50.160 and download this spreadsheet as a CSV file now? 00:11:50.160 --> 00:11:51.510 BRIAN YU: Yep, go ahead. 00:11:51.510 --> 00:11:51.810 DAVID J. MALAN: All right. 00:11:51.810 --> 00:11:54.890 I'm going to go ahead in Google Spreadsheets and go to File, Download. 00:11:54.890 --> 00:11:56.640 And you can see a whole bunch of options-- 00:11:56.640 --> 00:12:01.850 PDF, Web Page, Comma-Separated Values, which is the one I want. 00:12:01.850 --> 00:12:04.320 So I'm going to indeed go ahead and choose CSV 00:12:04.320 --> 00:12:06.510 from this dropdown in spreadsheets. 00:12:06.510 --> 00:12:08.410 That, of course, downloaded that file for me. 00:12:08.410 --> 00:12:11.077 And now I'm going to go ahead and go into our familiar CS50 IDE. 00:12:11.077 --> 00:12:14.600 You'll recall that last week I was able to upload a file into the IDE. 00:12:14.600 --> 00:12:17.350 And I'm going to go ahead and do the same here this week, as well. 00:12:17.350 --> 00:12:20.730 I'm going to go ahead and grab my file, which ended up in my Downloads 00:12:20.730 --> 00:12:22.830 folder on my particular computer here. 00:12:22.830 --> 00:12:27.840 And I'm going to go ahead and drag and drop this into the IDE 00:12:27.840 --> 00:12:31.790 such that it ends up in my home directory, so to speak. 00:12:31.790 --> 00:12:34.410 So now I have this file, Favorite TV Shows Forms. 00:12:34.410 --> 00:12:36.750 And in fact, if I double click this within the IDE, 00:12:36.750 --> 00:12:38.880 you'll see familiar data now. 00:12:38.880 --> 00:12:42.950 Timestamp comma title comma genres is our header row 00:12:42.950 --> 00:12:46.830 that contains the names of the properties or attributes in this file. 00:12:46.830 --> 00:12:51.390 Then we've got our timestamps comma favorite title comma and then 00:12:51.390 --> 00:12:53.310 a comma-separated list of genres. 00:12:53.310 --> 00:12:56.100 And here indeed, notice that Google took care 00:12:56.100 --> 00:13:00.030 to use double quotes around any values that themselves had commas. 00:13:00.030 --> 00:13:02.130 So it's a relatively simple file format. 00:13:02.130 --> 00:13:04.560 And I could certainly just kind of skim through this, 00:13:04.560 --> 00:13:07.920 figuring out who likes The Office, who likes Breaking Bad, or other shows. 00:13:07.920 --> 00:13:11.040 But per last week, we now have a pretty useful programming language 00:13:11.040 --> 00:13:14.220 at our disposal, Python, that could allow us to start manipulating 00:13:14.220 --> 00:13:16.860 and analyzing this data more readily. 00:13:16.860 --> 00:13:20.100 And here to my point last week about using the right tool for the job, 00:13:20.100 --> 00:13:24.860 you could absolutely do everything we're about to do in all weeks prior of CS50. 00:13:24.860 --> 00:13:27.720 We could have used C for what we're about to do. 00:13:27.720 --> 00:13:31.350 But as you can probably glean, C tends to be painful for certain things, 00:13:31.350 --> 00:13:34.290 like anything involving string manipulation, 00:13:34.290 --> 00:13:36.660 changing strings, analyzing strings. 00:13:36.660 --> 00:13:38.290 It's just a real pain, right? 00:13:38.290 --> 00:13:42.330 God forbid you had to take this CSV file and load it all into memory, not 00:13:42.330 --> 00:13:43.470 unlike your spell checker. 00:13:43.470 --> 00:13:46.950 You would have to be using malloc all over the place or realloc or the like. 00:13:46.950 --> 00:13:50.640 There's just a lot of heavy lifting involved in just analyzing a text file. 00:13:50.640 --> 00:13:53.760 So Python does all of that for us by just giving us 00:13:53.760 --> 00:13:56.130 more functions at our disposal with which 00:13:56.130 --> 00:13:59.470 to start analyzing and opening data. 00:13:59.470 --> 00:14:01.570 So let me go ahead and close this file. 00:14:01.570 --> 00:14:05.082 And let me go ahead and create a new one called favorites.py, 00:14:05.082 --> 00:14:07.290 wherein I'm going to start playing with this data set 00:14:07.290 --> 00:14:09.900 and see if we can't start answering some questions about it. 00:14:09.900 --> 00:14:12.570 And frankly, to this day, 20-plus years after learning how 00:14:12.570 --> 00:14:14.670 to program for the first time, I myself am 00:14:14.670 --> 00:14:18.000 very much in the habit when writing a new program of just starting simple 00:14:18.000 --> 00:14:22.320 and not solving the problem I ultimately want to but something simpler just 00:14:22.320 --> 00:14:24.270 as a sort of proof of concept to make sure 00:14:24.270 --> 00:14:26.510 I have the right plumbing in place. 00:14:26.510 --> 00:14:27.510 So by that, I mean this. 00:14:27.510 --> 00:14:32.550 Let's go ahead and write a quick program that simply opens up this file, the CSV 00:14:32.550 --> 00:14:37.120 file, iterates over it top to bottom, and just prints out each of the titles, 00:14:37.120 --> 00:14:39.430 just as a quick sanity check that I know what I'm doing 00:14:39.430 --> 00:14:41.460 and I have access to the data therein. 00:14:41.460 --> 00:14:43.740 So let me go ahead and import CSV. 00:14:43.740 --> 00:14:45.840 And then I can do this in a few different ways. 00:14:45.840 --> 00:14:48.030 But by now, you've probably seen or remembered 00:14:48.030 --> 00:14:50.490 my using something like the open command and the 00:14:50.490 --> 00:14:55.260 with keyword to open and eventually automatically close this file for me. 00:14:55.260 --> 00:14:59.710 This file is called Favorite TV Shows - Form Responses 1.csv. 00:15:02.400 --> 00:15:04.560 And I'm going to open this up in read mode. 00:15:04.560 --> 00:15:07.000 Strictly speaking, the r is not required. 00:15:07.000 --> 00:15:09.330 You might see examples online not including it. 00:15:09.330 --> 00:15:13.140 That's because read is the default. But for parity with C and fopen, 00:15:13.140 --> 00:15:15.900 I'm going to be explicit and actually do "r." 00:15:15.900 --> 00:15:18.670 And I'm going to go ahead and give this a variable name of file. 00:15:18.670 --> 00:15:23.820 So this line 3 here has the effect of opening that CSV file in read-only mode 00:15:23.820 --> 00:15:27.532 and creating a variable called file via which I can reference it. 00:15:27.532 --> 00:15:30.240 Now I'm going to go ahead and use some of that CSV functionality. 00:15:30.240 --> 00:15:32.790 I'm going to give myself what we keep calling a reader, which 00:15:32.790 --> 00:15:34.650 I could call it xyz, anything else. 00:15:34.650 --> 00:15:37.740 But "reader" kind of describes what this variable is going to do. 00:15:37.740 --> 00:15:42.930 And it's going to be the return value of calling csv.reader on that file. 00:15:42.930 --> 00:15:46.740 And so essentially, the CSV library, per last week, 00:15:46.740 --> 00:15:48.360 has a lot of fancy features built in. 00:15:48.360 --> 00:15:52.470 And all it needs as input is an already opened text file. 00:15:52.470 --> 00:15:55.120 And then it will then wrap that file, so to speak, 00:15:55.120 --> 00:15:57.270 with a whole bunch of more useful functionality, 00:15:57.270 --> 00:16:01.750 like the ability to read it column and row at a time. 00:16:01.750 --> 00:16:02.250 All right. 00:16:02.250 --> 00:16:05.170 Now I'm going to go ahead and, you know what, just for now, 00:16:05.170 --> 00:16:08.190 I'm going to skip the first row. 00:16:08.190 --> 00:16:11.310 I'm going to skip the first row, because the first row has my headings-- 00:16:11.310 --> 00:16:13.530 Timestamp, Title, and Genres. 00:16:13.530 --> 00:16:17.692 And I know what my columns are, so I'm just going to ignore that line for now. 00:16:17.692 --> 00:16:18.900 And now I'm going to do this. 00:16:18.900 --> 00:16:24.570 For row in reader, let me go ahead and print out, quite simply, row. 00:16:24.570 --> 00:16:28.890 And I only want title, so I think if it's three columns from left to right, 00:16:28.890 --> 00:16:30.330 it's 0, 1, 2. 00:16:30.330 --> 00:16:33.480 So I want to print out column bracket 1, which 00:16:33.480 --> 00:16:35.680 is going to be the second column zero indexed. 00:16:35.680 --> 00:16:36.180 All right. 00:16:36.180 --> 00:16:39.240 Let me go ahead and save that, go down to my terminal window, 00:16:39.240 --> 00:16:42.850 and run python of favorites.py and cross my fingers. 00:16:42.850 --> 00:16:43.500 OK. 00:16:43.500 --> 00:16:44.920 Voila. 00:16:44.920 --> 00:16:46.740 It flew by super fast. 00:16:46.740 --> 00:16:49.530 But it looks like, indeed, these are all of the TV 00:16:49.530 --> 00:16:51.150 shows that folks have inputted. 00:16:51.150 --> 00:16:53.370 Indeed, there's a few hundred if I keep scrolling up. 00:16:53.370 --> 00:16:55.740 So it looks like my program is working. 00:16:55.740 --> 00:16:57.850 But let's improve it just a little bit. 00:16:57.850 --> 00:17:02.490 It turns out that using the csv.reader isn't necessarily 00:17:02.490 --> 00:17:04.050 the best approach in Python. 00:17:04.050 --> 00:17:07.589 Many of you have already discovered a DictReader, a dictionary reader, 00:17:07.589 --> 00:17:10.740 which is nice, because then you don't have to know or keep double checking 00:17:10.740 --> 00:17:13.230 what number column your data is in. 00:17:13.230 --> 00:17:17.520 You can instead refer it to by the header itself, so by "title" 00:17:17.520 --> 00:17:18.660 or by "genres." 00:17:18.660 --> 00:17:21.052 This is also good, because if you or maybe a colleague 00:17:21.052 --> 00:17:23.010 are sort of messing around with the spreadsheet 00:17:23.010 --> 00:17:26.339 and they rearrange the columns by dragging them left or right, 00:17:26.339 --> 00:17:30.120 any numbers you have used in your code, 0, 1, 2 on up, 00:17:30.120 --> 00:17:34.390 could suddenly be incorrect if your colleague has reordered those columns. 00:17:34.390 --> 00:17:37.590 So using a dictionary reader tends to be a little more robust, because it 00:17:37.590 --> 00:17:40.480 uses the titles, not the mere numbers. 00:17:40.480 --> 00:17:43.230 It's still fallible if someone, yourself or someone else, 00:17:43.230 --> 00:17:47.978 changes the values in that very first row and renames titles or genres. 00:17:47.978 --> 00:17:49.270 Then things are going to break. 00:17:49.270 --> 00:17:51.270 But at that point, we kind of have to blame you 00:17:51.270 --> 00:17:53.730 for not having kept track of your code versus your data. 00:17:53.730 --> 00:17:55.020 But still a risk. 00:17:55.020 --> 00:17:58.445 So I'm going to change this to dictionary reader or DictReader here. 00:17:58.445 --> 00:18:00.570 And pretty much the rest of my code can be the same 00:18:00.570 --> 00:18:02.970 except I don't need this hack here on line 5. 00:18:02.970 --> 00:18:06.750 I don't need to just skip over to the next row from the get-go, 00:18:06.750 --> 00:18:10.890 because I now want the dictionary reader to handle the process of reading 00:18:10.890 --> 00:18:11.985 that first row for me. 00:18:11.985 --> 00:18:13.860 But otherwise, everything else stays the same 00:18:13.860 --> 00:18:15.693 except for this last line, where now I think 00:18:15.693 --> 00:18:21.300 I can now use row as a dictionary, not as a list per se, 00:18:21.300 --> 00:18:24.880 and print out specifically the title from each given row. 00:18:24.880 --> 00:18:27.690 So let me go ahead and run python of favorites.py again. 00:18:27.690 --> 00:18:31.480 And voila, it looks like I got the same result, several hundred of them. 00:18:31.480 --> 00:18:34.260 But let me stipulate that it's doing the same thing if we actually 00:18:34.260 --> 00:18:36.490 compared both of those side-by-side. 00:18:36.490 --> 00:18:36.990 All right. 00:18:36.990 --> 00:18:39.180 Before I forge ahead now to actually augment this 00:18:39.180 --> 00:18:44.610 with new functionality, any questions or confusion on this Python script 00:18:44.610 --> 00:18:49.530 we just wrote to open a file, wrap it with a reader or DictReader, 00:18:49.530 --> 00:18:54.510 and then iterate over the rows one at a time, printing the titles? 00:18:54.510 --> 00:18:56.510 Any questions, confusion on syntax at all? 00:18:56.510 --> 00:18:57.010 It's OK. 00:18:57.010 --> 00:18:59.370 We've only known or seen Python for a week. 00:18:59.370 --> 00:19:01.380 It's fine if it's still quite new. 00:19:01.380 --> 00:19:04.115 Anything, Brian, we should address? 00:19:04.115 --> 00:19:04.740 BRIAN YU: Yeah. 00:19:04.740 --> 00:19:08.800 So why is it that you don't need to close the file using the syntax 00:19:08.800 --> 00:19:10.190 that you're using right here? 00:19:10.190 --> 00:19:11.732 DAVID J. MALAN: Really good question. 00:19:11.732 --> 00:19:15.250 Last week, I more pedantically used open on its own. 00:19:15.250 --> 00:19:19.210 And then I later used a close function that was associated with the file 00:19:19.210 --> 00:19:20.470 that I had just opened. 00:19:20.470 --> 00:19:23.800 Now, the more Pythonic way to do things, if you will, 00:19:23.800 --> 00:19:27.370 is actually to use this with keyword, which didn't exist in C. 00:19:27.370 --> 00:19:29.830 And it just tends to be a useful feature in Python 00:19:29.830 --> 00:19:35.470 whereby if you say with open, dot dot dot, it will open the file for you. 00:19:35.470 --> 00:19:39.280 Then it will remain open so long as your code is indented inside 00:19:39.280 --> 00:19:41.410 of that with keywords block. 00:19:41.410 --> 00:19:43.780 And as soon as you get to the end of your program, 00:19:43.780 --> 00:19:45.732 it will automatically be closed for you. 00:19:45.732 --> 00:19:48.190 So this is one of these features where Python in some sense 00:19:48.190 --> 00:19:50.770 is trying to protect us from ourselves. 00:19:50.770 --> 00:19:52.900 It's probably pretty common for humans, myself 00:19:52.900 --> 00:19:55.000 included, to forget to close your file. 00:19:55.000 --> 00:19:57.580 That can create problems with saving things permanently. 00:19:57.580 --> 00:19:59.990 It can create memory leaks, as we know from C. 00:19:59.990 --> 00:20:02.740 So the with keyword just assumes that I'm not going to be an idiot 00:20:02.740 --> 00:20:04.150 and forget to close the file. 00:20:04.150 --> 00:20:08.050 Python is going to do it for me automatically. 00:20:08.050 --> 00:20:10.870 Other questions or confusions, Brian? 00:20:10.870 --> 00:20:13.690 BRIAN YU: How does DictReader know that Title 00:20:13.690 --> 00:20:16.270 is the name of the key inside of the dictionary? 00:20:16.270 --> 00:20:18.020 DAVID J. MALAN: Really good question, too. 00:20:18.020 --> 00:20:22.090 So it is designed by the authors of the Python language 00:20:22.090 --> 00:20:25.450 to look at the very first row in the file, 00:20:25.450 --> 00:20:29.380 split it on the commas in that very first row, 00:20:29.380 --> 00:20:34.090 and just assume that the first word or phrase before the first comma 00:20:34.090 --> 00:20:37.270 is the name of the first column, that the second word 00:20:37.270 --> 00:20:42.470 or phrase after the first comma is the name of the second column, 00:20:42.470 --> 00:20:43.310 and so forth. 00:20:43.310 --> 00:20:47.500 So a DictReader just presumes, as is the convention with CSVs, 00:20:47.500 --> 00:20:51.280 that your first row is going to contain the headings that you 00:20:51.280 --> 00:20:53.290 want to use to refer to those columns. 00:20:53.290 --> 00:20:56.860 If your CSV happens not to have such a heading whereby it just 00:20:56.860 --> 00:20:59.050 jumps right in on the first row to real data, 00:20:59.050 --> 00:21:02.140 then you're not going to be able to use a DictReader correctly, at least 00:21:02.140 --> 00:21:04.670 not without some manual configuration. 00:21:04.670 --> 00:21:05.170 All right. 00:21:05.170 --> 00:21:06.840 So let's go ahead and-- 00:21:06.840 --> 00:21:08.590 now I feel like there's a whole mess here. 00:21:08.590 --> 00:21:10.562 And some of these shows are pretty popular. 00:21:10.562 --> 00:21:13.270 And as I'm glancing over this, I definitely see some duplication. 00:21:13.270 --> 00:21:15.010 A whole bunch of you like The Office. 00:21:15.010 --> 00:21:17.530 A whole bunch of you like Breaking Bad, Game of Thrones, 00:21:17.530 --> 00:21:19.280 and a whole bunch of other shows, as well. 00:21:19.280 --> 00:21:21.250 So it would be nicer, I think, if we kind of 00:21:21.250 --> 00:21:25.480 narrow the scope of our look at this data by just looking at unique values. 00:21:25.480 --> 00:21:26.807 You're looking at unique value. 00:21:26.807 --> 00:21:29.140 So rather than just iterate over the file top to bottom, 00:21:29.140 --> 00:21:31.690 printing out one title after another, why 00:21:31.690 --> 00:21:34.330 don't we go ahead and sort of accumulate all of this data 00:21:34.330 --> 00:21:38.800 in some kind of data structure so that we can throw away duplicate values 00:21:38.800 --> 00:21:42.910 and then only print out the unique titles that we've accumulated? 00:21:42.910 --> 00:21:44.630 So I bet we can do this in a few ways. 00:21:44.630 --> 00:21:47.650 But if we think back to last week's demonstration of our dictionary, 00:21:47.650 --> 00:21:50.388 you'll recall that I used what was called a set. 00:21:50.388 --> 00:21:52.930 And I'm going to go ahead and create a variable called titles 00:21:52.930 --> 00:21:55.180 and set it equal to something called set. 00:21:55.180 --> 00:21:57.310 And a set is just a collection of values. 00:21:57.310 --> 00:21:58.540 It's kind of like a list. 00:21:58.540 --> 00:22:00.580 But it eliminates duplicates for me. 00:22:00.580 --> 00:22:02.860 And that would seem to be exactly the characteristic 00:22:02.860 --> 00:22:04.870 that I want for this program. 00:22:04.870 --> 00:22:08.560 Now, instead of printing each title, which is now premature 00:22:08.560 --> 00:22:10.480 if I want to first filter out duplicates, 00:22:10.480 --> 00:22:11.990 I'm going to go ahead and do this. 00:22:11.990 --> 00:22:17.290 I'm going to go ahead and add to the titles set using the add function 00:22:17.290 --> 00:22:19.570 the current row's title. 00:22:19.570 --> 00:22:21.250 So again, I'm not printing it now. 00:22:21.250 --> 00:22:25.510 I'm instead adding to the title set that particular title. 00:22:25.510 --> 00:22:27.190 And if it's there already, no big deal. 00:22:27.190 --> 00:22:29.260 The set data structure in Python is going 00:22:29.260 --> 00:22:31.000 to throw away the duplicates for me. 00:22:31.000 --> 00:22:33.310 And it's only going to go ahead and keep the uniques. 00:22:33.310 --> 00:22:37.330 Now, at the bottom of my file, I need to do a little more work, admittedly. 00:22:37.330 --> 00:22:40.990 Now I have to iterate over the set to print out only those unique titles. 00:22:40.990 --> 00:22:41.740 So let me do this. 00:22:41.740 --> 00:22:46.555 For title in titles, go ahead and print out title. 00:22:46.555 --> 00:22:49.180 And this is where Python just gets really user-friendly, right? 00:22:49.180 --> 00:22:53.050 You don't have to do int i get 0, i less than n, or whatever. 00:22:53.050 --> 00:22:55.540 You can just say for title in titles. 00:22:55.540 --> 00:22:59.200 And if the title's variable is the type of data structure 00:22:59.200 --> 00:23:04.690 that you can iterate over, which it will be if it's a list or if it's a set 00:23:04.690 --> 00:23:06.940 or even if it's a dictionary, another data structure 00:23:06.940 --> 00:23:11.180 we saw last week in Python, the for loop in Python will just know what to do. 00:23:11.180 --> 00:23:15.880 This will loop over all of the titles in the titles set. 00:23:15.880 --> 00:23:18.700 So let me go ahead and save this file and go ahead now 00:23:18.700 --> 00:23:20.920 and run python of favorites.py. 00:23:20.920 --> 00:23:25.240 And it looks like, yeah, the list is different in some way. 00:23:25.240 --> 00:23:29.463 But I'm seeing fewer results as I scroll up, definitely fewer than before, 00:23:29.463 --> 00:23:31.630 because my scrollbar didn't jump nearly as far down. 00:23:31.630 --> 00:23:33.260 But honestly, this is kind of a mess. 00:23:33.260 --> 00:23:34.660 Let's go ahead and sort this. 00:23:34.660 --> 00:23:37.502 Now, in C, it would have been kind of a pain to sort things. 00:23:37.502 --> 00:23:39.460 We'd have to whip out the pseudocode, probably, 00:23:39.460 --> 00:23:41.460 for bubble sort, selection sort, or, god forbid, 00:23:41.460 --> 00:23:43.270 merge sort and then implement it ourselves. 00:23:43.270 --> 00:23:47.210 But no, with Python comes, really, the proverbial kitchen sink of functions. 00:23:47.210 --> 00:23:49.510 So if you want to sort this set, you know what? 00:23:49.510 --> 00:23:50.950 Just say you want it sorted. 00:23:50.950 --> 00:23:53.890 There is a function in Python called sorted 00:23:53.890 --> 00:23:57.257 that will use one of those better algorithms-- maybe it's merge sort. 00:23:57.257 --> 00:23:58.840 Maybe it's something called quicksort. 00:23:58.840 --> 00:24:00.350 Maybe it's something else altogether. 00:24:00.350 --> 00:24:02.380 It's not going to use a big O of n squared sort. 00:24:02.380 --> 00:24:06.940 Someone at Python probably has spent the time implementing a better sort for us. 00:24:06.940 --> 00:24:08.817 But it will go ahead and sort the set for me. 00:24:08.817 --> 00:24:10.400 Now let me go ahead and do this again. 00:24:10.400 --> 00:24:13.360 Let me increase the size of my terminal window and rerun python 00:24:13.360 --> 00:24:15.070 of favorites.py. 00:24:15.070 --> 00:24:15.640 OK. 00:24:15.640 --> 00:24:19.330 And now we have an interesting assortment 00:24:19.330 --> 00:24:22.570 of shows that's easier for me to wrap my mind around, 00:24:22.570 --> 00:24:25.743 because I have it now sorted here. 00:24:25.743 --> 00:24:28.660 And indeed, if I scroll all the way up, we should see all of the shows 00:24:28.660 --> 00:24:32.257 beginning with numbers or a period, which 00:24:32.257 --> 00:24:34.090 might have just been someone playing around, 00:24:34.090 --> 00:24:36.290 followed by the A words, the B words, and so forth. 00:24:36.290 --> 00:24:38.707 So now it's a little easier to wrap our minds around this. 00:24:38.707 --> 00:24:39.760 But something's up. 00:24:39.760 --> 00:24:44.110 I feel like a lot of you like Avatar: The Last Airbender. 00:24:44.110 --> 00:24:47.980 And yet I'm seeing it, indeed, four different times. 00:24:47.980 --> 00:24:49.720 But I thought we were filtering this down 00:24:49.720 --> 00:24:53.350 to uniques by using that set structure. 00:24:53.350 --> 00:24:54.340 So what's going on? 00:24:54.340 --> 00:24:56.200 And in fact, if I keep scrolling, I'm pretty 00:24:56.200 --> 00:24:59.080 sure I saw more duplicates in here. 00:24:59.080 --> 00:25:01.840 BoJack Horseman, Breaking Bad, Breaking Bad, 00:25:01.840 --> 00:25:07.810 Brooklyn Nine-Nine, Brooklyn Nine-Nine, CS50 in several different flavors. 00:25:07.810 --> 00:25:10.210 And yes, it keeps going. 00:25:10.210 --> 00:25:11.110 Friends. 00:25:11.110 --> 00:25:13.050 So I see a lot of duplicate values. 00:25:13.050 --> 00:25:14.980 So what's going on? 00:25:14.980 --> 00:25:17.960 Yeah, [? Gadana? ?] 00:25:17.960 --> 00:25:22.480 AUDIENCE: Yeah, so your current sort is case insensitive-- sorry, 00:25:22.480 --> 00:25:26.680 is case sensitive, meaning that if someone spells avatar with capital 00:25:26.680 --> 00:25:30.800 A's in some places, then it's going to be a different result each time. 00:25:30.800 --> 00:25:32.050 DAVID J. MALAN: Yeah, exactly. 00:25:32.050 --> 00:25:35.530 Some of you weren't quite diligent when it came to capitalization. 00:25:35.530 --> 00:25:38.048 And so in fact, the reality is, as [? Gadana ?] notes, 00:25:38.048 --> 00:25:39.840 that there's differences in capitalization. 00:25:39.840 --> 00:25:41.090 Now, we've addressed this before. 00:25:41.090 --> 00:25:43.230 In fact, when you implemented your spell checker, 00:25:43.230 --> 00:25:44.980 you had to deal with this already when you 00:25:44.980 --> 00:25:46.780 were spell checking an arbitrary text. 00:25:46.780 --> 00:25:48.160 Some words might be capitalized. 00:25:48.160 --> 00:25:50.300 Some might be all lowercase, all uppercase. 00:25:50.300 --> 00:25:52.810 And you wanted to tolerate different casings. 00:25:52.810 --> 00:25:55.840 And so we probably solved this by just forcing everything 00:25:55.840 --> 00:25:58.540 to uppercase or everything to lowercase and doing things, 00:25:58.540 --> 00:26:00.500 therefore, case insensitively. 00:26:00.500 --> 00:26:01.750 So give me just a moment here. 00:26:01.750 --> 00:26:06.007 And I'm going to go ahead and make a quick change to my form here. 00:26:06.007 --> 00:26:07.840 Let's go ahead and change this in such a way 00:26:07.840 --> 00:26:11.020 that we actually force everything to uppercase or lowercase. 00:26:11.020 --> 00:26:13.760 Doesn't really matter which, but we need to canonicalize things, 00:26:13.760 --> 00:26:14.860 so to speak, in some way. 00:26:14.860 --> 00:26:18.790 And to canonicalize things just means to format all of your data 00:26:18.790 --> 00:26:20.020 in some standard way. 00:26:20.020 --> 00:26:22.390 So to [? Gadana's ?] point, let's just standardize 00:26:22.390 --> 00:26:23.920 the capitalization of things. 00:26:23.920 --> 00:26:25.600 Maybe all uppercase, all lowercase. 00:26:25.600 --> 00:26:27.260 We just need to make a judgment call. 00:26:27.260 --> 00:26:29.427 So I'm going to go ahead and make a few tweaks here. 00:26:29.427 --> 00:26:30.670 I'm still going to use a set. 00:26:30.670 --> 00:26:33.190 I'm still going to read the CSV as before. 00:26:33.190 --> 00:26:37.270 But instead of just adding the title with row bracket title, 00:26:37.270 --> 00:26:40.180 I'm going to go ahead and force it to uppercase, just 00:26:40.180 --> 00:26:42.850 arbitrarily, just for the sake of uniformity. 00:26:42.850 --> 00:26:45.610 And then let's go ahead and check what exactly has happened here. 00:26:45.610 --> 00:26:47.050 I'm not going to change anything else. 00:26:47.050 --> 00:26:49.383 But let me go ahead and increase the size of my terminal 00:26:49.383 --> 00:26:52.600 window, rerun python of favorites.py. 00:26:52.600 --> 00:26:53.517 And voila. 00:26:53.517 --> 00:26:55.600 It's a little harder to read, just because I'm not 00:26:55.600 --> 00:26:56.770 used to reading all caps. 00:26:56.770 --> 00:26:58.687 Kind of looks like we're yelling at ourselves. 00:26:58.687 --> 00:27:01.600 But I don't see-- wait a minute. 00:27:01.600 --> 00:27:05.560 I still see The Office over here twice. 00:27:05.560 --> 00:27:11.920 If I keep scrolling here, so far, I see Stranger Things and Strainger Things. 00:27:11.920 --> 00:27:13.570 That just looks like a typo. 00:27:13.570 --> 00:27:15.680 I see two Sherlocks, though. 00:27:15.680 --> 00:27:17.380 This is a little suspicious. 00:27:17.380 --> 00:27:21.730 So [? Gadana, ?] you and I don't seem to have solved things fully. 00:27:21.730 --> 00:27:24.220 And this one's a little more subtle. 00:27:24.220 --> 00:27:30.970 What more should I perhaps do to my data to ensure we get duplicates removed? 00:27:30.970 --> 00:27:32.200 Olivia? 00:27:32.200 --> 00:27:34.407 AUDIENCE: Maybe trim around the edges. 00:27:34.407 --> 00:27:35.990 DAVID J. MALAN: Trim around the edges. 00:27:35.990 --> 00:27:37.480 I like the sound of that, but what do you mean? 00:27:37.480 --> 00:27:38.320 What does that do? 00:27:38.320 --> 00:27:40.862 AUDIENCE: Oh, like, trim off the extra spaces in case someone 00:27:40.862 --> 00:27:42.940 put a space before or after the words. 00:27:42.940 --> 00:27:44.230 DAVID J. MALAN: Yeah, exactly. 00:27:44.230 --> 00:27:47.140 It's pretty common for humans, intentionally or accidentally, 00:27:47.140 --> 00:27:48.920 to hit the Space bar where they shouldn't. 00:27:48.920 --> 00:27:51.790 And in fact, I'm kind of inferring that I bet one or more 00:27:51.790 --> 00:27:55.143 of you accidentally typed Sherlock, space, and then decided, 00:27:55.143 --> 00:27:55.810 nope, that's it. 00:27:55.810 --> 00:27:57.040 I'm not typing anything else. 00:27:57.040 --> 00:28:00.740 But that space, even though we can't quite see it obviously, is there. 00:28:00.740 --> 00:28:04.060 And when we do a string comparison or when the set data structure does that, 00:28:04.060 --> 00:28:08.110 it's actually going to be noticed when doing those comparisons. 00:28:08.110 --> 00:28:10.092 And therefore they're not going to be the same. 00:28:10.092 --> 00:28:11.800 So I can do this in a few different ways. 00:28:11.800 --> 00:28:15.080 But it turns out, in Python, you can chain functions together, 00:28:15.080 --> 00:28:17.390 which is also, too, kind of a fancy feature. 00:28:17.390 --> 00:28:18.650 Notice what I'm doing here. 00:28:18.650 --> 00:28:21.070 I'm still accessing the titles set. 00:28:21.070 --> 00:28:23.710 I'm adding the following value to it. 00:28:23.710 --> 00:28:27.610 I'm adding the value row bracket title, but not quite. 00:28:27.610 --> 00:28:30.880 That is a string or an str, in Python speak. 00:28:30.880 --> 00:28:33.130 I'm going to go ahead and strip it, which 00:28:33.130 --> 00:28:36.340 means if we look up the documentation for this function, to Olivia's point, 00:28:36.340 --> 00:28:39.130 it's going to strip off or trim all of the white space 00:28:39.130 --> 00:28:41.200 to the left, all of the white space to the right, 00:28:41.200 --> 00:28:43.840 whether that's the Space bar or the Enter key 00:28:43.840 --> 00:28:46.870 or the Tab character or a few other things, as well. 00:28:46.870 --> 00:28:50.200 It's just going to get rid of leading and trailing white space. 00:28:50.200 --> 00:28:53.650 And then whatever's left over, I'm going to go ahead and force everything 00:28:53.650 --> 00:28:56.810 to uppercase in the spirit of [? Gadana's ?] suggestion, too. 00:28:56.810 --> 00:29:00.470 So we're sort of combining two good ideas now to really massage the data, 00:29:00.470 --> 00:29:02.470 if you will, into a cleaner format. 00:29:02.470 --> 00:29:04.780 And this is such a real-world reality. 00:29:04.780 --> 00:29:09.588 Humans, you and I, cannot be trusted to input data the way we are supposed to. 00:29:09.588 --> 00:29:11.380 Sometimes it's all lowercase, because we're 00:29:11.380 --> 00:29:13.463 being a little lazy or a little social media-like, 00:29:13.463 --> 00:29:16.120 even if we're checking out from Amazon and trying 00:29:16.120 --> 00:29:18.310 to input a valid postal address. 00:29:18.310 --> 00:29:22.127 Sometimes it's all capitals, because I can think of a few people in my life 00:29:22.127 --> 00:29:24.460 who don't quite understand the Caps Lock thing just yet. 00:29:24.460 --> 00:29:26.710 And so things might be all capitalized instead. 00:29:26.710 --> 00:29:30.580 This is not good for computer systems that require precision, 00:29:30.580 --> 00:29:32.440 to our emphasis in week 0. 00:29:32.440 --> 00:29:35.140 And so massaging data means cleaning it up, 00:29:35.140 --> 00:29:38.500 doing some mutations that don't really change the meaning of the data 00:29:38.500 --> 00:29:41.740 but canonicalize it, standardize it, so that you're 00:29:41.740 --> 00:29:44.950 comparing apples and apples, so to speak, not apples and oranges. 00:29:44.950 --> 00:29:47.950 Well, let me go ahead and run this again in my bigger terminal 00:29:47.950 --> 00:29:50.140 window, python of favorites.py. 00:29:50.140 --> 00:29:50.710 Voila. 00:29:50.710 --> 00:29:55.220 In scrolling up, up, up, I think we're in a better place. 00:29:55.220 --> 00:29:57.520 I only see one Office now. 00:29:57.520 --> 00:30:01.510 And if I keep scrolling up and up and up, I'm seeing typos still, 00:30:01.510 --> 00:30:03.910 but nothing related to white space. 00:30:03.910 --> 00:30:08.340 And I think we have a much cleaner unique list of titles at this point. 00:30:08.340 --> 00:30:10.800 Of course, if we scroll up, I would have to be 00:30:10.800 --> 00:30:14.970 a lot more clever if I want to detect things like typographical errors. 00:30:14.970 --> 00:30:19.870 It looks like one of you was very diligent about putting F.R.I. 00:30:19.870 --> 00:30:22.970 and so forth but then got bored at the end and left off the last period. 00:30:22.970 --> 00:30:25.470 But that's going to happen when you're taking in user input. 00:30:25.470 --> 00:30:28.140 We've, of course, got all these variants of CS50. 00:30:28.140 --> 00:30:30.570 That's going to be a mess to clean up, because now you 00:30:30.570 --> 00:30:35.130 can imagine having to add a whole bunch of if conditions and elses and else ifs 00:30:35.130 --> 00:30:38.160 to clean all of that up if we do want to canonicalize 00:30:38.160 --> 00:30:41.920 all different flavors of CS50 as, quote unquote, "CS50." 00:30:41.920 --> 00:30:43.890 So this is a very slippery slope. 00:30:43.890 --> 00:30:47.010 You and I could start writing a huge amount of data just to clean this up. 00:30:47.010 --> 00:30:50.940 But that's the reality when dealing with real-world data. 00:30:50.940 --> 00:30:55.140 Well, let's go ahead now and improve this program further, 00:30:55.140 --> 00:30:57.810 do something a little fancier, because I now 00:30:57.810 --> 00:31:00.090 can trust that my data has been canonicalized 00:31:00.090 --> 00:31:03.900 except for the actual typos or the weird variants of CS50 and the like. 00:31:03.900 --> 00:31:07.470 Let's go ahead and figure out what's the most popular favorite TV 00:31:07.470 --> 00:31:10.510 show among the audience here. 00:31:10.510 --> 00:31:12.300 So I'm going to start where I have before, 00:31:12.300 --> 00:31:14.133 with my current code, because I think I have 00:31:14.133 --> 00:31:16.143 most of the building blocks in place. 00:31:16.143 --> 00:31:18.810 I'm going to go ahead and clean up my code a little bit in here. 00:31:18.810 --> 00:31:22.050 I'm going to go ahead and give myself a separate variable now called title 00:31:22.050 --> 00:31:26.040 just so that I can think about things in a little more orderly fashion. 00:31:26.040 --> 00:31:29.200 But I'm not going to start adding things to this set anymore. 00:31:29.200 --> 00:31:32.220 In fact, a set, I don't think, is really going 00:31:32.220 --> 00:31:35.880 to be sufficient to keep track of the popularity of TV shows, 00:31:35.880 --> 00:31:38.820 because by definition, the set is throwing away duplicates. 00:31:38.820 --> 00:31:40.680 But the goal now is kind of the opposite. 00:31:40.680 --> 00:31:45.240 I want to know which are the duplicates so that I can tell you 00:31:45.240 --> 00:31:46.860 that this many people like The Office. 00:31:46.860 --> 00:31:50.530 This many people like Breaking Bad and the like. 00:31:50.530 --> 00:31:56.010 So what tools do we have in Python's toolkit via which we could accumulate 00:31:56.010 --> 00:31:59.320 or figure out that information? 00:31:59.320 --> 00:32:02.740 Any thoughts on what data structure might help us here 00:32:02.740 --> 00:32:07.870 if we want to figure out show, popularity, show, popularity? 00:32:07.870 --> 00:32:11.950 And by popularity, I just mean the frequency of it in the CSV file. 00:32:11.950 --> 00:32:13.720 Santiago? 00:32:13.720 --> 00:32:17.110 AUDIENCE: I guess one option could be to use dictionaries 00:32:17.110 --> 00:32:20.410 so that you can have The Office, I don't know, 00:32:20.410 --> 00:32:23.110 20 votes, and then Game of Thrones, another one, 00:32:23.110 --> 00:32:27.023 so that a dictionary could really help you visualize that. 00:32:27.023 --> 00:32:28.690 DAVID J. MALAN: Yeah, perfect instincts. 00:32:28.690 --> 00:32:31.660 Recall that a dictionary, at the end of the day, no matter how 00:32:31.660 --> 00:32:34.450 sophisticated it's implemented underneath the hood, 00:32:34.450 --> 00:32:35.680 like your spell checker-- 00:32:35.680 --> 00:32:38.240 It's just a collection of key value pairs. 00:32:38.240 --> 00:32:42.790 And indeed, it's maybe one of the most useful data structures in any language, 00:32:42.790 --> 00:32:45.820 because this ability to associate one piece of data with another 00:32:45.820 --> 00:32:49.150 is just a very general purpose solution to problems. 00:32:49.150 --> 00:32:51.730 And indeed, to Santiago's point, if the problem at hand 00:32:51.730 --> 00:32:53.650 is to figure out the popularity of shows, 00:32:53.650 --> 00:32:58.510 well, let's make the keys the titles of our shows and the frequencies thereof-- 00:32:58.510 --> 00:32:59.830 the votes, so to speak-- 00:32:59.830 --> 00:33:01.810 the values of those keys. 00:33:01.810 --> 00:33:06.450 We're going to map title to votes, title to vote, title to vote, and so forth. 00:33:06.450 --> 00:33:08.145 So a dictionary is exactly that. 00:33:08.145 --> 00:33:09.520 So let me go ahead and scroll up. 00:33:09.520 --> 00:33:10.978 And I can make a little tweak here. 00:33:10.978 --> 00:33:14.260 Instead of a set, I can instead say dict and give myself 00:33:14.260 --> 00:33:15.598 just an empty dictionary. 00:33:15.598 --> 00:33:18.640 There's actually shorthand notation for that that's a little more common. 00:33:18.640 --> 00:33:20.830 So you use two empty curly braces. 00:33:20.830 --> 00:33:22.810 That just means the exact same thing. 00:33:22.810 --> 00:33:25.270 Give me a dictionary that's initially empty. 00:33:25.270 --> 00:33:27.400 There's no fancy shortcut for a set. 00:33:27.400 --> 00:33:30.370 You have to literally type out S-E-T, open paren and closed paren. 00:33:30.370 --> 00:33:34.220 But dictionaries are so common, so popular, so powerful, 00:33:34.220 --> 00:33:38.350 they have this little syntactic shortcut of just two curly braces, 00:33:38.350 --> 00:33:39.560 open and closed. 00:33:39.560 --> 00:33:42.700 So now that I have that, let me go ahead and do this. 00:33:42.700 --> 00:33:45.580 Inside of my for loop, instead of printing 00:33:45.580 --> 00:33:48.880 the title, which I don't want to do, and instead of adding it to the set, 00:33:48.880 --> 00:33:50.770 I now want to add it to the dictionary. 00:33:50.770 --> 00:33:51.860 So how do I do that? 00:33:51.860 --> 00:33:55.480 Well, if my dictionary is called titles, I think I can essentially do something 00:33:55.480 --> 00:34:02.710 like this, titles bracket title = or maybe += 1. 00:34:02.710 --> 00:34:07.120 Maybe I can kind of use the dictionary as just a little cheat sheet 00:34:07.120 --> 00:34:12.050 of counts, numbers, that start at 0 and then just add 1, at 2, add 3. 00:34:12.050 --> 00:34:17.860 So every time I see The Office, The Office, The Office, do += 1, += 1. 00:34:17.860 --> 00:34:20.199 We can't do ++, because that's not a thing in Python. 00:34:20.199 --> 00:34:24.580 It only exists in C. But this would seem to go into the dictionary called 00:34:24.580 --> 00:34:29.260 titles, look up the key that matches this specific title, 00:34:29.260 --> 00:34:34.340 and then increment whatever value is there by 1. 00:34:34.340 --> 00:34:37.380 But I'm going to go ahead and run this a little naively here. 00:34:37.380 --> 00:34:40.280 Let me go ahead and run python of favorites.py. 00:34:40.280 --> 00:34:43.400 And wow, it broke already on line 9. 00:34:43.400 --> 00:34:47.389 So sort of an apt choice of show to begin with, 00:34:47.389 --> 00:34:49.530 we have a key error with Punisher. 00:34:49.530 --> 00:34:50.917 So Punisher is bad. 00:34:50.917 --> 00:34:52.250 Something bad has just happened. 00:34:52.250 --> 00:34:53.250 But what does that mean? 00:34:53.250 --> 00:34:55.429 A key error is referring to the fact that I 00:34:55.429 --> 00:34:59.407 tried to access an invalid key in a dictionary. 00:34:59.407 --> 00:35:01.490 This is saying that literally in this line of code 00:35:01.490 --> 00:35:04.610 here, even though titles is a dictionary and even 00:35:04.610 --> 00:35:07.130 though the value of title, singular, is, quote 00:35:07.130 --> 00:35:09.560 unquote, "PUNISHER," I'm getting a key error, 00:35:09.560 --> 00:35:13.230 because that title does not yet exist. 00:35:13.230 --> 00:35:17.060 So even if you're not sure of the Python syntax for fixing this problem, 00:35:17.060 --> 00:35:21.530 what's the intuitive solution here? 00:35:21.530 --> 00:35:25.610 I cannot increment the frequency of the Punisher, 00:35:25.610 --> 00:35:28.130 because Punisher is not in the dictionary. 00:35:28.130 --> 00:35:29.986 It almost feels like a catch-22. 00:35:29.986 --> 00:35:31.890 [? Greg? ?] 00:35:31.890 --> 00:35:35.900 AUDIENCE: I think that you need, first of all, to create a for loop 00:35:35.900 --> 00:35:40.520 and maybe assign a value to everything in the dictionary. 00:35:40.520 --> 00:35:43.683 For example, a value 0, and then add 1. 00:35:43.683 --> 00:35:45.350 DAVID J. MALAN: Yeah, so good instincts. 00:35:45.350 --> 00:35:46.730 And here, I can use another metaphor. 00:35:46.730 --> 00:35:49.147 I worry we might have a chicken and the egg problem there, 00:35:49.147 --> 00:35:51.470 because I don't think I can go to the top of my code, 00:35:51.470 --> 00:35:56.420 add a loop that initializes all of the values in the dictionary to 0, 00:35:56.420 --> 00:36:01.130 because I would need to know all of the names of the shows at that point. 00:36:01.130 --> 00:36:02.180 Now, that's fine. 00:36:02.180 --> 00:36:05.330 I think I could take you maybe more literally, [? Greg, ?] 00:36:05.330 --> 00:36:09.630 and open up the CSV file, iterate over it top to bottom, 00:36:09.630 --> 00:36:12.920 and, any time I see a title, just initialize it 00:36:12.920 --> 00:36:16.220 in the dictionary as having a value of 0, 0, 0. 00:36:16.220 --> 00:36:20.280 Then have another for loop, maybe reopen the file, and do the same. 00:36:20.280 --> 00:36:21.380 And that would work. 00:36:21.380 --> 00:36:23.540 But it's arguably not very efficient. 00:36:23.540 --> 00:36:26.330 It is asymptotically, in terms of big O. But that would 00:36:26.330 --> 00:36:28.220 seem to be doing twice as much work. 00:36:28.220 --> 00:36:31.820 Iterate over the file once just to initialize everything to 0. 00:36:31.820 --> 00:36:35.330 Then iterate over the file a second time just to increment the counts. 00:36:35.330 --> 00:36:38.360 I think we can do things a little more efficiently. 00:36:38.360 --> 00:36:41.090 I think we can achieve not only correctness but better design. 00:36:41.090 --> 00:36:45.560 Any thoughts on how we can still solve this problem without having 00:36:45.560 --> 00:36:48.290 to iterate over the whole thing twice? 00:36:48.290 --> 00:36:50.360 Yeah, [? Semowit? ?] 00:36:50.360 --> 00:36:53.450 AUDIENCE: I think we can add in an if statement 00:36:53.450 --> 00:36:55.970 to check if that key is in the dictionary. 00:36:55.970 --> 00:36:59.865 And if it's not, then add it and then go ahead and increment the value after. 00:36:59.865 --> 00:37:00.740 DAVID J. MALAN: Nice. 00:37:00.740 --> 00:37:02.460 And we can do exactly that. 00:37:02.460 --> 00:37:04.310 So let's just apply that intuition. 00:37:04.310 --> 00:37:08.583 If the problem is that I'm trying to access a key that does not yet exist, 00:37:08.583 --> 00:37:10.500 well, let's just be a little smarter about it. 00:37:10.500 --> 00:37:14.090 And to [? Semowit's ?] point, let's check whether the key exists. 00:37:14.090 --> 00:37:16.020 And if it does, then increment it. 00:37:16.020 --> 00:37:19.340 But if it does not, then and only then, [? to Greg's ?] advice, 00:37:19.340 --> 00:37:20.730 initialize it to 0. 00:37:20.730 --> 00:37:21.570 So let me do that. 00:37:21.570 --> 00:37:24.980 Let me go ahead and say if title in titles, 00:37:24.980 --> 00:37:28.520 which is the very Pythonic, beautiful way of asking a question 00:37:28.520 --> 00:37:30.500 like that, way cleaner than in C-- 00:37:30.500 --> 00:37:35.390 let me go ahead, then, and say exactly the line from before. 00:37:35.390 --> 00:37:40.280 Else, though, if that title is not yet in the dictionary called titles, 00:37:40.280 --> 00:37:41.720 well, that's OK, too. 00:37:41.720 --> 00:37:47.150 I can go ahead and say titles bracket title = 0. 00:37:47.150 --> 00:37:51.950 So the difference here is that I can certainly index 00:37:51.950 --> 00:37:57.740 into a dictionary using a key that doesn't exist if I plan at that moment 00:37:57.740 --> 00:37:58.730 to give it a value. 00:37:58.730 --> 00:38:02.030 That's OK, and that has always been OK since last week. 00:38:02.030 --> 00:38:07.490 But, however, if I want to go ahead and increment the value that's there, 00:38:07.490 --> 00:38:11.630 I'm going to go ahead and do that in this separate line. 00:38:11.630 --> 00:38:13.850 But I did introduce a bug. 00:38:13.850 --> 00:38:15.770 I did introduce a bug here. 00:38:15.770 --> 00:38:19.220 I think I need to go one step further logically. 00:38:19.220 --> 00:38:24.480 I don't think I want to initialize this to 0 per se. 00:38:24.480 --> 00:38:29.090 Does anyone see a subtle bug in my logic here? 00:38:29.090 --> 00:38:32.570 If the title is already in the dictionary, I'm incrementing it by 1. 00:38:32.570 --> 00:38:37.000 Otherwise, I'm initializing it to 0. 00:38:37.000 --> 00:38:38.500 Any subtle catches here? 00:38:38.500 --> 00:38:40.690 Yeah, Olivia, what do you see? 00:38:40.690 --> 00:38:44.700 AUDIENCE: I think you should initialize it to 1, since it's the first instance. 00:38:44.700 --> 00:38:45.700 DAVID J. MALAN: Exactly. 00:38:45.700 --> 00:38:46.870 I should initialize it to 1. 00:38:46.870 --> 00:38:50.137 Otherwise, I'm accidentally overlooking this particular title, 00:38:50.137 --> 00:38:51.970 and I'm going to go ahead and undercount it. 00:38:51.970 --> 00:38:54.100 So I can fix this either by doing this. 00:38:54.100 --> 00:38:57.820 Or frankly, if you prefer, I don't technically need to use an if else. 00:38:57.820 --> 00:39:00.730 I can use just an if by doing something like this instead. 00:39:00.730 --> 00:39:04.900 I could say if title not in titles, then I could go ahead 00:39:04.900 --> 00:39:07.420 and say titles bracket title gets 0. 00:39:07.420 --> 00:39:12.270 And then after that, I can blindly, so to speak, just do this. 00:39:12.270 --> 00:39:13.480 So which one is better? 00:39:13.480 --> 00:39:15.460 I think the second one is maybe a little better 00:39:15.460 --> 00:39:17.420 in that I'm saving one line of code. 00:39:17.420 --> 00:39:19.270 But it's ensuring with that if condition, 00:39:19.270 --> 00:39:24.190 to [? Semowit's ?] advice, that I'm not indexing into the titles dictionary 00:39:24.190 --> 00:39:27.050 until I'm sure that the title is in there. 00:39:27.050 --> 00:39:31.570 So let me go ahead and run this now, python of favorites.py, Enter. 00:39:31.570 --> 00:39:34.090 And OK, it didn't crash, so that's good. 00:39:34.090 --> 00:39:36.400 But I'm not yet seeing any useful information. 00:39:36.400 --> 00:39:38.740 But I now have access to a bit more. 00:39:38.740 --> 00:39:41.680 Let me scroll down now to the bottom of this program, where 00:39:41.680 --> 00:39:43.360 I have now this loop. 00:39:43.360 --> 00:39:45.460 Let me go ahead and print out not just the title 00:39:45.460 --> 00:39:49.870 but the value of that key in the dictionary by just indexing 00:39:49.870 --> 00:39:50.500 into it here. 00:39:50.500 --> 00:39:52.030 And you might not have seen this syntax before. 00:39:52.030 --> 00:39:54.758 But with print, you can actually pass in multiple arguments. 00:39:54.758 --> 00:39:57.550 And by default, print will just separate them with a space for you. 00:39:57.550 --> 00:39:59.860 You can override that behavior and separate them with anything. 00:39:59.860 --> 00:40:02.777 But this is just meant to be a quick and dirty program that prints out 00:40:02.777 --> 00:40:04.820 titles and now the popularity thereof. 00:40:04.820 --> 00:40:07.210 So let me run this again, python of favorites.py. 00:40:07.210 --> 00:40:08.470 And voila. 00:40:08.470 --> 00:40:12.040 It's kind of all over the place. 00:40:12.040 --> 00:40:14.800 Office, super popular with 26 votes there. 00:40:14.800 --> 00:40:18.220 A lot of single votes here. 00:40:18.220 --> 00:40:19.810 Big Bang Theory has nine. 00:40:19.810 --> 00:40:21.340 You know, this is all nice and good. 00:40:21.340 --> 00:40:24.548 But I feel like this is going to take me forever to wrap my mind around which 00:40:24.548 --> 00:40:25.990 are the most popular shows. 00:40:25.990 --> 00:40:27.560 So of course, how would we do this? 00:40:27.560 --> 00:40:30.250 Well, to the point made earlier, with spreadsheets, my god, 00:40:30.250 --> 00:40:33.100 in Microsoft Excel or Google Spreadsheets or Apple Numbers, 00:40:33.100 --> 00:40:35.590 you just click the column heading and boom, sorted. 00:40:35.590 --> 00:40:38.450 We seem to have lost that capability unless we now do it in code. 00:40:38.450 --> 00:40:40.450 So let me do that for us. 00:40:40.450 --> 00:40:42.550 Let me go ahead and go back to my code. 00:40:42.550 --> 00:40:48.520 And it looks like sorted, even though it does work on dictionaries, 00:40:48.520 --> 00:40:52.340 is actually sorting by key, not by value. 00:40:52.340 --> 00:40:55.030 And here's where our Python programming techniques need 00:40:55.030 --> 00:40:56.530 to get a little more sophisticated. 00:40:56.530 --> 00:40:58.572 And we want to introduce another feature here now 00:40:58.572 --> 00:41:01.660 of Python which is going to solve this problem specifically 00:41:01.660 --> 00:41:03.680 but in a pretty general way. 00:41:03.680 --> 00:41:06.310 So if we read the documentation for sorted, 00:41:06.310 --> 00:41:11.320 the sorted function indeed sorts sets by the values therein. 00:41:11.320 --> 00:41:13.840 It sorts lists by the values therein. 00:41:13.840 --> 00:41:17.140 It sorts dictionaries by the keys therein, 00:41:17.140 --> 00:41:20.960 because dictionaries have two pieces of information for every element. 00:41:20.960 --> 00:41:23.450 It has a key and a value, not just a value. 00:41:23.450 --> 00:41:25.390 So by default, sorted sorts by key. 00:41:25.390 --> 00:41:28.150 So we somehow have to override that behavior. 00:41:28.150 --> 00:41:29.390 So how can we do this? 00:41:29.390 --> 00:41:31.840 Well, it turns out that the sorted function 00:41:31.840 --> 00:41:35.890 takes another optional argument literally called key. 00:41:35.890 --> 00:41:41.570 And the key argument takes as its value the name of a function. 00:41:41.570 --> 00:41:43.690 And this is where things get really interesting, 00:41:43.690 --> 00:41:45.370 if not confusing, really quickly. 00:41:45.370 --> 00:41:50.620 It turns out, in Python, you can pass around functions as arguments 00:41:50.620 --> 00:41:51.790 by way of their name. 00:41:51.790 --> 00:41:56.080 And technically, you can do this in C. It's a lot more syntactically involved. 00:41:56.080 --> 00:41:57.777 But in Python, it's very common. 00:41:57.777 --> 00:41:59.110 In JavaScript, it's very common. 00:41:59.110 --> 00:42:01.930 In a lot of languages, it's very common to think of functions 00:42:01.930 --> 00:42:06.040 as first-class objects, which is a fancy way of saying you can pass them around 00:42:06.040 --> 00:42:08.440 just like they are variables themselves. 00:42:08.440 --> 00:42:09.730 We're not calling them yet. 00:42:09.730 --> 00:42:11.720 But you can pass them around by their name. 00:42:11.720 --> 00:42:13.310 So what do I mean by this? 00:42:13.310 --> 00:42:18.940 Well, I need a function now to sort my dictionary by its value. 00:42:18.940 --> 00:42:22.900 And only I know how to do this, perhaps, so let me go ahead and give myself 00:42:22.900 --> 00:42:25.990 a generic function name just for the moment called f-- f for function, 00:42:25.990 --> 00:42:26.907 kind of like in math-- 00:42:26.907 --> 00:42:28.907 because we're going to get rid of it eventually. 00:42:28.907 --> 00:42:31.120 But let me go ahead and temporarily define a function 00:42:31.120 --> 00:42:34.150 called f that takes as input a title. 00:42:34.150 --> 00:42:39.140 And then it returns for me the value corresponding to that key. 00:42:39.140 --> 00:42:43.060 So I'm going to go ahead and return titles bracket title. 00:42:43.060 --> 00:42:47.480 So here, we have a function whose purpose in life is super simple. 00:42:47.480 --> 00:42:48.700 You give it a title. 00:42:48.700 --> 00:42:52.990 It gives you the count thereof, the frequency, the popularity thereof, 00:42:52.990 --> 00:42:56.080 by just looking it up in that global dictionary. 00:42:56.080 --> 00:42:59.830 So it's super simple, but that's its only purpose in life. 00:42:59.830 --> 00:43:03.400 But now, according to the documentation for sorted, 00:43:03.400 --> 00:43:06.730 what it's now going to do, because I'm passing in a second argument called 00:43:06.730 --> 00:43:12.250 key, the sorted function, rather than just presume you want everything sorted 00:43:12.250 --> 00:43:15.280 alphabetically by key, it's instead going 00:43:15.280 --> 00:43:22.420 to call that function f on every one of the elements in your dictionary. 00:43:22.420 --> 00:43:25.960 And depending on your answer, the return value 00:43:25.960 --> 00:43:30.760 you give with that f function, that will be used instead 00:43:30.760 --> 00:43:34.060 to determine the actual ordering. 00:43:34.060 --> 00:43:36.900 So by default, sorted just looks at key. 00:43:36.900 --> 00:43:39.750 What I'm effectively doing with this f function 00:43:39.750 --> 00:43:44.160 is instead returning the value corresponding to every key. 00:43:44.160 --> 00:43:48.360 And so the logical implication of this, even though the syntax is a little new, 00:43:48.360 --> 00:43:51.690 is that this dictionary of titles will now 00:43:51.690 --> 00:43:55.140 be sorted by value instead of by key. 00:43:55.140 --> 00:43:57.460 Because again, by default, it sorts by key. 00:43:57.460 --> 00:44:01.740 But if I define my own key function and override that behavior 00:44:01.740 --> 00:44:05.370 to return the corresponding value, it's the values, the numbers, 00:44:05.370 --> 00:44:08.750 the counts that will actually be used to this thing. 00:44:08.750 --> 00:44:09.250 All right. 00:44:09.250 --> 00:44:11.333 Let's go ahead and see if that's true in practice. 00:44:11.333 --> 00:44:13.440 Let me go ahead and rerun python of favorites.py. 00:44:13.440 --> 00:44:14.790 I should see all the titles. 00:44:14.790 --> 00:44:17.520 And voila, conveniently, the most popular show 00:44:17.520 --> 00:44:22.170 seems to be Game of Thrones with 33 votes, followed by Friends with 27, 00:44:22.170 --> 00:44:25.000 followed by The Office with 26, and so forth. 00:44:25.000 --> 00:44:27.060 But of course, the list is kind of backwards. 00:44:27.060 --> 00:44:29.680 I mean, it's convenient that I can see it at the bottom of my screen. 00:44:29.680 --> 00:44:32.530 But really, if we're making a list, it should really be at the top. 00:44:32.530 --> 00:44:34.170 So how can we override that behavior? 00:44:34.170 --> 00:44:36.840 Turns out the sorted function, if you read its documentation, 00:44:36.840 --> 00:44:41.020 also takes another optional parameter called reverse. 00:44:41.020 --> 00:44:43.590 And if you set reverse equal to True, capital 00:44:43.590 --> 00:44:48.120 T in Python, that's going to go ahead and give us now 00:44:48.120 --> 00:44:50.190 the reverse order of that same sort. 00:44:50.190 --> 00:44:53.790 So let me go ahead and maximize my terminal window, rerun it again. 00:44:53.790 --> 00:44:57.480 And voila, if I scroll back up to the top, it's not alphabetically sorted. 00:44:57.480 --> 00:44:59.970 But if I keep going, keep going, keep going, keep going, 00:44:59.970 --> 00:45:01.262 the numbers are getting bigger. 00:45:01.262 --> 00:45:06.770 And voila, now Game of Thrones with 33 is all the way at the top. 00:45:06.770 --> 00:45:08.490 All right, so pretty cool. 00:45:08.490 --> 00:45:11.360 And again, the new functionality here in Python, at least, 00:45:11.360 --> 00:45:15.110 is that we can actually pass in functions to functions 00:45:15.110 --> 00:45:19.380 and leave it to the latter to call the former. 00:45:19.380 --> 00:45:21.180 So that's complicated just to say. 00:45:21.180 --> 00:45:26.400 But any questions or confusion now on how we are using dictionaries 00:45:26.400 --> 00:45:34.290 and how we are sorting things in this reverse, value-based way? 00:45:34.290 --> 00:45:35.450 Any questions or confusion? 00:45:35.450 --> 00:45:39.000 Anything in the chat or verbally, Brian? 00:45:39.000 --> 00:45:41.470 BRIAN YU: Looks like all questions are answered here. 00:45:41.470 --> 00:45:42.320 DAVID J. MALAN: OK. 00:45:42.320 --> 00:45:44.780 Then in that case, let me point out a common mistake. 00:45:44.780 --> 00:45:50.000 Notice that even though f is a function, notice that I did not call it there. 00:45:50.000 --> 00:45:53.630 That would be incorrect, the reason being we deliberately 00:45:53.630 --> 00:45:58.410 want to pass the function f into the sorted function 00:45:58.410 --> 00:46:03.810 so that the sorted function can take it upon itself to call f again and again 00:46:03.810 --> 00:46:04.310 and again. 00:46:04.310 --> 00:46:07.227 We don't want to just call it once by using the parentheses ourselves. 00:46:07.227 --> 00:46:11.030 We want to just pass it in by name so that the sorted function, which comes 00:46:11.030 --> 00:46:14.630 with Python, can instead do it for us. 00:46:14.630 --> 00:46:17.060 Santiago, did you have a question? 00:46:17.060 --> 00:46:18.710 AUDIENCE: Yes, I was going to ask. 00:46:18.710 --> 00:46:21.425 Why didn't we put f of title? 00:46:24.350 --> 00:46:26.960 I was going to ask that question specifically. 00:46:26.960 --> 00:46:29.005 DAVID J. MALAN: Oh, with the parentheses? 00:46:29.005 --> 00:46:29.630 AUDIENCE: Yeah. 00:46:29.630 --> 00:46:30.963 DAVID J. MALAN: Oh, OK, perfect. 00:46:30.963 --> 00:46:34.010 So because that would call the function once and only once. 00:46:34.010 --> 00:46:37.013 We want sorted to be able to call it again and again. 00:46:37.013 --> 00:46:38.930 Now, here's actually an example, as we've seen 00:46:38.930 --> 00:46:40.760 in the past, of a correct solution. 00:46:40.760 --> 00:46:45.170 This is behaving as I intend, a list of sorted titles from top to bottom 00:46:45.170 --> 00:46:47.450 in order of popularity. 00:46:47.450 --> 00:46:49.820 But it's a little poorly designed, because I'm 00:46:49.820 --> 00:46:53.390 defining this function f, whose name in the first place is kind of lame. 00:46:53.390 --> 00:46:56.660 But I'm defining a function only to use it in one place. 00:46:56.660 --> 00:47:00.860 And my god, the function is so tiny, it just feels like a waste of keystrokes 00:47:00.860 --> 00:47:03.740 to have defined a new function just to then pass it in. 00:47:03.740 --> 00:47:08.150 So it turns out, in Python, if you have a very short function whose 00:47:08.150 --> 00:47:13.130 purpose in life is meant to be to solve a local problem just once and that's it 00:47:13.130 --> 00:47:16.880 and it's short enough that you're pretty sure you can fit it on one line of code 00:47:16.880 --> 00:47:21.080 without things wrapping and starting to get ugly stylistically, it turns out 00:47:21.080 --> 00:47:23.270 you can actually do this instead. 00:47:23.270 --> 00:47:26.820 You can copy the code that you had in mind like this. 00:47:26.820 --> 00:47:30.680 And instead of actually defining f as a function name, 00:47:30.680 --> 00:47:34.070 you can actually use a special keyword in Python called lambda. 00:47:34.070 --> 00:47:37.760 You can specify the name of an argument for your function as before. 00:47:37.760 --> 00:47:41.690 And then you can simply specify the return value, thereafter 00:47:41.690 --> 00:47:44.940 deleting the function itself. 00:47:44.940 --> 00:47:49.640 So to be clear, key is still an argument to the sorted function. 00:47:49.640 --> 00:47:54.120 It expects as its value typically the name of a function. 00:47:54.120 --> 00:47:57.590 But if you've decided that, eh, this seems like a waste of effort 00:47:57.590 --> 00:47:59.870 to define a function, then pass the function in, 00:47:59.870 --> 00:48:02.840 especially when it's so short, you can do it in a one liner. 00:48:02.840 --> 00:48:05.990 A lambda function is an anonymous function. 00:48:05.990 --> 00:48:09.230 Lambda literally says, Python, give me a function. 00:48:09.230 --> 00:48:11.147 I don't care about its name. 00:48:11.147 --> 00:48:13.230 Therefore, you don't have to choose a name for it. 00:48:13.230 --> 00:48:17.940 But it does care still about its arguments and its return value. 00:48:17.940 --> 00:48:23.147 So it's still up to you to provide zero or more arguments and a return value. 00:48:23.147 --> 00:48:24.230 And notice I've done that. 00:48:24.230 --> 00:48:28.010 I've specified the keyword lambda followed by the name of the argument 00:48:28.010 --> 00:48:31.490 I want this anonymous, nameless function to accept. 00:48:31.490 --> 00:48:33.890 And then I'm specifying the return value. 00:48:33.890 --> 00:48:37.940 And with lambda functions, you do not need to specify return. 00:48:37.940 --> 00:48:41.000 Whatever you write after the colon is literally 00:48:41.000 --> 00:48:43.050 what will be returned automatically. 00:48:43.050 --> 00:48:45.320 So again, this is a very Pythonic thing to do. 00:48:45.320 --> 00:48:49.250 It's kind of a very clever one liner, even though it's a little cryptic 00:48:49.250 --> 00:48:50.757 to see for the very first time. 00:48:50.757 --> 00:48:53.840 But it allows you to condense your thoughts into a succinct statement that 00:48:53.840 --> 00:48:57.470 gets the job done so you don't have to start defining more and more functions 00:48:57.470 --> 00:49:02.490 that you or someone else then need to keep track of. 00:49:02.490 --> 00:49:02.990 All right. 00:49:02.990 --> 00:49:05.300 Any questions, then, on this? 00:49:05.300 --> 00:49:10.580 And I am pretty sure this is as complex or sophisticated as our Python code 00:49:10.580 --> 00:49:13.290 today will get. 00:49:13.290 --> 00:49:16.020 Yeah, over to Sophia. 00:49:16.020 --> 00:49:19.380 AUDIENCE: I was wondering why "lambda" is used specifically 00:49:19.380 --> 00:49:21.217 rather than some other keyword. 00:49:21.217 --> 00:49:23.550 DAVID J. MALAN: Yeah, so there's a long history in this. 00:49:23.550 --> 00:49:27.060 And if, in fact, you take a course on functional programming-- at Harvard, 00:49:27.060 --> 00:49:28.830 it's called CS51-- 00:49:28.830 --> 00:49:32.280 there's a whole etymology behind keywords like this. 00:49:32.280 --> 00:49:34.360 Let me defer that one for another time. 00:49:34.360 --> 00:49:37.440 But indeed, not only in Python but in other languages, 00:49:37.440 --> 00:49:41.290 as well, these things have come to exist called lambda functions. 00:49:41.290 --> 00:49:44.230 So they're actually quite commonplace in other languages, as well. 00:49:44.230 --> 00:49:48.580 And so Python just adopted the term of art. 00:49:48.580 --> 00:49:52.060 Mathematically, lambda is often used as a symbol for functions. 00:49:52.060 --> 00:49:55.980 And so they borrowed that same idea in the world of programming. 00:49:55.980 --> 00:49:56.550 All right. 00:49:56.550 --> 00:50:00.840 So seeing no other questions, let's go ahead and solve a related problem still 00:50:00.840 --> 00:50:03.510 with some Python but that's going to push up 00:50:03.510 --> 00:50:09.150 against the limits of efficiency when it comes to storing our data in CSV files. 00:50:09.150 --> 00:50:13.113 Let me go ahead and start fresh in this file, Favorites.py. 00:50:13.113 --> 00:50:15.030 All of the code I've written thus far, though, 00:50:15.030 --> 00:50:16.905 is on the course's website in advance, so you 00:50:16.905 --> 00:50:18.570 can see the incremental improvement. 00:50:18.570 --> 00:50:21.280 I'm going to go ahead and, again, import csv at the top. 00:50:21.280 --> 00:50:24.690 And now let's write a program this time that doesn't just 00:50:24.690 --> 00:50:27.990 automatically open up the CSV and analyze it looking 00:50:27.990 --> 00:50:31.020 for the total popularity of shows. 00:50:31.020 --> 00:50:35.430 Let's search for a specific show in the CSV and then 00:50:35.430 --> 00:50:39.082 go ahead and output the popularity thereof. 00:50:39.082 --> 00:50:41.040 And I can do this in a bunch of different ways. 00:50:41.040 --> 00:50:43.415 But I'm going to try to make this as concise as possible. 00:50:43.415 --> 00:50:46.800 I'm first going to ask the user to input a title. 00:50:46.800 --> 00:50:49.170 I could use CS50's get_string function. 00:50:49.170 --> 00:50:52.330 But recall that it's pretty much the same as Python's input function, 00:50:52.330 --> 00:50:55.740 so I'm going to use Python's input function today. 00:50:55.740 --> 00:50:57.780 And then I'm going to go ahead and, as before, 00:50:57.780 --> 00:51:01.350 open up that same CSV called Favorite TV Shows - 00:51:01.350 --> 00:51:08.010 Form Responses 1.csv in read-only mode as a variable called file. 00:51:08.010 --> 00:51:11.160 I'm then going to give myself a reader, and I'll use a DictReader again 00:51:11.160 --> 00:51:14.520 so I don't have to worry about knowing which columns things are in, 00:51:14.520 --> 00:51:16.080 passing in file. 00:51:16.080 --> 00:51:17.340 And then let's see. 00:51:17.340 --> 00:51:20.310 If I only care about one title, I can keep this program simpler. 00:51:20.310 --> 00:51:23.340 I don't need to figure out the popularity of every show. 00:51:23.340 --> 00:51:26.880 I just need to figure out the popularity of one show, the title 00:51:26.880 --> 00:51:28.510 that the human has typed in. 00:51:28.510 --> 00:51:32.160 So I'm going to go ahead and give myself a very simple int called counter 00:51:32.160 --> 00:51:33.480 and set it equal to 0. 00:51:33.480 --> 00:51:34.950 I don't need a whole dictionary. 00:51:34.950 --> 00:51:36.930 Just one variable suffices now. 00:51:36.930 --> 00:51:42.300 And I'm going to go ahead and iterate over the rows in the reader, as before. 00:51:42.300 --> 00:51:48.600 And then I'm going to say if the current row's title == the title the human 00:51:48.600 --> 00:51:51.930 typed in, let's go ahead and increment counter by 1. 00:51:51.930 --> 00:51:54.510 And it's already initialized, because I did that on line 7. 00:51:54.510 --> 00:51:55.500 So I think I'm good. 00:51:55.500 --> 00:51:57.450 And then at the end of this program, let's 00:51:57.450 --> 00:51:59.940 very simply print out the value of counter. 00:51:59.940 --> 00:52:04.920 So the purpose of this program is to prompt the user for a title of a show 00:52:04.920 --> 00:52:08.220 and then just report the popularity thereof 00:52:08.220 --> 00:52:11.040 by counting the number of instances of it in the file. 00:52:11.040 --> 00:52:14.520 So let me go ahead and run this with python of favorites.py. 00:52:14.520 --> 00:52:15.450 Enter. 00:52:15.450 --> 00:52:21.440 Let me go ahead and type in "The Office," Enter, and 19. 00:52:21.440 --> 00:52:23.710 Now, I don't remember exactly what the number was. 00:52:23.710 --> 00:52:26.620 But I remember The Office was more popular than that. 00:52:26.620 --> 00:52:29.840 I'm pretty sure it was not 19. 00:52:29.840 --> 00:52:35.645 Any intuition as to why this program is buggy or so it would seem? 00:52:35.645 --> 00:52:37.520 BRIAN YU: A few people in the chat are saying 00:52:37.520 --> 00:52:40.635 you need to remember to deal with capitalization and white space again. 00:52:40.635 --> 00:52:41.510 DAVID J. MALAN: Yeah. 00:52:41.510 --> 00:52:44.790 So we need to practice those same lessons learned from before. 00:52:44.790 --> 00:52:48.830 So I should really canonicalize the input that the human, I, just typed in 00:52:48.830 --> 00:52:51.980 and also the input that's coming from the CSV file. 00:52:51.980 --> 00:52:53.990 Perhaps the simplest way to do this is, up here, 00:52:53.990 --> 00:52:57.200 to first strip off leading and trailing white space in case I get a little 00:52:57.200 --> 00:52:59.670 sloppy and hit the Space bar where I shouldn't. 00:52:59.670 --> 00:53:02.612 And then let's go ahead and force it to uppercase just because. 00:53:02.612 --> 00:53:04.320 It doesn't matter if it's upper or lower, 00:53:04.320 --> 00:53:06.420 but at least we'll standardize things that way. 00:53:06.420 --> 00:53:10.130 And then when I do this, look at the current rows title. 00:53:10.130 --> 00:53:12.170 I think I really need to do the same thing. 00:53:12.170 --> 00:53:15.140 If I'm going to canonicalize one, I need to canonical the other. 00:53:15.140 --> 00:53:19.790 And now compare the all-caps, white-space-stripped versions 00:53:19.790 --> 00:53:20.730 of both strings. 00:53:20.730 --> 00:53:21.920 So now let me rerun it. 00:53:21.920 --> 00:53:24.020 Now I'm going to type in "The Office," Enter. 00:53:24.020 --> 00:53:24.770 And voila. 00:53:24.770 --> 00:53:28.130 Now I'm at 26, which I think is where we were at before. 00:53:28.130 --> 00:53:30.890 And in fact, now I, the user, can be a little sloppy. 00:53:30.890 --> 00:53:32.450 I can say "the office." 00:53:32.450 --> 00:53:35.780 I can run it again and say "the office" and then, for whatever reason, 00:53:35.780 --> 00:53:37.460 hit the Space bar a lot, Enter. 00:53:37.460 --> 00:53:38.570 It's still going to work. 00:53:38.570 --> 00:53:41.810 And indeed, though we seem to be belaboring the pedantic here 00:53:41.810 --> 00:53:44.683 with trimming off white space and so forth, just think. 00:53:44.683 --> 00:53:46.850 In a relatively small audience here, how many of you 00:53:46.850 --> 00:53:50.150 accidentally hit the Space bar or capitalized things differently? 00:53:50.150 --> 00:53:52.340 This happens massively on scale. 00:53:52.340 --> 00:53:55.040 And you can imagine this being important when you're tagging 00:53:55.040 --> 00:53:56.780 friends in some social media account. 00:53:56.780 --> 00:53:58.940 You're doing @Brian or the like. 00:53:58.940 --> 00:54:02.510 You don't want to have to require the user to type @, capital B, 00:54:02.510 --> 00:54:05.100 lowercase r-i-a-n, and so forth. 00:54:05.100 --> 00:54:07.880 So tolerating disparate, messy user input 00:54:07.880 --> 00:54:11.990 is such a common problem to solve, including 00:54:11.990 --> 00:54:14.740 in today's apps that we all use. 00:54:14.740 --> 00:54:15.280 All right. 00:54:15.280 --> 00:54:21.000 Any questions, then, on this program, which I think is correct? 00:54:21.000 --> 00:54:22.890 Then let me ask a question of you. 00:54:22.890 --> 00:54:27.080 In what sense is this program poorly designed? 00:54:27.080 --> 00:54:30.860 In what sense is this program poorly designed? 00:54:30.860 --> 00:54:33.480 This is more subtle. 00:54:33.480 --> 00:54:38.040 But think about the running time of this program in terms of big O. 00:54:38.040 --> 00:54:45.030 What is the running time of this program if the CSV file has n different shows 00:54:45.030 --> 00:54:47.610 in it or n different submissions? 00:54:47.610 --> 00:54:50.310 So n is the variable in question. 00:54:50.310 --> 00:54:53.122 Yeah, what's the running time, Andrew? 00:54:53.122 --> 00:54:55.510 AUDIENCE: [INAUDIBLE] 00:54:55.510 --> 00:54:58.260 DAVID J. MALAN: Yeah, it's big O of n, because I'm literally using 00:54:58.260 --> 00:55:00.400 linear search by way of the for loop. 00:55:00.400 --> 00:55:04.000 That's how a for loop works in Python, just like in C. Starts at the beginning 00:55:04.000 --> 00:55:06.080 and potentially goes all the way till the end. 00:55:06.080 --> 00:55:08.730 And so I'm using implicitly linear search, 00:55:08.730 --> 00:55:11.910 because I'm not using any fancy data structures, no sets, no dictionaries. 00:55:11.910 --> 00:55:14.140 I'm just looping from top to bottom. 00:55:14.140 --> 00:55:18.390 So you can imagine that if we surveyed not just all of the students here 00:55:18.390 --> 00:55:21.390 in class but maybe everyone on campus or everyone in the world-- 00:55:21.390 --> 00:55:24.240 maybe we're Internet Movie Database, IMDb. 00:55:24.240 --> 00:55:28.710 There could be a huge number of votes and a huge number of shows. 00:55:28.710 --> 00:55:32.460 And so writing a program, whether it's in a terminal window like mine 00:55:32.460 --> 00:55:36.480 or maybe on a mobile device or maybe on a webpage for your laptop or desktop, 00:55:36.480 --> 00:55:40.800 it's probably not the best design to constantly loop 00:55:40.800 --> 00:55:44.190 over all of the shows in your database from top 00:55:44.190 --> 00:55:47.250 to bottom just to answer a single question. 00:55:47.250 --> 00:55:51.502 It would be much nicer to do things in log of n time or in constant time. 00:55:51.502 --> 00:55:54.210 And thankfully, over the past few weeks, both in C and in Python, 00:55:54.210 --> 00:55:57.390 we have seen smarter ways to do this. 00:55:57.390 --> 00:56:00.420 But I'm not practicing what I've preached here. 00:56:00.420 --> 00:56:05.520 And in fact, at some point, this notion of a flat-file database 00:56:05.520 --> 00:56:07.440 starts to get too primitive for us. 00:56:07.440 --> 00:56:11.670 Flat-file databases, like CSV files, are wonderfully useful 00:56:11.670 --> 00:56:13.590 when you just want to do something quickly 00:56:13.590 --> 00:56:16.410 or when you want to download data from some third party, 00:56:16.410 --> 00:56:18.518 like Google, in a standard, portable way. 00:56:18.518 --> 00:56:21.810 "Portable" means that it can be used by different people and different systems. 00:56:21.810 --> 00:56:23.760 CSV is about as simple as it gets, because you 00:56:23.760 --> 00:56:26.250 don't need to own Microsoft Word or Apple 00:56:26.250 --> 00:56:28.150 Numbers or any particular product. 00:56:28.150 --> 00:56:30.750 It's just a text file, so you can use any text editing 00:56:30.750 --> 00:56:34.140 program or any programming language to access it. 00:56:34.140 --> 00:56:38.730 But flat-file databases aren't necessarily the best structure 00:56:38.730 --> 00:56:42.750 to use ultimately for larger data sets, because they don't really 00:56:42.750 --> 00:56:44.790 lend themselves to more efficient queries. 00:56:44.790 --> 00:56:48.280 So CSV files, pretty much at best, you have to search top to bottom, 00:56:48.280 --> 00:56:49.140 left to right. 00:56:49.140 --> 00:56:53.070 But it turns out that there are better databases out there generally known 00:56:53.070 --> 00:56:57.960 as relational databases that, instead of being files in which you store data, 00:56:57.960 --> 00:57:01.230 they are instead programs in which you store data. 00:57:01.230 --> 00:57:04.830 Now, to be fair, those programs use a lot of RAM, memory, 00:57:04.830 --> 00:57:06.390 where they actually store your data. 00:57:06.390 --> 00:57:08.670 And they do certainly persist your data. 00:57:08.670 --> 00:57:12.600 They keep it long term by storing your data also in files. 00:57:12.600 --> 00:57:16.110 But between you and your data, there is this running program. 00:57:16.110 --> 00:57:20.070 And if you've ever heard of Oracle or MySQL or PostgreSQL or SQL Server 00:57:20.070 --> 00:57:23.880 or Microsoft Access or bunches of other popular products, 00:57:23.880 --> 00:57:26.520 both commercial and free and open source alike, 00:57:26.520 --> 00:57:30.720 relational databases are so similar in spirit to spreadsheets. 00:57:30.720 --> 00:57:33.690 But they are implemented in software. 00:57:33.690 --> 00:57:35.490 And they give us more and more features. 00:57:35.490 --> 00:57:37.360 And they use more and more data structures 00:57:37.360 --> 00:57:42.550 so that we can search for data, insert data, delete data, update data much, 00:57:42.550 --> 00:57:47.350 much more efficiently than we could if just using something like a CSV file. 00:57:47.350 --> 00:57:49.600 So let's go ahead and take our five-minute break here. 00:57:49.600 --> 00:57:52.558 And when we come back, we'll look at relational databases and, in turn, 00:57:52.558 --> 00:57:54.400 a language called SQL. 00:57:54.400 --> 00:57:55.170 All right. 00:57:55.170 --> 00:57:56.190 So we are back. 00:57:56.190 --> 00:57:58.890 And the goal at hand now is to transition 00:57:58.890 --> 00:58:02.340 from these fairly simplistic flat-file databases 00:58:02.340 --> 00:58:04.260 to a more proper relational database. 00:58:04.260 --> 00:58:07.500 And relational databases are indeed what power so many 00:58:07.500 --> 00:58:10.470 of today's mobile applications, web applications, and the like. 00:58:10.470 --> 00:58:13.170 Now we're beginning to transition to real-world software 00:58:13.170 --> 00:58:16.210 with real-world languages, at that. 00:58:16.210 --> 00:58:20.610 And so now, let me introduce what we're going to call SQLite. 00:58:20.610 --> 00:58:23.040 So it turns out that a relational database 00:58:23.040 --> 00:58:27.660 is a database that stores all of the data still in rows and columns. 00:58:27.660 --> 00:58:30.960 But it doesn't do so using spreadsheets or sheets. 00:58:30.960 --> 00:58:33.930 It instead does so using what we're going to call tables. 00:58:33.930 --> 00:58:36.120 So it's pretty much the same idea. 00:58:36.120 --> 00:58:39.120 But with tables, we get some additional functionality. 00:58:39.120 --> 00:58:41.130 With those tables, we'll have the ability 00:58:41.130 --> 00:58:46.480 to search for data, update data, delete data, insert new data, and the like. 00:58:46.480 --> 00:58:49.188 And these are things that we absolutely can do with spreadsheets. 00:58:49.188 --> 00:58:52.105 But in the world of spreadsheets, if you want to search for something, 00:58:52.105 --> 00:58:54.840 it's you, the human, doing it by manually clicking and scrolling, 00:58:54.840 --> 00:58:55.340 typically. 00:58:55.340 --> 00:58:57.390 If you want to insert data, it's you, the human, 00:58:57.390 --> 00:58:59.362 typing it in manually after adding a new row. 00:58:59.362 --> 00:59:01.320 If you want to delete something, it's you right 00:59:01.320 --> 00:59:04.350 clicking or Control-clicking and deleting a whole row 00:59:04.350 --> 00:59:06.750 or updating the individual cells they're in. 00:59:06.750 --> 00:59:11.910 With SQL, Structured Query Language, we have a new programming language 00:59:11.910 --> 00:59:15.360 that is very often used in conjunction with other programming languages. 00:59:15.360 --> 00:59:18.930 And so today, we'll see SQL used on its own initially. 00:59:18.930 --> 00:59:21.990 But we'll also see it in the context of a Python program. 00:59:21.990 --> 00:59:28.170 So a language like Python can itself use SQL to do more powerful things 00:59:28.170 --> 00:59:30.660 than Python alone could do. 00:59:30.660 --> 00:59:34.440 So with that said, SQLite is like a light version of SQL. 00:59:34.440 --> 00:59:35.940 It's a more user-friendly version. 00:59:35.940 --> 00:59:36.780 It's more portable. 00:59:36.780 --> 00:59:40.260 It can be used on Macs and PCS and phones and laptops and desktops 00:59:40.260 --> 00:59:40.950 and servers. 00:59:40.950 --> 00:59:42.120 But it's incredibly common. 00:59:42.120 --> 00:59:45.810 In fact, in your iPhone and your Android phone, many of the applications 00:59:45.810 --> 00:59:50.250 you are running today on your own device are using SQLite underneath the hood. 00:59:50.250 --> 00:59:52.290 So it isn't a toy language per se. 00:59:52.290 --> 00:59:55.150 It's instead a relatively simple implementation 00:59:55.150 --> 00:59:56.920 of a language generally known as SQL. 00:59:56.920 --> 01:00:00.680 But long story short, there's other implementations of relational databases 01:00:00.680 --> 01:00:01.180 out there. 01:00:01.180 --> 01:00:02.972 And I rattled off several of them already-- 01:00:02.972 --> 01:00:05.620 Oracle and MySQL and PostgreSQL and the like. 01:00:05.620 --> 01:00:09.910 Those all have slightly different flavors or dialects of SQL. 01:00:09.910 --> 01:00:14.860 So SQL is a fairly standard language for interacting with databases. 01:00:14.860 --> 01:00:16.960 But different companies, different communities 01:00:16.960 --> 01:00:20.200 have kind of added or subtracted their own preferred features. 01:00:20.200 --> 01:00:24.340 And so the syntax you use is generally constant across all platforms. 01:00:24.340 --> 01:00:27.458 But we will standardize for our purposes on SQLite. 01:00:27.458 --> 01:00:29.500 And indeed, this is what you would use these days 01:00:29.500 --> 01:00:31.820 in the world of mobile applications. 01:00:31.820 --> 01:00:33.710 So it's very much germane there. 01:00:33.710 --> 01:00:39.760 So with SQLite, we're going to have ultimately the ability to query data 01:00:39.760 --> 01:00:41.690 and update data, delete data, and the like. 01:00:41.690 --> 01:00:44.080 But to do so, we actually need a program with which 01:00:44.080 --> 01:00:46.220 to interact with our database. 01:00:46.220 --> 01:00:50.320 So the way SQLite works is that it stores all of your data 01:00:50.320 --> 01:00:51.920 still in a file. 01:00:51.920 --> 01:00:53.500 But it's a binary file now. 01:00:53.500 --> 01:00:55.690 That is, it's a file containing 0's and 1's. 01:00:55.690 --> 01:00:57.550 And those 0's and 1's might represent text. 01:00:57.550 --> 01:00:58.810 They might represent numbers. 01:00:58.810 --> 01:01:01.570 But it's a more compact, efficient representation 01:01:01.570 --> 01:01:05.290 than a mere CSV file would be using ASCII or Unicode. 01:01:05.290 --> 01:01:06.670 So that's the first difference. 01:01:06.670 --> 01:01:10.690 SQLite uses a single file, a binary file, 01:01:10.690 --> 01:01:14.470 to store all of your data and represent it inside of that file by way of all 01:01:14.470 --> 01:01:18.110 of those 0's and 1's or the tables to which I alluded before, 01:01:18.110 --> 01:01:22.180 which are the analogue in the database world of sheets or spreadsheets 01:01:22.180 --> 01:01:23.690 in the spreadsheet world. 01:01:23.690 --> 01:01:28.940 So to interact with that binary file wherein all of your data is stored, 01:01:28.940 --> 01:01:31.283 we need some kind of user-facing program. 01:01:31.283 --> 01:01:32.950 And there's many different tools to use. 01:01:32.950 --> 01:01:36.970 But the standard one that comes with SQLite 01:01:36.970 --> 01:01:40.750 is called sqlite3, essentially version 3 of the tool. 01:01:40.750 --> 01:01:44.050 This is a command line tool similar in spirit to any of the commands 01:01:44.050 --> 01:01:46.000 you've run in a terminal window thus far that 01:01:46.000 --> 01:01:50.395 allows you to open up that binary file and interact with all of your tables. 01:01:50.395 --> 01:01:53.020 Now, here again, we kind of have a chicken and the egg problem. 01:01:53.020 --> 01:01:56.290 If I want to use a database but I don't yet have a database 01:01:56.290 --> 01:01:58.570 and yet I want to select data from my database, 01:01:58.570 --> 01:02:00.040 how do I actually load things in? 01:02:00.040 --> 01:02:04.130 Well, you can load data into a SQLite database in at least two ways. 01:02:04.130 --> 01:02:06.490 One, which I'll do in a moment, you can just 01:02:06.490 --> 01:02:10.480 import an existing flat-file database, like a CSV. 01:02:10.480 --> 01:02:15.640 And what you do is you save the CSV on your Mac or PC on your CS50 IDE. 01:02:15.640 --> 01:02:18.100 You run a special command with sqlite3. 01:02:18.100 --> 01:02:21.430 And it will just load the CSV into memory. 01:02:21.430 --> 01:02:23.620 It will figure out where all of the commas are. 01:02:23.620 --> 01:02:28.510 And it will construct inside of that binary file the corresponding rows 01:02:28.510 --> 01:02:31.360 and columns using the appropriate 0's and 1's 01:02:31.360 --> 01:02:32.810 to store all of that information. 01:02:32.810 --> 01:02:35.410 So it just imports it for you automatically. 01:02:35.410 --> 01:02:39.310 Approach 2 would be to actually write code in a language like Python 01:02:39.310 --> 01:02:44.290 or any other that actually manually inserts all of the data 01:02:44.290 --> 01:02:45.155 into your database. 01:02:45.155 --> 01:02:46.280 And we'll do that, as well. 01:02:46.280 --> 01:02:47.290 But let's start simple. 01:02:47.290 --> 01:02:51.070 Let me go ahead and run, for instance, sqlite3. 01:02:51.070 --> 01:02:54.550 And this is preinstalled on CS50 IDE, and it's not that hard to get it up 01:02:54.550 --> 01:02:56.570 and running on a Mac and PC, as well. 01:02:56.570 --> 01:02:59.860 I'm going to go ahead and run sqlite3 in my terminal window here. 01:02:59.860 --> 01:03:00.610 And voila. 01:03:00.610 --> 01:03:03.430 You just see some very simple output. 01:03:03.430 --> 01:03:07.023 It's telling me to type .help if I want to see some usage hints. 01:03:07.023 --> 01:03:09.190 But I know most of the commands, and we'll generally 01:03:09.190 --> 01:03:11.232 give you all of the commands that you might need. 01:03:11.232 --> 01:03:15.760 In fact, one of the commands that we can use is .mode, and another is .import. 01:03:15.760 --> 01:03:18.100 So generally, you won't use these that frequently. 01:03:18.100 --> 01:03:21.670 You'll only use them when creating a database for the first time when 01:03:21.670 --> 01:03:25.002 you are creating that database from an existing CSV file. 01:03:25.002 --> 01:03:26.710 And indeed, that's my goal at the moment. 01:03:26.710 --> 01:03:30.610 Let me take our CSV file containing all of your favorite TV shows 01:03:30.610 --> 01:03:35.650 and load it into SQLite in a proper relational database 01:03:35.650 --> 01:03:39.460 so that we can do better than, for instance, big O of n 01:03:39.460 --> 01:03:42.730 when it comes to searching that data and doing anything else on it. 01:03:42.730 --> 01:03:44.960 So to do this, I have to execute two commands. 01:03:44.960 --> 01:03:48.280 One, I need to put SQLite into CSV mode. 01:03:48.280 --> 01:03:51.010 And that's just to distinguish it from other flat-file formats, 01:03:51.010 --> 01:03:53.890 like TSV for tabs or some other format. 01:03:53.890 --> 01:03:56.230 And now I'm going to go ahead and run .import. 01:03:56.230 --> 01:03:59.920 Then I have to specify the name of the file to import, which is the CSV. 01:03:59.920 --> 01:04:03.490 And I'm going to go ahead and call my table shows. 01:04:03.490 --> 01:04:08.500 So .import takes two arguments, the name of the file that you want to import 01:04:08.500 --> 01:04:12.430 and the name of the table that you want to create out of that file. 01:04:12.430 --> 01:04:14.680 And again, tables have rows and columns. 01:04:14.680 --> 01:04:18.280 And the commas in the file are going to delineate 01:04:18.280 --> 01:04:20.290 where those columns begin and end. 01:04:20.290 --> 01:04:21.790 I'm going to go ahead and hit Enter. 01:04:21.790 --> 01:04:24.670 It looks like it flew by pretty fast. 01:04:24.670 --> 01:04:26.560 Nothing seems to have happened. 01:04:26.560 --> 01:04:30.940 But I think that's OK, because now we're going to go ahead and have the ability 01:04:30.940 --> 01:04:32.710 to actually manipulate that data. 01:04:32.710 --> 01:04:34.630 But how do we manipulate the data? 01:04:34.630 --> 01:04:36.070 We need a new language. 01:04:36.070 --> 01:04:42.280 SQL, Structured Query Language, is the language used by SQLites and Oracle 01:04:42.280 --> 01:04:45.220 and MySQL and PostgreSQL and bunches of other products 01:04:45.220 --> 01:04:48.040 whose names you don't need to know or remember any time soon. 01:04:48.040 --> 01:04:53.260 But SQL is the language we'll use to query the database for information 01:04:53.260 --> 01:04:54.620 and do something with it. 01:04:54.620 --> 01:04:57.920 Generally speaking, a relational database and, in turn, 01:04:57.920 --> 01:05:02.480 SQL, which is a language via which you can interact with relational databases, 01:05:02.480 --> 01:05:04.910 support four fundamental operations. 01:05:04.910 --> 01:05:08.090 And they're sort of a crude acronym, pun intended, 01:05:08.090 --> 01:05:11.960 that is just helpful for remembering what those fundamental operations are 01:05:11.960 --> 01:05:13.190 with relational databases. 01:05:13.190 --> 01:05:19.220 CRUD stands for Create, Read, Update, and Delete. 01:05:19.220 --> 01:05:21.800 And indeed, the acronym is CRUD, C-R-U-D. 01:05:21.800 --> 01:05:25.040 So it helps you remember that the four basic operations supported by any 01:05:25.040 --> 01:05:28.590 relational database are create, read, update, delete. 01:05:28.590 --> 01:05:30.710 "Create" means to create or add new data. 01:05:30.710 --> 01:05:34.550 "Read" means to access and load into memory new data. 01:05:34.550 --> 01:05:36.710 We've seen read before with opening files. 01:05:36.710 --> 01:05:39.140 "Update" and "delete" mean exactly that, as well, 01:05:39.140 --> 01:05:41.450 if you want to manipulate the data in your data set. 01:05:41.450 --> 01:05:44.530 Now, those are generic terms for any relational database. 01:05:44.530 --> 01:05:48.200 Those are the four properties typically supported by any relational database. 01:05:48.200 --> 01:05:53.490 In the world of SQL, there are some very specific commands or functions, 01:05:53.490 --> 01:05:58.550 if you will, that implement those four functionalities. 01:05:58.550 --> 01:06:00.980 They are create and insert-- 01:06:00.980 --> 01:06:03.620 achieve the same thing as create more generally. 01:06:03.620 --> 01:06:07.850 The keyword "select" is what's used to read data from a database. 01:06:07.850 --> 01:06:09.460 Update and delete are the same. 01:06:09.460 --> 01:06:11.210 So it's kind of an annoying inconsistency. 01:06:11.210 --> 01:06:14.803 The acronym or the term of art is CRUD, Create, Read, Update, Delete. 01:06:14.803 --> 01:06:16.970 But in the world of SQL, the authors of the language 01:06:16.970 --> 01:06:20.030 decided to implement those four ideas by way 01:06:20.030 --> 01:06:24.760 of these five keywords or functions or commands, if you will, in the language 01:06:24.760 --> 01:06:25.260 SQL. 01:06:25.260 --> 01:06:28.800 So what you are looking at are five of the keywords 01:06:28.800 --> 01:06:32.990 that you can use in this new language called SQL to actually do something 01:06:32.990 --> 01:06:33.980 with your database. 01:06:33.980 --> 01:06:35.070 Now, what does that mean? 01:06:35.070 --> 01:06:37.190 Well, suppose that you wanted to manually create 01:06:37.190 --> 01:06:38.960 a database for the very first time. 01:06:38.960 --> 01:06:39.585 What do you do? 01:06:39.585 --> 01:06:42.752 Well, back in the world of spreadsheets, it's pretty straightforward, right? 01:06:42.752 --> 01:06:44.300 You'd open up Google Spreadsheets. 01:06:44.300 --> 01:06:46.370 You go to File, New or whatever. 01:06:46.370 --> 01:06:48.350 And then you just, voila, get a new spreadsheet 01:06:48.350 --> 01:06:51.170 into which you can start creating rows and columns and the like. 01:06:51.170 --> 01:06:53.840 In Microsoft Excel, Apple Numbers, same thing-- 01:06:53.840 --> 01:06:57.840 File menu, New Spreadsheet or whatever, and boom, you have a new spreadsheet. 01:06:57.840 --> 01:07:00.860 Now, in the world of SQL, SQL databases are generally 01:07:00.860 --> 01:07:02.840 meant to be interacted with code. 01:07:02.840 --> 01:07:05.930 However, there are Graphical User Interfaces, GUIs, by which 01:07:05.930 --> 01:07:07.430 you can interact with them, as well. 01:07:07.430 --> 01:07:11.600 But we're going to use code today to do so and programs at a command line. 01:07:11.600 --> 01:07:17.120 It turns out that you can create tables programmatically 01:07:17.120 --> 01:07:19.530 by running a command like this. 01:07:19.530 --> 01:07:24.320 So if you literally type out syntax along the lines of CREATE TABLE, then 01:07:24.320 --> 01:07:27.230 the name of your table, indicated here in lowercase, 01:07:27.230 --> 01:07:31.490 then a parenthesis, then the name of your column that you want to create 01:07:31.490 --> 01:07:36.190 and the type of that column, a la C, and then comma, dot, 01:07:36.190 --> 01:07:39.050 dot, dot, some more columns, this is generally 01:07:39.050 --> 01:07:43.350 speaking the syntax you'll use to create in this language called 01:07:43.350 --> 01:07:44.802 SQL a new table. 01:07:44.802 --> 01:07:46.010 Now, this is in the abstract. 01:07:46.010 --> 01:07:49.077 Again, table in lowercase is meant to represent the name 01:07:49.077 --> 01:07:50.660 you want to give to your actual table. 01:07:50.660 --> 01:07:52.580 column in lowercase is meant to be the name 01:07:52.580 --> 01:07:54.080 you want to give to your own column. 01:07:54.080 --> 01:07:54.788 Maybe it's Title. 01:07:54.788 --> 01:07:55.567 Maybe it's Genres. 01:07:55.567 --> 01:07:57.400 And dot, dot, dot just means, of course, you 01:07:57.400 --> 01:07:59.100 can have even more columns than that. 01:07:59.100 --> 01:08:02.990 But literally in a moment, if I were to type in this kind of command 01:08:02.990 --> 01:08:06.500 into the terminal window after running the sqlite3 program, 01:08:06.500 --> 01:08:09.980 I could start creating one or more tables for myself. 01:08:09.980 --> 01:08:12.920 And in fact, that's what already happened for me. 01:08:12.920 --> 01:08:15.560 This .import command, which is not part of SQL-- 01:08:15.560 --> 01:08:19.579 this is the equivalent of a Menu option in Excel or Google Spreadsheets. 01:08:19.579 --> 01:08:22.729 .import just automates a certain process for me. 01:08:22.729 --> 01:08:24.319 And what it did for me is this. 01:08:24.319 --> 01:08:28.609 If I type now .schema, which is another SQLite-specific command-- 01:08:28.609 --> 01:08:32.479 anything that starts with a . is specific only to sqlite3, 01:08:32.479 --> 01:08:34.250 this terminal window program. 01:08:34.250 --> 01:08:36.830 Notice what's outputted is this. 01:08:36.830 --> 01:08:44.420 By running .import that automatically for me created a table in my database 01:08:44.420 --> 01:08:46.010 called shows. 01:08:46.010 --> 01:08:47.540 And it gave it three columns-- 01:08:47.540 --> 01:08:50.060 Timestamp, title, and genres. 01:08:50.060 --> 01:08:52.529 Where did those column names come from? 01:08:52.529 --> 01:08:55.220 Well, they came from the very first line in the CSV. 01:08:55.220 --> 01:08:58.850 And they all looked like text, so the type of those values 01:08:58.850 --> 01:09:02.490 was just assumed to be text, text, text. 01:09:02.490 --> 01:09:05.090 Now, to be clear, I could have manually type this out, 01:09:05.090 --> 01:09:08.359 created these three columns in a new table called shows for me. 01:09:08.359 --> 01:09:11.870 But again, the .import command just automated that from a CSV. 01:09:11.870 --> 01:09:17.370 But the SQL is what we see here, CREATE TABLE shows and so forth. 01:09:17.370 --> 01:09:22.609 So that is to say now, in this database, there is a file-- 01:09:22.609 --> 01:09:27.500 or rather, there is a table called shows inside 01:09:27.500 --> 01:09:29.630 of which is all of the data from that CSV. 01:09:29.630 --> 01:09:31.580 How do I actually get at that data? 01:09:31.580 --> 01:09:33.830 Well, it turns out there's other commands were called. 01:09:33.830 --> 01:09:37.430 Not just CREATE, but also SELECT, it turns out. 01:09:37.430 --> 01:09:40.850 SELECT is the equivalent of read, getting data from the database. 01:09:40.850 --> 01:09:42.590 And this one is pretty powerful. 01:09:42.590 --> 01:09:45.950 And the reason that so many data scientists and statisticians 01:09:45.950 --> 01:09:48.290 use and like using languages like SQL-- 01:09:48.290 --> 01:09:51.620 they make it relatively easy to just get data and filter that data 01:09:51.620 --> 01:09:55.380 and analyze that data using new syntax for us today, 01:09:55.380 --> 01:09:58.940 but relatively simple syntax relative to other things we've seen. 01:09:58.940 --> 01:10:03.590 The SELECT command in SQL lets you select one or more columns 01:10:03.590 --> 01:10:06.710 from your table by the given name. 01:10:06.710 --> 01:10:10.040 So we'll see this now in just a moment here. 01:10:10.040 --> 01:10:11.460 How might I go about doing this? 01:10:11.460 --> 01:10:15.170 Well, let me go ahead and now, at my prompt after just clearing the window 01:10:15.170 --> 01:10:17.340 to keep things neat, let me try this out. 01:10:17.340 --> 01:10:26.090 Let me go ahead and SELECT, let's say, title FROM shows;. 01:10:26.090 --> 01:10:27.290 So why am I doing this? 01:10:27.290 --> 01:10:29.800 Well, again, the conventional format for the SELECT command 01:10:29.800 --> 01:10:33.400 is to say SELECT, then the name of one or more columns, then 01:10:33.400 --> 01:10:37.240 literally the preposition FROM, and then the name of the table from which you 01:10:37.240 --> 01:10:38.840 want to select that data. 01:10:38.840 --> 01:10:43.390 So if my table is called shows and the column is called title, 01:10:43.390 --> 01:10:46.930 it stands to reason that SELECT title FROM shows should give me 01:10:46.930 --> 01:10:48.100 back the data I want. 01:10:48.100 --> 01:10:50.080 Now, notice a couple of stylistic choices 01:10:50.080 --> 01:10:52.630 that aren't strictly required but are good style. 01:10:52.630 --> 01:10:56.470 Conventionally, I would capitalize any SQL keywords, 01:10:56.470 --> 01:10:59.470 including SELECT and FROM in this case, and then 01:10:59.470 --> 01:11:03.610 lowercase anything that's a column name or a table name, 01:11:03.610 --> 01:11:07.023 assuming you created those columns and tables in, in fact, lowercase. 01:11:07.023 --> 01:11:08.690 There's different conventions out there. 01:11:08.690 --> 01:11:09.815 Some people will uppercase. 01:11:09.815 --> 01:11:12.950 Some people will use something called camel case or snake case or the like. 01:11:12.950 --> 01:11:15.220 But generally speaking, I would encourage all caps 01:11:15.220 --> 01:11:19.180 for SQL syntax and lowercase for the column and table names. 01:11:19.180 --> 01:11:21.190 I'm going to go ahead now and hit Enter. 01:11:21.190 --> 01:11:22.060 And voila. 01:11:22.060 --> 01:11:26.950 We see rapidly a whole list of values outputted from the database. 01:11:26.950 --> 01:11:30.790 And if you think way back, you might recognize that this actually 01:11:30.790 --> 01:11:35.200 happens to be the same order as before, because the CSV 01:11:35.200 --> 01:11:39.010 file was loaded top to bottom into this same database table. 01:11:39.010 --> 01:11:42.370 And so what we're seeing, in fact, is all of that same data, duplicates 01:11:42.370 --> 01:11:46.030 and miscapitalizations and weird spacing and all. 01:11:46.030 --> 01:11:48.790 But suppose I want to see all of the data from the CSV. 01:11:48.790 --> 01:11:51.160 Well, it turns out you can select multiple columns. 01:11:51.160 --> 01:11:54.478 You can select not only title, but maybe timestamp was of interest. 01:11:54.478 --> 01:11:56.770 And this one admittedly was capitalized, because that's 01:11:56.770 --> 01:11:58.360 what it was in the spreadsheet. 01:11:58.360 --> 01:12:00.290 That was not something I chose manually. 01:12:00.290 --> 01:12:02.800 So if I just use a comma-separated list of column names, 01:12:02.800 --> 01:12:04.090 notice what I can do now. 01:12:04.090 --> 01:12:07.790 It's a little hard to see for us humans, because there's a lot going on now. 01:12:07.790 --> 01:12:10.120 But notice that in double quotes on the left, 01:12:10.120 --> 01:12:14.170 there are all of the timestamps, which represent the time at which you all 01:12:14.170 --> 01:12:15.490 submitted your favorite shows. 01:12:15.490 --> 01:12:19.390 And on the right of the comma, there's another quoted string 01:12:19.390 --> 01:12:22.210 that is the title of the show that you liked, although SQLite 01:12:22.210 --> 01:12:27.070 omits the quotes if it's just a single word, like Friends, just by convention. 01:12:27.070 --> 01:12:29.290 In fact, if I want to get all of the columns, 01:12:29.290 --> 01:12:31.510 turns out there's some shorthand syntax for that. 01:12:31.510 --> 01:12:34.270 * is the so-called wild card operator. 01:12:34.270 --> 01:12:37.780 And it will get me all of the columns from left to right in my table. 01:12:37.780 --> 01:12:38.500 And voila. 01:12:38.500 --> 01:12:44.180 Now I see all of the data, including all of the genres, as well. 01:12:44.180 --> 01:12:49.090 So now I effectively have three columns being outputted all at once here. 01:12:49.090 --> 01:12:51.520 Well, this is not that useful thus far. 01:12:51.520 --> 01:12:53.770 In fact, all I've been doing is really just outputting 01:12:53.770 --> 01:12:55.060 the contents of the CSV. 01:12:55.060 --> 01:12:58.990 But SQL's powerful because it comes with other features right out of the box, 01:12:58.990 --> 01:13:02.830 somewhat similar in spirit to functions that are built into Google Spreadsheets 01:13:02.830 --> 01:13:03.550 and Excel. 01:13:03.550 --> 01:13:06.110 But now we can use them ultimately in our own code. 01:13:06.110 --> 01:13:09.460 So functions like AVG, COUNT, DISTINCT, LOWER, MAX, MIN, 01:13:09.460 --> 01:13:13.540 and UPPER and bunches more, these are all functions built into SQL 01:13:13.540 --> 01:13:19.370 that you can use as part of your query to alter the data as it's coming back 01:13:19.370 --> 01:13:21.370 from the database-- not permanently, but as it's 01:13:21.370 --> 01:13:25.040 coming back to you-- so that it's in a format you actually care about. 01:13:25.040 --> 01:13:26.870 So for instance, one of my goals earlier, 01:13:26.870 --> 01:13:29.680 was to get back just the distinct, the unique titles. 01:13:29.680 --> 01:13:32.620 And we had to write all that annoying code using a set 01:13:32.620 --> 01:13:35.560 and then add things to the set and then loop over it again, right? 01:13:35.560 --> 01:13:37.180 That was not a huge amount of code. 01:13:37.180 --> 01:13:40.840 But it definitely took us, what, 5, 10 minutes to get the job done at least. 01:13:40.840 --> 01:13:43.780 In SQL, you can do all of that in one breath. 01:13:43.780 --> 01:13:45.650 I'm going to go ahead now and do this. 01:13:45.650 --> 01:13:49.690 SELECT not just title FROM shows. 01:13:49.690 --> 01:13:54.370 Let me go ahead and SELECT DISTINCT title FROM shows. 01:13:54.370 --> 01:13:57.640 So DISTINCT, again, is an available function in SQL 01:13:57.640 --> 01:13:58.900 that does what the name says. 01:13:58.900 --> 01:14:00.650 It's going to filter out all of the titles 01:14:00.650 --> 01:14:02.450 to just give me the distinct ones back. 01:14:02.450 --> 01:14:08.740 So if I hit Enter now, you'll see a similarly messy list but including-- 01:14:08.740 --> 01:14:10.810 "no idea," someone that doesn't watch TV-- 01:14:10.810 --> 01:14:14.230 including an unsorted list of those titles. 01:14:14.230 --> 01:14:18.130 So I think we can probably start to clean this thing up as we did before. 01:14:18.130 --> 01:14:20.950 Let me go ahead and now SELECT not just DISTINCT, 01:14:20.950 --> 01:14:23.660 but let me go ahead and uppercase everything as well. 01:14:23.660 --> 01:14:25.970 And I can use UPPER as another function. 01:14:25.970 --> 01:14:27.580 And notice I'm just nesting things. 01:14:27.580 --> 01:14:30.247 The output of one function, as we've seen in many languages now, 01:14:30.247 --> 01:14:31.450 can be the input to another. 01:14:31.450 --> 01:14:32.830 Let me hit Enter now. 01:14:32.830 --> 01:14:36.610 And now it's getting a little more canonicalized, so to speak, 01:14:36.610 --> 01:14:39.190 because I'm using capitalization for everything. 01:14:39.190 --> 01:14:43.690 But it would seem that things still aren't really sorted. 01:14:43.690 --> 01:14:46.070 It's just the same order in which you inputted them 01:14:46.070 --> 01:14:48.370 but without duplicates this time. 01:14:48.370 --> 01:14:51.700 So it turns out that SQL has other syntax 01:14:51.700 --> 01:14:55.580 that we can use to make our queries more precise and more powerful. 01:14:55.580 --> 01:14:57.640 So in addition to these kinds of functions 01:14:57.640 --> 01:15:00.340 that you can use to alter the data that's being shown to you 01:15:00.340 --> 01:15:04.570 and coming back, you can also use these kinds of clauses or syntax 01:15:04.570 --> 01:15:05.800 in SQL queries. 01:15:05.800 --> 01:15:09.130 You can say WHERE, which is the equivalent of a condition. 01:15:09.130 --> 01:15:13.600 You can say select all of this data where something is true or false. 01:15:13.600 --> 01:15:17.440 You can say LIKE, where you can say give me data that isn't exactly this 01:15:17.440 --> 01:15:18.520 but is like this. 01:15:18.520 --> 01:15:20.660 You can order the data by some column. 01:15:20.660 --> 01:15:23.210 You can limit the number of rows that come back. 01:15:23.210 --> 01:15:26.850 And you can group identical values together in some way. 01:15:26.850 --> 01:15:28.640 So let's see a few examples of this. 01:15:28.640 --> 01:15:32.055 Let me go back here and play around now with-- 01:15:32.055 --> 01:15:32.930 how about The Office? 01:15:32.930 --> 01:15:34.513 That was the one we looked at earlier. 01:15:34.513 --> 01:15:42.260 So let me go ahead and SELECT title FROM shows WHERE title = "The Office";. 01:15:42.260 --> 01:15:48.200 So I've added this WHERE predicate, so to speak, WHERE title = "The Office." 01:15:48.200 --> 01:15:49.190 So SQL's nice. 01:15:49.190 --> 01:15:52.670 Similar in spirit to Python, it's more user friendly, perhaps, 01:15:52.670 --> 01:15:55.910 than C where everything kind of sort of reads like an English sentence, 01:15:55.910 --> 01:15:58.230 even though it's a little more precise. 01:15:58.230 --> 01:15:59.880 And it's a little more succinct. 01:15:59.880 --> 01:16:01.130 Let me go ahead and hit Enter. 01:16:01.130 --> 01:16:02.120 And voila. 01:16:02.120 --> 01:16:05.850 That's how many of you inputted The Office. 01:16:05.850 --> 01:16:08.520 But notice it's not everyone, is it? 01:16:08.520 --> 01:16:10.050 We're missing some still. 01:16:10.050 --> 01:16:14.070 It seems that I got back only those of you who typed in literally 01:16:14.070 --> 01:16:16.710 "The Office," capital T, capital O. 01:16:16.710 --> 01:16:19.200 So what if I want to be a little more resilient than that? 01:16:19.200 --> 01:16:23.280 Well, let me get back any rows where you all typed in "office." 01:16:23.280 --> 01:16:26.820 Maybe you omitted the article "the." 01:16:26.820 --> 01:16:30.390 So let me go ahead and say not title = "Office." 01:16:30.390 --> 01:16:33.780 but let me go ahead and say where the title is like "Office." 01:16:33.780 --> 01:16:35.490 But I don't want it to just be "office." 01:16:35.490 --> 01:16:39.120 I want to allow for maybe some stuff at the beginning, maybe some stuff 01:16:39.120 --> 01:16:39.673 at the end. 01:16:39.673 --> 01:16:42.090 And even though that seems like a bit of an inconsistency, 01:16:42.090 --> 01:16:46.950 in the context of using LIKE, there's another wild card character. 01:16:46.950 --> 01:16:51.390 The percent sign represents zero or more characters to the left. 01:16:51.390 --> 01:16:55.410 And this percent sign represents zero or more characters to the right. 01:16:55.410 --> 01:16:59.940 So it's kind of this catchall that will now find me all titles that somewhere 01:16:59.940 --> 01:17:02.980 have O-F-F-I-C-E inside of them. 01:17:02.980 --> 01:17:04.778 And it turns out LIKE is case insensitive, 01:17:04.778 --> 01:17:07.320 so I don't even need to worry about capitalization with LIKE. 01:17:07.320 --> 01:17:08.610 Now let me hit Enter. 01:17:08.610 --> 01:17:09.450 And voila. 01:17:09.450 --> 01:17:10.890 Now I get back more answers. 01:17:10.890 --> 01:17:12.780 And you can really see the messiness now. 01:17:12.780 --> 01:17:15.900 Notice up here one of you used lowercase. 01:17:15.900 --> 01:17:18.450 That tends to be common when typing things in quickly. 01:17:18.450 --> 01:17:21.270 One of you did it lowercase here and then also gave 01:17:21.270 --> 01:17:23.160 us an extra white space at the end. 01:17:23.160 --> 01:17:24.900 One of you just typed in "office." 01:17:24.900 --> 01:17:27.540 One of you typed in "the office" again with a space at the end. 01:17:27.540 --> 01:17:29.200 And so there's a lot of variation here. 01:17:29.200 --> 01:17:31.560 And that's why, when we forced everything to uppercase 01:17:31.560 --> 01:17:34.650 and we started trimming things, we were able to get rid 01:17:34.650 --> 01:17:37.440 of a lot of those redundancies. 01:17:37.440 --> 01:17:40.290 Well, in fact, let's go ahead and order this now. 01:17:40.290 --> 01:17:44.040 So let me go back to selecting the distinct uppercase title, 01:17:44.040 --> 01:17:51.060 so SELECT DISTINCT UPPER of title FROM shows. 01:17:51.060 --> 01:17:56.220 And let me now ORDER BY, which is a new clause, the uppercased version 01:17:56.220 --> 01:17:57.868 of title. 01:17:57.868 --> 01:17:59.910 So now notice there's a few things going on here. 01:17:59.910 --> 01:18:01.710 But I'm just building up more complicated queries 01:18:01.710 --> 01:18:04.260 similar to scratch, where we just started throwing more and more puzzle 01:18:04.260 --> 01:18:05.340 pieces at a problem. 01:18:05.340 --> 01:18:10.530 I'm selecting all of the distinct uppercase titles from the shows table. 01:18:10.530 --> 01:18:13.050 But I'm going to order the results this time 01:18:13.050 --> 01:18:15.780 by the uppercased version of title. 01:18:15.780 --> 01:18:17.550 So everything is going to be uppercased. 01:18:17.550 --> 01:18:20.460 And then it's going to be sorted A through Z. Hit Enter now, 01:18:20.460 --> 01:18:23.160 and now things are a little easier to make sense of. 01:18:23.160 --> 01:18:26.970 Notice the quotes are there only when there are multiple words in a title. 01:18:26.970 --> 01:18:29.400 Otherwise, sqlite3 doesn't bother showing us. 01:18:29.400 --> 01:18:32.190 But notice here's all the "the" shows. 01:18:32.190 --> 01:18:36.270 And if we keep scrolling up, the P's, the N's, the M's, the L's, and so 01:18:36.270 --> 01:18:41.190 forth-- it's indeed alphabetized thanks to using ORDER BY. 01:18:41.190 --> 01:18:41.950 All right. 01:18:41.950 --> 01:18:45.540 Well, let's start to solve more similar problems now in SQL 01:18:45.540 --> 01:18:49.830 by writing way less code than we did a bit ago in Python. 01:18:49.830 --> 01:18:54.780 Suppose I want to actually figure out the counts of these most popular shows. 01:18:54.780 --> 01:18:58.050 So I want to combine all of the identical shows 01:18:58.050 --> 01:19:00.510 and figure out all of the corresponding counts. 01:19:00.510 --> 01:19:02.330 Well, let me go ahead and try this. 01:19:02.330 --> 01:19:07.932 Let me go ahead and SELECT again the uppercased version of title. 01:19:07.932 --> 01:19:10.140 But I'm not going to do DISTINCT this time, because I 01:19:10.140 --> 01:19:11.830 want to do that a little differently. 01:19:11.830 --> 01:19:13.650 I'm going to SELECT the uppercased version 01:19:13.650 --> 01:19:16.510 of title, the COUNT of those titles-- 01:19:16.510 --> 01:19:19.320 so the number of times a given title appears, so COUNT 01:19:19.320 --> 01:19:20.610 is a new keyword now-- 01:19:20.610 --> 01:19:22.080 FROM shows. 01:19:22.080 --> 01:19:25.860 But now how do I figure out what the count is? 01:19:25.860 --> 01:19:29.700 Well, if you think about this table as having a lot of titles-- 01:19:29.700 --> 01:19:31.930 title, title, title, title, title-- 01:19:31.930 --> 01:19:35.970 it would be nice to kind of group the identical titles together 01:19:35.970 --> 01:19:42.460 and then actually count how many such titles we grouped together. 01:19:42.460 --> 01:19:47.710 And the syntax for that is literally to say GROUP BY UPPER(title);. 01:19:47.710 --> 01:19:51.130 This tells SQL to group all of the uppercased titles 01:19:51.130 --> 01:19:53.860 together, kind of collapse multiple rows into one, 01:19:53.860 --> 01:19:58.990 but keep track of the count of titles after that collapse. 01:19:58.990 --> 01:20:01.810 Let me go ahead now and hit Enter. 01:20:01.810 --> 01:20:05.980 And you'll see, very similar to one of the earlier Python programs we wrote, 01:20:05.980 --> 01:20:10.040 all of the titles on the left followed by a comma, followed by the count. 01:20:10.040 --> 01:20:11.920 So one of you really likes Tom and Jerry. 01:20:11.920 --> 01:20:14.470 One of you really likes Top Gear. 01:20:14.470 --> 01:20:17.140 If I scroll up, though, two of you really liked The Wire. 01:20:17.140 --> 01:20:19.930 23 of you here like The Office, although we still 01:20:19.930 --> 01:20:22.010 haven't trimmed the issue here. 01:20:22.010 --> 01:20:25.180 So we could still combine that further by trimming whitespace if we want. 01:20:25.180 --> 01:20:27.040 But now we're getting these kinds of counts. 01:20:27.040 --> 01:20:32.510 Well, how can I go ahead and order this, as we did before? 01:20:32.510 --> 01:20:39.820 Let me go ahead here and add ORDER BY COUNT of title 01:20:39.820 --> 01:20:42.010 and then hit semicolon now. 01:20:42.010 --> 01:20:45.310 And now notice, just as in Python, everything 01:20:45.310 --> 01:20:47.800 is from smallest to largest initially, with Game of Thrones 01:20:47.800 --> 01:20:49.180 here down on the bottom. 01:20:49.180 --> 01:20:50.360 How can I fix this? 01:20:50.360 --> 01:20:53.890 Well, it turns out if you can order things in descending order, 01:20:53.890 --> 01:20:58.510 D-E-S-C for short instead of A-S-C, which is the default for ascending-- 01:20:58.510 --> 01:21:02.110 so if I do it in descending order, now I'd have to scroll all the way back up 01:21:02.110 --> 01:21:07.480 to the A's, the very top, to see where the lines begin. 01:21:07.480 --> 01:21:09.420 Whoops. 01:21:09.420 --> 01:21:13.020 If I scroll all the way back up to the top, we'll see where all of the A words 01:21:13.020 --> 01:21:14.610 begin up here. 01:21:14.610 --> 01:21:17.252 And now if I want to-- 01:21:17.252 --> 01:21:18.210 whoops, whoops, whoops. 01:21:18.210 --> 01:21:20.190 Did I do that right? 01:21:20.190 --> 01:21:20.690 Sorry. 01:21:20.690 --> 01:21:21.950 I don't want to-- 01:21:21.950 --> 01:21:23.827 there we go, ORDER BY COUNT descending. 01:21:23.827 --> 01:21:26.660 Now let me go ahead and-- this is just a little too unwieldy to see. 01:21:26.660 --> 01:21:29.035 Let me just limit myself to the top 10 and keep it simple 01:21:29.035 --> 01:21:30.920 and only look at the top 10 values here. 01:21:30.920 --> 01:21:31.730 Voila. 01:21:31.730 --> 01:21:36.585 Now I have Game of Thrones at 33, Friends at 26, The Office at 23-- 01:21:36.585 --> 01:21:38.210 though I think I'm still missing a few. 01:21:38.210 --> 01:21:41.660 Brian, do you recall the SQL function for trimming leading and trailing 01:21:41.660 --> 01:21:43.410 white space? 01:21:43.410 --> 01:21:44.785 BRIAN YU: I think it's just TRIM. 01:21:44.785 --> 01:21:45.660 DAVID J. MALAN: TRIM? 01:21:45.660 --> 01:21:46.340 OK. 01:21:46.340 --> 01:21:47.577 I myself did not remember. 01:21:47.577 --> 01:21:49.160 So when in doubt, google or ask Brian. 01:21:49.160 --> 01:21:51.000 So let me go ahead and fix this. 01:21:51.000 --> 01:21:55.670 Let me go ahead and SELECT uppercase of trimming the title first. 01:21:55.670 --> 01:22:00.840 And then I'm going to GROUP BY trimming and then uppercasing it there. 01:22:00.840 --> 01:22:02.372 And now Enter, and voila. 01:22:02.372 --> 01:22:03.080 Thank you, Brian. 01:22:03.080 --> 01:22:07.020 So now we're up to our 26 Offices here. 01:22:07.020 --> 01:22:09.110 So in short, it took us a little while to get 01:22:09.110 --> 01:22:10.880 to this point in the story in SQL. 01:22:10.880 --> 01:22:12.020 But notice what we've done. 01:22:12.020 --> 01:22:14.210 We've taken a program that took us a few minutes 01:22:14.210 --> 01:22:16.790 and certainly a dozen or more lines of code. 01:22:16.790 --> 01:22:20.300 And we've distilled it into something that, yes, is a new language 01:22:20.300 --> 01:22:22.310 but is just kind of a one liner. 01:22:22.310 --> 01:22:24.888 And once you get comfortable with a language like SQL, 01:22:24.888 --> 01:22:27.680 especially if you're not even a computer scientist but maybe a data 01:22:27.680 --> 01:22:31.130 scientist or an analyst of some sort who spends a lot of their day looking 01:22:31.130 --> 01:22:33.470 at financial information or medical information 01:22:33.470 --> 01:22:37.070 or really any data set that can be loaded into rows and columns, 01:22:37.070 --> 01:22:41.150 once you start to speak and read SQL as a human can 01:22:41.150 --> 01:22:44.990 you start to express some pretty powerful queries relatively succinctly 01:22:44.990 --> 01:22:47.390 and, boom, get back your answer. 01:22:47.390 --> 01:22:50.000 And by using a command line program, like sqlite3, 01:22:50.000 --> 01:22:53.540 you can immediately see the results there, albeit as very simplistic text. 01:22:53.540 --> 01:22:56.690 But as mentioned, too, there's also some graphical programs 01:22:56.690 --> 01:23:00.117 out there, free and commercial, that also support SQL, where you can still 01:23:00.117 --> 01:23:00.950 type these commands. 01:23:00.950 --> 01:23:03.770 And then it will show it to you in a more user friendly way, much 01:23:03.770 --> 01:23:07.790 like in Windows or macOS would by default. 01:23:07.790 --> 01:23:16.058 So any questions now on the syntax or capabilities of SELECT statements? 01:23:16.058 --> 01:23:17.350 BRIAN YU: One question came in. 01:23:17.350 --> 01:23:20.450 Where is the file with this data actually being stored? 01:23:20.450 --> 01:23:21.700 DAVID J. MALAN: Good question. 01:23:21.700 --> 01:23:24.030 Where is the file actually being stored? 01:23:24.030 --> 01:23:27.460 So before quitting, I can actually save this file as anything. 01:23:27.460 --> 01:23:30.043 I want the file extension would typically be .db. 01:23:30.043 --> 01:23:31.960 And in fact, Brian, do you mind just checking? 01:23:31.960 --> 01:23:34.930 What's the syntax for writing the file manually with dot something? 01:23:34.930 --> 01:23:36.910 It would be under .help, I think. 01:23:36.910 --> 01:23:39.550 BRIAN YU: I think it's .save followed by the name of the file. 01:23:39.550 --> 01:23:43.240 DAVID J. MALAN: .save, so I'll call this shows.db, Enter. 01:23:43.240 --> 01:23:46.600 If I now go ahead and open up another terminal window and type 01:23:46.600 --> 01:23:49.990 our old friend ls, you'll see that now I have a CSV file. 01:23:49.990 --> 01:23:51.760 I have my Python file from before. 01:23:51.760 --> 01:23:54.790 And I have a new file called shows.db, which I've created. 01:23:54.790 --> 01:24:00.910 That is the binary file that contains the table that I've loaded dynamically 01:24:00.910 --> 01:24:04.700 in from that CSV file. 01:24:04.700 --> 01:24:08.810 Any other questions on SELECT queries or what we can do with them? 01:24:08.810 --> 01:24:12.620 BRIAN YU: Yeah, a few people are asking about what the runtime of this is. 01:24:12.620 --> 01:24:14.430 DAVID J. MALAN: Yeah, really good question. 01:24:14.430 --> 01:24:15.170 What is the runtime? 01:24:15.170 --> 01:24:18.253 I'm going to come back to that question in just a little bit if that's OK. 01:24:18.253 --> 01:24:20.960 Right now, it's admittedly big O of n. 01:24:20.960 --> 01:24:23.390 I've not actually done anything better than we did 01:24:23.390 --> 01:24:26.090 with our CSV file or our Python code. 01:24:26.090 --> 01:24:28.040 Right now, it's still big O of n by default. 01:24:28.040 --> 01:24:30.230 But there's going to be a better answer to that 01:24:30.230 --> 01:24:33.030 that's going to make it something much more logarithmic. 01:24:33.030 --> 01:24:36.687 So let me come back to that feature when it's time to enable it. 01:24:36.687 --> 01:24:39.020 But in fact, let's start to take some steps toward that. 01:24:39.020 --> 01:24:40.812 Because it turns out, when loading in data, 01:24:40.812 --> 01:24:42.853 we're not always going to have the luxury of just 01:24:42.853 --> 01:24:44.900 having one big file in CSV format that we import, 01:24:44.900 --> 01:24:46.070 and we go about our business. 01:24:46.070 --> 01:24:47.780 We're going to have to decide in advance how 01:24:47.780 --> 01:24:50.210 we want to store the data and what data we want to store 01:24:50.210 --> 01:24:53.120 and what the relationships might be across not one 01:24:53.120 --> 01:24:55.278 single table, but multiple tables. 01:24:55.278 --> 01:24:57.320 So let me go ahead and run one other command here 01:24:57.320 --> 01:25:00.170 that actually introduces the first of a problem. 01:25:00.170 --> 01:25:03.830 Let me go ahead and SELECT title FROM shows 01:25:03.830 --> 01:25:07.160 WHERE genres equals, for instance, "Comedy." 01:25:07.160 --> 01:25:08.570 That was one of the genres. 01:25:08.570 --> 01:25:11.690 And notice that we get back a whole bunch of results. 01:25:11.690 --> 01:25:14.300 But I bet I'm missing some. 01:25:14.300 --> 01:25:16.470 I'm skimming through this pretty quickly. 01:25:16.470 --> 01:25:19.880 But I bet I'm missing some, because if I check if genres 01:25:19.880 --> 01:25:21.872 = "Comedy," what am I omitting? 01:25:21.872 --> 01:25:24.830 Well, those of you who checked multiple boxes might have said something 01:25:24.830 --> 01:25:28.310 is a comedy and a drama or comedy and romance 01:25:28.310 --> 01:25:30.800 or maybe a couple of other permutations of genres. 01:25:30.800 --> 01:25:34.070 If I'm searching for equality here, = "Comedy," 01:25:34.070 --> 01:25:37.880 I'm only going to get those favorites from you where you only said, 01:25:37.880 --> 01:25:40.250 my favorite TV show is a comedy. 01:25:40.250 --> 01:25:48.113 But what if we want to do something like LIKE comedy instead? 01:25:48.113 --> 01:25:50.030 And we could say something like, well, so long 01:25:50.030 --> 01:25:54.290 as the word "comedy" is in there, then we should get back even more results. 01:25:54.290 --> 01:25:57.480 And let me stipulate that, indeed, I now have a longer list of results. 01:25:57.480 --> 01:26:01.010 Now we have all shows where you checked at least the Comedy box. 01:26:01.010 --> 01:26:03.770 But unfortunately, this starts to get a little sloppy, 01:26:03.770 --> 01:26:06.410 because recall what the Genres column looks like. 01:26:06.410 --> 01:26:07.730 SELECT. 01:26:07.730 --> 01:26:11.150 Let me SELECT genres FROM shows;. 01:26:11.150 --> 01:26:16.010 Notice that all of the genres that we loaded into this table from the CSV 01:26:16.010 --> 01:26:20.030 file are a comma-separated list of genres. 01:26:20.030 --> 01:26:22.070 That's just the way Google Forms did it. 01:26:22.070 --> 01:26:24.320 And that's fine for CSV purposes. 01:26:24.320 --> 01:26:28.310 That's kind of fine for SQL purposes, but this is kind of messy. 01:26:28.310 --> 01:26:31.700 Generally speaking, storing comma-separated lists 01:26:31.700 --> 01:26:35.840 of values in a SQL database is not what you should be doing. 01:26:35.840 --> 01:26:41.030 The whole point of using a SQL database is to move away from commas and CSVs 01:26:41.030 --> 01:26:42.860 and to actually store things more cleanly. 01:26:42.860 --> 01:26:45.920 Because in fact, let me propose a problem. 01:26:45.920 --> 01:26:50.540 Suppose I want to search not for comedy but maybe also 01:26:50.540 --> 01:26:55.520 music, like this, thereby allowing me to find any shows where 01:26:55.520 --> 01:26:59.990 the word "music" is somewhere in the comma-separated list. 01:26:59.990 --> 01:27:01.940 There's a subtle bug here. 01:27:01.940 --> 01:27:05.690 And you might have to think back to where we began, the form 01:27:05.690 --> 01:27:07.910 that you pulled up. 01:27:07.910 --> 01:27:09.860 I can't show the whole thing here, but we 01:27:09.860 --> 01:27:14.060 started with action, adventure, animation, biography, dot, dot, dot, 01:27:14.060 --> 01:27:15.620 music. 01:27:15.620 --> 01:27:18.500 Musical was also there, so distinct. 01:27:18.500 --> 01:27:22.700 A music video versus a musical are two different types of genres. 01:27:22.700 --> 01:27:25.250 But notice my query at the moment. 01:27:25.250 --> 01:27:26.930 What's problematic with this? 01:27:26.930 --> 01:27:31.070 At the moment, we would seem to have a bug whereby this query will select 01:27:31.070 --> 01:27:34.370 not only "music," but also "musical." 01:27:34.370 --> 01:27:36.620 And so this is just where things are getting messy. 01:27:36.620 --> 01:27:37.400 Now, yeah, you know what? 01:27:37.400 --> 01:27:38.810 We could kind of clean this up. 01:27:38.810 --> 01:27:43.790 Maybe we could put a comma here so that it can't just be music something. 01:27:43.790 --> 01:27:45.410 It has to be music comma. 01:27:45.410 --> 01:27:47.840 But what if music is the last box that you checked? 01:27:47.840 --> 01:27:49.310 Well, then it's music nothing. 01:27:49.310 --> 01:27:50.210 There is no comma. 01:27:50.210 --> 01:27:52.262 So now I need to OR things together. 01:27:52.262 --> 01:27:54.470 So maybe I have to do something like WHERE "%Music,%" 01:27:54.470 --> 01:28:00.800 like this or OR genres LIKE "%Music" like this. 01:28:00.800 --> 01:28:02.750 But honestly, this is just getting messy. 01:28:02.750 --> 01:28:04.040 This is poorly designed. 01:28:04.040 --> 01:28:07.220 If you're just storing your data as a comma-separated list of values inside 01:28:07.220 --> 01:28:11.010 of a column and you have to resort to this kind of hack to figure out, 01:28:11.010 --> 01:28:13.130 well, maybe it's over here or here or here, 01:28:13.130 --> 01:28:16.640 and thinking about all the permutations of syntax, you're doing it wrong. 01:28:16.640 --> 01:28:20.130 You're not using a SQL database to its fullest potential. 01:28:20.130 --> 01:28:22.490 So how do we go about designing this thing better 01:28:22.490 --> 01:28:26.690 and actually load this CSV into a database a little more cleanly? 01:28:26.690 --> 01:28:31.820 In short, how do we get rid of the stupid commas in the Genres column 01:28:31.820 --> 01:28:36.740 and instead put one word, "comedy" or "music" or "musical," 01:28:36.740 --> 01:28:38.930 in each of those cells, so to speak? 01:28:38.930 --> 01:28:40.250 Not two, not three-- 01:28:40.250 --> 01:28:43.820 one only without throwing away some of those genres. 01:28:43.820 --> 01:28:46.730 Well, let me introduce a few building blocks that will get us there. 01:28:46.730 --> 01:28:48.680 It turns out, when creating your own tables 01:28:48.680 --> 01:28:51.260 and loading data into a database on your own, 01:28:51.260 --> 01:28:53.375 we're going to need more than just SELECT. 01:28:53.375 --> 01:28:55.220 SELECT, of course, is just for reading. 01:28:55.220 --> 01:28:59.330 But if we're going to do this better and not just use sqlite3 as a built-in 01:28:59.330 --> 01:29:04.880 .import command, but instead we're going to write some code to load all 01:29:04.880 --> 01:29:07.580 of our data into maybe two tables-- 01:29:07.580 --> 01:29:10.100 one for the titles, one for the genres-- 01:29:10.100 --> 01:29:15.680 we're going to need a little more expressiveness when it comes to SQL. 01:29:15.680 --> 01:29:17.990 And so for that, we're going to need, one, the ability 01:29:17.990 --> 01:29:19.113 to create our own tables. 01:29:19.113 --> 01:29:20.780 And we've seen a glimpse of this before. 01:29:20.780 --> 01:29:23.280 But we're also going to need to see another piece of syntax, 01:29:23.280 --> 01:29:24.500 as well, so inserting. 01:29:24.500 --> 01:29:29.060 Inserting is another command that you can execute on a SQL database 01:29:29.060 --> 01:29:32.720 in order to actually add data to a database, which is great. 01:29:32.720 --> 01:29:38.630 Because if I want to ultimately iterate over that same CSV but, this time, 01:29:38.630 --> 01:29:43.075 manually add all of the rows to the database myself, 01:29:43.075 --> 01:29:45.200 well, then I'm going to need some way of inserting. 01:29:45.200 --> 01:29:46.850 And the syntax for that is as follows. 01:29:46.850 --> 01:29:50.720 INSERT INTO the name of the table, the column or columns 01:29:50.720 --> 01:29:54.890 that you want to insert values into, then literally the word VALUES, 01:29:54.890 --> 01:29:58.787 and then literally in parentheses again, the actual list of values. 01:29:58.787 --> 01:30:01.370 So it's a little abstract when we see it in this generic form. 01:30:01.370 --> 01:30:06.480 But we'll see this more explicitly in just a moment here, as well. 01:30:06.480 --> 01:30:09.483 So when it comes to inserting something into a database, 01:30:09.483 --> 01:30:10.650 let's go ahead and try this. 01:30:10.650 --> 01:30:13.100 So suppose that-- let's see. 01:30:13.100 --> 01:30:15.080 What's a show that-- 01:30:15.080 --> 01:30:15.985 The Muppet Show. 01:30:15.985 --> 01:30:17.360 I grew up loving The Muppet Show. 01:30:17.360 --> 01:30:18.650 It was out in, like, the '70s. 01:30:18.650 --> 01:30:21.680 And I don't think it was on the list, but I can check this for sure. 01:30:21.680 --> 01:30:28.100 So SELECT * FROM shows WHERE title LIKE-- 01:30:28.100 --> 01:30:30.950 let's just search for "muppets" with a wild card. 01:30:30.950 --> 01:30:32.500 And I'm guessing no one put it there. 01:30:32.500 --> 01:30:33.000 Good. 01:30:33.000 --> 01:30:34.320 So it's a missed opportunity. 01:30:34.320 --> 01:30:35.570 I forgot to fill out the form. 01:30:35.570 --> 01:30:37.820 I could go back and fill out the form and re-import the CSV, 01:30:37.820 --> 01:30:39.487 but let's go ahead and do this manually. 01:30:39.487 --> 01:30:44.420 So let me go ahead and INSERT INTO shows what columns? 01:30:44.420 --> 01:30:50.360 title and genres, and I guess I could do a Timestamp just for kicks. 01:30:50.360 --> 01:30:52.220 And then I'm going to insert what values? 01:30:52.220 --> 01:30:55.430 The values will be, well, I don't know, whatever time it is now. 01:30:55.430 --> 01:30:58.460 So I'm going to cheat there rather than look up the date and the time. 01:30:58.460 --> 01:31:01.430 The title will be "The Muppet Show." 01:31:01.430 --> 01:31:05.100 And the genres will be-- it was kind of a comedy. 01:31:05.100 --> 01:31:06.290 It was kind of a musical. 01:31:06.290 --> 01:31:08.360 So we'll kind of leave it at that. 01:31:08.360 --> 01:31:09.350 Semicolon. 01:31:09.350 --> 01:31:11.870 So again, this follows the standard syntax here 01:31:11.870 --> 01:31:14.030 of specifying the table you want to insert into, 01:31:14.030 --> 01:31:16.910 the columns you want to insert into, and the values 01:31:16.910 --> 01:31:18.467 you want to put into those columns. 01:31:18.467 --> 01:31:20.300 And I'm going to go ahead and hit Enter now. 01:31:20.300 --> 01:31:22.250 Nothing seems to have happened. 01:31:22.250 --> 01:31:28.070 But if I now select that same query-- 01:31:28.070 --> 01:31:32.630 oh, OK, it's still nothing, because I made a subtle mistake. 01:31:32.630 --> 01:31:34.700 I'm not searching for "Muppets," plural. 01:31:34.700 --> 01:31:37.250 I'm searching for "Muppet," singular, The Muppet Show. 01:31:37.250 --> 01:31:38.000 Voila. 01:31:38.000 --> 01:31:40.790 Now you see my row in this database. 01:31:40.790 --> 01:31:42.680 And so INSERT would give us the ability now 01:31:42.680 --> 01:31:44.570 to insert new rows into the database. 01:31:44.570 --> 01:31:48.410 Suppose you want to update something. 01:31:48.410 --> 01:31:51.540 You know, some of the Muppet Shows were actually pretty dramatic. 01:31:51.540 --> 01:31:52.710 So how might we do that? 01:31:52.710 --> 01:31:56.960 Well, I can say UPDATE shows SET-- 01:31:56.960 --> 01:32:04.250 let's see-- genres = "Comedy, Drama, Musical" WHERE 01:32:04.250 --> 01:32:07.910 title = "The Muppet Show." 01:32:07.910 --> 01:32:10.890 So again, I'll pull up the canonical syntax for this in a bit. 01:32:10.890 --> 01:32:14.120 But for now, just a little teaser, you can update things pretty simply. 01:32:14.120 --> 01:32:16.662 And even though it takes a little getting used to the syntax, 01:32:16.662 --> 01:32:17.960 it kind of does what it says. 01:32:17.960 --> 01:32:23.250 UPDATE shows SET genres = this WHERE title = that. 01:32:23.250 --> 01:32:24.650 And now I can go ahead and Enter. 01:32:24.650 --> 01:32:27.290 If I go ahead and select the same thing, just like in a terminal window, 01:32:27.290 --> 01:32:28.250 you can go up and down. 01:32:28.250 --> 01:32:29.600 That's how I'm typing so quickly. 01:32:29.600 --> 01:32:31.600 I'm just going up and down to previous commands. 01:32:31.600 --> 01:32:32.120 Voila. 01:32:32.120 --> 01:32:36.830 Now I see that the Muppet Show is a comedy, a drama, and a musical. 01:32:36.830 --> 01:32:40.070 Well, I take issue, though, with one of the more popular shows that 01:32:40.070 --> 01:32:40.970 was in the list. 01:32:40.970 --> 01:32:44.637 A whole bunch of you liked, let's say, Friends, 01:32:44.637 --> 01:32:46.220 which I've never really been a fan of. 01:32:46.220 --> 01:32:53.828 And let me go ahead and SELECT title FROM shows WHERE title = "Friends." 01:32:53.828 --> 01:32:56.120 And maybe I should be a little more rigorous than that. 01:32:56.120 --> 01:32:59.150 I should say title LIKE "Friends" just in case 01:32:59.150 --> 01:33:00.650 there was different capitalizations. 01:33:00.650 --> 01:33:01.460 Enter. 01:33:01.460 --> 01:33:03.148 A lot of you really liked Friends. 01:33:03.148 --> 01:33:04.190 In fact, how many of you? 01:33:04.190 --> 01:33:05.610 Well, recall that I can do this. 01:33:05.610 --> 01:33:08.900 I can say COUNT, and I can let SQL do the count for me. 01:33:08.900 --> 01:33:10.575 26 of you, I disagree with strongly. 01:33:10.575 --> 01:33:12.950 And there's a couple of you that even added all the dots, 01:33:12.950 --> 01:33:14.240 but we'll deal with you later. 01:33:14.240 --> 01:33:16.100 So suppose I do take issue with this. 01:33:16.100 --> 01:33:22.970 Well, DELETE FROM shows WHERE title = "Friends"-- 01:33:22.970 --> 01:33:24.390 actually, title LIKE "Friends." 01:33:24.390 --> 01:33:25.220 Let's get them all. 01:33:25.220 --> 01:33:26.090 Enter. 01:33:26.090 --> 01:33:29.450 And now if we SELECT this again, I'm sorry. 01:33:29.450 --> 01:33:30.870 Friends has been canceled. 01:33:30.870 --> 01:33:34.910 So you can again execute these fundamental commands of CRUD, 01:33:34.910 --> 01:33:38.630 Create Read, Update, and Delete, by using CREATE or INSERT, 01:33:38.630 --> 01:33:41.540 by using SELECT, by using UPDATE literally 01:33:41.540 --> 01:33:43.380 and DELETE literally, as well. 01:33:43.380 --> 01:33:44.580 And that's about it. 01:33:44.580 --> 01:33:46.580 Even though this was a lot quickly, there really 01:33:46.580 --> 01:33:49.040 are just those four fundamental operations in SQL 01:33:49.040 --> 01:33:53.090 plus some of these add-on features, like these additional functions like COUNT 01:33:53.090 --> 01:33:57.420 that you can use and also some of these keywords like WHERE and the like. 01:33:57.420 --> 01:33:59.810 Well, let me propose that we now do better. 01:33:59.810 --> 01:34:04.580 If we have the ability to select data and create tables and insert data, 01:34:04.580 --> 01:34:11.270 let's go ahead and write our own Python script that uses SQL, as in a loop, 01:34:11.270 --> 01:34:16.130 to read over my CSV file and to insert, insert, insert, insert each of the rows 01:34:16.130 --> 01:34:16.700 manually. 01:34:16.700 --> 01:34:18.408 Because honestly, it will take me forever 01:34:18.408 --> 01:34:22.220 to manually type out hundreds of SQL queries to import all of your rows 01:34:22.220 --> 01:34:23.390 into a new database. 01:34:23.390 --> 01:34:25.520 I want to write a program that does this instead. 01:34:25.520 --> 01:34:29.430 And I'm going to propose that we design it in the following way. 01:34:29.430 --> 01:34:32.720 I'm going to have two tables this time, represented here 01:34:32.720 --> 01:34:34.190 with this artist's rendition. 01:34:34.190 --> 01:34:36.020 One is going to be called shows. 01:34:36.020 --> 01:34:38.060 One is going to be called genres. 01:34:38.060 --> 01:34:44.270 And this is a fundamental principle of designing relational databases, 01:34:44.270 --> 01:34:49.700 to figure out the relationships among data and to normalize your data. 01:34:49.700 --> 01:34:53.480 To normalize your data means to eliminate redundancies. 01:34:53.480 --> 01:34:58.520 To normalize your data means to eliminate mentions of the same words 01:34:58.520 --> 01:35:02.320 again and again and have just single sources of truth for your data, 01:35:02.320 --> 01:35:02.820 so to speak. 01:35:02.820 --> 01:35:04.140 So what do I mean by that? 01:35:04.140 --> 01:35:07.520 I'm going to propose that we instead create a simpler table called 01:35:07.520 --> 01:35:10.320 shows that has just two columns. 01:35:10.320 --> 01:35:13.098 One is going to be called id, which is new. 01:35:13.098 --> 01:35:15.140 The other is going to be called title, as before. 01:35:15.140 --> 01:35:16.940 Honestly, I don't care about timestamps, so we're just 01:35:16.940 --> 01:35:19.730 going to throw that value away, which is another upside of writing 01:35:19.730 --> 01:35:20.420 our own program. 01:35:20.420 --> 01:35:23.030 We can add or remove any data we want. 01:35:23.030 --> 01:35:25.850 For id, I'm introducing this, which is going 01:35:25.850 --> 01:35:28.490 to be a unique identifier, literally a simple integer-- 01:35:28.490 --> 01:35:31.190 1, 2, 3, all the way up to a billion or 2 billion, 01:35:31.190 --> 01:35:33.080 however many favorites we have. 01:35:33.080 --> 01:35:35.690 I'm just going to let this auto increment as we go. 01:35:35.690 --> 01:35:36.710 Why? 01:35:36.710 --> 01:35:42.530 I propose that we move to another table all of the genres and that, 01:35:42.530 --> 01:35:48.350 instead of having one or two or three or five genres in one column 01:35:48.350 --> 01:35:51.860 as a stupid comma-separated list-- which is stupid only in the sense 01:35:51.860 --> 01:35:53.180 that it's just messy, right? 01:35:53.180 --> 01:35:55.040 It means that I have to run stupid commands 01:35:55.040 --> 01:35:57.332 where I'm checking for the comma here, the comma there. 01:35:57.332 --> 01:35:58.850 It's very hackish, so to speak. 01:35:58.850 --> 01:36:00.080 Bad design. 01:36:00.080 --> 01:36:03.770 Instead of doing that, I'm going to create another table that 01:36:03.770 --> 01:36:05.580 also has two columns. 01:36:05.580 --> 01:36:09.320 One is going to be called show_id, and the other is going to be called genre. 01:36:09.320 --> 01:36:12.830 And genre here is just going to be a single word now. 01:36:12.830 --> 01:36:16.340 That column will contain single words for genres, 01:36:16.340 --> 01:36:19.400 like "comedy" or "music" or "musical." 01:36:19.400 --> 01:36:23.570 But we're going to associate all of those genres 01:36:23.570 --> 01:36:27.470 with the original show to which they belong, per your Google form 01:36:27.470 --> 01:36:31.500 submissions, by using this show_id here. 01:36:31.500 --> 01:36:33.290 So what does this mean in particular? 01:36:33.290 --> 01:36:37.370 By adding to our first table, shows, this unique identifier-- 01:36:37.370 --> 01:36:39.080 1, 2, 3, 4, 5, 6-- 01:36:39.080 --> 01:36:44.630 I can now refer to that same show in a very efficient way using 01:36:44.630 --> 01:36:46.940 a very simple number instead of redundantly 01:36:46.940 --> 01:36:49.730 having The Office, The Office, The Office again and again. 01:36:49.730 --> 01:36:52.280 I can refer to it by just one canonical number, which 01:36:52.280 --> 01:36:54.980 is only going to be 4 bytes or 32 bits. 01:36:54.980 --> 01:36:56.330 Pretty efficient. 01:36:56.330 --> 01:37:00.920 But I can still associate that show with one genre or two or three 01:37:00.920 --> 01:37:03.210 or more or even none. 01:37:03.210 --> 01:37:07.610 So in this way, every row in our current table 01:37:07.610 --> 01:37:12.860 is going to become one or more rows in our new pair of tables. 01:37:12.860 --> 01:37:15.560 We're factoring out the genres so that we 01:37:15.560 --> 01:37:20.270 can add multiple rows for every show, potentially, but still 01:37:20.270 --> 01:37:25.050 remap those genres back to the original show itself. 01:37:25.050 --> 01:37:27.890 So what is some of the buzzwords here? 01:37:27.890 --> 01:37:31.070 What's some of the language to be familiar with? 01:37:31.070 --> 01:37:35.090 Well, we need to know what kinds of types are at our disposal here. 01:37:35.090 --> 01:37:37.250 So for that, let me propose this. 01:37:37.250 --> 01:37:41.300 Let me propose that we have this list here. 01:37:41.300 --> 01:37:44.590 It turns out, in SQLite, there are five main data types. 01:37:44.590 --> 01:37:46.340 And that's a bit of an oversimplification, 01:37:46.340 --> 01:37:49.430 but there's five main data types, some of which look familiar, 01:37:49.430 --> 01:37:51.410 a couple of which are a little weird. 01:37:51.410 --> 01:37:53.810 INTEGER is a thing. 01:37:53.810 --> 01:37:55.910 REAL is the same thing as float. 01:37:55.910 --> 01:38:00.080 So an integer might be a 32-bit or 4-byte value, like 1, 2, 3, or 4, 01:38:00.080 --> 01:38:01.130 positive or negative. 01:38:01.130 --> 01:38:03.213 Real number's going to have a decimal point in it, 01:38:03.213 --> 01:38:05.570 a floating point value, probably 32 bits by default. 01:38:05.570 --> 01:38:08.240 But those kinds of things, the sizes of these types, 01:38:08.240 --> 01:38:10.430 vary by system, just like they technically 01:38:10.430 --> 01:38:13.760 did in C. So do they vary by system in the world of SQL. 01:38:13.760 --> 01:38:16.010 But generally speaking, these are good rules of thumb. 01:38:16.010 --> 01:38:16.970 TEXT is just that. 01:38:16.970 --> 01:38:19.820 It's sort of the equivalent of a string of some length. 01:38:19.820 --> 01:38:22.362 But then in SQLite, it turns out there's two other data 01:38:22.362 --> 01:38:23.570 types we've not seen before-- 01:38:23.570 --> 01:38:25.010 NUMERIC and BLOB. 01:38:25.010 --> 01:38:26.750 But more on those in just a little bit. 01:38:26.750 --> 01:38:28.370 BLOB is Binary Large Object. 01:38:28.370 --> 01:38:30.860 It means you can store 0's and 1's in your database. 01:38:30.860 --> 01:38:34.670 NUMERIC is going to be something that's number-like but isn't a number per se. 01:38:34.670 --> 01:38:38.360 It's like a year or a time, something that has numbers, but isn't 01:38:38.360 --> 01:38:40.730 just a simple integer at that. 01:38:40.730 --> 01:38:44.210 And let me propose, too, that SQLite is going to allow us to specify, too, 01:38:44.210 --> 01:38:49.520 when we create our own columns manually by executing the SQL code ourselves, 01:38:49.520 --> 01:38:52.430 we can specify that a column cannot be null. 01:38:52.430 --> 01:38:53.840 Thus far, we've ignored this. 01:38:53.840 --> 01:38:56.090 But some of you might have taken the fifth 01:38:56.090 --> 01:38:58.850 and just not given us the title of a show or a genre. 01:38:58.850 --> 01:39:01.020 Your answers might be blank. 01:39:01.020 --> 01:39:03.020 Some of you, maybe in registering for a website, 01:39:03.020 --> 01:39:06.170 don't want to provide information like where you live or your phone number. 01:39:06.170 --> 01:39:10.190 So a database in general sometimes does want to support null values. 01:39:10.190 --> 01:39:12.290 But you might want to say that it can't be null. 01:39:12.290 --> 01:39:14.390 A website probably needs your email address, 01:39:14.390 --> 01:39:18.570 needs your password and a few other fields, but not everything. 01:39:18.570 --> 01:39:22.250 And there's another keyword in SQL, just so you've seen it, called UNIQUE, where 01:39:22.250 --> 01:39:25.460 you can additionally say that whatever values are in this column 01:39:25.460 --> 01:39:26.520 must be unique. 01:39:26.520 --> 01:39:28.670 So a website might also use that. 01:39:28.670 --> 01:39:31.910 If you want to make sure that the same email address can't register 01:39:31.910 --> 01:39:33.830 for your website multiple times, you just 01:39:33.830 --> 01:39:36.020 specify that the email column is unique. 01:39:36.020 --> 01:39:40.370 That way, you can't put multiple people in with identical email addresses. 01:39:40.370 --> 01:39:44.060 So long story short, this is just more of the tools in our SQL toolkit, 01:39:44.060 --> 01:39:46.280 because we'll see some of these now indirectly. 01:39:46.280 --> 01:39:49.670 And the last piece of jargon we need before designing our own tables 01:39:49.670 --> 01:39:51.150 is going to be this. 01:39:51.150 --> 01:39:54.110 It turns out that, in SQL, there's this notion 01:39:54.110 --> 01:39:56.270 of primary keys and foreign keys. 01:39:56.270 --> 01:39:59.390 And we've not seen this in spreadsheets. 01:39:59.390 --> 01:40:02.150 Unless you've been working in the real world for some years 01:40:02.150 --> 01:40:04.400 and you have fairly fancy spreadsheets in front of you 01:40:04.400 --> 01:40:06.380 as an analyst or financial person or the like, 01:40:06.380 --> 01:40:11.750 odds are you've not seen keys or unique identifiers in quite the same way. 01:40:11.750 --> 01:40:13.170 But they're relatively simple. 01:40:13.170 --> 01:40:17.390 In fact, let me go back to our picture before and propose 01:40:17.390 --> 01:40:21.230 that when you have two tables like this and you 01:40:21.230 --> 01:40:25.790 want to use a simple integer to uniquely identify all of the rows in one 01:40:25.790 --> 01:40:28.395 of the tables, that's called technically an ID. 01:40:28.395 --> 01:40:30.020 That's what I'll call it by convention. 01:40:30.020 --> 01:40:33.770 You could call it anything you want, but ID just means it's a unique identifier. 01:40:33.770 --> 01:40:37.470 But semantically, this ID is what's called a primary key. 01:40:37.470 --> 01:40:43.940 A primary key is the column in a table that uniquely identifies every row. 01:40:43.940 --> 01:40:46.820 This means you can have multiple versions of The Office 01:40:46.820 --> 01:40:48.860 in that title field. 01:40:48.860 --> 01:40:52.490 But each of those rows is going to have its own number uniquely, potentially. 01:40:52.490 --> 01:40:56.630 So primary key uniquely identifies each row. 01:40:56.630 --> 01:41:01.550 In another table, like genres, which I'm proposing we create in just a moment, 01:41:01.550 --> 01:41:06.770 it turns out that you're welcome to refer back to another table 01:41:06.770 --> 01:41:09.260 by way of that unique identifier. 01:41:09.260 --> 01:41:13.710 But when it's in this context, that ID is called a foreign key. 01:41:13.710 --> 01:41:16.130 So even though I've called it show_id here, 01:41:16.130 --> 01:41:18.470 that's just a convention in a lot of SQL databases 01:41:18.470 --> 01:41:23.030 to imply that this is technically a column called ID in a table 01:41:23.030 --> 01:41:26.760 called show or shows, plural in this case. 01:41:26.760 --> 01:41:29.900 So if there's a number 1 here, and suppose 01:41:29.900 --> 01:41:34.190 that The Office has a unique ID of 1, we would 01:41:34.190 --> 01:41:38.420 have a row in this table called id is 1, title is The Office. 01:41:38.420 --> 01:41:43.730 The Office might be in the comedy category, the drama category, 01:41:43.730 --> 01:41:46.400 the romance category, so multiple ones. 01:41:46.400 --> 01:41:51.050 Therefore, in the genres table, we want to output three rows, 01:41:51.050 --> 01:41:56.150 the number 1, 1, 1 in each of those rows but the words "comedy," 01:41:56.150 --> 01:42:00.450 "drama," "romance" in each of those rows respectively. 01:42:00.450 --> 01:42:03.620 So again, the goal here is to just design our database better, not 01:42:03.620 --> 01:42:08.120 have these stupid comma-separated lists of values inside of a single column. 01:42:08.120 --> 01:42:12.980 We want to kind of blow that up, explode it, into individual rows. 01:42:12.980 --> 01:42:15.710 You might think, well, why don't we just use multiple columns? 01:42:15.710 --> 01:42:18.560 But again, per our principle from spreadsheets, 01:42:18.560 --> 01:42:21.650 you should not be in the habit of adding more and more columns when 01:42:21.650 --> 01:42:25.190 the data is all the same, like genre, genre, genre, right? 01:42:25.190 --> 01:42:27.410 The stupid way to do this in the spreadsheet world 01:42:27.410 --> 01:42:29.660 would be to have one column called Genre 1, 01:42:29.660 --> 01:42:34.100 another column called Genre 2, another column called Genre 3, Genre 4. 01:42:34.100 --> 01:42:37.340 And you can imagine just how stupid and inefficient this is. 01:42:37.340 --> 01:42:41.510 A lot of those columns are going to be empty for shows with very few genres. 01:42:41.510 --> 01:42:43.770 And it's just kind of messy at that point. 01:42:43.770 --> 01:42:47.030 So better, in the world of relational databases, 01:42:47.030 --> 01:42:51.350 to have something like a second table, where you have multiple rows that 01:42:51.350 --> 01:42:55.700 somehow link back to that primary key by way of what we're calling, 01:42:55.700 --> 01:42:58.440 conceptually, a foreign key. 01:42:58.440 --> 01:42:59.090 All right. 01:42:59.090 --> 01:43:01.640 So let's go ahead now and try to write this code. 01:43:01.640 --> 01:43:03.710 Let me go back to my IDE. 01:43:03.710 --> 01:43:07.850 Let me quit out of SQLite now. 01:43:07.850 --> 01:43:10.640 And let me just move away. 01:43:10.640 --> 01:43:15.402 I'm going to move this away, my file, for just a moment 01:43:15.402 --> 01:43:17.360 so that we're only left with our original data. 01:43:17.360 --> 01:43:21.680 Let's go about implementing a final version of my Python file that does 01:43:21.680 --> 01:43:23.540 this-- creates two tables-- 01:43:23.540 --> 01:43:26.270 one called shows, one called genres-- 01:43:26.270 --> 01:43:30.200 and then, two, in a for loop, iterates over that CSV 01:43:30.200 --> 01:43:34.490 and inserts some data into the shows and other data into the genres. 01:43:34.490 --> 01:43:36.350 How can we do this programmatically? 01:43:36.350 --> 01:43:38.720 Well, there's a final piece of the puzzle that we need. 01:43:38.720 --> 01:43:41.912 We need some way of bridging the world of Python and SQL. 01:43:41.912 --> 01:43:44.120 And here, we do need a library, because it would just 01:43:44.120 --> 01:43:46.700 be way too painful to do without a library. 01:43:46.700 --> 01:43:47.540 It can be CS50. 01:43:47.540 --> 01:43:50.730 CS50, as we'll see, makes this very simple. 01:43:50.730 --> 01:43:53.480 There are other third-party commercial and open-source libraries 01:43:53.480 --> 01:43:56.522 that you can also use in the real world, as well, that do the same thing. 01:43:56.522 --> 01:43:58.670 But the syntax is a little less friendly, 01:43:58.670 --> 01:44:01.880 so we'll start by using the CS50 library, which in Python, recall, 01:44:01.880 --> 01:44:04.330 has functions like get_string and get_int and get_float. 01:44:04.330 --> 01:44:10.430 But today, it also has support, it turns out, for SQL capabilities, as well. 01:44:10.430 --> 01:44:12.760 So I'm going to go back to my Favorites file. 01:44:12.760 --> 01:44:15.970 And I'm going to import not only CSV, but I'm also 01:44:15.970 --> 01:44:21.310 going to import from the CS50 library a feature called SQL. 01:44:21.310 --> 01:44:25.930 So we have a variable, if you will, inside of the CS50 library 01:44:25.930 --> 01:44:28.600 or, rather, a function inside of the CS50 library 01:44:28.600 --> 01:44:31.870 called SQL that, if I call it, will allow me 01:44:31.870 --> 01:44:35.270 to load a SQLite database into memory. 01:44:35.270 --> 01:44:36.290 So how do I do this? 01:44:36.290 --> 01:44:38.790 Well, let me go ahead and add a couple of new lines of code. 01:44:38.790 --> 01:44:45.340 Let me go ahead and open up a file called shows.db, 01:44:45.340 --> 01:44:47.055 but this time in write mode. 01:44:47.055 --> 01:44:49.180 And then just for kicks-- just for now, rather, I'm 01:44:49.180 --> 01:44:50.930 going to go ahead and close it right away. 01:44:50.930 --> 01:44:54.260 This is a Pythonic way of creating an empty file. 01:44:54.260 --> 01:44:58.210 It's kind of stupid looking, but by opening a file called shows.db 01:44:58.210 --> 01:45:00.790 in write mode and then immediately closing it, 01:45:00.790 --> 01:45:03.670 it has the effect of creating the file, closing the file. 01:45:03.670 --> 01:45:06.310 So I now have an empty file with which to interact. 01:45:06.310 --> 01:45:09.100 I could also do this, as an aside, by doing this-- 01:45:09.100 --> 01:45:10.810 touch shows.db. 01:45:10.810 --> 01:45:14.470 touch kind of a strange command, but in a terminal window, 01:45:14.470 --> 01:45:17.870 it means to create a file if it doesn't exist. 01:45:17.870 --> 01:45:19.450 So we could also do that instead. 01:45:19.450 --> 01:45:22.420 But that would be independent of Python. 01:45:22.420 --> 01:45:24.790 So once I've created this file, let me go ahead 01:45:24.790 --> 01:45:28.720 and open the file now as a SQLite database. 01:45:28.720 --> 01:45:31.600 I'm going to declare a variable called db for database. 01:45:31.600 --> 01:45:34.930 I'm going to use the SQL function from CS50's library. 01:45:34.930 --> 01:45:38.170 And I'm going to open via somewhat cryptic string this-- 01:45:38.170 --> 01:45:43.600 sqlite:///shows.db. 01:45:43.600 --> 01:45:48.740 Now, it looks like a URL, http://, but it's SQLite instead. 01:45:48.740 --> 01:45:52.300 And there's three slashes instead of the usual two. 01:45:52.300 --> 01:45:54.430 But this line of code, line 6, has the result 01:45:54.430 --> 01:45:57.820 of opening now that otherwise empty file with nothing 01:45:57.820 --> 01:46:04.040 in it yet as being a SQLite database using CS50's library. 01:46:04.040 --> 01:46:05.330 Why did I do that? 01:46:05.330 --> 01:46:09.020 Well, I did that because I now want to create my first table. 01:46:09.020 --> 01:46:12.140 Let me go ahead and execute, db.execute. 01:46:12.140 --> 01:46:16.330 So there's a function called execute inside of the CS50 SQL library. 01:46:16.330 --> 01:46:17.980 And I'm going to go ahead and run this. 01:46:17.980 --> 01:46:23.770 CREATE TABLE called shows, the columns of which 01:46:23.770 --> 01:46:27.430 are an id, which is going to be an integer, a title, which 01:46:27.430 --> 01:46:33.380 is going to be text, the primary key in which is going to be the id column. 01:46:33.380 --> 01:46:34.870 So this is a bit cryptic. 01:46:34.870 --> 01:46:36.520 But let's see what's happening. 01:46:36.520 --> 01:46:41.950 I seem to now, in line 8, be combining Python with SQL. 01:46:41.950 --> 01:46:46.000 And this is where now programming gets really powerful, fancy, cool, 01:46:46.000 --> 01:46:48.250 difficult, however you want to perceive it. 01:46:48.250 --> 01:46:50.680 I can actually use one language inside of another. 01:46:50.680 --> 01:46:51.250 How? 01:46:51.250 --> 01:46:53.420 Well, SQL is just a bunch of textural commands. 01:46:53.420 --> 01:46:55.420 Up until now, I've been typing them out manually 01:46:55.420 --> 01:46:57.430 in this program called SQLite3. 01:46:57.430 --> 01:47:00.010 There's nothing stopping me, though, from storing 01:47:00.010 --> 01:47:02.830 those same commands in Python strings and then 01:47:02.830 --> 01:47:05.890 passing them to a database using code. 01:47:05.890 --> 01:47:08.230 The code I'm using is a function called execute. 01:47:08.230 --> 01:47:10.990 And its purpose in life, and CS50 staff wrote this, 01:47:10.990 --> 01:47:18.950 is to pass the argument from your Python code into the database for execution. 01:47:18.950 --> 01:47:22.510 So it's like the programmatic way of just typing things manually 01:47:22.510 --> 01:47:25.160 at the SQLite prompt a few minutes ago. 01:47:25.160 --> 01:47:27.880 So that's going to go ahead and create my table called 01:47:27.880 --> 01:47:30.610 shows, in which I'm going to store all of those unique IDs 01:47:30.610 --> 01:47:32.290 and also the titles. 01:47:32.290 --> 01:47:33.670 And then let me do this again. 01:47:33.670 --> 01:47:39.040 db.execute CREATE TABLE genres, and that's 01:47:39.040 --> 01:47:43.670 going to have a column called show_id, which is an integer also, genre, 01:47:43.670 --> 01:47:45.340 which is text. 01:47:45.340 --> 01:47:48.130 And lastly, it's going to have a foreign key-- 01:47:48.130 --> 01:47:51.190 it's going to wrap a little long here-- 01:47:51.190 --> 01:47:56.563 on show_id, which references the shows table id. 01:47:56.563 --> 01:47:57.730 All right, so this is a lot. 01:47:57.730 --> 01:47:59.860 So let's just recap left to right. 01:47:59.860 --> 01:48:03.730 db.execute is my Python function that executes any SQL I want. 01:48:03.730 --> 01:48:06.460 CREATE TABLE genres creates a table called genres. 01:48:06.460 --> 01:48:10.060 The columns in that table will be something called show_id, 01:48:10.060 --> 01:48:13.630 which is an integer, and genre, which is a text field. 01:48:13.630 --> 01:48:17.050 But it's going to be one genre at a time, not multiple. 01:48:17.050 --> 01:48:20.170 And then here, I'm specifying a foreign key 01:48:20.170 --> 01:48:24.280 will be the show_id column, which happens to refer back 01:48:24.280 --> 01:48:28.180 to the shows table's IDs column. 01:48:28.180 --> 01:48:31.480 It's a little cryptic, but all this is doing is implementing for us 01:48:31.480 --> 01:48:33.470 the equivalent of this picture here. 01:48:33.470 --> 01:48:35.770 I could have manually typed both of these SQL 01:48:35.770 --> 01:48:37.690 commands at that blinking prompt. 01:48:37.690 --> 01:48:39.850 But again, no, I want to write a program now 01:48:39.850 --> 01:48:43.720 in Python that creates the tables for me and now, more interestingly, 01:48:43.720 --> 01:48:47.583 loads the data into that database. 01:48:47.583 --> 01:48:49.000 So let's go ahead and do this now. 01:48:49.000 --> 01:48:51.100 I'm not going to select a title from the user, 01:48:51.100 --> 01:48:52.660 because I want to import everything. 01:48:52.660 --> 01:48:54.993 I'm not going to use any counting or anything like that. 01:48:54.993 --> 01:48:57.700 So let's go ahead and just go inside of my loop as before. 01:48:57.700 --> 01:49:02.240 And this time, let's go ahead and, for row in reader, 01:49:02.240 --> 01:49:05.110 let's go ahead and get the current title, as we've always done. 01:49:05.110 --> 01:49:08.640 But let's also, as always, go ahead and strip it of white space 01:49:08.640 --> 01:49:11.700 and capitalize it, just to canonicalize it. 01:49:11.700 --> 01:49:15.960 And now I'm going to go ahead and execute db.execute, quote unquote, 01:49:15.960 --> 01:49:24.707 INSERT INTO shows the title column, the value of "title." 01:49:24.707 --> 01:49:26.040 So I want to put the title here. 01:49:26.040 --> 01:49:31.690 It turns out that SQL libraries like ours support one final piece of syntax, 01:49:31.690 --> 01:49:32.850 which is a placeholder. 01:49:32.850 --> 01:49:34.800 In C, we use %s. 01:49:34.800 --> 01:49:37.950 In Python, we just use curly braces and put the word right there. 01:49:37.950 --> 01:49:41.520 In SQL, we have a third approach to the same problem-- just syntactically 01:49:41.520 --> 01:49:43.590 different, but conceptually the same. 01:49:43.590 --> 01:49:46.560 You put a question mark where you want to put a placeholder. 01:49:46.560 --> 01:49:50.670 And then outside of this string, I'm going to actually type in the value 01:49:50.670 --> 01:49:53.070 that I want to plug into that question mark. 01:49:53.070 --> 01:49:55.590 So this is so similar to printf in week 1. 01:49:55.590 --> 01:50:00.180 But instead of %s, it's a question mark now and then a comma-separated list 01:50:00.180 --> 01:50:03.120 of the arguments you want to plug in for those placeholders. 01:50:03.120 --> 01:50:08.820 So now this line of code 16 has just inserted all of those values 01:50:08.820 --> 01:50:09.670 into my database. 01:50:09.670 --> 01:50:10.440 And let's go ahead and run this. 01:50:10.440 --> 01:50:12.970 Before I go any further, let me go ahead and do this. 01:50:12.970 --> 01:50:15.960 I'm going to go ahead now and run python of favorites.py 01:50:15.960 --> 01:50:18.030 and cross my fingers, as always. 01:50:18.030 --> 01:50:20.010 It's taking a moment, taking a moment. 01:50:20.010 --> 01:50:23.340 That's because there's a decent-sized file there. 01:50:23.340 --> 01:50:25.650 Or I screwed up. 01:50:25.650 --> 01:50:27.930 This is taking too long. 01:50:27.930 --> 01:50:28.950 Oh, OK. 01:50:28.950 --> 01:50:30.960 I should have just been more patient. 01:50:30.960 --> 01:50:31.560 All right. 01:50:31.560 --> 01:50:33.970 So it just seems my connection's a little slow. 01:50:33.970 --> 01:50:38.717 So as I expected, everything is 100% correct, and it's working fine. 01:50:38.717 --> 01:50:40.800 So now let's go ahead and see what I actually did. 01:50:40.800 --> 01:50:44.970 If I type ls, notice that I have a file called shows.db. 01:50:44.970 --> 01:50:48.180 This is brand new, because my Python program created it this time. 01:50:48.180 --> 01:50:51.060 Let's go ahead and run sqlite3 of shows.db 01:50:51.060 --> 01:50:53.080 just so I can now see what's inside of it. 01:50:53.080 --> 01:50:57.090 Notice that I can do .schema just to see what tables exist. 01:50:57.090 --> 01:51:00.660 And indeed, the two tables that I created in my Python code 01:51:00.660 --> 01:51:01.920 seem to exist. 01:51:01.920 --> 01:51:04.020 But notice that there's-- 01:51:04.020 --> 01:51:08.730 if I do SELECT * FROM shows, let's see all the data. 01:51:08.730 --> 01:51:09.750 Voila. 01:51:09.750 --> 01:51:13.170 There is a table that's been programmatically created. 01:51:13.170 --> 01:51:16.350 And it has, notice this time, no timestamps, no genres. 01:51:16.350 --> 01:51:20.730 But it has an ID on the left and the title on the right. 01:51:20.730 --> 01:51:25.350 And amazingly, all of the IDs are monotonically increasing from 1 01:51:25.350 --> 01:51:27.390 on up to 513, in this case. 01:51:27.390 --> 01:51:28.300 Why is that? 01:51:28.300 --> 01:51:30.600 Well, one of the features you get in a SQL database 01:51:30.600 --> 01:51:34.410 is if you define a column as being a primary key in SQLite, 01:51:34.410 --> 01:51:36.480 it's going to be auto incremented for you. 01:51:36.480 --> 01:51:41.970 Recall that nowhere in my code did I even have a line, an integer, 01:51:41.970 --> 01:51:43.830 inputting 1, then 2, then 3. 01:51:43.830 --> 01:51:45.310 I could absolutely do that. 01:51:45.310 --> 01:51:47.730 I could have done something like this-- counter-- 01:51:47.730 --> 01:51:51.660 rather, I could have done something like this-- counter = 1. 01:51:51.660 --> 01:51:56.280 And then down here, I could have said id, title, give myself 01:51:56.280 --> 01:51:59.122 two placeholders, and then pass in the counter each time. 01:51:59.122 --> 01:52:01.830 I could have implemented this myself and then, on each iteration, 01:52:01.830 --> 01:52:03.960 done counter += 1. 01:52:03.960 --> 01:52:06.330 But with SQL databases, as we've seen, you 01:52:06.330 --> 01:52:08.310 get a lot more functionality built in. 01:52:08.310 --> 01:52:11.130 I don't have to do any of that, because if I've 01:52:11.130 --> 01:52:16.710 declared that ID as being a primary key, SQLite is going to insert it for me 01:52:16.710 --> 01:52:19.870 and increment it also for me, as well. 01:52:19.870 --> 01:52:20.400 All right. 01:52:20.400 --> 01:52:24.510 So if I go back to SQLite, though, notice that I do have IDs and titles. 01:52:24.510 --> 01:52:28.860 But if I SELECT * FROM genres, there's of course nothing there yet. 01:52:28.860 --> 01:52:32.250 So how now do I get all of the genres for each of these shows in? 01:52:32.250 --> 01:52:33.910 I need to finish my script. 01:52:33.910 --> 01:52:38.970 So inside of this same loop, I have not only the title in my current row, 01:52:38.970 --> 01:52:42.570 but I also have genres in the current row. 01:52:42.570 --> 01:52:45.570 But the genres are separated by commas. 01:52:45.570 --> 01:52:47.880 Recall that in the CSV, next to every title, 01:52:47.880 --> 01:52:51.450 there's a comma-separated list of genres. 01:52:51.450 --> 01:52:53.460 How do I get at each genre individually? 01:52:53.460 --> 01:52:59.190 Well, I'd like to be able to say for genre in row bracket genres. 01:52:59.190 --> 01:53:02.520 But this is not going to work, because that's not going 01:53:02.520 --> 01:53:05.310 to be split up based on those commas. 01:53:05.310 --> 01:53:07.190 That's literally just going to iterate over, 01:53:07.190 --> 01:53:10.860 in fact, all of the characters in that string, as we saw last week. 01:53:10.860 --> 01:53:13.950 But it turns out that strings in Python have a fancy split 01:53:13.950 --> 01:53:19.300 function, whereby I can split on a comma followed by a space. 01:53:19.300 --> 01:53:21.930 And what this function will do for me in Python is 01:53:21.930 --> 01:53:26.130 take a comma separated list of genres and explode it, so to speak, 01:53:26.130 --> 01:53:31.800 split it on every comma, space into a Python list 01:53:31.800 --> 01:53:36.570 containing genre after genre in an actual Python list 01:53:36.570 --> 01:53:37.990 a la square brackets. 01:53:37.990 --> 01:53:42.360 So now I can iterate over that list of individual genres. 01:53:42.360 --> 01:53:49.470 And inside of here, I can do db.execute INSERT INTO genres show_id, genre, 01:53:49.470 --> 01:53:53.130 the values, question mark, question mark. 01:53:53.130 --> 01:53:56.100 But huh, there's a problem. 01:53:56.100 --> 01:53:59.970 I can definitely plug in the current genre, which is this. 01:53:59.970 --> 01:54:02.970 But I need to put something here still. 01:54:02.970 --> 01:54:07.560 For that first question mark, I need a value for the show_id. 01:54:07.560 --> 01:54:11.130 How do I know what the ID is of the current TV show? 01:54:11.130 --> 01:54:13.650 Well, it turns out the library can help you with this. 01:54:13.650 --> 01:54:18.970 When you insert new rows into a table that has a primary key, 01:54:18.970 --> 01:54:23.400 it turns out that most libraries will return you that value in some way. 01:54:23.400 --> 01:54:26.520 And if I go back to line 15 and I actually 01:54:26.520 --> 01:54:31.470 store the return value of db.execute after using INSERT, 01:54:31.470 --> 01:54:34.500 the library will tell me what was the integer that 01:54:34.500 --> 01:54:36.390 was just used for this given show. 01:54:36.390 --> 01:54:37.650 Maybe it's 1, 2, 3. 01:54:37.650 --> 01:54:39.940 I don't have to know or care as the programmer. 01:54:39.940 --> 01:54:42.570 But the return value, I can store in a variable. 01:54:42.570 --> 01:54:47.520 And then down here, I can literally put that same ID so that now, 01:54:47.520 --> 01:54:51.600 if I am inputting The Office, whose ID is 1, into the shows table 01:54:51.600 --> 01:54:54.720 and its genres are comedy, drama, romance, 01:54:54.720 --> 01:54:57.990 I can now inside of this for loop, this nested for loop, 01:54:57.990 --> 01:55:03.240 insert 1 followed by "comedy," 1 followed by "drama," 1 followed 01:55:03.240 --> 01:55:07.330 by "romance," three rows all at once. 01:55:07.330 --> 01:55:11.980 And so now let's go back down here into my terminal window. 01:55:11.980 --> 01:55:15.660 Let me remove the old shows.db with rm, just to start fresh. 01:55:15.660 --> 01:55:19.920 Let me go ahead and rerun python of favorites.py. 01:55:19.920 --> 01:55:23.733 I'll be more patient this time, because cloud's being a little slow. 01:55:23.733 --> 01:55:24.900 So it's doing some thinking. 01:55:24.900 --> 01:55:27.030 And in fact, there's more work being done now. 01:55:27.030 --> 01:55:29.340 At this point in the story, my program is presumably 01:55:29.340 --> 01:55:33.060 iterating over all of the rows in the CSV. 01:55:33.060 --> 01:55:37.170 And it's inserting into the shows table one at a time, 01:55:37.170 --> 01:55:43.380 and then it's inserting one or more genres into the genres table. 01:55:43.380 --> 01:55:44.250 It's a little slow. 01:55:44.250 --> 01:55:47.370 If we were on a faster system or if I were doing it on my own Mac or PC, 01:55:47.370 --> 01:55:49.480 it would probably go down more quickly. 01:55:49.480 --> 01:55:52.740 But you can see here an example of why I use the .import command in the first 01:55:52.740 --> 01:55:53.130 place. 01:55:53.130 --> 01:55:54.670 That automated some of this process. 01:55:54.670 --> 01:55:58.440 But unfortunately, it didn't allow me to change the format of my data. 01:55:58.440 --> 01:56:01.530 But the key point to make here is that even though this 01:56:01.530 --> 01:56:05.490 is taking a little bit of time to insert these hundreds of rows all at once, 01:56:05.490 --> 01:56:07.260 I'm only going to have to do this once. 01:56:07.260 --> 01:56:10.840 And what was asked a bit ago was the performance of this. 01:56:10.840 --> 01:56:15.390 It turns out that now that we have full control over the SQL database, 01:56:15.390 --> 01:56:20.640 it turns out we're going to have the ability to actually improve 01:56:20.640 --> 01:56:22.230 the performance thereof. 01:56:22.230 --> 01:56:24.000 Oh, OK. 01:56:24.000 --> 01:56:25.830 As expected, it finished right on time. 01:56:25.830 --> 01:56:29.970 And let me go ahead now and run sqlite3 on shows.db. 01:56:29.970 --> 01:56:32.670 All right, so now I'm back in my raw SQL environment. 01:56:32.670 --> 01:56:36.180 If I do SELECT * FROM shows, which I did before, 01:56:36.180 --> 01:56:37.650 we'll see all of this as before. 01:56:37.650 --> 01:56:42.090 If I SELECT * FROM shows WHERE title = "THE OFFICE," 01:56:42.090 --> 01:56:45.103 I'll see the actual unique IDs of all of those. 01:56:45.103 --> 01:56:46.770 We didn't bother eliminating duplicates. 01:56:46.770 --> 01:56:50.610 We just kept everything as is, but we gave everything a unique ID. 01:56:50.610 --> 01:56:57.520 But if I now do SELECT * FROM genres, we'll see all of the values there. 01:56:57.520 --> 01:56:59.070 And notice the key detail. 01:56:59.070 --> 01:57:03.360 There is only one genre per row here. 01:57:03.360 --> 01:57:06.480 And so we can ultimately line those up with our titles. 01:57:06.480 --> 01:57:10.250 And our titles here, we had all of these here. 01:57:10.250 --> 01:57:12.538 Something's wrong. 01:57:12.538 --> 01:57:13.580 I want to get this right. 01:57:13.580 --> 01:57:15.940 Let's go ahead and take our second and final five-minute break here. 01:57:15.940 --> 01:57:18.280 And we'll come back, and I will explain what's going on. 01:57:18.280 --> 01:57:20.170 All right, we are back. 01:57:20.170 --> 01:57:23.710 And just before we broke up, my own self-doubt was starting to creep in. 01:57:23.710 --> 01:57:26.830 But I'm happy to say, with no fancy magic behind the scenes, 01:57:26.830 --> 01:57:28.430 everything was actually working fine. 01:57:28.430 --> 01:57:30.263 I was just doubting the correctness of this. 01:57:30.263 --> 01:57:33.460 If I do SELECT * FROM shows, I indeed get back 01:57:33.460 --> 01:57:37.540 two columns, one with the unique ID, the so-called primary key, followed 01:57:37.540 --> 01:57:40.280 by the title of each of those shows. 01:57:40.280 --> 01:57:46.120 And if I similarly search for * FROM genres, I get single genres at a time. 01:57:46.120 --> 01:57:49.600 But on the left-hand side are not primary keys per se 01:57:49.600 --> 01:57:52.450 but now those same numbers here in this context called 01:57:52.450 --> 01:57:55.160 foreign keys that map one to the other. 01:57:55.160 --> 01:58:01.260 So for instance, whatever show 512 is had five different genres associated 01:58:01.260 --> 01:58:01.760 with it. 01:58:01.760 --> 01:58:05.320 And in fact, if I go back a moment to shows, it looks like Game of Thrones 01:58:05.320 --> 01:58:10.420 was decided by one of you as belonging in thriller, history, adventure, 01:58:10.420 --> 01:58:14.660 action, and war, as well, those five. 01:58:14.660 --> 01:58:17.320 So now this is what's meant by relational database. 01:58:17.320 --> 01:58:21.430 You have this relation or relationship across multiple tables 01:58:21.430 --> 01:58:25.050 that link some data in one to some other data in the like. 01:58:25.050 --> 01:58:27.550 The catch, though, is that it would seem a little harder now 01:58:27.550 --> 01:58:30.910 to answer questions, because now I have to kind of query two tables 01:58:30.910 --> 01:58:34.450 or execute two separate queries and then combine the data. 01:58:34.450 --> 01:58:36.130 But that's not actually the case. 01:58:36.130 --> 01:58:39.100 Suppose that I want to answer the question of, 01:58:39.100 --> 01:58:42.760 what are all of the musicals among your favorite TV shows? 01:58:42.760 --> 01:58:46.490 I can't select just the shows, because there's no genres in there anymore. 01:58:46.490 --> 01:58:48.730 But I also can't select just the genres table, 01:58:48.730 --> 01:58:50.900 because there's no titles in there. 01:58:50.900 --> 01:58:55.060 But there is a value that's bridging one and the other, that foreign key 01:58:55.060 --> 01:58:56.980 to primary key relationship. 01:58:56.980 --> 01:58:59.170 So you know what I can do off the top of my head? 01:58:59.170 --> 01:59:03.790 I'm pretty sure I can select all of the show_ids from the genres table 01:59:03.790 --> 01:59:07.072 where a specific genre = "Musical." 01:59:07.072 --> 01:59:09.280 And I don't have to worry about commas or spaces now, 01:59:09.280 --> 01:59:13.210 because again, in this new version that I have designed programmatically 01:59:13.210 --> 01:59:16.990 with code, musical and every other genre is just a single word. 01:59:16.990 --> 01:59:21.220 If I hit Enter, all of these show_ids were decided 01:59:21.220 --> 01:59:23.930 by you all as belonging to musicals. 01:59:23.930 --> 01:59:25.930 But now this is not interesting, and I certainly 01:59:25.930 --> 01:59:28.360 don't want to execute 10 or so queries manually 01:59:28.360 --> 01:59:30.400 to look up every one of those IDs. 01:59:30.400 --> 01:59:32.680 But notice what we can do in SQL, as well. 01:59:32.680 --> 01:59:33.880 I can nest queries. 01:59:33.880 --> 01:59:36.940 Let me put this whole query in parentheses for just a moment 01:59:36.940 --> 01:59:39.070 and then prepend to it the following. 01:59:39.070 --> 01:59:46.930 SELECT title FROM shows WHERE the primary key, id, is in this subquery. 01:59:46.930 --> 01:59:50.650 So you can have nested queries similar in spirit a bit like in Python and C 01:59:50.650 --> 01:59:52.510 when you have nested for loops. 01:59:52.510 --> 01:59:55.690 In this case, just like in grade school math, whatever is in the parentheses 01:59:55.690 --> 01:59:57.160 will be executed first. 01:59:57.160 --> 02:00:02.140 Then the outer query will be executed using the results of that inner query. 02:00:02.140 --> 02:00:07.000 So if I select the title from shows where the ID is in that list of IDs, 02:00:07.000 --> 02:00:07.660 voila. 02:00:07.660 --> 02:00:11.560 It seems that, somewhat amusingly, several of you 02:00:11.560 --> 02:00:15.280 think that Breaking Bad, Supernatural, Glee, Sherlock, How I Met Your Mother, 02:00:15.280 --> 02:00:18.190 Hawaii Five-0, Twin Peaks, The Lawyer, and My Brother, My Brother 02:00:18.190 --> 02:00:19.900 and Me are all musicals. 02:00:19.900 --> 02:00:22.630 I take exception to a few of those, but so be it. 02:00:22.630 --> 02:00:24.850 You checked the box for musical for those shows. 02:00:24.850 --> 02:00:29.260 So even though we've designed things better in the sense 02:00:29.260 --> 02:00:33.010 that we've normalized our database by factoring out commonalities 02:00:33.010 --> 02:00:35.050 or, rather, we've cleaned up the data, there's 02:00:35.050 --> 02:00:37.150 still admittedly some redundancy. 02:00:37.150 --> 02:00:39.370 There's still admittedly some redundancy. 02:00:39.370 --> 02:00:44.410 But I at least now have the data in clean fashion 02:00:44.410 --> 02:00:47.800 so that every column has just a single value in it and not 02:00:47.800 --> 02:00:49.870 some contrived comma-separated list. 02:00:49.870 --> 02:00:51.725 Suppose I want to find out all of the genres 02:00:51.725 --> 02:00:53.350 that you all thought The Office was in. 02:00:53.350 --> 02:00:55.480 So let's ask kind of the opposite question. 02:00:55.480 --> 02:00:56.840 Well, how might I do that? 02:00:56.840 --> 02:01:00.430 Well, to figure out The Office, I'm going to first need to SELECT the id 02:01:00.430 --> 02:01:06.400 FROM shows WHERE title = "THE OFFICE," because a whole bunch of you 02:01:06.400 --> 02:01:07.330 typed in The Office. 02:01:07.330 --> 02:01:09.902 And we gave each of your answers a unique identifier 02:01:09.902 --> 02:01:11.110 so we could keep track of it. 02:01:11.110 --> 02:01:12.500 And there's all of those numbers. 02:01:12.500 --> 02:01:14.260 Now, this is, like, dozens of responses. 02:01:14.260 --> 02:01:16.540 I certainly don't want to execute that many queries. 02:01:16.540 --> 02:01:18.850 But I think a subquery will help us out again. 02:01:18.850 --> 02:01:21.470 Let me put parentheses around this whole thing. 02:01:21.470 --> 02:01:27.910 And now let me say SELECT DISTINCT genre FROM genres WHERE 02:01:27.910 --> 02:01:32.380 the show_id in the genres table is in that query. 02:01:32.380 --> 02:01:37.400 And just for kicks, let me go ahead and ORDER BY genre. 02:01:37.400 --> 02:01:38.930 So let me go ahead and execute this. 02:01:38.930 --> 02:01:42.490 And, OK, somewhat amusingly, those of you who inputted The Office 02:01:42.490 --> 02:01:46.960 checked boxes for animation, comedy, documentary, drama, family, horror, 02:01:46.960 --> 02:01:49.000 reality-TV, romance, and sci-fi. 02:01:49.000 --> 02:01:51.020 I take exception to a few of those, too. 02:01:51.020 --> 02:01:53.720 But this is what happens when you accept user input. 02:01:53.720 --> 02:01:57.293 So here again, we have with this SQL language 02:01:57.293 --> 02:01:59.710 the ability to express fairly succinctly, even though it's 02:01:59.710 --> 02:02:03.670 a lot of new features today all at once, what would otherwise take me 02:02:03.670 --> 02:02:06.593 a dozen or two lines in Python code to implement 02:02:06.593 --> 02:02:09.010 and god knows how many lines of code and how many hours it 02:02:09.010 --> 02:02:13.060 would take me to implement something like this in C. Now, admittedly, 02:02:13.060 --> 02:02:15.220 we could do better than this design. 02:02:15.220 --> 02:02:18.550 This table or this picture represents what we have now. 02:02:18.550 --> 02:02:22.360 But you'll notice a lot of redundancy implicit in the genres table. 02:02:22.360 --> 02:02:25.510 Any time you check the comedy box, I have a row now 02:02:25.510 --> 02:02:27.940 that says comedy, comedy, comedy, comedy. 02:02:27.940 --> 02:02:31.930 And the show_id differs, but I have the word "comedy" again and again. 02:02:31.930 --> 02:02:35.740 And now, that tends to be frowned upon in the world of relational databases, 02:02:35.740 --> 02:02:39.370 because if you have a genre called comedy or one 02:02:39.370 --> 02:02:42.430 called musical or anything else, you should ideally just 02:02:42.430 --> 02:02:43.970 have that living in one place. 02:02:43.970 --> 02:02:47.980 And so if we really wanted to be particular and really, truly 02:02:47.980 --> 02:02:51.100 normalize this database, which is an academic term referring 02:02:51.100 --> 02:02:55.530 to removing all such redundancies, we could actually do it like this. 02:02:55.530 --> 02:02:59.480 We could have a shows table still with an id and title, no difference there. 02:02:59.480 --> 02:03:03.890 But we could have a genres table with two columns, id and name. 02:03:03.890 --> 02:03:05.020 Now, this is its own id. 02:03:05.020 --> 02:03:06.910 It has no connection with the show_id. 02:03:06.910 --> 02:03:10.660 It's just its own unique identifier, a primary key here now, 02:03:10.660 --> 02:03:12.320 and the name of that genre. 02:03:12.320 --> 02:03:14.350 So you would have one row in the genres table 02:03:14.350 --> 02:03:17.690 for comedy, for drama, music, musical, and everything else. 02:03:17.690 --> 02:03:19.870 And then you would use a third table, which 02:03:19.870 --> 02:03:23.920 is colloquially called a join table, which I'll draw here in the middle. 02:03:23.920 --> 02:03:25.960 And you can call it anything you want, but we've 02:03:25.960 --> 02:03:29.920 called it shows_genres to make clear that this table implements 02:03:29.920 --> 02:03:33.400 a relationship between those two tables. 02:03:33.400 --> 02:03:36.910 And notice that in this table is really no juicy data. 02:03:36.910 --> 02:03:38.800 It's just foreign keys-- 02:03:38.800 --> 02:03:41.380 show_id, genre_id. 02:03:41.380 --> 02:03:43.930 And by having this third table, we can now 02:03:43.930 --> 02:03:47.890 make sure that the word "comedy" only appears in one row anywhere. 02:03:47.890 --> 02:03:50.860 The word "musical" only appears in one row anywhere. 02:03:50.860 --> 02:03:55.450 But we use these more efficient integers called show_id and genre_id, 02:03:55.450 --> 02:04:00.850 which respectively point to those primary keys and their primary tables 02:04:00.850 --> 02:04:02.072 to link those two together. 02:04:02.072 --> 02:04:04.780 And this is an example of what's called in the world of databases 02:04:04.780 --> 02:04:06.790 a many-to-many relationship. 02:04:06.790 --> 02:04:09.610 One show can have many genres. 02:04:09.610 --> 02:04:12.730 One genre can belong to many shows. 02:04:12.730 --> 02:04:14.530 And so by having this third table, you can 02:04:14.530 --> 02:04:16.730 have that many-to-many relationship. 02:04:16.730 --> 02:04:19.570 And again, the third table now allows us to truly normalize 02:04:19.570 --> 02:04:23.920 our data set by getting rid of all of the duplicate comedy, comedy, comedy. 02:04:23.920 --> 02:04:25.420 Why is this important? 02:04:25.420 --> 02:04:27.310 Probably not a huge deal for genres. 02:04:27.310 --> 02:04:30.910 But imagine with my current design if I made a spelling mistake, 02:04:30.910 --> 02:04:32.440 and I misnamed comedy. 02:04:32.440 --> 02:04:36.190 I would now have to change every row with the word comedy again and again. 02:04:36.190 --> 02:04:39.580 Or if maybe you change the genres of the shows, 02:04:39.580 --> 02:04:41.840 you would have to change it in multiple places. 02:04:41.840 --> 02:04:44.260 But with this other approach with three tables, 02:04:44.260 --> 02:04:46.450 you can argue that now you only have to change 02:04:46.450 --> 02:04:49.750 the name of a genre in one place, not all over the place. 02:04:49.750 --> 02:04:52.420 And that, in general, in C and now in Python and now 02:04:52.420 --> 02:04:57.040 SQL has generally been a good thing not to copy paste identical values 02:04:57.040 --> 02:05:00.060 all over the place. 02:05:00.060 --> 02:05:00.720 All right. 02:05:00.720 --> 02:05:04.410 So with that said, what other tools do we have at our disposal? 02:05:04.410 --> 02:05:09.180 Well, it turns out that there are other data types out there in the real world 02:05:09.180 --> 02:05:11.280 using SQL besides just these five-- 02:05:11.280 --> 02:05:13.620 BLOB, INTEGER, NUMERIC, REAL, and TEXT. 02:05:13.620 --> 02:05:15.870 BLOB, again, is for binary stuff, generally not 02:05:15.870 --> 02:05:18.840 used except for more specialized applications, let's say. 02:05:18.840 --> 02:05:21.270 INTEGER, which is an int, typically 32 bits; 02:05:21.270 --> 02:05:23.700 NUMERIC, which is something like a date or a year 02:05:23.700 --> 02:05:26.220 or time or something like that; REAL numbers, which 02:05:26.220 --> 02:05:30.180 are floating point values; and TEXT, which are things like strings. 02:05:30.180 --> 02:05:34.110 But if you graduate ultimately from SQLite on phones 02:05:34.110 --> 02:05:38.522 and on Macs and PCs to actual servers that run Oracle, MySQL, 02:05:38.522 --> 02:05:40.230 and PostgreSQL if you're actually running 02:05:40.230 --> 02:05:42.540 your own internet-style business, well, it 02:05:42.540 --> 02:05:47.310 turns out that more sophisticated, even more powerful 02:05:47.310 --> 02:05:50.620 databases come with other subtypes, if you will. 02:05:50.620 --> 02:05:54.270 So besides INTEGER, you can specify smallint for small numbers, 02:05:54.270 --> 02:05:57.690 maybe using just a few bits instead of 32-- 02:05:57.690 --> 02:06:01.800 INTEGER or bigint, which uses 64 bits instead of 32. 02:06:01.800 --> 02:06:05.130 The Facebooks, the Twitters of the world need to use bigint a lot, 02:06:05.130 --> 02:06:06.720 because they have so much data. 02:06:06.720 --> 02:06:09.330 You and I can get away with simple integers, because we're not 02:06:09.330 --> 02:06:12.450 going to have more than 4 billion favorite TV shows in a class, 02:06:12.450 --> 02:06:13.290 certainly. 02:06:13.290 --> 02:06:17.040 Something like REAL, you can have 32-bit real numbers or, a little weirdly 02:06:17.040 --> 02:06:22.470 named, double precision, which is like a double was in C, using 64 bits instead 02:06:22.470 --> 02:06:23.640 for more precision. 02:06:23.640 --> 02:06:25.230 NUMERIC is kind of this catchall. 02:06:25.230 --> 02:06:29.160 You can have not only dates and date times but things like Boolean values. 02:06:29.160 --> 02:06:31.200 You can specify the total number of digits 02:06:31.200 --> 02:06:34.180 to store using this numeric scale and precision. 02:06:34.180 --> 02:06:37.440 So it relates to numbers that aren't just quite integers. 02:06:37.440 --> 02:06:39.720 And then you also have categories of TEXT-- 02:06:39.720 --> 02:06:42.570 char followed by a number, which specifies 02:06:42.570 --> 02:06:47.010 that every value in the column will have the same number of characters, 02:06:47.010 --> 02:06:50.190 that's helpful for things where the length in advance, like in the US. 02:06:50.190 --> 02:06:54.030 All states, all 50 states, have two-character codes, 02:06:54.030 --> 02:06:57.450 like MA for Massachusetts, CA for California. 02:06:57.450 --> 02:07:00.150 char(2) would be appropriate there, because you 02:07:00.150 --> 02:07:03.000 know every value in the column is going to have two characters. 02:07:03.000 --> 02:07:05.250 When you don't know, though, you can use varchar. 02:07:05.250 --> 02:07:08.400 And varchar specifies a maximum number of characters. 02:07:08.400 --> 02:07:12.060 And so you might specify varchar of, like, 32. 02:07:12.060 --> 02:07:15.600 No one might be able to type in a name that's longer than 32 characters, 02:07:15.600 --> 02:07:18.900 or varchar(200) if you want to allow for something even bigger. 02:07:18.900 --> 02:07:21.690 But this is germane to our real-world experience with the web. 02:07:21.690 --> 02:07:24.540 If you've ever gone to a website, start filling out a form, 02:07:24.540 --> 02:07:26.850 and all of a sudden you can't type any more characters, 02:07:26.850 --> 02:07:28.440 your response is too long-- 02:07:28.440 --> 02:07:29.462 why is that? 02:07:29.462 --> 02:07:31.170 Well, one, the programmers just might not 02:07:31.170 --> 02:07:33.810 want you to keep expressing yourself in more detail, especially 02:07:33.810 --> 02:07:36.330 if it's a complaint form on a customer service site. 02:07:36.330 --> 02:07:40.620 But pragmatically, it's probably because their database was designed 02:07:40.620 --> 02:07:42.570 to store a finite number of characters. 02:07:42.570 --> 02:07:44.025 And you have hit that threshold. 02:07:44.025 --> 02:07:45.900 And you certainly don't want to have a buffer 02:07:45.900 --> 02:07:50.190 overflow, like in C. So the database will enforce a maximum value n. 02:07:50.190 --> 02:07:52.830 And then text is for even bigger chunks of text. 02:07:52.830 --> 02:07:54.930 If you're letting people copy paste their resumes 02:07:54.930 --> 02:07:59.680 or hold documents or even larger sets of text, you might use text instead. 02:07:59.680 --> 02:08:03.510 So let's then consider a real-world data set. 02:08:03.510 --> 02:08:07.320 Things get really interesting, and all of these very academic ideas 02:08:07.320 --> 02:08:09.600 and recommendations really come into play 02:08:09.600 --> 02:08:14.830 when we don't had hundreds of favorites but when we have thousands instead. 02:08:14.830 --> 02:08:19.180 And so what I'm going to go ahead and do here is download a file here, 02:08:19.180 --> 02:08:25.120 which is a SQLite version of the IMDb, Internet Movie Database, 02:08:25.120 --> 02:08:27.120 that some of you might have used in website form 02:08:27.120 --> 02:08:30.330 in order to look up movies and ratings thereof and the like. 02:08:30.330 --> 02:08:32.100 And what we've done in advance is we wrote 02:08:32.100 --> 02:08:38.490 a script that downloaded all of that information in advance as TSV files. 02:08:38.490 --> 02:08:42.600 It turns out that they, Internet Movie Database, make all of their data 02:08:42.600 --> 02:08:46.650 available as TSV files, Tab-Separated Values. 02:08:46.650 --> 02:08:54.010 And we went ahead and imported it with a script called shows.db as follows. 02:08:54.010 --> 02:08:55.800 So I'm going to go ahead in just a moment 02:08:55.800 --> 02:08:59.520 and open up shows.db, which is not the version I created earlier 02:08:59.520 --> 02:09:00.990 based on your favorites. 02:09:00.990 --> 02:09:02.820 This is now the version that we, the staff, 02:09:02.820 --> 02:09:06.630 created in advance by downloading hundreds of thousands 02:09:06.630 --> 02:09:10.950 of movies and TV shows and actors and directors from IMDb.com 02:09:10.950 --> 02:09:15.660 under their license and then imported into a SQLite database. 02:09:15.660 --> 02:09:17.130 So how can I see what's in here? 02:09:17.130 --> 02:09:19.530 Well, let me go ahead and type .schema, recall. 02:09:19.530 --> 02:09:22.545 And you'll see a whole bunch of data therein. 02:09:22.545 --> 02:09:25.690 And in fact, in pictorial form, it actually looks like this. 02:09:25.690 --> 02:09:28.140 Here is a picture that just gives you the lay of the land. 02:09:28.140 --> 02:09:30.330 There's going to be a people table that has 02:09:30.330 --> 02:09:33.903 an ID for every person, a name, and their birth year. 02:09:33.903 --> 02:09:36.570 There's going to be a shows table, just like we've been talking, 02:09:36.570 --> 02:09:41.100 which is IDs, titles of shows-- also, though, the year that the show debuted 02:09:41.100 --> 02:09:43.380 and the number of episodes that the show had. 02:09:43.380 --> 02:09:46.410 Then there's going to be genres, similar in design to before. 02:09:46.410 --> 02:09:49.620 So we didn't go all out and factor it out into a third table. 02:09:49.620 --> 02:09:52.830 We just have some duplication here, admittedly, in genres. 02:09:52.830 --> 02:09:54.240 But then there's a ratings table. 02:09:54.240 --> 02:09:57.240 And here's where you can see where relational databases get interesting. 02:09:57.240 --> 02:10:01.000 You can have a ratings table storing ratings, like 1 to 5, 02:10:01.000 --> 02:10:05.080 but also associate those ratings with a show by way of its show_id. 02:10:05.080 --> 02:10:08.440 And then you can keep track of the number of votes that that show got. 02:10:08.440 --> 02:10:10.910 Writers, notice, is a separate table. 02:10:10.910 --> 02:10:12.560 And notice this is kind of cool. 02:10:12.560 --> 02:10:19.060 This table, per the arrows, relates to the shows table and the people table, 02:10:19.060 --> 02:10:20.770 because this is a joined table. 02:10:20.770 --> 02:10:24.040 A foreign key of show_id and a foreign key of person_id 02:10:24.040 --> 02:10:28.250 refer to the shows table and the people table respectively 02:10:28.250 --> 02:10:32.710 so that a human person can be a writer for multiple shows 02:10:32.710 --> 02:10:36.560 and one show can have multiple writers, another many-to-many relationship. 02:10:36.560 --> 02:10:39.310 And then lastly, stars, the actors in a show. 02:10:39.310 --> 02:10:41.050 Notice that this, too, is a join table. 02:10:41.050 --> 02:10:43.540 It's only got two foreign keys, a show_id 02:10:43.540 --> 02:10:47.447 and a person_id that are referring back to those tables respectively. 02:10:47.447 --> 02:10:50.030 And here's where it really makes sense of relational database. 02:10:50.030 --> 02:10:52.930 It would be pretty stupid and bad design if you 02:10:52.930 --> 02:10:57.520 had names of all of the directors and names of all of the writers 02:10:57.520 --> 02:11:01.840 and names of all of the stars of these shows in separate tables in duplicate, 02:11:01.840 --> 02:11:04.330 like Steve Carell, Steve Carell, Steve Carell. 02:11:04.330 --> 02:11:06.670 All of those actors and directors and writers 02:11:06.670 --> 02:11:11.450 and every other role in the business are just people at the end of the day. 02:11:11.450 --> 02:11:13.630 So in a relational database, the advice would 02:11:13.630 --> 02:11:16.570 be to put all of those people in a people table 02:11:16.570 --> 02:11:21.040 and then use primary and foreign keys to refer to, to relate them to, 02:11:21.040 --> 02:11:24.010 these other types of tables. 02:11:24.010 --> 02:11:26.720 The catch is, though, that when we do this, 02:11:26.720 --> 02:11:31.280 it turns out that things can be slow when we have lots of data. 02:11:31.280 --> 02:11:33.250 So for instance, let me go into this. 02:11:33.250 --> 02:11:37.210 Let me go ahead and SELECT * FROM shows;. 02:11:37.210 --> 02:11:38.343 That's a lot of data. 02:11:38.343 --> 02:11:41.260 It's pretty fast on my Mac, and I switched from the IDE to my Mac just 02:11:41.260 --> 02:11:43.270 to save time, because it's a little faster doing things 02:11:43.270 --> 02:11:44.800 locally instead of in the cloud. 02:11:44.800 --> 02:11:48.460 Let me go ahead and count the number of shows in this IMDb database 02:11:48.460 --> 02:11:49.720 by using COUNT. 02:11:49.720 --> 02:11:53.390 153,331 TV shows. 02:11:53.390 --> 02:11:54.250 So that's a lot. 02:11:54.250 --> 02:11:59.110 How about the count of people from the people table? 02:11:59.110 --> 02:12:06.290 457,886 people who might be stars or writers or some other role, as well. 02:12:06.290 --> 02:12:07.765 So this is a sizable data set. 02:12:07.765 --> 02:12:09.890 So let me go ahead and do something simple, though. 02:12:09.890 --> 02:12:14.560 Let me go ahead and SELECT * FROM shows WHERE title = "The Office." 02:12:14.560 --> 02:12:17.900 And this time, I don't have to worry about weird capitalization or spacing. 02:12:17.900 --> 02:12:18.850 This is IMDb. 02:12:18.850 --> 02:12:21.727 This is clean data from an authoritative source. 02:12:21.727 --> 02:12:24.310 Notice that there's actually different versions of The Office. 02:12:24.310 --> 02:12:26.590 You probably know the UK one and the US one. 02:12:26.590 --> 02:12:30.520 There's other shows that are unrelated to that particular type of show. 02:12:30.520 --> 02:12:34.540 But each of them is distinguished, notice, by the year here. 02:12:34.540 --> 02:12:37.280 All right, so that's kind of a lot. 02:12:37.280 --> 02:12:38.680 And let's do this again. 02:12:38.680 --> 02:12:40.930 Let me go ahead and turn on a feature temporarily just 02:12:40.930 --> 02:12:44.000 to time this query by turning on a timer in this program. 02:12:44.000 --> 02:12:45.370 And let me run it again. 02:12:45.370 --> 02:12:51.970 It looks like it took 0.012 seconds of real time to do that search. 02:12:51.970 --> 02:12:52.780 That's pretty fast. 02:12:52.780 --> 02:12:55.180 I barely noticed, certainly because it's so fast. 02:12:55.180 --> 02:12:56.710 But let me go ahead and do this. 02:12:56.710 --> 02:13:01.510 Let me go ahead and create an index called title_index on the table 02:13:01.510 --> 02:13:04.360 called shows on its title column. 02:13:04.360 --> 02:13:05.470 Well, what am I doing? 02:13:05.470 --> 02:13:08.680 Well, to answer the question finally from before about performance, 02:13:08.680 --> 02:13:11.340 by default, everything we've been doing is indeed big O of n. 02:13:11.340 --> 02:13:13.090 It's just being linearly searched from top 02:13:13.090 --> 02:13:16.630 to bottom, which seems to call into question the whole purpose of SQL if we 02:13:16.630 --> 02:13:18.850 were doing no better than with CSVs. 02:13:18.850 --> 02:13:22.480 But an index is a clue to the database to load 02:13:22.480 --> 02:13:25.960 the data more efficiently in such a way that you get logarithmic time. 02:13:25.960 --> 02:13:30.520 An index is a fancy data structure that the SQLite database or the Oracle 02:13:30.520 --> 02:13:33.520 database or the MySQL database, whatever product you're using, 02:13:33.520 --> 02:13:35.680 builds up for you in memory. 02:13:35.680 --> 02:13:38.560 And then it does something using syntax like this 02:13:38.560 --> 02:13:42.340 that builds in memory generally something known as a B-tree. 02:13:42.340 --> 02:13:44.178 We've talked a bit about trees in the class. 02:13:44.178 --> 02:13:46.720 We talked about binary search trees, things that kind of look 02:13:46.720 --> 02:13:47.920 like family trees. 02:13:47.920 --> 02:13:50.620 A B-tree is essentially a family tree that's 02:13:50.620 --> 02:13:53.020 just very wide and not that tall. 02:13:53.020 --> 02:13:56.500 It's a data structure similar in spirit to what we looked at in C. 02:13:56.500 --> 02:13:59.830 But it tries to keep all of the leaf nodes, all of the children 02:13:59.830 --> 02:14:02.230 or grandchildren or great-grandchildren, so to speak, 02:14:02.230 --> 02:14:04.390 as close to the root as possible. 02:14:04.390 --> 02:14:08.320 And the algorithm it uses for that tends to be proprietary or documented 02:14:08.320 --> 02:14:09.970 based on the system you're using. 02:14:09.970 --> 02:14:12.100 But it doesn't store things in a list. 02:14:12.100 --> 02:14:17.620 It does not store things top to bottom, like the tables we view them as. 02:14:17.620 --> 02:14:21.640 Underneath the hood, those tables that look like very tall structures 02:14:21.640 --> 02:14:23.770 are actually, underneath the hood, implemented 02:14:23.770 --> 02:14:25.820 with fancier things called trees. 02:14:25.820 --> 02:14:29.710 And if we create those trees by creating what are properly called indexes 02:14:29.710 --> 02:14:34.660 like this, it might take us a moment, like 0.098 seconds, to create an index. 02:14:34.660 --> 02:14:36.220 But now notice what happens. 02:14:36.220 --> 02:14:40.210 Previously, when I searched the titles for The Office, using linear search, 02:14:40.210 --> 02:14:43.180 it took 0.012 seconds. 02:14:43.180 --> 02:14:46.750 If I do the same query again after having created the index 02:14:46.750 --> 02:14:50.920 and having told SQLite, build me this fancy tree in memory, voila. 02:14:50.920 --> 02:14:55.450 0.001 seconds, so orders of magnitude faster. 02:14:55.450 --> 02:14:57.550 Now, both are fast to us humans, certainly. 02:14:57.550 --> 02:15:01.040 But imagine the data set being even bigger, the query being even bigger. 02:15:01.040 --> 02:15:05.900 These indexes can get even larger than that. 02:15:05.900 --> 02:15:07.970 Rather, the queries can take longer than that 02:15:07.970 --> 02:15:11.130 and therefore take even more time than that. 02:15:11.130 --> 02:15:13.940 But unfortunately, if I've got all of my data all over the place, 02:15:13.940 --> 02:15:16.970 as in a diagram like this, my god. 02:15:16.970 --> 02:15:18.770 How do I actually get useful work done? 02:15:18.770 --> 02:15:21.590 How do I get back the people in a movie and the writers 02:15:21.590 --> 02:15:24.260 and the stars and the ratings if it's all over the place? 02:15:24.260 --> 02:15:26.840 I would seem to have created such a mess and that I now 02:15:26.840 --> 02:15:28.910 need to execute all of these queries. 02:15:28.910 --> 02:15:32.000 But notice it doesn't have to be that complicated. 02:15:32.000 --> 02:15:35.660 It turns out that there's another keyword in SQL, really the last 02:15:35.660 --> 02:15:38.150 that we'll look at here, called JOIN. 02:15:38.150 --> 02:15:41.480 The JOIN keyword, which you can use implicitly or explicitly, 02:15:41.480 --> 02:15:45.470 allows you to just join tables together and sort of reconstitute 02:15:45.470 --> 02:15:47.760 a bigger, more user friendly table. 02:15:47.760 --> 02:15:51.020 So for instance, suppose I want to get all of Steve Carell's TV shows, 02:15:51.020 --> 02:15:52.250 not just The Office. 02:15:52.250 --> 02:15:55.880 Well, recall that I can select Steve's ID from the people 02:15:55.880 --> 02:15:59.390 table WHERE name = "Steve Carell." 02:15:59.390 --> 02:16:02.780 So again, he has a different ID in this table, because this is from IMDb. 02:16:02.780 --> 02:16:04.400 But there's his ID. 02:16:04.400 --> 02:16:07.260 And let me go ahead and turn the timer off for now. 02:16:07.260 --> 02:16:07.760 All right. 02:16:07.760 --> 02:16:11.510 So there is his ID, 126797. 02:16:11.510 --> 02:16:14.780 I could copy paste that into my code, but that's not necessary 02:16:14.780 --> 02:16:16.490 thanks to these nested queries. 02:16:16.490 --> 02:16:18.660 I can do something like this. 02:16:18.660 --> 02:16:23.720 Let me go ahead and now select all of the show_ids from the stars table 02:16:23.720 --> 02:16:29.790 where person_id from that table is equal to this result. 02:16:29.790 --> 02:16:33.240 So there's that join table, stars, that links people and shows. 02:16:33.240 --> 02:16:35.370 So let me go ahead and execute that. 02:16:35.370 --> 02:16:35.870 All right. 02:16:35.870 --> 02:16:39.559 So there's all of the show_ids of Steve Carell's TV shows. 02:16:39.559 --> 02:16:40.379 That's a lot. 02:16:40.379 --> 02:16:42.139 And it's very nonobvious what they are. 02:16:42.139 --> 02:16:45.680 So let me do another nested query by putting all of that in parentheses 02:16:45.680 --> 02:16:51.530 and now SELECT title FROM shows WHERE the ID of the show 02:16:51.530 --> 02:16:55.820 is in this big, long list of show_ids. 02:16:55.820 --> 02:17:00.260 And there are all of the shows that he's in, including The Dana Carvey Show 02:17:00.260 --> 02:17:04.430 back when, The Office up at the top, and then, most recently, 02:17:04.430 --> 02:17:07.142 shows like The Morning Show on Apple TV. 02:17:07.142 --> 02:17:09.350 All right, so that's pretty cool that we can actually 02:17:09.350 --> 02:17:11.129 reconstitute the data like that. 02:17:11.129 --> 02:17:13.889 But it turns out there's different ways of doing that, as well. 02:17:13.889 --> 02:17:15.950 And you'll see more of this in the coming weeks 02:17:15.950 --> 02:17:18.150 and in the problem sets and labs and the like. 02:17:18.150 --> 02:17:19.879 But it turns out we can do other things, as well. 02:17:19.879 --> 02:17:21.962 And let me just show this syntax even though it'll 02:17:21.962 --> 02:17:23.670 look a little cryptic at first glance. 02:17:23.670 --> 02:17:26.299 You can also use that JOIN keyword as follows. 02:17:26.299 --> 02:17:33.350 I can select the title from the people table joined with the stars table 02:17:33.350 --> 02:17:39.959 on the people.id column equaling the stars.person_id column. 02:17:39.959 --> 02:17:42.799 So in other words, I can select a title from the result 02:17:42.799 --> 02:17:46.940 of joining people and stars, like this, on the id column in one 02:17:46.940 --> 02:17:49.129 and the person_id column in the other. 02:17:49.129 --> 02:17:58.879 And I can join in the shows table on the stars.show_id equaling the shows.id. 02:17:58.879 --> 02:18:03.799 So again, now I'm joining the primary and foreign keys on these two tables 02:18:03.799 --> 02:18:07.700 where the name equals "Steve Carell." 02:18:07.700 --> 02:18:10.070 So this is the most cryptic thing we've seen yet. 02:18:10.070 --> 02:18:12.530 But it just means take this table and join it with this one 02:18:12.530 --> 02:18:16.580 and then join it with this one and filter all of the resulting joined rows 02:18:16.580 --> 02:18:18.530 by a name of Steve Carell. 02:18:18.530 --> 02:18:19.520 And voila. 02:18:19.520 --> 02:18:22.469 There we have all of those answers, as well. 02:18:22.469 --> 02:18:25.129 And there's other ways of doing this, too. 02:18:25.129 --> 02:18:27.809 I'll leave unsaid now some of the syntax for that. 02:18:27.809 --> 02:18:29.480 But that felt a little slow. 02:18:29.480 --> 02:18:32.090 And in fact, let me go ahead and turn my timer back on. 02:18:32.090 --> 02:18:34.610 Let me re-execute this last query. 02:18:34.610 --> 02:18:40.879 SELECT title FROM people joining on stars, joining on shows 02:18:40.879 --> 02:18:42.650 WHERE name = "Steve Carell." 02:18:42.650 --> 02:18:44.700 That took over half a second. 02:18:44.700 --> 02:18:47.480 So that was actually admittedly kind of slow. 02:18:47.480 --> 02:18:50.209 But again, indexes come to the rescue and if, again, we 02:18:50.209 --> 02:18:52.610 don't allow linear search to dominate. 02:18:52.610 --> 02:18:54.889 But let me go ahead and create a few indexes. 02:18:54.889 --> 02:19:01.940 Create an index called person_index on the stars table, the person_id column. 02:19:01.940 --> 02:19:02.570 Why? 02:19:02.570 --> 02:19:05.600 Well, my query a moment ago used the person_id column. 02:19:05.600 --> 02:19:06.510 It filtered on it. 02:19:06.510 --> 02:19:08.000 So that might be a bottleneck. 02:19:08.000 --> 02:19:12.290 I'm going to go ahead and create another index called show_index 02:19:12.290 --> 02:19:14.870 on the stars table on show_id. 02:19:14.870 --> 02:19:18.290 Similarly, a moment ago, my query used the show_id column. 02:19:18.290 --> 02:19:21.743 And so that, too, might have been a bottleneck linearly, top to bottom. 02:19:21.743 --> 02:19:22.910 So let me create that index. 02:19:22.910 --> 02:19:25.368 And then lastly, let me create an index called name_index-- 02:19:25.368 --> 02:19:28.459 and this is perhaps the most obvious, similar to the show titles before-- 02:19:28.459 --> 02:19:31.549 on the people table on the name column. 02:19:31.549 --> 02:19:32.930 And that, too, took a moment. 02:19:32.930 --> 02:19:35.330 Now, in total, this took almost a full second. 02:19:35.330 --> 02:19:37.850 But these indexes only get created once. 02:19:37.850 --> 02:19:40.070 They get maintained automatically over time. 02:19:40.070 --> 02:19:42.080 But you don't incur this with every query. 02:19:42.080 --> 02:19:44.389 Now let me do my SELECT again. 02:19:44.389 --> 02:19:48.800 Let me SELECT title FROM people joining the stars table, 02:19:48.800 --> 02:19:52.730 joining the shows table WHERE name = "Steve Carell." 02:19:52.730 --> 02:19:53.690 Boom. 02:19:53.690 --> 02:19:56.630 0.001 seconds. 02:19:56.630 --> 02:20:00.930 That was an order of magnitude faster than the more than half a second 02:20:00.930 --> 02:20:02.620 it took us a little bit ago. 02:20:02.620 --> 02:20:05.860 So here, too, you see the power of a relational database. 02:20:05.860 --> 02:20:08.912 So even though we've created some problems for ourselves over time, 02:20:08.912 --> 02:20:12.120 we've solved them ultimately-- granted, with some more sophisticated features 02:20:12.120 --> 02:20:13.320 and additional syntax. 02:20:13.320 --> 02:20:15.990 But a relational database is indeed why you use them 02:20:15.990 --> 02:20:19.470 in the real world for the Twitters, the Instagrams, the Facebooks, the Googles, 02:20:19.470 --> 02:20:22.590 because they can store data so efficiently 02:20:22.590 --> 02:20:25.960 without redundancy, because you can normalize them and factor everything 02:20:25.960 --> 02:20:26.460 out. 02:20:26.460 --> 02:20:28.740 But they can still maintain the relations 02:20:28.740 --> 02:20:30.570 that you might have seen in a spreadsheet 02:20:30.570 --> 02:20:32.940 but using something closer to logarithmic thanks 02:20:32.940 --> 02:20:34.770 to those tree structures. 02:20:34.770 --> 02:20:35.910 But there are problems. 02:20:35.910 --> 02:20:38.880 And what we wanted to do is end on today two primary problems 02:20:38.880 --> 02:20:42.570 that are introduced with SQL, because they are just unfortunately 02:20:42.570 --> 02:20:43.920 so commonly done. 02:20:43.920 --> 02:20:45.462 Notice this year. 02:20:45.462 --> 02:20:47.670 There is something generally known as a SQL injection 02:20:47.670 --> 02:20:51.330 attack, which you are vulnerable to in any application 02:20:51.330 --> 02:20:52.830 where you're taking user input. 02:20:52.830 --> 02:20:55.800 That hasn't been an issue for my favorites.py file, 02:20:55.800 --> 02:20:58.260 where I only took input from a CSV. 02:20:58.260 --> 02:21:00.510 But if one of you were malicious, what if one of you 02:21:00.510 --> 02:21:03.750 had maliciously typed in the word "delete" or "update" 02:21:03.750 --> 02:21:06.180 or something else as the title of your show 02:21:06.180 --> 02:21:11.040 and I accidentally plugged it into my own Python code when executing a query? 02:21:11.040 --> 02:21:14.940 You could potentially inject SQL into my own code. 02:21:14.940 --> 02:21:15.750 How might that be? 02:21:15.750 --> 02:21:18.960 Well, if logging in via Yale, you'll typically see a form like this. 02:21:18.960 --> 02:21:21.850 Or logging in via Harvard to something, you'll see a form like this. 02:21:21.850 --> 02:21:23.767 Here's an example that I'm pretty sure neither 02:21:23.767 --> 02:21:25.710 Harvard nor Yale are vulnerable to. 02:21:25.710 --> 02:21:28.590 Suppose I type in my email address to this login form 02:21:28.590 --> 02:21:32.350 as malan@harvard.edu'--. 02:21:32.350 --> 02:21:34.890 It turns out, in SQL, -- 02:21:34.890 --> 02:21:38.250 is the symbol for commenting if you want to comment something out. 02:21:38.250 --> 02:21:40.428 It turns out that the single quote is used 02:21:40.428 --> 02:21:43.470 when you want to search for something like Steve Carell or, in this case, 02:21:43.470 --> 02:21:44.930 malan@harvard.edu. 02:21:44.930 --> 02:21:45.930 It can be double quotes. 02:21:45.930 --> 02:21:47.040 It can be single quotes. 02:21:47.040 --> 02:21:50.040 In this case, I'm using single quotes here. 02:21:50.040 --> 02:21:53.400 But let's consider some sample code, if you will, in Python. 02:21:53.400 --> 02:21:56.910 Here's a line of code that I propose might exist in the backend 02:21:56.910 --> 02:22:00.180 for Harvard's authentication or Yale's or anyone else's. 02:22:00.180 --> 02:22:04.890 Maybe someone wrote some Python code like this using SELECT * FROM users 02:22:04.890 --> 02:22:06.870 WHERE username = question? 02:22:06.870 --> 02:22:10.770 AND password = question?, and they plugged in username and password. 02:22:10.770 --> 02:22:13.770 Whatever the user typed into that web form a moment ago gets 02:22:13.770 --> 02:22:16.270 plugged in here to these question marks. 02:22:16.270 --> 02:22:17.290 This is good. 02:22:17.290 --> 02:22:20.980 This is good code, because you're using the SQL question marks. 02:22:20.980 --> 02:22:24.315 So if you literally just do what we preach today and use these question 02:22:24.315 --> 02:22:27.870 mark placeholders, you are safe from SQL injection attacks. 02:22:27.870 --> 02:22:29.760 Unfortunately, there are too many developers 02:22:29.760 --> 02:22:34.950 in the world that don't practice this or don't realize this or do forget this. 02:22:34.950 --> 02:22:38.850 If you instead resort to Python approaches like this, 02:22:38.850 --> 02:22:42.910 where you use an f-string instead, which might be your instincts after last 02:22:42.910 --> 02:22:45.660 week, because they're wonderfully convenient with the curly braces 02:22:45.660 --> 02:22:46.290 and all-- 02:22:46.290 --> 02:22:50.370 suppose that you literally plug in username and password 02:22:50.370 --> 02:22:53.430 not with the question mark placeholders but just literally 02:22:53.430 --> 02:22:55.260 in between those curly braces. 02:22:55.260 --> 02:22:58.210 Watch what happens if my username, malan@harvard.edu, 02:22:58.210 --> 02:23:03.120 was actually typed in by me maliciously as malan@harvard.edu'--. 02:23:05.691 --> 02:23:09.030 That would have the effect of tricking this Python 02:23:09.030 --> 02:23:11.610 code into doing essentially this. 02:23:11.610 --> 02:23:13.590 Let me do a find and replace. 02:23:13.590 --> 02:23:22.881 It would trick Python into executing username = "malan@harvard.edu"--" 02:23:22.881 --> 02:23:24.660 and then other stuff. 02:23:24.660 --> 02:23:27.480 Unfortunately, the -- again means comment, 02:23:27.480 --> 02:23:33.390 which means you could maybe trick a server into ignoring the whole password 02:23:33.390 --> 02:23:35.190 part of this SQL query. 02:23:35.190 --> 02:23:37.530 And if the SQL query's purpose in life is to check, 02:23:37.530 --> 02:23:42.210 is this username and password valid, so that you can decide to log the user in 02:23:42.210 --> 02:23:44.880 or to say, no, you're not authorized, well, 02:23:44.880 --> 02:23:48.390 by essentially commenting out everything related to password, 02:23:48.390 --> 02:23:49.710 notice what I've done. 02:23:49.710 --> 02:23:55.620 I've just now theoretically logged myself in as malan@harvard.edu without 02:23:55.620 --> 02:24:00.030 even knowing or inputting a password, because I injected SQL syntax, 02:24:00.030 --> 02:24:04.620 the quote and the --, into my query, tricking the server into just ignoring 02:24:04.620 --> 02:24:06.870 the password equality check. 02:24:06.870 --> 02:24:11.250 And so it turns out that db.execute, when you execute an INSERT, 02:24:11.250 --> 02:24:15.240 it returns to you as said the ID of the newly inserted row. 02:24:15.240 --> 02:24:20.370 When you use db.execute to select rows from a database table, 02:24:20.370 --> 02:24:25.360 it returns to you a list of rows, each of which is a dictionary. 02:24:25.360 --> 02:24:28.110 So this is now pseudocode down here with my comment. 02:24:28.110 --> 02:24:31.140 But if you get back one row, that would seem 02:24:31.140 --> 02:24:34.470 to imply that there is a user named malan@harvard.edu. 02:24:34.470 --> 02:24:37.830 Don't know what his password is, because whoever this person is maliciously 02:24:37.830 --> 02:24:40.860 tricked the server into ignoring that syntax. 02:24:40.860 --> 02:24:43.890 So SQL injection attacks are unfortunately 02:24:43.890 --> 02:24:46.570 one of the most common attacks against SQL databases. 02:24:46.570 --> 02:24:51.090 They are completely preventable if you simply use placeholders and use 02:24:51.090 --> 02:24:53.940 libraries, whether it's CS50's or other third-party libraries 02:24:53.940 --> 02:24:55.440 that you may use down the road. 02:24:55.440 --> 02:24:58.530 A common meme on the internet is this picture here. 02:24:58.530 --> 02:25:00.810 If we Zoom in on this person's license plate 02:25:00.810 --> 02:25:02.970 or where the license plate should be, this 02:25:02.970 --> 02:25:05.940 is an example of someone theoretically trying 02:25:05.940 --> 02:25:10.350 to trick some camera on the highway into dropping the whole database. 02:25:10.350 --> 02:25:13.710 DROP is another keyword in SQL that deletes a database table. 02:25:13.710 --> 02:25:15.810 And this person was either intentionally or just 02:25:15.810 --> 02:25:19.980 a humorously trying to trick it into executing SQL 02:25:19.980 --> 02:25:21.760 by using syntax like this. 02:25:21.760 --> 02:25:26.070 So characters like single quotes, --, semicolons are all potentially 02:25:26.070 --> 02:25:29.190 dangerous characters in SQL if they're passed through unchanged 02:25:29.190 --> 02:25:30.120 to the database. 02:25:30.120 --> 02:25:34.140 A very popular xkcd comic-- let me give you a moment to just read this-- 02:25:34.140 --> 02:25:40.080 is another well-known meme of sorts now in computer science. 02:25:40.080 --> 02:25:43.880 If you'd like to, read this one on your own. 02:25:43.880 --> 02:25:51.170 But henceforth, you are now in the family of educated learners who 02:25:51.170 --> 02:25:54.410 know who Little Bobby Tables is. 02:25:54.410 --> 02:25:56.240 Unfortunately, it's dead silence in here, 02:25:56.240 --> 02:25:58.340 so I can't tell if anyone is actually laughing at this joke. 02:25:58.340 --> 02:26:00.110 But anyhow, this is a very well-known meme. 02:26:00.110 --> 02:26:02.690 So if you're a computer scientist who knows SQL, you know this one. 02:26:02.690 --> 02:26:05.565 And there's one last problem we'd like to introduce if you don't mind 02:26:05.565 --> 02:26:07.250 just a couple of final moments here. 02:26:07.250 --> 02:26:09.530 And that is a fundamental problem in computing 02:26:09.530 --> 02:26:11.690 called race conditions, which for the first time 02:26:11.690 --> 02:26:14.300 is now manifest in our discussion of SQL. 02:26:14.300 --> 02:26:18.230 It turns out that SQL and SQL databases are very often used, again, 02:26:18.230 --> 02:26:21.380 in the real world for very high-performing applications. 02:26:21.380 --> 02:26:24.320 And by that, I mean, again, the Googles, the Facebooks, the Twitters 02:26:24.320 --> 02:26:28.490 of the world where lots and lots of data is coming into servers all at once. 02:26:28.490 --> 02:26:30.200 And case in point, some of you might have 02:26:30.200 --> 02:26:33.320 clicked Like on this egg some time ago. 02:26:33.320 --> 02:26:35.690 This is the most-liked Instagram post ever. 02:26:35.690 --> 02:26:39.710 As of last night, it was up to 50-plus million likes. 02:26:39.710 --> 02:26:42.620 Well eclipsed Kim Kardashian's previous post, 02:26:42.620 --> 02:26:44.690 which is still at 18 million or so. 02:26:44.690 --> 02:26:47.780 This is to say this is a hard problem to solve, 02:26:47.780 --> 02:26:51.800 this notion of likes coming in at such an incredible rate. 02:26:51.800 --> 02:26:55.310 Because suppose that, long story short, Instagram actually 02:26:55.310 --> 02:26:57.290 has a server with a SQL database. 02:26:57.290 --> 02:27:01.490 And they have code in Python or C++ or whatever language that's talking 02:27:01.490 --> 02:27:02.660 to that database. 02:27:02.660 --> 02:27:04.910 And suppose that they have code that's trying 02:27:04.910 --> 02:27:06.680 to increment the total number of likes. 02:27:06.680 --> 02:27:08.240 Well, how might this work logically? 02:27:08.240 --> 02:27:11.660 Well, in order to increment the number of likes that a picture like this egg 02:27:11.660 --> 02:27:14.060 has, you might first select from the database 02:27:14.060 --> 02:27:18.260 the current number of likes for the ID of that egg photograph. 02:27:18.260 --> 02:27:19.790 Then you might add 1 to it. 02:27:19.790 --> 02:27:21.797 Then you might update the database. 02:27:21.797 --> 02:27:24.630 And I didn't use it before, but just like there's INSERT and DELETE, 02:27:24.630 --> 02:27:26.010 there's UPDATE, as well. 02:27:26.010 --> 02:27:29.600 So you might update the database with the new count plus 1. 02:27:29.600 --> 02:27:31.970 So the code for that might look a little something 02:27:31.970 --> 02:27:35.600 like this, three lines of code using CS50's library here, 02:27:35.600 --> 02:27:40.010 where you execute SELECT likes FROM posts WHERE id = question?, 02:27:40.010 --> 02:27:42.890 where id is the unique identifier for that egg. 02:27:42.890 --> 02:27:45.740 And then I'm storing the result in a rows variable, 02:27:45.740 --> 02:27:48.950 which, again, I claim is a list of rows. 02:27:48.950 --> 02:27:52.130 I'm going to go into the first row, so that's rows bracket 0. 02:27:52.130 --> 02:27:55.070 And I'm going to go into the likes column to get the actual number. 02:27:55.070 --> 02:27:57.140 And that number, I'm going to store in a variable called likes. 02:27:57.140 --> 02:27:58.880 So this is going to be, like, 50,000,000, 02:27:58.880 --> 02:28:01.100 and I want it to go to 50,000,001. 02:28:01.100 --> 02:28:02.370 So how do I do that? 02:28:02.370 --> 02:28:08.780 Well, I execute on the database UPDATE posts SET likes = ?. 02:28:08.780 --> 02:28:10.980 And then I just plug in likes + 1. 02:28:10.980 --> 02:28:15.020 The problem, though, with the Instagrams and Googles and Twitters of the world 02:28:15.020 --> 02:28:16.790 is that they don't just have one server. 02:28:16.790 --> 02:28:18.710 They have many thousands of servers. 02:28:18.710 --> 02:28:22.580 And all of those servers might in parallel be receiving clicks from you 02:28:22.580 --> 02:28:23.960 and I on the internet. 02:28:23.960 --> 02:28:28.310 And those clicks translate into this code getting executed, executed, 02:28:28.310 --> 02:28:28.970 executed. 02:28:28.970 --> 02:28:32.930 And the problem is that when you have three lines of code and suppose Brian 02:28:32.930 --> 02:28:35.420 and I click on that egg at roughly the same time, 02:28:35.420 --> 02:28:40.010 my three lines might not get executed before his three lines or vice versa. 02:28:40.010 --> 02:28:42.650 They might get commingled chronologically. 02:28:42.650 --> 02:28:46.130 My first line might get executed, then Brian's first line might get executed. 02:28:46.130 --> 02:28:48.750 My second line might get executed, Brian's second line. 02:28:48.750 --> 02:28:50.960 So they might get interspersed on different servers 02:28:50.960 --> 02:28:53.900 or just temporally in time, chronologically. 02:28:53.900 --> 02:28:56.690 That's problematic, because suppose Brian and I click 02:28:56.690 --> 02:28:58.580 on that egg roughly at the same time. 02:28:58.580 --> 02:29:01.010 And we get back the same answer to the SELECT query. 02:29:01.010 --> 02:29:03.290 50 million is the current count. 02:29:03.290 --> 02:29:06.620 Then our next lines of code execute on the servers we happen to be on, 02:29:06.620 --> 02:29:09.260 which adds 1 to the likes. 02:29:09.260 --> 02:29:14.780 The server might accidentally end up updating the row for the egg 02:29:14.780 --> 02:29:20.960 with 50,000,001 both times, because the fundamental problem is 02:29:20.960 --> 02:29:24.890 if my code executes while Brian's code executes, 02:29:24.890 --> 02:29:29.480 we are both checking the value of a variable at essentially the same time. 02:29:29.480 --> 02:29:32.090 And we are both then making a conclusion-- 02:29:32.090 --> 02:29:35.190 oh, the current likes are 50 million. 02:29:35.190 --> 02:29:36.470 We are then making a decision. 02:29:36.470 --> 02:29:38.310 Let's add 1 to 50 million. 02:29:38.310 --> 02:29:41.600 We are then updating the value with 50,000,001. 02:29:41.600 --> 02:29:46.640 The problem is, though, that, really, if Brian's code or the server he happens 02:29:46.640 --> 02:29:50.780 to be connected to on Instagram happens to have selected the number of likes 02:29:50.780 --> 02:29:53.900 first, he should be allowed to finish the code that's 02:29:53.900 --> 02:29:57.950 being executed so that when I select it, I see 50,000,001, 02:29:57.950 --> 02:30:02.270 and I add 1 to that so the new count is 50,000,002. 02:30:02.270 --> 02:30:04.070 This is what's known as a race condition. 02:30:04.070 --> 02:30:06.980 When you write code in a multiserver-- 02:30:06.980 --> 02:30:11.120 more fancily known as a multithreaded environment-- lines of code 02:30:11.120 --> 02:30:16.160 chronologically can get commingled on different servers at any given time. 02:30:16.160 --> 02:30:18.200 The problem fundamentally derives from the fact 02:30:18.200 --> 02:30:22.430 that if Brian's server is in the middle of checking the state of a variable, 02:30:22.430 --> 02:30:23.840 I should be locked out. 02:30:23.840 --> 02:30:26.870 I should not be allowed to click on that button at the same time, 02:30:26.870 --> 02:30:30.590 or my code should not be allowed to execute logically. 02:30:30.590 --> 02:30:33.050 So there is a solution when you have to write code 02:30:33.050 --> 02:30:36.500 like this, as is common for Twitter and Instagram and Facebook and the like, 02:30:36.500 --> 02:30:38.420 to use what are called transactions. 02:30:38.420 --> 02:30:41.815 Transactions add some few new pieces of syntax that we won't dwell on today 02:30:41.815 --> 02:30:43.690 and you don't need to use in the coming days. 02:30:43.690 --> 02:30:46.180 But they do solve a fundamentally hard problem. 02:30:46.180 --> 02:30:50.500 Transactions essentially allow you to lock a table or, really, 02:30:50.500 --> 02:30:54.885 a row in the table so that if Brian's click on that egg 02:30:54.885 --> 02:30:57.760 results in some code executing that's in the process of checking what 02:30:57.760 --> 02:31:02.770 is the total like count, my click on the egg will not get handled by the server 02:31:02.770 --> 02:31:05.630 until his code is done executing. 02:31:05.630 --> 02:31:08.470 So in green here, I've proposed the way you should do this. 02:31:08.470 --> 02:31:12.968 You shouldn't just execute the middle three lines, "you" being Facebook, 02:31:12.968 --> 02:31:13.510 in this case. 02:31:13.510 --> 02:31:17.200 Instagram should execute BEGIN TRANSACTION first, then 02:31:17.200 --> 02:31:19.300 COMMIT the transaction at the end. 02:31:19.300 --> 02:31:22.780 And the design of transactions is that all of the lines in between 02:31:22.780 --> 02:31:26.320 will either succeed altogether or fail altogether. 02:31:26.320 --> 02:31:28.180 The database won't get into this funky state 02:31:28.180 --> 02:31:32.320 where we start losing track of likes on eggs. 02:31:32.320 --> 02:31:34.660 And though this has not been an issue in recent years, 02:31:34.660 --> 02:31:36.952 back in the day when Twitter was first getting started, 02:31:36.952 --> 02:31:40.232 Twitter was super popular and super offline a lot of the time. 02:31:40.232 --> 02:31:42.190 There was this thing called a Fail Whale, which 02:31:42.190 --> 02:31:44.037 is the picture they showed on their website 02:31:44.037 --> 02:31:46.120 when they were getting too much traffic to handle. 02:31:46.120 --> 02:31:49.540 That was because when people are liking and tweeting and retweeting things, 02:31:49.540 --> 02:31:51.520 it's a huge amount of data coming in. 02:31:51.520 --> 02:31:54.500 And it turns out it's very hard to solve these problems. 02:31:54.500 --> 02:31:58.450 But locking the database table or the rows with these transactions 02:31:58.450 --> 02:32:00.490 is one way fundamentally to solve this. 02:32:00.490 --> 02:32:03.160 And in our final extra time today, we thought 02:32:03.160 --> 02:32:05.080 we would play this out in the same example 02:32:05.080 --> 02:32:07.510 that I was taught transactions in some years ago. 02:32:07.510 --> 02:32:10.750 Suppose that the scenario at hand is that you and your roommates 02:32:10.750 --> 02:32:12.370 have a nice dorm fridge. 02:32:12.370 --> 02:32:15.100 And you're all in the habit of drinking lots of milk, 02:32:15.100 --> 02:32:17.050 and you want to be able to drink some milk. 02:32:17.050 --> 02:32:19.420 But you go to the fridge, like I'm about to here. 02:32:19.420 --> 02:32:22.210 And you realize, uh-oh, we're out of milk. 02:32:22.210 --> 02:32:25.570 And so now I am inspecting the state of this refrigerator, which 02:32:25.570 --> 02:32:27.970 is quite old but also quite empty. 02:32:27.970 --> 02:32:30.160 And the state of this variable, being empty, 02:32:30.160 --> 02:32:33.620 tells me that I should go to CVS and buy some more milk. 02:32:33.620 --> 02:32:35.080 So what do I then do? 02:32:35.080 --> 02:32:37.150 I'm presumably going to close the fridge, 02:32:37.150 --> 02:32:40.600 and I'm going to go and leave and go head to CVS. 02:32:40.600 --> 02:32:43.510 Unfortunately, the same problem arises that we'll act out here 02:32:43.510 --> 02:32:46.150 in our final 60 or so seconds together, whereby 02:32:46.150 --> 02:32:49.660 if Brian now, my roommate in this story, also wants some milk, 02:32:49.660 --> 02:32:52.060 he comes by when I'm already headed to the store, 02:32:52.060 --> 02:32:55.310 inspects the state of the fridge, and realizes, oh, we're out of milk. 02:32:55.310 --> 02:32:57.650 So he nicely will go restock, as well. 02:32:57.650 --> 02:32:59.620 So let's see how this plays out, and we'll 02:32:59.620 --> 02:33:03.590 see if there isn't a similar, analogous solution. 02:33:03.590 --> 02:33:05.620 So I've checked the state of the variable. 02:33:05.620 --> 02:33:06.920 We're indeed out of milk. 02:33:06.920 --> 02:33:08.030 I'll be right back. 02:33:08.030 --> 02:33:09.085 Just going to go to CVS. 02:33:26.336 --> 02:33:29.829 [MUSIC PLAYING] 02:34:44.240 --> 02:34:45.020 All right. 02:34:45.020 --> 02:34:46.550 I am now back from the store. 02:34:46.550 --> 02:34:47.870 I've picked up some milk. 02:34:47.870 --> 02:34:50.090 Going to go ahead and put it into the fridge and-- 02:34:50.090 --> 02:34:51.710 oh, how did this happen? 02:34:51.710 --> 02:34:53.570 Now there's multiple jugs of milk. 02:34:53.570 --> 02:34:55.490 And of course, milk does not last that long. 02:34:55.490 --> 02:34:57.282 And Brian and I don't drink that much milk. 02:34:57.282 --> 02:34:58.970 So this is a really serious problem. 02:34:58.970 --> 02:35:03.150 We've sort of tried to update the value of this variable at the same time. 02:35:03.150 --> 02:35:05.030 So how do we go about fixing this? 02:35:05.030 --> 02:35:07.320 What's the actual solution here? 02:35:07.320 --> 02:35:09.920 Well, I dare say that we can draw some inspiration 02:35:09.920 --> 02:35:13.787 from the world of transactions and the world of databases. 02:35:13.787 --> 02:35:15.620 And perhaps create a visual for here that we 02:35:15.620 --> 02:35:18.270 hope you never forget if you take nothing away from today. 02:35:18.270 --> 02:35:21.327 Let's go ahead and act this out one last time where, this time, 02:35:21.327 --> 02:35:22.910 I'm going to be a little more extreme. 02:35:22.910 --> 02:35:24.290 I go ahead and open the fridge. 02:35:24.290 --> 02:35:25.940 I realize, oh, we're out of milk. 02:35:25.940 --> 02:35:27.380 I'm going to go to the store. 02:35:27.380 --> 02:35:29.450 I do not want to allow for this situation 02:35:29.450 --> 02:35:32.490 where Brian accidentally checks the fridge, as well. 02:35:32.490 --> 02:35:37.670 So I am going to lock the refrigerator instead. 02:35:37.670 --> 02:35:41.390 Let me go ahead and drape this through here. 02:35:43.940 --> 02:35:49.050 A little extreme, but I think so long as he can't get into the fridge, 02:35:49.050 --> 02:35:52.790 this shouldn't be a problem. 02:35:52.790 --> 02:35:56.060 Let me go ahead now and just attach the lock here. 02:35:56.060 --> 02:35:57.170 Almost got it. 02:35:57.170 --> 02:35:58.340 Come on. 02:35:58.340 --> 02:35:59.570 All right. 02:35:59.570 --> 02:36:01.993 Now the fridge is locked. 02:36:01.993 --> 02:36:03.410 Now I'm going to go get some milk. 02:36:17.210 --> 02:36:18.210 BRIAN YU: [SIGHS] 02:36:18.710 --> 02:36:22.060 [MUSIC PLAYING]