WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:02.994 [MUSIC PLAYING] 00:01:22.792 --> 00:01:23.750 DAVID MALAN: All right. 00:01:23.750 --> 00:01:28.230 So this is CS50, and this day we continue our look at Python, 00:01:28.230 --> 00:01:30.330 but also introduce another language. 00:01:30.330 --> 00:01:32.700 This one is called SQL, or Structured Query Language, 00:01:32.700 --> 00:01:35.492 and it's going to allow us to solve some different problems as well 00:01:35.492 --> 00:01:37.260 as some old problems better. 00:01:37.260 --> 00:01:40.330 But to do this, let's try to motivate some of the problems initially. 00:01:40.330 --> 00:01:42.390 So this, of course, is just an example of a spreadsheet. 00:01:42.390 --> 00:01:43.920 Odds are you use these pretty frequently, 00:01:43.920 --> 00:01:46.253 whether it's in the form of Google Spreadsheets or Excel 00:01:46.253 --> 00:01:47.700 or Apple Numbers and the like. 00:01:47.700 --> 00:01:50.700 And this is actually the simplest form of what we might call a database, 00:01:50.700 --> 00:01:54.000 just some kind of application that stores lots of data. 00:01:54.000 --> 00:01:56.910 And odds are if you've ever taken registrations for a student group 00:01:56.910 --> 00:01:59.580 or had people sign up or fill out a form or a survey, 00:01:59.580 --> 00:02:03.270 odds are you've used Google Forms to get your data into a Google Spreadsheet. 00:02:03.270 --> 00:02:05.370 And so we thought we would begin there today, 00:02:05.370 --> 00:02:08.370 a representative problem whereby you might want to collect a lot of data 00:02:08.370 --> 00:02:10.660 from users and then do something with it, 00:02:10.660 --> 00:02:14.670 but ideally more powerfully than you could do as a human on your own. 00:02:14.670 --> 00:02:18.600 So we've put into place a Google Form that quite simply 00:02:18.600 --> 00:02:20.040 asks a couple of questions today. 00:02:20.040 --> 00:02:21.423 What are your favorite TV shows? 00:02:21.423 --> 00:02:24.090 And we ask you for the title of your favorite show and the genre 00:02:24.090 --> 00:02:26.340 or genres into which that show falls. 00:02:26.340 --> 00:02:29.490 So if you wouldn't mind, go ahead and open up this URL here 00:02:29.490 --> 00:02:31.192 on your laptop or phone. 00:02:31.192 --> 00:02:32.650 If you don't have one, that's fine. 00:02:32.650 --> 00:02:35.790 You can use the person's next to you, if I may. 00:02:35.790 --> 00:02:41.970 And take just a moment to go to that URL there and answer those two questions. 00:02:41.970 --> 00:02:46.890 What is the title of your favorite TV show, and into what genre or genres 00:02:46.890 --> 00:02:48.060 does it fall? 00:02:51.310 --> 00:02:52.720 Favorite TV shows. 00:02:52.720 --> 00:02:59.260 And when you go to that URL, you'll find yourself at, again, this form here, 00:02:59.260 --> 00:03:03.230 asking for your title and for your favorite show. 00:03:03.230 --> 00:03:06.040 So by the look of my own screen here we've 00:03:06.040 --> 00:03:09.640 got some 62 responses, 74 responses coming in already, 00:03:09.640 --> 00:03:10.780 so we can track this live. 00:03:10.780 --> 00:03:12.710 So 111 responses so far. 00:03:12.710 --> 00:03:14.830 That's a lot of favorite TV shows. 00:03:14.830 --> 00:03:16.090 This number's-- keep going up. 00:03:16.090 --> 00:03:20.890 169, up to, next, 191. 00:03:20.890 --> 00:03:22.480 Keep them coming. 00:03:22.480 --> 00:03:23.520 One per person is fine. 00:03:23.520 --> 00:03:25.103 It will give us more than enough data. 00:03:25.103 --> 00:03:26.860 246. 00:03:26.860 --> 00:03:29.300 We'll give it another few seconds, but suffice it to say, 00:03:29.300 --> 00:03:31.190 this is kind of a lot of data. 00:03:31.190 --> 00:03:32.920 And if the next step in this process were 00:03:32.920 --> 00:03:35.830 for me to analyze the data because I'm curious to know 00:03:35.830 --> 00:03:38.050 what's the most popular show on campus or if I wanted 00:03:38.050 --> 00:03:40.330 to do some other analytical work on this, 00:03:40.330 --> 00:03:42.400 it'll be useful to have all of this data, 00:03:42.400 --> 00:03:44.740 not in Google Form's proprietary interface, 00:03:44.740 --> 00:03:48.520 but in Google Spreadsheet form with rows and columns, a format you're probably 00:03:48.520 --> 00:03:49.420 pretty familiar with. 00:03:49.420 --> 00:03:51.670 So we'll let this counter keep going up and up and up, 00:03:51.670 --> 00:03:53.840 and I'm going to go ahead and click on responses. 00:03:53.840 --> 00:03:55.990 And if we really wanted to, we could look through-- 00:03:55.990 --> 00:03:59.950 The Office is up there, followed by Games of Thrones, Friends, OK. 00:03:59.950 --> 00:04:01.500 Perhaps pretty predictable here. 00:04:01.500 --> 00:04:03.250 Let me go ahead and click this icon, which 00:04:03.250 --> 00:04:06.100 is going to open the Google Spreadsheet version of this. 00:04:06.100 --> 00:04:08.590 And you'll see that there's three columns by default. 00:04:08.590 --> 00:04:12.230 [LAUGHTER] 00:04:12.230 --> 00:04:13.440 Oh, that's funny. 00:04:16.386 --> 00:04:18.341 [APPLAUSE] 00:04:18.341 --> 00:04:18.841 Thank you. 00:04:18.841 --> 00:04:21.787 Thank you. 00:04:21.787 --> 00:04:23.616 OK. 00:04:23.616 --> 00:04:24.530 OK. 00:04:24.530 --> 00:04:25.310 Well-played. 00:04:25.310 --> 00:04:28.790 Now-- so you'll see that by default we got three columns. 00:04:28.790 --> 00:04:31.370 Timestamp, Google Forms just does automatically 00:04:31.370 --> 00:04:33.230 to timestamp the submission of the form. 00:04:33.230 --> 00:04:34.910 Title comes from the first question. 00:04:34.910 --> 00:04:36.617 Genres comes from the second question. 00:04:36.617 --> 00:04:38.450 And if we kept scrolling and scrolling, odds 00:04:38.450 --> 00:04:40.010 are we're seeing more and more rows. 00:04:40.010 --> 00:04:43.040 Let's propose that we have enough of these rows already, 00:04:43.040 --> 00:04:44.960 and how might we go about analyzing the data? 00:04:44.960 --> 00:04:48.080 Well, if you're pretty expert with Google Spreadsheets or Excel 00:04:48.080 --> 00:04:50.570 or Numbers, maybe you could use some functions that 00:04:50.570 --> 00:04:54.197 are built into these programs or you could do some Command-F or Control-F 00:04:54.197 --> 00:04:56.030 and search for the favorite show that you're 00:04:56.030 --> 00:04:57.890 looking for to try to run some numbers. 00:04:57.890 --> 00:04:59.940 But we have programming abilities nowadays. 00:04:59.940 --> 00:05:02.480 And if you haven't used this feature already, 00:05:02.480 --> 00:05:04.710 if you go to Google Spreadsheet's file menu. 00:05:04.710 --> 00:05:06.960 You can go to download, and you can download your data 00:05:06.960 --> 00:05:07.970 into a bunch of formats. 00:05:07.970 --> 00:05:09.762 And the most useful one I'd propose for now 00:05:09.762 --> 00:05:12.770 is going to be this one, Comma-Separated Values or CSV. 00:05:12.770 --> 00:05:14.100 And we've seen this before. 00:05:14.100 --> 00:05:15.800 In fact, you've generated this before. 00:05:15.800 --> 00:05:18.920 So if I go ahead and do that, it pretty quickly downloaded a CSV file 00:05:18.920 --> 00:05:22.700 that I really wanted to I could open in Excel or Apple Numbers or the like. 00:05:22.700 --> 00:05:26.460 But let's do something instead with that, and go ahead and put it into, 00:05:26.460 --> 00:05:30.870 for instance, my IDE so that I can actually do something with it. 00:05:30.870 --> 00:05:33.480 I'm going to grab my downloaded file here. 00:05:33.480 --> 00:05:37.130 And I'm going to go ahead and drag that into the IDE. 00:05:37.130 --> 00:05:39.680 And so now you'll see a whole bunch of files here. 00:05:39.680 --> 00:05:40.850 Let me put-- whoops. 00:05:40.850 --> 00:05:43.102 Let me put this right in the root of my folder. 00:05:43.102 --> 00:05:44.310 And we'll see this file here. 00:05:44.310 --> 00:05:46.910 And if I double-click on it, you'll see exactly the same data, 00:05:46.910 --> 00:05:48.185 albeit in CSV format. 00:05:48.185 --> 00:05:50.810 So it looks the same, but we have commas separating the values. 00:05:50.810 --> 00:05:53.580 And Google Spreadsheets did that for us automatically. 00:05:53.580 --> 00:05:55.910 So let's now write a program that analyzes this data 00:05:55.910 --> 00:05:59.360 and see if we can't make sense of what some of the most popular values 00:05:59.360 --> 00:06:00.570 are in there. 00:06:00.570 --> 00:06:03.510 Let me go ahead and close this, open a new file, 00:06:03.510 --> 00:06:07.150 and I'm going to go ahead and call it, for instance, favorites.py, 00:06:07.150 --> 00:06:09.770 since the goal is to figure out everyone's favorites here. 00:06:09.770 --> 00:06:12.920 And let me go ahead and use some syntax from last week 00:06:12.920 --> 00:06:15.230 that you might be familiar with now from P set 6. 00:06:15.230 --> 00:06:17.390 Let me go ahead and import CSV, the module, 00:06:17.390 --> 00:06:20.600 so that I have access to that library of functionality. 00:06:20.600 --> 00:06:22.238 Then let me do this with Open. 00:06:22.238 --> 00:06:25.280 And then I'm going to go ahead and open this file, and I could rename it, 00:06:25.280 --> 00:06:28.220 but you know, I can just grab the whole file name this way, too. 00:06:28.220 --> 00:06:32.360 It's a little excessive, but I'll go ahead and paste that in in read mode. 00:06:32.360 --> 00:06:34.530 And that gives me a file. 00:06:34.530 --> 00:06:36.990 Let me go ahead and zoom out just a little bit there. 00:06:36.990 --> 00:06:40.430 And once I have this file, I can throw it in to a dict reader, 00:06:40.430 --> 00:06:42.620 for instance, to make it easier to read the CSV. 00:06:42.620 --> 00:06:45.080 So reader gets CSV.DictReader. 00:06:45.080 --> 00:06:47.000 Pass in that file. 00:06:47.000 --> 00:06:48.890 And then after this, I have the ability, now, 00:06:48.890 --> 00:06:54.170 to iterate over each row like this, much like you probably did for DNA 00:06:54.170 --> 00:06:56.400 or one or more other problems, as well. 00:06:56.400 --> 00:06:59.390 Now, just as a sanity check, let me just go ahead and print out 00:06:59.390 --> 00:07:00.870 each of the row's titles. 00:07:00.870 --> 00:07:02.630 So I'm not going to see anything new here, 00:07:02.630 --> 00:07:05.720 but this will be a little sanity check that I'm writing correct code. 00:07:05.720 --> 00:07:09.930 So run Python of favorites.py, and voila, fwoom, there's all the data. 00:07:09.930 --> 00:07:13.270 So The Office is both at the beginning and at the end here, it seems. 00:07:13.270 --> 00:07:16.520 All right, so that's all fine and good, a nice stepping stone, but let's start 00:07:16.520 --> 00:07:19.580 to analyze this data, this time using some code. 00:07:19.580 --> 00:07:22.850 So let me go ahead and say this. 00:07:22.850 --> 00:07:25.580 Suppose I wanted to keep track and determine 00:07:25.580 --> 00:07:27.530 what the most popular show is. 00:07:27.530 --> 00:07:30.480 So I've got this big CSV file with title, title, title, title. 00:07:30.480 --> 00:07:32.480 Invariably, there's going to be some duplicates. 00:07:32.480 --> 00:07:35.690 A bunch of you probably like The Office or some other show, 00:07:35.690 --> 00:07:38.180 and so it'd be nice to kind of aggregate that data. 00:07:38.180 --> 00:07:43.080 What might be a useful data structure in Python whereby for each row 00:07:43.080 --> 00:07:45.740 we can use this data structure to just kind of do 00:07:45.740 --> 00:07:49.490 the equivalent of like 1, 2, 3, and count each 00:07:49.490 --> 00:07:51.720 of the occurrences of those titles? 00:07:51.720 --> 00:07:56.030 Any-- OK, I heard a hash table, but more properly in Python? 00:07:56.030 --> 00:07:56.950 A dictionary. 00:07:56.950 --> 00:07:59.410 So a dictionary is kind of the Swiss army 00:07:59.410 --> 00:08:02.140 knife of data structures for a lot of people in the sense 00:08:02.140 --> 00:08:05.590 that insofar as it allows you to store keys and values, 00:08:05.590 --> 00:08:08.900 and those keys can be strings, and those values can be anything you want, 00:08:08.900 --> 00:08:11.763 including, for instance, in this case, numbers, like a counter, 00:08:11.763 --> 00:08:13.930 it's a really versatile data structure when you just 00:08:13.930 --> 00:08:17.000 want to throw some data into a data structure and do something with it. 00:08:17.000 --> 00:08:20.110 So let me go ahead and give myself a dictionary at the top of the file. 00:08:20.110 --> 00:08:23.110 I can do this in a couple of ways, and if my goal is to store the counts 00:08:23.110 --> 00:08:25.420 of each of these tiles, I could call it counts, 00:08:25.420 --> 00:08:30.130 and I could set it equal to dict(), which is just a function that hands me 00:08:30.130 --> 00:08:31.450 an empty dictionary. 00:08:31.450 --> 00:08:33.730 Or I recall I can use this syntax, which is 00:08:33.730 --> 00:08:35.919 equivalent to just shorthand notation. 00:08:35.919 --> 00:08:40.210 But what I want to do now inside of my loop, every time I 00:08:40.210 --> 00:08:44.210 iterate through this file, I want to check if my-- 00:08:44.210 --> 00:08:47.630 I've seen this title before, I want to add 1 to my counter. 00:08:47.630 --> 00:08:52.808 But if I've never seen this title before, what do I want to do instead? 00:08:52.808 --> 00:08:53.308 Yeah? 00:08:53.308 --> 00:08:55.117 AUDIENCE: [INAUDIBLE] 00:08:55.117 --> 00:08:56.950 DAVID MALAN: Add the title to the dictionary 00:08:56.950 --> 00:09:00.110 as a key with the value of what, the first time I see it? 00:09:00.110 --> 00:09:01.970 Well, the first time I see it. 00:09:01.970 --> 00:09:02.470 AUDIENCE: 1. 00:09:02.470 --> 00:09:03.940 DAVID MALAN: So we can just initialize it to 1. 00:09:03.940 --> 00:09:05.290 So there's two scenarios here. 00:09:05.290 --> 00:09:08.500 Either we've seen this title before and we want to do plus 1, 00:09:08.500 --> 00:09:11.830 or we've not seen the title before and we just want to set it equal to 1. 00:09:11.830 --> 00:09:13.630 So how might we express that logic? 00:09:13.630 --> 00:09:16.190 Well, let me go ahead and say, first of all, 00:09:16.190 --> 00:09:18.280 let's get the title into a variable of its own, 00:09:18.280 --> 00:09:20.740 just to shorten our syntax a little bit. 00:09:20.740 --> 00:09:25.040 And then I can say something like this, if title in counts-- 00:09:25.040 --> 00:09:26.300 well, what does that mean? 00:09:26.300 --> 00:09:29.140 That means that inside of my counts dictionary, 00:09:29.140 --> 00:09:34.300 I can index into it at that location, and I can increment whatever is there 00:09:34.300 --> 00:09:38.830 by 1, assuming on some previous iteration, 00:09:38.830 --> 00:09:44.738 else I went ahead and did counts of title equals 1. 00:09:44.738 --> 00:09:46.030 So these are the two scenarios. 00:09:46.030 --> 00:09:48.940 If I've seen this title before, go ahead and add 1 00:09:48.940 --> 00:09:50.560 to its value in the dictionary. 00:09:50.560 --> 00:09:53.960 Otherwise, just default to I have seen it one time. 00:09:53.960 --> 00:09:56.710 So some of the shows are going to stay stuck at one if they're not 00:09:56.710 --> 00:09:58.502 very popular among the audience, or they're 00:09:58.502 --> 00:10:00.703 going to keep getting plus 1, plus 1, plus 1. 00:10:00.703 --> 00:10:02.620 So what I often do when writing code like this 00:10:02.620 --> 00:10:05.450 is, as I'm sort of finding my way, after I'm done with this, 00:10:05.450 --> 00:10:07.075 I'm not going to write the whole thing. 00:10:07.075 --> 00:10:10.070 Let me go ahead and just, at the end of this, print counts, right? 00:10:10.070 --> 00:10:11.570 I'm just curious to see what it is. 00:10:11.570 --> 00:10:13.570 I'm just going to go ahead and add print counts. 00:10:13.570 --> 00:10:16.060 Let me go ahead and run now Python of favorites. 00:10:16.060 --> 00:10:19.690 And voila, you see a huge, messy data structure. 00:10:19.690 --> 00:10:21.520 But let's go to the very top of it. 00:10:21.520 --> 00:10:25.210 Notice that it starts with curly braces, indicating it's, indeed, a dictionary. 00:10:25.210 --> 00:10:28.480 It starts with a quoted value, which is one of the titles in it, 00:10:28.480 --> 00:10:31.450 and it looks like two of you said Dynasty is your favorite show. 00:10:31.450 --> 00:10:33.790 25 of you said The Office is your favorite show. 00:10:33.790 --> 00:10:36.340 One of you said Blindspot is your favorite show. 00:10:36.340 --> 00:10:37.210 One of you, 24. 00:10:37.210 --> 00:10:39.060 Friends, 19, and so forth. 00:10:39.060 --> 00:10:40.810 So the data is kind of all over the place, 00:10:40.810 --> 00:10:43.570 because it's based on the order in which you all input it. 00:10:43.570 --> 00:10:45.487 So let's just clean this up a little bit. 00:10:45.487 --> 00:10:47.570 And let me go ahead and print this out as follows. 00:10:47.570 --> 00:10:49.660 Instead of just printing a big, messy dictionary, 00:10:49.660 --> 00:10:52.990 why don't I go ahead and iterate over the dictionary? 00:10:52.990 --> 00:10:55.220 And we didn't see this necessarily last week, 00:10:55.220 --> 00:10:57.053 but you can actually do something like this. 00:10:57.053 --> 00:11:01.480 If your dictionary's storing titles as keys and counts as values, 00:11:01.480 --> 00:11:06.640 we can do something fancy like for title comma count in counts, 00:11:06.640 --> 00:11:10.090 but if you want to iterate over not just the keys but all of the items therein, 00:11:10.090 --> 00:11:12.250 there's this other function in Python associated 00:11:12.250 --> 00:11:15.370 with a dictionary that will hand you back two things at once, 00:11:15.370 --> 00:11:16.763 key and value together. 00:11:16.763 --> 00:11:18.430 And you can call them anything you want. 00:11:18.430 --> 00:11:21.470 I'm calling them title comma count, respectively. 00:11:21.470 --> 00:11:25.300 Now, let me go ahead and print out, for instance, something like the title, 00:11:25.300 --> 00:11:28.340 and then let me go ahead and just print out the count right next to it. 00:11:28.340 --> 00:11:32.540 So now, if I save this and run favorites.py, OK, it's prettier, 00:11:32.540 --> 00:11:35.530 but it's still a little messy in that we have-- 00:11:35.530 --> 00:11:38.300 the numbers are just right there after the name. 00:11:38.300 --> 00:11:39.890 So let's go ahead and tidy this up. 00:11:39.890 --> 00:11:41.890 Recall that when you print two things with print 00:11:41.890 --> 00:11:44.170 separating them with a comma, what character 00:11:44.170 --> 00:11:47.380 did you get automatically in between them? 00:11:47.380 --> 00:11:48.560 Yeah, just a single space. 00:11:48.560 --> 00:11:50.630 We saw that at the very beginning of last week. 00:11:50.630 --> 00:11:53.740 But we can override that separator, otherwise known as sep, 00:11:53.740 --> 00:11:56.950 which has a default value of quote-unquote with a space. 00:11:56.950 --> 00:12:00.293 But suppose I just want to give some visual delineation here, 00:12:00.293 --> 00:12:03.460 maybe like a vertical bar to kind of sort of make it look like a spreadsheet 00:12:03.460 --> 00:12:04.000 again. 00:12:04.000 --> 00:12:05.590 Let me go ahead and run this. 00:12:05.590 --> 00:12:07.600 And now it's not super, super pretty, but you 00:12:07.600 --> 00:12:12.700 can see title followed by a vertical bar followed by the number 1. 00:12:12.700 --> 00:12:16.243 So this is all fine and good, but it'd be nice, for instance, 00:12:16.243 --> 00:12:18.160 if we could sort these, because I don't really 00:12:18.160 --> 00:12:22.000 want to sift through this visually, looking for all of the possible values. 00:12:22.000 --> 00:12:25.660 So you know, the easiest way to sort these things might be like this. 00:12:25.660 --> 00:12:28.960 Instead of passing in counts of items, there's 00:12:28.960 --> 00:12:31.930 another function in Python where you can literally just sort something 00:12:31.930 --> 00:12:32.590 with sorted. 00:12:32.590 --> 00:12:34.850 You don't have to worry about bubble sort, selection sort, insertion sort, 00:12:34.850 --> 00:12:36.160 merge sort, anything. 00:12:36.160 --> 00:12:37.867 Just, you sort it and Python. 00:12:37.867 --> 00:12:39.700 We'll use one of the fastest implementations 00:12:39.700 --> 00:12:41.720 of sorting available to it. 00:12:41.720 --> 00:12:44.020 Now, if I rerun this, we'll see that-- 00:12:44.020 --> 00:12:45.220 [CHUCKLES] OK. 00:12:45.220 --> 00:12:47.350 We'll see that we have all of these values 00:12:47.350 --> 00:12:51.170 here starting with 24, 9009, Adventure Time. 00:12:51.170 --> 00:12:53.420 But that's not what I meant. 00:12:53.420 --> 00:12:57.160 I wanted to sort not by title, but by what? 00:12:57.160 --> 00:12:58.250 Count. 00:12:58.250 --> 00:13:00.550 So this gets a little less obvious. 00:13:00.550 --> 00:13:04.390 And it turns out Python does this in a couple of possible ways. 00:13:04.390 --> 00:13:06.640 But let me go ahead and introduce this one first. 00:13:06.640 --> 00:13:11.170 It turns out that this sorted function takes an optional argument that 00:13:11.170 --> 00:13:12.910 happens to be called key. 00:13:12.910 --> 00:13:18.280 And key simply specifies, how do you want to sort 00:13:18.280 --> 00:13:20.620 the dictionaries that you're being-- 00:13:20.620 --> 00:13:23.680 or how do you want to sort the key value pairs that are being passed? 00:13:23.680 --> 00:13:26.540 By default, it literally uses the key, so title. 00:13:26.540 --> 00:13:29.960 But if you want to use the value, you know what you can do, is this. 00:13:29.960 --> 00:13:33.420 Let me go ahead, and up here, even though this is a little weird, 00:13:33.420 --> 00:13:35.920 and let me go ahead and say the following. 00:13:35.920 --> 00:13:37.960 Def function item. 00:13:37.960 --> 00:13:38.617 So def f. 00:13:38.617 --> 00:13:40.450 I'm just giving myself a pretty generic name 00:13:40.450 --> 00:13:43.075 for a function, because I'm going to get rid of it in a moment. 00:13:43.075 --> 00:13:46.780 I'm going to go ahead and return item bracket 1. 00:13:46.780 --> 00:13:48.080 Now, why is this? 00:13:48.080 --> 00:13:49.780 Take a look at this line here. 00:13:49.780 --> 00:13:53.500 And even though sorted is new today, as this counts.items(), 00:13:53.500 --> 00:13:58.490 all items does is return key comma value, and sorted sorts then by key, 00:13:58.490 --> 00:14:02.330 by default, handing them back as title comma count, or again, 00:14:02.330 --> 00:14:03.930 I could call these anything I want. 00:14:03.930 --> 00:14:07.700 But I'm calling them, more semantically usefully, title comma counts. 00:14:07.700 --> 00:14:09.890 So that's giving me back to things. 00:14:09.890 --> 00:14:13.370 Well, just like an array in C or a list in Python, 00:14:13.370 --> 00:14:16.850 if you're getting two things at once, the first of those things, 00:14:16.850 --> 00:14:19.790 conventionally, is going to be accessible in bracket zero. 00:14:19.790 --> 00:14:22.190 The second thing is going to be accessible in bracket 1. 00:14:22.190 --> 00:14:27.350 So this super, super simple function called f takes, as input, an item. 00:14:27.350 --> 00:14:30.050 And if that item has two values, a key and a value, 00:14:30.050 --> 00:14:32.270 return item one is just literally going to return 00:14:32.270 --> 00:14:33.710 the second thing, whatever it is. 00:14:33.710 --> 00:14:36.530 In this case, it's going to return, of course, the value. 00:14:36.530 --> 00:14:38.160 Now, why is this useful? 00:14:38.160 --> 00:14:43.260 Well, because the sorted function takes an additional argument called key, 00:14:43.260 --> 00:14:46.040 you can actually pass it the name of a function 00:14:46.040 --> 00:14:51.620 to use in order to determine how to sort the previous items. 00:14:51.620 --> 00:14:53.750 So this is an example of a really powerful feature 00:14:53.750 --> 00:14:57.860 of Python and other higher-level languages, in which case 00:14:57.860 --> 00:15:01.670 you can pass in, as arguments, the names of functions. 00:15:01.670 --> 00:15:02.720 I'm not calling f. 00:15:02.720 --> 00:15:04.340 I'm just passing in its name. 00:15:04.340 --> 00:15:07.520 If I were calling it, I'd have open paren and a closed paren, 00:15:07.520 --> 00:15:09.620 but I'm just passing it in by name. 00:15:09.620 --> 00:15:13.050 And now, this is going to, by default, sort things from small to large. 00:15:13.050 --> 00:15:15.050 Turns out if you read the documentation, there's 00:15:15.050 --> 00:15:19.430 another argument to sorted called reverse that will invert it so that now 00:15:19.430 --> 00:15:21.780 you'll get largest to smallest. 00:15:21.780 --> 00:15:23.160 So now let's try this. 00:15:23.160 --> 00:15:25.460 Let me go ahead and rerun Python of favorites.py 00:15:25.460 --> 00:15:29.030 enter, and now, if I scroll up to the top, 00:15:29.030 --> 00:15:33.680 we should see The Office is indeed the most popular show, apparently 00:15:33.680 --> 00:15:35.660 consistent with Netflix's data these days, too, 00:15:35.660 --> 00:15:38.702 followed by Friends, Game of Thrones, Breaking Bad, Rick and Morty, Black 00:15:38.702 --> 00:15:39.932 Mirror, and so forth. 00:15:39.932 --> 00:15:42.890 And the least popular show among you, although there looks like there's 00:15:42.890 --> 00:15:44.735 a lot of ties, Demon Slayer. 00:15:44.735 --> 00:15:47.890 [CHUCKLES] Or Game of Thrones, but GOT. 00:15:47.890 --> 00:15:49.880 So interestingly, there are some-- 00:15:49.880 --> 00:15:51.420 [LAUGHTER] 00:15:51.420 --> 00:15:52.320 There's-- OK. 00:15:54.930 --> 00:15:55.520 OK. 00:15:55.520 --> 00:15:57.950 So here, we've gone and sorted these values. 00:15:57.950 --> 00:16:00.680 Now, let me go ahead and clean this up a little bit. 00:16:00.680 --> 00:16:05.510 Much like in C where it's kind of silly to define a variable 00:16:05.510 --> 00:16:08.270 and then use it only in one place, similarly in Python, 00:16:08.270 --> 00:16:11.660 if you're going to define a function, but then only use it in one place, 00:16:11.660 --> 00:16:12.920 you don't actually need to. 00:16:12.920 --> 00:16:17.248 There's a fancy one-line equivalent wear you can actually say this. 00:16:17.248 --> 00:16:19.790 And you won't see this too often, but it's a handy thing just 00:16:19.790 --> 00:16:21.170 to have seen at least once. 00:16:21.170 --> 00:16:25.850 You can define a function called a lambda function that takes in as input 00:16:25.850 --> 00:16:30.240 any argument like this, and then you can literally just return that value. 00:16:30.240 --> 00:16:34.175 So if I go ahead and get rid of this function f and replace it with this, 00:16:34.175 --> 00:16:37.730 the syntax is super weird-looking at first glance, 00:16:37.730 --> 00:16:40.520 but lambda is the term of art in computer 00:16:40.520 --> 00:16:42.037 science for an anonymous function. 00:16:42.037 --> 00:16:43.370 I don't care that it's called f. 00:16:43.370 --> 00:16:44.720 I'm never going to use it again. 00:16:44.720 --> 00:16:47.360 So lambda just means give me a function. 00:16:47.360 --> 00:16:51.080 Item means that is the input to this function, the argument, and the return 00:16:51.080 --> 00:16:54.080 value, or output, is going to be item bracket 1. 00:16:54.080 --> 00:16:55.580 So it's a little arcane. 00:16:55.580 --> 00:16:58.935 Odds are you'll have to double check the documentation for using this oneself 00:16:58.935 --> 00:17:00.560 in the future, but it's the same thing. 00:17:00.560 --> 00:17:02.720 And indeed, if I rerun the program, I'm going 00:17:02.720 --> 00:17:08.030 to get back the exact same results, this time with the items still sorted 00:17:08.030 --> 00:17:09.589 like that. 00:17:09.589 --> 00:17:14.569 All right, any questions on the code here, the features of the syntax? 00:17:14.569 --> 00:17:16.160 Anything at all? 00:17:16.160 --> 00:17:17.261 Yeah. 00:17:17.261 --> 00:17:20.303 AUDIENCE: [INAUDIBLE] 00:17:20.303 --> 00:17:22.970 DAVID MALAN: Yes, if you're writing a lambda function like this, 00:17:22.970 --> 00:17:24.770 you are indeed limited to one line. 00:17:24.770 --> 00:17:27.187 That was OK, because f, even though it was two lines, only 00:17:27.187 --> 00:17:28.550 had one line of actual content. 00:17:28.550 --> 00:17:31.900 But yes, it has to be one line, indeed. 00:17:31.900 --> 00:17:32.400 All right. 00:17:32.400 --> 00:17:36.360 So the GOT thing opens up some interesting messiness, right? 00:17:36.360 --> 00:17:39.990 Because GOT presumably denoted Game of Thrones, and yet most of you 00:17:39.990 --> 00:17:42.235 wrote Game of Thrones, capitalized properly, 00:17:42.235 --> 00:17:44.610 and yet, it'd be nice if we could have some equivalences. 00:17:44.610 --> 00:17:48.300 And honestly, I'm guessing there's a bunch of capitalization differences, 00:17:48.300 --> 00:17:48.960 too. 00:17:48.960 --> 00:17:50.970 So you know what I might be inclined to do? 00:17:50.970 --> 00:17:54.390 What might be the easiest way for us to standardize 00:17:54.390 --> 00:17:58.710 what we're calling these shows' titles that would be a little more 00:17:58.710 --> 00:18:01.270 resilient against some of you typing with capitals, 00:18:01.270 --> 00:18:03.300 some of you typing with lower case? 00:18:03.300 --> 00:18:06.130 Let's ignore the GOT problem, though. 00:18:06.130 --> 00:18:09.022 Yeah, so make them all lowercase, or heck, make them all upper case. 00:18:09.022 --> 00:18:10.980 It doesn't matter so long as you're consistent. 00:18:10.980 --> 00:18:11.688 So you know what? 00:18:11.688 --> 00:18:13.740 Before I get title from the row, let me go ahead 00:18:13.740 --> 00:18:18.450 and force it all to lower case, and then let me go ahead and-- 00:18:18.450 --> 00:18:24.160 in my-- go ahead and force it to lower case there. 00:18:24.160 --> 00:18:26.070 So no matter what the input is, I'm going 00:18:26.070 --> 00:18:29.790 to go ahead and force it on entry to lower case. 00:18:29.790 --> 00:18:33.490 Now, if I do this and scroll back up, looks like our counts went up a little, 00:18:33.490 --> 00:18:33.990 right? 00:18:33.990 --> 00:18:35.580 The Office is now at 27. 00:18:35.580 --> 00:18:37.080 Game of Thrones is now at 21. 00:18:37.080 --> 00:18:38.290 Friends is at 20. 00:18:38.290 --> 00:18:41.220 So it looks like some of you, kind of, reasonably if sloppily, 00:18:41.220 --> 00:18:44.272 didn't capitalize it exactly as the show officially is called. 00:18:44.272 --> 00:18:45.480 But that should be OK, right? 00:18:45.480 --> 00:18:47.700 This is all about UX or user experience. 00:18:47.700 --> 00:18:51.930 This should be one stepping stone toward making your program much, much more 00:18:51.930 --> 00:18:54.780 resilient against user input. 00:18:54.780 --> 00:18:55.530 But you know what? 00:18:55.530 --> 00:18:57.240 What if there were a better way to do this, right? 00:18:57.240 --> 00:18:59.490 Instead of writing code, this feels pretty tedious. 00:18:59.490 --> 00:19:01.948 Could we not just use a different tool, different language, 00:19:01.948 --> 00:19:03.160 and achieve these same goals? 00:19:03.160 --> 00:19:05.395 Today's answer is indeed going to be yes, with SQL. 00:19:05.395 --> 00:19:06.270 But first a question. 00:19:06.270 --> 00:19:06.810 AUDIENCE: [INAUDIBLE] 00:19:06.810 --> 00:19:07.160 DAVID MALAN: No? 00:19:07.160 --> 00:19:07.920 No question. 00:19:07.920 --> 00:19:11.520 All right, so then the segue is exactly this. 00:19:11.520 --> 00:19:14.940 I'm going to go ahead now and close favorites.py. 00:19:14.940 --> 00:19:18.010 And I'm just going to maximize the size of my terminal window here. 00:19:18.010 --> 00:19:21.840 And I'm going to go ahead and run a program called SQLite3. 00:19:21.840 --> 00:19:25.590 SQLite3 is a command line program that you can install on your Mac or PC-- 00:19:25.590 --> 00:19:27.240 it's already on CS50 IDE-- 00:19:27.240 --> 00:19:31.470 that allows you to use another language called SQL interactively. 00:19:31.470 --> 00:19:34.170 And it's got this really cool feature whereby 00:19:34.170 --> 00:19:40.230 if I have this file called, very verbosely, CS520 2019 Lecture 7, 00:19:40.230 --> 00:19:43.020 and so forth, I'm going to copy that after highlighting it. 00:19:43.020 --> 00:19:47.730 I'm going to run SQLite3, and I'm going to put SQLite3 into .mode CSV. 00:19:47.730 --> 00:19:51.180 It's a weird syntax, .mode CSV, but you don't have to use these commands very 00:19:51.180 --> 00:19:51.840 often. 00:19:51.840 --> 00:19:54.840 Then I'm going to use one other command here, which is called import. 00:19:54.840 --> 00:19:58.450 And I'm going to go ahead and import this into a database. 00:19:58.450 --> 00:20:00.175 But what database, exactly? 00:20:00.175 --> 00:20:03.050 Well, let me go ahead and more specifically-- let me exit out of this 00:20:03.050 --> 00:20:03.550 here. 00:20:03.550 --> 00:20:04.170 Whoops. 00:20:04.170 --> 00:20:08.850 Let me exit out of this here and run SQLite3 in the name of a database. 00:20:08.850 --> 00:20:10.830 I'm dealing with favorite shows, so maybe I'll 00:20:10.830 --> 00:20:14.040 call this favorites.db, DB denoting database. 00:20:14.040 --> 00:20:17.730 I'm going to now go ahead and do .mode CSV. 00:20:17.730 --> 00:20:20.550 I'm going to now do .import, quoting the name of the file, 00:20:20.550 --> 00:20:23.550 and quotes are important, because it's got some ugly spaces in it, 00:20:23.550 --> 00:20:26.940 and I'm going to import this into a table, kind of like a spreadsheet, 00:20:26.940 --> 00:20:29.700 that I'm just going to call in all lowercase favorites. 00:20:29.700 --> 00:20:30.750 And hit Enter. 00:20:30.750 --> 00:20:32.410 Nothing seems to happen now. 00:20:32.410 --> 00:20:35.650 But if I go back to my terminal window and type ls, 00:20:35.650 --> 00:20:39.780 notice that in addition to my CSV file and favorites.py and src7, 00:20:39.780 --> 00:20:41.970 which I downloaded in advance with today's code, 00:20:41.970 --> 00:20:44.820 I now have this file called favorites.db. 00:20:44.820 --> 00:20:50.140 And if I rerun SQLite, it turns out when I pass in the name of that database, 00:20:50.140 --> 00:20:52.920 now I can ask the same kinds of questions as before, 00:20:52.920 --> 00:20:55.590 but I can do it with a different syntax. 00:20:55.590 --> 00:20:59.670 I can go ahead and select all the titles from my favorites 00:20:59.670 --> 00:21:04.350 by typing this command, SELECT title FROM favorites; Enter, and voila, 00:21:04.350 --> 00:21:06.010 we get all of the titles there. 00:21:06.010 --> 00:21:08.250 Suppose I want to sort those titles as before, 00:21:08.250 --> 00:21:14.230 or I can SELECT title FROM favorites, ORDER BY title; ending with-- 00:21:14.230 --> 00:21:14.730 oh, sorry. 00:21:14.730 --> 00:21:15.870 Semicolons are back. 00:21:15.870 --> 00:21:18.300 Then I'm going to go ahead and hit here. 00:21:18.300 --> 00:21:20.730 And you'll see that now it's alphabetized, 00:21:20.730 --> 00:21:23.318 and there's all those Offices by first letter. 00:21:23.318 --> 00:21:24.360 Now, there are some bugs. 00:21:24.360 --> 00:21:26.500 If you scroll down, you'll see lowercase letters as well, 00:21:26.500 --> 00:21:28.620 so we'd have to deal with that situation, too. 00:21:28.620 --> 00:21:32.820 But this seems to be a user-friendly way of just selecting data 00:21:32.820 --> 00:21:33.570 that I care about. 00:21:33.570 --> 00:21:34.240 And watch this. 00:21:34.240 --> 00:21:36.010 Let me try to group things together. 00:21:36.010 --> 00:21:38.430 I'm going to go ahead and select title, and I'm 00:21:38.430 --> 00:21:41.880 going to go ahead and say count the number of titles 00:21:41.880 --> 00:21:45.000 from this table called favorites. 00:21:45.000 --> 00:21:48.660 But I want to group those things by title. 00:21:48.660 --> 00:21:51.820 Now, we'll see in a moment what this means, but the effect is what's cool. 00:21:51.820 --> 00:21:56.370 Voila, I now have output much like my Python program. 00:21:56.370 --> 00:22:00.028 It's not really as orderly as I would like, but notice 00:22:00.028 --> 00:22:01.320 what it's done at the very top. 00:22:01.320 --> 00:22:06.330 The very first line of output, if I keep scrolling is this here. 00:22:06.330 --> 00:22:08.670 On the left is my title, then a vertical bar. 00:22:08.670 --> 00:22:13.120 On the right is literally count of title, how many titles match that. 00:22:13.120 --> 00:22:14.800 So let me go ahead and do this. 00:22:14.800 --> 00:22:18.240 Let me go ahead and first of all limit this to the top 10 titles, 00:22:18.240 --> 00:22:20.910 so I can just deal with this and see all this at once. 00:22:20.910 --> 00:22:23.460 So here we have some alphabetized list of titles. 00:22:23.460 --> 00:22:27.490 But now, suppose I want to go ahead and rename this thing. 00:22:27.490 --> 00:22:28.930 Let me go ahead and say that. 00:22:28.930 --> 00:22:30.510 Let me go ahead and say, as-- 00:22:30.510 --> 00:22:33.450 COUNT AS n. 00:22:33.450 --> 00:22:37.290 Now notice that the title of this output is title, and then n. 00:22:37.290 --> 00:22:41.650 But the reason I did that was so that I can do this. 00:22:41.650 --> 00:22:47.010 I'm going to now do ORDER BY n, descending order, 00:22:47.010 --> 00:22:49.370 and then limit to the top 10. 00:22:49.370 --> 00:22:51.120 Enter, and voila. 00:22:51.120 --> 00:22:53.460 Now, using a completely different language, 00:22:53.460 --> 00:22:56.460 I think I've gotten the same results, except for the lower casing issue, 00:22:56.460 --> 00:22:58.085 which we can come back to another time. 00:22:58.085 --> 00:23:00.210 But The Office is here as the number one hit. 00:23:00.210 --> 00:23:01.350 Friends then, 19. 00:23:01.350 --> 00:23:03.820 Game of Thrones, 18, and so forth. 00:23:03.820 --> 00:23:06.840 So what's then the motivation in part for this new language called SQL? 00:23:06.840 --> 00:23:08.550 And you've seen just snippets of it here. 00:23:08.550 --> 00:23:10.850 It allows us to solve the same kind of problems 00:23:10.850 --> 00:23:13.350 much more easily, because I don't have to sit down and write 00:23:13.350 --> 00:23:16.590 a whole darn Python program, faster though it is then 00:23:16.590 --> 00:23:20.933 C. I don't have to write custom code just to answer questions about my data. 00:23:20.933 --> 00:23:23.100 And in fact, if you pursue the web track in a couple 00:23:23.100 --> 00:23:25.710 of weeks' time or the mobile track, both of those 00:23:25.710 --> 00:23:29.130 are going to offer support for SQL so that if you want to store data 00:23:29.130 --> 00:23:32.880 on users or interactive information on an app or a website application, 00:23:32.880 --> 00:23:37.050 SQL is going to be ultimately where you can put that data and access it later, 00:23:37.050 --> 00:23:40.507 thereby writing even less code than today. 00:23:40.507 --> 00:23:43.590 So let's go ahead and take a look at some of the features of these things. 00:23:43.590 --> 00:23:45.507 What we've really just done is introduce this, 00:23:45.507 --> 00:23:47.580 something called a relational database. 00:23:47.580 --> 00:23:50.940 The goal at hand is not to rely on CSV files anymore. 00:23:50.940 --> 00:23:54.420 CSVs are like the weakest form of databases you can use. 00:23:54.420 --> 00:23:54.960 Why? 00:23:54.960 --> 00:23:59.162 Well, every time you want to look in a CSV file for data, you have to open it. 00:23:59.162 --> 00:24:00.120 You have to read it in. 00:24:00.120 --> 00:24:02.590 You have to iterate over it, line by line by line, 00:24:02.590 --> 00:24:05.160 looking for the data you care about, and then you're done. 00:24:05.160 --> 00:24:08.942 That's going to be super slow when you have a lot of data in a CSV file. 00:24:08.942 --> 00:24:10.650 And in fact, in a little bit today, we'll 00:24:10.650 --> 00:24:14.190 download a really big database in text form 00:24:14.190 --> 00:24:19.230 that we're going to then see 500-plus megabytes later is much better handled 00:24:19.230 --> 00:24:21.420 by what are called relational databases, pieces 00:24:21.420 --> 00:24:24.990 of software you can run on your Mac or PC or servers in a company. 00:24:24.990 --> 00:24:28.950 All they do is store data and provide you with faster access to it. 00:24:28.950 --> 00:24:32.010 So SQLite3 is just a command line program 00:24:32.010 --> 00:24:34.012 via which we can interact with that data. 00:24:34.012 --> 00:24:35.220 It's not terribly compelling. 00:24:35.220 --> 00:24:37.200 It's, again, a black and white blinking prompt, 00:24:37.200 --> 00:24:40.050 but we'll see a graphical user interface, or GUI, in a bit, 00:24:40.050 --> 00:24:43.200 too, that'll allow us to navigate that same data. 00:24:43.200 --> 00:24:45.570 Here's how we went ahead and imported the data. 00:24:45.570 --> 00:24:47.550 And .schema is the last of the commands. 00:24:47.550 --> 00:24:49.870 In fact, let me go ahead and type this at the prompt. 00:24:49.870 --> 00:24:52.440 Not .mode or .import but .schema. 00:24:52.440 --> 00:24:55.860 And you'll see what automatically happened for me is something 00:24:55.860 --> 00:24:58.590 like this, CREATE TABLE favorites. 00:24:58.590 --> 00:25:02.010 This all happened automatically when I imported my data, but in a bit, 00:25:02.010 --> 00:25:04.740 we'll be doing this more manually ourselves. 00:25:04.740 --> 00:25:08.580 So where does this then bring us in terms of features? 00:25:08.580 --> 00:25:11.280 Well, in the world of storing data, whether it's 00:25:11.280 --> 00:25:13.650 storing users who've registered for your websites 00:25:13.650 --> 00:25:16.350 or the number of likes a post on Instagram has gotten 00:25:16.350 --> 00:25:19.950 or any form of data that you might want to read or write, 00:25:19.950 --> 00:25:22.410 there are four fundamental operations in the world. 00:25:22.410 --> 00:25:25.180 CRUD is the sort of crude way of remembering this. 00:25:25.180 --> 00:25:28.260 These stand for Create, Read, Update, Delete. 00:25:28.260 --> 00:25:30.870 And I dare say that with these four operations 00:25:30.870 --> 00:25:34.030 you can do anything you want when it comes with data. 00:25:34.030 --> 00:25:38.010 Creating that data, reading that data, updating it, or deleting it. 00:25:38.010 --> 00:25:42.450 Now, in the world of SQL, or Structured Query Language, or S-Q-L, 00:25:42.450 --> 00:25:44.490 this is just another programming language. 00:25:44.490 --> 00:25:47.460 It tends to be used only in the context of databases, 00:25:47.460 --> 00:25:51.810 but you can use it to solve problems not unlike Python, but in the context 00:25:51.810 --> 00:25:54.090 of the data you care about. 00:25:54.090 --> 00:25:56.370 So with SQL, these four CRUD operations actually 00:25:56.370 --> 00:25:58.530 have different keywords mapped to them. 00:25:58.530 --> 00:26:01.650 They happen to be these, INSERT, SELECT, UPDATE, DELETE. 00:26:01.650 --> 00:26:06.360 That is to say, unlike Python, which has hundreds of functions available to you, 00:26:06.360 --> 00:26:11.940 today is kind of neat in so far AS SQL has maybe a dozen or so total functions 00:26:11.940 --> 00:26:14.700 that you can use, a few dozen functions total that you can use. 00:26:14.700 --> 00:26:18.790 Much more manageable, and these are the four that we'll use most frequently. 00:26:18.790 --> 00:26:21.330 So beyond this, how do you go about creating a table? 00:26:21.330 --> 00:26:22.420 Well, what does that mean? 00:26:22.420 --> 00:26:24.253 Well, relational database is literally that. 00:26:24.253 --> 00:26:27.660 It's like a fancier version of Google Spreadsheets or Microsoft Excel 00:26:27.660 --> 00:26:30.120 or Apple Numbers that allows you to store all of your data 00:26:30.120 --> 00:26:31.500 in rows and columns. 00:26:31.500 --> 00:26:35.790 But it gives you these four functions and more 00:26:35.790 --> 00:26:40.140 via which to select data you care about, or delete or update or insert data 00:26:40.140 --> 00:26:41.820 that you care about as well. 00:26:41.820 --> 00:26:45.390 So we'll see in a bit that this is going to be the syntax VIA which in SQL you 00:26:45.390 --> 00:26:46.820 can create a new table. 00:26:46.820 --> 00:26:49.500 In Google Spreadsheets you would literally go to the plus icon 00:26:49.500 --> 00:26:50.640 and say New Sheet. 00:26:50.640 --> 00:26:54.000 In SQL, you would express this more programmatically with a line of code 00:26:54.000 --> 00:26:54.795 like this. 00:26:54.795 --> 00:26:57.420 But you're going to have to make a few decisions along the way, 00:26:57.420 --> 00:27:00.280 and that's because SQL has its own data types. 00:27:00.280 --> 00:27:03.300 Now, fortunately, they're pretty reminiscent of C and Python. 00:27:03.300 --> 00:27:06.302 But there's five main ones here. 00:27:06.302 --> 00:27:08.260 Let's just walk through a few examples thereof. 00:27:08.260 --> 00:27:09.390 So first, INTEGER. 00:27:09.390 --> 00:27:12.930 If you know in advance that you want to store a whole bunch of information 00:27:12.930 --> 00:27:19.440 in a database like someone's age or the number of likes a post has 00:27:19.440 --> 00:27:21.660 or anything that lends itself to an integer, 00:27:21.660 --> 00:27:24.540 SQL supports an integer data type. 00:27:24.540 --> 00:27:28.120 But in some versions of SQL, you actually have a decision to make. 00:27:28.120 --> 00:27:31.470 Do you want an integer, a smallint, or a bigint? 00:27:31.470 --> 00:27:34.710 And these speak to, as in C, the size, the number 00:27:34.710 --> 00:27:36.180 of bits that are used to store it. 00:27:36.180 --> 00:27:37.530 An integer is 32 bits. 00:27:37.530 --> 00:27:39.090 A bigint is 64 bits. 00:27:39.090 --> 00:27:41.690 A smallint is fewer than both those. 00:27:41.690 --> 00:27:44.720 Now, we're using, for class, something called SQLite. 00:27:44.720 --> 00:27:47.363 It's a free and open source, very user friendly version of SQL. 00:27:47.363 --> 00:27:49.280 But there's others you might have heard about. 00:27:49.280 --> 00:27:51.230 So if you've ever heard of the company Oracle, 00:27:51.230 --> 00:27:53.540 their bread and butter is a database program 00:27:53.540 --> 00:27:56.210 called Oracle that is a SQL database that 00:27:56.210 --> 00:27:58.730 allows companies to store huge amounts of data 00:27:58.730 --> 00:28:01.490 and select and create and update and delete data 00:28:01.490 --> 00:28:03.180 using software they've written. 00:28:03.180 --> 00:28:07.940 MySQL, Postgres, MariaDB, Microsoft Access Server, 00:28:07.940 --> 00:28:12.050 SQL Server, and bunches of others, are all examples of SQL servers. 00:28:12.050 --> 00:28:15.162 And they'll each support data types, typically, like this. 00:28:15.162 --> 00:28:17.120 We happen to be showing you ones from Postgres, 00:28:17.120 --> 00:28:20.390 which is a super popular open source one that you can use on the internet, 00:28:20.390 --> 00:28:22.010 for instance for final projects. 00:28:22.010 --> 00:28:25.657 Real numbers, just like in Python or C, are similar in spirit to floats. 00:28:25.657 --> 00:28:27.240 However, you have a couple of options. 00:28:27.240 --> 00:28:29.330 One is a real, which uses 32 bits. 00:28:29.330 --> 00:28:31.040 Another is called double precision. 00:28:31.040 --> 00:28:34.400 It's actually two words, but it gives you 64 bits as well, 00:28:34.400 --> 00:28:36.770 to give you more precision numerically. 00:28:36.770 --> 00:28:39.950 Then you've got this catch-all known as numeric in SQL, 00:28:39.950 --> 00:28:43.490 whereby if you have a number that's number-like, essentially, 00:28:43.490 --> 00:28:46.460 but isn't just an integer, categories-- 00:28:46.460 --> 00:28:50.270 this category encompasses things like Boolean values, dates, 00:28:50.270 --> 00:28:53.240 which have a predefined format like YYYY, 00:28:53.240 --> 00:28:56.390 four digits typically implements a year. 00:28:56.390 --> 00:28:59.660 That would be considered numeric here, as would be a date and a date time, 00:28:59.660 --> 00:29:04.950 which has something like 00:00:00 for hours, minutes, seconds. 00:29:04.950 --> 00:29:06.780 And then you have other values as well. 00:29:06.780 --> 00:29:09.590 What's nice about this numeric type specifically 00:29:09.590 --> 00:29:14.570 is you can actually solve the floating point problem in both Python 00:29:14.570 --> 00:29:18.410 and C. Recall that floats or even doubles in C 00:29:18.410 --> 00:29:22.370 ultimately have finite precision, which is bad in terms of values 00:29:22.370 --> 00:29:24.140 not quite adding up as you intend. 00:29:24.140 --> 00:29:26.300 But with numeric data types and databases, 00:29:26.300 --> 00:29:30.380 you can specify exactly how many digits you want before the decimal point 00:29:30.380 --> 00:29:31.830 and after, maximally. 00:29:31.830 --> 00:29:35.060 So if you're doing with financial information or scientific information, 00:29:35.060 --> 00:29:39.020 you can be super, super precise thanks to this data type in the database. 00:29:39.020 --> 00:29:43.160 So finally, a problem that we've solved, versus those other languages. 00:29:43.160 --> 00:29:43.910 Text. 00:29:43.910 --> 00:29:46.050 When it comes to storing data in a database, 00:29:46.050 --> 00:29:48.470 you can specify that your columns can either 00:29:48.470 --> 00:29:52.220 be characters, which isn't an individual characters or chars. 00:29:52.220 --> 00:29:55.730 You literally specify n, the number of characters 00:29:55.730 --> 00:29:58.370 that every cell in that column will be. 00:29:58.370 --> 00:29:59.823 You never do this in Excel. 00:29:59.823 --> 00:30:02.240 You never do this in Google Spreadsheets or Apple Numbers. 00:30:02.240 --> 00:30:06.040 In a database, though, you can tell the database ever more precisely, 00:30:06.040 --> 00:30:10.790 I want to store two characters in every cell in some column. 00:30:10.790 --> 00:30:11.660 Why? 00:30:11.660 --> 00:30:13.310 Well, maybe like US state codes. 00:30:13.310 --> 00:30:16.640 If you have MA for Massachusetts or CA for California, 00:30:16.640 --> 00:30:20.210 it might be nice to tell the database that you can store, minimally 00:30:20.210 --> 00:30:23.420 and maximally, two characters for every value I'm going to give you. 00:30:23.420 --> 00:30:24.723 Varchar is a little different. 00:30:24.723 --> 00:30:26.390 This is a variable number of characters. 00:30:26.390 --> 00:30:30.290 If your column might have a few characters or a lot of characters based 00:30:30.290 --> 00:30:33.470 on what some human types in, you can give an upper bound 00:30:33.470 --> 00:30:36.710 n on how many characters will be stored in every cell. 00:30:36.710 --> 00:30:40.490 So if you don't know in advance how long a user's email address is going to be 00:30:40.490 --> 00:30:43.100 or how long my name or your name is going to be in advance, 00:30:43.100 --> 00:30:47.370 you can specify varchar and then pick some upper bound, some value that, ugh, 00:30:47.370 --> 00:30:50.600 no one's going to have more than 20 characters in their name or 200 00:30:50.600 --> 00:30:55.160 characters in their name, whatever line you decide to draw in the sand. 00:30:55.160 --> 00:30:57.710 As an aside, does anyone want to conjecture 00:30:57.710 --> 00:31:00.940 what a good upper bound is for the number of characters 00:31:00.940 --> 00:31:04.085 in a human name that might register for a website you're making? 00:31:04.085 --> 00:31:05.480 AUDIENCE: [INAUDIBLE] 00:31:05.480 --> 00:31:06.390 DAVID MALAN: 25? 00:31:06.390 --> 00:31:06.890 Pretty good. 00:31:06.890 --> 00:31:07.990 AUDIENCE: [INAUDIBLE] 00:31:07.990 --> 00:31:09.032 DAVID MALAN: 30, I heard. 00:31:09.032 --> 00:31:09.980 AUDIENCE: [INAUDIBLE] 00:31:09.980 --> 00:31:11.615 DAVID MALAN: 32, OK. 00:31:11.615 --> 00:31:12.490 AUDIENCE: [INAUDIBLE] 00:31:12.490 --> 00:31:14.330 DAVID MALAN: 45? 00:31:14.330 --> 00:31:15.740 Higher. 00:31:15.740 --> 00:31:16.720 60? 00:31:16.720 --> 00:31:17.420 AUDIENCE: 64. 00:31:17.420 --> 00:31:18.170 DAVID MALAN: 64? 00:31:18.170 --> 00:31:20.790 This is the kind of thing that Google, I dare say, is good at. 00:31:20.790 --> 00:31:23.570 So let me go ahead and open up Google real fast. 00:31:23.570 --> 00:31:26.570 Suppose the goal at hand is the store, in a database 00:31:26.570 --> 00:31:32.060 table, that is the equivalent of a spreadsheets column, the longest 00:31:32.060 --> 00:31:34.811 name in world. 00:31:34.811 --> 00:31:35.950 All right. 00:31:35.950 --> 00:31:43.180 Looks like we had 988 characters in a fellow's name, Hubert. 00:31:43.180 --> 00:31:46.430 And I won't even try to pronounce his last name, but there it is. 00:31:46.430 --> 00:31:48.470 So if you want to fill-- 00:31:48.470 --> 00:31:50.795 if you to be able to fit everyone's name-- 00:31:50.795 --> 00:31:52.900 oh, and note, senior. 00:31:52.900 --> 00:31:54.760 [LAUGHTER] 00:31:54.760 --> 00:31:56.870 You might need even more characters than that. 00:31:56.870 --> 00:31:59.950 So this is a rabbit hole we won't really go down today, 00:31:59.950 --> 00:32:02.530 but suffice it to say that the answers to these questions 00:32:02.530 --> 00:32:03.550 are often not obvious. 00:32:03.550 --> 00:32:06.340 And if you think about websites you've visited in the real world, 00:32:06.340 --> 00:32:10.930 whether on campus or off, odds are, you have tried typing in some input 00:32:10.930 --> 00:32:14.660 to some web-based form or even some mobile application, 00:32:14.660 --> 00:32:17.200 where all of a sudden your keyboard stops working, right? 00:32:17.200 --> 00:32:19.000 They won't tolerate any more characters. 00:32:19.000 --> 00:32:21.490 Maybe it's the Common Application and the length of your essay. 00:32:21.490 --> 00:32:23.680 Maybe it's the length that your resume can be in a field. 00:32:23.680 --> 00:32:26.440 Maybe it's the length of your name when registering for a website. 00:32:26.440 --> 00:32:27.670 Well, why is that? 00:32:27.670 --> 00:32:32.860 Those applications are probably using a database, probably using SQL, 00:32:32.860 --> 00:32:37.210 and they had to decide, in advance, what is the maximum length of an input 00:32:37.210 --> 00:32:39.190 that we're going to tolerate from a human? 00:32:39.190 --> 00:32:42.220 And then the computer ultimately enforces that. 00:32:42.220 --> 00:32:42.820 Why? 00:32:42.820 --> 00:32:46.900 Well that would be a very clever attack, for some bad actor out there 00:32:46.900 --> 00:32:48.310 to exploit. 00:32:48.310 --> 00:32:50.998 If you had no bounds on how big the input could 00:32:50.998 --> 00:32:53.290 be that a human could type in, they could-- much like I 00:32:53.290 --> 00:32:55.123 need to try the other day with Emma's name-- 00:32:55.123 --> 00:32:57.760 just keep pasting pasting, pasting, pasting a massive input, 00:32:57.760 --> 00:33:01.750 hoping to overwhelm your computer's memory, maybe make your server crash, 00:33:01.750 --> 00:33:04.370 and therefore deny service to other people. 00:33:04.370 --> 00:33:06.670 So there are reasons for these kinds of defenses. 00:33:06.670 --> 00:33:08.140 Lastly is BLOB. 00:33:08.140 --> 00:33:10.153 At the very top, Binary Large Object. 00:33:10.153 --> 00:33:12.070 Don't need it typically that often, but if you 00:33:12.070 --> 00:33:16.270 want to store binary data, raw zeros and ones that represent files or the like, 00:33:16.270 --> 00:33:18.400 you can store it in a BLOB format as well. 00:33:18.400 --> 00:33:20.650 Now, just to give you a whirlwind tour of other syntax 00:33:20.650 --> 00:33:23.020 before we start using it, you can insert data 00:33:23.020 --> 00:33:25.570 into a database using syntax quite like this. 00:33:25.570 --> 00:33:29.590 You can select data from a database using syntax like I already did. 00:33:29.590 --> 00:33:32.180 You can use functions as well. 00:33:32.180 --> 00:33:35.230 In fact, in just a bit, when we load more data into the database, 00:33:35.230 --> 00:33:38.800 I could actually select all of the distinct names from a database-- 00:33:38.800 --> 00:33:40.870 and in fact, I could have done this a moment ago. 00:33:40.870 --> 00:33:45.430 Rather than SELECT title FROM favorites, as I 00:33:45.430 --> 00:33:49.360 did earlier-- that's going to give me all of the titles in that database. 00:33:49.360 --> 00:33:54.310 I could instead do something like SELECT DISTINCT title FROM favorites, 00:33:54.310 --> 00:33:56.680 and that's going to filter out all of the duplicates 00:33:56.680 --> 00:34:00.268 without me having to write a single line of code to do that. 00:34:00.268 --> 00:34:01.810 You can also count, as we did before. 00:34:01.810 --> 00:34:03.790 You can compute averages, mins, or max, which 00:34:03.790 --> 00:34:06.220 is really useful for analytical work that you might want 00:34:06.220 --> 00:34:07.909 to do as a data scientist or the like. 00:34:07.909 --> 00:34:10.659 And there's a whole bunch of other functions available to as well. 00:34:10.659 --> 00:34:12.010 But there's some more powerful features. 00:34:12.010 --> 00:34:14.230 And we'll begin to use these in just a little bit. 00:34:14.230 --> 00:34:15.880 We used WHERE before. 00:34:15.880 --> 00:34:17.199 I used LIMIT before. 00:34:17.199 --> 00:34:18.460 I used GROUP BY before. 00:34:18.460 --> 00:34:21.280 And we'll see a bunch of examples that reinforce all of these. 00:34:21.280 --> 00:34:23.920 But there's keywords like LIKE, whereby if you 00:34:23.920 --> 00:34:28.540 want to search for everyone's input who mentions The Office, you know what? 00:34:28.540 --> 00:34:31.989 I bet we could do something like this instead. 00:34:31.989 --> 00:34:36.280 I know that I can SELECT star for-- oh, let's go ahead 00:34:36.280 --> 00:34:41.920 and SELECT title FROM favorites WHERE, and this is kind of nonsensical, 00:34:41.920 --> 00:34:45.820 but title equals "The Office," quote unquote. 00:34:45.820 --> 00:34:48.969 And I get back everyone who typed in literally The Office. 00:34:48.969 --> 00:34:55.090 However, if I instead did something like this, "the office" in lower case, 00:34:55.090 --> 00:34:58.780 I'm going to get one of you who typed in "the office" in lowercase. 00:34:58.780 --> 00:35:01.240 And the thing here is just the title of this output. 00:35:01.240 --> 00:35:03.940 But what if I want to do anything like The Office? 00:35:03.940 --> 00:35:10.780 I could say something like LIKE "%office%", and it turns out in SQL, 00:35:10.780 --> 00:35:12.880 this new language, percent is a placeholder. 00:35:12.880 --> 00:35:16.120 It's a wild card that just says, eh, anything can go before, 00:35:16.120 --> 00:35:18.190 and, eh, anything can go after. 00:35:18.190 --> 00:35:21.340 So now I'm searching for any titles that have the word The Office. 00:35:21.340 --> 00:35:22.810 We picked up another entry here. 00:35:22.810 --> 00:35:26.153 So one of you said Office, without the word "the," also pretty reasonable. 00:35:26.153 --> 00:35:28.570 Some of the uppercase and lowercase is all over the place. 00:35:28.570 --> 00:35:30.010 Now we've caught those. 00:35:30.010 --> 00:35:33.100 And if you want to count the total number of offices now, 00:35:33.100 --> 00:35:34.810 now I can do something like this. 00:35:34.810 --> 00:35:38.020 COUNT title-- so I can combine all of these building 00:35:38.020 --> 00:35:40.720 blocks, not unlike Scratch, and get back the total number 00:35:40.720 --> 00:35:45.610 of offices, which looks now to be 33 when we tolerate a lot more 00:35:45.610 --> 00:35:48.880 variability in our users' input. 00:35:48.880 --> 00:35:50.680 Most powerfully of all, we'll see something 00:35:50.680 --> 00:35:53.020 like JOIN in just a little bit, and that'll 00:35:53.020 --> 00:35:56.600 be all toward the end of using databases better with better design. 00:35:56.600 --> 00:36:00.010 But with SELECTs, we, of course, don't just want to select all of the titles. 00:36:00.010 --> 00:36:02.360 We might select where some condition is true. 00:36:02.360 --> 00:36:04.430 So conditions, just like in Python and C, 00:36:04.430 --> 00:36:08.323 exist in SQL, but at the end of our queries, as we've seen. 00:36:08.323 --> 00:36:10.240 If you want to update something, for instance, 00:36:10.240 --> 00:36:12.550 you can update data in a database by saying 00:36:12.550 --> 00:36:16.300 UPDATE table name SET the column name equal to some value, 00:36:16.300 --> 00:36:18.200 WHERE condition is true. 00:36:18.200 --> 00:36:19.820 So how might I use this? 00:36:19.820 --> 00:36:23.640 Well, let's again look at the data where it's like The Office. 00:36:23.640 --> 00:36:24.390 And you know what? 00:36:24.390 --> 00:36:25.807 I'm going to go ahead and do this. 00:36:25.807 --> 00:36:28.307 If I want to clean all of this data up-- so data cleaning is 00:36:28.307 --> 00:36:30.140 a very common approach, whether you're doing 00:36:30.140 --> 00:36:32.890 research or analytical work, often when you have messy data, 00:36:32.890 --> 00:36:34.770 the first pass is honestly grunt work. 00:36:34.770 --> 00:36:37.350 Like, writing some code, or god forbid, manually copying 00:36:37.350 --> 00:36:40.330 and pasting in an Excel file, just to clean up your data. 00:36:40.330 --> 00:36:42.630 You can do this a lot more powerfully in SQL. 00:36:42.630 --> 00:36:48.450 So for instance, let me go ahead and say UPDATE my favorites SET 00:36:48.450 --> 00:36:52.500 title equal to "The Office," the canonical form, capital 00:36:52.500 --> 00:36:56.400 T, capital O, WHERE title LIKE-- 00:36:56.400 --> 00:36:59.850 and now let me do this wildcard search, knowing 00:36:59.850 --> 00:37:02.940 that that's going to slurp in all 30-plus of those rows. 00:37:02.940 --> 00:37:06.330 But now, when I hit Enter, nothing seems to happen. 00:37:06.330 --> 00:37:10.050 But if I then SELECT title again WHERE LIKE "%office%", 00:37:10.050 --> 00:37:13.590 now they all look the same, because I've updated my data. 00:37:13.590 --> 00:37:17.010 And so in fact, now I could go back to The Office, 00:37:17.010 --> 00:37:21.520 because I've canonicalized my data, and everything now looks the same. 00:37:21.520 --> 00:37:23.760 So with SQL, can you do that as well? 00:37:23.760 --> 00:37:28.620 Previously, let's go ahead and SELECT a title FROM favorites. 00:37:28.620 --> 00:37:32.880 And let's just limit this to 10 of them this time, so they don't all fly by. 00:37:32.880 --> 00:37:34.980 Suppose-- VEEP is a good show. 00:37:34.980 --> 00:37:37.380 Maybe I'm not a fan of, say-- 00:37:37.380 --> 00:37:40.770 let's give ourselves 20 here, see what comes up next. 00:37:40.770 --> 00:37:44.370 OK, I actually don't like Friends, even though everyone in the world seems to. 00:37:44.370 --> 00:37:45.420 No problem. 00:37:45.420 --> 00:37:46.665 DELETE FROM favorites-- 00:37:46.665 --> 00:37:47.700 [LAUGHTER] 00:37:47.700 --> 00:37:50.790 --WHERE title = "Friends";. 00:37:50.790 --> 00:37:53.070 Now, no more Friends. 00:37:53.070 --> 00:37:54.885 And so you can delete your data as well. 00:37:54.885 --> 00:37:56.010 [LAUGHTER] 00:37:56.010 --> 00:38:00.390 You can delete your data as well as by using a query of this form. 00:38:00.390 --> 00:38:04.020 This one is perhaps the most dangerous of all. 00:38:04.020 --> 00:38:07.860 Consider what might happen if you're a little sloppy, a little hasty, 00:38:07.860 --> 00:38:09.720 and omit the WHERE condition. 00:38:09.720 --> 00:38:14.070 What happens if you go ahead and DELETE FROM favorites;? 00:38:14.070 --> 00:38:15.143 Everything goes away. 00:38:15.143 --> 00:38:18.060 And slightly worse than that, though we won't have occasion to use it, 00:38:18.060 --> 00:38:20.352 there's another function called DROP, where if you just 00:38:20.352 --> 00:38:23.970 want to delete the table altogether, delete all of your rows and columns, 00:38:23.970 --> 00:38:28.630 you can say DROP TABLE favorites;, and that will delete all of the data 00:38:28.630 --> 00:38:29.130 as well. 00:38:29.130 --> 00:38:32.700 And we'll come back to that, because the mere fact that you can scarily 00:38:32.700 --> 00:38:35.370 delete lots of data at once makes you super 00:38:35.370 --> 00:38:38.580 vulnerable to bad actors on the internet or in the office who 00:38:38.580 --> 00:38:41.770 want to potentially wreak havoc on your data. 00:38:41.770 --> 00:38:46.210 So we'll come back today, talking about matters of security, as well. 00:38:46.210 --> 00:38:49.920 So any questions before, now, we take things up a notch 00:38:49.920 --> 00:38:55.270 and look at a much more massive than our hundreds of favorites here? 00:38:55.270 --> 00:38:56.430 Is that a question there? 00:38:56.430 --> 00:38:58.586 Yeah. 00:38:58.586 --> 00:39:01.965 AUDIENCE: [INAUDIBLE] 00:39:01.965 --> 00:39:04.090 DAVID MALAN: Did not look like Friends got removed. 00:39:04.090 --> 00:39:05.050 Oh, interesting. 00:39:05.050 --> 00:39:06.730 You are right. 00:39:06.730 --> 00:39:09.700 Let's try a little harder. 00:39:09.700 --> 00:39:12.010 LIKE "%friends%". 00:39:15.200 --> 00:39:16.370 Better? 00:39:16.370 --> 00:39:17.190 Now they're gone. 00:39:17.190 --> 00:39:19.607 So I'm guessing-- I would have to look closer at the data, 00:39:19.607 --> 00:39:21.442 but I'm guessing one or more people typed it 00:39:21.442 --> 00:39:23.150 with a slightly different capitalization, 00:39:23.150 --> 00:39:25.760 or maybe they hit a spacebar character or something, where 00:39:25.760 --> 00:39:29.040 I wasn't doing a precise match, thereby making 00:39:29.040 --> 00:39:32.730 the LIKE operator a more robust fix. 00:39:32.730 --> 00:39:35.280 However, let's consider for just a moment. 00:39:35.280 --> 00:39:36.860 What might the gotcha here be? 00:39:36.860 --> 00:39:39.170 Like, this feels like a nice quick and dirty fix, 00:39:39.170 --> 00:39:42.118 but it really is quick and dirty in what sense? 00:39:42.118 --> 00:39:43.910 AUDIENCE: Like, if you have another TV show 00:39:43.910 --> 00:39:47.543 that has the word friend in it, but not Friends, [INAUDIBLE].. 00:39:47.543 --> 00:39:49.460 DAVID MALAN: Yeah, there might be another show 00:39:49.460 --> 00:39:50.480 with the word "friends" in it. 00:39:50.480 --> 00:39:51.320 Quite reasonable. 00:39:51.320 --> 00:39:54.380 And I just blew those away as well from my database. 00:39:54.380 --> 00:39:56.480 And in fact, we can probably see this for real. 00:39:56.480 --> 00:39:59.540 If I go ahead and look at, for instance-- 00:39:59.540 --> 00:40:05.360 let me go into a browser here, IMDB.com, super popular website for just getting 00:40:05.360 --> 00:40:07.550 movie reviews or information about movies. 00:40:07.550 --> 00:40:09.330 Let me go ahead and search for friends. 00:40:09.330 --> 00:40:12.830 And yes, the most popular hit seems to be the 1994 version of Friends, 00:40:12.830 --> 00:40:14.740 but there's Friends with Benefits. 00:40:14.740 --> 00:40:17.573 There's Just Friends, We Are Your Friends, My Best Friend's Wedding, 00:40:17.573 --> 00:40:18.990 Fisherman's Friends, and so forth. 00:40:18.990 --> 00:40:21.020 Some of those movies, some of those TV shows. 00:40:21.020 --> 00:40:24.950 And had we poked a little further among our own data of hundreds of rows, maybe 00:40:24.950 --> 00:40:27.980 some of you did have some of those shows as your favorites as well. 00:40:27.980 --> 00:40:30.290 So beware that, too. 00:40:30.290 --> 00:40:32.990 So in preparing for today, we wanted to play 00:40:32.990 --> 00:40:36.350 with an even juicier data set then one we could generate for ourselves. 00:40:36.350 --> 00:40:39.920 And wonderfully, IMDB makes a lot of their database 00:40:39.920 --> 00:40:44.900 available for download as TSV files, Tab-Separated Values. 00:40:44.900 --> 00:40:48.210 It's really no fundamentally different from Comma Separated Values. 00:40:48.210 --> 00:40:51.200 One human at IMDB just decided to use tabs to separate 00:40:51.200 --> 00:40:52.763 their values instead of commas. 00:40:52.763 --> 00:40:54.680 And if you read through this page here, you'll 00:40:54.680 --> 00:40:58.520 see that they let you download a bunch of different files, one of which 00:40:58.520 --> 00:41:02.360 will focus on, initially, specifically on title.basics.tsv.gz. 00:41:05.105 --> 00:41:06.980 So that's a mouthful, but this is just saying 00:41:06.980 --> 00:41:09.920 this is basic information about titles in their database. 00:41:09.920 --> 00:41:12.785 Gz means it's G-zipped, which is like Zip. 00:41:12.785 --> 00:41:14.660 If you're familiar, on a Mac or PC, it's just 00:41:14.660 --> 00:41:19.850 another way of compressing information, but you can unzip that on a computer 00:41:19.850 --> 00:41:20.390 as well. 00:41:20.390 --> 00:41:22.790 Dot TSV means Tab-Separated Values. 00:41:22.790 --> 00:41:24.860 Now, we read in advance the information here, 00:41:24.860 --> 00:41:26.490 but let me summarize it as follows. 00:41:26.490 --> 00:41:28.972 There's some juicy information in that file. 00:41:28.972 --> 00:41:31.430 And I'm going to go ahead and download this file over here. 00:41:31.430 --> 00:41:33.710 If I click a link on IMDB's website, I'll 00:41:33.710 --> 00:41:37.607 see a file called title.basics.tsv.gz. 00:41:37.607 --> 00:41:40.190 I'm going to go ahead and let that download in the background. 00:41:40.190 --> 00:41:44.600 It looks like it is 100-plus megabytes. 00:41:44.600 --> 00:41:47.570 If I click on it on my Mac, it's going to decompress it. 00:41:47.570 --> 00:41:53.210 And you'll see here, this apparently is 532 megabytes once it's decompressed. 00:41:53.210 --> 00:41:56.840 So that's way bigger than the CSV file I downloaded from Google 00:41:56.840 --> 00:42:01.110 a few minutes ago, which is 24 kilobytes, 24,000 bytes. 00:42:01.110 --> 00:42:03.560 This is 532 million bytes. 00:42:03.560 --> 00:42:07.040 So it's an order of magnitude bigger, which means our design decisions today 00:42:07.040 --> 00:42:08.210 are really going to matter. 00:42:08.210 --> 00:42:10.500 Now, if we take a look inside of this file, 00:42:10.500 --> 00:42:12.260 we'll see a bunch of interesting fields. 00:42:12.260 --> 00:42:15.920 And I've essentially summarized this from their own online documentation. 00:42:15.920 --> 00:42:18.980 There's these five fields that are of interest right now, 00:42:18.980 --> 00:42:20.630 and we'll tease these apart now. 00:42:20.630 --> 00:42:22.550 So tconst, I didn't recognize that. 00:42:22.550 --> 00:42:26.510 It's an IMDB-specific term, it seems, but it means some kind of title 00:42:26.510 --> 00:42:27.200 constant. 00:42:27.200 --> 00:42:30.480 It's a unique identifier for every movie in the database. 00:42:30.480 --> 00:42:33.360 So for instance, one of the movies in that TSV file, 00:42:33.360 --> 00:42:39.040 which I'll open up in a moment, has the unique value tt4786824. 00:42:39.040 --> 00:42:39.830 Why? 00:42:39.830 --> 00:42:40.970 Just because. 00:42:40.970 --> 00:42:44.750 But it turns out we can actually see what this is. 00:42:44.750 --> 00:42:57.160 Let me actually grab that value and go to IMDB.com/title/tt4786824. 00:42:57.160 --> 00:42:58.160 Enter. 00:42:58.160 --> 00:42:59.870 These actually do have significance. 00:42:59.870 --> 00:43:01.760 That is the unique identifier for The Crown, 00:43:01.760 --> 00:43:04.400 popular hit TV show that you can now watch on Netflix. 00:43:04.400 --> 00:43:08.660 So if you go back to the actual field-- 00:43:08.660 --> 00:43:09.960 that's how I spent the summer. 00:43:09.960 --> 00:43:12.543 So if you go back to the fields here, we have not only tconst, 00:43:12.543 --> 00:43:15.440 which uniquely identifies the film, we also have title type. 00:43:15.440 --> 00:43:17.990 So it turns out in this 500 megabyte file, 00:43:17.990 --> 00:43:22.010 there's huge amounts of information on movies, TV series, documentaries, 00:43:22.010 --> 00:43:25.280 shorts, way more data than we could possibly play with in one day. 00:43:25.280 --> 00:43:29.270 So we'll focus just on TV series, like this, the show The Crown. 00:43:29.270 --> 00:43:31.910 Primary title is the name in the file that is 00:43:31.910 --> 00:43:33.660 given to the primary title of the show. 00:43:33.660 --> 00:43:35.660 Sometimes shows apparently have different titles 00:43:35.660 --> 00:43:38.320 in different geographies or communities, so the primary title 00:43:38.320 --> 00:43:40.070 is the one that most people know it by. 00:43:40.070 --> 00:43:43.670 Start year is the year for a TV show that-- in which that TV show began. 00:43:43.670 --> 00:43:46.250 The Crown came out first in 2016. 00:43:46.250 --> 00:43:51.050 And genres, a little weirdly, is a comma-separated list 00:43:51.050 --> 00:43:53.690 of genres into which that show falls. 00:43:53.690 --> 00:43:57.320 And I say it's a little weird, because we do have a TSV file, 00:43:57.320 --> 00:43:59.580 but they're using commas inside. 00:43:59.580 --> 00:44:03.590 So this is probably among the motivations for having used tabs in-- 00:44:03.590 --> 00:44:05.660 to separate all of the columns, because they 00:44:05.660 --> 00:44:08.060 want to use commas for some other purpose here. 00:44:08.060 --> 00:44:09.570 But it's a little messy. 00:44:09.570 --> 00:44:13.160 This is actually not necessarily the best design decision here. 00:44:13.160 --> 00:44:17.090 So what can we go ahead and do with all of this data here? 00:44:17.090 --> 00:44:20.140 So let's go ahead and start to grab some of this data 00:44:20.140 --> 00:44:21.640 and write some scripts involving it. 00:44:21.640 --> 00:44:23.260 But let me issue a disclaimer. 00:44:23.260 --> 00:44:26.000 It turns out when you have a lot of data in a database, 00:44:26.000 --> 00:44:30.247 it can take quite a long time to analyze it or parse it. 00:44:30.247 --> 00:44:32.830 And so if you ever watch certain baking shows like Julia Child 00:44:32.830 --> 00:44:35.620 from down the road here in Cambridge, very often there 00:44:35.620 --> 00:44:39.580 would be some TV magic whereby someone mixes the cake mix 00:44:39.580 --> 00:44:42.723 and puts it into the oven, and then two minutes later, voila, it comes out. 00:44:42.723 --> 00:44:44.890 That's because some of the scripts we'll write today 00:44:44.890 --> 00:44:47.410 might actually take many minutes to run, but we'll 00:44:47.410 --> 00:44:49.118 go ahead and run them in the backgrounds, 00:44:49.118 --> 00:44:51.760 and I'll use the outputs that I've created here in advance. 00:44:51.760 --> 00:44:55.600 So how do we go about beginning to navigate this data set? 00:44:55.600 --> 00:44:59.830 Well, first of all, let me go ahead and just open this file. 00:44:59.830 --> 00:45:02.230 And I'm going to use my own terminal window here. 00:45:02.230 --> 00:45:06.790 This file was called titles.basics.tsv. 00:45:06.790 --> 00:45:09.580 Notice that even in my program here, it took a moment to open, 00:45:09.580 --> 00:45:11.060 but this is what it looks like. 00:45:11.060 --> 00:45:14.380 It's a little cryptic at first glance, but notice the first row 00:45:14.380 --> 00:45:16.030 is the headers of the file. 00:45:16.030 --> 00:45:19.900 So we've got tconst and title type and primary title and so forth. 00:45:19.900 --> 00:45:24.160 Then every row thereafter seems to start with one of those unique identifiers, 00:45:24.160 --> 00:45:27.250 and they seem to be monotonically increasing, 1, 2, 3, 4, 00:45:27.250 --> 00:45:29.830 but they start with tt then some number of digits. 00:45:29.830 --> 00:45:32.972 Then you see that these are a bunch of shorts, so short films or the like. 00:45:32.972 --> 00:45:35.180 And if we kept scrolling, we'd see some other values. 00:45:35.180 --> 00:45:36.825 Here's the first movie in the database. 00:45:36.825 --> 00:45:39.700 And if we keep scrolling and scrolling and scrolling, we'll see more. 00:45:39.700 --> 00:45:43.150 So if actually I use my text editor here and search for literally the word 00:45:43.150 --> 00:45:45.040 The Crown, it's going to take a little while 00:45:45.040 --> 00:45:48.100 to get to, because there's a bunch of shows apparently called 00:45:48.100 --> 00:45:50.470 The Crown or Crowning or something. 00:45:50.470 --> 00:45:52.060 But you know what'll be faster? 00:45:52.060 --> 00:45:53.860 Let me go ahead and search for-- 00:45:53.860 --> 00:45:55.090 what was it? 00:45:55.090 --> 00:46:03.020 If I go back to tt4786824, Enter, it takes a moment 00:46:03.020 --> 00:46:05.770 to find this because it's searching through hundreds of megabytes, 00:46:05.770 --> 00:46:10.493 but voila, down here is this tconst, TV series type. 00:46:10.493 --> 00:46:11.410 It's called The Crown. 00:46:11.410 --> 00:46:15.920 In another geography it's called also The Crown, 2016, and so forth. 00:46:15.920 --> 00:46:17.920 So that's a huge amount of data. 00:46:17.920 --> 00:46:19.180 In fact, how much data? 00:46:19.180 --> 00:46:22.310 Well, if I actually analyze this with a command on my computer, 00:46:22.310 --> 00:46:29.320 there are 6,267,469 shows in the world, be it TV 00:46:29.320 --> 00:46:31.330 or movies or documentaries or the like. 00:46:31.330 --> 00:46:32.830 That's a huge data set. 00:46:32.830 --> 00:46:36.460 So suppose I want to focus really just on a subset of that data. 00:46:36.460 --> 00:46:39.970 What can I go about using instead? 00:46:39.970 --> 00:46:45.040 Well, let me go ahead and open up, for instance, the same file, 00:46:45.040 --> 00:46:47.685 but let me try putting it in CS50 IDE. 00:46:47.685 --> 00:46:50.560 I'm going to go ahead and download this, as you might have downloaded 00:46:50.560 --> 00:46:52.660 past problem sets, with wget. 00:46:52.660 --> 00:46:54.867 That's going to give me a pretty big file. 00:46:54.867 --> 00:46:56.950 There's a command now that you've not had occasion 00:46:56.950 --> 00:47:02.640 to use in the IDE called gunzip, which will unzip a file that starts with-- 00:47:02.640 --> 00:47:04.480 that ends with .gz. 00:47:04.480 --> 00:47:09.060 That's going to take a moment to unzip the whole file. 00:47:09.060 --> 00:47:11.310 A moment, a moment, a moment, OK. 00:47:11.310 --> 00:47:12.970 [CHUCKLES] And no space left on disk. 00:47:12.970 --> 00:47:14.880 But that's OK, because it turns out what I'm also 00:47:14.880 --> 00:47:17.260 going to start doing today is using my own Mac a bit more. 00:47:17.260 --> 00:47:19.260 And you'll recall last week, I did that a little bit, 00:47:19.260 --> 00:47:22.218 because when I wanted to play with speech recognition in my microphone, 00:47:22.218 --> 00:47:25.890 it's a lot easier to have Python running on my Mac or your PC 00:47:25.890 --> 00:47:28.200 if I want to use hardware built into my computer. 00:47:28.200 --> 00:47:31.680 Plus my MacBook, is a lot faster, for instance, than a single cloud 00:47:31.680 --> 00:47:33.270 account on a shared server. 00:47:33.270 --> 00:47:37.770 So I'll go ahead and write some of this code here on my own machine instead. 00:47:37.770 --> 00:47:40.930 So let me go ahead and open a file called import .py. 00:47:40.930 --> 00:47:44.700 And I'm going to go ahead and import the CSV module, as always, 00:47:44.700 --> 00:47:49.490 and I'm going to go ahead and open this file, which is called title.basics.tsv 00:47:49.490 --> 00:47:53.425 in read-only mode, and I'm going to call this variable titles. 00:47:53.425 --> 00:47:54.550 What am I next going to do? 00:47:54.550 --> 00:47:58.230 Let me myself a reader using csv.DictReader, 00:47:58.230 --> 00:48:01.290 as before, reading in those titles. 00:48:01.290 --> 00:48:04.800 But csv.DictReader feels like the wrong reader. 00:48:04.800 --> 00:48:07.040 Why? 00:48:07.040 --> 00:48:09.360 Just to be clear. 00:48:09.360 --> 00:48:11.445 Yeah, I mean, maybe I want TSV. 00:48:11.445 --> 00:48:14.290 But it turns out TSV doesn't exist. 00:48:14.290 --> 00:48:17.310 So even though I might be inclined, for instance, to change this 00:48:17.310 --> 00:48:21.570 to a T and this to a T, the CSV module does enough for me, 00:48:21.570 --> 00:48:24.540 but I need to tell it that I want to use a different delimiter. 00:48:24.540 --> 00:48:26.723 Instead of the default, which looks like this, 00:48:26.723 --> 00:48:28.890 I can actually override that and say, you know what? 00:48:28.890 --> 00:48:29.700 Use a tab. 00:48:29.700 --> 00:48:33.690 And just like backslash n in C and in Python, it's a new line. 00:48:33.690 --> 00:48:37.830 Backslash t in both languages is a tab character. 00:48:37.830 --> 00:48:41.100 All right, so once I've done this, let me go ahead and open up, now-- 00:48:41.100 --> 00:48:43.920 let me go ahead and open up shows0.csv. 00:48:43.920 --> 00:48:47.760 My goal in life now is to make this file a lot more manageable for myself. 00:48:47.760 --> 00:48:53.070 I want to take a 500megabyte file and extract only the TV shows therein. 00:48:53.070 --> 00:48:56.190 Moreover, how about only the TV shows from 1970 onward? 00:48:56.190 --> 00:48:59.080 We won't go even further back than that. 00:48:59.080 --> 00:49:04.320 So let me go ahead and open up, in write mode, a file called show0.csv. 00:49:04.320 --> 00:49:06.278 And I'm just going to call that variable shows. 00:49:06.278 --> 00:49:08.445 Then I'm going to go ahead and give myself a writer. 00:49:08.445 --> 00:49:10.290 And you might recall using this in the past. 00:49:10.290 --> 00:49:13.082 This is simply going to give me a variable called writer, via which 00:49:13.082 --> 00:49:14.400 I can write to a new file. 00:49:14.400 --> 00:49:18.750 Because again, the goal is to read this file and write to this file 00:49:18.750 --> 00:49:21.090 a subset of the data therein. 00:49:21.090 --> 00:49:23.160 So let me go ahead and write one row first. 00:49:23.160 --> 00:49:28.180 Write a row, passing in a list of values, specifically tconst, 00:49:28.180 --> 00:49:29.320 which is the title-- 00:49:29.320 --> 00:49:31.860 which is the ID field; primaryTitle, which 00:49:31.860 --> 00:49:35.910 is the title field; startYear, which is the year field; and genres, 00:49:35.910 --> 00:49:38.620 was one more fields that I mentioned earlier. 00:49:38.620 --> 00:49:43.248 So my goal is to export only those four columns that I care about for now. 00:49:43.248 --> 00:49:44.790 So I'm going to go ahead and do this. 00:49:44.790 --> 00:49:49.590 For each row in my reader, I'm going to go ahead and say if that 00:49:49.590 --> 00:49:56.340 row's titleType == tvSeries, which, recall, 00:49:56.340 --> 00:50:00.720 The Crown was an example of, then I'm going to go ahead and write 00:50:00.720 --> 00:50:01.710 to the writer-- 00:50:01.710 --> 00:50:05.940 whoops-- writer.writeRow, a list containing what? 00:50:05.940 --> 00:50:13.230 The row's tconst value, the row's primary title value, the row-- 00:50:13.230 --> 00:50:21.214 whoops-- the row's start year, and lastly, the row's-- 00:50:21.214 --> 00:50:24.310 [CHUCKLES] the row's genres. 00:50:24.310 --> 00:50:26.560 So what am I doing here, just to be clear? 00:50:26.560 --> 00:50:30.760 So what I have done is I've first written out-- once and only 00:50:30.760 --> 00:50:33.070 once-- literally these values, because I want 00:50:33.070 --> 00:50:35.200 headers in the first line of my file. 00:50:35.200 --> 00:50:37.750 After that, for each row in the reader, which is currently 00:50:37.750 --> 00:50:40.900 iterating over this file in read-only mode, 00:50:40.900 --> 00:50:44.200 I want to print out the current row's tconst, current row's primary title, 00:50:44.200 --> 00:50:47.170 the current row's start year, and the current row genres. 00:50:47.170 --> 00:50:52.450 But notice, I'm ignoring movies and shorts and documentaries and bunches 00:50:52.450 --> 00:50:54.080 of other values as well. 00:50:54.080 --> 00:50:54.830 And you know what? 00:50:54.830 --> 00:50:57.310 Just for good measure, let's shrink this a little bit. 00:50:57.310 --> 00:51:02.320 And row, how about, is adult == "0", for today's purposes? 00:51:02.320 --> 00:51:03.970 So that'll filter the list further. 00:51:03.970 --> 00:51:06.160 If you'd like to flip that later, that's fine. 00:51:06.160 --> 00:51:11.920 So let me go ahead and do Python 3, for version 3 on my Mac-- actually, 00:51:11.920 --> 00:51:17.632 no, let's do Python of import.py, all right? 00:51:17.632 --> 00:51:19.090 I'm going to cross my fingers here. 00:51:19.090 --> 00:51:22.300 Hopefully the file is working, working, working. 00:51:22.300 --> 00:51:24.010 But it's taking a decent amount of time. 00:51:24.010 --> 00:51:29.290 Like, this is how much time it takes, apparently, to process over 00:51:29.290 --> 00:51:31.840 millions of rows of data. 00:51:31.840 --> 00:51:33.700 Still running. 00:51:33.700 --> 00:51:34.960 Still running. 00:51:34.960 --> 00:51:37.465 But the goal, again, is to shrink the amount 00:51:37.465 --> 00:51:40.090 of data I have to ultimately care about so that we can actually 00:51:40.090 --> 00:51:41.980 search it much more effectively. 00:51:41.980 --> 00:51:43.510 So, OK, it actually finished. 00:51:43.510 --> 00:51:46.430 So let me go ahead and open up show0.csv. 00:51:46.430 --> 00:51:49.840 Notice now, in my text editor, I've got a lot less data. 00:51:49.840 --> 00:51:51.820 I've thrown away everything I don't care about, 00:51:51.820 --> 00:51:54.970 but I've been left with tconst, primary title, start year, and genres, 00:51:54.970 --> 00:51:59.052 and everything herein is now consistent with that filtration. 00:51:59.052 --> 00:52:00.760 But I haven't filtered everything I said. 00:52:00.760 --> 00:52:03.380 What did I say I wanted to get rid of earlier? 00:52:06.590 --> 00:52:08.090 Yeah, the shows before 1970. 00:52:08.090 --> 00:52:11.550 And clearly some of these are coming from 1940s and so forth. 00:52:11.550 --> 00:52:14.480 So let's go ahead and get rid of those, but see how. 00:52:14.480 --> 00:52:18.597 This is the CSV file that we just opened, but in Google Spreadsheet form. 00:52:18.597 --> 00:52:20.930 So I literally just imported it into Google Spreadsheets 00:52:20.930 --> 00:52:21.950 that so we could see it. 00:52:21.950 --> 00:52:25.640 Literally the same data as before, and there are those 1940s movies. 00:52:25.640 --> 00:52:28.490 But there's something curious that I wanted to be mindful of. 00:52:28.490 --> 00:52:34.290 If I scroll down in the start years and keep going and going and going, 00:52:34.290 --> 00:52:35.400 huh, those seem OK. 00:52:35.400 --> 00:52:36.650 Those are increasing in order. 00:52:36.650 --> 00:52:37.560 But let me try this. 00:52:37.560 --> 00:52:40.880 Let me just poke around my data, sorting in reverse order. 00:52:40.880 --> 00:52:44.480 It's going to take a while, because even this is a decent number of TV shows. 00:52:44.480 --> 00:52:47.000 Notice this weirdness. 00:52:47.000 --> 00:52:50.630 At the top of start year now, once I've reversed sorted them, 00:52:50.630 --> 00:52:53.450 there's a whole bunch of backslash capital N's. 00:52:53.450 --> 00:52:57.050 Now, this has nothing to do with C, and nothing to do with Python. 00:52:57.050 --> 00:52:59.900 It has everything to do with the documentation. 00:52:59.900 --> 00:53:05.930 If you read IMDB's data, as I only did carefully eventually, you'll see this. 00:53:05.930 --> 00:53:10.370 A backslash n and capital N is used to denote that a particular field is 00:53:10.370 --> 00:53:12.592 missing or null for that title name. 00:53:12.592 --> 00:53:14.300 Now, this is important, because if I want 00:53:14.300 --> 00:53:17.090 to filter out movies that are after 1970, 00:53:17.090 --> 00:53:18.670 I need to be resilient against that. 00:53:18.670 --> 00:53:20.210 So let me go ahead and do this. 00:53:20.210 --> 00:53:27.590 So if the current row's startYear does not equal backslash n, 00:53:27.590 --> 00:53:31.850 then I'm going to go ahead and check that it's a TV series 00:53:31.850 --> 00:53:34.160 and that it's not an adult show. 00:53:34.160 --> 00:53:35.150 So that would help. 00:53:35.150 --> 00:53:38.110 But furthermore, how can I check this here? 00:53:38.110 --> 00:53:39.530 Let me do year. 00:53:39.530 --> 00:53:43.500 And how can I convert row startYear to an integer? 00:53:43.500 --> 00:53:45.500 Well, everything in a spreadsheet, by definition 00:53:45.500 --> 00:53:50.510 of it having been in a spreadsheet, or a TSV file or a CSV file, is text. 00:53:50.510 --> 00:53:52.940 But start year looks like years, so what Python 00:53:52.940 --> 00:53:55.430 function can I use to actually convert text that 00:53:55.430 --> 00:53:58.840 resembles a number to an actual number? 00:53:58.840 --> 00:53:59.340 Yeah. 00:53:59.340 --> 00:54:01.240 So we can do something like this. 00:54:01.240 --> 00:54:03.840 So I can convert year to an int, and now I 00:54:03.840 --> 00:54:08.530 can say, if year greater than or equal to 1970, 00:54:08.530 --> 00:54:12.348 now I'm going to go ahead and do those lines instead. 00:54:12.348 --> 00:54:14.640 Now, there's an opportunity, surely, for better design, 00:54:14.640 --> 00:54:16.530 because once your code starts doing this, 00:54:16.530 --> 00:54:18.360 you've done something suboptimally, right? 00:54:18.360 --> 00:54:21.112 This is not going to end well if all of my code starts wrapping. 00:54:21.112 --> 00:54:23.820 So I could clean up the logic in a little bit, but let's go ahead 00:54:23.820 --> 00:54:25.750 and run this just one more time. 00:54:25.750 --> 00:54:30.450 This time, changing this to shows1.CSV so we can see slightly different 00:54:30.450 --> 00:54:30.960 outputs. 00:54:30.960 --> 00:54:33.546 Let me go ahead and run Python import.py. 00:54:33.546 --> 00:54:37.740 Huh, syntax error, Unicode error, codec-- 00:54:37.740 --> 00:54:39.000 that's a weird one. 00:54:39.000 --> 00:54:43.810 But this is because backslashes, recall, in C and Python, have special meaning. 00:54:43.810 --> 00:54:47.070 So when you do something like this, backslash capital n, 00:54:47.070 --> 00:54:49.860 even though it's not a lower case n, backslash n, recall, 00:54:49.860 --> 00:54:52.110 is the scape character in C and Python. 00:54:52.110 --> 00:54:55.230 So this is like telling Python, this is a special character. 00:54:55.230 --> 00:54:56.092 But it's not. 00:54:56.092 --> 00:54:58.050 And we've never really had occasion to do this, 00:54:58.050 --> 00:55:00.240 but how would do you think we could output 00:55:00.240 --> 00:55:05.620 a literal backslash before a capital N? 00:55:05.620 --> 00:55:06.120 Yeah. 00:55:06.120 --> 00:55:08.610 So it turns out the solution to this problem, usually, 00:55:08.610 --> 00:55:11.730 no matter the language, is that if you want a literal character, not 00:55:11.730 --> 00:55:15.070 an escape character, you literally put another one of it before. 00:55:15.070 --> 00:55:17.260 So even though this looks a little funky now, 00:55:17.260 --> 00:55:22.420 this backslash backslash capital N literally will mean backslash N. 00:55:22.420 --> 00:55:22.920 All right. 00:55:22.920 --> 00:55:26.360 So now, let me go ahead and run this on import.py. 00:55:26.360 --> 00:55:29.790 This time, I'm hopefully going to actually generate 00:55:29.790 --> 00:55:36.120 a new file called shows1.csv that has even less data that 00:55:36.120 --> 00:55:42.400 actually is going to contain my shows, but only a subset of them. 00:55:42.400 --> 00:55:45.000 And let's go ahead and pull the cake out of the oven this way. 00:55:45.000 --> 00:55:47.050 This is what I get now this time. 00:55:47.050 --> 00:55:50.058 So if I actually load the CSV, shows1.csv, into Google Spreadsheet, 00:55:50.058 --> 00:55:53.100 just because it's pretty easy to look at than the black and white window, 00:55:53.100 --> 00:55:57.420 now you can see that I apparently am only getting shows 1970 and onward. 00:55:57.420 --> 00:55:59.940 And indeed, if I sorted them, I would see no backslash N's. 00:55:59.940 --> 00:56:03.960 I would have thrown away everything that doesn't meet that criteria. 00:56:03.960 --> 00:56:07.110 Well, let me go ahead and do one last thing here. 00:56:07.110 --> 00:56:10.020 I'm going to go ahead and make one more change. 00:56:10.020 --> 00:56:12.330 And first, let's improve the design here. 00:56:12.330 --> 00:56:16.710 This indentation is the result of my asking questions again and again 00:56:16.710 --> 00:56:20.410 and again and indenting if and only if those things are true. 00:56:20.410 --> 00:56:22.950 But notice, you can start to flip your logic here, right? 00:56:22.950 --> 00:56:26.790 Instead of saying if the start year does not equal backslash N, 00:56:26.790 --> 00:56:30.210 what if I just do this and say continue? 00:56:30.210 --> 00:56:32.010 I can then unindent this-- 00:56:32.010 --> 00:56:35.280 because if you've not used it before in Python and in C, 00:56:35.280 --> 00:56:39.880 if you say continue inside of a loop, it's not going to continue down there, 00:56:39.880 --> 00:56:42.690 it's going to [WHOOSH] continue to the start of the loop again. 00:56:42.690 --> 00:56:45.840 So via this logic, we can actually keep wrapping around again and again. 00:56:45.840 --> 00:56:50.460 And here, too, we could say, if year less than 1970, I can go ahead 00:56:50.460 --> 00:56:54.340 and say continue, which would then allow me to unindent this as well. 00:56:54.340 --> 00:56:57.510 So there are solutions, design-wise, to actually avoiding 00:56:57.510 --> 00:56:59.230 that infinite indentation. 00:56:59.230 --> 00:56:59.730 All right. 00:56:59.730 --> 00:57:02.130 Let's go ahead and do one last version. 00:57:02.130 --> 00:57:04.710 Then I'm going to go ahead and pull out of the oven in a-- 00:57:04.710 --> 00:57:05.640 premade. 00:57:05.640 --> 00:57:08.680 So the last thing I didn't load before was this. 00:57:08.680 --> 00:57:12.780 Suppose that I want to load into the CSV file, all of the genres 00:57:12.780 --> 00:57:14.820 associated with the show. 00:57:14.820 --> 00:57:17.940 It looks like all of these shows have one or more genres, just like you 00:57:17.940 --> 00:57:20.370 were asked for your favorite shows. 00:57:20.370 --> 00:57:24.780 And so now, we have a CSV file with tconst, primary titles, start year, 00:57:24.780 --> 00:57:29.280 and genres, where genre is itself is a comma-separated list. 00:57:29.280 --> 00:57:31.290 But there's a fundamental problem here. 00:57:31.290 --> 00:57:35.040 Even though I have all of this data here, the best I can do 00:57:35.040 --> 00:57:36.460 is a program like this. 00:57:36.460 --> 00:57:42.190 Let me go ahead and search for, for instance, the following. 00:57:42.190 --> 00:57:44.765 Let me go ahead and grab a file real fast. 00:57:48.410 --> 00:57:53.300 Let me go ahead and grab a copy of shows2.csv 00:57:53.300 --> 00:57:55.970 and write one final program here. 00:57:55.970 --> 00:57:59.712 If I want to go ahead now and search this very large, still, data set, 00:57:59.712 --> 00:58:02.780 in shows2.csv, well, let me go ahead and do this. 00:58:02.780 --> 00:58:06.528 Import CSV. 00:58:06.528 --> 00:58:08.570 Let me go ahead now and ask the user for a title. 00:58:08.570 --> 00:58:11.120 I could use CS50's get string, but there's really no need for that 00:58:11.120 --> 00:58:14.162 anymore now that we have the title function-- the input function, recall. 00:58:14.162 --> 00:58:15.310 So I'll just use that. 00:58:15.310 --> 00:58:20.900 And then I'm going to go ahead and open up shows2.csv in read-only mode. 00:58:20.900 --> 00:58:23.270 And I'm going to call that my file. 00:58:23.270 --> 00:58:27.470 Then I'm going to go ahead and give myself a reader from csv.DictReader, 00:58:27.470 --> 00:58:29.330 passing in that file. 00:58:29.330 --> 00:58:33.422 And now I'm going to go ahead and, for row in reader, do the following. 00:58:33.422 --> 00:58:35.630 The goal now is to write a Python program that allows 00:58:35.630 --> 00:58:37.620 me to search only those TV shows. 00:58:37.620 --> 00:58:44.420 So I could say something like this, if title == row "primaryTitle", 00:58:44.420 --> 00:58:50.060 then I can go ahead, for instance, and print out row "primaryTitle," 00:58:50.060 --> 00:58:53.040 started in row "startYear." 00:58:53.040 --> 00:58:55.130 So what is the goal of this program? 00:58:55.130 --> 00:58:56.887 It's going to ask the user for input. 00:58:56.887 --> 00:58:58.970 It's going to open this big CSV that I've created. 00:58:58.970 --> 00:59:01.520 But that's still smaller than the 50-megabyte version. 00:59:01.520 --> 00:59:04.670 It's going to iterate over every row in that file via DictReader, 00:59:04.670 --> 00:59:09.800 checking if the title the human typed in equals the current row's primary title. 00:59:09.800 --> 00:59:13.190 And if so, it's going to print the title and year of that show. 00:59:13.190 --> 00:59:17.750 So if I go ahead and run Python of search.py, 00:59:17.750 --> 00:59:22.760 typing in something like The Crown, Enter, voila, I get that answer. 00:59:22.760 --> 00:59:26.710 If I go ahead and do The Office, Enter, there's a bunch of Offices. 00:59:26.710 --> 00:59:29.210 And in fact, if you haven't seen more than the American one, 00:59:29.210 --> 00:59:31.620 there's the UK one, and apparently several others. 00:59:31.620 --> 00:59:33.410 And we can actually corroborate this now. 00:59:33.410 --> 00:59:36.590 If you go to IMDB, where all of this data originally came from, 00:59:36.590 --> 00:59:40.730 and type The Office, there are all of The Offices that actually line up 00:59:40.730 --> 00:59:42.470 with our own very data set. 00:59:42.470 --> 00:59:44.690 It's going to be so damn tedious every time 00:59:44.690 --> 00:59:46.610 you want to search for data or update data 00:59:46.610 --> 00:59:49.880 or insert new data to write a Python program to do it, 00:59:49.880 --> 00:59:51.822 so we need a few more features of SQL. 00:59:51.822 --> 00:59:53.780 But I think first, we need some Halloween candy 00:59:53.780 --> 00:59:55.790 for our five-minute break outback. 00:59:55.790 --> 00:59:57.560 All right, we are back. 00:59:57.560 --> 01:00:00.680 So we have a whole bunch more data now, because we've downloaded 01:00:00.680 --> 01:00:03.470 that really big TSV file from IMDB. 01:00:03.470 --> 01:00:07.700 I've simplified it into some CSV files, but that really gets me half of the way 01:00:07.700 --> 01:00:10.910 there, because now if I want to search the data, as with search.py, 01:00:10.910 --> 01:00:14.540 I still have to look over thousands of rows of TV shows, 01:00:14.540 --> 01:00:17.990 and it's only going to be linear search by nature of just how files are read, 01:00:17.990 --> 01:00:19.700 top to bottom, left to right. 01:00:19.700 --> 01:00:25.130 But it turns out in Python that you can actually write SQL code itself. 01:00:25.130 --> 01:00:28.640 And CS50 has a Python library that has not only get string and get int 01:00:28.640 --> 01:00:33.200 and so forth, but it also has a SQL function built in that allows you 01:00:33.200 --> 01:00:38.000 to connect to, so to speak, a file that ends in something like .db, 01:00:38.000 --> 01:00:41.960 which is to say, in a moment, we'll start to write some Python code now 01:00:41.960 --> 01:00:47.750 toward an end of loading a really large data set like IMDB's into a proper SQL 01:00:47.750 --> 01:00:51.500 database, thereby allowing us hereafter to use all of the power 01:00:51.500 --> 01:00:54.650 and expressiveness of SQL, and more examples of that in just a bit like 01:00:54.650 --> 01:00:58.160 SELECT, UPDATE, DELETE, and INSERT, without having to write a whole bunch 01:00:58.160 --> 01:00:59.120 of Python code. 01:00:59.120 --> 01:01:03.860 So to be clear, instead of using Python to search and manipulate our data, 01:01:03.860 --> 01:01:08.120 we're going to write a script, a program in Python, whose sole purpose in life 01:01:08.120 --> 01:01:10.283 is to get data from one format into another. 01:01:10.283 --> 01:01:12.950 And you can imagine there's being generally useful, whether it's 01:01:12.950 --> 01:01:15.260 a Google Spreadsheet you've downloaded or a large data 01:01:15.260 --> 01:01:17.750 set you found on the internet that you want to use for a final project 01:01:17.750 --> 01:01:18.830 or for some other class. 01:01:18.830 --> 01:01:22.640 Python can be a really powerful way of taking one data source as input 01:01:22.640 --> 01:01:26.330 and producing its output now, as of today, SQL instead. 01:01:26.330 --> 01:01:31.670 So let's go ahead and iterate one final time of our title.basics.tsv. 01:01:31.670 --> 01:01:34.220 But this time, not just save it into a CSV file. 01:01:34.220 --> 01:01:39.740 Let's put it into a proper SQL database on my own Mac or your PC. 01:01:39.740 --> 01:01:41.510 So let me go ahead and do this. 01:01:41.510 --> 01:01:47.150 First, let me go ahead and say, just like on the slide, db = cs50.sql, 01:01:47.150 --> 01:01:53.567 and then quote unquote, "sqlite:///--" so the third slash is not a typo. 01:01:53.567 --> 01:01:54.650 It should indeed be there. 01:01:54.650 --> 01:01:57.260 And I'm going to say shows3.db, just because this 01:01:57.260 --> 01:01:59.690 is version 3 now of my import script. 01:01:59.690 --> 01:02:03.680 I'm going to go ahead, just as last week, and now import CSV-- 01:02:03.680 --> 01:02:06.240 CS50's library as well. 01:02:06.240 --> 01:02:11.330 But for this to work, the file shows3.db needs to exist first. 01:02:11.330 --> 01:02:15.260 And there's a couple of ways on a Mac or a PC or a Linux computer, typically, 01:02:15.260 --> 01:02:18.440 to create an empty file that's ready to receive data. 01:02:18.440 --> 01:02:20.990 You can literally use the command touch, which 01:02:20.990 --> 01:02:25.160 just will create an empty file by whatever name you type at the prompt. 01:02:25.160 --> 01:02:26.900 Or we can do this programmatically. 01:02:26.900 --> 01:02:28.310 And I'm going to do it programmatically, because I 01:02:28.310 --> 01:02:30.352 bet I'm going to screw up one or more times here, 01:02:30.352 --> 01:02:34.220 and it's going to be useful to let my Python program create and recreate 01:02:34.220 --> 01:02:37.250 the database again and again and again until I get it right. 01:02:37.250 --> 01:02:42.920 So let me go ahead and open a file called shows3.db in write mode. 01:02:42.920 --> 01:02:45.860 And recall from Python and C, using fopen, 01:02:45.860 --> 01:02:50.180 anytime you open a file in write mode, it will overwrite any file that exists 01:02:50.180 --> 01:02:52.340 or create any file that doesn't. 01:02:52.340 --> 01:02:53.780 That's all I needed to do. 01:02:53.780 --> 01:02:57.830 So in Python 2, recall that we were able to use this dot notation. 01:02:57.830 --> 01:02:59.780 And it turns out here, when you open a file, 01:02:59.780 --> 01:03:02.900 if you want to immediately close it, because your only goal was to create 01:03:02.900 --> 01:03:06.930 it, you can just do .close on the very thing you just opened. 01:03:06.930 --> 01:03:09.570 That is equivalent, just to be clear, to doing 01:03:09.570 --> 01:03:14.790 something a little more pedantic like this, file = open, and then file.close. 01:03:14.790 --> 01:03:18.070 But we can collapse this into one slick one-liner, so to speak, 01:03:18.070 --> 01:03:21.840 by just doing instead what I did a moment ago. 01:03:21.840 --> 01:03:27.330 All that does is create empty shows.3 file. 01:03:27.330 --> 01:03:32.385 Now, open that file for SQLite. 01:03:32.385 --> 01:03:34.260 And again, SQLite is the light version of SQL 01:03:34.260 --> 01:03:36.060 that anyone can use on their own Mac or PC. 01:03:36.060 --> 01:03:39.220 You don't need a special server to get up and running with it. 01:03:39.220 --> 01:03:42.990 So now let me go ahead and open up title.basics.tsv. 01:03:42.990 --> 01:03:48.060 And then here, let me go ahead and create myself a DictReader so 01:03:48.060 --> 01:03:50.910 that I can iterate over the lines in that TSV file. 01:03:50.910 --> 01:03:55.180 And now, let me go ahead, and for row in reader, do the following. 01:03:55.180 --> 01:03:57.390 I first want to filter out stuff just as before. 01:03:57.390 --> 01:04:04.200 So I'm going to say if row bracket "titleType" == "tvSeries" 01:04:04.200 --> 01:04:10.830 and row "isAdult" == quote unquote "0", then I'm going to go ahead and check 01:04:10.830 --> 01:04:11.720 one other thing. 01:04:11.720 --> 01:04:15.465 I'm going to go ahead and give myself a start year variable, similar to before, 01:04:15.465 --> 01:04:17.130 although I called it year earlier. 01:04:17.130 --> 01:04:19.980 Then let me go ahead and do row "startYear", 01:04:19.980 --> 01:04:22.050 just so I can cast that to an int. 01:04:22.050 --> 01:04:27.240 But I only want to do that if row "startYear" does not 01:04:27.240 --> 01:04:31.620 equal that special backslash N that IMDB told me to watch out for. 01:04:31.620 --> 01:04:34.600 So I only want to do that if it's not that. 01:04:34.600 --> 01:04:38.670 And then if startYear is greater than or equal to 1970, 01:04:38.670 --> 01:04:41.080 let's go ahead and do the following. 01:04:41.080 --> 01:04:46.680 Let's go ahead and do genres, gets row "genres". 01:04:46.680 --> 01:04:50.073 Let's go ahead and get tconst, gets row "tconst," 01:04:50.073 --> 01:04:52.740 just so I can put these in some slightly shorter variable names, 01:04:52.740 --> 01:04:54.360 just to keep myself sane. 01:04:54.360 --> 01:04:58.770 primaryTitle is going to be from row "primaryTitle." 01:04:58.770 --> 01:05:00.720 and then let me go ahead and give myself-- 01:05:00.720 --> 01:05:04.150 we already have startYear, so those are the only other three fields I need. 01:05:04.150 --> 01:05:08.460 So now, I want to go ahead and insert this row from my TSV 01:05:08.460 --> 01:05:10.002 into a SQLite database. 01:05:10.002 --> 01:05:12.960 And the operative word that we saw earlier that we haven't used it yet, 01:05:12.960 --> 01:05:13.710 is INSERT. 01:05:13.710 --> 01:05:15.390 We did use SELECT. 01:05:15.390 --> 01:05:16.500 We did use UPDATE. 01:05:16.500 --> 01:05:17.460 We did use DELETE. 01:05:17.460 --> 01:05:18.850 We haven't used INSERT yet. 01:05:18.850 --> 01:05:20.350 So I'm going to do that in a moment. 01:05:20.350 --> 01:05:23.200 But first, I need my database to actually exist, 01:05:23.200 --> 01:05:25.330 so I need to create an actual table. 01:05:25.330 --> 01:05:28.860 So I'm going to go up here first and do this, db, is a reference now, 01:05:28.860 --> 01:05:30.995 a variable representing my database, and I'm 01:05:30.995 --> 01:05:33.120 going to call the only function inside of it that's 01:05:33.120 --> 01:05:36.000 useful for our purposes, called execute. 01:05:36.000 --> 01:05:39.240 What I can now do is execute any SQL I want. 01:05:39.240 --> 01:05:41.280 So what do I want to load into this database? 01:05:41.280 --> 01:05:45.930 I think I want to load in the tconst, the primaryTitle, the startYear, 01:05:45.930 --> 01:05:51.690 and the genres, just like we had earlier from title.basics.tsv. 01:05:51.690 --> 01:05:55.140 I want to load rows that represent this kind of data, all right? 01:05:55.140 --> 01:05:56.620 So how am I going to do this? 01:05:56.620 --> 01:05:58.590 Well, let me go ahead and create a table. 01:05:58.590 --> 01:06:02.880 I'm going to call it shows, because that seems nice and conceptually consistent. 01:06:02.880 --> 01:06:06.510 I'm going to go ahead and create a list of columns now. 01:06:06.510 --> 01:06:08.400 tconst is going to be one column. 01:06:08.400 --> 01:06:10.290 primaryTitle is going to be another. 01:06:10.290 --> 01:06:12.000 startYear is going to be another. 01:06:12.000 --> 01:06:14.610 And genres is going to be the last. 01:06:14.610 --> 01:06:20.310 I can literally, that is to say, write SQL inside of a string 01:06:20.310 --> 01:06:23.700 that I pass to a Python function called db.execute. 01:06:23.700 --> 01:06:27.300 And because db.execute-- or rather, because db 01:06:27.300 --> 01:06:34.380 was configured with shows3.db, when I execute this string in Python, 01:06:34.380 --> 01:06:38.100 it's going to get executed on that database file, shows3.db. 01:06:38.100 --> 01:06:40.573 So it's a nice way of bridging these two worlds. 01:06:40.573 --> 01:06:42.990 So I'm going to have to be a little more specific, though. 01:06:42.990 --> 01:06:45.707 Recall that SQL has a bunch of types. 01:06:45.707 --> 01:06:47.040 And I'm going to keep it simple. 01:06:47.040 --> 01:06:50.910 I'm going to go ahead and say that the type of that tconst value is text. 01:06:50.910 --> 01:06:53.220 The type of the primaryTitle is text. 01:06:53.220 --> 01:06:56.400 The type of startYear is going to be numeric, kind of a catch 01:06:56.400 --> 01:06:58.020 all for dates and date times. 01:06:58.020 --> 01:07:00.390 And then genres is going to be text as well. 01:07:00.390 --> 01:07:01.770 So the syntax is a little funky. 01:07:01.770 --> 01:07:04.860 You actually specify the name of the column and then the type, 01:07:04.860 --> 01:07:08.408 as opposed to the opposite, which we did in C. But that's the way SQL is. 01:07:08.408 --> 01:07:10.200 So I'm going to go ahead and save that now. 01:07:10.200 --> 01:07:14.180 And just to comment this, this is going to create 01:07:14.180 --> 01:07:21.590 a table called shows in database file called shows3.db, 01:07:21.590 --> 01:07:23.100 just to be super explicit. 01:07:23.100 --> 01:07:24.780 So what am I going to do down here? 01:07:24.780 --> 01:07:27.450 It looks like I have the ability with CS50's library 01:07:27.450 --> 01:07:29.320 to execute any SQL I want. 01:07:29.320 --> 01:07:34.230 So let me go ahead and insert into shows the following values, 01:07:34.230 --> 01:07:39.570 a tconst, a primaryTitle, a startYear, and a genre-- 01:07:39.570 --> 01:07:40.920 and genres. 01:07:40.920 --> 01:07:41.940 What values? 01:07:41.940 --> 01:07:43.443 I want to insert these values. 01:07:43.443 --> 01:07:45.360 Now, I don't know in advance, so I'm literally 01:07:45.360 --> 01:07:47.380 going to put some question marks here. 01:07:47.380 --> 01:07:51.630 And it turns out in SQL, this is valid syntax for the library we're using. 01:07:51.630 --> 01:07:54.000 This is this INSERT query. 01:07:54.000 --> 01:07:57.750 INSERT INTO the table name a parenthesized list 01:07:57.750 --> 01:08:00.270 of the columns you want to insert data into. 01:08:00.270 --> 01:08:03.078 Then, a set of values in separate parentheses. 01:08:03.078 --> 01:08:05.370 And for now, I'm using question marks for placeholders, 01:08:05.370 --> 01:08:07.248 for reasons we'll come back to. 01:08:07.248 --> 01:08:09.540 But I'm going to go ahead and plug the following values 01:08:09.540 --> 01:08:18.479 into those placeholders, tconst, primaryTitle, startYear, and genres. 01:08:18.479 --> 01:08:20.819 And what the db.execute function is going 01:08:20.819 --> 01:08:24.399 to do for me automatically is it's going to look at this SQL query. 01:08:24.399 --> 01:08:27.630 Notice that, oh, it's got four question marks in it, or placeholders. 01:08:27.630 --> 01:08:32.850 Those, in SQL, are like the %s was in C or are like the curly braces are 01:08:32.850 --> 01:08:34.319 and Python f strings. 01:08:34.319 --> 01:08:38.130 So this says, give me 1, 2, 3, 4 placeholders and plug in, ultimately, 01:08:38.130 --> 01:08:41.220 the following four values, tconst, which is just a variable; 01:08:41.220 --> 01:08:44.040 primaryTitle, which is the same; startYear, which is the same; 01:08:44.040 --> 01:08:45.968 and genres, which is the same. 01:08:45.968 --> 01:08:47.010 So what am I going to do? 01:08:47.010 --> 01:08:52.290 For every TV series in this file that's not an adult series 01:08:52.290 --> 01:08:57.270 and that started after 1970, insert it into my database. 01:08:57.270 --> 01:09:00.689 If I've made no typos, I'm going to go ahead and run this, cross my fingers, 01:09:00.689 --> 01:09:02.160 and enter. 01:09:02.160 --> 01:09:05.279 This one is going to take more time, because it turns out writing 01:09:05.279 --> 01:09:07.109 to a CSV file is actually pretty quick. 01:09:07.109 --> 01:09:10.080 You can just write row, write row, write row, write row, 01:09:10.080 --> 01:09:13.979 but inserting into a SQLite database is going to take more time. 01:09:13.979 --> 01:09:19.000 More time upfront, but it's going to be a lot faster to search thereafter. 01:09:19.000 --> 01:09:23.520 So let me go ahead and do the cake in the oven thing 01:09:23.520 --> 01:09:25.680 and go ahead and now open up a file I made 01:09:25.680 --> 01:09:29.250 an advance in today's src3 directory called shows3.db 01:09:29.250 --> 01:09:33.000 using SQLite3, that command line program we used earlier. 01:09:33.000 --> 01:09:38.310 Recall that I can say .schema to see the types of data in the database. 01:09:38.310 --> 01:09:40.470 And indeed, look what I've done in advance. 01:09:40.470 --> 01:09:47.250 I created a table called shows with exactly those columns in a-- 01:09:47.250 --> 01:09:51.840 with exactly these four columns, tconst, primaryTitle, startYear, and genres. 01:09:51.840 --> 01:09:55.950 But I did this in advance to save us time so that I can now do SELECT 01:09:55.950 --> 01:09:58.268 * FROM shows. 01:09:58.268 --> 01:09:59.560 And let me not get all of them. 01:09:59.560 --> 01:10:02.580 Let me do the first 10, semicolon. 01:10:02.580 --> 01:10:03.120 All right. 01:10:03.120 --> 01:10:06.610 So we see the first 10 shows from IMDB in whatever order 01:10:06.610 --> 01:10:08.010 IMDB distributes them. 01:10:08.010 --> 01:10:10.590 You can see their ID numbers are incrementing. 01:10:10.590 --> 01:10:11.460 And All My Children. 01:10:11.460 --> 01:10:13.418 I remember growing up with that show years ago. 01:10:13.418 --> 01:10:15.410 And it seems that that's a drama-- 01:10:15.410 --> 01:10:18.540 it's not really mystery, but so be it, and a romance show there. 01:10:18.540 --> 01:10:23.370 But it's indeed 1970, as are every show thereafter in 1970 or onward. 01:10:23.370 --> 01:10:26.350 I can go ahead and search for more like this. 01:10:26.350 --> 01:10:28.590 Let me give myself the first 100 shows. 01:10:28.590 --> 01:10:30.540 All right, so this is a pretty large data set. 01:10:30.540 --> 01:10:32.370 And let me go ahead and count them all. 01:10:32.370 --> 01:10:34.745 Recall that you can use a COUNT function so that we don't 01:10:34.745 --> 01:10:36.330 have to print them all on my screen. 01:10:36.330 --> 01:10:42.090 It looks like there are 153,331 TV series in IMDB. 01:10:42.090 --> 01:10:45.720 Crazier than that, you want to know how many of them came out this year? 01:10:45.720 --> 01:10:50.430 WHERE startYear = 2019;. 01:10:50.430 --> 01:10:57.090 There were 6,099 new TV shows in the world this year according to IMDB. 01:10:57.090 --> 01:11:00.900 Just one of those is The Office that won our-- 01:11:00.900 --> 01:11:01.950 won our vote earlier. 01:11:01.950 --> 01:11:06.280 So SELECT *, star denoting wild card, in this case, everything. 01:11:06.280 --> 01:11:08.290 And unfortunately, SQL has two wild cards. 01:11:08.290 --> 01:11:12.600 Star means select all of the columns in the table. 01:11:12.600 --> 01:11:17.220 Percent means let any characters come before or after a quoted 01:11:17.220 --> 01:11:18.870 string in a WHERE clause. 01:11:18.870 --> 01:11:24.660 So let me go ahead and SELECT * FROM shows WHERE title = The Office, 01:11:24.660 --> 01:11:26.700 and we'll actually see-- whoops. 01:11:26.700 --> 01:11:28.800 primaryTitle, sorry. 01:11:28.800 --> 01:11:31.530 Let me fix that. primaryTitle = The Office. 01:11:31.530 --> 01:11:33.150 There are all of those Offices. 01:11:33.150 --> 01:11:37.230 And indeed, 2005 is probably the one we know and love. 01:11:37.230 --> 01:11:44.370 If I go to a browser and go to IMDB slash title slash that ID, 01:11:44.370 --> 01:11:46.620 indeed, that's probably the one we're all thinking of, 01:11:46.620 --> 01:11:48.790 unless you voted for the UK version instead. 01:11:48.790 --> 01:11:52.290 So again, this is actually real, live data that we're now playing with. 01:11:52.290 --> 01:11:53.738 Well, what more can we do? 01:11:53.738 --> 01:11:55.530 Well, there's one thing that I don't really 01:11:55.530 --> 01:12:00.540 like about this, which is that when we select all of the columns-- 01:12:00.540 --> 01:12:02.130 and let's go ahead and do this. 01:12:02.130 --> 01:12:04.440 Let's select another hundred of them before. 01:12:04.440 --> 01:12:10.470 This feels a little messy that we have all of these nice, clean columns except 01:12:10.470 --> 01:12:12.270 for when we get to genres. 01:12:12.270 --> 01:12:14.902 Then we just have this arbitrary comma-separated list. 01:12:14.902 --> 01:12:16.860 Suppose, for instance, I want to search for all 01:12:16.860 --> 01:12:19.820 of the comedies that came out in 2019. 01:12:19.820 --> 01:12:27.870 I could say SELECT * FROM shows where genres = "Comedy" AND-- 01:12:27.870 --> 01:12:29.880 turns out you can use conjunctions like this-- 01:12:29.880 --> 01:12:32.970 startYear = 2019. 01:12:32.970 --> 01:12:34.290 So that gives me a whole bunch. 01:12:34.290 --> 01:12:35.490 Let's count them. 01:12:35.490 --> 01:12:38.490 So COUNT this here, Enter. 01:12:38.490 --> 01:12:42.390 OK, a thousand of those 6,000 shows are comedies. 01:12:42.390 --> 01:12:44.850 But I think that's an underestimate. 01:12:44.850 --> 01:12:48.810 Why is this query buggy at the moment? 01:12:48.810 --> 01:12:49.458 Yeah? 01:12:49.458 --> 01:12:50.587 AUDIENCE: [INAUDIBLE] 01:12:50.587 --> 01:12:52.920 DAVID MALAN: Yeah, some of them had more than one genre, 01:12:52.920 --> 01:12:56.070 so comedy is somewhere in that comma-separated list. 01:12:56.070 --> 01:13:00.450 And so what I should probably do instead is not say genre = "Comedy," 01:13:00.450 --> 01:13:05.490 but maybe genres LIKE "Comedy," and allow something maybe to appear before, 01:13:05.490 --> 01:13:07.140 something maybe to appear after. 01:13:07.140 --> 01:13:11.040 And that's going to give me 1,593 comedies that 01:13:11.040 --> 01:13:13.290 came out this year in 2019. 01:13:13.290 --> 01:13:14.820 So that seems a little better. 01:13:14.820 --> 01:13:17.220 But this is not very robust, right? 01:13:17.220 --> 01:13:19.890 Once you start resorting to techniques like this, 01:13:19.890 --> 01:13:22.920 it should start, as a programmer, to rub you the wrong way. 01:13:22.920 --> 01:13:24.168 It's kind of a hack, right? 01:13:24.168 --> 01:13:27.210 Like, you're searching for comedy, but there could be something before it 01:13:27.210 --> 01:13:29.640 or something after it, and odds are, there 01:13:29.640 --> 01:13:32.400 is no other word I can think of in the world of genres 01:13:32.400 --> 01:13:36.142 that starts with or ends with comedy, so we're probably OK. 01:13:36.142 --> 01:13:38.850 But this is kind of hack-ish, that you're just kind of searching. 01:13:38.850 --> 01:13:44.790 It would be nice if we could just search for a specific column called genre. 01:13:44.790 --> 01:13:47.170 So how can we go about doing that? 01:13:47.170 --> 01:13:50.670 Well, let me go ahead and do the following instead. 01:13:50.670 --> 01:13:54.570 Let me go ahead and open up a final version of my import script, 01:13:54.570 --> 01:13:58.020 this one that does two things up top. 01:13:58.020 --> 01:14:02.780 At the top, I'm going to create two tables, one called shows, which has-- 01:14:02.780 --> 01:14:04.950 I'm to clean up the column names, too. 01:14:04.950 --> 01:14:06.930 IMDB is a little nonconventional. 01:14:06.930 --> 01:14:10.290 What most people would do when describing a unique identifier, 01:14:10.290 --> 01:14:12.210 they're going to call it ID, not tconst. 01:14:12.210 --> 01:14:13.895 So we're going to rename it to ID. 01:14:13.895 --> 01:14:16.020 They're not going to call their title primaryTitle. 01:14:16.020 --> 01:14:18.660 They're going to call it title, so we're going to rename it title. 01:14:18.660 --> 01:14:19.770 They're not going to name it startYear. 01:14:19.770 --> 01:14:20.937 We're going to call it year. 01:14:20.937 --> 01:14:22.060 And then that's it. 01:14:22.060 --> 01:14:24.270 We'll come back to primary key in just a moment. 01:14:24.270 --> 01:14:25.530 But notice this. 01:14:25.530 --> 01:14:28.320 In my new and final version of this script, I'm creating, 01:14:28.320 --> 01:14:33.060 I propose, a second table called genres whose purpose in life 01:14:33.060 --> 01:14:38.250 is to contain a value called show_id and another one called genre. 01:14:38.250 --> 01:14:39.750 So what's going on? 01:14:39.750 --> 01:14:42.810 Well, let me go ahead and show this. 01:14:42.810 --> 01:14:48.930 If I load the resulting database from this one, shows4.db, and I do .schema, 01:14:48.930 --> 01:14:50.850 you'll see that I indeed have two tables. 01:14:50.850 --> 01:14:55.150 Let me go ahead and SELECT * FROM shows WHERE title, this time, 01:14:55.150 --> 01:14:59.410 because I've renamed it from primaryTitle, = The Office. 01:14:59.410 --> 01:14:59.910 OK. 01:14:59.910 --> 01:15:01.050 That's a lot of Offices. 01:15:01.050 --> 01:15:06.700 But let's go, AND year = 2005, which is the one we're all thinking about. 01:15:06.700 --> 01:15:07.860 And it's that one. 01:15:07.860 --> 01:15:09.810 And now, notice this. 01:15:09.810 --> 01:15:12.360 Notice that I'm getting back what? 01:15:12.360 --> 01:15:19.590 An ID, I'm getting back a title, and I'm getting back a year, but no genres. 01:15:19.590 --> 01:15:24.053 That's because there's another table now called genres that's separate. 01:15:24.053 --> 01:15:25.470 And you know, I'm kind of curious. 01:15:25.470 --> 01:15:27.763 I see that a genre table has show_id. 01:15:27.763 --> 01:15:28.680 Let me go and do this. 01:15:28.680 --> 01:15:34.200 SELECT * FROM genres WHERE show_id =, and let 01:15:34.200 --> 01:15:37.830 me do a little copy paste here, = this show_id. 01:15:37.830 --> 01:15:39.780 And what might I see? 01:15:39.780 --> 01:15:41.100 Comedy. 01:15:41.100 --> 01:15:42.840 So what have we done now? 01:15:42.840 --> 01:15:46.170 For any TV show that was in IMDB's database that 01:15:46.170 --> 01:15:50.010 was a comma-separated list of genres, I've exploded it, so to speak. 01:15:50.010 --> 01:15:52.170 I've split that value on the commas. 01:15:52.170 --> 01:15:56.070 And if the show is a comedy, I've added a row in these genres table, 01:15:56.070 --> 01:15:59.310 but then I've jotted down the show's ID next to that genre 01:15:59.310 --> 01:16:02.670 so I remember that that show was of that genre. 01:16:02.670 --> 01:16:05.770 But if another show has multiple fields-- for instance, 01:16:05.770 --> 01:16:10.980 let's go ahead and search for not The Office but, say, The Crown. 01:16:10.980 --> 01:16:12.540 And there's only one of those. 01:16:12.540 --> 01:16:20.190 And now I do SELECT * FROM genres WHERE show_id = this number, 01:16:20.190 --> 01:16:25.360 we'll see that, oh, The Crown now has drama and history as a genre. 01:16:25.360 --> 01:16:27.570 And so therefore, in the genres table, notice 01:16:27.570 --> 01:16:30.690 that there's two IDs and two genres. 01:16:30.690 --> 01:16:32.220 But now we can use a quality. 01:16:32.220 --> 01:16:39.300 I can now search for all of the comedies from 2019 in kind of a powerful way, 01:16:39.300 --> 01:16:47.010 SELECT * FROM shows WHERE id IN-- 01:16:47.010 --> 01:16:48.330 and here's the cool part-- 01:16:48.330 --> 01:17:01.020 SELECT show_id FROM genres WHERE genre = "Comedy" AND year = 2019. 01:17:01.020 --> 01:17:03.960 So this is admittedly a mouthful, but let's consider what's happening. 01:17:03.960 --> 01:17:05.910 First, I'm selecting star from shows. 01:17:05.910 --> 01:17:08.760 That means give me all the shows in all of the columns. 01:17:08.760 --> 01:17:14.640 But filter as follows, only show me those shows where the ID of the show 01:17:14.640 --> 01:17:17.170 is in the following list of IDs. 01:17:17.170 --> 01:17:20.610 Now, you can look at a nested query inside the parentheses here. 01:17:20.610 --> 01:17:25.140 This list here selects all of the show IDs from the genres table 01:17:25.140 --> 01:17:27.380 where genre = "Comedy". 01:17:27.380 --> 01:17:31.770 So the highlighted parenthetical right now returns essentially a list 01:17:31.770 --> 01:17:35.130 of all of the ID numbers of shows that are associated with comedy, 01:17:35.130 --> 01:17:37.900 even if they're associated with other things, too. 01:17:37.900 --> 01:17:41.400 And we're making sure that the year equals 2019. 01:17:41.400 --> 01:17:46.290 So if I now hit Enter, we'll see a whole bunch of results, but we should see, 01:17:46.290 --> 01:17:49.110 if I count these by using my usual syntax, 01:17:49.110 --> 01:17:54.120 there were 1,593 shows that are comedies in 2019. 01:17:54.120 --> 01:17:58.200 That does happen to equal the same count we did earlier by using like, 01:17:58.200 --> 01:18:01.560 but this is better designed in the sense that there's no ambiguity. 01:18:01.560 --> 01:18:03.730 You're not just hackishly looking for a substring, 01:18:03.730 --> 01:18:06.060 so to speak, in a comma-separated list. 01:18:06.060 --> 01:18:08.580 You can actually now search more robustly 01:18:08.580 --> 01:18:10.710 by having redesigned your data. 01:18:10.710 --> 01:18:13.830 And what we've done really is something like this. 01:18:13.830 --> 01:18:18.150 Instead of storing our data in just one table called shows, 01:18:18.150 --> 01:18:22.230 and every show has an ID, a title, a year, and genres, 01:18:22.230 --> 01:18:26.880 we've instead exploded those genres into two separate tables, such 01:18:26.880 --> 01:18:29.650 that now our shows table looks like this. 01:18:29.650 --> 01:18:32.200 We have an ID, a title, and a year. 01:18:32.200 --> 01:18:37.350 And notice that the ID of a show can also appear over here 01:18:37.350 --> 01:18:40.950 in another table called genres, and just by convention, to keep us sane, 01:18:40.950 --> 01:18:43.380 instead of calling it ID here, we've called 01:18:43.380 --> 01:18:48.060 it show_id to connote that it came from a table called shows, plural, 01:18:48.060 --> 01:18:50.010 but it's a single ID from a show. 01:18:50.010 --> 01:18:55.050 So by convention, humans often call them table name, minus the S, underscore ID. 01:18:55.050 --> 01:18:59.280 And then the keyword here or genre is comedy or drama or documentary 01:18:59.280 --> 01:19:01.443 or some other genre as well. 01:19:01.443 --> 01:19:03.360 And now, this is a little subtle, but the fact 01:19:03.360 --> 01:19:05.940 that this little symbol here, drawn from the database world, 01:19:05.940 --> 01:19:08.370 flails out into three separate places, this 01:19:08.370 --> 01:19:11.370 is a one-to-many relationship, so to speak. 01:19:11.370 --> 01:19:15.630 You can have one show over here mapping to many genres over here, 01:19:15.630 --> 01:19:21.090 or maybe zero, but it's zero or more possible genres. 01:19:21.090 --> 01:19:21.990 All right. 01:19:21.990 --> 01:19:26.720 Any questions just yet? 01:19:26.720 --> 01:19:27.220 All right. 01:19:27.220 --> 01:19:30.430 So the real power then, to be clear, is coming, now, 01:19:30.430 --> 01:19:31.870 from this kind of expressiveness. 01:19:31.870 --> 01:19:34.495 So now, let's play around with some other queries and features. 01:19:34.495 --> 01:19:37.330 But first, let's give ourselves a bit more data. 01:19:37.330 --> 01:19:40.420 It turns out besides the file called title.basics.tsv, 01:19:40.420 --> 01:19:44.140 IMDB.com makes a bunch of others available to us as well. 01:19:44.140 --> 01:19:46.900 There is one called name.basics.tsv. 01:19:46.900 --> 01:19:49.420 And this is one that has information on all 01:19:49.420 --> 01:19:53.690 of the names of actors and actresses and directors and writers 01:19:53.690 --> 01:19:54.940 and other people in the world. 01:19:54.940 --> 01:19:58.600 So for instance, there is an nconst, which is like a name constant 01:19:58.600 --> 01:20:03.190 or an ID, nm2946516, which happens to belong 01:20:03.190 --> 01:20:06.550 to the actor whose primary name is Claire Foy, the star of The Crown. 01:20:06.550 --> 01:20:10.990 She was born in 1984, and there are some other fields in that file as well. 01:20:10.990 --> 01:20:14.560 But also juicy is this file, title.principals.tsv, 01:20:14.560 --> 01:20:16.690 and this is where it gets interesting, too. 01:20:16.690 --> 01:20:20.450 In this file, notice there are no actual titles. 01:20:20.450 --> 01:20:21.740 There's no primary titles. 01:20:21.740 --> 01:20:23.800 There's no actual human names. 01:20:23.800 --> 01:20:28.810 Instead, there's just two unique identifiers, a tconst and an nconst, 01:20:28.810 --> 01:20:34.240 which IMDB speak for a title identifier and a name identifier. 01:20:34.240 --> 01:20:37.450 So for instance, in one of the rows in this TSV file called 01:20:37.450 --> 01:20:42.790 title.principals.tsv, there is a row that starts with tt4786824; 01:20:42.790 --> 01:20:48.220 also has nm2946516, and has the word actress, 01:20:48.220 --> 01:20:52.090 thereby implying that if you look up the nm-- 01:20:52.090 --> 01:20:57.280 the nconst in the names file and you look up the tconst in the titles file, 01:20:57.280 --> 01:21:03.600 you will be able to, by transitivity, infer that Claire Foy is in The Crown. 01:21:03.600 --> 01:21:06.670 This allows us to have a many-to-many relationship. 01:21:06.670 --> 01:21:10.360 A one movie or show can have many actors, 01:21:10.360 --> 01:21:16.030 and one actor can be in many shows, so we're using a sort of join file here, 01:21:16.030 --> 01:21:18.070 a join table, that's going to somehow allow 01:21:18.070 --> 01:21:21.590 us to link two different data sets together, and more on that in a moment. 01:21:21.590 --> 01:21:24.700 But what's really fun in IMDB is that it also has a bunch of ratings 01:21:24.700 --> 01:21:28.630 that humans have typed in saying, I get this 10 out of 10, a 0 out of 10, 01:21:28.630 --> 01:21:29.200 and so forth. 01:21:29.200 --> 01:21:31.950 And they keep track of the number of votes that shows have gotten. 01:21:31.950 --> 01:21:36.370 And so in title.ratings.tsv, yet another file you can download from IMDB, 01:21:36.370 --> 01:21:40.120 you can look up a given tconst, the unique identifier for a title, 01:21:40.120 --> 01:21:42.940 what its average rating is and the number of votes. 01:21:42.940 --> 01:21:46.660 And in fact, if I pull this up, for instance, on The Crown, if I go back 01:21:46.660 --> 01:21:48.880 to IMDB itself-- 01:21:48.880 --> 01:21:53.930 IMDB, search for The Crown, the 2016 version, 01:21:53.930 --> 01:21:59.410 you'll see that indeed, it is an 8.7, which lines up with exactly what 01:21:59.410 --> 01:22:00.210 we have here. 01:22:00.210 --> 01:22:03.580 But over time, that number is going to go up or down, because IMDB is updating 01:22:03.580 --> 01:22:06.280 their data set every day as well. 01:22:06.280 --> 01:22:09.730 So besides this data, we also then have the ability 01:22:09.730 --> 01:22:12.580 to consider what this all looks like collectively. 01:22:12.580 --> 01:22:14.920 So in this case here, here is another diagram. 01:22:14.920 --> 01:22:18.610 It's more complicated now, but it just captures the intuition 01:22:18.610 --> 01:22:22.450 that you would derive by just reading through IMDB's documentation, which 01:22:22.450 --> 01:22:23.800 defines the following. 01:22:23.800 --> 01:22:26.420 It turns out if you read closely among those files, 01:22:26.420 --> 01:22:29.710 you'll see that, oh, we can glean a whole bunch of shows that are going 01:22:29.710 --> 01:22:32.050 to have IDs, title, year, and episodes. 01:22:32.050 --> 01:22:37.840 I want to go ahead and associate those shows with a whole bunch of stars. 01:22:37.840 --> 01:22:42.170 But people are the entities in the world that have IDs, names, and birth. 01:22:42.170 --> 01:22:43.910 So now things get a little weird. 01:22:43.910 --> 01:22:46.300 Let's focus only on these two tables here. 01:22:46.300 --> 01:22:48.340 This is a diagram of the goal at hand, and this 01:22:48.340 --> 01:22:49.632 is a script I wrote in advance. 01:22:49.632 --> 01:22:53.150 And in a moment, we'll open up a SQL database that represents this. 01:22:53.150 --> 01:22:56.290 There's going to be a table called shows, every row of which 01:22:56.290 --> 01:22:59.350 has an ID, title, year, and some number of episodes, 01:22:59.350 --> 01:23:00.962 so you can see how long-running it is. 01:23:00.962 --> 01:23:02.920 There's also going to be a table called people. 01:23:02.920 --> 01:23:05.080 Claire Foy is going to be among them. 01:23:05.080 --> 01:23:09.610 She and the other humans will have an ID, name, and birth year associated 01:23:09.610 --> 01:23:10.960 with them as well. 01:23:10.960 --> 01:23:14.230 But there's going to be this other table here called stars. 01:23:14.230 --> 01:23:19.390 And you'll notice there's a line that links shows to stars to people, 01:23:19.390 --> 01:23:21.190 again by transitivity. 01:23:21.190 --> 01:23:27.580 If there is, in the stars table, a show ID, a.k.a. tconst, and a person ID, 01:23:27.580 --> 01:23:28.260 a.k.a. 01:23:28.260 --> 01:23:32.260 nconst, that links, for instance, Claire Foy to The Crown. 01:23:32.260 --> 01:23:34.480 It's going to link Steve Carell to The Office. 01:23:34.480 --> 01:23:38.450 It's going to link every other actor to their show as well. 01:23:38.450 --> 01:23:41.770 Similarly for writers, we won't play too much with the writers today, 01:23:41.770 --> 01:23:45.760 but writers are people, too, just as stars are people. 01:23:45.760 --> 01:23:48.970 And so here's another feature or design goal of SQL. 01:23:48.970 --> 01:23:52.150 You want to ideally factor out the commonalities, so 01:23:52.150 --> 01:23:55.450 that Claire Foy appears only in one place by name, 01:23:55.450 --> 01:23:58.965 but her unique identifier might appear in bunches of places. 01:23:58.965 --> 01:24:01.840 There's a lot of actors in the world who are also writers themselves. 01:24:01.840 --> 01:24:03.970 They are, at the end of the day, people, but they 01:24:03.970 --> 01:24:08.200 might appear both in the stars table and in the writers table 01:24:08.200 --> 01:24:10.300 by way of their person ID. 01:24:10.300 --> 01:24:14.050 So the goal of SQL is to not copy and paste Claire Foy, Claire Foy, Claire 01:24:14.050 --> 01:24:17.750 Foy, or Steve Carell, Steve Carell, Steve Carell, all over the place. 01:24:17.750 --> 01:24:20.170 You want to have one authoritative place for all 01:24:20.170 --> 01:24:23.320 of your people, one authoritative place for all of your shows, 01:24:23.320 --> 01:24:25.180 and then you have these other tables called 01:24:25.180 --> 01:24:28.940 join tables, which are similar in spirit to the TSV files 01:24:28.940 --> 01:24:32.790 you can download that somehow link these identifiers together. 01:24:32.790 --> 01:24:38.820 So if I, on my computer here, open a file that I had-- came with in advance, 01:24:38.820 --> 01:24:43.370 that you can also play with online as well, called, finally, shows.db. 01:24:43.370 --> 01:24:45.080 And I do .schema on this one. 01:24:45.080 --> 01:24:48.560 This is the largest of the databases, and this is the one that we, the staff, 01:24:48.560 --> 01:24:51.345 actually generated from all of the TSV files online, 01:24:51.345 --> 01:24:52.970 but we threw away a lot of information. 01:24:52.970 --> 01:24:55.850 We threw away anything before 1970, and we filtered out 01:24:55.850 --> 01:24:59.600 everything except TV series, so that we can actually 01:24:59.600 --> 01:25:01.820 play with the data in this data set. 01:25:01.820 --> 01:25:04.580 But honestly, using a black and white prompt and a terminal window 01:25:04.580 --> 01:25:07.455 tends not to get fun, especially when the data flies over the screen. 01:25:07.455 --> 01:25:10.413 So there are also things called GUIs, Graphical User Interfaces, 01:25:10.413 --> 01:25:13.080 and indeed, there is a program that's freely available for Macs, 01:25:13.080 --> 01:25:16.490 PCs, and other types of operating systems, called DB Browser. 01:25:16.490 --> 01:25:18.480 And indeed, will point you at this online. 01:25:18.480 --> 01:25:23.030 This is just a program that allows you to explore SQL files on your own Mac 01:25:23.030 --> 01:25:24.890 or PC much more visibly-- 01:25:24.890 --> 01:25:28.070 or much more visually and much more pleasantly than maybe just 01:25:28.070 --> 01:25:29.870 a command line interface allows. 01:25:29.870 --> 01:25:33.080 So let me go ahead and open up, for instance, shows.db. 01:25:33.080 --> 01:25:34.770 And we'll see a whole bunch of things. 01:25:34.770 --> 01:25:36.890 First of all, the graphical user interface 01:25:36.890 --> 01:25:40.280 here shows me the same information, just in slightly prettier format. 01:25:40.280 --> 01:25:44.450 Shows.db, per my diagram a moment ago, has six tables, people, 01:25:44.450 --> 01:25:46.670 like I mentioned; shows, like I mentioned; 01:25:46.670 --> 01:25:50.398 also, stars and writers, ratings, and then that separate table called genres. 01:25:50.398 --> 01:25:52.190 And if you look over here on the right, you 01:25:52.190 --> 01:25:56.360 can actually see the SQL code we wrote to create those tables. 01:25:56.360 --> 01:26:00.350 But cooler than that is that notice these tabs up top here? 01:26:00.350 --> 01:26:03.170 I am currently on Database Structure, and if you're using Windows, 01:26:03.170 --> 01:26:05.180 your interface will look a little different, 01:26:05.180 --> 01:26:06.740 but the options are still there. 01:26:06.740 --> 01:26:08.960 I'm going to go ahead and click on Browse Data. 01:26:08.960 --> 01:26:12.000 And now you'll see a little dropdown of all of the tables. 01:26:12.000 --> 01:26:14.510 So if I want to go ahead and look at a whole bunch of shows, 01:26:14.510 --> 01:26:18.680 I can actually see all of my data here, not unlike Google Spreadsheets. 01:26:18.680 --> 01:26:23.450 But notice, this is 153,331 movies-- 01:26:23.450 --> 01:26:27.380 or shows, rather, that I can see altogether here. 01:26:27.380 --> 01:26:31.110 And lastly, what's cool is that if I go over to the SQL tab, 01:26:31.110 --> 01:26:33.058 I can now execute some sample queries. 01:26:33.058 --> 01:26:34.350 So let me go ahead and do this. 01:26:34.350 --> 01:26:37.340 SELECT * FROM shows;. 01:26:37.340 --> 01:26:39.530 This is going to give me a whole lot of shows. 01:26:39.530 --> 01:26:40.130 Enter. 01:26:40.130 --> 01:26:43.140 So I hit the little Play button that just executes that query, 01:26:43.140 --> 01:26:46.980 and you see the resulting rows that have come back. 01:26:46.980 --> 01:26:48.800 So again, how did we get to this point? 01:26:48.800 --> 01:26:51.920 We, the staff, downloaded all of those TSV files in advance. 01:26:51.920 --> 01:26:56.150 We wrote a Python script that imported all of the data from those files 01:26:56.150 --> 01:26:58.880 into memory, threw away the stuff we didn't care about, 01:26:58.880 --> 01:27:03.362 and then inserted it into tables like this table called shows. 01:27:03.362 --> 01:27:06.320 And what's nice about this tool is, because if you are an aspiring data 01:27:06.320 --> 01:27:09.020 scientist or you're just trying to get some analytical work done 01:27:09.020 --> 01:27:11.690 for this class or any other, or any other project, 01:27:11.690 --> 01:27:14.370 a graphical tool lends itself to just kind of poking around. 01:27:14.370 --> 01:27:19.010 So for instance, you'll see that the shows table no longer has any genres. 01:27:19.010 --> 01:27:20.010 But that's OK. 01:27:20.010 --> 01:27:21.950 We can reconstitute that data. 01:27:21.950 --> 01:27:25.670 I can go ahead and SELECT * FROM shows-- 01:27:25.670 --> 01:27:27.390 but you know what I can do? 01:27:27.390 --> 01:27:30.680 I can actually join in the genres table. 01:27:30.680 --> 01:27:34.250 So I can take the shows table here, the genres table here, 01:27:34.250 --> 01:27:37.640 and essentially link them together by way of the ID. 01:27:37.640 --> 01:27:39.000 How is that possible? 01:27:39.000 --> 01:27:43.010 Well, if you look at shows, a show has an ID title, year, and episodes. 01:27:43.010 --> 01:27:45.020 Genres has a show ID. 01:27:45.020 --> 01:27:48.560 So if you think of ID on my left hand as representing my fingers here. 01:27:48.560 --> 01:27:51.830 Show ID in genres is representing my right hand here. 01:27:51.830 --> 01:27:54.830 What we want to do is lineup ID with show ID 01:27:54.830 --> 01:28:00.570 to make one larger, wider table that constitutes all of that data together. 01:28:00.570 --> 01:28:01.890 So how do I do this? 01:28:01.890 --> 01:28:04.130 Well, in SQL, you can join two tables. 01:28:04.130 --> 01:28:07.430 You say what table you want to join with what other table and 01:28:07.430 --> 01:28:08.570 how you want to do it. 01:28:08.570 --> 01:28:16.740 Well, I want to go ahead and join it on shows.id = genres.show_id;. 01:28:16.740 --> 01:28:19.700 And now, when I hit execute, it took a moment-- 01:28:19.700 --> 01:28:21.682 indeed, it took 408 milliseconds. 01:28:21.682 --> 01:28:24.140 But my god, that's a lot faster than writing a whole Python 01:28:24.140 --> 01:28:25.160 script to do this. 01:28:25.160 --> 01:28:29.220 Now I have a table with all of the shows as before. 01:28:29.220 --> 01:28:30.740 But notice the table got wider. 01:28:30.740 --> 01:28:33.590 This is a temporary table that SQL has returned to me. 01:28:33.590 --> 01:28:36.077 This one now has genre and show ID. 01:28:36.077 --> 01:28:37.910 So in fact, let me go ahead and filter this. 01:28:37.910 --> 01:28:52.290 WHERE title = "The Office" AND year = 2005, Play, we'll just get this. 01:28:52.290 --> 01:28:55.370 So notice I have a wider table containing all of the columns 01:28:55.370 --> 01:28:57.350 from both of those joined tables. 01:28:57.350 --> 01:29:03.170 But if I change this now to The Crown, and that was 2016, and hit play, 01:29:03.170 --> 01:29:05.875 notice I get back seemingly redundant information. 01:29:05.875 --> 01:29:08.000 But this might be useful nonetheless, because now I 01:29:08.000 --> 01:29:13.040 can iterate over all of the rows knowing that every row has not only a title 01:29:13.040 --> 01:29:16.220 and a year, but also a genre as well. 01:29:16.220 --> 01:29:19.070 So I can reconstitute the table like that. 01:29:19.070 --> 01:29:21.530 Well, how can I ask other questions as well, 01:29:21.530 --> 01:29:26.030 like what are actors that are in other shows as well? 01:29:26.030 --> 01:29:30.000 Well, let me go ahead and do something like this. 01:29:30.000 --> 01:29:33.230 For instance, let's select all of Steve Carell's movies. 01:29:33.230 --> 01:29:35.280 Or-- let's select Steve Carell himself first. 01:29:35.280 --> 01:29:38.280 So SELECT * FROM-- 01:29:38.280 --> 01:29:40.140 let's see-- no. 01:29:40.140 --> 01:29:40.830 Let's do this. 01:29:40.830 --> 01:29:41.910 How about Ellen? 01:29:41.910 --> 01:29:49.405 So SELECT * from people where name = Ellen DeGeneres. 01:29:49.405 --> 01:29:50.280 I spelled that right. 01:29:50.280 --> 01:29:52.150 Semicolon, Play. 01:29:52.150 --> 01:29:52.650 All right. 01:29:52.650 --> 01:29:56.040 So this is handy, because I now know that Ellen's birth year is 1958, 01:29:56.040 --> 01:29:58.320 but her ID is 1122. 01:29:58.320 --> 01:30:01.530 That's the same thing as that nconst, but we threw away the nm 01:30:01.530 --> 01:30:05.520 and we got rid of all the zeros and made it into a proper number for efficiency, 01:30:05.520 --> 01:30:08.340 better to uniquely identify humans, typically-- or anything-- 01:30:08.340 --> 01:30:13.110 by numbers which fit in 32 or 64 bits rather than longer strings. 01:30:13.110 --> 01:30:16.740 So now I know Ellen's ID is 1122, what can I do? 01:30:16.740 --> 01:30:25.470 Well, let me go ahead and SELECT * FROM stars WHERE person_id = this. 01:30:25.470 --> 01:30:26.970 This will tell me what? 01:30:26.970 --> 01:30:31.920 This will tell me all of the information about shows that Ellen starred in, 01:30:31.920 --> 01:30:34.150 including, presumably, her own. 01:30:34.150 --> 01:30:37.230 OK, so I now see person_id is the same, the same, the same, 01:30:37.230 --> 01:30:39.608 but Ellen is apparently in all of those shows. 01:30:39.608 --> 01:30:41.400 But that's not that helpful, and also, this 01:30:41.400 --> 01:30:44.070 is kind of lame that I've just hardcoded Ellen's ID. 01:30:44.070 --> 01:30:45.550 But I don't have to do that. 01:30:45.550 --> 01:30:55.020 I can do a subquery, SELECT id FROM people WHERE name = "Ellen DeGeneres", 01:30:55.020 --> 01:30:57.330 closed parenthesis, Play. 01:30:57.330 --> 01:31:00.480 Now it's dynamic, so now I've not hardcoded anything. 01:31:00.480 --> 01:31:01.930 But this isn't that useful. 01:31:01.930 --> 01:31:04.770 Let me go ahead and just select show_id here. 01:31:04.770 --> 01:31:08.910 So I now have SELECT show_id FROM stars WHERE the person_id ID 01:31:08.910 --> 01:31:12.108 = whatever Ellen's ID is here. 01:31:12.108 --> 01:31:13.650 How can I take this one step further? 01:31:13.650 --> 01:31:19.980 Well, what if I do SELECT * FROM shows WHERE the ID of the show 01:31:19.980 --> 01:31:23.400 is in the following list of values? 01:31:23.400 --> 01:31:25.590 So not only does SQL support equals, when 01:31:25.590 --> 01:31:28.530 you want to compare one value against the next, much like in Python, 01:31:28.530 --> 01:31:33.360 you have the keyword IN, where you can say select everything from shows where 01:31:33.360 --> 01:31:36.930 the ID of the show is in the following list of shows, 01:31:36.930 --> 01:31:42.260 which happens to represent the list of show IDs that Ellen is in. 01:31:42.260 --> 01:31:43.140 Phew. 01:31:43.140 --> 01:31:45.150 Lets hit Play on this. 01:31:45.150 --> 01:31:48.660 It took a moment, but it looks like these are all of the shows 01:31:48.660 --> 01:31:51.120 that Ellen has been in, according to IMDB. 01:31:51.120 --> 01:31:54.600 And it looks like The Ellen DeGeneres Show is one. 01:31:54.600 --> 01:31:58.380 She's been on the air for 2,865 episodes. 01:31:58.380 --> 01:32:01.480 There is Ellen, her original TV show, which was on for a few seasons back 01:32:01.480 --> 01:32:01.980 then. 01:32:01.980 --> 01:32:06.430 Looks like she's doing some game shows these days as of 2017, and so forth. 01:32:06.430 --> 01:32:09.810 So using these step-by-step thought processes 01:32:09.810 --> 01:32:12.300 can we actually build up more interesting queries 01:32:12.300 --> 01:32:15.410 to get back information like that. 01:32:15.410 --> 01:32:21.192 All right, any questions before we try a few others out as well? 01:32:21.192 --> 01:32:21.860 No? 01:32:21.860 --> 01:32:22.360 All right. 01:32:22.360 --> 01:32:25.270 Well, let me show one other approach to this same problem. 01:32:25.270 --> 01:32:27.720 It turns out this is what are known as nested queries. 01:32:27.720 --> 01:32:30.012 You keep using the parenthesization, much like in math, 01:32:30.012 --> 01:32:32.490 where you can nest arithmetic expressions in parentheses. 01:32:32.490 --> 01:32:33.900 You can do the same in SQL. 01:32:33.900 --> 01:32:37.590 But you can also join information in slightly different ways as well. 01:32:37.590 --> 01:32:40.480 I can actually do something like this. 01:32:40.480 --> 01:32:47.670 Let me go ahead and SELECT Title FROM the following tables, 01:32:47.670 --> 01:32:56.980 people JOIN stars ON people.id = stars.person_id-- 01:32:56.980 --> 01:32:59.230 and we'll walk through this in just a second-- 01:32:59.230 --> 01:33:03.210 JOIN-- sorry-- 01:33:03.210 --> 01:33:14.640 JOIN shows ON stars.show_id = shows.id where name equals "Ellen DeGeneres." 01:33:14.640 --> 01:33:18.780 This, while more of a mouthful, is equivalent to what I've just done. 01:33:18.780 --> 01:33:21.450 Notice that I've select the title of the show 01:33:21.450 --> 01:33:24.630 from the following, the result of joining people on stars. 01:33:24.630 --> 01:33:27.330 How do you join the people's table and the stars table? 01:33:27.330 --> 01:33:28.740 Well, people have IDs. 01:33:28.740 --> 01:33:31.465 Stars have person IDs, according to the diagram. 01:33:31.465 --> 01:33:33.090 What else do you want to join together? 01:33:33.090 --> 01:33:35.310 Let's join all of that with shows. 01:33:35.310 --> 01:33:36.000 How? 01:33:36.000 --> 01:33:40.740 Well, let's go ahead and say stars.show_id = shows.id. 01:33:40.740 --> 01:33:43.770 So it's a way of linking, transitively, multiple tables together, 01:33:43.770 --> 01:33:47.910 and then filter all of that with the name of Ellen DeGeneres. 01:33:47.910 --> 01:33:50.580 This is what we would call an explicit join. 01:33:50.580 --> 01:33:54.210 And if I click Play there, notice it's taking a moment. 01:33:54.210 --> 01:33:58.530 Took, in fact, 1,990 milliseconds, almost two human seconds. 01:33:58.530 --> 01:34:01.800 So it got slower, but it does give me another syntax 01:34:01.800 --> 01:34:02.910 via which to select data. 01:34:02.910 --> 01:34:07.380 And honestly, what's cool about this is I can select other fields as well. 01:34:07.380 --> 01:34:12.240 They don't have to come from the original query in my nested selects. 01:34:12.240 --> 01:34:15.060 But let me go ahead and do this, just as a teaser, and then 01:34:15.060 --> 01:34:17.850 we'll consider some problems before wrapping up. 01:34:17.850 --> 01:34:20.130 So it turns out 2 millisecond-- 01:34:20.130 --> 01:34:25.590 2 seconds is kind of slow, and if I do this again, it took 2,029 milliseconds. 01:34:25.590 --> 01:34:29.345 If I do it again, it took 1,963 milliseconds, 01:34:29.345 --> 01:34:30.720 and I'm looking at the time here. 01:34:30.720 --> 01:34:31.530 That's slow, right? 01:34:31.530 --> 01:34:34.030 That means if you have a finite number of servers in the world 01:34:34.030 --> 01:34:36.720 and each of them can only handle some number of users at a time, 01:34:36.720 --> 01:34:41.700 it's kind of a waste of hardware and of money, arguably, to spend more seconds 01:34:41.700 --> 01:34:46.050 or milliseconds servicing one query from a given user on your app or your web 01:34:46.050 --> 01:34:47.100 application. 01:34:47.100 --> 01:34:49.950 But it turns out we can do something kind of smart here. 01:34:49.950 --> 01:34:54.340 Notice that in our database structure, we've done a few things here. 01:34:54.340 --> 01:34:57.570 Let me open up, for instance, people. 01:34:57.570 --> 01:35:02.990 Rather, let me go over to people, and you'll see a few keywords 01:35:02.990 --> 01:35:04.460 that I ignored earlier. 01:35:04.460 --> 01:35:07.640 When you're defining a table, you can specify that one of your columns 01:35:07.640 --> 01:35:09.253 is what's called a primary key. 01:35:09.253 --> 01:35:12.420 That is the column by which you're going to uniquely identify all your data, 01:35:12.420 --> 01:35:15.620 so it's like those numeric IDs we've given every person and show. 01:35:15.620 --> 01:35:18.020 A foreign key is what we call the same number 01:35:18.020 --> 01:35:19.590 when it appears in some other table. 01:35:19.590 --> 01:35:24.440 So when we saw a person_id or show_id, those are what are called foreign keys, 01:35:24.440 --> 01:35:27.140 because it's the same numbers, but in another table 01:35:27.140 --> 01:35:29.780 that we're using to join things together, eventually. 01:35:29.780 --> 01:35:31.718 And you can also define columns to be unique, 01:35:31.718 --> 01:35:33.260 or you can index them for efficiency. 01:35:33.260 --> 01:35:35.750 And that's, in fact, where we're going right now. 01:35:35.750 --> 01:35:38.690 If you look at the same query, you'll notice that I'm 01:35:38.690 --> 01:35:40.640 searching on a bunch of columns. 01:35:40.640 --> 01:35:43.380 I'm clearly searching on the name field. 01:35:43.380 --> 01:35:48.050 I'm also searching on-- that is, joining on-- show_id, 01:35:48.050 --> 01:35:52.740 and I'm searching on person_id, not to mention the individual ID fields. 01:35:52.740 --> 01:35:55.040 But the cool thing about a primary key, which 01:35:55.040 --> 01:35:59.270 we have in advance to find all of our ID columns, all of the columns 01:35:59.270 --> 01:36:03.500 called ID, to be primary keys, you get a feature for free with SQL. 01:36:03.500 --> 01:36:06.530 SQL builds up what's called an index, a very fancy data 01:36:06.530 --> 01:36:09.950 structure, a tree-like structure, that actually allows you to search 01:36:09.950 --> 01:36:12.240 for information quite efficiently. 01:36:12.240 --> 01:36:14.240 So when you define a column to be a primary key, 01:36:14.240 --> 01:36:15.770 you get what's called an index. 01:36:15.770 --> 01:36:19.430 Specifically, if we go back a few weeks, to tree, our discussion of trees, 01:36:19.430 --> 01:36:20.248 it looks like this. 01:36:20.248 --> 01:36:22.790 This is what's called the B-tree, which is not a binary tree. 01:36:22.790 --> 01:36:25.460 It's a B-tree in the sense that it's got a lot of nodes 01:36:25.460 --> 01:36:28.130 that might be one or two or more children each, 01:36:28.130 --> 01:36:32.270 but it's very short with a lot of nodes, very-- a lot of width. 01:36:32.270 --> 01:36:35.570 So that means when you store your data in a tree-like structure, 01:36:35.570 --> 01:36:38.870 long story short, it just tends to be very efficiently searchable. 01:36:38.870 --> 01:36:40.970 So when you define a column to be a primary key, 01:36:40.970 --> 01:36:43.460 you get that speed for free. 01:36:43.460 --> 01:36:46.520 But we can also tell our SQL database, you know what? 01:36:46.520 --> 01:36:51.330 I plan on searching on person IDs and show IDs and names, also, 01:36:51.330 --> 01:36:53.990 so let's go ahead and create myself some indexes. 01:36:53.990 --> 01:36:57.530 Let me go ahead and execute the following queries just once. 01:36:57.530 --> 01:37:04.490 Let me create something called an index called person_index on the stars table 01:37:04.490 --> 01:37:07.040 using its person_id column. 01:37:07.040 --> 01:37:08.810 So pretty quick syntax. 01:37:08.810 --> 01:37:12.230 This means create a B-tree, create a fancy tree structure, 01:37:12.230 --> 01:37:16.730 called person index, on the stars table by person-- 01:37:16.730 --> 01:37:18.020 on the person_id column. 01:37:18.020 --> 01:37:21.810 That is, make a table like this in memory to store all of the person IDs. 01:37:21.810 --> 01:37:22.310 Why? 01:37:22.310 --> 01:37:26.450 So I can find them faster than linear search in the column itself. 01:37:26.450 --> 01:37:27.955 Let me go ahead and execute this. 01:37:27.955 --> 01:37:29.330 And you'll see it takes a moment. 01:37:29.330 --> 01:37:29.930 It's thinking. 01:37:29.930 --> 01:37:31.510 It took me 1.6 seconds. 01:37:31.510 --> 01:37:33.772 So a little slow, but I only have to do this once. 01:37:33.772 --> 01:37:34.730 Let me create two more. 01:37:34.730 --> 01:37:39.860 CREATE INDEX called show_index ON stars ON the show_id column. 01:37:39.860 --> 01:37:42.080 So almost the same, but this just means give me 01:37:42.080 --> 01:37:44.780 a tree that looks like this in memory, so 01:37:44.780 --> 01:37:48.360 that when I query for data like Ellen DeGeneres, 01:37:48.360 --> 01:37:52.640 it searches a tree-like structure instead of a list-like structure 01:37:52.640 --> 01:37:53.960 in a column alone. 01:37:53.960 --> 01:37:54.890 Let me execute that. 01:37:54.890 --> 01:37:58.440 And I screwed up ON stars show_ID. 01:37:58.440 --> 01:37:59.100 There we go. 01:37:59.100 --> 01:38:00.210 That took 1 second. 01:38:00.210 --> 01:38:07.320 And lastly, let's create an index called name index on people on the name column 01:38:07.320 --> 01:38:10.620 so that I can search for people by name more efficiently as well. 01:38:10.620 --> 01:38:13.920 Otherwise, without these indexes on a column like name, 01:38:13.920 --> 01:38:17.310 it is going to check every damn cell in the column looking 01:38:17.310 --> 01:38:21.180 for Ellen DeGeneres, Ellen DeGeneres, Ellen DeGeneres, using big O of N, 01:38:21.180 --> 01:38:24.070 running time, or linear search. 01:38:24.070 --> 01:38:26.520 So now, let me go back to my query here. 01:38:26.520 --> 01:38:28.440 I've not made any changes to the query. 01:38:28.440 --> 01:38:31.500 The last time I ran this, it took almost 2 seconds. 01:38:31.500 --> 01:38:34.860 Now, after creating these indexes and telling my SQL database 01:38:34.860 --> 01:38:37.470 I plan to search on those columns, watch this, 01:38:37.470 --> 01:38:40.110 and watch the blue highlighted number. 01:38:40.110 --> 01:38:41.520 8 milliseconds. 01:38:41.520 --> 01:38:42.190 8 millisecond. 01:38:42.190 --> 01:38:43.110 7 milliseconds. 01:38:43.110 --> 01:38:44.490 7, 7, 7. 01:38:44.490 --> 01:38:46.245 Looks like 4 milliseconds that time. 01:38:46.245 --> 01:38:49.120 So this is what the Googles of the world, the Facebooks of the world, 01:38:49.120 --> 01:38:51.180 the Microsofts, who have very large data, 01:38:51.180 --> 01:38:53.310 they not only store their data in databases 01:38:53.310 --> 01:38:57.150 like we are here, but they also index their tables intelligently, 01:38:57.150 --> 01:39:01.200 drawing in ideas from weeks ago, so that the database, for them, and for free, 01:39:01.200 --> 01:39:04.110 and sort of magically, creates these kinds of structures in memory, 01:39:04.110 --> 01:39:07.380 but does it so that you can search and insert and update 01:39:07.380 --> 01:39:10.620 your data all the more efficiently. 01:39:10.620 --> 01:39:13.560 So there's got to be some problems or some prices paid. 01:39:13.560 --> 01:39:15.810 So let's consider just a couple of those. 01:39:15.810 --> 01:39:19.637 So what problems actually arise when using a SQL database like this? 01:39:19.637 --> 01:39:21.720 Well, the first is what's called a race condition. 01:39:21.720 --> 01:39:24.780 And perhaps the best way to explain this is by way of a-- 01:39:24.780 --> 01:39:27.780 sort of a story that I was told when I took a class like this years 01:39:27.780 --> 01:39:31.690 ago in a class called CS161, Operating Systems. 01:39:31.690 --> 01:39:33.540 So contrived scenario, but consider this. 01:39:33.540 --> 01:39:35.832 You and your roommates have a fridge in your dorm room, 01:39:35.832 --> 01:39:37.890 and you and your roommate really like milk. 01:39:37.890 --> 01:39:41.910 And one of you comes home one day, opens the fridge, and sees, oh, dammit, 01:39:41.910 --> 01:39:42.770 we're out of milk. 01:39:42.770 --> 01:39:45.270 And so you close the fridge and you walk into Harvard Square 01:39:45.270 --> 01:39:47.400 and you head to CVS or some other such store. 01:39:47.400 --> 01:39:50.670 Meanwhile, your other roommate comes home, opens the fridge, 01:39:50.670 --> 01:39:53.280 really wants some milk, but, oh, darn it, we're out of milk. 01:39:53.280 --> 01:39:56.425 So they close the fridge, walk outside, and head 01:39:56.425 --> 01:39:59.550 to some other store that sells milk, and for reasons that we're contriving, 01:39:59.550 --> 01:40:01.500 don't bump into you at CVS. 01:40:01.500 --> 01:40:04.728 A few minutes later, you both get home, of course, having bought some milk. 01:40:04.728 --> 01:40:06.520 But of course, milk doesn't last that long, 01:40:06.520 --> 01:40:08.460 so now you've got twice as much milk, and surely some of it's 01:40:08.460 --> 01:40:09.127 going to go bad. 01:40:09.127 --> 01:40:11.640 Like horrible, horrible problem. 01:40:11.640 --> 01:40:12.690 Why, right? 01:40:12.690 --> 01:40:16.230 It's contrived, but this-- fond memories of when it was taught to me. 01:40:16.230 --> 01:40:19.320 So what is the fundamental problem here, if you put your finger on it? 01:40:19.320 --> 01:40:22.380 Both you and your roommate were allowed to inspect 01:40:22.380 --> 01:40:26.400 the state of that refrigerator without knowing that the other was 01:40:26.400 --> 01:40:27.720 about to do the same. 01:40:27.720 --> 01:40:31.405 Better would have been for one of you to leave a note on the fridge saying, gone 01:40:31.405 --> 01:40:33.780 for milk, so that the other person doesn't do it as well. 01:40:33.780 --> 01:40:37.230 Or more stringently, just lock the refrigerator 01:40:37.230 --> 01:40:41.430 so that no one else besides you can see inside of it until you have 01:40:41.430 --> 01:40:44.010 updated the state of that refrigerator. 01:40:44.010 --> 01:40:46.320 So this comes into play all too often these days 01:40:46.320 --> 01:40:48.130 when dealing with really large data sets. 01:40:48.130 --> 01:40:50.880 The Twitters of the world, the Instagrams of the world, all of you 01:40:50.880 --> 01:40:53.130 are probably in the habit of liking or uploading posts 01:40:53.130 --> 01:40:54.640 on those sites or some other. 01:40:54.640 --> 01:40:58.120 But if a lot of other people are, too, especially when things go viral, 01:40:58.120 --> 01:41:01.080 there might be code like this underneath the hood that 01:41:01.080 --> 01:41:02.400 essentially does the following. 01:41:02.400 --> 01:41:05.700 Here's three lines of representative Python code that use SQL. 01:41:05.700 --> 01:41:08.850 The first line here selects the number of likes from a table 01:41:08.850 --> 01:41:11.880 called posts where the ID of the post is something, 01:41:11.880 --> 01:41:14.370 a variable called ID, 1, 2, 3, 4, 5. 01:41:14.370 --> 01:41:17.670 That just tells me, what is the total number of likes this post has? 01:41:17.670 --> 01:41:20.648 Then, I store that answer in a variable using likes. 01:41:20.648 --> 01:41:22.440 And you'll see this syntax in problem set 7 01:41:22.440 --> 01:41:25.560 as you begin to play with SQL yourself in the CS50 library. 01:41:25.560 --> 01:41:29.130 Suppose that your third line of code is to then update posts, 01:41:29.130 --> 01:41:32.670 setting likes equal to some place holder, where that place 01:41:32.670 --> 01:41:34.630 holder is going to be likes plus 1. 01:41:34.630 --> 01:41:38.490 This is the same problem as the milk example, 01:41:38.490 --> 01:41:41.430 because if you check how many likes there 01:41:41.430 --> 01:41:43.490 are on this post, or your roommate checks how-- 01:41:43.490 --> 01:41:46.050 or if you check how much milk is available left in the fridge 01:41:46.050 --> 01:41:49.440 and then go off and begin to make a decision like buying milk, 01:41:49.440 --> 01:41:52.260 your roommate might, on a server that's running 01:41:52.260 --> 01:41:55.410 the same code to handle thousands and thousands of people, 01:41:55.410 --> 01:41:56.680 might ask that same question. 01:41:56.680 --> 01:41:58.472 How many likes does this current post have? 01:41:58.472 --> 01:42:00.120 Suppose the post has a million likes. 01:42:00.120 --> 01:42:02.400 Both of you execute that first line of code, 01:42:02.400 --> 01:42:04.020 see that, oh, I have a million likes. 01:42:04.020 --> 01:42:06.550 I've just clicked plus 1 on the site. 01:42:06.550 --> 01:42:10.830 And so you try changing the number of likes with this second update query 01:42:10.830 --> 01:42:12.720 to 1 million plus 1. 01:42:12.720 --> 01:42:15.750 But if both of you have been making a decision that 01:42:15.750 --> 01:42:18.660 gets interleaved with the other person, as will absolutely 01:42:18.660 --> 01:42:20.910 happen with Instagram and Twitter, who have thousands 01:42:20.910 --> 01:42:24.870 of servers all operating in parallel, instead of changing the number of likes 01:42:24.870 --> 01:42:30.750 from 1 million to 1,000,002, you might change it both of you to 1,000,001, 01:42:30.750 --> 01:42:33.870 thereby wasting one of the counts, because it gets lost, 01:42:33.870 --> 01:42:36.810 because you inspected the state of the fridge or the likes. 01:42:36.810 --> 01:42:37.983 You made a decision. 01:42:37.983 --> 01:42:39.900 And then you updated it based on that decision 01:42:39.900 --> 01:42:42.813 without realizing that life changed in between. 01:42:42.813 --> 01:42:44.730 Your roommate got home and checked the fridge, 01:42:44.730 --> 01:42:47.190 too, or someone else clicked the upvote. 01:42:47.190 --> 01:42:49.890 So this is a bad situation, but there are solutions in SQL 01:42:49.890 --> 01:42:54.270 that we won't look at this week, but what are known as transactions. 01:42:54.270 --> 01:42:57.960 That is a solvable problem, and more on that in a higher level database class. 01:42:57.960 --> 01:43:00.690 And one final example of this, because this one you 01:43:00.690 --> 01:43:04.380 can avoid throughout the coming weeks in CS50 and beyond. 01:43:04.380 --> 01:43:08.520 A SQL injection attack is when you write bad code that 01:43:08.520 --> 01:43:12.150 somehow allows some bad actor on the internet or an application 01:43:12.150 --> 01:43:16.530 to trick your code into running SQL code that you did not intend. 01:43:16.530 --> 01:43:19.680 For instance, this is how Yale students log on in New Haven. 01:43:19.680 --> 01:43:22.470 This, of course, is how Harvard students log on here in Cambridge. 01:43:22.470 --> 01:43:24.887 You're asked for a username and password, or a Harvard key 01:43:24.887 --> 01:43:25.680 and a password. 01:43:25.680 --> 01:43:28.290 Suppose, though, you as a user don't cooperate, 01:43:28.290 --> 01:43:30.840 and instead of typing in your email address and password, 01:43:30.840 --> 01:43:34.080 suppose you type something like this, like your email address, 01:43:34.080 --> 01:43:38.520 Malan@Harvard.edu, then maybe a single quote, and then a dash dash. 01:43:38.520 --> 01:43:40.950 Well, it turns out in SQL, dash dash is how 01:43:40.950 --> 01:43:46.320 you start a comment, similar to hash in Python or in slash slash in C. 01:43:46.320 --> 01:43:50.180 But suppose that Harvard had implemented its website such 01:43:50.180 --> 01:43:52.680 that there's a query like this going on underneath the hood, 01:43:52.680 --> 01:43:56.370 if the Harvard key is implemented in Python, that says this, SELECT * FROM 01:43:56.370 --> 01:44:00.630 users WHERE username = placeholder AND password = placeholder, 01:44:00.630 --> 01:44:02.430 passing in username and password. 01:44:02.430 --> 01:44:03.550 That is good. 01:44:03.550 --> 01:44:07.050 That is correct code, because those place holders are important. 01:44:07.050 --> 01:44:10.890 What the CS50 library does, and other libraries in the world like it, is 01:44:10.890 --> 01:44:12.810 it escapes user input. 01:44:12.810 --> 01:44:17.480 If a user tries to trick your code into inputting comments or single quotes, 01:44:17.480 --> 01:44:21.990 db execute and other libraries will sanitize the user's input 01:44:21.990 --> 01:44:24.570 and prevent the following from happening. 01:44:24.570 --> 01:44:27.630 If, however, you do this-- not good, but bad. 01:44:27.630 --> 01:44:32.490 Suppose that you really practice what we preach last week using f strings, 01:44:32.490 --> 01:44:34.920 this is now bad in the context of SQL. 01:44:34.920 --> 01:44:37.830 If you create a format string like this with a little f 01:44:37.830 --> 01:44:41.250 that is literally SELECT * from users WHERE username =, 01:44:41.250 --> 01:44:44.310 and then use the fancy Python notation for curly braces, 01:44:44.310 --> 01:44:48.660 and password = curly braces, this is a correct SQL construction 01:44:48.660 --> 01:44:53.280 so long as the human cooperates and puts their username here and password here. 01:44:53.280 --> 01:44:55.680 But what if the user is malicious, like me, 01:44:55.680 --> 01:45:01.110 and actually includes a single quote in their input, and then dash dash? 01:45:01.110 --> 01:45:04.650 So Malan@Harvard.edu quote dash dash would 01:45:04.650 --> 01:45:08.520 have the effect of injecting that into the user's input. 01:45:08.520 --> 01:45:11.040 But dash dash I claim as a comment, so it's 01:45:11.040 --> 01:45:15.270 a way of tricking the computer into executing not SELECT * WHERE user-- 01:45:15.270 --> 01:45:19.350 SELECT * FROM users where username = this AND password = that, 01:45:19.350 --> 01:45:22.230 it has the effect of just saying SELECT * FROM users where 01:45:22.230 --> 01:45:25.290 username equals Malan@Harvard.edu. 01:45:25.290 --> 01:45:27.720 And if-- and you'll see this in p set 7-- 01:45:27.720 --> 01:45:31.020 db.execute returns one or more rows from the table-- 01:45:31.020 --> 01:45:33.750 that's how you can actually select data using Python code-- 01:45:33.750 --> 01:45:35.670 and the number of rows equals 1, because it's 01:45:35.670 --> 01:45:39.210 selected Malan@Harvard.edu you could log in 01:45:39.210 --> 01:45:45.660 as me because you very simply tricked the computer into executing code 01:45:45.660 --> 01:45:47.400 that you injected. 01:45:47.400 --> 01:45:50.900 And the biggest takeaway here is this is 100% solvable problem. 01:45:50.900 --> 01:45:53.670 You just need to know that it is solved and not 01:45:53.670 --> 01:45:56.760 do code like this using f strings in Python, but instead 01:45:56.760 --> 01:46:02.370 to use library code like db.execute, or after this course, anything like it. 01:46:02.370 --> 01:46:05.290 With that said, we'll end with a joke. 01:46:17.475 --> 01:46:18.130 OK. 01:46:18.130 --> 01:46:19.300 That's it for CS50. 01:46:19.300 --> 01:46:21.480 We'll see you next time.