WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:02.988 [MUSIC PLAYING] 00:01:01.790 --> 00:01:03.500 SPEAKER 1: All right. 00:01:03.500 --> 00:01:09.590 This is CS50, and this is week 7 on the day before All Hallows' Eve. 00:01:09.590 --> 00:01:13.760 Today, we will introduce yet another language, the goal of which 00:01:13.760 --> 00:01:16.430 is not to introduce another language for language's sake, 00:01:16.430 --> 00:01:20.510 but to really begin to emphasize that when it comes to the world of software 00:01:20.510 --> 00:01:23.390 programming, engineering more generally, there's often 00:01:23.390 --> 00:01:25.170 different tools for different jobs. 00:01:25.170 --> 00:01:28.632 And if you were to try to use C to solve all of our future problems, 00:01:28.632 --> 00:01:30.590 it would actually be painful, as you've already 00:01:30.590 --> 00:01:35.360 seen how much more pleasant it is to solve certain problems with Python. 00:01:35.360 --> 00:01:37.460 But today, quite quickly, we'll realize that even 00:01:37.460 --> 00:01:40.440 Python's actually not the most pleasant way to solve a lot of problems, 00:01:40.440 --> 00:01:43.100 especially when it comes to data, data at scale. 00:01:43.100 --> 00:01:45.830 And, in fact, as we transition in the final weeks of CS50 00:01:45.830 --> 00:01:49.130 to the world of web programming, and if you so choose with your final project 00:01:49.130 --> 00:01:51.500 mobile programming, you'll actually need more tools 00:01:51.500 --> 00:01:54.150 in your tool kit than C and Python alone. 00:01:54.150 --> 00:01:57.620 In fact, today we'll introduce a database-centric language called S-Q-L, 00:01:57.620 --> 00:01:58.590 or SQL. 00:01:58.590 --> 00:02:02.400 Next week, we'll explore markup languages, like HTML and CSS. 00:02:02.400 --> 00:02:03.970 A bit of JavaScript, as well. 00:02:03.970 --> 00:02:07.290 And then we'll synthesize all of this together at the end of the class, 00:02:07.290 --> 00:02:10.360 as some of you might, for your final projects, as well. 00:02:10.360 --> 00:02:13.120 But, before we do that, and talk about data, 00:02:13.120 --> 00:02:15.310 let's actually start to gather some data. 00:02:15.310 --> 00:02:19.770 So if you could visit this URL here on your phone, or a laptop. 00:02:19.770 --> 00:02:23.460 Or, if easier, here's a bar code version thereof. 00:02:23.460 --> 00:02:25.980 You can point your camera at this bar code here, 00:02:25.980 --> 00:02:29.220 and it's going to pull up a relatively short Google form that's 00:02:29.220 --> 00:02:32.040 just going to ask you a couple of questions, 00:02:32.040 --> 00:02:35.070 so that we can collect some actual live data 00:02:35.070 --> 00:02:37.800 and actually play around with it, initially, in Python. 00:02:37.800 --> 00:02:41.020 So if you go to that URL there, you'll see a Google form. 00:02:41.020 --> 00:02:44.577 And if you haven't been able to scan it quite yet, that's fine. 00:02:44.577 --> 00:02:47.160 Just kind of look over the shoulder of the person next to you. 00:02:47.160 --> 00:02:49.960 But you, or they, will see a little something like this. 00:02:49.960 --> 00:02:53.130 So among the questions will be, which is your favorite language? 00:02:53.130 --> 00:02:56.160 On the list, only thus far, is Scratch, C, and Python, 00:02:56.160 --> 00:02:58.350 and below that you'll see another question asking 00:02:58.350 --> 00:03:02.960 about your favorite problem across the problem sets thus far. 00:03:02.960 --> 00:03:04.710 Each of them is radio buttons, which means 00:03:04.710 --> 00:03:08.800 you'll be able to select one button for each of those questions. 00:03:08.800 --> 00:03:11.310 And, ultimately, what's going to be nice is 00:03:11.310 --> 00:03:14.132 that if you've never used Google Forms before as an administrator, 00:03:14.132 --> 00:03:17.340 all of that data is actually going to end up being, not only in Google Forms, 00:03:17.340 --> 00:03:19.525 but also, if you so choose, in Google Spreadsheets. 00:03:19.525 --> 00:03:22.650 Which is an example, of course, of a spreadsheet software similar in spirit 00:03:22.650 --> 00:03:26.940 to Apple Numbers on Macs or Microsoft Excel on different platforms. 00:03:26.940 --> 00:03:29.220 And Google Sheets is going to allow us to store 00:03:29.220 --> 00:03:30.960 all of that data in rows and columns. 00:03:30.960 --> 00:03:33.090 And so, since Google made both of these products, 00:03:33.090 --> 00:03:34.690 they integrated one with the other. 00:03:34.690 --> 00:03:39.930 So, in fact, if I, on my laptop here in another window, open 00:03:39.930 --> 00:03:41.280 this up-- let me flip over. 00:03:41.280 --> 00:03:42.660 Here's the live spreadsheet. 00:03:42.660 --> 00:03:45.510 And we'll see that the very first person who buzzed in really 00:03:45.510 --> 00:03:48.090 liked Python, as did a lot of other people thereafter. 00:03:48.090 --> 00:03:52.390 But, Hello, World was your favorite in Python, which is great. 00:03:52.390 --> 00:03:54.330 There's a couple of votes for Scratch here. 00:03:54.330 --> 00:03:58.110 If we scroll down, there's one hold out for C, who really liked Credit, 00:03:58.110 --> 00:03:59.320 in this case here. 00:03:59.320 --> 00:04:03.250 And if we scroll down further, it looks like Python and Scratch are in there. 00:04:03.250 --> 00:04:05.727 A few more C's, and so on and so forth. 00:04:05.727 --> 00:04:08.310 So suppose that we wanted to, actually, now analyze this data. 00:04:08.310 --> 00:04:11.587 Now, any of you who have used Excel, or Numbers, or Google Spreadsheets 00:04:11.587 --> 00:04:13.920 know that it comes with built in functions and formulas, 00:04:13.920 --> 00:04:17.040 and we can do all of that, but suppose there was a huge amount of data. 00:04:17.040 --> 00:04:19.920 Or, suppose that this data was not coming in via Google Sheets, 00:04:19.920 --> 00:04:23.130 but via your own web application, or your own mobile application, 00:04:23.130 --> 00:04:25.530 and it's just ending up in some kind of spreadsheet. 00:04:25.530 --> 00:04:28.530 Well, wouldn't it be nice if we could actually analyze that kind of data 00:04:28.530 --> 00:04:29.250 with code. 00:04:29.250 --> 00:04:30.570 And, in fact, you can. 00:04:30.570 --> 00:04:35.220 And the simplest way to store a bunch of data isn't with anything fancy, 00:04:35.220 --> 00:04:39.450 in fact, but just a literal text file, something ending in .txt, 00:04:39.450 --> 00:04:41.520 or, maybe more commonly, .csv. 00:04:41.520 --> 00:04:44.400 In fact, what we'd call a flat file database 00:04:44.400 --> 00:04:48.820 is literally just code for a text file containing all of your data. 00:04:48.820 --> 00:04:53.040 But your data typically has delimiters that separate some values from others. 00:04:53.040 --> 00:04:55.050 And, in fact, the most common approach, daresay, 00:04:55.050 --> 00:04:59.430 is to use, indeed, comma-separated values, or, CSV files. 00:04:59.430 --> 00:05:04.080 And what that means is that in simple text alone, just asking, maybe Unicode, 00:05:04.080 --> 00:05:06.960 you sort of mimic the idea of rows and columns 00:05:06.960 --> 00:05:09.570 by using newline characters, like backslash n, 00:05:09.570 --> 00:05:11.260 to represent row after row after row. 00:05:11.260 --> 00:05:12.510 That's pretty straightforward. 00:05:12.510 --> 00:05:14.280 To just move the cursor to the next line. 00:05:14.280 --> 00:05:18.930 And because text files aren't graphical, so there's no notion of vertical bars 00:05:18.930 --> 00:05:21.510 that you can put between what you and I think of as columns, 00:05:21.510 --> 00:05:24.630 you literally just use a comma, or some other such symbol 00:05:24.630 --> 00:05:27.212 to separate one value from another. 00:05:27.212 --> 00:05:28.920 So, in fact, let me go ahead and do this. 00:05:28.920 --> 00:05:31.837 Even if you've never done this before, it turns out, in Google Sheets, 00:05:31.837 --> 00:05:35.130 and also Excel and Numbers, you can export your data, 00:05:35.130 --> 00:05:38.880 not in some proprietary Apple or Microsoft or Google format, 00:05:38.880 --> 00:05:43.380 but in a globally portable format known as .csv. 00:05:43.380 --> 00:05:44.970 So let me go up to file. 00:05:44.970 --> 00:05:47.037 I will go to download. 00:05:47.037 --> 00:05:49.620 And notice, I can export this in a bunch of different formats, 00:05:49.620 --> 00:05:52.722 but the one I care about for today is going to be .csv. 00:05:52.722 --> 00:05:55.180 On my Mac, that's going to put it into my downloads folder. 00:05:55.180 --> 00:05:57.722 And what I'm going to go ahead here and do, in just a moment, 00:05:57.722 --> 00:06:00.690 is let me go ahead it and open up VS Code. 00:06:00.690 --> 00:06:03.300 So, in short, I've downloaded this file. 00:06:03.300 --> 00:06:04.980 Google gave it a long default file name. 00:06:04.980 --> 00:06:06.900 I've renamed it to favorites.csv. 00:06:06.900 --> 00:06:10.200 Let me go ahead and actually open it now in VS Code here. 00:06:10.200 --> 00:06:12.480 And this now is the exact same data. 00:06:12.480 --> 00:06:14.800 If you're still submitting the form, I'm afraid you didn't make the cut off. 00:06:14.800 --> 00:06:17.220 So we're not going to see more data ending up in this file 00:06:17.220 --> 00:06:19.260 because I've literally downloaded a copy of it. 00:06:19.260 --> 00:06:21.240 But, indeed, if I scroll through this, we'll 00:06:21.240 --> 00:06:25.050 see that it's got some 399 lines of data. 00:06:25.050 --> 00:06:28.080 Or, technically, 398 because the very first one 00:06:28.080 --> 00:06:30.990 is what we'd call a header row, which just describes 00:06:30.990 --> 00:06:32.820 what each of these columns means. 00:06:32.820 --> 00:06:35.670 Now notice that even though the Google Sheets interface actually 00:06:35.670 --> 00:06:39.780 put all of this data in proper graphical rows and columns, 00:06:39.780 --> 00:06:42.510 you can still see the rows because they're just separate lines, 00:06:42.510 --> 00:06:45.900 and you can kind of see the columns by focusing on the comma 00:06:45.900 --> 00:06:47.970 here, the comma here, the comma here. 00:06:47.970 --> 00:06:50.820 And then, also, the comma here, the comma here, 00:06:50.820 --> 00:06:52.540 the comma here, and so forth. 00:06:52.540 --> 00:06:53.910 Now, a quick subtlety. 00:06:53.910 --> 00:06:59.490 Hello, World is a two-word problem name, and it itself has a comma in it. 00:06:59.490 --> 00:07:06.090 What's to stop me from confusing the comma in Hello, World from the commas 00:07:06.090 --> 00:07:09.410 that Google, apparently, inserted into this file? 00:07:09.410 --> 00:07:09.910 Yeah. 00:07:09.910 --> 00:07:11.925 So, automatically, you all did not type this. 00:07:11.925 --> 00:07:13.300 We did not type it into the form. 00:07:13.300 --> 00:07:16.750 But Google is smart enough, as is Apple and Microsoft, when 00:07:16.750 --> 00:07:20.958 they export CSVs to somehow escape seemingly dangerous characters, 00:07:20.958 --> 00:07:22.750 or characters that could just break things. 00:07:22.750 --> 00:07:25.180 And, in this case, a convention in the CSV world 00:07:25.180 --> 00:07:28.340 is just to quote any strings that themselves have commas 00:07:28.340 --> 00:07:32.150 so that any program you're writing that reads this file doesn't get confused. 00:07:32.150 --> 00:07:35.260 So the only thing that came in automatically from Google 00:07:35.260 --> 00:07:37.690 is just this timestamp here, based on the time 00:07:37.690 --> 00:07:39.370 of day in our own local timezone. 00:07:39.370 --> 00:07:40.970 That was added automatically. 00:07:40.970 --> 00:07:43.100 So we have three, and not just two columns. 00:07:43.100 --> 00:07:45.460 So with that said, we have three columns: 00:07:45.460 --> 00:07:47.980 timestamp, language, and problem. 00:07:47.980 --> 00:07:50.200 The latter two of which came from all of you. 00:07:50.200 --> 00:07:52.840 Let's actually write some code that analyzes this data. 00:07:52.840 --> 00:07:55.720 Let's figure out what the most popular something or other is here. 00:07:55.720 --> 00:07:57.970 So I'll close the CSV file. 00:07:57.970 --> 00:08:02.140 Let me go ahead and use what seems to be folks favorite language, thus far, 00:08:02.140 --> 00:08:05.320 and write a file called favorites.py, for instance. 00:08:05.320 --> 00:08:08.440 And I'm going to use Python to open that CSV file, 00:08:08.440 --> 00:08:12.180 analyze it, crunch some numbers, and output some statistics, if you will. 00:08:12.180 --> 00:08:15.920 So the easiest way to manipulate CSV files, as you might have gleaned, 00:08:15.920 --> 00:08:19.940 is not to just open the file yourself, look for commas, allocate memory, 00:08:19.940 --> 00:08:22.220 or anything like that in C. In Python, you 00:08:22.220 --> 00:08:27.230 can literally just import CSV, which is a Python module that 00:08:27.230 --> 00:08:29.570 gives you CSV related functionality. 00:08:29.570 --> 00:08:32.390 In Python, there's a bunch of ways to open files. 00:08:32.390 --> 00:08:35.960 One way to open a file is like this in Python. 00:08:35.960 --> 00:08:40.039 You can say file equals open, similar to fopen in C. 00:08:40.039 --> 00:08:44.600 You can specify the name of the file you want to open, like favorites.csv, 00:08:44.600 --> 00:08:48.330 and you can explicitly say you want to open the file for reading, 00:08:48.330 --> 00:08:50.240 quote unquote, "r," just like fopen. 00:08:50.240 --> 00:08:53.280 Strictly speaking, in Python read is implied. 00:08:53.280 --> 00:08:55.580 So if you omit the second argument, it will still work. 00:08:55.580 --> 00:08:58.490 But for parity with fopen, I'll do the same here. 00:08:58.490 --> 00:09:03.020 Then you can, maybe, do something with file, and then, at the end, 00:09:03.020 --> 00:09:05.070 you can close the file like this. 00:09:05.070 --> 00:09:08.840 So in Python, there really is a mapping between fopen and fclose. 00:09:08.840 --> 00:09:12.560 But in Python the functions are called open and close for short. 00:09:12.560 --> 00:09:16.130 But a more common way, a more Pythonic way, so to speak, i.e. 00:09:16.130 --> 00:09:19.040 the way people tend to do it in Python, is actually 00:09:19.040 --> 00:09:22.610 to use a keyword that didn't exist in C, where you instead say with. 00:09:22.610 --> 00:09:26.990 And you say, with open this file name as a specific file name, 00:09:26.990 --> 00:09:28.970 and then indent it inside of that. 00:09:28.970 --> 00:09:30.950 Now you can do whatever you want with the file. 00:09:30.950 --> 00:09:33.350 And the implication of using with, even though it's not 00:09:33.350 --> 00:09:36.740 obvious from the keyword itself, is that the file will be automatically closed 00:09:36.740 --> 00:09:37.770 for you later. 00:09:37.770 --> 00:09:39.650 So this is just a minor Python convenience 00:09:39.650 --> 00:09:42.020 so that you don't have to remember to close the file. 00:09:42.020 --> 00:09:44.030 Unless something goes wrong, it will just 00:09:44.030 --> 00:09:47.690 close automatically as soon as you're outside of this with block. 00:09:47.690 --> 00:09:49.460 So how do I go about reading a CSV? 00:09:49.460 --> 00:09:52.940 The simplest way is to actually give yourself a variable called, 00:09:52.940 --> 00:09:53.900 maybe, reader. 00:09:53.900 --> 00:09:56.780 Like if I want to read the CSV, I'll call my variable reader. 00:09:56.780 --> 00:10:02.270 Set that equal to the return value of a function that comes with Python CSV 00:10:02.270 --> 00:10:07.080 module called reader, in lowercase, and you just pass in the file name. 00:10:07.080 --> 00:10:09.560 So the first line there on line three opens the file 00:10:09.560 --> 00:10:11.570 and gives me access to the bytes there in. 00:10:11.570 --> 00:10:14.450 Line four now actually uses this library that 00:10:14.450 --> 00:10:18.260 comes with Python to just go read it for me, figure out where the commas are, 00:10:18.260 --> 00:10:21.690 so that it can hand me, line after line, the data in the file. 00:10:21.690 --> 00:10:24.320 Now the first piece of data in the file, though, was what? 00:10:24.320 --> 00:10:27.080 What's the very first row? 00:10:27.080 --> 00:10:30.692 So it was that header row with timestamp, language, problem. 00:10:30.692 --> 00:10:32.900 I actually want to skip that because that's not data. 00:10:32.900 --> 00:10:36.870 That's what we might call metadata that's describing my actual data. 00:10:36.870 --> 00:10:39.670 So one way to deal with this, I can literally just say next reader. 00:10:39.670 --> 00:10:42.170 And next, even though the semantics here are a little weird, 00:10:42.170 --> 00:10:46.100 it just means, no, no, give me the next line from that reader instead 00:10:46.100 --> 00:10:47.550 of the first by default. 00:10:47.550 --> 00:10:51.650 And now inside of this with block, I can do something like this. 00:10:51.650 --> 00:10:56.810 For each row in that reader, let's do something super simple initially. 00:10:56.810 --> 00:11:00.380 Let's just print out row bracket one. 00:11:00.380 --> 00:11:01.910 So row bracket one. 00:11:01.910 --> 00:11:03.330 So what's going on here? 00:11:03.330 --> 00:11:07.040 Well, the CSV reader in Python is going to return to you, inside 00:11:07.040 --> 00:11:09.990 of this loop, one row after another. 00:11:09.990 --> 00:11:12.080 Each of those rows, though, has how many columns? 00:11:12.080 --> 00:11:12.920 Three. 00:11:12.920 --> 00:11:15.470 The timestamp, the language, and the problem. 00:11:15.470 --> 00:11:19.430 And just like in C, our lists in Python are zero index. 00:11:19.430 --> 00:11:20.720 So zero, one, two. 00:11:20.720 --> 00:11:25.350 So if I'm printing out row bracket one, that's the second, or middle, column. 00:11:25.350 --> 00:11:29.220 So what's this going to print row after row? 00:11:29.220 --> 00:11:32.340 Each of the languages that you all replied with. 00:11:32.340 --> 00:11:32.940 So that's all. 00:11:32.940 --> 00:11:34.440 This isn't doing any kind of analytics yet. 00:11:34.440 --> 00:11:36.440 I'm just going through the motions to, at least, 00:11:36.440 --> 00:11:37.900 print out some data of interest. 00:11:37.900 --> 00:11:38.692 So let me run this. 00:11:38.692 --> 00:11:41.550 Python of favorites.py, enter. 00:11:41.550 --> 00:11:43.270 And it happens super fast. 00:11:43.270 --> 00:11:47.170 But if I scroll back in my terminal window, there is all of that raw data. 00:11:47.170 --> 00:11:49.472 So this is to say, once you know the function names, 00:11:49.472 --> 00:11:51.180 once you know the keywords, it's actually 00:11:51.180 --> 00:11:54.090 pretty simple in Python to just get up and running with a file, 00:11:54.090 --> 00:11:56.070 and start looking at the data therein. 00:11:56.070 --> 00:12:02.910 But it turns out that it's a little sub-optimal to use the reader alone. 00:12:02.910 --> 00:12:05.400 It turns out there's better ways to do this, and let 00:12:05.400 --> 00:12:06.780 me make this clear as follows. 00:12:06.780 --> 00:12:09.660 I don't strictly need a variable, but let me actually declare 00:12:09.660 --> 00:12:13.560 a variable called favorite, set it equal to row bracket one, 00:12:13.560 --> 00:12:15.630 and then print out that favorite value. 00:12:15.630 --> 00:12:17.438 This is not doing anything new. 00:12:17.438 --> 00:12:19.230 It's just declaring an additional variable, 00:12:19.230 --> 00:12:22.380 but I wanted to highlight the fact that I'm just kind of trusting that row 00:12:22.380 --> 00:12:24.630 bracket one is the problem-- 00:12:24.630 --> 00:12:26.430 or, is the language I care about. 00:12:26.430 --> 00:12:27.660 Is language, is language. 00:12:27.660 --> 00:12:30.790 But suppose one of you, or I, go into the Google spreadsheet, 00:12:30.790 --> 00:12:32.668 and like anyone might with a spreadsheet, 00:12:32.668 --> 00:12:34.210 you might start moving things around. 00:12:34.210 --> 00:12:36.947 And you might swap some of the columns left and right. 00:12:36.947 --> 00:12:39.280 You might delete one of the columns, add something else. 00:12:39.280 --> 00:12:41.830 In short, spreadsheets are arguably fragile 00:12:41.830 --> 00:12:44.590 in that it's pretty easy in the GUI, the graphical user interface, 00:12:44.590 --> 00:12:46.040 to change them around. 00:12:46.040 --> 00:12:50.080 And so my code in Python, accordingly, is, arguably, fragile 00:12:50.080 --> 00:12:53.770 because I'm just hoping that row bracket one is always the data 00:12:53.770 --> 00:12:54.890 that I care about. 00:12:54.890 --> 00:12:56.450 So what would be marginally better? 00:12:56.450 --> 00:12:59.110 Well, let's actually use that header row instead. 00:12:59.110 --> 00:13:02.920 And more common, arguably, in Python, is not to use a simple reader, 00:13:02.920 --> 00:13:05.650 but, instead, what we would call a Dictionary Reader. 00:13:05.650 --> 00:13:09.190 I'm going to change this to DictReader, capital D, capital R, 00:13:09.190 --> 00:13:12.010 and that's it for the change to line four. 00:13:12.010 --> 00:13:14.197 On line five, I'm going to get rid of the next line 00:13:14.197 --> 00:13:16.030 because I don't want to skip the header now. 00:13:16.030 --> 00:13:19.210 What DictReader does, which reader does not, 00:13:19.210 --> 00:13:22.690 is it automatically analyzes that first line in the file, 00:13:22.690 --> 00:13:25.510 figures out what are all of your columns called, 00:13:25.510 --> 00:13:30.220 and, thereafter, when you iterate over this reader what each of your rows 00:13:30.220 --> 00:13:35.650 now is-- it's no longer a list of size three, bracket zero, bracket 00:13:35.650 --> 00:13:36.490 one, bracket two. 00:13:36.490 --> 00:13:39.670 Each row that you get back in this loop on line five 00:13:39.670 --> 00:13:44.890 is now, wonderfully, a dictionary instead, the keys of which 00:13:44.890 --> 00:13:46.090 are from the header field. 00:13:46.090 --> 00:13:47.650 Timestamp, language, problem. 00:13:47.650 --> 00:13:52.460 The values of which are whatever each of you typed in again and again. 00:13:52.460 --> 00:13:53.890 So how do I change the code? 00:13:53.890 --> 00:13:57.800 I no longer have these numeric indices because row is no longer a list, 00:13:57.800 --> 00:13:58.850 it's a dictionary. 00:13:58.850 --> 00:14:02.350 So if I literally want the language that you typed in, 00:14:02.350 --> 00:14:06.520 I can use row bracket, quote unquote, "language," treating row as a Dict 00:14:06.520 --> 00:14:09.010 not as a list anymore. 00:14:09.010 --> 00:14:12.700 Now this is, again, more robust because if you move the columns around on me, 00:14:12.700 --> 00:14:15.340 code is still going to work, at least so long as you don't 00:14:15.340 --> 00:14:17.570 rename the columns at the very top. 00:14:17.570 --> 00:14:20.180 So that's still one assumption I'm making. 00:14:20.180 --> 00:14:20.680 All right. 00:14:20.680 --> 00:14:23.770 Well, beyond that, what could I actually do here? 00:14:23.770 --> 00:14:27.422 Well, just to be clear, I don't, strictly speaking, need this variable. 00:14:27.422 --> 00:14:28.880 So no need to highlight this again. 00:14:28.880 --> 00:14:30.755 So let me just simplify the code a little bit 00:14:30.755 --> 00:14:33.220 and get rid of this variable, and instead just print out 00:14:33.220 --> 00:14:34.960 the language in that row. 00:14:34.960 --> 00:14:40.360 As a quick check, let me rerun Python of favorites.py 00:14:40.360 --> 00:14:43.430 and it seems to still work, even though there's a lot of output. 00:14:43.430 --> 00:14:45.610 We're not going to check all 399 lines, but it 00:14:45.610 --> 00:14:48.160 looks like it printed out all of those popular languages. 00:14:48.160 --> 00:14:50.120 Well, what more can we actually do? 00:14:50.120 --> 00:14:52.420 Well, let's actually now start to crunch some numbers 00:14:52.420 --> 00:14:55.450 and figure out how many people like Scratch, how many people like 00:14:55.450 --> 00:14:56.890 C, how many people like Python. 00:14:56.890 --> 00:14:58.580 Let's start to analyze this. 00:14:58.580 --> 00:15:02.290 So maybe the most pedantic way to do this in Python 00:15:02.290 --> 00:15:07.113 would be to just create some variables and do all of the counting myself. 00:15:07.113 --> 00:15:08.780 So let me actually go ahead and do this. 00:15:08.780 --> 00:15:10.113 Let me delete this code for now. 00:15:10.113 --> 00:15:12.370 And after opening the reader, let me create 00:15:12.370 --> 00:15:14.980 a variable called Scratch and set it equal to zero, 00:15:14.980 --> 00:15:19.060 a variable called C, set it equal to zero, a variable called Python, 00:15:19.060 --> 00:15:23.290 set it equal to zero, just so that I have three counters, similar to what 00:15:23.290 --> 00:15:26.620 we did in week zero, week one, anytime we counted anything Honestly, 00:15:26.620 --> 00:15:29.600 this looks a little stupid, and it's not wrong. 00:15:29.600 --> 00:15:31.600 In fact, this is how you would do it in Python, 00:15:31.600 --> 00:15:33.910 but Python also has some clever syntax. 00:15:33.910 --> 00:15:38.800 If you want to be really cool, you can do Scratch comma, c comma, 00:15:38.800 --> 00:15:43.932 Python equals zero, zero, zero, and do all three at once if you like that. 00:15:43.932 --> 00:15:47.140 So it tightens up the code a little bit, even though the effect is ultimately 00:15:47.140 --> 00:15:47.710 the same. 00:15:47.710 --> 00:15:52.010 Now let's go ahead and iterate over this file row by row by row. 00:15:52.010 --> 00:15:54.730 And if we see Scratch, increment the Scratch counter. 00:15:54.730 --> 00:15:56.320 If we see C, increment the C counter. 00:15:56.320 --> 00:15:59.000 If we see Python, increment the Python counter, instead. 00:15:59.000 --> 00:16:00.080 So how do I do this? 00:16:00.080 --> 00:16:03.220 Well, I could do something like, for each row in the reader, 00:16:03.220 --> 00:16:07.180 just like before, let me go ahead and get that favorite variable 00:16:07.180 --> 00:16:11.230 and set it equal to the language in that dictionary that just came back 00:16:11.230 --> 00:16:12.740 as part of this iteration. 00:16:12.740 --> 00:16:19.060 And now I can do something like this, if favorite equals, equals "Scratch" then, 00:16:19.060 --> 00:16:26.710 with my indentation, I can do Scratch plus equals one, elif favorite equals, 00:16:26.710 --> 00:16:27.670 equals "C." 00:16:27.670 --> 00:16:31.290 I can go ahead and increment C plus equals one. 00:16:31.290 --> 00:16:34.440 Else-- and I don't think I want else, just to be safe. 00:16:34.440 --> 00:16:37.410 Even though we only saw three options, I think just to be super safe, 00:16:37.410 --> 00:16:41.040 elif favorite equals, equals "Python," then let's go ahead 00:16:41.040 --> 00:16:42.210 and increment Python. 00:16:42.210 --> 00:16:45.360 The one thing we can't do that we could do in C is the plus, plus trick. 00:16:45.360 --> 00:16:47.890 So plus equals is as close as we can get. 00:16:47.890 --> 00:16:48.900 So what have I done? 00:16:48.900 --> 00:16:52.110 Inside of this loop, I've just incremented each of those counters 00:16:52.110 --> 00:16:56.230 by one if I see Scratch, or C, or Python, again and again. 00:16:56.230 --> 00:16:56.730 All right. 00:16:56.730 --> 00:17:00.060 Outside of the loop and outside of the with block, 00:17:00.060 --> 00:17:02.370 because once I'm done reading all of the rows, 00:17:02.370 --> 00:17:06.493 I might as well let the with clause close the file automatically, 00:17:06.493 --> 00:17:08.410 let's just go ahead and print out some values. 00:17:08.410 --> 00:17:13.560 So I'm going to go ahead and print out Scratch colon, and then inside of there 00:17:13.560 --> 00:17:16.290 let's print out whatever the value of that variable is. 00:17:16.290 --> 00:17:19.770 Let's then go ahead and print out C colon, 00:17:19.770 --> 00:17:22.030 and then whatever the value of the C variable is. 00:17:22.030 --> 00:17:26.160 And then, lastly, let's print out Python colon and whatever the value is. 00:17:26.160 --> 00:17:29.117 And now I made three typos here. 00:17:29.117 --> 00:17:30.700 This is not going to print the values. 00:17:30.700 --> 00:17:32.320 What do I need to do? 00:17:32.320 --> 00:17:34.030 Sorry? 00:17:34.030 --> 00:17:36.560 Yeah, I'm missing the f in front of each of these strings 00:17:36.560 --> 00:17:38.470 so that I actually get formatted. 00:17:38.470 --> 00:17:41.800 And that is to say, the variables get interpolated inside of the curly braces 00:17:41.800 --> 00:17:43.220 like we saw last week. 00:17:43.220 --> 00:17:43.720 All right. 00:17:43.720 --> 00:17:45.928 So, honestly, that's kind of a decent amount of code, 00:17:45.928 --> 00:17:49.630 like 18 or so lines of code, just to count the number of responses. 00:17:49.630 --> 00:17:51.070 But let's see if I got it right. 00:17:51.070 --> 00:17:55.330 Let me open my terminal and run Python of favorites.py, 00:17:55.330 --> 00:17:58.420 and now I see, by an overwhelming amount, 00:17:58.420 --> 00:18:04.660 folks like Python, followed by C, followed by Scratch, in that order. 00:18:04.660 --> 00:18:06.910 But that was a decent amount of code to have to write, 00:18:06.910 --> 00:18:10.540 and it turns out there's actually better ways of doing this, more Pythonic ways, 00:18:10.540 --> 00:18:12.530 more programmatic ways of doing this. 00:18:12.530 --> 00:18:15.760 And if we think back to one of our universal data structures. 00:18:15.760 --> 00:18:18.910 Think back to how we preached last week, and the week before, 00:18:18.910 --> 00:18:21.280 the value of these dictionaries, more generally. 00:18:21.280 --> 00:18:23.317 Like the CSV module, clearly a fan of them, 00:18:23.317 --> 00:18:25.150 because that's what DictReader is giving us, 00:18:25.150 --> 00:18:27.220 dictionary, after dictionary, after dictionary. 00:18:27.220 --> 00:18:29.290 And this was the general idea of a dictionary. 00:18:29.290 --> 00:18:31.840 It associates keys with values, much like you 00:18:31.840 --> 00:18:34.940 might in a two-column table on a chalkboard, or the like. 00:18:34.940 --> 00:18:39.340 Well, this is what I need if I want to keep track of how many people said 00:18:39.340 --> 00:18:42.250 Scratch, and C, and Python. 00:18:42.250 --> 00:18:46.900 If I had a piece of chalk, I could just write Scratch, and C, 00:18:46.900 --> 00:18:48.670 and Python as three keys. 00:18:48.670 --> 00:18:51.910 And then with my chalk and, maybe, an eraser, keep track of the values. 00:18:51.910 --> 00:18:52.900 They all start at zero. 00:18:52.900 --> 00:18:54.550 Then, I add one, add two, add three. 00:18:54.550 --> 00:18:56.800 Or, maybe, on a chalkboard, I actually use hash marks. 00:18:56.800 --> 00:19:00.910 But a dictionary is kind of the perfect data structure for just associating 00:19:00.910 --> 00:19:05.770 something like Scratch, C, Python, with something else-- keys with values, 00:19:05.770 --> 00:19:06.890 respectively. 00:19:06.890 --> 00:19:09.250 So this is going to look a little weirder, 00:19:09.250 --> 00:19:13.040 but it's going to be pretty conventional to do something like this instead. 00:19:13.040 --> 00:19:15.550 Let me go back into VS Code. 00:19:15.550 --> 00:19:17.360 I'll close my terminal window. 00:19:17.360 --> 00:19:20.680 And let me go ahead and actually delete a lot of this 00:19:20.680 --> 00:19:22.630 because I can simplify this further. 00:19:22.630 --> 00:19:26.500 Let me go ahead and now give myself, maybe, a variable just, 00:19:26.500 --> 00:19:30.698 generically, called counts, and set that equal to an empty dictionary. 00:19:30.698 --> 00:19:32.740 And you can actually do this in a couple of ways. 00:19:32.740 --> 00:19:34.727 You can literally write out dict with nothing 00:19:34.727 --> 00:19:37.060 in parentheses, which will give you an empty dictionary, 00:19:37.060 --> 00:19:39.602 like the picture on the board, but a little more conventional 00:19:39.602 --> 00:19:43.450 is to just use two keystrokes and use two curly braces with nothing inside. 00:19:43.450 --> 00:19:47.390 That gives me an empty dictionary like this picture here. 00:19:47.390 --> 00:19:49.160 Now my loop is going to be the same. 00:19:49.160 --> 00:19:52.000 I'm going to do, for each row in the reader, 00:19:52.000 --> 00:19:55.550 I'm going to go ahead and grab the favorite language for this person. 00:19:55.550 --> 00:19:58.652 So favorite equals row, quote unquote, "language." 00:19:58.652 --> 00:20:00.860 But now I'm going to do something a little different. 00:20:00.860 --> 00:20:02.680 There's two scenarios here. 00:20:02.680 --> 00:20:07.690 Either, I have seen this language before and I want to increment it by one. 00:20:07.690 --> 00:20:11.950 Or, if I've never seen this language before because the loop just started, 00:20:11.950 --> 00:20:16.950 what should I initialize the count for this language to? 00:20:16.950 --> 00:20:18.840 So, one because I've only seen it once. 00:20:18.840 --> 00:20:19.710 Exactly. 00:20:19.710 --> 00:20:22.470 So now let me go ahead and do exactly that. 00:20:22.470 --> 00:20:26.370 If this current favorite, that I am seeing in the row, 00:20:26.370 --> 00:20:29.997 is already in the counts dictionary-- and in Python, 00:20:29.997 --> 00:20:31.830 this is literally how you ask that question. 00:20:31.830 --> 00:20:33.840 If favorite in counts. 00:20:33.840 --> 00:20:38.250 That will check, is there a key with this name, Scratch, C, or Python? 00:20:38.250 --> 00:20:42.390 If so, go into that location in the counts dictionary, 00:20:42.390 --> 00:20:46.500 index into it at the favorite location, because favorite is a string. 00:20:46.500 --> 00:20:49.140 It's either, quote unquote, "Scratch," "C," or "Python," 00:20:49.140 --> 00:20:53.490 and just increment it by one like this. 00:20:53.490 --> 00:20:54.780 Else, as you noted. 00:20:54.780 --> 00:21:00.360 If it's not there implicitly, then counts bracket favorite should probably 00:21:00.360 --> 00:21:03.210 be set equal-- not to zero because we're literally 00:21:03.210 --> 00:21:04.920 are seeing it in the current row. 00:21:04.920 --> 00:21:06.660 Let's initialize it to one. 00:21:06.660 --> 00:21:08.460 And, thereafter, if we see it again, it's 00:21:08.460 --> 00:21:11.610 going to be plus equals one, plus equals one, plus equals one. 00:21:11.610 --> 00:21:15.330 So now outside of that loop, outside of the with block, let me do this. 00:21:15.330 --> 00:21:18.200 For each favorite in those counts-- 00:21:18.200 --> 00:21:20.920 And this, too, in Python is a trick if you want to iterate over 00:21:20.920 --> 00:21:22.870 all of the keys in a dictionary. 00:21:22.870 --> 00:21:26.560 That is, if you want to iterate over the left-hand column of all of these keys, 00:21:26.560 --> 00:21:30.220 you literally can say, for something in that dictionary. 00:21:30.220 --> 00:21:33.730 So for favorite in counts, this is giving me a variable called favorite 00:21:33.730 --> 00:21:37.420 and updating it automatically top to bottom in that dictionary. 00:21:37.420 --> 00:21:40.600 Let's go ahead and print out an f string that's 00:21:40.600 --> 00:21:44.230 going to say whatever the name of that language is colon, 00:21:44.230 --> 00:21:50.950 and whatever the value of that language is in that there dictionary. 00:21:50.950 --> 00:21:54.040 So, again, logically the only thing that's new is this. 00:21:54.040 --> 00:21:57.940 I'm now using one dictionary instead of three variables 00:21:57.940 --> 00:22:02.170 to keep track of three things, like updating this chalkboard with three 00:22:02.170 --> 00:22:04.043 different things, Scratch, C, Python. 00:22:04.043 --> 00:22:06.460 And the last thing I'm doing, which is a little different, 00:22:06.460 --> 00:22:09.430 is once I have that dictionary, whether there's three languages, 00:22:09.430 --> 00:22:11.597 or, maybe, tomorrow there'll be fourth because we're 00:22:11.597 --> 00:22:12.970 going to introduce SQL today. 00:22:12.970 --> 00:22:16.807 Well, this will iterate over all of those keys and print out the values. 00:22:16.807 --> 00:22:18.640 All right, so if I didn't do anything wrong, 00:22:18.640 --> 00:22:21.940 if I do Python of favorites.py and hit enter. 00:22:21.940 --> 00:22:23.043 There we have it. 00:22:23.043 --> 00:22:25.210 And it happens to be in a different order this time. 00:22:25.210 --> 00:22:28.090 That's because we saw Python first, we then saw Scratch, 00:22:28.090 --> 00:22:31.870 and, eventually, we saw C. But if we wanted to sort these differently, 00:22:31.870 --> 00:22:34.780 we actually could with some different code. 00:22:34.780 --> 00:22:36.170 But, in short, what have we done? 00:22:36.170 --> 00:22:39.340 We've created this kind of structure in memory with three keys, 00:22:39.340 --> 00:22:42.100 Python, C, and Scratch, because each time 00:22:42.100 --> 00:22:47.140 we encounter such a language from you all, we either set our counter to one 00:22:47.140 --> 00:22:50.110 or increment it by one instead. 00:22:50.110 --> 00:22:56.320 Any questions on this code, or this general idea of using dictionaries as, 00:22:56.320 --> 00:23:00.490 like, a little cheat sheet for doing some math in this way? 00:23:00.490 --> 00:23:04.200 Super common paradigm. 00:23:04.200 --> 00:23:04.700 All right. 00:23:04.700 --> 00:23:06.890 Well, let me tweak this a little bit. 00:23:06.890 --> 00:23:10.622 Right now, in my output, we're seeing Python, Scratch and C. 00:23:10.622 --> 00:23:13.580 Maybe, for the sake of discussion, suppose we want to sort this by key. 00:23:13.580 --> 00:23:14.580 We can actually do that. 00:23:14.580 --> 00:23:16.370 Let me close my terminal temporarily. 00:23:16.370 --> 00:23:19.037 And it turns out, in Python, there's a bunch of ways to do this, 00:23:19.037 --> 00:23:21.890 but the simplest way to sort a dictionary by key 00:23:21.890 --> 00:23:24.500 is literally to use a function called sorted 00:23:24.500 --> 00:23:26.970 that comes with Python, that just does it for you. 00:23:26.970 --> 00:23:29.720 And even if you pass it a dictionary, it will sort that dictionary 00:23:29.720 --> 00:23:34.170 by the left-hand column so you can iterate it over alphabetically instead. 00:23:34.170 --> 00:23:36.110 So if I go back now to VS Code. 00:23:36.110 --> 00:23:39.710 If I open my terminal window and I rerun Python on favorites.py, 00:23:39.710 --> 00:23:41.840 now that I've added the sorted call, we should now 00:23:41.840 --> 00:23:45.470 see just because it's sorted alphabetically instead. 00:23:45.470 --> 00:23:49.195 Now that's not that useful, especially if we had lots of languages. 00:23:49.195 --> 00:23:51.320 You probably don't care about it being alphabetized 00:23:51.320 --> 00:23:54.570 as much as you care about it being ranked by which is the most popular, 00:23:54.570 --> 00:23:55.700 which is the least popular. 00:23:55.700 --> 00:23:58.220 And, for that, there's a bunch of ways to do this in Python. 00:23:58.220 --> 00:24:03.410 And, I think, the simplest way to sort by value the right-hand column instead 00:24:03.410 --> 00:24:06.900 of the left-hand column is probably to make this change instead. 00:24:06.900 --> 00:24:09.240 Let me close my terminal temporarily. 00:24:09.240 --> 00:24:14.110 Let me still use the sorted function, which by default sorts by key, 00:24:14.110 --> 00:24:16.750 but let's change it to be as follows. 00:24:16.750 --> 00:24:21.960 Let's change it to sort by a function called counts.get, 00:24:21.960 --> 00:24:24.360 which is a little weird, but this comes back 00:24:24.360 --> 00:24:28.290 to last week's brief discussion of object-oriented programming, or oop. 00:24:28.290 --> 00:24:30.150 Remember, in Python, that almost everything 00:24:30.150 --> 00:24:31.860 is like an object of some sort. 00:24:31.860 --> 00:24:36.930 An int is an object, a dictionary is an object, a string is an object. 00:24:36.930 --> 00:24:40.590 Which is to say that, not only do these things have values like, quote unquote, 00:24:40.590 --> 00:24:44.340 "Hello, World," or 50, these variables, these objects, 00:24:44.340 --> 00:24:46.980 can also have functions built into them, a.k.a. 00:24:46.980 --> 00:24:47.670 methods. 00:24:47.670 --> 00:24:52.440 So it turns out that because counts is a dictionary, because I made it so, 00:24:52.440 --> 00:24:55.350 that counts dictionary, like any dictionary in Python, 00:24:55.350 --> 00:24:57.150 comes with a function called get. 00:24:57.150 --> 00:25:01.290 And if you just tell the sorted function to use that built-in method, 00:25:01.290 --> 00:25:05.400 it will actually, for every key, get its value, get its value, get its value, 00:25:05.400 --> 00:25:09.720 and sort effectively by the right-hand column instead of the left. 00:25:09.720 --> 00:25:13.140 Now we'll see down the line, perhaps, more sophisticated ways of using this, 00:25:13.140 --> 00:25:15.990 but, for now, this just overrides the default behavior 00:25:15.990 --> 00:25:19.510 and sorts the dictionary, not by key, but by value instead. 00:25:19.510 --> 00:25:20.010 All right. 00:25:20.010 --> 00:25:24.120 So now watch this if I run Python of favorites.py once more. 00:25:24.120 --> 00:25:28.140 Previously, it was in the order in which the languages appeared first 00:25:28.140 --> 00:25:31.930 in the CSV file, then it was sorted alphabetically. 00:25:31.930 --> 00:25:34.960 Now it should be sorted by value. 00:25:34.960 --> 00:25:35.850 And, indeed. 00:25:35.850 --> 00:25:39.180 Scratch is the least with 40, C is the next with 78, 00:25:39.180 --> 00:25:41.458 Python is the biggest with 280. 00:25:41.458 --> 00:25:43.500 That's not much of a top 10, or a top three list. 00:25:43.500 --> 00:25:44.820 Let's actually reverse it. 00:25:44.820 --> 00:25:46.800 And the easiest way in Python to do that is 00:25:46.800 --> 00:25:50.130 to pass a third argument into sorted, and you would know this 00:25:50.130 --> 00:25:51.750 by just reading the documentation. 00:25:51.750 --> 00:25:56.370 You can literally say, reverse equals True, capital T, 00:25:56.370 --> 00:26:00.480 and now if I rerun this one last time, Python of favorites.py, 00:26:00.480 --> 00:26:04.110 I'll see the same values but with the whole thing reversed in order. 00:26:04.110 --> 00:26:07.230 Long story short, even though this might feel like a slog, 00:26:07.230 --> 00:26:11.070 like adding this and looking up this, so much easier than in C 00:26:11.070 --> 00:26:13.980 where you would have had to figure out, how does bubble sort work? 00:26:13.980 --> 00:26:17.520 Let me implement bubble sort, selection sort, any of those sorting algorithms, 00:26:17.520 --> 00:26:18.660 or use some other library. 00:26:18.660 --> 00:26:21.430 In Python, you just get a lot more for free, so to speak. 00:26:21.430 --> 00:26:24.390 It's just built in once you get comfy with the documentation. 00:26:24.390 --> 00:26:29.370 And, to be clear, this is an argument, as is this, as is this. 00:26:29.370 --> 00:26:33.540 But in Python, we have not only positional arguments, which 00:26:33.540 --> 00:26:36.720 are based on what position they are in, left to right, just like C, 00:26:36.720 --> 00:26:39.900 you also have these named parameters whereby 00:26:39.900 --> 00:26:42.720 they have explicit names that you can use yourself, 00:26:42.720 --> 00:26:45.930 to make clear that you're using this one but not this other one. 00:26:45.930 --> 00:26:50.980 More parameters in Python can be optional than in C. Phew. 00:26:50.980 --> 00:26:51.480 All right. 00:26:51.480 --> 00:26:54.727 Any questions about that technique yet? 00:26:54.727 --> 00:26:57.810 And if you're feeling like this is starting to take the fun out of Python, 00:26:57.810 --> 00:27:01.930 that's actually kind of the point of doing this the hard way. 00:27:01.930 --> 00:27:02.430 All right. 00:27:02.430 --> 00:27:05.110 Well, let's do it one other way that's marginally better. 00:27:05.110 --> 00:27:09.833 It turns out, in Python there really is this rich ecosystem of libraries, 00:27:09.833 --> 00:27:12.750 the code that comes with the language itself, or, even, third parties. 00:27:12.750 --> 00:27:17.040 And coming with the language is another module 00:27:17.040 --> 00:27:19.920 called the collections module, or package here, 00:27:19.920 --> 00:27:25.350 whereby if I use from collections, I can import something called Counter, 00:27:25.350 --> 00:27:28.140 capital C. And it turns out, if this felt 00:27:28.140 --> 00:27:30.450 a little bit painful to create a dictionary yourself, 00:27:30.450 --> 00:27:33.470 initialize it, maybe, to zero or one, like this. 00:27:33.470 --> 00:27:36.547 Turns out, you have the same problem that people before you have had, 00:27:36.547 --> 00:27:38.130 and so there's another way to do this. 00:27:38.130 --> 00:27:42.150 You can create a variable called counts, set it equal to Counter, capital 00:27:42.150 --> 00:27:43.770 C, open paren, close paren. 00:27:43.770 --> 00:27:46.140 And this is a different type of object. 00:27:46.140 --> 00:27:48.300 It's a different type of object in Python, that 00:27:48.300 --> 00:27:50.615 has counting capabilities built in. 00:27:50.615 --> 00:27:53.490 And so if I actually want to use this counter instead, I can do this. 00:27:53.490 --> 00:27:57.930 For each row in the reader, let's go ahead and grab the favorite language 00:27:57.930 --> 00:28:00.540 from that row, just like before. 00:28:00.540 --> 00:28:04.530 And without doing any of that headache of like, if, elif, or any of this, 00:28:04.530 --> 00:28:09.060 you can literally just index into that counter using favorite, 00:28:09.060 --> 00:28:14.340 quote unquote, "Scratch," or "C," or "Python," and increment it by one. 00:28:14.340 --> 00:28:17.680 What the Counter class is going to do for you, so to speak-- 00:28:17.680 --> 00:28:20.190 Another example of object-oriented programming, and counts 00:28:20.190 --> 00:28:21.435 is now an object thereof. 00:28:21.435 --> 00:28:25.107 What this whole feature of Counter is going to do for you is it's 00:28:25.107 --> 00:28:26.940 going to automatically initialize everything 00:28:26.940 --> 00:28:28.710 to zero, even if you've never seen it before, 00:28:28.710 --> 00:28:30.940 and then you can just blindly start incrementing it. 00:28:30.940 --> 00:28:33.570 So, in short, there's just more pleasant ways, sometimes, 00:28:33.570 --> 00:28:36.280 to do something in Python as well. 00:28:36.280 --> 00:28:36.780 All right. 00:28:36.780 --> 00:28:42.750 How about, lastly, let's make things, maybe-- 00:28:42.750 --> 00:28:44.340 Oh, actually, let's do this. 00:28:44.340 --> 00:28:47.918 We can even simplify the sorting here. 00:28:47.918 --> 00:28:49.710 Let me actually take this one step further. 00:28:49.710 --> 00:28:52.950 Instead of manually figuring out how to sort this, I'm going to do this. 00:28:52.950 --> 00:29:02.650 For each favorite, and the count thereof in the counts variable's most 00:29:02.650 --> 00:29:07.820 common function's return value, go ahead and print out this as well. 00:29:07.820 --> 00:29:09.790 So in short, again, a bit new syntax. 00:29:09.790 --> 00:29:11.210 But what's going on here? 00:29:11.210 --> 00:29:14.410 Well, it turns out that this counts class and-- 00:29:14.410 --> 00:29:17.478 sorry, this Counter class, and, in turn, this counts variable, 00:29:17.478 --> 00:29:20.770 comes with a function built in that you would only know from the documentation. 00:29:20.770 --> 00:29:23.260 It's literally called most underscore common, 00:29:23.260 --> 00:29:31.700 and what it returns to you when you call it is a pair of key value, key value. 00:29:31.700 --> 00:29:34.480 And so this, too, is a trick in Python that we did not have in C. 00:29:34.480 --> 00:29:37.180 If you want to iterate over something, but grab 00:29:37.180 --> 00:29:40.120 two variables at a time on each iteration like this, 00:29:40.120 --> 00:29:42.880 you separate them by commas and can get favorite count, 00:29:42.880 --> 00:29:44.840 favorite count, favorite count. 00:29:44.840 --> 00:29:50.260 So if I run this now, Python of favorites.py, this, too, just works. 00:29:50.260 --> 00:29:53.440 And it's getting a little simpler, a little tighter than before 00:29:53.440 --> 00:29:57.740 than if we had actually done it all manually. 00:29:57.740 --> 00:30:01.310 Lastly, here is a code that's the shortest version thereof. 00:30:01.310 --> 00:30:03.650 We're down to like 14 or 15 lines. 00:30:03.650 --> 00:30:08.480 If I wanted to change this to analyze the most popular problem thus 00:30:08.480 --> 00:30:12.620 far in the class, how do I go about changing the code 00:30:12.620 --> 00:30:18.260 to print out, top to bottom, the most popular problem or problems? 00:30:18.260 --> 00:30:19.280 What line should change? 00:30:19.280 --> 00:30:20.570 Yeah. 00:30:20.570 --> 00:30:22.070 So, yeah. 00:30:22.070 --> 00:30:22.610 Line 10. 00:30:22.610 --> 00:30:25.310 Because I've written this in kind of a general purpose way 00:30:25.310 --> 00:30:27.282 and using dictionaries with keys, it suffices 00:30:27.282 --> 00:30:29.240 to change language to, quote unquote, "problem" 00:30:29.240 --> 00:30:31.325 because that was the third column from the CSV. 00:30:31.325 --> 00:30:33.200 And so now, if you're curious, let's actually 00:30:33.200 --> 00:30:35.240 make my terminal window a bit bigger. 00:30:35.240 --> 00:30:37.820 Python of favorites.py, enter. 00:30:37.820 --> 00:30:38.750 And, OK. 00:30:38.750 --> 00:30:40.910 Tragically, we peaked early with Hello, World-- 00:30:40.910 --> 00:30:45.060 is the most popular problem thus far, followed by Filter, then Scratch. 00:30:45.060 --> 00:30:45.560 OK. 00:30:45.560 --> 00:30:46.700 Peaked even earlier. 00:30:46.700 --> 00:30:51.660 Mario, DNA, and so forth, and a bunch of others thereafter. 00:30:51.660 --> 00:30:55.477 So based on this sample size, here's the ranking of the problems thus far. 00:30:55.477 --> 00:30:56.060 So, we got it. 00:30:56.060 --> 00:30:59.570 More Hello, World problems in the weeks to come. 00:30:59.570 --> 00:31:01.160 All right. 00:31:01.160 --> 00:31:04.520 Now that we've done that in that way, let's 00:31:04.520 --> 00:31:07.070 just make this program slightly interactive 00:31:07.070 --> 00:31:11.070 and see how we can really take a fundamentally different approach. 00:31:11.070 --> 00:31:12.620 I'm going to go into VS Code. 00:31:12.620 --> 00:31:15.798 I'm going to keep everything the same, except that, at the bottom, 00:31:15.798 --> 00:31:17.840 I'm going to get rid of this loop because I don't 00:31:17.840 --> 00:31:20.120 want any more print out everything. 00:31:20.120 --> 00:31:22.040 I want to look up specific counts. 00:31:22.040 --> 00:31:25.060 Like, how popular was this problem, how popular was this other problem? 00:31:25.060 --> 00:31:28.310 And what I'm going to go ahead and do is to create a variable called favorite, 00:31:28.310 --> 00:31:30.180 set it equal to-- 00:31:30.180 --> 00:31:32.060 I could use get string in the CS50 library, 00:31:32.060 --> 00:31:34.790 but we saw last week there's no need to for strings, certainly. 00:31:34.790 --> 00:31:37.340 Let me just use the input function that comes with Python, 00:31:37.340 --> 00:31:39.890 and prompt the human for their favorite problem. 00:31:39.890 --> 00:31:45.020 And then let me go ahead and print out, for instance, an f string containing 00:31:45.020 --> 00:31:51.920 whatever their favorite is, colon, and whatever the count is thereof 00:31:51.920 --> 00:31:54.290 of that favorite, close quote. 00:31:54.290 --> 00:31:56.400 So let me open my terminal window. 00:31:56.400 --> 00:31:59.300 Let me run Python of favorites.py, enter. 00:31:59.300 --> 00:32:04.640 And if I type in Hello, World, looks like 65 people, indeed, like that one. 00:32:04.640 --> 00:32:06.860 If I run Python of favorites.py again. 00:32:06.860 --> 00:32:09.350 I type in Scratch, now we see that one. 00:32:09.350 --> 00:32:12.570 If I type in anything else, I'm going to get its specific value. 00:32:12.570 --> 00:32:18.050 So this is to say, not only can we write Python code to analyze some data pretty 00:32:18.050 --> 00:32:20.720 tightly versus the manual code we wrote out earlier, 00:32:20.720 --> 00:32:23.238 you can also make these programs interactive as well. 00:32:23.238 --> 00:32:25.530 And this is going to be a super common paradigm, right, 00:32:25.530 --> 00:32:28.880 if you go into the world of consulting, analytics, data science, more 00:32:28.880 --> 00:32:29.540 generally. 00:32:29.540 --> 00:32:34.310 Among your roles is going to be to analyze data, to ask questions of data, 00:32:34.310 --> 00:32:37.610 get back the answer. be ask questions of data, get back the answer. 00:32:37.610 --> 00:32:40.190 Honestly, life gets pretty tedious, even though you've only 00:32:40.190 --> 00:32:43.670 been programming in Python, perhaps, for like one week, a week and a half now. 00:32:43.670 --> 00:32:46.880 When you have to write code to solve all of the world's problems-- 00:32:46.880 --> 00:32:49.250 and there's this sort of tenant in programming, 00:32:49.250 --> 00:32:53.270 that programmers tend to avoid writing code as much as they can 00:32:53.270 --> 00:32:55.190 because, ideally, you would solve problems 00:32:55.190 --> 00:32:58.640 with the right tool for the job, minimizing the number of lines of code 00:32:58.640 --> 00:32:59.880 you actually write. 00:32:59.880 --> 00:33:01.920 So how do we actually get to that point? 00:33:01.920 --> 00:33:06.740 Well, instead of just dealing with CSV files, pure text, it turns out 00:33:06.740 --> 00:33:09.200 there's an entire world of proper databases. 00:33:09.200 --> 00:33:12.080 Not flat file databases, where you store everything in text files, 00:33:12.080 --> 00:33:16.370 but a database program, a piece of software running on a computer, 00:33:16.370 --> 00:33:18.922 running on a server, that's always listening for you. 00:33:18.922 --> 00:33:22.130 It's got a lot of memory, it's got a lot of space, and in turn a lot of data, 00:33:22.130 --> 00:33:25.460 and it supports a database specific language 00:33:25.460 --> 00:33:31.020 that makes it much easier, much faster to ask questions of the very same data. 00:33:31.020 --> 00:33:33.290 It's a relational database in the sense, too, 00:33:33.290 --> 00:33:35.480 that it's not even necessarily one spreadsheet, 00:33:35.480 --> 00:33:37.430 one set of rows and columns. 00:33:37.430 --> 00:33:41.960 You can have two sheets, three sheets, 30 sheets across which there might very 00:33:41.960 --> 00:33:44.840 well be relationships, or relations. 00:33:44.840 --> 00:33:48.830 So S-Q-L, or SQL, is a database specific language, 00:33:48.830 --> 00:33:52.625 stands for Structured Query Language, that's a declarative language whereby 00:33:52.625 --> 00:33:54.500 you're not going to be in the habit with SQL, 00:33:54.500 --> 00:33:58.870 typically, of writing loops and conditionals, and this kind of thing. 00:33:58.870 --> 00:34:02.260 You're instead going to describe the data that you want to get back, 00:34:02.260 --> 00:34:05.680 you're going to describe the question that you want the answer to, 00:34:05.680 --> 00:34:08.650 and we'll do this using a relatively small grammar. 00:34:08.650 --> 00:34:11.620 That is to say, there's not that many keywords in SQL. 00:34:11.620 --> 00:34:12.850 It's a pretty small language. 00:34:12.850 --> 00:34:16.120 But it's going to allow us to eliminate dozens of lines of Python code, 00:34:16.120 --> 00:34:16.900 perhaps. 00:34:16.900 --> 00:34:19.429 SQL follows this CRUD paradigm. 00:34:19.429 --> 00:34:23.230 So C-R-U-D, which simply means that in a relational database, 00:34:23.230 --> 00:34:25.030 you can really only do four things. 00:34:25.030 --> 00:34:30.100 You can create data, read data-- that is, look at it or analyze it somehow. 00:34:30.100 --> 00:34:32.540 Update the data, or delete the data. 00:34:32.540 --> 00:34:33.770 So, CRUD, for short. 00:34:33.770 --> 00:34:36.673 And that really speaks to just how relatively simple 00:34:36.673 --> 00:34:39.340 the world is, even though we'll just scratch the surface of some 00:34:39.340 --> 00:34:40.423 of its capabilities today. 00:34:40.423 --> 00:34:42.560 And you'll explore more over time. 00:34:42.560 --> 00:34:45.340 Specifically, in SQL, there's going to be other keywords that 00:34:45.340 --> 00:34:46.719 map to those four ideas. 00:34:46.719 --> 00:34:49.480 Technically, you don't just create data in the world of SQL, 00:34:49.480 --> 00:34:53.270 you can also insert data, like inserting more rows into a sheet. 00:34:53.270 --> 00:34:55.330 And it's not the word "read" that people use. 00:34:55.330 --> 00:34:56.739 People say to "select" data. 00:34:56.739 --> 00:35:00.170 But they mean to read data, which is sort of the opposite of writing 00:35:00.170 --> 00:35:01.220 or creating data. 00:35:01.220 --> 00:35:04.070 But the U and the D are the same, except that there's also 00:35:04.070 --> 00:35:07.580 a keyword in SQL known as DROP, which lets you very destructively, 00:35:07.580 --> 00:35:11.610 very dangerously delete entire database tables, as well. 00:35:11.610 --> 00:35:15.500 So how do we do this, and what's the connection to our favorites data 00:35:15.500 --> 00:35:16.190 thus far? 00:35:16.190 --> 00:35:18.920 Well, here is the syntax in this language called 00:35:18.920 --> 00:35:22.880 SQL via which you can create a table. 00:35:22.880 --> 00:35:26.510 So the jargon is a little different, but the ideas are exactly the same 00:35:26.510 --> 00:35:27.840 from the world of spreadsheets. 00:35:27.840 --> 00:35:33.720 What you call a sheet in a spreadsheet, the database world calls a table. 00:35:33.720 --> 00:35:37.320 It's a table of rows and columns, but it's the exact same idea. 00:35:37.320 --> 00:35:40.320 You're going to have discretion over what to call the table, 00:35:40.320 --> 00:35:42.710 just like you can call a spreadsheet something, or else, 00:35:42.710 --> 00:35:44.790 and you can also specify the types of data 00:35:44.790 --> 00:35:46.790 that you want to store in your rows and columns. 00:35:46.790 --> 00:35:49.520 And it's going to go a little more deeply than just formatting it, 00:35:49.520 --> 00:35:51.687 like in Excel, and Numbers, and Google Spreadsheets, 00:35:51.687 --> 00:35:55.250 you can actually control, maybe, how big the data could be depending 00:35:55.250 --> 00:35:57.140 on the database you're actually using. 00:35:57.140 --> 00:35:59.990 In CS50, we're going to use a light version of SQL. 00:35:59.990 --> 00:36:01.820 Literally, a language called-- 00:36:01.820 --> 00:36:04.970 an implementation of SQL called SQLite, which has really 00:36:04.970 --> 00:36:08.360 all of the core functionality that you would see in the real world, 00:36:08.360 --> 00:36:10.702 and with larger, more scalable systems, but it's 00:36:10.702 --> 00:36:13.160 going to allow us to focus on a lot of the building blocks. 00:36:13.160 --> 00:36:17.010 And SQLite's actually really popular on Macs, PCs, and phones, nowadays. 00:36:17.010 --> 00:36:20.450 A lot of the data that games and other applications on your phone might store, 00:36:20.450 --> 00:36:24.470 actually have a file, a binary file with zeros and ones, 00:36:24.470 --> 00:36:26.393 that's in the SQLite format. 00:36:26.393 --> 00:36:29.060 So if you do a mobile app, for instance, for your final project, 00:36:29.060 --> 00:36:32.220 you'll have an opportunity to play with something like this. 00:36:32.220 --> 00:36:34.760 Well, how do you actually run SQLite3? 00:36:34.760 --> 00:36:37.500 It's just a command built into your code space. 00:36:37.500 --> 00:36:40.400 So this is a program you could install on your own Mac, your own PC, 00:36:40.400 --> 00:36:40.910 or the like. 00:36:40.910 --> 00:36:43.280 We'll do everything as we've done before, in the cloud, 00:36:43.280 --> 00:36:44.780 and actually use your code space. 00:36:44.780 --> 00:36:48.170 And by that I mean, we can just start to play with this data now 00:36:48.170 --> 00:36:51.890 using SQL instead of Python. 00:36:51.890 --> 00:36:52.710 So let me do this. 00:36:52.710 --> 00:36:55.190 Let me open up my terminal window here, and let 00:36:55.190 --> 00:36:58.010 me go ahead and maximize my terminal window just because we'll 00:36:58.010 --> 00:37:00.140 focus now on the files here. 00:37:00.140 --> 00:37:03.530 Recall that I have a file called favorites.csv, 00:37:03.530 --> 00:37:05.510 and that CSV file is just text. 00:37:05.510 --> 00:37:08.300 But let me load it into a proper database 00:37:08.300 --> 00:37:11.510 so I can actually use this other language called SQL on it. 00:37:11.510 --> 00:37:14.210 To do this, I'm going to run SQLite3, which 00:37:14.210 --> 00:37:16.070 just means the third version of it, and I'm 00:37:16.070 --> 00:37:18.950 going to create a new database called favorites.db. 00:37:18.950 --> 00:37:21.380 That's just a convention, but it means here comes 00:37:21.380 --> 00:37:23.030 a database that I'm going to create. 00:37:23.030 --> 00:37:25.370 Notice, I'm not using the tabbed code editor. 00:37:25.370 --> 00:37:28.460 I'm not using the code command because the code command is generally 00:37:28.460 --> 00:37:29.600 for text files. 00:37:29.600 --> 00:37:33.440 SQLite3 is going to create a binary file, zeros and ones, ultimately. 00:37:33.440 --> 00:37:35.750 When I run that, it's going to ask me to verify yes. 00:37:35.750 --> 00:37:37.400 I'm going to hit y and then enter. 00:37:37.400 --> 00:37:40.940 And now I'm at the SQLite prompt, which is not the dollar sign. 00:37:40.940 --> 00:37:43.730 It literally says SQLite with an angled bracket. 00:37:43.730 --> 00:37:49.220 Now, one time only, I want to go ahead and load favorites.csv 00:37:49.220 --> 00:37:53.150 into this database so I can actually play around with it using not Python, 00:37:53.150 --> 00:37:54.982 but this new language called SQL. 00:37:54.982 --> 00:37:56.940 And the way I'm going to do this is as follows. 00:37:56.940 --> 00:37:57.773 I'm going to do dot. 00:37:57.773 --> 00:38:00.130 Mode csv, enter. 00:38:00.130 --> 00:38:03.430 And that just puts SQLite into CSV mode. 00:38:03.430 --> 00:38:05.650 It has different modes for different file formats. 00:38:05.650 --> 00:38:09.700 I'm going to .import and then I'm going to specify the file that I want 00:38:09.700 --> 00:38:11.800 to import, which is favorites.csv. 00:38:11.800 --> 00:38:13.030 And then this one's up to me. 00:38:13.030 --> 00:38:15.580 What is the name of the table I want to create? 00:38:15.580 --> 00:38:18.012 And table, again, is essentially synonymous with sheets. 00:38:18.012 --> 00:38:19.720 So I'm going to call everything the same. 00:38:19.720 --> 00:38:22.700 I'm going to call my table favorites as well. 00:38:22.700 --> 00:38:25.420 So what this command is essentially going to do, 00:38:25.420 --> 00:38:29.890 is all of those lines of Python code that open the file, read it row by row, 00:38:29.890 --> 00:38:31.028 and do something with it-- 00:38:31.028 --> 00:38:32.320 This is just built into SQLite. 00:38:32.320 --> 00:38:37.720 It's going to load the whole darn CSV into this new favorites.db file, 00:38:37.720 --> 00:38:38.980 and then that's it for now. 00:38:38.980 --> 00:38:42.220 I'm going to go ahead and literally type .quit to get out of SQLite. 00:38:42.220 --> 00:38:43.810 I'm back at my dollar sign prompt. 00:38:43.810 --> 00:38:47.500 If I type ls, I have not only favorites.csv, 00:38:47.500 --> 00:38:54.790 I also have favorites.db now as well, a brand new file, and in that file now is 00:38:54.790 --> 00:38:57.560 an optimized version of the CSV file. 00:38:57.560 --> 00:39:00.860 In that DB file now is a version of the data that's 00:39:00.860 --> 00:39:04.430 going to lend itself to CRUD operations, creating, reading, update, 00:39:04.430 --> 00:39:08.810 and deleting, using this new language called SQL. 00:39:08.810 --> 00:39:10.478 All right, so how do I get into this? 00:39:10.478 --> 00:39:13.520 Well, let me clear my terminal window and pretend that I'm doing this now 00:39:13.520 --> 00:39:14.062 the next day. 00:39:14.062 --> 00:39:15.478 I've already created the database. 00:39:15.478 --> 00:39:16.670 That's a one time operation. 00:39:16.670 --> 00:39:18.545 Once you've got the data, now I'm going to go 00:39:18.545 --> 00:39:24.080 ahead and again run SQLite3 favorites.db just to open the file again. 00:39:24.080 --> 00:39:27.260 But it's already now-- all of the data is in there. 00:39:27.260 --> 00:39:32.420 Just as a teaser, let me go ahead and do this. 00:39:32.420 --> 00:39:38.580 .schema is a SQLite command that just shows me the schema of this database 00:39:38.580 --> 00:39:39.080 table. 00:39:39.080 --> 00:39:41.130 And we'll see more about this in a little bit, 00:39:41.130 --> 00:39:44.630 but for now this is showing me, essentially, 00:39:44.630 --> 00:39:48.170 the SQL command that was automatically run 00:39:48.170 --> 00:39:51.650 when I imported this database the first time around. 00:39:51.650 --> 00:39:54.680 And, for now, just notice that it mentions timestamp, 00:39:54.680 --> 00:39:57.270 it mentions language, it mentions problem. 00:39:57.270 --> 00:39:59.790 Very loosely, it calls each of those texts. 00:39:59.790 --> 00:40:02.370 So we're not trying very hard to distinguish one type of data 00:40:02.370 --> 00:40:02.912 from another. 00:40:02.912 --> 00:40:03.690 It's all text. 00:40:03.690 --> 00:40:06.790 But notice, create table If not exists favorites. 00:40:06.790 --> 00:40:09.420 This is essentially the create table syntax 00:40:09.420 --> 00:40:11.490 that I alluded to earlier via which you can 00:40:11.490 --> 00:40:15.510 create a table in a SQLite database. 00:40:15.510 --> 00:40:17.280 But more on that in just a bit. 00:40:17.280 --> 00:40:22.500 Here now is how we can actually get at the data in that database. 00:40:22.500 --> 00:40:28.380 It turns out that we can select one or more columns from a database 00:40:28.380 --> 00:40:30.180 table using syntax like this. 00:40:30.180 --> 00:40:33.870 Literally, the keyword select, then the name of one or more columns 00:40:33.870 --> 00:40:38.550 that are in that database, and then from the specific table that you care about. 00:40:38.550 --> 00:40:41.760 And notice that in capital letters here are all of the SQL 00:40:41.760 --> 00:40:44.760 specific keywords, select, and from, in particular. 00:40:44.760 --> 00:40:47.940 And in lowercase, by convention, here are the placeholders 00:40:47.940 --> 00:40:51.270 for the columns that you, or I, have created, and the tables that you, or I, 00:40:51.270 --> 00:40:52.210 have created. 00:40:52.210 --> 00:40:53.820 So if I go back to SQLite here. 00:40:53.820 --> 00:40:57.090 Let me just clear with Control L, which will just freshen up the screen here 00:40:57.090 --> 00:40:58.770 so we can focus on what's new. 00:40:58.770 --> 00:41:03.600 If I want to select everything from the table called favorites, 00:41:03.600 --> 00:41:04.890 here's what I can do. 00:41:04.890 --> 00:41:09.570 Select star from favorites semicolon. 00:41:09.570 --> 00:41:12.660 And, do forgive me, semicolons are back for SQL, in this case. 00:41:12.660 --> 00:41:16.680 But select star from favorites uses a syntax you might not be familiar with. 00:41:16.680 --> 00:41:18.660 Star here has nothing to do with pointers. 00:41:18.660 --> 00:41:20.100 Star is a wild card. 00:41:20.100 --> 00:41:22.290 It means give me everything, no matter what it's 00:41:22.290 --> 00:41:24.480 called, from this particular table. 00:41:24.480 --> 00:41:27.030 When I hit enter, what we're going to see 00:41:27.030 --> 00:41:30.810 is the entire contents of the favorite table 00:41:30.810 --> 00:41:34.480 that's the result of having imported that CSV into this database. 00:41:34.480 --> 00:41:37.380 So when I hit enter, there is all of that data. 00:41:37.380 --> 00:41:40.110 And SQLite, just to be friendly, it's using 00:41:40.110 --> 00:41:44.220 what we might call ASCII art, just very simple text, like hyphens, 00:41:44.220 --> 00:41:47.310 and vertical bars, and pluses on the corner, to make it look pretty 00:41:47.310 --> 00:41:49.690 and make it look like it is a proper table. 00:41:49.690 --> 00:41:52.620 But what you're really seeing is the contents of favorites.db, 00:41:52.620 --> 00:41:54.750 specifically in that table. 00:41:54.750 --> 00:41:57.130 Specifically, if I only care about languages, 00:41:57.130 --> 00:41:59.130 let me try something more specific than star. 00:41:59.130 --> 00:42:03.270 Select language from favorites semicolon. 00:42:03.270 --> 00:42:05.640 This is going to give me just a single column now, 00:42:05.640 --> 00:42:09.150 of all of the favorites that you selected for language specifically. 00:42:09.150 --> 00:42:12.730 This is a little overwhelming to see all 399 or so pieces of data, 00:42:12.730 --> 00:42:14.680 so let me actually truncate it a little bit. 00:42:14.680 --> 00:42:20.333 Let me do select language from favorites limit 10. 00:42:20.333 --> 00:42:22.500 So we're about to see that there's little tricks you 00:42:22.500 --> 00:42:26.430 can use to tweak the behavior of the language 00:42:26.430 --> 00:42:28.500 in order to get back more or less data. 00:42:28.500 --> 00:42:31.110 In fact, it turns out there's a bunch of keywords 00:42:31.110 --> 00:42:35.880 like these built into SQL, much like Google Spreadsheets, Apple Numbers, 00:42:35.880 --> 00:42:38.293 Microsoft Excel, and certainly Python. 00:42:38.293 --> 00:42:41.460 There's a lot of functionality that you just get for free with the language. 00:42:41.460 --> 00:42:44.610 If you want to calculate an average, count the number of things in a file, 00:42:44.610 --> 00:42:48.090 get the unique or distinct values, force everything to lowercase, 00:42:48.090 --> 00:42:50.380 force everything to uppercase, get the maximum value, 00:42:50.380 --> 00:42:53.505 minimum value-- much like spreadsheets, if you're familiar with that world, 00:42:53.505 --> 00:42:56.880 you get all of that functionality in SQL but also more. 00:42:56.880 --> 00:42:59.680 So, for instance, if I go back to my terminal window here. 00:42:59.680 --> 00:43:02.490 Let me go ahead and select the total number of favorites 00:43:02.490 --> 00:43:06.150 in this table, the total number of rows that you all inputted. 00:43:06.150 --> 00:43:10.320 So I could do select star from favorites semicolon, 00:43:10.320 --> 00:43:12.450 and then I could literally start counting these. 00:43:12.450 --> 00:43:15.330 Like 1, 2, 3, 4-- there's clearly a better way. 00:43:15.330 --> 00:43:19.480 And, indeed, on our list of functions was a count function. 00:43:19.480 --> 00:43:22.200 And so the way I can use that in SQL is like this. 00:43:22.200 --> 00:43:25.230 Select count of star-- 00:43:25.230 --> 00:43:28.113 so pass star in as an argument to the count function. 00:43:28.113 --> 00:43:30.780 You don't care what columns you're counting just count them all. 00:43:30.780 --> 00:43:33.440 From favorites semicolon. 00:43:33.440 --> 00:43:35.190 And now, you're actually going to get back 00:43:35.190 --> 00:43:40.050 like a little baby table that has just one row, one column inside of which-- 00:43:40.050 --> 00:43:43.410 one cell of which has the total actual count. 00:43:43.410 --> 00:43:47.610 And it's 398 because 399, recall, included the actual header 00:43:47.610 --> 00:43:49.480 row from the file. 00:43:49.480 --> 00:43:49.980 All right. 00:43:49.980 --> 00:43:52.260 So suppose you want to-- 00:43:52.260 --> 00:43:54.480 actually, note that this is the exact same thing 00:43:54.480 --> 00:43:58.840 as counting a specific column because every row 00:43:58.840 --> 00:44:00.670 has the same number of columns, three. 00:44:00.670 --> 00:44:03.790 We could just say select the count of languages, 00:44:03.790 --> 00:44:07.720 or select the count of problems. 00:44:07.720 --> 00:44:10.023 All of those are going to give me back the same answer. 00:44:10.023 --> 00:44:12.190 It is, therefore, conventional in SQL if you're just 00:44:12.190 --> 00:44:14.080 trying to count the number of rows, don't even 00:44:14.080 --> 00:44:15.413 worry about what they're called. 00:44:15.413 --> 00:44:18.722 Just do count star to get back everything more simply. 00:44:18.722 --> 00:44:21.430 All right, but what if we want to get back the distinct languages 00:44:21.430 --> 00:44:23.972 and we didn't know a priori that this came from a Google form 00:44:23.972 --> 00:44:25.107 with three radio buttons? 00:44:25.107 --> 00:44:26.690 Well, we could do something like this. 00:44:26.690 --> 00:44:33.130 We could select the distinct languages from the favorites table, enter. 00:44:33.130 --> 00:44:35.950 And that gives me Python, Scratch, C because distinct 00:44:35.950 --> 00:44:38.200 is one of the other functions that comes with SQL. 00:44:38.200 --> 00:44:40.990 This is, obviously, very easily countable with my human eyes, 00:44:40.990 --> 00:44:43.210 but if I wanted to do this more dynamically, 00:44:43.210 --> 00:44:47.530 I could change this to be count the distinct languages. 00:44:47.530 --> 00:44:50.200 And just like in C, just like in Python, just like in Scratch, 00:44:50.200 --> 00:44:52.987 I can nest these functions and pass the output of one 00:44:52.987 --> 00:44:54.070 into the input of another. 00:44:54.070 --> 00:44:59.550 If I hit enter now, I now get three in this case here. 00:44:59.550 --> 00:45:00.050 OK. 00:45:00.050 --> 00:45:04.580 Let me pause to see if there's any questions or confusion just yet. 00:45:04.580 --> 00:45:05.120 Yeah. 00:45:05.120 --> 00:45:07.000 [INDISTINCT SPEECH] 00:45:07.000 --> 00:45:07.770 Does SQLite-- 00:45:07.770 --> 00:45:09.910 [INDISTINCT SPEECH] 00:45:09.910 --> 00:45:11.980 SQLite3 is a program. 00:45:11.980 --> 00:45:16.210 And it's an implementation of the SQLite language, which 00:45:16.210 --> 00:45:18.820 itself is a lightweight version of what the world known 00:45:18.820 --> 00:45:21.890 as SQL, which is a very convoluted way of saying 00:45:21.890 --> 00:45:23.390 there's lots of humans in the world. 00:45:23.390 --> 00:45:25.420 Not everyone agrees what SQL should be. 00:45:25.420 --> 00:45:28.720 Microsoft might disagree with Oracle, might disagree with other companies, 00:45:28.720 --> 00:45:29.390 as well. 00:45:29.390 --> 00:45:32.500 So there's a common subset of SQL in the world 00:45:32.500 --> 00:45:35.200 that almost everyone knows, and learns, and uses, 00:45:35.200 --> 00:45:37.900 but there are also some vendor specific features. 00:45:37.900 --> 00:45:40.790 SQLite tries to distill things really into the essence, 00:45:40.790 --> 00:45:44.430 and so that's what you increasingly see on Android, on iOS, on Macs, and PCs, 00:45:44.430 --> 00:45:44.930 as well. 00:45:44.930 --> 00:45:47.650 So we use it because it's relatively canonical. 00:45:47.650 --> 00:45:49.330 Good question. 00:45:49.330 --> 00:45:49.830 All right. 00:45:49.830 --> 00:45:53.550 So let's do a few other things by introducing a few other keywords 00:45:53.550 --> 00:45:55.440 without trying all of these right now. 00:45:55.440 --> 00:45:59.730 Here in this list is a bunch of new keywords 00:45:59.730 --> 00:46:01.740 that are going to give us even finer control. 00:46:01.740 --> 00:46:04.920 And we saw limit already, and that just limits the output. 00:46:04.920 --> 00:46:08.010 But you can also have what are called predicates. 00:46:08.010 --> 00:46:11.280 You can literally use the keyword where to start filtering the data, 00:46:11.280 --> 00:46:14.640 without using an if, and an elif, and an elif, and an elif, and so forth. 00:46:14.640 --> 00:46:17.640 You can just in one line express something conditionally, 00:46:17.640 --> 00:46:21.067 you can order the data, and you can even group similar data together. 00:46:21.067 --> 00:46:22.150 So what do I mean by this? 00:46:22.150 --> 00:46:24.622 Let me go back to VS Code here, and let me play around 00:46:24.622 --> 00:46:25.830 with a few different queries. 00:46:25.830 --> 00:46:32.490 Let me select, maybe, the count of rows from favorites, which previously 00:46:32.490 --> 00:46:38.250 was going to be 398 if I just get back all of the rows, but suppose I only 00:46:38.250 --> 00:46:41.190 want to know how many of you liked C. I can then say something 00:46:41.190 --> 00:46:46.500 like where the language in each row equals, quote unquote, "C," 00:46:46.500 --> 00:46:49.750 and the convention here is to use single quotes, though SQLite 00:46:49.750 --> 00:46:51.790 is tolerant of other formats as well. 00:46:51.790 --> 00:46:56.257 If I hit enter here, I'll see, indeed, as we saw with Python, the 78 number. 00:46:56.257 --> 00:46:57.340 That, honestly, took what? 00:46:57.340 --> 00:46:59.690 13, 14, 15 lines of code? 00:46:59.690 --> 00:47:04.990 Now I've distilled that kind of query into a single line of SQL code 00:47:04.990 --> 00:47:07.180 instead, by using this built in functionality. 00:47:07.180 --> 00:47:11.110 Suppose I really want to get specific, and how many of you really liked 00:47:11.110 --> 00:47:14.500 Hello, World in C as your favorite? 00:47:14.500 --> 00:47:16.510 Well, I could change this query. 00:47:16.510 --> 00:47:19.330 And just like your dollar sign prompt, your shell, 00:47:19.330 --> 00:47:22.660 you can go up and down in your history in SQLite to save keystrokes. 00:47:22.660 --> 00:47:24.520 You can use Boolean logic. 00:47:24.520 --> 00:47:27.790 And I can say language equals C AND, maybe, 00:47:27.790 --> 00:47:31.490 problem equals, quote unquote, "Hello, World," 00:47:31.490 --> 00:47:34.730 and the number of you that liked that problem was seven. 00:47:34.730 --> 00:47:37.900 So really, really early on likes Hello, World in C. 00:47:37.900 --> 00:47:39.910 Now notice a couple of key differences. 00:47:39.910 --> 00:47:43.900 One, I'm using AND, and not ampersand, ampersand like in C. I'm 00:47:43.900 --> 00:47:46.100 using single equal signs. 00:47:46.100 --> 00:47:50.030 So SQL behaves like Scratch does, which is not like Python or C. Why? 00:47:50.030 --> 00:47:52.780 Different people have implemented different languages differently. 00:47:52.780 --> 00:47:57.310 Equals, equals, equality in the world of SQL 00:47:57.310 --> 00:47:59.120 for comparing things left and right. 00:47:59.120 --> 00:47:59.620 All right. 00:47:59.620 --> 00:48:01.870 Things are now going to get a little more interesting, 00:48:01.870 --> 00:48:05.140 but the whole goal of all of that Python code 00:48:05.140 --> 00:48:09.650 was to analyze the ranking of languages and popularity thereof. 00:48:09.650 --> 00:48:12.400 Turns out in SQL, once you have the vocabulary, 00:48:12.400 --> 00:48:14.590 it's pretty easy to do something like that. 00:48:14.590 --> 00:48:15.560 I'm going to do this. 00:48:15.560 --> 00:48:19.120 I'm going to select all of the languages in the table, 00:48:19.120 --> 00:48:23.020 but I'm also going to select the count thereof. 00:48:23.020 --> 00:48:26.020 And then I'm going to do that from the favorites table, 00:48:26.020 --> 00:48:29.560 but I'm going to group by language because I 00:48:29.560 --> 00:48:33.370 claimed a moment ago that group by is another one of our key phrases in SQL 00:48:33.370 --> 00:48:35.120 that's going to let us group data. 00:48:35.120 --> 00:48:37.600 And what this effectively means is that if you've 00:48:37.600 --> 00:48:41.530 got this table with a lot of duplicate languages, again, and again, 00:48:41.530 --> 00:48:45.130 and again, you can group by that column, and, essentially, 00:48:45.130 --> 00:48:48.850 smush all of the Python rows together, all of the Scratch rows together, 00:48:48.850 --> 00:48:52.930 all of the C rows together, but figure out how many of those 00:48:52.930 --> 00:48:55.240 rows just got smushed together. 00:48:55.240 --> 00:48:57.910 Effectively, doing all of that dictionary legwork, 00:48:57.910 --> 00:49:01.090 or the counter legwork, that I did in 13-- 00:49:01.090 --> 00:49:02.990 15 lines of Python code. 00:49:02.990 --> 00:49:06.700 So if I hit enter here, this now is the motivation 00:49:06.700 --> 00:49:08.410 for what we're now starting to do. 00:49:08.410 --> 00:49:11.860 I have distilled into a single line of code 00:49:11.860 --> 00:49:14.140 in a language called SQL what, indeed, took me 00:49:14.140 --> 00:49:18.280 more than a dozen lines of Python code just to get back an answer. 00:49:18.280 --> 00:49:20.110 And I can do the same thing with problem. 00:49:20.110 --> 00:49:23.230 I can just change language here, for instance, to problem instead. 00:49:23.230 --> 00:49:26.300 But, per this list, I can not only group things, I can order them. 00:49:26.300 --> 00:49:29.440 So if you actually want to get a top 10, or a top three list, well let's 00:49:29.440 --> 00:49:30.880 just change this query slightly. 00:49:30.880 --> 00:49:37.970 Before the semicolon, let me order by the count of those rows semicolon. 00:49:37.970 --> 00:49:40.570 And now what I get is from smallest to largest. 00:49:40.570 --> 00:49:42.850 40, 78, 280. 00:49:42.850 --> 00:49:44.590 If you want to flip that, that's fine. 00:49:44.590 --> 00:49:49.630 By default, order by uses ascending order, abbreviated A-S-C. 00:49:49.630 --> 00:49:53.810 If you want to do descending order, D-E-S-C, you can do that as well. 00:49:53.810 --> 00:49:57.680 And now we have a top three list, from largest to smallest. 00:49:57.680 --> 00:50:01.360 Now, honestly, this is a bit of a mouthful to use count star over here, 00:50:01.360 --> 00:50:02.920 count star over here. 00:50:02.920 --> 00:50:06.580 There's a nicety in SQL, too, where you can create little aliases of sorts. 00:50:06.580 --> 00:50:08.890 So if I use the same query again-- 00:50:08.890 --> 00:50:10.850 let me scroll over to the left. 00:50:10.850 --> 00:50:13.990 I can actually use the keyword as here, and I 00:50:13.990 --> 00:50:18.370 can rename this weird looking column, count star, to anything I want. 00:50:18.370 --> 00:50:19.930 I can rename it to n. 00:50:19.930 --> 00:50:24.280 And then at the end of this query, I can order by n, essentially, 00:50:24.280 --> 00:50:26.990 creating a synonym, if you will, for one versus the other. 00:50:26.990 --> 00:50:30.310 So if I hit enter now, same exact thing, but my little baby table 00:50:30.310 --> 00:50:33.310 that came back-- not a technical term-- has two columns, one of which 00:50:33.310 --> 00:50:36.580 is more simply called n now instead of count star. 00:50:36.580 --> 00:50:40.330 It just makes it minorly more convenient in your actual SQL code 00:50:40.330 --> 00:50:44.110 to reference things that might actually be a little annoying to type. 00:50:44.110 --> 00:50:47.200 Lastly, suppose we want to get a top one list, 00:50:47.200 --> 00:50:49.000 and we just want the most popular language. 00:50:49.000 --> 00:50:51.790 Honestly, I can just do limit one, enter. 00:50:51.790 --> 00:50:55.720 That gives me just this tiny little table, a temporary table, really, 00:50:55.720 --> 00:50:56.860 with one row. 00:50:56.860 --> 00:51:00.460 And, honestly, if I don't even care about what the language is, 00:51:00.460 --> 00:51:02.230 I can omit that entirely. 00:51:02.230 --> 00:51:06.340 Just see how many people really like the most popular language. 00:51:06.340 --> 00:51:07.660 280, in this case. 00:51:07.660 --> 00:51:10.910 But, of course, it's more interesting to see what it actually is. 00:51:10.910 --> 00:51:13.420 So, in short, just by turning these knobs syntactically, 00:51:13.420 --> 00:51:16.900 it's relatively easy to start getting at more and more data. 00:51:16.900 --> 00:51:19.270 And more answers there, too. 00:51:19.270 --> 00:51:19.770 Phew. 00:51:19.770 --> 00:51:24.270 Questions on this thus far? 00:51:24.270 --> 00:51:25.200 Any questions? 00:51:25.200 --> 00:51:26.040 No? 00:51:26.040 --> 00:51:26.670 OK. 00:51:26.670 --> 00:51:29.305 Well suppose that this week, for instance. 00:51:29.305 --> 00:51:31.680 One of our new problems is going to be called Fiftyville, 00:51:31.680 --> 00:51:33.960 and it's going to allow you to explore the world of SQL 00:51:33.960 --> 00:51:35.793 in the context of a place called Fiftyville. 00:51:35.793 --> 00:51:38.350 Suppose that suddenly becomes your favorite problem. 00:51:38.350 --> 00:51:40.920 Well, how can we go about adding more data to a database? 00:51:40.920 --> 00:51:43.720 Well, we've seen create table for creating the table, 00:51:43.720 --> 00:51:47.230 we've seen select for selecting data there from. 00:51:47.230 --> 00:51:50.340 Turns out there's also an insert into command 00:51:50.340 --> 00:51:54.210 that you can use to insert new data into a table. 00:51:54.210 --> 00:51:57.450 Now, I did this in bulk by just importing that whole CSV file, 00:51:57.450 --> 00:52:00.370 and SQLite3 did it all for me automatically. 00:52:00.370 --> 00:52:04.320 But in the real world, if you don't have a captive audience, every one of whom 00:52:04.320 --> 00:52:06.540 is submitting the form at the same time-- 00:52:06.540 --> 00:52:09.300 but maybe it's an application that's running 24/7, 00:52:09.300 --> 00:52:13.210 you're going to get more and more data over time, just like Google itself. 00:52:13.210 --> 00:52:16.290 So if you write code like this, you can insert one row 00:52:16.290 --> 00:52:20.530 at a time, one row at a time, and actually change the data in your table. 00:52:20.530 --> 00:52:24.820 So just as a check, let me do select star from favorites, enter. 00:52:24.820 --> 00:52:26.500 Just to see all of the data. 00:52:26.500 --> 00:52:31.030 And the last data we got was at 1:41 PM, and 21 seconds. 00:52:31.030 --> 00:52:34.390 Suppose now I've decided I want to insert one new row. 00:52:34.390 --> 00:52:35.560 I can do this. 00:52:35.560 --> 00:52:39.550 Insert into favorites-- and then I have to specify 00:52:39.550 --> 00:52:41.710 what columns do I want to insert into. 00:52:41.710 --> 00:52:47.230 I'm going to insert a new language column, and a new problem column. 00:52:47.230 --> 00:52:47.840 Timestamp? 00:52:47.840 --> 00:52:48.340 I could. 00:52:48.340 --> 00:52:51.590 I don't really want to look up the time, so I'm going to leave that one blank. 00:52:51.590 --> 00:52:54.520 And I'm going to put in values as follows for this. 00:52:54.520 --> 00:52:57.850 SQL for the language, which wasn't even an option on the form earlier, 00:52:57.850 --> 00:53:02.830 and Fiftyville for the name of the problem, semicolon. 00:53:02.830 --> 00:53:04.540 So there's a bit of dichotomy here. 00:53:04.540 --> 00:53:07.690 In the first set of parentheses, you specify a comma separated list 00:53:07.690 --> 00:53:09.910 of the columns that you want to put data into. 00:53:09.910 --> 00:53:11.890 In the second set of parentheses, you actually 00:53:11.890 --> 00:53:15.680 specify the values that you want to put into those columns. 00:53:15.680 --> 00:53:18.350 So when I hit enter nothing seems to happen, 00:53:18.350 --> 00:53:20.510 which, in general, is a good thing at my terminal. 00:53:20.510 --> 00:53:25.030 But if I now rerun select star from favorites, we will see-- 00:53:25.030 --> 00:53:27.070 voila, a brand new row. 00:53:27.070 --> 00:53:29.320 We don't know what time or date it was inputted at. 00:53:29.320 --> 00:53:33.130 In fact, we see an old friend, null, which indicates the absence of a value, 00:53:33.130 --> 00:53:37.610 but we do indeed see that SQL, in Fiftyville, is actually now in there. 00:53:37.610 --> 00:53:42.340 So in the world of SQL, null has nothing to do with pointers or addresses. 00:53:42.340 --> 00:53:46.120 The world of SQL, it's just using the same word to represent the same idea, 00:53:46.120 --> 00:53:48.700 that there's no data here, but it has nothing 00:53:48.700 --> 00:53:51.080 to do with actual memory addresses in this case. 00:53:51.080 --> 00:53:53.830 But suppose that you don't want to do that, and, like, no, no, no. 00:53:53.830 --> 00:53:54.640 Let's just delete that. 00:53:54.640 --> 00:53:56.348 Fiftyville hasn't even been released yet, 00:53:56.348 --> 00:53:58.720 nor have we even finished talking about SQL. 00:53:58.720 --> 00:54:02.110 How do we delete data from a database table? 00:54:02.110 --> 00:54:03.670 Well, there's a delete from command. 00:54:03.670 --> 00:54:05.360 Let me go back to VS Code here. 00:54:05.360 --> 00:54:08.080 Let me go ahead and clear my terminal just to keep things clean. 00:54:08.080 --> 00:54:10.570 Let me go ahead and delete from favorites, 00:54:10.570 --> 00:54:14.290 and let me not hit enter here after a semicolon. 00:54:14.290 --> 00:54:16.330 This is one of the most destructive things 00:54:16.330 --> 00:54:19.570 you can do as a database administrator. 00:54:19.570 --> 00:54:21.550 If you Google around, there are horror stories 00:54:21.550 --> 00:54:25.750 of interns in the real world executing commands like this at their companies. 00:54:25.750 --> 00:54:28.510 This will delete everything from favorites. 00:54:28.510 --> 00:54:31.660 So if you ever do this, remember, we told you not to today. 00:54:31.660 --> 00:54:38.530 But if we add a where clause here, only delete rows where the timestamp column 00:54:38.530 --> 00:54:41.433 is null, this is more reasonable. 00:54:41.433 --> 00:54:43.600 And, frankly, any companies you work for should also 00:54:43.600 --> 00:54:45.767 have backups of their database, so we shouldn't even 00:54:45.767 --> 00:54:48.680 be reading about these horror stories, but such is the real world. 00:54:48.680 --> 00:54:52.960 So this is going to delete any row from the favorites table 00:54:52.960 --> 00:54:57.260 where Timestamp, capital T, because that's how Google did it, is null. 00:54:57.260 --> 00:54:58.570 I go ahead and hit enter. 00:54:58.570 --> 00:55:03.880 Nothing seems to happen, but if I do select star from favorites semicolon 00:55:03.880 --> 00:55:06.550 that now row is, again, gone. 00:55:06.550 --> 00:55:11.410 So you can use these predicates, these where conditions, coupled with select, 00:55:11.410 --> 00:55:14.740 coupled with delete, and other operations as well. 00:55:14.740 --> 00:55:16.520 What if I actually want to make a change? 00:55:16.520 --> 00:55:18.910 So if you want to update existing data like this. 00:55:18.910 --> 00:55:20.540 Well, we could do this. 00:55:20.540 --> 00:55:25.300 I could update this table I could set one column equal to this value 00:55:25.300 --> 00:55:27.230 where some condition is true. 00:55:27.230 --> 00:55:28.638 So how might this work? 00:55:28.638 --> 00:55:30.430 Well, let me boldly claim that a lot of you 00:55:30.430 --> 00:55:33.070 are really going to like Fiftyville in the world of SQL, 00:55:33.070 --> 00:55:35.860 so all of these favorites are sort of passe now. 00:55:35.860 --> 00:55:36.890 So let's do this. 00:55:36.890 --> 00:55:39.950 Let me go ahead and update the favorites table, 00:55:39.950 --> 00:55:43.150 setting the language column equal to "SQL," quote unquote. 00:55:43.150 --> 00:55:47.650 And with a comma let me go ahead and also update the problem column to be 00:55:47.650 --> 00:55:49.810 equal to, quote unquote, "Fiftyville." 00:55:49.810 --> 00:55:52.930 I'm not going to have any kind of where here, which means this 00:55:52.930 --> 00:55:56.030 is just going to do its thing on all of the rows. 00:55:56.030 --> 00:55:58.270 So if I hit enter nothing seems to have happened, 00:55:58.270 --> 00:56:00.850 but if I now do select star from favorites, 00:56:00.850 --> 00:56:04.160 everyone's favorite is literally that problem. 00:56:04.160 --> 00:56:07.690 So this too is destructive, unlike the real digital world, 00:56:07.690 --> 00:56:10.120 there's no Control Z, or undo that. 00:56:10.120 --> 00:56:13.480 You better have made a backup of your database, otherwise 00:56:13.480 --> 00:56:15.460 that's not a good thing. 00:56:15.460 --> 00:56:17.830 In this case, I do have this CSV file, so I could just 00:56:17.830 --> 00:56:19.703 delete my favorites.db file. 00:56:19.703 --> 00:56:22.870 I could re-import the data, so I haven't really lost anything of importance, 00:56:22.870 --> 00:56:26.140 but you could in the case of the real world and any data 00:56:26.140 --> 00:56:27.470 you're actually working on. 00:56:27.470 --> 00:56:29.410 So just to make the point, let me go ahead 00:56:29.410 --> 00:56:33.250 and delete from favorites semicolon, enter. 00:56:33.250 --> 00:56:34.660 Let me reselect. 00:56:34.660 --> 00:56:36.710 There's no data there anymore. 00:56:36.710 --> 00:56:42.890 And in fact, if I do select count star from favorites, 00:56:42.890 --> 00:56:45.220 we'll see as much that the answer is, in fact, zero 00:56:45.220 --> 00:56:48.040 because everything has now been deleted. 00:56:48.040 --> 00:56:51.340 Any questions, then, on that code there? 00:56:51.340 --> 00:56:51.950 No? 00:56:51.950 --> 00:56:52.450 All right. 00:56:52.450 --> 00:56:55.533 So if not too scared yet, let's go ahead and take our 10-minute break now. 00:56:55.533 --> 00:56:59.110 Halloween candy is served, and we'll be back in 10. 00:56:59.110 --> 00:57:01.060 All right. 00:57:01.060 --> 00:57:05.750 So we are back, and before we dive back into SQL and some real world data, 00:57:05.750 --> 00:57:10.300 it turns out, unbeknownst to me, we've had a Halloween costume contest. 00:57:10.300 --> 00:57:14.560 So it's now time to announce the winners of this year's CS50 costume contest. 00:57:14.560 --> 00:57:17.530 If our two winners would like to come on up who, 00:57:17.530 --> 00:57:19.495 I'm told during break, dressed up as me. 00:57:19.495 --> 00:57:22.265 [APPLAUSE] 00:57:25.620 --> 00:57:26.460 Come on over. 00:57:26.460 --> 00:57:28.085 Would you like to introduce yourselves? 00:57:28.085 --> 00:57:31.240 SPEAKER 2: Yes Hi, everyone. 00:57:31.240 --> 00:57:32.100 I'm David. 00:57:32.100 --> 00:57:34.380 I am living in Matthews as a first year. 00:57:34.380 --> 00:57:38.700 I'm planning on studying gov and computer science. 00:57:38.700 --> 00:57:39.820 SPEAKER 3: Hi, everyone. 00:57:39.820 --> 00:57:40.740 I'm David. 00:57:40.740 --> 00:57:43.020 I'm a second year in Mather, and I'm planning 00:57:43.020 --> 00:57:44.745 on just studying computer science. 00:57:44.745 --> 00:57:45.480 [LAUGHING] 00:57:45.480 --> 00:57:46.605 SPEAKER 1: Well, thank you. 00:57:46.605 --> 00:57:50.190 We have some Oreos for you. 00:57:50.190 --> 00:57:51.497 Thank you, both, so much. 00:57:51.497 --> 00:57:52.830 Did anyone else dress like this? 00:57:52.830 --> 00:57:55.437 We have two more Oreos, if you'd like? 00:57:55.437 --> 00:57:56.020 Intentionally? 00:57:56.020 --> 00:57:58.082 [LAUGHING] 00:57:58.082 --> 00:57:59.040 Oh, that's pretty good. 00:57:59.040 --> 00:57:59.540 OK. 00:57:59.540 --> 00:58:00.750 Yes, we have one more winner. 00:58:00.750 --> 00:58:01.515 Come on down. 00:58:01.515 --> 00:58:03.810 [LAUGHS] 00:58:09.580 --> 00:58:10.317 Thank you. 00:58:10.317 --> 00:58:10.900 SPEAKER 4: Hi. 00:58:10.900 --> 00:58:12.010 SPEAKER 1: This is intentional? 00:58:12.010 --> 00:58:12.590 SPEAKER 4: No. 00:58:12.590 --> 00:58:13.340 SPEAKER 1: Oh, OK. 00:58:13.340 --> 00:58:14.410 [LAUGHING] 00:58:14.410 --> 00:58:16.150 SPEAKER 4: Hello, my name is David. 00:58:16.150 --> 00:58:18.490 I'm from Canada, and I'm a first year. 00:58:18.490 --> 00:58:20.200 I'm not sure what I'm going to study. 00:58:20.200 --> 00:58:20.783 SPEAKER 1: OK. 00:58:20.783 --> 00:58:21.800 Welcome, as well. 00:58:21.800 --> 00:58:22.675 SPEAKER 4: Thank you. 00:58:22.675 --> 00:58:23.560 SPEAKER 1: All right. 00:58:23.560 --> 00:58:26.270 [APPLAUSE] 00:58:27.732 --> 00:58:29.690 Up until now, we've played around with the data 00:58:29.690 --> 00:58:33.695 that you all gave us, which was based, very simply, on your favorite language 00:58:33.695 --> 00:58:34.820 and your favorite problems. 00:58:34.820 --> 00:58:38.030 But it turns out there's a lot of real world data in, indeed, 00:58:38.030 --> 00:58:40.880 the real world, some of which is quite voluminous. 00:58:40.880 --> 00:58:44.720 And, indeed, there can be not just dozens, or hundreds, but thousands, 00:58:44.720 --> 00:58:47.962 hundreds of thousands, even millions of rows in the biggest of databases. 00:58:47.962 --> 00:58:50.420 And so what we thought we'd do in the latter part of today, 00:58:50.420 --> 00:58:53.360 is really, actually, get our hands dirty with a real world 00:58:53.360 --> 00:58:57.020 data set from the Internet Movie Database, otherwise known as IMDb. 00:58:57.020 --> 00:59:01.010 And, in fact, if you go to imdb.com, you'll be able to answer, 00:59:01.010 --> 00:59:03.650 via their web interface, some of the very questions we'll 00:59:03.650 --> 00:59:05.720 do today using SQL alone. 00:59:05.720 --> 00:59:10.520 But what you'll find, ultimately, is that what websites like imdb.com, 00:59:10.520 --> 00:59:13.880 or their mobile app versions thereof, are probably doing 00:59:13.880 --> 00:59:17.420 is-- yes, giving you a nice pretty graphical interface to type queries, 00:59:17.420 --> 00:59:22.760 but underneath the hood they are passing your input into SQL queries, 00:59:22.760 --> 00:59:25.800 or similar queries, that they formed most of. 00:59:25.800 --> 00:59:28.320 They're just waiting for placeholders, like the keywords, 00:59:28.320 --> 00:59:30.160 that you're actually searching for. 00:59:30.160 --> 00:59:33.420 So let's go ahead and experiment, maybe, with just some real-world data, 00:59:33.420 --> 00:59:36.725 initially, before we consider how to actually store it at scale. 00:59:36.725 --> 00:59:38.850 So let me open up, just for the sake of discussion, 00:59:38.850 --> 00:59:41.190 an actual, empty spreadsheet, just so I have 00:59:41.190 --> 00:59:43.650 some rows and columns to play with. 00:59:43.650 --> 00:59:47.790 And let me propose that we want to model TV shows from the real world. 00:59:47.790 --> 00:59:49.420 How can we go about doing this? 00:59:49.420 --> 00:59:53.020 Well, maybe I could start in this first column A, so to speak, 00:59:53.020 --> 00:59:54.600 and I could create a title column. 00:59:54.600 --> 00:59:57.660 And then, maybe, a column for the star of that show. 00:59:57.660 --> 01:00:00.060 And a very popular show, of course, is The Office. 01:00:00.060 --> 01:00:03.900 So I might put this into the second cell in that first column. 01:00:03.900 --> 01:00:06.870 And under star, I could put someone like Steve Carell. 01:00:06.870 --> 01:00:10.200 But, of course, he wasn't the only star of the show. 01:00:10.200 --> 01:00:11.740 There are others as well. 01:00:11.740 --> 01:00:14.370 And so if I want to put in someone like Rainn Wilson, 01:00:14.370 --> 01:00:16.210 well, maybe, I need a second star column. 01:00:16.210 --> 01:00:17.640 So Rainn Wilson. 01:00:17.640 --> 01:00:20.010 But even as early as the first season, there 01:00:20.010 --> 01:00:24.420 was also another star in the credits, John Krasinski. 01:00:24.420 --> 01:00:25.440 So he was a star. 01:00:25.440 --> 01:00:29.280 Jenna Fischer was top credited in the first season. 01:00:29.280 --> 01:00:30.990 So Jenna Fischer. 01:00:30.990 --> 01:00:37.020 And then BJ Novak, Harvard alum, was also in the first season's opening 01:00:37.020 --> 01:00:37.870 credits, as well. 01:00:37.870 --> 01:00:39.360 So we've got all-- 01:00:39.360 --> 01:00:41.170 one, two, three, four, five of these folks. 01:00:41.170 --> 01:00:42.630 Hopefully, I didn't misspell anyone's name, 01:00:42.630 --> 01:00:45.030 but here's the beginnings of a real-world data set. 01:00:45.030 --> 01:00:47.970 And we could imagine doing this for everyone's favorite shows, 01:00:47.970 --> 01:00:49.630 adding more, and more rows. 01:00:49.630 --> 01:00:52.650 But let's consider, as we often do, not just the correctness 01:00:52.650 --> 01:00:56.220 of this implementation, but the design. 01:00:56.220 --> 01:00:57.990 It's pretty straightforward. 01:00:57.990 --> 01:00:58.972 It's very readable. 01:00:58.972 --> 01:01:00.430 So I think it's good in that sense. 01:01:00.430 --> 01:01:03.510 But if you start to nitpick what's poorly designed, 01:01:03.510 --> 01:01:06.240 even in the world of spreadsheets, about what I've done here-- 01:01:06.240 --> 01:01:08.610 assuming that the next row is another show, 01:01:08.610 --> 01:01:12.110 the next row is another show, and so forth. 01:01:12.110 --> 01:01:13.280 What's bad about this? 01:01:13.280 --> 01:01:13.780 Yeah? 01:01:13.780 --> 01:01:15.272 [INDISTINCT SPEECH] 01:01:15.272 --> 01:01:17.980 Yeah, so each row is going to have a different number of columns, 01:01:17.980 --> 01:01:20.440 and even I, kind of, couldn't make up my mind from the get go. 01:01:20.440 --> 01:01:22.000 Like, do I have just one star column? 01:01:22.000 --> 01:01:22.500 Or two? 01:01:22.500 --> 01:01:23.792 Or maybe now I'm up to 5? 01:01:23.792 --> 01:01:25.750 For even bigger shows, and later in The Office, 01:01:25.750 --> 01:01:27.625 when more people got top billing, we're going 01:01:27.625 --> 01:01:29.770 to need more than five columns for stars. 01:01:29.770 --> 01:01:31.060 So that's fine. 01:01:31.060 --> 01:01:34.682 We can clearly scroll to the right, and just keep adding more columns, 01:01:34.682 --> 01:01:36.640 but there should be something about that design 01:01:36.640 --> 01:01:38.890 that like rubs you the wrong way, like something 01:01:38.890 --> 01:01:41.620 feels a little off if some rows have this many columns, 01:01:41.620 --> 01:01:42.700 others have this many. 01:01:42.700 --> 01:01:45.520 The data would be very jagged along the right hand side. 01:01:45.520 --> 01:01:48.320 It would be very sparse, which would be another way to describe it. 01:01:48.320 --> 01:01:50.000 There's probably a better way. 01:01:50.000 --> 01:01:53.440 So maybe I should flip the data around, and maybe 01:01:53.440 --> 01:01:57.520 a better approach here would be to just have one column for stars. 01:01:57.520 --> 01:01:59.870 So let me do this. 01:01:59.870 --> 01:02:04.180 Let me just move Rainn Wilson over here, and John Krasinski over here, 01:02:04.180 --> 01:02:09.980 and Jenna Fischer over here, and BJ Novak over here, as well. 01:02:09.980 --> 01:02:13.850 I'll get rid of all of these superfluous, identically named columns. 01:02:13.850 --> 01:02:17.840 And now this is sort of better because now I can have any number of stars 01:02:17.840 --> 01:02:21.330 in the vertical, although it's a little weird to leave this blank. 01:02:21.330 --> 01:02:23.850 So maybe I should of copy paste here. 01:02:23.850 --> 01:02:27.920 So in some sense this is better, in that now I only 01:02:27.920 --> 01:02:30.860 have one title column, one star column, and I can just 01:02:30.860 --> 01:02:34.790 keep adding row, row, row, for each show and its stars. 01:02:34.790 --> 01:02:37.840 But what's now poorly designed about this? 01:02:37.840 --> 01:02:38.340 Yeah? 01:02:38.340 --> 01:02:39.970 [INDISTINCT SPEECH] 01:02:39.970 --> 01:02:40.470 Yeah. 01:02:40.470 --> 01:02:41.580 I'm repeating the title. 01:02:41.580 --> 01:02:44.220 And, in general, copy, paste, repeating yourself in code 01:02:44.220 --> 01:02:46.090 has generally been a bad thing. 01:02:46.090 --> 01:02:49.110 It's generally gotten us in trouble if I make a change, or maybe 01:02:49.110 --> 01:02:50.550 a typographical error somewhere. 01:02:50.550 --> 01:02:52.050 Maybe it propagates elsewhere. 01:02:52.050 --> 01:02:54.450 And if nothing else, it's just a lot of wasted space. 01:02:54.450 --> 01:02:57.300 If this is actually going to be stored in a database, in a CSV file, 01:02:57.300 --> 01:03:00.300 why are you duplicating the same string again, and again, and again, 01:03:00.300 --> 01:03:01.440 for large TV shows? 01:03:01.440 --> 01:03:02.370 That's just wasteful. 01:03:02.370 --> 01:03:04.780 It just doesn't seem wise. 01:03:04.780 --> 01:03:06.970 So how can we eliminate that redundancy? 01:03:06.970 --> 01:03:10.260 Well, unfortunately, in the world of spreadsheets, things kind of now 01:03:10.260 --> 01:03:13.080 escalate quickly to be kind of annoying. 01:03:13.080 --> 01:03:16.150 But let me do it, nonetheless, with just a small bit of data. 01:03:16.150 --> 01:03:18.250 Let me propose that we do this instead. 01:03:18.250 --> 01:03:21.977 Let me create not one sheet but maybe multiple sheets, 01:03:21.977 --> 01:03:24.060 and assume that there's some kind of relationship, 01:03:24.060 --> 01:03:25.680 or relation, across these sheets. 01:03:25.680 --> 01:03:29.100 So, just to be pedantic, let me call this sheet, not the default sheet one, 01:03:29.100 --> 01:03:30.360 but let's call this shows. 01:03:30.360 --> 01:03:34.750 And in this sheet, I'm going to have a title column for every show, 01:03:34.750 --> 01:03:36.990 and I think I'm going to be proactive here. 01:03:36.990 --> 01:03:41.200 I'm going to start giving every show a unique ID number, much like Harvard 01:03:41.200 --> 01:03:43.150 affiliates have Harvard IDs, Yale affiliates 01:03:43.150 --> 01:03:44.900 have Yale ID numbers, and so forth. 01:03:44.900 --> 01:03:48.280 Let's go ahead and give each show its own unique identifier, 01:03:48.280 --> 01:03:49.940 for reasons we'll soon see. 01:03:49.940 --> 01:03:53.110 So for The Office let me, just for consistency 01:03:53.110 --> 01:03:55.030 with the actual Internet Movie Database, I'm 01:03:55.030 --> 01:03:58.488 going to give it a unique number of 386676. 01:03:58.488 --> 01:04:00.280 The specifics don't really matter, but that 01:04:00.280 --> 01:04:03.010 happens to be what people in the real world actually do. 01:04:03.010 --> 01:04:05.950 But that's it for TV shows, even though I could imagine 01:04:05.950 --> 01:04:07.490 there being many more in this sheet. 01:04:07.490 --> 01:04:10.840 Let me create another sheet now here, and I'll call it people. 01:04:10.840 --> 01:04:15.490 And in the people sheet, let me keep track of all of those TV stars. 01:04:15.490 --> 01:04:20.500 So one column will be name, another will be also called ID here, 01:04:20.500 --> 01:04:25.070 but it's going to be a person ID, not a show ID. 01:04:25.070 --> 01:04:27.610 And here we have, for instance, Steve Carell. 01:04:27.610 --> 01:04:29.570 We have Rainn Wilson. 01:04:29.570 --> 01:04:31.870 We have John Krasinski. 01:04:31.870 --> 01:04:34.300 We have Jenna Fischer. 01:04:34.300 --> 01:04:36.670 And we have BJ Novak. 01:04:36.670 --> 01:04:38.600 And, this is going to be a little tedious, 01:04:38.600 --> 01:04:41.070 but just to be consistent with reality, turns out 01:04:41.070 --> 01:04:46.230 that according to IMDb Steve Carell's unique number in the world is 136797, 01:04:46.230 --> 01:04:56.640 Rainn's is 933988, John's is 1024677, Jenna's is 278979, 01:04:56.640 --> 01:05:01.814 and, lastly, BJ Novak's is 1145983. 01:05:01.814 --> 01:05:02.650 Phew. 01:05:02.650 --> 01:05:03.150 OK. 01:05:03.150 --> 01:05:06.360 So now we have the same people, but they each have a unique ID number. 01:05:06.360 --> 01:05:10.590 Lastly, let's associate those shows with those people in a way that 01:05:10.590 --> 01:05:13.320 avoids the two problems we identified earlier, which 01:05:13.320 --> 01:05:16.650 was having a variable number of columns in one case 01:05:16.650 --> 01:05:18.960 versus redundancy in the second case. 01:05:18.960 --> 01:05:22.710 Let's really tighten things up so nothing is in duplicate 01:05:22.710 --> 01:05:24.880 that doesn't actually need to be. 01:05:24.880 --> 01:05:28.890 So I'm going to create a third sheet here, and I'll call it stars, 01:05:28.890 --> 01:05:31.770 like the TV stars for these shows, and what I'm going to do 01:05:31.770 --> 01:05:36.292 is have only two columns, a show ID, and a person ID. 01:05:36.292 --> 01:05:38.250 I could write these in different ways, but it's 01:05:38.250 --> 01:05:41.740 conventional in the database world to use snake case, so to speak, 01:05:41.740 --> 01:05:46.150 where everything is lowercase, with underscores instead of spaces. 01:05:46.150 --> 01:05:49.390 And for show ID-- well, we're only mocking up one show for now, 01:05:49.390 --> 01:05:54.100 but I'm going to go ahead and say 386676, which is The Office, I claimed. 01:05:54.100 --> 01:05:57.160 And now I'm going to go ahead and have all of those person IDs, 01:05:57.160 --> 01:06:00.230 and this time it's OK to copy paste, if only to save time. 01:06:00.230 --> 01:06:04.540 So I'm going to grab all of these five stars IDs, paste them there, 01:06:04.540 --> 01:06:08.840 and I am going to indulge by duplicating the show ID, 01:06:08.840 --> 01:06:10.580 even though we didn't like that earlier. 01:06:10.580 --> 01:06:14.380 Now this indeed has escalated quickly because none of these sheets 01:06:14.380 --> 01:06:17.590 are very useful to look at to the human eye because none of them 01:06:17.590 --> 01:06:18.910 has the complete picture. 01:06:18.910 --> 01:06:22.600 We have shows in one, people in another, and then like this cryptic mapping 01:06:22.600 --> 01:06:24.250 of numbers in the third. 01:06:24.250 --> 01:06:29.200 But I propose that this is the right way to implement data 01:06:29.200 --> 01:06:32.530 if your goal is to have a canonical source of truth 01:06:32.530 --> 01:06:34.060 for every show and every person. 01:06:34.060 --> 01:06:37.990 That is to say, you only say the show's name once in one place. 01:06:37.990 --> 01:06:43.030 You only write the TV star's names once in one place. 01:06:43.030 --> 01:06:46.330 And you associate them, you relate one with the other, 01:06:46.330 --> 01:06:48.710 by way of this third sheet here. 01:06:48.710 --> 01:06:52.270 So if you've not seen it already, notice that if The Office has 01:06:52.270 --> 01:06:56.800 this unique ID, 386676, notice in the stars table, 01:06:56.800 --> 01:06:59.620 that same value appears multiple times. 01:06:59.620 --> 01:07:04.810 But what this third sheet is doing is associating that same show ID 01:07:04.810 --> 01:07:08.510 with one, two, three, four, five different people. 01:07:08.510 --> 01:07:11.860 Now I can see that it's similar in spirit to what we already 01:07:11.860 --> 01:07:13.780 indicted as bad design a moment ago. 01:07:13.780 --> 01:07:15.460 The Office, The Office, The Office. 01:07:15.460 --> 01:07:18.970 But think about our world of C. In the world of C, and really computers 01:07:18.970 --> 01:07:22.180 in general, data takes up finite amount of space, typically. 01:07:22.180 --> 01:07:24.470 Like an integer is four bytes, 32 bits. 01:07:24.470 --> 01:07:27.070 So even though, yes, I'm duplicating this value, 01:07:27.070 --> 01:07:30.460 it's just the same four bytes, four bytes, four bytes. 01:07:30.460 --> 01:07:36.340 It's not t-h-e space o-f-f-i-c-e, null character. 01:07:36.340 --> 01:07:38.755 It's not the same 11 bytes again, and again. 01:07:38.755 --> 01:07:41.380 It's just a number, and numbers tend to be much more efficient. 01:07:41.380 --> 01:07:43.660 That computers can crunch numbers much more quickly. 01:07:43.660 --> 01:07:47.920 Duplicating numbers is in general allowed, or smiled upon. 01:07:47.920 --> 01:07:50.660 Duplicating strings will get you into trouble. 01:07:50.660 --> 01:07:54.212 So with that said, is this a useful spreadsheet now? 01:07:54.212 --> 01:07:56.920 Would you want to be handed this in your job and asked questions? 01:07:56.920 --> 01:07:59.020 Like, hey, who stars in The Office? 01:07:59.020 --> 01:08:01.390 You can answer it, but you have to look up one sheet, 01:08:01.390 --> 01:08:02.950 then another, then a third. 01:08:02.950 --> 01:08:06.490 Or you need to use VLOOKUP, or special functions in Excel, or Google Sheets. 01:08:06.490 --> 01:08:08.823 I mean, you're just creating a lot of work for yourself. 01:08:08.823 --> 01:08:12.340 But, academically, if you will, systematically, this 01:08:12.340 --> 01:08:15.640 has a lot of merit because we've avoided all duplication. 01:08:15.640 --> 01:08:20.800 We've normalized the data, so to speak, by factoring out any duplication. 01:08:20.800 --> 01:08:22.700 So where are we going with this? 01:08:22.700 --> 01:08:25.720 Well, it turns out that we'll play now with some actual data 01:08:25.720 --> 01:08:28.481 from the real world from the actual Internet Movie database. 01:08:28.481 --> 01:08:31.189 And in a moment, it's going to look a little something like this. 01:08:31.189 --> 01:08:34.689 This is an artist's rendition of five different-- sorry, 01:08:34.689 --> 01:08:36.800 six different tables. 01:08:36.800 --> 01:08:39.430 So not one, but six different tables that we 01:08:39.430 --> 01:08:43.479 created using freely available Internet Movie Database data. 01:08:43.479 --> 01:08:47.904 They kindly provide, not CSVs, but TSVs, tab separated values, 01:08:47.904 --> 01:08:49.779 which are essentially the same thing, but you 01:08:49.779 --> 01:08:51.729 look for tab characters instead of commas 01:08:51.729 --> 01:08:55.120 in order to convert them, as we did, into our own format, 01:08:55.120 --> 01:08:56.800 SQLite in this case. 01:08:56.800 --> 01:08:59.404 But we'll see that there is a file that we've 01:08:59.404 --> 01:09:02.529 made available from today, which you can download off the course's website, 01:09:02.529 --> 01:09:07.149 called shows.db, and it contains all of this same information. 01:09:07.149 --> 01:09:10.660 And in that shows.db file there are indeed these six tables, 01:09:10.660 --> 01:09:13.180 but let's focus on just two of them initially. 01:09:13.180 --> 01:09:16.038 IMDb, the Internet Movie Database, is all about rating TV shows, 01:09:16.038 --> 01:09:18.580 and tracking that kind of information, so let's actually take 01:09:18.580 --> 01:09:20.497 a look at some of these ratings and figure out 01:09:20.497 --> 01:09:22.970 how we can actually answer actual questions. 01:09:22.970 --> 01:09:24.760 So let me go over to VS Code. 01:09:24.760 --> 01:09:29.750 And let me run SQLite of shows.db, which is a file that already exists. 01:09:29.750 --> 01:09:33.010 There's no CSVs, no TSVs, we did all of this for you already. 01:09:33.010 --> 01:09:34.932 When I hit enter, I get my SQLite prompt. 01:09:34.932 --> 01:09:36.640 And the first thing I like to do whenever 01:09:36.640 --> 01:09:39.189 I'm playing with a SQL database for the first time-- maybe 01:09:39.189 --> 01:09:41.410 I got it from a class, or my boss, or the like-- 01:09:41.410 --> 01:09:44.979 is just to wrap my mind around what's inside of the database 01:09:44.979 --> 01:09:48.069 because you're not typically going to be provided with pictures like this. 01:09:48.069 --> 01:09:49.729 You're just going to be given a file. 01:09:49.729 --> 01:09:51.220 So let me just select some data. 01:09:51.220 --> 01:09:54.468 Let me select star from the shows table. 01:09:54.468 --> 01:09:57.010 I don't really want to see all of it, so let me just limit it 01:09:57.010 --> 01:09:59.390 to the first 10 shows in the table. 01:09:59.390 --> 01:10:03.310 And here, we can infer what the shows table looks like. 01:10:03.310 --> 01:10:07.270 Every show has an ID, a title, the year in which it debuted, apparently, 01:10:07.270 --> 01:10:11.320 and the number of episodes as of last night when we exported the data. 01:10:11.320 --> 01:10:15.010 So that seems to reflect this picture, and this is technically 01:10:15.010 --> 01:10:19.000 an entity relationship diagram, a standard way of depicting things. 01:10:19.000 --> 01:10:21.670 And you'll see that in our picture shows, indeed, 01:10:21.670 --> 01:10:25.480 have an ID column, title column, year column, and episodes column. 01:10:25.480 --> 01:10:26.840 Well what about these ratings? 01:10:26.840 --> 01:10:29.650 Well, according to the picture, that has a show ID, a rating, 01:10:29.650 --> 01:10:30.830 and a votes column. 01:10:30.830 --> 01:10:34.690 So let's go back to VS Code here, and let's do select star 01:10:34.690 --> 01:10:39.670 from ratings, limit 10, just to wrap our mind around some of the data. 01:10:39.670 --> 01:10:42.100 And there we have a show ID in the left, we 01:10:42.100 --> 01:10:45.170 have rating in the middle, which seems to be like a floating point value, 01:10:45.170 --> 01:10:46.940 and then votes, which seems to be an integer. 01:10:46.940 --> 01:10:48.815 So we have some different types of data here. 01:10:48.815 --> 01:10:50.270 But there's a lot of data. 01:10:50.270 --> 01:10:51.610 In fact, if I do this now. 01:10:51.610 --> 01:10:55.180 Select star from shows-- 01:10:55.180 --> 01:10:56.630 let's not select all of the data. 01:10:56.630 --> 01:11:00.670 Let's do select count stars from shows, and in this database 01:11:00.670 --> 01:11:04.640 there are 214,000 shows in this database. 01:11:04.640 --> 01:11:08.780 So we're well past the 398 rows that we've been talking about thus far. 01:11:08.780 --> 01:11:12.010 So it turns out, per this diagram, there's actually 01:11:12.010 --> 01:11:16.360 a standard relationship between these two tables, shows and ratings 01:11:16.360 --> 01:11:18.880 respectively, and it's what we would call a one-to-one one 01:11:18.880 --> 01:11:23.430 relationship whereby every show in this design has one rating. 01:11:23.430 --> 01:11:25.180 And this is indicated, technically, if you 01:11:25.180 --> 01:11:27.755 look at what the arrowheads look like on these diagrams. 01:11:27.755 --> 01:11:30.130 This indicates that it's a one-to-one relationship, which 01:11:30.130 --> 01:11:34.450 means every show has one rating, which means every row in the shows table 01:11:34.450 --> 01:11:36.970 has a corresponding row in the ratings table. 01:11:36.970 --> 01:11:39.640 Strictly speaking, they could be in the same table. 01:11:39.640 --> 01:11:42.130 You could just join them together and make one wider table, 01:11:42.130 --> 01:11:44.800 but IMDb keeps the data separate so we too 01:11:44.800 --> 01:11:47.210 kept them separate in two separate tables here. 01:11:47.210 --> 01:11:50.120 So what does this actually mean in practice? 01:11:50.120 --> 01:11:54.590 Well, let's actually take a look in VS Code at the schema for these tables. 01:11:54.590 --> 01:11:57.160 Let me clear my screen, and let me .schema, 01:11:57.160 --> 01:12:01.570 but specifically look at the schema, or the design of, the shows table. 01:12:01.570 --> 01:12:06.640 So you can do .schema shows, and any command in SQLite with a dot is SQLite 01:12:06.640 --> 01:12:08.050 specific. 01:12:08.050 --> 01:12:10.900 In the real world, if you're using other products like Oracle, 01:12:10.900 --> 01:12:14.350 or Postgre, or MySQL, or others, they have different commands, 01:12:14.350 --> 01:12:16.300 but anything else that we've been typing, 01:12:16.300 --> 01:12:20.260 especially the capitalized keywords, is indeed standard SQL. 01:12:20.260 --> 01:12:25.310 If I hit enter here, here is what the shows table apparently looks like. 01:12:25.310 --> 01:12:28.480 In other words, here is the create table command that we, the staff, 01:12:28.480 --> 01:12:30.550 ran in order to create this table for you, 01:12:30.550 --> 01:12:32.560 and then we imported a bunch of data into it. 01:12:32.560 --> 01:12:38.200 Every show has an ID, has a title, has a year, has a number of episodes. 01:12:38.200 --> 01:12:39.610 But there's more detail here. 01:12:39.610 --> 01:12:43.600 Apparently, the ID is an integer, the title is text, and it is not null, 01:12:43.600 --> 01:12:47.170 cannot be null, the year is numeric, whatever that means, 01:12:47.170 --> 01:12:49.030 and the episodes is an integer. 01:12:49.030 --> 01:12:50.740 Well, that is now familiar. 01:12:50.740 --> 01:12:51.830 And then primary key. 01:12:51.830 --> 01:12:53.622 So there's some other stuff going on there, 01:12:53.622 --> 01:12:57.760 which we'll come back to, but let me also do .schema ratings, enter, 01:12:57.760 --> 01:13:00.280 and we'll see a couple of other data types here. 01:13:00.280 --> 01:13:03.070 There's show ID, which is still an integer, but not null. 01:13:03.070 --> 01:13:05.560 Rating, which is a real number, a.k.a. 01:13:05.560 --> 01:13:07.840 float, but it, too, cannot be null. 01:13:07.840 --> 01:13:10.750 And then some number of votes, which is an integer, cannot be null. 01:13:10.750 --> 01:13:12.560 And then there's mention of foreign key. 01:13:12.560 --> 01:13:16.300 So unlike our data set for the favorites a bit ago, which we just 01:13:16.300 --> 01:13:20.530 did automatically, and imported, this database by us, and by IMDb, 01:13:20.530 --> 01:13:23.680 has been more thoughtfully designed, where there's actually 01:13:23.680 --> 01:13:27.640 some relationships across multiple tables, rather than previously 01:13:27.640 --> 01:13:28.420 just one. 01:13:28.420 --> 01:13:32.300 Now in the world of SQL, we have indeed different data types. 01:13:32.300 --> 01:13:35.140 For instance, we have these five primarily. 01:13:35.140 --> 01:13:37.750 One, cutely named BLOB, which is actually 01:13:37.750 --> 01:13:41.990 Binary Large Object, which generally means like a file or some piece of data 01:13:41.990 --> 01:13:43.190 that's zeros and ones. 01:13:43.190 --> 01:13:46.160 Though, generally, it's best to store files on file systems, 01:13:46.160 --> 01:13:50.090 like in folders, on disks, so to speak, not in your database. 01:13:50.090 --> 01:13:51.440 There's integers we've seen. 01:13:51.440 --> 01:13:52.410 There's numeric. 01:13:52.410 --> 01:13:54.740 Which numeric is more like dates and times, 01:13:54.740 --> 01:13:58.340 things that are numbers, but not necessarily integers or floating point 01:13:58.340 --> 01:13:58.970 values. 01:13:58.970 --> 01:14:01.460 There's reals, which do have decimal points in them. 01:14:01.460 --> 01:14:02.780 And then there's just text. 01:14:02.780 --> 01:14:06.290 In other SQL databases in the real world, in your future jobs, 01:14:06.290 --> 01:14:09.740 or your future classes that you might use SQL again in, 01:14:09.740 --> 01:14:14.960 there are even more data types in other databases: Oracle, MySQL, Postgre, 01:14:14.960 --> 01:14:15.605 and so forth. 01:14:15.605 --> 01:14:17.480 But these are sort of representative of them. 01:14:17.480 --> 01:14:19.767 They just get more precise in other systems. 01:14:19.767 --> 01:14:22.100 But there's also some keywords, that we've seen already, 01:14:22.100 --> 01:14:24.230 that you can specify when designing a database. 01:14:24.230 --> 01:14:26.090 --that this column cannot be null. 01:14:26.090 --> 01:14:29.720 If you want to make sure that no one can insert or update 01:14:29.720 --> 01:14:32.030 data unbeknownst to you that is null, you 01:14:32.030 --> 01:14:34.340 can impose that when creating the table. 01:14:34.340 --> 01:14:36.740 And unlike Excel, and Google Spreadsheets, 01:14:36.740 --> 01:14:40.010 and Apple Numbers, which will generally let the human type in or not type 01:14:40.010 --> 01:14:43.520 in anything they want, with a database you have more protections 01:14:43.520 --> 01:14:45.365 over the integrity of your data. 01:14:45.365 --> 01:14:48.900 Moreover, you can specify that a column's values must be unique. 01:14:48.900 --> 01:14:50.750 If you want to avoid duplicates, like you 01:14:50.750 --> 01:14:54.740 don't want the same person to be able to register twice for your website thereby 01:14:54.740 --> 01:14:57.500 making sure they have one unique email address, 01:14:57.500 --> 01:14:59.840 your database can help with that too. 01:14:59.840 --> 01:15:03.470 You don't have to rely on Python to check if it already exists. 01:15:03.470 --> 01:15:07.520 But there's this other feature of relational databases, that 01:15:07.520 --> 01:15:11.480 is databases that have multiple tables across which there are relationships, 01:15:11.480 --> 01:15:13.880 and that's these keywords we saw briefly a moment ago. 01:15:13.880 --> 01:15:16.620 Primary key, and foreign key. 01:15:16.620 --> 01:15:20.180 And we started to scratch this surface here. 01:15:20.180 --> 01:15:22.490 It turns out what I was doing was actually 01:15:22.490 --> 01:15:25.340 best practice in the world of relational databases 01:15:25.340 --> 01:15:30.680 I gave or, really, IMDb gave every show in the world a unique ID. 01:15:30.680 --> 01:15:36.470 And that unique ID, in this case 386676, is a numeric value, an integer, 01:15:36.470 --> 01:15:38.930 that uniquely identifies that TV show. 01:15:38.930 --> 01:15:43.370 In other words, this is the primary key for this table. 01:15:43.370 --> 01:15:46.370 Technically a sheet, but I'm using that just because it's easier to type 01:15:46.370 --> 01:15:48.020 in than my black and white window. 01:15:48.020 --> 01:15:53.960 This ID column is the primary key for shows, as I was mocking up earlier. 01:15:53.960 --> 01:15:57.290 What is the primary key in the people sheet here? 01:15:57.290 --> 01:15:58.580 It is also ID. 01:15:58.580 --> 01:16:02.450 It's a different ID, but it's by convention often called the same thing. 01:16:02.450 --> 01:16:07.037 But this people column, called ID, is its primary key. 01:16:07.037 --> 01:16:08.870 And you can perhaps see where this is going. 01:16:08.870 --> 01:16:13.910 Those same numbers also happen to appear in this third table, 01:16:13.910 --> 01:16:17.180 but in that context, they're sort of foreign keys. 01:16:17.180 --> 01:16:21.420 They didn't come from this star sheet, but they are in this star sheet, 01:16:21.420 --> 01:16:23.690 so they're sort of relatively foreign to it. 01:16:23.690 --> 01:16:29.030 So foreign keys is simply the presence of primary keys in some other table, 01:16:29.030 --> 01:16:30.630 in some other data set. 01:16:30.630 --> 01:16:33.050 And so it's just a description of relativity, 01:16:33.050 --> 01:16:36.860 but the foreign key is the column that uniquely identifies your data. 01:16:36.860 --> 01:16:41.190 Foreign keys is just the appearance of those same numbers elsewhere. 01:16:41.190 --> 01:16:42.570 So what does this mean? 01:16:42.570 --> 01:16:45.020 If we go back to VS Code here, you'll see 01:16:45.020 --> 01:16:48.860 that when we created this table using IMDb's real data, 01:16:48.860 --> 01:16:52.640 we specified that in our shows table, which is bigger than the one 01:16:52.640 --> 01:16:56.180 I mocked up with Google Sheets there, has not only an ID and a title it, 01:16:56.180 --> 01:16:58.615 again, also has year in which the show debuted, 01:16:58.615 --> 01:16:59.990 and the total number of episodes. 01:16:59.990 --> 01:17:02.810 Because that's juicy data that comes from IMDb, 01:17:02.810 --> 01:17:04.670 beyond what I mocked up a moment ago. 01:17:04.670 --> 01:17:08.150 In the ratings table, meanwhile, there's a show ID, 01:17:08.150 --> 01:17:12.360 which is also an integer, just like this ID, but as the name implies, 01:17:12.360 --> 01:17:17.810 this show ID column is actually going to be a foreign key that references 01:17:17.810 --> 01:17:20.640 the shows tables ID column. 01:17:20.640 --> 01:17:23.420 So this is the relational in relational databases. 01:17:23.420 --> 01:17:25.530 These are two tables that have a relationship, 01:17:25.530 --> 01:17:32.520 and that relationship is that show ID is referring to this actual ID here. 01:17:32.520 --> 01:17:34.520 And it allows us essentially, conceptually, 01:17:34.520 --> 01:17:37.350 to link these two tables together. 01:17:37.350 --> 01:17:39.353 So what does this actually mean? 01:17:39.353 --> 01:17:40.770 Well, let me go ahead and do this. 01:17:40.770 --> 01:17:42.140 Let me go back to VS Code here. 01:17:42.140 --> 01:17:45.120 I'll clear my terminal, and let's play around with some of this data. 01:17:45.120 --> 01:17:48.080 So let's go ahead and do this just to experiment. 01:17:48.080 --> 01:17:51.320 Select star from ratings where-- 01:17:51.320 --> 01:17:54.860 let's get all of the good shows where, just like Rotten Tomatoes, 01:17:54.860 --> 01:17:57.410 we'll do the cut off at 6.0 out of 10. 01:17:57.410 --> 01:18:01.063 So where rating is greater than or equal to 6.0. 01:18:01.063 --> 01:18:03.230 And just so I don't overwhelm my screen, let me just 01:18:03.230 --> 01:18:05.370 limit this to the first 10 results. 01:18:05.370 --> 01:18:08.510 In other words, this is SQL syntax for selecting 01:18:08.510 --> 01:18:12.920 all of the ratings that are at least 6.0 or higher from that table. 01:18:12.920 --> 01:18:13.640 Enter. 01:18:13.640 --> 01:18:15.350 And we see just the first 10 of them. 01:18:15.350 --> 01:18:18.540 Not the top 10 because we've not sorted or grouped or anything like that, 01:18:18.540 --> 01:18:20.490 but the first 10 in the table. 01:18:20.490 --> 01:18:23.780 So what is interesting here is that we've 01:18:23.780 --> 01:18:26.390 seen just some sampling of the data, if you will. 01:18:26.390 --> 01:18:28.790 But this isn't all that interesting here. 01:18:28.790 --> 01:18:32.220 Let me actually distill this just to the show ID because, in other words, 01:18:32.220 --> 01:18:34.190 I want to know 10 good shows to watch. 01:18:34.190 --> 01:18:36.440 So let me just select show ID. 01:18:36.440 --> 01:18:39.480 So same result, but less data. 01:18:39.480 --> 01:18:41.700 It's just that first column thereof. 01:18:41.700 --> 01:18:43.540 Now this is going to be a little annoying, 01:18:43.540 --> 01:18:47.460 but if I want to find out the names of these shows-- 01:18:47.460 --> 01:18:50.400 think about the picture from whence we came. 01:18:50.400 --> 01:18:55.320 All of the show's names are in the shows table, but all of the show's ratings 01:18:55.320 --> 01:18:57.520 are in the ratings table. 01:18:57.520 --> 01:19:00.058 So even if I do select star from ratings, 01:19:00.058 --> 01:19:02.100 I'm never going to know what show I'm looking at. 01:19:02.100 --> 01:19:04.500 Like, what the heck is show ID 62614? 01:19:04.500 --> 01:19:05.680 Well, I could do this. 01:19:05.680 --> 01:19:14.370 I could select star from shows where the ID of the show equals 62614 semicolon. 01:19:14.370 --> 01:19:15.240 Enter. 01:19:15.240 --> 01:19:18.270 OK, so I could watch this show from 1981. 01:19:18.270 --> 01:19:19.290 Let me do another one. 01:19:19.290 --> 01:19:24.030 Select star from shows where ID equals 63881. 01:19:24.030 --> 01:19:26.710 So I'm just grabbing the second ID from here. 01:19:26.710 --> 01:19:27.210 OK. 01:19:27.210 --> 01:19:29.530 So Catweazle, a kid's show from 1970. 01:19:29.530 --> 01:19:30.030 All right. 01:19:30.030 --> 01:19:31.120 So I'll watch that. 01:19:31.120 --> 01:19:32.620 So now let's do another one. 01:19:32.620 --> 01:19:33.810 I'll just copy this. 01:19:33.810 --> 01:19:37.530 Like suffice it to say, this is not the best way to look up data, 01:19:37.530 --> 01:19:41.610 where I'm literally copying and pasting values from one query into the next. 01:19:41.610 --> 01:19:43.950 But this is where SQL gets a bit powerful. 01:19:43.950 --> 01:19:46.170 I can have nested queries. 01:19:46.170 --> 01:19:47.710 I can put one inside of the other. 01:19:47.710 --> 01:19:48.793 So let me instead do this. 01:19:48.793 --> 01:19:51.130 Let me clear the screen, and let me instead do this. 01:19:51.130 --> 01:19:57.620 Select star from shows where the ID of the show 01:19:57.620 --> 01:20:01.160 is in the following list of IDs. 01:20:01.160 --> 01:20:02.630 Select. 01:20:02.630 --> 01:20:05.480 And, actually, I'll do this on a separate line. 01:20:05.480 --> 01:20:12.020 Select show ID from ratings where the rating value is 01:20:12.020 --> 01:20:15.800 greater than or equal to 6.0 semicolon. 01:20:15.800 --> 01:20:17.570 So I've separated this onto two lines. 01:20:17.570 --> 01:20:19.710 The dot, dot, dot is just a continuation character, 01:20:19.710 --> 01:20:23.480 which means same query, multiple lines, but the parentheses are deliberate. 01:20:23.480 --> 01:20:27.500 Just like grade school math, I want what's in parentheses to happen first. 01:20:27.500 --> 01:20:29.930 And so what the database will do is we'll 01:20:29.930 --> 01:20:33.740 select, as before, all of the show IDs from the ratings table 01:20:33.740 --> 01:20:36.713 where the rating value is at least 6.0 out of 10. 01:20:36.713 --> 01:20:38.630 And that's going to return to me, effectively, 01:20:38.630 --> 01:20:41.750 a list, some kind of collection of show IDs, 01:20:41.750 --> 01:20:43.760 which previously I was copying and pasting, 01:20:43.760 --> 01:20:47.390 now the database will do the legwork for me. 01:20:47.390 --> 01:20:51.410 It will now select everything from the shows table, where the ID of the show 01:20:51.410 --> 01:20:53.655 is in that list of values. 01:20:53.655 --> 01:20:56.280 And it's actually going to be more than 10 unless I go in there 01:20:56.280 --> 01:20:58.690 and say limit 10, which I can do. 01:20:58.690 --> 01:21:00.510 So let me go ahead and hit enter now. 01:21:00.510 --> 01:21:03.630 And now I see more useful information not just 01:21:03.630 --> 01:21:05.640 the ratings information, which in a vacuum 01:21:05.640 --> 01:21:07.320 tells me nothing about what to watch. 01:21:07.320 --> 01:21:12.360 Now I see the show ID, the title, the year, and the episodes. 01:21:12.360 --> 01:21:17.200 But notably, what is-- and if I want to distill this into just the title, 01:21:17.200 --> 01:21:20.010 let me actually go back here and instead do 01:21:20.010 --> 01:21:25.140 select just the title from shows where the ID is in this whole list. 01:21:25.140 --> 01:21:28.210 And I'll re-execute it by just copying and pasting the same. 01:21:28.210 --> 01:21:31.240 The only difference now is instead of star, I'm selecting title. 01:21:31.240 --> 01:21:34.080 Here's now how the data analyst at IMDb might 01:21:34.080 --> 01:21:38.430 be selecting 10 shows that are really good to watch, according to ratings. 01:21:38.430 --> 01:21:41.490 But what, of course, is missing from the output, 01:21:41.490 --> 01:21:45.065 whether I do star or just title? 01:21:45.065 --> 01:21:45.690 What's missing? 01:21:45.690 --> 01:21:45.960 Yeah? 01:21:45.960 --> 01:21:46.860 AUDIENCE: Rating. 01:21:46.860 --> 01:21:47.940 - The actual rating. 01:21:47.940 --> 01:21:51.000 I know these are at least 6.0, but which is 7.0? 01:21:51.000 --> 01:21:52.050 Which is 10.0? 01:21:52.050 --> 01:21:55.270 It'd be nice to actually combine the data in some way. 01:21:55.270 --> 01:21:57.840 So we can actually do that, too, because it turns out 01:21:57.840 --> 01:22:00.960 that when you have two tables in the world of SQL, or even more, 01:22:00.960 --> 01:22:03.720 you can actually join them together. 01:22:03.720 --> 01:22:08.700 You can join them together, literally using a keyword called join. 01:22:08.700 --> 01:22:10.540 And you can do this as follows. 01:22:10.540 --> 01:22:13.253 Let me propose, with a sample data set, these two tables. 01:22:13.253 --> 01:22:16.170 And dot, dot, dot just means we don't really care about the specifics. 01:22:16.170 --> 01:22:17.800 We just care about the structure. 01:22:17.800 --> 01:22:21.000 So on the left here is a simplified version of my shows 01:22:21.000 --> 01:22:24.270 table that has a show ID and a show title, 01:22:24.270 --> 01:22:27.990 but I've omitted year and episodes as just intellectually 01:22:27.990 --> 01:22:30.180 distracting from the basic structure here. 01:22:30.180 --> 01:22:31.740 But they're there in the real table. 01:22:31.740 --> 01:22:34.890 On the right hand side here, we have the ratings table 01:22:34.890 --> 01:22:37.290 with just two of its columns, the show ID and the rating, 01:22:37.290 --> 01:22:39.457 but I've omitted the votes because it doesn't really 01:22:39.457 --> 01:22:40.750 add anything to the discussion. 01:22:40.750 --> 01:22:42.600 But let me propose this. 01:22:42.600 --> 01:22:45.750 Notice that these two tables and these two rows 01:22:45.750 --> 01:22:47.790 therein definitely have commonalities. 01:22:47.790 --> 01:22:50.370 Like, they both have the same ID. 01:22:50.370 --> 01:22:54.150 In the left table it's 386676, a.k.a. 01:22:54.150 --> 01:22:55.860 The primary key of that row. 01:22:55.860 --> 01:23:01.200 But it's also appearing in duplicate in the right hand table, 386676, 01:23:01.200 --> 01:23:02.977 and in that context it's a foreign key. 01:23:02.977 --> 01:23:04.810 The point, though, is that they're the same. 01:23:04.810 --> 01:23:08.320 So wouldn't it be nice if I could treat one table here, one table here, 01:23:08.320 --> 01:23:12.150 and if my fingertips represent these identical values, kind of like glue 01:23:12.150 --> 01:23:15.715 them together so I get one wider table with all of the information 01:23:15.715 --> 01:23:18.090 together to satisfy your concern, that we don't even know 01:23:18.090 --> 01:23:20.015 what the ratings are of those shows. 01:23:20.015 --> 01:23:21.390 Well let me go ahead and do this. 01:23:21.390 --> 01:23:24.488 Just for artist's rendition, let me flip title and ID, 01:23:24.488 --> 01:23:25.780 which has no functional effect. 01:23:25.780 --> 01:23:28.488 It's just going to put the numbers closer together on the screen. 01:23:28.488 --> 01:23:32.610 Let me then literally highlight the fact that these two numbers are identical, 01:23:32.610 --> 01:23:35.190 and let me propose that we do the equivalent of this, 01:23:35.190 --> 01:23:38.910 we somehow join these two tables on that common value. 01:23:38.910 --> 01:23:40.995 Strictly speaking, I don't need both values 01:23:40.995 --> 01:23:43.870 because they're duplicates, so I don't care if one of them goes away. 01:23:43.870 --> 01:23:46.470 But what I'd really like to do is select, indeed, 01:23:46.470 --> 01:23:52.230 a temporary table that is the joined version of that original data. 01:23:52.230 --> 01:23:54.840 And frankly, I don't really care as the user what the ID is. 01:23:54.840 --> 01:23:57.690 Heck, all I care about is what show to watch and what its rating is. 01:23:57.690 --> 01:23:59.370 Give me the title and the rating. 01:23:59.370 --> 01:24:01.590 All of these numbers are, again, metadata, things 01:24:01.590 --> 01:24:05.490 that the computer cares about, but we humans probably do not. 01:24:05.490 --> 01:24:07.890 So how can we implement that idea? 01:24:07.890 --> 01:24:09.840 Of taking one data set that has a relationship 01:24:09.840 --> 01:24:12.510 with this data set, and somehow combine it together? 01:24:12.510 --> 01:24:14.170 Well let me go back to VS Code here. 01:24:14.170 --> 01:24:15.330 Let me clear my screen. 01:24:15.330 --> 01:24:20.220 And this is going to be a bit cryptic at first, but it's very step by step. 01:24:20.220 --> 01:24:21.010 Let me do this. 01:24:21.010 --> 01:24:25.890 Select star from shows, but not from shows alone. 01:24:25.890 --> 01:24:28.650 Let me join it with the ratings table. 01:24:28.650 --> 01:24:32.400 So let me select everything from shows joined with ratings, 01:24:32.400 --> 01:24:35.910 but I need to tell the database, well, what do I want to join things on? 01:24:35.910 --> 01:24:37.200 What are my fingertips? 01:24:37.200 --> 01:24:40.030 Specifically, I want them to join on those common integers. 01:24:40.030 --> 01:24:44.940 So I can literally say, on, and then I can specify one table on the left. 01:24:44.940 --> 01:24:51.030 Shows.id on the left should equal the ratings table's shows.id column 01:24:51.030 --> 01:24:52.470 on the right. 01:24:52.470 --> 01:24:57.120 Again, if I'm joining shows with ratings it's called ID in one, 01:24:57.120 --> 01:25:00.640 it's called show ID in the other, but it's the exact same thing. 01:25:00.640 --> 01:25:03.210 In fact, if I rewind, this is where we came from. 01:25:03.210 --> 01:25:05.290 Two tables with the same value. 01:25:05.290 --> 01:25:10.170 So with this query here, if I go ahead and now specify not just that but let 01:25:10.170 --> 01:25:14.760 me further say, where rating is greater than or equal to 6.0. 01:25:14.760 --> 01:25:16.500 And, heck, let's limit it to 10. 01:25:16.500 --> 01:25:21.420 Just fits on the screen, so it's more of a mouthful, but when I hit enter now 01:25:21.420 --> 01:25:26.220 we have a wider table that indeed contains everything, star, from having 01:25:26.220 --> 01:25:29.340 joined these two tables left and right. 01:25:29.340 --> 01:25:31.710 Now, again, I don't really care about much of this data, 01:25:31.710 --> 01:25:34.410 like year and episodes and definitely not the IDs. 01:25:34.410 --> 01:25:36.070 So let me actually hit up. 01:25:36.070 --> 01:25:38.070 Let me go to the beginning of the query, and let 01:25:38.070 --> 01:25:41.640 me just select the title of the show and the rating of the show. 01:25:41.640 --> 01:25:45.780 The query is getting a little long and it's wrapping, but it's the same query. 01:25:45.780 --> 01:25:49.500 Except, instead of star, I've done title comma rating. 01:25:49.500 --> 01:25:50.910 Now when I hit enter-- 01:25:50.910 --> 01:25:54.190 like, this is the list that would have been nice to see the first time around. 01:25:54.190 --> 01:25:57.280 Show me 10 shows with a rating of 6.0 or higher, 01:25:57.280 --> 01:25:59.620 but remind me what the rating actually is so, 01:25:59.620 --> 01:26:03.580 maybe, I can prioritize the sevens, the eights, the nines, and even the tens, 01:26:03.580 --> 01:26:05.236 if any. 01:26:05.236 --> 01:26:11.760 Any questions about this technique of joining two tables? 01:26:11.760 --> 01:26:15.210 This sort of solves the problem that we created in the world of this sheet, 01:26:15.210 --> 01:26:18.030 where I was just kind of playing around, where I sort of moved 01:26:18.030 --> 01:26:20.280 all the data into its separate locations, which 01:26:20.280 --> 01:26:21.750 is not at all pleasant to use. 01:26:21.750 --> 01:26:25.590 But with SQL, with join, you can still get any of the data you want. 01:26:25.590 --> 01:26:27.546 Yeah. 01:26:27.546 --> 01:26:30.498 [INDISTINCT SPEECH] 01:26:34.450 --> 01:26:35.308 Correct. 01:26:35.308 --> 01:26:37.600 So, yes, I should have called that out more explicitly. 01:26:37.600 --> 01:26:41.472 In my query here, I was using dot notation, which we've seen in Python, 01:26:41.472 --> 01:26:43.930 we've seen in C. It means something similar in spirit here, 01:26:43.930 --> 01:26:45.972 but it has nothing to do with structured objects. 01:26:45.972 --> 01:26:48.650 In this case, it has to do with tables and columns. 01:26:48.650 --> 01:26:53.320 So shows.id just makes clear that I want the ID column from the shows table 01:26:53.320 --> 01:26:57.490 to line up with the show ID column from the ratings table. 01:26:57.490 --> 01:27:01.630 Strictly speaking, I don't need to do that because in this case 01:27:01.630 --> 01:27:03.460 there is no ambiguity. 01:27:03.460 --> 01:27:06.340 One table has a column called ID, the other table 01:27:06.340 --> 01:27:07.810 has a column called show ID. 01:27:07.810 --> 01:27:10.840 So certainly the database can just figure this out for me. 01:27:10.840 --> 01:27:15.730 But, for best practice, and for the sake of being explicit, using the dot 01:27:15.730 --> 01:27:17.950 notation and table names can help, especially 01:27:17.950 --> 01:27:20.270 if there's some common language across them. 01:27:20.270 --> 01:27:20.770 All right. 01:27:20.770 --> 01:27:23.030 Well let's go back to the bigger data set here. 01:27:23.030 --> 01:27:25.060 These are all six tables in IMDb. 01:27:25.060 --> 01:27:28.450 We focused for just a moment there on like shows and ratings alone, 01:27:28.450 --> 01:27:29.620 but what about genres. 01:27:29.620 --> 01:27:33.250 So, genres, like comedy, and documentary, and drama, and so forth. 01:27:33.250 --> 01:27:37.190 Turns out that this actually implements a different type of relationship. 01:27:37.190 --> 01:27:39.530 Previously we saw a one-to-one relationship, 01:27:39.530 --> 01:27:42.080 but it turns out that IMDb supports what's 01:27:42.080 --> 01:27:44.820 called a one-to-many relationship when it comes to genres. 01:27:44.820 --> 01:27:45.320 Why? 01:27:45.320 --> 01:27:48.650 Well, shows like The Office, I do think are generally considered comedy 01:27:48.650 --> 01:27:51.650 and that's it, but there are certainly other TV shows 01:27:51.650 --> 01:27:55.140 that might have multiple genres associated with them. 01:27:55.140 --> 01:27:59.160 Maybe it's comedy and a bit of romance thrown in, like rom-coms, and so forth, 01:27:59.160 --> 01:28:03.980 so you could imagine some shows having two or three or more genres, 01:28:03.980 --> 01:28:07.520 and so one-to-many means that one show can have many genres. 01:28:07.520 --> 01:28:11.450 One-to-one would mean one show can have one rating, as we've seen. 01:28:11.450 --> 01:28:17.330 So why don't we go ahead and focus, maybe, on how about a query like this. 01:28:17.330 --> 01:28:19.577 Let me go back to VS Code here, clear my screen, 01:28:19.577 --> 01:28:21.410 and let's just look at some of those genres. 01:28:21.410 --> 01:28:25.700 Select star from genres and then I'll limit it to 10. 01:28:25.700 --> 01:28:28.590 And, again, I do this just to wrap my mind around a new data set. 01:28:28.590 --> 01:28:30.740 I could look at the schema, but that tends to be more cryptic. 01:28:30.740 --> 01:28:32.310 I just want to look at the raw data. 01:28:32.310 --> 01:28:32.810 OK. 01:28:32.810 --> 01:28:37.227 It looks like here there are a bunch of genres: comedy, adventure, comedy. 01:28:37.227 --> 01:28:38.810 So two comedies, which is interesting. 01:28:38.810 --> 01:28:40.100 Oh, interesting. 01:28:40.100 --> 01:28:42.410 Family, action, sci-fi, family. 01:28:42.410 --> 01:28:45.260 So the values here are duplicated, which it turns out 01:28:45.260 --> 01:28:47.000 is not the best design of IMDb. 01:28:47.000 --> 01:28:49.710 We literally just imported the data as they implement it. 01:28:49.710 --> 01:28:55.190 But notice that show ID 62614 is a comedy, but so is 01:28:55.190 --> 01:29:00.950 show 63881, and so is show 65270. 01:29:00.950 --> 01:29:04.320 So it turns out that in the real world sometimes data is somewhat messy. 01:29:04.320 --> 01:29:06.830 There's duplication of comedy, comedy, comedy, 01:29:06.830 --> 01:29:09.380 but such is the way IMDb's data is. 01:29:09.380 --> 01:29:12.694 But what's more interesting to me, for now, is notice this. 01:29:12.694 --> 01:29:16.530 This show ID in three rows is the same. 01:29:16.530 --> 01:29:19.700 So there's some show out there that's considered by the world 01:29:19.700 --> 01:29:22.257 to be an adventure, a comedy, and a family show. 01:29:22.257 --> 01:29:23.340 So let's see what that is. 01:29:23.340 --> 01:29:26.880 Let me just highlight and copy that value, 63881, and do this. 01:29:26.880 --> 01:29:31.378 Select star from shows where the ID of the show equals that value. 01:29:31.378 --> 01:29:33.170 And it turns out, we saw it briefly before, 01:29:33.170 --> 01:29:37.400 it's a show from the 1970s called Catweazle, which falls into all three 01:29:37.400 --> 01:29:38.880 of those categories. 01:29:38.880 --> 01:29:43.880 So by using a one-to-many relationship, sort of depicted by this picture here, 01:29:43.880 --> 01:29:47.588 you can implement that same idea without having that jagged edge. 01:29:47.588 --> 01:29:50.630 When we looked at the spreadsheet earlier, in an earlier version of this, 01:29:50.630 --> 01:29:55.490 we had star, star, star, which we could do again, genre, genre, genre, 01:29:55.490 --> 01:29:58.460 but now we instead have two separate tables 01:29:58.460 --> 01:30:02.247 where this many-to-many relationship is implemented across. 01:30:02.247 --> 01:30:03.830 So let's actually play around with it. 01:30:03.830 --> 01:30:07.640 Let me go back to VS Code here, and let's actually take 01:30:07.640 --> 01:30:10.520 a look at the schema for genres. 01:30:10.520 --> 01:30:14.030 And we'll see that it's pretty small, as the picture suggests. 01:30:14.030 --> 01:30:14.990 It's called genres. 01:30:14.990 --> 01:30:18.530 Every row has a show ID, which is an integer, cannot be null. 01:30:18.530 --> 01:30:21.200 It has a genre, which is text, cannot be null. 01:30:21.200 --> 01:30:25.190 And that show ID is a foreign key in this table that references an ID 01:30:25.190 --> 01:30:26.690 column in the shows table. 01:30:26.690 --> 01:30:29.240 So very similar in spirit, and so it really 01:30:29.240 --> 01:30:32.090 is just kind of on the honor system that we're only 01:30:32.090 --> 01:30:34.670 putting one row for each show in ratings, 01:30:34.670 --> 01:30:39.150 but 0 or more shows in the genres table for shows, as well. 01:30:39.150 --> 01:30:41.360 So what can we do once we want to tinker with genres? 01:30:41.360 --> 01:30:42.420 Well, let me do this. 01:30:42.420 --> 01:30:49.587 How about we select the show ID from the genres table where the genre is comedy. 01:30:49.587 --> 01:30:51.170 Like, I'm in the mood for some comedy. 01:30:51.170 --> 01:30:54.950 Let's see all of the available comedies, except let's just limit it to 10. 01:30:54.950 --> 01:30:59.150 Here are the show IDs for 10 comedies according to the Internet Movie 01:30:59.150 --> 01:30:59.803 Database. 01:30:59.803 --> 01:31:01.220 Well, that's not very interesting. 01:31:01.220 --> 01:31:03.350 I care about the title, so we can do that. 01:31:03.350 --> 01:31:08.990 Select title from shows where the ID of the show is not equal to, 01:31:08.990 --> 01:31:12.710 but rather, in the following subquery, if you will. 01:31:12.710 --> 01:31:18.170 Select show ID from genres where genre equals, quote unquote, 01:31:18.170 --> 01:31:21.900 "Comedy," limit 10, just to keep things simple. 01:31:21.900 --> 01:31:25.460 So same query as before, but now I'm using it as a nested query 01:31:25.460 --> 01:31:28.730 to select the actual titles whose IDs match those their. 01:31:28.730 --> 01:31:29.540 Enter. 01:31:29.540 --> 01:31:32.540 And there are those titles for 10 comedies, 01:31:32.540 --> 01:31:34.240 maybe a couple of which we've seen. 01:31:34.240 --> 01:31:35.480 Catweazel we've seen before. 01:31:35.480 --> 01:31:36.970 But what if we want to now-- 01:31:36.970 --> 01:31:38.380 let's see. 01:31:38.380 --> 01:31:42.085 Maybe we want to get all of the-- let's flip it around. 01:31:42.085 --> 01:31:43.630 Catweazle keeps coming up. 01:31:43.630 --> 01:31:47.170 Why don't we figure out, with a query, what all of its genres are? 01:31:47.170 --> 01:31:48.110 So Catweazle. 01:31:48.110 --> 01:31:48.610 Let's see. 01:31:48.610 --> 01:31:55.450 So let's do select star from genres limit 10, 01:31:55.450 --> 01:31:57.430 because I know it's in the the first 10. 01:31:57.430 --> 01:31:59.950 This was, 63881 was Catweazle's ID. 01:31:59.950 --> 01:32:00.880 So let's do this. 01:32:00.880 --> 01:32:06.860 So select genre from genres where the show ID equals that value. 01:32:06.860 --> 01:32:07.360 OK. 01:32:07.360 --> 01:32:09.152 So there's the same query as we did before. 01:32:09.152 --> 01:32:10.480 Can we make this dynamic? 01:32:10.480 --> 01:32:11.650 Well, we can, too. 01:32:11.650 --> 01:32:13.870 What if I instead, more dynamically do, select 01:32:13.870 --> 01:32:18.250 genre from genres where the show ID-- 01:32:18.250 --> 01:32:19.180 and not in. 01:32:19.180 --> 01:32:23.020 If I'm looking for a specific show, now I can actually do equals. 01:32:23.020 --> 01:32:25.480 And in my subquery, I could do this, select ID 01:32:25.480 --> 01:32:28.660 from shows where the title of the show equals, 01:32:28.660 --> 01:32:33.160 quote unquote, "Catweazel," semicolon, enter. 01:32:33.160 --> 01:32:35.720 So, again, even though I'm typing these very quickly, 01:32:35.720 --> 01:32:38.630 I'm really just composing similar, smaller ideas 01:32:38.630 --> 01:32:41.060 that we've seen before into larger and larger queries 01:32:41.060 --> 01:32:42.573 to just get at more of this data. 01:32:42.573 --> 01:32:44.490 So what's really going on underneath the hood? 01:32:44.490 --> 01:32:46.160 Well, you can think of it like this. 01:32:46.160 --> 01:32:50.060 If we've got this relationship between shows and genres, 01:32:50.060 --> 01:32:52.580 here's an excerpt from shows, and I didn't bother 01:32:52.580 --> 01:32:54.170 showing the thousands of other shows. 01:32:54.170 --> 01:32:56.900 Here's an excerpt from genres on the right. 01:32:56.900 --> 01:33:00.530 What is that query-- what are we essentially trying to do? 01:33:00.530 --> 01:33:02.570 Well, let me flip this around here. 01:33:02.570 --> 01:33:06.290 Let me highlight the fact that this is the same, this is the same, 01:33:06.290 --> 01:33:07.260 this is the same. 01:33:07.260 --> 01:33:10.040 So wouldn't it be nice if I could kind of get these all together? 01:33:10.040 --> 01:33:11.660 Well, if I join these tables, we're actually 01:33:11.660 --> 01:33:13.370 going to notice an interesting artifact. 01:33:13.370 --> 01:33:16.490 If I join them together, as we did before with ratings, 01:33:16.490 --> 01:33:20.990 I'm going to need to fill in the gap there because this is not a table. 01:33:20.990 --> 01:33:24.860 Tables, by definition, always have the same number of rows and columns. 01:33:24.860 --> 01:33:26.730 You can't have gaps in them like this. 01:33:26.730 --> 01:33:29.820 So the simplest thing to do is just to fill that in this way. 01:33:29.820 --> 01:33:33.170 But if I were to try to combine two tables that 01:33:33.170 --> 01:33:36.320 have this one-to-many relationship, you're 01:33:36.320 --> 01:33:38.090 actually going to get duplication. 01:33:38.090 --> 01:33:40.190 It's not duplication in the original tables, 01:33:40.190 --> 01:33:43.850 but in the temporary tables, otherwise known as a result set, 01:33:43.850 --> 01:33:45.750 that's coming back to us. 01:33:45.750 --> 01:33:47.070 So what do I mean by this? 01:33:47.070 --> 01:33:50.060 Well, if we actually implement this same idea as before, 01:33:50.060 --> 01:33:52.970 where we try to join these two tables, let 01:33:52.970 --> 01:33:55.640 me propose that we do it with this syntax. 01:33:55.640 --> 01:33:59.540 Let me do select star from shows join genres, 01:33:59.540 --> 01:34:04.850 which is just like we did with ratings but now let's join it on shows.id 01:34:04.850 --> 01:34:07.760 equals genres.show_ID. 01:34:07.760 --> 01:34:14.120 But let's just do this for Catweazel, where ID equals 63881, semicolon. 01:34:14.120 --> 01:34:16.250 With the ratings, it worked perfectly because it 01:34:16.250 --> 01:34:19.920 was a one-to-one relationship, so the rows just got wider, if you will. 01:34:19.920 --> 01:34:23.660 But now, because it's a one-to-many relationship, when 01:34:23.660 --> 01:34:26.960 you execute these queries, you are going to get back duplicate data 01:34:26.960 --> 01:34:28.400 but this is OK. 01:34:28.400 --> 01:34:30.740 It's considered OK because this is sort of ephemeral. 01:34:30.740 --> 01:34:32.540 These result sets, these temporary tables, 01:34:32.540 --> 01:34:34.700 exist just for us to look at the data, just 01:34:34.700 --> 01:34:36.440 for us to crunch the numbers somehow. 01:34:36.440 --> 01:34:42.167 It's not actually stored in duplicate in the database itself. 01:34:42.167 --> 01:34:44.000 If I wanted to tighten this further, though, 01:34:44.000 --> 01:34:47.570 let me actually get rid of the star and let me just do title genre. 01:34:47.570 --> 01:34:49.040 And, indeed, we can now see. 01:34:49.040 --> 01:34:52.130 OK, Catweazel three times has three different categories. 01:34:52.130 --> 01:34:54.440 But, generally, we don't even care about that so I 01:34:54.440 --> 01:34:57.830 can even whittle this query down to just selecting genre, 01:34:57.830 --> 01:35:00.500 and that, too, will just give me the result, effectively, hiding 01:35:00.500 --> 01:35:01.550 the duplication. 01:35:01.550 --> 01:35:04.957 But when you join data with a one-to-many relationship 01:35:04.957 --> 01:35:07.790 you're temporarily going to get duplicates, which is actually useful 01:35:07.790 --> 01:35:10.580 because it's very easy then to get at the show's title 01:35:10.580 --> 01:35:14.080 no matter where you are in some loop. 01:35:14.080 --> 01:35:14.620 All right. 01:35:14.620 --> 01:35:15.912 Well, what more can we do here? 01:35:15.912 --> 01:35:19.470 Well let me propose that we revisit the main database, here with six tables, 01:35:19.470 --> 01:35:21.970 and let's look at, perhaps, the juiciest, and the one that's 01:35:21.970 --> 01:35:26.680 really what most people use IMDb for is to look up shows and people therein. 01:35:26.680 --> 01:35:28.300 Let's focus on these three tables. 01:35:28.300 --> 01:35:32.140 And we can infer, from this diagram, that there's now, for the first time, 01:35:32.140 --> 01:35:34.270 three tables involved in a relationship. 01:35:34.270 --> 01:35:38.710 There's people, there's shows, but I've proposed this intermediary stars 01:35:38.710 --> 01:35:40.930 table, much like I temporarily, in Google Sheets, 01:35:40.930 --> 01:35:44.120 gave us a third sheet to link the two together. 01:35:44.120 --> 01:35:46.810 This stars table we're about to see, it's purpose in life 01:35:46.810 --> 01:35:49.990 is to join two other tables together. 01:35:49.990 --> 01:35:54.670 And, in fact, it's only going to have two columns, show ID and person ID. 01:35:54.670 --> 01:35:57.790 So what this is going to do for us is implement this idea, 01:35:57.790 --> 01:35:59.360 many-to-many relationship. 01:35:59.360 --> 01:35:59.860 Why? 01:35:59.860 --> 01:36:02.810 Because any TV show can obviously have many people in it, 01:36:02.810 --> 01:36:06.670 but one person can presumably star in many different shows. 01:36:06.670 --> 01:36:09.680 Like, Steve Carell has been in multiple shows, not just The Office. 01:36:09.680 --> 01:36:11.870 So when you have a many-to-many relationship, 01:36:11.870 --> 01:36:16.655 you actually do need this third table to bridge the two any number of times. 01:36:16.655 --> 01:36:19.280 But it's going to make our life a little more unpleasant to get 01:36:19.280 --> 01:36:22.800 the data we want because it's going to add some additional steps, if you will. 01:36:22.800 --> 01:36:24.300 So let me do this. 01:36:24.300 --> 01:36:28.340 Suppose that I want to get everything I know about The Office. 01:36:28.340 --> 01:36:30.800 Well, let's start with a single query here in VS Code. 01:36:30.800 --> 01:36:35.780 Select star from shows where title equals, quote unquote, "The Office." 01:36:35.780 --> 01:36:37.130 And I should see-- 01:36:37.130 --> 01:36:38.060 Oh, interesting. 01:36:38.060 --> 01:36:41.810 Several attempts at creating a TV show called The Office over the years. 01:36:41.810 --> 01:36:45.740 You can, perhaps, infer the year in which the most popular of them began. 01:36:45.740 --> 01:36:46.910 2005? 01:36:46.910 --> 01:36:49.580 So I presume this is the one we all know and have watched, 01:36:49.580 --> 01:36:54.200 at least in the US, which is this one, 386676, which matches the ID that I 01:36:54.200 --> 01:36:55.770 very carefully used earlier. 01:36:55.770 --> 01:36:58.520 So let me actually be a little more deliberate, 01:36:58.520 --> 01:37:02.690 where the title equals The Office and the year equals 2005. 01:37:02.690 --> 01:37:07.400 That query now gets us The Office that we all in the US and, perhaps, love. 01:37:07.400 --> 01:37:09.830 But now let's actually do something like, get 01:37:09.830 --> 01:37:11.960 all of the people who starred in it, at least 01:37:11.960 --> 01:37:14.650 according to IMDb, whoever had top billing. 01:37:14.650 --> 01:37:16.370 So how can I do this? 01:37:16.370 --> 01:37:20.350 Well, unfortunately, in the shows table there are no people, 01:37:20.350 --> 01:37:24.310 and there's no stars even, but I could do a nested query like this. 01:37:24.310 --> 01:37:32.350 Why don't I select the person ID from the stars table where-- 01:37:32.350 --> 01:37:45.990 whoops, where the-- sorry, where show ID equals, and then in parentheses 01:37:45.990 --> 01:37:48.430 let me do that same query as before and for time's sake, 01:37:48.430 --> 01:37:52.260 I'll just copy paste so that we get back the one and only 01:37:52.260 --> 01:37:53.640 Office in that subquery. 01:37:53.640 --> 01:37:56.710 So what I'm going to do is take an intermediate step, a baby step, 01:37:56.710 --> 01:37:57.240 if you will. 01:37:57.240 --> 01:38:00.992 Right now, I have found in the shows table The Office I care about. 01:38:00.992 --> 01:38:02.700 But if I want to get to the people table, 01:38:02.700 --> 01:38:06.720 I have to take a step through the stars table, this intermediate table, just 01:38:06.720 --> 01:38:09.130 to get anywhere close to the names of those people. 01:38:09.130 --> 01:38:11.070 So what can I get from the stars table? 01:38:11.070 --> 01:38:14.040 Well, why don't I at least select all of the person 01:38:14.040 --> 01:38:18.340 IDs in that table that are somehow associated with the same show ID? 01:38:18.340 --> 01:38:20.340 So in VS Code, what I'm doing is this. 01:38:20.340 --> 01:38:23.340 Select person ID from that intermediate stars 01:38:23.340 --> 01:38:27.690 table, where the show ID in question is whatever the show ID is for The Office. 01:38:27.690 --> 01:38:30.510 I could literally type 386676, but I'm trying 01:38:30.510 --> 01:38:33.630 to do this more dynamically so I've used a nested query instead. 01:38:33.630 --> 01:38:34.890 All right, this is correct. 01:38:34.890 --> 01:38:36.760 Whoops, ironically. 01:38:36.760 --> 01:38:38.085 Where show ID-- 01:38:41.540 --> 01:38:42.440 Oh, thank you. 01:38:42.440 --> 01:38:43.610 Oh, who said that? 01:38:43.610 --> 01:38:44.570 OK. 01:38:44.570 --> 01:38:45.110 Batman. 01:38:45.110 --> 01:38:47.370 I owe Batman Oreos after class. 01:38:47.370 --> 01:38:48.470 Thank you, Batman. 01:38:48.470 --> 01:38:50.330 That's de facto. 01:38:50.330 --> 01:38:50.930 Yes. 01:38:50.930 --> 01:38:52.355 Other good prize for that costume. 01:38:52.355 --> 01:38:53.330 [LAUGHS] 01:38:53.330 --> 01:38:55.160 OK, so let me fix this. 01:38:55.160 --> 01:38:55.910 My apologies. 01:38:55.910 --> 01:38:58.790 So let's go ahead and select person ID from stars 01:38:58.790 --> 01:39:02.390 where show ID equals, and this is where I messed up before. 01:39:02.390 --> 01:39:05.510 I did select star from shows, but I can't 01:39:05.510 --> 01:39:09.410 look for a show ID equaling an entire row of information. 01:39:09.410 --> 01:39:16.160 I instead need to do select ID from shows where the shows title equals, 01:39:16.160 --> 01:39:20.510 quote unquote, "The Office" and the year of that show is 2005. 01:39:20.510 --> 01:39:25.340 And just to call this out, much like in C, I'm quoting strings, 01:39:25.340 --> 01:39:26.780 but I'm not quoting numbers. 01:39:26.780 --> 01:39:28.910 It's not necessary for something like 2005. 01:39:28.910 --> 01:39:33.410 And just to be super clear, too, I have generally adopted a style already today 01:39:33.410 --> 01:39:37.850 of capitalizing any SQL keywords, like select, like from, like where, 01:39:37.850 --> 01:39:40.470 and so forth, and then using lowercase for everything else. 01:39:40.470 --> 01:39:42.180 Strictly speaking, that's not necessary. 01:39:42.180 --> 01:39:44.130 Stylistically, we would encourage you to be 01:39:44.130 --> 01:39:46.592 in the habit of using uppercase for your SQL keywords 01:39:46.592 --> 01:39:48.300 because they just pop more on the screen. 01:39:48.300 --> 01:39:49.883 It sort of makes things more readable. 01:39:49.883 --> 01:39:52.200 But strictly speaking, SQL itself does not care 01:39:52.200 --> 01:39:54.690 about that capitalization of keywords. 01:39:54.690 --> 01:39:56.490 All right, now let me cross my fingers. 01:39:56.490 --> 01:39:59.100 And now I get back this list of person IDs. 01:39:59.100 --> 01:40:00.840 And, again, my goal was to figure out who 01:40:00.840 --> 01:40:03.018 is in The Office that debuted in 2005. 01:40:03.018 --> 01:40:06.060 This is not that interesting because I don't know who any of these people 01:40:06.060 --> 01:40:08.820 are, but here's where we can do one additional step 01:40:08.820 --> 01:40:10.650 and nest a nested query. 01:40:10.650 --> 01:40:15.000 So let me actually select the names from the people table 01:40:15.000 --> 01:40:18.720 where the ID of those people is in. 01:40:18.720 --> 01:40:20.820 And then on a new line, just to make it pretty, 01:40:20.820 --> 01:40:26.770 where the ID equals this query, which I'll paste here, 01:40:26.770 --> 01:40:31.260 and then that equals, and then another line and indent further, 01:40:31.260 --> 01:40:32.602 this query here. 01:40:32.602 --> 01:40:35.310 So just to save time, I'm copying and pasting the previous query, 01:40:35.310 --> 01:40:38.970 but I'm wrapping it with one outermost query, now, that's 01:40:38.970 --> 01:40:42.690 saying select names from people where the ID of those people 01:40:42.690 --> 01:40:47.940 is in this result set, where the ID of those shows is in this result set. 01:40:47.940 --> 01:40:52.170 So the parentheses make clear, ideally, what's happening 01:40:52.170 --> 01:40:53.790 and what order from inside out. 01:40:53.790 --> 01:40:54.540 Enter. 01:40:54.540 --> 01:40:58.860 And there we have it, at least according to IMDb, for the latest season, 01:40:58.860 --> 01:41:04.350 like this is the top-billed stars that are in this here database. 01:41:04.350 --> 01:41:04.860 All right. 01:41:04.860 --> 01:41:06.302 So how can we do something else? 01:41:06.302 --> 01:41:08.260 Well, let me just do it in the other direction. 01:41:08.260 --> 01:41:11.177 Suppose we want to see all of Steve Carell shows, not just The Office. 01:41:11.177 --> 01:41:16.650 Select title from shows where the ID of the show is in. 01:41:16.650 --> 01:41:18.990 And then here, I'm going to do select show 01:41:18.990 --> 01:41:24.150 ID from stars where person ID equals. 01:41:24.150 --> 01:41:26.490 And then here, indenting for clarity, I'm 01:41:26.490 --> 01:41:28.740 going to select Steve Carell's ID by saying, 01:41:28.740 --> 01:41:32.850 select ID from people where the name of that person is, 01:41:32.850 --> 01:41:35.040 quote unquote, "Steve Carell." 01:41:35.040 --> 01:41:36.780 And so in this way, 01:41:36.780 --> 01:41:42.250 I'm writing the SQL query sort of in reverse. 01:41:42.250 --> 01:41:44.102 I'm asking first what I care about. 01:41:44.102 --> 01:41:45.810 But before I can even answer that, I have 01:41:45.810 --> 01:41:48.510 to answer this nested query, what is Steve Carell's ID? 01:41:48.510 --> 01:41:52.320 Once I have that, what are all the show IDs that person ID has been in? 01:41:52.320 --> 01:41:55.740 And then, please tell me what the title of all of those shows is. 01:41:55.740 --> 01:41:57.510 Let me go ahead and cross my fingers. 01:41:57.510 --> 01:41:58.200 And, voila. 01:41:58.200 --> 01:42:00.950 Some of these you might have heard of, some of you might not have, 01:42:00.950 --> 01:42:03.750 but if you were to go on imdb.com and search for Steve Carell, 01:42:03.750 --> 01:42:08.445 you would presumably see this here list of shows that he's been in, 01:42:08.445 --> 01:42:11.410 in some particular order. 01:42:11.410 --> 01:42:15.330 Just to show you two other syntaxes, but let me not emphasize this 01:42:15.330 --> 01:42:17.200 because it will look complicated. 01:42:17.200 --> 01:42:19.500 There are other ways to solve this same problem. 01:42:19.500 --> 01:42:21.660 If you prefer the approach of joining, we 01:42:21.660 --> 01:42:25.720 can actually join not just two, but three tables together. 01:42:25.720 --> 01:42:26.535 But question first. 01:42:26.535 --> 01:42:28.310 [INDISTINCT SPEECH] 01:42:28.310 --> 01:42:29.810 No, just stretching first. 01:42:29.810 --> 01:42:35.120 So two final ways to execute the same idea, but the first of them that I just 01:42:35.120 --> 01:42:37.850 did is arguably, relatively, simpler. 01:42:37.850 --> 01:42:38.910 You could do this. 01:42:38.910 --> 01:42:42.800 I could select the title from the shows table 01:42:42.800 --> 01:42:49.220 by joining it on the stars table on the shows ID column, 01:42:49.220 --> 01:42:52.820 equaling the stars tables show ID column. 01:42:52.820 --> 01:42:58.460 And then I can further join it on the people table, on stars dot person ID 01:42:58.460 --> 01:43:00.453 equaling people.id. 01:43:00.453 --> 01:43:03.620 This is a mouthful, and even I am kind of crossing my fingers that it didn't 01:43:03.620 --> 01:43:06.510 screw up when transcribing it from my printout here, 01:43:06.510 --> 01:43:11.180 but what I'm effectively doing is joining one, two, three tables all 01:43:11.180 --> 01:43:17.570 together by telling the database how to join the shows table, with the stars 01:43:17.570 --> 01:43:19.070 table, and the people table. 01:43:19.070 --> 01:43:22.940 Specifically, the way to bridge that picture, per the diagram, 01:43:22.940 --> 01:43:28.890 is to specify that shows.id should be lined up with stars.show ID. 01:43:28.890 --> 01:43:34.380 And stars.person ID should be lined up with people ID. 01:43:34.380 --> 01:43:35.008 And that's it. 01:43:35.008 --> 01:43:37.050 That essentially allows us to connect these three 01:43:37.050 --> 01:43:39.000 tables with their common fields. 01:43:39.000 --> 01:43:42.390 If I hit enter now, I'm going to get back, somewhat slowly 01:43:42.390 --> 01:43:47.160 actually, a really long list, with some duplication, 01:43:47.160 --> 01:43:49.830 of all of those particular shows. 01:43:49.830 --> 01:43:52.182 Actually, all shows in the database because I 01:43:52.182 --> 01:43:53.640 didn't practice what I'm preaching. 01:43:53.640 --> 01:43:56.140 I wanted to search for just Steve Carell, what you're seeing 01:43:56.140 --> 01:43:58.950 is the entirety of the tens of thousands of TV shows. 01:43:58.950 --> 01:44:00.360 Control C is your friend. 01:44:00.360 --> 01:44:05.070 Let me go ahead and reload SQLite, and let me type that again. 01:44:05.070 --> 01:44:06.690 Let me type that once more, sorry. 01:44:06.690 --> 01:44:13.320 Select title from shows, join stars on shows.id equals stars.show ID, 01:44:13.320 --> 01:44:20.872 join people on stars.person ID equals people ID where-- 01:44:20.872 --> 01:44:22.080 this was the part I left out. 01:44:22.080 --> 01:44:24.990 Name equals Steve Carell. 01:44:24.990 --> 01:44:28.710 And if I didn't screw up by typing so fast-- enter. 01:44:28.710 --> 01:44:31.500 Now we get, a little more slowly than before, 01:44:31.500 --> 01:44:34.435 those same shows that Steve Carell starred in. 01:44:34.435 --> 01:44:36.810 So this is just to say there's another way of doing this. 01:44:36.810 --> 01:44:39.420 But maybe a third way, which is a little simpler than that, 01:44:39.420 --> 01:44:41.250 explicitly joining them in that way. 01:44:41.250 --> 01:44:47.490 You can, alternatively, still do this, select title from shows, stars, 01:44:47.490 --> 01:44:48.270 and people. 01:44:48.270 --> 01:44:51.480 You just literally enumerate, with commas, what three tables 01:44:51.480 --> 01:44:52.920 you want to join somehow. 01:44:52.920 --> 01:44:55.020 And then you can instead of using join, you 01:44:55.020 --> 01:44:58.560 can just use where clauses to make sure they line up properly. 01:44:58.560 --> 01:45:03.630 You can say, where shows.id equals stars.show ID, 01:45:03.630 --> 01:45:09.180 and people ID equals stars.person ID, and name 01:45:09.180 --> 01:45:11.760 equals, quote unquote, "Steve Carell." 01:45:11.760 --> 01:45:14.430 And I realize this is hard to keep track of everything now, 01:45:14.430 --> 01:45:16.353 all these darn different ways to do this, 01:45:16.353 --> 01:45:19.020 this is just to say that there's different approaches to solving 01:45:19.020 --> 01:45:20.728 the same problem and for different people 01:45:20.728 --> 01:45:23.970 you might think about things a little more differently than someone else. 01:45:23.970 --> 01:45:27.630 If I hit enter here, this too it's a little slower than the nested selects 01:45:27.630 --> 01:45:31.590 it seems, but it does, in fact, give us that same answer. 01:45:31.590 --> 01:45:34.440 And just for thoroughness, if I go back to our diagram. 01:45:34.440 --> 01:45:38.700 Besides the tables we've seen, there's actually another writers table 01:45:38.700 --> 01:45:39.480 in there, as well. 01:45:39.480 --> 01:45:42.480 If you're curious to see what writers is, let's just glance at that real 01:45:42.480 --> 01:45:42.980 fast. 01:45:42.980 --> 01:45:47.010 In VS Code, let me do .schema writers, and it's actually almost the same 01:45:47.010 --> 01:45:48.330 as stars. 01:45:48.330 --> 01:45:52.440 Except this case, in writers, we are associating a show 01:45:52.440 --> 01:45:54.750 with a person ID, both of which in this context 01:45:54.750 --> 01:45:58.770 are foreign keys that indeed reference back shows and people ID. 01:45:58.770 --> 01:46:02.520 Which, again, if I do this schema stars, which we didn't see before, 01:46:02.520 --> 01:46:04.120 is structurally the same. 01:46:04.120 --> 01:46:06.540 So the relationship is essentially embodied, in this case, 01:46:06.540 --> 01:46:12.690 by IMDb, and in turn by us, by way of the tables name, TV stars or writers 01:46:12.690 --> 01:46:14.410 thereof. 01:46:14.410 --> 01:46:14.940 All right. 01:46:14.940 --> 01:46:16.000 I know that's a lot. 01:46:16.000 --> 01:46:19.720 Any questions before we take a higher level step back? 01:46:19.720 --> 01:46:20.220 Yeah. 01:46:20.220 --> 01:46:22.992 [INDISTINCT SPEECH] 01:46:25.183 --> 01:46:25.850 A good question. 01:46:25.850 --> 01:46:28.400 Does SQL provide any way to figure out the mapping between tables 01:46:28.400 --> 01:46:29.540 without looking at the database? 01:46:29.540 --> 01:46:30.380 Short answer, no. 01:46:30.380 --> 01:46:32.668 Like, this is the dia-- well, that's not quite fair. 01:46:32.668 --> 01:46:34.460 Depending on the database you're using, you 01:46:34.460 --> 01:46:36.080 might be able to click a button, for instance, 01:46:36.080 --> 01:46:38.997 and get a nice pretty picture like this, that shows the relationships. 01:46:38.997 --> 01:46:40.730 Indeed, we use software to generate this. 01:46:40.730 --> 01:46:42.980 We didn't do this diagram, for instance, by hand. 01:46:42.980 --> 01:46:45.530 SQLite itself does not provide you with that. 01:46:45.530 --> 01:46:48.425 In SQLite the best you can do is run .schema. 01:46:48.425 --> 01:46:50.300 And if you don't specify a table name, you'll 01:46:50.300 --> 01:46:52.120 get everything from the table described. 01:46:52.120 --> 01:46:53.870 Once you get comfortable with SQL, though, 01:46:53.870 --> 01:46:56.240 the idea is that you can read the text and sort of infer what 01:46:56.240 --> 01:46:57.080 the structure is. 01:46:57.080 --> 01:46:58.850 But, yes, there are graphical programs can 01:46:58.850 --> 01:47:00.710 generate prettier pictures like this, but it 01:47:00.710 --> 01:47:02.900 depends on the software you're using. 01:47:02.900 --> 01:47:03.470 Yeah. 01:47:03.470 --> 01:47:06.600 [INDISTINCT SPEECH] 01:47:06.600 --> 01:47:10.590 SQL is not case sensitive with respect to its keywords. 01:47:10.590 --> 01:47:13.920 But table names, and other things that you chose, 01:47:13.920 --> 01:47:16.452 you've got to be consistent with how you capitalize them. 01:47:16.452 --> 01:47:19.410 I've done everything in lowercase, but that tends to be one convention. 01:47:19.410 --> 01:47:23.970 Other people might use camel case, where you alternate caps appropriately. 01:47:23.970 --> 01:47:26.737 All right, so let's take a higher level look at this, 01:47:26.737 --> 01:47:29.820 and also consider some of the actual real world problems that, tragically, 01:47:29.820 --> 01:47:32.200 are still with us in some form today. 01:47:32.200 --> 01:47:34.980 Notice that some of the queries we executed a bit ago 01:47:34.980 --> 01:47:38.440 were actually relatively slow whereas I hit enter and got a lot of my results 01:47:38.440 --> 01:47:38.940 like that. 01:47:38.940 --> 01:47:41.130 Those last two queries, where I was joining 01:47:41.130 --> 01:47:43.500 all of those tables looking for Steve Carell's shows 01:47:43.500 --> 01:47:45.138 were actually relatively slow. 01:47:45.138 --> 01:47:46.680 And let's try to take a simpler case. 01:47:46.680 --> 01:47:47.310 Let me do this. 01:47:47.310 --> 01:47:51.480 In SQLite you can actually time your queries by running .timer and then 01:47:51.480 --> 01:47:52.330 turning it on. 01:47:52.330 --> 01:47:55.620 This is just going to keep track now of how many seconds or milliseconds any 01:47:55.620 --> 01:47:57.787 of your queries take if you're curious to figure out 01:47:57.787 --> 01:47:58.930 what's faster, what's slow. 01:47:58.930 --> 01:48:00.930 Let me do something relatively simple like this, 01:48:00.930 --> 01:48:04.860 select star from shows where the title of the show equals, 01:48:04.860 --> 01:48:07.198 quote unquote, "The Office," semicolon. 01:48:07.198 --> 01:48:08.740 All right, that was pretty darn fast. 01:48:08.740 --> 01:48:13.240 And it took 0.044 seconds in reality. 01:48:13.240 --> 01:48:16.210 If you care further, you can break this time down into user time, 01:48:16.210 --> 01:48:18.708 like spent in my account, versus system time, which means 01:48:18.708 --> 01:48:20.500 spent in the operating system, essentially. 01:48:20.500 --> 01:48:28.120 But we'll focus on the real wall-clock time of 0.044 seconds. 01:48:28.120 --> 01:48:29.290 Pretty darn fast. 01:48:29.290 --> 01:48:32.770 But we can actually do better because it turns out in the world of SQL, 01:48:32.770 --> 01:48:37.690 you can create what are called indexes, which is a fancy way of saying a data 01:48:37.690 --> 01:48:41.170 structure that makes it faster to perform queries, 01:48:41.170 --> 01:48:43.660 like selects, and even other queries as well. 01:48:43.660 --> 01:48:46.180 In an index, you can use syntax like this, 01:48:46.180 --> 01:48:49.420 create index, the name of the index, on a specific table 01:48:49.420 --> 01:48:50.710 on the specific columns. 01:48:50.710 --> 01:48:54.340 And what I mean by this is if you know that your application, like imdb.com 01:48:54.340 --> 01:48:58.480 or their mobile app, is going to search on certain columns frequently, 01:48:58.480 --> 01:49:01.240 you can prepare the database in advance to build up 01:49:01.240 --> 01:49:04.450 some fancy data structures in memory so it can get back answers 01:49:04.450 --> 01:49:06.080 even faster than that. 01:49:06.080 --> 01:49:08.200 So case in point, let me go back to VS Code 01:49:08.200 --> 01:49:12.700 here, and let me create an index called, whatever, title index, 01:49:12.700 --> 01:49:17.180 for instance on the shows table, specifically on the title column. 01:49:17.180 --> 01:49:20.830 So that's simply the syntax for telling the database in advance, 01:49:20.830 --> 01:49:23.350 index this column because I'm going to do a lot of searching 01:49:23.350 --> 01:49:25.210 on it so I want the queries to be fast. 01:49:25.210 --> 01:49:26.020 Enter. 01:49:26.020 --> 01:49:26.920 It took a moment. 01:49:26.920 --> 01:49:30.010 It took almost half a second, but the index I only have to build once. 01:49:30.010 --> 01:49:32.320 Even though, technically, you have to maintain it over time if you're 01:49:32.320 --> 01:49:34.210 doing updates, deletes, and inserts. 01:49:34.210 --> 01:49:36.610 But now let me do the same query as before, 01:49:36.610 --> 01:49:41.530 select star from shows where title equals, quote unquote, "The Office." 01:49:41.530 --> 01:49:45.580 Previously, that query took 0.044 seconds. 01:49:45.580 --> 01:49:47.080 When I hit enter now? 01:49:47.080 --> 01:49:47.890 Boom. 01:49:47.890 --> 01:49:50.770 I mean, it takes no time at all, or less time 01:49:50.770 --> 01:49:53.998 than it's even keeping track of in terms of significant digits. 01:49:53.998 --> 01:49:57.040 Now, that might not seem like a big deal to us humans and our human eyes, 01:49:57.040 --> 01:49:59.620 but if you've got hundreds, thousands, millions of users, 01:49:59.620 --> 01:50:03.580 as maybe the real imdb.com has, you just saved yourself 01:50:03.580 --> 01:50:05.900 a fortune in servers and complexity. 01:50:05.900 --> 01:50:06.400 Why? 01:50:06.400 --> 01:50:08.320 Because the same server can, clearly, now 01:50:08.320 --> 01:50:12.400 handle way more people per unit of time, per second, because each query 01:50:12.400 --> 01:50:14.027 takes less and less time. 01:50:14.027 --> 01:50:16.360 I mean, we're all too familiar here and at Yale, surely, 01:50:16.360 --> 01:50:19.087 with certain University applications that are just so darn slow. 01:50:19.087 --> 01:50:20.920 When you click a button and the stupid thing 01:50:20.920 --> 01:50:23.740 spins and makes you wait and wait, a lot of the time that 01:50:23.740 --> 01:50:27.820 can be explained by poor database design, or databases that might not 01:50:27.820 --> 01:50:29.200 have been indexed properly. 01:50:29.200 --> 01:50:31.492 So when you're searching for some course, for instance, 01:50:31.492 --> 01:50:33.310 in the course catalog, it's taking forever 01:50:33.310 --> 01:50:35.170 because underneath the hood it's essentially 01:50:35.170 --> 01:50:37.730 doing linear search over everything. 01:50:37.730 --> 01:50:40.030 But, by contrast, in a relational database, 01:50:40.030 --> 01:50:42.670 when you create an index in advance because you 01:50:42.670 --> 01:50:45.640 have a hunch that maybe users are going to search on that column, 01:50:45.640 --> 01:50:47.560 like show titles. 01:50:47.560 --> 01:50:49.270 Essentially, you're building up in memory 01:50:49.270 --> 01:50:51.700 what's called a b-tree, which is not a binary tree. 01:50:51.700 --> 01:50:54.370 It's still a tree, though, if you think back to week five. 01:50:54.370 --> 01:50:57.820 But it's a very short, fat tree instead, where 01:50:57.820 --> 01:51:02.650 every node might have two, or three, or 30 children, which essentially 01:51:02.650 --> 01:51:04.820 pulls the height of the tree way up. 01:51:04.820 --> 01:51:07.990 Which is to say that when you search for some value in a b-tree, 01:51:07.990 --> 01:51:10.040 it's invariably going to be in the leaves. 01:51:10.040 --> 01:51:13.870 So the shorter the tree is, the fewer steps it takes to find the value 01:51:13.870 --> 01:51:14.750 you care about. 01:51:14.750 --> 01:51:17.860 So when you run create table, that kind of data structure 01:51:17.860 --> 01:51:21.680 is being magically created for you by the database 01:51:21.680 --> 01:51:25.540 so it's not a simple linear search through the entire column, 01:51:25.540 --> 01:51:27.350 top to bottom. 01:51:27.350 --> 01:51:31.330 So with that said, we can see this really with more complicated queries. 01:51:31.330 --> 01:51:33.260 And let me go back to VS Code here. 01:51:33.260 --> 01:51:36.853 Let me propose to run that same, slow query before, 01:51:36.853 --> 01:51:39.520 even though it's fine if you're not comfortable with the syntax. 01:51:39.520 --> 01:51:41.710 It was relatively slow, though, to watch. 01:51:41.710 --> 01:51:46.360 So let's do select title from shows stars and people 01:51:46.360 --> 01:51:56.500 where shows.id equals stars.show ID, and people.id equals stars.dot person ID, 01:51:56.500 --> 01:51:59.050 and name equals Steve Carell. 01:51:59.050 --> 01:52:00.820 So this was the last of those queries. 01:52:00.820 --> 01:52:03.280 That just searches for all of Steve Carell's TV shows 01:52:03.280 --> 01:52:06.430 without using joins explicitly but by just enumerating 01:52:06.430 --> 01:52:10.210 all three tables in question, and then using where to cleverly connect 01:52:10.210 --> 01:52:11.390 the dots, so to speak. 01:52:11.390 --> 01:52:12.770 But my timer is still on. 01:52:12.770 --> 01:52:16.390 So notice now, when I hit enter, it doesn't just feel slow. 01:52:16.390 --> 01:52:19.390 It actually took 2.763 seconds. 01:52:19.390 --> 01:52:20.410 Like, that's slow. 01:52:20.410 --> 01:52:21.250 That's expensive. 01:52:21.250 --> 01:52:22.650 That's going to annoy your users. 01:52:22.650 --> 01:52:25.900 That's going to annoy your students if the database is thinking, and thinking, 01:52:25.900 --> 01:52:27.730 and thinking, and taking that much time. 01:52:27.730 --> 01:52:29.770 But let's note this. 01:52:29.770 --> 01:52:34.080 That same query I just executed touched a bunch of columns, 01:52:34.080 --> 01:52:35.830 and it turns out that whenever you declare 01:52:35.830 --> 01:52:40.330 a primary key in a database, at least in SQLite, you get an index for free. 01:52:40.330 --> 01:52:43.060 Primary keys are automatically indexed, so searching 01:52:43.060 --> 01:52:44.320 for a number in that column? 01:52:44.320 --> 01:52:45.040 Super fast. 01:52:45.040 --> 01:52:48.580 Not linear search, it's something logarithmic, most likely, or, ideally, 01:52:48.580 --> 01:52:51.100 closer to something like constant time even. 01:52:51.100 --> 01:52:54.940 Here, though, I'm touching not just shows.id, 01:52:54.940 --> 01:52:58.840 but I'm also filtering on stars.show ID, so a foreign key. 01:52:58.840 --> 01:53:01.000 Foreign keys are not indexed by default. 01:53:01.000 --> 01:53:02.260 I'm looking at people ID. 01:53:02.260 --> 01:53:03.190 That's a primary key. 01:53:03.190 --> 01:53:03.880 That's indexed. 01:53:03.880 --> 01:53:08.200 But stars.person ID, not indexed by default, is a foreign key. 01:53:08.200 --> 01:53:11.890 Lastly, I'm filtering by name in the people table. 01:53:11.890 --> 01:53:17.140 Names are not indexed by default. So I'm touching three separate columns, 01:53:17.140 --> 01:53:20.980 two foreign keys, one name field, that have no fancy tree 01:53:20.980 --> 01:53:22.570 structure built for them. 01:53:22.570 --> 01:53:23.900 But I can do that. 01:53:23.900 --> 01:53:27.190 Let me go down to my terminal here. 01:53:27.190 --> 01:53:30.550 Let me create one index called, say, person index, though, I 01:53:30.550 --> 01:53:35.750 could call it anything I want, on the stars table on the person ID column. 01:53:35.750 --> 01:53:38.140 So that indexes that foreign key. 01:53:38.140 --> 01:53:41.050 Took 1.7 seconds, but I only have to do it once. 01:53:41.050 --> 01:53:46.720 Create index, show index on the stars table show ID. 01:53:46.720 --> 01:53:49.210 So another foreign key is getting its own index. 01:53:49.210 --> 01:53:51.490 Took 1.4 seconds, but it's a one time thing. 01:53:51.490 --> 01:53:54.070 And, lastly, let's index all of those actors names. 01:53:54.070 --> 01:53:59.770 Create index called name index on the people table on the name column. 01:53:59.770 --> 01:54:00.460 Enter. 01:54:00.460 --> 01:54:04.520 That took 1.0 seconds, but a one time operation. 01:54:04.520 --> 01:54:07.850 So, essentially, I've built up like three of these trees in memory 01:54:07.850 --> 01:54:09.920 now, specifically for these columns. 01:54:09.920 --> 01:54:15.160 So now recall, previously, that slow query. 01:54:15.160 --> 01:54:18.610 If I scroll back up, that took, what was it? 01:54:18.610 --> 01:54:20.570 2.7 seconds? 01:54:20.570 --> 01:54:21.070 I think. 01:54:21.070 --> 01:54:23.330 2.7 seconds, previously. 01:54:23.330 --> 01:54:29.080 But if I now run the same thing, select title from show 01:54:29.080 --> 01:54:33.850 stars people where shows ID equals stars.show show ID, 01:54:33.850 --> 01:54:37.240 and people ID equals stars.person person ID, 01:54:37.240 --> 01:54:42.730 and name equals Steve, so close, Carell. 01:54:42.730 --> 01:54:47.320 Same query as before previously took 2.7 seconds, which was the most annoying 01:54:47.320 --> 01:54:47.890 of them yet. 01:54:47.890 --> 01:54:49.420 Now, when I hit enter? 01:54:49.420 --> 01:54:50.230 Boom. 01:54:50.230 --> 01:54:55.240 0.001 seconds, which is the difference, again, between bunches 01:54:55.240 --> 01:55:00.970 of linear searches and, in this case, searching a fancier, week-five style, 01:55:00.970 --> 01:55:02.650 b-tree, in this case. 01:55:02.650 --> 01:55:04.010 So indexes matter. 01:55:04.010 --> 01:55:07.120 So what's then, maybe, the trade off here? 01:55:07.120 --> 01:55:09.680 Like, why not index every column in every table? 01:55:09.680 --> 01:55:10.930 Because this is feeling great. 01:55:10.930 --> 01:55:14.110 Like, we're speeding things up by factors of 1,000, practically. 01:55:14.110 --> 01:55:15.190 What's the trade off? 01:55:15.190 --> 01:55:16.240 [INDISTINCT SPEECH] 01:55:16.240 --> 01:55:18.010 Lots and lots of memory, or space. 01:55:18.010 --> 01:55:20.200 Yeah, so you're just trading off space for time, 01:55:20.200 --> 01:55:23.200 which we said a couple of weeks ago is an acceptable trade off depending 01:55:23.200 --> 01:55:24.610 on what resources you have. 01:55:24.610 --> 01:55:28.160 But it's probably an overcorrection to index everything, 01:55:28.160 --> 01:55:31.445 especially since it will slightly slow down inserts, updates, and deletes 01:55:31.445 --> 01:55:34.570 because you have to maintain this tree structure so it doesn't devolve back 01:55:34.570 --> 01:55:37.600 into a linked list or something linear. 01:55:37.600 --> 01:55:43.900 But, in fact, being selective about it is perhaps the best strategy. 01:55:43.900 --> 01:55:44.440 All right. 01:55:44.440 --> 01:55:48.250 So that we can now solve some other problems more generally, 01:55:48.250 --> 01:55:49.780 let me just connect two dots. 01:55:49.780 --> 01:55:53.740 Even though we focused today on SQL, specifically for databases, 01:55:53.740 --> 01:55:56.590 you can actually combine one language with another 01:55:56.590 --> 01:55:58.010 and solve different problems. 01:55:58.010 --> 01:55:59.420 And so, in fact, let me do this. 01:55:59.420 --> 01:56:02.530 Let me revisit our favorites.py from earlier, 01:56:02.530 --> 01:56:07.610 but let me actually now use the favorites database as follows. 01:56:07.610 --> 01:56:09.980 Let me go into VS Code here. 01:56:09.980 --> 01:56:16.810 Let me remove favorites.db because if you recall it made everyone's count 01:56:16.810 --> 01:56:18.760 previously-- 01:56:18.760 --> 01:56:19.510 became Fiftyville. 01:56:19.510 --> 01:56:21.640 So let me remove that file. 01:56:21.640 --> 01:56:26.140 Let me run SQLite3 on favorites.db again. 01:56:26.140 --> 01:56:27.940 Let me create the file anew. 01:56:27.940 --> 01:56:30.490 Let me set the mode to CSV again. 01:56:30.490 --> 01:56:34.630 Let me import that file called favorites.csv into an identical table 01:56:34.630 --> 01:56:36.160 as before called favorites. 01:56:36.160 --> 01:56:37.150 And then quit. 01:56:37.150 --> 01:56:40.970 So I've just reset things to my backup, if you will, from the CSV file. 01:56:40.970 --> 01:56:43.300 So, I again have a favorites.db. 01:56:43.300 --> 01:56:48.033 Let me now minimize my terminal window here and reopen favorites.py, 01:56:48.033 --> 01:56:50.200 and let me just go ahead and get rid of that version 01:56:50.200 --> 01:56:53.830 entirely and focus this time on not talking to a CSV file, 01:56:53.830 --> 01:56:55.780 opening it, iterating over the rows. 01:56:55.780 --> 01:56:58.930 We can actually use Python to execute SQL queries 01:56:58.930 --> 01:57:00.950 and kind of get the best of both worlds. 01:57:00.950 --> 01:57:03.160 So let me do this here. 01:57:03.160 --> 01:57:09.580 Let me, from the CS50 Python library, import our own SQL functionality. 01:57:09.580 --> 01:57:12.052 And this is a training wheel we still provide for SQL 01:57:12.052 --> 01:57:14.260 because it's just much easier than using the industry 01:57:14.260 --> 01:57:15.760 standard libraries for SQL. 01:57:15.760 --> 01:57:18.540 It just is painful for simple tasks. 01:57:18.540 --> 01:57:21.380 So now, let me create a variable called db, for database. 01:57:21.380 --> 01:57:24.410 I'm going to set it equal to this SQL function that CS50 wrote. 01:57:24.410 --> 01:57:28.730 And this is going to look weird, but the way you open a DB file in Python, 01:57:28.730 --> 01:57:31.580 whether it's with CS50 library or someone else's, you 01:57:31.580 --> 01:57:36.560 say SQLite colon slash, slash, slash favorites.db. 01:57:36.560 --> 01:57:38.780 So weird syntax, but it's commonplace. 01:57:38.780 --> 01:57:41.900 Three slashes, not two like a URL is usually. 01:57:41.900 --> 01:57:44.300 Now let's use a variable called favorite and set 01:57:44.300 --> 01:57:48.020 it equal to the return value of input by asking the human for their favorite TV 01:57:48.020 --> 01:57:48.650 show. 01:57:48.650 --> 01:57:51.860 And now, previously we opened up the CSV file, 01:57:51.860 --> 01:57:56.270 iterated over it looking for the show they typed in, show they typed in. 01:57:56.270 --> 01:57:59.060 Or, rather-- Oh, sorry, not the show. 01:57:59.060 --> 01:58:01.610 The problem, or the problem that we typed in. 01:58:01.610 --> 01:58:03.150 Let me instead do this. 01:58:03.150 --> 01:58:07.290 Let me set, use a db variables, execute function, 01:58:07.290 --> 01:58:11.360 which comes with the CS50 library, and let me execute this SQL query. 01:58:11.360 --> 01:58:20.250 Select count star as n from favorites where problem equals question mark. 01:58:20.250 --> 01:58:23.000 And the question mark is a little weird, but think of it, for now, 01:58:23.000 --> 01:58:24.902 like C's percent s. 01:58:24.902 --> 01:58:26.750 Comma favorite. 01:58:26.750 --> 01:58:29.840 I'm going to plug in whatever the human typed in into that query where 01:58:29.840 --> 01:58:30.920 the question mark is. 01:58:30.920 --> 01:58:34.340 So no percent s, I'm using a question mark in this world of SQL. 01:58:34.340 --> 01:58:36.883 This is going to give me back a temporary table, 01:58:36.883 --> 01:58:39.800 and I'm going to store that temporary table in a variable called rows. 01:58:39.800 --> 01:58:42.840 Because the temporary table is essentially zero or more rows, 01:58:42.840 --> 01:58:44.660 so I'm going to name my variable rows. 01:58:44.660 --> 01:58:48.170 And then if I want to get back the first and only row, 01:58:48.170 --> 01:58:51.860 I can literally do row equals rows bracket zero just to be pedantic, 01:58:51.860 --> 01:58:55.970 just to get, literally, the first row in that temporary table, or result set. 01:58:55.970 --> 01:58:59.810 And now if I want to print back the n column therein, 01:58:59.810 --> 01:59:03.590 I can do print row, quote unquote, "n." 01:59:03.590 --> 01:59:06.420 So let me take a step back and do this. 01:59:06.420 --> 01:59:12.710 Let me go into SQLite3 of favorites.db and let me literally type something 01:59:12.710 --> 01:59:13.580 like this. 01:59:13.580 --> 01:59:16.640 Here I'm in SQLite3 at the bottom and the problem I'm searching, 01:59:16.640 --> 01:59:19.520 for instance, is Scratch semicolon. 01:59:19.520 --> 01:59:23.480 Notice that is how in SQL, in the command line of SQLite3, 01:59:23.480 --> 01:59:24.960 I can get back the answer I want. 01:59:24.960 --> 01:59:26.960 But what if I want to make a Python program that 01:59:26.960 --> 01:59:29.480 queries the database for that value and then prints something out? 01:59:29.480 --> 01:59:31.790 In two weeks time, what if I want to do that to make 01:59:31.790 --> 01:59:35.300 a web application, like imdb.com or a mobile app, 01:59:35.300 --> 01:59:37.490 that writes code to generate a user interface 01:59:37.490 --> 01:59:39.320 but that pulls the data from a database? 01:59:39.320 --> 01:59:42.170 You're not going to have your human users using SQLite3, 01:59:42.170 --> 01:59:44.280 you're going to generate the output for them. 01:59:44.280 --> 01:59:46.790 So let me close my terminal window here. 01:59:46.790 --> 01:59:49.190 Rather, let me close out of SQLite 3. 01:59:49.190 --> 01:59:53.840 Let me now run Python of favorites.py, enter. 01:59:53.840 --> 01:59:56.450 I'm prompted for my favorite using the input function. 01:59:56.450 --> 02:00:00.290 I type in Scratch and hit enter, and there's my 34. 02:00:00.290 --> 02:00:02.650 So this is an incredibly common practice, 02:00:02.650 --> 02:00:04.400 to use one language for what it's best at. 02:00:04.400 --> 02:00:07.430 Like SQL is best at reading data from databases, 02:00:07.430 --> 02:00:11.060 Python is, maybe, best in this case for creating a user interface 02:00:11.060 --> 02:00:13.040 or, eventually, making a web application. 02:00:13.040 --> 02:00:16.560 But it's certainly fine to use one inside of the other. 02:00:16.560 --> 02:00:19.740 The documentation for this library, if and when you get curious, 02:00:19.740 --> 02:00:23.520 is it this URL here, along with a lot of CS50's own documentation. 02:00:23.520 --> 02:00:26.340 But there are some problems, nonetheless, we 02:00:26.340 --> 02:00:29.760 might encounter in this world, and we thought we'd end on these challenges. 02:00:29.760 --> 02:00:33.420 It turns out that in the world of SQL, even though we haven't touched 02:00:33.420 --> 02:00:36.638 upon this yet, you're generally working not with like hundreds 02:00:36.638 --> 02:00:39.180 of people and their favorite languages and problems, not even 02:00:39.180 --> 02:00:42.600 thousands of movies, but like millions of things in the database, 02:00:42.600 --> 02:00:45.180 like Instagram posts or TikTok videos or the like. 02:00:45.180 --> 02:00:48.510 Those are huge databases with millions of rows. 02:00:48.510 --> 02:00:51.510 The problem with SQL and, really, databases in general, 02:00:51.510 --> 02:00:54.030 is if you have thousands, millions of things 02:00:54.030 --> 02:00:57.765 happening at once, things can get out of order and your math can be wrong. 02:00:57.765 --> 02:00:59.890 You can lose track of how many likes something has. 02:00:59.890 --> 02:01:03.870 And so, for instance, as of last night this remains the most popular Instagram 02:01:03.870 --> 02:01:06.000 post, to date, in the world. 02:01:06.000 --> 02:01:08.368 It was clicked on by so many people. 02:01:08.368 --> 02:01:10.410 And you might think that's pretty straightforward 02:01:10.410 --> 02:01:13.230 to keep track click, click, click, but not 02:01:13.230 --> 02:01:16.870 when there's millions of devices in the world and thousands of servers, 02:01:16.870 --> 02:01:20.000 probably, at Meta running Instagram's back end. 02:01:20.000 --> 02:01:23.290 So how do you actually keep track of all of these likes? 02:01:23.290 --> 02:01:26.980 Well, maybe, Meta is using code like this 02:01:26.980 --> 02:01:30.370 to implement the counter for likes on Instagram. 02:01:30.370 --> 02:01:33.160 Maybe they are using lines of code similar to what we just 02:01:33.160 --> 02:01:37.510 wrote db.execute select the current number of likes 02:01:37.510 --> 02:01:40.750 from the posts table where the ID of the post 02:01:40.750 --> 02:01:43.760 equals whatever the one is that the user clicked on. 02:01:43.760 --> 02:01:48.160 So the post a moment ago, presumably, has a unique ID, a primary key, 02:01:48.160 --> 02:01:51.160 and that number just gets plugged in here when Meta wants to figure out, 02:01:51.160 --> 02:01:53.140 all right, someone just clicked on this post. 02:01:53.140 --> 02:01:56.320 Let's figure out what the current number of likes so we can add 1 to it 02:01:56.320 --> 02:01:57.890 and update the database. 02:01:57.890 --> 02:02:00.910 So this query here gives us a temporary table 02:02:00.910 --> 02:02:04.360 containing the current number of likes before you or someone else clicked. 02:02:04.360 --> 02:02:06.820 Maybe then we just declare a variable called 02:02:06.820 --> 02:02:09.772 likes to get at the first row's likes column. 02:02:09.772 --> 02:02:11.980 So this is just Python syntax, similar to what I did, 02:02:11.980 --> 02:02:15.640 just to get the actual number you care about, like 34 or whatever million 02:02:15.640 --> 02:02:16.480 it is here. 02:02:16.480 --> 02:02:20.650 But then suppose there's a second database query and a third line of code 02:02:20.650 --> 02:02:26.200 that updates the posts table setting the likes equal to this value 02:02:26.200 --> 02:02:29.030 where the ID of the post is this value. 02:02:29.030 --> 02:02:32.870 So these question marks are similar, again, in spirit to printf's percent s. 02:02:32.870 --> 02:02:34.810 They're placeholders for things that are going 02:02:34.810 --> 02:02:37.070 to be plugged in after the commas. 02:02:37.070 --> 02:02:38.860 So if I want to update the number of likes 02:02:38.860 --> 02:02:41.080 to be whatever the current number is plus one, 02:02:41.080 --> 02:02:45.100 I put it there as a second argument and then I plug-in the ID of that post. 02:02:45.100 --> 02:02:49.420 The problem, though, with large systems like the Metas, the Googles, 02:02:49.420 --> 02:02:51.640 the Microsofts, and others of the world, is 02:02:51.640 --> 02:02:54.880 that they are executing code like this on multiple servers, 02:02:54.880 --> 02:02:58.930 thousands of servers, that might be executing slightly out of order. 02:02:58.930 --> 02:03:01.070 One might be faster, one might be slower. 02:03:01.070 --> 02:03:04.990 Which is to say, even though these three lines of code 02:03:04.990 --> 02:03:07.240 represent what should happen when I click on that post 02:03:07.240 --> 02:03:09.670 and you click on that post and you click on that post, 02:03:09.670 --> 02:03:13.600 the lines of code chronologically might get shuffled a little bit. 02:03:13.600 --> 02:03:16.090 Like, maybe, this line of code gets executed for me 02:03:16.090 --> 02:03:18.100 and then it gets executed for you and then you, 02:03:18.100 --> 02:03:20.900 and then the server moves on to the next line of code. 02:03:20.900 --> 02:03:23.890 So it's sort of multitasking, handling lots of users at once. 02:03:23.890 --> 02:03:27.940 The problem with this is that you run into a race condition of sorts, 02:03:27.940 --> 02:03:30.820 where the servers are sort of racing to handle one user 02:03:30.820 --> 02:03:34.580 but other users requests are happening at the same time. 02:03:34.580 --> 02:03:39.220 So the analogy that I was taught years ago in an operating systems 02:03:39.220 --> 02:03:41.860 class actually pertains to something like a refrigerator here. 02:03:41.860 --> 02:03:43.360 So we have a mini refrigerator here. 02:03:43.360 --> 02:03:45.850 Suppose you've got one in your dorm or your house room, 02:03:45.850 --> 02:03:48.340 and you come home one day. 02:03:48.340 --> 02:03:51.860 And you really like milk, so you open the fridge and you look inside, 02:03:51.860 --> 02:03:52.480 and there's-- 02:03:52.480 --> 02:03:53.590 Oh, we're out of milk. 02:03:53.590 --> 02:03:57.160 So you close the fridge, you walk out to CVS, or somewhere else, 02:03:57.160 --> 02:03:58.600 and go to get more milk. 02:03:58.600 --> 02:04:01.240 Meanwhile, though, your roommate comes home, who in this story 02:04:01.240 --> 02:04:03.850 also likes milk, and so they decide that-- 02:04:03.850 --> 02:04:05.620 Oh, I'm out of milk in the fridge. 02:04:05.620 --> 02:04:07.120 So they, maybe, head out. 02:04:07.120 --> 02:04:10.660 Maybe they follow a different path to go get more milk as well. 02:04:10.660 --> 02:04:13.330 Some number of minutes pass, you both come home later on. 02:04:13.330 --> 02:04:14.080 Like, Oh, darn it. 02:04:14.080 --> 02:04:15.830 Now we have twice as much milk as we need. 02:04:15.830 --> 02:04:17.247 We don't really like it that much. 02:04:17.247 --> 02:04:19.720 And some of it's going to go sour now, so it's wasted like. 02:04:19.720 --> 02:04:20.720 We've made a mistake. 02:04:20.720 --> 02:04:22.780 We should not have bought twice as much milk. 02:04:22.780 --> 02:04:27.220 Now, stupid story, but the point is that both of you 02:04:27.220 --> 02:04:30.250 made decisions based on the state of a variable. 02:04:30.250 --> 02:04:33.790 But the problem was that variable was in the process of being 02:04:33.790 --> 02:04:36.430 updated when someone else looked at it. 02:04:36.430 --> 02:04:39.470 The first person in the story was on their way to the store 02:04:39.470 --> 02:04:41.560 so the variable was about to be incremented 02:04:41.560 --> 02:04:43.840 in terms of quantity of milk, but the other person 02:04:43.840 --> 02:04:46.430 didn't know that yet so they too tried to increment it. 02:04:46.430 --> 02:04:48.760 And, in that case, we ended up with too much milk. 02:04:48.760 --> 02:04:52.450 But suppose what might happen here is similar in spirit. 02:04:52.450 --> 02:04:56.560 Suppose that post, at some point in time, had just 1,000,000 likes. 02:04:56.560 --> 02:05:00.310 And suppose this line of code got executed for me, for you, and for you 02:05:00.310 --> 02:05:02.050 after all three of us clicked on it. 02:05:02.050 --> 02:05:05.530 Well, the value of our likes variable in Meta servers 02:05:05.530 --> 02:05:08.020 might be 1,000,000, 1,000,000, and 1,000,000. 02:05:08.020 --> 02:05:11.470 They therefore update 1,000,000 to be 1,000,000 plus one. 02:05:11.470 --> 02:05:14.890 And so what they update the database to be is 1,000,001, 02:05:14.890 --> 02:05:16.080 but they do it three times. 02:05:16.080 --> 02:05:18.690 1,000,001, 1,000,001, 1,000,001. 02:05:18.690 --> 02:05:21.900 But they've lost two of those likes because they 02:05:21.900 --> 02:05:26.040 might have inspected the variable while some other server, some other user's 02:05:26.040 --> 02:05:28.510 like, was being processed. 02:05:28.510 --> 02:05:31.350 So long story short, when you have lots of data 02:05:31.350 --> 02:05:34.365 on lots of servers, all of which is happening very quickly, 02:05:34.365 --> 02:05:36.240 you run into these so called race conditions. 02:05:36.240 --> 02:05:39.750 And code like this can be dangerous, even though it might not 02:05:39.750 --> 02:05:41.460 look incorrect at a glance. 02:05:41.460 --> 02:05:44.280 Thankfully, in the world of SQL, though you won't generally 02:05:44.280 --> 02:05:47.700 have to do this certainly for problem sets sake, there are solutions to this. 02:05:47.700 --> 02:05:51.210 But too many engineers in the world don't know this, don't remember this, 02:05:51.210 --> 02:05:53.580 or don't appreciate this reality. 02:05:53.580 --> 02:05:55.920 There are keywords in certain databases that 02:05:55.920 --> 02:05:59.610 let you instead begin a transaction, which means, essentially, 02:05:59.610 --> 02:06:02.970 that these three lines of code should either all 02:06:02.970 --> 02:06:05.370 happen together or not at all. 02:06:05.370 --> 02:06:06.660 They should be atomic. 02:06:06.660 --> 02:06:09.390 That is to say, they should all happen without interruption 02:06:09.390 --> 02:06:11.340 or they just shouldn't happen at all. 02:06:11.340 --> 02:06:16.000 And that ensures that the math does not go wrong 02:06:16.000 --> 02:06:18.820 because my like will get counted, and then you're like, 02:06:18.820 --> 02:06:22.390 and then you're like, as opposed to them being intermingled and lost 02:06:22.390 --> 02:06:23.630 track of accordingly. 02:06:23.630 --> 02:06:26.590 So in Python, using the CS50 library, you 02:06:26.590 --> 02:06:30.280 could wrap these three lines of code by saying begin transaction 02:06:30.280 --> 02:06:33.580 to the database, commit the transaction to the database, 02:06:33.580 --> 02:06:37.867 and that relatively simple solution avoids this problem of race conditions. 02:06:37.867 --> 02:06:38.950 But this, too, is a topic. 02:06:38.950 --> 02:06:41.020 If you Google, invariably you'll see that this 02:06:41.020 --> 02:06:46.000 is a problem that has hit various servers and apps over time. 02:06:46.000 --> 02:06:49.660 But there's one other problem that the world is still not very good at, 02:06:49.660 --> 02:06:52.060 and that's known as a SQL injection attack. 02:06:52.060 --> 02:06:55.690 And it turns out that even with what we've been doing, even code like this 02:06:55.690 --> 02:07:00.160 here, is all too easily vulnerable to being hacked, if you will. 02:07:00.160 --> 02:07:04.670 That is misused in some way, unless you practice what I'm preaching, 02:07:04.670 --> 02:07:07.070 which is using placeholders like this. 02:07:07.070 --> 02:07:10.570 It turns out that it's very dangerous to take user input in, generally. 02:07:10.570 --> 02:07:13.150 Like, most of your users might be nice, good people, 02:07:13.150 --> 02:07:16.330 but there's always going to be someone who's malicious or curious 02:07:16.330 --> 02:07:19.670 or just execute something you don't expect, and things can go wrong. 02:07:19.670 --> 02:07:21.940 So in the world of SQL, here's what can happen. 02:07:21.940 --> 02:07:24.890 For instance, here is the login screen for Yale accounts. 02:07:24.890 --> 02:07:27.220 Here's the comparable screen for Harvard accounts. 02:07:27.220 --> 02:07:29.803 And all of us are in their habit of using one of these screens 02:07:29.803 --> 02:07:31.150 or something similar or another. 02:07:31.150 --> 02:07:34.450 You're often asked for a username or email address and a password. 02:07:34.450 --> 02:07:38.950 But suppose that Harvard or Yale or Google or Microsoft or wherever, 02:07:38.950 --> 02:07:42.280 are taking user input from forms like this, be it on the web 02:07:42.280 --> 02:07:46.540 or on a mobile app, and they're just plugging your input into a SQL query 02:07:46.540 --> 02:07:50.320 that they wrote in advance that just is waiting for your username or password 02:07:50.320 --> 02:07:53.980 to be plugged in so they can ask a complete query of the database. 02:07:53.980 --> 02:07:57.850 It turns out that if I'm a bit malicious, or curious, 02:07:57.850 --> 02:08:01.810 I could, maybe, type in some funky syntax to fields like this 02:08:01.810 --> 02:08:04.570 that I know have special meaning to certain databases. 02:08:04.570 --> 02:08:09.490 And it turns out in the world of SQL, SQLite in particular, single quotes 02:08:09.490 --> 02:08:10.570 are clearly important. 02:08:10.570 --> 02:08:12.850 Because I've been using them all day long for strings, 02:08:12.850 --> 02:08:16.745 and I didn't mention this yet, but the comment character in SQLite is dash, 02:08:16.745 --> 02:08:17.245 dash. 02:08:17.245 --> 02:08:20.203 If you want the rest of a line to be ignored, you just say, dash, dash. 02:08:20.203 --> 02:08:23.605 So it's not hash, it's not slash, slash, like in Python and C, respectively. 02:08:23.605 --> 02:08:26.210 It's dash, dash, or two hyphens. 02:08:26.210 --> 02:08:28.887 So suppose that I'm a hacker or a curious student, 02:08:28.887 --> 02:08:30.970 and I want to see if Harvard knows what it's doing 02:08:30.970 --> 02:08:32.680 when it comes to SQL injection attacks. 02:08:32.680 --> 02:08:36.347 I could literally type in, maybe my username with a single quote 02:08:36.347 --> 02:08:37.180 and then dash, dash. 02:08:37.180 --> 02:08:38.870 Well, why would I do this? 02:08:38.870 --> 02:08:43.240 Well, suppose for the sake of discussion that some developer at Harvard, 02:08:43.240 --> 02:08:46.030 or any website, really, has written a line of code 02:08:46.030 --> 02:08:49.870 like this to check if the username and password just 02:08:49.870 --> 02:08:52.760 typed in match what's in the database. 02:08:52.760 --> 02:08:53.890 So how might I do this? 02:08:53.890 --> 02:08:57.410 And if so, let the user log in, show them their account, or whatever. 02:08:57.410 --> 02:08:59.800 So here's the line of code in question. 02:08:59.800 --> 02:09:05.050 Select star from users where username equals question mark and password 02:09:05.050 --> 02:09:05.980 equals question mark. 02:09:05.980 --> 02:09:07.040 This is correct. 02:09:07.040 --> 02:09:08.650 This is green because it's good. 02:09:08.650 --> 02:09:10.960 It is not vulnerable to attack because 02:09:10.960 --> 02:09:13.090 I'm using these placeholders here, which, 02:09:13.090 --> 02:09:15.370 even though we've implemented in the CS50 library, 02:09:15.370 --> 02:09:18.040 most SQL libraries support the same syntax. 02:09:18.040 --> 02:09:19.600 But it can vary by system. 02:09:19.600 --> 02:09:20.890 This is safe. 02:09:20.890 --> 02:09:23.230 What is not safe is to use some of the stuff 02:09:23.230 --> 02:09:25.750 you learned last week where you can just use f strings, 02:09:25.750 --> 02:09:29.680 or format your own strings and interpolate values with curly braces. 02:09:29.680 --> 02:09:31.930 For instance, suppose you took me at my word 02:09:31.930 --> 02:09:33.790 last week that you can do this in Python. 02:09:33.790 --> 02:09:37.430 This code here, unfortunately, in yellow, is not safe. 02:09:37.430 --> 02:09:39.190 Here's a format string in Python. 02:09:39.190 --> 02:09:41.140 Here's the beginning of a SQL query. 02:09:41.140 --> 02:09:43.810 Here's a placeholder for the user username. 02:09:43.810 --> 02:09:45.730 Here's a placeholder for the user's password. 02:09:45.730 --> 02:09:49.400 And I've proactively, correctly, put single quotes around them. 02:09:49.400 --> 02:09:53.410 The problem is if you just blindly plug users input 02:09:53.410 --> 02:09:58.390 into pre-made queries like this, they can finish your thought for you, 02:09:58.390 --> 02:10:01.280 in ways you don't expect, if you trust the users. 02:10:01.280 --> 02:10:03.910 For instance, if I plug in for my username, 02:10:03.910 --> 02:10:07.600 mail in at harvard.edu, single quote, dash, dash. 02:10:07.600 --> 02:10:09.200 Notice what happens. 02:10:09.200 --> 02:10:11.470 Here's the single quote from the query. 02:10:11.470 --> 02:10:13.100 Here's what I typed in. 02:10:13.100 --> 02:10:14.150 But, wait a minute. 02:10:14.150 --> 02:10:17.560 It looks like this single quote that I typed in finishes 02:10:17.560 --> 02:10:19.810 the thought that the developer started. 02:10:19.810 --> 02:10:22.527 The dash, dash means, heck, ignore the rest of that. 02:10:22.527 --> 02:10:25.360 And, indeed, I've lifted grayed out because what effectively happens 02:10:25.360 --> 02:10:29.510 is that only executed is what's in yellow here at the moment. 02:10:29.510 --> 02:10:30.010 Why? 02:10:30.010 --> 02:10:32.440 Because everything after that close quote, 02:10:32.440 --> 02:10:35.255 which finishes the developers thought, is just ignored. 02:10:35.255 --> 02:10:38.380 So I mean, this is literally an example of how you can hack into a database 02:10:38.380 --> 02:10:39.910 by injecting SQL. 02:10:39.910 --> 02:10:42.220 Like dash, dash is an example of SQL. 02:10:42.220 --> 02:10:45.280 Silly as it is, it's a comment that tells the database 02:10:45.280 --> 02:10:46.460 to ignore the password. 02:10:46.460 --> 02:10:47.450 So what does this mean? 02:10:47.450 --> 02:10:50.770 Well, of course you're going to get rows back now because if you're only 02:10:50.770 --> 02:10:53.950 searching for someone by username, it doesn't matter what their password is, 02:10:53.950 --> 02:10:57.460 what they typed in, you've essentially blacked out that part 02:10:57.460 --> 02:10:59.530 and you're not even checking the password. 02:10:59.530 --> 02:11:03.640 The effect then would be that you could log in as me or Carter or anyone else, 02:11:03.640 --> 02:11:08.410 just by knowing their usernames if the Harvard developer wrote code 02:11:08.410 --> 02:11:09.183 in this way. 02:11:09.183 --> 02:11:12.100 And even though I'm pretty sure Harvard key does not suffer from this, 02:11:12.100 --> 02:11:14.590 so many darn websites have in the past. 02:11:14.590 --> 02:11:17.733 And if you Google, SQL injection attacks-- search by 02:11:17.733 --> 02:11:20.150 and Google, for instance, the past month or the past year, 02:11:20.150 --> 02:11:22.940 you will tragically likely see results because humans 02:11:22.940 --> 02:11:24.650 continue to make this mistake. 02:11:24.650 --> 02:11:27.830 The solution though, ultimately, is actually just 02:11:27.830 --> 02:11:30.710 use placeholders, just use the library that 02:11:30.710 --> 02:11:32.990 escapes potentially dangerous input. 02:11:32.990 --> 02:11:35.600 And this looks a little weird, but in C we 02:11:35.600 --> 02:11:38.000 saw that the escape character was a backslash? 02:11:38.000 --> 02:11:41.690 And that made backslash n, or backslash something else, be treated specially. 02:11:41.690 --> 02:11:45.750 Weirdly, in SQL it tends to be just another single quote. 02:11:45.750 --> 02:11:48.450 So if you do quote, quote that actually means-- 02:11:48.450 --> 02:11:49.580 I mean, a literal quote. 02:11:49.580 --> 02:11:52.010 It's not like the empty string, or nothing in between it. 02:11:52.010 --> 02:11:53.060 So it looks weird. 02:11:53.060 --> 02:11:56.630 But, long story short, if you use a library like CS50, 02:11:56.630 --> 02:11:58.460 or anything in the real world that handles 02:11:58.460 --> 02:12:01.130 escaping of user input, the whole query you 02:12:01.130 --> 02:12:04.910 can think of as being now good and green again because it doesn't matter what 02:12:04.910 --> 02:12:05.780 the human types in. 02:12:05.780 --> 02:12:10.680 Any scary characters will be properly escaped by the database. 02:12:10.680 --> 02:12:12.740 So it'll depend on the library you're using, 02:12:12.740 --> 02:12:15.200 but it almost always is the syntax using question 02:12:15.200 --> 02:12:18.530 marks or some similar placeholder. 02:12:18.530 --> 02:12:19.130 All right. 02:12:19.130 --> 02:12:23.270 So with that said, you are now inducted into the Hall 02:12:23.270 --> 02:12:26.020 of People who know now a little something about databases. 02:12:26.020 --> 02:12:28.520 We've only just scratched the surface of using the language, 02:12:28.520 --> 02:12:31.340 but it's now something we'll use to build up more and more interesting 02:12:31.340 --> 02:12:33.673 applications, especially for final projects when we soon 02:12:33.673 --> 02:12:37.160 transition to web programming or mobile app development, if you go that route. 02:12:37.160 --> 02:12:41.540 But you'll soon be able to speak a language, literally and figuratively, 02:12:41.540 --> 02:12:44.310 that those before you have acquired as well. 02:12:44.310 --> 02:12:48.170 So you are now qualified to understand this sort of internet joke. 02:12:48.170 --> 02:12:51.020 This is someone who, if I Zoom in, was trying 02:12:51.020 --> 02:12:55.820 to get out of paying some camera based tolls by tricking the state or the city 02:12:55.820 --> 02:12:58.130 into deleting or dropping their whole database. 02:12:58.130 --> 02:13:01.100 Drop means delete the whole thing, not just the rows therein. 02:13:01.100 --> 02:13:02.270 OK, so maybe not that funny. 02:13:02.270 --> 02:13:06.910 But this is now the note will end on, similar to other xkcd comics 02:13:06.910 --> 02:13:07.910 we've introduced you to. 02:13:07.910 --> 02:13:11.150 Every CS person out there has seen, knows this comic. 02:13:11.150 --> 02:13:14.150 So if you ever refer to, with a wink-- 02:13:14.150 --> 02:13:17.090 if you ever refer to little Bobby tables with a wink, 02:13:17.090 --> 02:13:20.795 if it's a computer scientist on the other end, they'll know whom you mean. 02:13:23.918 --> 02:13:25.330 [LAUGHING] 02:13:25.330 --> 02:13:26.260 OK, there we go. 02:13:29.690 --> 02:13:30.230 All right. 02:13:30.230 --> 02:13:30.740 Tough crowd. 02:13:30.740 --> 02:13:33.990 All right, Batman, come on down for your cookies, and we'll see you next time. 02:13:33.990 --> 02:13:34.520 [APPLAUSE] 02:13:34.520 --> 02:13:37.570 [MUSIC PLAYING]