1 00:00:00,000 --> 00:00:02,988 [MUSIC PLAYING] 2 00:00:02,988 --> 00:01:01,790 3 00:01:01,790 --> 00:01:03,500 SPEAKER 1: All right. 4 00:01:03,500 --> 00:01:09,590 This is CS50, and this is week 7 on the day before All Hallows' Eve. 5 00:01:09,590 --> 00:01:13,760 Today, we will introduce yet another language, the goal of which 6 00:01:13,760 --> 00:01:16,430 is not to introduce another language for language's sake, 7 00:01:16,430 --> 00:01:20,510 but to really begin to emphasize that when it comes to the world of software 8 00:01:20,510 --> 00:01:23,390 programming, engineering more generally, there's often 9 00:01:23,390 --> 00:01:25,170 different tools for different jobs. 10 00:01:25,170 --> 00:01:28,632 And if you were to try to use C to solve all of our future problems, 11 00:01:28,632 --> 00:01:30,590 it would actually be painful, as you've already 12 00:01:30,590 --> 00:01:35,360 seen how much more pleasant it is to solve certain problems with Python. 13 00:01:35,360 --> 00:01:37,460 But today, quite quickly, we'll realize that even 14 00:01:37,460 --> 00:01:40,440 Python's actually not the most pleasant way to solve a lot of problems, 15 00:01:40,440 --> 00:01:43,100 especially when it comes to data, data at scale. 16 00:01:43,100 --> 00:01:45,830 And, in fact, as we transition in the final weeks of CS50 17 00:01:45,830 --> 00:01:49,130 to the world of web programming, and if you so choose with your final project 18 00:01:49,130 --> 00:01:51,500 mobile programming, you'll actually need more tools 19 00:01:51,500 --> 00:01:54,150 in your tool kit than C and Python alone. 20 00:01:54,150 --> 00:01:57,620 In fact, today we'll introduce a database-centric language called S-Q-L, 21 00:01:57,620 --> 00:01:58,590 or SQL. 22 00:01:58,590 --> 00:02:02,400 Next week, we'll explore markup languages, like HTML and CSS. 23 00:02:02,400 --> 00:02:03,970 A bit of JavaScript, as well. 24 00:02:03,970 --> 00:02:07,290 And then we'll synthesize all of this together at the end of the class, 25 00:02:07,290 --> 00:02:10,360 as some of you might, for your final projects, as well. 26 00:02:10,360 --> 00:02:13,120 But, before we do that, and talk about data, 27 00:02:13,120 --> 00:02:15,310 let's actually start to gather some data. 28 00:02:15,310 --> 00:02:19,770 So if you could visit this URL here on your phone, or a laptop. 29 00:02:19,770 --> 00:02:23,460 Or, if easier, here's a bar code version thereof. 30 00:02:23,460 --> 00:02:25,980 You can point your camera at this bar code here, 31 00:02:25,980 --> 00:02:29,220 and it's going to pull up a relatively short Google form that's 32 00:02:29,220 --> 00:02:32,040 just going to ask you a couple of questions, 33 00:02:32,040 --> 00:02:35,070 so that we can collect some actual live data 34 00:02:35,070 --> 00:02:37,800 and actually play around with it, initially, in Python. 35 00:02:37,800 --> 00:02:41,020 So if you go to that URL there, you'll see a Google form. 36 00:02:41,020 --> 00:02:44,577 And if you haven't been able to scan it quite yet, that's fine. 37 00:02:44,577 --> 00:02:47,160 Just kind of look over the shoulder of the person next to you. 38 00:02:47,160 --> 00:02:49,960 But you, or they, will see a little something like this. 39 00:02:49,960 --> 00:02:53,130 So among the questions will be, which is your favorite language? 40 00:02:53,130 --> 00:02:56,160 On the list, only thus far, is Scratch, C, and Python, 41 00:02:56,160 --> 00:02:58,350 and below that you'll see another question asking 42 00:02:58,350 --> 00:03:02,960 about your favorite problem across the problem sets thus far. 43 00:03:02,960 --> 00:03:04,710 Each of them is radio buttons, which means 44 00:03:04,710 --> 00:03:08,800 you'll be able to select one button for each of those questions. 45 00:03:08,800 --> 00:03:11,310 And, ultimately, what's going to be nice is 46 00:03:11,310 --> 00:03:14,132 that if you've never used Google Forms before as an administrator, 47 00:03:14,132 --> 00:03:17,340 all of that data is actually going to end up being, not only in Google Forms, 48 00:03:17,340 --> 00:03:19,525 but also, if you so choose, in Google Spreadsheets. 49 00:03:19,525 --> 00:03:22,650 Which is an example, of course, of a spreadsheet software similar in spirit 50 00:03:22,650 --> 00:03:26,940 to Apple Numbers on Macs or Microsoft Excel on different platforms. 51 00:03:26,940 --> 00:03:29,220 And Google Sheets is going to allow us to store 52 00:03:29,220 --> 00:03:30,960 all of that data in rows and columns. 53 00:03:30,960 --> 00:03:33,090 And so, since Google made both of these products, 54 00:03:33,090 --> 00:03:34,690 they integrated one with the other. 55 00:03:34,690 --> 00:03:39,930 So, in fact, if I, on my laptop here in another window, open 56 00:03:39,930 --> 00:03:41,280 this up-- let me flip over. 57 00:03:41,280 --> 00:03:42,660 Here's the live spreadsheet. 58 00:03:42,660 --> 00:03:45,510 And we'll see that the very first person who buzzed in really 59 00:03:45,510 --> 00:03:48,090 liked Python, as did a lot of other people thereafter. 60 00:03:48,090 --> 00:03:52,390 But, Hello, World was your favorite in Python, which is great. 61 00:03:52,390 --> 00:03:54,330 There's a couple of votes for Scratch here. 62 00:03:54,330 --> 00:03:58,110 If we scroll down, there's one hold out for C, who really liked Credit, 63 00:03:58,110 --> 00:03:59,320 in this case here. 64 00:03:59,320 --> 00:04:03,250 And if we scroll down further, it looks like Python and Scratch are in there. 65 00:04:03,250 --> 00:04:05,727 A few more C's, and so on and so forth. 66 00:04:05,727 --> 00:04:08,310 So suppose that we wanted to, actually, now analyze this data. 67 00:04:08,310 --> 00:04:11,587 Now, any of you who have used Excel, or Numbers, or Google Spreadsheets 68 00:04:11,587 --> 00:04:13,920 know that it comes with built in functions and formulas, 69 00:04:13,920 --> 00:04:17,040 and we can do all of that, but suppose there was a huge amount of data. 70 00:04:17,040 --> 00:04:19,920 Or, suppose that this data was not coming in via Google Sheets, 71 00:04:19,920 --> 00:04:23,130 but via your own web application, or your own mobile application, 72 00:04:23,130 --> 00:04:25,530 and it's just ending up in some kind of spreadsheet. 73 00:04:25,530 --> 00:04:28,530 Well, wouldn't it be nice if we could actually analyze that kind of data 74 00:04:28,530 --> 00:04:29,250 with code. 75 00:04:29,250 --> 00:04:30,570 And, in fact, you can. 76 00:04:30,570 --> 00:04:35,220 And the simplest way to store a bunch of data isn't with anything fancy, 77 00:04:35,220 --> 00:04:39,450 in fact, but just a literal text file, something ending in .txt, 78 00:04:39,450 --> 00:04:41,520 or, maybe more commonly, .csv. 79 00:04:41,520 --> 00:04:44,400 In fact, what we'd call a flat file database 80 00:04:44,400 --> 00:04:48,820 is literally just code for a text file containing all of your data. 81 00:04:48,820 --> 00:04:53,040 But your data typically has delimiters that separate some values from others. 82 00:04:53,040 --> 00:04:55,050 And, in fact, the most common approach, daresay, 83 00:04:55,050 --> 00:04:59,430 is to use, indeed, comma-separated values, or, CSV files. 84 00:04:59,430 --> 00:05:04,080 And what that means is that in simple text alone, just asking, maybe Unicode, 85 00:05:04,080 --> 00:05:06,960 you sort of mimic the idea of rows and columns 86 00:05:06,960 --> 00:05:09,570 by using newline characters, like backslash n, 87 00:05:09,570 --> 00:05:11,260 to represent row after row after row. 88 00:05:11,260 --> 00:05:12,510 That's pretty straightforward. 89 00:05:12,510 --> 00:05:14,280 To just move the cursor to the next line. 90 00:05:14,280 --> 00:05:18,930 And because text files aren't graphical, so there's no notion of vertical bars 91 00:05:18,930 --> 00:05:21,510 that you can put between what you and I think of as columns, 92 00:05:21,510 --> 00:05:24,630 you literally just use a comma, or some other such symbol 93 00:05:24,630 --> 00:05:27,212 to separate one value from another. 94 00:05:27,212 --> 00:05:28,920 So, in fact, let me go ahead and do this. 95 00:05:28,920 --> 00:05:31,837 Even if you've never done this before, it turns out, in Google Sheets, 96 00:05:31,837 --> 00:05:35,130 and also Excel and Numbers, you can export your data, 97 00:05:35,130 --> 00:05:38,880 not in some proprietary Apple or Microsoft or Google format, 98 00:05:38,880 --> 00:05:43,380 but in a globally portable format known as .csv. 99 00:05:43,380 --> 00:05:44,970 So let me go up to file. 100 00:05:44,970 --> 00:05:47,037 I will go to download. 101 00:05:47,037 --> 00:05:49,620 And notice, I can export this in a bunch of different formats, 102 00:05:49,620 --> 00:05:52,722 but the one I care about for today is going to be .csv. 103 00:05:52,722 --> 00:05:55,180 On my Mac, that's going to put it into my downloads folder. 104 00:05:55,180 --> 00:05:57,722 And what I'm going to go ahead here and do, in just a moment, 105 00:05:57,722 --> 00:06:00,690 is let me go ahead it and open up VS Code. 106 00:06:00,690 --> 00:06:03,300 So, in short, I've downloaded this file. 107 00:06:03,300 --> 00:06:04,980 Google gave it a long default file name. 108 00:06:04,980 --> 00:06:06,900 I've renamed it to favorites.csv. 109 00:06:06,900 --> 00:06:10,200 Let me go ahead and actually open it now in VS Code here. 110 00:06:10,200 --> 00:06:12,480 And this now is the exact same data. 111 00:06:12,480 --> 00:06:14,800 If you're still submitting the form, I'm afraid you didn't make the cut off. 112 00:06:14,800 --> 00:06:17,220 So we're not going to see more data ending up in this file 113 00:06:17,220 --> 00:06:19,260 because I've literally downloaded a copy of it. 114 00:06:19,260 --> 00:06:21,240 But, indeed, if I scroll through this, we'll 115 00:06:21,240 --> 00:06:25,050 see that it's got some 399 lines of data. 116 00:06:25,050 --> 00:06:28,080 Or, technically, 398 because the very first one 117 00:06:28,080 --> 00:06:30,990 is what we'd call a header row, which just describes 118 00:06:30,990 --> 00:06:32,820 what each of these columns means. 119 00:06:32,820 --> 00:06:35,670 Now notice that even though the Google Sheets interface actually 120 00:06:35,670 --> 00:06:39,780 put all of this data in proper graphical rows and columns, 121 00:06:39,780 --> 00:06:42,510 you can still see the rows because they're just separate lines, 122 00:06:42,510 --> 00:06:45,900 and you can kind of see the columns by focusing on the comma 123 00:06:45,900 --> 00:06:47,970 here, the comma here, the comma here. 124 00:06:47,970 --> 00:06:50,820 And then, also, the comma here, the comma here, 125 00:06:50,820 --> 00:06:52,540 the comma here, and so forth. 126 00:06:52,540 --> 00:06:53,910 Now, a quick subtlety. 127 00:06:53,910 --> 00:06:59,490 Hello, World is a two-word problem name, and it itself has a comma in it. 128 00:06:59,490 --> 00:07:06,090 What's to stop me from confusing the comma in Hello, World from the commas 129 00:07:06,090 --> 00:07:09,410 that Google, apparently, inserted into this file? 130 00:07:09,410 --> 00:07:09,910 Yeah. 131 00:07:09,910 --> 00:07:11,925 So, automatically, you all did not type this. 132 00:07:11,925 --> 00:07:13,300 We did not type it into the form. 133 00:07:13,300 --> 00:07:16,750 But Google is smart enough, as is Apple and Microsoft, when 134 00:07:16,750 --> 00:07:20,958 they export CSVs to somehow escape seemingly dangerous characters, 135 00:07:20,958 --> 00:07:22,750 or characters that could just break things. 136 00:07:22,750 --> 00:07:25,180 And, in this case, a convention in the CSV world 137 00:07:25,180 --> 00:07:28,340 is just to quote any strings that themselves have commas 138 00:07:28,340 --> 00:07:32,150 so that any program you're writing that reads this file doesn't get confused. 139 00:07:32,150 --> 00:07:35,260 So the only thing that came in automatically from Google 140 00:07:35,260 --> 00:07:37,690 is just this timestamp here, based on the time 141 00:07:37,690 --> 00:07:39,370 of day in our own local timezone. 142 00:07:39,370 --> 00:07:40,970 That was added automatically. 143 00:07:40,970 --> 00:07:43,100 So we have three, and not just two columns. 144 00:07:43,100 --> 00:07:45,460 So with that said, we have three columns: 145 00:07:45,460 --> 00:07:47,980 timestamp, language, and problem. 146 00:07:47,980 --> 00:07:50,200 The latter two of which came from all of you. 147 00:07:50,200 --> 00:07:52,840 Let's actually write some code that analyzes this data. 148 00:07:52,840 --> 00:07:55,720 Let's figure out what the most popular something or other is here. 149 00:07:55,720 --> 00:07:57,970 So I'll close the CSV file. 150 00:07:57,970 --> 00:08:02,140 Let me go ahead and use what seems to be folks favorite language, thus far, 151 00:08:02,140 --> 00:08:05,320 and write a file called favorites.py, for instance. 152 00:08:05,320 --> 00:08:08,440 And I'm going to use Python to open that CSV file, 153 00:08:08,440 --> 00:08:12,180 analyze it, crunch some numbers, and output some statistics, if you will. 154 00:08:12,180 --> 00:08:15,920 So the easiest way to manipulate CSV files, as you might have gleaned, 155 00:08:15,920 --> 00:08:19,940 is not to just open the file yourself, look for commas, allocate memory, 156 00:08:19,940 --> 00:08:22,220 or anything like that in C. In Python, you 157 00:08:22,220 --> 00:08:27,230 can literally just import CSV, which is a Python module that 158 00:08:27,230 --> 00:08:29,570 gives you CSV related functionality. 159 00:08:29,570 --> 00:08:32,390 In Python, there's a bunch of ways to open files. 160 00:08:32,390 --> 00:08:35,960 One way to open a file is like this in Python. 161 00:08:35,960 --> 00:08:40,039 You can say file equals open, similar to fopen in C. 162 00:08:40,039 --> 00:08:44,600 You can specify the name of the file you want to open, like favorites.csv, 163 00:08:44,600 --> 00:08:48,330 and you can explicitly say you want to open the file for reading, 164 00:08:48,330 --> 00:08:50,240 quote unquote, "r," just like fopen. 165 00:08:50,240 --> 00:08:53,280 Strictly speaking, in Python read is implied. 166 00:08:53,280 --> 00:08:55,580 So if you omit the second argument, it will still work. 167 00:08:55,580 --> 00:08:58,490 But for parity with fopen, I'll do the same here. 168 00:08:58,490 --> 00:09:03,020 Then you can, maybe, do something with file, and then, at the end, 169 00:09:03,020 --> 00:09:05,070 you can close the file like this. 170 00:09:05,070 --> 00:09:08,840 So in Python, there really is a mapping between fopen and fclose. 171 00:09:08,840 --> 00:09:12,560 But in Python the functions are called open and close for short. 172 00:09:12,560 --> 00:09:16,130 But a more common way, a more Pythonic way, so to speak, i.e. 173 00:09:16,130 --> 00:09:19,040 the way people tend to do it in Python, is actually 174 00:09:19,040 --> 00:09:22,610 to use a keyword that didn't exist in C, where you instead say with. 175 00:09:22,610 --> 00:09:26,990 And you say, with open this file name as a specific file name, 176 00:09:26,990 --> 00:09:28,970 and then indent it inside of that. 177 00:09:28,970 --> 00:09:30,950 Now you can do whatever you want with the file. 178 00:09:30,950 --> 00:09:33,350 And the implication of using with, even though it's not 179 00:09:33,350 --> 00:09:36,740 obvious from the keyword itself, is that the file will be automatically closed 180 00:09:36,740 --> 00:09:37,770 for you later. 181 00:09:37,770 --> 00:09:39,650 So this is just a minor Python convenience 182 00:09:39,650 --> 00:09:42,020 so that you don't have to remember to close the file. 183 00:09:42,020 --> 00:09:44,030 Unless something goes wrong, it will just 184 00:09:44,030 --> 00:09:47,690 close automatically as soon as you're outside of this with block. 185 00:09:47,690 --> 00:09:49,460 So how do I go about reading a CSV? 186 00:09:49,460 --> 00:09:52,940 The simplest way is to actually give yourself a variable called, 187 00:09:52,940 --> 00:09:53,900 maybe, reader. 188 00:09:53,900 --> 00:09:56,780 Like if I want to read the CSV, I'll call my variable reader. 189 00:09:56,780 --> 00:10:02,270 Set that equal to the return value of a function that comes with Python CSV 190 00:10:02,270 --> 00:10:07,080 module called reader, in lowercase, and you just pass in the file name. 191 00:10:07,080 --> 00:10:09,560 So the first line there on line three opens the file 192 00:10:09,560 --> 00:10:11,570 and gives me access to the bytes there in. 193 00:10:11,570 --> 00:10:14,450 Line four now actually uses this library that 194 00:10:14,450 --> 00:10:18,260 comes with Python to just go read it for me, figure out where the commas are, 195 00:10:18,260 --> 00:10:21,690 so that it can hand me, line after line, the data in the file. 196 00:10:21,690 --> 00:10:24,320 Now the first piece of data in the file, though, was what? 197 00:10:24,320 --> 00:10:27,080 What's the very first row? 198 00:10:27,080 --> 00:10:30,692 So it was that header row with timestamp, language, problem. 199 00:10:30,692 --> 00:10:32,900 I actually want to skip that because that's not data. 200 00:10:32,900 --> 00:10:36,870 That's what we might call metadata that's describing my actual data. 201 00:10:36,870 --> 00:10:39,670 So one way to deal with this, I can literally just say next reader. 202 00:10:39,670 --> 00:10:42,170 And next, even though the semantics here are a little weird, 203 00:10:42,170 --> 00:10:46,100 it just means, no, no, give me the next line from that reader instead 204 00:10:46,100 --> 00:10:47,550 of the first by default. 205 00:10:47,550 --> 00:10:51,650 And now inside of this with block, I can do something like this. 206 00:10:51,650 --> 00:10:56,810 For each row in that reader, let's do something super simple initially. 207 00:10:56,810 --> 00:11:00,380 Let's just print out row bracket one. 208 00:11:00,380 --> 00:11:01,910 So row bracket one. 209 00:11:01,910 --> 00:11:03,330 So what's going on here? 210 00:11:03,330 --> 00:11:07,040 Well, the CSV reader in Python is going to return to you, inside 211 00:11:07,040 --> 00:11:09,990 of this loop, one row after another. 212 00:11:09,990 --> 00:11:12,080 Each of those rows, though, has how many columns? 213 00:11:12,080 --> 00:11:12,920 Three. 214 00:11:12,920 --> 00:11:15,470 The timestamp, the language, and the problem. 215 00:11:15,470 --> 00:11:19,430 And just like in C, our lists in Python are zero index. 216 00:11:19,430 --> 00:11:20,720 So zero, one, two. 217 00:11:20,720 --> 00:11:25,350 So if I'm printing out row bracket one, that's the second, or middle, column. 218 00:11:25,350 --> 00:11:29,220 So what's this going to print row after row? 219 00:11:29,220 --> 00:11:32,340 Each of the languages that you all replied with. 220 00:11:32,340 --> 00:11:32,940 So that's all. 221 00:11:32,940 --> 00:11:34,440 This isn't doing any kind of analytics yet. 222 00:11:34,440 --> 00:11:36,440 I'm just going through the motions to, at least, 223 00:11:36,440 --> 00:11:37,900 print out some data of interest. 224 00:11:37,900 --> 00:11:38,692 So let me run this. 225 00:11:38,692 --> 00:11:41,550 Python of favorites.py, enter. 226 00:11:41,550 --> 00:11:43,270 And it happens super fast. 227 00:11:43,270 --> 00:11:47,170 But if I scroll back in my terminal window, there is all of that raw data. 228 00:11:47,170 --> 00:11:49,472 So this is to say, once you know the function names, 229 00:11:49,472 --> 00:11:51,180 once you know the keywords, it's actually 230 00:11:51,180 --> 00:11:54,090 pretty simple in Python to just get up and running with a file, 231 00:11:54,090 --> 00:11:56,070 and start looking at the data therein. 232 00:11:56,070 --> 00:12:02,910 But it turns out that it's a little sub-optimal to use the reader alone. 233 00:12:02,910 --> 00:12:05,400 It turns out there's better ways to do this, and let 234 00:12:05,400 --> 00:12:06,780 me make this clear as follows. 235 00:12:06,780 --> 00:12:09,660 I don't strictly need a variable, but let me actually declare 236 00:12:09,660 --> 00:12:13,560 a variable called favorite, set it equal to row bracket one, 237 00:12:13,560 --> 00:12:15,630 and then print out that favorite value. 238 00:12:15,630 --> 00:12:17,438 This is not doing anything new. 239 00:12:17,438 --> 00:12:19,230 It's just declaring an additional variable, 240 00:12:19,230 --> 00:12:22,380 but I wanted to highlight the fact that I'm just kind of trusting that row 241 00:12:22,380 --> 00:12:24,630 bracket one is the problem-- 242 00:12:24,630 --> 00:12:26,430 or, is the language I care about. 243 00:12:26,430 --> 00:12:27,660 Is language, is language. 244 00:12:27,660 --> 00:12:30,790 But suppose one of you, or I, go into the Google spreadsheet, 245 00:12:30,790 --> 00:12:32,668 and like anyone might with a spreadsheet, 246 00:12:32,668 --> 00:12:34,210 you might start moving things around. 247 00:12:34,210 --> 00:12:36,947 And you might swap some of the columns left and right. 248 00:12:36,947 --> 00:12:39,280 You might delete one of the columns, add something else. 249 00:12:39,280 --> 00:12:41,830 In short, spreadsheets are arguably fragile 250 00:12:41,830 --> 00:12:44,590 in that it's pretty easy in the GUI, the graphical user interface, 251 00:12:44,590 --> 00:12:46,040 to change them around. 252 00:12:46,040 --> 00:12:50,080 And so my code in Python, accordingly, is, arguably, fragile 253 00:12:50,080 --> 00:12:53,770 because I'm just hoping that row bracket one is always the data 254 00:12:53,770 --> 00:12:54,890 that I care about. 255 00:12:54,890 --> 00:12:56,450 So what would be marginally better? 256 00:12:56,450 --> 00:12:59,110 Well, let's actually use that header row instead. 257 00:12:59,110 --> 00:13:02,920 And more common, arguably, in Python, is not to use a simple reader, 258 00:13:02,920 --> 00:13:05,650 but, instead, what we would call a Dictionary Reader. 259 00:13:05,650 --> 00:13:09,190 I'm going to change this to DictReader, capital D, capital R, 260 00:13:09,190 --> 00:13:12,010 and that's it for the change to line four. 261 00:13:12,010 --> 00:13:14,197 On line five, I'm going to get rid of the next line 262 00:13:14,197 --> 00:13:16,030 because I don't want to skip the header now. 263 00:13:16,030 --> 00:13:19,210 What DictReader does, which reader does not, 264 00:13:19,210 --> 00:13:22,690 is it automatically analyzes that first line in the file, 265 00:13:22,690 --> 00:13:25,510 figures out what are all of your columns called, 266 00:13:25,510 --> 00:13:30,220 and, thereafter, when you iterate over this reader what each of your rows 267 00:13:30,220 --> 00:13:35,650 now is-- it's no longer a list of size three, bracket zero, bracket 268 00:13:35,650 --> 00:13:36,490 one, bracket two. 269 00:13:36,490 --> 00:13:39,670 Each row that you get back in this loop on line five 270 00:13:39,670 --> 00:13:44,890 is now, wonderfully, a dictionary instead, the keys of which 271 00:13:44,890 --> 00:13:46,090 are from the header field. 272 00:13:46,090 --> 00:13:47,650 Timestamp, language, problem. 273 00:13:47,650 --> 00:13:52,460 The values of which are whatever each of you typed in again and again. 274 00:13:52,460 --> 00:13:53,890 So how do I change the code? 275 00:13:53,890 --> 00:13:57,800 I no longer have these numeric indices because row is no longer a list, 276 00:13:57,800 --> 00:13:58,850 it's a dictionary. 277 00:13:58,850 --> 00:14:02,350 So if I literally want the language that you typed in, 278 00:14:02,350 --> 00:14:06,520 I can use row bracket, quote unquote, "language," treating row as a Dict 279 00:14:06,520 --> 00:14:09,010 not as a list anymore. 280 00:14:09,010 --> 00:14:12,700 Now this is, again, more robust because if you move the columns around on me, 281 00:14:12,700 --> 00:14:15,340 code is still going to work, at least so long as you don't 282 00:14:15,340 --> 00:14:17,570 rename the columns at the very top. 283 00:14:17,570 --> 00:14:20,180 So that's still one assumption I'm making. 284 00:14:20,180 --> 00:14:20,680 All right. 285 00:14:20,680 --> 00:14:23,770 Well, beyond that, what could I actually do here? 286 00:14:23,770 --> 00:14:27,422 Well, just to be clear, I don't, strictly speaking, need this variable. 287 00:14:27,422 --> 00:14:28,880 So no need to highlight this again. 288 00:14:28,880 --> 00:14:30,755 So let me just simplify the code a little bit 289 00:14:30,755 --> 00:14:33,220 and get rid of this variable, and instead just print out 290 00:14:33,220 --> 00:14:34,960 the language in that row. 291 00:14:34,960 --> 00:14:40,360 As a quick check, let me rerun Python of favorites.py 292 00:14:40,360 --> 00:14:43,430 and it seems to still work, even though there's a lot of output. 293 00:14:43,430 --> 00:14:45,610 We're not going to check all 399 lines, but it 294 00:14:45,610 --> 00:14:48,160 looks like it printed out all of those popular languages. 295 00:14:48,160 --> 00:14:50,120 Well, what more can we actually do? 296 00:14:50,120 --> 00:14:52,420 Well, let's actually now start to crunch some numbers 297 00:14:52,420 --> 00:14:55,450 and figure out how many people like Scratch, how many people like 298 00:14:55,450 --> 00:14:56,890 C, how many people like Python. 299 00:14:56,890 --> 00:14:58,580 Let's start to analyze this. 300 00:14:58,580 --> 00:15:02,290 So maybe the most pedantic way to do this in Python 301 00:15:02,290 --> 00:15:07,113 would be to just create some variables and do all of the counting myself. 302 00:15:07,113 --> 00:15:08,780 So let me actually go ahead and do this. 303 00:15:08,780 --> 00:15:10,113 Let me delete this code for now. 304 00:15:10,113 --> 00:15:12,370 And after opening the reader, let me create 305 00:15:12,370 --> 00:15:14,980 a variable called Scratch and set it equal to zero, 306 00:15:14,980 --> 00:15:19,060 a variable called C, set it equal to zero, a variable called Python, 307 00:15:19,060 --> 00:15:23,290 set it equal to zero, just so that I have three counters, similar to what 308 00:15:23,290 --> 00:15:26,620 we did in week zero, week one, anytime we counted anything Honestly, 309 00:15:26,620 --> 00:15:29,600 this looks a little stupid, and it's not wrong. 310 00:15:29,600 --> 00:15:31,600 In fact, this is how you would do it in Python, 311 00:15:31,600 --> 00:15:33,910 but Python also has some clever syntax. 312 00:15:33,910 --> 00:15:38,800 If you want to be really cool, you can do Scratch comma, c comma, 313 00:15:38,800 --> 00:15:43,932 Python equals zero, zero, zero, and do all three at once if you like that. 314 00:15:43,932 --> 00:15:47,140 So it tightens up the code a little bit, even though the effect is ultimately 315 00:15:47,140 --> 00:15:47,710 the same. 316 00:15:47,710 --> 00:15:52,010 Now let's go ahead and iterate over this file row by row by row. 317 00:15:52,010 --> 00:15:54,730 And if we see Scratch, increment the Scratch counter. 318 00:15:54,730 --> 00:15:56,320 If we see C, increment the C counter. 319 00:15:56,320 --> 00:15:59,000 If we see Python, increment the Python counter, instead. 320 00:15:59,000 --> 00:16:00,080 So how do I do this? 321 00:16:00,080 --> 00:16:03,220 Well, I could do something like, for each row in the reader, 322 00:16:03,220 --> 00:16:07,180 just like before, let me go ahead and get that favorite variable 323 00:16:07,180 --> 00:16:11,230 and set it equal to the language in that dictionary that just came back 324 00:16:11,230 --> 00:16:12,740 as part of this iteration. 325 00:16:12,740 --> 00:16:19,060 And now I can do something like this, if favorite equals, equals "Scratch" then, 326 00:16:19,060 --> 00:16:26,710 with my indentation, I can do Scratch plus equals one, elif favorite equals, 327 00:16:26,710 --> 00:16:27,670 equals "C." 328 00:16:27,670 --> 00:16:31,290 I can go ahead and increment C plus equals one. 329 00:16:31,290 --> 00:16:34,440 Else-- and I don't think I want else, just to be safe. 330 00:16:34,440 --> 00:16:37,410 Even though we only saw three options, I think just to be super safe, 331 00:16:37,410 --> 00:16:41,040 elif favorite equals, equals "Python," then let's go ahead 332 00:16:41,040 --> 00:16:42,210 and increment Python. 333 00:16:42,210 --> 00:16:45,360 The one thing we can't do that we could do in C is the plus, plus trick. 334 00:16:45,360 --> 00:16:47,890 So plus equals is as close as we can get. 335 00:16:47,890 --> 00:16:48,900 So what have I done? 336 00:16:48,900 --> 00:16:52,110 Inside of this loop, I've just incremented each of those counters 337 00:16:52,110 --> 00:16:56,230 by one if I see Scratch, or C, or Python, again and again. 338 00:16:56,230 --> 00:16:56,730 All right. 339 00:16:56,730 --> 00:17:00,060 Outside of the loop and outside of the with block, 340 00:17:00,060 --> 00:17:02,370 because once I'm done reading all of the rows, 341 00:17:02,370 --> 00:17:06,493 I might as well let the with clause close the file automatically, 342 00:17:06,493 --> 00:17:08,410 let's just go ahead and print out some values. 343 00:17:08,410 --> 00:17:13,560 So I'm going to go ahead and print out Scratch colon, and then inside of there 344 00:17:13,560 --> 00:17:16,290 let's print out whatever the value of that variable is. 345 00:17:16,290 --> 00:17:19,770 Let's then go ahead and print out C colon, 346 00:17:19,770 --> 00:17:22,030 and then whatever the value of the C variable is. 347 00:17:22,030 --> 00:17:26,160 And then, lastly, let's print out Python colon and whatever the value is. 348 00:17:26,160 --> 00:17:29,117 And now I made three typos here. 349 00:17:29,117 --> 00:17:30,700 This is not going to print the values. 350 00:17:30,700 --> 00:17:32,320 What do I need to do? 351 00:17:32,320 --> 00:17:34,030 Sorry? 352 00:17:34,030 --> 00:17:36,560 Yeah, I'm missing the f in front of each of these strings 353 00:17:36,560 --> 00:17:38,470 so that I actually get formatted. 354 00:17:38,470 --> 00:17:41,800 And that is to say, the variables get interpolated inside of the curly braces 355 00:17:41,800 --> 00:17:43,220 like we saw last week. 356 00:17:43,220 --> 00:17:43,720 All right. 357 00:17:43,720 --> 00:17:45,928 So, honestly, that's kind of a decent amount of code, 358 00:17:45,928 --> 00:17:49,630 like 18 or so lines of code, just to count the number of responses. 359 00:17:49,630 --> 00:17:51,070 But let's see if I got it right. 360 00:17:51,070 --> 00:17:55,330 Let me open my terminal and run Python of favorites.py, 361 00:17:55,330 --> 00:17:58,420 and now I see, by an overwhelming amount, 362 00:17:58,420 --> 00:18:04,660 folks like Python, followed by C, followed by Scratch, in that order. 363 00:18:04,660 --> 00:18:06,910 But that was a decent amount of code to have to write, 364 00:18:06,910 --> 00:18:10,540 and it turns out there's actually better ways of doing this, more Pythonic ways, 365 00:18:10,540 --> 00:18:12,530 more programmatic ways of doing this. 366 00:18:12,530 --> 00:18:15,760 And if we think back to one of our universal data structures. 367 00:18:15,760 --> 00:18:18,910 Think back to how we preached last week, and the week before, 368 00:18:18,910 --> 00:18:21,280 the value of these dictionaries, more generally. 369 00:18:21,280 --> 00:18:23,317 Like the CSV module, clearly a fan of them, 370 00:18:23,317 --> 00:18:25,150 because that's what DictReader is giving us, 371 00:18:25,150 --> 00:18:27,220 dictionary, after dictionary, after dictionary. 372 00:18:27,220 --> 00:18:29,290 And this was the general idea of a dictionary. 373 00:18:29,290 --> 00:18:31,840 It associates keys with values, much like you 374 00:18:31,840 --> 00:18:34,940 might in a two-column table on a chalkboard, or the like. 375 00:18:34,940 --> 00:18:39,340 Well, this is what I need if I want to keep track of how many people said 376 00:18:39,340 --> 00:18:42,250 Scratch, and C, and Python. 377 00:18:42,250 --> 00:18:46,900 If I had a piece of chalk, I could just write Scratch, and C, 378 00:18:46,900 --> 00:18:48,670 and Python as three keys. 379 00:18:48,670 --> 00:18:51,910 And then with my chalk and, maybe, an eraser, keep track of the values. 380 00:18:51,910 --> 00:18:52,900 They all start at zero. 381 00:18:52,900 --> 00:18:54,550 Then, I add one, add two, add three. 382 00:18:54,550 --> 00:18:56,800 Or, maybe, on a chalkboard, I actually use hash marks. 383 00:18:56,800 --> 00:19:00,910 But a dictionary is kind of the perfect data structure for just associating 384 00:19:00,910 --> 00:19:05,770 something like Scratch, C, Python, with something else-- keys with values, 385 00:19:05,770 --> 00:19:06,890 respectively. 386 00:19:06,890 --> 00:19:09,250 So this is going to look a little weirder, 387 00:19:09,250 --> 00:19:13,040 but it's going to be pretty conventional to do something like this instead. 388 00:19:13,040 --> 00:19:15,550 Let me go back into VS Code. 389 00:19:15,550 --> 00:19:17,360 I'll close my terminal window. 390 00:19:17,360 --> 00:19:20,680 And let me go ahead and actually delete a lot of this 391 00:19:20,680 --> 00:19:22,630 because I can simplify this further. 392 00:19:22,630 --> 00:19:26,500 Let me go ahead and now give myself, maybe, a variable just, 393 00:19:26,500 --> 00:19:30,698 generically, called counts, and set that equal to an empty dictionary. 394 00:19:30,698 --> 00:19:32,740 And you can actually do this in a couple of ways. 395 00:19:32,740 --> 00:19:34,727 You can literally write out dict with nothing 396 00:19:34,727 --> 00:19:37,060 in parentheses, which will give you an empty dictionary, 397 00:19:37,060 --> 00:19:39,602 like the picture on the board, but a little more conventional 398 00:19:39,602 --> 00:19:43,450 is to just use two keystrokes and use two curly braces with nothing inside. 399 00:19:43,450 --> 00:19:47,390 That gives me an empty dictionary like this picture here. 400 00:19:47,390 --> 00:19:49,160 Now my loop is going to be the same. 401 00:19:49,160 --> 00:19:52,000 I'm going to do, for each row in the reader, 402 00:19:52,000 --> 00:19:55,550 I'm going to go ahead and grab the favorite language for this person. 403 00:19:55,550 --> 00:19:58,652 So favorite equals row, quote unquote, "language." 404 00:19:58,652 --> 00:20:00,860 But now I'm going to do something a little different. 405 00:20:00,860 --> 00:20:02,680 There's two scenarios here. 406 00:20:02,680 --> 00:20:07,690 Either, I have seen this language before and I want to increment it by one. 407 00:20:07,690 --> 00:20:11,950 Or, if I've never seen this language before because the loop just started, 408 00:20:11,950 --> 00:20:16,950 what should I initialize the count for this language to? 409 00:20:16,950 --> 00:20:18,840 So, one because I've only seen it once. 410 00:20:18,840 --> 00:20:19,710 Exactly. 411 00:20:19,710 --> 00:20:22,470 So now let me go ahead and do exactly that. 412 00:20:22,470 --> 00:20:26,370 If this current favorite, that I am seeing in the row, 413 00:20:26,370 --> 00:20:29,997 is already in the counts dictionary-- and in Python, 414 00:20:29,997 --> 00:20:31,830 this is literally how you ask that question. 415 00:20:31,830 --> 00:20:33,840 If favorite in counts. 416 00:20:33,840 --> 00:20:38,250 That will check, is there a key with this name, Scratch, C, or Python? 417 00:20:38,250 --> 00:20:42,390 If so, go into that location in the counts dictionary, 418 00:20:42,390 --> 00:20:46,500 index into it at the favorite location, because favorite is a string. 419 00:20:46,500 --> 00:20:49,140 It's either, quote unquote, "Scratch," "C," or "Python," 420 00:20:49,140 --> 00:20:53,490 and just increment it by one like this. 421 00:20:53,490 --> 00:20:54,780 Else, as you noted. 422 00:20:54,780 --> 00:21:00,360 If it's not there implicitly, then counts bracket favorite should probably 423 00:21:00,360 --> 00:21:03,210 be set equal-- not to zero because we're literally 424 00:21:03,210 --> 00:21:04,920 are seeing it in the current row. 425 00:21:04,920 --> 00:21:06,660 Let's initialize it to one. 426 00:21:06,660 --> 00:21:08,460 And, thereafter, if we see it again, it's 427 00:21:08,460 --> 00:21:11,610 going to be plus equals one, plus equals one, plus equals one. 428 00:21:11,610 --> 00:21:15,330 So now outside of that loop, outside of the with block, let me do this. 429 00:21:15,330 --> 00:21:18,200 For each favorite in those counts-- 430 00:21:18,200 --> 00:21:20,920 And this, too, in Python is a trick if you want to iterate over 431 00:21:20,920 --> 00:21:22,870 all of the keys in a dictionary. 432 00:21:22,870 --> 00:21:26,560 That is, if you want to iterate over the left-hand column of all of these keys, 433 00:21:26,560 --> 00:21:30,220 you literally can say, for something in that dictionary. 434 00:21:30,220 --> 00:21:33,730 So for favorite in counts, this is giving me a variable called favorite 435 00:21:33,730 --> 00:21:37,420 and updating it automatically top to bottom in that dictionary. 436 00:21:37,420 --> 00:21:40,600 Let's go ahead and print out an f string that's 437 00:21:40,600 --> 00:21:44,230 going to say whatever the name of that language is colon, 438 00:21:44,230 --> 00:21:50,950 and whatever the value of that language is in that there dictionary. 439 00:21:50,950 --> 00:21:54,040 So, again, logically the only thing that's new is this. 440 00:21:54,040 --> 00:21:57,940 I'm now using one dictionary instead of three variables 441 00:21:57,940 --> 00:22:02,170 to keep track of three things, like updating this chalkboard with three 442 00:22:02,170 --> 00:22:04,043 different things, Scratch, C, Python. 443 00:22:04,043 --> 00:22:06,460 And the last thing I'm doing, which is a little different, 444 00:22:06,460 --> 00:22:09,430 is once I have that dictionary, whether there's three languages, 445 00:22:09,430 --> 00:22:11,597 or, maybe, tomorrow there'll be fourth because we're 446 00:22:11,597 --> 00:22:12,970 going to introduce SQL today. 447 00:22:12,970 --> 00:22:16,807 Well, this will iterate over all of those keys and print out the values. 448 00:22:16,807 --> 00:22:18,640 All right, so if I didn't do anything wrong, 449 00:22:18,640 --> 00:22:21,940 if I do Python of favorites.py and hit enter. 450 00:22:21,940 --> 00:22:23,043 There we have it. 451 00:22:23,043 --> 00:22:25,210 And it happens to be in a different order this time. 452 00:22:25,210 --> 00:22:28,090 That's because we saw Python first, we then saw Scratch, 453 00:22:28,090 --> 00:22:31,870 and, eventually, we saw C. But if we wanted to sort these differently, 454 00:22:31,870 --> 00:22:34,780 we actually could with some different code. 455 00:22:34,780 --> 00:22:36,170 But, in short, what have we done? 456 00:22:36,170 --> 00:22:39,340 We've created this kind of structure in memory with three keys, 457 00:22:39,340 --> 00:22:42,100 Python, C, and Scratch, because each time 458 00:22:42,100 --> 00:22:47,140 we encounter such a language from you all, we either set our counter to one 459 00:22:47,140 --> 00:22:50,110 or increment it by one instead. 460 00:22:50,110 --> 00:22:56,320 Any questions on this code, or this general idea of using dictionaries as, 461 00:22:56,320 --> 00:23:00,490 like, a little cheat sheet for doing some math in this way? 462 00:23:00,490 --> 00:23:04,200 Super common paradigm. 463 00:23:04,200 --> 00:23:04,700 All right. 464 00:23:04,700 --> 00:23:06,890 Well, let me tweak this a little bit. 465 00:23:06,890 --> 00:23:10,622 Right now, in my output, we're seeing Python, Scratch and C. 466 00:23:10,622 --> 00:23:13,580 Maybe, for the sake of discussion, suppose we want to sort this by key. 467 00:23:13,580 --> 00:23:14,580 We can actually do that. 468 00:23:14,580 --> 00:23:16,370 Let me close my terminal temporarily. 469 00:23:16,370 --> 00:23:19,037 And it turns out, in Python, there's a bunch of ways to do this, 470 00:23:19,037 --> 00:23:21,890 but the simplest way to sort a dictionary by key 471 00:23:21,890 --> 00:23:24,500 is literally to use a function called sorted 472 00:23:24,500 --> 00:23:26,970 that comes with Python, that just does it for you. 473 00:23:26,970 --> 00:23:29,720 And even if you pass it a dictionary, it will sort that dictionary 474 00:23:29,720 --> 00:23:34,170 by the left-hand column so you can iterate it over alphabetically instead. 475 00:23:34,170 --> 00:23:36,110 So if I go back now to VS Code. 476 00:23:36,110 --> 00:23:39,710 If I open my terminal window and I rerun Python on favorites.py, 477 00:23:39,710 --> 00:23:41,840 now that I've added the sorted call, we should now 478 00:23:41,840 --> 00:23:45,470 see just because it's sorted alphabetically instead. 479 00:23:45,470 --> 00:23:49,195 Now that's not that useful, especially if we had lots of languages. 480 00:23:49,195 --> 00:23:51,320 You probably don't care about it being alphabetized 481 00:23:51,320 --> 00:23:54,570 as much as you care about it being ranked by which is the most popular, 482 00:23:54,570 --> 00:23:55,700 which is the least popular. 483 00:23:55,700 --> 00:23:58,220 And, for that, there's a bunch of ways to do this in Python. 484 00:23:58,220 --> 00:24:03,410 And, I think, the simplest way to sort by value the right-hand column instead 485 00:24:03,410 --> 00:24:06,900 of the left-hand column is probably to make this change instead. 486 00:24:06,900 --> 00:24:09,240 Let me close my terminal temporarily. 487 00:24:09,240 --> 00:24:14,110 Let me still use the sorted function, which by default sorts by key, 488 00:24:14,110 --> 00:24:16,750 but let's change it to be as follows. 489 00:24:16,750 --> 00:24:21,960 Let's change it to sort by a function called counts.get, 490 00:24:21,960 --> 00:24:24,360 which is a little weird, but this comes back 491 00:24:24,360 --> 00:24:28,290 to last week's brief discussion of object-oriented programming, or oop. 492 00:24:28,290 --> 00:24:30,150 Remember, in Python, that almost everything 493 00:24:30,150 --> 00:24:31,860 is like an object of some sort. 494 00:24:31,860 --> 00:24:36,930 An int is an object, a dictionary is an object, a string is an object. 495 00:24:36,930 --> 00:24:40,590 Which is to say that, not only do these things have values like, quote unquote, 496 00:24:40,590 --> 00:24:44,340 "Hello, World," or 50, these variables, these objects, 497 00:24:44,340 --> 00:24:46,980 can also have functions built into them, a.k.a. 498 00:24:46,980 --> 00:24:47,670 methods. 499 00:24:47,670 --> 00:24:52,440 So it turns out that because counts is a dictionary, because I made it so, 500 00:24:52,440 --> 00:24:55,350 that counts dictionary, like any dictionary in Python, 501 00:24:55,350 --> 00:24:57,150 comes with a function called get. 502 00:24:57,150 --> 00:25:01,290 And if you just tell the sorted function to use that built-in method, 503 00:25:01,290 --> 00:25:05,400 it will actually, for every key, get its value, get its value, get its value, 504 00:25:05,400 --> 00:25:09,720 and sort effectively by the right-hand column instead of the left. 505 00:25:09,720 --> 00:25:13,140 Now we'll see down the line, perhaps, more sophisticated ways of using this, 506 00:25:13,140 --> 00:25:15,990 but, for now, this just overrides the default behavior 507 00:25:15,990 --> 00:25:19,510 and sorts the dictionary, not by key, but by value instead. 508 00:25:19,510 --> 00:25:20,010 All right. 509 00:25:20,010 --> 00:25:24,120 So now watch this if I run Python of favorites.py once more. 510 00:25:24,120 --> 00:25:28,140 Previously, it was in the order in which the languages appeared first 511 00:25:28,140 --> 00:25:31,930 in the CSV file, then it was sorted alphabetically. 512 00:25:31,930 --> 00:25:34,960 Now it should be sorted by value. 513 00:25:34,960 --> 00:25:35,850 And, indeed. 514 00:25:35,850 --> 00:25:39,180 Scratch is the least with 40, C is the next with 78, 515 00:25:39,180 --> 00:25:41,458 Python is the biggest with 280. 516 00:25:41,458 --> 00:25:43,500 That's not much of a top 10, or a top three list. 517 00:25:43,500 --> 00:25:44,820 Let's actually reverse it. 518 00:25:44,820 --> 00:25:46,800 And the easiest way in Python to do that is 519 00:25:46,800 --> 00:25:50,130 to pass a third argument into sorted, and you would know this 520 00:25:50,130 --> 00:25:51,750 by just reading the documentation. 521 00:25:51,750 --> 00:25:56,370 You can literally say, reverse equals True, capital T, 522 00:25:56,370 --> 00:26:00,480 and now if I rerun this one last time, Python of favorites.py, 523 00:26:00,480 --> 00:26:04,110 I'll see the same values but with the whole thing reversed in order. 524 00:26:04,110 --> 00:26:07,230 Long story short, even though this might feel like a slog, 525 00:26:07,230 --> 00:26:11,070 like adding this and looking up this, so much easier than in C 526 00:26:11,070 --> 00:26:13,980 where you would have had to figure out, how does bubble sort work? 527 00:26:13,980 --> 00:26:17,520 Let me implement bubble sort, selection sort, any of those sorting algorithms, 528 00:26:17,520 --> 00:26:18,660 or use some other library. 529 00:26:18,660 --> 00:26:21,430 In Python, you just get a lot more for free, so to speak. 530 00:26:21,430 --> 00:26:24,390 It's just built in once you get comfy with the documentation. 531 00:26:24,390 --> 00:26:29,370 And, to be clear, this is an argument, as is this, as is this. 532 00:26:29,370 --> 00:26:33,540 But in Python, we have not only positional arguments, which 533 00:26:33,540 --> 00:26:36,720 are based on what position they are in, left to right, just like C, 534 00:26:36,720 --> 00:26:39,900 you also have these named parameters whereby 535 00:26:39,900 --> 00:26:42,720 they have explicit names that you can use yourself, 536 00:26:42,720 --> 00:26:45,930 to make clear that you're using this one but not this other one. 537 00:26:45,930 --> 00:26:50,980 More parameters in Python can be optional than in C. Phew. 538 00:26:50,980 --> 00:26:51,480 All right. 539 00:26:51,480 --> 00:26:54,727 Any questions about that technique yet? 540 00:26:54,727 --> 00:26:57,810 And if you're feeling like this is starting to take the fun out of Python, 541 00:26:57,810 --> 00:27:01,930 that's actually kind of the point of doing this the hard way. 542 00:27:01,930 --> 00:27:02,430 All right. 543 00:27:02,430 --> 00:27:05,110 Well, let's do it one other way that's marginally better. 544 00:27:05,110 --> 00:27:09,833 It turns out, in Python there really is this rich ecosystem of libraries, 545 00:27:09,833 --> 00:27:12,750 the code that comes with the language itself, or, even, third parties. 546 00:27:12,750 --> 00:27:17,040 And coming with the language is another module 547 00:27:17,040 --> 00:27:19,920 called the collections module, or package here, 548 00:27:19,920 --> 00:27:25,350 whereby if I use from collections, I can import something called Counter, 549 00:27:25,350 --> 00:27:28,140 capital C. And it turns out, if this felt 550 00:27:28,140 --> 00:27:30,450 a little bit painful to create a dictionary yourself, 551 00:27:30,450 --> 00:27:33,470 initialize it, maybe, to zero or one, like this. 552 00:27:33,470 --> 00:27:36,547 Turns out, you have the same problem that people before you have had, 553 00:27:36,547 --> 00:27:38,130 and so there's another way to do this. 554 00:27:38,130 --> 00:27:42,150 You can create a variable called counts, set it equal to Counter, capital 555 00:27:42,150 --> 00:27:43,770 C, open paren, close paren. 556 00:27:43,770 --> 00:27:46,140 And this is a different type of object. 557 00:27:46,140 --> 00:27:48,300 It's a different type of object in Python, that 558 00:27:48,300 --> 00:27:50,615 has counting capabilities built in. 559 00:27:50,615 --> 00:27:53,490 And so if I actually want to use this counter instead, I can do this. 560 00:27:53,490 --> 00:27:57,930 For each row in the reader, let's go ahead and grab the favorite language 561 00:27:57,930 --> 00:28:00,540 from that row, just like before. 562 00:28:00,540 --> 00:28:04,530 And without doing any of that headache of like, if, elif, or any of this, 563 00:28:04,530 --> 00:28:09,060 you can literally just index into that counter using favorite, 564 00:28:09,060 --> 00:28:14,340 quote unquote, "Scratch," or "C," or "Python," and increment it by one. 565 00:28:14,340 --> 00:28:17,680 What the Counter class is going to do for you, so to speak-- 566 00:28:17,680 --> 00:28:20,190 Another example of object-oriented programming, and counts 567 00:28:20,190 --> 00:28:21,435 is now an object thereof. 568 00:28:21,435 --> 00:28:25,107 What this whole feature of Counter is going to do for you is it's 569 00:28:25,107 --> 00:28:26,940 going to automatically initialize everything 570 00:28:26,940 --> 00:28:28,710 to zero, even if you've never seen it before, 571 00:28:28,710 --> 00:28:30,940 and then you can just blindly start incrementing it. 572 00:28:30,940 --> 00:28:33,570 So, in short, there's just more pleasant ways, sometimes, 573 00:28:33,570 --> 00:28:36,280 to do something in Python as well. 574 00:28:36,280 --> 00:28:36,780 All right. 575 00:28:36,780 --> 00:28:42,750 How about, lastly, let's make things, maybe-- 576 00:28:42,750 --> 00:28:44,340 Oh, actually, let's do this. 577 00:28:44,340 --> 00:28:47,918 We can even simplify the sorting here. 578 00:28:47,918 --> 00:28:49,710 Let me actually take this one step further. 579 00:28:49,710 --> 00:28:52,950 Instead of manually figuring out how to sort this, I'm going to do this. 580 00:28:52,950 --> 00:29:02,650 For each favorite, and the count thereof in the counts variable's most 581 00:29:02,650 --> 00:29:07,820 common function's return value, go ahead and print out this as well. 582 00:29:07,820 --> 00:29:09,790 So in short, again, a bit new syntax. 583 00:29:09,790 --> 00:29:11,210 But what's going on here? 584 00:29:11,210 --> 00:29:14,410 Well, it turns out that this counts class and-- 585 00:29:14,410 --> 00:29:17,478 sorry, this Counter class, and, in turn, this counts variable, 586 00:29:17,478 --> 00:29:20,770 comes with a function built in that you would only know from the documentation. 587 00:29:20,770 --> 00:29:23,260 It's literally called most underscore common, 588 00:29:23,260 --> 00:29:31,700 and what it returns to you when you call it is a pair of key value, key value. 589 00:29:31,700 --> 00:29:34,480 And so this, too, is a trick in Python that we did not have in C. 590 00:29:34,480 --> 00:29:37,180 If you want to iterate over something, but grab 591 00:29:37,180 --> 00:29:40,120 two variables at a time on each iteration like this, 592 00:29:40,120 --> 00:29:42,880 you separate them by commas and can get favorite count, 593 00:29:42,880 --> 00:29:44,840 favorite count, favorite count. 594 00:29:44,840 --> 00:29:50,260 So if I run this now, Python of favorites.py, this, too, just works. 595 00:29:50,260 --> 00:29:53,440 And it's getting a little simpler, a little tighter than before 596 00:29:53,440 --> 00:29:57,740 than if we had actually done it all manually. 597 00:29:57,740 --> 00:30:01,310 Lastly, here is a code that's the shortest version thereof. 598 00:30:01,310 --> 00:30:03,650 We're down to like 14 or 15 lines. 599 00:30:03,650 --> 00:30:08,480 If I wanted to change this to analyze the most popular problem thus 600 00:30:08,480 --> 00:30:12,620 far in the class, how do I go about changing the code 601 00:30:12,620 --> 00:30:18,260 to print out, top to bottom, the most popular problem or problems? 602 00:30:18,260 --> 00:30:19,280 What line should change? 603 00:30:19,280 --> 00:30:20,570 Yeah. 604 00:30:20,570 --> 00:30:22,070 So, yeah. 605 00:30:22,070 --> 00:30:22,610 Line 10. 606 00:30:22,610 --> 00:30:25,310 Because I've written this in kind of a general purpose way 607 00:30:25,310 --> 00:30:27,282 and using dictionaries with keys, it suffices 608 00:30:27,282 --> 00:30:29,240 to change language to, quote unquote, "problem" 609 00:30:29,240 --> 00:30:31,325 because that was the third column from the CSV. 610 00:30:31,325 --> 00:30:33,200 And so now, if you're curious, let's actually 611 00:30:33,200 --> 00:30:35,240 make my terminal window a bit bigger. 612 00:30:35,240 --> 00:30:37,820 Python of favorites.py, enter. 613 00:30:37,820 --> 00:30:38,750 And, OK. 614 00:30:38,750 --> 00:30:40,910 Tragically, we peaked early with Hello, World-- 615 00:30:40,910 --> 00:30:45,060 is the most popular problem thus far, followed by Filter, then Scratch. 616 00:30:45,060 --> 00:30:45,560 OK. 617 00:30:45,560 --> 00:30:46,700 Peaked even earlier. 618 00:30:46,700 --> 00:30:51,660 Mario, DNA, and so forth, and a bunch of others thereafter. 619 00:30:51,660 --> 00:30:55,477 So based on this sample size, here's the ranking of the problems thus far. 620 00:30:55,477 --> 00:30:56,060 So, we got it. 621 00:30:56,060 --> 00:30:59,570 More Hello, World problems in the weeks to come. 622 00:30:59,570 --> 00:31:01,160 All right. 623 00:31:01,160 --> 00:31:04,520 Now that we've done that in that way, let's 624 00:31:04,520 --> 00:31:07,070 just make this program slightly interactive 625 00:31:07,070 --> 00:31:11,070 and see how we can really take a fundamentally different approach. 626 00:31:11,070 --> 00:31:12,620 I'm going to go into VS Code. 627 00:31:12,620 --> 00:31:15,798 I'm going to keep everything the same, except that, at the bottom, 628 00:31:15,798 --> 00:31:17,840 I'm going to get rid of this loop because I don't 629 00:31:17,840 --> 00:31:20,120 want any more print out everything. 630 00:31:20,120 --> 00:31:22,040 I want to look up specific counts. 631 00:31:22,040 --> 00:31:25,060 Like, how popular was this problem, how popular was this other problem? 632 00:31:25,060 --> 00:31:28,310 And what I'm going to go ahead and do is to create a variable called favorite, 633 00:31:28,310 --> 00:31:30,180 set it equal to-- 634 00:31:30,180 --> 00:31:32,060 I could use get string in the CS50 library, 635 00:31:32,060 --> 00:31:34,790 but we saw last week there's no need to for strings, certainly. 636 00:31:34,790 --> 00:31:37,340 Let me just use the input function that comes with Python, 637 00:31:37,340 --> 00:31:39,890 and prompt the human for their favorite problem. 638 00:31:39,890 --> 00:31:45,020 And then let me go ahead and print out, for instance, an f string containing 639 00:31:45,020 --> 00:31:51,920 whatever their favorite is, colon, and whatever the count is thereof 640 00:31:51,920 --> 00:31:54,290 of that favorite, close quote. 641 00:31:54,290 --> 00:31:56,400 So let me open my terminal window. 642 00:31:56,400 --> 00:31:59,300 Let me run Python of favorites.py, enter. 643 00:31:59,300 --> 00:32:04,640 And if I type in Hello, World, looks like 65 people, indeed, like that one. 644 00:32:04,640 --> 00:32:06,860 If I run Python of favorites.py again. 645 00:32:06,860 --> 00:32:09,350 I type in Scratch, now we see that one. 646 00:32:09,350 --> 00:32:12,570 If I type in anything else, I'm going to get its specific value. 647 00:32:12,570 --> 00:32:18,050 So this is to say, not only can we write Python code to analyze some data pretty 648 00:32:18,050 --> 00:32:20,720 tightly versus the manual code we wrote out earlier, 649 00:32:20,720 --> 00:32:23,238 you can also make these programs interactive as well. 650 00:32:23,238 --> 00:32:25,530 And this is going to be a super common paradigm, right, 651 00:32:25,530 --> 00:32:28,880 if you go into the world of consulting, analytics, data science, more 652 00:32:28,880 --> 00:32:29,540 generally. 653 00:32:29,540 --> 00:32:34,310 Among your roles is going to be to analyze data, to ask questions of data, 654 00:32:34,310 --> 00:32:37,610 get back the answer. be ask questions of data, get back the answer. 655 00:32:37,610 --> 00:32:40,190 Honestly, life gets pretty tedious, even though you've only 656 00:32:40,190 --> 00:32:43,670 been programming in Python, perhaps, for like one week, a week and a half now. 657 00:32:43,670 --> 00:32:46,880 When you have to write code to solve all of the world's problems-- 658 00:32:46,880 --> 00:32:49,250 and there's this sort of tenant in programming, 659 00:32:49,250 --> 00:32:53,270 that programmers tend to avoid writing code as much as they can 660 00:32:53,270 --> 00:32:55,190 because, ideally, you would solve problems 661 00:32:55,190 --> 00:32:58,640 with the right tool for the job, minimizing the number of lines of code 662 00:32:58,640 --> 00:32:59,880 you actually write. 663 00:32:59,880 --> 00:33:01,920 So how do we actually get to that point? 664 00:33:01,920 --> 00:33:06,740 Well, instead of just dealing with CSV files, pure text, it turns out 665 00:33:06,740 --> 00:33:09,200 there's an entire world of proper databases. 666 00:33:09,200 --> 00:33:12,080 Not flat file databases, where you store everything in text files, 667 00:33:12,080 --> 00:33:16,370 but a database program, a piece of software running on a computer, 668 00:33:16,370 --> 00:33:18,922 running on a server, that's always listening for you. 669 00:33:18,922 --> 00:33:22,130 It's got a lot of memory, it's got a lot of space, and in turn a lot of data, 670 00:33:22,130 --> 00:33:25,460 and it supports a database specific language 671 00:33:25,460 --> 00:33:31,020 that makes it much easier, much faster to ask questions of the very same data. 672 00:33:31,020 --> 00:33:33,290 It's a relational database in the sense, too, 673 00:33:33,290 --> 00:33:35,480 that it's not even necessarily one spreadsheet, 674 00:33:35,480 --> 00:33:37,430 one set of rows and columns. 675 00:33:37,430 --> 00:33:41,960 You can have two sheets, three sheets, 30 sheets across which there might very 676 00:33:41,960 --> 00:33:44,840 well be relationships, or relations. 677 00:33:44,840 --> 00:33:48,830 So S-Q-L, or SQL, is a database specific language, 678 00:33:48,830 --> 00:33:52,625 stands for Structured Query Language, that's a declarative language whereby 679 00:33:52,625 --> 00:33:54,500 you're not going to be in the habit with SQL, 680 00:33:54,500 --> 00:33:58,870 typically, of writing loops and conditionals, and this kind of thing. 681 00:33:58,870 --> 00:34:02,260 You're instead going to describe the data that you want to get back, 682 00:34:02,260 --> 00:34:05,680 you're going to describe the question that you want the answer to, 683 00:34:05,680 --> 00:34:08,650 and we'll do this using a relatively small grammar. 684 00:34:08,650 --> 00:34:11,620 That is to say, there's not that many keywords in SQL. 685 00:34:11,620 --> 00:34:12,850 It's a pretty small language. 686 00:34:12,850 --> 00:34:16,120 But it's going to allow us to eliminate dozens of lines of Python code, 687 00:34:16,120 --> 00:34:16,900 perhaps. 688 00:34:16,900 --> 00:34:19,429 SQL follows this CRUD paradigm. 689 00:34:19,429 --> 00:34:23,230 So C-R-U-D, which simply means that in a relational database, 690 00:34:23,230 --> 00:34:25,030 you can really only do four things. 691 00:34:25,030 --> 00:34:30,100 You can create data, read data-- that is, look at it or analyze it somehow. 692 00:34:30,100 --> 00:34:32,540 Update the data, or delete the data. 693 00:34:32,540 --> 00:34:33,770 So, CRUD, for short. 694 00:34:33,770 --> 00:34:36,673 And that really speaks to just how relatively simple 695 00:34:36,673 --> 00:34:39,340 the world is, even though we'll just scratch the surface of some 696 00:34:39,340 --> 00:34:40,423 of its capabilities today. 697 00:34:40,423 --> 00:34:42,560 And you'll explore more over time. 698 00:34:42,560 --> 00:34:45,340 Specifically, in SQL, there's going to be other keywords that 699 00:34:45,340 --> 00:34:46,719 map to those four ideas. 700 00:34:46,719 --> 00:34:49,480 Technically, you don't just create data in the world of SQL, 701 00:34:49,480 --> 00:34:53,270 you can also insert data, like inserting more rows into a sheet. 702 00:34:53,270 --> 00:34:55,330 And it's not the word "read" that people use. 703 00:34:55,330 --> 00:34:56,739 People say to "select" data. 704 00:34:56,739 --> 00:35:00,170 But they mean to read data, which is sort of the opposite of writing 705 00:35:00,170 --> 00:35:01,220 or creating data. 706 00:35:01,220 --> 00:35:04,070 But the U and the D are the same, except that there's also 707 00:35:04,070 --> 00:35:07,580 a keyword in SQL known as DROP, which lets you very destructively, 708 00:35:07,580 --> 00:35:11,610 very dangerously delete entire database tables, as well. 709 00:35:11,610 --> 00:35:15,500 So how do we do this, and what's the connection to our favorites data 710 00:35:15,500 --> 00:35:16,190 thus far? 711 00:35:16,190 --> 00:35:18,920 Well, here is the syntax in this language called 712 00:35:18,920 --> 00:35:22,880 SQL via which you can create a table. 713 00:35:22,880 --> 00:35:26,510 So the jargon is a little different, but the ideas are exactly the same 714 00:35:26,510 --> 00:35:27,840 from the world of spreadsheets. 715 00:35:27,840 --> 00:35:33,720 What you call a sheet in a spreadsheet, the database world calls a table. 716 00:35:33,720 --> 00:35:37,320 It's a table of rows and columns, but it's the exact same idea. 717 00:35:37,320 --> 00:35:40,320 You're going to have discretion over what to call the table, 718 00:35:40,320 --> 00:35:42,710 just like you can call a spreadsheet something, or else, 719 00:35:42,710 --> 00:35:44,790 and you can also specify the types of data 720 00:35:44,790 --> 00:35:46,790 that you want to store in your rows and columns. 721 00:35:46,790 --> 00:35:49,520 And it's going to go a little more deeply than just formatting it, 722 00:35:49,520 --> 00:35:51,687 like in Excel, and Numbers, and Google Spreadsheets, 723 00:35:51,687 --> 00:35:55,250 you can actually control, maybe, how big the data could be depending 724 00:35:55,250 --> 00:35:57,140 on the database you're actually using. 725 00:35:57,140 --> 00:35:59,990 In CS50, we're going to use a light version of SQL. 726 00:35:59,990 --> 00:36:01,820 Literally, a language called-- 727 00:36:01,820 --> 00:36:04,970 an implementation of SQL called SQLite, which has really 728 00:36:04,970 --> 00:36:08,360 all of the core functionality that you would see in the real world, 729 00:36:08,360 --> 00:36:10,702 and with larger, more scalable systems, but it's 730 00:36:10,702 --> 00:36:13,160 going to allow us to focus on a lot of the building blocks. 731 00:36:13,160 --> 00:36:17,010 And SQLite's actually really popular on Macs, PCs, and phones, nowadays. 732 00:36:17,010 --> 00:36:20,450 A lot of the data that games and other applications on your phone might store, 733 00:36:20,450 --> 00:36:24,470 actually have a file, a binary file with zeros and ones, 734 00:36:24,470 --> 00:36:26,393 that's in the SQLite format. 735 00:36:26,393 --> 00:36:29,060 So if you do a mobile app, for instance, for your final project, 736 00:36:29,060 --> 00:36:32,220 you'll have an opportunity to play with something like this. 737 00:36:32,220 --> 00:36:34,760 Well, how do you actually run SQLite3? 738 00:36:34,760 --> 00:36:37,500 It's just a command built into your code space. 739 00:36:37,500 --> 00:36:40,400 So this is a program you could install on your own Mac, your own PC, 740 00:36:40,400 --> 00:36:40,910 or the like. 741 00:36:40,910 --> 00:36:43,280 We'll do everything as we've done before, in the cloud, 742 00:36:43,280 --> 00:36:44,780 and actually use your code space. 743 00:36:44,780 --> 00:36:48,170 And by that I mean, we can just start to play with this data now 744 00:36:48,170 --> 00:36:51,890 using SQL instead of Python. 745 00:36:51,890 --> 00:36:52,710 So let me do this. 746 00:36:52,710 --> 00:36:55,190 Let me open up my terminal window here, and let 747 00:36:55,190 --> 00:36:58,010 me go ahead and maximize my terminal window just because we'll 748 00:36:58,010 --> 00:37:00,140 focus now on the files here. 749 00:37:00,140 --> 00:37:03,530 Recall that I have a file called favorites.csv, 750 00:37:03,530 --> 00:37:05,510 and that CSV file is just text. 751 00:37:05,510 --> 00:37:08,300 But let me load it into a proper database 752 00:37:08,300 --> 00:37:11,510 so I can actually use this other language called SQL on it. 753 00:37:11,510 --> 00:37:14,210 To do this, I'm going to run SQLite3, which 754 00:37:14,210 --> 00:37:16,070 just means the third version of it, and I'm 755 00:37:16,070 --> 00:37:18,950 going to create a new database called favorites.db. 756 00:37:18,950 --> 00:37:21,380 That's just a convention, but it means here comes 757 00:37:21,380 --> 00:37:23,030 a database that I'm going to create. 758 00:37:23,030 --> 00:37:25,370 Notice, I'm not using the tabbed code editor. 759 00:37:25,370 --> 00:37:28,460 I'm not using the code command because the code command is generally 760 00:37:28,460 --> 00:37:29,600 for text files. 761 00:37:29,600 --> 00:37:33,440 SQLite3 is going to create a binary file, zeros and ones, ultimately. 762 00:37:33,440 --> 00:37:35,750 When I run that, it's going to ask me to verify yes. 763 00:37:35,750 --> 00:37:37,400 I'm going to hit y and then enter. 764 00:37:37,400 --> 00:37:40,940 And now I'm at the SQLite prompt, which is not the dollar sign. 765 00:37:40,940 --> 00:37:43,730 It literally says SQLite with an angled bracket. 766 00:37:43,730 --> 00:37:49,220 Now, one time only, I want to go ahead and load favorites.csv 767 00:37:49,220 --> 00:37:53,150 into this database so I can actually play around with it using not Python, 768 00:37:53,150 --> 00:37:54,982 but this new language called SQL. 769 00:37:54,982 --> 00:37:56,940 And the way I'm going to do this is as follows. 770 00:37:56,940 --> 00:37:57,773 I'm going to do dot. 771 00:37:57,773 --> 00:38:00,130 Mode csv, enter. 772 00:38:00,130 --> 00:38:03,430 And that just puts SQLite into CSV mode. 773 00:38:03,430 --> 00:38:05,650 It has different modes for different file formats. 774 00:38:05,650 --> 00:38:09,700 I'm going to .import and then I'm going to specify the file that I want 775 00:38:09,700 --> 00:38:11,800 to import, which is favorites.csv. 776 00:38:11,800 --> 00:38:13,030 And then this one's up to me. 777 00:38:13,030 --> 00:38:15,580 What is the name of the table I want to create? 778 00:38:15,580 --> 00:38:18,012 And table, again, is essentially synonymous with sheets. 779 00:38:18,012 --> 00:38:19,720 So I'm going to call everything the same. 780 00:38:19,720 --> 00:38:22,700 I'm going to call my table favorites as well. 781 00:38:22,700 --> 00:38:25,420 So what this command is essentially going to do, 782 00:38:25,420 --> 00:38:29,890 is all of those lines of Python code that open the file, read it row by row, 783 00:38:29,890 --> 00:38:31,028 and do something with it-- 784 00:38:31,028 --> 00:38:32,320 This is just built into SQLite. 785 00:38:32,320 --> 00:38:37,720 It's going to load the whole darn CSV into this new favorites.db file, 786 00:38:37,720 --> 00:38:38,980 and then that's it for now. 787 00:38:38,980 --> 00:38:42,220 I'm going to go ahead and literally type .quit to get out of SQLite. 788 00:38:42,220 --> 00:38:43,810 I'm back at my dollar sign prompt. 789 00:38:43,810 --> 00:38:47,500 If I type ls, I have not only favorites.csv, 790 00:38:47,500 --> 00:38:54,790 I also have favorites.db now as well, a brand new file, and in that file now is 791 00:38:54,790 --> 00:38:57,560 an optimized version of the CSV file. 792 00:38:57,560 --> 00:39:00,860 In that DB file now is a version of the data that's 793 00:39:00,860 --> 00:39:04,430 going to lend itself to CRUD operations, creating, reading, update, 794 00:39:04,430 --> 00:39:08,810 and deleting, using this new language called SQL. 795 00:39:08,810 --> 00:39:10,478 All right, so how do I get into this? 796 00:39:10,478 --> 00:39:13,520 Well, let me clear my terminal window and pretend that I'm doing this now 797 00:39:13,520 --> 00:39:14,062 the next day. 798 00:39:14,062 --> 00:39:15,478 I've already created the database. 799 00:39:15,478 --> 00:39:16,670 That's a one time operation. 800 00:39:16,670 --> 00:39:18,545 Once you've got the data, now I'm going to go 801 00:39:18,545 --> 00:39:24,080 ahead and again run SQLite3 favorites.db just to open the file again. 802 00:39:24,080 --> 00:39:27,260 But it's already now-- all of the data is in there. 803 00:39:27,260 --> 00:39:32,420 Just as a teaser, let me go ahead and do this. 804 00:39:32,420 --> 00:39:38,580 .schema is a SQLite command that just shows me the schema of this database 805 00:39:38,580 --> 00:39:39,080 table. 806 00:39:39,080 --> 00:39:41,130 And we'll see more about this in a little bit, 807 00:39:41,130 --> 00:39:44,630 but for now this is showing me, essentially, 808 00:39:44,630 --> 00:39:48,170 the SQL command that was automatically run 809 00:39:48,170 --> 00:39:51,650 when I imported this database the first time around. 810 00:39:51,650 --> 00:39:54,680 And, for now, just notice that it mentions timestamp, 811 00:39:54,680 --> 00:39:57,270 it mentions language, it mentions problem. 812 00:39:57,270 --> 00:39:59,790 Very loosely, it calls each of those texts. 813 00:39:59,790 --> 00:40:02,370 So we're not trying very hard to distinguish one type of data 814 00:40:02,370 --> 00:40:02,912 from another. 815 00:40:02,912 --> 00:40:03,690 It's all text. 816 00:40:03,690 --> 00:40:06,790 But notice, create table If not exists favorites. 817 00:40:06,790 --> 00:40:09,420 This is essentially the create table syntax 818 00:40:09,420 --> 00:40:11,490 that I alluded to earlier via which you can 819 00:40:11,490 --> 00:40:15,510 create a table in a SQLite database. 820 00:40:15,510 --> 00:40:17,280 But more on that in just a bit. 821 00:40:17,280 --> 00:40:22,500 Here now is how we can actually get at the data in that database. 822 00:40:22,500 --> 00:40:28,380 It turns out that we can select one or more columns from a database 823 00:40:28,380 --> 00:40:30,180 table using syntax like this. 824 00:40:30,180 --> 00:40:33,870 Literally, the keyword select, then the name of one or more columns 825 00:40:33,870 --> 00:40:38,550 that are in that database, and then from the specific table that you care about. 826 00:40:38,550 --> 00:40:41,760 And notice that in capital letters here are all of the SQL 827 00:40:41,760 --> 00:40:44,760 specific keywords, select, and from, in particular. 828 00:40:44,760 --> 00:40:47,940 And in lowercase, by convention, here are the placeholders 829 00:40:47,940 --> 00:40:51,270 for the columns that you, or I, have created, and the tables that you, or I, 830 00:40:51,270 --> 00:40:52,210 have created. 831 00:40:52,210 --> 00:40:53,820 So if I go back to SQLite here. 832 00:40:53,820 --> 00:40:57,090 Let me just clear with Control L, which will just freshen up the screen here 833 00:40:57,090 --> 00:40:58,770 so we can focus on what's new. 834 00:40:58,770 --> 00:41:03,600 If I want to select everything from the table called favorites, 835 00:41:03,600 --> 00:41:04,890 here's what I can do. 836 00:41:04,890 --> 00:41:09,570 Select star from favorites semicolon. 837 00:41:09,570 --> 00:41:12,660 And, do forgive me, semicolons are back for SQL, in this case. 838 00:41:12,660 --> 00:41:16,680 But select star from favorites uses a syntax you might not be familiar with. 839 00:41:16,680 --> 00:41:18,660 Star here has nothing to do with pointers. 840 00:41:18,660 --> 00:41:20,100 Star is a wild card. 841 00:41:20,100 --> 00:41:22,290 It means give me everything, no matter what it's 842 00:41:22,290 --> 00:41:24,480 called, from this particular table. 843 00:41:24,480 --> 00:41:27,030 When I hit enter, what we're going to see 844 00:41:27,030 --> 00:41:30,810 is the entire contents of the favorite table 845 00:41:30,810 --> 00:41:34,480 that's the result of having imported that CSV into this database. 846 00:41:34,480 --> 00:41:37,380 So when I hit enter, there is all of that data. 847 00:41:37,380 --> 00:41:40,110 And SQLite, just to be friendly, it's using 848 00:41:40,110 --> 00:41:44,220 what we might call ASCII art, just very simple text, like hyphens, 849 00:41:44,220 --> 00:41:47,310 and vertical bars, and pluses on the corner, to make it look pretty 850 00:41:47,310 --> 00:41:49,690 and make it look like it is a proper table. 851 00:41:49,690 --> 00:41:52,620 But what you're really seeing is the contents of favorites.db, 852 00:41:52,620 --> 00:41:54,750 specifically in that table. 853 00:41:54,750 --> 00:41:57,130 Specifically, if I only care about languages, 854 00:41:57,130 --> 00:41:59,130 let me try something more specific than star. 855 00:41:59,130 --> 00:42:03,270 Select language from favorites semicolon. 856 00:42:03,270 --> 00:42:05,640 This is going to give me just a single column now, 857 00:42:05,640 --> 00:42:09,150 of all of the favorites that you selected for language specifically. 858 00:42:09,150 --> 00:42:12,730 This is a little overwhelming to see all 399 or so pieces of data, 859 00:42:12,730 --> 00:42:14,680 so let me actually truncate it a little bit. 860 00:42:14,680 --> 00:42:20,333 Let me do select language from favorites limit 10. 861 00:42:20,333 --> 00:42:22,500 So we're about to see that there's little tricks you 862 00:42:22,500 --> 00:42:26,430 can use to tweak the behavior of the language 863 00:42:26,430 --> 00:42:28,500 in order to get back more or less data. 864 00:42:28,500 --> 00:42:31,110 In fact, it turns out there's a bunch of keywords 865 00:42:31,110 --> 00:42:35,880 like these built into SQL, much like Google Spreadsheets, Apple Numbers, 866 00:42:35,880 --> 00:42:38,293 Microsoft Excel, and certainly Python. 867 00:42:38,293 --> 00:42:41,460 There's a lot of functionality that you just get for free with the language. 868 00:42:41,460 --> 00:42:44,610 If you want to calculate an average, count the number of things in a file, 869 00:42:44,610 --> 00:42:48,090 get the unique or distinct values, force everything to lowercase, 870 00:42:48,090 --> 00:42:50,380 force everything to uppercase, get the maximum value, 871 00:42:50,380 --> 00:42:53,505 minimum value-- much like spreadsheets, if you're familiar with that world, 872 00:42:53,505 --> 00:42:56,880 you get all of that functionality in SQL but also more. 873 00:42:56,880 --> 00:42:59,680 So, for instance, if I go back to my terminal window here. 874 00:42:59,680 --> 00:43:02,490 Let me go ahead and select the total number of favorites 875 00:43:02,490 --> 00:43:06,150 in this table, the total number of rows that you all inputted. 876 00:43:06,150 --> 00:43:10,320 So I could do select star from favorites semicolon, 877 00:43:10,320 --> 00:43:12,450 and then I could literally start counting these. 878 00:43:12,450 --> 00:43:15,330 Like 1, 2, 3, 4-- there's clearly a better way. 879 00:43:15,330 --> 00:43:19,480 And, indeed, on our list of functions was a count function. 880 00:43:19,480 --> 00:43:22,200 And so the way I can use that in SQL is like this. 881 00:43:22,200 --> 00:43:25,230 Select count of star-- 882 00:43:25,230 --> 00:43:28,113 so pass star in as an argument to the count function. 883 00:43:28,113 --> 00:43:30,780 You don't care what columns you're counting just count them all. 884 00:43:30,780 --> 00:43:33,440 From favorites semicolon. 885 00:43:33,440 --> 00:43:35,190 And now, you're actually going to get back 886 00:43:35,190 --> 00:43:40,050 like a little baby table that has just one row, one column inside of which-- 887 00:43:40,050 --> 00:43:43,410 one cell of which has the total actual count. 888 00:43:43,410 --> 00:43:47,610 And it's 398 because 399, recall, included the actual header 889 00:43:47,610 --> 00:43:49,480 row from the file. 890 00:43:49,480 --> 00:43:49,980 All right. 891 00:43:49,980 --> 00:43:52,260 So suppose you want to-- 892 00:43:52,260 --> 00:43:54,480 actually, note that this is the exact same thing 893 00:43:54,480 --> 00:43:58,840 as counting a specific column because every row 894 00:43:58,840 --> 00:44:00,670 has the same number of columns, three. 895 00:44:00,670 --> 00:44:03,790 We could just say select the count of languages, 896 00:44:03,790 --> 00:44:07,720 or select the count of problems. 897 00:44:07,720 --> 00:44:10,023 All of those are going to give me back the same answer. 898 00:44:10,023 --> 00:44:12,190 It is, therefore, conventional in SQL if you're just 899 00:44:12,190 --> 00:44:14,080 trying to count the number of rows, don't even 900 00:44:14,080 --> 00:44:15,413 worry about what they're called. 901 00:44:15,413 --> 00:44:18,722 Just do count star to get back everything more simply. 902 00:44:18,722 --> 00:44:21,430 All right, but what if we want to get back the distinct languages 903 00:44:21,430 --> 00:44:23,972 and we didn't know a priori that this came from a Google form 904 00:44:23,972 --> 00:44:25,107 with three radio buttons? 905 00:44:25,107 --> 00:44:26,690 Well, we could do something like this. 906 00:44:26,690 --> 00:44:33,130 We could select the distinct languages from the favorites table, enter. 907 00:44:33,130 --> 00:44:35,950 And that gives me Python, Scratch, C because distinct 908 00:44:35,950 --> 00:44:38,200 is one of the other functions that comes with SQL. 909 00:44:38,200 --> 00:44:40,990 This is, obviously, very easily countable with my human eyes, 910 00:44:40,990 --> 00:44:43,210 but if I wanted to do this more dynamically, 911 00:44:43,210 --> 00:44:47,530 I could change this to be count the distinct languages. 912 00:44:47,530 --> 00:44:50,200 And just like in C, just like in Python, just like in Scratch, 913 00:44:50,200 --> 00:44:52,987 I can nest these functions and pass the output of one 914 00:44:52,987 --> 00:44:54,070 into the input of another. 915 00:44:54,070 --> 00:44:59,550 If I hit enter now, I now get three in this case here. 916 00:44:59,550 --> 00:45:00,050 OK. 917 00:45:00,050 --> 00:45:04,580 Let me pause to see if there's any questions or confusion just yet. 918 00:45:04,580 --> 00:45:05,120 Yeah. 919 00:45:05,120 --> 00:45:07,000 [INDISTINCT SPEECH] 920 00:45:07,000 --> 00:45:07,770 Does SQLite-- 921 00:45:07,770 --> 00:45:09,910 [INDISTINCT SPEECH] 922 00:45:09,910 --> 00:45:11,980 SQLite3 is a program. 923 00:45:11,980 --> 00:45:16,210 And it's an implementation of the SQLite language, which 924 00:45:16,210 --> 00:45:18,820 itself is a lightweight version of what the world known 925 00:45:18,820 --> 00:45:21,890 as SQL, which is a very convoluted way of saying 926 00:45:21,890 --> 00:45:23,390 there's lots of humans in the world. 927 00:45:23,390 --> 00:45:25,420 Not everyone agrees what SQL should be. 928 00:45:25,420 --> 00:45:28,720 Microsoft might disagree with Oracle, might disagree with other companies, 929 00:45:28,720 --> 00:45:29,390 as well. 930 00:45:29,390 --> 00:45:32,500 So there's a common subset of SQL in the world 931 00:45:32,500 --> 00:45:35,200 that almost everyone knows, and learns, and uses, 932 00:45:35,200 --> 00:45:37,900 but there are also some vendor specific features. 933 00:45:37,900 --> 00:45:40,790 SQLite tries to distill things really into the essence, 934 00:45:40,790 --> 00:45:44,430 and so that's what you increasingly see on Android, on iOS, on Macs, and PCs, 935 00:45:44,430 --> 00:45:44,930 as well. 936 00:45:44,930 --> 00:45:47,650 So we use it because it's relatively canonical. 937 00:45:47,650 --> 00:45:49,330 Good question. 938 00:45:49,330 --> 00:45:49,830 All right. 939 00:45:49,830 --> 00:45:53,550 So let's do a few other things by introducing a few other keywords 940 00:45:53,550 --> 00:45:55,440 without trying all of these right now. 941 00:45:55,440 --> 00:45:59,730 Here in this list is a bunch of new keywords 942 00:45:59,730 --> 00:46:01,740 that are going to give us even finer control. 943 00:46:01,740 --> 00:46:04,920 And we saw limit already, and that just limits the output. 944 00:46:04,920 --> 00:46:08,010 But you can also have what are called predicates. 945 00:46:08,010 --> 00:46:11,280 You can literally use the keyword where to start filtering the data, 946 00:46:11,280 --> 00:46:14,640 without using an if, and an elif, and an elif, and an elif, and so forth. 947 00:46:14,640 --> 00:46:17,640 You can just in one line express something conditionally, 948 00:46:17,640 --> 00:46:21,067 you can order the data, and you can even group similar data together. 949 00:46:21,067 --> 00:46:22,150 So what do I mean by this? 950 00:46:22,150 --> 00:46:24,622 Let me go back to VS Code here, and let me play around 951 00:46:24,622 --> 00:46:25,830 with a few different queries. 952 00:46:25,830 --> 00:46:32,490 Let me select, maybe, the count of rows from favorites, which previously 953 00:46:32,490 --> 00:46:38,250 was going to be 398 if I just get back all of the rows, but suppose I only 954 00:46:38,250 --> 00:46:41,190 want to know how many of you liked C. I can then say something 955 00:46:41,190 --> 00:46:46,500 like where the language in each row equals, quote unquote, "C," 956 00:46:46,500 --> 00:46:49,750 and the convention here is to use single quotes, though SQLite 957 00:46:49,750 --> 00:46:51,790 is tolerant of other formats as well. 958 00:46:51,790 --> 00:46:56,257 If I hit enter here, I'll see, indeed, as we saw with Python, the 78 number. 959 00:46:56,257 --> 00:46:57,340 That, honestly, took what? 960 00:46:57,340 --> 00:46:59,690 13, 14, 15 lines of code? 961 00:46:59,690 --> 00:47:04,990 Now I've distilled that kind of query into a single line of SQL code 962 00:47:04,990 --> 00:47:07,180 instead, by using this built in functionality. 963 00:47:07,180 --> 00:47:11,110 Suppose I really want to get specific, and how many of you really liked 964 00:47:11,110 --> 00:47:14,500 Hello, World in C as your favorite? 965 00:47:14,500 --> 00:47:16,510 Well, I could change this query. 966 00:47:16,510 --> 00:47:19,330 And just like your dollar sign prompt, your shell, 967 00:47:19,330 --> 00:47:22,660 you can go up and down in your history in SQLite to save keystrokes. 968 00:47:22,660 --> 00:47:24,520 You can use Boolean logic. 969 00:47:24,520 --> 00:47:27,790 And I can say language equals C AND, maybe, 970 00:47:27,790 --> 00:47:31,490 problem equals, quote unquote, "Hello, World," 971 00:47:31,490 --> 00:47:34,730 and the number of you that liked that problem was seven. 972 00:47:34,730 --> 00:47:37,900 So really, really early on likes Hello, World in C. 973 00:47:37,900 --> 00:47:39,910 Now notice a couple of key differences. 974 00:47:39,910 --> 00:47:43,900 One, I'm using AND, and not ampersand, ampersand like in C. I'm 975 00:47:43,900 --> 00:47:46,100 using single equal signs. 976 00:47:46,100 --> 00:47:50,030 So SQL behaves like Scratch does, which is not like Python or C. Why? 977 00:47:50,030 --> 00:47:52,780 Different people have implemented different languages differently. 978 00:47:52,780 --> 00:47:57,310 Equals, equals, equality in the world of SQL 979 00:47:57,310 --> 00:47:59,120 for comparing things left and right. 980 00:47:59,120 --> 00:47:59,620 All right. 981 00:47:59,620 --> 00:48:01,870 Things are now going to get a little more interesting, 982 00:48:01,870 --> 00:48:05,140 but the whole goal of all of that Python code 983 00:48:05,140 --> 00:48:09,650 was to analyze the ranking of languages and popularity thereof. 984 00:48:09,650 --> 00:48:12,400 Turns out in SQL, once you have the vocabulary, 985 00:48:12,400 --> 00:48:14,590 it's pretty easy to do something like that. 986 00:48:14,590 --> 00:48:15,560 I'm going to do this. 987 00:48:15,560 --> 00:48:19,120 I'm going to select all of the languages in the table, 988 00:48:19,120 --> 00:48:23,020 but I'm also going to select the count thereof. 989 00:48:23,020 --> 00:48:26,020 And then I'm going to do that from the favorites table, 990 00:48:26,020 --> 00:48:29,560 but I'm going to group by language because I 991 00:48:29,560 --> 00:48:33,370 claimed a moment ago that group by is another one of our key phrases in SQL 992 00:48:33,370 --> 00:48:35,120 that's going to let us group data. 993 00:48:35,120 --> 00:48:37,600 And what this effectively means is that if you've 994 00:48:37,600 --> 00:48:41,530 got this table with a lot of duplicate languages, again, and again, 995 00:48:41,530 --> 00:48:45,130 and again, you can group by that column, and, essentially, 996 00:48:45,130 --> 00:48:48,850 smush all of the Python rows together, all of the Scratch rows together, 997 00:48:48,850 --> 00:48:52,930 all of the C rows together, but figure out how many of those 998 00:48:52,930 --> 00:48:55,240 rows just got smushed together. 999 00:48:55,240 --> 00:48:57,910 Effectively, doing all of that dictionary legwork, 1000 00:48:57,910 --> 00:49:01,090 or the counter legwork, that I did in 13-- 1001 00:49:01,090 --> 00:49:02,990 15 lines of Python code. 1002 00:49:02,990 --> 00:49:06,700 So if I hit enter here, this now is the motivation 1003 00:49:06,700 --> 00:49:08,410 for what we're now starting to do. 1004 00:49:08,410 --> 00:49:11,860 I have distilled into a single line of code 1005 00:49:11,860 --> 00:49:14,140 in a language called SQL what, indeed, took me 1006 00:49:14,140 --> 00:49:18,280 more than a dozen lines of Python code just to get back an answer. 1007 00:49:18,280 --> 00:49:20,110 And I can do the same thing with problem. 1008 00:49:20,110 --> 00:49:23,230 I can just change language here, for instance, to problem instead. 1009 00:49:23,230 --> 00:49:26,300 But, per this list, I can not only group things, I can order them. 1010 00:49:26,300 --> 00:49:29,440 So if you actually want to get a top 10, or a top three list, well let's 1011 00:49:29,440 --> 00:49:30,880 just change this query slightly. 1012 00:49:30,880 --> 00:49:37,970 Before the semicolon, let me order by the count of those rows semicolon. 1013 00:49:37,970 --> 00:49:40,570 And now what I get is from smallest to largest. 1014 00:49:40,570 --> 00:49:42,850 40, 78, 280. 1015 00:49:42,850 --> 00:49:44,590 If you want to flip that, that's fine. 1016 00:49:44,590 --> 00:49:49,630 By default, order by uses ascending order, abbreviated A-S-C. 1017 00:49:49,630 --> 00:49:53,810 If you want to do descending order, D-E-S-C, you can do that as well. 1018 00:49:53,810 --> 00:49:57,680 And now we have a top three list, from largest to smallest. 1019 00:49:57,680 --> 00:50:01,360 Now, honestly, this is a bit of a mouthful to use count star over here, 1020 00:50:01,360 --> 00:50:02,920 count star over here. 1021 00:50:02,920 --> 00:50:06,580 There's a nicety in SQL, too, where you can create little aliases of sorts. 1022 00:50:06,580 --> 00:50:08,890 So if I use the same query again-- 1023 00:50:08,890 --> 00:50:10,850 let me scroll over to the left. 1024 00:50:10,850 --> 00:50:13,990 I can actually use the keyword as here, and I 1025 00:50:13,990 --> 00:50:18,370 can rename this weird looking column, count star, to anything I want. 1026 00:50:18,370 --> 00:50:19,930 I can rename it to n. 1027 00:50:19,930 --> 00:50:24,280 And then at the end of this query, I can order by n, essentially, 1028 00:50:24,280 --> 00:50:26,990 creating a synonym, if you will, for one versus the other. 1029 00:50:26,990 --> 00:50:30,310 So if I hit enter now, same exact thing, but my little baby table 1030 00:50:30,310 --> 00:50:33,310 that came back-- not a technical term-- has two columns, one of which 1031 00:50:33,310 --> 00:50:36,580 is more simply called n now instead of count star. 1032 00:50:36,580 --> 00:50:40,330 It just makes it minorly more convenient in your actual SQL code 1033 00:50:40,330 --> 00:50:44,110 to reference things that might actually be a little annoying to type. 1034 00:50:44,110 --> 00:50:47,200 Lastly, suppose we want to get a top one list, 1035 00:50:47,200 --> 00:50:49,000 and we just want the most popular language. 1036 00:50:49,000 --> 00:50:51,790 Honestly, I can just do limit one, enter. 1037 00:50:51,790 --> 00:50:55,720 That gives me just this tiny little table, a temporary table, really, 1038 00:50:55,720 --> 00:50:56,860 with one row. 1039 00:50:56,860 --> 00:51:00,460 And, honestly, if I don't even care about what the language is, 1040 00:51:00,460 --> 00:51:02,230 I can omit that entirely. 1041 00:51:02,230 --> 00:51:06,340 Just see how many people really like the most popular language. 1042 00:51:06,340 --> 00:51:07,660 280, in this case. 1043 00:51:07,660 --> 00:51:10,910 But, of course, it's more interesting to see what it actually is. 1044 00:51:10,910 --> 00:51:13,420 So, in short, just by turning these knobs syntactically, 1045 00:51:13,420 --> 00:51:16,900 it's relatively easy to start getting at more and more data. 1046 00:51:16,900 --> 00:51:19,270 And more answers there, too. 1047 00:51:19,270 --> 00:51:19,770 Phew. 1048 00:51:19,770 --> 00:51:24,270 Questions on this thus far? 1049 00:51:24,270 --> 00:51:25,200 Any questions? 1050 00:51:25,200 --> 00:51:26,040 No? 1051 00:51:26,040 --> 00:51:26,670 OK. 1052 00:51:26,670 --> 00:51:29,305 Well suppose that this week, for instance. 1053 00:51:29,305 --> 00:51:31,680 One of our new problems is going to be called Fiftyville, 1054 00:51:31,680 --> 00:51:33,960 and it's going to allow you to explore the world of SQL 1055 00:51:33,960 --> 00:51:35,793 in the context of a place called Fiftyville. 1056 00:51:35,793 --> 00:51:38,350 Suppose that suddenly becomes your favorite problem. 1057 00:51:38,350 --> 00:51:40,920 Well, how can we go about adding more data to a database? 1058 00:51:40,920 --> 00:51:43,720 Well, we've seen create table for creating the table, 1059 00:51:43,720 --> 00:51:47,230 we've seen select for selecting data there from. 1060 00:51:47,230 --> 00:51:50,340 Turns out there's also an insert into command 1061 00:51:50,340 --> 00:51:54,210 that you can use to insert new data into a table. 1062 00:51:54,210 --> 00:51:57,450 Now, I did this in bulk by just importing that whole CSV file, 1063 00:51:57,450 --> 00:52:00,370 and SQLite3 did it all for me automatically. 1064 00:52:00,370 --> 00:52:04,320 But in the real world, if you don't have a captive audience, every one of whom 1065 00:52:04,320 --> 00:52:06,540 is submitting the form at the same time-- 1066 00:52:06,540 --> 00:52:09,300 but maybe it's an application that's running 24/7, 1067 00:52:09,300 --> 00:52:13,210 you're going to get more and more data over time, just like Google itself. 1068 00:52:13,210 --> 00:52:16,290 So if you write code like this, you can insert one row 1069 00:52:16,290 --> 00:52:20,530 at a time, one row at a time, and actually change the data in your table. 1070 00:52:20,530 --> 00:52:24,820 So just as a check, let me do select star from favorites, enter. 1071 00:52:24,820 --> 00:52:26,500 Just to see all of the data. 1072 00:52:26,500 --> 00:52:31,030 And the last data we got was at 1:41 PM, and 21 seconds. 1073 00:52:31,030 --> 00:52:34,390 Suppose now I've decided I want to insert one new row. 1074 00:52:34,390 --> 00:52:35,560 I can do this. 1075 00:52:35,560 --> 00:52:39,550 Insert into favorites-- and then I have to specify 1076 00:52:39,550 --> 00:52:41,710 what columns do I want to insert into. 1077 00:52:41,710 --> 00:52:47,230 I'm going to insert a new language column, and a new problem column. 1078 00:52:47,230 --> 00:52:47,840 Timestamp? 1079 00:52:47,840 --> 00:52:48,340 I could. 1080 00:52:48,340 --> 00:52:51,590 I don't really want to look up the time, so I'm going to leave that one blank. 1081 00:52:51,590 --> 00:52:54,520 And I'm going to put in values as follows for this. 1082 00:52:54,520 --> 00:52:57,850 SQL for the language, which wasn't even an option on the form earlier, 1083 00:52:57,850 --> 00:53:02,830 and Fiftyville for the name of the problem, semicolon. 1084 00:53:02,830 --> 00:53:04,540 So there's a bit of dichotomy here. 1085 00:53:04,540 --> 00:53:07,690 In the first set of parentheses, you specify a comma separated list 1086 00:53:07,690 --> 00:53:09,910 of the columns that you want to put data into. 1087 00:53:09,910 --> 00:53:11,890 In the second set of parentheses, you actually 1088 00:53:11,890 --> 00:53:15,680 specify the values that you want to put into those columns. 1089 00:53:15,680 --> 00:53:18,350 So when I hit enter nothing seems to happen, 1090 00:53:18,350 --> 00:53:20,510 which, in general, is a good thing at my terminal. 1091 00:53:20,510 --> 00:53:25,030 But if I now rerun select star from favorites, we will see-- 1092 00:53:25,030 --> 00:53:27,070 voila, a brand new row. 1093 00:53:27,070 --> 00:53:29,320 We don't know what time or date it was inputted at. 1094 00:53:29,320 --> 00:53:33,130 In fact, we see an old friend, null, which indicates the absence of a value, 1095 00:53:33,130 --> 00:53:37,610 but we do indeed see that SQL, in Fiftyville, is actually now in there. 1096 00:53:37,610 --> 00:53:42,340 So in the world of SQL, null has nothing to do with pointers or addresses. 1097 00:53:42,340 --> 00:53:46,120 The world of SQL, it's just using the same word to represent the same idea, 1098 00:53:46,120 --> 00:53:48,700 that there's no data here, but it has nothing 1099 00:53:48,700 --> 00:53:51,080 to do with actual memory addresses in this case. 1100 00:53:51,080 --> 00:53:53,830 But suppose that you don't want to do that, and, like, no, no, no. 1101 00:53:53,830 --> 00:53:54,640 Let's just delete that. 1102 00:53:54,640 --> 00:53:56,348 Fiftyville hasn't even been released yet, 1103 00:53:56,348 --> 00:53:58,720 nor have we even finished talking about SQL. 1104 00:53:58,720 --> 00:54:02,110 How do we delete data from a database table? 1105 00:54:02,110 --> 00:54:03,670 Well, there's a delete from command. 1106 00:54:03,670 --> 00:54:05,360 Let me go back to VS Code here. 1107 00:54:05,360 --> 00:54:08,080 Let me go ahead and clear my terminal just to keep things clean. 1108 00:54:08,080 --> 00:54:10,570 Let me go ahead and delete from favorites, 1109 00:54:10,570 --> 00:54:14,290 and let me not hit enter here after a semicolon. 1110 00:54:14,290 --> 00:54:16,330 This is one of the most destructive things 1111 00:54:16,330 --> 00:54:19,570 you can do as a database administrator. 1112 00:54:19,570 --> 00:54:21,550 If you Google around, there are horror stories 1113 00:54:21,550 --> 00:54:25,750 of interns in the real world executing commands like this at their companies. 1114 00:54:25,750 --> 00:54:28,510 This will delete everything from favorites. 1115 00:54:28,510 --> 00:54:31,660 So if you ever do this, remember, we told you not to today. 1116 00:54:31,660 --> 00:54:38,530 But if we add a where clause here, only delete rows where the timestamp column 1117 00:54:38,530 --> 00:54:41,433 is null, this is more reasonable. 1118 00:54:41,433 --> 00:54:43,600 And, frankly, any companies you work for should also 1119 00:54:43,600 --> 00:54:45,767 have backups of their database, so we shouldn't even 1120 00:54:45,767 --> 00:54:48,680 be reading about these horror stories, but such is the real world. 1121 00:54:48,680 --> 00:54:52,960 So this is going to delete any row from the favorites table 1122 00:54:52,960 --> 00:54:57,260 where Timestamp, capital T, because that's how Google did it, is null. 1123 00:54:57,260 --> 00:54:58,570 I go ahead and hit enter. 1124 00:54:58,570 --> 00:55:03,880 Nothing seems to happen, but if I do select star from favorites semicolon 1125 00:55:03,880 --> 00:55:06,550 that now row is, again, gone. 1126 00:55:06,550 --> 00:55:11,410 So you can use these predicates, these where conditions, coupled with select, 1127 00:55:11,410 --> 00:55:14,740 coupled with delete, and other operations as well. 1128 00:55:14,740 --> 00:55:16,520 What if I actually want to make a change? 1129 00:55:16,520 --> 00:55:18,910 So if you want to update existing data like this. 1130 00:55:18,910 --> 00:55:20,540 Well, we could do this. 1131 00:55:20,540 --> 00:55:25,300 I could update this table I could set one column equal to this value 1132 00:55:25,300 --> 00:55:27,230 where some condition is true. 1133 00:55:27,230 --> 00:55:28,638 So how might this work? 1134 00:55:28,638 --> 00:55:30,430 Well, let me boldly claim that a lot of you 1135 00:55:30,430 --> 00:55:33,070 are really going to like Fiftyville in the world of SQL, 1136 00:55:33,070 --> 00:55:35,860 so all of these favorites are sort of passe now. 1137 00:55:35,860 --> 00:55:36,890 So let's do this. 1138 00:55:36,890 --> 00:55:39,950 Let me go ahead and update the favorites table, 1139 00:55:39,950 --> 00:55:43,150 setting the language column equal to "SQL," quote unquote. 1140 00:55:43,150 --> 00:55:47,650 And with a comma let me go ahead and also update the problem column to be 1141 00:55:47,650 --> 00:55:49,810 equal to, quote unquote, "Fiftyville." 1142 00:55:49,810 --> 00:55:52,930 I'm not going to have any kind of where here, which means this 1143 00:55:52,930 --> 00:55:56,030 is just going to do its thing on all of the rows. 1144 00:55:56,030 --> 00:55:58,270 So if I hit enter nothing seems to have happened, 1145 00:55:58,270 --> 00:56:00,850 but if I now do select star from favorites, 1146 00:56:00,850 --> 00:56:04,160 everyone's favorite is literally that problem. 1147 00:56:04,160 --> 00:56:07,690 So this too is destructive, unlike the real digital world, 1148 00:56:07,690 --> 00:56:10,120 there's no Control Z, or undo that. 1149 00:56:10,120 --> 00:56:13,480 You better have made a backup of your database, otherwise 1150 00:56:13,480 --> 00:56:15,460 that's not a good thing. 1151 00:56:15,460 --> 00:56:17,830 In this case, I do have this CSV file, so I could just 1152 00:56:17,830 --> 00:56:19,703 delete my favorites.db file. 1153 00:56:19,703 --> 00:56:22,870 I could re-import the data, so I haven't really lost anything of importance, 1154 00:56:22,870 --> 00:56:26,140 but you could in the case of the real world and any data 1155 00:56:26,140 --> 00:56:27,470 you're actually working on. 1156 00:56:27,470 --> 00:56:29,410 So just to make the point, let me go ahead 1157 00:56:29,410 --> 00:56:33,250 and delete from favorites semicolon, enter. 1158 00:56:33,250 --> 00:56:34,660 Let me reselect. 1159 00:56:34,660 --> 00:56:36,710 There's no data there anymore. 1160 00:56:36,710 --> 00:56:42,890 And in fact, if I do select count star from favorites, 1161 00:56:42,890 --> 00:56:45,220 we'll see as much that the answer is, in fact, zero 1162 00:56:45,220 --> 00:56:48,040 because everything has now been deleted. 1163 00:56:48,040 --> 00:56:51,340 Any questions, then, on that code there? 1164 00:56:51,340 --> 00:56:51,950 No? 1165 00:56:51,950 --> 00:56:52,450 All right. 1166 00:56:52,450 --> 00:56:55,533 So if not too scared yet, let's go ahead and take our 10-minute break now. 1167 00:56:55,533 --> 00:56:59,110 Halloween candy is served, and we'll be back in 10. 1168 00:56:59,110 --> 00:57:01,060 All right. 1169 00:57:01,060 --> 00:57:05,750 So we are back, and before we dive back into SQL and some real world data, 1170 00:57:05,750 --> 00:57:10,300 it turns out, unbeknownst to me, we've had a Halloween costume contest. 1171 00:57:10,300 --> 00:57:14,560 So it's now time to announce the winners of this year's CS50 costume contest. 1172 00:57:14,560 --> 00:57:17,530 If our two winners would like to come on up who, 1173 00:57:17,530 --> 00:57:19,495 I'm told during break, dressed up as me. 1174 00:57:19,495 --> 00:57:22,265 [APPLAUSE] 1175 00:57:22,265 --> 00:57:25,620 1176 00:57:25,620 --> 00:57:26,460 Come on over. 1177 00:57:26,460 --> 00:57:28,085 Would you like to introduce yourselves? 1178 00:57:28,085 --> 00:57:31,240 SPEAKER 2: Yes Hi, everyone. 1179 00:57:31,240 --> 00:57:32,100 I'm David. 1180 00:57:32,100 --> 00:57:34,380 I am living in Matthews as a first year. 1181 00:57:34,380 --> 00:57:38,700 I'm planning on studying gov and computer science. 1182 00:57:38,700 --> 00:57:39,820 SPEAKER 3: Hi, everyone. 1183 00:57:39,820 --> 00:57:40,740 I'm David. 1184 00:57:40,740 --> 00:57:43,020 I'm a second year in Mather, and I'm planning 1185 00:57:43,020 --> 00:57:44,745 on just studying computer science. 1186 00:57:44,745 --> 00:57:45,480 [LAUGHING] 1187 00:57:45,480 --> 00:57:46,605 SPEAKER 1: Well, thank you. 1188 00:57:46,605 --> 00:57:50,190 We have some Oreos for you. 1189 00:57:50,190 --> 00:57:51,497 Thank you, both, so much. 1190 00:57:51,497 --> 00:57:52,830 Did anyone else dress like this? 1191 00:57:52,830 --> 00:57:55,437 We have two more Oreos, if you'd like? 1192 00:57:55,437 --> 00:57:56,020 Intentionally? 1193 00:57:56,020 --> 00:57:58,082 [LAUGHING] 1194 00:57:58,082 --> 00:57:59,040 Oh, that's pretty good. 1195 00:57:59,040 --> 00:57:59,540 OK. 1196 00:57:59,540 --> 00:58:00,750 Yes, we have one more winner. 1197 00:58:00,750 --> 00:58:01,515 Come on down. 1198 00:58:01,515 --> 00:58:03,810 [LAUGHS] 1199 00:58:03,810 --> 00:58:09,580 1200 00:58:09,580 --> 00:58:10,317 Thank you. 1201 00:58:10,317 --> 00:58:10,900 SPEAKER 4: Hi. 1202 00:58:10,900 --> 00:58:12,010 SPEAKER 1: This is intentional? 1203 00:58:12,010 --> 00:58:12,590 SPEAKER 4: No. 1204 00:58:12,590 --> 00:58:13,340 SPEAKER 1: Oh, OK. 1205 00:58:13,340 --> 00:58:14,410 [LAUGHING] 1206 00:58:14,410 --> 00:58:16,150 SPEAKER 4: Hello, my name is David. 1207 00:58:16,150 --> 00:58:18,490 I'm from Canada, and I'm a first year. 1208 00:58:18,490 --> 00:58:20,200 I'm not sure what I'm going to study. 1209 00:58:20,200 --> 00:58:20,783 SPEAKER 1: OK. 1210 00:58:20,783 --> 00:58:21,800 Welcome, as well. 1211 00:58:21,800 --> 00:58:22,675 SPEAKER 4: Thank you. 1212 00:58:22,675 --> 00:58:23,560 SPEAKER 1: All right. 1213 00:58:23,560 --> 00:58:26,270 [APPLAUSE] 1214 00:58:26,270 --> 00:58:27,732 1215 00:58:27,732 --> 00:58:29,690 Up until now, we've played around with the data 1216 00:58:29,690 --> 00:58:33,695 that you all gave us, which was based, very simply, on your favorite language 1217 00:58:33,695 --> 00:58:34,820 and your favorite problems. 1218 00:58:34,820 --> 00:58:38,030 But it turns out there's a lot of real world data in, indeed, 1219 00:58:38,030 --> 00:58:40,880 the real world, some of which is quite voluminous. 1220 00:58:40,880 --> 00:58:44,720 And, indeed, there can be not just dozens, or hundreds, but thousands, 1221 00:58:44,720 --> 00:58:47,962 hundreds of thousands, even millions of rows in the biggest of databases. 1222 00:58:47,962 --> 00:58:50,420 And so what we thought we'd do in the latter part of today, 1223 00:58:50,420 --> 00:58:53,360 is really, actually, get our hands dirty with a real world 1224 00:58:53,360 --> 00:58:57,020 data set from the Internet Movie Database, otherwise known as IMDb. 1225 00:58:57,020 --> 00:59:01,010 And, in fact, if you go to imdb.com, you'll be able to answer, 1226 00:59:01,010 --> 00:59:03,650 via their web interface, some of the very questions we'll 1227 00:59:03,650 --> 00:59:05,720 do today using SQL alone. 1228 00:59:05,720 --> 00:59:10,520 But what you'll find, ultimately, is that what websites like imdb.com, 1229 00:59:10,520 --> 00:59:13,880 or their mobile app versions thereof, are probably doing 1230 00:59:13,880 --> 00:59:17,420 is-- yes, giving you a nice pretty graphical interface to type queries, 1231 00:59:17,420 --> 00:59:22,760 but underneath the hood they are passing your input into SQL queries, 1232 00:59:22,760 --> 00:59:25,800 or similar queries, that they formed most of. 1233 00:59:25,800 --> 00:59:28,320 They're just waiting for placeholders, like the keywords, 1234 00:59:28,320 --> 00:59:30,160 that you're actually searching for. 1235 00:59:30,160 --> 00:59:33,420 So let's go ahead and experiment, maybe, with just some real-world data, 1236 00:59:33,420 --> 00:59:36,725 initially, before we consider how to actually store it at scale. 1237 00:59:36,725 --> 00:59:38,850 So let me open up, just for the sake of discussion, 1238 00:59:38,850 --> 00:59:41,190 an actual, empty spreadsheet, just so I have 1239 00:59:41,190 --> 00:59:43,650 some rows and columns to play with. 1240 00:59:43,650 --> 00:59:47,790 And let me propose that we want to model TV shows from the real world. 1241 00:59:47,790 --> 00:59:49,420 How can we go about doing this? 1242 00:59:49,420 --> 00:59:53,020 Well, maybe I could start in this first column A, so to speak, 1243 00:59:53,020 --> 00:59:54,600 and I could create a title column. 1244 00:59:54,600 --> 00:59:57,660 And then, maybe, a column for the star of that show. 1245 00:59:57,660 --> 01:00:00,060 And a very popular show, of course, is The Office. 1246 01:00:00,060 --> 01:00:03,900 So I might put this into the second cell in that first column. 1247 01:00:03,900 --> 01:00:06,870 And under star, I could put someone like Steve Carell. 1248 01:00:06,870 --> 01:00:10,200 But, of course, he wasn't the only star of the show. 1249 01:00:10,200 --> 01:00:11,740 There are others as well. 1250 01:00:11,740 --> 01:00:14,370 And so if I want to put in someone like Rainn Wilson, 1251 01:00:14,370 --> 01:00:16,210 well, maybe, I need a second star column. 1252 01:00:16,210 --> 01:00:17,640 So Rainn Wilson. 1253 01:00:17,640 --> 01:00:20,010 But even as early as the first season, there 1254 01:00:20,010 --> 01:00:24,420 was also another star in the credits, John Krasinski. 1255 01:00:24,420 --> 01:00:25,440 So he was a star. 1256 01:00:25,440 --> 01:00:29,280 Jenna Fischer was top credited in the first season. 1257 01:00:29,280 --> 01:00:30,990 So Jenna Fischer. 1258 01:00:30,990 --> 01:00:37,020 And then BJ Novak, Harvard alum, was also in the first season's opening 1259 01:00:37,020 --> 01:00:37,870 credits, as well. 1260 01:00:37,870 --> 01:00:39,360 So we've got all-- 1261 01:00:39,360 --> 01:00:41,170 one, two, three, four, five of these folks. 1262 01:00:41,170 --> 01:00:42,630 Hopefully, I didn't misspell anyone's name, 1263 01:00:42,630 --> 01:00:45,030 but here's the beginnings of a real-world data set. 1264 01:00:45,030 --> 01:00:47,970 And we could imagine doing this for everyone's favorite shows, 1265 01:00:47,970 --> 01:00:49,630 adding more, and more rows. 1266 01:00:49,630 --> 01:00:52,650 But let's consider, as we often do, not just the correctness 1267 01:00:52,650 --> 01:00:56,220 of this implementation, but the design. 1268 01:00:56,220 --> 01:00:57,990 It's pretty straightforward. 1269 01:00:57,990 --> 01:00:58,972 It's very readable. 1270 01:00:58,972 --> 01:01:00,430 So I think it's good in that sense. 1271 01:01:00,430 --> 01:01:03,510 But if you start to nitpick what's poorly designed, 1272 01:01:03,510 --> 01:01:06,240 even in the world of spreadsheets, about what I've done here-- 1273 01:01:06,240 --> 01:01:08,610 assuming that the next row is another show, 1274 01:01:08,610 --> 01:01:12,110 the next row is another show, and so forth. 1275 01:01:12,110 --> 01:01:13,280 What's bad about this? 1276 01:01:13,280 --> 01:01:13,780 Yeah? 1277 01:01:13,780 --> 01:01:15,272 [INDISTINCT SPEECH] 1278 01:01:15,272 --> 01:01:17,980 Yeah, so each row is going to have a different number of columns, 1279 01:01:17,980 --> 01:01:20,440 and even I, kind of, couldn't make up my mind from the get go. 1280 01:01:20,440 --> 01:01:22,000 Like, do I have just one star column? 1281 01:01:22,000 --> 01:01:22,500 Or two? 1282 01:01:22,500 --> 01:01:23,792 Or maybe now I'm up to 5? 1283 01:01:23,792 --> 01:01:25,750 For even bigger shows, and later in The Office, 1284 01:01:25,750 --> 01:01:27,625 when more people got top billing, we're going 1285 01:01:27,625 --> 01:01:29,770 to need more than five columns for stars. 1286 01:01:29,770 --> 01:01:31,060 So that's fine. 1287 01:01:31,060 --> 01:01:34,682 We can clearly scroll to the right, and just keep adding more columns, 1288 01:01:34,682 --> 01:01:36,640 but there should be something about that design 1289 01:01:36,640 --> 01:01:38,890 that like rubs you the wrong way, like something 1290 01:01:38,890 --> 01:01:41,620 feels a little off if some rows have this many columns, 1291 01:01:41,620 --> 01:01:42,700 others have this many. 1292 01:01:42,700 --> 01:01:45,520 The data would be very jagged along the right hand side. 1293 01:01:45,520 --> 01:01:48,320 It would be very sparse, which would be another way to describe it. 1294 01:01:48,320 --> 01:01:50,000 There's probably a better way. 1295 01:01:50,000 --> 01:01:53,440 So maybe I should flip the data around, and maybe 1296 01:01:53,440 --> 01:01:57,520 a better approach here would be to just have one column for stars. 1297 01:01:57,520 --> 01:01:59,870 So let me do this. 1298 01:01:59,870 --> 01:02:04,180 Let me just move Rainn Wilson over here, and John Krasinski over here, 1299 01:02:04,180 --> 01:02:09,980 and Jenna Fischer over here, and BJ Novak over here, as well. 1300 01:02:09,980 --> 01:02:13,850 I'll get rid of all of these superfluous, identically named columns. 1301 01:02:13,850 --> 01:02:17,840 And now this is sort of better because now I can have any number of stars 1302 01:02:17,840 --> 01:02:21,330 in the vertical, although it's a little weird to leave this blank. 1303 01:02:21,330 --> 01:02:23,850 So maybe I should of copy paste here. 1304 01:02:23,850 --> 01:02:27,920 So in some sense this is better, in that now I only 1305 01:02:27,920 --> 01:02:30,860 have one title column, one star column, and I can just 1306 01:02:30,860 --> 01:02:34,790 keep adding row, row, row, for each show and its stars. 1307 01:02:34,790 --> 01:02:37,840 But what's now poorly designed about this? 1308 01:02:37,840 --> 01:02:38,340 Yeah? 1309 01:02:38,340 --> 01:02:39,970 [INDISTINCT SPEECH] 1310 01:02:39,970 --> 01:02:40,470 Yeah. 1311 01:02:40,470 --> 01:02:41,580 I'm repeating the title. 1312 01:02:41,580 --> 01:02:44,220 And, in general, copy, paste, repeating yourself in code 1313 01:02:44,220 --> 01:02:46,090 has generally been a bad thing. 1314 01:02:46,090 --> 01:02:49,110 It's generally gotten us in trouble if I make a change, or maybe 1315 01:02:49,110 --> 01:02:50,550 a typographical error somewhere. 1316 01:02:50,550 --> 01:02:52,050 Maybe it propagates elsewhere. 1317 01:02:52,050 --> 01:02:54,450 And if nothing else, it's just a lot of wasted space. 1318 01:02:54,450 --> 01:02:57,300 If this is actually going to be stored in a database, in a CSV file, 1319 01:02:57,300 --> 01:03:00,300 why are you duplicating the same string again, and again, and again, 1320 01:03:00,300 --> 01:03:01,440 for large TV shows? 1321 01:03:01,440 --> 01:03:02,370 That's just wasteful. 1322 01:03:02,370 --> 01:03:04,780 It just doesn't seem wise. 1323 01:03:04,780 --> 01:03:06,970 So how can we eliminate that redundancy? 1324 01:03:06,970 --> 01:03:10,260 Well, unfortunately, in the world of spreadsheets, things kind of now 1325 01:03:10,260 --> 01:03:13,080 escalate quickly to be kind of annoying. 1326 01:03:13,080 --> 01:03:16,150 But let me do it, nonetheless, with just a small bit of data. 1327 01:03:16,150 --> 01:03:18,250 Let me propose that we do this instead. 1328 01:03:18,250 --> 01:03:21,977 Let me create not one sheet but maybe multiple sheets, 1329 01:03:21,977 --> 01:03:24,060 and assume that there's some kind of relationship, 1330 01:03:24,060 --> 01:03:25,680 or relation, across these sheets. 1331 01:03:25,680 --> 01:03:29,100 So, just to be pedantic, let me call this sheet, not the default sheet one, 1332 01:03:29,100 --> 01:03:30,360 but let's call this shows. 1333 01:03:30,360 --> 01:03:34,750 And in this sheet, I'm going to have a title column for every show, 1334 01:03:34,750 --> 01:03:36,990 and I think I'm going to be proactive here. 1335 01:03:36,990 --> 01:03:41,200 I'm going to start giving every show a unique ID number, much like Harvard 1336 01:03:41,200 --> 01:03:43,150 affiliates have Harvard IDs, Yale affiliates 1337 01:03:43,150 --> 01:03:44,900 have Yale ID numbers, and so forth. 1338 01:03:44,900 --> 01:03:48,280 Let's go ahead and give each show its own unique identifier, 1339 01:03:48,280 --> 01:03:49,940 for reasons we'll soon see. 1340 01:03:49,940 --> 01:03:53,110 So for The Office let me, just for consistency 1341 01:03:53,110 --> 01:03:55,030 with the actual Internet Movie Database, I'm 1342 01:03:55,030 --> 01:03:58,488 going to give it a unique number of 386676. 1343 01:03:58,488 --> 01:04:00,280 The specifics don't really matter, but that 1344 01:04:00,280 --> 01:04:03,010 happens to be what people in the real world actually do. 1345 01:04:03,010 --> 01:04:05,950 But that's it for TV shows, even though I could imagine 1346 01:04:05,950 --> 01:04:07,490 there being many more in this sheet. 1347 01:04:07,490 --> 01:04:10,840 Let me create another sheet now here, and I'll call it people. 1348 01:04:10,840 --> 01:04:15,490 And in the people sheet, let me keep track of all of those TV stars. 1349 01:04:15,490 --> 01:04:20,500 So one column will be name, another will be also called ID here, 1350 01:04:20,500 --> 01:04:25,070 but it's going to be a person ID, not a show ID. 1351 01:04:25,070 --> 01:04:27,610 And here we have, for instance, Steve Carell. 1352 01:04:27,610 --> 01:04:29,570 We have Rainn Wilson. 1353 01:04:29,570 --> 01:04:31,870 We have John Krasinski. 1354 01:04:31,870 --> 01:04:34,300 We have Jenna Fischer. 1355 01:04:34,300 --> 01:04:36,670 And we have BJ Novak. 1356 01:04:36,670 --> 01:04:38,600 And, this is going to be a little tedious, 1357 01:04:38,600 --> 01:04:41,070 but just to be consistent with reality, turns out 1358 01:04:41,070 --> 01:04:46,230 that according to IMDb Steve Carell's unique number in the world is 136797, 1359 01:04:46,230 --> 01:04:56,640 Rainn's is 933988, John's is 1024677, Jenna's is 278979, 1360 01:04:56,640 --> 01:05:01,814 and, lastly, BJ Novak's is 1145983. 1361 01:05:01,814 --> 01:05:02,650 Phew. 1362 01:05:02,650 --> 01:05:03,150 OK. 1363 01:05:03,150 --> 01:05:06,360 So now we have the same people, but they each have a unique ID number. 1364 01:05:06,360 --> 01:05:10,590 Lastly, let's associate those shows with those people in a way that 1365 01:05:10,590 --> 01:05:13,320 avoids the two problems we identified earlier, which 1366 01:05:13,320 --> 01:05:16,650 was having a variable number of columns in one case 1367 01:05:16,650 --> 01:05:18,960 versus redundancy in the second case. 1368 01:05:18,960 --> 01:05:22,710 Let's really tighten things up so nothing is in duplicate 1369 01:05:22,710 --> 01:05:24,880 that doesn't actually need to be. 1370 01:05:24,880 --> 01:05:28,890 So I'm going to create a third sheet here, and I'll call it stars, 1371 01:05:28,890 --> 01:05:31,770 like the TV stars for these shows, and what I'm going to do 1372 01:05:31,770 --> 01:05:36,292 is have only two columns, a show ID, and a person ID. 1373 01:05:36,292 --> 01:05:38,250 I could write these in different ways, but it's 1374 01:05:38,250 --> 01:05:41,740 conventional in the database world to use snake case, so to speak, 1375 01:05:41,740 --> 01:05:46,150 where everything is lowercase, with underscores instead of spaces. 1376 01:05:46,150 --> 01:05:49,390 And for show ID-- well, we're only mocking up one show for now, 1377 01:05:49,390 --> 01:05:54,100 but I'm going to go ahead and say 386676, which is The Office, I claimed. 1378 01:05:54,100 --> 01:05:57,160 And now I'm going to go ahead and have all of those person IDs, 1379 01:05:57,160 --> 01:06:00,230 and this time it's OK to copy paste, if only to save time. 1380 01:06:00,230 --> 01:06:04,540 So I'm going to grab all of these five stars IDs, paste them there, 1381 01:06:04,540 --> 01:06:08,840 and I am going to indulge by duplicating the show ID, 1382 01:06:08,840 --> 01:06:10,580 even though we didn't like that earlier. 1383 01:06:10,580 --> 01:06:14,380 Now this indeed has escalated quickly because none of these sheets 1384 01:06:14,380 --> 01:06:17,590 are very useful to look at to the human eye because none of them 1385 01:06:17,590 --> 01:06:18,910 has the complete picture. 1386 01:06:18,910 --> 01:06:22,600 We have shows in one, people in another, and then like this cryptic mapping 1387 01:06:22,600 --> 01:06:24,250 of numbers in the third. 1388 01:06:24,250 --> 01:06:29,200 But I propose that this is the right way to implement data 1389 01:06:29,200 --> 01:06:32,530 if your goal is to have a canonical source of truth 1390 01:06:32,530 --> 01:06:34,060 for every show and every person. 1391 01:06:34,060 --> 01:06:37,990 That is to say, you only say the show's name once in one place. 1392 01:06:37,990 --> 01:06:43,030 You only write the TV star's names once in one place. 1393 01:06:43,030 --> 01:06:46,330 And you associate them, you relate one with the other, 1394 01:06:46,330 --> 01:06:48,710 by way of this third sheet here. 1395 01:06:48,710 --> 01:06:52,270 So if you've not seen it already, notice that if The Office has 1396 01:06:52,270 --> 01:06:56,800 this unique ID, 386676, notice in the stars table, 1397 01:06:56,800 --> 01:06:59,620 that same value appears multiple times. 1398 01:06:59,620 --> 01:07:04,810 But what this third sheet is doing is associating that same show ID 1399 01:07:04,810 --> 01:07:08,510 with one, two, three, four, five different people. 1400 01:07:08,510 --> 01:07:11,860 Now I can see that it's similar in spirit to what we already 1401 01:07:11,860 --> 01:07:13,780 indicted as bad design a moment ago. 1402 01:07:13,780 --> 01:07:15,460 The Office, The Office, The Office. 1403 01:07:15,460 --> 01:07:18,970 But think about our world of C. In the world of C, and really computers 1404 01:07:18,970 --> 01:07:22,180 in general, data takes up finite amount of space, typically. 1405 01:07:22,180 --> 01:07:24,470 Like an integer is four bytes, 32 bits. 1406 01:07:24,470 --> 01:07:27,070 So even though, yes, I'm duplicating this value, 1407 01:07:27,070 --> 01:07:30,460 it's just the same four bytes, four bytes, four bytes. 1408 01:07:30,460 --> 01:07:36,340 It's not t-h-e space o-f-f-i-c-e, null character. 1409 01:07:36,340 --> 01:07:38,755 It's not the same 11 bytes again, and again. 1410 01:07:38,755 --> 01:07:41,380 It's just a number, and numbers tend to be much more efficient. 1411 01:07:41,380 --> 01:07:43,660 That computers can crunch numbers much more quickly. 1412 01:07:43,660 --> 01:07:47,920 Duplicating numbers is in general allowed, or smiled upon. 1413 01:07:47,920 --> 01:07:50,660 Duplicating strings will get you into trouble. 1414 01:07:50,660 --> 01:07:54,212 So with that said, is this a useful spreadsheet now? 1415 01:07:54,212 --> 01:07:56,920 Would you want to be handed this in your job and asked questions? 1416 01:07:56,920 --> 01:07:59,020 Like, hey, who stars in The Office? 1417 01:07:59,020 --> 01:08:01,390 You can answer it, but you have to look up one sheet, 1418 01:08:01,390 --> 01:08:02,950 then another, then a third. 1419 01:08:02,950 --> 01:08:06,490 Or you need to use VLOOKUP, or special functions in Excel, or Google Sheets. 1420 01:08:06,490 --> 01:08:08,823 I mean, you're just creating a lot of work for yourself. 1421 01:08:08,823 --> 01:08:12,340 But, academically, if you will, systematically, this 1422 01:08:12,340 --> 01:08:15,640 has a lot of merit because we've avoided all duplication. 1423 01:08:15,640 --> 01:08:20,800 We've normalized the data, so to speak, by factoring out any duplication. 1424 01:08:20,800 --> 01:08:22,700 So where are we going with this? 1425 01:08:22,700 --> 01:08:25,720 Well, it turns out that we'll play now with some actual data 1426 01:08:25,720 --> 01:08:28,481 from the real world from the actual Internet Movie database. 1427 01:08:28,481 --> 01:08:31,189 And in a moment, it's going to look a little something like this. 1428 01:08:31,189 --> 01:08:34,689 This is an artist's rendition of five different-- sorry, 1429 01:08:34,689 --> 01:08:36,800 six different tables. 1430 01:08:36,800 --> 01:08:39,430 So not one, but six different tables that we 1431 01:08:39,430 --> 01:08:43,479 created using freely available Internet Movie Database data. 1432 01:08:43,479 --> 01:08:47,904 They kindly provide, not CSVs, but TSVs, tab separated values, 1433 01:08:47,904 --> 01:08:49,779 which are essentially the same thing, but you 1434 01:08:49,779 --> 01:08:51,729 look for tab characters instead of commas 1435 01:08:51,729 --> 01:08:55,120 in order to convert them, as we did, into our own format, 1436 01:08:55,120 --> 01:08:56,800 SQLite in this case. 1437 01:08:56,800 --> 01:08:59,404 But we'll see that there is a file that we've 1438 01:08:59,404 --> 01:09:02,529 made available from today, which you can download off the course's website, 1439 01:09:02,529 --> 01:09:07,149 called shows.db, and it contains all of this same information. 1440 01:09:07,149 --> 01:09:10,660 And in that shows.db file there are indeed these six tables, 1441 01:09:10,660 --> 01:09:13,180 but let's focus on just two of them initially. 1442 01:09:13,180 --> 01:09:16,038 IMDb, the Internet Movie Database, is all about rating TV shows, 1443 01:09:16,038 --> 01:09:18,580 and tracking that kind of information, so let's actually take 1444 01:09:18,580 --> 01:09:20,497 a look at some of these ratings and figure out 1445 01:09:20,497 --> 01:09:22,970 how we can actually answer actual questions. 1446 01:09:22,970 --> 01:09:24,760 So let me go over to VS Code. 1447 01:09:24,760 --> 01:09:29,750 And let me run SQLite of shows.db, which is a file that already exists. 1448 01:09:29,750 --> 01:09:33,010 There's no CSVs, no TSVs, we did all of this for you already. 1449 01:09:33,010 --> 01:09:34,932 When I hit enter, I get my SQLite prompt. 1450 01:09:34,932 --> 01:09:36,640 And the first thing I like to do whenever 1451 01:09:36,640 --> 01:09:39,189 I'm playing with a SQL database for the first time-- maybe 1452 01:09:39,189 --> 01:09:41,410 I got it from a class, or my boss, or the like-- 1453 01:09:41,410 --> 01:09:44,979 is just to wrap my mind around what's inside of the database 1454 01:09:44,979 --> 01:09:48,069 because you're not typically going to be provided with pictures like this. 1455 01:09:48,069 --> 01:09:49,729 You're just going to be given a file. 1456 01:09:49,729 --> 01:09:51,220 So let me just select some data. 1457 01:09:51,220 --> 01:09:54,468 Let me select star from the shows table. 1458 01:09:54,468 --> 01:09:57,010 I don't really want to see all of it, so let me just limit it 1459 01:09:57,010 --> 01:09:59,390 to the first 10 shows in the table. 1460 01:09:59,390 --> 01:10:03,310 And here, we can infer what the shows table looks like. 1461 01:10:03,310 --> 01:10:07,270 Every show has an ID, a title, the year in which it debuted, apparently, 1462 01:10:07,270 --> 01:10:11,320 and the number of episodes as of last night when we exported the data. 1463 01:10:11,320 --> 01:10:15,010 So that seems to reflect this picture, and this is technically 1464 01:10:15,010 --> 01:10:19,000 an entity relationship diagram, a standard way of depicting things. 1465 01:10:19,000 --> 01:10:21,670 And you'll see that in our picture shows, indeed, 1466 01:10:21,670 --> 01:10:25,480 have an ID column, title column, year column, and episodes column. 1467 01:10:25,480 --> 01:10:26,840 Well what about these ratings? 1468 01:10:26,840 --> 01:10:29,650 Well, according to the picture, that has a show ID, a rating, 1469 01:10:29,650 --> 01:10:30,830 and a votes column. 1470 01:10:30,830 --> 01:10:34,690 So let's go back to VS Code here, and let's do select star 1471 01:10:34,690 --> 01:10:39,670 from ratings, limit 10, just to wrap our mind around some of the data. 1472 01:10:39,670 --> 01:10:42,100 And there we have a show ID in the left, we 1473 01:10:42,100 --> 01:10:45,170 have rating in the middle, which seems to be like a floating point value, 1474 01:10:45,170 --> 01:10:46,940 and then votes, which seems to be an integer. 1475 01:10:46,940 --> 01:10:48,815 So we have some different types of data here. 1476 01:10:48,815 --> 01:10:50,270 But there's a lot of data. 1477 01:10:50,270 --> 01:10:51,610 In fact, if I do this now. 1478 01:10:51,610 --> 01:10:55,180 Select star from shows-- 1479 01:10:55,180 --> 01:10:56,630 let's not select all of the data. 1480 01:10:56,630 --> 01:11:00,670 Let's do select count stars from shows, and in this database 1481 01:11:00,670 --> 01:11:04,640 there are 214,000 shows in this database. 1482 01:11:04,640 --> 01:11:08,780 So we're well past the 398 rows that we've been talking about thus far. 1483 01:11:08,780 --> 01:11:12,010 So it turns out, per this diagram, there's actually 1484 01:11:12,010 --> 01:11:16,360 a standard relationship between these two tables, shows and ratings 1485 01:11:16,360 --> 01:11:18,880 respectively, and it's what we would call a one-to-one one 1486 01:11:18,880 --> 01:11:23,430 relationship whereby every show in this design has one rating. 1487 01:11:23,430 --> 01:11:25,180 And this is indicated, technically, if you 1488 01:11:25,180 --> 01:11:27,755 look at what the arrowheads look like on these diagrams. 1489 01:11:27,755 --> 01:11:30,130 This indicates that it's a one-to-one relationship, which 1490 01:11:30,130 --> 01:11:34,450 means every show has one rating, which means every row in the shows table 1491 01:11:34,450 --> 01:11:36,970 has a corresponding row in the ratings table. 1492 01:11:36,970 --> 01:11:39,640 Strictly speaking, they could be in the same table. 1493 01:11:39,640 --> 01:11:42,130 You could just join them together and make one wider table, 1494 01:11:42,130 --> 01:11:44,800 but IMDb keeps the data separate so we too 1495 01:11:44,800 --> 01:11:47,210 kept them separate in two separate tables here. 1496 01:11:47,210 --> 01:11:50,120 So what does this actually mean in practice? 1497 01:11:50,120 --> 01:11:54,590 Well, let's actually take a look in VS Code at the schema for these tables. 1498 01:11:54,590 --> 01:11:57,160 Let me clear my screen, and let me .schema, 1499 01:11:57,160 --> 01:12:01,570 but specifically look at the schema, or the design of, the shows table. 1500 01:12:01,570 --> 01:12:06,640 So you can do .schema shows, and any command in SQLite with a dot is SQLite 1501 01:12:06,640 --> 01:12:08,050 specific. 1502 01:12:08,050 --> 01:12:10,900 In the real world, if you're using other products like Oracle, 1503 01:12:10,900 --> 01:12:14,350 or Postgre, or MySQL, or others, they have different commands, 1504 01:12:14,350 --> 01:12:16,300 but anything else that we've been typing, 1505 01:12:16,300 --> 01:12:20,260 especially the capitalized keywords, is indeed standard SQL. 1506 01:12:20,260 --> 01:12:25,310 If I hit enter here, here is what the shows table apparently looks like. 1507 01:12:25,310 --> 01:12:28,480 In other words, here is the create table command that we, the staff, 1508 01:12:28,480 --> 01:12:30,550 ran in order to create this table for you, 1509 01:12:30,550 --> 01:12:32,560 and then we imported a bunch of data into it. 1510 01:12:32,560 --> 01:12:38,200 Every show has an ID, has a title, has a year, has a number of episodes. 1511 01:12:38,200 --> 01:12:39,610 But there's more detail here. 1512 01:12:39,610 --> 01:12:43,600 Apparently, the ID is an integer, the title is text, and it is not null, 1513 01:12:43,600 --> 01:12:47,170 cannot be null, the year is numeric, whatever that means, 1514 01:12:47,170 --> 01:12:49,030 and the episodes is an integer. 1515 01:12:49,030 --> 01:12:50,740 Well, that is now familiar. 1516 01:12:50,740 --> 01:12:51,830 And then primary key. 1517 01:12:51,830 --> 01:12:53,622 So there's some other stuff going on there, 1518 01:12:53,622 --> 01:12:57,760 which we'll come back to, but let me also do .schema ratings, enter, 1519 01:12:57,760 --> 01:13:00,280 and we'll see a couple of other data types here. 1520 01:13:00,280 --> 01:13:03,070 There's show ID, which is still an integer, but not null. 1521 01:13:03,070 --> 01:13:05,560 Rating, which is a real number, a.k.a. 1522 01:13:05,560 --> 01:13:07,840 float, but it, too, cannot be null. 1523 01:13:07,840 --> 01:13:10,750 And then some number of votes, which is an integer, cannot be null. 1524 01:13:10,750 --> 01:13:12,560 And then there's mention of foreign key. 1525 01:13:12,560 --> 01:13:16,300 So unlike our data set for the favorites a bit ago, which we just 1526 01:13:16,300 --> 01:13:20,530 did automatically, and imported, this database by us, and by IMDb, 1527 01:13:20,530 --> 01:13:23,680 has been more thoughtfully designed, where there's actually 1528 01:13:23,680 --> 01:13:27,640 some relationships across multiple tables, rather than previously 1529 01:13:27,640 --> 01:13:28,420 just one. 1530 01:13:28,420 --> 01:13:32,300 Now in the world of SQL, we have indeed different data types. 1531 01:13:32,300 --> 01:13:35,140 For instance, we have these five primarily. 1532 01:13:35,140 --> 01:13:37,750 One, cutely named BLOB, which is actually 1533 01:13:37,750 --> 01:13:41,990 Binary Large Object, which generally means like a file or some piece of data 1534 01:13:41,990 --> 01:13:43,190 that's zeros and ones. 1535 01:13:43,190 --> 01:13:46,160 Though, generally, it's best to store files on file systems, 1536 01:13:46,160 --> 01:13:50,090 like in folders, on disks, so to speak, not in your database. 1537 01:13:50,090 --> 01:13:51,440 There's integers we've seen. 1538 01:13:51,440 --> 01:13:52,410 There's numeric. 1539 01:13:52,410 --> 01:13:54,740 Which numeric is more like dates and times, 1540 01:13:54,740 --> 01:13:58,340 things that are numbers, but not necessarily integers or floating point 1541 01:13:58,340 --> 01:13:58,970 values. 1542 01:13:58,970 --> 01:14:01,460 There's reals, which do have decimal points in them. 1543 01:14:01,460 --> 01:14:02,780 And then there's just text. 1544 01:14:02,780 --> 01:14:06,290 In other SQL databases in the real world, in your future jobs, 1545 01:14:06,290 --> 01:14:09,740 or your future classes that you might use SQL again in, 1546 01:14:09,740 --> 01:14:14,960 there are even more data types in other databases: Oracle, MySQL, Postgre, 1547 01:14:14,960 --> 01:14:15,605 and so forth. 1548 01:14:15,605 --> 01:14:17,480 But these are sort of representative of them. 1549 01:14:17,480 --> 01:14:19,767 They just get more precise in other systems. 1550 01:14:19,767 --> 01:14:22,100 But there's also some keywords, that we've seen already, 1551 01:14:22,100 --> 01:14:24,230 that you can specify when designing a database. 1552 01:14:24,230 --> 01:14:26,090 --that this column cannot be null. 1553 01:14:26,090 --> 01:14:29,720 If you want to make sure that no one can insert or update 1554 01:14:29,720 --> 01:14:32,030 data unbeknownst to you that is null, you 1555 01:14:32,030 --> 01:14:34,340 can impose that when creating the table. 1556 01:14:34,340 --> 01:14:36,740 And unlike Excel, and Google Spreadsheets, 1557 01:14:36,740 --> 01:14:40,010 and Apple Numbers, which will generally let the human type in or not type 1558 01:14:40,010 --> 01:14:43,520 in anything they want, with a database you have more protections 1559 01:14:43,520 --> 01:14:45,365 over the integrity of your data. 1560 01:14:45,365 --> 01:14:48,900 Moreover, you can specify that a column's values must be unique. 1561 01:14:48,900 --> 01:14:50,750 If you want to avoid duplicates, like you 1562 01:14:50,750 --> 01:14:54,740 don't want the same person to be able to register twice for your website thereby 1563 01:14:54,740 --> 01:14:57,500 making sure they have one unique email address, 1564 01:14:57,500 --> 01:14:59,840 your database can help with that too. 1565 01:14:59,840 --> 01:15:03,470 You don't have to rely on Python to check if it already exists. 1566 01:15:03,470 --> 01:15:07,520 But there's this other feature of relational databases, that 1567 01:15:07,520 --> 01:15:11,480 is databases that have multiple tables across which there are relationships, 1568 01:15:11,480 --> 01:15:13,880 and that's these keywords we saw briefly a moment ago. 1569 01:15:13,880 --> 01:15:16,620 Primary key, and foreign key. 1570 01:15:16,620 --> 01:15:20,180 And we started to scratch this surface here. 1571 01:15:20,180 --> 01:15:22,490 It turns out what I was doing was actually 1572 01:15:22,490 --> 01:15:25,340 best practice in the world of relational databases 1573 01:15:25,340 --> 01:15:30,680 I gave or, really, IMDb gave every show in the world a unique ID. 1574 01:15:30,680 --> 01:15:36,470 And that unique ID, in this case 386676, is a numeric value, an integer, 1575 01:15:36,470 --> 01:15:38,930 that uniquely identifies that TV show. 1576 01:15:38,930 --> 01:15:43,370 In other words, this is the primary key for this table. 1577 01:15:43,370 --> 01:15:46,370 Technically a sheet, but I'm using that just because it's easier to type 1578 01:15:46,370 --> 01:15:48,020 in than my black and white window. 1579 01:15:48,020 --> 01:15:53,960 This ID column is the primary key for shows, as I was mocking up earlier. 1580 01:15:53,960 --> 01:15:57,290 What is the primary key in the people sheet here? 1581 01:15:57,290 --> 01:15:58,580 It is also ID. 1582 01:15:58,580 --> 01:16:02,450 It's a different ID, but it's by convention often called the same thing. 1583 01:16:02,450 --> 01:16:07,037 But this people column, called ID, is its primary key. 1584 01:16:07,037 --> 01:16:08,870 And you can perhaps see where this is going. 1585 01:16:08,870 --> 01:16:13,910 Those same numbers also happen to appear in this third table, 1586 01:16:13,910 --> 01:16:17,180 but in that context, they're sort of foreign keys. 1587 01:16:17,180 --> 01:16:21,420 They didn't come from this star sheet, but they are in this star sheet, 1588 01:16:21,420 --> 01:16:23,690 so they're sort of relatively foreign to it. 1589 01:16:23,690 --> 01:16:29,030 So foreign keys is simply the presence of primary keys in some other table, 1590 01:16:29,030 --> 01:16:30,630 in some other data set. 1591 01:16:30,630 --> 01:16:33,050 And so it's just a description of relativity, 1592 01:16:33,050 --> 01:16:36,860 but the foreign key is the column that uniquely identifies your data. 1593 01:16:36,860 --> 01:16:41,190 Foreign keys is just the appearance of those same numbers elsewhere. 1594 01:16:41,190 --> 01:16:42,570 So what does this mean? 1595 01:16:42,570 --> 01:16:45,020 If we go back to VS Code here, you'll see 1596 01:16:45,020 --> 01:16:48,860 that when we created this table using IMDb's real data, 1597 01:16:48,860 --> 01:16:52,640 we specified that in our shows table, which is bigger than the one 1598 01:16:52,640 --> 01:16:56,180 I mocked up with Google Sheets there, has not only an ID and a title it, 1599 01:16:56,180 --> 01:16:58,615 again, also has year in which the show debuted, 1600 01:16:58,615 --> 01:16:59,990 and the total number of episodes. 1601 01:16:59,990 --> 01:17:02,810 Because that's juicy data that comes from IMDb, 1602 01:17:02,810 --> 01:17:04,670 beyond what I mocked up a moment ago. 1603 01:17:04,670 --> 01:17:08,150 In the ratings table, meanwhile, there's a show ID, 1604 01:17:08,150 --> 01:17:12,360 which is also an integer, just like this ID, but as the name implies, 1605 01:17:12,360 --> 01:17:17,810 this show ID column is actually going to be a foreign key that references 1606 01:17:17,810 --> 01:17:20,640 the shows tables ID column. 1607 01:17:20,640 --> 01:17:23,420 So this is the relational in relational databases. 1608 01:17:23,420 --> 01:17:25,530 These are two tables that have a relationship, 1609 01:17:25,530 --> 01:17:32,520 and that relationship is that show ID is referring to this actual ID here. 1610 01:17:32,520 --> 01:17:34,520 And it allows us essentially, conceptually, 1611 01:17:34,520 --> 01:17:37,350 to link these two tables together. 1612 01:17:37,350 --> 01:17:39,353 So what does this actually mean? 1613 01:17:39,353 --> 01:17:40,770 Well, let me go ahead and do this. 1614 01:17:40,770 --> 01:17:42,140 Let me go back to VS Code here. 1615 01:17:42,140 --> 01:17:45,120 I'll clear my terminal, and let's play around with some of this data. 1616 01:17:45,120 --> 01:17:48,080 So let's go ahead and do this just to experiment. 1617 01:17:48,080 --> 01:17:51,320 Select star from ratings where-- 1618 01:17:51,320 --> 01:17:54,860 let's get all of the good shows where, just like Rotten Tomatoes, 1619 01:17:54,860 --> 01:17:57,410 we'll do the cut off at 6.0 out of 10. 1620 01:17:57,410 --> 01:18:01,063 So where rating is greater than or equal to 6.0. 1621 01:18:01,063 --> 01:18:03,230 And just so I don't overwhelm my screen, let me just 1622 01:18:03,230 --> 01:18:05,370 limit this to the first 10 results. 1623 01:18:05,370 --> 01:18:08,510 In other words, this is SQL syntax for selecting 1624 01:18:08,510 --> 01:18:12,920 all of the ratings that are at least 6.0 or higher from that table. 1625 01:18:12,920 --> 01:18:13,640 Enter. 1626 01:18:13,640 --> 01:18:15,350 And we see just the first 10 of them. 1627 01:18:15,350 --> 01:18:18,540 Not the top 10 because we've not sorted or grouped or anything like that, 1628 01:18:18,540 --> 01:18:20,490 but the first 10 in the table. 1629 01:18:20,490 --> 01:18:23,780 So what is interesting here is that we've 1630 01:18:23,780 --> 01:18:26,390 seen just some sampling of the data, if you will. 1631 01:18:26,390 --> 01:18:28,790 But this isn't all that interesting here. 1632 01:18:28,790 --> 01:18:32,220 Let me actually distill this just to the show ID because, in other words, 1633 01:18:32,220 --> 01:18:34,190 I want to know 10 good shows to watch. 1634 01:18:34,190 --> 01:18:36,440 So let me just select show ID. 1635 01:18:36,440 --> 01:18:39,480 So same result, but less data. 1636 01:18:39,480 --> 01:18:41,700 It's just that first column thereof. 1637 01:18:41,700 --> 01:18:43,540 Now this is going to be a little annoying, 1638 01:18:43,540 --> 01:18:47,460 but if I want to find out the names of these shows-- 1639 01:18:47,460 --> 01:18:50,400 think about the picture from whence we came. 1640 01:18:50,400 --> 01:18:55,320 All of the show's names are in the shows table, but all of the show's ratings 1641 01:18:55,320 --> 01:18:57,520 are in the ratings table. 1642 01:18:57,520 --> 01:19:00,058 So even if I do select star from ratings, 1643 01:19:00,058 --> 01:19:02,100 I'm never going to know what show I'm looking at. 1644 01:19:02,100 --> 01:19:04,500 Like, what the heck is show ID 62614? 1645 01:19:04,500 --> 01:19:05,680 Well, I could do this. 1646 01:19:05,680 --> 01:19:14,370 I could select star from shows where the ID of the show equals 62614 semicolon. 1647 01:19:14,370 --> 01:19:15,240 Enter. 1648 01:19:15,240 --> 01:19:18,270 OK, so I could watch this show from 1981. 1649 01:19:18,270 --> 01:19:19,290 Let me do another one. 1650 01:19:19,290 --> 01:19:24,030 Select star from shows where ID equals 63881. 1651 01:19:24,030 --> 01:19:26,710 So I'm just grabbing the second ID from here. 1652 01:19:26,710 --> 01:19:27,210 OK. 1653 01:19:27,210 --> 01:19:29,530 So Catweazle, a kid's show from 1970. 1654 01:19:29,530 --> 01:19:30,030 All right. 1655 01:19:30,030 --> 01:19:31,120 So I'll watch that. 1656 01:19:31,120 --> 01:19:32,620 So now let's do another one. 1657 01:19:32,620 --> 01:19:33,810 I'll just copy this. 1658 01:19:33,810 --> 01:19:37,530 Like suffice it to say, this is not the best way to look up data, 1659 01:19:37,530 --> 01:19:41,610 where I'm literally copying and pasting values from one query into the next. 1660 01:19:41,610 --> 01:19:43,950 But this is where SQL gets a bit powerful. 1661 01:19:43,950 --> 01:19:46,170 I can have nested queries. 1662 01:19:46,170 --> 01:19:47,710 I can put one inside of the other. 1663 01:19:47,710 --> 01:19:48,793 So let me instead do this. 1664 01:19:48,793 --> 01:19:51,130 Let me clear the screen, and let me instead do this. 1665 01:19:51,130 --> 01:19:57,620 Select star from shows where the ID of the show 1666 01:19:57,620 --> 01:20:01,160 is in the following list of IDs. 1667 01:20:01,160 --> 01:20:02,630 Select. 1668 01:20:02,630 --> 01:20:05,480 And, actually, I'll do this on a separate line. 1669 01:20:05,480 --> 01:20:12,020 Select show ID from ratings where the rating value is 1670 01:20:12,020 --> 01:20:15,800 greater than or equal to 6.0 semicolon. 1671 01:20:15,800 --> 01:20:17,570 So I've separated this onto two lines. 1672 01:20:17,570 --> 01:20:19,710 The dot, dot, dot is just a continuation character, 1673 01:20:19,710 --> 01:20:23,480 which means same query, multiple lines, but the parentheses are deliberate. 1674 01:20:23,480 --> 01:20:27,500 Just like grade school math, I want what's in parentheses to happen first. 1675 01:20:27,500 --> 01:20:29,930 And so what the database will do is we'll 1676 01:20:29,930 --> 01:20:33,740 select, as before, all of the show IDs from the ratings table 1677 01:20:33,740 --> 01:20:36,713 where the rating value is at least 6.0 out of 10. 1678 01:20:36,713 --> 01:20:38,630 And that's going to return to me, effectively, 1679 01:20:38,630 --> 01:20:41,750 a list, some kind of collection of show IDs, 1680 01:20:41,750 --> 01:20:43,760 which previously I was copying and pasting, 1681 01:20:43,760 --> 01:20:47,390 now the database will do the legwork for me. 1682 01:20:47,390 --> 01:20:51,410 It will now select everything from the shows table, where the ID of the show 1683 01:20:51,410 --> 01:20:53,655 is in that list of values. 1684 01:20:53,655 --> 01:20:56,280 And it's actually going to be more than 10 unless I go in there 1685 01:20:56,280 --> 01:20:58,690 and say limit 10, which I can do. 1686 01:20:58,690 --> 01:21:00,510 So let me go ahead and hit enter now. 1687 01:21:00,510 --> 01:21:03,630 And now I see more useful information not just 1688 01:21:03,630 --> 01:21:05,640 the ratings information, which in a vacuum 1689 01:21:05,640 --> 01:21:07,320 tells me nothing about what to watch. 1690 01:21:07,320 --> 01:21:12,360 Now I see the show ID, the title, the year, and the episodes. 1691 01:21:12,360 --> 01:21:17,200 But notably, what is-- and if I want to distill this into just the title, 1692 01:21:17,200 --> 01:21:20,010 let me actually go back here and instead do 1693 01:21:20,010 --> 01:21:25,140 select just the title from shows where the ID is in this whole list. 1694 01:21:25,140 --> 01:21:28,210 And I'll re-execute it by just copying and pasting the same. 1695 01:21:28,210 --> 01:21:31,240 The only difference now is instead of star, I'm selecting title. 1696 01:21:31,240 --> 01:21:34,080 Here's now how the data analyst at IMDb might 1697 01:21:34,080 --> 01:21:38,430 be selecting 10 shows that are really good to watch, according to ratings. 1698 01:21:38,430 --> 01:21:41,490 But what, of course, is missing from the output, 1699 01:21:41,490 --> 01:21:45,065 whether I do star or just title? 1700 01:21:45,065 --> 01:21:45,690 What's missing? 1701 01:21:45,690 --> 01:21:45,960 Yeah? 1702 01:21:45,960 --> 01:21:46,860 AUDIENCE: Rating. 1703 01:21:46,860 --> 01:21:47,940 - The actual rating. 1704 01:21:47,940 --> 01:21:51,000 I know these are at least 6.0, but which is 7.0? 1705 01:21:51,000 --> 01:21:52,050 Which is 10.0? 1706 01:21:52,050 --> 01:21:55,270 It'd be nice to actually combine the data in some way. 1707 01:21:55,270 --> 01:21:57,840 So we can actually do that, too, because it turns out 1708 01:21:57,840 --> 01:22:00,960 that when you have two tables in the world of SQL, or even more, 1709 01:22:00,960 --> 01:22:03,720 you can actually join them together. 1710 01:22:03,720 --> 01:22:08,700 You can join them together, literally using a keyword called join. 1711 01:22:08,700 --> 01:22:10,540 And you can do this as follows. 1712 01:22:10,540 --> 01:22:13,253 Let me propose, with a sample data set, these two tables. 1713 01:22:13,253 --> 01:22:16,170 And dot, dot, dot just means we don't really care about the specifics. 1714 01:22:16,170 --> 01:22:17,800 We just care about the structure. 1715 01:22:17,800 --> 01:22:21,000 So on the left here is a simplified version of my shows 1716 01:22:21,000 --> 01:22:24,270 table that has a show ID and a show title, 1717 01:22:24,270 --> 01:22:27,990 but I've omitted year and episodes as just intellectually 1718 01:22:27,990 --> 01:22:30,180 distracting from the basic structure here. 1719 01:22:30,180 --> 01:22:31,740 But they're there in the real table. 1720 01:22:31,740 --> 01:22:34,890 On the right hand side here, we have the ratings table 1721 01:22:34,890 --> 01:22:37,290 with just two of its columns, the show ID and the rating, 1722 01:22:37,290 --> 01:22:39,457 but I've omitted the votes because it doesn't really 1723 01:22:39,457 --> 01:22:40,750 add anything to the discussion. 1724 01:22:40,750 --> 01:22:42,600 But let me propose this. 1725 01:22:42,600 --> 01:22:45,750 Notice that these two tables and these two rows 1726 01:22:45,750 --> 01:22:47,790 therein definitely have commonalities. 1727 01:22:47,790 --> 01:22:50,370 Like, they both have the same ID. 1728 01:22:50,370 --> 01:22:54,150 In the left table it's 386676, a.k.a. 1729 01:22:54,150 --> 01:22:55,860 The primary key of that row. 1730 01:22:55,860 --> 01:23:01,200 But it's also appearing in duplicate in the right hand table, 386676, 1731 01:23:01,200 --> 01:23:02,977 and in that context it's a foreign key. 1732 01:23:02,977 --> 01:23:04,810 The point, though, is that they're the same. 1733 01:23:04,810 --> 01:23:08,320 So wouldn't it be nice if I could treat one table here, one table here, 1734 01:23:08,320 --> 01:23:12,150 and if my fingertips represent these identical values, kind of like glue 1735 01:23:12,150 --> 01:23:15,715 them together so I get one wider table with all of the information 1736 01:23:15,715 --> 01:23:18,090 together to satisfy your concern, that we don't even know 1737 01:23:18,090 --> 01:23:20,015 what the ratings are of those shows. 1738 01:23:20,015 --> 01:23:21,390 Well let me go ahead and do this. 1739 01:23:21,390 --> 01:23:24,488 Just for artist's rendition, let me flip title and ID, 1740 01:23:24,488 --> 01:23:25,780 which has no functional effect. 1741 01:23:25,780 --> 01:23:28,488 It's just going to put the numbers closer together on the screen. 1742 01:23:28,488 --> 01:23:32,610 Let me then literally highlight the fact that these two numbers are identical, 1743 01:23:32,610 --> 01:23:35,190 and let me propose that we do the equivalent of this, 1744 01:23:35,190 --> 01:23:38,910 we somehow join these two tables on that common value. 1745 01:23:38,910 --> 01:23:40,995 Strictly speaking, I don't need both values 1746 01:23:40,995 --> 01:23:43,870 because they're duplicates, so I don't care if one of them goes away. 1747 01:23:43,870 --> 01:23:46,470 But what I'd really like to do is select, indeed, 1748 01:23:46,470 --> 01:23:52,230 a temporary table that is the joined version of that original data. 1749 01:23:52,230 --> 01:23:54,840 And frankly, I don't really care as the user what the ID is. 1750 01:23:54,840 --> 01:23:57,690 Heck, all I care about is what show to watch and what its rating is. 1751 01:23:57,690 --> 01:23:59,370 Give me the title and the rating. 1752 01:23:59,370 --> 01:24:01,590 All of these numbers are, again, metadata, things 1753 01:24:01,590 --> 01:24:05,490 that the computer cares about, but we humans probably do not. 1754 01:24:05,490 --> 01:24:07,890 So how can we implement that idea? 1755 01:24:07,890 --> 01:24:09,840 Of taking one data set that has a relationship 1756 01:24:09,840 --> 01:24:12,510 with this data set, and somehow combine it together? 1757 01:24:12,510 --> 01:24:14,170 Well let me go back to VS Code here. 1758 01:24:14,170 --> 01:24:15,330 Let me clear my screen. 1759 01:24:15,330 --> 01:24:20,220 And this is going to be a bit cryptic at first, but it's very step by step. 1760 01:24:20,220 --> 01:24:21,010 Let me do this. 1761 01:24:21,010 --> 01:24:25,890 Select star from shows, but not from shows alone. 1762 01:24:25,890 --> 01:24:28,650 Let me join it with the ratings table. 1763 01:24:28,650 --> 01:24:32,400 So let me select everything from shows joined with ratings, 1764 01:24:32,400 --> 01:24:35,910 but I need to tell the database, well, what do I want to join things on? 1765 01:24:35,910 --> 01:24:37,200 What are my fingertips? 1766 01:24:37,200 --> 01:24:40,030 Specifically, I want them to join on those common integers. 1767 01:24:40,030 --> 01:24:44,940 So I can literally say, on, and then I can specify one table on the left. 1768 01:24:44,940 --> 01:24:51,030 Shows.id on the left should equal the ratings table's shows.id column 1769 01:24:51,030 --> 01:24:52,470 on the right. 1770 01:24:52,470 --> 01:24:57,120 Again, if I'm joining shows with ratings it's called ID in one, 1771 01:24:57,120 --> 01:25:00,640 it's called show ID in the other, but it's the exact same thing. 1772 01:25:00,640 --> 01:25:03,210 In fact, if I rewind, this is where we came from. 1773 01:25:03,210 --> 01:25:05,290 Two tables with the same value. 1774 01:25:05,290 --> 01:25:10,170 So with this query here, if I go ahead and now specify not just that but let 1775 01:25:10,170 --> 01:25:14,760 me further say, where rating is greater than or equal to 6.0. 1776 01:25:14,760 --> 01:25:16,500 And, heck, let's limit it to 10. 1777 01:25:16,500 --> 01:25:21,420 Just fits on the screen, so it's more of a mouthful, but when I hit enter now 1778 01:25:21,420 --> 01:25:26,220 we have a wider table that indeed contains everything, star, from having 1779 01:25:26,220 --> 01:25:29,340 joined these two tables left and right. 1780 01:25:29,340 --> 01:25:31,710 Now, again, I don't really care about much of this data, 1781 01:25:31,710 --> 01:25:34,410 like year and episodes and definitely not the IDs. 1782 01:25:34,410 --> 01:25:36,070 So let me actually hit up. 1783 01:25:36,070 --> 01:25:38,070 Let me go to the beginning of the query, and let 1784 01:25:38,070 --> 01:25:41,640 me just select the title of the show and the rating of the show. 1785 01:25:41,640 --> 01:25:45,780 The query is getting a little long and it's wrapping, but it's the same query. 1786 01:25:45,780 --> 01:25:49,500 Except, instead of star, I've done title comma rating. 1787 01:25:49,500 --> 01:25:50,910 Now when I hit enter-- 1788 01:25:50,910 --> 01:25:54,190 like, this is the list that would have been nice to see the first time around. 1789 01:25:54,190 --> 01:25:57,280 Show me 10 shows with a rating of 6.0 or higher, 1790 01:25:57,280 --> 01:25:59,620 but remind me what the rating actually is so, 1791 01:25:59,620 --> 01:26:03,580 maybe, I can prioritize the sevens, the eights, the nines, and even the tens, 1792 01:26:03,580 --> 01:26:05,236 if any. 1793 01:26:05,236 --> 01:26:11,760 Any questions about this technique of joining two tables? 1794 01:26:11,760 --> 01:26:15,210 This sort of solves the problem that we created in the world of this sheet, 1795 01:26:15,210 --> 01:26:18,030 where I was just kind of playing around, where I sort of moved 1796 01:26:18,030 --> 01:26:20,280 all the data into its separate locations, which 1797 01:26:20,280 --> 01:26:21,750 is not at all pleasant to use. 1798 01:26:21,750 --> 01:26:25,590 But with SQL, with join, you can still get any of the data you want. 1799 01:26:25,590 --> 01:26:27,546 Yeah. 1800 01:26:27,546 --> 01:26:30,498 [INDISTINCT SPEECH] 1801 01:26:30,498 --> 01:26:34,450 1802 01:26:34,450 --> 01:26:35,308 Correct. 1803 01:26:35,308 --> 01:26:37,600 So, yes, I should have called that out more explicitly. 1804 01:26:37,600 --> 01:26:41,472 In my query here, I was using dot notation, which we've seen in Python, 1805 01:26:41,472 --> 01:26:43,930 we've seen in C. It means something similar in spirit here, 1806 01:26:43,930 --> 01:26:45,972 but it has nothing to do with structured objects. 1807 01:26:45,972 --> 01:26:48,650 In this case, it has to do with tables and columns. 1808 01:26:48,650 --> 01:26:53,320 So shows.id just makes clear that I want the ID column from the shows table 1809 01:26:53,320 --> 01:26:57,490 to line up with the show ID column from the ratings table. 1810 01:26:57,490 --> 01:27:01,630 Strictly speaking, I don't need to do that because in this case 1811 01:27:01,630 --> 01:27:03,460 there is no ambiguity. 1812 01:27:03,460 --> 01:27:06,340 One table has a column called ID, the other table 1813 01:27:06,340 --> 01:27:07,810 has a column called show ID. 1814 01:27:07,810 --> 01:27:10,840 So certainly the database can just figure this out for me. 1815 01:27:10,840 --> 01:27:15,730 But, for best practice, and for the sake of being explicit, using the dot 1816 01:27:15,730 --> 01:27:17,950 notation and table names can help, especially 1817 01:27:17,950 --> 01:27:20,270 if there's some common language across them. 1818 01:27:20,270 --> 01:27:20,770 All right. 1819 01:27:20,770 --> 01:27:23,030 Well let's go back to the bigger data set here. 1820 01:27:23,030 --> 01:27:25,060 These are all six tables in IMDb. 1821 01:27:25,060 --> 01:27:28,450 We focused for just a moment there on like shows and ratings alone, 1822 01:27:28,450 --> 01:27:29,620 but what about genres. 1823 01:27:29,620 --> 01:27:33,250 So, genres, like comedy, and documentary, and drama, and so forth. 1824 01:27:33,250 --> 01:27:37,190 Turns out that this actually implements a different type of relationship. 1825 01:27:37,190 --> 01:27:39,530 Previously we saw a one-to-one relationship, 1826 01:27:39,530 --> 01:27:42,080 but it turns out that IMDb supports what's 1827 01:27:42,080 --> 01:27:44,820 called a one-to-many relationship when it comes to genres. 1828 01:27:44,820 --> 01:27:45,320 Why? 1829 01:27:45,320 --> 01:27:48,650 Well, shows like The Office, I do think are generally considered comedy 1830 01:27:48,650 --> 01:27:51,650 and that's it, but there are certainly other TV shows 1831 01:27:51,650 --> 01:27:55,140 that might have multiple genres associated with them. 1832 01:27:55,140 --> 01:27:59,160 Maybe it's comedy and a bit of romance thrown in, like rom-coms, and so forth, 1833 01:27:59,160 --> 01:28:03,980 so you could imagine some shows having two or three or more genres, 1834 01:28:03,980 --> 01:28:07,520 and so one-to-many means that one show can have many genres. 1835 01:28:07,520 --> 01:28:11,450 One-to-one would mean one show can have one rating, as we've seen. 1836 01:28:11,450 --> 01:28:17,330 So why don't we go ahead and focus, maybe, on how about a query like this. 1837 01:28:17,330 --> 01:28:19,577 Let me go back to VS Code here, clear my screen, 1838 01:28:19,577 --> 01:28:21,410 and let's just look at some of those genres. 1839 01:28:21,410 --> 01:28:25,700 Select star from genres and then I'll limit it to 10. 1840 01:28:25,700 --> 01:28:28,590 And, again, I do this just to wrap my mind around a new data set. 1841 01:28:28,590 --> 01:28:30,740 I could look at the schema, but that tends to be more cryptic. 1842 01:28:30,740 --> 01:28:32,310 I just want to look at the raw data. 1843 01:28:32,310 --> 01:28:32,810 OK. 1844 01:28:32,810 --> 01:28:37,227 It looks like here there are a bunch of genres: comedy, adventure, comedy. 1845 01:28:37,227 --> 01:28:38,810 So two comedies, which is interesting. 1846 01:28:38,810 --> 01:28:40,100 Oh, interesting. 1847 01:28:40,100 --> 01:28:42,410 Family, action, sci-fi, family. 1848 01:28:42,410 --> 01:28:45,260 So the values here are duplicated, which it turns out 1849 01:28:45,260 --> 01:28:47,000 is not the best design of IMDb. 1850 01:28:47,000 --> 01:28:49,710 We literally just imported the data as they implement it. 1851 01:28:49,710 --> 01:28:55,190 But notice that show ID 62614 is a comedy, but so is 1852 01:28:55,190 --> 01:29:00,950 show 63881, and so is show 65270. 1853 01:29:00,950 --> 01:29:04,320 So it turns out that in the real world sometimes data is somewhat messy. 1854 01:29:04,320 --> 01:29:06,830 There's duplication of comedy, comedy, comedy, 1855 01:29:06,830 --> 01:29:09,380 but such is the way IMDb's data is. 1856 01:29:09,380 --> 01:29:12,694 But what's more interesting to me, for now, is notice this. 1857 01:29:12,694 --> 01:29:16,530 This show ID in three rows is the same. 1858 01:29:16,530 --> 01:29:19,700 So there's some show out there that's considered by the world 1859 01:29:19,700 --> 01:29:22,257 to be an adventure, a comedy, and a family show. 1860 01:29:22,257 --> 01:29:23,340 So let's see what that is. 1861 01:29:23,340 --> 01:29:26,880 Let me just highlight and copy that value, 63881, and do this. 1862 01:29:26,880 --> 01:29:31,378 Select star from shows where the ID of the show equals that value. 1863 01:29:31,378 --> 01:29:33,170 And it turns out, we saw it briefly before, 1864 01:29:33,170 --> 01:29:37,400 it's a show from the 1970s called Catweazle, which falls into all three 1865 01:29:37,400 --> 01:29:38,880 of those categories. 1866 01:29:38,880 --> 01:29:43,880 So by using a one-to-many relationship, sort of depicted by this picture here, 1867 01:29:43,880 --> 01:29:47,588 you can implement that same idea without having that jagged edge. 1868 01:29:47,588 --> 01:29:50,630 When we looked at the spreadsheet earlier, in an earlier version of this, 1869 01:29:50,630 --> 01:29:55,490 we had star, star, star, which we could do again, genre, genre, genre, 1870 01:29:55,490 --> 01:29:58,460 but now we instead have two separate tables 1871 01:29:58,460 --> 01:30:02,247 where this many-to-many relationship is implemented across. 1872 01:30:02,247 --> 01:30:03,830 So let's actually play around with it. 1873 01:30:03,830 --> 01:30:07,640 Let me go back to VS Code here, and let's actually take 1874 01:30:07,640 --> 01:30:10,520 a look at the schema for genres. 1875 01:30:10,520 --> 01:30:14,030 And we'll see that it's pretty small, as the picture suggests. 1876 01:30:14,030 --> 01:30:14,990 It's called genres. 1877 01:30:14,990 --> 01:30:18,530 Every row has a show ID, which is an integer, cannot be null. 1878 01:30:18,530 --> 01:30:21,200 It has a genre, which is text, cannot be null. 1879 01:30:21,200 --> 01:30:25,190 And that show ID is a foreign key in this table that references an ID 1880 01:30:25,190 --> 01:30:26,690 column in the shows table. 1881 01:30:26,690 --> 01:30:29,240 So very similar in spirit, and so it really 1882 01:30:29,240 --> 01:30:32,090 is just kind of on the honor system that we're only 1883 01:30:32,090 --> 01:30:34,670 putting one row for each show in ratings, 1884 01:30:34,670 --> 01:30:39,150 but 0 or more shows in the genres table for shows, as well. 1885 01:30:39,150 --> 01:30:41,360 So what can we do once we want to tinker with genres? 1886 01:30:41,360 --> 01:30:42,420 Well, let me do this. 1887 01:30:42,420 --> 01:30:49,587 How about we select the show ID from the genres table where the genre is comedy. 1888 01:30:49,587 --> 01:30:51,170 Like, I'm in the mood for some comedy. 1889 01:30:51,170 --> 01:30:54,950 Let's see all of the available comedies, except let's just limit it to 10. 1890 01:30:54,950 --> 01:30:59,150 Here are the show IDs for 10 comedies according to the Internet Movie 1891 01:30:59,150 --> 01:30:59,803 Database. 1892 01:30:59,803 --> 01:31:01,220 Well, that's not very interesting. 1893 01:31:01,220 --> 01:31:03,350 I care about the title, so we can do that. 1894 01:31:03,350 --> 01:31:08,990 Select title from shows where the ID of the show is not equal to, 1895 01:31:08,990 --> 01:31:12,710 but rather, in the following subquery, if you will. 1896 01:31:12,710 --> 01:31:18,170 Select show ID from genres where genre equals, quote unquote, 1897 01:31:18,170 --> 01:31:21,900 "Comedy," limit 10, just to keep things simple. 1898 01:31:21,900 --> 01:31:25,460 So same query as before, but now I'm using it as a nested query 1899 01:31:25,460 --> 01:31:28,730 to select the actual titles whose IDs match those their. 1900 01:31:28,730 --> 01:31:29,540 Enter. 1901 01:31:29,540 --> 01:31:32,540 And there are those titles for 10 comedies, 1902 01:31:32,540 --> 01:31:34,240 maybe a couple of which we've seen. 1903 01:31:34,240 --> 01:31:35,480 Catweazel we've seen before. 1904 01:31:35,480 --> 01:31:36,970 But what if we want to now-- 1905 01:31:36,970 --> 01:31:38,380 let's see. 1906 01:31:38,380 --> 01:31:42,085 Maybe we want to get all of the-- let's flip it around. 1907 01:31:42,085 --> 01:31:43,630 Catweazle keeps coming up. 1908 01:31:43,630 --> 01:31:47,170 Why don't we figure out, with a query, what all of its genres are? 1909 01:31:47,170 --> 01:31:48,110 So Catweazle. 1910 01:31:48,110 --> 01:31:48,610 Let's see. 1911 01:31:48,610 --> 01:31:55,450 So let's do select star from genres limit 10, 1912 01:31:55,450 --> 01:31:57,430 because I know it's in the the first 10. 1913 01:31:57,430 --> 01:31:59,950 This was, 63881 was Catweazle's ID. 1914 01:31:59,950 --> 01:32:00,880 So let's do this. 1915 01:32:00,880 --> 01:32:06,860 So select genre from genres where the show ID equals that value. 1916 01:32:06,860 --> 01:32:07,360 OK. 1917 01:32:07,360 --> 01:32:09,152 So there's the same query as we did before. 1918 01:32:09,152 --> 01:32:10,480 Can we make this dynamic? 1919 01:32:10,480 --> 01:32:11,650 Well, we can, too. 1920 01:32:11,650 --> 01:32:13,870 What if I instead, more dynamically do, select 1921 01:32:13,870 --> 01:32:18,250 genre from genres where the show ID-- 1922 01:32:18,250 --> 01:32:19,180 and not in. 1923 01:32:19,180 --> 01:32:23,020 If I'm looking for a specific show, now I can actually do equals. 1924 01:32:23,020 --> 01:32:25,480 And in my subquery, I could do this, select ID 1925 01:32:25,480 --> 01:32:28,660 from shows where the title of the show equals, 1926 01:32:28,660 --> 01:32:33,160 quote unquote, "Catweazel," semicolon, enter. 1927 01:32:33,160 --> 01:32:35,720 So, again, even though I'm typing these very quickly, 1928 01:32:35,720 --> 01:32:38,630 I'm really just composing similar, smaller ideas 1929 01:32:38,630 --> 01:32:41,060 that we've seen before into larger and larger queries 1930 01:32:41,060 --> 01:32:42,573 to just get at more of this data. 1931 01:32:42,573 --> 01:32:44,490 So what's really going on underneath the hood? 1932 01:32:44,490 --> 01:32:46,160 Well, you can think of it like this. 1933 01:32:46,160 --> 01:32:50,060 If we've got this relationship between shows and genres, 1934 01:32:50,060 --> 01:32:52,580 here's an excerpt from shows, and I didn't bother 1935 01:32:52,580 --> 01:32:54,170 showing the thousands of other shows. 1936 01:32:54,170 --> 01:32:56,900 Here's an excerpt from genres on the right. 1937 01:32:56,900 --> 01:33:00,530 What is that query-- what are we essentially trying to do? 1938 01:33:00,530 --> 01:33:02,570 Well, let me flip this around here. 1939 01:33:02,570 --> 01:33:06,290 Let me highlight the fact that this is the same, this is the same, 1940 01:33:06,290 --> 01:33:07,260 this is the same. 1941 01:33:07,260 --> 01:33:10,040 So wouldn't it be nice if I could kind of get these all together? 1942 01:33:10,040 --> 01:33:11,660 Well, if I join these tables, we're actually 1943 01:33:11,660 --> 01:33:13,370 going to notice an interesting artifact. 1944 01:33:13,370 --> 01:33:16,490 If I join them together, as we did before with ratings, 1945 01:33:16,490 --> 01:33:20,990 I'm going to need to fill in the gap there because this is not a table. 1946 01:33:20,990 --> 01:33:24,860 Tables, by definition, always have the same number of rows and columns. 1947 01:33:24,860 --> 01:33:26,730 You can't have gaps in them like this. 1948 01:33:26,730 --> 01:33:29,820 So the simplest thing to do is just to fill that in this way. 1949 01:33:29,820 --> 01:33:33,170 But if I were to try to combine two tables that 1950 01:33:33,170 --> 01:33:36,320 have this one-to-many relationship, you're 1951 01:33:36,320 --> 01:33:38,090 actually going to get duplication. 1952 01:33:38,090 --> 01:33:40,190 It's not duplication in the original tables, 1953 01:33:40,190 --> 01:33:43,850 but in the temporary tables, otherwise known as a result set, 1954 01:33:43,850 --> 01:33:45,750 that's coming back to us. 1955 01:33:45,750 --> 01:33:47,070 So what do I mean by this? 1956 01:33:47,070 --> 01:33:50,060 Well, if we actually implement this same idea as before, 1957 01:33:50,060 --> 01:33:52,970 where we try to join these two tables, let 1958 01:33:52,970 --> 01:33:55,640 me propose that we do it with this syntax. 1959 01:33:55,640 --> 01:33:59,540 Let me do select star from shows join genres, 1960 01:33:59,540 --> 01:34:04,850 which is just like we did with ratings but now let's join it on shows.id 1961 01:34:04,850 --> 01:34:07,760 equals genres.show_ID. 1962 01:34:07,760 --> 01:34:14,120 But let's just do this for Catweazel, where ID equals 63881, semicolon. 1963 01:34:14,120 --> 01:34:16,250 With the ratings, it worked perfectly because it 1964 01:34:16,250 --> 01:34:19,920 was a one-to-one relationship, so the rows just got wider, if you will. 1965 01:34:19,920 --> 01:34:23,660 But now, because it's a one-to-many relationship, when 1966 01:34:23,660 --> 01:34:26,960 you execute these queries, you are going to get back duplicate data 1967 01:34:26,960 --> 01:34:28,400 but this is OK. 1968 01:34:28,400 --> 01:34:30,740 It's considered OK because this is sort of ephemeral. 1969 01:34:30,740 --> 01:34:32,540 These result sets, these temporary tables, 1970 01:34:32,540 --> 01:34:34,700 exist just for us to look at the data, just 1971 01:34:34,700 --> 01:34:36,440 for us to crunch the numbers somehow. 1972 01:34:36,440 --> 01:34:42,167 It's not actually stored in duplicate in the database itself. 1973 01:34:42,167 --> 01:34:44,000 If I wanted to tighten this further, though, 1974 01:34:44,000 --> 01:34:47,570 let me actually get rid of the star and let me just do title genre. 1975 01:34:47,570 --> 01:34:49,040 And, indeed, we can now see. 1976 01:34:49,040 --> 01:34:52,130 OK, Catweazel three times has three different categories. 1977 01:34:52,130 --> 01:34:54,440 But, generally, we don't even care about that so I 1978 01:34:54,440 --> 01:34:57,830 can even whittle this query down to just selecting genre, 1979 01:34:57,830 --> 01:35:00,500 and that, too, will just give me the result, effectively, hiding 1980 01:35:00,500 --> 01:35:01,550 the duplication. 1981 01:35:01,550 --> 01:35:04,957 But when you join data with a one-to-many relationship 1982 01:35:04,957 --> 01:35:07,790 you're temporarily going to get duplicates, which is actually useful 1983 01:35:07,790 --> 01:35:10,580 because it's very easy then to get at the show's title 1984 01:35:10,580 --> 01:35:14,080 no matter where you are in some loop. 1985 01:35:14,080 --> 01:35:14,620 All right. 1986 01:35:14,620 --> 01:35:15,912 Well, what more can we do here? 1987 01:35:15,912 --> 01:35:19,470 Well let me propose that we revisit the main database, here with six tables, 1988 01:35:19,470 --> 01:35:21,970 and let's look at, perhaps, the juiciest, and the one that's 1989 01:35:21,970 --> 01:35:26,680 really what most people use IMDb for is to look up shows and people therein. 1990 01:35:26,680 --> 01:35:28,300 Let's focus on these three tables. 1991 01:35:28,300 --> 01:35:32,140 And we can infer, from this diagram, that there's now, for the first time, 1992 01:35:32,140 --> 01:35:34,270 three tables involved in a relationship. 1993 01:35:34,270 --> 01:35:38,710 There's people, there's shows, but I've proposed this intermediary stars 1994 01:35:38,710 --> 01:35:40,930 table, much like I temporarily, in Google Sheets, 1995 01:35:40,930 --> 01:35:44,120 gave us a third sheet to link the two together. 1996 01:35:44,120 --> 01:35:46,810 This stars table we're about to see, it's purpose in life 1997 01:35:46,810 --> 01:35:49,990 is to join two other tables together. 1998 01:35:49,990 --> 01:35:54,670 And, in fact, it's only going to have two columns, show ID and person ID. 1999 01:35:54,670 --> 01:35:57,790 So what this is going to do for us is implement this idea, 2000 01:35:57,790 --> 01:35:59,360 many-to-many relationship. 2001 01:35:59,360 --> 01:35:59,860 Why? 2002 01:35:59,860 --> 01:36:02,810 Because any TV show can obviously have many people in it, 2003 01:36:02,810 --> 01:36:06,670 but one person can presumably star in many different shows. 2004 01:36:06,670 --> 01:36:09,680 Like, Steve Carell has been in multiple shows, not just The Office. 2005 01:36:09,680 --> 01:36:11,870 So when you have a many-to-many relationship, 2006 01:36:11,870 --> 01:36:16,655 you actually do need this third table to bridge the two any number of times. 2007 01:36:16,655 --> 01:36:19,280 But it's going to make our life a little more unpleasant to get 2008 01:36:19,280 --> 01:36:22,800 the data we want because it's going to add some additional steps, if you will. 2009 01:36:22,800 --> 01:36:24,300 So let me do this. 2010 01:36:24,300 --> 01:36:28,340 Suppose that I want to get everything I know about The Office. 2011 01:36:28,340 --> 01:36:30,800 Well, let's start with a single query here in VS Code. 2012 01:36:30,800 --> 01:36:35,780 Select star from shows where title equals, quote unquote, "The Office." 2013 01:36:35,780 --> 01:36:37,130 And I should see-- 2014 01:36:37,130 --> 01:36:38,060 Oh, interesting. 2015 01:36:38,060 --> 01:36:41,810 Several attempts at creating a TV show called The Office over the years. 2016 01:36:41,810 --> 01:36:45,740 You can, perhaps, infer the year in which the most popular of them began. 2017 01:36:45,740 --> 01:36:46,910 2005? 2018 01:36:46,910 --> 01:36:49,580 So I presume this is the one we all know and have watched, 2019 01:36:49,580 --> 01:36:54,200 at least in the US, which is this one, 386676, which matches the ID that I 2020 01:36:54,200 --> 01:36:55,770 very carefully used earlier. 2021 01:36:55,770 --> 01:36:58,520 So let me actually be a little more deliberate, 2022 01:36:58,520 --> 01:37:02,690 where the title equals The Office and the year equals 2005. 2023 01:37:02,690 --> 01:37:07,400 That query now gets us The Office that we all in the US and, perhaps, love. 2024 01:37:07,400 --> 01:37:09,830 But now let's actually do something like, get 2025 01:37:09,830 --> 01:37:11,960 all of the people who starred in it, at least 2026 01:37:11,960 --> 01:37:14,650 according to IMDb, whoever had top billing. 2027 01:37:14,650 --> 01:37:16,370 So how can I do this? 2028 01:37:16,370 --> 01:37:20,350 Well, unfortunately, in the shows table there are no people, 2029 01:37:20,350 --> 01:37:24,310 and there's no stars even, but I could do a nested query like this. 2030 01:37:24,310 --> 01:37:32,350 Why don't I select the person ID from the stars table where-- 2031 01:37:32,350 --> 01:37:45,990 whoops, where the-- sorry, where show ID equals, and then in parentheses 2032 01:37:45,990 --> 01:37:48,430 let me do that same query as before and for time's sake, 2033 01:37:48,430 --> 01:37:52,260 I'll just copy paste so that we get back the one and only 2034 01:37:52,260 --> 01:37:53,640 Office in that subquery. 2035 01:37:53,640 --> 01:37:56,710 So what I'm going to do is take an intermediate step, a baby step, 2036 01:37:56,710 --> 01:37:57,240 if you will. 2037 01:37:57,240 --> 01:38:00,992 Right now, I have found in the shows table The Office I care about. 2038 01:38:00,992 --> 01:38:02,700 But if I want to get to the people table, 2039 01:38:02,700 --> 01:38:06,720 I have to take a step through the stars table, this intermediate table, just 2040 01:38:06,720 --> 01:38:09,130 to get anywhere close to the names of those people. 2041 01:38:09,130 --> 01:38:11,070 So what can I get from the stars table? 2042 01:38:11,070 --> 01:38:14,040 Well, why don't I at least select all of the person 2043 01:38:14,040 --> 01:38:18,340 IDs in that table that are somehow associated with the same show ID? 2044 01:38:18,340 --> 01:38:20,340 So in VS Code, what I'm doing is this. 2045 01:38:20,340 --> 01:38:23,340 Select person ID from that intermediate stars 2046 01:38:23,340 --> 01:38:27,690 table, where the show ID in question is whatever the show ID is for The Office. 2047 01:38:27,690 --> 01:38:30,510 I could literally type 386676, but I'm trying 2048 01:38:30,510 --> 01:38:33,630 to do this more dynamically so I've used a nested query instead. 2049 01:38:33,630 --> 01:38:34,890 All right, this is correct. 2050 01:38:34,890 --> 01:38:36,760 Whoops, ironically. 2051 01:38:36,760 --> 01:38:38,085 Where show ID-- 2052 01:38:38,085 --> 01:38:41,540 2053 01:38:41,540 --> 01:38:42,440 Oh, thank you. 2054 01:38:42,440 --> 01:38:43,610 Oh, who said that? 2055 01:38:43,610 --> 01:38:44,570 OK. 2056 01:38:44,570 --> 01:38:45,110 Batman. 2057 01:38:45,110 --> 01:38:47,370 I owe Batman Oreos after class. 2058 01:38:47,370 --> 01:38:48,470 Thank you, Batman. 2059 01:38:48,470 --> 01:38:50,330 That's de facto. 2060 01:38:50,330 --> 01:38:50,930 Yes. 2061 01:38:50,930 --> 01:38:52,355 Other good prize for that costume. 2062 01:38:52,355 --> 01:38:53,330 [LAUGHS] 2063 01:38:53,330 --> 01:38:55,160 OK, so let me fix this. 2064 01:38:55,160 --> 01:38:55,910 My apologies. 2065 01:38:55,910 --> 01:38:58,790 So let's go ahead and select person ID from stars 2066 01:38:58,790 --> 01:39:02,390 where show ID equals, and this is where I messed up before. 2067 01:39:02,390 --> 01:39:05,510 I did select star from shows, but I can't 2068 01:39:05,510 --> 01:39:09,410 look for a show ID equaling an entire row of information. 2069 01:39:09,410 --> 01:39:16,160 I instead need to do select ID from shows where the shows title equals, 2070 01:39:16,160 --> 01:39:20,510 quote unquote, "The Office" and the year of that show is 2005. 2071 01:39:20,510 --> 01:39:25,340 And just to call this out, much like in C, I'm quoting strings, 2072 01:39:25,340 --> 01:39:26,780 but I'm not quoting numbers. 2073 01:39:26,780 --> 01:39:28,910 It's not necessary for something like 2005. 2074 01:39:28,910 --> 01:39:33,410 And just to be super clear, too, I have generally adopted a style already today 2075 01:39:33,410 --> 01:39:37,850 of capitalizing any SQL keywords, like select, like from, like where, 2076 01:39:37,850 --> 01:39:40,470 and so forth, and then using lowercase for everything else. 2077 01:39:40,470 --> 01:39:42,180 Strictly speaking, that's not necessary. 2078 01:39:42,180 --> 01:39:44,130 Stylistically, we would encourage you to be 2079 01:39:44,130 --> 01:39:46,592 in the habit of using uppercase for your SQL keywords 2080 01:39:46,592 --> 01:39:48,300 because they just pop more on the screen. 2081 01:39:48,300 --> 01:39:49,883 It sort of makes things more readable. 2082 01:39:49,883 --> 01:39:52,200 But strictly speaking, SQL itself does not care 2083 01:39:52,200 --> 01:39:54,690 about that capitalization of keywords. 2084 01:39:54,690 --> 01:39:56,490 All right, now let me cross my fingers. 2085 01:39:56,490 --> 01:39:59,100 And now I get back this list of person IDs. 2086 01:39:59,100 --> 01:40:00,840 And, again, my goal was to figure out who 2087 01:40:00,840 --> 01:40:03,018 is in The Office that debuted in 2005. 2088 01:40:03,018 --> 01:40:06,060 This is not that interesting because I don't know who any of these people 2089 01:40:06,060 --> 01:40:08,820 are, but here's where we can do one additional step 2090 01:40:08,820 --> 01:40:10,650 and nest a nested query. 2091 01:40:10,650 --> 01:40:15,000 So let me actually select the names from the people table 2092 01:40:15,000 --> 01:40:18,720 where the ID of those people is in. 2093 01:40:18,720 --> 01:40:20,820 And then on a new line, just to make it pretty, 2094 01:40:20,820 --> 01:40:26,770 where the ID equals this query, which I'll paste here, 2095 01:40:26,770 --> 01:40:31,260 and then that equals, and then another line and indent further, 2096 01:40:31,260 --> 01:40:32,602 this query here. 2097 01:40:32,602 --> 01:40:35,310 So just to save time, I'm copying and pasting the previous query, 2098 01:40:35,310 --> 01:40:38,970 but I'm wrapping it with one outermost query, now, that's 2099 01:40:38,970 --> 01:40:42,690 saying select names from people where the ID of those people 2100 01:40:42,690 --> 01:40:47,940 is in this result set, where the ID of those shows is in this result set. 2101 01:40:47,940 --> 01:40:52,170 So the parentheses make clear, ideally, what's happening 2102 01:40:52,170 --> 01:40:53,790 and what order from inside out. 2103 01:40:53,790 --> 01:40:54,540 Enter. 2104 01:40:54,540 --> 01:40:58,860 And there we have it, at least according to IMDb, for the latest season, 2105 01:40:58,860 --> 01:41:04,350 like this is the top-billed stars that are in this here database. 2106 01:41:04,350 --> 01:41:04,860 All right. 2107 01:41:04,860 --> 01:41:06,302 So how can we do something else? 2108 01:41:06,302 --> 01:41:08,260 Well, let me just do it in the other direction. 2109 01:41:08,260 --> 01:41:11,177 Suppose we want to see all of Steve Carell shows, not just The Office. 2110 01:41:11,177 --> 01:41:16,650 Select title from shows where the ID of the show is in. 2111 01:41:16,650 --> 01:41:18,990 And then here, I'm going to do select show 2112 01:41:18,990 --> 01:41:24,150 ID from stars where person ID equals. 2113 01:41:24,150 --> 01:41:26,490 And then here, indenting for clarity, I'm 2114 01:41:26,490 --> 01:41:28,740 going to select Steve Carell's ID by saying, 2115 01:41:28,740 --> 01:41:32,850 select ID from people where the name of that person is, 2116 01:41:32,850 --> 01:41:35,040 quote unquote, "Steve Carell." 2117 01:41:35,040 --> 01:41:36,780 And so in this way, 2118 01:41:36,780 --> 01:41:42,250 I'm writing the SQL query sort of in reverse. 2119 01:41:42,250 --> 01:41:44,102 I'm asking first what I care about. 2120 01:41:44,102 --> 01:41:45,810 But before I can even answer that, I have 2121 01:41:45,810 --> 01:41:48,510 to answer this nested query, what is Steve Carell's ID? 2122 01:41:48,510 --> 01:41:52,320 Once I have that, what are all the show IDs that person ID has been in? 2123 01:41:52,320 --> 01:41:55,740 And then, please tell me what the title of all of those shows is. 2124 01:41:55,740 --> 01:41:57,510 Let me go ahead and cross my fingers. 2125 01:41:57,510 --> 01:41:58,200 And, voila. 2126 01:41:58,200 --> 01:42:00,950 Some of these you might have heard of, some of you might not have, 2127 01:42:00,950 --> 01:42:03,750 but if you were to go on imdb.com and search for Steve Carell, 2128 01:42:03,750 --> 01:42:08,445 you would presumably see this here list of shows that he's been in, 2129 01:42:08,445 --> 01:42:11,410 in some particular order. 2130 01:42:11,410 --> 01:42:15,330 Just to show you two other syntaxes, but let me not emphasize this 2131 01:42:15,330 --> 01:42:17,200 because it will look complicated. 2132 01:42:17,200 --> 01:42:19,500 There are other ways to solve this same problem. 2133 01:42:19,500 --> 01:42:21,660 If you prefer the approach of joining, we 2134 01:42:21,660 --> 01:42:25,720 can actually join not just two, but three tables together. 2135 01:42:25,720 --> 01:42:26,535 But question first. 2136 01:42:26,535 --> 01:42:28,310 [INDISTINCT SPEECH] 2137 01:42:28,310 --> 01:42:29,810 No, just stretching first. 2138 01:42:29,810 --> 01:42:35,120 So two final ways to execute the same idea, but the first of them that I just 2139 01:42:35,120 --> 01:42:37,850 did is arguably, relatively, simpler. 2140 01:42:37,850 --> 01:42:38,910 You could do this. 2141 01:42:38,910 --> 01:42:42,800 I could select the title from the shows table 2142 01:42:42,800 --> 01:42:49,220 by joining it on the stars table on the shows ID column, 2143 01:42:49,220 --> 01:42:52,820 equaling the stars tables show ID column. 2144 01:42:52,820 --> 01:42:58,460 And then I can further join it on the people table, on stars dot person ID 2145 01:42:58,460 --> 01:43:00,453 equaling people.id. 2146 01:43:00,453 --> 01:43:03,620 This is a mouthful, and even I am kind of crossing my fingers that it didn't 2147 01:43:03,620 --> 01:43:06,510 screw up when transcribing it from my printout here, 2148 01:43:06,510 --> 01:43:11,180 but what I'm effectively doing is joining one, two, three tables all 2149 01:43:11,180 --> 01:43:17,570 together by telling the database how to join the shows table, with the stars 2150 01:43:17,570 --> 01:43:19,070 table, and the people table. 2151 01:43:19,070 --> 01:43:22,940 Specifically, the way to bridge that picture, per the diagram, 2152 01:43:22,940 --> 01:43:28,890 is to specify that shows.id should be lined up with stars.show ID. 2153 01:43:28,890 --> 01:43:34,380 And stars.person ID should be lined up with people ID. 2154 01:43:34,380 --> 01:43:35,008 And that's it. 2155 01:43:35,008 --> 01:43:37,050 That essentially allows us to connect these three 2156 01:43:37,050 --> 01:43:39,000 tables with their common fields. 2157 01:43:39,000 --> 01:43:42,390 If I hit enter now, I'm going to get back, somewhat slowly 2158 01:43:42,390 --> 01:43:47,160 actually, a really long list, with some duplication, 2159 01:43:47,160 --> 01:43:49,830 of all of those particular shows. 2160 01:43:49,830 --> 01:43:52,182 Actually, all shows in the database because I 2161 01:43:52,182 --> 01:43:53,640 didn't practice what I'm preaching. 2162 01:43:53,640 --> 01:43:56,140 I wanted to search for just Steve Carell, what you're seeing 2163 01:43:56,140 --> 01:43:58,950 is the entirety of the tens of thousands of TV shows. 2164 01:43:58,950 --> 01:44:00,360 Control C is your friend. 2165 01:44:00,360 --> 01:44:05,070 Let me go ahead and reload SQLite, and let me type that again. 2166 01:44:05,070 --> 01:44:06,690 Let me type that once more, sorry. 2167 01:44:06,690 --> 01:44:13,320 Select title from shows, join stars on shows.id equals stars.show ID, 2168 01:44:13,320 --> 01:44:20,872 join people on stars.person ID equals people ID where-- 2169 01:44:20,872 --> 01:44:22,080 this was the part I left out. 2170 01:44:22,080 --> 01:44:24,990 Name equals Steve Carell. 2171 01:44:24,990 --> 01:44:28,710 And if I didn't screw up by typing so fast-- enter. 2172 01:44:28,710 --> 01:44:31,500 Now we get, a little more slowly than before, 2173 01:44:31,500 --> 01:44:34,435 those same shows that Steve Carell starred in. 2174 01:44:34,435 --> 01:44:36,810 So this is just to say there's another way of doing this. 2175 01:44:36,810 --> 01:44:39,420 But maybe a third way, which is a little simpler than that, 2176 01:44:39,420 --> 01:44:41,250 explicitly joining them in that way. 2177 01:44:41,250 --> 01:44:47,490 You can, alternatively, still do this, select title from shows, stars, 2178 01:44:47,490 --> 01:44:48,270 and people. 2179 01:44:48,270 --> 01:44:51,480 You just literally enumerate, with commas, what three tables 2180 01:44:51,480 --> 01:44:52,920 you want to join somehow. 2181 01:44:52,920 --> 01:44:55,020 And then you can instead of using join, you 2182 01:44:55,020 --> 01:44:58,560 can just use where clauses to make sure they line up properly. 2183 01:44:58,560 --> 01:45:03,630 You can say, where shows.id equals stars.show ID, 2184 01:45:03,630 --> 01:45:09,180 and people ID equals stars.person ID, and name 2185 01:45:09,180 --> 01:45:11,760 equals, quote unquote, "Steve Carell." 2186 01:45:11,760 --> 01:45:14,430 And I realize this is hard to keep track of everything now, 2187 01:45:14,430 --> 01:45:16,353 all these darn different ways to do this, 2188 01:45:16,353 --> 01:45:19,020 this is just to say that there's different approaches to solving 2189 01:45:19,020 --> 01:45:20,728 the same problem and for different people 2190 01:45:20,728 --> 01:45:23,970 you might think about things a little more differently than someone else. 2191 01:45:23,970 --> 01:45:27,630 If I hit enter here, this too it's a little slower than the nested selects 2192 01:45:27,630 --> 01:45:31,590 it seems, but it does, in fact, give us that same answer. 2193 01:45:31,590 --> 01:45:34,440 And just for thoroughness, if I go back to our diagram. 2194 01:45:34,440 --> 01:45:38,700 Besides the tables we've seen, there's actually another writers table 2195 01:45:38,700 --> 01:45:39,480 in there, as well. 2196 01:45:39,480 --> 01:45:42,480 If you're curious to see what writers is, let's just glance at that real 2197 01:45:42,480 --> 01:45:42,980 fast. 2198 01:45:42,980 --> 01:45:47,010 In VS Code, let me do .schema writers, and it's actually almost the same 2199 01:45:47,010 --> 01:45:48,330 as stars. 2200 01:45:48,330 --> 01:45:52,440 Except this case, in writers, we are associating a show 2201 01:45:52,440 --> 01:45:54,750 with a person ID, both of which in this context 2202 01:45:54,750 --> 01:45:58,770 are foreign keys that indeed reference back shows and people ID. 2203 01:45:58,770 --> 01:46:02,520 Which, again, if I do this schema stars, which we didn't see before, 2204 01:46:02,520 --> 01:46:04,120 is structurally the same. 2205 01:46:04,120 --> 01:46:06,540 So the relationship is essentially embodied, in this case, 2206 01:46:06,540 --> 01:46:12,690 by IMDb, and in turn by us, by way of the tables name, TV stars or writers 2207 01:46:12,690 --> 01:46:14,410 thereof. 2208 01:46:14,410 --> 01:46:14,940 All right. 2209 01:46:14,940 --> 01:46:16,000 I know that's a lot. 2210 01:46:16,000 --> 01:46:19,720 Any questions before we take a higher level step back? 2211 01:46:19,720 --> 01:46:20,220 Yeah. 2212 01:46:20,220 --> 01:46:22,992 [INDISTINCT SPEECH] 2213 01:46:22,992 --> 01:46:25,183 2214 01:46:25,183 --> 01:46:25,850 A good question. 2215 01:46:25,850 --> 01:46:28,400 Does SQL provide any way to figure out the mapping between tables 2216 01:46:28,400 --> 01:46:29,540 without looking at the database? 2217 01:46:29,540 --> 01:46:30,380 Short answer, no. 2218 01:46:30,380 --> 01:46:32,668 Like, this is the dia-- well, that's not quite fair. 2219 01:46:32,668 --> 01:46:34,460 Depending on the database you're using, you 2220 01:46:34,460 --> 01:46:36,080 might be able to click a button, for instance, 2221 01:46:36,080 --> 01:46:38,997 and get a nice pretty picture like this, that shows the relationships. 2222 01:46:38,997 --> 01:46:40,730 Indeed, we use software to generate this. 2223 01:46:40,730 --> 01:46:42,980 We didn't do this diagram, for instance, by hand. 2224 01:46:42,980 --> 01:46:45,530 SQLite itself does not provide you with that. 2225 01:46:45,530 --> 01:46:48,425 In SQLite the best you can do is run .schema. 2226 01:46:48,425 --> 01:46:50,300 And if you don't specify a table name, you'll 2227 01:46:50,300 --> 01:46:52,120 get everything from the table described. 2228 01:46:52,120 --> 01:46:53,870 Once you get comfortable with SQL, though, 2229 01:46:53,870 --> 01:46:56,240 the idea is that you can read the text and sort of infer what 2230 01:46:56,240 --> 01:46:57,080 the structure is. 2231 01:46:57,080 --> 01:46:58,850 But, yes, there are graphical programs can 2232 01:46:58,850 --> 01:47:00,710 generate prettier pictures like this, but it 2233 01:47:00,710 --> 01:47:02,900 depends on the software you're using. 2234 01:47:02,900 --> 01:47:03,470 Yeah. 2235 01:47:03,470 --> 01:47:06,600 [INDISTINCT SPEECH] 2236 01:47:06,600 --> 01:47:10,590 SQL is not case sensitive with respect to its keywords. 2237 01:47:10,590 --> 01:47:13,920 But table names, and other things that you chose, 2238 01:47:13,920 --> 01:47:16,452 you've got to be consistent with how you capitalize them. 2239 01:47:16,452 --> 01:47:19,410 I've done everything in lowercase, but that tends to be one convention. 2240 01:47:19,410 --> 01:47:23,970 Other people might use camel case, where you alternate caps appropriately. 2241 01:47:23,970 --> 01:47:26,737 All right, so let's take a higher level look at this, 2242 01:47:26,737 --> 01:47:29,820 and also consider some of the actual real world problems that, tragically, 2243 01:47:29,820 --> 01:47:32,200 are still with us in some form today. 2244 01:47:32,200 --> 01:47:34,980 Notice that some of the queries we executed a bit ago 2245 01:47:34,980 --> 01:47:38,440 were actually relatively slow whereas I hit enter and got a lot of my results 2246 01:47:38,440 --> 01:47:38,940 like that. 2247 01:47:38,940 --> 01:47:41,130 Those last two queries, where I was joining 2248 01:47:41,130 --> 01:47:43,500 all of those tables looking for Steve Carell's shows 2249 01:47:43,500 --> 01:47:45,138 were actually relatively slow. 2250 01:47:45,138 --> 01:47:46,680 And let's try to take a simpler case. 2251 01:47:46,680 --> 01:47:47,310 Let me do this. 2252 01:47:47,310 --> 01:47:51,480 In SQLite you can actually time your queries by running .timer and then 2253 01:47:51,480 --> 01:47:52,330 turning it on. 2254 01:47:52,330 --> 01:47:55,620 This is just going to keep track now of how many seconds or milliseconds any 2255 01:47:55,620 --> 01:47:57,787 of your queries take if you're curious to figure out 2256 01:47:57,787 --> 01:47:58,930 what's faster, what's slow. 2257 01:47:58,930 --> 01:48:00,930 Let me do something relatively simple like this, 2258 01:48:00,930 --> 01:48:04,860 select star from shows where the title of the show equals, 2259 01:48:04,860 --> 01:48:07,198 quote unquote, "The Office," semicolon. 2260 01:48:07,198 --> 01:48:08,740 All right, that was pretty darn fast. 2261 01:48:08,740 --> 01:48:13,240 And it took 0.044 seconds in reality. 2262 01:48:13,240 --> 01:48:16,210 If you care further, you can break this time down into user time, 2263 01:48:16,210 --> 01:48:18,708 like spent in my account, versus system time, which means 2264 01:48:18,708 --> 01:48:20,500 spent in the operating system, essentially. 2265 01:48:20,500 --> 01:48:28,120 But we'll focus on the real wall-clock time of 0.044 seconds. 2266 01:48:28,120 --> 01:48:29,290 Pretty darn fast. 2267 01:48:29,290 --> 01:48:32,770 But we can actually do better because it turns out in the world of SQL, 2268 01:48:32,770 --> 01:48:37,690 you can create what are called indexes, which is a fancy way of saying a data 2269 01:48:37,690 --> 01:48:41,170 structure that makes it faster to perform queries, 2270 01:48:41,170 --> 01:48:43,660 like selects, and even other queries as well. 2271 01:48:43,660 --> 01:48:46,180 In an index, you can use syntax like this, 2272 01:48:46,180 --> 01:48:49,420 create index, the name of the index, on a specific table 2273 01:48:49,420 --> 01:48:50,710 on the specific columns. 2274 01:48:50,710 --> 01:48:54,340 And what I mean by this is if you know that your application, like imdb.com 2275 01:48:54,340 --> 01:48:58,480 or their mobile app, is going to search on certain columns frequently, 2276 01:48:58,480 --> 01:49:01,240 you can prepare the database in advance to build up 2277 01:49:01,240 --> 01:49:04,450 some fancy data structures in memory so it can get back answers 2278 01:49:04,450 --> 01:49:06,080 even faster than that. 2279 01:49:06,080 --> 01:49:08,200 So case in point, let me go back to VS Code 2280 01:49:08,200 --> 01:49:12,700 here, and let me create an index called, whatever, title index, 2281 01:49:12,700 --> 01:49:17,180 for instance on the shows table, specifically on the title column. 2282 01:49:17,180 --> 01:49:20,830 So that's simply the syntax for telling the database in advance, 2283 01:49:20,830 --> 01:49:23,350 index this column because I'm going to do a lot of searching 2284 01:49:23,350 --> 01:49:25,210 on it so I want the queries to be fast. 2285 01:49:25,210 --> 01:49:26,020 Enter. 2286 01:49:26,020 --> 01:49:26,920 It took a moment. 2287 01:49:26,920 --> 01:49:30,010 It took almost half a second, but the index I only have to build once. 2288 01:49:30,010 --> 01:49:32,320 Even though, technically, you have to maintain it over time if you're 2289 01:49:32,320 --> 01:49:34,210 doing updates, deletes, and inserts. 2290 01:49:34,210 --> 01:49:36,610 But now let me do the same query as before, 2291 01:49:36,610 --> 01:49:41,530 select star from shows where title equals, quote unquote, "The Office." 2292 01:49:41,530 --> 01:49:45,580 Previously, that query took 0.044 seconds. 2293 01:49:45,580 --> 01:49:47,080 When I hit enter now? 2294 01:49:47,080 --> 01:49:47,890 Boom. 2295 01:49:47,890 --> 01:49:50,770 I mean, it takes no time at all, or less time 2296 01:49:50,770 --> 01:49:53,998 than it's even keeping track of in terms of significant digits. 2297 01:49:53,998 --> 01:49:57,040 Now, that might not seem like a big deal to us humans and our human eyes, 2298 01:49:57,040 --> 01:49:59,620 but if you've got hundreds, thousands, millions of users, 2299 01:49:59,620 --> 01:50:03,580 as maybe the real imdb.com has, you just saved yourself 2300 01:50:03,580 --> 01:50:05,900 a fortune in servers and complexity. 2301 01:50:05,900 --> 01:50:06,400 Why? 2302 01:50:06,400 --> 01:50:08,320 Because the same server can, clearly, now 2303 01:50:08,320 --> 01:50:12,400 handle way more people per unit of time, per second, because each query 2304 01:50:12,400 --> 01:50:14,027 takes less and less time. 2305 01:50:14,027 --> 01:50:16,360 I mean, we're all too familiar here and at Yale, surely, 2306 01:50:16,360 --> 01:50:19,087 with certain University applications that are just so darn slow. 2307 01:50:19,087 --> 01:50:20,920 When you click a button and the stupid thing 2308 01:50:20,920 --> 01:50:23,740 spins and makes you wait and wait, a lot of the time that 2309 01:50:23,740 --> 01:50:27,820 can be explained by poor database design, or databases that might not 2310 01:50:27,820 --> 01:50:29,200 have been indexed properly. 2311 01:50:29,200 --> 01:50:31,492 So when you're searching for some course, for instance, 2312 01:50:31,492 --> 01:50:33,310 in the course catalog, it's taking forever 2313 01:50:33,310 --> 01:50:35,170 because underneath the hood it's essentially 2314 01:50:35,170 --> 01:50:37,730 doing linear search over everything. 2315 01:50:37,730 --> 01:50:40,030 But, by contrast, in a relational database, 2316 01:50:40,030 --> 01:50:42,670 when you create an index in advance because you 2317 01:50:42,670 --> 01:50:45,640 have a hunch that maybe users are going to search on that column, 2318 01:50:45,640 --> 01:50:47,560 like show titles. 2319 01:50:47,560 --> 01:50:49,270 Essentially, you're building up in memory 2320 01:50:49,270 --> 01:50:51,700 what's called a b-tree, which is not a binary tree. 2321 01:50:51,700 --> 01:50:54,370 It's still a tree, though, if you think back to week five. 2322 01:50:54,370 --> 01:50:57,820 But it's a very short, fat tree instead, where 2323 01:50:57,820 --> 01:51:02,650 every node might have two, or three, or 30 children, which essentially 2324 01:51:02,650 --> 01:51:04,820 pulls the height of the tree way up. 2325 01:51:04,820 --> 01:51:07,990 Which is to say that when you search for some value in a b-tree, 2326 01:51:07,990 --> 01:51:10,040 it's invariably going to be in the leaves. 2327 01:51:10,040 --> 01:51:13,870 So the shorter the tree is, the fewer steps it takes to find the value 2328 01:51:13,870 --> 01:51:14,750 you care about. 2329 01:51:14,750 --> 01:51:17,860 So when you run create table, that kind of data structure 2330 01:51:17,860 --> 01:51:21,680 is being magically created for you by the database 2331 01:51:21,680 --> 01:51:25,540 so it's not a simple linear search through the entire column, 2332 01:51:25,540 --> 01:51:27,350 top to bottom. 2333 01:51:27,350 --> 01:51:31,330 So with that said, we can see this really with more complicated queries. 2334 01:51:31,330 --> 01:51:33,260 And let me go back to VS Code here. 2335 01:51:33,260 --> 01:51:36,853 Let me propose to run that same, slow query before, 2336 01:51:36,853 --> 01:51:39,520 even though it's fine if you're not comfortable with the syntax. 2337 01:51:39,520 --> 01:51:41,710 It was relatively slow, though, to watch. 2338 01:51:41,710 --> 01:51:46,360 So let's do select title from shows stars and people 2339 01:51:46,360 --> 01:51:56,500 where shows.id equals stars.show ID, and people.id equals stars.dot person ID, 2340 01:51:56,500 --> 01:51:59,050 and name equals Steve Carell. 2341 01:51:59,050 --> 01:52:00,820 So this was the last of those queries. 2342 01:52:00,820 --> 01:52:03,280 That just searches for all of Steve Carell's TV shows 2343 01:52:03,280 --> 01:52:06,430 without using joins explicitly but by just enumerating 2344 01:52:06,430 --> 01:52:10,210 all three tables in question, and then using where to cleverly connect 2345 01:52:10,210 --> 01:52:11,390 the dots, so to speak. 2346 01:52:11,390 --> 01:52:12,770 But my timer is still on. 2347 01:52:12,770 --> 01:52:16,390 So notice now, when I hit enter, it doesn't just feel slow. 2348 01:52:16,390 --> 01:52:19,390 It actually took 2.763 seconds. 2349 01:52:19,390 --> 01:52:20,410 Like, that's slow. 2350 01:52:20,410 --> 01:52:21,250 That's expensive. 2351 01:52:21,250 --> 01:52:22,650 That's going to annoy your users. 2352 01:52:22,650 --> 01:52:25,900 That's going to annoy your students if the database is thinking, and thinking, 2353 01:52:25,900 --> 01:52:27,730 and thinking, and taking that much time. 2354 01:52:27,730 --> 01:52:29,770 But let's note this. 2355 01:52:29,770 --> 01:52:34,080 That same query I just executed touched a bunch of columns, 2356 01:52:34,080 --> 01:52:35,830 and it turns out that whenever you declare 2357 01:52:35,830 --> 01:52:40,330 a primary key in a database, at least in SQLite, you get an index for free. 2358 01:52:40,330 --> 01:52:43,060 Primary keys are automatically indexed, so searching 2359 01:52:43,060 --> 01:52:44,320 for a number in that column? 2360 01:52:44,320 --> 01:52:45,040 Super fast. 2361 01:52:45,040 --> 01:52:48,580 Not linear search, it's something logarithmic, most likely, or, ideally, 2362 01:52:48,580 --> 01:52:51,100 closer to something like constant time even. 2363 01:52:51,100 --> 01:52:54,940 Here, though, I'm touching not just shows.id, 2364 01:52:54,940 --> 01:52:58,840 but I'm also filtering on stars.show ID, so a foreign key. 2365 01:52:58,840 --> 01:53:01,000 Foreign keys are not indexed by default. 2366 01:53:01,000 --> 01:53:02,260 I'm looking at people ID. 2367 01:53:02,260 --> 01:53:03,190 That's a primary key. 2368 01:53:03,190 --> 01:53:03,880 That's indexed. 2369 01:53:03,880 --> 01:53:08,200 But stars.person ID, not indexed by default, is a foreign key. 2370 01:53:08,200 --> 01:53:11,890 Lastly, I'm filtering by name in the people table. 2371 01:53:11,890 --> 01:53:17,140 Names are not indexed by default. So I'm touching three separate columns, 2372 01:53:17,140 --> 01:53:20,980 two foreign keys, one name field, that have no fancy tree 2373 01:53:20,980 --> 01:53:22,570 structure built for them. 2374 01:53:22,570 --> 01:53:23,900 But I can do that. 2375 01:53:23,900 --> 01:53:27,190 Let me go down to my terminal here. 2376 01:53:27,190 --> 01:53:30,550 Let me create one index called, say, person index, though, I 2377 01:53:30,550 --> 01:53:35,750 could call it anything I want, on the stars table on the person ID column. 2378 01:53:35,750 --> 01:53:38,140 So that indexes that foreign key. 2379 01:53:38,140 --> 01:53:41,050 Took 1.7 seconds, but I only have to do it once. 2380 01:53:41,050 --> 01:53:46,720 Create index, show index on the stars table show ID. 2381 01:53:46,720 --> 01:53:49,210 So another foreign key is getting its own index. 2382 01:53:49,210 --> 01:53:51,490 Took 1.4 seconds, but it's a one time thing. 2383 01:53:51,490 --> 01:53:54,070 And, lastly, let's index all of those actors names. 2384 01:53:54,070 --> 01:53:59,770 Create index called name index on the people table on the name column. 2385 01:53:59,770 --> 01:54:00,460 Enter. 2386 01:54:00,460 --> 01:54:04,520 That took 1.0 seconds, but a one time operation. 2387 01:54:04,520 --> 01:54:07,850 So, essentially, I've built up like three of these trees in memory 2388 01:54:07,850 --> 01:54:09,920 now, specifically for these columns. 2389 01:54:09,920 --> 01:54:15,160 So now recall, previously, that slow query. 2390 01:54:15,160 --> 01:54:18,610 If I scroll back up, that took, what was it? 2391 01:54:18,610 --> 01:54:20,570 2.7 seconds? 2392 01:54:20,570 --> 01:54:21,070 I think. 2393 01:54:21,070 --> 01:54:23,330 2.7 seconds, previously. 2394 01:54:23,330 --> 01:54:29,080 But if I now run the same thing, select title from show 2395 01:54:29,080 --> 01:54:33,850 stars people where shows ID equals stars.show show ID, 2396 01:54:33,850 --> 01:54:37,240 and people ID equals stars.person person ID, 2397 01:54:37,240 --> 01:54:42,730 and name equals Steve, so close, Carell. 2398 01:54:42,730 --> 01:54:47,320 Same query as before previously took 2.7 seconds, which was the most annoying 2399 01:54:47,320 --> 01:54:47,890 of them yet. 2400 01:54:47,890 --> 01:54:49,420 Now, when I hit enter? 2401 01:54:49,420 --> 01:54:50,230 Boom. 2402 01:54:50,230 --> 01:54:55,240 0.001 seconds, which is the difference, again, between bunches 2403 01:54:55,240 --> 01:55:00,970 of linear searches and, in this case, searching a fancier, week-five style, 2404 01:55:00,970 --> 01:55:02,650 b-tree, in this case. 2405 01:55:02,650 --> 01:55:04,010 So indexes matter. 2406 01:55:04,010 --> 01:55:07,120 So what's then, maybe, the trade off here? 2407 01:55:07,120 --> 01:55:09,680 Like, why not index every column in every table? 2408 01:55:09,680 --> 01:55:10,930 Because this is feeling great. 2409 01:55:10,930 --> 01:55:14,110 Like, we're speeding things up by factors of 1,000, practically. 2410 01:55:14,110 --> 01:55:15,190 What's the trade off? 2411 01:55:15,190 --> 01:55:16,240 [INDISTINCT SPEECH] 2412 01:55:16,240 --> 01:55:18,010 Lots and lots of memory, or space. 2413 01:55:18,010 --> 01:55:20,200 Yeah, so you're just trading off space for time, 2414 01:55:20,200 --> 01:55:23,200 which we said a couple of weeks ago is an acceptable trade off depending 2415 01:55:23,200 --> 01:55:24,610 on what resources you have. 2416 01:55:24,610 --> 01:55:28,160 But it's probably an overcorrection to index everything, 2417 01:55:28,160 --> 01:55:31,445 especially since it will slightly slow down inserts, updates, and deletes 2418 01:55:31,445 --> 01:55:34,570 because you have to maintain this tree structure so it doesn't devolve back 2419 01:55:34,570 --> 01:55:37,600 into a linked list or something linear. 2420 01:55:37,600 --> 01:55:43,900 But, in fact, being selective about it is perhaps the best strategy. 2421 01:55:43,900 --> 01:55:44,440 All right. 2422 01:55:44,440 --> 01:55:48,250 So that we can now solve some other problems more generally, 2423 01:55:48,250 --> 01:55:49,780 let me just connect two dots. 2424 01:55:49,780 --> 01:55:53,740 Even though we focused today on SQL, specifically for databases, 2425 01:55:53,740 --> 01:55:56,590 you can actually combine one language with another 2426 01:55:56,590 --> 01:55:58,010 and solve different problems. 2427 01:55:58,010 --> 01:55:59,420 And so, in fact, let me do this. 2428 01:55:59,420 --> 01:56:02,530 Let me revisit our favorites.py from earlier, 2429 01:56:02,530 --> 01:56:07,610 but let me actually now use the favorites database as follows. 2430 01:56:07,610 --> 01:56:09,980 Let me go into VS Code here. 2431 01:56:09,980 --> 01:56:16,810 Let me remove favorites.db because if you recall it made everyone's count 2432 01:56:16,810 --> 01:56:18,760 previously-- 2433 01:56:18,760 --> 01:56:19,510 became Fiftyville. 2434 01:56:19,510 --> 01:56:21,640 So let me remove that file. 2435 01:56:21,640 --> 01:56:26,140 Let me run SQLite3 on favorites.db again. 2436 01:56:26,140 --> 01:56:27,940 Let me create the file anew. 2437 01:56:27,940 --> 01:56:30,490 Let me set the mode to CSV again. 2438 01:56:30,490 --> 01:56:34,630 Let me import that file called favorites.csv into an identical table 2439 01:56:34,630 --> 01:56:36,160 as before called favorites. 2440 01:56:36,160 --> 01:56:37,150 And then quit. 2441 01:56:37,150 --> 01:56:40,970 So I've just reset things to my backup, if you will, from the CSV file. 2442 01:56:40,970 --> 01:56:43,300 So, I again have a favorites.db. 2443 01:56:43,300 --> 01:56:48,033 Let me now minimize my terminal window here and reopen favorites.py, 2444 01:56:48,033 --> 01:56:50,200 and let me just go ahead and get rid of that version 2445 01:56:50,200 --> 01:56:53,830 entirely and focus this time on not talking to a CSV file, 2446 01:56:53,830 --> 01:56:55,780 opening it, iterating over the rows. 2447 01:56:55,780 --> 01:56:58,930 We can actually use Python to execute SQL queries 2448 01:56:58,930 --> 01:57:00,950 and kind of get the best of both worlds. 2449 01:57:00,950 --> 01:57:03,160 So let me do this here. 2450 01:57:03,160 --> 01:57:09,580 Let me, from the CS50 Python library, import our own SQL functionality. 2451 01:57:09,580 --> 01:57:12,052 And this is a training wheel we still provide for SQL 2452 01:57:12,052 --> 01:57:14,260 because it's just much easier than using the industry 2453 01:57:14,260 --> 01:57:15,760 standard libraries for SQL. 2454 01:57:15,760 --> 01:57:18,540 It just is painful for simple tasks. 2455 01:57:18,540 --> 01:57:21,380 So now, let me create a variable called db, for database. 2456 01:57:21,380 --> 01:57:24,410 I'm going to set it equal to this SQL function that CS50 wrote. 2457 01:57:24,410 --> 01:57:28,730 And this is going to look weird, but the way you open a DB file in Python, 2458 01:57:28,730 --> 01:57:31,580 whether it's with CS50 library or someone else's, you 2459 01:57:31,580 --> 01:57:36,560 say SQLite colon slash, slash, slash favorites.db. 2460 01:57:36,560 --> 01:57:38,780 So weird syntax, but it's commonplace. 2461 01:57:38,780 --> 01:57:41,900 Three slashes, not two like a URL is usually. 2462 01:57:41,900 --> 01:57:44,300 Now let's use a variable called favorite and set 2463 01:57:44,300 --> 01:57:48,020 it equal to the return value of input by asking the human for their favorite TV 2464 01:57:48,020 --> 01:57:48,650 show. 2465 01:57:48,650 --> 01:57:51,860 And now, previously we opened up the CSV file, 2466 01:57:51,860 --> 01:57:56,270 iterated over it looking for the show they typed in, show they typed in. 2467 01:57:56,270 --> 01:57:59,060 Or, rather-- Oh, sorry, not the show. 2468 01:57:59,060 --> 01:58:01,610 The problem, or the problem that we typed in. 2469 01:58:01,610 --> 01:58:03,150 Let me instead do this. 2470 01:58:03,150 --> 01:58:07,290 Let me set, use a db variables, execute function, 2471 01:58:07,290 --> 01:58:11,360 which comes with the CS50 library, and let me execute this SQL query. 2472 01:58:11,360 --> 01:58:20,250 Select count star as n from favorites where problem equals question mark. 2473 01:58:20,250 --> 01:58:23,000 And the question mark is a little weird, but think of it, for now, 2474 01:58:23,000 --> 01:58:24,902 like C's percent s. 2475 01:58:24,902 --> 01:58:26,750 Comma favorite. 2476 01:58:26,750 --> 01:58:29,840 I'm going to plug in whatever the human typed in into that query where 2477 01:58:29,840 --> 01:58:30,920 the question mark is. 2478 01:58:30,920 --> 01:58:34,340 So no percent s, I'm using a question mark in this world of SQL. 2479 01:58:34,340 --> 01:58:36,883 This is going to give me back a temporary table, 2480 01:58:36,883 --> 01:58:39,800 and I'm going to store that temporary table in a variable called rows. 2481 01:58:39,800 --> 01:58:42,840 Because the temporary table is essentially zero or more rows, 2482 01:58:42,840 --> 01:58:44,660 so I'm going to name my variable rows. 2483 01:58:44,660 --> 01:58:48,170 And then if I want to get back the first and only row, 2484 01:58:48,170 --> 01:58:51,860 I can literally do row equals rows bracket zero just to be pedantic, 2485 01:58:51,860 --> 01:58:55,970 just to get, literally, the first row in that temporary table, or result set. 2486 01:58:55,970 --> 01:58:59,810 And now if I want to print back the n column therein, 2487 01:58:59,810 --> 01:59:03,590 I can do print row, quote unquote, "n." 2488 01:59:03,590 --> 01:59:06,420 So let me take a step back and do this. 2489 01:59:06,420 --> 01:59:12,710 Let me go into SQLite3 of favorites.db and let me literally type something 2490 01:59:12,710 --> 01:59:13,580 like this. 2491 01:59:13,580 --> 01:59:16,640 Here I'm in SQLite3 at the bottom and the problem I'm searching, 2492 01:59:16,640 --> 01:59:19,520 for instance, is Scratch semicolon. 2493 01:59:19,520 --> 01:59:23,480 Notice that is how in SQL, in the command line of SQLite3, 2494 01:59:23,480 --> 01:59:24,960 I can get back the answer I want. 2495 01:59:24,960 --> 01:59:26,960 But what if I want to make a Python program that 2496 01:59:26,960 --> 01:59:29,480 queries the database for that value and then prints something out? 2497 01:59:29,480 --> 01:59:31,790 In two weeks time, what if I want to do that to make 2498 01:59:31,790 --> 01:59:35,300 a web application, like imdb.com or a mobile app, 2499 01:59:35,300 --> 01:59:37,490 that writes code to generate a user interface 2500 01:59:37,490 --> 01:59:39,320 but that pulls the data from a database? 2501 01:59:39,320 --> 01:59:42,170 You're not going to have your human users using SQLite3, 2502 01:59:42,170 --> 01:59:44,280 you're going to generate the output for them. 2503 01:59:44,280 --> 01:59:46,790 So let me close my terminal window here. 2504 01:59:46,790 --> 01:59:49,190 Rather, let me close out of SQLite 3. 2505 01:59:49,190 --> 01:59:53,840 Let me now run Python of favorites.py, enter. 2506 01:59:53,840 --> 01:59:56,450 I'm prompted for my favorite using the input function. 2507 01:59:56,450 --> 02:00:00,290 I type in Scratch and hit enter, and there's my 34. 2508 02:00:00,290 --> 02:00:02,650 So this is an incredibly common practice, 2509 02:00:02,650 --> 02:00:04,400 to use one language for what it's best at. 2510 02:00:04,400 --> 02:00:07,430 Like SQL is best at reading data from databases, 2511 02:00:07,430 --> 02:00:11,060 Python is, maybe, best in this case for creating a user interface 2512 02:00:11,060 --> 02:00:13,040 or, eventually, making a web application. 2513 02:00:13,040 --> 02:00:16,560 But it's certainly fine to use one inside of the other. 2514 02:00:16,560 --> 02:00:19,740 The documentation for this library, if and when you get curious, 2515 02:00:19,740 --> 02:00:23,520 is it this URL here, along with a lot of CS50's own documentation. 2516 02:00:23,520 --> 02:00:26,340 But there are some problems, nonetheless, we 2517 02:00:26,340 --> 02:00:29,760 might encounter in this world, and we thought we'd end on these challenges. 2518 02:00:29,760 --> 02:00:33,420 It turns out that in the world of SQL, even though we haven't touched 2519 02:00:33,420 --> 02:00:36,638 upon this yet, you're generally working not with like hundreds 2520 02:00:36,638 --> 02:00:39,180 of people and their favorite languages and problems, not even 2521 02:00:39,180 --> 02:00:42,600 thousands of movies, but like millions of things in the database, 2522 02:00:42,600 --> 02:00:45,180 like Instagram posts or TikTok videos or the like. 2523 02:00:45,180 --> 02:00:48,510 Those are huge databases with millions of rows. 2524 02:00:48,510 --> 02:00:51,510 The problem with SQL and, really, databases in general, 2525 02:00:51,510 --> 02:00:54,030 is if you have thousands, millions of things 2526 02:00:54,030 --> 02:00:57,765 happening at once, things can get out of order and your math can be wrong. 2527 02:00:57,765 --> 02:00:59,890 You can lose track of how many likes something has. 2528 02:00:59,890 --> 02:01:03,870 And so, for instance, as of last night this remains the most popular Instagram 2529 02:01:03,870 --> 02:01:06,000 post, to date, in the world. 2530 02:01:06,000 --> 02:01:08,368 It was clicked on by so many people. 2531 02:01:08,368 --> 02:01:10,410 And you might think that's pretty straightforward 2532 02:01:10,410 --> 02:01:13,230 to keep track click, click, click, but not 2533 02:01:13,230 --> 02:01:16,870 when there's millions of devices in the world and thousands of servers, 2534 02:01:16,870 --> 02:01:20,000 probably, at Meta running Instagram's back end. 2535 02:01:20,000 --> 02:01:23,290 So how do you actually keep track of all of these likes? 2536 02:01:23,290 --> 02:01:26,980 Well, maybe, Meta is using code like this 2537 02:01:26,980 --> 02:01:30,370 to implement the counter for likes on Instagram. 2538 02:01:30,370 --> 02:01:33,160 Maybe they are using lines of code similar to what we just 2539 02:01:33,160 --> 02:01:37,510 wrote db.execute select the current number of likes 2540 02:01:37,510 --> 02:01:40,750 from the posts table where the ID of the post 2541 02:01:40,750 --> 02:01:43,760 equals whatever the one is that the user clicked on. 2542 02:01:43,760 --> 02:01:48,160 So the post a moment ago, presumably, has a unique ID, a primary key, 2543 02:01:48,160 --> 02:01:51,160 and that number just gets plugged in here when Meta wants to figure out, 2544 02:01:51,160 --> 02:01:53,140 all right, someone just clicked on this post. 2545 02:01:53,140 --> 02:01:56,320 Let's figure out what the current number of likes so we can add 1 to it 2546 02:01:56,320 --> 02:01:57,890 and update the database. 2547 02:01:57,890 --> 02:02:00,910 So this query here gives us a temporary table 2548 02:02:00,910 --> 02:02:04,360 containing the current number of likes before you or someone else clicked. 2549 02:02:04,360 --> 02:02:06,820 Maybe then we just declare a variable called 2550 02:02:06,820 --> 02:02:09,772 likes to get at the first row's likes column. 2551 02:02:09,772 --> 02:02:11,980 So this is just Python syntax, similar to what I did, 2552 02:02:11,980 --> 02:02:15,640 just to get the actual number you care about, like 34 or whatever million 2553 02:02:15,640 --> 02:02:16,480 it is here. 2554 02:02:16,480 --> 02:02:20,650 But then suppose there's a second database query and a third line of code 2555 02:02:20,650 --> 02:02:26,200 that updates the posts table setting the likes equal to this value 2556 02:02:26,200 --> 02:02:29,030 where the ID of the post is this value. 2557 02:02:29,030 --> 02:02:32,870 So these question marks are similar, again, in spirit to printf's percent s. 2558 02:02:32,870 --> 02:02:34,810 They're placeholders for things that are going 2559 02:02:34,810 --> 02:02:37,070 to be plugged in after the commas. 2560 02:02:37,070 --> 02:02:38,860 So if I want to update the number of likes 2561 02:02:38,860 --> 02:02:41,080 to be whatever the current number is plus one, 2562 02:02:41,080 --> 02:02:45,100 I put it there as a second argument and then I plug-in the ID of that post. 2563 02:02:45,100 --> 02:02:49,420 The problem, though, with large systems like the Metas, the Googles, 2564 02:02:49,420 --> 02:02:51,640 the Microsofts, and others of the world, is 2565 02:02:51,640 --> 02:02:54,880 that they are executing code like this on multiple servers, 2566 02:02:54,880 --> 02:02:58,930 thousands of servers, that might be executing slightly out of order. 2567 02:02:58,930 --> 02:03:01,070 One might be faster, one might be slower. 2568 02:03:01,070 --> 02:03:04,990 Which is to say, even though these three lines of code 2569 02:03:04,990 --> 02:03:07,240 represent what should happen when I click on that post 2570 02:03:07,240 --> 02:03:09,670 and you click on that post and you click on that post, 2571 02:03:09,670 --> 02:03:13,600 the lines of code chronologically might get shuffled a little bit. 2572 02:03:13,600 --> 02:03:16,090 Like, maybe, this line of code gets executed for me 2573 02:03:16,090 --> 02:03:18,100 and then it gets executed for you and then you, 2574 02:03:18,100 --> 02:03:20,900 and then the server moves on to the next line of code. 2575 02:03:20,900 --> 02:03:23,890 So it's sort of multitasking, handling lots of users at once. 2576 02:03:23,890 --> 02:03:27,940 The problem with this is that you run into a race condition of sorts, 2577 02:03:27,940 --> 02:03:30,820 where the servers are sort of racing to handle one user 2578 02:03:30,820 --> 02:03:34,580 but other users requests are happening at the same time. 2579 02:03:34,580 --> 02:03:39,220 So the analogy that I was taught years ago in an operating systems 2580 02:03:39,220 --> 02:03:41,860 class actually pertains to something like a refrigerator here. 2581 02:03:41,860 --> 02:03:43,360 So we have a mini refrigerator here. 2582 02:03:43,360 --> 02:03:45,850 Suppose you've got one in your dorm or your house room, 2583 02:03:45,850 --> 02:03:48,340 and you come home one day. 2584 02:03:48,340 --> 02:03:51,860 And you really like milk, so you open the fridge and you look inside, 2585 02:03:51,860 --> 02:03:52,480 and there's-- 2586 02:03:52,480 --> 02:03:53,590 Oh, we're out of milk. 2587 02:03:53,590 --> 02:03:57,160 So you close the fridge, you walk out to CVS, or somewhere else, 2588 02:03:57,160 --> 02:03:58,600 and go to get more milk. 2589 02:03:58,600 --> 02:04:01,240 Meanwhile, though, your roommate comes home, who in this story 2590 02:04:01,240 --> 02:04:03,850 also likes milk, and so they decide that-- 2591 02:04:03,850 --> 02:04:05,620 Oh, I'm out of milk in the fridge. 2592 02:04:05,620 --> 02:04:07,120 So they, maybe, head out. 2593 02:04:07,120 --> 02:04:10,660 Maybe they follow a different path to go get more milk as well. 2594 02:04:10,660 --> 02:04:13,330 Some number of minutes pass, you both come home later on. 2595 02:04:13,330 --> 02:04:14,080 Like, Oh, darn it. 2596 02:04:14,080 --> 02:04:15,830 Now we have twice as much milk as we need. 2597 02:04:15,830 --> 02:04:17,247 We don't really like it that much. 2598 02:04:17,247 --> 02:04:19,720 And some of it's going to go sour now, so it's wasted like. 2599 02:04:19,720 --> 02:04:20,720 We've made a mistake. 2600 02:04:20,720 --> 02:04:22,780 We should not have bought twice as much milk. 2601 02:04:22,780 --> 02:04:27,220 Now, stupid story, but the point is that both of you 2602 02:04:27,220 --> 02:04:30,250 made decisions based on the state of a variable. 2603 02:04:30,250 --> 02:04:33,790 But the problem was that variable was in the process of being 2604 02:04:33,790 --> 02:04:36,430 updated when someone else looked at it. 2605 02:04:36,430 --> 02:04:39,470 The first person in the story was on their way to the store 2606 02:04:39,470 --> 02:04:41,560 so the variable was about to be incremented 2607 02:04:41,560 --> 02:04:43,840 in terms of quantity of milk, but the other person 2608 02:04:43,840 --> 02:04:46,430 didn't know that yet so they too tried to increment it. 2609 02:04:46,430 --> 02:04:48,760 And, in that case, we ended up with too much milk. 2610 02:04:48,760 --> 02:04:52,450 But suppose what might happen here is similar in spirit. 2611 02:04:52,450 --> 02:04:56,560 Suppose that post, at some point in time, had just 1,000,000 likes. 2612 02:04:56,560 --> 02:05:00,310 And suppose this line of code got executed for me, for you, and for you 2613 02:05:00,310 --> 02:05:02,050 after all three of us clicked on it. 2614 02:05:02,050 --> 02:05:05,530 Well, the value of our likes variable in Meta servers 2615 02:05:05,530 --> 02:05:08,020 might be 1,000,000, 1,000,000, and 1,000,000. 2616 02:05:08,020 --> 02:05:11,470 They therefore update 1,000,000 to be 1,000,000 plus one. 2617 02:05:11,470 --> 02:05:14,890 And so what they update the database to be is 1,000,001, 2618 02:05:14,890 --> 02:05:16,080 but they do it three times. 2619 02:05:16,080 --> 02:05:18,690 1,000,001, 1,000,001, 1,000,001. 2620 02:05:18,690 --> 02:05:21,900 But they've lost two of those likes because they 2621 02:05:21,900 --> 02:05:26,040 might have inspected the variable while some other server, some other user's 2622 02:05:26,040 --> 02:05:28,510 like, was being processed. 2623 02:05:28,510 --> 02:05:31,350 So long story short, when you have lots of data 2624 02:05:31,350 --> 02:05:34,365 on lots of servers, all of which is happening very quickly, 2625 02:05:34,365 --> 02:05:36,240 you run into these so called race conditions. 2626 02:05:36,240 --> 02:05:39,750 And code like this can be dangerous, even though it might not 2627 02:05:39,750 --> 02:05:41,460 look incorrect at a glance. 2628 02:05:41,460 --> 02:05:44,280 Thankfully, in the world of SQL, though you won't generally 2629 02:05:44,280 --> 02:05:47,700 have to do this certainly for problem sets sake, there are solutions to this. 2630 02:05:47,700 --> 02:05:51,210 But too many engineers in the world don't know this, don't remember this, 2631 02:05:51,210 --> 02:05:53,580 or don't appreciate this reality. 2632 02:05:53,580 --> 02:05:55,920 There are keywords in certain databases that 2633 02:05:55,920 --> 02:05:59,610 let you instead begin a transaction, which means, essentially, 2634 02:05:59,610 --> 02:06:02,970 that these three lines of code should either all 2635 02:06:02,970 --> 02:06:05,370 happen together or not at all. 2636 02:06:05,370 --> 02:06:06,660 They should be atomic. 2637 02:06:06,660 --> 02:06:09,390 That is to say, they should all happen without interruption 2638 02:06:09,390 --> 02:06:11,340 or they just shouldn't happen at all. 2639 02:06:11,340 --> 02:06:16,000 And that ensures that the math does not go wrong 2640 02:06:16,000 --> 02:06:18,820 because my like will get counted, and then you're like, 2641 02:06:18,820 --> 02:06:22,390 and then you're like, as opposed to them being intermingled and lost 2642 02:06:22,390 --> 02:06:23,630 track of accordingly. 2643 02:06:23,630 --> 02:06:26,590 So in Python, using the CS50 library, you 2644 02:06:26,590 --> 02:06:30,280 could wrap these three lines of code by saying begin transaction 2645 02:06:30,280 --> 02:06:33,580 to the database, commit the transaction to the database, 2646 02:06:33,580 --> 02:06:37,867 and that relatively simple solution avoids this problem of race conditions. 2647 02:06:37,867 --> 02:06:38,950 But this, too, is a topic. 2648 02:06:38,950 --> 02:06:41,020 If you Google, invariably you'll see that this 2649 02:06:41,020 --> 02:06:46,000 is a problem that has hit various servers and apps over time. 2650 02:06:46,000 --> 02:06:49,660 But there's one other problem that the world is still not very good at, 2651 02:06:49,660 --> 02:06:52,060 and that's known as a SQL injection attack. 2652 02:06:52,060 --> 02:06:55,690 And it turns out that even with what we've been doing, even code like this 2653 02:06:55,690 --> 02:07:00,160 here, is all too easily vulnerable to being hacked, if you will. 2654 02:07:00,160 --> 02:07:04,670 That is misused in some way, unless you practice what I'm preaching, 2655 02:07:04,670 --> 02:07:07,070 which is using placeholders like this. 2656 02:07:07,070 --> 02:07:10,570 It turns out that it's very dangerous to take user input in, generally. 2657 02:07:10,570 --> 02:07:13,150 Like, most of your users might be nice, good people, 2658 02:07:13,150 --> 02:07:16,330 but there's always going to be someone who's malicious or curious 2659 02:07:16,330 --> 02:07:19,670 or just execute something you don't expect, and things can go wrong. 2660 02:07:19,670 --> 02:07:21,940 So in the world of SQL, here's what can happen. 2661 02:07:21,940 --> 02:07:24,890 For instance, here is the login screen for Yale accounts. 2662 02:07:24,890 --> 02:07:27,220 Here's the comparable screen for Harvard accounts. 2663 02:07:27,220 --> 02:07:29,803 And all of us are in their habit of using one of these screens 2664 02:07:29,803 --> 02:07:31,150 or something similar or another. 2665 02:07:31,150 --> 02:07:34,450 You're often asked for a username or email address and a password. 2666 02:07:34,450 --> 02:07:38,950 But suppose that Harvard or Yale or Google or Microsoft or wherever, 2667 02:07:38,950 --> 02:07:42,280 are taking user input from forms like this, be it on the web 2668 02:07:42,280 --> 02:07:46,540 or on a mobile app, and they're just plugging your input into a SQL query 2669 02:07:46,540 --> 02:07:50,320 that they wrote in advance that just is waiting for your username or password 2670 02:07:50,320 --> 02:07:53,980 to be plugged in so they can ask a complete query of the database. 2671 02:07:53,980 --> 02:07:57,850 It turns out that if I'm a bit malicious, or curious, 2672 02:07:57,850 --> 02:08:01,810 I could, maybe, type in some funky syntax to fields like this 2673 02:08:01,810 --> 02:08:04,570 that I know have special meaning to certain databases. 2674 02:08:04,570 --> 02:08:09,490 And it turns out in the world of SQL, SQLite in particular, single quotes 2675 02:08:09,490 --> 02:08:10,570 are clearly important. 2676 02:08:10,570 --> 02:08:12,850 Because I've been using them all day long for strings, 2677 02:08:12,850 --> 02:08:16,745 and I didn't mention this yet, but the comment character in SQLite is dash, 2678 02:08:16,745 --> 02:08:17,245 dash. 2679 02:08:17,245 --> 02:08:20,203 If you want the rest of a line to be ignored, you just say, dash, dash. 2680 02:08:20,203 --> 02:08:23,605 So it's not hash, it's not slash, slash, like in Python and C, respectively. 2681 02:08:23,605 --> 02:08:26,210 It's dash, dash, or two hyphens. 2682 02:08:26,210 --> 02:08:28,887 So suppose that I'm a hacker or a curious student, 2683 02:08:28,887 --> 02:08:30,970 and I want to see if Harvard knows what it's doing 2684 02:08:30,970 --> 02:08:32,680 when it comes to SQL injection attacks. 2685 02:08:32,680 --> 02:08:36,347 I could literally type in, maybe my username with a single quote 2686 02:08:36,347 --> 02:08:37,180 and then dash, dash. 2687 02:08:37,180 --> 02:08:38,870 Well, why would I do this? 2688 02:08:38,870 --> 02:08:43,240 Well, suppose for the sake of discussion that some developer at Harvard, 2689 02:08:43,240 --> 02:08:46,030 or any website, really, has written a line of code 2690 02:08:46,030 --> 02:08:49,870 like this to check if the username and password just 2691 02:08:49,870 --> 02:08:52,760 typed in match what's in the database. 2692 02:08:52,760 --> 02:08:53,890 So how might I do this? 2693 02:08:53,890 --> 02:08:57,410 And if so, let the user log in, show them their account, or whatever. 2694 02:08:57,410 --> 02:08:59,800 So here's the line of code in question. 2695 02:08:59,800 --> 02:09:05,050 Select star from users where username equals question mark and password 2696 02:09:05,050 --> 02:09:05,980 equals question mark. 2697 02:09:05,980 --> 02:09:07,040 This is correct. 2698 02:09:07,040 --> 02:09:08,650 This is green because it's good. 2699 02:09:08,650 --> 02:09:10,960 It is not vulnerable to attack because 2700 02:09:10,960 --> 02:09:13,090 I'm using these placeholders here, which, 2701 02:09:13,090 --> 02:09:15,370 even though we've implemented in the CS50 library, 2702 02:09:15,370 --> 02:09:18,040 most SQL libraries support the same syntax. 2703 02:09:18,040 --> 02:09:19,600 But it can vary by system. 2704 02:09:19,600 --> 02:09:20,890 This is safe. 2705 02:09:20,890 --> 02:09:23,230 What is not safe is to use some of the stuff 2706 02:09:23,230 --> 02:09:25,750 you learned last week where you can just use f strings, 2707 02:09:25,750 --> 02:09:29,680 or format your own strings and interpolate values with curly braces. 2708 02:09:29,680 --> 02:09:31,930 For instance, suppose you took me at my word 2709 02:09:31,930 --> 02:09:33,790 last week that you can do this in Python. 2710 02:09:33,790 --> 02:09:37,430 This code here, unfortunately, in yellow, is not safe. 2711 02:09:37,430 --> 02:09:39,190 Here's a format string in Python. 2712 02:09:39,190 --> 02:09:41,140 Here's the beginning of a SQL query. 2713 02:09:41,140 --> 02:09:43,810 Here's a placeholder for the user username. 2714 02:09:43,810 --> 02:09:45,730 Here's a placeholder for the user's password. 2715 02:09:45,730 --> 02:09:49,400 And I've proactively, correctly, put single quotes around them. 2716 02:09:49,400 --> 02:09:53,410 The problem is if you just blindly plug users input 2717 02:09:53,410 --> 02:09:58,390 into pre-made queries like this, they can finish your thought for you, 2718 02:09:58,390 --> 02:10:01,280 in ways you don't expect, if you trust the users. 2719 02:10:01,280 --> 02:10:03,910 For instance, if I plug in for my username, 2720 02:10:03,910 --> 02:10:07,600 mail in at harvard.edu, single quote, dash, dash. 2721 02:10:07,600 --> 02:10:09,200 Notice what happens. 2722 02:10:09,200 --> 02:10:11,470 Here's the single quote from the query. 2723 02:10:11,470 --> 02:10:13,100 Here's what I typed in. 2724 02:10:13,100 --> 02:10:14,150 But, wait a minute. 2725 02:10:14,150 --> 02:10:17,560 It looks like this single quote that I typed in finishes 2726 02:10:17,560 --> 02:10:19,810 the thought that the developer started. 2727 02:10:19,810 --> 02:10:22,527 The dash, dash means, heck, ignore the rest of that. 2728 02:10:22,527 --> 02:10:25,360 And, indeed, I've lifted grayed out because what effectively happens 2729 02:10:25,360 --> 02:10:29,510 is that only executed is what's in yellow here at the moment. 2730 02:10:29,510 --> 02:10:30,010 Why? 2731 02:10:30,010 --> 02:10:32,440 Because everything after that close quote, 2732 02:10:32,440 --> 02:10:35,255 which finishes the developers thought, is just ignored. 2733 02:10:35,255 --> 02:10:38,380 So I mean, this is literally an example of how you can hack into a database 2734 02:10:38,380 --> 02:10:39,910 by injecting SQL. 2735 02:10:39,910 --> 02:10:42,220 Like dash, dash is an example of SQL. 2736 02:10:42,220 --> 02:10:45,280 Silly as it is, it's a comment that tells the database 2737 02:10:45,280 --> 02:10:46,460 to ignore the password. 2738 02:10:46,460 --> 02:10:47,450 So what does this mean? 2739 02:10:47,450 --> 02:10:50,770 Well, of course you're going to get rows back now because if you're only 2740 02:10:50,770 --> 02:10:53,950 searching for someone by username, it doesn't matter what their password is, 2741 02:10:53,950 --> 02:10:57,460 what they typed in, you've essentially blacked out that part 2742 02:10:57,460 --> 02:10:59,530 and you're not even checking the password. 2743 02:10:59,530 --> 02:11:03,640 The effect then would be that you could log in as me or Carter or anyone else, 2744 02:11:03,640 --> 02:11:08,410 just by knowing their usernames if the Harvard developer wrote code 2745 02:11:08,410 --> 02:11:09,183 in this way. 2746 02:11:09,183 --> 02:11:12,100 And even though I'm pretty sure Harvard key does not suffer from this, 2747 02:11:12,100 --> 02:11:14,590 so many darn websites have in the past. 2748 02:11:14,590 --> 02:11:17,733 And if you Google, SQL injection attacks-- search by 2749 02:11:17,733 --> 02:11:20,150 and Google, for instance, the past month or the past year, 2750 02:11:20,150 --> 02:11:22,940 you will tragically likely see results because humans 2751 02:11:22,940 --> 02:11:24,650 continue to make this mistake. 2752 02:11:24,650 --> 02:11:27,830 The solution though, ultimately, is actually just 2753 02:11:27,830 --> 02:11:30,710 use placeholders, just use the library that 2754 02:11:30,710 --> 02:11:32,990 escapes potentially dangerous input. 2755 02:11:32,990 --> 02:11:35,600 And this looks a little weird, but in C we 2756 02:11:35,600 --> 02:11:38,000 saw that the escape character was a backslash? 2757 02:11:38,000 --> 02:11:41,690 And that made backslash n, or backslash something else, be treated specially. 2758 02:11:41,690 --> 02:11:45,750 Weirdly, in SQL it tends to be just another single quote. 2759 02:11:45,750 --> 02:11:48,450 So if you do quote, quote that actually means-- 2760 02:11:48,450 --> 02:11:49,580 I mean, a literal quote. 2761 02:11:49,580 --> 02:11:52,010 It's not like the empty string, or nothing in between it. 2762 02:11:52,010 --> 02:11:53,060 So it looks weird. 2763 02:11:53,060 --> 02:11:56,630 But, long story short, if you use a library like CS50, 2764 02:11:56,630 --> 02:11:58,460 or anything in the real world that handles 2765 02:11:58,460 --> 02:12:01,130 escaping of user input, the whole query you 2766 02:12:01,130 --> 02:12:04,910 can think of as being now good and green again because it doesn't matter what 2767 02:12:04,910 --> 02:12:05,780 the human types in. 2768 02:12:05,780 --> 02:12:10,680 Any scary characters will be properly escaped by the database. 2769 02:12:10,680 --> 02:12:12,740 So it'll depend on the library you're using, 2770 02:12:12,740 --> 02:12:15,200 but it almost always is the syntax using question 2771 02:12:15,200 --> 02:12:18,530 marks or some similar placeholder. 2772 02:12:18,530 --> 02:12:19,130 All right. 2773 02:12:19,130 --> 02:12:23,270 So with that said, you are now inducted into the Hall 2774 02:12:23,270 --> 02:12:26,020 of People who know now a little something about databases. 2775 02:12:26,020 --> 02:12:28,520 We've only just scratched the surface of using the language, 2776 02:12:28,520 --> 02:12:31,340 but it's now something we'll use to build up more and more interesting 2777 02:12:31,340 --> 02:12:33,673 applications, especially for final projects when we soon 2778 02:12:33,673 --> 02:12:37,160 transition to web programming or mobile app development, if you go that route. 2779 02:12:37,160 --> 02:12:41,540 But you'll soon be able to speak a language, literally and figuratively, 2780 02:12:41,540 --> 02:12:44,310 that those before you have acquired as well. 2781 02:12:44,310 --> 02:12:48,170 So you are now qualified to understand this sort of internet joke. 2782 02:12:48,170 --> 02:12:51,020 This is someone who, if I Zoom in, was trying 2783 02:12:51,020 --> 02:12:55,820 to get out of paying some camera based tolls by tricking the state or the city 2784 02:12:55,820 --> 02:12:58,130 into deleting or dropping their whole database. 2785 02:12:58,130 --> 02:13:01,100 Drop means delete the whole thing, not just the rows therein. 2786 02:13:01,100 --> 02:13:02,270 OK, so maybe not that funny. 2787 02:13:02,270 --> 02:13:06,910 But this is now the note will end on, similar to other xkcd comics 2788 02:13:06,910 --> 02:13:07,910 we've introduced you to. 2789 02:13:07,910 --> 02:13:11,150 Every CS person out there has seen, knows this comic. 2790 02:13:11,150 --> 02:13:14,150 So if you ever refer to, with a wink-- 2791 02:13:14,150 --> 02:13:17,090 if you ever refer to little Bobby tables with a wink, 2792 02:13:17,090 --> 02:13:20,795 if it's a computer scientist on the other end, they'll know whom you mean. 2793 02:13:20,795 --> 02:13:23,918 2794 02:13:23,918 --> 02:13:25,330 [LAUGHING] 2795 02:13:25,330 --> 02:13:26,260 OK, there we go. 2796 02:13:26,260 --> 02:13:29,690 2797 02:13:29,690 --> 02:13:30,230 All right. 2798 02:13:30,230 --> 02:13:30,740 Tough crowd. 2799 02:13:30,740 --> 02:13:33,990 All right, Batman, come on down for your cookies, and we'll see you next time. 2800 02:13:33,990 --> 02:13:34,520 [APPLAUSE] 2801 02:13:34,520 --> 02:13:37,570 [MUSIC PLAYING] 2802 02:13:37,570 --> 02:14:04,000