1 00:00:00,000 --> 00:00:02,988 [MUSIC PLAYING] 2 00:00:02,988 --> 00:01:12,582 3 00:01:12,582 --> 00:01:13,540 DAVID MALAN: All right. 4 00:01:13,540 --> 00:01:17,150 This is CS50 and this is already week seven. 5 00:01:17,150 --> 00:01:20,240 And this is the week where we'll continue where we left off 6 00:01:20,240 --> 00:01:22,430 with Python, introducing you to a bit more syntax 7 00:01:22,430 --> 00:01:25,580 and capabilities of the language so you can solve interesting problems. 8 00:01:25,580 --> 00:01:27,580 But a lot of those problems increasingly are now 9 00:01:27,580 --> 00:01:29,180 going to involve data in some form. 10 00:01:29,180 --> 00:01:31,250 After all, if you think of most any website 11 00:01:31,250 --> 00:01:36,320 or mobile app or process nowadays that involves solving problems, 12 00:01:36,320 --> 00:01:39,530 it almost always involves some amount of data and often data at scale. 13 00:01:39,530 --> 00:01:40,680 Lots and lots of data. 14 00:01:40,680 --> 00:01:42,430 And so what we're going to see first today 15 00:01:42,430 --> 00:01:45,710 is that, yes, you can use Python to solve all the problems past that we've 16 00:01:45,710 --> 00:01:48,327 seen and also some data specific ones, but sometimes it's 17 00:01:48,327 --> 00:01:49,410 just going to be annoying. 18 00:01:49,410 --> 00:01:50,420 It's going to be a little painful. 19 00:01:50,420 --> 00:01:54,060 It's just going to be more work than you might like to just get to some answer. 20 00:01:54,060 --> 00:01:56,870 And so today we'll too introduce you to a new language 21 00:01:56,870 --> 00:01:59,240 called SQL, Structured Query Language. 22 00:01:59,240 --> 00:02:02,780 And this is a language that rest assured is actually much smaller, 23 00:02:02,780 --> 00:02:05,240 relatively speaking, than C and Python. 24 00:02:05,240 --> 00:02:07,860 It sort of does less, but it doesn't really well. 25 00:02:07,860 --> 00:02:11,240 And it's a language for querying databases, storing data in it, 26 00:02:11,240 --> 00:02:13,950 updating it, inserting it, deleting it, and so much more. 27 00:02:13,950 --> 00:02:16,160 And it's the kind of technology that's used nowadays 28 00:02:16,160 --> 00:02:19,700 in, indeed, web apps and mobile apps, data science, analytics, 29 00:02:19,700 --> 00:02:20,940 and so much more. 30 00:02:20,940 --> 00:02:23,600 It's really good at storing lots and lots of data. 31 00:02:23,600 --> 00:02:25,430 Now, this is yet another language. 32 00:02:25,430 --> 00:02:27,800 And believe it or not, next week we'll introduce you 33 00:02:27,800 --> 00:02:31,880 to three more languages, HTML and CSS, which are not technically 34 00:02:31,880 --> 00:02:32,900 programming languages. 35 00:02:32,900 --> 00:02:35,630 They're all about aesthetics and markup of information. 36 00:02:35,630 --> 00:02:38,660 But also JavaScript, which is, in fact, a programming language. 37 00:02:38,660 --> 00:02:41,630 But the goals here in CS50 really are going 38 00:02:41,630 --> 00:02:44,000 to be to empower you to program more generally. 39 00:02:44,000 --> 00:02:47,000 And indeed, when you're out there in the real world some years from now, 40 00:02:47,000 --> 00:02:50,330 invariably there's going to be some new other popular language out there. 41 00:02:50,330 --> 00:02:53,880 And hopefully in this week and next week and beyond, among the goals 42 00:02:53,880 --> 00:02:56,630 is not just to teach you these languages specifically but again, 43 00:02:56,630 --> 00:02:59,330 how to teach yourself the future languages that we've not even 44 00:02:59,330 --> 00:03:00,480 heard about just yet. 45 00:03:00,480 --> 00:03:04,220 So with that said, let's begin with a survey of sorts. 46 00:03:04,220 --> 00:03:10,190 If you go to this URL on your phone or laptop, cs50.ly/favorites, 47 00:03:10,190 --> 00:03:14,390 a very simple Google Form awaits you that's just going to ask you a couple 48 00:03:14,390 --> 00:03:16,530 of multiple choice questions. 49 00:03:16,530 --> 00:03:24,800 So go to cs50.ly/favorites, and that should lead you to a Google Form that 50 00:03:24,800 --> 00:03:27,590 looks a little something like this asking you first, 51 00:03:27,590 --> 00:03:31,700 as of now in week seven, what is your favorite language among those options 52 00:03:31,700 --> 00:03:32,250 here. 53 00:03:32,250 --> 00:03:34,220 And then further down, one more question. 54 00:03:34,220 --> 00:03:39,170 If you think back on problem sets 0 through 6, what was, if any, 55 00:03:39,170 --> 00:03:42,080 your favorite problem set problem? 56 00:03:42,080 --> 00:03:45,110 Be it in Scratch or C or Python. 57 00:03:45,110 --> 00:03:48,650 So answer those two questions. 58 00:03:48,650 --> 00:03:52,820 And in a moment, I'll flip over to my screen here where you'll see, 59 00:03:52,820 --> 00:03:56,570 and anyone who's used Google Forms knows, the spreadsheet that's 60 00:03:56,570 --> 00:03:58,130 collecting now this data. 61 00:03:58,130 --> 00:04:01,060 Microsoft Office 365 can do the same if you use one of those forms. 62 00:04:01,060 --> 00:04:02,810 And what you see here now is a spreadsheet 63 00:04:02,810 --> 00:04:06,770 in Google Sheets enumerating all of the audience's questions. 64 00:04:06,770 --> 00:04:09,980 Language is in column B, problem is in column C, 65 00:04:09,980 --> 00:04:13,190 and each row represents one student who has responded. 66 00:04:13,190 --> 00:04:17,360 A few of you were super eager for class today at 8:33 AM Eastern time. 67 00:04:17,360 --> 00:04:19,040 10:32, 11:10. 68 00:04:19,040 --> 00:04:21,779 OK, so now we're getting into the actual class time here. 69 00:04:21,779 --> 00:04:24,830 And if I scroll down, we'll probably see a few dozen, 70 00:04:24,830 --> 00:04:26,900 a couple of hundred answers by now. 71 00:04:26,900 --> 00:04:29,225 And yeah, so we're getting a whole lot of answers here. 72 00:04:29,225 --> 00:04:31,100 And I'm seeing some patterns emerge, but it's 73 00:04:31,100 --> 00:04:34,190 not necessarily obvious to the human eyes what those patterns are. 74 00:04:34,190 --> 00:04:36,320 Now of course, you can use Google Spreadsheets. 75 00:04:36,320 --> 00:04:40,143 You can highlight the data and you can create charts magically out of it. 76 00:04:40,143 --> 00:04:42,560 But you can only do what Google lets you do with the data. 77 00:04:42,560 --> 00:04:45,350 And same thing for Microsoft Excel or Apple Numbers. 78 00:04:45,350 --> 00:04:48,230 But wouldn't it be nice to just be able to manipulate the raw data, 79 00:04:48,230 --> 00:04:51,740 relatively simple though it is, to just answer questions about the data? 80 00:04:51,740 --> 00:04:54,260 Maybe long term create your own charts, customize it 81 00:04:54,260 --> 00:04:56,240 just the way you want rather than beholden 82 00:04:56,240 --> 00:04:59,250 to software that's off the shelf like this. 83 00:04:59,250 --> 00:05:02,130 Well, how could we go about doing this? 84 00:05:02,130 --> 00:05:05,330 Well, let me propose that we treat this data set now 85 00:05:05,330 --> 00:05:08,790 as what we're going to call for now a flat file database. 86 00:05:08,790 --> 00:05:10,850 We'll see today that there's fancier databases, 87 00:05:10,850 --> 00:05:15,200 but the simplest database in the world is really just like a .csv file. 88 00:05:15,200 --> 00:05:18,500 And we saw that a couple of weeks ago in C. We wrote a bit of C code 89 00:05:18,500 --> 00:05:23,660 that used fprintf to write data to a file using commas as the separator. 90 00:05:23,660 --> 00:05:25,910 We didn't really do much more with CSVs at the time 91 00:05:25,910 --> 00:05:28,610 though, because it's really annoying, painful, time consuming, 92 00:05:28,610 --> 00:05:32,840 not fun to use C for something like that because of malloc and memory and all 93 00:05:32,840 --> 00:05:33,413 that stuff. 94 00:05:33,413 --> 00:05:35,330 But with Python, it's going to be much easier. 95 00:05:35,330 --> 00:05:39,530 And so any time you have access to some data set where you can just download it 96 00:05:39,530 --> 00:05:42,140 to your own Mac or PC or your cloud environment, 97 00:05:42,140 --> 00:05:46,460 it's sort of a candidate for now writing code to do something with the data. 98 00:05:46,460 --> 00:05:48,120 Maybe analyze it right away. 99 00:05:48,120 --> 00:05:50,150 If it's been human imported manually, maybe you 100 00:05:50,150 --> 00:05:52,550 have to clean it up by doing a lot of find and replace 101 00:05:52,550 --> 00:05:54,852 but not with your keyboard but rather with code. 102 00:05:54,852 --> 00:05:56,310 And so let me go ahead and do this. 103 00:05:56,310 --> 00:05:59,780 Let me go back to my Google Sheet here that has 104 00:05:59,780 --> 00:06:02,010 all of the data that's come in now. 105 00:06:02,010 --> 00:06:08,430 And let me go ahead and download this via the File menu here. 106 00:06:08,430 --> 00:06:09,590 And let's see. 107 00:06:09,590 --> 00:06:10,310 Download. 108 00:06:10,310 --> 00:06:12,170 And you can see a whole bunch of options. 109 00:06:12,170 --> 00:06:14,210 Most formats might be familiar, but today we'll 110 00:06:14,210 --> 00:06:17,602 focus just on this one, Comma Separated Values or CSV. 111 00:06:17,602 --> 00:06:20,810 That's going to go ahead and download it on my Mac here into my own Downloads 112 00:06:20,810 --> 00:06:21,140 folder. 113 00:06:21,140 --> 00:06:22,890 And now I'm going to go ahead and do this. 114 00:06:22,890 --> 00:06:25,140 Let me go ahead and pull up VS Code in the cloud here. 115 00:06:25,140 --> 00:06:28,140 And if you've never done this before, there's a couple of ways to do it. 116 00:06:28,140 --> 00:06:31,320 But the simplest way to upload a file to your code space, so to speak, 117 00:06:31,320 --> 00:06:32,990 is just a sort of drag and drop. 118 00:06:32,990 --> 00:06:36,020 That's going to magically upload it to the server there. 119 00:06:36,020 --> 00:06:39,688 And we'll see that, one, it has a very long file name, which I'm actually 120 00:06:39,688 --> 00:06:41,480 going to clean this up because that's going 121 00:06:41,480 --> 00:06:43,880 to be very tedious to type in my code. 122 00:06:43,880 --> 00:06:46,927 So I could either right click, of course, up here, 123 00:06:46,927 --> 00:06:48,510 but I'm going to use my Linux command. 124 00:06:48,510 --> 00:06:52,550 So let's move this file called CSV 50 2022 something or other and let's 125 00:06:52,550 --> 00:06:54,740 just name it more simply favorites.csv. 126 00:06:54,740 --> 00:06:58,370 So all lowercase, no spaces, sort of good basics. 127 00:06:58,370 --> 00:07:02,960 And let me go ahead now and open up this file with code favorites.csv. 128 00:07:02,960 --> 00:07:06,500 I'll close my File Explorer and we'll see exactly the same data 129 00:07:06,500 --> 00:07:09,890 as before but not quite as pretty as Google Sheets makes it be. 130 00:07:09,890 --> 00:07:13,640 Rather we see here that I still have three columns, timestamp, language, 131 00:07:13,640 --> 00:07:17,013 problem, and then all of the values down below, including the timestamps 132 00:07:17,013 --> 00:07:18,180 and the answers they're for. 133 00:07:18,180 --> 00:07:20,360 But it doesn't have proper columns. 134 00:07:20,360 --> 00:07:22,250 It just has commas separating them. 135 00:07:22,250 --> 00:07:24,860 Now, we could very easily write Python code 136 00:07:24,860 --> 00:07:28,820 just like we wrote C code to manipulate files like this 137 00:07:28,820 --> 00:07:30,140 either to write or read. 138 00:07:30,140 --> 00:07:32,450 But instead let's do something that's a little more 139 00:07:32,450 --> 00:07:34,640 pleasant, which is indeed in the form of Python. 140 00:07:34,640 --> 00:07:38,300 So Python actually comes with native support for CSVs. 141 00:07:38,300 --> 00:07:41,200 It has indeed a package called CSV that just 142 00:07:41,200 --> 00:07:43,700 lets you read and write and do a whole bunch of useful stuff 143 00:07:43,700 --> 00:07:45,007 when it comes to CSV files. 144 00:07:45,007 --> 00:07:47,090 So let's go ahead and do something with this file. 145 00:07:47,090 --> 00:07:49,250 Let me go back here to VS Code. 146 00:07:49,250 --> 00:07:51,860 I'm going to close favorites.csv for now. 147 00:07:51,860 --> 00:07:55,430 But just remember in your mind that timestamp was the first column, 148 00:07:55,430 --> 00:07:58,640 language was the second column, and problem was the third. 149 00:07:58,640 --> 00:08:01,790 And notice because we're using commas, they don't again line up perfectly, 150 00:08:01,790 --> 00:08:02,832 but that's not a problem. 151 00:08:02,832 --> 00:08:05,270 There are two commas in every line presumably. 152 00:08:05,270 --> 00:08:09,892 And I'm going to go ahead and now create a file called how about favorites.py so 153 00:08:09,892 --> 00:08:12,350 that I can start writing some code to manipulate this data. 154 00:08:12,350 --> 00:08:13,600 And let's do something simple. 155 00:08:13,600 --> 00:08:16,910 Let's just write a simple program in Python that opens this file, 156 00:08:16,910 --> 00:08:19,997 reads it, and print something out just as a safety check 157 00:08:19,997 --> 00:08:22,830 that I know what I'm doing, even though it's not going to be useful. 158 00:08:22,830 --> 00:08:26,780 So in Python, if you want CSV support, you import CSV. 159 00:08:26,780 --> 00:08:30,350 And that gives you access to all the magical capabilities thereof. 160 00:08:30,350 --> 00:08:32,450 Let me now go ahead and use this technique 161 00:08:32,450 --> 00:08:35,960 to open a file in Python, which is similar in C. 162 00:08:35,960 --> 00:08:37,789 But with Python, we're going to do this. 163 00:08:37,789 --> 00:08:39,020 The keyword with. 164 00:08:39,020 --> 00:08:42,049 I'm going to open a file called favorites.csv, 165 00:08:42,049 --> 00:08:43,880 which was the shorter name I gave it. 166 00:08:43,880 --> 00:08:46,280 This is optional, but just for explicitness, I'm 167 00:08:46,280 --> 00:08:50,090 going to open it in read mode explicitly, just like f open 168 00:08:50,090 --> 00:08:51,720 took a second argument as well. 169 00:08:51,720 --> 00:08:54,500 And I'm going to name this file once open quite simply file, 170 00:08:54,500 --> 00:08:56,240 though I could call it anything I want. 171 00:08:56,240 --> 00:08:59,060 And now it's just an open file. 172 00:08:59,060 --> 00:09:02,870 So far as Python knows at this moment, it's just text, or better yet, 173 00:09:02,870 --> 00:09:04,250 it's just zeros and ones. 174 00:09:04,250 --> 00:09:08,360 If you want this Python package called CSV to actually do 175 00:09:08,360 --> 00:09:12,500 something useful with it, you have to load this file now into the library. 176 00:09:12,500 --> 00:09:16,165 And the simplest way to do this is to give myself a variable called reader 177 00:09:16,165 --> 00:09:17,540 because I want to read this file. 178 00:09:17,540 --> 00:09:19,430 Though this too I could call anything else. 179 00:09:19,430 --> 00:09:22,610 I'm going to then set that equal to the return value of a function called 180 00:09:22,610 --> 00:09:24,710 csv.reader. 181 00:09:24,710 --> 00:09:28,220 And I pass to that per the documentation the open file. 182 00:09:28,220 --> 00:09:29,780 So step one, I open the file. 183 00:09:29,780 --> 00:09:32,030 And this just gives me access to the bytes therein. 184 00:09:32,030 --> 00:09:36,530 Step two now with csv.reader tells the Python package 185 00:09:36,530 --> 00:09:40,940 called CSV to do something useful with it and start analyzing the commas 186 00:09:40,940 --> 00:09:43,770 and allow me to parse it further. 187 00:09:43,770 --> 00:09:45,180 So let's go ahead and do this. 188 00:09:45,180 --> 00:09:49,730 Let me go ahead now and within this loop let's say this. 189 00:09:49,730 --> 00:09:52,340 Sorry, within this open file, let's do this. 190 00:09:52,340 --> 00:09:58,040 For every row, if you will, or line in the file, a.k.a. 191 00:09:58,040 --> 00:10:04,910 reader, let's go ahead and print out just how about row bracket 1. 192 00:10:04,910 --> 00:10:06,198 Now what's going on here? 193 00:10:06,198 --> 00:10:08,990 Well, it turns out if you read the documentation for the CSV reader 194 00:10:08,990 --> 00:10:13,610 function, what it hands you back is essentially this special object, so 195 00:10:13,610 --> 00:10:16,850 to speak, that allows you to treat it as though it's just 196 00:10:16,850 --> 00:10:20,660 a really big list of lines from the file, a.k.a. reader. 197 00:10:20,660 --> 00:10:25,310 So by saying for row in reader, this is a way more succinct way 198 00:10:25,310 --> 00:10:28,763 of saying give me the first line in the file plus plus, 199 00:10:28,763 --> 00:10:30,680 give me the second line in the file plus plus, 200 00:10:30,680 --> 00:10:33,597 and so forth that we would have done what much more mechanically in C. 201 00:10:33,597 --> 00:10:37,350 This is just much more Pythonic and English friendly, if you will. 202 00:10:37,350 --> 00:10:42,225 So in every iteration of this loop, row is going to contain all of the data 203 00:10:42,225 --> 00:10:43,100 from the current row. 204 00:10:43,100 --> 00:10:45,590 But better yet, what the reader function does 205 00:10:45,590 --> 00:10:50,240 for me is it hands me each row not just as a big string or STR 206 00:10:50,240 --> 00:10:51,590 of text in Python. 207 00:10:51,590 --> 00:10:56,590 It gives me what apparently based on the syntax on line six. 208 00:10:56,590 --> 00:10:57,460 Any instinct? 209 00:10:57,460 --> 00:10:58,030 Yeah. 210 00:10:58,030 --> 00:10:58,740 AUDIENCE: A list. 211 00:10:58,740 --> 00:11:00,698 DAVID MALAN: It's giving me back indeed a list. 212 00:11:00,698 --> 00:11:02,880 And I presume the visual clue for you was the fact 213 00:11:02,880 --> 00:11:04,463 that we're using square brackets here. 214 00:11:04,463 --> 00:11:07,780 And indeed, row bracket 1 is going to be not the first 215 00:11:07,780 --> 00:11:10,060 but the second element in that list. 216 00:11:10,060 --> 00:11:11,280 And so just take a guess. 217 00:11:11,280 --> 00:11:14,400 When I run this code in a moment, what's going to get printed? 218 00:11:14,400 --> 00:11:16,950 The timestamp, the language, or the problem? 219 00:11:16,950 --> 00:11:20,892 220 00:11:20,892 --> 00:11:21,392 Yeah? 221 00:11:21,392 --> 00:11:22,350 AUDIENCE: The language. 222 00:11:22,350 --> 00:11:25,260 DAVID MALAN: The language because it's the second column that is 223 00:11:25,260 --> 00:11:27,848 in the file delimited by those commas. 224 00:11:27,848 --> 00:11:29,140 So let me go ahead and do this. 225 00:11:29,140 --> 00:11:30,630 Let me clear my terminal down here. 226 00:11:30,630 --> 00:11:33,570 Let me run Python of favorites.py and Enter. 227 00:11:33,570 --> 00:11:34,710 And there's everything. 228 00:11:34,710 --> 00:11:35,490 It was super fast. 229 00:11:35,490 --> 00:11:37,090 But there's a really long list here. 230 00:11:37,090 --> 00:11:41,130 And in fact, if I increase the size of my terminal and start scrolling up, 231 00:11:41,130 --> 00:11:43,090 you'll just see all of the raw data. 232 00:11:43,090 --> 00:11:44,490 Now, this isn't that useful yet. 233 00:11:44,490 --> 00:11:46,032 I could have just glanced at the CSV. 234 00:11:46,032 --> 00:11:49,320 But clearly now I have the ability to open the file, 235 00:11:49,320 --> 00:11:52,830 parse it, so to speak, that is break it up into its constituent parts, 236 00:11:52,830 --> 00:11:56,380 and do something with specific parts therein. 237 00:11:56,380 --> 00:11:56,880 All right. 238 00:11:56,880 --> 00:11:59,640 So if I want to do this a little more pleasantly though, 239 00:11:59,640 --> 00:12:02,460 let me at least make this semantically a little cleaner. 240 00:12:02,460 --> 00:12:05,650 And you know what, just for clarity, let me just give myself a variable. 241 00:12:05,650 --> 00:12:09,000 It's not strictly necessary, but I know that this is 242 00:12:09,000 --> 00:12:12,958 the favorite, for instance, language. 243 00:12:12,958 --> 00:12:14,250 So let's just call it favorite. 244 00:12:14,250 --> 00:12:16,410 Set it equal to row bracket 1. 245 00:12:16,410 --> 00:12:18,600 And now just to be more explicit in my code, 246 00:12:18,600 --> 00:12:21,030 even though again, we don't need the variable per se, 247 00:12:21,030 --> 00:12:22,840 this code is, of course, going to do the same thing. 248 00:12:22,840 --> 00:12:25,132 It's just using an additional variable called favorite. 249 00:12:25,132 --> 00:12:28,450 If I go down here, scroll up, run the program again, 250 00:12:28,450 --> 00:12:30,280 I get back to the exact same data. 251 00:12:30,280 --> 00:12:32,280 But this is a stepping stone to something 252 00:12:32,280 --> 00:12:34,800 that's even more powerful about Python support 253 00:12:34,800 --> 00:12:38,610 for CSV files is that you don't have to just treat the return 254 00:12:38,610 --> 00:12:43,500 value as a list with 0 and 1 and 2. 255 00:12:43,500 --> 00:12:48,540 So just thinking intuitively here, why is this maybe not the best design 256 00:12:48,540 --> 00:12:51,960 to hand you, the programmer, back the data in a list 257 00:12:51,960 --> 00:12:55,640 that's numerically indexed with 0, 1, 2? 258 00:12:55,640 --> 00:12:59,760 It clearly works, but critique this. 259 00:12:59,760 --> 00:13:00,890 What could go wrong? 260 00:13:00,890 --> 00:13:03,282 What's a little poorly designed? 261 00:13:03,282 --> 00:13:04,169 Yeah? 262 00:13:04,169 --> 00:13:06,586 AUDIENCE: You have to always remember what the things are, 263 00:13:06,586 --> 00:13:10,945 what the order is [INAUDIBLE] 264 00:13:10,945 --> 00:13:11,820 DAVID MALAN: Exactly. 265 00:13:11,820 --> 00:13:14,880 So yeah, so it's up to you to repeat-- 266 00:13:14,880 --> 00:13:18,360 it's up to you to remember what column the data is actually in. 267 00:13:18,360 --> 00:13:22,020 And God forbid you're collaborating with someone else on the spreadsheet. 268 00:13:22,020 --> 00:13:24,803 If you've used Google Spreadsheets, you can move columns around 269 00:13:24,803 --> 00:13:27,220 maybe just because you want to visually reorganize things. 270 00:13:27,220 --> 00:13:30,150 And if you do this and then someone else downloads that same data, 271 00:13:30,150 --> 00:13:31,840 all of their code is going to break. 272 00:13:31,840 --> 00:13:33,360 So that's just really bad design. 273 00:13:33,360 --> 00:13:36,390 It's fragile just because you're sort of on the honor system 274 00:13:36,390 --> 00:13:38,588 that one means the data that you want. 275 00:13:38,588 --> 00:13:41,130 So wouldn't it be nice if it could be a little more explicit? 276 00:13:41,130 --> 00:13:47,130 Well, recall that the very first line in this file is actually this. 277 00:13:47,130 --> 00:13:50,730 And I paused the output this time so that we can see more optionally. 278 00:13:50,730 --> 00:13:51,960 I just reran favorites.py. 279 00:13:51,960 --> 00:13:55,050 And notice one of these things is not like the other. 280 00:13:55,050 --> 00:13:59,070 Every output was either scratch or C or Python except for this first one. 281 00:13:59,070 --> 00:14:00,855 Why am I seeing the word language here? 282 00:14:00,855 --> 00:14:03,610 283 00:14:03,610 --> 00:14:05,860 Where did language come from? 284 00:14:05,860 --> 00:14:08,230 You didn't have the ability to manually input. 285 00:14:08,230 --> 00:14:09,372 No. 286 00:14:09,372 --> 00:14:10,330 Where did it come from? 287 00:14:10,330 --> 00:14:11,030 Yeah. 288 00:14:11,030 --> 00:14:11,850 AUDIENCE: That would be the header. 289 00:14:11,850 --> 00:14:12,630 DAVID MALAN: Yeah, the header. 290 00:14:12,630 --> 00:14:15,300 The very first row in the file, which by human convention 291 00:14:15,300 --> 00:14:18,210 generally just defines what the columns represent so that there's 292 00:14:18,210 --> 00:14:20,470 some human useful information there. 293 00:14:20,470 --> 00:14:24,400 Now, that's not really intended to be part of my output at the moment, 294 00:14:24,400 --> 00:14:25,710 so there is a way to skip this. 295 00:14:25,710 --> 00:14:29,380 If you want to skip the first row, you can actually do something like this. 296 00:14:29,380 --> 00:14:32,560 You can say next row, and that will just ignore that row. 297 00:14:32,560 --> 00:14:35,520 So then I'm starting really with the every row thereafter. 298 00:14:35,520 --> 00:14:37,995 But there's a better way to handle this than that. 299 00:14:37,995 --> 00:14:40,620 That will get rid of the row in the output, but let me go ahead 300 00:14:40,620 --> 00:14:42,888 and use a different feature of the CSV package 301 00:14:42,888 --> 00:14:45,430 that's just going to make this a little cleaner all together. 302 00:14:45,430 --> 00:14:47,140 So let me clear my terminal window here. 303 00:14:47,140 --> 00:14:49,830 Let me undo this next thing that I just added. 304 00:14:49,830 --> 00:14:54,690 And instead of using a reader, let me go ahead and use a dictionary reader, 305 00:14:54,690 --> 00:15:00,720 abbreviated dict reader, that's going to now return me the equivalent of all 306 00:15:00,720 --> 00:15:02,260 of the rows one at a time. 307 00:15:02,260 --> 00:15:04,720 So I can still call it reader just as before. 308 00:15:04,720 --> 00:15:08,340 But as the name implies, what this reader is going to return 309 00:15:08,340 --> 00:15:11,560 is not a list after list after list but a dictionary, 310 00:15:11,560 --> 00:15:13,140 a dictionary, a dictionary. 311 00:15:13,140 --> 00:15:16,210 And remember, a dictionary is just a collection of key value pairs. 312 00:15:16,210 --> 00:15:17,640 So what does that mean? 313 00:15:17,640 --> 00:15:18,390 What are the keys? 314 00:15:18,390 --> 00:15:19,480 What are the values? 315 00:15:19,480 --> 00:15:24,210 Well, now that I'm using a dictionary reader, I can actually do this. 316 00:15:24,210 --> 00:15:27,270 Instead of on the honor system remembering that I want column one, 317 00:15:27,270 --> 00:15:30,630 I can treat row now not as a list but as a dictionary. 318 00:15:30,630 --> 00:15:34,410 And that means I can go in here and say quote unquote "language." 319 00:15:34,410 --> 00:15:36,150 And we saw that back in week six. 320 00:15:36,150 --> 00:15:38,820 Python allows you to index into dictionaries 321 00:15:38,820 --> 00:15:42,480 using square bracket notation in strings or STRs on the inside, 322 00:15:42,480 --> 00:15:44,790 just like lists allow for numbers. 323 00:15:44,790 --> 00:15:48,180 But this now I think is going to be a little more robust. 324 00:15:48,180 --> 00:15:52,830 If I run this again, Python of favorites.py, of that worked out fine. 325 00:15:52,830 --> 00:15:56,580 And let me pause the output too by using this program called more. 326 00:15:56,580 --> 00:15:58,600 Now I don't even see the header. 327 00:15:58,600 --> 00:16:03,660 So now whoever works with Python wrote the code for this package 328 00:16:03,660 --> 00:16:06,600 to just analyze that first line of code, use the header 329 00:16:06,600 --> 00:16:10,230 as you just called it as the keys, and then every time you 330 00:16:10,230 --> 00:16:13,380 iterate through this loop, it updates the values, the values, the values, 331 00:16:13,380 --> 00:16:15,810 but the keys stay the same. 332 00:16:15,810 --> 00:16:20,320 Any questions then on this technique? 333 00:16:20,320 --> 00:16:24,206 Suffice it to say this would be painful in C. Yes? 334 00:16:24,206 --> 00:16:28,670 AUDIENCE: [INAUDIBLE] 335 00:16:28,670 --> 00:16:33,395 336 00:16:33,395 --> 00:16:34,270 DAVID MALAN: Exactly. 337 00:16:34,270 --> 00:16:39,130 So the keys are always going to be quote unquote "timestamp, language, 338 00:16:39,130 --> 00:16:40,130 and problem." 339 00:16:40,130 --> 00:16:43,330 But on each iteration of this loop here, the row 340 00:16:43,330 --> 00:16:46,600 is going to contain a different row of values, different row of values, 341 00:16:46,600 --> 00:16:47,600 different row of values. 342 00:16:47,600 --> 00:16:50,830 So you're going to get back one dictionary for every student who 343 00:16:50,830 --> 00:16:54,140 submitted the Google Form, if you will, while iterating through it there. 344 00:16:54,140 --> 00:16:54,640 All right. 345 00:16:54,640 --> 00:17:01,270 So once we have this ability here, why don't we go ahead and transition to 346 00:17:01,270 --> 00:17:03,850 how about not just using that dictionary reader, which 347 00:17:03,850 --> 00:17:05,890 it makes the code a little more robust. 348 00:17:05,890 --> 00:17:08,599 Because now if you move the columns around, no big deal. 349 00:17:08,599 --> 00:17:11,170 It doesn't matter if the numeric indices change. 350 00:17:11,170 --> 00:17:13,660 You can still use those keywords instead. 351 00:17:13,660 --> 00:17:15,460 But let's actually analyze the data now. 352 00:17:15,460 --> 00:17:18,619 I'm just spitting it out, which is not solving any problems for anyone. 353 00:17:18,619 --> 00:17:22,750 So let's go ahead and count the popularity of Scratch, C, and Python 354 00:17:22,750 --> 00:17:25,790 and see what everyone's been thinking here. 355 00:17:25,790 --> 00:17:26,290 All right. 356 00:17:26,290 --> 00:17:27,470 So how might I do this? 357 00:17:27,470 --> 00:17:29,890 Well, let me go ahead and do this up here. 358 00:17:29,890 --> 00:17:33,880 Before I start iterating, let me give myself, let's say, three variables. 359 00:17:33,880 --> 00:17:37,120 And to keep things simple, I'll say one variable called Scratch. 360 00:17:37,120 --> 00:17:40,270 Set it equal to 0 for 0 students so far. 361 00:17:40,270 --> 00:17:44,200 C is going to equal 0 and Python is going to equal 0. 362 00:17:44,200 --> 00:17:46,157 There's a slightly prettier way of doing this, 363 00:17:46,157 --> 00:17:49,240 just because this is like three lines of code to do something very simple. 364 00:17:49,240 --> 00:17:53,080 You could alternatively in Python but not C 365 00:17:53,080 --> 00:17:58,720 do Scratch comma C comma Python equals 0 comma 0. 366 00:17:58,720 --> 00:18:02,168 So kind of slightly more elegant just to fit it all into one line. 367 00:18:02,168 --> 00:18:04,210 But now let's just do something more interesting. 368 00:18:04,210 --> 00:18:06,160 On line seven, I'm still going to figure out 369 00:18:06,160 --> 00:18:08,320 what the current favorite language is. 370 00:18:08,320 --> 00:18:11,740 And now I'm just going to do some conditional checks. 371 00:18:11,740 --> 00:18:16,610 How about if that favorite equals equals quote unquote "Scratch," 372 00:18:16,610 --> 00:18:19,990 Let's go ahead and increment Scratch by 1. 373 00:18:19,990 --> 00:18:23,560 We can't do plus plus in Python, but we can do plus equals 1. 374 00:18:23,560 --> 00:18:30,700 How about elif if favorite equals equals C, then let's do C plus equals 1. 375 00:18:30,700 --> 00:18:32,063 We could do else. 376 00:18:32,063 --> 00:18:33,730 This is actually a good design question. 377 00:18:33,730 --> 00:18:34,840 Should I do else? 378 00:18:34,840 --> 00:18:36,880 Should I do elif? 379 00:18:36,880 --> 00:18:39,250 Any instincts here? 380 00:18:39,250 --> 00:18:40,824 Yeah. 381 00:18:40,824 --> 00:18:43,259 AUDIENCE: [INAUDIBLE] 382 00:18:43,259 --> 00:18:46,322 383 00:18:46,322 --> 00:18:48,030 DAVID MALAN: Yeah, really good instincts. 384 00:18:48,030 --> 00:18:50,370 Just in case someone goes and adds another language 385 00:18:50,370 --> 00:18:51,960 to the form next week because we're obviously 386 00:18:51,960 --> 00:18:53,370 going to introduce another language today, 387 00:18:53,370 --> 00:18:55,620 you don't want your code to now artificially inflate 388 00:18:55,620 --> 00:18:58,380 the scores for Python just because you're 389 00:18:58,380 --> 00:19:00,340 conflating multiple languages together. 390 00:19:00,340 --> 00:19:03,810 So the more defensive sort of better way to write this code, I agree, 391 00:19:03,810 --> 00:19:07,680 would be elif favorite equals equals Python. 392 00:19:07,680 --> 00:19:10,590 Then let's go ahead and increment Python plus equals 1. 393 00:19:10,590 --> 00:19:12,840 And if there's a new language next week, we're 394 00:19:12,840 --> 00:19:14,260 obviously going to have to update the code, 395 00:19:14,260 --> 00:19:15,630 but at least we're not miscounting. 396 00:19:15,630 --> 00:19:17,130 We're just missing the new language. 397 00:19:17,130 --> 00:19:19,210 So I think that's slightly more robust. 398 00:19:19,210 --> 00:19:22,290 All right, now at the very bottom of this program and outside of the loop 399 00:19:22,290 --> 00:19:25,680 when I'm all done counting, let me go ahead and print out 400 00:19:25,680 --> 00:19:27,060 using some f strings. 401 00:19:27,060 --> 00:19:31,450 How about the total number of people whose favorite is Scratch? 402 00:19:31,450 --> 00:19:34,890 So this is just week six f string syntax. 403 00:19:34,890 --> 00:19:38,940 Let me go ahead and print out another f string for C. And I'm, of course, 404 00:19:38,940 --> 00:19:41,880 putting the variables in curly braces, all lowercase, 405 00:19:41,880 --> 00:19:45,060 but the English words I'm doing capitalization for. 406 00:19:45,060 --> 00:19:49,470 Let's do a final one with f Python colon and then in curly braces 407 00:19:49,470 --> 00:19:51,000 Python close quote. 408 00:19:51,000 --> 00:19:52,105 And I think I'm done. 409 00:19:52,105 --> 00:19:53,980 So let me just hide my terminal for a second. 410 00:19:53,980 --> 00:19:55,290 Here's the total program. 411 00:19:55,290 --> 00:19:56,730 Same stuff as before. 412 00:19:56,730 --> 00:19:58,980 Open favorites.csv. 413 00:19:58,980 --> 00:20:02,880 Open it further with the dictionary reader to do that processing for us. 414 00:20:02,880 --> 00:20:06,330 Initialize three variables to 0 just so we have something to count with. 415 00:20:06,330 --> 00:20:09,300 And then iterate over the file row by row. 416 00:20:09,300 --> 00:20:12,120 And this is just some sort of week one style conditional logic, 417 00:20:12,120 --> 00:20:14,410 albeit in Python, counting things. 418 00:20:14,410 --> 00:20:14,910 All right. 419 00:20:14,910 --> 00:20:17,080 So how can we now execute this? 420 00:20:17,080 --> 00:20:18,450 Let me go back to my terminal. 421 00:20:18,450 --> 00:20:20,610 Python of favorites.py. 422 00:20:20,610 --> 00:20:22,050 And here we go. 423 00:20:22,050 --> 00:20:26,400 As of today, everyone who's reporting in live via the Google Form, 424 00:20:26,400 --> 00:20:30,190 their favorite languages are-- 425 00:20:30,190 --> 00:20:31,330 interesting. 426 00:20:31,330 --> 00:20:34,510 That's pretty interesting too after just one week of Python no less. 427 00:20:34,510 --> 00:20:36,310 But Scratch is a healthy contender there. 428 00:20:36,310 --> 00:20:38,210 A lot of C. So a pretty good mix here. 429 00:20:38,210 --> 00:20:43,127 So is this going to be the best way to write this program long term? 430 00:20:43,127 --> 00:20:45,460 Well, as you noted, if there's a new language next week, 431 00:20:45,460 --> 00:20:47,830 this week we're going to have to constantly update this. 432 00:20:47,830 --> 00:20:51,610 And here's where you should let your mind wander to the future. 433 00:20:51,610 --> 00:20:55,270 If we have a fourth language, fifth language, sixth, seventh, eighth, 434 00:20:55,270 --> 00:20:59,410 which aspects here might kind of have some code smell to it? 435 00:20:59,410 --> 00:21:04,310 This probably isn't the best design to set us up for the future. 436 00:21:04,310 --> 00:21:07,480 What might be better than this? 437 00:21:07,480 --> 00:21:08,417 Yeah. 438 00:21:08,417 --> 00:21:09,880 AUDIENCE: We need to add a language to line five. 439 00:21:09,880 --> 00:21:10,330 DAVID MALAN: Yeah. 440 00:21:10,330 --> 00:21:12,288 We have to keep adding a language to line five. 441 00:21:12,288 --> 00:21:13,340 And OK, not a big deal. 442 00:21:13,340 --> 00:21:16,690 We could add SQL today and maybe JavaScript next week. 443 00:21:16,690 --> 00:21:19,510 But any time a line of code, a line of logic, 444 00:21:19,510 --> 00:21:21,450 it's just going to grow out of control. 445 00:21:21,450 --> 00:21:23,950 We've had this chat a couple of times with different syntax. 446 00:21:23,950 --> 00:21:25,700 There's probably a better way than that. 447 00:21:25,700 --> 00:21:26,440 So let's do that. 448 00:21:26,440 --> 00:21:30,550 Instead of using these individual variables, we could maybe use a list, 449 00:21:30,550 --> 00:21:35,560 but a list would be a little confusing because what does bracket 0 mean? 450 00:21:35,560 --> 00:21:37,090 What is bracket 1, bracket 2? 451 00:21:37,090 --> 00:21:40,930 But a dictionary, recall, is this Swiss army knife of data structures 452 00:21:40,930 --> 00:21:44,630 whereby you can associate anything with anything else, keys with values. 453 00:21:44,630 --> 00:21:48,370 So I dare say a cleaner way to solve this problem that sets us up 454 00:21:48,370 --> 00:21:52,570 for less work or confusion later would be to create a new variable called 455 00:21:52,570 --> 00:21:55,000 counts, if that's what we're doing, counting things up, 456 00:21:55,000 --> 00:21:58,340 and just set it equal to an empty dictionary. 457 00:21:58,340 --> 00:22:01,675 And you can literally say dict with the open parenthesis, close parenthesis 458 00:22:01,675 --> 00:22:05,020 and nothing or the more Pythonic just use open 459 00:22:05,020 --> 00:22:07,000 and close curly braces with nothing inside. 460 00:22:07,000 --> 00:22:11,320 That gives me an empty dictionary just like square brackets gives me a list. 461 00:22:11,320 --> 00:22:13,990 Now, my logic down here has to change a little bit. 462 00:22:13,990 --> 00:22:19,150 But what's nice is I don't need one conditional for every language. 463 00:22:19,150 --> 00:22:22,240 Because again, if we have a fourth, a fifth, a sixth, that chunk of code 464 00:22:22,240 --> 00:22:24,580 is also going to grow a bit out of control too. 465 00:22:24,580 --> 00:22:26,570 So I can get rid of this here. 466 00:22:26,570 --> 00:22:29,180 And what I think I'm going to do is say this. 467 00:22:29,180 --> 00:22:32,380 Whatever the current favorite is from the current row in the file, 468 00:22:32,380 --> 00:22:38,420 why don't we go into our counts variable at that key? 469 00:22:38,420 --> 00:22:40,303 And again, favorite is a variable. 470 00:22:40,303 --> 00:22:41,720 It's not quote unquote "favorite." 471 00:22:41,720 --> 00:22:44,560 It's going to be Scratch or C or Python. 472 00:22:44,560 --> 00:22:47,350 And then why don't we go ahead and just increment whatever 473 00:22:47,350 --> 00:22:51,740 the value of that count is at that key? 474 00:22:51,740 --> 00:22:53,500 Now, this is technically buggy. 475 00:22:53,500 --> 00:22:57,140 We're really close, but there is a bug. 476 00:22:57,140 --> 00:23:00,350 Does anyone want to conjecture what the bug is? 477 00:23:00,350 --> 00:23:00,890 Yeah. 478 00:23:00,890 --> 00:23:04,215 AUDIENCE: [INAUDIBLE] 479 00:23:04,215 --> 00:23:08,490 480 00:23:08,490 --> 00:23:11,640 DAVID MALAN: A good question that answers my question nonetheless. 481 00:23:11,640 --> 00:23:12,685 So no. 482 00:23:12,685 --> 00:23:14,310 The magic you describe will not happen. 483 00:23:14,310 --> 00:23:17,490 And to repeat the hypothesis, will this automatically 484 00:23:17,490 --> 00:23:22,590 create a key for every language that we try 485 00:23:22,590 --> 00:23:24,090 plugging into those square brackets? 486 00:23:24,090 --> 00:23:25,140 Short answer no. 487 00:23:25,140 --> 00:23:27,750 Odds are this is going to create a key error, one of those 488 00:23:27,750 --> 00:23:29,640 traceback error messages that you've probably 489 00:23:29,640 --> 00:23:31,890 seen by now either in class or in problem sets 490 00:23:31,890 --> 00:23:37,410 whereby if Scratch hasn't appeared in the dictionary before or C or Python, 491 00:23:37,410 --> 00:23:40,330 then the dictionary has no clue what you're talking about. 492 00:23:40,330 --> 00:23:43,200 So I think we actually still need some conditional logic but not 493 00:23:43,200 --> 00:23:45,900 that's going to grow longer and longer with each language. 494 00:23:45,900 --> 00:23:47,860 What I think we probably want to do is this. 495 00:23:47,860 --> 00:23:52,530 If the current favorite is in the count dictionary, 496 00:23:52,530 --> 00:23:56,730 and this is the Pythonic way of just saying is this key in this dictionary, 497 00:23:56,730 --> 00:24:02,160 then go ahead and safely do counts favorite plus equals 1. 498 00:24:02,160 --> 00:24:05,250 Else, to your conjecture now, else what I want to do. 499 00:24:05,250 --> 00:24:10,330 Counts favorites equals, yeah, 1. 500 00:24:10,330 --> 00:24:14,380 So initialize a brand new key to a brand new value of 1 501 00:24:14,380 --> 00:24:16,630 because I'm obviously just seeing this language. 502 00:24:16,630 --> 00:24:19,190 Otherwise increment again and again. 503 00:24:19,190 --> 00:24:22,390 And now down here I just need to tweak my syntax a little bit. 504 00:24:22,390 --> 00:24:26,540 I don't need to print out all of these things one at a time manually. 505 00:24:26,540 --> 00:24:30,470 I can actually get away I think with another loop at the very bottom here. 506 00:24:30,470 --> 00:24:31,480 So how about I do this? 507 00:24:31,480 --> 00:24:34,790 For each favorite in those counts, and this is, 508 00:24:34,790 --> 00:24:39,340 again, the Pythonic way to iterate over all of the keys in a dictionary, 509 00:24:39,340 --> 00:24:45,220 go ahead and print out using an f string whatever the current favorite is, 510 00:24:45,220 --> 00:24:52,180 Scratch or C or Python, and then a colon and then figure out what its count is. 511 00:24:52,180 --> 00:24:54,610 And you can do that by going into the counts dictionary, 512 00:24:54,610 --> 00:24:58,910 looking at the favorite key, and get back its value. 513 00:24:58,910 --> 00:25:00,190 So I close my curly braces. 514 00:25:00,190 --> 00:25:01,060 I close my quotes. 515 00:25:01,060 --> 00:25:04,690 And even though this looks ugly at the moment, now this is much more dynamic. 516 00:25:04,690 --> 00:25:07,690 Because if we go and add SQL to the CSV file tomorrow 517 00:25:07,690 --> 00:25:10,720 or we add JavaScript next week, this will just work. 518 00:25:10,720 --> 00:25:12,760 It will keep working now automatically. 519 00:25:12,760 --> 00:25:15,140 All I change is the Google Form, not my actual code. 520 00:25:15,140 --> 00:25:15,640 All right. 521 00:25:15,640 --> 00:25:17,620 Let's try Python of favorites.py. 522 00:25:17,620 --> 00:25:19,150 Cross my fingers as always. 523 00:25:19,150 --> 00:25:23,020 And there now is the data as of now. 524 00:25:23,020 --> 00:25:25,270 Questions on this code here? 525 00:25:25,270 --> 00:25:26,195 Yeah. 526 00:25:26,195 --> 00:25:28,093 AUDIENCE: [INAUDIBLE] 527 00:25:28,093 --> 00:25:29,510 DAVID MALAN: Really good question. 528 00:25:29,510 --> 00:25:33,040 What if you wanted to print it in a particular order? 529 00:25:33,040 --> 00:25:35,530 Well, I could give you a couple of solutions. 530 00:25:35,530 --> 00:25:37,390 If you want to print it out in-- 531 00:25:37,390 --> 00:25:40,850 it's already coincidentally in alphabetical order. 532 00:25:40,850 --> 00:25:43,630 So you got that for free although that's just by chance here. 533 00:25:43,630 --> 00:25:45,010 But there is a way to do this. 534 00:25:45,010 --> 00:25:47,320 And let me propose that we go down here to my loop. 535 00:25:47,320 --> 00:25:50,320 And I explicitly use a function you might not have seen in Python yet, 536 00:25:50,320 --> 00:25:52,240 but it's literally called sorted, which is 537 00:25:52,240 --> 00:25:54,970 going to take either a list or, in this case, a dictionary 538 00:25:54,970 --> 00:25:58,270 and by default sort it by key alphabetically. 539 00:25:58,270 --> 00:26:00,907 Now, if my intuition is correct, this is not 540 00:26:00,907 --> 00:26:03,490 going to change the output, because it's already alphabetical. 541 00:26:03,490 --> 00:26:05,980 But if you read the documentation for the sorted function, 542 00:26:05,980 --> 00:26:08,800 it takes multiple parameters potentially, some of which 543 00:26:08,800 --> 00:26:10,000 are named parameters. 544 00:26:10,000 --> 00:26:11,870 And so you can actually do this. 545 00:26:11,870 --> 00:26:14,200 If you want to the counts but you want to reverse 546 00:26:14,200 --> 00:26:18,018 the order for whatever reason here so that it's reverse alphabetical order. 547 00:26:18,018 --> 00:26:19,810 Now let me go ahead and rerun this and I'll 548 00:26:19,810 --> 00:26:21,470 keep the previous output on the screen. 549 00:26:21,470 --> 00:26:27,000 Enter and now it's backwards alphabetically, if you will. 550 00:26:27,000 --> 00:26:29,130 Other questions on this here? 551 00:26:29,130 --> 00:26:31,770 552 00:26:31,770 --> 00:26:33,830 No? 553 00:26:33,830 --> 00:26:39,930 How about then we transition to changing sorting by value. 554 00:26:39,930 --> 00:26:42,770 And this is going to escalate a little quickly briefly 555 00:26:42,770 --> 00:26:44,540 but then we'll tone it down again. 556 00:26:44,540 --> 00:26:47,600 Notice that right now this is indeed sorting by key. 557 00:26:47,600 --> 00:26:49,400 What if, especially if I have lots of data, 558 00:26:49,400 --> 00:26:52,670 it'd be nice to make a top 10 list or, in this case, a top three list 559 00:26:52,670 --> 00:26:58,890 and actually see in order of the counts, the values what these popular ones are. 560 00:26:58,890 --> 00:27:00,620 So it's not C, Python, Scratch. 561 00:27:00,620 --> 00:27:03,770 It should ideally be Python, then C, then 562 00:27:03,770 --> 00:27:06,710 Scratch because of the values and the magnitude thereof. 563 00:27:06,710 --> 00:27:08,010 So how can I do this? 564 00:27:08,010 --> 00:27:12,170 Well, it turns out there is another key, another parameter that you 565 00:27:12,170 --> 00:27:16,550 can pass to the sorted function that is typically 566 00:27:16,550 --> 00:27:17,925 implemented as a function itself. 567 00:27:17,925 --> 00:27:19,633 And so I'm going to go ahead and do this. 568 00:27:19,633 --> 00:27:21,950 I'm going to temporarily define a function called 569 00:27:21,950 --> 00:27:24,230 get value just to make my life easier. 570 00:27:24,230 --> 00:27:29,390 And this get value function is going to take I'll say a language parameter. 571 00:27:29,390 --> 00:27:35,400 And then all I'm going to do is return whatever the count is of that language. 572 00:27:35,400 --> 00:27:37,850 So out of context, this is just a super simple function 573 00:27:37,850 --> 00:27:41,252 that you hand it a language like Scratch or C or Python, 574 00:27:41,252 --> 00:27:43,460 it's just going to tell you what the count is thereof 575 00:27:43,460 --> 00:27:46,400 in that dictionary called counts. 576 00:27:46,400 --> 00:27:51,440 But what I can do now down here in my newly introduced call to sort it is I 577 00:27:51,440 --> 00:27:54,380 can tell it what to use as its key. 578 00:27:54,380 --> 00:27:57,170 Instead of using literally the key, Scratch, C, 579 00:27:57,170 --> 00:28:01,040 Python, I can sort of override that behavior and say, you know what? 580 00:28:01,040 --> 00:28:05,570 To figure out what to sort by, go ahead and call this function 581 00:28:05,570 --> 00:28:07,890 called get value. 582 00:28:07,890 --> 00:28:11,270 Notice that I have not put parentheses after get value 583 00:28:11,270 --> 00:28:14,420 because I don't want to call get value right then and there. 584 00:28:14,420 --> 00:28:20,737 I want to pass the get value function as itself in argument to the sorted 585 00:28:20,737 --> 00:28:23,570 function so that the sorted function written years ago by the people 586 00:28:23,570 --> 00:28:27,320 at Python can call my version of get value again and again 587 00:28:27,320 --> 00:28:30,300 and again when they try to sort this actual data. 588 00:28:30,300 --> 00:28:36,350 So now if I add that and I leave reverse equals true, let's see what happens. 589 00:28:36,350 --> 00:28:38,240 Python of favorites.py. 590 00:28:38,240 --> 00:28:38,810 Enter. 591 00:28:38,810 --> 00:28:42,410 And now I get my top 10, or in this case, top three list. 592 00:28:42,410 --> 00:28:45,290 And if I had more sophisticated data with more columns 593 00:28:45,290 --> 00:28:47,150 all together that I actually care about, I 594 00:28:47,150 --> 00:28:50,758 could even sort this more powerfully as well. 595 00:28:50,758 --> 00:28:52,550 But let me clean this up a little bit, just 596 00:28:52,550 --> 00:28:54,592 so you've seen it, even though we won't use these 597 00:28:54,592 --> 00:28:59,010 that often in CS50 until the end of the class will they come up again. 598 00:28:59,010 --> 00:29:00,740 Technically this is a little bit-- 599 00:29:00,740 --> 00:29:03,620 this isn't necessarily the best design to spend all this time 600 00:29:03,620 --> 00:29:06,680 implementing a function and then only use it in one place. 601 00:29:06,680 --> 00:29:09,430 In general, we've argued that you don't necessarily 602 00:29:09,430 --> 00:29:11,930 need a variable if you're only going to use it in one place. 603 00:29:11,930 --> 00:29:13,430 You don't really need a function if you're only 604 00:29:13,430 --> 00:29:14,638 going to use it in one place. 605 00:29:14,638 --> 00:29:17,160 And here we kind of have a good candidate for that. 606 00:29:17,160 --> 00:29:19,400 And so it turns out in Python if you don't 607 00:29:19,400 --> 00:29:23,240 want to bother creating a function just to use it once, 608 00:29:23,240 --> 00:29:26,810 you can create what's called an anonymous function, a.k.a. 609 00:29:26,810 --> 00:29:29,900 a lambda function, like the lambda symbol familiar. 610 00:29:29,900 --> 00:29:33,380 And a lambda function, the syntax is a little strange looking, 611 00:29:33,380 --> 00:29:34,490 but you say this. 612 00:29:34,490 --> 00:29:36,080 You literally say lambda. 613 00:29:36,080 --> 00:29:38,660 You literally then say the name of the argument 614 00:29:38,660 --> 00:29:42,470 that you want this anonymous function with no name to take. 615 00:29:42,470 --> 00:29:44,430 Then you have a colon. 616 00:29:44,430 --> 00:29:47,930 And then quite simply, you write what you want the return 617 00:29:47,930 --> 00:29:49,850 value of this function to be. 618 00:29:49,850 --> 00:29:51,800 You don't even say return literally. 619 00:29:51,800 --> 00:29:53,720 These lambda functions are meant to be used 620 00:29:53,720 --> 00:30:00,680 super tersely so that you can in one line express something like this. 621 00:30:00,680 --> 00:30:03,720 And I admit this looks more cryptic, I think, than the previous version. 622 00:30:03,720 --> 00:30:06,512 But as you get more comfortable with Python or other languages that 623 00:30:06,512 --> 00:30:10,370 support this feature, it allows you to not bother with lines of code like that 624 00:30:10,370 --> 00:30:12,690 and just tighten up your code a little bit. 625 00:30:12,690 --> 00:30:16,670 So this line here, lambda language colon counts language, 626 00:30:16,670 --> 00:30:18,927 is the one line version of this. 627 00:30:18,927 --> 00:30:21,260 And you don't even need to bother picking a name for it. 628 00:30:21,260 --> 00:30:23,840 Lambda tells Python I didn't waste any time 629 00:30:23,840 --> 00:30:26,440 thinking of a name for this function. 630 00:30:26,440 --> 00:30:35,310 So questions then on this technique of using Python to analyze data like this? 631 00:30:35,310 --> 00:30:37,900 Any questions? 632 00:30:37,900 --> 00:30:39,860 We're almost done with Python. 633 00:30:39,860 --> 00:30:42,070 Questions? 634 00:30:42,070 --> 00:30:42,640 No? 635 00:30:42,640 --> 00:30:43,390 OK. 636 00:30:43,390 --> 00:30:45,790 So why don't we make things a little more interesting? 637 00:30:45,790 --> 00:30:48,620 Because we had a much juicier data set with the problems 638 00:30:48,620 --> 00:30:50,620 that we've assigned over the past several weeks. 639 00:30:50,620 --> 00:30:54,970 Why don't we go ahead and, quite simply, I think 640 00:30:54,970 --> 00:30:57,290 we wrote pretty darn good code here. 641 00:30:57,290 --> 00:31:02,590 So I think we can pretty much just change a bit of it to say, 642 00:31:02,590 --> 00:31:05,720 let's see, if I don't want language, I want problem. 643 00:31:05,720 --> 00:31:10,510 And if I want to sort by not language but problem, I think that's it. 644 00:31:10,510 --> 00:31:12,580 I think if I didn't overlook something here 645 00:31:12,580 --> 00:31:16,210 just by changing what column I'm reading the data from and then 646 00:31:16,210 --> 00:31:18,550 just to be consistent renaming my variables 647 00:31:18,550 --> 00:31:21,310 just so I know what I'm looking at, what will this program now 648 00:31:21,310 --> 00:31:25,470 do after those minor changes? 649 00:31:25,470 --> 00:31:26,970 What will I see when I run this? 650 00:31:26,970 --> 00:31:29,693 651 00:31:29,693 --> 00:31:31,860 What would be the first thing I see when I run this? 652 00:31:31,860 --> 00:31:34,950 653 00:31:34,950 --> 00:31:35,970 Tough crowd today. 654 00:31:35,970 --> 00:31:36,720 Yes. 655 00:31:36,720 --> 00:31:37,637 AUDIENCE: The problem. 656 00:31:37,637 --> 00:31:39,110 DAVID MALAN: Yeah. 657 00:31:39,110 --> 00:31:40,330 The top problem. 658 00:31:40,330 --> 00:31:42,627 So the most popular problem, which I'm a little worried 659 00:31:42,627 --> 00:31:45,210 it might be hello or just Scratch, but let's go ahead and see. 660 00:31:45,210 --> 00:31:47,700 So let me go ahead and open my terminal window. 661 00:31:47,700 --> 00:31:50,670 I'll even maximize my terminal window so we can see a lot. 662 00:31:50,670 --> 00:31:54,167 Let me go ahead and run Python of favorites.py. 663 00:31:54,167 --> 00:31:57,000 I'm going to go ahead now and cross my fingers that I didn't mess up 664 00:31:57,000 --> 00:31:57,660 and hit Enter. 665 00:31:57,660 --> 00:31:59,550 And OK, great. 666 00:31:59,550 --> 00:32:00,480 We peaked early. 667 00:32:00,480 --> 00:32:03,840 So Scratch was the most popular program according 668 00:32:03,840 --> 00:32:06,460 to the data at the time I downloaded it. 669 00:32:06,460 --> 00:32:08,520 I'm sure other votes have come in since. 670 00:32:08,520 --> 00:32:13,080 Filter in week four was tied then with Tideman as well. 671 00:32:13,080 --> 00:32:16,480 Mario is a close third there and so forth. 672 00:32:16,480 --> 00:32:19,540 So this is helpful for us on staff that not so much love 673 00:32:19,540 --> 00:32:21,040 down here at the bottom of the list. 674 00:32:21,040 --> 00:32:23,400 So it was a bunch of code to write, but now 675 00:32:23,400 --> 00:32:26,250 that we've written it in this very versatile dynamic way, 676 00:32:26,250 --> 00:32:29,280 it's pretty good for just like crunching data and doing some analytics. 677 00:32:29,280 --> 00:32:33,480 But it's still a decent number of lines to have had to write manually. 678 00:32:33,480 --> 00:32:41,820 And this is where sometimes it isn't necessarily the right tool for the job, 679 00:32:41,820 --> 00:32:44,520 but rather a candidate for using some other language altogether, 680 00:32:44,520 --> 00:32:47,428 especially when it's not just a one time program that you run 681 00:32:47,428 --> 00:32:48,720 and you want to see the answer. 682 00:32:48,720 --> 00:32:53,130 What if you want to take input from the user and answer questions dynamically 683 00:32:53,130 --> 00:32:55,050 like a mobile app would, like a website would, 684 00:32:55,050 --> 00:32:58,740 like Microsoft Excel or Apple Numbers or Google Sheets would for you. 685 00:32:58,740 --> 00:33:02,430 Well, let's make one final change for now to this version of the program 686 00:33:02,430 --> 00:33:05,020 and actually take in some user input. 687 00:33:05,020 --> 00:33:08,170 So besides just loading all of the data into memory, 688 00:33:08,170 --> 00:33:13,290 let's go ahead and down below here not just print out the top 10 689 00:33:13,290 --> 00:33:16,660 list, if you will, but prompt the user for their favorites. 690 00:33:16,660 --> 00:33:18,660 I'm going to use Python's input function and I'm 691 00:33:18,660 --> 00:33:21,840 going to prompt them with "favorite," quote unquote, like tell me what 692 00:33:21,840 --> 00:33:23,370 your favorite problem-- 693 00:33:23,370 --> 00:33:27,660 what problem rather you are interested in. 694 00:33:27,660 --> 00:33:33,267 And now let me go ahead and say if that favorite is in the count's variable, 695 00:33:33,267 --> 00:33:35,850 so you didn't type in something random that we didn't actually 696 00:33:35,850 --> 00:33:37,890 assign as a problem, then let me go ahead 697 00:33:37,890 --> 00:33:44,130 and print with a format string whatever that favorite is of yours and show you 698 00:33:44,130 --> 00:33:47,040 the actual popularity thereof by indexing 699 00:33:47,040 --> 00:33:51,370 into counts using that favorite as the key and printing this. 700 00:33:51,370 --> 00:33:52,620 So now it's a dynamic program. 701 00:33:52,620 --> 00:33:55,170 It doesn't dump all of the data and all of the summations. 702 00:33:55,170 --> 00:33:59,100 Rather it's going to allow me to see what my choice of favorite is. 703 00:33:59,100 --> 00:34:03,940 And I'm going to go ahead and say, let's see, I'm a fan of Mario here. 704 00:34:03,940 --> 00:34:04,950 So Enter. 705 00:34:04,950 --> 00:34:08,880 And indeed, we see the same value we saw a moment ago but just for Mario. 706 00:34:08,880 --> 00:34:11,880 But the point now is that, one, all of this is possible. 707 00:34:11,880 --> 00:34:15,630 Two, it's way easier and more pleasant than this would have been in C. 708 00:34:15,630 --> 00:34:17,280 This is still only 15 lines of code. 709 00:34:17,280 --> 00:34:19,238 And in C, again, there's the memory management. 710 00:34:19,238 --> 00:34:21,947 There's the iterating over the strings trying to find the commas. 711 00:34:21,947 --> 00:34:23,190 There's just a lot more work. 712 00:34:23,190 --> 00:34:26,489 But honestly even when you just want to answer a question like this 713 00:34:26,489 --> 00:34:29,067 in Excel and Apple Numbers, Google Sheets, generally, you 714 00:34:29,067 --> 00:34:30,150 can just highlight things. 715 00:34:30,150 --> 00:34:34,500 You can click a button and boom, you get your answer for summation or max or min 716 00:34:34,500 --> 00:34:35,909 or any of those sort of basics. 717 00:34:35,909 --> 00:34:39,270 Wouldn't it be nice if we weren't taking a step backwards as programmers 718 00:34:39,270 --> 00:34:42,880 and being more powerful and yet we now have to do more of the work? 719 00:34:42,880 --> 00:34:46,230 So maybe sometimes Python's not or any language 720 00:34:46,230 --> 00:34:48,370 is not the best tool for the job. 721 00:34:48,370 --> 00:34:51,960 And that's going to now allow us to introduce more generally something 722 00:34:51,960 --> 00:34:53,670 called a relational database. 723 00:34:53,670 --> 00:34:57,137 Graduating from mere flat file databases like text files 724 00:34:57,137 --> 00:34:58,845 or binary files in which all of your data 725 00:34:58,845 --> 00:35:00,570 is stored to something more proper. 726 00:35:00,570 --> 00:35:02,566 But first, questions. 727 00:35:02,566 --> 00:35:08,518 AUDIENCE: [INAUDIBLE] 728 00:35:08,518 --> 00:35:12,943 729 00:35:12,943 --> 00:35:14,360 DAVID MALAN: Really good question. 730 00:35:14,360 --> 00:35:17,670 To reiterate, if I were-- is this case sensitive? 731 00:35:17,670 --> 00:35:21,470 So if I were to type in Mario in all lowercase and hit Enter, 732 00:35:21,470 --> 00:35:23,450 I actually get no such response. 733 00:35:23,450 --> 00:35:26,720 Now that might be acceptable, because the problem technically is a capital M. 734 00:35:26,720 --> 00:35:30,840 But that's a little ridiculous to be that pedantic about the input. 735 00:35:30,840 --> 00:35:34,250 So how could we solve this? 736 00:35:34,250 --> 00:35:37,010 Any tips for how we can make this a little more robust? 737 00:35:37,010 --> 00:35:38,378 Yeah. 738 00:35:38,378 --> 00:35:42,290 AUDIENCE: [INAUDIBLE] 739 00:35:42,290 --> 00:35:45,848 740 00:35:45,848 --> 00:35:46,765 DAVID MALAN: OK, yeah. 741 00:35:46,765 --> 00:35:49,400 742 00:35:49,400 --> 00:35:49,900 Yeah. 743 00:35:49,900 --> 00:35:51,700 So we could use a few different functions, one of which 744 00:35:51,700 --> 00:35:54,970 is called title, which will change it to title case where it capitalizes, 745 00:35:54,970 --> 00:35:58,390 like in most English sentences, the first letter of that sentence. 746 00:35:58,390 --> 00:35:59,740 We could use capitalize. 747 00:35:59,740 --> 00:36:00,940 We could use upper. 748 00:36:00,940 --> 00:36:01,990 We could use lower. 749 00:36:01,990 --> 00:36:06,435 But indeed, we could just decide how we want to standardize the capitalization. 750 00:36:06,435 --> 00:36:08,810 Either uppercase, lowercase, or some combination thereof. 751 00:36:08,810 --> 00:36:11,770 And just make sure that you change the counts themselves. 752 00:36:11,770 --> 00:36:13,930 Make sure that you do the same to favorite 753 00:36:13,930 --> 00:36:16,420 and make sure that maybe you keep a backup of the data 754 00:36:16,420 --> 00:36:19,780 if you want to show the original version that came from the CSV 755 00:36:19,780 --> 00:36:23,205 without presuming to just capitalize everything for the user. 756 00:36:23,205 --> 00:36:25,330 But indeed, that would be the most common scenario. 757 00:36:25,330 --> 00:36:28,540 You just make things case insensitive when doing those matches. 758 00:36:28,540 --> 00:36:36,910 Other questions now on Python before we leave it behind for the coming week? 759 00:36:36,910 --> 00:36:39,880 All right, well then let's introduce these relational databases. 760 00:36:39,880 --> 00:36:43,060 So relational database is what every-- 761 00:36:43,060 --> 00:36:45,572 it's a super popular way of storing lots of data. 762 00:36:45,572 --> 00:36:48,280 This is what the Twitters of the world, the Googles of the world, 763 00:36:48,280 --> 00:36:52,390 the Metas of the world use to store some of their data at scale. 764 00:36:52,390 --> 00:36:55,492 There are alternatives to relational databases. 765 00:36:55,492 --> 00:36:57,700 Indeed, today we'll talk about a language called SQL. 766 00:36:57,700 --> 00:37:00,408 There's also a movement, if you will, or an alternative generally 767 00:37:00,408 --> 00:37:02,330 called NoSQL, which is just the opposite. 768 00:37:02,330 --> 00:37:03,280 You don't use SQL. 769 00:37:03,280 --> 00:37:06,020 There are things called object oriented databases and the like. 770 00:37:06,020 --> 00:37:11,110 But if you've ever heard of MySQL or PostgreSQL or Microsoft SQL 771 00:37:11,110 --> 00:37:15,580 Server or Oracle or MariaDB or a bunch of other products, 772 00:37:15,580 --> 00:37:18,850 both free and commercial, this is what they're talking about. 773 00:37:18,850 --> 00:37:21,560 Databases that are designed to store lots of data. 774 00:37:21,560 --> 00:37:23,710 And what's nice about relational databases 775 00:37:23,710 --> 00:37:26,530 is that they're really similar to the spreadsheets with which you 776 00:37:26,530 --> 00:37:29,800 were presumably familiar long before today's class. 777 00:37:29,800 --> 00:37:32,530 So a relational database is going to store, 778 00:37:32,530 --> 00:37:35,452 as you'll see, all of the data in rows and columns. 779 00:37:35,452 --> 00:37:37,910 Now, the terminology will thereafter be a little different. 780 00:37:37,910 --> 00:37:40,570 Instead of having sheets, you're going to have tables. 781 00:37:40,570 --> 00:37:43,240 But those tables are still going to have rows and columns. 782 00:37:43,240 --> 00:37:46,330 And you're going to have even more control over the performance 783 00:37:46,330 --> 00:37:51,970 of your data when you start to access it using this Structured Query Language 784 00:37:51,970 --> 00:37:53,050 or SQL. 785 00:37:53,050 --> 00:37:56,440 This is a language you can use for web apps, mobile apps. 786 00:37:56,440 --> 00:37:58,990 A lot of analysts would sit down at their Mac or PC 787 00:37:58,990 --> 00:38:01,810 and actually ask questions of data to get back the answer. 788 00:38:01,810 --> 00:38:05,050 And wonderfully, even though there will be some new syntax today, 789 00:38:05,050 --> 00:38:07,570 SQL really just does four basic things. 790 00:38:07,570 --> 00:38:10,030 CRUD is the sort of crude acronym here. 791 00:38:10,030 --> 00:38:14,470 CRUD is a way of remembering that a relational database supports 792 00:38:14,470 --> 00:38:20,390 ultimately creating data, reading data, updating data, and deleting data. 793 00:38:20,390 --> 00:38:22,240 So even if you're feeling like, wow, this 794 00:38:22,240 --> 00:38:25,510 is a lot of new syntax, which it isn't relative to our past languages, 795 00:38:25,510 --> 00:38:29,650 the only things you're doing really are creating data, reading data, updating, 796 00:38:29,650 --> 00:38:30,910 and deleting the same. 797 00:38:30,910 --> 00:38:36,940 Now a little confusingly in SQL, the corresponding functions or commands 798 00:38:36,940 --> 00:38:41,150 that exist that map to CRUD are actually this. 799 00:38:41,150 --> 00:38:44,290 So it's still create, but there's another one called insert. 800 00:38:44,290 --> 00:38:48,250 It's not read, which is more of the computer scientist way of saying it, 801 00:38:48,250 --> 00:38:50,380 but select, which is a little more explicit. 802 00:38:50,380 --> 00:38:52,030 Like select data you care about. 803 00:38:52,030 --> 00:38:53,320 Update is still update. 804 00:38:53,320 --> 00:38:54,310 Delete is still delete. 805 00:38:54,310 --> 00:38:57,910 But there's another command called drop, which lets you drop, that is delete, 806 00:38:57,910 --> 00:38:59,930 entire tables as well. 807 00:38:59,930 --> 00:39:03,280 So you can create tables using syntax that's 808 00:39:03,280 --> 00:39:05,480 generally going to look like this. 809 00:39:05,480 --> 00:39:06,910 You'll say create table. 810 00:39:06,910 --> 00:39:10,210 You'll give the name of the table, which you can call most 811 00:39:10,210 --> 00:39:14,260 anything you want, but generally all lowercase, no spaces is best. 812 00:39:14,260 --> 00:39:18,550 Then in parentheses, you can specify a comma separated list of the columns 813 00:39:18,550 --> 00:39:20,660 that you might want in this table. 814 00:39:20,660 --> 00:39:24,070 So this is the code equivalent in the SQL language 815 00:39:24,070 --> 00:39:27,220 of manually opening Google Sheets or Excel or Numbers 816 00:39:27,220 --> 00:39:30,490 and clicking in the top left cell and like typing timestamp 817 00:39:30,490 --> 00:39:34,180 and then in the next typing language and then the third typing problem. 818 00:39:34,180 --> 00:39:37,750 This is the way to define what your headers are, if you will, 819 00:39:37,750 --> 00:39:38,500 in a spreadsheet. 820 00:39:38,500 --> 00:39:39,790 But now it's called a table. 821 00:39:39,790 --> 00:39:42,100 Now, we won't use this command manually first. 822 00:39:42,100 --> 00:39:44,050 Let's do something a little simpler. 823 00:39:44,050 --> 00:39:47,590 We're going to start off by just importing this data ourselves. 824 00:39:47,590 --> 00:39:49,340 And I'm going to go ahead and do this. 825 00:39:49,340 --> 00:39:51,160 Let me go back to VS Code here. 826 00:39:51,160 --> 00:39:54,040 I'm going to leave behind favorites.py for now, 827 00:39:54,040 --> 00:39:57,100 because now we're going to transition to this other language called SQL. 828 00:39:57,100 --> 00:40:02,680 And to do this, I am going to create a new database file. 829 00:40:02,680 --> 00:40:06,790 And I'm going to do so using a command called sqlite3, 830 00:40:06,790 --> 00:40:09,160 which is just the third version thereof, and I'm 831 00:40:09,160 --> 00:40:12,110 going to give the database a name of favorites.db. 832 00:40:12,110 --> 00:40:14,860 There's different conventions, but this is one of the most common. 833 00:40:14,860 --> 00:40:18,610 When I hit Enter, this is going to create for me a new empty database just 834 00:40:18,610 --> 00:40:23,080 like opening an untitled spreadsheet in Excel, Google Sheets, or Apple Numbers. 835 00:40:23,080 --> 00:40:25,720 I'm being prompted do I want to create favorites.db. 836 00:40:25,720 --> 00:40:26,890 I'll hit Y for Yes. 837 00:40:26,890 --> 00:40:28,562 OK, we're up and running. 838 00:40:28,562 --> 00:40:30,520 Now, you're going to notice a different prompt. 839 00:40:30,520 --> 00:40:34,360 I'm not in my Linux prompt per se, which is always the dollar sign. 840 00:40:34,360 --> 00:40:37,090 I'm now inside of the program called SQLite. 841 00:40:37,090 --> 00:40:41,500 And we're going to use SQLite, SQLite3, as just an interactive way for now 842 00:40:41,500 --> 00:40:43,180 of playing with SQL code. 843 00:40:43,180 --> 00:40:48,550 At the end of today, we'll show you how you can use SQL in Python code 844 00:40:48,550 --> 00:40:51,050 so that you still write Python code to do whatever you want, 845 00:40:51,050 --> 00:40:53,170 but you can talk to databases using Python. 846 00:40:53,170 --> 00:40:55,630 And this is exactly how web apps, mobile apps work. 847 00:40:55,630 --> 00:40:59,140 For instance, on iOS, on an iPhone, an iPad, or the like, 848 00:40:59,140 --> 00:41:03,010 if you want to store data, it's very often stored in a SQL database, 849 00:41:03,010 --> 00:41:04,580 as we're about to do. 850 00:41:04,580 --> 00:41:07,398 But you might use a language called SWIFT or Objective C. 851 00:41:07,398 --> 00:41:09,190 And the same exists in the world of Android 852 00:41:09,190 --> 00:41:12,830 using Java or Kotlin or something else to query the database. 853 00:41:12,830 --> 00:41:16,270 So we're going to see SQL in isolation for now like an analyst might just 854 00:41:16,270 --> 00:41:19,880 use with their Mac or PC, but we're going to tie it together by day's end. 855 00:41:19,880 --> 00:41:28,180 So at this terminal SQLite, let me go ahead and execute this command first. 856 00:41:28,180 --> 00:41:30,200 I'm going to first put SQLite into CSV mode, 857 00:41:30,200 --> 00:41:32,200 because I'm going to cut some corners initially. 858 00:41:32,200 --> 00:41:34,600 And I'm just going to automatically import 859 00:41:34,600 --> 00:41:38,560 all of the data that was submitted via the Google Form, which I exported 860 00:41:38,560 --> 00:41:41,410 as a CSV and uploaded to my code space. 861 00:41:41,410 --> 00:41:45,790 And I'm just going to automatically say turn this CSV file into a SQL 862 00:41:45,790 --> 00:41:48,040 database for me just so I don't have to figure out 863 00:41:48,040 --> 00:41:49,760 what those create table commands are. 864 00:41:49,760 --> 00:41:53,200 So to do this, I'm going to say mode csv so 865 00:41:53,200 --> 00:41:57,850 that SQLite knows that this is the command, knows that this is a CSV file. 866 00:41:57,850 --> 00:41:59,290 It's literally .mode. 867 00:41:59,290 --> 00:42:01,480 So the dot comes before the keyword there. 868 00:42:01,480 --> 00:42:05,770 And now I'm going to say .import and then the name of the file I want 869 00:42:05,770 --> 00:42:08,020 to import, which is favorites.csv. 870 00:42:08,020 --> 00:42:12,635 And now the name of the table that I want to create with that data. 871 00:42:12,635 --> 00:42:15,010 And just for consistency, I'm going to call it favorites. 872 00:42:15,010 --> 00:42:17,270 I could change these things to be anything I want, 873 00:42:17,270 --> 00:42:18,530 but I'm going to do that. 874 00:42:18,530 --> 00:42:20,860 And voila, nothing seems to have happened. 875 00:42:20,860 --> 00:42:24,503 But just like in C and in Python and Linux when nothing seems to happen, 876 00:42:24,503 --> 00:42:25,670 that's usually a good thing. 877 00:42:25,670 --> 00:42:27,590 It means I didn't mess up. 878 00:42:27,590 --> 00:42:31,120 So if I want to see what just happened, there's this other command. 879 00:42:31,120 --> 00:42:33,430 And these commands that start with dots, these 880 00:42:33,430 --> 00:42:37,690 are SQLite specific, which is indeed a lightweight version of SQL. 881 00:42:37,690 --> 00:42:39,010 They're not SQL, per se. 882 00:42:39,010 --> 00:42:41,260 So if you're using Oracle or something else like that, 883 00:42:41,260 --> 00:42:43,135 you're not going to use these exact commands. 884 00:42:43,135 --> 00:42:45,160 You'll see the ones we use in just a moment. 885 00:42:45,160 --> 00:42:46,660 And here's the first. 886 00:42:46,660 --> 00:42:51,230 When I type .schema, the schema of a database is the design of the database. 887 00:42:51,230 --> 00:42:52,120 What are the tables? 888 00:42:52,120 --> 00:42:53,900 What are the columns and all of that? 889 00:42:53,900 --> 00:42:58,390 So when I type .schema, this actually in this case shows me the create table 890 00:42:58,390 --> 00:43:03,700 command that was automatically drawn for me by just doing this import line. 891 00:43:03,700 --> 00:43:07,450 Once I get more comfortable with SQL, I could literally type this out myself 892 00:43:07,450 --> 00:43:09,560 or use some program to generate that as well. 893 00:43:09,560 --> 00:43:11,290 But what it's creating for me is this. 894 00:43:11,290 --> 00:43:15,580 Create table if it doesn't exist, even though it's more terse than that. 895 00:43:15,580 --> 00:43:18,130 I want to create a table called favorites. 896 00:43:18,130 --> 00:43:20,590 And then the columns for that table are going 897 00:43:20,590 --> 00:43:24,880 to be timestamp, which is going to be text, comma, language, 898 00:43:24,880 --> 00:43:27,248 which is also going to be text, comma, problem, 899 00:43:27,248 --> 00:43:28,540 which is also going to be text. 900 00:43:28,540 --> 00:43:34,330 That was just inferred very trivially by the .import command to just figure out 901 00:43:34,330 --> 00:43:38,680 that, yes, just give me a three column database table based on the Google 902 00:43:38,680 --> 00:43:40,390 Form. 903 00:43:40,390 --> 00:43:41,740 Questions on this? 904 00:43:41,740 --> 00:43:44,140 These are commands you run once to get up and running. 905 00:43:44,140 --> 00:43:47,260 You don't run these commands frequently, but we have them 906 00:43:47,260 --> 00:43:49,280 on the slide just for reference. 907 00:43:49,280 --> 00:43:49,780 All right. 908 00:43:49,780 --> 00:43:52,010 So now let's do something a little more interesting. 909 00:43:52,010 --> 00:43:55,330 I'm going to clear my SQLite terminal here, but I'm still in SQLite. 910 00:43:55,330 --> 00:43:58,570 I'm going to now use some of my first SQL commands, which 911 00:43:58,570 --> 00:44:02,440 recall were among them select. 912 00:44:02,440 --> 00:44:05,522 So CRUD, C-R-U-D. The R was select. 913 00:44:05,522 --> 00:44:07,480 This is maybe the most common, the most useful, 914 00:44:07,480 --> 00:44:10,330 the most powerful thing to use with a SQL database 915 00:44:10,330 --> 00:44:13,000 selecting data to answer questions akin to the ones we 916 00:44:13,000 --> 00:44:14,980 were trying to answer with Python. 917 00:44:14,980 --> 00:44:16,270 This is the general syntax. 918 00:44:16,270 --> 00:44:19,900 Any time you want to select data from a SQL database, you literally say select. 919 00:44:19,900 --> 00:44:24,730 You then specify the column or columns that you want to select data from. 920 00:44:24,730 --> 00:44:28,450 You literally write the word from and then you specify the name of the table. 921 00:44:28,450 --> 00:44:32,030 You want to get that data from semicolon, in this case. 922 00:44:32,030 --> 00:44:35,200 Everything that's in capitals here is a SQL keyword. 923 00:44:35,200 --> 00:44:37,750 Strictly speaking, you don't have to capitalize things, 924 00:44:37,750 --> 00:44:40,150 but we would encourage you to do so stylistically. 925 00:44:40,150 --> 00:44:43,030 And especially as you're learning and even as you're writing it, 926 00:44:43,030 --> 00:44:47,530 it just helps to distinguish SQL from words you chose, like the names 927 00:44:47,530 --> 00:44:49,460 of the columns and the data therein. 928 00:44:49,460 --> 00:44:52,700 So do adopt early on this convention. 929 00:44:52,700 --> 00:44:56,240 So let me go back now to my code space here. 930 00:44:56,240 --> 00:44:59,012 I'm running my terminal window with SQLite3 inside of it. 931 00:44:59,012 --> 00:45:00,970 Suppose that I just want to get all of the data 932 00:45:00,970 --> 00:45:04,720 from the favorites table, which was automatically imported. 933 00:45:04,720 --> 00:45:05,480 Let's do this. 934 00:45:05,480 --> 00:45:06,220 Select. 935 00:45:06,220 --> 00:45:07,630 I want everything. 936 00:45:07,630 --> 00:45:13,090 Well, I can do timestamp comma language comma problem. 937 00:45:13,090 --> 00:45:13,840 But you know what? 938 00:45:13,840 --> 00:45:15,850 Here's a convenience already. 939 00:45:15,850 --> 00:45:19,120 If you want everything, there's what's called a wild card character 940 00:45:19,120 --> 00:45:21,460 in SQL, which is just a star, an asterisk, which 941 00:45:21,460 --> 00:45:25,120 means give me every column without my knowing even what they're called. 942 00:45:25,120 --> 00:45:28,400 Let me go ahead now and say from favorites semicolon. 943 00:45:28,400 --> 00:45:33,010 And this is the SQL way of opening the database, iterating over 944 00:45:33,010 --> 00:45:36,400 every row therein, printing out every row therein, done. 945 00:45:36,400 --> 00:45:39,640 So those three steps, which was like nine lines of Python code 946 00:45:39,640 --> 00:45:42,700 give or take earlier, is now one line of SQL. 947 00:45:42,700 --> 00:45:43,990 I hit Enter. 948 00:45:43,990 --> 00:45:46,370 There is all of the data. 949 00:45:46,370 --> 00:45:48,130 So I see now all of the data. 950 00:45:48,130 --> 00:45:50,230 Just output it as a CSV here. 951 00:45:50,230 --> 00:45:51,940 But it's not the CSV file. 952 00:45:51,940 --> 00:45:53,210 It's now actually the table. 953 00:45:53,210 --> 00:45:54,580 And in fact, just for good measure, let me 954 00:45:54,580 --> 00:45:58,000 do this, because you'll see the behavior a little differently the next time we 955 00:45:58,000 --> 00:45:58,690 open the file. 956 00:45:58,690 --> 00:46:00,730 I've just exited out of SQLite3. 957 00:46:00,730 --> 00:46:03,490 I'm going to rerun it, but I'm not going to reimport the data 958 00:46:03,490 --> 00:46:05,920 or do anything like that, because my file now exists. 959 00:46:05,920 --> 00:46:07,790 In fact, let me take one step back. 960 00:46:07,790 --> 00:46:11,830 If I type ls at my Linux prompt, there's my favorites.py from before. 961 00:46:11,830 --> 00:46:13,870 There's my favorites.csv from before. 962 00:46:13,870 --> 00:46:18,580 And here's a third file that I did create a moment ago when I first 963 00:46:18,580 --> 00:46:19,690 ran SQLite3. 964 00:46:19,690 --> 00:46:21,130 So the data is persistent. 965 00:46:21,130 --> 00:46:23,050 It's not using RAM or memory. 966 00:46:23,050 --> 00:46:24,950 Anything I do now is saved there. 967 00:46:24,950 --> 00:46:27,653 So let's go ahead and rerun SQLite3 with the same file. 968 00:46:27,653 --> 00:46:28,570 But I'm not going to-- 969 00:46:28,570 --> 00:46:31,130 I don't have to reimport everything, because the file already exists. 970 00:46:31,130 --> 00:46:32,800 Let me now do that same thing again. 971 00:46:32,800 --> 00:46:36,460 Select star from favorites to get all of the data. 972 00:46:36,460 --> 00:46:39,990 And what you'll see now is the same data, but it's a little prettier now. 973 00:46:39,990 --> 00:46:44,400 Because I reran it, I effectively disabled CSV mode this time. 974 00:46:44,400 --> 00:46:49,590 And what I'm now seeing is the entire contents of this database 975 00:46:49,590 --> 00:46:51,420 table called favorites. 976 00:46:51,420 --> 00:46:53,370 Now, there's nothing new here, but you're just 977 00:46:53,370 --> 00:46:55,980 seeing now like an ASCII or Unicode version of all 978 00:46:55,980 --> 00:46:57,988 of the same data from that database. 979 00:46:57,988 --> 00:47:00,030 Well, suppose I want to get a subset of the data. 980 00:47:00,030 --> 00:47:01,238 Well, let me clear my screen. 981 00:47:01,238 --> 00:47:04,530 And just like in Linux, I can Control L just to clean things up aesthetically. 982 00:47:04,530 --> 00:47:06,580 Suppose I want to get just the languages. 983 00:47:06,580 --> 00:47:09,540 So I could do select language from favorites. 984 00:47:09,540 --> 00:47:12,930 And this will now select not all three columns, a.k.a. 985 00:47:12,930 --> 00:47:17,280 star, this will only select the language column and all of the data therein. 986 00:47:17,280 --> 00:47:19,020 If I hit Enter, voila. 987 00:47:19,020 --> 00:47:21,150 Now I just see those there. 988 00:47:21,150 --> 00:47:22,990 No timestamps, no problems. 989 00:47:22,990 --> 00:47:25,470 It's just a slice of the table, if you will. 990 00:47:25,470 --> 00:47:30,010 All right, not that interesting still because it's just a big column of data. 991 00:47:30,010 --> 00:47:31,900 But now things get more interesting. 992 00:47:31,900 --> 00:47:35,850 It turns out in SQL that there are functions that come with this language, 993 00:47:35,850 --> 00:47:37,920 just like C, just like Python. 994 00:47:37,920 --> 00:47:41,790 In SQL, some of the more useful ones, some of the simpler ones, 995 00:47:41,790 --> 00:47:42,630 are these here. 996 00:47:42,630 --> 00:47:44,880 Average, count, distinct, lower, max, min, 997 00:47:44,880 --> 00:47:48,150 upper, which pretty much do what they say. 998 00:47:48,150 --> 00:47:50,533 And count is a particularly useful one. 999 00:47:50,533 --> 00:47:51,450 Let's start with that. 1000 00:47:51,450 --> 00:47:55,740 It's a reasonable question to be asked how many people submitted the Google 1001 00:47:55,740 --> 00:47:58,523 Form by the time I actually downloaded the CSV. 1002 00:47:58,523 --> 00:48:00,190 Well, why don't we go ahead and do this? 1003 00:48:00,190 --> 00:48:02,850 Let me go back to VS Code here in my terminal window. 1004 00:48:02,850 --> 00:48:06,900 Let me select not star but the count of star. 1005 00:48:06,900 --> 00:48:10,020 So give me the count of the rows that are being 1006 00:48:10,020 --> 00:48:12,660 returned from the database called-- 1007 00:48:12,660 --> 00:48:14,190 the database table called favorites. 1008 00:48:14,190 --> 00:48:16,648 Now when I hit Enter, I'm not going to get all of the data. 1009 00:48:16,648 --> 00:48:18,630 I'm just going to get simply a number. 1010 00:48:18,630 --> 00:48:21,480 430 rows came back. 1011 00:48:21,480 --> 00:48:22,740 So that's pretty good. 1012 00:48:22,740 --> 00:48:24,580 I now know how much data is in there. 1013 00:48:24,580 --> 00:48:26,350 Well, what languages were in there? 1014 00:48:26,350 --> 00:48:30,520 Well, I could do select language from favorites just as before, 1015 00:48:30,520 --> 00:48:33,270 but that's not that useful, especially if I'm inheriting the data. 1016 00:48:33,270 --> 00:48:35,770 Like I'm the analyst who's been handed a data set by my boss 1017 00:48:35,770 --> 00:48:37,440 and they want me to crunch some numbers. 1018 00:48:37,440 --> 00:48:39,810 OK, I could load this into Excel. 1019 00:48:39,810 --> 00:48:40,770 I could sort it. 1020 00:48:40,770 --> 00:48:44,400 But you can use SQL now to answer pretty basic questions too. 1021 00:48:44,400 --> 00:48:48,540 If you want to select the distinct languages in the data set because you 1022 00:48:48,540 --> 00:48:51,000 weren't privy to the Google Form, let me go ahead 1023 00:48:51,000 --> 00:48:56,820 and select only the distinct languages from the favorites table. 1024 00:48:56,820 --> 00:49:00,330 And now I hit Enter and I get back a much more succinct answer. 1025 00:49:00,330 --> 00:49:02,220 Just the three languages in question. 1026 00:49:02,220 --> 00:49:04,963 Not really that useful since I created the Google Form, 1027 00:49:04,963 --> 00:49:07,380 but certainly if you're inheriting data from someone else, 1028 00:49:07,380 --> 00:49:09,390 you've just downloaded a data set, at least now 1029 00:49:09,390 --> 00:49:12,190 I'm arguably wrapping my mind around what's going on. 1030 00:49:12,190 --> 00:49:15,178 Now, this is not necessary for such a small data set, 1031 00:49:15,178 --> 00:49:16,470 but I can combine these things. 1032 00:49:16,470 --> 00:49:21,180 Select the count of the distinct languages in this data set 1033 00:49:21,180 --> 00:49:22,140 called favorites. 1034 00:49:22,140 --> 00:49:25,220 And now I should get back what answer? 1035 00:49:25,220 --> 00:49:27,378 So hopefully indeed an answer called three. 1036 00:49:27,378 --> 00:49:29,420 And what you're getting back notice aesthetically 1037 00:49:29,420 --> 00:49:32,180 too is like a mini temporary table. 1038 00:49:32,180 --> 00:49:36,890 When I asked for just the distinct languages, what SQL hands 1039 00:49:36,890 --> 00:49:39,410 me back is this temporary table in memory 1040 00:49:39,410 --> 00:49:43,340 that has one column called language and then three rows. 1041 00:49:43,340 --> 00:49:45,210 Now, this is not saved anywhere. 1042 00:49:45,210 --> 00:49:47,450 It's just executed ephemerally like this. 1043 00:49:47,450 --> 00:49:49,580 But that's why it's depicted in this way. 1044 00:49:49,580 --> 00:49:52,610 What you're getting is subsets of your data, smaller 1045 00:49:52,610 --> 00:49:54,410 tables containing some of your data. 1046 00:49:54,410 --> 00:49:55,550 And same thing down here. 1047 00:49:55,550 --> 00:49:58,400 This is a crazy long column name. 1048 00:49:58,400 --> 00:50:00,480 You can rename it if you really want. 1049 00:50:00,480 --> 00:50:02,900 But that's all we're seeing there. 1050 00:50:02,900 --> 00:50:07,100 And in fact, if that's a little ugly, we can actually alias these things. 1051 00:50:07,100 --> 00:50:12,090 N is a common name for a variable, a number in any programming language. 1052 00:50:12,090 --> 00:50:15,200 So I can actually alias this to be a column called n. 1053 00:50:15,200 --> 00:50:16,160 Hit Enter. 1054 00:50:16,160 --> 00:50:19,490 And now I'm getting a tiny, tiny table whose column is called 1055 00:50:19,490 --> 00:50:23,280 n that just has the one value there. 1056 00:50:23,280 --> 00:50:28,500 All right, questions on these application of these functions here? 1057 00:50:28,500 --> 00:50:31,950 1058 00:50:31,950 --> 00:50:33,152 Questions, yeah. 1059 00:50:33,152 --> 00:50:35,000 AUDIENCE: [INAUDIBLE] 1060 00:50:35,000 --> 00:50:36,400 DAVID MALAN: Say a little louder. 1061 00:50:36,400 --> 00:50:37,280 AUDIENCE: A-S. 1062 00:50:37,280 --> 00:50:38,475 DAVID MALAN: Oh, A-S. As. 1063 00:50:38,475 --> 00:50:39,350 Literally in English. 1064 00:50:39,350 --> 00:50:42,650 So rename this column as this. 1065 00:50:42,650 --> 00:50:45,317 Technically it creates an alias for the column. 1066 00:50:45,317 --> 00:50:45,900 So that's all. 1067 00:50:45,900 --> 00:50:46,818 Yeah? 1068 00:50:46,818 --> 00:50:49,955 AUDIENCE: [INAUDIBLE] 1069 00:50:49,955 --> 00:50:50,830 DAVID MALAN: Exactly. 1070 00:50:50,830 --> 00:50:53,530 Distinct will operate on whatever you hand it in parentheses 1071 00:50:53,530 --> 00:50:57,460 and get rid of all of the duplicates, giving you back just the uniques. 1072 00:50:57,460 --> 00:50:58,360 Correct. 1073 00:50:58,360 --> 00:50:59,780 Other questions here? 1074 00:50:59,780 --> 00:51:00,524 Yeah. 1075 00:51:00,524 --> 00:51:02,944 AUDIENCE: [INAUDIBLE] 1076 00:51:02,944 --> 00:51:06,935 1077 00:51:06,935 --> 00:51:08,060 DAVID MALAN: Good question. 1078 00:51:08,060 --> 00:51:10,355 When you define an alias like n, which I just did, 1079 00:51:10,355 --> 00:51:12,230 does it become like a variable you can reuse? 1080 00:51:12,230 --> 00:51:16,580 Short answer, no in this case, but you can reuse it within your same query. 1081 00:51:16,580 --> 00:51:20,630 Even though these queries are getting a little longer, admittedly, 1082 00:51:20,630 --> 00:51:23,270 statements that they are, you can actually 1083 00:51:23,270 --> 00:51:26,090 reuse n in an even longer query. 1084 00:51:26,090 --> 00:51:27,440 So later in your query. 1085 00:51:27,440 --> 00:51:30,380 And we'll see a few that are going to start to grow in length. 1086 00:51:30,380 --> 00:51:33,050 So it's a nice way of nicknaming things just to be 1087 00:51:33,050 --> 00:51:35,190 a little more terse in your query. 1088 00:51:35,190 --> 00:51:38,480 So we can transition to some of these more sophisticated queries 1089 00:51:38,480 --> 00:51:41,360 because it turns out there are some other techniques we 1090 00:51:41,360 --> 00:51:42,630 can introduce as well. 1091 00:51:42,630 --> 00:51:44,660 Here are some other keywords in SQL. 1092 00:51:44,660 --> 00:51:47,005 And again, even though this is another list of things, 1093 00:51:47,005 --> 00:51:49,130 there's only four things fundamentally we're doing. 1094 00:51:49,130 --> 00:51:51,600 Creating, reading, updating, and deleting data. 1095 00:51:51,600 --> 00:51:55,580 These are just allowing us to fine tune how we do it exactly. 1096 00:51:55,580 --> 00:51:59,210 So where is going to allow us to filter data, as we'll do in just a moment. 1097 00:51:59,210 --> 00:52:03,440 Like select data where this conditional is true. 1098 00:52:03,440 --> 00:52:06,260 Like is going to be an alternative to an equal sign. 1099 00:52:06,260 --> 00:52:11,210 So instead of looking for exactly Scratch or exactly Python or exactly C, 1100 00:52:11,210 --> 00:52:13,790 you can look for something like dot dot dot 1101 00:52:13,790 --> 00:52:16,770 and it can be a little bit of a fuzzier match, if you will, 1102 00:52:16,770 --> 00:52:18,020 with other characters as well. 1103 00:52:18,020 --> 00:52:19,790 Order by is going to deal with sorting. 1104 00:52:19,790 --> 00:52:23,360 Limit is going to just let me limit the total number of rows that come back 1105 00:52:23,360 --> 00:52:26,330 to 1 or 10 or finite if I don't want to see 1106 00:52:26,330 --> 00:52:28,375 all 400 plus rows all at once, because I'm just 1107 00:52:28,375 --> 00:52:29,750 trying to wrap my mind around it. 1108 00:52:29,750 --> 00:52:32,207 And group by is best shown by example. 1109 00:52:32,207 --> 00:52:34,290 So let's play with just a couple of these as well. 1110 00:52:34,290 --> 00:52:36,080 Let me go back to VS Code here. 1111 00:52:36,080 --> 00:52:37,170 I'll clear my screen. 1112 00:52:37,170 --> 00:52:39,530 I'm still in the same SQLite instance. 1113 00:52:39,530 --> 00:52:44,040 And let's count how many of you liked C without writing Python code as before. 1114 00:52:44,040 --> 00:52:47,960 So let me go ahead and select the count of the rows 1115 00:52:47,960 --> 00:52:54,530 from favorites where the language in each row equals C. 1116 00:52:54,530 --> 00:52:58,100 And the convention in SQLite is to use single quotes any time 1117 00:52:58,100 --> 00:53:02,810 you're surrounding a string that's meant to represent a literal piece of text 1118 00:53:02,810 --> 00:53:06,720 as opposed to C, which was double quotes, or Python, which was either. 1119 00:53:06,720 --> 00:53:11,090 So this is selecting the count of rows from favorites table where 1120 00:53:11,090 --> 00:53:13,730 the language in question is C. Enter. 1121 00:53:13,730 --> 00:53:15,410 And this gives me 98. 1122 00:53:15,410 --> 00:53:19,190 Notice, though, if I omit that predicate like we did before, 1123 00:53:19,190 --> 00:53:23,850 you'll get back the total number of rows that were in the table. 1124 00:53:23,850 --> 00:53:26,030 So where is what's called a predicate that just 1125 00:53:26,030 --> 00:53:30,020 allows me to filter things just like an if condition or the like in a language 1126 00:53:30,020 --> 00:53:31,130 that we've seen before. 1127 00:53:31,130 --> 00:53:34,670 You can be a little more specific like how many people really liked 1128 00:53:34,670 --> 00:53:38,870 C and the Mario problem specifically? 1129 00:53:38,870 --> 00:53:40,440 Well, let's do this. 1130 00:53:40,440 --> 00:53:45,830 Let's go ahead and do select the number of rows from the favorites table 1131 00:53:45,830 --> 00:53:49,640 where the language is C and. 1132 00:53:49,640 --> 00:53:52,880 So it's still literally the word ands and or, just 1133 00:53:52,880 --> 00:53:57,830 like in Python but not like in C. And problem equals Mario. 1134 00:53:57,830 --> 00:54:01,520 So let's see if there's any fans of both C and the Mario problem. 1135 00:54:01,520 --> 00:54:05,638 And three of us really like those two things together in this case. 1136 00:54:05,638 --> 00:54:06,930 All right, what else can we do? 1137 00:54:06,930 --> 00:54:10,280 Well, more compelling might be to see, kind of like in Python, 1138 00:54:10,280 --> 00:54:13,580 for each language, what was the popularity thereof? 1139 00:54:13,580 --> 00:54:15,560 And at the moment, we don't really have a way 1140 00:54:15,560 --> 00:54:17,960 of doing that except in Python where we had the loop 1141 00:54:17,960 --> 00:54:20,710 and we had those variables and the dictionary that did all of that 1142 00:54:20,710 --> 00:54:21,410 counting for us. 1143 00:54:21,410 --> 00:54:24,710 Totally doable but tedious, especially if your job is to analyze data. 1144 00:54:24,710 --> 00:54:28,190 My God, even writing 15 lines of code to answer simple questions 1145 00:54:28,190 --> 00:54:29,390 is kind of ridiculous. 1146 00:54:29,390 --> 00:54:31,560 SQL can do better for us. 1147 00:54:31,560 --> 00:54:33,630 So let me go ahead and do this. 1148 00:54:33,630 --> 00:54:38,540 Let me go ahead and select every language and the count 1149 00:54:38,540 --> 00:54:45,650 thereof from the favorites table but this time group by language. 1150 00:54:45,650 --> 00:54:48,440 So this was another one of the keywords that we 1151 00:54:48,440 --> 00:54:52,460 can use in this abbreviated list of extra features of SQL. 1152 00:54:52,460 --> 00:54:55,760 And this one takes a moment to wrap your mind around, 1153 00:54:55,760 --> 00:54:59,510 but this is going to give me a two column temporary table where 1154 00:54:59,510 --> 00:55:02,030 the first column is a language and the second column 1155 00:55:02,030 --> 00:55:05,330 is the count thereof from this data set. 1156 00:55:05,330 --> 00:55:10,010 And group by language just means that only show me Scratch once, 1157 00:55:10,010 --> 00:55:12,350 only show me C once, only show me Python once. 1158 00:55:12,350 --> 00:55:16,130 That is group all of the identical values together, 1159 00:55:16,130 --> 00:55:19,260 but keep track of how many of them there are. 1160 00:55:19,260 --> 00:55:27,200 And so now if I go over to SQLite and I hit Enter, now I have in SQL version 1161 00:55:27,200 --> 00:55:29,720 the exact same output that I had from Python 1162 00:55:29,720 --> 00:55:32,120 that took me, what, 15 plus lines before. 1163 00:55:32,120 --> 00:55:35,810 Now we're down to just one because SQL, Structured Query Language, 1164 00:55:35,810 --> 00:55:39,830 is all about constructing queries like this to answer questions and get back 1165 00:55:39,830 --> 00:55:41,152 answers quickly. 1166 00:55:41,152 --> 00:55:42,860 If we want to clean this up a little bit, 1167 00:55:42,860 --> 00:55:44,960 you asked earlier about sorting order. 1168 00:55:44,960 --> 00:55:46,200 Well, we can do that too. 1169 00:55:46,200 --> 00:55:48,930 There's another key phrase we can use here. 1170 00:55:48,930 --> 00:55:55,080 We can order by the count of those rows and then run that query here. 1171 00:55:55,080 --> 00:55:57,480 So now unfortunately they're from smallest to biggest, 1172 00:55:57,480 --> 00:55:58,675 but we can reverse that. 1173 00:55:58,675 --> 00:56:00,800 It turns out, and my query's starting to wrap here. 1174 00:56:00,800 --> 00:56:02,360 I'll zoom out for a moment. 1175 00:56:02,360 --> 00:56:07,050 If you want to order by count, the default is in ascending order, 1176 00:56:07,050 --> 00:56:10,980 abbreviated A-S-C. If you want to reverse the sort in SQL, 1177 00:56:10,980 --> 00:56:13,920 instead of using reverse equals true like we did in Python, 1178 00:56:13,920 --> 00:56:16,380 you say D-E-S-C for descending order. 1179 00:56:16,380 --> 00:56:19,557 And now we get almost the same output but flipped in reverse. 1180 00:56:19,557 --> 00:56:22,140 So it's just a lot faster to answer questions once, of course, 1181 00:56:22,140 --> 00:56:24,660 you get some muscle memory and some comfort with it. 1182 00:56:24,660 --> 00:56:26,280 Well, what else can I do? 1183 00:56:26,280 --> 00:56:28,890 What if I just care about the most popular language? 1184 00:56:28,890 --> 00:56:31,140 I don't care about the second place or the third place 1185 00:56:31,140 --> 00:56:32,460 languages or anything else. 1186 00:56:32,460 --> 00:56:34,830 Well, let me add one more clause here. 1187 00:56:34,830 --> 00:56:36,750 Limit the answer to one. 1188 00:56:36,750 --> 00:56:39,330 And no matter how many rows should come back now, 1189 00:56:39,330 --> 00:56:41,880 I just get the number one language as of the data 1190 00:56:41,880 --> 00:56:47,870 set we collected with 270 votes for it. 1191 00:56:47,870 --> 00:56:52,060 Questions on this? 1192 00:56:52,060 --> 00:56:55,260 Any questions here? 1193 00:56:55,260 --> 00:56:55,800 No? 1194 00:56:55,800 --> 00:56:58,620 Well, what if we're starting to introduce SQL 1195 00:56:58,620 --> 00:57:02,250 and it was kind of too late to make it into the Google Form? 1196 00:57:02,250 --> 00:57:04,260 So it turns out there's syntax for this too. 1197 00:57:04,260 --> 00:57:05,670 You can create data, of course. 1198 00:57:05,670 --> 00:57:09,240 Not just the tables, but the data therein and here's the typical syntax 1199 00:57:09,240 --> 00:57:11,640 for inserting data into a SQL database. 1200 00:57:11,640 --> 00:57:14,850 You literally say insert into the name of the table. 1201 00:57:14,850 --> 00:57:19,020 And then in parentheses, you specify one or more columns for which you 1202 00:57:19,020 --> 00:57:20,940 have values that you want to insert. 1203 00:57:20,940 --> 00:57:22,950 This is to say you don't have to give values 1204 00:57:22,950 --> 00:57:25,650 for every column in the given row. 1205 00:57:25,650 --> 00:57:27,990 If you only have answers to some of those questions, 1206 00:57:27,990 --> 00:57:29,940 you can enumerate them here like this. 1207 00:57:29,940 --> 00:57:32,010 But the values you insert are going to be these. 1208 00:57:32,010 --> 00:57:35,500 So you literally say after the close parenthesis values. 1209 00:57:35,500 --> 00:57:38,430 And then in a second set of parentheses with a same length 1210 00:57:38,430 --> 00:57:42,370 comma separated list, you specify what values do you want to insert. 1211 00:57:42,370 --> 00:57:43,710 So it's a little verbose. 1212 00:57:43,710 --> 00:57:47,520 And frankly longer term, you're going to use Python code to automatically do 1213 00:57:47,520 --> 00:57:50,010 these kinds of insertions, but let's go ahead and try this. 1214 00:57:50,010 --> 00:57:57,990 Right now if I do select distinct language from favorites, again, 1215 00:57:57,990 --> 00:57:58,833 we see this. 1216 00:57:58,833 --> 00:58:00,000 Just these three candidates. 1217 00:58:00,000 --> 00:58:01,870 But we've now taught you a bit of SQL. 1218 00:58:01,870 --> 00:58:09,120 So let's do insert into favorites the column called language. 1219 00:58:09,120 --> 00:58:10,620 And you know what? 1220 00:58:10,620 --> 00:58:12,150 I'm going to give a problem here. 1221 00:58:12,150 --> 00:58:14,430 The values for which, and let me zoom back out, 1222 00:58:14,430 --> 00:58:17,910 are going to be quote unquote "SQL" and quote, unquote "fiftyville." 1223 00:58:17,910 --> 00:58:20,130 You'll soon see what that's all about. 1224 00:58:20,130 --> 00:58:21,240 Semicolon. 1225 00:58:21,240 --> 00:58:24,580 Nothing seems to happen, but that's usually a good thing. 1226 00:58:24,580 --> 00:58:26,580 And now if I scroll back up in my queries, 1227 00:58:26,580 --> 00:58:30,990 in SQLite3 you can scroll back and forth in time to avoid retyping things, 1228 00:58:30,990 --> 00:58:35,400 now I should see indeed four candidate languages here. 1229 00:58:35,400 --> 00:58:39,150 Now, suppose that you were never really a fan of C and maybe you programmed 1230 00:58:39,150 --> 00:58:42,990 a little bit in high school or in the real world and you liked C++. 1231 00:58:42,990 --> 00:58:50,010 Well, there's a whole lot of answers for C. So select star from favorites 1232 00:58:50,010 --> 00:58:53,430 where language equals quote unquote "C." 1233 00:58:53,430 --> 00:58:58,020 So here's everyone who submitted the answer for C. Let's presume that, no, 1234 00:58:58,020 --> 00:58:59,850 they didn't really want C, they wanted C++, 1235 00:58:59,850 --> 00:59:01,767 which is not a language we teach in the class. 1236 00:59:01,767 --> 00:59:03,910 But I could also now do this. 1237 00:59:03,910 --> 00:59:08,670 You can use the update command to set a column or columns to different values 1238 00:59:08,670 --> 00:59:10,210 where some condition is met. 1239 00:59:10,210 --> 00:59:15,360 So if I do update table name set column name equal to some value 1240 00:59:15,360 --> 00:59:19,330 filtering it perhaps by where some condition is true. 1241 00:59:19,330 --> 00:59:22,890 So suppose I've changed my mind, or you know what, 1242 00:59:22,890 --> 00:59:31,320 let's go ahead and do update favorites set language equal to maybe C++ where 1243 00:59:31,320 --> 00:59:33,870 language equals C. 1244 00:59:33,870 --> 00:59:35,910 Now, this is destructive, so you generally 1245 00:59:35,910 --> 00:59:38,368 don't want to do this unless you have a backup of your data 1246 00:59:38,368 --> 00:59:40,170 too, overriding what people's answers are. 1247 00:59:40,170 --> 00:59:42,753 This seems to have been successful, because no error messages. 1248 00:59:42,753 --> 00:59:45,270 And if I rerun the previous select that gives me 1249 00:59:45,270 --> 00:59:49,410 all of the favorites where language equals C, now indeed I get none. 1250 00:59:49,410 --> 00:59:53,127 But if I search for C++, now I get a lot. 1251 00:59:53,127 --> 00:59:55,710 And if I get rid of that where clause altogether and just look 1252 00:59:55,710 --> 01:00:00,330 at the contents of my database, now you see that indeed C++ is comingled with 1253 01:00:00,330 --> 01:00:01,140 all the other data. 1254 01:00:01,140 --> 01:00:04,270 This is not what you all intended, of course, so I can undo this. 1255 01:00:04,270 --> 01:00:06,150 Let me go ahead and undo what I just did. 1256 01:00:06,150 --> 01:00:12,570 Let me set my favorite language to C where language equals C++. 1257 01:00:12,570 --> 01:00:14,460 But the predicate is important. 1258 01:00:14,460 --> 01:00:16,230 This I'm not going to do. 1259 01:00:16,230 --> 01:00:20,190 What if I accidentally omitted this predicate, the where clause? 1260 01:00:20,190 --> 01:00:22,545 How would that screw things up might you think? 1261 01:00:22,545 --> 01:00:26,160 1262 01:00:26,160 --> 01:00:26,970 Yeah, in the back. 1263 01:00:26,970 --> 01:00:29,110 AUDIENCE: [INAUDIBLE] 1264 01:00:29,110 --> 01:00:32,590 DAVID MALAN: It would set every row's language to indeed C. 1265 01:00:32,590 --> 01:00:34,060 And this is dangerous. 1266 01:00:34,060 --> 01:00:37,720 And if you start googling around for SQL mistakes or the like, 1267 01:00:37,720 --> 01:00:41,590 people in the real world have accidentally run commands like this. 1268 01:00:41,590 --> 01:00:43,510 And without naming names, a former member 1269 01:00:43,510 --> 01:00:47,200 of our teaching staff at one point accidentally ran a command like this 1270 01:00:47,200 --> 01:00:51,610 and changed every student's name in our database to Bobby I think it was. 1271 01:00:51,610 --> 01:00:55,120 The same name for every row because they simply forgot a predicate. 1272 01:00:55,120 --> 01:00:58,240 So here too there's dangers in code, and you 1273 01:00:58,240 --> 01:01:00,730 should adopt the habit quite quickly of always, 1274 01:01:00,730 --> 01:01:03,730 one, backing up your data like with CP, for instance, 1275 01:01:03,730 --> 01:01:09,010 in Linux or any other technique or just making sure before you hit Enter that, 1276 01:01:09,010 --> 01:01:11,322 yes, this is indeed the query I want to execute. 1277 01:01:11,322 --> 01:01:13,030 And generally speaking in the real world, 1278 01:01:13,030 --> 01:01:14,890 there should be process controls in place. 1279 01:01:14,890 --> 01:01:17,770 Like the intern should not have access to the production 1280 01:01:17,770 --> 01:01:19,570 database, the live database, and the like. 1281 01:01:19,570 --> 01:01:21,860 But you have a lot of power now with these queries. 1282 01:01:21,860 --> 01:01:25,520 So just be all the more careful, because very easily can you do bad things. 1283 01:01:25,520 --> 01:01:26,660 So let me undo this. 1284 01:01:26,660 --> 01:01:30,790 Where language equals quote unquote "C++." 1285 01:01:30,790 --> 01:01:31,840 And I'll zoom back out. 1286 01:01:31,840 --> 01:01:32,470 Enter. 1287 01:01:32,470 --> 01:01:34,480 And now I think we're back in business. 1288 01:01:34,480 --> 01:01:36,100 C is among the answers. 1289 01:01:36,100 --> 01:01:37,520 Yeah? 1290 01:01:37,520 --> 01:01:40,340 AUDIENCE: [INAUDIBLE] 1291 01:01:40,340 --> 01:01:42,293 1292 01:01:42,293 --> 01:01:44,460 DAVID MALAN: It's essentially doing what at the end? 1293 01:01:44,460 --> 01:01:45,270 AUDIENCE: [INAUDIBLE] 1294 01:01:45,270 --> 01:01:46,680 DAVID MALAN: It's essentially find and replace. 1295 01:01:46,680 --> 01:01:47,180 Yes. 1296 01:01:47,180 --> 01:01:51,060 In layperson's terms, this is find and replace implemented with SQL. 1297 01:01:51,060 --> 01:01:55,020 And in fact, the authors of Microsoft Word or Google Docs 1298 01:01:55,020 --> 01:01:58,290 might very well be using language like this SQL 1299 01:01:58,290 --> 01:02:01,710 when you go to the nice graphical user friendly find and replace box. 1300 01:02:01,710 --> 01:02:05,010 This may very well be what they're doing underneath the hood or, of course, 1301 01:02:05,010 --> 01:02:07,770 they could be using some other language altogether. 1302 01:02:07,770 --> 01:02:10,530 There's one last syntax that's worth knowing, 1303 01:02:10,530 --> 01:02:14,280 delete, which for better or for worse is even more destructive whereby it 1304 01:02:14,280 --> 01:02:16,590 allows you to delete rows from tables. 1305 01:02:16,590 --> 01:02:20,130 It's distinct from drop, which lets you delete tables themselves. 1306 01:02:20,130 --> 01:02:21,630 This focuses on rows. 1307 01:02:21,630 --> 01:02:26,250 So suppose that you really, really didn't like, 1308 01:02:26,250 --> 01:02:30,000 let's say, Tideman was a little challenging if you 1309 01:02:30,000 --> 01:02:31,840 tackled that more comfortable problem. 1310 01:02:31,840 --> 01:02:34,800 So if you really don't want to even think about Tideman anymore, 1311 01:02:34,800 --> 01:02:40,080 so why don't we do delete from favorites where problem equals, and I 1312 01:02:40,080 --> 01:02:42,240 won't execute it for real, Tideman. 1313 01:02:42,240 --> 01:02:45,360 This would have the effect of deleting every row, including 1314 01:02:45,360 --> 01:02:50,010 the language therein and the timestamp, where the student answered Tideman. 1315 01:02:50,010 --> 01:02:53,910 Worse than this would be this. 1316 01:02:53,910 --> 01:02:55,125 Why might this be bad? 1317 01:02:55,125 --> 01:02:58,560 1318 01:02:58,560 --> 01:03:00,930 OK, chuckling because there's no predicate. 1319 01:03:00,930 --> 01:03:04,960 There's no filter, which means literally this would delete all of the data. 1320 01:03:04,960 --> 01:03:07,740 So again, with great power here comes great responsibility. 1321 01:03:07,740 --> 01:03:12,390 Now, this has just been a data set of 430 rows by us dynamically created. 1322 01:03:12,390 --> 01:03:15,240 There's, of course, some really juicy data sets in the real world. 1323 01:03:15,240 --> 01:03:17,990 And one website you might have heard or an app you might have used 1324 01:03:17,990 --> 01:03:20,460 is IMDb, the Internet Movie Database, which wonderfully 1325 01:03:20,460 --> 01:03:24,330 makes some of their data available for download as CSV files 1326 01:03:24,330 --> 01:03:27,420 or technically TSV files, Tab Separated Values. 1327 01:03:27,420 --> 01:03:31,830 But what we did in advance of class was download some of that data for both TV 1328 01:03:31,830 --> 01:03:34,170 shows in the real world and movies in the real world. 1329 01:03:34,170 --> 01:03:35,962 And what's wonderful about this data set is 1330 01:03:35,962 --> 01:03:39,630 it's not just dozens or hundreds or even thousands of lines. 1331 01:03:39,630 --> 01:03:44,190 There are millions of rows of juicy data, TV shows and movies 1332 01:03:44,190 --> 01:03:47,190 with which most folks are probably familiar, at least with a subset. 1333 01:03:47,190 --> 01:03:50,580 And we'll see in just a little bit that this data 1334 01:03:50,580 --> 01:03:54,600 comes in the form of now six different tables that we've given you. 1335 01:03:54,600 --> 01:03:56,610 And the tables in question for today are going 1336 01:03:56,610 --> 01:04:00,735 to be the people in the TV business, the stars therein, the shows that people 1337 01:04:00,735 --> 01:04:01,860 are producing and the like. 1338 01:04:01,860 --> 01:04:04,860 This is a picture we'll revisit to enable 1339 01:04:04,860 --> 01:04:07,110 you to wrap your minds around what the actual data is. 1340 01:04:07,110 --> 01:04:09,948 This feels like a good opportunity though for a snack. 1341 01:04:09,948 --> 01:04:12,990 In fact, in just a moment, we have a whole lot of Rice Krispie treats out 1342 01:04:12,990 --> 01:04:13,690 in the lobby. 1343 01:04:13,690 --> 01:04:18,630 But if folks could perhaps acknowledge this mini wedding cake here. 1344 01:04:18,630 --> 01:04:21,640 CS50 zone Carter Zenke is getting married this week. 1345 01:04:21,640 --> 01:04:28,880 So congratulations to Carter as well. 1346 01:04:28,880 --> 01:04:30,550 Congrats. 1347 01:04:30,550 --> 01:04:31,940 All right. 1348 01:04:31,940 --> 01:04:34,440 [APPLAUSE] 1349 01:04:34,440 --> 01:04:38,115 1350 01:04:38,115 --> 01:04:40,952 There's only one piece of cake in that box but a lot of Rice Krispie 1351 01:04:40,952 --> 01:04:41,910 treats in the transept. 1352 01:04:41,910 --> 01:04:45,420 Let's take 10 minutes and we'll be back with Internet Movie Database in 10. 1353 01:04:45,420 --> 01:04:46,470 All right. 1354 01:04:46,470 --> 01:04:47,320 We are back. 1355 01:04:47,320 --> 01:04:51,000 So if you've never been, you can actually go to imdb.com right now 1356 01:04:51,000 --> 01:04:53,020 and play around or download the mobile app. 1357 01:04:53,020 --> 01:04:56,580 And it's just big database of a lot of TV shows and movies 1358 01:04:56,580 --> 01:04:58,090 and actors and the like. 1359 01:04:58,090 --> 01:05:00,828 But what indeed is nice is you can download some of that data. 1360 01:05:00,828 --> 01:05:02,370 And that's what I've done in advance. 1361 01:05:02,370 --> 01:05:05,160 And what we've done is we wrote some Python code 1362 01:05:05,160 --> 01:05:09,540 to convert some of the flat file databases that they let you download 1363 01:05:09,540 --> 01:05:13,890 and we converted it into a SQL database with six tables. 1364 01:05:13,890 --> 01:05:17,940 So not just one but six that ultimately are these here. 1365 01:05:17,940 --> 01:05:21,090 And let me just help you wrap your minds around what this picture is, which 1366 01:05:21,090 --> 01:05:25,410 is an entity relationship diagram, which is just to say each of these boxes 1367 01:05:25,410 --> 01:05:27,090 on the screen represents a table. 1368 01:05:27,090 --> 01:05:31,230 And each of the arrows or edges represents some kind of relationship 1369 01:05:31,230 --> 01:05:32,100 across the tables. 1370 01:05:32,100 --> 01:05:34,200 Because up until now, the only data we had 1371 01:05:34,200 --> 01:05:36,270 were those three columns in the favorites table. 1372 01:05:36,270 --> 01:05:39,130 But what gets really useful about SQL databases, 1373 01:05:39,130 --> 01:05:41,700 just like a Google Spreadsheet or an Excel file, 1374 01:05:41,700 --> 01:05:45,300 is you can have multiple sheets or in a database multiple tables. 1375 01:05:45,300 --> 01:05:50,010 And so what we're about to see is that in the IMDb database for TV shows, 1376 01:05:50,010 --> 01:05:53,820 there's going to be a dedicated table for all the people in the TV business. 1377 01:05:53,820 --> 01:05:57,210 There's going to be a dedicated table for all of the TV shows 1378 01:05:57,210 --> 01:05:59,130 that are in their database as of right now. 1379 01:05:59,130 --> 01:06:03,180 There's going to be a dedicated table for writers in that industry, 1380 01:06:03,180 --> 01:06:08,580 for the ratings of shows, for the genres to which shows belong, 1381 01:06:08,580 --> 01:06:09,700 comedy and the like. 1382 01:06:09,700 --> 01:06:12,310 And then lastly, there's going to be this table, 1383 01:06:12,310 --> 01:06:17,700 which somehow associates people with the TV shows that they star in 1384 01:06:17,700 --> 01:06:19,030 and vice versa. 1385 01:06:19,030 --> 01:06:22,290 And so let's consider first what this looks like in code. 1386 01:06:22,290 --> 01:06:25,110 And we'll see that it's going to overwhelm intentionally at first, 1387 01:06:25,110 --> 01:06:26,200 but I'm going to do this. 1388 01:06:26,200 --> 01:06:27,690 I'm going to go back to my terminal window. 1389 01:06:27,690 --> 01:06:30,240 And during the break, I downloaded from the course's website 1390 01:06:30,240 --> 01:06:33,960 a file called shows.db, which we made in advance for you. 1391 01:06:33,960 --> 01:06:37,440 And if I type ls, I'll see all of my favorites files from before. 1392 01:06:37,440 --> 01:06:40,020 The CSV, the DB, and the Python file. 1393 01:06:40,020 --> 01:06:42,390 But now they're shows.db. 1394 01:06:42,390 --> 01:06:45,540 So I'm going to go ahead in my full screen terminal window here. 1395 01:06:45,540 --> 01:06:47,850 I'm not using actual tabs or code files. 1396 01:06:47,850 --> 01:06:52,380 Now I'm going to run sqlite3 on the file called shows.db. 1397 01:06:52,380 --> 01:06:54,900 And I'm just going to see this version information here. 1398 01:06:54,900 --> 01:06:57,120 Let me clear my screen and run the one command 1399 01:06:57,120 --> 01:07:01,170 I ran earlier to show us the schema of the favorites database. 1400 01:07:01,170 --> 01:07:03,947 Now we'll see the schema for the shows database. 1401 01:07:03,947 --> 01:07:06,030 And there's a lot going on here, but let me scroll 1402 01:07:06,030 --> 01:07:08,070 back up to the very top, the beginning. 1403 01:07:08,070 --> 01:07:10,570 And we see this here. 1404 01:07:10,570 --> 01:07:15,690 So when I run .schema, we see a dump, really, 1405 01:07:15,690 --> 01:07:20,310 of all of the SQL create table commands that were run in order to create this 1406 01:07:20,310 --> 01:07:21,330 database for you. 1407 01:07:21,330 --> 01:07:24,420 And one of those tables is called genres and another people, 1408 01:07:24,420 --> 01:07:27,370 ratings, shows, stars, and so forth. 1409 01:07:27,370 --> 01:07:29,340 And the columns therein, even though it's 1410 01:07:29,340 --> 01:07:32,670 formatted a little more prettily than the automatically generated create 1411 01:07:32,670 --> 01:07:37,680 table statement for favorites whereby we have one column per line of output 1412 01:07:37,680 --> 01:07:41,520 here in the, for instance, people table, there's 1413 01:07:41,520 --> 01:07:45,900 going to be an ID column, like a unique identifier like a Harvard ID, a Yale 1414 01:07:45,900 --> 01:07:50,730 ID or the like, a name column, a birth year, and then some other stuff. 1415 01:07:50,730 --> 01:07:53,760 If I scroll down to shows, every show in the world 1416 01:07:53,760 --> 01:07:58,410 is going to have a unique ID as well, a title of course, the year in which it 1417 01:07:58,410 --> 01:08:00,390 debuted, and the total number of episodes 1418 01:08:00,390 --> 01:08:02,370 as of the time we downloaded the data. 1419 01:08:02,370 --> 01:08:04,080 And then what else is there? 1420 01:08:04,080 --> 01:08:08,170 Some of these are a little less obvious like ratings here. 1421 01:08:08,170 --> 01:08:10,830 So ratings don't have an ID column, but they 1422 01:08:10,830 --> 01:08:15,038 have a show ID column and a rating like on a five point scale or a 10 point 1423 01:08:15,038 --> 01:08:16,830 scale or the like and then the total number 1424 01:08:16,830 --> 01:08:19,319 of votes that were collected to contribute to that rating. 1425 01:08:19,319 --> 01:08:23,680 IMDb allows people to upvote and downvote shows and movies and the like. 1426 01:08:23,680 --> 01:08:26,380 And then similarly is genre structured. 1427 01:08:26,380 --> 01:08:29,290 There's a show ID and then there's a genre, 1428 01:08:29,290 --> 01:08:33,370 which is going to be an English word like comedy or drama or something else. 1429 01:08:33,370 --> 01:08:34,200 And then what else? 1430 01:08:34,200 --> 01:08:38,609 Let's go a little further at the bottom here for stars and writers. 1431 01:08:38,609 --> 01:08:41,729 If we go to the very bottom here, stars and writers 1432 01:08:41,729 --> 01:08:43,470 are similarly structured too. 1433 01:08:43,470 --> 01:08:46,890 They have a show ID and a person ID. 1434 01:08:46,890 --> 01:08:48,540 So show and person. 1435 01:08:48,540 --> 01:08:52,470 And then this writers table has a show ID and a person ID. 1436 01:08:52,470 --> 01:08:56,340 And there's a whole lot of other words that we'll come to in just a moment. 1437 01:08:56,340 --> 01:08:58,330 But what is this code hinting at? 1438 01:08:58,330 --> 01:09:01,000 Well, if I go back to the picture from earlier here, 1439 01:09:01,000 --> 01:09:05,160 you'll see that this picture captures the relationships 1440 01:09:05,160 --> 01:09:07,020 among these various tables. 1441 01:09:07,020 --> 01:09:11,670 So for instance, if we focus on shows for just a moment, a show, 1442 01:09:11,670 --> 01:09:15,359 again, has a unique ID, a title, a year in which it debuted, 1443 01:09:15,359 --> 01:09:16,859 and a total number of episodes. 1444 01:09:16,859 --> 01:09:20,550 If you want to figure out what genre or genres 1445 01:09:20,550 --> 01:09:24,300 a show belongs to, because some shows are just comedies, 1446 01:09:24,300 --> 01:09:29,010 some shows are just dramas, but some shows are arguably comedies and dramas 1447 01:09:29,010 --> 01:09:30,700 depending on the episode or the like. 1448 01:09:30,700 --> 01:09:34,740 So you can imagine wanting to associate two or three or even more genres 1449 01:09:34,740 --> 01:09:35,640 with a show. 1450 01:09:35,640 --> 01:09:39,830 This line here in this second table allows us to do that. 1451 01:09:39,830 --> 01:09:46,310 Every row in the genres table we'll see has two items, a show ID 1452 01:09:46,310 --> 01:09:49,250 which relates to the ID of a show. 1453 01:09:49,250 --> 01:09:54,350 And that's why these lines literally line up with that specific column name. 1454 01:09:54,350 --> 01:09:57,050 And a genre, which is going to be like, quote unquote, "comedy," 1455 01:09:57,050 --> 01:09:59,250 quote unquote "drama," or something else. 1456 01:09:59,250 --> 01:10:02,180 Now with that said, design question. 1457 01:10:02,180 --> 01:10:07,250 Why have we deliberately not just gotten rid of this genres table 1458 01:10:07,250 --> 01:10:12,950 and made our lives simpler by just adding a genre column to this show's 1459 01:10:12,950 --> 01:10:13,640 table? 1460 01:10:13,640 --> 01:10:17,150 And again, a table is just like a sheet with rows and columns. 1461 01:10:17,150 --> 01:10:21,140 At the moment, shows only have four columns, ID, title, year, episodes. 1462 01:10:21,140 --> 01:10:28,980 Why not just add a fifth column called genre and put the show's genre there? 1463 01:10:28,980 --> 01:10:32,200 Any intuition here? 1464 01:10:32,200 --> 01:10:34,960 Why not just keep things simple? 1465 01:10:34,960 --> 01:10:36,286 Yeah, in back. 1466 01:10:36,286 --> 01:10:42,035 AUDIENCE: [INAUDIBLE] 1467 01:10:42,035 --> 01:10:42,910 DAVID MALAN: Exactly. 1468 01:10:42,910 --> 01:10:46,670 If you add a fifth column here and call it genre, 1469 01:10:46,670 --> 01:10:49,900 then you have to pick a genre specifically. 1470 01:10:49,900 --> 01:10:55,840 You have to put in that cell presumably comedy or drama or musical or something 1471 01:10:55,840 --> 01:10:56,480 else. 1472 01:10:56,480 --> 01:10:58,270 Now, you could write multiple words in the cell, 1473 01:10:58,270 --> 01:11:00,645 but generally speaking, that would be sloppy, bad design. 1474 01:11:00,645 --> 01:11:02,492 Like every cell just like in a spreadsheet 1475 01:11:02,492 --> 01:11:03,700 should really have one value. 1476 01:11:03,700 --> 01:11:07,270 It might have multiple words, but it shouldn't be a weirdly comma separated 1477 01:11:07,270 --> 01:11:08,530 list of multiple things. 1478 01:11:08,530 --> 01:11:10,655 It should just be in a different cell in that case. 1479 01:11:10,655 --> 01:11:14,320 So if you instead were to design this with just a single column called genre, 1480 01:11:14,320 --> 01:11:16,510 you're imposing what a computer scientist 1481 01:11:16,510 --> 01:11:19,480 would call a one to one relationship. 1482 01:11:19,480 --> 01:11:21,580 Every show has one genre. 1483 01:11:21,580 --> 01:11:23,800 And that's not necessarily a good thing. 1484 01:11:23,800 --> 01:11:26,170 Or strictly speaking, it would be a many to one, 1485 01:11:26,170 --> 01:11:28,510 because the same genre could belong to multiple shows, 1486 01:11:28,510 --> 01:11:31,760 but each show could only have one genre in that case. 1487 01:11:31,760 --> 01:11:33,760 What a relational database allows you to do, 1488 01:11:33,760 --> 01:11:35,890 and relational is indeed the operative word, 1489 01:11:35,890 --> 01:11:39,100 it allows you to factor out some of your information 1490 01:11:39,100 --> 01:11:44,590 and then have maybe one show here in one row but then in this genres table, 1491 01:11:44,590 --> 01:11:47,320 you could have one row for that one show genre, 1492 01:11:47,320 --> 01:11:52,450 or you could have two rows in the genres table for comedy and for drama. 1493 01:11:52,450 --> 01:11:55,490 Or if it has a third genre, you could just add another row here. 1494 01:11:55,490 --> 01:11:59,380 So you still have one row for the show itself 1495 01:11:59,380 --> 01:12:02,890 with all the juiciest details but a variable number of rows 1496 01:12:02,890 --> 01:12:05,950 by having this relationship with another table. 1497 01:12:05,950 --> 01:12:10,000 Meanwhile, ratings work the same way, at least in this case. 1498 01:12:10,000 --> 01:12:12,882 A show has ID, title, year, and episodes. 1499 01:12:12,882 --> 01:12:14,590 But if you want to figure out its rating, 1500 01:12:14,590 --> 01:12:17,710 you have to follow the arrow here, so to speak, and look up 1501 01:12:17,710 --> 01:12:20,020 the corresponding show ID in this table. 1502 01:12:20,020 --> 01:12:23,750 Find the rating of that show and the total number of ratings. 1503 01:12:23,750 --> 01:12:26,470 So that's been factored out too, for better or for worse. 1504 01:12:26,470 --> 01:12:28,600 Now let's consider people. 1505 01:12:28,600 --> 01:12:32,050 People have just three columns, ID, name, and birth. 1506 01:12:32,050 --> 01:12:35,350 But there's no mention of the TV show in which people have starred 1507 01:12:35,350 --> 01:12:38,140 or the TV shows that a person has written. 1508 01:12:38,140 --> 01:12:39,200 Well, why is that? 1509 01:12:39,200 --> 01:12:42,730 Well, if you just had a fourth column here called show, 1510 01:12:42,730 --> 01:12:45,560 well, you would have to decide what show is that person in. 1511 01:12:45,560 --> 01:12:47,560 And no one could ever act again in another show, 1512 01:12:47,560 --> 01:12:49,310 because there's no room to store the data. 1513 01:12:49,310 --> 01:12:53,560 But if someone, of course, a popular actor can star in multiple shows, 1514 01:12:53,560 --> 01:12:57,790 well, we could have one ID for that person, one name, one birth year, 1515 01:12:57,790 --> 01:12:58,360 obviously. 1516 01:12:58,360 --> 01:13:02,230 Like there's only one Steve Carell as an actor in the world of people. 1517 01:13:02,230 --> 01:13:06,610 But Steve Carell in this example could have his person ID, 1518 01:13:06,610 --> 01:13:09,220 whatever his Harvard ID equivalent, Yale ID equivalent is, 1519 01:13:09,220 --> 01:13:12,130 appear in multiple rows in this table so that it 1520 01:13:12,130 --> 01:13:14,470 can be associated with multiple shows. 1521 01:13:14,470 --> 01:13:18,830 And this allows you to create what's called a one to many relationship, 1522 01:13:18,830 --> 01:13:20,500 or technically it's bidirectional. 1523 01:13:20,500 --> 01:13:22,450 It's a many to many relationship. 1524 01:13:22,450 --> 01:13:23,120 Why? 1525 01:13:23,120 --> 01:13:26,470 Well, one show can certainly have multiple people in it 1526 01:13:26,470 --> 01:13:29,090 and multiple people writing for it, just in the real world. 1527 01:13:29,090 --> 01:13:33,910 But conversely, one person could certainly act in multiple shows 1528 01:13:33,910 --> 01:13:35,330 or write multiple shows. 1529 01:13:35,330 --> 01:13:38,380 So this is what you get with relational databases. 1530 01:13:38,380 --> 01:13:41,440 You put your sort of canonical data for people in one place, 1531 01:13:41,440 --> 01:13:46,150 for shows in another place, and then you use these additional tables 1532 01:13:46,150 --> 01:13:48,993 to relate one thing to another. 1533 01:13:48,993 --> 01:13:50,410 So we won't dwell on the pictures. 1534 01:13:50,410 --> 01:13:53,350 That's just if you sort of can wrap your mind around the data 1535 01:13:53,350 --> 01:13:55,870 set better that way, that's one way of thinking about it. 1536 01:13:55,870 --> 01:14:00,177 But recall that the code we just saw for the schema, again, escalated quickly. 1537 01:14:00,177 --> 01:14:02,260 There's a lot of keywords I haven't mentioned yet. 1538 01:14:02,260 --> 01:14:04,360 But some of these are perhaps familiar. 1539 01:14:04,360 --> 01:14:06,130 They're capitalized differently here. 1540 01:14:06,130 --> 01:14:08,140 But integer is on the list here. 1541 01:14:08,140 --> 01:14:10,790 Null is on the list, albeit technically not null. 1542 01:14:10,790 --> 01:14:12,760 So let's tease apart some of these keywords 1543 01:14:12,760 --> 01:14:15,218 and consider what they're actually doing for your database, 1544 01:14:15,218 --> 01:14:18,640 because now we're exploring features that do not exist 1545 01:14:18,640 --> 01:14:20,980 in the world of spreadsheets alone. 1546 01:14:20,980 --> 01:14:24,190 So it turns out in a SQL database, specifically 1547 01:14:24,190 --> 01:14:27,265 SQLite which is the version of SQL we use in CS50 1548 01:14:27,265 --> 01:14:30,910 and which is commonly used for things like mobile applications nowadays. 1549 01:14:30,910 --> 01:14:32,560 It's like a lightweight version of SQL. 1550 01:14:32,560 --> 01:14:35,590 It's when you aren't trying to run Twitter and have billions 1551 01:14:35,590 --> 01:14:37,150 and billions of rows necessarily. 1552 01:14:37,150 --> 01:14:40,730 You've got hundreds, thousands, tens of thousands, maybe even a few million, 1553 01:14:40,730 --> 01:14:44,590 but not crazy numbers, crazy amounts of data. 1554 01:14:44,590 --> 01:14:49,250 In the world of SQLite specifically, there's these five data types. 1555 01:14:49,250 --> 01:14:52,390 So just like in C, we had int and char and the like. 1556 01:14:52,390 --> 01:14:54,190 In SQL, we have these. 1557 01:14:54,190 --> 01:14:57,447 Blob, which is kind of funny, but it just means binary large objects. 1558 01:14:57,447 --> 01:14:58,780 So it's like a binary data type. 1559 01:14:58,780 --> 01:15:02,830 Zeros and ones that aren't necessarily fitting into the other categories. 1560 01:15:02,830 --> 01:15:05,290 Integer, which of course, is an integer as we know it. 1561 01:15:05,290 --> 01:15:09,940 Numeric, which is kind of a catchall for numbers that are formatted specially. 1562 01:15:09,940 --> 01:15:14,710 So like a date would be like year, year, year, year, dash month, month, 1563 01:15:14,710 --> 01:15:16,390 dash day, day. 1564 01:15:16,390 --> 01:15:18,015 And this is actually a wonderful thing. 1565 01:15:18,015 --> 01:15:19,723 Depending on the country you're from, you 1566 01:15:19,723 --> 01:15:22,720 might think your date system in your country is great or it's horrible. 1567 01:15:22,720 --> 01:15:27,040 The US system is horrible because we have month, day, and then 1568 01:15:27,040 --> 01:15:29,350 year, which is impossible to sort. 1569 01:15:29,350 --> 01:15:31,930 It is the wrong way objectively to store data. 1570 01:15:31,930 --> 01:15:33,803 And yet here we are using this at scale. 1571 01:15:33,803 --> 01:15:35,470 Other countries have gotten this better. 1572 01:15:35,470 --> 01:15:38,383 Numeric in SQL itself standardizes that stuff. 1573 01:15:38,383 --> 01:15:40,300 So it doesn't matter what country you're from. 1574 01:15:40,300 --> 01:15:43,200 You're storing your data in this particular way for instance. 1575 01:15:43,200 --> 01:15:46,560 Times are standardized and other types of numeric data as well. 1576 01:15:46,560 --> 01:15:48,630 Real is synonymous with flow. 1577 01:15:48,630 --> 01:15:52,410 So something with a decimal point and some number of digits thereafter. 1578 01:15:52,410 --> 01:15:55,530 And then text is just for strings and the like. 1579 01:15:55,530 --> 01:16:00,240 With other even fancier databases like MySQL, PostgreSQL, Oracle, 1580 01:16:00,240 --> 01:16:02,370 and other products you might have heard of, 1581 01:16:02,370 --> 01:16:04,170 there's even more data types where you have 1582 01:16:04,170 --> 01:16:05,712 to make even finer grained decisions. 1583 01:16:05,712 --> 01:16:07,950 But for SQLite, it's indeed pretty lightweight 1584 01:16:07,950 --> 01:16:12,120 and you or we just have to decide the data types for each column in a table. 1585 01:16:12,120 --> 01:16:15,000 But there's these additional constraints in the world of SQL. 1586 01:16:15,000 --> 01:16:21,400 You can additionally say that cells in this column may or may not be null. 1587 01:16:21,400 --> 01:16:24,660 So if you want to protect yourself from yourself so you don't screw up 1588 01:16:24,660 --> 01:16:26,880 and insert a null, that is a blank value, 1589 01:16:26,880 --> 01:16:31,090 you can explicitly design a table to have a column that cannot be null. 1590 01:16:31,090 --> 01:16:33,090 And so in fact, someone came up during the break 1591 01:16:33,090 --> 01:16:36,450 to ask me about my having manually inserted SQL, 1592 01:16:36,450 --> 01:16:38,923 quote unquote "SQL," into our favorites database. 1593 01:16:38,923 --> 01:16:40,590 You might recall that I kind of cheated. 1594 01:16:40,590 --> 01:16:44,340 I just inserted "SQL" quote unquote and "fiftyville," 1595 01:16:44,340 --> 01:16:46,050 the name of a new problem, quote unquote. 1596 01:16:46,050 --> 01:16:49,170 But what did I not insert into the database? 1597 01:16:49,170 --> 01:16:50,100 A timestamp. 1598 01:16:50,100 --> 01:16:50,880 And I could have. 1599 01:16:50,880 --> 01:16:54,240 I could have put the current day and time a few minutes ago, but I didn't. 1600 01:16:54,240 --> 01:16:58,590 And that's fine if it's acceptable to you and the product you're building. 1601 01:16:58,590 --> 01:17:00,300 But I could have prevented that. 1602 01:17:00,300 --> 01:17:05,100 If we had defined the table to have a timestamp column that isn't just text 1603 01:17:05,100 --> 01:17:08,490 but it's text that's not null, SQL would have complained 1604 01:17:08,490 --> 01:17:10,650 and would not have let me complete that insertion. 1605 01:17:10,650 --> 01:17:13,567 So there's these kinds of built in defenses that you don't necessarily 1606 01:17:13,567 --> 01:17:15,180 get with a spreadsheet alone. 1607 01:17:15,180 --> 01:17:17,170 And unique means exactly that. 1608 01:17:17,170 --> 01:17:20,190 If you want to make sure that every row in that column 1609 01:17:20,190 --> 01:17:24,480 is unique, maybe for email addresses or in the US Social Security 1610 01:17:24,480 --> 01:17:26,190 numbers or anything that you want to make 1611 01:17:26,190 --> 01:17:30,330 sure you don't have two versions of, you can specify that the column is unique. 1612 01:17:30,330 --> 01:17:32,170 And there's other such constraints as well. 1613 01:17:32,170 --> 01:17:34,140 But again, this is just a list of features 1614 01:17:34,140 --> 01:17:37,170 that you get from a proper relational database. 1615 01:17:37,170 --> 01:17:39,990 But perhaps the most intellectually interesting one 1616 01:17:39,990 --> 01:17:42,420 and the most powerful one is what's called 1617 01:17:42,420 --> 01:17:45,000 here a primary key and a foreign key. 1618 01:17:45,000 --> 01:17:47,640 And let me go back now to this output. 1619 01:17:47,640 --> 01:17:51,970 If we look at shows, you'll see that a show, again, 1620 01:17:51,970 --> 01:17:55,320 has an ID, a title, a year, and a number of episodes. 1621 01:17:55,320 --> 01:17:57,120 And now the data types might make sense. 1622 01:17:57,120 --> 01:17:59,820 The ID it turns out, just like a Harvard ID, a Yale ID, 1623 01:17:59,820 --> 01:18:01,150 is going to be an integer. 1624 01:18:01,150 --> 01:18:02,220 So a simple number. 1625 01:18:02,220 --> 01:18:04,680 The title, of course, is going to be text but not null. 1626 01:18:04,680 --> 01:18:07,470 It would be weird if a TV show had no name. 1627 01:18:07,470 --> 01:18:08,615 That can't be. 1628 01:18:08,615 --> 01:18:10,990 The whole world would break or your TV Guide and whatnot. 1629 01:18:10,990 --> 01:18:13,590 So that makes sense there to say not null. 1630 01:18:13,590 --> 01:18:14,670 Year is numeric. 1631 01:18:14,670 --> 01:18:15,990 So it's a standardized form. 1632 01:18:15,990 --> 01:18:19,140 Episodes is an integer, like how many episodes have been produced. 1633 01:18:19,140 --> 01:18:20,790 And then lastly, notice this. 1634 01:18:20,790 --> 01:18:24,090 The primary key of the show's table is apparently 1635 01:18:24,090 --> 01:18:27,880 the column called ID mentioned a few lines earlier. 1636 01:18:27,880 --> 01:18:32,400 This just means that the database will use the ID 1637 01:18:32,400 --> 01:18:34,740 column as the unique identifier. 1638 01:18:34,740 --> 01:18:37,770 So it's similar to the unique keyword, but primary key just 1639 01:18:37,770 --> 01:18:40,260 means the database is going to treat it as special too 1640 01:18:40,260 --> 01:18:43,380 and make sure that it is uniquely identifying your data. 1641 01:18:43,380 --> 01:18:45,300 But what's interesting is this. 1642 01:18:45,300 --> 01:18:48,480 Notice if I scroll back up to people, people 1643 01:18:48,480 --> 01:18:51,480 were sort of similarly structured but with different attributes. 1644 01:18:51,480 --> 01:18:56,310 Like up here we had a person has an ID, a name, a birth 1645 01:18:56,310 --> 01:18:58,800 year, and a primary key of ID. 1646 01:18:58,800 --> 01:19:01,080 So a ID is, again, integer. 1647 01:19:01,080 --> 01:19:03,600 Name is text but not null, because it'd be 1648 01:19:03,600 --> 01:19:06,420 weird to have a human with absolutely no name textually. 1649 01:19:06,420 --> 01:19:07,860 Birth is going to be numeric. 1650 01:19:07,860 --> 01:19:11,590 But the primary key of people is ID as well. 1651 01:19:11,590 --> 01:19:15,150 So those are the unique columns that the database will just treat special. 1652 01:19:15,150 --> 01:19:15,810 Why? 1653 01:19:15,810 --> 01:19:17,310 Well, we just looked at shows. 1654 01:19:17,310 --> 01:19:18,900 We just looked at people. 1655 01:19:18,900 --> 01:19:22,380 Let's focus now on this one down here, stars. 1656 01:19:22,380 --> 01:19:25,260 How do you determine who stars in a TV show? 1657 01:19:25,260 --> 01:19:26,370 Well, we had two columns. 1658 01:19:26,370 --> 01:19:28,320 The show ID and the person ID. 1659 01:19:28,320 --> 01:19:31,080 This is the incarnation of a many to many relationship. 1660 01:19:31,080 --> 01:19:32,550 One person could be in many shows. 1661 01:19:32,550 --> 01:19:35,710 One show could certainly have many people in it or writing for it. 1662 01:19:35,710 --> 01:19:37,250 But notice this. 1663 01:19:37,250 --> 01:19:41,220 Within this table of two columns, show ID and person ID, 1664 01:19:41,220 --> 01:19:44,070 there's what's going to be called a foreign key called 1665 01:19:44,070 --> 01:19:49,350 show ID that references the show's table's ID column 1666 01:19:49,350 --> 01:19:51,367 and then another foreign key called person ID, 1667 01:19:51,367 --> 01:19:53,700 though I could call these things in parentheses anything 1668 01:19:53,700 --> 01:19:57,495 I want, that references the people table's ID column. 1669 01:19:57,495 --> 01:20:00,120 Now, you're not going to often have to type commands like this. 1670 01:20:00,120 --> 01:20:02,700 Again, you set the database up once in the beginning 1671 01:20:02,700 --> 01:20:04,590 typically, maybe with some help from a TF, 1672 01:20:04,590 --> 01:20:06,210 maybe with help of Google or the like. 1673 01:20:06,210 --> 01:20:09,203 But once your database is designed, it's back to the CRUD. 1674 01:20:09,203 --> 01:20:12,120 Create, read, update, delete, the selects, the inserts, the deletions, 1675 01:20:12,120 --> 01:20:12,690 and the like. 1676 01:20:12,690 --> 01:20:14,310 But what's this implying? 1677 01:20:14,310 --> 01:20:17,520 These keywords like primary key and foreign key 1678 01:20:17,520 --> 01:20:23,100 are what are doing in code what this picture was painting a moment ago. 1679 01:20:23,100 --> 01:20:27,570 These lines here are drawn literally to line up with the corresponding things. 1680 01:20:27,570 --> 01:20:30,660 People's ID lines up with person ID. 1681 01:20:30,660 --> 01:20:34,410 Show's ID lines up with show ID. 1682 01:20:34,410 --> 01:20:37,950 And so you're just seeing graphical version, code version, graphical, 1683 01:20:37,950 --> 01:20:40,860 code that creates these relationships. 1684 01:20:40,860 --> 01:20:45,190 Now, given that, let's actually see what these things look like. 1685 01:20:45,190 --> 01:20:47,160 So let me go back to VS Code here. 1686 01:20:47,160 --> 01:20:48,400 Let me clear my screen. 1687 01:20:48,400 --> 01:20:51,000 I'm still within SQLite with shows.db. 1688 01:20:51,000 --> 01:20:53,480 Let me go ahead and do what I do with any new database. 1689 01:20:53,480 --> 01:20:55,230 If I ever download something or I'm trying 1690 01:20:55,230 --> 01:20:57,600 to wrap my mind around a problem, usually it 1691 01:20:57,600 --> 01:21:01,230 doesn't come with a pretty picture or a three hour lecture 1692 01:21:01,230 --> 01:21:02,670 to explain what the data set is. 1693 01:21:02,670 --> 01:21:06,190 Rather you just have the data set in your own knowledge of SQL. 1694 01:21:06,190 --> 01:21:07,440 So let me play around. 1695 01:21:07,440 --> 01:21:09,608 So .schema shows me all of the tables. 1696 01:21:09,608 --> 01:21:10,900 That might be a starting point. 1697 01:21:10,900 --> 01:21:11,900 OK, this is interesting. 1698 01:21:11,900 --> 01:21:13,560 I know what people are. 1699 01:21:13,560 --> 01:21:16,530 Let's go ahead and show me all the people. 1700 01:21:16,530 --> 01:21:18,595 So select star from people. 1701 01:21:18,595 --> 01:21:20,970 I'm just trying to wrap my mind around what this data set 1702 01:21:20,970 --> 01:21:24,330 looks like in a more user friendly way. 1703 01:21:24,330 --> 01:21:26,607 That's already a lot of people. 1704 01:21:26,607 --> 01:21:28,440 As you see the years flying by, there's been 1705 01:21:28,440 --> 01:21:31,170 a lot of people in the TV business. 1706 01:21:31,170 --> 01:21:33,540 So this was maybe not the best query to run. 1707 01:21:33,540 --> 01:21:39,300 But this is indicative of just how large this data set is from IMDb. 1708 01:21:39,300 --> 01:21:42,960 When in doubt and whenever you lose control over your computer, 1709 01:21:42,960 --> 01:21:44,940 Control C is your friend to interrupt. 1710 01:21:44,940 --> 01:21:46,860 What would have been better, because I don't 1711 01:21:46,860 --> 01:21:49,980 think I need to know all of the million people in the world, 1712 01:21:49,980 --> 01:21:52,470 I could do limit me to 10 people. 1713 01:21:52,470 --> 01:21:55,860 And that's enough now to get a sense of Fred Astaire 1714 01:21:55,860 --> 01:21:58,920 has an ID of one, the first person ever. 1715 01:21:58,920 --> 01:22:00,660 Birth year of 1899. 1716 01:22:00,660 --> 01:22:04,740 Lauren Bacall and all of these other people from yesteryear. 1717 01:22:04,740 --> 01:22:07,410 You see that they are the first 10 people in the database. 1718 01:22:07,410 --> 01:22:09,160 So there's an example of some of the data. 1719 01:22:09,160 --> 01:22:12,540 Now if I want to wrap my mind around what a show is, I know it technically. 1720 01:22:12,540 --> 01:22:13,800 I know it from the picture. 1721 01:22:13,800 --> 01:22:15,400 But let's just look at some raw data. 1722 01:22:15,400 --> 01:22:18,210 So instead of saying select star from people, let me go ahead 1723 01:22:18,210 --> 01:22:21,480 and select star from shows limit 10. 1724 01:22:21,480 --> 01:22:26,250 And OK, I've only heard of or seen a couple of these, 1725 01:22:26,250 --> 01:22:27,970 but these are older shows at that. 1726 01:22:27,970 --> 01:22:31,530 But I see that every show has an ID, a title, a year in which it debuted, 1727 01:22:31,530 --> 01:22:32,910 and a number of episodes. 1728 01:22:32,910 --> 01:22:35,340 But perhaps most opaque is going to be this. 1729 01:22:35,340 --> 01:22:41,520 Select star from stars where this is the table that associates people 1730 01:22:41,520 --> 01:22:42,330 with shows. 1731 01:22:42,330 --> 01:22:46,950 Am I going to see any names or show titles here? 1732 01:22:46,950 --> 01:22:49,127 Not according to the definition we saw earlier. 1733 01:22:49,127 --> 01:22:50,460 Oh, I should have done my limit. 1734 01:22:50,460 --> 01:22:51,660 Let me interrupt that. 1735 01:22:51,660 --> 01:22:52,620 Let me do that again. 1736 01:22:52,620 --> 01:22:54,180 Limit 10. 1737 01:22:54,180 --> 01:22:55,050 No. 1738 01:22:55,050 --> 01:22:58,380 And this is where now you're definitely in the programmer world, 1739 01:22:58,380 --> 01:23:01,780 because this would be the most annoying spreadsheet to use on your Mac or PC 1740 01:23:01,780 --> 01:23:02,280 ever. 1741 01:23:02,280 --> 01:23:04,528 If you just had a sheet with all of these numbers 1742 01:23:04,528 --> 01:23:06,570 that associates one thing with the other, my God, 1743 01:23:06,570 --> 01:23:08,910 how do you figure out who this is or what this is? 1744 01:23:08,910 --> 01:23:12,780 You have to manually Control F or Command F looking for the data. 1745 01:23:12,780 --> 01:23:14,310 But a database doesn't care. 1746 01:23:14,310 --> 01:23:18,160 Once you know SQL, you can stitch these things back together. 1747 01:23:18,160 --> 01:23:21,360 So what you're seeing here are foreign keys. 1748 01:23:21,360 --> 01:23:22,150 Foreign keys. 1749 01:23:22,150 --> 01:23:22,650 Why? 1750 01:23:22,650 --> 01:23:27,600 Because show ID corresponds to the same numbers from that other table 1751 01:23:27,600 --> 01:23:30,990 called shows that has a proper primary key called ID. 1752 01:23:30,990 --> 01:23:33,360 Person ID is a foreign key in this context, 1753 01:23:33,360 --> 01:23:38,620 because it refers to numbers that belong to really the people table and its ID 1754 01:23:38,620 --> 01:23:39,120 column. 1755 01:23:39,120 --> 01:23:41,037 So this is just a way of somehow linking them. 1756 01:23:41,037 --> 01:23:44,100 And so if you think of I always think of this in my mind's eye as this. 1757 01:23:44,100 --> 01:23:47,760 If this is the people table, this is the shows table, 1758 01:23:47,760 --> 01:23:51,690 and there's this middle table in between, the stars table. 1759 01:23:51,690 --> 01:23:56,490 There's some way of stitching those two together by lining up the IDs of one 1760 01:23:56,490 --> 01:23:58,870 with the other and getting back some more data. 1761 01:23:58,870 --> 01:24:01,690 So let's actually play with some of this data. 1762 01:24:01,690 --> 01:24:05,148 How about we start where we emphasized earlier, genres. 1763 01:24:05,148 --> 01:24:08,440 So let me go ahead and take a quick look at all of the genres in this database. 1764 01:24:08,440 --> 01:24:10,500 So select star from genres. 1765 01:24:10,500 --> 01:24:13,030 Star is usually going to be a little overwhelming, 1766 01:24:13,030 --> 01:24:15,240 but it just gives me a sense of what the data is. 1767 01:24:15,240 --> 01:24:17,850 But let's actually look at-- 1768 01:24:17,850 --> 01:24:19,440 let's go look at all of them there. 1769 01:24:19,440 --> 01:24:20,850 OK, that's a lot. 1770 01:24:20,850 --> 01:24:24,210 These are all official genres from IMDb. 1771 01:24:24,210 --> 01:24:27,030 OK, it wasn't terribly long. 1772 01:24:27,030 --> 01:24:28,350 Let me filter that down. 1773 01:24:28,350 --> 01:24:34,800 So from genres where genre equals Comedy, capital C just based 1774 01:24:34,800 --> 01:24:36,000 on the data I'm seeing. 1775 01:24:36,000 --> 01:24:38,220 OK, so what am I seeing now? 1776 01:24:38,220 --> 01:24:40,838 And in fact, let me limit this arbitrarily to 10, 1777 01:24:40,838 --> 01:24:42,630 though I could limit it to anything I want. 1778 01:24:42,630 --> 01:24:45,060 Here are 10 comedies. 1779 01:24:45,060 --> 01:24:46,080 What are they? 1780 01:24:46,080 --> 01:24:47,490 Well, who the heck knows? 1781 01:24:47,490 --> 01:24:49,590 All I know are the 10 show IDs. 1782 01:24:49,590 --> 01:24:51,180 Now, I could do something like this. 1783 01:24:51,180 --> 01:24:54,037 As we've seen before with SQL, I could do, all right, well 1784 01:24:54,037 --> 01:24:55,620 let's figure out what this show ID is. 1785 01:24:55,620 --> 01:25:03,750 Select star from shows where the ID of the show I'm looking for equals what? 1786 01:25:03,750 --> 01:25:06,910 62614 semicolon. 1787 01:25:06,910 --> 01:25:11,200 So I could manually look it up by cross referencing the other table. 1788 01:25:11,200 --> 01:25:14,860 So that was the show in question there, the first comedy in the data set. 1789 01:25:14,860 --> 01:25:16,200 Let me look up the second one. 1790 01:25:16,200 --> 01:25:20,610 So instead of that, let's do 63881 Enter. 1791 01:25:20,610 --> 01:25:21,150 OK. 1792 01:25:21,150 --> 01:25:22,410 So that's that show. 1793 01:25:22,410 --> 01:25:23,640 And let's do one more. 1794 01:25:23,640 --> 01:25:27,210 And suffice it to say, this is just getting tedious and vulnerable 1795 01:25:27,210 --> 01:25:29,520 to mistakes quickly. 1796 01:25:29,520 --> 01:25:31,710 This surely can't be the way to do this. 1797 01:25:31,710 --> 01:25:35,280 And indeed, SQL is going to let us do this a little more powerfully instead. 1798 01:25:35,280 --> 01:25:36,790 Let's do this. 1799 01:25:36,790 --> 01:25:39,210 Instead of getting this table temporarily 1800 01:25:39,210 --> 01:25:42,640 with all these show IDs and all these genres, let's refine the query. 1801 01:25:42,640 --> 01:25:49,920 So let's just select the show ID from the genres table where the genre equals 1802 01:25:49,920 --> 01:25:51,810 quote unquote "comedy." 1803 01:25:51,810 --> 01:25:56,670 Now I have a big list of show IDs, all of which are comedy. 1804 01:25:56,670 --> 01:25:57,205 How many? 1805 01:25:57,205 --> 01:25:58,830 Well, I can combine ideas from earlier. 1806 01:25:58,830 --> 01:26:03,582 I can just count all of those show IDs or star if I want to just do that too. 1807 01:26:03,582 --> 01:26:05,040 But I can count all those show IDs. 1808 01:26:05,040 --> 01:26:09,880 48,706 comedies on IMDb's database for TV shows. 1809 01:26:09,880 --> 01:26:11,140 So feels like a lot. 1810 01:26:11,140 --> 01:26:14,010 But how can I now use that information and get back 1811 01:26:14,010 --> 01:26:18,100 the titles of comedies in the database without doing it manually? 1812 01:26:18,100 --> 01:26:19,720 Well, let's do this. 1813 01:26:19,720 --> 01:26:21,270 I have a moment ago this query. 1814 01:26:21,270 --> 01:26:25,710 Select the show ID from genres where the current genre 1815 01:26:25,710 --> 01:26:29,295 is quote unquote "comedy." 1816 01:26:29,295 --> 01:26:31,920 What if I kind of nest these queries, kind of like grade school 1817 01:26:31,920 --> 01:26:33,030 math in parentheses? 1818 01:26:33,030 --> 01:26:35,940 What if I combine this whole thing in parentheses? 1819 01:26:35,940 --> 01:26:38,040 And now let me select what I really want. 1820 01:26:38,040 --> 01:26:43,950 Let me go ahead and select how about the title of all 1821 01:26:43,950 --> 01:26:52,420 shows where the idea of the show is in this list of show IDs. 1822 01:26:52,420 --> 01:26:59,440 So if you agree that the shows table has an ID column, which is otherwise 1823 01:26:59,440 --> 01:27:02,170 known as its primary key, the unique ID that identifies it, just 1824 01:27:02,170 --> 01:27:06,490 like our Harvard IDs, our Yale IDs, and you agree that per a moment ago 1825 01:27:06,490 --> 01:27:09,580 this shorter query will give me back just the show 1826 01:27:09,580 --> 01:27:13,840 IDs of all of the comedies in the database, 1827 01:27:13,840 --> 01:27:17,710 you can actually combine or nest these queries together. 1828 01:27:17,710 --> 01:27:19,843 It's going to respect SQLite order of operations 1829 01:27:19,843 --> 01:27:21,760 with parentheses, just like grade school math. 1830 01:27:21,760 --> 01:27:24,070 So the thing in parentheses will be executed first. 1831 01:27:24,070 --> 01:27:27,340 That gives it back a list of IDs, like 48,000 IDs. 1832 01:27:27,340 --> 01:27:30,160 And then this query, the outer query, is going 1833 01:27:30,160 --> 01:27:33,790 to get the title from all of the shows where the ID of the show 1834 01:27:33,790 --> 01:27:36,280 is in that big list of 48,000. 1835 01:27:36,280 --> 01:27:39,233 So if I now execute these together, I think 1836 01:27:39,233 --> 01:27:42,400 the list is still going to be a little long, but let me execute it together. 1837 01:27:42,400 --> 01:27:45,880 Now I see this long list of outputs. 1838 01:27:45,880 --> 01:27:46,900 A little overwhelming. 1839 01:27:46,900 --> 01:27:50,650 Let's go ahead and maybe limit it to just 10 1840 01:27:50,650 --> 01:27:52,460 as before for discussion's sake. 1841 01:27:52,460 --> 01:27:56,110 And now I see 10 comedies ordered arbitrarily 1842 01:27:56,110 --> 01:27:58,330 from however they're in the database that happen 1843 01:27:58,330 --> 01:28:01,420 to indeed have comedy as their genre. 1844 01:28:01,420 --> 01:28:04,310 If I want to do this a little more cleanly, I could do this. 1845 01:28:04,310 --> 01:28:05,260 Let's see. 1846 01:28:05,260 --> 01:28:09,640 Why don't I order by title ascending order, which is alphabetically, 1847 01:28:09,640 --> 01:28:12,190 or the default is also an ascending. 1848 01:28:12,190 --> 01:28:13,240 Limit 10. 1849 01:28:13,240 --> 01:28:18,250 Now I see the top 10, I mean, weirdly named things with hash symbols 1850 01:28:18,250 --> 01:28:22,360 presumably to get their titles up to the beginning or maybe these are hashtags. 1851 01:28:22,360 --> 01:28:27,550 Here now we have alphabetically the first 10 shows that are comedies. 1852 01:28:27,550 --> 01:28:31,705 Any questions on these kinds of queries? 1853 01:28:31,705 --> 01:28:33,580 It's kind of a lot, but at the same time it's 1854 01:28:33,580 --> 01:28:37,540 just like composing the smaller ideas from before into slightly more 1855 01:28:37,540 --> 01:28:39,820 useful queries. 1856 01:28:39,820 --> 01:28:40,360 Yeah. 1857 01:28:40,360 --> 01:28:43,348 AUDIENCE: [INAUDIBLE] 1858 01:28:43,348 --> 01:28:46,343 1859 01:28:46,343 --> 01:28:48,760 DAVID MALAN: Do foreign keys have to set the relationship? 1860 01:28:48,760 --> 01:28:51,850 When you create the table, the programmer or the database 1861 01:28:51,850 --> 01:28:54,280 administrator would create that relationship 1862 01:28:54,280 --> 01:28:57,700 by using those keywords primary key and foreign key that 1863 01:28:57,700 --> 01:29:01,330 teaches the database what is related to what per the picture. 1864 01:29:01,330 --> 01:29:02,530 So you do that once. 1865 01:29:02,530 --> 01:29:07,390 And now I being the sort of programmer who's familiar with the database, 1866 01:29:07,390 --> 01:29:13,750 I am just using these foreign keys in a manner consistent with their design. 1867 01:29:13,750 --> 01:29:15,680 And this is where it's useful at some point, 1868 01:29:15,680 --> 01:29:18,940 even if no one hands you a picture, to make sure you understand the database, 1869 01:29:18,940 --> 01:29:22,180 because that's going to inform literally what you type in SQL 1870 01:29:22,180 --> 01:29:23,740 to get the data you care about. 1871 01:29:23,740 --> 01:29:26,060 Well, let's do something a little more precise. 1872 01:29:26,060 --> 01:29:27,860 How about-- very reasonable question. 1873 01:29:27,860 --> 01:29:30,970 And honestly, this is exactly what imdb.com and the app or for. 1874 01:29:30,970 --> 01:29:34,600 What if you want to find all of the shows that Steve Carell is in? 1875 01:29:34,600 --> 01:29:36,160 Kind of a reasonable query. 1876 01:29:36,160 --> 01:29:40,150 Literally something someone might type into Google or more specifically IMDb. 1877 01:29:40,150 --> 01:29:42,130 It's not really obvious at first glance how 1878 01:29:42,130 --> 01:29:44,795 to do that, though, because from my database, 1879 01:29:44,795 --> 01:29:47,170 if these are my six tables, well, I can pretty easily get 1880 01:29:47,170 --> 01:29:48,460 Steve Carell from here. 1881 01:29:48,460 --> 01:29:51,105 But I can really only get his ID number, whatever that is, 1882 01:29:51,105 --> 01:29:53,230 his name, which I know already, and his birth year. 1883 01:29:53,230 --> 01:29:56,200 OK, interesting but has nothing to do with the shows that he's in. 1884 01:29:56,200 --> 01:30:00,850 I can look at shows over here, but there's no mention of Steve Carell 1885 01:30:00,850 --> 01:30:02,710 because there's no person ID here. 1886 01:30:02,710 --> 01:30:04,660 Where is that relationship implemented? 1887 01:30:04,660 --> 01:30:06,590 Well, it's implemented down here. 1888 01:30:06,590 --> 01:30:07,910 So how do we do this? 1889 01:30:07,910 --> 01:30:10,210 Well, here's the perfect example of a lesson 1890 01:30:10,210 --> 01:30:13,450 we've been trying to emphasize for weeks of taking these baby steps. 1891 01:30:13,450 --> 01:30:17,750 Break larger problems down into smaller ones and let's do something like this. 1892 01:30:17,750 --> 01:30:20,740 Let's just get everything I know about Steve Carell from the database. 1893 01:30:20,740 --> 01:30:24,190 Let's select star from people where the name of the person 1894 01:30:24,190 --> 01:30:26,890 is quote unquote "Steve Carell." 1895 01:30:26,890 --> 01:30:28,630 I just want to see what data we've got. 1896 01:30:28,630 --> 01:30:30,640 And here's what we have. 1897 01:30:30,640 --> 01:30:33,130 There's only one Steve Carell born in 1962 1898 01:30:33,130 --> 01:30:37,150 and his unique ID is 136797 according to IMDb. 1899 01:30:37,150 --> 01:30:40,930 This isn't some global actor identifier, per se. 1900 01:30:40,930 --> 01:30:45,010 All right, well how do I get now all of the shows that Steve Carell is in? 1901 01:30:45,010 --> 01:30:46,130 Well, I could do this. 1902 01:30:46,130 --> 01:30:51,190 Select star from stars, not to confuse the two. 1903 01:30:51,190 --> 01:30:53,500 One's the symbol, one's the table name. 1904 01:30:53,500 --> 01:30:58,690 Where person ID equals 136797. 1905 01:30:58,690 --> 01:31:03,070 So I think this will now give me everything from the stars table 1906 01:31:03,070 --> 01:31:04,943 that relates to Steve Carell. 1907 01:31:04,943 --> 01:31:07,360 And you'll see person ID is the same because I'm literally 1908 01:31:07,360 --> 01:31:08,693 searching for just Steve Carell. 1909 01:31:08,693 --> 01:31:12,200 But there are like 20 or so shows that he's been in. 1910 01:31:12,200 --> 01:31:14,450 All right, well here's where things would get tedious. 1911 01:31:14,450 --> 01:31:15,340 What are those shows? 1912 01:31:15,340 --> 01:31:21,460 Well, I could do select title from shows where the ID of the show equals. 1913 01:31:21,460 --> 01:31:25,670 And here's whenever you copy paste, you're probably doing something wrong. 1914 01:31:25,670 --> 01:31:27,890 OK, he was in The Dana Carvey Show. 1915 01:31:27,890 --> 01:31:28,715 Familiar with that. 1916 01:31:28,715 --> 01:31:29,590 Let's do another one. 1917 01:31:29,590 --> 01:31:31,460 We'll copy paste this. 1918 01:31:31,460 --> 01:31:34,090 Where ID equals this. 1919 01:31:34,090 --> 01:31:34,890 Over The Top. 1920 01:31:34,890 --> 01:31:35,390 Another. 1921 01:31:35,390 --> 01:31:37,000 And if we keep digging, we'll probably find The Office. 1922 01:31:37,000 --> 01:31:40,240 But my God, that's going to take forever to do 20 queries manually. 1923 01:31:40,240 --> 01:31:41,252 It's not very dynamic. 1924 01:31:41,252 --> 01:31:43,960 But what if we just nest these queries a little more dynamically? 1925 01:31:43,960 --> 01:31:45,830 So let me start from the beginning again. 1926 01:31:45,830 --> 01:31:48,890 What if we go ahead and select everything 1927 01:31:48,890 --> 01:31:56,410 we know about people whose name equals Steve Carell. 1928 01:31:56,410 --> 01:31:58,220 That gave us earlier this data. 1929 01:31:58,220 --> 01:31:59,470 I don't need all of that data. 1930 01:31:59,470 --> 01:32:00,100 I know his name. 1931 01:32:00,100 --> 01:32:01,517 I don't care about his birth year. 1932 01:32:01,517 --> 01:32:05,140 So let's change this to just be give me the ID of Steve Carell. 1933 01:32:05,140 --> 01:32:08,390 And that gives me back now this smaller temporary data set. 1934 01:32:08,390 --> 01:32:08,890 All right. 1935 01:32:08,890 --> 01:32:12,195 Can I now use this inside of another query? 1936 01:32:12,195 --> 01:32:14,320 Well, let me wrap the whole thing with parentheses. 1937 01:32:14,320 --> 01:32:18,850 And now let me say select star from the stars table 1938 01:32:18,850 --> 01:32:22,477 where the person ID equals this. 1939 01:32:22,477 --> 01:32:24,310 So I'm deliberately not using in because I'm 1940 01:32:24,310 --> 01:32:26,780 assuming there's indeed only one Steve Carell in the world. 1941 01:32:26,780 --> 01:32:28,480 So I'm not getting back a list of Steve Carells. 1942 01:32:28,480 --> 01:32:30,640 I'm getting back the one and only in this case. 1943 01:32:30,640 --> 01:32:31,810 So equal is fine. 1944 01:32:31,810 --> 01:32:33,310 In is when you have multiple. 1945 01:32:33,310 --> 01:32:34,840 Equal is when you have one. 1946 01:32:34,840 --> 01:32:36,375 Let me go ahead and hit Enter now. 1947 01:32:36,375 --> 01:32:37,750 OK, that's more data than I need. 1948 01:32:37,750 --> 01:32:40,430 I don't need like 20 copies of Steve Carell's person ID. 1949 01:32:40,430 --> 01:32:41,380 So let me hit up. 1950 01:32:41,380 --> 01:32:46,420 Let me go back and let me just get show ID from Steve Carell. 1951 01:32:46,420 --> 01:32:52,070 And now I have a list of just the 20 or so show IDs that he has been in. 1952 01:32:52,070 --> 01:32:52,570 All right. 1953 01:32:52,570 --> 01:32:53,597 How can I now use this? 1954 01:32:53,597 --> 01:32:54,430 Well, let me hit up. 1955 01:32:54,430 --> 01:32:57,040 Let me put the whole thing in parentheses. 1956 01:32:57,040 --> 01:32:58,960 And now let me select what I really want. 1957 01:32:58,960 --> 01:33:03,340 Select title from shows where. 1958 01:33:03,340 --> 01:33:05,020 And here's the final flourish. 1959 01:33:05,020 --> 01:33:11,350 The shows table has an ID, has a title, has a year, and has an episode. 1960 01:33:11,350 --> 01:33:21,610 And what I really want, though, is to check which shows have ID that is what? 1961 01:33:21,610 --> 01:33:24,510 Anyone want to finish the thought? 1962 01:33:24,510 --> 01:33:25,860 I just want to-- yeah. 1963 01:33:25,860 --> 01:33:28,435 AUDIENCE: [INAUDIBLE] 1964 01:33:28,435 --> 01:33:29,310 DAVID MALAN: Exactly. 1965 01:33:29,310 --> 01:33:30,570 ID in this. 1966 01:33:30,570 --> 01:33:31,690 And this is getting ugly. 1967 01:33:31,690 --> 01:33:33,810 And when you actually write your queries in a text file, 1968 01:33:33,810 --> 01:33:35,730 you can format them nicely and indent them. 1969 01:33:35,730 --> 01:33:36,772 My font is just getting-- 1970 01:33:36,772 --> 01:33:38,938 I don't want to make it too small to fit everything. 1971 01:33:38,938 --> 01:33:40,230 But now we have three queries. 1972 01:33:40,230 --> 01:33:43,380 One is in doubly nested parentheses, then there's the middle one, 1973 01:33:43,380 --> 01:33:44,550 then there's the outer one. 1974 01:33:44,550 --> 01:33:47,310 So this last query is going to get me the title from shows 1975 01:33:47,310 --> 01:33:50,460 where the ID of the show is in this big list of 20 1976 01:33:50,460 --> 01:33:52,410 or so show IDs that Steve Carell is in. 1977 01:33:52,410 --> 01:33:54,812 And I knew that because I looked up his name here. 1978 01:33:54,812 --> 01:33:57,270 And notice what I did not do this time is I didn't manually 1979 01:33:57,270 --> 01:33:58,620 hardcode his ID number. 1980 01:33:58,620 --> 01:33:59,430 There's no need. 1981 01:33:59,430 --> 01:34:02,508 That would be kind of a bad way to implement a website if you're 1982 01:34:02,508 --> 01:34:04,050 using a database underneath the hood. 1983 01:34:04,050 --> 01:34:07,470 You want the IMDb for real to search for whatever 1984 01:34:07,470 --> 01:34:11,010 the human typed in and no one's going to know Steve Carell's person ID 1985 01:34:11,010 --> 01:34:12,070 or anything else. 1986 01:34:12,070 --> 01:34:14,580 So here we've done this all dynamically. 1987 01:34:14,580 --> 01:34:18,690 And now if I hit Enter, I think I get all of his shows. 1988 01:34:18,690 --> 01:34:21,840 Let's go ahead and order this by title just to make it tidy. 1989 01:34:21,840 --> 01:34:24,810 And you probably will see at least one or more shows that. 1990 01:34:24,810 --> 01:34:29,520 And probably the most popular is, dot dot dot, The Office. 1991 01:34:29,520 --> 01:34:32,280 So this is literally the kind of query that's 1992 01:34:32,280 --> 01:34:37,200 being executed underneath the hood when you go to websites or apps like IMDb. 1993 01:34:37,200 --> 01:34:42,030 Your textual query is probably being plugged into a longer SQL query 1994 01:34:42,030 --> 01:34:44,850 like this where some programmer at IMDb probably 1995 01:34:44,850 --> 01:34:48,420 wrote this whole query in advance weeks, months, years ago 1996 01:34:48,420 --> 01:34:52,470 and they're just somehow plugging in the value that you the human 1997 01:34:52,470 --> 01:34:55,550 typed into the search box or the like. 1998 01:34:55,550 --> 01:35:03,010 Questions now on finding this data or any other? 1999 01:35:03,010 --> 01:35:03,890 No? 2000 01:35:03,890 --> 01:35:04,390 OK. 2001 01:35:04,390 --> 01:35:06,980 So where else could we go with this? 2002 01:35:06,980 --> 01:35:12,700 Well, let's consider how else we might combine data. 2003 01:35:12,700 --> 01:35:16,060 Suppose that the next question actually perhaps appropriately 2004 01:35:16,060 --> 01:35:19,780 would be focusing in on not just people and shows and these stars, 2005 01:35:19,780 --> 01:35:23,680 but how do we gather more information about the shows themselves, 2006 01:35:23,680 --> 01:35:25,430 like the genres, the ratings, or the like. 2007 01:35:25,430 --> 01:35:28,030 So indeed, let's focus on just these two tables here. 2008 01:35:28,030 --> 01:35:31,640 Recall that every show has an ID, a title, a year, and episodes. 2009 01:35:31,640 --> 01:35:34,660 But it also might have one or more relationships with rows 2010 01:35:34,660 --> 01:35:36,280 and this other table called genres. 2011 01:35:36,280 --> 01:35:39,250 And this is so that a show can be a comedy, can be a drama, 2012 01:35:39,250 --> 01:35:40,960 can be any number of other things. 2013 01:35:40,960 --> 01:35:42,400 One row per. 2014 01:35:42,400 --> 01:35:45,040 So you would see the same show ID again and again 2015 01:35:45,040 --> 01:35:48,760 and again with a different genre written in English 2016 01:35:48,760 --> 01:35:51,070 like comedy, drama, or the like. 2017 01:35:51,070 --> 01:35:53,890 Well, how do I kind of reconstitute that data? 2018 01:35:53,890 --> 01:35:56,710 Well, turns out there's a few different ways to do this. 2019 01:35:56,710 --> 01:36:00,610 And let me propose that we introduce this keyword here, join. 2020 01:36:00,610 --> 01:36:05,525 And this is really the most powerful of the keywords in SQL itself. 2021 01:36:05,525 --> 01:36:06,650 It doesn't have to be used. 2022 01:36:06,650 --> 01:36:09,310 We've seen with nested queries that you can still 2023 01:36:09,310 --> 01:36:13,490 select data across multiple tables, but here is another way. 2024 01:36:13,490 --> 01:36:14,660 So let me do this. 2025 01:36:14,660 --> 01:36:17,710 Let me go back to my SQLite database. 2026 01:36:17,710 --> 01:36:23,680 And let me select sort of in one breath exactly the data I want. 2027 01:36:23,680 --> 01:36:26,980 Select star from shows. 2028 01:36:26,980 --> 01:36:30,500 And let's just limit this initially to 10 to see what it looks like. 2029 01:36:30,500 --> 01:36:31,000 All right. 2030 01:36:31,000 --> 01:36:32,250 That's, again, the shows data. 2031 01:36:32,250 --> 01:36:34,690 Select star from genres. 2032 01:36:34,690 --> 01:36:37,760 Let's limit that to 10 too, just to wrap our minds around it. 2033 01:36:37,760 --> 01:36:39,730 And now this is not that useful. 2034 01:36:39,730 --> 01:36:43,510 However, the data in the leftmost column here 2035 01:36:43,510 --> 01:36:45,490 is the primary key in the shows table. 2036 01:36:45,490 --> 01:36:46,870 These are just unique IDs. 2037 01:36:46,870 --> 01:36:51,530 The data here in the genres table, recall, show ID is the foreign key. 2038 01:36:51,530 --> 01:36:54,310 So it's the same numbers but just copied into another table 2039 01:36:54,310 --> 01:36:56,650 so that we can have this relationship across them. 2040 01:36:56,650 --> 01:37:00,070 How do I kind of line up these numbers with these numbers 2041 01:37:00,070 --> 01:37:06,430 to get back a wider table that has title and year and episodes and genre and, 2042 01:37:06,430 --> 01:37:08,950 heck, ratings and all of that too if we want? 2043 01:37:08,950 --> 01:37:11,230 Well, you can join these tables by just telling 2044 01:37:11,230 --> 01:37:12,890 the database what to join on what. 2045 01:37:12,890 --> 01:37:13,940 So let me do this. 2046 01:37:13,940 --> 01:37:17,110 Select star from shows. 2047 01:37:17,110 --> 01:37:21,760 Join that table though on the genres table. 2048 01:37:21,760 --> 01:37:24,160 Well, how do you want to join those two tables? 2049 01:37:24,160 --> 01:37:26,860 And again, the two tables from the picture looked like this. 2050 01:37:26,860 --> 01:37:30,400 How do you tell SQL programmatically to put one of them 2051 01:37:30,400 --> 01:37:32,420 right next to the other, line up all of the ID 2052 01:37:32,420 --> 01:37:34,820 so that you just get one larger data set? 2053 01:37:34,820 --> 01:37:39,160 Well, we can use indeed this syntax called join. 2054 01:37:39,160 --> 01:37:41,140 So back to VS Code here. 2055 01:37:41,140 --> 01:37:44,380 And let me join these two tables. 2056 01:37:44,380 --> 01:37:45,970 Sorry, typo here. 2057 01:37:45,970 --> 01:37:51,940 Join genres on the shows table's ID column, a.k.a. 2058 01:37:51,940 --> 01:37:58,240 its primary key, equaling the genres table's show ID column, a.k.a. 2059 01:37:58,240 --> 01:37:59,210 the foreign key. 2060 01:37:59,210 --> 01:38:01,090 So in other words, it looks a little cryptic, 2061 01:38:01,090 --> 01:38:05,080 but I'm just telling SQL how to line up these two tables and what column 2062 01:38:05,080 --> 01:38:08,230 to match with the other so that the numbers line up 2063 01:38:08,230 --> 01:38:10,610 and I get essentially a wider table. 2064 01:38:10,610 --> 01:38:14,930 Let me go ahead and hit semicolon and Enter. 2065 01:38:14,930 --> 01:38:17,330 And this is now going to give me a lot of data. 2066 01:38:17,330 --> 01:38:18,600 We might have to interrupt it. 2067 01:38:18,600 --> 01:38:22,650 But notice even at a glance, we're getting the ID, the title, the year, 2068 01:38:22,650 --> 01:38:25,130 the number of episodes, the ID again redundantly, 2069 01:38:25,130 --> 01:38:27,140 but that's to be expected if I'm joining them, 2070 01:38:27,140 --> 01:38:29,360 and the genre all the way on the right. 2071 01:38:29,360 --> 01:38:31,740 Let me hit Control C to interrupt. 2072 01:38:31,740 --> 01:38:33,890 Let me just limit this to The Office. 2073 01:38:33,890 --> 01:38:36,500 So where title equals quote unquote "The Office" 2074 01:38:36,500 --> 01:38:39,680 so we can focus on just one sample data. 2075 01:38:39,680 --> 01:38:43,490 And here, fun fact, there's been more than one Office. 2076 01:38:43,490 --> 01:38:46,100 The one that you all probably like is this one 2077 01:38:46,100 --> 01:38:49,250 that started in 2005 with 188 episodes. 2078 01:38:49,250 --> 01:38:53,060 Its ID in the shows table is 386676. 2079 01:38:53,060 --> 01:38:54,710 That's confirmed over here too. 2080 01:38:54,710 --> 01:38:56,850 So again, we've just joined the two tables. 2081 01:38:56,850 --> 01:38:57,350 How? 2082 01:38:57,350 --> 01:38:58,730 By lining up those fields. 2083 01:38:58,730 --> 01:39:01,490 But now that we can see that almost all of The Offices 2084 01:39:01,490 --> 01:39:04,700 produced over the decades are comedies except for this one. 2085 01:39:04,700 --> 01:39:07,460 There was a version of The Office produced in 2001 2086 01:39:07,460 --> 01:39:10,160 that was considered more of a drama. 2087 01:39:10,160 --> 01:39:11,940 Unsure if it's related to the other. 2088 01:39:11,940 --> 01:39:13,640 How can we link in other data? 2089 01:39:13,640 --> 01:39:16,920 Well, let's go ahead and link in ratings too or instead. 2090 01:39:16,920 --> 01:39:21,290 So instead of joining this with genres, let me go ahead and rewind here 2091 01:39:21,290 --> 01:39:28,550 and join shows on ratings on shows.id equals ratings.show_ID. 2092 01:39:28,550 --> 01:39:32,630 And let's limit it to The Office too for discussion's sake where title equals 2093 01:39:32,630 --> 01:39:35,820 quote unquote "The Office" semicolon. 2094 01:39:35,820 --> 01:39:39,650 And now you can see that among the various Offices, 2095 01:39:39,650 --> 01:39:43,100 it looks like the one that most of us probably know and love 2096 01:39:43,100 --> 01:39:49,070 is the highest rated also with a 9.0 with like 585,000 people having 2097 01:39:49,070 --> 01:39:52,280 cast votes for whereas this other shows seem to have been less popular. 2098 01:39:52,280 --> 01:39:56,370 And perhaps that's why indeed you see fewer episodes for them as well. 2099 01:39:56,370 --> 01:39:58,680 So even though we've put the data in multiple places, 2100 01:39:58,680 --> 01:40:02,670 you can still kind of reconstitute it by lining things up in this way 2101 01:40:02,670 --> 01:40:04,970 and rejoining the tables. 2102 01:40:04,970 --> 01:40:09,140 Questions now on this? 2103 01:40:09,140 --> 01:40:16,540 This is the heart of what SQL does and what relational databases do for you. 2104 01:40:16,540 --> 01:40:18,100 Questions? 2105 01:40:18,100 --> 01:40:18,610 All right. 2106 01:40:18,610 --> 01:40:20,470 A few final features. 2107 01:40:20,470 --> 01:40:22,660 There's not all that much that-- 2108 01:40:22,660 --> 01:40:24,320 SQL takes practice like anything else. 2109 01:40:24,320 --> 01:40:26,237 But in terms of syntax and capabilities, let's 2110 01:40:26,237 --> 01:40:30,610 just introduce you to a couple of final features here and problems that arise 2111 01:40:30,610 --> 01:40:32,210 and how we might solve them. 2112 01:40:32,210 --> 01:40:34,640 Let's do this as well. 2113 01:40:34,640 --> 01:40:36,980 So let me go back into VS Code here. 2114 01:40:36,980 --> 01:40:40,900 And let's just find out Steve Carell's information again. 2115 01:40:40,900 --> 01:40:43,180 Last time we did it with this nested query 2116 01:40:43,180 --> 01:40:47,503 by getting his ID and then the show IDs and then the titles for those show IDs. 2117 01:40:47,503 --> 01:40:49,420 With join, you can do it a little differently. 2118 01:40:49,420 --> 01:40:50,950 And any of these ways are fine. 2119 01:40:50,950 --> 01:40:53,740 One might become easier to mentally than another. 2120 01:40:53,740 --> 01:40:57,040 Let's go ahead and select the titles from what. 2121 01:40:57,040 --> 01:41:02,095 Let's select the title from the people table. 2122 01:41:02,095 --> 01:41:03,220 And I'm going to hit Enter. 2123 01:41:03,220 --> 01:41:05,260 And when you're using SQLite3 interactively, 2124 01:41:05,260 --> 01:41:09,400 if you ever find yourself with a prompt that says dot dot dot angle bracket, 2125 01:41:09,400 --> 01:41:12,067 it means you're continuing your thought onto the next line. 2126 01:41:12,067 --> 01:41:13,900 If you didn't intend that, you can sometimes 2127 01:41:13,900 --> 01:41:16,510 hit semicolon to just end the thought and hit 2128 01:41:16,510 --> 01:41:18,460 Enter even if it triggers an error. 2129 01:41:18,460 --> 01:41:21,430 But this is one way of formatting my queries now a little more nicely. 2130 01:41:21,430 --> 01:41:24,597 I'm just going to add some white space so that it's a little easier to read. 2131 01:41:24,597 --> 01:41:25,720 What do I want to select? 2132 01:41:25,720 --> 01:41:28,000 Well, I want to select the title of shows 2133 01:41:28,000 --> 01:41:35,320 from the people table joined with the stars table on the people table's ID 2134 01:41:35,320 --> 01:41:40,520 column equaling the stars table's person ID column. 2135 01:41:40,520 --> 01:41:44,230 So in other words, if you think back to what people are and what stars are, 2136 01:41:44,230 --> 01:41:48,250 one has an ID, one has a person ID, I'm just now connecting those two tables. 2137 01:41:48,250 --> 01:41:49,450 I'm joining those two. 2138 01:41:49,450 --> 01:41:53,350 But I want to do this as well with another table. 2139 01:41:53,350 --> 01:41:54,938 Let me additionally join in. 2140 01:41:54,938 --> 01:41:56,980 So now I only have two hands, but now I'm putting 2141 01:41:56,980 --> 01:41:59,320 a third table joined in together here. 2142 01:41:59,320 --> 01:42:05,960 Join shows on stars.show_id equals shows.id. 2143 01:42:05,960 --> 01:42:08,500 So this is now linking three tables together. 2144 01:42:08,500 --> 01:42:12,100 But I only care about this for one person, so where the name of the person 2145 01:42:12,100 --> 01:42:14,650 equals quote unquote "Steve Carell." 2146 01:42:14,650 --> 01:42:18,850 So more cryptic, to be sure, but what we're doing with this query 2147 01:42:18,850 --> 01:42:21,370 is just taking all three tables that we care about 2148 01:42:21,370 --> 01:42:23,440 and we're joining them all together at once 2149 01:42:23,440 --> 01:42:27,250 using this new join syntax literally telling the database what 2150 01:42:27,250 --> 01:42:29,020 columns to line up with what. 2151 01:42:29,020 --> 01:42:32,350 And then we filter at the very end just like before to get back, 2152 01:42:32,350 --> 01:42:35,650 if I hit Enter, the answer we want, which in this case 2153 01:42:35,650 --> 01:42:40,030 is a little slower at the moment, but that same list of 20 or so 2154 01:42:40,030 --> 01:42:41,350 shows that he's been in. 2155 01:42:41,350 --> 01:42:42,920 There's one other way to do this. 2156 01:42:42,920 --> 01:42:45,250 And again, these are all in the slides online. 2157 01:42:45,250 --> 01:42:47,890 So you can repeat them without having to jot down everything and we'll put them 2158 01:42:47,890 --> 01:42:48,820 in the notes too. 2159 01:42:48,820 --> 01:42:50,450 But there's another way to do this. 2160 01:42:50,450 --> 01:42:52,780 I could also use an implicit join. 2161 01:42:52,780 --> 01:42:55,030 So that was an explicit join because I literally typed 2162 01:42:55,030 --> 01:42:57,220 the word join multiple times at that. 2163 01:42:57,220 --> 01:43:00,880 But let me go ahead and select the title from these three tables. 2164 01:43:00,880 --> 01:43:02,628 People, stars, and shows. 2165 01:43:02,628 --> 01:43:04,420 And this might just be nicer because if you 2166 01:43:04,420 --> 01:43:06,337 know what tables you want to select data from, 2167 01:43:06,337 --> 01:43:08,980 just enumerate them separated by commas, which you might prefer 2168 01:43:08,980 --> 01:43:18,970 in your mind, where the people ID equals the stars person ID and the stars 2169 01:43:18,970 --> 01:43:25,450 show ID equals the shows ID and the name of the person equals Steve Carell. 2170 01:43:25,450 --> 01:43:26,788 So this is an implicit join. 2171 01:43:26,788 --> 01:43:29,830 And honestly, I constantly reference my notes for some of this stuff too. 2172 01:43:29,830 --> 01:43:31,720 It's not the kind of thing that's going to come like this to you 2173 01:43:31,720 --> 01:43:32,740 after just one day. 2174 01:43:32,740 --> 01:43:35,560 But it's just a different way of expressing the same thing. 2175 01:43:35,560 --> 01:43:37,720 I want to select data from three different tables. 2176 01:43:37,720 --> 01:43:43,120 And hey SQL, here is how I want you to line those tables up so that I can 2177 01:43:43,120 --> 01:43:45,640 get like related data for Steve Carell. 2178 01:43:45,640 --> 01:43:49,450 And this now will achieve the same results ultimately. 2179 01:43:49,450 --> 01:43:50,260 Let me hit Enter. 2180 01:43:50,260 --> 01:43:54,040 2181 01:43:54,040 --> 01:43:54,920 And there we go. 2182 01:43:54,920 --> 01:43:55,750 So a little slower. 2183 01:43:55,750 --> 01:43:57,850 And performance might vary based on computer, 2184 01:43:57,850 --> 01:44:01,270 based on implementation of SQL, but I think I still have the same answers. 2185 01:44:01,270 --> 01:44:04,790 Now suppose, as I often do, and I had to look it up again last time, 2186 01:44:04,790 --> 01:44:08,350 suppose you forget how to spell Steve Carell's name. 2187 01:44:08,350 --> 01:44:10,240 Is it two R's, two L's, or the like? 2188 01:44:10,240 --> 01:44:12,903 Well, I could also do something like this. 2189 01:44:12,903 --> 01:44:14,320 Well, let's just keep this simple. 2190 01:44:14,320 --> 01:44:17,103 Select star from people where name equals. 2191 01:44:17,103 --> 01:44:20,020 I've been deliberately getting it right so as to not embarrass myself. 2192 01:44:20,020 --> 01:44:22,000 That's the Steve Carell I keep querying. 2193 01:44:22,000 --> 01:44:25,960 If you forget, well you could try searching for just Steves, 2194 01:44:25,960 --> 01:44:28,450 but interestingly, there's a bunch of Steves. 2195 01:44:28,450 --> 01:44:30,070 We don't know when they were born. 2196 01:44:30,070 --> 01:44:35,360 But that's probably not the Steve Carell we want if we don't have his last name. 2197 01:44:35,360 --> 01:44:38,770 So I could alternatively do, well it's Steve and then it starts with a C, 2198 01:44:38,770 --> 01:44:39,460 I think. 2199 01:44:39,460 --> 01:44:42,700 Well, it turns out there's another wild card you can use in SQL. 2200 01:44:42,700 --> 01:44:46,060 We used the asterisk to select all of the columns. 2201 01:44:46,060 --> 01:44:50,800 You can in quotes use a percent sign to say C something. 2202 01:44:50,800 --> 01:44:54,370 So there's 0 or more characters after the letter C. 2203 01:44:54,370 --> 01:44:58,630 And now this doesn't work because now I would be literally looking 2204 01:44:58,630 --> 01:45:01,690 for Steve space C something. 2205 01:45:01,690 --> 01:45:04,910 But recall earlier I mentioned that one other keyword, 2206 01:45:04,910 --> 01:45:06,868 which is for fuzzier matching, so to speak, 2207 01:45:06,868 --> 01:45:09,160 where it's not exactly what you're looking for but it's 2208 01:45:09,160 --> 01:45:10,630 like what you're looking for. 2209 01:45:10,630 --> 01:45:14,350 If you instead say where his name is like Steve space C something, 2210 01:45:14,350 --> 01:45:17,060 now we'll get back a whole bunch of Steves. 2211 01:45:17,060 --> 01:45:19,810 But I think now I could probably find the one I'm actually looking 2212 01:45:19,810 --> 01:45:20,920 for if I don't remember his name. 2213 01:45:20,920 --> 01:45:22,510 You can use multiple percent signs. 2214 01:45:22,510 --> 01:45:25,260 If you forget what his first name is, you could reverse the order. 2215 01:45:25,260 --> 01:45:29,940 But that too is a very powerful SQL feature at that. 2216 01:45:29,940 --> 01:45:32,670 Questions on these queries here? 2217 01:45:32,670 --> 01:45:33,708 Yeah. 2218 01:45:33,708 --> 01:45:35,448 AUDIENCE: [INAUDIBLE] 2219 01:45:35,448 --> 01:45:36,240 DAVID MALAN: Sorry? 2220 01:45:36,240 --> 01:45:38,730 AUDIENCE: [INAUDIBLE] 2221 01:45:38,730 --> 01:45:40,003 DAVID MALAN: What about it? 2222 01:45:40,003 --> 01:45:40,990 AUDIENCE: [INAUDIBLE] 2223 01:45:40,990 --> 01:45:42,400 DAVID MALAN: Oh yeah, sure. 2224 01:45:42,400 --> 01:45:44,500 So the query I used here. 2225 01:45:44,500 --> 01:45:49,810 There's a lot of Steves whose last name starts with C. Oops, too far. 2226 01:45:49,810 --> 01:45:52,760 The last query I executed was this one here. 2227 01:45:52,760 --> 01:45:57,110 So where the name is like quote unquote "Steve C%." 2228 01:45:57,110 --> 01:45:59,950 So that's just another tool for your toolkit here. 2229 01:45:59,950 --> 01:46:02,830 But you'll perhaps have notice that those two-- 2230 01:46:02,830 --> 01:46:05,380 prior to that query, the joins I did were sort of slow. 2231 01:46:05,380 --> 01:46:07,480 And honestly, this database isn't even that big. 2232 01:46:07,480 --> 01:46:10,120 Like yes, it has tens of thousands of rows in it. 2233 01:46:10,120 --> 01:46:13,630 But in the real world and most of the apps you and I use a lot every day 2234 01:46:13,630 --> 01:46:17,080 or websites, there's millions, even billions of rows of data. 2235 01:46:17,080 --> 01:46:22,210 And if I had to wait on my computer here or my code space a second or two 2236 01:46:22,210 --> 01:46:25,990 to get the data, that's not going to work for millions of users or customers 2237 01:46:25,990 --> 01:46:26,680 certainly. 2238 01:46:26,680 --> 01:46:29,020 So how can we actually improve things? 2239 01:46:29,020 --> 01:46:34,330 Well, it turns out another upside of a proper relational database 2240 01:46:34,330 --> 01:46:38,830 is that it's not just a spreadsheet where the onus is on you to find 2241 01:46:38,830 --> 01:46:40,120 the data you're looking for. 2242 01:46:40,120 --> 01:46:43,780 You can also tell the database to index the data for you. 2243 01:46:43,780 --> 01:46:48,670 An index is an efficient cheat sheet for finding data fast. 2244 01:46:48,670 --> 01:46:52,908 Like books in the real world often have indices at the end of the book 2245 01:46:52,908 --> 01:46:54,700 where you can look things up alphabetically 2246 01:46:54,700 --> 01:46:57,790 and then you can cross reference it for the pages that topic appears on. 2247 01:46:57,790 --> 01:46:59,110 Same idea in a database. 2248 01:46:59,110 --> 01:47:03,190 If you tell the database in advance that you want to search on a certain column 2249 01:47:03,190 --> 01:47:06,700 frequently, you can tell it to build a fancy index that will just 2250 01:47:06,700 --> 01:47:08,770 allow you to search that column faster. 2251 01:47:08,770 --> 01:47:11,560 By default, these columns are going to be searched 2252 01:47:11,560 --> 01:47:13,197 most likely by a linear search. 2253 01:47:13,197 --> 01:47:15,280 Not even binary search, because the data might not 2254 01:47:15,280 --> 01:47:17,390 be sorted because it came in any order. 2255 01:47:17,390 --> 01:47:20,170 But if you create an index, you're probably 2256 01:47:20,170 --> 01:47:22,780 going to get something closer to logarithmic than linear, 2257 01:47:22,780 --> 01:47:24,892 and that's going to be a big plus overall. 2258 01:47:24,892 --> 01:47:26,350 So let me do something simple here. 2259 01:47:26,350 --> 01:47:30,670 First let me turn on a SQLite specific feature that just is going to time all 2260 01:47:30,670 --> 01:47:33,190 of my queries by writing .timer on. 2261 01:47:33,190 --> 01:47:37,515 I just want to keep track of how long each of these commands takes. 2262 01:47:37,515 --> 01:47:40,390 This one is not a slow command, so this is just going to be relative. 2263 01:47:40,390 --> 01:47:43,300 But let's just select everything from the shows table 2264 01:47:43,300 --> 01:47:45,970 where the title thereof is The Office. 2265 01:47:45,970 --> 01:47:48,940 Let's see how long this relatively simple query takes. 2266 01:47:48,940 --> 01:47:50,980 All right, not very long at all. 2267 01:47:50,980 --> 01:47:54,850 In real terms less than a second, 0.035 seconds. 2268 01:47:54,850 --> 01:47:57,070 So not slow by any means. 2269 01:47:57,070 --> 01:48:00,160 But if you've got hundreds, thousands, millions of users, 2270 01:48:00,160 --> 01:48:03,040 every one of those milliseconds could very well add up. 2271 01:48:03,040 --> 01:48:04,640 So can we do better? 2272 01:48:04,640 --> 01:48:06,440 Well, we can if I do this. 2273 01:48:06,440 --> 01:48:11,740 If I use syntax like this once in the beginning of the design of my database, 2274 01:48:11,740 --> 01:48:17,980 I create not a table but an index with some name on a specific table on one 2275 01:48:17,980 --> 01:48:18,820 or more columns. 2276 01:48:18,820 --> 01:48:21,370 I can give a clue, a hint to the database in advance 2277 01:48:21,370 --> 01:48:26,260 saying please optimize with some secret sauce searching or selecting 2278 01:48:26,260 --> 01:48:29,740 on this column in this table so that my searches are faster. 2279 01:48:29,740 --> 01:48:30,860 So let me do this. 2280 01:48:30,860 --> 01:48:32,590 Let me go back to VS Code here. 2281 01:48:32,590 --> 01:48:36,460 Let me create an index called how about title index. 2282 01:48:36,460 --> 01:48:40,240 I could call it anything I want, but I want to search faster on titles. 2283 01:48:40,240 --> 01:48:45,940 So I'm going to call this a title index where rather title index on the table 2284 01:48:45,940 --> 01:48:46,900 called shows. 2285 01:48:46,900 --> 01:48:49,150 And then in parentheses is the syntax. 2286 01:48:49,150 --> 01:48:50,620 The column called title. 2287 01:48:50,620 --> 01:48:53,350 So again, I've just borrowed this canonical syntax 2288 01:48:53,350 --> 01:48:57,590 and I've just translated it into something that's TV show specific. 2289 01:48:57,590 --> 01:48:58,090 All right. 2290 01:48:58,090 --> 01:49:00,040 What is this going to do for me? 2291 01:49:00,040 --> 01:49:03,700 Once I hit Enter, this is going to create in the computer's memory, 2292 01:49:03,700 --> 01:49:05,860 the database's memory something called a B-tree. 2293 01:49:05,860 --> 01:49:07,330 It's not a binary tree. 2294 01:49:07,330 --> 01:49:11,627 A B-tree is actually a potentially more efficient data structure 2295 01:49:11,627 --> 01:49:13,960 that we didn't talk about a few weeks back in week five, 2296 01:49:13,960 --> 01:49:15,710 but it looks a little something like this, 2297 01:49:15,710 --> 01:49:17,680 which looks similar to a binary tree. 2298 01:49:17,680 --> 01:49:21,847 But does anyone notice what makes this not a binary tree? 2299 01:49:21,847 --> 01:49:26,405 AUDIENCE: [INAUDIBLE] 2300 01:49:26,405 --> 01:49:27,280 DAVID MALAN: Exactly. 2301 01:49:27,280 --> 01:49:31,570 Binary tree, bi implying two, has no more than two children per node, 2302 01:49:31,570 --> 01:49:33,580 but here's a perfect example, one, two, three. 2303 01:49:33,580 --> 01:49:35,990 And there could be four children, five children or more. 2304 01:49:35,990 --> 01:49:38,530 But the effect of that, if you have a very wide tree, 2305 01:49:38,530 --> 01:49:40,780 the upside is that it's very short. 2306 01:49:40,780 --> 01:49:44,560 It pulls the data higher up closer to the node, to the root node. 2307 01:49:44,560 --> 01:49:48,080 And recall that the root node is where we began our searches in the past, 2308 01:49:48,080 --> 01:49:51,620 whether it was a BST, a Binary Search Tree, even a tri or other data 2309 01:49:51,620 --> 01:49:52,120 structures. 2310 01:49:52,120 --> 01:49:53,380 We always began at the top. 2311 01:49:53,380 --> 01:49:55,780 So the higher up you can pull the data, even 2312 01:49:55,780 --> 01:49:57,617 if it makes the data structure very wide, 2313 01:49:57,617 --> 01:50:00,700 you're going to be able to do boom, boom, boom, look up queries or look up 2314 01:50:00,700 --> 01:50:03,340 data probably much faster certainly than if it's just 2315 01:50:03,340 --> 01:50:06,590 a very long list like a column by default. 2316 01:50:06,590 --> 01:50:09,970 So with that said, let me go back to VS Code. 2317 01:50:09,970 --> 01:50:11,380 I didn't create the index yet. 2318 01:50:11,380 --> 01:50:13,277 Let me go ahead and hit Enter and create it. 2319 01:50:13,277 --> 01:50:14,860 All right, it took a minute, a moment. 2320 01:50:14,860 --> 01:50:17,920 It took like half a second, which obviously is not that slow. 2321 01:50:17,920 --> 01:50:20,380 But with more data, that could have been even slower. 2322 01:50:20,380 --> 01:50:22,600 But it's a one time operation as of now. 2323 01:50:22,600 --> 01:50:27,910 And now let me hit up and let me select the same data from shows 2324 01:50:27,910 --> 01:50:29,350 where title equals The Office. 2325 01:50:29,350 --> 01:50:34,360 Last time just a moment ago it took 0.035 seconds. 2326 01:50:34,360 --> 01:50:38,920 Not slow but also that's going to add up if I have lots of users of IMDb. 2327 01:50:38,920 --> 01:50:42,040 Let's go ahead now and execute the same query again. 2328 01:50:42,040 --> 01:50:44,700 How long did that take? 2329 01:50:44,700 --> 01:50:47,010 0.001 seconds now. 2330 01:50:47,010 --> 01:50:48,720 I mean, practically nothing. 2331 01:50:48,720 --> 01:50:50,970 And so that's the sort of opportunity now. 2332 01:50:50,970 --> 01:50:53,470 When you've got lots of data and you want to really speed up 2333 01:50:53,470 --> 01:50:56,130 these searches, these indexes, these indices that just create 2334 01:50:56,130 --> 01:50:59,130 for you these magical data structures in the databases memory, 2335 01:50:59,130 --> 01:51:02,430 it allows you to search on columns that you are pretty sure you 2336 01:51:02,430 --> 01:51:04,150 want to search on more effectively. 2337 01:51:04,150 --> 01:51:07,380 Now, by contrast, if you've ever used Google or Bing or some search 2338 01:51:07,380 --> 01:51:10,590 engine that has advanced search, some of those text boxes 2339 01:51:10,590 --> 01:51:13,470 that you can search more precisely in might very well be slower. 2340 01:51:13,470 --> 01:51:14,190 Why? 2341 01:51:14,190 --> 01:51:16,950 Well, probably you don't want to go crazy and just index 2342 01:51:16,950 --> 01:51:19,380 every column on every table. 2343 01:51:19,380 --> 01:51:21,020 Why? 2344 01:51:21,020 --> 01:51:25,070 What might be the intuition? 2345 01:51:25,070 --> 01:51:28,490 If logically indexes speed things up, why not index everything? 2346 01:51:28,490 --> 01:51:30,710 There's always going to be a trade off here. 2347 01:51:30,710 --> 01:51:33,040 What might that be? 2348 01:51:33,040 --> 01:51:34,000 Yeah. 2349 01:51:34,000 --> 01:51:34,875 AUDIENCE: [INAUDIBLE] 2350 01:51:34,875 --> 01:51:37,167 DAVID MALAN: Yeah, it's going to take a lot of storage. 2351 01:51:37,167 --> 01:51:38,960 This is just a slide on the screen. 2352 01:51:38,960 --> 01:51:41,020 But this has to go somewhere. 2353 01:51:41,020 --> 01:51:43,360 This needs space in the computer's memory 2354 01:51:43,360 --> 01:51:44,740 or on the hard drive or the like. 2355 01:51:44,740 --> 01:51:48,777 And that's fine if you have unlimited space, but odds are you don't. 2356 01:51:48,777 --> 01:51:51,110 And that's going to get expensive for different reasons. 2357 01:51:51,110 --> 01:51:55,810 So maybe you only want to index certain columns and certain tables 2358 01:51:55,810 --> 01:51:56,860 and not all of them. 2359 01:51:56,860 --> 01:51:57,777 Because you know what? 2360 01:51:57,777 --> 01:51:59,590 What even if a user really wants to search 2361 01:51:59,590 --> 01:52:05,020 maybe via advanced search on some other column or table altogether, fine. 2362 01:52:05,020 --> 01:52:07,240 If once in a while a query is slow, we're 2363 01:52:07,240 --> 01:52:11,110 probably getting the bigger bang for our buck by optimizing the common cases, 2364 01:52:11,110 --> 01:52:16,030 the more popular queries that people actually care about too. 2365 01:52:16,030 --> 01:52:16,640 All right. 2366 01:52:16,640 --> 01:52:22,450 So let's come full circle and bring this now back to how we actually 2367 01:52:22,450 --> 01:52:25,030 began, which was with some Python code. 2368 01:52:25,030 --> 01:52:27,550 So it turns out these are not either or decisions. 2369 01:52:27,550 --> 01:52:29,740 It turns out in the real world, developers 2370 01:52:29,740 --> 01:52:33,320 are constantly using one, two, three languages at once. 2371 01:52:33,320 --> 01:52:36,940 And in fact, next week I rattled off HTML, CSS, and JavaScript, one of which 2372 01:52:36,940 --> 01:52:39,310 is a proper programming language, but those languages 2373 01:52:39,310 --> 01:52:40,510 are often used together. 2374 01:52:40,510 --> 01:52:46,300 Totally normal and common to use Python and SQL or Java and SQL 2375 01:52:46,300 --> 01:52:49,300 or SWIFT and SQL or any number of different combinations 2376 01:52:49,300 --> 01:52:50,710 with a database language. 2377 01:52:50,710 --> 01:52:53,920 You might use your preferred programming language, Java, Python, 2378 01:52:53,920 --> 01:53:00,760 C++ to create the user interface and the logic that implements the program 2379 01:53:00,760 --> 01:53:01,330 itself. 2380 01:53:01,330 --> 01:53:04,092 But for your data, SQL's a really good candidate. 2381 01:53:04,092 --> 01:53:07,300 And indeed, we've seen already that SQL can just speed up certain operations. 2382 01:53:07,300 --> 01:53:08,140 You can change. 2383 01:53:08,140 --> 01:53:11,200 You can collapse 15 lines of code into just one 2384 01:53:11,200 --> 01:53:13,430 and you can use these things together. 2385 01:53:13,430 --> 01:53:14,740 So let me come back to-- 2386 01:53:14,740 --> 01:53:16,720 I'm going to quit out of SQLite. 2387 01:53:16,720 --> 01:53:18,580 I'm going to minimize my terminal window. 2388 01:53:18,580 --> 01:53:22,030 And here's where we left off before with favorites.py. 2389 01:53:22,030 --> 01:53:27,430 With favorites.py, everything was being stored in favorites.csv. 2390 01:53:27,430 --> 01:53:32,260 And recall that we eventually imported that CSV file into favorites.db 2391 01:53:32,260 --> 01:53:35,920 automatically with .import just so we could start playing around with SQL. 2392 01:53:35,920 --> 01:53:37,990 But we can now tie these two together. 2393 01:53:37,990 --> 01:53:40,690 And a way to do that is as follows. 2394 01:53:40,690 --> 01:53:42,700 CS50 has a library for Python. 2395 01:53:42,700 --> 01:53:47,350 You might recall having available get string, get int, get float. 2396 01:53:47,350 --> 01:53:49,360 You don't strictly need to use them in Python 2397 01:53:49,360 --> 01:53:51,970 because it's much easier to just use the input function 2398 01:53:51,970 --> 01:53:55,570 and then try, accept, and convert things to int or float or the like. 2399 01:53:55,570 --> 01:54:00,520 But it's a lot more work to use SQL in Python without a third party library. 2400 01:54:00,520 --> 01:54:03,370 A lot of the commercial options or popular open source options 2401 01:54:03,370 --> 01:54:05,200 are actually just complicated to use. 2402 01:54:05,200 --> 01:54:09,790 So CS50 does have a very useful function inside of its library for Python 2403 01:54:09,790 --> 01:54:12,640 that you should use and must use for the problem set that 2404 01:54:12,640 --> 01:54:18,100 just makes it easy to execute Python, execute SQL inside of your Python code. 2405 01:54:18,100 --> 01:54:22,360 But it's built on top of a very popular open source alternative. 2406 01:54:22,360 --> 01:54:24,350 So you can use that too in the real world. 2407 01:54:24,350 --> 01:54:26,800 So the documentation for that is at this URL here, 2408 01:54:26,800 --> 01:54:30,940 but I'll show you what we need to know here by focusing back on favorites.py. 2409 01:54:30,940 --> 01:54:35,540 So what I'm going to do here is follows is this. 2410 01:54:35,540 --> 01:54:42,550 Let me delete everything from favorites.py except for let's say this. 2411 01:54:42,550 --> 01:54:45,970 From CS50 import SQL in all caps. 2412 01:54:45,970 --> 01:54:49,060 So that's importing a SQL feature from CS50's library 2413 01:54:49,060 --> 01:54:53,200 that's going to allow me to open a DB file in code. 2414 01:54:53,200 --> 01:54:54,380 How do I do that? 2415 01:54:54,380 --> 01:54:56,650 Well, let me create a variable called DB for database, 2416 01:54:56,650 --> 01:54:58,275 though I could call it anything I want. 2417 01:54:58,275 --> 01:55:02,470 Let me call this SQL function and pass in using special syntax that's 2418 01:55:02,470 --> 01:55:03,640 not CS50 specific. 2419 01:55:03,640 --> 01:55:09,640 It's an industry thing. sqlite:///. 2420 01:55:09,640 --> 01:55:14,640 Unlike every other URL you type, this one literally has three in this context 2421 01:55:14,640 --> 01:55:15,140 here. 2422 01:55:15,140 --> 01:55:18,340 And then the name of the database, which in this case is favorites.db. 2423 01:55:18,340 --> 01:55:22,450 So this is just a way of telling this SQL library that we wrote 2424 01:55:22,450 --> 01:55:27,640 but that works exactly like third party alternatives open favorites.db using 2425 01:55:27,640 --> 01:55:30,380 the SQLite technology, if you will. 2426 01:55:30,380 --> 01:55:30,880 All right. 2427 01:55:30,880 --> 01:55:32,338 Let's just ask the user a question. 2428 01:55:32,338 --> 01:55:36,057 Give me your favorite problem. 2429 01:55:36,057 --> 01:55:38,140 So we're going to use input instead of get string, 2430 01:55:38,140 --> 01:55:41,450 but we could use get string, but they're pretty much the same for our purposes. 2431 01:55:41,450 --> 01:55:43,150 Let's ask the user for their favorite. 2432 01:55:43,150 --> 01:55:49,960 And now in Python code, let us select from favorites.db 2433 01:55:49,960 --> 01:55:55,450 all of the rows where students specify that problem as their favorite. 2434 01:55:55,450 --> 01:55:57,610 So in SQL alone, it would be this. 2435 01:55:57,610 --> 01:56:04,210 Select star from favorites where problem equals 2436 01:56:04,210 --> 01:56:08,170 and I'll do, well, whatever my favorite's going to be. 2437 01:56:08,170 --> 01:56:10,288 Like problem equals Mario, for instance. 2438 01:56:10,288 --> 01:56:13,330 So if I were just using SQL, I would literally write something like that. 2439 01:56:13,330 --> 01:56:15,130 But I'm in a .py file now. 2440 01:56:15,130 --> 01:56:16,870 I have to use Python syntax. 2441 01:56:16,870 --> 01:56:18,550 But Python supports strings. 2442 01:56:18,550 --> 01:56:20,090 SQL is just text. 2443 01:56:20,090 --> 01:56:20,990 It's just a string. 2444 01:56:20,990 --> 01:56:24,220 So I could certainly just put my SQL code in a string 2445 01:56:24,220 --> 01:56:26,800 perhaps and then pass it to a Python function. 2446 01:56:26,800 --> 01:56:28,780 And here's the bridge between the two. 2447 01:56:28,780 --> 01:56:31,960 If you just treat SQL as any old text, we can put it in a string 2448 01:56:31,960 --> 01:56:32,840 and execute it. 2449 01:56:32,840 --> 01:56:34,837 So let me actually do this. 2450 01:56:34,837 --> 01:56:36,670 Let me go ahead and create a variable called 2451 01:56:36,670 --> 01:56:40,130 rows, which is eventually going to contain all the rows from the database. 2452 01:56:40,130 --> 01:56:45,790 Let me go ahead and select db.execute. 2453 01:56:45,790 --> 01:56:49,030 This is the one function you need to know about inside of CS50's library, 2454 01:56:49,030 --> 01:56:51,310 and it literally executes a SQL statement. 2455 01:56:51,310 --> 01:56:56,140 And then in quotes, you pass it literally what you want to execute. 2456 01:56:56,140 --> 01:56:58,990 And let me go ahead and close the parenthesis at the end there. 2457 01:56:58,990 --> 01:57:00,740 And now let me just try this. 2458 01:57:00,740 --> 01:57:05,260 So for row in rows, let's iterate over all of the rows, let me go ahead 2459 01:57:05,260 --> 01:57:12,370 and print out how about row, quote unquote. 2460 01:57:12,370 --> 01:57:15,700 And what do I want here? 2461 01:57:15,700 --> 01:57:20,020 Let's print out the timestamp of that person for kicks. 2462 01:57:20,020 --> 01:57:22,120 All right, let me open my terminal window. 2463 01:57:22,120 --> 01:57:23,890 Python of favorites.py. 2464 01:57:23,890 --> 01:57:27,490 Crossing my fingers here for sure. 2465 01:57:27,490 --> 01:57:27,990 Enter. 2466 01:57:27,990 --> 01:57:31,200 2467 01:57:31,200 --> 01:57:31,770 There we go. 2468 01:57:31,770 --> 01:57:32,280 Favorites. 2469 01:57:32,280 --> 01:57:34,500 I'll type in Mario. 2470 01:57:34,500 --> 01:57:35,070 OK. 2471 01:57:35,070 --> 01:57:36,240 So I got back-- 2472 01:57:36,240 --> 01:57:37,980 it's not very interesting, but I got back 2473 01:57:37,980 --> 01:57:40,830 all of the timestamps of students who typed in Mario that we 2474 01:57:40,830 --> 01:57:42,330 imported into this database. 2475 01:57:42,330 --> 01:57:45,360 Well, what I really care about is how popular Mario is. 2476 01:57:45,360 --> 01:57:46,830 So let me change this a little bit. 2477 01:57:46,830 --> 01:57:50,490 Let me change this to count the number of rows. 2478 01:57:50,490 --> 01:57:52,080 And let me keep it simple. 2479 01:57:52,080 --> 01:57:56,260 Let me give an alias like I proposed earlier like as n, where n is a number. 2480 01:57:56,260 --> 01:58:00,090 So that now down here, I can actually just do this. 2481 01:58:00,090 --> 01:58:02,230 Print out the value of n. 2482 01:58:02,230 --> 01:58:02,730 All right. 2483 01:58:02,730 --> 01:58:04,450 Let me go back to my terminal window. 2484 01:58:04,450 --> 01:58:05,910 Run Python to favorites.py. 2485 01:58:05,910 --> 01:58:07,140 Let me type in Mario. 2486 01:58:07,140 --> 01:58:08,070 Enter. 2487 01:58:08,070 --> 01:58:09,660 OK, 39. 2488 01:58:09,660 --> 01:58:11,790 Now, technically I'm cheating. 2489 01:58:11,790 --> 01:58:14,430 Honestly if I'm executing select count, we've 2490 01:58:14,430 --> 01:58:17,752 seen before it only ever returns one row, not multiple. 2491 01:58:17,752 --> 01:58:20,460 So there's really nothing to iterate over, but it's working fine. 2492 01:58:20,460 --> 01:58:22,560 It's just iterating once, but I'm getting lucky. 2493 01:58:22,560 --> 01:58:26,100 So technically what I should probably just do is this. 2494 01:58:26,100 --> 01:58:29,220 I should probably give myself a variable called row, 2495 01:58:29,220 --> 01:58:34,770 set it equal to the very first row and only row that came back, 2496 01:58:34,770 --> 01:58:39,510 and now print out that rows and column. 2497 01:58:39,510 --> 01:58:40,890 Let me rerun the program. 2498 01:58:40,890 --> 01:58:42,130 I'll type in Mario again. 2499 01:58:42,130 --> 01:58:42,660 Enter. 2500 01:58:42,660 --> 01:58:44,980 And I still see 39. 2501 01:58:44,980 --> 01:58:48,328 So of course, I don't strictly need to do this. 2502 01:58:48,328 --> 01:58:49,620 I don't really need a variable. 2503 01:58:49,620 --> 01:58:52,590 I can do rows bracket 0 instead. 2504 01:58:52,590 --> 01:58:54,840 But let me focus on what this library is now doing. 2505 01:58:54,840 --> 01:59:01,620 So per the documentation, what the CS50 execute function always does for you is 2506 01:59:01,620 --> 01:59:05,620 it returns a list of dictionaries. 2507 01:59:05,620 --> 01:59:10,558 So if your query returns nothing, like no matches, you get back an empty list. 2508 01:59:10,558 --> 01:59:12,600 Like open bracket, closed bracket, nothing in it. 2509 01:59:12,600 --> 01:59:14,520 Any loop is not going to execute anything useful, 2510 01:59:14,520 --> 01:59:15,770 because there's nothing in it. 2511 01:59:15,770 --> 01:59:18,660 If, though, you get back one row, you're going 2512 01:59:18,660 --> 01:59:22,740 to get back a list of size one inside of which is a single dictionary. 2513 01:59:22,740 --> 01:59:24,930 That dictionary is going to have keys that 2514 01:59:24,930 --> 01:59:29,200 correspond to whatever you selected, be it the columns or the count. 2515 01:59:29,200 --> 01:59:32,880 So when I selected star before, I would have gotten all of the columns. 2516 01:59:32,880 --> 01:59:34,950 That's how I was able to access timestamp. 2517 01:59:34,950 --> 01:59:37,230 Here I'm just selecting count and I don't 2518 01:59:37,230 --> 01:59:38,910 want to have to type this down here. 2519 01:59:38,910 --> 01:59:40,230 That would just look kind of atrocious. 2520 01:59:40,230 --> 01:59:42,355 It would work, but it would look weird to just keep 2521 01:59:42,355 --> 01:59:44,640 retyping count paren star close paren. 2522 01:59:44,640 --> 01:59:49,530 So I just created an alias called n just to make my life easier or cleaner down 2523 01:59:49,530 --> 01:59:50,380 here. 2524 01:59:50,380 --> 01:59:55,680 So to be clear, the CS50 execute function returns a list of dictionaries 2525 01:59:55,680 --> 01:59:57,360 when you're using select. 2526 01:59:57,360 --> 02:00:01,500 And that is how I can now get back the first and only row 2527 02:00:01,500 --> 02:00:04,860 and then print out that row's end value. 2528 02:00:04,860 --> 02:00:07,380 It is identical to-- 2529 02:00:07,380 --> 02:00:08,650 let me do this. 2530 02:00:08,650 --> 02:00:11,610 Let me highlight this whole line of text. 2531 02:00:11,610 --> 02:00:15,120 Let me in my terminal window run SQLite3 of favorites.db 2532 02:00:15,120 --> 02:00:16,470 like we did before break. 2533 02:00:16,470 --> 02:00:19,170 Let me just copy paste this query. 2534 02:00:19,170 --> 02:00:20,040 Enter. 2535 02:00:20,040 --> 02:00:24,550 That's the table I got back earlier when we played with SQL manually. 2536 02:00:24,550 --> 02:00:28,170 And so when I get back this table, here's the key, here's the value, 2537 02:00:28,170 --> 02:00:31,860 and I only have one row, which is why I'm just blindly indexing 2538 02:00:31,860 --> 02:00:34,080 into rows bracket 0, because I know there's always 2539 02:00:34,080 --> 02:00:35,247 going to be an answer there. 2540 02:00:35,247 --> 02:00:36,690 It's going to be 0 or 1 or more. 2541 02:00:36,690 --> 02:00:40,360 But I know now it's going to be called n because of this here. 2542 02:00:40,360 --> 02:00:41,580 So what have I just done? 2543 02:00:41,580 --> 02:00:43,113 Well, this is SQL down here. 2544 02:00:43,113 --> 02:00:45,780 And this is just me being like a data scientist asking questions 2545 02:00:45,780 --> 02:00:48,640 about my data just using black and white SQL queries. 2546 02:00:48,640 --> 02:00:52,410 This is me now being a Python programmer who wants to talk to a SQL database 2547 02:00:52,410 --> 02:00:53,700 using Python. 2548 02:00:53,700 --> 02:00:56,820 And the bridge we're using happens to be the CS50 library. 2549 02:00:56,820 --> 02:00:59,940 But again, there's third party free libraries you can also use as well. 2550 02:00:59,940 --> 02:01:01,710 Ours is just very simple. 2551 02:01:01,710 --> 02:01:03,660 And indeed, the documentation will explain 2552 02:01:03,660 --> 02:01:07,113 how execute behaves a little differently for inserts, updates, and deletes. 2553 02:01:07,113 --> 02:01:09,780 You don't get back a list because you're not selecting anything, 2554 02:01:09,780 --> 02:01:13,050 but you do get back some return values. 2555 02:01:13,050 --> 02:01:15,990 Questions on this? 2556 02:01:15,990 --> 02:01:18,030 That's the last of our Python code. 2557 02:01:18,030 --> 02:01:23,320 That ties everything together in spirit. 2558 02:01:23,320 --> 02:01:24,768 Yeah? 2559 02:01:24,768 --> 02:01:29,460 AUDIENCE: [INAUDIBLE] 2560 02:01:29,460 --> 02:01:31,810 DAVID MALAN: This one here? 2561 02:01:31,810 --> 02:01:32,320 Yes. 2562 02:01:32,320 --> 02:01:37,558 So db.execute by definition returns a list of rows. 2563 02:01:37,558 --> 02:01:40,600 And each of those rows happens to be a dictionary because its convenient. 2564 02:01:40,600 --> 02:01:42,100 Key value pairs. 2565 02:01:42,100 --> 02:01:45,040 If I'm selecting the count of rows, I just 2566 02:01:45,040 --> 02:01:48,250 know from having learned SQL an hour ago that this is always 2567 02:01:48,250 --> 02:01:52,540 going to give me a single row whose column in this case is called n. 2568 02:01:52,540 --> 02:01:56,470 So if I know it's a single row, I can just blindly, just like in C, 2569 02:01:56,470 --> 02:02:00,160 go into that list or an array in C and go to the first location 2570 02:02:00,160 --> 02:02:02,590 and then treat that as the single row. 2571 02:02:02,590 --> 02:02:04,330 What you don't want to do is this. 2572 02:02:04,330 --> 02:02:08,500 Even if you the human know the query returns one row, 2573 02:02:08,500 --> 02:02:10,780 you can't just magically change the variable name 2574 02:02:10,780 --> 02:02:13,270 to be singular and expect to have only one value. 2575 02:02:13,270 --> 02:02:14,960 You will always have a list. 2576 02:02:14,960 --> 02:02:18,190 So even if there is only one value in it, it's up to you to do something 2577 02:02:18,190 --> 02:02:19,780 like this to get at it. 2578 02:02:19,780 --> 02:02:24,070 Or if you prefer more succinctness, you can do rows bracket I bracket n. 2579 02:02:24,070 --> 02:02:27,220 That'll achieve the same thing without a variable. 2580 02:02:27,220 --> 02:02:29,095 Yeah? 2581 02:02:29,095 --> 02:02:33,970 AUDIENCE: [INAUDIBLE] 2582 02:02:33,970 --> 02:02:34,720 DAVID MALAN: Good. 2583 02:02:34,720 --> 02:02:37,150 So I have been misleading this whole time 2584 02:02:37,150 --> 02:02:40,690 and cheating because this is only ever going to return Mario. 2585 02:02:40,690 --> 02:02:44,530 I'm ignoring the favorite that the human typed in here on line five. 2586 02:02:44,530 --> 02:02:45,770 So let me fix that. 2587 02:02:45,770 --> 02:02:49,000 And that's going to lead us to some of the problems that arise ultimately 2588 02:02:49,000 --> 02:02:49,780 with SQL. 2589 02:02:49,780 --> 02:02:53,320 The right way to solve that problem-- let me get rid of my terminal window 2590 02:02:53,320 --> 02:02:53,890 here. 2591 02:02:53,890 --> 02:02:56,170 The right way to solve this problem is not 2592 02:02:56,170 --> 02:02:59,170 to use an fstring like we did in Python generally, 2593 02:02:59,170 --> 02:03:02,680 because SQL queries, as we'll see in a moment, can be dangerous. 2594 02:03:02,680 --> 02:03:06,940 When you want to plug in users' data into a query 2595 02:03:06,940 --> 02:03:10,540 that you've written most of in advance, you should, you 2596 02:03:10,540 --> 02:03:15,640 must, you had better use a placeholder, namely a question mark in this case. 2597 02:03:15,640 --> 02:03:18,340 This is somewhat specific to CS50's library, 2598 02:03:18,340 --> 02:03:21,910 but we just borrowed the convention that every other library uses too. 2599 02:03:21,910 --> 02:03:25,660 In the world of SQL, single question marks are used as placeholders. 2600 02:03:25,660 --> 02:03:28,190 And the way you do this is as follows. 2601 02:03:28,190 --> 02:03:32,020 If you want to plug-in a value for that question mark, 2602 02:03:32,020 --> 02:03:37,180 just like in printf in C, you specify as a second or a third or fourth argument 2603 02:03:37,180 --> 02:03:39,320 all of the values you want plugged into this. 2604 02:03:39,320 --> 02:03:42,280 So in C weeks ago, we were using %s. 2605 02:03:42,280 --> 02:03:43,810 Same exact idea. 2606 02:03:43,810 --> 02:03:46,810 In SQL it's a question mark that you use instead. 2607 02:03:46,810 --> 02:03:49,480 This now, if I open back my terminal window 2608 02:03:49,480 --> 02:03:55,120 and I run Python of favorites.py, type in Mario, I should still get 39. 2609 02:03:55,120 --> 02:03:58,750 But now I can also type in Scratch perhaps and get 44 2610 02:03:58,750 --> 02:04:00,520 for that very first piece at 0. 2611 02:04:00,520 --> 02:04:02,900 And that one is even more popular here. 2612 02:04:02,900 --> 02:04:04,150 So this now is correct. 2613 02:04:04,150 --> 02:04:09,560 It would work to use an fstring here and then plug in a value like favorite 2614 02:04:09,560 --> 02:04:10,060 here. 2615 02:04:10,060 --> 02:04:11,977 But you'll see in just a moment don't do that. 2616 02:04:11,977 --> 02:04:15,640 You will expose yourself to potential hack or attacks 2617 02:04:15,640 --> 02:04:17,910 by trusting the user's input. 2618 02:04:17,910 --> 02:04:19,660 And so in fact, let's transition from that 2619 02:04:19,660 --> 02:04:21,610 to exactly some of these kinds of challenges, 2620 02:04:21,610 --> 02:04:24,020 namely two before we wrap up. 2621 02:04:24,020 --> 02:04:27,820 So in the world of SQL, especially when it's used at scale with the Twitters 2622 02:04:27,820 --> 02:04:30,610 and the Googles of the world, a lot of data 2623 02:04:30,610 --> 02:04:32,610 is probably coming into the database all at once 2624 02:04:32,610 --> 02:04:34,610 because multiple people are opening their phones 2625 02:04:34,610 --> 02:04:36,070 at the same time around the world. 2626 02:04:36,070 --> 02:04:37,690 They're clicking on the same links roughly 2627 02:04:37,690 --> 02:04:39,107 at the same time around the world. 2628 02:04:39,107 --> 02:04:42,235 When you have thousands of people all using your site at once, 2629 02:04:42,235 --> 02:04:44,110 order of operations is going to be important. 2630 02:04:44,110 --> 02:04:47,620 But unfortunately in SQL and in other contexts of computing, 2631 02:04:47,620 --> 02:04:50,150 there's this risk of what's known as a race condition. 2632 02:04:50,150 --> 02:04:53,710 So for instance, has anyone ever seen or liked this? 2633 02:04:53,710 --> 02:04:55,780 This is the world record egg. 2634 02:04:55,780 --> 02:04:58,467 Or it's this thing that was very popular a while back. 2635 02:04:58,467 --> 02:04:59,800 It's still kind of going strong. 2636 02:04:59,800 --> 02:05:03,340 But if you go to the Instagram profile for World Record Egg, 2637 02:05:03,340 --> 02:05:06,610 the goal was to make the most liked Instagram post ever. 2638 02:05:06,610 --> 02:05:08,000 And they did pretty well. 2639 02:05:08,000 --> 02:05:09,170 It's just this. 2640 02:05:09,170 --> 02:05:10,790 It's just a picture of an egg. 2641 02:05:10,790 --> 02:05:12,570 Now, at the height of the popularity, like 2642 02:05:12,570 --> 02:05:15,070 there might have been hundreds, thousands, tens of thousands 2643 02:05:15,070 --> 02:05:18,520 of people clicking pretty much at the same time on this egg. 2644 02:05:18,520 --> 02:05:20,830 So it actually creates a potential problem 2645 02:05:20,830 --> 02:05:23,470 with the integrity of Instagram's data. 2646 02:05:23,470 --> 02:05:24,010 Why? 2647 02:05:24,010 --> 02:05:26,302 Well, if you have all these requests coming in at once, 2648 02:05:26,302 --> 02:05:28,250 how do you possibly keep track of all of them 2649 02:05:28,250 --> 02:05:31,850 and update your counter in a way that can keep up with all of that traffic? 2650 02:05:31,850 --> 02:05:32,350 Why? 2651 02:05:32,350 --> 02:05:35,680 Well, let's just hypothesize what Meta, formerly Facebook, 2652 02:05:35,680 --> 02:05:38,860 was doing underneath the hood with Instagram if this were their code. 2653 02:05:38,860 --> 02:05:41,890 So suppose for the sake of discussion that Instagram servers 2654 02:05:41,890 --> 02:05:44,617 are using a mix of Python and SQL. 2655 02:05:44,617 --> 02:05:47,200 Probably not using the CS50 library, but they could absolutely 2656 02:05:47,200 --> 02:05:50,680 be using those two languages or two others together. 2657 02:05:50,680 --> 02:05:55,270 Suppose they do this in order to update the number of likes for that post. 2658 02:05:55,270 --> 02:06:00,160 They first execute a SQL query like select the current number of likes 2659 02:06:00,160 --> 02:06:03,580 from a table called posts where the idea of the post 2660 02:06:03,580 --> 02:06:08,530 equals whatever the unique identifier is for that specific egg in the table. 2661 02:06:08,530 --> 02:06:12,460 And then they store the result in this rows variable, just like I did. 2662 02:06:12,460 --> 02:06:14,830 And then they do this. 2663 02:06:14,830 --> 02:06:16,420 They create a variable called likes. 2664 02:06:16,420 --> 02:06:18,700 They set it equal to rows bracket 0. 2665 02:06:18,700 --> 02:06:21,280 So the very first row in the result set. 2666 02:06:21,280 --> 02:06:22,450 And they get the likes key. 2667 02:06:22,450 --> 02:06:25,030 So this is literally what I just did with the count. 2668 02:06:25,030 --> 02:06:27,880 Let me hypothesize that Instagram does something similar 2669 02:06:27,880 --> 02:06:29,330 with the total number of likes. 2670 02:06:29,330 --> 02:06:30,370 Why are they doing this? 2671 02:06:30,370 --> 02:06:34,600 Because they then want to execute a third line of code that executes 2672 02:06:34,600 --> 02:06:36,430 update the posts table. 2673 02:06:36,430 --> 02:06:41,920 Set the new number of likes equal to something where the idea of the post 2674 02:06:41,920 --> 02:06:43,330 equals this other thing. 2675 02:06:43,330 --> 02:06:46,870 Now, notice just like in printf there's the comma separated list of values. 2676 02:06:46,870 --> 02:06:50,560 They want to update the current number of likes from the current value 2677 02:06:50,560 --> 02:06:51,970 to the current value plus 1. 2678 02:06:51,970 --> 02:06:53,350 So it's likes plus 1. 2679 02:06:53,350 --> 02:06:55,640 And then we plug in the ID for this. 2680 02:06:55,640 --> 02:06:57,760 So suppose this is what Instagram is doing. 2681 02:06:57,760 --> 02:07:02,860 Unfortunately, whenever you execute multiple lines of code independently 2682 02:07:02,860 --> 02:07:06,260 and you're so popular like Instagram that you have thousands, 2683 02:07:06,260 --> 02:07:08,530 hundreds of thousands of servers potentially, 2684 02:07:08,530 --> 02:07:11,470 it is quite possible that if you and I and everyone else in the room 2685 02:07:11,470 --> 02:07:13,630 clicks that egg at the same time, it's not 2686 02:07:13,630 --> 02:07:17,093 going to be the case statistically that three lines of code are executed for me 2687 02:07:17,093 --> 02:07:19,510 and then three lines for you and then three lines for you. 2688 02:07:19,510 --> 02:07:21,490 They're probably going to get interspersed. 2689 02:07:21,490 --> 02:07:24,430 This gets executed for me and then this gets executed for you 2690 02:07:24,430 --> 02:07:27,520 and then they get back to doing work for me and so forth just to kind 2691 02:07:27,520 --> 02:07:30,820 of multitask, just like a human might, but at a super speed here. 2692 02:07:30,820 --> 02:07:33,590 The problem, though, is if these lines of code 2693 02:07:33,590 --> 02:07:35,838 get interrupted, what could go wrong? 2694 02:07:35,838 --> 02:07:38,630 Well, suppose that Carter and I both click the egg at the same time 2695 02:07:38,630 --> 02:07:41,780 and suppose the current number of likes back in the day is 100. 2696 02:07:41,780 --> 02:07:45,150 That stores in this variable the value 100. 2697 02:07:45,150 --> 02:07:49,010 But if we click so close in time, we might get back the same answer 2698 02:07:49,010 --> 02:07:50,300 to this select query. 2699 02:07:50,300 --> 02:07:54,380 As of that moment in time when David and Carter clicked, it had 100 likes. 2700 02:07:54,380 --> 02:07:58,220 But then this last line of code is executed for me and then maybe Carter. 2701 02:07:58,220 --> 02:08:03,380 Because that answer, the state of the database, was stored in this variable, 2702 02:08:03,380 --> 02:08:08,870 then both Carter and I will result in this line of code 2703 02:08:08,870 --> 02:08:10,610 being executed with the same value. 2704 02:08:10,610 --> 02:08:17,480 Update the post table setting the likes equal to 101 for that post's ID. 2705 02:08:17,480 --> 02:08:18,080 Why? 2706 02:08:18,080 --> 02:08:21,950 Because again, if each of these lines of code running on different servers 2707 02:08:21,950 --> 02:08:25,213 are checking the value of the current number of likes 2708 02:08:25,213 --> 02:08:28,130 but then getting interrupted because Carter clicked the darn thing too 2709 02:08:28,130 --> 02:08:30,350 and then resuming their work on my behalf, 2710 02:08:30,350 --> 02:08:33,890 we might have a race condition where the code is sort of racing to finish 2711 02:08:33,890 --> 02:08:36,320 but getting interrupted by other users' clicks. 2712 02:08:36,320 --> 02:08:38,720 And the problem with that is that if you are 2713 02:08:38,720 --> 02:08:42,710 inspecting the value of some variable, or in this case a database cell, 2714 02:08:42,710 --> 02:08:45,650 and making a decision based on it, like how to update it, 2715 02:08:45,650 --> 02:08:46,995 you might now lose data. 2716 02:08:46,995 --> 02:08:50,120 And Instagram is probably not good for advertising if they're losing likes. 2717 02:08:50,120 --> 02:08:54,620 And so that's probably a problem not to retain the value 102 2718 02:08:54,620 --> 02:08:57,230 and instead insert the number 101 twice. 2719 02:08:57,230 --> 02:09:00,500 It's actually similar in spirit to a story that 2720 02:09:00,500 --> 02:09:05,960 was told in a databases course I took myself years ago whereby-- 2721 02:09:05,960 --> 02:09:08,900 it's somewhat analogous to kind of a contrived scenario 2722 02:09:08,900 --> 02:09:10,070 involving a refrigerator. 2723 02:09:10,070 --> 02:09:12,380 And this is the closest thing to a refrigerator we could get on stage. 2724 02:09:12,380 --> 02:09:14,420 But imagine you've got one of these little dorm fridges 2725 02:09:14,420 --> 02:09:16,040 in your dorm too and your roommate. 2726 02:09:16,040 --> 02:09:19,102 And maybe both of you, as the story was told to me, really like milk. 2727 02:09:19,102 --> 02:09:21,560 And one of you is at class, but the other of you comes home 2728 02:09:21,560 --> 02:09:25,500 and you open your dorm fridge and you're like, oh darn it, we're out of milk. 2729 02:09:25,500 --> 02:09:27,020 And so you close the fridge. 2730 02:09:27,020 --> 02:09:29,720 You walk across the street to CVS or some other store 2731 02:09:29,720 --> 02:09:31,490 and you get in line to buy some milk. 2732 02:09:31,490 --> 02:09:33,020 Meanwhile, your roommate gets out of class. 2733 02:09:33,020 --> 02:09:34,160 They come back to your dorm room. 2734 02:09:34,160 --> 02:09:35,702 They're really thirsty for some milk. 2735 02:09:35,702 --> 02:09:37,010 They open up the fridge. 2736 02:09:37,010 --> 02:09:38,510 They say, oh, we're out of milk. 2737 02:09:38,510 --> 02:09:41,630 And then they take a different route perhaps to CVS 2738 02:09:41,630 --> 02:09:44,300 or some other store nearby, get in line to buy some milk. 2739 02:09:44,300 --> 02:09:48,260 Fast forward some amount of time in this very contrived story and what happens? 2740 02:09:48,260 --> 02:09:50,570 Oh damn it, we now ended up with two gallons of milk 2741 02:09:50,570 --> 02:09:54,480 and there's no way we can fit gallons of milk in there, let alone two of them. 2742 02:09:54,480 --> 02:09:56,000 So that's a problem. 2743 02:09:56,000 --> 02:10:00,000 But what's the relationship to this here? 2744 02:10:00,000 --> 02:10:02,693 Well both of us, yeah, did what? 2745 02:10:02,693 --> 02:10:07,523 AUDIENCE: [INAUDIBLE] 2746 02:10:07,523 --> 02:10:16,685 2747 02:10:16,685 --> 02:10:17,560 DAVID MALAN: Exactly. 2748 02:10:17,560 --> 02:10:19,925 AUDIENCE: [INAUDIBLE] 2749 02:10:19,925 --> 02:10:22,965 2750 02:10:22,965 --> 02:10:23,840 DAVID MALAN: Exactly. 2751 02:10:23,840 --> 02:10:26,480 So to summarize, both of us had a very similar thought process, 2752 02:10:26,480 --> 02:10:28,820 made a similar decision based on the same information, 2753 02:10:28,820 --> 02:10:31,760 not realizing that the information, the fridge, 2754 02:10:31,760 --> 02:10:34,600 was in the process of being updated. 2755 02:10:34,600 --> 02:10:37,100 And of course, in the Instagram world, it happens like this. 2756 02:10:37,100 --> 02:10:39,300 In the fridge world, it might take a few minutes. 2757 02:10:39,300 --> 02:10:41,840 But the problem is ultimately the result of our 2758 02:10:41,840 --> 02:10:46,070 having made a decision about the state of the world and the state of the world 2759 02:10:46,070 --> 02:10:48,600 was in the middle of being updated. 2760 02:10:48,600 --> 02:10:50,333 The queries got comingled with others. 2761 02:10:50,333 --> 02:10:53,000 Or, in this case, someone was already on their way to the store. 2762 02:10:53,000 --> 02:10:54,960 So what's the solution in the real world? 2763 02:10:54,960 --> 02:10:59,000 Well, you could very simply take a post it note and put like gone for milk 2764 02:10:59,000 --> 02:11:02,060 so as to communicate to your roommate that they should not 2765 02:11:02,060 --> 02:11:04,620 inspect the value of that variable and make a decision on it. 2766 02:11:04,620 --> 02:11:05,120 Why? 2767 02:11:05,120 --> 02:11:08,958 Because it's not yet consistent with the outcome that's about to happen. 2768 02:11:08,958 --> 02:11:11,000 You could be more dramatic and you could actually 2769 02:11:11,000 --> 02:11:13,580 lock the fridge somehow, put a padlock around it 2770 02:11:13,580 --> 02:11:15,470 or the like so they can't even get in there. 2771 02:11:15,470 --> 02:11:18,510 And that would achieve the same effect too. 2772 02:11:18,510 --> 02:11:22,400 And that is actually pretty much the solution to this problem in code too. 2773 02:11:22,400 --> 02:11:23,370 It's not safe. 2774 02:11:23,370 --> 02:11:27,110 It's not sufficient to only execute three lines of code like this. 2775 02:11:27,110 --> 02:11:32,000 Rather, what you probably want to do is use additional SQL keywords 2776 02:11:32,000 --> 02:11:35,270 that we won't spend much time on in the class itself, but these. 2777 02:11:35,270 --> 02:11:37,160 There are solutions to this problem. 2778 02:11:37,160 --> 02:11:39,320 You can begin what's called a transaction 2779 02:11:39,320 --> 02:11:43,460 and you can more explicitly commit to making a decision, 2780 02:11:43,460 --> 02:11:46,100 like updating the database to 101 or 102. 2781 02:11:46,100 --> 02:11:50,120 Or if you realize, wait a minute, Carter's query is interrupting mine. 2782 02:11:50,120 --> 02:11:53,540 Let me roll back to the previous state and just rewind. 2783 02:11:53,540 --> 02:11:54,290 Let me undo. 2784 02:11:54,290 --> 02:11:55,840 Control Z, if you will. 2785 02:11:55,840 --> 02:11:58,340 There's also another keyword that's not so much used anymore 2786 02:11:58,340 --> 02:11:59,600 in SQL which is locking. 2787 02:11:59,600 --> 02:12:01,170 You could literally back in the day. 2788 02:12:01,170 --> 02:12:05,690 Lock the entire database table, preventing anyone from updating it 2789 02:12:05,690 --> 02:12:09,320 or making changes or even reading it while someone else was accessing it. 2790 02:12:09,320 --> 02:12:12,320 That was a very heavy handed solution because it slowed everything down. 2791 02:12:12,320 --> 02:12:15,710 But in short, transactions are now a feature of SQL 2792 02:12:15,710 --> 02:12:17,840 that you won't necessarily need to use yourselves 2793 02:12:17,840 --> 02:12:22,490 that do solve this problem by doing the equivalent of saying while David's 2794 02:12:22,490 --> 02:12:26,360 like counter is in the process of being updated, keep Carter at bay, 2795 02:12:26,360 --> 02:12:29,510 ideally briefly, and then let his data go through too. 2796 02:12:29,510 --> 02:12:32,240 It's equivalent too to putting a note or a lock on the fridge. 2797 02:12:32,240 --> 02:12:33,920 And indeed, I mean lock literally. 2798 02:12:33,920 --> 02:12:36,380 They were once upon a time called and still 2799 02:12:36,380 --> 02:12:41,480 are in some contexts called locks on databases too. 2800 02:12:41,480 --> 02:12:44,730 And the code for which you might do this is almost the same. 2801 02:12:44,730 --> 02:12:51,860 You simply wrap the three queries with a transaction statement and a commit. 2802 02:12:51,860 --> 02:12:57,800 And the term of art here is that this makes your statements atomic. 2803 02:12:57,800 --> 02:13:01,230 So atomic means they're either all executed or not at all. 2804 02:13:01,230 --> 02:13:04,340 That is they're all very tightly coupled together without interruption. 2805 02:13:04,340 --> 02:13:08,480 Transactions solves that problem and avoid having two gallons of milk. 2806 02:13:08,480 --> 02:13:11,360 And the last problem that arises that is tragically 2807 02:13:11,360 --> 02:13:14,510 so darn common in the real world today is 2808 02:13:14,510 --> 02:13:16,173 what's called a SQL injection attack. 2809 02:13:16,173 --> 02:13:18,590 And it's what I alluded to earlier with the question mark. 2810 02:13:18,590 --> 02:13:20,298 So suppose you're in the habit of logging 2811 02:13:20,298 --> 02:13:23,240 into Yale websites with your net ID or password 2812 02:13:23,240 --> 02:13:25,950 or at Harvard, your Harvard key and password as well. 2813 02:13:25,950 --> 02:13:28,370 Suppose for the sake of discussion that the people that 2814 02:13:28,370 --> 02:13:32,360 implemented Harvard key log in allow you to type in your email address, 2815 02:13:32,360 --> 02:13:33,680 of course, and your password. 2816 02:13:33,680 --> 02:13:37,160 But suppose that they are using SQL underneath the hood 2817 02:13:37,160 --> 02:13:39,770 to check your username and password to make sure 2818 02:13:39,770 --> 02:13:43,400 that you are David Malan or Carter Zenke or whoever you claim to be. 2819 02:13:43,400 --> 02:13:48,770 I haven't shown you the syntax yet, but it turns out that in SQL, -- 2820 02:13:48,770 --> 02:13:51,410 is a special way of indicating a comment. 2821 02:13:51,410 --> 02:13:53,100 It means ignore everything to the right. 2822 02:13:53,100 --> 02:13:57,080 So it's just like // in C or the hash symbol in Python. 2823 02:13:57,080 --> 02:13:59,430 -- just means ignore everything to the right. 2824 02:13:59,430 --> 02:14:01,440 And we've, of course, seen single quotes. 2825 02:14:01,440 --> 02:14:04,730 So one way to wage a SQL injection attack 2826 02:14:04,730 --> 02:14:08,870 is to try to inject malicious SQL code into someone else's database 2827 02:14:08,870 --> 02:14:10,220 without them realizing it. 2828 02:14:10,220 --> 02:14:11,220 How do you do this? 2829 02:14:11,220 --> 02:14:15,492 Well, suppose I log in as malan@harvard.edu single quote dash, 2830 02:14:15,492 --> 02:14:16,316 dash. 2831 02:14:16,316 --> 02:14:20,510 I'm not double quoting anything clearly and there's nothing to the right 2832 02:14:20,510 --> 02:14:21,770 of the -- 2833 02:14:21,770 --> 02:14:23,150 anyway. 2834 02:14:23,150 --> 02:14:25,500 But this imbalance is going to be useful. 2835 02:14:25,500 --> 02:14:26,000 Why? 2836 02:14:26,000 --> 02:14:29,630 Because if I'm a hacker and I'm presuming someone at Harvard probably 2837 02:14:29,630 --> 02:14:33,680 is using single quotes to wrap the user's email address 2838 02:14:33,680 --> 02:14:38,360 and wrap the user's password, what if I try to complete their thought for them 2839 02:14:38,360 --> 02:14:40,370 and close one of those quotes for them? 2840 02:14:40,370 --> 02:14:41,850 What might happen? 2841 02:14:41,850 --> 02:14:43,220 Well, we could do this. 2842 02:14:43,220 --> 02:14:46,250 Here for instance, let me hypothesize is the code that Harvard wrote, 2843 02:14:46,250 --> 02:14:48,050 hopefully not, underneath the hood. 2844 02:14:48,050 --> 02:14:51,500 So they're using CS50's library in Python and they're using SQL inside. 2845 02:14:51,500 --> 02:14:54,350 Suppose that they have a query like this. 2846 02:14:54,350 --> 02:15:00,380 Select star from users where username equals question mark and password 2847 02:15:00,380 --> 02:15:01,315 equals question mark. 2848 02:15:01,315 --> 02:15:04,190 And then suppose they just plug in whatever username and password was 2849 02:15:04,190 --> 02:15:05,070 typed in. 2850 02:15:05,070 --> 02:15:08,090 And then if they get back some number of rows dot dot dot, 2851 02:15:08,090 --> 02:15:09,410 they assume I am David. 2852 02:15:09,410 --> 02:15:12,200 They assume Carter is Carter if both the username and password are 2853 02:15:12,200 --> 02:15:12,890 in the database. 2854 02:15:12,890 --> 02:15:14,480 Just end of story there. 2855 02:15:14,480 --> 02:15:15,680 This is good. 2856 02:15:15,680 --> 02:15:18,750 This has the question mark placeholder, as we discussed earlier. 2857 02:15:18,750 --> 02:15:21,140 But what if you don't quite remember that? 2858 02:15:21,140 --> 02:15:24,670 You don't quite take that to heart and you use your more familiar last week 2859 02:15:24,670 --> 02:15:28,690 fstrings whereby we use these curly braces to plug in values. 2860 02:15:28,690 --> 02:15:30,280 What if you do this instead? 2861 02:15:30,280 --> 02:15:31,900 So it's almost the same idea. 2862 02:15:31,900 --> 02:15:34,950 It's still db execute but now it's select star from users 2863 02:15:34,950 --> 02:15:36,120 where username equals. 2864 02:15:36,120 --> 02:15:39,570 And now notice I'm doing the single quotes, which is required by SQL, 2865 02:15:39,570 --> 02:15:41,850 but I'm using fstrings with the curly braces. 2866 02:15:41,850 --> 02:15:46,530 And the password equals single quote password and then close single quote. 2867 02:15:46,530 --> 02:15:50,910 The problem is if you're just blindly pasting effectively 2868 02:15:50,910 --> 02:15:55,530 the user's input into that web form into the username field and the password 2869 02:15:55,530 --> 02:15:59,890 field, there's nothing stopping a malicious user, student, faculty, 2870 02:15:59,890 --> 02:16:03,090 staff from including a single quote in their name. 2871 02:16:03,090 --> 02:16:07,470 Or maybe even benevolently if their name happens 2872 02:16:07,470 --> 02:16:10,440 to have a single quote, as some last names in particular do. 2873 02:16:10,440 --> 02:16:12,230 So this is very fragile. 2874 02:16:12,230 --> 02:16:12,730 Why? 2875 02:16:12,730 --> 02:16:15,090 Well, suppose that if we plug in my malicious, 2876 02:16:15,090 --> 02:16:18,030 malan@harvard.edu single quote -- 2877 02:16:18,030 --> 02:16:20,430 notice what happens to username here. 2878 02:16:20,430 --> 02:16:23,940 The username variable inside of the curly quotes 2879 02:16:23,940 --> 02:16:26,340 will get replaced with this. 2880 02:16:26,340 --> 02:16:29,880 And notice single quote, which the Harvard programmer wrote, 2881 02:16:29,880 --> 02:16:35,969 malan@harvard.edu single quote which I wrote -- 2882 02:16:35,969 --> 02:16:40,170 which I wrote single quote which Harvard wrote and whatever else 2883 02:16:40,170 --> 02:16:41,160 they want after that. 2884 02:16:41,160 --> 02:16:43,524 What's the implication, though, of the dash, dash? 2885 02:16:43,524 --> 02:16:46,722 2886 02:16:46,722 --> 02:16:48,680 Everything to the right is going to be ignored. 2887 02:16:48,680 --> 02:16:51,730 So the password is never even checked in this scenario. 2888 02:16:51,730 --> 02:16:55,855 I'm tricking the server into ignoring everything after the -- 2889 02:16:55,855 --> 02:17:00,650 but I have constructed very cleverly, very maliciously a syntactically valid 2890 02:17:00,650 --> 02:17:01,150 query. 2891 02:17:01,150 --> 02:17:01,650 Why? 2892 02:17:01,650 --> 02:17:03,580 Because I provided the single quote that's 2893 02:17:03,580 --> 02:17:06,670 going to finish the thought of that first single quote. 2894 02:17:06,670 --> 02:17:09,219 And now I would only know how to do this if I saw the code 2895 02:17:09,219 --> 02:17:12,209 or if I just randomly try putting apostrophes into web forms 2896 02:17:12,209 --> 02:17:13,209 and see if things break. 2897 02:17:13,209 --> 02:17:15,215 That's often how adversaries attack systems. 2898 02:17:15,215 --> 02:17:17,590 They type in potentially dangerous characters, hit Enter. 2899 02:17:17,590 --> 02:17:20,510 If something breaks, they're not necessarily into the system, 2900 02:17:20,510 --> 02:17:22,757 but they know that there might be a vulnerability. 2901 02:17:22,757 --> 02:17:25,340 And then they start trying more methodically things like this. 2902 02:17:25,340 --> 02:17:27,850 So this then is going to be bad, because it effectively 2903 02:17:27,850 --> 02:17:29,500 grays out the rest of the query. 2904 02:17:29,500 --> 02:17:32,889 And this query is surely going to return some rows 2905 02:17:32,889 --> 02:17:34,570 without even knowing my password. 2906 02:17:34,570 --> 02:17:37,450 And so this logic here dot dot dot means, well, 2907 02:17:37,450 --> 02:17:39,850 if a data came back from this query, Harvard 2908 02:17:39,850 --> 02:17:43,480 is presumably going to assume that Malan logged in. 2909 02:17:43,480 --> 02:17:47,840 Show him his account or whatever is being protected here. 2910 02:17:47,840 --> 02:17:50,559 So in short, using fstrings bad. 2911 02:17:50,559 --> 02:17:54,280 Using any equivalent like %s in C, bad. 2912 02:17:54,280 --> 02:17:56,559 When it comes to SQL, using question marks 2913 02:17:56,559 --> 02:18:00,280 or whatever a third party library like CS50 prescribes 2914 02:18:00,280 --> 02:18:02,110 is the way to solve this. 2915 02:18:02,110 --> 02:18:02,620 Why? 2916 02:18:02,620 --> 02:18:07,540 Because libraries like ours are designed to at least be smart and be paranoid. 2917 02:18:07,540 --> 02:18:09,740 And what we will do is this. 2918 02:18:09,740 --> 02:18:12,580 When you use the question marks and the values are plugged in, 2919 02:18:12,580 --> 02:18:16,660 we will escape any potentially dangerous characters 2920 02:18:16,660 --> 02:18:18,520 inside of those placeholders. 2921 02:18:18,520 --> 02:18:21,580 And so effectively, the single quote will no longer 2922 02:18:21,580 --> 02:18:23,469 be considered a grammatical single quote. 2923 02:18:23,469 --> 02:18:26,870 It will just be literally a character in the username or password. 2924 02:18:26,870 --> 02:18:30,040 So the library takes care of this for you 2925 02:18:30,040 --> 02:18:33,500 because you're plugging in the username and password as separate arguments. 2926 02:18:33,500 --> 02:18:37,360 And then we or the third party you're using actually sanitize. 2927 02:18:37,360 --> 02:18:41,510 That is clean up the data and prevent those bad characters. 2928 02:18:41,510 --> 02:18:44,650 Now, this is kind of an internet meme that went around for a while. 2929 02:18:44,650 --> 02:18:47,320 If you've ever driven a car or been in a car 2930 02:18:47,320 --> 02:18:49,930 where there's the automatic readers for tolls. 2931 02:18:49,930 --> 02:18:52,969 This person thought it might be funny to try doing something like this. 2932 02:18:52,969 --> 02:18:54,309 What are they presumably doing? 2933 02:18:54,309 --> 02:18:57,520 The presumption here is, whether or not it worked is unclear, 2934 02:18:57,520 --> 02:19:01,570 is that here's the end of actual license plate number, 2935 02:19:01,570 --> 02:19:03,903 but here's an interesting single quote and a semicolon. 2936 02:19:03,903 --> 02:19:06,070 That's especially bad because it means you can maybe 2937 02:19:06,070 --> 02:19:08,059 execute a second query on the database. 2938 02:19:08,059 --> 02:19:11,170 This is someone having fun trying to drop the entire database 2939 02:19:11,170 --> 02:19:16,090 table for whatever municipality is scanning through cameras their license 2940 02:19:16,090 --> 02:19:16,750 plate code. 2941 02:19:16,750 --> 02:19:19,150 And I would be remiss if we didn't end on this note. 2942 02:19:19,150 --> 02:19:22,240 At least in computer science circles, there 2943 02:19:22,240 --> 02:19:26,920 is someone named, no relation to the TF name we put in the database earlier, 2944 02:19:26,920 --> 02:19:31,660 little Bobby Tables, which ends with this XKCD comic. 2945 02:19:31,660 --> 02:19:36,340 And if you chuckle, if you laugh, you're now legit SQL programmers. 2946 02:19:36,340 --> 02:19:39,049 2947 02:19:39,049 --> 02:19:42,030 Nice, nice. 2948 02:19:42,030 --> 02:19:44,740 Every CS student out there knows about little Bobby Table. 2949 02:19:44,740 --> 02:19:47,657 So if you name drop little Bobby Tables now, you're in. 2950 02:19:47,657 --> 02:19:49,240 All right, that's it though for today. 2951 02:19:49,240 --> 02:19:51,780 We will see you next time. 2952 02:19:51,780 --> 02:19:55,130 [MUSIC PLAYING] 2953 02:19:55,130 --> 02:20:28,000