1 00:00:00,000 --> 00:00:03,968 [MUSIC PLAYING] 2 00:00:03,968 --> 00:01:18,020 [MUSIC PLAYING] 3 00:01:18,020 --> 00:01:20,030 DAVID J. MALAN: This is CS50. 4 00:01:20,030 --> 00:01:23,360 And this is week 7, the week, here, of Halloween. 5 00:01:23,360 --> 00:01:26,510 Indeed, special thanks to CS50's own Valerie and her mom 6 00:01:26,510 --> 00:01:30,590 for having created this very festive scenery, and all past ones as well. 7 00:01:30,590 --> 00:01:33,330 Today, we pick up where we left off last time, 8 00:01:33,330 --> 00:01:35,180 which, recall, we introduced Python. 9 00:01:35,180 --> 00:01:38,240 And that was our big transition from C, where suddenly things 10 00:01:38,240 --> 00:01:40,850 started to look new again, probably, syntactically. 11 00:01:40,850 --> 00:01:44,390 But also, probably things hopefully started to feel easier. 12 00:01:44,390 --> 00:01:48,080 Well, with that said, problem set 6 certainly added some challenges, 13 00:01:48,080 --> 00:01:49,380 and you did some new things. 14 00:01:49,380 --> 00:01:53,090 But hopefully you've begun to appreciate that with Python, just a lot more stuff 15 00:01:53,090 --> 00:01:54,290 is easier to do. 16 00:01:54,290 --> 00:01:56,960 You get more out of the box with the language itself. 17 00:01:56,960 --> 00:01:59,450 And that's going to be so useful over the coming weeks 18 00:01:59,450 --> 00:02:03,980 as we transition further to introducing something called databases today, 19 00:02:03,980 --> 00:02:06,270 web programming next week and the week after. 20 00:02:06,270 --> 00:02:08,900 So that by term's end, and perhaps even for your final project, 21 00:02:08,900 --> 00:02:11,870 you really are building something from scratch 22 00:02:11,870 --> 00:02:15,415 using all of these various tools somehow together. 23 00:02:15,415 --> 00:02:17,540 So before we do that, though, today, let's consider 24 00:02:17,540 --> 00:02:22,610 what we weren't really able to do last week, which was actually 25 00:02:22,610 --> 00:02:25,490 create and store data ourselves. 26 00:02:25,490 --> 00:02:30,710 In Python, we've played around with the CSV, comma-separated values library. 27 00:02:30,710 --> 00:02:33,810 And you've been able to read in CSVs from disk, 28 00:02:33,810 --> 00:02:37,880 so to speak, that is, from files in your programming environments. 29 00:02:37,880 --> 00:02:41,455 But we haven't necessarily started saving data, persisting data ourselves. 30 00:02:41,455 --> 00:02:44,330 And that's a huge limitation, because pretty much all of the examples 31 00:02:44,330 --> 00:02:46,288 we've done thus far with a couple of exceptions 32 00:02:46,288 --> 00:02:49,550 have involved my providing input at the keyboard or even vocally. 33 00:02:49,550 --> 00:02:51,508 But then nothing happens to it. 34 00:02:51,508 --> 00:02:53,300 It disappears the moment the program quits, 35 00:02:53,300 --> 00:02:55,410 because it was only being stored in memory. 36 00:02:55,410 --> 00:02:59,090 But today, we'll start to focus all the more on storing things on disk, 37 00:02:59,090 --> 00:03:02,100 that is, storing things in files and folders 38 00:03:02,100 --> 00:03:04,940 so that you can actually write programs that remember 39 00:03:04,940 --> 00:03:06,620 what it is the human did last time. 40 00:03:06,620 --> 00:03:09,320 And ultimately, you can actually make mobile or web apps 41 00:03:09,320 --> 00:03:12,050 that actually begin to grow, and grow, and grow their data sets, 42 00:03:12,050 --> 00:03:15,650 as might happen if you get more and more users, for instance, on a website. 43 00:03:15,650 --> 00:03:19,400 To play, then, with this new capability of being able to write files, 44 00:03:19,400 --> 00:03:22,217 let's go ahead and just collect some data. 45 00:03:22,217 --> 00:03:24,050 In fact, those of you here in person, if you 46 00:03:24,050 --> 00:03:26,960 want to pull up this URL on your phone or laptop, 47 00:03:26,960 --> 00:03:29,060 that's going to lead you to a Google Form. 48 00:03:29,060 --> 00:03:34,130 And that Google Form is going to ask you in just a moment for really 49 00:03:34,130 --> 00:03:35,320 just your favorite TV show. 50 00:03:35,320 --> 00:03:37,070 And it's going to ask you to categorize it 51 00:03:37,070 --> 00:03:41,270 according to a genre, like comedy, or drama, or action, or musical, 52 00:03:41,270 --> 00:03:42,460 or something like that. 53 00:03:42,460 --> 00:03:44,210 And this is useful, because if you've ever 54 00:03:44,210 --> 00:03:47,520 used a Google Form before, or Microsoft's equivalent with Office 365, 55 00:03:47,520 --> 00:03:50,270 it's a really useful mechanism at just collecting data from users, 56 00:03:50,270 --> 00:03:54,000 and then ultimately, putting it into a spreadsheet form. 57 00:03:54,000 --> 00:03:58,160 So this is a screenshot of the form that those of you 58 00:03:58,160 --> 00:04:01,080 here in person or tuning in on Zoom are currently filling out. 59 00:04:01,080 --> 00:04:02,420 It's asking only two questions. 60 00:04:02,420 --> 00:04:04,550 What's the title of your favorite TV show? 61 00:04:04,550 --> 00:04:09,470 And what are one or more genres into which your TV show falls? 62 00:04:09,470 --> 00:04:12,860 And I'll go ahead and pivot now to the view 63 00:04:12,860 --> 00:04:15,667 that I'll be able to see as the person who created this form, which 64 00:04:15,667 --> 00:04:17,209 is quite simply a Google spreadsheet. 65 00:04:17,209 --> 00:04:19,709 Google Forms has this nice feature, if you've never noticed, 66 00:04:19,709 --> 00:04:22,650 that allows you to export your data to a Google Spreadsheet. 67 00:04:22,650 --> 00:04:25,130 And then from there, we can actually grab the file 68 00:04:25,130 --> 00:04:27,500 and download it to my own Mac or your own PC 69 00:04:27,500 --> 00:04:30,320 so that we can actually play around with the data that's come in. 70 00:04:30,320 --> 00:04:32,570 So in fact, let me go ahead and slide over 71 00:04:32,570 --> 00:04:36,540 to this, the live Google Spreadsheet. 72 00:04:36,540 --> 00:04:40,490 And you'll see, probably, a whole bunch of familiar TV shows here, 73 00:04:40,490 --> 00:04:41,360 all coming in. 74 00:04:41,360 --> 00:04:43,820 And if we keep scrolling, and scrolling, and scrolling-- 75 00:04:43,820 --> 00:04:45,290 only 46, 47. 76 00:04:45,290 --> 00:04:47,430 There we go, up to 50 plus already. 77 00:04:47,430 --> 00:04:50,420 If you need that URL again here, if you're just tuning in, 78 00:04:50,420 --> 00:04:52,850 you can go to this URL here. 79 00:04:52,850 --> 00:04:55,760 And in just a moment, we'll have a bunch of data 80 00:04:55,760 --> 00:04:59,450 with which we can start to experiment. 81 00:04:59,450 --> 00:05:01,070 I'll give you a moment or so there. 82 00:05:01,070 --> 00:05:06,330 83 00:05:06,330 --> 00:05:08,418 All right. 84 00:05:08,418 --> 00:05:09,960 Let me hang in there a little longer. 85 00:05:09,960 --> 00:05:11,418 OK, we've got over 100 submissions. 86 00:05:11,418 --> 00:05:12,390 Good. 87 00:05:12,390 --> 00:05:15,270 Good, even more coming in now. 88 00:05:15,270 --> 00:05:16,890 And we can see them coming in live. 89 00:05:16,890 --> 00:05:18,750 Here, let me switch back to the spreadsheet. 90 00:05:18,750 --> 00:05:21,090 The list is growing, and growing, and growing. 91 00:05:21,090 --> 00:05:22,900 And in just a moment-- 92 00:05:22,900 --> 00:05:26,490 let me give Carter a moment to help me export it in real time. 93 00:05:26,490 --> 00:05:29,640 Carter, just give me a heads up when it's reasonable for me 94 00:05:29,640 --> 00:05:32,880 to download this file. 95 00:05:32,880 --> 00:05:35,140 All right, and I'll begin to do this very slowly. 96 00:05:35,140 --> 00:05:37,720 So I'm going to go up to the File menu, if you've never done this before. 97 00:05:37,720 --> 00:05:40,110 Download-- you can download a whole bunch of formats, one in Excel. 98 00:05:40,110 --> 00:05:42,060 But more simply, and the one we'll start to play with here, 99 00:05:42,060 --> 00:05:43,630 is comma-separated values. 100 00:05:43,630 --> 00:05:47,010 So CSV files we used this past week, why are they useful? 101 00:05:47,010 --> 00:05:50,190 Now that you've played with them or used them in past real world, 102 00:05:50,190 --> 00:05:55,260 what's the utility of a CSV file versus something like Excel, for instance? 103 00:05:55,260 --> 00:05:59,010 Why CSV in the first place? 104 00:05:59,010 --> 00:06:00,030 Any instincts? 105 00:06:00,030 --> 00:06:00,858 Yeah? 106 00:06:00,858 --> 00:06:02,730 AUDIENCE: Because it's just a text file? 107 00:06:02,730 --> 00:06:04,605 DAVID J. MALAN: OK, so storage is compelling. 108 00:06:04,605 --> 00:06:07,710 A simple text file with ASCII or Unicode text is probably pretty small. 109 00:06:07,710 --> 00:06:08,290 I like that. 110 00:06:08,290 --> 00:06:10,101 Other thoughts? 111 00:06:10,101 --> 00:06:11,184 AUDIENCE: Structure of it? 112 00:06:11,184 --> 00:06:12,518 DAVID J. MALAN: Yeah, well said. 113 00:06:12,518 --> 00:06:14,760 It's just a simple text format, but using conventions 114 00:06:14,760 --> 00:06:18,330 like commas you can represent the idea of columns using new lines, 115 00:06:18,330 --> 00:06:20,430 backslash ends invisibly at the end of your lines, 116 00:06:20,430 --> 00:06:22,000 you can create the idea of rows. 117 00:06:22,000 --> 00:06:24,000 So it's a very simple way of implementing what 118 00:06:24,000 --> 00:06:26,610 we might call a flat-file database. 119 00:06:26,610 --> 00:06:28,860 It's a way of storing data in a flat, that is, 120 00:06:28,860 --> 00:06:32,310 very simple file that's just pure ASCII or Unicode text. 121 00:06:32,310 --> 00:06:35,340 And more compellingly, I dare say, is that with a CSV file, 122 00:06:35,340 --> 00:06:36,862 it's completely portable. 123 00:06:36,862 --> 00:06:38,820 Something is portable in the world of computing 124 00:06:38,820 --> 00:06:41,910 if it means you can use it on a Mac or a PC running this operating system, 125 00:06:41,910 --> 00:06:42,750 or this other one. 126 00:06:42,750 --> 00:06:45,480 And portability is nice because if I were to download an Excel file, 127 00:06:45,480 --> 00:06:47,760 there'd be a whole bunch of people in this room and online 128 00:06:47,760 --> 00:06:50,190 who couldn't download it because they haven't bought Microsoft Excel 129 00:06:50,190 --> 00:06:50,940 or installed it. 130 00:06:50,940 --> 00:06:55,680 Or if they have a Mac, or if it's a .numbers file in the Mac world, 131 00:06:55,680 --> 00:06:57,610 a PC user might not be able to download it. 132 00:06:57,610 --> 00:06:59,800 So a CSV is indeed very portable. 133 00:06:59,800 --> 00:07:02,940 So I'm going to go ahead and download, quite simply, the CSV 134 00:07:02,940 --> 00:07:04,440 version of this file. 135 00:07:04,440 --> 00:07:06,960 That's going to put it onto my own Mac's Downloads folder. 136 00:07:06,960 --> 00:07:11,460 And let me go ahead here, and in just a moment, let me just simplify the name. 137 00:07:11,460 --> 00:07:14,910 Because it actually downloads it at a pretty large name. 138 00:07:14,910 --> 00:07:18,120 And give me just one moment here, and you'll see that, indeed, 139 00:07:18,120 --> 00:07:21,210 on my Mac I have a file called favorites.csv. 140 00:07:21,210 --> 00:07:22,750 I shortened the name real quick. 141 00:07:22,750 --> 00:07:28,680 And now what I'm going to do is go over to VS Code, and in VS Code, 142 00:07:28,680 --> 00:07:30,480 I'm going to open my File Explorer. 143 00:07:30,480 --> 00:07:34,620 And if I minimize my window here for a moment, a handy feature of VS Code 144 00:07:34,620 --> 00:07:37,980 is that you can just drag and drop a file, for instance, into your Explorer. 145 00:07:37,980 --> 00:07:40,660 And voila, it's going to automatically upload it for you. 146 00:07:40,660 --> 00:07:43,260 So let me go ahead and full screen here, close my Explorer, 147 00:07:43,260 --> 00:07:45,120 temporarily close my Terminal window. 148 00:07:45,120 --> 00:07:48,720 And you'll see here a CSV file, favorites.csv. 149 00:07:48,720 --> 00:07:51,390 And the first row, by convention, has whatever 150 00:07:51,390 --> 00:07:54,720 the columns were in Google Spreadsheets, or Office 365, 151 00:07:54,720 --> 00:07:58,620 in Excel online, timestamp, comma, title, comma, genres. 152 00:07:58,620 --> 00:08:00,390 Then, we have timestamps, which indicates 153 00:08:00,390 --> 00:08:01,722 when people started submitting. 154 00:08:01,722 --> 00:08:03,930 Looks like a couple of people were super eager to get 155 00:08:03,930 --> 00:08:05,430 started an hour or two ago. 156 00:08:05,430 --> 00:08:09,340 And then, you have the title next, after a comma. 157 00:08:09,340 --> 00:08:12,150 But there's kind of a curiosity after that. 158 00:08:12,150 --> 00:08:15,510 Sometimes I see the genre like comedy, comedy, comedy, 159 00:08:15,510 --> 00:08:19,870 but sometimes it's like crime, comma, drama, or action, comma, crime, comma, 160 00:08:19,870 --> 00:08:20,550 drama. 161 00:08:20,550 --> 00:08:22,470 And those things are quoted. 162 00:08:22,470 --> 00:08:24,180 And yet, I didn't do any quotes. 163 00:08:24,180 --> 00:08:25,800 You probably didn't type any quotes. 164 00:08:25,800 --> 00:08:30,180 Where are those quotes coming from in this CSV file? 165 00:08:30,180 --> 00:08:31,650 Why are they there if we infer? 166 00:08:31,650 --> 00:08:32,485 Yeah? 167 00:08:32,485 --> 00:08:35,340 AUDIENCE: [INAUDIBLE] 168 00:08:35,340 --> 00:08:38,308 DAVID J. MALAN: Yeah, so you have a corner case, if you will. 169 00:08:38,308 --> 00:08:40,350 Because if you're using commas, as you described, 170 00:08:40,350 --> 00:08:44,280 to separate your data into what are effectively columns, well, 171 00:08:44,280 --> 00:08:47,010 you've painted yourself into a corner if your actual data 172 00:08:47,010 --> 00:08:48,578 has commas in it itself. 173 00:08:48,578 --> 00:08:51,120 So what Google has done, what Microsoft does, what Apple does 174 00:08:51,120 --> 00:08:54,330 is, they quote any strings of text that themselves 175 00:08:54,330 --> 00:08:58,560 have commas so that these are now English grammatical commas, 176 00:08:58,560 --> 00:09:00,730 not CSV specific commas. 177 00:09:00,730 --> 00:09:03,400 So it's a way of escaping your data, if you will. 178 00:09:03,400 --> 00:09:06,070 And escaping just means to call out a symbol in a special way 179 00:09:06,070 --> 00:09:08,637 so it's not misinterpreted as something else. 180 00:09:08,637 --> 00:09:10,470 All right, so this is all to say that we now 181 00:09:10,470 --> 00:09:13,980 have all of this data with which we can play in the form of what we'll 182 00:09:13,980 --> 00:09:15,840 start calling a flat-file database. 183 00:09:15,840 --> 00:09:19,230 So suppose I wanted to now start manipulating this data, 184 00:09:19,230 --> 00:09:22,110 and I want to store it ultimately, indeed, in this CSV format. 185 00:09:22,110 --> 00:09:24,540 How can I actually start to read this data, 186 00:09:24,540 --> 00:09:26,950 maybe clean it up, maybe do some analytics on it 187 00:09:26,950 --> 00:09:30,570 and actually figure out, what's the most popular show among those who submitted 188 00:09:30,570 --> 00:09:32,190 here over the past few minutes? 189 00:09:32,190 --> 00:09:34,270 Well, let me go ahead and close this. 190 00:09:34,270 --> 00:09:38,970 Let me go ahead, then, and open up, for instance, just my Terminal window. 191 00:09:38,970 --> 00:09:41,970 And let's code up a file called favorites.py. 192 00:09:41,970 --> 00:09:46,110 And let's go ahead and iteratively start simple by just opening up this file 193 00:09:46,110 --> 00:09:47,830 and printing out what's inside of it. 194 00:09:47,830 --> 00:09:51,330 So you might recall that we can do this by doing something like import CSV 195 00:09:51,330 --> 00:09:54,990 to give myself some CSV reading functionality. 196 00:09:54,990 --> 00:09:59,610 Then, I can go ahead and do something like with open, the name of the file 197 00:09:59,610 --> 00:10:02,280 that I want to open in read mode. 198 00:10:02,280 --> 00:10:03,990 Quote, unquote, "r" means to read it. 199 00:10:03,990 --> 00:10:06,600 And then, I can say as file, or whatever other name 200 00:10:06,600 --> 00:10:09,670 for a variable to say that I want to open this file, 201 00:10:09,670 --> 00:10:12,480 and essentially store some kind of reference to it in that variable 202 00:10:12,480 --> 00:10:13,600 called file. 203 00:10:13,600 --> 00:10:16,920 Then, I can give myself a reader, and I can say csv.reader, 204 00:10:16,920 --> 00:10:18,397 passing in that file as input. 205 00:10:18,397 --> 00:10:19,980 And this is the magic of that library. 206 00:10:19,980 --> 00:10:23,190 It deals with the process of opening it, reading it, and giving you 207 00:10:23,190 --> 00:10:26,430 back something that you can just iterate over, like with a for loop 208 00:10:26,430 --> 00:10:30,510 I do want to skip the first row, and recall that I can do this. 209 00:10:30,510 --> 00:10:33,665 Next, reader, is this little trick that just says, ignore the first row. 210 00:10:33,665 --> 00:10:35,040 Because the first one is special. 211 00:10:35,040 --> 00:10:37,410 It said timestamp, title, genres. 212 00:10:37,410 --> 00:10:39,400 That's not your data, that was mine. 213 00:10:39,400 --> 00:10:41,880 But this means now that I've skipped that first row. 214 00:10:41,880 --> 00:10:44,700 Everything hereafter is going to be the title of a show 215 00:10:44,700 --> 00:10:46,260 that you all like, so let me do this. 216 00:10:46,260 --> 00:10:51,090 For row in the reader, let's go ahead and print out the title 217 00:10:51,090 --> 00:10:52,860 of the show each of you typed in. 218 00:10:52,860 --> 00:10:57,000 How do I get at the title of the show each of you typed in? 219 00:10:57,000 --> 00:10:58,740 It's somewhere inside of row. 220 00:10:58,740 --> 00:11:00,790 Row recalls a list. 221 00:11:00,790 --> 00:11:02,910 So what do I want to type next in order to get 222 00:11:02,910 --> 00:11:08,920 at the title of the current row just as a quick check here? 223 00:11:08,920 --> 00:11:11,350 What do I want to type to get at the title of the row, 224 00:11:11,350 --> 00:11:15,520 keeping in mind, again, that it was timestamp, title, genres? 225 00:11:15,520 --> 00:11:16,020 Yeah? 226 00:11:16,020 --> 00:11:16,895 AUDIENCE: [INAUDIBLE] 227 00:11:16,895 --> 00:11:18,910 DAVID J. MALAN: So row bracket 1 would give me 228 00:11:18,910 --> 00:11:22,640 the second column, 0 index, that is, the one in the middle with the title. 229 00:11:22,640 --> 00:11:24,520 So this program isn't that interesting yet, 230 00:11:24,520 --> 00:11:27,370 but it's a quick and dirty way to figure out, all right, what's my data 231 00:11:27,370 --> 00:11:27,820 look like? 232 00:11:27,820 --> 00:11:29,987 Let me actually just do a little bit of a check here 233 00:11:29,987 --> 00:11:32,380 and see if it contains the data I think it does. 234 00:11:32,380 --> 00:11:34,490 Let me maximize my Terminal window here. 235 00:11:34,490 --> 00:11:37,810 Let me run Python of favorites.py, hitting Enter. 236 00:11:37,810 --> 00:11:42,520 And you'll see now a purely textual list of all of the shows 237 00:11:42,520 --> 00:11:44,560 you all seem to like here. 238 00:11:44,560 --> 00:11:47,080 But what's noteworthy about it? 239 00:11:47,080 --> 00:11:49,960 Specific shows aside, judgment aside as to people's 240 00:11:49,960 --> 00:11:54,640 TV tastes, what's interesting or noteworthy about the data that 241 00:11:54,640 --> 00:11:58,090 might create some problems for us if we start to analyze this data, 242 00:11:58,090 --> 00:11:59,740 and figure out what's the most popular? 243 00:11:59,740 --> 00:12:02,590 How many people like this or that? 244 00:12:02,590 --> 00:12:03,340 What do you think? 245 00:12:03,340 --> 00:12:04,151 Yeah? 246 00:12:04,151 --> 00:12:07,458 AUDIENCE: User errors [INAUDIBLE]. 247 00:12:07,458 --> 00:12:09,500 DAVID J. MALAN: Yeah, there might be user errors, 248 00:12:09,500 --> 00:12:12,800 or just stylistic differences that give the appearance that one show 249 00:12:12,800 --> 00:12:14,400 is different from the other. 250 00:12:14,400 --> 00:12:16,970 For instance, here. 251 00:12:16,970 --> 00:12:20,250 Let's see if I can see an example on the screen here. 252 00:12:20,250 --> 00:12:24,340 Yeah, so friends here is an all lowercase, Friends here is capitalized. 253 00:12:24,340 --> 00:12:25,140 No big deal. 254 00:12:25,140 --> 00:12:26,370 We can sort of mitigate that. 255 00:12:26,370 --> 00:12:29,480 But this is just a tiny example of where data in the real world 256 00:12:29,480 --> 00:12:30,758 can get messy fast. 257 00:12:30,758 --> 00:12:32,300 And that probably wasn't even a typo. 258 00:12:32,300 --> 00:12:37,587 It was just someone not caring as much to capitalize it, and that's fine. 259 00:12:37,587 --> 00:12:39,920 Your users are going to type what they're going to type. 260 00:12:39,920 --> 00:12:43,580 So let's see if we can't now begin to get at more specific data, 261 00:12:43,580 --> 00:12:45,500 and maybe even clean some of this data up. 262 00:12:45,500 --> 00:12:50,570 Let me go back into my file called favorites.py here, 263 00:12:50,570 --> 00:12:54,350 and let's actually do something a little more user friendly for me. 264 00:12:54,350 --> 00:12:58,070 Instead of a reader, recall that there was this dictionary reader that's 265 00:12:58,070 --> 00:12:59,750 just a little more user friendly. 266 00:12:59,750 --> 00:13:04,670 And it means I can type in dictionary reader here, passing in the same file. 267 00:13:04,670 --> 00:13:10,670 But now, when I iterate over this reader variable, what is each row? 268 00:13:10,670 --> 00:13:14,060 When using a DictReader instead of a reader, recall, and this 269 00:13:14,060 --> 00:13:17,690 is just a peculiarity of the CSV library, 270 00:13:17,690 --> 00:13:22,530 this gives me back, not a list of cells, but what instead, 271 00:13:22,530 --> 00:13:24,950 which is marginally more user friendly for me? 272 00:13:24,950 --> 00:13:25,460 Yeah? 273 00:13:25,460 --> 00:13:27,585 AUDIENCE: [INAUDIBLE] 274 00:13:27,585 --> 00:13:28,460 DAVID J. MALAN: Yeah. 275 00:13:28,460 --> 00:13:30,860 I can now use open bracket, quotes, and the title. 276 00:13:30,860 --> 00:13:33,770 Because what's coming back now is a dict object, that is, 277 00:13:33,770 --> 00:13:37,010 a dictionary which has keys and values. 278 00:13:37,010 --> 00:13:39,060 The keys of which are the column headings. 279 00:13:39,060 --> 00:13:41,400 The values of which are the data I actually care about. 280 00:13:41,400 --> 00:13:44,030 So this is just marginally better because, one, it's 281 00:13:44,030 --> 00:13:47,330 just way more obvious to me, the author of this code, what it is I'm 282 00:13:47,330 --> 00:13:48,050 getting at. 283 00:13:48,050 --> 00:13:50,310 I don't remember what column the title was. 284 00:13:50,310 --> 00:13:50,810 Was it 0? 285 00:13:50,810 --> 00:13:51,320 Was it 1? 286 00:13:51,320 --> 00:13:51,890 Was it 2? 287 00:13:51,890 --> 00:13:53,450 That's something you're going to forget over time. 288 00:13:53,450 --> 00:13:56,510 And God forbid someone changes the data by just dragging and dropping 289 00:13:56,510 --> 00:13:59,390 the columns in Excel, or Apple Numbers, or Google Spreadsheets. 290 00:13:59,390 --> 00:14:01,978 That's going to break all of your numeric indices. 291 00:14:01,978 --> 00:14:03,770 And so a dictionary reader is arguably just 292 00:14:03,770 --> 00:14:07,310 better design because it's more robust against changes 293 00:14:07,310 --> 00:14:08,960 and potential errors like that. 294 00:14:08,960 --> 00:14:12,560 Now the effect of this change isn't going to be really any different. 295 00:14:12,560 --> 00:14:16,820 If I run Python of favorites.py, voila, I get all of the same results. 296 00:14:16,820 --> 00:14:21,290 But I've now not made any assumptions as to where each of the columns 297 00:14:21,290 --> 00:14:22,920 actually is numerically. 298 00:14:22,920 --> 00:14:23,420 All right. 299 00:14:23,420 --> 00:14:26,330 Well, let's go ahead and now filter out some duplicates. 300 00:14:26,330 --> 00:14:29,960 Because there's a lot of commonality among some of the shows here, so let's 301 00:14:29,960 --> 00:14:32,660 see if we can't filter out duplicates. 302 00:14:32,660 --> 00:14:38,900 If I'm reading a CSV file top to bottom, what intuitively might be the logic 303 00:14:38,900 --> 00:14:41,650 I want to implement to filter out duplicates? 304 00:14:41,650 --> 00:14:44,900 It's not going to be quite as simple as a simple function that does it for me. 305 00:14:44,900 --> 00:14:47,300 I'm going to have to build this. 306 00:14:47,300 --> 00:14:50,630 But logically, if you're reading a file from top to bottom, 307 00:14:50,630 --> 00:14:54,860 how might you go about, in Python or just any context, 308 00:14:54,860 --> 00:14:58,040 getting rid of duplicate values? 309 00:14:58,040 --> 00:14:59,862 Yeah, what do you think? 310 00:14:59,862 --> 00:15:05,735 AUDIENCE: [INAUDIBLE] 311 00:15:05,735 --> 00:15:06,610 DAVID J. MALAN: Sure. 312 00:15:06,610 --> 00:15:09,730 I could use a list and I could add each title to the list, 313 00:15:09,730 --> 00:15:13,070 but first check if I put this into the list before. 314 00:15:13,070 --> 00:15:15,140 So let's try a little something like that. 315 00:15:15,140 --> 00:15:18,340 Let me go ahead and create a variable at the top of my program here. 316 00:15:18,340 --> 00:15:21,340 I'll call it titles, for instance, initialize to an empty list, 317 00:15:21,340 --> 00:15:23,050 open bracket, close bracket. 318 00:15:23,050 --> 00:15:27,860 And then, inside of my loop here, instead of printing it out, 319 00:15:27,860 --> 00:15:29,630 let's start to make a decision. 320 00:15:29,630 --> 00:15:39,332 So if the current row's title is in the titles list 321 00:15:39,332 --> 00:15:40,540 I don't want to put it there. 322 00:15:40,540 --> 00:15:43,582 And actually, let me invert the logic so I'm doing something proactively. 323 00:15:43,582 --> 00:15:48,190 So if it's not the case that row bracket title is in titles, 324 00:15:48,190 --> 00:15:56,200 then, go ahead and do something like titles.append the current row's title. 325 00:15:56,200 --> 00:15:59,170 And recall that we saw .append a week or so ago, 326 00:15:59,170 --> 00:16:01,700 where it just allows you to append to the current list. 327 00:16:01,700 --> 00:16:04,810 And then, what can I do at the very end, after I'm all 328 00:16:04,810 --> 00:16:06,460 done reading the whole file? 329 00:16:06,460 --> 00:16:09,820 Why don't I go ahead and say, for title in titles, 330 00:16:09,820 --> 00:16:12,730 go ahead and print out the current title? 331 00:16:12,730 --> 00:16:16,700 So it's two loops now, and we can come back to the quality of that design. 332 00:16:16,700 --> 00:16:19,373 But let me go ahead here and rerun Python of favorites.py. 333 00:16:19,373 --> 00:16:22,540 Let me increase the size of my Terminal window so we can focus just on this, 334 00:16:22,540 --> 00:16:23,770 and hit Enter. 335 00:16:23,770 --> 00:16:28,120 And now, I'm just skimming. 336 00:16:28,120 --> 00:16:31,090 I don't think I'm seeing duplicates, although I 337 00:16:31,090 --> 00:16:34,000 am seeing some near duplicates. 338 00:16:34,000 --> 00:16:36,760 For instance, there's Friends again. 339 00:16:36,760 --> 00:16:40,060 And if we keep going, and going, and going, and going, 340 00:16:40,060 --> 00:16:41,260 there's Friends again. 341 00:16:41,260 --> 00:16:47,070 Oh, interesting, so that's curious that I seem to have multiple Friends, 342 00:16:47,070 --> 00:16:48,630 and I have this one here, too. 343 00:16:48,630 --> 00:16:51,120 So how might we clean this up further? 344 00:16:51,120 --> 00:16:53,393 I like your instincts, and it's a step closer to it. 345 00:16:53,393 --> 00:16:55,560 What are we going to have to do to really filter out 346 00:16:55,560 --> 00:16:57,360 those near duplicates? 347 00:16:57,360 --> 00:16:59,146 Any thoughts? 348 00:16:59,146 --> 00:17:04,045 AUDIENCE: You could set everything to lower [INAUDIBLE].. 349 00:17:04,045 --> 00:17:04,920 DAVID J. MALAN: Yeah. 350 00:17:04,920 --> 00:17:06,670 What are the common mistakes to summarize? 351 00:17:06,670 --> 00:17:09,405 We could ignore the capitalization altogether and maybe 352 00:17:09,405 --> 00:17:12,030 just force everything to lowercase, or everything to uppercase. 353 00:17:12,030 --> 00:17:13,920 Doesn't matter which, but let's just be consistent. 354 00:17:13,920 --> 00:17:16,795 And for those of you who might have accidentally or instinctively hit 355 00:17:16,795 --> 00:17:19,650 the spacebar at the beginning of your input or even at the end, 356 00:17:19,650 --> 00:17:21,180 we can strip that off, too. 357 00:17:21,180 --> 00:17:24,490 Stripping whitespace is a common thing just to clean up user input. 358 00:17:24,490 --> 00:17:27,869 So let me go back into my code here, and let me go ahead 359 00:17:27,869 --> 00:17:30,540 and tweak the title a little bit. 360 00:17:30,540 --> 00:17:33,360 Let me say that the current title inside of this loop 361 00:17:33,360 --> 00:17:36,300 is not going to be just the current row's title. 362 00:17:36,300 --> 00:17:40,440 But let me go ahead and strip off, from the left and the right implicitly, 363 00:17:40,440 --> 00:17:41,260 any whitespace. 364 00:17:41,260 --> 00:17:44,260 If you read the documentation for the strip function, it does just that. 365 00:17:44,260 --> 00:17:47,230 It gets rid of whitespace to the left, whitespace to the right. 366 00:17:47,230 --> 00:17:50,310 And then, if I want to force everything to maybe uppercase, 367 00:17:50,310 --> 00:17:52,710 I can just uppercase the entire string. 368 00:17:52,710 --> 00:17:56,310 And remember, what's handy about Python is you can chain some of these function 369 00:17:56,310 --> 00:17:59,550 calls together by just using dots again and again. 370 00:17:59,550 --> 00:18:01,500 And that just takes whatever just happened, 371 00:18:01,500 --> 00:18:04,320 like the whitespace got stripped off, then, it additionally 372 00:18:04,320 --> 00:18:06,280 uppercases the whole thing as well. 373 00:18:06,280 --> 00:18:10,980 So now, I'm going to just check whether this specific title is in titles. 374 00:18:10,980 --> 00:18:14,760 And if not, I'm going to go ahead and append that title, 375 00:18:14,760 --> 00:18:17,200 massaged into this different format, if you will. 376 00:18:17,200 --> 00:18:19,350 So I'm throwing away some information. 377 00:18:19,350 --> 00:18:23,760 I'm sacrificing all of the nuances of your grammar and input 378 00:18:23,760 --> 00:18:25,020 to the form itself. 379 00:18:25,020 --> 00:18:27,480 But at least I'm trying to canonicalize size, that is, 380 00:18:27,480 --> 00:18:29,950 standardize what the data actually looks like. 381 00:18:29,950 --> 00:18:33,810 So let me go ahead and run Python of favorites.py again and hit Enter. 382 00:18:33,810 --> 00:18:35,412 Oh, and this is just user error. 383 00:18:35,412 --> 00:18:36,870 Maybe you haven't seen this before. 384 00:18:36,870 --> 00:18:40,870 This just looks like a mistake on my part. 385 00:18:40,870 --> 00:18:43,200 I meant to say not even uppercase. 386 00:18:43,200 --> 00:18:44,202 That's completely wrong. 387 00:18:44,202 --> 00:18:46,410 The function is called upper, now that I think of it. 388 00:18:46,410 --> 00:18:46,830 All right. 389 00:18:46,830 --> 00:18:49,330 Let's go and increase the size of the Terminal window again. 390 00:18:49,330 --> 00:18:50,820 Run Python of favorites.py. 391 00:18:50,820 --> 00:18:55,530 And now, it's a little more overwhelming to look at because it's not sorted yet 392 00:18:55,530 --> 00:18:57,300 and it's all capitalized. 393 00:18:57,300 --> 00:19:03,300 But I don't think I'm seeing multiple Friends, so to speak. 394 00:19:03,300 --> 00:19:06,455 There's one Friends up here and that's it. 395 00:19:06,455 --> 00:19:07,830 I'm back up at my prompt already. 396 00:19:07,830 --> 00:19:10,270 So we seem now to be filtering out duplicates. 397 00:19:10,270 --> 00:19:13,582 Now, before we dive in further and clean this up further than this, 398 00:19:13,582 --> 00:19:14,790 what else could we have done? 399 00:19:14,790 --> 00:19:17,100 Well, it turns out that in Python 2 you often 400 00:19:17,100 --> 00:19:19,350 do get a lot of functionality built into the language. 401 00:19:19,350 --> 00:19:22,313 And I'm kind of implementing myself the idea of a set. 402 00:19:22,313 --> 00:19:24,480 If you think back to mathematics, a set is typically 403 00:19:24,480 --> 00:19:28,470 something with a bunch of values that has duplicates filtered out. 404 00:19:28,470 --> 00:19:31,290 Recall that Python already has this for us. 405 00:19:31,290 --> 00:19:34,200 And we saw it really briefly when I whipped up the dictionary 406 00:19:34,200 --> 00:19:36,190 implementation a couple of weeks back. 407 00:19:36,190 --> 00:19:40,980 So I could actually define my titles to be a set instead of a list, 408 00:19:40,980 --> 00:19:46,080 and this would just modestly allow me to refine my code here, such 409 00:19:46,080 --> 00:19:48,870 that I don't have to bother checking for duplicates anyway. 410 00:19:48,870 --> 00:19:52,710 I can instead just say something like, titles.add 411 00:19:52,710 --> 00:19:55,807 the current title, like this. 412 00:19:55,807 --> 00:19:58,890 Marginally better design if you know that a set exists because you're just 413 00:19:58,890 --> 00:20:00,753 getting more functionality out of this. 414 00:20:00,753 --> 00:20:02,670 All right, so let's clean the data up further. 415 00:20:02,670 --> 00:20:06,150 We've now gone ahead and fixed the problem of case sensitivity. 416 00:20:06,150 --> 00:20:08,940 We threw away whitespace in case someone had hit the spacebar 417 00:20:08,940 --> 00:20:09,990 with some of the input. 418 00:20:09,990 --> 00:20:13,960 Let's go ahead now and sort these things by the titles themselves. 419 00:20:13,960 --> 00:20:17,280 So instead of just printing out the titles in the same order 420 00:20:17,280 --> 00:20:22,142 you all inputted them, but filtering out duplicates as we go, let me go ahead 421 00:20:22,142 --> 00:20:24,600 and use another function in Python you might not have seen, 422 00:20:24,600 --> 00:20:26,940 which is literally called sorted, and will 423 00:20:26,940 --> 00:20:31,900 take care of the process of actually sorting titles for you. 424 00:20:31,900 --> 00:20:34,440 Let me go ahead and increase the font size of my Terminal, 425 00:20:34,440 --> 00:20:36,570 run Python of favorites.py, and hit Enter. 426 00:20:36,570 --> 00:20:40,410 And now you can really see how many of these shows start with the word "the" 427 00:20:40,410 --> 00:20:41,220 or do not. 428 00:20:41,220 --> 00:20:43,590 Now it's a little easier to wrap our minds around, 429 00:20:43,590 --> 00:20:46,650 just because it's at least sorted alphabetically. 430 00:20:46,650 --> 00:20:50,220 But now you can really see some of the differences in people's inputs. 431 00:20:50,220 --> 00:20:51,850 So far, so good. 432 00:20:51,850 --> 00:20:56,400 But a few of you decided to stylize Avatar in three different ways here. 433 00:20:56,400 --> 00:20:59,520 Brooklyn 99 is a couple of different ways here. 434 00:20:59,520 --> 00:21:02,700 And I think if we keep going we'll see further and further variances that we 435 00:21:02,700 --> 00:21:06,460 did not fix by focusing on whitespace and capitalization alone. 436 00:21:06,460 --> 00:21:09,870 So already here, this is only, what, 100 plus, 200 rows. 437 00:21:09,870 --> 00:21:12,795 Already real-world data starts to get messy quickly, 438 00:21:12,795 --> 00:21:14,670 and that might not bode well when we actually 439 00:21:14,670 --> 00:21:16,650 want to keep around real data from real users. 440 00:21:16,650 --> 00:21:19,025 You can imagine an actual website or a mobile application 441 00:21:19,025 --> 00:21:21,660 dealing with this kind of thing on scale. 442 00:21:21,660 --> 00:21:23,080 Well, let's go ahead and do this. 443 00:21:23,080 --> 00:21:26,370 Let's actually figure out the popularity of these various shows 444 00:21:26,370 --> 00:21:31,680 by now iterating over my data, and keeping track of how many of you 445 00:21:31,680 --> 00:21:33,000 inputted a given title. 446 00:21:33,000 --> 00:21:37,890 We're going to ignore the problems like Brooklyn 99 and the Avatar. 447 00:21:37,890 --> 00:21:42,390 Sorry, yeah, Avatar, where there was things 448 00:21:42,390 --> 00:21:47,070 that were different beyond just whitespace and capitalization. 449 00:21:47,070 --> 00:21:48,840 But let's go ahead and keep track of, now, 450 00:21:48,840 --> 00:21:52,290 how many of you inputted each of these titles. 451 00:21:52,290 --> 00:21:53,448 So how can I do this? 452 00:21:53,448 --> 00:21:55,740 I'm still going to take this approach of iterating over 453 00:21:55,740 --> 00:21:58,110 the CSV file from top to bottom. 454 00:21:58,110 --> 00:22:00,360 We've used a couple of data structures thus far, 455 00:22:00,360 --> 00:22:04,260 a list to keep track of titles, or a set to keep track of titles. 456 00:22:04,260 --> 00:22:07,350 But what if I now want to keep around a little more information? 457 00:22:07,350 --> 00:22:13,110 For each title, I want to keep around how many times I've seen it before. 458 00:22:13,110 --> 00:22:14,130 I'm not doing that yet. 459 00:22:14,130 --> 00:22:17,880 I'm throwing away the total number of times I see these shows. 460 00:22:17,880 --> 00:22:20,520 How could I start to keep that around? 461 00:22:20,520 --> 00:22:21,845 AUDIENCE: Use a dictionary. 462 00:22:21,845 --> 00:22:23,970 DAVID J. MALAN: We could use a dictionary, and how? 463 00:22:23,970 --> 00:22:25,260 Elaborate on that. 464 00:22:25,260 --> 00:22:26,135 AUDIENCE: [INAUDIBLE] 465 00:22:26,135 --> 00:22:28,093 DAVID J. MALAN: Perfect, really good instincts. 466 00:22:28,093 --> 00:22:29,850 Using a dictionary, insofar as it lets us 467 00:22:29,850 --> 00:22:33,300 store keys and values, that is, associate something with something 468 00:22:33,300 --> 00:22:33,850 else. 469 00:22:33,850 --> 00:22:36,480 This is why a dictionary or hash tables more 470 00:22:36,480 --> 00:22:39,630 generally are such a useful, practical data structure. 471 00:22:39,630 --> 00:22:43,120 Because they just let you remember stuff in some kind of structured way. 472 00:22:43,120 --> 00:22:45,870 So if the keys are going to be the titles I've seen, 473 00:22:45,870 --> 00:22:49,860 the values could be the number of times I've seen each of those titles. 474 00:22:49,860 --> 00:22:54,308 And so it's kind of like just having a two-column table on paper. 475 00:22:54,308 --> 00:22:56,850 For instance, if I were going to do this on a piece of paper, 476 00:22:56,850 --> 00:22:59,340 I might just have two columns here, where 477 00:22:59,340 --> 00:23:04,590 maybe this is the title that I've seen, and this is the count over here. 478 00:23:04,590 --> 00:23:08,250 This is, in effect, a dictionary in Python. 479 00:23:08,250 --> 00:23:11,490 It's two columns, keys on the left, values on the right. 480 00:23:11,490 --> 00:23:13,620 And this, if I can implement in code, will actually 481 00:23:13,620 --> 00:23:17,580 allow me to store this data, and then maybe do some simple arithmetic 482 00:23:17,580 --> 00:23:19,450 to figure out which is the most popular. 483 00:23:19,450 --> 00:23:20,380 So let's do this. 484 00:23:20,380 --> 00:23:24,240 Let me go ahead and change my titles to not be a list, not be a set. 485 00:23:24,240 --> 00:23:29,160 Let's have it be a dictionary instead, either doing this, or more succinctly, 486 00:23:29,160 --> 00:23:33,600 two curly braces that are empty gives me an empty dictionary automatically. 487 00:23:33,600 --> 00:23:34,860 What do I now want to do? 488 00:23:34,860 --> 00:23:37,440 I think most of my code can stay the same. 489 00:23:37,440 --> 00:23:40,860 But down here, I don't want to just blindly add titles 490 00:23:40,860 --> 00:23:42,180 to the data structure. 491 00:23:42,180 --> 00:23:45,060 I somehow need to keep track of the count. 492 00:23:45,060 --> 00:23:48,690 And unfortunately, if I just do this-- let's do titles, 493 00:23:48,690 --> 00:23:53,580 bracket, title, plus equals 1. 494 00:23:53,580 --> 00:23:56,130 This is a reasonable first attempt at this. 495 00:23:56,130 --> 00:23:57,420 Because what am I doing? 496 00:23:57,420 --> 00:24:02,730 If titles is a dictionary and I want to look up the current title therein, 497 00:24:02,730 --> 00:24:05,520 the syntax for that, like before, is titles, bracket, and then 498 00:24:05,520 --> 00:24:08,880 the key you want to use to index into the dictionary. 499 00:24:08,880 --> 00:24:12,000 It's not a number in this case, it's an actual word, a title. 500 00:24:12,000 --> 00:24:14,220 And you're just going to increment it by one, 501 00:24:14,220 --> 00:24:17,010 and then eventually I'll come back and finish my second loop 502 00:24:17,010 --> 00:24:19,710 and do things in terms of the order. 503 00:24:19,710 --> 00:24:23,580 But for now, let's just keep track of the total counts. 504 00:24:23,580 --> 00:24:25,660 Let me go ahead and increase my Terminal window. 505 00:24:25,660 --> 00:24:29,590 Let me do Python of favorites.py and hit Enter. 506 00:24:29,590 --> 00:24:30,090 Huh. 507 00:24:30,090 --> 00:24:34,140 How I Met Your Mother is giving me a key error. 508 00:24:34,140 --> 00:24:36,750 What does that mean? 509 00:24:36,750 --> 00:24:39,120 And why am I seeing this? 510 00:24:39,120 --> 00:24:42,330 And in fact, just to give a little bit of a breadcrumb here, 511 00:24:42,330 --> 00:24:44,140 let me zoom out here. 512 00:24:44,140 --> 00:24:46,920 Let me open up the CSV file again real quickly. 513 00:24:46,920 --> 00:24:50,460 And wow, we didn't even get past the second row in the file 514 00:24:50,460 --> 00:24:52,210 or the first show in the file. 515 00:24:52,210 --> 00:24:54,840 Notice that How I Met Your Mother, somewhat lowercased, 516 00:24:54,840 --> 00:24:57,562 is the very first show in therein. 517 00:24:57,562 --> 00:24:59,520 What's your instinct for why this is happening? 518 00:24:59,520 --> 00:25:01,750 AUDIENCE: You don't have a starting point. 519 00:25:01,750 --> 00:25:03,667 DAVID J. MALAN: I don't have a starting point. 520 00:25:03,667 --> 00:25:04,920 I'm adding one to what? 521 00:25:04,920 --> 00:25:09,840 I'm blindly indexing into the dictionary using a key, How I Met Your Mother, 522 00:25:09,840 --> 00:25:11,880 that doesn't yet exist in the dictionary. 523 00:25:11,880 --> 00:25:14,610 And so Python throws what's called a key error 524 00:25:14,610 --> 00:25:17,440 because the key you're trying to use just doesn't exist yet. 525 00:25:17,440 --> 00:25:21,000 So logically, how could we fix this? 526 00:25:21,000 --> 00:25:21,690 We're close. 527 00:25:21,690 --> 00:25:25,110 We got half of the problem solved, but I'm not handling the obvious, now, 528 00:25:25,110 --> 00:25:26,850 case of nothing being there. 529 00:25:26,850 --> 00:25:27,595 Yeah? 530 00:25:27,595 --> 00:25:28,930 AUDIENCE: Creating a counter. 531 00:25:28,930 --> 00:25:29,620 DAVID J. MALAN: Creating a-- 532 00:25:29,620 --> 00:25:30,370 AUDIENCE: Counter. 533 00:25:30,370 --> 00:25:32,240 DAVID J. MALAN: Creating the counter itself. 534 00:25:32,240 --> 00:25:34,190 So maybe I could do something like this. 535 00:25:34,190 --> 00:25:38,380 Let me close my Terminal window and let me ask a question first. 536 00:25:38,380 --> 00:25:44,980 If the current title is in the dictionary already, if title in titles, 537 00:25:44,980 --> 00:25:47,440 that's going to give me a true-false answer it turns out. 538 00:25:47,440 --> 00:25:52,240 Then, I can safely say, titles, bracket, title, plus equals 1. 539 00:25:52,240 --> 00:25:56,740 And recall, this is just shorthand notation for the same thing as in C, 540 00:25:56,740 --> 00:25:58,300 title plus 1. 541 00:25:58,300 --> 00:25:59,360 Whoops, typo. 542 00:25:59,360 --> 00:26:00,168 Don't do that. 543 00:26:00,168 --> 00:26:02,710 That's the same thing as this but it's a little more succinct 544 00:26:02,710 --> 00:26:04,810 just to say plus equals 1. 545 00:26:04,810 --> 00:26:09,490 Else, if it's logically not the case that the current title is in the titles 546 00:26:09,490 --> 00:26:13,610 dictionary, then I probably want to say titles, bracket, title equals? 547 00:26:13,610 --> 00:26:15,315 Feel free to just shout it out. 548 00:26:15,315 --> 00:26:15,940 AUDIENCE: Zero. 549 00:26:15,940 --> 00:26:16,815 DAVID J. MALAN: Zero. 550 00:26:16,815 --> 00:26:21,320 I just have to put some value there so that the key itself is also there. 551 00:26:21,320 --> 00:26:21,820 All right. 552 00:26:21,820 --> 00:26:24,370 So now that I've got this going on, let me go ahead 553 00:26:24,370 --> 00:26:26,620 and undo my sorting temporarily. 554 00:26:26,620 --> 00:26:29,650 And now let me go ahead and do this. 555 00:26:29,650 --> 00:26:33,460 I can, as a quick check, let me go ahead and just run the code 556 00:26:33,460 --> 00:26:35,050 as is, Python of favorites.py. 557 00:26:35,050 --> 00:26:37,030 I'm back in business. 558 00:26:37,030 --> 00:26:39,790 It's printing correctly, no key errors, but it's not sorted. 559 00:26:39,790 --> 00:26:41,620 And I'm not seeing any of the counts. 560 00:26:41,620 --> 00:26:43,780 Let me just quickly add the counts, and there's 561 00:26:43,780 --> 00:26:45,530 a couple of ways I could do this. 562 00:26:45,530 --> 00:26:52,900 I could, say, print out the title, and then, maybe, let's do something like-- 563 00:26:52,900 --> 00:26:57,220 how about just, comma, titles, bracket, title? 564 00:26:57,220 --> 00:26:59,020 So I'm going to print two things at once, 565 00:26:59,020 --> 00:27:01,600 both the current title in the dictionary, 566 00:27:01,600 --> 00:27:04,300 and whatever its value is by indexing into it. 567 00:27:04,300 --> 00:27:06,140 Let me increase my Terminal window. 568 00:27:06,140 --> 00:27:10,420 Let me run Python of favorites.py, Enter, and OK. 569 00:27:10,420 --> 00:27:11,908 Huh. 570 00:27:11,908 --> 00:27:14,080 Huh. 571 00:27:14,080 --> 00:27:17,560 None of you said a whole lot of TV shows, it seems. 572 00:27:17,560 --> 00:27:21,690 What's the logical error here? 573 00:27:21,690 --> 00:27:25,460 What did I do wrong if I look back at my code here? 574 00:27:25,460 --> 00:27:25,960 Yeah? 575 00:27:25,960 --> 00:27:27,600 Why so many 0s? 576 00:27:27,600 --> 00:27:29,490 AUDIENCE: [INAUDIBLE] 577 00:27:29,490 --> 00:27:30,490 DAVID J. MALAN: Exactly. 578 00:27:30,490 --> 00:27:35,140 To summarize, I initialized the count to 0 the first time I saw it, 579 00:27:35,140 --> 00:27:38,500 but I should have initialized it at least to 1 because I just saw it. 580 00:27:38,500 --> 00:27:40,220 Or I should change my code a bit. 581 00:27:40,220 --> 00:27:42,880 So for instance, if I go back in here, the simplest fix 582 00:27:42,880 --> 00:27:46,390 is probably to initialize to 1, because on this iteration of the loop, 583 00:27:46,390 --> 00:27:49,210 obviously, I'm seeing this title for the very first time. 584 00:27:49,210 --> 00:27:51,580 Or I could change my logic a little bit. 585 00:27:51,580 --> 00:27:53,470 I could do something like this instead. 586 00:27:53,470 --> 00:27:58,840 If the current title is not in titles, then I could initialize it to 0. 587 00:27:58,840 --> 00:28:02,860 And then I could get rid of the else, and now blindly index 588 00:28:02,860 --> 00:28:04,900 into the titles dictionary. 589 00:28:04,900 --> 00:28:09,100 Because now, on line 11, I can trust that lines 9 and 10 590 00:28:09,100 --> 00:28:12,040 took care of the initialization for me if need be. 591 00:28:12,040 --> 00:28:13,028 Which one is better? 592 00:28:13,028 --> 00:28:13,570 I don't know. 593 00:28:13,570 --> 00:28:17,150 This one's a little nicer, maybe because it's one line fewer. 594 00:28:17,150 --> 00:28:20,470 But I think both approaches are perfectly reasonable and well-designed. 595 00:28:20,470 --> 00:28:22,420 But the key thing, no pun intended, is that we 596 00:28:22,420 --> 00:28:27,100 have to make sure the key exists before we presume to actually incrue. 597 00:28:27,100 --> 00:28:28,810 Oh, this is wrong. 598 00:28:28,810 --> 00:28:30,370 This is incorrect code. 599 00:28:30,370 --> 00:28:33,040 What did I do wrong? 600 00:28:33,040 --> 00:28:33,790 OK, yes. 601 00:28:33,790 --> 00:28:34,400 There we go. 602 00:28:34,400 --> 00:28:37,690 So otherwise, everyone would have liked this show once, and no matter 603 00:28:37,690 --> 00:28:39,190 how many people said the same thing. 604 00:28:39,190 --> 00:28:41,237 Now the code is as it should be. 605 00:28:41,237 --> 00:28:43,570 So let me go ahead and open up my Terminal window again. 606 00:28:43,570 --> 00:28:47,710 Let me run Python of favorites.py, and now we see more reasonable counts. 607 00:28:47,710 --> 00:28:49,150 Some shows weren't that popular. 608 00:28:49,150 --> 00:28:50,830 There's just 1s and maybe 2s. 609 00:28:50,830 --> 00:28:56,570 But I bet if we sort these things we can start to see a little more detail. 610 00:28:56,570 --> 00:28:57,890 So how else can we do this? 611 00:28:57,890 --> 00:29:04,520 Well, turns out, when dealing with a dictionary like this-- 612 00:29:04,520 --> 00:29:07,160 let's go ahead and just sort the titles themselves. 613 00:29:07,160 --> 00:29:12,130 So let's reintroduce the sorted function as I did before, but no other changes. 614 00:29:12,130 --> 00:29:14,735 Let me go ahead now and run Python of favorites.py. 615 00:29:14,735 --> 00:29:17,110 Now it's just a little easier to wrap your mind around it 616 00:29:17,110 --> 00:29:18,568 because at least it's alphabetical. 617 00:29:18,568 --> 00:29:22,600 But it's not sorted by value, it's sorted by key. 618 00:29:22,600 --> 00:29:26,170 But sure enough, if we scroll down, there's something down here, 619 00:29:26,170 --> 00:29:29,202 for instance, like, let's see, The Office. 620 00:29:29,202 --> 00:29:30,910 That's definitely going to be a contender 621 00:29:30,910 --> 00:29:32,860 for most popular, 15 responses. 622 00:29:32,860 --> 00:29:35,860 But let's see what's actually going to bubble up to the top. 623 00:29:35,860 --> 00:29:40,870 Unfortunately, the sorted function only sorts dictionaries by keys 624 00:29:40,870 --> 00:29:44,410 by default, not by values. 625 00:29:44,410 --> 00:29:47,500 But it turns out, in Python, if you read the documentation 626 00:29:47,500 --> 00:29:49,510 for the sorted function, you can actually 627 00:29:49,510 --> 00:29:54,580 pass in other arguments that tell it how to sort things. 628 00:29:54,580 --> 00:29:57,430 For instance, if I want to do things in reverse order, 629 00:29:57,430 --> 00:30:02,140 I can add a second parameter to the sorted function called reverse. 630 00:30:02,140 --> 00:30:03,310 And it's a named parameter. 631 00:30:03,310 --> 00:30:05,620 You literally say, reverse equals true, so 632 00:30:05,620 --> 00:30:08,830 that the position of it in the comma-separated list doesn't matter. 633 00:30:08,830 --> 00:30:11,762 If I now rerun this after increasing my Terminal window, 634 00:30:11,762 --> 00:30:13,720 you'll see now that it's in the opposite order. 635 00:30:13,720 --> 00:30:16,330 Now adventure and Anne with an E is at the bottom 636 00:30:16,330 --> 00:30:18,410 of the output instead of the top. 637 00:30:18,410 --> 00:30:26,845 How can I tell it to sort by values instead of by key? 638 00:30:26,845 --> 00:30:28,220 Well, let's go ahead and do this. 639 00:30:28,220 --> 00:30:30,940 Let me go ahead and define a function. 640 00:30:30,940 --> 00:30:33,070 I'm just going to call it f to keep things simple. 641 00:30:33,070 --> 00:30:36,190 And this f function is going to take a title as input. 642 00:30:36,190 --> 00:30:41,140 And given a given title, it's going to return the value of that title. 643 00:30:41,140 --> 00:30:44,560 So actually, maybe a better name for this would be get value, 644 00:30:44,560 --> 00:30:46,820 and/or we could come up with something else as well. 645 00:30:46,820 --> 00:30:49,300 The purpose of the get value function, to be clear, 646 00:30:49,300 --> 00:30:54,200 is to take it as input a title and then return the corresponding value. 647 00:30:54,200 --> 00:30:55,460 Why is this useful? 648 00:30:55,460 --> 00:30:57,760 Well, it turns out that the sorted function in Python, 649 00:30:57,760 --> 00:31:01,870 according to its documentation, also takes a key parameter, 650 00:31:01,870 --> 00:31:05,860 where you can pass in, crazy enough, the name of a function 651 00:31:05,860 --> 00:31:11,650 that it will use in order to determine what it should sort by, by the key, 652 00:31:11,650 --> 00:31:16,550 or by the value, or in other cases, even other types of data as well. 653 00:31:16,550 --> 00:31:19,390 So there's a curiosity here, though, that's very deliberate. 654 00:31:19,390 --> 00:31:21,550 Key is the name of the parameter, just like reverse 655 00:31:21,550 --> 00:31:23,092 was the name of this other parameter. 656 00:31:23,092 --> 00:31:26,110 The value of it, though, is not a function call. 657 00:31:26,110 --> 00:31:27,580 It's a function name. 658 00:31:27,580 --> 00:31:30,520 Notice I am not doing this, no parentheses. 659 00:31:30,520 --> 00:31:35,400 I'm instead passing in get value, the function I wrote, by its name. 660 00:31:35,400 --> 00:31:37,900 And this is a feature of Python and certain other languages. 661 00:31:37,900 --> 00:31:41,110 Just like variables, you can actually pass whole functions 662 00:31:41,110 --> 00:31:45,560 around so that they can be called for you later on by someone else. 663 00:31:45,560 --> 00:31:48,880 So what this means is that the sorted function written by Python, 664 00:31:48,880 --> 00:31:51,410 they didn't know what you're going to want to sort by today. 665 00:31:51,410 --> 00:31:55,810 But if you provide them with a function called get value, or anything else, now 666 00:31:55,810 --> 00:31:58,000 their sorted function will use that function 667 00:31:58,000 --> 00:32:01,840 to determine, OK, if you don't want to sort by the key of the dictionary, what 668 00:32:01,840 --> 00:32:02,950 do you want to sort by? 669 00:32:02,950 --> 00:32:05,800 This is going to tell it to sort by the value 670 00:32:05,800 --> 00:32:08,750 by returning the specific value we care about. 671 00:32:08,750 --> 00:32:12,580 So let me go ahead now and rerun this after increasing my Terminal, Python 672 00:32:12,580 --> 00:32:14,620 of favorites.py, Enter. 673 00:32:14,620 --> 00:32:17,110 Here we have now an example of all of the titles you all 674 00:32:17,110 --> 00:32:22,180 typed in, albeit forced to uppercase and with any whitespace thrown out. 675 00:32:22,180 --> 00:32:24,733 And now, The Office is an easy win over Friends, 676 00:32:24,733 --> 00:32:27,400 versus Community, versus Game of Thrones, Breaking Bad, and then 677 00:32:27,400 --> 00:32:29,740 a lot of variants thereafter. 678 00:32:29,740 --> 00:32:31,680 So there's a lot of steps to go through. 679 00:32:31,680 --> 00:32:33,555 This isn't that bad once you've done it once, 680 00:32:33,555 --> 00:32:35,472 and you know what these functions are, and you 681 00:32:35,472 --> 00:32:36,870 know that these parameters exist. 682 00:32:36,870 --> 00:32:38,010 But it's a lot of work. 683 00:32:38,010 --> 00:32:42,090 That's 17 lines of code just to analyze a CSV file 684 00:32:42,090 --> 00:32:45,420 that you all created by way of those Google Form submissions. 685 00:32:45,420 --> 00:32:48,360 But it took me a lot of work just to get simple answers out of it. 686 00:32:48,360 --> 00:32:50,277 And indeed, that's going to be among the goals 687 00:32:50,277 --> 00:32:52,920 for today, ultimately, is, how can we just make this easier? 688 00:32:52,920 --> 00:32:54,795 It's one thing to learn new things in Python, 689 00:32:54,795 --> 00:32:57,300 but if we can avoid writing code, or this much code, 690 00:32:57,300 --> 00:32:58,840 that's going to be a good thing. 691 00:32:58,840 --> 00:33:01,020 And so one other technique we can introduce here 692 00:33:01,020 --> 00:33:03,570 that does allow us to write a little less code 693 00:33:03,570 --> 00:33:05,730 is, we can actually get rid of this function. 694 00:33:05,730 --> 00:33:09,240 It turns out, in Python, if you just need to make a function 695 00:33:09,240 --> 00:33:11,940 but it's going to be used and then essentially thrown away, 696 00:33:11,940 --> 00:33:14,790 it's not something you're going to be reusing in multiple places-- 697 00:33:14,790 --> 00:33:17,550 it's not like a library function that you want to keep around-- 698 00:33:17,550 --> 00:33:19,680 you can actually just do this. 699 00:33:19,680 --> 00:33:23,430 You can change the value of this key parameter 700 00:33:23,430 --> 00:33:25,950 to be what's called a lambda function, which 701 00:33:25,950 --> 00:33:29,040 is a fancy way of saying a function that technically has no name. 702 00:33:29,040 --> 00:33:30,870 It's an anonymous function. 703 00:33:30,870 --> 00:33:32,400 Why does it have no name? 704 00:33:32,400 --> 00:33:35,530 Well, it's kind of stupid that I invented this name on line 13. 705 00:33:35,530 --> 00:33:38,670 I used it on line 16, and then I never again used it. 706 00:33:38,670 --> 00:33:42,520 If there's only being used in one place, why bother giving it a name at all? 707 00:33:42,520 --> 00:33:45,000 So if you instead, in Python, say lambda, 708 00:33:45,000 --> 00:33:47,850 and then type out the name of the parameter 709 00:33:47,850 --> 00:33:50,130 you want this anonymous function to take, 710 00:33:50,130 --> 00:33:54,460 you can then say, go ahead and return this value. 711 00:33:54,460 --> 00:33:57,030 Now let's notice the inconsistencies here. 712 00:33:57,030 --> 00:33:59,850 When you use this special lambda keyword that says, hey Python, 713 00:33:59,850 --> 00:34:02,850 give me an anonymous function, a function with no name, 714 00:34:02,850 --> 00:34:06,610 it then says, Python, this anonymous function will take one parameter. 715 00:34:06,610 --> 00:34:08,699 Notice there's no parentheses. 716 00:34:08,699 --> 00:34:10,500 And that's deliberate, if confusing. 717 00:34:10,500 --> 00:34:12,909 It just tightens things up a little bit. 718 00:34:12,909 --> 00:34:16,770 Notice that there's no return keyword, which similarly tightens things 719 00:34:16,770 --> 00:34:18,780 up a bit, albeit inconsistently. 720 00:34:18,780 --> 00:34:22,650 But this line of code I've just highlighted 721 00:34:22,650 --> 00:34:26,429 is actually identical in functionality to this. 722 00:34:26,429 --> 00:34:28,320 But it throws away the word [INAUDIBLE]. 723 00:34:28,320 --> 00:34:29,850 It throws away the word get value. 724 00:34:29,850 --> 00:34:33,510 It throws away the parentheses, and it throws away the return keyword just 725 00:34:33,510 --> 00:34:34,679 to tighten things up. 726 00:34:34,679 --> 00:34:37,290 And it's well suited for a problem like this 727 00:34:37,290 --> 00:34:39,960 where I just want to pass in a tiny little function that 728 00:34:39,960 --> 00:34:41,090 does something useful. 729 00:34:41,090 --> 00:34:42,840 But it's not something I'm going to reuse. 730 00:34:42,840 --> 00:34:45,100 It doesn't need multiple lines to take up space. 731 00:34:45,100 --> 00:34:47,159 It's just a nice, elegant one liner. 732 00:34:47,159 --> 00:34:48,840 That's all a lambda function does. 733 00:34:48,840 --> 00:34:51,909 It allows you to create an anonymous function right then and there. 734 00:34:51,909 --> 00:34:57,210 And then the function you're passing it to, like sorted, will use it as before. 735 00:34:57,210 --> 00:35:00,810 Indeed, if I run Python of favorites.py after growing my Terminal window, 736 00:35:00,810 --> 00:35:02,800 the result is exactly the same. 737 00:35:02,800 --> 00:35:06,360 And we see at the bottom here all of those small results. 738 00:35:06,360 --> 00:35:10,810 Are any questions, then, on this syntax, on these ideas? 739 00:35:10,810 --> 00:35:13,770 The goal here has been to write a Python program that just starts 740 00:35:13,770 --> 00:35:18,860 to analyze or clean up data like this. 741 00:35:18,860 --> 00:35:19,765 Yeah? 742 00:35:19,765 --> 00:35:23,273 AUDIENCE: [INAUDIBLE] 743 00:35:23,273 --> 00:35:26,440 DAVID J. MALAN: Could you use the lambda if it's just returning immediately? 744 00:35:26,440 --> 00:35:29,260 It's really meant for one line of code, generally. 745 00:35:29,260 --> 00:35:31,420 So you don't use the return keyword. 746 00:35:31,420 --> 00:35:33,880 You just say what it is you want to return. 747 00:35:33,880 --> 00:35:37,680 AUDIENCE: [INAUDIBLE] 748 00:35:37,680 --> 00:35:38,930 DAVID J. MALAN: Good question. 749 00:35:38,930 --> 00:35:40,960 Could you do more in that one line if it's 750 00:35:40,960 --> 00:35:42,670 got to be a more involved algorithm? 751 00:35:42,670 --> 00:35:45,820 Yes, but you would just ultimately return the value in question. 752 00:35:45,820 --> 00:35:47,947 In short, if it's getting at all sophisticated 753 00:35:47,947 --> 00:35:49,780 you don't use the lambda function in Python. 754 00:35:49,780 --> 00:35:52,510 You go ahead and actually just define a name for it, 755 00:35:52,510 --> 00:35:54,233 even if it's a one-off name. 756 00:35:54,233 --> 00:35:56,650 JavaScript, another language we'll look at in a few weeks, 757 00:35:56,650 --> 00:35:59,590 makes heavier use, I dare say, of lambda functions. 758 00:35:59,590 --> 00:36:02,030 And those can actually be multiple, multiple lines, 759 00:36:02,030 --> 00:36:05,520 but Python does not support that instinct. 760 00:36:05,520 --> 00:36:06,020 All right. 761 00:36:06,020 --> 00:36:07,728 So let's go ahead and do one other thing. 762 00:36:07,728 --> 00:36:10,340 Office was clearly popping out of the code here quite a bit. 763 00:36:10,340 --> 00:36:12,760 Let's go ahead and write a slightly different program 764 00:36:12,760 --> 00:36:15,400 that maybe just focuses on The Office for the moment, 765 00:36:15,400 --> 00:36:16,900 just focuses on The Office. 766 00:36:16,900 --> 00:36:21,250 So let me go ahead and throw most of this code away, up until this point 767 00:36:21,250 --> 00:36:23,080 when I'm inside of my inner loop. 768 00:36:23,080 --> 00:36:26,050 And let me go ahead, and I don't even want the global variable here. 769 00:36:26,050 --> 00:36:28,240 All I want to do is focus on the current title. 770 00:36:28,240 --> 00:36:30,730 How could I detect if someone likes The Office? 771 00:36:30,730 --> 00:36:33,790 Well, I could say something like-- 772 00:36:33,790 --> 00:36:34,670 how about this? 773 00:36:34,670 --> 00:36:36,310 So counter equals 0. 774 00:36:36,310 --> 00:36:38,350 We'll just focus on The Office. 775 00:36:38,350 --> 00:36:43,930 If title equals, equals The Office, I could then go ahead and say, 776 00:36:43,930 --> 00:36:47,410 counter plus equals 1. 777 00:36:47,410 --> 00:36:48,400 I don't need a key. 778 00:36:48,400 --> 00:36:49,858 There's no dictionary involved now. 779 00:36:49,858 --> 00:36:51,880 It's just a simple integer variable. 780 00:36:51,880 --> 00:36:55,750 And then, down here I'll say something like, 781 00:36:55,750 --> 00:37:00,970 number of people who like The Office is, whatever this value is. 782 00:37:00,970 --> 00:37:03,850 And I'll put in counter in curly braces, and then I'll 783 00:37:03,850 --> 00:37:05,783 turn this whole thing into an F string. 784 00:37:05,783 --> 00:37:07,450 All right, let me go ahead and run this. 785 00:37:07,450 --> 00:37:10,100 Python of favorites.py, Enter. 786 00:37:10,100 --> 00:37:12,610 Number of people who like The Office is 15. 787 00:37:12,610 --> 00:37:13,990 All right, so that's great. 788 00:37:13,990 --> 00:37:17,530 But let's go ahead now and deliberately muddy the data a bit. 789 00:37:17,530 --> 00:37:21,160 All of you were very nice in that you typed in The Office. 790 00:37:21,160 --> 00:37:23,230 But you can imagine someone just typing Office, 791 00:37:23,230 --> 00:37:25,692 for instance, maybe there, maybe there. 792 00:37:25,692 --> 00:37:28,150 And many people might just write Office, you could imagine. 793 00:37:28,150 --> 00:37:30,400 Didn't happen here, but suppose it did, and probably 794 00:37:30,400 --> 00:37:33,290 would have if we had even more and more submissions over time. 795 00:37:33,290 --> 00:37:37,000 Now let's go ahead and rerun this program, no changes to the code. 796 00:37:37,000 --> 00:37:39,130 Now only 13 people like The Office. 797 00:37:39,130 --> 00:37:40,150 So let's fix this. 798 00:37:40,150 --> 00:37:45,790 The data is now as I mutated it to have a couple Offices, and many The Offices. 799 00:37:45,790 --> 00:37:51,050 How could I change my Python code to now count both of those situations? 800 00:37:51,050 --> 00:37:55,390 What could I change up here in order to improve this situation? 801 00:37:55,390 --> 00:37:57,910 Any thoughts? 802 00:37:57,910 --> 00:37:58,750 Yeah? 803 00:37:58,750 --> 00:38:02,300 AUDIENCE: You write the title [INAUDIBLE].. 804 00:38:02,300 --> 00:38:05,050 DAVID J. MALAN: Yeah, so I could just ask two questions like that. 805 00:38:05,050 --> 00:38:09,370 If title equals The Office, or title equals, equals just 806 00:38:09,370 --> 00:38:10,438 Office, for instance. 807 00:38:10,438 --> 00:38:12,730 And I'm still don't have to worry about capitalization. 808 00:38:12,730 --> 00:38:15,813 I don't have to worry about spaces because I at least threw that all away. 809 00:38:15,813 --> 00:38:18,200 Now I can go ahead and rerun this code. 810 00:38:18,200 --> 00:38:19,990 Let me go run it a third time. 811 00:38:19,990 --> 00:38:22,460 OK, so we're back up to 15. 812 00:38:22,460 --> 00:38:25,210 So I like that. 813 00:38:25,210 --> 00:38:28,660 You could imagine this not scaling very well. 814 00:38:28,660 --> 00:38:31,763 Avatar had three different permutations, and there were some others 815 00:38:31,763 --> 00:38:34,180 if we dug deeper that there might have been more variants. 816 00:38:34,180 --> 00:38:36,430 Could we do something a little more general purpose? 817 00:38:36,430 --> 00:38:38,230 Well, we could do something like this. 818 00:38:38,230 --> 00:38:41,883 If Office in the title-- 819 00:38:41,883 --> 00:38:44,050 this is kind of a cool thing you can do with Python. 820 00:38:44,050 --> 00:38:46,930 It's very English-like, just ask the question, albeit tersely. 821 00:38:46,930 --> 00:38:50,660 This, interesting, just got me into trouble. 822 00:38:50,660 --> 00:38:53,140 Now, all of a sudden, we're up to 16. 823 00:38:53,140 --> 00:38:56,252 Does anyone know what the other one is? 824 00:38:56,252 --> 00:38:58,098 AUDIENCE: Someone put V Office. 825 00:38:58,098 --> 00:38:59,265 DAVID J. MALAN: What Office? 826 00:38:59,265 --> 00:39:02,120 AUDIENCE: Someone entered a V Office, [INAUDIBLE].. 827 00:39:02,120 --> 00:39:04,517 828 00:39:04,517 --> 00:39:05,850 DAVID J. MALAN: Oh, interesting. 829 00:39:05,850 --> 00:39:07,930 Yes, so they hit The. 830 00:39:07,930 --> 00:39:08,430 OK. 831 00:39:08,430 --> 00:39:11,568 [APPLAUSE] 832 00:39:11,568 --> 00:39:12,360 DAVID J. MALAN: OK. 833 00:39:12,360 --> 00:39:13,800 Someone did that, sure. 834 00:39:13,800 --> 00:39:17,130 So The V Office. 835 00:39:17,130 --> 00:39:19,470 OK, this one's actually going to be hard to correct for. 836 00:39:19,470 --> 00:39:21,300 I can't really think of a general-- 837 00:39:21,300 --> 00:39:25,860 well, this is actually a good example of data gets messy fast. 838 00:39:25,860 --> 00:39:28,080 And you could imagine doing something where, OK, we 839 00:39:28,080 --> 00:39:32,920 could have like 26 conditions if someone said The A Office, or The B Office, 840 00:39:32,920 --> 00:39:33,420 right? 841 00:39:33,420 --> 00:39:34,650 You could imagine doing that. 842 00:39:34,650 --> 00:39:37,400 But then there's surely going to be other typos that are possible. 843 00:39:37,400 --> 00:39:39,610 So that's actually a hard one to fix. 844 00:39:39,610 --> 00:39:44,730 But it turns out we got lucky and now this is actually the accurate count. 845 00:39:44,730 --> 00:39:46,660 But the data is itself messy. 846 00:39:46,660 --> 00:39:49,950 Let me show another way that just adds another tool to our toolkit. 847 00:39:49,950 --> 00:39:54,750 It turns out that there's this feature in many programming languages, Python 848 00:39:54,750 --> 00:39:57,040 among them, called regular expressions. 849 00:39:57,040 --> 00:39:59,040 And this is actually a really powerful technique 850 00:39:59,040 --> 00:40:00,873 that we'll just scratch the surface of here. 851 00:40:00,873 --> 00:40:04,290 But it's going to be really useful, actually, maybe toward final projects, 852 00:40:04,290 --> 00:40:08,880 in web programming, any time you want to clean up data or validate data. 853 00:40:08,880 --> 00:40:11,910 And actually, just to make this clear, give me a moment 854 00:40:11,910 --> 00:40:14,160 before I switch screens here. 855 00:40:14,160 --> 00:40:18,450 And let me open up a Google Form from scratch. 856 00:40:18,450 --> 00:40:22,230 Give me just a moment to create something real quick. 857 00:40:22,230 --> 00:40:25,260 If you've never noticed this before when creating a Google Form, 858 00:40:25,260 --> 00:40:28,410 you can do a question. 859 00:40:28,410 --> 00:40:30,360 And if you want the user to type in something 860 00:40:30,360 --> 00:40:33,033 very specific as a short text answer like this, 861 00:40:33,033 --> 00:40:35,700 you might know that there's toggles like this in Google's world, 862 00:40:35,700 --> 00:40:36,930 like you can require it. 863 00:40:36,930 --> 00:40:39,270 Or you can do response validation. 864 00:40:39,270 --> 00:40:41,670 You could say, what's your email? 865 00:40:41,670 --> 00:40:47,250 And then you could say something like, text is an email. 866 00:40:47,250 --> 00:40:52,530 So here's an example in Google Forms how you can validate users' input. 867 00:40:52,530 --> 00:40:57,150 But a feature most of you have probably never noticed, or cared about, or used, 868 00:40:57,150 --> 00:40:59,490 is this thing called a regular expression, where 869 00:40:59,490 --> 00:41:01,440 you can actually define a pattern. 870 00:41:01,440 --> 00:41:04,830 And I could actually reimplement that same idea by doing something like this. 871 00:41:04,830 --> 00:41:11,070 I can say, let the user type in anything represented by .star, then an at sign, 872 00:41:11,070 --> 00:41:16,600 then something else, then a literal period, then, for instance, 873 00:41:16,600 --> 00:41:17,680 something else. 874 00:41:17,680 --> 00:41:19,950 So it's very cryptic, admittedly, at first glance. 875 00:41:19,950 --> 00:41:23,430 But this means any character 0 more times. 876 00:41:23,430 --> 00:41:26,160 This means any character 0 more times. 877 00:41:26,160 --> 00:41:28,725 This means a literal period, because apparently 878 00:41:28,725 --> 00:41:32,160 dot means any character in the context of these patterns. 879 00:41:32,160 --> 00:41:36,070 Then this thing means any character 0 more times. 880 00:41:36,070 --> 00:41:38,670 So I should actually be a little more nitpicky. 881 00:41:38,670 --> 00:41:41,530 You don't want 0 or more times, you want 1 or more times. 882 00:41:41,530 --> 00:41:45,340 So this with the plus means any character 1 or more time. 883 00:41:45,340 --> 00:41:47,020 So there has to be something there. 884 00:41:47,020 --> 00:41:51,630 And I think I want the same thing here 1 or more times, 1 or more times. 885 00:41:51,630 --> 00:41:56,160 Or heck, if I want to restrict this form in some sense to edu addresses, 886 00:41:56,160 --> 00:41:59,520 I could change that last thing to literally .edu. 887 00:41:59,520 --> 00:42:01,570 And so long story short, even though this looks, 888 00:42:01,570 --> 00:42:06,420 I'm sure, pretty cryptic, there's this mini language built into Python, 889 00:42:06,420 --> 00:42:09,900 and JavaScript, and Java, and other languages that allows you to express 890 00:42:09,900 --> 00:42:12,430 patterns in a standardized way. 891 00:42:12,430 --> 00:42:15,930 And this pattern is actually something we can implement in code, too. 892 00:42:15,930 --> 00:42:18,150 And let me switch back to Python for a second just 893 00:42:18,150 --> 00:42:19,920 to do the same kind of idea. 894 00:42:19,920 --> 00:42:22,950 Let me toggle back to my code here. 895 00:42:22,950 --> 00:42:26,730 Let me put up, for instance, a summary of what it is you can do. 896 00:42:26,730 --> 00:42:33,030 And here's just a quick summary of some of the available symbols. 897 00:42:33,030 --> 00:42:39,330 A period may represent any character. .star or .asterisks means 0 or more 898 00:42:39,330 --> 00:42:39,970 characters. 899 00:42:39,970 --> 00:42:43,020 So the dot means anything, so it can be A or nothing. 900 00:42:43,020 --> 00:42:44,160 It can be B or nothing. 901 00:42:44,160 --> 00:42:49,530 It can be A, B, A, B, C. It can be any combination of 0 or more characters. 902 00:42:49,530 --> 00:42:52,860 Change that to a plus and you now express one or more characters. 903 00:42:52,860 --> 00:42:55,860 Question mark means something is optional. 904 00:42:55,860 --> 00:42:59,370 Caret symbol means start matching at the beginning of the user's input. 905 00:42:59,370 --> 00:43:05,100 Dollar sign means stop matching at the end of the user's input. 906 00:43:05,100 --> 00:43:07,210 So we won't play with all of these just now. 907 00:43:07,210 --> 00:43:11,470 But let me go over here and actually tackle this Office problem. 908 00:43:11,470 --> 00:43:15,450 Let me go ahead and import a new library called the regular expression library, 909 00:43:15,450 --> 00:43:17,280 import re. 910 00:43:17,280 --> 00:43:20,560 And then, down here, let me say this. 911 00:43:20,560 --> 00:43:25,080 If re.search, this pattern. 912 00:43:25,080 --> 00:43:30,210 Let's just search for Office, quote, unquote, in the current title. 913 00:43:30,210 --> 00:43:32,730 Then we're going to go ahead and increase the counter. 914 00:43:32,730 --> 00:43:35,040 So it turns out that the regular expression library 915 00:43:35,040 --> 00:43:39,090 has a function called search that takes as its first argument a pattern, 916 00:43:39,090 --> 00:43:41,970 and then, as its second argument the string you 917 00:43:41,970 --> 00:43:44,160 want to analyze for that pattern. 918 00:43:44,160 --> 00:43:47,880 So it's sort of looking for a needle in this haystack, from left to right. 919 00:43:47,880 --> 00:43:52,080 Let me go ahead now and run this version of the program, Enter. 920 00:43:52,080 --> 00:43:56,250 And now I screwed up because I forgot my colon, but that's old stuff. 921 00:43:56,250 --> 00:43:57,322 Enter. 922 00:43:57,322 --> 00:43:59,150 Huh. 923 00:43:59,150 --> 00:44:01,800 Number of people who like The Office is now 0. 924 00:44:01,800 --> 00:44:02,970 So this seems like a big-- 925 00:44:02,970 --> 00:44:05,640 thank you-- big step backwards. 926 00:44:05,640 --> 00:44:08,010 What did I do wrong? 927 00:44:08,010 --> 00:44:08,730 Yeah? 928 00:44:08,730 --> 00:44:10,410 AUDIENCE: [INAUDIBLE] 929 00:44:10,410 --> 00:44:11,610 DAVID J. MALAN: Yeah. 930 00:44:11,610 --> 00:44:14,595 I forced all my input to uppercase, so I probably need to do this. 931 00:44:14,595 --> 00:44:16,470 So we'll come back to other approaches there. 932 00:44:16,470 --> 00:44:17,430 Let me rerun it now. 933 00:44:17,430 --> 00:44:19,800 OK, now we're back up to 16. 934 00:44:19,800 --> 00:44:22,200 But I could even, let's say-- 935 00:44:22,200 --> 00:44:25,110 I could tolerate just The Office. 936 00:44:25,110 --> 00:44:30,120 How about this, or how about something like, or The Office? 937 00:44:30,120 --> 00:44:32,373 Let me do this instead. 938 00:44:32,373 --> 00:44:34,290 And let me use these other special characters. 939 00:44:34,290 --> 00:44:37,380 This caret sign means the beginning of the string. 940 00:44:37,380 --> 00:44:40,740 This dollar sign weirdly represents the end of the string. 941 00:44:40,740 --> 00:44:44,490 I'm adding in some parentheses just like in math, just to add another symbol 942 00:44:44,490 --> 00:44:46,470 here, the or symbol here. 943 00:44:46,470 --> 00:44:50,580 And this is saying start matching at the beginning of the user string. 944 00:44:50,580 --> 00:44:54,660 Check if the beginning of the string is Office, or the beginning of the string 945 00:44:54,660 --> 00:44:55,920 is The Office. 946 00:44:55,920 --> 00:44:58,540 And then, you better be at the end of the string. 947 00:44:58,540 --> 00:45:01,650 So they can't keep typing words before or after that input. 948 00:45:01,650 --> 00:45:03,690 Let me go ahead and rerun the program. 949 00:45:03,690 --> 00:45:07,500 And now we're down to 15, which used to be our correct answer, 950 00:45:07,500 --> 00:45:10,770 but then we noticed The V Office. 951 00:45:10,770 --> 00:45:12,680 How can we deal with that? 952 00:45:12,680 --> 00:45:16,200 It's going to be messier to deal with that. 953 00:45:16,200 --> 00:45:21,555 How about if I tolerate any character represented by dot 954 00:45:21,555 --> 00:45:23,640 in between The and Office? 955 00:45:23,640 --> 00:45:28,110 Now if I rerun it, now I really have this expressive capability. 956 00:45:28,110 --> 00:45:32,340 So this is only to say, there are so many ways in languages, in general, 957 00:45:32,340 --> 00:45:33,390 to solve problems. 958 00:45:33,390 --> 00:45:35,950 And some of these tools are more sophisticated than others. 959 00:45:35,950 --> 00:45:38,957 This is one that you've actually probably glanced at but never used 960 00:45:38,957 --> 00:45:41,040 in the context of Google Forms for years if you're 961 00:45:41,040 --> 00:45:43,770 in the habit of creating these for student groups or other activities. 962 00:45:43,770 --> 00:45:45,840 But it's now something you can start to leverage. 963 00:45:45,840 --> 00:45:49,440 And we're just scratching the surface of what's actually possible with this. 964 00:45:49,440 --> 00:45:53,640 But let's now do one final example just using some Python code here. 965 00:45:53,640 --> 00:45:55,530 And let's actually write a program that's 966 00:45:55,530 --> 00:46:00,040 a little more general purpose that allows me to search for any given title 967 00:46:00,040 --> 00:46:01,770 and figure out its popularity. 968 00:46:01,770 --> 00:46:04,410 So let me go ahead and simplify this. 969 00:46:04,410 --> 00:46:06,630 Let's get rid of our regular expressions. 970 00:46:06,630 --> 00:46:09,940 Let's go ahead and continue capitalizing the title. 971 00:46:09,940 --> 00:46:11,580 And let's go ahead to-- 972 00:46:11,580 --> 00:46:16,020 at the beginning of this program, and first ask the user for the title 973 00:46:16,020 --> 00:46:17,380 they want to search for. 974 00:46:17,380 --> 00:46:20,320 So title equals, let's ask the user for input, 975 00:46:20,320 --> 00:46:23,490 which is essentially the same thing as our CS50 get_string function. 976 00:46:23,490 --> 00:46:24,960 Ask them for the title. 977 00:46:24,960 --> 00:46:28,170 And then whatever they type in, let's go ahead and strip whitespace 978 00:46:28,170 --> 00:46:30,900 and uppercase the thing again. 979 00:46:30,900 --> 00:46:35,820 And now, inside of my loop, I could say something like this. 980 00:46:35,820 --> 00:46:42,660 If the current row's title after stripping whitespace and forcing 981 00:46:42,660 --> 00:46:46,920 it to uppercase, too, equals the user's title, then, go ahead 982 00:46:46,920 --> 00:46:49,440 and maybe increment a counter. 983 00:46:49,440 --> 00:46:51,160 So I still need that counter back. 984 00:46:51,160 --> 00:46:56,610 So let me go ahead and define this maybe in here, counter equals 0. 985 00:46:56,610 --> 00:46:58,720 And then, at the very end of this program, 986 00:46:58,720 --> 00:47:01,050 let me go ahead and print out just the popularity 987 00:47:01,050 --> 00:47:03,040 of whatever the human typed in. 988 00:47:03,040 --> 00:47:06,030 So again, the only difference is I'm asking the human for some input 989 00:47:06,030 --> 00:47:06,720 this time. 990 00:47:06,720 --> 00:47:09,150 I'm initializing my counter to 0, then I'm 991 00:47:09,150 --> 00:47:12,660 searching for their title in the CSV file 992 00:47:12,660 --> 00:47:15,810 by doing the same massaging of the data by forcing it to uppercase 993 00:47:15,810 --> 00:47:18,570 and getting rid of the whitespace. 994 00:47:18,570 --> 00:47:21,780 So now, when I run Python of favorites.py, Enter, 995 00:47:21,780 --> 00:47:30,030 I could type in the office all lowercase even, and now we're down to 13. 996 00:47:30,030 --> 00:47:34,940 13, why? 997 00:47:34,940 --> 00:47:36,700 Oh, that's correct. 998 00:47:36,700 --> 00:47:40,640 Because I'm the one that went in and removed those The keywords a bit ago. 999 00:47:40,640 --> 00:47:43,030 If we fixed those, we would be back up to 15. 1000 00:47:43,030 --> 00:47:47,650 If we added support for The V Office, we would be up to 16 as well. 1001 00:47:47,650 --> 00:47:50,350 All right, any questions then on these various manipulations? 1002 00:47:50,350 --> 00:47:52,183 And if you're feeling like, oh, my god, this 1003 00:47:52,183 --> 00:47:55,100 is so much Python code just to do simple things, that's the point. 1004 00:47:55,100 --> 00:47:57,160 And indeed, even though it's a powerful language 1005 00:47:57,160 --> 00:48:00,670 and can solve these kinds of problems, we had to write almost 20 lines of code 1006 00:48:00,670 --> 00:48:03,470 just to ask a single question like this. 1007 00:48:03,470 --> 00:48:07,120 But any questions on how we did this, or on any of these building 1008 00:48:07,120 --> 00:48:10,750 blocks along the way? 1009 00:48:10,750 --> 00:48:11,890 Anything here? 1010 00:48:11,890 --> 00:48:12,390 No? 1011 00:48:12,390 --> 00:48:12,600 All right. 1012 00:48:12,600 --> 00:48:13,300 That was a lot. 1013 00:48:13,300 --> 00:48:14,800 Let's take a five-minute break here. 1014 00:48:14,800 --> 00:48:17,230 When we come back, we'll do it better. 1015 00:48:17,230 --> 00:48:18,480 So we are back. 1016 00:48:18,480 --> 00:48:20,610 And the rest of today is ultimately about, how 1017 00:48:20,610 --> 00:48:24,840 can we store, and manipulate, and change, and retrieve data 1018 00:48:24,840 --> 00:48:28,090 more efficiently than we might by just writing raw code? 1019 00:48:28,090 --> 00:48:31,440 This isn't to say that you shouldn't use Python to do the kinds of things 1020 00:48:31,440 --> 00:48:32,280 that we just did. 1021 00:48:32,280 --> 00:48:37,020 And in fact, it might be super common if you're getting a lot of messy input 1022 00:48:37,020 --> 00:48:39,073 from users that you might want to clean it up. 1023 00:48:39,073 --> 00:48:42,240 And maybe the best way to do that is to write a program so that step-by-step 1024 00:48:42,240 --> 00:48:44,370 you can make all of the requisite changes and fixes 1025 00:48:44,370 --> 00:48:47,523 like we did with The Office, for instance, again and again, 1026 00:48:47,523 --> 00:48:50,190 and reuse that code, especially if more and more submissions are 1027 00:48:50,190 --> 00:48:51,150 coming through. 1028 00:48:51,150 --> 00:48:53,550 But another theme of today, ultimately, is 1029 00:48:53,550 --> 00:48:57,638 that sometimes there are different, if not better tools for the same job. 1030 00:48:57,638 --> 00:48:59,430 And in fact, now at this point in the term, 1031 00:48:59,430 --> 00:49:02,310 as we begin to introduce not just Python, but in a moment 1032 00:49:02,310 --> 00:49:06,120 a language called SQL, and next week, a language called JavaScript, 1033 00:49:06,120 --> 00:49:09,150 and the week after that, synthesizing a whole lot of these languages 1034 00:49:09,150 --> 00:49:12,420 together is to just kind of paint a picture of how 1035 00:49:12,420 --> 00:49:15,900 you might decide what the trade-offs are between using this tool, or this tool, 1036 00:49:15,900 --> 00:49:16,830 or this other tool. 1037 00:49:16,830 --> 00:49:19,770 Because undoubtedly you can solve problems moving forward 1038 00:49:19,770 --> 00:49:22,660 in many different ways with many different tools. 1039 00:49:22,660 --> 00:49:25,020 So let's give you another tool, one with which 1040 00:49:25,020 --> 00:49:28,170 you can implement a proper relational database. 1041 00:49:28,170 --> 00:49:31,050 What we just saw in the form of CSV files 1042 00:49:31,050 --> 00:49:33,810 are what we might call flat-file databases. 1043 00:49:33,810 --> 00:49:37,500 Again, just a very simple file, flat in that there's no hierarchy to it. 1044 00:49:37,500 --> 00:49:39,270 It's just rows and columns. 1045 00:49:39,270 --> 00:49:44,280 And that is all ultimately storing ASCII or Unicode text. 1046 00:49:44,280 --> 00:49:47,400 A relational database, though, is something that's actually 1047 00:49:47,400 --> 00:49:50,850 closer to a proper spreadsheet program. 1048 00:49:50,850 --> 00:49:53,440 A CSV is an individual sheet, if you will, 1049 00:49:53,440 --> 00:49:55,260 from a spreadsheet when you export it. 1050 00:49:55,260 --> 00:49:57,460 If you had multiple sheets in a spreadsheet, 1051 00:49:57,460 --> 00:49:59,595 you would have to export multiple CSVs. 1052 00:49:59,595 --> 00:50:01,470 And that gets annoying quickly in code if you 1053 00:50:01,470 --> 00:50:03,990 have to open up this CSV, this CSV, all of which 1054 00:50:03,990 --> 00:50:07,080 represent different sheets or tabs in a proper spreadsheet. 1055 00:50:07,080 --> 00:50:11,520 A relational database is more like a spreadsheet program 1056 00:50:11,520 --> 00:50:14,640 that you, a programmer, now can interact with. 1057 00:50:14,640 --> 00:50:16,140 You can write data to it. 1058 00:50:16,140 --> 00:50:19,680 You can read data from it, and you can have multiple sheets, a.k.a., 1059 00:50:19,680 --> 00:50:22,150 tables storing all of your data. 1060 00:50:22,150 --> 00:50:24,240 So whereas Excel and numbers in Google spreadsheet 1061 00:50:24,240 --> 00:50:27,090 are meant to be reused really by humans with their mouse and their keyboard, 1062 00:50:27,090 --> 00:50:29,850 clicking, and pointing, and manipulating things graphically, 1063 00:50:29,850 --> 00:50:32,160 a relational database using a language called 1064 00:50:32,160 --> 00:50:37,320 SQL is one in which the programmer has similar capabilities, 1065 00:50:37,320 --> 00:50:39,000 but doing so in code. 1066 00:50:39,000 --> 00:50:42,720 Specifically, using a language called SQL, and at a scale 1067 00:50:42,720 --> 00:50:45,670 that's much grander than spreadsheets alone. 1068 00:50:45,670 --> 00:50:48,420 In fact, if you try on your Mac or PC to open a spreadsheet that's 1069 00:50:48,420 --> 00:50:51,090 got tens of thousands of rows, it'll probably 1070 00:50:51,090 --> 00:50:54,780 work fine, hundreds of thousands of rows, millions of rows, no way. 1071 00:50:54,780 --> 00:50:57,000 At some point your Mac or PC is going to struggle 1072 00:50:57,000 --> 00:50:59,130 to open particularly large data sets. 1073 00:50:59,130 --> 00:51:01,620 And that, too, is where proper databases come 1074 00:51:01,620 --> 00:51:04,140 into play and proper languages for databases come 1075 00:51:04,140 --> 00:51:06,120 into play, when it's all about scale. 1076 00:51:06,120 --> 00:51:09,390 And indeed, most any mobile app or web app today that you or someone else 1077 00:51:09,390 --> 00:51:13,420 might write should probably plan on lots of data if it's successful. 1078 00:51:13,420 --> 00:51:15,730 So we need the right tools for that problem. 1079 00:51:15,730 --> 00:51:19,110 So fortunately, even though we're about to learn yet another language, 1080 00:51:19,110 --> 00:51:24,360 it only does four things fundamentally, known by this silly acronym, CRUD. 1081 00:51:24,360 --> 00:51:27,870 SQL, this language for databases, supports the ability 1082 00:51:27,870 --> 00:51:32,400 to create data, read data, update data, and delete data. 1083 00:51:32,400 --> 00:51:33,420 That's it. 1084 00:51:33,420 --> 00:51:36,690 There's a few more keywords that exist in this language called SQL 1085 00:51:36,690 --> 00:51:37,750 that we'll soon see. 1086 00:51:37,750 --> 00:51:39,458 But at the end of the day, even if you're 1087 00:51:39,458 --> 00:51:42,180 starting to feel like this is a lot very quickly, 1088 00:51:42,180 --> 00:51:44,940 it all boils down to these four basic operations. 1089 00:51:44,940 --> 00:51:47,640 And the four commands in SQL, if you will, 1090 00:51:47,640 --> 00:51:51,780 functions in a sense that implement those four ideas happen to be these. 1091 00:51:51,780 --> 00:51:54,270 They're almost the same but with some slight variance. 1092 00:51:54,270 --> 00:51:59,280 The ability to create or insert data is the C. The ability to select data 1093 00:51:59,280 --> 00:52:01,080 is the R, or read. 1094 00:52:01,080 --> 00:52:02,130 Update is the same. 1095 00:52:02,130 --> 00:52:05,048 Delete is the same, but drop is also a keyword as well. 1096 00:52:05,048 --> 00:52:06,840 So we'll see these and a few other keywords 1097 00:52:06,840 --> 00:52:10,410 in SQL that, at the end of the day, just allow you to create, read, and update 1098 00:52:10,410 --> 00:52:14,310 data using verbs, if you will, like these. 1099 00:52:14,310 --> 00:52:18,030 So to do that, what's the syntax going to be? 1100 00:52:18,030 --> 00:52:20,290 Well, we won't get into the weeds too quickly on this. 1101 00:52:20,290 --> 00:52:22,650 But here's a representative syntax of how 1102 00:52:22,650 --> 00:52:25,710 you can create using this language called SQL, in your very 1103 00:52:25,710 --> 00:52:28,020 own database, a brand new table. 1104 00:52:28,020 --> 00:52:30,910 This is so easy in Excel, and Google Spreadsheets, and Apple Numbers. 1105 00:52:30,910 --> 00:52:32,910 You want a new sheet, you click the plus button. 1106 00:52:32,910 --> 00:52:33,690 You get a new tab. 1107 00:52:33,690 --> 00:52:35,490 You give it a name, and boom, you're done. 1108 00:52:35,490 --> 00:52:40,050 In the world of programming, though, if you want to create the analogue of that 1109 00:52:40,050 --> 00:52:43,440 spreadsheet in the computer's memory, you create something called a table, 1110 00:52:43,440 --> 00:52:48,420 like a sheet, that has a name, and then in parentheses has one or more columns. 1111 00:52:48,420 --> 00:52:51,990 But unlike Google Spreadsheets, and Apple Numbers, and Excel, 1112 00:52:51,990 --> 00:52:55,073 you have to decide as the programmer what types of data 1113 00:52:55,073 --> 00:52:57,240 you're going to be storing in each of these columns. 1114 00:52:57,240 --> 00:52:59,430 Now even though Excel, and Google Spreadsheets, 1115 00:52:59,430 --> 00:53:03,310 and Numbers does allow you to format or present data in different ways, 1116 00:53:03,310 --> 00:53:07,680 it's not strongly typed data like it is, for instance, when we were using C. 1117 00:53:07,680 --> 00:53:10,200 And heck, even in Python there's underlying data types. 1118 00:53:10,200 --> 00:53:12,158 Even if you don't have to type them explicitly, 1119 00:53:12,158 --> 00:53:14,900 databases are going to want to know, are you storing integers? 1120 00:53:14,900 --> 00:53:16,640 Are you storing real numbers or floats? 1121 00:53:16,640 --> 00:53:17,630 Are you storing text? 1122 00:53:17,630 --> 00:53:18,140 Why? 1123 00:53:18,140 --> 00:53:20,960 Because especially as your data scales, the more hints 1124 00:53:20,960 --> 00:53:24,410 you give the database about your data, the more performance it can be, 1125 00:53:24,410 --> 00:53:27,500 the faster it can help you get at and store that data. 1126 00:53:27,500 --> 00:53:29,302 So types are about to be important again, 1127 00:53:29,302 --> 00:53:31,760 but there's not going to be that many of them, fortunately. 1128 00:53:31,760 --> 00:53:34,640 Now how can I go about converting, for instance, some real data, 1129 00:53:34,640 --> 00:53:37,490 like that from you, my favorites.csv file, 1130 00:53:37,490 --> 00:53:39,440 into a proper relational database? 1131 00:53:39,440 --> 00:53:42,650 Well, it turns out that using SQL I can do this 1132 00:53:42,650 --> 00:53:45,260 in VS Code on my own Mac, or PC, or in the cloud 1133 00:53:45,260 --> 00:53:48,455 here by just importing the CSV into a database. 1134 00:53:48,455 --> 00:53:50,330 We'll see eventually how to do this manually. 1135 00:53:50,330 --> 00:53:52,622 For now, I'm going to use more of an automated process. 1136 00:53:52,622 --> 00:53:54,680 So let me go over to VS Code here. 1137 00:53:54,680 --> 00:53:57,170 Let me type ls to see where we left off before. 1138 00:53:57,170 --> 00:54:01,005 I had two files favorites.csv, which I downloaded from Google Spreadsheets. 1139 00:54:01,005 --> 00:54:02,630 Recall that I made a couple of changes. 1140 00:54:02,630 --> 00:54:06,050 We deleted a couple of Thes from the file for The Office. 1141 00:54:06,050 --> 00:54:08,600 But this is the same file as before, and then we 1142 00:54:08,600 --> 00:54:11,210 have favorites.py, which we'll set aside for now. 1143 00:54:11,210 --> 00:54:14,870 I'm going to go ahead now and run a command SQLite3. 1144 00:54:14,870 --> 00:54:18,020 So in the world of relational databases, there's 1145 00:54:18,020 --> 00:54:23,030 many different products out there, many different software that 1146 00:54:23,030 --> 00:54:25,370 implements the SQL language. 1147 00:54:25,370 --> 00:54:26,600 Microsoft has their own. 1148 00:54:26,600 --> 00:54:30,080 There's something called MySQL that's been very popular for years. 1149 00:54:30,080 --> 00:54:32,120 Facebook, for instance, used it early on. 1150 00:54:32,120 --> 00:54:35,030 PostgreSQL, Microsoft Access Server, Oracle, 1151 00:54:35,030 --> 00:54:36,958 and maybe a whole bunch of other product names 1152 00:54:36,958 --> 00:54:38,750 you might have encountered over time, which 1153 00:54:38,750 --> 00:54:42,980 is to say there's many different types of tools, and servers, 1154 00:54:42,980 --> 00:54:44,990 and software in which you can use SQL. 1155 00:54:44,990 --> 00:54:47,780 We're going to use a very lightweight version of the SQL language 1156 00:54:47,780 --> 00:54:49,370 today called SQLite. 1157 00:54:49,370 --> 00:54:51,680 This is the version of SQL that's generally 1158 00:54:51,680 --> 00:54:54,020 used on iPhones and Android devices these days. 1159 00:54:54,020 --> 00:54:56,930 If you download an app that stores data like your own contacts, 1160 00:54:56,930 --> 00:54:59,000 typically is stored using SQLite. 1161 00:54:59,000 --> 00:55:02,710 Because it's fairly lightweight, but you can still store hundreds, 1162 00:55:02,710 --> 00:55:05,810 thousands, even tens of thousands of pieces of data 1163 00:55:05,810 --> 00:55:07,970 even using this lightweight version thereof. 1164 00:55:07,970 --> 00:55:10,790 SQLite3 is like version 3 of this tool. 1165 00:55:10,790 --> 00:55:16,340 We're going to go ahead and run SQLite3 with a file called favorites.db. 1166 00:55:16,340 --> 00:55:20,120 It's conventional in the world of SQLite to name your file something.db. 1167 00:55:20,120 --> 00:55:22,490 I'm going to create a database called favorites.db. 1168 00:55:22,490 --> 00:55:27,010 Once I'm inside of the program, now I'm going to go ahead and enter CSV Mode. 1169 00:55:27,010 --> 00:55:28,760 Again, not something you have to memorize, 1170 00:55:28,760 --> 00:55:30,468 just something you can look up as needed. 1171 00:55:30,468 --> 00:55:34,100 And then, I'm going to import favorites.csv 1172 00:55:34,100 --> 00:55:40,260 into a table, that is, a sheet, if you will, called favorites as well. 1173 00:55:40,260 --> 00:55:44,030 Now I'm going to hit Enter and I'm going to go ahead and exit the program 1174 00:55:44,030 --> 00:55:45,650 altogether and type ls. 1175 00:55:45,650 --> 00:55:48,170 Now I have three files in my current directory-- 1176 00:55:48,170 --> 00:55:52,130 the CSV file, the Python file from before, and now favorites.db. 1177 00:55:52,130 --> 00:55:56,180 But if I did this right, all of the data you all typed into the CSV file 1178 00:55:56,180 --> 00:55:59,840 has now been loaded into a proper database where I can now use 1179 00:55:59,840 --> 00:56:03,180 this SQL language to access it instead. 1180 00:56:03,180 --> 00:56:07,730 So let's go ahead again and run SQLite3 of favorites.db, which now exists. 1181 00:56:07,730 --> 00:56:10,640 And now, at the SQLite prompt I can start 1182 00:56:10,640 --> 00:56:13,160 to play around and see what this data is. 1183 00:56:13,160 --> 00:56:16,610 For instance, I can look, by typing .schema, 1184 00:56:16,610 --> 00:56:19,362 at what the schema is of my data, what's the design. 1185 00:56:19,362 --> 00:56:22,070 Now no thought was put into the design of this data at the moment 1186 00:56:22,070 --> 00:56:23,900 because I automated the whole process. 1187 00:56:23,900 --> 00:56:26,750 Once we start creating our own databases we'll 1188 00:56:26,750 --> 00:56:29,750 give more thought to the data types and the columns that we have. 1189 00:56:29,750 --> 00:56:34,220 But we can see what SQLite presumed I wanted just 1190 00:56:34,220 --> 00:56:36,530 by importing the data by default. 1191 00:56:36,530 --> 00:56:41,120 What the import command did for me a moment ago is essentially the syntax. 1192 00:56:41,120 --> 00:56:44,510 It automated the process of creating a table, if it doesn't exist, 1193 00:56:44,510 --> 00:56:45,770 called favorites. 1194 00:56:45,770 --> 00:56:48,980 And then notice, in parentheses it gave me three columns-- 1195 00:56:48,980 --> 00:56:53,360 timestamp, title, and genres, which were inferred, obviously, from the CSV. 1196 00:56:53,360 --> 00:56:56,000 All three of which have been decreed to be text. 1197 00:56:56,000 --> 00:56:59,180 Again, once we're more comfortable we'll create our own tables, 1198 00:56:59,180 --> 00:57:01,010 choose our own types and column names. 1199 00:57:01,010 --> 00:57:03,350 But for now, I just automated the whole process just 1200 00:57:03,350 --> 00:57:08,120 to get us started by using this built-in import command as well. 1201 00:57:08,120 --> 00:57:08,810 All right. 1202 00:57:08,810 --> 00:57:11,630 So what now can I begin to do? 1203 00:57:11,630 --> 00:57:16,910 Well, if I wanted to, for instance, start playing around with data therein, 1204 00:57:16,910 --> 00:57:19,595 I might execute a couple of different commands. 1205 00:57:19,595 --> 00:57:23,000 1206 00:57:23,000 --> 00:57:28,421 Let me find the right one here-- one of which would be select. 1207 00:57:28,421 --> 00:57:31,610 Select being one of our most versatile tools 1208 00:57:31,610 --> 00:57:33,180 to select data from this database. 1209 00:57:33,180 --> 00:57:35,720 So if I have these three columns here-- timestamp, 1210 00:57:35,720 --> 00:57:39,020 title, and genres, suppose I want to select all of the titles. 1211 00:57:39,020 --> 00:57:43,790 Doing that earlier in Python required importing the CSV library, 1212 00:57:43,790 --> 00:57:48,740 opening the file, creating a reader or a DictReader, iterating over every row, 1213 00:57:48,740 --> 00:57:51,500 adding every title to a dictionary or just printing it out, 1214 00:57:51,500 --> 00:57:52,310 and dot, dot, dot. 1215 00:57:52,310 --> 00:57:55,170 There was a dozen or so lines of code when we first began. 1216 00:57:55,170 --> 00:57:56,840 Now, how about this? 1217 00:57:56,840 --> 00:58:01,220 Select title from favorites, semicolon, done. 1218 00:58:01,220 --> 00:58:05,570 So now, with this particular language, the output is very textual 1219 00:58:05,570 --> 00:58:08,930 and it's simulating what it looks like if it were more graphical by creating 1220 00:58:08,930 --> 00:58:10,870 this table, so to speak. 1221 00:58:10,870 --> 00:58:14,080 Select title from favorites is a distillation 1222 00:58:14,080 --> 00:58:17,530 in a different language called SQL of all the lines of code 1223 00:58:17,530 --> 00:58:20,740 I wrote early on when we first started playing with favorites.py. 1224 00:58:20,740 --> 00:58:25,540 SQL is therefore optimized for reading, and creating, and updating, 1225 00:58:25,540 --> 00:58:27,500 and ultimately, deleting data. 1226 00:58:27,500 --> 00:58:30,700 So here's perhaps a better tool for the job once you have the data. 1227 00:58:30,700 --> 00:58:34,030 Tossing it into a more powerful, versatile format 1228 00:58:34,030 --> 00:58:37,228 might allow you now to get more work done more quickly 1229 00:58:37,228 --> 00:58:38,770 without having to reinvent the wheel. 1230 00:58:38,770 --> 00:58:41,510 Someone else has figured out how to select data like this. 1231 00:58:41,510 --> 00:58:43,760 What more can I do here? 1232 00:58:43,760 --> 00:58:47,050 Well, let me go ahead and pull up, in a moment, just a little bit 1233 00:58:47,050 --> 00:58:49,390 of a cheat sheet here. 1234 00:58:49,390 --> 00:58:53,840 Give me one second to find this. 1235 00:58:53,840 --> 00:58:58,090 So suppose I want to now select data a little more powerfully. 1236 00:58:58,090 --> 00:59:00,220 So here's what I just did in a canonical way. 1237 00:59:00,220 --> 00:59:01,720 So select typically works like this. 1238 00:59:01,720 --> 00:59:05,860 You select columns from a specific table, semicolon. 1239 00:59:05,860 --> 00:59:08,260 Unfortunately, stupid semicolons are back. 1240 00:59:08,260 --> 00:59:12,710 Select columns from table then, is the generic form of what I just did. 1241 00:59:12,710 --> 00:59:17,122 More specifically, I selected one column called title from favorites. 1242 00:59:17,122 --> 00:59:18,580 Favorites is the name of the table. 1243 00:59:18,580 --> 00:59:19,690 Semicolon ends my thought. 1244 00:59:19,690 --> 00:59:23,440 Suppose I wanted to get two things, like the genres that each of you inputted. 1245 00:59:23,440 --> 00:59:28,420 I could instead do select title, comma, genres from favorites, 1246 00:59:28,420 --> 00:59:30,220 and then, a semicolon, and Enter. 1247 00:59:30,220 --> 00:59:32,110 It's going to look a little ugly on my screen 1248 00:59:32,110 --> 00:59:33,670 because some of these titles and-- 1249 00:59:33,670 --> 00:59:37,300 OK, one of you really went all out with Community. 1250 00:59:37,300 --> 00:59:40,660 You can see that it's just wrapping in an ugly way, 1251 00:59:40,660 --> 00:59:43,300 but it's just now showing me two columns. 1252 00:59:43,300 --> 00:59:46,840 If we scroll up to the very top again, the left most of one, 1253 00:59:46,840 --> 00:59:48,280 Black Mirror went all out, too. 1254 00:59:48,280 --> 00:59:49,150 Thank you. 1255 00:59:49,150 --> 00:59:52,000 And now, OK, we're going to have to clean some of these up. 1256 00:59:52,000 --> 00:59:54,115 Game of Thrones, good comedy, yes. 1257 00:59:54,115 --> 00:59:57,550 1258 00:59:57,550 --> 00:59:59,480 Keep going, keep going, keep going. 1259 00:59:59,480 --> 01:00:02,870 So now we've selected two of the columns that we care about. 1260 01:00:02,870 --> 01:00:03,370 There it is. 1261 01:00:03,370 --> 01:00:06,380 OK, so it's crazy wide because of all of those genres. 1262 01:00:06,380 --> 01:00:09,135 But it allows me to select exactly the data I want. 1263 01:00:09,135 --> 01:00:12,010 Let's go back to the titles, though, and perhaps start playing around 1264 01:00:12,010 --> 01:00:13,750 with some modifiers here. 1265 01:00:13,750 --> 01:00:18,310 For instance, it turns out, using SQL there's a lot of functionality 1266 01:00:18,310 --> 01:00:19,900 built into the language. 1267 01:00:19,900 --> 01:00:22,843 You've got a lot of functions, similar to Excel or Google Spreadsheets 1268 01:00:22,843 --> 01:00:24,010 where you can have formulas. 1269 01:00:24,010 --> 01:00:26,320 SQL provides you with some of the same heuristics that 1270 01:00:26,320 --> 01:00:30,350 allow you to apply operations like these on entire columns. 1271 01:00:30,350 --> 01:00:32,920 For instance, you can take averages, count the total, 1272 01:00:32,920 --> 01:00:36,010 get the distinct values, force things to lowercase, uppercase, min, 1273 01:00:36,010 --> 01:00:37,220 and max, and so forth. 1274 01:00:37,220 --> 01:00:39,610 So let's try distinct, for instance. 1275 01:00:39,610 --> 01:00:43,450 Let me go back to my Terminal, and let's say, select, 1276 01:00:43,450 --> 01:00:48,760 how about the distinct titles from the favorites table? 1277 01:00:48,760 --> 01:00:49,513 Enter. 1278 01:00:49,513 --> 01:00:51,430 I didn't bother selecting the genres because I 1279 01:00:51,430 --> 01:00:52,763 want it to be a little prettier. 1280 01:00:52,763 --> 01:00:58,090 And you can see here that we have just the distinct titles, 1281 01:00:58,090 --> 01:01:00,547 except for issues of formatting. 1282 01:01:00,547 --> 01:01:02,380 So whitespace is going to be an issue again. 1283 01:01:02,380 --> 01:01:04,213 Capitalization is going to be a thing again. 1284 01:01:04,213 --> 01:01:05,230 So there's a trade-off. 1285 01:01:05,230 --> 01:01:09,280 One of the things I was doing in Python was forcing everything to uppercase 1286 01:01:09,280 --> 01:01:10,930 and then getting rid of whitespace. 1287 01:01:10,930 --> 01:01:12,400 But we could combine some of these. 1288 01:01:12,400 --> 01:01:15,340 I could do something like force every title to uppercase, 1289 01:01:15,340 --> 01:01:16,507 then get the distinct value. 1290 01:01:16,507 --> 01:01:19,382 And that's actually going to get rid of some of those values as well. 1291 01:01:19,382 --> 01:01:21,860 And again, I did it all in one simple line that was fast. 1292 01:01:21,860 --> 01:01:24,027 So let me pull up at the bottom of the screen again. 1293 01:01:24,027 --> 01:01:28,090 I selected distinct upper titles from favorites, 1294 01:01:28,090 --> 01:01:31,070 and that did everything for me at once in just one breath. 1295 01:01:31,070 --> 01:01:33,640 Suppose I want to get the total number of counts of titles. 1296 01:01:33,640 --> 01:01:40,150 How about select count of all of those titles from favorites? 1297 01:01:40,150 --> 01:01:43,990 Semicolon, Enter, and now you get back a mini table 1298 01:01:43,990 --> 01:01:47,960 that contains just your answer, 158 in this case. 1299 01:01:47,960 --> 01:01:50,560 So that's the total number of, not distinct, 1300 01:01:50,560 --> 01:01:52,690 but total titles that we had in the file. 1301 01:01:52,690 --> 01:01:56,560 And we could continue to manipulate the data further using, again, 1302 01:01:56,560 --> 01:01:58,450 functions like these here. 1303 01:01:58,450 --> 01:02:01,550 But there's also additional filtration we can do. 1304 01:02:01,550 --> 01:02:07,010 We can also qualify our selections by saying where some condition is true. 1305 01:02:07,010 --> 01:02:10,630 So just as in Scratch, and C, and Python, you have Boolean expressions, 1306 01:02:10,630 --> 01:02:16,270 you can have the same in SQL as well, where I can filter my data where 1307 01:02:16,270 --> 01:02:19,390 something is true or false. 1308 01:02:19,390 --> 01:02:21,340 Like allows me to do approximations. 1309 01:02:21,340 --> 01:02:23,500 If I want to get something that's like The Office 1310 01:02:23,500 --> 01:02:26,320 but not necessarily T-H-E, space, Office, 1311 01:02:26,320 --> 01:02:29,440 I could do pattern matching using like here. 1312 01:02:29,440 --> 01:02:33,170 Order by, limit, and grouped by are other commands I can execute, too. 1313 01:02:33,170 --> 01:02:35,980 So let me go back and do a couple of these here. 1314 01:02:35,980 --> 01:02:42,610 How about, let me just get, oh, I don't know, all of the titles from favorites 1315 01:02:42,610 --> 01:02:44,848 but limit it to 10 results. 1316 01:02:44,848 --> 01:02:47,890 That might be one thing that's helpful to see if you just care about some 1317 01:02:47,890 --> 01:02:50,110 of the data at the top there instead. 1318 01:02:50,110 --> 01:02:56,530 How about, select all of the titles from favorites, where the title itself 1319 01:02:56,530 --> 01:02:59,710 is like, quote, unquote, "Office?" 1320 01:02:59,710 --> 01:03:02,740 And this will give me only two answers. 1321 01:03:02,740 --> 01:03:07,180 Those are the two rows, recall, that I mutated by getting rid of the word The. 1322 01:03:07,180 --> 01:03:11,880 Notice that like allows me too tolerate uppercase and lowercase. 1323 01:03:11,880 --> 01:03:14,880 Because if I instead just use the equal sign, 1324 01:03:14,880 --> 01:03:21,060 and in SQL a single equal sign does, in fact, mean equality. 1325 01:03:21,060 --> 01:03:23,370 For comparison's sake, it's not doing assignment. 1326 01:03:23,370 --> 01:03:26,070 This is not how you assign data in SQL. 1327 01:03:26,070 --> 01:03:27,940 I got back no answers there. 1328 01:03:27,940 --> 01:03:31,620 So indeed, the equal sign is giving me literal answers 1329 01:03:31,620 --> 01:03:33,990 that searches just for what I typed in. 1330 01:03:33,990 --> 01:03:35,370 How could I get all of these? 1331 01:03:35,370 --> 01:03:39,090 Well, similar in spirit to regular expressions but not quite as powerful 1332 01:03:39,090 --> 01:03:41,620 in SQL, I could do something like this. 1333 01:03:41,620 --> 01:03:45,330 I can select the title from favorites where the title is like, quote, 1334 01:03:45,330 --> 01:03:46,710 unquote, "Office." 1335 01:03:46,710 --> 01:03:52,450 But I can add, a bit weirdly, percent signs to the left and the right. 1336 01:03:52,450 --> 01:03:57,930 So the language SQL supports the same notion of pattern matching 1337 01:03:57,930 --> 01:03:59,547 but much more limited out of the box. 1338 01:03:59,547 --> 01:04:01,380 If we want more powerful regular expressions 1339 01:04:01,380 --> 01:04:03,420 we probably do want to use Python instead. 1340 01:04:03,420 --> 01:04:06,720 But the percent sign here means 0 or more characters 1341 01:04:06,720 --> 01:04:09,340 on the left, 0 or more characters on the right. 1342 01:04:09,340 --> 01:04:14,460 So this will just grab any title that contains O-F-F-I-C-E in it in that 1343 01:04:14,460 --> 01:04:15,100 order. 1344 01:04:15,100 --> 01:04:19,290 And now I get all 16, it would seem, of those results, again. 1345 01:04:19,290 --> 01:04:20,490 How do I know it's 16? 1346 01:04:20,490 --> 01:04:23,160 Well, I can just get the count of those titles 1347 01:04:23,160 --> 01:04:26,140 and get back that answer instead as well. 1348 01:04:26,140 --> 01:04:29,640 So again, it takes some getting used to, the vocabulary 1349 01:04:29,640 --> 01:04:30,973 and the syntax that you can use. 1350 01:04:30,973 --> 01:04:32,682 There's these building blocks and others. 1351 01:04:32,682 --> 01:04:35,500 But SQL is really designed, again, for creating, reading, updating, 1352 01:04:35,500 --> 01:04:36,480 and deleting data. 1353 01:04:36,480 --> 01:04:41,080 For instance, I've never really been a fan of Friends, for instance. 1354 01:04:41,080 --> 01:04:46,860 So right now if I do select, how about title from favorites 1355 01:04:46,860 --> 01:04:53,340 where title like, quote, unquote, Friends with the percent signs? 1356 01:04:53,340 --> 01:04:55,330 We can see that there's a whole bunch of them. 1357 01:04:55,330 --> 01:04:56,520 That's how many exactly. 1358 01:04:56,520 --> 01:04:57,960 Let's just do a quick count. 1359 01:04:57,960 --> 01:04:59,890 So that's nine of them. 1360 01:04:59,890 --> 01:05:03,300 Well, delete from favorites. 1361 01:05:03,300 --> 01:05:10,500 OK, you and me, delete from favorites, where title like Friends, Enter. 1362 01:05:10,500 --> 01:05:14,280 Nothing seems to happen, but bye-bye Friends. 1363 01:05:14,280 --> 01:05:15,060 [APPLAUSE] 1364 01:05:15,060 --> 01:05:16,143 DAVID J. MALAN: Thank you. 1365 01:05:16,143 --> 01:05:19,110 1366 01:05:19,110 --> 01:05:21,390 So now we've actually changed the data. 1367 01:05:21,390 --> 01:05:25,110 And this is what's compelling about a proper database. 1368 01:05:25,110 --> 01:05:29,320 Yes, you could technically write Python code that not only reads the CSV file, 1369 01:05:29,320 --> 01:05:30,540 but also writes it. 1370 01:05:30,540 --> 01:05:32,820 You can change using quote, unquote, "A" for append, 1371 01:05:32,820 --> 01:05:35,910 or quote, unquote, "W" for write, instead of quote, unquote, 1372 01:05:35,910 --> 01:05:37,263 "R" for read alone. 1373 01:05:37,263 --> 01:05:39,930 But it's definitely a little more involved to do that in Python. 1374 01:05:39,930 --> 01:05:42,250 But with SQL, you can update the data in real time. 1375 01:05:42,250 --> 01:05:45,750 And if I were actually running a web application here or a database 1376 01:05:45,750 --> 01:05:47,880 for a mobile app, that change, theoretically, 1377 01:05:47,880 --> 01:05:50,160 would be reflected everywhere on your own devices 1378 01:05:50,160 --> 01:05:52,210 if you're somehow talking to this application. 1379 01:05:52,210 --> 01:05:53,995 So that's the direction we're headed. 1380 01:05:53,995 --> 01:05:55,620 This other thing has been bothering me. 1381 01:05:55,620 --> 01:06:02,640 So select, how about title from favorites, where title equals, 1382 01:06:02,640 --> 01:06:03,180 what was it? 1383 01:06:03,180 --> 01:06:06,600 The V Office, was it? 1384 01:06:06,600 --> 01:06:07,680 Yeah, it was that one. 1385 01:06:07,680 --> 01:06:12,150 How about we update favorites by setting title 1386 01:06:12,150 --> 01:06:19,080 equal to The Office, where title equals quote, unquote, "The V Office" 1387 01:06:19,080 --> 01:06:20,470 semicolon? 1388 01:06:20,470 --> 01:06:22,650 And now, if I select the same thing again 1389 01:06:22,650 --> 01:06:24,810 I can go up and down with my arrow keys quickly. 1390 01:06:24,810 --> 01:06:27,090 Now there is no The V Office. 1391 01:06:27,090 --> 01:06:29,200 We've actually changed that value. 1392 01:06:29,200 --> 01:06:30,300 How about genres? 1393 01:06:30,300 --> 01:06:35,880 Select genres from favorites, where the title is title 1394 01:06:35,880 --> 01:06:39,450 equals Game of Thrones, semicolon. 1395 01:06:39,450 --> 01:06:43,000 These were kind of long, and I don't really agree with all of that. 1396 01:06:43,000 --> 01:06:49,560 So how about we update favorites, set genres equal to, sure, 1397 01:06:49,560 --> 01:06:52,650 action, adventure, sure, drama? 1398 01:06:52,650 --> 01:06:54,390 OK, so it's a decent list. 1399 01:06:54,390 --> 01:06:56,700 Fantasy, sure, thriller, war. 1400 01:06:56,700 --> 01:07:01,050 OK, anything really but comedy, I would say. 1401 01:07:01,050 --> 01:07:03,160 Let's go ahead and hit Enter now. 1402 01:07:03,160 --> 01:07:07,800 And now, if I select genres again, same query, now we've canonicalized that. 1403 01:07:07,800 --> 01:07:09,160 We've thrown data away. 1404 01:07:09,160 --> 01:07:11,250 So whether or not that is right is probably 1405 01:07:11,250 --> 01:07:13,020 a bit subjective and argumentative. 1406 01:07:13,020 --> 01:07:16,920 But I have at least cleaned up my data, which is, again, the U in CRUD. 1407 01:07:16,920 --> 01:07:20,670 Create, read, update, delete, you can do it that easily. 1408 01:07:20,670 --> 01:07:22,290 Beware using delete. 1409 01:07:22,290 --> 01:07:26,430 Beware worse using drop, whereby you can drop an entire table. 1410 01:07:26,430 --> 01:07:29,310 But via these kinds of commands, can we actually now 1411 01:07:29,310 --> 01:07:33,390 manipulate our data much more rapidly and with single thoughts. 1412 01:07:33,390 --> 01:07:36,390 And in fact, if you're an aspiring statistician, or data scientist, 1413 01:07:36,390 --> 01:07:40,320 or analyst in the real world, SQL is such a commonly used language 1414 01:07:40,320 --> 01:07:43,573 because it allows you to really dive into data quickly, and ask 1415 01:07:43,573 --> 01:07:45,990 questions of the data, and get back answers quite quickly. 1416 01:07:45,990 --> 01:07:47,530 And this is a simple data set. 1417 01:07:47,530 --> 01:07:51,840 You can do this with much larger data sets as we soon will, too. 1418 01:07:51,840 --> 01:07:55,050 Or any questions on what we've seen of SQL thus far? 1419 01:07:55,050 --> 01:07:57,170 Only scratched the surface, but again, it 1420 01:07:57,170 --> 01:08:03,100 boils down to creating, reading, updating, and deleting data. 1421 01:08:03,100 --> 01:08:04,910 Questions here? 1422 01:08:04,910 --> 01:08:05,410 All right. 1423 01:08:05,410 --> 01:08:07,820 Well, let's consider the design of this data. 1424 01:08:07,820 --> 01:08:11,900 Recall that if I do .schema, that shows me the design of my table, 1425 01:08:11,900 --> 01:08:13,990 the so-called schema of my data. 1426 01:08:13,990 --> 01:08:15,010 This is OK. 1427 01:08:15,010 --> 01:08:17,649 It gets the job done, and frankly, everything the user typed in 1428 01:08:17,649 --> 01:08:21,609 was arguably text, including the timestamp, which is the date and time. 1429 01:08:21,609 --> 01:08:24,040 But so the data set itself is somewhat simple. 1430 01:08:24,040 --> 01:08:29,569 But if we look at the data set itself, especially genres, let's do this. 1431 01:08:29,569 --> 01:08:31,960 Select genres from favorites. 1432 01:08:31,960 --> 01:08:34,540 And let me point out one other thing stylistically, too. 1433 01:08:34,540 --> 01:08:39,050 I am very deliberately capitalizing all of the special SQL keywords, 1434 01:08:39,050 --> 01:08:42,700 and I'm lowercasing all of the column names and the table names. 1435 01:08:42,700 --> 01:08:45,760 This is a convention, and honestly, it just helps you read, 1436 01:08:45,760 --> 01:08:49,210 I think, the code when you're co-mingling your names for columns 1437 01:08:49,210 --> 01:08:52,600 and tables with proper SQL keywords. 1438 01:08:52,600 --> 01:08:58,029 But I could just as easily do select genres from favorites, 1439 01:08:58,029 --> 01:09:01,149 but again, the SQL specific keywords don't quite jump out as much. 1440 01:09:01,149 --> 01:09:04,240 So stylistically, we would recommend this, selecting genres 1441 01:09:04,240 --> 01:09:06,250 from favorites, semicolon. 1442 01:09:06,250 --> 01:09:09,370 So here is where-- 1443 01:09:09,370 --> 01:09:11,560 oh. 1444 01:09:11,560 --> 01:09:12,850 OK, that was not intended. 1445 01:09:12,850 --> 01:09:15,220 I accidentally made every show, including 1446 01:09:15,220 --> 01:09:20,410 The Office about action, adventure, drama, fantasy, thriller, and war. 1447 01:09:20,410 --> 01:09:24,460 How did I do that accidentally? 1448 01:09:24,460 --> 01:09:25,600 What did I do wrong? 1449 01:09:25,600 --> 01:09:30,234 AUDIENCE: [INAUDIBLE] 1450 01:09:30,234 --> 01:09:31,109 DAVID J. MALAN: Yeah. 1451 01:09:31,109 --> 01:09:32,460 So beware, this is funny. 1452 01:09:32,460 --> 01:09:34,359 I think I did say beware around this time. 1453 01:09:34,359 --> 01:09:37,810 So the SQL database took me-- literally, I updated favorites, 1454 01:09:37,810 --> 01:09:41,010 setting genres equal to that, semicolon, end of thought. 1455 01:09:41,010 --> 01:09:43,500 I really wanted to say where title equals, 1456 01:09:43,500 --> 01:09:46,020 quote, unquote, "Game of Thrones." 1457 01:09:46,020 --> 01:09:49,080 Unfortunately, there isn't an undo command or time machine 1458 01:09:49,080 --> 01:09:51,930 with a SQL database, so the best we can do here 1459 01:09:51,930 --> 01:09:56,250 is, let's actually get rid of favorites.db. 1460 01:09:56,250 --> 01:10:02,340 Let's run SQLite of favorites.db again, which now will be recreated. 1461 01:10:02,340 --> 01:10:04,440 Let me change myself into CSV mode. 1462 01:10:04,440 --> 01:10:09,930 Let me import, into my favorites table, the CSV file. 1463 01:10:09,930 --> 01:10:14,250 And now, Friends is back, for better or for worse, 1464 01:10:14,250 --> 01:10:15,510 but so are all of our genres. 1465 01:10:15,510 --> 01:10:18,090 1466 01:10:18,090 --> 01:10:21,330 If I now reload the file and do select, star, from-- 1467 01:10:21,330 --> 01:10:21,840 sorry. 1468 01:10:21,840 --> 01:10:25,838 Select genres from favorites, that was the result I was getting. 1469 01:10:25,838 --> 01:10:28,630 It's much messier, but that's because some of these are quite long. 1470 01:10:28,630 --> 01:10:30,297 But now we're back to the original data. 1471 01:10:30,297 --> 01:10:32,700 Lesson here, be sure to back up your work. 1472 01:10:32,700 --> 01:10:33,240 All right. 1473 01:10:33,240 --> 01:10:36,850 So what more can we now do with this data? 1474 01:10:36,850 --> 01:10:40,480 Well, I don't love the design of the genres table for a couple of reasons. 1475 01:10:40,480 --> 01:10:43,200 One, we didn't have any sort of validation, 1476 01:10:43,200 --> 01:10:45,210 but user input is going to be messy. 1477 01:10:45,210 --> 01:10:47,790 There's just a lot of redundancy in here. 1478 01:10:47,790 --> 01:10:49,720 Let's go ahead and do this. 1479 01:10:49,720 --> 01:10:51,760 Let me select all the comedies you all typed in. 1480 01:10:51,760 --> 01:10:57,960 So select title from favorites, where genres equals, 1481 01:10:57,960 --> 01:11:00,120 quote, unquote, "comedy." 1482 01:11:00,120 --> 01:11:05,730 OK, so there's all of the shows that are explicitly comedies. 1483 01:11:05,730 --> 01:11:08,940 But I think there might actually be others. 1484 01:11:08,940 --> 01:11:11,350 Let me scroll back up here. 1485 01:11:11,350 --> 01:11:12,180 Comedy, drama. 1486 01:11:12,180 --> 01:11:14,040 What was a comedy and a drama? 1487 01:11:14,040 --> 01:11:19,140 How about let's search for the-- oops, let me copy paste comedy, comma, drama. 1488 01:11:19,140 --> 01:11:23,700 OK, so The Office, in this case, was considered comedy and drama, Billions, 1489 01:11:23,700 --> 01:11:27,010 It's Always Sunny in Philadelphia, and Gilmore Girls as well. 1490 01:11:27,010 --> 01:11:31,450 But notice that I get many more when I just search for comedy. 1491 01:11:31,450 --> 01:11:36,450 So the catch here is that, because I have all of these genres implemented 1492 01:11:36,450 --> 01:11:39,030 the way Google did, as a comma-separated list, 1493 01:11:39,030 --> 01:11:43,590 it's actually really hard and messy to get at any show, all of the shows 1494 01:11:43,590 --> 01:11:46,830 that are somewhere described as comedy. 1495 01:11:46,830 --> 01:11:49,680 Because if I search for quote, unquote, "comedy," the only answers 1496 01:11:49,680 --> 01:11:53,610 I'm going to get are this one, whatever that show is, this one, whatever 1497 01:11:53,610 --> 01:11:55,240 that show is, this one. 1498 01:11:55,240 --> 01:11:56,670 But I'm not going to get this one. 1499 01:11:56,670 --> 01:11:58,170 I'm not going to get this one. 1500 01:11:58,170 --> 01:11:59,070 Why? 1501 01:11:59,070 --> 01:12:02,790 If I'm searching for, where genres equals, quote, unquote, "comedy," 1502 01:12:02,790 --> 01:12:04,350 why am I missing those other shows? 1503 01:12:04,350 --> 01:12:07,202 1504 01:12:07,202 --> 01:12:07,910 Why am I missing? 1505 01:12:07,910 --> 01:12:08,595 Yeah? 1506 01:12:08,595 --> 01:12:11,020 AUDIENCE: [INAUDIBLE] 1507 01:12:11,020 --> 01:12:12,020 DAVID J. MALAN: Exactly. 1508 01:12:12,020 --> 01:12:14,390 It's not just a comedy, it's a comedy and a drama, 1509 01:12:14,390 --> 01:12:16,680 and a comedy or a news show, and so forth. 1510 01:12:16,680 --> 01:12:20,510 So I have to search for these commas, so this gets messy quickly, right? 1511 01:12:20,510 --> 01:12:22,560 Let me copy this so I can do this. 1512 01:12:22,560 --> 01:12:26,150 Let me search for where genres equals comedy. 1513 01:12:26,150 --> 01:12:33,230 How about, or genres equals comedy, drama, or genres 1514 01:12:33,230 --> 01:12:36,650 equals this whole thing, comedy, news, talk show? 1515 01:12:36,650 --> 01:12:38,370 I'm going to get more and more results. 1516 01:12:38,370 --> 01:12:39,900 But that's not going to scale well. 1517 01:12:39,900 --> 01:12:42,810 What could I do instead of enumerating with ors 1518 01:12:42,810 --> 01:12:45,800 all of the different permutations of genres, do you think? 1519 01:12:45,800 --> 01:12:49,655 AUDIENCE: [INAUDIBLE] 1520 01:12:49,655 --> 01:12:50,530 DAVID J. MALAN: Yeah. 1521 01:12:50,530 --> 01:12:54,430 So I could use the keyword is, similar in Python to the word in. 1522 01:12:54,430 --> 01:12:56,980 I could use the like keyword so that so long 1523 01:12:56,980 --> 01:13:02,080 as the genres is like comedy somewhere in there, 1524 01:13:02,080 --> 01:13:05,900 that's going to give me all of them, so long as the word comedy is in there. 1525 01:13:05,900 --> 01:13:09,265 But let me go ahead and just open the form from earlier. 1526 01:13:09,265 --> 01:13:12,250 1527 01:13:12,250 --> 01:13:15,073 Let me see if I can open this real quick before I toggle over. 1528 01:13:15,073 --> 01:13:16,990 If we look back at the form, recall that there 1529 01:13:16,990 --> 01:13:22,630 were all of those radio buttons asking for the specific genres 1530 01:13:22,630 --> 01:13:24,530 into which something fell. 1531 01:13:24,530 --> 01:13:29,710 And if I open this, let me full screen here and now open the original form. 1532 01:13:29,710 --> 01:13:32,920 You'll see all of the genres here, none of which 1533 01:13:32,920 --> 01:13:38,680 are that worrisome except for a corner case is jumping out at me. 1534 01:13:38,680 --> 01:13:43,330 Where might the like keyword alone get me into trouble? 1535 01:13:43,330 --> 01:13:44,380 It's not with comedy. 1536 01:13:44,380 --> 01:13:45,880 I'm OK with comedy. 1537 01:13:45,880 --> 01:13:47,680 AUDIENCE: Music and musical? 1538 01:13:47,680 --> 01:13:51,190 DAVID J. MALAN: Yeah, music and musical are deliberately on the list here. 1539 01:13:51,190 --> 01:13:53,660 Because, one, they're separate genres. 1540 01:13:53,660 --> 01:13:56,310 But if I just search for something that's like music, 1541 01:13:56,310 --> 01:13:58,810 I'm going to accidentally suck in all of the musicals, which 1542 01:13:58,810 --> 01:13:59,950 might not be what I intend. 1543 01:13:59,950 --> 01:14:03,310 If music is a music video or whatever, and musical is actually 1544 01:14:03,310 --> 01:14:06,620 a different type of show, I don't want to just do that. 1545 01:14:06,620 --> 01:14:08,110 So it seems just very messy. 1546 01:14:08,110 --> 01:14:11,682 I could probably hack something together with-- maybe add some commas in there, 1547 01:14:11,682 --> 01:14:12,640 or something like this. 1548 01:14:12,640 --> 01:14:15,520 But this is just not a good design for the data. 1549 01:14:15,520 --> 01:14:17,890 Google has done it this way because it's just 1550 01:14:17,890 --> 01:14:22,000 simple to actually keep the user's data all in a single column, 1551 01:14:22,000 --> 01:14:24,520 and just as they did, separate it by commas. 1552 01:14:24,520 --> 01:14:28,720 But this is a real messy way to use CSV is 1553 01:14:28,720 --> 01:14:32,830 by putting comma-separated values in your comma-separated values. 1554 01:14:32,830 --> 01:14:35,350 Arguably, the folks at Google probably just did this 1555 01:14:35,350 --> 01:14:36,520 because it's just simpler. 1556 01:14:36,520 --> 01:14:38,645 And they didn't want to give people multiple sheets 1557 01:14:38,645 --> 01:14:42,220 or complicate things using some other weirder character than commas alone. 1558 01:14:42,220 --> 01:14:44,450 But I bet there's a better way for us to do this. 1559 01:14:44,450 --> 01:14:45,860 And let me go ahead and do this. 1560 01:14:45,860 --> 01:14:47,978 Let me go back into my code here. 1561 01:14:47,978 --> 01:14:50,020 And in just a moment, I'm going to grab a program 1562 01:14:50,020 --> 01:14:54,250 that I wrote in advance that's going to use Python to open up the CSV file, 1563 01:14:54,250 --> 01:14:59,110 iterate over all of the rows, and load the data into two tables this time, 1564 01:14:59,110 --> 01:15:02,630 two tables, one called shows, and one called genres, 1565 01:15:02,630 --> 01:15:05,140 so as to actually separate these two things out. 1566 01:15:05,140 --> 01:15:07,730 Give me just a moment to grab the code. 1567 01:15:07,730 --> 01:15:10,720 And when I run this, I'll only have to run it once. 1568 01:15:10,720 --> 01:15:13,090 Let me go ahead and run Python in a moment, 1569 01:15:13,090 --> 01:15:15,940 and I'll reveal the results in a sec. 1570 01:15:15,940 --> 01:15:18,790 This is going to be version 8 of the code online. 1571 01:15:18,790 --> 01:15:22,480 When I do this, let me go ahead and open up this file. 1572 01:15:22,480 --> 01:15:26,260 Give me a second to move it into this directory. 1573 01:15:26,260 --> 01:15:28,570 Version 8, OK. 1574 01:15:28,570 --> 01:15:31,515 So here we have version 8 of this that's available online 1575 01:15:31,515 --> 01:15:32,890 that's going to do the following. 1576 01:15:32,890 --> 01:15:34,848 And I'll gloss over some of the details just so 1577 01:15:34,848 --> 01:15:38,740 that we don't get stuck in the weeds of some of this code. 1578 01:15:38,740 --> 01:15:41,380 I'm going to be using, at the top of this program, 1579 01:15:41,380 --> 01:15:45,820 as we'll soon see, a CS50 library, not for the sake of get_string, 1580 01:15:45,820 --> 01:15:48,730 or get_int, or get_float, but because there's some built-in SQL 1581 01:15:48,730 --> 01:15:51,880 functionality that we didn't discuss a couple of weeks back with the CS50 1582 01:15:51,880 --> 01:15:52,750 library itself. 1583 01:15:52,750 --> 01:15:56,680 But inside of the CS50 library we'll see there is a special function called 1584 01:15:56,680 --> 01:16:01,330 SQL that gives you the ability using this weird URL-like looking thing, 1585 01:16:01,330 --> 01:16:06,040 technically called a URI, that allows me to open a file called favorites.db. 1586 01:16:06,040 --> 01:16:08,380 And long story short, all of the subsequent code 1587 01:16:08,380 --> 01:16:12,580 is going to iterate over this favorites.csv file that we downloaded. 1588 01:16:12,580 --> 01:16:16,240 And it's going to import it into the SQLite database, 1589 01:16:16,240 --> 01:16:19,430 but it's going to use two tables instead of just one. 1590 01:16:19,430 --> 01:16:21,640 So give me just a moment to run this, and then I'll 1591 01:16:21,640 --> 01:16:23,600 reveal the actual results. 1592 01:16:23,600 --> 01:16:26,270 This is going to be run on favorites.csv. 1593 01:16:26,270 --> 01:16:31,510 1594 01:16:31,510 --> 01:16:35,485 And taking a look here, give me just a moment. 1595 01:16:35,485 --> 01:16:39,310 1596 01:16:39,310 --> 01:16:43,350 Oh, give me a sec. 1597 01:16:43,350 --> 01:16:45,270 Come on. 1598 01:16:45,270 --> 01:16:46,080 Come on. 1599 01:16:46,080 --> 01:16:49,020 This program should not be taking this long. 1600 01:16:49,020 --> 01:16:50,460 Sorry. 1601 01:16:50,460 --> 01:16:51,840 Let's open this real fast. 1602 01:16:51,840 --> 01:16:54,750 1603 01:16:54,750 --> 01:16:55,860 Whoops, not that file. 1604 01:16:55,860 --> 01:16:59,000 1605 01:16:59,000 --> 01:16:59,500 OK. 1606 01:16:59,500 --> 01:17:02,350 Let me just skim this code real quick to see where we've gone wrong. 1607 01:17:02,350 --> 01:17:04,840 [INAUDIBLE] reader. 1608 01:17:04,840 --> 01:17:10,390 Reader, title, show ID in certain two shows. 1609 01:17:10,390 --> 01:17:15,370 [INAUDIBLE] genres split, DB execute. 1610 01:17:15,370 --> 01:17:15,880 All right. 1611 01:17:15,880 --> 01:17:17,410 This is me debugging in real time. 1612 01:17:17,410 --> 01:17:22,843 All those times we encourage you to use print, this is me actually using print. 1613 01:17:22,843 --> 01:17:24,760 We'll see how quickly I can recover from this. 1614 01:17:24,760 --> 01:17:26,560 Python of favorites version 8. 1615 01:17:26,560 --> 01:17:29,100 1616 01:17:29,100 --> 01:17:32,070 OK, so here's me debugging in real time. 1617 01:17:32,070 --> 01:17:32,882 It's printing it. 1618 01:17:32,882 --> 01:17:34,590 Oh, maybe I just didn't wait long enough. 1619 01:17:34,590 --> 01:17:35,940 OK, so here we go. 1620 01:17:35,940 --> 01:17:39,900 What I'm doing is printing out the dictionary that represents 1621 01:17:39,900 --> 01:17:41,542 each row that you all typed in. 1622 01:17:41,542 --> 01:17:43,000 And we're actually making progress. 1623 01:17:43,000 --> 01:17:44,220 All right. 1624 01:17:44,220 --> 01:17:46,510 I was too impatient and didn't wait long enough. 1625 01:17:46,510 --> 01:17:47,470 So in a moment-- 1626 01:17:47,470 --> 01:17:47,970 there we go. 1627 01:17:47,970 --> 01:17:50,370 All right, so all we have to do sometimes is wait. 1628 01:17:50,370 --> 01:17:54,420 Let me go ahead now and open this file using SQLite3. 1629 01:17:54,420 --> 01:17:58,080 So in SQLite3 I now have a different version of favorites.db. 1630 01:17:58,080 --> 01:17:59,970 I named it number 8 for consistency. 1631 01:17:59,970 --> 01:18:03,280 Once I've run the program I can do .schema to look inside of it. 1632 01:18:03,280 --> 01:18:07,200 And here's what the two tables in this database are going to look like. 1633 01:18:07,200 --> 01:18:10,920 I've created a table called shows, this time to represent all of the TV shows 1634 01:18:10,920 --> 01:18:14,410 that are favorites, that has two columns. 1635 01:18:14,410 --> 01:18:16,812 One is called ID, one is called Title. 1636 01:18:16,812 --> 01:18:19,020 But now I'm going to start taking out for a spin some 1637 01:18:19,020 --> 01:18:20,640 of the other features of SQL. 1638 01:18:20,640 --> 01:18:24,900 And besides there being text, it turns out there's a data type called integer. 1639 01:18:24,900 --> 01:18:26,910 Besides there being a data type called text, 1640 01:18:26,910 --> 01:18:30,150 there's also a special key phrase that you can specify 1641 01:18:30,150 --> 01:18:31,830 that the title can never be null. 1642 01:18:31,830 --> 01:18:35,160 Think back to our use of null in C. Think back 1643 01:18:35,160 --> 01:18:37,260 to the keyword none in Python. 1644 01:18:37,260 --> 01:18:40,830 This is a database constraint that allows you to ensure that none of you 1645 01:18:40,830 --> 01:18:42,630 can't have of favorite TV show. 1646 01:18:42,630 --> 01:18:46,320 If you submit the form, you have to have typed in a title for it 1647 01:18:46,320 --> 01:18:48,250 to end up in our database here. 1648 01:18:48,250 --> 01:18:50,940 And you'll notice one other new feature. 1649 01:18:50,940 --> 01:18:53,460 It turns out, on this table I'm defining what's 1650 01:18:53,460 --> 01:18:56,970 called a primary key, specifically to be the ID column. 1651 01:18:56,970 --> 01:18:58,600 More on that in just a moment. 1652 01:18:58,600 --> 01:19:02,700 Meanwhile, the second table my code has created for me, as we'll soon see, 1653 01:19:02,700 --> 01:19:08,220 gives me a column called show ID, and then, a genre, 1654 01:19:08,220 --> 01:19:11,140 the value of which is text that can also not be null. 1655 01:19:11,140 --> 01:19:12,760 And then more on this in a moment. 1656 01:19:12,760 --> 01:19:15,840 This table has what we're going to call a foreign key, 1657 01:19:15,840 --> 01:19:20,550 specifically the show ID column that references shows ID. 1658 01:19:20,550 --> 01:19:23,250 So before we get into the weeds of this, this 1659 01:19:23,250 --> 01:19:27,120 is now a way of creating the relation in relational database. 1660 01:19:27,120 --> 01:19:31,140 If I have two tables now, not just one, they can somehow 1661 01:19:31,140 --> 01:19:34,720 be linked together by a common column. 1662 01:19:34,720 --> 01:19:37,860 In other words, the shows column-- 1663 01:19:37,860 --> 01:19:41,340 shows table is going to give me a table with two columns-- 1664 01:19:41,340 --> 01:19:43,080 an ID and a title. 1665 01:19:43,080 --> 01:19:46,800 Every title you gave me, I'm going to assign a unique value. 1666 01:19:46,800 --> 01:19:51,600 The genre's table, meanwhile, is going to associate individual genres 1667 01:19:51,600 --> 01:19:54,480 singular with that same idea. 1668 01:19:54,480 --> 01:20:01,150 And the result of this, to pop back to the Terminal here, is, let's do this. 1669 01:20:01,150 --> 01:20:05,460 Select star from shows of this new database, 1670 01:20:05,460 --> 01:20:09,030 and you'll see that I've given, indeed, all of the shows you all typed 1671 01:20:09,030 --> 01:20:10,663 in unique identifiers. 1672 01:20:10,663 --> 01:20:13,830 I didn't filter out duplicates or do anything beyond just forcing everything 1673 01:20:13,830 --> 01:20:14,430 to uppercase. 1674 01:20:14,430 --> 01:20:16,930 So there's going to be some duplicates here because I didn't 1675 01:20:16,930 --> 01:20:18,840 want to get rid of anyone's data. 1676 01:20:18,840 --> 01:20:21,960 But you'll see that, indeed, I've given everyone 1677 01:20:21,960 --> 01:20:24,420 a unique identifier, from the very first person who 1678 01:20:24,420 --> 01:20:28,650 typed How I Met Your Mother, all the way down to input number 158. 1679 01:20:28,650 --> 01:20:35,760 Meanwhile, if I do select star from genres, which is now a table, not just 1680 01:20:35,760 --> 01:20:38,280 a column in the original data, now you'll 1681 01:20:38,280 --> 01:20:43,195 see a much better design for this data. 1682 01:20:43,195 --> 01:20:44,320 Notice what I've done here. 1683 01:20:44,320 --> 01:20:47,370 Let me go all the way to the top and you'll see two columns, one of which 1684 01:20:47,370 --> 01:20:51,508 is called show ID, the other of which is called genre. 1685 01:20:51,508 --> 01:20:53,550 And again, I wrote some code to do this because I 1686 01:20:53,550 --> 01:20:56,675 had to take Google's messy output where everything was separated by commas. 1687 01:20:56,675 --> 01:21:00,330 I had to tear away the commas and then put each genre into this table 1688 01:21:00,330 --> 01:21:01,770 by itself. 1689 01:21:01,770 --> 01:21:04,200 Even though we haven't introduced the syntax via which 1690 01:21:04,200 --> 01:21:07,230 we can reconstitute the data and reassociate 1691 01:21:07,230 --> 01:21:10,530 your genres with your titles, why, at a glance, 1692 01:21:10,530 --> 01:21:13,140 might this be a better design now? 1693 01:21:13,140 --> 01:21:16,740 Even though I've doubled the number of tables from one to two, 1694 01:21:16,740 --> 01:21:21,540 why is this probably on the direction toward a better design? 1695 01:21:21,540 --> 01:21:23,130 What might your instincts be? 1696 01:21:23,130 --> 01:21:26,430 1697 01:21:26,430 --> 01:21:27,720 Why is this cleaner? 1698 01:21:27,720 --> 01:21:31,230 Again, first time with SQL, why is it better, perhaps, 1699 01:21:31,230 --> 01:21:33,780 that we've done this with our genre's table? 1700 01:21:33,780 --> 01:21:35,650 Can I come to you? 1701 01:21:35,650 --> 01:21:36,700 Why might this be better? 1702 01:21:36,700 --> 01:21:37,200 Yeah. 1703 01:21:37,200 --> 01:21:41,360 Oh, just because we had the conversation before about the commas. 1704 01:21:41,360 --> 01:21:42,470 AUDIENCE: [INAUDIBLE] 1705 01:21:42,470 --> 01:21:42,890 DAVID J. MALAN: Exactly. 1706 01:21:42,890 --> 01:21:43,848 It's as simple as that. 1707 01:21:43,848 --> 01:21:49,130 We've cleaned up the data by giving every genre, every word in the genres 1708 01:21:49,130 --> 01:21:51,260 column in the original Google Spreadsheet 1709 01:21:51,260 --> 01:21:54,030 its own cell in this table, if you will. 1710 01:21:54,030 --> 01:21:56,930 And now notice show ID might appear multiple times. 1711 01:21:56,930 --> 01:22:01,530 Whoever typed in How I Met Your Mother, they only associated one genre with it. 1712 01:22:01,530 --> 01:22:04,460 And so we see that show ID 1 is a comedy. 1713 01:22:04,460 --> 01:22:05,645 But whoever typed in-- 1714 01:22:05,645 --> 01:22:07,520 I forget the name of the second show offhand. 1715 01:22:07,520 --> 01:22:11,960 But that person, whoever was assigned show ID 2 checked off 1716 01:22:11,960 --> 01:22:13,850 a whole bunch of the genre's boxes. 1717 01:22:13,850 --> 01:22:17,540 That happened again with show ID 3, 4. 1718 01:22:17,540 --> 01:22:20,940 Persons 5, 6, 7 only checked one box. 1719 01:22:20,940 --> 01:22:25,040 And so you can see now that we've associated the data with what we 1720 01:22:25,040 --> 01:22:27,690 might call a one-to-many relationship. 1721 01:22:27,690 --> 01:22:33,050 A one-to-many relationship, whereby for every one show in the show's table, 1722 01:22:33,050 --> 01:22:36,800 it can now have many genres associated with it, each of which 1723 01:22:36,800 --> 01:22:41,250 is represented by a separate row here. 1724 01:22:41,250 --> 01:22:44,960 So again, if I go ahead and select star from shows-- 1725 01:22:44,960 --> 01:22:48,740 let's limit it to the first 10 just to focus on a subset of the data. 1726 01:22:48,740 --> 01:22:51,770 How I Met Your Mother, The Sopranos was the second input there. 1727 01:22:51,770 --> 01:22:55,100 It would seem that now that I've created the data in this way, 1728 01:22:55,100 --> 01:22:59,925 I could ideally somehow search the data, but a little more correctly. 1729 01:22:59,925 --> 01:23:01,550 I don't have to worry about the commas. 1730 01:23:01,550 --> 01:23:03,925 I don't have to worry about the hackish approach of music 1731 01:23:03,925 --> 01:23:05,630 being a substring of musical. 1732 01:23:05,630 --> 01:23:08,310 But how can I actually get back at this data? 1733 01:23:08,310 --> 01:23:09,750 Well, let's go ahead and do this. 1734 01:23:09,750 --> 01:23:13,700 Suppose I did want to get back maybe all of the comedies. 1735 01:23:13,700 --> 01:23:17,120 All of the comedies, no matter whether the person checked just the comedy 1736 01:23:17,120 --> 01:23:19,550 box or multiple boxes instead. 1737 01:23:19,550 --> 01:23:23,330 How now, given that I have two tables, could I 1738 01:23:23,330 --> 01:23:27,740 go about selecting only the titles of comedies? 1739 01:23:27,740 --> 01:23:29,780 I've actually made the problem a little harder, 1740 01:23:29,780 --> 01:23:32,720 but again, SQL is going to give me a solution for this. 1741 01:23:32,720 --> 01:23:35,030 The problem is that if I want to search for comedies, 1742 01:23:35,030 --> 01:23:37,820 I have to check the genres table first. 1743 01:23:37,820 --> 01:23:39,650 And then what's that going to give me? 1744 01:23:39,650 --> 01:23:43,580 If I search the genres table for comedies, 1745 01:23:43,580 --> 01:23:46,610 what's that going to give me back potentially? 1746 01:23:46,610 --> 01:23:47,360 Yeah? 1747 01:23:47,360 --> 01:23:48,110 AUDIENCE: Show ID. 1748 01:23:48,110 --> 01:23:49,110 DAVID J. MALAN: Maybe show ID. 1749 01:23:49,110 --> 01:23:49,902 So let me try that. 1750 01:23:49,902 --> 01:23:55,820 Let me do select show ID from genres, where the genre in a given row 1751 01:23:55,820 --> 01:23:57,320 equals quote, unquote, "comedy." 1752 01:23:57,320 --> 01:24:00,470 No commas, no like, no percent signs. 1753 01:24:00,470 --> 01:24:04,940 Because literally, that column now is singular words, like comedy, or drama, 1754 01:24:04,940 --> 01:24:05,720 or the like. 1755 01:24:05,720 --> 01:24:07,670 Let me go ahead and hit Enter here. 1756 01:24:07,670 --> 01:24:10,700 OK, so I got back a whole bunch of ID numbers. 1757 01:24:10,700 --> 01:24:13,100 Now this could very quickly get annoying. 1758 01:24:13,100 --> 01:24:17,930 It looks like show ID 1, 2, 4, 5, 6, 7, 9, and so forth, are all comedies. 1759 01:24:17,930 --> 01:24:23,690 So I could do something really crazy like, select title from shows, 1760 01:24:23,690 --> 01:24:29,390 where ID equals 1, or ID equals 2. 1761 01:24:29,390 --> 01:24:32,360 This is not going to scale very well, but this 1762 01:24:32,360 --> 01:24:34,610 is why SQL is especially powerful. 1763 01:24:34,610 --> 01:24:39,420 You can actually compose one SQL question from multiple ones. 1764 01:24:39,420 --> 01:24:40,440 So let's do this. 1765 01:24:40,440 --> 01:24:44,480 Why don't I select the title where the ID of the show 1766 01:24:44,480 --> 01:24:48,050 is in the following list of IDs? 1767 01:24:48,050 --> 01:24:54,710 Select show ID from genres, where the specific genre is, quote, unquote, 1768 01:24:54,710 --> 01:24:55,640 "comedy." 1769 01:24:55,640 --> 01:24:58,550 So I've got two SQL queries. 1770 01:24:58,550 --> 01:25:01,850 One is deliberately nested inside of parentheses. 1771 01:25:01,850 --> 01:25:04,760 That's going to give me back that whole list of show IDs. 1772 01:25:04,760 --> 01:25:07,070 But that's exactly what I want to then look up 1773 01:25:07,070 --> 01:25:11,000 the titles for by selecting title from shows where the ID of the show 1774 01:25:11,000 --> 01:25:13,380 is in that big, tall list. 1775 01:25:13,380 --> 01:25:18,710 And so now if I hit Enter, I get back only those shows 1776 01:25:18,710 --> 01:25:22,520 that were somehow flagged as comedy, whether you in the audience 1777 01:25:22,520 --> 01:25:26,570 checked one box for comedy, two boxes, or all of the boxes. 1778 01:25:26,570 --> 01:25:28,928 Somehow we teased out comedy, again, just 1779 01:25:28,928 --> 01:25:31,220 by using that Python script, which loaded this data not 1780 01:25:31,220 --> 01:25:33,800 into one big table, but instead, two. 1781 01:25:33,800 --> 01:25:36,420 And if we want to clean this up, let's do a couple of things. 1782 01:25:36,420 --> 01:25:40,580 Let's, outside of the parentheses, do order by title. 1783 01:25:40,580 --> 01:25:43,550 This is a way of sorting the data in SQL very easily. 1784 01:25:43,550 --> 01:25:48,140 Now we have a whole list of the same titles that are now sorted. 1785 01:25:48,140 --> 01:25:52,640 And what was the keyword with which I could filter out duplicates? 1786 01:25:52,640 --> 01:25:53,450 Yeah, distinct. 1787 01:25:53,450 --> 01:25:54,260 So let's try this. 1788 01:25:54,260 --> 01:25:59,600 Same query, but let's select only the distinct titles from that whole query. 1789 01:25:59,600 --> 01:26:01,770 And notice, I've very deliberately done it this way. 1790 01:26:01,770 --> 01:26:03,520 And to this day, any time I'm using SQL, I 1791 01:26:03,520 --> 01:26:06,187 don't just start at the beginning and type out my whole thought, 1792 01:26:06,187 --> 01:26:07,940 and just get it right on the first try. 1793 01:26:07,940 --> 01:26:10,610 I very commonly start with the subquery, if you 1794 01:26:10,610 --> 01:26:12,800 will, the thing in parentheses, just to get myself 1795 01:26:12,800 --> 01:26:14,600 one step toward what I care about. 1796 01:26:14,600 --> 01:26:15,332 Then I add to it. 1797 01:26:15,332 --> 01:26:16,040 Then I add to it. 1798 01:26:16,040 --> 01:26:18,590 Then I add to it, just like we've encouraged in Python and C, 1799 01:26:18,590 --> 01:26:22,370 taking baby steps in order to get to the answer you actually care about, 1800 01:26:22,370 --> 01:26:23,480 like this one now. 1801 01:26:23,480 --> 01:26:26,060 And other than this mistake, which we didn't 1802 01:26:26,060 --> 01:26:30,350 fix because I re-imported the data after accidentally changing everyone's genre, 1803 01:26:30,350 --> 01:26:35,190 we now have an alphabetized list of all of the same data. 1804 01:26:35,190 --> 01:26:40,670 But now it's better designed, because we have it split across these two tables. 1805 01:26:40,670 --> 01:26:41,630 Oh, thank you. 1806 01:26:41,630 --> 01:26:44,720 OK, just thanks. 1807 01:26:44,720 --> 01:26:48,950 What questions do we have, if any here? 1808 01:26:48,950 --> 01:26:50,480 Questions on this approach? 1809 01:26:50,480 --> 01:26:53,040 1810 01:26:53,040 --> 01:26:55,316 Yeah? 1811 01:26:55,316 --> 01:27:00,280 AUDIENCE: [INAUDIBLE] 1812 01:27:00,280 --> 01:27:02,280 DAVID J. MALAN: Oh, now that we have a database, 1813 01:27:02,280 --> 01:27:04,640 how do we transfer it to a CSV? 1814 01:27:04,640 --> 01:27:06,060 There are ways to do that. 1815 01:27:06,060 --> 01:27:08,570 And in fact, there's a command within SQLite 1816 01:27:08,570 --> 01:27:11,590 that allows you to export your data back to a CSV file. 1817 01:27:11,590 --> 01:27:13,340 If you want to email it to someone and you 1818 01:27:13,340 --> 01:27:16,430 want them to be able to open it in Excel, or Google Spreadsheets, or Apple 1819 01:27:16,430 --> 01:27:19,010 Numbers, or the like, you can go in the other direction. 1820 01:27:19,010 --> 01:27:21,890 Generally though, once you're in the world of SQL 1821 01:27:21,890 --> 01:27:24,620 you're probably storing your data there long term. 1822 01:27:24,620 --> 01:27:27,590 And you're probably updating it, maybe deleting it, adding to it, 1823 01:27:27,590 --> 01:27:28,273 and so forth. 1824 01:27:28,273 --> 01:27:30,440 For instance, the one command I did not show earlier 1825 01:27:30,440 --> 01:27:33,320 is, suppose someone forgot a show. 1826 01:27:33,320 --> 01:27:35,640 Let's see, did I see this in the output? 1827 01:27:35,640 --> 01:27:37,580 All right, so Curb Your Enthusiasm. 1828 01:27:37,580 --> 01:27:38,540 Saw that last night. 1829 01:27:38,540 --> 01:27:39,530 It was just, yeah. 1830 01:27:39,530 --> 01:27:41,160 Did anyone see it last night? 1831 01:27:41,160 --> 01:27:41,660 No? 1832 01:27:41,660 --> 01:27:44,743 All right, well, just the one person that checked that box, so you and me. 1833 01:27:44,743 --> 01:27:46,740 What's another show that didn't make the list? 1834 01:27:46,740 --> 01:27:48,217 How about Seinfeld? 1835 01:27:48,217 --> 01:27:49,550 It's now on Netflix, apparently. 1836 01:27:49,550 --> 01:27:55,220 So insert into shows. 1837 01:27:55,220 --> 01:27:56,390 What do we want to insert? 1838 01:27:56,390 --> 01:27:59,910 Well, we want to insert maybe an ID and a title. 1839 01:27:59,910 --> 01:28:02,150 But I don't actually care what the ID is, 1840 01:28:02,150 --> 01:28:03,650 so I'm just going to insert a title. 1841 01:28:03,650 --> 01:28:05,780 And the value I'm going to give to that title 1842 01:28:05,780 --> 01:28:08,900 is going to be, quote, unquote, "Seinfeld." 1843 01:28:08,900 --> 01:28:11,810 And then, let me go ahead and hit semicolon. 1844 01:28:11,810 --> 01:28:14,360 Nothing seems to happen, but let me rerun the big query 1845 01:28:14,360 --> 01:28:16,250 from before looking for comedies. 1846 01:28:16,250 --> 01:28:19,850 And unfortunately, Seinfeld has not yet been flagged as a comedy, 1847 01:28:19,850 --> 01:28:21,710 so let's get this right, too. 1848 01:28:21,710 --> 01:28:25,370 What intuitively I'm going to have to do to associate, now, 1849 01:28:25,370 --> 01:28:27,980 Seinfeld with my comedies? 1850 01:28:27,980 --> 01:28:30,140 I just inserted into the show's table. 1851 01:28:30,140 --> 01:28:33,890 What more needs to happen before we can flag Seinfeld as a comedy? 1852 01:28:33,890 --> 01:28:35,369 AUDIENCE: [INAUDIBLE] 1853 01:28:35,369 --> 01:28:36,452 DAVID J. MALAN: Say again? 1854 01:28:36,452 --> 01:28:37,635 AUDIENCE: [INAUDIBLE] 1855 01:28:37,635 --> 01:28:38,510 DAVID J. MALAN: Yeah. 1856 01:28:38,510 --> 01:28:42,950 So I need to insert into the genres table two things now, 1857 01:28:42,950 --> 01:28:48,180 a show ID, like this, and then, the name of the genre, 1858 01:28:48,180 --> 01:28:49,340 which presumably is comedy. 1859 01:28:49,340 --> 01:28:50,810 What values do I want to insert? 1860 01:28:50,810 --> 01:28:52,795 Well, the show ID, I better grab that. 1861 01:28:52,795 --> 01:28:54,170 Oh, I don't even know what it is. 1862 01:28:54,170 --> 01:28:55,770 I'm going to have to figure out what that is. 1863 01:28:55,770 --> 01:28:57,660 So I could do this in a couple of ways. 1864 01:28:57,660 --> 01:28:59,150 Let me do a one-time thing. 1865 01:28:59,150 --> 01:29:02,924 Select star from shows, where title equals, 1866 01:29:02,924 --> 01:29:06,770 quote, unquote, "Seinfeld" semicolon 159. 1867 01:29:06,770 --> 01:29:11,780 So now I could do, insert into genres a show ID 1868 01:29:11,780 --> 01:29:20,510 and a genre name, the values 159, and, quote, unquote, "comedy" semicolon, 1869 01:29:20,510 --> 01:29:21,260 Enter. 1870 01:29:21,260 --> 01:29:24,710 And now, if I scroll back in my history and execute that really big query 1871 01:29:24,710 --> 01:29:26,690 again, looking for all distinct comedies, 1872 01:29:26,690 --> 01:29:29,100 now Seinfeld has made the list. 1873 01:29:29,100 --> 01:29:32,637 But I did this manually so I didn't actually capitalize it. 1874 01:29:32,637 --> 01:29:33,470 Let's clean that up. 1875 01:29:33,470 --> 01:29:35,420 Let's do update. 1876 01:29:35,420 --> 01:29:37,700 Let's do update my shows. 1877 01:29:37,700 --> 01:29:43,440 Set title equals to Seinfeld semicolon. 1878 01:29:43,440 --> 01:29:43,940 No? 1879 01:29:43,940 --> 01:29:47,780 OK, thank you, where title equals, quote, unquote, "Seinfeld." 1880 01:29:47,780 --> 01:29:49,520 Let's not make that mistake again. 1881 01:29:49,520 --> 01:29:50,330 Enter. 1882 01:29:50,330 --> 01:29:53,270 And now, if I execute that really big query, now Seinfeld is, 1883 01:29:53,270 --> 01:29:56,480 indeed, considered a comedy. 1884 01:29:56,480 --> 01:29:57,960 So where are we going with this? 1885 01:29:57,960 --> 01:30:00,240 Well, thus far we've been doing all of this pretty manually. 1886 01:30:00,240 --> 01:30:02,780 And this is absolutely what an analyst, a data scientist type person 1887 01:30:02,780 --> 01:30:05,197 might do if just manipulating a pretty large data set just 1888 01:30:05,197 --> 01:30:08,010 to get at interesting answers that might be across one, 1889 01:30:08,010 --> 01:30:09,440 two, or even many more tables. 1890 01:30:09,440 --> 01:30:12,440 Eventually, in a few weeks, we're going to start to automate all of this 1891 01:30:12,440 --> 01:30:16,610 by writing code in Python that generates SQL to do this. 1892 01:30:16,610 --> 01:30:19,280 If you go to most any website on the internet today, 1893 01:30:19,280 --> 01:30:23,420 and you, for instance, log in, odds are you're typing a username and password, 1894 01:30:23,420 --> 01:30:24,260 clicking Submit. 1895 01:30:24,260 --> 01:30:25,430 What's then happening? 1896 01:30:25,430 --> 01:30:28,010 Well, the website might not be implemented in Python 1897 01:30:28,010 --> 01:30:31,610 but it's probably implemented in some language, Python, JavaScript, Java, 1898 01:30:31,610 --> 01:30:33,110 Ruby, something else. 1899 01:30:33,110 --> 01:30:38,330 And that language is probably using something like a relational database 1900 01:30:38,330 --> 01:30:41,930 to use SQL to get your username, get your password, 1901 01:30:41,930 --> 01:30:43,997 and compare the two against what you've typed in. 1902 01:30:43,997 --> 01:30:46,580 And actually, it's hopefully not getting your actual password, 1903 01:30:46,580 --> 01:30:48,163 but something called the hash thereof. 1904 01:30:48,163 --> 01:30:50,360 But there's probably a database involved doing that. 1905 01:30:50,360 --> 01:30:53,600 When you buy something on Amazon.com and you click Check Out, 1906 01:30:53,600 --> 01:30:56,720 odds are there's some code on Amazon's server 1907 01:30:56,720 --> 01:30:59,870 that's looking at what it is you added to your shopping cart, 1908 01:30:59,870 --> 01:31:03,690 and then maybe using a for loop of some sort, in Python or another language. 1909 01:31:03,690 --> 01:31:08,180 It's doing a whole bunch of SQL inserts to store in their database what 1910 01:31:08,180 --> 01:31:09,290 it is you bought. 1911 01:31:09,290 --> 01:31:12,260 There's other types of databases, too, but SQL databases, 1912 01:31:12,260 --> 01:31:14,647 or relational databases are quite popular. 1913 01:31:14,647 --> 01:31:16,730 So let's go ahead and write one other program here 1914 01:31:16,730 --> 01:31:21,320 in Python that now merges these two languages together, whereby 1915 01:31:21,320 --> 01:31:24,740 I'm going to use SQL inside of a Python program 1916 01:31:24,740 --> 01:31:28,430 so I can implement my logic of my program in Python, 1917 01:31:28,430 --> 01:31:30,050 step-by-step, line-by-line. 1918 01:31:30,050 --> 01:31:34,260 But when I want to get at some data I can actually talk to a SQL database. 1919 01:31:34,260 --> 01:31:37,190 So let me go ahead and open favorites.py. 1920 01:31:37,190 --> 01:31:39,740 1921 01:31:39,740 --> 01:31:45,200 And let me go ahead and throw away some of what we did earlier and really 1922 01:31:45,200 --> 01:31:47,690 just now add a SQL to the mix. 1923 01:31:47,690 --> 01:31:51,440 From the CS50 library, let's import the SQL function. 1924 01:31:51,440 --> 01:31:54,260 This will be useful to use because most third-party libraries 1925 01:31:54,260 --> 01:31:57,390 that deal with SQL and Python are more complicated than they need to be. 1926 01:31:57,390 --> 01:32:00,390 So I think you'll find this library easier to use. 1927 01:32:00,390 --> 01:32:01,780 Let's then do the following. 1928 01:32:01,780 --> 01:32:03,900 Create a variable called db for database. 1929 01:32:03,900 --> 01:32:05,400 But I could call it anything I want. 1930 01:32:05,400 --> 01:32:09,090 Let's use that you URI, which is a fancy way of saying something 1931 01:32:09,090 --> 01:32:17,790 that looks like a URL, but that actually opens up a database on disk, that is, 1932 01:32:17,790 --> 01:32:19,110 in the current folder. 1933 01:32:19,110 --> 01:32:22,620 Let's now ask the user for a title by prompting them for a, quote, unquote, 1934 01:32:22,620 --> 01:32:23,820 "title" like this. 1935 01:32:23,820 --> 01:32:27,960 And let's strip off any whitespace just so that the data is not messy. 1936 01:32:27,960 --> 01:32:30,730 And then, let's go ahead and do this. 1937 01:32:30,730 --> 01:32:32,220 And this is the new logic. 1938 01:32:32,220 --> 01:32:36,570 I'm going to go ahead now and write a line of code that uses Python 1939 01:32:36,570 --> 01:32:39,840 to talk to the original favorites.db. 1940 01:32:39,840 --> 01:32:43,980 So again, I'm not using the two-table database, which is in favorites8.db. 1941 01:32:43,980 --> 01:32:47,320 I'm using the original that we imported from your own data, 1942 01:32:47,320 --> 01:32:49,090 and I'm going to do the following. 1943 01:32:49,090 --> 01:32:54,150 I'm going to use db.execute to execute a SQL command inside of Python. 1944 01:32:54,150 --> 01:33:02,700 I'm going to select the count of shows from the favorites 1945 01:33:02,700 --> 01:33:09,960 table, where the title the user typed in is like this question mark. 1946 01:33:09,960 --> 01:33:11,910 And why I'm doing that is as follows. 1947 01:33:11,910 --> 01:33:15,600 Just like in C, when we had percent S, in SQL for now, 1948 01:33:15,600 --> 01:33:17,490 the analogue is going to be a question mark. 1949 01:33:17,490 --> 01:33:18,990 So same idea, different syntax. 1950 01:33:18,990 --> 01:33:21,180 Instead of percent S, it's just a question mark. 1951 01:33:21,180 --> 01:33:26,280 And using a comma outside of this first string, using CS50's execute 1952 01:33:26,280 --> 01:33:29,340 function I can pass in a SQL string, a command, 1953 01:33:29,340 --> 01:33:33,835 then any arguments I want to plug into the question marks therein. 1954 01:33:33,835 --> 01:33:36,210 So the goal at hand is to actually write a program that's 1955 01:33:36,210 --> 01:33:42,420 going to search favorites.csv, a.k.a., favorites.db for the total number 1956 01:33:42,420 --> 01:33:45,300 of people that liked a particular show. 1957 01:33:45,300 --> 01:33:49,050 So this is going to select the count of people from the favorites table 1958 01:33:49,050 --> 01:33:53,400 where the title they typed in is like whatever the user has just now typed 1959 01:33:53,400 --> 01:33:53,970 in. 1960 01:33:53,970 --> 01:33:56,730 This db execute function returns a list. 1961 01:33:56,730 --> 01:33:57,813 It returns a list of rows. 1962 01:33:57,813 --> 01:34:00,188 And you would only know that by my telling you or reading 1963 01:34:00,188 --> 01:34:01,290 the documentation. 1964 01:34:01,290 --> 01:34:04,140 And therefore, if I want to get back to the total count, 1965 01:34:04,140 --> 01:34:08,940 I'm going to go ahead and grab the first row from those rows. 1966 01:34:08,940 --> 01:34:11,220 Because it's only going to give me back the count. 1967 01:34:11,220 --> 01:34:16,440 And then I'm going to go ahead and print out that row's first value. 1968 01:34:16,440 --> 01:34:17,940 But it's going to be a little weird. 1969 01:34:17,940 --> 01:34:21,420 Technically the column is going to be called "count" star, quote, unquote, 1970 01:34:21,420 --> 01:34:22,420 which is a little weird. 1971 01:34:22,420 --> 01:34:24,180 Let me add one more feature to the mix. 1972 01:34:24,180 --> 01:34:26,280 You can actually give nicknames to columns 1973 01:34:26,280 --> 01:34:30,090 that are coming back, especially if they are the result of functions like this. 1974 01:34:30,090 --> 01:34:34,170 I can just call that column counter, in all lowercase. 1975 01:34:34,170 --> 01:34:41,620 That means I can now say get back the counter key inside of this dictionary. 1976 01:34:41,620 --> 01:34:43,360 So just to recap, what have we done? 1977 01:34:43,360 --> 01:34:46,010 We've imported the CS50 library SQL function. 1978 01:34:46,010 --> 01:34:49,080 We've, with this line of code, opened the favorites.db 1979 01:34:49,080 --> 01:34:54,810 file that you and I created earlier by importing your CSV into SQLite. 1980 01:34:54,810 --> 01:34:58,140 I'm now just asking the user for a title they want to search for. 1981 01:34:58,140 --> 01:35:02,070 I'm now executing this SQL query on that database, 1982 01:35:02,070 --> 01:35:05,250 plugging in whatever the human typed in as their title 1983 01:35:05,250 --> 01:35:07,170 in order to get back a total count. 1984 01:35:07,170 --> 01:35:10,950 And I'm giving the count a nickname, an alias of counter, 1985 01:35:10,950 --> 01:35:13,860 just so it's more self-explanatory. 1986 01:35:13,860 --> 01:35:18,330 This function, db execute, no matter what, always returns a list of rows, 1987 01:35:18,330 --> 01:35:20,470 even if there's only one row inside of it. 1988 01:35:20,470 --> 01:35:23,310 So this line of code just gives me the first and only row. 1989 01:35:23,310 --> 01:35:27,960 And then, this goes inside of that row, which it turns out is a dictionary, 1990 01:35:27,960 --> 01:35:33,730 and gives me the key counter and the value it corresponds to. 1991 01:35:33,730 --> 01:35:35,340 So what, to be clear, is this doing? 1992 01:35:35,340 --> 01:35:38,220 Let's go ahead and run this manually in my Terminal window first. 1993 01:35:38,220 --> 01:35:41,970 Let me run SQLite3 on favorites-- 1994 01:35:41,970 --> 01:35:43,380 Well, let's do this. 1995 01:35:43,380 --> 01:35:47,410 On favorites.db, let me import the data again. 1996 01:35:47,410 --> 01:35:54,910 So mode csv.import in from favorites.csv into a favorites table. 1997 01:35:54,910 --> 01:35:57,330 So I've just recreated the same data set that you all 1998 01:35:57,330 --> 01:35:59,782 gave me earlier in favorites.db. 1999 01:35:59,782 --> 01:36:02,490 If I were to do this manually, let's search for The Office again. 2000 01:36:02,490 --> 01:36:09,210 Select, count star from favorites, where title like, and let's 2001 01:36:09,210 --> 01:36:12,270 just manually type it in for now, The Office. 2002 01:36:12,270 --> 01:36:15,330 We'll search for the one with the word The, semicolon. 2003 01:36:15,330 --> 01:36:16,560 I get back 12. 2004 01:36:16,560 --> 01:36:18,780 But technically, notice what I get back. 2005 01:36:18,780 --> 01:36:25,080 I technically get back a miniature table containing one column and one row. 2006 01:36:25,080 --> 01:36:27,090 What if I want to rename that column? 2007 01:36:27,090 --> 01:36:28,840 That's where the as keyword comes in. 2008 01:36:28,840 --> 01:36:31,080 So select count star as counter. 2009 01:36:31,080 --> 01:36:32,910 Notice what happens, Enter. 2010 01:36:32,910 --> 01:36:35,883 I just get back-- same simple table, but I've 2011 01:36:35,883 --> 01:36:38,550 renamed the column to be counter just because it's a little more 2012 01:36:38,550 --> 01:36:40,410 self-explanatory as to what it is. 2013 01:36:40,410 --> 01:36:43,200 So what am I doing with this line of code? 2014 01:36:43,200 --> 01:36:47,430 This line of code is returning to me that miniature temporary table 2015 01:36:47,430 --> 01:36:50,820 in the form of a list of dictionaries. 2016 01:36:50,820 --> 01:36:55,620 The list contains one row, as we'll see, and it 2017 01:36:55,620 --> 01:37:00,800 contains one column, as we'll see, the key for which is counter. 2018 01:37:00,800 --> 01:37:02,540 So let's now run the code itself. 2019 01:37:02,540 --> 01:37:07,280 I'm going to get out of SQLite3 and I'm going to run Python of favorites.py. 2020 01:37:07,280 --> 01:37:08,120 Enter. 2021 01:37:08,120 --> 01:37:09,540 I'm being prompted for a title. 2022 01:37:09,540 --> 01:37:13,790 I'm going to type in The Office and cross my fingers, and there's that 12. 2023 01:37:13,790 --> 01:37:14,810 Why is it 12? 2024 01:37:14,810 --> 01:37:17,450 Well, there's a typo again because I re-imported the CSV. 2025 01:37:17,450 --> 01:37:21,270 I had deleted two of the Thes, so we're back at the original data set. 2026 01:37:21,270 --> 01:37:25,760 So there's 12 total that have, quote, unquote, "The Office" 2027 01:37:25,760 --> 01:37:27,780 in the title like that. 2028 01:37:27,780 --> 01:37:28,860 So what have we done? 2029 01:37:28,860 --> 01:37:31,820 We've combined some Python with some SQL, 2030 01:37:31,820 --> 01:37:34,790 but we've relegated all of the complexity of searching for something, 2031 01:37:34,790 --> 01:37:36,800 the selecting of something, gotten rid of all 2032 01:37:36,800 --> 01:37:39,560 of the with keyword, the open keyword, the for loop, 2033 01:37:39,560 --> 01:37:41,600 the reader the DictReader, and all of that. 2034 01:37:41,600 --> 01:37:46,460 And it's just one line of SQL now, using the best of both worlds. 2035 01:37:46,460 --> 01:37:51,720 All right, any questions on what we've just done here or how any of this 2036 01:37:51,720 --> 01:37:52,220 works? 2037 01:37:52,220 --> 01:37:54,870 2038 01:37:54,870 --> 01:37:56,210 Any questions here? 2039 01:37:56,210 --> 01:37:57,310 Yeah? 2040 01:37:57,310 --> 01:38:01,070 AUDIENCE: [INAUDIBLE] 2041 01:38:01,070 --> 01:38:04,413 DAVID J. MALAN: When does this function return more than one row? 2042 01:38:04,413 --> 01:38:05,330 Was that the question? 2043 01:38:05,330 --> 01:38:05,760 AUDIENCE: Yeah. 2044 01:38:05,760 --> 01:38:06,030 DAVID J. MALAN: Yeah. 2045 01:38:06,030 --> 01:38:08,240 So let's do that by changing the problem at hand. 2046 01:38:08,240 --> 01:38:10,970 This program was designed just to select the total count. 2047 01:38:10,970 --> 01:38:16,070 Let's go ahead and select, for instance, all 2048 01:38:16,070 --> 01:38:20,705 of the ways you all typed in The Office by selecting the title this time. 2049 01:38:20,705 --> 01:38:24,110 2050 01:38:24,110 --> 01:38:28,470 If I do this in SQLite3, let me go ahead and do this again 2051 01:38:28,470 --> 01:38:29,970 after increasing my Terminal window. 2052 01:38:29,970 --> 01:38:30,920 Let's do it manually. 2053 01:38:30,920 --> 01:38:35,570 Select title from favorites, where the title is like, 2054 01:38:35,570 --> 01:38:38,835 quote, unquote, "The Office," semicolon. 2055 01:38:38,835 --> 01:38:41,960 I get back all of these different rows, and we didn't even notice this one. 2056 01:38:41,960 --> 01:38:43,910 There's actually another little typo in there 2057 01:38:43,910 --> 01:38:47,630 with some capitalization of the E, and the C, and the E. That 2058 01:38:47,630 --> 01:38:50,840 would be an example of a query that gives me back therefore 2059 01:38:50,840 --> 01:38:52,080 for multiple rows. 2060 01:38:52,080 --> 01:38:53,990 So let's now change my Python program. 2061 01:38:53,990 --> 01:38:59,540 If I now, in my Python program, do this, I get back a whole bunch of rows 2062 01:38:59,540 --> 01:39:01,110 containing all of those titles. 2063 01:39:01,110 --> 01:39:06,350 I can now do, for row in rows, I can print out the current row's title, 2064 01:39:06,350 --> 01:39:09,390 and now manipulate all of those things together. 2065 01:39:09,390 --> 01:39:10,760 Let me keep both on the screen. 2066 01:39:10,760 --> 01:39:12,320 Let me run Python of favorites.py. 2067 01:39:12,320 --> 01:39:16,320 And that for loop now should iterate, what, 10 or more times, 2068 01:39:16,320 --> 01:39:18,020 once for each of those titles. 2069 01:39:18,020 --> 01:39:22,010 And indeed, if I type in The Office again, Enter. 2070 01:39:22,010 --> 01:39:24,890 Whoops. 2071 01:39:24,890 --> 01:39:25,670 Row title. 2072 01:39:25,670 --> 01:39:26,750 What did I do wrong? 2073 01:39:26,750 --> 01:39:30,052 Oh, I should not be renaming title to counter this time. 2074 01:39:30,052 --> 01:39:31,760 So that's just a dumb mistake on my part. 2075 01:39:31,760 --> 01:39:33,410 Let me rerun it again. 2076 01:39:33,410 --> 01:39:36,230 And now I should see after typing in The Office, 2077 01:39:36,230 --> 01:39:38,420 Enter, a whole bunch of The Offices. 2078 01:39:38,420 --> 01:39:40,520 And because I'm using like, even the missed 2079 01:39:40,520 --> 01:39:43,550 capitalizations are coming through, because like is case insensitive. 2080 01:39:43,550 --> 01:39:45,890 It doesn't matter if it's uppercase or lowercase. 2081 01:39:45,890 --> 01:39:50,300 Whereas had I used the equal sign I would get back only the same ones 2082 01:39:50,300 --> 01:39:51,935 capitalized correctly. 2083 01:39:51,935 --> 01:39:55,250 All right, any questions on this next? 2084 01:39:55,250 --> 01:39:59,700 All right, so let's transition to a larger, juicier data 2085 01:39:59,700 --> 01:40:01,610 set, and consider some of the issues that 2086 01:40:01,610 --> 01:40:05,870 arise when actually now using SQL and skating toward a world in which we're 2087 01:40:05,870 --> 01:40:08,970 using SQL for mobile apps, web apps, and generally speaking, 2088 01:40:08,970 --> 01:40:10,800 very large data sets. 2089 01:40:10,800 --> 01:40:14,090 So let's start with a larger data set just like that. 2090 01:40:14,090 --> 01:40:19,800 Give me just a moment to switch screens over to what we have for you today, 2091 01:40:19,800 --> 01:40:22,970 which is an actual relational database that we've created out 2092 01:40:22,970 --> 01:40:26,540 of a real-world data set from IMDb. 2093 01:40:26,540 --> 01:40:29,210 So InternetMovieDatabase.com is a website 2094 01:40:29,210 --> 01:40:31,790 where you can search for TV shows, and movies, and actors, 2095 01:40:31,790 --> 01:40:34,880 and so forth, all using their database behind the scenes. 2096 01:40:34,880 --> 01:40:39,530 IMDb wonderfully makes their data set available as not CSV files, 2097 01:40:39,530 --> 01:40:42,960 but TSV files, tab-separated values. 2098 01:40:42,960 --> 01:40:46,460 And so what we did is, before class we downloaded those TSV files. 2099 01:40:46,460 --> 01:40:50,300 We wrote a Python program similar to my favorites8.py file 2100 01:40:50,300 --> 01:40:53,750 earlier that read in all of those TSV files, 2101 01:40:53,750 --> 01:40:58,820 created some SQL tables in an IMDb database 2102 01:40:58,820 --> 01:41:03,270 for you in SQLite that has multiple tables and multiple columns. 2103 01:41:03,270 --> 01:41:07,190 So let's go and wrap our minds around what's actually in this data set. 2104 01:41:07,190 --> 01:41:10,940 Let me go back to VS Code here, and in just a moment, 2105 01:41:10,940 --> 01:41:15,260 I'm going to go ahead and copy the file, which we've named shows.db. 2106 01:41:15,260 --> 01:41:20,510 And I'm going to go ahead and increase my Terminal and do SQLite3 of shows.db. 2107 01:41:20,510 --> 01:41:23,570 Whenever playing around with a SQLite database for the first time, 2108 01:41:23,570 --> 01:41:26,480 typing .schema is perhaps a good place to start to give you a sense 2109 01:41:26,480 --> 01:41:27,410 of what's in there. 2110 01:41:27,410 --> 01:41:29,120 And things just escalated quickly. 2111 01:41:29,120 --> 01:41:31,640 There's a lot in this data set, because, indeed, there's 2112 01:41:31,640 --> 01:41:35,750 going to be tens of hundreds of thousands of rows in this data set, 2113 01:41:35,750 --> 01:41:39,230 and also problem set 7, where we'll look at the movie side of things 2114 01:41:39,230 --> 01:41:40,920 and not just the TV shows. 2115 01:41:40,920 --> 01:41:43,940 So what is the schema that we have created for you 2116 01:41:43,940 --> 01:41:47,150 from IMDb's actual real-world data? 2117 01:41:47,150 --> 01:41:48,950 One, there's a table called shows. 2118 01:41:48,950 --> 01:41:51,950 And notice we've just added whitespace by hitting Enter a bunch of times 2119 01:41:51,950 --> 01:41:54,050 to make it a little more stylistically readable. 2120 01:41:54,050 --> 01:41:57,980 The shows table has an ID column, a title column, a year, 2121 01:41:57,980 --> 01:42:00,740 and the total number of episodes for a given show. 2122 01:42:00,740 --> 01:42:05,750 And the types of those columns are integer, text, numeric, and integer. 2123 01:42:05,750 --> 01:42:08,090 So it turns out there's actually a few different data 2124 01:42:08,090 --> 01:42:13,850 types that are worth being aware of when it comes to creating tables themselves. 2125 01:42:13,850 --> 01:42:18,170 In fact, in SQLite there's five data types, and only five, 2126 01:42:18,170 --> 01:42:21,650 fortunately, one of which is, indeed, integer, negative or positive, 2127 01:42:21,650 --> 01:42:25,010 numeric, which is kind of a catchall for dates and times, 2128 01:42:25,010 --> 01:42:27,320 things that are numeric but are not just integers, 2129 01:42:27,320 --> 01:42:29,510 and not just real numbers, for instance. 2130 01:42:29,510 --> 01:42:33,020 Real number is what we've generally thought of as float up until now. 2131 01:42:33,020 --> 01:42:35,058 Text, of course, is just text, but notice 2132 01:42:35,058 --> 01:42:37,100 that you don't have to worry about how big it is. 2133 01:42:37,100 --> 01:42:39,110 Like in Python, it will size to fit. 2134 01:42:39,110 --> 01:42:41,840 And then there's BLOB, which is binary large object, which 2135 01:42:41,840 --> 01:42:45,300 is for just raw 0s and 1s, like for files or things like that. 2136 01:42:45,300 --> 01:42:47,570 But we'll generally use the other four of these. 2137 01:42:47,570 --> 01:42:50,960 And so, indeed, when we imported this data for you 2138 01:42:50,960 --> 01:42:56,270 we decided that every show would be given an ID, which is just an integer. 2139 01:42:56,270 --> 01:42:59,460 Every show has, of course, a title, which should not be null. 2140 01:42:59,460 --> 01:43:01,320 Otherwise, why is it in the database? 2141 01:43:01,320 --> 01:43:04,830 Every show has a year, which is numeric according 2142 01:43:04,830 --> 01:43:06,180 to that definition a moment ago. 2143 01:43:06,180 --> 01:43:09,540 And the total number of episodes for a show is going to be an integer. 2144 01:43:09,540 --> 01:43:13,110 What now is with these primary keys that we mentioned earlier, too? 2145 01:43:13,110 --> 01:43:18,090 A primary key is the column that uniquely identifies all of the data. 2146 01:43:18,090 --> 01:43:21,120 In our case, with the favorites, I automatically 2147 01:43:21,120 --> 01:43:24,750 gave each of your submissions a unique ID so that even if two or more of you 2148 01:43:24,750 --> 01:43:27,360 typed in The Office, your submission still 2149 01:43:27,360 --> 01:43:32,310 had a unique identifier, a number that allowed me to then correlate it 2150 01:43:32,310 --> 01:43:36,330 with your genres, just as we saw a moment ago. 2151 01:43:36,330 --> 01:43:39,280 In this version of IMDb, there's also genres. 2152 01:43:39,280 --> 01:43:42,120 But they don't come from us, they come from IMDb.com. 2153 01:43:42,120 --> 01:43:46,320 And so a genre has a show ID, and a genre just like our database. 2154 01:43:46,320 --> 01:43:49,890 But these are real-world genres with a bit more filtration. 2155 01:43:49,890 --> 01:43:54,520 Notice, though, just like my version, there's a foreign key. 2156 01:43:54,520 --> 01:43:59,970 A foreign key is the appearance of another table's primary key 2157 01:43:59,970 --> 01:44:02,050 in its own table. 2158 01:44:02,050 --> 01:44:05,100 So when you have a table like genres, which is somehow 2159 01:44:05,100 --> 01:44:10,920 cross referencing the original shows table, if shows have a primary key 2160 01:44:10,920 --> 01:44:15,270 called ID, and those same numbers appear in the genres table 2161 01:44:15,270 --> 01:44:20,250 under the column called show ID, by definition, show ID is a foreign key. 2162 01:44:20,250 --> 01:44:22,650 It's the same numbers but it's foreign in the sense 2163 01:44:22,650 --> 01:44:25,260 that the number is being used in this table, 2164 01:44:25,260 --> 01:44:29,130 even though it's officially defined primarily in this other table. 2165 01:44:29,130 --> 01:44:31,770 This is what we mean by relational databases. 2166 01:44:31,770 --> 01:44:37,170 You have multiple tables with some column in common, numbers typically. 2167 01:44:37,170 --> 01:44:40,860 And those numbers allow you to line the two tables up in such a way 2168 01:44:40,860 --> 01:44:44,040 that you can reconnect the shows with their genres, 2169 01:44:44,040 --> 01:44:46,800 just like we did with our smaller data set a moment ago. 2170 01:44:46,800 --> 01:44:49,050 This logic is extended further. 2171 01:44:49,050 --> 01:44:53,070 Notice that the IMDb database we've created for you has a stars table, 2172 01:44:53,070 --> 01:44:56,730 like TV show stars, the actors therein. 2173 01:44:56,730 --> 01:45:00,210 And that table, interestingly, has no mention of people 2174 01:45:00,210 --> 01:45:02,220 and no mention of shows, per se. 2175 01:45:02,220 --> 01:45:05,730 It only has a column called show ID, which is an integer, 2176 01:45:05,730 --> 01:45:08,220 and a person ID, which is an integer. 2177 01:45:08,220 --> 01:45:14,320 Meanwhile, if we scrolled down to the bottom, 2178 01:45:14,320 --> 01:45:17,490 you will see a table called people. 2179 01:45:17,490 --> 01:45:23,010 And we have decided in IMDb's world that every person in the TV show world 2180 01:45:23,010 --> 01:45:27,510 will have a unique identifier that's a number, a name that's text, a birth 2181 01:45:27,510 --> 01:45:31,500 date, which is numeric, and then, again, specifying that ID 2182 01:45:31,500 --> 01:45:35,350 is going to be their primary key. 2183 01:45:35,350 --> 01:45:36,940 So what's going on here? 2184 01:45:36,940 --> 01:45:42,640 Well, it turns out that TV stars and writers are both types of people. 2185 01:45:42,640 --> 01:45:47,730 So using this relational database, notice the road we're going down. 2186 01:45:47,730 --> 01:45:49,770 We're factoring out commonalities. 2187 01:45:49,770 --> 01:45:52,570 And if a person can be different things in life, 2188 01:45:52,570 --> 01:45:55,260 well, we're defining them first and foremost as people. 2189 01:45:55,260 --> 01:45:58,150 And then, notice these two tables are almost the same. 2190 01:45:58,150 --> 01:46:00,660 The stars table has a show ID, which is a number, 2191 01:46:00,660 --> 01:46:02,670 and a person ID, which is a number, which 2192 01:46:02,670 --> 01:46:10,710 allows us via this middleman table, if you will, to link people with TV shows. 2193 01:46:10,710 --> 01:46:16,080 Similarly, the writers table allows us to connect shows with people, too, 2194 01:46:16,080 --> 01:46:18,220 by just recording those numbers. 2195 01:46:18,220 --> 01:46:20,980 So if we go into this data set, let's do the following. 2196 01:46:20,980 --> 01:46:24,340 Let's do select star from people semicolon. 2197 01:46:24,340 --> 01:46:27,030 So a huge amount of data is coming back. 2198 01:46:27,030 --> 01:46:31,480 This is hundreds of thousands of rows now based on the ID numbers alone. 2199 01:46:31,480 --> 01:46:34,320 So this is real-world data now flying across the screen. 2200 01:46:34,320 --> 01:46:38,160 There's a lot of people in the TV show business, not just actors and writers, 2201 01:46:38,160 --> 01:46:39,780 but others as well. 2202 01:46:39,780 --> 01:46:40,680 It's still going. 2203 01:46:40,680 --> 01:46:41,977 There's a lot of data there. 2204 01:46:41,977 --> 01:46:44,310 So my god, if you had to do anything manual in this data 2205 01:46:44,310 --> 01:46:46,660 set it's probably not going to work out very well. 2206 01:46:46,660 --> 01:46:49,590 And actually, we're up to, what, a million people in this data 2207 01:46:49,590 --> 01:46:51,990 set, plus, which would mean this probably isn't even 2208 01:46:51,990 --> 01:46:55,590 going to open very well in Excel, or Google Spreadsheets, or Apple Numbers. 2209 01:46:55,590 --> 01:46:57,700 SQL probably is the better approach here. 2210 01:46:57,700 --> 01:47:00,360 Let's search for someone specific, like select star 2211 01:47:00,360 --> 01:47:06,060 from people, where name equals Steve Carell, for instance, sticking 2212 01:47:06,060 --> 01:47:06,702 with comedies. 2213 01:47:06,702 --> 01:47:08,160 All right, so there's Steve Carell. 2214 01:47:08,160 --> 01:47:14,100 He is person number 136,797, born in 1962. 2215 01:47:14,100 --> 01:47:16,540 And that's as much data as we have on Steve Carell here. 2216 01:47:16,540 --> 01:47:19,210 How do we figure out what shows, for instance, he's in? 2217 01:47:19,210 --> 01:47:23,500 Well, let's see, select star from shows, semicolon. 2218 01:47:23,500 --> 01:47:27,180 There's a crazy number of shows out there in the IMDb database. 2219 01:47:27,180 --> 01:47:30,150 And you can see it here again flying across the screen. 2220 01:47:30,150 --> 01:47:33,630 Feels like we're going to have to employ some techniques in order 2221 01:47:33,630 --> 01:47:37,060 to get at all of Steve Carell's shows. 2222 01:47:37,060 --> 01:47:39,220 So how are we going to do that? 2223 01:47:39,220 --> 01:47:41,730 Well, god, this is a lot of data here. 2224 01:47:41,730 --> 01:47:45,120 And in fact, yeah, we have, what, 15 million 2225 01:47:45,120 --> 01:47:47,280 shows plus in this data set, too. 2226 01:47:47,280 --> 01:47:50,340 So doing things efficiently is now going to start to matter. 2227 01:47:50,340 --> 01:47:51,790 So let's actually do this. 2228 01:47:51,790 --> 01:47:53,460 Let me select a specific show. 2229 01:47:53,460 --> 01:47:58,590 Select star from shows where title equals, quote, unquote, "The Office." 2230 01:47:58,590 --> 01:48:00,920 And there presumably shouldn't be typos in this data 2231 01:48:00,920 --> 01:48:03,470 because it comes from the real website IMDb.com. 2232 01:48:03,470 --> 01:48:05,000 Let's get back to show. 2233 01:48:05,000 --> 01:48:08,210 Turns out there's been a lot of The Offices out in the world. 2234 01:48:08,210 --> 01:48:12,170 The one that started in 2005 is the one that we want, 2235 01:48:12,170 --> 01:48:14,990 presumably the most popular with 188 episodes. 2236 01:48:14,990 --> 01:48:16,220 How can we get just that? 2237 01:48:16,220 --> 01:48:21,650 Maybe we could do and year equals, how about 2005? 2238 01:48:21,650 --> 01:48:25,250 All right, so now we've got back just the ID of The Office 2239 01:48:25,250 --> 01:48:26,210 that we care about. 2240 01:48:26,210 --> 01:48:27,230 And let's do this, too. 2241 01:48:27,230 --> 01:48:30,110 Let me turn on a timer within SQLite just 2242 01:48:30,110 --> 01:48:31,790 to get a sense of running time now. 2243 01:48:31,790 --> 01:48:33,110 Let me do that again. 2244 01:48:33,110 --> 01:48:36,050 Select star from shows, where title equals The Office, 2245 01:48:36,050 --> 01:48:37,808 and year equals 2005. 2246 01:48:37,808 --> 01:48:38,850 And let's keep it simple. 2247 01:48:38,850 --> 01:48:40,370 Let's just do titles for now. 2248 01:48:40,370 --> 01:48:41,450 Enter. 2249 01:48:41,450 --> 01:48:43,460 All right, so not terribly long. 2250 01:48:43,460 --> 01:48:46,990 It found it pretty fast, but it looks like it took how much real time? 2251 01:48:46,990 --> 01:48:50,010 0.02 seconds, not bad for just a title. 2252 01:48:50,010 --> 01:48:53,210 But just to plant a seed, it turns out that we can probably 2253 01:48:53,210 --> 01:48:54,200 speed even this up. 2254 01:48:54,200 --> 01:48:54,930 Let me do this. 2255 01:48:54,930 --> 01:48:58,220 Let me create something called an index, which is another use of the C 2256 01:48:58,220 --> 01:49:00,170 in CRUD for creating something. 2257 01:49:00,170 --> 01:49:02,810 And I'm going to call this title index. 2258 01:49:02,810 --> 01:49:07,340 And I'm going to create it on the shows table, 2259 01:49:07,340 --> 01:49:09,273 specifically on the title column. 2260 01:49:09,273 --> 01:49:11,690 And we'll see in a moment what this is going to do for me. 2261 01:49:11,690 --> 01:49:12,920 Enter. 2262 01:49:12,920 --> 01:49:17,130 Took a moment, like 0.349 seconds, to create something called an index. 2263 01:49:17,130 --> 01:49:21,590 But now watch, if I select star from shows searching for The Office again, 2264 01:49:21,590 --> 01:49:24,230 previously it took me 0.021 seconds. 2265 01:49:24,230 --> 01:49:27,680 Not bad, but now, wow. 2266 01:49:27,680 --> 01:49:31,190 Literally no time at all, or so low that it wasn't really measurable. 2267 01:49:31,190 --> 01:49:33,470 And I'll do it again just to get a sense of things. 2268 01:49:33,470 --> 01:49:34,860 Still quite low. 2269 01:49:34,860 --> 01:49:40,020 Now even though 0.021 seconds, not crazy long, imagine now having a lot of data, 2270 01:49:40,020 --> 01:49:42,560 a lot of users running a real website or real mobile app. 2271 01:49:42,560 --> 01:49:45,980 Every millisecond we can start to shave off is going to be compelling. 2272 01:49:45,980 --> 01:49:47,810 So what is it we just did? 2273 01:49:47,810 --> 01:49:51,830 Well, we actually just created something called an index. 2274 01:49:51,830 --> 01:49:53,960 And this is a nice way to tie in, now, some 2275 01:49:53,960 --> 01:49:56,420 of our week 5 discussion of data structures, 2276 01:49:56,420 --> 01:49:58,250 and our week 3 discussion of running times. 2277 01:49:58,250 --> 01:50:01,280 An index in a database is some kind of fancy data 2278 01:50:01,280 --> 01:50:06,110 structure that allows the database to do better than linear search. 2279 01:50:06,110 --> 01:50:10,580 Literally, as you just saw, these tables are crazy long or tall right now, 2280 01:50:10,580 --> 01:50:12,150 very linear, that is. 2281 01:50:12,150 --> 01:50:14,300 And so when I first searched for The Office, 2282 01:50:14,300 --> 01:50:17,780 it was literally doing linear search, top to bottom, looking at as many as, 2283 01:50:17,780 --> 01:50:19,400 what, a million plus rows. 2284 01:50:19,400 --> 01:50:20,870 That's relatively slow. 2285 01:50:20,870 --> 01:50:23,330 It's not that slow, 0.021 seconds. 2286 01:50:23,330 --> 01:50:27,020 But that's relatively slow just theoretically, algorithmically, 2287 01:50:27,020 --> 01:50:28,580 doing anything linearly. 2288 01:50:28,580 --> 01:50:32,090 But if you instead create an index using syntax 2289 01:50:32,090 --> 01:50:37,880 like this, which I just did, creating an index on the title column of the show's 2290 01:50:37,880 --> 01:50:41,220 table, that's like giving the database a clue in advance saying, 2291 01:50:41,220 --> 01:50:44,660 hey, I know I'm going to search on this column in this table a lot. 2292 01:50:44,660 --> 01:50:47,430 Do something with data structures to speed things up. 2293 01:50:47,430 --> 01:50:50,030 And so if you think back to our discussion of data structures, 2294 01:50:50,030 --> 01:50:51,720 maybe it's using a tree. 2295 01:50:51,720 --> 01:50:56,060 Maybe it's using a trie or a hash table, some fancier two-dimensional data 2296 01:50:56,060 --> 01:50:59,930 structure is generally going to lift the data up creating right maybe a tree 2297 01:50:59,930 --> 01:51:00,810 structure. 2298 01:51:00,810 --> 01:51:03,620 So it's just much faster to find data, especially 2299 01:51:03,620 --> 01:51:06,500 if it's sorting it now based on title, and not 2300 01:51:06,500 --> 01:51:08,030 just storing it in one long list. 2301 01:51:08,030 --> 01:51:10,220 And in fact, in the world of relational databases, 2302 01:51:10,220 --> 01:51:12,560 the type of structure that's often used in a database 2303 01:51:12,560 --> 01:51:13,920 is something called a B-tree. 2304 01:51:13,920 --> 01:51:15,170 It's not a binary tree. 2305 01:51:15,170 --> 01:51:19,040 Different use of the letter B, but it looks a little something like the trees 2306 01:51:19,040 --> 01:51:19,820 we've seen. 2307 01:51:19,820 --> 01:51:21,680 It's not binary because some of the nodes 2308 01:51:21,680 --> 01:51:24,890 might have more than two children or fewer, 2309 01:51:24,890 --> 01:51:28,190 but it's a very wide but relatively shallow tree. 2310 01:51:28,190 --> 01:51:29,990 It's not very tall. 2311 01:51:29,990 --> 01:51:33,770 And the upside of that is that if your data is stored in this tree, 2312 01:51:33,770 --> 01:51:36,060 the database can find it more quickly. 2313 01:51:36,060 --> 01:51:41,270 And the reason it took half a second, a third of a second to build the index 2314 01:51:41,270 --> 01:51:45,260 is because SQLite needed to take some non-zero amount of time 2315 01:51:45,260 --> 01:51:47,630 to just build up this tree in memory. 2316 01:51:47,630 --> 01:51:51,900 And it has algorithms for doing so based on alphabetization or other techniques. 2317 01:51:51,900 --> 01:51:55,340 But you spend a bit of time up front, a third of a second. 2318 01:51:55,340 --> 01:51:57,470 And then thereafter, wow. 2319 01:51:57,470 --> 01:52:00,470 Every subsequent query, if I keep doing it again and again, 2320 01:52:00,470 --> 01:52:04,040 is going to be crazy low, 0.000, maybe 0.001. 2321 01:52:04,040 --> 01:52:08,240 But an order of magnitude, a factor of 10 or 100 faster than it 2322 01:52:08,240 --> 01:52:10,790 previously was earlier. 2323 01:52:10,790 --> 01:52:14,360 So we have these indexes which allow us to get at data faster. 2324 01:52:14,360 --> 01:52:17,330 But what if we want to actually get data that's 2325 01:52:17,330 --> 01:52:19,370 now across these multiple tables? 2326 01:52:19,370 --> 01:52:20,240 How can we do that? 2327 01:52:20,240 --> 01:52:23,060 And how might these indices or indexes help further? 2328 01:52:23,060 --> 01:52:26,900 Well, it turns out there is a way that we've seen already 2329 01:52:26,900 --> 01:52:29,510 indirectly to join two tables together. 2330 01:52:29,510 --> 01:52:33,410 Previously, when I selected the ID of The Office, 2331 01:52:33,410 --> 01:52:37,740 and then I searched for it in the other table using select in a nested query, 2332 01:52:37,740 --> 01:52:40,410 I was joining two tables together. 2333 01:52:40,410 --> 01:52:42,900 And it turns out there's a couple of ways to do this. 2334 01:52:42,900 --> 01:52:46,550 Let's go ahead now and, for instance, find all of Steve Carell's TV shows. 2335 01:52:46,550 --> 01:52:48,680 Not just The Office but all of them, too. 2336 01:52:48,680 --> 01:52:56,315 Unfortunately, if we look at our schema, shows up here have no mention of TV-- 2337 01:52:56,315 --> 01:53:01,860 oh, shows over here has no mention of the TV stars in them. 2338 01:53:01,860 --> 01:53:05,130 And people have no mention of shows. 2339 01:53:05,130 --> 01:53:09,360 We somehow need to use this table here to connect the two. 2340 01:53:09,360 --> 01:53:14,820 And this is called a join table, in the sense that using two integer columns-- 2341 01:53:14,820 --> 01:53:17,790 it joins the two tables together logically. 2342 01:53:17,790 --> 01:53:21,750 And so if you're savvy enough with SQL, you can do what I did with my hands 2343 01:53:21,750 --> 01:53:26,010 earlier and like recombine tables by using these common IDs, 2344 01:53:26,010 --> 01:53:27,460 these integers together. 2345 01:53:27,460 --> 01:53:28,330 So let me do this. 2346 01:53:28,330 --> 01:53:32,730 Let me go ahead and figure out, step-by-step, Steve Carell's shows. 2347 01:53:32,730 --> 01:53:33,970 So how am I going to do this? 2348 01:53:33,970 --> 01:53:39,120 Well, if I select star from people, where name equals Steve Carell, 2349 01:53:39,120 --> 01:53:40,840 fortunately, there's only one of them. 2350 01:53:40,840 --> 01:53:46,800 So this gives me back his name, his ID, and his birth year. 2351 01:53:46,800 --> 01:53:48,960 But it's really only his ID that I care about. 2352 01:53:48,960 --> 01:53:49,680 Why? 2353 01:53:49,680 --> 01:53:55,500 Because in order to get back his shows, I need to link person ID with show ID. 2354 01:53:55,500 --> 01:53:57,640 So I need to know his ID number. 2355 01:53:57,640 --> 01:53:59,590 So what could I do with this? 2356 01:53:59,590 --> 01:54:04,230 Well, remember the schema and the stars table. 2357 01:54:04,230 --> 01:54:07,830 I've just gotten, from the people table, Steve Carell's ID. 2358 01:54:07,830 --> 01:54:13,530 I bet by transitivity I could now use his person ID, his ID, 2359 01:54:13,530 --> 01:54:15,900 to get back all of his show IDs. 2360 01:54:15,900 --> 01:54:19,170 And then once I've got all of his show IDs, I can take it one step further 2361 01:54:19,170 --> 01:54:21,330 and get back all of his shows' titles. 2362 01:54:21,330 --> 01:54:25,440 So the answer is actually English words and not just random, seemingly, 2363 01:54:25,440 --> 01:54:26,290 integers. 2364 01:54:26,290 --> 01:54:27,640 So let me go ahead and do this. 2365 01:54:27,640 --> 01:54:31,740 Let me, again, get Steve Carell's ID number, but not star. 2366 01:54:31,740 --> 01:54:33,060 Star represents everything. 2367 01:54:33,060 --> 01:54:35,610 It's a wildcard character in SQL. 2368 01:54:35,610 --> 01:54:38,310 Let me just select the ID of Steve Carell. 2369 01:54:38,310 --> 01:54:41,640 And that gives me back 136,797. 2370 01:54:41,640 --> 01:54:43,440 And it's only giving me back one value. 2371 01:54:43,440 --> 01:54:46,710 The thing called ID is just the column heading up above. 2372 01:54:46,710 --> 01:54:51,360 Now, suppose I want to select all of the show IDs 2373 01:54:51,360 --> 01:54:53,460 that Steve Carell is affiliated with. 2374 01:54:53,460 --> 01:55:00,570 Let me select Show ID from stars, where the person ID in stars 2375 01:55:00,570 --> 01:55:03,480 happens to equal Steve Carell's ID. 2376 01:55:03,480 --> 01:55:07,320 So again, I'm building up my answer in reverse and taking these baby steps. 2377 01:55:07,320 --> 01:55:11,430 On the right, in parentheses, I'm getting Steve Carell's ID. 2378 01:55:11,430 --> 01:55:15,330 On the left, I am now selecting all of the show IDs 2379 01:55:15,330 --> 01:55:19,410 that have some connection with that person ID in the stars table. 2380 01:55:19,410 --> 01:55:21,850 This answer, too, is not going to be that illuminating. 2381 01:55:21,850 --> 01:55:25,420 It's just a whole bunch of integers that have no meaning to me as a human. 2382 01:55:25,420 --> 01:55:27,770 But let's take this one step further. 2383 01:55:27,770 --> 01:55:29,520 And even though my code is getting long, I 2384 01:55:29,520 --> 01:55:31,920 could hit Enter and format it nicely, especially 2385 01:55:31,920 --> 01:55:33,660 if I were doing this in a code file. 2386 01:55:33,660 --> 01:55:35,580 But I'm just doing it interactively for now. 2387 01:55:35,580 --> 01:55:39,420 Let's now select all of the titles from the shows table, 2388 01:55:39,420 --> 01:55:48,100 where the ID of the show is in this following previous query. 2389 01:55:48,100 --> 01:55:49,740 So again, the query is getting long. 2390 01:55:49,740 --> 01:55:52,200 But notice, it's the third and last step. 2391 01:55:52,200 --> 01:55:55,950 Select title from the shows table, where the ID of the show 2392 01:55:55,950 --> 01:55:58,590 is in the list of all of the show IDs that 2393 01:55:58,590 --> 01:56:02,040 came back from the stars table searching for Steve Carell's person ID. 2394 01:56:02,040 --> 01:56:03,480 How did we get that person ID? 2395 01:56:03,480 --> 01:56:04,680 Let me scroll to the end. 2396 01:56:04,680 --> 01:56:11,160 Well, I selected, in my innermost parentheses, Steve Carell's own ID. 2397 01:56:11,160 --> 01:56:13,440 So now, when I hit Enter, voila. 2398 01:56:13,440 --> 01:56:16,525 I get all of Steve Carell's TV shows up until now. 2399 01:56:16,525 --> 01:56:19,650 And if I want to tidy this up further, I can use the same tricks as before. 2400 01:56:19,650 --> 01:56:22,180 Order by title, semicolon. 2401 01:56:22,180 --> 01:56:25,540 Now I've got it all alphabetized as before. 2402 01:56:25,540 --> 01:56:28,080 So again, with SQL comes the ability to search-- 2403 01:56:28,080 --> 01:56:31,290 I mean, look how quickly we do this, 0.094 seconds 2404 01:56:31,290 --> 01:56:34,650 to search across three different tables to get back this answer. 2405 01:56:34,650 --> 01:56:38,820 But my data is now all neatly designed in individual tables, 2406 01:56:38,820 --> 01:56:42,000 which is going to be important now that the data set is so large. 2407 01:56:42,000 --> 01:56:44,340 But let me take this one step further. 2408 01:56:44,340 --> 01:56:46,930 Let me go ahead and do this. 2409 01:56:46,930 --> 01:56:51,580 Let me go ahead and point out that with this query, 2410 01:56:51,580 --> 01:56:54,870 notice that I'm searching on-- 2411 01:56:54,870 --> 01:56:58,710 let's say I'm searching on a person ID here. 2412 01:56:58,710 --> 01:57:02,410 And at the end here, I'm searching on a name column here. 2413 01:57:02,410 --> 01:57:05,230 So let me actually go ahead and do this. 2414 01:57:05,230 --> 01:57:09,510 Let me go ahead and see if we can't speed this up. 2415 01:57:09,510 --> 01:57:13,090 This query at the moment takes 0.092 seconds. 2416 01:57:13,090 --> 01:57:15,930 Let's see if we can't speed this up further by just quickly creating 2417 01:57:15,930 --> 01:57:18,930 a few more of those B-trees in the databases memory. 2418 01:57:18,930 --> 01:57:24,240 Create an index called person index, and I'm going to do this on the stars table 2419 01:57:24,240 --> 01:57:25,950 on the person ID column. 2420 01:57:25,950 --> 01:57:26,850 Enter. 2421 01:57:26,850 --> 01:57:28,488 It's taking a moment, taking a moment. 2422 01:57:28,488 --> 01:57:30,780 That's almost a full second because that's a big table. 2423 01:57:30,780 --> 01:57:35,050 Let's create another index called show index on the stars table. 2424 01:57:35,050 --> 01:57:35,550 Why? 2425 01:57:35,550 --> 01:57:37,950 Because I want to search by the show ID also. 2426 01:57:37,950 --> 01:57:39,780 That was part of my big query. 2427 01:57:39,780 --> 01:57:40,660 Takes a moment. 2428 01:57:40,660 --> 01:57:43,810 OK, just more than about 2/3 of a second. 2429 01:57:43,810 --> 01:57:46,440 Now let's create one last one, another index called name index, 2430 01:57:46,440 --> 01:57:49,160 but I could call these things anything I want, on the people table. 2431 01:57:49,160 --> 01:57:49,560 Why? 2432 01:57:49,560 --> 01:57:51,550 Because I'm also searching on the name column. 2433 01:57:51,550 --> 01:57:53,730 So in short, I'm creating indexes on each 2434 01:57:53,730 --> 01:57:57,450 of the columns that are somehow involved in my search query, 2435 01:57:57,450 --> 01:57:59,680 going from one table to the other. 2436 01:57:59,680 --> 01:58:07,590 Now let's go back to the previous query, which, recall, took-- 2437 01:58:07,590 --> 01:58:10,270 I think I erased it, 0.091. 2438 01:58:10,270 --> 01:58:10,770 All right. 2439 01:58:10,770 --> 01:58:12,645 Well, it was roughly this order of magnitude. 2440 01:58:12,645 --> 01:58:14,140 We're not seeing the data now. 2441 01:58:14,140 --> 01:58:17,160 But let me go ahead and run my original big query once. 2442 01:58:17,160 --> 01:58:20,380 And boom, we're down to almost nothing. 2443 01:58:20,380 --> 01:58:22,860 So again, creating these indexes in memory 2444 01:58:22,860 --> 01:58:27,640 has the effect of rapidly speeding up our computation time. 2445 01:58:27,640 --> 01:58:31,140 Now if you've ever used, for instance, the my.harvard course shopping tool 2446 01:58:31,140 --> 01:58:35,130 here on campus, or Yale's analogue, you might wonder, why is the thing so slow? 2447 01:58:35,130 --> 01:58:39,330 This could be one of the reasons why large data sets with thousands of rows, 2448 01:58:39,330 --> 01:58:41,670 thousands of courses tend to be slow, if, 2449 01:58:41,670 --> 01:58:44,880 and I'm only conjecturing, if the database isn't properly indexed. 2450 01:58:44,880 --> 01:58:46,890 If you're building your own web application 2451 01:58:46,890 --> 01:58:49,170 and you're finding that users are waiting and waiting, 2452 01:58:49,170 --> 01:58:52,290 and things are spinning and spinning, what might be among the problems? 2453 01:58:52,290 --> 01:58:55,770 Well, it could absolutely just be bad algorithms and bad code that you wrote. 2454 01:58:55,770 --> 01:58:58,270 Or it might be that you haven't thought about, well, 2455 01:58:58,270 --> 01:59:01,770 what column should be optimized for searches and filtration 2456 01:59:01,770 --> 01:59:05,705 like I've done here in order to speed up subsequent queries? 2457 01:59:05,705 --> 01:59:07,830 Again, from the outside in, we can only conjecture. 2458 01:59:07,830 --> 01:59:11,580 But ultimately, this is just one of the things that 2459 01:59:11,580 --> 01:59:14,308 explains performance problems as well. 2460 01:59:14,308 --> 01:59:17,100 All right, let's point out just a couple of final syntactic things, 2461 01:59:17,100 --> 01:59:19,770 and then we'll consider, bigger picture, some problems 2462 01:59:19,770 --> 01:59:22,110 that might arise in this world. 2463 01:59:22,110 --> 01:59:26,880 If these nested, nested queries start to get a little much, 2464 01:59:26,880 --> 01:59:29,160 there are other ways, just so you've seen it, 2465 01:59:29,160 --> 01:59:31,920 that you can execute similar logic in SQL. 2466 01:59:31,920 --> 01:59:34,410 For instance, if I know in advance that I 2467 01:59:34,410 --> 01:59:39,390 want to connect Steve Carell to his show IDs and to their titles, 2468 01:59:39,390 --> 01:59:41,190 we can do something more like this. 2469 01:59:41,190 --> 01:59:52,050 Select title from the people table, joined with the stars table on people 2470 01:59:52,050 --> 01:59:54,900 ID equals stars.personID. 2471 01:59:54,900 --> 01:59:55,860 So what am I doing? 2472 01:59:55,860 --> 01:59:56,490 New syntax. 2473 01:59:56,490 --> 01:59:59,782 And again, this is not something you'll have to memorize or ingrain right away. 2474 01:59:59,782 --> 02:00:04,080 But just so you've seen other approaches, select title from people 2475 02:00:04,080 --> 02:00:05,160 join stars. 2476 02:00:05,160 --> 02:00:09,690 This is an explicit way to say, take the people table in one hand, the stars 2477 02:00:09,690 --> 02:00:12,030 table in the other hand, and somehow join them 2478 02:00:12,030 --> 02:00:13,990 as I keep doing with my fingertips here. 2479 02:00:13,990 --> 02:00:15,610 How to join them? 2480 02:00:15,610 --> 02:00:20,370 Join them so that the people, the ID column in the people table lines up 2481 02:00:20,370 --> 02:00:23,260 with the person ID in the stars table. 2482 02:00:23,260 --> 02:00:25,260 But that's not quite everything. 2483 02:00:25,260 --> 02:00:28,740 I could also say, join further on the shows table, 2484 02:00:28,740 --> 02:00:35,290 where the stars show ID equals the shows ID column. 2485 02:00:35,290 --> 02:00:36,540 So what am I doing here? 2486 02:00:36,540 --> 02:00:45,990 That's saying, go further and join the stars table with the show's table, 2487 02:00:45,990 --> 02:00:48,990 joining the show ID column with the ID column. 2488 02:00:48,990 --> 02:00:51,750 Again, this starts to get a little messy to think about. 2489 02:00:51,750 --> 02:00:55,830 But now I can just say, where name equals, quote, unquote, "Steve Carell." 2490 02:00:55,830 --> 02:00:59,070 I can do in one query what previously took me three nested queries 2491 02:00:59,070 --> 02:01:00,600 and get back the same answers. 2492 02:01:00,600 --> 02:01:05,380 And I can still add in my order by title to get back the result. 2493 02:01:05,380 --> 02:01:09,780 And if I do this a little more neatly, let me type this out a little 2494 02:01:09,780 --> 02:01:10,750 differently. 2495 02:01:10,750 --> 02:01:16,620 Let me type this out by adding a new line-- ah, I can't do that here. 2496 02:01:16,620 --> 02:01:18,120 I'm going to leave it alone for now. 2497 02:01:18,120 --> 02:01:20,700 We can type it on multiple lines in other contexts. 2498 02:01:20,700 --> 02:01:23,700 And let me do one last thing. 2499 02:01:23,700 --> 02:01:25,012 Do I want to show that? 2500 02:01:25,012 --> 02:01:26,970 I'm going to show it, but this is not something 2501 02:01:26,970 --> 02:01:28,440 you should ingrain just yet either. 2502 02:01:28,440 --> 02:01:31,620 Select title from people, stars, and shows. 2503 02:01:31,620 --> 02:01:34,860 If you know in advance that you want to do something with all three tables, 2504 02:01:34,860 --> 02:01:38,130 you can just enumerate them, one table name after the other. 2505 02:01:38,130 --> 02:01:43,497 And then you can say where people.ID equals stars.personID. 2506 02:01:43,497 --> 02:01:45,330 And now I'm hitting Enter so that it formats 2507 02:01:45,330 --> 02:01:47,070 a little more readably on my screen. 2508 02:01:47,070 --> 02:01:55,140 And stars.showID equals shows.ID, and lastly, name equals Steve Carell. 2509 02:01:55,140 --> 02:01:59,730 In short, you specify that you want to select data from all three 2510 02:01:59,730 --> 02:02:00,690 of these tables. 2511 02:02:00,690 --> 02:02:06,450 And then you tell the database how to combine foreign keys with primary keys, 2512 02:02:06,450 --> 02:02:09,690 that is, the columns that have those integers in common. 2513 02:02:09,690 --> 02:02:12,720 If I hit Enter now, I get the same exact results, ever 2514 02:02:12,720 --> 02:02:16,110 more so if I also add in an order by title. 2515 02:02:16,110 --> 02:02:17,770 Oops. 2516 02:02:17,770 --> 02:02:18,270 All right. 2517 02:02:18,270 --> 02:02:20,190 That's why I didn't want to do this earlier. 2518 02:02:20,190 --> 02:02:23,190 I have to go back through my history multiple times to actually get back 2519 02:02:23,190 --> 02:02:24,640 the multi-line query this time. 2520 02:02:24,640 --> 02:02:25,140 All right. 2521 02:02:25,140 --> 02:02:27,280 That was a lot all at once. 2522 02:02:27,280 --> 02:02:31,365 But this is only to say that, even as we make the design of the data 2523 02:02:31,365 --> 02:02:33,990 more sophisticated, and we put some of it over here, some of it 2524 02:02:33,990 --> 02:02:38,130 over here, some of it over here so as to avoid duplication of data, weird hacks 2525 02:02:38,130 --> 02:02:41,910 like putting commas in the data, we can still get back all of the answers 2526 02:02:41,910 --> 02:02:44,280 that we might want across these several tables. 2527 02:02:44,280 --> 02:02:48,580 And using indexes, we can significantly speed up these processes 2528 02:02:48,580 --> 02:02:52,140 so as to handle 10 times as many, a 100 times as many users 2529 02:02:52,140 --> 02:02:53,670 on the same actual database. 2530 02:02:53,670 --> 02:02:55,020 There is going to be a downside. 2531 02:02:55,020 --> 02:02:57,540 And thinking back to our discussion of algorithms and data 2532 02:02:57,540 --> 02:03:02,120 structures in past weeks, what might be a downside of creating these indexes? 2533 02:03:02,120 --> 02:03:06,110 Because as of now, I created four separate indexes on the name column, 2534 02:03:06,110 --> 02:03:09,560 the title column, and some other columns, too. 2535 02:03:09,560 --> 02:03:11,930 Why wouldn't I just go ahead and index everything 2536 02:03:11,930 --> 02:03:14,390 if it's clearly speeding things up? 2537 02:03:14,390 --> 02:03:15,770 Memory, so space. 2538 02:03:15,770 --> 02:03:18,890 Any time you're starting to benefit time wise in computer science, 2539 02:03:18,890 --> 02:03:21,950 odds are you're sacrificing space, or vice versa. 2540 02:03:21,950 --> 02:03:25,400 And probably indexing absolutely everything is a little dumb 2541 02:03:25,400 --> 02:03:29,430 because you're going to waste way more space than you might actually need. 2542 02:03:29,430 --> 02:03:31,610 So figuring out where the right inflection point is 2543 02:03:31,610 --> 02:03:36,410 is part of the process of designing and just getting better at these things. 2544 02:03:36,410 --> 02:03:40,910 Now unfortunately, a whole lot of things can go wrong in this world, 2545 02:03:40,910 --> 02:03:44,870 and they continue to in the real world with people using SQL databases. 2546 02:03:44,870 --> 02:03:46,760 And in fact, here on out, if you're reading 2547 02:03:46,760 --> 02:03:51,530 something technical about SQL databases, and websites being hacked in some form, 2548 02:03:51,530 --> 02:03:54,660 and passwords leaking out, unfortunately, all too often 2549 02:03:54,660 --> 02:03:57,505 it is because of what are called SQL injection attacks. 2550 02:03:57,505 --> 02:03:59,630 And just to give you a sense now to counterbalance, 2551 02:03:59,630 --> 02:04:01,588 maybe [INAUDIBLE] enthusiasm for like, oh, that 2552 02:04:01,588 --> 02:04:03,620 was neat how we can do things so quickly. 2553 02:04:03,620 --> 02:04:06,680 With great power comes responsibility in this world, too. 2554 02:04:06,680 --> 02:04:09,320 And so many people introduce bugs into their code 2555 02:04:09,320 --> 02:04:17,160 by not quite appreciating how it is the data is getting into your application. 2556 02:04:17,160 --> 02:04:18,390 So what do I mean by that? 2557 02:04:18,390 --> 02:04:21,200 Here, for instance, is a typical login screen for Yale. 2558 02:04:21,200 --> 02:04:23,780 And here's the analogue for Harvard where you're prompted, 2559 02:04:23,780 --> 02:04:26,450 every day probably, for your username and your password, 2560 02:04:26,450 --> 02:04:28,460 your email address and your password here. 2561 02:04:28,460 --> 02:04:32,420 Suppose, though, that behind this login page, 2562 02:04:32,420 --> 02:04:35,030 whether Harvard's or Yale's, there's some website. 2563 02:04:35,030 --> 02:04:38,270 And that website is using SQL underneath the hood 2564 02:04:38,270 --> 02:04:40,700 to store all of the Harvard or Yale people's 2565 02:04:40,700 --> 02:04:43,940 usernames, passwords, ID numbers, courses, transcripts, 2566 02:04:43,940 --> 02:04:44,970 all of that stuff. 2567 02:04:44,970 --> 02:04:47,480 So there's a SQL database underneath the website. 2568 02:04:47,480 --> 02:04:50,360 Well, what might go wrong with this process? 2569 02:04:50,360 --> 02:04:52,850 Unfortunately, there's some special syntax in SQL 2570 02:04:52,850 --> 02:04:54,530 just like there is in C and Python. 2571 02:04:54,530 --> 02:04:56,960 For instance, there are comments in SQL, too. 2572 02:04:56,960 --> 02:05:00,680 If you do two hyphens, dash, dash, that's a comment in SQL. 2573 02:05:00,680 --> 02:05:06,170 And if you, the programmer, aren't sufficiently distrustful of your users, 2574 02:05:06,170 --> 02:05:09,560 such that you defend against potentially adversarial attacks, 2575 02:05:09,560 --> 02:05:11,160 you might do something like this. 2576 02:05:11,160 --> 02:05:16,070 Suppose that I somewhat maliciously or curiously log in 2577 02:05:16,070 --> 02:05:19,130 by typing my username, Malan@harvard.edu, and then maybe 2578 02:05:19,130 --> 02:05:20,990 a single quote and a dash, dash. 2579 02:05:20,990 --> 02:05:21,680 Why? 2580 02:05:21,680 --> 02:05:24,860 Because I'm trying to suss out if there is a vulnerability here 2581 02:05:24,860 --> 02:05:26,420 to a SQL injection attack. 2582 02:05:26,420 --> 02:05:27,910 Do not do this in general. 2583 02:05:27,910 --> 02:05:31,160 But if I were the owner of the website trying to see if I've made any mistake, 2584 02:05:31,160 --> 02:05:35,300 I might try using potentially dangerous characters in my input. 2585 02:05:35,300 --> 02:05:36,290 Dangerous how? 2586 02:05:36,290 --> 02:05:40,340 Because single quote is used for quoting things in SQL, as we've seen-- 2587 02:05:40,340 --> 02:05:41,810 single quotes or double quotes. 2588 02:05:41,810 --> 02:05:44,930 Dash, dash, I claim now, is used for commenting. 2589 02:05:44,930 --> 02:05:47,960 But let's now imagine what the code underneath the hood 2590 02:05:47,960 --> 02:05:52,160 might be for something like Yale's login or Harvard's login. 2591 02:05:52,160 --> 02:05:54,600 What if it's code that looks like this? 2592 02:05:54,600 --> 02:05:56,540 So let me read it from left to right. 2593 02:05:56,540 --> 02:06:00,710 Suppose that they are using something like CS50's own execute function, 2594 02:06:00,710 --> 02:06:03,230 and they've got some SQL typed into the website that 2595 02:06:03,230 --> 02:06:07,160 says select star from users, where username equals this, 2596 02:06:07,160 --> 02:06:09,050 and password equals that. 2597 02:06:09,050 --> 02:06:12,510 And they're plugging in username and password. 2598 02:06:12,510 --> 02:06:13,607 So what am I doing here? 2599 02:06:13,607 --> 02:06:16,190 Well, when the user types their username password, hits Enter, 2600 02:06:16,190 --> 02:06:18,920 I probably want to select that user from my database 2601 02:06:18,920 --> 02:06:21,020 to see if the username and passwords match. 2602 02:06:21,020 --> 02:06:23,720 So the underlying SQL might be, select star 2603 02:06:23,720 --> 02:06:25,790 from users, where username equals question mark, 2604 02:06:25,790 --> 02:06:27,207 and password equals question mark. 2605 02:06:27,207 --> 02:06:28,100 Users is the table. 2606 02:06:28,100 --> 02:06:29,510 One column is username. 2607 02:06:29,510 --> 02:06:31,400 One column is password. 2608 02:06:31,400 --> 02:06:32,000 All right. 2609 02:06:32,000 --> 02:06:37,430 And if we get back one row, presumably Malan@harvard.edu 2610 02:06:37,430 --> 02:06:38,970 exists with that password. 2611 02:06:38,970 --> 02:06:41,190 We should let him proceed from there on out. 2612 02:06:41,190 --> 02:06:45,140 So that's some pseudo code, if you will, for this scenario. 2613 02:06:45,140 --> 02:06:49,580 What if, though, this code is not as well written as it currently 2614 02:06:49,580 --> 02:06:51,500 is, and isn't using question marks? 2615 02:06:51,500 --> 02:06:54,757 So the question mark syntax is a fairly common SQL thing, 2616 02:06:54,757 --> 02:06:56,840 where the question marks are used as placeholders, 2617 02:06:56,840 --> 02:06:59,390 just like in printf, percent S was. 2618 02:06:59,390 --> 02:07:02,900 But this function, db.execute from CS50's library 2619 02:07:02,900 --> 02:07:05,420 and third-party libraries as well, is also 2620 02:07:05,420 --> 02:07:07,790 doing some good stuff with these question marks, 2621 02:07:07,790 --> 02:07:09,830 and defending against the following attack. 2622 02:07:09,830 --> 02:07:12,920 Suppose that you were not using a third-party library like ours 2623 02:07:12,920 --> 02:07:16,490 and you were just manually constructing your SQL queries like this. 2624 02:07:16,490 --> 02:07:19,940 You were to do something like this instead using an f-string in Python. 2625 02:07:19,940 --> 02:07:21,800 You're comfortable with format strings now. 2626 02:07:21,800 --> 02:07:24,883 You've gotten into the habit of using curly braces and plugging in values. 2627 02:07:24,883 --> 02:07:27,230 Suppose that you, the aspiring programmer, 2628 02:07:27,230 --> 02:07:29,660 is just using techniques that you've been taught. 2629 02:07:29,660 --> 02:07:32,660 So you have an f-string with select star from users, 2630 02:07:32,660 --> 02:07:36,470 where username equals, quote, unquote, "username" in curly braces. 2631 02:07:36,470 --> 02:07:41,270 And password equals, quote, unquote, "password" in curly braces. 2632 02:07:41,270 --> 02:07:44,270 As of what, two weeks ago, this was perfectly 2633 02:07:44,270 --> 02:07:49,460 legitimate technique in Python to plug in values into a string. 2634 02:07:49,460 --> 02:07:53,630 But notice if you are using single quotes yourself 2635 02:07:53,630 --> 02:07:58,750 and the user has typed in single quotes to their input, what 2636 02:07:58,750 --> 02:07:59,920 could go wrong here? 2637 02:07:59,920 --> 02:08:04,510 Where are we going with this if you're just blindly plugging user input 2638 02:08:04,510 --> 02:08:08,350 into your own prepared string of text? 2639 02:08:08,350 --> 02:08:09,230 Yeah? 2640 02:08:09,230 --> 02:08:14,725 AUDIENCE: [INAUDIBLE] 2641 02:08:14,725 --> 02:08:15,600 DAVID J. MALAN: Yeah. 2642 02:08:15,600 --> 02:08:20,670 Worst case, they could insert what is actually SQL code into your database 2643 02:08:20,670 --> 02:08:21,810 as follows. 2644 02:08:21,810 --> 02:08:25,140 Generally speaking, if you're using special syntax like single quotes 2645 02:08:25,140 --> 02:08:27,000 to surround the user's input, you'd better 2646 02:08:27,000 --> 02:08:29,250 hope that they don't have an apostrophe in their name. 2647 02:08:29,250 --> 02:08:31,875 Or you better hope that they don't type a single quote as well. 2648 02:08:31,875 --> 02:08:36,025 Because what if their single quote finishes your single quote instead, 2649 02:08:36,025 --> 02:08:37,900 and then the rest of this is somehow ignored? 2650 02:08:37,900 --> 02:08:39,330 Well, let's consider how this might happen. 2651 02:08:39,330 --> 02:08:40,510 Let me go ahead in here. 2652 02:08:40,510 --> 02:08:42,840 This got a little blurry here, but let me 2653 02:08:42,840 --> 02:08:44,850 plug in here-- wow, that looks awful. 2654 02:08:44,850 --> 02:08:47,940 Let me fix the red. 2655 02:08:47,940 --> 02:08:50,430 Just change this to white so it's more readable. 2656 02:08:50,430 --> 02:08:56,730 What happens if the user does this instead? 2657 02:08:56,730 --> 02:08:59,310 They type in, like I did into the screenshot, 2658 02:08:59,310 --> 02:09:02,760 'Malan@harvard.edu,' single quote, dash, dash. 2659 02:09:02,760 --> 02:09:05,070 What has just happened logically, even though we've 2660 02:09:05,070 --> 02:09:06,900 only just begun with SQL today? 2661 02:09:06,900 --> 02:09:12,150 Well, select star from users, where username equals Malan@harvard.edu, end 2662 02:09:12,150 --> 02:09:13,320 quote. 2663 02:09:13,320 --> 02:09:17,340 What's bad about the rest of this? 2664 02:09:17,340 --> 02:09:19,752 Dash, dash, I claim, means a comment, which 2665 02:09:19,752 --> 02:09:22,210 means my color coding is going to be a little blurry again. 2666 02:09:22,210 --> 02:09:25,140 But everything after the dash, dash is just ignored. 2667 02:09:25,140 --> 02:09:27,440 The logic, then, of the SQL query, then, is 2668 02:09:27,440 --> 02:09:30,760 to just say, select Malan@harvard.edu from the database, 2669 02:09:30,760 --> 02:09:33,490 not even checking the password anymore. 2670 02:09:33,490 --> 02:09:36,180 Therefore, you will get back at least one row. 2671 02:09:36,180 --> 02:09:41,190 So length of rows will equal 1, and so presumably the rest of the pseudo code 2672 02:09:41,190 --> 02:09:44,190 logs the user in, gives them access to my my.harvard account, 2673 02:09:44,190 --> 02:09:45,150 or whatever it is. 2674 02:09:45,150 --> 02:09:50,640 And they've pretended to be me simply by using a single quote and a dash, 2675 02:09:50,640 --> 02:09:52,445 dash in the username field. 2676 02:09:52,445 --> 02:09:54,570 Again, please don't go start doing this later today 2677 02:09:54,570 --> 02:09:56,140 on Harvard, Yale, or other websites. 2678 02:09:56,140 --> 02:09:57,670 But it could be as simple as that. 2679 02:09:57,670 --> 02:09:58,170 Why? 2680 02:09:58,170 --> 02:10:00,030 Because the programmer practiced what they 2681 02:10:00,030 --> 02:10:04,110 were taught, which was just to use curly braces to plug in, 2682 02:10:04,110 --> 02:10:05,560 in f-strings, values. 2683 02:10:05,560 --> 02:10:08,590 But if you don't understand how the user's input is going to be used, 2684 02:10:08,590 --> 02:10:12,255 and if you don't distrust your users fundamentally, for every good person 2685 02:10:12,255 --> 02:10:14,130 out there there's going to be, unfortunately, 2686 02:10:14,130 --> 02:10:19,380 some adversary who just wants to try to find fault in your code or hack 2687 02:10:19,380 --> 02:10:20,490 into your data set. 2688 02:10:20,490 --> 02:10:22,980 This is what's known as a SQL injection attack, 2689 02:10:22,980 --> 02:10:27,000 because the user can type something that happens to be or look like SQL, 2690 02:10:27,000 --> 02:10:31,110 and trick your database into doing something it didn't intend to, 2691 02:10:31,110 --> 02:10:34,830 like, for instance, logging the user in. 2692 02:10:34,830 --> 02:10:36,880 Worst case, they could even do something else. 2693 02:10:36,880 --> 02:10:41,490 Maybe the user types a semicolon, then the word drop, or the word update. 2694 02:10:41,490 --> 02:10:45,180 You could imagine doing semicolon update table grades, where 2695 02:10:45,180 --> 02:10:49,140 name equals Malan, and set the grade equal to A instead of B, 2696 02:10:49,140 --> 02:10:50,670 or something like that. 2697 02:10:50,670 --> 02:10:53,550 The ability to inject SQL into the database 2698 02:10:53,550 --> 02:10:56,820 means you can do anything you want with the data set, either constructively, 2699 02:10:56,820 --> 02:11:00,000 or worse, destructively. 2700 02:11:00,000 --> 02:11:02,880 And now, just a quick, little cartoon that should now make sense. 2701 02:11:02,880 --> 02:11:09,410 2702 02:11:09,410 --> 02:11:13,020 OK, to, like, one of us, two of us. 2703 02:11:13,020 --> 02:11:14,400 Awkwardly somewhat funny. 2704 02:11:14,400 --> 02:11:16,560 All right, so let's move on to one last condition. 2705 02:11:16,560 --> 02:11:19,130 There's one other problem that can go awry here. 2706 02:11:19,130 --> 02:11:20,380 Oh, and I should explain this. 2707 02:11:20,380 --> 02:11:25,500 So this is an allusion to the son, Robert, having typed in semicolon. 2708 02:11:25,500 --> 02:11:28,810 The word drop, table, students, and doing some of the same technique. 2709 02:11:28,810 --> 02:11:31,440 This is humor that only CS people would understand 2710 02:11:31,440 --> 02:11:35,040 because it's the mom realizing, oh, her son's doing a SQL injection 2711 02:11:35,040 --> 02:11:36,308 attack onto the database. 2712 02:11:36,308 --> 02:11:39,600 Less funny when you explain it, but once you notice the syntax, that's all this 2713 02:11:39,600 --> 02:11:40,960 is an allusion to. 2714 02:11:40,960 --> 02:11:41,460 All right. 2715 02:11:41,460 --> 02:11:44,850 So one final threat, now that you are graduating 2716 02:11:44,850 --> 02:11:49,320 to the world of proper databases and away from CSV files alone. 2717 02:11:49,320 --> 02:11:52,170 Things can go wrong when using databases, 2718 02:11:52,170 --> 02:11:55,838 and honestly, even using CSV files if you have multiple users. 2719 02:11:55,838 --> 02:11:57,630 And thus far, you and I have had the luxury 2720 02:11:57,630 --> 02:12:00,547 in almost every program we've written that it's just me using my code. 2721 02:12:00,547 --> 02:12:01,830 It's just you using your code. 2722 02:12:01,830 --> 02:12:04,770 And even if your teaching fellow or TA is using it, probably 2723 02:12:04,770 --> 02:12:06,060 not at the same time. 2724 02:12:06,060 --> 02:12:10,770 But the world gets interesting if you start putting your code on phones, 2725 02:12:10,770 --> 02:12:14,730 on websites, such that now you might have two users literally trying 2726 02:12:14,730 --> 02:12:17,430 to log in at the same time, literally clicking a button 2727 02:12:17,430 --> 02:12:19,410 at the same, or nearly the same time. 2728 02:12:19,410 --> 02:12:22,380 What happens, then, if a computer is trying 2729 02:12:22,380 --> 02:12:25,290 to handle requests from two different people at once, 2730 02:12:25,290 --> 02:12:27,480 as might happen all the time on a website? 2731 02:12:27,480 --> 02:12:29,610 You might get what are called race conditions. 2732 02:12:29,610 --> 02:12:33,060 And this is a problem in computing in general, not just with SQL, not just 2733 02:12:33,060 --> 02:12:36,960 with Python, really just any time you have shared data, 2734 02:12:36,960 --> 02:12:39,150 like a database, as follows. 2735 02:12:39,150 --> 02:12:43,620 This apparently is one of the most liked Instagram posts ever. 2736 02:12:43,620 --> 02:12:46,110 It is literally just a picture of an egg. 2737 02:12:46,110 --> 02:12:47,645 Has anyone clicked on this egg? 2738 02:12:47,645 --> 02:12:48,270 Like, a couple? 2739 02:12:48,270 --> 02:12:48,780 Oh, OK. 2740 02:12:48,780 --> 02:12:49,280 Wow. 2741 02:12:49,280 --> 02:12:50,220 All right, so yes. 2742 02:12:50,220 --> 02:12:53,880 So go search for this photo if you'd like to add to the likes on Instagram. 2743 02:12:53,880 --> 02:12:56,110 The account is world_record_egg. 2744 02:12:56,110 --> 02:12:59,040 This is just a screenshot of Instagram of that picture of an egg. 2745 02:12:59,040 --> 02:13:00,540 If you're in the habit of using Instagram, 2746 02:13:00,540 --> 02:13:03,498 or like any social media site, there's some equivalent of a like button 2747 02:13:03,498 --> 02:13:04,920 or a heart button these days. 2748 02:13:04,920 --> 02:13:06,900 And that's actually a really hard problem. 2749 02:13:06,900 --> 02:13:10,050 Such a simple idea to count the number of likes something 2750 02:13:10,050 --> 02:13:12,990 has, but that means someone has to click on it. 2751 02:13:12,990 --> 02:13:14,910 Your code has to detect the click. 2752 02:13:14,910 --> 02:13:17,860 Your code has to update the database, and then do it again and again, 2753 02:13:17,860 --> 02:13:22,740 even if multiple people are perhaps right now clicking on that same egg. 2754 02:13:22,740 --> 02:13:28,350 And unfortunately, bad things can happen if two people try to do something 2755 02:13:28,350 --> 02:13:30,540 at the same time on a computer. 2756 02:13:30,540 --> 02:13:31,690 How might this happen? 2757 02:13:31,690 --> 02:13:35,670 So here's some more code, half pseudocode, half Python code here, 2758 02:13:35,670 --> 02:13:36,580 as follows. 2759 02:13:36,580 --> 02:13:40,230 Suppose that what happens when you, literally, right now, maybe click 2760 02:13:40,230 --> 02:13:43,470 on the like button on the Instagram post. 2761 02:13:43,470 --> 02:13:47,760 Suppose that code, like the following, is executed on Facebook servers. 2762 02:13:47,760 --> 02:13:54,190 db.execute of select likes from posts where ID equals question mark. 2763 02:13:54,190 --> 02:13:54,690 All right. 2764 02:13:54,690 --> 02:13:57,090 So what am I assuming here? 2765 02:13:57,090 --> 02:13:59,280 I'm assuming that that photograph has a unique ID. 2766 02:13:59,280 --> 02:14:02,670 It's some big integer, whatever it was, randomly assigned. 2767 02:14:02,670 --> 02:14:05,130 I'm assuming that when you click on the heart 2768 02:14:05,130 --> 02:14:08,160 the unique ID is somehow sent to Instagram servers 2769 02:14:08,160 --> 02:14:10,740 so that their code can call it ID. 2770 02:14:10,740 --> 02:14:13,830 And I'm assuming that Instagram is using its SQL database 2771 02:14:13,830 --> 02:14:17,790 and selecting, from a posts table, the current number of likes 2772 02:14:17,790 --> 02:14:21,160 of that egg for that given ID number. 2773 02:14:21,160 --> 02:14:21,660 Why? 2774 02:14:21,660 --> 02:14:24,952 Because I need to know how many likes it already has if I want to add one to it 2775 02:14:24,952 --> 02:14:26,190 and then update the database. 2776 02:14:26,190 --> 02:14:29,710 I need to select the data, then I need to update the data here. 2777 02:14:29,710 --> 02:14:30,210 All right. 2778 02:14:30,210 --> 02:14:33,780 So in some Python code here, let's store, in a variable called 2779 02:14:33,780 --> 02:14:37,950 likes, whatever comes back in the first row from the likes column. 2780 02:14:37,950 --> 02:14:40,660 Again, this is new syntax specific to our library, 2781 02:14:40,660 --> 02:14:43,830 but a common way of getting back first row and the column called 2782 02:14:43,830 --> 02:14:44,800 likes therein. 2783 02:14:44,800 --> 02:14:46,920 So at this point in the story, likes is storing 2784 02:14:46,920 --> 02:14:49,462 the total number of likes, in the millions or whatever it is, 2785 02:14:49,462 --> 02:14:50,610 of that particular egg. 2786 02:14:50,610 --> 02:14:51,900 Then I do this. 2787 02:14:51,900 --> 02:14:56,400 Execute update posts, set the number of likes 2788 02:14:56,400 --> 02:15:00,607 equal to this value, where the ID of the post equals this value. 2789 02:15:00,607 --> 02:15:02,190 What do I want to update the likes to? 2790 02:15:02,190 --> 02:15:06,540 Whatever likes currently is plus 1, and then plugging in the ID. 2791 02:15:06,540 --> 02:15:08,340 So a simple idea, right? 2792 02:15:08,340 --> 02:15:11,730 I'm checking the value of the likes, and maybe it's 10. 2793 02:15:11,730 --> 02:15:15,030 I'm changing 10 to 11 and then updating the table. 2794 02:15:15,030 --> 02:15:18,390 But a problem can arise if two people have 2795 02:15:18,390 --> 02:15:23,010 clicked on that egg at roughly the same time, or literally, the same time. 2796 02:15:23,010 --> 02:15:24,340 Why is that? 2797 02:15:24,340 --> 02:15:26,850 Well, in the world of databases and servers, 2798 02:15:26,850 --> 02:15:31,120 and the Instagrams of the world have thousands of physical servers nowadays. 2799 02:15:31,120 --> 02:15:35,430 So they can support millions, billions even, of users nowadays. 2800 02:15:35,430 --> 02:15:36,910 What can go wrong? 2801 02:15:36,910 --> 02:15:40,530 Well, typically code like this is not what we'll call atomic. 2802 02:15:40,530 --> 02:15:44,430 To be atomic means that it all executes together or not at all. 2803 02:15:44,430 --> 02:15:49,500 Rather, code typically is executed, as you might imagine, line by line. 2804 02:15:49,500 --> 02:15:53,110 And if your code is running on a server that multiple people have access to, 2805 02:15:53,110 --> 02:15:55,530 which is absolutely the case for an app like Instagram, 2806 02:15:55,530 --> 02:15:58,560 if you and I click on the heart at roughly the same time, 2807 02:15:58,560 --> 02:16:02,160 for efficiency, the computer, the server, owned by Instagram, 2808 02:16:02,160 --> 02:16:04,500 might execute this line of code for me. 2809 02:16:04,500 --> 02:16:06,537 Then it might execute this line of code for you. 2810 02:16:06,537 --> 02:16:09,120 Then this line of code for me, then this line of code for you, 2811 02:16:09,120 --> 02:16:11,703 then this line of code for me, then this line of code for you. 2812 02:16:11,703 --> 02:16:16,550 That is to say, our queries might get intermingled chronologically. 2813 02:16:16,550 --> 02:16:19,370 Because it'd be a little obnoxious if, when you're using Instagram, 2814 02:16:19,370 --> 02:16:22,040 I'm blocked out while you're interacting with the site. 2815 02:16:22,040 --> 02:16:24,778 It'd be a lot nicer for efficiency and fairness if somehow they 2816 02:16:24,778 --> 02:16:27,320 do a little bit of work for me, a little bit of work for you, 2817 02:16:27,320 --> 02:16:30,540 and back and forth, and back and forth, equitably on the server. 2818 02:16:30,540 --> 02:16:33,320 So that's what typically happens by default. These lines of code 2819 02:16:33,320 --> 02:16:35,420 get executed independently. 2820 02:16:35,420 --> 02:16:39,770 And they can happen in alternating order with other users. 2821 02:16:39,770 --> 02:16:41,809 You can get them combined like this. 2822 02:16:41,809 --> 02:16:45,840 Same order top to bottom, but other things might happen in between. 2823 02:16:45,840 --> 02:16:50,000 So suppose that the number of likes at the very beginning was 10. 2824 02:16:50,000 --> 02:16:54,080 And suppose that Carter and I both click on that egg at roughly the same time. 2825 02:16:54,080 --> 02:16:56,420 And suppose this line of code gets executed for me, 2826 02:16:56,420 --> 02:16:59,870 and that gives me a value in likes, ultimately, of 10. 2827 02:16:59,870 --> 02:17:03,290 Suppose, then, that the computer takes a break from dealing with my request, 2828 02:17:03,290 --> 02:17:05,690 does the same code for Carter, and gets back 2829 02:17:05,690 --> 02:17:07,910 what value for the current number of likes? 2830 02:17:07,910 --> 02:17:09,170 Also 10 for Carter. 2831 02:17:09,170 --> 02:17:11,209 Because mine has not been recorded yet. 2832 02:17:11,209 --> 02:17:13,910 At this point in the story, somewhere in the computer's memory 2833 02:17:13,910 --> 02:17:16,219 there's a likes variable for me, storing 10. 2834 02:17:16,219 --> 02:17:19,400 There's a likes variable storing 10 for Carter. 2835 02:17:19,400 --> 02:17:21,290 Then this line of code executes for me. 2836 02:17:21,290 --> 02:17:25,219 It updates the database to be likes plus 1, which stores 11 in the database. 2837 02:17:25,219 --> 02:17:30,440 Then Carter's code is executed, updating the same row in the database 2838 02:17:30,440 --> 02:17:34,160 to 11, unfortunately. 2839 02:17:34,160 --> 02:17:38,209 Because his value of likes happened to be the same value of mine. 2840 02:17:38,209 --> 02:17:41,540 And so the metaphor here, that if we had a refrigerator on stage we would 2841 02:17:41,540 --> 02:17:44,790 actually act out, is something that was taught to me years ago in an operating 2842 02:17:44,790 --> 02:17:50,299 systems class, whereby the most similar analogue in the real world would be 2843 02:17:50,299 --> 02:17:52,370 if you've got a mini fridge in your dorm room. 2844 02:17:52,370 --> 02:17:58,440 And one of you and your roommates comes home, opens the fridge, and realizes, 2845 02:17:58,440 --> 02:18:01,160 oh, we're out of milk, was how the story went in my day. 2846 02:18:01,160 --> 02:18:05,184 So you close the refrigerator, and you walk across the street, go to CVS, 2847 02:18:05,184 --> 02:18:06,559 and get in line to buy some milk. 2848 02:18:06,559 --> 02:18:08,309 Meanwhile, your roommate comes home. 2849 02:18:08,309 --> 02:18:12,924 They, too, inspect the state of your refrigerator, a.k.a., a variable, 2850 02:18:12,924 --> 02:18:15,049 open the door, and realizes, oh, we're out of milk. 2851 02:18:15,049 --> 02:18:15,980 I'll go get more milk. 2852 02:18:15,980 --> 02:18:18,020 Close the fridge, go across the street, and head 2853 02:18:18,020 --> 02:18:20,059 to maybe a different store, or the line is long enough 2854 02:18:20,059 --> 02:18:21,851 that you don't see each other at the store. 2855 02:18:21,851 --> 02:18:25,700 So long story short, you both eventually get home, open the door, and damn it, 2856 02:18:25,700 --> 02:18:27,559 now there's milk from your other roommate 2857 02:18:27,559 --> 02:18:30,680 there because you both made a decision on this 2858 02:18:30,680 --> 02:18:35,809 based on the state of a variable that you independently examined. 2859 02:18:35,809 --> 02:18:37,700 And you didn't somehow communicate. 2860 02:18:37,700 --> 02:18:40,610 Now in the real world, this is absolutely solvable. 2861 02:18:40,610 --> 02:18:44,120 How would you fix this or avoid this problem in the real world? 2862 02:18:44,120 --> 02:18:45,770 Literally, own roommate, own fridge. 2863 02:18:45,770 --> 02:18:47,750 AUDIENCE: Text your roommate [INAUDIBLE].. 2864 02:18:47,750 --> 02:18:48,750 DAVID J. MALAN: Perfect. 2865 02:18:48,750 --> 02:18:50,482 Let them know, so somehow communicate. 2866 02:18:50,482 --> 02:18:52,940 And in fact, the terminology here would be multiple threads 2867 02:18:52,940 --> 02:18:55,459 can somehow intercommunicate by having shared state, 2868 02:18:55,459 --> 02:18:57,230 like the iMessage thread on your phone. 2869 02:18:57,230 --> 02:18:58,190 You could leave a note. 2870 02:18:58,190 --> 02:19:01,490 You could, more dramatically, lock the refrigerator somehow, 2871 02:19:01,490 --> 02:19:05,570 thereby making the milk purchasing process atomic. 2872 02:19:05,570 --> 02:19:08,389 The fundamental problem is that for efficiency, again, 2873 02:19:08,389 --> 02:19:11,690 computers tend to intermingle logic that needs 2874 02:19:11,690 --> 02:19:15,980 to happen when it's happening across multiple users just for fairness' sake, 2875 02:19:15,980 --> 02:19:17,180 for scheduling sake. 2876 02:19:17,180 --> 02:19:19,820 You need to make sure that all three of these lines of code 2877 02:19:19,820 --> 02:19:23,269 execute for me, and then for Carter, and then for you 2878 02:19:23,269 --> 02:19:25,709 if you want to ensure that this count is correct. 2879 02:19:25,709 --> 02:19:29,219 And for years, when social media was first getting off the ground, 2880 02:19:29,219 --> 02:19:31,100 this was a super hard problem. 2881 02:19:31,100 --> 02:19:33,920 Twitter used to go down all of the time, and tweets, 2882 02:19:33,920 --> 02:19:36,260 and retweets were a thing that were similarly happening 2883 02:19:36,260 --> 02:19:37,490 with a very high frequency. 2884 02:19:37,490 --> 02:19:39,020 These are hard problems to solve. 2885 02:19:39,020 --> 02:19:40,690 And thankfully, there are solutions. 2886 02:19:40,690 --> 02:19:43,440 And we won't get into the weeds of how you might use these things, 2887 02:19:43,440 --> 02:19:46,100 but know that there are solutions in the form of things 2888 02:19:46,100 --> 02:19:49,610 called locks, which I use that word deliberately with the fridge. 2889 02:19:49,610 --> 02:19:53,570 Software locks can allow you to protect a variable so no one else can 2890 02:19:53,570 --> 02:19:55,550 look at it until you're done with it. 2891 02:19:55,550 --> 02:19:57,770 There are things called transactions, which 2892 02:19:57,770 --> 02:20:01,100 allow you to do the equivalent of sending a message to, or really locking 2893 02:20:01,100 --> 02:20:04,130 out your roommate from accessing that same variable, too, 2894 02:20:04,130 --> 02:20:06,920 but for slightly less amount of time. 2895 02:20:06,920 --> 02:20:08,880 There are solutions to these problems. 2896 02:20:08,880 --> 02:20:12,650 So for instance, in Python, the same code now in green 2897 02:20:12,650 --> 02:20:14,360 might look a little something like this. 2898 02:20:14,360 --> 02:20:17,150 When you know that something has to happen all at once, 2899 02:20:17,150 --> 02:20:21,020 altogether, you first begin a transaction, and you do your thing, 2900 02:20:21,020 --> 02:20:23,707 and then you commit the transaction at the very end. 2901 02:20:23,707 --> 02:20:25,790 Here, too, though, there's going to be a downside. 2902 02:20:25,790 --> 02:20:29,902 Typically, the more you use transactions in this way, 2903 02:20:29,902 --> 02:20:31,610 potentially the higher the probability is 2904 02:20:31,610 --> 02:20:35,190 that you're going to box someone out or make Carter's request a little slower. 2905 02:20:35,190 --> 02:20:35,690 Why? 2906 02:20:35,690 --> 02:20:37,482 Because we can't interact at the same time. 2907 02:20:37,482 --> 02:20:39,920 Or you might make his request fail if he tries to update 2908 02:20:39,920 --> 02:20:41,870 something that's already been updated. 2909 02:20:41,870 --> 02:20:44,720 So you generally want to have as few lines of code 2910 02:20:44,720 --> 02:20:47,840 together in between these transactions so that you get in and you get out. 2911 02:20:47,840 --> 02:20:50,780 And you go to CVS and you get back really fast so as to not 2912 02:20:50,780 --> 02:20:52,550 cause these kind of performance things. 2913 02:20:52,550 --> 02:20:55,070 So things indeed escalated quickly today. 2914 02:20:55,070 --> 02:20:58,272 The original goal was just to solve problems using a different language 2915 02:20:58,272 --> 02:20:59,480 more effectively than Python. 2916 02:20:59,480 --> 02:21:01,772 But as soon as you have these more powerful techniques, 2917 02:21:01,772 --> 02:21:03,525 a whole new set of problems arises. 2918 02:21:03,525 --> 02:21:05,150 Takes practice to get comfortable with. 2919 02:21:05,150 --> 02:21:09,170 But ultimately, this is all leading us toward the introduction next week 2920 02:21:09,170 --> 02:21:12,260 of web programming with HTML, CSS, and some JavaScript. 2921 02:21:12,260 --> 02:21:15,090 The week after, bringing Python and SQL back into the mix. 2922 02:21:15,090 --> 02:21:16,940 So that by term's end, we've really now used 2923 02:21:16,940 --> 02:21:19,760 all of these different languages for what they're best at. 2924 02:21:19,760 --> 02:21:22,843 And over the next few weeks, the goal is to make sure you're understanding 2925 02:21:22,843 --> 02:21:25,932 and comfortable with what each of these things is good and bad for. 2926 02:21:25,932 --> 02:21:27,140 Let's go ahead and wrap here. 2927 02:21:27,140 --> 02:21:28,473 I'll stick around for questions. 2928 02:21:28,473 --> 02:21:30,310 We'll see you next time. 2929 02:21:30,310 --> 02:22:04,344 [MUSIC PLAYING]