1 00:00:00,000 --> 00:00:02,994 [MUSIC PLAYING] 2 00:00:02,994 --> 00:01:22,792 3 00:01:22,792 --> 00:01:23,750 DAVID MALAN: All right. 4 00:01:23,750 --> 00:01:28,230 So this is CS50, and this day we continue our look at Python, 5 00:01:28,230 --> 00:01:30,330 but also introduce another language. 6 00:01:30,330 --> 00:01:32,700 This one is called SQL, or Structured Query Language, 7 00:01:32,700 --> 00:01:35,492 and it's going to allow us to solve some different problems as well 8 00:01:35,492 --> 00:01:37,260 as some old problems better. 9 00:01:37,260 --> 00:01:40,330 But to do this, let's try to motivate some of the problems initially. 10 00:01:40,330 --> 00:01:42,390 So this, of course, is just an example of a spreadsheet. 11 00:01:42,390 --> 00:01:43,920 Odds are you use these pretty frequently, 12 00:01:43,920 --> 00:01:46,253 whether it's in the form of Google Spreadsheets or Excel 13 00:01:46,253 --> 00:01:47,700 or Apple Numbers and the like. 14 00:01:47,700 --> 00:01:50,700 And this is actually the simplest form of what we might call a database, 15 00:01:50,700 --> 00:01:54,000 just some kind of application that stores lots of data. 16 00:01:54,000 --> 00:01:56,910 And odds are if you've ever taken registrations for a student group 17 00:01:56,910 --> 00:01:59,580 or had people sign up or fill out a form or a survey, 18 00:01:59,580 --> 00:02:03,270 odds are you've used Google Forms to get your data into a Google Spreadsheet. 19 00:02:03,270 --> 00:02:05,370 And so we thought we would begin there today, 20 00:02:05,370 --> 00:02:08,370 a representative problem whereby you might want to collect a lot of data 21 00:02:08,370 --> 00:02:10,660 from users and then do something with it, 22 00:02:10,660 --> 00:02:14,670 but ideally more powerfully than you could do as a human on your own. 23 00:02:14,670 --> 00:02:18,600 So we've put into place a Google Form that quite simply 24 00:02:18,600 --> 00:02:20,040 asks a couple of questions today. 25 00:02:20,040 --> 00:02:21,423 What are your favorite TV shows? 26 00:02:21,423 --> 00:02:24,090 And we ask you for the title of your favorite show and the genre 27 00:02:24,090 --> 00:02:26,340 or genres into which that show falls. 28 00:02:26,340 --> 00:02:29,490 So if you wouldn't mind, go ahead and open up this URL here 29 00:02:29,490 --> 00:02:31,192 on your laptop or phone. 30 00:02:31,192 --> 00:02:32,650 If you don't have one, that's fine. 31 00:02:32,650 --> 00:02:35,790 You can use the person's next to you, if I may. 32 00:02:35,790 --> 00:02:41,970 And take just a moment to go to that URL there and answer those two questions. 33 00:02:41,970 --> 00:02:46,890 What is the title of your favorite TV show, and into what genre or genres 34 00:02:46,890 --> 00:02:48,060 does it fall? 35 00:02:48,060 --> 00:02:51,310 36 00:02:51,310 --> 00:02:52,720 Favorite TV shows. 37 00:02:52,720 --> 00:02:59,260 And when you go to that URL, you'll find yourself at, again, this form here, 38 00:02:59,260 --> 00:03:03,230 asking for your title and for your favorite show. 39 00:03:03,230 --> 00:03:06,040 So by the look of my own screen here we've 40 00:03:06,040 --> 00:03:09,640 got some 62 responses, 74 responses coming in already, 41 00:03:09,640 --> 00:03:10,780 so we can track this live. 42 00:03:10,780 --> 00:03:12,710 So 111 responses so far. 43 00:03:12,710 --> 00:03:14,830 That's a lot of favorite TV shows. 44 00:03:14,830 --> 00:03:16,090 This number's-- keep going up. 45 00:03:16,090 --> 00:03:20,890 169, up to, next, 191. 46 00:03:20,890 --> 00:03:22,480 Keep them coming. 47 00:03:22,480 --> 00:03:23,520 One per person is fine. 48 00:03:23,520 --> 00:03:25,103 It will give us more than enough data. 49 00:03:25,103 --> 00:03:26,860 246. 50 00:03:26,860 --> 00:03:29,300 We'll give it another few seconds, but suffice it to say, 51 00:03:29,300 --> 00:03:31,190 this is kind of a lot of data. 52 00:03:31,190 --> 00:03:32,920 And if the next step in this process were 53 00:03:32,920 --> 00:03:35,830 for me to analyze the data because I'm curious to know 54 00:03:35,830 --> 00:03:38,050 what's the most popular show on campus or if I wanted 55 00:03:38,050 --> 00:03:40,330 to do some other analytical work on this, 56 00:03:40,330 --> 00:03:42,400 it'll be useful to have all of this data, 57 00:03:42,400 --> 00:03:44,740 not in Google Form's proprietary interface, 58 00:03:44,740 --> 00:03:48,520 but in Google Spreadsheet form with rows and columns, a format you're probably 59 00:03:48,520 --> 00:03:49,420 pretty familiar with. 60 00:03:49,420 --> 00:03:51,670 So we'll let this counter keep going up and up and up, 61 00:03:51,670 --> 00:03:53,840 and I'm going to go ahead and click on responses. 62 00:03:53,840 --> 00:03:55,990 And if we really wanted to, we could look through-- 63 00:03:55,990 --> 00:03:59,950 The Office is up there, followed by Games of Thrones, Friends, OK. 64 00:03:59,950 --> 00:04:01,500 Perhaps pretty predictable here. 65 00:04:01,500 --> 00:04:03,250 Let me go ahead and click this icon, which 66 00:04:03,250 --> 00:04:06,100 is going to open the Google Spreadsheet version of this. 67 00:04:06,100 --> 00:04:08,590 And you'll see that there's three columns by default. 68 00:04:08,590 --> 00:04:12,230 [LAUGHTER] 69 00:04:12,230 --> 00:04:13,440 Oh, that's funny. 70 00:04:13,440 --> 00:04:16,386 71 00:04:16,386 --> 00:04:18,341 [APPLAUSE] 72 00:04:18,341 --> 00:04:18,841 Thank you. 73 00:04:18,841 --> 00:04:21,787 Thank you. 74 00:04:21,787 --> 00:04:23,616 OK. 75 00:04:23,616 --> 00:04:24,530 OK. 76 00:04:24,530 --> 00:04:25,310 Well-played. 77 00:04:25,310 --> 00:04:28,790 Now-- so you'll see that by default we got three columns. 78 00:04:28,790 --> 00:04:31,370 Timestamp, Google Forms just does automatically 79 00:04:31,370 --> 00:04:33,230 to timestamp the submission of the form. 80 00:04:33,230 --> 00:04:34,910 Title comes from the first question. 81 00:04:34,910 --> 00:04:36,617 Genres comes from the second question. 82 00:04:36,617 --> 00:04:38,450 And if we kept scrolling and scrolling, odds 83 00:04:38,450 --> 00:04:40,010 are we're seeing more and more rows. 84 00:04:40,010 --> 00:04:43,040 Let's propose that we have enough of these rows already, 85 00:04:43,040 --> 00:04:44,960 and how might we go about analyzing the data? 86 00:04:44,960 --> 00:04:48,080 Well, if you're pretty expert with Google Spreadsheets or Excel 87 00:04:48,080 --> 00:04:50,570 or Numbers, maybe you could use some functions that 88 00:04:50,570 --> 00:04:54,197 are built into these programs or you could do some Command-F or Control-F 89 00:04:54,197 --> 00:04:56,030 and search for the favorite show that you're 90 00:04:56,030 --> 00:04:57,890 looking for to try to run some numbers. 91 00:04:57,890 --> 00:04:59,940 But we have programming abilities nowadays. 92 00:04:59,940 --> 00:05:02,480 And if you haven't used this feature already, 93 00:05:02,480 --> 00:05:04,710 if you go to Google Spreadsheet's file menu. 94 00:05:04,710 --> 00:05:06,960 You can go to download, and you can download your data 95 00:05:06,960 --> 00:05:07,970 into a bunch of formats. 96 00:05:07,970 --> 00:05:09,762 And the most useful one I'd propose for now 97 00:05:09,762 --> 00:05:12,770 is going to be this one, Comma-Separated Values or CSV. 98 00:05:12,770 --> 00:05:14,100 And we've seen this before. 99 00:05:14,100 --> 00:05:15,800 In fact, you've generated this before. 100 00:05:15,800 --> 00:05:18,920 So if I go ahead and do that, it pretty quickly downloaded a CSV file 101 00:05:18,920 --> 00:05:22,700 that I really wanted to I could open in Excel or Apple Numbers or the like. 102 00:05:22,700 --> 00:05:26,460 But let's do something instead with that, and go ahead and put it into, 103 00:05:26,460 --> 00:05:30,870 for instance, my IDE so that I can actually do something with it. 104 00:05:30,870 --> 00:05:33,480 I'm going to grab my downloaded file here. 105 00:05:33,480 --> 00:05:37,130 And I'm going to go ahead and drag that into the IDE. 106 00:05:37,130 --> 00:05:39,680 And so now you'll see a whole bunch of files here. 107 00:05:39,680 --> 00:05:40,850 Let me put-- whoops. 108 00:05:40,850 --> 00:05:43,102 Let me put this right in the root of my folder. 109 00:05:43,102 --> 00:05:44,310 And we'll see this file here. 110 00:05:44,310 --> 00:05:46,910 And if I double-click on it, you'll see exactly the same data, 111 00:05:46,910 --> 00:05:48,185 albeit in CSV format. 112 00:05:48,185 --> 00:05:50,810 So it looks the same, but we have commas separating the values. 113 00:05:50,810 --> 00:05:53,580 And Google Spreadsheets did that for us automatically. 114 00:05:53,580 --> 00:05:55,910 So let's now write a program that analyzes this data 115 00:05:55,910 --> 00:05:59,360 and see if we can't make sense of what some of the most popular values 116 00:05:59,360 --> 00:06:00,570 are in there. 117 00:06:00,570 --> 00:06:03,510 Let me go ahead and close this, open a new file, 118 00:06:03,510 --> 00:06:07,150 and I'm going to go ahead and call it, for instance, favorites.py, 119 00:06:07,150 --> 00:06:09,770 since the goal is to figure out everyone's favorites here. 120 00:06:09,770 --> 00:06:12,920 And let me go ahead and use some syntax from last week 121 00:06:12,920 --> 00:06:15,230 that you might be familiar with now from P set 6. 122 00:06:15,230 --> 00:06:17,390 Let me go ahead and import CSV, the module, 123 00:06:17,390 --> 00:06:20,600 so that I have access to that library of functionality. 124 00:06:20,600 --> 00:06:22,238 Then let me do this with Open. 125 00:06:22,238 --> 00:06:25,280 And then I'm going to go ahead and open this file, and I could rename it, 126 00:06:25,280 --> 00:06:28,220 but you know, I can just grab the whole file name this way, too. 127 00:06:28,220 --> 00:06:32,360 It's a little excessive, but I'll go ahead and paste that in in read mode. 128 00:06:32,360 --> 00:06:34,530 And that gives me a file. 129 00:06:34,530 --> 00:06:36,990 Let me go ahead and zoom out just a little bit there. 130 00:06:36,990 --> 00:06:40,430 And once I have this file, I can throw it in to a dict reader, 131 00:06:40,430 --> 00:06:42,620 for instance, to make it easier to read the CSV. 132 00:06:42,620 --> 00:06:45,080 So reader gets CSV.DictReader. 133 00:06:45,080 --> 00:06:47,000 Pass in that file. 134 00:06:47,000 --> 00:06:48,890 And then after this, I have the ability, now, 135 00:06:48,890 --> 00:06:54,170 to iterate over each row like this, much like you probably did for DNA 136 00:06:54,170 --> 00:06:56,400 or one or more other problems, as well. 137 00:06:56,400 --> 00:06:59,390 Now, just as a sanity check, let me just go ahead and print out 138 00:06:59,390 --> 00:07:00,870 each of the row's titles. 139 00:07:00,870 --> 00:07:02,630 So I'm not going to see anything new here, 140 00:07:02,630 --> 00:07:05,720 but this will be a little sanity check that I'm writing correct code. 141 00:07:05,720 --> 00:07:09,930 So run Python of favorites.py, and voila, fwoom, there's all the data. 142 00:07:09,930 --> 00:07:13,270 So The Office is both at the beginning and at the end here, it seems. 143 00:07:13,270 --> 00:07:16,520 All right, so that's all fine and good, a nice stepping stone, but let's start 144 00:07:16,520 --> 00:07:19,580 to analyze this data, this time using some code. 145 00:07:19,580 --> 00:07:22,850 So let me go ahead and say this. 146 00:07:22,850 --> 00:07:25,580 Suppose I wanted to keep track and determine 147 00:07:25,580 --> 00:07:27,530 what the most popular show is. 148 00:07:27,530 --> 00:07:30,480 So I've got this big CSV file with title, title, title, title. 149 00:07:30,480 --> 00:07:32,480 Invariably, there's going to be some duplicates. 150 00:07:32,480 --> 00:07:35,690 A bunch of you probably like The Office or some other show, 151 00:07:35,690 --> 00:07:38,180 and so it'd be nice to kind of aggregate that data. 152 00:07:38,180 --> 00:07:43,080 What might be a useful data structure in Python whereby for each row 153 00:07:43,080 --> 00:07:45,740 we can use this data structure to just kind of do 154 00:07:45,740 --> 00:07:49,490 the equivalent of like 1, 2, 3, and count each 155 00:07:49,490 --> 00:07:51,720 of the occurrences of those titles? 156 00:07:51,720 --> 00:07:56,030 Any-- OK, I heard a hash table, but more properly in Python? 157 00:07:56,030 --> 00:07:56,950 A dictionary. 158 00:07:56,950 --> 00:07:59,410 So a dictionary is kind of the Swiss army 159 00:07:59,410 --> 00:08:02,140 knife of data structures for a lot of people in the sense 160 00:08:02,140 --> 00:08:05,590 that insofar as it allows you to store keys and values, 161 00:08:05,590 --> 00:08:08,900 and those keys can be strings, and those values can be anything you want, 162 00:08:08,900 --> 00:08:11,763 including, for instance, in this case, numbers, like a counter, 163 00:08:11,763 --> 00:08:13,930 it's a really versatile data structure when you just 164 00:08:13,930 --> 00:08:17,000 want to throw some data into a data structure and do something with it. 165 00:08:17,000 --> 00:08:20,110 So let me go ahead and give myself a dictionary at the top of the file. 166 00:08:20,110 --> 00:08:23,110 I can do this in a couple of ways, and if my goal is to store the counts 167 00:08:23,110 --> 00:08:25,420 of each of these tiles, I could call it counts, 168 00:08:25,420 --> 00:08:30,130 and I could set it equal to dict(), which is just a function that hands me 169 00:08:30,130 --> 00:08:31,450 an empty dictionary. 170 00:08:31,450 --> 00:08:33,730 Or I recall I can use this syntax, which is 171 00:08:33,730 --> 00:08:35,919 equivalent to just shorthand notation. 172 00:08:35,919 --> 00:08:40,210 But what I want to do now inside of my loop, every time I 173 00:08:40,210 --> 00:08:44,210 iterate through this file, I want to check if my-- 174 00:08:44,210 --> 00:08:47,630 I've seen this title before, I want to add 1 to my counter. 175 00:08:47,630 --> 00:08:52,808 But if I've never seen this title before, what do I want to do instead? 176 00:08:52,808 --> 00:08:53,308 Yeah? 177 00:08:53,308 --> 00:08:55,117 AUDIENCE: [INAUDIBLE] 178 00:08:55,117 --> 00:08:56,950 DAVID MALAN: Add the title to the dictionary 179 00:08:56,950 --> 00:09:00,110 as a key with the value of what, the first time I see it? 180 00:09:00,110 --> 00:09:01,970 Well, the first time I see it. 181 00:09:01,970 --> 00:09:02,470 AUDIENCE: 1. 182 00:09:02,470 --> 00:09:03,940 DAVID MALAN: So we can just initialize it to 1. 183 00:09:03,940 --> 00:09:05,290 So there's two scenarios here. 184 00:09:05,290 --> 00:09:08,500 Either we've seen this title before and we want to do plus 1, 185 00:09:08,500 --> 00:09:11,830 or we've not seen the title before and we just want to set it equal to 1. 186 00:09:11,830 --> 00:09:13,630 So how might we express that logic? 187 00:09:13,630 --> 00:09:16,190 Well, let me go ahead and say, first of all, 188 00:09:16,190 --> 00:09:18,280 let's get the title into a variable of its own, 189 00:09:18,280 --> 00:09:20,740 just to shorten our syntax a little bit. 190 00:09:20,740 --> 00:09:25,040 And then I can say something like this, if title in counts-- 191 00:09:25,040 --> 00:09:26,300 well, what does that mean? 192 00:09:26,300 --> 00:09:29,140 That means that inside of my counts dictionary, 193 00:09:29,140 --> 00:09:34,300 I can index into it at that location, and I can increment whatever is there 194 00:09:34,300 --> 00:09:38,830 by 1, assuming on some previous iteration, 195 00:09:38,830 --> 00:09:44,738 else I went ahead and did counts of title equals 1. 196 00:09:44,738 --> 00:09:46,030 So these are the two scenarios. 197 00:09:46,030 --> 00:09:48,940 If I've seen this title before, go ahead and add 1 198 00:09:48,940 --> 00:09:50,560 to its value in the dictionary. 199 00:09:50,560 --> 00:09:53,960 Otherwise, just default to I have seen it one time. 200 00:09:53,960 --> 00:09:56,710 So some of the shows are going to stay stuck at one if they're not 201 00:09:56,710 --> 00:09:58,502 very popular among the audience, or they're 202 00:09:58,502 --> 00:10:00,703 going to keep getting plus 1, plus 1, plus 1. 203 00:10:00,703 --> 00:10:02,620 So what I often do when writing code like this 204 00:10:02,620 --> 00:10:05,450 is, as I'm sort of finding my way, after I'm done with this, 205 00:10:05,450 --> 00:10:07,075 I'm not going to write the whole thing. 206 00:10:07,075 --> 00:10:10,070 Let me go ahead and just, at the end of this, print counts, right? 207 00:10:10,070 --> 00:10:11,570 I'm just curious to see what it is. 208 00:10:11,570 --> 00:10:13,570 I'm just going to go ahead and add print counts. 209 00:10:13,570 --> 00:10:16,060 Let me go ahead and run now Python of favorites. 210 00:10:16,060 --> 00:10:19,690 And voila, you see a huge, messy data structure. 211 00:10:19,690 --> 00:10:21,520 But let's go to the very top of it. 212 00:10:21,520 --> 00:10:25,210 Notice that it starts with curly braces, indicating it's, indeed, a dictionary. 213 00:10:25,210 --> 00:10:28,480 It starts with a quoted value, which is one of the titles in it, 214 00:10:28,480 --> 00:10:31,450 and it looks like two of you said Dynasty is your favorite show. 215 00:10:31,450 --> 00:10:33,790 25 of you said The Office is your favorite show. 216 00:10:33,790 --> 00:10:36,340 One of you said Blindspot is your favorite show. 217 00:10:36,340 --> 00:10:37,210 One of you, 24. 218 00:10:37,210 --> 00:10:39,060 Friends, 19, and so forth. 219 00:10:39,060 --> 00:10:40,810 So the data is kind of all over the place, 220 00:10:40,810 --> 00:10:43,570 because it's based on the order in which you all input it. 221 00:10:43,570 --> 00:10:45,487 So let's just clean this up a little bit. 222 00:10:45,487 --> 00:10:47,570 And let me go ahead and print this out as follows. 223 00:10:47,570 --> 00:10:49,660 Instead of just printing a big, messy dictionary, 224 00:10:49,660 --> 00:10:52,990 why don't I go ahead and iterate over the dictionary? 225 00:10:52,990 --> 00:10:55,220 And we didn't see this necessarily last week, 226 00:10:55,220 --> 00:10:57,053 but you can actually do something like this. 227 00:10:57,053 --> 00:11:01,480 If your dictionary's storing titles as keys and counts as values, 228 00:11:01,480 --> 00:11:06,640 we can do something fancy like for title comma count in counts, 229 00:11:06,640 --> 00:11:10,090 but if you want to iterate over not just the keys but all of the items therein, 230 00:11:10,090 --> 00:11:12,250 there's this other function in Python associated 231 00:11:12,250 --> 00:11:15,370 with a dictionary that will hand you back two things at once, 232 00:11:15,370 --> 00:11:16,763 key and value together. 233 00:11:16,763 --> 00:11:18,430 And you can call them anything you want. 234 00:11:18,430 --> 00:11:21,470 I'm calling them title comma count, respectively. 235 00:11:21,470 --> 00:11:25,300 Now, let me go ahead and print out, for instance, something like the title, 236 00:11:25,300 --> 00:11:28,340 and then let me go ahead and just print out the count right next to it. 237 00:11:28,340 --> 00:11:32,540 So now, if I save this and run favorites.py, OK, it's prettier, 238 00:11:32,540 --> 00:11:35,530 but it's still a little messy in that we have-- 239 00:11:35,530 --> 00:11:38,300 the numbers are just right there after the name. 240 00:11:38,300 --> 00:11:39,890 So let's go ahead and tidy this up. 241 00:11:39,890 --> 00:11:41,890 Recall that when you print two things with print 242 00:11:41,890 --> 00:11:44,170 separating them with a comma, what character 243 00:11:44,170 --> 00:11:47,380 did you get automatically in between them? 244 00:11:47,380 --> 00:11:48,560 Yeah, just a single space. 245 00:11:48,560 --> 00:11:50,630 We saw that at the very beginning of last week. 246 00:11:50,630 --> 00:11:53,740 But we can override that separator, otherwise known as sep, 247 00:11:53,740 --> 00:11:56,950 which has a default value of quote-unquote with a space. 248 00:11:56,950 --> 00:12:00,293 But suppose I just want to give some visual delineation here, 249 00:12:00,293 --> 00:12:03,460 maybe like a vertical bar to kind of sort of make it look like a spreadsheet 250 00:12:03,460 --> 00:12:04,000 again. 251 00:12:04,000 --> 00:12:05,590 Let me go ahead and run this. 252 00:12:05,590 --> 00:12:07,600 And now it's not super, super pretty, but you 253 00:12:07,600 --> 00:12:12,700 can see title followed by a vertical bar followed by the number 1. 254 00:12:12,700 --> 00:12:16,243 So this is all fine and good, but it'd be nice, for instance, 255 00:12:16,243 --> 00:12:18,160 if we could sort these, because I don't really 256 00:12:18,160 --> 00:12:22,000 want to sift through this visually, looking for all of the possible values. 257 00:12:22,000 --> 00:12:25,660 So you know, the easiest way to sort these things might be like this. 258 00:12:25,660 --> 00:12:28,960 Instead of passing in counts of items, there's 259 00:12:28,960 --> 00:12:31,930 another function in Python where you can literally just sort something 260 00:12:31,930 --> 00:12:32,590 with sorted. 261 00:12:32,590 --> 00:12:34,850 You don't have to worry about bubble sort, selection sort, insertion sort, 262 00:12:34,850 --> 00:12:36,160 merge sort, anything. 263 00:12:36,160 --> 00:12:37,867 Just, you sort it and Python. 264 00:12:37,867 --> 00:12:39,700 We'll use one of the fastest implementations 265 00:12:39,700 --> 00:12:41,720 of sorting available to it. 266 00:12:41,720 --> 00:12:44,020 Now, if I rerun this, we'll see that-- 267 00:12:44,020 --> 00:12:45,220 [CHUCKLES] OK. 268 00:12:45,220 --> 00:12:47,350 We'll see that we have all of these values 269 00:12:47,350 --> 00:12:51,170 here starting with 24, 9009, Adventure Time. 270 00:12:51,170 --> 00:12:53,420 But that's not what I meant. 271 00:12:53,420 --> 00:12:57,160 I wanted to sort not by title, but by what? 272 00:12:57,160 --> 00:12:58,250 Count. 273 00:12:58,250 --> 00:13:00,550 So this gets a little less obvious. 274 00:13:00,550 --> 00:13:04,390 And it turns out Python does this in a couple of possible ways. 275 00:13:04,390 --> 00:13:06,640 But let me go ahead and introduce this one first. 276 00:13:06,640 --> 00:13:11,170 It turns out that this sorted function takes an optional argument that 277 00:13:11,170 --> 00:13:12,910 happens to be called key. 278 00:13:12,910 --> 00:13:18,280 And key simply specifies, how do you want to sort 279 00:13:18,280 --> 00:13:20,620 the dictionaries that you're being-- 280 00:13:20,620 --> 00:13:23,680 or how do you want to sort the key value pairs that are being passed? 281 00:13:23,680 --> 00:13:26,540 By default, it literally uses the key, so title. 282 00:13:26,540 --> 00:13:29,960 But if you want to use the value, you know what you can do, is this. 283 00:13:29,960 --> 00:13:33,420 Let me go ahead, and up here, even though this is a little weird, 284 00:13:33,420 --> 00:13:35,920 and let me go ahead and say the following. 285 00:13:35,920 --> 00:13:37,960 Def function item. 286 00:13:37,960 --> 00:13:38,617 So def f. 287 00:13:38,617 --> 00:13:40,450 I'm just giving myself a pretty generic name 288 00:13:40,450 --> 00:13:43,075 for a function, because I'm going to get rid of it in a moment. 289 00:13:43,075 --> 00:13:46,780 I'm going to go ahead and return item bracket 1. 290 00:13:46,780 --> 00:13:48,080 Now, why is this? 291 00:13:48,080 --> 00:13:49,780 Take a look at this line here. 292 00:13:49,780 --> 00:13:53,500 And even though sorted is new today, as this counts.items(), 293 00:13:53,500 --> 00:13:58,490 all items does is return key comma value, and sorted sorts then by key, 294 00:13:58,490 --> 00:14:02,330 by default, handing them back as title comma count, or again, 295 00:14:02,330 --> 00:14:03,930 I could call these anything I want. 296 00:14:03,930 --> 00:14:07,700 But I'm calling them, more semantically usefully, title comma counts. 297 00:14:07,700 --> 00:14:09,890 So that's giving me back to things. 298 00:14:09,890 --> 00:14:13,370 Well, just like an array in C or a list in Python, 299 00:14:13,370 --> 00:14:16,850 if you're getting two things at once, the first of those things, 300 00:14:16,850 --> 00:14:19,790 conventionally, is going to be accessible in bracket zero. 301 00:14:19,790 --> 00:14:22,190 The second thing is going to be accessible in bracket 1. 302 00:14:22,190 --> 00:14:27,350 So this super, super simple function called f takes, as input, an item. 303 00:14:27,350 --> 00:14:30,050 And if that item has two values, a key and a value, 304 00:14:30,050 --> 00:14:32,270 return item one is just literally going to return 305 00:14:32,270 --> 00:14:33,710 the second thing, whatever it is. 306 00:14:33,710 --> 00:14:36,530 In this case, it's going to return, of course, the value. 307 00:14:36,530 --> 00:14:38,160 Now, why is this useful? 308 00:14:38,160 --> 00:14:43,260 Well, because the sorted function takes an additional argument called key, 309 00:14:43,260 --> 00:14:46,040 you can actually pass it the name of a function 310 00:14:46,040 --> 00:14:51,620 to use in order to determine how to sort the previous items. 311 00:14:51,620 --> 00:14:53,750 So this is an example of a really powerful feature 312 00:14:53,750 --> 00:14:57,860 of Python and other higher-level languages, in which case 313 00:14:57,860 --> 00:15:01,670 you can pass in, as arguments, the names of functions. 314 00:15:01,670 --> 00:15:02,720 I'm not calling f. 315 00:15:02,720 --> 00:15:04,340 I'm just passing in its name. 316 00:15:04,340 --> 00:15:07,520 If I were calling it, I'd have open paren and a closed paren, 317 00:15:07,520 --> 00:15:09,620 but I'm just passing it in by name. 318 00:15:09,620 --> 00:15:13,050 And now, this is going to, by default, sort things from small to large. 319 00:15:13,050 --> 00:15:15,050 Turns out if you read the documentation, there's 320 00:15:15,050 --> 00:15:19,430 another argument to sorted called reverse that will invert it so that now 321 00:15:19,430 --> 00:15:21,780 you'll get largest to smallest. 322 00:15:21,780 --> 00:15:23,160 So now let's try this. 323 00:15:23,160 --> 00:15:25,460 Let me go ahead and rerun Python of favorites.py 324 00:15:25,460 --> 00:15:29,030 enter, and now, if I scroll up to the top, 325 00:15:29,030 --> 00:15:33,680 we should see The Office is indeed the most popular show, apparently 326 00:15:33,680 --> 00:15:35,660 consistent with Netflix's data these days, too, 327 00:15:35,660 --> 00:15:38,702 followed by Friends, Game of Thrones, Breaking Bad, Rick and Morty, Black 328 00:15:38,702 --> 00:15:39,932 Mirror, and so forth. 329 00:15:39,932 --> 00:15:42,890 And the least popular show among you, although there looks like there's 330 00:15:42,890 --> 00:15:44,735 a lot of ties, Demon Slayer. 331 00:15:44,735 --> 00:15:47,890 [CHUCKLES] Or Game of Thrones, but GOT. 332 00:15:47,890 --> 00:15:49,880 So interestingly, there are some-- 333 00:15:49,880 --> 00:15:51,420 [LAUGHTER] 334 00:15:51,420 --> 00:15:52,320 There's-- OK. 335 00:15:52,320 --> 00:15:54,930 336 00:15:54,930 --> 00:15:55,520 OK. 337 00:15:55,520 --> 00:15:57,950 So here, we've gone and sorted these values. 338 00:15:57,950 --> 00:16:00,680 Now, let me go ahead and clean this up a little bit. 339 00:16:00,680 --> 00:16:05,510 Much like in C where it's kind of silly to define a variable 340 00:16:05,510 --> 00:16:08,270 and then use it only in one place, similarly in Python, 341 00:16:08,270 --> 00:16:11,660 if you're going to define a function, but then only use it in one place, 342 00:16:11,660 --> 00:16:12,920 you don't actually need to. 343 00:16:12,920 --> 00:16:17,248 There's a fancy one-line equivalent wear you can actually say this. 344 00:16:17,248 --> 00:16:19,790 And you won't see this too often, but it's a handy thing just 345 00:16:19,790 --> 00:16:21,170 to have seen at least once. 346 00:16:21,170 --> 00:16:25,850 You can define a function called a lambda function that takes in as input 347 00:16:25,850 --> 00:16:30,240 any argument like this, and then you can literally just return that value. 348 00:16:30,240 --> 00:16:34,175 So if I go ahead and get rid of this function f and replace it with this, 349 00:16:34,175 --> 00:16:37,730 the syntax is super weird-looking at first glance, 350 00:16:37,730 --> 00:16:40,520 but lambda is the term of art in computer 351 00:16:40,520 --> 00:16:42,037 science for an anonymous function. 352 00:16:42,037 --> 00:16:43,370 I don't care that it's called f. 353 00:16:43,370 --> 00:16:44,720 I'm never going to use it again. 354 00:16:44,720 --> 00:16:47,360 So lambda just means give me a function. 355 00:16:47,360 --> 00:16:51,080 Item means that is the input to this function, the argument, and the return 356 00:16:51,080 --> 00:16:54,080 value, or output, is going to be item bracket 1. 357 00:16:54,080 --> 00:16:55,580 So it's a little arcane. 358 00:16:55,580 --> 00:16:58,935 Odds are you'll have to double check the documentation for using this oneself 359 00:16:58,935 --> 00:17:00,560 in the future, but it's the same thing. 360 00:17:00,560 --> 00:17:02,720 And indeed, if I rerun the program, I'm going 361 00:17:02,720 --> 00:17:08,030 to get back the exact same results, this time with the items still sorted 362 00:17:08,030 --> 00:17:09,589 like that. 363 00:17:09,589 --> 00:17:14,569 All right, any questions on the code here, the features of the syntax? 364 00:17:14,569 --> 00:17:16,160 Anything at all? 365 00:17:16,160 --> 00:17:17,261 Yeah. 366 00:17:17,261 --> 00:17:20,303 AUDIENCE: [INAUDIBLE] 367 00:17:20,303 --> 00:17:22,970 DAVID MALAN: Yes, if you're writing a lambda function like this, 368 00:17:22,970 --> 00:17:24,770 you are indeed limited to one line. 369 00:17:24,770 --> 00:17:27,187 That was OK, because f, even though it was two lines, only 370 00:17:27,187 --> 00:17:28,550 had one line of actual content. 371 00:17:28,550 --> 00:17:31,900 But yes, it has to be one line, indeed. 372 00:17:31,900 --> 00:17:32,400 All right. 373 00:17:32,400 --> 00:17:36,360 So the GOT thing opens up some interesting messiness, right? 374 00:17:36,360 --> 00:17:39,990 Because GOT presumably denoted Game of Thrones, and yet most of you 375 00:17:39,990 --> 00:17:42,235 wrote Game of Thrones, capitalized properly, 376 00:17:42,235 --> 00:17:44,610 and yet, it'd be nice if we could have some equivalences. 377 00:17:44,610 --> 00:17:48,300 And honestly, I'm guessing there's a bunch of capitalization differences, 378 00:17:48,300 --> 00:17:48,960 too. 379 00:17:48,960 --> 00:17:50,970 So you know what I might be inclined to do? 380 00:17:50,970 --> 00:17:54,390 What might be the easiest way for us to standardize 381 00:17:54,390 --> 00:17:58,710 what we're calling these shows' titles that would be a little more 382 00:17:58,710 --> 00:18:01,270 resilient against some of you typing with capitals, 383 00:18:01,270 --> 00:18:03,300 some of you typing with lower case? 384 00:18:03,300 --> 00:18:06,130 Let's ignore the GOT problem, though. 385 00:18:06,130 --> 00:18:09,022 Yeah, so make them all lowercase, or heck, make them all upper case. 386 00:18:09,022 --> 00:18:10,980 It doesn't matter so long as you're consistent. 387 00:18:10,980 --> 00:18:11,688 So you know what? 388 00:18:11,688 --> 00:18:13,740 Before I get title from the row, let me go ahead 389 00:18:13,740 --> 00:18:18,450 and force it all to lower case, and then let me go ahead and-- 390 00:18:18,450 --> 00:18:24,160 in my-- go ahead and force it to lower case there. 391 00:18:24,160 --> 00:18:26,070 So no matter what the input is, I'm going 392 00:18:26,070 --> 00:18:29,790 to go ahead and force it on entry to lower case. 393 00:18:29,790 --> 00:18:33,490 Now, if I do this and scroll back up, looks like our counts went up a little, 394 00:18:33,490 --> 00:18:33,990 right? 395 00:18:33,990 --> 00:18:35,580 The Office is now at 27. 396 00:18:35,580 --> 00:18:37,080 Game of Thrones is now at 21. 397 00:18:37,080 --> 00:18:38,290 Friends is at 20. 398 00:18:38,290 --> 00:18:41,220 So it looks like some of you, kind of, reasonably if sloppily, 399 00:18:41,220 --> 00:18:44,272 didn't capitalize it exactly as the show officially is called. 400 00:18:44,272 --> 00:18:45,480 But that should be OK, right? 401 00:18:45,480 --> 00:18:47,700 This is all about UX or user experience. 402 00:18:47,700 --> 00:18:51,930 This should be one stepping stone toward making your program much, much more 403 00:18:51,930 --> 00:18:54,780 resilient against user input. 404 00:18:54,780 --> 00:18:55,530 But you know what? 405 00:18:55,530 --> 00:18:57,240 What if there were a better way to do this, right? 406 00:18:57,240 --> 00:18:59,490 Instead of writing code, this feels pretty tedious. 407 00:18:59,490 --> 00:19:01,948 Could we not just use a different tool, different language, 408 00:19:01,948 --> 00:19:03,160 and achieve these same goals? 409 00:19:03,160 --> 00:19:05,395 Today's answer is indeed going to be yes, with SQL. 410 00:19:05,395 --> 00:19:06,270 But first a question. 411 00:19:06,270 --> 00:19:06,810 AUDIENCE: [INAUDIBLE] 412 00:19:06,810 --> 00:19:07,160 DAVID MALAN: No? 413 00:19:07,160 --> 00:19:07,920 No question. 414 00:19:07,920 --> 00:19:11,520 All right, so then the segue is exactly this. 415 00:19:11,520 --> 00:19:14,940 I'm going to go ahead now and close favorites.py. 416 00:19:14,940 --> 00:19:18,010 And I'm just going to maximize the size of my terminal window here. 417 00:19:18,010 --> 00:19:21,840 And I'm going to go ahead and run a program called SQLite3. 418 00:19:21,840 --> 00:19:25,590 SQLite3 is a command line program that you can install on your Mac or PC-- 419 00:19:25,590 --> 00:19:27,240 it's already on CS50 IDE-- 420 00:19:27,240 --> 00:19:31,470 that allows you to use another language called SQL interactively. 421 00:19:31,470 --> 00:19:34,170 And it's got this really cool feature whereby 422 00:19:34,170 --> 00:19:40,230 if I have this file called, very verbosely, CS520 2019 Lecture 7, 423 00:19:40,230 --> 00:19:43,020 and so forth, I'm going to copy that after highlighting it. 424 00:19:43,020 --> 00:19:47,730 I'm going to run SQLite3, and I'm going to put SQLite3 into .mode CSV. 425 00:19:47,730 --> 00:19:51,180 It's a weird syntax, .mode CSV, but you don't have to use these commands very 426 00:19:51,180 --> 00:19:51,840 often. 427 00:19:51,840 --> 00:19:54,840 Then I'm going to use one other command here, which is called import. 428 00:19:54,840 --> 00:19:58,450 And I'm going to go ahead and import this into a database. 429 00:19:58,450 --> 00:20:00,175 But what database, exactly? 430 00:20:00,175 --> 00:20:03,050 Well, let me go ahead and more specifically-- let me exit out of this 431 00:20:03,050 --> 00:20:03,550 here. 432 00:20:03,550 --> 00:20:04,170 Whoops. 433 00:20:04,170 --> 00:20:08,850 Let me exit out of this here and run SQLite3 in the name of a database. 434 00:20:08,850 --> 00:20:10,830 I'm dealing with favorite shows, so maybe I'll 435 00:20:10,830 --> 00:20:14,040 call this favorites.db, DB denoting database. 436 00:20:14,040 --> 00:20:17,730 I'm going to now go ahead and do .mode CSV. 437 00:20:17,730 --> 00:20:20,550 I'm going to now do .import, quoting the name of the file, 438 00:20:20,550 --> 00:20:23,550 and quotes are important, because it's got some ugly spaces in it, 439 00:20:23,550 --> 00:20:26,940 and I'm going to import this into a table, kind of like a spreadsheet, 440 00:20:26,940 --> 00:20:29,700 that I'm just going to call in all lowercase favorites. 441 00:20:29,700 --> 00:20:30,750 And hit Enter. 442 00:20:30,750 --> 00:20:32,410 Nothing seems to happen now. 443 00:20:32,410 --> 00:20:35,650 But if I go back to my terminal window and type ls, 444 00:20:35,650 --> 00:20:39,780 notice that in addition to my CSV file and favorites.py and src7, 445 00:20:39,780 --> 00:20:41,970 which I downloaded in advance with today's code, 446 00:20:41,970 --> 00:20:44,820 I now have this file called favorites.db. 447 00:20:44,820 --> 00:20:50,140 And if I rerun SQLite, it turns out when I pass in the name of that database, 448 00:20:50,140 --> 00:20:52,920 now I can ask the same kinds of questions as before, 449 00:20:52,920 --> 00:20:55,590 but I can do it with a different syntax. 450 00:20:55,590 --> 00:20:59,670 I can go ahead and select all the titles from my favorites 451 00:20:59,670 --> 00:21:04,350 by typing this command, SELECT title FROM favorites; Enter, and voila, 452 00:21:04,350 --> 00:21:06,010 we get all of the titles there. 453 00:21:06,010 --> 00:21:08,250 Suppose I want to sort those titles as before, 454 00:21:08,250 --> 00:21:14,230 or I can SELECT title FROM favorites, ORDER BY title; ending with-- 455 00:21:14,230 --> 00:21:14,730 oh, sorry. 456 00:21:14,730 --> 00:21:15,870 Semicolons are back. 457 00:21:15,870 --> 00:21:18,300 Then I'm going to go ahead and hit here. 458 00:21:18,300 --> 00:21:20,730 And you'll see that now it's alphabetized, 459 00:21:20,730 --> 00:21:23,318 and there's all those Offices by first letter. 460 00:21:23,318 --> 00:21:24,360 Now, there are some bugs. 461 00:21:24,360 --> 00:21:26,500 If you scroll down, you'll see lowercase letters as well, 462 00:21:26,500 --> 00:21:28,620 so we'd have to deal with that situation, too. 463 00:21:28,620 --> 00:21:32,820 But this seems to be a user-friendly way of just selecting data 464 00:21:32,820 --> 00:21:33,570 that I care about. 465 00:21:33,570 --> 00:21:34,240 And watch this. 466 00:21:34,240 --> 00:21:36,010 Let me try to group things together. 467 00:21:36,010 --> 00:21:38,430 I'm going to go ahead and select title, and I'm 468 00:21:38,430 --> 00:21:41,880 going to go ahead and say count the number of titles 469 00:21:41,880 --> 00:21:45,000 from this table called favorites. 470 00:21:45,000 --> 00:21:48,660 But I want to group those things by title. 471 00:21:48,660 --> 00:21:51,820 Now, we'll see in a moment what this means, but the effect is what's cool. 472 00:21:51,820 --> 00:21:56,370 Voila, I now have output much like my Python program. 473 00:21:56,370 --> 00:22:00,028 It's not really as orderly as I would like, but notice 474 00:22:00,028 --> 00:22:01,320 what it's done at the very top. 475 00:22:01,320 --> 00:22:06,330 The very first line of output, if I keep scrolling is this here. 476 00:22:06,330 --> 00:22:08,670 On the left is my title, then a vertical bar. 477 00:22:08,670 --> 00:22:13,120 On the right is literally count of title, how many titles match that. 478 00:22:13,120 --> 00:22:14,800 So let me go ahead and do this. 479 00:22:14,800 --> 00:22:18,240 Let me go ahead and first of all limit this to the top 10 titles, 480 00:22:18,240 --> 00:22:20,910 so I can just deal with this and see all this at once. 481 00:22:20,910 --> 00:22:23,460 So here we have some alphabetized list of titles. 482 00:22:23,460 --> 00:22:27,490 But now, suppose I want to go ahead and rename this thing. 483 00:22:27,490 --> 00:22:28,930 Let me go ahead and say that. 484 00:22:28,930 --> 00:22:30,510 Let me go ahead and say, as-- 485 00:22:30,510 --> 00:22:33,450 COUNT AS n. 486 00:22:33,450 --> 00:22:37,290 Now notice that the title of this output is title, and then n. 487 00:22:37,290 --> 00:22:41,650 But the reason I did that was so that I can do this. 488 00:22:41,650 --> 00:22:47,010 I'm going to now do ORDER BY n, descending order, 489 00:22:47,010 --> 00:22:49,370 and then limit to the top 10. 490 00:22:49,370 --> 00:22:51,120 Enter, and voila. 491 00:22:51,120 --> 00:22:53,460 Now, using a completely different language, 492 00:22:53,460 --> 00:22:56,460 I think I've gotten the same results, except for the lower casing issue, 493 00:22:56,460 --> 00:22:58,085 which we can come back to another time. 494 00:22:58,085 --> 00:23:00,210 But The Office is here as the number one hit. 495 00:23:00,210 --> 00:23:01,350 Friends then, 19. 496 00:23:01,350 --> 00:23:03,820 Game of Thrones, 18, and so forth. 497 00:23:03,820 --> 00:23:06,840 So what's then the motivation in part for this new language called SQL? 498 00:23:06,840 --> 00:23:08,550 And you've seen just snippets of it here. 499 00:23:08,550 --> 00:23:10,850 It allows us to solve the same kind of problems 500 00:23:10,850 --> 00:23:13,350 much more easily, because I don't have to sit down and write 501 00:23:13,350 --> 00:23:16,590 a whole darn Python program, faster though it is then 502 00:23:16,590 --> 00:23:20,933 C. I don't have to write custom code just to answer questions about my data. 503 00:23:20,933 --> 00:23:23,100 And in fact, if you pursue the web track in a couple 504 00:23:23,100 --> 00:23:25,710 of weeks' time or the mobile track, both of those 505 00:23:25,710 --> 00:23:29,130 are going to offer support for SQL so that if you want to store data 506 00:23:29,130 --> 00:23:32,880 on users or interactive information on an app or a website application, 507 00:23:32,880 --> 00:23:37,050 SQL is going to be ultimately where you can put that data and access it later, 508 00:23:37,050 --> 00:23:40,507 thereby writing even less code than today. 509 00:23:40,507 --> 00:23:43,590 So let's go ahead and take a look at some of the features of these things. 510 00:23:43,590 --> 00:23:45,507 What we've really just done is introduce this, 511 00:23:45,507 --> 00:23:47,580 something called a relational database. 512 00:23:47,580 --> 00:23:50,940 The goal at hand is not to rely on CSV files anymore. 513 00:23:50,940 --> 00:23:54,420 CSVs are like the weakest form of databases you can use. 514 00:23:54,420 --> 00:23:54,960 Why? 515 00:23:54,960 --> 00:23:59,162 Well, every time you want to look in a CSV file for data, you have to open it. 516 00:23:59,162 --> 00:24:00,120 You have to read it in. 517 00:24:00,120 --> 00:24:02,590 You have to iterate over it, line by line by line, 518 00:24:02,590 --> 00:24:05,160 looking for the data you care about, and then you're done. 519 00:24:05,160 --> 00:24:08,942 That's going to be super slow when you have a lot of data in a CSV file. 520 00:24:08,942 --> 00:24:10,650 And in fact, in a little bit today, we'll 521 00:24:10,650 --> 00:24:14,190 download a really big database in text form 522 00:24:14,190 --> 00:24:19,230 that we're going to then see 500-plus megabytes later is much better handled 523 00:24:19,230 --> 00:24:21,420 by what are called relational databases, pieces 524 00:24:21,420 --> 00:24:24,990 of software you can run on your Mac or PC or servers in a company. 525 00:24:24,990 --> 00:24:28,950 All they do is store data and provide you with faster access to it. 526 00:24:28,950 --> 00:24:32,010 So SQLite3 is just a command line program 527 00:24:32,010 --> 00:24:34,012 via which we can interact with that data. 528 00:24:34,012 --> 00:24:35,220 It's not terribly compelling. 529 00:24:35,220 --> 00:24:37,200 It's, again, a black and white blinking prompt, 530 00:24:37,200 --> 00:24:40,050 but we'll see a graphical user interface, or GUI, in a bit, 531 00:24:40,050 --> 00:24:43,200 too, that'll allow us to navigate that same data. 532 00:24:43,200 --> 00:24:45,570 Here's how we went ahead and imported the data. 533 00:24:45,570 --> 00:24:47,550 And .schema is the last of the commands. 534 00:24:47,550 --> 00:24:49,870 In fact, let me go ahead and type this at the prompt. 535 00:24:49,870 --> 00:24:52,440 Not .mode or .import but .schema. 536 00:24:52,440 --> 00:24:55,860 And you'll see what automatically happened for me is something 537 00:24:55,860 --> 00:24:58,590 like this, CREATE TABLE favorites. 538 00:24:58,590 --> 00:25:02,010 This all happened automatically when I imported my data, but in a bit, 539 00:25:02,010 --> 00:25:04,740 we'll be doing this more manually ourselves. 540 00:25:04,740 --> 00:25:08,580 So where does this then bring us in terms of features? 541 00:25:08,580 --> 00:25:11,280 Well, in the world of storing data, whether it's 542 00:25:11,280 --> 00:25:13,650 storing users who've registered for your websites 543 00:25:13,650 --> 00:25:16,350 or the number of likes a post on Instagram has gotten 544 00:25:16,350 --> 00:25:19,950 or any form of data that you might want to read or write, 545 00:25:19,950 --> 00:25:22,410 there are four fundamental operations in the world. 546 00:25:22,410 --> 00:25:25,180 CRUD is the sort of crude way of remembering this. 547 00:25:25,180 --> 00:25:28,260 These stand for Create, Read, Update, Delete. 548 00:25:28,260 --> 00:25:30,870 And I dare say that with these four operations 549 00:25:30,870 --> 00:25:34,030 you can do anything you want when it comes with data. 550 00:25:34,030 --> 00:25:38,010 Creating that data, reading that data, updating it, or deleting it. 551 00:25:38,010 --> 00:25:42,450 Now, in the world of SQL, or Structured Query Language, or S-Q-L, 552 00:25:42,450 --> 00:25:44,490 this is just another programming language. 553 00:25:44,490 --> 00:25:47,460 It tends to be used only in the context of databases, 554 00:25:47,460 --> 00:25:51,810 but you can use it to solve problems not unlike Python, but in the context 555 00:25:51,810 --> 00:25:54,090 of the data you care about. 556 00:25:54,090 --> 00:25:56,370 So with SQL, these four CRUD operations actually 557 00:25:56,370 --> 00:25:58,530 have different keywords mapped to them. 558 00:25:58,530 --> 00:26:01,650 They happen to be these, INSERT, SELECT, UPDATE, DELETE. 559 00:26:01,650 --> 00:26:06,360 That is to say, unlike Python, which has hundreds of functions available to you, 560 00:26:06,360 --> 00:26:11,940 today is kind of neat in so far AS SQL has maybe a dozen or so total functions 561 00:26:11,940 --> 00:26:14,700 that you can use, a few dozen functions total that you can use. 562 00:26:14,700 --> 00:26:18,790 Much more manageable, and these are the four that we'll use most frequently. 563 00:26:18,790 --> 00:26:21,330 So beyond this, how do you go about creating a table? 564 00:26:21,330 --> 00:26:22,420 Well, what does that mean? 565 00:26:22,420 --> 00:26:24,253 Well, relational database is literally that. 566 00:26:24,253 --> 00:26:27,660 It's like a fancier version of Google Spreadsheets or Microsoft Excel 567 00:26:27,660 --> 00:26:30,120 or Apple Numbers that allows you to store all of your data 568 00:26:30,120 --> 00:26:31,500 in rows and columns. 569 00:26:31,500 --> 00:26:35,790 But it gives you these four functions and more 570 00:26:35,790 --> 00:26:40,140 via which to select data you care about, or delete or update or insert data 571 00:26:40,140 --> 00:26:41,820 that you care about as well. 572 00:26:41,820 --> 00:26:45,390 So we'll see in a bit that this is going to be the syntax VIA which in SQL you 573 00:26:45,390 --> 00:26:46,820 can create a new table. 574 00:26:46,820 --> 00:26:49,500 In Google Spreadsheets you would literally go to the plus icon 575 00:26:49,500 --> 00:26:50,640 and say New Sheet. 576 00:26:50,640 --> 00:26:54,000 In SQL, you would express this more programmatically with a line of code 577 00:26:54,000 --> 00:26:54,795 like this. 578 00:26:54,795 --> 00:26:57,420 But you're going to have to make a few decisions along the way, 579 00:26:57,420 --> 00:27:00,280 and that's because SQL has its own data types. 580 00:27:00,280 --> 00:27:03,300 Now, fortunately, they're pretty reminiscent of C and Python. 581 00:27:03,300 --> 00:27:06,302 But there's five main ones here. 582 00:27:06,302 --> 00:27:08,260 Let's just walk through a few examples thereof. 583 00:27:08,260 --> 00:27:09,390 So first, INTEGER. 584 00:27:09,390 --> 00:27:12,930 If you know in advance that you want to store a whole bunch of information 585 00:27:12,930 --> 00:27:19,440 in a database like someone's age or the number of likes a post has 586 00:27:19,440 --> 00:27:21,660 or anything that lends itself to an integer, 587 00:27:21,660 --> 00:27:24,540 SQL supports an integer data type. 588 00:27:24,540 --> 00:27:28,120 But in some versions of SQL, you actually have a decision to make. 589 00:27:28,120 --> 00:27:31,470 Do you want an integer, a smallint, or a bigint? 590 00:27:31,470 --> 00:27:34,710 And these speak to, as in C, the size, the number 591 00:27:34,710 --> 00:27:36,180 of bits that are used to store it. 592 00:27:36,180 --> 00:27:37,530 An integer is 32 bits. 593 00:27:37,530 --> 00:27:39,090 A bigint is 64 bits. 594 00:27:39,090 --> 00:27:41,690 A smallint is fewer than both those. 595 00:27:41,690 --> 00:27:44,720 Now, we're using, for class, something called SQLite. 596 00:27:44,720 --> 00:27:47,363 It's a free and open source, very user friendly version of SQL. 597 00:27:47,363 --> 00:27:49,280 But there's others you might have heard about. 598 00:27:49,280 --> 00:27:51,230 So if you've ever heard of the company Oracle, 599 00:27:51,230 --> 00:27:53,540 their bread and butter is a database program 600 00:27:53,540 --> 00:27:56,210 called Oracle that is a SQL database that 601 00:27:56,210 --> 00:27:58,730 allows companies to store huge amounts of data 602 00:27:58,730 --> 00:28:01,490 and select and create and update and delete data 603 00:28:01,490 --> 00:28:03,180 using software they've written. 604 00:28:03,180 --> 00:28:07,940 MySQL, Postgres, MariaDB, Microsoft Access Server, 605 00:28:07,940 --> 00:28:12,050 SQL Server, and bunches of others, are all examples of SQL servers. 606 00:28:12,050 --> 00:28:15,162 And they'll each support data types, typically, like this. 607 00:28:15,162 --> 00:28:17,120 We happen to be showing you ones from Postgres, 608 00:28:17,120 --> 00:28:20,390 which is a super popular open source one that you can use on the internet, 609 00:28:20,390 --> 00:28:22,010 for instance for final projects. 610 00:28:22,010 --> 00:28:25,657 Real numbers, just like in Python or C, are similar in spirit to floats. 611 00:28:25,657 --> 00:28:27,240 However, you have a couple of options. 612 00:28:27,240 --> 00:28:29,330 One is a real, which uses 32 bits. 613 00:28:29,330 --> 00:28:31,040 Another is called double precision. 614 00:28:31,040 --> 00:28:34,400 It's actually two words, but it gives you 64 bits as well, 615 00:28:34,400 --> 00:28:36,770 to give you more precision numerically. 616 00:28:36,770 --> 00:28:39,950 Then you've got this catch-all known as numeric in SQL, 617 00:28:39,950 --> 00:28:43,490 whereby if you have a number that's number-like, essentially, 618 00:28:43,490 --> 00:28:46,460 but isn't just an integer, categories-- 619 00:28:46,460 --> 00:28:50,270 this category encompasses things like Boolean values, dates, 620 00:28:50,270 --> 00:28:53,240 which have a predefined format like YYYY, 621 00:28:53,240 --> 00:28:56,390 four digits typically implements a year. 622 00:28:56,390 --> 00:28:59,660 That would be considered numeric here, as would be a date and a date time, 623 00:28:59,660 --> 00:29:04,950 which has something like 00:00:00 for hours, minutes, seconds. 624 00:29:04,950 --> 00:29:06,780 And then you have other values as well. 625 00:29:06,780 --> 00:29:09,590 What's nice about this numeric type specifically 626 00:29:09,590 --> 00:29:14,570 is you can actually solve the floating point problem in both Python 627 00:29:14,570 --> 00:29:18,410 and C. Recall that floats or even doubles in C 628 00:29:18,410 --> 00:29:22,370 ultimately have finite precision, which is bad in terms of values 629 00:29:22,370 --> 00:29:24,140 not quite adding up as you intend. 630 00:29:24,140 --> 00:29:26,300 But with numeric data types and databases, 631 00:29:26,300 --> 00:29:30,380 you can specify exactly how many digits you want before the decimal point 632 00:29:30,380 --> 00:29:31,830 and after, maximally. 633 00:29:31,830 --> 00:29:35,060 So if you're doing with financial information or scientific information, 634 00:29:35,060 --> 00:29:39,020 you can be super, super precise thanks to this data type in the database. 635 00:29:39,020 --> 00:29:43,160 So finally, a problem that we've solved, versus those other languages. 636 00:29:43,160 --> 00:29:43,910 Text. 637 00:29:43,910 --> 00:29:46,050 When it comes to storing data in a database, 638 00:29:46,050 --> 00:29:48,470 you can specify that your columns can either 639 00:29:48,470 --> 00:29:52,220 be characters, which isn't an individual characters or chars. 640 00:29:52,220 --> 00:29:55,730 You literally specify n, the number of characters 641 00:29:55,730 --> 00:29:58,370 that every cell in that column will be. 642 00:29:58,370 --> 00:29:59,823 You never do this in Excel. 643 00:29:59,823 --> 00:30:02,240 You never do this in Google Spreadsheets or Apple Numbers. 644 00:30:02,240 --> 00:30:06,040 In a database, though, you can tell the database ever more precisely, 645 00:30:06,040 --> 00:30:10,790 I want to store two characters in every cell in some column. 646 00:30:10,790 --> 00:30:11,660 Why? 647 00:30:11,660 --> 00:30:13,310 Well, maybe like US state codes. 648 00:30:13,310 --> 00:30:16,640 If you have MA for Massachusetts or CA for California, 649 00:30:16,640 --> 00:30:20,210 it might be nice to tell the database that you can store, minimally 650 00:30:20,210 --> 00:30:23,420 and maximally, two characters for every value I'm going to give you. 651 00:30:23,420 --> 00:30:24,723 Varchar is a little different. 652 00:30:24,723 --> 00:30:26,390 This is a variable number of characters. 653 00:30:26,390 --> 00:30:30,290 If your column might have a few characters or a lot of characters based 654 00:30:30,290 --> 00:30:33,470 on what some human types in, you can give an upper bound 655 00:30:33,470 --> 00:30:36,710 n on how many characters will be stored in every cell. 656 00:30:36,710 --> 00:30:40,490 So if you don't know in advance how long a user's email address is going to be 657 00:30:40,490 --> 00:30:43,100 or how long my name or your name is going to be in advance, 658 00:30:43,100 --> 00:30:47,370 you can specify varchar and then pick some upper bound, some value that, ugh, 659 00:30:47,370 --> 00:30:50,600 no one's going to have more than 20 characters in their name or 200 660 00:30:50,600 --> 00:30:55,160 characters in their name, whatever line you decide to draw in the sand. 661 00:30:55,160 --> 00:30:57,710 As an aside, does anyone want to conjecture 662 00:30:57,710 --> 00:31:00,940 what a good upper bound is for the number of characters 663 00:31:00,940 --> 00:31:04,085 in a human name that might register for a website you're making? 664 00:31:04,085 --> 00:31:05,480 AUDIENCE: [INAUDIBLE] 665 00:31:05,480 --> 00:31:06,390 DAVID MALAN: 25? 666 00:31:06,390 --> 00:31:06,890 Pretty good. 667 00:31:06,890 --> 00:31:07,990 AUDIENCE: [INAUDIBLE] 668 00:31:07,990 --> 00:31:09,032 DAVID MALAN: 30, I heard. 669 00:31:09,032 --> 00:31:09,980 AUDIENCE: [INAUDIBLE] 670 00:31:09,980 --> 00:31:11,615 DAVID MALAN: 32, OK. 671 00:31:11,615 --> 00:31:12,490 AUDIENCE: [INAUDIBLE] 672 00:31:12,490 --> 00:31:14,330 DAVID MALAN: 45? 673 00:31:14,330 --> 00:31:15,740 Higher. 674 00:31:15,740 --> 00:31:16,720 60? 675 00:31:16,720 --> 00:31:17,420 AUDIENCE: 64. 676 00:31:17,420 --> 00:31:18,170 DAVID MALAN: 64? 677 00:31:18,170 --> 00:31:20,790 This is the kind of thing that Google, I dare say, is good at. 678 00:31:20,790 --> 00:31:23,570 So let me go ahead and open up Google real fast. 679 00:31:23,570 --> 00:31:26,570 Suppose the goal at hand is the store, in a database 680 00:31:26,570 --> 00:31:32,060 table, that is the equivalent of a spreadsheets column, the longest 681 00:31:32,060 --> 00:31:34,811 name in world. 682 00:31:34,811 --> 00:31:35,950 All right. 683 00:31:35,950 --> 00:31:43,180 Looks like we had 988 characters in a fellow's name, Hubert. 684 00:31:43,180 --> 00:31:46,430 And I won't even try to pronounce his last name, but there it is. 685 00:31:46,430 --> 00:31:48,470 So if you want to fill-- 686 00:31:48,470 --> 00:31:50,795 if you to be able to fit everyone's name-- 687 00:31:50,795 --> 00:31:52,900 oh, and note, senior. 688 00:31:52,900 --> 00:31:54,760 [LAUGHTER] 689 00:31:54,760 --> 00:31:56,870 You might need even more characters than that. 690 00:31:56,870 --> 00:31:59,950 So this is a rabbit hole we won't really go down today, 691 00:31:59,950 --> 00:32:02,530 but suffice it to say that the answers to these questions 692 00:32:02,530 --> 00:32:03,550 are often not obvious. 693 00:32:03,550 --> 00:32:06,340 And if you think about websites you've visited in the real world, 694 00:32:06,340 --> 00:32:10,930 whether on campus or off, odds are, you have tried typing in some input 695 00:32:10,930 --> 00:32:14,660 to some web-based form or even some mobile application, 696 00:32:14,660 --> 00:32:17,200 where all of a sudden your keyboard stops working, right? 697 00:32:17,200 --> 00:32:19,000 They won't tolerate any more characters. 698 00:32:19,000 --> 00:32:21,490 Maybe it's the Common Application and the length of your essay. 699 00:32:21,490 --> 00:32:23,680 Maybe it's the length that your resume can be in a field. 700 00:32:23,680 --> 00:32:26,440 Maybe it's the length of your name when registering for a website. 701 00:32:26,440 --> 00:32:27,670 Well, why is that? 702 00:32:27,670 --> 00:32:32,860 Those applications are probably using a database, probably using SQL, 703 00:32:32,860 --> 00:32:37,210 and they had to decide, in advance, what is the maximum length of an input 704 00:32:37,210 --> 00:32:39,190 that we're going to tolerate from a human? 705 00:32:39,190 --> 00:32:42,220 And then the computer ultimately enforces that. 706 00:32:42,220 --> 00:32:42,820 Why? 707 00:32:42,820 --> 00:32:46,900 Well that would be a very clever attack, for some bad actor out there 708 00:32:46,900 --> 00:32:48,310 to exploit. 709 00:32:48,310 --> 00:32:50,998 If you had no bounds on how big the input could 710 00:32:50,998 --> 00:32:53,290 be that a human could type in, they could-- much like I 711 00:32:53,290 --> 00:32:55,123 need to try the other day with Emma's name-- 712 00:32:55,123 --> 00:32:57,760 just keep pasting pasting, pasting, pasting a massive input, 713 00:32:57,760 --> 00:33:01,750 hoping to overwhelm your computer's memory, maybe make your server crash, 714 00:33:01,750 --> 00:33:04,370 and therefore deny service to other people. 715 00:33:04,370 --> 00:33:06,670 So there are reasons for these kinds of defenses. 716 00:33:06,670 --> 00:33:08,140 Lastly is BLOB. 717 00:33:08,140 --> 00:33:10,153 At the very top, Binary Large Object. 718 00:33:10,153 --> 00:33:12,070 Don't need it typically that often, but if you 719 00:33:12,070 --> 00:33:16,270 want to store binary data, raw zeros and ones that represent files or the like, 720 00:33:16,270 --> 00:33:18,400 you can store it in a BLOB format as well. 721 00:33:18,400 --> 00:33:20,650 Now, just to give you a whirlwind tour of other syntax 722 00:33:20,650 --> 00:33:23,020 before we start using it, you can insert data 723 00:33:23,020 --> 00:33:25,570 into a database using syntax quite like this. 724 00:33:25,570 --> 00:33:29,590 You can select data from a database using syntax like I already did. 725 00:33:29,590 --> 00:33:32,180 You can use functions as well. 726 00:33:32,180 --> 00:33:35,230 In fact, in just a bit, when we load more data into the database, 727 00:33:35,230 --> 00:33:38,800 I could actually select all of the distinct names from a database-- 728 00:33:38,800 --> 00:33:40,870 and in fact, I could have done this a moment ago. 729 00:33:40,870 --> 00:33:45,430 Rather than SELECT title FROM favorites, as I 730 00:33:45,430 --> 00:33:49,360 did earlier-- that's going to give me all of the titles in that database. 731 00:33:49,360 --> 00:33:54,310 I could instead do something like SELECT DISTINCT title FROM favorites, 732 00:33:54,310 --> 00:33:56,680 and that's going to filter out all of the duplicates 733 00:33:56,680 --> 00:34:00,268 without me having to write a single line of code to do that. 734 00:34:00,268 --> 00:34:01,810 You can also count, as we did before. 735 00:34:01,810 --> 00:34:03,790 You can compute averages, mins, or max, which 736 00:34:03,790 --> 00:34:06,220 is really useful for analytical work that you might want 737 00:34:06,220 --> 00:34:07,909 to do as a data scientist or the like. 738 00:34:07,909 --> 00:34:10,659 And there's a whole bunch of other functions available to as well. 739 00:34:10,659 --> 00:34:12,010 But there's some more powerful features. 740 00:34:12,010 --> 00:34:14,230 And we'll begin to use these in just a little bit. 741 00:34:14,230 --> 00:34:15,880 We used WHERE before. 742 00:34:15,880 --> 00:34:17,199 I used LIMIT before. 743 00:34:17,199 --> 00:34:18,460 I used GROUP BY before. 744 00:34:18,460 --> 00:34:21,280 And we'll see a bunch of examples that reinforce all of these. 745 00:34:21,280 --> 00:34:23,920 But there's keywords like LIKE, whereby if you 746 00:34:23,920 --> 00:34:28,540 want to search for everyone's input who mentions The Office, you know what? 747 00:34:28,540 --> 00:34:31,989 I bet we could do something like this instead. 748 00:34:31,989 --> 00:34:36,280 I know that I can SELECT star for-- oh, let's go ahead 749 00:34:36,280 --> 00:34:41,920 and SELECT title FROM favorites WHERE, and this is kind of nonsensical, 750 00:34:41,920 --> 00:34:45,820 but title equals "The Office," quote unquote. 751 00:34:45,820 --> 00:34:48,969 And I get back everyone who typed in literally The Office. 752 00:34:48,969 --> 00:34:55,090 However, if I instead did something like this, "the office" in lower case, 753 00:34:55,090 --> 00:34:58,780 I'm going to get one of you who typed in "the office" in lowercase. 754 00:34:58,780 --> 00:35:01,240 And the thing here is just the title of this output. 755 00:35:01,240 --> 00:35:03,940 But what if I want to do anything like The Office? 756 00:35:03,940 --> 00:35:10,780 I could say something like LIKE "%office%", and it turns out in SQL, 757 00:35:10,780 --> 00:35:12,880 this new language, percent is a placeholder. 758 00:35:12,880 --> 00:35:16,120 It's a wild card that just says, eh, anything can go before, 759 00:35:16,120 --> 00:35:18,190 and, eh, anything can go after. 760 00:35:18,190 --> 00:35:21,340 So now I'm searching for any titles that have the word The Office. 761 00:35:21,340 --> 00:35:22,810 We picked up another entry here. 762 00:35:22,810 --> 00:35:26,153 So one of you said Office, without the word "the," also pretty reasonable. 763 00:35:26,153 --> 00:35:28,570 Some of the uppercase and lowercase is all over the place. 764 00:35:28,570 --> 00:35:30,010 Now we've caught those. 765 00:35:30,010 --> 00:35:33,100 And if you want to count the total number of offices now, 766 00:35:33,100 --> 00:35:34,810 now I can do something like this. 767 00:35:34,810 --> 00:35:38,020 COUNT title-- so I can combine all of these building 768 00:35:38,020 --> 00:35:40,720 blocks, not unlike Scratch, and get back the total number 769 00:35:40,720 --> 00:35:45,610 of offices, which looks now to be 33 when we tolerate a lot more 770 00:35:45,610 --> 00:35:48,880 variability in our users' input. 771 00:35:48,880 --> 00:35:50,680 Most powerfully of all, we'll see something 772 00:35:50,680 --> 00:35:53,020 like JOIN in just a little bit, and that'll 773 00:35:53,020 --> 00:35:56,600 be all toward the end of using databases better with better design. 774 00:35:56,600 --> 00:36:00,010 But with SELECTs, we, of course, don't just want to select all of the titles. 775 00:36:00,010 --> 00:36:02,360 We might select where some condition is true. 776 00:36:02,360 --> 00:36:04,430 So conditions, just like in Python and C, 777 00:36:04,430 --> 00:36:08,323 exist in SQL, but at the end of our queries, as we've seen. 778 00:36:08,323 --> 00:36:10,240 If you want to update something, for instance, 779 00:36:10,240 --> 00:36:12,550 you can update data in a database by saying 780 00:36:12,550 --> 00:36:16,300 UPDATE table name SET the column name equal to some value, 781 00:36:16,300 --> 00:36:18,200 WHERE condition is true. 782 00:36:18,200 --> 00:36:19,820 So how might I use this? 783 00:36:19,820 --> 00:36:23,640 Well, let's again look at the data where it's like The Office. 784 00:36:23,640 --> 00:36:24,390 And you know what? 785 00:36:24,390 --> 00:36:25,807 I'm going to go ahead and do this. 786 00:36:25,807 --> 00:36:28,307 If I want to clean all of this data up-- so data cleaning is 787 00:36:28,307 --> 00:36:30,140 a very common approach, whether you're doing 788 00:36:30,140 --> 00:36:32,890 research or analytical work, often when you have messy data, 789 00:36:32,890 --> 00:36:34,770 the first pass is honestly grunt work. 790 00:36:34,770 --> 00:36:37,350 Like, writing some code, or god forbid, manually copying 791 00:36:37,350 --> 00:36:40,330 and pasting in an Excel file, just to clean up your data. 792 00:36:40,330 --> 00:36:42,630 You can do this a lot more powerfully in SQL. 793 00:36:42,630 --> 00:36:48,450 So for instance, let me go ahead and say UPDATE my favorites SET 794 00:36:48,450 --> 00:36:52,500 title equal to "The Office," the canonical form, capital 795 00:36:52,500 --> 00:36:56,400 T, capital O, WHERE title LIKE-- 796 00:36:56,400 --> 00:36:59,850 and now let me do this wildcard search, knowing 797 00:36:59,850 --> 00:37:02,940 that that's going to slurp in all 30-plus of those rows. 798 00:37:02,940 --> 00:37:06,330 But now, when I hit Enter, nothing seems to happen. 799 00:37:06,330 --> 00:37:10,050 But if I then SELECT title again WHERE LIKE "%office%", 800 00:37:10,050 --> 00:37:13,590 now they all look the same, because I've updated my data. 801 00:37:13,590 --> 00:37:17,010 And so in fact, now I could go back to The Office, 802 00:37:17,010 --> 00:37:21,520 because I've canonicalized my data, and everything now looks the same. 803 00:37:21,520 --> 00:37:23,760 So with SQL, can you do that as well? 804 00:37:23,760 --> 00:37:28,620 Previously, let's go ahead and SELECT a title FROM favorites. 805 00:37:28,620 --> 00:37:32,880 And let's just limit this to 10 of them this time, so they don't all fly by. 806 00:37:32,880 --> 00:37:34,980 Suppose-- VEEP is a good show. 807 00:37:34,980 --> 00:37:37,380 Maybe I'm not a fan of, say-- 808 00:37:37,380 --> 00:37:40,770 let's give ourselves 20 here, see what comes up next. 809 00:37:40,770 --> 00:37:44,370 OK, I actually don't like Friends, even though everyone in the world seems to. 810 00:37:44,370 --> 00:37:45,420 No problem. 811 00:37:45,420 --> 00:37:46,665 DELETE FROM favorites-- 812 00:37:46,665 --> 00:37:47,700 [LAUGHTER] 813 00:37:47,700 --> 00:37:50,790 --WHERE title = "Friends";. 814 00:37:50,790 --> 00:37:53,070 Now, no more Friends. 815 00:37:53,070 --> 00:37:54,885 And so you can delete your data as well. 816 00:37:54,885 --> 00:37:56,010 [LAUGHTER] 817 00:37:56,010 --> 00:38:00,390 You can delete your data as well as by using a query of this form. 818 00:38:00,390 --> 00:38:04,020 This one is perhaps the most dangerous of all. 819 00:38:04,020 --> 00:38:07,860 Consider what might happen if you're a little sloppy, a little hasty, 820 00:38:07,860 --> 00:38:09,720 and omit the WHERE condition. 821 00:38:09,720 --> 00:38:14,070 What happens if you go ahead and DELETE FROM favorites;? 822 00:38:14,070 --> 00:38:15,143 Everything goes away. 823 00:38:15,143 --> 00:38:18,060 And slightly worse than that, though we won't have occasion to use it, 824 00:38:18,060 --> 00:38:20,352 there's another function called DROP, where if you just 825 00:38:20,352 --> 00:38:23,970 want to delete the table altogether, delete all of your rows and columns, 826 00:38:23,970 --> 00:38:28,630 you can say DROP TABLE favorites;, and that will delete all of the data 827 00:38:28,630 --> 00:38:29,130 as well. 828 00:38:29,130 --> 00:38:32,700 And we'll come back to that, because the mere fact that you can scarily 829 00:38:32,700 --> 00:38:35,370 delete lots of data at once makes you super 830 00:38:35,370 --> 00:38:38,580 vulnerable to bad actors on the internet or in the office who 831 00:38:38,580 --> 00:38:41,770 want to potentially wreak havoc on your data. 832 00:38:41,770 --> 00:38:46,210 So we'll come back today, talking about matters of security, as well. 833 00:38:46,210 --> 00:38:49,920 So any questions before, now, we take things up a notch 834 00:38:49,920 --> 00:38:55,270 and look at a much more massive than our hundreds of favorites here? 835 00:38:55,270 --> 00:38:56,430 Is that a question there? 836 00:38:56,430 --> 00:38:58,586 Yeah. 837 00:38:58,586 --> 00:39:01,965 AUDIENCE: [INAUDIBLE] 838 00:39:01,965 --> 00:39:04,090 DAVID MALAN: Did not look like Friends got removed. 839 00:39:04,090 --> 00:39:05,050 Oh, interesting. 840 00:39:05,050 --> 00:39:06,730 You are right. 841 00:39:06,730 --> 00:39:09,700 Let's try a little harder. 842 00:39:09,700 --> 00:39:12,010 LIKE "%friends%". 843 00:39:12,010 --> 00:39:15,200 844 00:39:15,200 --> 00:39:16,370 Better? 845 00:39:16,370 --> 00:39:17,190 Now they're gone. 846 00:39:17,190 --> 00:39:19,607 So I'm guessing-- I would have to look closer at the data, 847 00:39:19,607 --> 00:39:21,442 but I'm guessing one or more people typed it 848 00:39:21,442 --> 00:39:23,150 with a slightly different capitalization, 849 00:39:23,150 --> 00:39:25,760 or maybe they hit a spacebar character or something, where 850 00:39:25,760 --> 00:39:29,040 I wasn't doing a precise match, thereby making 851 00:39:29,040 --> 00:39:32,730 the LIKE operator a more robust fix. 852 00:39:32,730 --> 00:39:35,280 However, let's consider for just a moment. 853 00:39:35,280 --> 00:39:36,860 What might the gotcha here be? 854 00:39:36,860 --> 00:39:39,170 Like, this feels like a nice quick and dirty fix, 855 00:39:39,170 --> 00:39:42,118 but it really is quick and dirty in what sense? 856 00:39:42,118 --> 00:39:43,910 AUDIENCE: Like, if you have another TV show 857 00:39:43,910 --> 00:39:47,543 that has the word friend in it, but not Friends, [INAUDIBLE].. 858 00:39:47,543 --> 00:39:49,460 DAVID MALAN: Yeah, there might be another show 859 00:39:49,460 --> 00:39:50,480 with the word "friends" in it. 860 00:39:50,480 --> 00:39:51,320 Quite reasonable. 861 00:39:51,320 --> 00:39:54,380 And I just blew those away as well from my database. 862 00:39:54,380 --> 00:39:56,480 And in fact, we can probably see this for real. 863 00:39:56,480 --> 00:39:59,540 If I go ahead and look at, for instance-- 864 00:39:59,540 --> 00:40:05,360 let me go into a browser here, IMDB.com, super popular website for just getting 865 00:40:05,360 --> 00:40:07,550 movie reviews or information about movies. 866 00:40:07,550 --> 00:40:09,330 Let me go ahead and search for friends. 867 00:40:09,330 --> 00:40:12,830 And yes, the most popular hit seems to be the 1994 version of Friends, 868 00:40:12,830 --> 00:40:14,740 but there's Friends with Benefits. 869 00:40:14,740 --> 00:40:17,573 There's Just Friends, We Are Your Friends, My Best Friend's Wedding, 870 00:40:17,573 --> 00:40:18,990 Fisherman's Friends, and so forth. 871 00:40:18,990 --> 00:40:21,020 Some of those movies, some of those TV shows. 872 00:40:21,020 --> 00:40:24,950 And had we poked a little further among our own data of hundreds of rows, maybe 873 00:40:24,950 --> 00:40:27,980 some of you did have some of those shows as your favorites as well. 874 00:40:27,980 --> 00:40:30,290 So beware that, too. 875 00:40:30,290 --> 00:40:32,990 So in preparing for today, we wanted to play 876 00:40:32,990 --> 00:40:36,350 with an even juicier data set then one we could generate for ourselves. 877 00:40:36,350 --> 00:40:39,920 And wonderfully, IMDB makes a lot of their database 878 00:40:39,920 --> 00:40:44,900 available for download as TSV files, Tab-Separated Values. 879 00:40:44,900 --> 00:40:48,210 It's really no fundamentally different from Comma Separated Values. 880 00:40:48,210 --> 00:40:51,200 One human at IMDB just decided to use tabs to separate 881 00:40:51,200 --> 00:40:52,763 their values instead of commas. 882 00:40:52,763 --> 00:40:54,680 And if you read through this page here, you'll 883 00:40:54,680 --> 00:40:58,520 see that they let you download a bunch of different files, one of which 884 00:40:58,520 --> 00:41:02,360 will focus on, initially, specifically on title.basics.tsv.gz. 885 00:41:02,360 --> 00:41:05,105 886 00:41:05,105 --> 00:41:06,980 So that's a mouthful, but this is just saying 887 00:41:06,980 --> 00:41:09,920 this is basic information about titles in their database. 888 00:41:09,920 --> 00:41:12,785 Gz means it's G-zipped, which is like Zip. 889 00:41:12,785 --> 00:41:14,660 If you're familiar, on a Mac or PC, it's just 890 00:41:14,660 --> 00:41:19,850 another way of compressing information, but you can unzip that on a computer 891 00:41:19,850 --> 00:41:20,390 as well. 892 00:41:20,390 --> 00:41:22,790 Dot TSV means Tab-Separated Values. 893 00:41:22,790 --> 00:41:24,860 Now, we read in advance the information here, 894 00:41:24,860 --> 00:41:26,490 but let me summarize it as follows. 895 00:41:26,490 --> 00:41:28,972 There's some juicy information in that file. 896 00:41:28,972 --> 00:41:31,430 And I'm going to go ahead and download this file over here. 897 00:41:31,430 --> 00:41:33,710 If I click a link on IMDB's website, I'll 898 00:41:33,710 --> 00:41:37,607 see a file called title.basics.tsv.gz. 899 00:41:37,607 --> 00:41:40,190 I'm going to go ahead and let that download in the background. 900 00:41:40,190 --> 00:41:44,600 It looks like it is 100-plus megabytes. 901 00:41:44,600 --> 00:41:47,570 If I click on it on my Mac, it's going to decompress it. 902 00:41:47,570 --> 00:41:53,210 And you'll see here, this apparently is 532 megabytes once it's decompressed. 903 00:41:53,210 --> 00:41:56,840 So that's way bigger than the CSV file I downloaded from Google 904 00:41:56,840 --> 00:42:01,110 a few minutes ago, which is 24 kilobytes, 24,000 bytes. 905 00:42:01,110 --> 00:42:03,560 This is 532 million bytes. 906 00:42:03,560 --> 00:42:07,040 So it's an order of magnitude bigger, which means our design decisions today 907 00:42:07,040 --> 00:42:08,210 are really going to matter. 908 00:42:08,210 --> 00:42:10,500 Now, if we take a look inside of this file, 909 00:42:10,500 --> 00:42:12,260 we'll see a bunch of interesting fields. 910 00:42:12,260 --> 00:42:15,920 And I've essentially summarized this from their own online documentation. 911 00:42:15,920 --> 00:42:18,980 There's these five fields that are of interest right now, 912 00:42:18,980 --> 00:42:20,630 and we'll tease these apart now. 913 00:42:20,630 --> 00:42:22,550 So tconst, I didn't recognize that. 914 00:42:22,550 --> 00:42:26,510 It's an IMDB-specific term, it seems, but it means some kind of title 915 00:42:26,510 --> 00:42:27,200 constant. 916 00:42:27,200 --> 00:42:30,480 It's a unique identifier for every movie in the database. 917 00:42:30,480 --> 00:42:33,360 So for instance, one of the movies in that TSV file, 918 00:42:33,360 --> 00:42:39,040 which I'll open up in a moment, has the unique value tt4786824. 919 00:42:39,040 --> 00:42:39,830 Why? 920 00:42:39,830 --> 00:42:40,970 Just because. 921 00:42:40,970 --> 00:42:44,750 But it turns out we can actually see what this is. 922 00:42:44,750 --> 00:42:57,160 Let me actually grab that value and go to IMDB.com/title/tt4786824. 923 00:42:57,160 --> 00:42:58,160 Enter. 924 00:42:58,160 --> 00:42:59,870 These actually do have significance. 925 00:42:59,870 --> 00:43:01,760 That is the unique identifier for The Crown, 926 00:43:01,760 --> 00:43:04,400 popular hit TV show that you can now watch on Netflix. 927 00:43:04,400 --> 00:43:08,660 So if you go back to the actual field-- 928 00:43:08,660 --> 00:43:09,960 that's how I spent the summer. 929 00:43:09,960 --> 00:43:12,543 So if you go back to the fields here, we have not only tconst, 930 00:43:12,543 --> 00:43:15,440 which uniquely identifies the film, we also have title type. 931 00:43:15,440 --> 00:43:17,990 So it turns out in this 500 megabyte file, 932 00:43:17,990 --> 00:43:22,010 there's huge amounts of information on movies, TV series, documentaries, 933 00:43:22,010 --> 00:43:25,280 shorts, way more data than we could possibly play with in one day. 934 00:43:25,280 --> 00:43:29,270 So we'll focus just on TV series, like this, the show The Crown. 935 00:43:29,270 --> 00:43:31,910 Primary title is the name in the file that is 936 00:43:31,910 --> 00:43:33,660 given to the primary title of the show. 937 00:43:33,660 --> 00:43:35,660 Sometimes shows apparently have different titles 938 00:43:35,660 --> 00:43:38,320 in different geographies or communities, so the primary title 939 00:43:38,320 --> 00:43:40,070 is the one that most people know it by. 940 00:43:40,070 --> 00:43:43,670 Start year is the year for a TV show that-- in which that TV show began. 941 00:43:43,670 --> 00:43:46,250 The Crown came out first in 2016. 942 00:43:46,250 --> 00:43:51,050 And genres, a little weirdly, is a comma-separated list 943 00:43:51,050 --> 00:43:53,690 of genres into which that show falls. 944 00:43:53,690 --> 00:43:57,320 And I say it's a little weird, because we do have a TSV file, 945 00:43:57,320 --> 00:43:59,580 but they're using commas inside. 946 00:43:59,580 --> 00:44:03,590 So this is probably among the motivations for having used tabs in-- 947 00:44:03,590 --> 00:44:05,660 to separate all of the columns, because they 948 00:44:05,660 --> 00:44:08,060 want to use commas for some other purpose here. 949 00:44:08,060 --> 00:44:09,570 But it's a little messy. 950 00:44:09,570 --> 00:44:13,160 This is actually not necessarily the best design decision here. 951 00:44:13,160 --> 00:44:17,090 So what can we go ahead and do with all of this data here? 952 00:44:17,090 --> 00:44:20,140 So let's go ahead and start to grab some of this data 953 00:44:20,140 --> 00:44:21,640 and write some scripts involving it. 954 00:44:21,640 --> 00:44:23,260 But let me issue a disclaimer. 955 00:44:23,260 --> 00:44:26,000 It turns out when you have a lot of data in a database, 956 00:44:26,000 --> 00:44:30,247 it can take quite a long time to analyze it or parse it. 957 00:44:30,247 --> 00:44:32,830 And so if you ever watch certain baking shows like Julia Child 958 00:44:32,830 --> 00:44:35,620 from down the road here in Cambridge, very often there 959 00:44:35,620 --> 00:44:39,580 would be some TV magic whereby someone mixes the cake mix 960 00:44:39,580 --> 00:44:42,723 and puts it into the oven, and then two minutes later, voila, it comes out. 961 00:44:42,723 --> 00:44:44,890 That's because some of the scripts we'll write today 962 00:44:44,890 --> 00:44:47,410 might actually take many minutes to run, but we'll 963 00:44:47,410 --> 00:44:49,118 go ahead and run them in the backgrounds, 964 00:44:49,118 --> 00:44:51,760 and I'll use the outputs that I've created here in advance. 965 00:44:51,760 --> 00:44:55,600 So how do we go about beginning to navigate this data set? 966 00:44:55,600 --> 00:44:59,830 Well, first of all, let me go ahead and just open this file. 967 00:44:59,830 --> 00:45:02,230 And I'm going to use my own terminal window here. 968 00:45:02,230 --> 00:45:06,790 This file was called titles.basics.tsv. 969 00:45:06,790 --> 00:45:09,580 Notice that even in my program here, it took a moment to open, 970 00:45:09,580 --> 00:45:11,060 but this is what it looks like. 971 00:45:11,060 --> 00:45:14,380 It's a little cryptic at first glance, but notice the first row 972 00:45:14,380 --> 00:45:16,030 is the headers of the file. 973 00:45:16,030 --> 00:45:19,900 So we've got tconst and title type and primary title and so forth. 974 00:45:19,900 --> 00:45:24,160 Then every row thereafter seems to start with one of those unique identifiers, 975 00:45:24,160 --> 00:45:27,250 and they seem to be monotonically increasing, 1, 2, 3, 4, 976 00:45:27,250 --> 00:45:29,830 but they start with tt then some number of digits. 977 00:45:29,830 --> 00:45:32,972 Then you see that these are a bunch of shorts, so short films or the like. 978 00:45:32,972 --> 00:45:35,180 And if we kept scrolling, we'd see some other values. 979 00:45:35,180 --> 00:45:36,825 Here's the first movie in the database. 980 00:45:36,825 --> 00:45:39,700 And if we keep scrolling and scrolling and scrolling, we'll see more. 981 00:45:39,700 --> 00:45:43,150 So if actually I use my text editor here and search for literally the word 982 00:45:43,150 --> 00:45:45,040 The Crown, it's going to take a little while 983 00:45:45,040 --> 00:45:48,100 to get to, because there's a bunch of shows apparently called 984 00:45:48,100 --> 00:45:50,470 The Crown or Crowning or something. 985 00:45:50,470 --> 00:45:52,060 But you know what'll be faster? 986 00:45:52,060 --> 00:45:53,860 Let me go ahead and search for-- 987 00:45:53,860 --> 00:45:55,090 what was it? 988 00:45:55,090 --> 00:46:03,020 If I go back to tt4786824, Enter, it takes a moment 989 00:46:03,020 --> 00:46:05,770 to find this because it's searching through hundreds of megabytes, 990 00:46:05,770 --> 00:46:10,493 but voila, down here is this tconst, TV series type. 991 00:46:10,493 --> 00:46:11,410 It's called The Crown. 992 00:46:11,410 --> 00:46:15,920 In another geography it's called also The Crown, 2016, and so forth. 993 00:46:15,920 --> 00:46:17,920 So that's a huge amount of data. 994 00:46:17,920 --> 00:46:19,180 In fact, how much data? 995 00:46:19,180 --> 00:46:22,310 Well, if I actually analyze this with a command on my computer, 996 00:46:22,310 --> 00:46:29,320 there are 6,267,469 shows in the world, be it TV 997 00:46:29,320 --> 00:46:31,330 or movies or documentaries or the like. 998 00:46:31,330 --> 00:46:32,830 That's a huge data set. 999 00:46:32,830 --> 00:46:36,460 So suppose I want to focus really just on a subset of that data. 1000 00:46:36,460 --> 00:46:39,970 What can I go about using instead? 1001 00:46:39,970 --> 00:46:45,040 Well, let me go ahead and open up, for instance, the same file, 1002 00:46:45,040 --> 00:46:47,685 but let me try putting it in CS50 IDE. 1003 00:46:47,685 --> 00:46:50,560 I'm going to go ahead and download this, as you might have downloaded 1004 00:46:50,560 --> 00:46:52,660 past problem sets, with wget. 1005 00:46:52,660 --> 00:46:54,867 That's going to give me a pretty big file. 1006 00:46:54,867 --> 00:46:56,950 There's a command now that you've not had occasion 1007 00:46:56,950 --> 00:47:02,640 to use in the IDE called gunzip, which will unzip a file that starts with-- 1008 00:47:02,640 --> 00:47:04,480 that ends with .gz. 1009 00:47:04,480 --> 00:47:09,060 That's going to take a moment to unzip the whole file. 1010 00:47:09,060 --> 00:47:11,310 A moment, a moment, a moment, OK. 1011 00:47:11,310 --> 00:47:12,970 [CHUCKLES] And no space left on disk. 1012 00:47:12,970 --> 00:47:14,880 But that's OK, because it turns out what I'm also 1013 00:47:14,880 --> 00:47:17,260 going to start doing today is using my own Mac a bit more. 1014 00:47:17,260 --> 00:47:19,260 And you'll recall last week, I did that a little bit, 1015 00:47:19,260 --> 00:47:22,218 because when I wanted to play with speech recognition in my microphone, 1016 00:47:22,218 --> 00:47:25,890 it's a lot easier to have Python running on my Mac or your PC 1017 00:47:25,890 --> 00:47:28,200 if I want to use hardware built into my computer. 1018 00:47:28,200 --> 00:47:31,680 Plus my MacBook, is a lot faster, for instance, than a single cloud 1019 00:47:31,680 --> 00:47:33,270 account on a shared server. 1020 00:47:33,270 --> 00:47:37,770 So I'll go ahead and write some of this code here on my own machine instead. 1021 00:47:37,770 --> 00:47:40,930 So let me go ahead and open a file called import .py. 1022 00:47:40,930 --> 00:47:44,700 And I'm going to go ahead and import the CSV module, as always, 1023 00:47:44,700 --> 00:47:49,490 and I'm going to go ahead and open this file, which is called title.basics.tsv 1024 00:47:49,490 --> 00:47:53,425 in read-only mode, and I'm going to call this variable titles. 1025 00:47:53,425 --> 00:47:54,550 What am I next going to do? 1026 00:47:54,550 --> 00:47:58,230 Let me myself a reader using csv.DictReader, 1027 00:47:58,230 --> 00:48:01,290 as before, reading in those titles. 1028 00:48:01,290 --> 00:48:04,800 But csv.DictReader feels like the wrong reader. 1029 00:48:04,800 --> 00:48:07,040 Why? 1030 00:48:07,040 --> 00:48:09,360 Just to be clear. 1031 00:48:09,360 --> 00:48:11,445 Yeah, I mean, maybe I want TSV. 1032 00:48:11,445 --> 00:48:14,290 But it turns out TSV doesn't exist. 1033 00:48:14,290 --> 00:48:17,310 So even though I might be inclined, for instance, to change this 1034 00:48:17,310 --> 00:48:21,570 to a T and this to a T, the CSV module does enough for me, 1035 00:48:21,570 --> 00:48:24,540 but I need to tell it that I want to use a different delimiter. 1036 00:48:24,540 --> 00:48:26,723 Instead of the default, which looks like this, 1037 00:48:26,723 --> 00:48:28,890 I can actually override that and say, you know what? 1038 00:48:28,890 --> 00:48:29,700 Use a tab. 1039 00:48:29,700 --> 00:48:33,690 And just like backslash n in C and in Python, it's a new line. 1040 00:48:33,690 --> 00:48:37,830 Backslash t in both languages is a tab character. 1041 00:48:37,830 --> 00:48:41,100 All right, so once I've done this, let me go ahead and open up, now-- 1042 00:48:41,100 --> 00:48:43,920 let me go ahead and open up shows0.csv. 1043 00:48:43,920 --> 00:48:47,760 My goal in life now is to make this file a lot more manageable for myself. 1044 00:48:47,760 --> 00:48:53,070 I want to take a 500megabyte file and extract only the TV shows therein. 1045 00:48:53,070 --> 00:48:56,190 Moreover, how about only the TV shows from 1970 onward? 1046 00:48:56,190 --> 00:48:59,080 We won't go even further back than that. 1047 00:48:59,080 --> 00:49:04,320 So let me go ahead and open up, in write mode, a file called show0.csv. 1048 00:49:04,320 --> 00:49:06,278 And I'm just going to call that variable shows. 1049 00:49:06,278 --> 00:49:08,445 Then I'm going to go ahead and give myself a writer. 1050 00:49:08,445 --> 00:49:10,290 And you might recall using this in the past. 1051 00:49:10,290 --> 00:49:13,082 This is simply going to give me a variable called writer, via which 1052 00:49:13,082 --> 00:49:14,400 I can write to a new file. 1053 00:49:14,400 --> 00:49:18,750 Because again, the goal is to read this file and write to this file 1054 00:49:18,750 --> 00:49:21,090 a subset of the data therein. 1055 00:49:21,090 --> 00:49:23,160 So let me go ahead and write one row first. 1056 00:49:23,160 --> 00:49:28,180 Write a row, passing in a list of values, specifically tconst, 1057 00:49:28,180 --> 00:49:29,320 which is the title-- 1058 00:49:29,320 --> 00:49:31,860 which is the ID field; primaryTitle, which 1059 00:49:31,860 --> 00:49:35,910 is the title field; startYear, which is the year field; and genres, 1060 00:49:35,910 --> 00:49:38,620 was one more fields that I mentioned earlier. 1061 00:49:38,620 --> 00:49:43,248 So my goal is to export only those four columns that I care about for now. 1062 00:49:43,248 --> 00:49:44,790 So I'm going to go ahead and do this. 1063 00:49:44,790 --> 00:49:49,590 For each row in my reader, I'm going to go ahead and say if that 1064 00:49:49,590 --> 00:49:56,340 row's titleType == tvSeries, which, recall, 1065 00:49:56,340 --> 00:50:00,720 The Crown was an example of, then I'm going to go ahead and write 1066 00:50:00,720 --> 00:50:01,710 to the writer-- 1067 00:50:01,710 --> 00:50:05,940 whoops-- writer.writeRow, a list containing what? 1068 00:50:05,940 --> 00:50:13,230 The row's tconst value, the row's primary title value, the row-- 1069 00:50:13,230 --> 00:50:21,214 whoops-- the row's start year, and lastly, the row's-- 1070 00:50:21,214 --> 00:50:24,310 [CHUCKLES] the row's genres. 1071 00:50:24,310 --> 00:50:26,560 So what am I doing here, just to be clear? 1072 00:50:26,560 --> 00:50:30,760 So what I have done is I've first written out-- once and only 1073 00:50:30,760 --> 00:50:33,070 once-- literally these values, because I want 1074 00:50:33,070 --> 00:50:35,200 headers in the first line of my file. 1075 00:50:35,200 --> 00:50:37,750 After that, for each row in the reader, which is currently 1076 00:50:37,750 --> 00:50:40,900 iterating over this file in read-only mode, 1077 00:50:40,900 --> 00:50:44,200 I want to print out the current row's tconst, current row's primary title, 1078 00:50:44,200 --> 00:50:47,170 the current row's start year, and the current row genres. 1079 00:50:47,170 --> 00:50:52,450 But notice, I'm ignoring movies and shorts and documentaries and bunches 1080 00:50:52,450 --> 00:50:54,080 of other values as well. 1081 00:50:54,080 --> 00:50:54,830 And you know what? 1082 00:50:54,830 --> 00:50:57,310 Just for good measure, let's shrink this a little bit. 1083 00:50:57,310 --> 00:51:02,320 And row, how about, is adult == "0", for today's purposes? 1084 00:51:02,320 --> 00:51:03,970 So that'll filter the list further. 1085 00:51:03,970 --> 00:51:06,160 If you'd like to flip that later, that's fine. 1086 00:51:06,160 --> 00:51:11,920 So let me go ahead and do Python 3, for version 3 on my Mac-- actually, 1087 00:51:11,920 --> 00:51:17,632 no, let's do Python of import.py, all right? 1088 00:51:17,632 --> 00:51:19,090 I'm going to cross my fingers here. 1089 00:51:19,090 --> 00:51:22,300 Hopefully the file is working, working, working. 1090 00:51:22,300 --> 00:51:24,010 But it's taking a decent amount of time. 1091 00:51:24,010 --> 00:51:29,290 Like, this is how much time it takes, apparently, to process over 1092 00:51:29,290 --> 00:51:31,840 millions of rows of data. 1093 00:51:31,840 --> 00:51:33,700 Still running. 1094 00:51:33,700 --> 00:51:34,960 Still running. 1095 00:51:34,960 --> 00:51:37,465 But the goal, again, is to shrink the amount 1096 00:51:37,465 --> 00:51:40,090 of data I have to ultimately care about so that we can actually 1097 00:51:40,090 --> 00:51:41,980 search it much more effectively. 1098 00:51:41,980 --> 00:51:43,510 So, OK, it actually finished. 1099 00:51:43,510 --> 00:51:46,430 So let me go ahead and open up show0.csv. 1100 00:51:46,430 --> 00:51:49,840 Notice now, in my text editor, I've got a lot less data. 1101 00:51:49,840 --> 00:51:51,820 I've thrown away everything I don't care about, 1102 00:51:51,820 --> 00:51:54,970 but I've been left with tconst, primary title, start year, and genres, 1103 00:51:54,970 --> 00:51:59,052 and everything herein is now consistent with that filtration. 1104 00:51:59,052 --> 00:52:00,760 But I haven't filtered everything I said. 1105 00:52:00,760 --> 00:52:03,380 What did I say I wanted to get rid of earlier? 1106 00:52:03,380 --> 00:52:06,590 1107 00:52:06,590 --> 00:52:08,090 Yeah, the shows before 1970. 1108 00:52:08,090 --> 00:52:11,550 And clearly some of these are coming from 1940s and so forth. 1109 00:52:11,550 --> 00:52:14,480 So let's go ahead and get rid of those, but see how. 1110 00:52:14,480 --> 00:52:18,597 This is the CSV file that we just opened, but in Google Spreadsheet form. 1111 00:52:18,597 --> 00:52:20,930 So I literally just imported it into Google Spreadsheets 1112 00:52:20,930 --> 00:52:21,950 that so we could see it. 1113 00:52:21,950 --> 00:52:25,640 Literally the same data as before, and there are those 1940s movies. 1114 00:52:25,640 --> 00:52:28,490 But there's something curious that I wanted to be mindful of. 1115 00:52:28,490 --> 00:52:34,290 If I scroll down in the start years and keep going and going and going, 1116 00:52:34,290 --> 00:52:35,400 huh, those seem OK. 1117 00:52:35,400 --> 00:52:36,650 Those are increasing in order. 1118 00:52:36,650 --> 00:52:37,560 But let me try this. 1119 00:52:37,560 --> 00:52:40,880 Let me just poke around my data, sorting in reverse order. 1120 00:52:40,880 --> 00:52:44,480 It's going to take a while, because even this is a decent number of TV shows. 1121 00:52:44,480 --> 00:52:47,000 Notice this weirdness. 1122 00:52:47,000 --> 00:52:50,630 At the top of start year now, once I've reversed sorted them, 1123 00:52:50,630 --> 00:52:53,450 there's a whole bunch of backslash capital N's. 1124 00:52:53,450 --> 00:52:57,050 Now, this has nothing to do with C, and nothing to do with Python. 1125 00:52:57,050 --> 00:52:59,900 It has everything to do with the documentation. 1126 00:52:59,900 --> 00:53:05,930 If you read IMDB's data, as I only did carefully eventually, you'll see this. 1127 00:53:05,930 --> 00:53:10,370 A backslash n and capital N is used to denote that a particular field is 1128 00:53:10,370 --> 00:53:12,592 missing or null for that title name. 1129 00:53:12,592 --> 00:53:14,300 Now, this is important, because if I want 1130 00:53:14,300 --> 00:53:17,090 to filter out movies that are after 1970, 1131 00:53:17,090 --> 00:53:18,670 I need to be resilient against that. 1132 00:53:18,670 --> 00:53:20,210 So let me go ahead and do this. 1133 00:53:20,210 --> 00:53:27,590 So if the current row's startYear does not equal backslash n, 1134 00:53:27,590 --> 00:53:31,850 then I'm going to go ahead and check that it's a TV series 1135 00:53:31,850 --> 00:53:34,160 and that it's not an adult show. 1136 00:53:34,160 --> 00:53:35,150 So that would help. 1137 00:53:35,150 --> 00:53:38,110 But furthermore, how can I check this here? 1138 00:53:38,110 --> 00:53:39,530 Let me do year. 1139 00:53:39,530 --> 00:53:43,500 And how can I convert row startYear to an integer? 1140 00:53:43,500 --> 00:53:45,500 Well, everything in a spreadsheet, by definition 1141 00:53:45,500 --> 00:53:50,510 of it having been in a spreadsheet, or a TSV file or a CSV file, is text. 1142 00:53:50,510 --> 00:53:52,940 But start year looks like years, so what Python 1143 00:53:52,940 --> 00:53:55,430 function can I use to actually convert text that 1144 00:53:55,430 --> 00:53:58,840 resembles a number to an actual number? 1145 00:53:58,840 --> 00:53:59,340 Yeah. 1146 00:53:59,340 --> 00:54:01,240 So we can do something like this. 1147 00:54:01,240 --> 00:54:03,840 So I can convert year to an int, and now I 1148 00:54:03,840 --> 00:54:08,530 can say, if year greater than or equal to 1970, 1149 00:54:08,530 --> 00:54:12,348 now I'm going to go ahead and do those lines instead. 1150 00:54:12,348 --> 00:54:14,640 Now, there's an opportunity, surely, for better design, 1151 00:54:14,640 --> 00:54:16,530 because once your code starts doing this, 1152 00:54:16,530 --> 00:54:18,360 you've done something suboptimally, right? 1153 00:54:18,360 --> 00:54:21,112 This is not going to end well if all of my code starts wrapping. 1154 00:54:21,112 --> 00:54:23,820 So I could clean up the logic in a little bit, but let's go ahead 1155 00:54:23,820 --> 00:54:25,750 and run this just one more time. 1156 00:54:25,750 --> 00:54:30,450 This time, changing this to shows1.CSV so we can see slightly different 1157 00:54:30,450 --> 00:54:30,960 outputs. 1158 00:54:30,960 --> 00:54:33,546 Let me go ahead and run Python import.py. 1159 00:54:33,546 --> 00:54:37,740 Huh, syntax error, Unicode error, codec-- 1160 00:54:37,740 --> 00:54:39,000 that's a weird one. 1161 00:54:39,000 --> 00:54:43,810 But this is because backslashes, recall, in C and Python, have special meaning. 1162 00:54:43,810 --> 00:54:47,070 So when you do something like this, backslash capital n, 1163 00:54:47,070 --> 00:54:49,860 even though it's not a lower case n, backslash n, recall, 1164 00:54:49,860 --> 00:54:52,110 is the scape character in C and Python. 1165 00:54:52,110 --> 00:54:55,230 So this is like telling Python, this is a special character. 1166 00:54:55,230 --> 00:54:56,092 But it's not. 1167 00:54:56,092 --> 00:54:58,050 And we've never really had occasion to do this, 1168 00:54:58,050 --> 00:55:00,240 but how would do you think we could output 1169 00:55:00,240 --> 00:55:05,620 a literal backslash before a capital N? 1170 00:55:05,620 --> 00:55:06,120 Yeah. 1171 00:55:06,120 --> 00:55:08,610 So it turns out the solution to this problem, usually, 1172 00:55:08,610 --> 00:55:11,730 no matter the language, is that if you want a literal character, not 1173 00:55:11,730 --> 00:55:15,070 an escape character, you literally put another one of it before. 1174 00:55:15,070 --> 00:55:17,260 So even though this looks a little funky now, 1175 00:55:17,260 --> 00:55:22,420 this backslash backslash capital N literally will mean backslash N. 1176 00:55:22,420 --> 00:55:22,920 All right. 1177 00:55:22,920 --> 00:55:26,360 So now, let me go ahead and run this on import.py. 1178 00:55:26,360 --> 00:55:29,790 This time, I'm hopefully going to actually generate 1179 00:55:29,790 --> 00:55:36,120 a new file called shows1.csv that has even less data that 1180 00:55:36,120 --> 00:55:42,400 actually is going to contain my shows, but only a subset of them. 1181 00:55:42,400 --> 00:55:45,000 And let's go ahead and pull the cake out of the oven this way. 1182 00:55:45,000 --> 00:55:47,050 This is what I get now this time. 1183 00:55:47,050 --> 00:55:50,058 So if I actually load the CSV, shows1.csv, into Google Spreadsheet, 1184 00:55:50,058 --> 00:55:53,100 just because it's pretty easy to look at than the black and white window, 1185 00:55:53,100 --> 00:55:57,420 now you can see that I apparently am only getting shows 1970 and onward. 1186 00:55:57,420 --> 00:55:59,940 And indeed, if I sorted them, I would see no backslash N's. 1187 00:55:59,940 --> 00:56:03,960 I would have thrown away everything that doesn't meet that criteria. 1188 00:56:03,960 --> 00:56:07,110 Well, let me go ahead and do one last thing here. 1189 00:56:07,110 --> 00:56:10,020 I'm going to go ahead and make one more change. 1190 00:56:10,020 --> 00:56:12,330 And first, let's improve the design here. 1191 00:56:12,330 --> 00:56:16,710 This indentation is the result of my asking questions again and again 1192 00:56:16,710 --> 00:56:20,410 and again and indenting if and only if those things are true. 1193 00:56:20,410 --> 00:56:22,950 But notice, you can start to flip your logic here, right? 1194 00:56:22,950 --> 00:56:26,790 Instead of saying if the start year does not equal backslash N, 1195 00:56:26,790 --> 00:56:30,210 what if I just do this and say continue? 1196 00:56:30,210 --> 00:56:32,010 I can then unindent this-- 1197 00:56:32,010 --> 00:56:35,280 because if you've not used it before in Python and in C, 1198 00:56:35,280 --> 00:56:39,880 if you say continue inside of a loop, it's not going to continue down there, 1199 00:56:39,880 --> 00:56:42,690 it's going to [WHOOSH] continue to the start of the loop again. 1200 00:56:42,690 --> 00:56:45,840 So via this logic, we can actually keep wrapping around again and again. 1201 00:56:45,840 --> 00:56:50,460 And here, too, we could say, if year less than 1970, I can go ahead 1202 00:56:50,460 --> 00:56:54,340 and say continue, which would then allow me to unindent this as well. 1203 00:56:54,340 --> 00:56:57,510 So there are solutions, design-wise, to actually avoiding 1204 00:56:57,510 --> 00:56:59,230 that infinite indentation. 1205 00:56:59,230 --> 00:56:59,730 All right. 1206 00:56:59,730 --> 00:57:02,130 Let's go ahead and do one last version. 1207 00:57:02,130 --> 00:57:04,710 Then I'm going to go ahead and pull out of the oven in a-- 1208 00:57:04,710 --> 00:57:05,640 premade. 1209 00:57:05,640 --> 00:57:08,680 So the last thing I didn't load before was this. 1210 00:57:08,680 --> 00:57:12,780 Suppose that I want to load into the CSV file, all of the genres 1211 00:57:12,780 --> 00:57:14,820 associated with the show. 1212 00:57:14,820 --> 00:57:17,940 It looks like all of these shows have one or more genres, just like you 1213 00:57:17,940 --> 00:57:20,370 were asked for your favorite shows. 1214 00:57:20,370 --> 00:57:24,780 And so now, we have a CSV file with tconst, primary titles, start year, 1215 00:57:24,780 --> 00:57:29,280 and genres, where genre is itself is a comma-separated list. 1216 00:57:29,280 --> 00:57:31,290 But there's a fundamental problem here. 1217 00:57:31,290 --> 00:57:35,040 Even though I have all of this data here, the best I can do 1218 00:57:35,040 --> 00:57:36,460 is a program like this. 1219 00:57:36,460 --> 00:57:42,190 Let me go ahead and search for, for instance, the following. 1220 00:57:42,190 --> 00:57:44,765 Let me go ahead and grab a file real fast. 1221 00:57:44,765 --> 00:57:48,410 1222 00:57:48,410 --> 00:57:53,300 Let me go ahead and grab a copy of shows2.csv 1223 00:57:53,300 --> 00:57:55,970 and write one final program here. 1224 00:57:55,970 --> 00:57:59,712 If I want to go ahead now and search this very large, still, data set, 1225 00:57:59,712 --> 00:58:02,780 in shows2.csv, well, let me go ahead and do this. 1226 00:58:02,780 --> 00:58:06,528 Import CSV. 1227 00:58:06,528 --> 00:58:08,570 Let me go ahead now and ask the user for a title. 1228 00:58:08,570 --> 00:58:11,120 I could use CS50's get string, but there's really no need for that 1229 00:58:11,120 --> 00:58:14,162 anymore now that we have the title function-- the input function, recall. 1230 00:58:14,162 --> 00:58:15,310 So I'll just use that. 1231 00:58:15,310 --> 00:58:20,900 And then I'm going to go ahead and open up shows2.csv in read-only mode. 1232 00:58:20,900 --> 00:58:23,270 And I'm going to call that my file. 1233 00:58:23,270 --> 00:58:27,470 Then I'm going to go ahead and give myself a reader from csv.DictReader, 1234 00:58:27,470 --> 00:58:29,330 passing in that file. 1235 00:58:29,330 --> 00:58:33,422 And now I'm going to go ahead and, for row in reader, do the following. 1236 00:58:33,422 --> 00:58:35,630 The goal now is to write a Python program that allows 1237 00:58:35,630 --> 00:58:37,620 me to search only those TV shows. 1238 00:58:37,620 --> 00:58:44,420 So I could say something like this, if title == row "primaryTitle", 1239 00:58:44,420 --> 00:58:50,060 then I can go ahead, for instance, and print out row "primaryTitle," 1240 00:58:50,060 --> 00:58:53,040 started in row "startYear." 1241 00:58:53,040 --> 00:58:55,130 So what is the goal of this program? 1242 00:58:55,130 --> 00:58:56,887 It's going to ask the user for input. 1243 00:58:56,887 --> 00:58:58,970 It's going to open this big CSV that I've created. 1244 00:58:58,970 --> 00:59:01,520 But that's still smaller than the 50-megabyte version. 1245 00:59:01,520 --> 00:59:04,670 It's going to iterate over every row in that file via DictReader, 1246 00:59:04,670 --> 00:59:09,800 checking if the title the human typed in equals the current row's primary title. 1247 00:59:09,800 --> 00:59:13,190 And if so, it's going to print the title and year of that show. 1248 00:59:13,190 --> 00:59:17,750 So if I go ahead and run Python of search.py, 1249 00:59:17,750 --> 00:59:22,760 typing in something like The Crown, Enter, voila, I get that answer. 1250 00:59:22,760 --> 00:59:26,710 If I go ahead and do The Office, Enter, there's a bunch of Offices. 1251 00:59:26,710 --> 00:59:29,210 And in fact, if you haven't seen more than the American one, 1252 00:59:29,210 --> 00:59:31,620 there's the UK one, and apparently several others. 1253 00:59:31,620 --> 00:59:33,410 And we can actually corroborate this now. 1254 00:59:33,410 --> 00:59:36,590 If you go to IMDB, where all of this data originally came from, 1255 00:59:36,590 --> 00:59:40,730 and type The Office, there are all of The Offices that actually line up 1256 00:59:40,730 --> 00:59:42,470 with our own very data set. 1257 00:59:42,470 --> 00:59:44,690 It's going to be so damn tedious every time 1258 00:59:44,690 --> 00:59:46,610 you want to search for data or update data 1259 00:59:46,610 --> 00:59:49,880 or insert new data to write a Python program to do it, 1260 00:59:49,880 --> 00:59:51,822 so we need a few more features of SQL. 1261 00:59:51,822 --> 00:59:53,780 But I think first, we need some Halloween candy 1262 00:59:53,780 --> 00:59:55,790 for our five-minute break outback. 1263 00:59:55,790 --> 00:59:57,560 All right, we are back. 1264 00:59:57,560 --> 01:00:00,680 So we have a whole bunch more data now, because we've downloaded 1265 01:00:00,680 --> 01:00:03,470 that really big TSV file from IMDB. 1266 01:00:03,470 --> 01:00:07,700 I've simplified it into some CSV files, but that really gets me half of the way 1267 01:00:07,700 --> 01:00:10,910 there, because now if I want to search the data, as with search.py, 1268 01:00:10,910 --> 01:00:14,540 I still have to look over thousands of rows of TV shows, 1269 01:00:14,540 --> 01:00:17,990 and it's only going to be linear search by nature of just how files are read, 1270 01:00:17,990 --> 01:00:19,700 top to bottom, left to right. 1271 01:00:19,700 --> 01:00:25,130 But it turns out in Python that you can actually write SQL code itself. 1272 01:00:25,130 --> 01:00:28,640 And CS50 has a Python library that has not only get string and get int 1273 01:00:28,640 --> 01:00:33,200 and so forth, but it also has a SQL function built in that allows you 1274 01:00:33,200 --> 01:00:38,000 to connect to, so to speak, a file that ends in something like .db, 1275 01:00:38,000 --> 01:00:41,960 which is to say, in a moment, we'll start to write some Python code now 1276 01:00:41,960 --> 01:00:47,750 toward an end of loading a really large data set like IMDB's into a proper SQL 1277 01:00:47,750 --> 01:00:51,500 database, thereby allowing us hereafter to use all of the power 1278 01:00:51,500 --> 01:00:54,650 and expressiveness of SQL, and more examples of that in just a bit like 1279 01:00:54,650 --> 01:00:58,160 SELECT, UPDATE, DELETE, and INSERT, without having to write a whole bunch 1280 01:00:58,160 --> 01:00:59,120 of Python code. 1281 01:00:59,120 --> 01:01:03,860 So to be clear, instead of using Python to search and manipulate our data, 1282 01:01:03,860 --> 01:01:08,120 we're going to write a script, a program in Python, whose sole purpose in life 1283 01:01:08,120 --> 01:01:10,283 is to get data from one format into another. 1284 01:01:10,283 --> 01:01:12,950 And you can imagine there's being generally useful, whether it's 1285 01:01:12,950 --> 01:01:15,260 a Google Spreadsheet you've downloaded or a large data 1286 01:01:15,260 --> 01:01:17,750 set you found on the internet that you want to use for a final project 1287 01:01:17,750 --> 01:01:18,830 or for some other class. 1288 01:01:18,830 --> 01:01:22,640 Python can be a really powerful way of taking one data source as input 1289 01:01:22,640 --> 01:01:26,330 and producing its output now, as of today, SQL instead. 1290 01:01:26,330 --> 01:01:31,670 So let's go ahead and iterate one final time of our title.basics.tsv. 1291 01:01:31,670 --> 01:01:34,220 But this time, not just save it into a CSV file. 1292 01:01:34,220 --> 01:01:39,740 Let's put it into a proper SQL database on my own Mac or your PC. 1293 01:01:39,740 --> 01:01:41,510 So let me go ahead and do this. 1294 01:01:41,510 --> 01:01:47,150 First, let me go ahead and say, just like on the slide, db = cs50.sql, 1295 01:01:47,150 --> 01:01:53,567 and then quote unquote, "sqlite:///--" so the third slash is not a typo. 1296 01:01:53,567 --> 01:01:54,650 It should indeed be there. 1297 01:01:54,650 --> 01:01:57,260 And I'm going to say shows3.db, just because this 1298 01:01:57,260 --> 01:01:59,690 is version 3 now of my import script. 1299 01:01:59,690 --> 01:02:03,680 I'm going to go ahead, just as last week, and now import CSV-- 1300 01:02:03,680 --> 01:02:06,240 CS50's library as well. 1301 01:02:06,240 --> 01:02:11,330 But for this to work, the file shows3.db needs to exist first. 1302 01:02:11,330 --> 01:02:15,260 And there's a couple of ways on a Mac or a PC or a Linux computer, typically, 1303 01:02:15,260 --> 01:02:18,440 to create an empty file that's ready to receive data. 1304 01:02:18,440 --> 01:02:20,990 You can literally use the command touch, which 1305 01:02:20,990 --> 01:02:25,160 just will create an empty file by whatever name you type at the prompt. 1306 01:02:25,160 --> 01:02:26,900 Or we can do this programmatically. 1307 01:02:26,900 --> 01:02:28,310 And I'm going to do it programmatically, because I 1308 01:02:28,310 --> 01:02:30,352 bet I'm going to screw up one or more times here, 1309 01:02:30,352 --> 01:02:34,220 and it's going to be useful to let my Python program create and recreate 1310 01:02:34,220 --> 01:02:37,250 the database again and again and again until I get it right. 1311 01:02:37,250 --> 01:02:42,920 So let me go ahead and open a file called shows3.db in write mode. 1312 01:02:42,920 --> 01:02:45,860 And recall from Python and C, using fopen, 1313 01:02:45,860 --> 01:02:50,180 anytime you open a file in write mode, it will overwrite any file that exists 1314 01:02:50,180 --> 01:02:52,340 or create any file that doesn't. 1315 01:02:52,340 --> 01:02:53,780 That's all I needed to do. 1316 01:02:53,780 --> 01:02:57,830 So in Python 2, recall that we were able to use this dot notation. 1317 01:02:57,830 --> 01:02:59,780 And it turns out here, when you open a file, 1318 01:02:59,780 --> 01:03:02,900 if you want to immediately close it, because your only goal was to create 1319 01:03:02,900 --> 01:03:06,930 it, you can just do .close on the very thing you just opened. 1320 01:03:06,930 --> 01:03:09,570 That is equivalent, just to be clear, to doing 1321 01:03:09,570 --> 01:03:14,790 something a little more pedantic like this, file = open, and then file.close. 1322 01:03:14,790 --> 01:03:18,070 But we can collapse this into one slick one-liner, so to speak, 1323 01:03:18,070 --> 01:03:21,840 by just doing instead what I did a moment ago. 1324 01:03:21,840 --> 01:03:27,330 All that does is create empty shows.3 file. 1325 01:03:27,330 --> 01:03:32,385 Now, open that file for SQLite. 1326 01:03:32,385 --> 01:03:34,260 And again, SQLite is the light version of SQL 1327 01:03:34,260 --> 01:03:36,060 that anyone can use on their own Mac or PC. 1328 01:03:36,060 --> 01:03:39,220 You don't need a special server to get up and running with it. 1329 01:03:39,220 --> 01:03:42,990 So now let me go ahead and open up title.basics.tsv. 1330 01:03:42,990 --> 01:03:48,060 And then here, let me go ahead and create myself a DictReader so 1331 01:03:48,060 --> 01:03:50,910 that I can iterate over the lines in that TSV file. 1332 01:03:50,910 --> 01:03:55,180 And now, let me go ahead, and for row in reader, do the following. 1333 01:03:55,180 --> 01:03:57,390 I first want to filter out stuff just as before. 1334 01:03:57,390 --> 01:04:04,200 So I'm going to say if row bracket "titleType" == "tvSeries" 1335 01:04:04,200 --> 01:04:10,830 and row "isAdult" == quote unquote "0", then I'm going to go ahead and check 1336 01:04:10,830 --> 01:04:11,720 one other thing. 1337 01:04:11,720 --> 01:04:15,465 I'm going to go ahead and give myself a start year variable, similar to before, 1338 01:04:15,465 --> 01:04:17,130 although I called it year earlier. 1339 01:04:17,130 --> 01:04:19,980 Then let me go ahead and do row "startYear", 1340 01:04:19,980 --> 01:04:22,050 just so I can cast that to an int. 1341 01:04:22,050 --> 01:04:27,240 But I only want to do that if row "startYear" does not 1342 01:04:27,240 --> 01:04:31,620 equal that special backslash N that IMDB told me to watch out for. 1343 01:04:31,620 --> 01:04:34,600 So I only want to do that if it's not that. 1344 01:04:34,600 --> 01:04:38,670 And then if startYear is greater than or equal to 1970, 1345 01:04:38,670 --> 01:04:41,080 let's go ahead and do the following. 1346 01:04:41,080 --> 01:04:46,680 Let's go ahead and do genres, gets row "genres". 1347 01:04:46,680 --> 01:04:50,073 Let's go ahead and get tconst, gets row "tconst," 1348 01:04:50,073 --> 01:04:52,740 just so I can put these in some slightly shorter variable names, 1349 01:04:52,740 --> 01:04:54,360 just to keep myself sane. 1350 01:04:54,360 --> 01:04:58,770 primaryTitle is going to be from row "primaryTitle." 1351 01:04:58,770 --> 01:05:00,720 and then let me go ahead and give myself-- 1352 01:05:00,720 --> 01:05:04,150 we already have startYear, so those are the only other three fields I need. 1353 01:05:04,150 --> 01:05:08,460 So now, I want to go ahead and insert this row from my TSV 1354 01:05:08,460 --> 01:05:10,002 into a SQLite database. 1355 01:05:10,002 --> 01:05:12,960 And the operative word that we saw earlier that we haven't used it yet, 1356 01:05:12,960 --> 01:05:13,710 is INSERT. 1357 01:05:13,710 --> 01:05:15,390 We did use SELECT. 1358 01:05:15,390 --> 01:05:16,500 We did use UPDATE. 1359 01:05:16,500 --> 01:05:17,460 We did use DELETE. 1360 01:05:17,460 --> 01:05:18,850 We haven't used INSERT yet. 1361 01:05:18,850 --> 01:05:20,350 So I'm going to do that in a moment. 1362 01:05:20,350 --> 01:05:23,200 But first, I need my database to actually exist, 1363 01:05:23,200 --> 01:05:25,330 so I need to create an actual table. 1364 01:05:25,330 --> 01:05:28,860 So I'm going to go up here first and do this, db, is a reference now, 1365 01:05:28,860 --> 01:05:30,995 a variable representing my database, and I'm 1366 01:05:30,995 --> 01:05:33,120 going to call the only function inside of it that's 1367 01:05:33,120 --> 01:05:36,000 useful for our purposes, called execute. 1368 01:05:36,000 --> 01:05:39,240 What I can now do is execute any SQL I want. 1369 01:05:39,240 --> 01:05:41,280 So what do I want to load into this database? 1370 01:05:41,280 --> 01:05:45,930 I think I want to load in the tconst, the primaryTitle, the startYear, 1371 01:05:45,930 --> 01:05:51,690 and the genres, just like we had earlier from title.basics.tsv. 1372 01:05:51,690 --> 01:05:55,140 I want to load rows that represent this kind of data, all right? 1373 01:05:55,140 --> 01:05:56,620 So how am I going to do this? 1374 01:05:56,620 --> 01:05:58,590 Well, let me go ahead and create a table. 1375 01:05:58,590 --> 01:06:02,880 I'm going to call it shows, because that seems nice and conceptually consistent. 1376 01:06:02,880 --> 01:06:06,510 I'm going to go ahead and create a list of columns now. 1377 01:06:06,510 --> 01:06:08,400 tconst is going to be one column. 1378 01:06:08,400 --> 01:06:10,290 primaryTitle is going to be another. 1379 01:06:10,290 --> 01:06:12,000 startYear is going to be another. 1380 01:06:12,000 --> 01:06:14,610 And genres is going to be the last. 1381 01:06:14,610 --> 01:06:20,310 I can literally, that is to say, write SQL inside of a string 1382 01:06:20,310 --> 01:06:23,700 that I pass to a Python function called db.execute. 1383 01:06:23,700 --> 01:06:27,300 And because db.execute-- or rather, because db 1384 01:06:27,300 --> 01:06:34,380 was configured with shows3.db, when I execute this string in Python, 1385 01:06:34,380 --> 01:06:38,100 it's going to get executed on that database file, shows3.db. 1386 01:06:38,100 --> 01:06:40,573 So it's a nice way of bridging these two worlds. 1387 01:06:40,573 --> 01:06:42,990 So I'm going to have to be a little more specific, though. 1388 01:06:42,990 --> 01:06:45,707 Recall that SQL has a bunch of types. 1389 01:06:45,707 --> 01:06:47,040 And I'm going to keep it simple. 1390 01:06:47,040 --> 01:06:50,910 I'm going to go ahead and say that the type of that tconst value is text. 1391 01:06:50,910 --> 01:06:53,220 The type of the primaryTitle is text. 1392 01:06:53,220 --> 01:06:56,400 The type of startYear is going to be numeric, kind of a catch 1393 01:06:56,400 --> 01:06:58,020 all for dates and date times. 1394 01:06:58,020 --> 01:07:00,390 And then genres is going to be text as well. 1395 01:07:00,390 --> 01:07:01,770 So the syntax is a little funky. 1396 01:07:01,770 --> 01:07:04,860 You actually specify the name of the column and then the type, 1397 01:07:04,860 --> 01:07:08,408 as opposed to the opposite, which we did in C. But that's the way SQL is. 1398 01:07:08,408 --> 01:07:10,200 So I'm going to go ahead and save that now. 1399 01:07:10,200 --> 01:07:14,180 And just to comment this, this is going to create 1400 01:07:14,180 --> 01:07:21,590 a table called shows in database file called shows3.db, 1401 01:07:21,590 --> 01:07:23,100 just to be super explicit. 1402 01:07:23,100 --> 01:07:24,780 So what am I going to do down here? 1403 01:07:24,780 --> 01:07:27,450 It looks like I have the ability with CS50's library 1404 01:07:27,450 --> 01:07:29,320 to execute any SQL I want. 1405 01:07:29,320 --> 01:07:34,230 So let me go ahead and insert into shows the following values, 1406 01:07:34,230 --> 01:07:39,570 a tconst, a primaryTitle, a startYear, and a genre-- 1407 01:07:39,570 --> 01:07:40,920 and genres. 1408 01:07:40,920 --> 01:07:41,940 What values? 1409 01:07:41,940 --> 01:07:43,443 I want to insert these values. 1410 01:07:43,443 --> 01:07:45,360 Now, I don't know in advance, so I'm literally 1411 01:07:45,360 --> 01:07:47,380 going to put some question marks here. 1412 01:07:47,380 --> 01:07:51,630 And it turns out in SQL, this is valid syntax for the library we're using. 1413 01:07:51,630 --> 01:07:54,000 This is this INSERT query. 1414 01:07:54,000 --> 01:07:57,750 INSERT INTO the table name a parenthesized list 1415 01:07:57,750 --> 01:08:00,270 of the columns you want to insert data into. 1416 01:08:00,270 --> 01:08:03,078 Then, a set of values in separate parentheses. 1417 01:08:03,078 --> 01:08:05,370 And for now, I'm using question marks for placeholders, 1418 01:08:05,370 --> 01:08:07,248 for reasons we'll come back to. 1419 01:08:07,248 --> 01:08:09,540 But I'm going to go ahead and plug the following values 1420 01:08:09,540 --> 01:08:18,479 into those placeholders, tconst, primaryTitle, startYear, and genres. 1421 01:08:18,479 --> 01:08:20,819 And what the db.execute function is going 1422 01:08:20,819 --> 01:08:24,399 to do for me automatically is it's going to look at this SQL query. 1423 01:08:24,399 --> 01:08:27,630 Notice that, oh, it's got four question marks in it, or placeholders. 1424 01:08:27,630 --> 01:08:32,850 Those, in SQL, are like the %s was in C or are like the curly braces are 1425 01:08:32,850 --> 01:08:34,319 and Python f strings. 1426 01:08:34,319 --> 01:08:38,130 So this says, give me 1, 2, 3, 4 placeholders and plug in, ultimately, 1427 01:08:38,130 --> 01:08:41,220 the following four values, tconst, which is just a variable; 1428 01:08:41,220 --> 01:08:44,040 primaryTitle, which is the same; startYear, which is the same; 1429 01:08:44,040 --> 01:08:45,968 and genres, which is the same. 1430 01:08:45,968 --> 01:08:47,010 So what am I going to do? 1431 01:08:47,010 --> 01:08:52,290 For every TV series in this file that's not an adult series 1432 01:08:52,290 --> 01:08:57,270 and that started after 1970, insert it into my database. 1433 01:08:57,270 --> 01:09:00,689 If I've made no typos, I'm going to go ahead and run this, cross my fingers, 1434 01:09:00,689 --> 01:09:02,160 and enter. 1435 01:09:02,160 --> 01:09:05,279 This one is going to take more time, because it turns out writing 1436 01:09:05,279 --> 01:09:07,109 to a CSV file is actually pretty quick. 1437 01:09:07,109 --> 01:09:10,080 You can just write row, write row, write row, write row, 1438 01:09:10,080 --> 01:09:13,979 but inserting into a SQLite database is going to take more time. 1439 01:09:13,979 --> 01:09:19,000 More time upfront, but it's going to be a lot faster to search thereafter. 1440 01:09:19,000 --> 01:09:23,520 So let me go ahead and do the cake in the oven thing 1441 01:09:23,520 --> 01:09:25,680 and go ahead and now open up a file I made 1442 01:09:25,680 --> 01:09:29,250 an advance in today's src3 directory called shows3.db 1443 01:09:29,250 --> 01:09:33,000 using SQLite3, that command line program we used earlier. 1444 01:09:33,000 --> 01:09:38,310 Recall that I can say .schema to see the types of data in the database. 1445 01:09:38,310 --> 01:09:40,470 And indeed, look what I've done in advance. 1446 01:09:40,470 --> 01:09:47,250 I created a table called shows with exactly those columns in a-- 1447 01:09:47,250 --> 01:09:51,840 with exactly these four columns, tconst, primaryTitle, startYear, and genres. 1448 01:09:51,840 --> 01:09:55,950 But I did this in advance to save us time so that I can now do SELECT 1449 01:09:55,950 --> 01:09:58,268 * FROM shows. 1450 01:09:58,268 --> 01:09:59,560 And let me not get all of them. 1451 01:09:59,560 --> 01:10:02,580 Let me do the first 10, semicolon. 1452 01:10:02,580 --> 01:10:03,120 All right. 1453 01:10:03,120 --> 01:10:06,610 So we see the first 10 shows from IMDB in whatever order 1454 01:10:06,610 --> 01:10:08,010 IMDB distributes them. 1455 01:10:08,010 --> 01:10:10,590 You can see their ID numbers are incrementing. 1456 01:10:10,590 --> 01:10:11,460 And All My Children. 1457 01:10:11,460 --> 01:10:13,418 I remember growing up with that show years ago. 1458 01:10:13,418 --> 01:10:15,410 And it seems that that's a drama-- 1459 01:10:15,410 --> 01:10:18,540 it's not really mystery, but so be it, and a romance show there. 1460 01:10:18,540 --> 01:10:23,370 But it's indeed 1970, as are every show thereafter in 1970 or onward. 1461 01:10:23,370 --> 01:10:26,350 I can go ahead and search for more like this. 1462 01:10:26,350 --> 01:10:28,590 Let me give myself the first 100 shows. 1463 01:10:28,590 --> 01:10:30,540 All right, so this is a pretty large data set. 1464 01:10:30,540 --> 01:10:32,370 And let me go ahead and count them all. 1465 01:10:32,370 --> 01:10:34,745 Recall that you can use a COUNT function so that we don't 1466 01:10:34,745 --> 01:10:36,330 have to print them all on my screen. 1467 01:10:36,330 --> 01:10:42,090 It looks like there are 153,331 TV series in IMDB. 1468 01:10:42,090 --> 01:10:45,720 Crazier than that, you want to know how many of them came out this year? 1469 01:10:45,720 --> 01:10:50,430 WHERE startYear = 2019;. 1470 01:10:50,430 --> 01:10:57,090 There were 6,099 new TV shows in the world this year according to IMDB. 1471 01:10:57,090 --> 01:11:00,900 Just one of those is The Office that won our-- 1472 01:11:00,900 --> 01:11:01,950 won our vote earlier. 1473 01:11:01,950 --> 01:11:06,280 So SELECT *, star denoting wild card, in this case, everything. 1474 01:11:06,280 --> 01:11:08,290 And unfortunately, SQL has two wild cards. 1475 01:11:08,290 --> 01:11:12,600 Star means select all of the columns in the table. 1476 01:11:12,600 --> 01:11:17,220 Percent means let any characters come before or after a quoted 1477 01:11:17,220 --> 01:11:18,870 string in a WHERE clause. 1478 01:11:18,870 --> 01:11:24,660 So let me go ahead and SELECT * FROM shows WHERE title = The Office, 1479 01:11:24,660 --> 01:11:26,700 and we'll actually see-- whoops. 1480 01:11:26,700 --> 01:11:28,800 primaryTitle, sorry. 1481 01:11:28,800 --> 01:11:31,530 Let me fix that. primaryTitle = The Office. 1482 01:11:31,530 --> 01:11:33,150 There are all of those Offices. 1483 01:11:33,150 --> 01:11:37,230 And indeed, 2005 is probably the one we know and love. 1484 01:11:37,230 --> 01:11:44,370 If I go to a browser and go to IMDB slash title slash that ID, 1485 01:11:44,370 --> 01:11:46,620 indeed, that's probably the one we're all thinking of, 1486 01:11:46,620 --> 01:11:48,790 unless you voted for the UK version instead. 1487 01:11:48,790 --> 01:11:52,290 So again, this is actually real, live data that we're now playing with. 1488 01:11:52,290 --> 01:11:53,738 Well, what more can we do? 1489 01:11:53,738 --> 01:11:55,530 Well, there's one thing that I don't really 1490 01:11:55,530 --> 01:12:00,540 like about this, which is that when we select all of the columns-- 1491 01:12:00,540 --> 01:12:02,130 and let's go ahead and do this. 1492 01:12:02,130 --> 01:12:04,440 Let's select another hundred of them before. 1493 01:12:04,440 --> 01:12:10,470 This feels a little messy that we have all of these nice, clean columns except 1494 01:12:10,470 --> 01:12:12,270 for when we get to genres. 1495 01:12:12,270 --> 01:12:14,902 Then we just have this arbitrary comma-separated list. 1496 01:12:14,902 --> 01:12:16,860 Suppose, for instance, I want to search for all 1497 01:12:16,860 --> 01:12:19,820 of the comedies that came out in 2019. 1498 01:12:19,820 --> 01:12:27,870 I could say SELECT * FROM shows where genres = "Comedy" AND-- 1499 01:12:27,870 --> 01:12:29,880 turns out you can use conjunctions like this-- 1500 01:12:29,880 --> 01:12:32,970 startYear = 2019. 1501 01:12:32,970 --> 01:12:34,290 So that gives me a whole bunch. 1502 01:12:34,290 --> 01:12:35,490 Let's count them. 1503 01:12:35,490 --> 01:12:38,490 So COUNT this here, Enter. 1504 01:12:38,490 --> 01:12:42,390 OK, a thousand of those 6,000 shows are comedies. 1505 01:12:42,390 --> 01:12:44,850 But I think that's an underestimate. 1506 01:12:44,850 --> 01:12:48,810 Why is this query buggy at the moment? 1507 01:12:48,810 --> 01:12:49,458 Yeah? 1508 01:12:49,458 --> 01:12:50,587 AUDIENCE: [INAUDIBLE] 1509 01:12:50,587 --> 01:12:52,920 DAVID MALAN: Yeah, some of them had more than one genre, 1510 01:12:52,920 --> 01:12:56,070 so comedy is somewhere in that comma-separated list. 1511 01:12:56,070 --> 01:13:00,450 And so what I should probably do instead is not say genre = "Comedy," 1512 01:13:00,450 --> 01:13:05,490 but maybe genres LIKE "Comedy," and allow something maybe to appear before, 1513 01:13:05,490 --> 01:13:07,140 something maybe to appear after. 1514 01:13:07,140 --> 01:13:11,040 And that's going to give me 1,593 comedies that 1515 01:13:11,040 --> 01:13:13,290 came out this year in 2019. 1516 01:13:13,290 --> 01:13:14,820 So that seems a little better. 1517 01:13:14,820 --> 01:13:17,220 But this is not very robust, right? 1518 01:13:17,220 --> 01:13:19,890 Once you start resorting to techniques like this, 1519 01:13:19,890 --> 01:13:22,920 it should start, as a programmer, to rub you the wrong way. 1520 01:13:22,920 --> 01:13:24,168 It's kind of a hack, right? 1521 01:13:24,168 --> 01:13:27,210 Like, you're searching for comedy, but there could be something before it 1522 01:13:27,210 --> 01:13:29,640 or something after it, and odds are, there 1523 01:13:29,640 --> 01:13:32,400 is no other word I can think of in the world of genres 1524 01:13:32,400 --> 01:13:36,142 that starts with or ends with comedy, so we're probably OK. 1525 01:13:36,142 --> 01:13:38,850 But this is kind of hack-ish, that you're just kind of searching. 1526 01:13:38,850 --> 01:13:44,790 It would be nice if we could just search for a specific column called genre. 1527 01:13:44,790 --> 01:13:47,170 So how can we go about doing that? 1528 01:13:47,170 --> 01:13:50,670 Well, let me go ahead and do the following instead. 1529 01:13:50,670 --> 01:13:54,570 Let me go ahead and open up a final version of my import script, 1530 01:13:54,570 --> 01:13:58,020 this one that does two things up top. 1531 01:13:58,020 --> 01:14:02,780 At the top, I'm going to create two tables, one called shows, which has-- 1532 01:14:02,780 --> 01:14:04,950 I'm to clean up the column names, too. 1533 01:14:04,950 --> 01:14:06,930 IMDB is a little nonconventional. 1534 01:14:06,930 --> 01:14:10,290 What most people would do when describing a unique identifier, 1535 01:14:10,290 --> 01:14:12,210 they're going to call it ID, not tconst. 1536 01:14:12,210 --> 01:14:13,895 So we're going to rename it to ID. 1537 01:14:13,895 --> 01:14:16,020 They're not going to call their title primaryTitle. 1538 01:14:16,020 --> 01:14:18,660 They're going to call it title, so we're going to rename it title. 1539 01:14:18,660 --> 01:14:19,770 They're not going to name it startYear. 1540 01:14:19,770 --> 01:14:20,937 We're going to call it year. 1541 01:14:20,937 --> 01:14:22,060 And then that's it. 1542 01:14:22,060 --> 01:14:24,270 We'll come back to primary key in just a moment. 1543 01:14:24,270 --> 01:14:25,530 But notice this. 1544 01:14:25,530 --> 01:14:28,320 In my new and final version of this script, I'm creating, 1545 01:14:28,320 --> 01:14:33,060 I propose, a second table called genres whose purpose in life 1546 01:14:33,060 --> 01:14:38,250 is to contain a value called show_id and another one called genre. 1547 01:14:38,250 --> 01:14:39,750 So what's going on? 1548 01:14:39,750 --> 01:14:42,810 Well, let me go ahead and show this. 1549 01:14:42,810 --> 01:14:48,930 If I load the resulting database from this one, shows4.db, and I do .schema, 1550 01:14:48,930 --> 01:14:50,850 you'll see that I indeed have two tables. 1551 01:14:50,850 --> 01:14:55,150 Let me go ahead and SELECT * FROM shows WHERE title, this time, 1552 01:14:55,150 --> 01:14:59,410 because I've renamed it from primaryTitle, = The Office. 1553 01:14:59,410 --> 01:14:59,910 OK. 1554 01:14:59,910 --> 01:15:01,050 That's a lot of Offices. 1555 01:15:01,050 --> 01:15:06,700 But let's go, AND year = 2005, which is the one we're all thinking about. 1556 01:15:06,700 --> 01:15:07,860 And it's that one. 1557 01:15:07,860 --> 01:15:09,810 And now, notice this. 1558 01:15:09,810 --> 01:15:12,360 Notice that I'm getting back what? 1559 01:15:12,360 --> 01:15:19,590 An ID, I'm getting back a title, and I'm getting back a year, but no genres. 1560 01:15:19,590 --> 01:15:24,053 That's because there's another table now called genres that's separate. 1561 01:15:24,053 --> 01:15:25,470 And you know, I'm kind of curious. 1562 01:15:25,470 --> 01:15:27,763 I see that a genre table has show_id. 1563 01:15:27,763 --> 01:15:28,680 Let me go and do this. 1564 01:15:28,680 --> 01:15:34,200 SELECT * FROM genres WHERE show_id =, and let 1565 01:15:34,200 --> 01:15:37,830 me do a little copy paste here, = this show_id. 1566 01:15:37,830 --> 01:15:39,780 And what might I see? 1567 01:15:39,780 --> 01:15:41,100 Comedy. 1568 01:15:41,100 --> 01:15:42,840 So what have we done now? 1569 01:15:42,840 --> 01:15:46,170 For any TV show that was in IMDB's database that 1570 01:15:46,170 --> 01:15:50,010 was a comma-separated list of genres, I've exploded it, so to speak. 1571 01:15:50,010 --> 01:15:52,170 I've split that value on the commas. 1572 01:15:52,170 --> 01:15:56,070 And if the show is a comedy, I've added a row in these genres table, 1573 01:15:56,070 --> 01:15:59,310 but then I've jotted down the show's ID next to that genre 1574 01:15:59,310 --> 01:16:02,670 so I remember that that show was of that genre. 1575 01:16:02,670 --> 01:16:05,770 But if another show has multiple fields-- for instance, 1576 01:16:05,770 --> 01:16:10,980 let's go ahead and search for not The Office but, say, The Crown. 1577 01:16:10,980 --> 01:16:12,540 And there's only one of those. 1578 01:16:12,540 --> 01:16:20,190 And now I do SELECT * FROM genres WHERE show_id = this number, 1579 01:16:20,190 --> 01:16:25,360 we'll see that, oh, The Crown now has drama and history as a genre. 1580 01:16:25,360 --> 01:16:27,570 And so therefore, in the genres table, notice 1581 01:16:27,570 --> 01:16:30,690 that there's two IDs and two genres. 1582 01:16:30,690 --> 01:16:32,220 But now we can use a quality. 1583 01:16:32,220 --> 01:16:39,300 I can now search for all of the comedies from 2019 in kind of a powerful way, 1584 01:16:39,300 --> 01:16:47,010 SELECT * FROM shows WHERE id IN-- 1585 01:16:47,010 --> 01:16:48,330 and here's the cool part-- 1586 01:16:48,330 --> 01:17:01,020 SELECT show_id FROM genres WHERE genre = "Comedy" AND year = 2019. 1587 01:17:01,020 --> 01:17:03,960 So this is admittedly a mouthful, but let's consider what's happening. 1588 01:17:03,960 --> 01:17:05,910 First, I'm selecting star from shows. 1589 01:17:05,910 --> 01:17:08,760 That means give me all the shows in all of the columns. 1590 01:17:08,760 --> 01:17:14,640 But filter as follows, only show me those shows where the ID of the show 1591 01:17:14,640 --> 01:17:17,170 is in the following list of IDs. 1592 01:17:17,170 --> 01:17:20,610 Now, you can look at a nested query inside the parentheses here. 1593 01:17:20,610 --> 01:17:25,140 This list here selects all of the show IDs from the genres table 1594 01:17:25,140 --> 01:17:27,380 where genre = "Comedy". 1595 01:17:27,380 --> 01:17:31,770 So the highlighted parenthetical right now returns essentially a list 1596 01:17:31,770 --> 01:17:35,130 of all of the ID numbers of shows that are associated with comedy, 1597 01:17:35,130 --> 01:17:37,900 even if they're associated with other things, too. 1598 01:17:37,900 --> 01:17:41,400 And we're making sure that the year equals 2019. 1599 01:17:41,400 --> 01:17:46,290 So if I now hit Enter, we'll see a whole bunch of results, but we should see, 1600 01:17:46,290 --> 01:17:49,110 if I count these by using my usual syntax, 1601 01:17:49,110 --> 01:17:54,120 there were 1,593 shows that are comedies in 2019. 1602 01:17:54,120 --> 01:17:58,200 That does happen to equal the same count we did earlier by using like, 1603 01:17:58,200 --> 01:18:01,560 but this is better designed in the sense that there's no ambiguity. 1604 01:18:01,560 --> 01:18:03,730 You're not just hackishly looking for a substring, 1605 01:18:03,730 --> 01:18:06,060 so to speak, in a comma-separated list. 1606 01:18:06,060 --> 01:18:08,580 You can actually now search more robustly 1607 01:18:08,580 --> 01:18:10,710 by having redesigned your data. 1608 01:18:10,710 --> 01:18:13,830 And what we've done really is something like this. 1609 01:18:13,830 --> 01:18:18,150 Instead of storing our data in just one table called shows, 1610 01:18:18,150 --> 01:18:22,230 and every show has an ID, a title, a year, and genres, 1611 01:18:22,230 --> 01:18:26,880 we've instead exploded those genres into two separate tables, such 1612 01:18:26,880 --> 01:18:29,650 that now our shows table looks like this. 1613 01:18:29,650 --> 01:18:32,200 We have an ID, a title, and a year. 1614 01:18:32,200 --> 01:18:37,350 And notice that the ID of a show can also appear over here 1615 01:18:37,350 --> 01:18:40,950 in another table called genres, and just by convention, to keep us sane, 1616 01:18:40,950 --> 01:18:43,380 instead of calling it ID here, we've called 1617 01:18:43,380 --> 01:18:48,060 it show_id to connote that it came from a table called shows, plural, 1618 01:18:48,060 --> 01:18:50,010 but it's a single ID from a show. 1619 01:18:50,010 --> 01:18:55,050 So by convention, humans often call them table name, minus the S, underscore ID. 1620 01:18:55,050 --> 01:18:59,280 And then the keyword here or genre is comedy or drama or documentary 1621 01:18:59,280 --> 01:19:01,443 or some other genre as well. 1622 01:19:01,443 --> 01:19:03,360 And now, this is a little subtle, but the fact 1623 01:19:03,360 --> 01:19:05,940 that this little symbol here, drawn from the database world, 1624 01:19:05,940 --> 01:19:08,370 flails out into three separate places, this 1625 01:19:08,370 --> 01:19:11,370 is a one-to-many relationship, so to speak. 1626 01:19:11,370 --> 01:19:15,630 You can have one show over here mapping to many genres over here, 1627 01:19:15,630 --> 01:19:21,090 or maybe zero, but it's zero or more possible genres. 1628 01:19:21,090 --> 01:19:21,990 All right. 1629 01:19:21,990 --> 01:19:26,720 Any questions just yet? 1630 01:19:26,720 --> 01:19:27,220 All right. 1631 01:19:27,220 --> 01:19:30,430 So the real power then, to be clear, is coming, now, 1632 01:19:30,430 --> 01:19:31,870 from this kind of expressiveness. 1633 01:19:31,870 --> 01:19:34,495 So now, let's play around with some other queries and features. 1634 01:19:34,495 --> 01:19:37,330 But first, let's give ourselves a bit more data. 1635 01:19:37,330 --> 01:19:40,420 It turns out besides the file called title.basics.tsv, 1636 01:19:40,420 --> 01:19:44,140 IMDB.com makes a bunch of others available to us as well. 1637 01:19:44,140 --> 01:19:46,900 There is one called name.basics.tsv. 1638 01:19:46,900 --> 01:19:49,420 And this is one that has information on all 1639 01:19:49,420 --> 01:19:53,690 of the names of actors and actresses and directors and writers 1640 01:19:53,690 --> 01:19:54,940 and other people in the world. 1641 01:19:54,940 --> 01:19:58,600 So for instance, there is an nconst, which is like a name constant 1642 01:19:58,600 --> 01:20:03,190 or an ID, nm2946516, which happens to belong 1643 01:20:03,190 --> 01:20:06,550 to the actor whose primary name is Claire Foy, the star of The Crown. 1644 01:20:06,550 --> 01:20:10,990 She was born in 1984, and there are some other fields in that file as well. 1645 01:20:10,990 --> 01:20:14,560 But also juicy is this file, title.principals.tsv, 1646 01:20:14,560 --> 01:20:16,690 and this is where it gets interesting, too. 1647 01:20:16,690 --> 01:20:20,450 In this file, notice there are no actual titles. 1648 01:20:20,450 --> 01:20:21,740 There's no primary titles. 1649 01:20:21,740 --> 01:20:23,800 There's no actual human names. 1650 01:20:23,800 --> 01:20:28,810 Instead, there's just two unique identifiers, a tconst and an nconst, 1651 01:20:28,810 --> 01:20:34,240 which IMDB speak for a title identifier and a name identifier. 1652 01:20:34,240 --> 01:20:37,450 So for instance, in one of the rows in this TSV file called 1653 01:20:37,450 --> 01:20:42,790 title.principals.tsv, there is a row that starts with tt4786824; 1654 01:20:42,790 --> 01:20:48,220 also has nm2946516, and has the word actress, 1655 01:20:48,220 --> 01:20:52,090 thereby implying that if you look up the nm-- 1656 01:20:52,090 --> 01:20:57,280 the nconst in the names file and you look up the tconst in the titles file, 1657 01:20:57,280 --> 01:21:03,600 you will be able to, by transitivity, infer that Claire Foy is in The Crown. 1658 01:21:03,600 --> 01:21:06,670 This allows us to have a many-to-many relationship. 1659 01:21:06,670 --> 01:21:10,360 A one movie or show can have many actors, 1660 01:21:10,360 --> 01:21:16,030 and one actor can be in many shows, so we're using a sort of join file here, 1661 01:21:16,030 --> 01:21:18,070 a join table, that's going to somehow allow 1662 01:21:18,070 --> 01:21:21,590 us to link two different data sets together, and more on that in a moment. 1663 01:21:21,590 --> 01:21:24,700 But what's really fun in IMDB is that it also has a bunch of ratings 1664 01:21:24,700 --> 01:21:28,630 that humans have typed in saying, I get this 10 out of 10, a 0 out of 10, 1665 01:21:28,630 --> 01:21:29,200 and so forth. 1666 01:21:29,200 --> 01:21:31,950 And they keep track of the number of votes that shows have gotten. 1667 01:21:31,950 --> 01:21:36,370 And so in title.ratings.tsv, yet another file you can download from IMDB, 1668 01:21:36,370 --> 01:21:40,120 you can look up a given tconst, the unique identifier for a title, 1669 01:21:40,120 --> 01:21:42,940 what its average rating is and the number of votes. 1670 01:21:42,940 --> 01:21:46,660 And in fact, if I pull this up, for instance, on The Crown, if I go back 1671 01:21:46,660 --> 01:21:48,880 to IMDB itself-- 1672 01:21:48,880 --> 01:21:53,930 IMDB, search for The Crown, the 2016 version, 1673 01:21:53,930 --> 01:21:59,410 you'll see that indeed, it is an 8.7, which lines up with exactly what 1674 01:21:59,410 --> 01:22:00,210 we have here. 1675 01:22:00,210 --> 01:22:03,580 But over time, that number is going to go up or down, because IMDB is updating 1676 01:22:03,580 --> 01:22:06,280 their data set every day as well. 1677 01:22:06,280 --> 01:22:09,730 So besides this data, we also then have the ability 1678 01:22:09,730 --> 01:22:12,580 to consider what this all looks like collectively. 1679 01:22:12,580 --> 01:22:14,920 So in this case here, here is another diagram. 1680 01:22:14,920 --> 01:22:18,610 It's more complicated now, but it just captures the intuition 1681 01:22:18,610 --> 01:22:22,450 that you would derive by just reading through IMDB's documentation, which 1682 01:22:22,450 --> 01:22:23,800 defines the following. 1683 01:22:23,800 --> 01:22:26,420 It turns out if you read closely among those files, 1684 01:22:26,420 --> 01:22:29,710 you'll see that, oh, we can glean a whole bunch of shows that are going 1685 01:22:29,710 --> 01:22:32,050 to have IDs, title, year, and episodes. 1686 01:22:32,050 --> 01:22:37,840 I want to go ahead and associate those shows with a whole bunch of stars. 1687 01:22:37,840 --> 01:22:42,170 But people are the entities in the world that have IDs, names, and birth. 1688 01:22:42,170 --> 01:22:43,910 So now things get a little weird. 1689 01:22:43,910 --> 01:22:46,300 Let's focus only on these two tables here. 1690 01:22:46,300 --> 01:22:48,340 This is a diagram of the goal at hand, and this 1691 01:22:48,340 --> 01:22:49,632 is a script I wrote in advance. 1692 01:22:49,632 --> 01:22:53,150 And in a moment, we'll open up a SQL database that represents this. 1693 01:22:53,150 --> 01:22:56,290 There's going to be a table called shows, every row of which 1694 01:22:56,290 --> 01:22:59,350 has an ID, title, year, and some number of episodes, 1695 01:22:59,350 --> 01:23:00,962 so you can see how long-running it is. 1696 01:23:00,962 --> 01:23:02,920 There's also going to be a table called people. 1697 01:23:02,920 --> 01:23:05,080 Claire Foy is going to be among them. 1698 01:23:05,080 --> 01:23:09,610 She and the other humans will have an ID, name, and birth year associated 1699 01:23:09,610 --> 01:23:10,960 with them as well. 1700 01:23:10,960 --> 01:23:14,230 But there's going to be this other table here called stars. 1701 01:23:14,230 --> 01:23:19,390 And you'll notice there's a line that links shows to stars to people, 1702 01:23:19,390 --> 01:23:21,190 again by transitivity. 1703 01:23:21,190 --> 01:23:27,580 If there is, in the stars table, a show ID, a.k.a. tconst, and a person ID, 1704 01:23:27,580 --> 01:23:28,260 a.k.a. 1705 01:23:28,260 --> 01:23:32,260 nconst, that links, for instance, Claire Foy to The Crown. 1706 01:23:32,260 --> 01:23:34,480 It's going to link Steve Carell to The Office. 1707 01:23:34,480 --> 01:23:38,450 It's going to link every other actor to their show as well. 1708 01:23:38,450 --> 01:23:41,770 Similarly for writers, we won't play too much with the writers today, 1709 01:23:41,770 --> 01:23:45,760 but writers are people, too, just as stars are people. 1710 01:23:45,760 --> 01:23:48,970 And so here's another feature or design goal of SQL. 1711 01:23:48,970 --> 01:23:52,150 You want to ideally factor out the commonalities, so 1712 01:23:52,150 --> 01:23:55,450 that Claire Foy appears only in one place by name, 1713 01:23:55,450 --> 01:23:58,965 but her unique identifier might appear in bunches of places. 1714 01:23:58,965 --> 01:24:01,840 There's a lot of actors in the world who are also writers themselves. 1715 01:24:01,840 --> 01:24:03,970 They are, at the end of the day, people, but they 1716 01:24:03,970 --> 01:24:08,200 might appear both in the stars table and in the writers table 1717 01:24:08,200 --> 01:24:10,300 by way of their person ID. 1718 01:24:10,300 --> 01:24:14,050 So the goal of SQL is to not copy and paste Claire Foy, Claire Foy, Claire 1719 01:24:14,050 --> 01:24:17,750 Foy, or Steve Carell, Steve Carell, Steve Carell, all over the place. 1720 01:24:17,750 --> 01:24:20,170 You want to have one authoritative place for all 1721 01:24:20,170 --> 01:24:23,320 of your people, one authoritative place for all of your shows, 1722 01:24:23,320 --> 01:24:25,180 and then you have these other tables called 1723 01:24:25,180 --> 01:24:28,940 join tables, which are similar in spirit to the TSV files 1724 01:24:28,940 --> 01:24:32,790 you can download that somehow link these identifiers together. 1725 01:24:32,790 --> 01:24:38,820 So if I, on my computer here, open a file that I had-- came with in advance, 1726 01:24:38,820 --> 01:24:43,370 that you can also play with online as well, called, finally, shows.db. 1727 01:24:43,370 --> 01:24:45,080 And I do .schema on this one. 1728 01:24:45,080 --> 01:24:48,560 This is the largest of the databases, and this is the one that we, the staff, 1729 01:24:48,560 --> 01:24:51,345 actually generated from all of the TSV files online, 1730 01:24:51,345 --> 01:24:52,970 but we threw away a lot of information. 1731 01:24:52,970 --> 01:24:55,850 We threw away anything before 1970, and we filtered out 1732 01:24:55,850 --> 01:24:59,600 everything except TV series, so that we can actually 1733 01:24:59,600 --> 01:25:01,820 play with the data in this data set. 1734 01:25:01,820 --> 01:25:04,580 But honestly, using a black and white prompt and a terminal window 1735 01:25:04,580 --> 01:25:07,455 tends not to get fun, especially when the data flies over the screen. 1736 01:25:07,455 --> 01:25:10,413 So there are also things called GUIs, Graphical User Interfaces, 1737 01:25:10,413 --> 01:25:13,080 and indeed, there is a program that's freely available for Macs, 1738 01:25:13,080 --> 01:25:16,490 PCs, and other types of operating systems, called DB Browser. 1739 01:25:16,490 --> 01:25:18,480 And indeed, will point you at this online. 1740 01:25:18,480 --> 01:25:23,030 This is just a program that allows you to explore SQL files on your own Mac 1741 01:25:23,030 --> 01:25:24,890 or PC much more visibly-- 1742 01:25:24,890 --> 01:25:28,070 or much more visually and much more pleasantly than maybe just 1743 01:25:28,070 --> 01:25:29,870 a command line interface allows. 1744 01:25:29,870 --> 01:25:33,080 So let me go ahead and open up, for instance, shows.db. 1745 01:25:33,080 --> 01:25:34,770 And we'll see a whole bunch of things. 1746 01:25:34,770 --> 01:25:36,890 First of all, the graphical user interface 1747 01:25:36,890 --> 01:25:40,280 here shows me the same information, just in slightly prettier format. 1748 01:25:40,280 --> 01:25:44,450 Shows.db, per my diagram a moment ago, has six tables, people, 1749 01:25:44,450 --> 01:25:46,670 like I mentioned; shows, like I mentioned; 1750 01:25:46,670 --> 01:25:50,398 also, stars and writers, ratings, and then that separate table called genres. 1751 01:25:50,398 --> 01:25:52,190 And if you look over here on the right, you 1752 01:25:52,190 --> 01:25:56,360 can actually see the SQL code we wrote to create those tables. 1753 01:25:56,360 --> 01:26:00,350 But cooler than that is that notice these tabs up top here? 1754 01:26:00,350 --> 01:26:03,170 I am currently on Database Structure, and if you're using Windows, 1755 01:26:03,170 --> 01:26:05,180 your interface will look a little different, 1756 01:26:05,180 --> 01:26:06,740 but the options are still there. 1757 01:26:06,740 --> 01:26:08,960 I'm going to go ahead and click on Browse Data. 1758 01:26:08,960 --> 01:26:12,000 And now you'll see a little dropdown of all of the tables. 1759 01:26:12,000 --> 01:26:14,510 So if I want to go ahead and look at a whole bunch of shows, 1760 01:26:14,510 --> 01:26:18,680 I can actually see all of my data here, not unlike Google Spreadsheets. 1761 01:26:18,680 --> 01:26:23,450 But notice, this is 153,331 movies-- 1762 01:26:23,450 --> 01:26:27,380 or shows, rather, that I can see altogether here. 1763 01:26:27,380 --> 01:26:31,110 And lastly, what's cool is that if I go over to the SQL tab, 1764 01:26:31,110 --> 01:26:33,058 I can now execute some sample queries. 1765 01:26:33,058 --> 01:26:34,350 So let me go ahead and do this. 1766 01:26:34,350 --> 01:26:37,340 SELECT * FROM shows;. 1767 01:26:37,340 --> 01:26:39,530 This is going to give me a whole lot of shows. 1768 01:26:39,530 --> 01:26:40,130 Enter. 1769 01:26:40,130 --> 01:26:43,140 So I hit the little Play button that just executes that query, 1770 01:26:43,140 --> 01:26:46,980 and you see the resulting rows that have come back. 1771 01:26:46,980 --> 01:26:48,800 So again, how did we get to this point? 1772 01:26:48,800 --> 01:26:51,920 We, the staff, downloaded all of those TSV files in advance. 1773 01:26:51,920 --> 01:26:56,150 We wrote a Python script that imported all of the data from those files 1774 01:26:56,150 --> 01:26:58,880 into memory, threw away the stuff we didn't care about, 1775 01:26:58,880 --> 01:27:03,362 and then inserted it into tables like this table called shows. 1776 01:27:03,362 --> 01:27:06,320 And what's nice about this tool is, because if you are an aspiring data 1777 01:27:06,320 --> 01:27:09,020 scientist or you're just trying to get some analytical work done 1778 01:27:09,020 --> 01:27:11,690 for this class or any other, or any other project, 1779 01:27:11,690 --> 01:27:14,370 a graphical tool lends itself to just kind of poking around. 1780 01:27:14,370 --> 01:27:19,010 So for instance, you'll see that the shows table no longer has any genres. 1781 01:27:19,010 --> 01:27:20,010 But that's OK. 1782 01:27:20,010 --> 01:27:21,950 We can reconstitute that data. 1783 01:27:21,950 --> 01:27:25,670 I can go ahead and SELECT * FROM shows-- 1784 01:27:25,670 --> 01:27:27,390 but you know what I can do? 1785 01:27:27,390 --> 01:27:30,680 I can actually join in the genres table. 1786 01:27:30,680 --> 01:27:34,250 So I can take the shows table here, the genres table here, 1787 01:27:34,250 --> 01:27:37,640 and essentially link them together by way of the ID. 1788 01:27:37,640 --> 01:27:39,000 How is that possible? 1789 01:27:39,000 --> 01:27:43,010 Well, if you look at shows, a show has an ID title, year, and episodes. 1790 01:27:43,010 --> 01:27:45,020 Genres has a show ID. 1791 01:27:45,020 --> 01:27:48,560 So if you think of ID on my left hand as representing my fingers here. 1792 01:27:48,560 --> 01:27:51,830 Show ID in genres is representing my right hand here. 1793 01:27:51,830 --> 01:27:54,830 What we want to do is lineup ID with show ID 1794 01:27:54,830 --> 01:28:00,570 to make one larger, wider table that constitutes all of that data together. 1795 01:28:00,570 --> 01:28:01,890 So how do I do this? 1796 01:28:01,890 --> 01:28:04,130 Well, in SQL, you can join two tables. 1797 01:28:04,130 --> 01:28:07,430 You say what table you want to join with what other table and 1798 01:28:07,430 --> 01:28:08,570 how you want to do it. 1799 01:28:08,570 --> 01:28:16,740 Well, I want to go ahead and join it on shows.id = genres.show_id;. 1800 01:28:16,740 --> 01:28:19,700 And now, when I hit execute, it took a moment-- 1801 01:28:19,700 --> 01:28:21,682 indeed, it took 408 milliseconds. 1802 01:28:21,682 --> 01:28:24,140 But my god, that's a lot faster than writing a whole Python 1803 01:28:24,140 --> 01:28:25,160 script to do this. 1804 01:28:25,160 --> 01:28:29,220 Now I have a table with all of the shows as before. 1805 01:28:29,220 --> 01:28:30,740 But notice the table got wider. 1806 01:28:30,740 --> 01:28:33,590 This is a temporary table that SQL has returned to me. 1807 01:28:33,590 --> 01:28:36,077 This one now has genre and show ID. 1808 01:28:36,077 --> 01:28:37,910 So in fact, let me go ahead and filter this. 1809 01:28:37,910 --> 01:28:52,290 WHERE title = "The Office" AND year = 2005, Play, we'll just get this. 1810 01:28:52,290 --> 01:28:55,370 So notice I have a wider table containing all of the columns 1811 01:28:55,370 --> 01:28:57,350 from both of those joined tables. 1812 01:28:57,350 --> 01:29:03,170 But if I change this now to The Crown, and that was 2016, and hit play, 1813 01:29:03,170 --> 01:29:05,875 notice I get back seemingly redundant information. 1814 01:29:05,875 --> 01:29:08,000 But this might be useful nonetheless, because now I 1815 01:29:08,000 --> 01:29:13,040 can iterate over all of the rows knowing that every row has not only a title 1816 01:29:13,040 --> 01:29:16,220 and a year, but also a genre as well. 1817 01:29:16,220 --> 01:29:19,070 So I can reconstitute the table like that. 1818 01:29:19,070 --> 01:29:21,530 Well, how can I ask other questions as well, 1819 01:29:21,530 --> 01:29:26,030 like what are actors that are in other shows as well? 1820 01:29:26,030 --> 01:29:30,000 Well, let me go ahead and do something like this. 1821 01:29:30,000 --> 01:29:33,230 For instance, let's select all of Steve Carell's movies. 1822 01:29:33,230 --> 01:29:35,280 Or-- let's select Steve Carell himself first. 1823 01:29:35,280 --> 01:29:38,280 So SELECT * FROM-- 1824 01:29:38,280 --> 01:29:40,140 let's see-- no. 1825 01:29:40,140 --> 01:29:40,830 Let's do this. 1826 01:29:40,830 --> 01:29:41,910 How about Ellen? 1827 01:29:41,910 --> 01:29:49,405 So SELECT * from people where name = Ellen DeGeneres. 1828 01:29:49,405 --> 01:29:50,280 I spelled that right. 1829 01:29:50,280 --> 01:29:52,150 Semicolon, Play. 1830 01:29:52,150 --> 01:29:52,650 All right. 1831 01:29:52,650 --> 01:29:56,040 So this is handy, because I now know that Ellen's birth year is 1958, 1832 01:29:56,040 --> 01:29:58,320 but her ID is 1122. 1833 01:29:58,320 --> 01:30:01,530 That's the same thing as that nconst, but we threw away the nm 1834 01:30:01,530 --> 01:30:05,520 and we got rid of all the zeros and made it into a proper number for efficiency, 1835 01:30:05,520 --> 01:30:08,340 better to uniquely identify humans, typically-- or anything-- 1836 01:30:08,340 --> 01:30:13,110 by numbers which fit in 32 or 64 bits rather than longer strings. 1837 01:30:13,110 --> 01:30:16,740 So now I know Ellen's ID is 1122, what can I do? 1838 01:30:16,740 --> 01:30:25,470 Well, let me go ahead and SELECT * FROM stars WHERE person_id = this. 1839 01:30:25,470 --> 01:30:26,970 This will tell me what? 1840 01:30:26,970 --> 01:30:31,920 This will tell me all of the information about shows that Ellen starred in, 1841 01:30:31,920 --> 01:30:34,150 including, presumably, her own. 1842 01:30:34,150 --> 01:30:37,230 OK, so I now see person_id is the same, the same, the same, 1843 01:30:37,230 --> 01:30:39,608 but Ellen is apparently in all of those shows. 1844 01:30:39,608 --> 01:30:41,400 But that's not that helpful, and also, this 1845 01:30:41,400 --> 01:30:44,070 is kind of lame that I've just hardcoded Ellen's ID. 1846 01:30:44,070 --> 01:30:45,550 But I don't have to do that. 1847 01:30:45,550 --> 01:30:55,020 I can do a subquery, SELECT id FROM people WHERE name = "Ellen DeGeneres", 1848 01:30:55,020 --> 01:30:57,330 closed parenthesis, Play. 1849 01:30:57,330 --> 01:31:00,480 Now it's dynamic, so now I've not hardcoded anything. 1850 01:31:00,480 --> 01:31:01,930 But this isn't that useful. 1851 01:31:01,930 --> 01:31:04,770 Let me go ahead and just select show_id here. 1852 01:31:04,770 --> 01:31:08,910 So I now have SELECT show_id FROM stars WHERE the person_id ID 1853 01:31:08,910 --> 01:31:12,108 = whatever Ellen's ID is here. 1854 01:31:12,108 --> 01:31:13,650 How can I take this one step further? 1855 01:31:13,650 --> 01:31:19,980 Well, what if I do SELECT * FROM shows WHERE the ID of the show 1856 01:31:19,980 --> 01:31:23,400 is in the following list of values? 1857 01:31:23,400 --> 01:31:25,590 So not only does SQL support equals, when 1858 01:31:25,590 --> 01:31:28,530 you want to compare one value against the next, much like in Python, 1859 01:31:28,530 --> 01:31:33,360 you have the keyword IN, where you can say select everything from shows where 1860 01:31:33,360 --> 01:31:36,930 the ID of the show is in the following list of shows, 1861 01:31:36,930 --> 01:31:42,260 which happens to represent the list of show IDs that Ellen is in. 1862 01:31:42,260 --> 01:31:43,140 Phew. 1863 01:31:43,140 --> 01:31:45,150 Lets hit Play on this. 1864 01:31:45,150 --> 01:31:48,660 It took a moment, but it looks like these are all of the shows 1865 01:31:48,660 --> 01:31:51,120 that Ellen has been in, according to IMDB. 1866 01:31:51,120 --> 01:31:54,600 And it looks like The Ellen DeGeneres Show is one. 1867 01:31:54,600 --> 01:31:58,380 She's been on the air for 2,865 episodes. 1868 01:31:58,380 --> 01:32:01,480 There is Ellen, her original TV show, which was on for a few seasons back 1869 01:32:01,480 --> 01:32:01,980 then. 1870 01:32:01,980 --> 01:32:06,430 Looks like she's doing some game shows these days as of 2017, and so forth. 1871 01:32:06,430 --> 01:32:09,810 So using these step-by-step thought processes 1872 01:32:09,810 --> 01:32:12,300 can we actually build up more interesting queries 1873 01:32:12,300 --> 01:32:15,410 to get back information like that. 1874 01:32:15,410 --> 01:32:21,192 All right, any questions before we try a few others out as well? 1875 01:32:21,192 --> 01:32:21,860 No? 1876 01:32:21,860 --> 01:32:22,360 All right. 1877 01:32:22,360 --> 01:32:25,270 Well, let me show one other approach to this same problem. 1878 01:32:25,270 --> 01:32:27,720 It turns out this is what are known as nested queries. 1879 01:32:27,720 --> 01:32:30,012 You keep using the parenthesization, much like in math, 1880 01:32:30,012 --> 01:32:32,490 where you can nest arithmetic expressions in parentheses. 1881 01:32:32,490 --> 01:32:33,900 You can do the same in SQL. 1882 01:32:33,900 --> 01:32:37,590 But you can also join information in slightly different ways as well. 1883 01:32:37,590 --> 01:32:40,480 I can actually do something like this. 1884 01:32:40,480 --> 01:32:47,670 Let me go ahead and SELECT Title FROM the following tables, 1885 01:32:47,670 --> 01:32:56,980 people JOIN stars ON people.id = stars.person_id-- 1886 01:32:56,980 --> 01:32:59,230 and we'll walk through this in just a second-- 1887 01:32:59,230 --> 01:33:03,210 JOIN-- sorry-- 1888 01:33:03,210 --> 01:33:14,640 JOIN shows ON stars.show_id = shows.id where name equals "Ellen DeGeneres." 1889 01:33:14,640 --> 01:33:18,780 This, while more of a mouthful, is equivalent to what I've just done. 1890 01:33:18,780 --> 01:33:21,450 Notice that I've select the title of the show 1891 01:33:21,450 --> 01:33:24,630 from the following, the result of joining people on stars. 1892 01:33:24,630 --> 01:33:27,330 How do you join the people's table and the stars table? 1893 01:33:27,330 --> 01:33:28,740 Well, people have IDs. 1894 01:33:28,740 --> 01:33:31,465 Stars have person IDs, according to the diagram. 1895 01:33:31,465 --> 01:33:33,090 What else do you want to join together? 1896 01:33:33,090 --> 01:33:35,310 Let's join all of that with shows. 1897 01:33:35,310 --> 01:33:36,000 How? 1898 01:33:36,000 --> 01:33:40,740 Well, let's go ahead and say stars.show_id = shows.id. 1899 01:33:40,740 --> 01:33:43,770 So it's a way of linking, transitively, multiple tables together, 1900 01:33:43,770 --> 01:33:47,910 and then filter all of that with the name of Ellen DeGeneres. 1901 01:33:47,910 --> 01:33:50,580 This is what we would call an explicit join. 1902 01:33:50,580 --> 01:33:54,210 And if I click Play there, notice it's taking a moment. 1903 01:33:54,210 --> 01:33:58,530 Took, in fact, 1,990 milliseconds, almost two human seconds. 1904 01:33:58,530 --> 01:34:01,800 So it got slower, but it does give me another syntax 1905 01:34:01,800 --> 01:34:02,910 via which to select data. 1906 01:34:02,910 --> 01:34:07,380 And honestly, what's cool about this is I can select other fields as well. 1907 01:34:07,380 --> 01:34:12,240 They don't have to come from the original query in my nested selects. 1908 01:34:12,240 --> 01:34:15,060 But let me go ahead and do this, just as a teaser, and then 1909 01:34:15,060 --> 01:34:17,850 we'll consider some problems before wrapping up. 1910 01:34:17,850 --> 01:34:20,130 So it turns out 2 millisecond-- 1911 01:34:20,130 --> 01:34:25,590 2 seconds is kind of slow, and if I do this again, it took 2,029 milliseconds. 1912 01:34:25,590 --> 01:34:29,345 If I do it again, it took 1,963 milliseconds, 1913 01:34:29,345 --> 01:34:30,720 and I'm looking at the time here. 1914 01:34:30,720 --> 01:34:31,530 That's slow, right? 1915 01:34:31,530 --> 01:34:34,030 That means if you have a finite number of servers in the world 1916 01:34:34,030 --> 01:34:36,720 and each of them can only handle some number of users at a time, 1917 01:34:36,720 --> 01:34:41,700 it's kind of a waste of hardware and of money, arguably, to spend more seconds 1918 01:34:41,700 --> 01:34:46,050 or milliseconds servicing one query from a given user on your app or your web 1919 01:34:46,050 --> 01:34:47,100 application. 1920 01:34:47,100 --> 01:34:49,950 But it turns out we can do something kind of smart here. 1921 01:34:49,950 --> 01:34:54,340 Notice that in our database structure, we've done a few things here. 1922 01:34:54,340 --> 01:34:57,570 Let me open up, for instance, people. 1923 01:34:57,570 --> 01:35:02,990 Rather, let me go over to people, and you'll see a few keywords 1924 01:35:02,990 --> 01:35:04,460 that I ignored earlier. 1925 01:35:04,460 --> 01:35:07,640 When you're defining a table, you can specify that one of your columns 1926 01:35:07,640 --> 01:35:09,253 is what's called a primary key. 1927 01:35:09,253 --> 01:35:12,420 That is the column by which you're going to uniquely identify all your data, 1928 01:35:12,420 --> 01:35:15,620 so it's like those numeric IDs we've given every person and show. 1929 01:35:15,620 --> 01:35:18,020 A foreign key is what we call the same number 1930 01:35:18,020 --> 01:35:19,590 when it appears in some other table. 1931 01:35:19,590 --> 01:35:24,440 So when we saw a person_id or show_id, those are what are called foreign keys, 1932 01:35:24,440 --> 01:35:27,140 because it's the same numbers, but in another table 1933 01:35:27,140 --> 01:35:29,780 that we're using to join things together, eventually. 1934 01:35:29,780 --> 01:35:31,718 And you can also define columns to be unique, 1935 01:35:31,718 --> 01:35:33,260 or you can index them for efficiency. 1936 01:35:33,260 --> 01:35:35,750 And that's, in fact, where we're going right now. 1937 01:35:35,750 --> 01:35:38,690 If you look at the same query, you'll notice that I'm 1938 01:35:38,690 --> 01:35:40,640 searching on a bunch of columns. 1939 01:35:40,640 --> 01:35:43,380 I'm clearly searching on the name field. 1940 01:35:43,380 --> 01:35:48,050 I'm also searching on-- that is, joining on-- show_id, 1941 01:35:48,050 --> 01:35:52,740 and I'm searching on person_id, not to mention the individual ID fields. 1942 01:35:52,740 --> 01:35:55,040 But the cool thing about a primary key, which 1943 01:35:55,040 --> 01:35:59,270 we have in advance to find all of our ID columns, all of the columns 1944 01:35:59,270 --> 01:36:03,500 called ID, to be primary keys, you get a feature for free with SQL. 1945 01:36:03,500 --> 01:36:06,530 SQL builds up what's called an index, a very fancy data 1946 01:36:06,530 --> 01:36:09,950 structure, a tree-like structure, that actually allows you to search 1947 01:36:09,950 --> 01:36:12,240 for information quite efficiently. 1948 01:36:12,240 --> 01:36:14,240 So when you define a column to be a primary key, 1949 01:36:14,240 --> 01:36:15,770 you get what's called an index. 1950 01:36:15,770 --> 01:36:19,430 Specifically, if we go back a few weeks, to tree, our discussion of trees, 1951 01:36:19,430 --> 01:36:20,248 it looks like this. 1952 01:36:20,248 --> 01:36:22,790 This is what's called the B-tree, which is not a binary tree. 1953 01:36:22,790 --> 01:36:25,460 It's a B-tree in the sense that it's got a lot of nodes 1954 01:36:25,460 --> 01:36:28,130 that might be one or two or more children each, 1955 01:36:28,130 --> 01:36:32,270 but it's very short with a lot of nodes, very-- a lot of width. 1956 01:36:32,270 --> 01:36:35,570 So that means when you store your data in a tree-like structure, 1957 01:36:35,570 --> 01:36:38,870 long story short, it just tends to be very efficiently searchable. 1958 01:36:38,870 --> 01:36:40,970 So when you define a column to be a primary key, 1959 01:36:40,970 --> 01:36:43,460 you get that speed for free. 1960 01:36:43,460 --> 01:36:46,520 But we can also tell our SQL database, you know what? 1961 01:36:46,520 --> 01:36:51,330 I plan on searching on person IDs and show IDs and names, also, 1962 01:36:51,330 --> 01:36:53,990 so let's go ahead and create myself some indexes. 1963 01:36:53,990 --> 01:36:57,530 Let me go ahead and execute the following queries just once. 1964 01:36:57,530 --> 01:37:04,490 Let me create something called an index called person_index on the stars table 1965 01:37:04,490 --> 01:37:07,040 using its person_id column. 1966 01:37:07,040 --> 01:37:08,810 So pretty quick syntax. 1967 01:37:08,810 --> 01:37:12,230 This means create a B-tree, create a fancy tree structure, 1968 01:37:12,230 --> 01:37:16,730 called person index, on the stars table by person-- 1969 01:37:16,730 --> 01:37:18,020 on the person_id column. 1970 01:37:18,020 --> 01:37:21,810 That is, make a table like this in memory to store all of the person IDs. 1971 01:37:21,810 --> 01:37:22,310 Why? 1972 01:37:22,310 --> 01:37:26,450 So I can find them faster than linear search in the column itself. 1973 01:37:26,450 --> 01:37:27,955 Let me go ahead and execute this. 1974 01:37:27,955 --> 01:37:29,330 And you'll see it takes a moment. 1975 01:37:29,330 --> 01:37:29,930 It's thinking. 1976 01:37:29,930 --> 01:37:31,510 It took me 1.6 seconds. 1977 01:37:31,510 --> 01:37:33,772 So a little slow, but I only have to do this once. 1978 01:37:33,772 --> 01:37:34,730 Let me create two more. 1979 01:37:34,730 --> 01:37:39,860 CREATE INDEX called show_index ON stars ON the show_id column. 1980 01:37:39,860 --> 01:37:42,080 So almost the same, but this just means give me 1981 01:37:42,080 --> 01:37:44,780 a tree that looks like this in memory, so 1982 01:37:44,780 --> 01:37:48,360 that when I query for data like Ellen DeGeneres, 1983 01:37:48,360 --> 01:37:52,640 it searches a tree-like structure instead of a list-like structure 1984 01:37:52,640 --> 01:37:53,960 in a column alone. 1985 01:37:53,960 --> 01:37:54,890 Let me execute that. 1986 01:37:54,890 --> 01:37:58,440 And I screwed up ON stars show_ID. 1987 01:37:58,440 --> 01:37:59,100 There we go. 1988 01:37:59,100 --> 01:38:00,210 That took 1 second. 1989 01:38:00,210 --> 01:38:07,320 And lastly, let's create an index called name index on people on the name column 1990 01:38:07,320 --> 01:38:10,620 so that I can search for people by name more efficiently as well. 1991 01:38:10,620 --> 01:38:13,920 Otherwise, without these indexes on a column like name, 1992 01:38:13,920 --> 01:38:17,310 it is going to check every damn cell in the column looking 1993 01:38:17,310 --> 01:38:21,180 for Ellen DeGeneres, Ellen DeGeneres, Ellen DeGeneres, using big O of N, 1994 01:38:21,180 --> 01:38:24,070 running time, or linear search. 1995 01:38:24,070 --> 01:38:26,520 So now, let me go back to my query here. 1996 01:38:26,520 --> 01:38:28,440 I've not made any changes to the query. 1997 01:38:28,440 --> 01:38:31,500 The last time I ran this, it took almost 2 seconds. 1998 01:38:31,500 --> 01:38:34,860 Now, after creating these indexes and telling my SQL database 1999 01:38:34,860 --> 01:38:37,470 I plan to search on those columns, watch this, 2000 01:38:37,470 --> 01:38:40,110 and watch the blue highlighted number. 2001 01:38:40,110 --> 01:38:41,520 8 milliseconds. 2002 01:38:41,520 --> 01:38:42,190 8 millisecond. 2003 01:38:42,190 --> 01:38:43,110 7 milliseconds. 2004 01:38:43,110 --> 01:38:44,490 7, 7, 7. 2005 01:38:44,490 --> 01:38:46,245 Looks like 4 milliseconds that time. 2006 01:38:46,245 --> 01:38:49,120 So this is what the Googles of the world, the Facebooks of the world, 2007 01:38:49,120 --> 01:38:51,180 the Microsofts, who have very large data, 2008 01:38:51,180 --> 01:38:53,310 they not only store their data in databases 2009 01:38:53,310 --> 01:38:57,150 like we are here, but they also index their tables intelligently, 2010 01:38:57,150 --> 01:39:01,200 drawing in ideas from weeks ago, so that the database, for them, and for free, 2011 01:39:01,200 --> 01:39:04,110 and sort of magically, creates these kinds of structures in memory, 2012 01:39:04,110 --> 01:39:07,380 but does it so that you can search and insert and update 2013 01:39:07,380 --> 01:39:10,620 your data all the more efficiently. 2014 01:39:10,620 --> 01:39:13,560 So there's got to be some problems or some prices paid. 2015 01:39:13,560 --> 01:39:15,810 So let's consider just a couple of those. 2016 01:39:15,810 --> 01:39:19,637 So what problems actually arise when using a SQL database like this? 2017 01:39:19,637 --> 01:39:21,720 Well, the first is what's called a race condition. 2018 01:39:21,720 --> 01:39:24,780 And perhaps the best way to explain this is by way of a-- 2019 01:39:24,780 --> 01:39:27,780 sort of a story that I was told when I took a class like this years 2020 01:39:27,780 --> 01:39:31,690 ago in a class called CS161, Operating Systems. 2021 01:39:31,690 --> 01:39:33,540 So contrived scenario, but consider this. 2022 01:39:33,540 --> 01:39:35,832 You and your roommates have a fridge in your dorm room, 2023 01:39:35,832 --> 01:39:37,890 and you and your roommate really like milk. 2024 01:39:37,890 --> 01:39:41,910 And one of you comes home one day, opens the fridge, and sees, oh, dammit, 2025 01:39:41,910 --> 01:39:42,770 we're out of milk. 2026 01:39:42,770 --> 01:39:45,270 And so you close the fridge and you walk into Harvard Square 2027 01:39:45,270 --> 01:39:47,400 and you head to CVS or some other such store. 2028 01:39:47,400 --> 01:39:50,670 Meanwhile, your other roommate comes home, opens the fridge, 2029 01:39:50,670 --> 01:39:53,280 really wants some milk, but, oh, darn it, we're out of milk. 2030 01:39:53,280 --> 01:39:56,425 So they close the fridge, walk outside, and head 2031 01:39:56,425 --> 01:39:59,550 to some other store that sells milk, and for reasons that we're contriving, 2032 01:39:59,550 --> 01:40:01,500 don't bump into you at CVS. 2033 01:40:01,500 --> 01:40:04,728 A few minutes later, you both get home, of course, having bought some milk. 2034 01:40:04,728 --> 01:40:06,520 But of course, milk doesn't last that long, 2035 01:40:06,520 --> 01:40:08,460 so now you've got twice as much milk, and surely some of it's 2036 01:40:08,460 --> 01:40:09,127 going to go bad. 2037 01:40:09,127 --> 01:40:11,640 Like horrible, horrible problem. 2038 01:40:11,640 --> 01:40:12,690 Why, right? 2039 01:40:12,690 --> 01:40:16,230 It's contrived, but this-- fond memories of when it was taught to me. 2040 01:40:16,230 --> 01:40:19,320 So what is the fundamental problem here, if you put your finger on it? 2041 01:40:19,320 --> 01:40:22,380 Both you and your roommate were allowed to inspect 2042 01:40:22,380 --> 01:40:26,400 the state of that refrigerator without knowing that the other was 2043 01:40:26,400 --> 01:40:27,720 about to do the same. 2044 01:40:27,720 --> 01:40:31,405 Better would have been for one of you to leave a note on the fridge saying, gone 2045 01:40:31,405 --> 01:40:33,780 for milk, so that the other person doesn't do it as well. 2046 01:40:33,780 --> 01:40:37,230 Or more stringently, just lock the refrigerator 2047 01:40:37,230 --> 01:40:41,430 so that no one else besides you can see inside of it until you have 2048 01:40:41,430 --> 01:40:44,010 updated the state of that refrigerator. 2049 01:40:44,010 --> 01:40:46,320 So this comes into play all too often these days 2050 01:40:46,320 --> 01:40:48,130 when dealing with really large data sets. 2051 01:40:48,130 --> 01:40:50,880 The Twitters of the world, the Instagrams of the world, all of you 2052 01:40:50,880 --> 01:40:53,130 are probably in the habit of liking or uploading posts 2053 01:40:53,130 --> 01:40:54,640 on those sites or some other. 2054 01:40:54,640 --> 01:40:58,120 But if a lot of other people are, too, especially when things go viral, 2055 01:40:58,120 --> 01:41:01,080 there might be code like this underneath the hood that 2056 01:41:01,080 --> 01:41:02,400 essentially does the following. 2057 01:41:02,400 --> 01:41:05,700 Here's three lines of representative Python code that use SQL. 2058 01:41:05,700 --> 01:41:08,850 The first line here selects the number of likes from a table 2059 01:41:08,850 --> 01:41:11,880 called posts where the ID of the post is something, 2060 01:41:11,880 --> 01:41:14,370 a variable called ID, 1, 2, 3, 4, 5. 2061 01:41:14,370 --> 01:41:17,670 That just tells me, what is the total number of likes this post has? 2062 01:41:17,670 --> 01:41:20,648 Then, I store that answer in a variable using likes. 2063 01:41:20,648 --> 01:41:22,440 And you'll see this syntax in problem set 7 2064 01:41:22,440 --> 01:41:25,560 as you begin to play with SQL yourself in the CS50 library. 2065 01:41:25,560 --> 01:41:29,130 Suppose that your third line of code is to then update posts, 2066 01:41:29,130 --> 01:41:32,670 setting likes equal to some place holder, where that place 2067 01:41:32,670 --> 01:41:34,630 holder is going to be likes plus 1. 2068 01:41:34,630 --> 01:41:38,490 This is the same problem as the milk example, 2069 01:41:38,490 --> 01:41:41,430 because if you check how many likes there 2070 01:41:41,430 --> 01:41:43,490 are on this post, or your roommate checks how-- 2071 01:41:43,490 --> 01:41:46,050 or if you check how much milk is available left in the fridge 2072 01:41:46,050 --> 01:41:49,440 and then go off and begin to make a decision like buying milk, 2073 01:41:49,440 --> 01:41:52,260 your roommate might, on a server that's running 2074 01:41:52,260 --> 01:41:55,410 the same code to handle thousands and thousands of people, 2075 01:41:55,410 --> 01:41:56,680 might ask that same question. 2076 01:41:56,680 --> 01:41:58,472 How many likes does this current post have? 2077 01:41:58,472 --> 01:42:00,120 Suppose the post has a million likes. 2078 01:42:00,120 --> 01:42:02,400 Both of you execute that first line of code, 2079 01:42:02,400 --> 01:42:04,020 see that, oh, I have a million likes. 2080 01:42:04,020 --> 01:42:06,550 I've just clicked plus 1 on the site. 2081 01:42:06,550 --> 01:42:10,830 And so you try changing the number of likes with this second update query 2082 01:42:10,830 --> 01:42:12,720 to 1 million plus 1. 2083 01:42:12,720 --> 01:42:15,750 But if both of you have been making a decision that 2084 01:42:15,750 --> 01:42:18,660 gets interleaved with the other person, as will absolutely 2085 01:42:18,660 --> 01:42:20,910 happen with Instagram and Twitter, who have thousands 2086 01:42:20,910 --> 01:42:24,870 of servers all operating in parallel, instead of changing the number of likes 2087 01:42:24,870 --> 01:42:30,750 from 1 million to 1,000,002, you might change it both of you to 1,000,001, 2088 01:42:30,750 --> 01:42:33,870 thereby wasting one of the counts, because it gets lost, 2089 01:42:33,870 --> 01:42:36,810 because you inspected the state of the fridge or the likes. 2090 01:42:36,810 --> 01:42:37,983 You made a decision. 2091 01:42:37,983 --> 01:42:39,900 And then you updated it based on that decision 2092 01:42:39,900 --> 01:42:42,813 without realizing that life changed in between. 2093 01:42:42,813 --> 01:42:44,730 Your roommate got home and checked the fridge, 2094 01:42:44,730 --> 01:42:47,190 too, or someone else clicked the upvote. 2095 01:42:47,190 --> 01:42:49,890 So this is a bad situation, but there are solutions in SQL 2096 01:42:49,890 --> 01:42:54,270 that we won't look at this week, but what are known as transactions. 2097 01:42:54,270 --> 01:42:57,960 That is a solvable problem, and more on that in a higher level database class. 2098 01:42:57,960 --> 01:43:00,690 And one final example of this, because this one you 2099 01:43:00,690 --> 01:43:04,380 can avoid throughout the coming weeks in CS50 and beyond. 2100 01:43:04,380 --> 01:43:08,520 A SQL injection attack is when you write bad code that 2101 01:43:08,520 --> 01:43:12,150 somehow allows some bad actor on the internet or an application 2102 01:43:12,150 --> 01:43:16,530 to trick your code into running SQL code that you did not intend. 2103 01:43:16,530 --> 01:43:19,680 For instance, this is how Yale students log on in New Haven. 2104 01:43:19,680 --> 01:43:22,470 This, of course, is how Harvard students log on here in Cambridge. 2105 01:43:22,470 --> 01:43:24,887 You're asked for a username and password, or a Harvard key 2106 01:43:24,887 --> 01:43:25,680 and a password. 2107 01:43:25,680 --> 01:43:28,290 Suppose, though, you as a user don't cooperate, 2108 01:43:28,290 --> 01:43:30,840 and instead of typing in your email address and password, 2109 01:43:30,840 --> 01:43:34,080 suppose you type something like this, like your email address, 2110 01:43:34,080 --> 01:43:38,520 Malan@Harvard.edu, then maybe a single quote, and then a dash dash. 2111 01:43:38,520 --> 01:43:40,950 Well, it turns out in SQL, dash dash is how 2112 01:43:40,950 --> 01:43:46,320 you start a comment, similar to hash in Python or in slash slash in C. 2113 01:43:46,320 --> 01:43:50,180 But suppose that Harvard had implemented its website such 2114 01:43:50,180 --> 01:43:52,680 that there's a query like this going on underneath the hood, 2115 01:43:52,680 --> 01:43:56,370 if the Harvard key is implemented in Python, that says this, SELECT * FROM 2116 01:43:56,370 --> 01:44:00,630 users WHERE username = placeholder AND password = placeholder, 2117 01:44:00,630 --> 01:44:02,430 passing in username and password. 2118 01:44:02,430 --> 01:44:03,550 That is good. 2119 01:44:03,550 --> 01:44:07,050 That is correct code, because those place holders are important. 2120 01:44:07,050 --> 01:44:10,890 What the CS50 library does, and other libraries in the world like it, is 2121 01:44:10,890 --> 01:44:12,810 it escapes user input. 2122 01:44:12,810 --> 01:44:17,480 If a user tries to trick your code into inputting comments or single quotes, 2123 01:44:17,480 --> 01:44:21,990 db execute and other libraries will sanitize the user's input 2124 01:44:21,990 --> 01:44:24,570 and prevent the following from happening. 2125 01:44:24,570 --> 01:44:27,630 If, however, you do this-- not good, but bad. 2126 01:44:27,630 --> 01:44:32,490 Suppose that you really practice what we preach last week using f strings, 2127 01:44:32,490 --> 01:44:34,920 this is now bad in the context of SQL. 2128 01:44:34,920 --> 01:44:37,830 If you create a format string like this with a little f 2129 01:44:37,830 --> 01:44:41,250 that is literally SELECT * from users WHERE username =, 2130 01:44:41,250 --> 01:44:44,310 and then use the fancy Python notation for curly braces, 2131 01:44:44,310 --> 01:44:48,660 and password = curly braces, this is a correct SQL construction 2132 01:44:48,660 --> 01:44:53,280 so long as the human cooperates and puts their username here and password here. 2133 01:44:53,280 --> 01:44:55,680 But what if the user is malicious, like me, 2134 01:44:55,680 --> 01:45:01,110 and actually includes a single quote in their input, and then dash dash? 2135 01:45:01,110 --> 01:45:04,650 So Malan@Harvard.edu quote dash dash would 2136 01:45:04,650 --> 01:45:08,520 have the effect of injecting that into the user's input. 2137 01:45:08,520 --> 01:45:11,040 But dash dash I claim as a comment, so it's 2138 01:45:11,040 --> 01:45:15,270 a way of tricking the computer into executing not SELECT * WHERE user-- 2139 01:45:15,270 --> 01:45:19,350 SELECT * FROM users where username = this AND password = that, 2140 01:45:19,350 --> 01:45:22,230 it has the effect of just saying SELECT * FROM users where 2141 01:45:22,230 --> 01:45:25,290 username equals Malan@Harvard.edu. 2142 01:45:25,290 --> 01:45:27,720 And if-- and you'll see this in p set 7-- 2143 01:45:27,720 --> 01:45:31,020 db.execute returns one or more rows from the table-- 2144 01:45:31,020 --> 01:45:33,750 that's how you can actually select data using Python code-- 2145 01:45:33,750 --> 01:45:35,670 and the number of rows equals 1, because it's 2146 01:45:35,670 --> 01:45:39,210 selected Malan@Harvard.edu you could log in 2147 01:45:39,210 --> 01:45:45,660 as me because you very simply tricked the computer into executing code 2148 01:45:45,660 --> 01:45:47,400 that you injected. 2149 01:45:47,400 --> 01:45:50,900 And the biggest takeaway here is this is 100% solvable problem. 2150 01:45:50,900 --> 01:45:53,670 You just need to know that it is solved and not 2151 01:45:53,670 --> 01:45:56,760 do code like this using f strings in Python, but instead 2152 01:45:56,760 --> 01:46:02,370 to use library code like db.execute, or after this course, anything like it. 2153 01:46:02,370 --> 01:46:05,290 With that said, we'll end with a joke. 2154 01:46:05,290 --> 01:46:17,475 2155 01:46:17,475 --> 01:46:18,130 OK. 2156 01:46:18,130 --> 01:46:19,300 That's it for CS50. 2157 01:46:19,300 --> 01:46:21,480 We'll see you next time. 2158 01:46:21,480 --> 01:46:22,000