1 00:00:00,000 --> 00:00:03,493 [MUSIC PLAYING] 2 00:00:03,493 --> 00:00:49,357 3 00:00:49,357 --> 00:00:50,440 DAVID J. MALAN: All right. 4 00:00:50,440 --> 00:00:53,260 This is CS50, and this is week 7. 5 00:00:53,260 --> 00:00:56,207 And today's focus is going to be entirely on data-- 6 00:00:56,207 --> 00:00:58,540 the process of collecting it, the process of storing it, 7 00:00:58,540 --> 00:01:00,610 the process of searching it, and so much more. 8 00:01:00,610 --> 00:01:03,280 You'll recall that last week we started off by playing around 9 00:01:03,280 --> 00:01:04,750 with the relatively small data set. 10 00:01:04,750 --> 00:01:08,630 We asked everyone for what their preferred house at Hogwarts might be. 11 00:01:08,630 --> 00:01:12,400 And then we proceeded to analyze that data a little bit using some Python 12 00:01:12,400 --> 00:01:15,730 and counting up how many people wanted Gryffindor or Slytherin or the others, 13 00:01:15,730 --> 00:01:16,423 as well. 14 00:01:16,423 --> 00:01:19,090 And we ultimately did that by using a Google form to collect it. 15 00:01:19,090 --> 00:01:21,923 And we stored all of the data in a Google spreadsheet, which we then 16 00:01:21,923 --> 00:01:24,350 exported, of course, as a CSV file. 17 00:01:24,350 --> 00:01:26,740 So this week, we thought we'd collect a little more data 18 00:01:26,740 --> 00:01:28,780 and see what kinds of problems arise when 19 00:01:28,780 --> 00:01:32,350 we start using only a spreadsheet or, in turn, a CSV file 20 00:01:32,350 --> 00:01:34,220 to store the data that we care about. 21 00:01:34,220 --> 00:01:37,930 So in fact, if you could go ahead and go to this URL here that you see, 22 00:01:37,930 --> 00:01:41,230 you should see another Google form, this one asking you 23 00:01:41,230 --> 00:01:42,730 some different questions. 24 00:01:42,730 --> 00:01:46,540 All of us probably have some preferred TV shows, now more than ever, perhaps. 25 00:01:46,540 --> 00:01:49,150 And what we'd like to do is ask everyone to input 26 00:01:49,150 --> 00:01:53,590 into that form their favorite TV show followed by the genre 27 00:01:53,590 --> 00:01:58,190 or genres into which that particular TV show falls. 28 00:01:58,190 --> 00:02:00,680 So go ahead and take a moment to do that. 29 00:02:00,680 --> 00:02:03,820 And if you're unable to follow along at home, what folks are looking at 30 00:02:03,820 --> 00:02:07,360 is a form quite like this one here, whereby we're just asking them 31 00:02:07,360 --> 00:02:11,350 for the title of their preferred TV show and the genre 32 00:02:11,350 --> 00:02:15,770 or genres of that specific TV show. 33 00:02:15,770 --> 00:02:16,270 All right. 34 00:02:16,270 --> 00:02:19,270 So let's go ahead and start to look at some of this data that's come in. 35 00:02:19,270 --> 00:02:23,043 Here is the resulting Google spreadsheet that Google Forms has created for us. 36 00:02:23,043 --> 00:02:25,960 And you'll notice that by default, Google Forms, this particular tool, 37 00:02:25,960 --> 00:02:28,180 has three different columns, at least for this form. 38 00:02:28,180 --> 00:02:30,070 One is a timestamp, and Google automatically 39 00:02:30,070 --> 00:02:33,340 gives us that based on what day and time everyone was buzzing in 40 00:02:33,340 --> 00:02:34,390 with the responses. 41 00:02:34,390 --> 00:02:38,770 Then they have a header row beyond that for title and genres. 42 00:02:38,770 --> 00:02:42,100 I've manually boldfaced it in advance just to make it stand out. 43 00:02:42,100 --> 00:02:45,580 But you'll notice that the headings here, Title and Genres, 44 00:02:45,580 --> 00:02:48,790 perfectly matches the question that we asked in the Google form. 45 00:02:48,790 --> 00:02:53,020 That allows us to therefore line up your responses with our questions. 46 00:02:53,020 --> 00:02:56,410 And you can see here Punisher was the first favorite TV 47 00:02:56,410 --> 00:03:00,040 show to be inputted followed by The Office, Breaking Bad, New Girl, Archer, 48 00:03:00,040 --> 00:03:02,270 another Office, and so forth. 49 00:03:02,270 --> 00:03:04,660 And in the third column, under Genres, you'll 50 00:03:04,660 --> 00:03:06,520 see that there's something curious here. 51 00:03:06,520 --> 00:03:08,230 While some of the cells-- 52 00:03:08,230 --> 00:03:10,330 that is, the little boxes of text-- 53 00:03:10,330 --> 00:03:12,970 have just single words like "comedy" or "drama," 54 00:03:12,970 --> 00:03:15,550 you'll notice that some of them have a comma-separated list. 55 00:03:15,550 --> 00:03:19,150 And that comma-separated list is because some of you checked, as you could, 56 00:03:19,150 --> 00:03:24,730 multiple check boxes to indicate that Breaking Bad is a crime genre 57 00:03:24,730 --> 00:03:26,830 drama and also thriller. 58 00:03:26,830 --> 00:03:31,240 And so the way Google Forms handles this is a bit sleazily in the sense 59 00:03:31,240 --> 00:03:35,350 that they just drop all of those values as a comma-separated list 60 00:03:35,350 --> 00:03:37,853 inside of the spreadsheet itself. 61 00:03:37,853 --> 00:03:40,270 And that's potentially a problem if we ultimately download 62 00:03:40,270 --> 00:03:43,570 this as a CSV file, comma-separated values, 63 00:03:43,570 --> 00:03:47,500 because now you have commas in between the commas. 64 00:03:47,500 --> 00:03:50,480 Fortunately, there's a solution to that that we'll ultimately see. 65 00:03:50,480 --> 00:03:52,160 So we've got a good amount of data here. 66 00:03:52,160 --> 00:03:55,300 In fact, if I keep scrolling down, we'll see a few hundred responses now. 67 00:03:55,300 --> 00:03:58,120 And it would be nice to analyze this data in some way 68 00:03:58,120 --> 00:04:02,410 and figure out what the most popular TV show is, maybe search for new shows 69 00:04:02,410 --> 00:04:04,143 I might like via their genre. 70 00:04:04,143 --> 00:04:06,310 So you can imagine some number of queries that could 71 00:04:06,310 --> 00:04:08,620 be answered by way of this data set. 72 00:04:08,620 --> 00:04:12,340 But let's first consider the limitations of leaving this data 73 00:04:12,340 --> 00:04:14,980 in just a spreadsheet like this. 74 00:04:14,980 --> 00:04:17,440 All of us are probably in the habit of using occasionally 75 00:04:17,440 --> 00:04:22,490 Google Spreadsheets, Apple Numbers, Microsoft Excel, or some other tool. 76 00:04:22,490 --> 00:04:27,220 So let's consider what spreadsheets are good at and what they are bad at. 77 00:04:27,220 --> 00:04:30,370 Would anyone like to volunteer an answer to the first of those? 78 00:04:30,370 --> 00:04:34,030 What is a spreadsheet good at or good for? 79 00:04:34,030 --> 00:04:35,080 Yeah, Andrew? 80 00:04:35,080 --> 00:04:36,760 What's your thinking on spreadsheets? 81 00:04:36,760 --> 00:04:39,797 AUDIENCE: [INAUDIBLE] 82 00:04:39,797 --> 00:04:41,880 DAVID J. MALAN: OK, very good for quickly sorting. 83 00:04:41,880 --> 00:04:42,300 I like that. 84 00:04:42,300 --> 00:04:44,758 I could click on the top of the Title column, for instance, 85 00:04:44,758 --> 00:04:48,450 and immediately sort all of those titles by alphabetically. 86 00:04:48,450 --> 00:04:49,140 I like that. 87 00:04:49,140 --> 00:04:53,370 Other reasons to use a spreadsheet-- what problems do they solve? 88 00:04:53,370 --> 00:04:55,050 What are they good at? 89 00:04:55,050 --> 00:04:56,670 Other thoughts on spreadsheets. 90 00:04:56,670 --> 00:04:58,530 Yeah, how about Peter? 91 00:04:58,530 --> 00:05:01,947 AUDIENCE: Storing large amounts of data that you can later analyze. 92 00:05:01,947 --> 00:05:03,780 DAVID J. MALAN: OK, so storing large amounts 93 00:05:03,780 --> 00:05:05,700 of data that you can later analyze. 94 00:05:05,700 --> 00:05:09,210 It's kind of a nice model for storing lots of rows of data, so to speak. 95 00:05:09,210 --> 00:05:11,310 I will say that there actually is a limit. 96 00:05:11,310 --> 00:05:13,890 And in fact, back in the day, I learned what this limit is. 97 00:05:13,890 --> 00:05:16,515 Long story short, in graduate school, I was using a spreadsheet 98 00:05:16,515 --> 00:05:17,940 to analyze some research data. 99 00:05:17,940 --> 00:05:23,370 And at one point, I had more data than Excel supported rows for. 100 00:05:23,370 --> 00:05:28,110 Specifically, I had some 65,536 rows, which 101 00:05:28,110 --> 00:05:30,360 was too many at that point for Excel at the time, 102 00:05:30,360 --> 00:05:33,870 because, long story short, if you recall from a spreadsheet program 103 00:05:33,870 --> 00:05:37,328 like Google Spreadsheets, every row is numbered from 1 on up. 104 00:05:37,328 --> 00:05:39,120 Well, unfortunately, at the time, Microsoft 105 00:05:39,120 --> 00:05:43,170 had used a 16-bit integer, 16 bits or 2 bytes, 106 00:05:43,170 --> 00:05:45,150 to represent each of those numbers. 107 00:05:45,150 --> 00:05:49,320 And it turns out the 2 to the 16th power is roughly 65,000. 108 00:05:49,320 --> 00:05:52,000 So at that point, I maxed out the total number of rows. 109 00:05:52,000 --> 00:05:54,828 Now, to Peter's point, they've increased that in recent years. 110 00:05:54,828 --> 00:05:56,620 And you can actually store a lot more data. 111 00:05:56,620 --> 00:05:58,870 So spreadsheets are indeed good at that. 112 00:05:58,870 --> 00:06:02,580 But they're not necessarily good at everything, because at some point, 113 00:06:02,580 --> 00:06:05,310 you're going to have more data potentially in a spreadsheet 114 00:06:05,310 --> 00:06:07,860 than your Mac or PC can handle. 115 00:06:07,860 --> 00:06:10,860 In fact, if you're actually trying to build an application, whether it's 116 00:06:10,860 --> 00:06:14,490 Twitter, Instagram, or Facebook or anything of that scale, 117 00:06:14,490 --> 00:06:17,490 those companies are certainly not storing their data, suffice it to say, 118 00:06:17,490 --> 00:06:20,700 in a spreadsheet, because there would just be way too much data to use. 119 00:06:20,700 --> 00:06:23,050 And no one could literally open it on their computer. 120 00:06:23,050 --> 00:06:25,830 So we'll need a solution to that problem of scale. 121 00:06:25,830 --> 00:06:29,950 But I don't think we need to throw out what works well about spreadsheets. 122 00:06:29,950 --> 00:06:33,510 So you can store indeed a lot of data in row form. 123 00:06:33,510 --> 00:06:36,930 But it would seem that you can also store a lot of data in column form. 124 00:06:36,930 --> 00:06:39,567 And even though I'm only showing columns A, B, and C, 125 00:06:39,567 --> 00:06:41,400 of course, you've probably used spreadsheets 126 00:06:41,400 --> 00:06:42,570 where you add more columns-- 127 00:06:42,570 --> 00:06:44,680 D, E, F, and so forth. 128 00:06:44,680 --> 00:06:48,540 So what's the right mental model for how to think about rows 129 00:06:48,540 --> 00:06:51,540 versus columns in a spreadsheet? 130 00:06:51,540 --> 00:06:57,840 I feel like we probably use them in a somewhat different way conceptually. 131 00:06:57,840 --> 00:07:00,550 We might think about them a little differently. 132 00:07:00,550 --> 00:07:04,440 What's the difference between rows and columns in a spreadsheet? 133 00:07:04,440 --> 00:07:06,570 Sofia. 134 00:07:06,570 --> 00:07:07,890 AUDIENCE: Adding more entries. 135 00:07:07,890 --> 00:07:09,420 Adding more data is-- 136 00:07:09,420 --> 00:07:12,720 those are within the rows, but then the actual attributes or characteristics 137 00:07:12,720 --> 00:07:14,240 of the data should be in columns. 138 00:07:14,240 --> 00:07:15,240 DAVID J. MALAN: Exactly. 139 00:07:15,240 --> 00:07:17,220 When you add more data to the spreadsheet, 140 00:07:17,220 --> 00:07:19,620 you should really be adding to the bottom of it, 141 00:07:19,620 --> 00:07:21,310 adding more and more rows. 142 00:07:21,310 --> 00:07:24,300 So these things sort of grow vertically, even though of course that's 143 00:07:24,300 --> 00:07:25,920 just a human's perception of it. 144 00:07:25,920 --> 00:07:28,740 They grow from top to bottom by adding more and more rows. 145 00:07:28,740 --> 00:07:31,560 But to Sofia's point, your columns represent 146 00:07:31,560 --> 00:07:37,920 what we might call attributes or fields or any other such characteristic that 147 00:07:37,920 --> 00:07:40,030 is a type of data that you're storing. 148 00:07:40,030 --> 00:07:42,930 So in this case of our form, Timestamp is the first column. 149 00:07:42,930 --> 00:07:44,460 Title is the second column. 150 00:07:44,460 --> 00:07:45,930 Genres is the third column. 151 00:07:45,930 --> 00:07:49,980 And those columns can indeed be thought of as fields or attributes, properties 152 00:07:49,980 --> 00:07:50,697 of your data. 153 00:07:50,697 --> 00:07:54,030 And those are properties that you should really decide on in advance when you're 154 00:07:54,030 --> 00:07:56,970 first creating the form, in our case, or when you're manually creating 155 00:07:56,970 --> 00:07:59,430 the spreadsheet in another case. 156 00:07:59,430 --> 00:08:01,320 You should not really be in the habit, when 157 00:08:01,320 --> 00:08:05,430 using spreadsheets, of adding data from left 158 00:08:05,430 --> 00:08:08,370 to right, adding more and more columns, unless you 159 00:08:08,370 --> 00:08:11,740 decide to collect more types of data. 160 00:08:11,740 --> 00:08:15,873 So just because someone adds a new favorite TV show to your data set, 161 00:08:15,873 --> 00:08:18,540 you shouldn't be adding that from left to right in a new column. 162 00:08:18,540 --> 00:08:21,040 You should indeed be adding it from top to bottom. 163 00:08:21,040 --> 00:08:24,780 But suppose that we actually decided to collect more information from everyone. 164 00:08:24,780 --> 00:08:28,650 Maybe that form had instead asked you for your name or your email address 165 00:08:28,650 --> 00:08:30,120 or any other questions. 166 00:08:30,120 --> 00:08:34,480 Those properties or attributes or fields would belong as new columns. 167 00:08:34,480 --> 00:08:38,309 So this is to say we generally decide on the layout of our data, 168 00:08:38,309 --> 00:08:41,130 the schema of our data, in advance. 169 00:08:41,130 --> 00:08:45,420 And then from there on out, we proceed to add, add, add more rows, not 170 00:08:45,420 --> 00:08:47,670 columns, unless we change our mind and need to change 171 00:08:47,670 --> 00:08:50,230 the schema of our particular data. 172 00:08:50,230 --> 00:08:53,850 So it turns out that spreadsheets are indeed wonderfully useful, 173 00:08:53,850 --> 00:08:56,700 to Peter's point, for large or reasonably large 174 00:08:56,700 --> 00:08:59,610 data sets that we might collect. 175 00:08:59,610 --> 00:09:04,500 And we can, of course, per last week, export those data sets as CSV files. 176 00:09:04,500 --> 00:09:07,290 And so we can go from a spreadsheet to a simple text 177 00:09:07,290 --> 00:09:11,370 file stored in ASCII or Unicode, more generally, on your own hard drive 178 00:09:11,370 --> 00:09:12,660 or somewhere in the cloud. 179 00:09:12,660 --> 00:09:16,350 And you can actually think of that file, that .CSV file, 180 00:09:16,350 --> 00:09:19,540 as what we might call a flat-file database. 181 00:09:19,540 --> 00:09:22,980 A database is, generally speaking, a file that stores data. 182 00:09:22,980 --> 00:09:25,997 Or it's a program that stores data for you. 183 00:09:25,997 --> 00:09:29,080 And all of us have probably thought about or used databases in some sense. 184 00:09:29,080 --> 00:09:31,560 You're probably familiar with the fact that all 185 00:09:31,560 --> 00:09:35,310 of those same big websites, Google and Twitter and Facebook and others, use 186 00:09:35,310 --> 00:09:37,017 databases to store our data. 187 00:09:37,017 --> 00:09:38,850 Well, those databases are either just really 188 00:09:38,850 --> 00:09:42,120 big files containing lots of data or special programs 189 00:09:42,120 --> 00:09:44,130 that are storing our data for us. 190 00:09:44,130 --> 00:09:46,350 And a flat file is just referring to the fact 191 00:09:46,350 --> 00:09:48,580 that it really is a very simple design. 192 00:09:48,580 --> 00:09:51,510 In fact, years ago, decades ago, humans decided 193 00:09:51,510 --> 00:09:54,780 when storing data in simple text files that if you 194 00:09:54,780 --> 00:09:57,540 want to store different types of data, like, to Sofia's point, 195 00:09:57,540 --> 00:10:00,340 different properties or attributes, well, let's keep it simple. 196 00:10:00,340 --> 00:10:03,780 Let's just separate those columns with commas 197 00:10:03,780 --> 00:10:06,450 in our flat-file database, a.k.a. 198 00:10:06,450 --> 00:10:07,118 a CSV. 199 00:10:07,118 --> 00:10:08,160 You can use other things. 200 00:10:08,160 --> 00:10:09,430 You can use tabs. 201 00:10:09,430 --> 00:10:12,570 There's things called TSVs, for Tab-Separated Values. 202 00:10:12,570 --> 00:10:14,760 And frankly, you can use anything you want. 203 00:10:14,760 --> 00:10:16,050 But there is a corner case. 204 00:10:16,050 --> 00:10:17,980 And we've already seen a preview of it. 205 00:10:17,980 --> 00:10:21,190 What if your actual data has a comma in it? 206 00:10:21,190 --> 00:10:23,820 What if the title of your favorite TV show has a comma? 207 00:10:23,820 --> 00:10:27,660 What if Google is presuming to store genres as a comma-separated list? 208 00:10:27,660 --> 00:10:32,340 Bad things can happen if using a CSV as your flat-file database. 209 00:10:32,340 --> 00:10:33,760 But there are solutions to that. 210 00:10:33,760 --> 00:10:35,580 And in fact, what the world typically does 211 00:10:35,580 --> 00:10:39,640 is whenever you have commas inside of your CSV file, 212 00:10:39,640 --> 00:10:42,300 you just make sure that the whole string is double 213 00:10:42,300 --> 00:10:44,460 quoted on the far left and far right. 214 00:10:44,460 --> 00:10:46,860 And anything inside of double quotes is not 215 00:10:46,860 --> 00:10:50,790 mistaken thereafter as delineating a column 216 00:10:50,790 --> 00:10:53,220 as the other commas in the file might. 217 00:10:53,220 --> 00:10:55,590 So that's all that's meant by a flat-file database. 218 00:10:55,590 --> 00:10:58,860 And CSV is perhaps one of the most common, the most common, formats 219 00:10:58,860 --> 00:11:01,240 thereof, if only because all of these programs, 220 00:11:01,240 --> 00:11:03,420 like Google Spreadsheets and Excel and Numbers, 221 00:11:03,420 --> 00:11:07,137 allow you to save your files as CSVs. 222 00:11:07,137 --> 00:11:08,970 Now, long story short, those of you who have 223 00:11:08,970 --> 00:11:12,570 used fancier features of spreadsheets like built-in functions and formulas 224 00:11:12,570 --> 00:11:14,850 and those kinds of things, those are built in 225 00:11:14,850 --> 00:11:19,120 and proprietary to Google Spreadsheets and Excel and Numbers. 226 00:11:19,120 --> 00:11:24,900 You cannot use formulas in a CSV file or a TSV file or in a flat-file database, 227 00:11:24,900 --> 00:11:25,870 more generally. 228 00:11:25,870 --> 00:11:27,990 You can only store static-- 229 00:11:27,990 --> 00:11:30,090 that is, unchanging-- values. 230 00:11:30,090 --> 00:11:33,490 So when you export the data, what you see is what you get. 231 00:11:33,490 --> 00:11:35,242 And that's why people use fancier programs 232 00:11:35,242 --> 00:11:37,200 like Excel and Numbers and Google Spreadsheets, 233 00:11:37,200 --> 00:11:38,658 because you get more functionality. 234 00:11:38,658 --> 00:11:41,100 But if you want to export the data, you can only 235 00:11:41,100 --> 00:11:44,190 get indeed the raw textual data out of it. 236 00:11:44,190 --> 00:11:45,690 But I daresay that's going to be OK. 237 00:11:45,690 --> 00:11:47,398 In fact, Brian, do you mind if I go ahead 238 00:11:47,398 --> 00:11:50,160 and download this spreadsheet as a CSV file now? 239 00:11:50,160 --> 00:11:51,510 BRIAN YU: Yep, go ahead. 240 00:11:51,510 --> 00:11:51,810 DAVID J. MALAN: All right. 241 00:11:51,810 --> 00:11:54,890 I'm going to go ahead in Google Spreadsheets and go to File, Download. 242 00:11:54,890 --> 00:11:56,640 And you can see a whole bunch of options-- 243 00:11:56,640 --> 00:12:01,850 PDF, Web Page, Comma-Separated Values, which is the one I want. 244 00:12:01,850 --> 00:12:04,320 So I'm going to indeed go ahead and choose CSV 245 00:12:04,320 --> 00:12:06,510 from this dropdown in spreadsheets. 246 00:12:06,510 --> 00:12:08,410 That, of course, downloaded that file for me. 247 00:12:08,410 --> 00:12:11,077 And now I'm going to go ahead and go into our familiar CS50 IDE. 248 00:12:11,077 --> 00:12:14,600 You'll recall that last week I was able to upload a file into the IDE. 249 00:12:14,600 --> 00:12:17,350 And I'm going to go ahead and do the same here this week, as well. 250 00:12:17,350 --> 00:12:20,730 I'm going to go ahead and grab my file, which ended up in my Downloads 251 00:12:20,730 --> 00:12:22,830 folder on my particular computer here. 252 00:12:22,830 --> 00:12:27,840 And I'm going to go ahead and drag and drop this into the IDE 253 00:12:27,840 --> 00:12:31,790 such that it ends up in my home directory, so to speak. 254 00:12:31,790 --> 00:12:34,410 So now I have this file, Favorite TV Shows Forms. 255 00:12:34,410 --> 00:12:36,750 And in fact, if I double click this within the IDE, 256 00:12:36,750 --> 00:12:38,880 you'll see familiar data now. 257 00:12:38,880 --> 00:12:42,950 Timestamp comma title comma genres is our header row 258 00:12:42,950 --> 00:12:46,830 that contains the names of the properties or attributes in this file. 259 00:12:46,830 --> 00:12:51,390 Then we've got our timestamps comma favorite title comma and then 260 00:12:51,390 --> 00:12:53,310 a comma-separated list of genres. 261 00:12:53,310 --> 00:12:56,100 And here indeed, notice that Google took care 262 00:12:56,100 --> 00:13:00,030 to use double quotes around any values that themselves had commas. 263 00:13:00,030 --> 00:13:02,130 So it's a relatively simple file format. 264 00:13:02,130 --> 00:13:04,560 And I could certainly just kind of skim through this, 265 00:13:04,560 --> 00:13:07,920 figuring out who likes The Office, who likes Breaking Bad, or other shows. 266 00:13:07,920 --> 00:13:11,040 But per last week, we now have a pretty useful programming language 267 00:13:11,040 --> 00:13:14,220 at our disposal, Python, that could allow us to start manipulating 268 00:13:14,220 --> 00:13:16,860 and analyzing this data more readily. 269 00:13:16,860 --> 00:13:20,100 And here to my point last week about using the right tool for the job, 270 00:13:20,100 --> 00:13:24,860 you could absolutely do everything we're about to do in all weeks prior of CS50. 271 00:13:24,860 --> 00:13:27,720 We could have used C for what we're about to do. 272 00:13:27,720 --> 00:13:31,350 But as you can probably glean, C tends to be painful for certain things, 273 00:13:31,350 --> 00:13:34,290 like anything involving string manipulation, 274 00:13:34,290 --> 00:13:36,660 changing strings, analyzing strings. 275 00:13:36,660 --> 00:13:38,290 It's just a real pain, right? 276 00:13:38,290 --> 00:13:42,330 God forbid you had to take this CSV file and load it all into memory, not 277 00:13:42,330 --> 00:13:43,470 unlike your spell checker. 278 00:13:43,470 --> 00:13:46,950 You would have to be using malloc all over the place or realloc or the like. 279 00:13:46,950 --> 00:13:50,640 There's just a lot of heavy lifting involved in just analyzing a text file. 280 00:13:50,640 --> 00:13:53,760 So Python does all of that for us by just giving us 281 00:13:53,760 --> 00:13:56,130 more functions at our disposal with which 282 00:13:56,130 --> 00:13:59,470 to start analyzing and opening data. 283 00:13:59,470 --> 00:14:01,570 So let me go ahead and close this file. 284 00:14:01,570 --> 00:14:05,082 And let me go ahead and create a new one called favorites.py, 285 00:14:05,082 --> 00:14:07,290 wherein I'm going to start playing with this data set 286 00:14:07,290 --> 00:14:09,900 and see if we can't start answering some questions about it. 287 00:14:09,900 --> 00:14:12,570 And frankly, to this day, 20-plus years after learning how 288 00:14:12,570 --> 00:14:14,670 to program for the first time, I myself am 289 00:14:14,670 --> 00:14:18,000 very much in the habit when writing a new program of just starting simple 290 00:14:18,000 --> 00:14:22,320 and not solving the problem I ultimately want to but something simpler just 291 00:14:22,320 --> 00:14:24,270 as a sort of proof of concept to make sure 292 00:14:24,270 --> 00:14:26,510 I have the right plumbing in place. 293 00:14:26,510 --> 00:14:27,510 So by that, I mean this. 294 00:14:27,510 --> 00:14:32,550 Let's go ahead and write a quick program that simply opens up this file, the CSV 295 00:14:32,550 --> 00:14:37,120 file, iterates over it top to bottom, and just prints out each of the titles, 296 00:14:37,120 --> 00:14:39,430 just as a quick sanity check that I know what I'm doing 297 00:14:39,430 --> 00:14:41,460 and I have access to the data therein. 298 00:14:41,460 --> 00:14:43,740 So let me go ahead and import CSV. 299 00:14:43,740 --> 00:14:45,840 And then I can do this in a few different ways. 300 00:14:45,840 --> 00:14:48,030 But by now, you've probably seen or remembered 301 00:14:48,030 --> 00:14:50,490 my using something like the open command and the 302 00:14:50,490 --> 00:14:55,260 with keyword to open and eventually automatically close this file for me. 303 00:14:55,260 --> 00:14:59,710 This file is called Favorite TV Shows - Form Responses 1.csv. 304 00:14:59,710 --> 00:15:02,400 305 00:15:02,400 --> 00:15:04,560 And I'm going to open this up in read mode. 306 00:15:04,560 --> 00:15:07,000 Strictly speaking, the r is not required. 307 00:15:07,000 --> 00:15:09,330 You might see examples online not including it. 308 00:15:09,330 --> 00:15:13,140 That's because read is the default. But for parity with C and fopen, 309 00:15:13,140 --> 00:15:15,900 I'm going to be explicit and actually do "r." 310 00:15:15,900 --> 00:15:18,670 And I'm going to go ahead and give this a variable name of file. 311 00:15:18,670 --> 00:15:23,820 So this line 3 here has the effect of opening that CSV file in read-only mode 312 00:15:23,820 --> 00:15:27,532 and creating a variable called file via which I can reference it. 313 00:15:27,532 --> 00:15:30,240 Now I'm going to go ahead and use some of that CSV functionality. 314 00:15:30,240 --> 00:15:32,790 I'm going to give myself what we keep calling a reader, which 315 00:15:32,790 --> 00:15:34,650 I could call it xyz, anything else. 316 00:15:34,650 --> 00:15:37,740 But "reader" kind of describes what this variable is going to do. 317 00:15:37,740 --> 00:15:42,930 And it's going to be the return value of calling csv.reader on that file. 318 00:15:42,930 --> 00:15:46,740 And so essentially, the CSV library, per last week, 319 00:15:46,740 --> 00:15:48,360 has a lot of fancy features built in. 320 00:15:48,360 --> 00:15:52,470 And all it needs as input is an already opened text file. 321 00:15:52,470 --> 00:15:55,120 And then it will then wrap that file, so to speak, 322 00:15:55,120 --> 00:15:57,270 with a whole bunch of more useful functionality, 323 00:15:57,270 --> 00:16:01,750 like the ability to read it column and row at a time. 324 00:16:01,750 --> 00:16:02,250 All right. 325 00:16:02,250 --> 00:16:05,170 Now I'm going to go ahead and, you know what, just for now, 326 00:16:05,170 --> 00:16:08,190 I'm going to skip the first row. 327 00:16:08,190 --> 00:16:11,310 I'm going to skip the first row, because the first row has my headings-- 328 00:16:11,310 --> 00:16:13,530 Timestamp, Title, and Genres. 329 00:16:13,530 --> 00:16:17,692 And I know what my columns are, so I'm just going to ignore that line for now. 330 00:16:17,692 --> 00:16:18,900 And now I'm going to do this. 331 00:16:18,900 --> 00:16:24,570 For row in reader, let me go ahead and print out, quite simply, row. 332 00:16:24,570 --> 00:16:28,890 And I only want title, so I think if it's three columns from left to right, 333 00:16:28,890 --> 00:16:30,330 it's 0, 1, 2. 334 00:16:30,330 --> 00:16:33,480 So I want to print out column bracket 1, which 335 00:16:33,480 --> 00:16:35,680 is going to be the second column zero indexed. 336 00:16:35,680 --> 00:16:36,180 All right. 337 00:16:36,180 --> 00:16:39,240 Let me go ahead and save that, go down to my terminal window, 338 00:16:39,240 --> 00:16:42,850 and run python of favorites.py and cross my fingers. 339 00:16:42,850 --> 00:16:43,500 OK. 340 00:16:43,500 --> 00:16:44,920 Voila. 341 00:16:44,920 --> 00:16:46,740 It flew by super fast. 342 00:16:46,740 --> 00:16:49,530 But it looks like, indeed, these are all of the TV 343 00:16:49,530 --> 00:16:51,150 shows that folks have inputted. 344 00:16:51,150 --> 00:16:53,370 Indeed, there's a few hundred if I keep scrolling up. 345 00:16:53,370 --> 00:16:55,740 So it looks like my program is working. 346 00:16:55,740 --> 00:16:57,850 But let's improve it just a little bit. 347 00:16:57,850 --> 00:17:02,490 It turns out that using the csv.reader isn't necessarily 348 00:17:02,490 --> 00:17:04,050 the best approach in Python. 349 00:17:04,050 --> 00:17:07,589 Many of you have already discovered a DictReader, a dictionary reader, 350 00:17:07,589 --> 00:17:10,740 which is nice, because then you don't have to know or keep double checking 351 00:17:10,740 --> 00:17:13,230 what number column your data is in. 352 00:17:13,230 --> 00:17:17,520 You can instead refer it to by the header itself, so by "title" 353 00:17:17,520 --> 00:17:18,660 or by "genres." 354 00:17:18,660 --> 00:17:21,052 This is also good, because if you or maybe a colleague 355 00:17:21,052 --> 00:17:23,010 are sort of messing around with the spreadsheet 356 00:17:23,010 --> 00:17:26,339 and they rearrange the columns by dragging them left or right, 357 00:17:26,339 --> 00:17:30,120 any numbers you have used in your code, 0, 1, 2 on up, 358 00:17:30,120 --> 00:17:34,390 could suddenly be incorrect if your colleague has reordered those columns. 359 00:17:34,390 --> 00:17:37,590 So using a dictionary reader tends to be a little more robust, because it 360 00:17:37,590 --> 00:17:40,480 uses the titles, not the mere numbers. 361 00:17:40,480 --> 00:17:43,230 It's still fallible if someone, yourself or someone else, 362 00:17:43,230 --> 00:17:47,978 changes the values in that very first row and renames titles or genres. 363 00:17:47,978 --> 00:17:49,270 Then things are going to break. 364 00:17:49,270 --> 00:17:51,270 But at that point, we kind of have to blame you 365 00:17:51,270 --> 00:17:53,730 for not having kept track of your code versus your data. 366 00:17:53,730 --> 00:17:55,020 But still a risk. 367 00:17:55,020 --> 00:17:58,445 So I'm going to change this to dictionary reader or DictReader here. 368 00:17:58,445 --> 00:18:00,570 And pretty much the rest of my code can be the same 369 00:18:00,570 --> 00:18:02,970 except I don't need this hack here on line 5. 370 00:18:02,970 --> 00:18:06,750 I don't need to just skip over to the next row from the get-go, 371 00:18:06,750 --> 00:18:10,890 because I now want the dictionary reader to handle the process of reading 372 00:18:10,890 --> 00:18:11,985 that first row for me. 373 00:18:11,985 --> 00:18:13,860 But otherwise, everything else stays the same 374 00:18:13,860 --> 00:18:15,693 except for this last line, where now I think 375 00:18:15,693 --> 00:18:21,300 I can now use row as a dictionary, not as a list per se, 376 00:18:21,300 --> 00:18:24,880 and print out specifically the title from each given row. 377 00:18:24,880 --> 00:18:27,690 So let me go ahead and run python of favorites.py again. 378 00:18:27,690 --> 00:18:31,480 And voila, it looks like I got the same result, several hundred of them. 379 00:18:31,480 --> 00:18:34,260 But let me stipulate that it's doing the same thing if we actually 380 00:18:34,260 --> 00:18:36,490 compared both of those side-by-side. 381 00:18:36,490 --> 00:18:36,990 All right. 382 00:18:36,990 --> 00:18:39,180 Before I forge ahead now to actually augment this 383 00:18:39,180 --> 00:18:44,610 with new functionality, any questions or confusion on this Python script 384 00:18:44,610 --> 00:18:49,530 we just wrote to open a file, wrap it with a reader or DictReader, 385 00:18:49,530 --> 00:18:54,510 and then iterate over the rows one at a time, printing the titles? 386 00:18:54,510 --> 00:18:56,510 Any questions, confusion on syntax at all? 387 00:18:56,510 --> 00:18:57,010 It's OK. 388 00:18:57,010 --> 00:18:59,370 We've only known or seen Python for a week. 389 00:18:59,370 --> 00:19:01,380 It's fine if it's still quite new. 390 00:19:01,380 --> 00:19:04,115 Anything, Brian, we should address? 391 00:19:04,115 --> 00:19:04,740 BRIAN YU: Yeah. 392 00:19:04,740 --> 00:19:08,800 So why is it that you don't need to close the file using the syntax 393 00:19:08,800 --> 00:19:10,190 that you're using right here? 394 00:19:10,190 --> 00:19:11,732 DAVID J. MALAN: Really good question. 395 00:19:11,732 --> 00:19:15,250 Last week, I more pedantically used open on its own. 396 00:19:15,250 --> 00:19:19,210 And then I later used a close function that was associated with the file 397 00:19:19,210 --> 00:19:20,470 that I had just opened. 398 00:19:20,470 --> 00:19:23,800 Now, the more Pythonic way to do things, if you will, 399 00:19:23,800 --> 00:19:27,370 is actually to use this with keyword, which didn't exist in C. 400 00:19:27,370 --> 00:19:29,830 And it just tends to be a useful feature in Python 401 00:19:29,830 --> 00:19:35,470 whereby if you say with open, dot dot dot, it will open the file for you. 402 00:19:35,470 --> 00:19:39,280 Then it will remain open so long as your code is indented inside 403 00:19:39,280 --> 00:19:41,410 of that with keywords block. 404 00:19:41,410 --> 00:19:43,780 And as soon as you get to the end of your program, 405 00:19:43,780 --> 00:19:45,732 it will automatically be closed for you. 406 00:19:45,732 --> 00:19:48,190 So this is one of these features where Python in some sense 407 00:19:48,190 --> 00:19:50,770 is trying to protect us from ourselves. 408 00:19:50,770 --> 00:19:52,900 It's probably pretty common for humans, myself 409 00:19:52,900 --> 00:19:55,000 included, to forget to close your file. 410 00:19:55,000 --> 00:19:57,580 That can create problems with saving things permanently. 411 00:19:57,580 --> 00:19:59,990 It can create memory leaks, as we know from C. 412 00:19:59,990 --> 00:20:02,740 So the with keyword just assumes that I'm not going to be an idiot 413 00:20:02,740 --> 00:20:04,150 and forget to close the file. 414 00:20:04,150 --> 00:20:08,050 Python is going to do it for me automatically. 415 00:20:08,050 --> 00:20:10,870 Other questions or confusions, Brian? 416 00:20:10,870 --> 00:20:13,690 BRIAN YU: How does DictReader know that Title 417 00:20:13,690 --> 00:20:16,270 is the name of the key inside of the dictionary? 418 00:20:16,270 --> 00:20:18,020 DAVID J. MALAN: Really good question, too. 419 00:20:18,020 --> 00:20:22,090 So it is designed by the authors of the Python language 420 00:20:22,090 --> 00:20:25,450 to look at the very first row in the file, 421 00:20:25,450 --> 00:20:29,380 split it on the commas in that very first row, 422 00:20:29,380 --> 00:20:34,090 and just assume that the first word or phrase before the first comma 423 00:20:34,090 --> 00:20:37,270 is the name of the first column, that the second word 424 00:20:37,270 --> 00:20:42,470 or phrase after the first comma is the name of the second column, 425 00:20:42,470 --> 00:20:43,310 and so forth. 426 00:20:43,310 --> 00:20:47,500 So a DictReader just presumes, as is the convention with CSVs, 427 00:20:47,500 --> 00:20:51,280 that your first row is going to contain the headings that you 428 00:20:51,280 --> 00:20:53,290 want to use to refer to those columns. 429 00:20:53,290 --> 00:20:56,860 If your CSV happens not to have such a heading whereby it just 430 00:20:56,860 --> 00:20:59,050 jumps right in on the first row to real data, 431 00:20:59,050 --> 00:21:02,140 then you're not going to be able to use a DictReader correctly, at least 432 00:21:02,140 --> 00:21:04,670 not without some manual configuration. 433 00:21:04,670 --> 00:21:05,170 All right. 434 00:21:05,170 --> 00:21:06,840 So let's go ahead and-- 435 00:21:06,840 --> 00:21:08,590 now I feel like there's a whole mess here. 436 00:21:08,590 --> 00:21:10,562 And some of these shows are pretty popular. 437 00:21:10,562 --> 00:21:13,270 And as I'm glancing over this, I definitely see some duplication. 438 00:21:13,270 --> 00:21:15,010 A whole bunch of you like The Office. 439 00:21:15,010 --> 00:21:17,530 A whole bunch of you like Breaking Bad, Game of Thrones, 440 00:21:17,530 --> 00:21:19,280 and a whole bunch of other shows, as well. 441 00:21:19,280 --> 00:21:21,250 So it would be nicer, I think, if we kind of 442 00:21:21,250 --> 00:21:25,480 narrow the scope of our look at this data by just looking at unique values. 443 00:21:25,480 --> 00:21:26,807 You're looking at unique value. 444 00:21:26,807 --> 00:21:29,140 So rather than just iterate over the file top to bottom, 445 00:21:29,140 --> 00:21:31,690 printing out one title after another, why 446 00:21:31,690 --> 00:21:34,330 don't we go ahead and sort of accumulate all of this data 447 00:21:34,330 --> 00:21:38,800 in some kind of data structure so that we can throw away duplicate values 448 00:21:38,800 --> 00:21:42,910 and then only print out the unique titles that we've accumulated? 449 00:21:42,910 --> 00:21:44,630 So I bet we can do this in a few ways. 450 00:21:44,630 --> 00:21:47,650 But if we think back to last week's demonstration of our dictionary, 451 00:21:47,650 --> 00:21:50,388 you'll recall that I used what was called a set. 452 00:21:50,388 --> 00:21:52,930 And I'm going to go ahead and create a variable called titles 453 00:21:52,930 --> 00:21:55,180 and set it equal to something called set. 454 00:21:55,180 --> 00:21:57,310 And a set is just a collection of values. 455 00:21:57,310 --> 00:21:58,540 It's kind of like a list. 456 00:21:58,540 --> 00:22:00,580 But it eliminates duplicates for me. 457 00:22:00,580 --> 00:22:02,860 And that would seem to be exactly the characteristic 458 00:22:02,860 --> 00:22:04,870 that I want for this program. 459 00:22:04,870 --> 00:22:08,560 Now, instead of printing each title, which is now premature 460 00:22:08,560 --> 00:22:10,480 if I want to first filter out duplicates, 461 00:22:10,480 --> 00:22:11,990 I'm going to go ahead and do this. 462 00:22:11,990 --> 00:22:17,290 I'm going to go ahead and add to the titles set using the add function 463 00:22:17,290 --> 00:22:19,570 the current row's title. 464 00:22:19,570 --> 00:22:21,250 So again, I'm not printing it now. 465 00:22:21,250 --> 00:22:25,510 I'm instead adding to the title set that particular title. 466 00:22:25,510 --> 00:22:27,190 And if it's there already, no big deal. 467 00:22:27,190 --> 00:22:29,260 The set data structure in Python is going 468 00:22:29,260 --> 00:22:31,000 to throw away the duplicates for me. 469 00:22:31,000 --> 00:22:33,310 And it's only going to go ahead and keep the uniques. 470 00:22:33,310 --> 00:22:37,330 Now, at the bottom of my file, I need to do a little more work, admittedly. 471 00:22:37,330 --> 00:22:40,990 Now I have to iterate over the set to print out only those unique titles. 472 00:22:40,990 --> 00:22:41,740 So let me do this. 473 00:22:41,740 --> 00:22:46,555 For title in titles, go ahead and print out title. 474 00:22:46,555 --> 00:22:49,180 And this is where Python just gets really user-friendly, right? 475 00:22:49,180 --> 00:22:53,050 You don't have to do int i get 0, i less than n, or whatever. 476 00:22:53,050 --> 00:22:55,540 You can just say for title in titles. 477 00:22:55,540 --> 00:22:59,200 And if the title's variable is the type of data structure 478 00:22:59,200 --> 00:23:04,690 that you can iterate over, which it will be if it's a list or if it's a set 479 00:23:04,690 --> 00:23:06,940 or even if it's a dictionary, another data structure 480 00:23:06,940 --> 00:23:11,180 we saw last week in Python, the for loop in Python will just know what to do. 481 00:23:11,180 --> 00:23:15,880 This will loop over all of the titles in the titles set. 482 00:23:15,880 --> 00:23:18,700 So let me go ahead and save this file and go ahead now 483 00:23:18,700 --> 00:23:20,920 and run python of favorites.py. 484 00:23:20,920 --> 00:23:25,240 And it looks like, yeah, the list is different in some way. 485 00:23:25,240 --> 00:23:29,463 But I'm seeing fewer results as I scroll up, definitely fewer than before, 486 00:23:29,463 --> 00:23:31,630 because my scrollbar didn't jump nearly as far down. 487 00:23:31,630 --> 00:23:33,260 But honestly, this is kind of a mess. 488 00:23:33,260 --> 00:23:34,660 Let's go ahead and sort this. 489 00:23:34,660 --> 00:23:37,502 Now, in C, it would have been kind of a pain to sort things. 490 00:23:37,502 --> 00:23:39,460 We'd have to whip out the pseudocode, probably, 491 00:23:39,460 --> 00:23:41,460 for bubble sort, selection sort, or, god forbid, 492 00:23:41,460 --> 00:23:43,270 merge sort and then implement it ourselves. 493 00:23:43,270 --> 00:23:47,210 But no, with Python comes, really, the proverbial kitchen sink of functions. 494 00:23:47,210 --> 00:23:49,510 So if you want to sort this set, you know what? 495 00:23:49,510 --> 00:23:50,950 Just say you want it sorted. 496 00:23:50,950 --> 00:23:53,890 There is a function in Python called sorted 497 00:23:53,890 --> 00:23:57,257 that will use one of those better algorithms-- maybe it's merge sort. 498 00:23:57,257 --> 00:23:58,840 Maybe it's something called quicksort. 499 00:23:58,840 --> 00:24:00,350 Maybe it's something else altogether. 500 00:24:00,350 --> 00:24:02,380 It's not going to use a big O of n squared sort. 501 00:24:02,380 --> 00:24:06,940 Someone at Python probably has spent the time implementing a better sort for us. 502 00:24:06,940 --> 00:24:08,817 But it will go ahead and sort the set for me. 503 00:24:08,817 --> 00:24:10,400 Now let me go ahead and do this again. 504 00:24:10,400 --> 00:24:13,360 Let me increase the size of my terminal window and rerun python 505 00:24:13,360 --> 00:24:15,070 of favorites.py. 506 00:24:15,070 --> 00:24:15,640 OK. 507 00:24:15,640 --> 00:24:19,330 And now we have an interesting assortment 508 00:24:19,330 --> 00:24:22,570 of shows that's easier for me to wrap my mind around, 509 00:24:22,570 --> 00:24:25,743 because I have it now sorted here. 510 00:24:25,743 --> 00:24:28,660 And indeed, if I scroll all the way up, we should see all of the shows 511 00:24:28,660 --> 00:24:32,257 beginning with numbers or a period, which 512 00:24:32,257 --> 00:24:34,090 might have just been someone playing around, 513 00:24:34,090 --> 00:24:36,290 followed by the A words, the B words, and so forth. 514 00:24:36,290 --> 00:24:38,707 So now it's a little easier to wrap our minds around this. 515 00:24:38,707 --> 00:24:39,760 But something's up. 516 00:24:39,760 --> 00:24:44,110 I feel like a lot of you like Avatar: The Last Airbender. 517 00:24:44,110 --> 00:24:47,980 And yet I'm seeing it, indeed, four different times. 518 00:24:47,980 --> 00:24:49,720 But I thought we were filtering this down 519 00:24:49,720 --> 00:24:53,350 to uniques by using that set structure. 520 00:24:53,350 --> 00:24:54,340 So what's going on? 521 00:24:54,340 --> 00:24:56,200 And in fact, if I keep scrolling, I'm pretty 522 00:24:56,200 --> 00:24:59,080 sure I saw more duplicates in here. 523 00:24:59,080 --> 00:25:01,840 BoJack Horseman, Breaking Bad, Breaking Bad, 524 00:25:01,840 --> 00:25:07,810 Brooklyn Nine-Nine, Brooklyn Nine-Nine, CS50 in several different flavors. 525 00:25:07,810 --> 00:25:10,210 And yes, it keeps going. 526 00:25:10,210 --> 00:25:11,110 Friends. 527 00:25:11,110 --> 00:25:13,050 So I see a lot of duplicate values. 528 00:25:13,050 --> 00:25:14,980 So what's going on? 529 00:25:14,980 --> 00:25:17,960 Yeah, [? Gadana? ?] 530 00:25:17,960 --> 00:25:22,480 AUDIENCE: Yeah, so your current sort is case insensitive-- sorry, 531 00:25:22,480 --> 00:25:26,680 is case sensitive, meaning that if someone spells avatar with capital 532 00:25:26,680 --> 00:25:30,800 A's in some places, then it's going to be a different result each time. 533 00:25:30,800 --> 00:25:32,050 DAVID J. MALAN: Yeah, exactly. 534 00:25:32,050 --> 00:25:35,530 Some of you weren't quite diligent when it came to capitalization. 535 00:25:35,530 --> 00:25:38,048 And so in fact, the reality is, as [? Gadana ?] notes, 536 00:25:38,048 --> 00:25:39,840 that there's differences in capitalization. 537 00:25:39,840 --> 00:25:41,090 Now, we've addressed this before. 538 00:25:41,090 --> 00:25:43,230 In fact, when you implemented your spell checker, 539 00:25:43,230 --> 00:25:44,980 you had to deal with this already when you 540 00:25:44,980 --> 00:25:46,780 were spell checking an arbitrary text. 541 00:25:46,780 --> 00:25:48,160 Some words might be capitalized. 542 00:25:48,160 --> 00:25:50,300 Some might be all lowercase, all uppercase. 543 00:25:50,300 --> 00:25:52,810 And you wanted to tolerate different casings. 544 00:25:52,810 --> 00:25:55,840 And so we probably solved this by just forcing everything 545 00:25:55,840 --> 00:25:58,540 to uppercase or everything to lowercase and doing things, 546 00:25:58,540 --> 00:26:00,500 therefore, case insensitively. 547 00:26:00,500 --> 00:26:01,750 So give me just a moment here. 548 00:26:01,750 --> 00:26:06,007 And I'm going to go ahead and make a quick change to my form here. 549 00:26:06,007 --> 00:26:07,840 Let's go ahead and change this in such a way 550 00:26:07,840 --> 00:26:11,020 that we actually force everything to uppercase or lowercase. 551 00:26:11,020 --> 00:26:13,760 Doesn't really matter which, but we need to canonicalize things, 552 00:26:13,760 --> 00:26:14,860 so to speak, in some way. 553 00:26:14,860 --> 00:26:18,790 And to canonicalize things just means to format all of your data 554 00:26:18,790 --> 00:26:20,020 in some standard way. 555 00:26:20,020 --> 00:26:22,390 So to [? Gadana's ?] point, let's just standardize 556 00:26:22,390 --> 00:26:23,920 the capitalization of things. 557 00:26:23,920 --> 00:26:25,600 Maybe all uppercase, all lowercase. 558 00:26:25,600 --> 00:26:27,260 We just need to make a judgment call. 559 00:26:27,260 --> 00:26:29,427 So I'm going to go ahead and make a few tweaks here. 560 00:26:29,427 --> 00:26:30,670 I'm still going to use a set. 561 00:26:30,670 --> 00:26:33,190 I'm still going to read the CSV as before. 562 00:26:33,190 --> 00:26:37,270 But instead of just adding the title with row bracket title, 563 00:26:37,270 --> 00:26:40,180 I'm going to go ahead and force it to uppercase, just 564 00:26:40,180 --> 00:26:42,850 arbitrarily, just for the sake of uniformity. 565 00:26:42,850 --> 00:26:45,610 And then let's go ahead and check what exactly has happened here. 566 00:26:45,610 --> 00:26:47,050 I'm not going to change anything else. 567 00:26:47,050 --> 00:26:49,383 But let me go ahead and increase the size of my terminal 568 00:26:49,383 --> 00:26:52,600 window, rerun python of favorites.py. 569 00:26:52,600 --> 00:26:53,517 And voila. 570 00:26:53,517 --> 00:26:55,600 It's a little harder to read, just because I'm not 571 00:26:55,600 --> 00:26:56,770 used to reading all caps. 572 00:26:56,770 --> 00:26:58,687 Kind of looks like we're yelling at ourselves. 573 00:26:58,687 --> 00:27:01,600 But I don't see-- wait a minute. 574 00:27:01,600 --> 00:27:05,560 I still see The Office over here twice. 575 00:27:05,560 --> 00:27:11,920 If I keep scrolling here, so far, I see Stranger Things and Strainger Things. 576 00:27:11,920 --> 00:27:13,570 That just looks like a typo. 577 00:27:13,570 --> 00:27:15,680 I see two Sherlocks, though. 578 00:27:15,680 --> 00:27:17,380 This is a little suspicious. 579 00:27:17,380 --> 00:27:21,730 So [? Gadana, ?] you and I don't seem to have solved things fully. 580 00:27:21,730 --> 00:27:24,220 And this one's a little more subtle. 581 00:27:24,220 --> 00:27:30,970 What more should I perhaps do to my data to ensure we get duplicates removed? 582 00:27:30,970 --> 00:27:32,200 Olivia? 583 00:27:32,200 --> 00:27:34,407 AUDIENCE: Maybe trim around the edges. 584 00:27:34,407 --> 00:27:35,990 DAVID J. MALAN: Trim around the edges. 585 00:27:35,990 --> 00:27:37,480 I like the sound of that, but what do you mean? 586 00:27:37,480 --> 00:27:38,320 What does that do? 587 00:27:38,320 --> 00:27:40,862 AUDIENCE: Oh, like, trim off the extra spaces in case someone 588 00:27:40,862 --> 00:27:42,940 put a space before or after the words. 589 00:27:42,940 --> 00:27:44,230 DAVID J. MALAN: Yeah, exactly. 590 00:27:44,230 --> 00:27:47,140 It's pretty common for humans, intentionally or accidentally, 591 00:27:47,140 --> 00:27:48,920 to hit the Space bar where they shouldn't. 592 00:27:48,920 --> 00:27:51,790 And in fact, I'm kind of inferring that I bet one or more 593 00:27:51,790 --> 00:27:55,143 of you accidentally typed Sherlock, space, and then decided, 594 00:27:55,143 --> 00:27:55,810 nope, that's it. 595 00:27:55,810 --> 00:27:57,040 I'm not typing anything else. 596 00:27:57,040 --> 00:28:00,740 But that space, even though we can't quite see it obviously, is there. 597 00:28:00,740 --> 00:28:04,060 And when we do a string comparison or when the set data structure does that, 598 00:28:04,060 --> 00:28:08,110 it's actually going to be noticed when doing those comparisons. 599 00:28:08,110 --> 00:28:10,092 And therefore they're not going to be the same. 600 00:28:10,092 --> 00:28:11,800 So I can do this in a few different ways. 601 00:28:11,800 --> 00:28:15,080 But it turns out, in Python, you can chain functions together, 602 00:28:15,080 --> 00:28:17,390 which is also, too, kind of a fancy feature. 603 00:28:17,390 --> 00:28:18,650 Notice what I'm doing here. 604 00:28:18,650 --> 00:28:21,070 I'm still accessing the titles set. 605 00:28:21,070 --> 00:28:23,710 I'm adding the following value to it. 606 00:28:23,710 --> 00:28:27,610 I'm adding the value row bracket title, but not quite. 607 00:28:27,610 --> 00:28:30,880 That is a string or an str, in Python speak. 608 00:28:30,880 --> 00:28:33,130 I'm going to go ahead and strip it, which 609 00:28:33,130 --> 00:28:36,340 means if we look up the documentation for this function, to Olivia's point, 610 00:28:36,340 --> 00:28:39,130 it's going to strip off or trim all of the white space 611 00:28:39,130 --> 00:28:41,200 to the left, all of the white space to the right, 612 00:28:41,200 --> 00:28:43,840 whether that's the Space bar or the Enter key 613 00:28:43,840 --> 00:28:46,870 or the Tab character or a few other things, as well. 614 00:28:46,870 --> 00:28:50,200 It's just going to get rid of leading and trailing white space. 615 00:28:50,200 --> 00:28:53,650 And then whatever's left over, I'm going to go ahead and force everything 616 00:28:53,650 --> 00:28:56,810 to uppercase in the spirit of [? Gadana's ?] suggestion, too. 617 00:28:56,810 --> 00:29:00,470 So we're sort of combining two good ideas now to really massage the data, 618 00:29:00,470 --> 00:29:02,470 if you will, into a cleaner format. 619 00:29:02,470 --> 00:29:04,780 And this is such a real-world reality. 620 00:29:04,780 --> 00:29:09,588 Humans, you and I, cannot be trusted to input data the way we are supposed to. 621 00:29:09,588 --> 00:29:11,380 Sometimes it's all lowercase, because we're 622 00:29:11,380 --> 00:29:13,463 being a little lazy or a little social media-like, 623 00:29:13,463 --> 00:29:16,120 even if we're checking out from Amazon and trying 624 00:29:16,120 --> 00:29:18,310 to input a valid postal address. 625 00:29:18,310 --> 00:29:22,127 Sometimes it's all capitals, because I can think of a few people in my life 626 00:29:22,127 --> 00:29:24,460 who don't quite understand the Caps Lock thing just yet. 627 00:29:24,460 --> 00:29:26,710 And so things might be all capitalized instead. 628 00:29:26,710 --> 00:29:30,580 This is not good for computer systems that require precision, 629 00:29:30,580 --> 00:29:32,440 to our emphasis in week 0. 630 00:29:32,440 --> 00:29:35,140 And so massaging data means cleaning it up, 631 00:29:35,140 --> 00:29:38,500 doing some mutations that don't really change the meaning of the data 632 00:29:38,500 --> 00:29:41,740 but canonicalize it, standardize it, so that you're 633 00:29:41,740 --> 00:29:44,950 comparing apples and apples, so to speak, not apples and oranges. 634 00:29:44,950 --> 00:29:47,950 Well, let me go ahead and run this again in my bigger terminal 635 00:29:47,950 --> 00:29:50,140 window, python of favorites.py. 636 00:29:50,140 --> 00:29:50,710 Voila. 637 00:29:50,710 --> 00:29:55,220 In scrolling up, up, up, I think we're in a better place. 638 00:29:55,220 --> 00:29:57,520 I only see one Office now. 639 00:29:57,520 --> 00:30:01,510 And if I keep scrolling up and up and up, I'm seeing typos still, 640 00:30:01,510 --> 00:30:03,910 but nothing related to white space. 641 00:30:03,910 --> 00:30:08,340 And I think we have a much cleaner unique list of titles at this point. 642 00:30:08,340 --> 00:30:10,800 Of course, if we scroll up, I would have to be 643 00:30:10,800 --> 00:30:14,970 a lot more clever if I want to detect things like typographical errors. 644 00:30:14,970 --> 00:30:19,870 It looks like one of you was very diligent about putting F.R.I. 645 00:30:19,870 --> 00:30:22,970 and so forth but then got bored at the end and left off the last period. 646 00:30:22,970 --> 00:30:25,470 But that's going to happen when you're taking in user input. 647 00:30:25,470 --> 00:30:28,140 We've, of course, got all these variants of CS50. 648 00:30:28,140 --> 00:30:30,570 That's going to be a mess to clean up, because now you 649 00:30:30,570 --> 00:30:35,130 can imagine having to add a whole bunch of if conditions and elses and else ifs 650 00:30:35,130 --> 00:30:38,160 to clean all of that up if we do want to canonicalize 651 00:30:38,160 --> 00:30:41,920 all different flavors of CS50 as, quote unquote, "CS50." 652 00:30:41,920 --> 00:30:43,890 So this is a very slippery slope. 653 00:30:43,890 --> 00:30:47,010 You and I could start writing a huge amount of data just to clean this up. 654 00:30:47,010 --> 00:30:50,940 But that's the reality when dealing with real-world data. 655 00:30:50,940 --> 00:30:55,140 Well, let's go ahead now and improve this program further, 656 00:30:55,140 --> 00:30:57,810 do something a little fancier, because I now 657 00:30:57,810 --> 00:31:00,090 can trust that my data has been canonicalized 658 00:31:00,090 --> 00:31:03,900 except for the actual typos or the weird variants of CS50 and the like. 659 00:31:03,900 --> 00:31:07,470 Let's go ahead and figure out what's the most popular favorite TV 660 00:31:07,470 --> 00:31:10,510 show among the audience here. 661 00:31:10,510 --> 00:31:12,300 So I'm going to start where I have before, 662 00:31:12,300 --> 00:31:14,133 with my current code, because I think I have 663 00:31:14,133 --> 00:31:16,143 most of the building blocks in place. 664 00:31:16,143 --> 00:31:18,810 I'm going to go ahead and clean up my code a little bit in here. 665 00:31:18,810 --> 00:31:22,050 I'm going to go ahead and give myself a separate variable now called title 666 00:31:22,050 --> 00:31:26,040 just so that I can think about things in a little more orderly fashion. 667 00:31:26,040 --> 00:31:29,200 But I'm not going to start adding things to this set anymore. 668 00:31:29,200 --> 00:31:32,220 In fact, a set, I don't think, is really going 669 00:31:32,220 --> 00:31:35,880 to be sufficient to keep track of the popularity of TV shows, 670 00:31:35,880 --> 00:31:38,820 because by definition, the set is throwing away duplicates. 671 00:31:38,820 --> 00:31:40,680 But the goal now is kind of the opposite. 672 00:31:40,680 --> 00:31:45,240 I want to know which are the duplicates so that I can tell you 673 00:31:45,240 --> 00:31:46,860 that this many people like The Office. 674 00:31:46,860 --> 00:31:50,530 This many people like Breaking Bad and the like. 675 00:31:50,530 --> 00:31:56,010 So what tools do we have in Python's toolkit via which we could accumulate 676 00:31:56,010 --> 00:31:59,320 or figure out that information? 677 00:31:59,320 --> 00:32:02,740 Any thoughts on what data structure might help us here 678 00:32:02,740 --> 00:32:07,870 if we want to figure out show, popularity, show, popularity? 679 00:32:07,870 --> 00:32:11,950 And by popularity, I just mean the frequency of it in the CSV file. 680 00:32:11,950 --> 00:32:13,720 Santiago? 681 00:32:13,720 --> 00:32:17,110 AUDIENCE: I guess one option could be to use dictionaries 682 00:32:17,110 --> 00:32:20,410 so that you can have The Office, I don't know, 683 00:32:20,410 --> 00:32:23,110 20 votes, and then Game of Thrones, another one, 684 00:32:23,110 --> 00:32:27,023 so that a dictionary could really help you visualize that. 685 00:32:27,023 --> 00:32:28,690 DAVID J. MALAN: Yeah, perfect instincts. 686 00:32:28,690 --> 00:32:31,660 Recall that a dictionary, at the end of the day, no matter how 687 00:32:31,660 --> 00:32:34,450 sophisticated it's implemented underneath the hood, 688 00:32:34,450 --> 00:32:35,680 like your spell checker-- 689 00:32:35,680 --> 00:32:38,240 It's just a collection of key value pairs. 690 00:32:38,240 --> 00:32:42,790 And indeed, it's maybe one of the most useful data structures in any language, 691 00:32:42,790 --> 00:32:45,820 because this ability to associate one piece of data with another 692 00:32:45,820 --> 00:32:49,150 is just a very general purpose solution to problems. 693 00:32:49,150 --> 00:32:51,730 And indeed, to Santiago's point, if the problem at hand 694 00:32:51,730 --> 00:32:53,650 is to figure out the popularity of shows, 695 00:32:53,650 --> 00:32:58,510 well, let's make the keys the titles of our shows and the frequencies thereof-- 696 00:32:58,510 --> 00:32:59,830 the votes, so to speak-- 697 00:32:59,830 --> 00:33:01,810 the values of those keys. 698 00:33:01,810 --> 00:33:06,450 We're going to map title to votes, title to vote, title to vote, and so forth. 699 00:33:06,450 --> 00:33:08,145 So a dictionary is exactly that. 700 00:33:08,145 --> 00:33:09,520 So let me go ahead and scroll up. 701 00:33:09,520 --> 00:33:10,978 And I can make a little tweak here. 702 00:33:10,978 --> 00:33:14,260 Instead of a set, I can instead say dict and give myself 703 00:33:14,260 --> 00:33:15,598 just an empty dictionary. 704 00:33:15,598 --> 00:33:18,640 There's actually shorthand notation for that that's a little more common. 705 00:33:18,640 --> 00:33:20,830 So you use two empty curly braces. 706 00:33:20,830 --> 00:33:22,810 That just means the exact same thing. 707 00:33:22,810 --> 00:33:25,270 Give me a dictionary that's initially empty. 708 00:33:25,270 --> 00:33:27,400 There's no fancy shortcut for a set. 709 00:33:27,400 --> 00:33:30,370 You have to literally type out S-E-T, open paren and closed paren. 710 00:33:30,370 --> 00:33:34,220 But dictionaries are so common, so popular, so powerful, 711 00:33:34,220 --> 00:33:38,350 they have this little syntactic shortcut of just two curly braces, 712 00:33:38,350 --> 00:33:39,560 open and closed. 713 00:33:39,560 --> 00:33:42,700 So now that I have that, let me go ahead and do this. 714 00:33:42,700 --> 00:33:45,580 Inside of my for loop, instead of printing 715 00:33:45,580 --> 00:33:48,880 the title, which I don't want to do, and instead of adding it to the set, 716 00:33:48,880 --> 00:33:50,770 I now want to add it to the dictionary. 717 00:33:50,770 --> 00:33:51,860 So how do I do that? 718 00:33:51,860 --> 00:33:55,480 Well, if my dictionary is called titles, I think I can essentially do something 719 00:33:55,480 --> 00:34:02,710 like this, titles bracket title = or maybe += 1. 720 00:34:02,710 --> 00:34:07,120 Maybe I can kind of use the dictionary as just a little cheat sheet 721 00:34:07,120 --> 00:34:12,050 of counts, numbers, that start at 0 and then just add 1, at 2, add 3. 722 00:34:12,050 --> 00:34:17,860 So every time I see The Office, The Office, The Office, do += 1, += 1. 723 00:34:17,860 --> 00:34:20,199 We can't do ++, because that's not a thing in Python. 724 00:34:20,199 --> 00:34:24,580 It only exists in C. But this would seem to go into the dictionary called 725 00:34:24,580 --> 00:34:29,260 titles, look up the key that matches this specific title, 726 00:34:29,260 --> 00:34:34,340 and then increment whatever value is there by 1. 727 00:34:34,340 --> 00:34:37,380 But I'm going to go ahead and run this a little naively here. 728 00:34:37,380 --> 00:34:40,280 Let me go ahead and run python of favorites.py. 729 00:34:40,280 --> 00:34:43,400 And wow, it broke already on line 9. 730 00:34:43,400 --> 00:34:47,389 So sort of an apt choice of show to begin with, 731 00:34:47,389 --> 00:34:49,530 we have a key error with Punisher. 732 00:34:49,530 --> 00:34:50,917 So Punisher is bad. 733 00:34:50,917 --> 00:34:52,250 Something bad has just happened. 734 00:34:52,250 --> 00:34:53,250 But what does that mean? 735 00:34:53,250 --> 00:34:55,429 A key error is referring to the fact that I 736 00:34:55,429 --> 00:34:59,407 tried to access an invalid key in a dictionary. 737 00:34:59,407 --> 00:35:01,490 This is saying that literally in this line of code 738 00:35:01,490 --> 00:35:04,610 here, even though titles is a dictionary and even 739 00:35:04,610 --> 00:35:07,130 though the value of title, singular, is, quote 740 00:35:07,130 --> 00:35:09,560 unquote, "PUNISHER," I'm getting a key error, 741 00:35:09,560 --> 00:35:13,230 because that title does not yet exist. 742 00:35:13,230 --> 00:35:17,060 So even if you're not sure of the Python syntax for fixing this problem, 743 00:35:17,060 --> 00:35:21,530 what's the intuitive solution here? 744 00:35:21,530 --> 00:35:25,610 I cannot increment the frequency of the Punisher, 745 00:35:25,610 --> 00:35:28,130 because Punisher is not in the dictionary. 746 00:35:28,130 --> 00:35:29,986 It almost feels like a catch-22. 747 00:35:29,986 --> 00:35:31,890 [? Greg? ?] 748 00:35:31,890 --> 00:35:35,900 AUDIENCE: I think that you need, first of all, to create a for loop 749 00:35:35,900 --> 00:35:40,520 and maybe assign a value to everything in the dictionary. 750 00:35:40,520 --> 00:35:43,683 For example, a value 0, and then add 1. 751 00:35:43,683 --> 00:35:45,350 DAVID J. MALAN: Yeah, so good instincts. 752 00:35:45,350 --> 00:35:46,730 And here, I can use another metaphor. 753 00:35:46,730 --> 00:35:49,147 I worry we might have a chicken and the egg problem there, 754 00:35:49,147 --> 00:35:51,470 because I don't think I can go to the top of my code, 755 00:35:51,470 --> 00:35:56,420 add a loop that initializes all of the values in the dictionary to 0, 756 00:35:56,420 --> 00:36:01,130 because I would need to know all of the names of the shows at that point. 757 00:36:01,130 --> 00:36:02,180 Now, that's fine. 758 00:36:02,180 --> 00:36:05,330 I think I could take you maybe more literally, [? Greg, ?] 759 00:36:05,330 --> 00:36:09,630 and open up the CSV file, iterate over it top to bottom, 760 00:36:09,630 --> 00:36:12,920 and, any time I see a title, just initialize it 761 00:36:12,920 --> 00:36:16,220 in the dictionary as having a value of 0, 0, 0. 762 00:36:16,220 --> 00:36:20,280 Then have another for loop, maybe reopen the file, and do the same. 763 00:36:20,280 --> 00:36:21,380 And that would work. 764 00:36:21,380 --> 00:36:23,540 But it's arguably not very efficient. 765 00:36:23,540 --> 00:36:26,330 It is asymptotically, in terms of big O. But that would 766 00:36:26,330 --> 00:36:28,220 seem to be doing twice as much work. 767 00:36:28,220 --> 00:36:31,820 Iterate over the file once just to initialize everything to 0. 768 00:36:31,820 --> 00:36:35,330 Then iterate over the file a second time just to increment the counts. 769 00:36:35,330 --> 00:36:38,360 I think we can do things a little more efficiently. 770 00:36:38,360 --> 00:36:41,090 I think we can achieve not only correctness but better design. 771 00:36:41,090 --> 00:36:45,560 Any thoughts on how we can still solve this problem without having 772 00:36:45,560 --> 00:36:48,290 to iterate over the whole thing twice? 773 00:36:48,290 --> 00:36:50,360 Yeah, [? Semowit? ?] 774 00:36:50,360 --> 00:36:53,450 AUDIENCE: I think we can add in an if statement 775 00:36:53,450 --> 00:36:55,970 to check if that key is in the dictionary. 776 00:36:55,970 --> 00:36:59,865 And if it's not, then add it and then go ahead and increment the value after. 777 00:36:59,865 --> 00:37:00,740 DAVID J. MALAN: Nice. 778 00:37:00,740 --> 00:37:02,460 And we can do exactly that. 779 00:37:02,460 --> 00:37:04,310 So let's just apply that intuition. 780 00:37:04,310 --> 00:37:08,583 If the problem is that I'm trying to access a key that does not yet exist, 781 00:37:08,583 --> 00:37:10,500 well, let's just be a little smarter about it. 782 00:37:10,500 --> 00:37:14,090 And to [? Semowit's ?] point, let's check whether the key exists. 783 00:37:14,090 --> 00:37:16,020 And if it does, then increment it. 784 00:37:16,020 --> 00:37:19,340 But if it does not, then and only then, [? to Greg's ?] advice, 785 00:37:19,340 --> 00:37:20,730 initialize it to 0. 786 00:37:20,730 --> 00:37:21,570 So let me do that. 787 00:37:21,570 --> 00:37:24,980 Let me go ahead and say if title in titles, 788 00:37:24,980 --> 00:37:28,520 which is the very Pythonic, beautiful way of asking a question 789 00:37:28,520 --> 00:37:30,500 like that, way cleaner than in C-- 790 00:37:30,500 --> 00:37:35,390 let me go ahead, then, and say exactly the line from before. 791 00:37:35,390 --> 00:37:40,280 Else, though, if that title is not yet in the dictionary called titles, 792 00:37:40,280 --> 00:37:41,720 well, that's OK, too. 793 00:37:41,720 --> 00:37:47,150 I can go ahead and say titles bracket title = 0. 794 00:37:47,150 --> 00:37:51,950 So the difference here is that I can certainly index 795 00:37:51,950 --> 00:37:57,740 into a dictionary using a key that doesn't exist if I plan at that moment 796 00:37:57,740 --> 00:37:58,730 to give it a value. 797 00:37:58,730 --> 00:38:02,030 That's OK, and that has always been OK since last week. 798 00:38:02,030 --> 00:38:07,490 But, however, if I want to go ahead and increment the value that's there, 799 00:38:07,490 --> 00:38:11,630 I'm going to go ahead and do that in this separate line. 800 00:38:11,630 --> 00:38:13,850 But I did introduce a bug. 801 00:38:13,850 --> 00:38:15,770 I did introduce a bug here. 802 00:38:15,770 --> 00:38:19,220 I think I need to go one step further logically. 803 00:38:19,220 --> 00:38:24,480 I don't think I want to initialize this to 0 per se. 804 00:38:24,480 --> 00:38:29,090 Does anyone see a subtle bug in my logic here? 805 00:38:29,090 --> 00:38:32,570 If the title is already in the dictionary, I'm incrementing it by 1. 806 00:38:32,570 --> 00:38:37,000 Otherwise, I'm initializing it to 0. 807 00:38:37,000 --> 00:38:38,500 Any subtle catches here? 808 00:38:38,500 --> 00:38:40,690 Yeah, Olivia, what do you see? 809 00:38:40,690 --> 00:38:44,700 AUDIENCE: I think you should initialize it to 1, since it's the first instance. 810 00:38:44,700 --> 00:38:45,700 DAVID J. MALAN: Exactly. 811 00:38:45,700 --> 00:38:46,870 I should initialize it to 1. 812 00:38:46,870 --> 00:38:50,137 Otherwise, I'm accidentally overlooking this particular title, 813 00:38:50,137 --> 00:38:51,970 and I'm going to go ahead and undercount it. 814 00:38:51,970 --> 00:38:54,100 So I can fix this either by doing this. 815 00:38:54,100 --> 00:38:57,820 Or frankly, if you prefer, I don't technically need to use an if else. 816 00:38:57,820 --> 00:39:00,730 I can use just an if by doing something like this instead. 817 00:39:00,730 --> 00:39:04,900 I could say if title not in titles, then I could go ahead 818 00:39:04,900 --> 00:39:07,420 and say titles bracket title gets 0. 819 00:39:07,420 --> 00:39:12,270 And then after that, I can blindly, so to speak, just do this. 820 00:39:12,270 --> 00:39:13,480 So which one is better? 821 00:39:13,480 --> 00:39:15,460 I think the second one is maybe a little better 822 00:39:15,460 --> 00:39:17,420 in that I'm saving one line of code. 823 00:39:17,420 --> 00:39:19,270 But it's ensuring with that if condition, 824 00:39:19,270 --> 00:39:24,190 to [? Semowit's ?] advice, that I'm not indexing into the titles dictionary 825 00:39:24,190 --> 00:39:27,050 until I'm sure that the title is in there. 826 00:39:27,050 --> 00:39:31,570 So let me go ahead and run this now, python of favorites.py, Enter. 827 00:39:31,570 --> 00:39:34,090 And OK, it didn't crash, so that's good. 828 00:39:34,090 --> 00:39:36,400 But I'm not yet seeing any useful information. 829 00:39:36,400 --> 00:39:38,740 But I now have access to a bit more. 830 00:39:38,740 --> 00:39:41,680 Let me scroll down now to the bottom of this program, where 831 00:39:41,680 --> 00:39:43,360 I have now this loop. 832 00:39:43,360 --> 00:39:45,460 Let me go ahead and print out not just the title 833 00:39:45,460 --> 00:39:49,870 but the value of that key in the dictionary by just indexing 834 00:39:49,870 --> 00:39:50,500 into it here. 835 00:39:50,500 --> 00:39:52,030 And you might not have seen this syntax before. 836 00:39:52,030 --> 00:39:54,758 But with print, you can actually pass in multiple arguments. 837 00:39:54,758 --> 00:39:57,550 And by default, print will just separate them with a space for you. 838 00:39:57,550 --> 00:39:59,860 You can override that behavior and separate them with anything. 839 00:39:59,860 --> 00:40:02,777 But this is just meant to be a quick and dirty program that prints out 840 00:40:02,777 --> 00:40:04,820 titles and now the popularity thereof. 841 00:40:04,820 --> 00:40:07,210 So let me run this again, python of favorites.py. 842 00:40:07,210 --> 00:40:08,470 And voila. 843 00:40:08,470 --> 00:40:12,040 It's kind of all over the place. 844 00:40:12,040 --> 00:40:14,800 Office, super popular with 26 votes there. 845 00:40:14,800 --> 00:40:18,220 A lot of single votes here. 846 00:40:18,220 --> 00:40:19,810 Big Bang Theory has nine. 847 00:40:19,810 --> 00:40:21,340 You know, this is all nice and good. 848 00:40:21,340 --> 00:40:24,548 But I feel like this is going to take me forever to wrap my mind around which 849 00:40:24,548 --> 00:40:25,990 are the most popular shows. 850 00:40:25,990 --> 00:40:27,560 So of course, how would we do this? 851 00:40:27,560 --> 00:40:30,250 Well, to the point made earlier, with spreadsheets, my god, 852 00:40:30,250 --> 00:40:33,100 in Microsoft Excel or Google Spreadsheets or Apple Numbers, 853 00:40:33,100 --> 00:40:35,590 you just click the column heading and boom, sorted. 854 00:40:35,590 --> 00:40:38,450 We seem to have lost that capability unless we now do it in code. 855 00:40:38,450 --> 00:40:40,450 So let me do that for us. 856 00:40:40,450 --> 00:40:42,550 Let me go ahead and go back to my code. 857 00:40:42,550 --> 00:40:48,520 And it looks like sorted, even though it does work on dictionaries, 858 00:40:48,520 --> 00:40:52,340 is actually sorting by key, not by value. 859 00:40:52,340 --> 00:40:55,030 And here's where our Python programming techniques need 860 00:40:55,030 --> 00:40:56,530 to get a little more sophisticated. 861 00:40:56,530 --> 00:40:58,572 And we want to introduce another feature here now 862 00:40:58,572 --> 00:41:01,660 of Python which is going to solve this problem specifically 863 00:41:01,660 --> 00:41:03,680 but in a pretty general way. 864 00:41:03,680 --> 00:41:06,310 So if we read the documentation for sorted, 865 00:41:06,310 --> 00:41:11,320 the sorted function indeed sorts sets by the values therein. 866 00:41:11,320 --> 00:41:13,840 It sorts lists by the values therein. 867 00:41:13,840 --> 00:41:17,140 It sorts dictionaries by the keys therein, 868 00:41:17,140 --> 00:41:20,960 because dictionaries have two pieces of information for every element. 869 00:41:20,960 --> 00:41:23,450 It has a key and a value, not just a value. 870 00:41:23,450 --> 00:41:25,390 So by default, sorted sorts by key. 871 00:41:25,390 --> 00:41:28,150 So we somehow have to override that behavior. 872 00:41:28,150 --> 00:41:29,390 So how can we do this? 873 00:41:29,390 --> 00:41:31,840 Well, it turns out that the sorted function 874 00:41:31,840 --> 00:41:35,890 takes another optional argument literally called key. 875 00:41:35,890 --> 00:41:41,570 And the key argument takes as its value the name of a function. 876 00:41:41,570 --> 00:41:43,690 And this is where things get really interesting, 877 00:41:43,690 --> 00:41:45,370 if not confusing, really quickly. 878 00:41:45,370 --> 00:41:50,620 It turns out, in Python, you can pass around functions as arguments 879 00:41:50,620 --> 00:41:51,790 by way of their name. 880 00:41:51,790 --> 00:41:56,080 And technically, you can do this in C. It's a lot more syntactically involved. 881 00:41:56,080 --> 00:41:57,777 But in Python, it's very common. 882 00:41:57,777 --> 00:41:59,110 In JavaScript, it's very common. 883 00:41:59,110 --> 00:42:01,930 In a lot of languages, it's very common to think of functions 884 00:42:01,930 --> 00:42:06,040 as first-class objects, which is a fancy way of saying you can pass them around 885 00:42:06,040 --> 00:42:08,440 just like they are variables themselves. 886 00:42:08,440 --> 00:42:09,730 We're not calling them yet. 887 00:42:09,730 --> 00:42:11,720 But you can pass them around by their name. 888 00:42:11,720 --> 00:42:13,310 So what do I mean by this? 889 00:42:13,310 --> 00:42:18,940 Well, I need a function now to sort my dictionary by its value. 890 00:42:18,940 --> 00:42:22,900 And only I know how to do this, perhaps, so let me go ahead and give myself 891 00:42:22,900 --> 00:42:25,990 a generic function name just for the moment called f-- f for function, 892 00:42:25,990 --> 00:42:26,907 kind of like in math-- 893 00:42:26,907 --> 00:42:28,907 because we're going to get rid of it eventually. 894 00:42:28,907 --> 00:42:31,120 But let me go ahead and temporarily define a function 895 00:42:31,120 --> 00:42:34,150 called f that takes as input a title. 896 00:42:34,150 --> 00:42:39,140 And then it returns for me the value corresponding to that key. 897 00:42:39,140 --> 00:42:43,060 So I'm going to go ahead and return titles bracket title. 898 00:42:43,060 --> 00:42:47,480 So here, we have a function whose purpose in life is super simple. 899 00:42:47,480 --> 00:42:48,700 You give it a title. 900 00:42:48,700 --> 00:42:52,990 It gives you the count thereof, the frequency, the popularity thereof, 901 00:42:52,990 --> 00:42:56,080 by just looking it up in that global dictionary. 902 00:42:56,080 --> 00:42:59,830 So it's super simple, but that's its only purpose in life. 903 00:42:59,830 --> 00:43:03,400 But now, according to the documentation for sorted, 904 00:43:03,400 --> 00:43:06,730 what it's now going to do, because I'm passing in a second argument called 905 00:43:06,730 --> 00:43:12,250 key, the sorted function, rather than just presume you want everything sorted 906 00:43:12,250 --> 00:43:15,280 alphabetically by key, it's instead going 907 00:43:15,280 --> 00:43:22,420 to call that function f on every one of the elements in your dictionary. 908 00:43:22,420 --> 00:43:25,960 And depending on your answer, the return value 909 00:43:25,960 --> 00:43:30,760 you give with that f function, that will be used instead 910 00:43:30,760 --> 00:43:34,060 to determine the actual ordering. 911 00:43:34,060 --> 00:43:36,900 So by default, sorted just looks at key. 912 00:43:36,900 --> 00:43:39,750 What I'm effectively doing with this f function 913 00:43:39,750 --> 00:43:44,160 is instead returning the value corresponding to every key. 914 00:43:44,160 --> 00:43:48,360 And so the logical implication of this, even though the syntax is a little new, 915 00:43:48,360 --> 00:43:51,690 is that this dictionary of titles will now 916 00:43:51,690 --> 00:43:55,140 be sorted by value instead of by key. 917 00:43:55,140 --> 00:43:57,460 Because again, by default, it sorts by key. 918 00:43:57,460 --> 00:44:01,740 But if I define my own key function and override that behavior 919 00:44:01,740 --> 00:44:05,370 to return the corresponding value, it's the values, the numbers, 920 00:44:05,370 --> 00:44:08,750 the counts that will actually be used to this thing. 921 00:44:08,750 --> 00:44:09,250 All right. 922 00:44:09,250 --> 00:44:11,333 Let's go ahead and see if that's true in practice. 923 00:44:11,333 --> 00:44:13,440 Let me go ahead and rerun python of favorites.py. 924 00:44:13,440 --> 00:44:14,790 I should see all the titles. 925 00:44:14,790 --> 00:44:17,520 And voila, conveniently, the most popular show 926 00:44:17,520 --> 00:44:22,170 seems to be Game of Thrones with 33 votes, followed by Friends with 27, 927 00:44:22,170 --> 00:44:25,000 followed by The Office with 26, and so forth. 928 00:44:25,000 --> 00:44:27,060 But of course, the list is kind of backwards. 929 00:44:27,060 --> 00:44:29,680 I mean, it's convenient that I can see it at the bottom of my screen. 930 00:44:29,680 --> 00:44:32,530 But really, if we're making a list, it should really be at the top. 931 00:44:32,530 --> 00:44:34,170 So how can we override that behavior? 932 00:44:34,170 --> 00:44:36,840 Turns out the sorted function, if you read its documentation, 933 00:44:36,840 --> 00:44:41,020 also takes another optional parameter called reverse. 934 00:44:41,020 --> 00:44:43,590 And if you set reverse equal to True, capital 935 00:44:43,590 --> 00:44:48,120 T in Python, that's going to go ahead and give us now 936 00:44:48,120 --> 00:44:50,190 the reverse order of that same sort. 937 00:44:50,190 --> 00:44:53,790 So let me go ahead and maximize my terminal window, rerun it again. 938 00:44:53,790 --> 00:44:57,480 And voila, if I scroll back up to the top, it's not alphabetically sorted. 939 00:44:57,480 --> 00:44:59,970 But if I keep going, keep going, keep going, keep going, 940 00:44:59,970 --> 00:45:01,262 the numbers are getting bigger. 941 00:45:01,262 --> 00:45:06,770 And voila, now Game of Thrones with 33 is all the way at the top. 942 00:45:06,770 --> 00:45:08,490 All right, so pretty cool. 943 00:45:08,490 --> 00:45:11,360 And again, the new functionality here in Python, at least, 944 00:45:11,360 --> 00:45:15,110 is that we can actually pass in functions to functions 945 00:45:15,110 --> 00:45:19,380 and leave it to the latter to call the former. 946 00:45:19,380 --> 00:45:21,180 So that's complicated just to say. 947 00:45:21,180 --> 00:45:26,400 But any questions or confusion now on how we are using dictionaries 948 00:45:26,400 --> 00:45:34,290 and how we are sorting things in this reverse, value-based way? 949 00:45:34,290 --> 00:45:35,450 Any questions or confusion? 950 00:45:35,450 --> 00:45:39,000 Anything in the chat or verbally, Brian? 951 00:45:39,000 --> 00:45:41,470 BRIAN YU: Looks like all questions are answered here. 952 00:45:41,470 --> 00:45:42,320 DAVID J. MALAN: OK. 953 00:45:42,320 --> 00:45:44,780 Then in that case, let me point out a common mistake. 954 00:45:44,780 --> 00:45:50,000 Notice that even though f is a function, notice that I did not call it there. 955 00:45:50,000 --> 00:45:53,630 That would be incorrect, the reason being we deliberately 956 00:45:53,630 --> 00:45:58,410 want to pass the function f into the sorted function 957 00:45:58,410 --> 00:46:03,810 so that the sorted function can take it upon itself to call f again and again 958 00:46:03,810 --> 00:46:04,310 and again. 959 00:46:04,310 --> 00:46:07,227 We don't want to just call it once by using the parentheses ourselves. 960 00:46:07,227 --> 00:46:11,030 We want to just pass it in by name so that the sorted function, which comes 961 00:46:11,030 --> 00:46:14,630 with Python, can instead do it for us. 962 00:46:14,630 --> 00:46:17,060 Santiago, did you have a question? 963 00:46:17,060 --> 00:46:18,710 AUDIENCE: Yes, I was going to ask. 964 00:46:18,710 --> 00:46:21,425 Why didn't we put f of title? 965 00:46:21,425 --> 00:46:24,350 966 00:46:24,350 --> 00:46:26,960 I was going to ask that question specifically. 967 00:46:26,960 --> 00:46:29,005 DAVID J. MALAN: Oh, with the parentheses? 968 00:46:29,005 --> 00:46:29,630 AUDIENCE: Yeah. 969 00:46:29,630 --> 00:46:30,963 DAVID J. MALAN: Oh, OK, perfect. 970 00:46:30,963 --> 00:46:34,010 So because that would call the function once and only once. 971 00:46:34,010 --> 00:46:37,013 We want sorted to be able to call it again and again. 972 00:46:37,013 --> 00:46:38,930 Now, here's actually an example, as we've seen 973 00:46:38,930 --> 00:46:40,760 in the past, of a correct solution. 974 00:46:40,760 --> 00:46:45,170 This is behaving as I intend, a list of sorted titles from top to bottom 975 00:46:45,170 --> 00:46:47,450 in order of popularity. 976 00:46:47,450 --> 00:46:49,820 But it's a little poorly designed, because I'm 977 00:46:49,820 --> 00:46:53,390 defining this function f, whose name in the first place is kind of lame. 978 00:46:53,390 --> 00:46:56,660 But I'm defining a function only to use it in one place. 979 00:46:56,660 --> 00:47:00,860 And my god, the function is so tiny, it just feels like a waste of keystrokes 980 00:47:00,860 --> 00:47:03,740 to have defined a new function just to then pass it in. 981 00:47:03,740 --> 00:47:08,150 So it turns out, in Python, if you have a very short function whose 982 00:47:08,150 --> 00:47:13,130 purpose in life is meant to be to solve a local problem just once and that's it 983 00:47:13,130 --> 00:47:16,880 and it's short enough that you're pretty sure you can fit it on one line of code 984 00:47:16,880 --> 00:47:21,080 without things wrapping and starting to get ugly stylistically, it turns out 985 00:47:21,080 --> 00:47:23,270 you can actually do this instead. 986 00:47:23,270 --> 00:47:26,820 You can copy the code that you had in mind like this. 987 00:47:26,820 --> 00:47:30,680 And instead of actually defining f as a function name, 988 00:47:30,680 --> 00:47:34,070 you can actually use a special keyword in Python called lambda. 989 00:47:34,070 --> 00:47:37,760 You can specify the name of an argument for your function as before. 990 00:47:37,760 --> 00:47:41,690 And then you can simply specify the return value, thereafter 991 00:47:41,690 --> 00:47:44,940 deleting the function itself. 992 00:47:44,940 --> 00:47:49,640 So to be clear, key is still an argument to the sorted function. 993 00:47:49,640 --> 00:47:54,120 It expects as its value typically the name of a function. 994 00:47:54,120 --> 00:47:57,590 But if you've decided that, eh, this seems like a waste of effort 995 00:47:57,590 --> 00:47:59,870 to define a function, then pass the function in, 996 00:47:59,870 --> 00:48:02,840 especially when it's so short, you can do it in a one liner. 997 00:48:02,840 --> 00:48:05,990 A lambda function is an anonymous function. 998 00:48:05,990 --> 00:48:09,230 Lambda literally says, Python, give me a function. 999 00:48:09,230 --> 00:48:11,147 I don't care about its name. 1000 00:48:11,147 --> 00:48:13,230 Therefore, you don't have to choose a name for it. 1001 00:48:13,230 --> 00:48:17,940 But it does care still about its arguments and its return value. 1002 00:48:17,940 --> 00:48:23,147 So it's still up to you to provide zero or more arguments and a return value. 1003 00:48:23,147 --> 00:48:24,230 And notice I've done that. 1004 00:48:24,230 --> 00:48:28,010 I've specified the keyword lambda followed by the name of the argument 1005 00:48:28,010 --> 00:48:31,490 I want this anonymous, nameless function to accept. 1006 00:48:31,490 --> 00:48:33,890 And then I'm specifying the return value. 1007 00:48:33,890 --> 00:48:37,940 And with lambda functions, you do not need to specify return. 1008 00:48:37,940 --> 00:48:41,000 Whatever you write after the colon is literally 1009 00:48:41,000 --> 00:48:43,050 what will be returned automatically. 1010 00:48:43,050 --> 00:48:45,320 So again, this is a very Pythonic thing to do. 1011 00:48:45,320 --> 00:48:49,250 It's kind of a very clever one liner, even though it's a little cryptic 1012 00:48:49,250 --> 00:48:50,757 to see for the very first time. 1013 00:48:50,757 --> 00:48:53,840 But it allows you to condense your thoughts into a succinct statement that 1014 00:48:53,840 --> 00:48:57,470 gets the job done so you don't have to start defining more and more functions 1015 00:48:57,470 --> 00:49:02,490 that you or someone else then need to keep track of. 1016 00:49:02,490 --> 00:49:02,990 All right. 1017 00:49:02,990 --> 00:49:05,300 Any questions, then, on this? 1018 00:49:05,300 --> 00:49:10,580 And I am pretty sure this is as complex or sophisticated as our Python code 1019 00:49:10,580 --> 00:49:13,290 today will get. 1020 00:49:13,290 --> 00:49:16,020 Yeah, over to Sophia. 1021 00:49:16,020 --> 00:49:19,380 AUDIENCE: I was wondering why "lambda" is used specifically 1022 00:49:19,380 --> 00:49:21,217 rather than some other keyword. 1023 00:49:21,217 --> 00:49:23,550 DAVID J. MALAN: Yeah, so there's a long history in this. 1024 00:49:23,550 --> 00:49:27,060 And if, in fact, you take a course on functional programming-- at Harvard, 1025 00:49:27,060 --> 00:49:28,830 it's called CS51-- 1026 00:49:28,830 --> 00:49:32,280 there's a whole etymology behind keywords like this. 1027 00:49:32,280 --> 00:49:34,360 Let me defer that one for another time. 1028 00:49:34,360 --> 00:49:37,440 But indeed, not only in Python but in other languages, 1029 00:49:37,440 --> 00:49:41,290 as well, these things have come to exist called lambda functions. 1030 00:49:41,290 --> 00:49:44,230 So they're actually quite commonplace in other languages, as well. 1031 00:49:44,230 --> 00:49:48,580 And so Python just adopted the term of art. 1032 00:49:48,580 --> 00:49:52,060 Mathematically, lambda is often used as a symbol for functions. 1033 00:49:52,060 --> 00:49:55,980 And so they borrowed that same idea in the world of programming. 1034 00:49:55,980 --> 00:49:56,550 All right. 1035 00:49:56,550 --> 00:50:00,840 So seeing no other questions, let's go ahead and solve a related problem still 1036 00:50:00,840 --> 00:50:03,510 with some Python but that's going to push up 1037 00:50:03,510 --> 00:50:09,150 against the limits of efficiency when it comes to storing our data in CSV files. 1038 00:50:09,150 --> 00:50:13,113 Let me go ahead and start fresh in this file, Favorites.py. 1039 00:50:13,113 --> 00:50:15,030 All of the code I've written thus far, though, 1040 00:50:15,030 --> 00:50:16,905 is on the course's website in advance, so you 1041 00:50:16,905 --> 00:50:18,570 can see the incremental improvement. 1042 00:50:18,570 --> 00:50:21,280 I'm going to go ahead and, again, import csv at the top. 1043 00:50:21,280 --> 00:50:24,690 And now let's write a program this time that doesn't just 1044 00:50:24,690 --> 00:50:27,990 automatically open up the CSV and analyze it looking 1045 00:50:27,990 --> 00:50:31,020 for the total popularity of shows. 1046 00:50:31,020 --> 00:50:35,430 Let's search for a specific show in the CSV and then 1047 00:50:35,430 --> 00:50:39,082 go ahead and output the popularity thereof. 1048 00:50:39,082 --> 00:50:41,040 And I can do this in a bunch of different ways. 1049 00:50:41,040 --> 00:50:43,415 But I'm going to try to make this as concise as possible. 1050 00:50:43,415 --> 00:50:46,800 I'm first going to ask the user to input a title. 1051 00:50:46,800 --> 00:50:49,170 I could use CS50's get_string function. 1052 00:50:49,170 --> 00:50:52,330 But recall that it's pretty much the same as Python's input function, 1053 00:50:52,330 --> 00:50:55,740 so I'm going to use Python's input function today. 1054 00:50:55,740 --> 00:50:57,780 And then I'm going to go ahead and, as before, 1055 00:50:57,780 --> 00:51:01,350 open up that same CSV called Favorite TV Shows - 1056 00:51:01,350 --> 00:51:08,010 Form Responses 1.csv in read-only mode as a variable called file. 1057 00:51:08,010 --> 00:51:11,160 I'm then going to give myself a reader, and I'll use a DictReader again 1058 00:51:11,160 --> 00:51:14,520 so I don't have to worry about knowing which columns things are in, 1059 00:51:14,520 --> 00:51:16,080 passing in file. 1060 00:51:16,080 --> 00:51:17,340 And then let's see. 1061 00:51:17,340 --> 00:51:20,310 If I only care about one title, I can keep this program simpler. 1062 00:51:20,310 --> 00:51:23,340 I don't need to figure out the popularity of every show. 1063 00:51:23,340 --> 00:51:26,880 I just need to figure out the popularity of one show, the title 1064 00:51:26,880 --> 00:51:28,510 that the human has typed in. 1065 00:51:28,510 --> 00:51:32,160 So I'm going to go ahead and give myself a very simple int called counter 1066 00:51:32,160 --> 00:51:33,480 and set it equal to 0. 1067 00:51:33,480 --> 00:51:34,950 I don't need a whole dictionary. 1068 00:51:34,950 --> 00:51:36,930 Just one variable suffices now. 1069 00:51:36,930 --> 00:51:42,300 And I'm going to go ahead and iterate over the rows in the reader, as before. 1070 00:51:42,300 --> 00:51:48,600 And then I'm going to say if the current row's title == the title the human 1071 00:51:48,600 --> 00:51:51,930 typed in, let's go ahead and increment counter by 1. 1072 00:51:51,930 --> 00:51:54,510 And it's already initialized, because I did that on line 7. 1073 00:51:54,510 --> 00:51:55,500 So I think I'm good. 1074 00:51:55,500 --> 00:51:57,450 And then at the end of this program, let's 1075 00:51:57,450 --> 00:51:59,940 very simply print out the value of counter. 1076 00:51:59,940 --> 00:52:04,920 So the purpose of this program is to prompt the user for a title of a show 1077 00:52:04,920 --> 00:52:08,220 and then just report the popularity thereof 1078 00:52:08,220 --> 00:52:11,040 by counting the number of instances of it in the file. 1079 00:52:11,040 --> 00:52:14,520 So let me go ahead and run this with python of favorites.py. 1080 00:52:14,520 --> 00:52:15,450 Enter. 1081 00:52:15,450 --> 00:52:21,440 Let me go ahead and type in "The Office," Enter, and 19. 1082 00:52:21,440 --> 00:52:23,710 Now, I don't remember exactly what the number was. 1083 00:52:23,710 --> 00:52:26,620 But I remember The Office was more popular than that. 1084 00:52:26,620 --> 00:52:29,840 I'm pretty sure it was not 19. 1085 00:52:29,840 --> 00:52:35,645 Any intuition as to why this program is buggy or so it would seem? 1086 00:52:35,645 --> 00:52:37,520 BRIAN YU: A few people in the chat are saying 1087 00:52:37,520 --> 00:52:40,635 you need to remember to deal with capitalization and white space again. 1088 00:52:40,635 --> 00:52:41,510 DAVID J. MALAN: Yeah. 1089 00:52:41,510 --> 00:52:44,790 So we need to practice those same lessons learned from before. 1090 00:52:44,790 --> 00:52:48,830 So I should really canonicalize the input that the human, I, just typed in 1091 00:52:48,830 --> 00:52:51,980 and also the input that's coming from the CSV file. 1092 00:52:51,980 --> 00:52:53,990 Perhaps the simplest way to do this is, up here, 1093 00:52:53,990 --> 00:52:57,200 to first strip off leading and trailing white space in case I get a little 1094 00:52:57,200 --> 00:52:59,670 sloppy and hit the Space bar where I shouldn't. 1095 00:52:59,670 --> 00:53:02,612 And then let's go ahead and force it to uppercase just because. 1096 00:53:02,612 --> 00:53:04,320 It doesn't matter if it's upper or lower, 1097 00:53:04,320 --> 00:53:06,420 but at least we'll standardize things that way. 1098 00:53:06,420 --> 00:53:10,130 And then when I do this, look at the current rows title. 1099 00:53:10,130 --> 00:53:12,170 I think I really need to do the same thing. 1100 00:53:12,170 --> 00:53:15,140 If I'm going to canonicalize one, I need to canonical the other. 1101 00:53:15,140 --> 00:53:19,790 And now compare the all-caps, white-space-stripped versions 1102 00:53:19,790 --> 00:53:20,730 of both strings. 1103 00:53:20,730 --> 00:53:21,920 So now let me rerun it. 1104 00:53:21,920 --> 00:53:24,020 Now I'm going to type in "The Office," Enter. 1105 00:53:24,020 --> 00:53:24,770 And voila. 1106 00:53:24,770 --> 00:53:28,130 Now I'm at 26, which I think is where we were at before. 1107 00:53:28,130 --> 00:53:30,890 And in fact, now I, the user, can be a little sloppy. 1108 00:53:30,890 --> 00:53:32,450 I can say "the office." 1109 00:53:32,450 --> 00:53:35,780 I can run it again and say "the office" and then, for whatever reason, 1110 00:53:35,780 --> 00:53:37,460 hit the Space bar a lot, Enter. 1111 00:53:37,460 --> 00:53:38,570 It's still going to work. 1112 00:53:38,570 --> 00:53:41,810 And indeed, though we seem to be belaboring the pedantic here 1113 00:53:41,810 --> 00:53:44,683 with trimming off white space and so forth, just think. 1114 00:53:44,683 --> 00:53:46,850 In a relatively small audience here, how many of you 1115 00:53:46,850 --> 00:53:50,150 accidentally hit the Space bar or capitalized things differently? 1116 00:53:50,150 --> 00:53:52,340 This happens massively on scale. 1117 00:53:52,340 --> 00:53:55,040 And you can imagine this being important when you're tagging 1118 00:53:55,040 --> 00:53:56,780 friends in some social media account. 1119 00:53:56,780 --> 00:53:58,940 You're doing @Brian or the like. 1120 00:53:58,940 --> 00:54:02,510 You don't want to have to require the user to type @, capital B, 1121 00:54:02,510 --> 00:54:05,100 lowercase r-i-a-n, and so forth. 1122 00:54:05,100 --> 00:54:07,880 So tolerating disparate, messy user input 1123 00:54:07,880 --> 00:54:11,990 is such a common problem to solve, including 1124 00:54:11,990 --> 00:54:14,740 in today's apps that we all use. 1125 00:54:14,740 --> 00:54:15,280 All right. 1126 00:54:15,280 --> 00:54:21,000 Any questions, then, on this program, which I think is correct? 1127 00:54:21,000 --> 00:54:22,890 Then let me ask a question of you. 1128 00:54:22,890 --> 00:54:27,080 In what sense is this program poorly designed? 1129 00:54:27,080 --> 00:54:30,860 In what sense is this program poorly designed? 1130 00:54:30,860 --> 00:54:33,480 This is more subtle. 1131 00:54:33,480 --> 00:54:38,040 But think about the running time of this program in terms of big O. 1132 00:54:38,040 --> 00:54:45,030 What is the running time of this program if the CSV file has n different shows 1133 00:54:45,030 --> 00:54:47,610 in it or n different submissions? 1134 00:54:47,610 --> 00:54:50,310 So n is the variable in question. 1135 00:54:50,310 --> 00:54:53,122 Yeah, what's the running time, Andrew? 1136 00:54:53,122 --> 00:54:55,510 AUDIENCE: [INAUDIBLE] 1137 00:54:55,510 --> 00:54:58,260 DAVID J. MALAN: Yeah, it's big O of n, because I'm literally using 1138 00:54:58,260 --> 00:55:00,400 linear search by way of the for loop. 1139 00:55:00,400 --> 00:55:04,000 That's how a for loop works in Python, just like in C. Starts at the beginning 1140 00:55:04,000 --> 00:55:06,080 and potentially goes all the way till the end. 1141 00:55:06,080 --> 00:55:08,730 And so I'm using implicitly linear search, 1142 00:55:08,730 --> 00:55:11,910 because I'm not using any fancy data structures, no sets, no dictionaries. 1143 00:55:11,910 --> 00:55:14,140 I'm just looping from top to bottom. 1144 00:55:14,140 --> 00:55:18,390 So you can imagine that if we surveyed not just all of the students here 1145 00:55:18,390 --> 00:55:21,390 in class but maybe everyone on campus or everyone in the world-- 1146 00:55:21,390 --> 00:55:24,240 maybe we're Internet Movie Database, IMDb. 1147 00:55:24,240 --> 00:55:28,710 There could be a huge number of votes and a huge number of shows. 1148 00:55:28,710 --> 00:55:32,460 And so writing a program, whether it's in a terminal window like mine 1149 00:55:32,460 --> 00:55:36,480 or maybe on a mobile device or maybe on a webpage for your laptop or desktop, 1150 00:55:36,480 --> 00:55:40,800 it's probably not the best design to constantly loop 1151 00:55:40,800 --> 00:55:44,190 over all of the shows in your database from top 1152 00:55:44,190 --> 00:55:47,250 to bottom just to answer a single question. 1153 00:55:47,250 --> 00:55:51,502 It would be much nicer to do things in log of n time or in constant time. 1154 00:55:51,502 --> 00:55:54,210 And thankfully, over the past few weeks, both in C and in Python, 1155 00:55:54,210 --> 00:55:57,390 we have seen smarter ways to do this. 1156 00:55:57,390 --> 00:56:00,420 But I'm not practicing what I've preached here. 1157 00:56:00,420 --> 00:56:05,520 And in fact, at some point, this notion of a flat-file database 1158 00:56:05,520 --> 00:56:07,440 starts to get too primitive for us. 1159 00:56:07,440 --> 00:56:11,670 Flat-file databases, like CSV files, are wonderfully useful 1160 00:56:11,670 --> 00:56:13,590 when you just want to do something quickly 1161 00:56:13,590 --> 00:56:16,410 or when you want to download data from some third party, 1162 00:56:16,410 --> 00:56:18,518 like Google, in a standard, portable way. 1163 00:56:18,518 --> 00:56:21,810 "Portable" means that it can be used by different people and different systems. 1164 00:56:21,810 --> 00:56:23,760 CSV is about as simple as it gets, because you 1165 00:56:23,760 --> 00:56:26,250 don't need to own Microsoft Word or Apple 1166 00:56:26,250 --> 00:56:28,150 Numbers or any particular product. 1167 00:56:28,150 --> 00:56:30,750 It's just a text file, so you can use any text editing 1168 00:56:30,750 --> 00:56:34,140 program or any programming language to access it. 1169 00:56:34,140 --> 00:56:38,730 But flat-file databases aren't necessarily the best structure 1170 00:56:38,730 --> 00:56:42,750 to use ultimately for larger data sets, because they don't really 1171 00:56:42,750 --> 00:56:44,790 lend themselves to more efficient queries. 1172 00:56:44,790 --> 00:56:48,280 So CSV files, pretty much at best, you have to search top to bottom, 1173 00:56:48,280 --> 00:56:49,140 left to right. 1174 00:56:49,140 --> 00:56:53,070 But it turns out that there are better databases out there generally known 1175 00:56:53,070 --> 00:56:57,960 as relational databases that, instead of being files in which you store data, 1176 00:56:57,960 --> 00:57:01,230 they are instead programs in which you store data. 1177 00:57:01,230 --> 00:57:04,830 Now, to be fair, those programs use a lot of RAM, memory, 1178 00:57:04,830 --> 00:57:06,390 where they actually store your data. 1179 00:57:06,390 --> 00:57:08,670 And they do certainly persist your data. 1180 00:57:08,670 --> 00:57:12,600 They keep it long term by storing your data also in files. 1181 00:57:12,600 --> 00:57:16,110 But between you and your data, there is this running program. 1182 00:57:16,110 --> 00:57:20,070 And if you've ever heard of Oracle or MySQL or PostgreSQL or SQL Server 1183 00:57:20,070 --> 00:57:23,880 or Microsoft Access or bunches of other popular products, 1184 00:57:23,880 --> 00:57:26,520 both commercial and free and open source alike, 1185 00:57:26,520 --> 00:57:30,720 relational databases are so similar in spirit to spreadsheets. 1186 00:57:30,720 --> 00:57:33,690 But they are implemented in software. 1187 00:57:33,690 --> 00:57:35,490 And they give us more and more features. 1188 00:57:35,490 --> 00:57:37,360 And they use more and more data structures 1189 00:57:37,360 --> 00:57:42,550 so that we can search for data, insert data, delete data, update data much, 1190 00:57:42,550 --> 00:57:47,350 much more efficiently than we could if just using something like a CSV file. 1191 00:57:47,350 --> 00:57:49,600 So let's go ahead and take our five-minute break here. 1192 00:57:49,600 --> 00:57:52,558 And when we come back, we'll look at relational databases and, in turn, 1193 00:57:52,558 --> 00:57:54,400 a language called SQL. 1194 00:57:54,400 --> 00:57:55,170 All right. 1195 00:57:55,170 --> 00:57:56,190 So we are back. 1196 00:57:56,190 --> 00:57:58,890 And the goal at hand now is to transition 1197 00:57:58,890 --> 00:58:02,340 from these fairly simplistic flat-file databases 1198 00:58:02,340 --> 00:58:04,260 to a more proper relational database. 1199 00:58:04,260 --> 00:58:07,500 And relational databases are indeed what power so many 1200 00:58:07,500 --> 00:58:10,470 of today's mobile applications, web applications, and the like. 1201 00:58:10,470 --> 00:58:13,170 Now we're beginning to transition to real-world software 1202 00:58:13,170 --> 00:58:16,210 with real-world languages, at that. 1203 00:58:16,210 --> 00:58:20,610 And so now, let me introduce what we're going to call SQLite. 1204 00:58:20,610 --> 00:58:23,040 So it turns out that a relational database 1205 00:58:23,040 --> 00:58:27,660 is a database that stores all of the data still in rows and columns. 1206 00:58:27,660 --> 00:58:30,960 But it doesn't do so using spreadsheets or sheets. 1207 00:58:30,960 --> 00:58:33,930 It instead does so using what we're going to call tables. 1208 00:58:33,930 --> 00:58:36,120 So it's pretty much the same idea. 1209 00:58:36,120 --> 00:58:39,120 But with tables, we get some additional functionality. 1210 00:58:39,120 --> 00:58:41,130 With those tables, we'll have the ability 1211 00:58:41,130 --> 00:58:46,480 to search for data, update data, delete data, insert new data, and the like. 1212 00:58:46,480 --> 00:58:49,188 And these are things that we absolutely can do with spreadsheets. 1213 00:58:49,188 --> 00:58:52,105 But in the world of spreadsheets, if you want to search for something, 1214 00:58:52,105 --> 00:58:54,840 it's you, the human, doing it by manually clicking and scrolling, 1215 00:58:54,840 --> 00:58:55,340 typically. 1216 00:58:55,340 --> 00:58:57,390 If you want to insert data, it's you, the human, 1217 00:58:57,390 --> 00:58:59,362 typing it in manually after adding a new row. 1218 00:58:59,362 --> 00:59:01,320 If you want to delete something, it's you right 1219 00:59:01,320 --> 00:59:04,350 clicking or Control-clicking and deleting a whole row 1220 00:59:04,350 --> 00:59:06,750 or updating the individual cells they're in. 1221 00:59:06,750 --> 00:59:11,910 With SQL, Structured Query Language, we have a new programming language 1222 00:59:11,910 --> 00:59:15,360 that is very often used in conjunction with other programming languages. 1223 00:59:15,360 --> 00:59:18,930 And so today, we'll see SQL used on its own initially. 1224 00:59:18,930 --> 00:59:21,990 But we'll also see it in the context of a Python program. 1225 00:59:21,990 --> 00:59:28,170 So a language like Python can itself use SQL to do more powerful things 1226 00:59:28,170 --> 00:59:30,660 than Python alone could do. 1227 00:59:30,660 --> 00:59:34,440 So with that said, SQLite is like a light version of SQL. 1228 00:59:34,440 --> 00:59:35,940 It's a more user-friendly version. 1229 00:59:35,940 --> 00:59:36,780 It's more portable. 1230 00:59:36,780 --> 00:59:40,260 It can be used on Macs and PCS and phones and laptops and desktops 1231 00:59:40,260 --> 00:59:40,950 and servers. 1232 00:59:40,950 --> 00:59:42,120 But it's incredibly common. 1233 00:59:42,120 --> 00:59:45,810 In fact, in your iPhone and your Android phone, many of the applications 1234 00:59:45,810 --> 00:59:50,250 you are running today on your own device are using SQLite underneath the hood. 1235 00:59:50,250 --> 00:59:52,290 So it isn't a toy language per se. 1236 00:59:52,290 --> 00:59:55,150 It's instead a relatively simple implementation 1237 00:59:55,150 --> 00:59:56,920 of a language generally known as SQL. 1238 00:59:56,920 --> 01:00:00,680 But long story short, there's other implementations of relational databases 1239 01:00:00,680 --> 01:00:01,180 out there. 1240 01:00:01,180 --> 01:00:02,972 And I rattled off several of them already-- 1241 01:00:02,972 --> 01:00:05,620 Oracle and MySQL and PostgreSQL and the like. 1242 01:00:05,620 --> 01:00:09,910 Those all have slightly different flavors or dialects of SQL. 1243 01:00:09,910 --> 01:00:14,860 So SQL is a fairly standard language for interacting with databases. 1244 01:00:14,860 --> 01:00:16,960 But different companies, different communities 1245 01:00:16,960 --> 01:00:20,200 have kind of added or subtracted their own preferred features. 1246 01:00:20,200 --> 01:00:24,340 And so the syntax you use is generally constant across all platforms. 1247 01:00:24,340 --> 01:00:27,458 But we will standardize for our purposes on SQLite. 1248 01:00:27,458 --> 01:00:29,500 And indeed, this is what you would use these days 1249 01:00:29,500 --> 01:00:31,820 in the world of mobile applications. 1250 01:00:31,820 --> 01:00:33,710 So it's very much germane there. 1251 01:00:33,710 --> 01:00:39,760 So with SQLite, we're going to have ultimately the ability to query data 1252 01:00:39,760 --> 01:00:41,690 and update data, delete data, and the like. 1253 01:00:41,690 --> 01:00:44,080 But to do so, we actually need a program with which 1254 01:00:44,080 --> 01:00:46,220 to interact with our database. 1255 01:00:46,220 --> 01:00:50,320 So the way SQLite works is that it stores all of your data 1256 01:00:50,320 --> 01:00:51,920 still in a file. 1257 01:00:51,920 --> 01:00:53,500 But it's a binary file now. 1258 01:00:53,500 --> 01:00:55,690 That is, it's a file containing 0's and 1's. 1259 01:00:55,690 --> 01:00:57,550 And those 0's and 1's might represent text. 1260 01:00:57,550 --> 01:00:58,810 They might represent numbers. 1261 01:00:58,810 --> 01:01:01,570 But it's a more compact, efficient representation 1262 01:01:01,570 --> 01:01:05,290 than a mere CSV file would be using ASCII or Unicode. 1263 01:01:05,290 --> 01:01:06,670 So that's the first difference. 1264 01:01:06,670 --> 01:01:10,690 SQLite uses a single file, a binary file, 1265 01:01:10,690 --> 01:01:14,470 to store all of your data and represent it inside of that file by way of all 1266 01:01:14,470 --> 01:01:18,110 of those 0's and 1's or the tables to which I alluded before, 1267 01:01:18,110 --> 01:01:22,180 which are the analogue in the database world of sheets or spreadsheets 1268 01:01:22,180 --> 01:01:23,690 in the spreadsheet world. 1269 01:01:23,690 --> 01:01:28,940 So to interact with that binary file wherein all of your data is stored, 1270 01:01:28,940 --> 01:01:31,283 we need some kind of user-facing program. 1271 01:01:31,283 --> 01:01:32,950 And there's many different tools to use. 1272 01:01:32,950 --> 01:01:36,970 But the standard one that comes with SQLite 1273 01:01:36,970 --> 01:01:40,750 is called sqlite3, essentially version 3 of the tool. 1274 01:01:40,750 --> 01:01:44,050 This is a command line tool similar in spirit to any of the commands 1275 01:01:44,050 --> 01:01:46,000 you've run in a terminal window thus far that 1276 01:01:46,000 --> 01:01:50,395 allows you to open up that binary file and interact with all of your tables. 1277 01:01:50,395 --> 01:01:53,020 Now, here again, we kind of have a chicken and the egg problem. 1278 01:01:53,020 --> 01:01:56,290 If I want to use a database but I don't yet have a database 1279 01:01:56,290 --> 01:01:58,570 and yet I want to select data from my database, 1280 01:01:58,570 --> 01:02:00,040 how do I actually load things in? 1281 01:02:00,040 --> 01:02:04,130 Well, you can load data into a SQLite database in at least two ways. 1282 01:02:04,130 --> 01:02:06,490 One, which I'll do in a moment, you can just 1283 01:02:06,490 --> 01:02:10,480 import an existing flat-file database, like a CSV. 1284 01:02:10,480 --> 01:02:15,640 And what you do is you save the CSV on your Mac or PC on your CS50 IDE. 1285 01:02:15,640 --> 01:02:18,100 You run a special command with sqlite3. 1286 01:02:18,100 --> 01:02:21,430 And it will just load the CSV into memory. 1287 01:02:21,430 --> 01:02:23,620 It will figure out where all of the commas are. 1288 01:02:23,620 --> 01:02:28,510 And it will construct inside of that binary file the corresponding rows 1289 01:02:28,510 --> 01:02:31,360 and columns using the appropriate 0's and 1's 1290 01:02:31,360 --> 01:02:32,810 to store all of that information. 1291 01:02:32,810 --> 01:02:35,410 So it just imports it for you automatically. 1292 01:02:35,410 --> 01:02:39,310 Approach 2 would be to actually write code in a language like Python 1293 01:02:39,310 --> 01:02:44,290 or any other that actually manually inserts all of the data 1294 01:02:44,290 --> 01:02:45,155 into your database. 1295 01:02:45,155 --> 01:02:46,280 And we'll do that, as well. 1296 01:02:46,280 --> 01:02:47,290 But let's start simple. 1297 01:02:47,290 --> 01:02:51,070 Let me go ahead and run, for instance, sqlite3. 1298 01:02:51,070 --> 01:02:54,550 And this is preinstalled on CS50 IDE, and it's not that hard to get it up 1299 01:02:54,550 --> 01:02:56,570 and running on a Mac and PC, as well. 1300 01:02:56,570 --> 01:02:59,860 I'm going to go ahead and run sqlite3 in my terminal window here. 1301 01:02:59,860 --> 01:03:00,610 And voila. 1302 01:03:00,610 --> 01:03:03,430 You just see some very simple output. 1303 01:03:03,430 --> 01:03:07,023 It's telling me to type .help if I want to see some usage hints. 1304 01:03:07,023 --> 01:03:09,190 But I know most of the commands, and we'll generally 1305 01:03:09,190 --> 01:03:11,232 give you all of the commands that you might need. 1306 01:03:11,232 --> 01:03:15,760 In fact, one of the commands that we can use is .mode, and another is .import. 1307 01:03:15,760 --> 01:03:18,100 So generally, you won't use these that frequently. 1308 01:03:18,100 --> 01:03:21,670 You'll only use them when creating a database for the first time when 1309 01:03:21,670 --> 01:03:25,002 you are creating that database from an existing CSV file. 1310 01:03:25,002 --> 01:03:26,710 And indeed, that's my goal at the moment. 1311 01:03:26,710 --> 01:03:30,610 Let me take our CSV file containing all of your favorite TV shows 1312 01:03:30,610 --> 01:03:35,650 and load it into SQLite in a proper relational database 1313 01:03:35,650 --> 01:03:39,460 so that we can do better than, for instance, big O of n 1314 01:03:39,460 --> 01:03:42,730 when it comes to searching that data and doing anything else on it. 1315 01:03:42,730 --> 01:03:44,960 So to do this, I have to execute two commands. 1316 01:03:44,960 --> 01:03:48,280 One, I need to put SQLite into CSV mode. 1317 01:03:48,280 --> 01:03:51,010 And that's just to distinguish it from other flat-file formats, 1318 01:03:51,010 --> 01:03:53,890 like TSV for tabs or some other format. 1319 01:03:53,890 --> 01:03:56,230 And now I'm going to go ahead and run .import. 1320 01:03:56,230 --> 01:03:59,920 Then I have to specify the name of the file to import, which is the CSV. 1321 01:03:59,920 --> 01:04:03,490 And I'm going to go ahead and call my table shows. 1322 01:04:03,490 --> 01:04:08,500 So .import takes two arguments, the name of the file that you want to import 1323 01:04:08,500 --> 01:04:12,430 and the name of the table that you want to create out of that file. 1324 01:04:12,430 --> 01:04:14,680 And again, tables have rows and columns. 1325 01:04:14,680 --> 01:04:18,280 And the commas in the file are going to delineate 1326 01:04:18,280 --> 01:04:20,290 where those columns begin and end. 1327 01:04:20,290 --> 01:04:21,790 I'm going to go ahead and hit Enter. 1328 01:04:21,790 --> 01:04:24,670 It looks like it flew by pretty fast. 1329 01:04:24,670 --> 01:04:26,560 Nothing seems to have happened. 1330 01:04:26,560 --> 01:04:30,940 But I think that's OK, because now we're going to go ahead and have the ability 1331 01:04:30,940 --> 01:04:32,710 to actually manipulate that data. 1332 01:04:32,710 --> 01:04:34,630 But how do we manipulate the data? 1333 01:04:34,630 --> 01:04:36,070 We need a new language. 1334 01:04:36,070 --> 01:04:42,280 SQL, Structured Query Language, is the language used by SQLites and Oracle 1335 01:04:42,280 --> 01:04:45,220 and MySQL and PostgreSQL and bunches of other products 1336 01:04:45,220 --> 01:04:48,040 whose names you don't need to know or remember any time soon. 1337 01:04:48,040 --> 01:04:53,260 But SQL is the language we'll use to query the database for information 1338 01:04:53,260 --> 01:04:54,620 and do something with it. 1339 01:04:54,620 --> 01:04:57,920 Generally speaking, a relational database and, in turn, 1340 01:04:57,920 --> 01:05:02,480 SQL, which is a language via which you can interact with relational databases, 1341 01:05:02,480 --> 01:05:04,910 support four fundamental operations. 1342 01:05:04,910 --> 01:05:08,090 And they're sort of a crude acronym, pun intended, 1343 01:05:08,090 --> 01:05:11,960 that is just helpful for remembering what those fundamental operations are 1344 01:05:11,960 --> 01:05:13,190 with relational databases. 1345 01:05:13,190 --> 01:05:19,220 CRUD stands for Create, Read, Update, and Delete. 1346 01:05:19,220 --> 01:05:21,800 And indeed, the acronym is CRUD, C-R-U-D. 1347 01:05:21,800 --> 01:05:25,040 So it helps you remember that the four basic operations supported by any 1348 01:05:25,040 --> 01:05:28,590 relational database are create, read, update, delete. 1349 01:05:28,590 --> 01:05:30,710 "Create" means to create or add new data. 1350 01:05:30,710 --> 01:05:34,550 "Read" means to access and load into memory new data. 1351 01:05:34,550 --> 01:05:36,710 We've seen read before with opening files. 1352 01:05:36,710 --> 01:05:39,140 "Update" and "delete" mean exactly that, as well, 1353 01:05:39,140 --> 01:05:41,450 if you want to manipulate the data in your data set. 1354 01:05:41,450 --> 01:05:44,530 Now, those are generic terms for any relational database. 1355 01:05:44,530 --> 01:05:48,200 Those are the four properties typically supported by any relational database. 1356 01:05:48,200 --> 01:05:53,490 In the world of SQL, there are some very specific commands or functions, 1357 01:05:53,490 --> 01:05:58,550 if you will, that implement those four functionalities. 1358 01:05:58,550 --> 01:06:00,980 They are create and insert-- 1359 01:06:00,980 --> 01:06:03,620 achieve the same thing as create more generally. 1360 01:06:03,620 --> 01:06:07,850 The keyword "select" is what's used to read data from a database. 1361 01:06:07,850 --> 01:06:09,460 Update and delete are the same. 1362 01:06:09,460 --> 01:06:11,210 So it's kind of an annoying inconsistency. 1363 01:06:11,210 --> 01:06:14,803 The acronym or the term of art is CRUD, Create, Read, Update, Delete. 1364 01:06:14,803 --> 01:06:16,970 But in the world of SQL, the authors of the language 1365 01:06:16,970 --> 01:06:20,030 decided to implement those four ideas by way 1366 01:06:20,030 --> 01:06:24,760 of these five keywords or functions or commands, if you will, in the language 1367 01:06:24,760 --> 01:06:25,260 SQL. 1368 01:06:25,260 --> 01:06:28,800 So what you are looking at are five of the keywords 1369 01:06:28,800 --> 01:06:32,990 that you can use in this new language called SQL to actually do something 1370 01:06:32,990 --> 01:06:33,980 with your database. 1371 01:06:33,980 --> 01:06:35,070 Now, what does that mean? 1372 01:06:35,070 --> 01:06:37,190 Well, suppose that you wanted to manually create 1373 01:06:37,190 --> 01:06:38,960 a database for the very first time. 1374 01:06:38,960 --> 01:06:39,585 What do you do? 1375 01:06:39,585 --> 01:06:42,752 Well, back in the world of spreadsheets, it's pretty straightforward, right? 1376 01:06:42,752 --> 01:06:44,300 You'd open up Google Spreadsheets. 1377 01:06:44,300 --> 01:06:46,370 You go to File, New or whatever. 1378 01:06:46,370 --> 01:06:48,350 And then you just, voila, get a new spreadsheet 1379 01:06:48,350 --> 01:06:51,170 into which you can start creating rows and columns and the like. 1380 01:06:51,170 --> 01:06:53,840 In Microsoft Excel, Apple Numbers, same thing-- 1381 01:06:53,840 --> 01:06:57,840 File menu, New Spreadsheet or whatever, and boom, you have a new spreadsheet. 1382 01:06:57,840 --> 01:07:00,860 Now, in the world of SQL, SQL databases are generally 1383 01:07:00,860 --> 01:07:02,840 meant to be interacted with code. 1384 01:07:02,840 --> 01:07:05,930 However, there are Graphical User Interfaces, GUIs, by which 1385 01:07:05,930 --> 01:07:07,430 you can interact with them, as well. 1386 01:07:07,430 --> 01:07:11,600 But we're going to use code today to do so and programs at a command line. 1387 01:07:11,600 --> 01:07:17,120 It turns out that you can create tables programmatically 1388 01:07:17,120 --> 01:07:19,530 by running a command like this. 1389 01:07:19,530 --> 01:07:24,320 So if you literally type out syntax along the lines of CREATE TABLE, then 1390 01:07:24,320 --> 01:07:27,230 the name of your table, indicated here in lowercase, 1391 01:07:27,230 --> 01:07:31,490 then a parenthesis, then the name of your column that you want to create 1392 01:07:31,490 --> 01:07:36,190 and the type of that column, a la C, and then comma, dot, 1393 01:07:36,190 --> 01:07:39,050 dot, dot, some more columns, this is generally 1394 01:07:39,050 --> 01:07:43,350 speaking the syntax you'll use to create in this language called 1395 01:07:43,350 --> 01:07:44,802 SQL a new table. 1396 01:07:44,802 --> 01:07:46,010 Now, this is in the abstract. 1397 01:07:46,010 --> 01:07:49,077 Again, table in lowercase is meant to represent the name 1398 01:07:49,077 --> 01:07:50,660 you want to give to your actual table. 1399 01:07:50,660 --> 01:07:52,580 column in lowercase is meant to be the name 1400 01:07:52,580 --> 01:07:54,080 you want to give to your own column. 1401 01:07:54,080 --> 01:07:54,788 Maybe it's Title. 1402 01:07:54,788 --> 01:07:55,567 Maybe it's Genres. 1403 01:07:55,567 --> 01:07:57,400 And dot, dot, dot just means, of course, you 1404 01:07:57,400 --> 01:07:59,100 can have even more columns than that. 1405 01:07:59,100 --> 01:08:02,990 But literally in a moment, if I were to type in this kind of command 1406 01:08:02,990 --> 01:08:06,500 into the terminal window after running the sqlite3 program, 1407 01:08:06,500 --> 01:08:09,980 I could start creating one or more tables for myself. 1408 01:08:09,980 --> 01:08:12,920 And in fact, that's what already happened for me. 1409 01:08:12,920 --> 01:08:15,560 This .import command, which is not part of SQL-- 1410 01:08:15,560 --> 01:08:19,579 this is the equivalent of a Menu option in Excel or Google Spreadsheets. 1411 01:08:19,579 --> 01:08:22,729 .import just automates a certain process for me. 1412 01:08:22,729 --> 01:08:24,319 And what it did for me is this. 1413 01:08:24,319 --> 01:08:28,609 If I type now .schema, which is another SQLite-specific command-- 1414 01:08:28,609 --> 01:08:32,479 anything that starts with a . is specific only to sqlite3, 1415 01:08:32,479 --> 01:08:34,250 this terminal window program. 1416 01:08:34,250 --> 01:08:36,830 Notice what's outputted is this. 1417 01:08:36,830 --> 01:08:44,420 By running .import that automatically for me created a table in my database 1418 01:08:44,420 --> 01:08:46,010 called shows. 1419 01:08:46,010 --> 01:08:47,540 And it gave it three columns-- 1420 01:08:47,540 --> 01:08:50,060 Timestamp, title, and genres. 1421 01:08:50,060 --> 01:08:52,529 Where did those column names come from? 1422 01:08:52,529 --> 01:08:55,220 Well, they came from the very first line in the CSV. 1423 01:08:55,220 --> 01:08:58,850 And they all looked like text, so the type of those values 1424 01:08:58,850 --> 01:09:02,490 was just assumed to be text, text, text. 1425 01:09:02,490 --> 01:09:05,090 Now, to be clear, I could have manually type this out, 1426 01:09:05,090 --> 01:09:08,359 created these three columns in a new table called shows for me. 1427 01:09:08,359 --> 01:09:11,870 But again, the .import command just automated that from a CSV. 1428 01:09:11,870 --> 01:09:17,370 But the SQL is what we see here, CREATE TABLE shows and so forth. 1429 01:09:17,370 --> 01:09:22,609 So that is to say now, in this database, there is a file-- 1430 01:09:22,609 --> 01:09:27,500 or rather, there is a table called shows inside 1431 01:09:27,500 --> 01:09:29,630 of which is all of the data from that CSV. 1432 01:09:29,630 --> 01:09:31,580 How do I actually get at that data? 1433 01:09:31,580 --> 01:09:33,830 Well, it turns out there's other commands were called. 1434 01:09:33,830 --> 01:09:37,430 Not just CREATE, but also SELECT, it turns out. 1435 01:09:37,430 --> 01:09:40,850 SELECT is the equivalent of read, getting data from the database. 1436 01:09:40,850 --> 01:09:42,590 And this one is pretty powerful. 1437 01:09:42,590 --> 01:09:45,950 And the reason that so many data scientists and statisticians 1438 01:09:45,950 --> 01:09:48,290 use and like using languages like SQL-- 1439 01:09:48,290 --> 01:09:51,620 they make it relatively easy to just get data and filter that data 1440 01:09:51,620 --> 01:09:55,380 and analyze that data using new syntax for us today, 1441 01:09:55,380 --> 01:09:58,940 but relatively simple syntax relative to other things we've seen. 1442 01:09:58,940 --> 01:10:03,590 The SELECT command in SQL lets you select one or more columns 1443 01:10:03,590 --> 01:10:06,710 from your table by the given name. 1444 01:10:06,710 --> 01:10:10,040 So we'll see this now in just a moment here. 1445 01:10:10,040 --> 01:10:11,460 How might I go about doing this? 1446 01:10:11,460 --> 01:10:15,170 Well, let me go ahead and now, at my prompt after just clearing the window 1447 01:10:15,170 --> 01:10:17,340 to keep things neat, let me try this out. 1448 01:10:17,340 --> 01:10:26,090 Let me go ahead and SELECT, let's say, title FROM shows;. 1449 01:10:26,090 --> 01:10:27,290 So why am I doing this? 1450 01:10:27,290 --> 01:10:29,800 Well, again, the conventional format for the SELECT command 1451 01:10:29,800 --> 01:10:33,400 is to say SELECT, then the name of one or more columns, then 1452 01:10:33,400 --> 01:10:37,240 literally the preposition FROM, and then the name of the table from which you 1453 01:10:37,240 --> 01:10:38,840 want to select that data. 1454 01:10:38,840 --> 01:10:43,390 So if my table is called shows and the column is called title, 1455 01:10:43,390 --> 01:10:46,930 it stands to reason that SELECT title FROM shows should give me 1456 01:10:46,930 --> 01:10:48,100 back the data I want. 1457 01:10:48,100 --> 01:10:50,080 Now, notice a couple of stylistic choices 1458 01:10:50,080 --> 01:10:52,630 that aren't strictly required but are good style. 1459 01:10:52,630 --> 01:10:56,470 Conventionally, I would capitalize any SQL keywords, 1460 01:10:56,470 --> 01:10:59,470 including SELECT and FROM in this case, and then 1461 01:10:59,470 --> 01:11:03,610 lowercase anything that's a column name or a table name, 1462 01:11:03,610 --> 01:11:07,023 assuming you created those columns and tables in, in fact, lowercase. 1463 01:11:07,023 --> 01:11:08,690 There's different conventions out there. 1464 01:11:08,690 --> 01:11:09,815 Some people will uppercase. 1465 01:11:09,815 --> 01:11:12,950 Some people will use something called camel case or snake case or the like. 1466 01:11:12,950 --> 01:11:15,220 But generally speaking, I would encourage all caps 1467 01:11:15,220 --> 01:11:19,180 for SQL syntax and lowercase for the column and table names. 1468 01:11:19,180 --> 01:11:21,190 I'm going to go ahead now and hit Enter. 1469 01:11:21,190 --> 01:11:22,060 And voila. 1470 01:11:22,060 --> 01:11:26,950 We see rapidly a whole list of values outputted from the database. 1471 01:11:26,950 --> 01:11:30,790 And if you think way back, you might recognize that this actually 1472 01:11:30,790 --> 01:11:35,200 happens to be the same order as before, because the CSV 1473 01:11:35,200 --> 01:11:39,010 file was loaded top to bottom into this same database table. 1474 01:11:39,010 --> 01:11:42,370 And so what we're seeing, in fact, is all of that same data, duplicates 1475 01:11:42,370 --> 01:11:46,030 and miscapitalizations and weird spacing and all. 1476 01:11:46,030 --> 01:11:48,790 But suppose I want to see all of the data from the CSV. 1477 01:11:48,790 --> 01:11:51,160 Well, it turns out you can select multiple columns. 1478 01:11:51,160 --> 01:11:54,478 You can select not only title, but maybe timestamp was of interest. 1479 01:11:54,478 --> 01:11:56,770 And this one admittedly was capitalized, because that's 1480 01:11:56,770 --> 01:11:58,360 what it was in the spreadsheet. 1481 01:11:58,360 --> 01:12:00,290 That was not something I chose manually. 1482 01:12:00,290 --> 01:12:02,800 So if I just use a comma-separated list of column names, 1483 01:12:02,800 --> 01:12:04,090 notice what I can do now. 1484 01:12:04,090 --> 01:12:07,790 It's a little hard to see for us humans, because there's a lot going on now. 1485 01:12:07,790 --> 01:12:10,120 But notice that in double quotes on the left, 1486 01:12:10,120 --> 01:12:14,170 there are all of the timestamps, which represent the time at which you all 1487 01:12:14,170 --> 01:12:15,490 submitted your favorite shows. 1488 01:12:15,490 --> 01:12:19,390 And on the right of the comma, there's another quoted string 1489 01:12:19,390 --> 01:12:22,210 that is the title of the show that you liked, although SQLite 1490 01:12:22,210 --> 01:12:27,070 omits the quotes if it's just a single word, like Friends, just by convention. 1491 01:12:27,070 --> 01:12:29,290 In fact, if I want to get all of the columns, 1492 01:12:29,290 --> 01:12:31,510 turns out there's some shorthand syntax for that. 1493 01:12:31,510 --> 01:12:34,270 * is the so-called wild card operator. 1494 01:12:34,270 --> 01:12:37,780 And it will get me all of the columns from left to right in my table. 1495 01:12:37,780 --> 01:12:38,500 And voila. 1496 01:12:38,500 --> 01:12:44,180 Now I see all of the data, including all of the genres, as well. 1497 01:12:44,180 --> 01:12:49,090 So now I effectively have three columns being outputted all at once here. 1498 01:12:49,090 --> 01:12:51,520 Well, this is not that useful thus far. 1499 01:12:51,520 --> 01:12:53,770 In fact, all I've been doing is really just outputting 1500 01:12:53,770 --> 01:12:55,060 the contents of the CSV. 1501 01:12:55,060 --> 01:12:58,990 But SQL's powerful because it comes with other features right out of the box, 1502 01:12:58,990 --> 01:13:02,830 somewhat similar in spirit to functions that are built into Google Spreadsheets 1503 01:13:02,830 --> 01:13:03,550 and Excel. 1504 01:13:03,550 --> 01:13:06,110 But now we can use them ultimately in our own code. 1505 01:13:06,110 --> 01:13:09,460 So functions like AVG, COUNT, DISTINCT, LOWER, MAX, MIN, 1506 01:13:09,460 --> 01:13:13,540 and UPPER and bunches more, these are all functions built into SQL 1507 01:13:13,540 --> 01:13:19,370 that you can use as part of your query to alter the data as it's coming back 1508 01:13:19,370 --> 01:13:21,370 from the database-- not permanently, but as it's 1509 01:13:21,370 --> 01:13:25,040 coming back to you-- so that it's in a format you actually care about. 1510 01:13:25,040 --> 01:13:26,870 So for instance, one of my goals earlier, 1511 01:13:26,870 --> 01:13:29,680 was to get back just the distinct, the unique titles. 1512 01:13:29,680 --> 01:13:32,620 And we had to write all that annoying code using a set 1513 01:13:32,620 --> 01:13:35,560 and then add things to the set and then loop over it again, right? 1514 01:13:35,560 --> 01:13:37,180 That was not a huge amount of code. 1515 01:13:37,180 --> 01:13:40,840 But it definitely took us, what, 5, 10 minutes to get the job done at least. 1516 01:13:40,840 --> 01:13:43,780 In SQL, you can do all of that in one breath. 1517 01:13:43,780 --> 01:13:45,650 I'm going to go ahead now and do this. 1518 01:13:45,650 --> 01:13:49,690 SELECT not just title FROM shows. 1519 01:13:49,690 --> 01:13:54,370 Let me go ahead and SELECT DISTINCT title FROM shows. 1520 01:13:54,370 --> 01:13:57,640 So DISTINCT, again, is an available function in SQL 1521 01:13:57,640 --> 01:13:58,900 that does what the name says. 1522 01:13:58,900 --> 01:14:00,650 It's going to filter out all of the titles 1523 01:14:00,650 --> 01:14:02,450 to just give me the distinct ones back. 1524 01:14:02,450 --> 01:14:08,740 So if I hit Enter now, you'll see a similarly messy list but including-- 1525 01:14:08,740 --> 01:14:10,810 "no idea," someone that doesn't watch TV-- 1526 01:14:10,810 --> 01:14:14,230 including an unsorted list of those titles. 1527 01:14:14,230 --> 01:14:18,130 So I think we can probably start to clean this thing up as we did before. 1528 01:14:18,130 --> 01:14:20,950 Let me go ahead and now SELECT not just DISTINCT, 1529 01:14:20,950 --> 01:14:23,660 but let me go ahead and uppercase everything as well. 1530 01:14:23,660 --> 01:14:25,970 And I can use UPPER as another function. 1531 01:14:25,970 --> 01:14:27,580 And notice I'm just nesting things. 1532 01:14:27,580 --> 01:14:30,247 The output of one function, as we've seen in many languages now, 1533 01:14:30,247 --> 01:14:31,450 can be the input to another. 1534 01:14:31,450 --> 01:14:32,830 Let me hit Enter now. 1535 01:14:32,830 --> 01:14:36,610 And now it's getting a little more canonicalized, so to speak, 1536 01:14:36,610 --> 01:14:39,190 because I'm using capitalization for everything. 1537 01:14:39,190 --> 01:14:43,690 But it would seem that things still aren't really sorted. 1538 01:14:43,690 --> 01:14:46,070 It's just the same order in which you inputted them 1539 01:14:46,070 --> 01:14:48,370 but without duplicates this time. 1540 01:14:48,370 --> 01:14:51,700 So it turns out that SQL has other syntax 1541 01:14:51,700 --> 01:14:55,580 that we can use to make our queries more precise and more powerful. 1542 01:14:55,580 --> 01:14:57,640 So in addition to these kinds of functions 1543 01:14:57,640 --> 01:15:00,340 that you can use to alter the data that's being shown to you 1544 01:15:00,340 --> 01:15:04,570 and coming back, you can also use these kinds of clauses or syntax 1545 01:15:04,570 --> 01:15:05,800 in SQL queries. 1546 01:15:05,800 --> 01:15:09,130 You can say WHERE, which is the equivalent of a condition. 1547 01:15:09,130 --> 01:15:13,600 You can say select all of this data where something is true or false. 1548 01:15:13,600 --> 01:15:17,440 You can say LIKE, where you can say give me data that isn't exactly this 1549 01:15:17,440 --> 01:15:18,520 but is like this. 1550 01:15:18,520 --> 01:15:20,660 You can order the data by some column. 1551 01:15:20,660 --> 01:15:23,210 You can limit the number of rows that come back. 1552 01:15:23,210 --> 01:15:26,850 And you can group identical values together in some way. 1553 01:15:26,850 --> 01:15:28,640 So let's see a few examples of this. 1554 01:15:28,640 --> 01:15:32,055 Let me go back here and play around now with-- 1555 01:15:32,055 --> 01:15:32,930 how about The Office? 1556 01:15:32,930 --> 01:15:34,513 That was the one we looked at earlier. 1557 01:15:34,513 --> 01:15:42,260 So let me go ahead and SELECT title FROM shows WHERE title = "The Office";. 1558 01:15:42,260 --> 01:15:48,200 So I've added this WHERE predicate, so to speak, WHERE title = "The Office." 1559 01:15:48,200 --> 01:15:49,190 So SQL's nice. 1560 01:15:49,190 --> 01:15:52,670 Similar in spirit to Python, it's more user friendly, perhaps, 1561 01:15:52,670 --> 01:15:55,910 than C where everything kind of sort of reads like an English sentence, 1562 01:15:55,910 --> 01:15:58,230 even though it's a little more precise. 1563 01:15:58,230 --> 01:15:59,880 And it's a little more succinct. 1564 01:15:59,880 --> 01:16:01,130 Let me go ahead and hit Enter. 1565 01:16:01,130 --> 01:16:02,120 And voila. 1566 01:16:02,120 --> 01:16:05,850 That's how many of you inputted The Office. 1567 01:16:05,850 --> 01:16:08,520 But notice it's not everyone, is it? 1568 01:16:08,520 --> 01:16:10,050 We're missing some still. 1569 01:16:10,050 --> 01:16:14,070 It seems that I got back only those of you who typed in literally 1570 01:16:14,070 --> 01:16:16,710 "The Office," capital T, capital O. 1571 01:16:16,710 --> 01:16:19,200 So what if I want to be a little more resilient than that? 1572 01:16:19,200 --> 01:16:23,280 Well, let me get back any rows where you all typed in "office." 1573 01:16:23,280 --> 01:16:26,820 Maybe you omitted the article "the." 1574 01:16:26,820 --> 01:16:30,390 So let me go ahead and say not title = "Office." 1575 01:16:30,390 --> 01:16:33,780 but let me go ahead and say where the title is like "Office." 1576 01:16:33,780 --> 01:16:35,490 But I don't want it to just be "office." 1577 01:16:35,490 --> 01:16:39,120 I want to allow for maybe some stuff at the beginning, maybe some stuff 1578 01:16:39,120 --> 01:16:39,673 at the end. 1579 01:16:39,673 --> 01:16:42,090 And even though that seems like a bit of an inconsistency, 1580 01:16:42,090 --> 01:16:46,950 in the context of using LIKE, there's another wild card character. 1581 01:16:46,950 --> 01:16:51,390 The percent sign represents zero or more characters to the left. 1582 01:16:51,390 --> 01:16:55,410 And this percent sign represents zero or more characters to the right. 1583 01:16:55,410 --> 01:16:59,940 So it's kind of this catchall that will now find me all titles that somewhere 1584 01:16:59,940 --> 01:17:02,980 have O-F-F-I-C-E inside of them. 1585 01:17:02,980 --> 01:17:04,778 And it turns out LIKE is case insensitive, 1586 01:17:04,778 --> 01:17:07,320 so I don't even need to worry about capitalization with LIKE. 1587 01:17:07,320 --> 01:17:08,610 Now let me hit Enter. 1588 01:17:08,610 --> 01:17:09,450 And voila. 1589 01:17:09,450 --> 01:17:10,890 Now I get back more answers. 1590 01:17:10,890 --> 01:17:12,780 And you can really see the messiness now. 1591 01:17:12,780 --> 01:17:15,900 Notice up here one of you used lowercase. 1592 01:17:15,900 --> 01:17:18,450 That tends to be common when typing things in quickly. 1593 01:17:18,450 --> 01:17:21,270 One of you did it lowercase here and then also gave 1594 01:17:21,270 --> 01:17:23,160 us an extra white space at the end. 1595 01:17:23,160 --> 01:17:24,900 One of you just typed in "office." 1596 01:17:24,900 --> 01:17:27,540 One of you typed in "the office" again with a space at the end. 1597 01:17:27,540 --> 01:17:29,200 And so there's a lot of variation here. 1598 01:17:29,200 --> 01:17:31,560 And that's why, when we forced everything to uppercase 1599 01:17:31,560 --> 01:17:34,650 and we started trimming things, we were able to get rid 1600 01:17:34,650 --> 01:17:37,440 of a lot of those redundancies. 1601 01:17:37,440 --> 01:17:40,290 Well, in fact, let's go ahead and order this now. 1602 01:17:40,290 --> 01:17:44,040 So let me go back to selecting the distinct uppercase title, 1603 01:17:44,040 --> 01:17:51,060 so SELECT DISTINCT UPPER of title FROM shows. 1604 01:17:51,060 --> 01:17:56,220 And let me now ORDER BY, which is a new clause, the uppercased version 1605 01:17:56,220 --> 01:17:57,868 of title. 1606 01:17:57,868 --> 01:17:59,910 So now notice there's a few things going on here. 1607 01:17:59,910 --> 01:18:01,710 But I'm just building up more complicated queries 1608 01:18:01,710 --> 01:18:04,260 similar to scratch, where we just started throwing more and more puzzle 1609 01:18:04,260 --> 01:18:05,340 pieces at a problem. 1610 01:18:05,340 --> 01:18:10,530 I'm selecting all of the distinct uppercase titles from the shows table. 1611 01:18:10,530 --> 01:18:13,050 But I'm going to order the results this time 1612 01:18:13,050 --> 01:18:15,780 by the uppercased version of title. 1613 01:18:15,780 --> 01:18:17,550 So everything is going to be uppercased. 1614 01:18:17,550 --> 01:18:20,460 And then it's going to be sorted A through Z. Hit Enter now, 1615 01:18:20,460 --> 01:18:23,160 and now things are a little easier to make sense of. 1616 01:18:23,160 --> 01:18:26,970 Notice the quotes are there only when there are multiple words in a title. 1617 01:18:26,970 --> 01:18:29,400 Otherwise, sqlite3 doesn't bother showing us. 1618 01:18:29,400 --> 01:18:32,190 But notice here's all the "the" shows. 1619 01:18:32,190 --> 01:18:36,270 And if we keep scrolling up, the P's, the N's, the M's, the L's, and so 1620 01:18:36,270 --> 01:18:41,190 forth-- it's indeed alphabetized thanks to using ORDER BY. 1621 01:18:41,190 --> 01:18:41,950 All right. 1622 01:18:41,950 --> 01:18:45,540 Well, let's start to solve more similar problems now in SQL 1623 01:18:45,540 --> 01:18:49,830 by writing way less code than we did a bit ago in Python. 1624 01:18:49,830 --> 01:18:54,780 Suppose I want to actually figure out the counts of these most popular shows. 1625 01:18:54,780 --> 01:18:58,050 So I want to combine all of the identical shows 1626 01:18:58,050 --> 01:19:00,510 and figure out all of the corresponding counts. 1627 01:19:00,510 --> 01:19:02,330 Well, let me go ahead and try this. 1628 01:19:02,330 --> 01:19:07,932 Let me go ahead and SELECT again the uppercased version of title. 1629 01:19:07,932 --> 01:19:10,140 But I'm not going to do DISTINCT this time, because I 1630 01:19:10,140 --> 01:19:11,830 want to do that a little differently. 1631 01:19:11,830 --> 01:19:13,650 I'm going to SELECT the uppercased version 1632 01:19:13,650 --> 01:19:16,510 of title, the COUNT of those titles-- 1633 01:19:16,510 --> 01:19:19,320 so the number of times a given title appears, so COUNT 1634 01:19:19,320 --> 01:19:20,610 is a new keyword now-- 1635 01:19:20,610 --> 01:19:22,080 FROM shows. 1636 01:19:22,080 --> 01:19:25,860 But now how do I figure out what the count is? 1637 01:19:25,860 --> 01:19:29,700 Well, if you think about this table as having a lot of titles-- 1638 01:19:29,700 --> 01:19:31,930 title, title, title, title, title-- 1639 01:19:31,930 --> 01:19:35,970 it would be nice to kind of group the identical titles together 1640 01:19:35,970 --> 01:19:42,460 and then actually count how many such titles we grouped together. 1641 01:19:42,460 --> 01:19:47,710 And the syntax for that is literally to say GROUP BY UPPER(title);. 1642 01:19:47,710 --> 01:19:51,130 This tells SQL to group all of the uppercased titles 1643 01:19:51,130 --> 01:19:53,860 together, kind of collapse multiple rows into one, 1644 01:19:53,860 --> 01:19:58,990 but keep track of the count of titles after that collapse. 1645 01:19:58,990 --> 01:20:01,810 Let me go ahead now and hit Enter. 1646 01:20:01,810 --> 01:20:05,980 And you'll see, very similar to one of the earlier Python programs we wrote, 1647 01:20:05,980 --> 01:20:10,040 all of the titles on the left followed by a comma, followed by the count. 1648 01:20:10,040 --> 01:20:11,920 So one of you really likes Tom and Jerry. 1649 01:20:11,920 --> 01:20:14,470 One of you really likes Top Gear. 1650 01:20:14,470 --> 01:20:17,140 If I scroll up, though, two of you really liked The Wire. 1651 01:20:17,140 --> 01:20:19,930 23 of you here like The Office, although we still 1652 01:20:19,930 --> 01:20:22,010 haven't trimmed the issue here. 1653 01:20:22,010 --> 01:20:25,180 So we could still combine that further by trimming whitespace if we want. 1654 01:20:25,180 --> 01:20:27,040 But now we're getting these kinds of counts. 1655 01:20:27,040 --> 01:20:32,510 Well, how can I go ahead and order this, as we did before? 1656 01:20:32,510 --> 01:20:39,820 Let me go ahead here and add ORDER BY COUNT of title 1657 01:20:39,820 --> 01:20:42,010 and then hit semicolon now. 1658 01:20:42,010 --> 01:20:45,310 And now notice, just as in Python, everything 1659 01:20:45,310 --> 01:20:47,800 is from smallest to largest initially, with Game of Thrones 1660 01:20:47,800 --> 01:20:49,180 here down on the bottom. 1661 01:20:49,180 --> 01:20:50,360 How can I fix this? 1662 01:20:50,360 --> 01:20:53,890 Well, it turns out if you can order things in descending order, 1663 01:20:53,890 --> 01:20:58,510 D-E-S-C for short instead of A-S-C, which is the default for ascending-- 1664 01:20:58,510 --> 01:21:02,110 so if I do it in descending order, now I'd have to scroll all the way back up 1665 01:21:02,110 --> 01:21:07,480 to the A's, the very top, to see where the lines begin. 1666 01:21:07,480 --> 01:21:09,420 Whoops. 1667 01:21:09,420 --> 01:21:13,020 If I scroll all the way back up to the top, we'll see where all of the A words 1668 01:21:13,020 --> 01:21:14,610 begin up here. 1669 01:21:14,610 --> 01:21:17,252 And now if I want to-- 1670 01:21:17,252 --> 01:21:18,210 whoops, whoops, whoops. 1671 01:21:18,210 --> 01:21:20,190 Did I do that right? 1672 01:21:20,190 --> 01:21:20,690 Sorry. 1673 01:21:20,690 --> 01:21:21,950 I don't want to-- 1674 01:21:21,950 --> 01:21:23,827 there we go, ORDER BY COUNT descending. 1675 01:21:23,827 --> 01:21:26,660 Now let me go ahead and-- this is just a little too unwieldy to see. 1676 01:21:26,660 --> 01:21:29,035 Let me just limit myself to the top 10 and keep it simple 1677 01:21:29,035 --> 01:21:30,920 and only look at the top 10 values here. 1678 01:21:30,920 --> 01:21:31,730 Voila. 1679 01:21:31,730 --> 01:21:36,585 Now I have Game of Thrones at 33, Friends at 26, The Office at 23-- 1680 01:21:36,585 --> 01:21:38,210 though I think I'm still missing a few. 1681 01:21:38,210 --> 01:21:41,660 Brian, do you recall the SQL function for trimming leading and trailing 1682 01:21:41,660 --> 01:21:43,410 white space? 1683 01:21:43,410 --> 01:21:44,785 BRIAN YU: I think it's just TRIM. 1684 01:21:44,785 --> 01:21:45,660 DAVID J. MALAN: TRIM? 1685 01:21:45,660 --> 01:21:46,340 OK. 1686 01:21:46,340 --> 01:21:47,577 I myself did not remember. 1687 01:21:47,577 --> 01:21:49,160 So when in doubt, google or ask Brian. 1688 01:21:49,160 --> 01:21:51,000 So let me go ahead and fix this. 1689 01:21:51,000 --> 01:21:55,670 Let me go ahead and SELECT uppercase of trimming the title first. 1690 01:21:55,670 --> 01:22:00,840 And then I'm going to GROUP BY trimming and then uppercasing it there. 1691 01:22:00,840 --> 01:22:02,372 And now Enter, and voila. 1692 01:22:02,372 --> 01:22:03,080 Thank you, Brian. 1693 01:22:03,080 --> 01:22:07,020 So now we're up to our 26 Offices here. 1694 01:22:07,020 --> 01:22:09,110 So in short, it took us a little while to get 1695 01:22:09,110 --> 01:22:10,880 to this point in the story in SQL. 1696 01:22:10,880 --> 01:22:12,020 But notice what we've done. 1697 01:22:12,020 --> 01:22:14,210 We've taken a program that took us a few minutes 1698 01:22:14,210 --> 01:22:16,790 and certainly a dozen or more lines of code. 1699 01:22:16,790 --> 01:22:20,300 And we've distilled it into something that, yes, is a new language 1700 01:22:20,300 --> 01:22:22,310 but is just kind of a one liner. 1701 01:22:22,310 --> 01:22:24,888 And once you get comfortable with a language like SQL, 1702 01:22:24,888 --> 01:22:27,680 especially if you're not even a computer scientist but maybe a data 1703 01:22:27,680 --> 01:22:31,130 scientist or an analyst of some sort who spends a lot of their day looking 1704 01:22:31,130 --> 01:22:33,470 at financial information or medical information 1705 01:22:33,470 --> 01:22:37,070 or really any data set that can be loaded into rows and columns, 1706 01:22:37,070 --> 01:22:41,150 once you start to speak and read SQL as a human can 1707 01:22:41,150 --> 01:22:44,990 you start to express some pretty powerful queries relatively succinctly 1708 01:22:44,990 --> 01:22:47,390 and, boom, get back your answer. 1709 01:22:47,390 --> 01:22:50,000 And by using a command line program, like sqlite3, 1710 01:22:50,000 --> 01:22:53,540 you can immediately see the results there, albeit as very simplistic text. 1711 01:22:53,540 --> 01:22:56,690 But as mentioned, too, there's also some graphical programs 1712 01:22:56,690 --> 01:23:00,117 out there, free and commercial, that also support SQL, where you can still 1713 01:23:00,117 --> 01:23:00,950 type these commands. 1714 01:23:00,950 --> 01:23:03,770 And then it will show it to you in a more user friendly way, much 1715 01:23:03,770 --> 01:23:07,790 like in Windows or macOS would by default. 1716 01:23:07,790 --> 01:23:16,058 So any questions now on the syntax or capabilities of SELECT statements? 1717 01:23:16,058 --> 01:23:17,350 BRIAN YU: One question came in. 1718 01:23:17,350 --> 01:23:20,450 Where is the file with this data actually being stored? 1719 01:23:20,450 --> 01:23:21,700 DAVID J. MALAN: Good question. 1720 01:23:21,700 --> 01:23:24,030 Where is the file actually being stored? 1721 01:23:24,030 --> 01:23:27,460 So before quitting, I can actually save this file as anything. 1722 01:23:27,460 --> 01:23:30,043 I want the file extension would typically be .db. 1723 01:23:30,043 --> 01:23:31,960 And in fact, Brian, do you mind just checking? 1724 01:23:31,960 --> 01:23:34,930 What's the syntax for writing the file manually with dot something? 1725 01:23:34,930 --> 01:23:36,910 It would be under .help, I think. 1726 01:23:36,910 --> 01:23:39,550 BRIAN YU: I think it's .save followed by the name of the file. 1727 01:23:39,550 --> 01:23:43,240 DAVID J. MALAN: .save, so I'll call this shows.db, Enter. 1728 01:23:43,240 --> 01:23:46,600 If I now go ahead and open up another terminal window and type 1729 01:23:46,600 --> 01:23:49,990 our old friend ls, you'll see that now I have a CSV file. 1730 01:23:49,990 --> 01:23:51,760 I have my Python file from before. 1731 01:23:51,760 --> 01:23:54,790 And I have a new file called shows.db, which I've created. 1732 01:23:54,790 --> 01:24:00,910 That is the binary file that contains the table that I've loaded dynamically 1733 01:24:00,910 --> 01:24:04,700 in from that CSV file. 1734 01:24:04,700 --> 01:24:08,810 Any other questions on SELECT queries or what we can do with them? 1735 01:24:08,810 --> 01:24:12,620 BRIAN YU: Yeah, a few people are asking about what the runtime of this is. 1736 01:24:12,620 --> 01:24:14,430 DAVID J. MALAN: Yeah, really good question. 1737 01:24:14,430 --> 01:24:15,170 What is the runtime? 1738 01:24:15,170 --> 01:24:18,253 I'm going to come back to that question in just a little bit if that's OK. 1739 01:24:18,253 --> 01:24:20,960 Right now, it's admittedly big O of n. 1740 01:24:20,960 --> 01:24:23,390 I've not actually done anything better than we did 1741 01:24:23,390 --> 01:24:26,090 with our CSV file or our Python code. 1742 01:24:26,090 --> 01:24:28,040 Right now, it's still big O of n by default. 1743 01:24:28,040 --> 01:24:30,230 But there's going to be a better answer to that 1744 01:24:30,230 --> 01:24:33,030 that's going to make it something much more logarithmic. 1745 01:24:33,030 --> 01:24:36,687 So let me come back to that feature when it's time to enable it. 1746 01:24:36,687 --> 01:24:39,020 But in fact, let's start to take some steps toward that. 1747 01:24:39,020 --> 01:24:40,812 Because it turns out, when loading in data, 1748 01:24:40,812 --> 01:24:42,853 we're not always going to have the luxury of just 1749 01:24:42,853 --> 01:24:44,900 having one big file in CSV format that we import, 1750 01:24:44,900 --> 01:24:46,070 and we go about our business. 1751 01:24:46,070 --> 01:24:47,780 We're going to have to decide in advance how 1752 01:24:47,780 --> 01:24:50,210 we want to store the data and what data we want to store 1753 01:24:50,210 --> 01:24:53,120 and what the relationships might be across not one 1754 01:24:53,120 --> 01:24:55,278 single table, but multiple tables. 1755 01:24:55,278 --> 01:24:57,320 So let me go ahead and run one other command here 1756 01:24:57,320 --> 01:25:00,170 that actually introduces the first of a problem. 1757 01:25:00,170 --> 01:25:03,830 Let me go ahead and SELECT title FROM shows 1758 01:25:03,830 --> 01:25:07,160 WHERE genres equals, for instance, "Comedy." 1759 01:25:07,160 --> 01:25:08,570 That was one of the genres. 1760 01:25:08,570 --> 01:25:11,690 And notice that we get back a whole bunch of results. 1761 01:25:11,690 --> 01:25:14,300 But I bet I'm missing some. 1762 01:25:14,300 --> 01:25:16,470 I'm skimming through this pretty quickly. 1763 01:25:16,470 --> 01:25:19,880 But I bet I'm missing some, because if I check if genres 1764 01:25:19,880 --> 01:25:21,872 = "Comedy," what am I omitting? 1765 01:25:21,872 --> 01:25:24,830 Well, those of you who checked multiple boxes might have said something 1766 01:25:24,830 --> 01:25:28,310 is a comedy and a drama or comedy and romance 1767 01:25:28,310 --> 01:25:30,800 or maybe a couple of other permutations of genres. 1768 01:25:30,800 --> 01:25:34,070 If I'm searching for equality here, = "Comedy," 1769 01:25:34,070 --> 01:25:37,880 I'm only going to get those favorites from you where you only said, 1770 01:25:37,880 --> 01:25:40,250 my favorite TV show is a comedy. 1771 01:25:40,250 --> 01:25:48,113 But what if we want to do something like LIKE comedy instead? 1772 01:25:48,113 --> 01:25:50,030 And we could say something like, well, so long 1773 01:25:50,030 --> 01:25:54,290 as the word "comedy" is in there, then we should get back even more results. 1774 01:25:54,290 --> 01:25:57,480 And let me stipulate that, indeed, I now have a longer list of results. 1775 01:25:57,480 --> 01:26:01,010 Now we have all shows where you checked at least the Comedy box. 1776 01:26:01,010 --> 01:26:03,770 But unfortunately, this starts to get a little sloppy, 1777 01:26:03,770 --> 01:26:06,410 because recall what the Genres column looks like. 1778 01:26:06,410 --> 01:26:07,730 SELECT. 1779 01:26:07,730 --> 01:26:11,150 Let me SELECT genres FROM shows;. 1780 01:26:11,150 --> 01:26:16,010 Notice that all of the genres that we loaded into this table from the CSV 1781 01:26:16,010 --> 01:26:20,030 file are a comma-separated list of genres. 1782 01:26:20,030 --> 01:26:22,070 That's just the way Google Forms did it. 1783 01:26:22,070 --> 01:26:24,320 And that's fine for CSV purposes. 1784 01:26:24,320 --> 01:26:28,310 That's kind of fine for SQL purposes, but this is kind of messy. 1785 01:26:28,310 --> 01:26:31,700 Generally speaking, storing comma-separated lists 1786 01:26:31,700 --> 01:26:35,840 of values in a SQL database is not what you should be doing. 1787 01:26:35,840 --> 01:26:41,030 The whole point of using a SQL database is to move away from commas and CSVs 1788 01:26:41,030 --> 01:26:42,860 and to actually store things more cleanly. 1789 01:26:42,860 --> 01:26:45,920 Because in fact, let me propose a problem. 1790 01:26:45,920 --> 01:26:50,540 Suppose I want to search not for comedy but maybe also 1791 01:26:50,540 --> 01:26:55,520 music, like this, thereby allowing me to find any shows where 1792 01:26:55,520 --> 01:26:59,990 the word "music" is somewhere in the comma-separated list. 1793 01:26:59,990 --> 01:27:01,940 There's a subtle bug here. 1794 01:27:01,940 --> 01:27:05,690 And you might have to think back to where we began, the form 1795 01:27:05,690 --> 01:27:07,910 that you pulled up. 1796 01:27:07,910 --> 01:27:09,860 I can't show the whole thing here, but we 1797 01:27:09,860 --> 01:27:14,060 started with action, adventure, animation, biography, dot, dot, dot, 1798 01:27:14,060 --> 01:27:15,620 music. 1799 01:27:15,620 --> 01:27:18,500 Musical was also there, so distinct. 1800 01:27:18,500 --> 01:27:22,700 A music video versus a musical are two different types of genres. 1801 01:27:22,700 --> 01:27:25,250 But notice my query at the moment. 1802 01:27:25,250 --> 01:27:26,930 What's problematic with this? 1803 01:27:26,930 --> 01:27:31,070 At the moment, we would seem to have a bug whereby this query will select 1804 01:27:31,070 --> 01:27:34,370 not only "music," but also "musical." 1805 01:27:34,370 --> 01:27:36,620 And so this is just where things are getting messy. 1806 01:27:36,620 --> 01:27:37,400 Now, yeah, you know what? 1807 01:27:37,400 --> 01:27:38,810 We could kind of clean this up. 1808 01:27:38,810 --> 01:27:43,790 Maybe we could put a comma here so that it can't just be music something. 1809 01:27:43,790 --> 01:27:45,410 It has to be music comma. 1810 01:27:45,410 --> 01:27:47,840 But what if music is the last box that you checked? 1811 01:27:47,840 --> 01:27:49,310 Well, then it's music nothing. 1812 01:27:49,310 --> 01:27:50,210 There is no comma. 1813 01:27:50,210 --> 01:27:52,262 So now I need to OR things together. 1814 01:27:52,262 --> 01:27:54,470 So maybe I have to do something like WHERE "%Music,%" 1815 01:27:54,470 --> 01:28:00,800 like this or OR genres LIKE "%Music" like this. 1816 01:28:00,800 --> 01:28:02,750 But honestly, this is just getting messy. 1817 01:28:02,750 --> 01:28:04,040 This is poorly designed. 1818 01:28:04,040 --> 01:28:07,220 If you're just storing your data as a comma-separated list of values inside 1819 01:28:07,220 --> 01:28:11,010 of a column and you have to resort to this kind of hack to figure out, 1820 01:28:11,010 --> 01:28:13,130 well, maybe it's over here or here or here, 1821 01:28:13,130 --> 01:28:16,640 and thinking about all the permutations of syntax, you're doing it wrong. 1822 01:28:16,640 --> 01:28:20,130 You're not using a SQL database to its fullest potential. 1823 01:28:20,130 --> 01:28:22,490 So how do we go about designing this thing better 1824 01:28:22,490 --> 01:28:26,690 and actually load this CSV into a database a little more cleanly? 1825 01:28:26,690 --> 01:28:31,820 In short, how do we get rid of the stupid commas in the Genres column 1826 01:28:31,820 --> 01:28:36,740 and instead put one word, "comedy" or "music" or "musical," 1827 01:28:36,740 --> 01:28:38,930 in each of those cells, so to speak? 1828 01:28:38,930 --> 01:28:40,250 Not two, not three-- 1829 01:28:40,250 --> 01:28:43,820 one only without throwing away some of those genres. 1830 01:28:43,820 --> 01:28:46,730 Well, let me introduce a few building blocks that will get us there. 1831 01:28:46,730 --> 01:28:48,680 It turns out, when creating your own tables 1832 01:28:48,680 --> 01:28:51,260 and loading data into a database on your own, 1833 01:28:51,260 --> 01:28:53,375 we're going to need more than just SELECT. 1834 01:28:53,375 --> 01:28:55,220 SELECT, of course, is just for reading. 1835 01:28:55,220 --> 01:28:59,330 But if we're going to do this better and not just use sqlite3 as a built-in 1836 01:28:59,330 --> 01:29:04,880 .import command, but instead we're going to write some code to load all 1837 01:29:04,880 --> 01:29:07,580 of our data into maybe two tables-- 1838 01:29:07,580 --> 01:29:10,100 one for the titles, one for the genres-- 1839 01:29:10,100 --> 01:29:15,680 we're going to need a little more expressiveness when it comes to SQL. 1840 01:29:15,680 --> 01:29:17,990 And so for that, we're going to need, one, the ability 1841 01:29:17,990 --> 01:29:19,113 to create our own tables. 1842 01:29:19,113 --> 01:29:20,780 And we've seen a glimpse of this before. 1843 01:29:20,780 --> 01:29:23,280 But we're also going to need to see another piece of syntax, 1844 01:29:23,280 --> 01:29:24,500 as well, so inserting. 1845 01:29:24,500 --> 01:29:29,060 Inserting is another command that you can execute on a SQL database 1846 01:29:29,060 --> 01:29:32,720 in order to actually add data to a database, which is great. 1847 01:29:32,720 --> 01:29:38,630 Because if I want to ultimately iterate over that same CSV but, this time, 1848 01:29:38,630 --> 01:29:43,075 manually add all of the rows to the database myself, 1849 01:29:43,075 --> 01:29:45,200 well, then I'm going to need some way of inserting. 1850 01:29:45,200 --> 01:29:46,850 And the syntax for that is as follows. 1851 01:29:46,850 --> 01:29:50,720 INSERT INTO the name of the table, the column or columns 1852 01:29:50,720 --> 01:29:54,890 that you want to insert values into, then literally the word VALUES, 1853 01:29:54,890 --> 01:29:58,787 and then literally in parentheses again, the actual list of values. 1854 01:29:58,787 --> 01:30:01,370 So it's a little abstract when we see it in this generic form. 1855 01:30:01,370 --> 01:30:06,480 But we'll see this more explicitly in just a moment here, as well. 1856 01:30:06,480 --> 01:30:09,483 So when it comes to inserting something into a database, 1857 01:30:09,483 --> 01:30:10,650 let's go ahead and try this. 1858 01:30:10,650 --> 01:30:13,100 So suppose that-- let's see. 1859 01:30:13,100 --> 01:30:15,080 What's a show that-- 1860 01:30:15,080 --> 01:30:15,985 The Muppet Show. 1861 01:30:15,985 --> 01:30:17,360 I grew up loving The Muppet Show. 1862 01:30:17,360 --> 01:30:18,650 It was out in, like, the '70s. 1863 01:30:18,650 --> 01:30:21,680 And I don't think it was on the list, but I can check this for sure. 1864 01:30:21,680 --> 01:30:28,100 So SELECT * FROM shows WHERE title LIKE-- 1865 01:30:28,100 --> 01:30:30,950 let's just search for "muppets" with a wild card. 1866 01:30:30,950 --> 01:30:32,500 And I'm guessing no one put it there. 1867 01:30:32,500 --> 01:30:33,000 Good. 1868 01:30:33,000 --> 01:30:34,320 So it's a missed opportunity. 1869 01:30:34,320 --> 01:30:35,570 I forgot to fill out the form. 1870 01:30:35,570 --> 01:30:37,820 I could go back and fill out the form and re-import the CSV, 1871 01:30:37,820 --> 01:30:39,487 but let's go ahead and do this manually. 1872 01:30:39,487 --> 01:30:44,420 So let me go ahead and INSERT INTO shows what columns? 1873 01:30:44,420 --> 01:30:50,360 title and genres, and I guess I could do a Timestamp just for kicks. 1874 01:30:50,360 --> 01:30:52,220 And then I'm going to insert what values? 1875 01:30:52,220 --> 01:30:55,430 The values will be, well, I don't know, whatever time it is now. 1876 01:30:55,430 --> 01:30:58,460 So I'm going to cheat there rather than look up the date and the time. 1877 01:30:58,460 --> 01:31:01,430 The title will be "The Muppet Show." 1878 01:31:01,430 --> 01:31:05,100 And the genres will be-- it was kind of a comedy. 1879 01:31:05,100 --> 01:31:06,290 It was kind of a musical. 1880 01:31:06,290 --> 01:31:08,360 So we'll kind of leave it at that. 1881 01:31:08,360 --> 01:31:09,350 Semicolon. 1882 01:31:09,350 --> 01:31:11,870 So again, this follows the standard syntax here 1883 01:31:11,870 --> 01:31:14,030 of specifying the table you want to insert into, 1884 01:31:14,030 --> 01:31:16,910 the columns you want to insert into, and the values 1885 01:31:16,910 --> 01:31:18,467 you want to put into those columns. 1886 01:31:18,467 --> 01:31:20,300 And I'm going to go ahead and hit Enter now. 1887 01:31:20,300 --> 01:31:22,250 Nothing seems to have happened. 1888 01:31:22,250 --> 01:31:28,070 But if I now select that same query-- 1889 01:31:28,070 --> 01:31:32,630 oh, OK, it's still nothing, because I made a subtle mistake. 1890 01:31:32,630 --> 01:31:34,700 I'm not searching for "Muppets," plural. 1891 01:31:34,700 --> 01:31:37,250 I'm searching for "Muppet," singular, The Muppet Show. 1892 01:31:37,250 --> 01:31:38,000 Voila. 1893 01:31:38,000 --> 01:31:40,790 Now you see my row in this database. 1894 01:31:40,790 --> 01:31:42,680 And so INSERT would give us the ability now 1895 01:31:42,680 --> 01:31:44,570 to insert new rows into the database. 1896 01:31:44,570 --> 01:31:48,410 Suppose you want to update something. 1897 01:31:48,410 --> 01:31:51,540 You know, some of the Muppet Shows were actually pretty dramatic. 1898 01:31:51,540 --> 01:31:52,710 So how might we do that? 1899 01:31:52,710 --> 01:31:56,960 Well, I can say UPDATE shows SET-- 1900 01:31:56,960 --> 01:32:04,250 let's see-- genres = "Comedy, Drama, Musical" WHERE 1901 01:32:04,250 --> 01:32:07,910 title = "The Muppet Show." 1902 01:32:07,910 --> 01:32:10,890 So again, I'll pull up the canonical syntax for this in a bit. 1903 01:32:10,890 --> 01:32:14,120 But for now, just a little teaser, you can update things pretty simply. 1904 01:32:14,120 --> 01:32:16,662 And even though it takes a little getting used to the syntax, 1905 01:32:16,662 --> 01:32:17,960 it kind of does what it says. 1906 01:32:17,960 --> 01:32:23,250 UPDATE shows SET genres = this WHERE title = that. 1907 01:32:23,250 --> 01:32:24,650 And now I can go ahead and Enter. 1908 01:32:24,650 --> 01:32:27,290 If I go ahead and select the same thing, just like in a terminal window, 1909 01:32:27,290 --> 01:32:28,250 you can go up and down. 1910 01:32:28,250 --> 01:32:29,600 That's how I'm typing so quickly. 1911 01:32:29,600 --> 01:32:31,600 I'm just going up and down to previous commands. 1912 01:32:31,600 --> 01:32:32,120 Voila. 1913 01:32:32,120 --> 01:32:36,830 Now I see that the Muppet Show is a comedy, a drama, and a musical. 1914 01:32:36,830 --> 01:32:40,070 Well, I take issue, though, with one of the more popular shows that 1915 01:32:40,070 --> 01:32:40,970 was in the list. 1916 01:32:40,970 --> 01:32:44,637 A whole bunch of you liked, let's say, Friends, 1917 01:32:44,637 --> 01:32:46,220 which I've never really been a fan of. 1918 01:32:46,220 --> 01:32:53,828 And let me go ahead and SELECT title FROM shows WHERE title = "Friends." 1919 01:32:53,828 --> 01:32:56,120 And maybe I should be a little more rigorous than that. 1920 01:32:56,120 --> 01:32:59,150 I should say title LIKE "Friends" just in case 1921 01:32:59,150 --> 01:33:00,650 there was different capitalizations. 1922 01:33:00,650 --> 01:33:01,460 Enter. 1923 01:33:01,460 --> 01:33:03,148 A lot of you really liked Friends. 1924 01:33:03,148 --> 01:33:04,190 In fact, how many of you? 1925 01:33:04,190 --> 01:33:05,610 Well, recall that I can do this. 1926 01:33:05,610 --> 01:33:08,900 I can say COUNT, and I can let SQL do the count for me. 1927 01:33:08,900 --> 01:33:10,575 26 of you, I disagree with strongly. 1928 01:33:10,575 --> 01:33:12,950 And there's a couple of you that even added all the dots, 1929 01:33:12,950 --> 01:33:14,240 but we'll deal with you later. 1930 01:33:14,240 --> 01:33:16,100 So suppose I do take issue with this. 1931 01:33:16,100 --> 01:33:22,970 Well, DELETE FROM shows WHERE title = "Friends"-- 1932 01:33:22,970 --> 01:33:24,390 actually, title LIKE "Friends." 1933 01:33:24,390 --> 01:33:25,220 Let's get them all. 1934 01:33:25,220 --> 01:33:26,090 Enter. 1935 01:33:26,090 --> 01:33:29,450 And now if we SELECT this again, I'm sorry. 1936 01:33:29,450 --> 01:33:30,870 Friends has been canceled. 1937 01:33:30,870 --> 01:33:34,910 So you can again execute these fundamental commands of CRUD, 1938 01:33:34,910 --> 01:33:38,630 Create Read, Update, and Delete, by using CREATE or INSERT, 1939 01:33:38,630 --> 01:33:41,540 by using SELECT, by using UPDATE literally 1940 01:33:41,540 --> 01:33:43,380 and DELETE literally, as well. 1941 01:33:43,380 --> 01:33:44,580 And that's about it. 1942 01:33:44,580 --> 01:33:46,580 Even though this was a lot quickly, there really 1943 01:33:46,580 --> 01:33:49,040 are just those four fundamental operations in SQL 1944 01:33:49,040 --> 01:33:53,090 plus some of these add-on features, like these additional functions like COUNT 1945 01:33:53,090 --> 01:33:57,420 that you can use and also some of these keywords like WHERE and the like. 1946 01:33:57,420 --> 01:33:59,810 Well, let me propose that we now do better. 1947 01:33:59,810 --> 01:34:04,580 If we have the ability to select data and create tables and insert data, 1948 01:34:04,580 --> 01:34:11,270 let's go ahead and write our own Python script that uses SQL, as in a loop, 1949 01:34:11,270 --> 01:34:16,130 to read over my CSV file and to insert, insert, insert, insert each of the rows 1950 01:34:16,130 --> 01:34:16,700 manually. 1951 01:34:16,700 --> 01:34:18,408 Because honestly, it will take me forever 1952 01:34:18,408 --> 01:34:22,220 to manually type out hundreds of SQL queries to import all of your rows 1953 01:34:22,220 --> 01:34:23,390 into a new database. 1954 01:34:23,390 --> 01:34:25,520 I want to write a program that does this instead. 1955 01:34:25,520 --> 01:34:29,430 And I'm going to propose that we design it in the following way. 1956 01:34:29,430 --> 01:34:32,720 I'm going to have two tables this time, represented here 1957 01:34:32,720 --> 01:34:34,190 with this artist's rendition. 1958 01:34:34,190 --> 01:34:36,020 One is going to be called shows. 1959 01:34:36,020 --> 01:34:38,060 One is going to be called genres. 1960 01:34:38,060 --> 01:34:44,270 And this is a fundamental principle of designing relational databases, 1961 01:34:44,270 --> 01:34:49,700 to figure out the relationships among data and to normalize your data. 1962 01:34:49,700 --> 01:34:53,480 To normalize your data means to eliminate redundancies. 1963 01:34:53,480 --> 01:34:58,520 To normalize your data means to eliminate mentions of the same words 1964 01:34:58,520 --> 01:35:02,320 again and again and have just single sources of truth for your data, 1965 01:35:02,320 --> 01:35:02,820 so to speak. 1966 01:35:02,820 --> 01:35:04,140 So what do I mean by that? 1967 01:35:04,140 --> 01:35:07,520 I'm going to propose that we instead create a simpler table called 1968 01:35:07,520 --> 01:35:10,320 shows that has just two columns. 1969 01:35:10,320 --> 01:35:13,098 One is going to be called id, which is new. 1970 01:35:13,098 --> 01:35:15,140 The other is going to be called title, as before. 1971 01:35:15,140 --> 01:35:16,940 Honestly, I don't care about timestamps, so we're just 1972 01:35:16,940 --> 01:35:19,730 going to throw that value away, which is another upside of writing 1973 01:35:19,730 --> 01:35:20,420 our own program. 1974 01:35:20,420 --> 01:35:23,030 We can add or remove any data we want. 1975 01:35:23,030 --> 01:35:25,850 For id, I'm introducing this, which is going 1976 01:35:25,850 --> 01:35:28,490 to be a unique identifier, literally a simple integer-- 1977 01:35:28,490 --> 01:35:31,190 1, 2, 3, all the way up to a billion or 2 billion, 1978 01:35:31,190 --> 01:35:33,080 however many favorites we have. 1979 01:35:33,080 --> 01:35:35,690 I'm just going to let this auto increment as we go. 1980 01:35:35,690 --> 01:35:36,710 Why? 1981 01:35:36,710 --> 01:35:42,530 I propose that we move to another table all of the genres and that, 1982 01:35:42,530 --> 01:35:48,350 instead of having one or two or three or five genres in one column 1983 01:35:48,350 --> 01:35:51,860 as a stupid comma-separated list-- which is stupid only in the sense 1984 01:35:51,860 --> 01:35:53,180 that it's just messy, right? 1985 01:35:53,180 --> 01:35:55,040 It means that I have to run stupid commands 1986 01:35:55,040 --> 01:35:57,332 where I'm checking for the comma here, the comma there. 1987 01:35:57,332 --> 01:35:58,850 It's very hackish, so to speak. 1988 01:35:58,850 --> 01:36:00,080 Bad design. 1989 01:36:00,080 --> 01:36:03,770 Instead of doing that, I'm going to create another table that 1990 01:36:03,770 --> 01:36:05,580 also has two columns. 1991 01:36:05,580 --> 01:36:09,320 One is going to be called show_id, and the other is going to be called genre. 1992 01:36:09,320 --> 01:36:12,830 And genre here is just going to be a single word now. 1993 01:36:12,830 --> 01:36:16,340 That column will contain single words for genres, 1994 01:36:16,340 --> 01:36:19,400 like "comedy" or "music" or "musical." 1995 01:36:19,400 --> 01:36:23,570 But we're going to associate all of those genres 1996 01:36:23,570 --> 01:36:27,470 with the original show to which they belong, per your Google form 1997 01:36:27,470 --> 01:36:31,500 submissions, by using this show_id here. 1998 01:36:31,500 --> 01:36:33,290 So what does this mean in particular? 1999 01:36:33,290 --> 01:36:37,370 By adding to our first table, shows, this unique identifier-- 2000 01:36:37,370 --> 01:36:39,080 1, 2, 3, 4, 5, 6-- 2001 01:36:39,080 --> 01:36:44,630 I can now refer to that same show in a very efficient way using 2002 01:36:44,630 --> 01:36:46,940 a very simple number instead of redundantly 2003 01:36:46,940 --> 01:36:49,730 having The Office, The Office, The Office again and again. 2004 01:36:49,730 --> 01:36:52,280 I can refer to it by just one canonical number, which 2005 01:36:52,280 --> 01:36:54,980 is only going to be 4 bytes or 32 bits. 2006 01:36:54,980 --> 01:36:56,330 Pretty efficient. 2007 01:36:56,330 --> 01:37:00,920 But I can still associate that show with one genre or two or three 2008 01:37:00,920 --> 01:37:03,210 or more or even none. 2009 01:37:03,210 --> 01:37:07,610 So in this way, every row in our current table 2010 01:37:07,610 --> 01:37:12,860 is going to become one or more rows in our new pair of tables. 2011 01:37:12,860 --> 01:37:15,560 We're factoring out the genres so that we 2012 01:37:15,560 --> 01:37:20,270 can add multiple rows for every show, potentially, but still 2013 01:37:20,270 --> 01:37:25,050 remap those genres back to the original show itself. 2014 01:37:25,050 --> 01:37:27,890 So what is some of the buzzwords here? 2015 01:37:27,890 --> 01:37:31,070 What's some of the language to be familiar with? 2016 01:37:31,070 --> 01:37:35,090 Well, we need to know what kinds of types are at our disposal here. 2017 01:37:35,090 --> 01:37:37,250 So for that, let me propose this. 2018 01:37:37,250 --> 01:37:41,300 Let me propose that we have this list here. 2019 01:37:41,300 --> 01:37:44,590 It turns out, in SQLite, there are five main data types. 2020 01:37:44,590 --> 01:37:46,340 And that's a bit of an oversimplification, 2021 01:37:46,340 --> 01:37:49,430 but there's five main data types, some of which look familiar, 2022 01:37:49,430 --> 01:37:51,410 a couple of which are a little weird. 2023 01:37:51,410 --> 01:37:53,810 INTEGER is a thing. 2024 01:37:53,810 --> 01:37:55,910 REAL is the same thing as float. 2025 01:37:55,910 --> 01:38:00,080 So an integer might be a 32-bit or 4-byte value, like 1, 2, 3, or 4, 2026 01:38:00,080 --> 01:38:01,130 positive or negative. 2027 01:38:01,130 --> 01:38:03,213 Real number's going to have a decimal point in it, 2028 01:38:03,213 --> 01:38:05,570 a floating point value, probably 32 bits by default. 2029 01:38:05,570 --> 01:38:08,240 But those kinds of things, the sizes of these types, 2030 01:38:08,240 --> 01:38:10,430 vary by system, just like they technically 2031 01:38:10,430 --> 01:38:13,760 did in C. So do they vary by system in the world of SQL. 2032 01:38:13,760 --> 01:38:16,010 But generally speaking, these are good rules of thumb. 2033 01:38:16,010 --> 01:38:16,970 TEXT is just that. 2034 01:38:16,970 --> 01:38:19,820 It's sort of the equivalent of a string of some length. 2035 01:38:19,820 --> 01:38:22,362 But then in SQLite, it turns out there's two other data 2036 01:38:22,362 --> 01:38:23,570 types we've not seen before-- 2037 01:38:23,570 --> 01:38:25,010 NUMERIC and BLOB. 2038 01:38:25,010 --> 01:38:26,750 But more on those in just a little bit. 2039 01:38:26,750 --> 01:38:28,370 BLOB is Binary Large Object. 2040 01:38:28,370 --> 01:38:30,860 It means you can store 0's and 1's in your database. 2041 01:38:30,860 --> 01:38:34,670 NUMERIC is going to be something that's number-like but isn't a number per se. 2042 01:38:34,670 --> 01:38:38,360 It's like a year or a time, something that has numbers, but isn't 2043 01:38:38,360 --> 01:38:40,730 just a simple integer at that. 2044 01:38:40,730 --> 01:38:44,210 And let me propose, too, that SQLite is going to allow us to specify, too, 2045 01:38:44,210 --> 01:38:49,520 when we create our own columns manually by executing the SQL code ourselves, 2046 01:38:49,520 --> 01:38:52,430 we can specify that a column cannot be null. 2047 01:38:52,430 --> 01:38:53,840 Thus far, we've ignored this. 2048 01:38:53,840 --> 01:38:56,090 But some of you might have taken the fifth 2049 01:38:56,090 --> 01:38:58,850 and just not given us the title of a show or a genre. 2050 01:38:58,850 --> 01:39:01,020 Your answers might be blank. 2051 01:39:01,020 --> 01:39:03,020 Some of you, maybe in registering for a website, 2052 01:39:03,020 --> 01:39:06,170 don't want to provide information like where you live or your phone number. 2053 01:39:06,170 --> 01:39:10,190 So a database in general sometimes does want to support null values. 2054 01:39:10,190 --> 01:39:12,290 But you might want to say that it can't be null. 2055 01:39:12,290 --> 01:39:14,390 A website probably needs your email address, 2056 01:39:14,390 --> 01:39:18,570 needs your password and a few other fields, but not everything. 2057 01:39:18,570 --> 01:39:22,250 And there's another keyword in SQL, just so you've seen it, called UNIQUE, where 2058 01:39:22,250 --> 01:39:25,460 you can additionally say that whatever values are in this column 2059 01:39:25,460 --> 01:39:26,520 must be unique. 2060 01:39:26,520 --> 01:39:28,670 So a website might also use that. 2061 01:39:28,670 --> 01:39:31,910 If you want to make sure that the same email address can't register 2062 01:39:31,910 --> 01:39:33,830 for your website multiple times, you just 2063 01:39:33,830 --> 01:39:36,020 specify that the email column is unique. 2064 01:39:36,020 --> 01:39:40,370 That way, you can't put multiple people in with identical email addresses. 2065 01:39:40,370 --> 01:39:44,060 So long story short, this is just more of the tools in our SQL toolkit, 2066 01:39:44,060 --> 01:39:46,280 because we'll see some of these now indirectly. 2067 01:39:46,280 --> 01:39:49,670 And the last piece of jargon we need before designing our own tables 2068 01:39:49,670 --> 01:39:51,150 is going to be this. 2069 01:39:51,150 --> 01:39:54,110 It turns out that, in SQL, there's this notion 2070 01:39:54,110 --> 01:39:56,270 of primary keys and foreign keys. 2071 01:39:56,270 --> 01:39:59,390 And we've not seen this in spreadsheets. 2072 01:39:59,390 --> 01:40:02,150 Unless you've been working in the real world for some years 2073 01:40:02,150 --> 01:40:04,400 and you have fairly fancy spreadsheets in front of you 2074 01:40:04,400 --> 01:40:06,380 as an analyst or financial person or the like, 2075 01:40:06,380 --> 01:40:11,750 odds are you've not seen keys or unique identifiers in quite the same way. 2076 01:40:11,750 --> 01:40:13,170 But they're relatively simple. 2077 01:40:13,170 --> 01:40:17,390 In fact, let me go back to our picture before and propose 2078 01:40:17,390 --> 01:40:21,230 that when you have two tables like this and you 2079 01:40:21,230 --> 01:40:25,790 want to use a simple integer to uniquely identify all of the rows in one 2080 01:40:25,790 --> 01:40:28,395 of the tables, that's called technically an ID. 2081 01:40:28,395 --> 01:40:30,020 That's what I'll call it by convention. 2082 01:40:30,020 --> 01:40:33,770 You could call it anything you want, but ID just means it's a unique identifier. 2083 01:40:33,770 --> 01:40:37,470 But semantically, this ID is what's called a primary key. 2084 01:40:37,470 --> 01:40:43,940 A primary key is the column in a table that uniquely identifies every row. 2085 01:40:43,940 --> 01:40:46,820 This means you can have multiple versions of The Office 2086 01:40:46,820 --> 01:40:48,860 in that title field. 2087 01:40:48,860 --> 01:40:52,490 But each of those rows is going to have its own number uniquely, potentially. 2088 01:40:52,490 --> 01:40:56,630 So primary key uniquely identifies each row. 2089 01:40:56,630 --> 01:41:01,550 In another table, like genres, which I'm proposing we create in just a moment, 2090 01:41:01,550 --> 01:41:06,770 it turns out that you're welcome to refer back to another table 2091 01:41:06,770 --> 01:41:09,260 by way of that unique identifier. 2092 01:41:09,260 --> 01:41:13,710 But when it's in this context, that ID is called a foreign key. 2093 01:41:13,710 --> 01:41:16,130 So even though I've called it show_id here, 2094 01:41:16,130 --> 01:41:18,470 that's just a convention in a lot of SQL databases 2095 01:41:18,470 --> 01:41:23,030 to imply that this is technically a column called ID in a table 2096 01:41:23,030 --> 01:41:26,760 called show or shows, plural in this case. 2097 01:41:26,760 --> 01:41:29,900 So if there's a number 1 here, and suppose 2098 01:41:29,900 --> 01:41:34,190 that The Office has a unique ID of 1, we would 2099 01:41:34,190 --> 01:41:38,420 have a row in this table called id is 1, title is The Office. 2100 01:41:38,420 --> 01:41:43,730 The Office might be in the comedy category, the drama category, 2101 01:41:43,730 --> 01:41:46,400 the romance category, so multiple ones. 2102 01:41:46,400 --> 01:41:51,050 Therefore, in the genres table, we want to output three rows, 2103 01:41:51,050 --> 01:41:56,150 the number 1, 1, 1 in each of those rows but the words "comedy," 2104 01:41:56,150 --> 01:42:00,450 "drama," "romance" in each of those rows respectively. 2105 01:42:00,450 --> 01:42:03,620 So again, the goal here is to just design our database better, not 2106 01:42:03,620 --> 01:42:08,120 have these stupid comma-separated lists of values inside of a single column. 2107 01:42:08,120 --> 01:42:12,980 We want to kind of blow that up, explode it, into individual rows. 2108 01:42:12,980 --> 01:42:15,710 You might think, well, why don't we just use multiple columns? 2109 01:42:15,710 --> 01:42:18,560 But again, per our principle from spreadsheets, 2110 01:42:18,560 --> 01:42:21,650 you should not be in the habit of adding more and more columns when 2111 01:42:21,650 --> 01:42:25,190 the data is all the same, like genre, genre, genre, right? 2112 01:42:25,190 --> 01:42:27,410 The stupid way to do this in the spreadsheet world 2113 01:42:27,410 --> 01:42:29,660 would be to have one column called Genre 1, 2114 01:42:29,660 --> 01:42:34,100 another column called Genre 2, another column called Genre 3, Genre 4. 2115 01:42:34,100 --> 01:42:37,340 And you can imagine just how stupid and inefficient this is. 2116 01:42:37,340 --> 01:42:41,510 A lot of those columns are going to be empty for shows with very few genres. 2117 01:42:41,510 --> 01:42:43,770 And it's just kind of messy at that point. 2118 01:42:43,770 --> 01:42:47,030 So better, in the world of relational databases, 2119 01:42:47,030 --> 01:42:51,350 to have something like a second table, where you have multiple rows that 2120 01:42:51,350 --> 01:42:55,700 somehow link back to that primary key by way of what we're calling, 2121 01:42:55,700 --> 01:42:58,440 conceptually, a foreign key. 2122 01:42:58,440 --> 01:42:59,090 All right. 2123 01:42:59,090 --> 01:43:01,640 So let's go ahead now and try to write this code. 2124 01:43:01,640 --> 01:43:03,710 Let me go back to my IDE. 2125 01:43:03,710 --> 01:43:07,850 Let me quit out of SQLite now. 2126 01:43:07,850 --> 01:43:10,640 And let me just move away. 2127 01:43:10,640 --> 01:43:15,402 I'm going to move this away, my file, for just a moment 2128 01:43:15,402 --> 01:43:17,360 so that we're only left with our original data. 2129 01:43:17,360 --> 01:43:21,680 Let's go about implementing a final version of my Python file that does 2130 01:43:21,680 --> 01:43:23,540 this-- creates two tables-- 2131 01:43:23,540 --> 01:43:26,270 one called shows, one called genres-- 2132 01:43:26,270 --> 01:43:30,200 and then, two, in a for loop, iterates over that CSV 2133 01:43:30,200 --> 01:43:34,490 and inserts some data into the shows and other data into the genres. 2134 01:43:34,490 --> 01:43:36,350 How can we do this programmatically? 2135 01:43:36,350 --> 01:43:38,720 Well, there's a final piece of the puzzle that we need. 2136 01:43:38,720 --> 01:43:41,912 We need some way of bridging the world of Python and SQL. 2137 01:43:41,912 --> 01:43:44,120 And here, we do need a library, because it would just 2138 01:43:44,120 --> 01:43:46,700 be way too painful to do without a library. 2139 01:43:46,700 --> 01:43:47,540 It can be CS50. 2140 01:43:47,540 --> 01:43:50,730 CS50, as we'll see, makes this very simple. 2141 01:43:50,730 --> 01:43:53,480 There are other third-party commercial and open-source libraries 2142 01:43:53,480 --> 01:43:56,522 that you can also use in the real world, as well, that do the same thing. 2143 01:43:56,522 --> 01:43:58,670 But the syntax is a little less friendly, 2144 01:43:58,670 --> 01:44:01,880 so we'll start by using the CS50 library, which in Python, recall, 2145 01:44:01,880 --> 01:44:04,330 has functions like get_string and get_int and get_float. 2146 01:44:04,330 --> 01:44:10,430 But today, it also has support, it turns out, for SQL capabilities, as well. 2147 01:44:10,430 --> 01:44:12,760 So I'm going to go back to my Favorites file. 2148 01:44:12,760 --> 01:44:15,970 And I'm going to import not only CSV, but I'm also 2149 01:44:15,970 --> 01:44:21,310 going to import from the CS50 library a feature called SQL. 2150 01:44:21,310 --> 01:44:25,930 So we have a variable, if you will, inside of the CS50 library 2151 01:44:25,930 --> 01:44:28,600 or, rather, a function inside of the CS50 library 2152 01:44:28,600 --> 01:44:31,870 called SQL that, if I call it, will allow me 2153 01:44:31,870 --> 01:44:35,270 to load a SQLite database into memory. 2154 01:44:35,270 --> 01:44:36,290 So how do I do this? 2155 01:44:36,290 --> 01:44:38,790 Well, let me go ahead and add a couple of new lines of code. 2156 01:44:38,790 --> 01:44:45,340 Let me go ahead and open up a file called shows.db, 2157 01:44:45,340 --> 01:44:47,055 but this time in write mode. 2158 01:44:47,055 --> 01:44:49,180 And then just for kicks-- just for now, rather, I'm 2159 01:44:49,180 --> 01:44:50,930 going to go ahead and close it right away. 2160 01:44:50,930 --> 01:44:54,260 This is a Pythonic way of creating an empty file. 2161 01:44:54,260 --> 01:44:58,210 It's kind of stupid looking, but by opening a file called shows.db 2162 01:44:58,210 --> 01:45:00,790 in write mode and then immediately closing it, 2163 01:45:00,790 --> 01:45:03,670 it has the effect of creating the file, closing the file. 2164 01:45:03,670 --> 01:45:06,310 So I now have an empty file with which to interact. 2165 01:45:06,310 --> 01:45:09,100 I could also do this, as an aside, by doing this-- 2166 01:45:09,100 --> 01:45:10,810 touch shows.db. 2167 01:45:10,810 --> 01:45:14,470 touch kind of a strange command, but in a terminal window, 2168 01:45:14,470 --> 01:45:17,870 it means to create a file if it doesn't exist. 2169 01:45:17,870 --> 01:45:19,450 So we could also do that instead. 2170 01:45:19,450 --> 01:45:22,420 But that would be independent of Python. 2171 01:45:22,420 --> 01:45:24,790 So once I've created this file, let me go ahead 2172 01:45:24,790 --> 01:45:28,720 and open the file now as a SQLite database. 2173 01:45:28,720 --> 01:45:31,600 I'm going to declare a variable called db for database. 2174 01:45:31,600 --> 01:45:34,930 I'm going to use the SQL function from CS50's library. 2175 01:45:34,930 --> 01:45:38,170 And I'm going to open via somewhat cryptic string this-- 2176 01:45:38,170 --> 01:45:43,600 sqlite:///shows.db. 2177 01:45:43,600 --> 01:45:48,740 Now, it looks like a URL, http://, but it's SQLite instead. 2178 01:45:48,740 --> 01:45:52,300 And there's three slashes instead of the usual two. 2179 01:45:52,300 --> 01:45:54,430 But this line of code, line 6, has the result 2180 01:45:54,430 --> 01:45:57,820 of opening now that otherwise empty file with nothing 2181 01:45:57,820 --> 01:46:04,040 in it yet as being a SQLite database using CS50's library. 2182 01:46:04,040 --> 01:46:05,330 Why did I do that? 2183 01:46:05,330 --> 01:46:09,020 Well, I did that because I now want to create my first table. 2184 01:46:09,020 --> 01:46:12,140 Let me go ahead and execute, db.execute. 2185 01:46:12,140 --> 01:46:16,330 So there's a function called execute inside of the CS50 SQL library. 2186 01:46:16,330 --> 01:46:17,980 And I'm going to go ahead and run this. 2187 01:46:17,980 --> 01:46:23,770 CREATE TABLE called shows, the columns of which 2188 01:46:23,770 --> 01:46:27,430 are an id, which is going to be an integer, a title, which 2189 01:46:27,430 --> 01:46:33,380 is going to be text, the primary key in which is going to be the id column. 2190 01:46:33,380 --> 01:46:34,870 So this is a bit cryptic. 2191 01:46:34,870 --> 01:46:36,520 But let's see what's happening. 2192 01:46:36,520 --> 01:46:41,950 I seem to now, in line 8, be combining Python with SQL. 2193 01:46:41,950 --> 01:46:46,000 And this is where now programming gets really powerful, fancy, cool, 2194 01:46:46,000 --> 01:46:48,250 difficult, however you want to perceive it. 2195 01:46:48,250 --> 01:46:50,680 I can actually use one language inside of another. 2196 01:46:50,680 --> 01:46:51,250 How? 2197 01:46:51,250 --> 01:46:53,420 Well, SQL is just a bunch of textural commands. 2198 01:46:53,420 --> 01:46:55,420 Up until now, I've been typing them out manually 2199 01:46:55,420 --> 01:46:57,430 in this program called SQLite3. 2200 01:46:57,430 --> 01:47:00,010 There's nothing stopping me, though, from storing 2201 01:47:00,010 --> 01:47:02,830 those same commands in Python strings and then 2202 01:47:02,830 --> 01:47:05,890 passing them to a database using code. 2203 01:47:05,890 --> 01:47:08,230 The code I'm using is a function called execute. 2204 01:47:08,230 --> 01:47:10,990 And its purpose in life, and CS50 staff wrote this, 2205 01:47:10,990 --> 01:47:18,950 is to pass the argument from your Python code into the database for execution. 2206 01:47:18,950 --> 01:47:22,510 So it's like the programmatic way of just typing things manually 2207 01:47:22,510 --> 01:47:25,160 at the SQLite prompt a few minutes ago. 2208 01:47:25,160 --> 01:47:27,880 So that's going to go ahead and create my table called 2209 01:47:27,880 --> 01:47:30,610 shows, in which I'm going to store all of those unique IDs 2210 01:47:30,610 --> 01:47:32,290 and also the titles. 2211 01:47:32,290 --> 01:47:33,670 And then let me do this again. 2212 01:47:33,670 --> 01:47:39,040 db.execute CREATE TABLE genres, and that's 2213 01:47:39,040 --> 01:47:43,670 going to have a column called show_id, which is an integer also, genre, 2214 01:47:43,670 --> 01:47:45,340 which is text. 2215 01:47:45,340 --> 01:47:48,130 And lastly, it's going to have a foreign key-- 2216 01:47:48,130 --> 01:47:51,190 it's going to wrap a little long here-- 2217 01:47:51,190 --> 01:47:56,563 on show_id, which references the shows table id. 2218 01:47:56,563 --> 01:47:57,730 All right, so this is a lot. 2219 01:47:57,730 --> 01:47:59,860 So let's just recap left to right. 2220 01:47:59,860 --> 01:48:03,730 db.execute is my Python function that executes any SQL I want. 2221 01:48:03,730 --> 01:48:06,460 CREATE TABLE genres creates a table called genres. 2222 01:48:06,460 --> 01:48:10,060 The columns in that table will be something called show_id, 2223 01:48:10,060 --> 01:48:13,630 which is an integer, and genre, which is a text field. 2224 01:48:13,630 --> 01:48:17,050 But it's going to be one genre at a time, not multiple. 2225 01:48:17,050 --> 01:48:20,170 And then here, I'm specifying a foreign key 2226 01:48:20,170 --> 01:48:24,280 will be the show_id column, which happens to refer back 2227 01:48:24,280 --> 01:48:28,180 to the shows table's IDs column. 2228 01:48:28,180 --> 01:48:31,480 It's a little cryptic, but all this is doing is implementing for us 2229 01:48:31,480 --> 01:48:33,470 the equivalent of this picture here. 2230 01:48:33,470 --> 01:48:35,770 I could have manually typed both of these SQL 2231 01:48:35,770 --> 01:48:37,690 commands at that blinking prompt. 2232 01:48:37,690 --> 01:48:39,850 But again, no, I want to write a program now 2233 01:48:39,850 --> 01:48:43,720 in Python that creates the tables for me and now, more interestingly, 2234 01:48:43,720 --> 01:48:47,583 loads the data into that database. 2235 01:48:47,583 --> 01:48:49,000 So let's go ahead and do this now. 2236 01:48:49,000 --> 01:48:51,100 I'm not going to select a title from the user, 2237 01:48:51,100 --> 01:48:52,660 because I want to import everything. 2238 01:48:52,660 --> 01:48:54,993 I'm not going to use any counting or anything like that. 2239 01:48:54,993 --> 01:48:57,700 So let's go ahead and just go inside of my loop as before. 2240 01:48:57,700 --> 01:49:02,240 And this time, let's go ahead and, for row in reader, 2241 01:49:02,240 --> 01:49:05,110 let's go ahead and get the current title, as we've always done. 2242 01:49:05,110 --> 01:49:08,640 But let's also, as always, go ahead and strip it of white space 2243 01:49:08,640 --> 01:49:11,700 and capitalize it, just to canonicalize it. 2244 01:49:11,700 --> 01:49:15,960 And now I'm going to go ahead and execute db.execute, quote unquote, 2245 01:49:15,960 --> 01:49:24,707 INSERT INTO shows the title column, the value of "title." 2246 01:49:24,707 --> 01:49:26,040 So I want to put the title here. 2247 01:49:26,040 --> 01:49:31,690 It turns out that SQL libraries like ours support one final piece of syntax, 2248 01:49:31,690 --> 01:49:32,850 which is a placeholder. 2249 01:49:32,850 --> 01:49:34,800 In C, we use %s. 2250 01:49:34,800 --> 01:49:37,950 In Python, we just use curly braces and put the word right there. 2251 01:49:37,950 --> 01:49:41,520 In SQL, we have a third approach to the same problem-- just syntactically 2252 01:49:41,520 --> 01:49:43,590 different, but conceptually the same. 2253 01:49:43,590 --> 01:49:46,560 You put a question mark where you want to put a placeholder. 2254 01:49:46,560 --> 01:49:50,670 And then outside of this string, I'm going to actually type in the value 2255 01:49:50,670 --> 01:49:53,070 that I want to plug into that question mark. 2256 01:49:53,070 --> 01:49:55,590 So this is so similar to printf in week 1. 2257 01:49:55,590 --> 01:50:00,180 But instead of %s, it's a question mark now and then a comma-separated list 2258 01:50:00,180 --> 01:50:03,120 of the arguments you want to plug in for those placeholders. 2259 01:50:03,120 --> 01:50:08,820 So now this line of code 16 has just inserted all of those values 2260 01:50:08,820 --> 01:50:09,670 into my database. 2261 01:50:09,670 --> 01:50:10,440 And let's go ahead and run this. 2262 01:50:10,440 --> 01:50:12,970 Before I go any further, let me go ahead and do this. 2263 01:50:12,970 --> 01:50:15,960 I'm going to go ahead now and run python of favorites.py 2264 01:50:15,960 --> 01:50:18,030 and cross my fingers, as always. 2265 01:50:18,030 --> 01:50:20,010 It's taking a moment, taking a moment. 2266 01:50:20,010 --> 01:50:23,340 That's because there's a decent-sized file there. 2267 01:50:23,340 --> 01:50:25,650 Or I screwed up. 2268 01:50:25,650 --> 01:50:27,930 This is taking too long. 2269 01:50:27,930 --> 01:50:28,950 Oh, OK. 2270 01:50:28,950 --> 01:50:30,960 I should have just been more patient. 2271 01:50:30,960 --> 01:50:31,560 All right. 2272 01:50:31,560 --> 01:50:33,970 So it just seems my connection's a little slow. 2273 01:50:33,970 --> 01:50:38,717 So as I expected, everything is 100% correct, and it's working fine. 2274 01:50:38,717 --> 01:50:40,800 So now let's go ahead and see what I actually did. 2275 01:50:40,800 --> 01:50:44,970 If I type ls, notice that I have a file called shows.db. 2276 01:50:44,970 --> 01:50:48,180 This is brand new, because my Python program created it this time. 2277 01:50:48,180 --> 01:50:51,060 Let's go ahead and run sqlite3 of shows.db 2278 01:50:51,060 --> 01:50:53,080 just so I can now see what's inside of it. 2279 01:50:53,080 --> 01:50:57,090 Notice that I can do .schema just to see what tables exist. 2280 01:50:57,090 --> 01:51:00,660 And indeed, the two tables that I created in my Python code 2281 01:51:00,660 --> 01:51:01,920 seem to exist. 2282 01:51:01,920 --> 01:51:04,020 But notice that there's-- 2283 01:51:04,020 --> 01:51:08,730 if I do SELECT * FROM shows, let's see all the data. 2284 01:51:08,730 --> 01:51:09,750 Voila. 2285 01:51:09,750 --> 01:51:13,170 There is a table that's been programmatically created. 2286 01:51:13,170 --> 01:51:16,350 And it has, notice this time, no timestamps, no genres. 2287 01:51:16,350 --> 01:51:20,730 But it has an ID on the left and the title on the right. 2288 01:51:20,730 --> 01:51:25,350 And amazingly, all of the IDs are monotonically increasing from 1 2289 01:51:25,350 --> 01:51:27,390 on up to 513, in this case. 2290 01:51:27,390 --> 01:51:28,300 Why is that? 2291 01:51:28,300 --> 01:51:30,600 Well, one of the features you get in a SQL database 2292 01:51:30,600 --> 01:51:34,410 is if you define a column as being a primary key in SQLite, 2293 01:51:34,410 --> 01:51:36,480 it's going to be auto incremented for you. 2294 01:51:36,480 --> 01:51:41,970 Recall that nowhere in my code did I even have a line, an integer, 2295 01:51:41,970 --> 01:51:43,830 inputting 1, then 2, then 3. 2296 01:51:43,830 --> 01:51:45,310 I could absolutely do that. 2297 01:51:45,310 --> 01:51:47,730 I could have done something like this-- counter-- 2298 01:51:47,730 --> 01:51:51,660 rather, I could have done something like this-- counter = 1. 2299 01:51:51,660 --> 01:51:56,280 And then down here, I could have said id, title, give myself 2300 01:51:56,280 --> 01:51:59,122 two placeholders, and then pass in the counter each time. 2301 01:51:59,122 --> 01:52:01,830 I could have implemented this myself and then, on each iteration, 2302 01:52:01,830 --> 01:52:03,960 done counter += 1. 2303 01:52:03,960 --> 01:52:06,330 But with SQL databases, as we've seen, you 2304 01:52:06,330 --> 01:52:08,310 get a lot more functionality built in. 2305 01:52:08,310 --> 01:52:11,130 I don't have to do any of that, because if I've 2306 01:52:11,130 --> 01:52:16,710 declared that ID as being a primary key, SQLite is going to insert it for me 2307 01:52:16,710 --> 01:52:19,870 and increment it also for me, as well. 2308 01:52:19,870 --> 01:52:20,400 All right. 2309 01:52:20,400 --> 01:52:24,510 So if I go back to SQLite, though, notice that I do have IDs and titles. 2310 01:52:24,510 --> 01:52:28,860 But if I SELECT * FROM genres, there's of course nothing there yet. 2311 01:52:28,860 --> 01:52:32,250 So how now do I get all of the genres for each of these shows in? 2312 01:52:32,250 --> 01:52:33,910 I need to finish my script. 2313 01:52:33,910 --> 01:52:38,970 So inside of this same loop, I have not only the title in my current row, 2314 01:52:38,970 --> 01:52:42,570 but I also have genres in the current row. 2315 01:52:42,570 --> 01:52:45,570 But the genres are separated by commas. 2316 01:52:45,570 --> 01:52:47,880 Recall that in the CSV, next to every title, 2317 01:52:47,880 --> 01:52:51,450 there's a comma-separated list of genres. 2318 01:52:51,450 --> 01:52:53,460 How do I get at each genre individually? 2319 01:52:53,460 --> 01:52:59,190 Well, I'd like to be able to say for genre in row bracket genres. 2320 01:52:59,190 --> 01:53:02,520 But this is not going to work, because that's not going 2321 01:53:02,520 --> 01:53:05,310 to be split up based on those commas. 2322 01:53:05,310 --> 01:53:07,190 That's literally just going to iterate over, 2323 01:53:07,190 --> 01:53:10,860 in fact, all of the characters in that string, as we saw last week. 2324 01:53:10,860 --> 01:53:13,950 But it turns out that strings in Python have a fancy split 2325 01:53:13,950 --> 01:53:19,300 function, whereby I can split on a comma followed by a space. 2326 01:53:19,300 --> 01:53:21,930 And what this function will do for me in Python is 2327 01:53:21,930 --> 01:53:26,130 take a comma separated list of genres and explode it, so to speak, 2328 01:53:26,130 --> 01:53:31,800 split it on every comma, space into a Python list 2329 01:53:31,800 --> 01:53:36,570 containing genre after genre in an actual Python list 2330 01:53:36,570 --> 01:53:37,990 a la square brackets. 2331 01:53:37,990 --> 01:53:42,360 So now I can iterate over that list of individual genres. 2332 01:53:42,360 --> 01:53:49,470 And inside of here, I can do db.execute INSERT INTO genres show_id, genre, 2333 01:53:49,470 --> 01:53:53,130 the values, question mark, question mark. 2334 01:53:53,130 --> 01:53:56,100 But huh, there's a problem. 2335 01:53:56,100 --> 01:53:59,970 I can definitely plug in the current genre, which is this. 2336 01:53:59,970 --> 01:54:02,970 But I need to put something here still. 2337 01:54:02,970 --> 01:54:07,560 For that first question mark, I need a value for the show_id. 2338 01:54:07,560 --> 01:54:11,130 How do I know what the ID is of the current TV show? 2339 01:54:11,130 --> 01:54:13,650 Well, it turns out the library can help you with this. 2340 01:54:13,650 --> 01:54:18,970 When you insert new rows into a table that has a primary key, 2341 01:54:18,970 --> 01:54:23,400 it turns out that most libraries will return you that value in some way. 2342 01:54:23,400 --> 01:54:26,520 And if I go back to line 15 and I actually 2343 01:54:26,520 --> 01:54:31,470 store the return value of db.execute after using INSERT, 2344 01:54:31,470 --> 01:54:34,500 the library will tell me what was the integer that 2345 01:54:34,500 --> 01:54:36,390 was just used for this given show. 2346 01:54:36,390 --> 01:54:37,650 Maybe it's 1, 2, 3. 2347 01:54:37,650 --> 01:54:39,940 I don't have to know or care as the programmer. 2348 01:54:39,940 --> 01:54:42,570 But the return value, I can store in a variable. 2349 01:54:42,570 --> 01:54:47,520 And then down here, I can literally put that same ID so that now, 2350 01:54:47,520 --> 01:54:51,600 if I am inputting The Office, whose ID is 1, into the shows table 2351 01:54:51,600 --> 01:54:54,720 and its genres are comedy, drama, romance, 2352 01:54:54,720 --> 01:54:57,990 I can now inside of this for loop, this nested for loop, 2353 01:54:57,990 --> 01:55:03,240 insert 1 followed by "comedy," 1 followed by "drama," 1 followed 2354 01:55:03,240 --> 01:55:07,330 by "romance," three rows all at once. 2355 01:55:07,330 --> 01:55:11,980 And so now let's go back down here into my terminal window. 2356 01:55:11,980 --> 01:55:15,660 Let me remove the old shows.db with rm, just to start fresh. 2357 01:55:15,660 --> 01:55:19,920 Let me go ahead and rerun python of favorites.py. 2358 01:55:19,920 --> 01:55:23,733 I'll be more patient this time, because cloud's being a little slow. 2359 01:55:23,733 --> 01:55:24,900 So it's doing some thinking. 2360 01:55:24,900 --> 01:55:27,030 And in fact, there's more work being done now. 2361 01:55:27,030 --> 01:55:29,340 At this point in the story, my program is presumably 2362 01:55:29,340 --> 01:55:33,060 iterating over all of the rows in the CSV. 2363 01:55:33,060 --> 01:55:37,170 And it's inserting into the shows table one at a time, 2364 01:55:37,170 --> 01:55:43,380 and then it's inserting one or more genres into the genres table. 2365 01:55:43,380 --> 01:55:44,250 It's a little slow. 2366 01:55:44,250 --> 01:55:47,370 If we were on a faster system or if I were doing it on my own Mac or PC, 2367 01:55:47,370 --> 01:55:49,480 it would probably go down more quickly. 2368 01:55:49,480 --> 01:55:52,740 But you can see here an example of why I use the .import command in the first 2369 01:55:52,740 --> 01:55:53,130 place. 2370 01:55:53,130 --> 01:55:54,670 That automated some of this process. 2371 01:55:54,670 --> 01:55:58,440 But unfortunately, it didn't allow me to change the format of my data. 2372 01:55:58,440 --> 01:56:01,530 But the key point to make here is that even though this 2373 01:56:01,530 --> 01:56:05,490 is taking a little bit of time to insert these hundreds of rows all at once, 2374 01:56:05,490 --> 01:56:07,260 I'm only going to have to do this once. 2375 01:56:07,260 --> 01:56:10,840 And what was asked a bit ago was the performance of this. 2376 01:56:10,840 --> 01:56:15,390 It turns out that now that we have full control over the SQL database, 2377 01:56:15,390 --> 01:56:20,640 it turns out we're going to have the ability to actually improve 2378 01:56:20,640 --> 01:56:22,230 the performance thereof. 2379 01:56:22,230 --> 01:56:24,000 Oh, OK. 2380 01:56:24,000 --> 01:56:25,830 As expected, it finished right on time. 2381 01:56:25,830 --> 01:56:29,970 And let me go ahead now and run sqlite3 on shows.db. 2382 01:56:29,970 --> 01:56:32,670 All right, so now I'm back in my raw SQL environment. 2383 01:56:32,670 --> 01:56:36,180 If I do SELECT * FROM shows, which I did before, 2384 01:56:36,180 --> 01:56:37,650 we'll see all of this as before. 2385 01:56:37,650 --> 01:56:42,090 If I SELECT * FROM shows WHERE title = "THE OFFICE," 2386 01:56:42,090 --> 01:56:45,103 I'll see the actual unique IDs of all of those. 2387 01:56:45,103 --> 01:56:46,770 We didn't bother eliminating duplicates. 2388 01:56:46,770 --> 01:56:50,610 We just kept everything as is, but we gave everything a unique ID. 2389 01:56:50,610 --> 01:56:57,520 But if I now do SELECT * FROM genres, we'll see all of the values there. 2390 01:56:57,520 --> 01:56:59,070 And notice the key detail. 2391 01:56:59,070 --> 01:57:03,360 There is only one genre per row here. 2392 01:57:03,360 --> 01:57:06,480 And so we can ultimately line those up with our titles. 2393 01:57:06,480 --> 01:57:10,250 And our titles here, we had all of these here. 2394 01:57:10,250 --> 01:57:12,538 Something's wrong. 2395 01:57:12,538 --> 01:57:13,580 I want to get this right. 2396 01:57:13,580 --> 01:57:15,940 Let's go ahead and take our second and final five-minute break here. 2397 01:57:15,940 --> 01:57:18,280 And we'll come back, and I will explain what's going on. 2398 01:57:18,280 --> 01:57:20,170 All right, we are back. 2399 01:57:20,170 --> 01:57:23,710 And just before we broke up, my own self-doubt was starting to creep in. 2400 01:57:23,710 --> 01:57:26,830 But I'm happy to say, with no fancy magic behind the scenes, 2401 01:57:26,830 --> 01:57:28,430 everything was actually working fine. 2402 01:57:28,430 --> 01:57:30,263 I was just doubting the correctness of this. 2403 01:57:30,263 --> 01:57:33,460 If I do SELECT * FROM shows, I indeed get back 2404 01:57:33,460 --> 01:57:37,540 two columns, one with the unique ID, the so-called primary key, followed 2405 01:57:37,540 --> 01:57:40,280 by the title of each of those shows. 2406 01:57:40,280 --> 01:57:46,120 And if I similarly search for * FROM genres, I get single genres at a time. 2407 01:57:46,120 --> 01:57:49,600 But on the left-hand side are not primary keys per se 2408 01:57:49,600 --> 01:57:52,450 but now those same numbers here in this context called 2409 01:57:52,450 --> 01:57:55,160 foreign keys that map one to the other. 2410 01:57:55,160 --> 01:58:01,260 So for instance, whatever show 512 is had five different genres associated 2411 01:58:01,260 --> 01:58:01,760 with it. 2412 01:58:01,760 --> 01:58:05,320 And in fact, if I go back a moment to shows, it looks like Game of Thrones 2413 01:58:05,320 --> 01:58:10,420 was decided by one of you as belonging in thriller, history, adventure, 2414 01:58:10,420 --> 01:58:14,660 action, and war, as well, those five. 2415 01:58:14,660 --> 01:58:17,320 So now this is what's meant by relational database. 2416 01:58:17,320 --> 01:58:21,430 You have this relation or relationship across multiple tables 2417 01:58:21,430 --> 01:58:25,050 that link some data in one to some other data in the like. 2418 01:58:25,050 --> 01:58:27,550 The catch, though, is that it would seem a little harder now 2419 01:58:27,550 --> 01:58:30,910 to answer questions, because now I have to kind of query two tables 2420 01:58:30,910 --> 01:58:34,450 or execute two separate queries and then combine the data. 2421 01:58:34,450 --> 01:58:36,130 But that's not actually the case. 2422 01:58:36,130 --> 01:58:39,100 Suppose that I want to answer the question of, 2423 01:58:39,100 --> 01:58:42,760 what are all of the musicals among your favorite TV shows? 2424 01:58:42,760 --> 01:58:46,490 I can't select just the shows, because there's no genres in there anymore. 2425 01:58:46,490 --> 01:58:48,730 But I also can't select just the genres table, 2426 01:58:48,730 --> 01:58:50,900 because there's no titles in there. 2427 01:58:50,900 --> 01:58:55,060 But there is a value that's bridging one and the other, that foreign key 2428 01:58:55,060 --> 01:58:56,980 to primary key relationship. 2429 01:58:56,980 --> 01:58:59,170 So you know what I can do off the top of my head? 2430 01:58:59,170 --> 01:59:03,790 I'm pretty sure I can select all of the show_ids from the genres table 2431 01:59:03,790 --> 01:59:07,072 where a specific genre = "Musical." 2432 01:59:07,072 --> 01:59:09,280 And I don't have to worry about commas or spaces now, 2433 01:59:09,280 --> 01:59:13,210 because again, in this new version that I have designed programmatically 2434 01:59:13,210 --> 01:59:16,990 with code, musical and every other genre is just a single word. 2435 01:59:16,990 --> 01:59:21,220 If I hit Enter, all of these show_ids were decided 2436 01:59:21,220 --> 01:59:23,930 by you all as belonging to musicals. 2437 01:59:23,930 --> 01:59:25,930 But now this is not interesting, and I certainly 2438 01:59:25,930 --> 01:59:28,360 don't want to execute 10 or so queries manually 2439 01:59:28,360 --> 01:59:30,400 to look up every one of those IDs. 2440 01:59:30,400 --> 01:59:32,680 But notice what we can do in SQL, as well. 2441 01:59:32,680 --> 01:59:33,880 I can nest queries. 2442 01:59:33,880 --> 01:59:36,940 Let me put this whole query in parentheses for just a moment 2443 01:59:36,940 --> 01:59:39,070 and then prepend to it the following. 2444 01:59:39,070 --> 01:59:46,930 SELECT title FROM shows WHERE the primary key, id, is in this subquery. 2445 01:59:46,930 --> 01:59:50,650 So you can have nested queries similar in spirit a bit like in Python and C 2446 01:59:50,650 --> 01:59:52,510 when you have nested for loops. 2447 01:59:52,510 --> 01:59:55,690 In this case, just like in grade school math, whatever is in the parentheses 2448 01:59:55,690 --> 01:59:57,160 will be executed first. 2449 01:59:57,160 --> 02:00:02,140 Then the outer query will be executed using the results of that inner query. 2450 02:00:02,140 --> 02:00:07,000 So if I select the title from shows where the ID is in that list of IDs, 2451 02:00:07,000 --> 02:00:07,660 voila. 2452 02:00:07,660 --> 02:00:11,560 It seems that, somewhat amusingly, several of you 2453 02:00:11,560 --> 02:00:15,280 think that Breaking Bad, Supernatural, Glee, Sherlock, How I Met Your Mother, 2454 02:00:15,280 --> 02:00:18,190 Hawaii Five-0, Twin Peaks, The Lawyer, and My Brother, My Brother 2455 02:00:18,190 --> 02:00:19,900 and Me are all musicals. 2456 02:00:19,900 --> 02:00:22,630 I take exception to a few of those, but so be it. 2457 02:00:22,630 --> 02:00:24,850 You checked the box for musical for those shows. 2458 02:00:24,850 --> 02:00:29,260 So even though we've designed things better in the sense 2459 02:00:29,260 --> 02:00:33,010 that we've normalized our database by factoring out commonalities 2460 02:00:33,010 --> 02:00:35,050 or, rather, we've cleaned up the data, there's 2461 02:00:35,050 --> 02:00:37,150 still admittedly some redundancy. 2462 02:00:37,150 --> 02:00:39,370 There's still admittedly some redundancy. 2463 02:00:39,370 --> 02:00:44,410 But I at least now have the data in clean fashion 2464 02:00:44,410 --> 02:00:47,800 so that every column has just a single value in it and not 2465 02:00:47,800 --> 02:00:49,870 some contrived comma-separated list. 2466 02:00:49,870 --> 02:00:51,725 Suppose I want to find out all of the genres 2467 02:00:51,725 --> 02:00:53,350 that you all thought The Office was in. 2468 02:00:53,350 --> 02:00:55,480 So let's ask kind of the opposite question. 2469 02:00:55,480 --> 02:00:56,840 Well, how might I do that? 2470 02:00:56,840 --> 02:01:00,430 Well, to figure out The Office, I'm going to first need to SELECT the id 2471 02:01:00,430 --> 02:01:06,400 FROM shows WHERE title = "THE OFFICE," because a whole bunch of you 2472 02:01:06,400 --> 02:01:07,330 typed in The Office. 2473 02:01:07,330 --> 02:01:09,902 And we gave each of your answers a unique identifier 2474 02:01:09,902 --> 02:01:11,110 so we could keep track of it. 2475 02:01:11,110 --> 02:01:12,500 And there's all of those numbers. 2476 02:01:12,500 --> 02:01:14,260 Now, this is, like, dozens of responses. 2477 02:01:14,260 --> 02:01:16,540 I certainly don't want to execute that many queries. 2478 02:01:16,540 --> 02:01:18,850 But I think a subquery will help us out again. 2479 02:01:18,850 --> 02:01:21,470 Let me put parentheses around this whole thing. 2480 02:01:21,470 --> 02:01:27,910 And now let me say SELECT DISTINCT genre FROM genres WHERE 2481 02:01:27,910 --> 02:01:32,380 the show_id in the genres table is in that query. 2482 02:01:32,380 --> 02:01:37,400 And just for kicks, let me go ahead and ORDER BY genre. 2483 02:01:37,400 --> 02:01:38,930 So let me go ahead and execute this. 2484 02:01:38,930 --> 02:01:42,490 And, OK, somewhat amusingly, those of you who inputted The Office 2485 02:01:42,490 --> 02:01:46,960 checked boxes for animation, comedy, documentary, drama, family, horror, 2486 02:01:46,960 --> 02:01:49,000 reality-TV, romance, and sci-fi. 2487 02:01:49,000 --> 02:01:51,020 I take exception to a few of those, too. 2488 02:01:51,020 --> 02:01:53,720 But this is what happens when you accept user input. 2489 02:01:53,720 --> 02:01:57,293 So here again, we have with this SQL language 2490 02:01:57,293 --> 02:01:59,710 the ability to express fairly succinctly, even though it's 2491 02:01:59,710 --> 02:02:03,670 a lot of new features today all at once, what would otherwise take me 2492 02:02:03,670 --> 02:02:06,593 a dozen or two lines in Python code to implement 2493 02:02:06,593 --> 02:02:09,010 and god knows how many lines of code and how many hours it 2494 02:02:09,010 --> 02:02:13,060 would take me to implement something like this in C. Now, admittedly, 2495 02:02:13,060 --> 02:02:15,220 we could do better than this design. 2496 02:02:15,220 --> 02:02:18,550 This table or this picture represents what we have now. 2497 02:02:18,550 --> 02:02:22,360 But you'll notice a lot of redundancy implicit in the genres table. 2498 02:02:22,360 --> 02:02:25,510 Any time you check the comedy box, I have a row now 2499 02:02:25,510 --> 02:02:27,940 that says comedy, comedy, comedy, comedy. 2500 02:02:27,940 --> 02:02:31,930 And the show_id differs, but I have the word "comedy" again and again. 2501 02:02:31,930 --> 02:02:35,740 And now, that tends to be frowned upon in the world of relational databases, 2502 02:02:35,740 --> 02:02:39,370 because if you have a genre called comedy or one 2503 02:02:39,370 --> 02:02:42,430 called musical or anything else, you should ideally just 2504 02:02:42,430 --> 02:02:43,970 have that living in one place. 2505 02:02:43,970 --> 02:02:47,980 And so if we really wanted to be particular and really, truly 2506 02:02:47,980 --> 02:02:51,100 normalize this database, which is an academic term referring 2507 02:02:51,100 --> 02:02:55,530 to removing all such redundancies, we could actually do it like this. 2508 02:02:55,530 --> 02:02:59,480 We could have a shows table still with an id and title, no difference there. 2509 02:02:59,480 --> 02:03:03,890 But we could have a genres table with two columns, id and name. 2510 02:03:03,890 --> 02:03:05,020 Now, this is its own id. 2511 02:03:05,020 --> 02:03:06,910 It has no connection with the show_id. 2512 02:03:06,910 --> 02:03:10,660 It's just its own unique identifier, a primary key here now, 2513 02:03:10,660 --> 02:03:12,320 and the name of that genre. 2514 02:03:12,320 --> 02:03:14,350 So you would have one row in the genres table 2515 02:03:14,350 --> 02:03:17,690 for comedy, for drama, music, musical, and everything else. 2516 02:03:17,690 --> 02:03:19,870 And then you would use a third table, which 2517 02:03:19,870 --> 02:03:23,920 is colloquially called a join table, which I'll draw here in the middle. 2518 02:03:23,920 --> 02:03:25,960 And you can call it anything you want, but we've 2519 02:03:25,960 --> 02:03:29,920 called it shows_genres to make clear that this table implements 2520 02:03:29,920 --> 02:03:33,400 a relationship between those two tables. 2521 02:03:33,400 --> 02:03:36,910 And notice that in this table is really no juicy data. 2522 02:03:36,910 --> 02:03:38,800 It's just foreign keys-- 2523 02:03:38,800 --> 02:03:41,380 show_id, genre_id. 2524 02:03:41,380 --> 02:03:43,930 And by having this third table, we can now 2525 02:03:43,930 --> 02:03:47,890 make sure that the word "comedy" only appears in one row anywhere. 2526 02:03:47,890 --> 02:03:50,860 The word "musical" only appears in one row anywhere. 2527 02:03:50,860 --> 02:03:55,450 But we use these more efficient integers called show_id and genre_id, 2528 02:03:55,450 --> 02:04:00,850 which respectively point to those primary keys and their primary tables 2529 02:04:00,850 --> 02:04:02,072 to link those two together. 2530 02:04:02,072 --> 02:04:04,780 And this is an example of what's called in the world of databases 2531 02:04:04,780 --> 02:04:06,790 a many-to-many relationship. 2532 02:04:06,790 --> 02:04:09,610 One show can have many genres. 2533 02:04:09,610 --> 02:04:12,730 One genre can belong to many shows. 2534 02:04:12,730 --> 02:04:14,530 And so by having this third table, you can 2535 02:04:14,530 --> 02:04:16,730 have that many-to-many relationship. 2536 02:04:16,730 --> 02:04:19,570 And again, the third table now allows us to truly normalize 2537 02:04:19,570 --> 02:04:23,920 our data set by getting rid of all of the duplicate comedy, comedy, comedy. 2538 02:04:23,920 --> 02:04:25,420 Why is this important? 2539 02:04:25,420 --> 02:04:27,310 Probably not a huge deal for genres. 2540 02:04:27,310 --> 02:04:30,910 But imagine with my current design if I made a spelling mistake, 2541 02:04:30,910 --> 02:04:32,440 and I misnamed comedy. 2542 02:04:32,440 --> 02:04:36,190 I would now have to change every row with the word comedy again and again. 2543 02:04:36,190 --> 02:04:39,580 Or if maybe you change the genres of the shows, 2544 02:04:39,580 --> 02:04:41,840 you would have to change it in multiple places. 2545 02:04:41,840 --> 02:04:44,260 But with this other approach with three tables, 2546 02:04:44,260 --> 02:04:46,450 you can argue that now you only have to change 2547 02:04:46,450 --> 02:04:49,750 the name of a genre in one place, not all over the place. 2548 02:04:49,750 --> 02:04:52,420 And that, in general, in C and now in Python and now 2549 02:04:52,420 --> 02:04:57,040 SQL has generally been a good thing not to copy paste identical values 2550 02:04:57,040 --> 02:05:00,060 all over the place. 2551 02:05:00,060 --> 02:05:00,720 All right. 2552 02:05:00,720 --> 02:05:04,410 So with that said, what other tools do we have at our disposal? 2553 02:05:04,410 --> 02:05:09,180 Well, it turns out that there are other data types out there in the real world 2554 02:05:09,180 --> 02:05:11,280 using SQL besides just these five-- 2555 02:05:11,280 --> 02:05:13,620 BLOB, INTEGER, NUMERIC, REAL, and TEXT. 2556 02:05:13,620 --> 02:05:15,870 BLOB, again, is for binary stuff, generally not 2557 02:05:15,870 --> 02:05:18,840 used except for more specialized applications, let's say. 2558 02:05:18,840 --> 02:05:21,270 INTEGER, which is an int, typically 32 bits; 2559 02:05:21,270 --> 02:05:23,700 NUMERIC, which is something like a date or a year 2560 02:05:23,700 --> 02:05:26,220 or time or something like that; REAL numbers, which 2561 02:05:26,220 --> 02:05:30,180 are floating point values; and TEXT, which are things like strings. 2562 02:05:30,180 --> 02:05:34,110 But if you graduate ultimately from SQLite on phones 2563 02:05:34,110 --> 02:05:38,522 and on Macs and PCs to actual servers that run Oracle, MySQL, 2564 02:05:38,522 --> 02:05:40,230 and PostgreSQL if you're actually running 2565 02:05:40,230 --> 02:05:42,540 your own internet-style business, well, it 2566 02:05:42,540 --> 02:05:47,310 turns out that more sophisticated, even more powerful 2567 02:05:47,310 --> 02:05:50,620 databases come with other subtypes, if you will. 2568 02:05:50,620 --> 02:05:54,270 So besides INTEGER, you can specify smallint for small numbers, 2569 02:05:54,270 --> 02:05:57,690 maybe using just a few bits instead of 32-- 2570 02:05:57,690 --> 02:06:01,800 INTEGER or bigint, which uses 64 bits instead of 32. 2571 02:06:01,800 --> 02:06:05,130 The Facebooks, the Twitters of the world need to use bigint a lot, 2572 02:06:05,130 --> 02:06:06,720 because they have so much data. 2573 02:06:06,720 --> 02:06:09,330 You and I can get away with simple integers, because we're not 2574 02:06:09,330 --> 02:06:12,450 going to have more than 4 billion favorite TV shows in a class, 2575 02:06:12,450 --> 02:06:13,290 certainly. 2576 02:06:13,290 --> 02:06:17,040 Something like REAL, you can have 32-bit real numbers or, a little weirdly 2577 02:06:17,040 --> 02:06:22,470 named, double precision, which is like a double was in C, using 64 bits instead 2578 02:06:22,470 --> 02:06:23,640 for more precision. 2579 02:06:23,640 --> 02:06:25,230 NUMERIC is kind of this catchall. 2580 02:06:25,230 --> 02:06:29,160 You can have not only dates and date times but things like Boolean values. 2581 02:06:29,160 --> 02:06:31,200 You can specify the total number of digits 2582 02:06:31,200 --> 02:06:34,180 to store using this numeric scale and precision. 2583 02:06:34,180 --> 02:06:37,440 So it relates to numbers that aren't just quite integers. 2584 02:06:37,440 --> 02:06:39,720 And then you also have categories of TEXT-- 2585 02:06:39,720 --> 02:06:42,570 char followed by a number, which specifies 2586 02:06:42,570 --> 02:06:47,010 that every value in the column will have the same number of characters, 2587 02:06:47,010 --> 02:06:50,190 that's helpful for things where the length in advance, like in the US. 2588 02:06:50,190 --> 02:06:54,030 All states, all 50 states, have two-character codes, 2589 02:06:54,030 --> 02:06:57,450 like MA for Massachusetts, CA for California. 2590 02:06:57,450 --> 02:07:00,150 char(2) would be appropriate there, because you 2591 02:07:00,150 --> 02:07:03,000 know every value in the column is going to have two characters. 2592 02:07:03,000 --> 02:07:05,250 When you don't know, though, you can use varchar. 2593 02:07:05,250 --> 02:07:08,400 And varchar specifies a maximum number of characters. 2594 02:07:08,400 --> 02:07:12,060 And so you might specify varchar of, like, 32. 2595 02:07:12,060 --> 02:07:15,600 No one might be able to type in a name that's longer than 32 characters, 2596 02:07:15,600 --> 02:07:18,900 or varchar(200) if you want to allow for something even bigger. 2597 02:07:18,900 --> 02:07:21,690 But this is germane to our real-world experience with the web. 2598 02:07:21,690 --> 02:07:24,540 If you've ever gone to a website, start filling out a form, 2599 02:07:24,540 --> 02:07:26,850 and all of a sudden you can't type any more characters, 2600 02:07:26,850 --> 02:07:28,440 your response is too long-- 2601 02:07:28,440 --> 02:07:29,462 why is that? 2602 02:07:29,462 --> 02:07:31,170 Well, one, the programmers just might not 2603 02:07:31,170 --> 02:07:33,810 want you to keep expressing yourself in more detail, especially 2604 02:07:33,810 --> 02:07:36,330 if it's a complaint form on a customer service site. 2605 02:07:36,330 --> 02:07:40,620 But pragmatically, it's probably because their database was designed 2606 02:07:40,620 --> 02:07:42,570 to store a finite number of characters. 2607 02:07:42,570 --> 02:07:44,025 And you have hit that threshold. 2608 02:07:44,025 --> 02:07:45,900 And you certainly don't want to have a buffer 2609 02:07:45,900 --> 02:07:50,190 overflow, like in C. So the database will enforce a maximum value n. 2610 02:07:50,190 --> 02:07:52,830 And then text is for even bigger chunks of text. 2611 02:07:52,830 --> 02:07:54,930 If you're letting people copy paste their resumes 2612 02:07:54,930 --> 02:07:59,680 or hold documents or even larger sets of text, you might use text instead. 2613 02:07:59,680 --> 02:08:03,510 So let's then consider a real-world data set. 2614 02:08:03,510 --> 02:08:07,320 Things get really interesting, and all of these very academic ideas 2615 02:08:07,320 --> 02:08:09,600 and recommendations really come into play 2616 02:08:09,600 --> 02:08:14,830 when we don't had hundreds of favorites but when we have thousands instead. 2617 02:08:14,830 --> 02:08:19,180 And so what I'm going to go ahead and do here is download a file here, 2618 02:08:19,180 --> 02:08:25,120 which is a SQLite version of the IMDb, Internet Movie Database, 2619 02:08:25,120 --> 02:08:27,120 that some of you might have used in website form 2620 02:08:27,120 --> 02:08:30,330 in order to look up movies and ratings thereof and the like. 2621 02:08:30,330 --> 02:08:32,100 And what we've done in advance is we wrote 2622 02:08:32,100 --> 02:08:38,490 a script that downloaded all of that information in advance as TSV files. 2623 02:08:38,490 --> 02:08:42,600 It turns out that they, Internet Movie Database, make all of their data 2624 02:08:42,600 --> 02:08:46,650 available as TSV files, Tab-Separated Values. 2625 02:08:46,650 --> 02:08:54,010 And we went ahead and imported it with a script called shows.db as follows. 2626 02:08:54,010 --> 02:08:55,800 So I'm going to go ahead in just a moment 2627 02:08:55,800 --> 02:08:59,520 and open up shows.db, which is not the version I created earlier 2628 02:08:59,520 --> 02:09:00,990 based on your favorites. 2629 02:09:00,990 --> 02:09:02,820 This is now the version that we, the staff, 2630 02:09:02,820 --> 02:09:06,630 created in advance by downloading hundreds of thousands 2631 02:09:06,630 --> 02:09:10,950 of movies and TV shows and actors and directors from IMDb.com 2632 02:09:10,950 --> 02:09:15,660 under their license and then imported into a SQLite database. 2633 02:09:15,660 --> 02:09:17,130 So how can I see what's in here? 2634 02:09:17,130 --> 02:09:19,530 Well, let me go ahead and type .schema, recall. 2635 02:09:19,530 --> 02:09:22,545 And you'll see a whole bunch of data therein. 2636 02:09:22,545 --> 02:09:25,690 And in fact, in pictorial form, it actually looks like this. 2637 02:09:25,690 --> 02:09:28,140 Here is a picture that just gives you the lay of the land. 2638 02:09:28,140 --> 02:09:30,330 There's going to be a people table that has 2639 02:09:30,330 --> 02:09:33,903 an ID for every person, a name, and their birth year. 2640 02:09:33,903 --> 02:09:36,570 There's going to be a shows table, just like we've been talking, 2641 02:09:36,570 --> 02:09:41,100 which is IDs, titles of shows-- also, though, the year that the show debuted 2642 02:09:41,100 --> 02:09:43,380 and the number of episodes that the show had. 2643 02:09:43,380 --> 02:09:46,410 Then there's going to be genres, similar in design to before. 2644 02:09:46,410 --> 02:09:49,620 So we didn't go all out and factor it out into a third table. 2645 02:09:49,620 --> 02:09:52,830 We just have some duplication here, admittedly, in genres. 2646 02:09:52,830 --> 02:09:54,240 But then there's a ratings table. 2647 02:09:54,240 --> 02:09:57,240 And here's where you can see where relational databases get interesting. 2648 02:09:57,240 --> 02:10:01,000 You can have a ratings table storing ratings, like 1 to 5, 2649 02:10:01,000 --> 02:10:05,080 but also associate those ratings with a show by way of its show_id. 2650 02:10:05,080 --> 02:10:08,440 And then you can keep track of the number of votes that that show got. 2651 02:10:08,440 --> 02:10:10,910 Writers, notice, is a separate table. 2652 02:10:10,910 --> 02:10:12,560 And notice this is kind of cool. 2653 02:10:12,560 --> 02:10:19,060 This table, per the arrows, relates to the shows table and the people table, 2654 02:10:19,060 --> 02:10:20,770 because this is a joined table. 2655 02:10:20,770 --> 02:10:24,040 A foreign key of show_id and a foreign key of person_id 2656 02:10:24,040 --> 02:10:28,250 refer to the shows table and the people table respectively 2657 02:10:28,250 --> 02:10:32,710 so that a human person can be a writer for multiple shows 2658 02:10:32,710 --> 02:10:36,560 and one show can have multiple writers, another many-to-many relationship. 2659 02:10:36,560 --> 02:10:39,310 And then lastly, stars, the actors in a show. 2660 02:10:39,310 --> 02:10:41,050 Notice that this, too, is a join table. 2661 02:10:41,050 --> 02:10:43,540 It's only got two foreign keys, a show_id 2662 02:10:43,540 --> 02:10:47,447 and a person_id that are referring back to those tables respectively. 2663 02:10:47,447 --> 02:10:50,030 And here's where it really makes sense of relational database. 2664 02:10:50,030 --> 02:10:52,930 It would be pretty stupid and bad design if you 2665 02:10:52,930 --> 02:10:57,520 had names of all of the directors and names of all of the writers 2666 02:10:57,520 --> 02:11:01,840 and names of all of the stars of these shows in separate tables in duplicate, 2667 02:11:01,840 --> 02:11:04,330 like Steve Carell, Steve Carell, Steve Carell. 2668 02:11:04,330 --> 02:11:06,670 All of those actors and directors and writers 2669 02:11:06,670 --> 02:11:11,450 and every other role in the business are just people at the end of the day. 2670 02:11:11,450 --> 02:11:13,630 So in a relational database, the advice would 2671 02:11:13,630 --> 02:11:16,570 be to put all of those people in a people table 2672 02:11:16,570 --> 02:11:21,040 and then use primary and foreign keys to refer to, to relate them to, 2673 02:11:21,040 --> 02:11:24,010 these other types of tables. 2674 02:11:24,010 --> 02:11:26,720 The catch is, though, that when we do this, 2675 02:11:26,720 --> 02:11:31,280 it turns out that things can be slow when we have lots of data. 2676 02:11:31,280 --> 02:11:33,250 So for instance, let me go into this. 2677 02:11:33,250 --> 02:11:37,210 Let me go ahead and SELECT * FROM shows;. 2678 02:11:37,210 --> 02:11:38,343 That's a lot of data. 2679 02:11:38,343 --> 02:11:41,260 It's pretty fast on my Mac, and I switched from the IDE to my Mac just 2680 02:11:41,260 --> 02:11:43,270 to save time, because it's a little faster doing things 2681 02:11:43,270 --> 02:11:44,800 locally instead of in the cloud. 2682 02:11:44,800 --> 02:11:48,460 Let me go ahead and count the number of shows in this IMDb database 2683 02:11:48,460 --> 02:11:49,720 by using COUNT. 2684 02:11:49,720 --> 02:11:53,390 153,331 TV shows. 2685 02:11:53,390 --> 02:11:54,250 So that's a lot. 2686 02:11:54,250 --> 02:11:59,110 How about the count of people from the people table? 2687 02:11:59,110 --> 02:12:06,290 457,886 people who might be stars or writers or some other role, as well. 2688 02:12:06,290 --> 02:12:07,765 So this is a sizable data set. 2689 02:12:07,765 --> 02:12:09,890 So let me go ahead and do something simple, though. 2690 02:12:09,890 --> 02:12:14,560 Let me go ahead and SELECT * FROM shows WHERE title = "The Office." 2691 02:12:14,560 --> 02:12:17,900 And this time, I don't have to worry about weird capitalization or spacing. 2692 02:12:17,900 --> 02:12:18,850 This is IMDb. 2693 02:12:18,850 --> 02:12:21,727 This is clean data from an authoritative source. 2694 02:12:21,727 --> 02:12:24,310 Notice that there's actually different versions of The Office. 2695 02:12:24,310 --> 02:12:26,590 You probably know the UK one and the US one. 2696 02:12:26,590 --> 02:12:30,520 There's other shows that are unrelated to that particular type of show. 2697 02:12:30,520 --> 02:12:34,540 But each of them is distinguished, notice, by the year here. 2698 02:12:34,540 --> 02:12:37,280 All right, so that's kind of a lot. 2699 02:12:37,280 --> 02:12:38,680 And let's do this again. 2700 02:12:38,680 --> 02:12:40,930 Let me go ahead and turn on a feature temporarily just 2701 02:12:40,930 --> 02:12:44,000 to time this query by turning on a timer in this program. 2702 02:12:44,000 --> 02:12:45,370 And let me run it again. 2703 02:12:45,370 --> 02:12:51,970 It looks like it took 0.012 seconds of real time to do that search. 2704 02:12:51,970 --> 02:12:52,780 That's pretty fast. 2705 02:12:52,780 --> 02:12:55,180 I barely noticed, certainly because it's so fast. 2706 02:12:55,180 --> 02:12:56,710 But let me go ahead and do this. 2707 02:12:56,710 --> 02:13:01,510 Let me go ahead and create an index called title_index on the table 2708 02:13:01,510 --> 02:13:04,360 called shows on its title column. 2709 02:13:04,360 --> 02:13:05,470 Well, what am I doing? 2710 02:13:05,470 --> 02:13:08,680 Well, to answer the question finally from before about performance, 2711 02:13:08,680 --> 02:13:11,340 by default, everything we've been doing is indeed big O of n. 2712 02:13:11,340 --> 02:13:13,090 It's just being linearly searched from top 2713 02:13:13,090 --> 02:13:16,630 to bottom, which seems to call into question the whole purpose of SQL if we 2714 02:13:16,630 --> 02:13:18,850 were doing no better than with CSVs. 2715 02:13:18,850 --> 02:13:22,480 But an index is a clue to the database to load 2716 02:13:22,480 --> 02:13:25,960 the data more efficiently in such a way that you get logarithmic time. 2717 02:13:25,960 --> 02:13:30,520 An index is a fancy data structure that the SQLite database or the Oracle 2718 02:13:30,520 --> 02:13:33,520 database or the MySQL database, whatever product you're using, 2719 02:13:33,520 --> 02:13:35,680 builds up for you in memory. 2720 02:13:35,680 --> 02:13:38,560 And then it does something using syntax like this 2721 02:13:38,560 --> 02:13:42,340 that builds in memory generally something known as a B-tree. 2722 02:13:42,340 --> 02:13:44,178 We've talked a bit about trees in the class. 2723 02:13:44,178 --> 02:13:46,720 We talked about binary search trees, things that kind of look 2724 02:13:46,720 --> 02:13:47,920 like family trees. 2725 02:13:47,920 --> 02:13:50,620 A B-tree is essentially a family tree that's 2726 02:13:50,620 --> 02:13:53,020 just very wide and not that tall. 2727 02:13:53,020 --> 02:13:56,500 It's a data structure similar in spirit to what we looked at in C. 2728 02:13:56,500 --> 02:13:59,830 But it tries to keep all of the leaf nodes, all of the children 2729 02:13:59,830 --> 02:14:02,230 or grandchildren or great-grandchildren, so to speak, 2730 02:14:02,230 --> 02:14:04,390 as close to the root as possible. 2731 02:14:04,390 --> 02:14:08,320 And the algorithm it uses for that tends to be proprietary or documented 2732 02:14:08,320 --> 02:14:09,970 based on the system you're using. 2733 02:14:09,970 --> 02:14:12,100 But it doesn't store things in a list. 2734 02:14:12,100 --> 02:14:17,620 It does not store things top to bottom, like the tables we view them as. 2735 02:14:17,620 --> 02:14:21,640 Underneath the hood, those tables that look like very tall structures 2736 02:14:21,640 --> 02:14:23,770 are actually, underneath the hood, implemented 2737 02:14:23,770 --> 02:14:25,820 with fancier things called trees. 2738 02:14:25,820 --> 02:14:29,710 And if we create those trees by creating what are properly called indexes 2739 02:14:29,710 --> 02:14:34,660 like this, it might take us a moment, like 0.098 seconds, to create an index. 2740 02:14:34,660 --> 02:14:36,220 But now notice what happens. 2741 02:14:36,220 --> 02:14:40,210 Previously, when I searched the titles for The Office, using linear search, 2742 02:14:40,210 --> 02:14:43,180 it took 0.012 seconds. 2743 02:14:43,180 --> 02:14:46,750 If I do the same query again after having created the index 2744 02:14:46,750 --> 02:14:50,920 and having told SQLite, build me this fancy tree in memory, voila. 2745 02:14:50,920 --> 02:14:55,450 0.001 seconds, so orders of magnitude faster. 2746 02:14:55,450 --> 02:14:57,550 Now, both are fast to us humans, certainly. 2747 02:14:57,550 --> 02:15:01,040 But imagine the data set being even bigger, the query being even bigger. 2748 02:15:01,040 --> 02:15:05,900 These indexes can get even larger than that. 2749 02:15:05,900 --> 02:15:07,970 Rather, the queries can take longer than that 2750 02:15:07,970 --> 02:15:11,130 and therefore take even more time than that. 2751 02:15:11,130 --> 02:15:13,940 But unfortunately, if I've got all of my data all over the place, 2752 02:15:13,940 --> 02:15:16,970 as in a diagram like this, my god. 2753 02:15:16,970 --> 02:15:18,770 How do I actually get useful work done? 2754 02:15:18,770 --> 02:15:21,590 How do I get back the people in a movie and the writers 2755 02:15:21,590 --> 02:15:24,260 and the stars and the ratings if it's all over the place? 2756 02:15:24,260 --> 02:15:26,840 I would seem to have created such a mess and that I now 2757 02:15:26,840 --> 02:15:28,910 need to execute all of these queries. 2758 02:15:28,910 --> 02:15:32,000 But notice it doesn't have to be that complicated. 2759 02:15:32,000 --> 02:15:35,660 It turns out that there's another keyword in SQL, really the last 2760 02:15:35,660 --> 02:15:38,150 that we'll look at here, called JOIN. 2761 02:15:38,150 --> 02:15:41,480 The JOIN keyword, which you can use implicitly or explicitly, 2762 02:15:41,480 --> 02:15:45,470 allows you to just join tables together and sort of reconstitute 2763 02:15:45,470 --> 02:15:47,760 a bigger, more user friendly table. 2764 02:15:47,760 --> 02:15:51,020 So for instance, suppose I want to get all of Steve Carell's TV shows, 2765 02:15:51,020 --> 02:15:52,250 not just The Office. 2766 02:15:52,250 --> 02:15:55,880 Well, recall that I can select Steve's ID from the people 2767 02:15:55,880 --> 02:15:59,390 table WHERE name = "Steve Carell." 2768 02:15:59,390 --> 02:16:02,780 So again, he has a different ID in this table, because this is from IMDb. 2769 02:16:02,780 --> 02:16:04,400 But there's his ID. 2770 02:16:04,400 --> 02:16:07,260 And let me go ahead and turn the timer off for now. 2771 02:16:07,260 --> 02:16:07,760 All right. 2772 02:16:07,760 --> 02:16:11,510 So there is his ID, 126797. 2773 02:16:11,510 --> 02:16:14,780 I could copy paste that into my code, but that's not necessary 2774 02:16:14,780 --> 02:16:16,490 thanks to these nested queries. 2775 02:16:16,490 --> 02:16:18,660 I can do something like this. 2776 02:16:18,660 --> 02:16:23,720 Let me go ahead and now select all of the show_ids from the stars table 2777 02:16:23,720 --> 02:16:29,790 where person_id from that table is equal to this result. 2778 02:16:29,790 --> 02:16:33,240 So there's that join table, stars, that links people and shows. 2779 02:16:33,240 --> 02:16:35,370 So let me go ahead and execute that. 2780 02:16:35,370 --> 02:16:35,870 All right. 2781 02:16:35,870 --> 02:16:39,559 So there's all of the show_ids of Steve Carell's TV shows. 2782 02:16:39,559 --> 02:16:40,379 That's a lot. 2783 02:16:40,379 --> 02:16:42,139 And it's very nonobvious what they are. 2784 02:16:42,139 --> 02:16:45,680 So let me do another nested query by putting all of that in parentheses 2785 02:16:45,680 --> 02:16:51,530 and now SELECT title FROM shows WHERE the ID of the show 2786 02:16:51,530 --> 02:16:55,820 is in this big, long list of show_ids. 2787 02:16:55,820 --> 02:17:00,260 And there are all of the shows that he's in, including The Dana Carvey Show 2788 02:17:00,260 --> 02:17:04,430 back when, The Office up at the top, and then, most recently, 2789 02:17:04,430 --> 02:17:07,142 shows like The Morning Show on Apple TV. 2790 02:17:07,142 --> 02:17:09,350 All right, so that's pretty cool that we can actually 2791 02:17:09,350 --> 02:17:11,129 reconstitute the data like that. 2792 02:17:11,129 --> 02:17:13,889 But it turns out there's different ways of doing that, as well. 2793 02:17:13,889 --> 02:17:15,950 And you'll see more of this in the coming weeks 2794 02:17:15,950 --> 02:17:18,150 and in the problem sets and labs and the like. 2795 02:17:18,150 --> 02:17:19,879 But it turns out we can do other things, as well. 2796 02:17:19,879 --> 02:17:21,962 And let me just show this syntax even though it'll 2797 02:17:21,962 --> 02:17:23,670 look a little cryptic at first glance. 2798 02:17:23,670 --> 02:17:26,299 You can also use that JOIN keyword as follows. 2799 02:17:26,299 --> 02:17:33,350 I can select the title from the people table joined with the stars table 2800 02:17:33,350 --> 02:17:39,959 on the people.id column equaling the stars.person_id column. 2801 02:17:39,959 --> 02:17:42,799 So in other words, I can select a title from the result 2802 02:17:42,799 --> 02:17:46,940 of joining people and stars, like this, on the id column in one 2803 02:17:46,940 --> 02:17:49,129 and the person_id column in the other. 2804 02:17:49,129 --> 02:17:58,879 And I can join in the shows table on the stars.show_id equaling the shows.id. 2805 02:17:58,879 --> 02:18:03,799 So again, now I'm joining the primary and foreign keys on these two tables 2806 02:18:03,799 --> 02:18:07,700 where the name equals "Steve Carell." 2807 02:18:07,700 --> 02:18:10,070 So this is the most cryptic thing we've seen yet. 2808 02:18:10,070 --> 02:18:12,530 But it just means take this table and join it with this one 2809 02:18:12,530 --> 02:18:16,580 and then join it with this one and filter all of the resulting joined rows 2810 02:18:16,580 --> 02:18:18,530 by a name of Steve Carell. 2811 02:18:18,530 --> 02:18:19,520 And voila. 2812 02:18:19,520 --> 02:18:22,469 There we have all of those answers, as well. 2813 02:18:22,469 --> 02:18:25,129 And there's other ways of doing this, too. 2814 02:18:25,129 --> 02:18:27,809 I'll leave unsaid now some of the syntax for that. 2815 02:18:27,809 --> 02:18:29,480 But that felt a little slow. 2816 02:18:29,480 --> 02:18:32,090 And in fact, let me go ahead and turn my timer back on. 2817 02:18:32,090 --> 02:18:34,610 Let me re-execute this last query. 2818 02:18:34,610 --> 02:18:40,879 SELECT title FROM people joining on stars, joining on shows 2819 02:18:40,879 --> 02:18:42,650 WHERE name = "Steve Carell." 2820 02:18:42,650 --> 02:18:44,700 That took over half a second. 2821 02:18:44,700 --> 02:18:47,480 So that was actually admittedly kind of slow. 2822 02:18:47,480 --> 02:18:50,209 But again, indexes come to the rescue and if, again, we 2823 02:18:50,209 --> 02:18:52,610 don't allow linear search to dominate. 2824 02:18:52,610 --> 02:18:54,889 But let me go ahead and create a few indexes. 2825 02:18:54,889 --> 02:19:01,940 Create an index called person_index on the stars table, the person_id column. 2826 02:19:01,940 --> 02:19:02,570 Why? 2827 02:19:02,570 --> 02:19:05,600 Well, my query a moment ago used the person_id column. 2828 02:19:05,600 --> 02:19:06,510 It filtered on it. 2829 02:19:06,510 --> 02:19:08,000 So that might be a bottleneck. 2830 02:19:08,000 --> 02:19:12,290 I'm going to go ahead and create another index called show_index 2831 02:19:12,290 --> 02:19:14,870 on the stars table on show_id. 2832 02:19:14,870 --> 02:19:18,290 Similarly, a moment ago, my query used the show_id column. 2833 02:19:18,290 --> 02:19:21,743 And so that, too, might have been a bottleneck linearly, top to bottom. 2834 02:19:21,743 --> 02:19:22,910 So let me create that index. 2835 02:19:22,910 --> 02:19:25,368 And then lastly, let me create an index called name_index-- 2836 02:19:25,368 --> 02:19:28,459 and this is perhaps the most obvious, similar to the show titles before-- 2837 02:19:28,459 --> 02:19:31,549 on the people table on the name column. 2838 02:19:31,549 --> 02:19:32,930 And that, too, took a moment. 2839 02:19:32,930 --> 02:19:35,330 Now, in total, this took almost a full second. 2840 02:19:35,330 --> 02:19:37,850 But these indexes only get created once. 2841 02:19:37,850 --> 02:19:40,070 They get maintained automatically over time. 2842 02:19:40,070 --> 02:19:42,080 But you don't incur this with every query. 2843 02:19:42,080 --> 02:19:44,389 Now let me do my SELECT again. 2844 02:19:44,389 --> 02:19:48,800 Let me SELECT title FROM people joining the stars table, 2845 02:19:48,800 --> 02:19:52,730 joining the shows table WHERE name = "Steve Carell." 2846 02:19:52,730 --> 02:19:53,690 Boom. 2847 02:19:53,690 --> 02:19:56,630 0.001 seconds. 2848 02:19:56,630 --> 02:20:00,930 That was an order of magnitude faster than the more than half a second 2849 02:20:00,930 --> 02:20:02,620 it took us a little bit ago. 2850 02:20:02,620 --> 02:20:05,860 So here, too, you see the power of a relational database. 2851 02:20:05,860 --> 02:20:08,912 So even though we've created some problems for ourselves over time, 2852 02:20:08,912 --> 02:20:12,120 we've solved them ultimately-- granted, with some more sophisticated features 2853 02:20:12,120 --> 02:20:13,320 and additional syntax. 2854 02:20:13,320 --> 02:20:15,990 But a relational database is indeed why you use them 2855 02:20:15,990 --> 02:20:19,470 in the real world for the Twitters, the Instagrams, the Facebooks, the Googles, 2856 02:20:19,470 --> 02:20:22,590 because they can store data so efficiently 2857 02:20:22,590 --> 02:20:25,960 without redundancy, because you can normalize them and factor everything 2858 02:20:25,960 --> 02:20:26,460 out. 2859 02:20:26,460 --> 02:20:28,740 But they can still maintain the relations 2860 02:20:28,740 --> 02:20:30,570 that you might have seen in a spreadsheet 2861 02:20:30,570 --> 02:20:32,940 but using something closer to logarithmic thanks 2862 02:20:32,940 --> 02:20:34,770 to those tree structures. 2863 02:20:34,770 --> 02:20:35,910 But there are problems. 2864 02:20:35,910 --> 02:20:38,880 And what we wanted to do is end on today two primary problems 2865 02:20:38,880 --> 02:20:42,570 that are introduced with SQL, because they are just unfortunately 2866 02:20:42,570 --> 02:20:43,920 so commonly done. 2867 02:20:43,920 --> 02:20:45,462 Notice this year. 2868 02:20:45,462 --> 02:20:47,670 There is something generally known as a SQL injection 2869 02:20:47,670 --> 02:20:51,330 attack, which you are vulnerable to in any application 2870 02:20:51,330 --> 02:20:52,830 where you're taking user input. 2871 02:20:52,830 --> 02:20:55,800 That hasn't been an issue for my favorites.py file, 2872 02:20:55,800 --> 02:20:58,260 where I only took input from a CSV. 2873 02:20:58,260 --> 02:21:00,510 But if one of you were malicious, what if one of you 2874 02:21:00,510 --> 02:21:03,750 had maliciously typed in the word "delete" or "update" 2875 02:21:03,750 --> 02:21:06,180 or something else as the title of your show 2876 02:21:06,180 --> 02:21:11,040 and I accidentally plugged it into my own Python code when executing a query? 2877 02:21:11,040 --> 02:21:14,940 You could potentially inject SQL into my own code. 2878 02:21:14,940 --> 02:21:15,750 How might that be? 2879 02:21:15,750 --> 02:21:18,960 Well, if logging in via Yale, you'll typically see a form like this. 2880 02:21:18,960 --> 02:21:21,850 Or logging in via Harvard to something, you'll see a form like this. 2881 02:21:21,850 --> 02:21:23,767 Here's an example that I'm pretty sure neither 2882 02:21:23,767 --> 02:21:25,710 Harvard nor Yale are vulnerable to. 2883 02:21:25,710 --> 02:21:28,590 Suppose I type in my email address to this login form 2884 02:21:28,590 --> 02:21:32,350 as malan@harvard.edu'--. 2885 02:21:32,350 --> 02:21:34,890 It turns out, in SQL, -- 2886 02:21:34,890 --> 02:21:38,250 is the symbol for commenting if you want to comment something out. 2887 02:21:38,250 --> 02:21:40,428 It turns out that the single quote is used 2888 02:21:40,428 --> 02:21:43,470 when you want to search for something like Steve Carell or, in this case, 2889 02:21:43,470 --> 02:21:44,930 malan@harvard.edu. 2890 02:21:44,930 --> 02:21:45,930 It can be double quotes. 2891 02:21:45,930 --> 02:21:47,040 It can be single quotes. 2892 02:21:47,040 --> 02:21:50,040 In this case, I'm using single quotes here. 2893 02:21:50,040 --> 02:21:53,400 But let's consider some sample code, if you will, in Python. 2894 02:21:53,400 --> 02:21:56,910 Here's a line of code that I propose might exist in the backend 2895 02:21:56,910 --> 02:22:00,180 for Harvard's authentication or Yale's or anyone else's. 2896 02:22:00,180 --> 02:22:04,890 Maybe someone wrote some Python code like this using SELECT * FROM users 2897 02:22:04,890 --> 02:22:06,870 WHERE username = question? 2898 02:22:06,870 --> 02:22:10,770 AND password = question?, and they plugged in username and password. 2899 02:22:10,770 --> 02:22:13,770 Whatever the user typed into that web form a moment ago gets 2900 02:22:13,770 --> 02:22:16,270 plugged in here to these question marks. 2901 02:22:16,270 --> 02:22:17,290 This is good. 2902 02:22:17,290 --> 02:22:20,980 This is good code, because you're using the SQL question marks. 2903 02:22:20,980 --> 02:22:24,315 So if you literally just do what we preach today and use these question 2904 02:22:24,315 --> 02:22:27,870 mark placeholders, you are safe from SQL injection attacks. 2905 02:22:27,870 --> 02:22:29,760 Unfortunately, there are too many developers 2906 02:22:29,760 --> 02:22:34,950 in the world that don't practice this or don't realize this or do forget this. 2907 02:22:34,950 --> 02:22:38,850 If you instead resort to Python approaches like this, 2908 02:22:38,850 --> 02:22:42,910 where you use an f-string instead, which might be your instincts after last 2909 02:22:42,910 --> 02:22:45,660 week, because they're wonderfully convenient with the curly braces 2910 02:22:45,660 --> 02:22:46,290 and all-- 2911 02:22:46,290 --> 02:22:50,370 suppose that you literally plug in username and password 2912 02:22:50,370 --> 02:22:53,430 not with the question mark placeholders but just literally 2913 02:22:53,430 --> 02:22:55,260 in between those curly braces. 2914 02:22:55,260 --> 02:22:58,210 Watch what happens if my username, malan@harvard.edu, 2915 02:22:58,210 --> 02:23:03,120 was actually typed in by me maliciously as malan@harvard.edu'--. 2916 02:23:03,120 --> 02:23:05,691 2917 02:23:05,691 --> 02:23:09,030 That would have the effect of tricking this Python 2918 02:23:09,030 --> 02:23:11,610 code into doing essentially this. 2919 02:23:11,610 --> 02:23:13,590 Let me do a find and replace. 2920 02:23:13,590 --> 02:23:22,881 It would trick Python into executing username = "malan@harvard.edu"--" 2921 02:23:22,881 --> 02:23:24,660 and then other stuff. 2922 02:23:24,660 --> 02:23:27,480 Unfortunately, the -- again means comment, 2923 02:23:27,480 --> 02:23:33,390 which means you could maybe trick a server into ignoring the whole password 2924 02:23:33,390 --> 02:23:35,190 part of this SQL query. 2925 02:23:35,190 --> 02:23:37,530 And if the SQL query's purpose in life is to check, 2926 02:23:37,530 --> 02:23:42,210 is this username and password valid, so that you can decide to log the user in 2927 02:23:42,210 --> 02:23:44,880 or to say, no, you're not authorized, well, 2928 02:23:44,880 --> 02:23:48,390 by essentially commenting out everything related to password, 2929 02:23:48,390 --> 02:23:49,710 notice what I've done. 2930 02:23:49,710 --> 02:23:55,620 I've just now theoretically logged myself in as malan@harvard.edu without 2931 02:23:55,620 --> 02:24:00,030 even knowing or inputting a password, because I injected SQL syntax, 2932 02:24:00,030 --> 02:24:04,620 the quote and the --, into my query, tricking the server into just ignoring 2933 02:24:04,620 --> 02:24:06,870 the password equality check. 2934 02:24:06,870 --> 02:24:11,250 And so it turns out that db.execute, when you execute an INSERT, 2935 02:24:11,250 --> 02:24:15,240 it returns to you as said the ID of the newly inserted row. 2936 02:24:15,240 --> 02:24:20,370 When you use db.execute to select rows from a database table, 2937 02:24:20,370 --> 02:24:25,360 it returns to you a list of rows, each of which is a dictionary. 2938 02:24:25,360 --> 02:24:28,110 So this is now pseudocode down here with my comment. 2939 02:24:28,110 --> 02:24:31,140 But if you get back one row, that would seem 2940 02:24:31,140 --> 02:24:34,470 to imply that there is a user named malan@harvard.edu. 2941 02:24:34,470 --> 02:24:37,830 Don't know what his password is, because whoever this person is maliciously 2942 02:24:37,830 --> 02:24:40,860 tricked the server into ignoring that syntax. 2943 02:24:40,860 --> 02:24:43,890 So SQL injection attacks are unfortunately 2944 02:24:43,890 --> 02:24:46,570 one of the most common attacks against SQL databases. 2945 02:24:46,570 --> 02:24:51,090 They are completely preventable if you simply use placeholders and use 2946 02:24:51,090 --> 02:24:53,940 libraries, whether it's CS50's or other third-party libraries 2947 02:24:53,940 --> 02:24:55,440 that you may use down the road. 2948 02:24:55,440 --> 02:24:58,530 A common meme on the internet is this picture here. 2949 02:24:58,530 --> 02:25:00,810 If we Zoom in on this person's license plate 2950 02:25:00,810 --> 02:25:02,970 or where the license plate should be, this 2951 02:25:02,970 --> 02:25:05,940 is an example of someone theoretically trying 2952 02:25:05,940 --> 02:25:10,350 to trick some camera on the highway into dropping the whole database. 2953 02:25:10,350 --> 02:25:13,710 DROP is another keyword in SQL that deletes a database table. 2954 02:25:13,710 --> 02:25:15,810 And this person was either intentionally or just 2955 02:25:15,810 --> 02:25:19,980 a humorously trying to trick it into executing SQL 2956 02:25:19,980 --> 02:25:21,760 by using syntax like this. 2957 02:25:21,760 --> 02:25:26,070 So characters like single quotes, --, semicolons are all potentially 2958 02:25:26,070 --> 02:25:29,190 dangerous characters in SQL if they're passed through unchanged 2959 02:25:29,190 --> 02:25:30,120 to the database. 2960 02:25:30,120 --> 02:25:34,140 A very popular xkcd comic-- let me give you a moment to just read this-- 2961 02:25:34,140 --> 02:25:40,080 is another well-known meme of sorts now in computer science. 2962 02:25:40,080 --> 02:25:43,880 If you'd like to, read this one on your own. 2963 02:25:43,880 --> 02:25:51,170 But henceforth, you are now in the family of educated learners who 2964 02:25:51,170 --> 02:25:54,410 know who Little Bobby Tables is. 2965 02:25:54,410 --> 02:25:56,240 Unfortunately, it's dead silence in here, 2966 02:25:56,240 --> 02:25:58,340 so I can't tell if anyone is actually laughing at this joke. 2967 02:25:58,340 --> 02:26:00,110 But anyhow, this is a very well-known meme. 2968 02:26:00,110 --> 02:26:02,690 So if you're a computer scientist who knows SQL, you know this one. 2969 02:26:02,690 --> 02:26:05,565 And there's one last problem we'd like to introduce if you don't mind 2970 02:26:05,565 --> 02:26:07,250 just a couple of final moments here. 2971 02:26:07,250 --> 02:26:09,530 And that is a fundamental problem in computing 2972 02:26:09,530 --> 02:26:11,690 called race conditions, which for the first time 2973 02:26:11,690 --> 02:26:14,300 is now manifest in our discussion of SQL. 2974 02:26:14,300 --> 02:26:18,230 It turns out that SQL and SQL databases are very often used, again, 2975 02:26:18,230 --> 02:26:21,380 in the real world for very high-performing applications. 2976 02:26:21,380 --> 02:26:24,320 And by that, I mean, again, the Googles, the Facebooks, the Twitters 2977 02:26:24,320 --> 02:26:28,490 of the world where lots and lots of data is coming into servers all at once. 2978 02:26:28,490 --> 02:26:30,200 And case in point, some of you might have 2979 02:26:30,200 --> 02:26:33,320 clicked Like on this egg some time ago. 2980 02:26:33,320 --> 02:26:35,690 This is the most-liked Instagram post ever. 2981 02:26:35,690 --> 02:26:39,710 As of last night, it was up to 50-plus million likes. 2982 02:26:39,710 --> 02:26:42,620 Well eclipsed Kim Kardashian's previous post, 2983 02:26:42,620 --> 02:26:44,690 which is still at 18 million or so. 2984 02:26:44,690 --> 02:26:47,780 This is to say this is a hard problem to solve, 2985 02:26:47,780 --> 02:26:51,800 this notion of likes coming in at such an incredible rate. 2986 02:26:51,800 --> 02:26:55,310 Because suppose that, long story short, Instagram actually 2987 02:26:55,310 --> 02:26:57,290 has a server with a SQL database. 2988 02:26:57,290 --> 02:27:01,490 And they have code in Python or C++ or whatever language that's talking 2989 02:27:01,490 --> 02:27:02,660 to that database. 2990 02:27:02,660 --> 02:27:04,910 And suppose that they have code that's trying 2991 02:27:04,910 --> 02:27:06,680 to increment the total number of likes. 2992 02:27:06,680 --> 02:27:08,240 Well, how might this work logically? 2993 02:27:08,240 --> 02:27:11,660 Well, in order to increment the number of likes that a picture like this egg 2994 02:27:11,660 --> 02:27:14,060 has, you might first select from the database 2995 02:27:14,060 --> 02:27:18,260 the current number of likes for the ID of that egg photograph. 2996 02:27:18,260 --> 02:27:19,790 Then you might add 1 to it. 2997 02:27:19,790 --> 02:27:21,797 Then you might update the database. 2998 02:27:21,797 --> 02:27:24,630 And I didn't use it before, but just like there's INSERT and DELETE, 2999 02:27:24,630 --> 02:27:26,010 there's UPDATE, as well. 3000 02:27:26,010 --> 02:27:29,600 So you might update the database with the new count plus 1. 3001 02:27:29,600 --> 02:27:31,970 So the code for that might look a little something 3002 02:27:31,970 --> 02:27:35,600 like this, three lines of code using CS50's library here, 3003 02:27:35,600 --> 02:27:40,010 where you execute SELECT likes FROM posts WHERE id = question?, 3004 02:27:40,010 --> 02:27:42,890 where id is the unique identifier for that egg. 3005 02:27:42,890 --> 02:27:45,740 And then I'm storing the result in a rows variable, 3006 02:27:45,740 --> 02:27:48,950 which, again, I claim is a list of rows. 3007 02:27:48,950 --> 02:27:52,130 I'm going to go into the first row, so that's rows bracket 0. 3008 02:27:52,130 --> 02:27:55,070 And I'm going to go into the likes column to get the actual number. 3009 02:27:55,070 --> 02:27:57,140 And that number, I'm going to store in a variable called likes. 3010 02:27:57,140 --> 02:27:58,880 So this is going to be, like, 50,000,000, 3011 02:27:58,880 --> 02:28:01,100 and I want it to go to 50,000,001. 3012 02:28:01,100 --> 02:28:02,370 So how do I do that? 3013 02:28:02,370 --> 02:28:08,780 Well, I execute on the database UPDATE posts SET likes = ?. 3014 02:28:08,780 --> 02:28:10,980 And then I just plug in likes + 1. 3015 02:28:10,980 --> 02:28:15,020 The problem, though, with the Instagrams and Googles and Twitters of the world 3016 02:28:15,020 --> 02:28:16,790 is that they don't just have one server. 3017 02:28:16,790 --> 02:28:18,710 They have many thousands of servers. 3018 02:28:18,710 --> 02:28:22,580 And all of those servers might in parallel be receiving clicks from you 3019 02:28:22,580 --> 02:28:23,960 and I on the internet. 3020 02:28:23,960 --> 02:28:28,310 And those clicks translate into this code getting executed, executed, 3021 02:28:28,310 --> 02:28:28,970 executed. 3022 02:28:28,970 --> 02:28:32,930 And the problem is that when you have three lines of code and suppose Brian 3023 02:28:32,930 --> 02:28:35,420 and I click on that egg at roughly the same time, 3024 02:28:35,420 --> 02:28:40,010 my three lines might not get executed before his three lines or vice versa. 3025 02:28:40,010 --> 02:28:42,650 They might get commingled chronologically. 3026 02:28:42,650 --> 02:28:46,130 My first line might get executed, then Brian's first line might get executed. 3027 02:28:46,130 --> 02:28:48,750 My second line might get executed, Brian's second line. 3028 02:28:48,750 --> 02:28:50,960 So they might get interspersed on different servers 3029 02:28:50,960 --> 02:28:53,900 or just temporally in time, chronologically. 3030 02:28:53,900 --> 02:28:56,690 That's problematic, because suppose Brian and I click 3031 02:28:56,690 --> 02:28:58,580 on that egg roughly at the same time. 3032 02:28:58,580 --> 02:29:01,010 And we get back the same answer to the SELECT query. 3033 02:29:01,010 --> 02:29:03,290 50 million is the current count. 3034 02:29:03,290 --> 02:29:06,620 Then our next lines of code execute on the servers we happen to be on, 3035 02:29:06,620 --> 02:29:09,260 which adds 1 to the likes. 3036 02:29:09,260 --> 02:29:14,780 The server might accidentally end up updating the row for the egg 3037 02:29:14,780 --> 02:29:20,960 with 50,000,001 both times, because the fundamental problem is 3038 02:29:20,960 --> 02:29:24,890 if my code executes while Brian's code executes, 3039 02:29:24,890 --> 02:29:29,480 we are both checking the value of a variable at essentially the same time. 3040 02:29:29,480 --> 02:29:32,090 And we are both then making a conclusion-- 3041 02:29:32,090 --> 02:29:35,190 oh, the current likes are 50 million. 3042 02:29:35,190 --> 02:29:36,470 We are then making a decision. 3043 02:29:36,470 --> 02:29:38,310 Let's add 1 to 50 million. 3044 02:29:38,310 --> 02:29:41,600 We are then updating the value with 50,000,001. 3045 02:29:41,600 --> 02:29:46,640 The problem is, though, that, really, if Brian's code or the server he happens 3046 02:29:46,640 --> 02:29:50,780 to be connected to on Instagram happens to have selected the number of likes 3047 02:29:50,780 --> 02:29:53,900 first, he should be allowed to finish the code that's 3048 02:29:53,900 --> 02:29:57,950 being executed so that when I select it, I see 50,000,001, 3049 02:29:57,950 --> 02:30:02,270 and I add 1 to that so the new count is 50,000,002. 3050 02:30:02,270 --> 02:30:04,070 This is what's known as a race condition. 3051 02:30:04,070 --> 02:30:06,980 When you write code in a multiserver-- 3052 02:30:06,980 --> 02:30:11,120 more fancily known as a multithreaded environment-- lines of code 3053 02:30:11,120 --> 02:30:16,160 chronologically can get commingled on different servers at any given time. 3054 02:30:16,160 --> 02:30:18,200 The problem fundamentally derives from the fact 3055 02:30:18,200 --> 02:30:22,430 that if Brian's server is in the middle of checking the state of a variable, 3056 02:30:22,430 --> 02:30:23,840 I should be locked out. 3057 02:30:23,840 --> 02:30:26,870 I should not be allowed to click on that button at the same time, 3058 02:30:26,870 --> 02:30:30,590 or my code should not be allowed to execute logically. 3059 02:30:30,590 --> 02:30:33,050 So there is a solution when you have to write code 3060 02:30:33,050 --> 02:30:36,500 like this, as is common for Twitter and Instagram and Facebook and the like, 3061 02:30:36,500 --> 02:30:38,420 to use what are called transactions. 3062 02:30:38,420 --> 02:30:41,815 Transactions add some few new pieces of syntax that we won't dwell on today 3063 02:30:41,815 --> 02:30:43,690 and you don't need to use in the coming days. 3064 02:30:43,690 --> 02:30:46,180 But they do solve a fundamentally hard problem. 3065 02:30:46,180 --> 02:30:50,500 Transactions essentially allow you to lock a table or, really, 3066 02:30:50,500 --> 02:30:54,885 a row in the table so that if Brian's click on that egg 3067 02:30:54,885 --> 02:30:57,760 results in some code executing that's in the process of checking what 3068 02:30:57,760 --> 02:31:02,770 is the total like count, my click on the egg will not get handled by the server 3069 02:31:02,770 --> 02:31:05,630 until his code is done executing. 3070 02:31:05,630 --> 02:31:08,470 So in green here, I've proposed the way you should do this. 3071 02:31:08,470 --> 02:31:12,968 You shouldn't just execute the middle three lines, "you" being Facebook, 3072 02:31:12,968 --> 02:31:13,510 in this case. 3073 02:31:13,510 --> 02:31:17,200 Instagram should execute BEGIN TRANSACTION first, then 3074 02:31:17,200 --> 02:31:19,300 COMMIT the transaction at the end. 3075 02:31:19,300 --> 02:31:22,780 And the design of transactions is that all of the lines in between 3076 02:31:22,780 --> 02:31:26,320 will either succeed altogether or fail altogether. 3077 02:31:26,320 --> 02:31:28,180 The database won't get into this funky state 3078 02:31:28,180 --> 02:31:32,320 where we start losing track of likes on eggs. 3079 02:31:32,320 --> 02:31:34,660 And though this has not been an issue in recent years, 3080 02:31:34,660 --> 02:31:36,952 back in the day when Twitter was first getting started, 3081 02:31:36,952 --> 02:31:40,232 Twitter was super popular and super offline a lot of the time. 3082 02:31:40,232 --> 02:31:42,190 There was this thing called a Fail Whale, which 3083 02:31:42,190 --> 02:31:44,037 is the picture they showed on their website 3084 02:31:44,037 --> 02:31:46,120 when they were getting too much traffic to handle. 3085 02:31:46,120 --> 02:31:49,540 That was because when people are liking and tweeting and retweeting things, 3086 02:31:49,540 --> 02:31:51,520 it's a huge amount of data coming in. 3087 02:31:51,520 --> 02:31:54,500 And it turns out it's very hard to solve these problems. 3088 02:31:54,500 --> 02:31:58,450 But locking the database table or the rows with these transactions 3089 02:31:58,450 --> 02:32:00,490 is one way fundamentally to solve this. 3090 02:32:00,490 --> 02:32:03,160 And in our final extra time today, we thought 3091 02:32:03,160 --> 02:32:05,080 we would play this out in the same example 3092 02:32:05,080 --> 02:32:07,510 that I was taught transactions in some years ago. 3093 02:32:07,510 --> 02:32:10,750 Suppose that the scenario at hand is that you and your roommates 3094 02:32:10,750 --> 02:32:12,370 have a nice dorm fridge. 3095 02:32:12,370 --> 02:32:15,100 And you're all in the habit of drinking lots of milk, 3096 02:32:15,100 --> 02:32:17,050 and you want to be able to drink some milk. 3097 02:32:17,050 --> 02:32:19,420 But you go to the fridge, like I'm about to here. 3098 02:32:19,420 --> 02:32:22,210 And you realize, uh-oh, we're out of milk. 3099 02:32:22,210 --> 02:32:25,570 And so now I am inspecting the state of this refrigerator, which 3100 02:32:25,570 --> 02:32:27,970 is quite old but also quite empty. 3101 02:32:27,970 --> 02:32:30,160 And the state of this variable, being empty, 3102 02:32:30,160 --> 02:32:33,620 tells me that I should go to CVS and buy some more milk. 3103 02:32:33,620 --> 02:32:35,080 So what do I then do? 3104 02:32:35,080 --> 02:32:37,150 I'm presumably going to close the fridge, 3105 02:32:37,150 --> 02:32:40,600 and I'm going to go and leave and go head to CVS. 3106 02:32:40,600 --> 02:32:43,510 Unfortunately, the same problem arises that we'll act out here 3107 02:32:43,510 --> 02:32:46,150 in our final 60 or so seconds together, whereby 3108 02:32:46,150 --> 02:32:49,660 if Brian now, my roommate in this story, also wants some milk, 3109 02:32:49,660 --> 02:32:52,060 he comes by when I'm already headed to the store, 3110 02:32:52,060 --> 02:32:55,310 inspects the state of the fridge, and realizes, oh, we're out of milk. 3111 02:32:55,310 --> 02:32:57,650 So he nicely will go restock, as well. 3112 02:32:57,650 --> 02:32:59,620 So let's see how this plays out, and we'll 3113 02:32:59,620 --> 02:33:03,590 see if there isn't a similar, analogous solution. 3114 02:33:03,590 --> 02:33:05,620 So I've checked the state of the variable. 3115 02:33:05,620 --> 02:33:06,920 We're indeed out of milk. 3116 02:33:06,920 --> 02:33:08,030 I'll be right back. 3117 02:33:08,030 --> 02:33:09,085 Just going to go to CVS. 3118 02:33:09,085 --> 02:33:26,336 3119 02:33:26,336 --> 02:33:29,829 [MUSIC PLAYING] 3120 02:33:29,829 --> 02:34:44,240 3121 02:34:44,240 --> 02:34:45,020 All right. 3122 02:34:45,020 --> 02:34:46,550 I am now back from the store. 3123 02:34:46,550 --> 02:34:47,870 I've picked up some milk. 3124 02:34:47,870 --> 02:34:50,090 Going to go ahead and put it into the fridge and-- 3125 02:34:50,090 --> 02:34:51,710 oh, how did this happen? 3126 02:34:51,710 --> 02:34:53,570 Now there's multiple jugs of milk. 3127 02:34:53,570 --> 02:34:55,490 And of course, milk does not last that long. 3128 02:34:55,490 --> 02:34:57,282 And Brian and I don't drink that much milk. 3129 02:34:57,282 --> 02:34:58,970 So this is a really serious problem. 3130 02:34:58,970 --> 02:35:03,150 We've sort of tried to update the value of this variable at the same time. 3131 02:35:03,150 --> 02:35:05,030 So how do we go about fixing this? 3132 02:35:05,030 --> 02:35:07,320 What's the actual solution here? 3133 02:35:07,320 --> 02:35:09,920 Well, I dare say that we can draw some inspiration 3134 02:35:09,920 --> 02:35:13,787 from the world of transactions and the world of databases. 3135 02:35:13,787 --> 02:35:15,620 And perhaps create a visual for here that we 3136 02:35:15,620 --> 02:35:18,270 hope you never forget if you take nothing away from today. 3137 02:35:18,270 --> 02:35:21,327 Let's go ahead and act this out one last time where, this time, 3138 02:35:21,327 --> 02:35:22,910 I'm going to be a little more extreme. 3139 02:35:22,910 --> 02:35:24,290 I go ahead and open the fridge. 3140 02:35:24,290 --> 02:35:25,940 I realize, oh, we're out of milk. 3141 02:35:25,940 --> 02:35:27,380 I'm going to go to the store. 3142 02:35:27,380 --> 02:35:29,450 I do not want to allow for this situation 3143 02:35:29,450 --> 02:35:32,490 where Brian accidentally checks the fridge, as well. 3144 02:35:32,490 --> 02:35:37,670 So I am going to lock the refrigerator instead. 3145 02:35:37,670 --> 02:35:41,390 Let me go ahead and drape this through here. 3146 02:35:41,390 --> 02:35:43,940 3147 02:35:43,940 --> 02:35:49,050 A little extreme, but I think so long as he can't get into the fridge, 3148 02:35:49,050 --> 02:35:52,790 this shouldn't be a problem. 3149 02:35:52,790 --> 02:35:56,060 Let me go ahead now and just attach the lock here. 3150 02:35:56,060 --> 02:35:57,170 Almost got it. 3151 02:35:57,170 --> 02:35:58,340 Come on. 3152 02:35:58,340 --> 02:35:59,570 All right. 3153 02:35:59,570 --> 02:36:01,993 Now the fridge is locked. 3154 02:36:01,993 --> 02:36:03,410 Now I'm going to go get some milk. 3155 02:36:03,410 --> 02:36:17,210 3156 02:36:17,210 --> 02:36:18,210 BRIAN YU: [SIGHS] 3157 02:36:18,210 --> 02:36:18,710 3158 02:36:18,710 --> 02:36:22,060 [MUSIC PLAYING] 3159 02:36:22,060 --> 02:37:19,000