[MUSIC PLAYING] DAVID J. MALAN: All right. This is CS50, and this is week 7. And today's focus is going to be entirely on data-- the process of collecting it, the process of storing it, the process of searching it, and so much more. You'll recall that last week we started off by playing around with the relatively small data set. We asked everyone for what their preferred house at Hogwarts might be. And then we proceeded to analyze that data a little bit using some Python and counting up how many people wanted Gryffindor or Slytherin or the others, as well. And we ultimately did that by using a Google form to collect it. And we stored all of the data in a Google spreadsheet, which we then exported, of course, as a CSV file. So this week, we thought we'd collect a little more data and see what kinds of problems arise when we start using only a spreadsheet or, in turn, a CSV file to store the data that we care about. So in fact, if you could go ahead and go to this URL here that you see, you should see another Google form, this one asking you some different questions. All of us probably have some preferred TV shows, now more than ever, perhaps. And what we'd like to do is ask everyone to input into that form their favorite TV show followed by the genre or genres into which that particular TV show falls. So go ahead and take a moment to do that. And if you're unable to follow along at home, what folks are looking at is a form quite like this one here, whereby we're just asking them for the title of their preferred TV show and the genre or genres of that specific TV show. All right. So let's go ahead and start to look at some of this data that's come in. Here is the resulting Google spreadsheet that Google Forms has created for us. And you'll notice that by default, Google Forms, this particular tool, has three different columns, at least for this form. One is a timestamp, and Google automatically gives us that based on what day and time everyone was buzzing in with the responses. Then they have a header row beyond that for title and genres. I've manually boldfaced it in advance just to make it stand out. But you'll notice that the headings here, Title and Genres, perfectly matches the question that we asked in the Google form. That allows us to therefore line up your responses with our questions. And you can see here Punisher was the first favorite TV show to be inputted followed by The Office, Breaking Bad, New Girl, Archer, another Office, and so forth. And in the third column, under Genres, you'll see that there's something curious here. While some of the cells-- that is, the little boxes of text-- have just single words like "comedy" or "drama," you'll notice that some of them have a comma-separated list. And that comma-separated list is because some of you checked, as you could, multiple check boxes to indicate that Breaking Bad is a crime genre drama and also thriller. And so the way Google Forms handles this is a bit sleazily in the sense that they just drop all of those values as a comma-separated list inside of the spreadsheet itself. And that's potentially a problem if we ultimately download this as a CSV file, comma-separated values, because now you have commas in between the commas. Fortunately, there's a solution to that that we'll ultimately see. So we've got a good amount of data here. In fact, if I keep scrolling down, we'll see a few hundred responses now. And it would be nice to analyze this data in some way and figure out what the most popular TV show is, maybe search for new shows I might like via their genre. So you can imagine some number of queries that could be answered by way of this data set. But let's first consider the limitations of leaving this data in just a spreadsheet like this. All of us are probably in the habit of using occasionally Google Spreadsheets, Apple Numbers, Microsoft Excel, or some other tool. So let's consider what spreadsheets are good at and what they are bad at. Would anyone like to volunteer an answer to the first of those? What is a spreadsheet good at or good for? Yeah, Andrew? What's your thinking on spreadsheets? AUDIENCE: [INAUDIBLE] DAVID J. MALAN: OK, very good for quickly sorting. I like that. I could click on the top of the Title column, for instance, and immediately sort all of those titles by alphabetically. I like that. Other reasons to use a spreadsheet-- what problems do they solve? What are they good at? Other thoughts on spreadsheets. Yeah, how about Peter? AUDIENCE: Storing large amounts of data that you can later analyze. DAVID J. MALAN: OK, so storing large amounts of data that you can later analyze. It's kind of a nice model for storing lots of rows of data, so to speak. I will say that there actually is a limit. And in fact, back in the day, I learned what this limit is. Long story short, in graduate school, I was using a spreadsheet to analyze some research data. And at one point, I had more data than Excel supported rows for. Specifically, I had some 65,536 rows, which was too many at that point for Excel at the time, because, long story short, if you recall from a spreadsheet program like Google Spreadsheets, every row is numbered from 1 on up. Well, unfortunately, at the time, Microsoft had used a 16-bit integer, 16 bits or 2 bytes, to represent each of those numbers. And it turns out the 2 to the 16th power is roughly 65,000. So at that point, I maxed out the total number of rows. Now, to Peter's point, they've increased that in recent years. And you can actually store a lot more data. So spreadsheets are indeed good at that. But they're not necessarily good at everything, because at some point, you're going to have more data potentially in a spreadsheet than your Mac or PC can handle. In fact, if you're actually trying to build an application, whether it's Twitter, Instagram, or Facebook or anything of that scale, those companies are certainly not storing their data, suffice it to say, in a spreadsheet, because there would just be way too much data to use. And no one could literally open it on their computer. So we'll need a solution to that problem of scale. But I don't think we need to throw out what works well about spreadsheets. So you can store indeed a lot of data in row form. But it would seem that you can also store a lot of data in column form. And even though I'm only showing columns A, B, and C, of course, you've probably used spreadsheets where you add more columns-- D, E, F, and so forth. So what's the right mental model for how to think about rows versus columns in a spreadsheet? I feel like we probably use them in a somewhat different way conceptually. We might think about them a little differently. What's the difference between rows and columns in a spreadsheet? Sofia. AUDIENCE: Adding more entries. Adding more data is-- those are within the rows, but then the actual attributes or characteristics of the data should be in columns. DAVID J. MALAN: Exactly. When you add more data to the spreadsheet, you should really be adding to the bottom of it, adding more and more rows. So these things sort of grow vertically, even though of course that's just a human's perception of it. They grow from top to bottom by adding more and more rows. But to Sofia's point, your columns represent what we might call attributes or fields or any other such characteristic that is a type of data that you're storing. So in this case of our form, Timestamp is the first column. Title is the second column. Genres is the third column. And those columns can indeed be thought of as fields or attributes, properties of your data. And those are properties that you should really decide on in advance when you're first creating the form, in our case, or when you're manually creating the spreadsheet in another case. You should not really be in the habit, when using spreadsheets, of adding data from left to right, adding more and more columns, unless you decide to collect more types of data. So just because someone adds a new favorite TV show to your data set, you shouldn't be adding that from left to right in a new column. You should indeed be adding it from top to bottom. But suppose that we actually decided to collect more information from everyone. Maybe that form had instead asked you for your name or your email address or any other questions. Those properties or attributes or fields would belong as new columns. So this is to say we generally decide on the layout of our data, the schema of our data, in advance. And then from there on out, we proceed to add, add, add more rows, not columns, unless we change our mind and need to change the schema of our particular data. So it turns out that spreadsheets are indeed wonderfully useful, to Peter's point, for large or reasonably large data sets that we might collect. And we can, of course, per last week, export those data sets as CSV files. And so we can go from a spreadsheet to a simple text file stored in ASCII or Unicode, more generally, on your own hard drive or somewhere in the cloud. And you can actually think of that file, that .CSV file, as what we might call a flat-file database. A database is, generally speaking, a file that stores data. Or it's a program that stores data for you. And all of us have probably thought about or used databases in some sense. You're probably familiar with the fact that all of those same big websites, Google and Twitter and Facebook and others, use databases to store our data. Well, those databases are either just really big files containing lots of data or special programs that are storing our data for us. And a flat file is just referring to the fact that it really is a very simple design. In fact, years ago, decades ago, humans decided when storing data in simple text files that if you want to store different types of data, like, to Sofia's point, different properties or attributes, well, let's keep it simple. Let's just separate those columns with commas in our flat-file database, a.k.a. a CSV. You can use other things. You can use tabs. There's things called TSVs, for Tab-Separated Values. And frankly, you can use anything you want. But there is a corner case. And we've already seen a preview of it. What if your actual data has a comma in it? What if the title of your favorite TV show has a comma? What if Google is presuming to store genres as a comma-separated list? Bad things can happen if using a CSV as your flat-file database. But there are solutions to that. And in fact, what the world typically does is whenever you have commas inside of your CSV file, you just make sure that the whole string is double quoted on the far left and far right. And anything inside of double quotes is not mistaken thereafter as delineating a column as the other commas in the file might. So that's all that's meant by a flat-file database. And CSV is perhaps one of the most common, the most common, formats thereof, if only because all of these programs, like Google Spreadsheets and Excel and Numbers, allow you to save your files as CSVs. Now, long story short, those of you who have used fancier features of spreadsheets like built-in functions and formulas and those kinds of things, those are built in and proprietary to Google Spreadsheets and Excel and Numbers. You cannot use formulas in a CSV file or a TSV file or in a flat-file database, more generally. You can only store static-- that is, unchanging-- values. So when you export the data, what you see is what you get. And that's why people use fancier programs like Excel and Numbers and Google Spreadsheets, because you get more functionality. But if you want to export the data, you can only get indeed the raw textual data out of it. But I daresay that's going to be OK. In fact, Brian, do you mind if I go ahead and download this spreadsheet as a CSV file now? BRIAN YU: Yep, go ahead. DAVID J. MALAN: All right. I'm going to go ahead in Google Spreadsheets and go to File, Download. And you can see a whole bunch of options-- PDF, Web Page, Comma-Separated Values, which is the one I want. So I'm going to indeed go ahead and choose CSV from this dropdown in spreadsheets. That, of course, downloaded that file for me. And now I'm going to go ahead and go into our familiar CS50 IDE. You'll recall that last week I was able to upload a file into the IDE. And I'm going to go ahead and do the same here this week, as well. I'm going to go ahead and grab my file, which ended up in my Downloads folder on my particular computer here. And I'm going to go ahead and drag and drop this into the IDE such that it ends up in my home directory, so to speak. So now I have this file, Favorite TV Shows Forms. And in fact, if I double click this within the IDE, you'll see familiar data now. Timestamp comma title comma genres is our header row that contains the names of the properties or attributes in this file. Then we've got our timestamps comma favorite title comma and then a comma-separated list of genres. And here indeed, notice that Google took care to use double quotes around any values that themselves had commas. So it's a relatively simple file format. And I could certainly just kind of skim through this, figuring out who likes The Office, who likes Breaking Bad, or other shows. But per last week, we now have a pretty useful programming language at our disposal, Python, that could allow us to start manipulating and analyzing this data more readily. And here to my point last week about using the right tool for the job, you could absolutely do everything we're about to do in all weeks prior of CS50. We could have used C for what we're about to do. But as you can probably glean, C tends to be painful for certain things, like anything involving string manipulation, changing strings, analyzing strings. It's just a real pain, right? God forbid you had to take this CSV file and load it all into memory, not unlike your spell checker. You would have to be using malloc all over the place or realloc or the like. There's just a lot of heavy lifting involved in just analyzing a text file. So Python does all of that for us by just giving us more functions at our disposal with which to start analyzing and opening data. So let me go ahead and close this file. And let me go ahead and create a new one called favorites.py, wherein I'm going to start playing with this data set and see if we can't start answering some questions about it. And frankly, to this day, 20-plus years after learning how to program for the first time, I myself am very much in the habit when writing a new program of just starting simple and not solving the problem I ultimately want to but something simpler just as a sort of proof of concept to make sure I have the right plumbing in place. So by that, I mean this. Let's go ahead and write a quick program that simply opens up this file, the CSV file, iterates over it top to bottom, and just prints out each of the titles, just as a quick sanity check that I know what I'm doing and I have access to the data therein. So let me go ahead and import CSV. And then I can do this in a few different ways. But by now, you've probably seen or remembered my using something like the open command and the with keyword to open and eventually automatically close this file for me. This file is called Favorite TV Shows - Form Responses 1.csv. And I'm going to open this up in read mode. Strictly speaking, the r is not required. You might see examples online not including it. That's because read is the default. But for parity with C and fopen, I'm going to be explicit and actually do "r." And I'm going to go ahead and give this a variable name of file. So this line 3 here has the effect of opening that CSV file in read-only mode and creating a variable called file via which I can reference it. Now I'm going to go ahead and use some of that CSV functionality. I'm going to give myself what we keep calling a reader, which I could call it xyz, anything else. But "reader" kind of describes what this variable is going to do. And it's going to be the return value of calling csv.reader on that file. And so essentially, the CSV library, per last week, has a lot of fancy features built in. And all it needs as input is an already opened text file. And then it will then wrap that file, so to speak, with a whole bunch of more useful functionality, like the ability to read it column and row at a time. All right. Now I'm going to go ahead and, you know what, just for now, I'm going to skip the first row. I'm going to skip the first row, because the first row has my headings-- Timestamp, Title, and Genres. And I know what my columns are, so I'm just going to ignore that line for now. And now I'm going to do this. For row in reader, let me go ahead and print out, quite simply, row. And I only want title, so I think if it's three columns from left to right, it's 0, 1, 2. So I want to print out column bracket 1, which is going to be the second column zero indexed. All right. Let me go ahead and save that, go down to my terminal window, and run python of favorites.py and cross my fingers. OK. Voila. It flew by super fast. But it looks like, indeed, these are all of the TV shows that folks have inputted. Indeed, there's a few hundred if I keep scrolling up. So it looks like my program is working. But let's improve it just a little bit. It turns out that using the csv.reader isn't necessarily the best approach in Python. Many of you have already discovered a DictReader, a dictionary reader, which is nice, because then you don't have to know or keep double checking what number column your data is in. You can instead refer it to by the header itself, so by "title" or by "genres." This is also good, because if you or maybe a colleague are sort of messing around with the spreadsheet and they rearrange the columns by dragging them left or right, any numbers you have used in your code, 0, 1, 2 on up, could suddenly be incorrect if your colleague has reordered those columns. So using a dictionary reader tends to be a little more robust, because it uses the titles, not the mere numbers. It's still fallible if someone, yourself or someone else, changes the values in that very first row and renames titles or genres. Then things are going to break. But at that point, we kind of have to blame you for not having kept track of your code versus your data. But still a risk. So I'm going to change this to dictionary reader or DictReader here. And pretty much the rest of my code can be the same except I don't need this hack here on line 5. I don't need to just skip over to the next row from the get-go, because I now want the dictionary reader to handle the process of reading that first row for me. But otherwise, everything else stays the same except for this last line, where now I think I can now use row as a dictionary, not as a list per se, and print out specifically the title from each given row. So let me go ahead and run python of favorites.py again. And voila, it looks like I got the same result, several hundred of them. But let me stipulate that it's doing the same thing if we actually compared both of those side-by-side. All right. Before I forge ahead now to actually augment this with new functionality, any questions or confusion on this Python script we just wrote to open a file, wrap it with a reader or DictReader, and then iterate over the rows one at a time, printing the titles? Any questions, confusion on syntax at all? It's OK. We've only known or seen Python for a week. It's fine if it's still quite new. Anything, Brian, we should address? BRIAN YU: Yeah. So why is it that you don't need to close the file using the syntax that you're using right here? DAVID J. MALAN: Really good question. Last week, I more pedantically used open on its own. And then I later used a close function that was associated with the file that I had just opened. Now, the more Pythonic way to do things, if you will, is actually to use this with keyword, which didn't exist in C. And it just tends to be a useful feature in Python whereby if you say with open, dot dot dot, it will open the file for you. Then it will remain open so long as your code is indented inside of that with keywords block. And as soon as you get to the end of your program, it will automatically be closed for you. So this is one of these features where Python in some sense is trying to protect us from ourselves. It's probably pretty common for humans, myself included, to forget to close your file. That can create problems with saving things permanently. It can create memory leaks, as we know from C. So the with keyword just assumes that I'm not going to be an idiot and forget to close the file. Python is going to do it for me automatically. Other questions or confusions, Brian? BRIAN YU: How does DictReader know that Title is the name of the key inside of the dictionary? DAVID J. MALAN: Really good question, too. So it is designed by the authors of the Python language to look at the very first row in the file, split it on the commas in that very first row, and just assume that the first word or phrase before the first comma is the name of the first column, that the second word or phrase after the first comma is the name of the second column, and so forth. So a DictReader just presumes, as is the convention with CSVs, that your first row is going to contain the headings that you want to use to refer to those columns. If your CSV happens not to have such a heading whereby it just jumps right in on the first row to real data, then you're not going to be able to use a DictReader correctly, at least not without some manual configuration. All right. So let's go ahead and-- now I feel like there's a whole mess here. And some of these shows are pretty popular. And as I'm glancing over this, I definitely see some duplication. A whole bunch of you like The Office. A whole bunch of you like Breaking Bad, Game of Thrones, and a whole bunch of other shows, as well. So it would be nicer, I think, if we kind of narrow the scope of our look at this data by just looking at unique values. You're looking at unique value. So rather than just iterate over the file top to bottom, printing out one title after another, why don't we go ahead and sort of accumulate all of this data in some kind of data structure so that we can throw away duplicate values and then only print out the unique titles that we've accumulated? So I bet we can do this in a few ways. But if we think back to last week's demonstration of our dictionary, you'll recall that I used what was called a set. And I'm going to go ahead and create a variable called titles and set it equal to something called set. And a set is just a collection of values. It's kind of like a list. But it eliminates duplicates for me. And that would seem to be exactly the characteristic that I want for this program. Now, instead of printing each title, which is now premature if I want to first filter out duplicates, I'm going to go ahead and do this. I'm going to go ahead and add to the titles set using the add function the current row's title. So again, I'm not printing it now. I'm instead adding to the title set that particular title. And if it's there already, no big deal. The set data structure in Python is going to throw away the duplicates for me. And it's only going to go ahead and keep the uniques. Now, at the bottom of my file, I need to do a little more work, admittedly. Now I have to iterate over the set to print out only those unique titles. So let me do this. For title in titles, go ahead and print out title. And this is where Python just gets really user-friendly, right? You don't have to do int i get 0, i less than n, or whatever. You can just say for title in titles. And if the title's variable is the type of data structure that you can iterate over, which it will be if it's a list or if it's a set or even if it's a dictionary, another data structure we saw last week in Python, the for loop in Python will just know what to do. This will loop over all of the titles in the titles set. So let me go ahead and save this file and go ahead now and run python of favorites.py. And it looks like, yeah, the list is different in some way. But I'm seeing fewer results as I scroll up, definitely fewer than before, because my scrollbar didn't jump nearly as far down. But honestly, this is kind of a mess. Let's go ahead and sort this. Now, in C, it would have been kind of a pain to sort things. We'd have to whip out the pseudocode, probably, for bubble sort, selection sort, or, god forbid, merge sort and then implement it ourselves. But no, with Python comes, really, the proverbial kitchen sink of functions. So if you want to sort this set, you know what? Just say you want it sorted. There is a function in Python called sorted that will use one of those better algorithms-- maybe it's merge sort. Maybe it's something called quicksort. Maybe it's something else altogether. It's not going to use a big O of n squared sort. Someone at Python probably has spent the time implementing a better sort for us. But it will go ahead and sort the set for me. Now let me go ahead and do this again. Let me increase the size of my terminal window and rerun python of favorites.py. OK. And now we have an interesting assortment of shows that's easier for me to wrap my mind around, because I have it now sorted here. And indeed, if I scroll all the way up, we should see all of the shows beginning with numbers or a period, which might have just been someone playing around, followed by the A words, the B words, and so forth. So now it's a little easier to wrap our minds around this. But something's up. I feel like a lot of you like Avatar: The Last Airbender. And yet I'm seeing it, indeed, four different times. But I thought we were filtering this down to uniques by using that set structure. So what's going on? And in fact, if I keep scrolling, I'm pretty sure I saw more duplicates in here. BoJack Horseman, Breaking Bad, Breaking Bad, Brooklyn Nine-Nine, Brooklyn Nine-Nine, CS50 in several different flavors. And yes, it keeps going. Friends. So I see a lot of duplicate values. So what's going on? Yeah, [? Gadana? ?] AUDIENCE: Yeah, so your current sort is case insensitive-- sorry, is case sensitive, meaning that if someone spells avatar with capital A's in some places, then it's going to be a different result each time. DAVID J. MALAN: Yeah, exactly. Some of you weren't quite diligent when it came to capitalization. And so in fact, the reality is, as [? Gadana ?] notes, that there's differences in capitalization. Now, we've addressed this before. In fact, when you implemented your spell checker, you had to deal with this already when you were spell checking an arbitrary text. Some words might be capitalized. Some might be all lowercase, all uppercase. And you wanted to tolerate different casings. And so we probably solved this by just forcing everything to uppercase or everything to lowercase and doing things, therefore, case insensitively. So give me just a moment here. And I'm going to go ahead and make a quick change to my form here. Let's go ahead and change this in such a way that we actually force everything to uppercase or lowercase. Doesn't really matter which, but we need to canonicalize things, so to speak, in some way. And to canonicalize things just means to format all of your data in some standard way. So to [? Gadana's ?] point, let's just standardize the capitalization of things. Maybe all uppercase, all lowercase. We just need to make a judgment call. So I'm going to go ahead and make a few tweaks here. I'm still going to use a set. I'm still going to read the CSV as before. But instead of just adding the title with row bracket title, I'm going to go ahead and force it to uppercase, just arbitrarily, just for the sake of uniformity. And then let's go ahead and check what exactly has happened here. I'm not going to change anything else. But let me go ahead and increase the size of my terminal window, rerun python of favorites.py. And voila. It's a little harder to read, just because I'm not used to reading all caps. Kind of looks like we're yelling at ourselves. But I don't see-- wait a minute. I still see The Office over here twice. If I keep scrolling here, so far, I see Stranger Things and Strainger Things. That just looks like a typo. I see two Sherlocks, though. This is a little suspicious. So [? Gadana, ?] you and I don't seem to have solved things fully. And this one's a little more subtle. What more should I perhaps do to my data to ensure we get duplicates removed? Olivia? AUDIENCE: Maybe trim around the edges. DAVID J. MALAN: Trim around the edges. I like the sound of that, but what do you mean? What does that do? AUDIENCE: Oh, like, trim off the extra spaces in case someone put a space before or after the words. DAVID J. MALAN: Yeah, exactly. It's pretty common for humans, intentionally or accidentally, to hit the Space bar where they shouldn't. And in fact, I'm kind of inferring that I bet one or more of you accidentally typed Sherlock, space, and then decided, nope, that's it. I'm not typing anything else. But that space, even though we can't quite see it obviously, is there. And when we do a string comparison or when the set data structure does that, it's actually going to be noticed when doing those comparisons. And therefore they're not going to be the same. So I can do this in a few different ways. But it turns out, in Python, you can chain functions together, which is also, too, kind of a fancy feature. Notice what I'm doing here. I'm still accessing the titles set. I'm adding the following value to it. I'm adding the value row bracket title, but not quite. That is a string or an str, in Python speak. I'm going to go ahead and strip it, which means if we look up the documentation for this function, to Olivia's point, it's going to strip off or trim all of the white space to the left, all of the white space to the right, whether that's the Space bar or the Enter key or the Tab character or a few other things, as well. It's just going to get rid of leading and trailing white space. And then whatever's left over, I'm going to go ahead and force everything to uppercase in the spirit of [? Gadana's ?] suggestion, too. So we're sort of combining two good ideas now to really massage the data, if you will, into a cleaner format. And this is such a real-world reality. Humans, you and I, cannot be trusted to input data the way we are supposed to. Sometimes it's all lowercase, because we're being a little lazy or a little social media-like, even if we're checking out from Amazon and trying to input a valid postal address. Sometimes it's all capitals, because I can think of a few people in my life who don't quite understand the Caps Lock thing just yet. And so things might be all capitalized instead. This is not good for computer systems that require precision, to our emphasis in week 0. And so massaging data means cleaning it up, doing some mutations that don't really change the meaning of the data but canonicalize it, standardize it, so that you're comparing apples and apples, so to speak, not apples and oranges. Well, let me go ahead and run this again in my bigger terminal window, python of favorites.py. Voila. In scrolling up, up, up, I think we're in a better place. I only see one Office now. And if I keep scrolling up and up and up, I'm seeing typos still, but nothing related to white space. And I think we have a much cleaner unique list of titles at this point. Of course, if we scroll up, I would have to be a lot more clever if I want to detect things like typographical errors. It looks like one of you was very diligent about putting F.R.I. and so forth but then got bored at the end and left off the last period. But that's going to happen when you're taking in user input. We've, of course, got all these variants of CS50. That's going to be a mess to clean up, because now you can imagine having to add a whole bunch of if conditions and elses and else ifs to clean all of that up if we do want to canonicalize all different flavors of CS50 as, quote unquote, "CS50." So this is a very slippery slope. You and I could start writing a huge amount of data just to clean this up. But that's the reality when dealing with real-world data. Well, let's go ahead now and improve this program further, do something a little fancier, because I now can trust that my data has been canonicalized except for the actual typos or the weird variants of CS50 and the like. Let's go ahead and figure out what's the most popular favorite TV show among the audience here. So I'm going to start where I have before, with my current code, because I think I have most of the building blocks in place. I'm going to go ahead and clean up my code a little bit in here. I'm going to go ahead and give myself a separate variable now called title just so that I can think about things in a little more orderly fashion. But I'm not going to start adding things to this set anymore. In fact, a set, I don't think, is really going to be sufficient to keep track of the popularity of TV shows, because by definition, the set is throwing away duplicates. But the goal now is kind of the opposite. I want to know which are the duplicates so that I can tell you that this many people like The Office. This many people like Breaking Bad and the like. So what tools do we have in Python's toolkit via which we could accumulate or figure out that information? Any thoughts on what data structure might help us here if we want to figure out show, popularity, show, popularity? And by popularity, I just mean the frequency of it in the CSV file. Santiago? AUDIENCE: I guess one option could be to use dictionaries so that you can have The Office, I don't know, 20 votes, and then Game of Thrones, another one, so that a dictionary could really help you visualize that. DAVID J. MALAN: Yeah, perfect instincts. Recall that a dictionary, at the end of the day, no matter how sophisticated it's implemented underneath the hood, like your spell checker-- It's just a collection of key value pairs. And indeed, it's maybe one of the most useful data structures in any language, because this ability to associate one piece of data with another is just a very general purpose solution to problems. And indeed, to Santiago's point, if the problem at hand is to figure out the popularity of shows, well, let's make the keys the titles of our shows and the frequencies thereof-- the votes, so to speak-- the values of those keys. We're going to map title to votes, title to vote, title to vote, and so forth. So a dictionary is exactly that. So let me go ahead and scroll up. And I can make a little tweak here. Instead of a set, I can instead say dict and give myself just an empty dictionary. There's actually shorthand notation for that that's a little more common. So you use two empty curly braces. That just means the exact same thing. Give me a dictionary that's initially empty. There's no fancy shortcut for a set. You have to literally type out S-E-T, open paren and closed paren. But dictionaries are so common, so popular, so powerful, they have this little syntactic shortcut of just two curly braces, open and closed. So now that I have that, let me go ahead and do this. Inside of my for loop, instead of printing the title, which I don't want to do, and instead of adding it to the set, I now want to add it to the dictionary. So how do I do that? Well, if my dictionary is called titles, I think I can essentially do something like this, titles bracket title = or maybe += 1. Maybe I can kind of use the dictionary as just a little cheat sheet of counts, numbers, that start at 0 and then just add 1, at 2, add 3. So every time I see The Office, The Office, The Office, do += 1, += 1. We can't do ++, because that's not a thing in Python. It only exists in C. But this would seem to go into the dictionary called titles, look up the key that matches this specific title, and then increment whatever value is there by 1. But I'm going to go ahead and run this a little naively here. Let me go ahead and run python of favorites.py. And wow, it broke already on line 9. So sort of an apt choice of show to begin with, we have a key error with Punisher. So Punisher is bad. Something bad has just happened. But what does that mean? A key error is referring to the fact that I tried to access an invalid key in a dictionary. This is saying that literally in this line of code here, even though titles is a dictionary and even though the value of title, singular, is, quote unquote, "PUNISHER," I'm getting a key error, because that title does not yet exist. So even if you're not sure of the Python syntax for fixing this problem, what's the intuitive solution here? I cannot increment the frequency of the Punisher, because Punisher is not in the dictionary. It almost feels like a catch-22. [? Greg? ?] AUDIENCE: I think that you need, first of all, to create a for loop and maybe assign a value to everything in the dictionary. For example, a value 0, and then add 1. DAVID J. MALAN: Yeah, so good instincts. And here, I can use another metaphor. I worry we might have a chicken and the egg problem there, because I don't think I can go to the top of my code, add a loop that initializes all of the values in the dictionary to 0, because I would need to know all of the names of the shows at that point. Now, that's fine. I think I could take you maybe more literally, [? Greg, ?] and open up the CSV file, iterate over it top to bottom, and, any time I see a title, just initialize it in the dictionary as having a value of 0, 0, 0. Then have another for loop, maybe reopen the file, and do the same. And that would work. But it's arguably not very efficient. It is asymptotically, in terms of big O. But that would seem to be doing twice as much work. Iterate over the file once just to initialize everything to 0. Then iterate over the file a second time just to increment the counts. I think we can do things a little more efficiently. I think we can achieve not only correctness but better design. Any thoughts on how we can still solve this problem without having to iterate over the whole thing twice? Yeah, [? Semowit? ?] AUDIENCE: I think we can add in an if statement to check if that key is in the dictionary. And if it's not, then add it and then go ahead and increment the value after. DAVID J. MALAN: Nice. And we can do exactly that. So let's just apply that intuition. If the problem is that I'm trying to access a key that does not yet exist, well, let's just be a little smarter about it. And to [? Semowit's ?] point, let's check whether the key exists. And if it does, then increment it. But if it does not, then and only then, [? to Greg's ?] advice, initialize it to 0. So let me do that. Let me go ahead and say if title in titles, which is the very Pythonic, beautiful way of asking a question like that, way cleaner than in C-- let me go ahead, then, and say exactly the line from before. Else, though, if that title is not yet in the dictionary called titles, well, that's OK, too. I can go ahead and say titles bracket title = 0. So the difference here is that I can certainly index into a dictionary using a key that doesn't exist if I plan at that moment to give it a value. That's OK, and that has always been OK since last week. But, however, if I want to go ahead and increment the value that's there, I'm going to go ahead and do that in this separate line. But I did introduce a bug. I did introduce a bug here. I think I need to go one step further logically. I don't think I want to initialize this to 0 per se. Does anyone see a subtle bug in my logic here? If the title is already in the dictionary, I'm incrementing it by 1. Otherwise, I'm initializing it to 0. Any subtle catches here? Yeah, Olivia, what do you see? AUDIENCE: I think you should initialize it to 1, since it's the first instance. DAVID J. MALAN: Exactly. I should initialize it to 1. Otherwise, I'm accidentally overlooking this particular title, and I'm going to go ahead and undercount it. So I can fix this either by doing this. Or frankly, if you prefer, I don't technically need to use an if else. I can use just an if by doing something like this instead. I could say if title not in titles, then I could go ahead and say titles bracket title gets 0. And then after that, I can blindly, so to speak, just do this. So which one is better? I think the second one is maybe a little better in that I'm saving one line of code. But it's ensuring with that if condition, to [? Semowit's ?] advice, that I'm not indexing into the titles dictionary until I'm sure that the title is in there. So let me go ahead and run this now, python of favorites.py, Enter. And OK, it didn't crash, so that's good. But I'm not yet seeing any useful information. But I now have access to a bit more. Let me scroll down now to the bottom of this program, where I have now this loop. Let me go ahead and print out not just the title but the value of that key in the dictionary by just indexing into it here. And you might not have seen this syntax before. But with print, you can actually pass in multiple arguments. And by default, print will just separate them with a space for you. You can override that behavior and separate them with anything. But this is just meant to be a quick and dirty program that prints out titles and now the popularity thereof. So let me run this again, python of favorites.py. And voila. It's kind of all over the place. Office, super popular with 26 votes there. A lot of single votes here. Big Bang Theory has nine. You know, this is all nice and good. But I feel like this is going to take me forever to wrap my mind around which are the most popular shows. So of course, how would we do this? Well, to the point made earlier, with spreadsheets, my god, in Microsoft Excel or Google Spreadsheets or Apple Numbers, you just click the column heading and boom, sorted. We seem to have lost that capability unless we now do it in code. So let me do that for us. Let me go ahead and go back to my code. And it looks like sorted, even though it does work on dictionaries, is actually sorting by key, not by value. And here's where our Python programming techniques need to get a little more sophisticated. And we want to introduce another feature here now of Python which is going to solve this problem specifically but in a pretty general way. So if we read the documentation for sorted, the sorted function indeed sorts sets by the values therein. It sorts lists by the values therein. It sorts dictionaries by the keys therein, because dictionaries have two pieces of information for every element. It has a key and a value, not just a value. So by default, sorted sorts by key. So we somehow have to override that behavior. So how can we do this? Well, it turns out that the sorted function takes another optional argument literally called key. And the key argument takes as its value the name of a function. And this is where things get really interesting, if not confusing, really quickly. It turns out, in Python, you can pass around functions as arguments by way of their name. And technically, you can do this in C. It's a lot more syntactically involved. But in Python, it's very common. In JavaScript, it's very common. In a lot of languages, it's very common to think of functions as first-class objects, which is a fancy way of saying you can pass them around just like they are variables themselves. We're not calling them yet. But you can pass them around by their name. So what do I mean by this? Well, I need a function now to sort my dictionary by its value. And only I know how to do this, perhaps, so let me go ahead and give myself a generic function name just for the moment called f-- f for function, kind of like in math-- because we're going to get rid of it eventually. But let me go ahead and temporarily define a function called f that takes as input a title. And then it returns for me the value corresponding to that key. So I'm going to go ahead and return titles bracket title. So here, we have a function whose purpose in life is super simple. You give it a title. It gives you the count thereof, the frequency, the popularity thereof, by just looking it up in that global dictionary. So it's super simple, but that's its only purpose in life. But now, according to the documentation for sorted, what it's now going to do, because I'm passing in a second argument called key, the sorted function, rather than just presume you want everything sorted alphabetically by key, it's instead going to call that function f on every one of the elements in your dictionary. And depending on your answer, the return value you give with that f function, that will be used instead to determine the actual ordering. So by default, sorted just looks at key. What I'm effectively doing with this f function is instead returning the value corresponding to every key. And so the logical implication of this, even though the syntax is a little new, is that this dictionary of titles will now be sorted by value instead of by key. Because again, by default, it sorts by key. But if I define my own key function and override that behavior to return the corresponding value, it's the values, the numbers, the counts that will actually be used to this thing. All right. Let's go ahead and see if that's true in practice. Let me go ahead and rerun python of favorites.py. I should see all the titles. And voila, conveniently, the most popular show seems to be Game of Thrones with 33 votes, followed by Friends with 27, followed by The Office with 26, and so forth. But of course, the list is kind of backwards. I mean, it's convenient that I can see it at the bottom of my screen. But really, if we're making a list, it should really be at the top. So how can we override that behavior? Turns out the sorted function, if you read its documentation, also takes another optional parameter called reverse. And if you set reverse equal to True, capital T in Python, that's going to go ahead and give us now the reverse order of that same sort. So let me go ahead and maximize my terminal window, rerun it again. And voila, if I scroll back up to the top, it's not alphabetically sorted. But if I keep going, keep going, keep going, keep going, the numbers are getting bigger. And voila, now Game of Thrones with 33 is all the way at the top. All right, so pretty cool. And again, the new functionality here in Python, at least, is that we can actually pass in functions to functions and leave it to the latter to call the former. So that's complicated just to say. But any questions or confusion now on how we are using dictionaries and how we are sorting things in this reverse, value-based way? Any questions or confusion? Anything in the chat or verbally, Brian? BRIAN YU: Looks like all questions are answered here. DAVID J. MALAN: OK. Then in that case, let me point out a common mistake. Notice that even though f is a function, notice that I did not call it there. That would be incorrect, the reason being we deliberately want to pass the function f into the sorted function so that the sorted function can take it upon itself to call f again and again and again. We don't want to just call it once by using the parentheses ourselves. We want to just pass it in by name so that the sorted function, which comes with Python, can instead do it for us. Santiago, did you have a question? AUDIENCE: Yes, I was going to ask. Why didn't we put f of title? I was going to ask that question specifically. DAVID J. MALAN: Oh, with the parentheses? AUDIENCE: Yeah. DAVID J. MALAN: Oh, OK, perfect. So because that would call the function once and only once. We want sorted to be able to call it again and again. Now, here's actually an example, as we've seen in the past, of a correct solution. This is behaving as I intend, a list of sorted titles from top to bottom in order of popularity. But it's a little poorly designed, because I'm defining this function f, whose name in the first place is kind of lame. But I'm defining a function only to use it in one place. And my god, the function is so tiny, it just feels like a waste of keystrokes to have defined a new function just to then pass it in. So it turns out, in Python, if you have a very short function whose purpose in life is meant to be to solve a local problem just once and that's it and it's short enough that you're pretty sure you can fit it on one line of code without things wrapping and starting to get ugly stylistically, it turns out you can actually do this instead. You can copy the code that you had in mind like this. And instead of actually defining f as a function name, you can actually use a special keyword in Python called lambda. You can specify the name of an argument for your function as before. And then you can simply specify the return value, thereafter deleting the function itself. So to be clear, key is still an argument to the sorted function. It expects as its value typically the name of a function. But if you've decided that, eh, this seems like a waste of effort to define a function, then pass the function in, especially when it's so short, you can do it in a one liner. A lambda function is an anonymous function. Lambda literally says, Python, give me a function. I don't care about its name. Therefore, you don't have to choose a name for it. But it does care still about its arguments and its return value. So it's still up to you to provide zero or more arguments and a return value. And notice I've done that. I've specified the keyword lambda followed by the name of the argument I want this anonymous, nameless function to accept. And then I'm specifying the return value. And with lambda functions, you do not need to specify return. Whatever you write after the colon is literally what will be returned automatically. So again, this is a very Pythonic thing to do. It's kind of a very clever one liner, even though it's a little cryptic to see for the very first time. But it allows you to condense your thoughts into a succinct statement that gets the job done so you don't have to start defining more and more functions that you or someone else then need to keep track of. All right. Any questions, then, on this? And I am pretty sure this is as complex or sophisticated as our Python code today will get. Yeah, over to Sophia. AUDIENCE: I was wondering why "lambda" is used specifically rather than some other keyword. DAVID J. MALAN: Yeah, so there's a long history in this. And if, in fact, you take a course on functional programming-- at Harvard, it's called CS51-- there's a whole etymology behind keywords like this. Let me defer that one for another time. But indeed, not only in Python but in other languages, as well, these things have come to exist called lambda functions. So they're actually quite commonplace in other languages, as well. And so Python just adopted the term of art. Mathematically, lambda is often used as a symbol for functions. And so they borrowed that same idea in the world of programming. All right. So seeing no other questions, let's go ahead and solve a related problem still with some Python but that's going to push up against the limits of efficiency when it comes to storing our data in CSV files. Let me go ahead and start fresh in this file, Favorites.py. All of the code I've written thus far, though, is on the course's website in advance, so you can see the incremental improvement. I'm going to go ahead and, again, import csv at the top. And now let's write a program this time that doesn't just automatically open up the CSV and analyze it looking for the total popularity of shows. Let's search for a specific show in the CSV and then go ahead and output the popularity thereof. And I can do this in a bunch of different ways. But I'm going to try to make this as concise as possible. I'm first going to ask the user to input a title. I could use CS50's get_string function. But recall that it's pretty much the same as Python's input function, so I'm going to use Python's input function today. And then I'm going to go ahead and, as before, open up that same CSV called Favorite TV Shows - Form Responses 1.csv in read-only mode as a variable called file. I'm then going to give myself a reader, and I'll use a DictReader again so I don't have to worry about knowing which columns things are in, passing in file. And then let's see. If I only care about one title, I can keep this program simpler. I don't need to figure out the popularity of every show. I just need to figure out the popularity of one show, the title that the human has typed in. So I'm going to go ahead and give myself a very simple int called counter and set it equal to 0. I don't need a whole dictionary. Just one variable suffices now. And I'm going to go ahead and iterate over the rows in the reader, as before. And then I'm going to say if the current row's title == the title the human typed in, let's go ahead and increment counter by 1. And it's already initialized, because I did that on line 7. So I think I'm good. And then at the end of this program, let's very simply print out the value of counter. So the purpose of this program is to prompt the user for a title of a show and then just report the popularity thereof by counting the number of instances of it in the file. So let me go ahead and run this with python of favorites.py. Enter. Let me go ahead and type in "The Office," Enter, and 19. Now, I don't remember exactly what the number was. But I remember The Office was more popular than that. I'm pretty sure it was not 19. Any intuition as to why this program is buggy or so it would seem? BRIAN YU: A few people in the chat are saying you need to remember to deal with capitalization and white space again. DAVID J. MALAN: Yeah. So we need to practice those same lessons learned from before. So I should really canonicalize the input that the human, I, just typed in and also the input that's coming from the CSV file. Perhaps the simplest way to do this is, up here, to first strip off leading and trailing white space in case I get a little sloppy and hit the Space bar where I shouldn't. And then let's go ahead and force it to uppercase just because. It doesn't matter if it's upper or lower, but at least we'll standardize things that way. And then when I do this, look at the current rows title. I think I really need to do the same thing. If I'm going to canonicalize one, I need to canonical the other. And now compare the all-caps, white-space-stripped versions of both strings. So now let me rerun it. Now I'm going to type in "The Office," Enter. And voila. Now I'm at 26, which I think is where we were at before. And in fact, now I, the user, can be a little sloppy. I can say "the office." I can run it again and say "the office" and then, for whatever reason, hit the Space bar a lot, Enter. It's still going to work. And indeed, though we seem to be belaboring the pedantic here with trimming off white space and so forth, just think. In a relatively small audience here, how many of you accidentally hit the Space bar or capitalized things differently? This happens massively on scale. And you can imagine this being important when you're tagging friends in some social media account. You're doing @Brian or the like. You don't want to have to require the user to type @, capital B, lowercase r-i-a-n, and so forth. So tolerating disparate, messy user input is such a common problem to solve, including in today's apps that we all use. All right. Any questions, then, on this program, which I think is correct? Then let me ask a question of you. In what sense is this program poorly designed? In what sense is this program poorly designed? This is more subtle. But think about the running time of this program in terms of big O. What is the running time of this program if the CSV file has n different shows in it or n different submissions? So n is the variable in question. Yeah, what's the running time, Andrew? AUDIENCE: [INAUDIBLE] DAVID J. MALAN: Yeah, it's big O of n, because I'm literally using linear search by way of the for loop. That's how a for loop works in Python, just like in C. Starts at the beginning and potentially goes all the way till the end. And so I'm using implicitly linear search, because I'm not using any fancy data structures, no sets, no dictionaries. I'm just looping from top to bottom. So you can imagine that if we surveyed not just all of the students here in class but maybe everyone on campus or everyone in the world-- maybe we're Internet Movie Database, IMDb. There could be a huge number of votes and a huge number of shows. And so writing a program, whether it's in a terminal window like mine or maybe on a mobile device or maybe on a webpage for your laptop or desktop, it's probably not the best design to constantly loop over all of the shows in your database from top to bottom just to answer a single question. It would be much nicer to do things in log of n time or in constant time. And thankfully, over the past few weeks, both in C and in Python, we have seen smarter ways to do this. But I'm not practicing what I've preached here. And in fact, at some point, this notion of a flat-file database starts to get too primitive for us. Flat-file databases, like CSV files, are wonderfully useful when you just want to do something quickly or when you want to download data from some third party, like Google, in a standard, portable way. "Portable" means that it can be used by different people and different systems. CSV is about as simple as it gets, because you don't need to own Microsoft Word or Apple Numbers or any particular product. It's just a text file, so you can use any text editing program or any programming language to access it. But flat-file databases aren't necessarily the best structure to use ultimately for larger data sets, because they don't really lend themselves to more efficient queries. So CSV files, pretty much at best, you have to search top to bottom, left to right. But it turns out that there are better databases out there generally known as relational databases that, instead of being files in which you store data, they are instead programs in which you store data. Now, to be fair, those programs use a lot of RAM, memory, where they actually store your data. And they do certainly persist your data. They keep it long term by storing your data also in files. But between you and your data, there is this running program. And if you've ever heard of Oracle or MySQL or PostgreSQL or SQL Server or Microsoft Access or bunches of other popular products, both commercial and free and open source alike, relational databases are so similar in spirit to spreadsheets. But they are implemented in software. And they give us more and more features. And they use more and more data structures so that we can search for data, insert data, delete data, update data much, much more efficiently than we could if just using something like a CSV file. So let's go ahead and take our five-minute break here. And when we come back, we'll look at relational databases and, in turn, a language called SQL. All right. So we are back. And the goal at hand now is to transition from these fairly simplistic flat-file databases to a more proper relational database. And relational databases are indeed what power so many of today's mobile applications, web applications, and the like. Now we're beginning to transition to real-world software with real-world languages, at that. And so now, let me introduce what we're going to call SQLite. So it turns out that a relational database is a database that stores all of the data still in rows and columns. But it doesn't do so using spreadsheets or sheets. It instead does so using what we're going to call tables. So it's pretty much the same idea. But with tables, we get some additional functionality. With those tables, we'll have the ability to search for data, update data, delete data, insert new data, and the like. And these are things that we absolutely can do with spreadsheets. But in the world of spreadsheets, if you want to search for something, it's you, the human, doing it by manually clicking and scrolling, typically. If you want to insert data, it's you, the human, typing it in manually after adding a new row. If you want to delete something, it's you right clicking or Control-clicking and deleting a whole row or updating the individual cells they're in. With SQL, Structured Query Language, we have a new programming language that is very often used in conjunction with other programming languages. And so today, we'll see SQL used on its own initially. But we'll also see it in the context of a Python program. So a language like Python can itself use SQL to do more powerful things than Python alone could do. So with that said, SQLite is like a light version of SQL. It's a more user-friendly version. It's more portable. It can be used on Macs and PCS and phones and laptops and desktops and servers. But it's incredibly common. In fact, in your iPhone and your Android phone, many of the applications you are running today on your own device are using SQLite underneath the hood. So it isn't a toy language per se. It's instead a relatively simple implementation of a language generally known as SQL. But long story short, there's other implementations of relational databases out there. And I rattled off several of them already-- Oracle and MySQL and PostgreSQL and the like. Those all have slightly different flavors or dialects of SQL. So SQL is a fairly standard language for interacting with databases. But different companies, different communities have kind of added or subtracted their own preferred features. And so the syntax you use is generally constant across all platforms. But we will standardize for our purposes on SQLite. And indeed, this is what you would use these days in the world of mobile applications. So it's very much germane there. So with SQLite, we're going to have ultimately the ability to query data and update data, delete data, and the like. But to do so, we actually need a program with which to interact with our database. So the way SQLite works is that it stores all of your data still in a file. But it's a binary file now. That is, it's a file containing 0's and 1's. And those 0's and 1's might represent text. They might represent numbers. But it's a more compact, efficient representation than a mere CSV file would be using ASCII or Unicode. So that's the first difference. SQLite uses a single file, a binary file, to store all of your data and represent it inside of that file by way of all of those 0's and 1's or the tables to which I alluded before, which are the analogue in the database world of sheets or spreadsheets in the spreadsheet world. So to interact with that binary file wherein all of your data is stored, we need some kind of user-facing program. And there's many different tools to use. But the standard one that comes with SQLite is called sqlite3, essentially version 3 of the tool. This is a command line tool similar in spirit to any of the commands you've run in a terminal window thus far that allows you to open up that binary file and interact with all of your tables. Now, here again, we kind of have a chicken and the egg problem. If I want to use a database but I don't yet have a database and yet I want to select data from my database, how do I actually load things in? Well, you can load data into a SQLite database in at least two ways. One, which I'll do in a moment, you can just import an existing flat-file database, like a CSV. And what you do is you save the CSV on your Mac or PC on your CS50 IDE. You run a special command with sqlite3. And it will just load the CSV into memory. It will figure out where all of the commas are. And it will construct inside of that binary file the corresponding rows and columns using the appropriate 0's and 1's to store all of that information. So it just imports it for you automatically. Approach 2 would be to actually write code in a language like Python or any other that actually manually inserts all of the data into your database. And we'll do that, as well. But let's start simple. Let me go ahead and run, for instance, sqlite3. And this is preinstalled on CS50 IDE, and it's not that hard to get it up and running on a Mac and PC, as well. I'm going to go ahead and run sqlite3 in my terminal window here. And voila. You just see some very simple output. It's telling me to type .help if I want to see some usage hints. But I know most of the commands, and we'll generally give you all of the commands that you might need. In fact, one of the commands that we can use is .mode, and another is .import. So generally, you won't use these that frequently. You'll only use them when creating a database for the first time when you are creating that database from an existing CSV file. And indeed, that's my goal at the moment. Let me take our CSV file containing all of your favorite TV shows and load it into SQLite in a proper relational database so that we can do better than, for instance, big O of n when it comes to searching that data and doing anything else on it. So to do this, I have to execute two commands. One, I need to put SQLite into CSV mode. And that's just to distinguish it from other flat-file formats, like TSV for tabs or some other format. And now I'm going to go ahead and run .import. Then I have to specify the name of the file to import, which is the CSV. And I'm going to go ahead and call my table shows. So .import takes two arguments, the name of the file that you want to import and the name of the table that you want to create out of that file. And again, tables have rows and columns. And the commas in the file are going to delineate where those columns begin and end. I'm going to go ahead and hit Enter. It looks like it flew by pretty fast. Nothing seems to have happened. But I think that's OK, because now we're going to go ahead and have the ability to actually manipulate that data. But how do we manipulate the data? We need a new language. SQL, Structured Query Language, is the language used by SQLites and Oracle and MySQL and PostgreSQL and bunches of other products whose names you don't need to know or remember any time soon. But SQL is the language we'll use to query the database for information and do something with it. Generally speaking, a relational database and, in turn, SQL, which is a language via which you can interact with relational databases, support four fundamental operations. And they're sort of a crude acronym, pun intended, that is just helpful for remembering what those fundamental operations are with relational databases. CRUD stands for Create, Read, Update, and Delete. And indeed, the acronym is CRUD, C-R-U-D. So it helps you remember that the four basic operations supported by any relational database are create, read, update, delete. "Create" means to create or add new data. "Read" means to access and load into memory new data. We've seen read before with opening files. "Update" and "delete" mean exactly that, as well, if you want to manipulate the data in your data set. Now, those are generic terms for any relational database. Those are the four properties typically supported by any relational database. In the world of SQL, there are some very specific commands or functions, if you will, that implement those four functionalities. They are create and insert-- achieve the same thing as create more generally. The keyword "select" is what's used to read data from a database. Update and delete are the same. So it's kind of an annoying inconsistency. The acronym or the term of art is CRUD, Create, Read, Update, Delete. But in the world of SQL, the authors of the language decided to implement those four ideas by way of these five keywords or functions or commands, if you will, in the language SQL. So what you are looking at are five of the keywords that you can use in this new language called SQL to actually do something with your database. Now, what does that mean? Well, suppose that you wanted to manually create a database for the very first time. What do you do? Well, back in the world of spreadsheets, it's pretty straightforward, right? You'd open up Google Spreadsheets. You go to File, New or whatever. And then you just, voila, get a new spreadsheet into which you can start creating rows and columns and the like. In Microsoft Excel, Apple Numbers, same thing-- File menu, New Spreadsheet or whatever, and boom, you have a new spreadsheet. Now, in the world of SQL, SQL databases are generally meant to be interacted with code. However, there are Graphical User Interfaces, GUIs, by which you can interact with them, as well. But we're going to use code today to do so and programs at a command line. It turns out that you can create tables programmatically by running a command like this. So if you literally type out syntax along the lines of CREATE TABLE, then the name of your table, indicated here in lowercase, then a parenthesis, then the name of your column that you want to create and the type of that column, a la C, and then comma, dot, dot, dot, some more columns, this is generally speaking the syntax you'll use to create in this language called SQL a new table. Now, this is in the abstract. Again, table in lowercase is meant to represent the name you want to give to your actual table. column in lowercase is meant to be the name you want to give to your own column. Maybe it's Title. Maybe it's Genres. And dot, dot, dot just means, of course, you can have even more columns than that. But literally in a moment, if I were to type in this kind of command into the terminal window after running the sqlite3 program, I could start creating one or more tables for myself. And in fact, that's what already happened for me. This .import command, which is not part of SQL-- this is the equivalent of a Menu option in Excel or Google Spreadsheets. .import just automates a certain process for me. And what it did for me is this. If I type now .schema, which is another SQLite-specific command-- anything that starts with a . is specific only to sqlite3, this terminal window program. Notice what's outputted is this. By running .import that automatically for me created a table in my database called shows. And it gave it three columns-- Timestamp, title, and genres. Where did those column names come from? Well, they came from the very first line in the CSV. And they all looked like text, so the type of those values was just assumed to be text, text, text. Now, to be clear, I could have manually type this out, created these three columns in a new table called shows for me. But again, the .import command just automated that from a CSV. But the SQL is what we see here, CREATE TABLE shows and so forth. So that is to say now, in this database, there is a file-- or rather, there is a table called shows inside of which is all of the data from that CSV. How do I actually get at that data? Well, it turns out there's other commands were called. Not just CREATE, but also SELECT, it turns out. SELECT is the equivalent of read, getting data from the database. And this one is pretty powerful. And the reason that so many data scientists and statisticians use and like using languages like SQL-- they make it relatively easy to just get data and filter that data and analyze that data using new syntax for us today, but relatively simple syntax relative to other things we've seen. The SELECT command in SQL lets you select one or more columns from your table by the given name. So we'll see this now in just a moment here. How might I go about doing this? Well, let me go ahead and now, at my prompt after just clearing the window to keep things neat, let me try this out. Let me go ahead and SELECT, let's say, title FROM shows;. So why am I doing this? Well, again, the conventional format for the SELECT command is to say SELECT, then the name of one or more columns, then literally the preposition FROM, and then the name of the table from which you want to select that data. So if my table is called shows and the column is called title, it stands to reason that SELECT title FROM shows should give me back the data I want. Now, notice a couple of stylistic choices that aren't strictly required but are good style. Conventionally, I would capitalize any SQL keywords, including SELECT and FROM in this case, and then lowercase anything that's a column name or a table name, assuming you created those columns and tables in, in fact, lowercase. There's different conventions out there. Some people will uppercase. Some people will use something called camel case or snake case or the like. But generally speaking, I would encourage all caps for SQL syntax and lowercase for the column and table names. I'm going to go ahead now and hit Enter. And voila. We see rapidly a whole list of values outputted from the database. And if you think way back, you might recognize that this actually happens to be the same order as before, because the CSV file was loaded top to bottom into this same database table. And so what we're seeing, in fact, is all of that same data, duplicates and miscapitalizations and weird spacing and all. But suppose I want to see all of the data from the CSV. Well, it turns out you can select multiple columns. You can select not only title, but maybe timestamp was of interest. And this one admittedly was capitalized, because that's what it was in the spreadsheet. That was not something I chose manually. So if I just use a comma-separated list of column names, notice what I can do now. It's a little hard to see for us humans, because there's a lot going on now. But notice that in double quotes on the left, there are all of the timestamps, which represent the time at which you all submitted your favorite shows. And on the right of the comma, there's another quoted string that is the title of the show that you liked, although SQLite omits the quotes if it's just a single word, like Friends, just by convention. In fact, if I want to get all of the columns, turns out there's some shorthand syntax for that. * is the so-called wild card operator. And it will get me all of the columns from left to right in my table. And voila. Now I see all of the data, including all of the genres, as well. So now I effectively have three columns being outputted all at once here. Well, this is not that useful thus far. In fact, all I've been doing is really just outputting the contents of the CSV. But SQL's powerful because it comes with other features right out of the box, somewhat similar in spirit to functions that are built into Google Spreadsheets and Excel. But now we can use them ultimately in our own code. So functions like AVG, COUNT, DISTINCT, LOWER, MAX, MIN, and UPPER and bunches more, these are all functions built into SQL that you can use as part of your query to alter the data as it's coming back from the database-- not permanently, but as it's coming back to you-- so that it's in a format you actually care about. So for instance, one of my goals earlier, was to get back just the distinct, the unique titles. And we had to write all that annoying code using a set and then add things to the set and then loop over it again, right? That was not a huge amount of code. But it definitely took us, what, 5, 10 minutes to get the job done at least. In SQL, you can do all of that in one breath. I'm going to go ahead now and do this. SELECT not just title FROM shows. Let me go ahead and SELECT DISTINCT title FROM shows. So DISTINCT, again, is an available function in SQL that does what the name says. It's going to filter out all of the titles to just give me the distinct ones back. So if I hit Enter now, you'll see a similarly messy list but including-- "no idea," someone that doesn't watch TV-- including an unsorted list of those titles. So I think we can probably start to clean this thing up as we did before. Let me go ahead and now SELECT not just DISTINCT, but let me go ahead and uppercase everything as well. And I can use UPPER as another function. And notice I'm just nesting things. The output of one function, as we've seen in many languages now, can be the input to another. Let me hit Enter now. And now it's getting a little more canonicalized, so to speak, because I'm using capitalization for everything. But it would seem that things still aren't really sorted. It's just the same order in which you inputted them but without duplicates this time. So it turns out that SQL has other syntax that we can use to make our queries more precise and more powerful. So in addition to these kinds of functions that you can use to alter the data that's being shown to you and coming back, you can also use these kinds of clauses or syntax in SQL queries. You can say WHERE, which is the equivalent of a condition. You can say select all of this data where something is true or false. You can say LIKE, where you can say give me data that isn't exactly this but is like this. You can order the data by some column. You can limit the number of rows that come back. And you can group identical values together in some way. So let's see a few examples of this. Let me go back here and play around now with-- how about The Office? That was the one we looked at earlier. So let me go ahead and SELECT title FROM shows WHERE title = "The Office";. So I've added this WHERE predicate, so to speak, WHERE title = "The Office." So SQL's nice. Similar in spirit to Python, it's more user friendly, perhaps, than C where everything kind of sort of reads like an English sentence, even though it's a little more precise. And it's a little more succinct. Let me go ahead and hit Enter. And voila. That's how many of you inputted The Office. But notice it's not everyone, is it? We're missing some still. It seems that I got back only those of you who typed in literally "The Office," capital T, capital O. So what if I want to be a little more resilient than that? Well, let me get back any rows where you all typed in "office." Maybe you omitted the article "the." So let me go ahead and say not title = "Office." but let me go ahead and say where the title is like "Office." But I don't want it to just be "office." I want to allow for maybe some stuff at the beginning, maybe some stuff at the end. And even though that seems like a bit of an inconsistency, in the context of using LIKE, there's another wild card character. The percent sign represents zero or more characters to the left. And this percent sign represents zero or more characters to the right. So it's kind of this catchall that will now find me all titles that somewhere have O-F-F-I-C-E inside of them. And it turns out LIKE is case insensitive, so I don't even need to worry about capitalization with LIKE. Now let me hit Enter. And voila. Now I get back more answers. And you can really see the messiness now. Notice up here one of you used lowercase. That tends to be common when typing things in quickly. One of you did it lowercase here and then also gave us an extra white space at the end. One of you just typed in "office." One of you typed in "the office" again with a space at the end. And so there's a lot of variation here. And that's why, when we forced everything to uppercase and we started trimming things, we were able to get rid of a lot of those redundancies. Well, in fact, let's go ahead and order this now. So let me go back to selecting the distinct uppercase title, so SELECT DISTINCT UPPER of title FROM shows. And let me now ORDER BY, which is a new clause, the uppercased version of title. So now notice there's a few things going on here. But I'm just building up more complicated queries similar to scratch, where we just started throwing more and more puzzle pieces at a problem. I'm selecting all of the distinct uppercase titles from the shows table. But I'm going to order the results this time by the uppercased version of title. So everything is going to be uppercased. And then it's going to be sorted A through Z. Hit Enter now, and now things are a little easier to make sense of. Notice the quotes are there only when there are multiple words in a title. Otherwise, sqlite3 doesn't bother showing us. But notice here's all the "the" shows. And if we keep scrolling up, the P's, the N's, the M's, the L's, and so forth-- it's indeed alphabetized thanks to using ORDER BY. All right. Well, let's start to solve more similar problems now in SQL by writing way less code than we did a bit ago in Python. Suppose I want to actually figure out the counts of these most popular shows. So I want to combine all of the identical shows and figure out all of the corresponding counts. Well, let me go ahead and try this. Let me go ahead and SELECT again the uppercased version of title. But I'm not going to do DISTINCT this time, because I want to do that a little differently. I'm going to SELECT the uppercased version of title, the COUNT of those titles-- so the number of times a given title appears, so COUNT is a new keyword now-- FROM shows. But now how do I figure out what the count is? Well, if you think about this table as having a lot of titles-- title, title, title, title, title-- it would be nice to kind of group the identical titles together and then actually count how many such titles we grouped together. And the syntax for that is literally to say GROUP BY UPPER(title);. This tells SQL to group all of the uppercased titles together, kind of collapse multiple rows into one, but keep track of the count of titles after that collapse. Let me go ahead now and hit Enter. And you'll see, very similar to one of the earlier Python programs we wrote, all of the titles on the left followed by a comma, followed by the count. So one of you really likes Tom and Jerry. One of you really likes Top Gear. If I scroll up, though, two of you really liked The Wire. 23 of you here like The Office, although we still haven't trimmed the issue here. So we could still combine that further by trimming whitespace if we want. But now we're getting these kinds of counts. Well, how can I go ahead and order this, as we did before? Let me go ahead here and add ORDER BY COUNT of title and then hit semicolon now. And now notice, just as in Python, everything is from smallest to largest initially, with Game of Thrones here down on the bottom. How can I fix this? Well, it turns out if you can order things in descending order, D-E-S-C for short instead of A-S-C, which is the default for ascending-- so if I do it in descending order, now I'd have to scroll all the way back up to the A's, the very top, to see where the lines begin. Whoops. If I scroll all the way back up to the top, we'll see where all of the A words begin up here. And now if I want to-- whoops, whoops, whoops. Did I do that right? Sorry. I don't want to-- there we go, ORDER BY COUNT descending. Now let me go ahead and-- this is just a little too unwieldy to see. Let me just limit myself to the top 10 and keep it simple and only look at the top 10 values here. Voila. Now I have Game of Thrones at 33, Friends at 26, The Office at 23-- though I think I'm still missing a few. Brian, do you recall the SQL function for trimming leading and trailing white space? BRIAN YU: I think it's just TRIM. DAVID J. MALAN: TRIM? OK. I myself did not remember. So when in doubt, google or ask Brian. So let me go ahead and fix this. Let me go ahead and SELECT uppercase of trimming the title first. And then I'm going to GROUP BY trimming and then uppercasing it there. And now Enter, and voila. Thank you, Brian. So now we're up to our 26 Offices here. So in short, it took us a little while to get to this point in the story in SQL. But notice what we've done. We've taken a program that took us a few minutes and certainly a dozen or more lines of code. And we've distilled it into something that, yes, is a new language but is just kind of a one liner. And once you get comfortable with a language like SQL, especially if you're not even a computer scientist but maybe a data scientist or an analyst of some sort who spends a lot of their day looking at financial information or medical information or really any data set that can be loaded into rows and columns, once you start to speak and read SQL as a human can you start to express some pretty powerful queries relatively succinctly and, boom, get back your answer. And by using a command line program, like sqlite3, you can immediately see the results there, albeit as very simplistic text. But as mentioned, too, there's also some graphical programs out there, free and commercial, that also support SQL, where you can still type these commands. And then it will show it to you in a more user friendly way, much like in Windows or macOS would by default. So any questions now on the syntax or capabilities of SELECT statements? BRIAN YU: One question came in. Where is the file with this data actually being stored? DAVID J. MALAN: Good question. Where is the file actually being stored? So before quitting, I can actually save this file as anything. I want the file extension would typically be .db. And in fact, Brian, do you mind just checking? What's the syntax for writing the file manually with dot something? It would be under .help, I think. BRIAN YU: I think it's .save followed by the name of the file. DAVID J. MALAN: .save, so I'll call this shows.db, Enter. If I now go ahead and open up another terminal window and type our old friend ls, you'll see that now I have a CSV file. I have my Python file from before. And I have a new file called shows.db, which I've created. That is the binary file that contains the table that I've loaded dynamically in from that CSV file. Any other questions on SELECT queries or what we can do with them? BRIAN YU: Yeah, a few people are asking about what the runtime of this is. DAVID J. MALAN: Yeah, really good question. What is the runtime? I'm going to come back to that question in just a little bit if that's OK. Right now, it's admittedly big O of n. I've not actually done anything better than we did with our CSV file or our Python code. Right now, it's still big O of n by default. But there's going to be a better answer to that that's going to make it something much more logarithmic. So let me come back to that feature when it's time to enable it. But in fact, let's start to take some steps toward that. Because it turns out, when loading in data, we're not always going to have the luxury of just having one big file in CSV format that we import, and we go about our business. We're going to have to decide in advance how we want to store the data and what data we want to store and what the relationships might be across not one single table, but multiple tables. So let me go ahead and run one other command here that actually introduces the first of a problem. Let me go ahead and SELECT title FROM shows WHERE genres equals, for instance, "Comedy." That was one of the genres. And notice that we get back a whole bunch of results. But I bet I'm missing some. I'm skimming through this pretty quickly. But I bet I'm missing some, because if I check if genres = "Comedy," what am I omitting? Well, those of you who checked multiple boxes might have said something is a comedy and a drama or comedy and romance or maybe a couple of other permutations of genres. If I'm searching for equality here, = "Comedy," I'm only going to get those favorites from you where you only said, my favorite TV show is a comedy. But what if we want to do something like LIKE comedy instead? And we could say something like, well, so long as the word "comedy" is in there, then we should get back even more results. And let me stipulate that, indeed, I now have a longer list of results. Now we have all shows where you checked at least the Comedy box. But unfortunately, this starts to get a little sloppy, because recall what the Genres column looks like. SELECT. Let me SELECT genres FROM shows;. Notice that all of the genres that we loaded into this table from the CSV file are a comma-separated list of genres. That's just the way Google Forms did it. And that's fine for CSV purposes. That's kind of fine for SQL purposes, but this is kind of messy. Generally speaking, storing comma-separated lists of values in a SQL database is not what you should be doing. The whole point of using a SQL database is to move away from commas and CSVs and to actually store things more cleanly. Because in fact, let me propose a problem. Suppose I want to search not for comedy but maybe also music, like this, thereby allowing me to find any shows where the word "music" is somewhere in the comma-separated list. There's a subtle bug here. And you might have to think back to where we began, the form that you pulled up. I can't show the whole thing here, but we started with action, adventure, animation, biography, dot, dot, dot, music. Musical was also there, so distinct. A music video versus a musical are two different types of genres. But notice my query at the moment. What's problematic with this? At the moment, we would seem to have a bug whereby this query will select not only "music," but also "musical." And so this is just where things are getting messy. Now, yeah, you know what? We could kind of clean this up. Maybe we could put a comma here so that it can't just be music something. It has to be music comma. But what if music is the last box that you checked? Well, then it's music nothing. There is no comma. So now I need to OR things together. So maybe I have to do something like WHERE "%Music,%" like this or OR genres LIKE "%Music" like this. But honestly, this is just getting messy. This is poorly designed. If you're just storing your data as a comma-separated list of values inside of a column and you have to resort to this kind of hack to figure out, well, maybe it's over here or here or here, and thinking about all the permutations of syntax, you're doing it wrong. You're not using a SQL database to its fullest potential. So how do we go about designing this thing better and actually load this CSV into a database a little more cleanly? In short, how do we get rid of the stupid commas in the Genres column and instead put one word, "comedy" or "music" or "musical," in each of those cells, so to speak? Not two, not three-- one only without throwing away some of those genres. Well, let me introduce a few building blocks that will get us there. It turns out, when creating your own tables and loading data into a database on your own, we're going to need more than just SELECT. SELECT, of course, is just for reading. But if we're going to do this better and not just use sqlite3 as a built-in .import command, but instead we're going to write some code to load all of our data into maybe two tables-- one for the titles, one for the genres-- we're going to need a little more expressiveness when it comes to SQL. And so for that, we're going to need, one, the ability to create our own tables. And we've seen a glimpse of this before. But we're also going to need to see another piece of syntax, as well, so inserting. Inserting is another command that you can execute on a SQL database in order to actually add data to a database, which is great. Because if I want to ultimately iterate over that same CSV but, this time, manually add all of the rows to the database myself, well, then I'm going to need some way of inserting. And the syntax for that is as follows. INSERT INTO the name of the table, the column or columns that you want to insert values into, then literally the word VALUES, and then literally in parentheses again, the actual list of values. So it's a little abstract when we see it in this generic form. But we'll see this more explicitly in just a moment here, as well. So when it comes to inserting something into a database, let's go ahead and try this. So suppose that-- let's see. What's a show that-- The Muppet Show. I grew up loving The Muppet Show. It was out in, like, the '70s. And I don't think it was on the list, but I can check this for sure. So SELECT * FROM shows WHERE title LIKE-- let's just search for "muppets" with a wild card. And I'm guessing no one put it there. Good. So it's a missed opportunity. I forgot to fill out the form. I could go back and fill out the form and re-import the CSV, but let's go ahead and do this manually. So let me go ahead and INSERT INTO shows what columns? title and genres, and I guess I could do a Timestamp just for kicks. And then I'm going to insert what values? The values will be, well, I don't know, whatever time it is now. So I'm going to cheat there rather than look up the date and the time. The title will be "The Muppet Show." And the genres will be-- it was kind of a comedy. It was kind of a musical. So we'll kind of leave it at that. Semicolon. So again, this follows the standard syntax here of specifying the table you want to insert into, the columns you want to insert into, and the values you want to put into those columns. And I'm going to go ahead and hit Enter now. Nothing seems to have happened. But if I now select that same query-- oh, OK, it's still nothing, because I made a subtle mistake. I'm not searching for "Muppets," plural. I'm searching for "Muppet," singular, The Muppet Show. Voila. Now you see my row in this database. And so INSERT would give us the ability now to insert new rows into the database. Suppose you want to update something. You know, some of the Muppet Shows were actually pretty dramatic. So how might we do that? Well, I can say UPDATE shows SET-- let's see-- genres = "Comedy, Drama, Musical" WHERE title = "The Muppet Show." So again, I'll pull up the canonical syntax for this in a bit. But for now, just a little teaser, you can update things pretty simply. And even though it takes a little getting used to the syntax, it kind of does what it says. UPDATE shows SET genres = this WHERE title = that. And now I can go ahead and Enter. If I go ahead and select the same thing, just like in a terminal window, you can go up and down. That's how I'm typing so quickly. I'm just going up and down to previous commands. Voila. Now I see that the Muppet Show is a comedy, a drama, and a musical. Well, I take issue, though, with one of the more popular shows that was in the list. A whole bunch of you liked, let's say, Friends, which I've never really been a fan of. And let me go ahead and SELECT title FROM shows WHERE title = "Friends." And maybe I should be a little more rigorous than that. I should say title LIKE "Friends" just in case there was different capitalizations. Enter. A lot of you really liked Friends. In fact, how many of you? Well, recall that I can do this. I can say COUNT, and I can let SQL do the count for me. 26 of you, I disagree with strongly. And there's a couple of you that even added all the dots, but we'll deal with you later. So suppose I do take issue with this. Well, DELETE FROM shows WHERE title = "Friends"-- actually, title LIKE "Friends." Let's get them all. Enter. And now if we SELECT this again, I'm sorry. Friends has been canceled. So you can again execute these fundamental commands of CRUD, Create Read, Update, and Delete, by using CREATE or INSERT, by using SELECT, by using UPDATE literally and DELETE literally, as well. And that's about it. Even though this was a lot quickly, there really are just those four fundamental operations in SQL plus some of these add-on features, like these additional functions like COUNT that you can use and also some of these keywords like WHERE and the like. Well, let me propose that we now do better. If we have the ability to select data and create tables and insert data, let's go ahead and write our own Python script that uses SQL, as in a loop, to read over my CSV file and to insert, insert, insert, insert each of the rows manually. Because honestly, it will take me forever to manually type out hundreds of SQL queries to import all of your rows into a new database. I want to write a program that does this instead. And I'm going to propose that we design it in the following way. I'm going to have two tables this time, represented here with this artist's rendition. One is going to be called shows. One is going to be called genres. And this is a fundamental principle of designing relational databases, to figure out the relationships among data and to normalize your data. To normalize your data means to eliminate redundancies. To normalize your data means to eliminate mentions of the same words again and again and have just single sources of truth for your data, so to speak. So what do I mean by that? I'm going to propose that we instead create a simpler table called shows that has just two columns. One is going to be called id, which is new. The other is going to be called title, as before. Honestly, I don't care about timestamps, so we're just going to throw that value away, which is another upside of writing our own program. We can add or remove any data we want. For id, I'm introducing this, which is going to be a unique identifier, literally a simple integer-- 1, 2, 3, all the way up to a billion or 2 billion, however many favorites we have. I'm just going to let this auto increment as we go. Why? I propose that we move to another table all of the genres and that, instead of having one or two or three or five genres in one column as a stupid comma-separated list-- which is stupid only in the sense that it's just messy, right? It means that I have to run stupid commands where I'm checking for the comma here, the comma there. It's very hackish, so to speak. Bad design. Instead of doing that, I'm going to create another table that also has two columns. One is going to be called show_id, and the other is going to be called genre. And genre here is just going to be a single word now. That column will contain single words for genres, like "comedy" or "music" or "musical." But we're going to associate all of those genres with the original show to which they belong, per your Google form submissions, by using this show_id here. So what does this mean in particular? By adding to our first table, shows, this unique identifier-- 1, 2, 3, 4, 5, 6-- I can now refer to that same show in a very efficient way using a very simple number instead of redundantly having The Office, The Office, The Office again and again. I can refer to it by just one canonical number, which is only going to be 4 bytes or 32 bits. Pretty efficient. But I can still associate that show with one genre or two or three or more or even none. So in this way, every row in our current table is going to become one or more rows in our new pair of tables. We're factoring out the genres so that we can add multiple rows for every show, potentially, but still remap those genres back to the original show itself. So what is some of the buzzwords here? What's some of the language to be familiar with? Well, we need to know what kinds of types are at our disposal here. So for that, let me propose this. Let me propose that we have this list here. It turns out, in SQLite, there are five main data types. And that's a bit of an oversimplification, but there's five main data types, some of which look familiar, a couple of which are a little weird. INTEGER is a thing. REAL is the same thing as float. So an integer might be a 32-bit or 4-byte value, like 1, 2, 3, or 4, positive or negative. Real number's going to have a decimal point in it, a floating point value, probably 32 bits by default. But those kinds of things, the sizes of these types, vary by system, just like they technically did in C. So do they vary by system in the world of SQL. But generally speaking, these are good rules of thumb. TEXT is just that. It's sort of the equivalent of a string of some length. But then in SQLite, it turns out there's two other data types we've not seen before-- NUMERIC and BLOB. But more on those in just a little bit. BLOB is Binary Large Object. It means you can store 0's and 1's in your database. NUMERIC is going to be something that's number-like but isn't a number per se. It's like a year or a time, something that has numbers, but isn't just a simple integer at that. And let me propose, too, that SQLite is going to allow us to specify, too, when we create our own columns manually by executing the SQL code ourselves, we can specify that a column cannot be null. Thus far, we've ignored this. But some of you might have taken the fifth and just not given us the title of a show or a genre. Your answers might be blank. Some of you, maybe in registering for a website, don't want to provide information like where you live or your phone number. So a database in general sometimes does want to support null values. But you might want to say that it can't be null. A website probably needs your email address, needs your password and a few other fields, but not everything. And there's another keyword in SQL, just so you've seen it, called UNIQUE, where you can additionally say that whatever values are in this column must be unique. So a website might also use that. If you want to make sure that the same email address can't register for your website multiple times, you just specify that the email column is unique. That way, you can't put multiple people in with identical email addresses. So long story short, this is just more of the tools in our SQL toolkit, because we'll see some of these now indirectly. And the last piece of jargon we need before designing our own tables is going to be this. It turns out that, in SQL, there's this notion of primary keys and foreign keys. And we've not seen this in spreadsheets. Unless you've been working in the real world for some years and you have fairly fancy spreadsheets in front of you as an analyst or financial person or the like, odds are you've not seen keys or unique identifiers in quite the same way. But they're relatively simple. In fact, let me go back to our picture before and propose that when you have two tables like this and you want to use a simple integer to uniquely identify all of the rows in one of the tables, that's called technically an ID. That's what I'll call it by convention. You could call it anything you want, but ID just means it's a unique identifier. But semantically, this ID is what's called a primary key. A primary key is the column in a table that uniquely identifies every row. This means you can have multiple versions of The Office in that title field. But each of those rows is going to have its own number uniquely, potentially. So primary key uniquely identifies each row. In another table, like genres, which I'm proposing we create in just a moment, it turns out that you're welcome to refer back to another table by way of that unique identifier. But when it's in this context, that ID is called a foreign key. So even though I've called it show_id here, that's just a convention in a lot of SQL databases to imply that this is technically a column called ID in a table called show or shows, plural in this case. So if there's a number 1 here, and suppose that The Office has a unique ID of 1, we would have a row in this table called id is 1, title is The Office. The Office might be in the comedy category, the drama category, the romance category, so multiple ones. Therefore, in the genres table, we want to output three rows, the number 1, 1, 1 in each of those rows but the words "comedy," "drama," "romance" in each of those rows respectively. So again, the goal here is to just design our database better, not have these stupid comma-separated lists of values inside of a single column. We want to kind of blow that up, explode it, into individual rows. You might think, well, why don't we just use multiple columns? But again, per our principle from spreadsheets, you should not be in the habit of adding more and more columns when the data is all the same, like genre, genre, genre, right? The stupid way to do this in the spreadsheet world would be to have one column called Genre 1, another column called Genre 2, another column called Genre 3, Genre 4. And you can imagine just how stupid and inefficient this is. A lot of those columns are going to be empty for shows with very few genres. And it's just kind of messy at that point. So better, in the world of relational databases, to have something like a second table, where you have multiple rows that somehow link back to that primary key by way of what we're calling, conceptually, a foreign key. All right. So let's go ahead now and try to write this code. Let me go back to my IDE. Let me quit out of SQLite now. And let me just move away. I'm going to move this away, my file, for just a moment so that we're only left with our original data. Let's go about implementing a final version of my Python file that does this-- creates two tables-- one called shows, one called genres-- and then, two, in a for loop, iterates over that CSV and inserts some data into the shows and other data into the genres. How can we do this programmatically? Well, there's a final piece of the puzzle that we need. We need some way of bridging the world of Python and SQL. And here, we do need a library, because it would just be way too painful to do without a library. It can be CS50. CS50, as we'll see, makes this very simple. There are other third-party commercial and open-source libraries that you can also use in the real world, as well, that do the same thing. But the syntax is a little less friendly, so we'll start by using the CS50 library, which in Python, recall, has functions like get_string and get_int and get_float. But today, it also has support, it turns out, for SQL capabilities, as well. So I'm going to go back to my Favorites file. And I'm going to import not only CSV, but I'm also going to import from the CS50 library a feature called SQL. So we have a variable, if you will, inside of the CS50 library or, rather, a function inside of the CS50 library called SQL that, if I call it, will allow me to load a SQLite database into memory. So how do I do this? Well, let me go ahead and add a couple of new lines of code. Let me go ahead and open up a file called shows.db, but this time in write mode. And then just for kicks-- just for now, rather, I'm going to go ahead and close it right away. This is a Pythonic way of creating an empty file. It's kind of stupid looking, but by opening a file called shows.db in write mode and then immediately closing it, it has the effect of creating the file, closing the file. So I now have an empty file with which to interact. I could also do this, as an aside, by doing this-- touch shows.db. touch kind of a strange command, but in a terminal window, it means to create a file if it doesn't exist. So we could also do that instead. But that would be independent of Python. So once I've created this file, let me go ahead and open the file now as a SQLite database. I'm going to declare a variable called db for database. I'm going to use the SQL function from CS50's library. And I'm going to open via somewhat cryptic string this-- sqlite:///shows.db. Now, it looks like a URL, http://, but it's SQLite instead. And there's three slashes instead of the usual two. But this line of code, line 6, has the result of opening now that otherwise empty file with nothing in it yet as being a SQLite database using CS50's library. Why did I do that? Well, I did that because I now want to create my first table. Let me go ahead and execute, db.execute. So there's a function called execute inside of the CS50 SQL library. And I'm going to go ahead and run this. CREATE TABLE called shows, the columns of which are an id, which is going to be an integer, a title, which is going to be text, the primary key in which is going to be the id column. So this is a bit cryptic. But let's see what's happening. I seem to now, in line 8, be combining Python with SQL. And this is where now programming gets really powerful, fancy, cool, difficult, however you want to perceive it. I can actually use one language inside of another. How? Well, SQL is just a bunch of textural commands. Up until now, I've been typing them out manually in this program called SQLite3. There's nothing stopping me, though, from storing those same commands in Python strings and then passing them to a database using code. The code I'm using is a function called execute. And its purpose in life, and CS50 staff wrote this, is to pass the argument from your Python code into the database for execution. So it's like the programmatic way of just typing things manually at the SQLite prompt a few minutes ago. So that's going to go ahead and create my table called shows, in which I'm going to store all of those unique IDs and also the titles. And then let me do this again. db.execute CREATE TABLE genres, and that's going to have a column called show_id, which is an integer also, genre, which is text. And lastly, it's going to have a foreign key-- it's going to wrap a little long here-- on show_id, which references the shows table id. All right, so this is a lot. So let's just recap left to right. db.execute is my Python function that executes any SQL I want. CREATE TABLE genres creates a table called genres. The columns in that table will be something called show_id, which is an integer, and genre, which is a text field. But it's going to be one genre at a time, not multiple. And then here, I'm specifying a foreign key will be the show_id column, which happens to refer back to the shows table's IDs column. It's a little cryptic, but all this is doing is implementing for us the equivalent of this picture here. I could have manually typed both of these SQL commands at that blinking prompt. But again, no, I want to write a program now in Python that creates the tables for me and now, more interestingly, loads the data into that database. So let's go ahead and do this now. I'm not going to select a title from the user, because I want to import everything. I'm not going to use any counting or anything like that. So let's go ahead and just go inside of my loop as before. And this time, let's go ahead and, for row in reader, let's go ahead and get the current title, as we've always done. But let's also, as always, go ahead and strip it of white space and capitalize it, just to canonicalize it. And now I'm going to go ahead and execute db.execute, quote unquote, INSERT INTO shows the title column, the value of "title." So I want to put the title here. It turns out that SQL libraries like ours support one final piece of syntax, which is a placeholder. In C, we use %s. In Python, we just use curly braces and put the word right there. In SQL, we have a third approach to the same problem-- just syntactically different, but conceptually the same. You put a question mark where you want to put a placeholder. And then outside of this string, I'm going to actually type in the value that I want to plug into that question mark. So this is so similar to printf in week 1. But instead of %s, it's a question mark now and then a comma-separated list of the arguments you want to plug in for those placeholders. So now this line of code 16 has just inserted all of those values into my database. And let's go ahead and run this. Before I go any further, let me go ahead and do this. I'm going to go ahead now and run python of favorites.py and cross my fingers, as always. It's taking a moment, taking a moment. That's because there's a decent-sized file there. Or I screwed up. This is taking too long. Oh, OK. I should have just been more patient. All right. So it just seems my connection's a little slow. So as I expected, everything is 100% correct, and it's working fine. So now let's go ahead and see what I actually did. If I type ls, notice that I have a file called shows.db. This is brand new, because my Python program created it this time. Let's go ahead and run sqlite3 of shows.db just so I can now see what's inside of it. Notice that I can do .schema just to see what tables exist. And indeed, the two tables that I created in my Python code seem to exist. But notice that there's-- if I do SELECT * FROM shows, let's see all the data. Voila. There is a table that's been programmatically created. And it has, notice this time, no timestamps, no genres. But it has an ID on the left and the title on the right. And amazingly, all of the IDs are monotonically increasing from 1 on up to 513, in this case. Why is that? Well, one of the features you get in a SQL database is if you define a column as being a primary key in SQLite, it's going to be auto incremented for you. Recall that nowhere in my code did I even have a line, an integer, inputting 1, then 2, then 3. I could absolutely do that. I could have done something like this-- counter-- rather, I could have done something like this-- counter = 1. And then down here, I could have said id, title, give myself two placeholders, and then pass in the counter each time. I could have implemented this myself and then, on each iteration, done counter += 1. But with SQL databases, as we've seen, you get a lot more functionality built in. I don't have to do any of that, because if I've declared that ID as being a primary key, SQLite is going to insert it for me and increment it also for me, as well. All right. So if I go back to SQLite, though, notice that I do have IDs and titles. But if I SELECT * FROM genres, there's of course nothing there yet. So how now do I get all of the genres for each of these shows in? I need to finish my script. So inside of this same loop, I have not only the title in my current row, but I also have genres in the current row. But the genres are separated by commas. Recall that in the CSV, next to every title, there's a comma-separated list of genres. How do I get at each genre individually? Well, I'd like to be able to say for genre in row bracket genres. But this is not going to work, because that's not going to be split up based on those commas. That's literally just going to iterate over, in fact, all of the characters in that string, as we saw last week. But it turns out that strings in Python have a fancy split function, whereby I can split on a comma followed by a space. And what this function will do for me in Python is take a comma separated list of genres and explode it, so to speak, split it on every comma, space into a Python list containing genre after genre in an actual Python list a la square brackets. So now I can iterate over that list of individual genres. And inside of here, I can do db.execute INSERT INTO genres show_id, genre, the values, question mark, question mark. But huh, there's a problem. I can definitely plug in the current genre, which is this. But I need to put something here still. For that first question mark, I need a value for the show_id. How do I know what the ID is of the current TV show? Well, it turns out the library can help you with this. When you insert new rows into a table that has a primary key, it turns out that most libraries will return you that value in some way. And if I go back to line 15 and I actually store the return value of db.execute after using INSERT, the library will tell me what was the integer that was just used for this given show. Maybe it's 1, 2, 3. I don't have to know or care as the programmer. But the return value, I can store in a variable. And then down here, I can literally put that same ID so that now, if I am inputting The Office, whose ID is 1, into the shows table and its genres are comedy, drama, romance, I can now inside of this for loop, this nested for loop, insert 1 followed by "comedy," 1 followed by "drama," 1 followed by "romance," three rows all at once. And so now let's go back down here into my terminal window. Let me remove the old shows.db with rm, just to start fresh. Let me go ahead and rerun python of favorites.py. I'll be more patient this time, because cloud's being a little slow. So it's doing some thinking. And in fact, there's more work being done now. At this point in the story, my program is presumably iterating over all of the rows in the CSV. And it's inserting into the shows table one at a time, and then it's inserting one or more genres into the genres table. It's a little slow. If we were on a faster system or if I were doing it on my own Mac or PC, it would probably go down more quickly. But you can see here an example of why I use the .import command in the first place. That automated some of this process. But unfortunately, it didn't allow me to change the format of my data. But the key point to make here is that even though this is taking a little bit of time to insert these hundreds of rows all at once, I'm only going to have to do this once. And what was asked a bit ago was the performance of this. It turns out that now that we have full control over the SQL database, it turns out we're going to have the ability to actually improve the performance thereof. Oh, OK. As expected, it finished right on time. And let me go ahead now and run sqlite3 on shows.db. All right, so now I'm back in my raw SQL environment. If I do SELECT * FROM shows, which I did before, we'll see all of this as before. If I SELECT * FROM shows WHERE title = "THE OFFICE," I'll see the actual unique IDs of all of those. We didn't bother eliminating duplicates. We just kept everything as is, but we gave everything a unique ID. But if I now do SELECT * FROM genres, we'll see all of the values there. And notice the key detail. There is only one genre per row here. And so we can ultimately line those up with our titles. And our titles here, we had all of these here. Something's wrong. I want to get this right. Let's go ahead and take our second and final five-minute break here. And we'll come back, and I will explain what's going on. All right, we are back. And just before we broke up, my own self-doubt was starting to creep in. But I'm happy to say, with no fancy magic behind the scenes, everything was actually working fine. I was just doubting the correctness of this. If I do SELECT * FROM shows, I indeed get back two columns, one with the unique ID, the so-called primary key, followed by the title of each of those shows. And if I similarly search for * FROM genres, I get single genres at a time. But on the left-hand side are not primary keys per se but now those same numbers here in this context called foreign keys that map one to the other. So for instance, whatever show 512 is had five different genres associated with it. And in fact, if I go back a moment to shows, it looks like Game of Thrones was decided by one of you as belonging in thriller, history, adventure, action, and war, as well, those five. So now this is what's meant by relational database. You have this relation or relationship across multiple tables that link some data in one to some other data in the like. The catch, though, is that it would seem a little harder now to answer questions, because now I have to kind of query two tables or execute two separate queries and then combine the data. But that's not actually the case. Suppose that I want to answer the question of, what are all of the musicals among your favorite TV shows? I can't select just the shows, because there's no genres in there anymore. But I also can't select just the genres table, because there's no titles in there. But there is a value that's bridging one and the other, that foreign key to primary key relationship. So you know what I can do off the top of my head? I'm pretty sure I can select all of the show_ids from the genres table where a specific genre = "Musical." And I don't have to worry about commas or spaces now, because again, in this new version that I have designed programmatically with code, musical and every other genre is just a single word. If I hit Enter, all of these show_ids were decided by you all as belonging to musicals. But now this is not interesting, and I certainly don't want to execute 10 or so queries manually to look up every one of those IDs. But notice what we can do in SQL, as well. I can nest queries. Let me put this whole query in parentheses for just a moment and then prepend to it the following. SELECT title FROM shows WHERE the primary key, id, is in this subquery. So you can have nested queries similar in spirit a bit like in Python and C when you have nested for loops. In this case, just like in grade school math, whatever is in the parentheses will be executed first. Then the outer query will be executed using the results of that inner query. So if I select the title from shows where the ID is in that list of IDs, voila. It seems that, somewhat amusingly, several of you think that Breaking Bad, Supernatural, Glee, Sherlock, How I Met Your Mother, Hawaii Five-0, Twin Peaks, The Lawyer, and My Brother, My Brother and Me are all musicals. I take exception to a few of those, but so be it. You checked the box for musical for those shows. So even though we've designed things better in the sense that we've normalized our database by factoring out commonalities or, rather, we've cleaned up the data, there's still admittedly some redundancy. There's still admittedly some redundancy. But I at least now have the data in clean fashion so that every column has just a single value in it and not some contrived comma-separated list. Suppose I want to find out all of the genres that you all thought The Office was in. So let's ask kind of the opposite question. Well, how might I do that? Well, to figure out The Office, I'm going to first need to SELECT the id FROM shows WHERE title = "THE OFFICE," because a whole bunch of you typed in The Office. And we gave each of your answers a unique identifier so we could keep track of it. And there's all of those numbers. Now, this is, like, dozens of responses. I certainly don't want to execute that many queries. But I think a subquery will help us out again. Let me put parentheses around this whole thing. And now let me say SELECT DISTINCT genre FROM genres WHERE the show_id in the genres table is in that query. And just for kicks, let me go ahead and ORDER BY genre. So let me go ahead and execute this. And, OK, somewhat amusingly, those of you who inputted The Office checked boxes for animation, comedy, documentary, drama, family, horror, reality-TV, romance, and sci-fi. I take exception to a few of those, too. But this is what happens when you accept user input. So here again, we have with this SQL language the ability to express fairly succinctly, even though it's a lot of new features today all at once, what would otherwise take me a dozen or two lines in Python code to implement and god knows how many lines of code and how many hours it would take me to implement something like this in C. Now, admittedly, we could do better than this design. This table or this picture represents what we have now. But you'll notice a lot of redundancy implicit in the genres table. Any time you check the comedy box, I have a row now that says comedy, comedy, comedy, comedy. And the show_id differs, but I have the word "comedy" again and again. And now, that tends to be frowned upon in the world of relational databases, because if you have a genre called comedy or one called musical or anything else, you should ideally just have that living in one place. And so if we really wanted to be particular and really, truly normalize this database, which is an academic term referring to removing all such redundancies, we could actually do it like this. We could have a shows table still with an id and title, no difference there. But we could have a genres table with two columns, id and name. Now, this is its own id. It has no connection with the show_id. It's just its own unique identifier, a primary key here now, and the name of that genre. So you would have one row in the genres table for comedy, for drama, music, musical, and everything else. And then you would use a third table, which is colloquially called a join table, which I'll draw here in the middle. And you can call it anything you want, but we've called it shows_genres to make clear that this table implements a relationship between those two tables. And notice that in this table is really no juicy data. It's just foreign keys-- show_id, genre_id. And by having this third table, we can now make sure that the word "comedy" only appears in one row anywhere. The word "musical" only appears in one row anywhere. But we use these more efficient integers called show_id and genre_id, which respectively point to those primary keys and their primary tables to link those two together. And this is an example of what's called in the world of databases a many-to-many relationship. One show can have many genres. One genre can belong to many shows. And so by having this third table, you can have that many-to-many relationship. And again, the third table now allows us to truly normalize our data set by getting rid of all of the duplicate comedy, comedy, comedy. Why is this important? Probably not a huge deal for genres. But imagine with my current design if I made a spelling mistake, and I misnamed comedy. I would now have to change every row with the word comedy again and again. Or if maybe you change the genres of the shows, you would have to change it in multiple places. But with this other approach with three tables, you can argue that now you only have to change the name of a genre in one place, not all over the place. And that, in general, in C and now in Python and now SQL has generally been a good thing not to copy paste identical values all over the place. All right. So with that said, what other tools do we have at our disposal? Well, it turns out that there are other data types out there in the real world using SQL besides just these five-- BLOB, INTEGER, NUMERIC, REAL, and TEXT. BLOB, again, is for binary stuff, generally not used except for more specialized applications, let's say. INTEGER, which is an int, typically 32 bits; NUMERIC, which is something like a date or a year or time or something like that; REAL numbers, which are floating point values; and TEXT, which are things like strings. But if you graduate ultimately from SQLite on phones and on Macs and PCs to actual servers that run Oracle, MySQL, and PostgreSQL if you're actually running your own internet-style business, well, it turns out that more sophisticated, even more powerful databases come with other subtypes, if you will. So besides INTEGER, you can specify smallint for small numbers, maybe using just a few bits instead of 32-- INTEGER or bigint, which uses 64 bits instead of 32. The Facebooks, the Twitters of the world need to use bigint a lot, because they have so much data. You and I can get away with simple integers, because we're not going to have more than 4 billion favorite TV shows in a class, certainly. Something like REAL, you can have 32-bit real numbers or, a little weirdly named, double precision, which is like a double was in C, using 64 bits instead for more precision. NUMERIC is kind of this catchall. You can have not only dates and date times but things like Boolean values. You can specify the total number of digits to store using this numeric scale and precision. So it relates to numbers that aren't just quite integers. And then you also have categories of TEXT-- char followed by a number, which specifies that every value in the column will have the same number of characters, that's helpful for things where the length in advance, like in the US. All states, all 50 states, have two-character codes, like MA for Massachusetts, CA for California. char(2) would be appropriate there, because you know every value in the column is going to have two characters. When you don't know, though, you can use varchar. And varchar specifies a maximum number of characters. And so you might specify varchar of, like, 32. No one might be able to type in a name that's longer than 32 characters, or varchar(200) if you want to allow for something even bigger. But this is germane to our real-world experience with the web. If you've ever gone to a website, start filling out a form, and all of a sudden you can't type any more characters, your response is too long-- why is that? Well, one, the programmers just might not want you to keep expressing yourself in more detail, especially if it's a complaint form on a customer service site. But pragmatically, it's probably because their database was designed to store a finite number of characters. And you have hit that threshold. And you certainly don't want to have a buffer overflow, like in C. So the database will enforce a maximum value n. And then text is for even bigger chunks of text. If you're letting people copy paste their resumes or hold documents or even larger sets of text, you might use text instead. So let's then consider a real-world data set. Things get really interesting, and all of these very academic ideas and recommendations really come into play when we don't had hundreds of favorites but when we have thousands instead. And so what I'm going to go ahead and do here is download a file here, which is a SQLite version of the IMDb, Internet Movie Database, that some of you might have used in website form in order to look up movies and ratings thereof and the like. And what we've done in advance is we wrote a script that downloaded all of that information in advance as TSV files. It turns out that they, Internet Movie Database, make all of their data available as TSV files, Tab-Separated Values. And we went ahead and imported it with a script called shows.db as follows. So I'm going to go ahead in just a moment and open up shows.db, which is not the version I created earlier based on your favorites. This is now the version that we, the staff, created in advance by downloading hundreds of thousands of movies and TV shows and actors and directors from IMDb.com under their license and then imported into a SQLite database. So how can I see what's in here? Well, let me go ahead and type .schema, recall. And you'll see a whole bunch of data therein. And in fact, in pictorial form, it actually looks like this. Here is a picture that just gives you the lay of the land. There's going to be a people table that has an ID for every person, a name, and their birth year. There's going to be a shows table, just like we've been talking, which is IDs, titles of shows-- also, though, the year that the show debuted and the number of episodes that the show had. Then there's going to be genres, similar in design to before. So we didn't go all out and factor it out into a third table. We just have some duplication here, admittedly, in genres. But then there's a ratings table. And here's where you can see where relational databases get interesting. You can have a ratings table storing ratings, like 1 to 5, but also associate those ratings with a show by way of its show_id. And then you can keep track of the number of votes that that show got. Writers, notice, is a separate table. And notice this is kind of cool. This table, per the arrows, relates to the shows table and the people table, because this is a joined table. A foreign key of show_id and a foreign key of person_id refer to the shows table and the people table respectively so that a human person can be a writer for multiple shows and one show can have multiple writers, another many-to-many relationship. And then lastly, stars, the actors in a show. Notice that this, too, is a join table. It's only got two foreign keys, a show_id and a person_id that are referring back to those tables respectively. And here's where it really makes sense of relational database. It would be pretty stupid and bad design if you had names of all of the directors and names of all of the writers and names of all of the stars of these shows in separate tables in duplicate, like Steve Carell, Steve Carell, Steve Carell. All of those actors and directors and writers and every other role in the business are just people at the end of the day. So in a relational database, the advice would be to put all of those people in a people table and then use primary and foreign keys to refer to, to relate them to, these other types of tables. The catch is, though, that when we do this, it turns out that things can be slow when we have lots of data. So for instance, let me go into this. Let me go ahead and SELECT * FROM shows;. That's a lot of data. It's pretty fast on my Mac, and I switched from the IDE to my Mac just to save time, because it's a little faster doing things locally instead of in the cloud. Let me go ahead and count the number of shows in this IMDb database by using COUNT. 153,331 TV shows. So that's a lot. How about the count of people from the people table? 457,886 people who might be stars or writers or some other role, as well. So this is a sizable data set. So let me go ahead and do something simple, though. Let me go ahead and SELECT * FROM shows WHERE title = "The Office." And this time, I don't have to worry about weird capitalization or spacing. This is IMDb. This is clean data from an authoritative source. Notice that there's actually different versions of The Office. You probably know the UK one and the US one. There's other shows that are unrelated to that particular type of show. But each of them is distinguished, notice, by the year here. All right, so that's kind of a lot. And let's do this again. Let me go ahead and turn on a feature temporarily just to time this query by turning on a timer in this program. And let me run it again. It looks like it took 0.012 seconds of real time to do that search. That's pretty fast. I barely noticed, certainly because it's so fast. But let me go ahead and do this. Let me go ahead and create an index called title_index on the table called shows on its title column. Well, what am I doing? Well, to answer the question finally from before about performance, by default, everything we've been doing is indeed big O of n. It's just being linearly searched from top to bottom, which seems to call into question the whole purpose of SQL if we were doing no better than with CSVs. But an index is a clue to the database to load the data more efficiently in such a way that you get logarithmic time. An index is a fancy data structure that the SQLite database or the Oracle database or the MySQL database, whatever product you're using, builds up for you in memory. And then it does something using syntax like this that builds in memory generally something known as a B-tree. We've talked a bit about trees in the class. We talked about binary search trees, things that kind of look like family trees. A B-tree is essentially a family tree that's just very wide and not that tall. It's a data structure similar in spirit to what we looked at in C. But it tries to keep all of the leaf nodes, all of the children or grandchildren or great-grandchildren, so to speak, as close to the root as possible. And the algorithm it uses for that tends to be proprietary or documented based on the system you're using. But it doesn't store things in a list. It does not store things top to bottom, like the tables we view them as. Underneath the hood, those tables that look like very tall structures are actually, underneath the hood, implemented with fancier things called trees. And if we create those trees by creating what are properly called indexes like this, it might take us a moment, like 0.098 seconds, to create an index. But now notice what happens. Previously, when I searched the titles for The Office, using linear search, it took 0.012 seconds. If I do the same query again after having created the index and having told SQLite, build me this fancy tree in memory, voila. 0.001 seconds, so orders of magnitude faster. Now, both are fast to us humans, certainly. But imagine the data set being even bigger, the query being even bigger. These indexes can get even larger than that. Rather, the queries can take longer than that and therefore take even more time than that. But unfortunately, if I've got all of my data all over the place, as in a diagram like this, my god. How do I actually get useful work done? How do I get back the people in a movie and the writers and the stars and the ratings if it's all over the place? I would seem to have created such a mess and that I now need to execute all of these queries. But notice it doesn't have to be that complicated. It turns out that there's another keyword in SQL, really the last that we'll look at here, called JOIN. The JOIN keyword, which you can use implicitly or explicitly, allows you to just join tables together and sort of reconstitute a bigger, more user friendly table. So for instance, suppose I want to get all of Steve Carell's TV shows, not just The Office. Well, recall that I can select Steve's ID from the people table WHERE name = "Steve Carell." So again, he has a different ID in this table, because this is from IMDb. But there's his ID. And let me go ahead and turn the timer off for now. All right. So there is his ID, 126797. I could copy paste that into my code, but that's not necessary thanks to these nested queries. I can do something like this. Let me go ahead and now select all of the show_ids from the stars table where person_id from that table is equal to this result. So there's that join table, stars, that links people and shows. So let me go ahead and execute that. All right. So there's all of the show_ids of Steve Carell's TV shows. That's a lot. And it's very nonobvious what they are. So let me do another nested query by putting all of that in parentheses and now SELECT title FROM shows WHERE the ID of the show is in this big, long list of show_ids. And there are all of the shows that he's in, including The Dana Carvey Show back when, The Office up at the top, and then, most recently, shows like The Morning Show on Apple TV. All right, so that's pretty cool that we can actually reconstitute the data like that. But it turns out there's different ways of doing that, as well. And you'll see more of this in the coming weeks and in the problem sets and labs and the like. But it turns out we can do other things, as well. And let me just show this syntax even though it'll look a little cryptic at first glance. You can also use that JOIN keyword as follows. I can select the title from the people table joined with the stars table on the people.id column equaling the stars.person_id column. So in other words, I can select a title from the result of joining people and stars, like this, on the id column in one and the person_id column in the other. And I can join in the shows table on the stars.show_id equaling the shows.id. So again, now I'm joining the primary and foreign keys on these two tables where the name equals "Steve Carell." So this is the most cryptic thing we've seen yet. But it just means take this table and join it with this one and then join it with this one and filter all of the resulting joined rows by a name of Steve Carell. And voila. There we have all of those answers, as well. And there's other ways of doing this, too. I'll leave unsaid now some of the syntax for that. But that felt a little slow. And in fact, let me go ahead and turn my timer back on. Let me re-execute this last query. SELECT title FROM people joining on stars, joining on shows WHERE name = "Steve Carell." That took over half a second. So that was actually admittedly kind of slow. But again, indexes come to the rescue and if, again, we don't allow linear search to dominate. But let me go ahead and create a few indexes. Create an index called person_index on the stars table, the person_id column. Why? Well, my query a moment ago used the person_id column. It filtered on it. So that might be a bottleneck. I'm going to go ahead and create another index called show_index on the stars table on show_id. Similarly, a moment ago, my query used the show_id column. And so that, too, might have been a bottleneck linearly, top to bottom. So let me create that index. And then lastly, let me create an index called name_index-- and this is perhaps the most obvious, similar to the show titles before-- on the people table on the name column. And that, too, took a moment. Now, in total, this took almost a full second. But these indexes only get created once. They get maintained automatically over time. But you don't incur this with every query. Now let me do my SELECT again. Let me SELECT title FROM people joining the stars table, joining the shows table WHERE name = "Steve Carell." Boom. 0.001 seconds. That was an order of magnitude faster than the more than half a second it took us a little bit ago. So here, too, you see the power of a relational database. So even though we've created some problems for ourselves over time, we've solved them ultimately-- granted, with some more sophisticated features and additional syntax. But a relational database is indeed why you use them in the real world for the Twitters, the Instagrams, the Facebooks, the Googles, because they can store data so efficiently without redundancy, because you can normalize them and factor everything out. But they can still maintain the relations that you might have seen in a spreadsheet but using something closer to logarithmic thanks to those tree structures. But there are problems. And what we wanted to do is end on today two primary problems that are introduced with SQL, because they are just unfortunately so commonly done. Notice this year. There is something generally known as a SQL injection attack, which you are vulnerable to in any application where you're taking user input. That hasn't been an issue for my favorites.py file, where I only took input from a CSV. But if one of you were malicious, what if one of you had maliciously typed in the word "delete" or "update" or something else as the title of your show and I accidentally plugged it into my own Python code when executing a query? You could potentially inject SQL into my own code. How might that be? Well, if logging in via Yale, you'll typically see a form like this. Or logging in via Harvard to something, you'll see a form like this. Here's an example that I'm pretty sure neither Harvard nor Yale are vulnerable to. Suppose I type in my email address to this login form as malan@harvard.edu'--. It turns out, in SQL, -- is the symbol for commenting if you want to comment something out. It turns out that the single quote is used when you want to search for something like Steve Carell or, in this case, malan@harvard.edu. It can be double quotes. It can be single quotes. In this case, I'm using single quotes here. But let's consider some sample code, if you will, in Python. Here's a line of code that I propose might exist in the backend for Harvard's authentication or Yale's or anyone else's. Maybe someone wrote some Python code like this using SELECT * FROM users WHERE username = question? AND password = question?, and they plugged in username and password. Whatever the user typed into that web form a moment ago gets plugged in here to these question marks. This is good. This is good code, because you're using the SQL question marks. So if you literally just do what we preach today and use these question mark placeholders, you are safe from SQL injection attacks. Unfortunately, there are too many developers in the world that don't practice this or don't realize this or do forget this. If you instead resort to Python approaches like this, where you use an f-string instead, which might be your instincts after last week, because they're wonderfully convenient with the curly braces and all-- suppose that you literally plug in username and password not with the question mark placeholders but just literally in between those curly braces. Watch what happens if my username, malan@harvard.edu, was actually typed in by me maliciously as malan@harvard.edu'--. That would have the effect of tricking this Python code into doing essentially this. Let me do a find and replace. It would trick Python into executing username = "malan@harvard.edu"--" and then other stuff. Unfortunately, the -- again means comment, which means you could maybe trick a server into ignoring the whole password part of this SQL query. And if the SQL query's purpose in life is to check, is this username and password valid, so that you can decide to log the user in or to say, no, you're not authorized, well, by essentially commenting out everything related to password, notice what I've done. I've just now theoretically logged myself in as malan@harvard.edu without even knowing or inputting a password, because I injected SQL syntax, the quote and the --, into my query, tricking the server into just ignoring the password equality check. And so it turns out that db.execute, when you execute an INSERT, it returns to you as said the ID of the newly inserted row. When you use db.execute to select rows from a database table, it returns to you a list of rows, each of which is a dictionary. So this is now pseudocode down here with my comment. But if you get back one row, that would seem to imply that there is a user named malan@harvard.edu. Don't know what his password is, because whoever this person is maliciously tricked the server into ignoring that syntax. So SQL injection attacks are unfortunately one of the most common attacks against SQL databases. They are completely preventable if you simply use placeholders and use libraries, whether it's CS50's or other third-party libraries that you may use down the road. A common meme on the internet is this picture here. If we Zoom in on this person's license plate or where the license plate should be, this is an example of someone theoretically trying to trick some camera on the highway into dropping the whole database. DROP is another keyword in SQL that deletes a database table. And this person was either intentionally or just a humorously trying to trick it into executing SQL by using syntax like this. So characters like single quotes, --, semicolons are all potentially dangerous characters in SQL if they're passed through unchanged to the database. A very popular xkcd comic-- let me give you a moment to just read this-- is another well-known meme of sorts now in computer science. If you'd like to, read this one on your own. But henceforth, you are now in the family of educated learners who know who Little Bobby Tables is. Unfortunately, it's dead silence in here, so I can't tell if anyone is actually laughing at this joke. But anyhow, this is a very well-known meme. So if you're a computer scientist who knows SQL, you know this one. And there's one last problem we'd like to introduce if you don't mind just a couple of final moments here. And that is a fundamental problem in computing called race conditions, which for the first time is now manifest in our discussion of SQL. It turns out that SQL and SQL databases are very often used, again, in the real world for very high-performing applications. And by that, I mean, again, the Googles, the Facebooks, the Twitters of the world where lots and lots of data is coming into servers all at once. And case in point, some of you might have clicked Like on this egg some time ago. This is the most-liked Instagram post ever. As of last night, it was up to 50-plus million likes. Well eclipsed Kim Kardashian's previous post, which is still at 18 million or so. This is to say this is a hard problem to solve, this notion of likes coming in at such an incredible rate. Because suppose that, long story short, Instagram actually has a server with a SQL database. And they have code in Python or C++ or whatever language that's talking to that database. And suppose that they have code that's trying to increment the total number of likes. Well, how might this work logically? Well, in order to increment the number of likes that a picture like this egg has, you might first select from the database the current number of likes for the ID of that egg photograph. Then you might add 1 to it. Then you might update the database. And I didn't use it before, but just like there's INSERT and DELETE, there's UPDATE, as well. So you might update the database with the new count plus 1. So the code for that might look a little something like this, three lines of code using CS50's library here, where you execute SELECT likes FROM posts WHERE id = question?, where id is the unique identifier for that egg. And then I'm storing the result in a rows variable, which, again, I claim is a list of rows. I'm going to go into the first row, so that's rows bracket 0. And I'm going to go into the likes column to get the actual number. And that number, I'm going to store in a variable called likes. So this is going to be, like, 50,000,000, and I want it to go to 50,000,001. So how do I do that? Well, I execute on the database UPDATE posts SET likes = ?. And then I just plug in likes + 1. The problem, though, with the Instagrams and Googles and Twitters of the world is that they don't just have one server. They have many thousands of servers. And all of those servers might in parallel be receiving clicks from you and I on the internet. And those clicks translate into this code getting executed, executed, executed. And the problem is that when you have three lines of code and suppose Brian and I click on that egg at roughly the same time, my three lines might not get executed before his three lines or vice versa. They might get commingled chronologically. My first line might get executed, then Brian's first line might get executed. My second line might get executed, Brian's second line. So they might get interspersed on different servers or just temporally in time, chronologically. That's problematic, because suppose Brian and I click on that egg roughly at the same time. And we get back the same answer to the SELECT query. 50 million is the current count. Then our next lines of code execute on the servers we happen to be on, which adds 1 to the likes. The server might accidentally end up updating the row for the egg with 50,000,001 both times, because the fundamental problem is if my code executes while Brian's code executes, we are both checking the value of a variable at essentially the same time. And we are both then making a conclusion-- oh, the current likes are 50 million. We are then making a decision. Let's add 1 to 50 million. We are then updating the value with 50,000,001. The problem is, though, that, really, if Brian's code or the server he happens to be connected to on Instagram happens to have selected the number of likes first, he should be allowed to finish the code that's being executed so that when I select it, I see 50,000,001, and I add 1 to that so the new count is 50,000,002. This is what's known as a race condition. When you write code in a multiserver-- more fancily known as a multithreaded environment-- lines of code chronologically can get commingled on different servers at any given time. The problem fundamentally derives from the fact that if Brian's server is in the middle of checking the state of a variable, I should be locked out. I should not be allowed to click on that button at the same time, or my code should not be allowed to execute logically. So there is a solution when you have to write code like this, as is common for Twitter and Instagram and Facebook and the like, to use what are called transactions. Transactions add some few new pieces of syntax that we won't dwell on today and you don't need to use in the coming days. But they do solve a fundamentally hard problem. Transactions essentially allow you to lock a table or, really, a row in the table so that if Brian's click on that egg results in some code executing that's in the process of checking what is the total like count, my click on the egg will not get handled by the server until his code is done executing. So in green here, I've proposed the way you should do this. You shouldn't just execute the middle three lines, "you" being Facebook, in this case. Instagram should execute BEGIN TRANSACTION first, then COMMIT the transaction at the end. And the design of transactions is that all of the lines in between will either succeed altogether or fail altogether. The database won't get into this funky state where we start losing track of likes on eggs. And though this has not been an issue in recent years, back in the day when Twitter was first getting started, Twitter was super popular and super offline a lot of the time. There was this thing called a Fail Whale, which is the picture they showed on their website when they were getting too much traffic to handle. That was because when people are liking and tweeting and retweeting things, it's a huge amount of data coming in. And it turns out it's very hard to solve these problems. But locking the database table or the rows with these transactions is one way fundamentally to solve this. And in our final extra time today, we thought we would play this out in the same example that I was taught transactions in some years ago. Suppose that the scenario at hand is that you and your roommates have a nice dorm fridge. And you're all in the habit of drinking lots of milk, and you want to be able to drink some milk. But you go to the fridge, like I'm about to here. And you realize, uh-oh, we're out of milk. And so now I am inspecting the state of this refrigerator, which is quite old but also quite empty. And the state of this variable, being empty, tells me that I should go to CVS and buy some more milk. So what do I then do? I'm presumably going to close the fridge, and I'm going to go and leave and go head to CVS. Unfortunately, the same problem arises that we'll act out here in our final 60 or so seconds together, whereby if Brian now, my roommate in this story, also wants some milk, he comes by when I'm already headed to the store, inspects the state of the fridge, and realizes, oh, we're out of milk. So he nicely will go restock, as well. So let's see how this plays out, and we'll see if there isn't a similar, analogous solution. So I've checked the state of the variable. We're indeed out of milk. I'll be right back. Just going to go to CVS. [MUSIC PLAYING] All right. I am now back from the store. I've picked up some milk. Going to go ahead and put it into the fridge and-- oh, how did this happen? Now there's multiple jugs of milk. And of course, milk does not last that long. And Brian and I don't drink that much milk. So this is a really serious problem. We've sort of tried to update the value of this variable at the same time. So how do we go about fixing this? What's the actual solution here? Well, I dare say that we can draw some inspiration from the world of transactions and the world of databases. And perhaps create a visual for here that we hope you never forget if you take nothing away from today. Let's go ahead and act this out one last time where, this time, I'm going to be a little more extreme. I go ahead and open the fridge. I realize, oh, we're out of milk. I'm going to go to the store. I do not want to allow for this situation where Brian accidentally checks the fridge, as well. So I am going to lock the refrigerator instead. Let me go ahead and drape this through here. A little extreme, but I think so long as he can't get into the fridge, this shouldn't be a problem. Let me go ahead now and just attach the lock here. Almost got it. Come on. All right. Now the fridge is locked. Now I'm going to go get some milk. BRIAN YU: [SIGHS] [MUSIC PLAYING]