CARTER ZENKE: Well, welcome everyone to our week seven, CS50 section. My name is Carter Zenke. I'm the Course Preceptor here on campus. And the goal of these sections is really to help bridge the gap between lecture and this week's problem set. So the goal of the day is to dive into more SQL to give you more practice writing queries, working with databases, and hopefully, at the very end, you'll feel prepared to tackle this week's problem set. Now actually, this week seven, is one of my favorite weeks in the course. And I like this week so much because it's all about data, and data is often about real people, real things in the real world. So it's like really good connection between computer science and how things actually work in the real world. It kind of brings these questions about how we design things, makes them all the more important. Because we're actually going to talk about real people, real students, real things in this world, and we're going to represent inside of our own programs now. So thinking about data and representing real people, let's actually take a look into a fantasy world, one in Harry Potter. And if you're familiar with Harry Potter, maybe you're familiar with the Hall of Prophecy, which is this place where all of the prophecies go to, live, and reside. The place of we're they're going to store lots of different prophecies. And maybe be one of these prophecies is this place of where Hogwarts students, people who go to the Hogwarts School of Magic, will be could have housed, where they're going to actually live during their time at Hogwarts. So, let's think about how we might store a database of Hogwarts students inside of the different houses of Hogwarts. And if you're familiar, there's like Slytherin, Gryffindor, Hufflepuff, and Ravenclaw. And all these students live in these various houses here. So if we think of a database, we think of this space that is maybe currently empty, we could fill it with some things. So let's go ahead and call this space, for the sake of our discussion here, as roster.db. We're going to have some students in here, and they're going to go into different houses inside of this database. So all databases have some tables, some places that we're going to store data inside of them. And in this case, let's say we have this table that looks a bit like this. And if you can see that, we see we have Adelaide Murton, who's in the house for Slytherin. And the head of the Slytherin house is a professor named Severus Snape. And down below and kind of row three here, we have Anthony Goldstein, who's in Ravenclaw. And the leader, the head of Ravenclaw, is Filius Flitwick, a professor at Hogwarts School of Magic. So here, we have a table. And often, in our databases, tables are named. So let's maybe call this one students. And now, we have our full database. We have a name for our database, roster.db, and we have a table inside of it, this one called students. But what we might want to do with this table? What are some things we might want to actually do to this data. If want to chime in and chat, what's the point of storing data in this way? What could we do with it? Yeah. So I'm seeing somebody messaging me. So you could maybe actually try to figure out what else is inside of this table. You query the database. We could try to ask what's inside of it. And to do that, we can actually use these SQL keywords that begins with a select. How do we select data from this table? So let's dive into selecting data from tables as we saw in lecture. Let's go a bit more deeply into it this time. So if I wanted to select all the data from this particular table, does anyone know what kind of query I could use for this? I'm seeing in the chat. I'm seeing some select, which is good. I'm seeing some stars, which is also good. And now, let's go ahead and try this out. So if I go to my own VS code over here, I can see that I have my own roster.db. So I'll go ahead and open this up with SQLite, this kind of database engine we can use to execute SQL queries on this database, so I'll do SQLite3 roster.db. And now, just to show you, if I type dot schema to see the kinds of tables I have inside the database, I see I have this table called students. It has the very same attributes or columns we saw earlier in our visual back here. So, the hypothesis right now is we could kind of select everything in this table using this syntax I see from the chat. We could say SELECT star. We can select everything from this students table. So let's go ahead and do that. And we end all of our queries in SQL with a semicolon. So this is the end of our thought. This is our entire query. Select everything from this students table. I'll hit Enter now. And now, I can see kind of, if I zoom back out, all of the students who are inside of this roster. So if you scroll through, we have about 40 students here, and we see some of that very same data. So we see that Anthony Goldstein is in Ravenclaw, led by Filius Flitwick. We've got in all data from inside this table, so that's pretty good. Let's create a visual for this. So if you said we had this roster here with a table called students, if we do SELECT star FROM students, what we'll see is the entire table will select everything in this kind of yellow highlight here. But we could also get more specific, because it's often not good to select everything. Often we care about some particular set of students or particular column in general. So let's say I just wanted the student names, like how can I get student names from this list, disregarding the house and the head and whoever else is in this database? How could I just get student names? I'm seeing some WHERE, and that is true. We could use WHERE for other cases. But here, I just want to get all of the student names from this roster. Nice. So I'm seeing some syntax here that looks a bit like this. We could say SELECT student_name FROM students. So often, we use select. After we say SELECT, we have a column name, like the column we want to actually select from our table. So in this case, if we looked at student's table here, we have a column called student_name. So if you want to select that column, we could say SELECT student_name FROM this table called student. So if we go back to our SQLite prompt here, we actually type in SQL commands. Only we now type SELECT student_name FROM students. And what should I end this with now, because it's a SQL query? A semicolon. Good. So I'll hit semicolon here, and I'll hit Enter. And now, notice how my table just has that single column. It doesn't have every column now, but just that student_name column. So often, we're using SELECT. If we want to include the column name, we're selecting from our table. And if we just want everything, we can say star. SELECT star, meaning select everything from our given table. Now, to kind of focus in on getting particular students, let's say we cared not just about every student's name, but only those who are in a particular house. If you go back to our table here, we have student names, houses, and heads. So maybe we care about the students who are in Slytherin house. How could we find the students who are in that house? Any ideas? We just care now about students who are in Slytherin. Yeah. So here, we can bring back that WHERE clause that we saw before. So, if we want not just an entire column of students as we see here but some rows, depending on a certain condition in another column, we can use WHERE. So WHERE will give us back only certain rows from our table. So let's try this. We could say, let's SELECT student_names. Let's get that student name column from the student's table now, but let's go ahead and get only those students where the house is equal to what we're going to call Slytherin in this case. So if we go back to our table now and I type this query, I could say SELECT student_name, and I can type this all in one line. For the sake of argument here, I could say FROM students. And let's do WHERE house is Slytherin. OK, my query will wrap here. So we kind of go around the edge, but it's all one line, theoretically. Semicolon here. What do we see? We see only the students who are inside of Slytherin. And just to prove this, let's go ahead and actually update our query and let's show you how could do it on multiple lines. So we can say SELECT student_name and house. We could say student_name and house. We want this in the same query here separated by comma. Well, let's get this FROM our students table. And let's say we only want the rows where that house column is going to be Slytherin in this case. So I'll do a semicolon here and I'll hit Enter. And now, I do see that these same student names do belong to Slytherin. So with SELECT, we can always try to select not just a single column, but also multiple columns separated by commas here. Now, what questions are there so far on these SELECT, or does it seem pretty straightforward so far? Any questions in the chat? OK, not seeing any right now. But let's try to go a little more advanced now. So we've seen how we can select some data. We've seen how we can get different rows back from our table. But what if we're not kind of sure exactly what we want to look for in our table? Maybe we want to get just the students whose last name is Potter. Well, how could we do that? Any ideas what kind of clause we could use, if you want to find the students whose name end with Potter? Somebody said we should use LIKE. And as we saw in lecture, LIKE is this query that kind of lets us do a fuzzy match, meaning that we get all the rows that kind of roughly match some clause or some string we might give to our query here. So to do this, let's go ahead and look at our table again. For Slytherin, we got kind of a result like this. But now, let's look at how we could use it, LIKE. So if we're interested in students whose name ends with Harry Potter, well, we could do a query like this. We could say, SELECT student_names, obviously, FROM our student table. And now, we're only going to take the student names that are like Potter with this kind of percent sign in front. And I want to recall what that percent sign is doing for us here, if you want to chime in the chat. Why can we not just say LIKE Potter, and why do I have to say LIKE percent Potter? Yes, I'm seeing that people are calling this a wild card character. And basically, this percent sign says that I can match any character ahead of this string Potter. It could be Harry Potter. It could be Lily Potter. It could be, really, any name as long as it ends with Potter. And putting that percent sign at the beginning, well, that says that I can have any characters I want before Potter. And someone can maybe just have Harry and then percent sign after Harry, and that would give me all of the names that begin with Harry in this case. So let's go ahead and take a look at this. We can say SELECT student_name now, and we want to take student name FROM our students table. And then after this, let's say, OK, we want to get the student names that are LIKE, not equal to but LIKE, this string we're going to give called Potter, percent Potter. Meaning, we can get any character before Potter as long as at the end, we see Potter exactly, so I'll hit Enter now. And we see that there's really only one student whose name ends with Potter, which is Harry James Potter. But you might also see other characters in the Harry Potter universe whose names also end in Potter too. Now, maybe a better question here, to get more students in this query, is this one. So we could say, find the names and houses of students whose names begin with H. So let's try this one. We're trying to find students whose names begin with H. So we want to select again names. So I'll say, OK, I want to SELECT a student_name from my student_name column, and I also want to select the house. So I'll say comma house, and I'll get back now in this query student_name and house. Where am I going to get this data from though? Any ideas? Where am I selecting from in this case? In the chat? From students. So I'll say FROM, students here. Now to finish this off, I only want the students whose names begin with H. So I could say WHERE to kind of subset my rows, get a certain number of rows who only meet a certain condition. And now, the column I care about, the student_name. And I'm seeing in the chat, I could say, not equals, not particular H here. I don't want students whose name literally is H. I want students whose name is LIKE H and then any other characters that come after it. So I could hit Enter here. And now, I'll do this. Now, we see we have Hannah Abbott, whose name begins with H, Harry, and Hermione also begins with H. And they're H in Hufflepuff for Hannah, and Harry and Hermione are in Gryffindor. A question, is this case sensitive? It's a good question. Let's go ahead and try it. So we could say, SELECT student_name, and we're going to select also the houses FROM the students table, where the student name is LIKE, and we saw capital H beforehand, like this. But now, let's try a lowercase h and see if we get any students here or any different results, now that it's lowercase. So let's hit Enter now, and we still get the very same results. So what do you think? Is LIKE case sensitive? Does it matter what case I use? No, it doesn't seem to. I could use the capital H. I could say the lowercase h and I get the very same results here. So that's a great question. And now, I'm seeing another question. We need to use keywords like SELECT, like WHERE in all caps, and why is that so? So here, you'll see that I'm using SELECT in all caps, I'm using FROM in all caps, and using WHERE in all caps, and LIKE in all caps. And the convention here is, is that if you're typing in a SQL keyword, some kind of command to SQL, like SELECT or FROM or WHERE or LIKE, well, you should capitalize that to denote it as a SQL keyword. I can keep my column names though or my strings in lowercase because if I look at my table here, well, these literally are in lowercase, student_name, house is lowercase. So I want to make sure that casing matches there. But for a SQL query, I should, by convention, sort of capitalize these keywords. I could very much do this. I could say SELECT star FROM students and I would get the very same results if I zoom out a bit, but it's just not quite as clean. It's hard to tell. Let's say, if I do this, select student_name, house from students. It's kind of hard to tell where my keywords begin and where my column names actually join in, like it's hard to tell which what thing is which here. So always capitalize your SQL keywords as you go through. Great. Another question, when do we use LIKE? So as we've seen before here, we use LIKE-- let me actually just finish the query here. We use LIKE when we want to actually fuzzy match something, kind of roughly match something. So if I know that I want students whose name begins with H but isn't exactly H, LIKE is a good use case for that here. Other questions on selecting then? I'm just going through the chat here to be sure. All right, so I think that's most of our questions on this. So there's a main question about, is there a way to make search case sensitive? There is. So we saw before that LIKE is generally not case sensitive, but let's try this. If I do a search for Harry James Potter, who we know is in our database, I could do SELECT, the student_name, FROM students, WHERE the student_name is exactly Harry James Potter, and I'll hit semicolon here. And I should get, well, none other than Harry James Potter. But what if I did this, what if I said SELECT student_name, FROM students, WHERE the student_name is now all lowercase harry james potter, hit Enter now, and no one seems to match that exactly. So equals, when you're using that, is case sensitive, but LIKE, by definition, is not. So that seems to be most of what you might want to use our selecting for. We can use WHERE and we can use LIKE to kind of narrow our matches from our table. But what else could we do with these queries? So one thing I want to do, as we saw in lecture, is kind of order our students, put them in like alphabetical order or kind of roster order and so on. And we can use some new school keywords for that exactly. So we can not just select, we can also order our data. So we saw before, we can just kind of get every student we want to. I guess by SELECT star FROM students give us all the students here. But at very end of this query, once it's complete, we could have tacked on this other keyword called ORDER BY to tell SQL what column I want to sort this data by. So I'll get every column, but if I want to sort the data by particular column, I could say ORDER BY that particular column name. So here, if I do this, ORDER BY student_name, well, I should get the names in alphabetical order. So, I think if I select everything, SELECT star FROM students, I might already have these names in alphabetical order by virtue of how they were added to this data set. But if I were to do this, SELECT star FROM students, I'll zoom in down below here. Zoom in here, SELECT star FROM students, and then ORDER BY student_name, well, I kind of get the very same thing. But it's important to keep in mind, you keep in mind that when you use ORDER BY by without kind of anything after it, like this here, we're getting the names in what's called ascending order. So ascending, in SQL, means from kind of the lower Ascii value, like A, to the higher Ascii value, like Z, so A to Z in this case. But if I wanted to flip it, not having A to Z, but Z to A, well, I could change this, not to order by ascending, but to order by descending, like this. So by default, SQL always order by the ascending order like this. But if you want to order by descending, you could simply say descending at the end or DESC for descending. So let's try that now so I can actually see the results. So I'll go, clear my terminal, zoom in a bit, and I'll say SELECT. I'll say SELECT star FROM students, and then I will say, ORDER BY student_name, now descending, so from Z to A. Now, let's try this. I'll hit Enter. And now, if I zoom out, you should see. I start with Vincent now, whose name is towards the end of the alphabet, and I go up to Adelaide, whose name is at the top of the alphabet. In this case. And I mean there are other things we could sort of sort by here. So we don't have to just sort by student name, we could also sort by house. So maybe we want to sort by this. We could say SELECT student_name, or sorry, SELECT-- let's just select everything here, SELECT everything FROM students. Whoops, my internet's a little wonky here. I'll SELECT star FROM-- I'm going to give my code space a minute to breathe here. I'm going too quickly through these SELECT. I think we're back. So we'll do SELECT star FROM students. And now I'm going to order by, not student name, but by house. And so now, I should see, if I zoom out, my data organized by house. I have all the students in Gryffindor, all students in Hufflepuff, and all the students in Ravenclaw, and all the students in Slytherin. So a question now is, let's say we wanted to sort the student data in alphabetical order, but have it first sorted by house, and then by student name. So ideally, we should have all of our data from our table viewed now, but sorted first by house, alphabetical order. And then, inside of each house, sorted by name. So how could we do this, do you think? Do you want to chime in the chat? How could we have this kind of multiple, multi-layered sort going on? I've seen a few options, and one that I like here is this comma. We could use a comma, something that we did before. We could say, if we start off in SELECT, we could do SELECT student_name, and house, or we could do the very same thing for ORDER BY. We could ORDER BY multiple things at the very same time, so let's do this. We could do SELECT everything FROM students, and let's ORDER BY first house, and then by student_name. So with the comma here is saying is that the very first thing before the comma is the first thing I will sort by. Then afterwards, student name is the next thing I will sort by. So inside of my sorted list of houses, I will then sort student names, so let's try this. And now, we can see, if I zoom out, well, we have Gryffindor all sorted. And inside of the students who are in Gryffindor, well, we have those students in alphabetical order, CDGHHLN and so on. So let's try it maybe with a more advanced technique. We could go in and say SELECT everything FROM students. But now, we just want to get the houses as they are, SELECT ORDER BY house, and then the student_name, but in reverse alphabetical order. And how can I get reverse alphabetical order now? Descending, right. So I could say DESC here. And now, I'll hit Enter. And now I see, I'm still sorting by house alphabetically, G, H, R, and S. But now, inside of each house, notice how I'm going from the bottom of the alphabet to the very top. So questions on these sorts? What do you wonder about? I could write this sort case sensitive. Let's check that one out actually. So we see-- we can do SELECT star FROM students, and now, I'll do ORDER BY. And because we said house before, why don't we just try capital H house and I'll hit semicolon and Enter again? And that still seems to work. So when you're talking about SQL column names, those also are not going to be case sensitive for you. And the question, what will happen if I switch between house and student_name in that query? That's a great question. So before, we had a query like this, SELECT star FROM students, and then we did ORDER BY first house and then student_name? But what if we did this, what if we did, first, student_name and then house? So what do you think will happen in this case? In the chat, if you want to chime in, what do you think? What will happen if we flip this order now? Yes, so somebody says it'll first sort by name and then by house, so let's try running this. We'll hit Enter here and I'll zoom out. And notice how my student names are in alphabetical order. But within the C's, for example, I now have my houses in alphabetical order, so Hufflepuff, Ravenclaw, and Gryffindor. Or let's see, we have a lot of M's. So here, we have Ravenclaw, Slytherin. Well, that didn't seem to quite work in this case actually. Why did that not work? MRR, interesting. Oh, yeah, so somebody-- it's a good point. So somebody asked, there's no two identical names. So what happened here is, this isn't quite useful for us, and I'll see if I can explain this. Because what SQL will do is not sort by the very first letter of student_name, but sort by the entire string. And so, in order for the houses that we see in alphabetical order, we have to think the exact same student name. So not just the same first letter, but the same student name. And because no two student names are exactly alike, we probably won't actually see the results in this case. So good catch from somebody in the chat here. Other questions on ordering then? A good question here. So what if a name starts with a number, kind of unlikely, but a good case to consider. So if we had a name search with a number, well, it's going to go by the Ascii value for that number, I believe. So in this case, numbers would likely come before A or come before, certainly before Z, and so on. So first, numbers and symbols, and then the alphabet. Any question here? Can we order one column by ascending and the second by descending? You absolutely can. So we could say this. We can say SELECT star FROM students. Let's go ahead and ORDER BY house, descending, and then we could ORDER BY student_name ascending, like this. So certainly, if you're free to apply your own ascending or descending to each of these multiple arguments to ORDER BY here. And another question before we move on, so one is, how are lower or uppercase letters sorted? So I think, in this case, it would be uppercase would come first because I think uppercase is lower in the Ascii alphabet. If you remember how letters correspond to Ascii numbers, I think uppercase is lower there, so it would come first in the default ascending order of these characters. So let's move on from ordering, and let's think about one keyword that's actually kind of useful along with ordering. That's the use case of a limit. So if we have some query, and we won't get back just a certain number of rows from that query, all we could do is simply apply this limit one or limit five or limit 10, whatever number you want to limit at the very end of that query. This is helpful for finding the top 10 of something, for example, because you can order by a certain column and then take the top 10 that show up. So, let's go back to our query here. We can maybe say, let's want the first 10 students whose name is first in the alphabet, the first 10 students from the very first, alphabetically. So we say SELECT star FROM students. I'm going to order by student_name, and then we'll limit our query to only 10 students. I'll hit Enter here. And now, I should see only 10 students, those whose names are the very first 10 in the alphabet. And somewhat I could change this. I could say, let's not get the first 10 alphabetically, let's get the last 10 alphabetically. So I could do SELECT star FROM students. Now, let's order by student_name, but in descending order. And then let's go ahead and limit 10, and we should see, we get the very last 10 students in our alphabetical order. So limiting is helpful along with ORDER BY to kind of at the top or the bottom sort of segments of your data set, as long as you order by a certain column that you care about here. Question in the chat is, what if you don't have an ID and you select 10 queries, but you want to see the number, is there a way to do this? Just so I can understand the question, I think we're talking about these IDs here on the left hand side. And as we'll see later in section, these IDs are the primary keys, the unique identifiers for each of these students. And generally speaking, every student should have their own unique ID. If they don't have an ID, well, you might have some problems with database design. But if we're talking about ordering by here, ORDER BY student_name, for example, we're not really caring about the student ID just the student name, in this case. We could certainly ORDER BY ID if we like to, ORDER BY ID descending, ORDER BY ID ascending. But in this case, it so happens to be that the student ID corresponds to the student name kind of alphabetically sorted. All right, and I'm seeing a question about the last 10 students. So this actually is going to work somewhat similarly. So if I wanted the last 10 students alphabetically, I could do this. I could sort SELECT star FROM students, and then go ahead and say I want to ORDER BY student_name in descending order to get those bottom 10 students, and then whoops, and then I could limit 10 at the very end. Other questions too on limiting? Let's keep moving then. And what we could do here is we often-- it's nice to get the individual student names, but one thing we often care about is how many students are there? What's like an average test score? Trying to aggregate our data, not just take individual rows, but see kind of the bigger picture from our data. And for that, SQL has a variety of keywords you can use to get that data, get that aggregate data, so to speak. So here, we'll talk about aggregating, where aggregating means taking those individual rows, as I said before, but kind of summing things up or averaging them or kind of seeing the whole larger picture. So if I wanted to count the rows I get back, well, we could do this. I could say SELECT COUNT star FROM students. And this will take everything in our table, force selecting everything, and return the count of it, the number of rows that I have. So for example, to figure out how many students are in this table, I can say SELECT COUNT star FROM students, semicolon, hit Enter. And now, I see, I get 40 in this case. There are 40 total students. And this is kind of not very pretty, like the COUNT star. I'd love to just say like number of students, for example. So how could I change this column names, anyone know in the chat? Yeah, I could use AS. So I can say SELECT COUNT star AS in this case, maybe number_of_students, and then I could say we want to select that from the student's table. So now, I see I get the very same answer, but now my column is called number_of_students. So again, what we're doing here is selecting everything from our students table, counting it up, and returning that as a column named number_of_students, in this case. So what if we only wanted particular students, like let's say we just wanted to count students from Gryffindor? Well, how could I do that? In the chat, if you want to chime in, counting just students from Gryffindor. So I'm seeing the use of WHERE, which is helpful. Other ideas too? Yeah. So I'm seeing-- I want to use COUNT. I probably use WHERE here too, and we'll get to what we call GROUP BY in just a moment. But let's try WHERE first. So normally, to get all the things in Gryffindor, I would do a query like this. I could say SELECT star FROM students WHERE, in this case, house is Gryffindor, and then semicolon Enter. And now, these are all students who are in Gryffindor. But if I wanted just the count of these, not every individual row, what could modify my query, and to not SELECT star but SELECT COUNT star, so like the count of everything here. SELECT COUNT star FROM students, and then I'll say WHERE. In this case, the house is Gryffindor, semicolon, Enter. And now, I see I have 11 students who are in Gryffindor here, and that's all pretty good, but we also have to care about Slytherin. So let's say we want to SELECT the COUNT star FROM students where the house is Slytherin, and that would give us also 11. But how could I get all of the houses in one place, like not just individual areas for each house, for how many students are in them, but actually have it all in one query and have a table that says Slytherin has this many students, Gryffindor has many students? And I'm seeing, in the chat, this GROUP BY keyword that we're going to explore now here. So GROUP BY is helpful if you want to get the number of students by a certain attribute. So we want to kind of group our data by house and figure out, well, how many students are in each house? So let's try this. We can use GROUP BY as follows. I can say SELECT star FROM our students table, and let's go ahead and GROUP BY, in this case, house. We want to group our data by house, and I think we might have messed up slightly here. Let's go ahead to the query. We'll just hit Enter here. And now we can see where we have individual houses, but there's something weird going on with this student_name over here. We have much fewer pieces of data, but I'm just not sure why we're getting student_name over here. Well, instead of searching everything, why don't we select the count from each of our houses? So SELECT COUNT again FROM our students table, and then we'll GROUP BY house. I'll hit Enter here. And so I get back 11, 5, 13, and 11. What do you think this means, in this case? Any ideas in the chat? It's the number of students in each house, but it's hard for us to tell which house is which. Lets include the house name in our query, so let's not just select the count of everything. Let's select the house in our query too. Let's order it like this. Let's first select the house, and then select the COUNT of everything FROM students, as long as we're grouping by house. I'll Enter here. And now, I get Gryffindor, Hufflepuff, Ravenclaw, and Slytherin, where each one has a count next to it, like 11, 5, 13, and 11. So this is the total number of students in each group, as somebody in the chat is saying, that we're going to first group by house and then take the count by every house. Other questions on these GROUP BYs or this aggregation here? So it seems like we have most questions answered there. And now that we've kind of seen a variety of SQL keywords, we've kind of reviewed them from lecture, let's dive into something about database that probably isn't great for us. Like if we select everything here, SELECT star FROM students, let's just zoom out a bit and think about what isn't very well designed about this database if you take a look at it. What could be better designed about it, or where do you see some repetition that we maybe can get rid of? Yeah, so I'm seeing some people saying that there's some repetition in the houses. We have Slytherin here, Slytherin again, Ravenclaw, Slytherin, Hufflepuff, like a lot of these names are repeating. And similarly for the heads, notice how we have all the heads kind of repeated throughout this data set. And maybe, ideally, we wouldn't actually have to repeat Severus Snape, Severus Snape, all the way through our data set. We could just put Severus Snape in one place and reference Severus Snape from other places. And maybe for the house, we could say, well, Slytherin shows up a lot of times. What if we just had Slytherin in one table, and we reference Slytherin in this table? So the goal is to avoid repeating ourselves over and over again in this data set, like having Slytherin so many times or Hufflepuff so many times. I just want to make our own tables for individual pieces of our data sets. So if we're thinking about how we could better design this table, let's look at a few design principles we could use to actually improve this table here. So go back to our slides. Let's think about database design. So one principle for a database is that every table you have inside of your database should only represent a single entity, and we've actually kind of broken that rule here. So if I go back into my table, well, I have students in this table. It's a table of students. But I also have houses, I have heads, like there's lots of things going on here that don't just belong to students. So ideally, what we could do is try to simplify this a little bit. Let's try to have a single table for students, a single table for houses, and then a single table for the assignments of students between houses. So in this case, students wouldn't also have the house assignment. And similarly, houses wouldn't have the house assignment. But we have this extra table called student-house assignments that sort of tells us how these students and houses relate to each other. So as a visual here, let's dive in. Let's say-- this is what we currently look like. We have a student's table that has student names, houses, and the head of each house, but we want to break this up. We only want to have a single thing inside of each of our tables. So let's break this up. Let's say get rid of the houses and heads here. Those are gone. Let's just have students in this table, but where should we put our houses now? Let's make a new table called houses. And this one we'll put here, we'll call it houses. And notice how in this table called houses, well, we have some new IDs for each of these houses. But now, we have, well, in this case, it's called student_name. Let me fix that. That should be house in this case, house called Gryffindor, Hufflepuff, Ravenclaw, and Slytherin, and then the head of that house in this case, like Minerva, Pomona, Filius, and Severus. So, we now have our own separate table for houses. And we could even go one step further and say, well, how do we now know where students are assigned. And for that, we could make our new table, this one called assignments, down below. So you can see that I have assignment 1 to 4, where student ID is 1 and house ID is 4. So what do you think this table now is telling us down below? If I give you the full screen below here, if you type in the chat, what is this table telling us, only this first row? It's telling us that Adelaide, Adelaide Murton, belongs to Slytherin. And how do you know that in this case? How do you know that Adelaide belongs to Slytherin now? Because of the IDs. So we saw before that here, we have ID. Student ID is one, but which student has the ID one? Well, it seems like it's Adelaide if you look at that table. And I look at the 4 here for the house ID, in which house is 4. Well, that's Slytherin. So we know that Adelaide has a belonging to Slytherin in this case. Adelaide is assigned to Slytherin house. So this brings us to another design principle here, which is that, ideally, we only want to have every piece of data stored in one place. And then from thereafter, refers that piece of data by its ID. So notice how we have, in the students table, an ID for every student. In the houses table, an ID for every house. And now, in our other tables, we just have to reference that ID. We don't have to actually reference Slytherin or the student name anymore. If those things change, we change them in one place. And now, the relationship, we can sort of define elsewhere and update that as we need to. And this is a good segue into what people are saying in the chat, which is primary keys and foreign keys. So what is the primary key now in each of these tables? Where do you see a primary key? Yes, so you're seeing in the ID columns for every table. So here in students, students is primary key. The unique ID for every student is just that ID column. And for houses, the primary key is that ID column as well, and similarly, for assignments. But now, student_id is not a primary key. It's not a unique identifier for every student here. What is it instead? It's a foreign key. So it's saying that this references a primary key in another table. It's references the unique ID in students now. And similarly, this house ID 4, well, that reference the primary key in houses. So we're going to say is that the primary key is that unique ID inside of the table, and a foreign key is a reference to that primary key from another table, a separate table in this case. So if we wanted to update this table, we could do it as follows. Let's actually go through the work of trying to update this table. Currently, looks like this. We have just a single table here. But we ideally want to make a different table for houses, a different table for assignments, so let's explore some syntax to actually make that happen for us. Now, we could create a table a bit like this, and something's missing here. But the general consensus is it looks a bit like this. We could say CREATE TABLE, call it houses, that has four, actually three different columns now, one called ID, one called house, and one called head. So the table called houses will have a column called ID, a column called house, and a column called head. And what is this last bit doing for us here? Any ideas? What is that specifying about our table, at fourth kind of row here? Yeah, so it's going to say this will be a primary key. This is going to be our unique ID inside of this house's table. So ID should be always unique for every row. It should never change for every row. So, if you want to do this, we could try this, but it won't quite work for us. That's because in SQL, there are also data types. So if we kind of table if we say, what kind of type we're storing in each of our columns? So here, we might say not just ID, but that ID is an integer, or not just house, but that house is a piece of text, and so it's head. So now, we're saying that we're going to have these three columns, ID, house, and head, and that ID is an integer, house is text, and head is text as well. And we could even get more fancy with this and apply some constraints, which means we could kind of make sure that as we insert data to this table, we adhere to some rules that we set beforehand. And maybe one of these rules is that each of these columns should never be null. No meaning empty, meaning that every column or every time we add a new row, no column can be blank. We can't have a house without a name, and we can't have a house out ahead. That wouldn't make sense. So every row has to have each of these three columns in order for us to insert it. And so, kind of a translation here is that not null is similar to required. We have to make sure this column name is required when we insert some new data here. So let's try this soon, but what questions do you have so far on this great table syntax? Let me see. A good question here. So if we update the particular house's ID in the house table, can that update the ID and the assignments table? So let's go back to our visual here. It's a great question. If we looked at the ID for Slytherin, that is 4. And in the assignments table, we have the ID 4. Now, if I changed Slytherin ID to be 5, well, this is assignments table would not automatically update for me, at least if I do it naively. If I just go in here and I say that this row here, Slytherin, should now have the ID 5, that will mess this reference up. We'll now actually know what to point to anymore. What I could do perhaps is better is let's say, OK, Slytherin-- perhaps it's no longer called Slytherin anymore. Maybe it's called a different kind of name. I could change this, well, this house now. Let me fix that typo here. Fix this house, and I could call it something else. And now, if I reference that same ID, I'll get a different name for the student house, but I should never change the primary key in general. And I see a question to our primary keys. So is it good practice to have an auto increment for the primary key? Meaning that if I add a new row, like let's say if I add a new student over here, that the ID automatically increases. It's a great idea. We always want to make sure that our ID is unique. And so as we add new rows, we'll make sure they're always increasing our ID, and SQL will handle that for you. If I go back to my create table statement and if I have this primary key ID constraint down the bottom, that will mean that this ID column will always auto increment for me, and often it actually won't ever be null. So primary key handles some of these constraints for me by saying that, look, this column should never be null. It should also never be a random value, which always auto increment as I add new rows. Great questions here. Other ones too? A good question here. So when we chose to introduce, to distribute the database into three tables, why is this the best distribution? So if you go back to our visual again here, why is this the best? I actually won't say this is the very best we could do. There's probably some more updates we could make here, but this is simply better. Because in this case, we're reducing some of our redundancies, and I could more easily change my data set if I'd like to. So notice how, here, going back to our example of changing some names of houses, well, if I change the name of Ravenclaw, in my prior data set like this one, I would have to go through and update all of these names, like individually, to find Ravenclaw, Ravenclaw, Ravenclaw, update that every time. But if I have this kind of arrangement, well, I can just change Ravenclaw in one table. And now, that name has changed on my entire database as I go through. So it's helpful to not repeat ourselves while we're having these databases design questions here. So let's try actually creating this table now, and our goal was to create a table called houses that had some of these constraints here. So let's go ahead and go back to our SQLite database, and let's try it. Let's go ahead and CREATE TABLE, this one called houses. And now, I'll add my parentheses to say what columns should be inside of houses. I'll hit Enter here. And now, I'll kind of, by convention, indent four spaces. I'll hit 1, 2, 3, 4. And the first column I want to have, that one is called ID. That one will be an INTEGER. And ideally, I want it to be NOT NULL, never should be able to have a blank ID in this table. Next one, I want to have a name of the house. This one, some text. And I also want to make sure that I can't have this ever be null, never blank. And similarly, I want to have a head, head name. This one called, this up type TEXT, and also never NULL. And there's one more thing I need to add here, and what is that? A primary key. So I can say PRIMARY KEY id, and then close this out down below in my parentheses, and hit a semicolon, and hit Enter. And now, if I type dot schema again, well, I should see I have a table for students now and a table for houses. And what we could do if we have more time is spend some time actually updating this database, like take data from our students table, insert into houses, make sure the students table removes the house and head columns as well. We won't spend too much time on that here today, but the goal is simply to show you how we can add more tables and insert some data. But on with inserting some data, how could we actually add heads here? Because if I do SELECT star FROM houses, well, there's nothing in there, nothing shows up. So how should I add some data here? I'm seeing insert, so let's actually see what that could look like. I could go here, and I could say our picture right now is like this. We have a blank houses table that has the right column names, but no data, so let's actually insert some data. So in general, inserting looks like a bit like this. Insert into the table name, followed by some columns I want to add data into, and then some values want to add to those columns. And to add in let's say Gryffindor, I could do this. INSERT INTO the houses table on the house and head columns Gryffindor for the house and McGonagall for the head, so let's try that. I could say, down below, INSERT INTO houses. And which columns we want to insert into now for houses? Head and house. Let's do it in the right order. Let's first do house and then head, let's then do the values. Let's say, OK, what should go into the house column? Gryffindor. And what should go into the head column? McGonagall. And then we'll hit semicolon, Enter here. And now, let's select. Let's SELECT star FROM houses, and now, we see Gryffindor as the house and the head is McGonagall, with the ID of 1. That primary key has auto incremented for us. Let's maybe add now, whoops-- let's maybe add now Slytherin. We could say INSERT INTO houses with the VALUES Slytherin and Severus Snape. And now, if we SELECT star FROM houses, we should see our ID has auto incremented, and we now have Slytherin with Severus Snape as the head. So questions on these inserts or these creating tables? Question, what if one insert more than one row at once? I believe that that is a good use case for something like Python. So we saw in lecture, not so much today, but in lecture, how we can use Python to kind of automate SQL queries. You could imagine, for example, having a Python loop, like a for loop, that kind of runs INSERT query command over and over again to add multiple rows over and over again there. So more on that next week, when we work on like-- actually in later weeks like with Flask and so on. But for now, I would refer back to lecture and look at how we can use Python along with SQL. So to close us out here, what we'll do is just figure out how we can actually modify our queries to fit this kind of more complex database. Now that we have data in all kinds of different tables, well, how do we query it? How do we get actually the information we want to get out of it? Something like this in the very end. How do we get the data we want out of this? And there are a number of ways we could do this. And let's take a look at actually an already complete database. So I'll go ahead and I will get a database that I've already created here for you. I'll do roster.db, copy it here. And now, if I do SQLite3 roster.db, type dot schema, you should see that now we have this brand new schema for our table. So we have a student's table with a student name and an ID, a houses table with the house name and a head name and an ID. And then finally, it's assignments table, where we say that a foreign key, called student_id, references the ID column in the student's table like this, and then the house_id column here references the ID column in houses just like this. And if I SELECT star FROM houses, notice how we have all of our houses and heads. If I SELECT star FROM students, notice I have all of my students and their ID. If I then SELECT star FROM the assignments table, well, I then just see all of the student IDs and the house ID they belong to. So a bit more complex now, but the goal was to reduce our repetition, to make sure we have every entity in a single table. But how do we then get the data that we want out of this? Well, there are a few ways we could do it. One way is to use these selects and these joins. Maybe we would actually try to figure out the number of students in Gryffindor, like how many students are in Gryffindor? We tried it out before, but now we have this new schema, this new table organization. So how should we change our approach? Well, we could use these selects and we can actually make sure our select are kind of nested inside of each other, like have one select work first and then feed the result to another select later on. So let's try this. Let's say I want to figure out how many students are in Gryffindor. Well, I should probably first notice that the ID column in houses corresponds with the house_id column in assignments. So maybe, what I should first do is figure out, well, what is the ID of Gryffindor? And to do that-- any ideas for a query to find out the ID for Gryffindor? What query could I use, assuming I couldn't see it before? Assuming I don't know what that ID is, how could I find the ID for Gryffindor? Here's our schema again. So maybe one thing I could do is I'd-- I don't the ID of Gryffindor off the bat, but I could try to select it. I could say SELECT maybe the ID from the houses table WHERE the house name is Gryffindor. I'll hit Enter and I get back 3 and does that match? Let's see. SELECT star FROM, in this case, houses. And I see Gryffindor in this case is 3, although admittedly, over here is one. That's a little confusing. But in our actual table, it is 3 over here. And now that I have the idea of Gryffindor, well, how could I find all the students who are in Gryffindor? I could try to do this. I can maybe say select the count of the student IDs. That's the count of student_ids from the assignments table WHERE the house_id, in this case, was 3. Gryffindor in this table here is 3. I'll hit Enter, and I get 11 students, which seems right. So this query could still be better designed though, because here, we're talking about this number three, but what could we do instead? How could we maybe not hard code this number three, if you recall from lecture? Any ideas in the chat? We could use a JOIN. We'll see those later on, but what else could we do? I'm seeing some ideas of kind of nesting these queries. So as an example here, let's say we first run this query. SELECT if FROM houses WHERE the house is Gryffindor. Give us the ID of Gryffindor. Then let's feed the results of that query into a larger query. This one, the above. So here, notice how we have some parentheses saying that this query first will be run at the very beginning. The result of this will then be passed up to our larger query, our query that's kind of nesting this one. So I'll run this first query. I'll go back one, at least in this table, and then I'll say that in this case, the house_id for Gryffindor is 1. And now, there's a query itself completes. I could say, it's like COUNT the student_id FROM assignments WHERE house_id is one in this case. So let's try it out. I could say SELECT. I know they're beginning the count of student_ids FROM the assignments table. But now, I need to figure out what's the ID of Gryffindor? Well, I know I want a house ID and I want it to equal something, but what is that something? What I could do in this case is I could say, well, I don't really quite know. I could instead-- I could say, well, go ahead and first find me, that ID for Gryffindor, SELECT id, in this case, FROM the houses table WHERE the name or the house is Gryffindor. And then I will go ahead and hit Enter here, close the query with a semicolon here, hit Enter, and I still get that same result. This is better now, because if the idea of Gryffindor changes, notice how in this table it's three, but in this table it's 1. If the idea of Gryffindor changes, well, now, I can just figure out what I'm looking for without actually caring what the ID of Gryffindor actually is. So thoughts or questions on these nest and selects? What questions do you have? A question here, can we nest without any limit? You theoretically can have as many nestings as you want to have. So we could maybe have a query here, if I look up here. Like maybe we don't know something about this query, we could say, OK, run another query to give me that result, to kind of nest them and nest them over and over again. You can absolutely do that. It comes at a cost though of performance. You want to make sure that you're kind of limiting your number of nestings to make sure you're not just looking up arbitrary data that you don't actually need in this case. So, no theoretical limit really, but there is a limit if you're concerned about performance, which you should be. So this was one way of approaching this problem, counting students of Gryffindor with this new table using a nest and selects. But another way-- this maybe a little more intuitive. In some ways, it's going to be this JOIN that we saw earlier. So this will bring us almost to the very end of our lecture, to look at JOINs here. But instead of using SELECT, we could use this idea of putting these tables together. Here in our select, which we're treating them still as separate tables. But with JOINs, we could say, let's actually put our tables together and think of it as one bigger table in this case. So if we look at our visual again, counting students in Gryffindor, we're going to see something like this. So we have three different tables, but it would be helpful, like ideally as a human, we could just simply take these tables and put them together, like just kind of LEGOs stick them together so they fit right next each other. We can see which houses align with which IDs. So let's again notice that in this case, the ID column in houses corresponds to the house_id column in assignments, and we just want to pair them up like put them right there and see that larger, bigger table, and that's what join does for you. So a JOIN, once you run it, look a bit like this. It'll put your two tables together and have them online based on the ID you specify. So here I'm saying, give me the assignments table and JOIN houses onto it. So notice how on the left hand side here, I see assignments, that is the assignments table, and I see houses, which is the houses table. But they're corresponding, they're now next each other, based on the fact that house_id in the assignment table corresponds to or references that ID column in houses. So SQL knows which rows to put together with which rows in each table based on this ID we specify. And the syntax for this looks a bit like this. We could say SELECT everything FROM the assignments table. But first, JOIN on the houses table, and make sure we use this column as the corresponding column in this case, like the assignments.house_id, meaning the house ID column in assignments, and houses.id, meaning the ID column in houses, making sure that these correspond to each other in this case. So notice how I go back to our visual here. We have the ID column with houses corresponding to the house_id column in assignment. So let's try this in SQL. Let's go ahead and SELECT star FROM our assignments table. But first, let's go ahead and JOIN on the houses table, and let's join houses on a certain column. Let's say we want the houses.id column to match up with the assignments.house_id column. And here, we do this. And if I zoom out a bit, you'll see the very same table. We have a student_id, the house_id, the ID of the house in the house's table, the house name, and the head. So what can we do now to figure out how many students are inside of Slytherin or Gryffindor in this case? I'm seeing some longer queries, which is good. We could simply count up student IDs. I'm sure we can do this, but let's try this one. We could say, I want to SELECT the COUNT of the student_ids that I have from the assignments table. But I first want to JOIN in. I want to JOIN in the houses table, making sure that, in this case, the houses.id column matches up with the assignments table with the house_id column. And now, with that bigger table, well, I only want the rows where it seems to be that the houses' house column, the house column in houses is equal to Gryffindor. Now, hit Enter here. and now, I get back 11 still, and I could do the very same thing for Slytherin. I could just change Gryffindor for Slytherin or Hufflepuff and so on. But the idea here is that we're trying to see our tables not as individual tables anymore, but now as one bigger table, and we can use that to figure out, well, what kinds of columns can I query to get the result that I want now. So what questions are there on these JOINs, either on the syntax or how they look? A good question-- so here we say, so it does not matter when we say JOIN on a foreign table in a primary table? Let me try to parse this question. What we could do-- notice how in this above query, I said houses.id equals assignments.house_id. And the reason for this was that if I have two ID columns in two tables, well, those columns are unambiguous in their own tables. But when I JOIN them, those columns of the same name are now ambiguous, meaning that they have the same name and there are two columns, so we have to differentiate them somehow. So if we say houses.id and assignments.house_id, well, if I had two columns with the same name, maybe it's just ID in assignments, I could say, houses.id equals assignments.id and therefore get the appropriate column name as we go through. A question, how not to show identical IDs, like the house_id in the house table and assignment table. So if I do this, go back to SELECT star, SELECT everything FROM our assignments table, joining in the houses table on houses.id equals assignments.house_id, notice how we had some repetition here. We had house_id and we had ID. In the way that SQLite works, this is just the way it adjoins tables. You're always going to get those duplicate columns here, but maybe helpful if you narrow down your SELECT. You could say maybe not select everything, but maybe SELECT the house_id. You could SELECT student_name and so on. So you could narrow your SELECT, not select everything. But in general, when you JOIN two tables, you are going to get those duplicate columns because SQL needs know which columns to JOIN by. Does that makes sense? And a good question, can you give an example of a JOIN with a GROUP BY? So we absolutely could. So notice how-- I remember how an earlier in section, we tried to have this GROUP BY that figured out how many students are in every house as a whole, like Slytherin, Gryffindor all in one table. We could do the same query now, but with a JOIN, with our separate tables. We could say, to be first, SELECT for me the houses, the house like the actual house name, and the COUNT of the student_ids in that house. And what I'll do now is I'll say I want to SELECT that FROM the assignments table. But I want to first JOIN in the houses table so long as the house's ID column corresponds with the assignments.house_id column. And now, what I could do is I could just simply GROUP BY house, or to be more specific, I could GROUP BY houses.house, the house column in houses. But I'll just say house for now, for consistency, and I'll hit Enter now. And now, I see that very same thing. Gryffindor has 11 students, Hufflepuff has 5, Ravenclaw has 13, and Slytherin has 11. So good question there. If we just treat our JOIN table as a regular big old table, well, we can do the very same, arithmetic as before. A good question here, so can we use AS when we JOIN the two tables? Let's try it. So, let's say SELECT house-- SELECT star, sorry, from, let's say, assignments. And then let's go ahead and JOIN on the houses table with houses.id equals, in this case, assignments.house_id. And the question was, can we use AS here? Can we say AS? Let's just call all of this ID, and let's see what happens. We'll hit Enter here. And I don't think we can do that, because we're trying to kind of rename these columns out or joining them. SQL expect the actual column names, not the aliases that we're going to use while we're joining, so a good question there. Other questions on these joins? For a visual for folks, if you are still wrapping your mind around this, it's helpful to see in this case that we're trying to simply figure out which IDs are sort of matching up with each other, and this part of our query is actually doing that matching for us. Can you join a table more than once, is a question in the chat. You absolutely can. So if I wanted to have more than one table in my joint, I can certainly do that. I could maybe get back my entirety of that original table we had before. I could say SELECT star FROM students, originally, but then I want to join, let's say, the assignments table, where the, let's see, it would be students.id is corresponding to the assignments.student_id. Let me zoom out a bit, so that JOIN is ready. Now I'll also JOIN to this table. I'll JOIN on the houses table so long as the houses.id column corresponds to the assignments.house_id column. And now, I should get back my entire table, joining these two back together. And now, it's kind of big and kind of redundant now, but I do see that very same table we had before, everything joined together, originally. Can you alias table name? I don't believe-- I'm not quite sure. If you were going to try it out yourself, I don't think you would be able to do that in this case, often because they don't show up in the query itself or the result of the query itself. Difference between JOIN and a UNION. We probably won't dive into unions so much here. But a UNION is good for finding out things that are in common with different data sets. Let's say you have one table and another, and both tables have some intersection, like datas [INAUDIBLE] a UNION would be good for that, whereas a JOIN is just good at simply combining your tables together as you go through. Other questions on joins before we wrap up? I'm happy to stay after as well. So, why don't we wrap officially here? Well, this was the end of our section for this week seven. Thank you all so much for coming. It was wonderful to see you over Zoom this week. I'm happy to stay on and answer questions in the chat, but otherwise, we'll see you in the future. Thank you all. Recording stopped. And thank you for all of your wonderful questions. I appreciate them in the chat. If you'd like to ask any others, I'm happy to stick around for a few more minutes. And thank you for that congratulations. I appreciate that. A question on GROUP BY, let's go back to that. The question was, how does GROUP BY work? What's the point of GROUP BY? It's maybe helpful if I show you how it would work in this context. Let's go ahead and join-- back to we have before, let's select everything from our students table. And let's go ahead and also JOIN in our assignments table ON assignments dot, in this case, student_id is equal to students_id. And let's join the, I can't remember, the houses table on houses.id equals assignments.house_id. So now, we have our big, big table. But let's say we want to group that data or do something with that data within a certain group, so we could say this. GROUP BY, often this case makes sense as GROUP BY house. And this is saying that if I wanted to count up anything or average anything or figure out how many of something I have, well, normally, if I didn't say GROUP BY, I would just get the count of all the students. But if I instead GROUP BY house, what I'll get-- and my code space is kind of glitching right now, but I will get back that same count, but only looking at the smaller subset of data within each group, so looking at how every house individually and counting up within that house group, if that makes sense. And if that doesn't quite make sense yet, I encourage you to experiment with it more to figure out how GROUP BY works, and there's still some resources on the web you could learn from as well. Yep. And we will certainly share the recordings for these in [INAUDIBLE] 2023, so it will be very early January as well. So other questions here? A question, are commands with capital case? That's by convention, so it helps us note between the capital of SQL keywords and our column names here, so by convention. Are sorting and GROUP BY similar? Sorting is helpful if you like ORDER BY, like is that an example of sorting? I could order my data. And that is kind of helpful for seeing groups in my data, like seeing what groups there are. But if I want to count something up within each group, well, I should still use GROUP BY in that case. So GROUP BY is often used with these aggregators, like count or average. Here, I don't have that up here, but I would normally type like COUNT star, AVERAGE star, and then a GROUP BY, so good question. Well, so glad this is helpful, and thank you all. It's great to see you over Zoom. I hope to see you next week.