DOUG LLOYD: So in our videos on web development, we've talked a couple of times now about what a database is. And in this video, we're goingww to get into a bit more detail and show you exactly what a database is, why we would use it, and how we might manipulate it. In order for us to build websites that are sort of more complex than just a page where they just go and see news, for example, we might need a database to store information, such as username and password combinations so that a user attempts to log in. That log-in information is sent to the database. It is checked against information in the database to see whether that username-password combination matches.

And if so, it lets the user in. We might also store other stuff for users, like their shopping history or really any other information that you might want to keep long-term for a user. Now, if you ever used programs like Microsoft Excel or Google Sheets or Numbers, you're probably familiar with the basic idea of a database.A database consists of a couple of different levels of hierarchy. Within a database, we have tables. And within each of those tables, we have rows and columns. And if you are familiar with Excel, or perhaps even if you're not, let's take a quick second and I can draw this analogy for you about what a database, a table, a row, a column is. And then we can translate that in just a moment to the same idea in SQL.

So on my screen here, I've opened up an Excel file. And across the top here, you see I have these letters-- A, B, C, D, E. These would be different columns in my file. Down the left side, I have numbered rows, where I might put information. So, for example, I might just put my name here in cell A1. So it's in the A column in the first row. Now, down at the bottom left is a feature that you may not have used too much if you don't use Excel very frequently. But there's these notions of sheets. Sheets are sort of akin to different tables. Notice that when I switch to sheet 2, the data that was in sheet 1, my name, is no longer there. So each sheet has its own unique set of rows and columns. But all of these sheets are still bound up in one single file called Book 1. So the analogy here is that Book 1 is our database, which contains a different number of tables, which in Excel parlance is just a sheet. And each sheet has columns that we can put data in and rows that we can put data in, as well.

So there are a couple of different types of database engines that we can use in our programs. We're going to talk about SQL, S-Q-L, which stands for the Structured Query Language. The Structure Query Language is a programming language whose sole purpose in life is to query or ask questions of or retrieve data from a database. And there are many different implementations of SQL. Two of the most popular are as follows-- we have MySQL which is an open-source platform. It is very commonly used to establish relational databases. We're not going to get into a lot of detail in this video about what a relational database is versus other types of database. But just know that there are more than one type, and we cover relational databases in this course. Another type, which we actually have used in CS50 since 2016, is SQLite, which has a very similar feature set to MySQL. But it's just a little more lightweight. It's a little easier to use on CS50 IDE.

And so that's why we're actually going to be using that one in the class. Now, regardless of which implementation of SQL that you use, a lot of them will come with a tool called phpMyAdmin, which is a GUI or Graphical User Interface tool that is used to execute some of the more tedious or mundane database queries in a more user-m friendly way because you can just click and do things in the web browser. And those are most commonly used to build databases in the first place and to set up tables because that's the first thing you're going to do once you get your database configured, is to create a table. Because without a table, we have no rows and columns. And without rows and columns, we can't store any data. Tables have very cumbersome syntax that is used to set them up. And so using phpMyAdmin, the graphical interface, to build your table is definitely gonna come in handy. In the process of building your table, you're going to have to specify exactly which columns are going to be stored in that table. So at the very beginning, when you create your table, you say, my table's going to store usernames and passwords and whatever sort of other information you want. and you have to specify that before you have inserted any data into the table.

So you have to design ahead of time what your table is going to look like. Once you've done that, pretty much every query except for, like, deleting the database and deleting the table that you're going to use on that table going forward is going to refer to data which is stored in the different rows of the table. Just like in C, every column of our SQL table is capable of holding data of different data types. So in C, for example, we had characters and strings and integers and floats. And SQL has a few more data types than that. This is just 20 of them, and this is not even an exhaustive list. But some of these things should look familiar. So, for example, we have int, which can store integers. But we also have these four other types, which can hold integers with different upper bounds. So you may recall from C that the upper bound of integer is 2 to the 31st power, or 2 to the 32nd power if they're unsigned integers. But here we can say small ints or tiny ints or medium or big ints, each of which have different upper bounds on the values. Decimal and float stand in place of double and float, which we're familiar with from C. We can also store date and time stamps in SQL databases. There's no data type for that that's native to C. But in SQL, there are a couple of different ways to do this. We can even do more exotic things like store geometry or line strings.

And what are these? Well, geometry and line strings can be used to store in a SQL database a mapping out or a drawing out of an area on a map, such as using GIS data. And we can actually store that in our SQL table and recreate that exact drawing a little bit later on. Text sort of stands in the stead of strings for just arbitrarily large chunks of text. Enums are also a type that exists in C, but we don't talk about them too much in C. But really quickly, what an enum is is it is a column of your table that can be used to store a limited set of values. So, for example, I could have an enum that is called, like, favorite colors, for example. And I could specify when I'm building my table that it can only be capable of holding red, green, and blue. If you tried to insert a row that had purple in that place, that wouldn't work because it is not one of the enumerated values that can be stored in that column. There are also char and varchar. And these are not quite the same as what you might think they are in C. So let's just take a quick second to explain the difference between these two data types, which are pretty important. So unlike in C, char does not refer to a single character. It is actually sort of akin to our notion of a string, but with a caveat that that string is a fixed length. And typically when we specify a char or a varchar type, we have to specify the length of that string at the outset, just like we do in c. So, for example, my column type might be a char 10. That means that I can store 10 character strings in that column of my table, exactly 10 character strings.

So if I tried to store, for example, the word "hi," H-I, which is just two letters, that would go into the column OK. But it would also store 8 extra-- the equivalent of null bytes, basically, so that I still had 10 characters, per se, in that column for that row. And if I tried to store, like, a 15-character string, that wouldn't work. I'd only end up storing the first 10 letters. So it's always going to be 10, every time. A varchar, on the other hand, refers to a variable-length string. So if I said that my column was varchar 99, I can store 1, 2, 3, 4, 5 character strings, up to 99 characters long, without having to have all of this extra sort of slack space or null bytes or zeros tacked onto the end. So char, fixed-length strings. Varchar, variable-length strings. We won't get into the difference between why you might want to use one or the other. But there are reasons why you might not always want to use a varchar, and you might want to use a char in some situations. Now, SQLite actually has many or all of these same data types. But the difference there is that each of those data types is affiliated with what's called a type affinity to simplify things.

So that whole list of 20-- or more, really-- can be reduced to one of these five different affinities-- null, integer, real, text, and blob. Null is probably pretty obvious-- nothing. Integer-- whole numbers. Real-- that would include things like decimal and float. Text would include things like char and varchar. Blob is just-- that would be more like the geometry or data that isn't really text, but it's just a large number of bits or bytes. That would be a blob. And everything can reduce to one of these five different type affinities. So after specifying the columns, it's also really important in our SQL table to have one other consideration, which is to have one column which is our primary key. Why do we need a primary key? The reason is every row of our table, in order to make our SQL queries most effective, should be able to be uniquely and quickly identified. So choosing the right primary key will allow us to make sure that there is one value in every row that is completely unique. And if that is true, if there is one column in every row that's unique, then we can uniquely identify or very quickly identify which row we're talking about. Now, it's also possible to establish a joint primary key, which is just a combination of, say, two columns-- but it can be an arbitrarily large number of columns-- that is always guaranteed to be unique. So I could have one column that could always have A's or B's or C's. I could have another column that has ones and twos and threes and fours. I could have multiple A's and multiple B's and multiple C's, multiple ones, twos, and threes. But across all of those rows, I'm only ever allowed to have 1 combination of A1. I can have B1, C1. I can have A2, A3, A4. But there's only ever one combination of those two columns that is unique. That would also be acceptable because that is a joint primary key. Now, SQL is a programming language, like other programming languages that we've discussed. But it has a very limited vocabulary. Now, there are more things that you can do with SQL than we're going to talk about in this video because in this video we're just going to talk about the four operations that one can perform on a table, or the four main operations that you'll most likely perform on the table-- INSERT, SELECT, UPDATE, and DELETE. And these four things we'll get into a bit more detail in just a second about. But these are definitely the most common four things you'll be doing with the tables that you're building, certainly in CS50, but also probably more generally whenever you're working with databases. So for all of the examples that we're going to cover in this video, we're going to consider a database that contains these two tables called users and moms.

And you can see users has four different columns-- idnum, username, password, and fullname. And moms has two different columns, username and mother. And let's now start to work with these and see how we can manipulate this database and the tables within it to-- for whatever reason we need them to-- be updated in our website. So the first operation we'll cover is insert. And as you might expect, it adds information to a table. Now, SQL queries have a certain structure to them. And so with each of these operations, what I'm going to do is give you the general sort of skeleton of what a query might look like. And then we'll use a couple of examples of that query to see how it affects our table. So an insert query, in general, looks something like this-- INSERT INTO table. So we specify what table we want to insert into. And then we have parentheses, columns-- which is a comma-separated list of all of the columns of our table that we want to insert data into. Then we have VALUES, and then a comma-separated list of the values that we want to put into those columns in the same order. So for example, I might want to INSERT INTO users into these three columns-- username, password, name. The values, respectively-- newman, in lowercase, USMAIL-- that's going to be corresponding to the password-- and then capital-N Newman for fullname. So assuming that I execute this query on the users table shown here, what's going to happen? Well, this is what's going to happen. We're going to add this row because we inserted into the users table. But you might be asking yourself, wait a minute. You never specified ID number. So how exactly did ID number get there? I mean, you specified newman and USMAIL and Newman. But that 12, that was never part of our query. And you're right. Because when you define the column that ultimately ends up being your table's primary key, it's usually a good idea to have that column be an integer. It's not a requirement, but it's usually a good idea. And because having a primary key is so important, you might want to specify it to autoincrement.

And what this basically means is you can forget to include it as part of your query. And if you do, it will automatically insert a value there that is unique from every other value in that column, typically by just incrementing by one every time, so that that row is still guaranteed to be unique. Because if it wasn't set to autoincrement and we just forgot to omit it, there might be a couple of rows that are actually blank or null there. But if that's our primary key, if ID number is our primary key, and we have a couple of rows that are blank there, now they're not unique. We have two rows that are blank. We can't uniquely identify them. And so when you set up your primary key, you usually want it to be an integer, and you usually want it to autoincrement so that you can completely forget about having to worry about that and let the table do it for you. So that's how 12 got there. I had specified my idnum column do autoincrement every time I make an insert into the table. Let's do this another time. Let's insert into the other table, insert into the moms table, the following values into these following columns. We can INSERT INTO the username and mother columns of the users table the VALUES kramer and Babs Kramer. And just like what we saw before, this is what we would have as a result. We would just insert that row, and it would now be part of our database and part of our tables. So that's the insert operation. Now let's talk about the second operation, which is select. So if insert is used to put information into a table, select is used to get that information back out of the table so that we can do something with it. Select queries also have a very similar skeleton. And they look like this-- SELECT and then whatever columns you want to choose FROM whichever table you want to choose them from. And then optionally in red here, these two other things-- WHERE some condition is satisfied. And we usually, in the context of SQL, refer to a condition as a predicate. But basically, we're just checking to see that some situation is true. And we might want to, for example, order them by a specific column so that they are organized alphabetically by the values in some column or whatever else. We don't have to do either WHERE or ORDER BY. But generally WHERE in particular is going to be used so that you don't get your entire database back. And ORDER BY just helps to keep things a little more organized. So, for example, here's a select query that we could use-- SELECT idnum, fullname from the users table.

So let's see what's going to happen here. So remember, I'm selecting idnum and fullname from the users table. Well, what is the select query going to return to me? This information here. It's going to look at each row of the table, and it's going to just pull out the ID number and the full name and give me all of those down the entire users table. My user table only consists of three rows, so it's given me three pairs of information, three idnum-fullname combinations. Now let's say I want to restrict my search a little bit, my query a little bit. SELECT password FROM users WHERE idnum is less than 12. So now I'm adding a predicate or a condition to my select query. What's going to happen here? Well, I'm going to get this information. This is just the password column from the users table where the ID number in that row is less than 12. So only rows 10 and 11's password get given back to me. 12's does not because 12 is not less than 12. There's one other thing we can do with a select query, and that's to select *. SELECT * from moms where username equals jerry. * is just shorthand for every column. So instead of having to specify username, mother, the two columns here, or, if I was using the users table, having to specify idnum, username, password, fullname, I can just say SELECT *. Give me everything in that row. And if I SELECT * from moms where username equals jerry, I am given this set of information back. So here's the thing that's cool about databases, though-- we don't have to just have two tables. And in particular, we don't have to have just one table that stores every relevant piece of information about a user. If we wanted to store, in addition to what we have there, information like their address and their date of birth and their social security number or whatever else we wanted to keep about them, this table could get bigger and bigger and bigger and bigger to the point where it becomes almost annoying to have to use it. But we can use relationships between different tables. And that's where the term relational comes into play for a relational database. We can set up our tables within our databases carefully enough that we can use relationships between them to pull information from where we need it without all that information having to be located in the same table in the first place. So, for example, here is our database as we last left it, with three rows in each of the users and the moms tables. Now let's imagine a hypothetical situation where we want to pair a user's full name, which is currently only stored in the users table, with their mother's name, which is only stored in the mother table. They're not on the same table right now. So we can't use the same sort of select query that we used before. We have to use what's called a select join query. It looks pretty similar to the select query before. But we're adding one extra piece of information, which is what tables we're joining together. So I want to select a specific set of columns from one table, joining another table onto it just temporarily. It doesn't really merge them. But it creates this sort of hypothetical table that does merge them. On predicate, where predicate is-- basically, we're trying to find where the two tables overlap so that we can create this merged table just temporarily for purposes of this query, and then have them separate again.

So, for example, I might have a query that looks like this-- SELECT users.fullname, comma, moms.mother FROM users joining the moms table onto it temporarily ON-- which is akin to WHERE-- users.username equals moms.username. And this syntax is also a little unfamiliar, right? I'm now prepending some of my column names with table names. The reason I'm doing this is particularly because of the last line there, where in each of those two tables I had a column called username. I need to disambiguate between them. So I need to specify what table and what column. That's all that's happening here, is I'm being very explicit about which column I want by specifying not only the column name, but also the table name. So here is what would happen if I executed this select query. Here are our two tables at the outset. This is all of the information that I'm trying to get to start with. I'm going to join these two tables together where where users.username equals moms.username. Newman does not appear in both tables. So I'm not going to even come close to extracting that piece of information. Kramer does not appear in both tables, so that one's out too. But jerry and gcostanza both appear in both tables. And so that is sort of the overlap point. And you can think about this as if we have our users table, which is like one circle over here, and our moms table, which is one circle over here. When we're joining, it's sort of like if it's a Venn diagram. We're trying to figure out what those two tables have in common and using that as, like, our anchor point for the join tables. So that's what we did here. We found where the two tables have data in common.

And based on that, we create this hypothetical table called users & moms that contains all of the information that appears in both of those tables. But notice that the second column there sort of indicates the overlap. This column refers to data that is in both users.username and moms.username. So that's what we're extracting here. But remember that this query, we didn't want everything. It wasn't SELECT * FROM users JOIN moms WHERE users.username equals moms.username. Username We just want the full name and the mother's name. And so that's what we extract. And so that really long query just results in us getting this set of information-- users.fullname and moms.mother for every row in both tables where the username happens to be the same. All right. So that was a bit more complicated. You probably won't do too many joins. Fortunately, the last two operations are a little bit easier. So update is the third of the four major operations. This is what we use to modify information in the table or change it or update information. The skeleton looks like this-- UPDATE table SET column equals value-- so we're updating some column to have a new piece of data in it-- WHERE predicates. We're trying to figure out which row we want to update, and then we're updating that column. So, for example, I want to UPDATE users SET password equals yadayada WHERE idnum equals 10. You can probably guess what's going to happen here. We're going to find the row whose ID number is 10, and we're going update the password there from "fusilli" to "yadayada." Pretty straightforward. And the fourth and final operation that we do in SQL most commonly is to delete, which strikes or completely removes information from a table. The basic framework looks something like this-- DELETE from some table WHERE predicate, or, again, where some condition is satisfied. So, for example, you might want to DELETE from our users table WHERE the user name equals 'newman.'

Now, if we did that, and this was our table to start with, you can probably guess what's going to happen. After that query executes, the row whose username is newman is no longer in our table. Now, all of the operations that we just showed are really easy to do in the graphical interface of phpMyAdmin, which, again, comes with most implementations of SQL that you might download. Here's the catch, though-- phpMyAdmin, despite being very user-friendly to use, requires manual intervention. You're going to have to log in and click on some buttons to delete information or update it or change it. And we don't want to do that. We're a webmaster, right? We want things to happen automatically for us. We want programs that we write to make those changes for us. So we don't want to go into phpMyAdmin and make those changes. We have to write code that does it for us. Fortunately, SQL integrates really nicely with a lot of modern programming languages such as Python or PHP that have functions that you can use to connect to your database. And then the programming language has other functions that will query the database for you and make those changes.

Now, we're going to leave off for now exactly how those languages do that. We'll save that for a video on those languages themselves. But know that that is something that you can do to avoid having to do any sort of manual intervention when you want to update or do any sort of operation at all on your databases.

My name is Doug Lloyd. This is CS50.