WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:03.486 [MUSIC PLAYING] 00:00:16.434 --> 00:00:19.090 CARTER ZENKE: Well hello, one and all, and welcome 00:00:19.090 --> 00:00:22.510 back to CS50 Introduction to Databases with SQL. 00:00:22.510 --> 00:00:25.090 My name is Carter Zenke, and today we'll focus 00:00:25.090 --> 00:00:29.050 on optimizing-- trying to reduce the time our queries take 00:00:29.050 --> 00:00:32.500 as well as reducing the storage space our database might use. 00:00:32.500 --> 00:00:34.960 We'll also see how to handle concurrency. 00:00:34.960 --> 00:00:38.300 That is, many queries all at once. 00:00:38.300 --> 00:00:41.560 Now we'll do all of this in the context of a new database, 00:00:41.560 --> 00:00:45.940 this one-- the Internet Movies Database, or IMDb for short. 00:00:45.940 --> 00:00:51.680 If you've been to imdb.com you may have seen this whole big database of movies, 00:00:51.680 --> 00:00:56.360 and we've compiled that into our very own SQLite database for us today. 00:00:56.360 --> 00:00:58.830 Now this database is big. 00:00:58.830 --> 00:01:01.240 It is much bigger than anything we've seen so far. 00:01:01.240 --> 00:01:07.090 In fact, it has over 200,000 movie ratings, over 400,000 movies, 00:01:07.090 --> 00:01:12.850 and 1.4 million stars or people who have acted in some movie. 00:01:12.850 --> 00:01:17.360 And the goal of this database is to represent movies, the people in them, 00:01:17.360 --> 00:01:21.280 and the average ratings for each of those movies. 00:01:21.280 --> 00:01:24.250 Now we have ourselves an entity relationship diagram 00:01:24.250 --> 00:01:27.470 to describe this database, and it looks a bit like this. 00:01:27.470 --> 00:01:31.690 We have people in one table, movies in another, 00:01:31.690 --> 00:01:34.870 and ratings in a separate table down here. 00:01:34.870 --> 00:01:37.870 Notice that people have a name and a birthday 00:01:37.870 --> 00:01:41.320 along with an ID, the primary key for that table. 00:01:41.320 --> 00:01:45.020 Movies too have a title and the year they were released 00:01:45.020 --> 00:01:48.130 as well as their own primary key ID here. 00:01:48.130 --> 00:01:51.970 Rating down here has the average rating for a given movie 00:01:51.970 --> 00:01:55.540 as well as the number of votes that went into calculating 00:01:55.540 --> 00:01:57.220 that average rating here. 00:01:57.220 --> 00:02:02.950 And notice too that the movie and people table has a many-to-many relationship. 00:02:02.950 --> 00:02:06.790 That is, a person can star in more than one movie 00:02:06.790 --> 00:02:11.450 in the same way a movie can have multiple stars. 00:02:11.450 --> 00:02:14.560 So if we want to implement this many-to-many relationship, 00:02:14.560 --> 00:02:17.230 we've seen one way to do that. 00:02:17.230 --> 00:02:19.720 Let me actually ask the audience here, how 00:02:19.720 --> 00:02:23.470 have we implemented a many-to-many relationship historically? 00:02:23.470 --> 00:02:27.170 What kind of tables do we need for this, do you think? 00:02:27.170 --> 00:02:27.670 All right. 00:02:27.670 --> 00:02:30.320 So not to worry if you don't have an idea of what we're going to use here. 00:02:30.320 --> 00:02:32.528 Let's take a peek at what tables we can actually use. 00:02:32.528 --> 00:02:37.030 So here as we've seen historically we might have two tables, one for people 00:02:37.030 --> 00:02:38.770 and one for movies. 00:02:38.770 --> 00:02:41.800 And notice here in the people table we have names 00:02:41.800 --> 00:02:45.130 along with a primary key called ID. 00:02:45.130 --> 00:02:49.300 And similarly in our movies table, we also have a title 00:02:49.300 --> 00:02:55.180 column along with a movies own primary key, an ID column over there. 00:02:55.180 --> 00:02:59.380 Now to implement the relationship between people and movies, 00:02:59.380 --> 00:03:02.810 we have to have some table in the middle of these two. 00:03:02.810 --> 00:03:08.050 One that relates people IDs or person IDs with movie IDs. 00:03:08.050 --> 00:03:09.940 And if we look at this table, I could see 00:03:09.940 --> 00:03:14.020 that if I see person ID next to some given movie ID, well 00:03:14.020 --> 00:03:17.260 it means that the person with this ID has starred in, 00:03:17.260 --> 00:03:19.660 played a role in, this movie here. 00:03:19.660 --> 00:03:21.610 We see person 158-- 00:03:21.610 --> 00:03:27.160 let's see, Tom Hanks-- was in the movie ID with 114709, or Toy Story 00:03:27.160 --> 00:03:29.500 as we see in the movies table. 00:03:29.500 --> 00:03:33.160 And now knowing this, how can we figure out which 00:03:33.160 --> 00:03:36.160 movies Kristen Bell has starred in? 00:03:36.160 --> 00:03:38.770 If we know of Kristen Bell's ID, what should we 00:03:38.770 --> 00:03:42.841 then do to find the movies Kristen Bell has starred in? 00:03:42.841 --> 00:03:48.490 AUDIENCE: You could select the ID of the people then look into the stars 00:03:48.490 --> 00:03:51.462 table, then look at [INAUDIBLE] the movies table. 00:03:51.462 --> 00:03:53.170 CARTER ZENKE: Yeah, I like your thinking. 00:03:53.170 --> 00:03:54.610 So there's multiple steps here. 00:03:54.610 --> 00:03:58.780 We have to go through each table to find ultimately the movie that Kristen 00:03:58.780 --> 00:03:59.950 Bell has starred in. 00:03:59.950 --> 00:04:03.610 The first step might be to ask, what is Kristen Bell's ID? 00:04:03.610 --> 00:04:06.310 Well it seems to be 68338. 00:04:06.310 --> 00:04:09.040 Then, to your point, we could look at the stars table 00:04:09.040 --> 00:04:12.710 and find out what movie IDs correspond to that ID here. 00:04:12.710 --> 00:04:17.779 So it seems like 2294629 corresponds with Kristen Bell's ID here. 00:04:17.779 --> 00:04:20.200 Then finally we look up in the movies table, well, 00:04:20.200 --> 00:04:23.770 what title corresponds with 2294629? 00:04:23.770 --> 00:04:25.660 It seems to be the movie Frozen. 00:04:25.660 --> 00:04:30.350 So visually, we have this relationship among these tables here. 00:04:30.350 --> 00:04:33.610 So let's now dive into our movies database 00:04:33.610 --> 00:04:36.100 to actually get a sense of what data is inside of this. 00:04:36.100 --> 00:04:39.100 And for that I'll go back to my computer so we can open this database up 00:04:39.100 --> 00:04:40.300 in SQLite. 00:04:40.300 --> 00:04:44.410 I'll come back over here and I'll open up my own environment 00:04:44.410 --> 00:04:48.250 where I can type sqlite3 movies.db. 00:04:48.250 --> 00:04:49.780 movies.db. 00:04:49.780 --> 00:04:56.980 And now if I zoom in just a bit, I can then type .schema to show you all 00:04:56.980 --> 00:04:59.530 the tables that exist in this database. 00:04:59.530 --> 00:05:06.070 I see up top I do have a movies table that has a primary key called ID, 00:05:06.070 --> 00:05:08.980 a column called title and a column called year. 00:05:08.980 --> 00:05:12.040 I have that same people table with names and birthdays, 00:05:12.040 --> 00:05:17.530 and I also have this stars table that corresponds with person IDs and movie 00:05:17.530 --> 00:05:21.780 IDs, relating in this case people and movies. 00:05:21.780 --> 00:05:25.600 OK, so let's try selecting then from our movies table 00:05:25.600 --> 00:05:28.930 to see what movies we have, take a peek at our data. 00:05:28.930 --> 00:05:30.880 Well, we saw a little bit ago, we could try 00:05:30.880 --> 00:05:33.640 to take a peek at our data using a certain SQL keyword. 00:05:33.640 --> 00:05:38.800 I could SELECT * from "movies" to see all columns from movies, 00:05:38.800 --> 00:05:42.790 but if I have let's say 400,000 rows I don't 00:05:42.790 --> 00:05:44.540 want to print all those to the screen. 00:05:44.540 --> 00:05:47.480 So what could I use instead? 00:05:47.480 --> 00:05:49.598 A certain SQL keyword. 00:05:49.598 --> 00:05:51.140 I'm hearing LIMIT, so let's try that. 00:05:51.140 --> 00:05:53.000 I'll say LIMIT 5. 00:05:53.000 --> 00:05:55.100 I only want the first five results here. 00:05:55.100 --> 00:06:00.820 I'll hit Enter, and here I'll see those first five movies in my movies table. 00:06:00.820 --> 00:06:03.850 They each have their own unique ID, a primary key. 00:06:03.850 --> 00:06:07.360 They have a title and the year that they were published. 00:06:07.360 --> 00:06:12.070 But if you've been to imdb.com, you may have wanted 00:06:12.070 --> 00:06:13.600 to search for a particular movie. 00:06:13.600 --> 00:06:17.590 Maybe you wanted to find your favorite movie among all the possible movies 00:06:17.590 --> 00:06:18.520 in their database. 00:06:18.520 --> 00:06:22.480 And we can simulate that kind of query by querying the database here. 00:06:22.480 --> 00:06:25.930 Whenever you type in IMDd's browser-- that's like the search for a movie 00:06:25.930 --> 00:06:28.930 called Frozen, for example-- 00:06:28.930 --> 00:06:31.180 odds are that behind the scenes a database is actually 00:06:31.180 --> 00:06:33.490 running that same query to return to you what 00:06:33.490 --> 00:06:36.220 movies have that title called Frozen. 00:06:36.220 --> 00:06:38.470 Now for me my favorite movie since I was a kid 00:06:38.470 --> 00:06:41.650 is one called Cars, so let me try to search for Cars here. 00:06:41.650 --> 00:06:48.700 I'll try to find SELECT * from "movies" where the "title" = 'Cars'. 00:06:48.700 --> 00:06:52.840 And you could imagine me going to IMDb, typing in Cars in the search bar, 00:06:52.840 --> 00:06:55.720 and behind the scenes this SQL query gets 00:06:55.720 --> 00:06:59.920 run to find what movies have the title Cars. 00:06:59.920 --> 00:07:05.320 So let me hit Enter here, and now I'll see that one movie has that title Cars. 00:07:05.320 --> 00:07:10.540 Came out in 2006 and the ID is 317219. 00:07:10.540 --> 00:07:13.720 But this is a lecture on optimizing, so it's worth 00:07:13.720 --> 00:07:18.370 asking how much time did it take for this query to run? 00:07:18.370 --> 00:07:19.600 Let's find out. 00:07:19.600 --> 00:07:23.380 Well I can use a SQLite command here called timer. 00:07:23.380 --> 00:07:29.080 I'll say .timer on to make sure that I time my future queries. 00:07:29.080 --> 00:07:30.670 I'll hit Enter now. 00:07:30.670 --> 00:07:33.820 And now if I try selecting again, I'll hit the up arrow twice 00:07:33.820 --> 00:07:35.260 to get back my old query. 00:07:35.260 --> 00:07:40.030 I can type Enter and I'll see not just the results 00:07:40.030 --> 00:07:44.380 but the time it took to run that query. 00:07:44.380 --> 00:07:47.250 Now notice, we have several different measures of time here. 00:07:47.250 --> 00:07:50.130 I have the quote, unquote "real time" which is essentially 00:07:50.130 --> 00:07:54.060 the stopwatch time, meaning from the time I press Enter to the time I saw 00:07:54.060 --> 00:07:56.835 these results, how long did that take? 00:07:56.835 --> 00:07:58.710 But of course you're familiar with computers. 00:07:58.710 --> 00:08:02.037 You know there's more involved than just this query running. 00:08:02.037 --> 00:08:03.870 I also have to wait for the operating system 00:08:03.870 --> 00:08:05.340 to finish other processes as well. 00:08:05.340 --> 00:08:07.740 That's where user and system time come in. 00:08:07.740 --> 00:08:12.070 User time is how much time this query particularly took on the CPU. 00:08:12.070 --> 00:08:14.820 System time is how long I spent waiting around for other processes 00:08:14.820 --> 00:08:17.130 to finish before I could run this query. 00:08:17.130 --> 00:08:19.200 For us, we'll focus on real time. 00:08:19.200 --> 00:08:22.120 What was the wall clock time, the stopwatch time, so to speak, 00:08:22.120 --> 00:08:26.480 between the time I pressed Enter and the time I saw these results. 00:08:26.480 --> 00:08:27.380 OK. 00:08:27.380 --> 00:08:33.350 So here we have a query taking 0.084 seconds. 00:08:33.350 --> 00:08:35.270 I mean, that is pretty good. 00:08:35.270 --> 00:08:41.030 If I only wait around for less than a tenth of a second, I'm not complaining. 00:08:41.030 --> 00:08:45.260 But if I'm an engineer at IMDb, why might I 00:08:45.260 --> 00:08:50.450 be concerned about a query that takes 0.084 seconds, particularly 00:08:50.450 --> 00:08:53.000 if I have lots and lots of users? 00:08:53.000 --> 00:08:55.120 What's the problem there? 00:08:55.120 --> 00:09:00.320 AUDIENCE: For users, the database might be overwhelmed by the queries. 00:09:00.320 --> 00:09:05.840 And you should also consider how much time the application takes 00:09:05.840 --> 00:09:08.738 to form the query to the database too. 00:09:08.738 --> 00:09:09.530 CARTER ZENKE: Yeah. 00:09:09.530 --> 00:09:11.720 So the basic idea here is that when I have 00:09:11.720 --> 00:09:15.090 a single query that takes maybe less than tenth of a second, that's fine. 00:09:15.090 --> 00:09:19.010 But if I have many, many users running those queries all at once, 00:09:19.010 --> 00:09:21.390 that time is going to add up. 00:09:21.390 --> 00:09:24.500 And so if I'm an engineer or even somebody in finance, 00:09:24.500 --> 00:09:28.160 I want to reduce that time because I'm paying per second that I 00:09:28.160 --> 00:09:30.830 run this database on some other server. 00:09:30.830 --> 00:09:33.980 So we can optimize these queries, but let's first 00:09:33.980 --> 00:09:36.470 get a sense of how they're presently running. 00:09:36.470 --> 00:09:41.360 What is SQLite doing underneath the hood to find this movie called Cars? 00:09:41.360 --> 00:09:45.920 Well for that, I'll bring up a certain visual here of our titles column. 00:09:45.920 --> 00:09:50.840 So imagine here that we have a smaller database called Movies 00:09:50.840 --> 00:09:52.890 that just has titles in this case. 00:09:52.890 --> 00:09:57.350 And now I want to search these titles, but I'm a computer. 00:09:57.350 --> 00:09:58.190 I'm not a human. 00:09:58.190 --> 00:10:01.550 Meaning I can only look at one item at a time. 00:10:01.550 --> 00:10:05.720 I can only look at one row and check its title. 00:10:05.720 --> 00:10:11.570 Now if I want to find the row or the rows that have Cars in them, 00:10:11.570 --> 00:10:16.610 what's the best I could do if my titles are not sorted? 00:10:16.610 --> 00:10:18.890 What's the best approach I could take here? 00:10:18.890 --> 00:10:24.922 If I can only look at one row at a time, how might I search this column? 00:10:24.922 --> 00:10:27.990 AUDIENCE: You should use a linear search then, right? 00:10:27.990 --> 00:10:29.680 CARTER ZENKE: So I like your intuition. 00:10:29.680 --> 00:10:31.900 There is an algorithm, a step-by-step process 00:10:31.900 --> 00:10:34.990 called linear search where we look at one element 00:10:34.990 --> 00:10:38.170 at a time or, in this case one row at a time. 00:10:38.170 --> 00:10:40.750 That's exactly what we can do for this table. 00:10:40.750 --> 00:10:44.980 I can look at one row at a time to find is this title in this column 00:10:44.980 --> 00:10:46.360 or is it not? 00:10:46.360 --> 00:10:50.120 So here to visualize, I could take this title column. 00:10:50.120 --> 00:10:52.970 I can only look at one at a time. 00:10:52.970 --> 00:10:58.630 But if this data is not sorted, I can do no better than looking at every row 00:10:58.630 --> 00:10:59.920 exactly once. 00:10:59.920 --> 00:11:01.780 So here we start with Toy Story. 00:11:01.780 --> 00:11:05.110 We ask, is Toy Story equal to Cars? 00:11:05.110 --> 00:11:07.000 It's not, so we'll keep going. 00:11:07.000 --> 00:11:09.100 Here I'll check Cars 3. 00:11:09.100 --> 00:11:12.070 Is Cars 3 strictly equal to Cars? 00:11:12.070 --> 00:11:15.040 It has Cars in it, but it's not Cars exactly. 00:11:15.040 --> 00:11:16.180 So I'll keep going. 00:11:16.180 --> 00:11:17.380 I'll look at Frozen. 00:11:17.380 --> 00:11:23.170 And Frozen is not Cars either, so I'll keep going again and here I see Cars. 00:11:23.170 --> 00:11:26.540 So maybe I could stop here. 00:11:26.540 --> 00:11:28.360 But why couldn't I? 00:11:28.360 --> 00:11:34.300 Well if I need to find not just the one row that has Cars but all the rows 00:11:34.300 --> 00:11:39.790 that have Cars, I have to keep going to see is Cars elsewhere in this column? 00:11:39.790 --> 00:11:42.910 Is it not just here, but also in other rows too? 00:11:42.910 --> 00:11:44.050 So I'll keep going. 00:11:44.050 --> 00:11:47.020 I'll look at WALL-E, Ratatouille, Soul, Turning Red 00:11:47.020 --> 00:11:50.290 until I get to the end of this column. 00:11:50.290 --> 00:11:53.530 And this process is called a scan of the table. 00:11:53.530 --> 00:11:58.210 We're scanning our column top to bottom looking for the result 00:11:58.210 --> 00:12:00.160 that we need to find. 00:12:00.160 --> 00:12:03.580 But it turns out we can actually optimize this search. 00:12:03.580 --> 00:12:07.898 We can do a bit better than going one by one by one by one, 00:12:07.898 --> 00:12:09.940 and for this we need a bit of a metaphor to think 00:12:09.940 --> 00:12:12.170 about how we can search our table. 00:12:12.170 --> 00:12:15.080 So if you have ever seen a kind of textbook, 00:12:15.080 --> 00:12:17.620 you might have seen something a bit like this here. 00:12:17.620 --> 00:12:19.780 This is a Handbook of Education Research. 00:12:19.780 --> 00:12:23.470 And maybe I want to find which pages correspond 00:12:23.470 --> 00:12:26.060 to how students learn computer science. 00:12:26.060 --> 00:12:29.350 So one thing I could do is open to the start of the book. 00:12:29.350 --> 00:12:34.630 I could go page by page looking for which pages have "how students 00:12:34.630 --> 00:12:36.700 learn computer science." 00:12:36.700 --> 00:12:40.690 I keep going here, and if I want to find all the pages that have 00:12:40.690 --> 00:12:42.610 "how students learn computer science" on them, 00:12:42.610 --> 00:12:45.610 I can do no better than going through every one 00:12:45.610 --> 00:12:49.600 by one by one until I'm at the end of this textbook. 00:12:49.600 --> 00:12:53.420 But chances are you yourself don't do this. 00:12:53.420 --> 00:12:56.270 You do something better. 00:12:56.270 --> 00:13:00.010 And there's some built-in structure in this book that could 00:13:00.010 --> 00:13:01.930 help you find what you're looking for. 00:13:01.930 --> 00:13:07.170 What might you use to find something in a textbook or some other book too? 00:13:07.170 --> 00:13:10.300 AUDIENCE: We can use the index of the book. 00:13:10.300 --> 00:13:12.100 CARTER ZENKE: You can use the index. 00:13:12.100 --> 00:13:14.470 So often a book like a textbook has something 00:13:14.470 --> 00:13:17.440 at the very back of the book that tells me 00:13:17.440 --> 00:13:21.370 all the subjects that are inside this book, but in alphabetical order. 00:13:21.370 --> 00:13:24.550 So I can see here there is a subject index in the back of the book 00:13:24.550 --> 00:13:28.330 here that lists the subjects in alphabetical order, A to Z. 00:13:28.330 --> 00:13:31.210 And if I want to find "how students learn computer science," 00:13:31.210 --> 00:13:34.300 I could perhaps search for computer science in this index. 00:13:34.300 --> 00:13:37.990 I could find A, B, C. There's C, and here I see computer science. 00:13:37.990 --> 00:13:42.310 Now I know all the pages I need to look at inside this book 00:13:42.310 --> 00:13:46.490 without going through each of them one by one by one. 00:13:46.490 --> 00:13:50.080 So in the same way that books have an index, so too 00:13:50.080 --> 00:13:53.810 can databases and tables have an index as well. 00:13:53.810 --> 00:13:57.400 So let's see a definition here for what an index is. 00:13:57.400 --> 00:14:00.340 Well an index is a structure that we can use. 00:14:00.340 --> 00:14:03.010 It's built into our database to help us speed up 00:14:03.010 --> 00:14:06.550 the retrieval of rows from that table. 00:14:06.550 --> 00:14:12.670 I first search my index, find out what rows my given search term is in, 00:14:12.670 --> 00:14:16.730 and then look those rows up in my table. 00:14:16.730 --> 00:14:22.540 Now if I want to create an index on a table, I can use some SQL syntax. 00:14:22.540 --> 00:14:27.760 I can say "create index" and give it some name. 00:14:27.760 --> 00:14:31.270 But I need to also specify a few other things. 00:14:31.270 --> 00:14:35.500 Similar to my books index, that index has some content. 00:14:35.500 --> 00:14:38.770 It's an index by subject or an index by author. 00:14:38.770 --> 00:14:42.670 So I need to tell this index what to include. 00:14:42.670 --> 00:14:47.440 So I should also say create index, give it a name on some table 00:14:47.440 --> 00:14:52.720 and some columns in that table, telling my index what data for my table 00:14:52.720 --> 00:14:55.090 it should include. 00:14:55.090 --> 00:14:59.230 So let's create our very own index in the movies table 00:14:59.230 --> 00:15:02.690 to identify how we can make this query just a bit faster overall. 00:15:02.690 --> 00:15:08.440 So I'll come back to my computer here, and let's now try to create this index. 00:15:08.440 --> 00:15:11.290 I will open up a new tab here. 00:15:11.290 --> 00:15:12.070 A new tab. 00:15:12.070 --> 00:15:14.500 And then I'll log back into my database. 00:15:14.500 --> 00:15:17.950 I'll say sqlite3 movies.db. 00:15:17.950 --> 00:15:22.550 And now I want to first create some index here. 00:15:22.550 --> 00:15:27.280 So I'll turn my timer back on and now I'll type the syntax we just learned. 00:15:27.280 --> 00:15:29.810 I want to create an index. 00:15:29.810 --> 00:15:33.760 And if I'm searching my title column in Movies, 00:15:33.760 --> 00:15:38.110 I should probably include the Title column in this index. 00:15:38.110 --> 00:15:41.470 So I'll say call this the "title_index" here. 00:15:41.470 --> 00:15:46.450 It'll be on the Movies table and it'll contain the data in the Title 00:15:46.450 --> 00:15:49.850 column of Movies a bit like this. 00:15:49.850 --> 00:15:55.052 Now I'll hit Enter and I'll see this index took some time to be created. 00:15:55.052 --> 00:15:57.010 There's some stuff going on underneath the hood 00:15:57.010 --> 00:16:00.940 where I'm putting this data in a new structure we're calling the index here. 00:16:00.940 --> 00:16:07.930 But now if I rerun this query, if I say SELECT * FROM "movies," 00:16:07.930 --> 00:16:15.450 SELECT * from "movies" where the "title" = 'Cars' semicolon, 00:16:15.450 --> 00:16:16.620 it was a lot faster. 00:16:16.620 --> 00:16:25.590 Here I see 0.01 seconds real time compared to 0.084 seconds. 00:16:25.590 --> 00:16:28.380 That's a savings of almost eight times. 00:16:28.380 --> 00:16:31.740 Almost eight times faster now that we have this index. 00:16:31.740 --> 00:16:34.260 Now it took some time to create the index. 00:16:34.260 --> 00:16:38.070 If I run just a few queries, I much easily 00:16:38.070 --> 00:16:40.770 make that time up in terms of the time I'm 00:16:40.770 --> 00:16:46.560 using on my server to look up these kinds of pieces of information. 00:16:46.560 --> 00:16:50.760 OK, so let's actually take a look at how we can ensure a query 00:16:50.760 --> 00:16:53.460 is using some index. 00:16:53.460 --> 00:16:55.980 Here I'm just kind of taking it on my word. 00:16:55.980 --> 00:16:57.780 Here it was faster, so it used the index. 00:16:57.780 --> 00:17:01.710 But there is a way for you to know whether an index was used. 00:17:01.710 --> 00:17:06.839 I could use a SQL statement called EXPLAIN QUERY PLAN. 00:17:06.839 --> 00:17:10.329 Other DBMSs might call this just simply EXPLAIN, 00:17:10.329 --> 00:17:14.190 but in SQLite it's EXPLAIN QUERY PLAN. 00:17:14.190 --> 00:17:17.819 And then after EXPLAIN QUERY PLAN, I can type the query 00:17:17.819 --> 00:17:20.760 that I want to learn the plan for. 00:17:20.760 --> 00:17:24.880 That is, how does SQLite plan to find me this information? 00:17:24.880 --> 00:17:27.579 How does it plan to execute this query? 00:17:27.579 --> 00:17:33.540 So now I'll type SELECT * FROM "movies" WHERE "title" = 'Cars' 00:17:33.540 --> 00:17:35.460 like this, semicolon. 00:17:35.460 --> 00:17:36.600 I'll hit Enter. 00:17:36.600 --> 00:17:39.420 And now I don't get the results, but I do 00:17:39.420 --> 00:17:45.060 get the plan SQLite intends to use to find me this result. 00:17:45.060 --> 00:17:46.350 And take a look here. 00:17:46.350 --> 00:17:51.930 I see QUERY PLAN, and underneath I see what SQLite intends to do. 00:17:51.930 --> 00:17:55.260 It will search the movies table, but it will do so 00:17:55.260 --> 00:17:58.410 using the index that we called title_index. 00:17:58.410 --> 00:18:02.560 And it will do so given some value for title here. 00:18:02.560 --> 00:18:06.210 So we can actually see SQLite does plan to use the index 00:18:06.210 --> 00:18:08.050 to speed up this query. 00:18:08.050 --> 00:18:11.770 We can confirm that this is exactly what's happening. 00:18:11.770 --> 00:18:15.450 But let me try now to remove this index and see 00:18:15.450 --> 00:18:19.380 what SQLite would have done when we didn't have this index. 00:18:19.380 --> 00:18:23.070 I'll come back over here and I'll go back to my other connection 00:18:23.070 --> 00:18:24.030 to my database. 00:18:24.030 --> 00:18:25.470 This one over here. 00:18:25.470 --> 00:18:28.290 And now I'll remove the index we just created. 00:18:28.290 --> 00:18:30.510 To remove an index I can drop it. 00:18:30.510 --> 00:18:35.948 I can say DROP INDEX "title_index" semicolon. 00:18:35.948 --> 00:18:37.740 And now that took just some amount of time, 00:18:37.740 --> 00:18:39.365 but now I don't have the index anymore. 00:18:39.365 --> 00:18:45.180 I can type .schema, and I'll see that index is not part of my schema. 00:18:45.180 --> 00:18:54.330 But now if I try EXPLAIN QUERY PLAN SELECT * from "movies" WHERE "title" 00:18:54.330 --> 00:19:00.780 = 'Cars' semicolon, I'll see that the plan is not to use the index. 00:19:00.780 --> 00:19:04.860 The plan is to scan movies, where scan, we saw before, 00:19:04.860 --> 00:19:08.130 means to go top to bottom through that title column 00:19:08.130 --> 00:19:12.070 and find me all the rows that have Cars in them. 00:19:12.070 --> 00:19:14.460 So you can see here the optimization we're already 00:19:14.460 --> 00:19:18.890 making by using this index here. 00:19:18.890 --> 00:19:21.740 OK, so let's take a moment here and pause. 00:19:21.740 --> 00:19:26.780 What questions do we have about indexes and how they're used in this case? 00:19:26.780 --> 00:19:33.950 AUDIENCE: Doesn't databases have implicit algorithms to optimize search? 00:19:33.950 --> 00:19:35.420 CARTER ZENKE: Yeah, good question. 00:19:35.420 --> 00:19:39.050 Do databases have implicit algorithms to optimize search? 00:19:39.050 --> 00:19:41.190 They do for some columns. 00:19:41.190 --> 00:19:46.520 So if we saw in our table earlier, we had a primary key column like ID. 00:19:46.520 --> 00:19:52.310 In SQLite and most other DBMSs, if I specify that a column is a primary key 00:19:52.310 --> 00:19:55.430 then that database management system will automatically 00:19:55.430 --> 00:19:59.280 create an index via which I can search for a primary key. 00:19:59.280 --> 00:20:03.920 If though I have a regular old column like title or like year or so on, 00:20:03.920 --> 00:20:07.130 it doesn't go through the process of automatically optimizing 00:20:07.130 --> 00:20:08.030 that search for me. 00:20:08.030 --> 00:20:12.457 I have to try to do that myself as we're doing here. 00:20:12.457 --> 00:20:13.290 Let's take one more. 00:20:13.290 --> 00:20:16.000 AUDIENCE: Would it be advisable to create a different index 00:20:16.000 --> 00:20:17.800 for every column in case we need it? 00:20:17.800 --> 00:20:20.380 CARTER ZENKE: Yeah, I mean it seems like indexes are so good. 00:20:20.380 --> 00:20:21.790 They speed up queries for us. 00:20:21.790 --> 00:20:25.030 Why not create an index on every column we have? 00:20:25.030 --> 00:20:26.800 Well as we'll see in just a moment, there 00:20:26.800 --> 00:20:30.730 are some trade-offs involving space and also the time it later takes 00:20:30.730 --> 00:20:33.020 us to insert or add new data. 00:20:33.020 --> 00:20:33.910 So good questions. 00:20:33.910 --> 00:20:36.620 I'll come back to my laptop where we can continue on. 00:20:36.620 --> 00:20:40.870 So we've seen how indexes can speed up queries. 00:20:40.870 --> 00:20:43.900 We're searching for one column here, but they could also speed up 00:20:43.900 --> 00:20:46.310 queries across multiple tables. 00:20:46.310 --> 00:20:48.910 So let's try a different kind of query here. 00:20:48.910 --> 00:20:52.810 What I might do now is try to find all of those movies 00:20:52.810 --> 00:20:55.480 that Tom Hanks starred in. 00:20:55.480 --> 00:20:59.230 And we did a bit of a example of this a little earlier visually, 00:20:59.230 --> 00:21:04.450 but my goal is to find all of those titles that Tom Hanks played a role in. 00:21:04.450 --> 00:21:07.690 Well if my goal is to end up with titles of movies, 00:21:07.690 --> 00:21:13.520 I could try selecting "title" from the "movies" column like this. 00:21:13.520 --> 00:21:16.720 But if I look at that movie's table, oh no, I only 00:21:16.720 --> 00:21:20.740 have titles, year, and the ID of those movies. 00:21:20.740 --> 00:21:23.230 I don't have anything related to Tom Hanks. 00:21:23.230 --> 00:21:27.250 Well Tom Hanks though is probably showing up in the stars table 00:21:27.250 --> 00:21:30.940 where I could correspond Tom Hanks's ID with the movie IDs 00:21:30.940 --> 00:21:32.950 that Tom Hanks played a role in. 00:21:32.950 --> 00:21:35.860 So let me try filtering these movies by an ID. 00:21:35.860 --> 00:21:40.450 I'll say WHERE the "id" of that movie is IN some list of movie IDs 00:21:40.450 --> 00:21:43.960 that also correspond with Tom Hanks's person ID. 00:21:43.960 --> 00:21:47.650 So I'll make a subquery here to find exactly those films. 00:21:47.650 --> 00:21:54.340 Now I'll say SELECT "movie_id" FROM "stars" where the "person _id" 00:21:54.340 --> 00:21:56.140 is equal to-- 00:21:56.140 --> 00:21:58.180 well I don't know Tom Hanks's person ID. 00:21:58.180 --> 00:22:00.880 I could try to hard-code it here, but I could do better 00:22:00.880 --> 00:22:05.710 by having another subquery to find what is Tom Hanks's ID? 00:22:05.710 --> 00:22:08.950 Now I'll indent not just four times but eight 00:22:08.950 --> 00:22:12.160 to show this is my next level of a subquery here. 00:22:12.160 --> 00:22:19.270 I'll now SELECT "id" FROM "people" WHERE the "name" = 'Tom Hanks'. 00:22:19.270 --> 00:22:21.020 'Tom Hanks' like this. 00:22:21.020 --> 00:22:23.950 And now I can close out my subqueries. 00:22:23.950 --> 00:22:28.300 I've found the ID of Tom Hanks, I've found those movie IDs that 00:22:28.300 --> 00:22:30.970 correspond with the person ID of Tom Hanks, 00:22:30.970 --> 00:22:34.900 and then I've found the titles that correspond to those movie IDs. 00:22:34.900 --> 00:22:39.100 Now I'll hit a semicolon here, press Enter, and I should see-- 00:22:39.100 --> 00:22:43.300 I get back all of Tom Hanks's movies, all of those that he played a role in. 00:22:43.300 --> 00:22:45.910 And I'll also see the time that it took. 00:22:45.910 --> 00:22:50.840 The real time for this query was about 0.197 seconds. 00:22:50.840 --> 00:22:54.880 So getting a little bit slower now that we're looking across multiple tables 00:22:54.880 --> 00:22:57.580 and having multiple subqueries. 00:22:57.580 --> 00:23:00.890 One index could speed this up as well. 00:23:00.890 --> 00:23:03.555 I can make a new terminal here. 00:23:03.555 --> 00:23:05.680 Let me open up the connection to my database again. 00:23:05.680 --> 00:23:08.470 I'll say sqlite3 movies.db. 00:23:08.470 --> 00:23:12.490 And now I should figure out what columns do 00:23:12.490 --> 00:23:15.670 I need to index to speed up this query? 00:23:15.670 --> 00:23:19.940 Well one thing I could do is try EXPLAIN QUERY PLAN again. 00:23:19.940 --> 00:23:25.310 I want to understand what SQLite plans to do to execute this query. 00:23:25.310 --> 00:23:29.020 So I'll EXPLAIN QUERY PLAN and I'll type in the same query from before. 00:23:29.020 --> 00:23:33.430 I'll say SELECT the "title" FROM "movies" WHERE 00:23:33.430 --> 00:23:37.720 the "id" is IN those movies that Tom Hanks played a role in. 00:23:37.720 --> 00:23:45.400 So I'll SELECT in this case movie_id from the "stars" table where 00:23:45.400 --> 00:23:49.180 the person_id equals Tom Hanks's ID. 00:23:49.180 --> 00:23:50.450 Now I'll find that. 00:23:50.450 --> 00:23:53.650 I'll indent eight times to show this is my next level subquery. 00:23:53.650 --> 00:23:59.050 I'll SELECT "id" from "people" WHERE "name" 00:23:59.050 --> 00:24:02.510 = 'Tom Hanks' and I'll wrap the line just a little bit. 00:24:02.510 --> 00:24:04.510 Let me zoom out so you can see it all in one go. 00:24:04.510 --> 00:24:10.570 And now I'll close out these subqueries just like this, hit Enter, 00:24:10.570 --> 00:24:16.090 and I should see if I zoom back in the plan SQLite intends to use to find me 00:24:16.090 --> 00:24:17.590 the results of this query. 00:24:17.590 --> 00:24:19.490 Now let's take a look here. 00:24:19.490 --> 00:24:21.900 I'm planning to first scan people. 00:24:21.900 --> 00:24:23.650 To understand the structure of this I need 00:24:23.650 --> 00:24:26.500 to first go all the way in to this bottom part here. 00:24:26.500 --> 00:24:32.320 I'm going to scan people to find presumably Tom Hanks's ID. 00:24:32.320 --> 00:24:36.610 Then after I do that, I'm going to scan stars presumably 00:24:36.610 --> 00:24:41.080 to find all the movie IDs that have Tom Hanks's ID. 00:24:41.080 --> 00:24:45.130 But then up here I'm finally going to search movies 00:24:45.130 --> 00:24:47.320 using integer primary key. 00:24:47.320 --> 00:24:49.750 Some index is automatically created for me 00:24:49.750 --> 00:24:52.850 as we saw before for my primary keys. 00:24:52.850 --> 00:24:56.620 So it seems like the optimization here could be in the people 00:24:56.620 --> 00:25:00.040 table and the stars table. 00:25:00.040 --> 00:25:05.620 Now let me ask, if I wanted to create an index on either the people 00:25:05.620 --> 00:25:09.970 table or the stars table, what do you think I should include? 00:25:09.970 --> 00:25:15.650 Which column should I include in this index given my query here? 00:25:15.650 --> 00:25:21.130 What column should I include in this index given this previous query? 00:25:21.130 --> 00:25:26.180 AUDIENCE: We could index the ID and the name of the person? 00:25:26.180 --> 00:25:27.530 CARTER ZENKE: Yeah, good idea. 00:25:27.530 --> 00:25:29.930 Index the ID and the name of the person. 00:25:29.930 --> 00:25:34.670 So if we look at this query again we'll see that our WHERE clause here 00:25:34.670 --> 00:25:37.770 is searching by the name in the people table, 00:25:37.770 --> 00:25:41.240 and up here this WHERE clause is searching by the person ID 00:25:41.240 --> 00:25:43.700 column in our stars table. 00:25:43.700 --> 00:25:46.280 So it would stand to reason that I should actually 00:25:46.280 --> 00:25:49.850 create an index that includes the name column in people 00:25:49.850 --> 00:25:52.250 and the person ID column in stars. 00:25:52.250 --> 00:25:54.200 Perhaps two separate indexes. 00:25:54.200 --> 00:25:55.710 One for each year. 00:25:55.710 --> 00:25:57.470 So let's go ahead and create those now. 00:25:57.470 --> 00:26:03.350 I'll come back to my computer and here I will try to create this index. 00:26:03.350 --> 00:26:06.590 I'll say first create an index. 00:26:06.590 --> 00:26:10.040 CREATE INDEX called "person_index." 00:26:10.040 --> 00:26:16.070 And I'll create it on stars and the person ID column in stars 00:26:16.070 --> 00:26:17.960 as we just discussed. 00:26:17.960 --> 00:26:23.600 Now if I hit Enter I should let it take some time to build up this index, 00:26:23.600 --> 00:26:26.270 but now I can also create my next one. 00:26:26.270 --> 00:26:35.300 I'll say CREATE INDEX, and then I'll choose "name_index" on people table. 00:26:35.300 --> 00:26:37.890 And I'll include the name column here. 00:26:37.890 --> 00:26:39.710 So now I'll hit Enter. 00:26:39.710 --> 00:26:42.980 That too will take some time to run, but now 00:26:42.980 --> 00:26:45.350 if I try to explain the query plan again-- 00:26:45.350 --> 00:26:50.630 I'll clear my screen, try EXPLAIN QUERY PLAN, and then I'll type in this query 00:26:50.630 --> 00:26:51.170 again. 00:26:51.170 --> 00:26:56.060 I'll say SELECT "title" FROM movies WHERE the "id" is IN those movies 00:26:56.060 --> 00:26:59.600 that Tom Hanks played a role in that have the person ID next to it. 00:26:59.600 --> 00:27:03.590 Then I'll say which is Tom Hanks's ID? 00:27:03.590 --> 00:27:08.650 And I'll close out the subquery here just like this. 00:27:08.650 --> 00:27:12.480 And now if I explain the plan yet again, I 00:27:12.480 --> 00:27:16.570 should see I'm now optimizing this query even further. 00:27:16.570 --> 00:27:19.950 I can see down below here, my plan is to search the people table. 00:27:19.950 --> 00:27:23.160 Not to scan it, but to search it using what we call 00:27:23.160 --> 00:27:25.710 a covering index called "name_index." 00:27:25.710 --> 00:27:27.010 The one we just created. 00:27:27.010 --> 00:27:29.610 We'll explain covering index in just a little bit. 00:27:29.610 --> 00:27:32.760 Next though, I'm going to search stars using 00:27:32.760 --> 00:27:37.060 the index we just created called "name_index" given some person ID. 00:27:37.060 --> 00:27:40.290 And then finally, I'll search movies using 00:27:40.290 --> 00:27:42.660 an integer primary key index, the one that's 00:27:42.660 --> 00:27:45.850 automatically created for me here. 00:27:45.850 --> 00:27:48.570 So here we see some new vocabulary. 00:27:48.570 --> 00:27:53.060 Not just using index, but using a covering index. 00:27:53.060 --> 00:27:55.470 Well, what is a covering index? 00:27:55.470 --> 00:27:58.650 A covering index, it turns out, is still an index. 00:27:58.650 --> 00:28:01.050 It's the same kind of idea, but it is one 00:28:01.050 --> 00:28:06.390 in which we have all the data we need inside the index itself. 00:28:06.390 --> 00:28:11.250 It's one in which the query data can be found only by searching the index. 00:28:11.250 --> 00:28:14.460 There's no need for me to go from the index to the table. 00:28:14.460 --> 00:28:17.040 I can just look it up in the index. 00:28:17.040 --> 00:28:20.850 It's similar to me searching this textbook and getting to the index 00:28:20.850 --> 00:28:24.450 and not then needing to actually go back in the book's pages. 00:28:24.450 --> 00:28:29.290 I can find everything I'm looking for in the index itself. 00:28:29.290 --> 00:28:34.260 And this is good because if I need to not just find the items in my index 00:28:34.260 --> 00:28:39.130 but then look them up in my table, that look up in the table takes some time. 00:28:39.130 --> 00:28:41.580 So if I instead have a covering index, one 00:28:41.580 --> 00:28:45.310 that includes all data I want to find exactly right there in it, 00:28:45.310 --> 00:28:49.120 that's going to be faster for me than a regular old index. 00:28:49.120 --> 00:28:54.750 So let's try then turning one more index into a covering index. 00:28:54.750 --> 00:28:55.890 I'll come back over here. 00:28:55.890 --> 00:29:00.780 And we saw before that we had the index on the stars 00:29:00.780 --> 00:29:03.660 table using just a regular old index. 00:29:03.660 --> 00:29:06.420 Using index person index. 00:29:06.420 --> 00:29:10.050 But what are we trying to find from this index? 00:29:10.050 --> 00:29:13.170 We're trying to find the movie ID column. 00:29:13.170 --> 00:29:18.910 So I could create an index that includes the movie ID column as well. 00:29:18.910 --> 00:29:22.300 I can have indexes that span multiple columns. 00:29:22.300 --> 00:29:23.730 So let's try that here. 00:29:23.730 --> 00:29:27.240 I'll come back over and I'll create this new index 00:29:27.240 --> 00:29:29.730 that intends to cover our search here. 00:29:29.730 --> 00:29:35.940 I want to include not just the person ID column but also the movie ID 00:29:35.940 --> 00:29:40.320 column so I can quickly find that data inside the index itself. 00:29:40.320 --> 00:29:45.620 Let me drop our current implementation of the person index like this. 00:29:45.620 --> 00:29:49.090 And I'll hit semicolon here, and now that index is gone. 00:29:49.090 --> 00:29:53.040 But if I type now CREATE INDEX, I'll also 00:29:53.040 --> 00:29:56.700 call this "person_index" ON "stars." 00:29:56.700 --> 00:30:00.240 and I'll say let's include not just the person ID column. 00:30:00.240 --> 00:30:05.940 Let's also include in this case, the movie ID column like this. 00:30:05.940 --> 00:30:08.730 Now I'll hit semicolon, Enter. 00:30:08.730 --> 00:30:11.580 And it'll take a bit of time to run, but once it's 00:30:11.580 --> 00:30:15.420 done I can probably try to explain the query plan again and see 00:30:15.420 --> 00:30:17.160 if we have our covering index. 00:30:17.160 --> 00:30:20.910 I'll go back up top here, I'll EXPLAIN QUERY PLAN, 00:30:20.910 --> 00:30:24.910 and I'll choose to recreate this query from before just like this. 00:30:24.910 --> 00:30:30.660 I'm going to select the "movie_id" FROM "stars" and then find Tom Hanks's ID 00:30:30.660 --> 00:30:36.580 and then close out these subqueries here just like this. 00:30:36.580 --> 00:30:41.830 And now once I EXPLAIN QUERY PLAN, we've optimized even further. 00:30:41.830 --> 00:30:44.160 Now I have two covering indexes. 00:30:44.160 --> 00:30:48.690 I can find all the information I need just by looking in these indexes, 00:30:48.690 --> 00:30:52.100 not by doing a table lookup. 00:30:52.100 --> 00:30:56.375 So let's now run this query and see just how much faster it might be. 00:30:56.375 --> 00:31:00.470 I'll come back to my laptop here, and let's get rid of EXPLAIN QUERY PLAN. 00:31:00.470 --> 00:31:02.660 Let's now turn our timer on. 00:31:02.660 --> 00:31:06.380 I'll turn the timer on like this and I'll rerun this query. 00:31:06.380 --> 00:31:10.520 I'll say we select title from movies where the ID is in the stars table 00:31:10.520 --> 00:31:13.790 so long as the movie ID aligns with the person ID. 00:31:13.790 --> 00:31:18.590 Then I'll find Tom Hanks's person ID here, I'll close out my subqueries, 00:31:18.590 --> 00:31:24.078 and I should be able to find that this query is just a little bit-- 00:31:24.078 --> 00:31:25.370 this is an understatement here. 00:31:25.370 --> 00:31:26.400 A lot faster. 00:31:26.400 --> 00:31:32.120 So if I go back to my old query here, I'll see this one took 0.197 seconds. 00:31:32.120 --> 00:31:37.320 This one with two covering indexes takes 0.004 seconds. 00:31:37.320 --> 00:31:40.970 This is an order of magnitude faster than my prior query 00:31:40.970 --> 00:31:44.540 because I've used indexes now. 00:31:44.540 --> 00:31:46.940 So, let's see. 00:31:46.940 --> 00:31:53.180 If we have all these benefits of indexes, we must be paying something. 00:31:53.180 --> 00:31:59.310 If I have this book here and I'm trying to find some data inside of it, 00:31:59.310 --> 00:32:03.740 what might the trade-offs be if I want to create an index? 00:32:03.740 --> 00:32:07.460 Let me think first about this database on my computer. 00:32:07.460 --> 00:32:08.870 What am I giving up? 00:32:08.870 --> 00:32:13.400 If I'm making sure my query is faster, what am I losing, do you think? 00:32:13.400 --> 00:32:16.010 AUDIENCE: Space will require for indexing also. 00:32:16.010 --> 00:32:19.370 The more index we would have, the more space and more memory 00:32:19.370 --> 00:32:21.278 we would have to consume for that. 00:32:21.278 --> 00:32:22.070 CARTER ZENKE: Yeah. 00:32:22.070 --> 00:32:25.070 So it's a general theme that when I try to speed up 00:32:25.070 --> 00:32:28.130 something in computer science, I have to pay some cost which 00:32:28.130 --> 00:32:29.990 is often in terms of space. 00:32:29.990 --> 00:32:34.400 That when I create an index I'm using more space in my database. 00:32:34.400 --> 00:32:37.010 And this similarly applies to an old fashioned textbook. 00:32:37.010 --> 00:32:39.470 If I have an index in here, I'm literally 00:32:39.470 --> 00:32:43.550 paying in pages to print so people can find content better. 00:32:43.550 --> 00:32:46.130 This is some extra money I have to spend, 00:32:46.130 --> 00:32:49.670 extra pages I have to print in order to make my queries faster, 00:32:49.670 --> 00:32:53.220 either on this book or in my database. 00:32:53.220 --> 00:32:57.650 So let's understand why indexes take up so much space 00:32:57.650 --> 00:33:00.440 by looking at the underlying data structure they 00:33:00.440 --> 00:33:03.110 use to store information. 00:33:03.110 --> 00:33:05.900 Well it turns out that the part of the reason indexes 00:33:05.900 --> 00:33:09.780 take up so much space is because they use a certain data structure, 00:33:09.780 --> 00:33:16.110 a way of organizing data called a B-Tree, or a balanced tree structure. 00:33:16.110 --> 00:33:18.920 So not to worry if you don't know what a tree structure is. 00:33:18.920 --> 00:33:23.030 Here's a visual to keep in mind as you think about using trees in computer 00:33:23.030 --> 00:33:25.520 science and in indexes in particular. 00:33:25.520 --> 00:33:27.780 A tree might look a bit like this. 00:33:27.780 --> 00:33:31.640 And notice how it's similar to a family tree with grandparents 00:33:31.640 --> 00:33:34.430 and parents and children and so on. 00:33:34.430 --> 00:33:38.450 You could if you wanted to kind of do a 180 on this diagram. 00:33:38.450 --> 00:33:41.030 Flip it right side up and now you might see 00:33:41.030 --> 00:33:44.010 more of a tree structure going from top to bottom. 00:33:44.010 --> 00:33:47.300 You can see down here what might call the trunk of the tree followed 00:33:47.300 --> 00:33:49.670 by some branches-- these over here. 00:33:49.670 --> 00:33:50.930 And then the leaves. 00:33:50.930 --> 00:33:54.890 These up top kind of the edge of this tree diagram. 00:33:54.890 --> 00:33:56.750 And there's a few vocabulary words you might 00:33:56.750 --> 00:33:59.090 want to know when you talk about trees. 00:33:59.090 --> 00:34:02.190 One of them is this idea of a node. 00:34:02.190 --> 00:34:06.230 So this right here is a single node in our tree. 00:34:06.230 --> 00:34:11.210 A tree is made up of a collection of nodes, and each of these nodes 00:34:11.210 --> 00:34:13.590 stores some value. 00:34:13.590 --> 00:34:17.900 Maybe it's a value in the column we're trying to index, for example. 00:34:17.900 --> 00:34:22.040 But in addition to that value, this node also 00:34:22.040 --> 00:34:26.780 stores pointers, kind of arrows metaphorically, to other nodes 00:34:26.780 --> 00:34:30.590 to tell us where in memory these nodes are. 00:34:30.590 --> 00:34:32.719 And by stringing these nodes together, we 00:34:32.719 --> 00:34:36.270 can build up this tree-like structure. 00:34:36.270 --> 00:34:40.280 Now this node here points to three other nodes. 00:34:40.280 --> 00:34:45.530 And some new vocabulary here is that this node has three children. 00:34:45.530 --> 00:34:47.929 Going back to that family tree example, we 00:34:47.929 --> 00:34:52.469 see that this node here has three children which are these right here. 00:34:52.469 --> 00:34:56.078 But these nodes themselves also have three children of their own. 00:34:56.078 --> 00:34:58.370 So if I were to go down to the next level of our tree-- 00:34:58.370 --> 00:35:02.180 well if I'm referring to this node here, these nodes down below 00:35:02.180 --> 00:35:04.190 are that node's grandchildren. 00:35:04.190 --> 00:35:07.080 The children of its children. 00:35:07.080 --> 00:35:10.490 Now these nodes here also have another name. 00:35:10.490 --> 00:35:14.240 We saw before they're often called leaf nodes. 00:35:14.240 --> 00:35:15.560 They're the edge of our tree. 00:35:15.560 --> 00:35:21.040 And we know a node is a leaf node when there are no other nodes it points to. 00:35:21.040 --> 00:35:24.970 So this is the general idea of what an index looks like. 00:35:24.970 --> 00:35:26.920 But let's get a little more concrete and try 00:35:26.920 --> 00:35:32.540 to build an index from our title column in our movies table here. 00:35:32.540 --> 00:35:37.150 So let's assume we have a table of IDs and titles here, 00:35:37.150 --> 00:35:41.140 and I want to create an index on this title column 00:35:41.140 --> 00:35:43.340 to search it more efficiently. 00:35:43.340 --> 00:35:47.350 Well we saw before that the best I can do when this data is unsorted 00:35:47.350 --> 00:35:49.130 is to go top to bottom. 00:35:49.130 --> 00:35:51.970 First look at Toy Story, then look at Cars 3, 00:35:51.970 --> 00:35:54.740 then look at Frozen, all the way down to the bottom. 00:35:54.740 --> 00:35:58.220 I can only scan this column, so to speak. 00:35:58.220 --> 00:36:02.560 But I know I could probably do a little better if the data is sorted. 00:36:02.560 --> 00:36:05.030 If the data is sorted I could look roughly in the middle, 00:36:05.030 --> 00:36:07.270 and if I've gone too far, go back a little bit. 00:36:07.270 --> 00:36:09.770 Or if I haven't gone far enough, go forward a little bit 00:36:09.770 --> 00:36:13.910 and do a fewer number of lookups to find data I'm looking for. 00:36:13.910 --> 00:36:16.210 So let me sort this title column. 00:36:16.210 --> 00:36:18.980 I'll put it like this. 00:36:18.980 --> 00:36:22.230 What have I done wrong? 00:36:22.230 --> 00:36:26.980 First I had this, then I wanted to sort my title 00:36:26.980 --> 00:36:29.860 column to make it faster to search. 00:36:29.860 --> 00:36:35.365 But what rule did I break here by sorting this title column like that? 00:36:35.365 --> 00:36:38.620 AUDIENCE: Yes, Carter, the thing is that by sorting only the movie's 00:36:38.620 --> 00:36:43.630 name and not the ID you will break the entire other databases that 00:36:43.630 --> 00:36:45.320 relates to that one. 00:36:45.320 --> 00:36:50.260 So for example, let's look for a movie in which, I don't know, 00:36:50.260 --> 00:36:53.230 Johnny Depp is working or has worked on. 00:36:53.230 --> 00:36:56.080 And the result may be, I don't know, Titanic. 00:36:56.080 --> 00:36:57.820 CARTER ZENKE: Yeah, I love that example. 00:36:57.820 --> 00:36:59.450 That really drives it home here. 00:36:59.450 --> 00:37:04.330 And in more general terms, I had this primary key called ID. 00:37:04.330 --> 00:37:10.120 And in the prior state I had Toy Story with the ID of 114709. 00:37:10.120 --> 00:37:14.650 A primary key means that Toy Story should be consistently identified 00:37:14.650 --> 00:37:18.580 with this number, 114709. 00:37:18.580 --> 00:37:22.600 But to your point, Mateo, if I sort only the title column, 00:37:22.600 --> 00:37:25.060 well now Cars has that ID-- 00:37:25.060 --> 00:37:27.070 114709. 00:37:27.070 --> 00:37:29.620 I've broken that relationship, particularly 00:37:29.620 --> 00:37:37.180 if other tables are relying on Toy Story being identified by 114709. 00:37:37.180 --> 00:37:38.605 So I can't do this. 00:37:38.605 --> 00:37:39.730 This is bad. 00:37:39.730 --> 00:37:43.520 I need some other way to sort my title column. 00:37:43.520 --> 00:37:46.570 So let's bring this back to what it was here. 00:37:46.570 --> 00:37:49.300 Let me focus just on the title column. 00:37:49.300 --> 00:37:54.100 Well I can't sort this title column because it's part of my table 00:37:54.100 --> 00:37:57.100 and the ordering of that table does matter. 00:37:57.100 --> 00:38:00.440 What I could do though is create a copy of this column 00:38:00.440 --> 00:38:02.180 and maybe sort that copy. 00:38:02.180 --> 00:38:06.530 So I'll literally create a copy of this title column elsewhere in memory. 00:38:06.530 --> 00:38:09.230 And let me just remove this title column identifier here 00:38:09.230 --> 00:38:10.850 so I have just the data. 00:38:10.850 --> 00:38:14.350 Just the data in this column somewhere else in memory. 00:38:14.350 --> 00:38:17.990 And now I could sort this data like this. 00:38:17.990 --> 00:38:21.370 So let's say I now want to find Cars. 00:38:21.370 --> 00:38:24.470 Well now that my data is sorted, what could I do? 00:38:24.470 --> 00:38:27.730 I could look first in the middle-- at Ratatouille for instance-- 00:38:27.730 --> 00:38:31.000 and then realize, well Cars comes before Ratatouille. 00:38:31.000 --> 00:38:34.450 Maybe I'll go a little higher in my index here. 00:38:34.450 --> 00:38:36.670 I'll search in the middle of this first half. 00:38:36.670 --> 00:38:38.800 I'll look at, in this case, Frozen. 00:38:38.800 --> 00:38:41.320 And Cars-- well it still comes before Frozen, 00:38:41.320 --> 00:38:44.560 so let me look up at the first half of this first half. 00:38:44.560 --> 00:38:47.770 I'll then go up here and I'll have spotted Cars. 00:38:47.770 --> 00:38:54.010 So a much faster way to search for information now once I have it sorted. 00:38:54.010 --> 00:38:56.500 But I don't think we're quite there yet. 00:38:56.500 --> 00:39:01.090 If I wanted to find the year that Cars was released, 00:39:01.090 --> 00:39:08.620 I found that it's in my title column but what am I not able to do at this point? 00:39:08.620 --> 00:39:10.840 I've copied the data somewhere else, but now I 00:39:10.840 --> 00:39:13.270 want to find some other data in my table. 00:39:13.270 --> 00:39:15.390 What can't I do? 00:39:15.390 --> 00:39:18.148 AUDIENCE: We need to point to index. 00:39:18.148 --> 00:39:18.940 CARTER ZENKE: Yeah. 00:39:18.940 --> 00:39:21.430 I need to be able to find some way to relate 00:39:21.430 --> 00:39:24.590 this index with the rows in my table. 00:39:24.590 --> 00:39:27.250 So here this is elsewhere in memory. 00:39:27.250 --> 00:39:31.460 It's good that I can find that Cars is in my title column. 00:39:31.460 --> 00:39:35.050 But if I want to find the other data in this row like the year 00:39:35.050 --> 00:39:37.060 that Cars was released, well I have to have 00:39:37.060 --> 00:39:43.340 some way of linking this data here, Cars, to my table. 00:39:43.340 --> 00:39:46.930 So often in an index, we won't just copy the data to some new location 00:39:46.930 --> 00:39:47.620 and sort it. 00:39:47.620 --> 00:39:52.240 We'll also have some link between these indexed pieces of data over here 00:39:52.240 --> 00:39:54.920 and our rows and our table back here. 00:39:54.920 --> 00:40:00.940 So for instance here is Cars, and Cars appears to be in row 4. 00:40:00.940 --> 00:40:05.650 Similarly Cars 3 looks to be in row 2 in my table. 00:40:05.650 --> 00:40:11.620 And I could metaphorically visually draw some lines between Cars over here 00:40:11.620 --> 00:40:14.540 in my index and Cars in my table. 00:40:14.540 --> 00:40:17.950 I could draw one that looks a bit like this going from Cars in my index 00:40:17.950 --> 00:40:20.510 to actually Cars in my table. 00:40:20.510 --> 00:40:24.010 I could draw one for Cars 3 going from this index place 00:40:24.010 --> 00:40:27.410 here to row 2 in my title column. 00:40:27.410 --> 00:40:32.380 So that is the way I can link my index with my table data. 00:40:32.380 --> 00:40:38.860 I could do the rest of these too for all of my other pieces of data in my index. 00:40:38.860 --> 00:40:42.090 So let's focus on this now just to check for understanding. 00:40:42.090 --> 00:40:44.100 I have this index. 00:40:44.100 --> 00:40:48.180 That index has my sorted title data and also 00:40:48.180 --> 00:40:51.820 the row number in which I found that piece of data. 00:40:51.820 --> 00:40:56.580 So let's say here I want to find the movie Soul. 00:40:56.580 --> 00:40:58.380 I want to find the movie Soul. 00:40:58.380 --> 00:41:03.876 How could I find Soul in my index? 00:41:03.876 --> 00:41:06.500 AUDIENCE: Yeah, since it is already sorted 00:41:06.500 --> 00:41:08.198 I believe you could use binary search. 00:41:08.198 --> 00:41:08.990 CARTER ZENKE: Yeah. 00:41:08.990 --> 00:41:12.020 I could use algorithm called binary search, which if you're not familiar 00:41:12.020 --> 00:41:15.320 simply means start in the middle and if I've gone too far, 00:41:15.320 --> 00:41:17.690 look at either the left half or the right half 00:41:17.690 --> 00:41:20.520 and repeat that process over and over. 00:41:20.520 --> 00:41:22.160 So here is my index again. 00:41:22.160 --> 00:41:25.470 I'll go roughly to the middle and I'll look at Ratatouille here. 00:41:25.470 --> 00:41:29.390 Well Soul, I know, comes after Ratatouille alphabetically, 00:41:29.390 --> 00:41:31.020 so what should I do? 00:41:31.020 --> 00:41:33.410 I should now look in this second half here. 00:41:33.410 --> 00:41:35.570 I might go to Toy Story. 00:41:35.570 --> 00:41:38.720 Once I see Toy Story, I know I've gone a little bit too far. 00:41:38.720 --> 00:41:41.720 Soul comes before Toy Story alphabetically, 00:41:41.720 --> 00:41:45.080 so I'll look in the first half of this second half. 00:41:45.080 --> 00:41:46.230 And what do I see? 00:41:46.230 --> 00:41:49.183 I see Soul right there. 00:41:49.183 --> 00:41:50.130 OK. 00:41:50.130 --> 00:41:52.700 So our index seems to be working. 00:41:52.700 --> 00:41:54.950 It allows me to more efficiently find data. 00:41:54.950 --> 00:41:58.250 I no longer have to look through every single row. 00:41:58.250 --> 00:42:03.190 I can simply use a more efficient algorithm like binary search. 00:42:03.190 --> 00:42:08.980 But now there's one more catch which is, if I had many, many, many rows-- 00:42:08.980 --> 00:42:12.940 not just nine but literally thousands or hundreds of thousands or millions 00:42:12.940 --> 00:42:19.090 of rows in this index, that's a lot of data to store in one location in memory 00:42:19.090 --> 00:42:22.450 or even to load into my computer's RAM, so to speak-- 00:42:22.450 --> 00:42:24.020 its random access memory-- 00:42:24.020 --> 00:42:26.740 so it can do all this jumping around. 00:42:26.740 --> 00:42:31.510 Ideally what I would do is break this index up into smaller pieces 00:42:31.510 --> 00:42:34.300 so I don't have to load all of this at once. 00:42:34.300 --> 00:42:38.420 And in reality our index is not a single column like this, 00:42:38.420 --> 00:42:42.160 but often individual nodes where each of these-- 00:42:42.160 --> 00:42:48.560 1, 2, 3-- might be some node scattered about the computer's memory. 00:42:48.560 --> 00:42:51.190 But what problem have I introduced now? 00:42:51.190 --> 00:42:53.500 If these are around the computer's memory, 00:42:53.500 --> 00:42:56.510 I can't actually jump between them anymore. 00:42:56.510 --> 00:43:00.140 I can't go maybe one space above Toy Story and find Soul. 00:43:00.140 --> 00:43:03.500 This node might be elsewhere in memory. 00:43:03.500 --> 00:43:06.760 And so we'll need some other nodes to actually help 00:43:06.760 --> 00:43:10.090 us find what we're looking for in this index. 00:43:10.090 --> 00:43:15.010 I might have one more over here that tells me what kind of data 00:43:15.010 --> 00:43:17.230 is in each of these nodes. 00:43:17.230 --> 00:43:21.190 Maybe I have one here that has both Frozen and Soul. 00:43:21.190 --> 00:43:25.990 Notice how here Frozen is the furthest down alphabetically in this 00:43:25.990 --> 00:43:26.980 node up here. 00:43:26.980 --> 00:43:30.280 And notice down below here Soul is the furthest down 00:43:30.280 --> 00:43:33.520 alphabetically in this node here. 00:43:33.520 --> 00:43:37.150 That is, if what I'm looking for comes before Frozen, 00:43:37.150 --> 00:43:38.800 I should look at this node. 00:43:38.800 --> 00:43:42.730 If what I'm looking for comes after Frozen but before Soul, 00:43:42.730 --> 00:43:44.620 I should look in this node. 00:43:44.620 --> 00:43:49.390 Or if what I'm looking for comes before none of these, it comes after Soul, 00:43:49.390 --> 00:43:53.390 I should look at this node down here. 00:43:53.390 --> 00:43:54.910 So let's find an example. 00:43:54.910 --> 00:43:56.970 Let's say I want to find Cars now. 00:43:56.970 --> 00:44:00.610 Well I'd start in this root node, this base of this tree 00:44:00.610 --> 00:44:03.190 we're building here that I've turned kind of right side-- 00:44:03.190 --> 00:44:04.240 this way. 00:44:04.240 --> 00:44:05.980 Let's say Frozen. 00:44:05.980 --> 00:44:09.610 I'll go here and does Cars come before Frozen? 00:44:09.610 --> 00:44:12.490 Well it does, so I'll look in this node now. 00:44:12.490 --> 00:44:16.570 I'll come up here and I'll find Cars in that node. 00:44:16.570 --> 00:44:17.500 Let's see here. 00:44:17.500 --> 00:44:19.840 Maybe I want to find Ratatouille. 00:44:19.840 --> 00:44:23.380 I'll go back to the beginning and I'll look through this root node. 00:44:23.380 --> 00:44:27.080 I'll say, does Ratatouille come before Frozen? 00:44:27.080 --> 00:44:27.580 It doesn't. 00:44:27.580 --> 00:44:30.970 It comes after Frozen, so I'll look and check Soul. 00:44:30.970 --> 00:44:33.670 Does Ratatouille come before Soul? 00:44:33.670 --> 00:44:36.070 Well it does, so I'll look in this node now. 00:44:36.070 --> 00:44:37.330 I'll go find Luca. 00:44:37.330 --> 00:44:40.240 I'm not quite there yet, but I will see Ratatouille 00:44:40.240 --> 00:44:43.870 so I know that I've now found Ratatouille in my index. 00:44:43.870 --> 00:44:45.880 And now for one more-- 00:44:45.880 --> 00:44:47.680 let me check for understanding here. 00:44:47.680 --> 00:44:51.370 How could I find Turning Red? 00:44:51.370 --> 00:44:56.310 What steps would I follow to find Turning Red? 00:44:56.310 --> 00:45:02.115 I might start at that root node, but then what questions should I ask? 00:45:02.115 --> 00:45:08.740 AUDIENCE: From Frozen also we can see Ratatouille is to the right 00:45:08.740 --> 00:45:10.890 so that we move in direction of Soul. 00:45:10.890 --> 00:45:15.850 Now once we get there we find the middle elements and we can see Ratatouille 00:45:15.850 --> 00:45:19.120 is to the right of the middle elements. 00:45:19.120 --> 00:45:20.520 So that's how we find it. 00:45:20.520 --> 00:45:25.560 Divide the list into two halves and look into the half 00:45:25.560 --> 00:45:27.448 that the element is found. 00:45:27.448 --> 00:45:29.740 CARTER ZENKE: Yeah, I like what you were thinking here. 00:45:29.740 --> 00:45:33.590 So you're talking about traversing this tree we've built and turned on its side 00:45:33.590 --> 00:45:34.090 here. 00:45:34.090 --> 00:45:36.010 And let's visualize that search here. 00:45:36.010 --> 00:45:41.440 So first we'll start at Frozen and ask, does Turning Red come before Frozen? 00:45:41.440 --> 00:45:41.940 It doesn't. 00:45:41.940 --> 00:45:43.530 It comes after, so I'll keep looking. 00:45:43.530 --> 00:45:44.640 I'll look at Soul. 00:45:44.640 --> 00:45:47.620 Does Turning Red come before Soul? 00:45:47.620 --> 00:45:48.120 It doesn't. 00:45:48.120 --> 00:45:53.430 It comes after Soul, so I know I need to look at this node down here. 00:45:53.430 --> 00:45:55.050 I'll go find Toy Story. 00:45:55.050 --> 00:45:57.990 Not quite there yet, but now I'll see Turning Red. 00:45:57.990 --> 00:46:00.150 I'm able to find what I'm looking for. 00:46:00.150 --> 00:46:01.110 And not just that. 00:46:01.110 --> 00:46:07.000 I can find what row number Turning Red is actually in in my table. 00:46:07.000 --> 00:46:12.180 So when you use an index, SQL or SQLite is using some algorithm a bit 00:46:12.180 --> 00:46:13.090 like this. 00:46:13.090 --> 00:46:15.870 It creates some tree structure a bit like this here 00:46:15.870 --> 00:46:18.780 with individual nodes that has your column data. 00:46:18.780 --> 00:46:22.470 It then uses this kind of algorithm to search that index 00:46:22.470 --> 00:46:26.770 and find whether or not that piece of data is in your index. 00:46:26.770 --> 00:46:32.050 And if it is, it will go ahead and look up that piece of data in your table. 00:46:32.050 --> 00:46:36.000 So see here, although this is turned on its side, the relationship 00:46:36.000 --> 00:46:37.740 between a kind of tree. 00:46:37.740 --> 00:46:43.680 We had a root node or a trunk node here pointing to some of these leaf nodes-- 00:46:43.680 --> 00:46:45.570 the edge of our graph right here. 00:46:45.570 --> 00:46:48.750 And you can see some relationship between these nodes here 00:46:48.750 --> 00:46:52.350 and our tree as well right here with arrows 00:46:52.350 --> 00:46:55.770 pointing across these individual nodes. 00:46:55.770 --> 00:47:01.050 So let me pause here and ask what questions do we have on this structure 00:47:01.050 --> 00:47:02.985 and how we've searched it? 00:47:02.985 --> 00:47:07.683 AUDIENCE: Is that related to linked lists or arrays? 00:47:07.683 --> 00:47:08.850 CARTER ZENKE: Good question. 00:47:08.850 --> 00:47:11.780 Is this related to linked lists or arrays? 00:47:11.780 --> 00:47:14.550 So if you're familiar, a linked list is some set 00:47:14.550 --> 00:47:18.030 of nodes that have been linked together in memory using these arrows like we 00:47:18.030 --> 00:47:19.800 saw on our diagram here. 00:47:19.800 --> 00:47:24.135 It is the case that sometimes an index will include a linked list. 00:47:27.100 --> 00:47:28.245 Let's take one more here. 00:47:28.245 --> 00:47:32.400 AUDIENCE: I want to know, a covering index-- 00:47:32.400 --> 00:47:37.890 how is the [INAUDIBLE] covering index and what has been-- 00:47:37.890 --> 00:47:44.160 I mean, if we just [INAUDIBLE] index and then we just start covering index, 00:47:44.160 --> 00:47:49.740 how SQL automatically getting that covering index that's here? 00:47:49.740 --> 00:47:53.010 CARTER ZENKE: Yeah, so I think we're asking, how is the index created? 00:47:53.010 --> 00:47:54.940 Like, what is the process for doing that? 00:47:54.940 --> 00:47:58.920 Well it turns out that just as we saw here in the slides, SQLite or whatever 00:47:58.920 --> 00:48:03.240 DBMS you're using will take a copy of the data that should be in that index 00:48:03.240 --> 00:48:06.750 and organize it into a tree structure like we saw here. 00:48:06.750 --> 00:48:09.150 It may or may not be similar to the structure we saw just 00:48:09.150 --> 00:48:14.640 a little earlier with nodes that have different pieces of row data in them. 00:48:14.640 --> 00:48:17.280 Good question there. 00:48:17.280 --> 00:48:18.180 OK. 00:48:18.180 --> 00:48:21.090 So we've seen now the trade-off that we get 00:48:21.090 --> 00:48:24.510 with index, which is they're very fast to look up some data 00:48:24.510 --> 00:48:26.400 but they take up a lot more space. 00:48:26.400 --> 00:48:29.370 There's a lot of redundancy to them here. 00:48:29.370 --> 00:48:31.890 There's also one more trade-off, which is 00:48:31.890 --> 00:48:37.230 if I want to insert something into this structure, it'll take me more time. 00:48:37.230 --> 00:48:39.240 I have to navigate this structure here. 00:48:39.240 --> 00:48:44.010 I have to figure out should it come before this value or after this value? 00:48:44.010 --> 00:48:48.030 And I'll have to traverse this tree every time verses 00:48:48.030 --> 00:48:51.420 without an index I could just add something to the end of this list. 00:48:51.420 --> 00:48:54.930 But with a B-Tree, with an index, I need to traverse all these nodes 00:48:54.930 --> 00:48:57.760 and figure out where should that piece of data go? 00:48:57.760 --> 00:48:59.593 So these trade-offs are something you should 00:48:59.593 --> 00:49:03.660 keep in mind as you build indexes and optimize your queries. 00:49:03.660 --> 00:49:04.560 OK. 00:49:04.560 --> 00:49:07.650 There are ways though to actually optimize the space 00:49:07.650 --> 00:49:10.710 that we use using a tree like this. 00:49:10.710 --> 00:49:15.310 We can use a structure called not an index, but a partial index. 00:49:15.310 --> 00:49:19.620 And if you're concerned about space, a partial index might be good for you. 00:49:19.620 --> 00:49:25.200 A partial index only includes a subset of the data from a given column, 00:49:25.200 --> 00:49:28.800 thus making your index smaller overall. 00:49:28.800 --> 00:49:31.590 You could think of a partial index being useful when 00:49:31.590 --> 00:49:35.490 you know that your users will only query a subset-- 00:49:35.490 --> 00:49:38.820 only a small number of rows from that table. 00:49:38.820 --> 00:49:42.690 And for example in IMDb, maybe we know people 00:49:42.690 --> 00:49:46.770 are more likely to search for a movie that came out this year versus someone 00:49:46.770 --> 00:49:49.720 that came out 15 years ago or so. 00:49:49.720 --> 00:49:53.340 So we could create an index that focuses on those movies that 00:49:53.340 --> 00:49:59.800 came out this year versus 15 years ago or so, and we'll do just that. 00:49:59.800 --> 00:50:03.670 If I want to create a partial index I can use this syntax here. 00:50:03.670 --> 00:50:06.720 I could say create an index and give it some name, 00:50:06.720 --> 00:50:11.670 then include some data from this table and this column here, 00:50:11.670 --> 00:50:13.530 or a list of columns I can provide. 00:50:13.530 --> 00:50:19.140 But then what I need0 to do is add a condition where some condition is true. 00:50:19.140 --> 00:50:23.370 These are the rows I want to include in this index. 00:50:23.370 --> 00:50:27.910 Before we had just create index name on table from a list of columns. 00:50:27.910 --> 00:50:33.420 Now we can add in this WHERE clause that says only include those rows where 00:50:33.420 --> 00:50:35.320 this condition is true. 00:50:35.320 --> 00:50:36.840 So let's try that now. 00:50:36.840 --> 00:50:39.210 I'll come back to my computer. 00:50:39.210 --> 00:50:41.880 And why don't we try creating a partial index 00:50:41.880 --> 00:50:46.710 to help speed up the lookup of movies that came out this year. 00:50:46.710 --> 00:50:50.520 Well I could come back to my terminal and I could remove, let's say, 00:50:50.520 --> 00:50:51.900 this one over here. 00:50:51.900 --> 00:50:55.320 Now I'll try to create this partial index. 00:50:55.320 --> 00:51:00.690 I'll say create an index called "recents" to indicate this 00:51:00.690 --> 00:51:03.090 will help us search for recent movies. 00:51:03.090 --> 00:51:08.400 Now I'll make this "recents" ON "movies" in the ("title") column. 00:51:08.400 --> 00:51:12.510 I want to look up titles ultimately, but now I 00:51:12.510 --> 00:51:20.190 only want to create an index for those rows where the year equals 2023. 00:51:20.190 --> 00:51:26.790 I only want to include those titles in my index that were released in 2023. 00:51:26.790 --> 00:51:31.890 So I'll hit Enter now and I'll see I created this index for myself here. 00:51:31.890 --> 00:51:33.900 Now let me try searching for those. 00:51:33.900 --> 00:51:40.440 I'll SELECT "title" FROM "movies" WHERE the "year" = 2023. 00:51:40.440 --> 00:51:41.610 Enter. 00:51:41.610 --> 00:51:45.910 And I'll see a lot of movies came out in 2023. 00:51:45.910 --> 00:51:48.550 Now, it's taking 1.3 seconds here. 00:51:48.550 --> 00:51:51.600 But now I could prove to you that we are using some index. 00:51:51.600 --> 00:52:00.120 I could say EXPLAIN QUERY PLAN and now type SELECT "title" FROM "movies" 00:52:00.120 --> 00:52:03.120 WHERE the "year" = 2023. 00:52:03.120 --> 00:52:04.440 Hit Enter. 00:52:04.440 --> 00:52:06.600 And now we'll see-- what are we doing? 00:52:06.600 --> 00:52:09.870 We're going to-- we'll scan movies but we're still 00:52:09.870 --> 00:52:13.950 going to use the index we just created called "recents." 00:52:13.950 --> 00:52:17.580 So it's helping speed us up just a little bit overall. 00:52:17.580 --> 00:52:21.540 Now let me try dropping this index and showing you the opposite. 00:52:21.540 --> 00:52:24.560 I'll come back over here and I could try dropping it. 00:52:24.560 --> 00:52:26.310 And actually, before I do that let me show 00:52:26.310 --> 00:52:29.340 you what would happen if I didn't search for movie in 2023. 00:52:29.340 --> 00:52:32.560 I would try EXPLAIN QUERY PLAN. 00:52:32.560 --> 00:52:42.150 And let me try SELECT "title" FROM "movies" where the year = 1998 00:52:42.150 --> 00:52:43.140 like this. 00:52:43.140 --> 00:52:47.130 And see now I'm only back to scanning movies. 00:52:47.130 --> 00:52:50.670 So before in my prior query I was able to use the index 00:52:50.670 --> 00:52:54.480 because my WHERE clause included 2023. 00:52:54.480 --> 00:52:58.590 Now though I'm using 1998, which were rows that were not 00:52:58.590 --> 00:53:02.520 included in this partial index. 00:53:02.520 --> 00:53:03.270 OK. 00:53:03.270 --> 00:53:05.850 So let me ask now, what questions do we have 00:53:05.850 --> 00:53:09.600 on these partial index which we can use to optimize some 00:53:09.600 --> 00:53:12.920 of the space these indexes take up? 00:53:12.920 --> 00:53:16.030 AUDIENCE: Are indexes not saved in schema? 00:53:16.030 --> 00:53:17.280 CARTER ZENKE: A good question. 00:53:17.280 --> 00:53:19.370 Are indexes saved in the schema? 00:53:19.370 --> 00:53:23.510 In SQLite if I type .schema I can actually see them in my schema. 00:53:23.510 --> 00:53:25.070 So let me go back here and show you. 00:53:25.070 --> 00:53:27.560 If I now type something like-- 00:53:27.560 --> 00:53:29.620 let me turn timer off here-- 00:53:29.620 --> 00:53:35.150 .schema, you can see down below here that I actually have these indexes 00:53:35.150 --> 00:53:37.190 as part of my schema. 00:53:37.190 --> 00:53:40.910 I'll see the create index statements I used down below here. 00:53:40.910 --> 00:53:44.810 Create an index called name_index, called person_index, called recents. 00:53:44.810 --> 00:53:49.460 And all that I used to create these indexes now inside my schema. 00:53:49.460 --> 00:53:54.790 If I drop some index it'll then be just removed from my schema altogether. 00:53:54.790 --> 00:53:57.160 Good question. 00:53:57.160 --> 00:53:57.660 OK. 00:53:57.660 --> 00:54:00.450 So speaking of dropping indexes, one other way 00:54:00.450 --> 00:54:04.470 we can more efficiently use space is to do a bit of vacuuming so to speak. 00:54:04.470 --> 00:54:08.040 So let's introduce this idea of vacuuming up our database. 00:54:08.040 --> 00:54:10.200 Let me show you a slide here. 00:54:10.200 --> 00:54:14.010 And this one involves this idea of trying to clean up 00:54:14.010 --> 00:54:17.580 unused space in our database. 00:54:17.580 --> 00:54:22.500 Often when we delete something, either a rows or an index, 00:54:22.500 --> 00:54:25.890 we don't actually delete those bits that were 00:54:25.890 --> 00:54:29.190 being used by those rows or that index. 00:54:29.190 --> 00:54:33.690 We just mark them as being available for whatever we next insert. 00:54:33.690 --> 00:54:36.400 They can be overwritten, so to speak. 00:54:36.400 --> 00:54:39.930 But if I want to actually reduce the size of my database 00:54:39.930 --> 00:54:44.880 after I delete something, I should use something like vacuum in SQLite 00:54:44.880 --> 00:54:47.610 or optimize in some other DBMS. 00:54:47.610 --> 00:54:50.950 Vacuuming is a bit like taking those unused bits and just sucking them up. 00:54:50.950 --> 00:54:55.500 Taking them back in so we can give them back to the operating system. 00:54:55.500 --> 00:54:59.490 And the keyword we can use for vacuum is just simply vacuum. 00:54:59.490 --> 00:55:00.720 So let's try this. 00:55:00.720 --> 00:55:02.640 I'll come back to my computer over here. 00:55:02.640 --> 00:55:06.120 And let me open up a new terminal-- one that I can use 00:55:06.120 --> 00:55:09.240 to find the current size of movies.db. 00:55:09.240 --> 00:55:12.720 So I'll hit this + button here and now I could type a command. 00:55:12.720 --> 00:55:17.830 Not a SQL command or a SQLite command, but actually a Unix command. 00:55:17.830 --> 00:55:22.320 I could type du -b for the disk usage in bytes. 00:55:22.320 --> 00:55:24.810 How many bytes is a particular file? 00:55:24.810 --> 00:55:31.560 I could then say movies.db here, and if I hit Enter I should see movies.db 00:55:31.560 --> 00:55:37.230 is something like 158 million bytes or 158 megabytes. 00:55:37.230 --> 00:55:41.290 So let's see what happens if I were to delete an index which you know takes up 00:55:41.290 --> 00:55:43.380 some amount of space in our database. 00:55:43.380 --> 00:55:45.510 I'll go back to my terminal here. 00:55:45.510 --> 00:55:48.990 And let me try to drop an index that we had recently created. 00:55:48.990 --> 00:55:55.170 I'll say DROP INDEX "person_index" to remove the person index in this case. 00:55:55.170 --> 00:55:58.860 I'll hit Enter and now I see that "person_index" is dropped. 00:55:58.860 --> 00:56:05.250 If I type .schema here, I no longer see it in my schema down below. 00:56:05.250 --> 00:56:09.180 So now let me check the disk usage in bytes and see if it's any smaller. 00:56:09.180 --> 00:56:13.410 I'll hit the up arrow to get access to that same command, hit Enter. 00:56:13.410 --> 00:56:14.940 It seems like it's the same size. 00:56:14.940 --> 00:56:17.940 It's still 158 megabytes. 00:56:17.940 --> 00:56:19.830 So let me try dropping something more here. 00:56:19.830 --> 00:56:23.430 I'll go back to my terminal here and now I'll try-- 00:56:23.430 --> 00:56:25.500 what other do I have? 00:56:25.500 --> 00:56:30.630 I have "name_index" so I'll say let's DROP index "name_index" like this. 00:56:30.630 --> 00:56:34.680 I'll hit Enter and I'll see that index is now gone. 00:56:34.680 --> 00:56:37.200 Well now I'll also try dropping recents. 00:56:37.200 --> 00:56:43.470 I'll DROP INDEX "recents" and now I believe all of my indexes 00:56:43.470 --> 00:56:44.310 should be gone. 00:56:44.310 --> 00:56:48.510 If I type .schema, I don't see any more indexes. 00:56:48.510 --> 00:56:51.300 So now let's check on the disk usage in bytes. 00:56:51.300 --> 00:56:54.750 I'll go back to my other terminal, hit du -b again, 00:56:54.750 --> 00:56:58.290 and it's still 158 megabytes. 00:56:58.290 --> 00:57:01.170 Well what can we do to clean this up? 00:57:01.170 --> 00:57:03.180 We could run vacuum. 00:57:03.180 --> 00:57:07.140 If you remember here when we delete an index-- we drop the index, 00:57:07.140 --> 00:57:09.840 we don't actually shrink our database file. 00:57:09.840 --> 00:57:13.320 We just mark those bits that were part of the index as being 00:57:13.320 --> 00:57:15.120 available for re-use. 00:57:15.120 --> 00:57:17.700 So let's now vacuum and compress our file 00:57:17.700 --> 00:57:21.210 to give back those bytes and those bits to the operating system. 00:57:21.210 --> 00:57:27.000 I'll come back to my terminal here and type VACUUM; hit Enter, 00:57:27.000 --> 00:57:30.360 and I'll wait for it to find those bits and bytes, 00:57:30.360 --> 00:57:33.300 wait for it to give them back, to de-allocate them, 00:57:33.300 --> 00:57:34.950 to go back to the operating system. 00:57:34.950 --> 00:57:39.660 And now if I come back and I type du -b again, hit Enter-- 00:57:39.660 --> 00:57:42.970 I should see a much smaller file. 00:57:42.970 --> 00:57:45.990 So before we had about 158 million bytes. 00:57:45.990 --> 00:57:50.670 We're down to in this case 100 million after dropping our indexes 00:57:50.670 --> 00:57:53.360 and using VACUUM. 00:57:53.360 --> 00:57:57.390 So let me ask now, what questions do we have on vacuuming? 00:57:57.390 --> 00:58:00.020 AUDIENCE: Can we do vacuuming in a faster way? 00:58:00.020 --> 00:58:02.730 CARTER ZENKE: Can you vacuum in a faster way? 00:58:02.730 --> 00:58:05.120 So I think it depends on the size of the vacuum 00:58:05.120 --> 00:58:06.990 that you're trying to implement here. 00:58:06.990 --> 00:58:08.900 If you have a lot of bits and bytes you're 00:58:08.900 --> 00:58:11.990 trying to de-allocate, that could take more time. 00:58:11.990 --> 00:58:15.290 The process here is the computer has to find those bits and bytes you're 00:58:15.290 --> 00:58:19.590 no longer using and then give each of those back to the operating system. 00:58:19.590 --> 00:58:24.178 So it could depend on the size and how long they take to find. 00:58:24.178 --> 00:58:25.720 And let's take another question here. 00:58:25.720 --> 00:58:30.691 AUDIENCE: Initially you said once we drop a database, we don't lose the-- 00:58:30.691 --> 00:58:33.310 the data is not deleted. 00:58:33.310 --> 00:58:36.550 The space is just reallocated, right? 00:58:36.550 --> 00:58:37.960 If I go you right. 00:58:37.960 --> 00:58:42.280 So does that mean the queries we run today, 00:58:42.280 --> 00:58:46.780 are they locked somewhere that-- let's say you mistakenly did a query. 00:58:46.780 --> 00:58:50.710 You can go back and pull that data back? 00:58:50.710 --> 00:58:51.880 Is that possible? 00:58:51.880 --> 00:58:55.000 Kind of like a reversal of a query. 00:58:55.000 --> 00:58:57.880 CARTER ZENKE: Yeah, so you're talking about some forensics here. 00:58:57.880 --> 00:59:01.900 Like, could we mark something as deleted but still find it later? 00:59:01.900 --> 00:59:05.290 There actually are people who are trained in this, in forensics, 00:59:05.290 --> 00:59:07.600 to find pieces of data that you thought were deleted 00:59:07.600 --> 00:59:12.160 but are actually still on your computer just forgotten, if you will. 00:59:12.160 --> 00:59:14.740 In this case in SQLite it would take a lot of work 00:59:14.740 --> 00:59:18.190 to go ahead and find all of those bits and bytes we've removed. 00:59:18.190 --> 00:59:21.550 But once we vacuum, once we actually give them back to the operating system, 00:59:21.550 --> 00:59:24.220 we can no longer find those in the database 00:59:24.220 --> 00:59:26.830 because they're not just marked as being de-allocated. 00:59:26.830 --> 00:59:30.720 They actually have been given back and they're no longer part of our database. 00:59:30.720 --> 00:59:32.710 Good question there. 00:59:32.710 --> 00:59:35.800 OK, so we've seen here how we can optimize 00:59:35.800 --> 00:59:38.680 our queries by reducing the time that they take 00:59:38.680 --> 00:59:40.870 although at the cost of some space. 00:59:40.870 --> 00:59:45.130 When we come back, we'll see how we can handle not just one query at a time 00:59:45.130 --> 00:59:47.440 but many all at once. 00:59:47.440 --> 00:59:49.010 See you in a bit. 00:59:49.010 --> 00:59:50.330 And we're back. 00:59:50.330 --> 00:59:54.460 So we've seen so far how to optimize individual queries, ensuring 00:59:54.460 --> 00:59:58.000 they take less time and also to ensure our databases 00:59:58.000 --> 01:00:00.800 use as little space as possible. 01:00:00.800 --> 01:00:03.010 What we'll focus on now is trying to ensure 01:00:03.010 --> 01:00:08.125 we can handle not just one query at a time but multiple all at once. 01:00:08.125 --> 01:00:10.570 And this is this idea of concurrency. 01:00:10.570 --> 01:00:16.180 Concurrency is when I get multiple queries, multiple interactions, all 01:00:16.180 --> 01:00:20.110 at roughly the same time and my computer system, my database, 01:00:20.110 --> 01:00:23.810 needs to figure out how to handle all of those at once. 01:00:23.810 --> 01:00:27.040 This is particularly important for websites that get a lot of traffic, 01:00:27.040 --> 01:00:28.900 or even for the financial sector. 01:00:28.900 --> 01:00:32.110 People are constantly making trades, placing orders, 01:00:32.110 --> 01:00:34.610 exchanging money, and so on. 01:00:34.610 --> 01:00:37.690 So let's see one example here of a bank. 01:00:37.690 --> 01:00:40.870 Let's say here I have a table of accounts 01:00:40.870 --> 01:00:44.860 and each account has a name and some balance to it. 01:00:44.860 --> 01:00:50.890 Well we decide here that maybe Alice wants to pay Bob $10. 01:00:50.890 --> 01:00:54.220 So what should I do to update the balances here? 01:00:54.220 --> 01:00:58.510 Maybe the first thing I should do is make sure that Bob receives that money. 01:00:58.510 --> 01:01:03.460 I could update Bob's account to have a balance not of $20 but of $30 now. 01:01:03.460 --> 01:01:05.920 From $20 to $30. 01:01:05.920 --> 01:01:07.450 But here's the thing. 01:01:07.450 --> 01:01:14.806 If at this moment somebody looks at this table, what will they see that's wrong? 01:01:14.806 --> 01:01:20.205 What will they see that they might not get the right picture of at this point? 01:01:23.800 --> 01:01:27.850 I'm hearing they might not see the correct balance for Alice. 01:01:27.850 --> 01:01:29.980 Bob has gotten some money. 01:01:29.980 --> 01:01:31.990 He's gone from $20 here to $30. 01:01:31.990 --> 01:01:34.060 But now I have not just-- 01:01:34.060 --> 01:01:35.710 let's see, $60 in the bank. 01:01:35.710 --> 01:01:38.210 I now have $70 it looks like. 01:01:38.210 --> 01:01:40.900 So if somebody looks at this transaction in this moment 01:01:40.900 --> 01:01:43.420 they'll see some incorrect information. 01:01:43.420 --> 01:01:45.910 What I should then do is update Alice's balance, 01:01:45.910 --> 01:01:50.740 subtracting $10 so we're at the right amount of dollars across our bank 01:01:50.740 --> 01:01:51.830 accounts here. 01:01:51.830 --> 01:01:53.890 So this is an issue. 01:01:53.890 --> 01:01:57.370 When I have one ongoing transaction like this 01:01:57.370 --> 01:01:59.950 and somebody else wants to look at that data, 01:01:59.950 --> 01:02:03.430 they really shouldn't be able to until this transaction is 01:02:03.430 --> 01:02:06.020 complete in its entirety. 01:02:06.020 --> 01:02:09.340 So ideally-- let's go back to what we had before with Alice at $10 01:02:09.340 --> 01:02:10.480 Bob at $20. 01:02:10.480 --> 01:02:15.430 Ideally if Alice wants to pay Bob, this should happen to an outside observer 01:02:15.430 --> 01:02:16.510 all at once. 01:02:16.510 --> 01:02:19.360 Bob should get those $10 and Alice should 01:02:19.360 --> 01:02:23.590 lose those $10 at the same exact moment so nobody 01:02:23.590 --> 01:02:29.230 can look and see that Bob has more money while Alice doesn't have less money. 01:02:29.230 --> 01:02:34.060 Now we can implement this idea of this transaction using 01:02:34.060 --> 01:02:39.040 a very similarly-named idea in databases called a transaction. 01:02:39.040 --> 01:02:41.830 Now a transaction in the world of databases 01:02:41.830 --> 01:02:44.800 is an individual unit of work. 01:02:44.800 --> 01:02:49.370 That means a transaction can't be broken down into smaller pieces. 01:02:49.370 --> 01:02:52.720 It happens all at once or not at all. 01:02:52.720 --> 01:02:55.540 And in fact a transaction has several properties 01:02:55.540 --> 01:02:58.690 and helps implement several properties of a database. 01:02:58.690 --> 01:03:00.340 Among these are ACID-- 01:03:00.340 --> 01:03:05.020 A-C-I-D. You may have heard of these if you've looked up something about 01:03:05.020 --> 01:03:10.150 databases, but each one stands for some principle of ensuring a consistent 01:03:10.150 --> 01:03:11.870 and reliable database. 01:03:11.870 --> 01:03:17.710 In fact, A stands for atomicity, which means I can't break a transaction down 01:03:17.710 --> 01:03:18.970 into smaller pieces. 01:03:18.970 --> 01:03:21.310 I can't break a transaction between Alice and Bob 01:03:21.310 --> 01:03:24.250 into Bob's update and then Alice's update. 01:03:24.250 --> 01:03:28.150 It's just update both accounts at the same time. 01:03:28.150 --> 01:03:32.080 Consistency means a transaction to help me ensure that I 01:03:32.080 --> 01:03:35.180 don't violate some database constraint. 01:03:35.180 --> 01:03:39.070 Let's say I try to pay Bob from Alice's account, 01:03:39.070 --> 01:03:40.900 but Alice doesn't have the money. 01:03:40.900 --> 01:03:44.530 If that is the case, a transaction will revert, undo. 01:03:44.530 --> 01:03:49.090 So I go back to being in this consistent state for my database. 01:03:49.090 --> 01:03:53.530 Isolation means if I have more than one person trying to access 01:03:53.530 --> 01:03:55.600 this database, their queries-- 01:03:55.600 --> 01:03:58.940 their transactions won't interfere with each other. 01:03:58.940 --> 01:04:03.385 And then finally, durability means if I were to unplug this database 01:04:03.385 --> 01:04:06.010 or we were to encounter a power failure or the server to crash, 01:04:06.010 --> 01:04:11.230 all that data would still be there from the time we committed our transaction 01:04:11.230 --> 01:04:14.150 or saved the results there. 01:04:14.150 --> 01:04:18.670 So transactions do a lot for us, but they have some very simple syntax. 01:04:18.670 --> 01:04:24.130 I can simply use BEGIN TRANSACTION and then follow it up with some statements 01:04:24.130 --> 01:04:26.780 I want to include in that transaction. 01:04:26.780 --> 01:04:30.850 And once I'm sure I want to save the results of these statements, 01:04:30.850 --> 01:04:36.340 I can then say COMMIT, meaning save these changes thus far. 01:04:36.340 --> 01:04:38.620 And it's using this kind of syntax we can 01:04:38.620 --> 01:04:41.830 ensure that nobody sees incorrect balances, 01:04:41.830 --> 01:04:45.590 let's say, for this bank account table here. 01:04:45.590 --> 01:04:49.840 So let's try this out and implement a payment between Alice and Bob, 01:04:49.840 --> 01:04:52.180 but now using a transaction. 01:04:52.180 --> 01:04:54.670 I'll come back to my computer here. 01:04:54.670 --> 01:04:57.400 And let me show you a new database. 01:04:57.400 --> 01:05:00.580 I'll type sqlite3 bank.db. 01:05:00.580 --> 01:05:01.600 I'll hit Enter. 01:05:01.600 --> 01:05:02.980 Oh, sqlite3. 01:05:02.980 --> 01:05:03.520 Sorry. 01:05:03.520 --> 01:05:04.720 bank.db. 01:05:04.720 --> 01:05:09.430 I'll hit Enter, and now let me type .schema to see the schema of this 01:05:09.430 --> 01:05:10.360 database. 01:05:10.360 --> 01:05:12.970 I'll see here a few different columns. 01:05:12.970 --> 01:05:15.970 But I have an "id" column, my primary key, 01:05:15.970 --> 01:05:19.900 a "name" column like we saw in our slides that is text and can't be null, 01:05:19.900 --> 01:05:22.390 and then a "balance" column. 01:05:22.390 --> 01:05:26.740 There's an integer, a whole number of dollars, that also must have a value. 01:05:26.740 --> 01:05:27.850 Can't be null. 01:05:27.850 --> 01:05:30.730 And I have a check constraint here saying 01:05:30.730 --> 01:05:35.590 the balance must be at all times greater than or equal to 0. 01:05:35.590 --> 01:05:40.690 So this is our implementation of a table of bank accounts. 01:05:40.690 --> 01:05:44.650 Let me try querying here to see what our account balances look like. 01:05:44.650 --> 01:05:50.390 I'll say SELECT * FROM "accounts" semicolon. 01:05:50.390 --> 01:05:53.980 And now I'll see that very same arrangement we saw in our slides. 01:05:53.980 --> 01:05:59.450 Alice has $10, Bob has $20, and Charlie has $30. 01:05:59.450 --> 01:06:04.570 So let me try then processing this transaction between Alice and Bob. 01:06:04.570 --> 01:06:08.890 Well at this moment I might try to update Bob's account 01:06:08.890 --> 01:06:10.600 at the first [? bat. ?] I'll say-- 01:06:10.600 --> 01:06:13.510 let's try update and I'll update the accounts 01:06:13.510 --> 01:06:21.730 table, set the "balance" equal to "balance" + 10 where the "id" equals 2. 01:06:21.730 --> 01:06:25.040 Remember, Bob's accounts "id" is 2. 01:06:25.040 --> 01:06:30.350 So now if I hit Enter here, Bob should have gotten just 10 more dollars. 01:06:30.350 --> 01:06:31.610 But there's a problem. 01:06:31.610 --> 01:06:34.790 If I also connect to this database through another connection here, 01:06:34.790 --> 01:06:37.457 sqlite3 bank.DB. 01:06:37.457 --> 01:06:40.040 Maybe I'm somebody who's checking on the balances of the bank. 01:06:40.040 --> 01:06:44.120 I say SELECT * FROM "accounts." 01:06:44.120 --> 01:06:48.360 What do I see but an incorrect total balance? 01:06:48.360 --> 01:06:51.890 We should have $60, but here I see $70. 01:06:51.890 --> 01:06:54.620 So I need to complete this transaction to ensure 01:06:54.620 --> 01:06:58.790 that I actually have the right amount of balances across the board. 01:06:58.790 --> 01:07:03.188 Let me go back here and let me try to run the next part here. 01:07:03.188 --> 01:07:03.980 Let we update this. 01:07:03.980 --> 01:07:10.910 I'll say UPDATE "accounts," and I'll SET "balance" equal to "balance" minus 10. 01:07:10.910 --> 01:07:17.960 Minus 10 where the "id" equals in this case 1 where Alice's bank account ID 01:07:17.960 --> 01:07:18.980 is 1. 01:07:18.980 --> 01:07:23.780 Now if I hit Enter I should be able to go back to my other terminal 01:07:23.780 --> 01:07:25.830 and check on the balances again. 01:07:25.830 --> 01:07:29.700 And now I'll see the right balances all together. 01:07:29.700 --> 01:07:33.380 But there is a way to avoid me looking at this table 01:07:33.380 --> 01:07:38.640 and seeing an incorrect state, and the way to do that is called a transaction. 01:07:38.640 --> 01:07:42.020 So let's go back here, and I will reset my terminals. 01:07:42.020 --> 01:07:45.410 I'll now also give back Alice's money. 01:07:45.410 --> 01:07:47.183 Let me give back Bob the $10. 01:07:47.183 --> 01:07:48.100 I'm going to go back-- 01:07:48.100 --> 01:07:51.860 [INAUDIBLE] say where "id" equals 1 here. 01:07:51.860 --> 01:07:54.140 Now Alice will have 10 more dollars. 01:07:54.140 --> 01:07:57.020 And I'll go back and I'll update accounts. 01:07:57.020 --> 01:08:02.810 I'll set the "balance" equal to "balance" minus 10 01:08:02.810 --> 01:08:05.690 where the "id" equals 2. 01:08:05.690 --> 01:08:09.390 So I'm giving Bob back the $10 that they just gave to Alice. 01:08:09.390 --> 01:08:15.160 Now if I SELECT * from "accounts" we're back to where we began. 01:08:15.160 --> 01:08:19.649 So ideally I should able to complete this transaction between Alice and Bob 01:08:19.649 --> 01:08:23.310 without somebody seeing the intermediate process of Bob 01:08:23.310 --> 01:08:25.470 having more money than Alice. 01:08:25.470 --> 01:08:26.850 So let's try this. 01:08:26.850 --> 01:08:31.649 I'll say BEGIN TRANSACTION; Enter. 01:08:31.649 --> 01:08:36.479 And now I can include what I want to have happen in this transaction. 01:08:36.479 --> 01:08:38.609 First I want to update Bob's account. 01:08:38.609 --> 01:08:45.640 I'll say UPDATE "accounts" and SET "balance" equal to "balance" + 10 01:08:45.640 --> 01:08:49.290 where the "id" in this case equals-- 01:08:49.290 --> 01:08:53.170 let's see, 2, because Bob's account ID is 2. 01:08:53.170 --> 01:08:57.000 Now I'll hit Enter, and now I'm in the middle of this transaction. 01:08:57.000 --> 01:09:00.960 But if I go to my other terminal-- let's say I'm somebody new 01:09:00.960 --> 01:09:02.640 and I check on the balances of the bank. 01:09:02.640 --> 01:09:06.520 I say SELECT * FROM "accounts". 01:09:06.520 --> 01:09:07.960 What do I see? 01:09:07.960 --> 01:09:10.390 I don't see any change from before. 01:09:10.390 --> 01:09:14.920 Alice still has $10, Bob has $20, and Charlie has $30. 01:09:14.920 --> 01:09:17.380 And even though in this terminal I've been 01:09:17.380 --> 01:09:21.430 able to update the account balance, I haven't committed those changes. 01:09:21.430 --> 01:09:24.380 I haven't saved them to the database just yet. 01:09:24.380 --> 01:09:26.770 So let me try now removing money from Alice's account. 01:09:26.770 --> 01:09:33.430 I'll say UPDATE "accounts" and set "balance" equal to balance minus 10 01:09:33.430 --> 01:09:36.700 where the "id" equals 1. 01:09:36.700 --> 01:09:39.550 Now I'll remove $10 from Alice's account. 01:09:39.550 --> 01:09:43.060 Hit Enter, and still if I go back to this other perspective 01:09:43.060 --> 01:09:46.870 here from somebody else checking on the bank balances, I see-- 01:09:46.870 --> 01:09:49.399 well everything is still the same. 01:09:49.399 --> 01:09:51.939 But the moment I then go to this transaction 01:09:51.939 --> 01:09:57.370 and finish it by typing COMMIT, meaning save these changes, I'll commit here. 01:09:57.370 --> 01:10:01.240 I can go back and I'll see all in one moment 01:10:01.240 --> 01:10:06.070 that it appears Alice has paid Bob $10. 01:10:06.070 --> 01:10:10.060 So this is the atomicity part of transactions. 01:10:10.060 --> 01:10:13.480 I don't see the intermediate steps in the transaction. 01:10:13.480 --> 01:10:19.000 I only see that it happened all at once or not at all. 01:10:19.000 --> 01:10:23.300 OK, so this is one example here of a transaction. 01:10:23.300 --> 01:10:27.700 Let me ask what questions we have so far on the transactions we've made 01:10:27.700 --> 01:10:31.060 and how they can be used. 01:10:31.060 --> 01:10:32.950 AUDIENCE: You are searching based on ID-- 01:10:32.950 --> 01:10:35.560 like an ID is equal to a question mark. 01:10:35.560 --> 01:10:39.550 But in a real-world scenario, how they would Search maybe 01:10:39.550 --> 01:10:44.063 like using the Transaction ID or basic account number? 01:10:44.063 --> 01:10:45.230 CARTER ZENKE: Good question. 01:10:45.230 --> 01:10:47.050 So here I'm searching by ID. 01:10:47.050 --> 01:10:51.820 I'm saying that I want to update the balance of account with ID 1 or ID 2, 01:10:51.820 --> 01:10:52.660 for instance. 01:10:52.660 --> 01:10:55.510 And this is similar in spirit to how your bank account likely 01:10:55.510 --> 01:10:58.850 has some unique number, an account number so to speak. 01:10:58.850 --> 01:11:01.360 So here in this table I'm updating balances 01:11:01.360 --> 01:11:03.700 by referring to the account number. 01:11:03.700 --> 01:11:06.670 You could imagine me using names, but there 01:11:06.670 --> 01:11:10.120 might be in some world multiple Alices, multiple Bobs. 01:11:10.120 --> 01:11:15.740 So I should uniquely identify each account using its ID in this case. 01:11:15.740 --> 01:11:20.490 OK, so let's come back and think through more kinds of transactions here. 01:11:20.490 --> 01:11:24.570 And let's think through a transaction we actually don't want to complete. 01:11:24.570 --> 01:11:30.410 So let's go back to our balance sheet where we have Alice at $0 and Bob 01:11:30.410 --> 01:11:35.660 at $30, and let's say Alice wants to pay Bob 10 more dollars. 01:11:35.660 --> 01:11:39.830 Well if I go through this transaction, what's going to happen? 01:11:39.830 --> 01:11:44.390 Alice will have -$10 and Bob will have $40, 01:11:44.390 --> 01:11:47.990 and that violates our check constraint that we saw in our schema. 01:11:47.990 --> 01:11:53.550 The balance of Alice's account must always be at least $0. 01:11:53.550 --> 01:11:56.360 So if I were to complete this transaction, 01:11:56.360 --> 01:11:59.870 that would bring our database to an inconsistent state. 01:11:59.870 --> 01:12:03.410 It violates some constraint that we have. 01:12:03.410 --> 01:12:06.800 But what I can do is undo these changes. 01:12:06.800 --> 01:12:10.730 Revert them so that we go back to the original state 01:12:10.730 --> 01:12:14.540 if we ever violate some constraint like we just did. 01:12:14.540 --> 01:12:17.880 There's a technical name for this, which is rolling back. 01:12:17.880 --> 01:12:21.980 So if I ever want to not save the changes in this transaction, 01:12:21.980 --> 01:12:23.840 I can say ROLLBACK. 01:12:23.840 --> 01:12:27.620 Go back to the beginning, think about it as if this transaction never even 01:12:27.620 --> 01:12:29.470 happened at all. 01:12:29.470 --> 01:12:30.380 So let's try it. 01:12:30.380 --> 01:12:33.870 So I'll come back to my computer, and what I'll do 01:12:33.870 --> 01:12:38.040 is try to implement this idea first without a transaction. 01:12:38.040 --> 01:12:41.400 I'll go back to, let's say this first terminal here, 01:12:41.400 --> 01:12:45.400 and I'll SELECT * FROM "accounts" like this. 01:12:45.400 --> 01:12:50.060 Notice how Alice has $0 and Bob has $30. 01:12:50.060 --> 01:12:53.010 Well I want to pay Bob 10 more dollars from Alice, 01:12:53.010 --> 01:12:54.810 so I'll update Bob's account. 01:12:54.810 --> 01:12:59.430 I'll find here my prior statement, UPDATE "accounts" 01:12:59.430 --> 01:13:03.900 SET "balance" equal to "balance" + 10 where the "id" equals 2. 01:13:03.900 --> 01:13:07.260 And recall that Bob's account ID is 2 01:13:07.260 --> 01:13:10.085 I'll hit Enter, and now if I SELECT * FROM "accounts" 01:13:10.085 --> 01:13:11.460 if we're looking somewhere else-- 01:13:11.460 --> 01:13:17.850 SELECT * FROM "accounts," I'll see Bob has 10 more dollars. 01:13:17.850 --> 01:13:21.300 But now I need to subtract money from Alice's account. 01:13:21.300 --> 01:13:27.750 So I'll use the same thing but now I'll say minus 10 where the "id" equals 1. 01:13:27.750 --> 01:13:30.120 Alice's account ID is 1. 01:13:30.120 --> 01:13:35.820 Now I'll try to subtract $10 from Alice, but I get a constraint failure here. 01:13:35.820 --> 01:13:39.120 My check constraint failed on the balance column. 01:13:39.120 --> 01:13:40.290 Why did it do that? 01:13:40.290 --> 01:13:47.460 Because I now have less than $0 in that balance column. 01:13:47.460 --> 01:13:52.260 So I need to undo this but I don't have a transaction here. 01:13:52.260 --> 01:13:57.550 What could I do instead to undo what I've just done? 01:13:57.550 --> 01:13:59.025 AUDIENCE: So we can roll back. 01:13:59.025 --> 01:14:01.740 CARTER ZENKE: We could roll back, but here's the thing. 01:14:01.740 --> 01:14:05.790 A rollback only works while we're inside of a transaction. 01:14:05.790 --> 01:14:09.700 And a transaction begins when we say begin transaction. 01:14:09.700 --> 01:14:12.930 So here I actually didn't say begin transaction. 01:14:12.930 --> 01:14:15.960 What I have to do instead is run another update statement. 01:14:15.960 --> 01:14:19.290 I have to update, let's see, Bob's account 01:14:19.290 --> 01:14:22.950 to remove the $10 that Alice paid Bob. 01:14:22.950 --> 01:14:23.710 So let's try that. 01:14:23.710 --> 01:14:27.810 I'll come back to my computer and I will then revert back. 01:14:27.810 --> 01:14:29.380 I'll say let's undo this. 01:14:29.380 --> 01:14:34.420 Let's subtract $10 now from Bob's account where the ID is 2. 01:14:34.420 --> 01:14:36.510 Now we're back to where we began. 01:14:36.510 --> 01:14:38.970 If I check in from some other [INAUDIBLE] here, 01:14:38.970 --> 01:14:44.310 I'll see we're back to Alice having $0 and Bob having $30. 01:14:44.310 --> 01:14:49.470 But we ideally want to use rollback to undo some changes that we 01:14:49.470 --> 01:14:51.400 have made in a transaction. 01:14:51.400 --> 01:14:54.000 So for that let me now use a transaction. 01:14:54.000 --> 01:14:58.980 I'll say BEGIN TRANSACTION like this, hit Enter. 01:14:58.980 --> 01:15:01.470 Now I want to update Bob's balance. 01:15:01.470 --> 01:15:05.070 I'll say UPDATE "accounts SET "balance" equal to "balance" + 01:15:05.070 --> 01:15:09.360 10 where the "id" = 2. 01:15:09.360 --> 01:15:11.860 Bob's account ID is 2. 01:15:11.860 --> 01:15:14.760 Now let me subtract $10 from Alice like this. 01:15:14.760 --> 01:15:16.290 - 10. 01:15:16.290 --> 01:15:19.200 I'll go back here and say where the "id" equals 1 01:15:19.200 --> 01:15:21.660 for Alice's account, hit Enter. 01:15:21.660 --> 01:15:24.360 But now I get that same constraint failure. 01:15:24.360 --> 01:15:28.600 Alice has less than $0, which can't be possible. 01:15:28.600 --> 01:15:29.880 So what do we do? 01:15:29.880 --> 01:15:33.000 Instead of having another update, we could just roll back. 01:15:33.000 --> 01:15:36.030 We could undo all of these previous statements here. 01:15:36.030 --> 01:15:40.500 ROLLBACK like this, hit Enter, and now I should see from another perspective 01:15:40.500 --> 01:15:46.380 here my table exactly as it was before with no intermediate step of Bob 01:15:46.380 --> 01:15:50.140 having money but Alice having negative $10. 01:15:50.140 --> 01:15:54.240 So let me ask m we've seen COMMIT and ROLLBACK. 01:15:54.240 --> 01:15:57.440 What questions do we have on transactions now? 01:16:03.530 --> 01:16:04.030 OK. 01:16:04.030 --> 01:16:07.390 Seeing none so far, so let's keep going. 01:16:07.390 --> 01:16:11.290 Transactions are helpful for ensuring that our data is consistent 01:16:11.290 --> 01:16:13.570 like we just saw, and that they're also-- 01:16:13.570 --> 01:16:15.130 we have atomic transactions. 01:16:15.130 --> 01:16:18.620 We can't break things down into smaller pieces. 01:16:18.620 --> 01:16:21.490 But there is a case that transactions can help guard 01:16:21.490 --> 01:16:24.880 against some adversarial kind of attack. 01:16:24.880 --> 01:16:27.460 We're into what we call race condition. 01:16:27.460 --> 01:16:32.140 A race condition is when we have multiple people, multiple processes 01:16:32.140 --> 01:16:37.900 trying to access some value and also making a decision based on that value. 01:16:37.900 --> 01:16:43.300 If unaddressed it can lead to a scenario that's inconsistent in our database. 01:16:43.300 --> 01:16:46.090 And actually hackers can exploit race conditions 01:16:46.090 --> 01:16:51.020 to bring our database to an inconsistent state in a way that benefits them. 01:16:51.020 --> 01:16:53.800 So let's say our friends Charlie and Alice here, they 01:16:53.800 --> 01:16:55.510 decide to rob our bank. 01:16:55.510 --> 01:16:59.510 They decide to take money from it by exploiting race conditions. 01:16:59.510 --> 01:17:04.000 And here are the logs of what Charlie and Alice actually did. 01:17:04.000 --> 01:17:10.090 We see here that Charlie and Alice executed three concurrent transactions 01:17:10.090 --> 01:17:11.260 with our bank. 01:17:11.260 --> 01:17:13.960 On the one hand Charlie opened up his bank account 01:17:13.960 --> 01:17:16.000 on two different computers. 01:17:16.000 --> 01:17:20.470 He logged in and he said let me transfer $30 to Alice. 01:17:20.470 --> 01:17:25.240 He at the same time clicked transfer on both computers knowing just well 01:17:25.240 --> 01:17:30.820 he had only $30 to give to Alice, but he was going to send her $60 in this case. 01:17:30.820 --> 01:17:34.120 He was hoping that we weren't guarding against race conditions. 01:17:34.120 --> 01:17:38.660 That he could send $30 on both accounts to Alice. 01:17:38.660 --> 01:17:41.260 Now at the same time Alice is at the ATM, 01:17:41.260 --> 01:17:45.010 and they're hoping that at the same time Charlie presses transfer 01:17:45.010 --> 01:17:50.890 Alice hits confirm on their transaction to withdraw $60 from the ATM. 01:17:50.890 --> 01:17:53.410 And when all of these happen at once, we might 01:17:53.410 --> 01:17:56.810 run into some inconsistent states in our database. 01:17:56.810 --> 01:17:59.710 Let's say first we run this transaction here. 01:17:59.710 --> 01:18:00.895 Let's focus on this one. 01:18:00.895 --> 01:18:02.770 The first step is to check Charlie's balance. 01:18:02.770 --> 01:18:05.500 Does Charlie have $30? 01:18:05.500 --> 01:18:06.970 It seems like Charlie does. 01:18:06.970 --> 01:18:11.410 We continue we might say, OK, let me add $30 to Alice's account like this. 01:18:11.410 --> 01:18:15.280 Then let me subtract $30 from Charlie's account down below here. 01:18:15.280 --> 01:18:18.130 And finally, that's the end of our transaction. 01:18:18.130 --> 01:18:21.610 We checked Charlie's balance, we added $30 to Alice, 01:18:21.610 --> 01:18:24.790 and subtracted $30 from Charlie. 01:18:24.790 --> 01:18:31.180 But now at the same time, we're also processing the other transfer of $30 01:18:31.180 --> 01:18:32.050 to Alice. 01:18:32.050 --> 01:18:35.590 So here we check in terms of our time, top to bottom here-- 01:18:35.590 --> 01:18:37.810 we check Charlie's balance at this point. 01:18:37.810 --> 01:18:39.820 It's still $30. 01:18:39.820 --> 01:18:43.060 Knowing this, this process goes ahead and says 01:18:43.060 --> 01:18:45.730 let's add $30 to Alice's account. 01:18:45.730 --> 01:18:49.930 And now only later let's subtract $30 like this. 01:18:49.930 --> 01:18:54.740 Now Charlie well appears to have no money in his account. 01:18:54.740 --> 01:19:01.330 But at this time above we've already sent $30 to Alice's account two times. 01:19:01.330 --> 01:19:03.610 And if Alice is waiting at the ATM over here 01:19:03.610 --> 01:19:06.400 and they press Confirm on that withdrawal, what might happen 01:19:06.400 --> 01:19:07.360 is something like this. 01:19:07.360 --> 01:19:10.840 At that moment that Alice has $60 in their account-- 01:19:10.840 --> 01:19:14.440 they check and they see it-- they could then withdraw that $60. 01:19:14.440 --> 01:19:17.390 And now that money is gone from our bank, 01:19:17.390 --> 01:19:19.840 even if we realize Charlie shouldn't have been 01:19:19.840 --> 01:19:22.980 able to make that transaction happen. 01:19:22.980 --> 01:19:26.510 So what logically is a solution here? 01:19:26.510 --> 01:19:31.170 It seems these transactions happened all at the same time. 01:19:31.170 --> 01:19:35.060 But what would you do to reorder these statements to make it clear 01:19:35.060 --> 01:19:39.690 that we can only do one at a time? 01:19:39.690 --> 01:19:45.260 What would you do in this case to update this table to fix this problem? 01:19:45.260 --> 01:19:52.710 AUDIENCE: I would do the withdrawal and the debits as one transaction. 01:19:52.710 --> 01:19:55.580 So the subtraction is done and the addition 01:19:55.580 --> 01:20:01.658 is done as one unit of work in order for the double transfer not to happen. 01:20:01.658 --> 01:20:02.450 CARTER ZENKE: Yeah. 01:20:02.450 --> 01:20:06.020 So ideally we should treat each of these as a separate transaction. 01:20:06.020 --> 01:20:10.280 Literally a transaction in the database sense where if I go back to this table 01:20:10.280 --> 01:20:15.000 if Charlie transfers $30 to Alice at roughly the same time, 01:20:15.000 --> 01:20:17.210 I should handle these sequentially. 01:20:17.210 --> 01:20:20.990 I should first do this one, then I should do this one. 01:20:20.990 --> 01:20:25.760 And only after that should Alice be able to withdraw their $60, or at least 01:20:25.760 --> 01:20:27.450 try to in this case. 01:20:27.450 --> 01:20:30.680 So we need to make these transactions sequential, and the word for this 01:20:30.680 --> 01:20:32.540 is isolating our transactions. 01:20:32.540 --> 01:20:34.580 Here they are not isolated. 01:20:34.580 --> 01:20:36.390 They're working at the same time. 01:20:36.390 --> 01:20:38.630 But in reality I want them to be isolated-- 01:20:38.630 --> 01:20:41.390 happen one after another after another. 01:20:41.390 --> 01:20:44.870 So ideally our table looks more like this. 01:20:44.870 --> 01:20:48.890 Charlie hits transfer at the exact same time on two laptops, 01:20:48.890 --> 01:20:54.170 but because we're working in transactions we only process one first. 01:20:54.170 --> 01:20:55.610 We start with this one here. 01:20:55.610 --> 01:21:00.860 We check the balance, we say let's add $30 to Alice's account and subtract $30 01:21:00.860 --> 01:21:01.940 from Charlie's. 01:21:01.940 --> 01:21:05.930 Once we're sure that is OK, we'll commit the transaction 01:21:05.930 --> 01:21:08.120 and that is a done deal. 01:21:08.120 --> 01:21:09.770 Now though, we'll go down below. 01:21:09.770 --> 01:21:11.630 We'll say begin transaction. 01:21:11.630 --> 01:21:15.140 We'll then say Charlie's balance is $0 because we're doing this 01:21:15.140 --> 01:21:16.940 after we did our first transaction. 01:21:16.940 --> 01:21:19.370 Charlie has no more money to give Alice. 01:21:19.370 --> 01:21:23.480 At that point what we'll do is try to add $30 to Alice 01:21:23.480 --> 01:21:27.240 but subtract $30 from Charlie, and we'll realize we made a mistake. 01:21:27.240 --> 01:21:31.050 We should roll back and undo what we just did here. 01:21:31.050 --> 01:21:35.060 So these now are separate transactions happening sequentially, 01:21:35.060 --> 01:21:38.720 and we avoid our error of giving Alice $60 when 01:21:38.720 --> 01:21:41.180 she shouldn't have that money at all. 01:21:41.180 --> 01:21:46.460 And then only after this is done do we let Alice try to withdraw $60. 01:21:46.460 --> 01:21:47.780 We'll go over here and see. 01:21:47.780 --> 01:21:51.900 Alice tries to have a balance of $60, withdraw $60, and commit it. 01:21:51.900 --> 01:21:56.570 But at this point she wouldn't actually have that money there. 01:21:56.570 --> 01:22:01.810 So let's see now what questions do we have on the way transactions 01:22:01.810 --> 01:22:04.860 are sequential in this case? 01:22:04.860 --> 01:22:08.410 AUDIENCE: In particularly I want to ask that there are millions 01:22:08.410 --> 01:22:10.480 of transactions that are going on. 01:22:10.480 --> 01:22:14.320 Particularly, also, is it like there that sequentially these transactions 01:22:14.320 --> 01:22:16.750 are being handled for the financial transaction over here 01:22:16.750 --> 01:22:20.650 that because there are a number of banks in a single country 01:22:20.650 --> 01:22:22.150 or international transactions. 01:22:22.150 --> 01:22:25.210 Are they handle sequentially or there is some parallel earning mechanism 01:22:25.210 --> 01:22:27.583 also exist if you may please clarify. 01:22:27.583 --> 01:22:29.000 CARTER ZENKE: Yeah, good question. 01:22:29.000 --> 01:22:31.630 So here visually we see these transactions are actually 01:22:31.630 --> 01:22:34.000 sequential in the operation here. 01:22:34.000 --> 01:22:37.240 And then your question is, are they literally sequential, like one 01:22:37.240 --> 01:22:38.470 after another after another? 01:22:38.470 --> 01:22:41.855 In reality they are very much just sequential like this. 01:22:41.855 --> 01:22:43.730 A database can get creative with use of locks 01:22:43.730 --> 01:22:46.063 as we'll see you in just a minute, but by and large they 01:22:46.063 --> 01:22:48.130 are sequential one after the other. 01:22:48.130 --> 01:22:51.370 And it's the job of a database module to ensure 01:22:51.370 --> 01:22:56.300 that these transactions are executed sequentially in an isolated way. 01:22:56.300 --> 01:22:58.360 Good question. 01:22:58.360 --> 01:22:58.900 OK. 01:22:58.900 --> 01:23:04.340 So let's see now how, if a database is able to make these sequential, 01:23:04.340 --> 01:23:06.610 it must be using something underneath the hood 01:23:06.610 --> 01:23:10.180 to ensure that these transactions don't interfere. 01:23:10.180 --> 01:23:15.610 And in SQLite and other DBMSs we use this idea of a lock on the database 01:23:15.610 --> 01:23:20.680 to ensure people can't access data when they really shouldn't be able to. 01:23:20.680 --> 01:23:23.200 So here let's look at an idea of what SQLite 01:23:23.200 --> 01:23:26.920 has at different kinds of locks on this database. 01:23:26.920 --> 01:23:28.490 We have a few different kinds-- 01:23:28.490 --> 01:23:32.290 unlocked, shared, and exclusive, among some others here. 01:23:32.290 --> 01:23:35.913 Each one has a different meaning for different processes. 01:23:35.913 --> 01:23:38.080 So let's take a look here at our bank account table. 01:23:38.080 --> 01:23:42.280 Let's say two computers want to access this account's table. 01:23:42.280 --> 01:23:47.560 Well at this time because nobody is accessing our database, it is unlocked. 01:23:47.560 --> 01:23:50.650 Anyone could read from the database, sort of see what's inside, 01:23:50.650 --> 01:23:54.490 or write to it-- add some data or update some data. 01:23:54.490 --> 01:23:59.650 But let's say we have one computer here who wants to read from this database. 01:23:59.650 --> 01:24:02.560 They want to see what data is inside. 01:24:02.560 --> 01:24:05.650 Well that database in SQLite would acquire 01:24:05.650 --> 01:24:08.650 what's called a shared lock like this. 01:24:08.650 --> 01:24:12.130 They're able to read the database but also 01:24:12.130 --> 01:24:14.500 it's to allow others to read as well. 01:24:14.500 --> 01:24:17.260 A shared lock means I'm only reading. 01:24:17.260 --> 01:24:20.180 You too could read as well. 01:24:20.180 --> 01:24:22.180 So let's say this computer comes in. 01:24:22.180 --> 01:24:24.760 They decide to read from this table. 01:24:24.760 --> 01:24:28.090 They could acquire the same shared lock. 01:24:28.090 --> 01:24:31.240 And because we're only reading, not updating, 01:24:31.240 --> 01:24:35.120 multiple transactions can read all at once. 01:24:35.120 --> 01:24:37.820 We don't have to make these sequential. 01:24:37.820 --> 01:24:42.170 But if we have a transaction that's hoping to write to the database, 01:24:42.170 --> 01:24:43.730 to update some value-- 01:24:43.730 --> 01:24:46.370 well that's when things get a little weird. 01:24:46.370 --> 01:24:50.760 I can't let somebody else read my data while I'm updating it. 01:24:50.760 --> 01:24:53.690 So what I need is for this process here, if it's 01:24:53.690 --> 01:24:57.290 writing, to acquire an exclusive lock. 01:24:57.290 --> 01:25:01.100 Meaning this is the only computer, only transaction that 01:25:01.100 --> 01:25:04.310 can read and write to this database. 01:25:04.310 --> 01:25:09.540 All others must hold off in order to be sequential. 01:25:09.540 --> 01:25:13.320 So we've seen transactions then are sequential 01:25:13.320 --> 01:25:17.070 and they use this idea of a lock to ensure other folks 01:25:17.070 --> 01:25:22.210 can't read when they're doing something sensitive like writing to the database. 01:25:22.210 --> 01:25:28.210 Let me ask now what questions we have on locks and how they work for us here? 01:25:28.210 --> 01:25:30.680 AUDIENCE: OK, my question is that suppose 01:25:30.680 --> 01:25:33.080 you're one computer [INAUDIBLE]. 01:25:33.080 --> 01:25:38.030 For what amount this would allow to do the exclusive lock? 01:25:38.030 --> 01:25:43.610 Maybe if the timestamp is larger, then some other computers 01:25:43.610 --> 01:25:46.740 will get timeout exception and other things, right? 01:25:46.740 --> 01:25:48.800 Because it's not allowed to do-- 01:25:48.800 --> 01:25:50.330 exclusively it's a lock. 01:25:50.330 --> 01:25:52.370 How are we going to decide those things? 01:25:52.370 --> 01:25:55.010 At what time one instance is going to be-- 01:25:55.010 --> 01:25:57.440 put the exclusive lock on the resource? 01:25:57.440 --> 01:25:59.600 CARTER ZENKE: So the question is, at what point 01:25:59.600 --> 01:26:02.690 does the transaction actually get an exclusive lock, 01:26:02.690 --> 01:26:07.050 and how do we decide how to prioritize different transactions here? 01:26:07.050 --> 01:26:10.310 Well there are a few algorithms you could use, one of which 01:26:10.310 --> 01:26:14.090 is simply taking the transaction that just came first. 01:26:14.090 --> 01:26:17.630 If at this point I receive a request for an exclusive lock, 01:26:17.630 --> 01:26:22.040 I might just wait for others to finish reading and allow that one to proceed. 01:26:22.040 --> 01:26:26.490 I could also use a process called time-stamping, which is similar to this 01:26:26.490 --> 01:26:30.230 where I look at the timestamp, the time my database tried to write to the log 01:26:30.230 --> 01:26:33.560 and use the one that is earliest in that case. 01:26:33.560 --> 01:26:38.210 Now it's important to note that an exclusive lock does block others 01:26:38.210 --> 01:26:40.380 from running their own transactions. 01:26:40.380 --> 01:26:44.480 If I have an exclusive lock, no one else can run their transaction. 01:26:44.480 --> 01:26:48.290 That is kind of a necessary downside in this case 01:26:48.290 --> 01:26:53.920 to ensure that I don't make mistakes in updating my table. 01:26:53.920 --> 01:26:55.480 Let's take one more here. 01:26:55.480 --> 01:26:57.850 AUDIENCE: What's the granularity on locking? 01:26:57.850 --> 01:27:02.050 I mean, you seem to say you locked the database, but that's craziness. 01:27:02.050 --> 01:27:05.080 Do you lock a table or do you lock a record on a table? 01:27:05.080 --> 01:27:07.607 CARTER ZENKE: Yeah, it might depend on the DBMS itself. 01:27:07.607 --> 01:27:13.540 So at the very coarsest, a lock might just lock the entire database. 01:27:13.540 --> 01:27:15.517 I can't actually see anything inside of it. 01:27:15.517 --> 01:27:18.100 And I'll be able to demonstrate that for you in a minute here. 01:27:18.100 --> 01:27:19.420 Let me go back to my computer. 01:27:19.420 --> 01:27:22.900 Let me try to deliberately lock this database. 01:27:22.900 --> 01:27:26.260 If I go to my terminal, I can begin a transaction that 01:27:26.260 --> 01:27:28.870 automatically has an exclusive lock. 01:27:28.870 --> 01:27:32.980 In SQLite the statement is BEGIN EXCLUSIVE. 01:27:32.980 --> 01:27:34.930 BEGIN EXCLUSIVE TRANSACTION. 01:27:34.930 --> 01:27:39.280 I'll hit Enter like this, and now I'm in the middle of this transaction. 01:27:39.280 --> 01:27:39.910 It's exclusive. 01:27:39.910 --> 01:27:43.060 Has an exclusive lock nobody else can read. 01:27:43.060 --> 01:27:47.230 If I go back to my other terminal here, connect to my database, 01:27:47.230 --> 01:27:51.880 I could try SELECT * FROM "accounts" just to read from this table. 01:27:51.880 --> 01:27:54.280 And now I see Runtime error-- 01:27:54.280 --> 01:27:55.900 database is locked. 01:27:55.900 --> 01:28:00.280 The entire database cannot be read from at the very coarsest level 01:28:00.280 --> 01:28:02.170 of an exclusive lock. 01:28:02.170 --> 01:28:04.600 Now because these locks are so coarse-- 01:28:04.600 --> 01:28:09.310 I have to lock the entire database, SQLite has its own module, 01:28:09.310 --> 01:28:14.050 a way of prioritizing transactions and ensuring that a transaction only 01:28:14.050 --> 01:28:17.590 has an exclusive lock for the duration it absolutely needs to. 01:28:17.590 --> 01:28:19.802 Because otherwise you would slow down substantially 01:28:19.802 --> 01:28:21.760 if I had to wait and wait and wait for somebody 01:28:21.760 --> 01:28:25.400 to finish with their transaction. 01:28:25.400 --> 01:28:26.770 OK. 01:28:26.770 --> 01:28:31.900 So this then is how we can handle not just one query at a time but multiple. 01:28:31.900 --> 01:28:35.860 And today we've seen how to optimize our databases from making sure queries 01:28:35.860 --> 01:28:39.670 are faster to also making sure our databases use as little space as 01:28:39.670 --> 01:28:40.420 possible. 01:28:40.420 --> 01:28:43.270 We'll see next time how to actually scale up 01:28:43.270 --> 01:28:45.460 our databases using other DBMSs-- 01:28:45.460 --> 01:28:48.280 like MySQL-- besides SQLite. 01:28:48.280 --> 01:28:51.120 We'll see you next time.