1 00:00:00,000 --> 00:00:03,486 [MUSIC PLAYING] 2 00:00:03,486 --> 00:00:16,434 3 00:00:16,434 --> 00:00:19,090 CARTER ZENKE: Well hello, one and all, and welcome 4 00:00:19,090 --> 00:00:22,510 back to CS50 Introduction to Databases with SQL. 5 00:00:22,510 --> 00:00:25,090 My name is Carter Zenke, and today we'll focus 6 00:00:25,090 --> 00:00:29,050 on optimizing-- trying to reduce the time our queries take 7 00:00:29,050 --> 00:00:32,500 as well as reducing the storage space our database might use. 8 00:00:32,500 --> 00:00:34,960 We'll also see how to handle concurrency. 9 00:00:34,960 --> 00:00:38,300 That is, many queries all at once. 10 00:00:38,300 --> 00:00:41,560 Now we'll do all of this in the context of a new database, 11 00:00:41,560 --> 00:00:45,940 this one-- the Internet Movies Database, or IMDb for short. 12 00:00:45,940 --> 00:00:51,680 If you've been to imdb.com you may have seen this whole big database of movies, 13 00:00:51,680 --> 00:00:56,360 and we've compiled that into our very own SQLite database for us today. 14 00:00:56,360 --> 00:00:58,830 Now this database is big. 15 00:00:58,830 --> 00:01:01,240 It is much bigger than anything we've seen so far. 16 00:01:01,240 --> 00:01:07,090 In fact, it has over 200,000 movie ratings, over 400,000 movies, 17 00:01:07,090 --> 00:01:12,850 and 1.4 million stars or people who have acted in some movie. 18 00:01:12,850 --> 00:01:17,360 And the goal of this database is to represent movies, the people in them, 19 00:01:17,360 --> 00:01:21,280 and the average ratings for each of those movies. 20 00:01:21,280 --> 00:01:24,250 Now we have ourselves an entity relationship diagram 21 00:01:24,250 --> 00:01:27,470 to describe this database, and it looks a bit like this. 22 00:01:27,470 --> 00:01:31,690 We have people in one table, movies in another, 23 00:01:31,690 --> 00:01:34,870 and ratings in a separate table down here. 24 00:01:34,870 --> 00:01:37,870 Notice that people have a name and a birthday 25 00:01:37,870 --> 00:01:41,320 along with an ID, the primary key for that table. 26 00:01:41,320 --> 00:01:45,020 Movies too have a title and the year they were released 27 00:01:45,020 --> 00:01:48,130 as well as their own primary key ID here. 28 00:01:48,130 --> 00:01:51,970 Rating down here has the average rating for a given movie 29 00:01:51,970 --> 00:01:55,540 as well as the number of votes that went into calculating 30 00:01:55,540 --> 00:01:57,220 that average rating here. 31 00:01:57,220 --> 00:02:02,950 And notice too that the movie and people table has a many-to-many relationship. 32 00:02:02,950 --> 00:02:06,790 That is, a person can star in more than one movie 33 00:02:06,790 --> 00:02:11,450 in the same way a movie can have multiple stars. 34 00:02:11,450 --> 00:02:14,560 So if we want to implement this many-to-many relationship, 35 00:02:14,560 --> 00:02:17,230 we've seen one way to do that. 36 00:02:17,230 --> 00:02:19,720 Let me actually ask the audience here, how 37 00:02:19,720 --> 00:02:23,470 have we implemented a many-to-many relationship historically? 38 00:02:23,470 --> 00:02:27,170 What kind of tables do we need for this, do you think? 39 00:02:27,170 --> 00:02:27,670 All right. 40 00:02:27,670 --> 00:02:30,320 So not to worry if you don't have an idea of what we're going to use here. 41 00:02:30,320 --> 00:02:32,528 Let's take a peek at what tables we can actually use. 42 00:02:32,528 --> 00:02:37,030 So here as we've seen historically we might have two tables, one for people 43 00:02:37,030 --> 00:02:38,770 and one for movies. 44 00:02:38,770 --> 00:02:41,800 And notice here in the people table we have names 45 00:02:41,800 --> 00:02:45,130 along with a primary key called ID. 46 00:02:45,130 --> 00:02:49,300 And similarly in our movies table, we also have a title 47 00:02:49,300 --> 00:02:55,180 column along with a movies own primary key, an ID column over there. 48 00:02:55,180 --> 00:02:59,380 Now to implement the relationship between people and movies, 49 00:02:59,380 --> 00:03:02,810 we have to have some table in the middle of these two. 50 00:03:02,810 --> 00:03:08,050 One that relates people IDs or person IDs with movie IDs. 51 00:03:08,050 --> 00:03:09,940 And if we look at this table, I could see 52 00:03:09,940 --> 00:03:14,020 that if I see person ID next to some given movie ID, well 53 00:03:14,020 --> 00:03:17,260 it means that the person with this ID has starred in, 54 00:03:17,260 --> 00:03:19,660 played a role in, this movie here. 55 00:03:19,660 --> 00:03:21,610 We see person 158-- 56 00:03:21,610 --> 00:03:27,160 let's see, Tom Hanks-- was in the movie ID with 114709, or Toy Story 57 00:03:27,160 --> 00:03:29,500 as we see in the movies table. 58 00:03:29,500 --> 00:03:33,160 And now knowing this, how can we figure out which 59 00:03:33,160 --> 00:03:36,160 movies Kristen Bell has starred in? 60 00:03:36,160 --> 00:03:38,770 If we know of Kristen Bell's ID, what should we 61 00:03:38,770 --> 00:03:42,841 then do to find the movies Kristen Bell has starred in? 62 00:03:42,841 --> 00:03:48,490 AUDIENCE: You could select the ID of the people then look into the stars 63 00:03:48,490 --> 00:03:51,462 table, then look at [INAUDIBLE] the movies table. 64 00:03:51,462 --> 00:03:53,170 CARTER ZENKE: Yeah, I like your thinking. 65 00:03:53,170 --> 00:03:54,610 So there's multiple steps here. 66 00:03:54,610 --> 00:03:58,780 We have to go through each table to find ultimately the movie that Kristen 67 00:03:58,780 --> 00:03:59,950 Bell has starred in. 68 00:03:59,950 --> 00:04:03,610 The first step might be to ask, what is Kristen Bell's ID? 69 00:04:03,610 --> 00:04:06,310 Well it seems to be 68338. 70 00:04:06,310 --> 00:04:09,040 Then, to your point, we could look at the stars table 71 00:04:09,040 --> 00:04:12,710 and find out what movie IDs correspond to that ID here. 72 00:04:12,710 --> 00:04:17,779 So it seems like 2294629 corresponds with Kristen Bell's ID here. 73 00:04:17,779 --> 00:04:20,200 Then finally we look up in the movies table, well, 74 00:04:20,200 --> 00:04:23,770 what title corresponds with 2294629? 75 00:04:23,770 --> 00:04:25,660 It seems to be the movie Frozen. 76 00:04:25,660 --> 00:04:30,350 So visually, we have this relationship among these tables here. 77 00:04:30,350 --> 00:04:33,610 So let's now dive into our movies database 78 00:04:33,610 --> 00:04:36,100 to actually get a sense of what data is inside of this. 79 00:04:36,100 --> 00:04:39,100 And for that I'll go back to my computer so we can open this database up 80 00:04:39,100 --> 00:04:40,300 in SQLite. 81 00:04:40,300 --> 00:04:44,410 I'll come back over here and I'll open up my own environment 82 00:04:44,410 --> 00:04:48,250 where I can type sqlite3 movies.db. 83 00:04:48,250 --> 00:04:49,780 movies.db. 84 00:04:49,780 --> 00:04:56,980 And now if I zoom in just a bit, I can then type .schema to show you all 85 00:04:56,980 --> 00:04:59,530 the tables that exist in this database. 86 00:04:59,530 --> 00:05:06,070 I see up top I do have a movies table that has a primary key called ID, 87 00:05:06,070 --> 00:05:08,980 a column called title and a column called year. 88 00:05:08,980 --> 00:05:12,040 I have that same people table with names and birthdays, 89 00:05:12,040 --> 00:05:17,530 and I also have this stars table that corresponds with person IDs and movie 90 00:05:17,530 --> 00:05:21,780 IDs, relating in this case people and movies. 91 00:05:21,780 --> 00:05:25,600 OK, so let's try selecting then from our movies table 92 00:05:25,600 --> 00:05:28,930 to see what movies we have, take a peek at our data. 93 00:05:28,930 --> 00:05:30,880 Well, we saw a little bit ago, we could try 94 00:05:30,880 --> 00:05:33,640 to take a peek at our data using a certain SQL keyword. 95 00:05:33,640 --> 00:05:38,800 I could SELECT * from "movies" to see all columns from movies, 96 00:05:38,800 --> 00:05:42,790 but if I have let's say 400,000 rows I don't 97 00:05:42,790 --> 00:05:44,540 want to print all those to the screen. 98 00:05:44,540 --> 00:05:47,480 So what could I use instead? 99 00:05:47,480 --> 00:05:49,598 A certain SQL keyword. 100 00:05:49,598 --> 00:05:51,140 I'm hearing LIMIT, so let's try that. 101 00:05:51,140 --> 00:05:53,000 I'll say LIMIT 5. 102 00:05:53,000 --> 00:05:55,100 I only want the first five results here. 103 00:05:55,100 --> 00:06:00,820 I'll hit Enter, and here I'll see those first five movies in my movies table. 104 00:06:00,820 --> 00:06:03,850 They each have their own unique ID, a primary key. 105 00:06:03,850 --> 00:06:07,360 They have a title and the year that they were published. 106 00:06:07,360 --> 00:06:12,070 But if you've been to imdb.com, you may have wanted 107 00:06:12,070 --> 00:06:13,600 to search for a particular movie. 108 00:06:13,600 --> 00:06:17,590 Maybe you wanted to find your favorite movie among all the possible movies 109 00:06:17,590 --> 00:06:18,520 in their database. 110 00:06:18,520 --> 00:06:22,480 And we can simulate that kind of query by querying the database here. 111 00:06:22,480 --> 00:06:25,930 Whenever you type in IMDd's browser-- that's like the search for a movie 112 00:06:25,930 --> 00:06:28,930 called Frozen, for example-- 113 00:06:28,930 --> 00:06:31,180 odds are that behind the scenes a database is actually 114 00:06:31,180 --> 00:06:33,490 running that same query to return to you what 115 00:06:33,490 --> 00:06:36,220 movies have that title called Frozen. 116 00:06:36,220 --> 00:06:38,470 Now for me my favorite movie since I was a kid 117 00:06:38,470 --> 00:06:41,650 is one called Cars, so let me try to search for Cars here. 118 00:06:41,650 --> 00:06:48,700 I'll try to find SELECT * from "movies" where the "title" = 'Cars'. 119 00:06:48,700 --> 00:06:52,840 And you could imagine me going to IMDb, typing in Cars in the search bar, 120 00:06:52,840 --> 00:06:55,720 and behind the scenes this SQL query gets 121 00:06:55,720 --> 00:06:59,920 run to find what movies have the title Cars. 122 00:06:59,920 --> 00:07:05,320 So let me hit Enter here, and now I'll see that one movie has that title Cars. 123 00:07:05,320 --> 00:07:10,540 Came out in 2006 and the ID is 317219. 124 00:07:10,540 --> 00:07:13,720 But this is a lecture on optimizing, so it's worth 125 00:07:13,720 --> 00:07:18,370 asking how much time did it take for this query to run? 126 00:07:18,370 --> 00:07:19,600 Let's find out. 127 00:07:19,600 --> 00:07:23,380 Well I can use a SQLite command here called timer. 128 00:07:23,380 --> 00:07:29,080 I'll say .timer on to make sure that I time my future queries. 129 00:07:29,080 --> 00:07:30,670 I'll hit Enter now. 130 00:07:30,670 --> 00:07:33,820 And now if I try selecting again, I'll hit the up arrow twice 131 00:07:33,820 --> 00:07:35,260 to get back my old query. 132 00:07:35,260 --> 00:07:40,030 I can type Enter and I'll see not just the results 133 00:07:40,030 --> 00:07:44,380 but the time it took to run that query. 134 00:07:44,380 --> 00:07:47,250 Now notice, we have several different measures of time here. 135 00:07:47,250 --> 00:07:50,130 I have the quote, unquote "real time" which is essentially 136 00:07:50,130 --> 00:07:54,060 the stopwatch time, meaning from the time I press Enter to the time I saw 137 00:07:54,060 --> 00:07:56,835 these results, how long did that take? 138 00:07:56,835 --> 00:07:58,710 But of course you're familiar with computers. 139 00:07:58,710 --> 00:08:02,037 You know there's more involved than just this query running. 140 00:08:02,037 --> 00:08:03,870 I also have to wait for the operating system 141 00:08:03,870 --> 00:08:05,340 to finish other processes as well. 142 00:08:05,340 --> 00:08:07,740 That's where user and system time come in. 143 00:08:07,740 --> 00:08:12,070 User time is how much time this query particularly took on the CPU. 144 00:08:12,070 --> 00:08:14,820 System time is how long I spent waiting around for other processes 145 00:08:14,820 --> 00:08:17,130 to finish before I could run this query. 146 00:08:17,130 --> 00:08:19,200 For us, we'll focus on real time. 147 00:08:19,200 --> 00:08:22,120 What was the wall clock time, the stopwatch time, so to speak, 148 00:08:22,120 --> 00:08:26,480 between the time I pressed Enter and the time I saw these results. 149 00:08:26,480 --> 00:08:27,380 OK. 150 00:08:27,380 --> 00:08:33,350 So here we have a query taking 0.084 seconds. 151 00:08:33,350 --> 00:08:35,270 I mean, that is pretty good. 152 00:08:35,270 --> 00:08:41,030 If I only wait around for less than a tenth of a second, I'm not complaining. 153 00:08:41,030 --> 00:08:45,260 But if I'm an engineer at IMDb, why might I 154 00:08:45,260 --> 00:08:50,450 be concerned about a query that takes 0.084 seconds, particularly 155 00:08:50,450 --> 00:08:53,000 if I have lots and lots of users? 156 00:08:53,000 --> 00:08:55,120 What's the problem there? 157 00:08:55,120 --> 00:09:00,320 AUDIENCE: For users, the database might be overwhelmed by the queries. 158 00:09:00,320 --> 00:09:05,840 And you should also consider how much time the application takes 159 00:09:05,840 --> 00:09:08,738 to form the query to the database too. 160 00:09:08,738 --> 00:09:09,530 CARTER ZENKE: Yeah. 161 00:09:09,530 --> 00:09:11,720 So the basic idea here is that when I have 162 00:09:11,720 --> 00:09:15,090 a single query that takes maybe less than tenth of a second, that's fine. 163 00:09:15,090 --> 00:09:19,010 But if I have many, many users running those queries all at once, 164 00:09:19,010 --> 00:09:21,390 that time is going to add up. 165 00:09:21,390 --> 00:09:24,500 And so if I'm an engineer or even somebody in finance, 166 00:09:24,500 --> 00:09:28,160 I want to reduce that time because I'm paying per second that I 167 00:09:28,160 --> 00:09:30,830 run this database on some other server. 168 00:09:30,830 --> 00:09:33,980 So we can optimize these queries, but let's first 169 00:09:33,980 --> 00:09:36,470 get a sense of how they're presently running. 170 00:09:36,470 --> 00:09:41,360 What is SQLite doing underneath the hood to find this movie called Cars? 171 00:09:41,360 --> 00:09:45,920 Well for that, I'll bring up a certain visual here of our titles column. 172 00:09:45,920 --> 00:09:50,840 So imagine here that we have a smaller database called Movies 173 00:09:50,840 --> 00:09:52,890 that just has titles in this case. 174 00:09:52,890 --> 00:09:57,350 And now I want to search these titles, but I'm a computer. 175 00:09:57,350 --> 00:09:58,190 I'm not a human. 176 00:09:58,190 --> 00:10:01,550 Meaning I can only look at one item at a time. 177 00:10:01,550 --> 00:10:05,720 I can only look at one row and check its title. 178 00:10:05,720 --> 00:10:11,570 Now if I want to find the row or the rows that have Cars in them, 179 00:10:11,570 --> 00:10:16,610 what's the best I could do if my titles are not sorted? 180 00:10:16,610 --> 00:10:18,890 What's the best approach I could take here? 181 00:10:18,890 --> 00:10:24,922 If I can only look at one row at a time, how might I search this column? 182 00:10:24,922 --> 00:10:27,990 AUDIENCE: You should use a linear search then, right? 183 00:10:27,990 --> 00:10:29,680 CARTER ZENKE: So I like your intuition. 184 00:10:29,680 --> 00:10:31,900 There is an algorithm, a step-by-step process 185 00:10:31,900 --> 00:10:34,990 called linear search where we look at one element 186 00:10:34,990 --> 00:10:38,170 at a time or, in this case one row at a time. 187 00:10:38,170 --> 00:10:40,750 That's exactly what we can do for this table. 188 00:10:40,750 --> 00:10:44,980 I can look at one row at a time to find is this title in this column 189 00:10:44,980 --> 00:10:46,360 or is it not? 190 00:10:46,360 --> 00:10:50,120 So here to visualize, I could take this title column. 191 00:10:50,120 --> 00:10:52,970 I can only look at one at a time. 192 00:10:52,970 --> 00:10:58,630 But if this data is not sorted, I can do no better than looking at every row 193 00:10:58,630 --> 00:10:59,920 exactly once. 194 00:10:59,920 --> 00:11:01,780 So here we start with Toy Story. 195 00:11:01,780 --> 00:11:05,110 We ask, is Toy Story equal to Cars? 196 00:11:05,110 --> 00:11:07,000 It's not, so we'll keep going. 197 00:11:07,000 --> 00:11:09,100 Here I'll check Cars 3. 198 00:11:09,100 --> 00:11:12,070 Is Cars 3 strictly equal to Cars? 199 00:11:12,070 --> 00:11:15,040 It has Cars in it, but it's not Cars exactly. 200 00:11:15,040 --> 00:11:16,180 So I'll keep going. 201 00:11:16,180 --> 00:11:17,380 I'll look at Frozen. 202 00:11:17,380 --> 00:11:23,170 And Frozen is not Cars either, so I'll keep going again and here I see Cars. 203 00:11:23,170 --> 00:11:26,540 So maybe I could stop here. 204 00:11:26,540 --> 00:11:28,360 But why couldn't I? 205 00:11:28,360 --> 00:11:34,300 Well if I need to find not just the one row that has Cars but all the rows 206 00:11:34,300 --> 00:11:39,790 that have Cars, I have to keep going to see is Cars elsewhere in this column? 207 00:11:39,790 --> 00:11:42,910 Is it not just here, but also in other rows too? 208 00:11:42,910 --> 00:11:44,050 So I'll keep going. 209 00:11:44,050 --> 00:11:47,020 I'll look at WALL-E, Ratatouille, Soul, Turning Red 210 00:11:47,020 --> 00:11:50,290 until I get to the end of this column. 211 00:11:50,290 --> 00:11:53,530 And this process is called a scan of the table. 212 00:11:53,530 --> 00:11:58,210 We're scanning our column top to bottom looking for the result 213 00:11:58,210 --> 00:12:00,160 that we need to find. 214 00:12:00,160 --> 00:12:03,580 But it turns out we can actually optimize this search. 215 00:12:03,580 --> 00:12:07,898 We can do a bit better than going one by one by one by one, 216 00:12:07,898 --> 00:12:09,940 and for this we need a bit of a metaphor to think 217 00:12:09,940 --> 00:12:12,170 about how we can search our table. 218 00:12:12,170 --> 00:12:15,080 So if you have ever seen a kind of textbook, 219 00:12:15,080 --> 00:12:17,620 you might have seen something a bit like this here. 220 00:12:17,620 --> 00:12:19,780 This is a Handbook of Education Research. 221 00:12:19,780 --> 00:12:23,470 And maybe I want to find which pages correspond 222 00:12:23,470 --> 00:12:26,060 to how students learn computer science. 223 00:12:26,060 --> 00:12:29,350 So one thing I could do is open to the start of the book. 224 00:12:29,350 --> 00:12:34,630 I could go page by page looking for which pages have "how students 225 00:12:34,630 --> 00:12:36,700 learn computer science." 226 00:12:36,700 --> 00:12:40,690 I keep going here, and if I want to find all the pages that have 227 00:12:40,690 --> 00:12:42,610 "how students learn computer science" on them, 228 00:12:42,610 --> 00:12:45,610 I can do no better than going through every one 229 00:12:45,610 --> 00:12:49,600 by one by one until I'm at the end of this textbook. 230 00:12:49,600 --> 00:12:53,420 But chances are you yourself don't do this. 231 00:12:53,420 --> 00:12:56,270 You do something better. 232 00:12:56,270 --> 00:13:00,010 And there's some built-in structure in this book that could 233 00:13:00,010 --> 00:13:01,930 help you find what you're looking for. 234 00:13:01,930 --> 00:13:07,170 What might you use to find something in a textbook or some other book too? 235 00:13:07,170 --> 00:13:10,300 AUDIENCE: We can use the index of the book. 236 00:13:10,300 --> 00:13:12,100 CARTER ZENKE: You can use the index. 237 00:13:12,100 --> 00:13:14,470 So often a book like a textbook has something 238 00:13:14,470 --> 00:13:17,440 at the very back of the book that tells me 239 00:13:17,440 --> 00:13:21,370 all the subjects that are inside this book, but in alphabetical order. 240 00:13:21,370 --> 00:13:24,550 So I can see here there is a subject index in the back of the book 241 00:13:24,550 --> 00:13:28,330 here that lists the subjects in alphabetical order, A to Z. 242 00:13:28,330 --> 00:13:31,210 And if I want to find "how students learn computer science," 243 00:13:31,210 --> 00:13:34,300 I could perhaps search for computer science in this index. 244 00:13:34,300 --> 00:13:37,990 I could find A, B, C. There's C, and here I see computer science. 245 00:13:37,990 --> 00:13:42,310 Now I know all the pages I need to look at inside this book 246 00:13:42,310 --> 00:13:46,490 without going through each of them one by one by one. 247 00:13:46,490 --> 00:13:50,080 So in the same way that books have an index, so too 248 00:13:50,080 --> 00:13:53,810 can databases and tables have an index as well. 249 00:13:53,810 --> 00:13:57,400 So let's see a definition here for what an index is. 250 00:13:57,400 --> 00:14:00,340 Well an index is a structure that we can use. 251 00:14:00,340 --> 00:14:03,010 It's built into our database to help us speed up 252 00:14:03,010 --> 00:14:06,550 the retrieval of rows from that table. 253 00:14:06,550 --> 00:14:12,670 I first search my index, find out what rows my given search term is in, 254 00:14:12,670 --> 00:14:16,730 and then look those rows up in my table. 255 00:14:16,730 --> 00:14:22,540 Now if I want to create an index on a table, I can use some SQL syntax. 256 00:14:22,540 --> 00:14:27,760 I can say "create index" and give it some name. 257 00:14:27,760 --> 00:14:31,270 But I need to also specify a few other things. 258 00:14:31,270 --> 00:14:35,500 Similar to my books index, that index has some content. 259 00:14:35,500 --> 00:14:38,770 It's an index by subject or an index by author. 260 00:14:38,770 --> 00:14:42,670 So I need to tell this index what to include. 261 00:14:42,670 --> 00:14:47,440 So I should also say create index, give it a name on some table 262 00:14:47,440 --> 00:14:52,720 and some columns in that table, telling my index what data for my table 263 00:14:52,720 --> 00:14:55,090 it should include. 264 00:14:55,090 --> 00:14:59,230 So let's create our very own index in the movies table 265 00:14:59,230 --> 00:15:02,690 to identify how we can make this query just a bit faster overall. 266 00:15:02,690 --> 00:15:08,440 So I'll come back to my computer here, and let's now try to create this index. 267 00:15:08,440 --> 00:15:11,290 I will open up a new tab here. 268 00:15:11,290 --> 00:15:12,070 A new tab. 269 00:15:12,070 --> 00:15:14,500 And then I'll log back into my database. 270 00:15:14,500 --> 00:15:17,950 I'll say sqlite3 movies.db. 271 00:15:17,950 --> 00:15:22,550 And now I want to first create some index here. 272 00:15:22,550 --> 00:15:27,280 So I'll turn my timer back on and now I'll type the syntax we just learned. 273 00:15:27,280 --> 00:15:29,810 I want to create an index. 274 00:15:29,810 --> 00:15:33,760 And if I'm searching my title column in Movies, 275 00:15:33,760 --> 00:15:38,110 I should probably include the Title column in this index. 276 00:15:38,110 --> 00:15:41,470 So I'll say call this the "title_index" here. 277 00:15:41,470 --> 00:15:46,450 It'll be on the Movies table and it'll contain the data in the Title 278 00:15:46,450 --> 00:15:49,850 column of Movies a bit like this. 279 00:15:49,850 --> 00:15:55,052 Now I'll hit Enter and I'll see this index took some time to be created. 280 00:15:55,052 --> 00:15:57,010 There's some stuff going on underneath the hood 281 00:15:57,010 --> 00:16:00,940 where I'm putting this data in a new structure we're calling the index here. 282 00:16:00,940 --> 00:16:07,930 But now if I rerun this query, if I say SELECT * FROM "movies," 283 00:16:07,930 --> 00:16:15,450 SELECT * from "movies" where the "title" = 'Cars' semicolon, 284 00:16:15,450 --> 00:16:16,620 it was a lot faster. 285 00:16:16,620 --> 00:16:25,590 Here I see 0.01 seconds real time compared to 0.084 seconds. 286 00:16:25,590 --> 00:16:28,380 That's a savings of almost eight times. 287 00:16:28,380 --> 00:16:31,740 Almost eight times faster now that we have this index. 288 00:16:31,740 --> 00:16:34,260 Now it took some time to create the index. 289 00:16:34,260 --> 00:16:38,070 If I run just a few queries, I much easily 290 00:16:38,070 --> 00:16:40,770 make that time up in terms of the time I'm 291 00:16:40,770 --> 00:16:46,560 using on my server to look up these kinds of pieces of information. 292 00:16:46,560 --> 00:16:50,760 OK, so let's actually take a look at how we can ensure a query 293 00:16:50,760 --> 00:16:53,460 is using some index. 294 00:16:53,460 --> 00:16:55,980 Here I'm just kind of taking it on my word. 295 00:16:55,980 --> 00:16:57,780 Here it was faster, so it used the index. 296 00:16:57,780 --> 00:17:01,710 But there is a way for you to know whether an index was used. 297 00:17:01,710 --> 00:17:06,839 I could use a SQL statement called EXPLAIN QUERY PLAN. 298 00:17:06,839 --> 00:17:10,329 Other DBMSs might call this just simply EXPLAIN, 299 00:17:10,329 --> 00:17:14,190 but in SQLite it's EXPLAIN QUERY PLAN. 300 00:17:14,190 --> 00:17:17,819 And then after EXPLAIN QUERY PLAN, I can type the query 301 00:17:17,819 --> 00:17:20,760 that I want to learn the plan for. 302 00:17:20,760 --> 00:17:24,880 That is, how does SQLite plan to find me this information? 303 00:17:24,880 --> 00:17:27,579 How does it plan to execute this query? 304 00:17:27,579 --> 00:17:33,540 So now I'll type SELECT * FROM "movies" WHERE "title" = 'Cars' 305 00:17:33,540 --> 00:17:35,460 like this, semicolon. 306 00:17:35,460 --> 00:17:36,600 I'll hit Enter. 307 00:17:36,600 --> 00:17:39,420 And now I don't get the results, but I do 308 00:17:39,420 --> 00:17:45,060 get the plan SQLite intends to use to find me this result. 309 00:17:45,060 --> 00:17:46,350 And take a look here. 310 00:17:46,350 --> 00:17:51,930 I see QUERY PLAN, and underneath I see what SQLite intends to do. 311 00:17:51,930 --> 00:17:55,260 It will search the movies table, but it will do so 312 00:17:55,260 --> 00:17:58,410 using the index that we called title_index. 313 00:17:58,410 --> 00:18:02,560 And it will do so given some value for title here. 314 00:18:02,560 --> 00:18:06,210 So we can actually see SQLite does plan to use the index 315 00:18:06,210 --> 00:18:08,050 to speed up this query. 316 00:18:08,050 --> 00:18:11,770 We can confirm that this is exactly what's happening. 317 00:18:11,770 --> 00:18:15,450 But let me try now to remove this index and see 318 00:18:15,450 --> 00:18:19,380 what SQLite would have done when we didn't have this index. 319 00:18:19,380 --> 00:18:23,070 I'll come back over here and I'll go back to my other connection 320 00:18:23,070 --> 00:18:24,030 to my database. 321 00:18:24,030 --> 00:18:25,470 This one over here. 322 00:18:25,470 --> 00:18:28,290 And now I'll remove the index we just created. 323 00:18:28,290 --> 00:18:30,510 To remove an index I can drop it. 324 00:18:30,510 --> 00:18:35,948 I can say DROP INDEX "title_index" semicolon. 325 00:18:35,948 --> 00:18:37,740 And now that took just some amount of time, 326 00:18:37,740 --> 00:18:39,365 but now I don't have the index anymore. 327 00:18:39,365 --> 00:18:45,180 I can type .schema, and I'll see that index is not part of my schema. 328 00:18:45,180 --> 00:18:54,330 But now if I try EXPLAIN QUERY PLAN SELECT * from "movies" WHERE "title" 329 00:18:54,330 --> 00:19:00,780 = 'Cars' semicolon, I'll see that the plan is not to use the index. 330 00:19:00,780 --> 00:19:04,860 The plan is to scan movies, where scan, we saw before, 331 00:19:04,860 --> 00:19:08,130 means to go top to bottom through that title column 332 00:19:08,130 --> 00:19:12,070 and find me all the rows that have Cars in them. 333 00:19:12,070 --> 00:19:14,460 So you can see here the optimization we're already 334 00:19:14,460 --> 00:19:18,890 making by using this index here. 335 00:19:18,890 --> 00:19:21,740 OK, so let's take a moment here and pause. 336 00:19:21,740 --> 00:19:26,780 What questions do we have about indexes and how they're used in this case? 337 00:19:26,780 --> 00:19:33,950 AUDIENCE: Doesn't databases have implicit algorithms to optimize search? 338 00:19:33,950 --> 00:19:35,420 CARTER ZENKE: Yeah, good question. 339 00:19:35,420 --> 00:19:39,050 Do databases have implicit algorithms to optimize search? 340 00:19:39,050 --> 00:19:41,190 They do for some columns. 341 00:19:41,190 --> 00:19:46,520 So if we saw in our table earlier, we had a primary key column like ID. 342 00:19:46,520 --> 00:19:52,310 In SQLite and most other DBMSs, if I specify that a column is a primary key 343 00:19:52,310 --> 00:19:55,430 then that database management system will automatically 344 00:19:55,430 --> 00:19:59,280 create an index via which I can search for a primary key. 345 00:19:59,280 --> 00:20:03,920 If though I have a regular old column like title or like year or so on, 346 00:20:03,920 --> 00:20:07,130 it doesn't go through the process of automatically optimizing 347 00:20:07,130 --> 00:20:08,030 that search for me. 348 00:20:08,030 --> 00:20:12,457 I have to try to do that myself as we're doing here. 349 00:20:12,457 --> 00:20:13,290 Let's take one more. 350 00:20:13,290 --> 00:20:16,000 AUDIENCE: Would it be advisable to create a different index 351 00:20:16,000 --> 00:20:17,800 for every column in case we need it? 352 00:20:17,800 --> 00:20:20,380 CARTER ZENKE: Yeah, I mean it seems like indexes are so good. 353 00:20:20,380 --> 00:20:21,790 They speed up queries for us. 354 00:20:21,790 --> 00:20:25,030 Why not create an index on every column we have? 355 00:20:25,030 --> 00:20:26,800 Well as we'll see in just a moment, there 356 00:20:26,800 --> 00:20:30,730 are some trade-offs involving space and also the time it later takes 357 00:20:30,730 --> 00:20:33,020 us to insert or add new data. 358 00:20:33,020 --> 00:20:33,910 So good questions. 359 00:20:33,910 --> 00:20:36,620 I'll come back to my laptop where we can continue on. 360 00:20:36,620 --> 00:20:40,870 So we've seen how indexes can speed up queries. 361 00:20:40,870 --> 00:20:43,900 We're searching for one column here, but they could also speed up 362 00:20:43,900 --> 00:20:46,310 queries across multiple tables. 363 00:20:46,310 --> 00:20:48,910 So let's try a different kind of query here. 364 00:20:48,910 --> 00:20:52,810 What I might do now is try to find all of those movies 365 00:20:52,810 --> 00:20:55,480 that Tom Hanks starred in. 366 00:20:55,480 --> 00:20:59,230 And we did a bit of a example of this a little earlier visually, 367 00:20:59,230 --> 00:21:04,450 but my goal is to find all of those titles that Tom Hanks played a role in. 368 00:21:04,450 --> 00:21:07,690 Well if my goal is to end up with titles of movies, 369 00:21:07,690 --> 00:21:13,520 I could try selecting "title" from the "movies" column like this. 370 00:21:13,520 --> 00:21:16,720 But if I look at that movie's table, oh no, I only 371 00:21:16,720 --> 00:21:20,740 have titles, year, and the ID of those movies. 372 00:21:20,740 --> 00:21:23,230 I don't have anything related to Tom Hanks. 373 00:21:23,230 --> 00:21:27,250 Well Tom Hanks though is probably showing up in the stars table 374 00:21:27,250 --> 00:21:30,940 where I could correspond Tom Hanks's ID with the movie IDs 375 00:21:30,940 --> 00:21:32,950 that Tom Hanks played a role in. 376 00:21:32,950 --> 00:21:35,860 So let me try filtering these movies by an ID. 377 00:21:35,860 --> 00:21:40,450 I'll say WHERE the "id" of that movie is IN some list of movie IDs 378 00:21:40,450 --> 00:21:43,960 that also correspond with Tom Hanks's person ID. 379 00:21:43,960 --> 00:21:47,650 So I'll make a subquery here to find exactly those films. 380 00:21:47,650 --> 00:21:54,340 Now I'll say SELECT "movie_id" FROM "stars" where the "person _id" 381 00:21:54,340 --> 00:21:56,140 is equal to-- 382 00:21:56,140 --> 00:21:58,180 well I don't know Tom Hanks's person ID. 383 00:21:58,180 --> 00:22:00,880 I could try to hard-code it here, but I could do better 384 00:22:00,880 --> 00:22:05,710 by having another subquery to find what is Tom Hanks's ID? 385 00:22:05,710 --> 00:22:08,950 Now I'll indent not just four times but eight 386 00:22:08,950 --> 00:22:12,160 to show this is my next level of a subquery here. 387 00:22:12,160 --> 00:22:19,270 I'll now SELECT "id" FROM "people" WHERE the "name" = 'Tom Hanks'. 388 00:22:19,270 --> 00:22:21,020 'Tom Hanks' like this. 389 00:22:21,020 --> 00:22:23,950 And now I can close out my subqueries. 390 00:22:23,950 --> 00:22:28,300 I've found the ID of Tom Hanks, I've found those movie IDs that 391 00:22:28,300 --> 00:22:30,970 correspond with the person ID of Tom Hanks, 392 00:22:30,970 --> 00:22:34,900 and then I've found the titles that correspond to those movie IDs. 393 00:22:34,900 --> 00:22:39,100 Now I'll hit a semicolon here, press Enter, and I should see-- 394 00:22:39,100 --> 00:22:43,300 I get back all of Tom Hanks's movies, all of those that he played a role in. 395 00:22:43,300 --> 00:22:45,910 And I'll also see the time that it took. 396 00:22:45,910 --> 00:22:50,840 The real time for this query was about 0.197 seconds. 397 00:22:50,840 --> 00:22:54,880 So getting a little bit slower now that we're looking across multiple tables 398 00:22:54,880 --> 00:22:57,580 and having multiple subqueries. 399 00:22:57,580 --> 00:23:00,890 One index could speed this up as well. 400 00:23:00,890 --> 00:23:03,555 I can make a new terminal here. 401 00:23:03,555 --> 00:23:05,680 Let me open up the connection to my database again. 402 00:23:05,680 --> 00:23:08,470 I'll say sqlite3 movies.db. 403 00:23:08,470 --> 00:23:12,490 And now I should figure out what columns do 404 00:23:12,490 --> 00:23:15,670 I need to index to speed up this query? 405 00:23:15,670 --> 00:23:19,940 Well one thing I could do is try EXPLAIN QUERY PLAN again. 406 00:23:19,940 --> 00:23:25,310 I want to understand what SQLite plans to do to execute this query. 407 00:23:25,310 --> 00:23:29,020 So I'll EXPLAIN QUERY PLAN and I'll type in the same query from before. 408 00:23:29,020 --> 00:23:33,430 I'll say SELECT the "title" FROM "movies" WHERE 409 00:23:33,430 --> 00:23:37,720 the "id" is IN those movies that Tom Hanks played a role in. 410 00:23:37,720 --> 00:23:45,400 So I'll SELECT in this case movie_id from the "stars" table where 411 00:23:45,400 --> 00:23:49,180 the person_id equals Tom Hanks's ID. 412 00:23:49,180 --> 00:23:50,450 Now I'll find that. 413 00:23:50,450 --> 00:23:53,650 I'll indent eight times to show this is my next level subquery. 414 00:23:53,650 --> 00:23:59,050 I'll SELECT "id" from "people" WHERE "name" 415 00:23:59,050 --> 00:24:02,510 = 'Tom Hanks' and I'll wrap the line just a little bit. 416 00:24:02,510 --> 00:24:04,510 Let me zoom out so you can see it all in one go. 417 00:24:04,510 --> 00:24:10,570 And now I'll close out these subqueries just like this, hit Enter, 418 00:24:10,570 --> 00:24:16,090 and I should see if I zoom back in the plan SQLite intends to use to find me 419 00:24:16,090 --> 00:24:17,590 the results of this query. 420 00:24:17,590 --> 00:24:19,490 Now let's take a look here. 421 00:24:19,490 --> 00:24:21,900 I'm planning to first scan people. 422 00:24:21,900 --> 00:24:23,650 To understand the structure of this I need 423 00:24:23,650 --> 00:24:26,500 to first go all the way in to this bottom part here. 424 00:24:26,500 --> 00:24:32,320 I'm going to scan people to find presumably Tom Hanks's ID. 425 00:24:32,320 --> 00:24:36,610 Then after I do that, I'm going to scan stars presumably 426 00:24:36,610 --> 00:24:41,080 to find all the movie IDs that have Tom Hanks's ID. 427 00:24:41,080 --> 00:24:45,130 But then up here I'm finally going to search movies 428 00:24:45,130 --> 00:24:47,320 using integer primary key. 429 00:24:47,320 --> 00:24:49,750 Some index is automatically created for me 430 00:24:49,750 --> 00:24:52,850 as we saw before for my primary keys. 431 00:24:52,850 --> 00:24:56,620 So it seems like the optimization here could be in the people 432 00:24:56,620 --> 00:25:00,040 table and the stars table. 433 00:25:00,040 --> 00:25:05,620 Now let me ask, if I wanted to create an index on either the people 434 00:25:05,620 --> 00:25:09,970 table or the stars table, what do you think I should include? 435 00:25:09,970 --> 00:25:15,650 Which column should I include in this index given my query here? 436 00:25:15,650 --> 00:25:21,130 What column should I include in this index given this previous query? 437 00:25:21,130 --> 00:25:26,180 AUDIENCE: We could index the ID and the name of the person? 438 00:25:26,180 --> 00:25:27,530 CARTER ZENKE: Yeah, good idea. 439 00:25:27,530 --> 00:25:29,930 Index the ID and the name of the person. 440 00:25:29,930 --> 00:25:34,670 So if we look at this query again we'll see that our WHERE clause here 441 00:25:34,670 --> 00:25:37,770 is searching by the name in the people table, 442 00:25:37,770 --> 00:25:41,240 and up here this WHERE clause is searching by the person ID 443 00:25:41,240 --> 00:25:43,700 column in our stars table. 444 00:25:43,700 --> 00:25:46,280 So it would stand to reason that I should actually 445 00:25:46,280 --> 00:25:49,850 create an index that includes the name column in people 446 00:25:49,850 --> 00:25:52,250 and the person ID column in stars. 447 00:25:52,250 --> 00:25:54,200 Perhaps two separate indexes. 448 00:25:54,200 --> 00:25:55,710 One for each year. 449 00:25:55,710 --> 00:25:57,470 So let's go ahead and create those now. 450 00:25:57,470 --> 00:26:03,350 I'll come back to my computer and here I will try to create this index. 451 00:26:03,350 --> 00:26:06,590 I'll say first create an index. 452 00:26:06,590 --> 00:26:10,040 CREATE INDEX called "person_index." 453 00:26:10,040 --> 00:26:16,070 And I'll create it on stars and the person ID column in stars 454 00:26:16,070 --> 00:26:17,960 as we just discussed. 455 00:26:17,960 --> 00:26:23,600 Now if I hit Enter I should let it take some time to build up this index, 456 00:26:23,600 --> 00:26:26,270 but now I can also create my next one. 457 00:26:26,270 --> 00:26:35,300 I'll say CREATE INDEX, and then I'll choose "name_index" on people table. 458 00:26:35,300 --> 00:26:37,890 And I'll include the name column here. 459 00:26:37,890 --> 00:26:39,710 So now I'll hit Enter. 460 00:26:39,710 --> 00:26:42,980 That too will take some time to run, but now 461 00:26:42,980 --> 00:26:45,350 if I try to explain the query plan again-- 462 00:26:45,350 --> 00:26:50,630 I'll clear my screen, try EXPLAIN QUERY PLAN, and then I'll type in this query 463 00:26:50,630 --> 00:26:51,170 again. 464 00:26:51,170 --> 00:26:56,060 I'll say SELECT "title" FROM movies WHERE the "id" is IN those movies 465 00:26:56,060 --> 00:26:59,600 that Tom Hanks played a role in that have the person ID next to it. 466 00:26:59,600 --> 00:27:03,590 Then I'll say which is Tom Hanks's ID? 467 00:27:03,590 --> 00:27:08,650 And I'll close out the subquery here just like this. 468 00:27:08,650 --> 00:27:12,480 And now if I explain the plan yet again, I 469 00:27:12,480 --> 00:27:16,570 should see I'm now optimizing this query even further. 470 00:27:16,570 --> 00:27:19,950 I can see down below here, my plan is to search the people table. 471 00:27:19,950 --> 00:27:23,160 Not to scan it, but to search it using what we call 472 00:27:23,160 --> 00:27:25,710 a covering index called "name_index." 473 00:27:25,710 --> 00:27:27,010 The one we just created. 474 00:27:27,010 --> 00:27:29,610 We'll explain covering index in just a little bit. 475 00:27:29,610 --> 00:27:32,760 Next though, I'm going to search stars using 476 00:27:32,760 --> 00:27:37,060 the index we just created called "name_index" given some person ID. 477 00:27:37,060 --> 00:27:40,290 And then finally, I'll search movies using 478 00:27:40,290 --> 00:27:42,660 an integer primary key index, the one that's 479 00:27:42,660 --> 00:27:45,850 automatically created for me here. 480 00:27:45,850 --> 00:27:48,570 So here we see some new vocabulary. 481 00:27:48,570 --> 00:27:53,060 Not just using index, but using a covering index. 482 00:27:53,060 --> 00:27:55,470 Well, what is a covering index? 483 00:27:55,470 --> 00:27:58,650 A covering index, it turns out, is still an index. 484 00:27:58,650 --> 00:28:01,050 It's the same kind of idea, but it is one 485 00:28:01,050 --> 00:28:06,390 in which we have all the data we need inside the index itself. 486 00:28:06,390 --> 00:28:11,250 It's one in which the query data can be found only by searching the index. 487 00:28:11,250 --> 00:28:14,460 There's no need for me to go from the index to the table. 488 00:28:14,460 --> 00:28:17,040 I can just look it up in the index. 489 00:28:17,040 --> 00:28:20,850 It's similar to me searching this textbook and getting to the index 490 00:28:20,850 --> 00:28:24,450 and not then needing to actually go back in the book's pages. 491 00:28:24,450 --> 00:28:29,290 I can find everything I'm looking for in the index itself. 492 00:28:29,290 --> 00:28:34,260 And this is good because if I need to not just find the items in my index 493 00:28:34,260 --> 00:28:39,130 but then look them up in my table, that look up in the table takes some time. 494 00:28:39,130 --> 00:28:41,580 So if I instead have a covering index, one 495 00:28:41,580 --> 00:28:45,310 that includes all data I want to find exactly right there in it, 496 00:28:45,310 --> 00:28:49,120 that's going to be faster for me than a regular old index. 497 00:28:49,120 --> 00:28:54,750 So let's try then turning one more index into a covering index. 498 00:28:54,750 --> 00:28:55,890 I'll come back over here. 499 00:28:55,890 --> 00:29:00,780 And we saw before that we had the index on the stars 500 00:29:00,780 --> 00:29:03,660 table using just a regular old index. 501 00:29:03,660 --> 00:29:06,420 Using index person index. 502 00:29:06,420 --> 00:29:10,050 But what are we trying to find from this index? 503 00:29:10,050 --> 00:29:13,170 We're trying to find the movie ID column. 504 00:29:13,170 --> 00:29:18,910 So I could create an index that includes the movie ID column as well. 505 00:29:18,910 --> 00:29:22,300 I can have indexes that span multiple columns. 506 00:29:22,300 --> 00:29:23,730 So let's try that here. 507 00:29:23,730 --> 00:29:27,240 I'll come back over and I'll create this new index 508 00:29:27,240 --> 00:29:29,730 that intends to cover our search here. 509 00:29:29,730 --> 00:29:35,940 I want to include not just the person ID column but also the movie ID 510 00:29:35,940 --> 00:29:40,320 column so I can quickly find that data inside the index itself. 511 00:29:40,320 --> 00:29:45,620 Let me drop our current implementation of the person index like this. 512 00:29:45,620 --> 00:29:49,090 And I'll hit semicolon here, and now that index is gone. 513 00:29:49,090 --> 00:29:53,040 But if I type now CREATE INDEX, I'll also 514 00:29:53,040 --> 00:29:56,700 call this "person_index" ON "stars." 515 00:29:56,700 --> 00:30:00,240 and I'll say let's include not just the person ID column. 516 00:30:00,240 --> 00:30:05,940 Let's also include in this case, the movie ID column like this. 517 00:30:05,940 --> 00:30:08,730 Now I'll hit semicolon, Enter. 518 00:30:08,730 --> 00:30:11,580 And it'll take a bit of time to run, but once it's 519 00:30:11,580 --> 00:30:15,420 done I can probably try to explain the query plan again and see 520 00:30:15,420 --> 00:30:17,160 if we have our covering index. 521 00:30:17,160 --> 00:30:20,910 I'll go back up top here, I'll EXPLAIN QUERY PLAN, 522 00:30:20,910 --> 00:30:24,910 and I'll choose to recreate this query from before just like this. 523 00:30:24,910 --> 00:30:30,660 I'm going to select the "movie_id" FROM "stars" and then find Tom Hanks's ID 524 00:30:30,660 --> 00:30:36,580 and then close out these subqueries here just like this. 525 00:30:36,580 --> 00:30:41,830 And now once I EXPLAIN QUERY PLAN, we've optimized even further. 526 00:30:41,830 --> 00:30:44,160 Now I have two covering indexes. 527 00:30:44,160 --> 00:30:48,690 I can find all the information I need just by looking in these indexes, 528 00:30:48,690 --> 00:30:52,100 not by doing a table lookup. 529 00:30:52,100 --> 00:30:56,375 So let's now run this query and see just how much faster it might be. 530 00:30:56,375 --> 00:31:00,470 I'll come back to my laptop here, and let's get rid of EXPLAIN QUERY PLAN. 531 00:31:00,470 --> 00:31:02,660 Let's now turn our timer on. 532 00:31:02,660 --> 00:31:06,380 I'll turn the timer on like this and I'll rerun this query. 533 00:31:06,380 --> 00:31:10,520 I'll say we select title from movies where the ID is in the stars table 534 00:31:10,520 --> 00:31:13,790 so long as the movie ID aligns with the person ID. 535 00:31:13,790 --> 00:31:18,590 Then I'll find Tom Hanks's person ID here, I'll close out my subqueries, 536 00:31:18,590 --> 00:31:24,078 and I should be able to find that this query is just a little bit-- 537 00:31:24,078 --> 00:31:25,370 this is an understatement here. 538 00:31:25,370 --> 00:31:26,400 A lot faster. 539 00:31:26,400 --> 00:31:32,120 So if I go back to my old query here, I'll see this one took 0.197 seconds. 540 00:31:32,120 --> 00:31:37,320 This one with two covering indexes takes 0.004 seconds. 541 00:31:37,320 --> 00:31:40,970 This is an order of magnitude faster than my prior query 542 00:31:40,970 --> 00:31:44,540 because I've used indexes now. 543 00:31:44,540 --> 00:31:46,940 So, let's see. 544 00:31:46,940 --> 00:31:53,180 If we have all these benefits of indexes, we must be paying something. 545 00:31:53,180 --> 00:31:59,310 If I have this book here and I'm trying to find some data inside of it, 546 00:31:59,310 --> 00:32:03,740 what might the trade-offs be if I want to create an index? 547 00:32:03,740 --> 00:32:07,460 Let me think first about this database on my computer. 548 00:32:07,460 --> 00:32:08,870 What am I giving up? 549 00:32:08,870 --> 00:32:13,400 If I'm making sure my query is faster, what am I losing, do you think? 550 00:32:13,400 --> 00:32:16,010 AUDIENCE: Space will require for indexing also. 551 00:32:16,010 --> 00:32:19,370 The more index we would have, the more space and more memory 552 00:32:19,370 --> 00:32:21,278 we would have to consume for that. 553 00:32:21,278 --> 00:32:22,070 CARTER ZENKE: Yeah. 554 00:32:22,070 --> 00:32:25,070 So it's a general theme that when I try to speed up 555 00:32:25,070 --> 00:32:28,130 something in computer science, I have to pay some cost which 556 00:32:28,130 --> 00:32:29,990 is often in terms of space. 557 00:32:29,990 --> 00:32:34,400 That when I create an index I'm using more space in my database. 558 00:32:34,400 --> 00:32:37,010 And this similarly applies to an old fashioned textbook. 559 00:32:37,010 --> 00:32:39,470 If I have an index in here, I'm literally 560 00:32:39,470 --> 00:32:43,550 paying in pages to print so people can find content better. 561 00:32:43,550 --> 00:32:46,130 This is some extra money I have to spend, 562 00:32:46,130 --> 00:32:49,670 extra pages I have to print in order to make my queries faster, 563 00:32:49,670 --> 00:32:53,220 either on this book or in my database. 564 00:32:53,220 --> 00:32:57,650 So let's understand why indexes take up so much space 565 00:32:57,650 --> 00:33:00,440 by looking at the underlying data structure they 566 00:33:00,440 --> 00:33:03,110 use to store information. 567 00:33:03,110 --> 00:33:05,900 Well it turns out that the part of the reason indexes 568 00:33:05,900 --> 00:33:09,780 take up so much space is because they use a certain data structure, 569 00:33:09,780 --> 00:33:16,110 a way of organizing data called a B-Tree, or a balanced tree structure. 570 00:33:16,110 --> 00:33:18,920 So not to worry if you don't know what a tree structure is. 571 00:33:18,920 --> 00:33:23,030 Here's a visual to keep in mind as you think about using trees in computer 572 00:33:23,030 --> 00:33:25,520 science and in indexes in particular. 573 00:33:25,520 --> 00:33:27,780 A tree might look a bit like this. 574 00:33:27,780 --> 00:33:31,640 And notice how it's similar to a family tree with grandparents 575 00:33:31,640 --> 00:33:34,430 and parents and children and so on. 576 00:33:34,430 --> 00:33:38,450 You could if you wanted to kind of do a 180 on this diagram. 577 00:33:38,450 --> 00:33:41,030 Flip it right side up and now you might see 578 00:33:41,030 --> 00:33:44,010 more of a tree structure going from top to bottom. 579 00:33:44,010 --> 00:33:47,300 You can see down here what might call the trunk of the tree followed 580 00:33:47,300 --> 00:33:49,670 by some branches-- these over here. 581 00:33:49,670 --> 00:33:50,930 And then the leaves. 582 00:33:50,930 --> 00:33:54,890 These up top kind of the edge of this tree diagram. 583 00:33:54,890 --> 00:33:56,750 And there's a few vocabulary words you might 584 00:33:56,750 --> 00:33:59,090 want to know when you talk about trees. 585 00:33:59,090 --> 00:34:02,190 One of them is this idea of a node. 586 00:34:02,190 --> 00:34:06,230 So this right here is a single node in our tree. 587 00:34:06,230 --> 00:34:11,210 A tree is made up of a collection of nodes, and each of these nodes 588 00:34:11,210 --> 00:34:13,590 stores some value. 589 00:34:13,590 --> 00:34:17,900 Maybe it's a value in the column we're trying to index, for example. 590 00:34:17,900 --> 00:34:22,040 But in addition to that value, this node also 591 00:34:22,040 --> 00:34:26,780 stores pointers, kind of arrows metaphorically, to other nodes 592 00:34:26,780 --> 00:34:30,590 to tell us where in memory these nodes are. 593 00:34:30,590 --> 00:34:32,719 And by stringing these nodes together, we 594 00:34:32,719 --> 00:34:36,270 can build up this tree-like structure. 595 00:34:36,270 --> 00:34:40,280 Now this node here points to three other nodes. 596 00:34:40,280 --> 00:34:45,530 And some new vocabulary here is that this node has three children. 597 00:34:45,530 --> 00:34:47,929 Going back to that family tree example, we 598 00:34:47,929 --> 00:34:52,469 see that this node here has three children which are these right here. 599 00:34:52,469 --> 00:34:56,078 But these nodes themselves also have three children of their own. 600 00:34:56,078 --> 00:34:58,370 So if I were to go down to the next level of our tree-- 601 00:34:58,370 --> 00:35:02,180 well if I'm referring to this node here, these nodes down below 602 00:35:02,180 --> 00:35:04,190 are that node's grandchildren. 603 00:35:04,190 --> 00:35:07,080 The children of its children. 604 00:35:07,080 --> 00:35:10,490 Now these nodes here also have another name. 605 00:35:10,490 --> 00:35:14,240 We saw before they're often called leaf nodes. 606 00:35:14,240 --> 00:35:15,560 They're the edge of our tree. 607 00:35:15,560 --> 00:35:21,040 And we know a node is a leaf node when there are no other nodes it points to. 608 00:35:21,040 --> 00:35:24,970 So this is the general idea of what an index looks like. 609 00:35:24,970 --> 00:35:26,920 But let's get a little more concrete and try 610 00:35:26,920 --> 00:35:32,540 to build an index from our title column in our movies table here. 611 00:35:32,540 --> 00:35:37,150 So let's assume we have a table of IDs and titles here, 612 00:35:37,150 --> 00:35:41,140 and I want to create an index on this title column 613 00:35:41,140 --> 00:35:43,340 to search it more efficiently. 614 00:35:43,340 --> 00:35:47,350 Well we saw before that the best I can do when this data is unsorted 615 00:35:47,350 --> 00:35:49,130 is to go top to bottom. 616 00:35:49,130 --> 00:35:51,970 First look at Toy Story, then look at Cars 3, 617 00:35:51,970 --> 00:35:54,740 then look at Frozen, all the way down to the bottom. 618 00:35:54,740 --> 00:35:58,220 I can only scan this column, so to speak. 619 00:35:58,220 --> 00:36:02,560 But I know I could probably do a little better if the data is sorted. 620 00:36:02,560 --> 00:36:05,030 If the data is sorted I could look roughly in the middle, 621 00:36:05,030 --> 00:36:07,270 and if I've gone too far, go back a little bit. 622 00:36:07,270 --> 00:36:09,770 Or if I haven't gone far enough, go forward a little bit 623 00:36:09,770 --> 00:36:13,910 and do a fewer number of lookups to find data I'm looking for. 624 00:36:13,910 --> 00:36:16,210 So let me sort this title column. 625 00:36:16,210 --> 00:36:18,980 I'll put it like this. 626 00:36:18,980 --> 00:36:22,230 What have I done wrong? 627 00:36:22,230 --> 00:36:26,980 First I had this, then I wanted to sort my title 628 00:36:26,980 --> 00:36:29,860 column to make it faster to search. 629 00:36:29,860 --> 00:36:35,365 But what rule did I break here by sorting this title column like that? 630 00:36:35,365 --> 00:36:38,620 AUDIENCE: Yes, Carter, the thing is that by sorting only the movie's 631 00:36:38,620 --> 00:36:43,630 name and not the ID you will break the entire other databases that 632 00:36:43,630 --> 00:36:45,320 relates to that one. 633 00:36:45,320 --> 00:36:50,260 So for example, let's look for a movie in which, I don't know, 634 00:36:50,260 --> 00:36:53,230 Johnny Depp is working or has worked on. 635 00:36:53,230 --> 00:36:56,080 And the result may be, I don't know, Titanic. 636 00:36:56,080 --> 00:36:57,820 CARTER ZENKE: Yeah, I love that example. 637 00:36:57,820 --> 00:36:59,450 That really drives it home here. 638 00:36:59,450 --> 00:37:04,330 And in more general terms, I had this primary key called ID. 639 00:37:04,330 --> 00:37:10,120 And in the prior state I had Toy Story with the ID of 114709. 640 00:37:10,120 --> 00:37:14,650 A primary key means that Toy Story should be consistently identified 641 00:37:14,650 --> 00:37:18,580 with this number, 114709. 642 00:37:18,580 --> 00:37:22,600 But to your point, Mateo, if I sort only the title column, 643 00:37:22,600 --> 00:37:25,060 well now Cars has that ID-- 644 00:37:25,060 --> 00:37:27,070 114709. 645 00:37:27,070 --> 00:37:29,620 I've broken that relationship, particularly 646 00:37:29,620 --> 00:37:37,180 if other tables are relying on Toy Story being identified by 114709. 647 00:37:37,180 --> 00:37:38,605 So I can't do this. 648 00:37:38,605 --> 00:37:39,730 This is bad. 649 00:37:39,730 --> 00:37:43,520 I need some other way to sort my title column. 650 00:37:43,520 --> 00:37:46,570 So let's bring this back to what it was here. 651 00:37:46,570 --> 00:37:49,300 Let me focus just on the title column. 652 00:37:49,300 --> 00:37:54,100 Well I can't sort this title column because it's part of my table 653 00:37:54,100 --> 00:37:57,100 and the ordering of that table does matter. 654 00:37:57,100 --> 00:38:00,440 What I could do though is create a copy of this column 655 00:38:00,440 --> 00:38:02,180 and maybe sort that copy. 656 00:38:02,180 --> 00:38:06,530 So I'll literally create a copy of this title column elsewhere in memory. 657 00:38:06,530 --> 00:38:09,230 And let me just remove this title column identifier here 658 00:38:09,230 --> 00:38:10,850 so I have just the data. 659 00:38:10,850 --> 00:38:14,350 Just the data in this column somewhere else in memory. 660 00:38:14,350 --> 00:38:17,990 And now I could sort this data like this. 661 00:38:17,990 --> 00:38:21,370 So let's say I now want to find Cars. 662 00:38:21,370 --> 00:38:24,470 Well now that my data is sorted, what could I do? 663 00:38:24,470 --> 00:38:27,730 I could look first in the middle-- at Ratatouille for instance-- 664 00:38:27,730 --> 00:38:31,000 and then realize, well Cars comes before Ratatouille. 665 00:38:31,000 --> 00:38:34,450 Maybe I'll go a little higher in my index here. 666 00:38:34,450 --> 00:38:36,670 I'll search in the middle of this first half. 667 00:38:36,670 --> 00:38:38,800 I'll look at, in this case, Frozen. 668 00:38:38,800 --> 00:38:41,320 And Cars-- well it still comes before Frozen, 669 00:38:41,320 --> 00:38:44,560 so let me look up at the first half of this first half. 670 00:38:44,560 --> 00:38:47,770 I'll then go up here and I'll have spotted Cars. 671 00:38:47,770 --> 00:38:54,010 So a much faster way to search for information now once I have it sorted. 672 00:38:54,010 --> 00:38:56,500 But I don't think we're quite there yet. 673 00:38:56,500 --> 00:39:01,090 If I wanted to find the year that Cars was released, 674 00:39:01,090 --> 00:39:08,620 I found that it's in my title column but what am I not able to do at this point? 675 00:39:08,620 --> 00:39:10,840 I've copied the data somewhere else, but now I 676 00:39:10,840 --> 00:39:13,270 want to find some other data in my table. 677 00:39:13,270 --> 00:39:15,390 What can't I do? 678 00:39:15,390 --> 00:39:18,148 AUDIENCE: We need to point to index. 679 00:39:18,148 --> 00:39:18,940 CARTER ZENKE: Yeah. 680 00:39:18,940 --> 00:39:21,430 I need to be able to find some way to relate 681 00:39:21,430 --> 00:39:24,590 this index with the rows in my table. 682 00:39:24,590 --> 00:39:27,250 So here this is elsewhere in memory. 683 00:39:27,250 --> 00:39:31,460 It's good that I can find that Cars is in my title column. 684 00:39:31,460 --> 00:39:35,050 But if I want to find the other data in this row like the year 685 00:39:35,050 --> 00:39:37,060 that Cars was released, well I have to have 686 00:39:37,060 --> 00:39:43,340 some way of linking this data here, Cars, to my table. 687 00:39:43,340 --> 00:39:46,930 So often in an index, we won't just copy the data to some new location 688 00:39:46,930 --> 00:39:47,620 and sort it. 689 00:39:47,620 --> 00:39:52,240 We'll also have some link between these indexed pieces of data over here 690 00:39:52,240 --> 00:39:54,920 and our rows and our table back here. 691 00:39:54,920 --> 00:40:00,940 So for instance here is Cars, and Cars appears to be in row 4. 692 00:40:00,940 --> 00:40:05,650 Similarly Cars 3 looks to be in row 2 in my table. 693 00:40:05,650 --> 00:40:11,620 And I could metaphorically visually draw some lines between Cars over here 694 00:40:11,620 --> 00:40:14,540 in my index and Cars in my table. 695 00:40:14,540 --> 00:40:17,950 I could draw one that looks a bit like this going from Cars in my index 696 00:40:17,950 --> 00:40:20,510 to actually Cars in my table. 697 00:40:20,510 --> 00:40:24,010 I could draw one for Cars 3 going from this index place 698 00:40:24,010 --> 00:40:27,410 here to row 2 in my title column. 699 00:40:27,410 --> 00:40:32,380 So that is the way I can link my index with my table data. 700 00:40:32,380 --> 00:40:38,860 I could do the rest of these too for all of my other pieces of data in my index. 701 00:40:38,860 --> 00:40:42,090 So let's focus on this now just to check for understanding. 702 00:40:42,090 --> 00:40:44,100 I have this index. 703 00:40:44,100 --> 00:40:48,180 That index has my sorted title data and also 704 00:40:48,180 --> 00:40:51,820 the row number in which I found that piece of data. 705 00:40:51,820 --> 00:40:56,580 So let's say here I want to find the movie Soul. 706 00:40:56,580 --> 00:40:58,380 I want to find the movie Soul. 707 00:40:58,380 --> 00:41:03,876 How could I find Soul in my index? 708 00:41:03,876 --> 00:41:06,500 AUDIENCE: Yeah, since it is already sorted 709 00:41:06,500 --> 00:41:08,198 I believe you could use binary search. 710 00:41:08,198 --> 00:41:08,990 CARTER ZENKE: Yeah. 711 00:41:08,990 --> 00:41:12,020 I could use algorithm called binary search, which if you're not familiar 712 00:41:12,020 --> 00:41:15,320 simply means start in the middle and if I've gone too far, 713 00:41:15,320 --> 00:41:17,690 look at either the left half or the right half 714 00:41:17,690 --> 00:41:20,520 and repeat that process over and over. 715 00:41:20,520 --> 00:41:22,160 So here is my index again. 716 00:41:22,160 --> 00:41:25,470 I'll go roughly to the middle and I'll look at Ratatouille here. 717 00:41:25,470 --> 00:41:29,390 Well Soul, I know, comes after Ratatouille alphabetically, 718 00:41:29,390 --> 00:41:31,020 so what should I do? 719 00:41:31,020 --> 00:41:33,410 I should now look in this second half here. 720 00:41:33,410 --> 00:41:35,570 I might go to Toy Story. 721 00:41:35,570 --> 00:41:38,720 Once I see Toy Story, I know I've gone a little bit too far. 722 00:41:38,720 --> 00:41:41,720 Soul comes before Toy Story alphabetically, 723 00:41:41,720 --> 00:41:45,080 so I'll look in the first half of this second half. 724 00:41:45,080 --> 00:41:46,230 And what do I see? 725 00:41:46,230 --> 00:41:49,183 I see Soul right there. 726 00:41:49,183 --> 00:41:50,130 OK. 727 00:41:50,130 --> 00:41:52,700 So our index seems to be working. 728 00:41:52,700 --> 00:41:54,950 It allows me to more efficiently find data. 729 00:41:54,950 --> 00:41:58,250 I no longer have to look through every single row. 730 00:41:58,250 --> 00:42:03,190 I can simply use a more efficient algorithm like binary search. 731 00:42:03,190 --> 00:42:08,980 But now there's one more catch which is, if I had many, many, many rows-- 732 00:42:08,980 --> 00:42:12,940 not just nine but literally thousands or hundreds of thousands or millions 733 00:42:12,940 --> 00:42:19,090 of rows in this index, that's a lot of data to store in one location in memory 734 00:42:19,090 --> 00:42:22,450 or even to load into my computer's RAM, so to speak-- 735 00:42:22,450 --> 00:42:24,020 its random access memory-- 736 00:42:24,020 --> 00:42:26,740 so it can do all this jumping around. 737 00:42:26,740 --> 00:42:31,510 Ideally what I would do is break this index up into smaller pieces 738 00:42:31,510 --> 00:42:34,300 so I don't have to load all of this at once. 739 00:42:34,300 --> 00:42:38,420 And in reality our index is not a single column like this, 740 00:42:38,420 --> 00:42:42,160 but often individual nodes where each of these-- 741 00:42:42,160 --> 00:42:48,560 1, 2, 3-- might be some node scattered about the computer's memory. 742 00:42:48,560 --> 00:42:51,190 But what problem have I introduced now? 743 00:42:51,190 --> 00:42:53,500 If these are around the computer's memory, 744 00:42:53,500 --> 00:42:56,510 I can't actually jump between them anymore. 745 00:42:56,510 --> 00:43:00,140 I can't go maybe one space above Toy Story and find Soul. 746 00:43:00,140 --> 00:43:03,500 This node might be elsewhere in memory. 747 00:43:03,500 --> 00:43:06,760 And so we'll need some other nodes to actually help 748 00:43:06,760 --> 00:43:10,090 us find what we're looking for in this index. 749 00:43:10,090 --> 00:43:15,010 I might have one more over here that tells me what kind of data 750 00:43:15,010 --> 00:43:17,230 is in each of these nodes. 751 00:43:17,230 --> 00:43:21,190 Maybe I have one here that has both Frozen and Soul. 752 00:43:21,190 --> 00:43:25,990 Notice how here Frozen is the furthest down alphabetically in this 753 00:43:25,990 --> 00:43:26,980 node up here. 754 00:43:26,980 --> 00:43:30,280 And notice down below here Soul is the furthest down 755 00:43:30,280 --> 00:43:33,520 alphabetically in this node here. 756 00:43:33,520 --> 00:43:37,150 That is, if what I'm looking for comes before Frozen, 757 00:43:37,150 --> 00:43:38,800 I should look at this node. 758 00:43:38,800 --> 00:43:42,730 If what I'm looking for comes after Frozen but before Soul, 759 00:43:42,730 --> 00:43:44,620 I should look in this node. 760 00:43:44,620 --> 00:43:49,390 Or if what I'm looking for comes before none of these, it comes after Soul, 761 00:43:49,390 --> 00:43:53,390 I should look at this node down here. 762 00:43:53,390 --> 00:43:54,910 So let's find an example. 763 00:43:54,910 --> 00:43:56,970 Let's say I want to find Cars now. 764 00:43:56,970 --> 00:44:00,610 Well I'd start in this root node, this base of this tree 765 00:44:00,610 --> 00:44:03,190 we're building here that I've turned kind of right side-- 766 00:44:03,190 --> 00:44:04,240 this way. 767 00:44:04,240 --> 00:44:05,980 Let's say Frozen. 768 00:44:05,980 --> 00:44:09,610 I'll go here and does Cars come before Frozen? 769 00:44:09,610 --> 00:44:12,490 Well it does, so I'll look in this node now. 770 00:44:12,490 --> 00:44:16,570 I'll come up here and I'll find Cars in that node. 771 00:44:16,570 --> 00:44:17,500 Let's see here. 772 00:44:17,500 --> 00:44:19,840 Maybe I want to find Ratatouille. 773 00:44:19,840 --> 00:44:23,380 I'll go back to the beginning and I'll look through this root node. 774 00:44:23,380 --> 00:44:27,080 I'll say, does Ratatouille come before Frozen? 775 00:44:27,080 --> 00:44:27,580 It doesn't. 776 00:44:27,580 --> 00:44:30,970 It comes after Frozen, so I'll look and check Soul. 777 00:44:30,970 --> 00:44:33,670 Does Ratatouille come before Soul? 778 00:44:33,670 --> 00:44:36,070 Well it does, so I'll look in this node now. 779 00:44:36,070 --> 00:44:37,330 I'll go find Luca. 780 00:44:37,330 --> 00:44:40,240 I'm not quite there yet, but I will see Ratatouille 781 00:44:40,240 --> 00:44:43,870 so I know that I've now found Ratatouille in my index. 782 00:44:43,870 --> 00:44:45,880 And now for one more-- 783 00:44:45,880 --> 00:44:47,680 let me check for understanding here. 784 00:44:47,680 --> 00:44:51,370 How could I find Turning Red? 785 00:44:51,370 --> 00:44:56,310 What steps would I follow to find Turning Red? 786 00:44:56,310 --> 00:45:02,115 I might start at that root node, but then what questions should I ask? 787 00:45:02,115 --> 00:45:08,740 AUDIENCE: From Frozen also we can see Ratatouille is to the right 788 00:45:08,740 --> 00:45:10,890 so that we move in direction of Soul. 789 00:45:10,890 --> 00:45:15,850 Now once we get there we find the middle elements and we can see Ratatouille 790 00:45:15,850 --> 00:45:19,120 is to the right of the middle elements. 791 00:45:19,120 --> 00:45:20,520 So that's how we find it. 792 00:45:20,520 --> 00:45:25,560 Divide the list into two halves and look into the half 793 00:45:25,560 --> 00:45:27,448 that the element is found. 794 00:45:27,448 --> 00:45:29,740 CARTER ZENKE: Yeah, I like what you were thinking here. 795 00:45:29,740 --> 00:45:33,590 So you're talking about traversing this tree we've built and turned on its side 796 00:45:33,590 --> 00:45:34,090 here. 797 00:45:34,090 --> 00:45:36,010 And let's visualize that search here. 798 00:45:36,010 --> 00:45:41,440 So first we'll start at Frozen and ask, does Turning Red come before Frozen? 799 00:45:41,440 --> 00:45:41,940 It doesn't. 800 00:45:41,940 --> 00:45:43,530 It comes after, so I'll keep looking. 801 00:45:43,530 --> 00:45:44,640 I'll look at Soul. 802 00:45:44,640 --> 00:45:47,620 Does Turning Red come before Soul? 803 00:45:47,620 --> 00:45:48,120 It doesn't. 804 00:45:48,120 --> 00:45:53,430 It comes after Soul, so I know I need to look at this node down here. 805 00:45:53,430 --> 00:45:55,050 I'll go find Toy Story. 806 00:45:55,050 --> 00:45:57,990 Not quite there yet, but now I'll see Turning Red. 807 00:45:57,990 --> 00:46:00,150 I'm able to find what I'm looking for. 808 00:46:00,150 --> 00:46:01,110 And not just that. 809 00:46:01,110 --> 00:46:07,000 I can find what row number Turning Red is actually in in my table. 810 00:46:07,000 --> 00:46:12,180 So when you use an index, SQL or SQLite is using some algorithm a bit 811 00:46:12,180 --> 00:46:13,090 like this. 812 00:46:13,090 --> 00:46:15,870 It creates some tree structure a bit like this here 813 00:46:15,870 --> 00:46:18,780 with individual nodes that has your column data. 814 00:46:18,780 --> 00:46:22,470 It then uses this kind of algorithm to search that index 815 00:46:22,470 --> 00:46:26,770 and find whether or not that piece of data is in your index. 816 00:46:26,770 --> 00:46:32,050 And if it is, it will go ahead and look up that piece of data in your table. 817 00:46:32,050 --> 00:46:36,000 So see here, although this is turned on its side, the relationship 818 00:46:36,000 --> 00:46:37,740 between a kind of tree. 819 00:46:37,740 --> 00:46:43,680 We had a root node or a trunk node here pointing to some of these leaf nodes-- 820 00:46:43,680 --> 00:46:45,570 the edge of our graph right here. 821 00:46:45,570 --> 00:46:48,750 And you can see some relationship between these nodes here 822 00:46:48,750 --> 00:46:52,350 and our tree as well right here with arrows 823 00:46:52,350 --> 00:46:55,770 pointing across these individual nodes. 824 00:46:55,770 --> 00:47:01,050 So let me pause here and ask what questions do we have on this structure 825 00:47:01,050 --> 00:47:02,985 and how we've searched it? 826 00:47:02,985 --> 00:47:07,683 AUDIENCE: Is that related to linked lists or arrays? 827 00:47:07,683 --> 00:47:08,850 CARTER ZENKE: Good question. 828 00:47:08,850 --> 00:47:11,780 Is this related to linked lists or arrays? 829 00:47:11,780 --> 00:47:14,550 So if you're familiar, a linked list is some set 830 00:47:14,550 --> 00:47:18,030 of nodes that have been linked together in memory using these arrows like we 831 00:47:18,030 --> 00:47:19,800 saw on our diagram here. 832 00:47:19,800 --> 00:47:24,135 It is the case that sometimes an index will include a linked list. 833 00:47:24,135 --> 00:47:27,100 834 00:47:27,100 --> 00:47:28,245 Let's take one more here. 835 00:47:28,245 --> 00:47:32,400 AUDIENCE: I want to know, a covering index-- 836 00:47:32,400 --> 00:47:37,890 how is the [INAUDIBLE] covering index and what has been-- 837 00:47:37,890 --> 00:47:44,160 I mean, if we just [INAUDIBLE] index and then we just start covering index, 838 00:47:44,160 --> 00:47:49,740 how SQL automatically getting that covering index that's here? 839 00:47:49,740 --> 00:47:53,010 CARTER ZENKE: Yeah, so I think we're asking, how is the index created? 840 00:47:53,010 --> 00:47:54,940 Like, what is the process for doing that? 841 00:47:54,940 --> 00:47:58,920 Well it turns out that just as we saw here in the slides, SQLite or whatever 842 00:47:58,920 --> 00:48:03,240 DBMS you're using will take a copy of the data that should be in that index 843 00:48:03,240 --> 00:48:06,750 and organize it into a tree structure like we saw here. 844 00:48:06,750 --> 00:48:09,150 It may or may not be similar to the structure we saw just 845 00:48:09,150 --> 00:48:14,640 a little earlier with nodes that have different pieces of row data in them. 846 00:48:14,640 --> 00:48:17,280 Good question there. 847 00:48:17,280 --> 00:48:18,180 OK. 848 00:48:18,180 --> 00:48:21,090 So we've seen now the trade-off that we get 849 00:48:21,090 --> 00:48:24,510 with index, which is they're very fast to look up some data 850 00:48:24,510 --> 00:48:26,400 but they take up a lot more space. 851 00:48:26,400 --> 00:48:29,370 There's a lot of redundancy to them here. 852 00:48:29,370 --> 00:48:31,890 There's also one more trade-off, which is 853 00:48:31,890 --> 00:48:37,230 if I want to insert something into this structure, it'll take me more time. 854 00:48:37,230 --> 00:48:39,240 I have to navigate this structure here. 855 00:48:39,240 --> 00:48:44,010 I have to figure out should it come before this value or after this value? 856 00:48:44,010 --> 00:48:48,030 And I'll have to traverse this tree every time verses 857 00:48:48,030 --> 00:48:51,420 without an index I could just add something to the end of this list. 858 00:48:51,420 --> 00:48:54,930 But with a B-Tree, with an index, I need to traverse all these nodes 859 00:48:54,930 --> 00:48:57,760 and figure out where should that piece of data go? 860 00:48:57,760 --> 00:48:59,593 So these trade-offs are something you should 861 00:48:59,593 --> 00:49:03,660 keep in mind as you build indexes and optimize your queries. 862 00:49:03,660 --> 00:49:04,560 OK. 863 00:49:04,560 --> 00:49:07,650 There are ways though to actually optimize the space 864 00:49:07,650 --> 00:49:10,710 that we use using a tree like this. 865 00:49:10,710 --> 00:49:15,310 We can use a structure called not an index, but a partial index. 866 00:49:15,310 --> 00:49:19,620 And if you're concerned about space, a partial index might be good for you. 867 00:49:19,620 --> 00:49:25,200 A partial index only includes a subset of the data from a given column, 868 00:49:25,200 --> 00:49:28,800 thus making your index smaller overall. 869 00:49:28,800 --> 00:49:31,590 You could think of a partial index being useful when 870 00:49:31,590 --> 00:49:35,490 you know that your users will only query a subset-- 871 00:49:35,490 --> 00:49:38,820 only a small number of rows from that table. 872 00:49:38,820 --> 00:49:42,690 And for example in IMDb, maybe we know people 873 00:49:42,690 --> 00:49:46,770 are more likely to search for a movie that came out this year versus someone 874 00:49:46,770 --> 00:49:49,720 that came out 15 years ago or so. 875 00:49:49,720 --> 00:49:53,340 So we could create an index that focuses on those movies that 876 00:49:53,340 --> 00:49:59,800 came out this year versus 15 years ago or so, and we'll do just that. 877 00:49:59,800 --> 00:50:03,670 If I want to create a partial index I can use this syntax here. 878 00:50:03,670 --> 00:50:06,720 I could say create an index and give it some name, 879 00:50:06,720 --> 00:50:11,670 then include some data from this table and this column here, 880 00:50:11,670 --> 00:50:13,530 or a list of columns I can provide. 881 00:50:13,530 --> 00:50:19,140 But then what I need0 to do is add a condition where some condition is true. 882 00:50:19,140 --> 00:50:23,370 These are the rows I want to include in this index. 883 00:50:23,370 --> 00:50:27,910 Before we had just create index name on table from a list of columns. 884 00:50:27,910 --> 00:50:33,420 Now we can add in this WHERE clause that says only include those rows where 885 00:50:33,420 --> 00:50:35,320 this condition is true. 886 00:50:35,320 --> 00:50:36,840 So let's try that now. 887 00:50:36,840 --> 00:50:39,210 I'll come back to my computer. 888 00:50:39,210 --> 00:50:41,880 And why don't we try creating a partial index 889 00:50:41,880 --> 00:50:46,710 to help speed up the lookup of movies that came out this year. 890 00:50:46,710 --> 00:50:50,520 Well I could come back to my terminal and I could remove, let's say, 891 00:50:50,520 --> 00:50:51,900 this one over here. 892 00:50:51,900 --> 00:50:55,320 Now I'll try to create this partial index. 893 00:50:55,320 --> 00:51:00,690 I'll say create an index called "recents" to indicate this 894 00:51:00,690 --> 00:51:03,090 will help us search for recent movies. 895 00:51:03,090 --> 00:51:08,400 Now I'll make this "recents" ON "movies" in the ("title") column. 896 00:51:08,400 --> 00:51:12,510 I want to look up titles ultimately, but now I 897 00:51:12,510 --> 00:51:20,190 only want to create an index for those rows where the year equals 2023. 898 00:51:20,190 --> 00:51:26,790 I only want to include those titles in my index that were released in 2023. 899 00:51:26,790 --> 00:51:31,890 So I'll hit Enter now and I'll see I created this index for myself here. 900 00:51:31,890 --> 00:51:33,900 Now let me try searching for those. 901 00:51:33,900 --> 00:51:40,440 I'll SELECT "title" FROM "movies" WHERE the "year" = 2023. 902 00:51:40,440 --> 00:51:41,610 Enter. 903 00:51:41,610 --> 00:51:45,910 And I'll see a lot of movies came out in 2023. 904 00:51:45,910 --> 00:51:48,550 Now, it's taking 1.3 seconds here. 905 00:51:48,550 --> 00:51:51,600 But now I could prove to you that we are using some index. 906 00:51:51,600 --> 00:52:00,120 I could say EXPLAIN QUERY PLAN and now type SELECT "title" FROM "movies" 907 00:52:00,120 --> 00:52:03,120 WHERE the "year" = 2023. 908 00:52:03,120 --> 00:52:04,440 Hit Enter. 909 00:52:04,440 --> 00:52:06,600 And now we'll see-- what are we doing? 910 00:52:06,600 --> 00:52:09,870 We're going to-- we'll scan movies but we're still 911 00:52:09,870 --> 00:52:13,950 going to use the index we just created called "recents." 912 00:52:13,950 --> 00:52:17,580 So it's helping speed us up just a little bit overall. 913 00:52:17,580 --> 00:52:21,540 Now let me try dropping this index and showing you the opposite. 914 00:52:21,540 --> 00:52:24,560 I'll come back over here and I could try dropping it. 915 00:52:24,560 --> 00:52:26,310 And actually, before I do that let me show 916 00:52:26,310 --> 00:52:29,340 you what would happen if I didn't search for movie in 2023. 917 00:52:29,340 --> 00:52:32,560 I would try EXPLAIN QUERY PLAN. 918 00:52:32,560 --> 00:52:42,150 And let me try SELECT "title" FROM "movies" where the year = 1998 919 00:52:42,150 --> 00:52:43,140 like this. 920 00:52:43,140 --> 00:52:47,130 And see now I'm only back to scanning movies. 921 00:52:47,130 --> 00:52:50,670 So before in my prior query I was able to use the index 922 00:52:50,670 --> 00:52:54,480 because my WHERE clause included 2023. 923 00:52:54,480 --> 00:52:58,590 Now though I'm using 1998, which were rows that were not 924 00:52:58,590 --> 00:53:02,520 included in this partial index. 925 00:53:02,520 --> 00:53:03,270 OK. 926 00:53:03,270 --> 00:53:05,850 So let me ask now, what questions do we have 927 00:53:05,850 --> 00:53:09,600 on these partial index which we can use to optimize some 928 00:53:09,600 --> 00:53:12,920 of the space these indexes take up? 929 00:53:12,920 --> 00:53:16,030 AUDIENCE: Are indexes not saved in schema? 930 00:53:16,030 --> 00:53:17,280 CARTER ZENKE: A good question. 931 00:53:17,280 --> 00:53:19,370 Are indexes saved in the schema? 932 00:53:19,370 --> 00:53:23,510 In SQLite if I type .schema I can actually see them in my schema. 933 00:53:23,510 --> 00:53:25,070 So let me go back here and show you. 934 00:53:25,070 --> 00:53:27,560 If I now type something like-- 935 00:53:27,560 --> 00:53:29,620 let me turn timer off here-- 936 00:53:29,620 --> 00:53:35,150 .schema, you can see down below here that I actually have these indexes 937 00:53:35,150 --> 00:53:37,190 as part of my schema. 938 00:53:37,190 --> 00:53:40,910 I'll see the create index statements I used down below here. 939 00:53:40,910 --> 00:53:44,810 Create an index called name_index, called person_index, called recents. 940 00:53:44,810 --> 00:53:49,460 And all that I used to create these indexes now inside my schema. 941 00:53:49,460 --> 00:53:54,790 If I drop some index it'll then be just removed from my schema altogether. 942 00:53:54,790 --> 00:53:57,160 Good question. 943 00:53:57,160 --> 00:53:57,660 OK. 944 00:53:57,660 --> 00:54:00,450 So speaking of dropping indexes, one other way 945 00:54:00,450 --> 00:54:04,470 we can more efficiently use space is to do a bit of vacuuming so to speak. 946 00:54:04,470 --> 00:54:08,040 So let's introduce this idea of vacuuming up our database. 947 00:54:08,040 --> 00:54:10,200 Let me show you a slide here. 948 00:54:10,200 --> 00:54:14,010 And this one involves this idea of trying to clean up 949 00:54:14,010 --> 00:54:17,580 unused space in our database. 950 00:54:17,580 --> 00:54:22,500 Often when we delete something, either a rows or an index, 951 00:54:22,500 --> 00:54:25,890 we don't actually delete those bits that were 952 00:54:25,890 --> 00:54:29,190 being used by those rows or that index. 953 00:54:29,190 --> 00:54:33,690 We just mark them as being available for whatever we next insert. 954 00:54:33,690 --> 00:54:36,400 They can be overwritten, so to speak. 955 00:54:36,400 --> 00:54:39,930 But if I want to actually reduce the size of my database 956 00:54:39,930 --> 00:54:44,880 after I delete something, I should use something like vacuum in SQLite 957 00:54:44,880 --> 00:54:47,610 or optimize in some other DBMS. 958 00:54:47,610 --> 00:54:50,950 Vacuuming is a bit like taking those unused bits and just sucking them up. 959 00:54:50,950 --> 00:54:55,500 Taking them back in so we can give them back to the operating system. 960 00:54:55,500 --> 00:54:59,490 And the keyword we can use for vacuum is just simply vacuum. 961 00:54:59,490 --> 00:55:00,720 So let's try this. 962 00:55:00,720 --> 00:55:02,640 I'll come back to my computer over here. 963 00:55:02,640 --> 00:55:06,120 And let me open up a new terminal-- one that I can use 964 00:55:06,120 --> 00:55:09,240 to find the current size of movies.db. 965 00:55:09,240 --> 00:55:12,720 So I'll hit this + button here and now I could type a command. 966 00:55:12,720 --> 00:55:17,830 Not a SQL command or a SQLite command, but actually a Unix command. 967 00:55:17,830 --> 00:55:22,320 I could type du -b for the disk usage in bytes. 968 00:55:22,320 --> 00:55:24,810 How many bytes is a particular file? 969 00:55:24,810 --> 00:55:31,560 I could then say movies.db here, and if I hit Enter I should see movies.db 970 00:55:31,560 --> 00:55:37,230 is something like 158 million bytes or 158 megabytes. 971 00:55:37,230 --> 00:55:41,290 So let's see what happens if I were to delete an index which you know takes up 972 00:55:41,290 --> 00:55:43,380 some amount of space in our database. 973 00:55:43,380 --> 00:55:45,510 I'll go back to my terminal here. 974 00:55:45,510 --> 00:55:48,990 And let me try to drop an index that we had recently created. 975 00:55:48,990 --> 00:55:55,170 I'll say DROP INDEX "person_index" to remove the person index in this case. 976 00:55:55,170 --> 00:55:58,860 I'll hit Enter and now I see that "person_index" is dropped. 977 00:55:58,860 --> 00:56:05,250 If I type .schema here, I no longer see it in my schema down below. 978 00:56:05,250 --> 00:56:09,180 So now let me check the disk usage in bytes and see if it's any smaller. 979 00:56:09,180 --> 00:56:13,410 I'll hit the up arrow to get access to that same command, hit Enter. 980 00:56:13,410 --> 00:56:14,940 It seems like it's the same size. 981 00:56:14,940 --> 00:56:17,940 It's still 158 megabytes. 982 00:56:17,940 --> 00:56:19,830 So let me try dropping something more here. 983 00:56:19,830 --> 00:56:23,430 I'll go back to my terminal here and now I'll try-- 984 00:56:23,430 --> 00:56:25,500 what other do I have? 985 00:56:25,500 --> 00:56:30,630 I have "name_index" so I'll say let's DROP index "name_index" like this. 986 00:56:30,630 --> 00:56:34,680 I'll hit Enter and I'll see that index is now gone. 987 00:56:34,680 --> 00:56:37,200 Well now I'll also try dropping recents. 988 00:56:37,200 --> 00:56:43,470 I'll DROP INDEX "recents" and now I believe all of my indexes 989 00:56:43,470 --> 00:56:44,310 should be gone. 990 00:56:44,310 --> 00:56:48,510 If I type .schema, I don't see any more indexes. 991 00:56:48,510 --> 00:56:51,300 So now let's check on the disk usage in bytes. 992 00:56:51,300 --> 00:56:54,750 I'll go back to my other terminal, hit du -b again, 993 00:56:54,750 --> 00:56:58,290 and it's still 158 megabytes. 994 00:56:58,290 --> 00:57:01,170 Well what can we do to clean this up? 995 00:57:01,170 --> 00:57:03,180 We could run vacuum. 996 00:57:03,180 --> 00:57:07,140 If you remember here when we delete an index-- we drop the index, 997 00:57:07,140 --> 00:57:09,840 we don't actually shrink our database file. 998 00:57:09,840 --> 00:57:13,320 We just mark those bits that were part of the index as being 999 00:57:13,320 --> 00:57:15,120 available for re-use. 1000 00:57:15,120 --> 00:57:17,700 So let's now vacuum and compress our file 1001 00:57:17,700 --> 00:57:21,210 to give back those bytes and those bits to the operating system. 1002 00:57:21,210 --> 00:57:27,000 I'll come back to my terminal here and type VACUUM; hit Enter, 1003 00:57:27,000 --> 00:57:30,360 and I'll wait for it to find those bits and bytes, 1004 00:57:30,360 --> 00:57:33,300 wait for it to give them back, to de-allocate them, 1005 00:57:33,300 --> 00:57:34,950 to go back to the operating system. 1006 00:57:34,950 --> 00:57:39,660 And now if I come back and I type du -b again, hit Enter-- 1007 00:57:39,660 --> 00:57:42,970 I should see a much smaller file. 1008 00:57:42,970 --> 00:57:45,990 So before we had about 158 million bytes. 1009 00:57:45,990 --> 00:57:50,670 We're down to in this case 100 million after dropping our indexes 1010 00:57:50,670 --> 00:57:53,360 and using VACUUM. 1011 00:57:53,360 --> 00:57:57,390 So let me ask now, what questions do we have on vacuuming? 1012 00:57:57,390 --> 00:58:00,020 AUDIENCE: Can we do vacuuming in a faster way? 1013 00:58:00,020 --> 00:58:02,730 CARTER ZENKE: Can you vacuum in a faster way? 1014 00:58:02,730 --> 00:58:05,120 So I think it depends on the size of the vacuum 1015 00:58:05,120 --> 00:58:06,990 that you're trying to implement here. 1016 00:58:06,990 --> 00:58:08,900 If you have a lot of bits and bytes you're 1017 00:58:08,900 --> 00:58:11,990 trying to de-allocate, that could take more time. 1018 00:58:11,990 --> 00:58:15,290 The process here is the computer has to find those bits and bytes you're 1019 00:58:15,290 --> 00:58:19,590 no longer using and then give each of those back to the operating system. 1020 00:58:19,590 --> 00:58:24,178 So it could depend on the size and how long they take to find. 1021 00:58:24,178 --> 00:58:25,720 And let's take another question here. 1022 00:58:25,720 --> 00:58:30,691 AUDIENCE: Initially you said once we drop a database, we don't lose the-- 1023 00:58:30,691 --> 00:58:33,310 the data is not deleted. 1024 00:58:33,310 --> 00:58:36,550 The space is just reallocated, right? 1025 00:58:36,550 --> 00:58:37,960 If I go you right. 1026 00:58:37,960 --> 00:58:42,280 So does that mean the queries we run today, 1027 00:58:42,280 --> 00:58:46,780 are they locked somewhere that-- let's say you mistakenly did a query. 1028 00:58:46,780 --> 00:58:50,710 You can go back and pull that data back? 1029 00:58:50,710 --> 00:58:51,880 Is that possible? 1030 00:58:51,880 --> 00:58:55,000 Kind of like a reversal of a query. 1031 00:58:55,000 --> 00:58:57,880 CARTER ZENKE: Yeah, so you're talking about some forensics here. 1032 00:58:57,880 --> 00:59:01,900 Like, could we mark something as deleted but still find it later? 1033 00:59:01,900 --> 00:59:05,290 There actually are people who are trained in this, in forensics, 1034 00:59:05,290 --> 00:59:07,600 to find pieces of data that you thought were deleted 1035 00:59:07,600 --> 00:59:12,160 but are actually still on your computer just forgotten, if you will. 1036 00:59:12,160 --> 00:59:14,740 In this case in SQLite it would take a lot of work 1037 00:59:14,740 --> 00:59:18,190 to go ahead and find all of those bits and bytes we've removed. 1038 00:59:18,190 --> 00:59:21,550 But once we vacuum, once we actually give them back to the operating system, 1039 00:59:21,550 --> 00:59:24,220 we can no longer find those in the database 1040 00:59:24,220 --> 00:59:26,830 because they're not just marked as being de-allocated. 1041 00:59:26,830 --> 00:59:30,720 They actually have been given back and they're no longer part of our database. 1042 00:59:30,720 --> 00:59:32,710 Good question there. 1043 00:59:32,710 --> 00:59:35,800 OK, so we've seen here how we can optimize 1044 00:59:35,800 --> 00:59:38,680 our queries by reducing the time that they take 1045 00:59:38,680 --> 00:59:40,870 although at the cost of some space. 1046 00:59:40,870 --> 00:59:45,130 When we come back, we'll see how we can handle not just one query at a time 1047 00:59:45,130 --> 00:59:47,440 but many all at once. 1048 00:59:47,440 --> 00:59:49,010 See you in a bit. 1049 00:59:49,010 --> 00:59:50,330 And we're back. 1050 00:59:50,330 --> 00:59:54,460 So we've seen so far how to optimize individual queries, ensuring 1051 00:59:54,460 --> 00:59:58,000 they take less time and also to ensure our databases 1052 00:59:58,000 --> 01:00:00,800 use as little space as possible. 1053 01:00:00,800 --> 01:00:03,010 What we'll focus on now is trying to ensure 1054 01:00:03,010 --> 01:00:08,125 we can handle not just one query at a time but multiple all at once. 1055 01:00:08,125 --> 01:00:10,570 And this is this idea of concurrency. 1056 01:00:10,570 --> 01:00:16,180 Concurrency is when I get multiple queries, multiple interactions, all 1057 01:00:16,180 --> 01:00:20,110 at roughly the same time and my computer system, my database, 1058 01:00:20,110 --> 01:00:23,810 needs to figure out how to handle all of those at once. 1059 01:00:23,810 --> 01:00:27,040 This is particularly important for websites that get a lot of traffic, 1060 01:00:27,040 --> 01:00:28,900 or even for the financial sector. 1061 01:00:28,900 --> 01:00:32,110 People are constantly making trades, placing orders, 1062 01:00:32,110 --> 01:00:34,610 exchanging money, and so on. 1063 01:00:34,610 --> 01:00:37,690 So let's see one example here of a bank. 1064 01:00:37,690 --> 01:00:40,870 Let's say here I have a table of accounts 1065 01:00:40,870 --> 01:00:44,860 and each account has a name and some balance to it. 1066 01:00:44,860 --> 01:00:50,890 Well we decide here that maybe Alice wants to pay Bob $10. 1067 01:00:50,890 --> 01:00:54,220 So what should I do to update the balances here? 1068 01:00:54,220 --> 01:00:58,510 Maybe the first thing I should do is make sure that Bob receives that money. 1069 01:00:58,510 --> 01:01:03,460 I could update Bob's account to have a balance not of $20 but of $30 now. 1070 01:01:03,460 --> 01:01:05,920 From $20 to $30. 1071 01:01:05,920 --> 01:01:07,450 But here's the thing. 1072 01:01:07,450 --> 01:01:14,806 If at this moment somebody looks at this table, what will they see that's wrong? 1073 01:01:14,806 --> 01:01:20,205 What will they see that they might not get the right picture of at this point? 1074 01:01:20,205 --> 01:01:23,800 1075 01:01:23,800 --> 01:01:27,850 I'm hearing they might not see the correct balance for Alice. 1076 01:01:27,850 --> 01:01:29,980 Bob has gotten some money. 1077 01:01:29,980 --> 01:01:31,990 He's gone from $20 here to $30. 1078 01:01:31,990 --> 01:01:34,060 But now I have not just-- 1079 01:01:34,060 --> 01:01:35,710 let's see, $60 in the bank. 1080 01:01:35,710 --> 01:01:38,210 I now have $70 it looks like. 1081 01:01:38,210 --> 01:01:40,900 So if somebody looks at this transaction in this moment 1082 01:01:40,900 --> 01:01:43,420 they'll see some incorrect information. 1083 01:01:43,420 --> 01:01:45,910 What I should then do is update Alice's balance, 1084 01:01:45,910 --> 01:01:50,740 subtracting $10 so we're at the right amount of dollars across our bank 1085 01:01:50,740 --> 01:01:51,830 accounts here. 1086 01:01:51,830 --> 01:01:53,890 So this is an issue. 1087 01:01:53,890 --> 01:01:57,370 When I have one ongoing transaction like this 1088 01:01:57,370 --> 01:01:59,950 and somebody else wants to look at that data, 1089 01:01:59,950 --> 01:02:03,430 they really shouldn't be able to until this transaction is 1090 01:02:03,430 --> 01:02:06,020 complete in its entirety. 1091 01:02:06,020 --> 01:02:09,340 So ideally-- let's go back to what we had before with Alice at $10 1092 01:02:09,340 --> 01:02:10,480 Bob at $20. 1093 01:02:10,480 --> 01:02:15,430 Ideally if Alice wants to pay Bob, this should happen to an outside observer 1094 01:02:15,430 --> 01:02:16,510 all at once. 1095 01:02:16,510 --> 01:02:19,360 Bob should get those $10 and Alice should 1096 01:02:19,360 --> 01:02:23,590 lose those $10 at the same exact moment so nobody 1097 01:02:23,590 --> 01:02:29,230 can look and see that Bob has more money while Alice doesn't have less money. 1098 01:02:29,230 --> 01:02:34,060 Now we can implement this idea of this transaction using 1099 01:02:34,060 --> 01:02:39,040 a very similarly-named idea in databases called a transaction. 1100 01:02:39,040 --> 01:02:41,830 Now a transaction in the world of databases 1101 01:02:41,830 --> 01:02:44,800 is an individual unit of work. 1102 01:02:44,800 --> 01:02:49,370 That means a transaction can't be broken down into smaller pieces. 1103 01:02:49,370 --> 01:02:52,720 It happens all at once or not at all. 1104 01:02:52,720 --> 01:02:55,540 And in fact a transaction has several properties 1105 01:02:55,540 --> 01:02:58,690 and helps implement several properties of a database. 1106 01:02:58,690 --> 01:03:00,340 Among these are ACID-- 1107 01:03:00,340 --> 01:03:05,020 A-C-I-D. You may have heard of these if you've looked up something about 1108 01:03:05,020 --> 01:03:10,150 databases, but each one stands for some principle of ensuring a consistent 1109 01:03:10,150 --> 01:03:11,870 and reliable database. 1110 01:03:11,870 --> 01:03:17,710 In fact, A stands for atomicity, which means I can't break a transaction down 1111 01:03:17,710 --> 01:03:18,970 into smaller pieces. 1112 01:03:18,970 --> 01:03:21,310 I can't break a transaction between Alice and Bob 1113 01:03:21,310 --> 01:03:24,250 into Bob's update and then Alice's update. 1114 01:03:24,250 --> 01:03:28,150 It's just update both accounts at the same time. 1115 01:03:28,150 --> 01:03:32,080 Consistency means a transaction to help me ensure that I 1116 01:03:32,080 --> 01:03:35,180 don't violate some database constraint. 1117 01:03:35,180 --> 01:03:39,070 Let's say I try to pay Bob from Alice's account, 1118 01:03:39,070 --> 01:03:40,900 but Alice doesn't have the money. 1119 01:03:40,900 --> 01:03:44,530 If that is the case, a transaction will revert, undo. 1120 01:03:44,530 --> 01:03:49,090 So I go back to being in this consistent state for my database. 1121 01:03:49,090 --> 01:03:53,530 Isolation means if I have more than one person trying to access 1122 01:03:53,530 --> 01:03:55,600 this database, their queries-- 1123 01:03:55,600 --> 01:03:58,940 their transactions won't interfere with each other. 1124 01:03:58,940 --> 01:04:03,385 And then finally, durability means if I were to unplug this database 1125 01:04:03,385 --> 01:04:06,010 or we were to encounter a power failure or the server to crash, 1126 01:04:06,010 --> 01:04:11,230 all that data would still be there from the time we committed our transaction 1127 01:04:11,230 --> 01:04:14,150 or saved the results there. 1128 01:04:14,150 --> 01:04:18,670 So transactions do a lot for us, but they have some very simple syntax. 1129 01:04:18,670 --> 01:04:24,130 I can simply use BEGIN TRANSACTION and then follow it up with some statements 1130 01:04:24,130 --> 01:04:26,780 I want to include in that transaction. 1131 01:04:26,780 --> 01:04:30,850 And once I'm sure I want to save the results of these statements, 1132 01:04:30,850 --> 01:04:36,340 I can then say COMMIT, meaning save these changes thus far. 1133 01:04:36,340 --> 01:04:38,620 And it's using this kind of syntax we can 1134 01:04:38,620 --> 01:04:41,830 ensure that nobody sees incorrect balances, 1135 01:04:41,830 --> 01:04:45,590 let's say, for this bank account table here. 1136 01:04:45,590 --> 01:04:49,840 So let's try this out and implement a payment between Alice and Bob, 1137 01:04:49,840 --> 01:04:52,180 but now using a transaction. 1138 01:04:52,180 --> 01:04:54,670 I'll come back to my computer here. 1139 01:04:54,670 --> 01:04:57,400 And let me show you a new database. 1140 01:04:57,400 --> 01:05:00,580 I'll type sqlite3 bank.db. 1141 01:05:00,580 --> 01:05:01,600 I'll hit Enter. 1142 01:05:01,600 --> 01:05:02,980 Oh, sqlite3. 1143 01:05:02,980 --> 01:05:03,520 Sorry. 1144 01:05:03,520 --> 01:05:04,720 bank.db. 1145 01:05:04,720 --> 01:05:09,430 I'll hit Enter, and now let me type .schema to see the schema of this 1146 01:05:09,430 --> 01:05:10,360 database. 1147 01:05:10,360 --> 01:05:12,970 I'll see here a few different columns. 1148 01:05:12,970 --> 01:05:15,970 But I have an "id" column, my primary key, 1149 01:05:15,970 --> 01:05:19,900 a "name" column like we saw in our slides that is text and can't be null, 1150 01:05:19,900 --> 01:05:22,390 and then a "balance" column. 1151 01:05:22,390 --> 01:05:26,740 There's an integer, a whole number of dollars, that also must have a value. 1152 01:05:26,740 --> 01:05:27,850 Can't be null. 1153 01:05:27,850 --> 01:05:30,730 And I have a check constraint here saying 1154 01:05:30,730 --> 01:05:35,590 the balance must be at all times greater than or equal to 0. 1155 01:05:35,590 --> 01:05:40,690 So this is our implementation of a table of bank accounts. 1156 01:05:40,690 --> 01:05:44,650 Let me try querying here to see what our account balances look like. 1157 01:05:44,650 --> 01:05:50,390 I'll say SELECT * FROM "accounts" semicolon. 1158 01:05:50,390 --> 01:05:53,980 And now I'll see that very same arrangement we saw in our slides. 1159 01:05:53,980 --> 01:05:59,450 Alice has $10, Bob has $20, and Charlie has $30. 1160 01:05:59,450 --> 01:06:04,570 So let me try then processing this transaction between Alice and Bob. 1161 01:06:04,570 --> 01:06:08,890 Well at this moment I might try to update Bob's account 1162 01:06:08,890 --> 01:06:10,600 at the first [? bat. ?] I'll say-- 1163 01:06:10,600 --> 01:06:13,510 let's try update and I'll update the accounts 1164 01:06:13,510 --> 01:06:21,730 table, set the "balance" equal to "balance" + 10 where the "id" equals 2. 1165 01:06:21,730 --> 01:06:25,040 Remember, Bob's accounts "id" is 2. 1166 01:06:25,040 --> 01:06:30,350 So now if I hit Enter here, Bob should have gotten just 10 more dollars. 1167 01:06:30,350 --> 01:06:31,610 But there's a problem. 1168 01:06:31,610 --> 01:06:34,790 If I also connect to this database through another connection here, 1169 01:06:34,790 --> 01:06:37,457 sqlite3 bank.DB. 1170 01:06:37,457 --> 01:06:40,040 Maybe I'm somebody who's checking on the balances of the bank. 1171 01:06:40,040 --> 01:06:44,120 I say SELECT * FROM "accounts." 1172 01:06:44,120 --> 01:06:48,360 What do I see but an incorrect total balance? 1173 01:06:48,360 --> 01:06:51,890 We should have $60, but here I see $70. 1174 01:06:51,890 --> 01:06:54,620 So I need to complete this transaction to ensure 1175 01:06:54,620 --> 01:06:58,790 that I actually have the right amount of balances across the board. 1176 01:06:58,790 --> 01:07:03,188 Let me go back here and let me try to run the next part here. 1177 01:07:03,188 --> 01:07:03,980 Let we update this. 1178 01:07:03,980 --> 01:07:10,910 I'll say UPDATE "accounts," and I'll SET "balance" equal to "balance" minus 10. 1179 01:07:10,910 --> 01:07:17,960 Minus 10 where the "id" equals in this case 1 where Alice's bank account ID 1180 01:07:17,960 --> 01:07:18,980 is 1. 1181 01:07:18,980 --> 01:07:23,780 Now if I hit Enter I should be able to go back to my other terminal 1182 01:07:23,780 --> 01:07:25,830 and check on the balances again. 1183 01:07:25,830 --> 01:07:29,700 And now I'll see the right balances all together. 1184 01:07:29,700 --> 01:07:33,380 But there is a way to avoid me looking at this table 1185 01:07:33,380 --> 01:07:38,640 and seeing an incorrect state, and the way to do that is called a transaction. 1186 01:07:38,640 --> 01:07:42,020 So let's go back here, and I will reset my terminals. 1187 01:07:42,020 --> 01:07:45,410 I'll now also give back Alice's money. 1188 01:07:45,410 --> 01:07:47,183 Let me give back Bob the $10. 1189 01:07:47,183 --> 01:07:48,100 I'm going to go back-- 1190 01:07:48,100 --> 01:07:51,860 [INAUDIBLE] say where "id" equals 1 here. 1191 01:07:51,860 --> 01:07:54,140 Now Alice will have 10 more dollars. 1192 01:07:54,140 --> 01:07:57,020 And I'll go back and I'll update accounts. 1193 01:07:57,020 --> 01:08:02,810 I'll set the "balance" equal to "balance" minus 10 1194 01:08:02,810 --> 01:08:05,690 where the "id" equals 2. 1195 01:08:05,690 --> 01:08:09,390 So I'm giving Bob back the $10 that they just gave to Alice. 1196 01:08:09,390 --> 01:08:15,160 Now if I SELECT * from "accounts" we're back to where we began. 1197 01:08:15,160 --> 01:08:19,649 So ideally I should able to complete this transaction between Alice and Bob 1198 01:08:19,649 --> 01:08:23,310 without somebody seeing the intermediate process of Bob 1199 01:08:23,310 --> 01:08:25,470 having more money than Alice. 1200 01:08:25,470 --> 01:08:26,850 So let's try this. 1201 01:08:26,850 --> 01:08:31,649 I'll say BEGIN TRANSACTION; Enter. 1202 01:08:31,649 --> 01:08:36,479 And now I can include what I want to have happen in this transaction. 1203 01:08:36,479 --> 01:08:38,609 First I want to update Bob's account. 1204 01:08:38,609 --> 01:08:45,640 I'll say UPDATE "accounts" and SET "balance" equal to "balance" + 10 1205 01:08:45,640 --> 01:08:49,290 where the "id" in this case equals-- 1206 01:08:49,290 --> 01:08:53,170 let's see, 2, because Bob's account ID is 2. 1207 01:08:53,170 --> 01:08:57,000 Now I'll hit Enter, and now I'm in the middle of this transaction. 1208 01:08:57,000 --> 01:09:00,960 But if I go to my other terminal-- let's say I'm somebody new 1209 01:09:00,960 --> 01:09:02,640 and I check on the balances of the bank. 1210 01:09:02,640 --> 01:09:06,520 I say SELECT * FROM "accounts". 1211 01:09:06,520 --> 01:09:07,960 What do I see? 1212 01:09:07,960 --> 01:09:10,390 I don't see any change from before. 1213 01:09:10,390 --> 01:09:14,920 Alice still has $10, Bob has $20, and Charlie has $30. 1214 01:09:14,920 --> 01:09:17,380 And even though in this terminal I've been 1215 01:09:17,380 --> 01:09:21,430 able to update the account balance, I haven't committed those changes. 1216 01:09:21,430 --> 01:09:24,380 I haven't saved them to the database just yet. 1217 01:09:24,380 --> 01:09:26,770 So let me try now removing money from Alice's account. 1218 01:09:26,770 --> 01:09:33,430 I'll say UPDATE "accounts" and set "balance" equal to balance minus 10 1219 01:09:33,430 --> 01:09:36,700 where the "id" equals 1. 1220 01:09:36,700 --> 01:09:39,550 Now I'll remove $10 from Alice's account. 1221 01:09:39,550 --> 01:09:43,060 Hit Enter, and still if I go back to this other perspective 1222 01:09:43,060 --> 01:09:46,870 here from somebody else checking on the bank balances, I see-- 1223 01:09:46,870 --> 01:09:49,399 well everything is still the same. 1224 01:09:49,399 --> 01:09:51,939 But the moment I then go to this transaction 1225 01:09:51,939 --> 01:09:57,370 and finish it by typing COMMIT, meaning save these changes, I'll commit here. 1226 01:09:57,370 --> 01:10:01,240 I can go back and I'll see all in one moment 1227 01:10:01,240 --> 01:10:06,070 that it appears Alice has paid Bob $10. 1228 01:10:06,070 --> 01:10:10,060 So this is the atomicity part of transactions. 1229 01:10:10,060 --> 01:10:13,480 I don't see the intermediate steps in the transaction. 1230 01:10:13,480 --> 01:10:19,000 I only see that it happened all at once or not at all. 1231 01:10:19,000 --> 01:10:23,300 OK, so this is one example here of a transaction. 1232 01:10:23,300 --> 01:10:27,700 Let me ask what questions we have so far on the transactions we've made 1233 01:10:27,700 --> 01:10:31,060 and how they can be used. 1234 01:10:31,060 --> 01:10:32,950 AUDIENCE: You are searching based on ID-- 1235 01:10:32,950 --> 01:10:35,560 like an ID is equal to a question mark. 1236 01:10:35,560 --> 01:10:39,550 But in a real-world scenario, how they would Search maybe 1237 01:10:39,550 --> 01:10:44,063 like using the Transaction ID or basic account number? 1238 01:10:44,063 --> 01:10:45,230 CARTER ZENKE: Good question. 1239 01:10:45,230 --> 01:10:47,050 So here I'm searching by ID. 1240 01:10:47,050 --> 01:10:51,820 I'm saying that I want to update the balance of account with ID 1 or ID 2, 1241 01:10:51,820 --> 01:10:52,660 for instance. 1242 01:10:52,660 --> 01:10:55,510 And this is similar in spirit to how your bank account likely 1243 01:10:55,510 --> 01:10:58,850 has some unique number, an account number so to speak. 1244 01:10:58,850 --> 01:11:01,360 So here in this table I'm updating balances 1245 01:11:01,360 --> 01:11:03,700 by referring to the account number. 1246 01:11:03,700 --> 01:11:06,670 You could imagine me using names, but there 1247 01:11:06,670 --> 01:11:10,120 might be in some world multiple Alices, multiple Bobs. 1248 01:11:10,120 --> 01:11:15,740 So I should uniquely identify each account using its ID in this case. 1249 01:11:15,740 --> 01:11:20,490 OK, so let's come back and think through more kinds of transactions here. 1250 01:11:20,490 --> 01:11:24,570 And let's think through a transaction we actually don't want to complete. 1251 01:11:24,570 --> 01:11:30,410 So let's go back to our balance sheet where we have Alice at $0 and Bob 1252 01:11:30,410 --> 01:11:35,660 at $30, and let's say Alice wants to pay Bob 10 more dollars. 1253 01:11:35,660 --> 01:11:39,830 Well if I go through this transaction, what's going to happen? 1254 01:11:39,830 --> 01:11:44,390 Alice will have -$10 and Bob will have $40, 1255 01:11:44,390 --> 01:11:47,990 and that violates our check constraint that we saw in our schema. 1256 01:11:47,990 --> 01:11:53,550 The balance of Alice's account must always be at least $0. 1257 01:11:53,550 --> 01:11:56,360 So if I were to complete this transaction, 1258 01:11:56,360 --> 01:11:59,870 that would bring our database to an inconsistent state. 1259 01:11:59,870 --> 01:12:03,410 It violates some constraint that we have. 1260 01:12:03,410 --> 01:12:06,800 But what I can do is undo these changes. 1261 01:12:06,800 --> 01:12:10,730 Revert them so that we go back to the original state 1262 01:12:10,730 --> 01:12:14,540 if we ever violate some constraint like we just did. 1263 01:12:14,540 --> 01:12:17,880 There's a technical name for this, which is rolling back. 1264 01:12:17,880 --> 01:12:21,980 So if I ever want to not save the changes in this transaction, 1265 01:12:21,980 --> 01:12:23,840 I can say ROLLBACK. 1266 01:12:23,840 --> 01:12:27,620 Go back to the beginning, think about it as if this transaction never even 1267 01:12:27,620 --> 01:12:29,470 happened at all. 1268 01:12:29,470 --> 01:12:30,380 So let's try it. 1269 01:12:30,380 --> 01:12:33,870 So I'll come back to my computer, and what I'll do 1270 01:12:33,870 --> 01:12:38,040 is try to implement this idea first without a transaction. 1271 01:12:38,040 --> 01:12:41,400 I'll go back to, let's say this first terminal here, 1272 01:12:41,400 --> 01:12:45,400 and I'll SELECT * FROM "accounts" like this. 1273 01:12:45,400 --> 01:12:50,060 Notice how Alice has $0 and Bob has $30. 1274 01:12:50,060 --> 01:12:53,010 Well I want to pay Bob 10 more dollars from Alice, 1275 01:12:53,010 --> 01:12:54,810 so I'll update Bob's account. 1276 01:12:54,810 --> 01:12:59,430 I'll find here my prior statement, UPDATE "accounts" 1277 01:12:59,430 --> 01:13:03,900 SET "balance" equal to "balance" + 10 where the "id" equals 2. 1278 01:13:03,900 --> 01:13:07,260 And recall that Bob's account ID is 2 1279 01:13:07,260 --> 01:13:10,085 I'll hit Enter, and now if I SELECT * FROM "accounts" 1280 01:13:10,085 --> 01:13:11,460 if we're looking somewhere else-- 1281 01:13:11,460 --> 01:13:17,850 SELECT * FROM "accounts," I'll see Bob has 10 more dollars. 1282 01:13:17,850 --> 01:13:21,300 But now I need to subtract money from Alice's account. 1283 01:13:21,300 --> 01:13:27,750 So I'll use the same thing but now I'll say minus 10 where the "id" equals 1. 1284 01:13:27,750 --> 01:13:30,120 Alice's account ID is 1. 1285 01:13:30,120 --> 01:13:35,820 Now I'll try to subtract $10 from Alice, but I get a constraint failure here. 1286 01:13:35,820 --> 01:13:39,120 My check constraint failed on the balance column. 1287 01:13:39,120 --> 01:13:40,290 Why did it do that? 1288 01:13:40,290 --> 01:13:47,460 Because I now have less than $0 in that balance column. 1289 01:13:47,460 --> 01:13:52,260 So I need to undo this but I don't have a transaction here. 1290 01:13:52,260 --> 01:13:57,550 What could I do instead to undo what I've just done? 1291 01:13:57,550 --> 01:13:59,025 AUDIENCE: So we can roll back. 1292 01:13:59,025 --> 01:14:01,740 CARTER ZENKE: We could roll back, but here's the thing. 1293 01:14:01,740 --> 01:14:05,790 A rollback only works while we're inside of a transaction. 1294 01:14:05,790 --> 01:14:09,700 And a transaction begins when we say begin transaction. 1295 01:14:09,700 --> 01:14:12,930 So here I actually didn't say begin transaction. 1296 01:14:12,930 --> 01:14:15,960 What I have to do instead is run another update statement. 1297 01:14:15,960 --> 01:14:19,290 I have to update, let's see, Bob's account 1298 01:14:19,290 --> 01:14:22,950 to remove the $10 that Alice paid Bob. 1299 01:14:22,950 --> 01:14:23,710 So let's try that. 1300 01:14:23,710 --> 01:14:27,810 I'll come back to my computer and I will then revert back. 1301 01:14:27,810 --> 01:14:29,380 I'll say let's undo this. 1302 01:14:29,380 --> 01:14:34,420 Let's subtract $10 now from Bob's account where the ID is 2. 1303 01:14:34,420 --> 01:14:36,510 Now we're back to where we began. 1304 01:14:36,510 --> 01:14:38,970 If I check in from some other [INAUDIBLE] here, 1305 01:14:38,970 --> 01:14:44,310 I'll see we're back to Alice having $0 and Bob having $30. 1306 01:14:44,310 --> 01:14:49,470 But we ideally want to use rollback to undo some changes that we 1307 01:14:49,470 --> 01:14:51,400 have made in a transaction. 1308 01:14:51,400 --> 01:14:54,000 So for that let me now use a transaction. 1309 01:14:54,000 --> 01:14:58,980 I'll say BEGIN TRANSACTION like this, hit Enter. 1310 01:14:58,980 --> 01:15:01,470 Now I want to update Bob's balance. 1311 01:15:01,470 --> 01:15:05,070 I'll say UPDATE "accounts SET "balance" equal to "balance" + 1312 01:15:05,070 --> 01:15:09,360 10 where the "id" = 2. 1313 01:15:09,360 --> 01:15:11,860 Bob's account ID is 2. 1314 01:15:11,860 --> 01:15:14,760 Now let me subtract $10 from Alice like this. 1315 01:15:14,760 --> 01:15:16,290 - 10. 1316 01:15:16,290 --> 01:15:19,200 I'll go back here and say where the "id" equals 1 1317 01:15:19,200 --> 01:15:21,660 for Alice's account, hit Enter. 1318 01:15:21,660 --> 01:15:24,360 But now I get that same constraint failure. 1319 01:15:24,360 --> 01:15:28,600 Alice has less than $0, which can't be possible. 1320 01:15:28,600 --> 01:15:29,880 So what do we do? 1321 01:15:29,880 --> 01:15:33,000 Instead of having another update, we could just roll back. 1322 01:15:33,000 --> 01:15:36,030 We could undo all of these previous statements here. 1323 01:15:36,030 --> 01:15:40,500 ROLLBACK like this, hit Enter, and now I should see from another perspective 1324 01:15:40,500 --> 01:15:46,380 here my table exactly as it was before with no intermediate step of Bob 1325 01:15:46,380 --> 01:15:50,140 having money but Alice having negative $10. 1326 01:15:50,140 --> 01:15:54,240 So let me ask m we've seen COMMIT and ROLLBACK. 1327 01:15:54,240 --> 01:15:57,440 What questions do we have on transactions now? 1328 01:15:57,440 --> 01:16:03,530 1329 01:16:03,530 --> 01:16:04,030 OK. 1330 01:16:04,030 --> 01:16:07,390 Seeing none so far, so let's keep going. 1331 01:16:07,390 --> 01:16:11,290 Transactions are helpful for ensuring that our data is consistent 1332 01:16:11,290 --> 01:16:13,570 like we just saw, and that they're also-- 1333 01:16:13,570 --> 01:16:15,130 we have atomic transactions. 1334 01:16:15,130 --> 01:16:18,620 We can't break things down into smaller pieces. 1335 01:16:18,620 --> 01:16:21,490 But there is a case that transactions can help guard 1336 01:16:21,490 --> 01:16:24,880 against some adversarial kind of attack. 1337 01:16:24,880 --> 01:16:27,460 We're into what we call race condition. 1338 01:16:27,460 --> 01:16:32,140 A race condition is when we have multiple people, multiple processes 1339 01:16:32,140 --> 01:16:37,900 trying to access some value and also making a decision based on that value. 1340 01:16:37,900 --> 01:16:43,300 If unaddressed it can lead to a scenario that's inconsistent in our database. 1341 01:16:43,300 --> 01:16:46,090 And actually hackers can exploit race conditions 1342 01:16:46,090 --> 01:16:51,020 to bring our database to an inconsistent state in a way that benefits them. 1343 01:16:51,020 --> 01:16:53,800 So let's say our friends Charlie and Alice here, they 1344 01:16:53,800 --> 01:16:55,510 decide to rob our bank. 1345 01:16:55,510 --> 01:16:59,510 They decide to take money from it by exploiting race conditions. 1346 01:16:59,510 --> 01:17:04,000 And here are the logs of what Charlie and Alice actually did. 1347 01:17:04,000 --> 01:17:10,090 We see here that Charlie and Alice executed three concurrent transactions 1348 01:17:10,090 --> 01:17:11,260 with our bank. 1349 01:17:11,260 --> 01:17:13,960 On the one hand Charlie opened up his bank account 1350 01:17:13,960 --> 01:17:16,000 on two different computers. 1351 01:17:16,000 --> 01:17:20,470 He logged in and he said let me transfer $30 to Alice. 1352 01:17:20,470 --> 01:17:25,240 He at the same time clicked transfer on both computers knowing just well 1353 01:17:25,240 --> 01:17:30,820 he had only $30 to give to Alice, but he was going to send her $60 in this case. 1354 01:17:30,820 --> 01:17:34,120 He was hoping that we weren't guarding against race conditions. 1355 01:17:34,120 --> 01:17:38,660 That he could send $30 on both accounts to Alice. 1356 01:17:38,660 --> 01:17:41,260 Now at the same time Alice is at the ATM, 1357 01:17:41,260 --> 01:17:45,010 and they're hoping that at the same time Charlie presses transfer 1358 01:17:45,010 --> 01:17:50,890 Alice hits confirm on their transaction to withdraw $60 from the ATM. 1359 01:17:50,890 --> 01:17:53,410 And when all of these happen at once, we might 1360 01:17:53,410 --> 01:17:56,810 run into some inconsistent states in our database. 1361 01:17:56,810 --> 01:17:59,710 Let's say first we run this transaction here. 1362 01:17:59,710 --> 01:18:00,895 Let's focus on this one. 1363 01:18:00,895 --> 01:18:02,770 The first step is to check Charlie's balance. 1364 01:18:02,770 --> 01:18:05,500 Does Charlie have $30? 1365 01:18:05,500 --> 01:18:06,970 It seems like Charlie does. 1366 01:18:06,970 --> 01:18:11,410 We continue we might say, OK, let me add $30 to Alice's account like this. 1367 01:18:11,410 --> 01:18:15,280 Then let me subtract $30 from Charlie's account down below here. 1368 01:18:15,280 --> 01:18:18,130 And finally, that's the end of our transaction. 1369 01:18:18,130 --> 01:18:21,610 We checked Charlie's balance, we added $30 to Alice, 1370 01:18:21,610 --> 01:18:24,790 and subtracted $30 from Charlie. 1371 01:18:24,790 --> 01:18:31,180 But now at the same time, we're also processing the other transfer of $30 1372 01:18:31,180 --> 01:18:32,050 to Alice. 1373 01:18:32,050 --> 01:18:35,590 So here we check in terms of our time, top to bottom here-- 1374 01:18:35,590 --> 01:18:37,810 we check Charlie's balance at this point. 1375 01:18:37,810 --> 01:18:39,820 It's still $30. 1376 01:18:39,820 --> 01:18:43,060 Knowing this, this process goes ahead and says 1377 01:18:43,060 --> 01:18:45,730 let's add $30 to Alice's account. 1378 01:18:45,730 --> 01:18:49,930 And now only later let's subtract $30 like this. 1379 01:18:49,930 --> 01:18:54,740 Now Charlie well appears to have no money in his account. 1380 01:18:54,740 --> 01:19:01,330 But at this time above we've already sent $30 to Alice's account two times. 1381 01:19:01,330 --> 01:19:03,610 And if Alice is waiting at the ATM over here 1382 01:19:03,610 --> 01:19:06,400 and they press Confirm on that withdrawal, what might happen 1383 01:19:06,400 --> 01:19:07,360 is something like this. 1384 01:19:07,360 --> 01:19:10,840 At that moment that Alice has $60 in their account-- 1385 01:19:10,840 --> 01:19:14,440 they check and they see it-- they could then withdraw that $60. 1386 01:19:14,440 --> 01:19:17,390 And now that money is gone from our bank, 1387 01:19:17,390 --> 01:19:19,840 even if we realize Charlie shouldn't have been 1388 01:19:19,840 --> 01:19:22,980 able to make that transaction happen. 1389 01:19:22,980 --> 01:19:26,510 So what logically is a solution here? 1390 01:19:26,510 --> 01:19:31,170 It seems these transactions happened all at the same time. 1391 01:19:31,170 --> 01:19:35,060 But what would you do to reorder these statements to make it clear 1392 01:19:35,060 --> 01:19:39,690 that we can only do one at a time? 1393 01:19:39,690 --> 01:19:45,260 What would you do in this case to update this table to fix this problem? 1394 01:19:45,260 --> 01:19:52,710 AUDIENCE: I would do the withdrawal and the debits as one transaction. 1395 01:19:52,710 --> 01:19:55,580 So the subtraction is done and the addition 1396 01:19:55,580 --> 01:20:01,658 is done as one unit of work in order for the double transfer not to happen. 1397 01:20:01,658 --> 01:20:02,450 CARTER ZENKE: Yeah. 1398 01:20:02,450 --> 01:20:06,020 So ideally we should treat each of these as a separate transaction. 1399 01:20:06,020 --> 01:20:10,280 Literally a transaction in the database sense where if I go back to this table 1400 01:20:10,280 --> 01:20:15,000 if Charlie transfers $30 to Alice at roughly the same time, 1401 01:20:15,000 --> 01:20:17,210 I should handle these sequentially. 1402 01:20:17,210 --> 01:20:20,990 I should first do this one, then I should do this one. 1403 01:20:20,990 --> 01:20:25,760 And only after that should Alice be able to withdraw their $60, or at least 1404 01:20:25,760 --> 01:20:27,450 try to in this case. 1405 01:20:27,450 --> 01:20:30,680 So we need to make these transactions sequential, and the word for this 1406 01:20:30,680 --> 01:20:32,540 is isolating our transactions. 1407 01:20:32,540 --> 01:20:34,580 Here they are not isolated. 1408 01:20:34,580 --> 01:20:36,390 They're working at the same time. 1409 01:20:36,390 --> 01:20:38,630 But in reality I want them to be isolated-- 1410 01:20:38,630 --> 01:20:41,390 happen one after another after another. 1411 01:20:41,390 --> 01:20:44,870 So ideally our table looks more like this. 1412 01:20:44,870 --> 01:20:48,890 Charlie hits transfer at the exact same time on two laptops, 1413 01:20:48,890 --> 01:20:54,170 but because we're working in transactions we only process one first. 1414 01:20:54,170 --> 01:20:55,610 We start with this one here. 1415 01:20:55,610 --> 01:21:00,860 We check the balance, we say let's add $30 to Alice's account and subtract $30 1416 01:21:00,860 --> 01:21:01,940 from Charlie's. 1417 01:21:01,940 --> 01:21:05,930 Once we're sure that is OK, we'll commit the transaction 1418 01:21:05,930 --> 01:21:08,120 and that is a done deal. 1419 01:21:08,120 --> 01:21:09,770 Now though, we'll go down below. 1420 01:21:09,770 --> 01:21:11,630 We'll say begin transaction. 1421 01:21:11,630 --> 01:21:15,140 We'll then say Charlie's balance is $0 because we're doing this 1422 01:21:15,140 --> 01:21:16,940 after we did our first transaction. 1423 01:21:16,940 --> 01:21:19,370 Charlie has no more money to give Alice. 1424 01:21:19,370 --> 01:21:23,480 At that point what we'll do is try to add $30 to Alice 1425 01:21:23,480 --> 01:21:27,240 but subtract $30 from Charlie, and we'll realize we made a mistake. 1426 01:21:27,240 --> 01:21:31,050 We should roll back and undo what we just did here. 1427 01:21:31,050 --> 01:21:35,060 So these now are separate transactions happening sequentially, 1428 01:21:35,060 --> 01:21:38,720 and we avoid our error of giving Alice $60 when 1429 01:21:38,720 --> 01:21:41,180 she shouldn't have that money at all. 1430 01:21:41,180 --> 01:21:46,460 And then only after this is done do we let Alice try to withdraw $60. 1431 01:21:46,460 --> 01:21:47,780 We'll go over here and see. 1432 01:21:47,780 --> 01:21:51,900 Alice tries to have a balance of $60, withdraw $60, and commit it. 1433 01:21:51,900 --> 01:21:56,570 But at this point she wouldn't actually have that money there. 1434 01:21:56,570 --> 01:22:01,810 So let's see now what questions do we have on the way transactions 1435 01:22:01,810 --> 01:22:04,860 are sequential in this case? 1436 01:22:04,860 --> 01:22:08,410 AUDIENCE: In particularly I want to ask that there are millions 1437 01:22:08,410 --> 01:22:10,480 of transactions that are going on. 1438 01:22:10,480 --> 01:22:14,320 Particularly, also, is it like there that sequentially these transactions 1439 01:22:14,320 --> 01:22:16,750 are being handled for the financial transaction over here 1440 01:22:16,750 --> 01:22:20,650 that because there are a number of banks in a single country 1441 01:22:20,650 --> 01:22:22,150 or international transactions. 1442 01:22:22,150 --> 01:22:25,210 Are they handle sequentially or there is some parallel earning mechanism 1443 01:22:25,210 --> 01:22:27,583 also exist if you may please clarify. 1444 01:22:27,583 --> 01:22:29,000 CARTER ZENKE: Yeah, good question. 1445 01:22:29,000 --> 01:22:31,630 So here visually we see these transactions are actually 1446 01:22:31,630 --> 01:22:34,000 sequential in the operation here. 1447 01:22:34,000 --> 01:22:37,240 And then your question is, are they literally sequential, like one 1448 01:22:37,240 --> 01:22:38,470 after another after another? 1449 01:22:38,470 --> 01:22:41,855 In reality they are very much just sequential like this. 1450 01:22:41,855 --> 01:22:43,730 A database can get creative with use of locks 1451 01:22:43,730 --> 01:22:46,063 as we'll see you in just a minute, but by and large they 1452 01:22:46,063 --> 01:22:48,130 are sequential one after the other. 1453 01:22:48,130 --> 01:22:51,370 And it's the job of a database module to ensure 1454 01:22:51,370 --> 01:22:56,300 that these transactions are executed sequentially in an isolated way. 1455 01:22:56,300 --> 01:22:58,360 Good question. 1456 01:22:58,360 --> 01:22:58,900 OK. 1457 01:22:58,900 --> 01:23:04,340 So let's see now how, if a database is able to make these sequential, 1458 01:23:04,340 --> 01:23:06,610 it must be using something underneath the hood 1459 01:23:06,610 --> 01:23:10,180 to ensure that these transactions don't interfere. 1460 01:23:10,180 --> 01:23:15,610 And in SQLite and other DBMSs we use this idea of a lock on the database 1461 01:23:15,610 --> 01:23:20,680 to ensure people can't access data when they really shouldn't be able to. 1462 01:23:20,680 --> 01:23:23,200 So here let's look at an idea of what SQLite 1463 01:23:23,200 --> 01:23:26,920 has at different kinds of locks on this database. 1464 01:23:26,920 --> 01:23:28,490 We have a few different kinds-- 1465 01:23:28,490 --> 01:23:32,290 unlocked, shared, and exclusive, among some others here. 1466 01:23:32,290 --> 01:23:35,913 Each one has a different meaning for different processes. 1467 01:23:35,913 --> 01:23:38,080 So let's take a look here at our bank account table. 1468 01:23:38,080 --> 01:23:42,280 Let's say two computers want to access this account's table. 1469 01:23:42,280 --> 01:23:47,560 Well at this time because nobody is accessing our database, it is unlocked. 1470 01:23:47,560 --> 01:23:50,650 Anyone could read from the database, sort of see what's inside, 1471 01:23:50,650 --> 01:23:54,490 or write to it-- add some data or update some data. 1472 01:23:54,490 --> 01:23:59,650 But let's say we have one computer here who wants to read from this database. 1473 01:23:59,650 --> 01:24:02,560 They want to see what data is inside. 1474 01:24:02,560 --> 01:24:05,650 Well that database in SQLite would acquire 1475 01:24:05,650 --> 01:24:08,650 what's called a shared lock like this. 1476 01:24:08,650 --> 01:24:12,130 They're able to read the database but also 1477 01:24:12,130 --> 01:24:14,500 it's to allow others to read as well. 1478 01:24:14,500 --> 01:24:17,260 A shared lock means I'm only reading. 1479 01:24:17,260 --> 01:24:20,180 You too could read as well. 1480 01:24:20,180 --> 01:24:22,180 So let's say this computer comes in. 1481 01:24:22,180 --> 01:24:24,760 They decide to read from this table. 1482 01:24:24,760 --> 01:24:28,090 They could acquire the same shared lock. 1483 01:24:28,090 --> 01:24:31,240 And because we're only reading, not updating, 1484 01:24:31,240 --> 01:24:35,120 multiple transactions can read all at once. 1485 01:24:35,120 --> 01:24:37,820 We don't have to make these sequential. 1486 01:24:37,820 --> 01:24:42,170 But if we have a transaction that's hoping to write to the database, 1487 01:24:42,170 --> 01:24:43,730 to update some value-- 1488 01:24:43,730 --> 01:24:46,370 well that's when things get a little weird. 1489 01:24:46,370 --> 01:24:50,760 I can't let somebody else read my data while I'm updating it. 1490 01:24:50,760 --> 01:24:53,690 So what I need is for this process here, if it's 1491 01:24:53,690 --> 01:24:57,290 writing, to acquire an exclusive lock. 1492 01:24:57,290 --> 01:25:01,100 Meaning this is the only computer, only transaction that 1493 01:25:01,100 --> 01:25:04,310 can read and write to this database. 1494 01:25:04,310 --> 01:25:09,540 All others must hold off in order to be sequential. 1495 01:25:09,540 --> 01:25:13,320 So we've seen transactions then are sequential 1496 01:25:13,320 --> 01:25:17,070 and they use this idea of a lock to ensure other folks 1497 01:25:17,070 --> 01:25:22,210 can't read when they're doing something sensitive like writing to the database. 1498 01:25:22,210 --> 01:25:28,210 Let me ask now what questions we have on locks and how they work for us here? 1499 01:25:28,210 --> 01:25:30,680 AUDIENCE: OK, my question is that suppose 1500 01:25:30,680 --> 01:25:33,080 you're one computer [INAUDIBLE]. 1501 01:25:33,080 --> 01:25:38,030 For what amount this would allow to do the exclusive lock? 1502 01:25:38,030 --> 01:25:43,610 Maybe if the timestamp is larger, then some other computers 1503 01:25:43,610 --> 01:25:46,740 will get timeout exception and other things, right? 1504 01:25:46,740 --> 01:25:48,800 Because it's not allowed to do-- 1505 01:25:48,800 --> 01:25:50,330 exclusively it's a lock. 1506 01:25:50,330 --> 01:25:52,370 How are we going to decide those things? 1507 01:25:52,370 --> 01:25:55,010 At what time one instance is going to be-- 1508 01:25:55,010 --> 01:25:57,440 put the exclusive lock on the resource? 1509 01:25:57,440 --> 01:25:59,600 CARTER ZENKE: So the question is, at what point 1510 01:25:59,600 --> 01:26:02,690 does the transaction actually get an exclusive lock, 1511 01:26:02,690 --> 01:26:07,050 and how do we decide how to prioritize different transactions here? 1512 01:26:07,050 --> 01:26:10,310 Well there are a few algorithms you could use, one of which 1513 01:26:10,310 --> 01:26:14,090 is simply taking the transaction that just came first. 1514 01:26:14,090 --> 01:26:17,630 If at this point I receive a request for an exclusive lock, 1515 01:26:17,630 --> 01:26:22,040 I might just wait for others to finish reading and allow that one to proceed. 1516 01:26:22,040 --> 01:26:26,490 I could also use a process called time-stamping, which is similar to this 1517 01:26:26,490 --> 01:26:30,230 where I look at the timestamp, the time my database tried to write to the log 1518 01:26:30,230 --> 01:26:33,560 and use the one that is earliest in that case. 1519 01:26:33,560 --> 01:26:38,210 Now it's important to note that an exclusive lock does block others 1520 01:26:38,210 --> 01:26:40,380 from running their own transactions. 1521 01:26:40,380 --> 01:26:44,480 If I have an exclusive lock, no one else can run their transaction. 1522 01:26:44,480 --> 01:26:48,290 That is kind of a necessary downside in this case 1523 01:26:48,290 --> 01:26:53,920 to ensure that I don't make mistakes in updating my table. 1524 01:26:53,920 --> 01:26:55,480 Let's take one more here. 1525 01:26:55,480 --> 01:26:57,850 AUDIENCE: What's the granularity on locking? 1526 01:26:57,850 --> 01:27:02,050 I mean, you seem to say you locked the database, but that's craziness. 1527 01:27:02,050 --> 01:27:05,080 Do you lock a table or do you lock a record on a table? 1528 01:27:05,080 --> 01:27:07,607 CARTER ZENKE: Yeah, it might depend on the DBMS itself. 1529 01:27:07,607 --> 01:27:13,540 So at the very coarsest, a lock might just lock the entire database. 1530 01:27:13,540 --> 01:27:15,517 I can't actually see anything inside of it. 1531 01:27:15,517 --> 01:27:18,100 And I'll be able to demonstrate that for you in a minute here. 1532 01:27:18,100 --> 01:27:19,420 Let me go back to my computer. 1533 01:27:19,420 --> 01:27:22,900 Let me try to deliberately lock this database. 1534 01:27:22,900 --> 01:27:26,260 If I go to my terminal, I can begin a transaction that 1535 01:27:26,260 --> 01:27:28,870 automatically has an exclusive lock. 1536 01:27:28,870 --> 01:27:32,980 In SQLite the statement is BEGIN EXCLUSIVE. 1537 01:27:32,980 --> 01:27:34,930 BEGIN EXCLUSIVE TRANSACTION. 1538 01:27:34,930 --> 01:27:39,280 I'll hit Enter like this, and now I'm in the middle of this transaction. 1539 01:27:39,280 --> 01:27:39,910 It's exclusive. 1540 01:27:39,910 --> 01:27:43,060 Has an exclusive lock nobody else can read. 1541 01:27:43,060 --> 01:27:47,230 If I go back to my other terminal here, connect to my database, 1542 01:27:47,230 --> 01:27:51,880 I could try SELECT * FROM "accounts" just to read from this table. 1543 01:27:51,880 --> 01:27:54,280 And now I see Runtime error-- 1544 01:27:54,280 --> 01:27:55,900 database is locked. 1545 01:27:55,900 --> 01:28:00,280 The entire database cannot be read from at the very coarsest level 1546 01:28:00,280 --> 01:28:02,170 of an exclusive lock. 1547 01:28:02,170 --> 01:28:04,600 Now because these locks are so coarse-- 1548 01:28:04,600 --> 01:28:09,310 I have to lock the entire database, SQLite has its own module, 1549 01:28:09,310 --> 01:28:14,050 a way of prioritizing transactions and ensuring that a transaction only 1550 01:28:14,050 --> 01:28:17,590 has an exclusive lock for the duration it absolutely needs to. 1551 01:28:17,590 --> 01:28:19,802 Because otherwise you would slow down substantially 1552 01:28:19,802 --> 01:28:21,760 if I had to wait and wait and wait for somebody 1553 01:28:21,760 --> 01:28:25,400 to finish with their transaction. 1554 01:28:25,400 --> 01:28:26,770 OK. 1555 01:28:26,770 --> 01:28:31,900 So this then is how we can handle not just one query at a time but multiple. 1556 01:28:31,900 --> 01:28:35,860 And today we've seen how to optimize our databases from making sure queries 1557 01:28:35,860 --> 01:28:39,670 are faster to also making sure our databases use as little space as 1558 01:28:39,670 --> 01:28:40,420 possible. 1559 01:28:40,420 --> 01:28:43,270 We'll see next time how to actually scale up 1560 01:28:43,270 --> 01:28:45,460 our databases using other DBMSs-- 1561 01:28:45,460 --> 01:28:48,280 like MySQL-- besides SQLite. 1562 01:28:48,280 --> 01:28:51,120 We'll see you next time. 1563 01:28:51,120 --> 01:28:52,000