WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000

00:00:00.000 --> 00:00:03.486
[MUSIC PLAYING]

00:00:16.434 --> 00:00:19.090
CARTER ZENKE: Well hello,
one and all, and welcome

00:00:19.090 --> 00:00:22.510
back to CS50 Introduction
to Databases with SQL.

00:00:22.510 --> 00:00:25.090
My name is Carter Zenke,
and today we'll focus

00:00:25.090 --> 00:00:29.050
on optimizing-- trying to
reduce the time our queries take

00:00:29.050 --> 00:00:32.500
as well as reducing the storage
space our database might use.

00:00:32.500 --> 00:00:34.960
We'll also see how to
handle concurrency.

00:00:34.960 --> 00:00:38.300
That is, many queries all at once.

00:00:38.300 --> 00:00:41.560
Now we'll do all of this in
the context of a new database,

00:00:41.560 --> 00:00:45.940
this one-- the Internet Movies
Database, or IMDb for short.

00:00:45.940 --> 00:00:51.680
If you've been to imdb.com you may have
seen this whole big database of movies,

00:00:51.680 --> 00:00:56.360
and we've compiled that into our very
own SQLite database for us today.

00:00:56.360 --> 00:00:58.830
Now this database is big.

00:00:58.830 --> 00:01:01.240
It is much bigger than
anything we've seen so far.

00:01:01.240 --> 00:01:07.090
In fact, it has over 200,000 movie
ratings, over 400,000 movies,

00:01:07.090 --> 00:01:12.850
and 1.4 million stars or people
who have acted in some movie.

00:01:12.850 --> 00:01:17.360
And the goal of this database is to
represent movies, the people in them,

00:01:17.360 --> 00:01:21.280
and the average ratings
for each of those movies.

00:01:21.280 --> 00:01:24.250
Now we have ourselves an
entity relationship diagram

00:01:24.250 --> 00:01:27.470
to describe this database,
and it looks a bit like this.

00:01:27.470 --> 00:01:31.690
We have people in one
table, movies in another,

00:01:31.690 --> 00:01:34.870
and ratings in a
separate table down here.

00:01:34.870 --> 00:01:37.870
Notice that people have
a name and a birthday

00:01:37.870 --> 00:01:41.320
along with an ID, the
primary key for that table.

00:01:41.320 --> 00:01:45.020
Movies too have a title and
the year they were released

00:01:45.020 --> 00:01:48.130
as well as their own
primary key ID here.

00:01:48.130 --> 00:01:51.970
Rating down here has the
average rating for a given movie

00:01:51.970 --> 00:01:55.540
as well as the number of votes
that went into calculating

00:01:55.540 --> 00:01:57.220
that average rating here.

00:01:57.220 --> 00:02:02.950
And notice too that the movie and people
table has a many-to-many relationship.

00:02:02.950 --> 00:02:06.790
That is, a person can star
in more than one movie

00:02:06.790 --> 00:02:11.450
in the same way a movie
can have multiple stars.

00:02:11.450 --> 00:02:14.560
So if we want to implement
this many-to-many relationship,

00:02:14.560 --> 00:02:17.230
we've seen one way to do that.

00:02:17.230 --> 00:02:19.720
Let me actually ask
the audience here, how

00:02:19.720 --> 00:02:23.470
have we implemented a many-to-many
relationship historically?

00:02:23.470 --> 00:02:27.170
What kind of tables do we
need for this, do you think?

00:02:27.170 --> 00:02:27.670
All right.

00:02:27.670 --> 00:02:30.320
So not to worry if you don't have an
idea of what we're going to use here.

00:02:30.320 --> 00:02:32.528
Let's take a peek at what
tables we can actually use.

00:02:32.528 --> 00:02:37.030
So here as we've seen historically we
might have two tables, one for people

00:02:37.030 --> 00:02:38.770
and one for movies.

00:02:38.770 --> 00:02:41.800
And notice here in the
people table we have names

00:02:41.800 --> 00:02:45.130
along with a primary key called ID.

00:02:45.130 --> 00:02:49.300
And similarly in our movies
table, we also have a title

00:02:49.300 --> 00:02:55.180
column along with a movies own
primary key, an ID column over there.

00:02:55.180 --> 00:02:59.380
Now to implement the relationship
between people and movies,

00:02:59.380 --> 00:03:02.810
we have to have some table
in the middle of these two.

00:03:02.810 --> 00:03:08.050
One that relates people IDs
or person IDs with movie IDs.

00:03:08.050 --> 00:03:09.940
And if we look at this
table, I could see

00:03:09.940 --> 00:03:14.020
that if I see person ID next
to some given movie ID, well

00:03:14.020 --> 00:03:17.260
it means that the person
with this ID has starred in,

00:03:17.260 --> 00:03:19.660
played a role in, this movie here.

00:03:19.660 --> 00:03:21.610
We see person 158--

00:03:21.610 --> 00:03:27.160
let's see, Tom Hanks-- was in the
movie ID with 114709, or Toy Story

00:03:27.160 --> 00:03:29.500
as we see in the movies table.

00:03:29.500 --> 00:03:33.160
And now knowing this, how
can we figure out which

00:03:33.160 --> 00:03:36.160
movies Kristen Bell has starred in?

00:03:36.160 --> 00:03:38.770
If we know of Kristen
Bell's ID, what should we

00:03:38.770 --> 00:03:42.841
then do to find the movies
Kristen Bell has starred in?

00:03:42.841 --> 00:03:48.490
AUDIENCE: You could select the ID of
the people then look into the stars

00:03:48.490 --> 00:03:51.462
table, then look at
[INAUDIBLE] the movies table.

00:03:51.462 --> 00:03:53.170
CARTER ZENKE: Yeah,
I like your thinking.

00:03:53.170 --> 00:03:54.610
So there's multiple steps here.

00:03:54.610 --> 00:03:58.780
We have to go through each table to
find ultimately the movie that Kristen

00:03:58.780 --> 00:03:59.950
Bell has starred in.

00:03:59.950 --> 00:04:03.610
The first step might be to
ask, what is Kristen Bell's ID?

00:04:03.610 --> 00:04:06.310
Well it seems to be 68338.

00:04:06.310 --> 00:04:09.040
Then, to your point, we
could look at the stars table

00:04:09.040 --> 00:04:12.710
and find out what movie IDs
correspond to that ID here.

00:04:12.710 --> 00:04:17.779
So it seems like 2294629 corresponds
with Kristen Bell's ID here.

00:04:17.779 --> 00:04:20.200
Then finally we look up
in the movies table, well,

00:04:20.200 --> 00:04:23.770
what title corresponds with 2294629?

00:04:23.770 --> 00:04:25.660
It seems to be the movie Frozen.

00:04:25.660 --> 00:04:30.350
So visually, we have this
relationship among these tables here.

00:04:30.350 --> 00:04:33.610
So let's now dive into
our movies database

00:04:33.610 --> 00:04:36.100
to actually get a sense of
what data is inside of this.

00:04:36.100 --> 00:04:39.100
And for that I'll go back to my
computer so we can open this database up

00:04:39.100 --> 00:04:40.300
in SQLite.

00:04:40.300 --> 00:04:44.410
I'll come back over here and
I'll open up my own environment

00:04:44.410 --> 00:04:48.250
where I can type sqlite3 movies.db.

00:04:48.250 --> 00:04:49.780
movies.db.

00:04:49.780 --> 00:04:56.980
And now if I zoom in just a bit, I
can then type .schema to show you all

00:04:56.980 --> 00:04:59.530
the tables that exist in this database.

00:04:59.530 --> 00:05:06.070
I see up top I do have a movies table
that has a primary key called ID,

00:05:06.070 --> 00:05:08.980
a column called title
and a column called year.

00:05:08.980 --> 00:05:12.040
I have that same people table
with names and birthdays,

00:05:12.040 --> 00:05:17.530
and I also have this stars table that
corresponds with person IDs and movie

00:05:17.530 --> 00:05:21.780
IDs, relating in this
case people and movies.

00:05:21.780 --> 00:05:25.600
OK, so let's try selecting
then from our movies table

00:05:25.600 --> 00:05:28.930
to see what movies we have,
take a peek at our data.

00:05:28.930 --> 00:05:30.880
Well, we saw a little
bit ago, we could try

00:05:30.880 --> 00:05:33.640
to take a peek at our data
using a certain SQL keyword.

00:05:33.640 --> 00:05:38.800
I could SELECT * from "movies"
to see all columns from movies,

00:05:38.800 --> 00:05:42.790
but if I have let's say
400,000 rows I don't

00:05:42.790 --> 00:05:44.540
want to print all those to the screen.

00:05:44.540 --> 00:05:47.480
So what could I use instead?

00:05:47.480 --> 00:05:49.598
A certain SQL keyword.

00:05:49.598 --> 00:05:51.140
I'm hearing LIMIT, so let's try that.

00:05:51.140 --> 00:05:53.000
I'll say LIMIT 5.

00:05:53.000 --> 00:05:55.100
I only want the first five results here.

00:05:55.100 --> 00:06:00.820
I'll hit Enter, and here I'll see those
first five movies in my movies table.

00:06:00.820 --> 00:06:03.850
They each have their own
unique ID, a primary key.

00:06:03.850 --> 00:06:07.360
They have a title and the
year that they were published.

00:06:07.360 --> 00:06:12.070
But if you've been to
imdb.com, you may have wanted

00:06:12.070 --> 00:06:13.600
to search for a particular movie.

00:06:13.600 --> 00:06:17.590
Maybe you wanted to find your favorite
movie among all the possible movies

00:06:17.590 --> 00:06:18.520
in their database.

00:06:18.520 --> 00:06:22.480
And we can simulate that kind of
query by querying the database here.

00:06:22.480 --> 00:06:25.930
Whenever you type in IMDd's browser--
that's like the search for a movie

00:06:25.930 --> 00:06:28.930
called Frozen, for example--

00:06:28.930 --> 00:06:31.180
odds are that behind the
scenes a database is actually

00:06:31.180 --> 00:06:33.490
running that same query
to return to you what

00:06:33.490 --> 00:06:36.220
movies have that title called Frozen.

00:06:36.220 --> 00:06:38.470
Now for me my favorite
movie since I was a kid

00:06:38.470 --> 00:06:41.650
is one called Cars, so let me
try to search for Cars here.

00:06:41.650 --> 00:06:48.700
I'll try to find SELECT * from
"movies" where the "title" = 'Cars'.

00:06:48.700 --> 00:06:52.840
And you could imagine me going to
IMDb, typing in Cars in the search bar,

00:06:52.840 --> 00:06:55.720
and behind the scenes
this SQL query gets

00:06:55.720 --> 00:06:59.920
run to find what movies
have the title Cars.

00:06:59.920 --> 00:07:05.320
So let me hit Enter here, and now I'll
see that one movie has that title Cars.

00:07:05.320 --> 00:07:10.540
Came out in 2006 and the ID is 317219.

00:07:10.540 --> 00:07:13.720
But this is a lecture on
optimizing, so it's worth

00:07:13.720 --> 00:07:18.370
asking how much time did it
take for this query to run?

00:07:18.370 --> 00:07:19.600
Let's find out.

00:07:19.600 --> 00:07:23.380
Well I can use a SQLite
command here called timer.

00:07:23.380 --> 00:07:29.080
I'll say .timer on to make sure
that I time my future queries.

00:07:29.080 --> 00:07:30.670
I'll hit Enter now.

00:07:30.670 --> 00:07:33.820
And now if I try selecting again,
I'll hit the up arrow twice

00:07:33.820 --> 00:07:35.260
to get back my old query.

00:07:35.260 --> 00:07:40.030
I can type Enter and I'll
see not just the results

00:07:40.030 --> 00:07:44.380
but the time it took to run that query.

00:07:44.380 --> 00:07:47.250
Now notice, we have several
different measures of time here.

00:07:47.250 --> 00:07:50.130
I have the quote, unquote "real
time" which is essentially

00:07:50.130 --> 00:07:54.060
the stopwatch time, meaning from the
time I press Enter to the time I saw

00:07:54.060 --> 00:07:56.835
these results, how long did that take?

00:07:56.835 --> 00:07:58.710
But of course you're
familiar with computers.

00:07:58.710 --> 00:08:02.037
You know there's more involved
than just this query running.

00:08:02.037 --> 00:08:03.870
I also have to wait for
the operating system

00:08:03.870 --> 00:08:05.340
to finish other processes as well.

00:08:05.340 --> 00:08:07.740
That's where user and
system time come in.

00:08:07.740 --> 00:08:12.070
User time is how much time this
query particularly took on the CPU.

00:08:12.070 --> 00:08:14.820
System time is how long I spent
waiting around for other processes

00:08:14.820 --> 00:08:17.130
to finish before I could run this query.

00:08:17.130 --> 00:08:19.200
For us, we'll focus on real time.

00:08:19.200 --> 00:08:22.120
What was the wall clock time,
the stopwatch time, so to speak,

00:08:22.120 --> 00:08:26.480
between the time I pressed Enter
and the time I saw these results.

00:08:26.480 --> 00:08:27.380
OK.

00:08:27.380 --> 00:08:33.350
So here we have a query
taking 0.084 seconds.

00:08:33.350 --> 00:08:35.270
I mean, that is pretty good.

00:08:35.270 --> 00:08:41.030
If I only wait around for less than a
tenth of a second, I'm not complaining.

00:08:41.030 --> 00:08:45.260
But if I'm an engineer
at IMDb, why might I

00:08:45.260 --> 00:08:50.450
be concerned about a query that
takes 0.084 seconds, particularly

00:08:50.450 --> 00:08:53.000
if I have lots and lots of users?

00:08:53.000 --> 00:08:55.120
What's the problem there?

00:08:55.120 --> 00:09:00.320
AUDIENCE: For users, the database
might be overwhelmed by the queries.

00:09:00.320 --> 00:09:05.840
And you should also consider how
much time the application takes

00:09:05.840 --> 00:09:08.738
to form the query to the database too.

00:09:08.738 --> 00:09:09.530
CARTER ZENKE: Yeah.

00:09:09.530 --> 00:09:11.720
So the basic idea here
is that when I have

00:09:11.720 --> 00:09:15.090
a single query that takes maybe less
than tenth of a second, that's fine.

00:09:15.090 --> 00:09:19.010
But if I have many, many users
running those queries all at once,

00:09:19.010 --> 00:09:21.390
that time is going to add up.

00:09:21.390 --> 00:09:24.500
And so if I'm an engineer
or even somebody in finance,

00:09:24.500 --> 00:09:28.160
I want to reduce that time because
I'm paying per second that I

00:09:28.160 --> 00:09:30.830
run this database on some other server.

00:09:30.830 --> 00:09:33.980
So we can optimize these
queries, but let's first

00:09:33.980 --> 00:09:36.470
get a sense of how
they're presently running.

00:09:36.470 --> 00:09:41.360
What is SQLite doing underneath the
hood to find this movie called Cars?

00:09:41.360 --> 00:09:45.920
Well for that, I'll bring up a certain
visual here of our titles column.

00:09:45.920 --> 00:09:50.840
So imagine here that we have a
smaller database called Movies

00:09:50.840 --> 00:09:52.890
that just has titles in this case.

00:09:52.890 --> 00:09:57.350
And now I want to search these
titles, but I'm a computer.

00:09:57.350 --> 00:09:58.190
I'm not a human.

00:09:58.190 --> 00:10:01.550
Meaning I can only look
at one item at a time.

00:10:01.550 --> 00:10:05.720
I can only look at one
row and check its title.

00:10:05.720 --> 00:10:11.570
Now if I want to find the row or
the rows that have Cars in them,

00:10:11.570 --> 00:10:16.610
what's the best I could do
if my titles are not sorted?

00:10:16.610 --> 00:10:18.890
What's the best approach
I could take here?

00:10:18.890 --> 00:10:24.922
If I can only look at one row at a
time, how might I search this column?

00:10:24.922 --> 00:10:27.990
AUDIENCE: You should use a
linear search then, right?

00:10:27.990 --> 00:10:29.680
CARTER ZENKE: So I like your intuition.

00:10:29.680 --> 00:10:31.900
There is an algorithm,
a step-by-step process

00:10:31.900 --> 00:10:34.990
called linear search where
we look at one element

00:10:34.990 --> 00:10:38.170
at a time or, in this
case one row at a time.

00:10:38.170 --> 00:10:40.750
That's exactly what we
can do for this table.

00:10:40.750 --> 00:10:44.980
I can look at one row at a time to
find is this title in this column

00:10:44.980 --> 00:10:46.360
or is it not?

00:10:46.360 --> 00:10:50.120
So here to visualize, I
could take this title column.

00:10:50.120 --> 00:10:52.970
I can only look at one at a time.

00:10:52.970 --> 00:10:58.630
But if this data is not sorted, I can
do no better than looking at every row

00:10:58.630 --> 00:10:59.920
exactly once.

00:10:59.920 --> 00:11:01.780
So here we start with Toy Story.

00:11:01.780 --> 00:11:05.110
We ask, is Toy Story equal to Cars?

00:11:05.110 --> 00:11:07.000
It's not, so we'll keep going.

00:11:07.000 --> 00:11:09.100
Here I'll check Cars 3.

00:11:09.100 --> 00:11:12.070
Is Cars 3 strictly equal to Cars?

00:11:12.070 --> 00:11:15.040
It has Cars in it, but
it's not Cars exactly.

00:11:15.040 --> 00:11:16.180
So I'll keep going.

00:11:16.180 --> 00:11:17.380
I'll look at Frozen.

00:11:17.380 --> 00:11:23.170
And Frozen is not Cars either, so I'll
keep going again and here I see Cars.

00:11:23.170 --> 00:11:26.540
So maybe I could stop here.

00:11:26.540 --> 00:11:28.360
But why couldn't I?

00:11:28.360 --> 00:11:34.300
Well if I need to find not just the
one row that has Cars but all the rows

00:11:34.300 --> 00:11:39.790
that have Cars, I have to keep going to
see is Cars elsewhere in this column?

00:11:39.790 --> 00:11:42.910
Is it not just here, but
also in other rows too?

00:11:42.910 --> 00:11:44.050
So I'll keep going.

00:11:44.050 --> 00:11:47.020
I'll look at WALL-E,
Ratatouille, Soul, Turning Red

00:11:47.020 --> 00:11:50.290
until I get to the end of this column.

00:11:50.290 --> 00:11:53.530
And this process is called
a scan of the table.

00:11:53.530 --> 00:11:58.210
We're scanning our column top
to bottom looking for the result

00:11:58.210 --> 00:12:00.160
that we need to find.

00:12:00.160 --> 00:12:03.580
But it turns out we can
actually optimize this search.

00:12:03.580 --> 00:12:07.898
We can do a bit better than
going one by one by one by one,

00:12:07.898 --> 00:12:09.940
and for this we need a
bit of a metaphor to think

00:12:09.940 --> 00:12:12.170
about how we can search our table.

00:12:12.170 --> 00:12:15.080
So if you have ever
seen a kind of textbook,

00:12:15.080 --> 00:12:17.620
you might have seen something
a bit like this here.

00:12:17.620 --> 00:12:19.780
This is a Handbook of
Education Research.

00:12:19.780 --> 00:12:23.470
And maybe I want to find
which pages correspond

00:12:23.470 --> 00:12:26.060
to how students learn computer science.

00:12:26.060 --> 00:12:29.350
So one thing I could do is
open to the start of the book.

00:12:29.350 --> 00:12:34.630
I could go page by page looking
for which pages have "how students

00:12:34.630 --> 00:12:36.700
learn computer science."

00:12:36.700 --> 00:12:40.690
I keep going here, and if I want
to find all the pages that have

00:12:40.690 --> 00:12:42.610
"how students learn
computer science" on them,

00:12:42.610 --> 00:12:45.610
I can do no better than
going through every one

00:12:45.610 --> 00:12:49.600
by one by one until I'm at
the end of this textbook.

00:12:49.600 --> 00:12:53.420
But chances are you
yourself don't do this.

00:12:53.420 --> 00:12:56.270
You do something better.

00:12:56.270 --> 00:13:00.010
And there's some built-in
structure in this book that could

00:13:00.010 --> 00:13:01.930
help you find what you're looking for.

00:13:01.930 --> 00:13:07.170
What might you use to find something
in a textbook or some other book too?

00:13:07.170 --> 00:13:10.300
AUDIENCE: We can use
the index of the book.

00:13:10.300 --> 00:13:12.100
CARTER ZENKE: You can use the index.

00:13:12.100 --> 00:13:14.470
So often a book like a
textbook has something

00:13:14.470 --> 00:13:17.440
at the very back of
the book that tells me

00:13:17.440 --> 00:13:21.370
all the subjects that are inside
this book, but in alphabetical order.

00:13:21.370 --> 00:13:24.550
So I can see here there is a subject
index in the back of the book

00:13:24.550 --> 00:13:28.330
here that lists the subjects
in alphabetical order, A to Z.

00:13:28.330 --> 00:13:31.210
And if I want to find "how
students learn computer science,"

00:13:31.210 --> 00:13:34.300
I could perhaps search for
computer science in this index.

00:13:34.300 --> 00:13:37.990
I could find A, B, C. There's C,
and here I see computer science.

00:13:37.990 --> 00:13:42.310
Now I know all the pages I need
to look at inside this book

00:13:42.310 --> 00:13:46.490
without going through each
of them one by one by one.

00:13:46.490 --> 00:13:50.080
So in the same way that
books have an index, so too

00:13:50.080 --> 00:13:53.810
can databases and tables
have an index as well.

00:13:53.810 --> 00:13:57.400
So let's see a definition
here for what an index is.

00:13:57.400 --> 00:14:00.340
Well an index is a
structure that we can use.

00:14:00.340 --> 00:14:03.010
It's built into our
database to help us speed up

00:14:03.010 --> 00:14:06.550
the retrieval of rows from that table.

00:14:06.550 --> 00:14:12.670
I first search my index, find out
what rows my given search term is in,

00:14:12.670 --> 00:14:16.730
and then look those rows up in my table.

00:14:16.730 --> 00:14:22.540
Now if I want to create an index on
a table, I can use some SQL syntax.

00:14:22.540 --> 00:14:27.760
I can say "create index"
and give it some name.

00:14:27.760 --> 00:14:31.270
But I need to also specify
a few other things.

00:14:31.270 --> 00:14:35.500
Similar to my books index,
that index has some content.

00:14:35.500 --> 00:14:38.770
It's an index by subject
or an index by author.

00:14:38.770 --> 00:14:42.670
So I need to tell this
index what to include.

00:14:42.670 --> 00:14:47.440
So I should also say create index,
give it a name on some table

00:14:47.440 --> 00:14:52.720
and some columns in that table,
telling my index what data for my table

00:14:52.720 --> 00:14:55.090
it should include.

00:14:55.090 --> 00:14:59.230
So let's create our very own
index in the movies table

00:14:59.230 --> 00:15:02.690
to identify how we can make this
query just a bit faster overall.

00:15:02.690 --> 00:15:08.440
So I'll come back to my computer here,
and let's now try to create this index.

00:15:08.440 --> 00:15:11.290
I will open up a new tab here.

00:15:11.290 --> 00:15:12.070
A new tab.

00:15:12.070 --> 00:15:14.500
And then I'll log back into my database.

00:15:14.500 --> 00:15:17.950
I'll say sqlite3 movies.db.

00:15:17.950 --> 00:15:22.550
And now I want to first
create some index here.

00:15:22.550 --> 00:15:27.280
So I'll turn my timer back on and now
I'll type the syntax we just learned.

00:15:27.280 --> 00:15:29.810
I want to create an index.

00:15:29.810 --> 00:15:33.760
And if I'm searching my
title column in Movies,

00:15:33.760 --> 00:15:38.110
I should probably include the
Title column in this index.

00:15:38.110 --> 00:15:41.470
So I'll say call this
the "title_index" here.

00:15:41.470 --> 00:15:46.450
It'll be on the Movies table and
it'll contain the data in the Title

00:15:46.450 --> 00:15:49.850
column of Movies a bit like this.

00:15:49.850 --> 00:15:55.052
Now I'll hit Enter and I'll see this
index took some time to be created.

00:15:55.052 --> 00:15:57.010
There's some stuff going
on underneath the hood

00:15:57.010 --> 00:16:00.940
where I'm putting this data in a new
structure we're calling the index here.

00:16:00.940 --> 00:16:07.930
But now if I rerun this query,
if I say SELECT * FROM "movies,"

00:16:07.930 --> 00:16:15.450
SELECT * from "movies" where
the "title" = 'Cars' semicolon,

00:16:15.450 --> 00:16:16.620
it was a lot faster.

00:16:16.620 --> 00:16:25.590
Here I see 0.01 seconds real
time compared to 0.084 seconds.

00:16:25.590 --> 00:16:28.380
That's a savings of almost eight times.

00:16:28.380 --> 00:16:31.740
Almost eight times faster
now that we have this index.

00:16:31.740 --> 00:16:34.260
Now it took some time
to create the index.

00:16:34.260 --> 00:16:38.070
If I run just a few
queries, I much easily

00:16:38.070 --> 00:16:40.770
make that time up in
terms of the time I'm

00:16:40.770 --> 00:16:46.560
using on my server to look up these
kinds of pieces of information.

00:16:46.560 --> 00:16:50.760
OK, so let's actually take a
look at how we can ensure a query

00:16:50.760 --> 00:16:53.460
is using some index.

00:16:53.460 --> 00:16:55.980
Here I'm just kind of
taking it on my word.

00:16:55.980 --> 00:16:57.780
Here it was faster,
so it used the index.

00:16:57.780 --> 00:17:01.710
But there is a way for you to
know whether an index was used.

00:17:01.710 --> 00:17:06.839
I could use a SQL statement
called EXPLAIN QUERY PLAN.

00:17:06.839 --> 00:17:10.329
Other DBMSs might call
this just simply EXPLAIN,

00:17:10.329 --> 00:17:14.190
but in SQLite it's EXPLAIN QUERY PLAN.

00:17:14.190 --> 00:17:17.819
And then after EXPLAIN QUERY
PLAN, I can type the query

00:17:17.819 --> 00:17:20.760
that I want to learn the plan for.

00:17:20.760 --> 00:17:24.880
That is, how does SQLite plan
to find me this information?

00:17:24.880 --> 00:17:27.579
How does it plan to execute this query?

00:17:27.579 --> 00:17:33.540
So now I'll type SELECT * FROM
"movies" WHERE "title" = 'Cars'

00:17:33.540 --> 00:17:35.460
like this, semicolon.

00:17:35.460 --> 00:17:36.600
I'll hit Enter.

00:17:36.600 --> 00:17:39.420
And now I don't get
the results, but I do

00:17:39.420 --> 00:17:45.060
get the plan SQLite intends
to use to find me this result.

00:17:45.060 --> 00:17:46.350
And take a look here.

00:17:46.350 --> 00:17:51.930
I see QUERY PLAN, and underneath
I see what SQLite intends to do.

00:17:51.930 --> 00:17:55.260
It will search the movies
table, but it will do so

00:17:55.260 --> 00:17:58.410
using the index that
we called title_index.

00:17:58.410 --> 00:18:02.560
And it will do so given
some value for title here.

00:18:02.560 --> 00:18:06.210
So we can actually see SQLite
does plan to use the index

00:18:06.210 --> 00:18:08.050
to speed up this query.

00:18:08.050 --> 00:18:11.770
We can confirm that this is
exactly what's happening.

00:18:11.770 --> 00:18:15.450
But let me try now to
remove this index and see

00:18:15.450 --> 00:18:19.380
what SQLite would have done
when we didn't have this index.

00:18:19.380 --> 00:18:23.070
I'll come back over here and I'll
go back to my other connection

00:18:23.070 --> 00:18:24.030
to my database.

00:18:24.030 --> 00:18:25.470
This one over here.

00:18:25.470 --> 00:18:28.290
And now I'll remove the
index we just created.

00:18:28.290 --> 00:18:30.510
To remove an index I can drop it.

00:18:30.510 --> 00:18:35.948
I can say DROP INDEX
"title_index" semicolon.

00:18:35.948 --> 00:18:37.740
And now that took just
some amount of time,

00:18:37.740 --> 00:18:39.365
but now I don't have the index anymore.

00:18:39.365 --> 00:18:45.180
I can type .schema, and I'll see
that index is not part of my schema.

00:18:45.180 --> 00:18:54.330
But now if I try EXPLAIN QUERY PLAN
SELECT * from "movies" WHERE "title"

00:18:54.330 --> 00:19:00.780
= 'Cars' semicolon, I'll see that
the plan is not to use the index.

00:19:00.780 --> 00:19:04.860
The plan is to scan movies,
where scan, we saw before,

00:19:04.860 --> 00:19:08.130
means to go top to bottom
through that title column

00:19:08.130 --> 00:19:12.070
and find me all the rows
that have Cars in them.

00:19:12.070 --> 00:19:14.460
So you can see here the
optimization we're already

00:19:14.460 --> 00:19:18.890
making by using this index here.

00:19:18.890 --> 00:19:21.740
OK, so let's take a
moment here and pause.

00:19:21.740 --> 00:19:26.780
What questions do we have about indexes
and how they're used in this case?

00:19:26.780 --> 00:19:33.950
AUDIENCE: Doesn't databases have
implicit algorithms to optimize search?

00:19:33.950 --> 00:19:35.420
CARTER ZENKE: Yeah, good question.

00:19:35.420 --> 00:19:39.050
Do databases have implicit
algorithms to optimize search?

00:19:39.050 --> 00:19:41.190
They do for some columns.

00:19:41.190 --> 00:19:46.520
So if we saw in our table earlier,
we had a primary key column like ID.

00:19:46.520 --> 00:19:52.310
In SQLite and most other DBMSs, if I
specify that a column is a primary key

00:19:52.310 --> 00:19:55.430
then that database management
system will automatically

00:19:55.430 --> 00:19:59.280
create an index via which I
can search for a primary key.

00:19:59.280 --> 00:20:03.920
If though I have a regular old column
like title or like year or so on,

00:20:03.920 --> 00:20:07.130
it doesn't go through the process
of automatically optimizing

00:20:07.130 --> 00:20:08.030
that search for me.

00:20:08.030 --> 00:20:12.457
I have to try to do that
myself as we're doing here.

00:20:12.457 --> 00:20:13.290
Let's take one more.

00:20:13.290 --> 00:20:16.000
AUDIENCE: Would it be advisable
to create a different index

00:20:16.000 --> 00:20:17.800
for every column in case we need it?

00:20:17.800 --> 00:20:20.380
CARTER ZENKE: Yeah, I mean it
seems like indexes are so good.

00:20:20.380 --> 00:20:21.790
They speed up queries for us.

00:20:21.790 --> 00:20:25.030
Why not create an index
on every column we have?

00:20:25.030 --> 00:20:26.800
Well as we'll see in
just a moment, there

00:20:26.800 --> 00:20:30.730
are some trade-offs involving space
and also the time it later takes

00:20:30.730 --> 00:20:33.020
us to insert or add new data.

00:20:33.020 --> 00:20:33.910
So good questions.

00:20:33.910 --> 00:20:36.620
I'll come back to my laptop
where we can continue on.

00:20:36.620 --> 00:20:40.870
So we've seen how indexes
can speed up queries.

00:20:40.870 --> 00:20:43.900
We're searching for one column
here, but they could also speed up

00:20:43.900 --> 00:20:46.310
queries across multiple tables.

00:20:46.310 --> 00:20:48.910
So let's try a different
kind of query here.

00:20:48.910 --> 00:20:52.810
What I might do now is try
to find all of those movies

00:20:52.810 --> 00:20:55.480
that Tom Hanks starred in.

00:20:55.480 --> 00:20:59.230
And we did a bit of a example of
this a little earlier visually,

00:20:59.230 --> 00:21:04.450
but my goal is to find all of those
titles that Tom Hanks played a role in.

00:21:04.450 --> 00:21:07.690
Well if my goal is to end
up with titles of movies,

00:21:07.690 --> 00:21:13.520
I could try selecting "title" from
the "movies" column like this.

00:21:13.520 --> 00:21:16.720
But if I look at that
movie's table, oh no, I only

00:21:16.720 --> 00:21:20.740
have titles, year, and
the ID of those movies.

00:21:20.740 --> 00:21:23.230
I don't have anything
related to Tom Hanks.

00:21:23.230 --> 00:21:27.250
Well Tom Hanks though is probably
showing up in the stars table

00:21:27.250 --> 00:21:30.940
where I could correspond Tom
Hanks's ID with the movie IDs

00:21:30.940 --> 00:21:32.950
that Tom Hanks played a role in.

00:21:32.950 --> 00:21:35.860
So let me try filtering
these movies by an ID.

00:21:35.860 --> 00:21:40.450
I'll say WHERE the "id" of that
movie is IN some list of movie IDs

00:21:40.450 --> 00:21:43.960
that also correspond with
Tom Hanks's person ID.

00:21:43.960 --> 00:21:47.650
So I'll make a subquery here
to find exactly those films.

00:21:47.650 --> 00:21:54.340
Now I'll say SELECT "movie_id"
FROM "stars" where the "person _id"

00:21:54.340 --> 00:21:56.140
is equal to--

00:21:56.140 --> 00:21:58.180
well I don't know Tom Hanks's person ID.

00:21:58.180 --> 00:22:00.880
I could try to hard-code it
here, but I could do better

00:22:00.880 --> 00:22:05.710
by having another subquery to
find what is Tom Hanks's ID?

00:22:05.710 --> 00:22:08.950
Now I'll indent not just
four times but eight

00:22:08.950 --> 00:22:12.160
to show this is my next
level of a subquery here.

00:22:12.160 --> 00:22:19.270
I'll now SELECT "id" FROM "people"
WHERE the "name" = 'Tom Hanks'.

00:22:19.270 --> 00:22:21.020
'Tom Hanks' like this.

00:22:21.020 --> 00:22:23.950
And now I can close out my subqueries.

00:22:23.950 --> 00:22:28.300
I've found the ID of Tom Hanks,
I've found those movie IDs that

00:22:28.300 --> 00:22:30.970
correspond with the
person ID of Tom Hanks,

00:22:30.970 --> 00:22:34.900
and then I've found the titles
that correspond to those movie IDs.

00:22:34.900 --> 00:22:39.100
Now I'll hit a semicolon here,
press Enter, and I should see--

00:22:39.100 --> 00:22:43.300
I get back all of Tom Hanks's movies,
all of those that he played a role in.

00:22:43.300 --> 00:22:45.910
And I'll also see the time that it took.

00:22:45.910 --> 00:22:50.840
The real time for this query
was about 0.197 seconds.

00:22:50.840 --> 00:22:54.880
So getting a little bit slower now that
we're looking across multiple tables

00:22:54.880 --> 00:22:57.580
and having multiple subqueries.

00:22:57.580 --> 00:23:00.890
One index could speed this up as well.

00:23:00.890 --> 00:23:03.555
I can make a new terminal here.

00:23:03.555 --> 00:23:05.680
Let me open up the connection
to my database again.

00:23:05.680 --> 00:23:08.470
I'll say sqlite3 movies.db.

00:23:08.470 --> 00:23:12.490
And now I should figure
out what columns do

00:23:12.490 --> 00:23:15.670
I need to index to speed up this query?

00:23:15.670 --> 00:23:19.940
Well one thing I could do is
try EXPLAIN QUERY PLAN again.

00:23:19.940 --> 00:23:25.310
I want to understand what SQLite
plans to do to execute this query.

00:23:25.310 --> 00:23:29.020
So I'll EXPLAIN QUERY PLAN and I'll
type in the same query from before.

00:23:29.020 --> 00:23:33.430
I'll say SELECT the
"title" FROM "movies" WHERE

00:23:33.430 --> 00:23:37.720
the "id" is IN those movies
that Tom Hanks played a role in.

00:23:37.720 --> 00:23:45.400
So I'll SELECT in this case movie_id
from the "stars" table where

00:23:45.400 --> 00:23:49.180
the person_id equals Tom Hanks's ID.

00:23:49.180 --> 00:23:50.450
Now I'll find that.

00:23:50.450 --> 00:23:53.650
I'll indent eight times to show
this is my next level subquery.

00:23:53.650 --> 00:23:59.050
I'll SELECT "id" from
"people" WHERE "name"

00:23:59.050 --> 00:24:02.510
= 'Tom Hanks' and I'll wrap
the line just a little bit.

00:24:02.510 --> 00:24:04.510
Let me zoom out so you
can see it all in one go.

00:24:04.510 --> 00:24:10.570
And now I'll close out these
subqueries just like this, hit Enter,

00:24:10.570 --> 00:24:16.090
and I should see if I zoom back in the
plan SQLite intends to use to find me

00:24:16.090 --> 00:24:17.590
the results of this query.

00:24:17.590 --> 00:24:19.490
Now let's take a look here.

00:24:19.490 --> 00:24:21.900
I'm planning to first scan people.

00:24:21.900 --> 00:24:23.650
To understand the
structure of this I need

00:24:23.650 --> 00:24:26.500
to first go all the way in
to this bottom part here.

00:24:26.500 --> 00:24:32.320
I'm going to scan people to
find presumably Tom Hanks's ID.

00:24:32.320 --> 00:24:36.610
Then after I do that, I'm
going to scan stars presumably

00:24:36.610 --> 00:24:41.080
to find all the movie IDs
that have Tom Hanks's ID.

00:24:41.080 --> 00:24:45.130
But then up here I'm finally
going to search movies

00:24:45.130 --> 00:24:47.320
using integer primary key.

00:24:47.320 --> 00:24:49.750
Some index is automatically
created for me

00:24:49.750 --> 00:24:52.850
as we saw before for my primary keys.

00:24:52.850 --> 00:24:56.620
So it seems like the optimization
here could be in the people

00:24:56.620 --> 00:25:00.040
table and the stars table.

00:25:00.040 --> 00:25:05.620
Now let me ask, if I wanted to
create an index on either the people

00:25:05.620 --> 00:25:09.970
table or the stars table, what
do you think I should include?

00:25:09.970 --> 00:25:15.650
Which column should I include in
this index given my query here?

00:25:15.650 --> 00:25:21.130
What column should I include in this
index given this previous query?

00:25:21.130 --> 00:25:26.180
AUDIENCE: We could index the
ID and the name of the person?

00:25:26.180 --> 00:25:27.530
CARTER ZENKE: Yeah, good idea.

00:25:27.530 --> 00:25:29.930
Index the ID and the name of the person.

00:25:29.930 --> 00:25:34.670
So if we look at this query again
we'll see that our WHERE clause here

00:25:34.670 --> 00:25:37.770
is searching by the name
in the people table,

00:25:37.770 --> 00:25:41.240
and up here this WHERE clause
is searching by the person ID

00:25:41.240 --> 00:25:43.700
column in our stars table.

00:25:43.700 --> 00:25:46.280
So it would stand to reason
that I should actually

00:25:46.280 --> 00:25:49.850
create an index that includes
the name column in people

00:25:49.850 --> 00:25:52.250
and the person ID column in stars.

00:25:52.250 --> 00:25:54.200
Perhaps two separate indexes.

00:25:54.200 --> 00:25:55.710
One for each year.

00:25:55.710 --> 00:25:57.470
So let's go ahead and create those now.

00:25:57.470 --> 00:26:03.350
I'll come back to my computer and
here I will try to create this index.

00:26:03.350 --> 00:26:06.590
I'll say first create an index.

00:26:06.590 --> 00:26:10.040
CREATE INDEX called "person_index."

00:26:10.040 --> 00:26:16.070
And I'll create it on stars and
the person ID column in stars

00:26:16.070 --> 00:26:17.960
as we just discussed.

00:26:17.960 --> 00:26:23.600
Now if I hit Enter I should let it
take some time to build up this index,

00:26:23.600 --> 00:26:26.270
but now I can also create my next one.

00:26:26.270 --> 00:26:35.300
I'll say CREATE INDEX, and then I'll
choose "name_index" on people table.

00:26:35.300 --> 00:26:37.890
And I'll include the name column here.

00:26:37.890 --> 00:26:39.710
So now I'll hit Enter.

00:26:39.710 --> 00:26:42.980
That too will take some
time to run, but now

00:26:42.980 --> 00:26:45.350
if I try to explain
the query plan again--

00:26:45.350 --> 00:26:50.630
I'll clear my screen, try EXPLAIN QUERY
PLAN, and then I'll type in this query

00:26:50.630 --> 00:26:51.170
again.

00:26:51.170 --> 00:26:56.060
I'll say SELECT "title" FROM movies
WHERE the "id" is IN those movies

00:26:56.060 --> 00:26:59.600
that Tom Hanks played a role in
that have the person ID next to it.

00:26:59.600 --> 00:27:03.590
Then I'll say which is Tom Hanks's ID?

00:27:03.590 --> 00:27:08.650
And I'll close out the
subquery here just like this.

00:27:08.650 --> 00:27:12.480
And now if I explain
the plan yet again, I

00:27:12.480 --> 00:27:16.570
should see I'm now optimizing
this query even further.

00:27:16.570 --> 00:27:19.950
I can see down below here, my plan
is to search the people table.

00:27:19.950 --> 00:27:23.160
Not to scan it, but to
search it using what we call

00:27:23.160 --> 00:27:25.710
a covering index called "name_index."

00:27:25.710 --> 00:27:27.010
The one we just created.

00:27:27.010 --> 00:27:29.610
We'll explain covering
index in just a little bit.

00:27:29.610 --> 00:27:32.760
Next though, I'm going
to search stars using

00:27:32.760 --> 00:27:37.060
the index we just created called
"name_index" given some person ID.

00:27:37.060 --> 00:27:40.290
And then finally, I'll
search movies using

00:27:40.290 --> 00:27:42.660
an integer primary key
index, the one that's

00:27:42.660 --> 00:27:45.850
automatically created for me here.

00:27:45.850 --> 00:27:48.570
So here we see some new vocabulary.

00:27:48.570 --> 00:27:53.060
Not just using index, but
using a covering index.

00:27:53.060 --> 00:27:55.470
Well, what is a covering index?

00:27:55.470 --> 00:27:58.650
A covering index, it turns
out, is still an index.

00:27:58.650 --> 00:28:01.050
It's the same kind of
idea, but it is one

00:28:01.050 --> 00:28:06.390
in which we have all the data
we need inside the index itself.

00:28:06.390 --> 00:28:11.250
It's one in which the query data can
be found only by searching the index.

00:28:11.250 --> 00:28:14.460
There's no need for me to go
from the index to the table.

00:28:14.460 --> 00:28:17.040
I can just look it up in the index.

00:28:17.040 --> 00:28:20.850
It's similar to me searching this
textbook and getting to the index

00:28:20.850 --> 00:28:24.450
and not then needing to actually
go back in the book's pages.

00:28:24.450 --> 00:28:29.290
I can find everything I'm
looking for in the index itself.

00:28:29.290 --> 00:28:34.260
And this is good because if I need to
not just find the items in my index

00:28:34.260 --> 00:28:39.130
but then look them up in my table, that
look up in the table takes some time.

00:28:39.130 --> 00:28:41.580
So if I instead have
a covering index, one

00:28:41.580 --> 00:28:45.310
that includes all data I want to
find exactly right there in it,

00:28:45.310 --> 00:28:49.120
that's going to be faster for
me than a regular old index.

00:28:49.120 --> 00:28:54.750
So let's try then turning one
more index into a covering index.

00:28:54.750 --> 00:28:55.890
I'll come back over here.

00:28:55.890 --> 00:29:00.780
And we saw before that we
had the index on the stars

00:29:00.780 --> 00:29:03.660
table using just a regular old index.

00:29:03.660 --> 00:29:06.420
Using index person index.

00:29:06.420 --> 00:29:10.050
But what are we trying
to find from this index?

00:29:10.050 --> 00:29:13.170
We're trying to find
the movie ID column.

00:29:13.170 --> 00:29:18.910
So I could create an index that
includes the movie ID column as well.

00:29:18.910 --> 00:29:22.300
I can have indexes that
span multiple columns.

00:29:22.300 --> 00:29:23.730
So let's try that here.

00:29:23.730 --> 00:29:27.240
I'll come back over and
I'll create this new index

00:29:27.240 --> 00:29:29.730
that intends to cover our search here.

00:29:29.730 --> 00:29:35.940
I want to include not just the person
ID column but also the movie ID

00:29:35.940 --> 00:29:40.320
column so I can quickly find that
data inside the index itself.

00:29:40.320 --> 00:29:45.620
Let me drop our current implementation
of the person index like this.

00:29:45.620 --> 00:29:49.090
And I'll hit semicolon here,
and now that index is gone.

00:29:49.090 --> 00:29:53.040
But if I type now
CREATE INDEX, I'll also

00:29:53.040 --> 00:29:56.700
call this "person_index" ON "stars."

00:29:56.700 --> 00:30:00.240
and I'll say let's include
not just the person ID column.

00:30:00.240 --> 00:30:05.940
Let's also include in this case,
the movie ID column like this.

00:30:05.940 --> 00:30:08.730
Now I'll hit semicolon, Enter.

00:30:08.730 --> 00:30:11.580
And it'll take a bit of
time to run, but once it's

00:30:11.580 --> 00:30:15.420
done I can probably try to explain
the query plan again and see

00:30:15.420 --> 00:30:17.160
if we have our covering index.

00:30:17.160 --> 00:30:20.910
I'll go back up top here,
I'll EXPLAIN QUERY PLAN,

00:30:20.910 --> 00:30:24.910
and I'll choose to recreate this
query from before just like this.

00:30:24.910 --> 00:30:30.660
I'm going to select the "movie_id" FROM
"stars" and then find Tom Hanks's ID

00:30:30.660 --> 00:30:36.580
and then close out these
subqueries here just like this.

00:30:36.580 --> 00:30:41.830
And now once I EXPLAIN QUERY PLAN,
we've optimized even further.

00:30:41.830 --> 00:30:44.160
Now I have two covering indexes.

00:30:44.160 --> 00:30:48.690
I can find all the information I need
just by looking in these indexes,

00:30:48.690 --> 00:30:52.100
not by doing a table lookup.

00:30:52.100 --> 00:30:56.375
So let's now run this query and see
just how much faster it might be.

00:30:56.375 --> 00:31:00.470
I'll come back to my laptop here, and
let's get rid of EXPLAIN QUERY PLAN.

00:31:00.470 --> 00:31:02.660
Let's now turn our timer on.

00:31:02.660 --> 00:31:06.380
I'll turn the timer on like
this and I'll rerun this query.

00:31:06.380 --> 00:31:10.520
I'll say we select title from movies
where the ID is in the stars table

00:31:10.520 --> 00:31:13.790
so long as the movie ID
aligns with the person ID.

00:31:13.790 --> 00:31:18.590
Then I'll find Tom Hanks's person ID
here, I'll close out my subqueries,

00:31:18.590 --> 00:31:24.078
and I should be able to find that
this query is just a little bit--

00:31:24.078 --> 00:31:25.370
this is an understatement here.

00:31:25.370 --> 00:31:26.400
A lot faster.

00:31:26.400 --> 00:31:32.120
So if I go back to my old query here,
I'll see this one took 0.197 seconds.

00:31:32.120 --> 00:31:37.320
This one with two covering
indexes takes 0.004 seconds.

00:31:37.320 --> 00:31:40.970
This is an order of magnitude
faster than my prior query

00:31:40.970 --> 00:31:44.540
because I've used indexes now.

00:31:44.540 --> 00:31:46.940
So, let's see.

00:31:46.940 --> 00:31:53.180
If we have all these benefits of
indexes, we must be paying something.

00:31:53.180 --> 00:31:59.310
If I have this book here and I'm
trying to find some data inside of it,

00:31:59.310 --> 00:32:03.740
what might the trade-offs be
if I want to create an index?

00:32:03.740 --> 00:32:07.460
Let me think first about
this database on my computer.

00:32:07.460 --> 00:32:08.870
What am I giving up?

00:32:08.870 --> 00:32:13.400
If I'm making sure my query is faster,
what am I losing, do you think?

00:32:13.400 --> 00:32:16.010
AUDIENCE: Space will
require for indexing also.

00:32:16.010 --> 00:32:19.370
The more index we would have,
the more space and more memory

00:32:19.370 --> 00:32:21.278
we would have to consume for that.

00:32:21.278 --> 00:32:22.070
CARTER ZENKE: Yeah.

00:32:22.070 --> 00:32:25.070
So it's a general theme
that when I try to speed up

00:32:25.070 --> 00:32:28.130
something in computer science,
I have to pay some cost which

00:32:28.130 --> 00:32:29.990
is often in terms of space.

00:32:29.990 --> 00:32:34.400
That when I create an index I'm
using more space in my database.

00:32:34.400 --> 00:32:37.010
And this similarly applies
to an old fashioned textbook.

00:32:37.010 --> 00:32:39.470
If I have an index in
here, I'm literally

00:32:39.470 --> 00:32:43.550
paying in pages to print so
people can find content better.

00:32:43.550 --> 00:32:46.130
This is some extra
money I have to spend,

00:32:46.130 --> 00:32:49.670
extra pages I have to print in
order to make my queries faster,

00:32:49.670 --> 00:32:53.220
either on this book or in my database.

00:32:53.220 --> 00:32:57.650
So let's understand why
indexes take up so much space

00:32:57.650 --> 00:33:00.440
by looking at the underlying
data structure they

00:33:00.440 --> 00:33:03.110
use to store information.

00:33:03.110 --> 00:33:05.900
Well it turns out that the
part of the reason indexes

00:33:05.900 --> 00:33:09.780
take up so much space is because
they use a certain data structure,

00:33:09.780 --> 00:33:16.110
a way of organizing data called a
B-Tree, or a balanced tree structure.

00:33:16.110 --> 00:33:18.920
So not to worry if you don't
know what a tree structure is.

00:33:18.920 --> 00:33:23.030
Here's a visual to keep in mind as you
think about using trees in computer

00:33:23.030 --> 00:33:25.520
science and in indexes in particular.

00:33:25.520 --> 00:33:27.780
A tree might look a bit like this.

00:33:27.780 --> 00:33:31.640
And notice how it's similar to
a family tree with grandparents

00:33:31.640 --> 00:33:34.430
and parents and children and so on.

00:33:34.430 --> 00:33:38.450
You could if you wanted to kind
of do a 180 on this diagram.

00:33:38.450 --> 00:33:41.030
Flip it right side up
and now you might see

00:33:41.030 --> 00:33:44.010
more of a tree structure
going from top to bottom.

00:33:44.010 --> 00:33:47.300
You can see down here what might
call the trunk of the tree followed

00:33:47.300 --> 00:33:49.670
by some branches-- these over here.

00:33:49.670 --> 00:33:50.930
And then the leaves.

00:33:50.930 --> 00:33:54.890
These up top kind of the
edge of this tree diagram.

00:33:54.890 --> 00:33:56.750
And there's a few
vocabulary words you might

00:33:56.750 --> 00:33:59.090
want to know when you talk about trees.

00:33:59.090 --> 00:34:02.190
One of them is this idea of a node.

00:34:02.190 --> 00:34:06.230
So this right here is a
single node in our tree.

00:34:06.230 --> 00:34:11.210
A tree is made up of a collection
of nodes, and each of these nodes

00:34:11.210 --> 00:34:13.590
stores some value.

00:34:13.590 --> 00:34:17.900
Maybe it's a value in the column
we're trying to index, for example.

00:34:17.900 --> 00:34:22.040
But in addition to that
value, this node also

00:34:22.040 --> 00:34:26.780
stores pointers, kind of arrows
metaphorically, to other nodes

00:34:26.780 --> 00:34:30.590
to tell us where in
memory these nodes are.

00:34:30.590 --> 00:34:32.719
And by stringing these
nodes together, we

00:34:32.719 --> 00:34:36.270
can build up this tree-like structure.

00:34:36.270 --> 00:34:40.280
Now this node here points
to three other nodes.

00:34:40.280 --> 00:34:45.530
And some new vocabulary here is
that this node has three children.

00:34:45.530 --> 00:34:47.929
Going back to that
family tree example, we

00:34:47.929 --> 00:34:52.469
see that this node here has three
children which are these right here.

00:34:52.469 --> 00:34:56.078
But these nodes themselves also
have three children of their own.

00:34:56.078 --> 00:34:58.370
So if I were to go down to
the next level of our tree--

00:34:58.370 --> 00:35:02.180
well if I'm referring to this
node here, these nodes down below

00:35:02.180 --> 00:35:04.190
are that node's grandchildren.

00:35:04.190 --> 00:35:07.080
The children of its children.

00:35:07.080 --> 00:35:10.490
Now these nodes here
also have another name.

00:35:10.490 --> 00:35:14.240
We saw before they're
often called leaf nodes.

00:35:14.240 --> 00:35:15.560
They're the edge of our tree.

00:35:15.560 --> 00:35:21.040
And we know a node is a leaf node when
there are no other nodes it points to.

00:35:21.040 --> 00:35:24.970
So this is the general idea
of what an index looks like.

00:35:24.970 --> 00:35:26.920
But let's get a little
more concrete and try

00:35:26.920 --> 00:35:32.540
to build an index from our title
column in our movies table here.

00:35:32.540 --> 00:35:37.150
So let's assume we have a
table of IDs and titles here,

00:35:37.150 --> 00:35:41.140
and I want to create an
index on this title column

00:35:41.140 --> 00:35:43.340
to search it more efficiently.

00:35:43.340 --> 00:35:47.350
Well we saw before that the best I
can do when this data is unsorted

00:35:47.350 --> 00:35:49.130
is to go top to bottom.

00:35:49.130 --> 00:35:51.970
First look at Toy Story,
then look at Cars 3,

00:35:51.970 --> 00:35:54.740
then look at Frozen, all
the way down to the bottom.

00:35:54.740 --> 00:35:58.220
I can only scan this
column, so to speak.

00:35:58.220 --> 00:36:02.560
But I know I could probably do a
little better if the data is sorted.

00:36:02.560 --> 00:36:05.030
If the data is sorted I could
look roughly in the middle,

00:36:05.030 --> 00:36:07.270
and if I've gone too far,
go back a little bit.

00:36:07.270 --> 00:36:09.770
Or if I haven't gone far
enough, go forward a little bit

00:36:09.770 --> 00:36:13.910
and do a fewer number of lookups
to find data I'm looking for.

00:36:13.910 --> 00:36:16.210
So let me sort this title column.

00:36:16.210 --> 00:36:18.980
I'll put it like this.

00:36:18.980 --> 00:36:22.230
What have I done wrong?

00:36:22.230 --> 00:36:26.980
First I had this, then I
wanted to sort my title

00:36:26.980 --> 00:36:29.860
column to make it faster to search.

00:36:29.860 --> 00:36:35.365
But what rule did I break here by
sorting this title column like that?

00:36:35.365 --> 00:36:38.620
AUDIENCE: Yes, Carter, the thing
is that by sorting only the movie's

00:36:38.620 --> 00:36:43.630
name and not the ID you will break
the entire other databases that

00:36:43.630 --> 00:36:45.320
relates to that one.

00:36:45.320 --> 00:36:50.260
So for example, let's look for
a movie in which, I don't know,

00:36:50.260 --> 00:36:53.230
Johnny Depp is working or has worked on.

00:36:53.230 --> 00:36:56.080
And the result may be,
I don't know, Titanic.

00:36:56.080 --> 00:36:57.820
CARTER ZENKE: Yeah, I love that example.

00:36:57.820 --> 00:36:59.450
That really drives it home here.

00:36:59.450 --> 00:37:04.330
And in more general terms, I
had this primary key called ID.

00:37:04.330 --> 00:37:10.120
And in the prior state I had
Toy Story with the ID of 114709.

00:37:10.120 --> 00:37:14.650
A primary key means that Toy Story
should be consistently identified

00:37:14.650 --> 00:37:18.580
with this number, 114709.

00:37:18.580 --> 00:37:22.600
But to your point, Mateo, if
I sort only the title column,

00:37:22.600 --> 00:37:25.060
well now Cars has that ID--

00:37:25.060 --> 00:37:27.070
114709.

00:37:27.070 --> 00:37:29.620
I've broken that
relationship, particularly

00:37:29.620 --> 00:37:37.180
if other tables are relying on Toy
Story being identified by 114709.

00:37:37.180 --> 00:37:38.605
So I can't do this.

00:37:38.605 --> 00:37:39.730
This is bad.

00:37:39.730 --> 00:37:43.520
I need some other way
to sort my title column.

00:37:43.520 --> 00:37:46.570
So let's bring this back
to what it was here.

00:37:46.570 --> 00:37:49.300
Let me focus just on the title column.

00:37:49.300 --> 00:37:54.100
Well I can't sort this title column
because it's part of my table

00:37:54.100 --> 00:37:57.100
and the ordering of
that table does matter.

00:37:57.100 --> 00:38:00.440
What I could do though is
create a copy of this column

00:38:00.440 --> 00:38:02.180
and maybe sort that copy.

00:38:02.180 --> 00:38:06.530
So I'll literally create a copy of
this title column elsewhere in memory.

00:38:06.530 --> 00:38:09.230
And let me just remove this
title column identifier here

00:38:09.230 --> 00:38:10.850
so I have just the data.

00:38:10.850 --> 00:38:14.350
Just the data in this column
somewhere else in memory.

00:38:14.350 --> 00:38:17.990
And now I could sort
this data like this.

00:38:17.990 --> 00:38:21.370
So let's say I now want to find Cars.

00:38:21.370 --> 00:38:24.470
Well now that my data is
sorted, what could I do?

00:38:24.470 --> 00:38:27.730
I could look first in the middle--
at Ratatouille for instance--

00:38:27.730 --> 00:38:31.000
and then realize, well Cars
comes before Ratatouille.

00:38:31.000 --> 00:38:34.450
Maybe I'll go a little
higher in my index here.

00:38:34.450 --> 00:38:36.670
I'll search in the middle
of this first half.

00:38:36.670 --> 00:38:38.800
I'll look at, in this case, Frozen.

00:38:38.800 --> 00:38:41.320
And Cars-- well it still
comes before Frozen,

00:38:41.320 --> 00:38:44.560
so let me look up at the
first half of this first half.

00:38:44.560 --> 00:38:47.770
I'll then go up here and
I'll have spotted Cars.

00:38:47.770 --> 00:38:54.010
So a much faster way to search for
information now once I have it sorted.

00:38:54.010 --> 00:38:56.500
But I don't think we're quite there yet.

00:38:56.500 --> 00:39:01.090
If I wanted to find the
year that Cars was released,

00:39:01.090 --> 00:39:08.620
I found that it's in my title column but
what am I not able to do at this point?

00:39:08.620 --> 00:39:10.840
I've copied the data
somewhere else, but now I

00:39:10.840 --> 00:39:13.270
want to find some
other data in my table.

00:39:13.270 --> 00:39:15.390
What can't I do?

00:39:15.390 --> 00:39:18.148
AUDIENCE: We need to point to index.

00:39:18.148 --> 00:39:18.940
CARTER ZENKE: Yeah.

00:39:18.940 --> 00:39:21.430
I need to be able to
find some way to relate

00:39:21.430 --> 00:39:24.590
this index with the rows in my table.

00:39:24.590 --> 00:39:27.250
So here this is elsewhere in memory.

00:39:27.250 --> 00:39:31.460
It's good that I can find that
Cars is in my title column.

00:39:31.460 --> 00:39:35.050
But if I want to find the other
data in this row like the year

00:39:35.050 --> 00:39:37.060
that Cars was released,
well I have to have

00:39:37.060 --> 00:39:43.340
some way of linking this
data here, Cars, to my table.

00:39:43.340 --> 00:39:46.930
So often in an index, we won't just
copy the data to some new location

00:39:46.930 --> 00:39:47.620
and sort it.

00:39:47.620 --> 00:39:52.240
We'll also have some link between
these indexed pieces of data over here

00:39:52.240 --> 00:39:54.920
and our rows and our table back here.

00:39:54.920 --> 00:40:00.940
So for instance here is Cars,
and Cars appears to be in row 4.

00:40:00.940 --> 00:40:05.650
Similarly Cars 3 looks to
be in row 2 in my table.

00:40:05.650 --> 00:40:11.620
And I could metaphorically visually
draw some lines between Cars over here

00:40:11.620 --> 00:40:14.540
in my index and Cars in my table.

00:40:14.540 --> 00:40:17.950
I could draw one that looks a bit
like this going from Cars in my index

00:40:17.950 --> 00:40:20.510
to actually Cars in my table.

00:40:20.510 --> 00:40:24.010
I could draw one for Cars 3
going from this index place

00:40:24.010 --> 00:40:27.410
here to row 2 in my title column.

00:40:27.410 --> 00:40:32.380
So that is the way I can link
my index with my table data.

00:40:32.380 --> 00:40:38.860
I could do the rest of these too for all
of my other pieces of data in my index.

00:40:38.860 --> 00:40:42.090
So let's focus on this now just
to check for understanding.

00:40:42.090 --> 00:40:44.100
I have this index.

00:40:44.100 --> 00:40:48.180
That index has my sorted
title data and also

00:40:48.180 --> 00:40:51.820
the row number in which I
found that piece of data.

00:40:51.820 --> 00:40:56.580
So let's say here I want
to find the movie Soul.

00:40:56.580 --> 00:40:58.380
I want to find the movie Soul.

00:40:58.380 --> 00:41:03.876
How could I find Soul in my index?

00:41:03.876 --> 00:41:06.500
AUDIENCE: Yeah, since
it is already sorted

00:41:06.500 --> 00:41:08.198
I believe you could use binary search.

00:41:08.198 --> 00:41:08.990
CARTER ZENKE: Yeah.

00:41:08.990 --> 00:41:12.020
I could use algorithm called binary
search, which if you're not familiar

00:41:12.020 --> 00:41:15.320
simply means start in the
middle and if I've gone too far,

00:41:15.320 --> 00:41:17.690
look at either the left
half or the right half

00:41:17.690 --> 00:41:20.520
and repeat that process over and over.

00:41:20.520 --> 00:41:22.160
So here is my index again.

00:41:22.160 --> 00:41:25.470
I'll go roughly to the middle and
I'll look at Ratatouille here.

00:41:25.470 --> 00:41:29.390
Well Soul, I know, comes after
Ratatouille alphabetically,

00:41:29.390 --> 00:41:31.020
so what should I do?

00:41:31.020 --> 00:41:33.410
I should now look in
this second half here.

00:41:33.410 --> 00:41:35.570
I might go to Toy Story.

00:41:35.570 --> 00:41:38.720
Once I see Toy Story, I know
I've gone a little bit too far.

00:41:38.720 --> 00:41:41.720
Soul comes before Toy
Story alphabetically,

00:41:41.720 --> 00:41:45.080
so I'll look in the first
half of this second half.

00:41:45.080 --> 00:41:46.230
And what do I see?

00:41:46.230 --> 00:41:49.183
I see Soul right there.

00:41:49.183 --> 00:41:50.130
OK.

00:41:50.130 --> 00:41:52.700
So our index seems to be working.

00:41:52.700 --> 00:41:54.950
It allows me to more
efficiently find data.

00:41:54.950 --> 00:41:58.250
I no longer have to look
through every single row.

00:41:58.250 --> 00:42:03.190
I can simply use a more efficient
algorithm like binary search.

00:42:03.190 --> 00:42:08.980
But now there's one more catch which
is, if I had many, many, many rows--

00:42:08.980 --> 00:42:12.940
not just nine but literally thousands
or hundreds of thousands or millions

00:42:12.940 --> 00:42:19.090
of rows in this index, that's a lot of
data to store in one location in memory

00:42:19.090 --> 00:42:22.450
or even to load into my
computer's RAM, so to speak--

00:42:22.450 --> 00:42:24.020
its random access memory--

00:42:24.020 --> 00:42:26.740
so it can do all this jumping around.

00:42:26.740 --> 00:42:31.510
Ideally what I would do is break
this index up into smaller pieces

00:42:31.510 --> 00:42:34.300
so I don't have to load
all of this at once.

00:42:34.300 --> 00:42:38.420
And in reality our index is
not a single column like this,

00:42:38.420 --> 00:42:42.160
but often individual nodes
where each of these--

00:42:42.160 --> 00:42:48.560
1, 2, 3-- might be some node
scattered about the computer's memory.

00:42:48.560 --> 00:42:51.190
But what problem have I introduced now?

00:42:51.190 --> 00:42:53.500
If these are around
the computer's memory,

00:42:53.500 --> 00:42:56.510
I can't actually jump
between them anymore.

00:42:56.510 --> 00:43:00.140
I can't go maybe one space
above Toy Story and find Soul.

00:43:00.140 --> 00:43:03.500
This node might be elsewhere in memory.

00:43:03.500 --> 00:43:06.760
And so we'll need some
other nodes to actually help

00:43:06.760 --> 00:43:10.090
us find what we're
looking for in this index.

00:43:10.090 --> 00:43:15.010
I might have one more over here
that tells me what kind of data

00:43:15.010 --> 00:43:17.230
is in each of these nodes.

00:43:17.230 --> 00:43:21.190
Maybe I have one here that
has both Frozen and Soul.

00:43:21.190 --> 00:43:25.990
Notice how here Frozen is the
furthest down alphabetically in this

00:43:25.990 --> 00:43:26.980
node up here.

00:43:26.980 --> 00:43:30.280
And notice down below here
Soul is the furthest down

00:43:30.280 --> 00:43:33.520
alphabetically in this node here.

00:43:33.520 --> 00:43:37.150
That is, if what I'm looking
for comes before Frozen,

00:43:37.150 --> 00:43:38.800
I should look at this node.

00:43:38.800 --> 00:43:42.730
If what I'm looking for comes
after Frozen but before Soul,

00:43:42.730 --> 00:43:44.620
I should look in this node.

00:43:44.620 --> 00:43:49.390
Or if what I'm looking for comes before
none of these, it comes after Soul,

00:43:49.390 --> 00:43:53.390
I should look at this node down here.

00:43:53.390 --> 00:43:54.910
So let's find an example.

00:43:54.910 --> 00:43:56.970
Let's say I want to find Cars now.

00:43:56.970 --> 00:44:00.610
Well I'd start in this root
node, this base of this tree

00:44:00.610 --> 00:44:03.190
we're building here that I've
turned kind of right side--

00:44:03.190 --> 00:44:04.240
this way.

00:44:04.240 --> 00:44:05.980
Let's say Frozen.

00:44:05.980 --> 00:44:09.610
I'll go here and does
Cars come before Frozen?

00:44:09.610 --> 00:44:12.490
Well it does, so I'll
look in this node now.

00:44:12.490 --> 00:44:16.570
I'll come up here and I'll
find Cars in that node.

00:44:16.570 --> 00:44:17.500
Let's see here.

00:44:17.500 --> 00:44:19.840
Maybe I want to find Ratatouille.

00:44:19.840 --> 00:44:23.380
I'll go back to the beginning and
I'll look through this root node.

00:44:23.380 --> 00:44:27.080
I'll say, does Ratatouille
come before Frozen?

00:44:27.080 --> 00:44:27.580
It doesn't.

00:44:27.580 --> 00:44:30.970
It comes after Frozen, so
I'll look and check Soul.

00:44:30.970 --> 00:44:33.670
Does Ratatouille come before Soul?

00:44:33.670 --> 00:44:36.070
Well it does, so I'll
look in this node now.

00:44:36.070 --> 00:44:37.330
I'll go find Luca.

00:44:37.330 --> 00:44:40.240
I'm not quite there yet,
but I will see Ratatouille

00:44:40.240 --> 00:44:43.870
so I know that I've now found
Ratatouille in my index.

00:44:43.870 --> 00:44:45.880
And now for one more--

00:44:45.880 --> 00:44:47.680
let me check for understanding here.

00:44:47.680 --> 00:44:51.370
How could I find Turning Red?

00:44:51.370 --> 00:44:56.310
What steps would I follow
to find Turning Red?

00:44:56.310 --> 00:45:02.115
I might start at that root node, but
then what questions should I ask?

00:45:02.115 --> 00:45:08.740
AUDIENCE: From Frozen also we can
see Ratatouille is to the right

00:45:08.740 --> 00:45:10.890
so that we move in direction of Soul.

00:45:10.890 --> 00:45:15.850
Now once we get there we find the middle
elements and we can see Ratatouille

00:45:15.850 --> 00:45:19.120
is to the right of the middle elements.

00:45:19.120 --> 00:45:20.520
So that's how we find it.

00:45:20.520 --> 00:45:25.560
Divide the list into two
halves and look into the half

00:45:25.560 --> 00:45:27.448
that the element is found.

00:45:27.448 --> 00:45:29.740
CARTER ZENKE: Yeah, I like
what you were thinking here.

00:45:29.740 --> 00:45:33.590
So you're talking about traversing this
tree we've built and turned on its side

00:45:33.590 --> 00:45:34.090
here.

00:45:34.090 --> 00:45:36.010
And let's visualize that search here.

00:45:36.010 --> 00:45:41.440
So first we'll start at Frozen and ask,
does Turning Red come before Frozen?

00:45:41.440 --> 00:45:41.940
It doesn't.

00:45:41.940 --> 00:45:43.530
It comes after, so I'll keep looking.

00:45:43.530 --> 00:45:44.640
I'll look at Soul.

00:45:44.640 --> 00:45:47.620
Does Turning Red come before Soul?

00:45:47.620 --> 00:45:48.120
It doesn't.

00:45:48.120 --> 00:45:53.430
It comes after Soul, so I know I
need to look at this node down here.

00:45:53.430 --> 00:45:55.050
I'll go find Toy Story.

00:45:55.050 --> 00:45:57.990
Not quite there yet, but
now I'll see Turning Red.

00:45:57.990 --> 00:46:00.150
I'm able to find what I'm looking for.

00:46:00.150 --> 00:46:01.110
And not just that.

00:46:01.110 --> 00:46:07.000
I can find what row number Turning
Red is actually in in my table.

00:46:07.000 --> 00:46:12.180
So when you use an index, SQL or
SQLite is using some algorithm a bit

00:46:12.180 --> 00:46:13.090
like this.

00:46:13.090 --> 00:46:15.870
It creates some tree
structure a bit like this here

00:46:15.870 --> 00:46:18.780
with individual nodes
that has your column data.

00:46:18.780 --> 00:46:22.470
It then uses this kind of
algorithm to search that index

00:46:22.470 --> 00:46:26.770
and find whether or not that
piece of data is in your index.

00:46:26.770 --> 00:46:32.050
And if it is, it will go ahead and look
up that piece of data in your table.

00:46:32.050 --> 00:46:36.000
So see here, although this is
turned on its side, the relationship

00:46:36.000 --> 00:46:37.740
between a kind of tree.

00:46:37.740 --> 00:46:43.680
We had a root node or a trunk node here
pointing to some of these leaf nodes--

00:46:43.680 --> 00:46:45.570
the edge of our graph right here.

00:46:45.570 --> 00:46:48.750
And you can see some relationship
between these nodes here

00:46:48.750 --> 00:46:52.350
and our tree as well
right here with arrows

00:46:52.350 --> 00:46:55.770
pointing across these individual nodes.

00:46:55.770 --> 00:47:01.050
So let me pause here and ask what
questions do we have on this structure

00:47:01.050 --> 00:47:02.985
and how we've searched it?

00:47:02.985 --> 00:47:07.683
AUDIENCE: Is that related
to linked lists or arrays?

00:47:07.683 --> 00:47:08.850
CARTER ZENKE: Good question.

00:47:08.850 --> 00:47:11.780
Is this related to
linked lists or arrays?

00:47:11.780 --> 00:47:14.550
So if you're familiar, a
linked list is some set

00:47:14.550 --> 00:47:18.030
of nodes that have been linked together
in memory using these arrows like we

00:47:18.030 --> 00:47:19.800
saw on our diagram here.

00:47:19.800 --> 00:47:24.135
It is the case that sometimes an
index will include a linked list.

00:47:27.100 --> 00:47:28.245
Let's take one more here.

00:47:28.245 --> 00:47:32.400
AUDIENCE: I want to
know, a covering index--

00:47:32.400 --> 00:47:37.890
how is the [INAUDIBLE] covering
index and what has been--

00:47:37.890 --> 00:47:44.160
I mean, if we just [INAUDIBLE] index
and then we just start covering index,

00:47:44.160 --> 00:47:49.740
how SQL automatically getting
that covering index that's here?

00:47:49.740 --> 00:47:53.010
CARTER ZENKE: Yeah, so I think we're
asking, how is the index created?

00:47:53.010 --> 00:47:54.940
Like, what is the
process for doing that?

00:47:54.940 --> 00:47:58.920
Well it turns out that just as we saw
here in the slides, SQLite or whatever

00:47:58.920 --> 00:48:03.240
DBMS you're using will take a copy of
the data that should be in that index

00:48:03.240 --> 00:48:06.750
and organize it into a tree
structure like we saw here.

00:48:06.750 --> 00:48:09.150
It may or may not be similar
to the structure we saw just

00:48:09.150 --> 00:48:14.640
a little earlier with nodes that have
different pieces of row data in them.

00:48:14.640 --> 00:48:17.280
Good question there.

00:48:17.280 --> 00:48:18.180
OK.

00:48:18.180 --> 00:48:21.090
So we've seen now the
trade-off that we get

00:48:21.090 --> 00:48:24.510
with index, which is they're
very fast to look up some data

00:48:24.510 --> 00:48:26.400
but they take up a lot more space.

00:48:26.400 --> 00:48:29.370
There's a lot of
redundancy to them here.

00:48:29.370 --> 00:48:31.890
There's also one more
trade-off, which is

00:48:31.890 --> 00:48:37.230
if I want to insert something into this
structure, it'll take me more time.

00:48:37.230 --> 00:48:39.240
I have to navigate this structure here.

00:48:39.240 --> 00:48:44.010
I have to figure out should it come
before this value or after this value?

00:48:44.010 --> 00:48:48.030
And I'll have to traverse
this tree every time verses

00:48:48.030 --> 00:48:51.420
without an index I could just add
something to the end of this list.

00:48:51.420 --> 00:48:54.930
But with a B-Tree, with an index,
I need to traverse all these nodes

00:48:54.930 --> 00:48:57.760
and figure out where should
that piece of data go?

00:48:57.760 --> 00:48:59.593
So these trade-offs are
something you should

00:48:59.593 --> 00:49:03.660
keep in mind as you build indexes
and optimize your queries.

00:49:03.660 --> 00:49:04.560
OK.

00:49:04.560 --> 00:49:07.650
There are ways though to
actually optimize the space

00:49:07.650 --> 00:49:10.710
that we use using a tree like this.

00:49:10.710 --> 00:49:15.310
We can use a structure called not
an index, but a partial index.

00:49:15.310 --> 00:49:19.620
And if you're concerned about space,
a partial index might be good for you.

00:49:19.620 --> 00:49:25.200
A partial index only includes a subset
of the data from a given column,

00:49:25.200 --> 00:49:28.800
thus making your index smaller overall.

00:49:28.800 --> 00:49:31.590
You could think of a partial
index being useful when

00:49:31.590 --> 00:49:35.490
you know that your users
will only query a subset--

00:49:35.490 --> 00:49:38.820
only a small number of
rows from that table.

00:49:38.820 --> 00:49:42.690
And for example in IMDb,
maybe we know people

00:49:42.690 --> 00:49:46.770
are more likely to search for a movie
that came out this year versus someone

00:49:46.770 --> 00:49:49.720
that came out 15 years ago or so.

00:49:49.720 --> 00:49:53.340
So we could create an index that
focuses on those movies that

00:49:53.340 --> 00:49:59.800
came out this year versus 15 years
ago or so, and we'll do just that.

00:49:59.800 --> 00:50:03.670
If I want to create a partial
index I can use this syntax here.

00:50:03.670 --> 00:50:06.720
I could say create an index
and give it some name,

00:50:06.720 --> 00:50:11.670
then include some data from
this table and this column here,

00:50:11.670 --> 00:50:13.530
or a list of columns I can provide.

00:50:13.530 --> 00:50:19.140
But then what I need0 to do is add a
condition where some condition is true.

00:50:19.140 --> 00:50:23.370
These are the rows I want
to include in this index.

00:50:23.370 --> 00:50:27.910
Before we had just create index name
on table from a list of columns.

00:50:27.910 --> 00:50:33.420
Now we can add in this WHERE clause
that says only include those rows where

00:50:33.420 --> 00:50:35.320
this condition is true.

00:50:35.320 --> 00:50:36.840
So let's try that now.

00:50:36.840 --> 00:50:39.210
I'll come back to my computer.

00:50:39.210 --> 00:50:41.880
And why don't we try
creating a partial index

00:50:41.880 --> 00:50:46.710
to help speed up the lookup of
movies that came out this year.

00:50:46.710 --> 00:50:50.520
Well I could come back to my terminal
and I could remove, let's say,

00:50:50.520 --> 00:50:51.900
this one over here.

00:50:51.900 --> 00:50:55.320
Now I'll try to create
this partial index.

00:50:55.320 --> 00:51:00.690
I'll say create an index called
"recents" to indicate this

00:51:00.690 --> 00:51:03.090
will help us search for recent movies.

00:51:03.090 --> 00:51:08.400
Now I'll make this "recents" ON
"movies" in the ("title") column.

00:51:08.400 --> 00:51:12.510
I want to look up titles
ultimately, but now I

00:51:12.510 --> 00:51:20.190
only want to create an index for
those rows where the year equals 2023.

00:51:20.190 --> 00:51:26.790
I only want to include those titles in
my index that were released in 2023.

00:51:26.790 --> 00:51:31.890
So I'll hit Enter now and I'll see I
created this index for myself here.

00:51:31.890 --> 00:51:33.900
Now let me try searching for those.

00:51:33.900 --> 00:51:40.440
I'll SELECT "title" FROM
"movies" WHERE the "year" = 2023.

00:51:40.440 --> 00:51:41.610
Enter.

00:51:41.610 --> 00:51:45.910
And I'll see a lot of
movies came out in 2023.

00:51:45.910 --> 00:51:48.550
Now, it's taking 1.3 seconds here.

00:51:48.550 --> 00:51:51.600
But now I could prove to you
that we are using some index.

00:51:51.600 --> 00:52:00.120
I could say EXPLAIN QUERY PLAN and
now type SELECT "title" FROM "movies"

00:52:00.120 --> 00:52:03.120
WHERE the "year" = 2023.

00:52:03.120 --> 00:52:04.440
Hit Enter.

00:52:04.440 --> 00:52:06.600
And now we'll see-- what are we doing?

00:52:06.600 --> 00:52:09.870
We're going to-- we'll
scan movies but we're still

00:52:09.870 --> 00:52:13.950
going to use the index we
just created called "recents."

00:52:13.950 --> 00:52:17.580
So it's helping speed us up
just a little bit overall.

00:52:17.580 --> 00:52:21.540
Now let me try dropping this index
and showing you the opposite.

00:52:21.540 --> 00:52:24.560
I'll come back over here
and I could try dropping it.

00:52:24.560 --> 00:52:26.310
And actually, before
I do that let me show

00:52:26.310 --> 00:52:29.340
you what would happen if I
didn't search for movie in 2023.

00:52:29.340 --> 00:52:32.560
I would try EXPLAIN QUERY PLAN.

00:52:32.560 --> 00:52:42.150
And let me try SELECT "title" FROM
"movies" where the year = 1998

00:52:42.150 --> 00:52:43.140
like this.

00:52:43.140 --> 00:52:47.130
And see now I'm only
back to scanning movies.

00:52:47.130 --> 00:52:50.670
So before in my prior query
I was able to use the index

00:52:50.670 --> 00:52:54.480
because my WHERE clause included 2023.

00:52:54.480 --> 00:52:58.590
Now though I'm using 1998,
which were rows that were not

00:52:58.590 --> 00:53:02.520
included in this partial index.

00:53:02.520 --> 00:53:03.270
OK.

00:53:03.270 --> 00:53:05.850
So let me ask now, what
questions do we have

00:53:05.850 --> 00:53:09.600
on these partial index which
we can use to optimize some

00:53:09.600 --> 00:53:12.920
of the space these indexes take up?

00:53:12.920 --> 00:53:16.030
AUDIENCE: Are indexes
not saved in schema?

00:53:16.030 --> 00:53:17.280
CARTER ZENKE: A good question.

00:53:17.280 --> 00:53:19.370
Are indexes saved in the schema?

00:53:19.370 --> 00:53:23.510
In SQLite if I type .schema I can
actually see them in my schema.

00:53:23.510 --> 00:53:25.070
So let me go back here and show you.

00:53:25.070 --> 00:53:27.560
If I now type something like--

00:53:27.560 --> 00:53:29.620
let me turn timer off here--

00:53:29.620 --> 00:53:35.150
.schema, you can see down below here
that I actually have these indexes

00:53:35.150 --> 00:53:37.190
as part of my schema.

00:53:37.190 --> 00:53:40.910
I'll see the create index
statements I used down below here.

00:53:40.910 --> 00:53:44.810
Create an index called name_index,
called person_index, called recents.

00:53:44.810 --> 00:53:49.460
And all that I used to create
these indexes now inside my schema.

00:53:49.460 --> 00:53:54.790
If I drop some index it'll then be
just removed from my schema altogether.

00:53:54.790 --> 00:53:57.160
Good question.

00:53:57.160 --> 00:53:57.660
OK.

00:53:57.660 --> 00:54:00.450
So speaking of dropping
indexes, one other way

00:54:00.450 --> 00:54:04.470
we can more efficiently use space is
to do a bit of vacuuming so to speak.

00:54:04.470 --> 00:54:08.040
So let's introduce this idea
of vacuuming up our database.

00:54:08.040 --> 00:54:10.200
Let me show you a slide here.

00:54:10.200 --> 00:54:14.010
And this one involves this
idea of trying to clean up

00:54:14.010 --> 00:54:17.580
unused space in our database.

00:54:17.580 --> 00:54:22.500
Often when we delete something,
either a rows or an index,

00:54:22.500 --> 00:54:25.890
we don't actually delete
those bits that were

00:54:25.890 --> 00:54:29.190
being used by those rows or that index.

00:54:29.190 --> 00:54:33.690
We just mark them as being available
for whatever we next insert.

00:54:33.690 --> 00:54:36.400
They can be overwritten, so to speak.

00:54:36.400 --> 00:54:39.930
But if I want to actually
reduce the size of my database

00:54:39.930 --> 00:54:44.880
after I delete something, I should
use something like vacuum in SQLite

00:54:44.880 --> 00:54:47.610
or optimize in some other DBMS.

00:54:47.610 --> 00:54:50.950
Vacuuming is a bit like taking those
unused bits and just sucking them up.

00:54:50.950 --> 00:54:55.500
Taking them back in so we can give
them back to the operating system.

00:54:55.500 --> 00:54:59.490
And the keyword we can use for
vacuum is just simply vacuum.

00:54:59.490 --> 00:55:00.720
So let's try this.

00:55:00.720 --> 00:55:02.640
I'll come back to my computer over here.

00:55:02.640 --> 00:55:06.120
And let me open up a new
terminal-- one that I can use

00:55:06.120 --> 00:55:09.240
to find the current size of movies.db.

00:55:09.240 --> 00:55:12.720
So I'll hit this + button here
and now I could type a command.

00:55:12.720 --> 00:55:17.830
Not a SQL command or a SQLite
command, but actually a Unix command.

00:55:17.830 --> 00:55:22.320
I could type du -b for
the disk usage in bytes.

00:55:22.320 --> 00:55:24.810
How many bytes is a particular file?

00:55:24.810 --> 00:55:31.560
I could then say movies.db here, and
if I hit Enter I should see movies.db

00:55:31.560 --> 00:55:37.230
is something like 158 million
bytes or 158 megabytes.

00:55:37.230 --> 00:55:41.290
So let's see what happens if I were to
delete an index which you know takes up

00:55:41.290 --> 00:55:43.380
some amount of space in our database.

00:55:43.380 --> 00:55:45.510
I'll go back to my terminal here.

00:55:45.510 --> 00:55:48.990
And let me try to drop an index
that we had recently created.

00:55:48.990 --> 00:55:55.170
I'll say DROP INDEX "person_index" to
remove the person index in this case.

00:55:55.170 --> 00:55:58.860
I'll hit Enter and now I see
that "person_index" is dropped.

00:55:58.860 --> 00:56:05.250
If I type .schema here, I no longer
see it in my schema down below.

00:56:05.250 --> 00:56:09.180
So now let me check the disk usage
in bytes and see if it's any smaller.

00:56:09.180 --> 00:56:13.410
I'll hit the up arrow to get access
to that same command, hit Enter.

00:56:13.410 --> 00:56:14.940
It seems like it's the same size.

00:56:14.940 --> 00:56:17.940
It's still 158 megabytes.

00:56:17.940 --> 00:56:19.830
So let me try dropping
something more here.

00:56:19.830 --> 00:56:23.430
I'll go back to my terminal
here and now I'll try--

00:56:23.430 --> 00:56:25.500
what other do I have?

00:56:25.500 --> 00:56:30.630
I have "name_index" so I'll say let's
DROP index "name_index" like this.

00:56:30.630 --> 00:56:34.680
I'll hit Enter and I'll
see that index is now gone.

00:56:34.680 --> 00:56:37.200
Well now I'll also try dropping recents.

00:56:37.200 --> 00:56:43.470
I'll DROP INDEX "recents" and
now I believe all of my indexes

00:56:43.470 --> 00:56:44.310
should be gone.

00:56:44.310 --> 00:56:48.510
If I type .schema, I don't
see any more indexes.

00:56:48.510 --> 00:56:51.300
So now let's check on
the disk usage in bytes.

00:56:51.300 --> 00:56:54.750
I'll go back to my other
terminal, hit du -b again,

00:56:54.750 --> 00:56:58.290
and it's still 158 megabytes.

00:56:58.290 --> 00:57:01.170
Well what can we do to clean this up?

00:57:01.170 --> 00:57:03.180
We could run vacuum.

00:57:03.180 --> 00:57:07.140
If you remember here when we delete
an index-- we drop the index,

00:57:07.140 --> 00:57:09.840
we don't actually shrink
our database file.

00:57:09.840 --> 00:57:13.320
We just mark those bits that
were part of the index as being

00:57:13.320 --> 00:57:15.120
available for re-use.

00:57:15.120 --> 00:57:17.700
So let's now vacuum
and compress our file

00:57:17.700 --> 00:57:21.210
to give back those bytes and those
bits to the operating system.

00:57:21.210 --> 00:57:27.000
I'll come back to my terminal
here and type VACUUM; hit Enter,

00:57:27.000 --> 00:57:30.360
and I'll wait for it to
find those bits and bytes,

00:57:30.360 --> 00:57:33.300
wait for it to give them
back, to de-allocate them,

00:57:33.300 --> 00:57:34.950
to go back to the operating system.

00:57:34.950 --> 00:57:39.660
And now if I come back and I
type du -b again, hit Enter--

00:57:39.660 --> 00:57:42.970
I should see a much smaller file.

00:57:42.970 --> 00:57:45.990
So before we had about
158 million bytes.

00:57:45.990 --> 00:57:50.670
We're down to in this case 100
million after dropping our indexes

00:57:50.670 --> 00:57:53.360
and using VACUUM.

00:57:53.360 --> 00:57:57.390
So let me ask now, what questions
do we have on vacuuming?

00:57:57.390 --> 00:58:00.020
AUDIENCE: Can we do
vacuuming in a faster way?

00:58:00.020 --> 00:58:02.730
CARTER ZENKE: Can you
vacuum in a faster way?

00:58:02.730 --> 00:58:05.120
So I think it depends on
the size of the vacuum

00:58:05.120 --> 00:58:06.990
that you're trying to implement here.

00:58:06.990 --> 00:58:08.900
If you have a lot of
bits and bytes you're

00:58:08.900 --> 00:58:11.990
trying to de-allocate,
that could take more time.

00:58:11.990 --> 00:58:15.290
The process here is the computer has
to find those bits and bytes you're

00:58:15.290 --> 00:58:19.590
no longer using and then give each of
those back to the operating system.

00:58:19.590 --> 00:58:24.178
So it could depend on the size
and how long they take to find.

00:58:24.178 --> 00:58:25.720
And let's take another question here.

00:58:25.720 --> 00:58:30.691
AUDIENCE: Initially you said once we
drop a database, we don't lose the--

00:58:30.691 --> 00:58:33.310
the data is not deleted.

00:58:33.310 --> 00:58:36.550
The space is just reallocated, right?

00:58:36.550 --> 00:58:37.960
If I go you right.

00:58:37.960 --> 00:58:42.280
So does that mean the
queries we run today,

00:58:42.280 --> 00:58:46.780
are they locked somewhere that--
let's say you mistakenly did a query.

00:58:46.780 --> 00:58:50.710
You can go back and pull that data back?

00:58:50.710 --> 00:58:51.880
Is that possible?

00:58:51.880 --> 00:58:55.000
Kind of like a reversal of a query.

00:58:55.000 --> 00:58:57.880
CARTER ZENKE: Yeah, so you're
talking about some forensics here.

00:58:57.880 --> 00:59:01.900
Like, could we mark something as
deleted but still find it later?

00:59:01.900 --> 00:59:05.290
There actually are people who are
trained in this, in forensics,

00:59:05.290 --> 00:59:07.600
to find pieces of data that
you thought were deleted

00:59:07.600 --> 00:59:12.160
but are actually still on your
computer just forgotten, if you will.

00:59:12.160 --> 00:59:14.740
In this case in SQLite it
would take a lot of work

00:59:14.740 --> 00:59:18.190
to go ahead and find all of those
bits and bytes we've removed.

00:59:18.190 --> 00:59:21.550
But once we vacuum, once we actually
give them back to the operating system,

00:59:21.550 --> 00:59:24.220
we can no longer find
those in the database

00:59:24.220 --> 00:59:26.830
because they're not just
marked as being de-allocated.

00:59:26.830 --> 00:59:30.720
They actually have been given back and
they're no longer part of our database.

00:59:30.720 --> 00:59:32.710
Good question there.

00:59:32.710 --> 00:59:35.800
OK, so we've seen here
how we can optimize

00:59:35.800 --> 00:59:38.680
our queries by reducing
the time that they take

00:59:38.680 --> 00:59:40.870
although at the cost of some space.

00:59:40.870 --> 00:59:45.130
When we come back, we'll see how we
can handle not just one query at a time

00:59:45.130 --> 00:59:47.440
but many all at once.

00:59:47.440 --> 00:59:49.010
See you in a bit.

00:59:49.010 --> 00:59:50.330
And we're back.

00:59:50.330 --> 00:59:54.460
So we've seen so far how to optimize
individual queries, ensuring

00:59:54.460 --> 00:59:58.000
they take less time and
also to ensure our databases

00:59:58.000 --> 01:00:00.800
use as little space as possible.

01:00:00.800 --> 01:00:03.010
What we'll focus on
now is trying to ensure

01:00:03.010 --> 01:00:08.125
we can handle not just one query
at a time but multiple all at once.

01:00:08.125 --> 01:00:10.570
And this is this idea of concurrency.

01:00:10.570 --> 01:00:16.180
Concurrency is when I get multiple
queries, multiple interactions, all

01:00:16.180 --> 01:00:20.110
at roughly the same time and my
computer system, my database,

01:00:20.110 --> 01:00:23.810
needs to figure out how to
handle all of those at once.

01:00:23.810 --> 01:00:27.040
This is particularly important for
websites that get a lot of traffic,

01:00:27.040 --> 01:00:28.900
or even for the financial sector.

01:00:28.900 --> 01:00:32.110
People are constantly making
trades, placing orders,

01:00:32.110 --> 01:00:34.610
exchanging money, and so on.

01:00:34.610 --> 01:00:37.690
So let's see one example here of a bank.

01:00:37.690 --> 01:00:40.870
Let's say here I have
a table of accounts

01:00:40.870 --> 01:00:44.860
and each account has a name
and some balance to it.

01:00:44.860 --> 01:00:50.890
Well we decide here that maybe
Alice wants to pay Bob $10.

01:00:50.890 --> 01:00:54.220
So what should I do to
update the balances here?

01:00:54.220 --> 01:00:58.510
Maybe the first thing I should do is
make sure that Bob receives that money.

01:00:58.510 --> 01:01:03.460
I could update Bob's account to have
a balance not of $20 but of $30 now.

01:01:03.460 --> 01:01:05.920
From $20 to $30.

01:01:05.920 --> 01:01:07.450
But here's the thing.

01:01:07.450 --> 01:01:14.806
If at this moment somebody looks at this
table, what will they see that's wrong?

01:01:14.806 --> 01:01:20.205
What will they see that they might not
get the right picture of at this point?

01:01:23.800 --> 01:01:27.850
I'm hearing they might not see
the correct balance for Alice.

01:01:27.850 --> 01:01:29.980
Bob has gotten some money.

01:01:29.980 --> 01:01:31.990
He's gone from $20 here to $30.

01:01:31.990 --> 01:01:34.060
But now I have not just--

01:01:34.060 --> 01:01:35.710
let's see, $60 in the bank.

01:01:35.710 --> 01:01:38.210
I now have $70 it looks like.

01:01:38.210 --> 01:01:40.900
So if somebody looks at this
transaction in this moment

01:01:40.900 --> 01:01:43.420
they'll see some incorrect information.

01:01:43.420 --> 01:01:45.910
What I should then do is
update Alice's balance,

01:01:45.910 --> 01:01:50.740
subtracting $10 so we're at the right
amount of dollars across our bank

01:01:50.740 --> 01:01:51.830
accounts here.

01:01:51.830 --> 01:01:53.890
So this is an issue.

01:01:53.890 --> 01:01:57.370
When I have one ongoing
transaction like this

01:01:57.370 --> 01:01:59.950
and somebody else wants
to look at that data,

01:01:59.950 --> 01:02:03.430
they really shouldn't be able
to until this transaction is

01:02:03.430 --> 01:02:06.020
complete in its entirety.

01:02:06.020 --> 01:02:09.340
So ideally-- let's go back to what
we had before with Alice at $10

01:02:09.340 --> 01:02:10.480
Bob at $20.

01:02:10.480 --> 01:02:15.430
Ideally if Alice wants to pay Bob, this
should happen to an outside observer

01:02:15.430 --> 01:02:16.510
all at once.

01:02:16.510 --> 01:02:19.360
Bob should get those
$10 and Alice should

01:02:19.360 --> 01:02:23.590
lose those $10 at the same
exact moment so nobody

01:02:23.590 --> 01:02:29.230
can look and see that Bob has more money
while Alice doesn't have less money.

01:02:29.230 --> 01:02:34.060
Now we can implement this
idea of this transaction using

01:02:34.060 --> 01:02:39.040
a very similarly-named idea in
databases called a transaction.

01:02:39.040 --> 01:02:41.830
Now a transaction in
the world of databases

01:02:41.830 --> 01:02:44.800
is an individual unit of work.

01:02:44.800 --> 01:02:49.370
That means a transaction can't be
broken down into smaller pieces.

01:02:49.370 --> 01:02:52.720
It happens all at once or not at all.

01:02:52.720 --> 01:02:55.540
And in fact a transaction
has several properties

01:02:55.540 --> 01:02:58.690
and helps implement several
properties of a database.

01:02:58.690 --> 01:03:00.340
Among these are ACID--

01:03:00.340 --> 01:03:05.020
A-C-I-D. You may have heard of these
if you've looked up something about

01:03:05.020 --> 01:03:10.150
databases, but each one stands for
some principle of ensuring a consistent

01:03:10.150 --> 01:03:11.870
and reliable database.

01:03:11.870 --> 01:03:17.710
In fact, A stands for atomicity, which
means I can't break a transaction down

01:03:17.710 --> 01:03:18.970
into smaller pieces.

01:03:18.970 --> 01:03:21.310
I can't break a transaction
between Alice and Bob

01:03:21.310 --> 01:03:24.250
into Bob's update and
then Alice's update.

01:03:24.250 --> 01:03:28.150
It's just update both
accounts at the same time.

01:03:28.150 --> 01:03:32.080
Consistency means a transaction
to help me ensure that I

01:03:32.080 --> 01:03:35.180
don't violate some database constraint.

01:03:35.180 --> 01:03:39.070
Let's say I try to pay
Bob from Alice's account,

01:03:39.070 --> 01:03:40.900
but Alice doesn't have the money.

01:03:40.900 --> 01:03:44.530
If that is the case, a
transaction will revert, undo.

01:03:44.530 --> 01:03:49.090
So I go back to being in this
consistent state for my database.

01:03:49.090 --> 01:03:53.530
Isolation means if I have more
than one person trying to access

01:03:53.530 --> 01:03:55.600
this database, their queries--

01:03:55.600 --> 01:03:58.940
their transactions won't
interfere with each other.

01:03:58.940 --> 01:04:03.385
And then finally, durability means
if I were to unplug this database

01:04:03.385 --> 01:04:06.010
or we were to encounter a power
failure or the server to crash,

01:04:06.010 --> 01:04:11.230
all that data would still be there from
the time we committed our transaction

01:04:11.230 --> 01:04:14.150
or saved the results there.

01:04:14.150 --> 01:04:18.670
So transactions do a lot for us, but
they have some very simple syntax.

01:04:18.670 --> 01:04:24.130
I can simply use BEGIN TRANSACTION and
then follow it up with some statements

01:04:24.130 --> 01:04:26.780
I want to include in that transaction.

01:04:26.780 --> 01:04:30.850
And once I'm sure I want to save
the results of these statements,

01:04:30.850 --> 01:04:36.340
I can then say COMMIT, meaning
save these changes thus far.

01:04:36.340 --> 01:04:38.620
And it's using this
kind of syntax we can

01:04:38.620 --> 01:04:41.830
ensure that nobody sees
incorrect balances,

01:04:41.830 --> 01:04:45.590
let's say, for this
bank account table here.

01:04:45.590 --> 01:04:49.840
So let's try this out and implement
a payment between Alice and Bob,

01:04:49.840 --> 01:04:52.180
but now using a transaction.

01:04:52.180 --> 01:04:54.670
I'll come back to my computer here.

01:04:54.670 --> 01:04:57.400
And let me show you a new database.

01:04:57.400 --> 01:05:00.580
I'll type sqlite3 bank.db.

01:05:00.580 --> 01:05:01.600
I'll hit Enter.

01:05:01.600 --> 01:05:02.980
Oh, sqlite3.

01:05:02.980 --> 01:05:03.520
Sorry.

01:05:03.520 --> 01:05:04.720
bank.db.

01:05:04.720 --> 01:05:09.430
I'll hit Enter, and now let me type
.schema to see the schema of this

01:05:09.430 --> 01:05:10.360
database.

01:05:10.360 --> 01:05:12.970
I'll see here a few different columns.

01:05:12.970 --> 01:05:15.970
But I have an "id"
column, my primary key,

01:05:15.970 --> 01:05:19.900
a "name" column like we saw in our
slides that is text and can't be null,

01:05:19.900 --> 01:05:22.390
and then a "balance" column.

01:05:22.390 --> 01:05:26.740
There's an integer, a whole number of
dollars, that also must have a value.

01:05:26.740 --> 01:05:27.850
Can't be null.

01:05:27.850 --> 01:05:30.730
And I have a check
constraint here saying

01:05:30.730 --> 01:05:35.590
the balance must be at all times
greater than or equal to 0.

01:05:35.590 --> 01:05:40.690
So this is our implementation
of a table of bank accounts.

01:05:40.690 --> 01:05:44.650
Let me try querying here to see
what our account balances look like.

01:05:44.650 --> 01:05:50.390
I'll say SELECT * FROM
"accounts" semicolon.

01:05:50.390 --> 01:05:53.980
And now I'll see that very same
arrangement we saw in our slides.

01:05:53.980 --> 01:05:59.450
Alice has $10, Bob has
$20, and Charlie has $30.

01:05:59.450 --> 01:06:04.570
So let me try then processing this
transaction between Alice and Bob.

01:06:04.570 --> 01:06:08.890
Well at this moment I might
try to update Bob's account

01:06:08.890 --> 01:06:10.600
at the first [? bat. ?] I'll say--

01:06:10.600 --> 01:06:13.510
let's try update and
I'll update the accounts

01:06:13.510 --> 01:06:21.730
table, set the "balance" equal to
"balance" + 10 where the "id" equals 2.

01:06:21.730 --> 01:06:25.040
Remember, Bob's accounts "id" is 2.

01:06:25.040 --> 01:06:30.350
So now if I hit Enter here, Bob should
have gotten just 10 more dollars.

01:06:30.350 --> 01:06:31.610
But there's a problem.

01:06:31.610 --> 01:06:34.790
If I also connect to this database
through another connection here,

01:06:34.790 --> 01:06:37.457
sqlite3 bank.DB.

01:06:37.457 --> 01:06:40.040
Maybe I'm somebody who's checking
on the balances of the bank.

01:06:40.040 --> 01:06:44.120
I say SELECT * FROM "accounts."

01:06:44.120 --> 01:06:48.360
What do I see but an
incorrect total balance?

01:06:48.360 --> 01:06:51.890
We should have $60, but here I see $70.

01:06:51.890 --> 01:06:54.620
So I need to complete
this transaction to ensure

01:06:54.620 --> 01:06:58.790
that I actually have the right
amount of balances across the board.

01:06:58.790 --> 01:07:03.188
Let me go back here and let me
try to run the next part here.

01:07:03.188 --> 01:07:03.980
Let we update this.

01:07:03.980 --> 01:07:10.910
I'll say UPDATE "accounts," and I'll SET
"balance" equal to "balance" minus 10.

01:07:10.910 --> 01:07:17.960
Minus 10 where the "id" equals in this
case 1 where Alice's bank account ID

01:07:17.960 --> 01:07:18.980
is 1.

01:07:18.980 --> 01:07:23.780
Now if I hit Enter I should be able
to go back to my other terminal

01:07:23.780 --> 01:07:25.830
and check on the balances again.

01:07:25.830 --> 01:07:29.700
And now I'll see the right
balances all together.

01:07:29.700 --> 01:07:33.380
But there is a way to avoid
me looking at this table

01:07:33.380 --> 01:07:38.640
and seeing an incorrect state, and the
way to do that is called a transaction.

01:07:38.640 --> 01:07:42.020
So let's go back here, and
I will reset my terminals.

01:07:42.020 --> 01:07:45.410
I'll now also give back Alice's money.

01:07:45.410 --> 01:07:47.183
Let me give back Bob the $10.

01:07:47.183 --> 01:07:48.100
I'm going to go back--

01:07:48.100 --> 01:07:51.860
[INAUDIBLE] say where
"id" equals 1 here.

01:07:51.860 --> 01:07:54.140
Now Alice will have 10 more dollars.

01:07:54.140 --> 01:07:57.020
And I'll go back and
I'll update accounts.

01:07:57.020 --> 01:08:02.810
I'll set the "balance"
equal to "balance" minus 10

01:08:02.810 --> 01:08:05.690
where the "id" equals 2.

01:08:05.690 --> 01:08:09.390
So I'm giving Bob back the $10
that they just gave to Alice.

01:08:09.390 --> 01:08:15.160
Now if I SELECT * from "accounts"
we're back to where we began.

01:08:15.160 --> 01:08:19.649
So ideally I should able to complete
this transaction between Alice and Bob

01:08:19.649 --> 01:08:23.310
without somebody seeing the
intermediate process of Bob

01:08:23.310 --> 01:08:25.470
having more money than Alice.

01:08:25.470 --> 01:08:26.850
So let's try this.

01:08:26.850 --> 01:08:31.649
I'll say BEGIN TRANSACTION; Enter.

01:08:31.649 --> 01:08:36.479
And now I can include what I want
to have happen in this transaction.

01:08:36.479 --> 01:08:38.609
First I want to update Bob's account.

01:08:38.609 --> 01:08:45.640
I'll say UPDATE "accounts" and SET
"balance" equal to "balance" + 10

01:08:45.640 --> 01:08:49.290
where the "id" in this case equals--

01:08:49.290 --> 01:08:53.170
let's see, 2, because
Bob's account ID is 2.

01:08:53.170 --> 01:08:57.000
Now I'll hit Enter, and now I'm
in the middle of this transaction.

01:08:57.000 --> 01:09:00.960
But if I go to my other terminal--
let's say I'm somebody new

01:09:00.960 --> 01:09:02.640
and I check on the balances of the bank.

01:09:02.640 --> 01:09:06.520
I say SELECT * FROM "accounts".

01:09:06.520 --> 01:09:07.960
What do I see?

01:09:07.960 --> 01:09:10.390
I don't see any change from before.

01:09:10.390 --> 01:09:14.920
Alice still has $10, Bob has
$20, and Charlie has $30.

01:09:14.920 --> 01:09:17.380
And even though in
this terminal I've been

01:09:17.380 --> 01:09:21.430
able to update the account balance,
I haven't committed those changes.

01:09:21.430 --> 01:09:24.380
I haven't saved them to
the database just yet.

01:09:24.380 --> 01:09:26.770
So let me try now removing
money from Alice's account.

01:09:26.770 --> 01:09:33.430
I'll say UPDATE "accounts" and set
"balance" equal to balance minus 10

01:09:33.430 --> 01:09:36.700
where the "id" equals 1.

01:09:36.700 --> 01:09:39.550
Now I'll remove $10
from Alice's account.

01:09:39.550 --> 01:09:43.060
Hit Enter, and still if I go
back to this other perspective

01:09:43.060 --> 01:09:46.870
here from somebody else checking
on the bank balances, I see--

01:09:46.870 --> 01:09:49.399
well everything is still the same.

01:09:49.399 --> 01:09:51.939
But the moment I then
go to this transaction

01:09:51.939 --> 01:09:57.370
and finish it by typing COMMIT, meaning
save these changes, I'll commit here.

01:09:57.370 --> 01:10:01.240
I can go back and I'll
see all in one moment

01:10:01.240 --> 01:10:06.070
that it appears Alice has paid Bob $10.

01:10:06.070 --> 01:10:10.060
So this is the atomicity
part of transactions.

01:10:10.060 --> 01:10:13.480
I don't see the intermediate
steps in the transaction.

01:10:13.480 --> 01:10:19.000
I only see that it happened
all at once or not at all.

01:10:19.000 --> 01:10:23.300
OK, so this is one example
here of a transaction.

01:10:23.300 --> 01:10:27.700
Let me ask what questions we have so
far on the transactions we've made

01:10:27.700 --> 01:10:31.060
and how they can be used.

01:10:31.060 --> 01:10:32.950
AUDIENCE: You are
searching based on ID--

01:10:32.950 --> 01:10:35.560
like an ID is equal to a question mark.

01:10:35.560 --> 01:10:39.550
But in a real-world scenario,
how they would Search maybe

01:10:39.550 --> 01:10:44.063
like using the Transaction
ID or basic account number?

01:10:44.063 --> 01:10:45.230
CARTER ZENKE: Good question.

01:10:45.230 --> 01:10:47.050
So here I'm searching by ID.

01:10:47.050 --> 01:10:51.820
I'm saying that I want to update the
balance of account with ID 1 or ID 2,

01:10:51.820 --> 01:10:52.660
for instance.

01:10:52.660 --> 01:10:55.510
And this is similar in spirit
to how your bank account likely

01:10:55.510 --> 01:10:58.850
has some unique number, an
account number so to speak.

01:10:58.850 --> 01:11:01.360
So here in this table
I'm updating balances

01:11:01.360 --> 01:11:03.700
by referring to the account number.

01:11:03.700 --> 01:11:06.670
You could imagine me
using names, but there

01:11:06.670 --> 01:11:10.120
might be in some world
multiple Alices, multiple Bobs.

01:11:10.120 --> 01:11:15.740
So I should uniquely identify each
account using its ID in this case.

01:11:15.740 --> 01:11:20.490
OK, so let's come back and think
through more kinds of transactions here.

01:11:20.490 --> 01:11:24.570
And let's think through a transaction
we actually don't want to complete.

01:11:24.570 --> 01:11:30.410
So let's go back to our balance sheet
where we have Alice at $0 and Bob

01:11:30.410 --> 01:11:35.660
at $30, and let's say Alice
wants to pay Bob 10 more dollars.

01:11:35.660 --> 01:11:39.830
Well if I go through this
transaction, what's going to happen?

01:11:39.830 --> 01:11:44.390
Alice will have -$10
and Bob will have $40,

01:11:44.390 --> 01:11:47.990
and that violates our check
constraint that we saw in our schema.

01:11:47.990 --> 01:11:53.550
The balance of Alice's account
must always be at least $0.

01:11:53.550 --> 01:11:56.360
So if I were to complete
this transaction,

01:11:56.360 --> 01:11:59.870
that would bring our database
to an inconsistent state.

01:11:59.870 --> 01:12:03.410
It violates some
constraint that we have.

01:12:03.410 --> 01:12:06.800
But what I can do is undo these changes.

01:12:06.800 --> 01:12:10.730
Revert them so that we go
back to the original state

01:12:10.730 --> 01:12:14.540
if we ever violate some
constraint like we just did.

01:12:14.540 --> 01:12:17.880
There's a technical name for
this, which is rolling back.

01:12:17.880 --> 01:12:21.980
So if I ever want to not save
the changes in this transaction,

01:12:21.980 --> 01:12:23.840
I can say ROLLBACK.

01:12:23.840 --> 01:12:27.620
Go back to the beginning, think about
it as if this transaction never even

01:12:27.620 --> 01:12:29.470
happened at all.

01:12:29.470 --> 01:12:30.380
So let's try it.

01:12:30.380 --> 01:12:33.870
So I'll come back to my
computer, and what I'll do

01:12:33.870 --> 01:12:38.040
is try to implement this idea
first without a transaction.

01:12:38.040 --> 01:12:41.400
I'll go back to, let's say
this first terminal here,

01:12:41.400 --> 01:12:45.400
and I'll SELECT * FROM
"accounts" like this.

01:12:45.400 --> 01:12:50.060
Notice how Alice has $0 and Bob has $30.

01:12:50.060 --> 01:12:53.010
Well I want to pay Bob 10
more dollars from Alice,

01:12:53.010 --> 01:12:54.810
so I'll update Bob's account.

01:12:54.810 --> 01:12:59.430
I'll find here my prior
statement, UPDATE "accounts"

01:12:59.430 --> 01:13:03.900
SET "balance" equal to "balance"
+ 10 where the "id" equals 2.

01:13:03.900 --> 01:13:07.260
And recall that Bob's account ID is 2

01:13:07.260 --> 01:13:10.085
I'll hit Enter, and now if
I SELECT * FROM "accounts"

01:13:10.085 --> 01:13:11.460
if we're looking somewhere else--

01:13:11.460 --> 01:13:17.850
SELECT * FROM "accounts," I'll
see Bob has 10 more dollars.

01:13:17.850 --> 01:13:21.300
But now I need to subtract
money from Alice's account.

01:13:21.300 --> 01:13:27.750
So I'll use the same thing but now I'll
say minus 10 where the "id" equals 1.

01:13:27.750 --> 01:13:30.120
Alice's account ID is 1.

01:13:30.120 --> 01:13:35.820
Now I'll try to subtract $10 from Alice,
but I get a constraint failure here.

01:13:35.820 --> 01:13:39.120
My check constraint failed
on the balance column.

01:13:39.120 --> 01:13:40.290
Why did it do that?

01:13:40.290 --> 01:13:47.460
Because I now have less than
$0 in that balance column.

01:13:47.460 --> 01:13:52.260
So I need to undo this but I
don't have a transaction here.

01:13:52.260 --> 01:13:57.550
What could I do instead to
undo what I've just done?

01:13:57.550 --> 01:13:59.025
AUDIENCE: So we can roll back.

01:13:59.025 --> 01:14:01.740
CARTER ZENKE: We could roll
back, but here's the thing.

01:14:01.740 --> 01:14:05.790
A rollback only works while
we're inside of a transaction.

01:14:05.790 --> 01:14:09.700
And a transaction begins when
we say begin transaction.

01:14:09.700 --> 01:14:12.930
So here I actually didn't
say begin transaction.

01:14:12.930 --> 01:14:15.960
What I have to do instead is
run another update statement.

01:14:15.960 --> 01:14:19.290
I have to update, let's
see, Bob's account

01:14:19.290 --> 01:14:22.950
to remove the $10 that Alice paid Bob.

01:14:22.950 --> 01:14:23.710
So let's try that.

01:14:23.710 --> 01:14:27.810
I'll come back to my computer
and I will then revert back.

01:14:27.810 --> 01:14:29.380
I'll say let's undo this.

01:14:29.380 --> 01:14:34.420
Let's subtract $10 now from
Bob's account where the ID is 2.

01:14:34.420 --> 01:14:36.510
Now we're back to where we began.

01:14:36.510 --> 01:14:38.970
If I check in from some
other [INAUDIBLE] here,

01:14:38.970 --> 01:14:44.310
I'll see we're back to Alice
having $0 and Bob having $30.

01:14:44.310 --> 01:14:49.470
But we ideally want to use rollback
to undo some changes that we

01:14:49.470 --> 01:14:51.400
have made in a transaction.

01:14:51.400 --> 01:14:54.000
So for that let me
now use a transaction.

01:14:54.000 --> 01:14:58.980
I'll say BEGIN TRANSACTION
like this, hit Enter.

01:14:58.980 --> 01:15:01.470
Now I want to update Bob's balance.

01:15:01.470 --> 01:15:05.070
I'll say UPDATE "accounts SET
"balance" equal to "balance" +

01:15:05.070 --> 01:15:09.360
10 where the "id" = 2.

01:15:09.360 --> 01:15:11.860
Bob's account ID is 2.

01:15:11.860 --> 01:15:14.760
Now let me subtract $10
from Alice like this.

01:15:14.760 --> 01:15:16.290
- 10.

01:15:16.290 --> 01:15:19.200
I'll go back here and say
where the "id" equals 1

01:15:19.200 --> 01:15:21.660
for Alice's account, hit Enter.

01:15:21.660 --> 01:15:24.360
But now I get that same
constraint failure.

01:15:24.360 --> 01:15:28.600
Alice has less than $0,
which can't be possible.

01:15:28.600 --> 01:15:29.880
So what do we do?

01:15:29.880 --> 01:15:33.000
Instead of having another
update, we could just roll back.

01:15:33.000 --> 01:15:36.030
We could undo all of these
previous statements here.

01:15:36.030 --> 01:15:40.500
ROLLBACK like this, hit Enter, and now
I should see from another perspective

01:15:40.500 --> 01:15:46.380
here my table exactly as it was before
with no intermediate step of Bob

01:15:46.380 --> 01:15:50.140
having money but Alice
having negative $10.

01:15:50.140 --> 01:15:54.240
So let me ask m we've
seen COMMIT and ROLLBACK.

01:15:54.240 --> 01:15:57.440
What questions do we
have on transactions now?

01:16:03.530 --> 01:16:04.030
OK.

01:16:04.030 --> 01:16:07.390
Seeing none so far, so let's keep going.

01:16:07.390 --> 01:16:11.290
Transactions are helpful for
ensuring that our data is consistent

01:16:11.290 --> 01:16:13.570
like we just saw, and
that they're also--

01:16:13.570 --> 01:16:15.130
we have atomic transactions.

01:16:15.130 --> 01:16:18.620
We can't break things
down into smaller pieces.

01:16:18.620 --> 01:16:21.490
But there is a case that
transactions can help guard

01:16:21.490 --> 01:16:24.880
against some adversarial kind of attack.

01:16:24.880 --> 01:16:27.460
We're into what we call race condition.

01:16:27.460 --> 01:16:32.140
A race condition is when we have
multiple people, multiple processes

01:16:32.140 --> 01:16:37.900
trying to access some value and also
making a decision based on that value.

01:16:37.900 --> 01:16:43.300
If unaddressed it can lead to a scenario
that's inconsistent in our database.

01:16:43.300 --> 01:16:46.090
And actually hackers can
exploit race conditions

01:16:46.090 --> 01:16:51.020
to bring our database to an inconsistent
state in a way that benefits them.

01:16:51.020 --> 01:16:53.800
So let's say our friends
Charlie and Alice here, they

01:16:53.800 --> 01:16:55.510
decide to rob our bank.

01:16:55.510 --> 01:16:59.510
They decide to take money from
it by exploiting race conditions.

01:16:59.510 --> 01:17:04.000
And here are the logs of what
Charlie and Alice actually did.

01:17:04.000 --> 01:17:10.090
We see here that Charlie and Alice
executed three concurrent transactions

01:17:10.090 --> 01:17:11.260
with our bank.

01:17:11.260 --> 01:17:13.960
On the one hand Charlie
opened up his bank account

01:17:13.960 --> 01:17:16.000
on two different computers.

01:17:16.000 --> 01:17:20.470
He logged in and he said let
me transfer $30 to Alice.

01:17:20.470 --> 01:17:25.240
He at the same time clicked transfer
on both computers knowing just well

01:17:25.240 --> 01:17:30.820
he had only $30 to give to Alice, but he
was going to send her $60 in this case.

01:17:30.820 --> 01:17:34.120
He was hoping that we weren't
guarding against race conditions.

01:17:34.120 --> 01:17:38.660
That he could send $30 on
both accounts to Alice.

01:17:38.660 --> 01:17:41.260
Now at the same time
Alice is at the ATM,

01:17:41.260 --> 01:17:45.010
and they're hoping that at the
same time Charlie presses transfer

01:17:45.010 --> 01:17:50.890
Alice hits confirm on their transaction
to withdraw $60 from the ATM.

01:17:50.890 --> 01:17:53.410
And when all of these
happen at once, we might

01:17:53.410 --> 01:17:56.810
run into some inconsistent
states in our database.

01:17:56.810 --> 01:17:59.710
Let's say first we run
this transaction here.

01:17:59.710 --> 01:18:00.895
Let's focus on this one.

01:18:00.895 --> 01:18:02.770
The first step is to
check Charlie's balance.

01:18:02.770 --> 01:18:05.500
Does Charlie have $30?

01:18:05.500 --> 01:18:06.970
It seems like Charlie does.

01:18:06.970 --> 01:18:11.410
We continue we might say, OK, let me
add $30 to Alice's account like this.

01:18:11.410 --> 01:18:15.280
Then let me subtract $30 from
Charlie's account down below here.

01:18:15.280 --> 01:18:18.130
And finally, that's the
end of our transaction.

01:18:18.130 --> 01:18:21.610
We checked Charlie's balance,
we added $30 to Alice,

01:18:21.610 --> 01:18:24.790
and subtracted $30 from Charlie.

01:18:24.790 --> 01:18:31.180
But now at the same time, we're also
processing the other transfer of $30

01:18:31.180 --> 01:18:32.050
to Alice.

01:18:32.050 --> 01:18:35.590
So here we check in terms of
our time, top to bottom here--

01:18:35.590 --> 01:18:37.810
we check Charlie's
balance at this point.

01:18:37.810 --> 01:18:39.820
It's still $30.

01:18:39.820 --> 01:18:43.060
Knowing this, this process
goes ahead and says

01:18:43.060 --> 01:18:45.730
let's add $30 to Alice's account.

01:18:45.730 --> 01:18:49.930
And now only later let's
subtract $30 like this.

01:18:49.930 --> 01:18:54.740
Now Charlie well appears to
have no money in his account.

01:18:54.740 --> 01:19:01.330
But at this time above we've already
sent $30 to Alice's account two times.

01:19:01.330 --> 01:19:03.610
And if Alice is waiting
at the ATM over here

01:19:03.610 --> 01:19:06.400
and they press Confirm on that
withdrawal, what might happen

01:19:06.400 --> 01:19:07.360
is something like this.

01:19:07.360 --> 01:19:10.840
At that moment that Alice
has $60 in their account--

01:19:10.840 --> 01:19:14.440
they check and they see it-- they
could then withdraw that $60.

01:19:14.440 --> 01:19:17.390
And now that money is
gone from our bank,

01:19:17.390 --> 01:19:19.840
even if we realize Charlie
shouldn't have been

01:19:19.840 --> 01:19:22.980
able to make that transaction happen.

01:19:22.980 --> 01:19:26.510
So what logically is a solution here?

01:19:26.510 --> 01:19:31.170
It seems these transactions
happened all at the same time.

01:19:31.170 --> 01:19:35.060
But what would you do to reorder
these statements to make it clear

01:19:35.060 --> 01:19:39.690
that we can only do one at a time?

01:19:39.690 --> 01:19:45.260
What would you do in this case to
update this table to fix this problem?

01:19:45.260 --> 01:19:52.710
AUDIENCE: I would do the withdrawal
and the debits as one transaction.

01:19:52.710 --> 01:19:55.580
So the subtraction is
done and the addition

01:19:55.580 --> 01:20:01.658
is done as one unit of work in order
for the double transfer not to happen.

01:20:01.658 --> 01:20:02.450
CARTER ZENKE: Yeah.

01:20:02.450 --> 01:20:06.020
So ideally we should treat each of
these as a separate transaction.

01:20:06.020 --> 01:20:10.280
Literally a transaction in the database
sense where if I go back to this table

01:20:10.280 --> 01:20:15.000
if Charlie transfers $30 to
Alice at roughly the same time,

01:20:15.000 --> 01:20:17.210
I should handle these sequentially.

01:20:17.210 --> 01:20:20.990
I should first do this one,
then I should do this one.

01:20:20.990 --> 01:20:25.760
And only after that should Alice be
able to withdraw their $60, or at least

01:20:25.760 --> 01:20:27.450
try to in this case.

01:20:27.450 --> 01:20:30.680
So we need to make these transactions
sequential, and the word for this

01:20:30.680 --> 01:20:32.540
is isolating our transactions.

01:20:32.540 --> 01:20:34.580
Here they are not isolated.

01:20:34.580 --> 01:20:36.390
They're working at the same time.

01:20:36.390 --> 01:20:38.630
But in reality I want
them to be isolated--

01:20:38.630 --> 01:20:41.390
happen one after another after another.

01:20:41.390 --> 01:20:44.870
So ideally our table
looks more like this.

01:20:44.870 --> 01:20:48.890
Charlie hits transfer at the
exact same time on two laptops,

01:20:48.890 --> 01:20:54.170
but because we're working in
transactions we only process one first.

01:20:54.170 --> 01:20:55.610
We start with this one here.

01:20:55.610 --> 01:21:00.860
We check the balance, we say let's add
$30 to Alice's account and subtract $30

01:21:00.860 --> 01:21:01.940
from Charlie's.

01:21:01.940 --> 01:21:05.930
Once we're sure that is OK,
we'll commit the transaction

01:21:05.930 --> 01:21:08.120
and that is a done deal.

01:21:08.120 --> 01:21:09.770
Now though, we'll go down below.

01:21:09.770 --> 01:21:11.630
We'll say begin transaction.

01:21:11.630 --> 01:21:15.140
We'll then say Charlie's balance
is $0 because we're doing this

01:21:15.140 --> 01:21:16.940
after we did our first transaction.

01:21:16.940 --> 01:21:19.370
Charlie has no more money to give Alice.

01:21:19.370 --> 01:21:23.480
At that point what we'll do
is try to add $30 to Alice

01:21:23.480 --> 01:21:27.240
but subtract $30 from Charlie, and
we'll realize we made a mistake.

01:21:27.240 --> 01:21:31.050
We should roll back and
undo what we just did here.

01:21:31.050 --> 01:21:35.060
So these now are separate
transactions happening sequentially,

01:21:35.060 --> 01:21:38.720
and we avoid our error
of giving Alice $60 when

01:21:38.720 --> 01:21:41.180
she shouldn't have that money at all.

01:21:41.180 --> 01:21:46.460
And then only after this is done do
we let Alice try to withdraw $60.

01:21:46.460 --> 01:21:47.780
We'll go over here and see.

01:21:47.780 --> 01:21:51.900
Alice tries to have a balance of
$60, withdraw $60, and commit it.

01:21:51.900 --> 01:21:56.570
But at this point she wouldn't
actually have that money there.

01:21:56.570 --> 01:22:01.810
So let's see now what questions
do we have on the way transactions

01:22:01.810 --> 01:22:04.860
are sequential in this case?

01:22:04.860 --> 01:22:08.410
AUDIENCE: In particularly I want
to ask that there are millions

01:22:08.410 --> 01:22:10.480
of transactions that are going on.

01:22:10.480 --> 01:22:14.320
Particularly, also, is it like there
that sequentially these transactions

01:22:14.320 --> 01:22:16.750
are being handled for the
financial transaction over here

01:22:16.750 --> 01:22:20.650
that because there are a number
of banks in a single country

01:22:20.650 --> 01:22:22.150
or international transactions.

01:22:22.150 --> 01:22:25.210
Are they handle sequentially or there
is some parallel earning mechanism

01:22:25.210 --> 01:22:27.583
also exist if you may please clarify.

01:22:27.583 --> 01:22:29.000
CARTER ZENKE: Yeah, good question.

01:22:29.000 --> 01:22:31.630
So here visually we see these
transactions are actually

01:22:31.630 --> 01:22:34.000
sequential in the operation here.

01:22:34.000 --> 01:22:37.240
And then your question is, are
they literally sequential, like one

01:22:37.240 --> 01:22:38.470
after another after another?

01:22:38.470 --> 01:22:41.855
In reality they are very much
just sequential like this.

01:22:41.855 --> 01:22:43.730
A database can get
creative with use of locks

01:22:43.730 --> 01:22:46.063
as we'll see you in just a
minute, but by and large they

01:22:46.063 --> 01:22:48.130
are sequential one after the other.

01:22:48.130 --> 01:22:51.370
And it's the job of a
database module to ensure

01:22:51.370 --> 01:22:56.300
that these transactions are executed
sequentially in an isolated way.

01:22:56.300 --> 01:22:58.360
Good question.

01:22:58.360 --> 01:22:58.900
OK.

01:22:58.900 --> 01:23:04.340
So let's see now how, if a database
is able to make these sequential,

01:23:04.340 --> 01:23:06.610
it must be using something
underneath the hood

01:23:06.610 --> 01:23:10.180
to ensure that these
transactions don't interfere.

01:23:10.180 --> 01:23:15.610
And in SQLite and other DBMSs we use
this idea of a lock on the database

01:23:15.610 --> 01:23:20.680
to ensure people can't access data
when they really shouldn't be able to.

01:23:20.680 --> 01:23:23.200
So here let's look at
an idea of what SQLite

01:23:23.200 --> 01:23:26.920
has at different kinds of
locks on this database.

01:23:26.920 --> 01:23:28.490
We have a few different kinds--

01:23:28.490 --> 01:23:32.290
unlocked, shared, and exclusive,
among some others here.

01:23:32.290 --> 01:23:35.913
Each one has a different
meaning for different processes.

01:23:35.913 --> 01:23:38.080
So let's take a look here
at our bank account table.

01:23:38.080 --> 01:23:42.280
Let's say two computers want
to access this account's table.

01:23:42.280 --> 01:23:47.560
Well at this time because nobody is
accessing our database, it is unlocked.

01:23:47.560 --> 01:23:50.650
Anyone could read from the
database, sort of see what's inside,

01:23:50.650 --> 01:23:54.490
or write to it-- add some
data or update some data.

01:23:54.490 --> 01:23:59.650
But let's say we have one computer here
who wants to read from this database.

01:23:59.650 --> 01:24:02.560
They want to see what data is inside.

01:24:02.560 --> 01:24:05.650
Well that database in
SQLite would acquire

01:24:05.650 --> 01:24:08.650
what's called a shared lock like this.

01:24:08.650 --> 01:24:12.130
They're able to read
the database but also

01:24:12.130 --> 01:24:14.500
it's to allow others to read as well.

01:24:14.500 --> 01:24:17.260
A shared lock means I'm only reading.

01:24:17.260 --> 01:24:20.180
You too could read as well.

01:24:20.180 --> 01:24:22.180
So let's say this computer comes in.

01:24:22.180 --> 01:24:24.760
They decide to read from this table.

01:24:24.760 --> 01:24:28.090
They could acquire the same shared lock.

01:24:28.090 --> 01:24:31.240
And because we're only
reading, not updating,

01:24:31.240 --> 01:24:35.120
multiple transactions
can read all at once.

01:24:35.120 --> 01:24:37.820
We don't have to make these sequential.

01:24:37.820 --> 01:24:42.170
But if we have a transaction that's
hoping to write to the database,

01:24:42.170 --> 01:24:43.730
to update some value--

01:24:43.730 --> 01:24:46.370
well that's when things
get a little weird.

01:24:46.370 --> 01:24:50.760
I can't let somebody else read
my data while I'm updating it.

01:24:50.760 --> 01:24:53.690
So what I need is for
this process here, if it's

01:24:53.690 --> 01:24:57.290
writing, to acquire an exclusive lock.

01:24:57.290 --> 01:25:01.100
Meaning this is the only
computer, only transaction that

01:25:01.100 --> 01:25:04.310
can read and write to this database.

01:25:04.310 --> 01:25:09.540
All others must hold off
in order to be sequential.

01:25:09.540 --> 01:25:13.320
So we've seen transactions
then are sequential

01:25:13.320 --> 01:25:17.070
and they use this idea of a
lock to ensure other folks

01:25:17.070 --> 01:25:22.210
can't read when they're doing something
sensitive like writing to the database.

01:25:22.210 --> 01:25:28.210
Let me ask now what questions we have
on locks and how they work for us here?

01:25:28.210 --> 01:25:30.680
AUDIENCE: OK, my
question is that suppose

01:25:30.680 --> 01:25:33.080
you're one computer [INAUDIBLE].

01:25:33.080 --> 01:25:38.030
For what amount this would
allow to do the exclusive lock?

01:25:38.030 --> 01:25:43.610
Maybe if the timestamp is
larger, then some other computers

01:25:43.610 --> 01:25:46.740
will get timeout exception
and other things, right?

01:25:46.740 --> 01:25:48.800
Because it's not allowed to do--

01:25:48.800 --> 01:25:50.330
exclusively it's a lock.

01:25:50.330 --> 01:25:52.370
How are we going to decide those things?

01:25:52.370 --> 01:25:55.010
At what time one
instance is going to be--

01:25:55.010 --> 01:25:57.440
put the exclusive lock on the resource?

01:25:57.440 --> 01:25:59.600
CARTER ZENKE: So the
question is, at what point

01:25:59.600 --> 01:26:02.690
does the transaction actually
get an exclusive lock,

01:26:02.690 --> 01:26:07.050
and how do we decide how to prioritize
different transactions here?

01:26:07.050 --> 01:26:10.310
Well there are a few algorithms
you could use, one of which

01:26:10.310 --> 01:26:14.090
is simply taking the transaction
that just came first.

01:26:14.090 --> 01:26:17.630
If at this point I receive a
request for an exclusive lock,

01:26:17.630 --> 01:26:22.040
I might just wait for others to finish
reading and allow that one to proceed.

01:26:22.040 --> 01:26:26.490
I could also use a process called
time-stamping, which is similar to this

01:26:26.490 --> 01:26:30.230
where I look at the timestamp, the time
my database tried to write to the log

01:26:30.230 --> 01:26:33.560
and use the one that is
earliest in that case.

01:26:33.560 --> 01:26:38.210
Now it's important to note that an
exclusive lock does block others

01:26:38.210 --> 01:26:40.380
from running their own transactions.

01:26:40.380 --> 01:26:44.480
If I have an exclusive lock, no
one else can run their transaction.

01:26:44.480 --> 01:26:48.290
That is kind of a necessary
downside in this case

01:26:48.290 --> 01:26:53.920
to ensure that I don't make
mistakes in updating my table.

01:26:53.920 --> 01:26:55.480
Let's take one more here.

01:26:55.480 --> 01:26:57.850
AUDIENCE: What's the
granularity on locking?

01:26:57.850 --> 01:27:02.050
I mean, you seem to say you locked
the database, but that's craziness.

01:27:02.050 --> 01:27:05.080
Do you lock a table or do
you lock a record on a table?

01:27:05.080 --> 01:27:07.607
CARTER ZENKE: Yeah, it might
depend on the DBMS itself.

01:27:07.607 --> 01:27:13.540
So at the very coarsest, a lock
might just lock the entire database.

01:27:13.540 --> 01:27:15.517
I can't actually see
anything inside of it.

01:27:15.517 --> 01:27:18.100
And I'll be able to demonstrate
that for you in a minute here.

01:27:18.100 --> 01:27:19.420
Let me go back to my computer.

01:27:19.420 --> 01:27:22.900
Let me try to deliberately
lock this database.

01:27:22.900 --> 01:27:26.260
If I go to my terminal, I
can begin a transaction that

01:27:26.260 --> 01:27:28.870
automatically has an exclusive lock.

01:27:28.870 --> 01:27:32.980
In SQLite the statement
is BEGIN EXCLUSIVE.

01:27:32.980 --> 01:27:34.930
BEGIN EXCLUSIVE TRANSACTION.

01:27:34.930 --> 01:27:39.280
I'll hit Enter like this, and now I'm
in the middle of this transaction.

01:27:39.280 --> 01:27:39.910
It's exclusive.

01:27:39.910 --> 01:27:43.060
Has an exclusive lock
nobody else can read.

01:27:43.060 --> 01:27:47.230
If I go back to my other terminal
here, connect to my database,

01:27:47.230 --> 01:27:51.880
I could try SELECT * FROM "accounts"
just to read from this table.

01:27:51.880 --> 01:27:54.280
And now I see Runtime error--

01:27:54.280 --> 01:27:55.900
database is locked.

01:27:55.900 --> 01:28:00.280
The entire database cannot be read
from at the very coarsest level

01:28:00.280 --> 01:28:02.170
of an exclusive lock.

01:28:02.170 --> 01:28:04.600
Now because these locks are so coarse--

01:28:04.600 --> 01:28:09.310
I have to lock the entire database,
SQLite has its own module,

01:28:09.310 --> 01:28:14.050
a way of prioritizing transactions
and ensuring that a transaction only

01:28:14.050 --> 01:28:17.590
has an exclusive lock for the
duration it absolutely needs to.

01:28:17.590 --> 01:28:19.802
Because otherwise you would
slow down substantially

01:28:19.802 --> 01:28:21.760
if I had to wait and wait
and wait for somebody

01:28:21.760 --> 01:28:25.400
to finish with their transaction.

01:28:25.400 --> 01:28:26.770
OK.

01:28:26.770 --> 01:28:31.900
So this then is how we can handle not
just one query at a time but multiple.

01:28:31.900 --> 01:28:35.860
And today we've seen how to optimize
our databases from making sure queries

01:28:35.860 --> 01:28:39.670
are faster to also making sure our
databases use as little space as

01:28:39.670 --> 01:28:40.420
possible.

01:28:40.420 --> 01:28:43.270
We'll see next time how
to actually scale up

01:28:43.270 --> 01:28:45.460
our databases using other DBMSs--

01:28:45.460 --> 01:28:48.280
like MySQL-- besides SQLite.

01:28:48.280 --> 01:28:51.120
We'll see you next time.