WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000

00:00:00.000 --> 00:00:02.988
[MUSIC PLAYING]

00:01:01.790 --> 00:01:03.500
SPEAKER 1: All right.

00:01:03.500 --> 00:01:09.590
This is CS50, and this is week 7
on the day before All Hallows' Eve.

00:01:09.590 --> 00:01:13.760
Today, we will introduce yet
another language, the goal of which

00:01:13.760 --> 00:01:16.430
is not to introduce another
language for language's sake,

00:01:16.430 --> 00:01:20.510
but to really begin to emphasize that
when it comes to the world of software

00:01:20.510 --> 00:01:23.390
programming, engineering
more generally, there's often

00:01:23.390 --> 00:01:25.170
different tools for different jobs.

00:01:25.170 --> 00:01:28.632
And if you were to try to use C to
solve all of our future problems,

00:01:28.632 --> 00:01:30.590
it would actually be
painful, as you've already

00:01:30.590 --> 00:01:35.360
seen how much more pleasant it is to
solve certain problems with Python.

00:01:35.360 --> 00:01:37.460
But today, quite quickly,
we'll realize that even

00:01:37.460 --> 00:01:40.440
Python's actually not the most pleasant
way to solve a lot of problems,

00:01:40.440 --> 00:01:43.100
especially when it comes
to data, data at scale.

00:01:43.100 --> 00:01:45.830
And, in fact, as we transition
in the final weeks of CS50

00:01:45.830 --> 00:01:49.130
to the world of web programming, and if
you so choose with your final project

00:01:49.130 --> 00:01:51.500
mobile programming, you'll
actually need more tools

00:01:51.500 --> 00:01:54.150
in your tool kit than
C and Python alone.

00:01:54.150 --> 00:01:57.620
In fact, today we'll introduce a
database-centric language called S-Q-L,

00:01:57.620 --> 00:01:58.590
or SQL.

00:01:58.590 --> 00:02:02.400
Next week, we'll explore markup
languages, like HTML and CSS.

00:02:02.400 --> 00:02:03.970
A bit of JavaScript, as well.

00:02:03.970 --> 00:02:07.290
And then we'll synthesize all of this
together at the end of the class,

00:02:07.290 --> 00:02:10.360
as some of you might, for
your final projects, as well.

00:02:10.360 --> 00:02:13.120
But, before we do that,
and talk about data,

00:02:13.120 --> 00:02:15.310
let's actually start
to gather some data.

00:02:15.310 --> 00:02:19.770
So if you could visit this URL
here on your phone, or a laptop.

00:02:19.770 --> 00:02:23.460
Or, if easier, here's a
bar code version thereof.

00:02:23.460 --> 00:02:25.980
You can point your camera
at this bar code here,

00:02:25.980 --> 00:02:29.220
and it's going to pull up a
relatively short Google form that's

00:02:29.220 --> 00:02:32.040
just going to ask you
a couple of questions,

00:02:32.040 --> 00:02:35.070
so that we can collect
some actual live data

00:02:35.070 --> 00:02:37.800
and actually play around with
it, initially, in Python.

00:02:37.800 --> 00:02:41.020
So if you go to that URL there,
you'll see a Google form.

00:02:41.020 --> 00:02:44.577
And if you haven't been able to
scan it quite yet, that's fine.

00:02:44.577 --> 00:02:47.160
Just kind of look over the
shoulder of the person next to you.

00:02:47.160 --> 00:02:49.960
But you, or they, will see a
little something like this.

00:02:49.960 --> 00:02:53.130
So among the questions will be,
which is your favorite language?

00:02:53.130 --> 00:02:56.160
On the list, only thus far,
is Scratch, C, and Python,

00:02:56.160 --> 00:02:58.350
and below that you'll see
another question asking

00:02:58.350 --> 00:03:02.960
about your favorite problem
across the problem sets thus far.

00:03:02.960 --> 00:03:04.710
Each of them is radio
buttons, which means

00:03:04.710 --> 00:03:08.800
you'll be able to select one
button for each of those questions.

00:03:08.800 --> 00:03:11.310
And, ultimately, what's
going to be nice is

00:03:11.310 --> 00:03:14.132
that if you've never used Google
Forms before as an administrator,

00:03:14.132 --> 00:03:17.340
all of that data is actually going to
end up being, not only in Google Forms,

00:03:17.340 --> 00:03:19.525
but also, if you so choose,
in Google Spreadsheets.

00:03:19.525 --> 00:03:22.650
Which is an example, of course, of a
spreadsheet software similar in spirit

00:03:22.650 --> 00:03:26.940
to Apple Numbers on Macs or Microsoft
Excel on different platforms.

00:03:26.940 --> 00:03:29.220
And Google Sheets is
going to allow us to store

00:03:29.220 --> 00:03:30.960
all of that data in rows and columns.

00:03:30.960 --> 00:03:33.090
And so, since Google made
both of these products,

00:03:33.090 --> 00:03:34.690
they integrated one with the other.

00:03:34.690 --> 00:03:39.930
So, in fact, if I, on my laptop
here in another window, open

00:03:39.930 --> 00:03:41.280
this up-- let me flip over.

00:03:41.280 --> 00:03:42.660
Here's the live spreadsheet.

00:03:42.660 --> 00:03:45.510
And we'll see that the very
first person who buzzed in really

00:03:45.510 --> 00:03:48.090
liked Python, as did a lot
of other people thereafter.

00:03:48.090 --> 00:03:52.390
But, Hello, World was your
favorite in Python, which is great.

00:03:52.390 --> 00:03:54.330
There's a couple of
votes for Scratch here.

00:03:54.330 --> 00:03:58.110
If we scroll down, there's one hold
out for C, who really liked Credit,

00:03:58.110 --> 00:03:59.320
in this case here.

00:03:59.320 --> 00:04:03.250
And if we scroll down further, it looks
like Python and Scratch are in there.

00:04:03.250 --> 00:04:05.727
A few more C's, and so on and so forth.

00:04:05.727 --> 00:04:08.310
So suppose that we wanted to,
actually, now analyze this data.

00:04:08.310 --> 00:04:11.587
Now, any of you who have used Excel,
or Numbers, or Google Spreadsheets

00:04:11.587 --> 00:04:13.920
know that it comes with built
in functions and formulas,

00:04:13.920 --> 00:04:17.040
and we can do all of that, but suppose
there was a huge amount of data.

00:04:17.040 --> 00:04:19.920
Or, suppose that this data was
not coming in via Google Sheets,

00:04:19.920 --> 00:04:23.130
but via your own web application,
or your own mobile application,

00:04:23.130 --> 00:04:25.530
and it's just ending up in
some kind of spreadsheet.

00:04:25.530 --> 00:04:28.530
Well, wouldn't it be nice if we could
actually analyze that kind of data

00:04:28.530 --> 00:04:29.250
with code.

00:04:29.250 --> 00:04:30.570
And, in fact, you can.

00:04:30.570 --> 00:04:35.220
And the simplest way to store a bunch
of data isn't with anything fancy,

00:04:35.220 --> 00:04:39.450
in fact, but just a literal text
file, something ending in .txt,

00:04:39.450 --> 00:04:41.520
or, maybe more commonly, .csv.

00:04:41.520 --> 00:04:44.400
In fact, what we'd call
a flat file database

00:04:44.400 --> 00:04:48.820
is literally just code for a text
file containing all of your data.

00:04:48.820 --> 00:04:53.040
But your data typically has delimiters
that separate some values from others.

00:04:53.040 --> 00:04:55.050
And, in fact, the most
common approach, daresay,

00:04:55.050 --> 00:04:59.430
is to use, indeed, comma-separated
values, or, CSV files.

00:04:59.430 --> 00:05:04.080
And what that means is that in simple
text alone, just asking, maybe Unicode,

00:05:04.080 --> 00:05:06.960
you sort of mimic the
idea of rows and columns

00:05:06.960 --> 00:05:09.570
by using newline characters,
like backslash n,

00:05:09.570 --> 00:05:11.260
to represent row after row after row.

00:05:11.260 --> 00:05:12.510
That's pretty straightforward.

00:05:12.510 --> 00:05:14.280
To just move the cursor
to the next line.

00:05:14.280 --> 00:05:18.930
And because text files aren't graphical,
so there's no notion of vertical bars

00:05:18.930 --> 00:05:21.510
that you can put between what
you and I think of as columns,

00:05:21.510 --> 00:05:24.630
you literally just use a comma,
or some other such symbol

00:05:24.630 --> 00:05:27.212
to separate one value from another.

00:05:27.212 --> 00:05:28.920
So, in fact, let me
go ahead and do this.

00:05:28.920 --> 00:05:31.837
Even if you've never done this before,
it turns out, in Google Sheets,

00:05:31.837 --> 00:05:35.130
and also Excel and Numbers,
you can export your data,

00:05:35.130 --> 00:05:38.880
not in some proprietary Apple
or Microsoft or Google format,

00:05:38.880 --> 00:05:43.380
but in a globally portable
format known as .csv.

00:05:43.380 --> 00:05:44.970
So let me go up to file.

00:05:44.970 --> 00:05:47.037
I will go to download.

00:05:47.037 --> 00:05:49.620
And notice, I can export this
in a bunch of different formats,

00:05:49.620 --> 00:05:52.722
but the one I care about for
today is going to be .csv.

00:05:52.722 --> 00:05:55.180
On my Mac, that's going to put
it into my downloads folder.

00:05:55.180 --> 00:05:57.722
And what I'm going to go ahead
here and do, in just a moment,

00:05:57.722 --> 00:06:00.690
is let me go ahead it
and open up VS Code.

00:06:00.690 --> 00:06:03.300
So, in short, I've downloaded this file.

00:06:03.300 --> 00:06:04.980
Google gave it a long default file name.

00:06:04.980 --> 00:06:06.900
I've renamed it to favorites.csv.

00:06:06.900 --> 00:06:10.200
Let me go ahead and actually
open it now in VS Code here.

00:06:10.200 --> 00:06:12.480
And this now is the exact same data.

00:06:12.480 --> 00:06:14.800
If you're still submitting the form,
I'm afraid you didn't make the cut off.

00:06:14.800 --> 00:06:17.220
So we're not going to see more
data ending up in this file

00:06:17.220 --> 00:06:19.260
because I've literally
downloaded a copy of it.

00:06:19.260 --> 00:06:21.240
But, indeed, if I scroll
through this, we'll

00:06:21.240 --> 00:06:25.050
see that it's got some
399 lines of data.

00:06:25.050 --> 00:06:28.080
Or, technically, 398
because the very first one

00:06:28.080 --> 00:06:30.990
is what we'd call a header
row, which just describes

00:06:30.990 --> 00:06:32.820
what each of these columns means.

00:06:32.820 --> 00:06:35.670
Now notice that even though the
Google Sheets interface actually

00:06:35.670 --> 00:06:39.780
put all of this data in proper
graphical rows and columns,

00:06:39.780 --> 00:06:42.510
you can still see the rows because
they're just separate lines,

00:06:42.510 --> 00:06:45.900
and you can kind of see the
columns by focusing on the comma

00:06:45.900 --> 00:06:47.970
here, the comma here, the comma here.

00:06:47.970 --> 00:06:50.820
And then, also, the comma
here, the comma here,

00:06:50.820 --> 00:06:52.540
the comma here, and so forth.

00:06:52.540 --> 00:06:53.910
Now, a quick subtlety.

00:06:53.910 --> 00:06:59.490
Hello, World is a two-word problem
name, and it itself has a comma in it.

00:06:59.490 --> 00:07:06.090
What's to stop me from confusing the
comma in Hello, World from the commas

00:07:06.090 --> 00:07:09.410
that Google, apparently,
inserted into this file?

00:07:09.410 --> 00:07:09.910
Yeah.

00:07:09.910 --> 00:07:11.925
So, automatically, you
all did not type this.

00:07:11.925 --> 00:07:13.300
We did not type it into the form.

00:07:13.300 --> 00:07:16.750
But Google is smart enough, as
is Apple and Microsoft, when

00:07:16.750 --> 00:07:20.958
they export CSVs to somehow escape
seemingly dangerous characters,

00:07:20.958 --> 00:07:22.750
or characters that
could just break things.

00:07:22.750 --> 00:07:25.180
And, in this case, a
convention in the CSV world

00:07:25.180 --> 00:07:28.340
is just to quote any strings
that themselves have commas

00:07:28.340 --> 00:07:32.150
so that any program you're writing that
reads this file doesn't get confused.

00:07:32.150 --> 00:07:35.260
So the only thing that came
in automatically from Google

00:07:35.260 --> 00:07:37.690
is just this timestamp
here, based on the time

00:07:37.690 --> 00:07:39.370
of day in our own local timezone.

00:07:39.370 --> 00:07:40.970
That was added automatically.

00:07:40.970 --> 00:07:43.100
So we have three, and
not just two columns.

00:07:43.100 --> 00:07:45.460
So with that said, we
have three columns:

00:07:45.460 --> 00:07:47.980
timestamp, language, and problem.

00:07:47.980 --> 00:07:50.200
The latter two of which
came from all of you.

00:07:50.200 --> 00:07:52.840
Let's actually write some
code that analyzes this data.

00:07:52.840 --> 00:07:55.720
Let's figure out what the most
popular something or other is here.

00:07:55.720 --> 00:07:57.970
So I'll close the CSV file.

00:07:57.970 --> 00:08:02.140
Let me go ahead and use what seems to
be folks favorite language, thus far,

00:08:02.140 --> 00:08:05.320
and write a file called
favorites.py, for instance.

00:08:05.320 --> 00:08:08.440
And I'm going to use Python
to open that CSV file,

00:08:08.440 --> 00:08:12.180
analyze it, crunch some numbers, and
output some statistics, if you will.

00:08:12.180 --> 00:08:15.920
So the easiest way to manipulate CSV
files, as you might have gleaned,

00:08:15.920 --> 00:08:19.940
is not to just open the file yourself,
look for commas, allocate memory,

00:08:19.940 --> 00:08:22.220
or anything like that
in C. In Python, you

00:08:22.220 --> 00:08:27.230
can literally just import CSV,
which is a Python module that

00:08:27.230 --> 00:08:29.570
gives you CSV related functionality.

00:08:29.570 --> 00:08:32.390
In Python, there's a bunch
of ways to open files.

00:08:32.390 --> 00:08:35.960
One way to open a file
is like this in Python.

00:08:35.960 --> 00:08:40.039
You can say file equals
open, similar to fopen in C.

00:08:40.039 --> 00:08:44.600
You can specify the name of the file
you want to open, like favorites.csv,

00:08:44.600 --> 00:08:48.330
and you can explicitly say you
want to open the file for reading,

00:08:48.330 --> 00:08:50.240
quote unquote, "r," just like fopen.

00:08:50.240 --> 00:08:53.280
Strictly speaking, in
Python read is implied.

00:08:53.280 --> 00:08:55.580
So if you omit the second
argument, it will still work.

00:08:55.580 --> 00:08:58.490
But for parity with fopen,
I'll do the same here.

00:08:58.490 --> 00:09:03.020
Then you can, maybe, do something
with file, and then, at the end,

00:09:03.020 --> 00:09:05.070
you can close the file like this.

00:09:05.070 --> 00:09:08.840
So in Python, there really is a
mapping between fopen and fclose.

00:09:08.840 --> 00:09:12.560
But in Python the functions are
called open and close for short.

00:09:12.560 --> 00:09:16.130
But a more common way, a more
Pythonic way, so to speak, i.e.

00:09:16.130 --> 00:09:19.040
the way people tend to do
it in Python, is actually

00:09:19.040 --> 00:09:22.610
to use a keyword that didn't exist
in C, where you instead say with.

00:09:22.610 --> 00:09:26.990
And you say, with open this file
name as a specific file name,

00:09:26.990 --> 00:09:28.970
and then indent it inside of that.

00:09:28.970 --> 00:09:30.950
Now you can do whatever
you want with the file.

00:09:30.950 --> 00:09:33.350
And the implication of using
with, even though it's not

00:09:33.350 --> 00:09:36.740
obvious from the keyword itself, is that
the file will be automatically closed

00:09:36.740 --> 00:09:37.770
for you later.

00:09:37.770 --> 00:09:39.650
So this is just a minor
Python convenience

00:09:39.650 --> 00:09:42.020
so that you don't have to
remember to close the file.

00:09:42.020 --> 00:09:44.030
Unless something goes
wrong, it will just

00:09:44.030 --> 00:09:47.690
close automatically as soon as
you're outside of this with block.

00:09:47.690 --> 00:09:49.460
So how do I go about reading a CSV?

00:09:49.460 --> 00:09:52.940
The simplest way is to actually
give yourself a variable called,

00:09:52.940 --> 00:09:53.900
maybe, reader.

00:09:53.900 --> 00:09:56.780
Like if I want to read the CSV,
I'll call my variable reader.

00:09:56.780 --> 00:10:02.270
Set that equal to the return value of
a function that comes with Python CSV

00:10:02.270 --> 00:10:07.080
module called reader, in lowercase,
and you just pass in the file name.

00:10:07.080 --> 00:10:09.560
So the first line there on
line three opens the file

00:10:09.560 --> 00:10:11.570
and gives me access
to the bytes there in.

00:10:11.570 --> 00:10:14.450
Line four now actually
uses this library that

00:10:14.450 --> 00:10:18.260
comes with Python to just go read it
for me, figure out where the commas are,

00:10:18.260 --> 00:10:21.690
so that it can hand me, line
after line, the data in the file.

00:10:21.690 --> 00:10:24.320
Now the first piece of data
in the file, though, was what?

00:10:24.320 --> 00:10:27.080
What's the very first row?

00:10:27.080 --> 00:10:30.692
So it was that header row with
timestamp, language, problem.

00:10:30.692 --> 00:10:32.900
I actually want to skip that
because that's not data.

00:10:32.900 --> 00:10:36.870
That's what we might call metadata
that's describing my actual data.

00:10:36.870 --> 00:10:39.670
So one way to deal with this, I
can literally just say next reader.

00:10:39.670 --> 00:10:42.170
And next, even though the
semantics here are a little weird,

00:10:42.170 --> 00:10:46.100
it just means, no, no, give me the
next line from that reader instead

00:10:46.100 --> 00:10:47.550
of the first by default.

00:10:47.550 --> 00:10:51.650
And now inside of this with block,
I can do something like this.

00:10:51.650 --> 00:10:56.810
For each row in that reader, let's
do something super simple initially.

00:10:56.810 --> 00:11:00.380
Let's just print out row bracket one.

00:11:00.380 --> 00:11:01.910
So row bracket one.

00:11:01.910 --> 00:11:03.330
So what's going on here?

00:11:03.330 --> 00:11:07.040
Well, the CSV reader in Python
is going to return to you, inside

00:11:07.040 --> 00:11:09.990
of this loop, one row after another.

00:11:09.990 --> 00:11:12.080
Each of those rows, though,
has how many columns?

00:11:12.080 --> 00:11:12.920
Three.

00:11:12.920 --> 00:11:15.470
The timestamp, the
language, and the problem.

00:11:15.470 --> 00:11:19.430
And just like in C, our lists
in Python are zero index.

00:11:19.430 --> 00:11:20.720
So zero, one, two.

00:11:20.720 --> 00:11:25.350
So if I'm printing out row bracket one,
that's the second, or middle, column.

00:11:25.350 --> 00:11:29.220
So what's this going
to print row after row?

00:11:29.220 --> 00:11:32.340
Each of the languages
that you all replied with.

00:11:32.340 --> 00:11:32.940
So that's all.

00:11:32.940 --> 00:11:34.440
This isn't doing any
kind of analytics yet.

00:11:34.440 --> 00:11:36.440
I'm just going through
the motions to, at least,

00:11:36.440 --> 00:11:37.900
print out some data of interest.

00:11:37.900 --> 00:11:38.692
So let me run this.

00:11:38.692 --> 00:11:41.550
Python of favorites.py, enter.

00:11:41.550 --> 00:11:43.270
And it happens super fast.

00:11:43.270 --> 00:11:47.170
But if I scroll back in my terminal
window, there is all of that raw data.

00:11:47.170 --> 00:11:49.472
So this is to say, once you
know the function names,

00:11:49.472 --> 00:11:51.180
once you know the
keywords, it's actually

00:11:51.180 --> 00:11:54.090
pretty simple in Python to just
get up and running with a file,

00:11:54.090 --> 00:11:56.070
and start looking at the data therein.

00:11:56.070 --> 00:12:02.910
But it turns out that it's a little
sub-optimal to use the reader alone.

00:12:02.910 --> 00:12:05.400
It turns out there's better
ways to do this, and let

00:12:05.400 --> 00:12:06.780
me make this clear as follows.

00:12:06.780 --> 00:12:09.660
I don't strictly need a variable,
but let me actually declare

00:12:09.660 --> 00:12:13.560
a variable called favorite, set
it equal to row bracket one,

00:12:13.560 --> 00:12:15.630
and then print out that favorite value.

00:12:15.630 --> 00:12:17.438
This is not doing anything new.

00:12:17.438 --> 00:12:19.230
It's just declaring an
additional variable,

00:12:19.230 --> 00:12:22.380
but I wanted to highlight the fact
that I'm just kind of trusting that row

00:12:22.380 --> 00:12:24.630
bracket one is the problem--

00:12:24.630 --> 00:12:26.430
or, is the language I care about.

00:12:26.430 --> 00:12:27.660
Is language, is language.

00:12:27.660 --> 00:12:30.790
But suppose one of you, or I,
go into the Google spreadsheet,

00:12:30.790 --> 00:12:32.668
and like anyone might
with a spreadsheet,

00:12:32.668 --> 00:12:34.210
you might start moving things around.

00:12:34.210 --> 00:12:36.947
And you might swap some of
the columns left and right.

00:12:36.947 --> 00:12:39.280
You might delete one of the
columns, add something else.

00:12:39.280 --> 00:12:41.830
In short, spreadsheets
are arguably fragile

00:12:41.830 --> 00:12:44.590
in that it's pretty easy in the
GUI, the graphical user interface,

00:12:44.590 --> 00:12:46.040
to change them around.

00:12:46.040 --> 00:12:50.080
And so my code in Python,
accordingly, is, arguably, fragile

00:12:50.080 --> 00:12:53.770
because I'm just hoping that row
bracket one is always the data

00:12:53.770 --> 00:12:54.890
that I care about.

00:12:54.890 --> 00:12:56.450
So what would be marginally better?

00:12:56.450 --> 00:12:59.110
Well, let's actually use
that header row instead.

00:12:59.110 --> 00:13:02.920
And more common, arguably, in Python,
is not to use a simple reader,

00:13:02.920 --> 00:13:05.650
but, instead, what we would
call a Dictionary Reader.

00:13:05.650 --> 00:13:09.190
I'm going to change this to
DictReader, capital D, capital R,

00:13:09.190 --> 00:13:12.010
and that's it for the
change to line four.

00:13:12.010 --> 00:13:14.197
On line five, I'm going to
get rid of the next line

00:13:14.197 --> 00:13:16.030
because I don't want
to skip the header now.

00:13:16.030 --> 00:13:19.210
What DictReader does,
which reader does not,

00:13:19.210 --> 00:13:22.690
is it automatically analyzes
that first line in the file,

00:13:22.690 --> 00:13:25.510
figures out what are all
of your columns called,

00:13:25.510 --> 00:13:30.220
and, thereafter, when you iterate over
this reader what each of your rows

00:13:30.220 --> 00:13:35.650
now is-- it's no longer a list of
size three, bracket zero, bracket

00:13:35.650 --> 00:13:36.490
one, bracket two.

00:13:36.490 --> 00:13:39.670
Each row that you get back
in this loop on line five

00:13:39.670 --> 00:13:44.890
is now, wonderfully, a dictionary
instead, the keys of which

00:13:44.890 --> 00:13:46.090
are from the header field.

00:13:46.090 --> 00:13:47.650
Timestamp, language, problem.

00:13:47.650 --> 00:13:52.460
The values of which are whatever
each of you typed in again and again.

00:13:52.460 --> 00:13:53.890
So how do I change the code?

00:13:53.890 --> 00:13:57.800
I no longer have these numeric indices
because row is no longer a list,

00:13:57.800 --> 00:13:58.850
it's a dictionary.

00:13:58.850 --> 00:14:02.350
So if I literally want the
language that you typed in,

00:14:02.350 --> 00:14:06.520
I can use row bracket, quote unquote,
"language," treating row as a Dict

00:14:06.520 --> 00:14:09.010
not as a list anymore.

00:14:09.010 --> 00:14:12.700
Now this is, again, more robust because
if you move the columns around on me,

00:14:12.700 --> 00:14:15.340
code is still going to work,
at least so long as you don't

00:14:15.340 --> 00:14:17.570
rename the columns at the very top.

00:14:17.570 --> 00:14:20.180
So that's still one
assumption I'm making.

00:14:20.180 --> 00:14:20.680
All right.

00:14:20.680 --> 00:14:23.770
Well, beyond that, what
could I actually do here?

00:14:23.770 --> 00:14:27.422
Well, just to be clear, I don't,
strictly speaking, need this variable.

00:14:27.422 --> 00:14:28.880
So no need to highlight this again.

00:14:28.880 --> 00:14:30.755
So let me just simplify
the code a little bit

00:14:30.755 --> 00:14:33.220
and get rid of this variable,
and instead just print out

00:14:33.220 --> 00:14:34.960
the language in that row.

00:14:34.960 --> 00:14:40.360
As a quick check, let me
rerun Python of favorites.py

00:14:40.360 --> 00:14:43.430
and it seems to still work, even
though there's a lot of output.

00:14:43.430 --> 00:14:45.610
We're not going to check
all 399 lines, but it

00:14:45.610 --> 00:14:48.160
looks like it printed out all
of those popular languages.

00:14:48.160 --> 00:14:50.120
Well, what more can we actually do?

00:14:50.120 --> 00:14:52.420
Well, let's actually now
start to crunch some numbers

00:14:52.420 --> 00:14:55.450
and figure out how many people
like Scratch, how many people like

00:14:55.450 --> 00:14:56.890
C, how many people like Python.

00:14:56.890 --> 00:14:58.580
Let's start to analyze this.

00:14:58.580 --> 00:15:02.290
So maybe the most pedantic
way to do this in Python

00:15:02.290 --> 00:15:07.113
would be to just create some variables
and do all of the counting myself.

00:15:07.113 --> 00:15:08.780
So let me actually go ahead and do this.

00:15:08.780 --> 00:15:10.113
Let me delete this code for now.

00:15:10.113 --> 00:15:12.370
And after opening the
reader, let me create

00:15:12.370 --> 00:15:14.980
a variable called Scratch
and set it equal to zero,

00:15:14.980 --> 00:15:19.060
a variable called C, set it equal
to zero, a variable called Python,

00:15:19.060 --> 00:15:23.290
set it equal to zero, just so that I
have three counters, similar to what

00:15:23.290 --> 00:15:26.620
we did in week zero, week one,
anytime we counted anything Honestly,

00:15:26.620 --> 00:15:29.600
this looks a little
stupid, and it's not wrong.

00:15:29.600 --> 00:15:31.600
In fact, this is how you
would do it in Python,

00:15:31.600 --> 00:15:33.910
but Python also has some clever syntax.

00:15:33.910 --> 00:15:38.800
If you want to be really cool,
you can do Scratch comma, c comma,

00:15:38.800 --> 00:15:43.932
Python equals zero, zero, zero, and
do all three at once if you like that.

00:15:43.932 --> 00:15:47.140
So it tightens up the code a little bit,
even though the effect is ultimately

00:15:47.140 --> 00:15:47.710
the same.

00:15:47.710 --> 00:15:52.010
Now let's go ahead and iterate
over this file row by row by row.

00:15:52.010 --> 00:15:54.730
And if we see Scratch,
increment the Scratch counter.

00:15:54.730 --> 00:15:56.320
If we see C, increment the C counter.

00:15:56.320 --> 00:15:59.000
If we see Python, increment
the Python counter, instead.

00:15:59.000 --> 00:16:00.080
So how do I do this?

00:16:00.080 --> 00:16:03.220
Well, I could do something like,
for each row in the reader,

00:16:03.220 --> 00:16:07.180
just like before, let me go ahead
and get that favorite variable

00:16:07.180 --> 00:16:11.230
and set it equal to the language in
that dictionary that just came back

00:16:11.230 --> 00:16:12.740
as part of this iteration.

00:16:12.740 --> 00:16:19.060
And now I can do something like this, if
favorite equals, equals "Scratch" then,

00:16:19.060 --> 00:16:26.710
with my indentation, I can do Scratch
plus equals one, elif favorite equals,

00:16:26.710 --> 00:16:27.670
equals "C."

00:16:27.670 --> 00:16:31.290
I can go ahead and
increment C plus equals one.

00:16:31.290 --> 00:16:34.440
Else-- and I don't think I
want else, just to be safe.

00:16:34.440 --> 00:16:37.410
Even though we only saw three options,
I think just to be super safe,

00:16:37.410 --> 00:16:41.040
elif favorite equals, equals
"Python," then let's go ahead

00:16:41.040 --> 00:16:42.210
and increment Python.

00:16:42.210 --> 00:16:45.360
The one thing we can't do that we
could do in C is the plus, plus trick.

00:16:45.360 --> 00:16:47.890
So plus equals is as
close as we can get.

00:16:47.890 --> 00:16:48.900
So what have I done?

00:16:48.900 --> 00:16:52.110
Inside of this loop, I've just
incremented each of those counters

00:16:52.110 --> 00:16:56.230
by one if I see Scratch, or
C, or Python, again and again.

00:16:56.230 --> 00:16:56.730
All right.

00:16:56.730 --> 00:17:00.060
Outside of the loop and
outside of the with block,

00:17:00.060 --> 00:17:02.370
because once I'm done
reading all of the rows,

00:17:02.370 --> 00:17:06.493
I might as well let the with clause
close the file automatically,

00:17:06.493 --> 00:17:08.410
let's just go ahead and
print out some values.

00:17:08.410 --> 00:17:13.560
So I'm going to go ahead and print out
Scratch colon, and then inside of there

00:17:13.560 --> 00:17:16.290
let's print out whatever the
value of that variable is.

00:17:16.290 --> 00:17:19.770
Let's then go ahead
and print out C colon,

00:17:19.770 --> 00:17:22.030
and then whatever the
value of the C variable is.

00:17:22.030 --> 00:17:26.160
And then, lastly, let's print out
Python colon and whatever the value is.

00:17:26.160 --> 00:17:29.117
And now I made three typos here.

00:17:29.117 --> 00:17:30.700
This is not going to print the values.

00:17:30.700 --> 00:17:32.320
What do I need to do?

00:17:32.320 --> 00:17:34.030
Sorry?

00:17:34.030 --> 00:17:36.560
Yeah, I'm missing the f in
front of each of these strings

00:17:36.560 --> 00:17:38.470
so that I actually get formatted.

00:17:38.470 --> 00:17:41.800
And that is to say, the variables get
interpolated inside of the curly braces

00:17:41.800 --> 00:17:43.220
like we saw last week.

00:17:43.220 --> 00:17:43.720
All right.

00:17:43.720 --> 00:17:45.928
So, honestly, that's kind
of a decent amount of code,

00:17:45.928 --> 00:17:49.630
like 18 or so lines of code, just
to count the number of responses.

00:17:49.630 --> 00:17:51.070
But let's see if I got it right.

00:17:51.070 --> 00:17:55.330
Let me open my terminal and
run Python of favorites.py,

00:17:55.330 --> 00:17:58.420
and now I see, by an
overwhelming amount,

00:17:58.420 --> 00:18:04.660
folks like Python, followed by C,
followed by Scratch, in that order.

00:18:04.660 --> 00:18:06.910
But that was a decent amount
of code to have to write,

00:18:06.910 --> 00:18:10.540
and it turns out there's actually better
ways of doing this, more Pythonic ways,

00:18:10.540 --> 00:18:12.530
more programmatic ways of doing this.

00:18:12.530 --> 00:18:15.760
And if we think back to one of
our universal data structures.

00:18:15.760 --> 00:18:18.910
Think back to how we preached
last week, and the week before,

00:18:18.910 --> 00:18:21.280
the value of these
dictionaries, more generally.

00:18:21.280 --> 00:18:23.317
Like the CSV module,
clearly a fan of them,

00:18:23.317 --> 00:18:25.150
because that's what
DictReader is giving us,

00:18:25.150 --> 00:18:27.220
dictionary, after
dictionary, after dictionary.

00:18:27.220 --> 00:18:29.290
And this was the general
idea of a dictionary.

00:18:29.290 --> 00:18:31.840
It associates keys with
values, much like you

00:18:31.840 --> 00:18:34.940
might in a two-column table
on a chalkboard, or the like.

00:18:34.940 --> 00:18:39.340
Well, this is what I need if I want
to keep track of how many people said

00:18:39.340 --> 00:18:42.250
Scratch, and C, and Python.

00:18:42.250 --> 00:18:46.900
If I had a piece of chalk, I
could just write Scratch, and C,

00:18:46.900 --> 00:18:48.670
and Python as three keys.

00:18:48.670 --> 00:18:51.910
And then with my chalk and, maybe,
an eraser, keep track of the values.

00:18:51.910 --> 00:18:52.900
They all start at zero.

00:18:52.900 --> 00:18:54.550
Then, I add one, add two, add three.

00:18:54.550 --> 00:18:56.800
Or, maybe, on a chalkboard,
I actually use hash marks.

00:18:56.800 --> 00:19:00.910
But a dictionary is kind of the perfect
data structure for just associating

00:19:00.910 --> 00:19:05.770
something like Scratch, C, Python,
with something else-- keys with values,

00:19:05.770 --> 00:19:06.890
respectively.

00:19:06.890 --> 00:19:09.250
So this is going to
look a little weirder,

00:19:09.250 --> 00:19:13.040
but it's going to be pretty conventional
to do something like this instead.

00:19:13.040 --> 00:19:15.550
Let me go back into VS Code.

00:19:15.550 --> 00:19:17.360
I'll close my terminal window.

00:19:17.360 --> 00:19:20.680
And let me go ahead and
actually delete a lot of this

00:19:20.680 --> 00:19:22.630
because I can simplify this further.

00:19:22.630 --> 00:19:26.500
Let me go ahead and now give
myself, maybe, a variable just,

00:19:26.500 --> 00:19:30.698
generically, called counts, and set
that equal to an empty dictionary.

00:19:30.698 --> 00:19:32.740
And you can actually do
this in a couple of ways.

00:19:32.740 --> 00:19:34.727
You can literally write
out dict with nothing

00:19:34.727 --> 00:19:37.060
in parentheses, which will
give you an empty dictionary,

00:19:37.060 --> 00:19:39.602
like the picture on the board,
but a little more conventional

00:19:39.602 --> 00:19:43.450
is to just use two keystrokes and use
two curly braces with nothing inside.

00:19:43.450 --> 00:19:47.390
That gives me an empty dictionary
like this picture here.

00:19:47.390 --> 00:19:49.160
Now my loop is going to be the same.

00:19:49.160 --> 00:19:52.000
I'm going to do, for
each row in the reader,

00:19:52.000 --> 00:19:55.550
I'm going to go ahead and grab the
favorite language for this person.

00:19:55.550 --> 00:19:58.652
So favorite equals row,
quote unquote, "language."

00:19:58.652 --> 00:20:00.860
But now I'm going to do
something a little different.

00:20:00.860 --> 00:20:02.680
There's two scenarios here.

00:20:02.680 --> 00:20:07.690
Either, I have seen this language before
and I want to increment it by one.

00:20:07.690 --> 00:20:11.950
Or, if I've never seen this language
before because the loop just started,

00:20:11.950 --> 00:20:16.950
what should I initialize the
count for this language to?

00:20:16.950 --> 00:20:18.840
So, one because I've only seen it once.

00:20:18.840 --> 00:20:19.710
Exactly.

00:20:19.710 --> 00:20:22.470
So now let me go ahead
and do exactly that.

00:20:22.470 --> 00:20:26.370
If this current favorite,
that I am seeing in the row,

00:20:26.370 --> 00:20:29.997
is already in the counts
dictionary-- and in Python,

00:20:29.997 --> 00:20:31.830
this is literally how
you ask that question.

00:20:31.830 --> 00:20:33.840
If favorite in counts.

00:20:33.840 --> 00:20:38.250
That will check, is there a key with
this name, Scratch, C, or Python?

00:20:38.250 --> 00:20:42.390
If so, go into that location
in the counts dictionary,

00:20:42.390 --> 00:20:46.500
index into it at the favorite
location, because favorite is a string.

00:20:46.500 --> 00:20:49.140
It's either, quote unquote,
"Scratch," "C," or "Python,"

00:20:49.140 --> 00:20:53.490
and just increment it by one like this.

00:20:53.490 --> 00:20:54.780
Else, as you noted.

00:20:54.780 --> 00:21:00.360
If it's not there implicitly, then
counts bracket favorite should probably

00:21:00.360 --> 00:21:03.210
be set equal-- not to zero
because we're literally

00:21:03.210 --> 00:21:04.920
are seeing it in the current row.

00:21:04.920 --> 00:21:06.660
Let's initialize it to one.

00:21:06.660 --> 00:21:08.460
And, thereafter, if
we see it again, it's

00:21:08.460 --> 00:21:11.610
going to be plus equals one,
plus equals one, plus equals one.

00:21:11.610 --> 00:21:15.330
So now outside of that loop, outside
of the with block, let me do this.

00:21:15.330 --> 00:21:18.200
For each favorite in those counts--

00:21:18.200 --> 00:21:20.920
And this, too, in Python is a
trick if you want to iterate over

00:21:20.920 --> 00:21:22.870
all of the keys in a dictionary.

00:21:22.870 --> 00:21:26.560
That is, if you want to iterate over the
left-hand column of all of these keys,

00:21:26.560 --> 00:21:30.220
you literally can say, for
something in that dictionary.

00:21:30.220 --> 00:21:33.730
So for favorite in counts, this is
giving me a variable called favorite

00:21:33.730 --> 00:21:37.420
and updating it automatically
top to bottom in that dictionary.

00:21:37.420 --> 00:21:40.600
Let's go ahead and print
out an f string that's

00:21:40.600 --> 00:21:44.230
going to say whatever the name
of that language is colon,

00:21:44.230 --> 00:21:50.950
and whatever the value of that
language is in that there dictionary.

00:21:50.950 --> 00:21:54.040
So, again, logically the only
thing that's new is this.

00:21:54.040 --> 00:21:57.940
I'm now using one dictionary
instead of three variables

00:21:57.940 --> 00:22:02.170
to keep track of three things, like
updating this chalkboard with three

00:22:02.170 --> 00:22:04.043
different things, Scratch, C, Python.

00:22:04.043 --> 00:22:06.460
And the last thing I'm doing,
which is a little different,

00:22:06.460 --> 00:22:09.430
is once I have that dictionary,
whether there's three languages,

00:22:09.430 --> 00:22:11.597
or, maybe, tomorrow there'll
be fourth because we're

00:22:11.597 --> 00:22:12.970
going to introduce SQL today.

00:22:12.970 --> 00:22:16.807
Well, this will iterate over all of
those keys and print out the values.

00:22:16.807 --> 00:22:18.640
All right, so if I
didn't do anything wrong,

00:22:18.640 --> 00:22:21.940
if I do Python of
favorites.py and hit enter.

00:22:21.940 --> 00:22:23.043
There we have it.

00:22:23.043 --> 00:22:25.210
And it happens to be in a
different order this time.

00:22:25.210 --> 00:22:28.090
That's because we saw Python
first, we then saw Scratch,

00:22:28.090 --> 00:22:31.870
and, eventually, we saw C. But if
we wanted to sort these differently,

00:22:31.870 --> 00:22:34.780
we actually could with
some different code.

00:22:34.780 --> 00:22:36.170
But, in short, what have we done?

00:22:36.170 --> 00:22:39.340
We've created this kind of
structure in memory with three keys,

00:22:39.340 --> 00:22:42.100
Python, C, and Scratch,
because each time

00:22:42.100 --> 00:22:47.140
we encounter such a language from you
all, we either set our counter to one

00:22:47.140 --> 00:22:50.110
or increment it by one instead.

00:22:50.110 --> 00:22:56.320
Any questions on this code, or this
general idea of using dictionaries as,

00:22:56.320 --> 00:23:00.490
like, a little cheat sheet for
doing some math in this way?

00:23:00.490 --> 00:23:04.200
Super common paradigm.

00:23:04.200 --> 00:23:04.700
All right.

00:23:04.700 --> 00:23:06.890
Well, let me tweak this a little bit.

00:23:06.890 --> 00:23:10.622
Right now, in my output, we're
seeing Python, Scratch and C.

00:23:10.622 --> 00:23:13.580
Maybe, for the sake of discussion,
suppose we want to sort this by key.

00:23:13.580 --> 00:23:14.580
We can actually do that.

00:23:14.580 --> 00:23:16.370
Let me close my terminal temporarily.

00:23:16.370 --> 00:23:19.037
And it turns out, in Python,
there's a bunch of ways to do this,

00:23:19.037 --> 00:23:21.890
but the simplest way to
sort a dictionary by key

00:23:21.890 --> 00:23:24.500
is literally to use a
function called sorted

00:23:24.500 --> 00:23:26.970
that comes with Python,
that just does it for you.

00:23:26.970 --> 00:23:29.720
And even if you pass it a dictionary,
it will sort that dictionary

00:23:29.720 --> 00:23:34.170
by the left-hand column so you can
iterate it over alphabetically instead.

00:23:34.170 --> 00:23:36.110
So if I go back now to VS Code.

00:23:36.110 --> 00:23:39.710
If I open my terminal window and
I rerun Python on favorites.py,

00:23:39.710 --> 00:23:41.840
now that I've added the
sorted call, we should now

00:23:41.840 --> 00:23:45.470
see just because it's sorted
alphabetically instead.

00:23:45.470 --> 00:23:49.195
Now that's not that useful, especially
if we had lots of languages.

00:23:49.195 --> 00:23:51.320
You probably don't care
about it being alphabetized

00:23:51.320 --> 00:23:54.570
as much as you care about it being
ranked by which is the most popular,

00:23:54.570 --> 00:23:55.700
which is the least popular.

00:23:55.700 --> 00:23:58.220
And, for that, there's a bunch
of ways to do this in Python.

00:23:58.220 --> 00:24:03.410
And, I think, the simplest way to sort
by value the right-hand column instead

00:24:03.410 --> 00:24:06.900
of the left-hand column is probably
to make this change instead.

00:24:06.900 --> 00:24:09.240
Let me close my terminal temporarily.

00:24:09.240 --> 00:24:14.110
Let me still use the sorted function,
which by default sorts by key,

00:24:14.110 --> 00:24:16.750
but let's change it to be as follows.

00:24:16.750 --> 00:24:21.960
Let's change it to sort by a
function called counts.get,

00:24:21.960 --> 00:24:24.360
which is a little weird,
but this comes back

00:24:24.360 --> 00:24:28.290
to last week's brief discussion of
object-oriented programming, or oop.

00:24:28.290 --> 00:24:30.150
Remember, in Python,
that almost everything

00:24:30.150 --> 00:24:31.860
is like an object of some sort.

00:24:31.860 --> 00:24:36.930
An int is an object, a dictionary
is an object, a string is an object.

00:24:36.930 --> 00:24:40.590
Which is to say that, not only do these
things have values like, quote unquote,

00:24:40.590 --> 00:24:44.340
"Hello, World," or 50, these
variables, these objects,

00:24:44.340 --> 00:24:46.980
can also have functions
built into them, a.k.a.

00:24:46.980 --> 00:24:47.670
methods.

00:24:47.670 --> 00:24:52.440
So it turns out that because counts
is a dictionary, because I made it so,

00:24:52.440 --> 00:24:55.350
that counts dictionary, like
any dictionary in Python,

00:24:55.350 --> 00:24:57.150
comes with a function called get.

00:24:57.150 --> 00:25:01.290
And if you just tell the sorted
function to use that built-in method,

00:25:01.290 --> 00:25:05.400
it will actually, for every key, get
its value, get its value, get its value,

00:25:05.400 --> 00:25:09.720
and sort effectively by the
right-hand column instead of the left.

00:25:09.720 --> 00:25:13.140
Now we'll see down the line, perhaps,
more sophisticated ways of using this,

00:25:13.140 --> 00:25:15.990
but, for now, this just
overrides the default behavior

00:25:15.990 --> 00:25:19.510
and sorts the dictionary, not
by key, but by value instead.

00:25:19.510 --> 00:25:20.010
All right.

00:25:20.010 --> 00:25:24.120
So now watch this if I run
Python of favorites.py once more.

00:25:24.120 --> 00:25:28.140
Previously, it was in the order in
which the languages appeared first

00:25:28.140 --> 00:25:31.930
in the CSV file, then it
was sorted alphabetically.

00:25:31.930 --> 00:25:34.960
Now it should be sorted by value.

00:25:34.960 --> 00:25:35.850
And, indeed.

00:25:35.850 --> 00:25:39.180
Scratch is the least with
40, C is the next with 78,

00:25:39.180 --> 00:25:41.458
Python is the biggest with 280.

00:25:41.458 --> 00:25:43.500
That's not much of a top
10, or a top three list.

00:25:43.500 --> 00:25:44.820
Let's actually reverse it.

00:25:44.820 --> 00:25:46.800
And the easiest way in
Python to do that is

00:25:46.800 --> 00:25:50.130
to pass a third argument into
sorted, and you would know this

00:25:50.130 --> 00:25:51.750
by just reading the documentation.

00:25:51.750 --> 00:25:56.370
You can literally say, reverse
equals True, capital T,

00:25:56.370 --> 00:26:00.480
and now if I rerun this one last
time, Python of favorites.py,

00:26:00.480 --> 00:26:04.110
I'll see the same values but with
the whole thing reversed in order.

00:26:04.110 --> 00:26:07.230
Long story short, even though
this might feel like a slog,

00:26:07.230 --> 00:26:11.070
like adding this and looking up
this, so much easier than in C

00:26:11.070 --> 00:26:13.980
where you would have had to figure
out, how does bubble sort work?

00:26:13.980 --> 00:26:17.520
Let me implement bubble sort, selection
sort, any of those sorting algorithms,

00:26:17.520 --> 00:26:18.660
or use some other library.

00:26:18.660 --> 00:26:21.430
In Python, you just get a lot
more for free, so to speak.

00:26:21.430 --> 00:26:24.390
It's just built in once you get
comfy with the documentation.

00:26:24.390 --> 00:26:29.370
And, to be clear, this is an
argument, as is this, as is this.

00:26:29.370 --> 00:26:33.540
But in Python, we have not only
positional arguments, which

00:26:33.540 --> 00:26:36.720
are based on what position they
are in, left to right, just like C,

00:26:36.720 --> 00:26:39.900
you also have these
named parameters whereby

00:26:39.900 --> 00:26:42.720
they have explicit names
that you can use yourself,

00:26:42.720 --> 00:26:45.930
to make clear that you're using
this one but not this other one.

00:26:45.930 --> 00:26:50.980
More parameters in Python can
be optional than in C. Phew.

00:26:50.980 --> 00:26:51.480
All right.

00:26:51.480 --> 00:26:54.727
Any questions about that technique yet?

00:26:54.727 --> 00:26:57.810
And if you're feeling like this is
starting to take the fun out of Python,

00:26:57.810 --> 00:27:01.930
that's actually kind of the
point of doing this the hard way.

00:27:01.930 --> 00:27:02.430
All right.

00:27:02.430 --> 00:27:05.110
Well, let's do it one other
way that's marginally better.

00:27:05.110 --> 00:27:09.833
It turns out, in Python there really
is this rich ecosystem of libraries,

00:27:09.833 --> 00:27:12.750
the code that comes with the language
itself, or, even, third parties.

00:27:12.750 --> 00:27:17.040
And coming with the
language is another module

00:27:17.040 --> 00:27:19.920
called the collections
module, or package here,

00:27:19.920 --> 00:27:25.350
whereby if I use from collections, I
can import something called Counter,

00:27:25.350 --> 00:27:28.140
capital C. And it
turns out, if this felt

00:27:28.140 --> 00:27:30.450
a little bit painful to
create a dictionary yourself,

00:27:30.450 --> 00:27:33.470
initialize it, maybe, to
zero or one, like this.

00:27:33.470 --> 00:27:36.547
Turns out, you have the same problem
that people before you have had,

00:27:36.547 --> 00:27:38.130
and so there's another way to do this.

00:27:38.130 --> 00:27:42.150
You can create a variable called
counts, set it equal to Counter, capital

00:27:42.150 --> 00:27:43.770
C, open paren, close paren.

00:27:43.770 --> 00:27:46.140
And this is a different type of object.

00:27:46.140 --> 00:27:48.300
It's a different type of
object in Python, that

00:27:48.300 --> 00:27:50.615
has counting capabilities built in.

00:27:50.615 --> 00:27:53.490
And so if I actually want to use
this counter instead, I can do this.

00:27:53.490 --> 00:27:57.930
For each row in the reader, let's go
ahead and grab the favorite language

00:27:57.930 --> 00:28:00.540
from that row, just like before.

00:28:00.540 --> 00:28:04.530
And without doing any of that headache
of like, if, elif, or any of this,

00:28:04.530 --> 00:28:09.060
you can literally just index
into that counter using favorite,

00:28:09.060 --> 00:28:14.340
quote unquote, "Scratch," or "C," or
"Python," and increment it by one.

00:28:14.340 --> 00:28:17.680
What the Counter class is going
to do for you, so to speak--

00:28:17.680 --> 00:28:20.190
Another example of object-oriented
programming, and counts

00:28:20.190 --> 00:28:21.435
is now an object thereof.

00:28:21.435 --> 00:28:25.107
What this whole feature of Counter
is going to do for you is it's

00:28:25.107 --> 00:28:26.940
going to automatically
initialize everything

00:28:26.940 --> 00:28:28.710
to zero, even if you've
never seen it before,

00:28:28.710 --> 00:28:30.940
and then you can just blindly
start incrementing it.

00:28:30.940 --> 00:28:33.570
So, in short, there's just
more pleasant ways, sometimes,

00:28:33.570 --> 00:28:36.280
to do something in Python as well.

00:28:36.280 --> 00:28:36.780
All right.

00:28:36.780 --> 00:28:42.750
How about, lastly, let's
make things, maybe--

00:28:42.750 --> 00:28:44.340
Oh, actually, let's do this.

00:28:44.340 --> 00:28:47.918
We can even simplify the sorting here.

00:28:47.918 --> 00:28:49.710
Let me actually take
this one step further.

00:28:49.710 --> 00:28:52.950
Instead of manually figuring out how
to sort this, I'm going to do this.

00:28:52.950 --> 00:29:02.650
For each favorite, and the count
thereof in the counts variable's most

00:29:02.650 --> 00:29:07.820
common function's return value, go
ahead and print out this as well.

00:29:07.820 --> 00:29:09.790
So in short, again, a bit new syntax.

00:29:09.790 --> 00:29:11.210
But what's going on here?

00:29:11.210 --> 00:29:14.410
Well, it turns out that
this counts class and--

00:29:14.410 --> 00:29:17.478
sorry, this Counter class, and,
in turn, this counts variable,

00:29:17.478 --> 00:29:20.770
comes with a function built in that you
would only know from the documentation.

00:29:20.770 --> 00:29:23.260
It's literally called
most underscore common,

00:29:23.260 --> 00:29:31.700
and what it returns to you when you call
it is a pair of key value, key value.

00:29:31.700 --> 00:29:34.480
And so this, too, is a trick in
Python that we did not have in C.

00:29:34.480 --> 00:29:37.180
If you want to iterate
over something, but grab

00:29:37.180 --> 00:29:40.120
two variables at a time on
each iteration like this,

00:29:40.120 --> 00:29:42.880
you separate them by commas
and can get favorite count,

00:29:42.880 --> 00:29:44.840
favorite count, favorite count.

00:29:44.840 --> 00:29:50.260
So if I run this now, Python of
favorites.py, this, too, just works.

00:29:50.260 --> 00:29:53.440
And it's getting a little simpler,
a little tighter than before

00:29:53.440 --> 00:29:57.740
than if we had actually
done it all manually.

00:29:57.740 --> 00:30:01.310
Lastly, here is a code that's
the shortest version thereof.

00:30:01.310 --> 00:30:03.650
We're down to like 14 or 15 lines.

00:30:03.650 --> 00:30:08.480
If I wanted to change this to
analyze the most popular problem thus

00:30:08.480 --> 00:30:12.620
far in the class, how do I
go about changing the code

00:30:12.620 --> 00:30:18.260
to print out, top to bottom, the
most popular problem or problems?

00:30:18.260 --> 00:30:19.280
What line should change?

00:30:19.280 --> 00:30:20.570
Yeah.

00:30:20.570 --> 00:30:22.070
So, yeah.

00:30:22.070 --> 00:30:22.610
Line 10.

00:30:22.610 --> 00:30:25.310
Because I've written this in
kind of a general purpose way

00:30:25.310 --> 00:30:27.282
and using dictionaries
with keys, it suffices

00:30:27.282 --> 00:30:29.240
to change language to,
quote unquote, "problem"

00:30:29.240 --> 00:30:31.325
because that was the
third column from the CSV.

00:30:31.325 --> 00:30:33.200
And so now, if you're
curious, let's actually

00:30:33.200 --> 00:30:35.240
make my terminal window a bit bigger.

00:30:35.240 --> 00:30:37.820
Python of favorites.py, enter.

00:30:37.820 --> 00:30:38.750
And, OK.

00:30:38.750 --> 00:30:40.910
Tragically, we peaked
early with Hello, World--

00:30:40.910 --> 00:30:45.060
is the most popular problem thus far,
followed by Filter, then Scratch.

00:30:45.060 --> 00:30:45.560
OK.

00:30:45.560 --> 00:30:46.700
Peaked even earlier.

00:30:46.700 --> 00:30:51.660
Mario, DNA, and so forth, and
a bunch of others thereafter.

00:30:51.660 --> 00:30:55.477
So based on this sample size, here's
the ranking of the problems thus far.

00:30:55.477 --> 00:30:56.060
So, we got it.

00:30:56.060 --> 00:30:59.570
More Hello, World problems
in the weeks to come.

00:30:59.570 --> 00:31:01.160
All right.

00:31:01.160 --> 00:31:04.520
Now that we've done
that in that way, let's

00:31:04.520 --> 00:31:07.070
just make this program
slightly interactive

00:31:07.070 --> 00:31:11.070
and see how we can really take a
fundamentally different approach.

00:31:11.070 --> 00:31:12.620
I'm going to go into VS Code.

00:31:12.620 --> 00:31:15.798
I'm going to keep everything the
same, except that, at the bottom,

00:31:15.798 --> 00:31:17.840
I'm going to get rid of
this loop because I don't

00:31:17.840 --> 00:31:20.120
want any more print out everything.

00:31:20.120 --> 00:31:22.040
I want to look up specific counts.

00:31:22.040 --> 00:31:25.060
Like, how popular was this problem,
how popular was this other problem?

00:31:25.060 --> 00:31:28.310
And what I'm going to go ahead and do
is to create a variable called favorite,

00:31:28.310 --> 00:31:30.180
set it equal to--

00:31:30.180 --> 00:31:32.060
I could use get string
in the CS50 library,

00:31:32.060 --> 00:31:34.790
but we saw last week there's no
need to for strings, certainly.

00:31:34.790 --> 00:31:37.340
Let me just use the input
function that comes with Python,

00:31:37.340 --> 00:31:39.890
and prompt the human for
their favorite problem.

00:31:39.890 --> 00:31:45.020
And then let me go ahead and print out,
for instance, an f string containing

00:31:45.020 --> 00:31:51.920
whatever their favorite is, colon,
and whatever the count is thereof

00:31:51.920 --> 00:31:54.290
of that favorite, close quote.

00:31:54.290 --> 00:31:56.400
So let me open my terminal window.

00:31:56.400 --> 00:31:59.300
Let me run Python of
favorites.py, enter.

00:31:59.300 --> 00:32:04.640
And if I type in Hello, World, looks
like 65 people, indeed, like that one.

00:32:04.640 --> 00:32:06.860
If I run Python of favorites.py again.

00:32:06.860 --> 00:32:09.350
I type in Scratch, now we see that one.

00:32:09.350 --> 00:32:12.570
If I type in anything else, I'm
going to get its specific value.

00:32:12.570 --> 00:32:18.050
So this is to say, not only can we write
Python code to analyze some data pretty

00:32:18.050 --> 00:32:20.720
tightly versus the manual
code we wrote out earlier,

00:32:20.720 --> 00:32:23.238
you can also make these
programs interactive as well.

00:32:23.238 --> 00:32:25.530
And this is going to be a
super common paradigm, right,

00:32:25.530 --> 00:32:28.880
if you go into the world of consulting,
analytics, data science, more

00:32:28.880 --> 00:32:29.540
generally.

00:32:29.540 --> 00:32:34.310
Among your roles is going to be to
analyze data, to ask questions of data,

00:32:34.310 --> 00:32:37.610
get back the answer. be ask questions
of data, get back the answer.

00:32:37.610 --> 00:32:40.190
Honestly, life gets pretty
tedious, even though you've only

00:32:40.190 --> 00:32:43.670
been programming in Python, perhaps, for
like one week, a week and a half now.

00:32:43.670 --> 00:32:46.880
When you have to write code to
solve all of the world's problems--

00:32:46.880 --> 00:32:49.250
and there's this sort of
tenant in programming,

00:32:49.250 --> 00:32:53.270
that programmers tend to avoid
writing code as much as they can

00:32:53.270 --> 00:32:55.190
because, ideally, you
would solve problems

00:32:55.190 --> 00:32:58.640
with the right tool for the job,
minimizing the number of lines of code

00:32:58.640 --> 00:32:59.880
you actually write.

00:32:59.880 --> 00:33:01.920
So how do we actually get to that point?

00:33:01.920 --> 00:33:06.740
Well, instead of just dealing with
CSV files, pure text, it turns out

00:33:06.740 --> 00:33:09.200
there's an entire world
of proper databases.

00:33:09.200 --> 00:33:12.080
Not flat file databases, where you
store everything in text files,

00:33:12.080 --> 00:33:16.370
but a database program, a piece
of software running on a computer,

00:33:16.370 --> 00:33:18.922
running on a server, that's
always listening for you.

00:33:18.922 --> 00:33:22.130
It's got a lot of memory, it's got a
lot of space, and in turn a lot of data,

00:33:22.130 --> 00:33:25.460
and it supports a
database specific language

00:33:25.460 --> 00:33:31.020
that makes it much easier, much faster
to ask questions of the very same data.

00:33:31.020 --> 00:33:33.290
It's a relational database
in the sense, too,

00:33:33.290 --> 00:33:35.480
that it's not even
necessarily one spreadsheet,

00:33:35.480 --> 00:33:37.430
one set of rows and columns.

00:33:37.430 --> 00:33:41.960
You can have two sheets, three sheets,
30 sheets across which there might very

00:33:41.960 --> 00:33:44.840
well be relationships, or relations.

00:33:44.840 --> 00:33:48.830
So S-Q-L, or SQL, is a
database specific language,

00:33:48.830 --> 00:33:52.625
stands for Structured Query Language,
that's a declarative language whereby

00:33:52.625 --> 00:33:54.500
you're not going to be
in the habit with SQL,

00:33:54.500 --> 00:33:58.870
typically, of writing loops and
conditionals, and this kind of thing.

00:33:58.870 --> 00:34:02.260
You're instead going to describe
the data that you want to get back,

00:34:02.260 --> 00:34:05.680
you're going to describe the
question that you want the answer to,

00:34:05.680 --> 00:34:08.650
and we'll do this using a
relatively small grammar.

00:34:08.650 --> 00:34:11.620
That is to say, there's not
that many keywords in SQL.

00:34:11.620 --> 00:34:12.850
It's a pretty small language.

00:34:12.850 --> 00:34:16.120
But it's going to allow us to eliminate
dozens of lines of Python code,

00:34:16.120 --> 00:34:16.900
perhaps.

00:34:16.900 --> 00:34:19.429
SQL follows this CRUD paradigm.

00:34:19.429 --> 00:34:23.230
So C-R-U-D, which simply means
that in a relational database,

00:34:23.230 --> 00:34:25.030
you can really only do four things.

00:34:25.030 --> 00:34:30.100
You can create data, read data-- that
is, look at it or analyze it somehow.

00:34:30.100 --> 00:34:32.540
Update the data, or delete the data.

00:34:32.540 --> 00:34:33.770
So, CRUD, for short.

00:34:33.770 --> 00:34:36.673
And that really speaks to
just how relatively simple

00:34:36.673 --> 00:34:39.340
the world is, even though we'll
just scratch the surface of some

00:34:39.340 --> 00:34:40.423
of its capabilities today.

00:34:40.423 --> 00:34:42.560
And you'll explore more over time.

00:34:42.560 --> 00:34:45.340
Specifically, in SQL, there's
going to be other keywords that

00:34:45.340 --> 00:34:46.719
map to those four ideas.

00:34:46.719 --> 00:34:49.480
Technically, you don't just
create data in the world of SQL,

00:34:49.480 --> 00:34:53.270
you can also insert data, like
inserting more rows into a sheet.

00:34:53.270 --> 00:34:55.330
And it's not the word
"read" that people use.

00:34:55.330 --> 00:34:56.739
People say to "select" data.

00:34:56.739 --> 00:35:00.170
But they mean to read data, which
is sort of the opposite of writing

00:35:00.170 --> 00:35:01.220
or creating data.

00:35:01.220 --> 00:35:04.070
But the U and the D are the
same, except that there's also

00:35:04.070 --> 00:35:07.580
a keyword in SQL known as DROP,
which lets you very destructively,

00:35:07.580 --> 00:35:11.610
very dangerously delete entire
database tables, as well.

00:35:11.610 --> 00:35:15.500
So how do we do this, and what's
the connection to our favorites data

00:35:15.500 --> 00:35:16.190
thus far?

00:35:16.190 --> 00:35:18.920
Well, here is the syntax
in this language called

00:35:18.920 --> 00:35:22.880
SQL via which you can create a table.

00:35:22.880 --> 00:35:26.510
So the jargon is a little different,
but the ideas are exactly the same

00:35:26.510 --> 00:35:27.840
from the world of spreadsheets.

00:35:27.840 --> 00:35:33.720
What you call a sheet in a spreadsheet,
the database world calls a table.

00:35:33.720 --> 00:35:37.320
It's a table of rows and columns,
but it's the exact same idea.

00:35:37.320 --> 00:35:40.320
You're going to have discretion
over what to call the table,

00:35:40.320 --> 00:35:42.710
just like you can call a
spreadsheet something, or else,

00:35:42.710 --> 00:35:44.790
and you can also specify
the types of data

00:35:44.790 --> 00:35:46.790
that you want to store
in your rows and columns.

00:35:46.790 --> 00:35:49.520
And it's going to go a little more
deeply than just formatting it,

00:35:49.520 --> 00:35:51.687
like in Excel, and Numbers,
and Google Spreadsheets,

00:35:51.687 --> 00:35:55.250
you can actually control, maybe,
how big the data could be depending

00:35:55.250 --> 00:35:57.140
on the database you're actually using.

00:35:57.140 --> 00:35:59.990
In CS50, we're going to
use a light version of SQL.

00:35:59.990 --> 00:36:01.820
Literally, a language called--

00:36:01.820 --> 00:36:04.970
an implementation of SQL
called SQLite, which has really

00:36:04.970 --> 00:36:08.360
all of the core functionality that
you would see in the real world,

00:36:08.360 --> 00:36:10.702
and with larger, more
scalable systems, but it's

00:36:10.702 --> 00:36:13.160
going to allow us to focus on
a lot of the building blocks.

00:36:13.160 --> 00:36:17.010
And SQLite's actually really popular
on Macs, PCs, and phones, nowadays.

00:36:17.010 --> 00:36:20.450
A lot of the data that games and other
applications on your phone might store,

00:36:20.450 --> 00:36:24.470
actually have a file, a binary
file with zeros and ones,

00:36:24.470 --> 00:36:26.393
that's in the SQLite format.

00:36:26.393 --> 00:36:29.060
So if you do a mobile app, for
instance, for your final project,

00:36:29.060 --> 00:36:32.220
you'll have an opportunity to
play with something like this.

00:36:32.220 --> 00:36:34.760
Well, how do you actually run SQLite3?

00:36:34.760 --> 00:36:37.500
It's just a command built
into your code space.

00:36:37.500 --> 00:36:40.400
So this is a program you could
install on your own Mac, your own PC,

00:36:40.400 --> 00:36:40.910
or the like.

00:36:40.910 --> 00:36:43.280
We'll do everything as we've
done before, in the cloud,

00:36:43.280 --> 00:36:44.780
and actually use your code space.

00:36:44.780 --> 00:36:48.170
And by that I mean, we can just
start to play with this data now

00:36:48.170 --> 00:36:51.890
using SQL instead of Python.

00:36:51.890 --> 00:36:52.710
So let me do this.

00:36:52.710 --> 00:36:55.190
Let me open up my terminal
window here, and let

00:36:55.190 --> 00:36:58.010
me go ahead and maximize my
terminal window just because we'll

00:36:58.010 --> 00:37:00.140
focus now on the files here.

00:37:00.140 --> 00:37:03.530
Recall that I have a file
called favorites.csv,

00:37:03.530 --> 00:37:05.510
and that CSV file is just text.

00:37:05.510 --> 00:37:08.300
But let me load it
into a proper database

00:37:08.300 --> 00:37:11.510
so I can actually use this
other language called SQL on it.

00:37:11.510 --> 00:37:14.210
To do this, I'm going
to run SQLite3, which

00:37:14.210 --> 00:37:16.070
just means the third
version of it, and I'm

00:37:16.070 --> 00:37:18.950
going to create a new
database called favorites.db.

00:37:18.950 --> 00:37:21.380
That's just a convention,
but it means here comes

00:37:21.380 --> 00:37:23.030
a database that I'm going to create.

00:37:23.030 --> 00:37:25.370
Notice, I'm not using
the tabbed code editor.

00:37:25.370 --> 00:37:28.460
I'm not using the code command
because the code command is generally

00:37:28.460 --> 00:37:29.600
for text files.

00:37:29.600 --> 00:37:33.440
SQLite3 is going to create a binary
file, zeros and ones, ultimately.

00:37:33.440 --> 00:37:35.750
When I run that, it's going
to ask me to verify yes.

00:37:35.750 --> 00:37:37.400
I'm going to hit y and then enter.

00:37:37.400 --> 00:37:40.940
And now I'm at the SQLite prompt,
which is not the dollar sign.

00:37:40.940 --> 00:37:43.730
It literally says SQLite
with an angled bracket.

00:37:43.730 --> 00:37:49.220
Now, one time only, I want to
go ahead and load favorites.csv

00:37:49.220 --> 00:37:53.150
into this database so I can actually
play around with it using not Python,

00:37:53.150 --> 00:37:54.982
but this new language called SQL.

00:37:54.982 --> 00:37:56.940
And the way I'm going to
do this is as follows.

00:37:56.940 --> 00:37:57.773
I'm going to do dot.

00:37:57.773 --> 00:38:00.130
Mode csv, enter.

00:38:00.130 --> 00:38:03.430
And that just puts SQLite into CSV mode.

00:38:03.430 --> 00:38:05.650
It has different modes for
different file formats.

00:38:05.650 --> 00:38:09.700
I'm going to .import and then I'm
going to specify the file that I want

00:38:09.700 --> 00:38:11.800
to import, which is favorites.csv.

00:38:11.800 --> 00:38:13.030
And then this one's up to me.

00:38:13.030 --> 00:38:15.580
What is the name of the
table I want to create?

00:38:15.580 --> 00:38:18.012
And table, again, is essentially
synonymous with sheets.

00:38:18.012 --> 00:38:19.720
So I'm going to call
everything the same.

00:38:19.720 --> 00:38:22.700
I'm going to call my
table favorites as well.

00:38:22.700 --> 00:38:25.420
So what this command is
essentially going to do,

00:38:25.420 --> 00:38:29.890
is all of those lines of Python code
that open the file, read it row by row,

00:38:29.890 --> 00:38:31.028
and do something with it--

00:38:31.028 --> 00:38:32.320
This is just built into SQLite.

00:38:32.320 --> 00:38:37.720
It's going to load the whole darn
CSV into this new favorites.db file,

00:38:37.720 --> 00:38:38.980
and then that's it for now.

00:38:38.980 --> 00:38:42.220
I'm going to go ahead and literally
type .quit to get out of SQLite.

00:38:42.220 --> 00:38:43.810
I'm back at my dollar sign prompt.

00:38:43.810 --> 00:38:47.500
If I type ls, I have
not only favorites.csv,

00:38:47.500 --> 00:38:54.790
I also have favorites.db now as well, a
brand new file, and in that file now is

00:38:54.790 --> 00:38:57.560
an optimized version of the CSV file.

00:38:57.560 --> 00:39:00.860
In that DB file now is a
version of the data that's

00:39:00.860 --> 00:39:04.430
going to lend itself to CRUD
operations, creating, reading, update,

00:39:04.430 --> 00:39:08.810
and deleting, using this
new language called SQL.

00:39:08.810 --> 00:39:10.478
All right, so how do I get into this?

00:39:10.478 --> 00:39:13.520
Well, let me clear my terminal window
and pretend that I'm doing this now

00:39:13.520 --> 00:39:14.062
the next day.

00:39:14.062 --> 00:39:15.478
I've already created the database.

00:39:15.478 --> 00:39:16.670
That's a one time operation.

00:39:16.670 --> 00:39:18.545
Once you've got the
data, now I'm going to go

00:39:18.545 --> 00:39:24.080
ahead and again run SQLite3 favorites.db
just to open the file again.

00:39:24.080 --> 00:39:27.260
But it's already now-- all
of the data is in there.

00:39:27.260 --> 00:39:32.420
Just as a teaser, let
me go ahead and do this.

00:39:32.420 --> 00:39:38.580
.schema is a SQLite command that just
shows me the schema of this database

00:39:38.580 --> 00:39:39.080
table.

00:39:39.080 --> 00:39:41.130
And we'll see more about
this in a little bit,

00:39:41.130 --> 00:39:44.630
but for now this is
showing me, essentially,

00:39:44.630 --> 00:39:48.170
the SQL command that
was automatically run

00:39:48.170 --> 00:39:51.650
when I imported this database
the first time around.

00:39:51.650 --> 00:39:54.680
And, for now, just notice
that it mentions timestamp,

00:39:54.680 --> 00:39:57.270
it mentions language,
it mentions problem.

00:39:57.270 --> 00:39:59.790
Very loosely, it calls
each of those texts.

00:39:59.790 --> 00:40:02.370
So we're not trying very hard
to distinguish one type of data

00:40:02.370 --> 00:40:02.912
from another.

00:40:02.912 --> 00:40:03.690
It's all text.

00:40:03.690 --> 00:40:06.790
But notice, create table
If not exists favorites.

00:40:06.790 --> 00:40:09.420
This is essentially
the create table syntax

00:40:09.420 --> 00:40:11.490
that I alluded to
earlier via which you can

00:40:11.490 --> 00:40:15.510
create a table in a SQLite database.

00:40:15.510 --> 00:40:17.280
But more on that in just a bit.

00:40:17.280 --> 00:40:22.500
Here now is how we can actually
get at the data in that database.

00:40:22.500 --> 00:40:28.380
It turns out that we can select
one or more columns from a database

00:40:28.380 --> 00:40:30.180
table using syntax like this.

00:40:30.180 --> 00:40:33.870
Literally, the keyword select, then
the name of one or more columns

00:40:33.870 --> 00:40:38.550
that are in that database, and then from
the specific table that you care about.

00:40:38.550 --> 00:40:41.760
And notice that in capital
letters here are all of the SQL

00:40:41.760 --> 00:40:44.760
specific keywords, select,
and from, in particular.

00:40:44.760 --> 00:40:47.940
And in lowercase, by convention,
here are the placeholders

00:40:47.940 --> 00:40:51.270
for the columns that you, or I, have
created, and the tables that you, or I,

00:40:51.270 --> 00:40:52.210
have created.

00:40:52.210 --> 00:40:53.820
So if I go back to SQLite here.

00:40:53.820 --> 00:40:57.090
Let me just clear with Control L, which
will just freshen up the screen here

00:40:57.090 --> 00:40:58.770
so we can focus on what's new.

00:40:58.770 --> 00:41:03.600
If I want to select everything
from the table called favorites,

00:41:03.600 --> 00:41:04.890
here's what I can do.

00:41:04.890 --> 00:41:09.570
Select star from favorites semicolon.

00:41:09.570 --> 00:41:12.660
And, do forgive me, semicolons
are back for SQL, in this case.

00:41:12.660 --> 00:41:16.680
But select star from favorites uses a
syntax you might not be familiar with.

00:41:16.680 --> 00:41:18.660
Star here has nothing
to do with pointers.

00:41:18.660 --> 00:41:20.100
Star is a wild card.

00:41:20.100 --> 00:41:22.290
It means give me everything,
no matter what it's

00:41:22.290 --> 00:41:24.480
called, from this particular table.

00:41:24.480 --> 00:41:27.030
When I hit enter, what
we're going to see

00:41:27.030 --> 00:41:30.810
is the entire contents
of the favorite table

00:41:30.810 --> 00:41:34.480
that's the result of having imported
that CSV into this database.

00:41:34.480 --> 00:41:37.380
So when I hit enter,
there is all of that data.

00:41:37.380 --> 00:41:40.110
And SQLite, just to be
friendly, it's using

00:41:40.110 --> 00:41:44.220
what we might call ASCII art, just
very simple text, like hyphens,

00:41:44.220 --> 00:41:47.310
and vertical bars, and pluses on
the corner, to make it look pretty

00:41:47.310 --> 00:41:49.690
and make it look like
it is a proper table.

00:41:49.690 --> 00:41:52.620
But what you're really seeing
is the contents of favorites.db,

00:41:52.620 --> 00:41:54.750
specifically in that table.

00:41:54.750 --> 00:41:57.130
Specifically, if I only
care about languages,

00:41:57.130 --> 00:41:59.130
let me try something
more specific than star.

00:41:59.130 --> 00:42:03.270
Select language from
favorites semicolon.

00:42:03.270 --> 00:42:05.640
This is going to give me
just a single column now,

00:42:05.640 --> 00:42:09.150
of all of the favorites that you
selected for language specifically.

00:42:09.150 --> 00:42:12.730
This is a little overwhelming to
see all 399 or so pieces of data,

00:42:12.730 --> 00:42:14.680
so let me actually
truncate it a little bit.

00:42:14.680 --> 00:42:20.333
Let me do select language
from favorites limit 10.

00:42:20.333 --> 00:42:22.500
So we're about to see that
there's little tricks you

00:42:22.500 --> 00:42:26.430
can use to tweak the
behavior of the language

00:42:26.430 --> 00:42:28.500
in order to get back more or less data.

00:42:28.500 --> 00:42:31.110
In fact, it turns out
there's a bunch of keywords

00:42:31.110 --> 00:42:35.880
like these built into SQL, much like
Google Spreadsheets, Apple Numbers,

00:42:35.880 --> 00:42:38.293
Microsoft Excel, and certainly Python.

00:42:38.293 --> 00:42:41.460
There's a lot of functionality that you
just get for free with the language.

00:42:41.460 --> 00:42:44.610
If you want to calculate an average,
count the number of things in a file,

00:42:44.610 --> 00:42:48.090
get the unique or distinct values,
force everything to lowercase,

00:42:48.090 --> 00:42:50.380
force everything to uppercase,
get the maximum value,

00:42:50.380 --> 00:42:53.505
minimum value-- much like spreadsheets,
if you're familiar with that world,

00:42:53.505 --> 00:42:56.880
you get all of that functionality
in SQL but also more.

00:42:56.880 --> 00:42:59.680
So, for instance, if I go back
to my terminal window here.

00:42:59.680 --> 00:43:02.490
Let me go ahead and select
the total number of favorites

00:43:02.490 --> 00:43:06.150
in this table, the total number
of rows that you all inputted.

00:43:06.150 --> 00:43:10.320
So I could do select star
from favorites semicolon,

00:43:10.320 --> 00:43:12.450
and then I could literally
start counting these.

00:43:12.450 --> 00:43:15.330
Like 1, 2, 3, 4-- there's
clearly a better way.

00:43:15.330 --> 00:43:19.480
And, indeed, on our list of
functions was a count function.

00:43:19.480 --> 00:43:22.200
And so the way I can use
that in SQL is like this.

00:43:22.200 --> 00:43:25.230
Select count of star--

00:43:25.230 --> 00:43:28.113
so pass star in as an argument
to the count function.

00:43:28.113 --> 00:43:30.780
You don't care what columns you're
counting just count them all.

00:43:30.780 --> 00:43:33.440
From favorites semicolon.

00:43:33.440 --> 00:43:35.190
And now, you're actually
going to get back

00:43:35.190 --> 00:43:40.050
like a little baby table that has just
one row, one column inside of which--

00:43:40.050 --> 00:43:43.410
one cell of which has
the total actual count.

00:43:43.410 --> 00:43:47.610
And it's 398 because 399, recall,
included the actual header

00:43:47.610 --> 00:43:49.480
row from the file.

00:43:49.480 --> 00:43:49.980
All right.

00:43:49.980 --> 00:43:52.260
So suppose you want to--

00:43:52.260 --> 00:43:54.480
actually, note that this
is the exact same thing

00:43:54.480 --> 00:43:58.840
as counting a specific
column because every row

00:43:58.840 --> 00:44:00.670
has the same number of columns, three.

00:44:00.670 --> 00:44:03.790
We could just say select
the count of languages,

00:44:03.790 --> 00:44:07.720
or select the count of problems.

00:44:07.720 --> 00:44:10.023
All of those are going to
give me back the same answer.

00:44:10.023 --> 00:44:12.190
It is, therefore, conventional
in SQL if you're just

00:44:12.190 --> 00:44:14.080
trying to count the
number of rows, don't even

00:44:14.080 --> 00:44:15.413
worry about what they're called.

00:44:15.413 --> 00:44:18.722
Just do count star to get
back everything more simply.

00:44:18.722 --> 00:44:21.430
All right, but what if we want to
get back the distinct languages

00:44:21.430 --> 00:44:23.972
and we didn't know a priori that
this came from a Google form

00:44:23.972 --> 00:44:25.107
with three radio buttons?

00:44:25.107 --> 00:44:26.690
Well, we could do something like this.

00:44:26.690 --> 00:44:33.130
We could select the distinct languages
from the favorites table, enter.

00:44:33.130 --> 00:44:35.950
And that gives me Python,
Scratch, C because distinct

00:44:35.950 --> 00:44:38.200
is one of the other functions
that comes with SQL.

00:44:38.200 --> 00:44:40.990
This is, obviously, very easily
countable with my human eyes,

00:44:40.990 --> 00:44:43.210
but if I wanted to do
this more dynamically,

00:44:43.210 --> 00:44:47.530
I could change this to be
count the distinct languages.

00:44:47.530 --> 00:44:50.200
And just like in C, just like
in Python, just like in Scratch,

00:44:50.200 --> 00:44:52.987
I can nest these functions
and pass the output of one

00:44:52.987 --> 00:44:54.070
into the input of another.

00:44:54.070 --> 00:44:59.550
If I hit enter now, I now
get three in this case here.

00:44:59.550 --> 00:45:00.050
OK.

00:45:00.050 --> 00:45:04.580
Let me pause to see if there's any
questions or confusion just yet.

00:45:04.580 --> 00:45:05.120
Yeah.

00:45:05.120 --> 00:45:07.000
[INDISTINCT SPEECH]

00:45:07.000 --> 00:45:07.770
Does SQLite--

00:45:07.770 --> 00:45:09.910
[INDISTINCT SPEECH]

00:45:09.910 --> 00:45:11.980
SQLite3 is a program.

00:45:11.980 --> 00:45:16.210
And it's an implementation
of the SQLite language, which

00:45:16.210 --> 00:45:18.820
itself is a lightweight
version of what the world known

00:45:18.820 --> 00:45:21.890
as SQL, which is a very
convoluted way of saying

00:45:21.890 --> 00:45:23.390
there's lots of humans in the world.

00:45:23.390 --> 00:45:25.420
Not everyone agrees what SQL should be.

00:45:25.420 --> 00:45:28.720
Microsoft might disagree with Oracle,
might disagree with other companies,

00:45:28.720 --> 00:45:29.390
as well.

00:45:29.390 --> 00:45:32.500
So there's a common
subset of SQL in the world

00:45:32.500 --> 00:45:35.200
that almost everyone knows,
and learns, and uses,

00:45:35.200 --> 00:45:37.900
but there are also some
vendor specific features.

00:45:37.900 --> 00:45:40.790
SQLite tries to distill things
really into the essence,

00:45:40.790 --> 00:45:44.430
and so that's what you increasingly see
on Android, on iOS, on Macs, and PCs,

00:45:44.430 --> 00:45:44.930
as well.

00:45:44.930 --> 00:45:47.650
So we use it because it's
relatively canonical.

00:45:47.650 --> 00:45:49.330
Good question.

00:45:49.330 --> 00:45:49.830
All right.

00:45:49.830 --> 00:45:53.550
So let's do a few other things by
introducing a few other keywords

00:45:53.550 --> 00:45:55.440
without trying all of these right now.

00:45:55.440 --> 00:45:59.730
Here in this list is a
bunch of new keywords

00:45:59.730 --> 00:46:01.740
that are going to give
us even finer control.

00:46:01.740 --> 00:46:04.920
And we saw limit already, and
that just limits the output.

00:46:04.920 --> 00:46:08.010
But you can also have what
are called predicates.

00:46:08.010 --> 00:46:11.280
You can literally use the keyword
where to start filtering the data,

00:46:11.280 --> 00:46:14.640
without using an if, and an elif, and
an elif, and an elif, and so forth.

00:46:14.640 --> 00:46:17.640
You can just in one line
express something conditionally,

00:46:17.640 --> 00:46:21.067
you can order the data, and you can
even group similar data together.

00:46:21.067 --> 00:46:22.150
So what do I mean by this?

00:46:22.150 --> 00:46:24.622
Let me go back to VS Code
here, and let me play around

00:46:24.622 --> 00:46:25.830
with a few different queries.

00:46:25.830 --> 00:46:32.490
Let me select, maybe, the count of
rows from favorites, which previously

00:46:32.490 --> 00:46:38.250
was going to be 398 if I just get back
all of the rows, but suppose I only

00:46:38.250 --> 00:46:41.190
want to know how many of you
liked C. I can then say something

00:46:41.190 --> 00:46:46.500
like where the language in each
row equals, quote unquote, "C,"

00:46:46.500 --> 00:46:49.750
and the convention here is to
use single quotes, though SQLite

00:46:49.750 --> 00:46:51.790
is tolerant of other formats as well.

00:46:51.790 --> 00:46:56.257
If I hit enter here, I'll see, indeed,
as we saw with Python, the 78 number.

00:46:56.257 --> 00:46:57.340
That, honestly, took what?

00:46:57.340 --> 00:46:59.690
13, 14, 15 lines of code?

00:46:59.690 --> 00:47:04.990
Now I've distilled that kind of
query into a single line of SQL code

00:47:04.990 --> 00:47:07.180
instead, by using this
built in functionality.

00:47:07.180 --> 00:47:11.110
Suppose I really want to get specific,
and how many of you really liked

00:47:11.110 --> 00:47:14.500
Hello, World in C as your favorite?

00:47:14.500 --> 00:47:16.510
Well, I could change this query.

00:47:16.510 --> 00:47:19.330
And just like your dollar
sign prompt, your shell,

00:47:19.330 --> 00:47:22.660
you can go up and down in your
history in SQLite to save keystrokes.

00:47:22.660 --> 00:47:24.520
You can use Boolean logic.

00:47:24.520 --> 00:47:27.790
And I can say language
equals C AND, maybe,

00:47:27.790 --> 00:47:31.490
problem equals, quote
unquote, "Hello, World,"

00:47:31.490 --> 00:47:34.730
and the number of you that
liked that problem was seven.

00:47:34.730 --> 00:47:37.900
So really, really early on
likes Hello, World in C.

00:47:37.900 --> 00:47:39.910
Now notice a couple of key differences.

00:47:39.910 --> 00:47:43.900
One, I'm using AND, and not
ampersand, ampersand like in C. I'm

00:47:43.900 --> 00:47:46.100
using single equal signs.

00:47:46.100 --> 00:47:50.030
So SQL behaves like Scratch does,
which is not like Python or C. Why?

00:47:50.030 --> 00:47:52.780
Different people have implemented
different languages differently.

00:47:52.780 --> 00:47:57.310
Equals, equals, equality
in the world of SQL

00:47:57.310 --> 00:47:59.120
for comparing things left and right.

00:47:59.120 --> 00:47:59.620
All right.

00:47:59.620 --> 00:48:01.870
Things are now going to get
a little more interesting,

00:48:01.870 --> 00:48:05.140
but the whole goal of
all of that Python code

00:48:05.140 --> 00:48:09.650
was to analyze the ranking of
languages and popularity thereof.

00:48:09.650 --> 00:48:12.400
Turns out in SQL, once
you have the vocabulary,

00:48:12.400 --> 00:48:14.590
it's pretty easy to do
something like that.

00:48:14.590 --> 00:48:15.560
I'm going to do this.

00:48:15.560 --> 00:48:19.120
I'm going to select all of
the languages in the table,

00:48:19.120 --> 00:48:23.020
but I'm also going to
select the count thereof.

00:48:23.020 --> 00:48:26.020
And then I'm going to do that
from the favorites table,

00:48:26.020 --> 00:48:29.560
but I'm going to group
by language because I

00:48:29.560 --> 00:48:33.370
claimed a moment ago that group by is
another one of our key phrases in SQL

00:48:33.370 --> 00:48:35.120
that's going to let us group data.

00:48:35.120 --> 00:48:37.600
And what this effectively
means is that if you've

00:48:37.600 --> 00:48:41.530
got this table with a lot of
duplicate languages, again, and again,

00:48:41.530 --> 00:48:45.130
and again, you can group by
that column, and, essentially,

00:48:45.130 --> 00:48:48.850
smush all of the Python rows together,
all of the Scratch rows together,

00:48:48.850 --> 00:48:52.930
all of the C rows together, but
figure out how many of those

00:48:52.930 --> 00:48:55.240
rows just got smushed together.

00:48:55.240 --> 00:48:57.910
Effectively, doing all of
that dictionary legwork,

00:48:57.910 --> 00:49:01.090
or the counter legwork,
that I did in 13--

00:49:01.090 --> 00:49:02.990
15 lines of Python code.

00:49:02.990 --> 00:49:06.700
So if I hit enter here,
this now is the motivation

00:49:06.700 --> 00:49:08.410
for what we're now starting to do.

00:49:08.410 --> 00:49:11.860
I have distilled into
a single line of code

00:49:11.860 --> 00:49:14.140
in a language called SQL
what, indeed, took me

00:49:14.140 --> 00:49:18.280
more than a dozen lines of Python
code just to get back an answer.

00:49:18.280 --> 00:49:20.110
And I can do the same
thing with problem.

00:49:20.110 --> 00:49:23.230
I can just change language here,
for instance, to problem instead.

00:49:23.230 --> 00:49:26.300
But, per this list, I can not only
group things, I can order them.

00:49:26.300 --> 00:49:29.440
So if you actually want to get a top
10, or a top three list, well let's

00:49:29.440 --> 00:49:30.880
just change this query slightly.

00:49:30.880 --> 00:49:37.970
Before the semicolon, let me order
by the count of those rows semicolon.

00:49:37.970 --> 00:49:40.570
And now what I get is
from smallest to largest.

00:49:40.570 --> 00:49:42.850
40, 78, 280.

00:49:42.850 --> 00:49:44.590
If you want to flip that, that's fine.

00:49:44.590 --> 00:49:49.630
By default, order by uses
ascending order, abbreviated A-S-C.

00:49:49.630 --> 00:49:53.810
If you want to do descending order,
D-E-S-C, you can do that as well.

00:49:53.810 --> 00:49:57.680
And now we have a top three
list, from largest to smallest.

00:49:57.680 --> 00:50:01.360
Now, honestly, this is a bit of a
mouthful to use count star over here,

00:50:01.360 --> 00:50:02.920
count star over here.

00:50:02.920 --> 00:50:06.580
There's a nicety in SQL, too, where
you can create little aliases of sorts.

00:50:06.580 --> 00:50:08.890
So if I use the same query again--

00:50:08.890 --> 00:50:10.850
let me scroll over to the left.

00:50:10.850 --> 00:50:13.990
I can actually use the
keyword as here, and I

00:50:13.990 --> 00:50:18.370
can rename this weird looking column,
count star, to anything I want.

00:50:18.370 --> 00:50:19.930
I can rename it to n.

00:50:19.930 --> 00:50:24.280
And then at the end of this query,
I can order by n, essentially,

00:50:24.280 --> 00:50:26.990
creating a synonym, if you
will, for one versus the other.

00:50:26.990 --> 00:50:30.310
So if I hit enter now, same exact
thing, but my little baby table

00:50:30.310 --> 00:50:33.310
that came back-- not a technical
term-- has two columns, one of which

00:50:33.310 --> 00:50:36.580
is more simply called n
now instead of count star.

00:50:36.580 --> 00:50:40.330
It just makes it minorly more
convenient in your actual SQL code

00:50:40.330 --> 00:50:44.110
to reference things that might
actually be a little annoying to type.

00:50:44.110 --> 00:50:47.200
Lastly, suppose we want
to get a top one list,

00:50:47.200 --> 00:50:49.000
and we just want the
most popular language.

00:50:49.000 --> 00:50:51.790
Honestly, I can just
do limit one, enter.

00:50:51.790 --> 00:50:55.720
That gives me just this tiny little
table, a temporary table, really,

00:50:55.720 --> 00:50:56.860
with one row.

00:50:56.860 --> 00:51:00.460
And, honestly, if I don't even
care about what the language is,

00:51:00.460 --> 00:51:02.230
I can omit that entirely.

00:51:02.230 --> 00:51:06.340
Just see how many people really
like the most popular language.

00:51:06.340 --> 00:51:07.660
280, in this case.

00:51:07.660 --> 00:51:10.910
But, of course, it's more interesting
to see what it actually is.

00:51:10.910 --> 00:51:13.420
So, in short, just by turning
these knobs syntactically,

00:51:13.420 --> 00:51:16.900
it's relatively easy to start
getting at more and more data.

00:51:16.900 --> 00:51:19.270
And more answers there, too.

00:51:19.270 --> 00:51:19.770
Phew.

00:51:19.770 --> 00:51:24.270
Questions on this thus far?

00:51:24.270 --> 00:51:25.200
Any questions?

00:51:25.200 --> 00:51:26.040
No?

00:51:26.040 --> 00:51:26.670
OK.

00:51:26.670 --> 00:51:29.305
Well suppose that this
week, for instance.

00:51:29.305 --> 00:51:31.680
One of our new problems is
going to be called Fiftyville,

00:51:31.680 --> 00:51:33.960
and it's going to allow you
to explore the world of SQL

00:51:33.960 --> 00:51:35.793
in the context of a
place called Fiftyville.

00:51:35.793 --> 00:51:38.350
Suppose that suddenly becomes
your favorite problem.

00:51:38.350 --> 00:51:40.920
Well, how can we go about
adding more data to a database?

00:51:40.920 --> 00:51:43.720
Well, we've seen create
table for creating the table,

00:51:43.720 --> 00:51:47.230
we've seen select for
selecting data there from.

00:51:47.230 --> 00:51:50.340
Turns out there's also
an insert into command

00:51:50.340 --> 00:51:54.210
that you can use to insert
new data into a table.

00:51:54.210 --> 00:51:57.450
Now, I did this in bulk by just
importing that whole CSV file,

00:51:57.450 --> 00:52:00.370
and SQLite3 did it all
for me automatically.

00:52:00.370 --> 00:52:04.320
But in the real world, if you don't have
a captive audience, every one of whom

00:52:04.320 --> 00:52:06.540
is submitting the form
at the same time--

00:52:06.540 --> 00:52:09.300
but maybe it's an application
that's running 24/7,

00:52:09.300 --> 00:52:13.210
you're going to get more and more data
over time, just like Google itself.

00:52:13.210 --> 00:52:16.290
So if you write code like
this, you can insert one row

00:52:16.290 --> 00:52:20.530
at a time, one row at a time, and
actually change the data in your table.

00:52:20.530 --> 00:52:24.820
So just as a check, let me do
select star from favorites, enter.

00:52:24.820 --> 00:52:26.500
Just to see all of the data.

00:52:26.500 --> 00:52:31.030
And the last data we got was
at 1:41 PM, and 21 seconds.

00:52:31.030 --> 00:52:34.390
Suppose now I've decided I
want to insert one new row.

00:52:34.390 --> 00:52:35.560
I can do this.

00:52:35.560 --> 00:52:39.550
Insert into favorites--
and then I have to specify

00:52:39.550 --> 00:52:41.710
what columns do I want to insert into.

00:52:41.710 --> 00:52:47.230
I'm going to insert a new language
column, and a new problem column.

00:52:47.230 --> 00:52:47.840
Timestamp?

00:52:47.840 --> 00:52:48.340
I could.

00:52:48.340 --> 00:52:51.590
I don't really want to look up the time,
so I'm going to leave that one blank.

00:52:51.590 --> 00:52:54.520
And I'm going to put in
values as follows for this.

00:52:54.520 --> 00:52:57.850
SQL for the language, which wasn't
even an option on the form earlier,

00:52:57.850 --> 00:53:02.830
and Fiftyville for the name
of the problem, semicolon.

00:53:02.830 --> 00:53:04.540
So there's a bit of dichotomy here.

00:53:04.540 --> 00:53:07.690
In the first set of parentheses,
you specify a comma separated list

00:53:07.690 --> 00:53:09.910
of the columns that you
want to put data into.

00:53:09.910 --> 00:53:11.890
In the second set of
parentheses, you actually

00:53:11.890 --> 00:53:15.680
specify the values that you
want to put into those columns.

00:53:15.680 --> 00:53:18.350
So when I hit enter
nothing seems to happen,

00:53:18.350 --> 00:53:20.510
which, in general, is a
good thing at my terminal.

00:53:20.510 --> 00:53:25.030
But if I now rerun select star
from favorites, we will see--

00:53:25.030 --> 00:53:27.070
voila, a brand new row.

00:53:27.070 --> 00:53:29.320
We don't know what time or
date it was inputted at.

00:53:29.320 --> 00:53:33.130
In fact, we see an old friend, null,
which indicates the absence of a value,

00:53:33.130 --> 00:53:37.610
but we do indeed see that SQL, in
Fiftyville, is actually now in there.

00:53:37.610 --> 00:53:42.340
So in the world of SQL, null has nothing
to do with pointers or addresses.

00:53:42.340 --> 00:53:46.120
The world of SQL, it's just using the
same word to represent the same idea,

00:53:46.120 --> 00:53:48.700
that there's no data
here, but it has nothing

00:53:48.700 --> 00:53:51.080
to do with actual memory
addresses in this case.

00:53:51.080 --> 00:53:53.830
But suppose that you don't want
to do that, and, like, no, no, no.

00:53:53.830 --> 00:53:54.640
Let's just delete that.

00:53:54.640 --> 00:53:56.348
Fiftyville hasn't even
been released yet,

00:53:56.348 --> 00:53:58.720
nor have we even finished
talking about SQL.

00:53:58.720 --> 00:54:02.110
How do we delete data
from a database table?

00:54:02.110 --> 00:54:03.670
Well, there's a delete from command.

00:54:03.670 --> 00:54:05.360
Let me go back to VS Code here.

00:54:05.360 --> 00:54:08.080
Let me go ahead and clear my
terminal just to keep things clean.

00:54:08.080 --> 00:54:10.570
Let me go ahead and
delete from favorites,

00:54:10.570 --> 00:54:14.290
and let me not hit enter
here after a semicolon.

00:54:14.290 --> 00:54:16.330
This is one of the
most destructive things

00:54:16.330 --> 00:54:19.570
you can do as a database administrator.

00:54:19.570 --> 00:54:21.550
If you Google around,
there are horror stories

00:54:21.550 --> 00:54:25.750
of interns in the real world executing
commands like this at their companies.

00:54:25.750 --> 00:54:28.510
This will delete
everything from favorites.

00:54:28.510 --> 00:54:31.660
So if you ever do this, remember,
we told you not to today.

00:54:31.660 --> 00:54:38.530
But if we add a where clause here, only
delete rows where the timestamp column

00:54:38.530 --> 00:54:41.433
is null, this is more reasonable.

00:54:41.433 --> 00:54:43.600
And, frankly, any companies
you work for should also

00:54:43.600 --> 00:54:45.767
have backups of their
database, so we shouldn't even

00:54:45.767 --> 00:54:48.680
be reading about these horror
stories, but such is the real world.

00:54:48.680 --> 00:54:52.960
So this is going to delete any
row from the favorites table

00:54:52.960 --> 00:54:57.260
where Timestamp, capital T, because
that's how Google did it, is null.

00:54:57.260 --> 00:54:58.570
I go ahead and hit enter.

00:54:58.570 --> 00:55:03.880
Nothing seems to happen, but if I do
select star from favorites semicolon

00:55:03.880 --> 00:55:06.550
that now row is, again, gone.

00:55:06.550 --> 00:55:11.410
So you can use these predicates, these
where conditions, coupled with select,

00:55:11.410 --> 00:55:14.740
coupled with delete, and
other operations as well.

00:55:14.740 --> 00:55:16.520
What if I actually
want to make a change?

00:55:16.520 --> 00:55:18.910
So if you want to update
existing data like this.

00:55:18.910 --> 00:55:20.540
Well, we could do this.

00:55:20.540 --> 00:55:25.300
I could update this table I could
set one column equal to this value

00:55:25.300 --> 00:55:27.230
where some condition is true.

00:55:27.230 --> 00:55:28.638
So how might this work?

00:55:28.638 --> 00:55:30.430
Well, let me boldly
claim that a lot of you

00:55:30.430 --> 00:55:33.070
are really going to like
Fiftyville in the world of SQL,

00:55:33.070 --> 00:55:35.860
so all of these favorites
are sort of passe now.

00:55:35.860 --> 00:55:36.890
So let's do this.

00:55:36.890 --> 00:55:39.950
Let me go ahead and update
the favorites table,

00:55:39.950 --> 00:55:43.150
setting the language column
equal to "SQL," quote unquote.

00:55:43.150 --> 00:55:47.650
And with a comma let me go ahead and
also update the problem column to be

00:55:47.650 --> 00:55:49.810
equal to, quote unquote, "Fiftyville."

00:55:49.810 --> 00:55:52.930
I'm not going to have any kind
of where here, which means this

00:55:52.930 --> 00:55:56.030
is just going to do its
thing on all of the rows.

00:55:56.030 --> 00:55:58.270
So if I hit enter nothing
seems to have happened,

00:55:58.270 --> 00:56:00.850
but if I now do select
star from favorites,

00:56:00.850 --> 00:56:04.160
everyone's favorite is
literally that problem.

00:56:04.160 --> 00:56:07.690
So this too is destructive,
unlike the real digital world,

00:56:07.690 --> 00:56:10.120
there's no Control Z, or undo that.

00:56:10.120 --> 00:56:13.480
You better have made a backup
of your database, otherwise

00:56:13.480 --> 00:56:15.460
that's not a good thing.

00:56:15.460 --> 00:56:17.830
In this case, I do have this
CSV file, so I could just

00:56:17.830 --> 00:56:19.703
delete my favorites.db file.

00:56:19.703 --> 00:56:22.870
I could re-import the data, so I haven't
really lost anything of importance,

00:56:22.870 --> 00:56:26.140
but you could in the case of
the real world and any data

00:56:26.140 --> 00:56:27.470
you're actually working on.

00:56:27.470 --> 00:56:29.410
So just to make the
point, let me go ahead

00:56:29.410 --> 00:56:33.250
and delete from favorites
semicolon, enter.

00:56:33.250 --> 00:56:34.660
Let me reselect.

00:56:34.660 --> 00:56:36.710
There's no data there anymore.

00:56:36.710 --> 00:56:42.890
And in fact, if I do select
count star from favorites,

00:56:42.890 --> 00:56:45.220
we'll see as much that the
answer is, in fact, zero

00:56:45.220 --> 00:56:48.040
because everything has now been deleted.

00:56:48.040 --> 00:56:51.340
Any questions, then, on that code there?

00:56:51.340 --> 00:56:51.950
No?

00:56:51.950 --> 00:56:52.450
All right.

00:56:52.450 --> 00:56:55.533
So if not too scared yet, let's go
ahead and take our 10-minute break now.

00:56:55.533 --> 00:56:59.110
Halloween candy is served,
and we'll be back in 10.

00:56:59.110 --> 00:57:01.060
All right.

00:57:01.060 --> 00:57:05.750
So we are back, and before we dive
back into SQL and some real world data,

00:57:05.750 --> 00:57:10.300
it turns out, unbeknownst to me,
we've had a Halloween costume contest.

00:57:10.300 --> 00:57:14.560
So it's now time to announce the winners
of this year's CS50 costume contest.

00:57:14.560 --> 00:57:17.530
If our two winners would
like to come on up who,

00:57:17.530 --> 00:57:19.495
I'm told during break, dressed up as me.

00:57:19.495 --> 00:57:22.265
[APPLAUSE]

00:57:25.620 --> 00:57:26.460
Come on over.

00:57:26.460 --> 00:57:28.085
Would you like to introduce yourselves?

00:57:28.085 --> 00:57:31.240
SPEAKER 2: Yes Hi, everyone.

00:57:31.240 --> 00:57:32.100
I'm David.

00:57:32.100 --> 00:57:34.380
I am living in Matthews as a first year.

00:57:34.380 --> 00:57:38.700
I'm planning on studying
gov and computer science.

00:57:38.700 --> 00:57:39.820
SPEAKER 3: Hi, everyone.

00:57:39.820 --> 00:57:40.740
I'm David.

00:57:40.740 --> 00:57:43.020
I'm a second year in
Mather, and I'm planning

00:57:43.020 --> 00:57:44.745
on just studying computer science.

00:57:44.745 --> 00:57:45.480
[LAUGHING]

00:57:45.480 --> 00:57:46.605
SPEAKER 1: Well, thank you.

00:57:46.605 --> 00:57:50.190
We have some Oreos for you.

00:57:50.190 --> 00:57:51.497
Thank you, both, so much.

00:57:51.497 --> 00:57:52.830
Did anyone else dress like this?

00:57:52.830 --> 00:57:55.437
We have two more Oreos, if you'd like?

00:57:55.437 --> 00:57:56.020
Intentionally?

00:57:56.020 --> 00:57:58.082
[LAUGHING]

00:57:58.082 --> 00:57:59.040
Oh, that's pretty good.

00:57:59.040 --> 00:57:59.540
OK.

00:57:59.540 --> 00:58:00.750
Yes, we have one more winner.

00:58:00.750 --> 00:58:01.515
Come on down.

00:58:01.515 --> 00:58:03.810
[LAUGHS]

00:58:09.580 --> 00:58:10.317
Thank you.

00:58:10.317 --> 00:58:10.900
SPEAKER 4: Hi.

00:58:10.900 --> 00:58:12.010
SPEAKER 1: This is intentional?

00:58:12.010 --> 00:58:12.590
SPEAKER 4: No.

00:58:12.590 --> 00:58:13.340
SPEAKER 1: Oh, OK.

00:58:13.340 --> 00:58:14.410
[LAUGHING]

00:58:14.410 --> 00:58:16.150
SPEAKER 4: Hello, my name is David.

00:58:16.150 --> 00:58:18.490
I'm from Canada, and I'm a first year.

00:58:18.490 --> 00:58:20.200
I'm not sure what I'm going to study.

00:58:20.200 --> 00:58:20.783
SPEAKER 1: OK.

00:58:20.783 --> 00:58:21.800
Welcome, as well.

00:58:21.800 --> 00:58:22.675
SPEAKER 4: Thank you.

00:58:22.675 --> 00:58:23.560
SPEAKER 1: All right.

00:58:23.560 --> 00:58:26.270
[APPLAUSE]

00:58:27.732 --> 00:58:29.690
Up until now, we've played
around with the data

00:58:29.690 --> 00:58:33.695
that you all gave us, which was based,
very simply, on your favorite language

00:58:33.695 --> 00:58:34.820
and your favorite problems.

00:58:34.820 --> 00:58:38.030
But it turns out there's a lot
of real world data in, indeed,

00:58:38.030 --> 00:58:40.880
the real world, some of
which is quite voluminous.

00:58:40.880 --> 00:58:44.720
And, indeed, there can be not just
dozens, or hundreds, but thousands,

00:58:44.720 --> 00:58:47.962
hundreds of thousands, even millions
of rows in the biggest of databases.

00:58:47.962 --> 00:58:50.420
And so what we thought we'd do
in the latter part of today,

00:58:50.420 --> 00:58:53.360
is really, actually, get our
hands dirty with a real world

00:58:53.360 --> 00:58:57.020
data set from the Internet Movie
Database, otherwise known as IMDb.

00:58:57.020 --> 00:59:01.010
And, in fact, if you go to
imdb.com, you'll be able to answer,

00:59:01.010 --> 00:59:03.650
via their web interface, some
of the very questions we'll

00:59:03.650 --> 00:59:05.720
do today using SQL alone.

00:59:05.720 --> 00:59:10.520
But what you'll find, ultimately,
is that what websites like imdb.com,

00:59:10.520 --> 00:59:13.880
or their mobile app versions
thereof, are probably doing

00:59:13.880 --> 00:59:17.420
is-- yes, giving you a nice pretty
graphical interface to type queries,

00:59:17.420 --> 00:59:22.760
but underneath the hood they are
passing your input into SQL queries,

00:59:22.760 --> 00:59:25.800
or similar queries, that
they formed most of.

00:59:25.800 --> 00:59:28.320
They're just waiting for
placeholders, like the keywords,

00:59:28.320 --> 00:59:30.160
that you're actually searching for.

00:59:30.160 --> 00:59:33.420
So let's go ahead and experiment,
maybe, with just some real-world data,

00:59:33.420 --> 00:59:36.725
initially, before we consider how
to actually store it at scale.

00:59:36.725 --> 00:59:38.850
So let me open up, just
for the sake of discussion,

00:59:38.850 --> 00:59:41.190
an actual, empty
spreadsheet, just so I have

00:59:41.190 --> 00:59:43.650
some rows and columns to play with.

00:59:43.650 --> 00:59:47.790
And let me propose that we want to
model TV shows from the real world.

00:59:47.790 --> 00:59:49.420
How can we go about doing this?

00:59:49.420 --> 00:59:53.020
Well, maybe I could start in
this first column A, so to speak,

00:59:53.020 --> 00:59:54.600
and I could create a title column.

00:59:54.600 --> 00:59:57.660
And then, maybe, a column
for the star of that show.

00:59:57.660 --> 01:00:00.060
And a very popular show,
of course, is The Office.

01:00:00.060 --> 01:00:03.900
So I might put this into the
second cell in that first column.

01:00:03.900 --> 01:00:06.870
And under star, I could put
someone like Steve Carell.

01:00:06.870 --> 01:00:10.200
But, of course, he wasn't
the only star of the show.

01:00:10.200 --> 01:00:11.740
There are others as well.

01:00:11.740 --> 01:00:14.370
And so if I want to put in
someone like Rainn Wilson,

01:00:14.370 --> 01:00:16.210
well, maybe, I need
a second star column.

01:00:16.210 --> 01:00:17.640
So Rainn Wilson.

01:00:17.640 --> 01:00:20.010
But even as early as
the first season, there

01:00:20.010 --> 01:00:24.420
was also another star in
the credits, John Krasinski.

01:00:24.420 --> 01:00:25.440
So he was a star.

01:00:25.440 --> 01:00:29.280
Jenna Fischer was top
credited in the first season.

01:00:29.280 --> 01:00:30.990
So Jenna Fischer.

01:00:30.990 --> 01:00:37.020
And then BJ Novak, Harvard alum, was
also in the first season's opening

01:00:37.020 --> 01:00:37.870
credits, as well.

01:00:37.870 --> 01:00:39.360
So we've got all--

01:00:39.360 --> 01:00:41.170
one, two, three, four,
five of these folks.

01:00:41.170 --> 01:00:42.630
Hopefully, I didn't
misspell anyone's name,

01:00:42.630 --> 01:00:45.030
but here's the beginnings
of a real-world data set.

01:00:45.030 --> 01:00:47.970
And we could imagine doing this
for everyone's favorite shows,

01:00:47.970 --> 01:00:49.630
adding more, and more rows.

01:00:49.630 --> 01:00:52.650
But let's consider, as we often
do, not just the correctness

01:00:52.650 --> 01:00:56.220
of this implementation, but the design.

01:00:56.220 --> 01:00:57.990
It's pretty straightforward.

01:00:57.990 --> 01:00:58.972
It's very readable.

01:00:58.972 --> 01:01:00.430
So I think it's good in that sense.

01:01:00.430 --> 01:01:03.510
But if you start to nitpick
what's poorly designed,

01:01:03.510 --> 01:01:06.240
even in the world of spreadsheets,
about what I've done here--

01:01:06.240 --> 01:01:08.610
assuming that the next
row is another show,

01:01:08.610 --> 01:01:12.110
the next row is another
show, and so forth.

01:01:12.110 --> 01:01:13.280
What's bad about this?

01:01:13.280 --> 01:01:13.780
Yeah?

01:01:13.780 --> 01:01:15.272
[INDISTINCT SPEECH]

01:01:15.272 --> 01:01:17.980
Yeah, so each row is going to have
a different number of columns,

01:01:17.980 --> 01:01:20.440
and even I, kind of, couldn't
make up my mind from the get go.

01:01:20.440 --> 01:01:22.000
Like, do I have just one star column?

01:01:22.000 --> 01:01:22.500
Or two?

01:01:22.500 --> 01:01:23.792
Or maybe now I'm up to 5?

01:01:23.792 --> 01:01:25.750
For even bigger shows,
and later in The Office,

01:01:25.750 --> 01:01:27.625
when more people got
top billing, we're going

01:01:27.625 --> 01:01:29.770
to need more than five
columns for stars.

01:01:29.770 --> 01:01:31.060
So that's fine.

01:01:31.060 --> 01:01:34.682
We can clearly scroll to the right,
and just keep adding more columns,

01:01:34.682 --> 01:01:36.640
but there should be
something about that design

01:01:36.640 --> 01:01:38.890
that like rubs you the
wrong way, like something

01:01:38.890 --> 01:01:41.620
feels a little off if some
rows have this many columns,

01:01:41.620 --> 01:01:42.700
others have this many.

01:01:42.700 --> 01:01:45.520
The data would be very jagged
along the right hand side.

01:01:45.520 --> 01:01:48.320
It would be very sparse, which
would be another way to describe it.

01:01:48.320 --> 01:01:50.000
There's probably a better way.

01:01:50.000 --> 01:01:53.440
So maybe I should flip
the data around, and maybe

01:01:53.440 --> 01:01:57.520
a better approach here would be
to just have one column for stars.

01:01:57.520 --> 01:01:59.870
So let me do this.

01:01:59.870 --> 01:02:04.180
Let me just move Rainn Wilson over
here, and John Krasinski over here,

01:02:04.180 --> 01:02:09.980
and Jenna Fischer over here,
and BJ Novak over here, as well.

01:02:09.980 --> 01:02:13.850
I'll get rid of all of these
superfluous, identically named columns.

01:02:13.850 --> 01:02:17.840
And now this is sort of better because
now I can have any number of stars

01:02:17.840 --> 01:02:21.330
in the vertical, although it's a
little weird to leave this blank.

01:02:21.330 --> 01:02:23.850
So maybe I should of copy paste here.

01:02:23.850 --> 01:02:27.920
So in some sense this is
better, in that now I only

01:02:27.920 --> 01:02:30.860
have one title column, one
star column, and I can just

01:02:30.860 --> 01:02:34.790
keep adding row, row, row,
for each show and its stars.

01:02:34.790 --> 01:02:37.840
But what's now poorly
designed about this?

01:02:37.840 --> 01:02:38.340
Yeah?

01:02:38.340 --> 01:02:39.970
[INDISTINCT SPEECH]

01:02:39.970 --> 01:02:40.470
Yeah.

01:02:40.470 --> 01:02:41.580
I'm repeating the title.

01:02:41.580 --> 01:02:44.220
And, in general, copy, paste,
repeating yourself in code

01:02:44.220 --> 01:02:46.090
has generally been a bad thing.

01:02:46.090 --> 01:02:49.110
It's generally gotten us in trouble
if I make a change, or maybe

01:02:49.110 --> 01:02:50.550
a typographical error somewhere.

01:02:50.550 --> 01:02:52.050
Maybe it propagates elsewhere.

01:02:52.050 --> 01:02:54.450
And if nothing else, it's
just a lot of wasted space.

01:02:54.450 --> 01:02:57.300
If this is actually going to be
stored in a database, in a CSV file,

01:02:57.300 --> 01:03:00.300
why are you duplicating the same
string again, and again, and again,

01:03:00.300 --> 01:03:01.440
for large TV shows?

01:03:01.440 --> 01:03:02.370
That's just wasteful.

01:03:02.370 --> 01:03:04.780
It just doesn't seem wise.

01:03:04.780 --> 01:03:06.970
So how can we eliminate that redundancy?

01:03:06.970 --> 01:03:10.260
Well, unfortunately, in the world
of spreadsheets, things kind of now

01:03:10.260 --> 01:03:13.080
escalate quickly to be kind of annoying.

01:03:13.080 --> 01:03:16.150
But let me do it, nonetheless,
with just a small bit of data.

01:03:16.150 --> 01:03:18.250
Let me propose that we do this instead.

01:03:18.250 --> 01:03:21.977
Let me create not one sheet
but maybe multiple sheets,

01:03:21.977 --> 01:03:24.060
and assume that there's
some kind of relationship,

01:03:24.060 --> 01:03:25.680
or relation, across these sheets.

01:03:25.680 --> 01:03:29.100
So, just to be pedantic, let me call
this sheet, not the default sheet one,

01:03:29.100 --> 01:03:30.360
but let's call this shows.

01:03:30.360 --> 01:03:34.750
And in this sheet, I'm going to
have a title column for every show,

01:03:34.750 --> 01:03:36.990
and I think I'm going
to be proactive here.

01:03:36.990 --> 01:03:41.200
I'm going to start giving every show
a unique ID number, much like Harvard

01:03:41.200 --> 01:03:43.150
affiliates have Harvard
IDs, Yale affiliates

01:03:43.150 --> 01:03:44.900
have Yale ID numbers, and so forth.

01:03:44.900 --> 01:03:48.280
Let's go ahead and give each
show its own unique identifier,

01:03:48.280 --> 01:03:49.940
for reasons we'll soon see.

01:03:49.940 --> 01:03:53.110
So for The Office let
me, just for consistency

01:03:53.110 --> 01:03:55.030
with the actual Internet
Movie Database, I'm

01:03:55.030 --> 01:03:58.488
going to give it a
unique number of 386676.

01:03:58.488 --> 01:04:00.280
The specifics don't
really matter, but that

01:04:00.280 --> 01:04:03.010
happens to be what people in
the real world actually do.

01:04:03.010 --> 01:04:05.950
But that's it for TV shows,
even though I could imagine

01:04:05.950 --> 01:04:07.490
there being many more in this sheet.

01:04:07.490 --> 01:04:10.840
Let me create another sheet now
here, and I'll call it people.

01:04:10.840 --> 01:04:15.490
And in the people sheet, let me
keep track of all of those TV stars.

01:04:15.490 --> 01:04:20.500
So one column will be name, another
will be also called ID here,

01:04:20.500 --> 01:04:25.070
but it's going to be a
person ID, not a show ID.

01:04:25.070 --> 01:04:27.610
And here we have, for
instance, Steve Carell.

01:04:27.610 --> 01:04:29.570
We have Rainn Wilson.

01:04:29.570 --> 01:04:31.870
We have John Krasinski.

01:04:31.870 --> 01:04:34.300
We have Jenna Fischer.

01:04:34.300 --> 01:04:36.670
And we have BJ Novak.

01:04:36.670 --> 01:04:38.600
And, this is going to
be a little tedious,

01:04:38.600 --> 01:04:41.070
but just to be consistent
with reality, turns out

01:04:41.070 --> 01:04:46.230
that according to IMDb Steve Carell's
unique number in the world is 136797,

01:04:46.230 --> 01:04:56.640
Rainn's is 933988, John's is
1024677, Jenna's is 278979,

01:04:56.640 --> 01:05:01.814
and, lastly, BJ Novak's is 1145983.

01:05:01.814 --> 01:05:02.650
Phew.

01:05:02.650 --> 01:05:03.150
OK.

01:05:03.150 --> 01:05:06.360
So now we have the same people, but
they each have a unique ID number.

01:05:06.360 --> 01:05:10.590
Lastly, let's associate those shows
with those people in a way that

01:05:10.590 --> 01:05:13.320
avoids the two problems we
identified earlier, which

01:05:13.320 --> 01:05:16.650
was having a variable number
of columns in one case

01:05:16.650 --> 01:05:18.960
versus redundancy in the second case.

01:05:18.960 --> 01:05:22.710
Let's really tighten things
up so nothing is in duplicate

01:05:22.710 --> 01:05:24.880
that doesn't actually need to be.

01:05:24.880 --> 01:05:28.890
So I'm going to create a third
sheet here, and I'll call it stars,

01:05:28.890 --> 01:05:31.770
like the TV stars for these
shows, and what I'm going to do

01:05:31.770 --> 01:05:36.292
is have only two columns,
a show ID, and a person ID.

01:05:36.292 --> 01:05:38.250
I could write these in
different ways, but it's

01:05:38.250 --> 01:05:41.740
conventional in the database world
to use snake case, so to speak,

01:05:41.740 --> 01:05:46.150
where everything is lowercase,
with underscores instead of spaces.

01:05:46.150 --> 01:05:49.390
And for show ID-- well, we're
only mocking up one show for now,

01:05:49.390 --> 01:05:54.100
but I'm going to go ahead and say
386676, which is The Office, I claimed.

01:05:54.100 --> 01:05:57.160
And now I'm going to go ahead
and have all of those person IDs,

01:05:57.160 --> 01:06:00.230
and this time it's OK to copy
paste, if only to save time.

01:06:00.230 --> 01:06:04.540
So I'm going to grab all of these
five stars IDs, paste them there,

01:06:04.540 --> 01:06:08.840
and I am going to indulge
by duplicating the show ID,

01:06:08.840 --> 01:06:10.580
even though we didn't like that earlier.

01:06:10.580 --> 01:06:14.380
Now this indeed has escalated
quickly because none of these sheets

01:06:14.380 --> 01:06:17.590
are very useful to look at to the
human eye because none of them

01:06:17.590 --> 01:06:18.910
has the complete picture.

01:06:18.910 --> 01:06:22.600
We have shows in one, people in another,
and then like this cryptic mapping

01:06:22.600 --> 01:06:24.250
of numbers in the third.

01:06:24.250 --> 01:06:29.200
But I propose that this is the
right way to implement data

01:06:29.200 --> 01:06:32.530
if your goal is to have a
canonical source of truth

01:06:32.530 --> 01:06:34.060
for every show and every person.

01:06:34.060 --> 01:06:37.990
That is to say, you only say the
show's name once in one place.

01:06:37.990 --> 01:06:43.030
You only write the TV star's
names once in one place.

01:06:43.030 --> 01:06:46.330
And you associate them, you
relate one with the other,

01:06:46.330 --> 01:06:48.710
by way of this third sheet here.

01:06:48.710 --> 01:06:52.270
So if you've not seen it already,
notice that if The Office has

01:06:52.270 --> 01:06:56.800
this unique ID, 386676,
notice in the stars table,

01:06:56.800 --> 01:06:59.620
that same value appears multiple times.

01:06:59.620 --> 01:07:04.810
But what this third sheet is doing
is associating that same show ID

01:07:04.810 --> 01:07:08.510
with one, two, three, four,
five different people.

01:07:08.510 --> 01:07:11.860
Now I can see that it's similar
in spirit to what we already

01:07:11.860 --> 01:07:13.780
indicted as bad design a moment ago.

01:07:13.780 --> 01:07:15.460
The Office, The Office, The Office.

01:07:15.460 --> 01:07:18.970
But think about our world of C. In
the world of C, and really computers

01:07:18.970 --> 01:07:22.180
in general, data takes up finite
amount of space, typically.

01:07:22.180 --> 01:07:24.470
Like an integer is four bytes, 32 bits.

01:07:24.470 --> 01:07:27.070
So even though, yes, I'm
duplicating this value,

01:07:27.070 --> 01:07:30.460
it's just the same four
bytes, four bytes, four bytes.

01:07:30.460 --> 01:07:36.340
It's not t-h-e space
o-f-f-i-c-e, null character.

01:07:36.340 --> 01:07:38.755
It's not the same 11
bytes again, and again.

01:07:38.755 --> 01:07:41.380
It's just a number, and numbers
tend to be much more efficient.

01:07:41.380 --> 01:07:43.660
That computers can crunch
numbers much more quickly.

01:07:43.660 --> 01:07:47.920
Duplicating numbers is in
general allowed, or smiled upon.

01:07:47.920 --> 01:07:50.660
Duplicating strings will
get you into trouble.

01:07:50.660 --> 01:07:54.212
So with that said, is this
a useful spreadsheet now?

01:07:54.212 --> 01:07:56.920
Would you want to be handed this
in your job and asked questions?

01:07:56.920 --> 01:07:59.020
Like, hey, who stars in The Office?

01:07:59.020 --> 01:08:01.390
You can answer it, but you
have to look up one sheet,

01:08:01.390 --> 01:08:02.950
then another, then a third.

01:08:02.950 --> 01:08:06.490
Or you need to use VLOOKUP, or special
functions in Excel, or Google Sheets.

01:08:06.490 --> 01:08:08.823
I mean, you're just creating
a lot of work for yourself.

01:08:08.823 --> 01:08:12.340
But, academically, if you
will, systematically, this

01:08:12.340 --> 01:08:15.640
has a lot of merit because
we've avoided all duplication.

01:08:15.640 --> 01:08:20.800
We've normalized the data, so to speak,
by factoring out any duplication.

01:08:20.800 --> 01:08:22.700
So where are we going with this?

01:08:22.700 --> 01:08:25.720
Well, it turns out that we'll
play now with some actual data

01:08:25.720 --> 01:08:28.481
from the real world from the
actual Internet Movie database.

01:08:28.481 --> 01:08:31.189
And in a moment, it's going to
look a little something like this.

01:08:31.189 --> 01:08:34.689
This is an artist's rendition
of five different-- sorry,

01:08:34.689 --> 01:08:36.800
six different tables.

01:08:36.800 --> 01:08:39.430
So not one, but six
different tables that we

01:08:39.430 --> 01:08:43.479
created using freely available
Internet Movie Database data.

01:08:43.479 --> 01:08:47.904
They kindly provide, not CSVs,
but TSVs, tab separated values,

01:08:47.904 --> 01:08:49.779
which are essentially
the same thing, but you

01:08:49.779 --> 01:08:51.729
look for tab characters
instead of commas

01:08:51.729 --> 01:08:55.120
in order to convert them, as
we did, into our own format,

01:08:55.120 --> 01:08:56.800
SQLite in this case.

01:08:56.800 --> 01:08:59.404
But we'll see that there
is a file that we've

01:08:59.404 --> 01:09:02.529
made available from today, which you
can download off the course's website,

01:09:02.529 --> 01:09:07.149
called shows.db, and it contains
all of this same information.

01:09:07.149 --> 01:09:10.660
And in that shows.db file there
are indeed these six tables,

01:09:10.660 --> 01:09:13.180
but let's focus on just
two of them initially.

01:09:13.180 --> 01:09:16.038
IMDb, the Internet Movie Database,
is all about rating TV shows,

01:09:16.038 --> 01:09:18.580
and tracking that kind of
information, so let's actually take

01:09:18.580 --> 01:09:20.497
a look at some of these
ratings and figure out

01:09:20.497 --> 01:09:22.970
how we can actually
answer actual questions.

01:09:22.970 --> 01:09:24.760
So let me go over to VS Code.

01:09:24.760 --> 01:09:29.750
And let me run SQLite of shows.db,
which is a file that already exists.

01:09:29.750 --> 01:09:33.010
There's no CSVs, no TSVs, we
did all of this for you already.

01:09:33.010 --> 01:09:34.932
When I hit enter, I
get my SQLite prompt.

01:09:34.932 --> 01:09:36.640
And the first thing
I like to do whenever

01:09:36.640 --> 01:09:39.189
I'm playing with a SQL database
for the first time-- maybe

01:09:39.189 --> 01:09:41.410
I got it from a class,
or my boss, or the like--

01:09:41.410 --> 01:09:44.979
is just to wrap my mind around
what's inside of the database

01:09:44.979 --> 01:09:48.069
because you're not typically going to
be provided with pictures like this.

01:09:48.069 --> 01:09:49.729
You're just going to be given a file.

01:09:49.729 --> 01:09:51.220
So let me just select some data.

01:09:51.220 --> 01:09:54.468
Let me select star from the shows table.

01:09:54.468 --> 01:09:57.010
I don't really want to see all
of it, so let me just limit it

01:09:57.010 --> 01:09:59.390
to the first 10 shows in the table.

01:09:59.390 --> 01:10:03.310
And here, we can infer what
the shows table looks like.

01:10:03.310 --> 01:10:07.270
Every show has an ID, a title, the
year in which it debuted, apparently,

01:10:07.270 --> 01:10:11.320
and the number of episodes as of
last night when we exported the data.

01:10:11.320 --> 01:10:15.010
So that seems to reflect this
picture, and this is technically

01:10:15.010 --> 01:10:19.000
an entity relationship diagram, a
standard way of depicting things.

01:10:19.000 --> 01:10:21.670
And you'll see that in
our picture shows, indeed,

01:10:21.670 --> 01:10:25.480
have an ID column, title column,
year column, and episodes column.

01:10:25.480 --> 01:10:26.840
Well what about these ratings?

01:10:26.840 --> 01:10:29.650
Well, according to the picture,
that has a show ID, a rating,

01:10:29.650 --> 01:10:30.830
and a votes column.

01:10:30.830 --> 01:10:34.690
So let's go back to VS Code
here, and let's do select star

01:10:34.690 --> 01:10:39.670
from ratings, limit 10, just to wrap
our mind around some of the data.

01:10:39.670 --> 01:10:42.100
And there we have a
show ID in the left, we

01:10:42.100 --> 01:10:45.170
have rating in the middle, which seems
to be like a floating point value,

01:10:45.170 --> 01:10:46.940
and then votes, which
seems to be an integer.

01:10:46.940 --> 01:10:48.815
So we have some different
types of data here.

01:10:48.815 --> 01:10:50.270
But there's a lot of data.

01:10:50.270 --> 01:10:51.610
In fact, if I do this now.

01:10:51.610 --> 01:10:55.180
Select star from shows--

01:10:55.180 --> 01:10:56.630
let's not select all of the data.

01:10:56.630 --> 01:11:00.670
Let's do select count stars
from shows, and in this database

01:11:00.670 --> 01:11:04.640
there are 214,000
shows in this database.

01:11:04.640 --> 01:11:08.780
So we're well past the 398 rows that
we've been talking about thus far.

01:11:08.780 --> 01:11:12.010
So it turns out, per this
diagram, there's actually

01:11:12.010 --> 01:11:16.360
a standard relationship between
these two tables, shows and ratings

01:11:16.360 --> 01:11:18.880
respectively, and it's what
we would call a one-to-one one

01:11:18.880 --> 01:11:23.430
relationship whereby every show
in this design has one rating.

01:11:23.430 --> 01:11:25.180
And this is indicated,
technically, if you

01:11:25.180 --> 01:11:27.755
look at what the arrowheads
look like on these diagrams.

01:11:27.755 --> 01:11:30.130
This indicates that it's a
one-to-one relationship, which

01:11:30.130 --> 01:11:34.450
means every show has one rating, which
means every row in the shows table

01:11:34.450 --> 01:11:36.970
has a corresponding row
in the ratings table.

01:11:36.970 --> 01:11:39.640
Strictly speaking, they
could be in the same table.

01:11:39.640 --> 01:11:42.130
You could just join them together
and make one wider table,

01:11:42.130 --> 01:11:44.800
but IMDb keeps the
data separate so we too

01:11:44.800 --> 01:11:47.210
kept them separate in
two separate tables here.

01:11:47.210 --> 01:11:50.120
So what does this
actually mean in practice?

01:11:50.120 --> 01:11:54.590
Well, let's actually take a look in VS
Code at the schema for these tables.

01:11:54.590 --> 01:11:57.160
Let me clear my screen,
and let me .schema,

01:11:57.160 --> 01:12:01.570
but specifically look at the schema,
or the design of, the shows table.

01:12:01.570 --> 01:12:06.640
So you can do .schema shows, and any
command in SQLite with a dot is SQLite

01:12:06.640 --> 01:12:08.050
specific.

01:12:08.050 --> 01:12:10.900
In the real world, if you're
using other products like Oracle,

01:12:10.900 --> 01:12:14.350
or Postgre, or MySQL, or others,
they have different commands,

01:12:14.350 --> 01:12:16.300
but anything else that
we've been typing,

01:12:16.300 --> 01:12:20.260
especially the capitalized
keywords, is indeed standard SQL.

01:12:20.260 --> 01:12:25.310
If I hit enter here, here is what the
shows table apparently looks like.

01:12:25.310 --> 01:12:28.480
In other words, here is the create
table command that we, the staff,

01:12:28.480 --> 01:12:30.550
ran in order to create
this table for you,

01:12:30.550 --> 01:12:32.560
and then we imported a
bunch of data into it.

01:12:32.560 --> 01:12:38.200
Every show has an ID, has a title,
has a year, has a number of episodes.

01:12:38.200 --> 01:12:39.610
But there's more detail here.

01:12:39.610 --> 01:12:43.600
Apparently, the ID is an integer, the
title is text, and it is not null,

01:12:43.600 --> 01:12:47.170
cannot be null, the year is
numeric, whatever that means,

01:12:47.170 --> 01:12:49.030
and the episodes is an integer.

01:12:49.030 --> 01:12:50.740
Well, that is now familiar.

01:12:50.740 --> 01:12:51.830
And then primary key.

01:12:51.830 --> 01:12:53.622
So there's some other
stuff going on there,

01:12:53.622 --> 01:12:57.760
which we'll come back to, but let
me also do .schema ratings, enter,

01:12:57.760 --> 01:13:00.280
and we'll see a couple
of other data types here.

01:13:00.280 --> 01:13:03.070
There's show ID, which is
still an integer, but not null.

01:13:03.070 --> 01:13:05.560
Rating, which is a real number, a.k.a.

01:13:05.560 --> 01:13:07.840
float, but it, too, cannot be null.

01:13:07.840 --> 01:13:10.750
And then some number of votes,
which is an integer, cannot be null.

01:13:10.750 --> 01:13:12.560
And then there's mention of foreign key.

01:13:12.560 --> 01:13:16.300
So unlike our data set for the
favorites a bit ago, which we just

01:13:16.300 --> 01:13:20.530
did automatically, and imported,
this database by us, and by IMDb,

01:13:20.530 --> 01:13:23.680
has been more thoughtfully
designed, where there's actually

01:13:23.680 --> 01:13:27.640
some relationships across multiple
tables, rather than previously

01:13:27.640 --> 01:13:28.420
just one.

01:13:28.420 --> 01:13:32.300
Now in the world of SQL, we have
indeed different data types.

01:13:32.300 --> 01:13:35.140
For instance, we have
these five primarily.

01:13:35.140 --> 01:13:37.750
One, cutely named
BLOB, which is actually

01:13:37.750 --> 01:13:41.990
Binary Large Object, which generally
means like a file or some piece of data

01:13:41.990 --> 01:13:43.190
that's zeros and ones.

01:13:43.190 --> 01:13:46.160
Though, generally, it's best
to store files on file systems,

01:13:46.160 --> 01:13:50.090
like in folders, on disks, so
to speak, not in your database.

01:13:50.090 --> 01:13:51.440
There's integers we've seen.

01:13:51.440 --> 01:13:52.410
There's numeric.

01:13:52.410 --> 01:13:54.740
Which numeric is more
like dates and times,

01:13:54.740 --> 01:13:58.340
things that are numbers, but not
necessarily integers or floating point

01:13:58.340 --> 01:13:58.970
values.

01:13:58.970 --> 01:14:01.460
There's reals, which do
have decimal points in them.

01:14:01.460 --> 01:14:02.780
And then there's just text.

01:14:02.780 --> 01:14:06.290
In other SQL databases in the
real world, in your future jobs,

01:14:06.290 --> 01:14:09.740
or your future classes that
you might use SQL again in,

01:14:09.740 --> 01:14:14.960
there are even more data types in other
databases: Oracle, MySQL, Postgre,

01:14:14.960 --> 01:14:15.605
and so forth.

01:14:15.605 --> 01:14:17.480
But these are sort of
representative of them.

01:14:17.480 --> 01:14:19.767
They just get more
precise in other systems.

01:14:19.767 --> 01:14:22.100
But there's also some keywords,
that we've seen already,

01:14:22.100 --> 01:14:24.230
that you can specify when
designing a database.

01:14:24.230 --> 01:14:26.090
--that this column cannot be null.

01:14:26.090 --> 01:14:29.720
If you want to make sure that
no one can insert or update

01:14:29.720 --> 01:14:32.030
data unbeknownst to
you that is null, you

01:14:32.030 --> 01:14:34.340
can impose that when creating the table.

01:14:34.340 --> 01:14:36.740
And unlike Excel, and
Google Spreadsheets,

01:14:36.740 --> 01:14:40.010
and Apple Numbers, which will generally
let the human type in or not type

01:14:40.010 --> 01:14:43.520
in anything they want, with a
database you have more protections

01:14:43.520 --> 01:14:45.365
over the integrity of your data.

01:14:45.365 --> 01:14:48.900
Moreover, you can specify that a
column's values must be unique.

01:14:48.900 --> 01:14:50.750
If you want to avoid
duplicates, like you

01:14:50.750 --> 01:14:54.740
don't want the same person to be able to
register twice for your website thereby

01:14:54.740 --> 01:14:57.500
making sure they have
one unique email address,

01:14:57.500 --> 01:14:59.840
your database can help with that too.

01:14:59.840 --> 01:15:03.470
You don't have to rely on Python
to check if it already exists.

01:15:03.470 --> 01:15:07.520
But there's this other feature
of relational databases, that

01:15:07.520 --> 01:15:11.480
is databases that have multiple tables
across which there are relationships,

01:15:11.480 --> 01:15:13.880
and that's these keywords
we saw briefly a moment ago.

01:15:13.880 --> 01:15:16.620
Primary key, and foreign key.

01:15:16.620 --> 01:15:20.180
And we started to scratch
this surface here.

01:15:20.180 --> 01:15:22.490
It turns out what I
was doing was actually

01:15:22.490 --> 01:15:25.340
best practice in the world
of relational databases

01:15:25.340 --> 01:15:30.680
I gave or, really, IMDb gave every
show in the world a unique ID.

01:15:30.680 --> 01:15:36.470
And that unique ID, in this case
386676, is a numeric value, an integer,

01:15:36.470 --> 01:15:38.930
that uniquely identifies that TV show.

01:15:38.930 --> 01:15:43.370
In other words, this is the
primary key for this table.

01:15:43.370 --> 01:15:46.370
Technically a sheet, but I'm using
that just because it's easier to type

01:15:46.370 --> 01:15:48.020
in than my black and white window.

01:15:48.020 --> 01:15:53.960
This ID column is the primary key for
shows, as I was mocking up earlier.

01:15:53.960 --> 01:15:57.290
What is the primary key
in the people sheet here?

01:15:57.290 --> 01:15:58.580
It is also ID.

01:15:58.580 --> 01:16:02.450
It's a different ID, but it's by
convention often called the same thing.

01:16:02.450 --> 01:16:07.037
But this people column,
called ID, is its primary key.

01:16:07.037 --> 01:16:08.870
And you can perhaps see
where this is going.

01:16:08.870 --> 01:16:13.910
Those same numbers also happen
to appear in this third table,

01:16:13.910 --> 01:16:17.180
but in that context, they're
sort of foreign keys.

01:16:17.180 --> 01:16:21.420
They didn't come from this star sheet,
but they are in this star sheet,

01:16:21.420 --> 01:16:23.690
so they're sort of
relatively foreign to it.

01:16:23.690 --> 01:16:29.030
So foreign keys is simply the presence
of primary keys in some other table,

01:16:29.030 --> 01:16:30.630
in some other data set.

01:16:30.630 --> 01:16:33.050
And so it's just a
description of relativity,

01:16:33.050 --> 01:16:36.860
but the foreign key is the column
that uniquely identifies your data.

01:16:36.860 --> 01:16:41.190
Foreign keys is just the appearance
of those same numbers elsewhere.

01:16:41.190 --> 01:16:42.570
So what does this mean?

01:16:42.570 --> 01:16:45.020
If we go back to VS
Code here, you'll see

01:16:45.020 --> 01:16:48.860
that when we created this
table using IMDb's real data,

01:16:48.860 --> 01:16:52.640
we specified that in our shows
table, which is bigger than the one

01:16:52.640 --> 01:16:56.180
I mocked up with Google Sheets there,
has not only an ID and a title it,

01:16:56.180 --> 01:16:58.615
again, also has year in
which the show debuted,

01:16:58.615 --> 01:16:59.990
and the total number of episodes.

01:16:59.990 --> 01:17:02.810
Because that's juicy data
that comes from IMDb,

01:17:02.810 --> 01:17:04.670
beyond what I mocked up a moment ago.

01:17:04.670 --> 01:17:08.150
In the ratings table,
meanwhile, there's a show ID,

01:17:08.150 --> 01:17:12.360
which is also an integer, just like
this ID, but as the name implies,

01:17:12.360 --> 01:17:17.810
this show ID column is actually going
to be a foreign key that references

01:17:17.810 --> 01:17:20.640
the shows tables ID column.

01:17:20.640 --> 01:17:23.420
So this is the relational
in relational databases.

01:17:23.420 --> 01:17:25.530
These are two tables
that have a relationship,

01:17:25.530 --> 01:17:32.520
and that relationship is that show ID
is referring to this actual ID here.

01:17:32.520 --> 01:17:34.520
And it allows us
essentially, conceptually,

01:17:34.520 --> 01:17:37.350
to link these two tables together.

01:17:37.350 --> 01:17:39.353
So what does this actually mean?

01:17:39.353 --> 01:17:40.770
Well, let me go ahead and do this.

01:17:40.770 --> 01:17:42.140
Let me go back to VS Code here.

01:17:42.140 --> 01:17:45.120
I'll clear my terminal, and let's
play around with some of this data.

01:17:45.120 --> 01:17:48.080
So let's go ahead and do
this just to experiment.

01:17:48.080 --> 01:17:51.320
Select star from ratings where--

01:17:51.320 --> 01:17:54.860
let's get all of the good shows
where, just like Rotten Tomatoes,

01:17:54.860 --> 01:17:57.410
we'll do the cut off at 6.0 out of 10.

01:17:57.410 --> 01:18:01.063
So where rating is greater
than or equal to 6.0.

01:18:01.063 --> 01:18:03.230
And just so I don't overwhelm
my screen, let me just

01:18:03.230 --> 01:18:05.370
limit this to the first 10 results.

01:18:05.370 --> 01:18:08.510
In other words, this is
SQL syntax for selecting

01:18:08.510 --> 01:18:12.920
all of the ratings that are at
least 6.0 or higher from that table.

01:18:12.920 --> 01:18:13.640
Enter.

01:18:13.640 --> 01:18:15.350
And we see just the first 10 of them.

01:18:15.350 --> 01:18:18.540
Not the top 10 because we've not sorted
or grouped or anything like that,

01:18:18.540 --> 01:18:20.490
but the first 10 in the table.

01:18:20.490 --> 01:18:23.780
So what is interesting
here is that we've

01:18:23.780 --> 01:18:26.390
seen just some sampling
of the data, if you will.

01:18:26.390 --> 01:18:28.790
But this isn't all
that interesting here.

01:18:28.790 --> 01:18:32.220
Let me actually distill this just to
the show ID because, in other words,

01:18:32.220 --> 01:18:34.190
I want to know 10 good shows to watch.

01:18:34.190 --> 01:18:36.440
So let me just select show ID.

01:18:36.440 --> 01:18:39.480
So same result, but less data.

01:18:39.480 --> 01:18:41.700
It's just that first column thereof.

01:18:41.700 --> 01:18:43.540
Now this is going to
be a little annoying,

01:18:43.540 --> 01:18:47.460
but if I want to find out
the names of these shows--

01:18:47.460 --> 01:18:50.400
think about the picture
from whence we came.

01:18:50.400 --> 01:18:55.320
All of the show's names are in the shows
table, but all of the show's ratings

01:18:55.320 --> 01:18:57.520
are in the ratings table.

01:18:57.520 --> 01:19:00.058
So even if I do select
star from ratings,

01:19:00.058 --> 01:19:02.100
I'm never going to know
what show I'm looking at.

01:19:02.100 --> 01:19:04.500
Like, what the heck is show ID 62614?

01:19:04.500 --> 01:19:05.680
Well, I could do this.

01:19:05.680 --> 01:19:14.370
I could select star from shows where the
ID of the show equals 62614 semicolon.

01:19:14.370 --> 01:19:15.240
Enter.

01:19:15.240 --> 01:19:18.270
OK, so I could watch
this show from 1981.

01:19:18.270 --> 01:19:19.290
Let me do another one.

01:19:19.290 --> 01:19:24.030
Select star from shows
where ID equals 63881.

01:19:24.030 --> 01:19:26.710
So I'm just grabbing
the second ID from here.

01:19:26.710 --> 01:19:27.210
OK.

01:19:27.210 --> 01:19:29.530
So Catweazle, a kid's show from 1970.

01:19:29.530 --> 01:19:30.030
All right.

01:19:30.030 --> 01:19:31.120
So I'll watch that.

01:19:31.120 --> 01:19:32.620
So now let's do another one.

01:19:32.620 --> 01:19:33.810
I'll just copy this.

01:19:33.810 --> 01:19:37.530
Like suffice it to say, this is
not the best way to look up data,

01:19:37.530 --> 01:19:41.610
where I'm literally copying and pasting
values from one query into the next.

01:19:41.610 --> 01:19:43.950
But this is where SQL
gets a bit powerful.

01:19:43.950 --> 01:19:46.170
I can have nested queries.

01:19:46.170 --> 01:19:47.710
I can put one inside of the other.

01:19:47.710 --> 01:19:48.793
So let me instead do this.

01:19:48.793 --> 01:19:51.130
Let me clear the screen,
and let me instead do this.

01:19:51.130 --> 01:19:57.620
Select star from shows
where the ID of the show

01:19:57.620 --> 01:20:01.160
is in the following list of IDs.

01:20:01.160 --> 01:20:02.630
Select.

01:20:02.630 --> 01:20:05.480
And, actually, I'll do
this on a separate line.

01:20:05.480 --> 01:20:12.020
Select show ID from ratings
where the rating value is

01:20:12.020 --> 01:20:15.800
greater than or equal to 6.0 semicolon.

01:20:15.800 --> 01:20:17.570
So I've separated this onto two lines.

01:20:17.570 --> 01:20:19.710
The dot, dot, dot is just
a continuation character,

01:20:19.710 --> 01:20:23.480
which means same query, multiple lines,
but the parentheses are deliberate.

01:20:23.480 --> 01:20:27.500
Just like grade school math, I want
what's in parentheses to happen first.

01:20:27.500 --> 01:20:29.930
And so what the database
will do is we'll

01:20:29.930 --> 01:20:33.740
select, as before, all of the
show IDs from the ratings table

01:20:33.740 --> 01:20:36.713
where the rating value is
at least 6.0 out of 10.

01:20:36.713 --> 01:20:38.630
And that's going to
return to me, effectively,

01:20:38.630 --> 01:20:41.750
a list, some kind of
collection of show IDs,

01:20:41.750 --> 01:20:43.760
which previously I was
copying and pasting,

01:20:43.760 --> 01:20:47.390
now the database will
do the legwork for me.

01:20:47.390 --> 01:20:51.410
It will now select everything from the
shows table, where the ID of the show

01:20:51.410 --> 01:20:53.655
is in that list of values.

01:20:53.655 --> 01:20:56.280
And it's actually going to be
more than 10 unless I go in there

01:20:56.280 --> 01:20:58.690
and say limit 10, which I can do.

01:20:58.690 --> 01:21:00.510
So let me go ahead and hit enter now.

01:21:00.510 --> 01:21:03.630
And now I see more useful
information not just

01:21:03.630 --> 01:21:05.640
the ratings information,
which in a vacuum

01:21:05.640 --> 01:21:07.320
tells me nothing about what to watch.

01:21:07.320 --> 01:21:12.360
Now I see the show ID, the title,
the year, and the episodes.

01:21:12.360 --> 01:21:17.200
But notably, what is-- and if I want
to distill this into just the title,

01:21:17.200 --> 01:21:20.010
let me actually go back
here and instead do

01:21:20.010 --> 01:21:25.140
select just the title from shows
where the ID is in this whole list.

01:21:25.140 --> 01:21:28.210
And I'll re-execute it by just
copying and pasting the same.

01:21:28.210 --> 01:21:31.240
The only difference now is instead
of star, I'm selecting title.

01:21:31.240 --> 01:21:34.080
Here's now how the data
analyst at IMDb might

01:21:34.080 --> 01:21:38.430
be selecting 10 shows that are really
good to watch, according to ratings.

01:21:38.430 --> 01:21:41.490
But what, of course, is
missing from the output,

01:21:41.490 --> 01:21:45.065
whether I do star or just title?

01:21:45.065 --> 01:21:45.690
What's missing?

01:21:45.690 --> 01:21:45.960
Yeah?

01:21:45.960 --> 01:21:46.860
AUDIENCE: Rating.

01:21:46.860 --> 01:21:47.940
- The actual rating.

01:21:47.940 --> 01:21:51.000
I know these are at least
6.0, but which is 7.0?

01:21:51.000 --> 01:21:52.050
Which is 10.0?

01:21:52.050 --> 01:21:55.270
It'd be nice to actually
combine the data in some way.

01:21:55.270 --> 01:21:57.840
So we can actually do that,
too, because it turns out

01:21:57.840 --> 01:22:00.960
that when you have two tables in
the world of SQL, or even more,

01:22:00.960 --> 01:22:03.720
you can actually join them together.

01:22:03.720 --> 01:22:08.700
You can join them together, literally
using a keyword called join.

01:22:08.700 --> 01:22:10.540
And you can do this as follows.

01:22:10.540 --> 01:22:13.253
Let me propose, with a sample
data set, these two tables.

01:22:13.253 --> 01:22:16.170
And dot, dot, dot just means we don't
really care about the specifics.

01:22:16.170 --> 01:22:17.800
We just care about the structure.

01:22:17.800 --> 01:22:21.000
So on the left here is a
simplified version of my shows

01:22:21.000 --> 01:22:24.270
table that has a show
ID and a show title,

01:22:24.270 --> 01:22:27.990
but I've omitted year and
episodes as just intellectually

01:22:27.990 --> 01:22:30.180
distracting from the
basic structure here.

01:22:30.180 --> 01:22:31.740
But they're there in the real table.

01:22:31.740 --> 01:22:34.890
On the right hand side here,
we have the ratings table

01:22:34.890 --> 01:22:37.290
with just two of its columns,
the show ID and the rating,

01:22:37.290 --> 01:22:39.457
but I've omitted the votes
because it doesn't really

01:22:39.457 --> 01:22:40.750
add anything to the discussion.

01:22:40.750 --> 01:22:42.600
But let me propose this.

01:22:42.600 --> 01:22:45.750
Notice that these two
tables and these two rows

01:22:45.750 --> 01:22:47.790
therein definitely have commonalities.

01:22:47.790 --> 01:22:50.370
Like, they both have the same ID.

01:22:50.370 --> 01:22:54.150
In the left table it's 386676, a.k.a.

01:22:54.150 --> 01:22:55.860
The primary key of that row.

01:22:55.860 --> 01:23:01.200
But it's also appearing in duplicate
in the right hand table, 386676,

01:23:01.200 --> 01:23:02.977
and in that context it's a foreign key.

01:23:02.977 --> 01:23:04.810
The point, though, is
that they're the same.

01:23:04.810 --> 01:23:08.320
So wouldn't it be nice if I could
treat one table here, one table here,

01:23:08.320 --> 01:23:12.150
and if my fingertips represent these
identical values, kind of like glue

01:23:12.150 --> 01:23:15.715
them together so I get one wider
table with all of the information

01:23:15.715 --> 01:23:18.090
together to satisfy your
concern, that we don't even know

01:23:18.090 --> 01:23:20.015
what the ratings are of those shows.

01:23:20.015 --> 01:23:21.390
Well let me go ahead and do this.

01:23:21.390 --> 01:23:24.488
Just for artist's rendition,
let me flip title and ID,

01:23:24.488 --> 01:23:25.780
which has no functional effect.

01:23:25.780 --> 01:23:28.488
It's just going to put the numbers
closer together on the screen.

01:23:28.488 --> 01:23:32.610
Let me then literally highlight the fact
that these two numbers are identical,

01:23:32.610 --> 01:23:35.190
and let me propose that we
do the equivalent of this,

01:23:35.190 --> 01:23:38.910
we somehow join these two
tables on that common value.

01:23:38.910 --> 01:23:40.995
Strictly speaking, I
don't need both values

01:23:40.995 --> 01:23:43.870
because they're duplicates, so I
don't care if one of them goes away.

01:23:43.870 --> 01:23:46.470
But what I'd really like
to do is select, indeed,

01:23:46.470 --> 01:23:52.230
a temporary table that is the joined
version of that original data.

01:23:52.230 --> 01:23:54.840
And frankly, I don't really
care as the user what the ID is.

01:23:54.840 --> 01:23:57.690
Heck, all I care about is what show
to watch and what its rating is.

01:23:57.690 --> 01:23:59.370
Give me the title and the rating.

01:23:59.370 --> 01:24:01.590
All of these numbers are,
again, metadata, things

01:24:01.590 --> 01:24:05.490
that the computer cares about,
but we humans probably do not.

01:24:05.490 --> 01:24:07.890
So how can we implement that idea?

01:24:07.890 --> 01:24:09.840
Of taking one data set
that has a relationship

01:24:09.840 --> 01:24:12.510
with this data set, and
somehow combine it together?

01:24:12.510 --> 01:24:14.170
Well let me go back to VS Code here.

01:24:14.170 --> 01:24:15.330
Let me clear my screen.

01:24:15.330 --> 01:24:20.220
And this is going to be a bit cryptic
at first, but it's very step by step.

01:24:20.220 --> 01:24:21.010
Let me do this.

01:24:21.010 --> 01:24:25.890
Select star from shows,
but not from shows alone.

01:24:25.890 --> 01:24:28.650
Let me join it with the ratings table.

01:24:28.650 --> 01:24:32.400
So let me select everything
from shows joined with ratings,

01:24:32.400 --> 01:24:35.910
but I need to tell the database, well,
what do I want to join things on?

01:24:35.910 --> 01:24:37.200
What are my fingertips?

01:24:37.200 --> 01:24:40.030
Specifically, I want them to
join on those common integers.

01:24:40.030 --> 01:24:44.940
So I can literally say, on, and then
I can specify one table on the left.

01:24:44.940 --> 01:24:51.030
Shows.id on the left should equal
the ratings table's shows.id column

01:24:51.030 --> 01:24:52.470
on the right.

01:24:52.470 --> 01:24:57.120
Again, if I'm joining shows with
ratings it's called ID in one,

01:24:57.120 --> 01:25:00.640
it's called show ID in the other,
but it's the exact same thing.

01:25:00.640 --> 01:25:03.210
In fact, if I rewind, this
is where we came from.

01:25:03.210 --> 01:25:05.290
Two tables with the same value.

01:25:05.290 --> 01:25:10.170
So with this query here, if I go ahead
and now specify not just that but let

01:25:10.170 --> 01:25:14.760
me further say, where rating is
greater than or equal to 6.0.

01:25:14.760 --> 01:25:16.500
And, heck, let's limit it to 10.

01:25:16.500 --> 01:25:21.420
Just fits on the screen, so it's more
of a mouthful, but when I hit enter now

01:25:21.420 --> 01:25:26.220
we have a wider table that indeed
contains everything, star, from having

01:25:26.220 --> 01:25:29.340
joined these two tables left and right.

01:25:29.340 --> 01:25:31.710
Now, again, I don't really
care about much of this data,

01:25:31.710 --> 01:25:34.410
like year and episodes and
definitely not the IDs.

01:25:34.410 --> 01:25:36.070
So let me actually hit up.

01:25:36.070 --> 01:25:38.070
Let me go to the beginning
of the query, and let

01:25:38.070 --> 01:25:41.640
me just select the title of the
show and the rating of the show.

01:25:41.640 --> 01:25:45.780
The query is getting a little long and
it's wrapping, but it's the same query.

01:25:45.780 --> 01:25:49.500
Except, instead of star,
I've done title comma rating.

01:25:49.500 --> 01:25:50.910
Now when I hit enter--

01:25:50.910 --> 01:25:54.190
like, this is the list that would have
been nice to see the first time around.

01:25:54.190 --> 01:25:57.280
Show me 10 shows with a
rating of 6.0 or higher,

01:25:57.280 --> 01:25:59.620
but remind me what the
rating actually is so,

01:25:59.620 --> 01:26:03.580
maybe, I can prioritize the sevens, the
eights, the nines, and even the tens,

01:26:03.580 --> 01:26:05.236
if any.

01:26:05.236 --> 01:26:11.760
Any questions about this
technique of joining two tables?

01:26:11.760 --> 01:26:15.210
This sort of solves the problem that
we created in the world of this sheet,

01:26:15.210 --> 01:26:18.030
where I was just kind of playing
around, where I sort of moved

01:26:18.030 --> 01:26:20.280
all the data into its
separate locations, which

01:26:20.280 --> 01:26:21.750
is not at all pleasant to use.

01:26:21.750 --> 01:26:25.590
But with SQL, with join, you can
still get any of the data you want.

01:26:25.590 --> 01:26:27.546
Yeah.

01:26:27.546 --> 01:26:30.498
[INDISTINCT SPEECH]

01:26:34.450 --> 01:26:35.308
Correct.

01:26:35.308 --> 01:26:37.600
So, yes, I should have called
that out more explicitly.

01:26:37.600 --> 01:26:41.472
In my query here, I was using dot
notation, which we've seen in Python,

01:26:41.472 --> 01:26:43.930
we've seen in C. It means
something similar in spirit here,

01:26:43.930 --> 01:26:45.972
but it has nothing to do
with structured objects.

01:26:45.972 --> 01:26:48.650
In this case, it has to do
with tables and columns.

01:26:48.650 --> 01:26:53.320
So shows.id just makes clear that I
want the ID column from the shows table

01:26:53.320 --> 01:26:57.490
to line up with the show ID
column from the ratings table.

01:26:57.490 --> 01:27:01.630
Strictly speaking, I don't need
to do that because in this case

01:27:01.630 --> 01:27:03.460
there is no ambiguity.

01:27:03.460 --> 01:27:06.340
One table has a column
called ID, the other table

01:27:06.340 --> 01:27:07.810
has a column called show ID.

01:27:07.810 --> 01:27:10.840
So certainly the database can
just figure this out for me.

01:27:10.840 --> 01:27:15.730
But, for best practice, and for the
sake of being explicit, using the dot

01:27:15.730 --> 01:27:17.950
notation and table names
can help, especially

01:27:17.950 --> 01:27:20.270
if there's some common
language across them.

01:27:20.270 --> 01:27:20.770
All right.

01:27:20.770 --> 01:27:23.030
Well let's go back to
the bigger data set here.

01:27:23.030 --> 01:27:25.060
These are all six tables in IMDb.

01:27:25.060 --> 01:27:28.450
We focused for just a moment there
on like shows and ratings alone,

01:27:28.450 --> 01:27:29.620
but what about genres.

01:27:29.620 --> 01:27:33.250
So, genres, like comedy, and
documentary, and drama, and so forth.

01:27:33.250 --> 01:27:37.190
Turns out that this actually implements
a different type of relationship.

01:27:37.190 --> 01:27:39.530
Previously we saw a
one-to-one relationship,

01:27:39.530 --> 01:27:42.080
but it turns out that
IMDb supports what's

01:27:42.080 --> 01:27:44.820
called a one-to-many relationship
when it comes to genres.

01:27:44.820 --> 01:27:45.320
Why?

01:27:45.320 --> 01:27:48.650
Well, shows like The Office, I do
think are generally considered comedy

01:27:48.650 --> 01:27:51.650
and that's it, but there
are certainly other TV shows

01:27:51.650 --> 01:27:55.140
that might have multiple
genres associated with them.

01:27:55.140 --> 01:27:59.160
Maybe it's comedy and a bit of romance
thrown in, like rom-coms, and so forth,

01:27:59.160 --> 01:28:03.980
so you could imagine some shows
having two or three or more genres,

01:28:03.980 --> 01:28:07.520
and so one-to-many means that
one show can have many genres.

01:28:07.520 --> 01:28:11.450
One-to-one would mean one show can
have one rating, as we've seen.

01:28:11.450 --> 01:28:17.330
So why don't we go ahead and focus,
maybe, on how about a query like this.

01:28:17.330 --> 01:28:19.577
Let me go back to VS Code
here, clear my screen,

01:28:19.577 --> 01:28:21.410
and let's just look at
some of those genres.

01:28:21.410 --> 01:28:25.700
Select star from genres and
then I'll limit it to 10.

01:28:25.700 --> 01:28:28.590
And, again, I do this just to wrap
my mind around a new data set.

01:28:28.590 --> 01:28:30.740
I could look at the schema, but
that tends to be more cryptic.

01:28:30.740 --> 01:28:32.310
I just want to look at the raw data.

01:28:32.310 --> 01:28:32.810
OK.

01:28:32.810 --> 01:28:37.227
It looks like here there are a bunch
of genres: comedy, adventure, comedy.

01:28:37.227 --> 01:28:38.810
So two comedies, which is interesting.

01:28:38.810 --> 01:28:40.100
Oh, interesting.

01:28:40.100 --> 01:28:42.410
Family, action, sci-fi, family.

01:28:42.410 --> 01:28:45.260
So the values here are
duplicated, which it turns out

01:28:45.260 --> 01:28:47.000
is not the best design of IMDb.

01:28:47.000 --> 01:28:49.710
We literally just imported
the data as they implement it.

01:28:49.710 --> 01:28:55.190
But notice that show ID
62614 is a comedy, but so is

01:28:55.190 --> 01:29:00.950
show 63881, and so is show 65270.

01:29:00.950 --> 01:29:04.320
So it turns out that in the real world
sometimes data is somewhat messy.

01:29:04.320 --> 01:29:06.830
There's duplication of
comedy, comedy, comedy,

01:29:06.830 --> 01:29:09.380
but such is the way IMDb's data is.

01:29:09.380 --> 01:29:12.694
But what's more interesting to
me, for now, is notice this.

01:29:12.694 --> 01:29:16.530
This show ID in three rows is the same.

01:29:16.530 --> 01:29:19.700
So there's some show out there
that's considered by the world

01:29:19.700 --> 01:29:22.257
to be an adventure, a
comedy, and a family show.

01:29:22.257 --> 01:29:23.340
So let's see what that is.

01:29:23.340 --> 01:29:26.880
Let me just highlight and copy
that value, 63881, and do this.

01:29:26.880 --> 01:29:31.378
Select star from shows where the
ID of the show equals that value.

01:29:31.378 --> 01:29:33.170
And it turns out, we
saw it briefly before,

01:29:33.170 --> 01:29:37.400
it's a show from the 1970s called
Catweazle, which falls into all three

01:29:37.400 --> 01:29:38.880
of those categories.

01:29:38.880 --> 01:29:43.880
So by using a one-to-many relationship,
sort of depicted by this picture here,

01:29:43.880 --> 01:29:47.588
you can implement that same idea
without having that jagged edge.

01:29:47.588 --> 01:29:50.630
When we looked at the spreadsheet
earlier, in an earlier version of this,

01:29:50.630 --> 01:29:55.490
we had star, star, star, which we
could do again, genre, genre, genre,

01:29:55.490 --> 01:29:58.460
but now we instead have
two separate tables

01:29:58.460 --> 01:30:02.247
where this many-to-many
relationship is implemented across.

01:30:02.247 --> 01:30:03.830
So let's actually play around with it.

01:30:03.830 --> 01:30:07.640
Let me go back to VS Code
here, and let's actually take

01:30:07.640 --> 01:30:10.520
a look at the schema for genres.

01:30:10.520 --> 01:30:14.030
And we'll see that it's pretty
small, as the picture suggests.

01:30:14.030 --> 01:30:14.990
It's called genres.

01:30:14.990 --> 01:30:18.530
Every row has a show ID, which
is an integer, cannot be null.

01:30:18.530 --> 01:30:21.200
It has a genre, which
is text, cannot be null.

01:30:21.200 --> 01:30:25.190
And that show ID is a foreign key
in this table that references an ID

01:30:25.190 --> 01:30:26.690
column in the shows table.

01:30:26.690 --> 01:30:29.240
So very similar in
spirit, and so it really

01:30:29.240 --> 01:30:32.090
is just kind of on the
honor system that we're only

01:30:32.090 --> 01:30:34.670
putting one row for
each show in ratings,

01:30:34.670 --> 01:30:39.150
but 0 or more shows in the
genres table for shows, as well.

01:30:39.150 --> 01:30:41.360
So what can we do once we
want to tinker with genres?

01:30:41.360 --> 01:30:42.420
Well, let me do this.

01:30:42.420 --> 01:30:49.587
How about we select the show ID from the
genres table where the genre is comedy.

01:30:49.587 --> 01:30:51.170
Like, I'm in the mood for some comedy.

01:30:51.170 --> 01:30:54.950
Let's see all of the available comedies,
except let's just limit it to 10.

01:30:54.950 --> 01:30:59.150
Here are the show IDs for 10 comedies
according to the Internet Movie

01:30:59.150 --> 01:30:59.803
Database.

01:30:59.803 --> 01:31:01.220
Well, that's not very interesting.

01:31:01.220 --> 01:31:03.350
I care about the title,
so we can do that.

01:31:03.350 --> 01:31:08.990
Select title from shows where the
ID of the show is not equal to,

01:31:08.990 --> 01:31:12.710
but rather, in the following
subquery, if you will.

01:31:12.710 --> 01:31:18.170
Select show ID from genres where
genre equals, quote unquote,

01:31:18.170 --> 01:31:21.900
"Comedy," limit 10, just
to keep things simple.

01:31:21.900 --> 01:31:25.460
So same query as before, but now
I'm using it as a nested query

01:31:25.460 --> 01:31:28.730
to select the actual titles
whose IDs match those their.

01:31:28.730 --> 01:31:29.540
Enter.

01:31:29.540 --> 01:31:32.540
And there are those
titles for 10 comedies,

01:31:32.540 --> 01:31:34.240
maybe a couple of which we've seen.

01:31:34.240 --> 01:31:35.480
Catweazel we've seen before.

01:31:35.480 --> 01:31:36.970
But what if we want to now--

01:31:36.970 --> 01:31:38.380
let's see.

01:31:38.380 --> 01:31:42.085
Maybe we want to get all of
the-- let's flip it around.

01:31:42.085 --> 01:31:43.630
Catweazle keeps coming up.

01:31:43.630 --> 01:31:47.170
Why don't we figure out, with a
query, what all of its genres are?

01:31:47.170 --> 01:31:48.110
So Catweazle.

01:31:48.110 --> 01:31:48.610
Let's see.

01:31:48.610 --> 01:31:55.450
So let's do select star
from genres limit 10,

01:31:55.450 --> 01:31:57.430
because I know it's in the the first 10.

01:31:57.430 --> 01:31:59.950
This was, 63881 was Catweazle's ID.

01:31:59.950 --> 01:32:00.880
So let's do this.

01:32:00.880 --> 01:32:06.860
So select genre from genres where
the show ID equals that value.

01:32:06.860 --> 01:32:07.360
OK.

01:32:07.360 --> 01:32:09.152
So there's the same
query as we did before.

01:32:09.152 --> 01:32:10.480
Can we make this dynamic?

01:32:10.480 --> 01:32:11.650
Well, we can, too.

01:32:11.650 --> 01:32:13.870
What if I instead, more
dynamically do, select

01:32:13.870 --> 01:32:18.250
genre from genres where the show ID--

01:32:18.250 --> 01:32:19.180
and not in.

01:32:19.180 --> 01:32:23.020
If I'm looking for a specific
show, now I can actually do equals.

01:32:23.020 --> 01:32:25.480
And in my subquery, I
could do this, select ID

01:32:25.480 --> 01:32:28.660
from shows where the
title of the show equals,

01:32:28.660 --> 01:32:33.160
quote unquote, "Catweazel,"
semicolon, enter.

01:32:33.160 --> 01:32:35.720
So, again, even though I'm
typing these very quickly,

01:32:35.720 --> 01:32:38.630
I'm really just composing
similar, smaller ideas

01:32:38.630 --> 01:32:41.060
that we've seen before into
larger and larger queries

01:32:41.060 --> 01:32:42.573
to just get at more of this data.

01:32:42.573 --> 01:32:44.490
So what's really going
on underneath the hood?

01:32:44.490 --> 01:32:46.160
Well, you can think of it like this.

01:32:46.160 --> 01:32:50.060
If we've got this relationship
between shows and genres,

01:32:50.060 --> 01:32:52.580
here's an excerpt from
shows, and I didn't bother

01:32:52.580 --> 01:32:54.170
showing the thousands of other shows.

01:32:54.170 --> 01:32:56.900
Here's an excerpt from
genres on the right.

01:32:56.900 --> 01:33:00.530
What is that query-- what are
we essentially trying to do?

01:33:00.530 --> 01:33:02.570
Well, let me flip this around here.

01:33:02.570 --> 01:33:06.290
Let me highlight the fact that
this is the same, this is the same,

01:33:06.290 --> 01:33:07.260
this is the same.

01:33:07.260 --> 01:33:10.040
So wouldn't it be nice if I could
kind of get these all together?

01:33:10.040 --> 01:33:11.660
Well, if I join these
tables, we're actually

01:33:11.660 --> 01:33:13.370
going to notice an interesting artifact.

01:33:13.370 --> 01:33:16.490
If I join them together, as
we did before with ratings,

01:33:16.490 --> 01:33:20.990
I'm going to need to fill in the gap
there because this is not a table.

01:33:20.990 --> 01:33:24.860
Tables, by definition, always have
the same number of rows and columns.

01:33:24.860 --> 01:33:26.730
You can't have gaps in them like this.

01:33:26.730 --> 01:33:29.820
So the simplest thing to do is
just to fill that in this way.

01:33:29.820 --> 01:33:33.170
But if I were to try to
combine two tables that

01:33:33.170 --> 01:33:36.320
have this one-to-many
relationship, you're

01:33:36.320 --> 01:33:38.090
actually going to get duplication.

01:33:38.090 --> 01:33:40.190
It's not duplication
in the original tables,

01:33:40.190 --> 01:33:43.850
but in the temporary tables,
otherwise known as a result set,

01:33:43.850 --> 01:33:45.750
that's coming back to us.

01:33:45.750 --> 01:33:47.070
So what do I mean by this?

01:33:47.070 --> 01:33:50.060
Well, if we actually implement
this same idea as before,

01:33:50.060 --> 01:33:52.970
where we try to join
these two tables, let

01:33:52.970 --> 01:33:55.640
me propose that we do
it with this syntax.

01:33:55.640 --> 01:33:59.540
Let me do select star
from shows join genres,

01:33:59.540 --> 01:34:04.850
which is just like we did with ratings
but now let's join it on shows.id

01:34:04.850 --> 01:34:07.760
equals genres.show_ID.

01:34:07.760 --> 01:34:14.120
But let's just do this for Catweazel,
where ID equals 63881, semicolon.

01:34:14.120 --> 01:34:16.250
With the ratings, it
worked perfectly because it

01:34:16.250 --> 01:34:19.920
was a one-to-one relationship, so
the rows just got wider, if you will.

01:34:19.920 --> 01:34:23.660
But now, because it's a
one-to-many relationship, when

01:34:23.660 --> 01:34:26.960
you execute these queries, you are
going to get back duplicate data

01:34:26.960 --> 01:34:28.400
but this is OK.

01:34:28.400 --> 01:34:30.740
It's considered OK because
this is sort of ephemeral.

01:34:30.740 --> 01:34:32.540
These result sets,
these temporary tables,

01:34:32.540 --> 01:34:34.700
exist just for us to
look at the data, just

01:34:34.700 --> 01:34:36.440
for us to crunch the numbers somehow.

01:34:36.440 --> 01:34:42.167
It's not actually stored in
duplicate in the database itself.

01:34:42.167 --> 01:34:44.000
If I wanted to tighten
this further, though,

01:34:44.000 --> 01:34:47.570
let me actually get rid of the star
and let me just do title genre.

01:34:47.570 --> 01:34:49.040
And, indeed, we can now see.

01:34:49.040 --> 01:34:52.130
OK, Catweazel three times has
three different categories.

01:34:52.130 --> 01:34:54.440
But, generally, we don't
even care about that so I

01:34:54.440 --> 01:34:57.830
can even whittle this query
down to just selecting genre,

01:34:57.830 --> 01:35:00.500
and that, too, will just give me
the result, effectively, hiding

01:35:00.500 --> 01:35:01.550
the duplication.

01:35:01.550 --> 01:35:04.957
But when you join data with
a one-to-many relationship

01:35:04.957 --> 01:35:07.790
you're temporarily going to get
duplicates, which is actually useful

01:35:07.790 --> 01:35:10.580
because it's very easy then
to get at the show's title

01:35:10.580 --> 01:35:14.080
no matter where you are in some loop.

01:35:14.080 --> 01:35:14.620
All right.

01:35:14.620 --> 01:35:15.912
Well, what more can we do here?

01:35:15.912 --> 01:35:19.470
Well let me propose that we revisit the
main database, here with six tables,

01:35:19.470 --> 01:35:21.970
and let's look at, perhaps, the
juiciest, and the one that's

01:35:21.970 --> 01:35:26.680
really what most people use IMDb for
is to look up shows and people therein.

01:35:26.680 --> 01:35:28.300
Let's focus on these three tables.

01:35:28.300 --> 01:35:32.140
And we can infer, from this diagram,
that there's now, for the first time,

01:35:32.140 --> 01:35:34.270
three tables involved in a relationship.

01:35:34.270 --> 01:35:38.710
There's people, there's shows, but
I've proposed this intermediary stars

01:35:38.710 --> 01:35:40.930
table, much like I
temporarily, in Google Sheets,

01:35:40.930 --> 01:35:44.120
gave us a third sheet to
link the two together.

01:35:44.120 --> 01:35:46.810
This stars table we're about
to see, it's purpose in life

01:35:46.810 --> 01:35:49.990
is to join two other tables together.

01:35:49.990 --> 01:35:54.670
And, in fact, it's only going to have
two columns, show ID and person ID.

01:35:54.670 --> 01:35:57.790
So what this is going to do
for us is implement this idea,

01:35:57.790 --> 01:35:59.360
many-to-many relationship.

01:35:59.360 --> 01:35:59.860
Why?

01:35:59.860 --> 01:36:02.810
Because any TV show can
obviously have many people in it,

01:36:02.810 --> 01:36:06.670
but one person can presumably
star in many different shows.

01:36:06.670 --> 01:36:09.680
Like, Steve Carell has been in
multiple shows, not just The Office.

01:36:09.680 --> 01:36:11.870
So when you have a
many-to-many relationship,

01:36:11.870 --> 01:36:16.655
you actually do need this third table
to bridge the two any number of times.

01:36:16.655 --> 01:36:19.280
But it's going to make our life
a little more unpleasant to get

01:36:19.280 --> 01:36:22.800
the data we want because it's going to
add some additional steps, if you will.

01:36:22.800 --> 01:36:24.300
So let me do this.

01:36:24.300 --> 01:36:28.340
Suppose that I want to get
everything I know about The Office.

01:36:28.340 --> 01:36:30.800
Well, let's start with a
single query here in VS Code.

01:36:30.800 --> 01:36:35.780
Select star from shows where title
equals, quote unquote, "The Office."

01:36:35.780 --> 01:36:37.130
And I should see--

01:36:37.130 --> 01:36:38.060
Oh, interesting.

01:36:38.060 --> 01:36:41.810
Several attempts at creating a TV
show called The Office over the years.

01:36:41.810 --> 01:36:45.740
You can, perhaps, infer the year in
which the most popular of them began.

01:36:45.740 --> 01:36:46.910
2005?

01:36:46.910 --> 01:36:49.580
So I presume this is the one
we all know and have watched,

01:36:49.580 --> 01:36:54.200
at least in the US, which is this one,
386676, which matches the ID that I

01:36:54.200 --> 01:36:55.770
very carefully used earlier.

01:36:55.770 --> 01:36:58.520
So let me actually be a
little more deliberate,

01:36:58.520 --> 01:37:02.690
where the title equals The
Office and the year equals 2005.

01:37:02.690 --> 01:37:07.400
That query now gets us The Office that
we all in the US and, perhaps, love.

01:37:07.400 --> 01:37:09.830
But now let's actually
do something like, get

01:37:09.830 --> 01:37:11.960
all of the people who
starred in it, at least

01:37:11.960 --> 01:37:14.650
according to IMDb,
whoever had top billing.

01:37:14.650 --> 01:37:16.370
So how can I do this?

01:37:16.370 --> 01:37:20.350
Well, unfortunately, in the
shows table there are no people,

01:37:20.350 --> 01:37:24.310
and there's no stars even, but I
could do a nested query like this.

01:37:24.310 --> 01:37:32.350
Why don't I select the person
ID from the stars table where--

01:37:32.350 --> 01:37:45.990
whoops, where the-- sorry, where show
ID equals, and then in parentheses

01:37:45.990 --> 01:37:48.430
let me do that same query as
before and for time's sake,

01:37:48.430 --> 01:37:52.260
I'll just copy paste so that
we get back the one and only

01:37:52.260 --> 01:37:53.640
Office in that subquery.

01:37:53.640 --> 01:37:56.710
So what I'm going to do is take
an intermediate step, a baby step,

01:37:56.710 --> 01:37:57.240
if you will.

01:37:57.240 --> 01:38:00.992
Right now, I have found in the
shows table The Office I care about.

01:38:00.992 --> 01:38:02.700
But if I want to get
to the people table,

01:38:02.700 --> 01:38:06.720
I have to take a step through the stars
table, this intermediate table, just

01:38:06.720 --> 01:38:09.130
to get anywhere close to
the names of those people.

01:38:09.130 --> 01:38:11.070
So what can I get from the stars table?

01:38:11.070 --> 01:38:14.040
Well, why don't I at least
select all of the person

01:38:14.040 --> 01:38:18.340
IDs in that table that are somehow
associated with the same show ID?

01:38:18.340 --> 01:38:20.340
So in VS Code, what I'm doing is this.

01:38:20.340 --> 01:38:23.340
Select person ID from
that intermediate stars

01:38:23.340 --> 01:38:27.690
table, where the show ID in question is
whatever the show ID is for The Office.

01:38:27.690 --> 01:38:30.510
I could literally type
386676, but I'm trying

01:38:30.510 --> 01:38:33.630
to do this more dynamically so
I've used a nested query instead.

01:38:33.630 --> 01:38:34.890
All right, this is correct.

01:38:34.890 --> 01:38:36.760
Whoops, ironically.

01:38:36.760 --> 01:38:38.085
Where show ID--

01:38:41.540 --> 01:38:42.440
Oh, thank you.

01:38:42.440 --> 01:38:43.610
Oh, who said that?

01:38:43.610 --> 01:38:44.570
OK.

01:38:44.570 --> 01:38:45.110
Batman.

01:38:45.110 --> 01:38:47.370
I owe Batman Oreos after class.

01:38:47.370 --> 01:38:48.470
Thank you, Batman.

01:38:48.470 --> 01:38:50.330
That's de facto.

01:38:50.330 --> 01:38:50.930
Yes.

01:38:50.930 --> 01:38:52.355
Other good prize for that costume.

01:38:52.355 --> 01:38:53.330
[LAUGHS]

01:38:53.330 --> 01:38:55.160
OK, so let me fix this.

01:38:55.160 --> 01:38:55.910
My apologies.

01:38:55.910 --> 01:38:58.790
So let's go ahead and
select person ID from stars

01:38:58.790 --> 01:39:02.390
where show ID equals, and this
is where I messed up before.

01:39:02.390 --> 01:39:05.510
I did select star from
shows, but I can't

01:39:05.510 --> 01:39:09.410
look for a show ID equaling
an entire row of information.

01:39:09.410 --> 01:39:16.160
I instead need to do select ID from
shows where the shows title equals,

01:39:16.160 --> 01:39:20.510
quote unquote, "The Office" and
the year of that show is 2005.

01:39:20.510 --> 01:39:25.340
And just to call this out, much
like in C, I'm quoting strings,

01:39:25.340 --> 01:39:26.780
but I'm not quoting numbers.

01:39:26.780 --> 01:39:28.910
It's not necessary for
something like 2005.

01:39:28.910 --> 01:39:33.410
And just to be super clear, too, I have
generally adopted a style already today

01:39:33.410 --> 01:39:37.850
of capitalizing any SQL keywords,
like select, like from, like where,

01:39:37.850 --> 01:39:40.470
and so forth, and then using
lowercase for everything else.

01:39:40.470 --> 01:39:42.180
Strictly speaking, that's not necessary.

01:39:42.180 --> 01:39:44.130
Stylistically, we would
encourage you to be

01:39:44.130 --> 01:39:46.592
in the habit of using
uppercase for your SQL keywords

01:39:46.592 --> 01:39:48.300
because they just pop
more on the screen.

01:39:48.300 --> 01:39:49.883
It sort of makes things more readable.

01:39:49.883 --> 01:39:52.200
But strictly speaking,
SQL itself does not care

01:39:52.200 --> 01:39:54.690
about that capitalization of keywords.

01:39:54.690 --> 01:39:56.490
All right, now let me cross my fingers.

01:39:56.490 --> 01:39:59.100
And now I get back this
list of person IDs.

01:39:59.100 --> 01:40:00.840
And, again, my goal
was to figure out who

01:40:00.840 --> 01:40:03.018
is in The Office that debuted in 2005.

01:40:03.018 --> 01:40:06.060
This is not that interesting because
I don't know who any of these people

01:40:06.060 --> 01:40:08.820
are, but here's where we
can do one additional step

01:40:08.820 --> 01:40:10.650
and nest a nested query.

01:40:10.650 --> 01:40:15.000
So let me actually select the
names from the people table

01:40:15.000 --> 01:40:18.720
where the ID of those people is in.

01:40:18.720 --> 01:40:20.820
And then on a new line,
just to make it pretty,

01:40:20.820 --> 01:40:26.770
where the ID equals this
query, which I'll paste here,

01:40:26.770 --> 01:40:31.260
and then that equals, and then
another line and indent further,

01:40:31.260 --> 01:40:32.602
this query here.

01:40:32.602 --> 01:40:35.310
So just to save time, I'm copying
and pasting the previous query,

01:40:35.310 --> 01:40:38.970
but I'm wrapping it with one
outermost query, now, that's

01:40:38.970 --> 01:40:42.690
saying select names from people
where the ID of those people

01:40:42.690 --> 01:40:47.940
is in this result set, where the ID
of those shows is in this result set.

01:40:47.940 --> 01:40:52.170
So the parentheses make clear,
ideally, what's happening

01:40:52.170 --> 01:40:53.790
and what order from inside out.

01:40:53.790 --> 01:40:54.540
Enter.

01:40:54.540 --> 01:40:58.860
And there we have it, at least according
to IMDb, for the latest season,

01:40:58.860 --> 01:41:04.350
like this is the top-billed stars
that are in this here database.

01:41:04.350 --> 01:41:04.860
All right.

01:41:04.860 --> 01:41:06.302
So how can we do something else?

01:41:06.302 --> 01:41:08.260
Well, let me just do it
in the other direction.

01:41:08.260 --> 01:41:11.177
Suppose we want to see all of Steve
Carell shows, not just The Office.

01:41:11.177 --> 01:41:16.650
Select title from shows where
the ID of the show is in.

01:41:16.650 --> 01:41:18.990
And then here, I'm
going to do select show

01:41:18.990 --> 01:41:24.150
ID from stars where person ID equals.

01:41:24.150 --> 01:41:26.490
And then here, indenting
for clarity, I'm

01:41:26.490 --> 01:41:28.740
going to select Steve
Carell's ID by saying,

01:41:28.740 --> 01:41:32.850
select ID from people where
the name of that person is,

01:41:32.850 --> 01:41:35.040
quote unquote, "Steve Carell."

01:41:35.040 --> 01:41:36.780
And so in this way,

01:41:36.780 --> 01:41:42.250
I'm writing the SQL
query sort of in reverse.

01:41:42.250 --> 01:41:44.102
I'm asking first what I care about.

01:41:44.102 --> 01:41:45.810
But before I can even
answer that, I have

01:41:45.810 --> 01:41:48.510
to answer this nested query,
what is Steve Carell's ID?

01:41:48.510 --> 01:41:52.320
Once I have that, what are all the
show IDs that person ID has been in?

01:41:52.320 --> 01:41:55.740
And then, please tell me what the
title of all of those shows is.

01:41:55.740 --> 01:41:57.510
Let me go ahead and cross my fingers.

01:41:57.510 --> 01:41:58.200
And, voila.

01:41:58.200 --> 01:42:00.950
Some of these you might have heard
of, some of you might not have,

01:42:00.950 --> 01:42:03.750
but if you were to go on imdb.com
and search for Steve Carell,

01:42:03.750 --> 01:42:08.445
you would presumably see this here
list of shows that he's been in,

01:42:08.445 --> 01:42:11.410
in some particular order.

01:42:11.410 --> 01:42:15.330
Just to show you two other syntaxes,
but let me not emphasize this

01:42:15.330 --> 01:42:17.200
because it will look complicated.

01:42:17.200 --> 01:42:19.500
There are other ways to
solve this same problem.

01:42:19.500 --> 01:42:21.660
If you prefer the
approach of joining, we

01:42:21.660 --> 01:42:25.720
can actually join not just
two, but three tables together.

01:42:25.720 --> 01:42:26.535
But question first.

01:42:26.535 --> 01:42:28.310
[INDISTINCT SPEECH]

01:42:28.310 --> 01:42:29.810
No, just stretching first.

01:42:29.810 --> 01:42:35.120
So two final ways to execute the same
idea, but the first of them that I just

01:42:35.120 --> 01:42:37.850
did is arguably, relatively, simpler.

01:42:37.850 --> 01:42:38.910
You could do this.

01:42:38.910 --> 01:42:42.800
I could select the title
from the shows table

01:42:42.800 --> 01:42:49.220
by joining it on the stars
table on the shows ID column,

01:42:49.220 --> 01:42:52.820
equaling the stars
tables show ID column.

01:42:52.820 --> 01:42:58.460
And then I can further join it on the
people table, on stars dot person ID

01:42:58.460 --> 01:43:00.453
equaling people.id.

01:43:00.453 --> 01:43:03.620
This is a mouthful, and even I am kind
of crossing my fingers that it didn't

01:43:03.620 --> 01:43:06.510
screw up when transcribing
it from my printout here,

01:43:06.510 --> 01:43:11.180
but what I'm effectively doing is
joining one, two, three tables all

01:43:11.180 --> 01:43:17.570
together by telling the database how
to join the shows table, with the stars

01:43:17.570 --> 01:43:19.070
table, and the people table.

01:43:19.070 --> 01:43:22.940
Specifically, the way to bridge
that picture, per the diagram,

01:43:22.940 --> 01:43:28.890
is to specify that shows.id should
be lined up with stars.show ID.

01:43:28.890 --> 01:43:34.380
And stars.person ID should
be lined up with people ID.

01:43:34.380 --> 01:43:35.008
And that's it.

01:43:35.008 --> 01:43:37.050
That essentially allows
us to connect these three

01:43:37.050 --> 01:43:39.000
tables with their common fields.

01:43:39.000 --> 01:43:42.390
If I hit enter now, I'm going
to get back, somewhat slowly

01:43:42.390 --> 01:43:47.160
actually, a really long
list, with some duplication,

01:43:47.160 --> 01:43:49.830
of all of those particular shows.

01:43:49.830 --> 01:43:52.182
Actually, all shows in
the database because I

01:43:52.182 --> 01:43:53.640
didn't practice what I'm preaching.

01:43:53.640 --> 01:43:56.140
I wanted to search for just
Steve Carell, what you're seeing

01:43:56.140 --> 01:43:58.950
is the entirety of the tens
of thousands of TV shows.

01:43:58.950 --> 01:44:00.360
Control C is your friend.

01:44:00.360 --> 01:44:05.070
Let me go ahead and reload SQLite,
and let me type that again.

01:44:05.070 --> 01:44:06.690
Let me type that once more, sorry.

01:44:06.690 --> 01:44:13.320
Select title from shows, join stars
on shows.id equals stars.show ID,

01:44:13.320 --> 01:44:20.872
join people on stars.person
ID equals people ID where--

01:44:20.872 --> 01:44:22.080
this was the part I left out.

01:44:22.080 --> 01:44:24.990
Name equals Steve Carell.

01:44:24.990 --> 01:44:28.710
And if I didn't screw up
by typing so fast-- enter.

01:44:28.710 --> 01:44:31.500
Now we get, a little
more slowly than before,

01:44:31.500 --> 01:44:34.435
those same shows that
Steve Carell starred in.

01:44:34.435 --> 01:44:36.810
So this is just to say there's
another way of doing this.

01:44:36.810 --> 01:44:39.420
But maybe a third way, which
is a little simpler than that,

01:44:39.420 --> 01:44:41.250
explicitly joining them in that way.

01:44:41.250 --> 01:44:47.490
You can, alternatively, still do
this, select title from shows, stars,

01:44:47.490 --> 01:44:48.270
and people.

01:44:48.270 --> 01:44:51.480
You just literally enumerate,
with commas, what three tables

01:44:51.480 --> 01:44:52.920
you want to join somehow.

01:44:52.920 --> 01:44:55.020
And then you can instead
of using join, you

01:44:55.020 --> 01:44:58.560
can just use where clauses to
make sure they line up properly.

01:44:58.560 --> 01:45:03.630
You can say, where shows.id
equals stars.show ID,

01:45:03.630 --> 01:45:09.180
and people ID equals
stars.person ID, and name

01:45:09.180 --> 01:45:11.760
equals, quote unquote, "Steve Carell."

01:45:11.760 --> 01:45:14.430
And I realize this is hard to
keep track of everything now,

01:45:14.430 --> 01:45:16.353
all these darn different
ways to do this,

01:45:16.353 --> 01:45:19.020
this is just to say that there's
different approaches to solving

01:45:19.020 --> 01:45:20.728
the same problem and
for different people

01:45:20.728 --> 01:45:23.970
you might think about things a little
more differently than someone else.

01:45:23.970 --> 01:45:27.630
If I hit enter here, this too it's a
little slower than the nested selects

01:45:27.630 --> 01:45:31.590
it seems, but it does, in
fact, give us that same answer.

01:45:31.590 --> 01:45:34.440
And just for thoroughness,
if I go back to our diagram.

01:45:34.440 --> 01:45:38.700
Besides the tables we've seen,
there's actually another writers table

01:45:38.700 --> 01:45:39.480
in there, as well.

01:45:39.480 --> 01:45:42.480
If you're curious to see what writers
is, let's just glance at that real

01:45:42.480 --> 01:45:42.980
fast.

01:45:42.980 --> 01:45:47.010
In VS Code, let me do .schema writers,
and it's actually almost the same

01:45:47.010 --> 01:45:48.330
as stars.

01:45:48.330 --> 01:45:52.440
Except this case, in writers,
we are associating a show

01:45:52.440 --> 01:45:54.750
with a person ID, both
of which in this context

01:45:54.750 --> 01:45:58.770
are foreign keys that indeed
reference back shows and people ID.

01:45:58.770 --> 01:46:02.520
Which, again, if I do this schema
stars, which we didn't see before,

01:46:02.520 --> 01:46:04.120
is structurally the same.

01:46:04.120 --> 01:46:06.540
So the relationship is essentially
embodied, in this case,

01:46:06.540 --> 01:46:12.690
by IMDb, and in turn by us, by way of
the tables name, TV stars or writers

01:46:12.690 --> 01:46:14.410
thereof.

01:46:14.410 --> 01:46:14.940
All right.

01:46:14.940 --> 01:46:16.000
I know that's a lot.

01:46:16.000 --> 01:46:19.720
Any questions before we take
a higher level step back?

01:46:19.720 --> 01:46:20.220
Yeah.

01:46:20.220 --> 01:46:22.992
[INDISTINCT SPEECH]

01:46:25.183 --> 01:46:25.850
A good question.

01:46:25.850 --> 01:46:28.400
Does SQL provide any way to figure
out the mapping between tables

01:46:28.400 --> 01:46:29.540
without looking at the database?

01:46:29.540 --> 01:46:30.380
Short answer, no.

01:46:30.380 --> 01:46:32.668
Like, this is the dia--
well, that's not quite fair.

01:46:32.668 --> 01:46:34.460
Depending on the database
you're using, you

01:46:34.460 --> 01:46:36.080
might be able to click
a button, for instance,

01:46:36.080 --> 01:46:38.997
and get a nice pretty picture like
this, that shows the relationships.

01:46:38.997 --> 01:46:40.730
Indeed, we use software
to generate this.

01:46:40.730 --> 01:46:42.980
We didn't do this diagram,
for instance, by hand.

01:46:42.980 --> 01:46:45.530
SQLite itself does not
provide you with that.

01:46:45.530 --> 01:46:48.425
In SQLite the best you
can do is run .schema.

01:46:48.425 --> 01:46:50.300
And if you don't specify
a table name, you'll

01:46:50.300 --> 01:46:52.120
get everything from the table described.

01:46:52.120 --> 01:46:53.870
Once you get comfortable
with SQL, though,

01:46:53.870 --> 01:46:56.240
the idea is that you can read
the text and sort of infer what

01:46:56.240 --> 01:46:57.080
the structure is.

01:46:57.080 --> 01:46:58.850
But, yes, there are
graphical programs can

01:46:58.850 --> 01:47:00.710
generate prettier
pictures like this, but it

01:47:00.710 --> 01:47:02.900
depends on the software you're using.

01:47:02.900 --> 01:47:03.470
Yeah.

01:47:03.470 --> 01:47:06.600
[INDISTINCT SPEECH]

01:47:06.600 --> 01:47:10.590
SQL is not case sensitive
with respect to its keywords.

01:47:10.590 --> 01:47:13.920
But table names, and other
things that you chose,

01:47:13.920 --> 01:47:16.452
you've got to be consistent
with how you capitalize them.

01:47:16.452 --> 01:47:19.410
I've done everything in lowercase,
but that tends to be one convention.

01:47:19.410 --> 01:47:23.970
Other people might use camel case,
where you alternate caps appropriately.

01:47:23.970 --> 01:47:26.737
All right, so let's take a
higher level look at this,

01:47:26.737 --> 01:47:29.820
and also consider some of the actual
real world problems that, tragically,

01:47:29.820 --> 01:47:32.200
are still with us in some form today.

01:47:32.200 --> 01:47:34.980
Notice that some of the
queries we executed a bit ago

01:47:34.980 --> 01:47:38.440
were actually relatively slow whereas
I hit enter and got a lot of my results

01:47:38.440 --> 01:47:38.940
like that.

01:47:38.940 --> 01:47:41.130
Those last two queries,
where I was joining

01:47:41.130 --> 01:47:43.500
all of those tables looking
for Steve Carell's shows

01:47:43.500 --> 01:47:45.138
were actually relatively slow.

01:47:45.138 --> 01:47:46.680
And let's try to take a simpler case.

01:47:46.680 --> 01:47:47.310
Let me do this.

01:47:47.310 --> 01:47:51.480
In SQLite you can actually time your
queries by running .timer and then

01:47:51.480 --> 01:47:52.330
turning it on.

01:47:52.330 --> 01:47:55.620
This is just going to keep track now
of how many seconds or milliseconds any

01:47:55.620 --> 01:47:57.787
of your queries take if
you're curious to figure out

01:47:57.787 --> 01:47:58.930
what's faster, what's slow.

01:47:58.930 --> 01:48:00.930
Let me do something
relatively simple like this,

01:48:00.930 --> 01:48:04.860
select star from shows where
the title of the show equals,

01:48:04.860 --> 01:48:07.198
quote unquote, "The Office," semicolon.

01:48:07.198 --> 01:48:08.740
All right, that was pretty darn fast.

01:48:08.740 --> 01:48:13.240
And it took 0.044 seconds in reality.

01:48:13.240 --> 01:48:16.210
If you care further, you can break
this time down into user time,

01:48:16.210 --> 01:48:18.708
like spent in my account,
versus system time, which means

01:48:18.708 --> 01:48:20.500
spent in the operating
system, essentially.

01:48:20.500 --> 01:48:28.120
But we'll focus on the real
wall-clock time of 0.044 seconds.

01:48:28.120 --> 01:48:29.290
Pretty darn fast.

01:48:29.290 --> 01:48:32.770
But we can actually do better because
it turns out in the world of SQL,

01:48:32.770 --> 01:48:37.690
you can create what are called indexes,
which is a fancy way of saying a data

01:48:37.690 --> 01:48:41.170
structure that makes it
faster to perform queries,

01:48:41.170 --> 01:48:43.660
like selects, and even
other queries as well.

01:48:43.660 --> 01:48:46.180
In an index, you can
use syntax like this,

01:48:46.180 --> 01:48:49.420
create index, the name of the
index, on a specific table

01:48:49.420 --> 01:48:50.710
on the specific columns.

01:48:50.710 --> 01:48:54.340
And what I mean by this is if you know
that your application, like imdb.com

01:48:54.340 --> 01:48:58.480
or their mobile app, is going to
search on certain columns frequently,

01:48:58.480 --> 01:49:01.240
you can prepare the database
in advance to build up

01:49:01.240 --> 01:49:04.450
some fancy data structures in
memory so it can get back answers

01:49:04.450 --> 01:49:06.080
even faster than that.

01:49:06.080 --> 01:49:08.200
So case in point, let
me go back to VS Code

01:49:08.200 --> 01:49:12.700
here, and let me create an index
called, whatever, title index,

01:49:12.700 --> 01:49:17.180
for instance on the shows table,
specifically on the title column.

01:49:17.180 --> 01:49:20.830
So that's simply the syntax for
telling the database in advance,

01:49:20.830 --> 01:49:23.350
index this column because I'm
going to do a lot of searching

01:49:23.350 --> 01:49:25.210
on it so I want the queries to be fast.

01:49:25.210 --> 01:49:26.020
Enter.

01:49:26.020 --> 01:49:26.920
It took a moment.

01:49:26.920 --> 01:49:30.010
It took almost half a second, but
the index I only have to build once.

01:49:30.010 --> 01:49:32.320
Even though, technically, you have
to maintain it over time if you're

01:49:32.320 --> 01:49:34.210
doing updates, deletes, and inserts.

01:49:34.210 --> 01:49:36.610
But now let me do the
same query as before,

01:49:36.610 --> 01:49:41.530
select star from shows where title
equals, quote unquote, "The Office."

01:49:41.530 --> 01:49:45.580
Previously, that query
took 0.044 seconds.

01:49:45.580 --> 01:49:47.080
When I hit enter now?

01:49:47.080 --> 01:49:47.890
Boom.

01:49:47.890 --> 01:49:50.770
I mean, it takes no time
at all, or less time

01:49:50.770 --> 01:49:53.998
than it's even keeping track of
in terms of significant digits.

01:49:53.998 --> 01:49:57.040
Now, that might not seem like a big
deal to us humans and our human eyes,

01:49:57.040 --> 01:49:59.620
but if you've got hundreds,
thousands, millions of users,

01:49:59.620 --> 01:50:03.580
as maybe the real imdb.com
has, you just saved yourself

01:50:03.580 --> 01:50:05.900
a fortune in servers and complexity.

01:50:05.900 --> 01:50:06.400
Why?

01:50:06.400 --> 01:50:08.320
Because the same server
can, clearly, now

01:50:08.320 --> 01:50:12.400
handle way more people per unit of
time, per second, because each query

01:50:12.400 --> 01:50:14.027
takes less and less time.

01:50:14.027 --> 01:50:16.360
I mean, we're all too familiar
here and at Yale, surely,

01:50:16.360 --> 01:50:19.087
with certain University applications
that are just so darn slow.

01:50:19.087 --> 01:50:20.920
When you click a button
and the stupid thing

01:50:20.920 --> 01:50:23.740
spins and makes you wait and
wait, a lot of the time that

01:50:23.740 --> 01:50:27.820
can be explained by poor database
design, or databases that might not

01:50:27.820 --> 01:50:29.200
have been indexed properly.

01:50:29.200 --> 01:50:31.492
So when you're searching for
some course, for instance,

01:50:31.492 --> 01:50:33.310
in the course catalog,
it's taking forever

01:50:33.310 --> 01:50:35.170
because underneath the
hood it's essentially

01:50:35.170 --> 01:50:37.730
doing linear search over everything.

01:50:37.730 --> 01:50:40.030
But, by contrast, in
a relational database,

01:50:40.030 --> 01:50:42.670
when you create an index
in advance because you

01:50:42.670 --> 01:50:45.640
have a hunch that maybe users are
going to search on that column,

01:50:45.640 --> 01:50:47.560
like show titles.

01:50:47.560 --> 01:50:49.270
Essentially, you're
building up in memory

01:50:49.270 --> 01:50:51.700
what's called a b-tree,
which is not a binary tree.

01:50:51.700 --> 01:50:54.370
It's still a tree, though, if
you think back to week five.

01:50:54.370 --> 01:50:57.820
But it's a very short,
fat tree instead, where

01:50:57.820 --> 01:51:02.650
every node might have two, or three,
or 30 children, which essentially

01:51:02.650 --> 01:51:04.820
pulls the height of the tree way up.

01:51:04.820 --> 01:51:07.990
Which is to say that when you
search for some value in a b-tree,

01:51:07.990 --> 01:51:10.040
it's invariably going
to be in the leaves.

01:51:10.040 --> 01:51:13.870
So the shorter the tree is, the fewer
steps it takes to find the value

01:51:13.870 --> 01:51:14.750
you care about.

01:51:14.750 --> 01:51:17.860
So when you run create table,
that kind of data structure

01:51:17.860 --> 01:51:21.680
is being magically created
for you by the database

01:51:21.680 --> 01:51:25.540
so it's not a simple linear
search through the entire column,

01:51:25.540 --> 01:51:27.350
top to bottom.

01:51:27.350 --> 01:51:31.330
So with that said, we can see this
really with more complicated queries.

01:51:31.330 --> 01:51:33.260
And let me go back to VS Code here.

01:51:33.260 --> 01:51:36.853
Let me propose to run that
same, slow query before,

01:51:36.853 --> 01:51:39.520
even though it's fine if you're
not comfortable with the syntax.

01:51:39.520 --> 01:51:41.710
It was relatively
slow, though, to watch.

01:51:41.710 --> 01:51:46.360
So let's do select title
from shows stars and people

01:51:46.360 --> 01:51:56.500
where shows.id equals stars.show ID, and
people.id equals stars.dot person ID,

01:51:56.500 --> 01:51:59.050
and name equals Steve Carell.

01:51:59.050 --> 01:52:00.820
So this was the last of those queries.

01:52:00.820 --> 01:52:03.280
That just searches for all
of Steve Carell's TV shows

01:52:03.280 --> 01:52:06.430
without using joins explicitly
but by just enumerating

01:52:06.430 --> 01:52:10.210
all three tables in question, and
then using where to cleverly connect

01:52:10.210 --> 01:52:11.390
the dots, so to speak.

01:52:11.390 --> 01:52:12.770
But my timer is still on.

01:52:12.770 --> 01:52:16.390
So notice now, when I hit enter,
it doesn't just feel slow.

01:52:16.390 --> 01:52:19.390
It actually took 2.763 seconds.

01:52:19.390 --> 01:52:20.410
Like, that's slow.

01:52:20.410 --> 01:52:21.250
That's expensive.

01:52:21.250 --> 01:52:22.650
That's going to annoy your users.

01:52:22.650 --> 01:52:25.900
That's going to annoy your students if
the database is thinking, and thinking,

01:52:25.900 --> 01:52:27.730
and thinking, and taking that much time.

01:52:27.730 --> 01:52:29.770
But let's note this.

01:52:29.770 --> 01:52:34.080
That same query I just executed
touched a bunch of columns,

01:52:34.080 --> 01:52:35.830
and it turns out that
whenever you declare

01:52:35.830 --> 01:52:40.330
a primary key in a database, at least
in SQLite, you get an index for free.

01:52:40.330 --> 01:52:43.060
Primary keys are automatically
indexed, so searching

01:52:43.060 --> 01:52:44.320
for a number in that column?

01:52:44.320 --> 01:52:45.040
Super fast.

01:52:45.040 --> 01:52:48.580
Not linear search, it's something
logarithmic, most likely, or, ideally,

01:52:48.580 --> 01:52:51.100
closer to something
like constant time even.

01:52:51.100 --> 01:52:54.940
Here, though, I'm touching
not just shows.id,

01:52:54.940 --> 01:52:58.840
but I'm also filtering on
stars.show ID, so a foreign key.

01:52:58.840 --> 01:53:01.000
Foreign keys are not indexed by default.

01:53:01.000 --> 01:53:02.260
I'm looking at people ID.

01:53:02.260 --> 01:53:03.190
That's a primary key.

01:53:03.190 --> 01:53:03.880
That's indexed.

01:53:03.880 --> 01:53:08.200
But stars.person ID, not indexed
by default, is a foreign key.

01:53:08.200 --> 01:53:11.890
Lastly, I'm filtering by
name in the people table.

01:53:11.890 --> 01:53:17.140
Names are not indexed by default. So
I'm touching three separate columns,

01:53:17.140 --> 01:53:20.980
two foreign keys, one name
field, that have no fancy tree

01:53:20.980 --> 01:53:22.570
structure built for them.

01:53:22.570 --> 01:53:23.900
But I can do that.

01:53:23.900 --> 01:53:27.190
Let me go down to my terminal here.

01:53:27.190 --> 01:53:30.550
Let me create one index called,
say, person index, though, I

01:53:30.550 --> 01:53:35.750
could call it anything I want, on the
stars table on the person ID column.

01:53:35.750 --> 01:53:38.140
So that indexes that foreign key.

01:53:38.140 --> 01:53:41.050
Took 1.7 seconds, but I
only have to do it once.

01:53:41.050 --> 01:53:46.720
Create index, show index
on the stars table show ID.

01:53:46.720 --> 01:53:49.210
So another foreign key
is getting its own index.

01:53:49.210 --> 01:53:51.490
Took 1.4 seconds, but
it's a one time thing.

01:53:51.490 --> 01:53:54.070
And, lastly, let's index
all of those actors names.

01:53:54.070 --> 01:53:59.770
Create index called name index on
the people table on the name column.

01:53:59.770 --> 01:54:00.460
Enter.

01:54:00.460 --> 01:54:04.520
That took 1.0 seconds,
but a one time operation.

01:54:04.520 --> 01:54:07.850
So, essentially, I've built up
like three of these trees in memory

01:54:07.850 --> 01:54:09.920
now, specifically for these columns.

01:54:09.920 --> 01:54:15.160
So now recall, previously,
that slow query.

01:54:15.160 --> 01:54:18.610
If I scroll back up,
that took, what was it?

01:54:18.610 --> 01:54:20.570
2.7 seconds?

01:54:20.570 --> 01:54:21.070
I think.

01:54:21.070 --> 01:54:23.330
2.7 seconds, previously.

01:54:23.330 --> 01:54:29.080
But if I now run the same
thing, select title from show

01:54:29.080 --> 01:54:33.850
stars people where shows ID
equals stars.show show ID,

01:54:33.850 --> 01:54:37.240
and people ID equals
stars.person person ID,

01:54:37.240 --> 01:54:42.730
and name equals Steve, so close, Carell.

01:54:42.730 --> 01:54:47.320
Same query as before previously took
2.7 seconds, which was the most annoying

01:54:47.320 --> 01:54:47.890
of them yet.

01:54:47.890 --> 01:54:49.420
Now, when I hit enter?

01:54:49.420 --> 01:54:50.230
Boom.

01:54:50.230 --> 01:54:55.240
0.001 seconds, which is the
difference, again, between bunches

01:54:55.240 --> 01:55:00.970
of linear searches and, in this case,
searching a fancier, week-five style,

01:55:00.970 --> 01:55:02.650
b-tree, in this case.

01:55:02.650 --> 01:55:04.010
So indexes matter.

01:55:04.010 --> 01:55:07.120
So what's then, maybe,
the trade off here?

01:55:07.120 --> 01:55:09.680
Like, why not index every
column in every table?

01:55:09.680 --> 01:55:10.930
Because this is feeling great.

01:55:10.930 --> 01:55:14.110
Like, we're speeding things up
by factors of 1,000, practically.

01:55:14.110 --> 01:55:15.190
What's the trade off?

01:55:15.190 --> 01:55:16.240
[INDISTINCT SPEECH]

01:55:16.240 --> 01:55:18.010
Lots and lots of memory, or space.

01:55:18.010 --> 01:55:20.200
Yeah, so you're just
trading off space for time,

01:55:20.200 --> 01:55:23.200
which we said a couple of weeks ago
is an acceptable trade off depending

01:55:23.200 --> 01:55:24.610
on what resources you have.

01:55:24.610 --> 01:55:28.160
But it's probably an
overcorrection to index everything,

01:55:28.160 --> 01:55:31.445
especially since it will slightly slow
down inserts, updates, and deletes

01:55:31.445 --> 01:55:34.570
because you have to maintain this tree
structure so it doesn't devolve back

01:55:34.570 --> 01:55:37.600
into a linked list or something linear.

01:55:37.600 --> 01:55:43.900
But, in fact, being selective about
it is perhaps the best strategy.

01:55:43.900 --> 01:55:44.440
All right.

01:55:44.440 --> 01:55:48.250
So that we can now solve some
other problems more generally,

01:55:48.250 --> 01:55:49.780
let me just connect two dots.

01:55:49.780 --> 01:55:53.740
Even though we focused today on
SQL, specifically for databases,

01:55:53.740 --> 01:55:56.590
you can actually combine
one language with another

01:55:56.590 --> 01:55:58.010
and solve different problems.

01:55:58.010 --> 01:55:59.420
And so, in fact, let me do this.

01:55:59.420 --> 01:56:02.530
Let me revisit our
favorites.py from earlier,

01:56:02.530 --> 01:56:07.610
but let me actually now use the
favorites database as follows.

01:56:07.610 --> 01:56:09.980
Let me go into VS Code here.

01:56:09.980 --> 01:56:16.810
Let me remove favorites.db because if
you recall it made everyone's count

01:56:16.810 --> 01:56:18.760
previously--

01:56:18.760 --> 01:56:19.510
became Fiftyville.

01:56:19.510 --> 01:56:21.640
So let me remove that file.

01:56:21.640 --> 01:56:26.140
Let me run SQLite3 on
favorites.db again.

01:56:26.140 --> 01:56:27.940
Let me create the file anew.

01:56:27.940 --> 01:56:30.490
Let me set the mode to CSV again.

01:56:30.490 --> 01:56:34.630
Let me import that file called
favorites.csv into an identical table

01:56:34.630 --> 01:56:36.160
as before called favorites.

01:56:36.160 --> 01:56:37.150
And then quit.

01:56:37.150 --> 01:56:40.970
So I've just reset things to my
backup, if you will, from the CSV file.

01:56:40.970 --> 01:56:43.300
So, I again have a favorites.db.

01:56:43.300 --> 01:56:48.033
Let me now minimize my terminal
window here and reopen favorites.py,

01:56:48.033 --> 01:56:50.200
and let me just go ahead
and get rid of that version

01:56:50.200 --> 01:56:53.830
entirely and focus this time
on not talking to a CSV file,

01:56:53.830 --> 01:56:55.780
opening it, iterating over the rows.

01:56:55.780 --> 01:56:58.930
We can actually use Python
to execute SQL queries

01:56:58.930 --> 01:57:00.950
and kind of get the best of both worlds.

01:57:00.950 --> 01:57:03.160
So let me do this here.

01:57:03.160 --> 01:57:09.580
Let me, from the CS50 Python library,
import our own SQL functionality.

01:57:09.580 --> 01:57:12.052
And this is a training wheel
we still provide for SQL

01:57:12.052 --> 01:57:14.260
because it's just much easier
than using the industry

01:57:14.260 --> 01:57:15.760
standard libraries for SQL.

01:57:15.760 --> 01:57:18.540
It just is painful for simple tasks.

01:57:18.540 --> 01:57:21.380
So now, let me create a variable
called db, for database.

01:57:21.380 --> 01:57:24.410
I'm going to set it equal to this
SQL function that CS50 wrote.

01:57:24.410 --> 01:57:28.730
And this is going to look weird, but
the way you open a DB file in Python,

01:57:28.730 --> 01:57:31.580
whether it's with CS50
library or someone else's, you

01:57:31.580 --> 01:57:36.560
say SQLite colon slash,
slash, slash favorites.db.

01:57:36.560 --> 01:57:38.780
So weird syntax, but it's commonplace.

01:57:38.780 --> 01:57:41.900
Three slashes, not two
like a URL is usually.

01:57:41.900 --> 01:57:44.300
Now let's use a variable
called favorite and set

01:57:44.300 --> 01:57:48.020
it equal to the return value of input by
asking the human for their favorite TV

01:57:48.020 --> 01:57:48.650
show.

01:57:48.650 --> 01:57:51.860
And now, previously we
opened up the CSV file,

01:57:51.860 --> 01:57:56.270
iterated over it looking for the show
they typed in, show they typed in.

01:57:56.270 --> 01:57:59.060
Or, rather-- Oh, sorry, not the show.

01:57:59.060 --> 01:58:01.610
The problem, or the
problem that we typed in.

01:58:01.610 --> 01:58:03.150
Let me instead do this.

01:58:03.150 --> 01:58:07.290
Let me set, use a db
variables, execute function,

01:58:07.290 --> 01:58:11.360
which comes with the CS50 library,
and let me execute this SQL query.

01:58:11.360 --> 01:58:20.250
Select count star as n from favorites
where problem equals question mark.

01:58:20.250 --> 01:58:23.000
And the question mark is a little
weird, but think of it, for now,

01:58:23.000 --> 01:58:24.902
like C's percent s.

01:58:24.902 --> 01:58:26.750
Comma favorite.

01:58:26.750 --> 01:58:29.840
I'm going to plug in whatever the
human typed in into that query where

01:58:29.840 --> 01:58:30.920
the question mark is.

01:58:30.920 --> 01:58:34.340
So no percent s, I'm using a
question mark in this world of SQL.

01:58:34.340 --> 01:58:36.883
This is going to give me
back a temporary table,

01:58:36.883 --> 01:58:39.800
and I'm going to store that temporary
table in a variable called rows.

01:58:39.800 --> 01:58:42.840
Because the temporary table is
essentially zero or more rows,

01:58:42.840 --> 01:58:44.660
so I'm going to name my variable rows.

01:58:44.660 --> 01:58:48.170
And then if I want to get
back the first and only row,

01:58:48.170 --> 01:58:51.860
I can literally do row equals rows
bracket zero just to be pedantic,

01:58:51.860 --> 01:58:55.970
just to get, literally, the first row
in that temporary table, or result set.

01:58:55.970 --> 01:58:59.810
And now if I want to print
back the n column therein,

01:58:59.810 --> 01:59:03.590
I can do print row, quote unquote, "n."

01:59:03.590 --> 01:59:06.420
So let me take a step back and do this.

01:59:06.420 --> 01:59:12.710
Let me go into SQLite3 of favorites.db
and let me literally type something

01:59:12.710 --> 01:59:13.580
like this.

01:59:13.580 --> 01:59:16.640
Here I'm in SQLite3 at the bottom
and the problem I'm searching,

01:59:16.640 --> 01:59:19.520
for instance, is Scratch semicolon.

01:59:19.520 --> 01:59:23.480
Notice that is how in SQL, in
the command line of SQLite3,

01:59:23.480 --> 01:59:24.960
I can get back the answer I want.

01:59:24.960 --> 01:59:26.960
But what if I want to
make a Python program that

01:59:26.960 --> 01:59:29.480
queries the database for that value
and then prints something out?

01:59:29.480 --> 01:59:31.790
In two weeks time, what if
I want to do that to make

01:59:31.790 --> 01:59:35.300
a web application, like
imdb.com or a mobile app,

01:59:35.300 --> 01:59:37.490
that writes code to
generate a user interface

01:59:37.490 --> 01:59:39.320
but that pulls the data from a database?

01:59:39.320 --> 01:59:42.170
You're not going to have your
human users using SQLite3,

01:59:42.170 --> 01:59:44.280
you're going to generate
the output for them.

01:59:44.280 --> 01:59:46.790
So let me close my terminal window here.

01:59:46.790 --> 01:59:49.190
Rather, let me close out of SQLite 3.

01:59:49.190 --> 01:59:53.840
Let me now run Python
of favorites.py, enter.

01:59:53.840 --> 01:59:56.450
I'm prompted for my favorite
using the input function.

01:59:56.450 --> 02:00:00.290
I type in Scratch and hit
enter, and there's my 34.

02:00:00.290 --> 02:00:02.650
So this is an incredibly
common practice,

02:00:02.650 --> 02:00:04.400
to use one language
for what it's best at.

02:00:04.400 --> 02:00:07.430
Like SQL is best at reading
data from databases,

02:00:07.430 --> 02:00:11.060
Python is, maybe, best in this
case for creating a user interface

02:00:11.060 --> 02:00:13.040
or, eventually, making
a web application.

02:00:13.040 --> 02:00:16.560
But it's certainly fine to
use one inside of the other.

02:00:16.560 --> 02:00:19.740
The documentation for this library,
if and when you get curious,

02:00:19.740 --> 02:00:23.520
is it this URL here, along with a
lot of CS50's own documentation.

02:00:23.520 --> 02:00:26.340
But there are some
problems, nonetheless, we

02:00:26.340 --> 02:00:29.760
might encounter in this world, and we
thought we'd end on these challenges.

02:00:29.760 --> 02:00:33.420
It turns out that in the world of
SQL, even though we haven't touched

02:00:33.420 --> 02:00:36.638
upon this yet, you're generally
working not with like hundreds

02:00:36.638 --> 02:00:39.180
of people and their favorite
languages and problems, not even

02:00:39.180 --> 02:00:42.600
thousands of movies, but like
millions of things in the database,

02:00:42.600 --> 02:00:45.180
like Instagram posts or
TikTok videos or the like.

02:00:45.180 --> 02:00:48.510
Those are huge databases
with millions of rows.

02:00:48.510 --> 02:00:51.510
The problem with SQL and,
really, databases in general,

02:00:51.510 --> 02:00:54.030
is if you have thousands,
millions of things

02:00:54.030 --> 02:00:57.765
happening at once, things can get out
of order and your math can be wrong.

02:00:57.765 --> 02:00:59.890
You can lose track of how
many likes something has.

02:00:59.890 --> 02:01:03.870
And so, for instance, as of last night
this remains the most popular Instagram

02:01:03.870 --> 02:01:06.000
post, to date, in the world.

02:01:06.000 --> 02:01:08.368
It was clicked on by so many people.

02:01:08.368 --> 02:01:10.410
And you might think that's
pretty straightforward

02:01:10.410 --> 02:01:13.230
to keep track click,
click, click, but not

02:01:13.230 --> 02:01:16.870
when there's millions of devices in
the world and thousands of servers,

02:01:16.870 --> 02:01:20.000
probably, at Meta running
Instagram's back end.

02:01:20.000 --> 02:01:23.290
So how do you actually keep
track of all of these likes?

02:01:23.290 --> 02:01:26.980
Well, maybe, Meta is
using code like this

02:01:26.980 --> 02:01:30.370
to implement the counter
for likes on Instagram.

02:01:30.370 --> 02:01:33.160
Maybe they are using lines of
code similar to what we just

02:01:33.160 --> 02:01:37.510
wrote db.execute select
the current number of likes

02:01:37.510 --> 02:01:40.750
from the posts table
where the ID of the post

02:01:40.750 --> 02:01:43.760
equals whatever the one is
that the user clicked on.

02:01:43.760 --> 02:01:48.160
So the post a moment ago, presumably,
has a unique ID, a primary key,

02:01:48.160 --> 02:01:51.160
and that number just gets plugged in
here when Meta wants to figure out,

02:01:51.160 --> 02:01:53.140
all right, someone just
clicked on this post.

02:01:53.140 --> 02:01:56.320
Let's figure out what the current
number of likes so we can add 1 to it

02:01:56.320 --> 02:01:57.890
and update the database.

02:01:57.890 --> 02:02:00.910
So this query here gives
us a temporary table

02:02:00.910 --> 02:02:04.360
containing the current number of likes
before you or someone else clicked.

02:02:04.360 --> 02:02:06.820
Maybe then we just
declare a variable called

02:02:06.820 --> 02:02:09.772
likes to get at the
first row's likes column.

02:02:09.772 --> 02:02:11.980
So this is just Python
syntax, similar to what I did,

02:02:11.980 --> 02:02:15.640
just to get the actual number you care
about, like 34 or whatever million

02:02:15.640 --> 02:02:16.480
it is here.

02:02:16.480 --> 02:02:20.650
But then suppose there's a second
database query and a third line of code

02:02:20.650 --> 02:02:26.200
that updates the posts table setting
the likes equal to this value

02:02:26.200 --> 02:02:29.030
where the ID of the post is this value.

02:02:29.030 --> 02:02:32.870
So these question marks are similar,
again, in spirit to printf's percent s.

02:02:32.870 --> 02:02:34.810
They're placeholders for
things that are going

02:02:34.810 --> 02:02:37.070
to be plugged in after the commas.

02:02:37.070 --> 02:02:38.860
So if I want to update
the number of likes

02:02:38.860 --> 02:02:41.080
to be whatever the current
number is plus one,

02:02:41.080 --> 02:02:45.100
I put it there as a second argument
and then I plug-in the ID of that post.

02:02:45.100 --> 02:02:49.420
The problem, though, with large
systems like the Metas, the Googles,

02:02:49.420 --> 02:02:51.640
the Microsofts, and
others of the world, is

02:02:51.640 --> 02:02:54.880
that they are executing code
like this on multiple servers,

02:02:54.880 --> 02:02:58.930
thousands of servers, that might
be executing slightly out of order.

02:02:58.930 --> 02:03:01.070
One might be faster,
one might be slower.

02:03:01.070 --> 02:03:04.990
Which is to say, even though
these three lines of code

02:03:04.990 --> 02:03:07.240
represent what should happen
when I click on that post

02:03:07.240 --> 02:03:09.670
and you click on that post
and you click on that post,

02:03:09.670 --> 02:03:13.600
the lines of code chronologically
might get shuffled a little bit.

02:03:13.600 --> 02:03:16.090
Like, maybe, this line of
code gets executed for me

02:03:16.090 --> 02:03:18.100
and then it gets executed
for you and then you,

02:03:18.100 --> 02:03:20.900
and then the server moves
on to the next line of code.

02:03:20.900 --> 02:03:23.890
So it's sort of multitasking,
handling lots of users at once.

02:03:23.890 --> 02:03:27.940
The problem with this is that you
run into a race condition of sorts,

02:03:27.940 --> 02:03:30.820
where the servers are sort
of racing to handle one user

02:03:30.820 --> 02:03:34.580
but other users requests are
happening at the same time.

02:03:34.580 --> 02:03:39.220
So the analogy that I was taught
years ago in an operating systems

02:03:39.220 --> 02:03:41.860
class actually pertains to
something like a refrigerator here.

02:03:41.860 --> 02:03:43.360
So we have a mini refrigerator here.

02:03:43.360 --> 02:03:45.850
Suppose you've got one in
your dorm or your house room,

02:03:45.850 --> 02:03:48.340
and you come home one day.

02:03:48.340 --> 02:03:51.860
And you really like milk, so you
open the fridge and you look inside,

02:03:51.860 --> 02:03:52.480
and there's--

02:03:52.480 --> 02:03:53.590
Oh, we're out of milk.

02:03:53.590 --> 02:03:57.160
So you close the fridge, you walk
out to CVS, or somewhere else,

02:03:57.160 --> 02:03:58.600
and go to get more milk.

02:03:58.600 --> 02:04:01.240
Meanwhile, though, your roommate
comes home, who in this story

02:04:01.240 --> 02:04:03.850
also likes milk, and
so they decide that--

02:04:03.850 --> 02:04:05.620
Oh, I'm out of milk in the fridge.

02:04:05.620 --> 02:04:07.120
So they, maybe, head out.

02:04:07.120 --> 02:04:10.660
Maybe they follow a different
path to go get more milk as well.

02:04:10.660 --> 02:04:13.330
Some number of minutes pass,
you both come home later on.

02:04:13.330 --> 02:04:14.080
Like, Oh, darn it.

02:04:14.080 --> 02:04:15.830
Now we have twice as
much milk as we need.

02:04:15.830 --> 02:04:17.247
We don't really like it that much.

02:04:17.247 --> 02:04:19.720
And some of it's going to go
sour now, so it's wasted like.

02:04:19.720 --> 02:04:20.720
We've made a mistake.

02:04:20.720 --> 02:04:22.780
We should not have bought
twice as much milk.

02:04:22.780 --> 02:04:27.220
Now, stupid story, but the
point is that both of you

02:04:27.220 --> 02:04:30.250
made decisions based on
the state of a variable.

02:04:30.250 --> 02:04:33.790
But the problem was that variable
was in the process of being

02:04:33.790 --> 02:04:36.430
updated when someone else looked at it.

02:04:36.430 --> 02:04:39.470
The first person in the story
was on their way to the store

02:04:39.470 --> 02:04:41.560
so the variable was
about to be incremented

02:04:41.560 --> 02:04:43.840
in terms of quantity of
milk, but the other person

02:04:43.840 --> 02:04:46.430
didn't know that yet so they
too tried to increment it.

02:04:46.430 --> 02:04:48.760
And, in that case, we ended
up with too much milk.

02:04:48.760 --> 02:04:52.450
But suppose what might happen
here is similar in spirit.

02:04:52.450 --> 02:04:56.560
Suppose that post, at some point
in time, had just 1,000,000 likes.

02:04:56.560 --> 02:05:00.310
And suppose this line of code got
executed for me, for you, and for you

02:05:00.310 --> 02:05:02.050
after all three of us clicked on it.

02:05:02.050 --> 02:05:05.530
Well, the value of our likes
variable in Meta servers

02:05:05.530 --> 02:05:08.020
might be 1,000,000,
1,000,000, and 1,000,000.

02:05:08.020 --> 02:05:11.470
They therefore update 1,000,000
to be 1,000,000 plus one.

02:05:11.470 --> 02:05:14.890
And so what they update the
database to be is 1,000,001,

02:05:14.890 --> 02:05:16.080
but they do it three times.

02:05:16.080 --> 02:05:18.690
1,000,001, 1,000,001, 1,000,001.

02:05:18.690 --> 02:05:21.900
But they've lost two of
those likes because they

02:05:21.900 --> 02:05:26.040
might have inspected the variable while
some other server, some other user's

02:05:26.040 --> 02:05:28.510
like, was being processed.

02:05:28.510 --> 02:05:31.350
So long story short, when
you have lots of data

02:05:31.350 --> 02:05:34.365
on lots of servers, all of
which is happening very quickly,

02:05:34.365 --> 02:05:36.240
you run into these so
called race conditions.

02:05:36.240 --> 02:05:39.750
And code like this can be
dangerous, even though it might not

02:05:39.750 --> 02:05:41.460
look incorrect at a glance.

02:05:41.460 --> 02:05:44.280
Thankfully, in the world of
SQL, though you won't generally

02:05:44.280 --> 02:05:47.700
have to do this certainly for problem
sets sake, there are solutions to this.

02:05:47.700 --> 02:05:51.210
But too many engineers in the world
don't know this, don't remember this,

02:05:51.210 --> 02:05:53.580
or don't appreciate this reality.

02:05:53.580 --> 02:05:55.920
There are keywords in
certain databases that

02:05:55.920 --> 02:05:59.610
let you instead begin a transaction,
which means, essentially,

02:05:59.610 --> 02:06:02.970
that these three lines
of code should either all

02:06:02.970 --> 02:06:05.370
happen together or not at all.

02:06:05.370 --> 02:06:06.660
They should be atomic.

02:06:06.660 --> 02:06:09.390
That is to say, they should
all happen without interruption

02:06:09.390 --> 02:06:11.340
or they just shouldn't happen at all.

02:06:11.340 --> 02:06:16.000
And that ensures that the
math does not go wrong

02:06:16.000 --> 02:06:18.820
because my like will get
counted, and then you're like,

02:06:18.820 --> 02:06:22.390
and then you're like, as opposed
to them being intermingled and lost

02:06:22.390 --> 02:06:23.630
track of accordingly.

02:06:23.630 --> 02:06:26.590
So in Python, using
the CS50 library, you

02:06:26.590 --> 02:06:30.280
could wrap these three lines of
code by saying begin transaction

02:06:30.280 --> 02:06:33.580
to the database, commit the
transaction to the database,

02:06:33.580 --> 02:06:37.867
and that relatively simple solution
avoids this problem of race conditions.

02:06:37.867 --> 02:06:38.950
But this, too, is a topic.

02:06:38.950 --> 02:06:41.020
If you Google, invariably
you'll see that this

02:06:41.020 --> 02:06:46.000
is a problem that has hit various
servers and apps over time.

02:06:46.000 --> 02:06:49.660
But there's one other problem that
the world is still not very good at,

02:06:49.660 --> 02:06:52.060
and that's known as a
SQL injection attack.

02:06:52.060 --> 02:06:55.690
And it turns out that even with what
we've been doing, even code like this

02:06:55.690 --> 02:07:00.160
here, is all too easily vulnerable
to being hacked, if you will.

02:07:00.160 --> 02:07:04.670
That is misused in some way, unless
you practice what I'm preaching,

02:07:04.670 --> 02:07:07.070
which is using placeholders like this.

02:07:07.070 --> 02:07:10.570
It turns out that it's very dangerous
to take user input in, generally.

02:07:10.570 --> 02:07:13.150
Like, most of your users
might be nice, good people,

02:07:13.150 --> 02:07:16.330
but there's always going to be
someone who's malicious or curious

02:07:16.330 --> 02:07:19.670
or just execute something you don't
expect, and things can go wrong.

02:07:19.670 --> 02:07:21.940
So in the world of SQL,
here's what can happen.

02:07:21.940 --> 02:07:24.890
For instance, here is the
login screen for Yale accounts.

02:07:24.890 --> 02:07:27.220
Here's the comparable
screen for Harvard accounts.

02:07:27.220 --> 02:07:29.803
And all of us are in their habit
of using one of these screens

02:07:29.803 --> 02:07:31.150
or something similar or another.

02:07:31.150 --> 02:07:34.450
You're often asked for a username
or email address and a password.

02:07:34.450 --> 02:07:38.950
But suppose that Harvard or Yale
or Google or Microsoft or wherever,

02:07:38.950 --> 02:07:42.280
are taking user input from forms
like this, be it on the web

02:07:42.280 --> 02:07:46.540
or on a mobile app, and they're just
plugging your input into a SQL query

02:07:46.540 --> 02:07:50.320
that they wrote in advance that just is
waiting for your username or password

02:07:50.320 --> 02:07:53.980
to be plugged in so they can ask
a complete query of the database.

02:07:53.980 --> 02:07:57.850
It turns out that if I'm a
bit malicious, or curious,

02:07:57.850 --> 02:08:01.810
I could, maybe, type in some
funky syntax to fields like this

02:08:01.810 --> 02:08:04.570
that I know have special
meaning to certain databases.

02:08:04.570 --> 02:08:09.490
And it turns out in the world of SQL,
SQLite in particular, single quotes

02:08:09.490 --> 02:08:10.570
are clearly important.

02:08:10.570 --> 02:08:12.850
Because I've been using them
all day long for strings,

02:08:12.850 --> 02:08:16.745
and I didn't mention this yet, but the
comment character in SQLite is dash,

02:08:16.745 --> 02:08:17.245
dash.

02:08:17.245 --> 02:08:20.203
If you want the rest of a line to be
ignored, you just say, dash, dash.

02:08:20.203 --> 02:08:23.605
So it's not hash, it's not slash, slash,
like in Python and C, respectively.

02:08:23.605 --> 02:08:26.210
It's dash, dash, or two hyphens.

02:08:26.210 --> 02:08:28.887
So suppose that I'm a
hacker or a curious student,

02:08:28.887 --> 02:08:30.970
and I want to see if Harvard
knows what it's doing

02:08:30.970 --> 02:08:32.680
when it comes to SQL injection attacks.

02:08:32.680 --> 02:08:36.347
I could literally type in, maybe
my username with a single quote

02:08:36.347 --> 02:08:37.180
and then dash, dash.

02:08:37.180 --> 02:08:38.870
Well, why would I do this?

02:08:38.870 --> 02:08:43.240
Well, suppose for the sake of discussion
that some developer at Harvard,

02:08:43.240 --> 02:08:46.030
or any website, really,
has written a line of code

02:08:46.030 --> 02:08:49.870
like this to check if the
username and password just

02:08:49.870 --> 02:08:52.760
typed in match what's in the database.

02:08:52.760 --> 02:08:53.890
So how might I do this?

02:08:53.890 --> 02:08:57.410
And if so, let the user log in, show
them their account, or whatever.

02:08:57.410 --> 02:08:59.800
So here's the line of code in question.

02:08:59.800 --> 02:09:05.050
Select star from users where username
equals question mark and password

02:09:05.050 --> 02:09:05.980
equals question mark.

02:09:05.980 --> 02:09:07.040
This is correct.

02:09:07.040 --> 02:09:08.650
This is green because it's good.

02:09:08.650 --> 02:09:10.960
It is not vulnerable to attack because

02:09:10.960 --> 02:09:13.090
I'm using these
placeholders here, which,

02:09:13.090 --> 02:09:15.370
even though we've implemented
in the CS50 library,

02:09:15.370 --> 02:09:18.040
most SQL libraries
support the same syntax.

02:09:18.040 --> 02:09:19.600
But it can vary by system.

02:09:19.600 --> 02:09:20.890
This is safe.

02:09:20.890 --> 02:09:23.230
What is not safe is to
use some of the stuff

02:09:23.230 --> 02:09:25.750
you learned last week where
you can just use f strings,

02:09:25.750 --> 02:09:29.680
or format your own strings and
interpolate values with curly braces.

02:09:29.680 --> 02:09:31.930
For instance, suppose
you took me at my word

02:09:31.930 --> 02:09:33.790
last week that you
can do this in Python.

02:09:33.790 --> 02:09:37.430
This code here, unfortunately,
in yellow, is not safe.

02:09:37.430 --> 02:09:39.190
Here's a format string in Python.

02:09:39.190 --> 02:09:41.140
Here's the beginning of a SQL query.

02:09:41.140 --> 02:09:43.810
Here's a placeholder
for the user username.

02:09:43.810 --> 02:09:45.730
Here's a placeholder
for the user's password.

02:09:45.730 --> 02:09:49.400
And I've proactively, correctly,
put single quotes around them.

02:09:49.400 --> 02:09:53.410
The problem is if you just
blindly plug users input

02:09:53.410 --> 02:09:58.390
into pre-made queries like this,
they can finish your thought for you,

02:09:58.390 --> 02:10:01.280
in ways you don't expect,
if you trust the users.

02:10:01.280 --> 02:10:03.910
For instance, if I plug
in for my username,

02:10:03.910 --> 02:10:07.600
mail in at harvard.edu,
single quote, dash, dash.

02:10:07.600 --> 02:10:09.200
Notice what happens.

02:10:09.200 --> 02:10:11.470
Here's the single quote from the query.

02:10:11.470 --> 02:10:13.100
Here's what I typed in.

02:10:13.100 --> 02:10:14.150
But, wait a minute.

02:10:14.150 --> 02:10:17.560
It looks like this single
quote that I typed in finishes

02:10:17.560 --> 02:10:19.810
the thought that the developer started.

02:10:19.810 --> 02:10:22.527
The dash, dash means, heck,
ignore the rest of that.

02:10:22.527 --> 02:10:25.360
And, indeed, I've lifted grayed out
because what effectively happens

02:10:25.360 --> 02:10:29.510
is that only executed is what's
in yellow here at the moment.

02:10:29.510 --> 02:10:30.010
Why?

02:10:30.010 --> 02:10:32.440
Because everything
after that close quote,

02:10:32.440 --> 02:10:35.255
which finishes the developers
thought, is just ignored.

02:10:35.255 --> 02:10:38.380
So I mean, this is literally an example
of how you can hack into a database

02:10:38.380 --> 02:10:39.910
by injecting SQL.

02:10:39.910 --> 02:10:42.220
Like dash, dash is an example of SQL.

02:10:42.220 --> 02:10:45.280
Silly as it is, it's a comment
that tells the database

02:10:45.280 --> 02:10:46.460
to ignore the password.

02:10:46.460 --> 02:10:47.450
So what does this mean?

02:10:47.450 --> 02:10:50.770
Well, of course you're going to get
rows back now because if you're only

02:10:50.770 --> 02:10:53.950
searching for someone by username, it
doesn't matter what their password is,

02:10:53.950 --> 02:10:57.460
what they typed in, you've
essentially blacked out that part

02:10:57.460 --> 02:10:59.530
and you're not even
checking the password.

02:10:59.530 --> 02:11:03.640
The effect then would be that you could
log in as me or Carter or anyone else,

02:11:03.640 --> 02:11:08.410
just by knowing their usernames if
the Harvard developer wrote code

02:11:08.410 --> 02:11:09.183
in this way.

02:11:09.183 --> 02:11:12.100
And even though I'm pretty sure
Harvard key does not suffer from this,

02:11:12.100 --> 02:11:14.590
so many darn websites have in the past.

02:11:14.590 --> 02:11:17.733
And if you Google, SQL
injection attacks-- search by

02:11:17.733 --> 02:11:20.150
and Google, for instance, the
past month or the past year,

02:11:20.150 --> 02:11:22.940
you will tragically likely
see results because humans

02:11:22.940 --> 02:11:24.650
continue to make this mistake.

02:11:24.650 --> 02:11:27.830
The solution though,
ultimately, is actually just

02:11:27.830 --> 02:11:30.710
use placeholders, just
use the library that

02:11:30.710 --> 02:11:32.990
escapes potentially dangerous input.

02:11:32.990 --> 02:11:35.600
And this looks a little
weird, but in C we

02:11:35.600 --> 02:11:38.000
saw that the escape
character was a backslash?

02:11:38.000 --> 02:11:41.690
And that made backslash n, or backslash
something else, be treated specially.

02:11:41.690 --> 02:11:45.750
Weirdly, in SQL it tends to
be just another single quote.

02:11:45.750 --> 02:11:48.450
So if you do quote, quote
that actually means--

02:11:48.450 --> 02:11:49.580
I mean, a literal quote.

02:11:49.580 --> 02:11:52.010
It's not like the empty string,
or nothing in between it.

02:11:52.010 --> 02:11:53.060
So it looks weird.

02:11:53.060 --> 02:11:56.630
But, long story short, if
you use a library like CS50,

02:11:56.630 --> 02:11:58.460
or anything in the
real world that handles

02:11:58.460 --> 02:12:01.130
escaping of user input,
the whole query you

02:12:01.130 --> 02:12:04.910
can think of as being now good and green
again because it doesn't matter what

02:12:04.910 --> 02:12:05.780
the human types in.

02:12:05.780 --> 02:12:10.680
Any scary characters will be
properly escaped by the database.

02:12:10.680 --> 02:12:12.740
So it'll depend on the
library you're using,

02:12:12.740 --> 02:12:15.200
but it almost always is
the syntax using question

02:12:15.200 --> 02:12:18.530
marks or some similar placeholder.

02:12:18.530 --> 02:12:19.130
All right.

02:12:19.130 --> 02:12:23.270
So with that said, you are
now inducted into the Hall

02:12:23.270 --> 02:12:26.020
of People who know now a little
something about databases.

02:12:26.020 --> 02:12:28.520
We've only just scratched the
surface of using the language,

02:12:28.520 --> 02:12:31.340
but it's now something we'll use to
build up more and more interesting

02:12:31.340 --> 02:12:33.673
applications, especially for
final projects when we soon

02:12:33.673 --> 02:12:37.160
transition to web programming or mobile
app development, if you go that route.

02:12:37.160 --> 02:12:41.540
But you'll soon be able to speak a
language, literally and figuratively,

02:12:41.540 --> 02:12:44.310
that those before you
have acquired as well.

02:12:44.310 --> 02:12:48.170
So you are now qualified to
understand this sort of internet joke.

02:12:48.170 --> 02:12:51.020
This is someone who, if
I Zoom in, was trying

02:12:51.020 --> 02:12:55.820
to get out of paying some camera based
tolls by tricking the state or the city

02:12:55.820 --> 02:12:58.130
into deleting or dropping
their whole database.

02:12:58.130 --> 02:13:01.100
Drop means delete the whole
thing, not just the rows therein.

02:13:01.100 --> 02:13:02.270
OK, so maybe not that funny.

02:13:02.270 --> 02:13:06.910
But this is now the note will end
on, similar to other xkcd comics

02:13:06.910 --> 02:13:07.910
we've introduced you to.

02:13:07.910 --> 02:13:11.150
Every CS person out there
has seen, knows this comic.

02:13:11.150 --> 02:13:14.150
So if you ever refer to, with a wink--

02:13:14.150 --> 02:13:17.090
if you ever refer to little
Bobby tables with a wink,

02:13:17.090 --> 02:13:20.795
if it's a computer scientist on the
other end, they'll know whom you mean.

02:13:23.918 --> 02:13:25.330
[LAUGHING]

02:13:25.330 --> 02:13:26.260
OK, there we go.

02:13:29.690 --> 02:13:30.230
All right.

02:13:30.230 --> 02:13:30.740
Tough crowd.

02:13:30.740 --> 02:13:33.990
All right, Batman, come on down for your
cookies, and we'll see you next time.

02:13:33.990 --> 02:13:34.520
[APPLAUSE]

02:13:34.520 --> 02:13:37.570
[MUSIC PLAYING]