WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000

00:00:00.000 --> 00:00:03.493
[MUSIC PLAYING]

00:00:49.357 --> 00:00:50.440
DAVID J. MALAN: All right.

00:00:50.440 --> 00:00:53.260
This is CS50, and this is week 7.

00:00:53.260 --> 00:00:56.207
And today's focus is going
to be entirely on data--

00:00:56.207 --> 00:00:58.540
the process of collecting it,
the process of storing it,

00:00:58.540 --> 00:01:00.610
the process of searching
it, and so much more.

00:01:00.610 --> 00:01:03.280
You'll recall that last week we
started off by playing around

00:01:03.280 --> 00:01:04.750
with the relatively small data set.

00:01:04.750 --> 00:01:08.630
We asked everyone for what their
preferred house at Hogwarts might be.

00:01:08.630 --> 00:01:12.400
And then we proceeded to analyze that
data a little bit using some Python

00:01:12.400 --> 00:01:15.730
and counting up how many people wanted
Gryffindor or Slytherin or the others,

00:01:15.730 --> 00:01:16.423
as well.

00:01:16.423 --> 00:01:19.090
And we ultimately did that by
using a Google form to collect it.

00:01:19.090 --> 00:01:21.923
And we stored all of the data in a
Google spreadsheet, which we then

00:01:21.923 --> 00:01:24.350
exported, of course, as a CSV file.

00:01:24.350 --> 00:01:26.740
So this week, we thought we'd
collect a little more data

00:01:26.740 --> 00:01:28.780
and see what kinds of
problems arise when

00:01:28.780 --> 00:01:32.350
we start using only a spreadsheet
or, in turn, a CSV file

00:01:32.350 --> 00:01:34.220
to store the data that we care about.

00:01:34.220 --> 00:01:37.930
So in fact, if you could go ahead
and go to this URL here that you see,

00:01:37.930 --> 00:01:41.230
you should see another Google
form, this one asking you

00:01:41.230 --> 00:01:42.730
some different questions.

00:01:42.730 --> 00:01:46.540
All of us probably have some preferred
TV shows, now more than ever, perhaps.

00:01:46.540 --> 00:01:49.150
And what we'd like to do
is ask everyone to input

00:01:49.150 --> 00:01:53.590
into that form their favorite
TV show followed by the genre

00:01:53.590 --> 00:01:58.190
or genres into which that
particular TV show falls.

00:01:58.190 --> 00:02:00.680
So go ahead and take
a moment to do that.

00:02:00.680 --> 00:02:03.820
And if you're unable to follow along
at home, what folks are looking at

00:02:03.820 --> 00:02:07.360
is a form quite like this one here,
whereby we're just asking them

00:02:07.360 --> 00:02:11.350
for the title of their
preferred TV show and the genre

00:02:11.350 --> 00:02:15.770
or genres of that specific TV show.

00:02:15.770 --> 00:02:16.270
All right.

00:02:16.270 --> 00:02:19.270
So let's go ahead and start to look
at some of this data that's come in.

00:02:19.270 --> 00:02:23.043
Here is the resulting Google spreadsheet
that Google Forms has created for us.

00:02:23.043 --> 00:02:25.960
And you'll notice that by default,
Google Forms, this particular tool,

00:02:25.960 --> 00:02:28.180
has three different columns,
at least for this form.

00:02:28.180 --> 00:02:30.070
One is a timestamp, and
Google automatically

00:02:30.070 --> 00:02:33.340
gives us that based on what day
and time everyone was buzzing in

00:02:33.340 --> 00:02:34.390
with the responses.

00:02:34.390 --> 00:02:38.770
Then they have a header row
beyond that for title and genres.

00:02:38.770 --> 00:02:42.100
I've manually boldfaced it in
advance just to make it stand out.

00:02:42.100 --> 00:02:45.580
But you'll notice that the
headings here, Title and Genres,

00:02:45.580 --> 00:02:48.790
perfectly matches the question
that we asked in the Google form.

00:02:48.790 --> 00:02:53.020
That allows us to therefore line up
your responses with our questions.

00:02:53.020 --> 00:02:56.410
And you can see here Punisher
was the first favorite TV

00:02:56.410 --> 00:03:00.040
show to be inputted followed by The
Office, Breaking Bad, New Girl, Archer,

00:03:00.040 --> 00:03:02.270
another Office, and so forth.

00:03:02.270 --> 00:03:04.660
And in the third column,
under Genres, you'll

00:03:04.660 --> 00:03:06.520
see that there's something curious here.

00:03:06.520 --> 00:03:08.230
While some of the cells--

00:03:08.230 --> 00:03:10.330
that is, the little boxes of text--

00:03:10.330 --> 00:03:12.970
have just single words
like "comedy" or "drama,"

00:03:12.970 --> 00:03:15.550
you'll notice that some of them
have a comma-separated list.

00:03:15.550 --> 00:03:19.150
And that comma-separated list is because
some of you checked, as you could,

00:03:19.150 --> 00:03:24.730
multiple check boxes to indicate
that Breaking Bad is a crime genre

00:03:24.730 --> 00:03:26.830
drama and also thriller.

00:03:26.830 --> 00:03:31.240
And so the way Google Forms handles
this is a bit sleazily in the sense

00:03:31.240 --> 00:03:35.350
that they just drop all of those
values as a comma-separated list

00:03:35.350 --> 00:03:37.853
inside of the spreadsheet itself.

00:03:37.853 --> 00:03:40.270
And that's potentially a problem
if we ultimately download

00:03:40.270 --> 00:03:43.570
this as a CSV file,
comma-separated values,

00:03:43.570 --> 00:03:47.500
because now you have commas
in between the commas.

00:03:47.500 --> 00:03:50.480
Fortunately, there's a solution
to that that we'll ultimately see.

00:03:50.480 --> 00:03:52.160
So we've got a good amount of data here.

00:03:52.160 --> 00:03:55.300
In fact, if I keep scrolling down,
we'll see a few hundred responses now.

00:03:55.300 --> 00:03:58.120
And it would be nice to
analyze this data in some way

00:03:58.120 --> 00:04:02.410
and figure out what the most popular
TV show is, maybe search for new shows

00:04:02.410 --> 00:04:04.143
I might like via their genre.

00:04:04.143 --> 00:04:06.310
So you can imagine some
number of queries that could

00:04:06.310 --> 00:04:08.620
be answered by way of this data set.

00:04:08.620 --> 00:04:12.340
But let's first consider the
limitations of leaving this data

00:04:12.340 --> 00:04:14.980
in just a spreadsheet like this.

00:04:14.980 --> 00:04:17.440
All of us are probably in the
habit of using occasionally

00:04:17.440 --> 00:04:22.490
Google Spreadsheets, Apple Numbers,
Microsoft Excel, or some other tool.

00:04:22.490 --> 00:04:27.220
So let's consider what spreadsheets
are good at and what they are bad at.

00:04:27.220 --> 00:04:30.370
Would anyone like to volunteer
an answer to the first of those?

00:04:30.370 --> 00:04:34.030
What is a spreadsheet
good at or good for?

00:04:34.030 --> 00:04:35.080
Yeah, Andrew?

00:04:35.080 --> 00:04:36.760
What's your thinking on spreadsheets?

00:04:36.760 --> 00:04:39.797
AUDIENCE: [INAUDIBLE]

00:04:39.797 --> 00:04:41.880
DAVID J. MALAN: OK, very
good for quickly sorting.

00:04:41.880 --> 00:04:42.300
I like that.

00:04:42.300 --> 00:04:44.758
I could click on the top of
the Title column, for instance,

00:04:44.758 --> 00:04:48.450
and immediately sort all of
those titles by alphabetically.

00:04:48.450 --> 00:04:49.140
I like that.

00:04:49.140 --> 00:04:53.370
Other reasons to use a spreadsheet--
what problems do they solve?

00:04:53.370 --> 00:04:55.050
What are they good at?

00:04:55.050 --> 00:04:56.670
Other thoughts on spreadsheets.

00:04:56.670 --> 00:04:58.530
Yeah, how about Peter?

00:04:58.530 --> 00:05:01.947
AUDIENCE: Storing large amounts of
data that you can later analyze.

00:05:01.947 --> 00:05:03.780
DAVID J. MALAN: OK, so
storing large amounts

00:05:03.780 --> 00:05:05.700
of data that you can later analyze.

00:05:05.700 --> 00:05:09.210
It's kind of a nice model for storing
lots of rows of data, so to speak.

00:05:09.210 --> 00:05:11.310
I will say that there
actually is a limit.

00:05:11.310 --> 00:05:13.890
And in fact, back in the day,
I learned what this limit is.

00:05:13.890 --> 00:05:16.515
Long story short, in graduate
school, I was using a spreadsheet

00:05:16.515 --> 00:05:17.940
to analyze some research data.

00:05:17.940 --> 00:05:23.370
And at one point, I had more data
than Excel supported rows for.

00:05:23.370 --> 00:05:28.110
Specifically, I had
some 65,536 rows, which

00:05:28.110 --> 00:05:30.360
was too many at that point
for Excel at the time,

00:05:30.360 --> 00:05:33.870
because, long story short, if you
recall from a spreadsheet program

00:05:33.870 --> 00:05:37.328
like Google Spreadsheets, every
row is numbered from 1 on up.

00:05:37.328 --> 00:05:39.120
Well, unfortunately,
at the time, Microsoft

00:05:39.120 --> 00:05:43.170
had used a 16-bit integer,
16 bits or 2 bytes,

00:05:43.170 --> 00:05:45.150
to represent each of those numbers.

00:05:45.150 --> 00:05:49.320
And it turns out the 2 to the
16th power is roughly 65,000.

00:05:49.320 --> 00:05:52.000
So at that point, I maxed
out the total number of rows.

00:05:52.000 --> 00:05:54.828
Now, to Peter's point, they've
increased that in recent years.

00:05:54.828 --> 00:05:56.620
And you can actually
store a lot more data.

00:05:56.620 --> 00:05:58.870
So spreadsheets are indeed good at that.

00:05:58.870 --> 00:06:02.580
But they're not necessarily good at
everything, because at some point,

00:06:02.580 --> 00:06:05.310
you're going to have more data
potentially in a spreadsheet

00:06:05.310 --> 00:06:07.860
than your Mac or PC can handle.

00:06:07.860 --> 00:06:10.860
In fact, if you're actually trying
to build an application, whether it's

00:06:10.860 --> 00:06:14.490
Twitter, Instagram, or Facebook
or anything of that scale,

00:06:14.490 --> 00:06:17.490
those companies are certainly not
storing their data, suffice it to say,

00:06:17.490 --> 00:06:20.700
in a spreadsheet, because there would
just be way too much data to use.

00:06:20.700 --> 00:06:23.050
And no one could literally
open it on their computer.

00:06:23.050 --> 00:06:25.830
So we'll need a solution
to that problem of scale.

00:06:25.830 --> 00:06:29.950
But I don't think we need to throw out
what works well about spreadsheets.

00:06:29.950 --> 00:06:33.510
So you can store indeed a
lot of data in row form.

00:06:33.510 --> 00:06:36.930
But it would seem that you can also
store a lot of data in column form.

00:06:36.930 --> 00:06:39.567
And even though I'm only
showing columns A, B, and C,

00:06:39.567 --> 00:06:41.400
of course, you've
probably used spreadsheets

00:06:41.400 --> 00:06:42.570
where you add more columns--

00:06:42.570 --> 00:06:44.680
D, E, F, and so forth.

00:06:44.680 --> 00:06:48.540
So what's the right mental model
for how to think about rows

00:06:48.540 --> 00:06:51.540
versus columns in a spreadsheet?

00:06:51.540 --> 00:06:57.840
I feel like we probably use them in a
somewhat different way conceptually.

00:06:57.840 --> 00:07:00.550
We might think about them
a little differently.

00:07:00.550 --> 00:07:04.440
What's the difference between
rows and columns in a spreadsheet?

00:07:04.440 --> 00:07:06.570
Sofia.

00:07:06.570 --> 00:07:07.890
AUDIENCE: Adding more entries.

00:07:07.890 --> 00:07:09.420
Adding more data is--

00:07:09.420 --> 00:07:12.720
those are within the rows, but then the
actual attributes or characteristics

00:07:12.720 --> 00:07:14.240
of the data should be in columns.

00:07:14.240 --> 00:07:15.240
DAVID J. MALAN: Exactly.

00:07:15.240 --> 00:07:17.220
When you add more data
to the spreadsheet,

00:07:17.220 --> 00:07:19.620
you should really be
adding to the bottom of it,

00:07:19.620 --> 00:07:21.310
adding more and more rows.

00:07:21.310 --> 00:07:24.300
So these things sort of grow
vertically, even though of course that's

00:07:24.300 --> 00:07:25.920
just a human's perception of it.

00:07:25.920 --> 00:07:28.740
They grow from top to bottom
by adding more and more rows.

00:07:28.740 --> 00:07:31.560
But to Sofia's point,
your columns represent

00:07:31.560 --> 00:07:37.920
what we might call attributes or fields
or any other such characteristic that

00:07:37.920 --> 00:07:40.030
is a type of data that you're storing.

00:07:40.030 --> 00:07:42.930
So in this case of our form,
Timestamp is the first column.

00:07:42.930 --> 00:07:44.460
Title is the second column.

00:07:44.460 --> 00:07:45.930
Genres is the third column.

00:07:45.930 --> 00:07:49.980
And those columns can indeed be thought
of as fields or attributes, properties

00:07:49.980 --> 00:07:50.697
of your data.

00:07:50.697 --> 00:07:54.030
And those are properties that you should
really decide on in advance when you're

00:07:54.030 --> 00:07:56.970
first creating the form, in our case,
or when you're manually creating

00:07:56.970 --> 00:07:59.430
the spreadsheet in another case.

00:07:59.430 --> 00:08:01.320
You should not really
be in the habit, when

00:08:01.320 --> 00:08:05.430
using spreadsheets, of
adding data from left

00:08:05.430 --> 00:08:08.370
to right, adding more and
more columns, unless you

00:08:08.370 --> 00:08:11.740
decide to collect more types of data.

00:08:11.740 --> 00:08:15.873
So just because someone adds a new
favorite TV show to your data set,

00:08:15.873 --> 00:08:18.540
you shouldn't be adding that from
left to right in a new column.

00:08:18.540 --> 00:08:21.040
You should indeed be adding
it from top to bottom.

00:08:21.040 --> 00:08:24.780
But suppose that we actually decided to
collect more information from everyone.

00:08:24.780 --> 00:08:28.650
Maybe that form had instead asked you
for your name or your email address

00:08:28.650 --> 00:08:30.120
or any other questions.

00:08:30.120 --> 00:08:34.480
Those properties or attributes or
fields would belong as new columns.

00:08:34.480 --> 00:08:38.309
So this is to say we generally
decide on the layout of our data,

00:08:38.309 --> 00:08:41.130
the schema of our data, in advance.

00:08:41.130 --> 00:08:45.420
And then from there on out, we proceed
to add, add, add more rows, not

00:08:45.420 --> 00:08:47.670
columns, unless we change
our mind and need to change

00:08:47.670 --> 00:08:50.230
the schema of our particular data.

00:08:50.230 --> 00:08:53.850
So it turns out that spreadsheets
are indeed wonderfully useful,

00:08:53.850 --> 00:08:56.700
to Peter's point, for
large or reasonably large

00:08:56.700 --> 00:08:59.610
data sets that we might collect.

00:08:59.610 --> 00:09:04.500
And we can, of course, per last week,
export those data sets as CSV files.

00:09:04.500 --> 00:09:07.290
And so we can go from a
spreadsheet to a simple text

00:09:07.290 --> 00:09:11.370
file stored in ASCII or Unicode, more
generally, on your own hard drive

00:09:11.370 --> 00:09:12.660
or somewhere in the cloud.

00:09:12.660 --> 00:09:16.350
And you can actually think
of that file, that .CSV file,

00:09:16.350 --> 00:09:19.540
as what we might call
a flat-file database.

00:09:19.540 --> 00:09:22.980
A database is, generally
speaking, a file that stores data.

00:09:22.980 --> 00:09:25.997
Or it's a program that
stores data for you.

00:09:25.997 --> 00:09:29.080
And all of us have probably thought
about or used databases in some sense.

00:09:29.080 --> 00:09:31.560
You're probably familiar
with the fact that all

00:09:31.560 --> 00:09:35.310
of those same big websites, Google and
Twitter and Facebook and others, use

00:09:35.310 --> 00:09:37.017
databases to store our data.

00:09:37.017 --> 00:09:38.850
Well, those databases
are either just really

00:09:38.850 --> 00:09:42.120
big files containing lots
of data or special programs

00:09:42.120 --> 00:09:44.130
that are storing our data for us.

00:09:44.130 --> 00:09:46.350
And a flat file is just
referring to the fact

00:09:46.350 --> 00:09:48.580
that it really is a very simple design.

00:09:48.580 --> 00:09:51.510
In fact, years ago,
decades ago, humans decided

00:09:51.510 --> 00:09:54.780
when storing data in simple
text files that if you

00:09:54.780 --> 00:09:57.540
want to store different types
of data, like, to Sofia's point,

00:09:57.540 --> 00:10:00.340
different properties or attributes,
well, let's keep it simple.

00:10:00.340 --> 00:10:03.780
Let's just separate
those columns with commas

00:10:03.780 --> 00:10:06.450
in our flat-file database, a.k.a.

00:10:06.450 --> 00:10:07.118
a CSV.

00:10:07.118 --> 00:10:08.160
You can use other things.

00:10:08.160 --> 00:10:09.430
You can use tabs.

00:10:09.430 --> 00:10:12.570
There's things called TSVs,
for Tab-Separated Values.

00:10:12.570 --> 00:10:14.760
And frankly, you can
use anything you want.

00:10:14.760 --> 00:10:16.050
But there is a corner case.

00:10:16.050 --> 00:10:17.980
And we've already seen a preview of it.

00:10:17.980 --> 00:10:21.190
What if your actual
data has a comma in it?

00:10:21.190 --> 00:10:23.820
What if the title of your
favorite TV show has a comma?

00:10:23.820 --> 00:10:27.660
What if Google is presuming to store
genres as a comma-separated list?

00:10:27.660 --> 00:10:32.340
Bad things can happen if using a
CSV as your flat-file database.

00:10:32.340 --> 00:10:33.760
But there are solutions to that.

00:10:33.760 --> 00:10:35.580
And in fact, what the
world typically does

00:10:35.580 --> 00:10:39.640
is whenever you have commas
inside of your CSV file,

00:10:39.640 --> 00:10:42.300
you just make sure that
the whole string is double

00:10:42.300 --> 00:10:44.460
quoted on the far left and far right.

00:10:44.460 --> 00:10:46.860
And anything inside of
double quotes is not

00:10:46.860 --> 00:10:50.790
mistaken thereafter as
delineating a column

00:10:50.790 --> 00:10:53.220
as the other commas in the file might.

00:10:53.220 --> 00:10:55.590
So that's all that's meant
by a flat-file database.

00:10:55.590 --> 00:10:58.860
And CSV is perhaps one of the most
common, the most common, formats

00:10:58.860 --> 00:11:01.240
thereof, if only because
all of these programs,

00:11:01.240 --> 00:11:03.420
like Google Spreadsheets
and Excel and Numbers,

00:11:03.420 --> 00:11:07.137
allow you to save your files as CSVs.

00:11:07.137 --> 00:11:08.970
Now, long story short,
those of you who have

00:11:08.970 --> 00:11:12.570
used fancier features of spreadsheets
like built-in functions and formulas

00:11:12.570 --> 00:11:14.850
and those kinds of
things, those are built in

00:11:14.850 --> 00:11:19.120
and proprietary to Google
Spreadsheets and Excel and Numbers.

00:11:19.120 --> 00:11:24.900
You cannot use formulas in a CSV file or
a TSV file or in a flat-file database,

00:11:24.900 --> 00:11:25.870
more generally.

00:11:25.870 --> 00:11:27.990
You can only store static--

00:11:27.990 --> 00:11:30.090
that is, unchanging-- values.

00:11:30.090 --> 00:11:33.490
So when you export the data,
what you see is what you get.

00:11:33.490 --> 00:11:35.242
And that's why people
use fancier programs

00:11:35.242 --> 00:11:37.200
like Excel and Numbers
and Google Spreadsheets,

00:11:37.200 --> 00:11:38.658
because you get more functionality.

00:11:38.658 --> 00:11:41.100
But if you want to export
the data, you can only

00:11:41.100 --> 00:11:44.190
get indeed the raw
textual data out of it.

00:11:44.190 --> 00:11:45.690
But I daresay that's going to be OK.

00:11:45.690 --> 00:11:47.398
In fact, Brian, do
you mind if I go ahead

00:11:47.398 --> 00:11:50.160
and download this spreadsheet
as a CSV file now?

00:11:50.160 --> 00:11:51.510
BRIAN YU: Yep, go ahead.

00:11:51.510 --> 00:11:51.810
DAVID J. MALAN: All right.

00:11:51.810 --> 00:11:54.890
I'm going to go ahead in Google
Spreadsheets and go to File, Download.

00:11:54.890 --> 00:11:56.640
And you can see a whole
bunch of options--

00:11:56.640 --> 00:12:01.850
PDF, Web Page, Comma-Separated
Values, which is the one I want.

00:12:01.850 --> 00:12:04.320
So I'm going to indeed
go ahead and choose CSV

00:12:04.320 --> 00:12:06.510
from this dropdown in spreadsheets.

00:12:06.510 --> 00:12:08.410
That, of course, downloaded
that file for me.

00:12:08.410 --> 00:12:11.077
And now I'm going to go ahead and
go into our familiar CS50 IDE.

00:12:11.077 --> 00:12:14.600
You'll recall that last week I was
able to upload a file into the IDE.

00:12:14.600 --> 00:12:17.350
And I'm going to go ahead and do
the same here this week, as well.

00:12:17.350 --> 00:12:20.730
I'm going to go ahead and grab my
file, which ended up in my Downloads

00:12:20.730 --> 00:12:22.830
folder on my particular computer here.

00:12:22.830 --> 00:12:27.840
And I'm going to go ahead and
drag and drop this into the IDE

00:12:27.840 --> 00:12:31.790
such that it ends up in my
home directory, so to speak.

00:12:31.790 --> 00:12:34.410
So now I have this file,
Favorite TV Shows Forms.

00:12:34.410 --> 00:12:36.750
And in fact, if I double
click this within the IDE,

00:12:36.750 --> 00:12:38.880
you'll see familiar data now.

00:12:38.880 --> 00:12:42.950
Timestamp comma title comma
genres is our header row

00:12:42.950 --> 00:12:46.830
that contains the names of the
properties or attributes in this file.

00:12:46.830 --> 00:12:51.390
Then we've got our timestamps
comma favorite title comma and then

00:12:51.390 --> 00:12:53.310
a comma-separated list of genres.

00:12:53.310 --> 00:12:56.100
And here indeed, notice
that Google took care

00:12:56.100 --> 00:13:00.030
to use double quotes around any
values that themselves had commas.

00:13:00.030 --> 00:13:02.130
So it's a relatively simple file format.

00:13:02.130 --> 00:13:04.560
And I could certainly just
kind of skim through this,

00:13:04.560 --> 00:13:07.920
figuring out who likes The Office, who
likes Breaking Bad, or other shows.

00:13:07.920 --> 00:13:11.040
But per last week, we now have a
pretty useful programming language

00:13:11.040 --> 00:13:14.220
at our disposal, Python, that could
allow us to start manipulating

00:13:14.220 --> 00:13:16.860
and analyzing this data more readily.

00:13:16.860 --> 00:13:20.100
And here to my point last week about
using the right tool for the job,

00:13:20.100 --> 00:13:24.860
you could absolutely do everything we're
about to do in all weeks prior of CS50.

00:13:24.860 --> 00:13:27.720
We could have used C for
what we're about to do.

00:13:27.720 --> 00:13:31.350
But as you can probably glean, C tends
to be painful for certain things,

00:13:31.350 --> 00:13:34.290
like anything involving
string manipulation,

00:13:34.290 --> 00:13:36.660
changing strings, analyzing strings.

00:13:36.660 --> 00:13:38.290
It's just a real pain, right?

00:13:38.290 --> 00:13:42.330
God forbid you had to take this CSV
file and load it all into memory, not

00:13:42.330 --> 00:13:43.470
unlike your spell checker.

00:13:43.470 --> 00:13:46.950
You would have to be using malloc all
over the place or realloc or the like.

00:13:46.950 --> 00:13:50.640
There's just a lot of heavy lifting
involved in just analyzing a text file.

00:13:50.640 --> 00:13:53.760
So Python does all of that
for us by just giving us

00:13:53.760 --> 00:13:56.130
more functions at our
disposal with which

00:13:56.130 --> 00:13:59.470
to start analyzing and opening data.

00:13:59.470 --> 00:14:01.570
So let me go ahead and close this file.

00:14:01.570 --> 00:14:05.082
And let me go ahead and create
a new one called favorites.py,

00:14:05.082 --> 00:14:07.290
wherein I'm going to start
playing with this data set

00:14:07.290 --> 00:14:09.900
and see if we can't start
answering some questions about it.

00:14:09.900 --> 00:14:12.570
And frankly, to this day,
20-plus years after learning how

00:14:12.570 --> 00:14:14.670
to program for the
first time, I myself am

00:14:14.670 --> 00:14:18.000
very much in the habit when writing
a new program of just starting simple

00:14:18.000 --> 00:14:22.320
and not solving the problem I ultimately
want to but something simpler just

00:14:22.320 --> 00:14:24.270
as a sort of proof of
concept to make sure

00:14:24.270 --> 00:14:26.510
I have the right plumbing in place.

00:14:26.510 --> 00:14:27.510
So by that, I mean this.

00:14:27.510 --> 00:14:32.550
Let's go ahead and write a quick program
that simply opens up this file, the CSV

00:14:32.550 --> 00:14:37.120
file, iterates over it top to bottom,
and just prints out each of the titles,

00:14:37.120 --> 00:14:39.430
just as a quick sanity check
that I know what I'm doing

00:14:39.430 --> 00:14:41.460
and I have access to the data therein.

00:14:41.460 --> 00:14:43.740
So let me go ahead and import CSV.

00:14:43.740 --> 00:14:45.840
And then I can do this
in a few different ways.

00:14:45.840 --> 00:14:48.030
But by now, you've
probably seen or remembered

00:14:48.030 --> 00:14:50.490
my using something like
the open command and the

00:14:50.490 --> 00:14:55.260
with keyword to open and eventually
automatically close this file for me.

00:14:55.260 --> 00:14:59.710
This file is called Favorite TV
Shows - Form Responses 1.csv.

00:15:02.400 --> 00:15:04.560
And I'm going to open
this up in read mode.

00:15:04.560 --> 00:15:07.000
Strictly speaking,
the r is not required.

00:15:07.000 --> 00:15:09.330
You might see examples
online not including it.

00:15:09.330 --> 00:15:13.140
That's because read is the default.
But for parity with C and fopen,

00:15:13.140 --> 00:15:15.900
I'm going to be explicit
and actually do "r."

00:15:15.900 --> 00:15:18.670
And I'm going to go ahead and
give this a variable name of file.

00:15:18.670 --> 00:15:23.820
So this line 3 here has the effect of
opening that CSV file in read-only mode

00:15:23.820 --> 00:15:27.532
and creating a variable called
file via which I can reference it.

00:15:27.532 --> 00:15:30.240
Now I'm going to go ahead and use
some of that CSV functionality.

00:15:30.240 --> 00:15:32.790
I'm going to give myself what
we keep calling a reader, which

00:15:32.790 --> 00:15:34.650
I could call it xyz, anything else.

00:15:34.650 --> 00:15:37.740
But "reader" kind of describes
what this variable is going to do.

00:15:37.740 --> 00:15:42.930
And it's going to be the return value
of calling csv.reader on that file.

00:15:42.930 --> 00:15:46.740
And so essentially, the
CSV library, per last week,

00:15:46.740 --> 00:15:48.360
has a lot of fancy features built in.

00:15:48.360 --> 00:15:52.470
And all it needs as input is
an already opened text file.

00:15:52.470 --> 00:15:55.120
And then it will then wrap
that file, so to speak,

00:15:55.120 --> 00:15:57.270
with a whole bunch of
more useful functionality,

00:15:57.270 --> 00:16:01.750
like the ability to read it
column and row at a time.

00:16:01.750 --> 00:16:02.250
All right.

00:16:02.250 --> 00:16:05.170
Now I'm going to go ahead and,
you know what, just for now,

00:16:05.170 --> 00:16:08.190
I'm going to skip the first row.

00:16:08.190 --> 00:16:11.310
I'm going to skip the first row,
because the first row has my headings--

00:16:11.310 --> 00:16:13.530
Timestamp, Title, and Genres.

00:16:13.530 --> 00:16:17.692
And I know what my columns are, so I'm
just going to ignore that line for now.

00:16:17.692 --> 00:16:18.900
And now I'm going to do this.

00:16:18.900 --> 00:16:24.570
For row in reader, let me go ahead
and print out, quite simply, row.

00:16:24.570 --> 00:16:28.890
And I only want title, so I think if
it's three columns from left to right,

00:16:28.890 --> 00:16:30.330
it's 0, 1, 2.

00:16:30.330 --> 00:16:33.480
So I want to print out
column bracket 1, which

00:16:33.480 --> 00:16:35.680
is going to be the second
column zero indexed.

00:16:35.680 --> 00:16:36.180
All right.

00:16:36.180 --> 00:16:39.240
Let me go ahead and save that,
go down to my terminal window,

00:16:39.240 --> 00:16:42.850
and run python of favorites.py
and cross my fingers.

00:16:42.850 --> 00:16:43.500
OK.

00:16:43.500 --> 00:16:44.920
Voila.

00:16:44.920 --> 00:16:46.740
It flew by super fast.

00:16:46.740 --> 00:16:49.530
But it looks like, indeed,
these are all of the TV

00:16:49.530 --> 00:16:51.150
shows that folks have inputted.

00:16:51.150 --> 00:16:53.370
Indeed, there's a few hundred
if I keep scrolling up.

00:16:53.370 --> 00:16:55.740
So it looks like my program is working.

00:16:55.740 --> 00:16:57.850
But let's improve it just a little bit.

00:16:57.850 --> 00:17:02.490
It turns out that using the
csv.reader isn't necessarily

00:17:02.490 --> 00:17:04.050
the best approach in Python.

00:17:04.050 --> 00:17:07.589
Many of you have already discovered
a DictReader, a dictionary reader,

00:17:07.589 --> 00:17:10.740
which is nice, because then you don't
have to know or keep double checking

00:17:10.740 --> 00:17:13.230
what number column your data is in.

00:17:13.230 --> 00:17:17.520
You can instead refer it to by
the header itself, so by "title"

00:17:17.520 --> 00:17:18.660
or by "genres."

00:17:18.660 --> 00:17:21.052
This is also good, because
if you or maybe a colleague

00:17:21.052 --> 00:17:23.010
are sort of messing around
with the spreadsheet

00:17:23.010 --> 00:17:26.339
and they rearrange the columns
by dragging them left or right,

00:17:26.339 --> 00:17:30.120
any numbers you have used
in your code, 0, 1, 2 on up,

00:17:30.120 --> 00:17:34.390
could suddenly be incorrect if your
colleague has reordered those columns.

00:17:34.390 --> 00:17:37.590
So using a dictionary reader tends to
be a little more robust, because it

00:17:37.590 --> 00:17:40.480
uses the titles, not the mere numbers.

00:17:40.480 --> 00:17:43.230
It's still fallible if someone,
yourself or someone else,

00:17:43.230 --> 00:17:47.978
changes the values in that very first
row and renames titles or genres.

00:17:47.978 --> 00:17:49.270
Then things are going to break.

00:17:49.270 --> 00:17:51.270
But at that point, we
kind of have to blame you

00:17:51.270 --> 00:17:53.730
for not having kept track of
your code versus your data.

00:17:53.730 --> 00:17:55.020
But still a risk.

00:17:55.020 --> 00:17:58.445
So I'm going to change this to
dictionary reader or DictReader here.

00:17:58.445 --> 00:18:00.570
And pretty much the rest
of my code can be the same

00:18:00.570 --> 00:18:02.970
except I don't need this
hack here on line 5.

00:18:02.970 --> 00:18:06.750
I don't need to just skip over
to the next row from the get-go,

00:18:06.750 --> 00:18:10.890
because I now want the dictionary
reader to handle the process of reading

00:18:10.890 --> 00:18:11.985
that first row for me.

00:18:11.985 --> 00:18:13.860
But otherwise, everything
else stays the same

00:18:13.860 --> 00:18:15.693
except for this last
line, where now I think

00:18:15.693 --> 00:18:21.300
I can now use row as a
dictionary, not as a list per se,

00:18:21.300 --> 00:18:24.880
and print out specifically
the title from each given row.

00:18:24.880 --> 00:18:27.690
So let me go ahead and run
python of favorites.py again.

00:18:27.690 --> 00:18:31.480
And voila, it looks like I got the
same result, several hundred of them.

00:18:31.480 --> 00:18:34.260
But let me stipulate that it's
doing the same thing if we actually

00:18:34.260 --> 00:18:36.490
compared both of those side-by-side.

00:18:36.490 --> 00:18:36.990
All right.

00:18:36.990 --> 00:18:39.180
Before I forge ahead now
to actually augment this

00:18:39.180 --> 00:18:44.610
with new functionality, any questions
or confusion on this Python script

00:18:44.610 --> 00:18:49.530
we just wrote to open a file, wrap
it with a reader or DictReader,

00:18:49.530 --> 00:18:54.510
and then iterate over the rows one
at a time, printing the titles?

00:18:54.510 --> 00:18:56.510
Any questions, confusion
on syntax at all?

00:18:56.510 --> 00:18:57.010
It's OK.

00:18:57.010 --> 00:18:59.370
We've only known or
seen Python for a week.

00:18:59.370 --> 00:19:01.380
It's fine if it's still quite new.

00:19:01.380 --> 00:19:04.115
Anything, Brian, we should address?

00:19:04.115 --> 00:19:04.740
BRIAN YU: Yeah.

00:19:04.740 --> 00:19:08.800
So why is it that you don't need
to close the file using the syntax

00:19:08.800 --> 00:19:10.190
that you're using right here?

00:19:10.190 --> 00:19:11.732
DAVID J. MALAN: Really good question.

00:19:11.732 --> 00:19:15.250
Last week, I more pedantically
used open on its own.

00:19:15.250 --> 00:19:19.210
And then I later used a close function
that was associated with the file

00:19:19.210 --> 00:19:20.470
that I had just opened.

00:19:20.470 --> 00:19:23.800
Now, the more Pythonic way
to do things, if you will,

00:19:23.800 --> 00:19:27.370
is actually to use this with
keyword, which didn't exist in C.

00:19:27.370 --> 00:19:29.830
And it just tends to be a
useful feature in Python

00:19:29.830 --> 00:19:35.470
whereby if you say with open, dot dot
dot, it will open the file for you.

00:19:35.470 --> 00:19:39.280
Then it will remain open so long
as your code is indented inside

00:19:39.280 --> 00:19:41.410
of that with keywords block.

00:19:41.410 --> 00:19:43.780
And as soon as you get to
the end of your program,

00:19:43.780 --> 00:19:45.732
it will automatically be closed for you.

00:19:45.732 --> 00:19:48.190
So this is one of these features
where Python in some sense

00:19:48.190 --> 00:19:50.770
is trying to protect us from ourselves.

00:19:50.770 --> 00:19:52.900
It's probably pretty
common for humans, myself

00:19:52.900 --> 00:19:55.000
included, to forget to close your file.

00:19:55.000 --> 00:19:57.580
That can create problems with
saving things permanently.

00:19:57.580 --> 00:19:59.990
It can create memory
leaks, as we know from C.

00:19:59.990 --> 00:20:02.740
So the with keyword just assumes
that I'm not going to be an idiot

00:20:02.740 --> 00:20:04.150
and forget to close the file.

00:20:04.150 --> 00:20:08.050
Python is going to do
it for me automatically.

00:20:08.050 --> 00:20:10.870
Other questions or confusions, Brian?

00:20:10.870 --> 00:20:13.690
BRIAN YU: How does
DictReader know that Title

00:20:13.690 --> 00:20:16.270
is the name of the key
inside of the dictionary?

00:20:16.270 --> 00:20:18.020
DAVID J. MALAN: Really
good question, too.

00:20:18.020 --> 00:20:22.090
So it is designed by the
authors of the Python language

00:20:22.090 --> 00:20:25.450
to look at the very
first row in the file,

00:20:25.450 --> 00:20:29.380
split it on the commas
in that very first row,

00:20:29.380 --> 00:20:34.090
and just assume that the first word
or phrase before the first comma

00:20:34.090 --> 00:20:37.270
is the name of the first
column, that the second word

00:20:37.270 --> 00:20:42.470
or phrase after the first comma
is the name of the second column,

00:20:42.470 --> 00:20:43.310
and so forth.

00:20:43.310 --> 00:20:47.500
So a DictReader just presumes,
as is the convention with CSVs,

00:20:47.500 --> 00:20:51.280
that your first row is going to
contain the headings that you

00:20:51.280 --> 00:20:53.290
want to use to refer to those columns.

00:20:53.290 --> 00:20:56.860
If your CSV happens not to have
such a heading whereby it just

00:20:56.860 --> 00:20:59.050
jumps right in on the
first row to real data,

00:20:59.050 --> 00:21:02.140
then you're not going to be able to
use a DictReader correctly, at least

00:21:02.140 --> 00:21:04.670
not without some manual configuration.

00:21:04.670 --> 00:21:05.170
All right.

00:21:05.170 --> 00:21:06.840
So let's go ahead and--

00:21:06.840 --> 00:21:08.590
now I feel like there's
a whole mess here.

00:21:08.590 --> 00:21:10.562
And some of these shows
are pretty popular.

00:21:10.562 --> 00:21:13.270
And as I'm glancing over this, I
definitely see some duplication.

00:21:13.270 --> 00:21:15.010
A whole bunch of you like The Office.

00:21:15.010 --> 00:21:17.530
A whole bunch of you like
Breaking Bad, Game of Thrones,

00:21:17.530 --> 00:21:19.280
and a whole bunch of
other shows, as well.

00:21:19.280 --> 00:21:21.250
So it would be nicer,
I think, if we kind of

00:21:21.250 --> 00:21:25.480
narrow the scope of our look at this
data by just looking at unique values.

00:21:25.480 --> 00:21:26.807
You're looking at unique value.

00:21:26.807 --> 00:21:29.140
So rather than just iterate
over the file top to bottom,

00:21:29.140 --> 00:21:31.690
printing out one title
after another, why

00:21:31.690 --> 00:21:34.330
don't we go ahead and sort of
accumulate all of this data

00:21:34.330 --> 00:21:38.800
in some kind of data structure so that
we can throw away duplicate values

00:21:38.800 --> 00:21:42.910
and then only print out the unique
titles that we've accumulated?

00:21:42.910 --> 00:21:44.630
So I bet we can do this in a few ways.

00:21:44.630 --> 00:21:47.650
But if we think back to last week's
demonstration of our dictionary,

00:21:47.650 --> 00:21:50.388
you'll recall that I used
what was called a set.

00:21:50.388 --> 00:21:52.930
And I'm going to go ahead and
create a variable called titles

00:21:52.930 --> 00:21:55.180
and set it equal to
something called set.

00:21:55.180 --> 00:21:57.310
And a set is just a
collection of values.

00:21:57.310 --> 00:21:58.540
It's kind of like a list.

00:21:58.540 --> 00:22:00.580
But it eliminates duplicates for me.

00:22:00.580 --> 00:22:02.860
And that would seem to be
exactly the characteristic

00:22:02.860 --> 00:22:04.870
that I want for this program.

00:22:04.870 --> 00:22:08.560
Now, instead of printing each
title, which is now premature

00:22:08.560 --> 00:22:10.480
if I want to first
filter out duplicates,

00:22:10.480 --> 00:22:11.990
I'm going to go ahead and do this.

00:22:11.990 --> 00:22:17.290
I'm going to go ahead and add to the
titles set using the add function

00:22:17.290 --> 00:22:19.570
the current row's title.

00:22:19.570 --> 00:22:21.250
So again, I'm not printing it now.

00:22:21.250 --> 00:22:25.510
I'm instead adding to the title
set that particular title.

00:22:25.510 --> 00:22:27.190
And if it's there already, no big deal.

00:22:27.190 --> 00:22:29.260
The set data structure
in Python is going

00:22:29.260 --> 00:22:31.000
to throw away the duplicates for me.

00:22:31.000 --> 00:22:33.310
And it's only going to go
ahead and keep the uniques.

00:22:33.310 --> 00:22:37.330
Now, at the bottom of my file, I need
to do a little more work, admittedly.

00:22:37.330 --> 00:22:40.990
Now I have to iterate over the set to
print out only those unique titles.

00:22:40.990 --> 00:22:41.740
So let me do this.

00:22:41.740 --> 00:22:46.555
For title in titles, go
ahead and print out title.

00:22:46.555 --> 00:22:49.180
And this is where Python just
gets really user-friendly, right?

00:22:49.180 --> 00:22:53.050
You don't have to do int i get
0, i less than n, or whatever.

00:22:53.050 --> 00:22:55.540
You can just say for title in titles.

00:22:55.540 --> 00:22:59.200
And if the title's variable
is the type of data structure

00:22:59.200 --> 00:23:04.690
that you can iterate over, which it
will be if it's a list or if it's a set

00:23:04.690 --> 00:23:06.940
or even if it's a dictionary,
another data structure

00:23:06.940 --> 00:23:11.180
we saw last week in Python, the for loop
in Python will just know what to do.

00:23:11.180 --> 00:23:15.880
This will loop over all of
the titles in the titles set.

00:23:15.880 --> 00:23:18.700
So let me go ahead and save
this file and go ahead now

00:23:18.700 --> 00:23:20.920
and run python of favorites.py.

00:23:20.920 --> 00:23:25.240
And it looks like, yeah, the
list is different in some way.

00:23:25.240 --> 00:23:29.463
But I'm seeing fewer results as I
scroll up, definitely fewer than before,

00:23:29.463 --> 00:23:31.630
because my scrollbar didn't
jump nearly as far down.

00:23:31.630 --> 00:23:33.260
But honestly, this is kind of a mess.

00:23:33.260 --> 00:23:34.660
Let's go ahead and sort this.

00:23:34.660 --> 00:23:37.502
Now, in C, it would have been
kind of a pain to sort things.

00:23:37.502 --> 00:23:39.460
We'd have to whip out
the pseudocode, probably,

00:23:39.460 --> 00:23:41.460
for bubble sort, selection
sort, or, god forbid,

00:23:41.460 --> 00:23:43.270
merge sort and then
implement it ourselves.

00:23:43.270 --> 00:23:47.210
But no, with Python comes, really, the
proverbial kitchen sink of functions.

00:23:47.210 --> 00:23:49.510
So if you want to sort
this set, you know what?

00:23:49.510 --> 00:23:50.950
Just say you want it sorted.

00:23:50.950 --> 00:23:53.890
There is a function in
Python called sorted

00:23:53.890 --> 00:23:57.257
that will use one of those better
algorithms-- maybe it's merge sort.

00:23:57.257 --> 00:23:58.840
Maybe it's something called quicksort.

00:23:58.840 --> 00:24:00.350
Maybe it's something else altogether.

00:24:00.350 --> 00:24:02.380
It's not going to use a
big O of n squared sort.

00:24:02.380 --> 00:24:06.940
Someone at Python probably has spent the
time implementing a better sort for us.

00:24:06.940 --> 00:24:08.817
But it will go ahead
and sort the set for me.

00:24:08.817 --> 00:24:10.400
Now let me go ahead and do this again.

00:24:10.400 --> 00:24:13.360
Let me increase the size of my
terminal window and rerun python

00:24:13.360 --> 00:24:15.070
of favorites.py.

00:24:15.070 --> 00:24:15.640
OK.

00:24:15.640 --> 00:24:19.330
And now we have an
interesting assortment

00:24:19.330 --> 00:24:22.570
of shows that's easier for
me to wrap my mind around,

00:24:22.570 --> 00:24:25.743
because I have it now sorted here.

00:24:25.743 --> 00:24:28.660
And indeed, if I scroll all the way
up, we should see all of the shows

00:24:28.660 --> 00:24:32.257
beginning with numbers
or a period, which

00:24:32.257 --> 00:24:34.090
might have just been
someone playing around,

00:24:34.090 --> 00:24:36.290
followed by the A words,
the B words, and so forth.

00:24:36.290 --> 00:24:38.707
So now it's a little easier
to wrap our minds around this.

00:24:38.707 --> 00:24:39.760
But something's up.

00:24:39.760 --> 00:24:44.110
I feel like a lot of you like
Avatar: The Last Airbender.

00:24:44.110 --> 00:24:47.980
And yet I'm seeing it,
indeed, four different times.

00:24:47.980 --> 00:24:49.720
But I thought we were
filtering this down

00:24:49.720 --> 00:24:53.350
to uniques by using that set structure.

00:24:53.350 --> 00:24:54.340
So what's going on?

00:24:54.340 --> 00:24:56.200
And in fact, if I keep
scrolling, I'm pretty

00:24:56.200 --> 00:24:59.080
sure I saw more duplicates in here.

00:24:59.080 --> 00:25:01.840
BoJack Horseman, Breaking
Bad, Breaking Bad,

00:25:01.840 --> 00:25:07.810
Brooklyn Nine-Nine, Brooklyn Nine-Nine,
CS50 in several different flavors.

00:25:07.810 --> 00:25:10.210
And yes, it keeps going.

00:25:10.210 --> 00:25:11.110
Friends.

00:25:11.110 --> 00:25:13.050
So I see a lot of duplicate values.

00:25:13.050 --> 00:25:14.980
So what's going on?

00:25:14.980 --> 00:25:17.960
Yeah, [? Gadana? ?]

00:25:17.960 --> 00:25:22.480
AUDIENCE: Yeah, so your current
sort is case insensitive-- sorry,

00:25:22.480 --> 00:25:26.680
is case sensitive, meaning that if
someone spells avatar with capital

00:25:26.680 --> 00:25:30.800
A's in some places, then it's going
to be a different result each time.

00:25:30.800 --> 00:25:32.050
DAVID J. MALAN: Yeah, exactly.

00:25:32.050 --> 00:25:35.530
Some of you weren't quite diligent
when it came to capitalization.

00:25:35.530 --> 00:25:38.048
And so in fact, the reality
is, as [? Gadana ?] notes,

00:25:38.048 --> 00:25:39.840
that there's differences
in capitalization.

00:25:39.840 --> 00:25:41.090
Now, we've addressed this before.

00:25:41.090 --> 00:25:43.230
In fact, when you implemented
your spell checker,

00:25:43.230 --> 00:25:44.980
you had to deal with
this already when you

00:25:44.980 --> 00:25:46.780
were spell checking an arbitrary text.

00:25:46.780 --> 00:25:48.160
Some words might be capitalized.

00:25:48.160 --> 00:25:50.300
Some might be all
lowercase, all uppercase.

00:25:50.300 --> 00:25:52.810
And you wanted to tolerate
different casings.

00:25:52.810 --> 00:25:55.840
And so we probably solved this
by just forcing everything

00:25:55.840 --> 00:25:58.540
to uppercase or everything to
lowercase and doing things,

00:25:58.540 --> 00:26:00.500
therefore, case insensitively.

00:26:00.500 --> 00:26:01.750
So give me just a moment here.

00:26:01.750 --> 00:26:06.007
And I'm going to go ahead and make
a quick change to my form here.

00:26:06.007 --> 00:26:07.840
Let's go ahead and
change this in such a way

00:26:07.840 --> 00:26:11.020
that we actually force everything
to uppercase or lowercase.

00:26:11.020 --> 00:26:13.760
Doesn't really matter which, but
we need to canonicalize things,

00:26:13.760 --> 00:26:14.860
so to speak, in some way.

00:26:14.860 --> 00:26:18.790
And to canonicalize things just
means to format all of your data

00:26:18.790 --> 00:26:20.020
in some standard way.

00:26:20.020 --> 00:26:22.390
So to [? Gadana's ?] point,
let's just standardize

00:26:22.390 --> 00:26:23.920
the capitalization of things.

00:26:23.920 --> 00:26:25.600
Maybe all uppercase, all lowercase.

00:26:25.600 --> 00:26:27.260
We just need to make a judgment call.

00:26:27.260 --> 00:26:29.427
So I'm going to go ahead
and make a few tweaks here.

00:26:29.427 --> 00:26:30.670
I'm still going to use a set.

00:26:30.670 --> 00:26:33.190
I'm still going to
read the CSV as before.

00:26:33.190 --> 00:26:37.270
But instead of just adding the
title with row bracket title,

00:26:37.270 --> 00:26:40.180
I'm going to go ahead and
force it to uppercase, just

00:26:40.180 --> 00:26:42.850
arbitrarily, just for
the sake of uniformity.

00:26:42.850 --> 00:26:45.610
And then let's go ahead and check
what exactly has happened here.

00:26:45.610 --> 00:26:47.050
I'm not going to change anything else.

00:26:47.050 --> 00:26:49.383
But let me go ahead and
increase the size of my terminal

00:26:49.383 --> 00:26:52.600
window, rerun python of favorites.py.

00:26:52.600 --> 00:26:53.517
And voila.

00:26:53.517 --> 00:26:55.600
It's a little harder to
read, just because I'm not

00:26:55.600 --> 00:26:56.770
used to reading all caps.

00:26:56.770 --> 00:26:58.687
Kind of looks like we're
yelling at ourselves.

00:26:58.687 --> 00:27:01.600
But I don't see-- wait a minute.

00:27:01.600 --> 00:27:05.560
I still see The Office over here twice.

00:27:05.560 --> 00:27:11.920
If I keep scrolling here, so far, I see
Stranger Things and Strainger Things.

00:27:11.920 --> 00:27:13.570
That just looks like a typo.

00:27:13.570 --> 00:27:15.680
I see two Sherlocks, though.

00:27:15.680 --> 00:27:17.380
This is a little suspicious.

00:27:17.380 --> 00:27:21.730
So [? Gadana, ?] you and I don't
seem to have solved things fully.

00:27:21.730 --> 00:27:24.220
And this one's a little more subtle.

00:27:24.220 --> 00:27:30.970
What more should I perhaps do to my data
to ensure we get duplicates removed?

00:27:30.970 --> 00:27:32.200
Olivia?

00:27:32.200 --> 00:27:34.407
AUDIENCE: Maybe trim around the edges.

00:27:34.407 --> 00:27:35.990
DAVID J. MALAN: Trim around the edges.

00:27:35.990 --> 00:27:37.480
I like the sound of that,
but what do you mean?

00:27:37.480 --> 00:27:38.320
What does that do?

00:27:38.320 --> 00:27:40.862
AUDIENCE: Oh, like, trim off
the extra spaces in case someone

00:27:40.862 --> 00:27:42.940
put a space before or after the words.

00:27:42.940 --> 00:27:44.230
DAVID J. MALAN: Yeah, exactly.

00:27:44.230 --> 00:27:47.140
It's pretty common for humans,
intentionally or accidentally,

00:27:47.140 --> 00:27:48.920
to hit the Space bar
where they shouldn't.

00:27:48.920 --> 00:27:51.790
And in fact, I'm kind of
inferring that I bet one or more

00:27:51.790 --> 00:27:55.143
of you accidentally typed
Sherlock, space, and then decided,

00:27:55.143 --> 00:27:55.810
nope, that's it.

00:27:55.810 --> 00:27:57.040
I'm not typing anything else.

00:27:57.040 --> 00:28:00.740
But that space, even though we can't
quite see it obviously, is there.

00:28:00.740 --> 00:28:04.060
And when we do a string comparison or
when the set data structure does that,

00:28:04.060 --> 00:28:08.110
it's actually going to be noticed
when doing those comparisons.

00:28:08.110 --> 00:28:10.092
And therefore they're
not going to be the same.

00:28:10.092 --> 00:28:11.800
So I can do this in
a few different ways.

00:28:11.800 --> 00:28:15.080
But it turns out, in Python, you
can chain functions together,

00:28:15.080 --> 00:28:17.390
which is also, too,
kind of a fancy feature.

00:28:17.390 --> 00:28:18.650
Notice what I'm doing here.

00:28:18.650 --> 00:28:21.070
I'm still accessing the titles set.

00:28:21.070 --> 00:28:23.710
I'm adding the following value to it.

00:28:23.710 --> 00:28:27.610
I'm adding the value row
bracket title, but not quite.

00:28:27.610 --> 00:28:30.880
That is a string or an
str, in Python speak.

00:28:30.880 --> 00:28:33.130
I'm going to go ahead
and strip it, which

00:28:33.130 --> 00:28:36.340
means if we look up the documentation
for this function, to Olivia's point,

00:28:36.340 --> 00:28:39.130
it's going to strip off or
trim all of the white space

00:28:39.130 --> 00:28:41.200
to the left, all of the
white space to the right,

00:28:41.200 --> 00:28:43.840
whether that's the Space
bar or the Enter key

00:28:43.840 --> 00:28:46.870
or the Tab character or a
few other things, as well.

00:28:46.870 --> 00:28:50.200
It's just going to get rid of
leading and trailing white space.

00:28:50.200 --> 00:28:53.650
And then whatever's left over, I'm
going to go ahead and force everything

00:28:53.650 --> 00:28:56.810
to uppercase in the spirit of
[? Gadana's ?] suggestion, too.

00:28:56.810 --> 00:29:00.470
So we're sort of combining two good
ideas now to really massage the data,

00:29:00.470 --> 00:29:02.470
if you will, into a cleaner format.

00:29:02.470 --> 00:29:04.780
And this is such a real-world reality.

00:29:04.780 --> 00:29:09.588
Humans, you and I, cannot be trusted to
input data the way we are supposed to.

00:29:09.588 --> 00:29:11.380
Sometimes it's all
lowercase, because we're

00:29:11.380 --> 00:29:13.463
being a little lazy or a
little social media-like,

00:29:13.463 --> 00:29:16.120
even if we're checking
out from Amazon and trying

00:29:16.120 --> 00:29:18.310
to input a valid postal address.

00:29:18.310 --> 00:29:22.127
Sometimes it's all capitals, because
I can think of a few people in my life

00:29:22.127 --> 00:29:24.460
who don't quite understand
the Caps Lock thing just yet.

00:29:24.460 --> 00:29:26.710
And so things might be
all capitalized instead.

00:29:26.710 --> 00:29:30.580
This is not good for computer
systems that require precision,

00:29:30.580 --> 00:29:32.440
to our emphasis in week 0.

00:29:32.440 --> 00:29:35.140
And so massaging data
means cleaning it up,

00:29:35.140 --> 00:29:38.500
doing some mutations that don't
really change the meaning of the data

00:29:38.500 --> 00:29:41.740
but canonicalize it,
standardize it, so that you're

00:29:41.740 --> 00:29:44.950
comparing apples and apples, so
to speak, not apples and oranges.

00:29:44.950 --> 00:29:47.950
Well, let me go ahead and run
this again in my bigger terminal

00:29:47.950 --> 00:29:50.140
window, python of favorites.py.

00:29:50.140 --> 00:29:50.710
Voila.

00:29:50.710 --> 00:29:55.220
In scrolling up, up, up, I
think we're in a better place.

00:29:55.220 --> 00:29:57.520
I only see one Office now.

00:29:57.520 --> 00:30:01.510
And if I keep scrolling up and up
and up, I'm seeing typos still,

00:30:01.510 --> 00:30:03.910
but nothing related to white space.

00:30:03.910 --> 00:30:08.340
And I think we have a much cleaner
unique list of titles at this point.

00:30:08.340 --> 00:30:10.800
Of course, if we scroll
up, I would have to be

00:30:10.800 --> 00:30:14.970
a lot more clever if I want to detect
things like typographical errors.

00:30:14.970 --> 00:30:19.870
It looks like one of you was very
diligent about putting F.R.I.

00:30:19.870 --> 00:30:22.970
and so forth but then got bored at
the end and left off the last period.

00:30:22.970 --> 00:30:25.470
But that's going to happen when
you're taking in user input.

00:30:25.470 --> 00:30:28.140
We've, of course, got all
these variants of CS50.

00:30:28.140 --> 00:30:30.570
That's going to be a mess
to clean up, because now you

00:30:30.570 --> 00:30:35.130
can imagine having to add a whole bunch
of if conditions and elses and else ifs

00:30:35.130 --> 00:30:38.160
to clean all of that up if
we do want to canonicalize

00:30:38.160 --> 00:30:41.920
all different flavors of CS50
as, quote unquote, "CS50."

00:30:41.920 --> 00:30:43.890
So this is a very slippery slope.

00:30:43.890 --> 00:30:47.010
You and I could start writing a huge
amount of data just to clean this up.

00:30:47.010 --> 00:30:50.940
But that's the reality when
dealing with real-world data.

00:30:50.940 --> 00:30:55.140
Well, let's go ahead now and
improve this program further,

00:30:55.140 --> 00:30:57.810
do something a little
fancier, because I now

00:30:57.810 --> 00:31:00.090
can trust that my data
has been canonicalized

00:31:00.090 --> 00:31:03.900
except for the actual typos or the
weird variants of CS50 and the like.

00:31:03.900 --> 00:31:07.470
Let's go ahead and figure out
what's the most popular favorite TV

00:31:07.470 --> 00:31:10.510
show among the audience here.

00:31:10.510 --> 00:31:12.300
So I'm going to start
where I have before,

00:31:12.300 --> 00:31:14.133
with my current code,
because I think I have

00:31:14.133 --> 00:31:16.143
most of the building blocks in place.

00:31:16.143 --> 00:31:18.810
I'm going to go ahead and clean
up my code a little bit in here.

00:31:18.810 --> 00:31:22.050
I'm going to go ahead and give myself
a separate variable now called title

00:31:22.050 --> 00:31:26.040
just so that I can think about things
in a little more orderly fashion.

00:31:26.040 --> 00:31:29.200
But I'm not going to start adding
things to this set anymore.

00:31:29.200 --> 00:31:32.220
In fact, a set, I don't
think, is really going

00:31:32.220 --> 00:31:35.880
to be sufficient to keep track
of the popularity of TV shows,

00:31:35.880 --> 00:31:38.820
because by definition, the set
is throwing away duplicates.

00:31:38.820 --> 00:31:40.680
But the goal now is
kind of the opposite.

00:31:40.680 --> 00:31:45.240
I want to know which are the
duplicates so that I can tell you

00:31:45.240 --> 00:31:46.860
that this many people like The Office.

00:31:46.860 --> 00:31:50.530
This many people like
Breaking Bad and the like.

00:31:50.530 --> 00:31:56.010
So what tools do we have in Python's
toolkit via which we could accumulate

00:31:56.010 --> 00:31:59.320
or figure out that information?

00:31:59.320 --> 00:32:02.740
Any thoughts on what data
structure might help us here

00:32:02.740 --> 00:32:07.870
if we want to figure out show,
popularity, show, popularity?

00:32:07.870 --> 00:32:11.950
And by popularity, I just mean the
frequency of it in the CSV file.

00:32:11.950 --> 00:32:13.720
Santiago?

00:32:13.720 --> 00:32:17.110
AUDIENCE: I guess one option
could be to use dictionaries

00:32:17.110 --> 00:32:20.410
so that you can have The
Office, I don't know,

00:32:20.410 --> 00:32:23.110
20 votes, and then Game
of Thrones, another one,

00:32:23.110 --> 00:32:27.023
so that a dictionary could
really help you visualize that.

00:32:27.023 --> 00:32:28.690
DAVID J. MALAN: Yeah, perfect instincts.

00:32:28.690 --> 00:32:31.660
Recall that a dictionary, at the
end of the day, no matter how

00:32:31.660 --> 00:32:34.450
sophisticated it's implemented
underneath the hood,

00:32:34.450 --> 00:32:35.680
like your spell checker--

00:32:35.680 --> 00:32:38.240
It's just a collection
of key value pairs.

00:32:38.240 --> 00:32:42.790
And indeed, it's maybe one of the most
useful data structures in any language,

00:32:42.790 --> 00:32:45.820
because this ability to associate
one piece of data with another

00:32:45.820 --> 00:32:49.150
is just a very general
purpose solution to problems.

00:32:49.150 --> 00:32:51.730
And indeed, to Santiago's
point, if the problem at hand

00:32:51.730 --> 00:32:53.650
is to figure out the
popularity of shows,

00:32:53.650 --> 00:32:58.510
well, let's make the keys the titles of
our shows and the frequencies thereof--

00:32:58.510 --> 00:32:59.830
the votes, so to speak--

00:32:59.830 --> 00:33:01.810
the values of those keys.

00:33:01.810 --> 00:33:06.450
We're going to map title to votes, title
to vote, title to vote, and so forth.

00:33:06.450 --> 00:33:08.145
So a dictionary is exactly that.

00:33:08.145 --> 00:33:09.520
So let me go ahead and scroll up.

00:33:09.520 --> 00:33:10.978
And I can make a little tweak here.

00:33:10.978 --> 00:33:14.260
Instead of a set, I can instead
say dict and give myself

00:33:14.260 --> 00:33:15.598
just an empty dictionary.

00:33:15.598 --> 00:33:18.640
There's actually shorthand notation
for that that's a little more common.

00:33:18.640 --> 00:33:20.830
So you use two empty curly braces.

00:33:20.830 --> 00:33:22.810
That just means the exact same thing.

00:33:22.810 --> 00:33:25.270
Give me a dictionary
that's initially empty.

00:33:25.270 --> 00:33:27.400
There's no fancy shortcut for a set.

00:33:27.400 --> 00:33:30.370
You have to literally type out
S-E-T, open paren and closed paren.

00:33:30.370 --> 00:33:34.220
But dictionaries are so common,
so popular, so powerful,

00:33:34.220 --> 00:33:38.350
they have this little syntactic
shortcut of just two curly braces,

00:33:38.350 --> 00:33:39.560
open and closed.

00:33:39.560 --> 00:33:42.700
So now that I have that,
let me go ahead and do this.

00:33:42.700 --> 00:33:45.580
Inside of my for loop,
instead of printing

00:33:45.580 --> 00:33:48.880
the title, which I don't want to do,
and instead of adding it to the set,

00:33:48.880 --> 00:33:50.770
I now want to add it to the dictionary.

00:33:50.770 --> 00:33:51.860
So how do I do that?

00:33:51.860 --> 00:33:55.480
Well, if my dictionary is called titles,
I think I can essentially do something

00:33:55.480 --> 00:34:02.710
like this, titles bracket
title = or maybe += 1.

00:34:02.710 --> 00:34:07.120
Maybe I can kind of use the dictionary
as just a little cheat sheet

00:34:07.120 --> 00:34:12.050
of counts, numbers, that start at
0 and then just add 1, at 2, add 3.

00:34:12.050 --> 00:34:17.860
So every time I see The Office, The
Office, The Office, do += 1, += 1.

00:34:17.860 --> 00:34:20.199
We can't do ++, because
that's not a thing in Python.

00:34:20.199 --> 00:34:24.580
It only exists in C. But this would
seem to go into the dictionary called

00:34:24.580 --> 00:34:29.260
titles, look up the key that
matches this specific title,

00:34:29.260 --> 00:34:34.340
and then increment whatever
value is there by 1.

00:34:34.340 --> 00:34:37.380
But I'm going to go ahead and
run this a little naively here.

00:34:37.380 --> 00:34:40.280
Let me go ahead and run
python of favorites.py.

00:34:40.280 --> 00:34:43.400
And wow, it broke already on line 9.

00:34:43.400 --> 00:34:47.389
So sort of an apt choice
of show to begin with,

00:34:47.389 --> 00:34:49.530
we have a key error with Punisher.

00:34:49.530 --> 00:34:50.917
So Punisher is bad.

00:34:50.917 --> 00:34:52.250
Something bad has just happened.

00:34:52.250 --> 00:34:53.250
But what does that mean?

00:34:53.250 --> 00:34:55.429
A key error is referring
to the fact that I

00:34:55.429 --> 00:34:59.407
tried to access an invalid
key in a dictionary.

00:34:59.407 --> 00:35:01.490
This is saying that literally
in this line of code

00:35:01.490 --> 00:35:04.610
here, even though titles
is a dictionary and even

00:35:04.610 --> 00:35:07.130
though the value of
title, singular, is, quote

00:35:07.130 --> 00:35:09.560
unquote, "PUNISHER,"
I'm getting a key error,

00:35:09.560 --> 00:35:13.230
because that title does not yet exist.

00:35:13.230 --> 00:35:17.060
So even if you're not sure of the
Python syntax for fixing this problem,

00:35:17.060 --> 00:35:21.530
what's the intuitive solution here?

00:35:21.530 --> 00:35:25.610
I cannot increment the
frequency of the Punisher,

00:35:25.610 --> 00:35:28.130
because Punisher is
not in the dictionary.

00:35:28.130 --> 00:35:29.986
It almost feels like a catch-22.

00:35:29.986 --> 00:35:31.890
[? Greg? ?]

00:35:31.890 --> 00:35:35.900
AUDIENCE: I think that you need,
first of all, to create a for loop

00:35:35.900 --> 00:35:40.520
and maybe assign a value to
everything in the dictionary.

00:35:40.520 --> 00:35:43.683
For example, a value 0, and then add 1.

00:35:43.683 --> 00:35:45.350
DAVID J. MALAN: Yeah, so good instincts.

00:35:45.350 --> 00:35:46.730
And here, I can use another metaphor.

00:35:46.730 --> 00:35:49.147
I worry we might have a chicken
and the egg problem there,

00:35:49.147 --> 00:35:51.470
because I don't think I can
go to the top of my code,

00:35:51.470 --> 00:35:56.420
add a loop that initializes all of
the values in the dictionary to 0,

00:35:56.420 --> 00:36:01.130
because I would need to know all of
the names of the shows at that point.

00:36:01.130 --> 00:36:02.180
Now, that's fine.

00:36:02.180 --> 00:36:05.330
I think I could take you maybe
more literally, [? Greg, ?]

00:36:05.330 --> 00:36:09.630
and open up the CSV file,
iterate over it top to bottom,

00:36:09.630 --> 00:36:12.920
and, any time I see a
title, just initialize it

00:36:12.920 --> 00:36:16.220
in the dictionary as
having a value of 0, 0, 0.

00:36:16.220 --> 00:36:20.280
Then have another for loop, maybe
reopen the file, and do the same.

00:36:20.280 --> 00:36:21.380
And that would work.

00:36:21.380 --> 00:36:23.540
But it's arguably not very efficient.

00:36:23.540 --> 00:36:26.330
It is asymptotically, in
terms of big O. But that would

00:36:26.330 --> 00:36:28.220
seem to be doing twice as much work.

00:36:28.220 --> 00:36:31.820
Iterate over the file once just
to initialize everything to 0.

00:36:31.820 --> 00:36:35.330
Then iterate over the file a second
time just to increment the counts.

00:36:35.330 --> 00:36:38.360
I think we can do things
a little more efficiently.

00:36:38.360 --> 00:36:41.090
I think we can achieve not only
correctness but better design.

00:36:41.090 --> 00:36:45.560
Any thoughts on how we can still
solve this problem without having

00:36:45.560 --> 00:36:48.290
to iterate over the whole thing twice?

00:36:48.290 --> 00:36:50.360
Yeah, [? Semowit? ?]

00:36:50.360 --> 00:36:53.450
AUDIENCE: I think we can
add in an if statement

00:36:53.450 --> 00:36:55.970
to check if that key
is in the dictionary.

00:36:55.970 --> 00:36:59.865
And if it's not, then add it and then
go ahead and increment the value after.

00:36:59.865 --> 00:37:00.740
DAVID J. MALAN: Nice.

00:37:00.740 --> 00:37:02.460
And we can do exactly that.

00:37:02.460 --> 00:37:04.310
So let's just apply that intuition.

00:37:04.310 --> 00:37:08.583
If the problem is that I'm trying to
access a key that does not yet exist,

00:37:08.583 --> 00:37:10.500
well, let's just be a
little smarter about it.

00:37:10.500 --> 00:37:14.090
And to [? Semowit's ?] point,
let's check whether the key exists.

00:37:14.090 --> 00:37:16.020
And if it does, then increment it.

00:37:16.020 --> 00:37:19.340
But if it does not, then and only
then, [? to Greg's ?] advice,

00:37:19.340 --> 00:37:20.730
initialize it to 0.

00:37:20.730 --> 00:37:21.570
So let me do that.

00:37:21.570 --> 00:37:24.980
Let me go ahead and
say if title in titles,

00:37:24.980 --> 00:37:28.520
which is the very Pythonic,
beautiful way of asking a question

00:37:28.520 --> 00:37:30.500
like that, way cleaner than in C--

00:37:30.500 --> 00:37:35.390
let me go ahead, then, and say
exactly the line from before.

00:37:35.390 --> 00:37:40.280
Else, though, if that title is not
yet in the dictionary called titles,

00:37:40.280 --> 00:37:41.720
well, that's OK, too.

00:37:41.720 --> 00:37:47.150
I can go ahead and say
titles bracket title = 0.

00:37:47.150 --> 00:37:51.950
So the difference here is
that I can certainly index

00:37:51.950 --> 00:37:57.740
into a dictionary using a key that
doesn't exist if I plan at that moment

00:37:57.740 --> 00:37:58.730
to give it a value.

00:37:58.730 --> 00:38:02.030
That's OK, and that has always
been OK since last week.

00:38:02.030 --> 00:38:07.490
But, however, if I want to go ahead
and increment the value that's there,

00:38:07.490 --> 00:38:11.630
I'm going to go ahead and do
that in this separate line.

00:38:11.630 --> 00:38:13.850
But I did introduce a bug.

00:38:13.850 --> 00:38:15.770
I did introduce a bug here.

00:38:15.770 --> 00:38:19.220
I think I need to go one
step further logically.

00:38:19.220 --> 00:38:24.480
I don't think I want to
initialize this to 0 per se.

00:38:24.480 --> 00:38:29.090
Does anyone see a subtle
bug in my logic here?

00:38:29.090 --> 00:38:32.570
If the title is already in the
dictionary, I'm incrementing it by 1.

00:38:32.570 --> 00:38:37.000
Otherwise, I'm initializing it to 0.

00:38:37.000 --> 00:38:38.500
Any subtle catches here?

00:38:38.500 --> 00:38:40.690
Yeah, Olivia, what do you see?

00:38:40.690 --> 00:38:44.700
AUDIENCE: I think you should initialize
it to 1, since it's the first instance.

00:38:44.700 --> 00:38:45.700
DAVID J. MALAN: Exactly.

00:38:45.700 --> 00:38:46.870
I should initialize it to 1.

00:38:46.870 --> 00:38:50.137
Otherwise, I'm accidentally
overlooking this particular title,

00:38:50.137 --> 00:38:51.970
and I'm going to go
ahead and undercount it.

00:38:51.970 --> 00:38:54.100
So I can fix this either by doing this.

00:38:54.100 --> 00:38:57.820
Or frankly, if you prefer, I don't
technically need to use an if else.

00:38:57.820 --> 00:39:00.730
I can use just an if by doing
something like this instead.

00:39:00.730 --> 00:39:04.900
I could say if title not in
titles, then I could go ahead

00:39:04.900 --> 00:39:07.420
and say titles bracket title gets 0.

00:39:07.420 --> 00:39:12.270
And then after that, I can
blindly, so to speak, just do this.

00:39:12.270 --> 00:39:13.480
So which one is better?

00:39:13.480 --> 00:39:15.460
I think the second one
is maybe a little better

00:39:15.460 --> 00:39:17.420
in that I'm saving one line of code.

00:39:17.420 --> 00:39:19.270
But it's ensuring with
that if condition,

00:39:19.270 --> 00:39:24.190
to [? Semowit's ?] advice, that I'm
not indexing into the titles dictionary

00:39:24.190 --> 00:39:27.050
until I'm sure that
the title is in there.

00:39:27.050 --> 00:39:31.570
So let me go ahead and run this
now, python of favorites.py, Enter.

00:39:31.570 --> 00:39:34.090
And OK, it didn't crash, so that's good.

00:39:34.090 --> 00:39:36.400
But I'm not yet seeing
any useful information.

00:39:36.400 --> 00:39:38.740
But I now have access to a bit more.

00:39:38.740 --> 00:39:41.680
Let me scroll down now to the
bottom of this program, where

00:39:41.680 --> 00:39:43.360
I have now this loop.

00:39:43.360 --> 00:39:45.460
Let me go ahead and print
out not just the title

00:39:45.460 --> 00:39:49.870
but the value of that key in
the dictionary by just indexing

00:39:49.870 --> 00:39:50.500
into it here.

00:39:50.500 --> 00:39:52.030
And you might not have
seen this syntax before.

00:39:52.030 --> 00:39:54.758
But with print, you can actually
pass in multiple arguments.

00:39:54.758 --> 00:39:57.550
And by default, print will just
separate them with a space for you.

00:39:57.550 --> 00:39:59.860
You can override that behavior
and separate them with anything.

00:39:59.860 --> 00:40:02.777
But this is just meant to be a quick
and dirty program that prints out

00:40:02.777 --> 00:40:04.820
titles and now the popularity thereof.

00:40:04.820 --> 00:40:07.210
So let me run this again,
python of favorites.py.

00:40:07.210 --> 00:40:08.470
And voila.

00:40:08.470 --> 00:40:12.040
It's kind of all over the place.

00:40:12.040 --> 00:40:14.800
Office, super popular
with 26 votes there.

00:40:14.800 --> 00:40:18.220
A lot of single votes here.

00:40:18.220 --> 00:40:19.810
Big Bang Theory has nine.

00:40:19.810 --> 00:40:21.340
You know, this is all nice and good.

00:40:21.340 --> 00:40:24.548
But I feel like this is going to take
me forever to wrap my mind around which

00:40:24.548 --> 00:40:25.990
are the most popular shows.

00:40:25.990 --> 00:40:27.560
So of course, how would we do this?

00:40:27.560 --> 00:40:30.250
Well, to the point made earlier,
with spreadsheets, my god,

00:40:30.250 --> 00:40:33.100
in Microsoft Excel or Google
Spreadsheets or Apple Numbers,

00:40:33.100 --> 00:40:35.590
you just click the column
heading and boom, sorted.

00:40:35.590 --> 00:40:38.450
We seem to have lost that capability
unless we now do it in code.

00:40:38.450 --> 00:40:40.450
So let me do that for us.

00:40:40.450 --> 00:40:42.550
Let me go ahead and go back to my code.

00:40:42.550 --> 00:40:48.520
And it looks like sorted, even
though it does work on dictionaries,

00:40:48.520 --> 00:40:52.340
is actually sorting
by key, not by value.

00:40:52.340 --> 00:40:55.030
And here's where our Python
programming techniques need

00:40:55.030 --> 00:40:56.530
to get a little more sophisticated.

00:40:56.530 --> 00:40:58.572
And we want to introduce
another feature here now

00:40:58.572 --> 00:41:01.660
of Python which is going to
solve this problem specifically

00:41:01.660 --> 00:41:03.680
but in a pretty general way.

00:41:03.680 --> 00:41:06.310
So if we read the
documentation for sorted,

00:41:06.310 --> 00:41:11.320
the sorted function indeed sorts
sets by the values therein.

00:41:11.320 --> 00:41:13.840
It sorts lists by the values therein.

00:41:13.840 --> 00:41:17.140
It sorts dictionaries
by the keys therein,

00:41:17.140 --> 00:41:20.960
because dictionaries have two pieces
of information for every element.

00:41:20.960 --> 00:41:23.450
It has a key and a
value, not just a value.

00:41:23.450 --> 00:41:25.390
So by default, sorted sorts by key.

00:41:25.390 --> 00:41:28.150
So we somehow have to
override that behavior.

00:41:28.150 --> 00:41:29.390
So how can we do this?

00:41:29.390 --> 00:41:31.840
Well, it turns out that
the sorted function

00:41:31.840 --> 00:41:35.890
takes another optional
argument literally called key.

00:41:35.890 --> 00:41:41.570
And the key argument takes as
its value the name of a function.

00:41:41.570 --> 00:41:43.690
And this is where things
get really interesting,

00:41:43.690 --> 00:41:45.370
if not confusing, really quickly.

00:41:45.370 --> 00:41:50.620
It turns out, in Python, you can
pass around functions as arguments

00:41:50.620 --> 00:41:51.790
by way of their name.

00:41:51.790 --> 00:41:56.080
And technically, you can do this in C.
It's a lot more syntactically involved.

00:41:56.080 --> 00:41:57.777
But in Python, it's very common.

00:41:57.777 --> 00:41:59.110
In JavaScript, it's very common.

00:41:59.110 --> 00:42:01.930
In a lot of languages, it's very
common to think of functions

00:42:01.930 --> 00:42:06.040
as first-class objects, which is a fancy
way of saying you can pass them around

00:42:06.040 --> 00:42:08.440
just like they are variables themselves.

00:42:08.440 --> 00:42:09.730
We're not calling them yet.

00:42:09.730 --> 00:42:11.720
But you can pass them
around by their name.

00:42:11.720 --> 00:42:13.310
So what do I mean by this?

00:42:13.310 --> 00:42:18.940
Well, I need a function now to
sort my dictionary by its value.

00:42:18.940 --> 00:42:22.900
And only I know how to do this, perhaps,
so let me go ahead and give myself

00:42:22.900 --> 00:42:25.990
a generic function name just for the
moment called f-- f for function,

00:42:25.990 --> 00:42:26.907
kind of like in math--

00:42:26.907 --> 00:42:28.907
because we're going to
get rid of it eventually.

00:42:28.907 --> 00:42:31.120
But let me go ahead and
temporarily define a function

00:42:31.120 --> 00:42:34.150
called f that takes as input a title.

00:42:34.150 --> 00:42:39.140
And then it returns for me the
value corresponding to that key.

00:42:39.140 --> 00:42:43.060
So I'm going to go ahead and
return titles bracket title.

00:42:43.060 --> 00:42:47.480
So here, we have a function whose
purpose in life is super simple.

00:42:47.480 --> 00:42:48.700
You give it a title.

00:42:48.700 --> 00:42:52.990
It gives you the count thereof, the
frequency, the popularity thereof,

00:42:52.990 --> 00:42:56.080
by just looking it up in
that global dictionary.

00:42:56.080 --> 00:42:59.830
So it's super simple, but
that's its only purpose in life.

00:42:59.830 --> 00:43:03.400
But now, according to the
documentation for sorted,

00:43:03.400 --> 00:43:06.730
what it's now going to do, because I'm
passing in a second argument called

00:43:06.730 --> 00:43:12.250
key, the sorted function, rather than
just presume you want everything sorted

00:43:12.250 --> 00:43:15.280
alphabetically by key,
it's instead going

00:43:15.280 --> 00:43:22.420
to call that function f on every one
of the elements in your dictionary.

00:43:22.420 --> 00:43:25.960
And depending on your
answer, the return value

00:43:25.960 --> 00:43:30.760
you give with that f function,
that will be used instead

00:43:30.760 --> 00:43:34.060
to determine the actual ordering.

00:43:34.060 --> 00:43:36.900
So by default, sorted just looks at key.

00:43:36.900 --> 00:43:39.750
What I'm effectively
doing with this f function

00:43:39.750 --> 00:43:44.160
is instead returning the value
corresponding to every key.

00:43:44.160 --> 00:43:48.360
And so the logical implication of this,
even though the syntax is a little new,

00:43:48.360 --> 00:43:51.690
is that this dictionary
of titles will now

00:43:51.690 --> 00:43:55.140
be sorted by value instead of by key.

00:43:55.140 --> 00:43:57.460
Because again, by
default, it sorts by key.

00:43:57.460 --> 00:44:01.740
But if I define my own key
function and override that behavior

00:44:01.740 --> 00:44:05.370
to return the corresponding value,
it's the values, the numbers,

00:44:05.370 --> 00:44:08.750
the counts that will actually
be used to this thing.

00:44:08.750 --> 00:44:09.250
All right.

00:44:09.250 --> 00:44:11.333
Let's go ahead and see if
that's true in practice.

00:44:11.333 --> 00:44:13.440
Let me go ahead and rerun
python of favorites.py.

00:44:13.440 --> 00:44:14.790
I should see all the titles.

00:44:14.790 --> 00:44:17.520
And voila, conveniently,
the most popular show

00:44:17.520 --> 00:44:22.170
seems to be Game of Thrones with 33
votes, followed by Friends with 27,

00:44:22.170 --> 00:44:25.000
followed by The Office
with 26, and so forth.

00:44:25.000 --> 00:44:27.060
But of course, the list
is kind of backwards.

00:44:27.060 --> 00:44:29.680
I mean, it's convenient that I can
see it at the bottom of my screen.

00:44:29.680 --> 00:44:32.530
But really, if we're making a list,
it should really be at the top.

00:44:32.530 --> 00:44:34.170
So how can we override that behavior?

00:44:34.170 --> 00:44:36.840
Turns out the sorted function,
if you read its documentation,

00:44:36.840 --> 00:44:41.020
also takes another optional
parameter called reverse.

00:44:41.020 --> 00:44:43.590
And if you set reverse
equal to True, capital

00:44:43.590 --> 00:44:48.120
T in Python, that's going
to go ahead and give us now

00:44:48.120 --> 00:44:50.190
the reverse order of that same sort.

00:44:50.190 --> 00:44:53.790
So let me go ahead and maximize my
terminal window, rerun it again.

00:44:53.790 --> 00:44:57.480
And voila, if I scroll back up to the
top, it's not alphabetically sorted.

00:44:57.480 --> 00:44:59.970
But if I keep going, keep
going, keep going, keep going,

00:44:59.970 --> 00:45:01.262
the numbers are getting bigger.

00:45:01.262 --> 00:45:06.770
And voila, now Game of Thrones
with 33 is all the way at the top.

00:45:06.770 --> 00:45:08.490
All right, so pretty cool.

00:45:08.490 --> 00:45:11.360
And again, the new functionality
here in Python, at least,

00:45:11.360 --> 00:45:15.110
is that we can actually pass
in functions to functions

00:45:15.110 --> 00:45:19.380
and leave it to the
latter to call the former.

00:45:19.380 --> 00:45:21.180
So that's complicated just to say.

00:45:21.180 --> 00:45:26.400
But any questions or confusion now
on how we are using dictionaries

00:45:26.400 --> 00:45:34.290
and how we are sorting things in
this reverse, value-based way?

00:45:34.290 --> 00:45:35.450
Any questions or confusion?

00:45:35.450 --> 00:45:39.000
Anything in the chat or verbally, Brian?

00:45:39.000 --> 00:45:41.470
BRIAN YU: Looks like all
questions are answered here.

00:45:41.470 --> 00:45:42.320
DAVID J. MALAN: OK.

00:45:42.320 --> 00:45:44.780
Then in that case, let me
point out a common mistake.

00:45:44.780 --> 00:45:50.000
Notice that even though f is a function,
notice that I did not call it there.

00:45:50.000 --> 00:45:53.630
That would be incorrect, the
reason being we deliberately

00:45:53.630 --> 00:45:58.410
want to pass the function
f into the sorted function

00:45:58.410 --> 00:46:03.810
so that the sorted function can take it
upon itself to call f again and again

00:46:03.810 --> 00:46:04.310
and again.

00:46:04.310 --> 00:46:07.227
We don't want to just call it once
by using the parentheses ourselves.

00:46:07.227 --> 00:46:11.030
We want to just pass it in by name so
that the sorted function, which comes

00:46:11.030 --> 00:46:14.630
with Python, can instead do it for us.

00:46:14.630 --> 00:46:17.060
Santiago, did you have a question?

00:46:17.060 --> 00:46:18.710
AUDIENCE: Yes, I was going to ask.

00:46:18.710 --> 00:46:21.425
Why didn't we put f of title?

00:46:24.350 --> 00:46:26.960
I was going to ask that
question specifically.

00:46:26.960 --> 00:46:29.005
DAVID J. MALAN: Oh,
with the parentheses?

00:46:29.005 --> 00:46:29.630
AUDIENCE: Yeah.

00:46:29.630 --> 00:46:30.963
DAVID J. MALAN: Oh, OK, perfect.

00:46:30.963 --> 00:46:34.010
So because that would call the
function once and only once.

00:46:34.010 --> 00:46:37.013
We want sorted to be able
to call it again and again.

00:46:37.013 --> 00:46:38.930
Now, here's actually an
example, as we've seen

00:46:38.930 --> 00:46:40.760
in the past, of a correct solution.

00:46:40.760 --> 00:46:45.170
This is behaving as I intend, a list
of sorted titles from top to bottom

00:46:45.170 --> 00:46:47.450
in order of popularity.

00:46:47.450 --> 00:46:49.820
But it's a little poorly
designed, because I'm

00:46:49.820 --> 00:46:53.390
defining this function f, whose name
in the first place is kind of lame.

00:46:53.390 --> 00:46:56.660
But I'm defining a function
only to use it in one place.

00:46:56.660 --> 00:47:00.860
And my god, the function is so tiny, it
just feels like a waste of keystrokes

00:47:00.860 --> 00:47:03.740
to have defined a new function
just to then pass it in.

00:47:03.740 --> 00:47:08.150
So it turns out, in Python, if you
have a very short function whose

00:47:08.150 --> 00:47:13.130
purpose in life is meant to be to solve
a local problem just once and that's it

00:47:13.130 --> 00:47:16.880
and it's short enough that you're pretty
sure you can fit it on one line of code

00:47:16.880 --> 00:47:21.080
without things wrapping and starting
to get ugly stylistically, it turns out

00:47:21.080 --> 00:47:23.270
you can actually do this instead.

00:47:23.270 --> 00:47:26.820
You can copy the code that
you had in mind like this.

00:47:26.820 --> 00:47:30.680
And instead of actually
defining f as a function name,

00:47:30.680 --> 00:47:34.070
you can actually use a special
keyword in Python called lambda.

00:47:34.070 --> 00:47:37.760
You can specify the name of an
argument for your function as before.

00:47:37.760 --> 00:47:41.690
And then you can simply specify
the return value, thereafter

00:47:41.690 --> 00:47:44.940
deleting the function itself.

00:47:44.940 --> 00:47:49.640
So to be clear, key is still an
argument to the sorted function.

00:47:49.640 --> 00:47:54.120
It expects as its value
typically the name of a function.

00:47:54.120 --> 00:47:57.590
But if you've decided that, eh,
this seems like a waste of effort

00:47:57.590 --> 00:47:59.870
to define a function,
then pass the function in,

00:47:59.870 --> 00:48:02.840
especially when it's so short,
you can do it in a one liner.

00:48:02.840 --> 00:48:05.990
A lambda function is
an anonymous function.

00:48:05.990 --> 00:48:09.230
Lambda literally says,
Python, give me a function.

00:48:09.230 --> 00:48:11.147
I don't care about its name.

00:48:11.147 --> 00:48:13.230
Therefore, you don't have
to choose a name for it.

00:48:13.230 --> 00:48:17.940
But it does care still about its
arguments and its return value.

00:48:17.940 --> 00:48:23.147
So it's still up to you to provide zero
or more arguments and a return value.

00:48:23.147 --> 00:48:24.230
And notice I've done that.

00:48:24.230 --> 00:48:28.010
I've specified the keyword lambda
followed by the name of the argument

00:48:28.010 --> 00:48:31.490
I want this anonymous,
nameless function to accept.

00:48:31.490 --> 00:48:33.890
And then I'm specifying
the return value.

00:48:33.890 --> 00:48:37.940
And with lambda functions, you
do not need to specify return.

00:48:37.940 --> 00:48:41.000
Whatever you write after
the colon is literally

00:48:41.000 --> 00:48:43.050
what will be returned automatically.

00:48:43.050 --> 00:48:45.320
So again, this is a very
Pythonic thing to do.

00:48:45.320 --> 00:48:49.250
It's kind of a very clever one liner,
even though it's a little cryptic

00:48:49.250 --> 00:48:50.757
to see for the very first time.

00:48:50.757 --> 00:48:53.840
But it allows you to condense your
thoughts into a succinct statement that

00:48:53.840 --> 00:48:57.470
gets the job done so you don't have to
start defining more and more functions

00:48:57.470 --> 00:49:02.490
that you or someone else
then need to keep track of.

00:49:02.490 --> 00:49:02.990
All right.

00:49:02.990 --> 00:49:05.300
Any questions, then, on this?

00:49:05.300 --> 00:49:10.580
And I am pretty sure this is as complex
or sophisticated as our Python code

00:49:10.580 --> 00:49:13.290
today will get.

00:49:13.290 --> 00:49:16.020
Yeah, over to Sophia.

00:49:16.020 --> 00:49:19.380
AUDIENCE: I was wondering why
"lambda" is used specifically

00:49:19.380 --> 00:49:21.217
rather than some other keyword.

00:49:21.217 --> 00:49:23.550
DAVID J. MALAN: Yeah, so
there's a long history in this.

00:49:23.550 --> 00:49:27.060
And if, in fact, you take a course on
functional programming-- at Harvard,

00:49:27.060 --> 00:49:28.830
it's called CS51--

00:49:28.830 --> 00:49:32.280
there's a whole etymology
behind keywords like this.

00:49:32.280 --> 00:49:34.360
Let me defer that one for another time.

00:49:34.360 --> 00:49:37.440
But indeed, not only in
Python but in other languages,

00:49:37.440 --> 00:49:41.290
as well, these things have come
to exist called lambda functions.

00:49:41.290 --> 00:49:44.230
So they're actually quite commonplace
in other languages, as well.

00:49:44.230 --> 00:49:48.580
And so Python just
adopted the term of art.

00:49:48.580 --> 00:49:52.060
Mathematically, lambda is often
used as a symbol for functions.

00:49:52.060 --> 00:49:55.980
And so they borrowed that same
idea in the world of programming.

00:49:55.980 --> 00:49:56.550
All right.

00:49:56.550 --> 00:50:00.840
So seeing no other questions, let's go
ahead and solve a related problem still

00:50:00.840 --> 00:50:03.510
with some Python but
that's going to push up

00:50:03.510 --> 00:50:09.150
against the limits of efficiency when it
comes to storing our data in CSV files.

00:50:09.150 --> 00:50:13.113
Let me go ahead and start fresh
in this file, Favorites.py.

00:50:13.113 --> 00:50:15.030
All of the code I've
written thus far, though,

00:50:15.030 --> 00:50:16.905
is on the course's
website in advance, so you

00:50:16.905 --> 00:50:18.570
can see the incremental improvement.

00:50:18.570 --> 00:50:21.280
I'm going to go ahead and,
again, import csv at the top.

00:50:21.280 --> 00:50:24.690
And now let's write a program
this time that doesn't just

00:50:24.690 --> 00:50:27.990
automatically open up the
CSV and analyze it looking

00:50:27.990 --> 00:50:31.020
for the total popularity of shows.

00:50:31.020 --> 00:50:35.430
Let's search for a specific
show in the CSV and then

00:50:35.430 --> 00:50:39.082
go ahead and output
the popularity thereof.

00:50:39.082 --> 00:50:41.040
And I can do this in a
bunch of different ways.

00:50:41.040 --> 00:50:43.415
But I'm going to try to make
this as concise as possible.

00:50:43.415 --> 00:50:46.800
I'm first going to ask
the user to input a title.

00:50:46.800 --> 00:50:49.170
I could use CS50's get_string function.

00:50:49.170 --> 00:50:52.330
But recall that it's pretty much
the same as Python's input function,

00:50:52.330 --> 00:50:55.740
so I'm going to use Python's
input function today.

00:50:55.740 --> 00:50:57.780
And then I'm going to
go ahead and, as before,

00:50:57.780 --> 00:51:01.350
open up that same CSV
called Favorite TV Shows -

00:51:01.350 --> 00:51:08.010
Form Responses 1.csv in read-only
mode as a variable called file.

00:51:08.010 --> 00:51:11.160
I'm then going to give myself a
reader, and I'll use a DictReader again

00:51:11.160 --> 00:51:14.520
so I don't have to worry about
knowing which columns things are in,

00:51:14.520 --> 00:51:16.080
passing in file.

00:51:16.080 --> 00:51:17.340
And then let's see.

00:51:17.340 --> 00:51:20.310
If I only care about one title,
I can keep this program simpler.

00:51:20.310 --> 00:51:23.340
I don't need to figure out
the popularity of every show.

00:51:23.340 --> 00:51:26.880
I just need to figure out the
popularity of one show, the title

00:51:26.880 --> 00:51:28.510
that the human has typed in.

00:51:28.510 --> 00:51:32.160
So I'm going to go ahead and give
myself a very simple int called counter

00:51:32.160 --> 00:51:33.480
and set it equal to 0.

00:51:33.480 --> 00:51:34.950
I don't need a whole dictionary.

00:51:34.950 --> 00:51:36.930
Just one variable suffices now.

00:51:36.930 --> 00:51:42.300
And I'm going to go ahead and iterate
over the rows in the reader, as before.

00:51:42.300 --> 00:51:48.600
And then I'm going to say if the current
row's title == the title the human

00:51:48.600 --> 00:51:51.930
typed in, let's go ahead
and increment counter by 1.

00:51:51.930 --> 00:51:54.510
And it's already initialized,
because I did that on line 7.

00:51:54.510 --> 00:51:55.500
So I think I'm good.

00:51:55.500 --> 00:51:57.450
And then at the end
of this program, let's

00:51:57.450 --> 00:51:59.940
very simply print out
the value of counter.

00:51:59.940 --> 00:52:04.920
So the purpose of this program is to
prompt the user for a title of a show

00:52:04.920 --> 00:52:08.220
and then just report
the popularity thereof

00:52:08.220 --> 00:52:11.040
by counting the number of
instances of it in the file.

00:52:11.040 --> 00:52:14.520
So let me go ahead and run this
with python of favorites.py.

00:52:14.520 --> 00:52:15.450
Enter.

00:52:15.450 --> 00:52:21.440
Let me go ahead and type in
"The Office," Enter, and 19.

00:52:21.440 --> 00:52:23.710
Now, I don't remember
exactly what the number was.

00:52:23.710 --> 00:52:26.620
But I remember The Office
was more popular than that.

00:52:26.620 --> 00:52:29.840
I'm pretty sure it was not 19.

00:52:29.840 --> 00:52:35.645
Any intuition as to why this program
is buggy or so it would seem?

00:52:35.645 --> 00:52:37.520
BRIAN YU: A few people
in the chat are saying

00:52:37.520 --> 00:52:40.635
you need to remember to deal with
capitalization and white space again.

00:52:40.635 --> 00:52:41.510
DAVID J. MALAN: Yeah.

00:52:41.510 --> 00:52:44.790
So we need to practice those
same lessons learned from before.

00:52:44.790 --> 00:52:48.830
So I should really canonicalize the
input that the human, I, just typed in

00:52:48.830 --> 00:52:51.980
and also the input that's
coming from the CSV file.

00:52:51.980 --> 00:52:53.990
Perhaps the simplest way
to do this is, up here,

00:52:53.990 --> 00:52:57.200
to first strip off leading and trailing
white space in case I get a little

00:52:57.200 --> 00:52:59.670
sloppy and hit the Space
bar where I shouldn't.

00:52:59.670 --> 00:53:02.612
And then let's go ahead and force
it to uppercase just because.

00:53:02.612 --> 00:53:04.320
It doesn't matter if
it's upper or lower,

00:53:04.320 --> 00:53:06.420
but at least we'll
standardize things that way.

00:53:06.420 --> 00:53:10.130
And then when I do this, look
at the current rows title.

00:53:10.130 --> 00:53:12.170
I think I really need
to do the same thing.

00:53:12.170 --> 00:53:15.140
If I'm going to canonicalize one,
I need to canonical the other.

00:53:15.140 --> 00:53:19.790
And now compare the all-caps,
white-space-stripped versions

00:53:19.790 --> 00:53:20.730
of both strings.

00:53:20.730 --> 00:53:21.920
So now let me rerun it.

00:53:21.920 --> 00:53:24.020
Now I'm going to type
in "The Office," Enter.

00:53:24.020 --> 00:53:24.770
And voila.

00:53:24.770 --> 00:53:28.130
Now I'm at 26, which I think
is where we were at before.

00:53:28.130 --> 00:53:30.890
And in fact, now I, the
user, can be a little sloppy.

00:53:30.890 --> 00:53:32.450
I can say "the office."

00:53:32.450 --> 00:53:35.780
I can run it again and say "the
office" and then, for whatever reason,

00:53:35.780 --> 00:53:37.460
hit the Space bar a lot, Enter.

00:53:37.460 --> 00:53:38.570
It's still going to work.

00:53:38.570 --> 00:53:41.810
And indeed, though we seem to
be belaboring the pedantic here

00:53:41.810 --> 00:53:44.683
with trimming off white space
and so forth, just think.

00:53:44.683 --> 00:53:46.850
In a relatively small
audience here, how many of you

00:53:46.850 --> 00:53:50.150
accidentally hit the Space bar or
capitalized things differently?

00:53:50.150 --> 00:53:52.340
This happens massively on scale.

00:53:52.340 --> 00:53:55.040
And you can imagine this being
important when you're tagging

00:53:55.040 --> 00:53:56.780
friends in some social media account.

00:53:56.780 --> 00:53:58.940
You're doing @Brian or the like.

00:53:58.940 --> 00:54:02.510
You don't want to have to require
the user to type @, capital B,

00:54:02.510 --> 00:54:05.100
lowercase r-i-a-n, and so forth.

00:54:05.100 --> 00:54:07.880
So tolerating disparate,
messy user input

00:54:07.880 --> 00:54:11.990
is such a common problem
to solve, including

00:54:11.990 --> 00:54:14.740
in today's apps that we all use.

00:54:14.740 --> 00:54:15.280
All right.

00:54:15.280 --> 00:54:21.000
Any questions, then, on this
program, which I think is correct?

00:54:21.000 --> 00:54:22.890
Then let me ask a question of you.

00:54:22.890 --> 00:54:27.080
In what sense is this
program poorly designed?

00:54:27.080 --> 00:54:30.860
In what sense is this
program poorly designed?

00:54:30.860 --> 00:54:33.480
This is more subtle.

00:54:33.480 --> 00:54:38.040
But think about the running time
of this program in terms of big O.

00:54:38.040 --> 00:54:45.030
What is the running time of this program
if the CSV file has n different shows

00:54:45.030 --> 00:54:47.610
in it or n different submissions?

00:54:47.610 --> 00:54:50.310
So n is the variable in question.

00:54:50.310 --> 00:54:53.122
Yeah, what's the running time, Andrew?

00:54:53.122 --> 00:54:55.510
AUDIENCE: [INAUDIBLE]

00:54:55.510 --> 00:54:58.260
DAVID J. MALAN: Yeah, it's big O
of n, because I'm literally using

00:54:58.260 --> 00:55:00.400
linear search by way of the for loop.

00:55:00.400 --> 00:55:04.000
That's how a for loop works in Python,
just like in C. Starts at the beginning

00:55:04.000 --> 00:55:06.080
and potentially goes all
the way till the end.

00:55:06.080 --> 00:55:08.730
And so I'm using
implicitly linear search,

00:55:08.730 --> 00:55:11.910
because I'm not using any fancy data
structures, no sets, no dictionaries.

00:55:11.910 --> 00:55:14.140
I'm just looping from top to bottom.

00:55:14.140 --> 00:55:18.390
So you can imagine that if we surveyed
not just all of the students here

00:55:18.390 --> 00:55:21.390
in class but maybe everyone on
campus or everyone in the world--

00:55:21.390 --> 00:55:24.240
maybe we're Internet
Movie Database, IMDb.

00:55:24.240 --> 00:55:28.710
There could be a huge number of
votes and a huge number of shows.

00:55:28.710 --> 00:55:32.460
And so writing a program, whether
it's in a terminal window like mine

00:55:32.460 --> 00:55:36.480
or maybe on a mobile device or maybe on
a webpage for your laptop or desktop,

00:55:36.480 --> 00:55:40.800
it's probably not the best
design to constantly loop

00:55:40.800 --> 00:55:44.190
over all of the shows in
your database from top

00:55:44.190 --> 00:55:47.250
to bottom just to answer
a single question.

00:55:47.250 --> 00:55:51.502
It would be much nicer to do things
in log of n time or in constant time.

00:55:51.502 --> 00:55:54.210
And thankfully, over the past few
weeks, both in C and in Python,

00:55:54.210 --> 00:55:57.390
we have seen smarter ways to do this.

00:55:57.390 --> 00:56:00.420
But I'm not practicing
what I've preached here.

00:56:00.420 --> 00:56:05.520
And in fact, at some point, this
notion of a flat-file database

00:56:05.520 --> 00:56:07.440
starts to get too primitive for us.

00:56:07.440 --> 00:56:11.670
Flat-file databases, like CSV
files, are wonderfully useful

00:56:11.670 --> 00:56:13.590
when you just want to
do something quickly

00:56:13.590 --> 00:56:16.410
or when you want to download
data from some third party,

00:56:16.410 --> 00:56:18.518
like Google, in a
standard, portable way.

00:56:18.518 --> 00:56:21.810
"Portable" means that it can be used by
different people and different systems.

00:56:21.810 --> 00:56:23.760
CSV is about as simple
as it gets, because you

00:56:23.760 --> 00:56:26.250
don't need to own
Microsoft Word or Apple

00:56:26.250 --> 00:56:28.150
Numbers or any particular product.

00:56:28.150 --> 00:56:30.750
It's just a text file, so
you can use any text editing

00:56:30.750 --> 00:56:34.140
program or any programming
language to access it.

00:56:34.140 --> 00:56:38.730
But flat-file databases aren't
necessarily the best structure

00:56:38.730 --> 00:56:42.750
to use ultimately for larger data
sets, because they don't really

00:56:42.750 --> 00:56:44.790
lend themselves to
more efficient queries.

00:56:44.790 --> 00:56:48.280
So CSV files, pretty much at best,
you have to search top to bottom,

00:56:48.280 --> 00:56:49.140
left to right.

00:56:49.140 --> 00:56:53.070
But it turns out that there are better
databases out there generally known

00:56:53.070 --> 00:56:57.960
as relational databases that, instead
of being files in which you store data,

00:56:57.960 --> 00:57:01.230
they are instead programs
in which you store data.

00:57:01.230 --> 00:57:04.830
Now, to be fair, those programs
use a lot of RAM, memory,

00:57:04.830 --> 00:57:06.390
where they actually store your data.

00:57:06.390 --> 00:57:08.670
And they do certainly persist your data.

00:57:08.670 --> 00:57:12.600
They keep it long term by
storing your data also in files.

00:57:12.600 --> 00:57:16.110
But between you and your data,
there is this running program.

00:57:16.110 --> 00:57:20.070
And if you've ever heard of Oracle
or MySQL or PostgreSQL or SQL Server

00:57:20.070 --> 00:57:23.880
or Microsoft Access or bunches
of other popular products,

00:57:23.880 --> 00:57:26.520
both commercial and free
and open source alike,

00:57:26.520 --> 00:57:30.720
relational databases are so
similar in spirit to spreadsheets.

00:57:30.720 --> 00:57:33.690
But they are implemented in software.

00:57:33.690 --> 00:57:35.490
And they give us more and more features.

00:57:35.490 --> 00:57:37.360
And they use more and
more data structures

00:57:37.360 --> 00:57:42.550
so that we can search for data, insert
data, delete data, update data much,

00:57:42.550 --> 00:57:47.350
much more efficiently than we could if
just using something like a CSV file.

00:57:47.350 --> 00:57:49.600
So let's go ahead and take
our five-minute break here.

00:57:49.600 --> 00:57:52.558
And when we come back, we'll look at
relational databases and, in turn,

00:57:52.558 --> 00:57:54.400
a language called SQL.

00:57:54.400 --> 00:57:55.170
All right.

00:57:55.170 --> 00:57:56.190
So we are back.

00:57:56.190 --> 00:57:58.890
And the goal at hand
now is to transition

00:57:58.890 --> 00:58:02.340
from these fairly simplistic
flat-file databases

00:58:02.340 --> 00:58:04.260
to a more proper relational database.

00:58:04.260 --> 00:58:07.500
And relational databases are
indeed what power so many

00:58:07.500 --> 00:58:10.470
of today's mobile applications,
web applications, and the like.

00:58:10.470 --> 00:58:13.170
Now we're beginning to
transition to real-world software

00:58:13.170 --> 00:58:16.210
with real-world languages, at that.

00:58:16.210 --> 00:58:20.610
And so now, let me introduce
what we're going to call SQLite.

00:58:20.610 --> 00:58:23.040
So it turns out that
a relational database

00:58:23.040 --> 00:58:27.660
is a database that stores all of
the data still in rows and columns.

00:58:27.660 --> 00:58:30.960
But it doesn't do so using
spreadsheets or sheets.

00:58:30.960 --> 00:58:33.930
It instead does so using what
we're going to call tables.

00:58:33.930 --> 00:58:36.120
So it's pretty much the same idea.

00:58:36.120 --> 00:58:39.120
But with tables, we get some
additional functionality.

00:58:39.120 --> 00:58:41.130
With those tables,
we'll have the ability

00:58:41.130 --> 00:58:46.480
to search for data, update data, delete
data, insert new data, and the like.

00:58:46.480 --> 00:58:49.188
And these are things that we
absolutely can do with spreadsheets.

00:58:49.188 --> 00:58:52.105
But in the world of spreadsheets,
if you want to search for something,

00:58:52.105 --> 00:58:54.840
it's you, the human, doing it by
manually clicking and scrolling,

00:58:54.840 --> 00:58:55.340
typically.

00:58:55.340 --> 00:58:57.390
If you want to insert
data, it's you, the human,

00:58:57.390 --> 00:58:59.362
typing it in manually
after adding a new row.

00:58:59.362 --> 00:59:01.320
If you want to delete
something, it's you right

00:59:01.320 --> 00:59:04.350
clicking or Control-clicking
and deleting a whole row

00:59:04.350 --> 00:59:06.750
or updating the individual
cells they're in.

00:59:06.750 --> 00:59:11.910
With SQL, Structured Query Language,
we have a new programming language

00:59:11.910 --> 00:59:15.360
that is very often used in conjunction
with other programming languages.

00:59:15.360 --> 00:59:18.930
And so today, we'll see SQL
used on its own initially.

00:59:18.930 --> 00:59:21.990
But we'll also see it in the
context of a Python program.

00:59:21.990 --> 00:59:28.170
So a language like Python can itself
use SQL to do more powerful things

00:59:28.170 --> 00:59:30.660
than Python alone could do.

00:59:30.660 --> 00:59:34.440
So with that said, SQLite is
like a light version of SQL.

00:59:34.440 --> 00:59:35.940
It's a more user-friendly version.

00:59:35.940 --> 00:59:36.780
It's more portable.

00:59:36.780 --> 00:59:40.260
It can be used on Macs and PCS and
phones and laptops and desktops

00:59:40.260 --> 00:59:40.950
and servers.

00:59:40.950 --> 00:59:42.120
But it's incredibly common.

00:59:42.120 --> 00:59:45.810
In fact, in your iPhone and your
Android phone, many of the applications

00:59:45.810 --> 00:59:50.250
you are running today on your own device
are using SQLite underneath the hood.

00:59:50.250 --> 00:59:52.290
So it isn't a toy language per se.

00:59:52.290 --> 00:59:55.150
It's instead a relatively
simple implementation

00:59:55.150 --> 00:59:56.920
of a language generally known as SQL.

00:59:56.920 --> 01:00:00.680
But long story short, there's other
implementations of relational databases

01:00:00.680 --> 01:00:01.180
out there.

01:00:01.180 --> 01:00:02.972
And I rattled off
several of them already--

01:00:02.972 --> 01:00:05.620
Oracle and MySQL and
PostgreSQL and the like.

01:00:05.620 --> 01:00:09.910
Those all have slightly different
flavors or dialects of SQL.

01:00:09.910 --> 01:00:14.860
So SQL is a fairly standard language
for interacting with databases.

01:00:14.860 --> 01:00:16.960
But different companies,
different communities

01:00:16.960 --> 01:00:20.200
have kind of added or subtracted
their own preferred features.

01:00:20.200 --> 01:00:24.340
And so the syntax you use is generally
constant across all platforms.

01:00:24.340 --> 01:00:27.458
But we will standardize
for our purposes on SQLite.

01:00:27.458 --> 01:00:29.500
And indeed, this is what
you would use these days

01:00:29.500 --> 01:00:31.820
in the world of mobile applications.

01:00:31.820 --> 01:00:33.710
So it's very much germane there.

01:00:33.710 --> 01:00:39.760
So with SQLite, we're going to have
ultimately the ability to query data

01:00:39.760 --> 01:00:41.690
and update data, delete
data, and the like.

01:00:41.690 --> 01:00:44.080
But to do so, we actually
need a program with which

01:00:44.080 --> 01:00:46.220
to interact with our database.

01:00:46.220 --> 01:00:50.320
So the way SQLite works is
that it stores all of your data

01:00:50.320 --> 01:00:51.920
still in a file.

01:00:51.920 --> 01:00:53.500
But it's a binary file now.

01:00:53.500 --> 01:00:55.690
That is, it's a file
containing 0's and 1's.

01:00:55.690 --> 01:00:57.550
And those 0's and 1's
might represent text.

01:00:57.550 --> 01:00:58.810
They might represent numbers.

01:00:58.810 --> 01:01:01.570
But it's a more compact,
efficient representation

01:01:01.570 --> 01:01:05.290
than a mere CSV file would
be using ASCII or Unicode.

01:01:05.290 --> 01:01:06.670
So that's the first difference.

01:01:06.670 --> 01:01:10.690
SQLite uses a single
file, a binary file,

01:01:10.690 --> 01:01:14.470
to store all of your data and represent
it inside of that file by way of all

01:01:14.470 --> 01:01:18.110
of those 0's and 1's or the
tables to which I alluded before,

01:01:18.110 --> 01:01:22.180
which are the analogue in the database
world of sheets or spreadsheets

01:01:22.180 --> 01:01:23.690
in the spreadsheet world.

01:01:23.690 --> 01:01:28.940
So to interact with that binary file
wherein all of your data is stored,

01:01:28.940 --> 01:01:31.283
we need some kind of
user-facing program.

01:01:31.283 --> 01:01:32.950
And there's many different tools to use.

01:01:32.950 --> 01:01:36.970
But the standard one
that comes with SQLite

01:01:36.970 --> 01:01:40.750
is called sqlite3, essentially
version 3 of the tool.

01:01:40.750 --> 01:01:44.050
This is a command line tool similar
in spirit to any of the commands

01:01:44.050 --> 01:01:46.000
you've run in a terminal
window thus far that

01:01:46.000 --> 01:01:50.395
allows you to open up that binary file
and interact with all of your tables.

01:01:50.395 --> 01:01:53.020
Now, here again, we kind of have
a chicken and the egg problem.

01:01:53.020 --> 01:01:56.290
If I want to use a database
but I don't yet have a database

01:01:56.290 --> 01:01:58.570
and yet I want to select
data from my database,

01:01:58.570 --> 01:02:00.040
how do I actually load things in?

01:02:00.040 --> 01:02:04.130
Well, you can load data into a
SQLite database in at least two ways.

01:02:04.130 --> 01:02:06.490
One, which I'll do in
a moment, you can just

01:02:06.490 --> 01:02:10.480
import an existing flat-file
database, like a CSV.

01:02:10.480 --> 01:02:15.640
And what you do is you save the CSV
on your Mac or PC on your CS50 IDE.

01:02:15.640 --> 01:02:18.100
You run a special command with sqlite3.

01:02:18.100 --> 01:02:21.430
And it will just load
the CSV into memory.

01:02:21.430 --> 01:02:23.620
It will figure out where
all of the commas are.

01:02:23.620 --> 01:02:28.510
And it will construct inside of that
binary file the corresponding rows

01:02:28.510 --> 01:02:31.360
and columns using the
appropriate 0's and 1's

01:02:31.360 --> 01:02:32.810
to store all of that information.

01:02:32.810 --> 01:02:35.410
So it just imports it
for you automatically.

01:02:35.410 --> 01:02:39.310
Approach 2 would be to actually
write code in a language like Python

01:02:39.310 --> 01:02:44.290
or any other that actually
manually inserts all of the data

01:02:44.290 --> 01:02:45.155
into your database.

01:02:45.155 --> 01:02:46.280
And we'll do that, as well.

01:02:46.280 --> 01:02:47.290
But let's start simple.

01:02:47.290 --> 01:02:51.070
Let me go ahead and run,
for instance, sqlite3.

01:02:51.070 --> 01:02:54.550
And this is preinstalled on CS50 IDE,
and it's not that hard to get it up

01:02:54.550 --> 01:02:56.570
and running on a Mac and PC, as well.

01:02:56.570 --> 01:02:59.860
I'm going to go ahead and run
sqlite3 in my terminal window here.

01:02:59.860 --> 01:03:00.610
And voila.

01:03:00.610 --> 01:03:03.430
You just see some very simple output.

01:03:03.430 --> 01:03:07.023
It's telling me to type .help if
I want to see some usage hints.

01:03:07.023 --> 01:03:09.190
But I know most of the
commands, and we'll generally

01:03:09.190 --> 01:03:11.232
give you all of the commands
that you might need.

01:03:11.232 --> 01:03:15.760
In fact, one of the commands that we can
use is .mode, and another is .import.

01:03:15.760 --> 01:03:18.100
So generally, you won't
use these that frequently.

01:03:18.100 --> 01:03:21.670
You'll only use them when creating
a database for the first time when

01:03:21.670 --> 01:03:25.002
you are creating that database
from an existing CSV file.

01:03:25.002 --> 01:03:26.710
And indeed, that's my
goal at the moment.

01:03:26.710 --> 01:03:30.610
Let me take our CSV file containing
all of your favorite TV shows

01:03:30.610 --> 01:03:35.650
and load it into SQLite in
a proper relational database

01:03:35.650 --> 01:03:39.460
so that we can do better
than, for instance, big O of n

01:03:39.460 --> 01:03:42.730
when it comes to searching that
data and doing anything else on it.

01:03:42.730 --> 01:03:44.960
So to do this, I have
to execute two commands.

01:03:44.960 --> 01:03:48.280
One, I need to put SQLite into CSV mode.

01:03:48.280 --> 01:03:51.010
And that's just to distinguish
it from other flat-file formats,

01:03:51.010 --> 01:03:53.890
like TSV for tabs or some other format.

01:03:53.890 --> 01:03:56.230
And now I'm going to go
ahead and run .import.

01:03:56.230 --> 01:03:59.920
Then I have to specify the name of
the file to import, which is the CSV.

01:03:59.920 --> 01:04:03.490
And I'm going to go ahead
and call my table shows.

01:04:03.490 --> 01:04:08.500
So .import takes two arguments, the
name of the file that you want to import

01:04:08.500 --> 01:04:12.430
and the name of the table that you
want to create out of that file.

01:04:12.430 --> 01:04:14.680
And again, tables have rows and columns.

01:04:14.680 --> 01:04:18.280
And the commas in the file
are going to delineate

01:04:18.280 --> 01:04:20.290
where those columns begin and end.

01:04:20.290 --> 01:04:21.790
I'm going to go ahead and hit Enter.

01:04:21.790 --> 01:04:24.670
It looks like it flew by pretty fast.

01:04:24.670 --> 01:04:26.560
Nothing seems to have happened.

01:04:26.560 --> 01:04:30.940
But I think that's OK, because now we're
going to go ahead and have the ability

01:04:30.940 --> 01:04:32.710
to actually manipulate that data.

01:04:32.710 --> 01:04:34.630
But how do we manipulate the data?

01:04:34.630 --> 01:04:36.070
We need a new language.

01:04:36.070 --> 01:04:42.280
SQL, Structured Query Language, is the
language used by SQLites and Oracle

01:04:42.280 --> 01:04:45.220
and MySQL and PostgreSQL and
bunches of other products

01:04:45.220 --> 01:04:48.040
whose names you don't need to
know or remember any time soon.

01:04:48.040 --> 01:04:53.260
But SQL is the language we'll use to
query the database for information

01:04:53.260 --> 01:04:54.620
and do something with it.

01:04:54.620 --> 01:04:57.920
Generally speaking, a relational
database and, in turn,

01:04:57.920 --> 01:05:02.480
SQL, which is a language via which you
can interact with relational databases,

01:05:02.480 --> 01:05:04.910
support four fundamental operations.

01:05:04.910 --> 01:05:08.090
And they're sort of a crude
acronym, pun intended,

01:05:08.090 --> 01:05:11.960
that is just helpful for remembering
what those fundamental operations are

01:05:11.960 --> 01:05:13.190
with relational databases.

01:05:13.190 --> 01:05:19.220
CRUD stands for Create,
Read, Update, and Delete.

01:05:19.220 --> 01:05:21.800
And indeed, the acronym
is CRUD, C-R-U-D.

01:05:21.800 --> 01:05:25.040
So it helps you remember that the
four basic operations supported by any

01:05:25.040 --> 01:05:28.590
relational database are
create, read, update, delete.

01:05:28.590 --> 01:05:30.710
"Create" means to
create or add new data.

01:05:30.710 --> 01:05:34.550
"Read" means to access and
load into memory new data.

01:05:34.550 --> 01:05:36.710
We've seen read before
with opening files.

01:05:36.710 --> 01:05:39.140
"Update" and "delete" mean
exactly that, as well,

01:05:39.140 --> 01:05:41.450
if you want to manipulate
the data in your data set.

01:05:41.450 --> 01:05:44.530
Now, those are generic terms
for any relational database.

01:05:44.530 --> 01:05:48.200
Those are the four properties typically
supported by any relational database.

01:05:48.200 --> 01:05:53.490
In the world of SQL, there are some
very specific commands or functions,

01:05:53.490 --> 01:05:58.550
if you will, that implement
those four functionalities.

01:05:58.550 --> 01:06:00.980
They are create and insert--

01:06:00.980 --> 01:06:03.620
achieve the same thing
as create more generally.

01:06:03.620 --> 01:06:07.850
The keyword "select" is what's
used to read data from a database.

01:06:07.850 --> 01:06:09.460
Update and delete are the same.

01:06:09.460 --> 01:06:11.210
So it's kind of an
annoying inconsistency.

01:06:11.210 --> 01:06:14.803
The acronym or the term of art is
CRUD, Create, Read, Update, Delete.

01:06:14.803 --> 01:06:16.970
But in the world of SQL,
the authors of the language

01:06:16.970 --> 01:06:20.030
decided to implement
those four ideas by way

01:06:20.030 --> 01:06:24.760
of these five keywords or functions or
commands, if you will, in the language

01:06:24.760 --> 01:06:25.260
SQL.

01:06:25.260 --> 01:06:28.800
So what you are looking at
are five of the keywords

01:06:28.800 --> 01:06:32.990
that you can use in this new language
called SQL to actually do something

01:06:32.990 --> 01:06:33.980
with your database.

01:06:33.980 --> 01:06:35.070
Now, what does that mean?

01:06:35.070 --> 01:06:37.190
Well, suppose that you
wanted to manually create

01:06:37.190 --> 01:06:38.960
a database for the very first time.

01:06:38.960 --> 01:06:39.585
What do you do?

01:06:39.585 --> 01:06:42.752
Well, back in the world of spreadsheets,
it's pretty straightforward, right?

01:06:42.752 --> 01:06:44.300
You'd open up Google Spreadsheets.

01:06:44.300 --> 01:06:46.370
You go to File, New or whatever.

01:06:46.370 --> 01:06:48.350
And then you just, voila,
get a new spreadsheet

01:06:48.350 --> 01:06:51.170
into which you can start creating
rows and columns and the like.

01:06:51.170 --> 01:06:53.840
In Microsoft Excel, Apple
Numbers, same thing--

01:06:53.840 --> 01:06:57.840
File menu, New Spreadsheet or whatever,
and boom, you have a new spreadsheet.

01:06:57.840 --> 01:07:00.860
Now, in the world of SQL,
SQL databases are generally

01:07:00.860 --> 01:07:02.840
meant to be interacted with code.

01:07:02.840 --> 01:07:05.930
However, there are Graphical
User Interfaces, GUIs, by which

01:07:05.930 --> 01:07:07.430
you can interact with them, as well.

01:07:07.430 --> 01:07:11.600
But we're going to use code today to
do so and programs at a command line.

01:07:11.600 --> 01:07:17.120
It turns out that you can
create tables programmatically

01:07:17.120 --> 01:07:19.530
by running a command like this.

01:07:19.530 --> 01:07:24.320
So if you literally type out syntax
along the lines of CREATE TABLE, then

01:07:24.320 --> 01:07:27.230
the name of your table,
indicated here in lowercase,

01:07:27.230 --> 01:07:31.490
then a parenthesis, then the name of
your column that you want to create

01:07:31.490 --> 01:07:36.190
and the type of that column,
a la C, and then comma, dot,

01:07:36.190 --> 01:07:39.050
dot, dot, some more
columns, this is generally

01:07:39.050 --> 01:07:43.350
speaking the syntax you'll use
to create in this language called

01:07:43.350 --> 01:07:44.802
SQL a new table.

01:07:44.802 --> 01:07:46.010
Now, this is in the abstract.

01:07:46.010 --> 01:07:49.077
Again, table in lowercase is
meant to represent the name

01:07:49.077 --> 01:07:50.660
you want to give to your actual table.

01:07:50.660 --> 01:07:52.580
column in lowercase is
meant to be the name

01:07:52.580 --> 01:07:54.080
you want to give to your own column.

01:07:54.080 --> 01:07:54.788
Maybe it's Title.

01:07:54.788 --> 01:07:55.567
Maybe it's Genres.

01:07:55.567 --> 01:07:57.400
And dot, dot, dot just
means, of course, you

01:07:57.400 --> 01:07:59.100
can have even more columns than that.

01:07:59.100 --> 01:08:02.990
But literally in a moment, if I
were to type in this kind of command

01:08:02.990 --> 01:08:06.500
into the terminal window after
running the sqlite3 program,

01:08:06.500 --> 01:08:09.980
I could start creating one
or more tables for myself.

01:08:09.980 --> 01:08:12.920
And in fact, that's what
already happened for me.

01:08:12.920 --> 01:08:15.560
This .import command,
which is not part of SQL--

01:08:15.560 --> 01:08:19.579
this is the equivalent of a Menu
option in Excel or Google Spreadsheets.

01:08:19.579 --> 01:08:22.729
.import just automates a
certain process for me.

01:08:22.729 --> 01:08:24.319
And what it did for me is this.

01:08:24.319 --> 01:08:28.609
If I type now .schema, which is
another SQLite-specific command--

01:08:28.609 --> 01:08:32.479
anything that starts with a .
is specific only to sqlite3,

01:08:32.479 --> 01:08:34.250
this terminal window program.

01:08:34.250 --> 01:08:36.830
Notice what's outputted is this.

01:08:36.830 --> 01:08:44.420
By running .import that automatically
for me created a table in my database

01:08:44.420 --> 01:08:46.010
called shows.

01:08:46.010 --> 01:08:47.540
And it gave it three columns--

01:08:47.540 --> 01:08:50.060
Timestamp, title, and genres.

01:08:50.060 --> 01:08:52.529
Where did those column names come from?

01:08:52.529 --> 01:08:55.220
Well, they came from the
very first line in the CSV.

01:08:55.220 --> 01:08:58.850
And they all looked like text,
so the type of those values

01:08:58.850 --> 01:09:02.490
was just assumed to be text, text, text.

01:09:02.490 --> 01:09:05.090
Now, to be clear, I could
have manually type this out,

01:09:05.090 --> 01:09:08.359
created these three columns in
a new table called shows for me.

01:09:08.359 --> 01:09:11.870
But again, the .import command
just automated that from a CSV.

01:09:11.870 --> 01:09:17.370
But the SQL is what we see here,
CREATE TABLE shows and so forth.

01:09:17.370 --> 01:09:22.609
So that is to say now, in this
database, there is a file--

01:09:22.609 --> 01:09:27.500
or rather, there is a
table called shows inside

01:09:27.500 --> 01:09:29.630
of which is all of the
data from that CSV.

01:09:29.630 --> 01:09:31.580
How do I actually get at that data?

01:09:31.580 --> 01:09:33.830
Well, it turns out there's
other commands were called.

01:09:33.830 --> 01:09:37.430
Not just CREATE, but also
SELECT, it turns out.

01:09:37.430 --> 01:09:40.850
SELECT is the equivalent of read,
getting data from the database.

01:09:40.850 --> 01:09:42.590
And this one is pretty powerful.

01:09:42.590 --> 01:09:45.950
And the reason that so many data
scientists and statisticians

01:09:45.950 --> 01:09:48.290
use and like using languages like SQL--

01:09:48.290 --> 01:09:51.620
they make it relatively easy to
just get data and filter that data

01:09:51.620 --> 01:09:55.380
and analyze that data using
new syntax for us today,

01:09:55.380 --> 01:09:58.940
but relatively simple syntax
relative to other things we've seen.

01:09:58.940 --> 01:10:03.590
The SELECT command in SQL lets
you select one or more columns

01:10:03.590 --> 01:10:06.710
from your table by the given name.

01:10:06.710 --> 01:10:10.040
So we'll see this now
in just a moment here.

01:10:10.040 --> 01:10:11.460
How might I go about doing this?

01:10:11.460 --> 01:10:15.170
Well, let me go ahead and now, at my
prompt after just clearing the window

01:10:15.170 --> 01:10:17.340
to keep things neat,
let me try this out.

01:10:17.340 --> 01:10:26.090
Let me go ahead and SELECT,
let's say, title FROM shows;.

01:10:26.090 --> 01:10:27.290
So why am I doing this?

01:10:27.290 --> 01:10:29.800
Well, again, the conventional
format for the SELECT command

01:10:29.800 --> 01:10:33.400
is to say SELECT, then the name
of one or more columns, then

01:10:33.400 --> 01:10:37.240
literally the preposition FROM, and then
the name of the table from which you

01:10:37.240 --> 01:10:38.840
want to select that data.

01:10:38.840 --> 01:10:43.390
So if my table is called shows
and the column is called title,

01:10:43.390 --> 01:10:46.930
it stands to reason that SELECT
title FROM shows should give me

01:10:46.930 --> 01:10:48.100
back the data I want.

01:10:48.100 --> 01:10:50.080
Now, notice a couple
of stylistic choices

01:10:50.080 --> 01:10:52.630
that aren't strictly
required but are good style.

01:10:52.630 --> 01:10:56.470
Conventionally, I would
capitalize any SQL keywords,

01:10:56.470 --> 01:10:59.470
including SELECT and FROM
in this case, and then

01:10:59.470 --> 01:11:03.610
lowercase anything that's a
column name or a table name,

01:11:03.610 --> 01:11:07.023
assuming you created those columns
and tables in, in fact, lowercase.

01:11:07.023 --> 01:11:08.690
There's different conventions out there.

01:11:08.690 --> 01:11:09.815
Some people will uppercase.

01:11:09.815 --> 01:11:12.950
Some people will use something called
camel case or snake case or the like.

01:11:12.950 --> 01:11:15.220
But generally speaking, I
would encourage all caps

01:11:15.220 --> 01:11:19.180
for SQL syntax and lowercase
for the column and table names.

01:11:19.180 --> 01:11:21.190
I'm going to go ahead now and hit Enter.

01:11:21.190 --> 01:11:22.060
And voila.

01:11:22.060 --> 01:11:26.950
We see rapidly a whole list of
values outputted from the database.

01:11:26.950 --> 01:11:30.790
And if you think way back, you
might recognize that this actually

01:11:30.790 --> 01:11:35.200
happens to be the same order
as before, because the CSV

01:11:35.200 --> 01:11:39.010
file was loaded top to bottom
into this same database table.

01:11:39.010 --> 01:11:42.370
And so what we're seeing, in fact,
is all of that same data, duplicates

01:11:42.370 --> 01:11:46.030
and miscapitalizations
and weird spacing and all.

01:11:46.030 --> 01:11:48.790
But suppose I want to see
all of the data from the CSV.

01:11:48.790 --> 01:11:51.160
Well, it turns out you can
select multiple columns.

01:11:51.160 --> 01:11:54.478
You can select not only title, but
maybe timestamp was of interest.

01:11:54.478 --> 01:11:56.770
And this one admittedly was
capitalized, because that's

01:11:56.770 --> 01:11:58.360
what it was in the spreadsheet.

01:11:58.360 --> 01:12:00.290
That was not something I chose manually.

01:12:00.290 --> 01:12:02.800
So if I just use a comma-separated
list of column names,

01:12:02.800 --> 01:12:04.090
notice what I can do now.

01:12:04.090 --> 01:12:07.790
It's a little hard to see for us humans,
because there's a lot going on now.

01:12:07.790 --> 01:12:10.120
But notice that in double
quotes on the left,

01:12:10.120 --> 01:12:14.170
there are all of the timestamps, which
represent the time at which you all

01:12:14.170 --> 01:12:15.490
submitted your favorite shows.

01:12:15.490 --> 01:12:19.390
And on the right of the comma,
there's another quoted string

01:12:19.390 --> 01:12:22.210
that is the title of the show
that you liked, although SQLite

01:12:22.210 --> 01:12:27.070
omits the quotes if it's just a single
word, like Friends, just by convention.

01:12:27.070 --> 01:12:29.290
In fact, if I want to
get all of the columns,

01:12:29.290 --> 01:12:31.510
turns out there's some
shorthand syntax for that.

01:12:31.510 --> 01:12:34.270
* is the so-called wild card operator.

01:12:34.270 --> 01:12:37.780
And it will get me all of the columns
from left to right in my table.

01:12:37.780 --> 01:12:38.500
And voila.

01:12:38.500 --> 01:12:44.180
Now I see all of the data, including
all of the genres, as well.

01:12:44.180 --> 01:12:49.090
So now I effectively have three columns
being outputted all at once here.

01:12:49.090 --> 01:12:51.520
Well, this is not that useful thus far.

01:12:51.520 --> 01:12:53.770
In fact, all I've been doing
is really just outputting

01:12:53.770 --> 01:12:55.060
the contents of the CSV.

01:12:55.060 --> 01:12:58.990
But SQL's powerful because it comes with
other features right out of the box,

01:12:58.990 --> 01:13:02.830
somewhat similar in spirit to functions
that are built into Google Spreadsheets

01:13:02.830 --> 01:13:03.550
and Excel.

01:13:03.550 --> 01:13:06.110
But now we can use them
ultimately in our own code.

01:13:06.110 --> 01:13:09.460
So functions like AVG, COUNT,
DISTINCT, LOWER, MAX, MIN,

01:13:09.460 --> 01:13:13.540
and UPPER and bunches more, these
are all functions built into SQL

01:13:13.540 --> 01:13:19.370
that you can use as part of your query
to alter the data as it's coming back

01:13:19.370 --> 01:13:21.370
from the database-- not
permanently, but as it's

01:13:21.370 --> 01:13:25.040
coming back to you-- so that it's
in a format you actually care about.

01:13:25.040 --> 01:13:26.870
So for instance, one
of my goals earlier,

01:13:26.870 --> 01:13:29.680
was to get back just the
distinct, the unique titles.

01:13:29.680 --> 01:13:32.620
And we had to write all that
annoying code using a set

01:13:32.620 --> 01:13:35.560
and then add things to the set and
then loop over it again, right?

01:13:35.560 --> 01:13:37.180
That was not a huge amount of code.

01:13:37.180 --> 01:13:40.840
But it definitely took us, what, 5, 10
minutes to get the job done at least.

01:13:40.840 --> 01:13:43.780
In SQL, you can do all
of that in one breath.

01:13:43.780 --> 01:13:45.650
I'm going to go ahead now and do this.

01:13:45.650 --> 01:13:49.690
SELECT not just title FROM shows.

01:13:49.690 --> 01:13:54.370
Let me go ahead and SELECT
DISTINCT title FROM shows.

01:13:54.370 --> 01:13:57.640
So DISTINCT, again, is an
available function in SQL

01:13:57.640 --> 01:13:58.900
that does what the name says.

01:13:58.900 --> 01:14:00.650
It's going to filter
out all of the titles

01:14:00.650 --> 01:14:02.450
to just give me the distinct ones back.

01:14:02.450 --> 01:14:08.740
So if I hit Enter now, you'll see a
similarly messy list but including--

01:14:08.740 --> 01:14:10.810
"no idea," someone
that doesn't watch TV--

01:14:10.810 --> 01:14:14.230
including an unsorted
list of those titles.

01:14:14.230 --> 01:14:18.130
So I think we can probably start to
clean this thing up as we did before.

01:14:18.130 --> 01:14:20.950
Let me go ahead and now
SELECT not just DISTINCT,

01:14:20.950 --> 01:14:23.660
but let me go ahead and
uppercase everything as well.

01:14:23.660 --> 01:14:25.970
And I can use UPPER as another function.

01:14:25.970 --> 01:14:27.580
And notice I'm just nesting things.

01:14:27.580 --> 01:14:30.247
The output of one function, as
we've seen in many languages now,

01:14:30.247 --> 01:14:31.450
can be the input to another.

01:14:31.450 --> 01:14:32.830
Let me hit Enter now.

01:14:32.830 --> 01:14:36.610
And now it's getting a little
more canonicalized, so to speak,

01:14:36.610 --> 01:14:39.190
because I'm using
capitalization for everything.

01:14:39.190 --> 01:14:43.690
But it would seem that things
still aren't really sorted.

01:14:43.690 --> 01:14:46.070
It's just the same order
in which you inputted them

01:14:46.070 --> 01:14:48.370
but without duplicates this time.

01:14:48.370 --> 01:14:51.700
So it turns out that
SQL has other syntax

01:14:51.700 --> 01:14:55.580
that we can use to make our queries
more precise and more powerful.

01:14:55.580 --> 01:14:57.640
So in addition to these
kinds of functions

01:14:57.640 --> 01:15:00.340
that you can use to alter the
data that's being shown to you

01:15:00.340 --> 01:15:04.570
and coming back, you can also use
these kinds of clauses or syntax

01:15:04.570 --> 01:15:05.800
in SQL queries.

01:15:05.800 --> 01:15:09.130
You can say WHERE, which is
the equivalent of a condition.

01:15:09.130 --> 01:15:13.600
You can say select all of this data
where something is true or false.

01:15:13.600 --> 01:15:17.440
You can say LIKE, where you can say
give me data that isn't exactly this

01:15:17.440 --> 01:15:18.520
but is like this.

01:15:18.520 --> 01:15:20.660
You can order the data by some column.

01:15:20.660 --> 01:15:23.210
You can limit the number
of rows that come back.

01:15:23.210 --> 01:15:26.850
And you can group identical
values together in some way.

01:15:26.850 --> 01:15:28.640
So let's see a few examples of this.

01:15:28.640 --> 01:15:32.055
Let me go back here and
play around now with--

01:15:32.055 --> 01:15:32.930
how about The Office?

01:15:32.930 --> 01:15:34.513
That was the one we looked at earlier.

01:15:34.513 --> 01:15:42.260
So let me go ahead and SELECT title
FROM shows WHERE title = "The Office";.

01:15:42.260 --> 01:15:48.200
So I've added this WHERE predicate, so
to speak, WHERE title = "The Office."

01:15:48.200 --> 01:15:49.190
So SQL's nice.

01:15:49.190 --> 01:15:52.670
Similar in spirit to Python,
it's more user friendly, perhaps,

01:15:52.670 --> 01:15:55.910
than C where everything kind of sort
of reads like an English sentence,

01:15:55.910 --> 01:15:58.230
even though it's a little more precise.

01:15:58.230 --> 01:15:59.880
And it's a little more succinct.

01:15:59.880 --> 01:16:01.130
Let me go ahead and hit Enter.

01:16:01.130 --> 01:16:02.120
And voila.

01:16:02.120 --> 01:16:05.850
That's how many of you
inputted The Office.

01:16:05.850 --> 01:16:08.520
But notice it's not everyone, is it?

01:16:08.520 --> 01:16:10.050
We're missing some still.

01:16:10.050 --> 01:16:14.070
It seems that I got back only
those of you who typed in literally

01:16:14.070 --> 01:16:16.710
"The Office," capital T, capital O.

01:16:16.710 --> 01:16:19.200
So what if I want to be a
little more resilient than that?

01:16:19.200 --> 01:16:23.280
Well, let me get back any rows
where you all typed in "office."

01:16:23.280 --> 01:16:26.820
Maybe you omitted the article "the."

01:16:26.820 --> 01:16:30.390
So let me go ahead and
say not title = "Office."

01:16:30.390 --> 01:16:33.780
but let me go ahead and say
where the title is like "Office."

01:16:33.780 --> 01:16:35.490
But I don't want it to just be "office."

01:16:35.490 --> 01:16:39.120
I want to allow for maybe some stuff
at the beginning, maybe some stuff

01:16:39.120 --> 01:16:39.673
at the end.

01:16:39.673 --> 01:16:42.090
And even though that seems
like a bit of an inconsistency,

01:16:42.090 --> 01:16:46.950
in the context of using LIKE,
there's another wild card character.

01:16:46.950 --> 01:16:51.390
The percent sign represents zero
or more characters to the left.

01:16:51.390 --> 01:16:55.410
And this percent sign represents
zero or more characters to the right.

01:16:55.410 --> 01:16:59.940
So it's kind of this catchall that will
now find me all titles that somewhere

01:16:59.940 --> 01:17:02.980
have O-F-F-I-C-E inside of them.

01:17:02.980 --> 01:17:04.778
And it turns out LIKE
is case insensitive,

01:17:04.778 --> 01:17:07.320
so I don't even need to worry
about capitalization with LIKE.

01:17:07.320 --> 01:17:08.610
Now let me hit Enter.

01:17:08.610 --> 01:17:09.450
And voila.

01:17:09.450 --> 01:17:10.890
Now I get back more answers.

01:17:10.890 --> 01:17:12.780
And you can really
see the messiness now.

01:17:12.780 --> 01:17:15.900
Notice up here one of
you used lowercase.

01:17:15.900 --> 01:17:18.450
That tends to be common when
typing things in quickly.

01:17:18.450 --> 01:17:21.270
One of you did it lowercase
here and then also gave

01:17:21.270 --> 01:17:23.160
us an extra white space at the end.

01:17:23.160 --> 01:17:24.900
One of you just typed in "office."

01:17:24.900 --> 01:17:27.540
One of you typed in "the office"
again with a space at the end.

01:17:27.540 --> 01:17:29.200
And so there's a lot of variation here.

01:17:29.200 --> 01:17:31.560
And that's why, when we
forced everything to uppercase

01:17:31.560 --> 01:17:34.650
and we started trimming
things, we were able to get rid

01:17:34.650 --> 01:17:37.440
of a lot of those redundancies.

01:17:37.440 --> 01:17:40.290
Well, in fact, let's go
ahead and order this now.

01:17:40.290 --> 01:17:44.040
So let me go back to selecting
the distinct uppercase title,

01:17:44.040 --> 01:17:51.060
so SELECT DISTINCT UPPER
of title FROM shows.

01:17:51.060 --> 01:17:56.220
And let me now ORDER BY, which is a
new clause, the uppercased version

01:17:56.220 --> 01:17:57.868
of title.

01:17:57.868 --> 01:17:59.910
So now notice there's a
few things going on here.

01:17:59.910 --> 01:18:01.710
But I'm just building up
more complicated queries

01:18:01.710 --> 01:18:04.260
similar to scratch, where we just
started throwing more and more puzzle

01:18:04.260 --> 01:18:05.340
pieces at a problem.

01:18:05.340 --> 01:18:10.530
I'm selecting all of the distinct
uppercase titles from the shows table.

01:18:10.530 --> 01:18:13.050
But I'm going to order
the results this time

01:18:13.050 --> 01:18:15.780
by the uppercased version of title.

01:18:15.780 --> 01:18:17.550
So everything is going to be uppercased.

01:18:17.550 --> 01:18:20.460
And then it's going to be sorted
A through Z. Hit Enter now,

01:18:20.460 --> 01:18:23.160
and now things are a little
easier to make sense of.

01:18:23.160 --> 01:18:26.970
Notice the quotes are there only when
there are multiple words in a title.

01:18:26.970 --> 01:18:29.400
Otherwise, sqlite3
doesn't bother showing us.

01:18:29.400 --> 01:18:32.190
But notice here's all the "the" shows.

01:18:32.190 --> 01:18:36.270
And if we keep scrolling up, the P's,
the N's, the M's, the L's, and so

01:18:36.270 --> 01:18:41.190
forth-- it's indeed alphabetized
thanks to using ORDER BY.

01:18:41.190 --> 01:18:41.950
All right.

01:18:41.950 --> 01:18:45.540
Well, let's start to solve more
similar problems now in SQL

01:18:45.540 --> 01:18:49.830
by writing way less code than
we did a bit ago in Python.

01:18:49.830 --> 01:18:54.780
Suppose I want to actually figure out
the counts of these most popular shows.

01:18:54.780 --> 01:18:58.050
So I want to combine all
of the identical shows

01:18:58.050 --> 01:19:00.510
and figure out all of
the corresponding counts.

01:19:00.510 --> 01:19:02.330
Well, let me go ahead and try this.

01:19:02.330 --> 01:19:07.932
Let me go ahead and SELECT again
the uppercased version of title.

01:19:07.932 --> 01:19:10.140
But I'm not going to do
DISTINCT this time, because I

01:19:10.140 --> 01:19:11.830
want to do that a little differently.

01:19:11.830 --> 01:19:13.650
I'm going to SELECT
the uppercased version

01:19:13.650 --> 01:19:16.510
of title, the COUNT of those titles--

01:19:16.510 --> 01:19:19.320
so the number of times a
given title appears, so COUNT

01:19:19.320 --> 01:19:20.610
is a new keyword now--

01:19:20.610 --> 01:19:22.080
FROM shows.

01:19:22.080 --> 01:19:25.860
But now how do I figure
out what the count is?

01:19:25.860 --> 01:19:29.700
Well, if you think about this
table as having a lot of titles--

01:19:29.700 --> 01:19:31.930
title, title, title, title, title--

01:19:31.930 --> 01:19:35.970
it would be nice to kind of group
the identical titles together

01:19:35.970 --> 01:19:42.460
and then actually count how many
such titles we grouped together.

01:19:42.460 --> 01:19:47.710
And the syntax for that is literally
to say GROUP BY UPPER(title);.

01:19:47.710 --> 01:19:51.130
This tells SQL to group all
of the uppercased titles

01:19:51.130 --> 01:19:53.860
together, kind of collapse
multiple rows into one,

01:19:53.860 --> 01:19:58.990
but keep track of the count
of titles after that collapse.

01:19:58.990 --> 01:20:01.810
Let me go ahead now and hit Enter.

01:20:01.810 --> 01:20:05.980
And you'll see, very similar to one of
the earlier Python programs we wrote,

01:20:05.980 --> 01:20:10.040
all of the titles on the left followed
by a comma, followed by the count.

01:20:10.040 --> 01:20:11.920
So one of you really
likes Tom and Jerry.

01:20:11.920 --> 01:20:14.470
One of you really likes Top Gear.

01:20:14.470 --> 01:20:17.140
If I scroll up, though, two
of you really liked The Wire.

01:20:17.140 --> 01:20:19.930
23 of you here like The
Office, although we still

01:20:19.930 --> 01:20:22.010
haven't trimmed the issue here.

01:20:22.010 --> 01:20:25.180
So we could still combine that further
by trimming whitespace if we want.

01:20:25.180 --> 01:20:27.040
But now we're getting
these kinds of counts.

01:20:27.040 --> 01:20:32.510
Well, how can I go ahead and
order this, as we did before?

01:20:32.510 --> 01:20:39.820
Let me go ahead here and
add ORDER BY COUNT of title

01:20:39.820 --> 01:20:42.010
and then hit semicolon now.

01:20:42.010 --> 01:20:45.310
And now notice, just as
in Python, everything

01:20:45.310 --> 01:20:47.800
is from smallest to largest
initially, with Game of Thrones

01:20:47.800 --> 01:20:49.180
here down on the bottom.

01:20:49.180 --> 01:20:50.360
How can I fix this?

01:20:50.360 --> 01:20:53.890
Well, it turns out if you can
order things in descending order,

01:20:53.890 --> 01:20:58.510
D-E-S-C for short instead of A-S-C,
which is the default for ascending--

01:20:58.510 --> 01:21:02.110
so if I do it in descending order, now
I'd have to scroll all the way back up

01:21:02.110 --> 01:21:07.480
to the A's, the very top, to
see where the lines begin.

01:21:07.480 --> 01:21:09.420
Whoops.

01:21:09.420 --> 01:21:13.020
If I scroll all the way back up to the
top, we'll see where all of the A words

01:21:13.020 --> 01:21:14.610
begin up here.

01:21:14.610 --> 01:21:17.252
And now if I want to--

01:21:17.252 --> 01:21:18.210
whoops, whoops, whoops.

01:21:18.210 --> 01:21:20.190
Did I do that right?

01:21:20.190 --> 01:21:20.690
Sorry.

01:21:20.690 --> 01:21:21.950
I don't want to--

01:21:21.950 --> 01:21:23.827
there we go, ORDER BY COUNT descending.

01:21:23.827 --> 01:21:26.660
Now let me go ahead and-- this is
just a little too unwieldy to see.

01:21:26.660 --> 01:21:29.035
Let me just limit myself to
the top 10 and keep it simple

01:21:29.035 --> 01:21:30.920
and only look at the top 10 values here.

01:21:30.920 --> 01:21:31.730
Voila.

01:21:31.730 --> 01:21:36.585
Now I have Game of Thrones at 33,
Friends at 26, The Office at 23--

01:21:36.585 --> 01:21:38.210
though I think I'm still missing a few.

01:21:38.210 --> 01:21:41.660
Brian, do you recall the SQL function
for trimming leading and trailing

01:21:41.660 --> 01:21:43.410
white space?

01:21:43.410 --> 01:21:44.785
BRIAN YU: I think it's just TRIM.

01:21:44.785 --> 01:21:45.660
DAVID J. MALAN: TRIM?

01:21:45.660 --> 01:21:46.340
OK.

01:21:46.340 --> 01:21:47.577
I myself did not remember.

01:21:47.577 --> 01:21:49.160
So when in doubt, google or ask Brian.

01:21:49.160 --> 01:21:51.000
So let me go ahead and fix this.

01:21:51.000 --> 01:21:55.670
Let me go ahead and SELECT uppercase
of trimming the title first.

01:21:55.670 --> 01:22:00.840
And then I'm going to GROUP BY
trimming and then uppercasing it there.

01:22:00.840 --> 01:22:02.372
And now Enter, and voila.

01:22:02.372 --> 01:22:03.080
Thank you, Brian.

01:22:03.080 --> 01:22:07.020
So now we're up to our 26 Offices here.

01:22:07.020 --> 01:22:09.110
So in short, it took us
a little while to get

01:22:09.110 --> 01:22:10.880
to this point in the story in SQL.

01:22:10.880 --> 01:22:12.020
But notice what we've done.

01:22:12.020 --> 01:22:14.210
We've taken a program
that took us a few minutes

01:22:14.210 --> 01:22:16.790
and certainly a dozen
or more lines of code.

01:22:16.790 --> 01:22:20.300
And we've distilled it into something
that, yes, is a new language

01:22:20.300 --> 01:22:22.310
but is just kind of a one liner.

01:22:22.310 --> 01:22:24.888
And once you get comfortable
with a language like SQL,

01:22:24.888 --> 01:22:27.680
especially if you're not even a
computer scientist but maybe a data

01:22:27.680 --> 01:22:31.130
scientist or an analyst of some sort
who spends a lot of their day looking

01:22:31.130 --> 01:22:33.470
at financial information
or medical information

01:22:33.470 --> 01:22:37.070
or really any data set that can
be loaded into rows and columns,

01:22:37.070 --> 01:22:41.150
once you start to speak
and read SQL as a human can

01:22:41.150 --> 01:22:44.990
you start to express some pretty
powerful queries relatively succinctly

01:22:44.990 --> 01:22:47.390
and, boom, get back your answer.

01:22:47.390 --> 01:22:50.000
And by using a command
line program, like sqlite3,

01:22:50.000 --> 01:22:53.540
you can immediately see the results
there, albeit as very simplistic text.

01:22:53.540 --> 01:22:56.690
But as mentioned, too, there's
also some graphical programs

01:22:56.690 --> 01:23:00.117
out there, free and commercial, that
also support SQL, where you can still

01:23:00.117 --> 01:23:00.950
type these commands.

01:23:00.950 --> 01:23:03.770
And then it will show it to you
in a more user friendly way, much

01:23:03.770 --> 01:23:07.790
like in Windows or
macOS would by default.

01:23:07.790 --> 01:23:16.058
So any questions now on the syntax
or capabilities of SELECT statements?

01:23:16.058 --> 01:23:17.350
BRIAN YU: One question came in.

01:23:17.350 --> 01:23:20.450
Where is the file with this
data actually being stored?

01:23:20.450 --> 01:23:21.700
DAVID J. MALAN: Good question.

01:23:21.700 --> 01:23:24.030
Where is the file actually being stored?

01:23:24.030 --> 01:23:27.460
So before quitting, I can actually
save this file as anything.

01:23:27.460 --> 01:23:30.043
I want the file extension
would typically be .db.

01:23:30.043 --> 01:23:31.960
And in fact, Brian, do
you mind just checking?

01:23:31.960 --> 01:23:34.930
What's the syntax for writing the
file manually with dot something?

01:23:34.930 --> 01:23:36.910
It would be under .help, I think.

01:23:36.910 --> 01:23:39.550
BRIAN YU: I think it's .save
followed by the name of the file.

01:23:39.550 --> 01:23:43.240
DAVID J. MALAN: .save, so I'll
call this shows.db, Enter.

01:23:43.240 --> 01:23:46.600
If I now go ahead and open up
another terminal window and type

01:23:46.600 --> 01:23:49.990
our old friend ls, you'll see
that now I have a CSV file.

01:23:49.990 --> 01:23:51.760
I have my Python file from before.

01:23:51.760 --> 01:23:54.790
And I have a new file called
shows.db, which I've created.

01:23:54.790 --> 01:24:00.910
That is the binary file that contains
the table that I've loaded dynamically

01:24:00.910 --> 01:24:04.700
in from that CSV file.

01:24:04.700 --> 01:24:08.810
Any other questions on SELECT
queries or what we can do with them?

01:24:08.810 --> 01:24:12.620
BRIAN YU: Yeah, a few people are asking
about what the runtime of this is.

01:24:12.620 --> 01:24:14.430
DAVID J. MALAN: Yeah,
really good question.

01:24:14.430 --> 01:24:15.170
What is the runtime?

01:24:15.170 --> 01:24:18.253
I'm going to come back to that question
in just a little bit if that's OK.

01:24:18.253 --> 01:24:20.960
Right now, it's admittedly big O of n.

01:24:20.960 --> 01:24:23.390
I've not actually done
anything better than we did

01:24:23.390 --> 01:24:26.090
with our CSV file or our Python code.

01:24:26.090 --> 01:24:28.040
Right now, it's still
big O of n by default.

01:24:28.040 --> 01:24:30.230
But there's going to be
a better answer to that

01:24:30.230 --> 01:24:33.030
that's going to make it
something much more logarithmic.

01:24:33.030 --> 01:24:36.687
So let me come back to that feature
when it's time to enable it.

01:24:36.687 --> 01:24:39.020
But in fact, let's start to
take some steps toward that.

01:24:39.020 --> 01:24:40.812
Because it turns out,
when loading in data,

01:24:40.812 --> 01:24:42.853
we're not always going to
have the luxury of just

01:24:42.853 --> 01:24:44.900
having one big file in
CSV format that we import,

01:24:44.900 --> 01:24:46.070
and we go about our business.

01:24:46.070 --> 01:24:47.780
We're going to have to
decide in advance how

01:24:47.780 --> 01:24:50.210
we want to store the data and
what data we want to store

01:24:50.210 --> 01:24:53.120
and what the relationships
might be across not one

01:24:53.120 --> 01:24:55.278
single table, but multiple tables.

01:24:55.278 --> 01:24:57.320
So let me go ahead and
run one other command here

01:24:57.320 --> 01:25:00.170
that actually introduces
the first of a problem.

01:25:00.170 --> 01:25:03.830
Let me go ahead and
SELECT title FROM shows

01:25:03.830 --> 01:25:07.160
WHERE genres equals,
for instance, "Comedy."

01:25:07.160 --> 01:25:08.570
That was one of the genres.

01:25:08.570 --> 01:25:11.690
And notice that we get back
a whole bunch of results.

01:25:11.690 --> 01:25:14.300
But I bet I'm missing some.

01:25:14.300 --> 01:25:16.470
I'm skimming through
this pretty quickly.

01:25:16.470 --> 01:25:19.880
But I bet I'm missing some,
because if I check if genres

01:25:19.880 --> 01:25:21.872
= "Comedy," what am I omitting?

01:25:21.872 --> 01:25:24.830
Well, those of you who checked multiple
boxes might have said something

01:25:24.830 --> 01:25:28.310
is a comedy and a drama
or comedy and romance

01:25:28.310 --> 01:25:30.800
or maybe a couple of other
permutations of genres.

01:25:30.800 --> 01:25:34.070
If I'm searching for
equality here, = "Comedy,"

01:25:34.070 --> 01:25:37.880
I'm only going to get those favorites
from you where you only said,

01:25:37.880 --> 01:25:40.250
my favorite TV show is a comedy.

01:25:40.250 --> 01:25:48.113
But what if we want to do
something like LIKE comedy instead?

01:25:48.113 --> 01:25:50.030
And we could say something
like, well, so long

01:25:50.030 --> 01:25:54.290
as the word "comedy" is in there, then
we should get back even more results.

01:25:54.290 --> 01:25:57.480
And let me stipulate that, indeed,
I now have a longer list of results.

01:25:57.480 --> 01:26:01.010
Now we have all shows where you
checked at least the Comedy box.

01:26:01.010 --> 01:26:03.770
But unfortunately, this
starts to get a little sloppy,

01:26:03.770 --> 01:26:06.410
because recall what the
Genres column looks like.

01:26:06.410 --> 01:26:07.730
SELECT.

01:26:07.730 --> 01:26:11.150
Let me SELECT genres FROM shows;.

01:26:11.150 --> 01:26:16.010
Notice that all of the genres that we
loaded into this table from the CSV

01:26:16.010 --> 01:26:20.030
file are a comma-separated
list of genres.

01:26:20.030 --> 01:26:22.070
That's just the way Google Forms did it.

01:26:22.070 --> 01:26:24.320
And that's fine for CSV purposes.

01:26:24.320 --> 01:26:28.310
That's kind of fine for SQL
purposes, but this is kind of messy.

01:26:28.310 --> 01:26:31.700
Generally speaking, storing
comma-separated lists

01:26:31.700 --> 01:26:35.840
of values in a SQL database is
not what you should be doing.

01:26:35.840 --> 01:26:41.030
The whole point of using a SQL database
is to move away from commas and CSVs

01:26:41.030 --> 01:26:42.860
and to actually store
things more cleanly.

01:26:42.860 --> 01:26:45.920
Because in fact, let
me propose a problem.

01:26:45.920 --> 01:26:50.540
Suppose I want to search not
for comedy but maybe also

01:26:50.540 --> 01:26:55.520
music, like this, thereby allowing
me to find any shows where

01:26:55.520 --> 01:26:59.990
the word "music" is somewhere
in the comma-separated list.

01:26:59.990 --> 01:27:01.940
There's a subtle bug here.

01:27:01.940 --> 01:27:05.690
And you might have to think
back to where we began, the form

01:27:05.690 --> 01:27:07.910
that you pulled up.

01:27:07.910 --> 01:27:09.860
I can't show the whole
thing here, but we

01:27:09.860 --> 01:27:14.060
started with action, adventure,
animation, biography, dot, dot, dot,

01:27:14.060 --> 01:27:15.620
music.

01:27:15.620 --> 01:27:18.500
Musical was also there, so distinct.

01:27:18.500 --> 01:27:22.700
A music video versus a musical
are two different types of genres.

01:27:22.700 --> 01:27:25.250
But notice my query at the moment.

01:27:25.250 --> 01:27:26.930
What's problematic with this?

01:27:26.930 --> 01:27:31.070
At the moment, we would seem to have
a bug whereby this query will select

01:27:31.070 --> 01:27:34.370
not only "music," but also "musical."

01:27:34.370 --> 01:27:36.620
And so this is just where
things are getting messy.

01:27:36.620 --> 01:27:37.400
Now, yeah, you know what?

01:27:37.400 --> 01:27:38.810
We could kind of clean this up.

01:27:38.810 --> 01:27:43.790
Maybe we could put a comma here so
that it can't just be music something.

01:27:43.790 --> 01:27:45.410
It has to be music comma.

01:27:45.410 --> 01:27:47.840
But what if music is the
last box that you checked?

01:27:47.840 --> 01:27:49.310
Well, then it's music nothing.

01:27:49.310 --> 01:27:50.210
There is no comma.

01:27:50.210 --> 01:27:52.262
So now I need to OR things together.

01:27:52.262 --> 01:27:54.470
So maybe I have to do
something like WHERE "%Music,%"

01:27:54.470 --> 01:28:00.800
like this or OR genres
LIKE "%Music" like this.

01:28:00.800 --> 01:28:02.750
But honestly, this is
just getting messy.

01:28:02.750 --> 01:28:04.040
This is poorly designed.

01:28:04.040 --> 01:28:07.220
If you're just storing your data as a
comma-separated list of values inside

01:28:07.220 --> 01:28:11.010
of a column and you have to resort
to this kind of hack to figure out,

01:28:11.010 --> 01:28:13.130
well, maybe it's over
here or here or here,

01:28:13.130 --> 01:28:16.640
and thinking about all the permutations
of syntax, you're doing it wrong.

01:28:16.640 --> 01:28:20.130
You're not using a SQL database
to its fullest potential.

01:28:20.130 --> 01:28:22.490
So how do we go about
designing this thing better

01:28:22.490 --> 01:28:26.690
and actually load this CSV into
a database a little more cleanly?

01:28:26.690 --> 01:28:31.820
In short, how do we get rid of the
stupid commas in the Genres column

01:28:31.820 --> 01:28:36.740
and instead put one word,
"comedy" or "music" or "musical,"

01:28:36.740 --> 01:28:38.930
in each of those cells, so to speak?

01:28:38.930 --> 01:28:40.250
Not two, not three--

01:28:40.250 --> 01:28:43.820
one only without throwing
away some of those genres.

01:28:43.820 --> 01:28:46.730
Well, let me introduce a few building
blocks that will get us there.

01:28:46.730 --> 01:28:48.680
It turns out, when
creating your own tables

01:28:48.680 --> 01:28:51.260
and loading data into
a database on your own,

01:28:51.260 --> 01:28:53.375
we're going to need
more than just SELECT.

01:28:53.375 --> 01:28:55.220
SELECT, of course, is just for reading.

01:28:55.220 --> 01:28:59.330
But if we're going to do this better
and not just use sqlite3 as a built-in

01:28:59.330 --> 01:29:04.880
.import command, but instead we're
going to write some code to load all

01:29:04.880 --> 01:29:07.580
of our data into maybe two tables--

01:29:07.580 --> 01:29:10.100
one for the titles, one for the genres--

01:29:10.100 --> 01:29:15.680
we're going to need a little more
expressiveness when it comes to SQL.

01:29:15.680 --> 01:29:17.990
And so for that, we're going
to need, one, the ability

01:29:17.990 --> 01:29:19.113
to create our own tables.

01:29:19.113 --> 01:29:20.780
And we've seen a glimpse of this before.

01:29:20.780 --> 01:29:23.280
But we're also going to need
to see another piece of syntax,

01:29:23.280 --> 01:29:24.500
as well, so inserting.

01:29:24.500 --> 01:29:29.060
Inserting is another command that
you can execute on a SQL database

01:29:29.060 --> 01:29:32.720
in order to actually add data
to a database, which is great.

01:29:32.720 --> 01:29:38.630
Because if I want to ultimately iterate
over that same CSV but, this time,

01:29:38.630 --> 01:29:43.075
manually add all of the
rows to the database myself,

01:29:43.075 --> 01:29:45.200
well, then I'm going to
need some way of inserting.

01:29:45.200 --> 01:29:46.850
And the syntax for that is as follows.

01:29:46.850 --> 01:29:50.720
INSERT INTO the name of the
table, the column or columns

01:29:50.720 --> 01:29:54.890
that you want to insert values into,
then literally the word VALUES,

01:29:54.890 --> 01:29:58.787
and then literally in parentheses
again, the actual list of values.

01:29:58.787 --> 01:30:01.370
So it's a little abstract when
we see it in this generic form.

01:30:01.370 --> 01:30:06.480
But we'll see this more explicitly
in just a moment here, as well.

01:30:06.480 --> 01:30:09.483
So when it comes to inserting
something into a database,

01:30:09.483 --> 01:30:10.650
let's go ahead and try this.

01:30:10.650 --> 01:30:13.100
So suppose that-- let's see.

01:30:13.100 --> 01:30:15.080
What's a show that--

01:30:15.080 --> 01:30:15.985
The Muppet Show.

01:30:15.985 --> 01:30:17.360
I grew up loving The Muppet Show.

01:30:17.360 --> 01:30:18.650
It was out in, like, the '70s.

01:30:18.650 --> 01:30:21.680
And I don't think it was on the
list, but I can check this for sure.

01:30:21.680 --> 01:30:28.100
So SELECT * FROM shows
WHERE title LIKE--

01:30:28.100 --> 01:30:30.950
let's just search for
"muppets" with a wild card.

01:30:30.950 --> 01:30:32.500
And I'm guessing no one put it there.

01:30:32.500 --> 01:30:33.000
Good.

01:30:33.000 --> 01:30:34.320
So it's a missed opportunity.

01:30:34.320 --> 01:30:35.570
I forgot to fill out the form.

01:30:35.570 --> 01:30:37.820
I could go back and fill out
the form and re-import the CSV,

01:30:37.820 --> 01:30:39.487
but let's go ahead and do this manually.

01:30:39.487 --> 01:30:44.420
So let me go ahead and INSERT
INTO shows what columns?

01:30:44.420 --> 01:30:50.360
title and genres, and I guess I
could do a Timestamp just for kicks.

01:30:50.360 --> 01:30:52.220
And then I'm going to
insert what values?

01:30:52.220 --> 01:30:55.430
The values will be, well, I don't
know, whatever time it is now.

01:30:55.430 --> 01:30:58.460
So I'm going to cheat there rather
than look up the date and the time.

01:30:58.460 --> 01:31:01.430
The title will be "The Muppet Show."

01:31:01.430 --> 01:31:05.100
And the genres will be--
it was kind of a comedy.

01:31:05.100 --> 01:31:06.290
It was kind of a musical.

01:31:06.290 --> 01:31:08.360
So we'll kind of leave it at that.

01:31:08.360 --> 01:31:09.350
Semicolon.

01:31:09.350 --> 01:31:11.870
So again, this follows
the standard syntax here

01:31:11.870 --> 01:31:14.030
of specifying the table
you want to insert into,

01:31:14.030 --> 01:31:16.910
the columns you want to
insert into, and the values

01:31:16.910 --> 01:31:18.467
you want to put into those columns.

01:31:18.467 --> 01:31:20.300
And I'm going to go
ahead and hit Enter now.

01:31:20.300 --> 01:31:22.250
Nothing seems to have happened.

01:31:22.250 --> 01:31:28.070
But if I now select that same query--

01:31:28.070 --> 01:31:32.630
oh, OK, it's still nothing,
because I made a subtle mistake.

01:31:32.630 --> 01:31:34.700
I'm not searching for "Muppets," plural.

01:31:34.700 --> 01:31:37.250
I'm searching for "Muppet,"
singular, The Muppet Show.

01:31:37.250 --> 01:31:38.000
Voila.

01:31:38.000 --> 01:31:40.790
Now you see my row in this database.

01:31:40.790 --> 01:31:42.680
And so INSERT would
give us the ability now

01:31:42.680 --> 01:31:44.570
to insert new rows into the database.

01:31:44.570 --> 01:31:48.410
Suppose you want to update something.

01:31:48.410 --> 01:31:51.540
You know, some of the Muppet Shows
were actually pretty dramatic.

01:31:51.540 --> 01:31:52.710
So how might we do that?

01:31:52.710 --> 01:31:56.960
Well, I can say UPDATE shows SET--

01:31:56.960 --> 01:32:04.250
let's see-- genres = "Comedy,
Drama, Musical" WHERE

01:32:04.250 --> 01:32:07.910
title = "The Muppet Show."

01:32:07.910 --> 01:32:10.890
So again, I'll pull up the
canonical syntax for this in a bit.

01:32:10.890 --> 01:32:14.120
But for now, just a little teaser,
you can update things pretty simply.

01:32:14.120 --> 01:32:16.662
And even though it takes a little
getting used to the syntax,

01:32:16.662 --> 01:32:17.960
it kind of does what it says.

01:32:17.960 --> 01:32:23.250
UPDATE shows SET genres =
this WHERE title = that.

01:32:23.250 --> 01:32:24.650
And now I can go ahead and Enter.

01:32:24.650 --> 01:32:27.290
If I go ahead and select the same
thing, just like in a terminal window,

01:32:27.290 --> 01:32:28.250
you can go up and down.

01:32:28.250 --> 01:32:29.600
That's how I'm typing so quickly.

01:32:29.600 --> 01:32:31.600
I'm just going up and
down to previous commands.

01:32:31.600 --> 01:32:32.120
Voila.

01:32:32.120 --> 01:32:36.830
Now I see that the Muppet Show is
a comedy, a drama, and a musical.

01:32:36.830 --> 01:32:40.070
Well, I take issue, though, with
one of the more popular shows that

01:32:40.070 --> 01:32:40.970
was in the list.

01:32:40.970 --> 01:32:44.637
A whole bunch of you
liked, let's say, Friends,

01:32:44.637 --> 01:32:46.220
which I've never really been a fan of.

01:32:46.220 --> 01:32:53.828
And let me go ahead and SELECT title
FROM shows WHERE title = "Friends."

01:32:53.828 --> 01:32:56.120
And maybe I should be a little
more rigorous than that.

01:32:56.120 --> 01:32:59.150
I should say title LIKE
"Friends" just in case

01:32:59.150 --> 01:33:00.650
there was different capitalizations.

01:33:00.650 --> 01:33:01.460
Enter.

01:33:01.460 --> 01:33:03.148
A lot of you really liked Friends.

01:33:03.148 --> 01:33:04.190
In fact, how many of you?

01:33:04.190 --> 01:33:05.610
Well, recall that I can do this.

01:33:05.610 --> 01:33:08.900
I can say COUNT, and I can
let SQL do the count for me.

01:33:08.900 --> 01:33:10.575
26 of you, I disagree with strongly.

01:33:10.575 --> 01:33:12.950
And there's a couple of you
that even added all the dots,

01:33:12.950 --> 01:33:14.240
but we'll deal with you later.

01:33:14.240 --> 01:33:16.100
So suppose I do take issue with this.

01:33:16.100 --> 01:33:22.970
Well, DELETE FROM shows
WHERE title = "Friends"--

01:33:22.970 --> 01:33:24.390
actually, title LIKE "Friends."

01:33:24.390 --> 01:33:25.220
Let's get them all.

01:33:25.220 --> 01:33:26.090
Enter.

01:33:26.090 --> 01:33:29.450
And now if we SELECT
this again, I'm sorry.

01:33:29.450 --> 01:33:30.870
Friends has been canceled.

01:33:30.870 --> 01:33:34.910
So you can again execute these
fundamental commands of CRUD,

01:33:34.910 --> 01:33:38.630
Create Read, Update, and Delete,
by using CREATE or INSERT,

01:33:38.630 --> 01:33:41.540
by using SELECT, by
using UPDATE literally

01:33:41.540 --> 01:33:43.380
and DELETE literally, as well.

01:33:43.380 --> 01:33:44.580
And that's about it.

01:33:44.580 --> 01:33:46.580
Even though this was a
lot quickly, there really

01:33:46.580 --> 01:33:49.040
are just those four
fundamental operations in SQL

01:33:49.040 --> 01:33:53.090
plus some of these add-on features, like
these additional functions like COUNT

01:33:53.090 --> 01:33:57.420
that you can use and also some of
these keywords like WHERE and the like.

01:33:57.420 --> 01:33:59.810
Well, let me propose
that we now do better.

01:33:59.810 --> 01:34:04.580
If we have the ability to select data
and create tables and insert data,

01:34:04.580 --> 01:34:11.270
let's go ahead and write our own Python
script that uses SQL, as in a loop,

01:34:11.270 --> 01:34:16.130
to read over my CSV file and to insert,
insert, insert, insert each of the rows

01:34:16.130 --> 01:34:16.700
manually.

01:34:16.700 --> 01:34:18.408
Because honestly, it
will take me forever

01:34:18.408 --> 01:34:22.220
to manually type out hundreds of SQL
queries to import all of your rows

01:34:22.220 --> 01:34:23.390
into a new database.

01:34:23.390 --> 01:34:25.520
I want to write a program
that does this instead.

01:34:25.520 --> 01:34:29.430
And I'm going to propose that we
design it in the following way.

01:34:29.430 --> 01:34:32.720
I'm going to have two tables
this time, represented here

01:34:32.720 --> 01:34:34.190
with this artist's rendition.

01:34:34.190 --> 01:34:36.020
One is going to be called shows.

01:34:36.020 --> 01:34:38.060
One is going to be called genres.

01:34:38.060 --> 01:34:44.270
And this is a fundamental principle
of designing relational databases,

01:34:44.270 --> 01:34:49.700
to figure out the relationships among
data and to normalize your data.

01:34:49.700 --> 01:34:53.480
To normalize your data means
to eliminate redundancies.

01:34:53.480 --> 01:34:58.520
To normalize your data means to
eliminate mentions of the same words

01:34:58.520 --> 01:35:02.320
again and again and have just single
sources of truth for your data,

01:35:02.320 --> 01:35:02.820
so to speak.

01:35:02.820 --> 01:35:04.140
So what do I mean by that?

01:35:04.140 --> 01:35:07.520
I'm going to propose that we instead
create a simpler table called

01:35:07.520 --> 01:35:10.320
shows that has just two columns.

01:35:10.320 --> 01:35:13.098
One is going to be
called id, which is new.

01:35:13.098 --> 01:35:15.140
The other is going to be
called title, as before.

01:35:15.140 --> 01:35:16.940
Honestly, I don't care about
timestamps, so we're just

01:35:16.940 --> 01:35:19.730
going to throw that value away,
which is another upside of writing

01:35:19.730 --> 01:35:20.420
our own program.

01:35:20.420 --> 01:35:23.030
We can add or remove any data we want.

01:35:23.030 --> 01:35:25.850
For id, I'm introducing
this, which is going

01:35:25.850 --> 01:35:28.490
to be a unique identifier,
literally a simple integer--

01:35:28.490 --> 01:35:31.190
1, 2, 3, all the way up
to a billion or 2 billion,

01:35:31.190 --> 01:35:33.080
however many favorites we have.

01:35:33.080 --> 01:35:35.690
I'm just going to let this
auto increment as we go.

01:35:35.690 --> 01:35:36.710
Why?

01:35:36.710 --> 01:35:42.530
I propose that we move to another
table all of the genres and that,

01:35:42.530 --> 01:35:48.350
instead of having one or two or
three or five genres in one column

01:35:48.350 --> 01:35:51.860
as a stupid comma-separated list--
which is stupid only in the sense

01:35:51.860 --> 01:35:53.180
that it's just messy, right?

01:35:53.180 --> 01:35:55.040
It means that I have
to run stupid commands

01:35:55.040 --> 01:35:57.332
where I'm checking for the
comma here, the comma there.

01:35:57.332 --> 01:35:58.850
It's very hackish, so to speak.

01:35:58.850 --> 01:36:00.080
Bad design.

01:36:00.080 --> 01:36:03.770
Instead of doing that, I'm going
to create another table that

01:36:03.770 --> 01:36:05.580
also has two columns.

01:36:05.580 --> 01:36:09.320
One is going to be called show_id, and
the other is going to be called genre.

01:36:09.320 --> 01:36:12.830
And genre here is just going
to be a single word now.

01:36:12.830 --> 01:36:16.340
That column will contain
single words for genres,

01:36:16.340 --> 01:36:19.400
like "comedy" or "music" or "musical."

01:36:19.400 --> 01:36:23.570
But we're going to associate
all of those genres

01:36:23.570 --> 01:36:27.470
with the original show to which
they belong, per your Google form

01:36:27.470 --> 01:36:31.500
submissions, by using this show_id here.

01:36:31.500 --> 01:36:33.290
So what does this mean in particular?

01:36:33.290 --> 01:36:37.370
By adding to our first table,
shows, this unique identifier--

01:36:37.370 --> 01:36:39.080
1, 2, 3, 4, 5, 6--

01:36:39.080 --> 01:36:44.630
I can now refer to that same show
in a very efficient way using

01:36:44.630 --> 01:36:46.940
a very simple number
instead of redundantly

01:36:46.940 --> 01:36:49.730
having The Office, The Office,
The Office again and again.

01:36:49.730 --> 01:36:52.280
I can refer to it by just
one canonical number, which

01:36:52.280 --> 01:36:54.980
is only going to be 4 bytes or 32 bits.

01:36:54.980 --> 01:36:56.330
Pretty efficient.

01:36:56.330 --> 01:37:00.920
But I can still associate that
show with one genre or two or three

01:37:00.920 --> 01:37:03.210
or more or even none.

01:37:03.210 --> 01:37:07.610
So in this way, every
row in our current table

01:37:07.610 --> 01:37:12.860
is going to become one or more
rows in our new pair of tables.

01:37:12.860 --> 01:37:15.560
We're factoring out
the genres so that we

01:37:15.560 --> 01:37:20.270
can add multiple rows for every
show, potentially, but still

01:37:20.270 --> 01:37:25.050
remap those genres back to
the original show itself.

01:37:25.050 --> 01:37:27.890
So what is some of the buzzwords here?

01:37:27.890 --> 01:37:31.070
What's some of the language
to be familiar with?

01:37:31.070 --> 01:37:35.090
Well, we need to know what kinds
of types are at our disposal here.

01:37:35.090 --> 01:37:37.250
So for that, let me propose this.

01:37:37.250 --> 01:37:41.300
Let me propose that we
have this list here.

01:37:41.300 --> 01:37:44.590
It turns out, in SQLite, there
are five main data types.

01:37:44.590 --> 01:37:46.340
And that's a bit of
an oversimplification,

01:37:46.340 --> 01:37:49.430
but there's five main data types,
some of which look familiar,

01:37:49.430 --> 01:37:51.410
a couple of which are a little weird.

01:37:51.410 --> 01:37:53.810
INTEGER is a thing.

01:37:53.810 --> 01:37:55.910
REAL is the same thing as float.

01:37:55.910 --> 01:38:00.080
So an integer might be a 32-bit or
4-byte value, like 1, 2, 3, or 4,

01:38:00.080 --> 01:38:01.130
positive or negative.

01:38:01.130 --> 01:38:03.213
Real number's going to
have a decimal point in it,

01:38:03.213 --> 01:38:05.570
a floating point value,
probably 32 bits by default.

01:38:05.570 --> 01:38:08.240
But those kinds of things,
the sizes of these types,

01:38:08.240 --> 01:38:10.430
vary by system, just
like they technically

01:38:10.430 --> 01:38:13.760
did in C. So do they vary by
system in the world of SQL.

01:38:13.760 --> 01:38:16.010
But generally speaking, these
are good rules of thumb.

01:38:16.010 --> 01:38:16.970
TEXT is just that.

01:38:16.970 --> 01:38:19.820
It's sort of the equivalent
of a string of some length.

01:38:19.820 --> 01:38:22.362
But then in SQLite, it turns
out there's two other data

01:38:22.362 --> 01:38:23.570
types we've not seen before--

01:38:23.570 --> 01:38:25.010
NUMERIC and BLOB.

01:38:25.010 --> 01:38:26.750
But more on those in just a little bit.

01:38:26.750 --> 01:38:28.370
BLOB is Binary Large Object.

01:38:28.370 --> 01:38:30.860
It means you can store 0's
and 1's in your database.

01:38:30.860 --> 01:38:34.670
NUMERIC is going to be something that's
number-like but isn't a number per se.

01:38:34.670 --> 01:38:38.360
It's like a year or a time,
something that has numbers, but isn't

01:38:38.360 --> 01:38:40.730
just a simple integer at that.

01:38:40.730 --> 01:38:44.210
And let me propose, too, that SQLite
is going to allow us to specify, too,

01:38:44.210 --> 01:38:49.520
when we create our own columns manually
by executing the SQL code ourselves,

01:38:49.520 --> 01:38:52.430
we can specify that a
column cannot be null.

01:38:52.430 --> 01:38:53.840
Thus far, we've ignored this.

01:38:53.840 --> 01:38:56.090
But some of you might
have taken the fifth

01:38:56.090 --> 01:38:58.850
and just not given us the
title of a show or a genre.

01:38:58.850 --> 01:39:01.020
Your answers might be blank.

01:39:01.020 --> 01:39:03.020
Some of you, maybe in
registering for a website,

01:39:03.020 --> 01:39:06.170
don't want to provide information like
where you live or your phone number.

01:39:06.170 --> 01:39:10.190
So a database in general sometimes
does want to support null values.

01:39:10.190 --> 01:39:12.290
But you might want to say
that it can't be null.

01:39:12.290 --> 01:39:14.390
A website probably needs
your email address,

01:39:14.390 --> 01:39:18.570
needs your password and a few
other fields, but not everything.

01:39:18.570 --> 01:39:22.250
And there's another keyword in SQL, just
so you've seen it, called UNIQUE, where

01:39:22.250 --> 01:39:25.460
you can additionally say that
whatever values are in this column

01:39:25.460 --> 01:39:26.520
must be unique.

01:39:26.520 --> 01:39:28.670
So a website might also use that.

01:39:28.670 --> 01:39:31.910
If you want to make sure that the
same email address can't register

01:39:31.910 --> 01:39:33.830
for your website
multiple times, you just

01:39:33.830 --> 01:39:36.020
specify that the email column is unique.

01:39:36.020 --> 01:39:40.370
That way, you can't put multiple people
in with identical email addresses.

01:39:40.370 --> 01:39:44.060
So long story short, this is just
more of the tools in our SQL toolkit,

01:39:44.060 --> 01:39:46.280
because we'll see some
of these now indirectly.

01:39:46.280 --> 01:39:49.670
And the last piece of jargon we
need before designing our own tables

01:39:49.670 --> 01:39:51.150
is going to be this.

01:39:51.150 --> 01:39:54.110
It turns out that, in
SQL, there's this notion

01:39:54.110 --> 01:39:56.270
of primary keys and foreign keys.

01:39:56.270 --> 01:39:59.390
And we've not seen this in spreadsheets.

01:39:59.390 --> 01:40:02.150
Unless you've been working in
the real world for some years

01:40:02.150 --> 01:40:04.400
and you have fairly fancy
spreadsheets in front of you

01:40:04.400 --> 01:40:06.380
as an analyst or financial
person or the like,

01:40:06.380 --> 01:40:11.750
odds are you've not seen keys or unique
identifiers in quite the same way.

01:40:11.750 --> 01:40:13.170
But they're relatively simple.

01:40:13.170 --> 01:40:17.390
In fact, let me go back to
our picture before and propose

01:40:17.390 --> 01:40:21.230
that when you have two
tables like this and you

01:40:21.230 --> 01:40:25.790
want to use a simple integer to
uniquely identify all of the rows in one

01:40:25.790 --> 01:40:28.395
of the tables, that's
called technically an ID.

01:40:28.395 --> 01:40:30.020
That's what I'll call it by convention.

01:40:30.020 --> 01:40:33.770
You could call it anything you want, but
ID just means it's a unique identifier.

01:40:33.770 --> 01:40:37.470
But semantically, this ID is
what's called a primary key.

01:40:37.470 --> 01:40:43.940
A primary key is the column in a table
that uniquely identifies every row.

01:40:43.940 --> 01:40:46.820
This means you can have
multiple versions of The Office

01:40:46.820 --> 01:40:48.860
in that title field.

01:40:48.860 --> 01:40:52.490
But each of those rows is going to have
its own number uniquely, potentially.

01:40:52.490 --> 01:40:56.630
So primary key uniquely
identifies each row.

01:40:56.630 --> 01:41:01.550
In another table, like genres, which I'm
proposing we create in just a moment,

01:41:01.550 --> 01:41:06.770
it turns out that you're welcome
to refer back to another table

01:41:06.770 --> 01:41:09.260
by way of that unique identifier.

01:41:09.260 --> 01:41:13.710
But when it's in this context,
that ID is called a foreign key.

01:41:13.710 --> 01:41:16.130
So even though I've
called it show_id here,

01:41:16.130 --> 01:41:18.470
that's just a convention
in a lot of SQL databases

01:41:18.470 --> 01:41:23.030
to imply that this is technically
a column called ID in a table

01:41:23.030 --> 01:41:26.760
called show or shows,
plural in this case.

01:41:26.760 --> 01:41:29.900
So if there's a number
1 here, and suppose

01:41:29.900 --> 01:41:34.190
that The Office has a
unique ID of 1, we would

01:41:34.190 --> 01:41:38.420
have a row in this table called
id is 1, title is The Office.

01:41:38.420 --> 01:41:43.730
The Office might be in the comedy
category, the drama category,

01:41:43.730 --> 01:41:46.400
the romance category, so multiple ones.

01:41:46.400 --> 01:41:51.050
Therefore, in the genres table,
we want to output three rows,

01:41:51.050 --> 01:41:56.150
the number 1, 1, 1 in each of
those rows but the words "comedy,"

01:41:56.150 --> 01:42:00.450
"drama," "romance" in each
of those rows respectively.

01:42:00.450 --> 01:42:03.620
So again, the goal here is to just
design our database better, not

01:42:03.620 --> 01:42:08.120
have these stupid comma-separated lists
of values inside of a single column.

01:42:08.120 --> 01:42:12.980
We want to kind of blow that up,
explode it, into individual rows.

01:42:12.980 --> 01:42:15.710
You might think, well, why don't
we just use multiple columns?

01:42:15.710 --> 01:42:18.560
But again, per our
principle from spreadsheets,

01:42:18.560 --> 01:42:21.650
you should not be in the habit of
adding more and more columns when

01:42:21.650 --> 01:42:25.190
the data is all the same, like
genre, genre, genre, right?

01:42:25.190 --> 01:42:27.410
The stupid way to do this
in the spreadsheet world

01:42:27.410 --> 01:42:29.660
would be to have one
column called Genre 1,

01:42:29.660 --> 01:42:34.100
another column called Genre 2, another
column called Genre 3, Genre 4.

01:42:34.100 --> 01:42:37.340
And you can imagine just how
stupid and inefficient this is.

01:42:37.340 --> 01:42:41.510
A lot of those columns are going to be
empty for shows with very few genres.

01:42:41.510 --> 01:42:43.770
And it's just kind of
messy at that point.

01:42:43.770 --> 01:42:47.030
So better, in the world
of relational databases,

01:42:47.030 --> 01:42:51.350
to have something like a second table,
where you have multiple rows that

01:42:51.350 --> 01:42:55.700
somehow link back to that primary
key by way of what we're calling,

01:42:55.700 --> 01:42:58.440
conceptually, a foreign key.

01:42:58.440 --> 01:42:59.090
All right.

01:42:59.090 --> 01:43:01.640
So let's go ahead now and
try to write this code.

01:43:01.640 --> 01:43:03.710
Let me go back to my IDE.

01:43:03.710 --> 01:43:07.850
Let me quit out of SQLite now.

01:43:07.850 --> 01:43:10.640
And let me just move away.

01:43:10.640 --> 01:43:15.402
I'm going to move this away,
my file, for just a moment

01:43:15.402 --> 01:43:17.360
so that we're only left
with our original data.

01:43:17.360 --> 01:43:21.680
Let's go about implementing a final
version of my Python file that does

01:43:21.680 --> 01:43:23.540
this-- creates two tables--

01:43:23.540 --> 01:43:26.270
one called shows, one called genres--

01:43:26.270 --> 01:43:30.200
and then, two, in a for
loop, iterates over that CSV

01:43:30.200 --> 01:43:34.490
and inserts some data into the shows
and other data into the genres.

01:43:34.490 --> 01:43:36.350
How can we do this programmatically?

01:43:36.350 --> 01:43:38.720
Well, there's a final piece
of the puzzle that we need.

01:43:38.720 --> 01:43:41.912
We need some way of bridging
the world of Python and SQL.

01:43:41.912 --> 01:43:44.120
And here, we do need a
library, because it would just

01:43:44.120 --> 01:43:46.700
be way too painful to
do without a library.

01:43:46.700 --> 01:43:47.540
It can be CS50.

01:43:47.540 --> 01:43:50.730
CS50, as we'll see,
makes this very simple.

01:43:50.730 --> 01:43:53.480
There are other third-party
commercial and open-source libraries

01:43:53.480 --> 01:43:56.522
that you can also use in the real
world, as well, that do the same thing.

01:43:56.522 --> 01:43:58.670
But the syntax is a
little less friendly,

01:43:58.670 --> 01:44:01.880
so we'll start by using the CS50
library, which in Python, recall,

01:44:01.880 --> 01:44:04.330
has functions like get_string
and get_int and get_float.

01:44:04.330 --> 01:44:10.430
But today, it also has support, it turns
out, for SQL capabilities, as well.

01:44:10.430 --> 01:44:12.760
So I'm going to go back
to my Favorites file.

01:44:12.760 --> 01:44:15.970
And I'm going to import
not only CSV, but I'm also

01:44:15.970 --> 01:44:21.310
going to import from the CS50
library a feature called SQL.

01:44:21.310 --> 01:44:25.930
So we have a variable, if you
will, inside of the CS50 library

01:44:25.930 --> 01:44:28.600
or, rather, a function
inside of the CS50 library

01:44:28.600 --> 01:44:31.870
called SQL that, if I
call it, will allow me

01:44:31.870 --> 01:44:35.270
to load a SQLite database into memory.

01:44:35.270 --> 01:44:36.290
So how do I do this?

01:44:36.290 --> 01:44:38.790
Well, let me go ahead and add
a couple of new lines of code.

01:44:38.790 --> 01:44:45.340
Let me go ahead and open
up a file called shows.db,

01:44:45.340 --> 01:44:47.055
but this time in write mode.

01:44:47.055 --> 01:44:49.180
And then just for kicks--
just for now, rather, I'm

01:44:49.180 --> 01:44:50.930
going to go ahead and
close it right away.

01:44:50.930 --> 01:44:54.260
This is a Pythonic way of
creating an empty file.

01:44:54.260 --> 01:44:58.210
It's kind of stupid looking, but
by opening a file called shows.db

01:44:58.210 --> 01:45:00.790
in write mode and then
immediately closing it,

01:45:00.790 --> 01:45:03.670
it has the effect of creating
the file, closing the file.

01:45:03.670 --> 01:45:06.310
So I now have an empty file
with which to interact.

01:45:06.310 --> 01:45:09.100
I could also do this, as
an aside, by doing this--

01:45:09.100 --> 01:45:10.810
touch shows.db.

01:45:10.810 --> 01:45:14.470
touch kind of a strange command,
but in a terminal window,

01:45:14.470 --> 01:45:17.870
it means to create a
file if it doesn't exist.

01:45:17.870 --> 01:45:19.450
So we could also do that instead.

01:45:19.450 --> 01:45:22.420
But that would be independent of Python.

01:45:22.420 --> 01:45:24.790
So once I've created this
file, let me go ahead

01:45:24.790 --> 01:45:28.720
and open the file now
as a SQLite database.

01:45:28.720 --> 01:45:31.600
I'm going to declare a variable
called db for database.

01:45:31.600 --> 01:45:34.930
I'm going to use the SQL
function from CS50's library.

01:45:34.930 --> 01:45:38.170
And I'm going to open via
somewhat cryptic string this--

01:45:38.170 --> 01:45:43.600
sqlite:///shows.db.

01:45:43.600 --> 01:45:48.740
Now, it looks like a URL,
http://, but it's SQLite instead.

01:45:48.740 --> 01:45:52.300
And there's three slashes
instead of the usual two.

01:45:52.300 --> 01:45:54.430
But this line of code,
line 6, has the result

01:45:54.430 --> 01:45:57.820
of opening now that otherwise
empty file with nothing

01:45:57.820 --> 01:46:04.040
in it yet as being a SQLite
database using CS50's library.

01:46:04.040 --> 01:46:05.330
Why did I do that?

01:46:05.330 --> 01:46:09.020
Well, I did that because I now
want to create my first table.

01:46:09.020 --> 01:46:12.140
Let me go ahead and execute, db.execute.

01:46:12.140 --> 01:46:16.330
So there's a function called execute
inside of the CS50 SQL library.

01:46:16.330 --> 01:46:17.980
And I'm going to go ahead and run this.

01:46:17.980 --> 01:46:23.770
CREATE TABLE called shows,
the columns of which

01:46:23.770 --> 01:46:27.430
are an id, which is going to
be an integer, a title, which

01:46:27.430 --> 01:46:33.380
is going to be text, the primary key
in which is going to be the id column.

01:46:33.380 --> 01:46:34.870
So this is a bit cryptic.

01:46:34.870 --> 01:46:36.520
But let's see what's happening.

01:46:36.520 --> 01:46:41.950
I seem to now, in line 8, be
combining Python with SQL.

01:46:41.950 --> 01:46:46.000
And this is where now programming
gets really powerful, fancy, cool,

01:46:46.000 --> 01:46:48.250
difficult, however you
want to perceive it.

01:46:48.250 --> 01:46:50.680
I can actually use one
language inside of another.

01:46:50.680 --> 01:46:51.250
How?

01:46:51.250 --> 01:46:53.420
Well, SQL is just a bunch
of textural commands.

01:46:53.420 --> 01:46:55.420
Up until now, I've been
typing them out manually

01:46:55.420 --> 01:46:57.430
in this program called SQLite3.

01:46:57.430 --> 01:47:00.010
There's nothing stopping
me, though, from storing

01:47:00.010 --> 01:47:02.830
those same commands in
Python strings and then

01:47:02.830 --> 01:47:05.890
passing them to a database using code.

01:47:05.890 --> 01:47:08.230
The code I'm using is a
function called execute.

01:47:08.230 --> 01:47:10.990
And its purpose in life,
and CS50 staff wrote this,

01:47:10.990 --> 01:47:18.950
is to pass the argument from your Python
code into the database for execution.

01:47:18.950 --> 01:47:22.510
So it's like the programmatic way
of just typing things manually

01:47:22.510 --> 01:47:25.160
at the SQLite prompt a few minutes ago.

01:47:25.160 --> 01:47:27.880
So that's going to go ahead
and create my table called

01:47:27.880 --> 01:47:30.610
shows, in which I'm going to
store all of those unique IDs

01:47:30.610 --> 01:47:32.290
and also the titles.

01:47:32.290 --> 01:47:33.670
And then let me do this again.

01:47:33.670 --> 01:47:39.040
db.execute CREATE TABLE
genres, and that's

01:47:39.040 --> 01:47:43.670
going to have a column called show_id,
which is an integer also, genre,

01:47:43.670 --> 01:47:45.340
which is text.

01:47:45.340 --> 01:47:48.130
And lastly, it's going
to have a foreign key--

01:47:48.130 --> 01:47:51.190
it's going to wrap a little long here--

01:47:51.190 --> 01:47:56.563
on show_id, which references
the shows table id.

01:47:56.563 --> 01:47:57.730
All right, so this is a lot.

01:47:57.730 --> 01:47:59.860
So let's just recap left to right.

01:47:59.860 --> 01:48:03.730
db.execute is my Python function
that executes any SQL I want.

01:48:03.730 --> 01:48:06.460
CREATE TABLE genres creates
a table called genres.

01:48:06.460 --> 01:48:10.060
The columns in that table will
be something called show_id,

01:48:10.060 --> 01:48:13.630
which is an integer, and
genre, which is a text field.

01:48:13.630 --> 01:48:17.050
But it's going to be one
genre at a time, not multiple.

01:48:17.050 --> 01:48:20.170
And then here, I'm
specifying a foreign key

01:48:20.170 --> 01:48:24.280
will be the show_id column,
which happens to refer back

01:48:24.280 --> 01:48:28.180
to the shows table's IDs column.

01:48:28.180 --> 01:48:31.480
It's a little cryptic, but all this
is doing is implementing for us

01:48:31.480 --> 01:48:33.470
the equivalent of this picture here.

01:48:33.470 --> 01:48:35.770
I could have manually
typed both of these SQL

01:48:35.770 --> 01:48:37.690
commands at that blinking prompt.

01:48:37.690 --> 01:48:39.850
But again, no, I want
to write a program now

01:48:39.850 --> 01:48:43.720
in Python that creates the tables
for me and now, more interestingly,

01:48:43.720 --> 01:48:47.583
loads the data into that database.

01:48:47.583 --> 01:48:49.000
So let's go ahead and do this now.

01:48:49.000 --> 01:48:51.100
I'm not going to select
a title from the user,

01:48:51.100 --> 01:48:52.660
because I want to import everything.

01:48:52.660 --> 01:48:54.993
I'm not going to use any
counting or anything like that.

01:48:54.993 --> 01:48:57.700
So let's go ahead and just go
inside of my loop as before.

01:48:57.700 --> 01:49:02.240
And this time, let's go
ahead and, for row in reader,

01:49:02.240 --> 01:49:05.110
let's go ahead and get the current
title, as we've always done.

01:49:05.110 --> 01:49:08.640
But let's also, as always, go
ahead and strip it of white space

01:49:08.640 --> 01:49:11.700
and capitalize it, just
to canonicalize it.

01:49:11.700 --> 01:49:15.960
And now I'm going to go ahead and
execute db.execute, quote unquote,

01:49:15.960 --> 01:49:24.707
INSERT INTO shows the title
column, the value of "title."

01:49:24.707 --> 01:49:26.040
So I want to put the title here.

01:49:26.040 --> 01:49:31.690
It turns out that SQL libraries like
ours support one final piece of syntax,

01:49:31.690 --> 01:49:32.850
which is a placeholder.

01:49:32.850 --> 01:49:34.800
In C, we use %s.

01:49:34.800 --> 01:49:37.950
In Python, we just use curly braces
and put the word right there.

01:49:37.950 --> 01:49:41.520
In SQL, we have a third approach to
the same problem-- just syntactically

01:49:41.520 --> 01:49:43.590
different, but conceptually the same.

01:49:43.590 --> 01:49:46.560
You put a question mark where
you want to put a placeholder.

01:49:46.560 --> 01:49:50.670
And then outside of this string, I'm
going to actually type in the value

01:49:50.670 --> 01:49:53.070
that I want to plug
into that question mark.

01:49:53.070 --> 01:49:55.590
So this is so similar
to printf in week 1.

01:49:55.590 --> 01:50:00.180
But instead of %s, it's a question mark
now and then a comma-separated list

01:50:00.180 --> 01:50:03.120
of the arguments you want to
plug in for those placeholders.

01:50:03.120 --> 01:50:08.820
So now this line of code 16 has
just inserted all of those values

01:50:08.820 --> 01:50:09.670
into my database.

01:50:09.670 --> 01:50:10.440
And let's go ahead and run this.

01:50:10.440 --> 01:50:12.970
Before I go any further,
let me go ahead and do this.

01:50:12.970 --> 01:50:15.960
I'm going to go ahead now and
run python of favorites.py

01:50:15.960 --> 01:50:18.030
and cross my fingers, as always.

01:50:18.030 --> 01:50:20.010
It's taking a moment, taking a moment.

01:50:20.010 --> 01:50:23.340
That's because there's a
decent-sized file there.

01:50:23.340 --> 01:50:25.650
Or I screwed up.

01:50:25.650 --> 01:50:27.930
This is taking too long.

01:50:27.930 --> 01:50:28.950
Oh, OK.

01:50:28.950 --> 01:50:30.960
I should have just been more patient.

01:50:30.960 --> 01:50:31.560
All right.

01:50:31.560 --> 01:50:33.970
So it just seems my
connection's a little slow.

01:50:33.970 --> 01:50:38.717
So as I expected, everything is
100% correct, and it's working fine.

01:50:38.717 --> 01:50:40.800
So now let's go ahead and
see what I actually did.

01:50:40.800 --> 01:50:44.970
If I type ls, notice that I
have a file called shows.db.

01:50:44.970 --> 01:50:48.180
This is brand new, because my
Python program created it this time.

01:50:48.180 --> 01:50:51.060
Let's go ahead and run
sqlite3 of shows.db

01:50:51.060 --> 01:50:53.080
just so I can now see
what's inside of it.

01:50:53.080 --> 01:50:57.090
Notice that I can do .schema
just to see what tables exist.

01:50:57.090 --> 01:51:00.660
And indeed, the two tables that
I created in my Python code

01:51:00.660 --> 01:51:01.920
seem to exist.

01:51:01.920 --> 01:51:04.020
But notice that there's--

01:51:04.020 --> 01:51:08.730
if I do SELECT * FROM shows,
let's see all the data.

01:51:08.730 --> 01:51:09.750
Voila.

01:51:09.750 --> 01:51:13.170
There is a table that's been
programmatically created.

01:51:13.170 --> 01:51:16.350
And it has, notice this time,
no timestamps, no genres.

01:51:16.350 --> 01:51:20.730
But it has an ID on the left
and the title on the right.

01:51:20.730 --> 01:51:25.350
And amazingly, all of the IDs are
monotonically increasing from 1

01:51:25.350 --> 01:51:27.390
on up to 513, in this case.

01:51:27.390 --> 01:51:28.300
Why is that?

01:51:28.300 --> 01:51:30.600
Well, one of the features
you get in a SQL database

01:51:30.600 --> 01:51:34.410
is if you define a column as
being a primary key in SQLite,

01:51:34.410 --> 01:51:36.480
it's going to be auto
incremented for you.

01:51:36.480 --> 01:51:41.970
Recall that nowhere in my code did
I even have a line, an integer,

01:51:41.970 --> 01:51:43.830
inputting 1, then 2, then 3.

01:51:43.830 --> 01:51:45.310
I could absolutely do that.

01:51:45.310 --> 01:51:47.730
I could have done something
like this-- counter--

01:51:47.730 --> 01:51:51.660
rather, I could have done
something like this-- counter = 1.

01:51:51.660 --> 01:51:56.280
And then down here, I could
have said id, title, give myself

01:51:56.280 --> 01:51:59.122
two placeholders, and then
pass in the counter each time.

01:51:59.122 --> 01:52:01.830
I could have implemented this
myself and then, on each iteration,

01:52:01.830 --> 01:52:03.960
done counter += 1.

01:52:03.960 --> 01:52:06.330
But with SQL databases,
as we've seen, you

01:52:06.330 --> 01:52:08.310
get a lot more functionality built in.

01:52:08.310 --> 01:52:11.130
I don't have to do any
of that, because if I've

01:52:11.130 --> 01:52:16.710
declared that ID as being a primary
key, SQLite is going to insert it for me

01:52:16.710 --> 01:52:19.870
and increment it also for me, as well.

01:52:19.870 --> 01:52:20.400
All right.

01:52:20.400 --> 01:52:24.510
So if I go back to SQLite, though,
notice that I do have IDs and titles.

01:52:24.510 --> 01:52:28.860
But if I SELECT * FROM genres,
there's of course nothing there yet.

01:52:28.860 --> 01:52:32.250
So how now do I get all of the
genres for each of these shows in?

01:52:32.250 --> 01:52:33.910
I need to finish my script.

01:52:33.910 --> 01:52:38.970
So inside of this same loop, I have
not only the title in my current row,

01:52:38.970 --> 01:52:42.570
but I also have genres
in the current row.

01:52:42.570 --> 01:52:45.570
But the genres are separated by commas.

01:52:45.570 --> 01:52:47.880
Recall that in the CSV,
next to every title,

01:52:47.880 --> 01:52:51.450
there's a comma-separated
list of genres.

01:52:51.450 --> 01:52:53.460
How do I get at each genre individually?

01:52:53.460 --> 01:52:59.190
Well, I'd like to be able to say
for genre in row bracket genres.

01:52:59.190 --> 01:53:02.520
But this is not going to
work, because that's not going

01:53:02.520 --> 01:53:05.310
to be split up based on those commas.

01:53:05.310 --> 01:53:07.190
That's literally just
going to iterate over,

01:53:07.190 --> 01:53:10.860
in fact, all of the characters in
that string, as we saw last week.

01:53:10.860 --> 01:53:13.950
But it turns out that strings
in Python have a fancy split

01:53:13.950 --> 01:53:19.300
function, whereby I can split
on a comma followed by a space.

01:53:19.300 --> 01:53:21.930
And what this function
will do for me in Python is

01:53:21.930 --> 01:53:26.130
take a comma separated list of
genres and explode it, so to speak,

01:53:26.130 --> 01:53:31.800
split it on every comma,
space into a Python list

01:53:31.800 --> 01:53:36.570
containing genre after genre
in an actual Python list

01:53:36.570 --> 01:53:37.990
a la square brackets.

01:53:37.990 --> 01:53:42.360
So now I can iterate over that
list of individual genres.

01:53:42.360 --> 01:53:49.470
And inside of here, I can do db.execute
INSERT INTO genres show_id, genre,

01:53:49.470 --> 01:53:53.130
the values, question
mark, question mark.

01:53:53.130 --> 01:53:56.100
But huh, there's a problem.

01:53:56.100 --> 01:53:59.970
I can definitely plug in the
current genre, which is this.

01:53:59.970 --> 01:54:02.970
But I need to put something here still.

01:54:02.970 --> 01:54:07.560
For that first question mark,
I need a value for the show_id.

01:54:07.560 --> 01:54:11.130
How do I know what the ID
is of the current TV show?

01:54:11.130 --> 01:54:13.650
Well, it turns out the library
can help you with this.

01:54:13.650 --> 01:54:18.970
When you insert new rows into
a table that has a primary key,

01:54:18.970 --> 01:54:23.400
it turns out that most libraries will
return you that value in some way.

01:54:23.400 --> 01:54:26.520
And if I go back to
line 15 and I actually

01:54:26.520 --> 01:54:31.470
store the return value of
db.execute after using INSERT,

01:54:31.470 --> 01:54:34.500
the library will tell me
what was the integer that

01:54:34.500 --> 01:54:36.390
was just used for this given show.

01:54:36.390 --> 01:54:37.650
Maybe it's 1, 2, 3.

01:54:37.650 --> 01:54:39.940
I don't have to know or
care as the programmer.

01:54:39.940 --> 01:54:42.570
But the return value, I
can store in a variable.

01:54:42.570 --> 01:54:47.520
And then down here, I can literally
put that same ID so that now,

01:54:47.520 --> 01:54:51.600
if I am inputting The Office,
whose ID is 1, into the shows table

01:54:51.600 --> 01:54:54.720
and its genres are
comedy, drama, romance,

01:54:54.720 --> 01:54:57.990
I can now inside of this for
loop, this nested for loop,

01:54:57.990 --> 01:55:03.240
insert 1 followed by "comedy," 1
followed by "drama," 1 followed

01:55:03.240 --> 01:55:07.330
by "romance," three rows all at once.

01:55:07.330 --> 01:55:11.980
And so now let's go back down
here into my terminal window.

01:55:11.980 --> 01:55:15.660
Let me remove the old shows.db
with rm, just to start fresh.

01:55:15.660 --> 01:55:19.920
Let me go ahead and rerun
python of favorites.py.

01:55:19.920 --> 01:55:23.733
I'll be more patient this time,
because cloud's being a little slow.

01:55:23.733 --> 01:55:24.900
So it's doing some thinking.

01:55:24.900 --> 01:55:27.030
And in fact, there's
more work being done now.

01:55:27.030 --> 01:55:29.340
At this point in the story,
my program is presumably

01:55:29.340 --> 01:55:33.060
iterating over all of
the rows in the CSV.

01:55:33.060 --> 01:55:37.170
And it's inserting into the
shows table one at a time,

01:55:37.170 --> 01:55:43.380
and then it's inserting one or
more genres into the genres table.

01:55:43.380 --> 01:55:44.250
It's a little slow.

01:55:44.250 --> 01:55:47.370
If we were on a faster system or if
I were doing it on my own Mac or PC,

01:55:47.370 --> 01:55:49.480
it would probably go down more quickly.

01:55:49.480 --> 01:55:52.740
But you can see here an example of why
I use the .import command in the first

01:55:52.740 --> 01:55:53.130
place.

01:55:53.130 --> 01:55:54.670
That automated some of this process.

01:55:54.670 --> 01:55:58.440
But unfortunately, it didn't allow
me to change the format of my data.

01:55:58.440 --> 01:56:01.530
But the key point to make
here is that even though this

01:56:01.530 --> 01:56:05.490
is taking a little bit of time to insert
these hundreds of rows all at once,

01:56:05.490 --> 01:56:07.260
I'm only going to have to do this once.

01:56:07.260 --> 01:56:10.840
And what was asked a bit ago
was the performance of this.

01:56:10.840 --> 01:56:15.390
It turns out that now that we have
full control over the SQL database,

01:56:15.390 --> 01:56:20.640
it turns out we're going to have
the ability to actually improve

01:56:20.640 --> 01:56:22.230
the performance thereof.

01:56:22.230 --> 01:56:24.000
Oh, OK.

01:56:24.000 --> 01:56:25.830
As expected, it finished right on time.

01:56:25.830 --> 01:56:29.970
And let me go ahead now and
run sqlite3 on shows.db.

01:56:29.970 --> 01:56:32.670
All right, so now I'm back
in my raw SQL environment.

01:56:32.670 --> 01:56:36.180
If I do SELECT * FROM
shows, which I did before,

01:56:36.180 --> 01:56:37.650
we'll see all of this as before.

01:56:37.650 --> 01:56:42.090
If I SELECT * FROM shows
WHERE title = "THE OFFICE,"

01:56:42.090 --> 01:56:45.103
I'll see the actual unique
IDs of all of those.

01:56:45.103 --> 01:56:46.770
We didn't bother eliminating duplicates.

01:56:46.770 --> 01:56:50.610
We just kept everything as is, but
we gave everything a unique ID.

01:56:50.610 --> 01:56:57.520
But if I now do SELECT * FROM genres,
we'll see all of the values there.

01:56:57.520 --> 01:56:59.070
And notice the key detail.

01:56:59.070 --> 01:57:03.360
There is only one genre per row here.

01:57:03.360 --> 01:57:06.480
And so we can ultimately line
those up with our titles.

01:57:06.480 --> 01:57:10.250
And our titles here, we
had all of these here.

01:57:10.250 --> 01:57:12.538
Something's wrong.

01:57:12.538 --> 01:57:13.580
I want to get this right.

01:57:13.580 --> 01:57:15.940
Let's go ahead and take our second
and final five-minute break here.

01:57:15.940 --> 01:57:18.280
And we'll come back, and I
will explain what's going on.

01:57:18.280 --> 01:57:20.170
All right, we are back.

01:57:20.170 --> 01:57:23.710
And just before we broke up, my own
self-doubt was starting to creep in.

01:57:23.710 --> 01:57:26.830
But I'm happy to say, with no
fancy magic behind the scenes,

01:57:26.830 --> 01:57:28.430
everything was actually working fine.

01:57:28.430 --> 01:57:30.263
I was just doubting the
correctness of this.

01:57:30.263 --> 01:57:33.460
If I do SELECT * FROM
shows, I indeed get back

01:57:33.460 --> 01:57:37.540
two columns, one with the unique ID,
the so-called primary key, followed

01:57:37.540 --> 01:57:40.280
by the title of each of those shows.

01:57:40.280 --> 01:57:46.120
And if I similarly search for * FROM
genres, I get single genres at a time.

01:57:46.120 --> 01:57:49.600
But on the left-hand side
are not primary keys per se

01:57:49.600 --> 01:57:52.450
but now those same numbers
here in this context called

01:57:52.450 --> 01:57:55.160
foreign keys that map one to the other.

01:57:55.160 --> 01:58:01.260
So for instance, whatever show 512 is
had five different genres associated

01:58:01.260 --> 01:58:01.760
with it.

01:58:01.760 --> 01:58:05.320
And in fact, if I go back a moment to
shows, it looks like Game of Thrones

01:58:05.320 --> 01:58:10.420
was decided by one of you as belonging
in thriller, history, adventure,

01:58:10.420 --> 01:58:14.660
action, and war, as well, those five.

01:58:14.660 --> 01:58:17.320
So now this is what's meant
by relational database.

01:58:17.320 --> 01:58:21.430
You have this relation or
relationship across multiple tables

01:58:21.430 --> 01:58:25.050
that link some data in one to
some other data in the like.

01:58:25.050 --> 01:58:27.550
The catch, though, is that it
would seem a little harder now

01:58:27.550 --> 01:58:30.910
to answer questions, because now
I have to kind of query two tables

01:58:30.910 --> 01:58:34.450
or execute two separate queries
and then combine the data.

01:58:34.450 --> 01:58:36.130
But that's not actually the case.

01:58:36.130 --> 01:58:39.100
Suppose that I want to
answer the question of,

01:58:39.100 --> 01:58:42.760
what are all of the musicals
among your favorite TV shows?

01:58:42.760 --> 01:58:46.490
I can't select just the shows, because
there's no genres in there anymore.

01:58:46.490 --> 01:58:48.730
But I also can't select
just the genres table,

01:58:48.730 --> 01:58:50.900
because there's no titles in there.

01:58:50.900 --> 01:58:55.060
But there is a value that's bridging
one and the other, that foreign key

01:58:55.060 --> 01:58:56.980
to primary key relationship.

01:58:56.980 --> 01:58:59.170
So you know what I can do
off the top of my head?

01:58:59.170 --> 01:59:03.790
I'm pretty sure I can select all of
the show_ids from the genres table

01:59:03.790 --> 01:59:07.072
where a specific genre = "Musical."

01:59:07.072 --> 01:59:09.280
And I don't have to worry
about commas or spaces now,

01:59:09.280 --> 01:59:13.210
because again, in this new version
that I have designed programmatically

01:59:13.210 --> 01:59:16.990
with code, musical and every
other genre is just a single word.

01:59:16.990 --> 01:59:21.220
If I hit Enter, all of
these show_ids were decided

01:59:21.220 --> 01:59:23.930
by you all as belonging to musicals.

01:59:23.930 --> 01:59:25.930
But now this is not
interesting, and I certainly

01:59:25.930 --> 01:59:28.360
don't want to execute 10
or so queries manually

01:59:28.360 --> 01:59:30.400
to look up every one of those IDs.

01:59:30.400 --> 01:59:32.680
But notice what we can
do in SQL, as well.

01:59:32.680 --> 01:59:33.880
I can nest queries.

01:59:33.880 --> 01:59:36.940
Let me put this whole query in
parentheses for just a moment

01:59:36.940 --> 01:59:39.070
and then prepend to it the following.

01:59:39.070 --> 01:59:46.930
SELECT title FROM shows WHERE the
primary key, id, is in this subquery.

01:59:46.930 --> 01:59:50.650
So you can have nested queries similar
in spirit a bit like in Python and C

01:59:50.650 --> 01:59:52.510
when you have nested for loops.

01:59:52.510 --> 01:59:55.690
In this case, just like in grade school
math, whatever is in the parentheses

01:59:55.690 --> 01:59:57.160
will be executed first.

01:59:57.160 --> 02:00:02.140
Then the outer query will be executed
using the results of that inner query.

02:00:02.140 --> 02:00:07.000
So if I select the title from shows
where the ID is in that list of IDs,

02:00:07.000 --> 02:00:07.660
voila.

02:00:07.660 --> 02:00:11.560
It seems that, somewhat
amusingly, several of you

02:00:11.560 --> 02:00:15.280
think that Breaking Bad, Supernatural,
Glee, Sherlock, How I Met Your Mother,

02:00:15.280 --> 02:00:18.190
Hawaii Five-0, Twin Peaks, The
Lawyer, and My Brother, My Brother

02:00:18.190 --> 02:00:19.900
and Me are all musicals.

02:00:19.900 --> 02:00:22.630
I take exception to a few
of those, but so be it.

02:00:22.630 --> 02:00:24.850
You checked the box for
musical for those shows.

02:00:24.850 --> 02:00:29.260
So even though we've designed
things better in the sense

02:00:29.260 --> 02:00:33.010
that we've normalized our database
by factoring out commonalities

02:00:33.010 --> 02:00:35.050
or, rather, we've cleaned
up the data, there's

02:00:35.050 --> 02:00:37.150
still admittedly some redundancy.

02:00:37.150 --> 02:00:39.370
There's still admittedly
some redundancy.

02:00:39.370 --> 02:00:44.410
But I at least now have
the data in clean fashion

02:00:44.410 --> 02:00:47.800
so that every column has just
a single value in it and not

02:00:47.800 --> 02:00:49.870
some contrived comma-separated list.

02:00:49.870 --> 02:00:51.725
Suppose I want to find
out all of the genres

02:00:51.725 --> 02:00:53.350
that you all thought The Office was in.

02:00:53.350 --> 02:00:55.480
So let's ask kind of
the opposite question.

02:00:55.480 --> 02:00:56.840
Well, how might I do that?

02:00:56.840 --> 02:01:00.430
Well, to figure out The Office, I'm
going to first need to SELECT the id

02:01:00.430 --> 02:01:06.400
FROM shows WHERE title = "THE
OFFICE," because a whole bunch of you

02:01:06.400 --> 02:01:07.330
typed in The Office.

02:01:07.330 --> 02:01:09.902
And we gave each of your
answers a unique identifier

02:01:09.902 --> 02:01:11.110
so we could keep track of it.

02:01:11.110 --> 02:01:12.500
And there's all of those numbers.

02:01:12.500 --> 02:01:14.260
Now, this is, like, dozens of responses.

02:01:14.260 --> 02:01:16.540
I certainly don't want to
execute that many queries.

02:01:16.540 --> 02:01:18.850
But I think a subquery
will help us out again.

02:01:18.850 --> 02:01:21.470
Let me put parentheses
around this whole thing.

02:01:21.470 --> 02:01:27.910
And now let me say SELECT
DISTINCT genre FROM genres WHERE

02:01:27.910 --> 02:01:32.380
the show_id in the genres
table is in that query.

02:01:32.380 --> 02:01:37.400
And just for kicks, let me
go ahead and ORDER BY genre.

02:01:37.400 --> 02:01:38.930
So let me go ahead and execute this.

02:01:38.930 --> 02:01:42.490
And, OK, somewhat amusingly, those
of you who inputted The Office

02:01:42.490 --> 02:01:46.960
checked boxes for animation, comedy,
documentary, drama, family, horror,

02:01:46.960 --> 02:01:49.000
reality-TV, romance, and sci-fi.

02:01:49.000 --> 02:01:51.020
I take exception to a few of those, too.

02:01:51.020 --> 02:01:53.720
But this is what happens
when you accept user input.

02:01:53.720 --> 02:01:57.293
So here again, we have
with this SQL language

02:01:57.293 --> 02:01:59.710
the ability to express fairly
succinctly, even though it's

02:01:59.710 --> 02:02:03.670
a lot of new features today all at
once, what would otherwise take me

02:02:03.670 --> 02:02:06.593
a dozen or two lines in
Python code to implement

02:02:06.593 --> 02:02:09.010
and god knows how many lines
of code and how many hours it

02:02:09.010 --> 02:02:13.060
would take me to implement something
like this in C. Now, admittedly,

02:02:13.060 --> 02:02:15.220
we could do better than this design.

02:02:15.220 --> 02:02:18.550
This table or this picture
represents what we have now.

02:02:18.550 --> 02:02:22.360
But you'll notice a lot of redundancy
implicit in the genres table.

02:02:22.360 --> 02:02:25.510
Any time you check the
comedy box, I have a row now

02:02:25.510 --> 02:02:27.940
that says comedy,
comedy, comedy, comedy.

02:02:27.940 --> 02:02:31.930
And the show_id differs, but I have
the word "comedy" again and again.

02:02:31.930 --> 02:02:35.740
And now, that tends to be frowned upon
in the world of relational databases,

02:02:35.740 --> 02:02:39.370
because if you have a
genre called comedy or one

02:02:39.370 --> 02:02:42.430
called musical or anything
else, you should ideally just

02:02:42.430 --> 02:02:43.970
have that living in one place.

02:02:43.970 --> 02:02:47.980
And so if we really wanted to
be particular and really, truly

02:02:47.980 --> 02:02:51.100
normalize this database, which
is an academic term referring

02:02:51.100 --> 02:02:55.530
to removing all such redundancies,
we could actually do it like this.

02:02:55.530 --> 02:02:59.480
We could have a shows table still with
an id and title, no difference there.

02:02:59.480 --> 02:03:03.890
But we could have a genres table
with two columns, id and name.

02:03:03.890 --> 02:03:05.020
Now, this is its own id.

02:03:05.020 --> 02:03:06.910
It has no connection with the show_id.

02:03:06.910 --> 02:03:10.660
It's just its own unique
identifier, a primary key here now,

02:03:10.660 --> 02:03:12.320
and the name of that genre.

02:03:12.320 --> 02:03:14.350
So you would have one
row in the genres table

02:03:14.350 --> 02:03:17.690
for comedy, for drama, music,
musical, and everything else.

02:03:17.690 --> 02:03:19.870
And then you would use
a third table, which

02:03:19.870 --> 02:03:23.920
is colloquially called a join table,
which I'll draw here in the middle.

02:03:23.920 --> 02:03:25.960
And you can call it
anything you want, but we've

02:03:25.960 --> 02:03:29.920
called it shows_genres to make
clear that this table implements

02:03:29.920 --> 02:03:33.400
a relationship between those two tables.

02:03:33.400 --> 02:03:36.910
And notice that in this table
is really no juicy data.

02:03:36.910 --> 02:03:38.800
It's just foreign keys--

02:03:38.800 --> 02:03:41.380
show_id, genre_id.

02:03:41.380 --> 02:03:43.930
And by having this
third table, we can now

02:03:43.930 --> 02:03:47.890
make sure that the word "comedy"
only appears in one row anywhere.

02:03:47.890 --> 02:03:50.860
The word "musical" only
appears in one row anywhere.

02:03:50.860 --> 02:03:55.450
But we use these more efficient
integers called show_id and genre_id,

02:03:55.450 --> 02:04:00.850
which respectively point to those
primary keys and their primary tables

02:04:00.850 --> 02:04:02.072
to link those two together.

02:04:02.072 --> 02:04:04.780
And this is an example of what's
called in the world of databases

02:04:04.780 --> 02:04:06.790
a many-to-many relationship.

02:04:06.790 --> 02:04:09.610
One show can have many genres.

02:04:09.610 --> 02:04:12.730
One genre can belong to many shows.

02:04:12.730 --> 02:04:14.530
And so by having this
third table, you can

02:04:14.530 --> 02:04:16.730
have that many-to-many relationship.

02:04:16.730 --> 02:04:19.570
And again, the third table now
allows us to truly normalize

02:04:19.570 --> 02:04:23.920
our data set by getting rid of all of
the duplicate comedy, comedy, comedy.

02:04:23.920 --> 02:04:25.420
Why is this important?

02:04:25.420 --> 02:04:27.310
Probably not a huge deal for genres.

02:04:27.310 --> 02:04:30.910
But imagine with my current design
if I made a spelling mistake,

02:04:30.910 --> 02:04:32.440
and I misnamed comedy.

02:04:32.440 --> 02:04:36.190
I would now have to change every row
with the word comedy again and again.

02:04:36.190 --> 02:04:39.580
Or if maybe you change
the genres of the shows,

02:04:39.580 --> 02:04:41.840
you would have to change
it in multiple places.

02:04:41.840 --> 02:04:44.260
But with this other
approach with three tables,

02:04:44.260 --> 02:04:46.450
you can argue that now
you only have to change

02:04:46.450 --> 02:04:49.750
the name of a genre in one
place, not all over the place.

02:04:49.750 --> 02:04:52.420
And that, in general, in C
and now in Python and now

02:04:52.420 --> 02:04:57.040
SQL has generally been a good thing
not to copy paste identical values

02:04:57.040 --> 02:05:00.060
all over the place.

02:05:00.060 --> 02:05:00.720
All right.

02:05:00.720 --> 02:05:04.410
So with that said, what other
tools do we have at our disposal?

02:05:04.410 --> 02:05:09.180
Well, it turns out that there are other
data types out there in the real world

02:05:09.180 --> 02:05:11.280
using SQL besides just these five--

02:05:11.280 --> 02:05:13.620
BLOB, INTEGER, NUMERIC, REAL, and TEXT.

02:05:13.620 --> 02:05:15.870
BLOB, again, is for binary
stuff, generally not

02:05:15.870 --> 02:05:18.840
used except for more specialized
applications, let's say.

02:05:18.840 --> 02:05:21.270
INTEGER, which is an
int, typically 32 bits;

02:05:21.270 --> 02:05:23.700
NUMERIC, which is something
like a date or a year

02:05:23.700 --> 02:05:26.220
or time or something like
that; REAL numbers, which

02:05:26.220 --> 02:05:30.180
are floating point values; and
TEXT, which are things like strings.

02:05:30.180 --> 02:05:34.110
But if you graduate ultimately
from SQLite on phones

02:05:34.110 --> 02:05:38.522
and on Macs and PCs to actual
servers that run Oracle, MySQL,

02:05:38.522 --> 02:05:40.230
and PostgreSQL if
you're actually running

02:05:40.230 --> 02:05:42.540
your own internet-style
business, well, it

02:05:42.540 --> 02:05:47.310
turns out that more
sophisticated, even more powerful

02:05:47.310 --> 02:05:50.620
databases come with other
subtypes, if you will.

02:05:50.620 --> 02:05:54.270
So besides INTEGER, you can
specify smallint for small numbers,

02:05:54.270 --> 02:05:57.690
maybe using just a few
bits instead of 32--

02:05:57.690 --> 02:06:01.800
INTEGER or bigint, which
uses 64 bits instead of 32.

02:06:01.800 --> 02:06:05.130
The Facebooks, the Twitters of the
world need to use bigint a lot,

02:06:05.130 --> 02:06:06.720
because they have so much data.

02:06:06.720 --> 02:06:09.330
You and I can get away with
simple integers, because we're not

02:06:09.330 --> 02:06:12.450
going to have more than 4 billion
favorite TV shows in a class,

02:06:12.450 --> 02:06:13.290
certainly.

02:06:13.290 --> 02:06:17.040
Something like REAL, you can have
32-bit real numbers or, a little weirdly

02:06:17.040 --> 02:06:22.470
named, double precision, which is like
a double was in C, using 64 bits instead

02:06:22.470 --> 02:06:23.640
for more precision.

02:06:23.640 --> 02:06:25.230
NUMERIC is kind of this catchall.

02:06:25.230 --> 02:06:29.160
You can have not only dates and date
times but things like Boolean values.

02:06:29.160 --> 02:06:31.200
You can specify the
total number of digits

02:06:31.200 --> 02:06:34.180
to store using this numeric
scale and precision.

02:06:34.180 --> 02:06:37.440
So it relates to numbers that
aren't just quite integers.

02:06:37.440 --> 02:06:39.720
And then you also have
categories of TEXT--

02:06:39.720 --> 02:06:42.570
char followed by a
number, which specifies

02:06:42.570 --> 02:06:47.010
that every value in the column will
have the same number of characters,

02:06:47.010 --> 02:06:50.190
that's helpful for things where the
length in advance, like in the US.

02:06:50.190 --> 02:06:54.030
All states, all 50 states,
have two-character codes,

02:06:54.030 --> 02:06:57.450
like MA for Massachusetts,
CA for California.

02:06:57.450 --> 02:07:00.150
char(2) would be appropriate
there, because you

02:07:00.150 --> 02:07:03.000
know every value in the column
is going to have two characters.

02:07:03.000 --> 02:07:05.250
When you don't know,
though, you can use varchar.

02:07:05.250 --> 02:07:08.400
And varchar specifies a
maximum number of characters.

02:07:08.400 --> 02:07:12.060
And so you might specify
varchar of, like, 32.

02:07:12.060 --> 02:07:15.600
No one might be able to type in a
name that's longer than 32 characters,

02:07:15.600 --> 02:07:18.900
or varchar(200) if you want to
allow for something even bigger.

02:07:18.900 --> 02:07:21.690
But this is germane to our
real-world experience with the web.

02:07:21.690 --> 02:07:24.540
If you've ever gone to a website,
start filling out a form,

02:07:24.540 --> 02:07:26.850
and all of a sudden you can't
type any more characters,

02:07:26.850 --> 02:07:28.440
your response is too long--

02:07:28.440 --> 02:07:29.462
why is that?

02:07:29.462 --> 02:07:31.170
Well, one, the
programmers just might not

02:07:31.170 --> 02:07:33.810
want you to keep expressing
yourself in more detail, especially

02:07:33.810 --> 02:07:36.330
if it's a complaint form
on a customer service site.

02:07:36.330 --> 02:07:40.620
But pragmatically, it's probably
because their database was designed

02:07:40.620 --> 02:07:42.570
to store a finite number of characters.

02:07:42.570 --> 02:07:44.025
And you have hit that threshold.

02:07:44.025 --> 02:07:45.900
And you certainly don't
want to have a buffer

02:07:45.900 --> 02:07:50.190
overflow, like in C. So the database
will enforce a maximum value n.

02:07:50.190 --> 02:07:52.830
And then text is for even
bigger chunks of text.

02:07:52.830 --> 02:07:54.930
If you're letting people
copy paste their resumes

02:07:54.930 --> 02:07:59.680
or hold documents or even larger sets
of text, you might use text instead.

02:07:59.680 --> 02:08:03.510
So let's then consider
a real-world data set.

02:08:03.510 --> 02:08:07.320
Things get really interesting, and
all of these very academic ideas

02:08:07.320 --> 02:08:09.600
and recommendations
really come into play

02:08:09.600 --> 02:08:14.830
when we don't had hundreds of favorites
but when we have thousands instead.

02:08:14.830 --> 02:08:19.180
And so what I'm going to go ahead
and do here is download a file here,

02:08:19.180 --> 02:08:25.120
which is a SQLite version of the
IMDb, Internet Movie Database,

02:08:25.120 --> 02:08:27.120
that some of you might
have used in website form

02:08:27.120 --> 02:08:30.330
in order to look up movies and
ratings thereof and the like.

02:08:30.330 --> 02:08:32.100
And what we've done
in advance is we wrote

02:08:32.100 --> 02:08:38.490
a script that downloaded all of that
information in advance as TSV files.

02:08:38.490 --> 02:08:42.600
It turns out that they, Internet
Movie Database, make all of their data

02:08:42.600 --> 02:08:46.650
available as TSV files,
Tab-Separated Values.

02:08:46.650 --> 02:08:54.010
And we went ahead and imported it with
a script called shows.db as follows.

02:08:54.010 --> 02:08:55.800
So I'm going to go
ahead in just a moment

02:08:55.800 --> 02:08:59.520
and open up shows.db, which is
not the version I created earlier

02:08:59.520 --> 02:09:00.990
based on your favorites.

02:09:00.990 --> 02:09:02.820
This is now the version
that we, the staff,

02:09:02.820 --> 02:09:06.630
created in advance by
downloading hundreds of thousands

02:09:06.630 --> 02:09:10.950
of movies and TV shows and actors
and directors from IMDb.com

02:09:10.950 --> 02:09:15.660
under their license and then
imported into a SQLite database.

02:09:15.660 --> 02:09:17.130
So how can I see what's in here?

02:09:17.130 --> 02:09:19.530
Well, let me go ahead
and type .schema, recall.

02:09:19.530 --> 02:09:22.545
And you'll see a whole
bunch of data therein.

02:09:22.545 --> 02:09:25.690
And in fact, in pictorial form,
it actually looks like this.

02:09:25.690 --> 02:09:28.140
Here is a picture that just
gives you the lay of the land.

02:09:28.140 --> 02:09:30.330
There's going to be a
people table that has

02:09:30.330 --> 02:09:33.903
an ID for every person, a
name, and their birth year.

02:09:33.903 --> 02:09:36.570
There's going to be a shows table,
just like we've been talking,

02:09:36.570 --> 02:09:41.100
which is IDs, titles of shows-- also,
though, the year that the show debuted

02:09:41.100 --> 02:09:43.380
and the number of episodes
that the show had.

02:09:43.380 --> 02:09:46.410
Then there's going to be genres,
similar in design to before.

02:09:46.410 --> 02:09:49.620
So we didn't go all out and
factor it out into a third table.

02:09:49.620 --> 02:09:52.830
We just have some duplication
here, admittedly, in genres.

02:09:52.830 --> 02:09:54.240
But then there's a ratings table.

02:09:54.240 --> 02:09:57.240
And here's where you can see where
relational databases get interesting.

02:09:57.240 --> 02:10:01.000
You can have a ratings table
storing ratings, like 1 to 5,

02:10:01.000 --> 02:10:05.080
but also associate those ratings
with a show by way of its show_id.

02:10:05.080 --> 02:10:08.440
And then you can keep track of the
number of votes that that show got.

02:10:08.440 --> 02:10:10.910
Writers, notice, is a separate table.

02:10:10.910 --> 02:10:12.560
And notice this is kind of cool.

02:10:12.560 --> 02:10:19.060
This table, per the arrows, relates to
the shows table and the people table,

02:10:19.060 --> 02:10:20.770
because this is a joined table.

02:10:20.770 --> 02:10:24.040
A foreign key of show_id and
a foreign key of person_id

02:10:24.040 --> 02:10:28.250
refer to the shows table and
the people table respectively

02:10:28.250 --> 02:10:32.710
so that a human person can be
a writer for multiple shows

02:10:32.710 --> 02:10:36.560
and one show can have multiple writers,
another many-to-many relationship.

02:10:36.560 --> 02:10:39.310
And then lastly, stars,
the actors in a show.

02:10:39.310 --> 02:10:41.050
Notice that this, too, is a join table.

02:10:41.050 --> 02:10:43.540
It's only got two
foreign keys, a show_id

02:10:43.540 --> 02:10:47.447
and a person_id that are referring
back to those tables respectively.

02:10:47.447 --> 02:10:50.030
And here's where it really makes
sense of relational database.

02:10:50.030 --> 02:10:52.930
It would be pretty stupid
and bad design if you

02:10:52.930 --> 02:10:57.520
had names of all of the directors
and names of all of the writers

02:10:57.520 --> 02:11:01.840
and names of all of the stars of these
shows in separate tables in duplicate,

02:11:01.840 --> 02:11:04.330
like Steve Carell, Steve
Carell, Steve Carell.

02:11:04.330 --> 02:11:06.670
All of those actors and
directors and writers

02:11:06.670 --> 02:11:11.450
and every other role in the business
are just people at the end of the day.

02:11:11.450 --> 02:11:13.630
So in a relational
database, the advice would

02:11:13.630 --> 02:11:16.570
be to put all of those
people in a people table

02:11:16.570 --> 02:11:21.040
and then use primary and foreign
keys to refer to, to relate them to,

02:11:21.040 --> 02:11:24.010
these other types of tables.

02:11:24.010 --> 02:11:26.720
The catch is, though,
that when we do this,

02:11:26.720 --> 02:11:31.280
it turns out that things can be
slow when we have lots of data.

02:11:31.280 --> 02:11:33.250
So for instance, let me go into this.

02:11:33.250 --> 02:11:37.210
Let me go ahead and
SELECT * FROM shows;.

02:11:37.210 --> 02:11:38.343
That's a lot of data.

02:11:38.343 --> 02:11:41.260
It's pretty fast on my Mac, and I
switched from the IDE to my Mac just

02:11:41.260 --> 02:11:43.270
to save time, because it's
a little faster doing things

02:11:43.270 --> 02:11:44.800
locally instead of in the cloud.

02:11:44.800 --> 02:11:48.460
Let me go ahead and count the number
of shows in this IMDb database

02:11:48.460 --> 02:11:49.720
by using COUNT.

02:11:49.720 --> 02:11:53.390
153,331 TV shows.

02:11:53.390 --> 02:11:54.250
So that's a lot.

02:11:54.250 --> 02:11:59.110
How about the count of
people from the people table?

02:11:59.110 --> 02:12:06.290
457,886 people who might be stars or
writers or some other role, as well.

02:12:06.290 --> 02:12:07.765
So this is a sizable data set.

02:12:07.765 --> 02:12:09.890
So let me go ahead and do
something simple, though.

02:12:09.890 --> 02:12:14.560
Let me go ahead and SELECT * FROM
shows WHERE title = "The Office."

02:12:14.560 --> 02:12:17.900
And this time, I don't have to worry
about weird capitalization or spacing.

02:12:17.900 --> 02:12:18.850
This is IMDb.

02:12:18.850 --> 02:12:21.727
This is clean data from
an authoritative source.

02:12:21.727 --> 02:12:24.310
Notice that there's actually
different versions of The Office.

02:12:24.310 --> 02:12:26.590
You probably know the
UK one and the US one.

02:12:26.590 --> 02:12:30.520
There's other shows that are unrelated
to that particular type of show.

02:12:30.520 --> 02:12:34.540
But each of them is distinguished,
notice, by the year here.

02:12:34.540 --> 02:12:37.280
All right, so that's kind of a lot.

02:12:37.280 --> 02:12:38.680
And let's do this again.

02:12:38.680 --> 02:12:40.930
Let me go ahead and turn on
a feature temporarily just

02:12:40.930 --> 02:12:44.000
to time this query by turning
on a timer in this program.

02:12:44.000 --> 02:12:45.370
And let me run it again.

02:12:45.370 --> 02:12:51.970
It looks like it took 0.012 seconds
of real time to do that search.

02:12:51.970 --> 02:12:52.780
That's pretty fast.

02:12:52.780 --> 02:12:55.180
I barely noticed, certainly
because it's so fast.

02:12:55.180 --> 02:12:56.710
But let me go ahead and do this.

02:12:56.710 --> 02:13:01.510
Let me go ahead and create an index
called title_index on the table

02:13:01.510 --> 02:13:04.360
called shows on its title column.

02:13:04.360 --> 02:13:05.470
Well, what am I doing?

02:13:05.470 --> 02:13:08.680
Well, to answer the question finally
from before about performance,

02:13:08.680 --> 02:13:11.340
by default, everything we've
been doing is indeed big O of n.

02:13:11.340 --> 02:13:13.090
It's just being linearly
searched from top

02:13:13.090 --> 02:13:16.630
to bottom, which seems to call into
question the whole purpose of SQL if we

02:13:16.630 --> 02:13:18.850
were doing no better than with CSVs.

02:13:18.850 --> 02:13:22.480
But an index is a clue
to the database to load

02:13:22.480 --> 02:13:25.960
the data more efficiently in such a
way that you get logarithmic time.

02:13:25.960 --> 02:13:30.520
An index is a fancy data structure
that the SQLite database or the Oracle

02:13:30.520 --> 02:13:33.520
database or the MySQL database,
whatever product you're using,

02:13:33.520 --> 02:13:35.680
builds up for you in memory.

02:13:35.680 --> 02:13:38.560
And then it does something
using syntax like this

02:13:38.560 --> 02:13:42.340
that builds in memory generally
something known as a B-tree.

02:13:42.340 --> 02:13:44.178
We've talked a bit about
trees in the class.

02:13:44.178 --> 02:13:46.720
We talked about binary search
trees, things that kind of look

02:13:46.720 --> 02:13:47.920
like family trees.

02:13:47.920 --> 02:13:50.620
A B-tree is essentially
a family tree that's

02:13:50.620 --> 02:13:53.020
just very wide and not that tall.

02:13:53.020 --> 02:13:56.500
It's a data structure similar in
spirit to what we looked at in C.

02:13:56.500 --> 02:13:59.830
But it tries to keep all of the
leaf nodes, all of the children

02:13:59.830 --> 02:14:02.230
or grandchildren or
great-grandchildren, so to speak,

02:14:02.230 --> 02:14:04.390
as close to the root as possible.

02:14:04.390 --> 02:14:08.320
And the algorithm it uses for that
tends to be proprietary or documented

02:14:08.320 --> 02:14:09.970
based on the system you're using.

02:14:09.970 --> 02:14:12.100
But it doesn't store things in a list.

02:14:12.100 --> 02:14:17.620
It does not store things top to bottom,
like the tables we view them as.

02:14:17.620 --> 02:14:21.640
Underneath the hood, those tables
that look like very tall structures

02:14:21.640 --> 02:14:23.770
are actually, underneath
the hood, implemented

02:14:23.770 --> 02:14:25.820
with fancier things called trees.

02:14:25.820 --> 02:14:29.710
And if we create those trees by creating
what are properly called indexes

02:14:29.710 --> 02:14:34.660
like this, it might take us a moment,
like 0.098 seconds, to create an index.

02:14:34.660 --> 02:14:36.220
But now notice what happens.

02:14:36.220 --> 02:14:40.210
Previously, when I searched the titles
for The Office, using linear search,

02:14:40.210 --> 02:14:43.180
it took 0.012 seconds.

02:14:43.180 --> 02:14:46.750
If I do the same query again
after having created the index

02:14:46.750 --> 02:14:50.920
and having told SQLite, build me
this fancy tree in memory, voila.

02:14:50.920 --> 02:14:55.450
0.001 seconds, so orders
of magnitude faster.

02:14:55.450 --> 02:14:57.550
Now, both are fast to
us humans, certainly.

02:14:57.550 --> 02:15:01.040
But imagine the data set being even
bigger, the query being even bigger.

02:15:01.040 --> 02:15:05.900
These indexes can get
even larger than that.

02:15:05.900 --> 02:15:07.970
Rather, the queries can
take longer than that

02:15:07.970 --> 02:15:11.130
and therefore take even
more time than that.

02:15:11.130 --> 02:15:13.940
But unfortunately, if I've got
all of my data all over the place,

02:15:13.940 --> 02:15:16.970
as in a diagram like this, my god.

02:15:16.970 --> 02:15:18.770
How do I actually get useful work done?

02:15:18.770 --> 02:15:21.590
How do I get back the people
in a movie and the writers

02:15:21.590 --> 02:15:24.260
and the stars and the ratings
if it's all over the place?

02:15:24.260 --> 02:15:26.840
I would seem to have created
such a mess and that I now

02:15:26.840 --> 02:15:28.910
need to execute all of these queries.

02:15:28.910 --> 02:15:32.000
But notice it doesn't have
to be that complicated.

02:15:32.000 --> 02:15:35.660
It turns out that there's another
keyword in SQL, really the last

02:15:35.660 --> 02:15:38.150
that we'll look at here, called JOIN.

02:15:38.150 --> 02:15:41.480
The JOIN keyword, which you can
use implicitly or explicitly,

02:15:41.480 --> 02:15:45.470
allows you to just join tables
together and sort of reconstitute

02:15:45.470 --> 02:15:47.760
a bigger, more user friendly table.

02:15:47.760 --> 02:15:51.020
So for instance, suppose I want to
get all of Steve Carell's TV shows,

02:15:51.020 --> 02:15:52.250
not just The Office.

02:15:52.250 --> 02:15:55.880
Well, recall that I can select
Steve's ID from the people

02:15:55.880 --> 02:15:59.390
table WHERE name = "Steve Carell."

02:15:59.390 --> 02:16:02.780
So again, he has a different ID in
this table, because this is from IMDb.

02:16:02.780 --> 02:16:04.400
But there's his ID.

02:16:04.400 --> 02:16:07.260
And let me go ahead and
turn the timer off for now.

02:16:07.260 --> 02:16:07.760
All right.

02:16:07.760 --> 02:16:11.510
So there is his ID, 126797.

02:16:11.510 --> 02:16:14.780
I could copy paste that into my
code, but that's not necessary

02:16:14.780 --> 02:16:16.490
thanks to these nested queries.

02:16:16.490 --> 02:16:18.660
I can do something like this.

02:16:18.660 --> 02:16:23.720
Let me go ahead and now select all
of the show_ids from the stars table

02:16:23.720 --> 02:16:29.790
where person_id from that
table is equal to this result.

02:16:29.790 --> 02:16:33.240
So there's that join table, stars,
that links people and shows.

02:16:33.240 --> 02:16:35.370
So let me go ahead and execute that.

02:16:35.370 --> 02:16:35.870
All right.

02:16:35.870 --> 02:16:39.559
So there's all of the show_ids
of Steve Carell's TV shows.

02:16:39.559 --> 02:16:40.379
That's a lot.

02:16:40.379 --> 02:16:42.139
And it's very nonobvious what they are.

02:16:42.139 --> 02:16:45.680
So let me do another nested query by
putting all of that in parentheses

02:16:45.680 --> 02:16:51.530
and now SELECT title FROM
shows WHERE the ID of the show

02:16:51.530 --> 02:16:55.820
is in this big, long list of show_ids.

02:16:55.820 --> 02:17:00.260
And there are all of the shows that
he's in, including The Dana Carvey Show

02:17:00.260 --> 02:17:04.430
back when, The Office up at the
top, and then, most recently,

02:17:04.430 --> 02:17:07.142
shows like The Morning Show on Apple TV.

02:17:07.142 --> 02:17:09.350
All right, so that's pretty
cool that we can actually

02:17:09.350 --> 02:17:11.129
reconstitute the data like that.

02:17:11.129 --> 02:17:13.889
But it turns out there's different
ways of doing that, as well.

02:17:13.889 --> 02:17:15.950
And you'll see more of
this in the coming weeks

02:17:15.950 --> 02:17:18.150
and in the problem sets
and labs and the like.

02:17:18.150 --> 02:17:19.879
But it turns out we can
do other things, as well.

02:17:19.879 --> 02:17:21.962
And let me just show this
syntax even though it'll

02:17:21.962 --> 02:17:23.670
look a little cryptic at first glance.

02:17:23.670 --> 02:17:26.299
You can also use that
JOIN keyword as follows.

02:17:26.299 --> 02:17:33.350
I can select the title from the people
table joined with the stars table

02:17:33.350 --> 02:17:39.959
on the people.id column equaling
the stars.person_id column.

02:17:39.959 --> 02:17:42.799
So in other words, I can
select a title from the result

02:17:42.799 --> 02:17:46.940
of joining people and stars, like
this, on the id column in one

02:17:46.940 --> 02:17:49.129
and the person_id column in the other.

02:17:49.129 --> 02:17:58.879
And I can join in the shows table on
the stars.show_id equaling the shows.id.

02:17:58.879 --> 02:18:03.799
So again, now I'm joining the primary
and foreign keys on these two tables

02:18:03.799 --> 02:18:07.700
where the name equals "Steve Carell."

02:18:07.700 --> 02:18:10.070
So this is the most cryptic
thing we've seen yet.

02:18:10.070 --> 02:18:12.530
But it just means take this
table and join it with this one

02:18:12.530 --> 02:18:16.580
and then join it with this one and
filter all of the resulting joined rows

02:18:16.580 --> 02:18:18.530
by a name of Steve Carell.

02:18:18.530 --> 02:18:19.520
And voila.

02:18:19.520 --> 02:18:22.469
There we have all of
those answers, as well.

02:18:22.469 --> 02:18:25.129
And there's other ways
of doing this, too.

02:18:25.129 --> 02:18:27.809
I'll leave unsaid now some
of the syntax for that.

02:18:27.809 --> 02:18:29.480
But that felt a little slow.

02:18:29.480 --> 02:18:32.090
And in fact, let me go ahead
and turn my timer back on.

02:18:32.090 --> 02:18:34.610
Let me re-execute this last query.

02:18:34.610 --> 02:18:40.879
SELECT title FROM people joining
on stars, joining on shows

02:18:40.879 --> 02:18:42.650
WHERE name = "Steve Carell."

02:18:42.650 --> 02:18:44.700
That took over half a second.

02:18:44.700 --> 02:18:47.480
So that was actually
admittedly kind of slow.

02:18:47.480 --> 02:18:50.209
But again, indexes come to
the rescue and if, again, we

02:18:50.209 --> 02:18:52.610
don't allow linear search to dominate.

02:18:52.610 --> 02:18:54.889
But let me go ahead and
create a few indexes.

02:18:54.889 --> 02:19:01.940
Create an index called person_index on
the stars table, the person_id column.

02:19:01.940 --> 02:19:02.570
Why?

02:19:02.570 --> 02:19:05.600
Well, my query a moment ago
used the person_id column.

02:19:05.600 --> 02:19:06.510
It filtered on it.

02:19:06.510 --> 02:19:08.000
So that might be a bottleneck.

02:19:08.000 --> 02:19:12.290
I'm going to go ahead and create
another index called show_index

02:19:12.290 --> 02:19:14.870
on the stars table on show_id.

02:19:14.870 --> 02:19:18.290
Similarly, a moment ago, my
query used the show_id column.

02:19:18.290 --> 02:19:21.743
And so that, too, might have been a
bottleneck linearly, top to bottom.

02:19:21.743 --> 02:19:22.910
So let me create that index.

02:19:22.910 --> 02:19:25.368
And then lastly, let me create
an index called name_index--

02:19:25.368 --> 02:19:28.459
and this is perhaps the most obvious,
similar to the show titles before--

02:19:28.459 --> 02:19:31.549
on the people table on the name column.

02:19:31.549 --> 02:19:32.930
And that, too, took a moment.

02:19:32.930 --> 02:19:35.330
Now, in total, this took
almost a full second.

02:19:35.330 --> 02:19:37.850
But these indexes only get created once.

02:19:37.850 --> 02:19:40.070
They get maintained
automatically over time.

02:19:40.070 --> 02:19:42.080
But you don't incur
this with every query.

02:19:42.080 --> 02:19:44.389
Now let me do my SELECT again.

02:19:44.389 --> 02:19:48.800
Let me SELECT title FROM
people joining the stars table,

02:19:48.800 --> 02:19:52.730
joining the shows table
WHERE name = "Steve Carell."

02:19:52.730 --> 02:19:53.690
Boom.

02:19:53.690 --> 02:19:56.630
0.001 seconds.

02:19:56.630 --> 02:20:00.930
That was an order of magnitude faster
than the more than half a second

02:20:00.930 --> 02:20:02.620
it took us a little bit ago.

02:20:02.620 --> 02:20:05.860
So here, too, you see the
power of a relational database.

02:20:05.860 --> 02:20:08.912
So even though we've created some
problems for ourselves over time,

02:20:08.912 --> 02:20:12.120
we've solved them ultimately-- granted,
with some more sophisticated features

02:20:12.120 --> 02:20:13.320
and additional syntax.

02:20:13.320 --> 02:20:15.990
But a relational database
is indeed why you use them

02:20:15.990 --> 02:20:19.470
in the real world for the Twitters, the
Instagrams, the Facebooks, the Googles,

02:20:19.470 --> 02:20:22.590
because they can store
data so efficiently

02:20:22.590 --> 02:20:25.960
without redundancy, because you can
normalize them and factor everything

02:20:25.960 --> 02:20:26.460
out.

02:20:26.460 --> 02:20:28.740
But they can still
maintain the relations

02:20:28.740 --> 02:20:30.570
that you might have
seen in a spreadsheet

02:20:30.570 --> 02:20:32.940
but using something closer
to logarithmic thanks

02:20:32.940 --> 02:20:34.770
to those tree structures.

02:20:34.770 --> 02:20:35.910
But there are problems.

02:20:35.910 --> 02:20:38.880
And what we wanted to do is end
on today two primary problems

02:20:38.880 --> 02:20:42.570
that are introduced with SQL,
because they are just unfortunately

02:20:42.570 --> 02:20:43.920
so commonly done.

02:20:43.920 --> 02:20:45.462
Notice this year.

02:20:45.462 --> 02:20:47.670
There is something generally
known as a SQL injection

02:20:47.670 --> 02:20:51.330
attack, which you are
vulnerable to in any application

02:20:51.330 --> 02:20:52.830
where you're taking user input.

02:20:52.830 --> 02:20:55.800
That hasn't been an issue
for my favorites.py file,

02:20:55.800 --> 02:20:58.260
where I only took input from a CSV.

02:20:58.260 --> 02:21:00.510
But if one of you were
malicious, what if one of you

02:21:00.510 --> 02:21:03.750
had maliciously typed in the
word "delete" or "update"

02:21:03.750 --> 02:21:06.180
or something else as
the title of your show

02:21:06.180 --> 02:21:11.040
and I accidentally plugged it into my
own Python code when executing a query?

02:21:11.040 --> 02:21:14.940
You could potentially
inject SQL into my own code.

02:21:14.940 --> 02:21:15.750
How might that be?

02:21:15.750 --> 02:21:18.960
Well, if logging in via Yale, you'll
typically see a form like this.

02:21:18.960 --> 02:21:21.850
Or logging in via Harvard to
something, you'll see a form like this.

02:21:21.850 --> 02:21:23.767
Here's an example that
I'm pretty sure neither

02:21:23.767 --> 02:21:25.710
Harvard nor Yale are vulnerable to.

02:21:25.710 --> 02:21:28.590
Suppose I type in my email
address to this login form

02:21:28.590 --> 02:21:32.350
as malan@harvard.edu'--.

02:21:32.350 --> 02:21:34.890
It turns out, in SQL, --

02:21:34.890 --> 02:21:38.250
is the symbol for commenting if
you want to comment something out.

02:21:38.250 --> 02:21:40.428
It turns out that the
single quote is used

02:21:40.428 --> 02:21:43.470
when you want to search for something
like Steve Carell or, in this case,

02:21:43.470 --> 02:21:44.930
malan@harvard.edu.

02:21:44.930 --> 02:21:45.930
It can be double quotes.

02:21:45.930 --> 02:21:47.040
It can be single quotes.

02:21:47.040 --> 02:21:50.040
In this case, I'm using
single quotes here.

02:21:50.040 --> 02:21:53.400
But let's consider some sample
code, if you will, in Python.

02:21:53.400 --> 02:21:56.910
Here's a line of code that I
propose might exist in the backend

02:21:56.910 --> 02:22:00.180
for Harvard's authentication
or Yale's or anyone else's.

02:22:00.180 --> 02:22:04.890
Maybe someone wrote some Python code
like this using SELECT * FROM users

02:22:04.890 --> 02:22:06.870
WHERE username = question?

02:22:06.870 --> 02:22:10.770
AND password = question?, and they
plugged in username and password.

02:22:10.770 --> 02:22:13.770
Whatever the user typed into
that web form a moment ago gets

02:22:13.770 --> 02:22:16.270
plugged in here to these question marks.

02:22:16.270 --> 02:22:17.290
This is good.

02:22:17.290 --> 02:22:20.980
This is good code, because you're
using the SQL question marks.

02:22:20.980 --> 02:22:24.315
So if you literally just do what we
preach today and use these question

02:22:24.315 --> 02:22:27.870
mark placeholders, you are safe
from SQL injection attacks.

02:22:27.870 --> 02:22:29.760
Unfortunately, there
are too many developers

02:22:29.760 --> 02:22:34.950
in the world that don't practice this
or don't realize this or do forget this.

02:22:34.950 --> 02:22:38.850
If you instead resort to
Python approaches like this,

02:22:38.850 --> 02:22:42.910
where you use an f-string instead,
which might be your instincts after last

02:22:42.910 --> 02:22:45.660
week, because they're wonderfully
convenient with the curly braces

02:22:45.660 --> 02:22:46.290
and all--

02:22:46.290 --> 02:22:50.370
suppose that you literally
plug in username and password

02:22:50.370 --> 02:22:53.430
not with the question mark
placeholders but just literally

02:22:53.430 --> 02:22:55.260
in between those curly braces.

02:22:55.260 --> 02:22:58.210
Watch what happens if my
username, malan@harvard.edu,

02:22:58.210 --> 02:23:03.120
was actually typed in by me
maliciously as malan@harvard.edu'--.

02:23:05.691 --> 02:23:09.030
That would have the effect
of tricking this Python

02:23:09.030 --> 02:23:11.610
code into doing essentially this.

02:23:11.610 --> 02:23:13.590
Let me do a find and replace.

02:23:13.590 --> 02:23:22.881
It would trick Python into executing
username = "malan@harvard.edu"--"

02:23:22.881 --> 02:23:24.660
and then other stuff.

02:23:24.660 --> 02:23:27.480
Unfortunately, the --
again means comment,

02:23:27.480 --> 02:23:33.390
which means you could maybe trick a
server into ignoring the whole password

02:23:33.390 --> 02:23:35.190
part of this SQL query.

02:23:35.190 --> 02:23:37.530
And if the SQL query's
purpose in life is to check,

02:23:37.530 --> 02:23:42.210
is this username and password valid, so
that you can decide to log the user in

02:23:42.210 --> 02:23:44.880
or to say, no, you're
not authorized, well,

02:23:44.880 --> 02:23:48.390
by essentially commenting out
everything related to password,

02:23:48.390 --> 02:23:49.710
notice what I've done.

02:23:49.710 --> 02:23:55.620
I've just now theoretically logged
myself in as malan@harvard.edu without

02:23:55.620 --> 02:24:00.030
even knowing or inputting a password,
because I injected SQL syntax,

02:24:00.030 --> 02:24:04.620
the quote and the --, into my query,
tricking the server into just ignoring

02:24:04.620 --> 02:24:06.870
the password equality check.

02:24:06.870 --> 02:24:11.250
And so it turns out that db.execute,
when you execute an INSERT,

02:24:11.250 --> 02:24:15.240
it returns to you as said the
ID of the newly inserted row.

02:24:15.240 --> 02:24:20.370
When you use db.execute to select
rows from a database table,

02:24:20.370 --> 02:24:25.360
it returns to you a list of rows,
each of which is a dictionary.

02:24:25.360 --> 02:24:28.110
So this is now pseudocode
down here with my comment.

02:24:28.110 --> 02:24:31.140
But if you get back one
row, that would seem

02:24:31.140 --> 02:24:34.470
to imply that there is a
user named malan@harvard.edu.

02:24:34.470 --> 02:24:37.830
Don't know what his password is, because
whoever this person is maliciously

02:24:37.830 --> 02:24:40.860
tricked the server into
ignoring that syntax.

02:24:40.860 --> 02:24:43.890
So SQL injection attacks
are unfortunately

02:24:43.890 --> 02:24:46.570
one of the most common
attacks against SQL databases.

02:24:46.570 --> 02:24:51.090
They are completely preventable if
you simply use placeholders and use

02:24:51.090 --> 02:24:53.940
libraries, whether it's CS50's
or other third-party libraries

02:24:53.940 --> 02:24:55.440
that you may use down the road.

02:24:55.440 --> 02:24:58.530
A common meme on the internet
is this picture here.

02:24:58.530 --> 02:25:00.810
If we Zoom in on this
person's license plate

02:25:00.810 --> 02:25:02.970
or where the license
plate should be, this

02:25:02.970 --> 02:25:05.940
is an example of someone
theoretically trying

02:25:05.940 --> 02:25:10.350
to trick some camera on the highway
into dropping the whole database.

02:25:10.350 --> 02:25:13.710
DROP is another keyword in SQL
that deletes a database table.

02:25:13.710 --> 02:25:15.810
And this person was either
intentionally or just

02:25:15.810 --> 02:25:19.980
a humorously trying to
trick it into executing SQL

02:25:19.980 --> 02:25:21.760
by using syntax like this.

02:25:21.760 --> 02:25:26.070
So characters like single quotes,
--, semicolons are all potentially

02:25:26.070 --> 02:25:29.190
dangerous characters in SQL if
they're passed through unchanged

02:25:29.190 --> 02:25:30.120
to the database.

02:25:30.120 --> 02:25:34.140
A very popular xkcd comic-- let me
give you a moment to just read this--

02:25:34.140 --> 02:25:40.080
is another well-known meme of
sorts now in computer science.

02:25:40.080 --> 02:25:43.880
If you'd like to, read
this one on your own.

02:25:43.880 --> 02:25:51.170
But henceforth, you are now in the
family of educated learners who

02:25:51.170 --> 02:25:54.410
know who Little Bobby Tables is.

02:25:54.410 --> 02:25:56.240
Unfortunately, it's
dead silence in here,

02:25:56.240 --> 02:25:58.340
so I can't tell if anyone is
actually laughing at this joke.

02:25:58.340 --> 02:26:00.110
But anyhow, this is a
very well-known meme.

02:26:00.110 --> 02:26:02.690
So if you're a computer scientist
who knows SQL, you know this one.

02:26:02.690 --> 02:26:05.565
And there's one last problem we'd
like to introduce if you don't mind

02:26:05.565 --> 02:26:07.250
just a couple of final moments here.

02:26:07.250 --> 02:26:09.530
And that is a fundamental
problem in computing

02:26:09.530 --> 02:26:11.690
called race conditions,
which for the first time

02:26:11.690 --> 02:26:14.300
is now manifest in
our discussion of SQL.

02:26:14.300 --> 02:26:18.230
It turns out that SQL and SQL
databases are very often used, again,

02:26:18.230 --> 02:26:21.380
in the real world for very
high-performing applications.

02:26:21.380 --> 02:26:24.320
And by that, I mean, again, the
Googles, the Facebooks, the Twitters

02:26:24.320 --> 02:26:28.490
of the world where lots and lots of
data is coming into servers all at once.

02:26:28.490 --> 02:26:30.200
And case in point,
some of you might have

02:26:30.200 --> 02:26:33.320
clicked Like on this egg some time ago.

02:26:33.320 --> 02:26:35.690
This is the most-liked
Instagram post ever.

02:26:35.690 --> 02:26:39.710
As of last night, it was up
to 50-plus million likes.

02:26:39.710 --> 02:26:42.620
Well eclipsed Kim
Kardashian's previous post,

02:26:42.620 --> 02:26:44.690
which is still at 18 million or so.

02:26:44.690 --> 02:26:47.780
This is to say this is
a hard problem to solve,

02:26:47.780 --> 02:26:51.800
this notion of likes coming
in at such an incredible rate.

02:26:51.800 --> 02:26:55.310
Because suppose that, long
story short, Instagram actually

02:26:55.310 --> 02:26:57.290
has a server with a SQL database.

02:26:57.290 --> 02:27:01.490
And they have code in Python or C++
or whatever language that's talking

02:27:01.490 --> 02:27:02.660
to that database.

02:27:02.660 --> 02:27:04.910
And suppose that they
have code that's trying

02:27:04.910 --> 02:27:06.680
to increment the total number of likes.

02:27:06.680 --> 02:27:08.240
Well, how might this work logically?

02:27:08.240 --> 02:27:11.660
Well, in order to increment the number
of likes that a picture like this egg

02:27:11.660 --> 02:27:14.060
has, you might first
select from the database

02:27:14.060 --> 02:27:18.260
the current number of likes for
the ID of that egg photograph.

02:27:18.260 --> 02:27:19.790
Then you might add 1 to it.

02:27:19.790 --> 02:27:21.797
Then you might update the database.

02:27:21.797 --> 02:27:24.630
And I didn't use it before, but
just like there's INSERT and DELETE,

02:27:24.630 --> 02:27:26.010
there's UPDATE, as well.

02:27:26.010 --> 02:27:29.600
So you might update the database
with the new count plus 1.

02:27:29.600 --> 02:27:31.970
So the code for that might
look a little something

02:27:31.970 --> 02:27:35.600
like this, three lines of code
using CS50's library here,

02:27:35.600 --> 02:27:40.010
where you execute SELECT likes
FROM posts WHERE id = question?,

02:27:40.010 --> 02:27:42.890
where id is the unique
identifier for that egg.

02:27:42.890 --> 02:27:45.740
And then I'm storing the
result in a rows variable,

02:27:45.740 --> 02:27:48.950
which, again, I claim is a list of rows.

02:27:48.950 --> 02:27:52.130
I'm going to go into the first
row, so that's rows bracket 0.

02:27:52.130 --> 02:27:55.070
And I'm going to go into the likes
column to get the actual number.

02:27:55.070 --> 02:27:57.140
And that number, I'm going to
store in a variable called likes.

02:27:57.140 --> 02:27:58.880
So this is going to
be, like, 50,000,000,

02:27:58.880 --> 02:28:01.100
and I want it to go to 50,000,001.

02:28:01.100 --> 02:28:02.370
So how do I do that?

02:28:02.370 --> 02:28:08.780
Well, I execute on the database
UPDATE posts SET likes = ?.

02:28:08.780 --> 02:28:10.980
And then I just plug in likes + 1.

02:28:10.980 --> 02:28:15.020
The problem, though, with the Instagrams
and Googles and Twitters of the world

02:28:15.020 --> 02:28:16.790
is that they don't just have one server.

02:28:16.790 --> 02:28:18.710
They have many thousands of servers.

02:28:18.710 --> 02:28:22.580
And all of those servers might in
parallel be receiving clicks from you

02:28:22.580 --> 02:28:23.960
and I on the internet.

02:28:23.960 --> 02:28:28.310
And those clicks translate into this
code getting executed, executed,

02:28:28.310 --> 02:28:28.970
executed.

02:28:28.970 --> 02:28:32.930
And the problem is that when you have
three lines of code and suppose Brian

02:28:32.930 --> 02:28:35.420
and I click on that egg
at roughly the same time,

02:28:35.420 --> 02:28:40.010
my three lines might not get executed
before his three lines or vice versa.

02:28:40.010 --> 02:28:42.650
They might get commingled
chronologically.

02:28:42.650 --> 02:28:46.130
My first line might get executed, then
Brian's first line might get executed.

02:28:46.130 --> 02:28:48.750
My second line might get
executed, Brian's second line.

02:28:48.750 --> 02:28:50.960
So they might get interspersed
on different servers

02:28:50.960 --> 02:28:53.900
or just temporally in
time, chronologically.

02:28:53.900 --> 02:28:56.690
That's problematic, because
suppose Brian and I click

02:28:56.690 --> 02:28:58.580
on that egg roughly at the same time.

02:28:58.580 --> 02:29:01.010
And we get back the same
answer to the SELECT query.

02:29:01.010 --> 02:29:03.290
50 million is the current count.

02:29:03.290 --> 02:29:06.620
Then our next lines of code execute
on the servers we happen to be on,

02:29:06.620 --> 02:29:09.260
which adds 1 to the likes.

02:29:09.260 --> 02:29:14.780
The server might accidentally end
up updating the row for the egg

02:29:14.780 --> 02:29:20.960
with 50,000,001 both times,
because the fundamental problem is

02:29:20.960 --> 02:29:24.890
if my code executes while
Brian's code executes,

02:29:24.890 --> 02:29:29.480
we are both checking the value of a
variable at essentially the same time.

02:29:29.480 --> 02:29:32.090
And we are both then
making a conclusion--

02:29:32.090 --> 02:29:35.190
oh, the current likes are 50 million.

02:29:35.190 --> 02:29:36.470
We are then making a decision.

02:29:36.470 --> 02:29:38.310
Let's add 1 to 50 million.

02:29:38.310 --> 02:29:41.600
We are then updating the
value with 50,000,001.

02:29:41.600 --> 02:29:46.640
The problem is, though, that, really,
if Brian's code or the server he happens

02:29:46.640 --> 02:29:50.780
to be connected to on Instagram happens
to have selected the number of likes

02:29:50.780 --> 02:29:53.900
first, he should be allowed
to finish the code that's

02:29:53.900 --> 02:29:57.950
being executed so that when I
select it, I see 50,000,001,

02:29:57.950 --> 02:30:02.270
and I add 1 to that so the
new count is 50,000,002.

02:30:02.270 --> 02:30:04.070
This is what's known
as a race condition.

02:30:04.070 --> 02:30:06.980
When you write code in a multiserver--

02:30:06.980 --> 02:30:11.120
more fancily known as a multithreaded
environment-- lines of code

02:30:11.120 --> 02:30:16.160
chronologically can get commingled on
different servers at any given time.

02:30:16.160 --> 02:30:18.200
The problem fundamentally
derives from the fact

02:30:18.200 --> 02:30:22.430
that if Brian's server is in the middle
of checking the state of a variable,

02:30:22.430 --> 02:30:23.840
I should be locked out.

02:30:23.840 --> 02:30:26.870
I should not be allowed to click
on that button at the same time,

02:30:26.870 --> 02:30:30.590
or my code should not be
allowed to execute logically.

02:30:30.590 --> 02:30:33.050
So there is a solution
when you have to write code

02:30:33.050 --> 02:30:36.500
like this, as is common for Twitter and
Instagram and Facebook and the like,

02:30:36.500 --> 02:30:38.420
to use what are called transactions.

02:30:38.420 --> 02:30:41.815
Transactions add some few new pieces
of syntax that we won't dwell on today

02:30:41.815 --> 02:30:43.690
and you don't need to
use in the coming days.

02:30:43.690 --> 02:30:46.180
But they do solve a
fundamentally hard problem.

02:30:46.180 --> 02:30:50.500
Transactions essentially allow
you to lock a table or, really,

02:30:50.500 --> 02:30:54.885
a row in the table so that
if Brian's click on that egg

02:30:54.885 --> 02:30:57.760
results in some code executing that's
in the process of checking what

02:30:57.760 --> 02:31:02.770
is the total like count, my click on the
egg will not get handled by the server

02:31:02.770 --> 02:31:05.630
until his code is done executing.

02:31:05.630 --> 02:31:08.470
So in green here, I've proposed
the way you should do this.

02:31:08.470 --> 02:31:12.968
You shouldn't just execute the middle
three lines, "you" being Facebook,

02:31:12.968 --> 02:31:13.510
in this case.

02:31:13.510 --> 02:31:17.200
Instagram should execute
BEGIN TRANSACTION first, then

02:31:17.200 --> 02:31:19.300
COMMIT the transaction at the end.

02:31:19.300 --> 02:31:22.780
And the design of transactions is
that all of the lines in between

02:31:22.780 --> 02:31:26.320
will either succeed
altogether or fail altogether.

02:31:26.320 --> 02:31:28.180
The database won't get
into this funky state

02:31:28.180 --> 02:31:32.320
where we start losing
track of likes on eggs.

02:31:32.320 --> 02:31:34.660
And though this has not been
an issue in recent years,

02:31:34.660 --> 02:31:36.952
back in the day when Twitter
was first getting started,

02:31:36.952 --> 02:31:40.232
Twitter was super popular and
super offline a lot of the time.

02:31:40.232 --> 02:31:42.190
There was this thing
called a Fail Whale, which

02:31:42.190 --> 02:31:44.037
is the picture they
showed on their website

02:31:44.037 --> 02:31:46.120
when they were getting too
much traffic to handle.

02:31:46.120 --> 02:31:49.540
That was because when people are liking
and tweeting and retweeting things,

02:31:49.540 --> 02:31:51.520
it's a huge amount of data coming in.

02:31:51.520 --> 02:31:54.500
And it turns out it's very
hard to solve these problems.

02:31:54.500 --> 02:31:58.450
But locking the database table or
the rows with these transactions

02:31:58.450 --> 02:32:00.490
is one way fundamentally to solve this.

02:32:00.490 --> 02:32:03.160
And in our final extra
time today, we thought

02:32:03.160 --> 02:32:05.080
we would play this out
in the same example

02:32:05.080 --> 02:32:07.510
that I was taught transactions
in some years ago.

02:32:07.510 --> 02:32:10.750
Suppose that the scenario at hand
is that you and your roommates

02:32:10.750 --> 02:32:12.370
have a nice dorm fridge.

02:32:12.370 --> 02:32:15.100
And you're all in the habit
of drinking lots of milk,

02:32:15.100 --> 02:32:17.050
and you want to be able
to drink some milk.

02:32:17.050 --> 02:32:19.420
But you go to the fridge,
like I'm about to here.

02:32:19.420 --> 02:32:22.210
And you realize, uh-oh,
we're out of milk.

02:32:22.210 --> 02:32:25.570
And so now I am inspecting the
state of this refrigerator, which

02:32:25.570 --> 02:32:27.970
is quite old but also quite empty.

02:32:27.970 --> 02:32:30.160
And the state of this
variable, being empty,

02:32:30.160 --> 02:32:33.620
tells me that I should go to
CVS and buy some more milk.

02:32:33.620 --> 02:32:35.080
So what do I then do?

02:32:35.080 --> 02:32:37.150
I'm presumably going
to close the fridge,

02:32:37.150 --> 02:32:40.600
and I'm going to go and
leave and go head to CVS.

02:32:40.600 --> 02:32:43.510
Unfortunately, the same problem
arises that we'll act out here

02:32:43.510 --> 02:32:46.150
in our final 60 or so
seconds together, whereby

02:32:46.150 --> 02:32:49.660
if Brian now, my roommate in
this story, also wants some milk,

02:32:49.660 --> 02:32:52.060
he comes by when I'm
already headed to the store,

02:32:52.060 --> 02:32:55.310
inspects the state of the fridge,
and realizes, oh, we're out of milk.

02:32:55.310 --> 02:32:57.650
So he nicely will go restock, as well.

02:32:57.650 --> 02:32:59.620
So let's see how this
plays out, and we'll

02:32:59.620 --> 02:33:03.590
see if there isn't a
similar, analogous solution.

02:33:03.590 --> 02:33:05.620
So I've checked the
state of the variable.

02:33:05.620 --> 02:33:06.920
We're indeed out of milk.

02:33:06.920 --> 02:33:08.030
I'll be right back.

02:33:08.030 --> 02:33:09.085
Just going to go to CVS.

02:33:26.336 --> 02:33:29.829
[MUSIC PLAYING]

02:34:44.240 --> 02:34:45.020
All right.

02:34:45.020 --> 02:34:46.550
I am now back from the store.

02:34:46.550 --> 02:34:47.870
I've picked up some milk.

02:34:47.870 --> 02:34:50.090
Going to go ahead and put
it into the fridge and--

02:34:50.090 --> 02:34:51.710
oh, how did this happen?

02:34:51.710 --> 02:34:53.570
Now there's multiple jugs of milk.

02:34:53.570 --> 02:34:55.490
And of course, milk
does not last that long.

02:34:55.490 --> 02:34:57.282
And Brian and I don't
drink that much milk.

02:34:57.282 --> 02:34:58.970
So this is a really serious problem.

02:34:58.970 --> 02:35:03.150
We've sort of tried to update the value
of this variable at the same time.

02:35:03.150 --> 02:35:05.030
So how do we go about fixing this?

02:35:05.030 --> 02:35:07.320
What's the actual solution here?

02:35:07.320 --> 02:35:09.920
Well, I dare say that we
can draw some inspiration

02:35:09.920 --> 02:35:13.787
from the world of transactions
and the world of databases.

02:35:13.787 --> 02:35:15.620
And perhaps create a
visual for here that we

02:35:15.620 --> 02:35:18.270
hope you never forget if you
take nothing away from today.

02:35:18.270 --> 02:35:21.327
Let's go ahead and act this out
one last time where, this time,

02:35:21.327 --> 02:35:22.910
I'm going to be a little more extreme.

02:35:22.910 --> 02:35:24.290
I go ahead and open the fridge.

02:35:24.290 --> 02:35:25.940
I realize, oh, we're out of milk.

02:35:25.940 --> 02:35:27.380
I'm going to go to the store.

02:35:27.380 --> 02:35:29.450
I do not want to allow
for this situation

02:35:29.450 --> 02:35:32.490
where Brian accidentally
checks the fridge, as well.

02:35:32.490 --> 02:35:37.670
So I am going to lock
the refrigerator instead.

02:35:37.670 --> 02:35:41.390
Let me go ahead and
drape this through here.

02:35:43.940 --> 02:35:49.050
A little extreme, but I think so
long as he can't get into the fridge,

02:35:49.050 --> 02:35:52.790
this shouldn't be a problem.

02:35:52.790 --> 02:35:56.060
Let me go ahead now and
just attach the lock here.

02:35:56.060 --> 02:35:57.170
Almost got it.

02:35:57.170 --> 02:35:58.340
Come on.

02:35:58.340 --> 02:35:59.570
All right.

02:35:59.570 --> 02:36:01.993
Now the fridge is locked.

02:36:01.993 --> 02:36:03.410
Now I'm going to go get some milk.

02:36:17.210 --> 02:36:18.210
BRIAN YU: [SIGHS]

02:36:18.710 --> 02:36:22.060
[MUSIC PLAYING]