WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000

00:00:00.000 --> 00:00:03.968
[MUSIC PLAYING]

00:00:03.968 --> 00:01:18.020
[MUSIC PLAYING]

00:01:18.020 --> 00:01:20.030
DAVID J. MALAN: This is CS50.

00:01:20.030 --> 00:01:23.360
And this is week 7, the
week, here, of Halloween.

00:01:23.360 --> 00:01:26.510
Indeed, special thanks to
CS50's own Valerie and her mom

00:01:26.510 --> 00:01:30.590
for having created this very festive
scenery, and all past ones as well.

00:01:30.590 --> 00:01:33.330
Today, we pick up where
we left off last time,

00:01:33.330 --> 00:01:35.180
which, recall, we introduced Python.

00:01:35.180 --> 00:01:38.240
And that was our big transition
from C, where suddenly things

00:01:38.240 --> 00:01:40.850
started to look new again,
probably, syntactically.

00:01:40.850 --> 00:01:44.390
But also, probably things
hopefully started to feel easier.

00:01:44.390 --> 00:01:48.080
Well, with that said, problem set
6 certainly added some challenges,

00:01:48.080 --> 00:01:49.380
and you did some new things.

00:01:49.380 --> 00:01:53.090
But hopefully you've begun to appreciate
that with Python, just a lot more stuff

00:01:53.090 --> 00:01:54.290
is easier to do.

00:01:54.290 --> 00:01:56.960
You get more out of the box
with the language itself.

00:01:56.960 --> 00:01:59.450
And that's going to be so
useful over the coming weeks

00:01:59.450 --> 00:02:03.980
as we transition further to introducing
something called databases today,

00:02:03.980 --> 00:02:06.270
web programming next
week and the week after.

00:02:06.270 --> 00:02:08.900
So that by term's end, and perhaps
even for your final project,

00:02:08.900 --> 00:02:11.870
you really are building
something from scratch

00:02:11.870 --> 00:02:15.415
using all of these various
tools somehow together.

00:02:15.415 --> 00:02:17.540
So before we do that,
though, today, let's consider

00:02:17.540 --> 00:02:22.610
what we weren't really able to
do last week, which was actually

00:02:22.610 --> 00:02:25.490
create and store data ourselves.

00:02:25.490 --> 00:02:30.710
In Python, we've played around with the
CSV, comma-separated values library.

00:02:30.710 --> 00:02:33.810
And you've been able to
read in CSVs from disk,

00:02:33.810 --> 00:02:37.880
so to speak, that is, from files
in your programming environments.

00:02:37.880 --> 00:02:41.455
But we haven't necessarily started
saving data, persisting data ourselves.

00:02:41.455 --> 00:02:44.330
And that's a huge limitation, because
pretty much all of the examples

00:02:44.330 --> 00:02:46.288
we've done thus far with
a couple of exceptions

00:02:46.288 --> 00:02:49.550
have involved my providing input
at the keyboard or even vocally.

00:02:49.550 --> 00:02:51.508
But then nothing happens to it.

00:02:51.508 --> 00:02:53.300
It disappears the moment
the program quits,

00:02:53.300 --> 00:02:55.410
because it was only
being stored in memory.

00:02:55.410 --> 00:02:59.090
But today, we'll start to focus all
the more on storing things on disk,

00:02:59.090 --> 00:03:02.100
that is, storing things
in files and folders

00:03:02.100 --> 00:03:04.940
so that you can actually
write programs that remember

00:03:04.940 --> 00:03:06.620
what it is the human did last time.

00:03:06.620 --> 00:03:09.320
And ultimately, you can
actually make mobile or web apps

00:03:09.320 --> 00:03:12.050
that actually begin to grow, and
grow, and grow their data sets,

00:03:12.050 --> 00:03:15.650
as might happen if you get more and
more users, for instance, on a website.

00:03:15.650 --> 00:03:19.400
To play, then, with this new capability
of being able to write files,

00:03:19.400 --> 00:03:22.217
let's go ahead and
just collect some data.

00:03:22.217 --> 00:03:24.050
In fact, those of you
here in person, if you

00:03:24.050 --> 00:03:26.960
want to pull up this URL
on your phone or laptop,

00:03:26.960 --> 00:03:29.060
that's going to lead
you to a Google Form.

00:03:29.060 --> 00:03:34.130
And that Google Form is going to
ask you in just a moment for really

00:03:34.130 --> 00:03:35.320
just your favorite TV show.

00:03:35.320 --> 00:03:37.070
And it's going to ask
you to categorize it

00:03:37.070 --> 00:03:41.270
according to a genre, like comedy,
or drama, or action, or musical,

00:03:41.270 --> 00:03:42.460
or something like that.

00:03:42.460 --> 00:03:44.210
And this is useful,
because if you've ever

00:03:44.210 --> 00:03:47.520
used a Google Form before, or
Microsoft's equivalent with Office 365,

00:03:47.520 --> 00:03:50.270
it's a really useful mechanism at
just collecting data from users,

00:03:50.270 --> 00:03:54.000
and then ultimately, putting
it into a spreadsheet form.

00:03:54.000 --> 00:03:58.160
So this is a screenshot of
the form that those of you

00:03:58.160 --> 00:04:01.080
here in person or tuning in on
Zoom are currently filling out.

00:04:01.080 --> 00:04:02.420
It's asking only two questions.

00:04:02.420 --> 00:04:04.550
What's the title of
your favorite TV show?

00:04:04.550 --> 00:04:09.470
And what are one or more genres
into which your TV show falls?

00:04:09.470 --> 00:04:12.860
And I'll go ahead and
pivot now to the view

00:04:12.860 --> 00:04:15.667
that I'll be able to see as the
person who created this form, which

00:04:15.667 --> 00:04:17.209
is quite simply a Google spreadsheet.

00:04:17.209 --> 00:04:19.709
Google Forms has this nice
feature, if you've never noticed,

00:04:19.709 --> 00:04:22.650
that allows you to export your
data to a Google Spreadsheet.

00:04:22.650 --> 00:04:25.130
And then from there, we
can actually grab the file

00:04:25.130 --> 00:04:27.500
and download it to my
own Mac or your own PC

00:04:27.500 --> 00:04:30.320
so that we can actually play around
with the data that's come in.

00:04:30.320 --> 00:04:32.570
So in fact, let me go
ahead and slide over

00:04:32.570 --> 00:04:36.540
to this, the live Google Spreadsheet.

00:04:36.540 --> 00:04:40.490
And you'll see, probably, a whole
bunch of familiar TV shows here,

00:04:40.490 --> 00:04:41.360
all coming in.

00:04:41.360 --> 00:04:43.820
And if we keep scrolling, and
scrolling, and scrolling--

00:04:43.820 --> 00:04:45.290
only 46, 47.

00:04:45.290 --> 00:04:47.430
There we go, up to 50 plus already.

00:04:47.430 --> 00:04:50.420
If you need that URL again
here, if you're just tuning in,

00:04:50.420 --> 00:04:52.850
you can go to this URL here.

00:04:52.850 --> 00:04:55.760
And in just a moment,
we'll have a bunch of data

00:04:55.760 --> 00:04:59.450
with which we can start to experiment.

00:04:59.450 --> 00:05:01.070
I'll give you a moment or so there.

00:05:06.330 --> 00:05:08.418
All right.

00:05:08.418 --> 00:05:09.960
Let me hang in there a little longer.

00:05:09.960 --> 00:05:11.418
OK, we've got over 100 submissions.

00:05:11.418 --> 00:05:12.390
Good.

00:05:12.390 --> 00:05:15.270
Good, even more coming in now.

00:05:15.270 --> 00:05:16.890
And we can see them coming in live.

00:05:16.890 --> 00:05:18.750
Here, let me switch
back to the spreadsheet.

00:05:18.750 --> 00:05:21.090
The list is growing, and
growing, and growing.

00:05:21.090 --> 00:05:22.900
And in just a moment--

00:05:22.900 --> 00:05:26.490
let me give Carter a moment to
help me export it in real time.

00:05:26.490 --> 00:05:29.640
Carter, just give me a heads
up when it's reasonable for me

00:05:29.640 --> 00:05:32.880
to download this file.

00:05:32.880 --> 00:05:35.140
All right, and I'll begin
to do this very slowly.

00:05:35.140 --> 00:05:37.720
So I'm going to go up to the File
menu, if you've never done this before.

00:05:37.720 --> 00:05:40.110
Download-- you can download a whole
bunch of formats, one in Excel.

00:05:40.110 --> 00:05:42.060
But more simply, and the one
we'll start to play with here,

00:05:42.060 --> 00:05:43.630
is comma-separated values.

00:05:43.630 --> 00:05:47.010
So CSV files we used this past
week, why are they useful?

00:05:47.010 --> 00:05:50.190
Now that you've played with them
or used them in past real world,

00:05:50.190 --> 00:05:55.260
what's the utility of a CSV file versus
something like Excel, for instance?

00:05:55.260 --> 00:05:59.010
Why CSV in the first place?

00:05:59.010 --> 00:06:00.030
Any instincts?

00:06:00.030 --> 00:06:00.858
Yeah?

00:06:00.858 --> 00:06:02.730
AUDIENCE: Because it's just a text file?

00:06:02.730 --> 00:06:04.605
DAVID J. MALAN: OK, so
storage is compelling.

00:06:04.605 --> 00:06:07.710
A simple text file with ASCII or
Unicode text is probably pretty small.

00:06:07.710 --> 00:06:08.290
I like that.

00:06:08.290 --> 00:06:10.101
Other thoughts?

00:06:10.101 --> 00:06:11.184
AUDIENCE: Structure of it?

00:06:11.184 --> 00:06:12.518
DAVID J. MALAN: Yeah, well said.

00:06:12.518 --> 00:06:14.760
It's just a simple text
format, but using conventions

00:06:14.760 --> 00:06:18.330
like commas you can represent the
idea of columns using new lines,

00:06:18.330 --> 00:06:20.430
backslash ends invisibly
at the end of your lines,

00:06:20.430 --> 00:06:22.000
you can create the idea of rows.

00:06:22.000 --> 00:06:24.000
So it's a very simple
way of implementing what

00:06:24.000 --> 00:06:26.610
we might call a flat-file database.

00:06:26.610 --> 00:06:28.860
It's a way of storing
data in a flat, that is,

00:06:28.860 --> 00:06:32.310
very simple file that's just
pure ASCII or Unicode text.

00:06:32.310 --> 00:06:35.340
And more compellingly, I dare
say, is that with a CSV file,

00:06:35.340 --> 00:06:36.862
it's completely portable.

00:06:36.862 --> 00:06:38.820
Something is portable in
the world of computing

00:06:38.820 --> 00:06:41.910
if it means you can use it on a Mac
or a PC running this operating system,

00:06:41.910 --> 00:06:42.750
or this other one.

00:06:42.750 --> 00:06:45.480
And portability is nice because if
I were to download an Excel file,

00:06:45.480 --> 00:06:47.760
there'd be a whole bunch of
people in this room and online

00:06:47.760 --> 00:06:50.190
who couldn't download it because
they haven't bought Microsoft Excel

00:06:50.190 --> 00:06:50.940
or installed it.

00:06:50.940 --> 00:06:55.680
Or if they have a Mac, or if it's
a .numbers file in the Mac world,

00:06:55.680 --> 00:06:57.610
a PC user might not be
able to download it.

00:06:57.610 --> 00:06:59.800
So a CSV is indeed very portable.

00:06:59.800 --> 00:07:02.940
So I'm going to go ahead and
download, quite simply, the CSV

00:07:02.940 --> 00:07:04.440
version of this file.

00:07:04.440 --> 00:07:06.960
That's going to put it onto
my own Mac's Downloads folder.

00:07:06.960 --> 00:07:11.460
And let me go ahead here, and in just a
moment, let me just simplify the name.

00:07:11.460 --> 00:07:14.910
Because it actually downloads
it at a pretty large name.

00:07:14.910 --> 00:07:18.120
And give me just one moment here,
and you'll see that, indeed,

00:07:18.120 --> 00:07:21.210
on my Mac I have a file
called favorites.csv.

00:07:21.210 --> 00:07:22.750
I shortened the name real quick.

00:07:22.750 --> 00:07:28.680
And now what I'm going to do is go
over to VS Code, and in VS Code,

00:07:28.680 --> 00:07:30.480
I'm going to open my File Explorer.

00:07:30.480 --> 00:07:34.620
And if I minimize my window here for
a moment, a handy feature of VS Code

00:07:34.620 --> 00:07:37.980
is that you can just drag and drop a
file, for instance, into your Explorer.

00:07:37.980 --> 00:07:40.660
And voila, it's going to
automatically upload it for you.

00:07:40.660 --> 00:07:43.260
So let me go ahead and full
screen here, close my Explorer,

00:07:43.260 --> 00:07:45.120
temporarily close my Terminal window.

00:07:45.120 --> 00:07:48.720
And you'll see here a
CSV file, favorites.csv.

00:07:48.720 --> 00:07:51.390
And the first row, by
convention, has whatever

00:07:51.390 --> 00:07:54.720
the columns were in Google
Spreadsheets, or Office 365,

00:07:54.720 --> 00:07:58.620
in Excel online, timestamp,
comma, title, comma, genres.

00:07:58.620 --> 00:08:00.390
Then, we have timestamps,
which indicates

00:08:00.390 --> 00:08:01.722
when people started submitting.

00:08:01.722 --> 00:08:03.930
Looks like a couple of people
were super eager to get

00:08:03.930 --> 00:08:05.430
started an hour or two ago.

00:08:05.430 --> 00:08:09.340
And then, you have the
title next, after a comma.

00:08:09.340 --> 00:08:12.150
But there's kind of a
curiosity after that.

00:08:12.150 --> 00:08:15.510
Sometimes I see the genre
like comedy, comedy, comedy,

00:08:15.510 --> 00:08:19.870
but sometimes it's like crime, comma,
drama, or action, comma, crime, comma,

00:08:19.870 --> 00:08:20.550
drama.

00:08:20.550 --> 00:08:22.470
And those things are quoted.

00:08:22.470 --> 00:08:24.180
And yet, I didn't do any quotes.

00:08:24.180 --> 00:08:25.800
You probably didn't type any quotes.

00:08:25.800 --> 00:08:30.180
Where are those quotes
coming from in this CSV file?

00:08:30.180 --> 00:08:31.650
Why are they there if we infer?

00:08:31.650 --> 00:08:32.485
Yeah?

00:08:32.485 --> 00:08:35.340
AUDIENCE: [INAUDIBLE]

00:08:35.340 --> 00:08:38.308
DAVID J. MALAN: Yeah, so you
have a corner case, if you will.

00:08:38.308 --> 00:08:40.350
Because if you're using
commas, as you described,

00:08:40.350 --> 00:08:44.280
to separate your data into what
are effectively columns, well,

00:08:44.280 --> 00:08:47.010
you've painted yourself into
a corner if your actual data

00:08:47.010 --> 00:08:48.578
has commas in it itself.

00:08:48.578 --> 00:08:51.120
So what Google has done, what
Microsoft does, what Apple does

00:08:51.120 --> 00:08:54.330
is, they quote any strings
of text that themselves

00:08:54.330 --> 00:08:58.560
have commas so that these are
now English grammatical commas,

00:08:58.560 --> 00:09:00.730
not CSV specific commas.

00:09:00.730 --> 00:09:03.400
So it's a way of escaping
your data, if you will.

00:09:03.400 --> 00:09:06.070
And escaping just means to call
out a symbol in a special way

00:09:06.070 --> 00:09:08.637
so it's not misinterpreted
as something else.

00:09:08.637 --> 00:09:10.470
All right, so this is
all to say that we now

00:09:10.470 --> 00:09:13.980
have all of this data with which we
can play in the form of what we'll

00:09:13.980 --> 00:09:15.840
start calling a flat-file database.

00:09:15.840 --> 00:09:19.230
So suppose I wanted to now
start manipulating this data,

00:09:19.230 --> 00:09:22.110
and I want to store it ultimately,
indeed, in this CSV format.

00:09:22.110 --> 00:09:24.540
How can I actually
start to read this data,

00:09:24.540 --> 00:09:26.950
maybe clean it up, maybe
do some analytics on it

00:09:26.950 --> 00:09:30.570
and actually figure out, what's the most
popular show among those who submitted

00:09:30.570 --> 00:09:32.190
here over the past few minutes?

00:09:32.190 --> 00:09:34.270
Well, let me go ahead and close this.

00:09:34.270 --> 00:09:38.970
Let me go ahead, then, and open up,
for instance, just my Terminal window.

00:09:38.970 --> 00:09:41.970
And let's code up a file
called favorites.py.

00:09:41.970 --> 00:09:46.110
And let's go ahead and iteratively start
simple by just opening up this file

00:09:46.110 --> 00:09:47.830
and printing out what's inside of it.

00:09:47.830 --> 00:09:51.330
So you might recall that we can do
this by doing something like import CSV

00:09:51.330 --> 00:09:54.990
to give myself some CSV
reading functionality.

00:09:54.990 --> 00:09:59.610
Then, I can go ahead and do something
like with open, the name of the file

00:09:59.610 --> 00:10:02.280
that I want to open in read mode.

00:10:02.280 --> 00:10:03.990
Quote, unquote, "r" means to read it.

00:10:03.990 --> 00:10:06.600
And then, I can say as
file, or whatever other name

00:10:06.600 --> 00:10:09.670
for a variable to say that
I want to open this file,

00:10:09.670 --> 00:10:12.480
and essentially store some kind of
reference to it in that variable

00:10:12.480 --> 00:10:13.600
called file.

00:10:13.600 --> 00:10:16.920
Then, I can give myself a
reader, and I can say csv.reader,

00:10:16.920 --> 00:10:18.397
passing in that file as input.

00:10:18.397 --> 00:10:19.980
And this is the magic of that library.

00:10:19.980 --> 00:10:23.190
It deals with the process of opening
it, reading it, and giving you

00:10:23.190 --> 00:10:26.430
back something that you can just
iterate over, like with a for loop

00:10:26.430 --> 00:10:30.510
I do want to skip the first row,
and recall that I can do this.

00:10:30.510 --> 00:10:33.665
Next, reader, is this little trick
that just says, ignore the first row.

00:10:33.665 --> 00:10:35.040
Because the first one is special.

00:10:35.040 --> 00:10:37.410
It said timestamp, title, genres.

00:10:37.410 --> 00:10:39.400
That's not your data, that was mine.

00:10:39.400 --> 00:10:41.880
But this means now that
I've skipped that first row.

00:10:41.880 --> 00:10:44.700
Everything hereafter is going
to be the title of a show

00:10:44.700 --> 00:10:46.260
that you all like, so let me do this.

00:10:46.260 --> 00:10:51.090
For row in the reader, let's go
ahead and print out the title

00:10:51.090 --> 00:10:52.860
of the show each of you typed in.

00:10:52.860 --> 00:10:57.000
How do I get at the title of
the show each of you typed in?

00:10:57.000 --> 00:10:58.740
It's somewhere inside of row.

00:10:58.740 --> 00:11:00.790
Row recalls a list.

00:11:00.790 --> 00:11:02.910
So what do I want to
type next in order to get

00:11:02.910 --> 00:11:08.920
at the title of the current
row just as a quick check here?

00:11:08.920 --> 00:11:11.350
What do I want to type to
get at the title of the row,

00:11:11.350 --> 00:11:15.520
keeping in mind, again, that it
was timestamp, title, genres?

00:11:15.520 --> 00:11:16.020
Yeah?

00:11:16.020 --> 00:11:16.895
AUDIENCE: [INAUDIBLE]

00:11:16.895 --> 00:11:18.910
DAVID J. MALAN: So row
bracket 1 would give me

00:11:18.910 --> 00:11:22.640
the second column, 0 index, that is,
the one in the middle with the title.

00:11:22.640 --> 00:11:24.520
So this program isn't
that interesting yet,

00:11:24.520 --> 00:11:27.370
but it's a quick and dirty way to
figure out, all right, what's my data

00:11:27.370 --> 00:11:27.820
look like?

00:11:27.820 --> 00:11:29.987
Let me actually just do a
little bit of a check here

00:11:29.987 --> 00:11:32.380
and see if it contains
the data I think it does.

00:11:32.380 --> 00:11:34.490
Let me maximize my Terminal window here.

00:11:34.490 --> 00:11:37.810
Let me run Python of
favorites.py, hitting Enter.

00:11:37.810 --> 00:11:42.520
And you'll see now a purely
textual list of all of the shows

00:11:42.520 --> 00:11:44.560
you all seem to like here.

00:11:44.560 --> 00:11:47.080
But what's noteworthy about it?

00:11:47.080 --> 00:11:49.960
Specific shows aside,
judgment aside as to people's

00:11:49.960 --> 00:11:54.640
TV tastes, what's interesting or
noteworthy about the data that

00:11:54.640 --> 00:11:58.090
might create some problems for us
if we start to analyze this data,

00:11:58.090 --> 00:11:59.740
and figure out what's the most popular?

00:11:59.740 --> 00:12:02.590
How many people like this or that?

00:12:02.590 --> 00:12:03.340
What do you think?

00:12:03.340 --> 00:12:04.151
Yeah?

00:12:04.151 --> 00:12:07.458
AUDIENCE: User errors [INAUDIBLE].

00:12:07.458 --> 00:12:09.500
DAVID J. MALAN: Yeah,
there might be user errors,

00:12:09.500 --> 00:12:12.800
or just stylistic differences that
give the appearance that one show

00:12:12.800 --> 00:12:14.400
is different from the other.

00:12:14.400 --> 00:12:16.970
For instance, here.

00:12:16.970 --> 00:12:20.250
Let's see if I can see an
example on the screen here.

00:12:20.250 --> 00:12:24.340
Yeah, so friends here is an all
lowercase, Friends here is capitalized.

00:12:24.340 --> 00:12:25.140
No big deal.

00:12:25.140 --> 00:12:26.370
We can sort of mitigate that.

00:12:26.370 --> 00:12:29.480
But this is just a tiny example
of where data in the real world

00:12:29.480 --> 00:12:30.758
can get messy fast.

00:12:30.758 --> 00:12:32.300
And that probably wasn't even a typo.

00:12:32.300 --> 00:12:37.587
It was just someone not caring as much
to capitalize it, and that's fine.

00:12:37.587 --> 00:12:39.920
Your users are going to type
what they're going to type.

00:12:39.920 --> 00:12:43.580
So let's see if we can't now begin
to get at more specific data,

00:12:43.580 --> 00:12:45.500
and maybe even clean
some of this data up.

00:12:45.500 --> 00:12:50.570
Let me go back into my file
called favorites.py here,

00:12:50.570 --> 00:12:54.350
and let's actually do something a
little more user friendly for me.

00:12:54.350 --> 00:12:58.070
Instead of a reader, recall that there
was this dictionary reader that's

00:12:58.070 --> 00:12:59.750
just a little more user friendly.

00:12:59.750 --> 00:13:04.670
And it means I can type in dictionary
reader here, passing in the same file.

00:13:04.670 --> 00:13:10.670
But now, when I iterate over this
reader variable, what is each row?

00:13:10.670 --> 00:13:14.060
When using a DictReader instead
of a reader, recall, and this

00:13:14.060 --> 00:13:17.690
is just a peculiarity
of the CSV library,

00:13:17.690 --> 00:13:22.530
this gives me back, not a list
of cells, but what instead,

00:13:22.530 --> 00:13:24.950
which is marginally more
user friendly for me?

00:13:24.950 --> 00:13:25.460
Yeah?

00:13:25.460 --> 00:13:27.585
AUDIENCE: [INAUDIBLE]

00:13:27.585 --> 00:13:28.460
DAVID J. MALAN: Yeah.

00:13:28.460 --> 00:13:30.860
I can now use open bracket,
quotes, and the title.

00:13:30.860 --> 00:13:33.770
Because what's coming back
now is a dict object, that is,

00:13:33.770 --> 00:13:37.010
a dictionary which has keys and values.

00:13:37.010 --> 00:13:39.060
The keys of which are
the column headings.

00:13:39.060 --> 00:13:41.400
The values of which are the
data I actually care about.

00:13:41.400 --> 00:13:44.030
So this is just marginally
better because, one, it's

00:13:44.030 --> 00:13:47.330
just way more obvious to me, the
author of this code, what it is I'm

00:13:47.330 --> 00:13:48.050
getting at.

00:13:48.050 --> 00:13:50.310
I don't remember what
column the title was.

00:13:50.310 --> 00:13:50.810
Was it 0?

00:13:50.810 --> 00:13:51.320
Was it 1?

00:13:51.320 --> 00:13:51.890
Was it 2?

00:13:51.890 --> 00:13:53.450
That's something you're
going to forget over time.

00:13:53.450 --> 00:13:56.510
And God forbid someone changes the
data by just dragging and dropping

00:13:56.510 --> 00:13:59.390
the columns in Excel, or Apple
Numbers, or Google Spreadsheets.

00:13:59.390 --> 00:14:01.978
That's going to break all
of your numeric indices.

00:14:01.978 --> 00:14:03.770
And so a dictionary
reader is arguably just

00:14:03.770 --> 00:14:07.310
better design because it's
more robust against changes

00:14:07.310 --> 00:14:08.960
and potential errors like that.

00:14:08.960 --> 00:14:12.560
Now the effect of this change isn't
going to be really any different.

00:14:12.560 --> 00:14:16.820
If I run Python of favorites.py,
voila, I get all of the same results.

00:14:16.820 --> 00:14:21.290
But I've now not made any assumptions
as to where each of the columns

00:14:21.290 --> 00:14:22.920
actually is numerically.

00:14:22.920 --> 00:14:23.420
All right.

00:14:23.420 --> 00:14:26.330
Well, let's go ahead and now
filter out some duplicates.

00:14:26.330 --> 00:14:29.960
Because there's a lot of commonality
among some of the shows here, so let's

00:14:29.960 --> 00:14:32.660
see if we can't filter out duplicates.

00:14:32.660 --> 00:14:38.900
If I'm reading a CSV file top to bottom,
what intuitively might be the logic

00:14:38.900 --> 00:14:41.650
I want to implement to
filter out duplicates?

00:14:41.650 --> 00:14:44.900
It's not going to be quite as simple as
a simple function that does it for me.

00:14:44.900 --> 00:14:47.300
I'm going to have to build this.

00:14:47.300 --> 00:14:50.630
But logically, if you're reading
a file from top to bottom,

00:14:50.630 --> 00:14:54.860
how might you go about, in
Python or just any context,

00:14:54.860 --> 00:14:58.040
getting rid of duplicate values?

00:14:58.040 --> 00:14:59.862
Yeah, what do you think?

00:14:59.862 --> 00:15:05.735
AUDIENCE: [INAUDIBLE]

00:15:05.735 --> 00:15:06.610
DAVID J. MALAN: Sure.

00:15:06.610 --> 00:15:09.730
I could use a list and I could
add each title to the list,

00:15:09.730 --> 00:15:13.070
but first check if I put
this into the list before.

00:15:13.070 --> 00:15:15.140
So let's try a little
something like that.

00:15:15.140 --> 00:15:18.340
Let me go ahead and create a variable
at the top of my program here.

00:15:18.340 --> 00:15:21.340
I'll call it titles, for instance,
initialize to an empty list,

00:15:21.340 --> 00:15:23.050
open bracket, close bracket.

00:15:23.050 --> 00:15:27.860
And then, inside of my loop
here, instead of printing it out,

00:15:27.860 --> 00:15:29.630
let's start to make a decision.

00:15:29.630 --> 00:15:39.332
So if the current row's
title is in the titles list

00:15:39.332 --> 00:15:40.540
I don't want to put it there.

00:15:40.540 --> 00:15:43.582
And actually, let me invert the logic
so I'm doing something proactively.

00:15:43.582 --> 00:15:48.190
So if it's not the case that
row bracket title is in titles,

00:15:48.190 --> 00:15:56.200
then, go ahead and do something like
titles.append the current row's title.

00:15:56.200 --> 00:15:59.170
And recall that we saw
.append a week or so ago,

00:15:59.170 --> 00:16:01.700
where it just allows you to
append to the current list.

00:16:01.700 --> 00:16:04.810
And then, what can I do at
the very end, after I'm all

00:16:04.810 --> 00:16:06.460
done reading the whole file?

00:16:06.460 --> 00:16:09.820
Why don't I go ahead and
say, for title in titles,

00:16:09.820 --> 00:16:12.730
go ahead and print
out the current title?

00:16:12.730 --> 00:16:16.700
So it's two loops now, and we can come
back to the quality of that design.

00:16:16.700 --> 00:16:19.373
But let me go ahead here and
rerun Python of favorites.py.

00:16:19.373 --> 00:16:22.540
Let me increase the size of my Terminal
window so we can focus just on this,

00:16:22.540 --> 00:16:23.770
and hit Enter.

00:16:23.770 --> 00:16:28.120
And now, I'm just skimming.

00:16:28.120 --> 00:16:31.090
I don't think I'm seeing
duplicates, although I

00:16:31.090 --> 00:16:34.000
am seeing some near duplicates.

00:16:34.000 --> 00:16:36.760
For instance, there's Friends again.

00:16:36.760 --> 00:16:40.060
And if we keep going, and
going, and going, and going,

00:16:40.060 --> 00:16:41.260
there's Friends again.

00:16:41.260 --> 00:16:47.070
Oh, interesting, so that's curious
that I seem to have multiple Friends,

00:16:47.070 --> 00:16:48.630
and I have this one here, too.

00:16:48.630 --> 00:16:51.120
So how might we clean this up further?

00:16:51.120 --> 00:16:53.393
I like your instincts, and
it's a step closer to it.

00:16:53.393 --> 00:16:55.560
What are we going to have
to do to really filter out

00:16:55.560 --> 00:16:57.360
those near duplicates?

00:16:57.360 --> 00:16:59.146
Any thoughts?

00:16:59.146 --> 00:17:04.045
AUDIENCE: You could set
everything to lower [INAUDIBLE]..

00:17:04.045 --> 00:17:04.920
DAVID J. MALAN: Yeah.

00:17:04.920 --> 00:17:06.670
What are the common
mistakes to summarize?

00:17:06.670 --> 00:17:09.405
We could ignore the capitalization
altogether and maybe

00:17:09.405 --> 00:17:12.030
just force everything to lowercase,
or everything to uppercase.

00:17:12.030 --> 00:17:13.920
Doesn't matter which, but
let's just be consistent.

00:17:13.920 --> 00:17:16.795
And for those of you who might have
accidentally or instinctively hit

00:17:16.795 --> 00:17:19.650
the spacebar at the beginning of
your input or even at the end,

00:17:19.650 --> 00:17:21.180
we can strip that off, too.

00:17:21.180 --> 00:17:24.490
Stripping whitespace is a common
thing just to clean up user input.

00:17:24.490 --> 00:17:27.869
So let me go back into my
code here, and let me go ahead

00:17:27.869 --> 00:17:30.540
and tweak the title a little bit.

00:17:30.540 --> 00:17:33.360
Let me say that the current
title inside of this loop

00:17:33.360 --> 00:17:36.300
is not going to be just
the current row's title.

00:17:36.300 --> 00:17:40.440
But let me go ahead and strip off,
from the left and the right implicitly,

00:17:40.440 --> 00:17:41.260
any whitespace.

00:17:41.260 --> 00:17:44.260
If you read the documentation for the
strip function, it does just that.

00:17:44.260 --> 00:17:47.230
It gets rid of whitespace to the
left, whitespace to the right.

00:17:47.230 --> 00:17:50.310
And then, if I want to force
everything to maybe uppercase,

00:17:50.310 --> 00:17:52.710
I can just uppercase the entire string.

00:17:52.710 --> 00:17:56.310
And remember, what's handy about Python
is you can chain some of these function

00:17:56.310 --> 00:17:59.550
calls together by just
using dots again and again.

00:17:59.550 --> 00:18:01.500
And that just takes
whatever just happened,

00:18:01.500 --> 00:18:04.320
like the whitespace got stripped
off, then, it additionally

00:18:04.320 --> 00:18:06.280
uppercases the whole thing as well.

00:18:06.280 --> 00:18:10.980
So now, I'm going to just check whether
this specific title is in titles.

00:18:10.980 --> 00:18:14.760
And if not, I'm going to go
ahead and append that title,

00:18:14.760 --> 00:18:17.200
massaged into this different
format, if you will.

00:18:17.200 --> 00:18:19.350
So I'm throwing away some information.

00:18:19.350 --> 00:18:23.760
I'm sacrificing all of the
nuances of your grammar and input

00:18:23.760 --> 00:18:25.020
to the form itself.

00:18:25.020 --> 00:18:27.480
But at least I'm trying to
canonicalize size, that is,

00:18:27.480 --> 00:18:29.950
standardize what the
data actually looks like.

00:18:29.950 --> 00:18:33.810
So let me go ahead and run Python
of favorites.py again and hit Enter.

00:18:33.810 --> 00:18:35.412
Oh, and this is just user error.

00:18:35.412 --> 00:18:36.870
Maybe you haven't seen this before.

00:18:36.870 --> 00:18:40.870
This just looks like
a mistake on my part.

00:18:40.870 --> 00:18:43.200
I meant to say not even uppercase.

00:18:43.200 --> 00:18:44.202
That's completely wrong.

00:18:44.202 --> 00:18:46.410
The function is called upper,
now that I think of it.

00:18:46.410 --> 00:18:46.830
All right.

00:18:46.830 --> 00:18:49.330
Let's go and increase the size
of the Terminal window again.

00:18:49.330 --> 00:18:50.820
Run Python of favorites.py.

00:18:50.820 --> 00:18:55.530
And now, it's a little more overwhelming
to look at because it's not sorted yet

00:18:55.530 --> 00:18:57.300
and it's all capitalized.

00:18:57.300 --> 00:19:03.300
But I don't think I'm seeing
multiple Friends, so to speak.

00:19:03.300 --> 00:19:06.455
There's one Friends
up here and that's it.

00:19:06.455 --> 00:19:07.830
I'm back up at my prompt already.

00:19:07.830 --> 00:19:10.270
So we seem now to be
filtering out duplicates.

00:19:10.270 --> 00:19:13.582
Now, before we dive in further and
clean this up further than this,

00:19:13.582 --> 00:19:14.790
what else could we have done?

00:19:14.790 --> 00:19:17.100
Well, it turns out that
in Python 2 you often

00:19:17.100 --> 00:19:19.350
do get a lot of functionality
built into the language.

00:19:19.350 --> 00:19:22.313
And I'm kind of implementing
myself the idea of a set.

00:19:22.313 --> 00:19:24.480
If you think back to
mathematics, a set is typically

00:19:24.480 --> 00:19:28.470
something with a bunch of values
that has duplicates filtered out.

00:19:28.470 --> 00:19:31.290
Recall that Python
already has this for us.

00:19:31.290 --> 00:19:34.200
And we saw it really briefly
when I whipped up the dictionary

00:19:34.200 --> 00:19:36.190
implementation a couple of weeks back.

00:19:36.190 --> 00:19:40.980
So I could actually define my titles
to be a set instead of a list,

00:19:40.980 --> 00:19:46.080
and this would just modestly allow
me to refine my code here, such

00:19:46.080 --> 00:19:48.870
that I don't have to bother
checking for duplicates anyway.

00:19:48.870 --> 00:19:52.710
I can instead just say
something like, titles.add

00:19:52.710 --> 00:19:55.807
the current title, like this.

00:19:55.807 --> 00:19:58.890
Marginally better design if you know
that a set exists because you're just

00:19:58.890 --> 00:20:00.753
getting more functionality out of this.

00:20:00.753 --> 00:20:02.670
All right, so let's clean
the data up further.

00:20:02.670 --> 00:20:06.150
We've now gone ahead and fixed
the problem of case sensitivity.

00:20:06.150 --> 00:20:08.940
We threw away whitespace in case
someone had hit the spacebar

00:20:08.940 --> 00:20:09.990
with some of the input.

00:20:09.990 --> 00:20:13.960
Let's go ahead now and sort these
things by the titles themselves.

00:20:13.960 --> 00:20:17.280
So instead of just printing out
the titles in the same order

00:20:17.280 --> 00:20:22.142
you all inputted them, but filtering
out duplicates as we go, let me go ahead

00:20:22.142 --> 00:20:24.600
and use another function in
Python you might not have seen,

00:20:24.600 --> 00:20:26.940
which is literally
called sorted, and will

00:20:26.940 --> 00:20:31.900
take care of the process of
actually sorting titles for you.

00:20:31.900 --> 00:20:34.440
Let me go ahead and increase
the font size of my Terminal,

00:20:34.440 --> 00:20:36.570
run Python of favorites.py,
and hit Enter.

00:20:36.570 --> 00:20:40.410
And now you can really see how many of
these shows start with the word "the"

00:20:40.410 --> 00:20:41.220
or do not.

00:20:41.220 --> 00:20:43.590
Now it's a little easier
to wrap our minds around,

00:20:43.590 --> 00:20:46.650
just because it's at least
sorted alphabetically.

00:20:46.650 --> 00:20:50.220
But now you can really see some of
the differences in people's inputs.

00:20:50.220 --> 00:20:51.850
So far, so good.

00:20:51.850 --> 00:20:56.400
But a few of you decided to stylize
Avatar in three different ways here.

00:20:56.400 --> 00:20:59.520
Brooklyn 99 is a couple
of different ways here.

00:20:59.520 --> 00:21:02.700
And I think if we keep going we'll see
further and further variances that we

00:21:02.700 --> 00:21:06.460
did not fix by focusing on
whitespace and capitalization alone.

00:21:06.460 --> 00:21:09.870
So already here, this is only,
what, 100 plus, 200 rows.

00:21:09.870 --> 00:21:12.795
Already real-world data
starts to get messy quickly,

00:21:12.795 --> 00:21:14.670
and that might not bode
well when we actually

00:21:14.670 --> 00:21:16.650
want to keep around real
data from real users.

00:21:16.650 --> 00:21:19.025
You can imagine an actual
website or a mobile application

00:21:19.025 --> 00:21:21.660
dealing with this kind
of thing on scale.

00:21:21.660 --> 00:21:23.080
Well, let's go ahead and do this.

00:21:23.080 --> 00:21:26.370
Let's actually figure out the
popularity of these various shows

00:21:26.370 --> 00:21:31.680
by now iterating over my data, and
keeping track of how many of you

00:21:31.680 --> 00:21:33.000
inputted a given title.

00:21:33.000 --> 00:21:37.890
We're going to ignore the problems
like Brooklyn 99 and the Avatar.

00:21:37.890 --> 00:21:42.390
Sorry, yeah, Avatar,
where there was things

00:21:42.390 --> 00:21:47.070
that were different beyond just
whitespace and capitalization.

00:21:47.070 --> 00:21:48.840
But let's go ahead and
keep track of, now,

00:21:48.840 --> 00:21:52.290
how many of you inputted
each of these titles.

00:21:52.290 --> 00:21:53.448
So how can I do this?

00:21:53.448 --> 00:21:55.740
I'm still going to take this
approach of iterating over

00:21:55.740 --> 00:21:58.110
the CSV file from top to bottom.

00:21:58.110 --> 00:22:00.360
We've used a couple of
data structures thus far,

00:22:00.360 --> 00:22:04.260
a list to keep track of titles,
or a set to keep track of titles.

00:22:04.260 --> 00:22:07.350
But what if I now want to keep
around a little more information?

00:22:07.350 --> 00:22:13.110
For each title, I want to keep around
how many times I've seen it before.

00:22:13.110 --> 00:22:14.130
I'm not doing that yet.

00:22:14.130 --> 00:22:17.880
I'm throwing away the total
number of times I see these shows.

00:22:17.880 --> 00:22:20.520
How could I start to keep that around?

00:22:20.520 --> 00:22:21.845
AUDIENCE: Use a dictionary.

00:22:21.845 --> 00:22:23.970
DAVID J. MALAN: We could
use a dictionary, and how?

00:22:23.970 --> 00:22:25.260
Elaborate on that.

00:22:25.260 --> 00:22:26.135
AUDIENCE: [INAUDIBLE]

00:22:26.135 --> 00:22:28.093
DAVID J. MALAN: Perfect,
really good instincts.

00:22:28.093 --> 00:22:29.850
Using a dictionary,
insofar as it lets us

00:22:29.850 --> 00:22:33.300
store keys and values, that is,
associate something with something

00:22:33.300 --> 00:22:33.850
else.

00:22:33.850 --> 00:22:36.480
This is why a dictionary
or hash tables more

00:22:36.480 --> 00:22:39.630
generally are such a useful,
practical data structure.

00:22:39.630 --> 00:22:43.120
Because they just let you remember
stuff in some kind of structured way.

00:22:43.120 --> 00:22:45.870
So if the keys are going
to be the titles I've seen,

00:22:45.870 --> 00:22:49.860
the values could be the number of
times I've seen each of those titles.

00:22:49.860 --> 00:22:54.308
And so it's kind of like just
having a two-column table on paper.

00:22:54.308 --> 00:22:56.850
For instance, if I were going
to do this on a piece of paper,

00:22:56.850 --> 00:22:59.340
I might just have two
columns here, where

00:22:59.340 --> 00:23:04.590
maybe this is the title that I've
seen, and this is the count over here.

00:23:04.590 --> 00:23:08.250
This is, in effect, a
dictionary in Python.

00:23:08.250 --> 00:23:11.490
It's two columns, keys on the
left, values on the right.

00:23:11.490 --> 00:23:13.620
And this, if I can implement
in code, will actually

00:23:13.620 --> 00:23:17.580
allow me to store this data, and
then maybe do some simple arithmetic

00:23:17.580 --> 00:23:19.450
to figure out which is the most popular.

00:23:19.450 --> 00:23:20.380
So let's do this.

00:23:20.380 --> 00:23:24.240
Let me go ahead and change my titles
to not be a list, not be a set.

00:23:24.240 --> 00:23:29.160
Let's have it be a dictionary instead,
either doing this, or more succinctly,

00:23:29.160 --> 00:23:33.600
two curly braces that are empty gives
me an empty dictionary automatically.

00:23:33.600 --> 00:23:34.860
What do I now want to do?

00:23:34.860 --> 00:23:37.440
I think most of my
code can stay the same.

00:23:37.440 --> 00:23:40.860
But down here, I don't want
to just blindly add titles

00:23:40.860 --> 00:23:42.180
to the data structure.

00:23:42.180 --> 00:23:45.060
I somehow need to keep
track of the count.

00:23:45.060 --> 00:23:48.690
And unfortunately, if I just
do this-- let's do titles,

00:23:48.690 --> 00:23:53.580
bracket, title, plus equals 1.

00:23:53.580 --> 00:23:56.130
This is a reasonable
first attempt at this.

00:23:56.130 --> 00:23:57.420
Because what am I doing?

00:23:57.420 --> 00:24:02.730
If titles is a dictionary and I want
to look up the current title therein,

00:24:02.730 --> 00:24:05.520
the syntax for that, like before,
is titles, bracket, and then

00:24:05.520 --> 00:24:08.880
the key you want to use to
index into the dictionary.

00:24:08.880 --> 00:24:12.000
It's not a number in this case,
it's an actual word, a title.

00:24:12.000 --> 00:24:14.220
And you're just going
to increment it by one,

00:24:14.220 --> 00:24:17.010
and then eventually I'll come
back and finish my second loop

00:24:17.010 --> 00:24:19.710
and do things in terms of the order.

00:24:19.710 --> 00:24:23.580
But for now, let's just keep
track of the total counts.

00:24:23.580 --> 00:24:25.660
Let me go ahead and
increase my Terminal window.

00:24:25.660 --> 00:24:29.590
Let me do Python of
favorites.py and hit Enter.

00:24:29.590 --> 00:24:30.090
Huh.

00:24:30.090 --> 00:24:34.140
How I Met Your Mother is
giving me a key error.

00:24:34.140 --> 00:24:36.750
What does that mean?

00:24:36.750 --> 00:24:39.120
And why am I seeing this?

00:24:39.120 --> 00:24:42.330
And in fact, just to give a
little bit of a breadcrumb here,

00:24:42.330 --> 00:24:44.140
let me zoom out here.

00:24:44.140 --> 00:24:46.920
Let me open up the CSV
file again real quickly.

00:24:46.920 --> 00:24:50.460
And wow, we didn't even get
past the second row in the file

00:24:50.460 --> 00:24:52.210
or the first show in the file.

00:24:52.210 --> 00:24:54.840
Notice that How I Met Your
Mother, somewhat lowercased,

00:24:54.840 --> 00:24:57.562
is the very first show in therein.

00:24:57.562 --> 00:24:59.520
What's your instinct for
why this is happening?

00:24:59.520 --> 00:25:01.750
AUDIENCE: You don't
have a starting point.

00:25:01.750 --> 00:25:03.667
DAVID J. MALAN: I don't
have a starting point.

00:25:03.667 --> 00:25:04.920
I'm adding one to what?

00:25:04.920 --> 00:25:09.840
I'm blindly indexing into the dictionary
using a key, How I Met Your Mother,

00:25:09.840 --> 00:25:11.880
that doesn't yet exist
in the dictionary.

00:25:11.880 --> 00:25:14.610
And so Python throws
what's called a key error

00:25:14.610 --> 00:25:17.440
because the key you're trying
to use just doesn't exist yet.

00:25:17.440 --> 00:25:21.000
So logically, how could we fix this?

00:25:21.000 --> 00:25:21.690
We're close.

00:25:21.690 --> 00:25:25.110
We got half of the problem solved,
but I'm not handling the obvious, now,

00:25:25.110 --> 00:25:26.850
case of nothing being there.

00:25:26.850 --> 00:25:27.595
Yeah?

00:25:27.595 --> 00:25:28.930
AUDIENCE: Creating a counter.

00:25:28.930 --> 00:25:29.620
DAVID J. MALAN: Creating a--

00:25:29.620 --> 00:25:30.370
AUDIENCE: Counter.

00:25:30.370 --> 00:25:32.240
DAVID J. MALAN: Creating
the counter itself.

00:25:32.240 --> 00:25:34.190
So maybe I could do something like this.

00:25:34.190 --> 00:25:38.380
Let me close my Terminal window
and let me ask a question first.

00:25:38.380 --> 00:25:44.980
If the current title is in the
dictionary already, if title in titles,

00:25:44.980 --> 00:25:47.440
that's going to give me a
true-false answer it turns out.

00:25:47.440 --> 00:25:52.240
Then, I can safely say, titles,
bracket, title, plus equals 1.

00:25:52.240 --> 00:25:56.740
And recall, this is just shorthand
notation for the same thing as in C,

00:25:56.740 --> 00:25:58.300
title plus 1.

00:25:58.300 --> 00:25:59.360
Whoops, typo.

00:25:59.360 --> 00:26:00.168
Don't do that.

00:26:00.168 --> 00:26:02.710
That's the same thing as this
but it's a little more succinct

00:26:02.710 --> 00:26:04.810
just to say plus equals 1.

00:26:04.810 --> 00:26:09.490
Else, if it's logically not the case
that the current title is in the titles

00:26:09.490 --> 00:26:13.610
dictionary, then I probably want to
say titles, bracket, title equals?

00:26:13.610 --> 00:26:15.315
Feel free to just shout it out.

00:26:15.315 --> 00:26:15.940
AUDIENCE: Zero.

00:26:15.940 --> 00:26:16.815
DAVID J. MALAN: Zero.

00:26:16.815 --> 00:26:21.320
I just have to put some value there
so that the key itself is also there.

00:26:21.320 --> 00:26:21.820
All right.

00:26:21.820 --> 00:26:24.370
So now that I've got this
going on, let me go ahead

00:26:24.370 --> 00:26:26.620
and undo my sorting temporarily.

00:26:26.620 --> 00:26:29.650
And now let me go ahead and do this.

00:26:29.650 --> 00:26:33.460
I can, as a quick check, let me
go ahead and just run the code

00:26:33.460 --> 00:26:35.050
as is, Python of favorites.py.

00:26:35.050 --> 00:26:37.030
I'm back in business.

00:26:37.030 --> 00:26:39.790
It's printing correctly, no key
errors, but it's not sorted.

00:26:39.790 --> 00:26:41.620
And I'm not seeing any of the counts.

00:26:41.620 --> 00:26:43.780
Let me just quickly add
the counts, and there's

00:26:43.780 --> 00:26:45.530
a couple of ways I could do this.

00:26:45.530 --> 00:26:52.900
I could, say, print out the title, and
then, maybe, let's do something like--

00:26:52.900 --> 00:26:57.220
how about just, comma,
titles, bracket, title?

00:26:57.220 --> 00:26:59.020
So I'm going to print
two things at once,

00:26:59.020 --> 00:27:01.600
both the current title
in the dictionary,

00:27:01.600 --> 00:27:04.300
and whatever its value
is by indexing into it.

00:27:04.300 --> 00:27:06.140
Let me increase my Terminal window.

00:27:06.140 --> 00:27:10.420
Let me run Python of
favorites.py, Enter, and OK.

00:27:10.420 --> 00:27:11.908
Huh.

00:27:11.908 --> 00:27:14.080
Huh.

00:27:14.080 --> 00:27:17.560
None of you said a whole
lot of TV shows, it seems.

00:27:17.560 --> 00:27:21.690
What's the logical error here?

00:27:21.690 --> 00:27:25.460
What did I do wrong if I
look back at my code here?

00:27:25.460 --> 00:27:25.960
Yeah?

00:27:25.960 --> 00:27:27.600
Why so many 0s?

00:27:27.600 --> 00:27:29.490
AUDIENCE: [INAUDIBLE]

00:27:29.490 --> 00:27:30.490
DAVID J. MALAN: Exactly.

00:27:30.490 --> 00:27:35.140
To summarize, I initialized the
count to 0 the first time I saw it,

00:27:35.140 --> 00:27:38.500
but I should have initialized it at
least to 1 because I just saw it.

00:27:38.500 --> 00:27:40.220
Or I should change my code a bit.

00:27:40.220 --> 00:27:42.880
So for instance, if I go back
in here, the simplest fix

00:27:42.880 --> 00:27:46.390
is probably to initialize to 1,
because on this iteration of the loop,

00:27:46.390 --> 00:27:49.210
obviously, I'm seeing this
title for the very first time.

00:27:49.210 --> 00:27:51.580
Or I could change my logic a little bit.

00:27:51.580 --> 00:27:53.470
I could do something like this instead.

00:27:53.470 --> 00:27:58.840
If the current title is not in titles,
then I could initialize it to 0.

00:27:58.840 --> 00:28:02.860
And then I could get rid of
the else, and now blindly index

00:28:02.860 --> 00:28:04.900
into the titles dictionary.

00:28:04.900 --> 00:28:09.100
Because now, on line 11, I
can trust that lines 9 and 10

00:28:09.100 --> 00:28:12.040
took care of the initialization
for me if need be.

00:28:12.040 --> 00:28:13.028
Which one is better?

00:28:13.028 --> 00:28:13.570
I don't know.

00:28:13.570 --> 00:28:17.150
This one's a little nicer, maybe
because it's one line fewer.

00:28:17.150 --> 00:28:20.470
But I think both approaches are
perfectly reasonable and well-designed.

00:28:20.470 --> 00:28:22.420
But the key thing, no
pun intended, is that we

00:28:22.420 --> 00:28:27.100
have to make sure the key exists
before we presume to actually incrue.

00:28:27.100 --> 00:28:28.810
Oh, this is wrong.

00:28:28.810 --> 00:28:30.370
This is incorrect code.

00:28:30.370 --> 00:28:33.040
What did I do wrong?

00:28:33.040 --> 00:28:33.790
OK, yes.

00:28:33.790 --> 00:28:34.400
There we go.

00:28:34.400 --> 00:28:37.690
So otherwise, everyone would have
liked this show once, and no matter

00:28:37.690 --> 00:28:39.190
how many people said the same thing.

00:28:39.190 --> 00:28:41.237
Now the code is as it should be.

00:28:41.237 --> 00:28:43.570
So let me go ahead and open
up my Terminal window again.

00:28:43.570 --> 00:28:47.710
Let me run Python of favorites.py,
and now we see more reasonable counts.

00:28:47.710 --> 00:28:49.150
Some shows weren't that popular.

00:28:49.150 --> 00:28:50.830
There's just 1s and maybe 2s.

00:28:50.830 --> 00:28:56.570
But I bet if we sort these things we
can start to see a little more detail.

00:28:56.570 --> 00:28:57.890
So how else can we do this?

00:28:57.890 --> 00:29:04.520
Well, turns out, when dealing
with a dictionary like this--

00:29:04.520 --> 00:29:07.160
let's go ahead and just
sort the titles themselves.

00:29:07.160 --> 00:29:12.130
So let's reintroduce the sorted function
as I did before, but no other changes.

00:29:12.130 --> 00:29:14.735
Let me go ahead now and
run Python of favorites.py.

00:29:14.735 --> 00:29:17.110
Now it's just a little easier
to wrap your mind around it

00:29:17.110 --> 00:29:18.568
because at least it's alphabetical.

00:29:18.568 --> 00:29:22.600
But it's not sorted by
value, it's sorted by key.

00:29:22.600 --> 00:29:26.170
But sure enough, if we scroll
down, there's something down here,

00:29:26.170 --> 00:29:29.202
for instance, like,
let's see, The Office.

00:29:29.202 --> 00:29:30.910
That's definitely
going to be a contender

00:29:30.910 --> 00:29:32.860
for most popular, 15 responses.

00:29:32.860 --> 00:29:35.860
But let's see what's actually
going to bubble up to the top.

00:29:35.860 --> 00:29:40.870
Unfortunately, the sorted function
only sorts dictionaries by keys

00:29:40.870 --> 00:29:44.410
by default, not by values.

00:29:44.410 --> 00:29:47.500
But it turns out, in Python,
if you read the documentation

00:29:47.500 --> 00:29:49.510
for the sorted function,
you can actually

00:29:49.510 --> 00:29:54.580
pass in other arguments that
tell it how to sort things.

00:29:54.580 --> 00:29:57.430
For instance, if I want to
do things in reverse order,

00:29:57.430 --> 00:30:02.140
I can add a second parameter to
the sorted function called reverse.

00:30:02.140 --> 00:30:03.310
And it's a named parameter.

00:30:03.310 --> 00:30:05.620
You literally say,
reverse equals true, so

00:30:05.620 --> 00:30:08.830
that the position of it in the
comma-separated list doesn't matter.

00:30:08.830 --> 00:30:11.762
If I now rerun this after
increasing my Terminal window,

00:30:11.762 --> 00:30:13.720
you'll see now that it's
in the opposite order.

00:30:13.720 --> 00:30:16.330
Now adventure and Anne
with an E is at the bottom

00:30:16.330 --> 00:30:18.410
of the output instead of the top.

00:30:18.410 --> 00:30:26.845
How can I tell it to sort
by values instead of by key?

00:30:26.845 --> 00:30:28.220
Well, let's go ahead and do this.

00:30:28.220 --> 00:30:30.940
Let me go ahead and define a function.

00:30:30.940 --> 00:30:33.070
I'm just going to call it
f to keep things simple.

00:30:33.070 --> 00:30:36.190
And this f function is going
to take a title as input.

00:30:36.190 --> 00:30:41.140
And given a given title, it's going
to return the value of that title.

00:30:41.140 --> 00:30:44.560
So actually, maybe a better name
for this would be get value,

00:30:44.560 --> 00:30:46.820
and/or we could come up
with something else as well.

00:30:46.820 --> 00:30:49.300
The purpose of the get
value function, to be clear,

00:30:49.300 --> 00:30:54.200
is to take it as input a title and
then return the corresponding value.

00:30:54.200 --> 00:30:55.460
Why is this useful?

00:30:55.460 --> 00:30:57.760
Well, it turns out that the
sorted function in Python,

00:30:57.760 --> 00:31:01.870
according to its documentation,
also takes a key parameter,

00:31:01.870 --> 00:31:05.860
where you can pass in, crazy
enough, the name of a function

00:31:05.860 --> 00:31:11.650
that it will use in order to determine
what it should sort by, by the key,

00:31:11.650 --> 00:31:16.550
or by the value, or in other cases,
even other types of data as well.

00:31:16.550 --> 00:31:19.390
So there's a curiosity here,
though, that's very deliberate.

00:31:19.390 --> 00:31:21.550
Key is the name of the
parameter, just like reverse

00:31:21.550 --> 00:31:23.092
was the name of this other parameter.

00:31:23.092 --> 00:31:26.110
The value of it, though,
is not a function call.

00:31:26.110 --> 00:31:27.580
It's a function name.

00:31:27.580 --> 00:31:30.520
Notice I am not doing
this, no parentheses.

00:31:30.520 --> 00:31:35.400
I'm instead passing in get value,
the function I wrote, by its name.

00:31:35.400 --> 00:31:37.900
And this is a feature of Python
and certain other languages.

00:31:37.900 --> 00:31:41.110
Just like variables, you can
actually pass whole functions

00:31:41.110 --> 00:31:45.560
around so that they can be called
for you later on by someone else.

00:31:45.560 --> 00:31:48.880
So what this means is that the
sorted function written by Python,

00:31:48.880 --> 00:31:51.410
they didn't know what you're
going to want to sort by today.

00:31:51.410 --> 00:31:55.810
But if you provide them with a function
called get value, or anything else, now

00:31:55.810 --> 00:31:58.000
their sorted function
will use that function

00:31:58.000 --> 00:32:01.840
to determine, OK, if you don't want to
sort by the key of the dictionary, what

00:32:01.840 --> 00:32:02.950
do you want to sort by?

00:32:02.950 --> 00:32:05.800
This is going to tell
it to sort by the value

00:32:05.800 --> 00:32:08.750
by returning the specific
value we care about.

00:32:08.750 --> 00:32:12.580
So let me go ahead now and rerun this
after increasing my Terminal, Python

00:32:12.580 --> 00:32:14.620
of favorites.py, Enter.

00:32:14.620 --> 00:32:17.110
Here we have now an example
of all of the titles you all

00:32:17.110 --> 00:32:22.180
typed in, albeit forced to uppercase
and with any whitespace thrown out.

00:32:22.180 --> 00:32:24.733
And now, The Office is
an easy win over Friends,

00:32:24.733 --> 00:32:27.400
versus Community, versus Game of
Thrones, Breaking Bad, and then

00:32:27.400 --> 00:32:29.740
a lot of variants thereafter.

00:32:29.740 --> 00:32:31.680
So there's a lot of steps to go through.

00:32:31.680 --> 00:32:33.555
This isn't that bad once
you've done it once,

00:32:33.555 --> 00:32:35.472
and you know what these
functions are, and you

00:32:35.472 --> 00:32:36.870
know that these parameters exist.

00:32:36.870 --> 00:32:38.010
But it's a lot of work.

00:32:38.010 --> 00:32:42.090
That's 17 lines of code
just to analyze a CSV file

00:32:42.090 --> 00:32:45.420
that you all created by way of
those Google Form submissions.

00:32:45.420 --> 00:32:48.360
But it took me a lot of work just
to get simple answers out of it.

00:32:48.360 --> 00:32:50.277
And indeed, that's going
to be among the goals

00:32:50.277 --> 00:32:52.920
for today, ultimately, is, how
can we just make this easier?

00:32:52.920 --> 00:32:54.795
It's one thing to learn
new things in Python,

00:32:54.795 --> 00:32:57.300
but if we can avoid writing
code, or this much code,

00:32:57.300 --> 00:32:58.840
that's going to be a good thing.

00:32:58.840 --> 00:33:01.020
And so one other technique
we can introduce here

00:33:01.020 --> 00:33:03.570
that does allow us to
write a little less code

00:33:03.570 --> 00:33:05.730
is, we can actually get
rid of this function.

00:33:05.730 --> 00:33:09.240
It turns out, in Python, if you
just need to make a function

00:33:09.240 --> 00:33:11.940
but it's going to be used and
then essentially thrown away,

00:33:11.940 --> 00:33:14.790
it's not something you're going
to be reusing in multiple places--

00:33:14.790 --> 00:33:17.550
it's not like a library function
that you want to keep around--

00:33:17.550 --> 00:33:19.680
you can actually just do this.

00:33:19.680 --> 00:33:23.430
You can change the value
of this key parameter

00:33:23.430 --> 00:33:25.950
to be what's called a
lambda function, which

00:33:25.950 --> 00:33:29.040
is a fancy way of saying a function
that technically has no name.

00:33:29.040 --> 00:33:30.870
It's an anonymous function.

00:33:30.870 --> 00:33:32.400
Why does it have no name?

00:33:32.400 --> 00:33:35.530
Well, it's kind of stupid that
I invented this name on line 13.

00:33:35.530 --> 00:33:38.670
I used it on line 16, and
then I never again used it.

00:33:38.670 --> 00:33:42.520
If there's only being used in one place,
why bother giving it a name at all?

00:33:42.520 --> 00:33:45.000
So if you instead, in
Python, say lambda,

00:33:45.000 --> 00:33:47.850
and then type out the
name of the parameter

00:33:47.850 --> 00:33:50.130
you want this anonymous
function to take,

00:33:50.130 --> 00:33:54.460
you can then say, go ahead
and return this value.

00:33:54.460 --> 00:33:57.030
Now let's notice the
inconsistencies here.

00:33:57.030 --> 00:33:59.850
When you use this special lambda
keyword that says, hey Python,

00:33:59.850 --> 00:34:02.850
give me an anonymous function,
a function with no name,

00:34:02.850 --> 00:34:06.610
it then says, Python, this anonymous
function will take one parameter.

00:34:06.610 --> 00:34:08.699
Notice there's no parentheses.

00:34:08.699 --> 00:34:10.500
And that's deliberate, if confusing.

00:34:10.500 --> 00:34:12.909
It just tightens things up a little bit.

00:34:12.909 --> 00:34:16.770
Notice that there's no return keyword,
which similarly tightens things

00:34:16.770 --> 00:34:18.780
up a bit, albeit inconsistently.

00:34:18.780 --> 00:34:22.650
But this line of code
I've just highlighted

00:34:22.650 --> 00:34:26.429
is actually identical in
functionality to this.

00:34:26.429 --> 00:34:28.320
But it throws away the word [INAUDIBLE].

00:34:28.320 --> 00:34:29.850
It throws away the word get value.

00:34:29.850 --> 00:34:33.510
It throws away the parentheses, and
it throws away the return keyword just

00:34:33.510 --> 00:34:34.679
to tighten things up.

00:34:34.679 --> 00:34:37.290
And it's well suited
for a problem like this

00:34:37.290 --> 00:34:39.960
where I just want to pass in
a tiny little function that

00:34:39.960 --> 00:34:41.090
does something useful.

00:34:41.090 --> 00:34:42.840
But it's not something
I'm going to reuse.

00:34:42.840 --> 00:34:45.100
It doesn't need multiple
lines to take up space.

00:34:45.100 --> 00:34:47.159
It's just a nice, elegant one liner.

00:34:47.159 --> 00:34:48.840
That's all a lambda function does.

00:34:48.840 --> 00:34:51.909
It allows you to create an anonymous
function right then and there.

00:34:51.909 --> 00:34:57.210
And then the function you're passing it
to, like sorted, will use it as before.

00:34:57.210 --> 00:35:00.810
Indeed, if I run Python of favorites.py
after growing my Terminal window,

00:35:00.810 --> 00:35:02.800
the result is exactly the same.

00:35:02.800 --> 00:35:06.360
And we see at the bottom here
all of those small results.

00:35:06.360 --> 00:35:10.810
Are any questions, then, on
this syntax, on these ideas?

00:35:10.810 --> 00:35:13.770
The goal here has been to write
a Python program that just starts

00:35:13.770 --> 00:35:18.860
to analyze or clean up data like this.

00:35:18.860 --> 00:35:19.765
Yeah?

00:35:19.765 --> 00:35:23.273
AUDIENCE: [INAUDIBLE]

00:35:23.273 --> 00:35:26.440
DAVID J. MALAN: Could you use the lambda
if it's just returning immediately?

00:35:26.440 --> 00:35:29.260
It's really meant for one
line of code, generally.

00:35:29.260 --> 00:35:31.420
So you don't use the return keyword.

00:35:31.420 --> 00:35:33.880
You just say what it
is you want to return.

00:35:33.880 --> 00:35:37.680
AUDIENCE: [INAUDIBLE]

00:35:37.680 --> 00:35:38.930
DAVID J. MALAN: Good question.

00:35:38.930 --> 00:35:40.960
Could you do more in
that one line if it's

00:35:40.960 --> 00:35:42.670
got to be a more involved algorithm?

00:35:42.670 --> 00:35:45.820
Yes, but you would just ultimately
return the value in question.

00:35:45.820 --> 00:35:47.947
In short, if it's getting
at all sophisticated

00:35:47.947 --> 00:35:49.780
you don't use the lambda
function in Python.

00:35:49.780 --> 00:35:52.510
You go ahead and actually
just define a name for it,

00:35:52.510 --> 00:35:54.233
even if it's a one-off name.

00:35:54.233 --> 00:35:56.650
JavaScript, another language
we'll look at in a few weeks,

00:35:56.650 --> 00:35:59.590
makes heavier use, I dare
say, of lambda functions.

00:35:59.590 --> 00:36:02.030
And those can actually be
multiple, multiple lines,

00:36:02.030 --> 00:36:05.520
but Python does not
support that instinct.

00:36:05.520 --> 00:36:06.020
All right.

00:36:06.020 --> 00:36:07.728
So let's go ahead and
do one other thing.

00:36:07.728 --> 00:36:10.340
Office was clearly popping out
of the code here quite a bit.

00:36:10.340 --> 00:36:12.760
Let's go ahead and write a
slightly different program

00:36:12.760 --> 00:36:15.400
that maybe just focuses on
The Office for the moment,

00:36:15.400 --> 00:36:16.900
just focuses on The Office.

00:36:16.900 --> 00:36:21.250
So let me go ahead and throw most of
this code away, up until this point

00:36:21.250 --> 00:36:23.080
when I'm inside of my inner loop.

00:36:23.080 --> 00:36:26.050
And let me go ahead, and I don't
even want the global variable here.

00:36:26.050 --> 00:36:28.240
All I want to do is focus
on the current title.

00:36:28.240 --> 00:36:30.730
How could I detect if
someone likes The Office?

00:36:30.730 --> 00:36:33.790
Well, I could say something like--

00:36:33.790 --> 00:36:34.670
how about this?

00:36:34.670 --> 00:36:36.310
So counter equals 0.

00:36:36.310 --> 00:36:38.350
We'll just focus on The Office.

00:36:38.350 --> 00:36:43.930
If title equals, equals The Office,
I could then go ahead and say,

00:36:43.930 --> 00:36:47.410
counter plus equals 1.

00:36:47.410 --> 00:36:48.400
I don't need a key.

00:36:48.400 --> 00:36:49.858
There's no dictionary involved now.

00:36:49.858 --> 00:36:51.880
It's just a simple integer variable.

00:36:51.880 --> 00:36:55.750
And then, down here
I'll say something like,

00:36:55.750 --> 00:37:00.970
number of people who like The
Office is, whatever this value is.

00:37:00.970 --> 00:37:03.850
And I'll put in counter in
curly braces, and then I'll

00:37:03.850 --> 00:37:05.783
turn this whole thing into an F string.

00:37:05.783 --> 00:37:07.450
All right, let me go ahead and run this.

00:37:07.450 --> 00:37:10.100
Python of favorites.py, Enter.

00:37:10.100 --> 00:37:12.610
Number of people who
like The Office is 15.

00:37:12.610 --> 00:37:13.990
All right, so that's great.

00:37:13.990 --> 00:37:17.530
But let's go ahead now and
deliberately muddy the data a bit.

00:37:17.530 --> 00:37:21.160
All of you were very nice in
that you typed in The Office.

00:37:21.160 --> 00:37:23.230
But you can imagine
someone just typing Office,

00:37:23.230 --> 00:37:25.692
for instance, maybe there, maybe there.

00:37:25.692 --> 00:37:28.150
And many people might just
write Office, you could imagine.

00:37:28.150 --> 00:37:30.400
Didn't happen here, but
suppose it did, and probably

00:37:30.400 --> 00:37:33.290
would have if we had even more
and more submissions over time.

00:37:33.290 --> 00:37:37.000
Now let's go ahead and rerun this
program, no changes to the code.

00:37:37.000 --> 00:37:39.130
Now only 13 people like The Office.

00:37:39.130 --> 00:37:40.150
So let's fix this.

00:37:40.150 --> 00:37:45.790
The data is now as I mutated it to have
a couple Offices, and many The Offices.

00:37:45.790 --> 00:37:51.050
How could I change my Python code to
now count both of those situations?

00:37:51.050 --> 00:37:55.390
What could I change up here in
order to improve this situation?

00:37:55.390 --> 00:37:57.910
Any thoughts?

00:37:57.910 --> 00:37:58.750
Yeah?

00:37:58.750 --> 00:38:02.300
AUDIENCE: You write
the title [INAUDIBLE]..

00:38:02.300 --> 00:38:05.050
DAVID J. MALAN: Yeah, so I could
just ask two questions like that.

00:38:05.050 --> 00:38:09.370
If title equals The Office,
or title equals, equals just

00:38:09.370 --> 00:38:10.438
Office, for instance.

00:38:10.438 --> 00:38:12.730
And I'm still don't have to
worry about capitalization.

00:38:12.730 --> 00:38:15.813
I don't have to worry about spaces
because I at least threw that all away.

00:38:15.813 --> 00:38:18.200
Now I can go ahead and rerun this code.

00:38:18.200 --> 00:38:19.990
Let me go run it a third time.

00:38:19.990 --> 00:38:22.460
OK, so we're back up to 15.

00:38:22.460 --> 00:38:25.210
So I like that.

00:38:25.210 --> 00:38:28.660
You could imagine this
not scaling very well.

00:38:28.660 --> 00:38:31.763
Avatar had three different
permutations, and there were some others

00:38:31.763 --> 00:38:34.180
if we dug deeper that there
might have been more variants.

00:38:34.180 --> 00:38:36.430
Could we do something a
little more general purpose?

00:38:36.430 --> 00:38:38.230
Well, we could do something like this.

00:38:38.230 --> 00:38:41.883
If Office in the title--

00:38:41.883 --> 00:38:44.050
this is kind of a cool thing
you can do with Python.

00:38:44.050 --> 00:38:46.930
It's very English-like, just ask
the question, albeit tersely.

00:38:46.930 --> 00:38:50.660
This, interesting, just
got me into trouble.

00:38:50.660 --> 00:38:53.140
Now, all of a sudden, we're up to 16.

00:38:53.140 --> 00:38:56.252
Does anyone know what the other one is?

00:38:56.252 --> 00:38:58.098
AUDIENCE: Someone put V Office.

00:38:58.098 --> 00:38:59.265
DAVID J. MALAN: What Office?

00:38:59.265 --> 00:39:02.120
AUDIENCE: Someone entered
a V Office, [INAUDIBLE]..

00:39:04.517 --> 00:39:05.850
DAVID J. MALAN: Oh, interesting.

00:39:05.850 --> 00:39:07.930
Yes, so they hit The.

00:39:07.930 --> 00:39:08.430
OK.

00:39:08.430 --> 00:39:11.568
[APPLAUSE]

00:39:11.568 --> 00:39:12.360
DAVID J. MALAN: OK.

00:39:12.360 --> 00:39:13.800
Someone did that, sure.

00:39:13.800 --> 00:39:17.130
So The V Office.

00:39:17.130 --> 00:39:19.470
OK, this one's actually going
to be hard to correct for.

00:39:19.470 --> 00:39:21.300
I can't really think of a general--

00:39:21.300 --> 00:39:25.860
well, this is actually a good
example of data gets messy fast.

00:39:25.860 --> 00:39:28.080
And you could imagine doing
something where, OK, we

00:39:28.080 --> 00:39:32.920
could have like 26 conditions if someone
said The A Office, or The B Office,

00:39:32.920 --> 00:39:33.420
right?

00:39:33.420 --> 00:39:34.650
You could imagine doing that.

00:39:34.650 --> 00:39:37.400
But then there's surely going to
be other typos that are possible.

00:39:37.400 --> 00:39:39.610
So that's actually a hard one to fix.

00:39:39.610 --> 00:39:44.730
But it turns out we got lucky and now
this is actually the accurate count.

00:39:44.730 --> 00:39:46.660
But the data is itself messy.

00:39:46.660 --> 00:39:49.950
Let me show another way that just
adds another tool to our toolkit.

00:39:49.950 --> 00:39:54.750
It turns out that there's this feature
in many programming languages, Python

00:39:54.750 --> 00:39:57.040
among them, called regular expressions.

00:39:57.040 --> 00:39:59.040
And this is actually a
really powerful technique

00:39:59.040 --> 00:40:00.873
that we'll just scratch
the surface of here.

00:40:00.873 --> 00:40:04.290
But it's going to be really useful,
actually, maybe toward final projects,

00:40:04.290 --> 00:40:08.880
in web programming, any time you want
to clean up data or validate data.

00:40:08.880 --> 00:40:11.910
And actually, just to make
this clear, give me a moment

00:40:11.910 --> 00:40:14.160
before I switch screens here.

00:40:14.160 --> 00:40:18.450
And let me open up a
Google Form from scratch.

00:40:18.450 --> 00:40:22.230
Give me just a moment to
create something real quick.

00:40:22.230 --> 00:40:25.260
If you've never noticed this
before when creating a Google Form,

00:40:25.260 --> 00:40:28.410
you can do a question.

00:40:28.410 --> 00:40:30.360
And if you want the user
to type in something

00:40:30.360 --> 00:40:33.033
very specific as a short
text answer like this,

00:40:33.033 --> 00:40:35.700
you might know that there's toggles
like this in Google's world,

00:40:35.700 --> 00:40:36.930
like you can require it.

00:40:36.930 --> 00:40:39.270
Or you can do response validation.

00:40:39.270 --> 00:40:41.670
You could say, what's your email?

00:40:41.670 --> 00:40:47.250
And then you could say something
like, text is an email.

00:40:47.250 --> 00:40:52.530
So here's an example in Google Forms
how you can validate users' input.

00:40:52.530 --> 00:40:57.150
But a feature most of you have probably
never noticed, or cared about, or used,

00:40:57.150 --> 00:40:59.490
is this thing called a
regular expression, where

00:40:59.490 --> 00:41:01.440
you can actually define a pattern.

00:41:01.440 --> 00:41:04.830
And I could actually reimplement that
same idea by doing something like this.

00:41:04.830 --> 00:41:11.070
I can say, let the user type in anything
represented by .star, then an at sign,

00:41:11.070 --> 00:41:16.600
then something else, then a
literal period, then, for instance,

00:41:16.600 --> 00:41:17.680
something else.

00:41:17.680 --> 00:41:19.950
So it's very cryptic,
admittedly, at first glance.

00:41:19.950 --> 00:41:23.430
But this means any
character 0 more times.

00:41:23.430 --> 00:41:26.160
This means any character 0 more times.

00:41:26.160 --> 00:41:28.725
This means a literal
period, because apparently

00:41:28.725 --> 00:41:32.160
dot means any character in
the context of these patterns.

00:41:32.160 --> 00:41:36.070
Then this thing means any
character 0 more times.

00:41:36.070 --> 00:41:38.670
So I should actually be
a little more nitpicky.

00:41:38.670 --> 00:41:41.530
You don't want 0 or more times,
you want 1 or more times.

00:41:41.530 --> 00:41:45.340
So this with the plus means
any character 1 or more time.

00:41:45.340 --> 00:41:47.020
So there has to be something there.

00:41:47.020 --> 00:41:51.630
And I think I want the same thing
here 1 or more times, 1 or more times.

00:41:51.630 --> 00:41:56.160
Or heck, if I want to restrict this
form in some sense to edu addresses,

00:41:56.160 --> 00:41:59.520
I could change that last
thing to literally .edu.

00:41:59.520 --> 00:42:01.570
And so long story short,
even though this looks,

00:42:01.570 --> 00:42:06.420
I'm sure, pretty cryptic, there's
this mini language built into Python,

00:42:06.420 --> 00:42:09.900
and JavaScript, and Java, and other
languages that allows you to express

00:42:09.900 --> 00:42:12.430
patterns in a standardized way.

00:42:12.430 --> 00:42:15.930
And this pattern is actually something
we can implement in code, too.

00:42:15.930 --> 00:42:18.150
And let me switch back to
Python for a second just

00:42:18.150 --> 00:42:19.920
to do the same kind of idea.

00:42:19.920 --> 00:42:22.950
Let me toggle back to my code here.

00:42:22.950 --> 00:42:26.730
Let me put up, for instance, a
summary of what it is you can do.

00:42:26.730 --> 00:42:33.030
And here's just a quick summary
of some of the available symbols.

00:42:33.030 --> 00:42:39.330
A period may represent any character.
.star or .asterisks means 0 or more

00:42:39.330 --> 00:42:39.970
characters.

00:42:39.970 --> 00:42:43.020
So the dot means anything,
so it can be A or nothing.

00:42:43.020 --> 00:42:44.160
It can be B or nothing.

00:42:44.160 --> 00:42:49.530
It can be A, B, A, B, C. It can be any
combination of 0 or more characters.

00:42:49.530 --> 00:42:52.860
Change that to a plus and you now
express one or more characters.

00:42:52.860 --> 00:42:55.860
Question mark means
something is optional.

00:42:55.860 --> 00:42:59.370
Caret symbol means start matching at
the beginning of the user's input.

00:42:59.370 --> 00:43:05.100
Dollar sign means stop matching
at the end of the user's input.

00:43:05.100 --> 00:43:07.210
So we won't play with
all of these just now.

00:43:07.210 --> 00:43:11.470
But let me go over here and
actually tackle this Office problem.

00:43:11.470 --> 00:43:15.450
Let me go ahead and import a new library
called the regular expression library,

00:43:15.450 --> 00:43:17.280
import re.

00:43:17.280 --> 00:43:20.560
And then, down here, let me say this.

00:43:20.560 --> 00:43:25.080
If re.search, this pattern.

00:43:25.080 --> 00:43:30.210
Let's just search for Office, quote,
unquote, in the current title.

00:43:30.210 --> 00:43:32.730
Then we're going to go ahead
and increase the counter.

00:43:32.730 --> 00:43:35.040
So it turns out that the
regular expression library

00:43:35.040 --> 00:43:39.090
has a function called search that
takes as its first argument a pattern,

00:43:39.090 --> 00:43:41.970
and then, as its second
argument the string you

00:43:41.970 --> 00:43:44.160
want to analyze for that pattern.

00:43:44.160 --> 00:43:47.880
So it's sort of looking for a needle
in this haystack, from left to right.

00:43:47.880 --> 00:43:52.080
Let me go ahead now and run this
version of the program, Enter.

00:43:52.080 --> 00:43:56.250
And now I screwed up because I forgot
my colon, but that's old stuff.

00:43:56.250 --> 00:43:57.322
Enter.

00:43:57.322 --> 00:43:59.150
Huh.

00:43:59.150 --> 00:44:01.800
Number of people who
like The Office is now 0.

00:44:01.800 --> 00:44:02.970
So this seems like a big--

00:44:02.970 --> 00:44:05.640
thank you-- big step backwards.

00:44:05.640 --> 00:44:08.010
What did I do wrong?

00:44:08.010 --> 00:44:08.730
Yeah?

00:44:08.730 --> 00:44:10.410
AUDIENCE: [INAUDIBLE]

00:44:10.410 --> 00:44:11.610
DAVID J. MALAN: Yeah.

00:44:11.610 --> 00:44:14.595
I forced all my input to uppercase,
so I probably need to do this.

00:44:14.595 --> 00:44:16.470
So we'll come back to
other approaches there.

00:44:16.470 --> 00:44:17.430
Let me rerun it now.

00:44:17.430 --> 00:44:19.800
OK, now we're back up to 16.

00:44:19.800 --> 00:44:22.200
But I could even, let's say--

00:44:22.200 --> 00:44:25.110
I could tolerate just The Office.

00:44:25.110 --> 00:44:30.120
How about this, or how about
something like, or The Office?

00:44:30.120 --> 00:44:32.373
Let me do this instead.

00:44:32.373 --> 00:44:34.290
And let me use these
other special characters.

00:44:34.290 --> 00:44:37.380
This caret sign means the
beginning of the string.

00:44:37.380 --> 00:44:40.740
This dollar sign weirdly
represents the end of the string.

00:44:40.740 --> 00:44:44.490
I'm adding in some parentheses just
like in math, just to add another symbol

00:44:44.490 --> 00:44:46.470
here, the or symbol here.

00:44:46.470 --> 00:44:50.580
And this is saying start matching
at the beginning of the user string.

00:44:50.580 --> 00:44:54.660
Check if the beginning of the string is
Office, or the beginning of the string

00:44:54.660 --> 00:44:55.920
is The Office.

00:44:55.920 --> 00:44:58.540
And then, you better be
at the end of the string.

00:44:58.540 --> 00:45:01.650
So they can't keep typing words
before or after that input.

00:45:01.650 --> 00:45:03.690
Let me go ahead and rerun the program.

00:45:03.690 --> 00:45:07.500
And now we're down to 15, which
used to be our correct answer,

00:45:07.500 --> 00:45:10.770
but then we noticed The V Office.

00:45:10.770 --> 00:45:12.680
How can we deal with that?

00:45:12.680 --> 00:45:16.200
It's going to be messier
to deal with that.

00:45:16.200 --> 00:45:21.555
How about if I tolerate any
character represented by dot

00:45:21.555 --> 00:45:23.640
in between The and Office?

00:45:23.640 --> 00:45:28.110
Now if I rerun it, now I really
have this expressive capability.

00:45:28.110 --> 00:45:32.340
So this is only to say, there are so
many ways in languages, in general,

00:45:32.340 --> 00:45:33.390
to solve problems.

00:45:33.390 --> 00:45:35.950
And some of these tools are
more sophisticated than others.

00:45:35.950 --> 00:45:38.957
This is one that you've actually
probably glanced at but never used

00:45:38.957 --> 00:45:41.040
in the context of Google
Forms for years if you're

00:45:41.040 --> 00:45:43.770
in the habit of creating these for
student groups or other activities.

00:45:43.770 --> 00:45:45.840
But it's now something
you can start to leverage.

00:45:45.840 --> 00:45:49.440
And we're just scratching the surface
of what's actually possible with this.

00:45:49.440 --> 00:45:53.640
But let's now do one final example
just using some Python code here.

00:45:53.640 --> 00:45:55.530
And let's actually
write a program that's

00:45:55.530 --> 00:46:00.040
a little more general purpose that
allows me to search for any given title

00:46:00.040 --> 00:46:01.770
and figure out its popularity.

00:46:01.770 --> 00:46:04.410
So let me go ahead and simplify this.

00:46:04.410 --> 00:46:06.630
Let's get rid of our
regular expressions.

00:46:06.630 --> 00:46:09.940
Let's go ahead and continue
capitalizing the title.

00:46:09.940 --> 00:46:11.580
And let's go ahead to--

00:46:11.580 --> 00:46:16.020
at the beginning of this program,
and first ask the user for the title

00:46:16.020 --> 00:46:17.380
they want to search for.

00:46:17.380 --> 00:46:20.320
So title equals, let's
ask the user for input,

00:46:20.320 --> 00:46:23.490
which is essentially the same thing
as our CS50 get_string function.

00:46:23.490 --> 00:46:24.960
Ask them for the title.

00:46:24.960 --> 00:46:28.170
And then whatever they type in,
let's go ahead and strip whitespace

00:46:28.170 --> 00:46:30.900
and uppercase the thing again.

00:46:30.900 --> 00:46:35.820
And now, inside of my loop, I
could say something like this.

00:46:35.820 --> 00:46:42.660
If the current row's title after
stripping whitespace and forcing

00:46:42.660 --> 00:46:46.920
it to uppercase, too, equals
the user's title, then, go ahead

00:46:46.920 --> 00:46:49.440
and maybe increment a counter.

00:46:49.440 --> 00:46:51.160
So I still need that counter back.

00:46:51.160 --> 00:46:56.610
So let me go ahead and define this
maybe in here, counter equals 0.

00:46:56.610 --> 00:46:58.720
And then, at the very
end of this program,

00:46:58.720 --> 00:47:01.050
let me go ahead and print
out just the popularity

00:47:01.050 --> 00:47:03.040
of whatever the human typed in.

00:47:03.040 --> 00:47:06.030
So again, the only difference is
I'm asking the human for some input

00:47:06.030 --> 00:47:06.720
this time.

00:47:06.720 --> 00:47:09.150
I'm initializing my
counter to 0, then I'm

00:47:09.150 --> 00:47:12.660
searching for their
title in the CSV file

00:47:12.660 --> 00:47:15.810
by doing the same massaging of the
data by forcing it to uppercase

00:47:15.810 --> 00:47:18.570
and getting rid of the whitespace.

00:47:18.570 --> 00:47:21.780
So now, when I run Python
of favorites.py, Enter,

00:47:21.780 --> 00:47:30.030
I could type in the office all lowercase
even, and now we're down to 13.

00:47:30.030 --> 00:47:34.940
13, why?

00:47:34.940 --> 00:47:36.700
Oh, that's correct.

00:47:36.700 --> 00:47:40.640
Because I'm the one that went in and
removed those The keywords a bit ago.

00:47:40.640 --> 00:47:43.030
If we fixed those, we
would be back up to 15.

00:47:43.030 --> 00:47:47.650
If we added support for The V
Office, we would be up to 16 as well.

00:47:47.650 --> 00:47:50.350
All right, any questions then
on these various manipulations?

00:47:50.350 --> 00:47:52.183
And if you're feeling
like, oh, my god, this

00:47:52.183 --> 00:47:55.100
is so much Python code just to do
simple things, that's the point.

00:47:55.100 --> 00:47:57.160
And indeed, even though
it's a powerful language

00:47:57.160 --> 00:48:00.670
and can solve these kinds of problems,
we had to write almost 20 lines of code

00:48:00.670 --> 00:48:03.470
just to ask a single question like this.

00:48:03.470 --> 00:48:07.120
But any questions on how we did
this, or on any of these building

00:48:07.120 --> 00:48:10.750
blocks along the way?

00:48:10.750 --> 00:48:11.890
Anything here?

00:48:11.890 --> 00:48:12.390
No?

00:48:12.390 --> 00:48:12.600
All right.

00:48:12.600 --> 00:48:13.300
That was a lot.

00:48:13.300 --> 00:48:14.800
Let's take a five-minute break here.

00:48:14.800 --> 00:48:17.230
When we come back, we'll do it better.

00:48:17.230 --> 00:48:18.480
So we are back.

00:48:18.480 --> 00:48:20.610
And the rest of today
is ultimately about, how

00:48:20.610 --> 00:48:24.840
can we store, and manipulate,
and change, and retrieve data

00:48:24.840 --> 00:48:28.090
more efficiently than we might
by just writing raw code?

00:48:28.090 --> 00:48:31.440
This isn't to say that you shouldn't
use Python to do the kinds of things

00:48:31.440 --> 00:48:32.280
that we just did.

00:48:32.280 --> 00:48:37.020
And in fact, it might be super common
if you're getting a lot of messy input

00:48:37.020 --> 00:48:39.073
from users that you might
want to clean it up.

00:48:39.073 --> 00:48:42.240
And maybe the best way to do that is
to write a program so that step-by-step

00:48:42.240 --> 00:48:44.370
you can make all of the
requisite changes and fixes

00:48:44.370 --> 00:48:47.523
like we did with The Office,
for instance, again and again,

00:48:47.523 --> 00:48:50.190
and reuse that code, especially
if more and more submissions are

00:48:50.190 --> 00:48:51.150
coming through.

00:48:51.150 --> 00:48:53.550
But another theme of
today, ultimately, is

00:48:53.550 --> 00:48:57.638
that sometimes there are different,
if not better tools for the same job.

00:48:57.638 --> 00:48:59.430
And in fact, now at
this point in the term,

00:48:59.430 --> 00:49:02.310
as we begin to introduce not
just Python, but in a moment

00:49:02.310 --> 00:49:06.120
a language called SQL, and next
week, a language called JavaScript,

00:49:06.120 --> 00:49:09.150
and the week after that, synthesizing
a whole lot of these languages

00:49:09.150 --> 00:49:12.420
together is to just kind
of paint a picture of how

00:49:12.420 --> 00:49:15.900
you might decide what the trade-offs are
between using this tool, or this tool,

00:49:15.900 --> 00:49:16.830
or this other tool.

00:49:16.830 --> 00:49:19.770
Because undoubtedly you can
solve problems moving forward

00:49:19.770 --> 00:49:22.660
in many different ways
with many different tools.

00:49:22.660 --> 00:49:25.020
So let's give you another
tool, one with which

00:49:25.020 --> 00:49:28.170
you can implement a proper
relational database.

00:49:28.170 --> 00:49:31.050
What we just saw in
the form of CSV files

00:49:31.050 --> 00:49:33.810
are what we might call
flat-file databases.

00:49:33.810 --> 00:49:37.500
Again, just a very simple file, flat
in that there's no hierarchy to it.

00:49:37.500 --> 00:49:39.270
It's just rows and columns.

00:49:39.270 --> 00:49:44.280
And that is all ultimately
storing ASCII or Unicode text.

00:49:44.280 --> 00:49:47.400
A relational database, though,
is something that's actually

00:49:47.400 --> 00:49:50.850
closer to a proper spreadsheet program.

00:49:50.850 --> 00:49:53.440
A CSV is an individual
sheet, if you will,

00:49:53.440 --> 00:49:55.260
from a spreadsheet when you export it.

00:49:55.260 --> 00:49:57.460
If you had multiple
sheets in a spreadsheet,

00:49:57.460 --> 00:49:59.595
you would have to export multiple CSVs.

00:49:59.595 --> 00:50:01.470
And that gets annoying
quickly in code if you

00:50:01.470 --> 00:50:03.990
have to open up this CSV,
this CSV, all of which

00:50:03.990 --> 00:50:07.080
represent different sheets or
tabs in a proper spreadsheet.

00:50:07.080 --> 00:50:11.520
A relational database is more
like a spreadsheet program

00:50:11.520 --> 00:50:14.640
that you, a programmer,
now can interact with.

00:50:14.640 --> 00:50:16.140
You can write data to it.

00:50:16.140 --> 00:50:19.680
You can read data from it, and you
can have multiple sheets, a.k.a.,

00:50:19.680 --> 00:50:22.150
tables storing all of your data.

00:50:22.150 --> 00:50:24.240
So whereas Excel and numbers
in Google spreadsheet

00:50:24.240 --> 00:50:27.090
are meant to be reused really by humans
with their mouse and their keyboard,

00:50:27.090 --> 00:50:29.850
clicking, and pointing, and
manipulating things graphically,

00:50:29.850 --> 00:50:32.160
a relational database
using a language called

00:50:32.160 --> 00:50:37.320
SQL is one in which the programmer
has similar capabilities,

00:50:37.320 --> 00:50:39.000
but doing so in code.

00:50:39.000 --> 00:50:42.720
Specifically, using a language
called SQL, and at a scale

00:50:42.720 --> 00:50:45.670
that's much grander
than spreadsheets alone.

00:50:45.670 --> 00:50:48.420
In fact, if you try on your Mac
or PC to open a spreadsheet that's

00:50:48.420 --> 00:50:51.090
got tens of thousands
of rows, it'll probably

00:50:51.090 --> 00:50:54.780
work fine, hundreds of thousands
of rows, millions of rows, no way.

00:50:54.780 --> 00:50:57.000
At some point your Mac or
PC is going to struggle

00:50:57.000 --> 00:50:59.130
to open particularly large data sets.

00:50:59.130 --> 00:51:01.620
And that, too, is where
proper databases come

00:51:01.620 --> 00:51:04.140
into play and proper
languages for databases come

00:51:04.140 --> 00:51:06.120
into play, when it's all about scale.

00:51:06.120 --> 00:51:09.390
And indeed, most any mobile app or
web app today that you or someone else

00:51:09.390 --> 00:51:13.420
might write should probably plan
on lots of data if it's successful.

00:51:13.420 --> 00:51:15.730
So we need the right
tools for that problem.

00:51:15.730 --> 00:51:19.110
So fortunately, even though we're
about to learn yet another language,

00:51:19.110 --> 00:51:24.360
it only does four things fundamentally,
known by this silly acronym, CRUD.

00:51:24.360 --> 00:51:27.870
SQL, this language for
databases, supports the ability

00:51:27.870 --> 00:51:32.400
to create data, read data,
update data, and delete data.

00:51:32.400 --> 00:51:33.420
That's it.

00:51:33.420 --> 00:51:36.690
There's a few more keywords that
exist in this language called SQL

00:51:36.690 --> 00:51:37.750
that we'll soon see.

00:51:37.750 --> 00:51:39.458
But at the end of the
day, even if you're

00:51:39.458 --> 00:51:42.180
starting to feel like this
is a lot very quickly,

00:51:42.180 --> 00:51:44.940
it all boils down to these
four basic operations.

00:51:44.940 --> 00:51:47.640
And the four commands
in SQL, if you will,

00:51:47.640 --> 00:51:51.780
functions in a sense that implement
those four ideas happen to be these.

00:51:51.780 --> 00:51:54.270
They're almost the same but
with some slight variance.

00:51:54.270 --> 00:51:59.280
The ability to create or insert data
is the C. The ability to select data

00:51:59.280 --> 00:52:01.080
is the R, or read.

00:52:01.080 --> 00:52:02.130
Update is the same.

00:52:02.130 --> 00:52:05.048
Delete is the same, but drop
is also a keyword as well.

00:52:05.048 --> 00:52:06.840
So we'll see these and
a few other keywords

00:52:06.840 --> 00:52:10.410
in SQL that, at the end of the day, just
allow you to create, read, and update

00:52:10.410 --> 00:52:14.310
data using verbs, if
you will, like these.

00:52:14.310 --> 00:52:18.030
So to do that, what's
the syntax going to be?

00:52:18.030 --> 00:52:20.290
Well, we won't get into the
weeds too quickly on this.

00:52:20.290 --> 00:52:22.650
But here's a representative
syntax of how

00:52:22.650 --> 00:52:25.710
you can create using this
language called SQL, in your very

00:52:25.710 --> 00:52:28.020
own database, a brand new table.

00:52:28.020 --> 00:52:30.910
This is so easy in Excel, and Google
Spreadsheets, and Apple Numbers.

00:52:30.910 --> 00:52:32.910
You want a new sheet, you
click the plus button.

00:52:32.910 --> 00:52:33.690
You get a new tab.

00:52:33.690 --> 00:52:35.490
You give it a name,
and boom, you're done.

00:52:35.490 --> 00:52:40.050
In the world of programming, though, if
you want to create the analogue of that

00:52:40.050 --> 00:52:43.440
spreadsheet in the computer's memory,
you create something called a table,

00:52:43.440 --> 00:52:48.420
like a sheet, that has a name, and then
in parentheses has one or more columns.

00:52:48.420 --> 00:52:51.990
But unlike Google Spreadsheets,
and Apple Numbers, and Excel,

00:52:51.990 --> 00:52:55.073
you have to decide as the
programmer what types of data

00:52:55.073 --> 00:52:57.240
you're going to be storing
in each of these columns.

00:52:57.240 --> 00:52:59.430
Now even though Excel,
and Google Spreadsheets,

00:52:59.430 --> 00:53:03.310
and Numbers does allow you to format
or present data in different ways,

00:53:03.310 --> 00:53:07.680
it's not strongly typed data like it
is, for instance, when we were using C.

00:53:07.680 --> 00:53:10.200
And heck, even in Python
there's underlying data types.

00:53:10.200 --> 00:53:12.158
Even if you don't have
to type them explicitly,

00:53:12.158 --> 00:53:14.900
databases are going to want to
know, are you storing integers?

00:53:14.900 --> 00:53:16.640
Are you storing real numbers or floats?

00:53:16.640 --> 00:53:17.630
Are you storing text?

00:53:17.630 --> 00:53:18.140
Why?

00:53:18.140 --> 00:53:20.960
Because especially as your
data scales, the more hints

00:53:20.960 --> 00:53:24.410
you give the database about your
data, the more performance it can be,

00:53:24.410 --> 00:53:27.500
the faster it can help you
get at and store that data.

00:53:27.500 --> 00:53:29.302
So types are about to
be important again,

00:53:29.302 --> 00:53:31.760
but there's not going to be
that many of them, fortunately.

00:53:31.760 --> 00:53:34.640
Now how can I go about converting,
for instance, some real data,

00:53:34.640 --> 00:53:37.490
like that from you,
my favorites.csv file,

00:53:37.490 --> 00:53:39.440
into a proper relational database?

00:53:39.440 --> 00:53:42.650
Well, it turns out that
using SQL I can do this

00:53:42.650 --> 00:53:45.260
in VS Code on my own Mac,
or PC, or in the cloud

00:53:45.260 --> 00:53:48.455
here by just importing
the CSV into a database.

00:53:48.455 --> 00:53:50.330
We'll see eventually
how to do this manually.

00:53:50.330 --> 00:53:52.622
For now, I'm going to use
more of an automated process.

00:53:52.622 --> 00:53:54.680
So let me go over to VS Code here.

00:53:54.680 --> 00:53:57.170
Let me type ls to see
where we left off before.

00:53:57.170 --> 00:54:01.005
I had two files favorites.csv, which
I downloaded from Google Spreadsheets.

00:54:01.005 --> 00:54:02.630
Recall that I made a couple of changes.

00:54:02.630 --> 00:54:06.050
We deleted a couple of Thes
from the file for The Office.

00:54:06.050 --> 00:54:08.600
But this is the same file
as before, and then we

00:54:08.600 --> 00:54:11.210
have favorites.py, which
we'll set aside for now.

00:54:11.210 --> 00:54:14.870
I'm going to go ahead now
and run a command SQLite3.

00:54:14.870 --> 00:54:18.020
So in the world of
relational databases, there's

00:54:18.020 --> 00:54:23.030
many different products out there,
many different software that

00:54:23.030 --> 00:54:25.370
implements the SQL language.

00:54:25.370 --> 00:54:26.600
Microsoft has their own.

00:54:26.600 --> 00:54:30.080
There's something called MySQL
that's been very popular for years.

00:54:30.080 --> 00:54:32.120
Facebook, for instance,
used it early on.

00:54:32.120 --> 00:54:35.030
PostgreSQL, Microsoft
Access Server, Oracle,

00:54:35.030 --> 00:54:36.958
and maybe a whole bunch
of other product names

00:54:36.958 --> 00:54:38.750
you might have encountered
over time, which

00:54:38.750 --> 00:54:42.980
is to say there's many different
types of tools, and servers,

00:54:42.980 --> 00:54:44.990
and software in which you can use SQL.

00:54:44.990 --> 00:54:47.780
We're going to use a very lightweight
version of the SQL language

00:54:47.780 --> 00:54:49.370
today called SQLite.

00:54:49.370 --> 00:54:51.680
This is the version of
SQL that's generally

00:54:51.680 --> 00:54:54.020
used on iPhones and
Android devices these days.

00:54:54.020 --> 00:54:56.930
If you download an app that stores
data like your own contacts,

00:54:56.930 --> 00:54:59.000
typically is stored using SQLite.

00:54:59.000 --> 00:55:02.710
Because it's fairly lightweight,
but you can still store hundreds,

00:55:02.710 --> 00:55:05.810
thousands, even tens of
thousands of pieces of data

00:55:05.810 --> 00:55:07.970
even using this lightweight
version thereof.

00:55:07.970 --> 00:55:10.790
SQLite3 is like version 3 of this tool.

00:55:10.790 --> 00:55:16.340
We're going to go ahead and run SQLite3
with a file called favorites.db.

00:55:16.340 --> 00:55:20.120
It's conventional in the world of
SQLite to name your file something.db.

00:55:20.120 --> 00:55:22.490
I'm going to create a
database called favorites.db.

00:55:22.490 --> 00:55:27.010
Once I'm inside of the program, now I'm
going to go ahead and enter CSV Mode.

00:55:27.010 --> 00:55:28.760
Again, not something
you have to memorize,

00:55:28.760 --> 00:55:30.468
just something you
can look up as needed.

00:55:30.468 --> 00:55:34.100
And then, I'm going to
import favorites.csv

00:55:34.100 --> 00:55:40.260
into a table, that is, a sheet, if
you will, called favorites as well.

00:55:40.260 --> 00:55:44.030
Now I'm going to hit Enter and I'm
going to go ahead and exit the program

00:55:44.030 --> 00:55:45.650
altogether and type ls.

00:55:45.650 --> 00:55:48.170
Now I have three files
in my current directory--

00:55:48.170 --> 00:55:52.130
the CSV file, the Python file
from before, and now favorites.db.

00:55:52.130 --> 00:55:56.180
But if I did this right, all of the
data you all typed into the CSV file

00:55:56.180 --> 00:55:59.840
has now been loaded into a proper
database where I can now use

00:55:59.840 --> 00:56:03.180
this SQL language to access it instead.

00:56:03.180 --> 00:56:07.730
So let's go ahead again and run SQLite3
of favorites.db, which now exists.

00:56:07.730 --> 00:56:10.640
And now, at the SQLite
prompt I can start

00:56:10.640 --> 00:56:13.160
to play around and
see what this data is.

00:56:13.160 --> 00:56:16.610
For instance, I can
look, by typing .schema,

00:56:16.610 --> 00:56:19.362
at what the schema is of
my data, what's the design.

00:56:19.362 --> 00:56:22.070
Now no thought was put into the
design of this data at the moment

00:56:22.070 --> 00:56:23.900
because I automated the whole process.

00:56:23.900 --> 00:56:26.750
Once we start creating
our own databases we'll

00:56:26.750 --> 00:56:29.750
give more thought to the data
types and the columns that we have.

00:56:29.750 --> 00:56:34.220
But we can see what SQLite
presumed I wanted just

00:56:34.220 --> 00:56:36.530
by importing the data by default.

00:56:36.530 --> 00:56:41.120
What the import command did for me a
moment ago is essentially the syntax.

00:56:41.120 --> 00:56:44.510
It automated the process of creating
a table, if it doesn't exist,

00:56:44.510 --> 00:56:45.770
called favorites.

00:56:45.770 --> 00:56:48.980
And then notice, in parentheses
it gave me three columns--

00:56:48.980 --> 00:56:53.360
timestamp, title, and genres, which
were inferred, obviously, from the CSV.

00:56:53.360 --> 00:56:56.000
All three of which have
been decreed to be text.

00:56:56.000 --> 00:56:59.180
Again, once we're more comfortable
we'll create our own tables,

00:56:59.180 --> 00:57:01.010
choose our own types and column names.

00:57:01.010 --> 00:57:03.350
But for now, I just automated
the whole process just

00:57:03.350 --> 00:57:08.120
to get us started by using this
built-in import command as well.

00:57:08.120 --> 00:57:08.810
All right.

00:57:08.810 --> 00:57:11.630
So what now can I begin to do?

00:57:11.630 --> 00:57:16.910
Well, if I wanted to, for instance,
start playing around with data therein,

00:57:16.910 --> 00:57:19.595
I might execute a couple
of different commands.

00:57:23.000 --> 00:57:28.421
Let me find the right one here--
one of which would be select.

00:57:28.421 --> 00:57:31.610
Select being one of our
most versatile tools

00:57:31.610 --> 00:57:33.180
to select data from this database.

00:57:33.180 --> 00:57:35.720
So if I have these three
columns here-- timestamp,

00:57:35.720 --> 00:57:39.020
title, and genres, suppose I
want to select all of the titles.

00:57:39.020 --> 00:57:43.790
Doing that earlier in Python
required importing the CSV library,

00:57:43.790 --> 00:57:48.740
opening the file, creating a reader or
a DictReader, iterating over every row,

00:57:48.740 --> 00:57:51.500
adding every title to a dictionary
or just printing it out,

00:57:51.500 --> 00:57:52.310
and dot, dot, dot.

00:57:52.310 --> 00:57:55.170
There was a dozen or so lines
of code when we first began.

00:57:55.170 --> 00:57:56.840
Now, how about this?

00:57:56.840 --> 00:58:01.220
Select title from
favorites, semicolon, done.

00:58:01.220 --> 00:58:05.570
So now, with this particular
language, the output is very textual

00:58:05.570 --> 00:58:08.930
and it's simulating what it looks like
if it were more graphical by creating

00:58:08.930 --> 00:58:10.870
this table, so to speak.

00:58:10.870 --> 00:58:14.080
Select title from
favorites is a distillation

00:58:14.080 --> 00:58:17.530
in a different language called
SQL of all the lines of code

00:58:17.530 --> 00:58:20.740
I wrote early on when we first
started playing with favorites.py.

00:58:20.740 --> 00:58:25.540
SQL is therefore optimized for
reading, and creating, and updating,

00:58:25.540 --> 00:58:27.500
and ultimately, deleting data.

00:58:27.500 --> 00:58:30.700
So here's perhaps a better tool
for the job once you have the data.

00:58:30.700 --> 00:58:34.030
Tossing it into a more
powerful, versatile format

00:58:34.030 --> 00:58:37.228
might allow you now to get
more work done more quickly

00:58:37.228 --> 00:58:38.770
without having to reinvent the wheel.

00:58:38.770 --> 00:58:41.510
Someone else has figured out
how to select data like this.

00:58:41.510 --> 00:58:43.760
What more can I do here?

00:58:43.760 --> 00:58:47.050
Well, let me go ahead and pull
up, in a moment, just a little bit

00:58:47.050 --> 00:58:49.390
of a cheat sheet here.

00:58:49.390 --> 00:58:53.840
Give me one second to find this.

00:58:53.840 --> 00:58:58.090
So suppose I want to now select
data a little more powerfully.

00:58:58.090 --> 00:59:00.220
So here's what I just
did in a canonical way.

00:59:00.220 --> 00:59:01.720
So select typically works like this.

00:59:01.720 --> 00:59:05.860
You select columns from a
specific table, semicolon.

00:59:05.860 --> 00:59:08.260
Unfortunately, stupid
semicolons are back.

00:59:08.260 --> 00:59:12.710
Select columns from table then, is
the generic form of what I just did.

00:59:12.710 --> 00:59:17.122
More specifically, I selected one
column called title from favorites.

00:59:17.122 --> 00:59:18.580
Favorites is the name of the table.

00:59:18.580 --> 00:59:19.690
Semicolon ends my thought.

00:59:19.690 --> 00:59:23.440
Suppose I wanted to get two things, like
the genres that each of you inputted.

00:59:23.440 --> 00:59:28.420
I could instead do select title,
comma, genres from favorites,

00:59:28.420 --> 00:59:30.220
and then, a semicolon, and Enter.

00:59:30.220 --> 00:59:32.110
It's going to look a
little ugly on my screen

00:59:32.110 --> 00:59:33.670
because some of these titles and--

00:59:33.670 --> 00:59:37.300
OK, one of you really went
all out with Community.

00:59:37.300 --> 00:59:40.660
You can see that it's just
wrapping in an ugly way,

00:59:40.660 --> 00:59:43.300
but it's just now
showing me two columns.

00:59:43.300 --> 00:59:46.840
If we scroll up to the very top
again, the left most of one,

00:59:46.840 --> 00:59:48.280
Black Mirror went all out, too.

00:59:48.280 --> 00:59:49.150
Thank you.

00:59:49.150 --> 00:59:52.000
And now, OK, we're going to
have to clean some of these up.

00:59:52.000 --> 00:59:54.115
Game of Thrones, good comedy, yes.

00:59:57.550 --> 00:59:59.480
Keep going, keep going, keep going.

00:59:59.480 --> 01:00:02.870
So now we've selected two of
the columns that we care about.

01:00:02.870 --> 01:00:03.370
There it is.

01:00:03.370 --> 01:00:06.380
OK, so it's crazy wide because
of all of those genres.

01:00:06.380 --> 01:00:09.135
But it allows me to select
exactly the data I want.

01:00:09.135 --> 01:00:12.010
Let's go back to the titles, though,
and perhaps start playing around

01:00:12.010 --> 01:00:13.750
with some modifiers here.

01:00:13.750 --> 01:00:18.310
For instance, it turns out, using
SQL there's a lot of functionality

01:00:18.310 --> 01:00:19.900
built into the language.

01:00:19.900 --> 01:00:22.843
You've got a lot of functions, similar
to Excel or Google Spreadsheets

01:00:22.843 --> 01:00:24.010
where you can have formulas.

01:00:24.010 --> 01:00:26.320
SQL provides you with some
of the same heuristics that

01:00:26.320 --> 01:00:30.350
allow you to apply operations
like these on entire columns.

01:00:30.350 --> 01:00:32.920
For instance, you can take
averages, count the total,

01:00:32.920 --> 01:00:36.010
get the distinct values, force
things to lowercase, uppercase, min,

01:00:36.010 --> 01:00:37.220
and max, and so forth.

01:00:37.220 --> 01:00:39.610
So let's try distinct, for instance.

01:00:39.610 --> 01:00:43.450
Let me go back to my Terminal,
and let's say, select,

01:00:43.450 --> 01:00:48.760
how about the distinct titles
from the favorites table?

01:00:48.760 --> 01:00:49.513
Enter.

01:00:49.513 --> 01:00:51.430
I didn't bother selecting
the genres because I

01:00:51.430 --> 01:00:52.763
want it to be a little prettier.

01:00:52.763 --> 01:00:58.090
And you can see here that we
have just the distinct titles,

01:00:58.090 --> 01:01:00.547
except for issues of formatting.

01:01:00.547 --> 01:01:02.380
So whitespace is going
to be an issue again.

01:01:02.380 --> 01:01:04.213
Capitalization is going
to be a thing again.

01:01:04.213 --> 01:01:05.230
So there's a trade-off.

01:01:05.230 --> 01:01:09.280
One of the things I was doing in Python
was forcing everything to uppercase

01:01:09.280 --> 01:01:10.930
and then getting rid of whitespace.

01:01:10.930 --> 01:01:12.400
But we could combine some of these.

01:01:12.400 --> 01:01:15.340
I could do something like
force every title to uppercase,

01:01:15.340 --> 01:01:16.507
then get the distinct value.

01:01:16.507 --> 01:01:19.382
And that's actually going to get
rid of some of those values as well.

01:01:19.382 --> 01:01:21.860
And again, I did it all in
one simple line that was fast.

01:01:21.860 --> 01:01:24.027
So let me pull up at the
bottom of the screen again.

01:01:24.027 --> 01:01:28.090
I selected distinct upper
titles from favorites,

01:01:28.090 --> 01:01:31.070
and that did everything for
me at once in just one breath.

01:01:31.070 --> 01:01:33.640
Suppose I want to get the total
number of counts of titles.

01:01:33.640 --> 01:01:40.150
How about select count of all
of those titles from favorites?

01:01:40.150 --> 01:01:43.990
Semicolon, Enter, and now
you get back a mini table

01:01:43.990 --> 01:01:47.960
that contains just your
answer, 158 in this case.

01:01:47.960 --> 01:01:50.560
So that's the total
number of, not distinct,

01:01:50.560 --> 01:01:52.690
but total titles that
we had in the file.

01:01:52.690 --> 01:01:56.560
And we could continue to manipulate
the data further using, again,

01:01:56.560 --> 01:01:58.450
functions like these here.

01:01:58.450 --> 01:02:01.550
But there's also additional
filtration we can do.

01:02:01.550 --> 01:02:07.010
We can also qualify our selections by
saying where some condition is true.

01:02:07.010 --> 01:02:10.630
So just as in Scratch, and C, and
Python, you have Boolean expressions,

01:02:10.630 --> 01:02:16.270
you can have the same in SQL as well,
where I can filter my data where

01:02:16.270 --> 01:02:19.390
something is true or false.

01:02:19.390 --> 01:02:21.340
Like allows me to do approximations.

01:02:21.340 --> 01:02:23.500
If I want to get something
that's like The Office

01:02:23.500 --> 01:02:26.320
but not necessarily
T-H-E, space, Office,

01:02:26.320 --> 01:02:29.440
I could do pattern
matching using like here.

01:02:29.440 --> 01:02:33.170
Order by, limit, and grouped by are
other commands I can execute, too.

01:02:33.170 --> 01:02:35.980
So let me go back and do
a couple of these here.

01:02:35.980 --> 01:02:42.610
How about, let me just get, oh, I don't
know, all of the titles from favorites

01:02:42.610 --> 01:02:44.848
but limit it to 10 results.

01:02:44.848 --> 01:02:47.890
That might be one thing that's helpful
to see if you just care about some

01:02:47.890 --> 01:02:50.110
of the data at the top there instead.

01:02:50.110 --> 01:02:56.530
How about, select all of the titles
from favorites, where the title itself

01:02:56.530 --> 01:02:59.710
is like, quote, unquote, "Office?"

01:02:59.710 --> 01:03:02.740
And this will give me only two answers.

01:03:02.740 --> 01:03:07.180
Those are the two rows, recall, that I
mutated by getting rid of the word The.

01:03:07.180 --> 01:03:11.880
Notice that like allows me too
tolerate uppercase and lowercase.

01:03:11.880 --> 01:03:14.880
Because if I instead
just use the equal sign,

01:03:14.880 --> 01:03:21.060
and in SQL a single equal sign
does, in fact, mean equality.

01:03:21.060 --> 01:03:23.370
For comparison's sake,
it's not doing assignment.

01:03:23.370 --> 01:03:26.070
This is not how you assign data in SQL.

01:03:26.070 --> 01:03:27.940
I got back no answers there.

01:03:27.940 --> 01:03:31.620
So indeed, the equal sign
is giving me literal answers

01:03:31.620 --> 01:03:33.990
that searches just for what I typed in.

01:03:33.990 --> 01:03:35.370
How could I get all of these?

01:03:35.370 --> 01:03:39.090
Well, similar in spirit to regular
expressions but not quite as powerful

01:03:39.090 --> 01:03:41.620
in SQL, I could do something like this.

01:03:41.620 --> 01:03:45.330
I can select the title from favorites
where the title is like, quote,

01:03:45.330 --> 01:03:46.710
unquote, "Office."

01:03:46.710 --> 01:03:52.450
But I can add, a bit weirdly, percent
signs to the left and the right.

01:03:52.450 --> 01:03:57.930
So the language SQL supports the
same notion of pattern matching

01:03:57.930 --> 01:03:59.547
but much more limited out of the box.

01:03:59.547 --> 01:04:01.380
If we want more powerful
regular expressions

01:04:01.380 --> 01:04:03.420
we probably do want
to use Python instead.

01:04:03.420 --> 01:04:06.720
But the percent sign here
means 0 or more characters

01:04:06.720 --> 01:04:09.340
on the left, 0 or more
characters on the right.

01:04:09.340 --> 01:04:14.460
So this will just grab any title that
contains O-F-F-I-C-E in it in that

01:04:14.460 --> 01:04:15.100
order.

01:04:15.100 --> 01:04:19.290
And now I get all 16, it would
seem, of those results, again.

01:04:19.290 --> 01:04:20.490
How do I know it's 16?

01:04:20.490 --> 01:04:23.160
Well, I can just get the
count of those titles

01:04:23.160 --> 01:04:26.140
and get back that
answer instead as well.

01:04:26.140 --> 01:04:29.640
So again, it takes some
getting used to, the vocabulary

01:04:29.640 --> 01:04:30.973
and the syntax that you can use.

01:04:30.973 --> 01:04:32.682
There's these building
blocks and others.

01:04:32.682 --> 01:04:35.500
But SQL is really designed, again,
for creating, reading, updating,

01:04:35.500 --> 01:04:36.480
and deleting data.

01:04:36.480 --> 01:04:41.080
For instance, I've never really
been a fan of Friends, for instance.

01:04:41.080 --> 01:04:46.860
So right now if I do select,
how about title from favorites

01:04:46.860 --> 01:04:53.340
where title like, quote, unquote,
Friends with the percent signs?

01:04:53.340 --> 01:04:55.330
We can see that there's
a whole bunch of them.

01:04:55.330 --> 01:04:56.520
That's how many exactly.

01:04:56.520 --> 01:04:57.960
Let's just do a quick count.

01:04:57.960 --> 01:04:59.890
So that's nine of them.

01:04:59.890 --> 01:05:03.300
Well, delete from favorites.

01:05:03.300 --> 01:05:10.500
OK, you and me, delete from favorites,
where title like Friends, Enter.

01:05:10.500 --> 01:05:14.280
Nothing seems to happen,
but bye-bye Friends.

01:05:14.280 --> 01:05:15.060
[APPLAUSE]

01:05:15.060 --> 01:05:16.143
DAVID J. MALAN: Thank you.

01:05:19.110 --> 01:05:21.390
So now we've actually changed the data.

01:05:21.390 --> 01:05:25.110
And this is what's compelling
about a proper database.

01:05:25.110 --> 01:05:29.320
Yes, you could technically write Python
code that not only reads the CSV file,

01:05:29.320 --> 01:05:30.540
but also writes it.

01:05:30.540 --> 01:05:32.820
You can change using quote,
unquote, "A" for append,

01:05:32.820 --> 01:05:35.910
or quote, unquote, "W" for
write, instead of quote, unquote,

01:05:35.910 --> 01:05:37.263
"R" for read alone.

01:05:37.263 --> 01:05:39.930
But it's definitely a little more
involved to do that in Python.

01:05:39.930 --> 01:05:42.250
But with SQL, you can update
the data in real time.

01:05:42.250 --> 01:05:45.750
And if I were actually running a
web application here or a database

01:05:45.750 --> 01:05:47.880
for a mobile app, that
change, theoretically,

01:05:47.880 --> 01:05:50.160
would be reflected everywhere
on your own devices

01:05:50.160 --> 01:05:52.210
if you're somehow talking
to this application.

01:05:52.210 --> 01:05:53.995
So that's the direction we're headed.

01:05:53.995 --> 01:05:55.620
This other thing has been bothering me.

01:05:55.620 --> 01:06:02.640
So select, how about title from
favorites, where title equals,

01:06:02.640 --> 01:06:03.180
what was it?

01:06:03.180 --> 01:06:06.600
The V Office, was it?

01:06:06.600 --> 01:06:07.680
Yeah, it was that one.

01:06:07.680 --> 01:06:12.150
How about we update
favorites by setting title

01:06:12.150 --> 01:06:19.080
equal to The Office, where title
equals quote, unquote, "The V Office"

01:06:19.080 --> 01:06:20.470
semicolon?

01:06:20.470 --> 01:06:22.650
And now, if I select
the same thing again

01:06:22.650 --> 01:06:24.810
I can go up and down with
my arrow keys quickly.

01:06:24.810 --> 01:06:27.090
Now there is no The V Office.

01:06:27.090 --> 01:06:29.200
We've actually changed that value.

01:06:29.200 --> 01:06:30.300
How about genres?

01:06:30.300 --> 01:06:35.880
Select genres from favorites,
where the title is title

01:06:35.880 --> 01:06:39.450
equals Game of Thrones, semicolon.

01:06:39.450 --> 01:06:43.000
These were kind of long, and I
don't really agree with all of that.

01:06:43.000 --> 01:06:49.560
So how about we update favorites,
set genres equal to, sure,

01:06:49.560 --> 01:06:52.650
action, adventure, sure, drama?

01:06:52.650 --> 01:06:54.390
OK, so it's a decent list.

01:06:54.390 --> 01:06:56.700
Fantasy, sure, thriller, war.

01:06:56.700 --> 01:07:01.050
OK, anything really but
comedy, I would say.

01:07:01.050 --> 01:07:03.160
Let's go ahead and hit Enter now.

01:07:03.160 --> 01:07:07.800
And now, if I select genres again, same
query, now we've canonicalized that.

01:07:07.800 --> 01:07:09.160
We've thrown data away.

01:07:09.160 --> 01:07:11.250
So whether or not that
is right is probably

01:07:11.250 --> 01:07:13.020
a bit subjective and argumentative.

01:07:13.020 --> 01:07:16.920
But I have at least cleaned up my
data, which is, again, the U in CRUD.

01:07:16.920 --> 01:07:20.670
Create, read, update, delete,
you can do it that easily.

01:07:20.670 --> 01:07:22.290
Beware using delete.

01:07:22.290 --> 01:07:26.430
Beware worse using drop, whereby
you can drop an entire table.

01:07:26.430 --> 01:07:29.310
But via these kinds of
commands, can we actually now

01:07:29.310 --> 01:07:33.390
manipulate our data much more
rapidly and with single thoughts.

01:07:33.390 --> 01:07:36.390
And in fact, if you're an aspiring
statistician, or data scientist,

01:07:36.390 --> 01:07:40.320
or analyst in the real world, SQL
is such a commonly used language

01:07:40.320 --> 01:07:43.573
because it allows you to really
dive into data quickly, and ask

01:07:43.573 --> 01:07:45.990
questions of the data, and get
back answers quite quickly.

01:07:45.990 --> 01:07:47.530
And this is a simple data set.

01:07:47.530 --> 01:07:51.840
You can do this with much larger
data sets as we soon will, too.

01:07:51.840 --> 01:07:55.050
Or any questions on what
we've seen of SQL thus far?

01:07:55.050 --> 01:07:57.170
Only scratched the
surface, but again, it

01:07:57.170 --> 01:08:03.100
boils down to creating, reading,
updating, and deleting data.

01:08:03.100 --> 01:08:04.910
Questions here?

01:08:04.910 --> 01:08:05.410
All right.

01:08:05.410 --> 01:08:07.820
Well, let's consider
the design of this data.

01:08:07.820 --> 01:08:11.900
Recall that if I do .schema, that
shows me the design of my table,

01:08:11.900 --> 01:08:13.990
the so-called schema of my data.

01:08:13.990 --> 01:08:15.010
This is OK.

01:08:15.010 --> 01:08:17.649
It gets the job done, and frankly,
everything the user typed in

01:08:17.649 --> 01:08:21.609
was arguably text, including the
timestamp, which is the date and time.

01:08:21.609 --> 01:08:24.040
But so the data set
itself is somewhat simple.

01:08:24.040 --> 01:08:29.569
But if we look at the data set itself,
especially genres, let's do this.

01:08:29.569 --> 01:08:31.960
Select genres from favorites.

01:08:31.960 --> 01:08:34.540
And let me point out one other
thing stylistically, too.

01:08:34.540 --> 01:08:39.050
I am very deliberately capitalizing
all of the special SQL keywords,

01:08:39.050 --> 01:08:42.700
and I'm lowercasing all of the
column names and the table names.

01:08:42.700 --> 01:08:45.760
This is a convention, and
honestly, it just helps you read,

01:08:45.760 --> 01:08:49.210
I think, the code when you're
co-mingling your names for columns

01:08:49.210 --> 01:08:52.600
and tables with proper SQL keywords.

01:08:52.600 --> 01:08:58.029
But I could just as easily do
select genres from favorites,

01:08:58.029 --> 01:09:01.149
but again, the SQL specific keywords
don't quite jump out as much.

01:09:01.149 --> 01:09:04.240
So stylistically, we would
recommend this, selecting genres

01:09:04.240 --> 01:09:06.250
from favorites, semicolon.

01:09:06.250 --> 01:09:09.370
So here is where--

01:09:09.370 --> 01:09:11.560
oh.

01:09:11.560 --> 01:09:12.850
OK, that was not intended.

01:09:12.850 --> 01:09:15.220
I accidentally made
every show, including

01:09:15.220 --> 01:09:20.410
The Office about action, adventure,
drama, fantasy, thriller, and war.

01:09:20.410 --> 01:09:24.460
How did I do that accidentally?

01:09:24.460 --> 01:09:25.600
What did I do wrong?

01:09:25.600 --> 01:09:30.234
AUDIENCE: [INAUDIBLE]

01:09:30.234 --> 01:09:31.109
DAVID J. MALAN: Yeah.

01:09:31.109 --> 01:09:32.460
So beware, this is funny.

01:09:32.460 --> 01:09:34.359
I think I did say
beware around this time.

01:09:34.359 --> 01:09:37.810
So the SQL database took me--
literally, I updated favorites,

01:09:37.810 --> 01:09:41.010
setting genres equal to that,
semicolon, end of thought.

01:09:41.010 --> 01:09:43.500
I really wanted to say
where title equals,

01:09:43.500 --> 01:09:46.020
quote, unquote, "Game of Thrones."

01:09:46.020 --> 01:09:49.080
Unfortunately, there isn't an
undo command or time machine

01:09:49.080 --> 01:09:51.930
with a SQL database, so
the best we can do here

01:09:51.930 --> 01:09:56.250
is, let's actually get
rid of favorites.db.

01:09:56.250 --> 01:10:02.340
Let's run SQLite of favorites.db
again, which now will be recreated.

01:10:02.340 --> 01:10:04.440
Let me change myself into CSV mode.

01:10:04.440 --> 01:10:09.930
Let me import, into my
favorites table, the CSV file.

01:10:09.930 --> 01:10:14.250
And now, Friends is back,
for better or for worse,

01:10:14.250 --> 01:10:15.510
but so are all of our genres.

01:10:18.090 --> 01:10:21.330
If I now reload the file
and do select, star, from--

01:10:21.330 --> 01:10:21.840
sorry.

01:10:21.840 --> 01:10:25.838
Select genres from favorites,
that was the result I was getting.

01:10:25.838 --> 01:10:28.630
It's much messier, but that's
because some of these are quite long.

01:10:28.630 --> 01:10:30.297
But now we're back to the original data.

01:10:30.297 --> 01:10:32.700
Lesson here, be sure
to back up your work.

01:10:32.700 --> 01:10:33.240
All right.

01:10:33.240 --> 01:10:36.850
So what more can we
now do with this data?

01:10:36.850 --> 01:10:40.480
Well, I don't love the design of the
genres table for a couple of reasons.

01:10:40.480 --> 01:10:43.200
One, we didn't have
any sort of validation,

01:10:43.200 --> 01:10:45.210
but user input is going to be messy.

01:10:45.210 --> 01:10:47.790
There's just a lot of
redundancy in here.

01:10:47.790 --> 01:10:49.720
Let's go ahead and do this.

01:10:49.720 --> 01:10:51.760
Let me select all the
comedies you all typed in.

01:10:51.760 --> 01:10:57.960
So select title from
favorites, where genres equals,

01:10:57.960 --> 01:11:00.120
quote, unquote, "comedy."

01:11:00.120 --> 01:11:05.730
OK, so there's all of the shows
that are explicitly comedies.

01:11:05.730 --> 01:11:08.940
But I think there might
actually be others.

01:11:08.940 --> 01:11:11.350
Let me scroll back up here.

01:11:11.350 --> 01:11:12.180
Comedy, drama.

01:11:12.180 --> 01:11:14.040
What was a comedy and a drama?

01:11:14.040 --> 01:11:19.140
How about let's search for the-- oops,
let me copy paste comedy, comma, drama.

01:11:19.140 --> 01:11:23.700
OK, so The Office, in this case, was
considered comedy and drama, Billions,

01:11:23.700 --> 01:11:27.010
It's Always Sunny in Philadelphia,
and Gilmore Girls as well.

01:11:27.010 --> 01:11:31.450
But notice that I get many more
when I just search for comedy.

01:11:31.450 --> 01:11:36.450
So the catch here is that, because I
have all of these genres implemented

01:11:36.450 --> 01:11:39.030
the way Google did, as
a comma-separated list,

01:11:39.030 --> 01:11:43.590
it's actually really hard and messy
to get at any show, all of the shows

01:11:43.590 --> 01:11:46.830
that are somewhere described as comedy.

01:11:46.830 --> 01:11:49.680
Because if I search for quote,
unquote, "comedy," the only answers

01:11:49.680 --> 01:11:53.610
I'm going to get are this one, whatever
that show is, this one, whatever

01:11:53.610 --> 01:11:55.240
that show is, this one.

01:11:55.240 --> 01:11:56.670
But I'm not going to get this one.

01:11:56.670 --> 01:11:58.170
I'm not going to get this one.

01:11:58.170 --> 01:11:59.070
Why?

01:11:59.070 --> 01:12:02.790
If I'm searching for, where genres
equals, quote, unquote, "comedy,"

01:12:02.790 --> 01:12:04.350
why am I missing those other shows?

01:12:07.202 --> 01:12:07.910
Why am I missing?

01:12:07.910 --> 01:12:08.595
Yeah?

01:12:08.595 --> 01:12:11.020
AUDIENCE: [INAUDIBLE]

01:12:11.020 --> 01:12:12.020
DAVID J. MALAN: Exactly.

01:12:12.020 --> 01:12:14.390
It's not just a comedy,
it's a comedy and a drama,

01:12:14.390 --> 01:12:16.680
and a comedy or a news
show, and so forth.

01:12:16.680 --> 01:12:20.510
So I have to search for these commas,
so this gets messy quickly, right?

01:12:20.510 --> 01:12:22.560
Let me copy this so I can do this.

01:12:22.560 --> 01:12:26.150
Let me search for where
genres equals comedy.

01:12:26.150 --> 01:12:33.230
How about, or genres equals
comedy, drama, or genres

01:12:33.230 --> 01:12:36.650
equals this whole thing,
comedy, news, talk show?

01:12:36.650 --> 01:12:38.370
I'm going to get more and more results.

01:12:38.370 --> 01:12:39.900
But that's not going to scale well.

01:12:39.900 --> 01:12:42.810
What could I do instead
of enumerating with ors

01:12:42.810 --> 01:12:45.800
all of the different permutations
of genres, do you think?

01:12:45.800 --> 01:12:49.655
AUDIENCE: [INAUDIBLE]

01:12:49.655 --> 01:12:50.530
DAVID J. MALAN: Yeah.

01:12:50.530 --> 01:12:54.430
So I could use the keyword is,
similar in Python to the word in.

01:12:54.430 --> 01:12:56.980
I could use the like
keyword so that so long

01:12:56.980 --> 01:13:02.080
as the genres is like
comedy somewhere in there,

01:13:02.080 --> 01:13:05.900
that's going to give me all of them,
so long as the word comedy is in there.

01:13:05.900 --> 01:13:09.265
But let me go ahead and just
open the form from earlier.

01:13:12.250 --> 01:13:15.073
Let me see if I can open this
real quick before I toggle over.

01:13:15.073 --> 01:13:16.990
If we look back at the
form, recall that there

01:13:16.990 --> 01:13:22.630
were all of those radio buttons
asking for the specific genres

01:13:22.630 --> 01:13:24.530
into which something fell.

01:13:24.530 --> 01:13:29.710
And if I open this, let me full screen
here and now open the original form.

01:13:29.710 --> 01:13:32.920
You'll see all of the
genres here, none of which

01:13:32.920 --> 01:13:38.680
are that worrisome except for a
corner case is jumping out at me.

01:13:38.680 --> 01:13:43.330
Where might the like keyword
alone get me into trouble?

01:13:43.330 --> 01:13:44.380
It's not with comedy.

01:13:44.380 --> 01:13:45.880
I'm OK with comedy.

01:13:45.880 --> 01:13:47.680
AUDIENCE: Music and musical?

01:13:47.680 --> 01:13:51.190
DAVID J. MALAN: Yeah, music and musical
are deliberately on the list here.

01:13:51.190 --> 01:13:53.660
Because, one, they're separate genres.

01:13:53.660 --> 01:13:56.310
But if I just search for
something that's like music,

01:13:56.310 --> 01:13:58.810
I'm going to accidentally suck
in all of the musicals, which

01:13:58.810 --> 01:13:59.950
might not be what I intend.

01:13:59.950 --> 01:14:03.310
If music is a music video or
whatever, and musical is actually

01:14:03.310 --> 01:14:06.620
a different type of show, I
don't want to just do that.

01:14:06.620 --> 01:14:08.110
So it seems just very messy.

01:14:08.110 --> 01:14:11.682
I could probably hack something together
with-- maybe add some commas in there,

01:14:11.682 --> 01:14:12.640
or something like this.

01:14:12.640 --> 01:14:15.520
But this is just not a
good design for the data.

01:14:15.520 --> 01:14:17.890
Google has done it this
way because it's just

01:14:17.890 --> 01:14:22.000
simple to actually keep the user's
data all in a single column,

01:14:22.000 --> 01:14:24.520
and just as they did,
separate it by commas.

01:14:24.520 --> 01:14:28.720
But this is a real
messy way to use CSV is

01:14:28.720 --> 01:14:32.830
by putting comma-separated values
in your comma-separated values.

01:14:32.830 --> 01:14:35.350
Arguably, the folks at
Google probably just did this

01:14:35.350 --> 01:14:36.520
because it's just simpler.

01:14:36.520 --> 01:14:38.645
And they didn't want to
give people multiple sheets

01:14:38.645 --> 01:14:42.220
or complicate things using some other
weirder character than commas alone.

01:14:42.220 --> 01:14:44.450
But I bet there's a better
way for us to do this.

01:14:44.450 --> 01:14:45.860
And let me go ahead and do this.

01:14:45.860 --> 01:14:47.978
Let me go back into my code here.

01:14:47.978 --> 01:14:50.020
And in just a moment, I'm
going to grab a program

01:14:50.020 --> 01:14:54.250
that I wrote in advance that's going
to use Python to open up the CSV file,

01:14:54.250 --> 01:14:59.110
iterate over all of the rows, and load
the data into two tables this time,

01:14:59.110 --> 01:15:02.630
two tables, one called
shows, and one called genres,

01:15:02.630 --> 01:15:05.140
so as to actually separate
these two things out.

01:15:05.140 --> 01:15:07.730
Give me just a moment to grab the code.

01:15:07.730 --> 01:15:10.720
And when I run this, I'll
only have to run it once.

01:15:10.720 --> 01:15:13.090
Let me go ahead and
run Python in a moment,

01:15:13.090 --> 01:15:15.940
and I'll reveal the results in a sec.

01:15:15.940 --> 01:15:18.790
This is going to be version
8 of the code online.

01:15:18.790 --> 01:15:22.480
When I do this, let me go
ahead and open up this file.

01:15:22.480 --> 01:15:26.260
Give me a second to move
it into this directory.

01:15:26.260 --> 01:15:28.570
Version 8, OK.

01:15:28.570 --> 01:15:31.515
So here we have version 8 of
this that's available online

01:15:31.515 --> 01:15:32.890
that's going to do the following.

01:15:32.890 --> 01:15:34.848
And I'll gloss over some
of the details just so

01:15:34.848 --> 01:15:38.740
that we don't get stuck in the
weeds of some of this code.

01:15:38.740 --> 01:15:41.380
I'm going to be using, at
the top of this program,

01:15:41.380 --> 01:15:45.820
as we'll soon see, a CS50 library,
not for the sake of get_string,

01:15:45.820 --> 01:15:48.730
or get_int, or get_float, but
because there's some built-in SQL

01:15:48.730 --> 01:15:51.880
functionality that we didn't discuss
a couple of weeks back with the CS50

01:15:51.880 --> 01:15:52.750
library itself.

01:15:52.750 --> 01:15:56.680
But inside of the CS50 library we'll
see there is a special function called

01:15:56.680 --> 01:16:01.330
SQL that gives you the ability using
this weird URL-like looking thing,

01:16:01.330 --> 01:16:06.040
technically called a URI, that allows
me to open a file called favorites.db.

01:16:06.040 --> 01:16:08.380
And long story short, all
of the subsequent code

01:16:08.380 --> 01:16:12.580
is going to iterate over this
favorites.csv file that we downloaded.

01:16:12.580 --> 01:16:16.240
And it's going to import it
into the SQLite database,

01:16:16.240 --> 01:16:19.430
but it's going to use two
tables instead of just one.

01:16:19.430 --> 01:16:21.640
So give me just a moment
to run this, and then I'll

01:16:21.640 --> 01:16:23.600
reveal the actual results.

01:16:23.600 --> 01:16:26.270
This is going to be
run on favorites.csv.

01:16:31.510 --> 01:16:35.485
And taking a look here,
give me just a moment.

01:16:39.310 --> 01:16:43.350
Oh, give me a sec.

01:16:43.350 --> 01:16:45.270
Come on.

01:16:45.270 --> 01:16:46.080
Come on.

01:16:46.080 --> 01:16:49.020
This program should not
be taking this long.

01:16:49.020 --> 01:16:50.460
Sorry.

01:16:50.460 --> 01:16:51.840
Let's open this real fast.

01:16:54.750 --> 01:16:55.860
Whoops, not that file.

01:16:59.000 --> 01:16:59.500
OK.

01:16:59.500 --> 01:17:02.350
Let me just skim this code real
quick to see where we've gone wrong.

01:17:02.350 --> 01:17:04.840
[INAUDIBLE] reader.

01:17:04.840 --> 01:17:10.390
Reader, title, show ID
in certain two shows.

01:17:10.390 --> 01:17:15.370
[INAUDIBLE] genres split, DB execute.

01:17:15.370 --> 01:17:15.880
All right.

01:17:15.880 --> 01:17:17.410
This is me debugging in real time.

01:17:17.410 --> 01:17:22.843
All those times we encourage you to use
print, this is me actually using print.

01:17:22.843 --> 01:17:24.760
We'll see how quickly I
can recover from this.

01:17:24.760 --> 01:17:26.560
Python of favorites version 8.

01:17:29.100 --> 01:17:32.070
OK, so here's me debugging in real time.

01:17:32.070 --> 01:17:32.882
It's printing it.

01:17:32.882 --> 01:17:34.590
Oh, maybe I just didn't
wait long enough.

01:17:34.590 --> 01:17:35.940
OK, so here we go.

01:17:35.940 --> 01:17:39.900
What I'm doing is printing out
the dictionary that represents

01:17:39.900 --> 01:17:41.542
each row that you all typed in.

01:17:41.542 --> 01:17:43.000
And we're actually making progress.

01:17:43.000 --> 01:17:44.220
All right.

01:17:44.220 --> 01:17:46.510
I was too impatient and
didn't wait long enough.

01:17:46.510 --> 01:17:47.470
So in a moment--

01:17:47.470 --> 01:17:47.970
there we go.

01:17:47.970 --> 01:17:50.370
All right, so all we have
to do sometimes is wait.

01:17:50.370 --> 01:17:54.420
Let me go ahead now and open
this file using SQLite3.

01:17:54.420 --> 01:17:58.080
So in SQLite3 I now have a
different version of favorites.db.

01:17:58.080 --> 01:17:59.970
I named it number 8 for consistency.

01:17:59.970 --> 01:18:03.280
Once I've run the program I can
do .schema to look inside of it.

01:18:03.280 --> 01:18:07.200
And here's what the two tables in
this database are going to look like.

01:18:07.200 --> 01:18:10.920
I've created a table called shows, this
time to represent all of the TV shows

01:18:10.920 --> 01:18:14.410
that are favorites,
that has two columns.

01:18:14.410 --> 01:18:16.812
One is called ID, one is called Title.

01:18:16.812 --> 01:18:19.020
But now I'm going to start
taking out for a spin some

01:18:19.020 --> 01:18:20.640
of the other features of SQL.

01:18:20.640 --> 01:18:24.900
And besides there being text, it turns
out there's a data type called integer.

01:18:24.900 --> 01:18:26.910
Besides there being a
data type called text,

01:18:26.910 --> 01:18:30.150
there's also a special key
phrase that you can specify

01:18:30.150 --> 01:18:31.830
that the title can never be null.

01:18:31.830 --> 01:18:35.160
Think back to our use
of null in C. Think back

01:18:35.160 --> 01:18:37.260
to the keyword none in Python.

01:18:37.260 --> 01:18:40.830
This is a database constraint that
allows you to ensure that none of you

01:18:40.830 --> 01:18:42.630
can't have of favorite TV show.

01:18:42.630 --> 01:18:46.320
If you submit the form, you have
to have typed in a title for it

01:18:46.320 --> 01:18:48.250
to end up in our database here.

01:18:48.250 --> 01:18:50.940
And you'll notice one other new feature.

01:18:50.940 --> 01:18:53.460
It turns out, on this
table I'm defining what's

01:18:53.460 --> 01:18:56.970
called a primary key,
specifically to be the ID column.

01:18:56.970 --> 01:18:58.600
More on that in just a moment.

01:18:58.600 --> 01:19:02.700
Meanwhile, the second table my code
has created for me, as we'll soon see,

01:19:02.700 --> 01:19:08.220
gives me a column called
show ID, and then, a genre,

01:19:08.220 --> 01:19:11.140
the value of which is text
that can also not be null.

01:19:11.140 --> 01:19:12.760
And then more on this in a moment.

01:19:12.760 --> 01:19:15.840
This table has what we're
going to call a foreign key,

01:19:15.840 --> 01:19:20.550
specifically the show ID column
that references shows ID.

01:19:20.550 --> 01:19:23.250
So before we get into
the weeds of this, this

01:19:23.250 --> 01:19:27.120
is now a way of creating the
relation in relational database.

01:19:27.120 --> 01:19:31.140
If I have two tables now, not
just one, they can somehow

01:19:31.140 --> 01:19:34.720
be linked together by a common column.

01:19:34.720 --> 01:19:37.860
In other words, the shows column--

01:19:37.860 --> 01:19:41.340
shows table is going to give
me a table with two columns--

01:19:41.340 --> 01:19:43.080
an ID and a title.

01:19:43.080 --> 01:19:46.800
Every title you gave me, I'm
going to assign a unique value.

01:19:46.800 --> 01:19:51.600
The genre's table, meanwhile, is
going to associate individual genres

01:19:51.600 --> 01:19:54.480
singular with that same idea.

01:19:54.480 --> 01:20:01.150
And the result of this, to pop back to
the Terminal here, is, let's do this.

01:20:01.150 --> 01:20:05.460
Select star from shows
of this new database,

01:20:05.460 --> 01:20:09.030
and you'll see that I've given,
indeed, all of the shows you all typed

01:20:09.030 --> 01:20:10.663
in unique identifiers.

01:20:10.663 --> 01:20:13.830
I didn't filter out duplicates or do
anything beyond just forcing everything

01:20:13.830 --> 01:20:14.430
to uppercase.

01:20:14.430 --> 01:20:16.930
So there's going to be some
duplicates here because I didn't

01:20:16.930 --> 01:20:18.840
want to get rid of anyone's data.

01:20:18.840 --> 01:20:21.960
But you'll see that,
indeed, I've given everyone

01:20:21.960 --> 01:20:24.420
a unique identifier, from
the very first person who

01:20:24.420 --> 01:20:28.650
typed How I Met Your Mother, all
the way down to input number 158.

01:20:28.650 --> 01:20:35.760
Meanwhile, if I do select star from
genres, which is now a table, not just

01:20:35.760 --> 01:20:38.280
a column in the original
data, now you'll

01:20:38.280 --> 01:20:43.195
see a much better design for this data.

01:20:43.195 --> 01:20:44.320
Notice what I've done here.

01:20:44.320 --> 01:20:47.370
Let me go all the way to the top and
you'll see two columns, one of which

01:20:47.370 --> 01:20:51.508
is called show ID, the other
of which is called genre.

01:20:51.508 --> 01:20:53.550
And again, I wrote some
code to do this because I

01:20:53.550 --> 01:20:56.675
had to take Google's messy output where
everything was separated by commas.

01:20:56.675 --> 01:21:00.330
I had to tear away the commas and
then put each genre into this table

01:21:00.330 --> 01:21:01.770
by itself.

01:21:01.770 --> 01:21:04.200
Even though we haven't
introduced the syntax via which

01:21:04.200 --> 01:21:07.230
we can reconstitute the
data and reassociate

01:21:07.230 --> 01:21:10.530
your genres with your
titles, why, at a glance,

01:21:10.530 --> 01:21:13.140
might this be a better design now?

01:21:13.140 --> 01:21:16.740
Even though I've doubled the
number of tables from one to two,

01:21:16.740 --> 01:21:21.540
why is this probably on the
direction toward a better design?

01:21:21.540 --> 01:21:23.130
What might your instincts be?

01:21:26.430 --> 01:21:27.720
Why is this cleaner?

01:21:27.720 --> 01:21:31.230
Again, first time with SQL,
why is it better, perhaps,

01:21:31.230 --> 01:21:33.780
that we've done this
with our genre's table?

01:21:33.780 --> 01:21:35.650
Can I come to you?

01:21:35.650 --> 01:21:36.700
Why might this be better?

01:21:36.700 --> 01:21:37.200
Yeah.

01:21:37.200 --> 01:21:41.360
Oh, just because we had the
conversation before about the commas.

01:21:41.360 --> 01:21:42.470
AUDIENCE: [INAUDIBLE]

01:21:42.470 --> 01:21:42.890
DAVID J. MALAN: Exactly.

01:21:42.890 --> 01:21:43.848
It's as simple as that.

01:21:43.848 --> 01:21:49.130
We've cleaned up the data by giving
every genre, every word in the genres

01:21:49.130 --> 01:21:51.260
column in the original
Google Spreadsheet

01:21:51.260 --> 01:21:54.030
its own cell in this table, if you will.

01:21:54.030 --> 01:21:56.930
And now notice show ID
might appear multiple times.

01:21:56.930 --> 01:22:01.530
Whoever typed in How I Met Your Mother,
they only associated one genre with it.

01:22:01.530 --> 01:22:04.460
And so we see that
show ID 1 is a comedy.

01:22:04.460 --> 01:22:05.645
But whoever typed in--

01:22:05.645 --> 01:22:07.520
I forget the name of
the second show offhand.

01:22:07.520 --> 01:22:11.960
But that person, whoever was
assigned show ID 2 checked off

01:22:11.960 --> 01:22:13.850
a whole bunch of the genre's boxes.

01:22:13.850 --> 01:22:17.540
That happened again with show ID 3, 4.

01:22:17.540 --> 01:22:20.940
Persons 5, 6, 7 only checked one box.

01:22:20.940 --> 01:22:25.040
And so you can see now that we've
associated the data with what we

01:22:25.040 --> 01:22:27.690
might call a one-to-many relationship.

01:22:27.690 --> 01:22:33.050
A one-to-many relationship, whereby
for every one show in the show's table,

01:22:33.050 --> 01:22:36.800
it can now have many genres
associated with it, each of which

01:22:36.800 --> 01:22:41.250
is represented by a separate row here.

01:22:41.250 --> 01:22:44.960
So again, if I go ahead and
select star from shows--

01:22:44.960 --> 01:22:48.740
let's limit it to the first 10 just
to focus on a subset of the data.

01:22:48.740 --> 01:22:51.770
How I Met Your Mother, The Sopranos
was the second input there.

01:22:51.770 --> 01:22:55.100
It would seem that now that I've
created the data in this way,

01:22:55.100 --> 01:22:59.925
I could ideally somehow search the
data, but a little more correctly.

01:22:59.925 --> 01:23:01.550
I don't have to worry about the commas.

01:23:01.550 --> 01:23:03.925
I don't have to worry about
the hackish approach of music

01:23:03.925 --> 01:23:05.630
being a substring of musical.

01:23:05.630 --> 01:23:08.310
But how can I actually
get back at this data?

01:23:08.310 --> 01:23:09.750
Well, let's go ahead and do this.

01:23:09.750 --> 01:23:13.700
Suppose I did want to get back
maybe all of the comedies.

01:23:13.700 --> 01:23:17.120
All of the comedies, no matter whether
the person checked just the comedy

01:23:17.120 --> 01:23:19.550
box or multiple boxes instead.

01:23:19.550 --> 01:23:23.330
How now, given that I
have two tables, could I

01:23:23.330 --> 01:23:27.740
go about selecting only
the titles of comedies?

01:23:27.740 --> 01:23:29.780
I've actually made the
problem a little harder,

01:23:29.780 --> 01:23:32.720
but again, SQL is going to
give me a solution for this.

01:23:32.720 --> 01:23:35.030
The problem is that if I
want to search for comedies,

01:23:35.030 --> 01:23:37.820
I have to check the genres table first.

01:23:37.820 --> 01:23:39.650
And then what's that going to give me?

01:23:39.650 --> 01:23:43.580
If I search the genres
table for comedies,

01:23:43.580 --> 01:23:46.610
what's that going to
give me back potentially?

01:23:46.610 --> 01:23:47.360
Yeah?

01:23:47.360 --> 01:23:48.110
AUDIENCE: Show ID.

01:23:48.110 --> 01:23:49.110
DAVID J. MALAN: Maybe show ID.

01:23:49.110 --> 01:23:49.902
So let me try that.

01:23:49.902 --> 01:23:55.820
Let me do select show ID from genres,
where the genre in a given row

01:23:55.820 --> 01:23:57.320
equals quote, unquote, "comedy."

01:23:57.320 --> 01:24:00.470
No commas, no like, no percent signs.

01:24:00.470 --> 01:24:04.940
Because literally, that column now is
singular words, like comedy, or drama,

01:24:04.940 --> 01:24:05.720
or the like.

01:24:05.720 --> 01:24:07.670
Let me go ahead and hit Enter here.

01:24:07.670 --> 01:24:10.700
OK, so I got back a whole
bunch of ID numbers.

01:24:10.700 --> 01:24:13.100
Now this could very
quickly get annoying.

01:24:13.100 --> 01:24:17.930
It looks like show ID 1, 2, 4, 5, 6,
7, 9, and so forth, are all comedies.

01:24:17.930 --> 01:24:23.690
So I could do something really
crazy like, select title from shows,

01:24:23.690 --> 01:24:29.390
where ID equals 1, or ID equals 2.

01:24:29.390 --> 01:24:32.360
This is not going to
scale very well, but this

01:24:32.360 --> 01:24:34.610
is why SQL is especially powerful.

01:24:34.610 --> 01:24:39.420
You can actually compose one
SQL question from multiple ones.

01:24:39.420 --> 01:24:40.440
So let's do this.

01:24:40.440 --> 01:24:44.480
Why don't I select the title
where the ID of the show

01:24:44.480 --> 01:24:48.050
is in the following list of IDs?

01:24:48.050 --> 01:24:54.710
Select show ID from genres, where the
specific genre is, quote, unquote,

01:24:54.710 --> 01:24:55.640
"comedy."

01:24:55.640 --> 01:24:58.550
So I've got two SQL queries.

01:24:58.550 --> 01:25:01.850
One is deliberately nested
inside of parentheses.

01:25:01.850 --> 01:25:04.760
That's going to give me back
that whole list of show IDs.

01:25:04.760 --> 01:25:07.070
But that's exactly what
I want to then look up

01:25:07.070 --> 01:25:11.000
the titles for by selecting title
from shows where the ID of the show

01:25:11.000 --> 01:25:13.380
is in that big, tall list.

01:25:13.380 --> 01:25:18.710
And so now if I hit Enter,
I get back only those shows

01:25:18.710 --> 01:25:22.520
that were somehow flagged as
comedy, whether you in the audience

01:25:22.520 --> 01:25:26.570
checked one box for comedy,
two boxes, or all of the boxes.

01:25:26.570 --> 01:25:28.928
Somehow we teased out
comedy, again, just

01:25:28.928 --> 01:25:31.220
by using that Python script,
which loaded this data not

01:25:31.220 --> 01:25:33.800
into one big table, but instead, two.

01:25:33.800 --> 01:25:36.420
And if we want to clean this
up, let's do a couple of things.

01:25:36.420 --> 01:25:40.580
Let's, outside of the
parentheses, do order by title.

01:25:40.580 --> 01:25:43.550
This is a way of sorting
the data in SQL very easily.

01:25:43.550 --> 01:25:48.140
Now we have a whole list of the
same titles that are now sorted.

01:25:48.140 --> 01:25:52.640
And what was the keyword with which
I could filter out duplicates?

01:25:52.640 --> 01:25:53.450
Yeah, distinct.

01:25:53.450 --> 01:25:54.260
So let's try this.

01:25:54.260 --> 01:25:59.600
Same query, but let's select only the
distinct titles from that whole query.

01:25:59.600 --> 01:26:01.770
And notice, I've very
deliberately done it this way.

01:26:01.770 --> 01:26:03.520
And to this day, any
time I'm using SQL, I

01:26:03.520 --> 01:26:06.187
don't just start at the beginning
and type out my whole thought,

01:26:06.187 --> 01:26:07.940
and just get it right on the first try.

01:26:07.940 --> 01:26:10.610
I very commonly start
with the subquery, if you

01:26:10.610 --> 01:26:12.800
will, the thing in
parentheses, just to get myself

01:26:12.800 --> 01:26:14.600
one step toward what I care about.

01:26:14.600 --> 01:26:15.332
Then I add to it.

01:26:15.332 --> 01:26:16.040
Then I add to it.

01:26:16.040 --> 01:26:18.590
Then I add to it, just like
we've encouraged in Python and C,

01:26:18.590 --> 01:26:22.370
taking baby steps in order to get to
the answer you actually care about,

01:26:22.370 --> 01:26:23.480
like this one now.

01:26:23.480 --> 01:26:26.060
And other than this
mistake, which we didn't

01:26:26.060 --> 01:26:30.350
fix because I re-imported the data after
accidentally changing everyone's genre,

01:26:30.350 --> 01:26:35.190
we now have an alphabetized
list of all of the same data.

01:26:35.190 --> 01:26:40.670
But now it's better designed, because we
have it split across these two tables.

01:26:40.670 --> 01:26:41.630
Oh, thank you.

01:26:41.630 --> 01:26:44.720
OK, just thanks.

01:26:44.720 --> 01:26:48.950
What questions do we have, if any here?

01:26:48.950 --> 01:26:50.480
Questions on this approach?

01:26:53.040 --> 01:26:55.316
Yeah?

01:26:55.316 --> 01:27:00.280
AUDIENCE: [INAUDIBLE]

01:27:00.280 --> 01:27:02.280
DAVID J. MALAN: Oh, now
that we have a database,

01:27:02.280 --> 01:27:04.640
how do we transfer it to a CSV?

01:27:04.640 --> 01:27:06.060
There are ways to do that.

01:27:06.060 --> 01:27:08.570
And in fact, there's a
command within SQLite

01:27:08.570 --> 01:27:11.590
that allows you to export
your data back to a CSV file.

01:27:11.590 --> 01:27:13.340
If you want to email
it to someone and you

01:27:13.340 --> 01:27:16.430
want them to be able to open it in
Excel, or Google Spreadsheets, or Apple

01:27:16.430 --> 01:27:19.010
Numbers, or the like, you can
go in the other direction.

01:27:19.010 --> 01:27:21.890
Generally though, once
you're in the world of SQL

01:27:21.890 --> 01:27:24.620
you're probably storing
your data there long term.

01:27:24.620 --> 01:27:27.590
And you're probably updating it,
maybe deleting it, adding to it,

01:27:27.590 --> 01:27:28.273
and so forth.

01:27:28.273 --> 01:27:30.440
For instance, the one command
I did not show earlier

01:27:30.440 --> 01:27:33.320
is, suppose someone forgot a show.

01:27:33.320 --> 01:27:35.640
Let's see, did I see this in the output?

01:27:35.640 --> 01:27:37.580
All right, so Curb Your Enthusiasm.

01:27:37.580 --> 01:27:38.540
Saw that last night.

01:27:38.540 --> 01:27:39.530
It was just, yeah.

01:27:39.530 --> 01:27:41.160
Did anyone see it last night?

01:27:41.160 --> 01:27:41.660
No?

01:27:41.660 --> 01:27:44.743
All right, well, just the one person
that checked that box, so you and me.

01:27:44.743 --> 01:27:46.740
What's another show that
didn't make the list?

01:27:46.740 --> 01:27:48.217
How about Seinfeld?

01:27:48.217 --> 01:27:49.550
It's now on Netflix, apparently.

01:27:49.550 --> 01:27:55.220
So insert into shows.

01:27:55.220 --> 01:27:56.390
What do we want to insert?

01:27:56.390 --> 01:27:59.910
Well, we want to insert
maybe an ID and a title.

01:27:59.910 --> 01:28:02.150
But I don't actually
care what the ID is,

01:28:02.150 --> 01:28:03.650
so I'm just going to insert a title.

01:28:03.650 --> 01:28:05.780
And the value I'm going
to give to that title

01:28:05.780 --> 01:28:08.900
is going to be, quote,
unquote, "Seinfeld."

01:28:08.900 --> 01:28:11.810
And then, let me go
ahead and hit semicolon.

01:28:11.810 --> 01:28:14.360
Nothing seems to happen, but
let me rerun the big query

01:28:14.360 --> 01:28:16.250
from before looking for comedies.

01:28:16.250 --> 01:28:19.850
And unfortunately, Seinfeld has
not yet been flagged as a comedy,

01:28:19.850 --> 01:28:21.710
so let's get this right, too.

01:28:21.710 --> 01:28:25.370
What intuitively I'm going to
have to do to associate, now,

01:28:25.370 --> 01:28:27.980
Seinfeld with my comedies?

01:28:27.980 --> 01:28:30.140
I just inserted into the show's table.

01:28:30.140 --> 01:28:33.890
What more needs to happen before
we can flag Seinfeld as a comedy?

01:28:33.890 --> 01:28:35.369
AUDIENCE: [INAUDIBLE]

01:28:35.369 --> 01:28:36.452
DAVID J. MALAN: Say again?

01:28:36.452 --> 01:28:37.635
AUDIENCE: [INAUDIBLE]

01:28:37.635 --> 01:28:38.510
DAVID J. MALAN: Yeah.

01:28:38.510 --> 01:28:42.950
So I need to insert into the
genres table two things now,

01:28:42.950 --> 01:28:48.180
a show ID, like this, and
then, the name of the genre,

01:28:48.180 --> 01:28:49.340
which presumably is comedy.

01:28:49.340 --> 01:28:50.810
What values do I want to insert?

01:28:50.810 --> 01:28:52.795
Well, the show ID, I better grab that.

01:28:52.795 --> 01:28:54.170
Oh, I don't even know what it is.

01:28:54.170 --> 01:28:55.770
I'm going to have to
figure out what that is.

01:28:55.770 --> 01:28:57.660
So I could do this in a couple of ways.

01:28:57.660 --> 01:28:59.150
Let me do a one-time thing.

01:28:59.150 --> 01:29:02.924
Select star from shows,
where title equals,

01:29:02.924 --> 01:29:06.770
quote, unquote,
"Seinfeld" semicolon 159.

01:29:06.770 --> 01:29:11.780
So now I could do, insert
into genres a show ID

01:29:11.780 --> 01:29:20.510
and a genre name, the values 159, and,
quote, unquote, "comedy" semicolon,

01:29:20.510 --> 01:29:21.260
Enter.

01:29:21.260 --> 01:29:24.710
And now, if I scroll back in my history
and execute that really big query

01:29:24.710 --> 01:29:26.690
again, looking for
all distinct comedies,

01:29:26.690 --> 01:29:29.100
now Seinfeld has made the list.

01:29:29.100 --> 01:29:32.637
But I did this manually so I
didn't actually capitalize it.

01:29:32.637 --> 01:29:33.470
Let's clean that up.

01:29:33.470 --> 01:29:35.420
Let's do update.

01:29:35.420 --> 01:29:37.700
Let's do update my shows.

01:29:37.700 --> 01:29:43.440
Set title equals to Seinfeld semicolon.

01:29:43.440 --> 01:29:43.940
No?

01:29:43.940 --> 01:29:47.780
OK, thank you, where title equals,
quote, unquote, "Seinfeld."

01:29:47.780 --> 01:29:49.520
Let's not make that mistake again.

01:29:49.520 --> 01:29:50.330
Enter.

01:29:50.330 --> 01:29:53.270
And now, if I execute that really
big query, now Seinfeld is,

01:29:53.270 --> 01:29:56.480
indeed, considered a comedy.

01:29:56.480 --> 01:29:57.960
So where are we going with this?

01:29:57.960 --> 01:30:00.240
Well, thus far we've been doing
all of this pretty manually.

01:30:00.240 --> 01:30:02.780
And this is absolutely what an
analyst, a data scientist type person

01:30:02.780 --> 01:30:05.197
might do if just manipulating
a pretty large data set just

01:30:05.197 --> 01:30:08.010
to get at interesting answers
that might be across one,

01:30:08.010 --> 01:30:09.440
two, or even many more tables.

01:30:09.440 --> 01:30:12.440
Eventually, in a few weeks, we're
going to start to automate all of this

01:30:12.440 --> 01:30:16.610
by writing code in Python
that generates SQL to do this.

01:30:16.610 --> 01:30:19.280
If you go to most any website
on the internet today,

01:30:19.280 --> 01:30:23.420
and you, for instance, log in, odds are
you're typing a username and password,

01:30:23.420 --> 01:30:24.260
clicking Submit.

01:30:24.260 --> 01:30:25.430
What's then happening?

01:30:25.430 --> 01:30:28.010
Well, the website might not
be implemented in Python

01:30:28.010 --> 01:30:31.610
but it's probably implemented in some
language, Python, JavaScript, Java,

01:30:31.610 --> 01:30:33.110
Ruby, something else.

01:30:33.110 --> 01:30:38.330
And that language is probably using
something like a relational database

01:30:38.330 --> 01:30:41.930
to use SQL to get your
username, get your password,

01:30:41.930 --> 01:30:43.997
and compare the two against
what you've typed in.

01:30:43.997 --> 01:30:46.580
And actually, it's hopefully not
getting your actual password,

01:30:46.580 --> 01:30:48.163
but something called the hash thereof.

01:30:48.163 --> 01:30:50.360
But there's probably a
database involved doing that.

01:30:50.360 --> 01:30:53.600
When you buy something on
Amazon.com and you click Check Out,

01:30:53.600 --> 01:30:56.720
odds are there's some
code on Amazon's server

01:30:56.720 --> 01:30:59.870
that's looking at what it is
you added to your shopping cart,

01:30:59.870 --> 01:31:03.690
and then maybe using a for loop of some
sort, in Python or another language.

01:31:03.690 --> 01:31:08.180
It's doing a whole bunch of SQL
inserts to store in their database what

01:31:08.180 --> 01:31:09.290
it is you bought.

01:31:09.290 --> 01:31:12.260
There's other types of databases,
too, but SQL databases,

01:31:12.260 --> 01:31:14.647
or relational databases
are quite popular.

01:31:14.647 --> 01:31:16.730
So let's go ahead and write
one other program here

01:31:16.730 --> 01:31:21.320
in Python that now merges these
two languages together, whereby

01:31:21.320 --> 01:31:24.740
I'm going to use SQL
inside of a Python program

01:31:24.740 --> 01:31:28.430
so I can implement my logic
of my program in Python,

01:31:28.430 --> 01:31:30.050
step-by-step, line-by-line.

01:31:30.050 --> 01:31:34.260
But when I want to get at some data I
can actually talk to a SQL database.

01:31:34.260 --> 01:31:37.190
So let me go ahead
and open favorites.py.

01:31:39.740 --> 01:31:45.200
And let me go ahead and throw away
some of what we did earlier and really

01:31:45.200 --> 01:31:47.690
just now add a SQL to the mix.

01:31:47.690 --> 01:31:51.440
From the CS50 library, let's
import the SQL function.

01:31:51.440 --> 01:31:54.260
This will be useful to use
because most third-party libraries

01:31:54.260 --> 01:31:57.390
that deal with SQL and Python are
more complicated than they need to be.

01:31:57.390 --> 01:32:00.390
So I think you'll find
this library easier to use.

01:32:00.390 --> 01:32:01.780
Let's then do the following.

01:32:01.780 --> 01:32:03.900
Create a variable
called db for database.

01:32:03.900 --> 01:32:05.400
But I could call it anything I want.

01:32:05.400 --> 01:32:09.090
Let's use that you URI, which is
a fancy way of saying something

01:32:09.090 --> 01:32:17.790
that looks like a URL, but that actually
opens up a database on disk, that is,

01:32:17.790 --> 01:32:19.110
in the current folder.

01:32:19.110 --> 01:32:22.620
Let's now ask the user for a title by
prompting them for a, quote, unquote,

01:32:22.620 --> 01:32:23.820
"title" like this.

01:32:23.820 --> 01:32:27.960
And let's strip off any whitespace
just so that the data is not messy.

01:32:27.960 --> 01:32:30.730
And then, let's go ahead and do this.

01:32:30.730 --> 01:32:32.220
And this is the new logic.

01:32:32.220 --> 01:32:36.570
I'm going to go ahead now and write
a line of code that uses Python

01:32:36.570 --> 01:32:39.840
to talk to the original favorites.db.

01:32:39.840 --> 01:32:43.980
So again, I'm not using the two-table
database, which is in favorites8.db.

01:32:43.980 --> 01:32:47.320
I'm using the original that we
imported from your own data,

01:32:47.320 --> 01:32:49.090
and I'm going to do the following.

01:32:49.090 --> 01:32:54.150
I'm going to use db.execute to execute
a SQL command inside of Python.

01:32:54.150 --> 01:33:02.700
I'm going to select the count
of shows from the favorites

01:33:02.700 --> 01:33:09.960
table, where the title the user
typed in is like this question mark.

01:33:09.960 --> 01:33:11.910
And why I'm doing that is as follows.

01:33:11.910 --> 01:33:15.600
Just like in C, when we had
percent S, in SQL for now,

01:33:15.600 --> 01:33:17.490
the analogue is going
to be a question mark.

01:33:17.490 --> 01:33:18.990
So same idea, different syntax.

01:33:18.990 --> 01:33:21.180
Instead of percent S,
it's just a question mark.

01:33:21.180 --> 01:33:26.280
And using a comma outside of this
first string, using CS50's execute

01:33:26.280 --> 01:33:29.340
function I can pass in
a SQL string, a command,

01:33:29.340 --> 01:33:33.835
then any arguments I want to plug
into the question marks therein.

01:33:33.835 --> 01:33:36.210
So the goal at hand is to
actually write a program that's

01:33:36.210 --> 01:33:42.420
going to search favorites.csv, a.k.a.,
favorites.db for the total number

01:33:42.420 --> 01:33:45.300
of people that liked a particular show.

01:33:45.300 --> 01:33:49.050
So this is going to select the count
of people from the favorites table

01:33:49.050 --> 01:33:53.400
where the title they typed in is like
whatever the user has just now typed

01:33:53.400 --> 01:33:53.970
in.

01:33:53.970 --> 01:33:56.730
This db execute function returns a list.

01:33:56.730 --> 01:33:57.813
It returns a list of rows.

01:33:57.813 --> 01:34:00.188
And you would only know that
by my telling you or reading

01:34:00.188 --> 01:34:01.290
the documentation.

01:34:01.290 --> 01:34:04.140
And therefore, if I want to
get back to the total count,

01:34:04.140 --> 01:34:08.940
I'm going to go ahead and grab
the first row from those rows.

01:34:08.940 --> 01:34:11.220
Because it's only going
to give me back the count.

01:34:11.220 --> 01:34:16.440
And then I'm going to go ahead and
print out that row's first value.

01:34:16.440 --> 01:34:17.940
But it's going to be a little weird.

01:34:17.940 --> 01:34:21.420
Technically the column is going to be
called "count" star, quote, unquote,

01:34:21.420 --> 01:34:22.420
which is a little weird.

01:34:22.420 --> 01:34:24.180
Let me add one more feature to the mix.

01:34:24.180 --> 01:34:26.280
You can actually give
nicknames to columns

01:34:26.280 --> 01:34:30.090
that are coming back, especially if they
are the result of functions like this.

01:34:30.090 --> 01:34:34.170
I can just call that column
counter, in all lowercase.

01:34:34.170 --> 01:34:41.620
That means I can now say get back the
counter key inside of this dictionary.

01:34:41.620 --> 01:34:43.360
So just to recap, what have we done?

01:34:43.360 --> 01:34:46.010
We've imported the CS50
library SQL function.

01:34:46.010 --> 01:34:49.080
We've, with this line of
code, opened the favorites.db

01:34:49.080 --> 01:34:54.810
file that you and I created earlier
by importing your CSV into SQLite.

01:34:54.810 --> 01:34:58.140
I'm now just asking the user for
a title they want to search for.

01:34:58.140 --> 01:35:02.070
I'm now executing this SQL
query on that database,

01:35:02.070 --> 01:35:05.250
plugging in whatever the
human typed in as their title

01:35:05.250 --> 01:35:07.170
in order to get back a total count.

01:35:07.170 --> 01:35:10.950
And I'm giving the count a
nickname, an alias of counter,

01:35:10.950 --> 01:35:13.860
just so it's more self-explanatory.

01:35:13.860 --> 01:35:18.330
This function, db execute, no matter
what, always returns a list of rows,

01:35:18.330 --> 01:35:20.470
even if there's only
one row inside of it.

01:35:20.470 --> 01:35:23.310
So this line of code just gives
me the first and only row.

01:35:23.310 --> 01:35:27.960
And then, this goes inside of that row,
which it turns out is a dictionary,

01:35:27.960 --> 01:35:33.730
and gives me the key counter
and the value it corresponds to.

01:35:33.730 --> 01:35:35.340
So what, to be clear, is this doing?

01:35:35.340 --> 01:35:38.220
Let's go ahead and run this manually
in my Terminal window first.

01:35:38.220 --> 01:35:41.970
Let me run SQLite3 on favorites--

01:35:41.970 --> 01:35:43.380
Well, let's do this.

01:35:43.380 --> 01:35:47.410
On favorites.db, let me
import the data again.

01:35:47.410 --> 01:35:54.910
So mode csv.import in from
favorites.csv into a favorites table.

01:35:54.910 --> 01:35:57.330
So I've just recreated the
same data set that you all

01:35:57.330 --> 01:35:59.782
gave me earlier in favorites.db.

01:35:59.782 --> 01:36:02.490
If I were to do this manually,
let's search for The Office again.

01:36:02.490 --> 01:36:09.210
Select, count star from favorites,
where title like, and let's

01:36:09.210 --> 01:36:12.270
just manually type it
in for now, The Office.

01:36:12.270 --> 01:36:15.330
We'll search for the one
with the word The, semicolon.

01:36:15.330 --> 01:36:16.560
I get back 12.

01:36:16.560 --> 01:36:18.780
But technically, notice what I get back.

01:36:18.780 --> 01:36:25.080
I technically get back a miniature
table containing one column and one row.

01:36:25.080 --> 01:36:27.090
What if I want to rename that column?

01:36:27.090 --> 01:36:28.840
That's where the as keyword comes in.

01:36:28.840 --> 01:36:31.080
So select count star as counter.

01:36:31.080 --> 01:36:32.910
Notice what happens, Enter.

01:36:32.910 --> 01:36:35.883
I just get back-- same
simple table, but I've

01:36:35.883 --> 01:36:38.550
renamed the column to be counter
just because it's a little more

01:36:38.550 --> 01:36:40.410
self-explanatory as to what it is.

01:36:40.410 --> 01:36:43.200
So what am I doing
with this line of code?

01:36:43.200 --> 01:36:47.430
This line of code is returning to
me that miniature temporary table

01:36:47.430 --> 01:36:50.820
in the form of a list of dictionaries.

01:36:50.820 --> 01:36:55.620
The list contains one
row, as we'll see, and it

01:36:55.620 --> 01:37:00.800
contains one column, as we'll
see, the key for which is counter.

01:37:00.800 --> 01:37:02.540
So let's now run the code itself.

01:37:02.540 --> 01:37:07.280
I'm going to get out of SQLite3 and I'm
going to run Python of favorites.py.

01:37:07.280 --> 01:37:08.120
Enter.

01:37:08.120 --> 01:37:09.540
I'm being prompted for a title.

01:37:09.540 --> 01:37:13.790
I'm going to type in The Office and
cross my fingers, and there's that 12.

01:37:13.790 --> 01:37:14.810
Why is it 12?

01:37:14.810 --> 01:37:17.450
Well, there's a typo again
because I re-imported the CSV.

01:37:17.450 --> 01:37:21.270
I had deleted two of the Thes, so
we're back at the original data set.

01:37:21.270 --> 01:37:25.760
So there's 12 total that have,
quote, unquote, "The Office"

01:37:25.760 --> 01:37:27.780
in the title like that.

01:37:27.780 --> 01:37:28.860
So what have we done?

01:37:28.860 --> 01:37:31.820
We've combined some
Python with some SQL,

01:37:31.820 --> 01:37:34.790
but we've relegated all of the
complexity of searching for something,

01:37:34.790 --> 01:37:36.800
the selecting of something,
gotten rid of all

01:37:36.800 --> 01:37:39.560
of the with keyword, the
open keyword, the for loop,

01:37:39.560 --> 01:37:41.600
the reader the DictReader,
and all of that.

01:37:41.600 --> 01:37:46.460
And it's just one line of SQL now,
using the best of both worlds.

01:37:46.460 --> 01:37:51.720
All right, any questions on what we've
just done here or how any of this

01:37:51.720 --> 01:37:52.220
works?

01:37:54.870 --> 01:37:56.210
Any questions here?

01:37:56.210 --> 01:37:57.310
Yeah?

01:37:57.310 --> 01:38:01.070
AUDIENCE: [INAUDIBLE]

01:38:01.070 --> 01:38:04.413
DAVID J. MALAN: When does this
function return more than one row?

01:38:04.413 --> 01:38:05.330
Was that the question?

01:38:05.330 --> 01:38:05.760
AUDIENCE: Yeah.

01:38:05.760 --> 01:38:06.030
DAVID J. MALAN: Yeah.

01:38:06.030 --> 01:38:08.240
So let's do that by changing
the problem at hand.

01:38:08.240 --> 01:38:10.970
This program was designed just
to select the total count.

01:38:10.970 --> 01:38:16.070
Let's go ahead and
select, for instance, all

01:38:16.070 --> 01:38:20.705
of the ways you all typed in The Office
by selecting the title this time.

01:38:24.110 --> 01:38:28.470
If I do this in SQLite3, let
me go ahead and do this again

01:38:28.470 --> 01:38:29.970
after increasing my Terminal window.

01:38:29.970 --> 01:38:30.920
Let's do it manually.

01:38:30.920 --> 01:38:35.570
Select title from favorites,
where the title is like,

01:38:35.570 --> 01:38:38.835
quote, unquote, "The Office," semicolon.

01:38:38.835 --> 01:38:41.960
I get back all of these different rows,
and we didn't even notice this one.

01:38:41.960 --> 01:38:43.910
There's actually another
little typo in there

01:38:43.910 --> 01:38:47.630
with some capitalization of the
E, and the C, and the E. That

01:38:47.630 --> 01:38:50.840
would be an example of a query
that gives me back therefore

01:38:50.840 --> 01:38:52.080
for multiple rows.

01:38:52.080 --> 01:38:53.990
So let's now change my Python program.

01:38:53.990 --> 01:38:59.540
If I now, in my Python program, do
this, I get back a whole bunch of rows

01:38:59.540 --> 01:39:01.110
containing all of those titles.

01:39:01.110 --> 01:39:06.350
I can now do, for row in rows, I can
print out the current row's title,

01:39:06.350 --> 01:39:09.390
and now manipulate all
of those things together.

01:39:09.390 --> 01:39:10.760
Let me keep both on the screen.

01:39:10.760 --> 01:39:12.320
Let me run Python of favorites.py.

01:39:12.320 --> 01:39:16.320
And that for loop now should
iterate, what, 10 or more times,

01:39:16.320 --> 01:39:18.020
once for each of those titles.

01:39:18.020 --> 01:39:22.010
And indeed, if I type in
The Office again, Enter.

01:39:22.010 --> 01:39:24.890
Whoops.

01:39:24.890 --> 01:39:25.670
Row title.

01:39:25.670 --> 01:39:26.750
What did I do wrong?

01:39:26.750 --> 01:39:30.052
Oh, I should not be renaming
title to counter this time.

01:39:30.052 --> 01:39:31.760
So that's just a dumb
mistake on my part.

01:39:31.760 --> 01:39:33.410
Let me rerun it again.

01:39:33.410 --> 01:39:36.230
And now I should see after
typing in The Office,

01:39:36.230 --> 01:39:38.420
Enter, a whole bunch of The Offices.

01:39:38.420 --> 01:39:40.520
And because I'm using
like, even the missed

01:39:40.520 --> 01:39:43.550
capitalizations are coming through,
because like is case insensitive.

01:39:43.550 --> 01:39:45.890
It doesn't matter if it's
uppercase or lowercase.

01:39:45.890 --> 01:39:50.300
Whereas had I used the equal sign
I would get back only the same ones

01:39:50.300 --> 01:39:51.935
capitalized correctly.

01:39:51.935 --> 01:39:55.250
All right, any questions on this next?

01:39:55.250 --> 01:39:59.700
All right, so let's transition
to a larger, juicier data

01:39:59.700 --> 01:40:01.610
set, and consider some
of the issues that

01:40:01.610 --> 01:40:05.870
arise when actually now using SQL and
skating toward a world in which we're

01:40:05.870 --> 01:40:08.970
using SQL for mobile apps, web
apps, and generally speaking,

01:40:08.970 --> 01:40:10.800
very large data sets.

01:40:10.800 --> 01:40:14.090
So let's start with a larger
data set just like that.

01:40:14.090 --> 01:40:19.800
Give me just a moment to switch screens
over to what we have for you today,

01:40:19.800 --> 01:40:22.970
which is an actual relational
database that we've created out

01:40:22.970 --> 01:40:26.540
of a real-world data set from IMDb.

01:40:26.540 --> 01:40:29.210
So InternetMovieDatabase.com
is a website

01:40:29.210 --> 01:40:31.790
where you can search for TV
shows, and movies, and actors,

01:40:31.790 --> 01:40:34.880
and so forth, all using their
database behind the scenes.

01:40:34.880 --> 01:40:39.530
IMDb wonderfully makes their data
set available as not CSV files,

01:40:39.530 --> 01:40:42.960
but TSV files, tab-separated values.

01:40:42.960 --> 01:40:46.460
And so what we did is, before class
we downloaded those TSV files.

01:40:46.460 --> 01:40:50.300
We wrote a Python program
similar to my favorites8.py file

01:40:50.300 --> 01:40:53.750
earlier that read in
all of those TSV files,

01:40:53.750 --> 01:40:58.820
created some SQL tables
in an IMDb database

01:40:58.820 --> 01:41:03.270
for you in SQLite that has multiple
tables and multiple columns.

01:41:03.270 --> 01:41:07.190
So let's go and wrap our minds around
what's actually in this data set.

01:41:07.190 --> 01:41:10.940
Let me go back to VS Code
here, and in just a moment,

01:41:10.940 --> 01:41:15.260
I'm going to go ahead and copy the
file, which we've named shows.db.

01:41:15.260 --> 01:41:20.510
And I'm going to go ahead and increase
my Terminal and do SQLite3 of shows.db.

01:41:20.510 --> 01:41:23.570
Whenever playing around with a
SQLite database for the first time,

01:41:23.570 --> 01:41:26.480
typing .schema is perhaps a good
place to start to give you a sense

01:41:26.480 --> 01:41:27.410
of what's in there.

01:41:27.410 --> 01:41:29.120
And things just escalated quickly.

01:41:29.120 --> 01:41:31.640
There's a lot in this data
set, because, indeed, there's

01:41:31.640 --> 01:41:35.750
going to be tens of hundreds of
thousands of rows in this data set,

01:41:35.750 --> 01:41:39.230
and also problem set 7, where we'll
look at the movie side of things

01:41:39.230 --> 01:41:40.920
and not just the TV shows.

01:41:40.920 --> 01:41:43.940
So what is the schema that
we have created for you

01:41:43.940 --> 01:41:47.150
from IMDb's actual real-world data?

01:41:47.150 --> 01:41:48.950
One, there's a table called shows.

01:41:48.950 --> 01:41:51.950
And notice we've just added whitespace
by hitting Enter a bunch of times

01:41:51.950 --> 01:41:54.050
to make it a little more
stylistically readable.

01:41:54.050 --> 01:41:57.980
The shows table has an ID
column, a title column, a year,

01:41:57.980 --> 01:42:00.740
and the total number of
episodes for a given show.

01:42:00.740 --> 01:42:05.750
And the types of those columns are
integer, text, numeric, and integer.

01:42:05.750 --> 01:42:08.090
So it turns out there's
actually a few different data

01:42:08.090 --> 01:42:13.850
types that are worth being aware of when
it comes to creating tables themselves.

01:42:13.850 --> 01:42:18.170
In fact, in SQLite there's
five data types, and only five,

01:42:18.170 --> 01:42:21.650
fortunately, one of which is, indeed,
integer, negative or positive,

01:42:21.650 --> 01:42:25.010
numeric, which is kind of a
catchall for dates and times,

01:42:25.010 --> 01:42:27.320
things that are numeric
but are not just integers,

01:42:27.320 --> 01:42:29.510
and not just real numbers, for instance.

01:42:29.510 --> 01:42:33.020
Real number is what we've generally
thought of as float up until now.

01:42:33.020 --> 01:42:35.058
Text, of course, is
just text, but notice

01:42:35.058 --> 01:42:37.100
that you don't have to
worry about how big it is.

01:42:37.100 --> 01:42:39.110
Like in Python, it will size to fit.

01:42:39.110 --> 01:42:41.840
And then there's BLOB, which
is binary large object, which

01:42:41.840 --> 01:42:45.300
is for just raw 0s and 1s, like
for files or things like that.

01:42:45.300 --> 01:42:47.570
But we'll generally use
the other four of these.

01:42:47.570 --> 01:42:50.960
And so, indeed, when we
imported this data for you

01:42:50.960 --> 01:42:56.270
we decided that every show would be
given an ID, which is just an integer.

01:42:56.270 --> 01:42:59.460
Every show has, of course, a
title, which should not be null.

01:42:59.460 --> 01:43:01.320
Otherwise, why is it in the database?

01:43:01.320 --> 01:43:04.830
Every show has a year,
which is numeric according

01:43:04.830 --> 01:43:06.180
to that definition a moment ago.

01:43:06.180 --> 01:43:09.540
And the total number of episodes for
a show is going to be an integer.

01:43:09.540 --> 01:43:13.110
What now is with these primary keys
that we mentioned earlier, too?

01:43:13.110 --> 01:43:18.090
A primary key is the column that
uniquely identifies all of the data.

01:43:18.090 --> 01:43:21.120
In our case, with the
favorites, I automatically

01:43:21.120 --> 01:43:24.750
gave each of your submissions a unique
ID so that even if two or more of you

01:43:24.750 --> 01:43:27.360
typed in The Office,
your submission still

01:43:27.360 --> 01:43:32.310
had a unique identifier, a number
that allowed me to then correlate it

01:43:32.310 --> 01:43:36.330
with your genres, just
as we saw a moment ago.

01:43:36.330 --> 01:43:39.280
In this version of IMDb,
there's also genres.

01:43:39.280 --> 01:43:42.120
But they don't come from
us, they come from IMDb.com.

01:43:42.120 --> 01:43:46.320
And so a genre has a show ID, and
a genre just like our database.

01:43:46.320 --> 01:43:49.890
But these are real-world genres
with a bit more filtration.

01:43:49.890 --> 01:43:54.520
Notice, though, just like my
version, there's a foreign key.

01:43:54.520 --> 01:43:59.970
A foreign key is the appearance
of another table's primary key

01:43:59.970 --> 01:44:02.050
in its own table.

01:44:02.050 --> 01:44:05.100
So when you have a table
like genres, which is somehow

01:44:05.100 --> 01:44:10.920
cross referencing the original shows
table, if shows have a primary key

01:44:10.920 --> 01:44:15.270
called ID, and those same numbers
appear in the genres table

01:44:15.270 --> 01:44:20.250
under the column called show ID, by
definition, show ID is a foreign key.

01:44:20.250 --> 01:44:22.650
It's the same numbers but
it's foreign in the sense

01:44:22.650 --> 01:44:25.260
that the number is being
used in this table,

01:44:25.260 --> 01:44:29.130
even though it's officially defined
primarily in this other table.

01:44:29.130 --> 01:44:31.770
This is what we mean by
relational databases.

01:44:31.770 --> 01:44:37.170
You have multiple tables with some
column in common, numbers typically.

01:44:37.170 --> 01:44:40.860
And those numbers allow you to line
the two tables up in such a way

01:44:40.860 --> 01:44:44.040
that you can reconnect the
shows with their genres,

01:44:44.040 --> 01:44:46.800
just like we did with our
smaller data set a moment ago.

01:44:46.800 --> 01:44:49.050
This logic is extended further.

01:44:49.050 --> 01:44:53.070
Notice that the IMDb database we've
created for you has a stars table,

01:44:53.070 --> 01:44:56.730
like TV show stars, the actors therein.

01:44:56.730 --> 01:45:00.210
And that table, interestingly,
has no mention of people

01:45:00.210 --> 01:45:02.220
and no mention of shows, per se.

01:45:02.220 --> 01:45:05.730
It only has a column called
show ID, which is an integer,

01:45:05.730 --> 01:45:08.220
and a person ID, which is an integer.

01:45:08.220 --> 01:45:14.320
Meanwhile, if we scrolled
down to the bottom,

01:45:14.320 --> 01:45:17.490
you will see a table called people.

01:45:17.490 --> 01:45:23.010
And we have decided in IMDb's world
that every person in the TV show world

01:45:23.010 --> 01:45:27.510
will have a unique identifier that's
a number, a name that's text, a birth

01:45:27.510 --> 01:45:31.500
date, which is numeric, and
then, again, specifying that ID

01:45:31.500 --> 01:45:35.350
is going to be their primary key.

01:45:35.350 --> 01:45:36.940
So what's going on here?

01:45:36.940 --> 01:45:42.640
Well, it turns out that TV stars and
writers are both types of people.

01:45:42.640 --> 01:45:47.730
So using this relational database,
notice the road we're going down.

01:45:47.730 --> 01:45:49.770
We're factoring out commonalities.

01:45:49.770 --> 01:45:52.570
And if a person can be
different things in life,

01:45:52.570 --> 01:45:55.260
well, we're defining them
first and foremost as people.

01:45:55.260 --> 01:45:58.150
And then, notice these two
tables are almost the same.

01:45:58.150 --> 01:46:00.660
The stars table has a show
ID, which is a number,

01:46:00.660 --> 01:46:02.670
and a person ID, which
is a number, which

01:46:02.670 --> 01:46:10.710
allows us via this middleman table, if
you will, to link people with TV shows.

01:46:10.710 --> 01:46:16.080
Similarly, the writers table allows
us to connect shows with people, too,

01:46:16.080 --> 01:46:18.220
by just recording those numbers.

01:46:18.220 --> 01:46:20.980
So if we go into this data
set, let's do the following.

01:46:20.980 --> 01:46:24.340
Let's do select star
from people semicolon.

01:46:24.340 --> 01:46:27.030
So a huge amount of data is coming back.

01:46:27.030 --> 01:46:31.480
This is hundreds of thousands of rows
now based on the ID numbers alone.

01:46:31.480 --> 01:46:34.320
So this is real-world data
now flying across the screen.

01:46:34.320 --> 01:46:38.160
There's a lot of people in the TV show
business, not just actors and writers,

01:46:38.160 --> 01:46:39.780
but others as well.

01:46:39.780 --> 01:46:40.680
It's still going.

01:46:40.680 --> 01:46:41.977
There's a lot of data there.

01:46:41.977 --> 01:46:44.310
So my god, if you had to do
anything manual in this data

01:46:44.310 --> 01:46:46.660
set it's probably not going
to work out very well.

01:46:46.660 --> 01:46:49.590
And actually, we're up to, what,
a million people in this data

01:46:49.590 --> 01:46:51.990
set, plus, which would mean
this probably isn't even

01:46:51.990 --> 01:46:55.590
going to open very well in Excel, or
Google Spreadsheets, or Apple Numbers.

01:46:55.590 --> 01:46:57.700
SQL probably is the
better approach here.

01:46:57.700 --> 01:47:00.360
Let's search for someone
specific, like select star

01:47:00.360 --> 01:47:06.060
from people, where name equals
Steve Carell, for instance, sticking

01:47:06.060 --> 01:47:06.702
with comedies.

01:47:06.702 --> 01:47:08.160
All right, so there's Steve Carell.

01:47:08.160 --> 01:47:14.100
He is person number
136,797, born in 1962.

01:47:14.100 --> 01:47:16.540
And that's as much data as
we have on Steve Carell here.

01:47:16.540 --> 01:47:19.210
How do we figure out what
shows, for instance, he's in?

01:47:19.210 --> 01:47:23.500
Well, let's see, select
star from shows, semicolon.

01:47:23.500 --> 01:47:27.180
There's a crazy number of shows
out there in the IMDb database.

01:47:27.180 --> 01:47:30.150
And you can see it here again
flying across the screen.

01:47:30.150 --> 01:47:33.630
Feels like we're going to have to
employ some techniques in order

01:47:33.630 --> 01:47:37.060
to get at all of Steve Carell's shows.

01:47:37.060 --> 01:47:39.220
So how are we going to do that?

01:47:39.220 --> 01:47:41.730
Well, god, this is a lot of data here.

01:47:41.730 --> 01:47:45.120
And in fact, yeah, we
have, what, 15 million

01:47:45.120 --> 01:47:47.280
shows plus in this data set, too.

01:47:47.280 --> 01:47:50.340
So doing things efficiently is
now going to start to matter.

01:47:50.340 --> 01:47:51.790
So let's actually do this.

01:47:51.790 --> 01:47:53.460
Let me select a specific show.

01:47:53.460 --> 01:47:58.590
Select star from shows where title
equals, quote, unquote, "The Office."

01:47:58.590 --> 01:48:00.920
And there presumably shouldn't
be typos in this data

01:48:00.920 --> 01:48:03.470
because it comes from the
real website IMDb.com.

01:48:03.470 --> 01:48:05.000
Let's get back to show.

01:48:05.000 --> 01:48:08.210
Turns out there's been a lot of
The Offices out in the world.

01:48:08.210 --> 01:48:12.170
The one that started in 2005
is the one that we want,

01:48:12.170 --> 01:48:14.990
presumably the most
popular with 188 episodes.

01:48:14.990 --> 01:48:16.220
How can we get just that?

01:48:16.220 --> 01:48:21.650
Maybe we could do and year
equals, how about 2005?

01:48:21.650 --> 01:48:25.250
All right, so now we've got
back just the ID of The Office

01:48:25.250 --> 01:48:26.210
that we care about.

01:48:26.210 --> 01:48:27.230
And let's do this, too.

01:48:27.230 --> 01:48:30.110
Let me turn on a timer
within SQLite just

01:48:30.110 --> 01:48:31.790
to get a sense of running time now.

01:48:31.790 --> 01:48:33.110
Let me do that again.

01:48:33.110 --> 01:48:36.050
Select star from shows, where
title equals The Office,

01:48:36.050 --> 01:48:37.808
and year equals 2005.

01:48:37.808 --> 01:48:38.850
And let's keep it simple.

01:48:38.850 --> 01:48:40.370
Let's just do titles for now.

01:48:40.370 --> 01:48:41.450
Enter.

01:48:41.450 --> 01:48:43.460
All right, so not terribly long.

01:48:43.460 --> 01:48:46.990
It found it pretty fast, but it looks
like it took how much real time?

01:48:46.990 --> 01:48:50.010
0.02 seconds, not bad for just a title.

01:48:50.010 --> 01:48:53.210
But just to plant a seed, it
turns out that we can probably

01:48:53.210 --> 01:48:54.200
speed even this up.

01:48:54.200 --> 01:48:54.930
Let me do this.

01:48:54.930 --> 01:48:58.220
Let me create something called an
index, which is another use of the C

01:48:58.220 --> 01:49:00.170
in CRUD for creating something.

01:49:00.170 --> 01:49:02.810
And I'm going to call this title index.

01:49:02.810 --> 01:49:07.340
And I'm going to create
it on the shows table,

01:49:07.340 --> 01:49:09.273
specifically on the title column.

01:49:09.273 --> 01:49:11.690
And we'll see in a moment what
this is going to do for me.

01:49:11.690 --> 01:49:12.920
Enter.

01:49:12.920 --> 01:49:17.130
Took a moment, like 0.349 seconds,
to create something called an index.

01:49:17.130 --> 01:49:21.590
But now watch, if I select star from
shows searching for The Office again,

01:49:21.590 --> 01:49:24.230
previously it took me 0.021 seconds.

01:49:24.230 --> 01:49:27.680
Not bad, but now, wow.

01:49:27.680 --> 01:49:31.190
Literally no time at all, or so low
that it wasn't really measurable.

01:49:31.190 --> 01:49:33.470
And I'll do it again just
to get a sense of things.

01:49:33.470 --> 01:49:34.860
Still quite low.

01:49:34.860 --> 01:49:40.020
Now even though 0.021 seconds, not crazy
long, imagine now having a lot of data,

01:49:40.020 --> 01:49:42.560
a lot of users running a real
website or real mobile app.

01:49:42.560 --> 01:49:45.980
Every millisecond we can start to
shave off is going to be compelling.

01:49:45.980 --> 01:49:47.810
So what is it we just did?

01:49:47.810 --> 01:49:51.830
Well, we actually just created
something called an index.

01:49:51.830 --> 01:49:53.960
And this is a nice way
to tie in, now, some

01:49:53.960 --> 01:49:56.420
of our week 5 discussion
of data structures,

01:49:56.420 --> 01:49:58.250
and our week 3 discussion
of running times.

01:49:58.250 --> 01:50:01.280
An index in a database is
some kind of fancy data

01:50:01.280 --> 01:50:06.110
structure that allows the database
to do better than linear search.

01:50:06.110 --> 01:50:10.580
Literally, as you just saw, these
tables are crazy long or tall right now,

01:50:10.580 --> 01:50:12.150
very linear, that is.

01:50:12.150 --> 01:50:14.300
And so when I first
searched for The Office,

01:50:14.300 --> 01:50:17.780
it was literally doing linear search,
top to bottom, looking at as many as,

01:50:17.780 --> 01:50:19.400
what, a million plus rows.

01:50:19.400 --> 01:50:20.870
That's relatively slow.

01:50:20.870 --> 01:50:23.330
It's not that slow, 0.021 seconds.

01:50:23.330 --> 01:50:27.020
But that's relatively slow just
theoretically, algorithmically,

01:50:27.020 --> 01:50:28.580
doing anything linearly.

01:50:28.580 --> 01:50:32.090
But if you instead create
an index using syntax

01:50:32.090 --> 01:50:37.880
like this, which I just did, creating an
index on the title column of the show's

01:50:37.880 --> 01:50:41.220
table, that's like giving the
database a clue in advance saying,

01:50:41.220 --> 01:50:44.660
hey, I know I'm going to search on
this column in this table a lot.

01:50:44.660 --> 01:50:47.430
Do something with data
structures to speed things up.

01:50:47.430 --> 01:50:50.030
And so if you think back to our
discussion of data structures,

01:50:50.030 --> 01:50:51.720
maybe it's using a tree.

01:50:51.720 --> 01:50:56.060
Maybe it's using a trie or a hash
table, some fancier two-dimensional data

01:50:56.060 --> 01:50:59.930
structure is generally going to lift
the data up creating right maybe a tree

01:50:59.930 --> 01:51:00.810
structure.

01:51:00.810 --> 01:51:03.620
So it's just much faster
to find data, especially

01:51:03.620 --> 01:51:06.500
if it's sorting it now
based on title, and not

01:51:06.500 --> 01:51:08.030
just storing it in one long list.

01:51:08.030 --> 01:51:10.220
And in fact, in the world
of relational databases,

01:51:10.220 --> 01:51:12.560
the type of structure that's
often used in a database

01:51:12.560 --> 01:51:13.920
is something called a B-tree.

01:51:13.920 --> 01:51:15.170
It's not a binary tree.

01:51:15.170 --> 01:51:19.040
Different use of the letter B, but it
looks a little something like the trees

01:51:19.040 --> 01:51:19.820
we've seen.

01:51:19.820 --> 01:51:21.680
It's not binary because
some of the nodes

01:51:21.680 --> 01:51:24.890
might have more than
two children or fewer,

01:51:24.890 --> 01:51:28.190
but it's a very wide but
relatively shallow tree.

01:51:28.190 --> 01:51:29.990
It's not very tall.

01:51:29.990 --> 01:51:33.770
And the upside of that is that if
your data is stored in this tree,

01:51:33.770 --> 01:51:36.060
the database can find it more quickly.

01:51:36.060 --> 01:51:41.270
And the reason it took half a second,
a third of a second to build the index

01:51:41.270 --> 01:51:45.260
is because SQLite needed to take
some non-zero amount of time

01:51:45.260 --> 01:51:47.630
to just build up this tree in memory.

01:51:47.630 --> 01:51:51.900
And it has algorithms for doing so based
on alphabetization or other techniques.

01:51:51.900 --> 01:51:55.340
But you spend a bit of time
up front, a third of a second.

01:51:55.340 --> 01:51:57.470
And then thereafter, wow.

01:51:57.470 --> 01:52:00.470
Every subsequent query, if I
keep doing it again and again,

01:52:00.470 --> 01:52:04.040
is going to be crazy
low, 0.000, maybe 0.001.

01:52:04.040 --> 01:52:08.240
But an order of magnitude, a
factor of 10 or 100 faster than it

01:52:08.240 --> 01:52:10.790
previously was earlier.

01:52:10.790 --> 01:52:14.360
So we have these indexes which
allow us to get at data faster.

01:52:14.360 --> 01:52:17.330
But what if we want to
actually get data that's

01:52:17.330 --> 01:52:19.370
now across these multiple tables?

01:52:19.370 --> 01:52:20.240
How can we do that?

01:52:20.240 --> 01:52:23.060
And how might these indices
or indexes help further?

01:52:23.060 --> 01:52:26.900
Well, it turns out there is
a way that we've seen already

01:52:26.900 --> 01:52:29.510
indirectly to join two tables together.

01:52:29.510 --> 01:52:33.410
Previously, when I selected
the ID of The Office,

01:52:33.410 --> 01:52:37.740
and then I searched for it in the other
table using select in a nested query,

01:52:37.740 --> 01:52:40.410
I was joining two tables together.

01:52:40.410 --> 01:52:42.900
And it turns out there's a
couple of ways to do this.

01:52:42.900 --> 01:52:46.550
Let's go ahead now and, for instance,
find all of Steve Carell's TV shows.

01:52:46.550 --> 01:52:48.680
Not just The Office
but all of them, too.

01:52:48.680 --> 01:52:56.315
Unfortunately, if we look at our schema,
shows up here have no mention of TV--

01:52:56.315 --> 01:53:01.860
oh, shows over here has no
mention of the TV stars in them.

01:53:01.860 --> 01:53:05.130
And people have no mention of shows.

01:53:05.130 --> 01:53:09.360
We somehow need to use this
table here to connect the two.

01:53:09.360 --> 01:53:14.820
And this is called a join table, in the
sense that using two integer columns--

01:53:14.820 --> 01:53:17.790
it joins the two tables
together logically.

01:53:17.790 --> 01:53:21.750
And so if you're savvy enough with SQL,
you can do what I did with my hands

01:53:21.750 --> 01:53:26.010
earlier and like recombine
tables by using these common IDs,

01:53:26.010 --> 01:53:27.460
these integers together.

01:53:27.460 --> 01:53:28.330
So let me do this.

01:53:28.330 --> 01:53:32.730
Let me go ahead and figure out,
step-by-step, Steve Carell's shows.

01:53:32.730 --> 01:53:33.970
So how am I going to do this?

01:53:33.970 --> 01:53:39.120
Well, if I select star from people,
where name equals Steve Carell,

01:53:39.120 --> 01:53:40.840
fortunately, there's only one of them.

01:53:40.840 --> 01:53:46.800
So this gives me back his name,
his ID, and his birth year.

01:53:46.800 --> 01:53:48.960
But it's really only his
ID that I care about.

01:53:48.960 --> 01:53:49.680
Why?

01:53:49.680 --> 01:53:55.500
Because in order to get back his shows,
I need to link person ID with show ID.

01:53:55.500 --> 01:53:57.640
So I need to know his ID number.

01:53:57.640 --> 01:53:59.590
So what could I do with this?

01:53:59.590 --> 01:54:04.230
Well, remember the schema
and the stars table.

01:54:04.230 --> 01:54:07.830
I've just gotten, from the
people table, Steve Carell's ID.

01:54:07.830 --> 01:54:13.530
I bet by transitivity I could
now use his person ID, his ID,

01:54:13.530 --> 01:54:15.900
to get back all of his show IDs.

01:54:15.900 --> 01:54:19.170
And then once I've got all of his show
IDs, I can take it one step further

01:54:19.170 --> 01:54:21.330
and get back all of his shows' titles.

01:54:21.330 --> 01:54:25.440
So the answer is actually English
words and not just random, seemingly,

01:54:25.440 --> 01:54:26.290
integers.

01:54:26.290 --> 01:54:27.640
So let me go ahead and do this.

01:54:27.640 --> 01:54:31.740
Let me, again, get Steve
Carell's ID number, but not star.

01:54:31.740 --> 01:54:33.060
Star represents everything.

01:54:33.060 --> 01:54:35.610
It's a wildcard character in SQL.

01:54:35.610 --> 01:54:38.310
Let me just select the
ID of Steve Carell.

01:54:38.310 --> 01:54:41.640
And that gives me back 136,797.

01:54:41.640 --> 01:54:43.440
And it's only giving me back one value.

01:54:43.440 --> 01:54:46.710
The thing called ID is just
the column heading up above.

01:54:46.710 --> 01:54:51.360
Now, suppose I want to
select all of the show IDs

01:54:51.360 --> 01:54:53.460
that Steve Carell is affiliated with.

01:54:53.460 --> 01:55:00.570
Let me select Show ID from stars,
where the person ID in stars

01:55:00.570 --> 01:55:03.480
happens to equal Steve Carell's ID.

01:55:03.480 --> 01:55:07.320
So again, I'm building up my answer in
reverse and taking these baby steps.

01:55:07.320 --> 01:55:11.430
On the right, in parentheses,
I'm getting Steve Carell's ID.

01:55:11.430 --> 01:55:15.330
On the left, I am now
selecting all of the show IDs

01:55:15.330 --> 01:55:19.410
that have some connection with
that person ID in the stars table.

01:55:19.410 --> 01:55:21.850
This answer, too, is not
going to be that illuminating.

01:55:21.850 --> 01:55:25.420
It's just a whole bunch of integers
that have no meaning to me as a human.

01:55:25.420 --> 01:55:27.770
But let's take this one step further.

01:55:27.770 --> 01:55:29.520
And even though my
code is getting long, I

01:55:29.520 --> 01:55:31.920
could hit Enter and format
it nicely, especially

01:55:31.920 --> 01:55:33.660
if I were doing this in a code file.

01:55:33.660 --> 01:55:35.580
But I'm just doing it
interactively for now.

01:55:35.580 --> 01:55:39.420
Let's now select all of the
titles from the shows table,

01:55:39.420 --> 01:55:48.100
where the ID of the show is in
this following previous query.

01:55:48.100 --> 01:55:49.740
So again, the query is getting long.

01:55:49.740 --> 01:55:52.200
But notice, it's the
third and last step.

01:55:52.200 --> 01:55:55.950
Select title from the shows
table, where the ID of the show

01:55:55.950 --> 01:55:58.590
is in the list of all
of the show IDs that

01:55:58.590 --> 01:56:02.040
came back from the stars table
searching for Steve Carell's person ID.

01:56:02.040 --> 01:56:03.480
How did we get that person ID?

01:56:03.480 --> 01:56:04.680
Let me scroll to the end.

01:56:04.680 --> 01:56:11.160
Well, I selected, in my innermost
parentheses, Steve Carell's own ID.

01:56:11.160 --> 01:56:13.440
So now, when I hit Enter, voila.

01:56:13.440 --> 01:56:16.525
I get all of Steve Carell's
TV shows up until now.

01:56:16.525 --> 01:56:19.650
And if I want to tidy this up further,
I can use the same tricks as before.

01:56:19.650 --> 01:56:22.180
Order by title, semicolon.

01:56:22.180 --> 01:56:25.540
Now I've got it all
alphabetized as before.

01:56:25.540 --> 01:56:28.080
So again, with SQL comes
the ability to search--

01:56:28.080 --> 01:56:31.290
I mean, look how quickly
we do this, 0.094 seconds

01:56:31.290 --> 01:56:34.650
to search across three different
tables to get back this answer.

01:56:34.650 --> 01:56:38.820
But my data is now all neatly
designed in individual tables,

01:56:38.820 --> 01:56:42.000
which is going to be important
now that the data set is so large.

01:56:42.000 --> 01:56:44.340
But let me take this one step further.

01:56:44.340 --> 01:56:46.930
Let me go ahead and do this.

01:56:46.930 --> 01:56:51.580
Let me go ahead and point
out that with this query,

01:56:51.580 --> 01:56:54.870
notice that I'm searching on--

01:56:54.870 --> 01:56:58.710
let's say I'm searching
on a person ID here.

01:56:58.710 --> 01:57:02.410
And at the end here, I'm
searching on a name column here.

01:57:02.410 --> 01:57:05.230
So let me actually go ahead and do this.

01:57:05.230 --> 01:57:09.510
Let me go ahead and see
if we can't speed this up.

01:57:09.510 --> 01:57:13.090
This query at the moment
takes 0.092 seconds.

01:57:13.090 --> 01:57:15.930
Let's see if we can't speed this
up further by just quickly creating

01:57:15.930 --> 01:57:18.930
a few more of those B-trees
in the databases memory.

01:57:18.930 --> 01:57:24.240
Create an index called person index, and
I'm going to do this on the stars table

01:57:24.240 --> 01:57:25.950
on the person ID column.

01:57:25.950 --> 01:57:26.850
Enter.

01:57:26.850 --> 01:57:28.488
It's taking a moment, taking a moment.

01:57:28.488 --> 01:57:30.780
That's almost a full second
because that's a big table.

01:57:30.780 --> 01:57:35.050
Let's create another index called
show index on the stars table.

01:57:35.050 --> 01:57:35.550
Why?

01:57:35.550 --> 01:57:37.950
Because I want to search
by the show ID also.

01:57:37.950 --> 01:57:39.780
That was part of my big query.

01:57:39.780 --> 01:57:40.660
Takes a moment.

01:57:40.660 --> 01:57:43.810
OK, just more than
about 2/3 of a second.

01:57:43.810 --> 01:57:46.440
Now let's create one last one,
another index called name index,

01:57:46.440 --> 01:57:49.160
but I could call these things
anything I want, on the people table.

01:57:49.160 --> 01:57:49.560
Why?

01:57:49.560 --> 01:57:51.550
Because I'm also searching
on the name column.

01:57:51.550 --> 01:57:53.730
So in short, I'm
creating indexes on each

01:57:53.730 --> 01:57:57.450
of the columns that are somehow
involved in my search query,

01:57:57.450 --> 01:57:59.680
going from one table to the other.

01:57:59.680 --> 01:58:07.590
Now let's go back to the previous
query, which, recall, took--

01:58:07.590 --> 01:58:10.270
I think I erased it, 0.091.

01:58:10.270 --> 01:58:10.770
All right.

01:58:10.770 --> 01:58:12.645
Well, it was roughly
this order of magnitude.

01:58:12.645 --> 01:58:14.140
We're not seeing the data now.

01:58:14.140 --> 01:58:17.160
But let me go ahead and run
my original big query once.

01:58:17.160 --> 01:58:20.380
And boom, we're down to almost nothing.

01:58:20.380 --> 01:58:22.860
So again, creating
these indexes in memory

01:58:22.860 --> 01:58:27.640
has the effect of rapidly
speeding up our computation time.

01:58:27.640 --> 01:58:31.140
Now if you've ever used, for instance,
the my.harvard course shopping tool

01:58:31.140 --> 01:58:35.130
here on campus, or Yale's analogue, you
might wonder, why is the thing so slow?

01:58:35.130 --> 01:58:39.330
This could be one of the reasons why
large data sets with thousands of rows,

01:58:39.330 --> 01:58:41.670
thousands of courses
tend to be slow, if,

01:58:41.670 --> 01:58:44.880
and I'm only conjecturing, if the
database isn't properly indexed.

01:58:44.880 --> 01:58:46.890
If you're building your
own web application

01:58:46.890 --> 01:58:49.170
and you're finding that users
are waiting and waiting,

01:58:49.170 --> 01:58:52.290
and things are spinning and spinning,
what might be among the problems?

01:58:52.290 --> 01:58:55.770
Well, it could absolutely just be bad
algorithms and bad code that you wrote.

01:58:55.770 --> 01:58:58.270
Or it might be that you
haven't thought about, well,

01:58:58.270 --> 01:59:01.770
what column should be optimized
for searches and filtration

01:59:01.770 --> 01:59:05.705
like I've done here in order
to speed up subsequent queries?

01:59:05.705 --> 01:59:07.830
Again, from the outside
in, we can only conjecture.

01:59:07.830 --> 01:59:11.580
But ultimately, this is
just one of the things that

01:59:11.580 --> 01:59:14.308
explains performance problems as well.

01:59:14.308 --> 01:59:17.100
All right, let's point out just a
couple of final syntactic things,

01:59:17.100 --> 01:59:19.770
and then we'll consider,
bigger picture, some problems

01:59:19.770 --> 01:59:22.110
that might arise in this world.

01:59:22.110 --> 01:59:26.880
If these nested, nested queries
start to get a little much,

01:59:26.880 --> 01:59:29.160
there are other ways,
just so you've seen it,

01:59:29.160 --> 01:59:31.920
that you can execute
similar logic in SQL.

01:59:31.920 --> 01:59:34.410
For instance, if I
know in advance that I

01:59:34.410 --> 01:59:39.390
want to connect Steve Carell to
his show IDs and to their titles,

01:59:39.390 --> 01:59:41.190
we can do something more like this.

01:59:41.190 --> 01:59:52.050
Select title from the people table,
joined with the stars table on people

01:59:52.050 --> 01:59:54.900
ID equals stars.personID.

01:59:54.900 --> 01:59:55.860
So what am I doing?

01:59:55.860 --> 01:59:56.490
New syntax.

01:59:56.490 --> 01:59:59.782
And again, this is not something you'll
have to memorize or ingrain right away.

01:59:59.782 --> 02:00:04.080
But just so you've seen other
approaches, select title from people

02:00:04.080 --> 02:00:05.160
join stars.

02:00:05.160 --> 02:00:09.690
This is an explicit way to say, take
the people table in one hand, the stars

02:00:09.690 --> 02:00:12.030
table in the other hand,
and somehow join them

02:00:12.030 --> 02:00:13.990
as I keep doing with my fingertips here.

02:00:13.990 --> 02:00:15.610
How to join them?

02:00:15.610 --> 02:00:20.370
Join them so that the people, the ID
column in the people table lines up

02:00:20.370 --> 02:00:23.260
with the person ID in the stars table.

02:00:23.260 --> 02:00:25.260
But that's not quite everything.

02:00:25.260 --> 02:00:28.740
I could also say, join
further on the shows table,

02:00:28.740 --> 02:00:35.290
where the stars show ID
equals the shows ID column.

02:00:35.290 --> 02:00:36.540
So what am I doing here?

02:00:36.540 --> 02:00:45.990
That's saying, go further and join
the stars table with the show's table,

02:00:45.990 --> 02:00:48.990
joining the show ID
column with the ID column.

02:00:48.990 --> 02:00:51.750
Again, this starts to get a
little messy to think about.

02:00:51.750 --> 02:00:55.830
But now I can just say, where name
equals, quote, unquote, "Steve Carell."

02:00:55.830 --> 02:00:59.070
I can do in one query what previously
took me three nested queries

02:00:59.070 --> 02:01:00.600
and get back the same answers.

02:01:00.600 --> 02:01:05.380
And I can still add in my order
by title to get back the result.

02:01:05.380 --> 02:01:09.780
And if I do this a little more
neatly, let me type this out a little

02:01:09.780 --> 02:01:10.750
differently.

02:01:10.750 --> 02:01:16.620
Let me type this out by adding a
new line-- ah, I can't do that here.

02:01:16.620 --> 02:01:18.120
I'm going to leave it alone for now.

02:01:18.120 --> 02:01:20.700
We can type it on multiple
lines in other contexts.

02:01:20.700 --> 02:01:23.700
And let me do one last thing.

02:01:23.700 --> 02:01:25.012
Do I want to show that?

02:01:25.012 --> 02:01:26.970
I'm going to show it,
but this is not something

02:01:26.970 --> 02:01:28.440
you should ingrain just yet either.

02:01:28.440 --> 02:01:31.620
Select title from
people, stars, and shows.

02:01:31.620 --> 02:01:34.860
If you know in advance that you want
to do something with all three tables,

02:01:34.860 --> 02:01:38.130
you can just enumerate them,
one table name after the other.

02:01:38.130 --> 02:01:43.497
And then you can say where
people.ID equals stars.personID.

02:01:43.497 --> 02:01:45.330
And now I'm hitting
Enter so that it formats

02:01:45.330 --> 02:01:47.070
a little more readably on my screen.

02:01:47.070 --> 02:01:55.140
And stars.showID equals shows.ID,
and lastly, name equals Steve Carell.

02:01:55.140 --> 02:01:59.730
In short, you specify that you
want to select data from all three

02:01:59.730 --> 02:02:00.690
of these tables.

02:02:00.690 --> 02:02:06.450
And then you tell the database how to
combine foreign keys with primary keys,

02:02:06.450 --> 02:02:09.690
that is, the columns that
have those integers in common.

02:02:09.690 --> 02:02:12.720
If I hit Enter now, I get
the same exact results, ever

02:02:12.720 --> 02:02:16.110
more so if I also add
in an order by title.

02:02:16.110 --> 02:02:17.770
Oops.

02:02:17.770 --> 02:02:18.270
All right.

02:02:18.270 --> 02:02:20.190
That's why I didn't
want to do this earlier.

02:02:20.190 --> 02:02:23.190
I have to go back through my history
multiple times to actually get back

02:02:23.190 --> 02:02:24.640
the multi-line query this time.

02:02:24.640 --> 02:02:25.140
All right.

02:02:25.140 --> 02:02:27.280
That was a lot all at once.

02:02:27.280 --> 02:02:31.365
But this is only to say that, even
as we make the design of the data

02:02:31.365 --> 02:02:33.990
more sophisticated, and we put
some of it over here, some of it

02:02:33.990 --> 02:02:38.130
over here, some of it over here so as to
avoid duplication of data, weird hacks

02:02:38.130 --> 02:02:41.910
like putting commas in the data, we
can still get back all of the answers

02:02:41.910 --> 02:02:44.280
that we might want across
these several tables.

02:02:44.280 --> 02:02:48.580
And using indexes, we can
significantly speed up these processes

02:02:48.580 --> 02:02:52.140
so as to handle 10 times as
many, a 100 times as many users

02:02:52.140 --> 02:02:53.670
on the same actual database.

02:02:53.670 --> 02:02:55.020
There is going to be a downside.

02:02:55.020 --> 02:02:57.540
And thinking back to our
discussion of algorithms and data

02:02:57.540 --> 02:03:02.120
structures in past weeks, what might be
a downside of creating these indexes?

02:03:02.120 --> 02:03:06.110
Because as of now, I created four
separate indexes on the name column,

02:03:06.110 --> 02:03:09.560
the title column, and
some other columns, too.

02:03:09.560 --> 02:03:11.930
Why wouldn't I just go
ahead and index everything

02:03:11.930 --> 02:03:14.390
if it's clearly speeding things up?

02:03:14.390 --> 02:03:15.770
Memory, so space.

02:03:15.770 --> 02:03:18.890
Any time you're starting to benefit
time wise in computer science,

02:03:18.890 --> 02:03:21.950
odds are you're sacrificing
space, or vice versa.

02:03:21.950 --> 02:03:25.400
And probably indexing absolutely
everything is a little dumb

02:03:25.400 --> 02:03:29.430
because you're going to waste way more
space than you might actually need.

02:03:29.430 --> 02:03:31.610
So figuring out where the
right inflection point is

02:03:31.610 --> 02:03:36.410
is part of the process of designing and
just getting better at these things.

02:03:36.410 --> 02:03:40.910
Now unfortunately, a whole lot of
things can go wrong in this world,

02:03:40.910 --> 02:03:44.870
and they continue to in the real
world with people using SQL databases.

02:03:44.870 --> 02:03:46.760
And in fact, here on
out, if you're reading

02:03:46.760 --> 02:03:51.530
something technical about SQL databases,
and websites being hacked in some form,

02:03:51.530 --> 02:03:54.660
and passwords leaking out,
unfortunately, all too often

02:03:54.660 --> 02:03:57.505
it is because of what are
called SQL injection attacks.

02:03:57.505 --> 02:03:59.630
And just to give you a
sense now to counterbalance,

02:03:59.630 --> 02:04:01.588
maybe [INAUDIBLE] enthusiasm
for like, oh, that

02:04:01.588 --> 02:04:03.620
was neat how we can
do things so quickly.

02:04:03.620 --> 02:04:06.680
With great power comes
responsibility in this world, too.

02:04:06.680 --> 02:04:09.320
And so many people introduce
bugs into their code

02:04:09.320 --> 02:04:17.160
by not quite appreciating how it is the
data is getting into your application.

02:04:17.160 --> 02:04:18.390
So what do I mean by that?

02:04:18.390 --> 02:04:21.200
Here, for instance, is a
typical login screen for Yale.

02:04:21.200 --> 02:04:23.780
And here's the analogue for
Harvard where you're prompted,

02:04:23.780 --> 02:04:26.450
every day probably, for your
username and your password,

02:04:26.450 --> 02:04:28.460
your email address and
your password here.

02:04:28.460 --> 02:04:32.420
Suppose, though, that
behind this login page,

02:04:32.420 --> 02:04:35.030
whether Harvard's or Yale's,
there's some website.

02:04:35.030 --> 02:04:38.270
And that website is using
SQL underneath the hood

02:04:38.270 --> 02:04:40.700
to store all of the
Harvard or Yale people's

02:04:40.700 --> 02:04:43.940
usernames, passwords, ID
numbers, courses, transcripts,

02:04:43.940 --> 02:04:44.970
all of that stuff.

02:04:44.970 --> 02:04:47.480
So there's a SQL database
underneath the website.

02:04:47.480 --> 02:04:50.360
Well, what might go
wrong with this process?

02:04:50.360 --> 02:04:52.850
Unfortunately, there's
some special syntax in SQL

02:04:52.850 --> 02:04:54.530
just like there is in C and Python.

02:04:54.530 --> 02:04:56.960
For instance, there are
comments in SQL, too.

02:04:56.960 --> 02:05:00.680
If you do two hyphens, dash,
dash, that's a comment in SQL.

02:05:00.680 --> 02:05:06.170
And if you, the programmer, aren't
sufficiently distrustful of your users,

02:05:06.170 --> 02:05:09.560
such that you defend against
potentially adversarial attacks,

02:05:09.560 --> 02:05:11.160
you might do something like this.

02:05:11.160 --> 02:05:16.070
Suppose that I somewhat
maliciously or curiously log in

02:05:16.070 --> 02:05:19.130
by typing my username,
Malan@harvard.edu, and then maybe

02:05:19.130 --> 02:05:20.990
a single quote and a dash, dash.

02:05:20.990 --> 02:05:21.680
Why?

02:05:21.680 --> 02:05:24.860
Because I'm trying to suss out
if there is a vulnerability here

02:05:24.860 --> 02:05:26.420
to a SQL injection attack.

02:05:26.420 --> 02:05:27.910
Do not do this in general.

02:05:27.910 --> 02:05:31.160
But if I were the owner of the website
trying to see if I've made any mistake,

02:05:31.160 --> 02:05:35.300
I might try using potentially
dangerous characters in my input.

02:05:35.300 --> 02:05:36.290
Dangerous how?

02:05:36.290 --> 02:05:40.340
Because single quote is used for
quoting things in SQL, as we've seen--

02:05:40.340 --> 02:05:41.810
single quotes or double quotes.

02:05:41.810 --> 02:05:44.930
Dash, dash, I claim now,
is used for commenting.

02:05:44.930 --> 02:05:47.960
But let's now imagine what
the code underneath the hood

02:05:47.960 --> 02:05:52.160
might be for something like
Yale's login or Harvard's login.

02:05:52.160 --> 02:05:54.600
What if it's code that looks like this?

02:05:54.600 --> 02:05:56.540
So let me read it from left to right.

02:05:56.540 --> 02:06:00.710
Suppose that they are using something
like CS50's own execute function,

02:06:00.710 --> 02:06:03.230
and they've got some SQL
typed into the website that

02:06:03.230 --> 02:06:07.160
says select star from users,
where username equals this,

02:06:07.160 --> 02:06:09.050
and password equals that.

02:06:09.050 --> 02:06:12.510
And they're plugging in
username and password.

02:06:12.510 --> 02:06:13.607
So what am I doing here?

02:06:13.607 --> 02:06:16.190
Well, when the user types their
username password, hits Enter,

02:06:16.190 --> 02:06:18.920
I probably want to select
that user from my database

02:06:18.920 --> 02:06:21.020
to see if the username
and passwords match.

02:06:21.020 --> 02:06:23.720
So the underlying SQL
might be, select star

02:06:23.720 --> 02:06:25.790
from users, where username
equals question mark,

02:06:25.790 --> 02:06:27.207
and password equals question mark.

02:06:27.207 --> 02:06:28.100
Users is the table.

02:06:28.100 --> 02:06:29.510
One column is username.

02:06:29.510 --> 02:06:31.400
One column is password.

02:06:31.400 --> 02:06:32.000
All right.

02:06:32.000 --> 02:06:37.430
And if we get back one row,
presumably Malan@harvard.edu

02:06:37.430 --> 02:06:38.970
exists with that password.

02:06:38.970 --> 02:06:41.190
We should let him proceed
from there on out.

02:06:41.190 --> 02:06:45.140
So that's some pseudo code, if
you will, for this scenario.

02:06:45.140 --> 02:06:49.580
What if, though, this code is not
as well written as it currently

02:06:49.580 --> 02:06:51.500
is, and isn't using question marks?

02:06:51.500 --> 02:06:54.757
So the question mark syntax
is a fairly common SQL thing,

02:06:54.757 --> 02:06:56.840
where the question marks
are used as placeholders,

02:06:56.840 --> 02:06:59.390
just like in printf, percent S was.

02:06:59.390 --> 02:07:02.900
But this function, db.execute
from CS50's library

02:07:02.900 --> 02:07:05.420
and third-party libraries
as well, is also

02:07:05.420 --> 02:07:07.790
doing some good stuff
with these question marks,

02:07:07.790 --> 02:07:09.830
and defending against
the following attack.

02:07:09.830 --> 02:07:12.920
Suppose that you were not using
a third-party library like ours

02:07:12.920 --> 02:07:16.490
and you were just manually constructing
your SQL queries like this.

02:07:16.490 --> 02:07:19.940
You were to do something like this
instead using an f-string in Python.

02:07:19.940 --> 02:07:21.800
You're comfortable with
format strings now.

02:07:21.800 --> 02:07:24.883
You've gotten into the habit of using
curly braces and plugging in values.

02:07:24.883 --> 02:07:27.230
Suppose that you, the
aspiring programmer,

02:07:27.230 --> 02:07:29.660
is just using techniques
that you've been taught.

02:07:29.660 --> 02:07:32.660
So you have an f-string
with select star from users,

02:07:32.660 --> 02:07:36.470
where username equals, quote,
unquote, "username" in curly braces.

02:07:36.470 --> 02:07:41.270
And password equals, quote,
unquote, "password" in curly braces.

02:07:41.270 --> 02:07:44.270
As of what, two weeks
ago, this was perfectly

02:07:44.270 --> 02:07:49.460
legitimate technique in Python
to plug in values into a string.

02:07:49.460 --> 02:07:53.630
But notice if you are using
single quotes yourself

02:07:53.630 --> 02:07:58.750
and the user has typed in single
quotes to their input, what

02:07:58.750 --> 02:07:59.920
could go wrong here?

02:07:59.920 --> 02:08:04.510
Where are we going with this if you're
just blindly plugging user input

02:08:04.510 --> 02:08:08.350
into your own prepared string of text?

02:08:08.350 --> 02:08:09.230
Yeah?

02:08:09.230 --> 02:08:14.725
AUDIENCE: [INAUDIBLE]

02:08:14.725 --> 02:08:15.600
DAVID J. MALAN: Yeah.

02:08:15.600 --> 02:08:20.670
Worst case, they could insert what is
actually SQL code into your database

02:08:20.670 --> 02:08:21.810
as follows.

02:08:21.810 --> 02:08:25.140
Generally speaking, if you're using
special syntax like single quotes

02:08:25.140 --> 02:08:27.000
to surround the user's
input, you'd better

02:08:27.000 --> 02:08:29.250
hope that they don't have
an apostrophe in their name.

02:08:29.250 --> 02:08:31.875
Or you better hope that they
don't type a single quote as well.

02:08:31.875 --> 02:08:36.025
Because what if their single quote
finishes your single quote instead,

02:08:36.025 --> 02:08:37.900
and then the rest of
this is somehow ignored?

02:08:37.900 --> 02:08:39.330
Well, let's consider
how this might happen.

02:08:39.330 --> 02:08:40.510
Let me go ahead in here.

02:08:40.510 --> 02:08:42.840
This got a little
blurry here, but let me

02:08:42.840 --> 02:08:44.850
plug in here-- wow, that looks awful.

02:08:44.850 --> 02:08:47.940
Let me fix the red.

02:08:47.940 --> 02:08:50.430
Just change this to white
so it's more readable.

02:08:50.430 --> 02:08:56.730
What happens if the
user does this instead?

02:08:56.730 --> 02:08:59.310
They type in, like I
did into the screenshot,

02:08:59.310 --> 02:09:02.760
'Malan@harvard.edu,'
single quote, dash, dash.

02:09:02.760 --> 02:09:05.070
What has just happened
logically, even though we've

02:09:05.070 --> 02:09:06.900
only just begun with SQL today?

02:09:06.900 --> 02:09:12.150
Well, select star from users, where
username equals Malan@harvard.edu, end

02:09:12.150 --> 02:09:13.320
quote.

02:09:13.320 --> 02:09:17.340
What's bad about the rest of this?

02:09:17.340 --> 02:09:19.752
Dash, dash, I claim,
means a comment, which

02:09:19.752 --> 02:09:22.210
means my color coding is going
to be a little blurry again.

02:09:22.210 --> 02:09:25.140
But everything after the
dash, dash is just ignored.

02:09:25.140 --> 02:09:27.440
The logic, then, of
the SQL query, then, is

02:09:27.440 --> 02:09:30.760
to just say, select
Malan@harvard.edu from the database,

02:09:30.760 --> 02:09:33.490
not even checking the password anymore.

02:09:33.490 --> 02:09:36.180
Therefore, you will get
back at least one row.

02:09:36.180 --> 02:09:41.190
So length of rows will equal 1, and so
presumably the rest of the pseudo code

02:09:41.190 --> 02:09:44.190
logs the user in, gives them
access to my my.harvard account,

02:09:44.190 --> 02:09:45.150
or whatever it is.

02:09:45.150 --> 02:09:50.640
And they've pretended to be me simply
by using a single quote and a dash,

02:09:50.640 --> 02:09:52.445
dash in the username field.

02:09:52.445 --> 02:09:54.570
Again, please don't go
start doing this later today

02:09:54.570 --> 02:09:56.140
on Harvard, Yale, or other websites.

02:09:56.140 --> 02:09:57.670
But it could be as simple as that.

02:09:57.670 --> 02:09:58.170
Why?

02:09:58.170 --> 02:10:00.030
Because the programmer
practiced what they

02:10:00.030 --> 02:10:04.110
were taught, which was just to
use curly braces to plug in,

02:10:04.110 --> 02:10:05.560
in f-strings, values.

02:10:05.560 --> 02:10:08.590
But if you don't understand how the
user's input is going to be used,

02:10:08.590 --> 02:10:12.255
and if you don't distrust your users
fundamentally, for every good person

02:10:12.255 --> 02:10:14.130
out there there's going
to be, unfortunately,

02:10:14.130 --> 02:10:19.380
some adversary who just wants to try
to find fault in your code or hack

02:10:19.380 --> 02:10:20.490
into your data set.

02:10:20.490 --> 02:10:22.980
This is what's known as
a SQL injection attack,

02:10:22.980 --> 02:10:27.000
because the user can type something
that happens to be or look like SQL,

02:10:27.000 --> 02:10:31.110
and trick your database into doing
something it didn't intend to,

02:10:31.110 --> 02:10:34.830
like, for instance, logging the user in.

02:10:34.830 --> 02:10:36.880
Worst case, they could
even do something else.

02:10:36.880 --> 02:10:41.490
Maybe the user types a semicolon, then
the word drop, or the word update.

02:10:41.490 --> 02:10:45.180
You could imagine doing semicolon
update table grades, where

02:10:45.180 --> 02:10:49.140
name equals Malan, and set the
grade equal to A instead of B,

02:10:49.140 --> 02:10:50.670
or something like that.

02:10:50.670 --> 02:10:53.550
The ability to inject
SQL into the database

02:10:53.550 --> 02:10:56.820
means you can do anything you want with
the data set, either constructively,

02:10:56.820 --> 02:11:00.000
or worse, destructively.

02:11:00.000 --> 02:11:02.880
And now, just a quick, little
cartoon that should now make sense.

02:11:09.410 --> 02:11:13.020
OK, to, like, one of us, two of us.

02:11:13.020 --> 02:11:14.400
Awkwardly somewhat funny.

02:11:14.400 --> 02:11:16.560
All right, so let's move
on to one last condition.

02:11:16.560 --> 02:11:19.130
There's one other problem
that can go awry here.

02:11:19.130 --> 02:11:20.380
Oh, and I should explain this.

02:11:20.380 --> 02:11:25.500
So this is an allusion to the son,
Robert, having typed in semicolon.

02:11:25.500 --> 02:11:28.810
The word drop, table, students, and
doing some of the same technique.

02:11:28.810 --> 02:11:31.440
This is humor that only
CS people would understand

02:11:31.440 --> 02:11:35.040
because it's the mom realizing,
oh, her son's doing a SQL injection

02:11:35.040 --> 02:11:36.308
attack onto the database.

02:11:36.308 --> 02:11:39.600
Less funny when you explain it, but once
you notice the syntax, that's all this

02:11:39.600 --> 02:11:40.960
is an allusion to.

02:11:40.960 --> 02:11:41.460
All right.

02:11:41.460 --> 02:11:44.850
So one final threat, now
that you are graduating

02:11:44.850 --> 02:11:49.320
to the world of proper databases
and away from CSV files alone.

02:11:49.320 --> 02:11:52.170
Things can go wrong
when using databases,

02:11:52.170 --> 02:11:55.838
and honestly, even using CSV
files if you have multiple users.

02:11:55.838 --> 02:11:57.630
And thus far, you and
I have had the luxury

02:11:57.630 --> 02:12:00.547
in almost every program we've written
that it's just me using my code.

02:12:00.547 --> 02:12:01.830
It's just you using your code.

02:12:01.830 --> 02:12:04.770
And even if your teaching fellow
or TA is using it, probably

02:12:04.770 --> 02:12:06.060
not at the same time.

02:12:06.060 --> 02:12:10.770
But the world gets interesting if you
start putting your code on phones,

02:12:10.770 --> 02:12:14.730
on websites, such that now you might
have two users literally trying

02:12:14.730 --> 02:12:17.430
to log in at the same time,
literally clicking a button

02:12:17.430 --> 02:12:19.410
at the same, or nearly the same time.

02:12:19.410 --> 02:12:22.380
What happens, then, if
a computer is trying

02:12:22.380 --> 02:12:25.290
to handle requests from two
different people at once,

02:12:25.290 --> 02:12:27.480
as might happen all
the time on a website?

02:12:27.480 --> 02:12:29.610
You might get what are
called race conditions.

02:12:29.610 --> 02:12:33.060
And this is a problem in computing in
general, not just with SQL, not just

02:12:33.060 --> 02:12:36.960
with Python, really just any
time you have shared data,

02:12:36.960 --> 02:12:39.150
like a database, as follows.

02:12:39.150 --> 02:12:43.620
This apparently is one of the
most liked Instagram posts ever.

02:12:43.620 --> 02:12:46.110
It is literally just
a picture of an egg.

02:12:46.110 --> 02:12:47.645
Has anyone clicked on this egg?

02:12:47.645 --> 02:12:48.270
Like, a couple?

02:12:48.270 --> 02:12:48.780
Oh, OK.

02:12:48.780 --> 02:12:49.280
Wow.

02:12:49.280 --> 02:12:50.220
All right, so yes.

02:12:50.220 --> 02:12:53.880
So go search for this photo if you'd
like to add to the likes on Instagram.

02:12:53.880 --> 02:12:56.110
The account is world_record_egg.

02:12:56.110 --> 02:12:59.040
This is just a screenshot of
Instagram of that picture of an egg.

02:12:59.040 --> 02:13:00.540
If you're in the habit
of using Instagram,

02:13:00.540 --> 02:13:03.498
or like any social media site, there's
some equivalent of a like button

02:13:03.498 --> 02:13:04.920
or a heart button these days.

02:13:04.920 --> 02:13:06.900
And that's actually a
really hard problem.

02:13:06.900 --> 02:13:10.050
Such a simple idea to count
the number of likes something

02:13:10.050 --> 02:13:12.990
has, but that means
someone has to click on it.

02:13:12.990 --> 02:13:14.910
Your code has to detect the click.

02:13:14.910 --> 02:13:17.860
Your code has to update the database,
and then do it again and again,

02:13:17.860 --> 02:13:22.740
even if multiple people are perhaps
right now clicking on that same egg.

02:13:22.740 --> 02:13:28.350
And unfortunately, bad things can
happen if two people try to do something

02:13:28.350 --> 02:13:30.540
at the same time on a computer.

02:13:30.540 --> 02:13:31.690
How might this happen?

02:13:31.690 --> 02:13:35.670
So here's some more code, half
pseudocode, half Python code here,

02:13:35.670 --> 02:13:36.580
as follows.

02:13:36.580 --> 02:13:40.230
Suppose that what happens when you,
literally, right now, maybe click

02:13:40.230 --> 02:13:43.470
on the like button on
the Instagram post.

02:13:43.470 --> 02:13:47.760
Suppose that code, like the following,
is executed on Facebook servers.

02:13:47.760 --> 02:13:54.190
db.execute of select likes from
posts where ID equals question mark.

02:13:54.190 --> 02:13:54.690
All right.

02:13:54.690 --> 02:13:57.090
So what am I assuming here?

02:13:57.090 --> 02:13:59.280
I'm assuming that that
photograph has a unique ID.

02:13:59.280 --> 02:14:02.670
It's some big integer, whatever
it was, randomly assigned.

02:14:02.670 --> 02:14:05.130
I'm assuming that when
you click on the heart

02:14:05.130 --> 02:14:08.160
the unique ID is somehow
sent to Instagram servers

02:14:08.160 --> 02:14:10.740
so that their code can call it ID.

02:14:10.740 --> 02:14:13.830
And I'm assuming that Instagram
is using its SQL database

02:14:13.830 --> 02:14:17.790
and selecting, from a posts
table, the current number of likes

02:14:17.790 --> 02:14:21.160
of that egg for that given ID number.

02:14:21.160 --> 02:14:21.660
Why?

02:14:21.660 --> 02:14:24.952
Because I need to know how many likes it
already has if I want to add one to it

02:14:24.952 --> 02:14:26.190
and then update the database.

02:14:26.190 --> 02:14:29.710
I need to select the data, then
I need to update the data here.

02:14:29.710 --> 02:14:30.210
All right.

02:14:30.210 --> 02:14:33.780
So in some Python code here,
let's store, in a variable called

02:14:33.780 --> 02:14:37.950
likes, whatever comes back in the
first row from the likes column.

02:14:37.950 --> 02:14:40.660
Again, this is new syntax
specific to our library,

02:14:40.660 --> 02:14:43.830
but a common way of getting back
first row and the column called

02:14:43.830 --> 02:14:44.800
likes therein.

02:14:44.800 --> 02:14:46.920
So at this point in the
story, likes is storing

02:14:46.920 --> 02:14:49.462
the total number of likes, in
the millions or whatever it is,

02:14:49.462 --> 02:14:50.610
of that particular egg.

02:14:50.610 --> 02:14:51.900
Then I do this.

02:14:51.900 --> 02:14:56.400
Execute update posts,
set the number of likes

02:14:56.400 --> 02:15:00.607
equal to this value, where the
ID of the post equals this value.

02:15:00.607 --> 02:15:02.190
What do I want to update the likes to?

02:15:02.190 --> 02:15:06.540
Whatever likes currently is plus
1, and then plugging in the ID.

02:15:06.540 --> 02:15:08.340
So a simple idea, right?

02:15:08.340 --> 02:15:11.730
I'm checking the value of
the likes, and maybe it's 10.

02:15:11.730 --> 02:15:15.030
I'm changing 10 to 11 and
then updating the table.

02:15:15.030 --> 02:15:18.390
But a problem can arise
if two people have

02:15:18.390 --> 02:15:23.010
clicked on that egg at roughly the
same time, or literally, the same time.

02:15:23.010 --> 02:15:24.340
Why is that?

02:15:24.340 --> 02:15:26.850
Well, in the world of
databases and servers,

02:15:26.850 --> 02:15:31.120
and the Instagrams of the world have
thousands of physical servers nowadays.

02:15:31.120 --> 02:15:35.430
So they can support millions,
billions even, of users nowadays.

02:15:35.430 --> 02:15:36.910
What can go wrong?

02:15:36.910 --> 02:15:40.530
Well, typically code like this
is not what we'll call atomic.

02:15:40.530 --> 02:15:44.430
To be atomic means that it all
executes together or not at all.

02:15:44.430 --> 02:15:49.500
Rather, code typically is executed,
as you might imagine, line by line.

02:15:49.500 --> 02:15:53.110
And if your code is running on a server
that multiple people have access to,

02:15:53.110 --> 02:15:55.530
which is absolutely the case
for an app like Instagram,

02:15:55.530 --> 02:15:58.560
if you and I click on the
heart at roughly the same time,

02:15:58.560 --> 02:16:02.160
for efficiency, the computer,
the server, owned by Instagram,

02:16:02.160 --> 02:16:04.500
might execute this line of code for me.

02:16:04.500 --> 02:16:06.537
Then it might execute
this line of code for you.

02:16:06.537 --> 02:16:09.120
Then this line of code for me,
then this line of code for you,

02:16:09.120 --> 02:16:11.703
then this line of code for me,
then this line of code for you.

02:16:11.703 --> 02:16:16.550
That is to say, our queries might
get intermingled chronologically.

02:16:16.550 --> 02:16:19.370
Because it'd be a little obnoxious
if, when you're using Instagram,

02:16:19.370 --> 02:16:22.040
I'm blocked out while you're
interacting with the site.

02:16:22.040 --> 02:16:24.778
It'd be a lot nicer for efficiency
and fairness if somehow they

02:16:24.778 --> 02:16:27.320
do a little bit of work for me,
a little bit of work for you,

02:16:27.320 --> 02:16:30.540
and back and forth, and back and
forth, equitably on the server.

02:16:30.540 --> 02:16:33.320
So that's what typically happens
by default. These lines of code

02:16:33.320 --> 02:16:35.420
get executed independently.

02:16:35.420 --> 02:16:39.770
And they can happen in alternating
order with other users.

02:16:39.770 --> 02:16:41.809
You can get them combined like this.

02:16:41.809 --> 02:16:45.840
Same order top to bottom, but other
things might happen in between.

02:16:45.840 --> 02:16:50.000
So suppose that the number of
likes at the very beginning was 10.

02:16:50.000 --> 02:16:54.080
And suppose that Carter and I both click
on that egg at roughly the same time.

02:16:54.080 --> 02:16:56.420
And suppose this line of
code gets executed for me,

02:16:56.420 --> 02:16:59.870
and that gives me a value
in likes, ultimately, of 10.

02:16:59.870 --> 02:17:03.290
Suppose, then, that the computer takes
a break from dealing with my request,

02:17:03.290 --> 02:17:05.690
does the same code for
Carter, and gets back

02:17:05.690 --> 02:17:07.910
what value for the
current number of likes?

02:17:07.910 --> 02:17:09.170
Also 10 for Carter.

02:17:09.170 --> 02:17:11.209
Because mine has not been recorded yet.

02:17:11.209 --> 02:17:13.910
At this point in the story,
somewhere in the computer's memory

02:17:13.910 --> 02:17:16.219
there's a likes variable
for me, storing 10.

02:17:16.219 --> 02:17:19.400
There's a likes variable
storing 10 for Carter.

02:17:19.400 --> 02:17:21.290
Then this line of code executes for me.

02:17:21.290 --> 02:17:25.219
It updates the database to be likes
plus 1, which stores 11 in the database.

02:17:25.219 --> 02:17:30.440
Then Carter's code is executed,
updating the same row in the database

02:17:30.440 --> 02:17:34.160
to 11, unfortunately.

02:17:34.160 --> 02:17:38.209
Because his value of likes happened
to be the same value of mine.

02:17:38.209 --> 02:17:41.540
And so the metaphor here, that if we
had a refrigerator on stage we would

02:17:41.540 --> 02:17:44.790
actually act out, is something that was
taught to me years ago in an operating

02:17:44.790 --> 02:17:50.299
systems class, whereby the most similar
analogue in the real world would be

02:17:50.299 --> 02:17:52.370
if you've got a mini
fridge in your dorm room.

02:17:52.370 --> 02:17:58.440
And one of you and your roommates comes
home, opens the fridge, and realizes,

02:17:58.440 --> 02:18:01.160
oh, we're out of milk, was
how the story went in my day.

02:18:01.160 --> 02:18:05.184
So you close the refrigerator, and
you walk across the street, go to CVS,

02:18:05.184 --> 02:18:06.559
and get in line to buy some milk.

02:18:06.559 --> 02:18:08.309
Meanwhile, your roommate comes home.

02:18:08.309 --> 02:18:12.924
They, too, inspect the state of your
refrigerator, a.k.a., a variable,

02:18:12.924 --> 02:18:15.049
open the door, and realizes,
oh, we're out of milk.

02:18:15.049 --> 02:18:15.980
I'll go get more milk.

02:18:15.980 --> 02:18:18.020
Close the fridge, go
across the street, and head

02:18:18.020 --> 02:18:20.059
to maybe a different store,
or the line is long enough

02:18:20.059 --> 02:18:21.851
that you don't see each
other at the store.

02:18:21.851 --> 02:18:25.700
So long story short, you both eventually
get home, open the door, and damn it,

02:18:25.700 --> 02:18:27.559
now there's milk from
your other roommate

02:18:27.559 --> 02:18:30.680
there because you both
made a decision on this

02:18:30.680 --> 02:18:35.809
based on the state of a variable
that you independently examined.

02:18:35.809 --> 02:18:37.700
And you didn't somehow communicate.

02:18:37.700 --> 02:18:40.610
Now in the real world, this
is absolutely solvable.

02:18:40.610 --> 02:18:44.120
How would you fix this or avoid
this problem in the real world?

02:18:44.120 --> 02:18:45.770
Literally, own roommate, own fridge.

02:18:45.770 --> 02:18:47.750
AUDIENCE: Text your
roommate [INAUDIBLE]..

02:18:47.750 --> 02:18:48.750
DAVID J. MALAN: Perfect.

02:18:48.750 --> 02:18:50.482
Let them know, so somehow communicate.

02:18:50.482 --> 02:18:52.940
And in fact, the terminology
here would be multiple threads

02:18:52.940 --> 02:18:55.459
can somehow intercommunicate
by having shared state,

02:18:55.459 --> 02:18:57.230
like the iMessage thread on your phone.

02:18:57.230 --> 02:18:58.190
You could leave a note.

02:18:58.190 --> 02:19:01.490
You could, more dramatically,
lock the refrigerator somehow,

02:19:01.490 --> 02:19:05.570
thereby making the milk
purchasing process atomic.

02:19:05.570 --> 02:19:08.389
The fundamental problem is
that for efficiency, again,

02:19:08.389 --> 02:19:11.690
computers tend to
intermingle logic that needs

02:19:11.690 --> 02:19:15.980
to happen when it's happening across
multiple users just for fairness' sake,

02:19:15.980 --> 02:19:17.180
for scheduling sake.

02:19:17.180 --> 02:19:19.820
You need to make sure that all
three of these lines of code

02:19:19.820 --> 02:19:23.269
execute for me, and then
for Carter, and then for you

02:19:23.269 --> 02:19:25.709
if you want to ensure that
this count is correct.

02:19:25.709 --> 02:19:29.219
And for years, when social media
was first getting off the ground,

02:19:29.219 --> 02:19:31.100
this was a super hard problem.

02:19:31.100 --> 02:19:33.920
Twitter used to go down all
of the time, and tweets,

02:19:33.920 --> 02:19:36.260
and retweets were a thing
that were similarly happening

02:19:36.260 --> 02:19:37.490
with a very high frequency.

02:19:37.490 --> 02:19:39.020
These are hard problems to solve.

02:19:39.020 --> 02:19:40.690
And thankfully, there are solutions.

02:19:40.690 --> 02:19:43.440
And we won't get into the weeds
of how you might use these things,

02:19:43.440 --> 02:19:46.100
but know that there are
solutions in the form of things

02:19:46.100 --> 02:19:49.610
called locks, which I use that
word deliberately with the fridge.

02:19:49.610 --> 02:19:53.570
Software locks can allow you to
protect a variable so no one else can

02:19:53.570 --> 02:19:55.550
look at it until you're done with it.

02:19:55.550 --> 02:19:57.770
There are things called
transactions, which

02:19:57.770 --> 02:20:01.100
allow you to do the equivalent of
sending a message to, or really locking

02:20:01.100 --> 02:20:04.130
out your roommate from accessing
that same variable, too,

02:20:04.130 --> 02:20:06.920
but for slightly less amount of time.

02:20:06.920 --> 02:20:08.880
There are solutions to these problems.

02:20:08.880 --> 02:20:12.650
So for instance, in Python,
the same code now in green

02:20:12.650 --> 02:20:14.360
might look a little something like this.

02:20:14.360 --> 02:20:17.150
When you know that something
has to happen all at once,

02:20:17.150 --> 02:20:21.020
altogether, you first begin a
transaction, and you do your thing,

02:20:21.020 --> 02:20:23.707
and then you commit the
transaction at the very end.

02:20:23.707 --> 02:20:25.790
Here, too, though, there's
going to be a downside.

02:20:25.790 --> 02:20:29.902
Typically, the more you use
transactions in this way,

02:20:29.902 --> 02:20:31.610
potentially the higher
the probability is

02:20:31.610 --> 02:20:35.190
that you're going to box someone out or
make Carter's request a little slower.

02:20:35.190 --> 02:20:35.690
Why?

02:20:35.690 --> 02:20:37.482
Because we can't interact
at the same time.

02:20:37.482 --> 02:20:39.920
Or you might make his request
fail if he tries to update

02:20:39.920 --> 02:20:41.870
something that's already been updated.

02:20:41.870 --> 02:20:44.720
So you generally want to
have as few lines of code

02:20:44.720 --> 02:20:47.840
together in between these transactions
so that you get in and you get out.

02:20:47.840 --> 02:20:50.780
And you go to CVS and you get
back really fast so as to not

02:20:50.780 --> 02:20:52.550
cause these kind of performance things.

02:20:52.550 --> 02:20:55.070
So things indeed
escalated quickly today.

02:20:55.070 --> 02:20:58.272
The original goal was just to solve
problems using a different language

02:20:58.272 --> 02:20:59.480
more effectively than Python.

02:20:59.480 --> 02:21:01.772
But as soon as you have these
more powerful techniques,

02:21:01.772 --> 02:21:03.525
a whole new set of problems arises.

02:21:03.525 --> 02:21:05.150
Takes practice to get comfortable with.

02:21:05.150 --> 02:21:09.170
But ultimately, this is all leading
us toward the introduction next week

02:21:09.170 --> 02:21:12.260
of web programming with HTML,
CSS, and some JavaScript.

02:21:12.260 --> 02:21:15.090
The week after, bringing Python
and SQL back into the mix.

02:21:15.090 --> 02:21:16.940
So that by term's end,
we've really now used

02:21:16.940 --> 02:21:19.760
all of these different languages
for what they're best at.

02:21:19.760 --> 02:21:22.843
And over the next few weeks, the goal
is to make sure you're understanding

02:21:22.843 --> 02:21:25.932
and comfortable with what each of
these things is good and bad for.

02:21:25.932 --> 02:21:27.140
Let's go ahead and wrap here.

02:21:27.140 --> 02:21:28.473
I'll stick around for questions.

02:21:28.473 --> 02:21:30.310
We'll see you next time.

02:21:30.310 --> 02:22:04.344
[MUSIC PLAYING]