WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000

00:00:00.000 --> 00:00:02.994
[MUSIC PLAYING]

00:01:22.792 --> 00:01:23.750
DAVID MALAN: All right.

00:01:23.750 --> 00:01:28.230
So this is CS50, and this day
we continue our look at Python,

00:01:28.230 --> 00:01:30.330
but also introduce another language.

00:01:30.330 --> 00:01:32.700
This one is called SQL, or
Structured Query Language,

00:01:32.700 --> 00:01:35.492
and it's going to allow us to solve
some different problems as well

00:01:35.492 --> 00:01:37.260
as some old problems better.

00:01:37.260 --> 00:01:40.330
But to do this, let's try to motivate
some of the problems initially.

00:01:40.330 --> 00:01:42.390
So this, of course, is just
an example of a spreadsheet.

00:01:42.390 --> 00:01:43.920
Odds are you use these
pretty frequently,

00:01:43.920 --> 00:01:46.253
whether it's in the form of
Google Spreadsheets or Excel

00:01:46.253 --> 00:01:47.700
or Apple Numbers and the like.

00:01:47.700 --> 00:01:50.700
And this is actually the simplest
form of what we might call a database,

00:01:50.700 --> 00:01:54.000
just some kind of application
that stores lots of data.

00:01:54.000 --> 00:01:56.910
And odds are if you've ever taken
registrations for a student group

00:01:56.910 --> 00:01:59.580
or had people sign up or
fill out a form or a survey,

00:01:59.580 --> 00:02:03.270
odds are you've used Google Forms to
get your data into a Google Spreadsheet.

00:02:03.270 --> 00:02:05.370
And so we thought we
would begin there today,

00:02:05.370 --> 00:02:08.370
a representative problem whereby you
might want to collect a lot of data

00:02:08.370 --> 00:02:10.660
from users and then
do something with it,

00:02:10.660 --> 00:02:14.670
but ideally more powerfully than
you could do as a human on your own.

00:02:14.670 --> 00:02:18.600
So we've put into place a
Google Form that quite simply

00:02:18.600 --> 00:02:20.040
asks a couple of questions today.

00:02:20.040 --> 00:02:21.423
What are your favorite TV shows?

00:02:21.423 --> 00:02:24.090
And we ask you for the title of
your favorite show and the genre

00:02:24.090 --> 00:02:26.340
or genres into which that show falls.

00:02:26.340 --> 00:02:29.490
So if you wouldn't mind, go
ahead and open up this URL here

00:02:29.490 --> 00:02:31.192
on your laptop or phone.

00:02:31.192 --> 00:02:32.650
If you don't have one, that's fine.

00:02:32.650 --> 00:02:35.790
You can use the person's
next to you, if I may.

00:02:35.790 --> 00:02:41.970
And take just a moment to go to that URL
there and answer those two questions.

00:02:41.970 --> 00:02:46.890
What is the title of your favorite TV
show, and into what genre or genres

00:02:46.890 --> 00:02:48.060
does it fall?

00:02:51.310 --> 00:02:52.720
Favorite TV shows.

00:02:52.720 --> 00:02:59.260
And when you go to that URL, you'll
find yourself at, again, this form here,

00:02:59.260 --> 00:03:03.230
asking for your title and
for your favorite show.

00:03:03.230 --> 00:03:06.040
So by the look of my
own screen here we've

00:03:06.040 --> 00:03:09.640
got some 62 responses, 74
responses coming in already,

00:03:09.640 --> 00:03:10.780
so we can track this live.

00:03:10.780 --> 00:03:12.710
So 111 responses so far.

00:03:12.710 --> 00:03:14.830
That's a lot of favorite TV shows.

00:03:14.830 --> 00:03:16.090
This number's-- keep going up.

00:03:16.090 --> 00:03:20.890
169, up to, next, 191.

00:03:20.890 --> 00:03:22.480
Keep them coming.

00:03:22.480 --> 00:03:23.520
One per person is fine.

00:03:23.520 --> 00:03:25.103
It will give us more than enough data.

00:03:25.103 --> 00:03:26.860
246.

00:03:26.860 --> 00:03:29.300
We'll give it another few
seconds, but suffice it to say,

00:03:29.300 --> 00:03:31.190
this is kind of a lot of data.

00:03:31.190 --> 00:03:32.920
And if the next step
in this process were

00:03:32.920 --> 00:03:35.830
for me to analyze the data
because I'm curious to know

00:03:35.830 --> 00:03:38.050
what's the most popular show
on campus or if I wanted

00:03:38.050 --> 00:03:40.330
to do some other
analytical work on this,

00:03:40.330 --> 00:03:42.400
it'll be useful to
have all of this data,

00:03:42.400 --> 00:03:44.740
not in Google Form's
proprietary interface,

00:03:44.740 --> 00:03:48.520
but in Google Spreadsheet form with rows
and columns, a format you're probably

00:03:48.520 --> 00:03:49.420
pretty familiar with.

00:03:49.420 --> 00:03:51.670
So we'll let this counter
keep going up and up and up,

00:03:51.670 --> 00:03:53.840
and I'm going to go ahead
and click on responses.

00:03:53.840 --> 00:03:55.990
And if we really wanted
to, we could look through--

00:03:55.990 --> 00:03:59.950
The Office is up there, followed
by Games of Thrones, Friends, OK.

00:03:59.950 --> 00:04:01.500
Perhaps pretty predictable here.

00:04:01.500 --> 00:04:03.250
Let me go ahead and
click this icon, which

00:04:03.250 --> 00:04:06.100
is going to open the Google
Spreadsheet version of this.

00:04:06.100 --> 00:04:08.590
And you'll see that there's
three columns by default.

00:04:08.590 --> 00:04:12.230
[LAUGHTER]

00:04:12.230 --> 00:04:13.440
Oh, that's funny.

00:04:16.386 --> 00:04:18.341
[APPLAUSE]

00:04:18.341 --> 00:04:18.841
Thank you.

00:04:18.841 --> 00:04:21.787
Thank you.

00:04:21.787 --> 00:04:23.616
OK.

00:04:23.616 --> 00:04:24.530
OK.

00:04:24.530 --> 00:04:25.310
Well-played.

00:04:25.310 --> 00:04:28.790
Now-- so you'll see that by
default we got three columns.

00:04:28.790 --> 00:04:31.370
Timestamp, Google Forms
just does automatically

00:04:31.370 --> 00:04:33.230
to timestamp the submission of the form.

00:04:33.230 --> 00:04:34.910
Title comes from the first question.

00:04:34.910 --> 00:04:36.617
Genres comes from the second question.

00:04:36.617 --> 00:04:38.450
And if we kept scrolling
and scrolling, odds

00:04:38.450 --> 00:04:40.010
are we're seeing more and more rows.

00:04:40.010 --> 00:04:43.040
Let's propose that we have
enough of these rows already,

00:04:43.040 --> 00:04:44.960
and how might we go
about analyzing the data?

00:04:44.960 --> 00:04:48.080
Well, if you're pretty expert
with Google Spreadsheets or Excel

00:04:48.080 --> 00:04:50.570
or Numbers, maybe you could
use some functions that

00:04:50.570 --> 00:04:54.197
are built into these programs or you
could do some Command-F or Control-F

00:04:54.197 --> 00:04:56.030
and search for the
favorite show that you're

00:04:56.030 --> 00:04:57.890
looking for to try to run some numbers.

00:04:57.890 --> 00:04:59.940
But we have programming
abilities nowadays.

00:04:59.940 --> 00:05:02.480
And if you haven't used
this feature already,

00:05:02.480 --> 00:05:04.710
if you go to Google
Spreadsheet's file menu.

00:05:04.710 --> 00:05:06.960
You can go to download, and
you can download your data

00:05:06.960 --> 00:05:07.970
into a bunch of formats.

00:05:07.970 --> 00:05:09.762
And the most useful
one I'd propose for now

00:05:09.762 --> 00:05:12.770
is going to be this one,
Comma-Separated Values or CSV.

00:05:12.770 --> 00:05:14.100
And we've seen this before.

00:05:14.100 --> 00:05:15.800
In fact, you've generated this before.

00:05:15.800 --> 00:05:18.920
So if I go ahead and do that, it
pretty quickly downloaded a CSV file

00:05:18.920 --> 00:05:22.700
that I really wanted to I could open
in Excel or Apple Numbers or the like.

00:05:22.700 --> 00:05:26.460
But let's do something instead with
that, and go ahead and put it into,

00:05:26.460 --> 00:05:30.870
for instance, my IDE so that I
can actually do something with it.

00:05:30.870 --> 00:05:33.480
I'm going to grab my
downloaded file here.

00:05:33.480 --> 00:05:37.130
And I'm going to go ahead
and drag that into the IDE.

00:05:37.130 --> 00:05:39.680
And so now you'll see a
whole bunch of files here.

00:05:39.680 --> 00:05:40.850
Let me put-- whoops.

00:05:40.850 --> 00:05:43.102
Let me put this right in
the root of my folder.

00:05:43.102 --> 00:05:44.310
And we'll see this file here.

00:05:44.310 --> 00:05:46.910
And if I double-click on it,
you'll see exactly the same data,

00:05:46.910 --> 00:05:48.185
albeit in CSV format.

00:05:48.185 --> 00:05:50.810
So it looks the same, but we have
commas separating the values.

00:05:50.810 --> 00:05:53.580
And Google Spreadsheets did
that for us automatically.

00:05:53.580 --> 00:05:55.910
So let's now write a program
that analyzes this data

00:05:55.910 --> 00:05:59.360
and see if we can't make sense of
what some of the most popular values

00:05:59.360 --> 00:06:00.570
are in there.

00:06:00.570 --> 00:06:03.510
Let me go ahead and close
this, open a new file,

00:06:03.510 --> 00:06:07.150
and I'm going to go ahead and call
it, for instance, favorites.py,

00:06:07.150 --> 00:06:09.770
since the goal is to figure
out everyone's favorites here.

00:06:09.770 --> 00:06:12.920
And let me go ahead and use
some syntax from last week

00:06:12.920 --> 00:06:15.230
that you might be familiar
with now from P set 6.

00:06:15.230 --> 00:06:17.390
Let me go ahead and
import CSV, the module,

00:06:17.390 --> 00:06:20.600
so that I have access to that
library of functionality.

00:06:20.600 --> 00:06:22.238
Then let me do this with Open.

00:06:22.238 --> 00:06:25.280
And then I'm going to go ahead and
open this file, and I could rename it,

00:06:25.280 --> 00:06:28.220
but you know, I can just grab the
whole file name this way, too.

00:06:28.220 --> 00:06:32.360
It's a little excessive, but I'll go
ahead and paste that in in read mode.

00:06:32.360 --> 00:06:34.530
And that gives me a file.

00:06:34.530 --> 00:06:36.990
Let me go ahead and zoom
out just a little bit there.

00:06:36.990 --> 00:06:40.430
And once I have this file, I can
throw it in to a dict reader,

00:06:40.430 --> 00:06:42.620
for instance, to make it
easier to read the CSV.

00:06:42.620 --> 00:06:45.080
So reader gets CSV.DictReader.

00:06:45.080 --> 00:06:47.000
Pass in that file.

00:06:47.000 --> 00:06:48.890
And then after this, I
have the ability, now,

00:06:48.890 --> 00:06:54.170
to iterate over each row like this,
much like you probably did for DNA

00:06:54.170 --> 00:06:56.400
or one or more other problems, as well.

00:06:56.400 --> 00:06:59.390
Now, just as a sanity check, let
me just go ahead and print out

00:06:59.390 --> 00:07:00.870
each of the row's titles.

00:07:00.870 --> 00:07:02.630
So I'm not going to
see anything new here,

00:07:02.630 --> 00:07:05.720
but this will be a little sanity
check that I'm writing correct code.

00:07:05.720 --> 00:07:09.930
So run Python of favorites.py, and
voila, fwoom, there's all the data.

00:07:09.930 --> 00:07:13.270
So The Office is both at the beginning
and at the end here, it seems.

00:07:13.270 --> 00:07:16.520
All right, so that's all fine and good,
a nice stepping stone, but let's start

00:07:16.520 --> 00:07:19.580
to analyze this data,
this time using some code.

00:07:19.580 --> 00:07:22.850
So let me go ahead and say this.

00:07:22.850 --> 00:07:25.580
Suppose I wanted to
keep track and determine

00:07:25.580 --> 00:07:27.530
what the most popular show is.

00:07:27.530 --> 00:07:30.480
So I've got this big CSV file
with title, title, title, title.

00:07:30.480 --> 00:07:32.480
Invariably, there's going
to be some duplicates.

00:07:32.480 --> 00:07:35.690
A bunch of you probably like
The Office or some other show,

00:07:35.690 --> 00:07:38.180
and so it'd be nice to kind
of aggregate that data.

00:07:38.180 --> 00:07:43.080
What might be a useful data structure
in Python whereby for each row

00:07:43.080 --> 00:07:45.740
we can use this data
structure to just kind of do

00:07:45.740 --> 00:07:49.490
the equivalent of like
1, 2, 3, and count each

00:07:49.490 --> 00:07:51.720
of the occurrences of those titles?

00:07:51.720 --> 00:07:56.030
Any-- OK, I heard a hash table,
but more properly in Python?

00:07:56.030 --> 00:07:56.950
A dictionary.

00:07:56.950 --> 00:07:59.410
So a dictionary is
kind of the Swiss army

00:07:59.410 --> 00:08:02.140
knife of data structures for
a lot of people in the sense

00:08:02.140 --> 00:08:05.590
that insofar as it allows
you to store keys and values,

00:08:05.590 --> 00:08:08.900
and those keys can be strings, and
those values can be anything you want,

00:08:08.900 --> 00:08:11.763
including, for instance, in this
case, numbers, like a counter,

00:08:11.763 --> 00:08:13.930
it's a really versatile
data structure when you just

00:08:13.930 --> 00:08:17.000
want to throw some data into a data
structure and do something with it.

00:08:17.000 --> 00:08:20.110
So let me go ahead and give myself
a dictionary at the top of the file.

00:08:20.110 --> 00:08:23.110
I can do this in a couple of ways,
and if my goal is to store the counts

00:08:23.110 --> 00:08:25.420
of each of these tiles,
I could call it counts,

00:08:25.420 --> 00:08:30.130
and I could set it equal to dict(),
which is just a function that hands me

00:08:30.130 --> 00:08:31.450
an empty dictionary.

00:08:31.450 --> 00:08:33.730
Or I recall I can use
this syntax, which is

00:08:33.730 --> 00:08:35.919
equivalent to just shorthand notation.

00:08:35.919 --> 00:08:40.210
But what I want to do now
inside of my loop, every time I

00:08:40.210 --> 00:08:44.210
iterate through this file,
I want to check if my--

00:08:44.210 --> 00:08:47.630
I've seen this title before,
I want to add 1 to my counter.

00:08:47.630 --> 00:08:52.808
But if I've never seen this title
before, what do I want to do instead?

00:08:52.808 --> 00:08:53.308
Yeah?

00:08:53.308 --> 00:08:55.117
AUDIENCE: [INAUDIBLE]

00:08:55.117 --> 00:08:56.950
DAVID MALAN: Add the
title to the dictionary

00:08:56.950 --> 00:09:00.110
as a key with the value of
what, the first time I see it?

00:09:00.110 --> 00:09:01.970
Well, the first time I see it.

00:09:01.970 --> 00:09:02.470
AUDIENCE: 1.

00:09:02.470 --> 00:09:03.940
DAVID MALAN: So we can
just initialize it to 1.

00:09:03.940 --> 00:09:05.290
So there's two scenarios here.

00:09:05.290 --> 00:09:08.500
Either we've seen this title
before and we want to do plus 1,

00:09:08.500 --> 00:09:11.830
or we've not seen the title before
and we just want to set it equal to 1.

00:09:11.830 --> 00:09:13.630
So how might we express that logic?

00:09:13.630 --> 00:09:16.190
Well, let me go ahead
and say, first of all,

00:09:16.190 --> 00:09:18.280
let's get the title into
a variable of its own,

00:09:18.280 --> 00:09:20.740
just to shorten our syntax a little bit.

00:09:20.740 --> 00:09:25.040
And then I can say something
like this, if title in counts--

00:09:25.040 --> 00:09:26.300
well, what does that mean?

00:09:26.300 --> 00:09:29.140
That means that inside
of my counts dictionary,

00:09:29.140 --> 00:09:34.300
I can index into it at that location,
and I can increment whatever is there

00:09:34.300 --> 00:09:38.830
by 1, assuming on some
previous iteration,

00:09:38.830 --> 00:09:44.738
else I went ahead and did
counts of title equals 1.

00:09:44.738 --> 00:09:46.030
So these are the two scenarios.

00:09:46.030 --> 00:09:48.940
If I've seen this title
before, go ahead and add 1

00:09:48.940 --> 00:09:50.560
to its value in the dictionary.

00:09:50.560 --> 00:09:53.960
Otherwise, just default to
I have seen it one time.

00:09:53.960 --> 00:09:56.710
So some of the shows are going to
stay stuck at one if they're not

00:09:56.710 --> 00:09:58.502
very popular among the
audience, or they're

00:09:58.502 --> 00:10:00.703
going to keep getting
plus 1, plus 1, plus 1.

00:10:00.703 --> 00:10:02.620
So what I often do when
writing code like this

00:10:02.620 --> 00:10:05.450
is, as I'm sort of finding my
way, after I'm done with this,

00:10:05.450 --> 00:10:07.075
I'm not going to write the whole thing.

00:10:07.075 --> 00:10:10.070
Let me go ahead and just, at the
end of this, print counts, right?

00:10:10.070 --> 00:10:11.570
I'm just curious to see what it is.

00:10:11.570 --> 00:10:13.570
I'm just going to go ahead
and add print counts.

00:10:13.570 --> 00:10:16.060
Let me go ahead and run
now Python of favorites.

00:10:16.060 --> 00:10:19.690
And voila, you see a huge,
messy data structure.

00:10:19.690 --> 00:10:21.520
But let's go to the very top of it.

00:10:21.520 --> 00:10:25.210
Notice that it starts with curly braces,
indicating it's, indeed, a dictionary.

00:10:25.210 --> 00:10:28.480
It starts with a quoted value,
which is one of the titles in it,

00:10:28.480 --> 00:10:31.450
and it looks like two of you said
Dynasty is your favorite show.

00:10:31.450 --> 00:10:33.790
25 of you said The Office
is your favorite show.

00:10:33.790 --> 00:10:36.340
One of you said Blindspot
is your favorite show.

00:10:36.340 --> 00:10:37.210
One of you, 24.

00:10:37.210 --> 00:10:39.060
Friends, 19, and so forth.

00:10:39.060 --> 00:10:40.810
So the data is kind
of all over the place,

00:10:40.810 --> 00:10:43.570
because it's based on the order
in which you all input it.

00:10:43.570 --> 00:10:45.487
So let's just clean
this up a little bit.

00:10:45.487 --> 00:10:47.570
And let me go ahead and
print this out as follows.

00:10:47.570 --> 00:10:49.660
Instead of just printing
a big, messy dictionary,

00:10:49.660 --> 00:10:52.990
why don't I go ahead and
iterate over the dictionary?

00:10:52.990 --> 00:10:55.220
And we didn't see this
necessarily last week,

00:10:55.220 --> 00:10:57.053
but you can actually
do something like this.

00:10:57.053 --> 00:11:01.480
If your dictionary's storing titles
as keys and counts as values,

00:11:01.480 --> 00:11:06.640
we can do something fancy like
for title comma count in counts,

00:11:06.640 --> 00:11:10.090
but if you want to iterate over not just
the keys but all of the items therein,

00:11:10.090 --> 00:11:12.250
there's this other function
in Python associated

00:11:12.250 --> 00:11:15.370
with a dictionary that will hand
you back two things at once,

00:11:15.370 --> 00:11:16.763
key and value together.

00:11:16.763 --> 00:11:18.430
And you can call them anything you want.

00:11:18.430 --> 00:11:21.470
I'm calling them title
comma count, respectively.

00:11:21.470 --> 00:11:25.300
Now, let me go ahead and print out,
for instance, something like the title,

00:11:25.300 --> 00:11:28.340
and then let me go ahead and just
print out the count right next to it.

00:11:28.340 --> 00:11:32.540
So now, if I save this and run
favorites.py, OK, it's prettier,

00:11:32.540 --> 00:11:35.530
but it's still a little
messy in that we have--

00:11:35.530 --> 00:11:38.300
the numbers are just right
there after the name.

00:11:38.300 --> 00:11:39.890
So let's go ahead and tidy this up.

00:11:39.890 --> 00:11:41.890
Recall that when you print
two things with print

00:11:41.890 --> 00:11:44.170
separating them with a
comma, what character

00:11:44.170 --> 00:11:47.380
did you get automatically
in between them?

00:11:47.380 --> 00:11:48.560
Yeah, just a single space.

00:11:48.560 --> 00:11:50.630
We saw that at the very
beginning of last week.

00:11:50.630 --> 00:11:53.740
But we can override that
separator, otherwise known as sep,

00:11:53.740 --> 00:11:56.950
which has a default value of
quote-unquote with a space.

00:11:56.950 --> 00:12:00.293
But suppose I just want to give
some visual delineation here,

00:12:00.293 --> 00:12:03.460
maybe like a vertical bar to kind of
sort of make it look like a spreadsheet

00:12:03.460 --> 00:12:04.000
again.

00:12:04.000 --> 00:12:05.590
Let me go ahead and run this.

00:12:05.590 --> 00:12:07.600
And now it's not super,
super pretty, but you

00:12:07.600 --> 00:12:12.700
can see title followed by a vertical
bar followed by the number 1.

00:12:12.700 --> 00:12:16.243
So this is all fine and good,
but it'd be nice, for instance,

00:12:16.243 --> 00:12:18.160
if we could sort these,
because I don't really

00:12:18.160 --> 00:12:22.000
want to sift through this visually,
looking for all of the possible values.

00:12:22.000 --> 00:12:25.660
So you know, the easiest way to sort
these things might be like this.

00:12:25.660 --> 00:12:28.960
Instead of passing in
counts of items, there's

00:12:28.960 --> 00:12:31.930
another function in Python where you
can literally just sort something

00:12:31.930 --> 00:12:32.590
with sorted.

00:12:32.590 --> 00:12:34.850
You don't have to worry about bubble
sort, selection sort, insertion sort,

00:12:34.850 --> 00:12:36.160
merge sort, anything.

00:12:36.160 --> 00:12:37.867
Just, you sort it and Python.

00:12:37.867 --> 00:12:39.700
We'll use one of the
fastest implementations

00:12:39.700 --> 00:12:41.720
of sorting available to it.

00:12:41.720 --> 00:12:44.020
Now, if I rerun this, we'll see that--

00:12:44.020 --> 00:12:45.220
[CHUCKLES] OK.

00:12:45.220 --> 00:12:47.350
We'll see that we have
all of these values

00:12:47.350 --> 00:12:51.170
here starting with 24,
9009, Adventure Time.

00:12:51.170 --> 00:12:53.420
But that's not what I meant.

00:12:53.420 --> 00:12:57.160
I wanted to sort not
by title, but by what?

00:12:57.160 --> 00:12:58.250
Count.

00:12:58.250 --> 00:13:00.550
So this gets a little less obvious.

00:13:00.550 --> 00:13:04.390
And it turns out Python does this
in a couple of possible ways.

00:13:04.390 --> 00:13:06.640
But let me go ahead and
introduce this one first.

00:13:06.640 --> 00:13:11.170
It turns out that this sorted function
takes an optional argument that

00:13:11.170 --> 00:13:12.910
happens to be called key.

00:13:12.910 --> 00:13:18.280
And key simply specifies,
how do you want to sort

00:13:18.280 --> 00:13:20.620
the dictionaries that you're being--

00:13:20.620 --> 00:13:23.680
or how do you want to sort the key
value pairs that are being passed?

00:13:23.680 --> 00:13:26.540
By default, it literally
uses the key, so title.

00:13:26.540 --> 00:13:29.960
But if you want to use the value,
you know what you can do, is this.

00:13:29.960 --> 00:13:33.420
Let me go ahead, and up here, even
though this is a little weird,

00:13:33.420 --> 00:13:35.920
and let me go ahead
and say the following.

00:13:35.920 --> 00:13:37.960
Def function item.

00:13:37.960 --> 00:13:38.617
So def f.

00:13:38.617 --> 00:13:40.450
I'm just giving myself
a pretty generic name

00:13:40.450 --> 00:13:43.075
for a function, because I'm going
to get rid of it in a moment.

00:13:43.075 --> 00:13:46.780
I'm going to go ahead and
return item bracket 1.

00:13:46.780 --> 00:13:48.080
Now, why is this?

00:13:48.080 --> 00:13:49.780
Take a look at this line here.

00:13:49.780 --> 00:13:53.500
And even though sorted is new
today, as this counts.items(),

00:13:53.500 --> 00:13:58.490
all items does is return key comma
value, and sorted sorts then by key,

00:13:58.490 --> 00:14:02.330
by default, handing them back
as title comma count, or again,

00:14:02.330 --> 00:14:03.930
I could call these anything I want.

00:14:03.930 --> 00:14:07.700
But I'm calling them, more semantically
usefully, title comma counts.

00:14:07.700 --> 00:14:09.890
So that's giving me back to things.

00:14:09.890 --> 00:14:13.370
Well, just like an array
in C or a list in Python,

00:14:13.370 --> 00:14:16.850
if you're getting two things at
once, the first of those things,

00:14:16.850 --> 00:14:19.790
conventionally, is going to
be accessible in bracket zero.

00:14:19.790 --> 00:14:22.190
The second thing is going to
be accessible in bracket 1.

00:14:22.190 --> 00:14:27.350
So this super, super simple function
called f takes, as input, an item.

00:14:27.350 --> 00:14:30.050
And if that item has two
values, a key and a value,

00:14:30.050 --> 00:14:32.270
return item one is just
literally going to return

00:14:32.270 --> 00:14:33.710
the second thing, whatever it is.

00:14:33.710 --> 00:14:36.530
In this case, it's going to
return, of course, the value.

00:14:36.530 --> 00:14:38.160
Now, why is this useful?

00:14:38.160 --> 00:14:43.260
Well, because the sorted function takes
an additional argument called key,

00:14:43.260 --> 00:14:46.040
you can actually pass it
the name of a function

00:14:46.040 --> 00:14:51.620
to use in order to determine
how to sort the previous items.

00:14:51.620 --> 00:14:53.750
So this is an example of
a really powerful feature

00:14:53.750 --> 00:14:57.860
of Python and other higher-level
languages, in which case

00:14:57.860 --> 00:15:01.670
you can pass in, as arguments,
the names of functions.

00:15:01.670 --> 00:15:02.720
I'm not calling f.

00:15:02.720 --> 00:15:04.340
I'm just passing in its name.

00:15:04.340 --> 00:15:07.520
If I were calling it, I'd have
open paren and a closed paren,

00:15:07.520 --> 00:15:09.620
but I'm just passing it in by name.

00:15:09.620 --> 00:15:13.050
And now, this is going to, by default,
sort things from small to large.

00:15:13.050 --> 00:15:15.050
Turns out if you read the
documentation, there's

00:15:15.050 --> 00:15:19.430
another argument to sorted called
reverse that will invert it so that now

00:15:19.430 --> 00:15:21.780
you'll get largest to smallest.

00:15:21.780 --> 00:15:23.160
So now let's try this.

00:15:23.160 --> 00:15:25.460
Let me go ahead and rerun
Python of favorites.py

00:15:25.460 --> 00:15:29.030
enter, and now, if I
scroll up to the top,

00:15:29.030 --> 00:15:33.680
we should see The Office is indeed
the most popular show, apparently

00:15:33.680 --> 00:15:35.660
consistent with Netflix's
data these days, too,

00:15:35.660 --> 00:15:38.702
followed by Friends, Game of Thrones,
Breaking Bad, Rick and Morty, Black

00:15:38.702 --> 00:15:39.932
Mirror, and so forth.

00:15:39.932 --> 00:15:42.890
And the least popular show among you,
although there looks like there's

00:15:42.890 --> 00:15:44.735
a lot of ties, Demon Slayer.

00:15:44.735 --> 00:15:47.890
[CHUCKLES] Or Game of Thrones, but GOT.

00:15:47.890 --> 00:15:49.880
So interestingly, there are some--

00:15:49.880 --> 00:15:51.420
[LAUGHTER]

00:15:51.420 --> 00:15:52.320
There's-- OK.

00:15:54.930 --> 00:15:55.520
OK.

00:15:55.520 --> 00:15:57.950
So here, we've gone and
sorted these values.

00:15:57.950 --> 00:16:00.680
Now, let me go ahead and
clean this up a little bit.

00:16:00.680 --> 00:16:05.510
Much like in C where it's kind
of silly to define a variable

00:16:05.510 --> 00:16:08.270
and then use it only in one
place, similarly in Python,

00:16:08.270 --> 00:16:11.660
if you're going to define a function,
but then only use it in one place,

00:16:11.660 --> 00:16:12.920
you don't actually need to.

00:16:12.920 --> 00:16:17.248
There's a fancy one-line equivalent
wear you can actually say this.

00:16:17.248 --> 00:16:19.790
And you won't see this too often,
but it's a handy thing just

00:16:19.790 --> 00:16:21.170
to have seen at least once.

00:16:21.170 --> 00:16:25.850
You can define a function called a
lambda function that takes in as input

00:16:25.850 --> 00:16:30.240
any argument like this, and then you
can literally just return that value.

00:16:30.240 --> 00:16:34.175
So if I go ahead and get rid of this
function f and replace it with this,

00:16:34.175 --> 00:16:37.730
the syntax is super
weird-looking at first glance,

00:16:37.730 --> 00:16:40.520
but lambda is the term
of art in computer

00:16:40.520 --> 00:16:42.037
science for an anonymous function.

00:16:42.037 --> 00:16:43.370
I don't care that it's called f.

00:16:43.370 --> 00:16:44.720
I'm never going to use it again.

00:16:44.720 --> 00:16:47.360
So lambda just means give me a function.

00:16:47.360 --> 00:16:51.080
Item means that is the input to this
function, the argument, and the return

00:16:51.080 --> 00:16:54.080
value, or output, is going
to be item bracket 1.

00:16:54.080 --> 00:16:55.580
So it's a little arcane.

00:16:55.580 --> 00:16:58.935
Odds are you'll have to double check
the documentation for using this oneself

00:16:58.935 --> 00:17:00.560
in the future, but it's the same thing.

00:17:00.560 --> 00:17:02.720
And indeed, if I rerun
the program, I'm going

00:17:02.720 --> 00:17:08.030
to get back the exact same results,
this time with the items still sorted

00:17:08.030 --> 00:17:09.589
like that.

00:17:09.589 --> 00:17:14.569
All right, any questions on the code
here, the features of the syntax?

00:17:14.569 --> 00:17:16.160
Anything at all?

00:17:16.160 --> 00:17:17.261
Yeah.

00:17:17.261 --> 00:17:20.303
AUDIENCE: [INAUDIBLE]

00:17:20.303 --> 00:17:22.970
DAVID MALAN: Yes, if you're writing
a lambda function like this,

00:17:22.970 --> 00:17:24.770
you are indeed limited to one line.

00:17:24.770 --> 00:17:27.187
That was OK, because f, even
though it was two lines, only

00:17:27.187 --> 00:17:28.550
had one line of actual content.

00:17:28.550 --> 00:17:31.900
But yes, it has to be one line, indeed.

00:17:31.900 --> 00:17:32.400
All right.

00:17:32.400 --> 00:17:36.360
So the GOT thing opens up some
interesting messiness, right?

00:17:36.360 --> 00:17:39.990
Because GOT presumably denoted Game
of Thrones, and yet most of you

00:17:39.990 --> 00:17:42.235
wrote Game of Thrones,
capitalized properly,

00:17:42.235 --> 00:17:44.610
and yet, it'd be nice if we
could have some equivalences.

00:17:44.610 --> 00:17:48.300
And honestly, I'm guessing there's a
bunch of capitalization differences,

00:17:48.300 --> 00:17:48.960
too.

00:17:48.960 --> 00:17:50.970
So you know what I
might be inclined to do?

00:17:50.970 --> 00:17:54.390
What might be the easiest
way for us to standardize

00:17:54.390 --> 00:17:58.710
what we're calling these shows'
titles that would be a little more

00:17:58.710 --> 00:18:01.270
resilient against some of
you typing with capitals,

00:18:01.270 --> 00:18:03.300
some of you typing with lower case?

00:18:03.300 --> 00:18:06.130
Let's ignore the GOT problem, though.

00:18:06.130 --> 00:18:09.022
Yeah, so make them all lowercase,
or heck, make them all upper case.

00:18:09.022 --> 00:18:10.980
It doesn't matter so long
as you're consistent.

00:18:10.980 --> 00:18:11.688
So you know what?

00:18:11.688 --> 00:18:13.740
Before I get title from
the row, let me go ahead

00:18:13.740 --> 00:18:18.450
and force it all to lower case,
and then let me go ahead and--

00:18:18.450 --> 00:18:24.160
in my-- go ahead and force
it to lower case there.

00:18:24.160 --> 00:18:26.070
So no matter what the
input is, I'm going

00:18:26.070 --> 00:18:29.790
to go ahead and force it
on entry to lower case.

00:18:29.790 --> 00:18:33.490
Now, if I do this and scroll back up,
looks like our counts went up a little,

00:18:33.490 --> 00:18:33.990
right?

00:18:33.990 --> 00:18:35.580
The Office is now at 27.

00:18:35.580 --> 00:18:37.080
Game of Thrones is now at 21.

00:18:37.080 --> 00:18:38.290
Friends is at 20.

00:18:38.290 --> 00:18:41.220
So it looks like some of you,
kind of, reasonably if sloppily,

00:18:41.220 --> 00:18:44.272
didn't capitalize it exactly as
the show officially is called.

00:18:44.272 --> 00:18:45.480
But that should be OK, right?

00:18:45.480 --> 00:18:47.700
This is all about UX or user experience.

00:18:47.700 --> 00:18:51.930
This should be one stepping stone toward
making your program much, much more

00:18:51.930 --> 00:18:54.780
resilient against user input.

00:18:54.780 --> 00:18:55.530
But you know what?

00:18:55.530 --> 00:18:57.240
What if there were a better
way to do this, right?

00:18:57.240 --> 00:18:59.490
Instead of writing code,
this feels pretty tedious.

00:18:59.490 --> 00:19:01.948
Could we not just use a different
tool, different language,

00:19:01.948 --> 00:19:03.160
and achieve these same goals?

00:19:03.160 --> 00:19:05.395
Today's answer is indeed
going to be yes, with SQL.

00:19:05.395 --> 00:19:06.270
But first a question.

00:19:06.270 --> 00:19:06.810
AUDIENCE: [INAUDIBLE]

00:19:06.810 --> 00:19:07.160
DAVID MALAN: No?

00:19:07.160 --> 00:19:07.920
No question.

00:19:07.920 --> 00:19:11.520
All right, so then the
segue is exactly this.

00:19:11.520 --> 00:19:14.940
I'm going to go ahead now
and close favorites.py.

00:19:14.940 --> 00:19:18.010
And I'm just going to maximize the
size of my terminal window here.

00:19:18.010 --> 00:19:21.840
And I'm going to go ahead and
run a program called SQLite3.

00:19:21.840 --> 00:19:25.590
SQLite3 is a command line program that
you can install on your Mac or PC--

00:19:25.590 --> 00:19:27.240
it's already on CS50 IDE--

00:19:27.240 --> 00:19:31.470
that allows you to use another
language called SQL interactively.

00:19:31.470 --> 00:19:34.170
And it's got this really
cool feature whereby

00:19:34.170 --> 00:19:40.230
if I have this file called, very
verbosely, CS520 2019 Lecture 7,

00:19:40.230 --> 00:19:43.020
and so forth, I'm going to copy
that after highlighting it.

00:19:43.020 --> 00:19:47.730
I'm going to run SQLite3, and I'm
going to put SQLite3 into .mode CSV.

00:19:47.730 --> 00:19:51.180
It's a weird syntax, .mode CSV, but you
don't have to use these commands very

00:19:51.180 --> 00:19:51.840
often.

00:19:51.840 --> 00:19:54.840
Then I'm going to use one other
command here, which is called import.

00:19:54.840 --> 00:19:58.450
And I'm going to go ahead and
import this into a database.

00:19:58.450 --> 00:20:00.175
But what database, exactly?

00:20:00.175 --> 00:20:03.050
Well, let me go ahead and more
specifically-- let me exit out of this

00:20:03.050 --> 00:20:03.550
here.

00:20:03.550 --> 00:20:04.170
Whoops.

00:20:04.170 --> 00:20:08.850
Let me exit out of this here and run
SQLite3 in the name of a database.

00:20:08.850 --> 00:20:10.830
I'm dealing with favorite
shows, so maybe I'll

00:20:10.830 --> 00:20:14.040
call this favorites.db,
DB denoting database.

00:20:14.040 --> 00:20:17.730
I'm going to now go
ahead and do .mode CSV.

00:20:17.730 --> 00:20:20.550
I'm going to now do .import,
quoting the name of the file,

00:20:20.550 --> 00:20:23.550
and quotes are important, because
it's got some ugly spaces in it,

00:20:23.550 --> 00:20:26.940
and I'm going to import this into a
table, kind of like a spreadsheet,

00:20:26.940 --> 00:20:29.700
that I'm just going to call
in all lowercase favorites.

00:20:29.700 --> 00:20:30.750
And hit Enter.

00:20:30.750 --> 00:20:32.410
Nothing seems to happen now.

00:20:32.410 --> 00:20:35.650
But if I go back to my
terminal window and type ls,

00:20:35.650 --> 00:20:39.780
notice that in addition to my CSV
file and favorites.py and src7,

00:20:39.780 --> 00:20:41.970
which I downloaded in
advance with today's code,

00:20:41.970 --> 00:20:44.820
I now have this file
called favorites.db.

00:20:44.820 --> 00:20:50.140
And if I rerun SQLite, it turns out when
I pass in the name of that database,

00:20:50.140 --> 00:20:52.920
now I can ask the same kinds
of questions as before,

00:20:52.920 --> 00:20:55.590
but I can do it with a different syntax.

00:20:55.590 --> 00:20:59.670
I can go ahead and select all
the titles from my favorites

00:20:59.670 --> 00:21:04.350
by typing this command, SELECT title
FROM favorites; Enter, and voila,

00:21:04.350 --> 00:21:06.010
we get all of the titles there.

00:21:06.010 --> 00:21:08.250
Suppose I want to sort
those titles as before,

00:21:08.250 --> 00:21:14.230
or I can SELECT title FROM favorites,
ORDER BY title; ending with--

00:21:14.230 --> 00:21:14.730
oh, sorry.

00:21:14.730 --> 00:21:15.870
Semicolons are back.

00:21:15.870 --> 00:21:18.300
Then I'm going to go ahead and hit here.

00:21:18.300 --> 00:21:20.730
And you'll see that
now it's alphabetized,

00:21:20.730 --> 00:21:23.318
and there's all those
Offices by first letter.

00:21:23.318 --> 00:21:24.360
Now, there are some bugs.

00:21:24.360 --> 00:21:26.500
If you scroll down, you'll
see lowercase letters as well,

00:21:26.500 --> 00:21:28.620
so we'd have to deal
with that situation, too.

00:21:28.620 --> 00:21:32.820
But this seems to be a user-friendly
way of just selecting data

00:21:32.820 --> 00:21:33.570
that I care about.

00:21:33.570 --> 00:21:34.240
And watch this.

00:21:34.240 --> 00:21:36.010
Let me try to group things together.

00:21:36.010 --> 00:21:38.430
I'm going to go ahead
and select title, and I'm

00:21:38.430 --> 00:21:41.880
going to go ahead and say
count the number of titles

00:21:41.880 --> 00:21:45.000
from this table called favorites.

00:21:45.000 --> 00:21:48.660
But I want to group
those things by title.

00:21:48.660 --> 00:21:51.820
Now, we'll see in a moment what this
means, but the effect is what's cool.

00:21:51.820 --> 00:21:56.370
Voila, I now have output
much like my Python program.

00:21:56.370 --> 00:22:00.028
It's not really as orderly
as I would like, but notice

00:22:00.028 --> 00:22:01.320
what it's done at the very top.

00:22:01.320 --> 00:22:06.330
The very first line of output,
if I keep scrolling is this here.

00:22:06.330 --> 00:22:08.670
On the left is my title,
then a vertical bar.

00:22:08.670 --> 00:22:13.120
On the right is literally count of
title, how many titles match that.

00:22:13.120 --> 00:22:14.800
So let me go ahead and do this.

00:22:14.800 --> 00:22:18.240
Let me go ahead and first of all
limit this to the top 10 titles,

00:22:18.240 --> 00:22:20.910
so I can just deal with this
and see all this at once.

00:22:20.910 --> 00:22:23.460
So here we have some
alphabetized list of titles.

00:22:23.460 --> 00:22:27.490
But now, suppose I want to go
ahead and rename this thing.

00:22:27.490 --> 00:22:28.930
Let me go ahead and say that.

00:22:28.930 --> 00:22:30.510
Let me go ahead and say, as--

00:22:30.510 --> 00:22:33.450
COUNT AS n.

00:22:33.450 --> 00:22:37.290
Now notice that the title of
this output is title, and then n.

00:22:37.290 --> 00:22:41.650
But the reason I did that
was so that I can do this.

00:22:41.650 --> 00:22:47.010
I'm going to now do ORDER
BY n, descending order,

00:22:47.010 --> 00:22:49.370
and then limit to the top 10.

00:22:49.370 --> 00:22:51.120
Enter, and voila.

00:22:51.120 --> 00:22:53.460
Now, using a completely
different language,

00:22:53.460 --> 00:22:56.460
I think I've gotten the same results,
except for the lower casing issue,

00:22:56.460 --> 00:22:58.085
which we can come back to another time.

00:22:58.085 --> 00:23:00.210
But The Office is here
as the number one hit.

00:23:00.210 --> 00:23:01.350
Friends then, 19.

00:23:01.350 --> 00:23:03.820
Game of Thrones, 18, and so forth.

00:23:03.820 --> 00:23:06.840
So what's then the motivation in part
for this new language called SQL?

00:23:06.840 --> 00:23:08.550
And you've seen just
snippets of it here.

00:23:08.550 --> 00:23:10.850
It allows us to solve
the same kind of problems

00:23:10.850 --> 00:23:13.350
much more easily, because I
don't have to sit down and write

00:23:13.350 --> 00:23:16.590
a whole darn Python program,
faster though it is then

00:23:16.590 --> 00:23:20.933
C. I don't have to write custom code
just to answer questions about my data.

00:23:20.933 --> 00:23:23.100
And in fact, if you pursue
the web track in a couple

00:23:23.100 --> 00:23:25.710
of weeks' time or the
mobile track, both of those

00:23:25.710 --> 00:23:29.130
are going to offer support for SQL
so that if you want to store data

00:23:29.130 --> 00:23:32.880
on users or interactive information
on an app or a website application,

00:23:32.880 --> 00:23:37.050
SQL is going to be ultimately where you
can put that data and access it later,

00:23:37.050 --> 00:23:40.507
thereby writing even
less code than today.

00:23:40.507 --> 00:23:43.590
So let's go ahead and take a look at
some of the features of these things.

00:23:43.590 --> 00:23:45.507
What we've really just
done is introduce this,

00:23:45.507 --> 00:23:47.580
something called a relational database.

00:23:47.580 --> 00:23:50.940
The goal at hand is not to
rely on CSV files anymore.

00:23:50.940 --> 00:23:54.420
CSVs are like the weakest
form of databases you can use.

00:23:54.420 --> 00:23:54.960
Why?

00:23:54.960 --> 00:23:59.162
Well, every time you want to look in a
CSV file for data, you have to open it.

00:23:59.162 --> 00:24:00.120
You have to read it in.

00:24:00.120 --> 00:24:02.590
You have to iterate over
it, line by line by line,

00:24:02.590 --> 00:24:05.160
looking for the data you care
about, and then you're done.

00:24:05.160 --> 00:24:08.942
That's going to be super slow when
you have a lot of data in a CSV file.

00:24:08.942 --> 00:24:10.650
And in fact, in a
little bit today, we'll

00:24:10.650 --> 00:24:14.190
download a really big
database in text form

00:24:14.190 --> 00:24:19.230
that we're going to then see 500-plus
megabytes later is much better handled

00:24:19.230 --> 00:24:21.420
by what are called
relational databases, pieces

00:24:21.420 --> 00:24:24.990
of software you can run on your
Mac or PC or servers in a company.

00:24:24.990 --> 00:24:28.950
All they do is store data and
provide you with faster access to it.

00:24:28.950 --> 00:24:32.010
So SQLite3 is just a
command line program

00:24:32.010 --> 00:24:34.012
via which we can
interact with that data.

00:24:34.012 --> 00:24:35.220
It's not terribly compelling.

00:24:35.220 --> 00:24:37.200
It's, again, a black and
white blinking prompt,

00:24:37.200 --> 00:24:40.050
but we'll see a graphical user
interface, or GUI, in a bit,

00:24:40.050 --> 00:24:43.200
too, that'll allow us to
navigate that same data.

00:24:43.200 --> 00:24:45.570
Here's how we went ahead
and imported the data.

00:24:45.570 --> 00:24:47.550
And .schema is the last of the commands.

00:24:47.550 --> 00:24:49.870
In fact, let me go ahead
and type this at the prompt.

00:24:49.870 --> 00:24:52.440
Not .mode or .import but .schema.

00:24:52.440 --> 00:24:55.860
And you'll see what automatically
happened for me is something

00:24:55.860 --> 00:24:58.590
like this, CREATE TABLE favorites.

00:24:58.590 --> 00:25:02.010
This all happened automatically when
I imported my data, but in a bit,

00:25:02.010 --> 00:25:04.740
we'll be doing this
more manually ourselves.

00:25:04.740 --> 00:25:08.580
So where does this then bring
us in terms of features?

00:25:08.580 --> 00:25:11.280
Well, in the world of
storing data, whether it's

00:25:11.280 --> 00:25:13.650
storing users who've
registered for your websites

00:25:13.650 --> 00:25:16.350
or the number of likes a
post on Instagram has gotten

00:25:16.350 --> 00:25:19.950
or any form of data that you
might want to read or write,

00:25:19.950 --> 00:25:22.410
there are four fundamental
operations in the world.

00:25:22.410 --> 00:25:25.180
CRUD is the sort of crude
way of remembering this.

00:25:25.180 --> 00:25:28.260
These stand for Create,
Read, Update, Delete.

00:25:28.260 --> 00:25:30.870
And I dare say that with
these four operations

00:25:30.870 --> 00:25:34.030
you can do anything you want
when it comes with data.

00:25:34.030 --> 00:25:38.010
Creating that data, reading that
data, updating it, or deleting it.

00:25:38.010 --> 00:25:42.450
Now, in the world of SQL, or
Structured Query Language, or S-Q-L,

00:25:42.450 --> 00:25:44.490
this is just another
programming language.

00:25:44.490 --> 00:25:47.460
It tends to be used only in
the context of databases,

00:25:47.460 --> 00:25:51.810
but you can use it to solve problems
not unlike Python, but in the context

00:25:51.810 --> 00:25:54.090
of the data you care about.

00:25:54.090 --> 00:25:56.370
So with SQL, these four
CRUD operations actually

00:25:56.370 --> 00:25:58.530
have different keywords mapped to them.

00:25:58.530 --> 00:26:01.650
They happen to be these,
INSERT, SELECT, UPDATE, DELETE.

00:26:01.650 --> 00:26:06.360
That is to say, unlike Python, which has
hundreds of functions available to you,

00:26:06.360 --> 00:26:11.940
today is kind of neat in so far AS SQL
has maybe a dozen or so total functions

00:26:11.940 --> 00:26:14.700
that you can use, a few dozen
functions total that you can use.

00:26:14.700 --> 00:26:18.790
Much more manageable, and these are the
four that we'll use most frequently.

00:26:18.790 --> 00:26:21.330
So beyond this, how do you
go about creating a table?

00:26:21.330 --> 00:26:22.420
Well, what does that mean?

00:26:22.420 --> 00:26:24.253
Well, relational database
is literally that.

00:26:24.253 --> 00:26:27.660
It's like a fancier version of Google
Spreadsheets or Microsoft Excel

00:26:27.660 --> 00:26:30.120
or Apple Numbers that allows
you to store all of your data

00:26:30.120 --> 00:26:31.500
in rows and columns.

00:26:31.500 --> 00:26:35.790
But it gives you these
four functions and more

00:26:35.790 --> 00:26:40.140
via which to select data you care about,
or delete or update or insert data

00:26:40.140 --> 00:26:41.820
that you care about as well.

00:26:41.820 --> 00:26:45.390
So we'll see in a bit that this is going
to be the syntax VIA which in SQL you

00:26:45.390 --> 00:26:46.820
can create a new table.

00:26:46.820 --> 00:26:49.500
In Google Spreadsheets you would
literally go to the plus icon

00:26:49.500 --> 00:26:50.640
and say New Sheet.

00:26:50.640 --> 00:26:54.000
In SQL, you would express this more
programmatically with a line of code

00:26:54.000 --> 00:26:54.795
like this.

00:26:54.795 --> 00:26:57.420
But you're going to have to make
a few decisions along the way,

00:26:57.420 --> 00:27:00.280
and that's because SQL
has its own data types.

00:27:00.280 --> 00:27:03.300
Now, fortunately, they're pretty
reminiscent of C and Python.

00:27:03.300 --> 00:27:06.302
But there's five main ones here.

00:27:06.302 --> 00:27:08.260
Let's just walk through
a few examples thereof.

00:27:08.260 --> 00:27:09.390
So first, INTEGER.

00:27:09.390 --> 00:27:12.930
If you know in advance that you want
to store a whole bunch of information

00:27:12.930 --> 00:27:19.440
in a database like someone's age
or the number of likes a post has

00:27:19.440 --> 00:27:21.660
or anything that lends
itself to an integer,

00:27:21.660 --> 00:27:24.540
SQL supports an integer data type.

00:27:24.540 --> 00:27:28.120
But in some versions of SQL, you
actually have a decision to make.

00:27:28.120 --> 00:27:31.470
Do you want an integer,
a smallint, or a bigint?

00:27:31.470 --> 00:27:34.710
And these speak to, as in
C, the size, the number

00:27:34.710 --> 00:27:36.180
of bits that are used to store it.

00:27:36.180 --> 00:27:37.530
An integer is 32 bits.

00:27:37.530 --> 00:27:39.090
A bigint is 64 bits.

00:27:39.090 --> 00:27:41.690
A smallint is fewer than both those.

00:27:41.690 --> 00:27:44.720
Now, we're using, for class,
something called SQLite.

00:27:44.720 --> 00:27:47.363
It's a free and open source, very
user friendly version of SQL.

00:27:47.363 --> 00:27:49.280
But there's others you
might have heard about.

00:27:49.280 --> 00:27:51.230
So if you've ever heard
of the company Oracle,

00:27:51.230 --> 00:27:53.540
their bread and butter
is a database program

00:27:53.540 --> 00:27:56.210
called Oracle that is
a SQL database that

00:27:56.210 --> 00:27:58.730
allows companies to store
huge amounts of data

00:27:58.730 --> 00:28:01.490
and select and create and
update and delete data

00:28:01.490 --> 00:28:03.180
using software they've written.

00:28:03.180 --> 00:28:07.940
MySQL, Postgres, MariaDB,
Microsoft Access Server,

00:28:07.940 --> 00:28:12.050
SQL Server, and bunches of others,
are all examples of SQL servers.

00:28:12.050 --> 00:28:15.162
And they'll each support data
types, typically, like this.

00:28:15.162 --> 00:28:17.120
We happen to be showing
you ones from Postgres,

00:28:17.120 --> 00:28:20.390
which is a super popular open source
one that you can use on the internet,

00:28:20.390 --> 00:28:22.010
for instance for final projects.

00:28:22.010 --> 00:28:25.657
Real numbers, just like in Python or
C, are similar in spirit to floats.

00:28:25.657 --> 00:28:27.240
However, you have a couple of options.

00:28:27.240 --> 00:28:29.330
One is a real, which uses 32 bits.

00:28:29.330 --> 00:28:31.040
Another is called double precision.

00:28:31.040 --> 00:28:34.400
It's actually two words, but
it gives you 64 bits as well,

00:28:34.400 --> 00:28:36.770
to give you more precision numerically.

00:28:36.770 --> 00:28:39.950
Then you've got this catch-all
known as numeric in SQL,

00:28:39.950 --> 00:28:43.490
whereby if you have a number
that's number-like, essentially,

00:28:43.490 --> 00:28:46.460
but isn't just an integer, categories--

00:28:46.460 --> 00:28:50.270
this category encompasses things
like Boolean values, dates,

00:28:50.270 --> 00:28:53.240
which have a predefined
format like YYYY,

00:28:53.240 --> 00:28:56.390
four digits typically implements a year.

00:28:56.390 --> 00:28:59.660
That would be considered numeric here,
as would be a date and a date time,

00:28:59.660 --> 00:29:04.950
which has something like 00:00:00
for hours, minutes, seconds.

00:29:04.950 --> 00:29:06.780
And then you have other values as well.

00:29:06.780 --> 00:29:09.590
What's nice about this
numeric type specifically

00:29:09.590 --> 00:29:14.570
is you can actually solve the
floating point problem in both Python

00:29:14.570 --> 00:29:18.410
and C. Recall that floats
or even doubles in C

00:29:18.410 --> 00:29:22.370
ultimately have finite precision,
which is bad in terms of values

00:29:22.370 --> 00:29:24.140
not quite adding up as you intend.

00:29:24.140 --> 00:29:26.300
But with numeric data
types and databases,

00:29:26.300 --> 00:29:30.380
you can specify exactly how many digits
you want before the decimal point

00:29:30.380 --> 00:29:31.830
and after, maximally.

00:29:31.830 --> 00:29:35.060
So if you're doing with financial
information or scientific information,

00:29:35.060 --> 00:29:39.020
you can be super, super precise thanks
to this data type in the database.

00:29:39.020 --> 00:29:43.160
So finally, a problem that we've
solved, versus those other languages.

00:29:43.160 --> 00:29:43.910
Text.

00:29:43.910 --> 00:29:46.050
When it comes to storing
data in a database,

00:29:46.050 --> 00:29:48.470
you can specify that
your columns can either

00:29:48.470 --> 00:29:52.220
be characters, which isn't an
individual characters or chars.

00:29:52.220 --> 00:29:55.730
You literally specify n,
the number of characters

00:29:55.730 --> 00:29:58.370
that every cell in that column will be.

00:29:58.370 --> 00:29:59.823
You never do this in Excel.

00:29:59.823 --> 00:30:02.240
You never do this in Google
Spreadsheets or Apple Numbers.

00:30:02.240 --> 00:30:06.040
In a database, though, you can tell
the database ever more precisely,

00:30:06.040 --> 00:30:10.790
I want to store two characters
in every cell in some column.

00:30:10.790 --> 00:30:11.660
Why?

00:30:11.660 --> 00:30:13.310
Well, maybe like US state codes.

00:30:13.310 --> 00:30:16.640
If you have MA for Massachusetts
or CA for California,

00:30:16.640 --> 00:30:20.210
it might be nice to tell the database
that you can store, minimally

00:30:20.210 --> 00:30:23.420
and maximally, two characters for
every value I'm going to give you.

00:30:23.420 --> 00:30:24.723
Varchar is a little different.

00:30:24.723 --> 00:30:26.390
This is a variable number of characters.

00:30:26.390 --> 00:30:30.290
If your column might have a few
characters or a lot of characters based

00:30:30.290 --> 00:30:33.470
on what some human types in,
you can give an upper bound

00:30:33.470 --> 00:30:36.710
n on how many characters
will be stored in every cell.

00:30:36.710 --> 00:30:40.490
So if you don't know in advance how long
a user's email address is going to be

00:30:40.490 --> 00:30:43.100
or how long my name or your
name is going to be in advance,

00:30:43.100 --> 00:30:47.370
you can specify varchar and then pick
some upper bound, some value that, ugh,

00:30:47.370 --> 00:30:50.600
no one's going to have more than
20 characters in their name or 200

00:30:50.600 --> 00:30:55.160
characters in their name, whatever
line you decide to draw in the sand.

00:30:55.160 --> 00:30:57.710
As an aside, does anyone
want to conjecture

00:30:57.710 --> 00:31:00.940
what a good upper bound is
for the number of characters

00:31:00.940 --> 00:31:04.085
in a human name that might register
for a website you're making?

00:31:04.085 --> 00:31:05.480
AUDIENCE: [INAUDIBLE]

00:31:05.480 --> 00:31:06.390
DAVID MALAN: 25?

00:31:06.390 --> 00:31:06.890
Pretty good.

00:31:06.890 --> 00:31:07.990
AUDIENCE: [INAUDIBLE]

00:31:07.990 --> 00:31:09.032
DAVID MALAN: 30, I heard.

00:31:09.032 --> 00:31:09.980
AUDIENCE: [INAUDIBLE]

00:31:09.980 --> 00:31:11.615
DAVID MALAN: 32, OK.

00:31:11.615 --> 00:31:12.490
AUDIENCE: [INAUDIBLE]

00:31:12.490 --> 00:31:14.330
DAVID MALAN: 45?

00:31:14.330 --> 00:31:15.740
Higher.

00:31:15.740 --> 00:31:16.720
60?

00:31:16.720 --> 00:31:17.420
AUDIENCE: 64.

00:31:17.420 --> 00:31:18.170
DAVID MALAN: 64?

00:31:18.170 --> 00:31:20.790
This is the kind of thing that
Google, I dare say, is good at.

00:31:20.790 --> 00:31:23.570
So let me go ahead and
open up Google real fast.

00:31:23.570 --> 00:31:26.570
Suppose the goal at hand
is the store, in a database

00:31:26.570 --> 00:31:32.060
table, that is the equivalent of
a spreadsheets column, the longest

00:31:32.060 --> 00:31:34.811
name in world.

00:31:34.811 --> 00:31:35.950
All right.

00:31:35.950 --> 00:31:43.180
Looks like we had 988 characters
in a fellow's name, Hubert.

00:31:43.180 --> 00:31:46.430
And I won't even try to pronounce
his last name, but there it is.

00:31:46.430 --> 00:31:48.470
So if you want to fill--

00:31:48.470 --> 00:31:50.795
if you to be able to
fit everyone's name--

00:31:50.795 --> 00:31:52.900
oh, and note, senior.

00:31:52.900 --> 00:31:54.760
[LAUGHTER]

00:31:54.760 --> 00:31:56.870
You might need even more
characters than that.

00:31:56.870 --> 00:31:59.950
So this is a rabbit hole we
won't really go down today,

00:31:59.950 --> 00:32:02.530
but suffice it to say that
the answers to these questions

00:32:02.530 --> 00:32:03.550
are often not obvious.

00:32:03.550 --> 00:32:06.340
And if you think about websites
you've visited in the real world,

00:32:06.340 --> 00:32:10.930
whether on campus or off, odds are,
you have tried typing in some input

00:32:10.930 --> 00:32:14.660
to some web-based form or
even some mobile application,

00:32:14.660 --> 00:32:17.200
where all of a sudden your
keyboard stops working, right?

00:32:17.200 --> 00:32:19.000
They won't tolerate any more characters.

00:32:19.000 --> 00:32:21.490
Maybe it's the Common Application
and the length of your essay.

00:32:21.490 --> 00:32:23.680
Maybe it's the length that
your resume can be in a field.

00:32:23.680 --> 00:32:26.440
Maybe it's the length of your name
when registering for a website.

00:32:26.440 --> 00:32:27.670
Well, why is that?

00:32:27.670 --> 00:32:32.860
Those applications are probably
using a database, probably using SQL,

00:32:32.860 --> 00:32:37.210
and they had to decide, in advance,
what is the maximum length of an input

00:32:37.210 --> 00:32:39.190
that we're going to
tolerate from a human?

00:32:39.190 --> 00:32:42.220
And then the computer
ultimately enforces that.

00:32:42.220 --> 00:32:42.820
Why?

00:32:42.820 --> 00:32:46.900
Well that would be a very clever
attack, for some bad actor out there

00:32:46.900 --> 00:32:48.310
to exploit.

00:32:48.310 --> 00:32:50.998
If you had no bounds on
how big the input could

00:32:50.998 --> 00:32:53.290
be that a human could type
in, they could-- much like I

00:32:53.290 --> 00:32:55.123
need to try the other
day with Emma's name--

00:32:55.123 --> 00:32:57.760
just keep pasting pasting,
pasting, pasting a massive input,

00:32:57.760 --> 00:33:01.750
hoping to overwhelm your computer's
memory, maybe make your server crash,

00:33:01.750 --> 00:33:04.370
and therefore deny
service to other people.

00:33:04.370 --> 00:33:06.670
So there are reasons for
these kinds of defenses.

00:33:06.670 --> 00:33:08.140
Lastly is BLOB.

00:33:08.140 --> 00:33:10.153
At the very top, Binary Large Object.

00:33:10.153 --> 00:33:12.070
Don't need it typically
that often, but if you

00:33:12.070 --> 00:33:16.270
want to store binary data, raw zeros and
ones that represent files or the like,

00:33:16.270 --> 00:33:18.400
you can store it in a
BLOB format as well.

00:33:18.400 --> 00:33:20.650
Now, just to give you a
whirlwind tour of other syntax

00:33:20.650 --> 00:33:23.020
before we start using
it, you can insert data

00:33:23.020 --> 00:33:25.570
into a database using
syntax quite like this.

00:33:25.570 --> 00:33:29.590
You can select data from a database
using syntax like I already did.

00:33:29.590 --> 00:33:32.180
You can use functions as well.

00:33:32.180 --> 00:33:35.230
In fact, in just a bit, when we
load more data into the database,

00:33:35.230 --> 00:33:38.800
I could actually select all of the
distinct names from a database--

00:33:38.800 --> 00:33:40.870
and in fact, I could have
done this a moment ago.

00:33:40.870 --> 00:33:45.430
Rather than SELECT title
FROM favorites, as I

00:33:45.430 --> 00:33:49.360
did earlier-- that's going to give me
all of the titles in that database.

00:33:49.360 --> 00:33:54.310
I could instead do something like
SELECT DISTINCT title FROM favorites,

00:33:54.310 --> 00:33:56.680
and that's going to filter
out all of the duplicates

00:33:56.680 --> 00:34:00.268
without me having to write a
single line of code to do that.

00:34:00.268 --> 00:34:01.810
You can also count, as we did before.

00:34:01.810 --> 00:34:03.790
You can compute averages,
mins, or max, which

00:34:03.790 --> 00:34:06.220
is really useful for analytical
work that you might want

00:34:06.220 --> 00:34:07.909
to do as a data scientist or the like.

00:34:07.909 --> 00:34:10.659
And there's a whole bunch of other
functions available to as well.

00:34:10.659 --> 00:34:12.010
But there's some more powerful features.

00:34:12.010 --> 00:34:14.230
And we'll begin to use
these in just a little bit.

00:34:14.230 --> 00:34:15.880
We used WHERE before.

00:34:15.880 --> 00:34:17.199
I used LIMIT before.

00:34:17.199 --> 00:34:18.460
I used GROUP BY before.

00:34:18.460 --> 00:34:21.280
And we'll see a bunch of examples
that reinforce all of these.

00:34:21.280 --> 00:34:23.920
But there's keywords
like LIKE, whereby if you

00:34:23.920 --> 00:34:28.540
want to search for everyone's input
who mentions The Office, you know what?

00:34:28.540 --> 00:34:31.989
I bet we could do something
like this instead.

00:34:31.989 --> 00:34:36.280
I know that I can SELECT
star for-- oh, let's go ahead

00:34:36.280 --> 00:34:41.920
and SELECT title FROM favorites WHERE,
and this is kind of nonsensical,

00:34:41.920 --> 00:34:45.820
but title equals "The
Office," quote unquote.

00:34:45.820 --> 00:34:48.969
And I get back everyone who
typed in literally The Office.

00:34:48.969 --> 00:34:55.090
However, if I instead did something
like this, "the office" in lower case,

00:34:55.090 --> 00:34:58.780
I'm going to get one of you who
typed in "the office" in lowercase.

00:34:58.780 --> 00:35:01.240
And the thing here is just
the title of this output.

00:35:01.240 --> 00:35:03.940
But what if I want to do
anything like The Office?

00:35:03.940 --> 00:35:10.780
I could say something like LIKE
"%office%", and it turns out in SQL,

00:35:10.780 --> 00:35:12.880
this new language,
percent is a placeholder.

00:35:12.880 --> 00:35:16.120
It's a wild card that just says,
eh, anything can go before,

00:35:16.120 --> 00:35:18.190
and, eh, anything can go after.

00:35:18.190 --> 00:35:21.340
So now I'm searching for any titles
that have the word The Office.

00:35:21.340 --> 00:35:22.810
We picked up another entry here.

00:35:22.810 --> 00:35:26.153
So one of you said Office, without the
word "the," also pretty reasonable.

00:35:26.153 --> 00:35:28.570
Some of the uppercase and
lowercase is all over the place.

00:35:28.570 --> 00:35:30.010
Now we've caught those.

00:35:30.010 --> 00:35:33.100
And if you want to count the
total number of offices now,

00:35:33.100 --> 00:35:34.810
now I can do something like this.

00:35:34.810 --> 00:35:38.020
COUNT title-- so I can
combine all of these building

00:35:38.020 --> 00:35:40.720
blocks, not unlike Scratch,
and get back the total number

00:35:40.720 --> 00:35:45.610
of offices, which looks now to
be 33 when we tolerate a lot more

00:35:45.610 --> 00:35:48.880
variability in our users' input.

00:35:48.880 --> 00:35:50.680
Most powerfully of all,
we'll see something

00:35:50.680 --> 00:35:53.020
like JOIN in just a
little bit, and that'll

00:35:53.020 --> 00:35:56.600
be all toward the end of using
databases better with better design.

00:35:56.600 --> 00:36:00.010
But with SELECTs, we, of course, don't
just want to select all of the titles.

00:36:00.010 --> 00:36:02.360
We might select where
some condition is true.

00:36:02.360 --> 00:36:04.430
So conditions, just
like in Python and C,

00:36:04.430 --> 00:36:08.323
exist in SQL, but at the end
of our queries, as we've seen.

00:36:08.323 --> 00:36:10.240
If you want to update
something, for instance,

00:36:10.240 --> 00:36:12.550
you can update data in
a database by saying

00:36:12.550 --> 00:36:16.300
UPDATE table name SET the
column name equal to some value,

00:36:16.300 --> 00:36:18.200
WHERE condition is true.

00:36:18.200 --> 00:36:19.820
So how might I use this?

00:36:19.820 --> 00:36:23.640
Well, let's again look at the
data where it's like The Office.

00:36:23.640 --> 00:36:24.390
And you know what?

00:36:24.390 --> 00:36:25.807
I'm going to go ahead and do this.

00:36:25.807 --> 00:36:28.307
If I want to clean all of this
data up-- so data cleaning is

00:36:28.307 --> 00:36:30.140
a very common approach,
whether you're doing

00:36:30.140 --> 00:36:32.890
research or analytical work,
often when you have messy data,

00:36:32.890 --> 00:36:34.770
the first pass is honestly grunt work.

00:36:34.770 --> 00:36:37.350
Like, writing some code, or
god forbid, manually copying

00:36:37.350 --> 00:36:40.330
and pasting in an Excel file,
just to clean up your data.

00:36:40.330 --> 00:36:42.630
You can do this a lot
more powerfully in SQL.

00:36:42.630 --> 00:36:48.450
So for instance, let me go ahead
and say UPDATE my favorites SET

00:36:48.450 --> 00:36:52.500
title equal to "The Office,"
the canonical form, capital

00:36:52.500 --> 00:36:56.400
T, capital O, WHERE title LIKE--

00:36:56.400 --> 00:36:59.850
and now let me do this
wildcard search, knowing

00:36:59.850 --> 00:37:02.940
that that's going to slurp
in all 30-plus of those rows.

00:37:02.940 --> 00:37:06.330
But now, when I hit Enter,
nothing seems to happen.

00:37:06.330 --> 00:37:10.050
But if I then SELECT title
again WHERE LIKE "%office%",

00:37:10.050 --> 00:37:13.590
now they all look the same,
because I've updated my data.

00:37:13.590 --> 00:37:17.010
And so in fact, now I could
go back to The Office,

00:37:17.010 --> 00:37:21.520
because I've canonicalized my data,
and everything now looks the same.

00:37:21.520 --> 00:37:23.760
So with SQL, can you do that as well?

00:37:23.760 --> 00:37:28.620
Previously, let's go ahead and
SELECT a title FROM favorites.

00:37:28.620 --> 00:37:32.880
And let's just limit this to 10 of them
this time, so they don't all fly by.

00:37:32.880 --> 00:37:34.980
Suppose-- VEEP is a good show.

00:37:34.980 --> 00:37:37.380
Maybe I'm not a fan of, say--

00:37:37.380 --> 00:37:40.770
let's give ourselves 20
here, see what comes up next.

00:37:40.770 --> 00:37:44.370
OK, I actually don't like Friends, even
though everyone in the world seems to.

00:37:44.370 --> 00:37:45.420
No problem.

00:37:45.420 --> 00:37:46.665
DELETE FROM favorites--

00:37:46.665 --> 00:37:47.700
[LAUGHTER]

00:37:47.700 --> 00:37:50.790
--WHERE title = "Friends";.

00:37:50.790 --> 00:37:53.070
Now, no more Friends.

00:37:53.070 --> 00:37:54.885
And so you can delete your data as well.

00:37:54.885 --> 00:37:56.010
[LAUGHTER]

00:37:56.010 --> 00:38:00.390
You can delete your data as well
as by using a query of this form.

00:38:00.390 --> 00:38:04.020
This one is perhaps the
most dangerous of all.

00:38:04.020 --> 00:38:07.860
Consider what might happen if you're
a little sloppy, a little hasty,

00:38:07.860 --> 00:38:09.720
and omit the WHERE condition.

00:38:09.720 --> 00:38:14.070
What happens if you go ahead
and DELETE FROM favorites;?

00:38:14.070 --> 00:38:15.143
Everything goes away.

00:38:15.143 --> 00:38:18.060
And slightly worse than that, though
we won't have occasion to use it,

00:38:18.060 --> 00:38:20.352
there's another function
called DROP, where if you just

00:38:20.352 --> 00:38:23.970
want to delete the table altogether,
delete all of your rows and columns,

00:38:23.970 --> 00:38:28.630
you can say DROP TABLE favorites;,
and that will delete all of the data

00:38:28.630 --> 00:38:29.130
as well.

00:38:29.130 --> 00:38:32.700
And we'll come back to that, because
the mere fact that you can scarily

00:38:32.700 --> 00:38:35.370
delete lots of data at
once makes you super

00:38:35.370 --> 00:38:38.580
vulnerable to bad actors on the
internet or in the office who

00:38:38.580 --> 00:38:41.770
want to potentially
wreak havoc on your data.

00:38:41.770 --> 00:38:46.210
So we'll come back today, talking
about matters of security, as well.

00:38:46.210 --> 00:38:49.920
So any questions before, now,
we take things up a notch

00:38:49.920 --> 00:38:55.270
and look at a much more massive
than our hundreds of favorites here?

00:38:55.270 --> 00:38:56.430
Is that a question there?

00:38:56.430 --> 00:38:58.586
Yeah.

00:38:58.586 --> 00:39:01.965
AUDIENCE: [INAUDIBLE]

00:39:01.965 --> 00:39:04.090
DAVID MALAN: Did not look
like Friends got removed.

00:39:04.090 --> 00:39:05.050
Oh, interesting.

00:39:05.050 --> 00:39:06.730
You are right.

00:39:06.730 --> 00:39:09.700
Let's try a little harder.

00:39:09.700 --> 00:39:12.010
LIKE "%friends%".

00:39:15.200 --> 00:39:16.370
Better?

00:39:16.370 --> 00:39:17.190
Now they're gone.

00:39:17.190 --> 00:39:19.607
So I'm guessing-- I would have
to look closer at the data,

00:39:19.607 --> 00:39:21.442
but I'm guessing one
or more people typed it

00:39:21.442 --> 00:39:23.150
with a slightly
different capitalization,

00:39:23.150 --> 00:39:25.760
or maybe they hit a spacebar
character or something, where

00:39:25.760 --> 00:39:29.040
I wasn't doing a precise
match, thereby making

00:39:29.040 --> 00:39:32.730
the LIKE operator a more robust fix.

00:39:32.730 --> 00:39:35.280
However, let's consider
for just a moment.

00:39:35.280 --> 00:39:36.860
What might the gotcha here be?

00:39:36.860 --> 00:39:39.170
Like, this feels like a
nice quick and dirty fix,

00:39:39.170 --> 00:39:42.118
but it really is quick
and dirty in what sense?

00:39:42.118 --> 00:39:43.910
AUDIENCE: Like, if you
have another TV show

00:39:43.910 --> 00:39:47.543
that has the word friend in it,
but not Friends, [INAUDIBLE]..

00:39:47.543 --> 00:39:49.460
DAVID MALAN: Yeah, there
might be another show

00:39:49.460 --> 00:39:50.480
with the word "friends" in it.

00:39:50.480 --> 00:39:51.320
Quite reasonable.

00:39:51.320 --> 00:39:54.380
And I just blew those away
as well from my database.

00:39:54.380 --> 00:39:56.480
And in fact, we can
probably see this for real.

00:39:56.480 --> 00:39:59.540
If I go ahead and look
at, for instance--

00:39:59.540 --> 00:40:05.360
let me go into a browser here, IMDB.com,
super popular website for just getting

00:40:05.360 --> 00:40:07.550
movie reviews or
information about movies.

00:40:07.550 --> 00:40:09.330
Let me go ahead and search for friends.

00:40:09.330 --> 00:40:12.830
And yes, the most popular hit seems
to be the 1994 version of Friends,

00:40:12.830 --> 00:40:14.740
but there's Friends with Benefits.

00:40:14.740 --> 00:40:17.573
There's Just Friends, We Are Your
Friends, My Best Friend's Wedding,

00:40:17.573 --> 00:40:18.990
Fisherman's Friends, and so forth.

00:40:18.990 --> 00:40:21.020
Some of those movies,
some of those TV shows.

00:40:21.020 --> 00:40:24.950
And had we poked a little further among
our own data of hundreds of rows, maybe

00:40:24.950 --> 00:40:27.980
some of you did have some of those
shows as your favorites as well.

00:40:27.980 --> 00:40:30.290
So beware that, too.

00:40:30.290 --> 00:40:32.990
So in preparing for
today, we wanted to play

00:40:32.990 --> 00:40:36.350
with an even juicier data set then
one we could generate for ourselves.

00:40:36.350 --> 00:40:39.920
And wonderfully, IMDB makes
a lot of their database

00:40:39.920 --> 00:40:44.900
available for download as TSV
files, Tab-Separated Values.

00:40:44.900 --> 00:40:48.210
It's really no fundamentally
different from Comma Separated Values.

00:40:48.210 --> 00:40:51.200
One human at IMDB just decided
to use tabs to separate

00:40:51.200 --> 00:40:52.763
their values instead of commas.

00:40:52.763 --> 00:40:54.680
And if you read through
this page here, you'll

00:40:54.680 --> 00:40:58.520
see that they let you download a
bunch of different files, one of which

00:40:58.520 --> 00:41:02.360
will focus on, initially,
specifically on title.basics.tsv.gz.

00:41:05.105 --> 00:41:06.980
So that's a mouthful,
but this is just saying

00:41:06.980 --> 00:41:09.920
this is basic information
about titles in their database.

00:41:09.920 --> 00:41:12.785
Gz means it's G-zipped,
which is like Zip.

00:41:12.785 --> 00:41:14.660
If you're familiar, on
a Mac or PC, it's just

00:41:14.660 --> 00:41:19.850
another way of compressing information,
but you can unzip that on a computer

00:41:19.850 --> 00:41:20.390
as well.

00:41:20.390 --> 00:41:22.790
Dot TSV means Tab-Separated Values.

00:41:22.790 --> 00:41:24.860
Now, we read in advance
the information here,

00:41:24.860 --> 00:41:26.490
but let me summarize it as follows.

00:41:26.490 --> 00:41:28.972
There's some juicy
information in that file.

00:41:28.972 --> 00:41:31.430
And I'm going to go ahead and
download this file over here.

00:41:31.430 --> 00:41:33.710
If I click a link on
IMDB's website, I'll

00:41:33.710 --> 00:41:37.607
see a file called title.basics.tsv.gz.

00:41:37.607 --> 00:41:40.190
I'm going to go ahead and let
that download in the background.

00:41:40.190 --> 00:41:44.600
It looks like it is 100-plus megabytes.

00:41:44.600 --> 00:41:47.570
If I click on it on my Mac,
it's going to decompress it.

00:41:47.570 --> 00:41:53.210
And you'll see here, this apparently is
532 megabytes once it's decompressed.

00:41:53.210 --> 00:41:56.840
So that's way bigger than the
CSV file I downloaded from Google

00:41:56.840 --> 00:42:01.110
a few minutes ago, which is
24 kilobytes, 24,000 bytes.

00:42:01.110 --> 00:42:03.560
This is 532 million bytes.

00:42:03.560 --> 00:42:07.040
So it's an order of magnitude bigger,
which means our design decisions today

00:42:07.040 --> 00:42:08.210
are really going to matter.

00:42:08.210 --> 00:42:10.500
Now, if we take a look
inside of this file,

00:42:10.500 --> 00:42:12.260
we'll see a bunch of interesting fields.

00:42:12.260 --> 00:42:15.920
And I've essentially summarized this
from their own online documentation.

00:42:15.920 --> 00:42:18.980
There's these five fields that
are of interest right now,

00:42:18.980 --> 00:42:20.630
and we'll tease these apart now.

00:42:20.630 --> 00:42:22.550
So tconst, I didn't recognize that.

00:42:22.550 --> 00:42:26.510
It's an IMDB-specific term, it seems,
but it means some kind of title

00:42:26.510 --> 00:42:27.200
constant.

00:42:27.200 --> 00:42:30.480
It's a unique identifier for
every movie in the database.

00:42:30.480 --> 00:42:33.360
So for instance, one of the
movies in that TSV file,

00:42:33.360 --> 00:42:39.040
which I'll open up in a moment,
has the unique value tt4786824.

00:42:39.040 --> 00:42:39.830
Why?

00:42:39.830 --> 00:42:40.970
Just because.

00:42:40.970 --> 00:42:44.750
But it turns out we can
actually see what this is.

00:42:44.750 --> 00:42:57.160
Let me actually grab that value
and go to IMDB.com/title/tt4786824.

00:42:57.160 --> 00:42:58.160
Enter.

00:42:58.160 --> 00:42:59.870
These actually do have significance.

00:42:59.870 --> 00:43:01.760
That is the unique
identifier for The Crown,

00:43:01.760 --> 00:43:04.400
popular hit TV show that you
can now watch on Netflix.

00:43:04.400 --> 00:43:08.660
So if you go back to the actual field--

00:43:08.660 --> 00:43:09.960
that's how I spent the summer.

00:43:09.960 --> 00:43:12.543
So if you go back to the fields
here, we have not only tconst,

00:43:12.543 --> 00:43:15.440
which uniquely identifies the
film, we also have title type.

00:43:15.440 --> 00:43:17.990
So it turns out in
this 500 megabyte file,

00:43:17.990 --> 00:43:22.010
there's huge amounts of information
on movies, TV series, documentaries,

00:43:22.010 --> 00:43:25.280
shorts, way more data than we could
possibly play with in one day.

00:43:25.280 --> 00:43:29.270
So we'll focus just on TV series,
like this, the show The Crown.

00:43:29.270 --> 00:43:31.910
Primary title is the
name in the file that is

00:43:31.910 --> 00:43:33.660
given to the primary title of the show.

00:43:33.660 --> 00:43:35.660
Sometimes shows apparently
have different titles

00:43:35.660 --> 00:43:38.320
in different geographies or
communities, so the primary title

00:43:38.320 --> 00:43:40.070
is the one that most people know it by.

00:43:40.070 --> 00:43:43.670
Start year is the year for a TV show
that-- in which that TV show began.

00:43:43.670 --> 00:43:46.250
The Crown came out first in 2016.

00:43:46.250 --> 00:43:51.050
And genres, a little weirdly,
is a comma-separated list

00:43:51.050 --> 00:43:53.690
of genres into which that show falls.

00:43:53.690 --> 00:43:57.320
And I say it's a little weird,
because we do have a TSV file,

00:43:57.320 --> 00:43:59.580
but they're using commas inside.

00:43:59.580 --> 00:44:03.590
So this is probably among the
motivations for having used tabs in--

00:44:03.590 --> 00:44:05.660
to separate all of the
columns, because they

00:44:05.660 --> 00:44:08.060
want to use commas for
some other purpose here.

00:44:08.060 --> 00:44:09.570
But it's a little messy.

00:44:09.570 --> 00:44:13.160
This is actually not necessarily
the best design decision here.

00:44:13.160 --> 00:44:17.090
So what can we go ahead and
do with all of this data here?

00:44:17.090 --> 00:44:20.140
So let's go ahead and start
to grab some of this data

00:44:20.140 --> 00:44:21.640
and write some scripts involving it.

00:44:21.640 --> 00:44:23.260
But let me issue a disclaimer.

00:44:23.260 --> 00:44:26.000
It turns out when you have
a lot of data in a database,

00:44:26.000 --> 00:44:30.247
it can take quite a long time
to analyze it or parse it.

00:44:30.247 --> 00:44:32.830
And so if you ever watch certain
baking shows like Julia Child

00:44:32.830 --> 00:44:35.620
from down the road here in
Cambridge, very often there

00:44:35.620 --> 00:44:39.580
would be some TV magic whereby
someone mixes the cake mix

00:44:39.580 --> 00:44:42.723
and puts it into the oven, and then
two minutes later, voila, it comes out.

00:44:42.723 --> 00:44:44.890
That's because some of the
scripts we'll write today

00:44:44.890 --> 00:44:47.410
might actually take many
minutes to run, but we'll

00:44:47.410 --> 00:44:49.118
go ahead and run them
in the backgrounds,

00:44:49.118 --> 00:44:51.760
and I'll use the outputs that
I've created here in advance.

00:44:51.760 --> 00:44:55.600
So how do we go about beginning
to navigate this data set?

00:44:55.600 --> 00:44:59.830
Well, first of all, let me go
ahead and just open this file.

00:44:59.830 --> 00:45:02.230
And I'm going to use my
own terminal window here.

00:45:02.230 --> 00:45:06.790
This file was called titles.basics.tsv.

00:45:06.790 --> 00:45:09.580
Notice that even in my program
here, it took a moment to open,

00:45:09.580 --> 00:45:11.060
but this is what it looks like.

00:45:11.060 --> 00:45:14.380
It's a little cryptic at first
glance, but notice the first row

00:45:14.380 --> 00:45:16.030
is the headers of the file.

00:45:16.030 --> 00:45:19.900
So we've got tconst and title type
and primary title and so forth.

00:45:19.900 --> 00:45:24.160
Then every row thereafter seems to start
with one of those unique identifiers,

00:45:24.160 --> 00:45:27.250
and they seem to be monotonically
increasing, 1, 2, 3, 4,

00:45:27.250 --> 00:45:29.830
but they start with tt
then some number of digits.

00:45:29.830 --> 00:45:32.972
Then you see that these are a bunch
of shorts, so short films or the like.

00:45:32.972 --> 00:45:35.180
And if we kept scrolling,
we'd see some other values.

00:45:35.180 --> 00:45:36.825
Here's the first movie in the database.

00:45:36.825 --> 00:45:39.700
And if we keep scrolling and scrolling
and scrolling, we'll see more.

00:45:39.700 --> 00:45:43.150
So if actually I use my text editor
here and search for literally the word

00:45:43.150 --> 00:45:45.040
The Crown, it's going
to take a little while

00:45:45.040 --> 00:45:48.100
to get to, because there's a
bunch of shows apparently called

00:45:48.100 --> 00:45:50.470
The Crown or Crowning or something.

00:45:50.470 --> 00:45:52.060
But you know what'll be faster?

00:45:52.060 --> 00:45:53.860
Let me go ahead and search for--

00:45:53.860 --> 00:45:55.090
what was it?

00:45:55.090 --> 00:46:03.020
If I go back to tt4786824,
Enter, it takes a moment

00:46:03.020 --> 00:46:05.770
to find this because it's searching
through hundreds of megabytes,

00:46:05.770 --> 00:46:10.493
but voila, down here is
this tconst, TV series type.

00:46:10.493 --> 00:46:11.410
It's called The Crown.

00:46:11.410 --> 00:46:15.920
In another geography it's called
also The Crown, 2016, and so forth.

00:46:15.920 --> 00:46:17.920
So that's a huge amount of data.

00:46:17.920 --> 00:46:19.180
In fact, how much data?

00:46:19.180 --> 00:46:22.310
Well, if I actually analyze this
with a command on my computer,

00:46:22.310 --> 00:46:29.320
there are 6,267,469 shows
in the world, be it TV

00:46:29.320 --> 00:46:31.330
or movies or documentaries or the like.

00:46:31.330 --> 00:46:32.830
That's a huge data set.

00:46:32.830 --> 00:46:36.460
So suppose I want to focus really
just on a subset of that data.

00:46:36.460 --> 00:46:39.970
What can I go about using instead?

00:46:39.970 --> 00:46:45.040
Well, let me go ahead and open
up, for instance, the same file,

00:46:45.040 --> 00:46:47.685
but let me try putting it in CS50 IDE.

00:46:47.685 --> 00:46:50.560
I'm going to go ahead and download
this, as you might have downloaded

00:46:50.560 --> 00:46:52.660
past problem sets, with wget.

00:46:52.660 --> 00:46:54.867
That's going to give
me a pretty big file.

00:46:54.867 --> 00:46:56.950
There's a command now that
you've not had occasion

00:46:56.950 --> 00:47:02.640
to use in the IDE called gunzip, which
will unzip a file that starts with--

00:47:02.640 --> 00:47:04.480
that ends with .gz.

00:47:04.480 --> 00:47:09.060
That's going to take a moment
to unzip the whole file.

00:47:09.060 --> 00:47:11.310
A moment, a moment, a moment, OK.

00:47:11.310 --> 00:47:12.970
[CHUCKLES] And no space left on disk.

00:47:12.970 --> 00:47:14.880
But that's OK, because it
turns out what I'm also

00:47:14.880 --> 00:47:17.260
going to start doing today is
using my own Mac a bit more.

00:47:17.260 --> 00:47:19.260
And you'll recall last week,
I did that a little bit,

00:47:19.260 --> 00:47:22.218
because when I wanted to play with
speech recognition in my microphone,

00:47:22.218 --> 00:47:25.890
it's a lot easier to have Python
running on my Mac or your PC

00:47:25.890 --> 00:47:28.200
if I want to use hardware
built into my computer.

00:47:28.200 --> 00:47:31.680
Plus my MacBook, is a lot faster,
for instance, than a single cloud

00:47:31.680 --> 00:47:33.270
account on a shared server.

00:47:33.270 --> 00:47:37.770
So I'll go ahead and write some of this
code here on my own machine instead.

00:47:37.770 --> 00:47:40.930
So let me go ahead and open
a file called import .py.

00:47:40.930 --> 00:47:44.700
And I'm going to go ahead and
import the CSV module, as always,

00:47:44.700 --> 00:47:49.490
and I'm going to go ahead and open this
file, which is called title.basics.tsv

00:47:49.490 --> 00:47:53.425
in read-only mode, and I'm going
to call this variable titles.

00:47:53.425 --> 00:47:54.550
What am I next going to do?

00:47:54.550 --> 00:47:58.230
Let me myself a reader
using csv.DictReader,

00:47:58.230 --> 00:48:01.290
as before, reading in those titles.

00:48:01.290 --> 00:48:04.800
But csv.DictReader feels
like the wrong reader.

00:48:04.800 --> 00:48:07.040
Why?

00:48:07.040 --> 00:48:09.360
Just to be clear.

00:48:09.360 --> 00:48:11.445
Yeah, I mean, maybe I want TSV.

00:48:11.445 --> 00:48:14.290
But it turns out TSV doesn't exist.

00:48:14.290 --> 00:48:17.310
So even though I might be inclined,
for instance, to change this

00:48:17.310 --> 00:48:21.570
to a T and this to a T, the
CSV module does enough for me,

00:48:21.570 --> 00:48:24.540
but I need to tell it that I want
to use a different delimiter.

00:48:24.540 --> 00:48:26.723
Instead of the default,
which looks like this,

00:48:26.723 --> 00:48:28.890
I can actually override
that and say, you know what?

00:48:28.890 --> 00:48:29.700
Use a tab.

00:48:29.700 --> 00:48:33.690
And just like backslash n in C
and in Python, it's a new line.

00:48:33.690 --> 00:48:37.830
Backslash t in both
languages is a tab character.

00:48:37.830 --> 00:48:41.100
All right, so once I've done this,
let me go ahead and open up, now--

00:48:41.100 --> 00:48:43.920
let me go ahead and open up shows0.csv.

00:48:43.920 --> 00:48:47.760
My goal in life now is to make this
file a lot more manageable for myself.

00:48:47.760 --> 00:48:53.070
I want to take a 500megabyte file and
extract only the TV shows therein.

00:48:53.070 --> 00:48:56.190
Moreover, how about only the
TV shows from 1970 onward?

00:48:56.190 --> 00:48:59.080
We won't go even further back than that.

00:48:59.080 --> 00:49:04.320
So let me go ahead and open up, in
write mode, a file called show0.csv.

00:49:04.320 --> 00:49:06.278
And I'm just going to
call that variable shows.

00:49:06.278 --> 00:49:08.445
Then I'm going to go ahead
and give myself a writer.

00:49:08.445 --> 00:49:10.290
And you might recall
using this in the past.

00:49:10.290 --> 00:49:13.082
This is simply going to give me a
variable called writer, via which

00:49:13.082 --> 00:49:14.400
I can write to a new file.

00:49:14.400 --> 00:49:18.750
Because again, the goal is to read
this file and write to this file

00:49:18.750 --> 00:49:21.090
a subset of the data therein.

00:49:21.090 --> 00:49:23.160
So let me go ahead and
write one row first.

00:49:23.160 --> 00:49:28.180
Write a row, passing in a list
of values, specifically tconst,

00:49:28.180 --> 00:49:29.320
which is the title--

00:49:29.320 --> 00:49:31.860
which is the ID field;
primaryTitle, which

00:49:31.860 --> 00:49:35.910
is the title field; startYear,
which is the year field; and genres,

00:49:35.910 --> 00:49:38.620
was one more fields that
I mentioned earlier.

00:49:38.620 --> 00:49:43.248
So my goal is to export only those
four columns that I care about for now.

00:49:43.248 --> 00:49:44.790
So I'm going to go ahead and do this.

00:49:44.790 --> 00:49:49.590
For each row in my reader, I'm
going to go ahead and say if that

00:49:49.590 --> 00:49:56.340
row's titleType ==
tvSeries, which, recall,

00:49:56.340 --> 00:50:00.720
The Crown was an example of, then
I'm going to go ahead and write

00:50:00.720 --> 00:50:01.710
to the writer--

00:50:01.710 --> 00:50:05.940
whoops-- writer.writeRow,
a list containing what?

00:50:05.940 --> 00:50:13.230
The row's tconst value, the row's
primary title value, the row--

00:50:13.230 --> 00:50:21.214
whoops-- the row's start
year, and lastly, the row's--

00:50:21.214 --> 00:50:24.310
[CHUCKLES] the row's genres.

00:50:24.310 --> 00:50:26.560
So what am I doing
here, just to be clear?

00:50:26.560 --> 00:50:30.760
So what I have done is I've
first written out-- once and only

00:50:30.760 --> 00:50:33.070
once-- literally these
values, because I want

00:50:33.070 --> 00:50:35.200
headers in the first line of my file.

00:50:35.200 --> 00:50:37.750
After that, for each row in
the reader, which is currently

00:50:37.750 --> 00:50:40.900
iterating over this
file in read-only mode,

00:50:40.900 --> 00:50:44.200
I want to print out the current row's
tconst, current row's primary title,

00:50:44.200 --> 00:50:47.170
the current row's start year,
and the current row genres.

00:50:47.170 --> 00:50:52.450
But notice, I'm ignoring movies and
shorts and documentaries and bunches

00:50:52.450 --> 00:50:54.080
of other values as well.

00:50:54.080 --> 00:50:54.830
And you know what?

00:50:54.830 --> 00:50:57.310
Just for good measure, let's
shrink this a little bit.

00:50:57.310 --> 00:51:02.320
And row, how about, is adult
== "0", for today's purposes?

00:51:02.320 --> 00:51:03.970
So that'll filter the list further.

00:51:03.970 --> 00:51:06.160
If you'd like to flip
that later, that's fine.

00:51:06.160 --> 00:51:11.920
So let me go ahead and do Python 3,
for version 3 on my Mac-- actually,

00:51:11.920 --> 00:51:17.632
no, let's do Python of
import.py, all right?

00:51:17.632 --> 00:51:19.090
I'm going to cross my fingers here.

00:51:19.090 --> 00:51:22.300
Hopefully the file is
working, working, working.

00:51:22.300 --> 00:51:24.010
But it's taking a decent amount of time.

00:51:24.010 --> 00:51:29.290
Like, this is how much time it
takes, apparently, to process over

00:51:29.290 --> 00:51:31.840
millions of rows of data.

00:51:31.840 --> 00:51:33.700
Still running.

00:51:33.700 --> 00:51:34.960
Still running.

00:51:34.960 --> 00:51:37.465
But the goal, again,
is to shrink the amount

00:51:37.465 --> 00:51:40.090
of data I have to ultimately care
about so that we can actually

00:51:40.090 --> 00:51:41.980
search it much more effectively.

00:51:41.980 --> 00:51:43.510
So, OK, it actually finished.

00:51:43.510 --> 00:51:46.430
So let me go ahead
and open up show0.csv.

00:51:46.430 --> 00:51:49.840
Notice now, in my text editor,
I've got a lot less data.

00:51:49.840 --> 00:51:51.820
I've thrown away everything
I don't care about,

00:51:51.820 --> 00:51:54.970
but I've been left with tconst,
primary title, start year, and genres,

00:51:54.970 --> 00:51:59.052
and everything herein is now
consistent with that filtration.

00:51:59.052 --> 00:52:00.760
But I haven't filtered
everything I said.

00:52:00.760 --> 00:52:03.380
What did I say I wanted
to get rid of earlier?

00:52:06.590 --> 00:52:08.090
Yeah, the shows before 1970.

00:52:08.090 --> 00:52:11.550
And clearly some of these are
coming from 1940s and so forth.

00:52:11.550 --> 00:52:14.480
So let's go ahead and get
rid of those, but see how.

00:52:14.480 --> 00:52:18.597
This is the CSV file that we just
opened, but in Google Spreadsheet form.

00:52:18.597 --> 00:52:20.930
So I literally just imported
it into Google Spreadsheets

00:52:20.930 --> 00:52:21.950
that so we could see it.

00:52:21.950 --> 00:52:25.640
Literally the same data as before,
and there are those 1940s movies.

00:52:25.640 --> 00:52:28.490
But there's something curious
that I wanted to be mindful of.

00:52:28.490 --> 00:52:34.290
If I scroll down in the start years
and keep going and going and going,

00:52:34.290 --> 00:52:35.400
huh, those seem OK.

00:52:35.400 --> 00:52:36.650
Those are increasing in order.

00:52:36.650 --> 00:52:37.560
But let me try this.

00:52:37.560 --> 00:52:40.880
Let me just poke around my
data, sorting in reverse order.

00:52:40.880 --> 00:52:44.480
It's going to take a while, because even
this is a decent number of TV shows.

00:52:44.480 --> 00:52:47.000
Notice this weirdness.

00:52:47.000 --> 00:52:50.630
At the top of start year now,
once I've reversed sorted them,

00:52:50.630 --> 00:52:53.450
there's a whole bunch of
backslash capital N's.

00:52:53.450 --> 00:52:57.050
Now, this has nothing to do with
C, and nothing to do with Python.

00:52:57.050 --> 00:52:59.900
It has everything to do
with the documentation.

00:52:59.900 --> 00:53:05.930
If you read IMDB's data, as I only did
carefully eventually, you'll see this.

00:53:05.930 --> 00:53:10.370
A backslash n and capital N is used
to denote that a particular field is

00:53:10.370 --> 00:53:12.592
missing or null for that title name.

00:53:12.592 --> 00:53:14.300
Now, this is important,
because if I want

00:53:14.300 --> 00:53:17.090
to filter out movies
that are after 1970,

00:53:17.090 --> 00:53:18.670
I need to be resilient against that.

00:53:18.670 --> 00:53:20.210
So let me go ahead and do this.

00:53:20.210 --> 00:53:27.590
So if the current row's startYear
does not equal backslash n,

00:53:27.590 --> 00:53:31.850
then I'm going to go ahead and
check that it's a TV series

00:53:31.850 --> 00:53:34.160
and that it's not an adult show.

00:53:34.160 --> 00:53:35.150
So that would help.

00:53:35.150 --> 00:53:38.110
But furthermore, how
can I check this here?

00:53:38.110 --> 00:53:39.530
Let me do year.

00:53:39.530 --> 00:53:43.500
And how can I convert row
startYear to an integer?

00:53:43.500 --> 00:53:45.500
Well, everything in a
spreadsheet, by definition

00:53:45.500 --> 00:53:50.510
of it having been in a spreadsheet,
or a TSV file or a CSV file, is text.

00:53:50.510 --> 00:53:52.940
But start year looks like
years, so what Python

00:53:52.940 --> 00:53:55.430
function can I use to
actually convert text that

00:53:55.430 --> 00:53:58.840
resembles a number to an actual number?

00:53:58.840 --> 00:53:59.340
Yeah.

00:53:59.340 --> 00:54:01.240
So we can do something like this.

00:54:01.240 --> 00:54:03.840
So I can convert year
to an int, and now I

00:54:03.840 --> 00:54:08.530
can say, if year greater
than or equal to 1970,

00:54:08.530 --> 00:54:12.348
now I'm going to go ahead
and do those lines instead.

00:54:12.348 --> 00:54:14.640
Now, there's an opportunity,
surely, for better design,

00:54:14.640 --> 00:54:16.530
because once your code
starts doing this,

00:54:16.530 --> 00:54:18.360
you've done something
suboptimally, right?

00:54:18.360 --> 00:54:21.112
This is not going to end well if
all of my code starts wrapping.

00:54:21.112 --> 00:54:23.820
So I could clean up the logic in
a little bit, but let's go ahead

00:54:23.820 --> 00:54:25.750
and run this just one more time.

00:54:25.750 --> 00:54:30.450
This time, changing this to shows1.CSV
so we can see slightly different

00:54:30.450 --> 00:54:30.960
outputs.

00:54:30.960 --> 00:54:33.546
Let me go ahead and
run Python import.py.

00:54:33.546 --> 00:54:37.740
Huh, syntax error,
Unicode error, codec--

00:54:37.740 --> 00:54:39.000
that's a weird one.

00:54:39.000 --> 00:54:43.810
But this is because backslashes, recall,
in C and Python, have special meaning.

00:54:43.810 --> 00:54:47.070
So when you do something like
this, backslash capital n,

00:54:47.070 --> 00:54:49.860
even though it's not a lower
case n, backslash n, recall,

00:54:49.860 --> 00:54:52.110
is the scape character in C and Python.

00:54:52.110 --> 00:54:55.230
So this is like telling Python,
this is a special character.

00:54:55.230 --> 00:54:56.092
But it's not.

00:54:56.092 --> 00:54:58.050
And we've never really
had occasion to do this,

00:54:58.050 --> 00:55:00.240
but how would do you
think we could output

00:55:00.240 --> 00:55:05.620
a literal backslash before a capital N?

00:55:05.620 --> 00:55:06.120
Yeah.

00:55:06.120 --> 00:55:08.610
So it turns out the solution
to this problem, usually,

00:55:08.610 --> 00:55:11.730
no matter the language, is that if
you want a literal character, not

00:55:11.730 --> 00:55:15.070
an escape character, you literally
put another one of it before.

00:55:15.070 --> 00:55:17.260
So even though this
looks a little funky now,

00:55:17.260 --> 00:55:22.420
this backslash backslash capital
N literally will mean backslash N.

00:55:22.420 --> 00:55:22.920
All right.

00:55:22.920 --> 00:55:26.360
So now, let me go ahead
and run this on import.py.

00:55:26.360 --> 00:55:29.790
This time, I'm hopefully
going to actually generate

00:55:29.790 --> 00:55:36.120
a new file called shows1.csv
that has even less data that

00:55:36.120 --> 00:55:42.400
actually is going to contain my
shows, but only a subset of them.

00:55:42.400 --> 00:55:45.000
And let's go ahead and pull the
cake out of the oven this way.

00:55:45.000 --> 00:55:47.050
This is what I get now this time.

00:55:47.050 --> 00:55:50.058
So if I actually load the CSV,
shows1.csv, into Google Spreadsheet,

00:55:50.058 --> 00:55:53.100
just because it's pretty easy to look
at than the black and white window,

00:55:53.100 --> 00:55:57.420
now you can see that I apparently am
only getting shows 1970 and onward.

00:55:57.420 --> 00:55:59.940
And indeed, if I sorted them,
I would see no backslash N's.

00:55:59.940 --> 00:56:03.960
I would have thrown away everything
that doesn't meet that criteria.

00:56:03.960 --> 00:56:07.110
Well, let me go ahead and
do one last thing here.

00:56:07.110 --> 00:56:10.020
I'm going to go ahead
and make one more change.

00:56:10.020 --> 00:56:12.330
And first, let's
improve the design here.

00:56:12.330 --> 00:56:16.710
This indentation is the result of
my asking questions again and again

00:56:16.710 --> 00:56:20.410
and again and indenting if and
only if those things are true.

00:56:20.410 --> 00:56:22.950
But notice, you can start to
flip your logic here, right?

00:56:22.950 --> 00:56:26.790
Instead of saying if the start
year does not equal backslash N,

00:56:26.790 --> 00:56:30.210
what if I just do this and say continue?

00:56:30.210 --> 00:56:32.010
I can then unindent this--

00:56:32.010 --> 00:56:35.280
because if you've not used
it before in Python and in C,

00:56:35.280 --> 00:56:39.880
if you say continue inside of a loop,
it's not going to continue down there,

00:56:39.880 --> 00:56:42.690
it's going to [WHOOSH] continue
to the start of the loop again.

00:56:42.690 --> 00:56:45.840
So via this logic, we can actually
keep wrapping around again and again.

00:56:45.840 --> 00:56:50.460
And here, too, we could say, if
year less than 1970, I can go ahead

00:56:50.460 --> 00:56:54.340
and say continue, which would then
allow me to unindent this as well.

00:56:54.340 --> 00:56:57.510
So there are solutions,
design-wise, to actually avoiding

00:56:57.510 --> 00:56:59.230
that infinite indentation.

00:56:59.230 --> 00:56:59.730
All right.

00:56:59.730 --> 00:57:02.130
Let's go ahead and do one last version.

00:57:02.130 --> 00:57:04.710
Then I'm going to go ahead and
pull out of the oven in a--

00:57:04.710 --> 00:57:05.640
premade.

00:57:05.640 --> 00:57:08.680
So the last thing I didn't
load before was this.

00:57:08.680 --> 00:57:12.780
Suppose that I want to load into
the CSV file, all of the genres

00:57:12.780 --> 00:57:14.820
associated with the show.

00:57:14.820 --> 00:57:17.940
It looks like all of these shows have
one or more genres, just like you

00:57:17.940 --> 00:57:20.370
were asked for your favorite shows.

00:57:20.370 --> 00:57:24.780
And so now, we have a CSV file with
tconst, primary titles, start year,

00:57:24.780 --> 00:57:29.280
and genres, where genre is
itself is a comma-separated list.

00:57:29.280 --> 00:57:31.290
But there's a fundamental problem here.

00:57:31.290 --> 00:57:35.040
Even though I have all of this
data here, the best I can do

00:57:35.040 --> 00:57:36.460
is a program like this.

00:57:36.460 --> 00:57:42.190
Let me go ahead and search for,
for instance, the following.

00:57:42.190 --> 00:57:44.765
Let me go ahead and
grab a file real fast.

00:57:48.410 --> 00:57:53.300
Let me go ahead and grab
a copy of shows2.csv

00:57:53.300 --> 00:57:55.970
and write one final program here.

00:57:55.970 --> 00:57:59.712
If I want to go ahead now and search
this very large, still, data set,

00:57:59.712 --> 00:58:02.780
in shows2.csv, well, let
me go ahead and do this.

00:58:02.780 --> 00:58:06.528
Import CSV.

00:58:06.528 --> 00:58:08.570
Let me go ahead now and
ask the user for a title.

00:58:08.570 --> 00:58:11.120
I could use CS50's get string, but
there's really no need for that

00:58:11.120 --> 00:58:14.162
anymore now that we have the title
function-- the input function, recall.

00:58:14.162 --> 00:58:15.310
So I'll just use that.

00:58:15.310 --> 00:58:20.900
And then I'm going to go ahead and
open up shows2.csv in read-only mode.

00:58:20.900 --> 00:58:23.270
And I'm going to call that my file.

00:58:23.270 --> 00:58:27.470
Then I'm going to go ahead and give
myself a reader from csv.DictReader,

00:58:27.470 --> 00:58:29.330
passing in that file.

00:58:29.330 --> 00:58:33.422
And now I'm going to go ahead and,
for row in reader, do the following.

00:58:33.422 --> 00:58:35.630
The goal now is to write a
Python program that allows

00:58:35.630 --> 00:58:37.620
me to search only those TV shows.

00:58:37.620 --> 00:58:44.420
So I could say something like this,
if title == row "primaryTitle",

00:58:44.420 --> 00:58:50.060
then I can go ahead, for instance,
and print out row "primaryTitle,"

00:58:50.060 --> 00:58:53.040
started in row "startYear."

00:58:53.040 --> 00:58:55.130
So what is the goal of this program?

00:58:55.130 --> 00:58:56.887
It's going to ask the user for input.

00:58:56.887 --> 00:58:58.970
It's going to open this
big CSV that I've created.

00:58:58.970 --> 00:59:01.520
But that's still smaller
than the 50-megabyte version.

00:59:01.520 --> 00:59:04.670
It's going to iterate over every
row in that file via DictReader,

00:59:04.670 --> 00:59:09.800
checking if the title the human typed in
equals the current row's primary title.

00:59:09.800 --> 00:59:13.190
And if so, it's going to print
the title and year of that show.

00:59:13.190 --> 00:59:17.750
So if I go ahead and
run Python of search.py,

00:59:17.750 --> 00:59:22.760
typing in something like The Crown,
Enter, voila, I get that answer.

00:59:22.760 --> 00:59:26.710
If I go ahead and do The Office,
Enter, there's a bunch of Offices.

00:59:26.710 --> 00:59:29.210
And in fact, if you haven't
seen more than the American one,

00:59:29.210 --> 00:59:31.620
there's the UK one, and
apparently several others.

00:59:31.620 --> 00:59:33.410
And we can actually
corroborate this now.

00:59:33.410 --> 00:59:36.590
If you go to IMDB, where all of
this data originally came from,

00:59:36.590 --> 00:59:40.730
and type The Office, there are all
of The Offices that actually line up

00:59:40.730 --> 00:59:42.470
with our own very data set.

00:59:42.470 --> 00:59:44.690
It's going to be so
damn tedious every time

00:59:44.690 --> 00:59:46.610
you want to search for
data or update data

00:59:46.610 --> 00:59:49.880
or insert new data to write
a Python program to do it,

00:59:49.880 --> 00:59:51.822
so we need a few more features of SQL.

00:59:51.822 --> 00:59:53.780
But I think first, we
need some Halloween candy

00:59:53.780 --> 00:59:55.790
for our five-minute break outback.

00:59:55.790 --> 00:59:57.560
All right, we are back.

00:59:57.560 --> 01:00:00.680
So we have a whole bunch more
data now, because we've downloaded

01:00:00.680 --> 01:00:03.470
that really big TSV file from IMDB.

01:00:03.470 --> 01:00:07.700
I've simplified it into some CSV files,
but that really gets me half of the way

01:00:07.700 --> 01:00:10.910
there, because now if I want to
search the data, as with search.py,

01:00:10.910 --> 01:00:14.540
I still have to look over
thousands of rows of TV shows,

01:00:14.540 --> 01:00:17.990
and it's only going to be linear search
by nature of just how files are read,

01:00:17.990 --> 01:00:19.700
top to bottom, left to right.

01:00:19.700 --> 01:00:25.130
But it turns out in Python that you
can actually write SQL code itself.

01:00:25.130 --> 01:00:28.640
And CS50 has a Python library that
has not only get string and get int

01:00:28.640 --> 01:00:33.200
and so forth, but it also has a SQL
function built in that allows you

01:00:33.200 --> 01:00:38.000
to connect to, so to speak, a file
that ends in something like .db,

01:00:38.000 --> 01:00:41.960
which is to say, in a moment, we'll
start to write some Python code now

01:00:41.960 --> 01:00:47.750
toward an end of loading a really large
data set like IMDB's into a proper SQL

01:00:47.750 --> 01:00:51.500
database, thereby allowing us
hereafter to use all of the power

01:00:51.500 --> 01:00:54.650
and expressiveness of SQL, and more
examples of that in just a bit like

01:00:54.650 --> 01:00:58.160
SELECT, UPDATE, DELETE, and INSERT,
without having to write a whole bunch

01:00:58.160 --> 01:00:59.120
of Python code.

01:00:59.120 --> 01:01:03.860
So to be clear, instead of using Python
to search and manipulate our data,

01:01:03.860 --> 01:01:08.120
we're going to write a script, a program
in Python, whose sole purpose in life

01:01:08.120 --> 01:01:10.283
is to get data from one
format into another.

01:01:10.283 --> 01:01:12.950
And you can imagine there's being
generally useful, whether it's

01:01:12.950 --> 01:01:15.260
a Google Spreadsheet you've
downloaded or a large data

01:01:15.260 --> 01:01:17.750
set you found on the internet that
you want to use for a final project

01:01:17.750 --> 01:01:18.830
or for some other class.

01:01:18.830 --> 01:01:22.640
Python can be a really powerful way
of taking one data source as input

01:01:22.640 --> 01:01:26.330
and producing its output now,
as of today, SQL instead.

01:01:26.330 --> 01:01:31.670
So let's go ahead and iterate one
final time of our title.basics.tsv.

01:01:31.670 --> 01:01:34.220
But this time, not just
save it into a CSV file.

01:01:34.220 --> 01:01:39.740
Let's put it into a proper SQL
database on my own Mac or your PC.

01:01:39.740 --> 01:01:41.510
So let me go ahead and do this.

01:01:41.510 --> 01:01:47.150
First, let me go ahead and say, just
like on the slide, db = cs50.sql,

01:01:47.150 --> 01:01:53.567
and then quote unquote, "sqlite:///--"
so the third slash is not a typo.

01:01:53.567 --> 01:01:54.650
It should indeed be there.

01:01:54.650 --> 01:01:57.260
And I'm going to say
shows3.db, just because this

01:01:57.260 --> 01:01:59.690
is version 3 now of my import script.

01:01:59.690 --> 01:02:03.680
I'm going to go ahead, just as
last week, and now import CSV--

01:02:03.680 --> 01:02:06.240
CS50's library as well.

01:02:06.240 --> 01:02:11.330
But for this to work, the file
shows3.db needs to exist first.

01:02:11.330 --> 01:02:15.260
And there's a couple of ways on a Mac
or a PC or a Linux computer, typically,

01:02:15.260 --> 01:02:18.440
to create an empty file
that's ready to receive data.

01:02:18.440 --> 01:02:20.990
You can literally use
the command touch, which

01:02:20.990 --> 01:02:25.160
just will create an empty file by
whatever name you type at the prompt.

01:02:25.160 --> 01:02:26.900
Or we can do this programmatically.

01:02:26.900 --> 01:02:28.310
And I'm going to do it
programmatically, because I

01:02:28.310 --> 01:02:30.352
bet I'm going to screw up
one or more times here,

01:02:30.352 --> 01:02:34.220
and it's going to be useful to let
my Python program create and recreate

01:02:34.220 --> 01:02:37.250
the database again and again
and again until I get it right.

01:02:37.250 --> 01:02:42.920
So let me go ahead and open a file
called shows3.db in write mode.

01:02:42.920 --> 01:02:45.860
And recall from Python
and C, using fopen,

01:02:45.860 --> 01:02:50.180
anytime you open a file in write mode,
it will overwrite any file that exists

01:02:50.180 --> 01:02:52.340
or create any file that doesn't.

01:02:52.340 --> 01:02:53.780
That's all I needed to do.

01:02:53.780 --> 01:02:57.830
So in Python 2, recall that we
were able to use this dot notation.

01:02:57.830 --> 01:02:59.780
And it turns out here,
when you open a file,

01:02:59.780 --> 01:03:02.900
if you want to immediately close it,
because your only goal was to create

01:03:02.900 --> 01:03:06.930
it, you can just do .close on
the very thing you just opened.

01:03:06.930 --> 01:03:09.570
That is equivalent, just
to be clear, to doing

01:03:09.570 --> 01:03:14.790
something a little more pedantic like
this, file = open, and then file.close.

01:03:14.790 --> 01:03:18.070
But we can collapse this into
one slick one-liner, so to speak,

01:03:18.070 --> 01:03:21.840
by just doing instead
what I did a moment ago.

01:03:21.840 --> 01:03:27.330
All that does is create
empty shows.3 file.

01:03:27.330 --> 01:03:32.385
Now, open that file for SQLite.

01:03:32.385 --> 01:03:34.260
And again, SQLite is
the light version of SQL

01:03:34.260 --> 01:03:36.060
that anyone can use on
their own Mac or PC.

01:03:36.060 --> 01:03:39.220
You don't need a special server
to get up and running with it.

01:03:39.220 --> 01:03:42.990
So now let me go ahead and
open up title.basics.tsv.

01:03:42.990 --> 01:03:48.060
And then here, let me go ahead
and create myself a DictReader so

01:03:48.060 --> 01:03:50.910
that I can iterate over
the lines in that TSV file.

01:03:50.910 --> 01:03:55.180
And now, let me go ahead, and for
row in reader, do the following.

01:03:55.180 --> 01:03:57.390
I first want to filter
out stuff just as before.

01:03:57.390 --> 01:04:04.200
So I'm going to say if row
bracket "titleType" == "tvSeries"

01:04:04.200 --> 01:04:10.830
and row "isAdult" == quote unquote "0",
then I'm going to go ahead and check

01:04:10.830 --> 01:04:11.720
one other thing.

01:04:11.720 --> 01:04:15.465
I'm going to go ahead and give myself a
start year variable, similar to before,

01:04:15.465 --> 01:04:17.130
although I called it year earlier.

01:04:17.130 --> 01:04:19.980
Then let me go ahead
and do row "startYear",

01:04:19.980 --> 01:04:22.050
just so I can cast that to an int.

01:04:22.050 --> 01:04:27.240
But I only want to do that
if row "startYear" does not

01:04:27.240 --> 01:04:31.620
equal that special backslash N
that IMDB told me to watch out for.

01:04:31.620 --> 01:04:34.600
So I only want to do
that if it's not that.

01:04:34.600 --> 01:04:38.670
And then if startYear is
greater than or equal to 1970,

01:04:38.670 --> 01:04:41.080
let's go ahead and do the following.

01:04:41.080 --> 01:04:46.680
Let's go ahead and do
genres, gets row "genres".

01:04:46.680 --> 01:04:50.073
Let's go ahead and get
tconst, gets row "tconst,"

01:04:50.073 --> 01:04:52.740
just so I can put these in some
slightly shorter variable names,

01:04:52.740 --> 01:04:54.360
just to keep myself sane.

01:04:54.360 --> 01:04:58.770
primaryTitle is going to
be from row "primaryTitle."

01:04:58.770 --> 01:05:00.720
and then let me go
ahead and give myself--

01:05:00.720 --> 01:05:04.150
we already have startYear, so those
are the only other three fields I need.

01:05:04.150 --> 01:05:08.460
So now, I want to go ahead and
insert this row from my TSV

01:05:08.460 --> 01:05:10.002
into a SQLite database.

01:05:10.002 --> 01:05:12.960
And the operative word that we saw
earlier that we haven't used it yet,

01:05:12.960 --> 01:05:13.710
is INSERT.

01:05:13.710 --> 01:05:15.390
We did use SELECT.

01:05:15.390 --> 01:05:16.500
We did use UPDATE.

01:05:16.500 --> 01:05:17.460
We did use DELETE.

01:05:17.460 --> 01:05:18.850
We haven't used INSERT yet.

01:05:18.850 --> 01:05:20.350
So I'm going to do that in a moment.

01:05:20.350 --> 01:05:23.200
But first, I need my
database to actually exist,

01:05:23.200 --> 01:05:25.330
so I need to create an actual table.

01:05:25.330 --> 01:05:28.860
So I'm going to go up here first
and do this, db, is a reference now,

01:05:28.860 --> 01:05:30.995
a variable representing
my database, and I'm

01:05:30.995 --> 01:05:33.120
going to call the only
function inside of it that's

01:05:33.120 --> 01:05:36.000
useful for our purposes, called execute.

01:05:36.000 --> 01:05:39.240
What I can now do is
execute any SQL I want.

01:05:39.240 --> 01:05:41.280
So what do I want to
load into this database?

01:05:41.280 --> 01:05:45.930
I think I want to load in the tconst,
the primaryTitle, the startYear,

01:05:45.930 --> 01:05:51.690
and the genres, just like we had
earlier from title.basics.tsv.

01:05:51.690 --> 01:05:55.140
I want to load rows that represent
this kind of data, all right?

01:05:55.140 --> 01:05:56.620
So how am I going to do this?

01:05:56.620 --> 01:05:58.590
Well, let me go ahead
and create a table.

01:05:58.590 --> 01:06:02.880
I'm going to call it shows, because that
seems nice and conceptually consistent.

01:06:02.880 --> 01:06:06.510
I'm going to go ahead and
create a list of columns now.

01:06:06.510 --> 01:06:08.400
tconst is going to be one column.

01:06:08.400 --> 01:06:10.290
primaryTitle is going to be another.

01:06:10.290 --> 01:06:12.000
startYear is going to be another.

01:06:12.000 --> 01:06:14.610
And genres is going to be the last.

01:06:14.610 --> 01:06:20.310
I can literally, that is to say,
write SQL inside of a string

01:06:20.310 --> 01:06:23.700
that I pass to a Python
function called db.execute.

01:06:23.700 --> 01:06:27.300
And because db.execute--
or rather, because db

01:06:27.300 --> 01:06:34.380
was configured with shows3.db, when
I execute this string in Python,

01:06:34.380 --> 01:06:38.100
it's going to get executed on
that database file, shows3.db.

01:06:38.100 --> 01:06:40.573
So it's a nice way of
bridging these two worlds.

01:06:40.573 --> 01:06:42.990
So I'm going to have to be a
little more specific, though.

01:06:42.990 --> 01:06:45.707
Recall that SQL has a bunch of types.

01:06:45.707 --> 01:06:47.040
And I'm going to keep it simple.

01:06:47.040 --> 01:06:50.910
I'm going to go ahead and say that
the type of that tconst value is text.

01:06:50.910 --> 01:06:53.220
The type of the primaryTitle is text.

01:06:53.220 --> 01:06:56.400
The type of startYear is going
to be numeric, kind of a catch

01:06:56.400 --> 01:06:58.020
all for dates and date times.

01:06:58.020 --> 01:07:00.390
And then genres is going
to be text as well.

01:07:00.390 --> 01:07:01.770
So the syntax is a little funky.

01:07:01.770 --> 01:07:04.860
You actually specify the name
of the column and then the type,

01:07:04.860 --> 01:07:08.408
as opposed to the opposite, which we
did in C. But that's the way SQL is.

01:07:08.408 --> 01:07:10.200
So I'm going to go
ahead and save that now.

01:07:10.200 --> 01:07:14.180
And just to comment this,
this is going to create

01:07:14.180 --> 01:07:21.590
a table called shows in
database file called shows3.db,

01:07:21.590 --> 01:07:23.100
just to be super explicit.

01:07:23.100 --> 01:07:24.780
So what am I going to do down here?

01:07:24.780 --> 01:07:27.450
It looks like I have the
ability with CS50's library

01:07:27.450 --> 01:07:29.320
to execute any SQL I want.

01:07:29.320 --> 01:07:34.230
So let me go ahead and insert
into shows the following values,

01:07:34.230 --> 01:07:39.570
a tconst, a primaryTitle,
a startYear, and a genre--

01:07:39.570 --> 01:07:40.920
and genres.

01:07:40.920 --> 01:07:41.940
What values?

01:07:41.940 --> 01:07:43.443
I want to insert these values.

01:07:43.443 --> 01:07:45.360
Now, I don't know in
advance, so I'm literally

01:07:45.360 --> 01:07:47.380
going to put some question marks here.

01:07:47.380 --> 01:07:51.630
And it turns out in SQL, this is valid
syntax for the library we're using.

01:07:51.630 --> 01:07:54.000
This is this INSERT query.

01:07:54.000 --> 01:07:57.750
INSERT INTO the table
name a parenthesized list

01:07:57.750 --> 01:08:00.270
of the columns you want
to insert data into.

01:08:00.270 --> 01:08:03.078
Then, a set of values
in separate parentheses.

01:08:03.078 --> 01:08:05.370
And for now, I'm using question
marks for placeholders,

01:08:05.370 --> 01:08:07.248
for reasons we'll come back to.

01:08:07.248 --> 01:08:09.540
But I'm going to go ahead
and plug the following values

01:08:09.540 --> 01:08:18.479
into those placeholders, tconst,
primaryTitle, startYear, and genres.

01:08:18.479 --> 01:08:20.819
And what the db.execute
function is going

01:08:20.819 --> 01:08:24.399
to do for me automatically is it's
going to look at this SQL query.

01:08:24.399 --> 01:08:27.630
Notice that, oh, it's got four
question marks in it, or placeholders.

01:08:27.630 --> 01:08:32.850
Those, in SQL, are like the %s was
in C or are like the curly braces are

01:08:32.850 --> 01:08:34.319
and Python f strings.

01:08:34.319 --> 01:08:38.130
So this says, give me 1, 2, 3, 4
placeholders and plug in, ultimately,

01:08:38.130 --> 01:08:41.220
the following four values,
tconst, which is just a variable;

01:08:41.220 --> 01:08:44.040
primaryTitle, which is the same;
startYear, which is the same;

01:08:44.040 --> 01:08:45.968
and genres, which is the same.

01:08:45.968 --> 01:08:47.010
So what am I going to do?

01:08:47.010 --> 01:08:52.290
For every TV series in this
file that's not an adult series

01:08:52.290 --> 01:08:57.270
and that started after 1970,
insert it into my database.

01:08:57.270 --> 01:09:00.689
If I've made no typos, I'm going to go
ahead and run this, cross my fingers,

01:09:00.689 --> 01:09:02.160
and enter.

01:09:02.160 --> 01:09:05.279
This one is going to take more
time, because it turns out writing

01:09:05.279 --> 01:09:07.109
to a CSV file is actually pretty quick.

01:09:07.109 --> 01:09:10.080
You can just write row, write
row, write row, write row,

01:09:10.080 --> 01:09:13.979
but inserting into a SQLite
database is going to take more time.

01:09:13.979 --> 01:09:19.000
More time upfront, but it's going to
be a lot faster to search thereafter.

01:09:19.000 --> 01:09:23.520
So let me go ahead and do
the cake in the oven thing

01:09:23.520 --> 01:09:25.680
and go ahead and now
open up a file I made

01:09:25.680 --> 01:09:29.250
an advance in today's src3
directory called shows3.db

01:09:29.250 --> 01:09:33.000
using SQLite3, that command
line program we used earlier.

01:09:33.000 --> 01:09:38.310
Recall that I can say .schema to see
the types of data in the database.

01:09:38.310 --> 01:09:40.470
And indeed, look what
I've done in advance.

01:09:40.470 --> 01:09:47.250
I created a table called shows
with exactly those columns in a--

01:09:47.250 --> 01:09:51.840
with exactly these four columns, tconst,
primaryTitle, startYear, and genres.

01:09:51.840 --> 01:09:55.950
But I did this in advance to save
us time so that I can now do SELECT

01:09:55.950 --> 01:09:58.268
* FROM shows.

01:09:58.268 --> 01:09:59.560
And let me not get all of them.

01:09:59.560 --> 01:10:02.580
Let me do the first 10, semicolon.

01:10:02.580 --> 01:10:03.120
All right.

01:10:03.120 --> 01:10:06.610
So we see the first 10 shows
from IMDB in whatever order

01:10:06.610 --> 01:10:08.010
IMDB distributes them.

01:10:08.010 --> 01:10:10.590
You can see their ID
numbers are incrementing.

01:10:10.590 --> 01:10:11.460
And All My Children.

01:10:11.460 --> 01:10:13.418
I remember growing up
with that show years ago.

01:10:13.418 --> 01:10:15.410
And it seems that that's a drama--

01:10:15.410 --> 01:10:18.540
it's not really mystery, but so
be it, and a romance show there.

01:10:18.540 --> 01:10:23.370
But it's indeed 1970, as are every
show thereafter in 1970 or onward.

01:10:23.370 --> 01:10:26.350
I can go ahead and search
for more like this.

01:10:26.350 --> 01:10:28.590
Let me give myself the first 100 shows.

01:10:28.590 --> 01:10:30.540
All right, so this is a
pretty large data set.

01:10:30.540 --> 01:10:32.370
And let me go ahead and count them all.

01:10:32.370 --> 01:10:34.745
Recall that you can use a
COUNT function so that we don't

01:10:34.745 --> 01:10:36.330
have to print them all on my screen.

01:10:36.330 --> 01:10:42.090
It looks like there are
153,331 TV series in IMDB.

01:10:42.090 --> 01:10:45.720
Crazier than that, you want to know
how many of them came out this year?

01:10:45.720 --> 01:10:50.430
WHERE startYear = 2019;.

01:10:50.430 --> 01:10:57.090
There were 6,099 new TV shows in the
world this year according to IMDB.

01:10:57.090 --> 01:11:00.900
Just one of those is The
Office that won our--

01:11:00.900 --> 01:11:01.950
won our vote earlier.

01:11:01.950 --> 01:11:06.280
So SELECT *, star denoting wild
card, in this case, everything.

01:11:06.280 --> 01:11:08.290
And unfortunately, SQL
has two wild cards.

01:11:08.290 --> 01:11:12.600
Star means select all of
the columns in the table.

01:11:12.600 --> 01:11:17.220
Percent means let any characters
come before or after a quoted

01:11:17.220 --> 01:11:18.870
string in a WHERE clause.

01:11:18.870 --> 01:11:24.660
So let me go ahead and SELECT *
FROM shows WHERE title = The Office,

01:11:24.660 --> 01:11:26.700
and we'll actually see-- whoops.

01:11:26.700 --> 01:11:28.800
primaryTitle, sorry.

01:11:28.800 --> 01:11:31.530
Let me fix that.
primaryTitle = The Office.

01:11:31.530 --> 01:11:33.150
There are all of those Offices.

01:11:33.150 --> 01:11:37.230
And indeed, 2005 is probably
the one we know and love.

01:11:37.230 --> 01:11:44.370
If I go to a browser and go to
IMDB slash title slash that ID,

01:11:44.370 --> 01:11:46.620
indeed, that's probably the
one we're all thinking of,

01:11:46.620 --> 01:11:48.790
unless you voted for
the UK version instead.

01:11:48.790 --> 01:11:52.290
So again, this is actually real, live
data that we're now playing with.

01:11:52.290 --> 01:11:53.738
Well, what more can we do?

01:11:53.738 --> 01:11:55.530
Well, there's one thing
that I don't really

01:11:55.530 --> 01:12:00.540
like about this, which is that
when we select all of the columns--

01:12:00.540 --> 01:12:02.130
and let's go ahead and do this.

01:12:02.130 --> 01:12:04.440
Let's select another
hundred of them before.

01:12:04.440 --> 01:12:10.470
This feels a little messy that we have
all of these nice, clean columns except

01:12:10.470 --> 01:12:12.270
for when we get to genres.

01:12:12.270 --> 01:12:14.902
Then we just have this
arbitrary comma-separated list.

01:12:14.902 --> 01:12:16.860
Suppose, for instance,
I want to search for all

01:12:16.860 --> 01:12:19.820
of the comedies that came out in 2019.

01:12:19.820 --> 01:12:27.870
I could say SELECT * FROM shows
where genres = "Comedy" AND--

01:12:27.870 --> 01:12:29.880
turns out you can use
conjunctions like this--

01:12:29.880 --> 01:12:32.970
startYear = 2019.

01:12:32.970 --> 01:12:34.290
So that gives me a whole bunch.

01:12:34.290 --> 01:12:35.490
Let's count them.

01:12:35.490 --> 01:12:38.490
So COUNT this here, Enter.

01:12:38.490 --> 01:12:42.390
OK, a thousand of those
6,000 shows are comedies.

01:12:42.390 --> 01:12:44.850
But I think that's an underestimate.

01:12:44.850 --> 01:12:48.810
Why is this query buggy at the moment?

01:12:48.810 --> 01:12:49.458
Yeah?

01:12:49.458 --> 01:12:50.587
AUDIENCE: [INAUDIBLE]

01:12:50.587 --> 01:12:52.920
DAVID MALAN: Yeah, some of
them had more than one genre,

01:12:52.920 --> 01:12:56.070
so comedy is somewhere in
that comma-separated list.

01:12:56.070 --> 01:13:00.450
And so what I should probably do
instead is not say genre = "Comedy,"

01:13:00.450 --> 01:13:05.490
but maybe genres LIKE "Comedy," and
allow something maybe to appear before,

01:13:05.490 --> 01:13:07.140
something maybe to appear after.

01:13:07.140 --> 01:13:11.040
And that's going to give
me 1,593 comedies that

01:13:11.040 --> 01:13:13.290
came out this year in 2019.

01:13:13.290 --> 01:13:14.820
So that seems a little better.

01:13:14.820 --> 01:13:17.220
But this is not very robust, right?

01:13:17.220 --> 01:13:19.890
Once you start resorting
to techniques like this,

01:13:19.890 --> 01:13:22.920
it should start, as a programmer,
to rub you the wrong way.

01:13:22.920 --> 01:13:24.168
It's kind of a hack, right?

01:13:24.168 --> 01:13:27.210
Like, you're searching for comedy,
but there could be something before it

01:13:27.210 --> 01:13:29.640
or something after it,
and odds are, there

01:13:29.640 --> 01:13:32.400
is no other word I can think
of in the world of genres

01:13:32.400 --> 01:13:36.142
that starts with or ends with
comedy, so we're probably OK.

01:13:36.142 --> 01:13:38.850
But this is kind of hack-ish, that
you're just kind of searching.

01:13:38.850 --> 01:13:44.790
It would be nice if we could just search
for a specific column called genre.

01:13:44.790 --> 01:13:47.170
So how can we go about doing that?

01:13:47.170 --> 01:13:50.670
Well, let me go ahead and
do the following instead.

01:13:50.670 --> 01:13:54.570
Let me go ahead and open up a
final version of my import script,

01:13:54.570 --> 01:13:58.020
this one that does two things up top.

01:13:58.020 --> 01:14:02.780
At the top, I'm going to create two
tables, one called shows, which has--

01:14:02.780 --> 01:14:04.950
I'm to clean up the column names, too.

01:14:04.950 --> 01:14:06.930
IMDB is a little nonconventional.

01:14:06.930 --> 01:14:10.290
What most people would do when
describing a unique identifier,

01:14:10.290 --> 01:14:12.210
they're going to call it ID, not tconst.

01:14:12.210 --> 01:14:13.895
So we're going to rename it to ID.

01:14:13.895 --> 01:14:16.020
They're not going to call
their title primaryTitle.

01:14:16.020 --> 01:14:18.660
They're going to call it title,
so we're going to rename it title.

01:14:18.660 --> 01:14:19.770
They're not going to name it startYear.

01:14:19.770 --> 01:14:20.937
We're going to call it year.

01:14:20.937 --> 01:14:22.060
And then that's it.

01:14:22.060 --> 01:14:24.270
We'll come back to primary
key in just a moment.

01:14:24.270 --> 01:14:25.530
But notice this.

01:14:25.530 --> 01:14:28.320
In my new and final version
of this script, I'm creating,

01:14:28.320 --> 01:14:33.060
I propose, a second table called
genres whose purpose in life

01:14:33.060 --> 01:14:38.250
is to contain a value called show_id
and another one called genre.

01:14:38.250 --> 01:14:39.750
So what's going on?

01:14:39.750 --> 01:14:42.810
Well, let me go ahead and show this.

01:14:42.810 --> 01:14:48.930
If I load the resulting database from
this one, shows4.db, and I do .schema,

01:14:48.930 --> 01:14:50.850
you'll see that I
indeed have two tables.

01:14:50.850 --> 01:14:55.150
Let me go ahead and SELECT * FROM
shows WHERE title, this time,

01:14:55.150 --> 01:14:59.410
because I've renamed it from
primaryTitle, = The Office.

01:14:59.410 --> 01:14:59.910
OK.

01:14:59.910 --> 01:15:01.050
That's a lot of Offices.

01:15:01.050 --> 01:15:06.700
But let's go, AND year = 2005, which
is the one we're all thinking about.

01:15:06.700 --> 01:15:07.860
And it's that one.

01:15:07.860 --> 01:15:09.810
And now, notice this.

01:15:09.810 --> 01:15:12.360
Notice that I'm getting back what?

01:15:12.360 --> 01:15:19.590
An ID, I'm getting back a title, and
I'm getting back a year, but no genres.

01:15:19.590 --> 01:15:24.053
That's because there's another table
now called genres that's separate.

01:15:24.053 --> 01:15:25.470
And you know, I'm kind of curious.

01:15:25.470 --> 01:15:27.763
I see that a genre table has show_id.

01:15:27.763 --> 01:15:28.680
Let me go and do this.

01:15:28.680 --> 01:15:34.200
SELECT * FROM genres
WHERE show_id =, and let

01:15:34.200 --> 01:15:37.830
me do a little copy paste
here, = this show_id.

01:15:37.830 --> 01:15:39.780
And what might I see?

01:15:39.780 --> 01:15:41.100
Comedy.

01:15:41.100 --> 01:15:42.840
So what have we done now?

01:15:42.840 --> 01:15:46.170
For any TV show that was
in IMDB's database that

01:15:46.170 --> 01:15:50.010
was a comma-separated list of genres,
I've exploded it, so to speak.

01:15:50.010 --> 01:15:52.170
I've split that value on the commas.

01:15:52.170 --> 01:15:56.070
And if the show is a comedy, I've
added a row in these genres table,

01:15:56.070 --> 01:15:59.310
but then I've jotted down the
show's ID next to that genre

01:15:59.310 --> 01:16:02.670
so I remember that that
show was of that genre.

01:16:02.670 --> 01:16:05.770
But if another show has
multiple fields-- for instance,

01:16:05.770 --> 01:16:10.980
let's go ahead and search for not
The Office but, say, The Crown.

01:16:10.980 --> 01:16:12.540
And there's only one of those.

01:16:12.540 --> 01:16:20.190
And now I do SELECT * FROM genres
WHERE show_id = this number,

01:16:20.190 --> 01:16:25.360
we'll see that, oh, The Crown now
has drama and history as a genre.

01:16:25.360 --> 01:16:27.570
And so therefore, in
the genres table, notice

01:16:27.570 --> 01:16:30.690
that there's two IDs and two genres.

01:16:30.690 --> 01:16:32.220
But now we can use a quality.

01:16:32.220 --> 01:16:39.300
I can now search for all of the comedies
from 2019 in kind of a powerful way,

01:16:39.300 --> 01:16:47.010
SELECT * FROM shows WHERE id IN--

01:16:47.010 --> 01:16:48.330
and here's the cool part--

01:16:48.330 --> 01:17:01.020
SELECT show_id FROM genres WHERE
genre = "Comedy" AND year = 2019.

01:17:01.020 --> 01:17:03.960
So this is admittedly a mouthful,
but let's consider what's happening.

01:17:03.960 --> 01:17:05.910
First, I'm selecting star from shows.

01:17:05.910 --> 01:17:08.760
That means give me all the
shows in all of the columns.

01:17:08.760 --> 01:17:14.640
But filter as follows, only show me
those shows where the ID of the show

01:17:14.640 --> 01:17:17.170
is in the following list of IDs.

01:17:17.170 --> 01:17:20.610
Now, you can look at a nested
query inside the parentheses here.

01:17:20.610 --> 01:17:25.140
This list here selects all of the
show IDs from the genres table

01:17:25.140 --> 01:17:27.380
where genre = "Comedy".

01:17:27.380 --> 01:17:31.770
So the highlighted parenthetical
right now returns essentially a list

01:17:31.770 --> 01:17:35.130
of all of the ID numbers of shows
that are associated with comedy,

01:17:35.130 --> 01:17:37.900
even if they're associated
with other things, too.

01:17:37.900 --> 01:17:41.400
And we're making sure
that the year equals 2019.

01:17:41.400 --> 01:17:46.290
So if I now hit Enter, we'll see a whole
bunch of results, but we should see,

01:17:46.290 --> 01:17:49.110
if I count these by
using my usual syntax,

01:17:49.110 --> 01:17:54.120
there were 1,593 shows
that are comedies in 2019.

01:17:54.120 --> 01:17:58.200
That does happen to equal the same
count we did earlier by using like,

01:17:58.200 --> 01:18:01.560
but this is better designed in the
sense that there's no ambiguity.

01:18:01.560 --> 01:18:03.730
You're not just hackishly
looking for a substring,

01:18:03.730 --> 01:18:06.060
so to speak, in a comma-separated list.

01:18:06.060 --> 01:18:08.580
You can actually now
search more robustly

01:18:08.580 --> 01:18:10.710
by having redesigned your data.

01:18:10.710 --> 01:18:13.830
And what we've done really
is something like this.

01:18:13.830 --> 01:18:18.150
Instead of storing our data in
just one table called shows,

01:18:18.150 --> 01:18:22.230
and every show has an ID, a
title, a year, and genres,

01:18:22.230 --> 01:18:26.880
we've instead exploded those genres
into two separate tables, such

01:18:26.880 --> 01:18:29.650
that now our shows
table looks like this.

01:18:29.650 --> 01:18:32.200
We have an ID, a title, and a year.

01:18:32.200 --> 01:18:37.350
And notice that the ID of a
show can also appear over here

01:18:37.350 --> 01:18:40.950
in another table called genres, and
just by convention, to keep us sane,

01:18:40.950 --> 01:18:43.380
instead of calling it
ID here, we've called

01:18:43.380 --> 01:18:48.060
it show_id to connote that it came
from a table called shows, plural,

01:18:48.060 --> 01:18:50.010
but it's a single ID from a show.

01:18:50.010 --> 01:18:55.050
So by convention, humans often call them
table name, minus the S, underscore ID.

01:18:55.050 --> 01:18:59.280
And then the keyword here or genre
is comedy or drama or documentary

01:18:59.280 --> 01:19:01.443
or some other genre as well.

01:19:01.443 --> 01:19:03.360
And now, this is a little
subtle, but the fact

01:19:03.360 --> 01:19:05.940
that this little symbol here,
drawn from the database world,

01:19:05.940 --> 01:19:08.370
flails out into three
separate places, this

01:19:08.370 --> 01:19:11.370
is a one-to-many
relationship, so to speak.

01:19:11.370 --> 01:19:15.630
You can have one show over here
mapping to many genres over here,

01:19:15.630 --> 01:19:21.090
or maybe zero, but it's zero
or more possible genres.

01:19:21.090 --> 01:19:21.990
All right.

01:19:21.990 --> 01:19:26.720
Any questions just yet?

01:19:26.720 --> 01:19:27.220
All right.

01:19:27.220 --> 01:19:30.430
So the real power then, to
be clear, is coming, now,

01:19:30.430 --> 01:19:31.870
from this kind of expressiveness.

01:19:31.870 --> 01:19:34.495
So now, let's play around with
some other queries and features.

01:19:34.495 --> 01:19:37.330
But first, let's give
ourselves a bit more data.

01:19:37.330 --> 01:19:40.420
It turns out besides the
file called title.basics.tsv,

01:19:40.420 --> 01:19:44.140
IMDB.com makes a bunch of
others available to us as well.

01:19:44.140 --> 01:19:46.900
There is one called name.basics.tsv.

01:19:46.900 --> 01:19:49.420
And this is one that
has information on all

01:19:49.420 --> 01:19:53.690
of the names of actors and
actresses and directors and writers

01:19:53.690 --> 01:19:54.940
and other people in the world.

01:19:54.940 --> 01:19:58.600
So for instance, there is an nconst,
which is like a name constant

01:19:58.600 --> 01:20:03.190
or an ID, nm2946516,
which happens to belong

01:20:03.190 --> 01:20:06.550
to the actor whose primary name is
Claire Foy, the star of The Crown.

01:20:06.550 --> 01:20:10.990
She was born in 1984, and there are
some other fields in that file as well.

01:20:10.990 --> 01:20:14.560
But also juicy is this
file, title.principals.tsv,

01:20:14.560 --> 01:20:16.690
and this is where it
gets interesting, too.

01:20:16.690 --> 01:20:20.450
In this file, notice there
are no actual titles.

01:20:20.450 --> 01:20:21.740
There's no primary titles.

01:20:21.740 --> 01:20:23.800
There's no actual human names.

01:20:23.800 --> 01:20:28.810
Instead, there's just two unique
identifiers, a tconst and an nconst,

01:20:28.810 --> 01:20:34.240
which IMDB speak for a title
identifier and a name identifier.

01:20:34.240 --> 01:20:37.450
So for instance, in one of the
rows in this TSV file called

01:20:37.450 --> 01:20:42.790
title.principals.tsv, there is a
row that starts with tt4786824;

01:20:42.790 --> 01:20:48.220
also has nm2946516, and
has the word actress,

01:20:48.220 --> 01:20:52.090
thereby implying that
if you look up the nm--

01:20:52.090 --> 01:20:57.280
the nconst in the names file and you
look up the tconst in the titles file,

01:20:57.280 --> 01:21:03.600
you will be able to, by transitivity,
infer that Claire Foy is in The Crown.

01:21:03.600 --> 01:21:06.670
This allows us to have a
many-to-many relationship.

01:21:06.670 --> 01:21:10.360
A one movie or show
can have many actors,

01:21:10.360 --> 01:21:16.030
and one actor can be in many shows, so
we're using a sort of join file here,

01:21:16.030 --> 01:21:18.070
a join table, that's
going to somehow allow

01:21:18.070 --> 01:21:21.590
us to link two different data sets
together, and more on that in a moment.

01:21:21.590 --> 01:21:24.700
But what's really fun in IMDB is
that it also has a bunch of ratings

01:21:24.700 --> 01:21:28.630
that humans have typed in saying, I
get this 10 out of 10, a 0 out of 10,

01:21:28.630 --> 01:21:29.200
and so forth.

01:21:29.200 --> 01:21:31.950
And they keep track of the number
of votes that shows have gotten.

01:21:31.950 --> 01:21:36.370
And so in title.ratings.tsv, yet
another file you can download from IMDB,

01:21:36.370 --> 01:21:40.120
you can look up a given tconst,
the unique identifier for a title,

01:21:40.120 --> 01:21:42.940
what its average rating is
and the number of votes.

01:21:42.940 --> 01:21:46.660
And in fact, if I pull this up, for
instance, on The Crown, if I go back

01:21:46.660 --> 01:21:48.880
to IMDB itself--

01:21:48.880 --> 01:21:53.930
IMDB, search for The
Crown, the 2016 version,

01:21:53.930 --> 01:21:59.410
you'll see that indeed, it is an
8.7, which lines up with exactly what

01:21:59.410 --> 01:22:00.210
we have here.

01:22:00.210 --> 01:22:03.580
But over time, that number is going to
go up or down, because IMDB is updating

01:22:03.580 --> 01:22:06.280
their data set every day as well.

01:22:06.280 --> 01:22:09.730
So besides this data, we
also then have the ability

01:22:09.730 --> 01:22:12.580
to consider what this all
looks like collectively.

01:22:12.580 --> 01:22:14.920
So in this case here,
here is another diagram.

01:22:14.920 --> 01:22:18.610
It's more complicated now, but
it just captures the intuition

01:22:18.610 --> 01:22:22.450
that you would derive by just reading
through IMDB's documentation, which

01:22:22.450 --> 01:22:23.800
defines the following.

01:22:23.800 --> 01:22:26.420
It turns out if you read
closely among those files,

01:22:26.420 --> 01:22:29.710
you'll see that, oh, we can glean a
whole bunch of shows that are going

01:22:29.710 --> 01:22:32.050
to have IDs, title, year, and episodes.

01:22:32.050 --> 01:22:37.840
I want to go ahead and associate those
shows with a whole bunch of stars.

01:22:37.840 --> 01:22:42.170
But people are the entities in the
world that have IDs, names, and birth.

01:22:42.170 --> 01:22:43.910
So now things get a little weird.

01:22:43.910 --> 01:22:46.300
Let's focus only on
these two tables here.

01:22:46.300 --> 01:22:48.340
This is a diagram of the
goal at hand, and this

01:22:48.340 --> 01:22:49.632
is a script I wrote in advance.

01:22:49.632 --> 01:22:53.150
And in a moment, we'll open up a
SQL database that represents this.

01:22:53.150 --> 01:22:56.290
There's going to be a table
called shows, every row of which

01:22:56.290 --> 01:22:59.350
has an ID, title, year, and
some number of episodes,

01:22:59.350 --> 01:23:00.962
so you can see how long-running it is.

01:23:00.962 --> 01:23:02.920
There's also going to be
a table called people.

01:23:02.920 --> 01:23:05.080
Claire Foy is going to be among them.

01:23:05.080 --> 01:23:09.610
She and the other humans will have an
ID, name, and birth year associated

01:23:09.610 --> 01:23:10.960
with them as well.

01:23:10.960 --> 01:23:14.230
But there's going to be this
other table here called stars.

01:23:14.230 --> 01:23:19.390
And you'll notice there's a line
that links shows to stars to people,

01:23:19.390 --> 01:23:21.190
again by transitivity.

01:23:21.190 --> 01:23:27.580
If there is, in the stars table, a show
ID, a.k.a. tconst, and a person ID,

01:23:27.580 --> 01:23:28.260
a.k.a.

01:23:28.260 --> 01:23:32.260
nconst, that links, for instance,
Claire Foy to The Crown.

01:23:32.260 --> 01:23:34.480
It's going to link Steve
Carell to The Office.

01:23:34.480 --> 01:23:38.450
It's going to link every other
actor to their show as well.

01:23:38.450 --> 01:23:41.770
Similarly for writers, we won't play
too much with the writers today,

01:23:41.770 --> 01:23:45.760
but writers are people, too,
just as stars are people.

01:23:45.760 --> 01:23:48.970
And so here's another feature
or design goal of SQL.

01:23:48.970 --> 01:23:52.150
You want to ideally factor
out the commonalities, so

01:23:52.150 --> 01:23:55.450
that Claire Foy appears
only in one place by name,

01:23:55.450 --> 01:23:58.965
but her unique identifier might
appear in bunches of places.

01:23:58.965 --> 01:24:01.840
There's a lot of actors in the world
who are also writers themselves.

01:24:01.840 --> 01:24:03.970
They are, at the end of
the day, people, but they

01:24:03.970 --> 01:24:08.200
might appear both in the stars
table and in the writers table

01:24:08.200 --> 01:24:10.300
by way of their person ID.

01:24:10.300 --> 01:24:14.050
So the goal of SQL is to not copy and
paste Claire Foy, Claire Foy, Claire

01:24:14.050 --> 01:24:17.750
Foy, or Steve Carell, Steve Carell,
Steve Carell, all over the place.

01:24:17.750 --> 01:24:20.170
You want to have one
authoritative place for all

01:24:20.170 --> 01:24:23.320
of your people, one authoritative
place for all of your shows,

01:24:23.320 --> 01:24:25.180
and then you have these
other tables called

01:24:25.180 --> 01:24:28.940
join tables, which are similar
in spirit to the TSV files

01:24:28.940 --> 01:24:32.790
you can download that somehow
link these identifiers together.

01:24:32.790 --> 01:24:38.820
So if I, on my computer here, open a
file that I had-- came with in advance,

01:24:38.820 --> 01:24:43.370
that you can also play with online
as well, called, finally, shows.db.

01:24:43.370 --> 01:24:45.080
And I do .schema on this one.

01:24:45.080 --> 01:24:48.560
This is the largest of the databases,
and this is the one that we, the staff,

01:24:48.560 --> 01:24:51.345
actually generated from all
of the TSV files online,

01:24:51.345 --> 01:24:52.970
but we threw away a lot of information.

01:24:52.970 --> 01:24:55.850
We threw away anything before
1970, and we filtered out

01:24:55.850 --> 01:24:59.600
everything except TV series,
so that we can actually

01:24:59.600 --> 01:25:01.820
play with the data in this data set.

01:25:01.820 --> 01:25:04.580
But honestly, using a black and
white prompt and a terminal window

01:25:04.580 --> 01:25:07.455
tends not to get fun, especially
when the data flies over the screen.

01:25:07.455 --> 01:25:10.413
So there are also things called
GUIs, Graphical User Interfaces,

01:25:10.413 --> 01:25:13.080
and indeed, there is a program
that's freely available for Macs,

01:25:13.080 --> 01:25:16.490
PCs, and other types of operating
systems, called DB Browser.

01:25:16.490 --> 01:25:18.480
And indeed, will point
you at this online.

01:25:18.480 --> 01:25:23.030
This is just a program that allows you
to explore SQL files on your own Mac

01:25:23.030 --> 01:25:24.890
or PC much more visibly--

01:25:24.890 --> 01:25:28.070
or much more visually and much
more pleasantly than maybe just

01:25:28.070 --> 01:25:29.870
a command line interface allows.

01:25:29.870 --> 01:25:33.080
So let me go ahead and open
up, for instance, shows.db.

01:25:33.080 --> 01:25:34.770
And we'll see a whole bunch of things.

01:25:34.770 --> 01:25:36.890
First of all, the
graphical user interface

01:25:36.890 --> 01:25:40.280
here shows me the same information,
just in slightly prettier format.

01:25:40.280 --> 01:25:44.450
Shows.db, per my diagram a moment
ago, has six tables, people,

01:25:44.450 --> 01:25:46.670
like I mentioned;
shows, like I mentioned;

01:25:46.670 --> 01:25:50.398
also, stars and writers, ratings, and
then that separate table called genres.

01:25:50.398 --> 01:25:52.190
And if you look over
here on the right, you

01:25:52.190 --> 01:25:56.360
can actually see the SQL code
we wrote to create those tables.

01:25:56.360 --> 01:26:00.350
But cooler than that is that
notice these tabs up top here?

01:26:00.350 --> 01:26:03.170
I am currently on Database Structure,
and if you're using Windows,

01:26:03.170 --> 01:26:05.180
your interface will
look a little different,

01:26:05.180 --> 01:26:06.740
but the options are still there.

01:26:06.740 --> 01:26:08.960
I'm going to go ahead
and click on Browse Data.

01:26:08.960 --> 01:26:12.000
And now you'll see a little
dropdown of all of the tables.

01:26:12.000 --> 01:26:14.510
So if I want to go ahead and
look at a whole bunch of shows,

01:26:14.510 --> 01:26:18.680
I can actually see all of my data
here, not unlike Google Spreadsheets.

01:26:18.680 --> 01:26:23.450
But notice, this is 153,331 movies--

01:26:23.450 --> 01:26:27.380
or shows, rather, that I
can see altogether here.

01:26:27.380 --> 01:26:31.110
And lastly, what's cool is that
if I go over to the SQL tab,

01:26:31.110 --> 01:26:33.058
I can now execute some sample queries.

01:26:33.058 --> 01:26:34.350
So let me go ahead and do this.

01:26:34.350 --> 01:26:37.340
SELECT * FROM shows;.

01:26:37.340 --> 01:26:39.530
This is going to give
me a whole lot of shows.

01:26:39.530 --> 01:26:40.130
Enter.

01:26:40.130 --> 01:26:43.140
So I hit the little Play button
that just executes that query,

01:26:43.140 --> 01:26:46.980
and you see the resulting
rows that have come back.

01:26:46.980 --> 01:26:48.800
So again, how did we get to this point?

01:26:48.800 --> 01:26:51.920
We, the staff, downloaded all
of those TSV files in advance.

01:26:51.920 --> 01:26:56.150
We wrote a Python script that imported
all of the data from those files

01:26:56.150 --> 01:26:58.880
into memory, threw away the
stuff we didn't care about,

01:26:58.880 --> 01:27:03.362
and then inserted it into tables
like this table called shows.

01:27:03.362 --> 01:27:06.320
And what's nice about this tool is,
because if you are an aspiring data

01:27:06.320 --> 01:27:09.020
scientist or you're just trying
to get some analytical work done

01:27:09.020 --> 01:27:11.690
for this class or any
other, or any other project,

01:27:11.690 --> 01:27:14.370
a graphical tool lends itself
to just kind of poking around.

01:27:14.370 --> 01:27:19.010
So for instance, you'll see that the
shows table no longer has any genres.

01:27:19.010 --> 01:27:20.010
But that's OK.

01:27:20.010 --> 01:27:21.950
We can reconstitute that data.

01:27:21.950 --> 01:27:25.670
I can go ahead and SELECT * FROM shows--

01:27:25.670 --> 01:27:27.390
but you know what I can do?

01:27:27.390 --> 01:27:30.680
I can actually join in the genres table.

01:27:30.680 --> 01:27:34.250
So I can take the shows table
here, the genres table here,

01:27:34.250 --> 01:27:37.640
and essentially link them
together by way of the ID.

01:27:37.640 --> 01:27:39.000
How is that possible?

01:27:39.000 --> 01:27:43.010
Well, if you look at shows, a show
has an ID title, year, and episodes.

01:27:43.010 --> 01:27:45.020
Genres has a show ID.

01:27:45.020 --> 01:27:48.560
So if you think of ID on my left
hand as representing my fingers here.

01:27:48.560 --> 01:27:51.830
Show ID in genres is
representing my right hand here.

01:27:51.830 --> 01:27:54.830
What we want to do is
lineup ID with show ID

01:27:54.830 --> 01:28:00.570
to make one larger, wider table that
constitutes all of that data together.

01:28:00.570 --> 01:28:01.890
So how do I do this?

01:28:01.890 --> 01:28:04.130
Well, in SQL, you can join two tables.

01:28:04.130 --> 01:28:07.430
You say what table you want to
join with what other table and

01:28:07.430 --> 01:28:08.570
how you want to do it.

01:28:08.570 --> 01:28:16.740
Well, I want to go ahead and join
it on shows.id = genres.show_id;.

01:28:16.740 --> 01:28:19.700
And now, when I hit
execute, it took a moment--

01:28:19.700 --> 01:28:21.682
indeed, it took 408 milliseconds.

01:28:21.682 --> 01:28:24.140
But my god, that's a lot faster
than writing a whole Python

01:28:24.140 --> 01:28:25.160
script to do this.

01:28:25.160 --> 01:28:29.220
Now I have a table with
all of the shows as before.

01:28:29.220 --> 01:28:30.740
But notice the table got wider.

01:28:30.740 --> 01:28:33.590
This is a temporary table
that SQL has returned to me.

01:28:33.590 --> 01:28:36.077
This one now has genre and show ID.

01:28:36.077 --> 01:28:37.910
So in fact, let me go
ahead and filter this.

01:28:37.910 --> 01:28:52.290
WHERE title = "The Office" AND year
= 2005, Play, we'll just get this.

01:28:52.290 --> 01:28:55.370
So notice I have a wider table
containing all of the columns

01:28:55.370 --> 01:28:57.350
from both of those joined tables.

01:28:57.350 --> 01:29:03.170
But if I change this now to The Crown,
and that was 2016, and hit play,

01:29:03.170 --> 01:29:05.875
notice I get back seemingly
redundant information.

01:29:05.875 --> 01:29:08.000
But this might be useful
nonetheless, because now I

01:29:08.000 --> 01:29:13.040
can iterate over all of the rows knowing
that every row has not only a title

01:29:13.040 --> 01:29:16.220
and a year, but also a genre as well.

01:29:16.220 --> 01:29:19.070
So I can reconstitute
the table like that.

01:29:19.070 --> 01:29:21.530
Well, how can I ask
other questions as well,

01:29:21.530 --> 01:29:26.030
like what are actors that
are in other shows as well?

01:29:26.030 --> 01:29:30.000
Well, let me go ahead and
do something like this.

01:29:30.000 --> 01:29:33.230
For instance, let's select
all of Steve Carell's movies.

01:29:33.230 --> 01:29:35.280
Or-- let's select Steve
Carell himself first.

01:29:35.280 --> 01:29:38.280
So SELECT * FROM--

01:29:38.280 --> 01:29:40.140
let's see-- no.

01:29:40.140 --> 01:29:40.830
Let's do this.

01:29:40.830 --> 01:29:41.910
How about Ellen?

01:29:41.910 --> 01:29:49.405
So SELECT * from people
where name = Ellen DeGeneres.

01:29:49.405 --> 01:29:50.280
I spelled that right.

01:29:50.280 --> 01:29:52.150
Semicolon, Play.

01:29:52.150 --> 01:29:52.650
All right.

01:29:52.650 --> 01:29:56.040
So this is handy, because I now know
that Ellen's birth year is 1958,

01:29:56.040 --> 01:29:58.320
but her ID is 1122.

01:29:58.320 --> 01:30:01.530
That's the same thing as that
nconst, but we threw away the nm

01:30:01.530 --> 01:30:05.520
and we got rid of all the zeros and made
it into a proper number for efficiency,

01:30:05.520 --> 01:30:08.340
better to uniquely identify
humans, typically-- or anything--

01:30:08.340 --> 01:30:13.110
by numbers which fit in 32 or 64
bits rather than longer strings.

01:30:13.110 --> 01:30:16.740
So now I know Ellen's ID
is 1122, what can I do?

01:30:16.740 --> 01:30:25.470
Well, let me go ahead and SELECT *
FROM stars WHERE person_id = this.

01:30:25.470 --> 01:30:26.970
This will tell me what?

01:30:26.970 --> 01:30:31.920
This will tell me all of the information
about shows that Ellen starred in,

01:30:31.920 --> 01:30:34.150
including, presumably, her own.

01:30:34.150 --> 01:30:37.230
OK, so I now see person_id is
the same, the same, the same,

01:30:37.230 --> 01:30:39.608
but Ellen is apparently
in all of those shows.

01:30:39.608 --> 01:30:41.400
But that's not that
helpful, and also, this

01:30:41.400 --> 01:30:44.070
is kind of lame that I've
just hardcoded Ellen's ID.

01:30:44.070 --> 01:30:45.550
But I don't have to do that.

01:30:45.550 --> 01:30:55.020
I can do a subquery, SELECT id FROM
people WHERE name = "Ellen DeGeneres",

01:30:55.020 --> 01:30:57.330
closed parenthesis, Play.

01:30:57.330 --> 01:31:00.480
Now it's dynamic, so now
I've not hardcoded anything.

01:31:00.480 --> 01:31:01.930
But this isn't that useful.

01:31:01.930 --> 01:31:04.770
Let me go ahead and just
select show_id here.

01:31:04.770 --> 01:31:08.910
So I now have SELECT show_id
FROM stars WHERE the person_id ID

01:31:08.910 --> 01:31:12.108
= whatever Ellen's ID is here.

01:31:12.108 --> 01:31:13.650
How can I take this one step further?

01:31:13.650 --> 01:31:19.980
Well, what if I do SELECT * FROM
shows WHERE the ID of the show

01:31:19.980 --> 01:31:23.400
is in the following list of values?

01:31:23.400 --> 01:31:25.590
So not only does SQL
support equals, when

01:31:25.590 --> 01:31:28.530
you want to compare one value against
the next, much like in Python,

01:31:28.530 --> 01:31:33.360
you have the keyword IN, where you can
say select everything from shows where

01:31:33.360 --> 01:31:36.930
the ID of the show is in
the following list of shows,

01:31:36.930 --> 01:31:42.260
which happens to represent the
list of show IDs that Ellen is in.

01:31:42.260 --> 01:31:43.140
Phew.

01:31:43.140 --> 01:31:45.150
Lets hit Play on this.

01:31:45.150 --> 01:31:48.660
It took a moment, but it looks
like these are all of the shows

01:31:48.660 --> 01:31:51.120
that Ellen has been
in, according to IMDB.

01:31:51.120 --> 01:31:54.600
And it looks like The Ellen
DeGeneres Show is one.

01:31:54.600 --> 01:31:58.380
She's been on the air
for 2,865 episodes.

01:31:58.380 --> 01:32:01.480
There is Ellen, her original TV show,
which was on for a few seasons back

01:32:01.480 --> 01:32:01.980
then.

01:32:01.980 --> 01:32:06.430
Looks like she's doing some game shows
these days as of 2017, and so forth.

01:32:06.430 --> 01:32:09.810
So using these step-by-step
thought processes

01:32:09.810 --> 01:32:12.300
can we actually build up
more interesting queries

01:32:12.300 --> 01:32:15.410
to get back information like that.

01:32:15.410 --> 01:32:21.192
All right, any questions before
we try a few others out as well?

01:32:21.192 --> 01:32:21.860
No?

01:32:21.860 --> 01:32:22.360
All right.

01:32:22.360 --> 01:32:25.270
Well, let me show one other
approach to this same problem.

01:32:25.270 --> 01:32:27.720
It turns out this is what
are known as nested queries.

01:32:27.720 --> 01:32:30.012
You keep using the
parenthesization, much like in math,

01:32:30.012 --> 01:32:32.490
where you can nest arithmetic
expressions in parentheses.

01:32:32.490 --> 01:32:33.900
You can do the same in SQL.

01:32:33.900 --> 01:32:37.590
But you can also join information
in slightly different ways as well.

01:32:37.590 --> 01:32:40.480
I can actually do something like this.

01:32:40.480 --> 01:32:47.670
Let me go ahead and SELECT
Title FROM the following tables,

01:32:47.670 --> 01:32:56.980
people JOIN stars ON
people.id = stars.person_id--

01:32:56.980 --> 01:32:59.230
and we'll walk through
this in just a second--

01:32:59.230 --> 01:33:03.210
JOIN-- sorry--

01:33:03.210 --> 01:33:14.640
JOIN shows ON stars.show_id = shows.id
where name equals "Ellen DeGeneres."

01:33:14.640 --> 01:33:18.780
This, while more of a mouthful, is
equivalent to what I've just done.

01:33:18.780 --> 01:33:21.450
Notice that I've select
the title of the show

01:33:21.450 --> 01:33:24.630
from the following, the result
of joining people on stars.

01:33:24.630 --> 01:33:27.330
How do you join the people's
table and the stars table?

01:33:27.330 --> 01:33:28.740
Well, people have IDs.

01:33:28.740 --> 01:33:31.465
Stars have person IDs,
according to the diagram.

01:33:31.465 --> 01:33:33.090
What else do you want to join together?

01:33:33.090 --> 01:33:35.310
Let's join all of that with shows.

01:33:35.310 --> 01:33:36.000
How?

01:33:36.000 --> 01:33:40.740
Well, let's go ahead and say
stars.show_id = shows.id.

01:33:40.740 --> 01:33:43.770
So it's a way of linking,
transitively, multiple tables together,

01:33:43.770 --> 01:33:47.910
and then filter all of that with
the name of Ellen DeGeneres.

01:33:47.910 --> 01:33:50.580
This is what we would
call an explicit join.

01:33:50.580 --> 01:33:54.210
And if I click Play there,
notice it's taking a moment.

01:33:54.210 --> 01:33:58.530
Took, in fact, 1,990 milliseconds,
almost two human seconds.

01:33:58.530 --> 01:34:01.800
So it got slower, but it
does give me another syntax

01:34:01.800 --> 01:34:02.910
via which to select data.

01:34:02.910 --> 01:34:07.380
And honestly, what's cool about this
is I can select other fields as well.

01:34:07.380 --> 01:34:12.240
They don't have to come from the
original query in my nested selects.

01:34:12.240 --> 01:34:15.060
But let me go ahead and do
this, just as a teaser, and then

01:34:15.060 --> 01:34:17.850
we'll consider some
problems before wrapping up.

01:34:17.850 --> 01:34:20.130
So it turns out 2 millisecond--

01:34:20.130 --> 01:34:25.590
2 seconds is kind of slow, and if I do
this again, it took 2,029 milliseconds.

01:34:25.590 --> 01:34:29.345
If I do it again, it
took 1,963 milliseconds,

01:34:29.345 --> 01:34:30.720
and I'm looking at the time here.

01:34:30.720 --> 01:34:31.530
That's slow, right?

01:34:31.530 --> 01:34:34.030
That means if you have a finite
number of servers in the world

01:34:34.030 --> 01:34:36.720
and each of them can only handle
some number of users at a time,

01:34:36.720 --> 01:34:41.700
it's kind of a waste of hardware and of
money, arguably, to spend more seconds

01:34:41.700 --> 01:34:46.050
or milliseconds servicing one query from
a given user on your app or your web

01:34:46.050 --> 01:34:47.100
application.

01:34:47.100 --> 01:34:49.950
But it turns out we can do
something kind of smart here.

01:34:49.950 --> 01:34:54.340
Notice that in our database structure,
we've done a few things here.

01:34:54.340 --> 01:34:57.570
Let me open up, for instance, people.

01:34:57.570 --> 01:35:02.990
Rather, let me go over to people,
and you'll see a few keywords

01:35:02.990 --> 01:35:04.460
that I ignored earlier.

01:35:04.460 --> 01:35:07.640
When you're defining a table, you
can specify that one of your columns

01:35:07.640 --> 01:35:09.253
is what's called a primary key.

01:35:09.253 --> 01:35:12.420
That is the column by which you're going
to uniquely identify all your data,

01:35:12.420 --> 01:35:15.620
so it's like those numeric IDs
we've given every person and show.

01:35:15.620 --> 01:35:18.020
A foreign key is what
we call the same number

01:35:18.020 --> 01:35:19.590
when it appears in some other table.

01:35:19.590 --> 01:35:24.440
So when we saw a person_id or show_id,
those are what are called foreign keys,

01:35:24.440 --> 01:35:27.140
because it's the same
numbers, but in another table

01:35:27.140 --> 01:35:29.780
that we're using to join
things together, eventually.

01:35:29.780 --> 01:35:31.718
And you can also define
columns to be unique,

01:35:31.718 --> 01:35:33.260
or you can index them for efficiency.

01:35:33.260 --> 01:35:35.750
And that's, in fact, where
we're going right now.

01:35:35.750 --> 01:35:38.690
If you look at the same
query, you'll notice that I'm

01:35:38.690 --> 01:35:40.640
searching on a bunch of columns.

01:35:40.640 --> 01:35:43.380
I'm clearly searching on the name field.

01:35:43.380 --> 01:35:48.050
I'm also searching on-- that
is, joining on-- show_id,

01:35:48.050 --> 01:35:52.740
and I'm searching on person_id, not
to mention the individual ID fields.

01:35:52.740 --> 01:35:55.040
But the cool thing about
a primary key, which

01:35:55.040 --> 01:35:59.270
we have in advance to find all of
our ID columns, all of the columns

01:35:59.270 --> 01:36:03.500
called ID, to be primary keys, you
get a feature for free with SQL.

01:36:03.500 --> 01:36:06.530
SQL builds up what's called
an index, a very fancy data

01:36:06.530 --> 01:36:09.950
structure, a tree-like structure,
that actually allows you to search

01:36:09.950 --> 01:36:12.240
for information quite efficiently.

01:36:12.240 --> 01:36:14.240
So when you define a
column to be a primary key,

01:36:14.240 --> 01:36:15.770
you get what's called an index.

01:36:15.770 --> 01:36:19.430
Specifically, if we go back a few
weeks, to tree, our discussion of trees,

01:36:19.430 --> 01:36:20.248
it looks like this.

01:36:20.248 --> 01:36:22.790
This is what's called the B-tree,
which is not a binary tree.

01:36:22.790 --> 01:36:25.460
It's a B-tree in the sense
that it's got a lot of nodes

01:36:25.460 --> 01:36:28.130
that might be one or two
or more children each,

01:36:28.130 --> 01:36:32.270
but it's very short with a lot
of nodes, very-- a lot of width.

01:36:32.270 --> 01:36:35.570
So that means when you store your
data in a tree-like structure,

01:36:35.570 --> 01:36:38.870
long story short, it just tends
to be very efficiently searchable.

01:36:38.870 --> 01:36:40.970
So when you define a
column to be a primary key,

01:36:40.970 --> 01:36:43.460
you get that speed for free.

01:36:43.460 --> 01:36:46.520
But we can also tell our
SQL database, you know what?

01:36:46.520 --> 01:36:51.330
I plan on searching on person
IDs and show IDs and names, also,

01:36:51.330 --> 01:36:53.990
so let's go ahead and
create myself some indexes.

01:36:53.990 --> 01:36:57.530
Let me go ahead and execute the
following queries just once.

01:36:57.530 --> 01:37:04.490
Let me create something called an index
called person_index on the stars table

01:37:04.490 --> 01:37:07.040
using its person_id column.

01:37:07.040 --> 01:37:08.810
So pretty quick syntax.

01:37:08.810 --> 01:37:12.230
This means create a B-tree,
create a fancy tree structure,

01:37:12.230 --> 01:37:16.730
called person index, on
the stars table by person--

01:37:16.730 --> 01:37:18.020
on the person_id column.

01:37:18.020 --> 01:37:21.810
That is, make a table like this in
memory to store all of the person IDs.

01:37:21.810 --> 01:37:22.310
Why?

01:37:22.310 --> 01:37:26.450
So I can find them faster than
linear search in the column itself.

01:37:26.450 --> 01:37:27.955
Let me go ahead and execute this.

01:37:27.955 --> 01:37:29.330
And you'll see it takes a moment.

01:37:29.330 --> 01:37:29.930
It's thinking.

01:37:29.930 --> 01:37:31.510
It took me 1.6 seconds.

01:37:31.510 --> 01:37:33.772
So a little slow, but I
only have to do this once.

01:37:33.772 --> 01:37:34.730
Let me create two more.

01:37:34.730 --> 01:37:39.860
CREATE INDEX called show_index
ON stars ON the show_id column.

01:37:39.860 --> 01:37:42.080
So almost the same, but
this just means give me

01:37:42.080 --> 01:37:44.780
a tree that looks like
this in memory, so

01:37:44.780 --> 01:37:48.360
that when I query for
data like Ellen DeGeneres,

01:37:48.360 --> 01:37:52.640
it searches a tree-like structure
instead of a list-like structure

01:37:52.640 --> 01:37:53.960
in a column alone.

01:37:53.960 --> 01:37:54.890
Let me execute that.

01:37:54.890 --> 01:37:58.440
And I screwed up ON stars show_ID.

01:37:58.440 --> 01:37:59.100
There we go.

01:37:59.100 --> 01:38:00.210
That took 1 second.

01:38:00.210 --> 01:38:07.320
And lastly, let's create an index called
name index on people on the name column

01:38:07.320 --> 01:38:10.620
so that I can search for people
by name more efficiently as well.

01:38:10.620 --> 01:38:13.920
Otherwise, without these
indexes on a column like name,

01:38:13.920 --> 01:38:17.310
it is going to check every
damn cell in the column looking

01:38:17.310 --> 01:38:21.180
for Ellen DeGeneres, Ellen DeGeneres,
Ellen DeGeneres, using big O of N,

01:38:21.180 --> 01:38:24.070
running time, or linear search.

01:38:24.070 --> 01:38:26.520
So now, let me go back to my query here.

01:38:26.520 --> 01:38:28.440
I've not made any changes to the query.

01:38:28.440 --> 01:38:31.500
The last time I ran this,
it took almost 2 seconds.

01:38:31.500 --> 01:38:34.860
Now, after creating these indexes
and telling my SQL database

01:38:34.860 --> 01:38:37.470
I plan to search on those
columns, watch this,

01:38:37.470 --> 01:38:40.110
and watch the blue highlighted number.

01:38:40.110 --> 01:38:41.520
8 milliseconds.

01:38:41.520 --> 01:38:42.190
8 millisecond.

01:38:42.190 --> 01:38:43.110
7 milliseconds.

01:38:43.110 --> 01:38:44.490
7, 7, 7.

01:38:44.490 --> 01:38:46.245
Looks like 4 milliseconds that time.

01:38:46.245 --> 01:38:49.120
So this is what the Googles of the
world, the Facebooks of the world,

01:38:49.120 --> 01:38:51.180
the Microsofts, who
have very large data,

01:38:51.180 --> 01:38:53.310
they not only store
their data in databases

01:38:53.310 --> 01:38:57.150
like we are here, but they also
index their tables intelligently,

01:38:57.150 --> 01:39:01.200
drawing in ideas from weeks ago, so that
the database, for them, and for free,

01:39:01.200 --> 01:39:04.110
and sort of magically, creates
these kinds of structures in memory,

01:39:04.110 --> 01:39:07.380
but does it so that you can
search and insert and update

01:39:07.380 --> 01:39:10.620
your data all the more efficiently.

01:39:10.620 --> 01:39:13.560
So there's got to be some
problems or some prices paid.

01:39:13.560 --> 01:39:15.810
So let's consider just
a couple of those.

01:39:15.810 --> 01:39:19.637
So what problems actually arise
when using a SQL database like this?

01:39:19.637 --> 01:39:21.720
Well, the first is what's
called a race condition.

01:39:21.720 --> 01:39:24.780
And perhaps the best way to
explain this is by way of a--

01:39:24.780 --> 01:39:27.780
sort of a story that I was told
when I took a class like this years

01:39:27.780 --> 01:39:31.690
ago in a class called
CS161, Operating Systems.

01:39:31.690 --> 01:39:33.540
So contrived scenario,
but consider this.

01:39:33.540 --> 01:39:35.832
You and your roommates have
a fridge in your dorm room,

01:39:35.832 --> 01:39:37.890
and you and your roommate
really like milk.

01:39:37.890 --> 01:39:41.910
And one of you comes home one day,
opens the fridge, and sees, oh, dammit,

01:39:41.910 --> 01:39:42.770
we're out of milk.

01:39:42.770 --> 01:39:45.270
And so you close the fridge and
you walk into Harvard Square

01:39:45.270 --> 01:39:47.400
and you head to CVS or
some other such store.

01:39:47.400 --> 01:39:50.670
Meanwhile, your other roommate
comes home, opens the fridge,

01:39:50.670 --> 01:39:53.280
really wants some milk, but,
oh, darn it, we're out of milk.

01:39:53.280 --> 01:39:56.425
So they close the fridge,
walk outside, and head

01:39:56.425 --> 01:39:59.550
to some other store that sells milk,
and for reasons that we're contriving,

01:39:59.550 --> 01:40:01.500
don't bump into you at CVS.

01:40:01.500 --> 01:40:04.728
A few minutes later, you both get home,
of course, having bought some milk.

01:40:04.728 --> 01:40:06.520
But of course, milk
doesn't last that long,

01:40:06.520 --> 01:40:08.460
so now you've got twice as much
milk, and surely some of it's

01:40:08.460 --> 01:40:09.127
going to go bad.

01:40:09.127 --> 01:40:11.640
Like horrible, horrible problem.

01:40:11.640 --> 01:40:12.690
Why, right?

01:40:12.690 --> 01:40:16.230
It's contrived, but this-- fond
memories of when it was taught to me.

01:40:16.230 --> 01:40:19.320
So what is the fundamental problem
here, if you put your finger on it?

01:40:19.320 --> 01:40:22.380
Both you and your roommate
were allowed to inspect

01:40:22.380 --> 01:40:26.400
the state of that refrigerator
without knowing that the other was

01:40:26.400 --> 01:40:27.720
about to do the same.

01:40:27.720 --> 01:40:31.405
Better would have been for one of you to
leave a note on the fridge saying, gone

01:40:31.405 --> 01:40:33.780
for milk, so that the other
person doesn't do it as well.

01:40:33.780 --> 01:40:37.230
Or more stringently, just
lock the refrigerator

01:40:37.230 --> 01:40:41.430
so that no one else besides you
can see inside of it until you have

01:40:41.430 --> 01:40:44.010
updated the state of that refrigerator.

01:40:44.010 --> 01:40:46.320
So this comes into play
all too often these days

01:40:46.320 --> 01:40:48.130
when dealing with
really large data sets.

01:40:48.130 --> 01:40:50.880
The Twitters of the world, the
Instagrams of the world, all of you

01:40:50.880 --> 01:40:53.130
are probably in the habit
of liking or uploading posts

01:40:53.130 --> 01:40:54.640
on those sites or some other.

01:40:54.640 --> 01:40:58.120
But if a lot of other people are,
too, especially when things go viral,

01:40:58.120 --> 01:41:01.080
there might be code like
this underneath the hood that

01:41:01.080 --> 01:41:02.400
essentially does the following.

01:41:02.400 --> 01:41:05.700
Here's three lines of representative
Python code that use SQL.

01:41:05.700 --> 01:41:08.850
The first line here selects the
number of likes from a table

01:41:08.850 --> 01:41:11.880
called posts where the ID
of the post is something,

01:41:11.880 --> 01:41:14.370
a variable called ID, 1, 2, 3, 4, 5.

01:41:14.370 --> 01:41:17.670
That just tells me, what is the
total number of likes this post has?

01:41:17.670 --> 01:41:20.648
Then, I store that answer
in a variable using likes.

01:41:20.648 --> 01:41:22.440
And you'll see this
syntax in problem set 7

01:41:22.440 --> 01:41:25.560
as you begin to play with SQL
yourself in the CS50 library.

01:41:25.560 --> 01:41:29.130
Suppose that your third line of
code is to then update posts,

01:41:29.130 --> 01:41:32.670
setting likes equal to some
place holder, where that place

01:41:32.670 --> 01:41:34.630
holder is going to be likes plus 1.

01:41:34.630 --> 01:41:38.490
This is the same problem
as the milk example,

01:41:38.490 --> 01:41:41.430
because if you check
how many likes there

01:41:41.430 --> 01:41:43.490
are on this post, or your
roommate checks how--

01:41:43.490 --> 01:41:46.050
or if you check how much milk
is available left in the fridge

01:41:46.050 --> 01:41:49.440
and then go off and begin to
make a decision like buying milk,

01:41:49.440 --> 01:41:52.260
your roommate might, on
a server that's running

01:41:52.260 --> 01:41:55.410
the same code to handle thousands
and thousands of people,

01:41:55.410 --> 01:41:56.680
might ask that same question.

01:41:56.680 --> 01:41:58.472
How many likes does
this current post have?

01:41:58.472 --> 01:42:00.120
Suppose the post has a million likes.

01:42:00.120 --> 01:42:02.400
Both of you execute
that first line of code,

01:42:02.400 --> 01:42:04.020
see that, oh, I have a million likes.

01:42:04.020 --> 01:42:06.550
I've just clicked plus 1 on the site.

01:42:06.550 --> 01:42:10.830
And so you try changing the number of
likes with this second update query

01:42:10.830 --> 01:42:12.720
to 1 million plus 1.

01:42:12.720 --> 01:42:15.750
But if both of you have
been making a decision that

01:42:15.750 --> 01:42:18.660
gets interleaved with the other
person, as will absolutely

01:42:18.660 --> 01:42:20.910
happen with Instagram and
Twitter, who have thousands

01:42:20.910 --> 01:42:24.870
of servers all operating in parallel,
instead of changing the number of likes

01:42:24.870 --> 01:42:30.750
from 1 million to 1,000,002, you might
change it both of you to 1,000,001,

01:42:30.750 --> 01:42:33.870
thereby wasting one of the
counts, because it gets lost,

01:42:33.870 --> 01:42:36.810
because you inspected the state
of the fridge or the likes.

01:42:36.810 --> 01:42:37.983
You made a decision.

01:42:37.983 --> 01:42:39.900
And then you updated it
based on that decision

01:42:39.900 --> 01:42:42.813
without realizing that
life changed in between.

01:42:42.813 --> 01:42:44.730
Your roommate got home
and checked the fridge,

01:42:44.730 --> 01:42:47.190
too, or someone else clicked the upvote.

01:42:47.190 --> 01:42:49.890
So this is a bad situation,
but there are solutions in SQL

01:42:49.890 --> 01:42:54.270
that we won't look at this week,
but what are known as transactions.

01:42:54.270 --> 01:42:57.960
That is a solvable problem, and more on
that in a higher level database class.

01:42:57.960 --> 01:43:00.690
And one final example of
this, because this one you

01:43:00.690 --> 01:43:04.380
can avoid throughout the coming
weeks in CS50 and beyond.

01:43:04.380 --> 01:43:08.520
A SQL injection attack is
when you write bad code that

01:43:08.520 --> 01:43:12.150
somehow allows some bad actor on
the internet or an application

01:43:12.150 --> 01:43:16.530
to trick your code into running
SQL code that you did not intend.

01:43:16.530 --> 01:43:19.680
For instance, this is how Yale
students log on in New Haven.

01:43:19.680 --> 01:43:22.470
This, of course, is how Harvard
students log on here in Cambridge.

01:43:22.470 --> 01:43:24.887
You're asked for a username
and password, or a Harvard key

01:43:24.887 --> 01:43:25.680
and a password.

01:43:25.680 --> 01:43:28.290
Suppose, though, you as
a user don't cooperate,

01:43:28.290 --> 01:43:30.840
and instead of typing in your
email address and password,

01:43:30.840 --> 01:43:34.080
suppose you type something like
this, like your email address,

01:43:34.080 --> 01:43:38.520
Malan@Harvard.edu, then maybe a
single quote, and then a dash dash.

01:43:38.520 --> 01:43:40.950
Well, it turns out in
SQL, dash dash is how

01:43:40.950 --> 01:43:46.320
you start a comment, similar to hash
in Python or in slash slash in C.

01:43:46.320 --> 01:43:50.180
But suppose that Harvard had
implemented its website such

01:43:50.180 --> 01:43:52.680
that there's a query like this
going on underneath the hood,

01:43:52.680 --> 01:43:56.370
if the Harvard key is implemented in
Python, that says this, SELECT * FROM

01:43:56.370 --> 01:44:00.630
users WHERE username = placeholder
AND password = placeholder,

01:44:00.630 --> 01:44:02.430
passing in username and password.

01:44:02.430 --> 01:44:03.550
That is good.

01:44:03.550 --> 01:44:07.050
That is correct code, because
those place holders are important.

01:44:07.050 --> 01:44:10.890
What the CS50 library does, and other
libraries in the world like it, is

01:44:10.890 --> 01:44:12.810
it escapes user input.

01:44:12.810 --> 01:44:17.480
If a user tries to trick your code into
inputting comments or single quotes,

01:44:17.480 --> 01:44:21.990
db execute and other libraries
will sanitize the user's input

01:44:21.990 --> 01:44:24.570
and prevent the
following from happening.

01:44:24.570 --> 01:44:27.630
If, however, you do
this-- not good, but bad.

01:44:27.630 --> 01:44:32.490
Suppose that you really practice what
we preach last week using f strings,

01:44:32.490 --> 01:44:34.920
this is now bad in the context of SQL.

01:44:34.920 --> 01:44:37.830
If you create a format string
like this with a little f

01:44:37.830 --> 01:44:41.250
that is literally SELECT *
from users WHERE username =,

01:44:41.250 --> 01:44:44.310
and then use the fancy Python
notation for curly braces,

01:44:44.310 --> 01:44:48.660
and password = curly braces, this
is a correct SQL construction

01:44:48.660 --> 01:44:53.280
so long as the human cooperates and puts
their username here and password here.

01:44:53.280 --> 01:44:55.680
But what if the user
is malicious, like me,

01:44:55.680 --> 01:45:01.110
and actually includes a single quote
in their input, and then dash dash?

01:45:01.110 --> 01:45:04.650
So Malan@Harvard.edu
quote dash dash would

01:45:04.650 --> 01:45:08.520
have the effect of injecting
that into the user's input.

01:45:08.520 --> 01:45:11.040
But dash dash I claim
as a comment, so it's

01:45:11.040 --> 01:45:15.270
a way of tricking the computer into
executing not SELECT * WHERE user--

01:45:15.270 --> 01:45:19.350
SELECT * FROM users where username
= this AND password = that,

01:45:19.350 --> 01:45:22.230
it has the effect of just
saying SELECT * FROM users where

01:45:22.230 --> 01:45:25.290
username equals Malan@Harvard.edu.

01:45:25.290 --> 01:45:27.720
And if-- and you'll
see this in p set 7--

01:45:27.720 --> 01:45:31.020
db.execute returns one or
more rows from the table--

01:45:31.020 --> 01:45:33.750
that's how you can actually
select data using Python code--

01:45:33.750 --> 01:45:35.670
and the number of rows
equals 1, because it's

01:45:35.670 --> 01:45:39.210
selected Malan@Harvard.edu
you could log in

01:45:39.210 --> 01:45:45.660
as me because you very simply tricked
the computer into executing code

01:45:45.660 --> 01:45:47.400
that you injected.

01:45:47.400 --> 01:45:50.900
And the biggest takeaway here is
this is 100% solvable problem.

01:45:50.900 --> 01:45:53.670
You just need to know
that it is solved and not

01:45:53.670 --> 01:45:56.760
do code like this using f
strings in Python, but instead

01:45:56.760 --> 01:46:02.370
to use library code like db.execute,
or after this course, anything like it.

01:46:02.370 --> 01:46:05.290
With that said, we'll end with a joke.

01:46:17.475 --> 01:46:18.130
OK.

01:46:18.130 --> 01:46:19.300
That's it for CS50.

01:46:19.300 --> 01:46:21.480
We'll see you next time.