WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000

00:00:00.000 --> 00:00:02.988
[MUSIC PLAYING]

00:01:12.582 --> 00:01:13.540
DAVID MALAN: All right.

00:01:13.540 --> 00:01:17.150
This is CS50 and this
is already week seven.

00:01:17.150 --> 00:01:20.240
And this is the week where
we'll continue where we left off

00:01:20.240 --> 00:01:22.430
with Python, introducing
you to a bit more syntax

00:01:22.430 --> 00:01:25.580
and capabilities of the language so
you can solve interesting problems.

00:01:25.580 --> 00:01:27.580
But a lot of those problems
increasingly are now

00:01:27.580 --> 00:01:29.180
going to involve data in some form.

00:01:29.180 --> 00:01:31.250
After all, if you think
of most any website

00:01:31.250 --> 00:01:36.320
or mobile app or process nowadays
that involves solving problems,

00:01:36.320 --> 00:01:39.530
it almost always involves some amount
of data and often data at scale.

00:01:39.530 --> 00:01:40.680
Lots and lots of data.

00:01:40.680 --> 00:01:42.430
And so what we're going
to see first today

00:01:42.430 --> 00:01:45.710
is that, yes, you can use Python to
solve all the problems past that we've

00:01:45.710 --> 00:01:48.327
seen and also some data specific
ones, but sometimes it's

00:01:48.327 --> 00:01:49.410
just going to be annoying.

00:01:49.410 --> 00:01:50.420
It's going to be a little painful.

00:01:50.420 --> 00:01:54.060
It's just going to be more work than you
might like to just get to some answer.

00:01:54.060 --> 00:01:56.870
And so today we'll too
introduce you to a new language

00:01:56.870 --> 00:01:59.240
called SQL, Structured Query Language.

00:01:59.240 --> 00:02:02.780
And this is a language that rest
assured is actually much smaller,

00:02:02.780 --> 00:02:05.240
relatively speaking, than C and Python.

00:02:05.240 --> 00:02:07.860
It sort of does less, but
it doesn't really well.

00:02:07.860 --> 00:02:11.240
And it's a language for querying
databases, storing data in it,

00:02:11.240 --> 00:02:13.950
updating it, inserting it,
deleting it, and so much more.

00:02:13.950 --> 00:02:16.160
And it's the kind of
technology that's used nowadays

00:02:16.160 --> 00:02:19.700
in, indeed, web apps and mobile
apps, data science, analytics,

00:02:19.700 --> 00:02:20.940
and so much more.

00:02:20.940 --> 00:02:23.600
It's really good at storing
lots and lots of data.

00:02:23.600 --> 00:02:25.430
Now, this is yet another language.

00:02:25.430 --> 00:02:27.800
And believe it or not, next
week we'll introduce you

00:02:27.800 --> 00:02:31.880
to three more languages, HTML and
CSS, which are not technically

00:02:31.880 --> 00:02:32.900
programming languages.

00:02:32.900 --> 00:02:35.630
They're all about aesthetics
and markup of information.

00:02:35.630 --> 00:02:38.660
But also JavaScript, which is,
in fact, a programming language.

00:02:38.660 --> 00:02:41.630
But the goals here in
CS50 really are going

00:02:41.630 --> 00:02:44.000
to be to empower you to
program more generally.

00:02:44.000 --> 00:02:47.000
And indeed, when you're out there in
the real world some years from now,

00:02:47.000 --> 00:02:50.330
invariably there's going to be some
new other popular language out there.

00:02:50.330 --> 00:02:53.880
And hopefully in this week and next
week and beyond, among the goals

00:02:53.880 --> 00:02:56.630
is not just to teach you these
languages specifically but again,

00:02:56.630 --> 00:02:59.330
how to teach yourself the future
languages that we've not even

00:02:59.330 --> 00:03:00.480
heard about just yet.

00:03:00.480 --> 00:03:04.220
So with that said, let's
begin with a survey of sorts.

00:03:04.220 --> 00:03:10.190
If you go to this URL on your
phone or laptop, cs50.ly/favorites,

00:03:10.190 --> 00:03:14.390
a very simple Google Form awaits you
that's just going to ask you a couple

00:03:14.390 --> 00:03:16.530
of multiple choice questions.

00:03:16.530 --> 00:03:24.800
So go to cs50.ly/favorites, and that
should lead you to a Google Form that

00:03:24.800 --> 00:03:27.590
looks a little something
like this asking you first,

00:03:27.590 --> 00:03:31.700
as of now in week seven, what is your
favorite language among those options

00:03:31.700 --> 00:03:32.250
here.

00:03:32.250 --> 00:03:34.220
And then further down,
one more question.

00:03:34.220 --> 00:03:39.170
If you think back on problem sets
0 through 6, what was, if any,

00:03:39.170 --> 00:03:42.080
your favorite problem set problem?

00:03:42.080 --> 00:03:45.110
Be it in Scratch or C or Python.

00:03:45.110 --> 00:03:48.650
So answer those two questions.

00:03:48.650 --> 00:03:52.820
And in a moment, I'll flip over to
my screen here where you'll see,

00:03:52.820 --> 00:03:56.570
and anyone who's used Google Forms
knows, the spreadsheet that's

00:03:56.570 --> 00:03:58.130
collecting now this data.

00:03:58.130 --> 00:04:01.060
Microsoft Office 365 can do the
same if you use one of those forms.

00:04:01.060 --> 00:04:02.810
And what you see here
now is a spreadsheet

00:04:02.810 --> 00:04:06.770
in Google Sheets enumerating
all of the audience's questions.

00:04:06.770 --> 00:04:09.980
Language is in column B,
problem is in column C,

00:04:09.980 --> 00:04:13.190
and each row represents one
student who has responded.

00:04:13.190 --> 00:04:17.360
A few of you were super eager for
class today at 8:33 AM Eastern time.

00:04:17.360 --> 00:04:19.040
10:32, 11:10.

00:04:19.040 --> 00:04:21.779
OK, so now we're getting into
the actual class time here.

00:04:21.779 --> 00:04:24.830
And if I scroll down, we'll
probably see a few dozen,

00:04:24.830 --> 00:04:26.900
a couple of hundred answers by now.

00:04:26.900 --> 00:04:29.225
And yeah, so we're getting
a whole lot of answers here.

00:04:29.225 --> 00:04:31.100
And I'm seeing some
patterns emerge, but it's

00:04:31.100 --> 00:04:34.190
not necessarily obvious to the
human eyes what those patterns are.

00:04:34.190 --> 00:04:36.320
Now of course, you can
use Google Spreadsheets.

00:04:36.320 --> 00:04:40.143
You can highlight the data and you
can create charts magically out of it.

00:04:40.143 --> 00:04:42.560
But you can only do what Google
lets you do with the data.

00:04:42.560 --> 00:04:45.350
And same thing for Microsoft
Excel or Apple Numbers.

00:04:45.350 --> 00:04:48.230
But wouldn't it be nice to just be
able to manipulate the raw data,

00:04:48.230 --> 00:04:51.740
relatively simple though it is, to
just answer questions about the data?

00:04:51.740 --> 00:04:54.260
Maybe long term create your
own charts, customize it

00:04:54.260 --> 00:04:56.240
just the way you want
rather than beholden

00:04:56.240 --> 00:04:59.250
to software that's off
the shelf like this.

00:04:59.250 --> 00:05:02.130
Well, how could we go about doing this?

00:05:02.130 --> 00:05:05.330
Well, let me propose that
we treat this data set now

00:05:05.330 --> 00:05:08.790
as what we're going to call
for now a flat file database.

00:05:08.790 --> 00:05:10.850
We'll see today that
there's fancier databases,

00:05:10.850 --> 00:05:15.200
but the simplest database in the
world is really just like a .csv file.

00:05:15.200 --> 00:05:18.500
And we saw that a couple of weeks
ago in C. We wrote a bit of C code

00:05:18.500 --> 00:05:23.660
that used fprintf to write data to a
file using commas as the separator.

00:05:23.660 --> 00:05:25.910
We didn't really do much
more with CSVs at the time

00:05:25.910 --> 00:05:28.610
though, because it's really
annoying, painful, time consuming,

00:05:28.610 --> 00:05:32.840
not fun to use C for something like that
because of malloc and memory and all

00:05:32.840 --> 00:05:33.413
that stuff.

00:05:33.413 --> 00:05:35.330
But with Python, it's
going to be much easier.

00:05:35.330 --> 00:05:39.530
And so any time you have access to some
data set where you can just download it

00:05:39.530 --> 00:05:42.140
to your own Mac or PC or
your cloud environment,

00:05:42.140 --> 00:05:46.460
it's sort of a candidate for now writing
code to do something with the data.

00:05:46.460 --> 00:05:48.120
Maybe analyze it right away.

00:05:48.120 --> 00:05:50.150
If it's been human imported
manually, maybe you

00:05:50.150 --> 00:05:52.550
have to clean it up by doing
a lot of find and replace

00:05:52.550 --> 00:05:54.852
but not with your keyboard
but rather with code.

00:05:54.852 --> 00:05:56.310
And so let me go ahead and do this.

00:05:56.310 --> 00:05:59.780
Let me go back to my
Google Sheet here that has

00:05:59.780 --> 00:06:02.010
all of the data that's come in now.

00:06:02.010 --> 00:06:08.430
And let me go ahead and download
this via the File menu here.

00:06:08.430 --> 00:06:09.590
And let's see.

00:06:09.590 --> 00:06:10.310
Download.

00:06:10.310 --> 00:06:12.170
And you can see a
whole bunch of options.

00:06:12.170 --> 00:06:14.210
Most formats might be
familiar, but today we'll

00:06:14.210 --> 00:06:17.602
focus just on this one, Comma
Separated Values or CSV.

00:06:17.602 --> 00:06:20.810
That's going to go ahead and download
it on my Mac here into my own Downloads

00:06:20.810 --> 00:06:21.140
folder.

00:06:21.140 --> 00:06:22.890
And now I'm going to
go ahead and do this.

00:06:22.890 --> 00:06:25.140
Let me go ahead and pull up
VS Code in the cloud here.

00:06:25.140 --> 00:06:28.140
And if you've never done this before,
there's a couple of ways to do it.

00:06:28.140 --> 00:06:31.320
But the simplest way to upload a
file to your code space, so to speak,

00:06:31.320 --> 00:06:32.990
is just a sort of drag and drop.

00:06:32.990 --> 00:06:36.020
That's going to magically
upload it to the server there.

00:06:36.020 --> 00:06:39.688
And we'll see that, one, it has a very
long file name, which I'm actually

00:06:39.688 --> 00:06:41.480
going to clean this up
because that's going

00:06:41.480 --> 00:06:43.880
to be very tedious to type in my code.

00:06:43.880 --> 00:06:46.927
So I could either right
click, of course, up here,

00:06:46.927 --> 00:06:48.510
but I'm going to use my Linux command.

00:06:48.510 --> 00:06:52.550
So let's move this file called CSV
50 2022 something or other and let's

00:06:52.550 --> 00:06:54.740
just name it more simply favorites.csv.

00:06:54.740 --> 00:06:58.370
So all lowercase, no
spaces, sort of good basics.

00:06:58.370 --> 00:07:02.960
And let me go ahead now and open up
this file with code favorites.csv.

00:07:02.960 --> 00:07:06.500
I'll close my File Explorer and
we'll see exactly the same data

00:07:06.500 --> 00:07:09.890
as before but not quite as pretty
as Google Sheets makes it be.

00:07:09.890 --> 00:07:13.640
Rather we see here that I still have
three columns, timestamp, language,

00:07:13.640 --> 00:07:17.013
problem, and then all of the values
down below, including the timestamps

00:07:17.013 --> 00:07:18.180
and the answers they're for.

00:07:18.180 --> 00:07:20.360
But it doesn't have proper columns.

00:07:20.360 --> 00:07:22.250
It just has commas separating them.

00:07:22.250 --> 00:07:24.860
Now, we could very
easily write Python code

00:07:24.860 --> 00:07:28.820
just like we wrote C code to
manipulate files like this

00:07:28.820 --> 00:07:30.140
either to write or read.

00:07:30.140 --> 00:07:32.450
But instead let's do
something that's a little more

00:07:32.450 --> 00:07:34.640
pleasant, which is indeed
in the form of Python.

00:07:34.640 --> 00:07:38.300
So Python actually comes
with native support for CSVs.

00:07:38.300 --> 00:07:41.200
It has indeed a package
called CSV that just

00:07:41.200 --> 00:07:43.700
lets you read and write and do
a whole bunch of useful stuff

00:07:43.700 --> 00:07:45.007
when it comes to CSV files.

00:07:45.007 --> 00:07:47.090
So let's go ahead and do
something with this file.

00:07:47.090 --> 00:07:49.250
Let me go back here to VS Code.

00:07:49.250 --> 00:07:51.860
I'm going to close
favorites.csv for now.

00:07:51.860 --> 00:07:55.430
But just remember in your mind that
timestamp was the first column,

00:07:55.430 --> 00:07:58.640
language was the second column,
and problem was the third.

00:07:58.640 --> 00:08:01.790
And notice because we're using commas,
they don't again line up perfectly,

00:08:01.790 --> 00:08:02.832
but that's not a problem.

00:08:02.832 --> 00:08:05.270
There are two commas in
every line presumably.

00:08:05.270 --> 00:08:09.892
And I'm going to go ahead and now create
a file called how about favorites.py so

00:08:09.892 --> 00:08:12.350
that I can start writing some
code to manipulate this data.

00:08:12.350 --> 00:08:13.600
And let's do something simple.

00:08:13.600 --> 00:08:16.910
Let's just write a simple program
in Python that opens this file,

00:08:16.910 --> 00:08:19.997
reads it, and print something
out just as a safety check

00:08:19.997 --> 00:08:22.830
that I know what I'm doing, even
though it's not going to be useful.

00:08:22.830 --> 00:08:26.780
So in Python, if you want
CSV support, you import CSV.

00:08:26.780 --> 00:08:30.350
And that gives you access to all
the magical capabilities thereof.

00:08:30.350 --> 00:08:32.450
Let me now go ahead
and use this technique

00:08:32.450 --> 00:08:35.960
to open a file in Python,
which is similar in C.

00:08:35.960 --> 00:08:37.789
But with Python, we're going to do this.

00:08:37.789 --> 00:08:39.020
The keyword with.

00:08:39.020 --> 00:08:42.049
I'm going to open a file
called favorites.csv,

00:08:42.049 --> 00:08:43.880
which was the shorter name I gave it.

00:08:43.880 --> 00:08:46.280
This is optional, but
just for explicitness, I'm

00:08:46.280 --> 00:08:50.090
going to open it in read mode
explicitly, just like f open

00:08:50.090 --> 00:08:51.720
took a second argument as well.

00:08:51.720 --> 00:08:54.500
And I'm going to name this file
once open quite simply file,

00:08:54.500 --> 00:08:56.240
though I could call it anything I want.

00:08:56.240 --> 00:08:59.060
And now it's just an open file.

00:08:59.060 --> 00:09:02.870
So far as Python knows at this
moment, it's just text, or better yet,

00:09:02.870 --> 00:09:04.250
it's just zeros and ones.

00:09:04.250 --> 00:09:08.360
If you want this Python package
called CSV to actually do

00:09:08.360 --> 00:09:12.500
something useful with it, you have to
load this file now into the library.

00:09:12.500 --> 00:09:16.165
And the simplest way to do this is to
give myself a variable called reader

00:09:16.165 --> 00:09:17.540
because I want to read this file.

00:09:17.540 --> 00:09:19.430
Though this too I could
call anything else.

00:09:19.430 --> 00:09:22.610
I'm going to then set that equal to
the return value of a function called

00:09:22.610 --> 00:09:24.710
csv.reader.

00:09:24.710 --> 00:09:28.220
And I pass to that per the
documentation the open file.

00:09:28.220 --> 00:09:29.780
So step one, I open the file.

00:09:29.780 --> 00:09:32.030
And this just gives me
access to the bytes therein.

00:09:32.030 --> 00:09:36.530
Step two now with csv.reader
tells the Python package

00:09:36.530 --> 00:09:40.940
called CSV to do something useful
with it and start analyzing the commas

00:09:40.940 --> 00:09:43.770
and allow me to parse it further.

00:09:43.770 --> 00:09:45.180
So let's go ahead and do this.

00:09:45.180 --> 00:09:49.730
Let me go ahead now and within
this loop let's say this.

00:09:49.730 --> 00:09:52.340
Sorry, within this open
file, let's do this.

00:09:52.340 --> 00:09:58.040
For every row, if you will,
or line in the file, a.k.a.

00:09:58.040 --> 00:10:04.910
reader, let's go ahead and print
out just how about row bracket 1.

00:10:04.910 --> 00:10:06.198
Now what's going on here?

00:10:06.198 --> 00:10:08.990
Well, it turns out if you read the
documentation for the CSV reader

00:10:08.990 --> 00:10:13.610
function, what it hands you back is
essentially this special object, so

00:10:13.610 --> 00:10:16.850
to speak, that allows you to
treat it as though it's just

00:10:16.850 --> 00:10:20.660
a really big list of lines
from the file, a.k.a. reader.

00:10:20.660 --> 00:10:25.310
So by saying for row in reader,
this is a way more succinct way

00:10:25.310 --> 00:10:28.763
of saying give me the first
line in the file plus plus,

00:10:28.763 --> 00:10:30.680
give me the second line
in the file plus plus,

00:10:30.680 --> 00:10:33.597
and so forth that we would have done
what much more mechanically in C.

00:10:33.597 --> 00:10:37.350
This is just much more Pythonic
and English friendly, if you will.

00:10:37.350 --> 00:10:42.225
So in every iteration of this loop,
row is going to contain all of the data

00:10:42.225 --> 00:10:43.100
from the current row.

00:10:43.100 --> 00:10:45.590
But better yet, what
the reader function does

00:10:45.590 --> 00:10:50.240
for me is it hands me each row
not just as a big string or STR

00:10:50.240 --> 00:10:51.590
of text in Python.

00:10:51.590 --> 00:10:56.590
It gives me what apparently
based on the syntax on line six.

00:10:56.590 --> 00:10:57.460
Any instinct?

00:10:57.460 --> 00:10:58.030
Yeah.

00:10:58.030 --> 00:10:58.740
AUDIENCE: A list.

00:10:58.740 --> 00:11:00.698
DAVID MALAN: It's giving
me back indeed a list.

00:11:00.698 --> 00:11:02.880
And I presume the visual
clue for you was the fact

00:11:02.880 --> 00:11:04.463
that we're using square brackets here.

00:11:04.463 --> 00:11:07.780
And indeed, row bracket 1
is going to be not the first

00:11:07.780 --> 00:11:10.060
but the second element in that list.

00:11:10.060 --> 00:11:11.280
And so just take a guess.

00:11:11.280 --> 00:11:14.400
When I run this code in a moment,
what's going to get printed?

00:11:14.400 --> 00:11:16.950
The timestamp, the
language, or the problem?

00:11:20.892 --> 00:11:21.392
Yeah?

00:11:21.392 --> 00:11:22.350
AUDIENCE: The language.

00:11:22.350 --> 00:11:25.260
DAVID MALAN: The language because
it's the second column that is

00:11:25.260 --> 00:11:27.848
in the file delimited by those commas.

00:11:27.848 --> 00:11:29.140
So let me go ahead and do this.

00:11:29.140 --> 00:11:30.630
Let me clear my terminal down here.

00:11:30.630 --> 00:11:33.570
Let me run Python of
favorites.py and Enter.

00:11:33.570 --> 00:11:34.710
And there's everything.

00:11:34.710 --> 00:11:35.490
It was super fast.

00:11:35.490 --> 00:11:37.090
But there's a really long list here.

00:11:37.090 --> 00:11:41.130
And in fact, if I increase the size
of my terminal and start scrolling up,

00:11:41.130 --> 00:11:43.090
you'll just see all of the raw data.

00:11:43.090 --> 00:11:44.490
Now, this isn't that useful yet.

00:11:44.490 --> 00:11:46.032
I could have just glanced at the CSV.

00:11:46.032 --> 00:11:49.320
But clearly now I have the
ability to open the file,

00:11:49.320 --> 00:11:52.830
parse it, so to speak, that is break
it up into its constituent parts,

00:11:52.830 --> 00:11:56.380
and do something with
specific parts therein.

00:11:56.380 --> 00:11:56.880
All right.

00:11:56.880 --> 00:11:59.640
So if I want to do this a
little more pleasantly though,

00:11:59.640 --> 00:12:02.460
let me at least make this
semantically a little cleaner.

00:12:02.460 --> 00:12:05.650
And you know what, just for clarity,
let me just give myself a variable.

00:12:05.650 --> 00:12:09.000
It's not strictly necessary,
but I know that this is

00:12:09.000 --> 00:12:12.958
the favorite, for instance, language.

00:12:12.958 --> 00:12:14.250
So let's just call it favorite.

00:12:14.250 --> 00:12:16.410
Set it equal to row bracket 1.

00:12:16.410 --> 00:12:18.600
And now just to be more
explicit in my code,

00:12:18.600 --> 00:12:21.030
even though again, we don't
need the variable per se,

00:12:21.030 --> 00:12:22.840
this code is, of course,
going to do the same thing.

00:12:22.840 --> 00:12:25.132
It's just using an additional
variable called favorite.

00:12:25.132 --> 00:12:28.450
If I go down here, scroll
up, run the program again,

00:12:28.450 --> 00:12:30.280
I get back to the exact same data.

00:12:30.280 --> 00:12:32.280
But this is a stepping
stone to something

00:12:32.280 --> 00:12:34.800
that's even more powerful
about Python support

00:12:34.800 --> 00:12:38.610
for CSV files is that you don't
have to just treat the return

00:12:38.610 --> 00:12:43.500
value as a list with 0 and 1 and 2.

00:12:43.500 --> 00:12:48.540
So just thinking intuitively here,
why is this maybe not the best design

00:12:48.540 --> 00:12:51.960
to hand you, the programmer,
back the data in a list

00:12:51.960 --> 00:12:55.640
that's numerically indexed with 0, 1, 2?

00:12:55.640 --> 00:12:59.760
It clearly works, but critique this.

00:12:59.760 --> 00:13:00.890
What could go wrong?

00:13:00.890 --> 00:13:03.282
What's a little poorly designed?

00:13:03.282 --> 00:13:04.169
Yeah?

00:13:04.169 --> 00:13:06.586
AUDIENCE: You have to always
remember what the things are,

00:13:06.586 --> 00:13:10.945
what the order is [INAUDIBLE]

00:13:10.945 --> 00:13:11.820
DAVID MALAN: Exactly.

00:13:11.820 --> 00:13:14.880
So yeah, so it's up to you to repeat--

00:13:14.880 --> 00:13:18.360
it's up to you to remember what
column the data is actually in.

00:13:18.360 --> 00:13:22.020
And God forbid you're collaborating
with someone else on the spreadsheet.

00:13:22.020 --> 00:13:24.803
If you've used Google Spreadsheets,
you can move columns around

00:13:24.803 --> 00:13:27.220
maybe just because you want
to visually reorganize things.

00:13:27.220 --> 00:13:30.150
And if you do this and then someone
else downloads that same data,

00:13:30.150 --> 00:13:31.840
all of their code is going to break.

00:13:31.840 --> 00:13:33.360
So that's just really bad design.

00:13:33.360 --> 00:13:36.390
It's fragile just because you're
sort of on the honor system

00:13:36.390 --> 00:13:38.588
that one means the data that you want.

00:13:38.588 --> 00:13:41.130
So wouldn't it be nice if it
could be a little more explicit?

00:13:41.130 --> 00:13:47.130
Well, recall that the very first
line in this file is actually this.

00:13:47.130 --> 00:13:50.730
And I paused the output this time
so that we can see more optionally.

00:13:50.730 --> 00:13:51.960
I just reran favorites.py.

00:13:51.960 --> 00:13:55.050
And notice one of these
things is not like the other.

00:13:55.050 --> 00:13:59.070
Every output was either scratch or C
or Python except for this first one.

00:13:59.070 --> 00:14:00.855
Why am I seeing the word language here?

00:14:03.610 --> 00:14:05.860
Where did language come from?

00:14:05.860 --> 00:14:08.230
You didn't have the
ability to manually input.

00:14:08.230 --> 00:14:09.372
No.

00:14:09.372 --> 00:14:10.330
Where did it come from?

00:14:10.330 --> 00:14:11.030
Yeah.

00:14:11.030 --> 00:14:11.850
AUDIENCE: That would be the header.

00:14:11.850 --> 00:14:12.630
DAVID MALAN: Yeah, the header.

00:14:12.630 --> 00:14:15.300
The very first row in the
file, which by human convention

00:14:15.300 --> 00:14:18.210
generally just defines what the
columns represent so that there's

00:14:18.210 --> 00:14:20.470
some human useful information there.

00:14:20.470 --> 00:14:24.400
Now, that's not really intended to
be part of my output at the moment,

00:14:24.400 --> 00:14:25.710
so there is a way to skip this.

00:14:25.710 --> 00:14:29.380
If you want to skip the first row, you
can actually do something like this.

00:14:29.380 --> 00:14:32.560
You can say next row, and that
will just ignore that row.

00:14:32.560 --> 00:14:35.520
So then I'm starting really
with the every row thereafter.

00:14:35.520 --> 00:14:37.995
But there's a better way
to handle this than that.

00:14:37.995 --> 00:14:40.620
That will get rid of the row in
the output, but let me go ahead

00:14:40.620 --> 00:14:42.888
and use a different
feature of the CSV package

00:14:42.888 --> 00:14:45.430
that's just going to make this
a little cleaner all together.

00:14:45.430 --> 00:14:47.140
So let me clear my terminal window here.

00:14:47.140 --> 00:14:49.830
Let me undo this next
thing that I just added.

00:14:49.830 --> 00:14:54.690
And instead of using a reader, let me
go ahead and use a dictionary reader,

00:14:54.690 --> 00:15:00.720
abbreviated dict reader, that's going
to now return me the equivalent of all

00:15:00.720 --> 00:15:02.260
of the rows one at a time.

00:15:02.260 --> 00:15:04.720
So I can still call it
reader just as before.

00:15:04.720 --> 00:15:08.340
But as the name implies, what
this reader is going to return

00:15:08.340 --> 00:15:11.560
is not a list after list
after list but a dictionary,

00:15:11.560 --> 00:15:13.140
a dictionary, a dictionary.

00:15:13.140 --> 00:15:16.210
And remember, a dictionary is just
a collection of key value pairs.

00:15:16.210 --> 00:15:17.640
So what does that mean?

00:15:17.640 --> 00:15:18.390
What are the keys?

00:15:18.390 --> 00:15:19.480
What are the values?

00:15:19.480 --> 00:15:24.210
Well, now that I'm using a dictionary
reader, I can actually do this.

00:15:24.210 --> 00:15:27.270
Instead of on the honor system
remembering that I want column one,

00:15:27.270 --> 00:15:30.630
I can treat row now not as
a list but as a dictionary.

00:15:30.630 --> 00:15:34.410
And that means I can go in here
and say quote unquote "language."

00:15:34.410 --> 00:15:36.150
And we saw that back in week six.

00:15:36.150 --> 00:15:38.820
Python allows you to
index into dictionaries

00:15:38.820 --> 00:15:42.480
using square bracket notation in
strings or STRs on the inside,

00:15:42.480 --> 00:15:44.790
just like lists allow for numbers.

00:15:44.790 --> 00:15:48.180
But this now I think is going
to be a little more robust.

00:15:48.180 --> 00:15:52.830
If I run this again, Python of
favorites.py, of that worked out fine.

00:15:52.830 --> 00:15:56.580
And let me pause the output too
by using this program called more.

00:15:56.580 --> 00:15:58.600
Now I don't even see the header.

00:15:58.600 --> 00:16:03.660
So now whoever works with Python
wrote the code for this package

00:16:03.660 --> 00:16:06.600
to just analyze that first
line of code, use the header

00:16:06.600 --> 00:16:10.230
as you just called it as the
keys, and then every time you

00:16:10.230 --> 00:16:13.380
iterate through this loop, it updates
the values, the values, the values,

00:16:13.380 --> 00:16:15.810
but the keys stay the same.

00:16:15.810 --> 00:16:20.320
Any questions then on this technique?

00:16:20.320 --> 00:16:24.206
Suffice it to say this
would be painful in C. Yes?

00:16:24.206 --> 00:16:28.670
AUDIENCE: [INAUDIBLE]

00:16:33.395 --> 00:16:34.270
DAVID MALAN: Exactly.

00:16:34.270 --> 00:16:39.130
So the keys are always going to be
quote unquote "timestamp, language,

00:16:39.130 --> 00:16:40.130
and problem."

00:16:40.130 --> 00:16:43.330
But on each iteration of
this loop here, the row

00:16:43.330 --> 00:16:46.600
is going to contain a different row
of values, different row of values,

00:16:46.600 --> 00:16:47.600
different row of values.

00:16:47.600 --> 00:16:50.830
So you're going to get back one
dictionary for every student who

00:16:50.830 --> 00:16:54.140
submitted the Google Form, if you
will, while iterating through it there.

00:16:54.140 --> 00:16:54.640
All right.

00:16:54.640 --> 00:17:01.270
So once we have this ability here, why
don't we go ahead and transition to

00:17:01.270 --> 00:17:03.850
how about not just using
that dictionary reader, which

00:17:03.850 --> 00:17:05.890
it makes the code a little more robust.

00:17:05.890 --> 00:17:08.599
Because now if you move the
columns around, no big deal.

00:17:08.599 --> 00:17:11.170
It doesn't matter if the
numeric indices change.

00:17:11.170 --> 00:17:13.660
You can still use
those keywords instead.

00:17:13.660 --> 00:17:15.460
But let's actually analyze the data now.

00:17:15.460 --> 00:17:18.619
I'm just spitting it out, which is
not solving any problems for anyone.

00:17:18.619 --> 00:17:22.750
So let's go ahead and count the
popularity of Scratch, C, and Python

00:17:22.750 --> 00:17:25.790
and see what everyone's
been thinking here.

00:17:25.790 --> 00:17:26.290
All right.

00:17:26.290 --> 00:17:27.470
So how might I do this?

00:17:27.470 --> 00:17:29.890
Well, let me go ahead
and do this up here.

00:17:29.890 --> 00:17:33.880
Before I start iterating, let me give
myself, let's say, three variables.

00:17:33.880 --> 00:17:37.120
And to keep things simple, I'll
say one variable called Scratch.

00:17:37.120 --> 00:17:40.270
Set it equal to 0 for 0 students so far.

00:17:40.270 --> 00:17:44.200
C is going to equal 0 and
Python is going to equal 0.

00:17:44.200 --> 00:17:46.157
There's a slightly
prettier way of doing this,

00:17:46.157 --> 00:17:49.240
just because this is like three lines
of code to do something very simple.

00:17:49.240 --> 00:17:53.080
You could alternatively
in Python but not C

00:17:53.080 --> 00:17:58.720
do Scratch comma C comma
Python equals 0 comma 0.

00:17:58.720 --> 00:18:02.168
So kind of slightly more elegant
just to fit it all into one line.

00:18:02.168 --> 00:18:04.210
But now let's just do
something more interesting.

00:18:04.210 --> 00:18:06.160
On line seven, I'm still
going to figure out

00:18:06.160 --> 00:18:08.320
what the current favorite language is.

00:18:08.320 --> 00:18:11.740
And now I'm just going to
do some conditional checks.

00:18:11.740 --> 00:18:16.610
How about if that favorite equals
equals quote unquote "Scratch,"

00:18:16.610 --> 00:18:19.990
Let's go ahead and
increment Scratch by 1.

00:18:19.990 --> 00:18:23.560
We can't do plus plus in Python,
but we can do plus equals 1.

00:18:23.560 --> 00:18:30.700
How about elif if favorite equals
equals C, then let's do C plus equals 1.

00:18:30.700 --> 00:18:32.063
We could do else.

00:18:32.063 --> 00:18:33.730
This is actually a good design question.

00:18:33.730 --> 00:18:34.840
Should I do else?

00:18:34.840 --> 00:18:36.880
Should I do elif?

00:18:36.880 --> 00:18:39.250
Any instincts here?

00:18:39.250 --> 00:18:40.824
Yeah.

00:18:40.824 --> 00:18:43.259
AUDIENCE: [INAUDIBLE]

00:18:46.322 --> 00:18:48.030
DAVID MALAN: Yeah,
really good instincts.

00:18:48.030 --> 00:18:50.370
Just in case someone goes
and adds another language

00:18:50.370 --> 00:18:51.960
to the form next week
because we're obviously

00:18:51.960 --> 00:18:53.370
going to introduce
another language today,

00:18:53.370 --> 00:18:55.620
you don't want your code
to now artificially inflate

00:18:55.620 --> 00:18:58.380
the scores for Python
just because you're

00:18:58.380 --> 00:19:00.340
conflating multiple languages together.

00:19:00.340 --> 00:19:03.810
So the more defensive sort of better
way to write this code, I agree,

00:19:03.810 --> 00:19:07.680
would be elif favorite
equals equals Python.

00:19:07.680 --> 00:19:10.590
Then let's go ahead and
increment Python plus equals 1.

00:19:10.590 --> 00:19:12.840
And if there's a new
language next week, we're

00:19:12.840 --> 00:19:14.260
obviously going to have
to update the code,

00:19:14.260 --> 00:19:15.630
but at least we're not miscounting.

00:19:15.630 --> 00:19:17.130
We're just missing the new language.

00:19:17.130 --> 00:19:19.210
So I think that's slightly more robust.

00:19:19.210 --> 00:19:22.290
All right, now at the very bottom of
this program and outside of the loop

00:19:22.290 --> 00:19:25.680
when I'm all done counting,
let me go ahead and print out

00:19:25.680 --> 00:19:27.060
using some f strings.

00:19:27.060 --> 00:19:31.450
How about the total number of
people whose favorite is Scratch?

00:19:31.450 --> 00:19:34.890
So this is just week
six f string syntax.

00:19:34.890 --> 00:19:38.940
Let me go ahead and print out another
f string for C. And I'm, of course,

00:19:38.940 --> 00:19:41.880
putting the variables in
curly braces, all lowercase,

00:19:41.880 --> 00:19:45.060
but the English words I'm
doing capitalization for.

00:19:45.060 --> 00:19:49.470
Let's do a final one with f Python
colon and then in curly braces

00:19:49.470 --> 00:19:51.000
Python close quote.

00:19:51.000 --> 00:19:52.105
And I think I'm done.

00:19:52.105 --> 00:19:53.980
So let me just hide my
terminal for a second.

00:19:53.980 --> 00:19:55.290
Here's the total program.

00:19:55.290 --> 00:19:56.730
Same stuff as before.

00:19:56.730 --> 00:19:58.980
Open favorites.csv.

00:19:58.980 --> 00:20:02.880
Open it further with the dictionary
reader to do that processing for us.

00:20:02.880 --> 00:20:06.330
Initialize three variables to 0 just
so we have something to count with.

00:20:06.330 --> 00:20:09.300
And then iterate over
the file row by row.

00:20:09.300 --> 00:20:12.120
And this is just some sort of
week one style conditional logic,

00:20:12.120 --> 00:20:14.410
albeit in Python, counting things.

00:20:14.410 --> 00:20:14.910
All right.

00:20:14.910 --> 00:20:17.080
So how can we now execute this?

00:20:17.080 --> 00:20:18.450
Let me go back to my terminal.

00:20:18.450 --> 00:20:20.610
Python of favorites.py.

00:20:20.610 --> 00:20:22.050
And here we go.

00:20:22.050 --> 00:20:26.400
As of today, everyone who's reporting
in live via the Google Form,

00:20:26.400 --> 00:20:30.190
their favorite languages are--

00:20:30.190 --> 00:20:31.330
interesting.

00:20:31.330 --> 00:20:34.510
That's pretty interesting too after
just one week of Python no less.

00:20:34.510 --> 00:20:36.310
But Scratch is a
healthy contender there.

00:20:36.310 --> 00:20:38.210
A lot of C. So a pretty good mix here.

00:20:38.210 --> 00:20:43.127
So is this going to be the best way
to write this program long term?

00:20:43.127 --> 00:20:45.460
Well, as you noted, if there's
a new language next week,

00:20:45.460 --> 00:20:47.830
this week we're going to have
to constantly update this.

00:20:47.830 --> 00:20:51.610
And here's where you should let
your mind wander to the future.

00:20:51.610 --> 00:20:55.270
If we have a fourth language, fifth
language, sixth, seventh, eighth,

00:20:55.270 --> 00:20:59.410
which aspects here might kind
of have some code smell to it?

00:20:59.410 --> 00:21:04.310
This probably isn't the best
design to set us up for the future.

00:21:04.310 --> 00:21:07.480
What might be better than this?

00:21:07.480 --> 00:21:08.417
Yeah.

00:21:08.417 --> 00:21:09.880
AUDIENCE: We need to add
a language to line five.

00:21:09.880 --> 00:21:10.330
DAVID MALAN: Yeah.

00:21:10.330 --> 00:21:12.288
We have to keep adding
a language to line five.

00:21:12.288 --> 00:21:13.340
And OK, not a big deal.

00:21:13.340 --> 00:21:16.690
We could add SQL today and
maybe JavaScript next week.

00:21:16.690 --> 00:21:19.510
But any time a line of
code, a line of logic,

00:21:19.510 --> 00:21:21.450
it's just going to grow out of control.

00:21:21.450 --> 00:21:23.950
We've had this chat a couple
of times with different syntax.

00:21:23.950 --> 00:21:25.700
There's probably a better way than that.

00:21:25.700 --> 00:21:26.440
So let's do that.

00:21:26.440 --> 00:21:30.550
Instead of using these individual
variables, we could maybe use a list,

00:21:30.550 --> 00:21:35.560
but a list would be a little confusing
because what does bracket 0 mean?

00:21:35.560 --> 00:21:37.090
What is bracket 1, bracket 2?

00:21:37.090 --> 00:21:40.930
But a dictionary, recall, is this
Swiss army knife of data structures

00:21:40.930 --> 00:21:44.630
whereby you can associate anything
with anything else, keys with values.

00:21:44.630 --> 00:21:48.370
So I dare say a cleaner way to
solve this problem that sets us up

00:21:48.370 --> 00:21:52.570
for less work or confusion later would
be to create a new variable called

00:21:52.570 --> 00:21:55.000
counts, if that's what we're
doing, counting things up,

00:21:55.000 --> 00:21:58.340
and just set it equal
to an empty dictionary.

00:21:58.340 --> 00:22:01.675
And you can literally say dict with
the open parenthesis, close parenthesis

00:22:01.675 --> 00:22:05.020
and nothing or the more
Pythonic just use open

00:22:05.020 --> 00:22:07.000
and close curly braces
with nothing inside.

00:22:07.000 --> 00:22:11.320
That gives me an empty dictionary just
like square brackets gives me a list.

00:22:11.320 --> 00:22:13.990
Now, my logic down here
has to change a little bit.

00:22:13.990 --> 00:22:19.150
But what's nice is I don't need
one conditional for every language.

00:22:19.150 --> 00:22:22.240
Because again, if we have a fourth,
a fifth, a sixth, that chunk of code

00:22:22.240 --> 00:22:24.580
is also going to grow a
bit out of control too.

00:22:24.580 --> 00:22:26.570
So I can get rid of this here.

00:22:26.570 --> 00:22:29.180
And what I think I'm
going to do is say this.

00:22:29.180 --> 00:22:32.380
Whatever the current favorite is
from the current row in the file,

00:22:32.380 --> 00:22:38.420
why don't we go into our
counts variable at that key?

00:22:38.420 --> 00:22:40.303
And again, favorite is a variable.

00:22:40.303 --> 00:22:41.720
It's not quote unquote "favorite."

00:22:41.720 --> 00:22:44.560
It's going to be Scratch or C or Python.

00:22:44.560 --> 00:22:47.350
And then why don't we go ahead
and just increment whatever

00:22:47.350 --> 00:22:51.740
the value of that count is at that key?

00:22:51.740 --> 00:22:53.500
Now, this is technically buggy.

00:22:53.500 --> 00:22:57.140
We're really close, but there is a bug.

00:22:57.140 --> 00:23:00.350
Does anyone want to
conjecture what the bug is?

00:23:00.350 --> 00:23:00.890
Yeah.

00:23:00.890 --> 00:23:04.215
AUDIENCE: [INAUDIBLE]

00:23:08.490 --> 00:23:11.640
DAVID MALAN: A good question that
answers my question nonetheless.

00:23:11.640 --> 00:23:12.685
So no.

00:23:12.685 --> 00:23:14.310
The magic you describe will not happen.

00:23:14.310 --> 00:23:17.490
And to repeat the hypothesis,
will this automatically

00:23:17.490 --> 00:23:22.590
create a key for every
language that we try

00:23:22.590 --> 00:23:24.090
plugging into those square brackets?

00:23:24.090 --> 00:23:25.140
Short answer no.

00:23:25.140 --> 00:23:27.750
Odds are this is going to
create a key error, one of those

00:23:27.750 --> 00:23:29.640
traceback error messages
that you've probably

00:23:29.640 --> 00:23:31.890
seen by now either in
class or in problem sets

00:23:31.890 --> 00:23:37.410
whereby if Scratch hasn't appeared in
the dictionary before or C or Python,

00:23:37.410 --> 00:23:40.330
then the dictionary has no
clue what you're talking about.

00:23:40.330 --> 00:23:43.200
So I think we actually still need
some conditional logic but not

00:23:43.200 --> 00:23:45.900
that's going to grow longer
and longer with each language.

00:23:45.900 --> 00:23:47.860
What I think we probably
want to do is this.

00:23:47.860 --> 00:23:52.530
If the current favorite is
in the count dictionary,

00:23:52.530 --> 00:23:56.730
and this is the Pythonic way of just
saying is this key in this dictionary,

00:23:56.730 --> 00:24:02.160
then go ahead and safely do
counts favorite plus equals 1.

00:24:02.160 --> 00:24:05.250
Else, to your conjecture
now, else what I want to do.

00:24:05.250 --> 00:24:10.330
Counts favorites equals, yeah, 1.

00:24:10.330 --> 00:24:14.380
So initialize a brand new
key to a brand new value of 1

00:24:14.380 --> 00:24:16.630
because I'm obviously
just seeing this language.

00:24:16.630 --> 00:24:19.190
Otherwise increment again and again.

00:24:19.190 --> 00:24:22.390
And now down here I just need
to tweak my syntax a little bit.

00:24:22.390 --> 00:24:26.540
I don't need to print out all of
these things one at a time manually.

00:24:26.540 --> 00:24:30.470
I can actually get away I think with
another loop at the very bottom here.

00:24:30.470 --> 00:24:31.480
So how about I do this?

00:24:31.480 --> 00:24:34.790
For each favorite in
those counts, and this is,

00:24:34.790 --> 00:24:39.340
again, the Pythonic way to iterate
over all of the keys in a dictionary,

00:24:39.340 --> 00:24:45.220
go ahead and print out using an f
string whatever the current favorite is,

00:24:45.220 --> 00:24:52.180
Scratch or C or Python, and then a colon
and then figure out what its count is.

00:24:52.180 --> 00:24:54.610
And you can do that by going
into the counts dictionary,

00:24:54.610 --> 00:24:58.910
looking at the favorite
key, and get back its value.

00:24:58.910 --> 00:25:00.190
So I close my curly braces.

00:25:00.190 --> 00:25:01.060
I close my quotes.

00:25:01.060 --> 00:25:04.690
And even though this looks ugly at the
moment, now this is much more dynamic.

00:25:04.690 --> 00:25:07.690
Because if we go and add
SQL to the CSV file tomorrow

00:25:07.690 --> 00:25:10.720
or we add JavaScript next
week, this will just work.

00:25:10.720 --> 00:25:12.760
It will keep working now automatically.

00:25:12.760 --> 00:25:15.140
All I change is the Google
Form, not my actual code.

00:25:15.140 --> 00:25:15.640
All right.

00:25:15.640 --> 00:25:17.620
Let's try Python of favorites.py.

00:25:17.620 --> 00:25:19.150
Cross my fingers as always.

00:25:19.150 --> 00:25:23.020
And there now is the data as of now.

00:25:23.020 --> 00:25:25.270
Questions on this code here?

00:25:25.270 --> 00:25:26.195
Yeah.

00:25:26.195 --> 00:25:28.093
AUDIENCE: [INAUDIBLE]

00:25:28.093 --> 00:25:29.510
DAVID MALAN: Really good question.

00:25:29.510 --> 00:25:33.040
What if you wanted to print
it in a particular order?

00:25:33.040 --> 00:25:35.530
Well, I could give you
a couple of solutions.

00:25:35.530 --> 00:25:37.390
If you want to print it out in--

00:25:37.390 --> 00:25:40.850
it's already coincidentally
in alphabetical order.

00:25:40.850 --> 00:25:43.630
So you got that for free although
that's just by chance here.

00:25:43.630 --> 00:25:45.010
But there is a way to do this.

00:25:45.010 --> 00:25:47.320
And let me propose that we
go down here to my loop.

00:25:47.320 --> 00:25:50.320
And I explicitly use a function you
might not have seen in Python yet,

00:25:50.320 --> 00:25:52.240
but it's literally
called sorted, which is

00:25:52.240 --> 00:25:54.970
going to take either a list
or, in this case, a dictionary

00:25:54.970 --> 00:25:58.270
and by default sort it
by key alphabetically.

00:25:58.270 --> 00:26:00.907
Now, if my intuition
is correct, this is not

00:26:00.907 --> 00:26:03.490
going to change the output,
because it's already alphabetical.

00:26:03.490 --> 00:26:05.980
But if you read the documentation
for the sorted function,

00:26:05.980 --> 00:26:08.800
it takes multiple parameters
potentially, some of which

00:26:08.800 --> 00:26:10.000
are named parameters.

00:26:10.000 --> 00:26:11.870
And so you can actually do this.

00:26:11.870 --> 00:26:14.200
If you want to the counts
but you want to reverse

00:26:14.200 --> 00:26:18.018
the order for whatever reason here so
that it's reverse alphabetical order.

00:26:18.018 --> 00:26:19.810
Now let me go ahead
and rerun this and I'll

00:26:19.810 --> 00:26:21.470
keep the previous output on the screen.

00:26:21.470 --> 00:26:27.000
Enter and now it's backwards
alphabetically, if you will.

00:26:27.000 --> 00:26:29.130
Other questions on this here?

00:26:31.770 --> 00:26:33.830
No?

00:26:33.830 --> 00:26:39.930
How about then we transition
to changing sorting by value.

00:26:39.930 --> 00:26:42.770
And this is going to escalate
a little quickly briefly

00:26:42.770 --> 00:26:44.540
but then we'll tone it down again.

00:26:44.540 --> 00:26:47.600
Notice that right now this
is indeed sorting by key.

00:26:47.600 --> 00:26:49.400
What if, especially if
I have lots of data,

00:26:49.400 --> 00:26:52.670
it'd be nice to make a top 10 list
or, in this case, a top three list

00:26:52.670 --> 00:26:58.890
and actually see in order of the counts,
the values what these popular ones are.

00:26:58.890 --> 00:27:00.620
So it's not C, Python, Scratch.

00:27:00.620 --> 00:27:03.770
It should ideally be
Python, then C, then

00:27:03.770 --> 00:27:06.710
Scratch because of the values
and the magnitude thereof.

00:27:06.710 --> 00:27:08.010
So how can I do this?

00:27:08.010 --> 00:27:12.170
Well, it turns out there is another
key, another parameter that you

00:27:12.170 --> 00:27:16.550
can pass to the sorted
function that is typically

00:27:16.550 --> 00:27:17.925
implemented as a function itself.

00:27:17.925 --> 00:27:19.633
And so I'm going to
go ahead and do this.

00:27:19.633 --> 00:27:21.950
I'm going to temporarily
define a function called

00:27:21.950 --> 00:27:24.230
get value just to make my life easier.

00:27:24.230 --> 00:27:29.390
And this get value function is going
to take I'll say a language parameter.

00:27:29.390 --> 00:27:35.400
And then all I'm going to do is return
whatever the count is of that language.

00:27:35.400 --> 00:27:37.850
So out of context, this is
just a super simple function

00:27:37.850 --> 00:27:41.252
that you hand it a language
like Scratch or C or Python,

00:27:41.252 --> 00:27:43.460
it's just going to tell you
what the count is thereof

00:27:43.460 --> 00:27:46.400
in that dictionary called counts.

00:27:46.400 --> 00:27:51.440
But what I can do now down here in my
newly introduced call to sort it is I

00:27:51.440 --> 00:27:54.380
can tell it what to use as its key.

00:27:54.380 --> 00:27:57.170
Instead of using literally
the key, Scratch, C,

00:27:57.170 --> 00:28:01.040
Python, I can sort of override that
behavior and say, you know what?

00:28:01.040 --> 00:28:05.570
To figure out what to sort by,
go ahead and call this function

00:28:05.570 --> 00:28:07.890
called get value.

00:28:07.890 --> 00:28:11.270
Notice that I have not put
parentheses after get value

00:28:11.270 --> 00:28:14.420
because I don't want to call
get value right then and there.

00:28:14.420 --> 00:28:20.737
I want to pass the get value function
as itself in argument to the sorted

00:28:20.737 --> 00:28:23.570
function so that the sorted function
written years ago by the people

00:28:23.570 --> 00:28:27.320
at Python can call my version
of get value again and again

00:28:27.320 --> 00:28:30.300
and again when they try
to sort this actual data.

00:28:30.300 --> 00:28:36.350
So now if I add that and I leave reverse
equals true, let's see what happens.

00:28:36.350 --> 00:28:38.240
Python of favorites.py.

00:28:38.240 --> 00:28:38.810
Enter.

00:28:38.810 --> 00:28:42.410
And now I get my top 10, or
in this case, top three list.

00:28:42.410 --> 00:28:45.290
And if I had more sophisticated
data with more columns

00:28:45.290 --> 00:28:47.150
all together that I
actually care about, I

00:28:47.150 --> 00:28:50.758
could even sort this
more powerfully as well.

00:28:50.758 --> 00:28:52.550
But let me clean this
up a little bit, just

00:28:52.550 --> 00:28:54.592
so you've seen it, even
though we won't use these

00:28:54.592 --> 00:28:59.010
that often in CS50 until the end of
the class will they come up again.

00:28:59.010 --> 00:29:00.740
Technically this is a little bit--

00:29:00.740 --> 00:29:03.620
this isn't necessarily the best
design to spend all this time

00:29:03.620 --> 00:29:06.680
implementing a function and
then only use it in one place.

00:29:06.680 --> 00:29:09.430
In general, we've argued
that you don't necessarily

00:29:09.430 --> 00:29:11.930
need a variable if you're only
going to use it in one place.

00:29:11.930 --> 00:29:13.430
You don't really need a
function if you're only

00:29:13.430 --> 00:29:14.638
going to use it in one place.

00:29:14.638 --> 00:29:17.160
And here we kind of have
a good candidate for that.

00:29:17.160 --> 00:29:19.400
And so it turns out
in Python if you don't

00:29:19.400 --> 00:29:23.240
want to bother creating a
function just to use it once,

00:29:23.240 --> 00:29:26.810
you can create what's called
an anonymous function, a.k.a.

00:29:26.810 --> 00:29:29.900
a lambda function, like
the lambda symbol familiar.

00:29:29.900 --> 00:29:33.380
And a lambda function, the syntax
is a little strange looking,

00:29:33.380 --> 00:29:34.490
but you say this.

00:29:34.490 --> 00:29:36.080
You literally say lambda.

00:29:36.080 --> 00:29:38.660
You literally then say
the name of the argument

00:29:38.660 --> 00:29:42.470
that you want this anonymous
function with no name to take.

00:29:42.470 --> 00:29:44.430
Then you have a colon.

00:29:44.430 --> 00:29:47.930
And then quite simply, you
write what you want the return

00:29:47.930 --> 00:29:49.850
value of this function to be.

00:29:49.850 --> 00:29:51.800
You don't even say return literally.

00:29:51.800 --> 00:29:53.720
These lambda functions
are meant to be used

00:29:53.720 --> 00:30:00.680
super tersely so that you can in one
line express something like this.

00:30:00.680 --> 00:30:03.720
And I admit this looks more cryptic,
I think, than the previous version.

00:30:03.720 --> 00:30:06.512
But as you get more comfortable
with Python or other languages that

00:30:06.512 --> 00:30:10.370
support this feature, it allows you to
not bother with lines of code like that

00:30:10.370 --> 00:30:12.690
and just tighten up
your code a little bit.

00:30:12.690 --> 00:30:16.670
So this line here, lambda
language colon counts language,

00:30:16.670 --> 00:30:18.927
is the one line version of this.

00:30:18.927 --> 00:30:21.260
And you don't even need to
bother picking a name for it.

00:30:21.260 --> 00:30:23.840
Lambda tells Python I
didn't waste any time

00:30:23.840 --> 00:30:26.440
thinking of a name for this function.

00:30:26.440 --> 00:30:35.310
So questions then on this technique of
using Python to analyze data like this?

00:30:35.310 --> 00:30:37.900
Any questions?

00:30:37.900 --> 00:30:39.860
We're almost done with Python.

00:30:39.860 --> 00:30:42.070
Questions?

00:30:42.070 --> 00:30:42.640
No?

00:30:42.640 --> 00:30:43.390
OK.

00:30:43.390 --> 00:30:45.790
So why don't we make things
a little more interesting?

00:30:45.790 --> 00:30:48.620
Because we had a much juicier
data set with the problems

00:30:48.620 --> 00:30:50.620
that we've assigned over
the past several weeks.

00:30:50.620 --> 00:30:54.970
Why don't we go ahead
and, quite simply, I think

00:30:54.970 --> 00:30:57.290
we wrote pretty darn good code here.

00:30:57.290 --> 00:31:02.590
So I think we can pretty much
just change a bit of it to say,

00:31:02.590 --> 00:31:05.720
let's see, if I don't want
language, I want problem.

00:31:05.720 --> 00:31:10.510
And if I want to sort by not language
but problem, I think that's it.

00:31:10.510 --> 00:31:12.580
I think if I didn't
overlook something here

00:31:12.580 --> 00:31:16.210
just by changing what column I'm
reading the data from and then

00:31:16.210 --> 00:31:18.550
just to be consistent
renaming my variables

00:31:18.550 --> 00:31:21.310
just so I know what I'm looking
at, what will this program now

00:31:21.310 --> 00:31:25.470
do after those minor changes?

00:31:25.470 --> 00:31:26.970
What will I see when I run this?

00:31:29.693 --> 00:31:31.860
What would be the first
thing I see when I run this?

00:31:34.950 --> 00:31:35.970
Tough crowd today.

00:31:35.970 --> 00:31:36.720
Yes.

00:31:36.720 --> 00:31:37.637
AUDIENCE: The problem.

00:31:37.637 --> 00:31:39.110
DAVID MALAN: Yeah.

00:31:39.110 --> 00:31:40.330
The top problem.

00:31:40.330 --> 00:31:42.627
So the most popular problem,
which I'm a little worried

00:31:42.627 --> 00:31:45.210
it might be hello or just Scratch,
but let's go ahead and see.

00:31:45.210 --> 00:31:47.700
So let me go ahead and
open my terminal window.

00:31:47.700 --> 00:31:50.670
I'll even maximize my terminal
window so we can see a lot.

00:31:50.670 --> 00:31:54.167
Let me go ahead and run
Python of favorites.py.

00:31:54.167 --> 00:31:57.000
I'm going to go ahead now and cross
my fingers that I didn't mess up

00:31:57.000 --> 00:31:57.660
and hit Enter.

00:31:57.660 --> 00:31:59.550
And OK, great.

00:31:59.550 --> 00:32:00.480
We peaked early.

00:32:00.480 --> 00:32:03.840
So Scratch was the most
popular program according

00:32:03.840 --> 00:32:06.460
to the data at the time I downloaded it.

00:32:06.460 --> 00:32:08.520
I'm sure other votes have come in since.

00:32:08.520 --> 00:32:13.080
Filter in week four was tied
then with Tideman as well.

00:32:13.080 --> 00:32:16.480
Mario is a close third
there and so forth.

00:32:16.480 --> 00:32:19.540
So this is helpful for us on
staff that not so much love

00:32:19.540 --> 00:32:21.040
down here at the bottom of the list.

00:32:21.040 --> 00:32:23.400
So it was a bunch of
code to write, but now

00:32:23.400 --> 00:32:26.250
that we've written it in this
very versatile dynamic way,

00:32:26.250 --> 00:32:29.280
it's pretty good for just like
crunching data and doing some analytics.

00:32:29.280 --> 00:32:33.480
But it's still a decent number of
lines to have had to write manually.

00:32:33.480 --> 00:32:41.820
And this is where sometimes it isn't
necessarily the right tool for the job,

00:32:41.820 --> 00:32:44.520
but rather a candidate for using
some other language altogether,

00:32:44.520 --> 00:32:47.428
especially when it's not just
a one time program that you run

00:32:47.428 --> 00:32:48.720
and you want to see the answer.

00:32:48.720 --> 00:32:53.130
What if you want to take input from the
user and answer questions dynamically

00:32:53.130 --> 00:32:55.050
like a mobile app would,
like a website would,

00:32:55.050 --> 00:32:58.740
like Microsoft Excel or Apple Numbers
or Google Sheets would for you.

00:32:58.740 --> 00:33:02.430
Well, let's make one final change for
now to this version of the program

00:33:02.430 --> 00:33:05.020
and actually take in some user input.

00:33:05.020 --> 00:33:08.170
So besides just loading all
of the data into memory,

00:33:08.170 --> 00:33:13.290
let's go ahead and down below
here not just print out the top 10

00:33:13.290 --> 00:33:16.660
list, if you will, but prompt
the user for their favorites.

00:33:16.660 --> 00:33:18.660
I'm going to use Python's
input function and I'm

00:33:18.660 --> 00:33:21.840
going to prompt them with "favorite,"
quote unquote, like tell me what

00:33:21.840 --> 00:33:23.370
your favorite problem--

00:33:23.370 --> 00:33:27.660
what problem rather
you are interested in.

00:33:27.660 --> 00:33:33.267
And now let me go ahead and say if that
favorite is in the count's variable,

00:33:33.267 --> 00:33:35.850
so you didn't type in something
random that we didn't actually

00:33:35.850 --> 00:33:37.890
assign as a problem,
then let me go ahead

00:33:37.890 --> 00:33:44.130
and print with a format string whatever
that favorite is of yours and show you

00:33:44.130 --> 00:33:47.040
the actual popularity
thereof by indexing

00:33:47.040 --> 00:33:51.370
into counts using that favorite
as the key and printing this.

00:33:51.370 --> 00:33:52.620
So now it's a dynamic program.

00:33:52.620 --> 00:33:55.170
It doesn't dump all of the
data and all of the summations.

00:33:55.170 --> 00:33:59.100
Rather it's going to allow me to
see what my choice of favorite is.

00:33:59.100 --> 00:34:03.940
And I'm going to go ahead and say,
let's see, I'm a fan of Mario here.

00:34:03.940 --> 00:34:04.950
So Enter.

00:34:04.950 --> 00:34:08.880
And indeed, we see the same value we
saw a moment ago but just for Mario.

00:34:08.880 --> 00:34:11.880
But the point now is that,
one, all of this is possible.

00:34:11.880 --> 00:34:15.630
Two, it's way easier and more pleasant
than this would have been in C.

00:34:15.630 --> 00:34:17.280
This is still only 15 lines of code.

00:34:17.280 --> 00:34:19.238
And in C, again, there's
the memory management.

00:34:19.238 --> 00:34:21.947
There's the iterating over the
strings trying to find the commas.

00:34:21.947 --> 00:34:23.190
There's just a lot more work.

00:34:23.190 --> 00:34:26.489
But honestly even when you just
want to answer a question like this

00:34:26.489 --> 00:34:29.067
in Excel and Apple Numbers,
Google Sheets, generally, you

00:34:29.067 --> 00:34:30.150
can just highlight things.

00:34:30.150 --> 00:34:34.500
You can click a button and boom, you get
your answer for summation or max or min

00:34:34.500 --> 00:34:35.909
or any of those sort of basics.

00:34:35.909 --> 00:34:39.270
Wouldn't it be nice if we weren't
taking a step backwards as programmers

00:34:39.270 --> 00:34:42.880
and being more powerful and yet we
now have to do more of the work?

00:34:42.880 --> 00:34:46.230
So maybe sometimes Python's
not or any language

00:34:46.230 --> 00:34:48.370
is not the best tool for the job.

00:34:48.370 --> 00:34:51.960
And that's going to now allow us to
introduce more generally something

00:34:51.960 --> 00:34:53.670
called a relational database.

00:34:53.670 --> 00:34:57.137
Graduating from mere flat
file databases like text files

00:34:57.137 --> 00:34:58.845
or binary files in
which all of your data

00:34:58.845 --> 00:35:00.570
is stored to something more proper.

00:35:00.570 --> 00:35:02.566
But first, questions.

00:35:02.566 --> 00:35:08.518
AUDIENCE: [INAUDIBLE]

00:35:12.943 --> 00:35:14.360
DAVID MALAN: Really good question.

00:35:14.360 --> 00:35:17.670
To reiterate, if I were--
is this case sensitive?

00:35:17.670 --> 00:35:21.470
So if I were to type in Mario
in all lowercase and hit Enter,

00:35:21.470 --> 00:35:23.450
I actually get no such response.

00:35:23.450 --> 00:35:26.720
Now that might be acceptable, because
the problem technically is a capital M.

00:35:26.720 --> 00:35:30.840
But that's a little ridiculous to
be that pedantic about the input.

00:35:30.840 --> 00:35:34.250
So how could we solve this?

00:35:34.250 --> 00:35:37.010
Any tips for how we can make
this a little more robust?

00:35:37.010 --> 00:35:38.378
Yeah.

00:35:38.378 --> 00:35:42.290
AUDIENCE: [INAUDIBLE]

00:35:45.848 --> 00:35:46.765
DAVID MALAN: OK, yeah.

00:35:49.400 --> 00:35:49.900
Yeah.

00:35:49.900 --> 00:35:51.700
So we could use a few different
functions, one of which

00:35:51.700 --> 00:35:54.970
is called title, which will change it
to title case where it capitalizes,

00:35:54.970 --> 00:35:58.390
like in most English sentences,
the first letter of that sentence.

00:35:58.390 --> 00:35:59.740
We could use capitalize.

00:35:59.740 --> 00:36:00.940
We could use upper.

00:36:00.940 --> 00:36:01.990
We could use lower.

00:36:01.990 --> 00:36:06.435
But indeed, we could just decide how we
want to standardize the capitalization.

00:36:06.435 --> 00:36:08.810
Either uppercase, lowercase,
or some combination thereof.

00:36:08.810 --> 00:36:11.770
And just make sure that you
change the counts themselves.

00:36:11.770 --> 00:36:13.930
Make sure that you do
the same to favorite

00:36:13.930 --> 00:36:16.420
and make sure that maybe you
keep a backup of the data

00:36:16.420 --> 00:36:19.780
if you want to show the original
version that came from the CSV

00:36:19.780 --> 00:36:23.205
without presuming to just
capitalize everything for the user.

00:36:23.205 --> 00:36:25.330
But indeed, that would be
the most common scenario.

00:36:25.330 --> 00:36:28.540
You just make things case
insensitive when doing those matches.

00:36:28.540 --> 00:36:36.910
Other questions now on Python before
we leave it behind for the coming week?

00:36:36.910 --> 00:36:39.880
All right, well then let's introduce
these relational databases.

00:36:39.880 --> 00:36:43.060
So relational database is what every--

00:36:43.060 --> 00:36:45.572
it's a super popular way
of storing lots of data.

00:36:45.572 --> 00:36:48.280
This is what the Twitters of the
world, the Googles of the world,

00:36:48.280 --> 00:36:52.390
the Metas of the world use to
store some of their data at scale.

00:36:52.390 --> 00:36:55.492
There are alternatives
to relational databases.

00:36:55.492 --> 00:36:57.700
Indeed, today we'll talk
about a language called SQL.

00:36:57.700 --> 00:37:00.408
There's also a movement, if you
will, or an alternative generally

00:37:00.408 --> 00:37:02.330
called NoSQL, which
is just the opposite.

00:37:02.330 --> 00:37:03.280
You don't use SQL.

00:37:03.280 --> 00:37:06.020
There are things called object
oriented databases and the like.

00:37:06.020 --> 00:37:11.110
But if you've ever heard of MySQL
or PostgreSQL or Microsoft SQL

00:37:11.110 --> 00:37:15.580
Server or Oracle or MariaDB
or a bunch of other products,

00:37:15.580 --> 00:37:18.850
both free and commercial, this
is what they're talking about.

00:37:18.850 --> 00:37:21.560
Databases that are designed
to store lots of data.

00:37:21.560 --> 00:37:23.710
And what's nice about
relational databases

00:37:23.710 --> 00:37:26.530
is that they're really similar to
the spreadsheets with which you

00:37:26.530 --> 00:37:29.800
were presumably familiar
long before today's class.

00:37:29.800 --> 00:37:32.530
So a relational database
is going to store,

00:37:32.530 --> 00:37:35.452
as you'll see, all of the
data in rows and columns.

00:37:35.452 --> 00:37:37.910
Now, the terminology will
thereafter be a little different.

00:37:37.910 --> 00:37:40.570
Instead of having sheets,
you're going to have tables.

00:37:40.570 --> 00:37:43.240
But those tables are still
going to have rows and columns.

00:37:43.240 --> 00:37:46.330
And you're going to have even
more control over the performance

00:37:46.330 --> 00:37:51.970
of your data when you start to access
it using this Structured Query Language

00:37:51.970 --> 00:37:53.050
or SQL.

00:37:53.050 --> 00:37:56.440
This is a language you can
use for web apps, mobile apps.

00:37:56.440 --> 00:37:58.990
A lot of analysts would
sit down at their Mac or PC

00:37:58.990 --> 00:38:01.810
and actually ask questions of
data to get back the answer.

00:38:01.810 --> 00:38:05.050
And wonderfully, even though there
will be some new syntax today,

00:38:05.050 --> 00:38:07.570
SQL really just does four basic things.

00:38:07.570 --> 00:38:10.030
CRUD is the sort of crude acronym here.

00:38:10.030 --> 00:38:14.470
CRUD is a way of remembering that
a relational database supports

00:38:14.470 --> 00:38:20.390
ultimately creating data, reading
data, updating data, and deleting data.

00:38:20.390 --> 00:38:22.240
So even if you're
feeling like, wow, this

00:38:22.240 --> 00:38:25.510
is a lot of new syntax, which it
isn't relative to our past languages,

00:38:25.510 --> 00:38:29.650
the only things you're doing really are
creating data, reading data, updating,

00:38:29.650 --> 00:38:30.910
and deleting the same.

00:38:30.910 --> 00:38:36.940
Now a little confusingly in SQL, the
corresponding functions or commands

00:38:36.940 --> 00:38:41.150
that exist that map to
CRUD are actually this.

00:38:41.150 --> 00:38:44.290
So it's still create, but there's
another one called insert.

00:38:44.290 --> 00:38:48.250
It's not read, which is more of the
computer scientist way of saying it,

00:38:48.250 --> 00:38:50.380
but select, which is a
little more explicit.

00:38:50.380 --> 00:38:52.030
Like select data you care about.

00:38:52.030 --> 00:38:53.320
Update is still update.

00:38:53.320 --> 00:38:54.310
Delete is still delete.

00:38:54.310 --> 00:38:57.910
But there's another command called drop,
which lets you drop, that is delete,

00:38:57.910 --> 00:38:59.930
entire tables as well.

00:38:59.930 --> 00:39:03.280
So you can create tables
using syntax that's

00:39:03.280 --> 00:39:05.480
generally going to look like this.

00:39:05.480 --> 00:39:06.910
You'll say create table.

00:39:06.910 --> 00:39:10.210
You'll give the name of the
table, which you can call most

00:39:10.210 --> 00:39:14.260
anything you want, but generally
all lowercase, no spaces is best.

00:39:14.260 --> 00:39:18.550
Then in parentheses, you can specify
a comma separated list of the columns

00:39:18.550 --> 00:39:20.660
that you might want in this table.

00:39:20.660 --> 00:39:24.070
So this is the code
equivalent in the SQL language

00:39:24.070 --> 00:39:27.220
of manually opening Google
Sheets or Excel or Numbers

00:39:27.220 --> 00:39:30.490
and clicking in the top left
cell and like typing timestamp

00:39:30.490 --> 00:39:34.180
and then in the next typing language
and then the third typing problem.

00:39:34.180 --> 00:39:37.750
This is the way to define what
your headers are, if you will,

00:39:37.750 --> 00:39:38.500
in a spreadsheet.

00:39:38.500 --> 00:39:39.790
But now it's called a table.

00:39:39.790 --> 00:39:42.100
Now, we won't use this
command manually first.

00:39:42.100 --> 00:39:44.050
Let's do something a little simpler.

00:39:44.050 --> 00:39:47.590
We're going to start off by just
importing this data ourselves.

00:39:47.590 --> 00:39:49.340
And I'm going to go ahead and do this.

00:39:49.340 --> 00:39:51.160
Let me go back to VS Code here.

00:39:51.160 --> 00:39:54.040
I'm going to leave behind
favorites.py for now,

00:39:54.040 --> 00:39:57.100
because now we're going to transition
to this other language called SQL.

00:39:57.100 --> 00:40:02.680
And to do this, I am going to
create a new database file.

00:40:02.680 --> 00:40:06.790
And I'm going to do so using
a command called sqlite3,

00:40:06.790 --> 00:40:09.160
which is just the third
version thereof, and I'm

00:40:09.160 --> 00:40:12.110
going to give the database
a name of favorites.db.

00:40:12.110 --> 00:40:14.860
There's different conventions, but
this is one of the most common.

00:40:14.860 --> 00:40:18.610
When I hit Enter, this is going to
create for me a new empty database just

00:40:18.610 --> 00:40:23.080
like opening an untitled spreadsheet in
Excel, Google Sheets, or Apple Numbers.

00:40:23.080 --> 00:40:25.720
I'm being prompted do I
want to create favorites.db.

00:40:25.720 --> 00:40:26.890
I'll hit Y for Yes.

00:40:26.890 --> 00:40:28.562
OK, we're up and running.

00:40:28.562 --> 00:40:30.520
Now, you're going to
notice a different prompt.

00:40:30.520 --> 00:40:34.360
I'm not in my Linux prompt per se,
which is always the dollar sign.

00:40:34.360 --> 00:40:37.090
I'm now inside of the
program called SQLite.

00:40:37.090 --> 00:40:41.500
And we're going to use SQLite, SQLite3,
as just an interactive way for now

00:40:41.500 --> 00:40:43.180
of playing with SQL code.

00:40:43.180 --> 00:40:48.550
At the end of today, we'll show you
how you can use SQL in Python code

00:40:48.550 --> 00:40:51.050
so that you still write Python
code to do whatever you want,

00:40:51.050 --> 00:40:53.170
but you can talk to
databases using Python.

00:40:53.170 --> 00:40:55.630
And this is exactly how
web apps, mobile apps work.

00:40:55.630 --> 00:40:59.140
For instance, on iOS, on an
iPhone, an iPad, or the like,

00:40:59.140 --> 00:41:03.010
if you want to store data, it's
very often stored in a SQL database,

00:41:03.010 --> 00:41:04.580
as we're about to do.

00:41:04.580 --> 00:41:07.398
But you might use a language
called SWIFT or Objective C.

00:41:07.398 --> 00:41:09.190
And the same exists in
the world of Android

00:41:09.190 --> 00:41:12.830
using Java or Kotlin or something
else to query the database.

00:41:12.830 --> 00:41:16.270
So we're going to see SQL in isolation
for now like an analyst might just

00:41:16.270 --> 00:41:19.880
use with their Mac or PC, but we're
going to tie it together by day's end.

00:41:19.880 --> 00:41:28.180
So at this terminal SQLite, let me go
ahead and execute this command first.

00:41:28.180 --> 00:41:30.200
I'm going to first put
SQLite into CSV mode,

00:41:30.200 --> 00:41:32.200
because I'm going to cut
some corners initially.

00:41:32.200 --> 00:41:34.600
And I'm just going to
automatically import

00:41:34.600 --> 00:41:38.560
all of the data that was submitted
via the Google Form, which I exported

00:41:38.560 --> 00:41:41.410
as a CSV and uploaded to my code space.

00:41:41.410 --> 00:41:45.790
And I'm just going to automatically
say turn this CSV file into a SQL

00:41:45.790 --> 00:41:48.040
database for me just so I
don't have to figure out

00:41:48.040 --> 00:41:49.760
what those create table commands are.

00:41:49.760 --> 00:41:53.200
So to do this, I'm
going to say mode csv so

00:41:53.200 --> 00:41:57.850
that SQLite knows that this is the
command, knows that this is a CSV file.

00:41:57.850 --> 00:41:59.290
It's literally .mode.

00:41:59.290 --> 00:42:01.480
So the dot comes before
the keyword there.

00:42:01.480 --> 00:42:05.770
And now I'm going to say .import
and then the name of the file I want

00:42:05.770 --> 00:42:08.020
to import, which is favorites.csv.

00:42:08.020 --> 00:42:12.635
And now the name of the table that
I want to create with that data.

00:42:12.635 --> 00:42:15.010
And just for consistency, I'm
going to call it favorites.

00:42:15.010 --> 00:42:17.270
I could change these things
to be anything I want,

00:42:17.270 --> 00:42:18.530
but I'm going to do that.

00:42:18.530 --> 00:42:20.860
And voila, nothing
seems to have happened.

00:42:20.860 --> 00:42:24.503
But just like in C and in Python and
Linux when nothing seems to happen,

00:42:24.503 --> 00:42:25.670
that's usually a good thing.

00:42:25.670 --> 00:42:27.590
It means I didn't mess up.

00:42:27.590 --> 00:42:31.120
So if I want to see what just
happened, there's this other command.

00:42:31.120 --> 00:42:33.430
And these commands that
start with dots, these

00:42:33.430 --> 00:42:37.690
are SQLite specific, which is
indeed a lightweight version of SQL.

00:42:37.690 --> 00:42:39.010
They're not SQL, per se.

00:42:39.010 --> 00:42:41.260
So if you're using Oracle
or something else like that,

00:42:41.260 --> 00:42:43.135
you're not going to use
these exact commands.

00:42:43.135 --> 00:42:45.160
You'll see the ones we
use in just a moment.

00:42:45.160 --> 00:42:46.660
And here's the first.

00:42:46.660 --> 00:42:51.230
When I type .schema, the schema of a
database is the design of the database.

00:42:51.230 --> 00:42:52.120
What are the tables?

00:42:52.120 --> 00:42:53.900
What are the columns and all of that?

00:42:53.900 --> 00:42:58.390
So when I type .schema, this actually
in this case shows me the create table

00:42:58.390 --> 00:43:03.700
command that was automatically drawn
for me by just doing this import line.

00:43:03.700 --> 00:43:07.450
Once I get more comfortable with SQL,
I could literally type this out myself

00:43:07.450 --> 00:43:09.560
or use some program to
generate that as well.

00:43:09.560 --> 00:43:11.290
But what it's creating for me is this.

00:43:11.290 --> 00:43:15.580
Create table if it doesn't exist,
even though it's more terse than that.

00:43:15.580 --> 00:43:18.130
I want to create a
table called favorites.

00:43:18.130 --> 00:43:20.590
And then the columns
for that table are going

00:43:20.590 --> 00:43:24.880
to be timestamp, which is going
to be text, comma, language,

00:43:24.880 --> 00:43:27.248
which is also going to
be text, comma, problem,

00:43:27.248 --> 00:43:28.540
which is also going to be text.

00:43:28.540 --> 00:43:34.330
That was just inferred very trivially by
the .import command to just figure out

00:43:34.330 --> 00:43:38.680
that, yes, just give me a three column
database table based on the Google

00:43:38.680 --> 00:43:40.390
Form.

00:43:40.390 --> 00:43:41.740
Questions on this?

00:43:41.740 --> 00:43:44.140
These are commands you run
once to get up and running.

00:43:44.140 --> 00:43:47.260
You don't run these commands
frequently, but we have them

00:43:47.260 --> 00:43:49.280
on the slide just for reference.

00:43:49.280 --> 00:43:49.780
All right.

00:43:49.780 --> 00:43:52.010
So now let's do something
a little more interesting.

00:43:52.010 --> 00:43:55.330
I'm going to clear my SQLite terminal
here, but I'm still in SQLite.

00:43:55.330 --> 00:43:58.570
I'm going to now use some of
my first SQL commands, which

00:43:58.570 --> 00:44:02.440
recall were among them select.

00:44:02.440 --> 00:44:05.522
So CRUD, C-R-U-D. The R was select.

00:44:05.522 --> 00:44:07.480
This is maybe the most
common, the most useful,

00:44:07.480 --> 00:44:10.330
the most powerful thing
to use with a SQL database

00:44:10.330 --> 00:44:13.000
selecting data to answer
questions akin to the ones we

00:44:13.000 --> 00:44:14.980
were trying to answer with Python.

00:44:14.980 --> 00:44:16.270
This is the general syntax.

00:44:16.270 --> 00:44:19.900
Any time you want to select data from a
SQL database, you literally say select.

00:44:19.900 --> 00:44:24.730
You then specify the column or columns
that you want to select data from.

00:44:24.730 --> 00:44:28.450
You literally write the word from and
then you specify the name of the table.

00:44:28.450 --> 00:44:32.030
You want to get that data
from semicolon, in this case.

00:44:32.030 --> 00:44:35.200
Everything that's in capitals
here is a SQL keyword.

00:44:35.200 --> 00:44:37.750
Strictly speaking, you don't
have to capitalize things,

00:44:37.750 --> 00:44:40.150
but we would encourage you
to do so stylistically.

00:44:40.150 --> 00:44:43.030
And especially as you're learning
and even as you're writing it,

00:44:43.030 --> 00:44:47.530
it just helps to distinguish SQL
from words you chose, like the names

00:44:47.530 --> 00:44:49.460
of the columns and the data therein.

00:44:49.460 --> 00:44:52.700
So do adopt early on this convention.

00:44:52.700 --> 00:44:56.240
So let me go back now
to my code space here.

00:44:56.240 --> 00:44:59.012
I'm running my terminal window
with SQLite3 inside of it.

00:44:59.012 --> 00:45:00.970
Suppose that I just want
to get all of the data

00:45:00.970 --> 00:45:04.720
from the favorites table, which
was automatically imported.

00:45:04.720 --> 00:45:05.480
Let's do this.

00:45:05.480 --> 00:45:06.220
Select.

00:45:06.220 --> 00:45:07.630
I want everything.

00:45:07.630 --> 00:45:13.090
Well, I can do timestamp
comma language comma problem.

00:45:13.090 --> 00:45:13.840
But you know what?

00:45:13.840 --> 00:45:15.850
Here's a convenience already.

00:45:15.850 --> 00:45:19.120
If you want everything, there's
what's called a wild card character

00:45:19.120 --> 00:45:21.460
in SQL, which is just a
star, an asterisk, which

00:45:21.460 --> 00:45:25.120
means give me every column without
my knowing even what they're called.

00:45:25.120 --> 00:45:28.400
Let me go ahead now and say
from favorites semicolon.

00:45:28.400 --> 00:45:33.010
And this is the SQL way of opening
the database, iterating over

00:45:33.010 --> 00:45:36.400
every row therein, printing
out every row therein, done.

00:45:36.400 --> 00:45:39.640
So those three steps, which was
like nine lines of Python code

00:45:39.640 --> 00:45:42.700
give or take earlier,
is now one line of SQL.

00:45:42.700 --> 00:45:43.990
I hit Enter.

00:45:43.990 --> 00:45:46.370
There is all of the data.

00:45:46.370 --> 00:45:48.130
So I see now all of the data.

00:45:48.130 --> 00:45:50.230
Just output it as a CSV here.

00:45:50.230 --> 00:45:51.940
But it's not the CSV file.

00:45:51.940 --> 00:45:53.210
It's now actually the table.

00:45:53.210 --> 00:45:54.580
And in fact, just for
good measure, let me

00:45:54.580 --> 00:45:58.000
do this, because you'll see the behavior
a little differently the next time we

00:45:58.000 --> 00:45:58.690
open the file.

00:45:58.690 --> 00:46:00.730
I've just exited out of SQLite3.

00:46:00.730 --> 00:46:03.490
I'm going to rerun it, but I'm
not going to reimport the data

00:46:03.490 --> 00:46:05.920
or do anything like that,
because my file now exists.

00:46:05.920 --> 00:46:07.790
In fact, let me take one step back.

00:46:07.790 --> 00:46:11.830
If I type ls at my Linux prompt,
there's my favorites.py from before.

00:46:11.830 --> 00:46:13.870
There's my favorites.csv from before.

00:46:13.870 --> 00:46:18.580
And here's a third file that I did
create a moment ago when I first

00:46:18.580 --> 00:46:19.690
ran SQLite3.

00:46:19.690 --> 00:46:21.130
So the data is persistent.

00:46:21.130 --> 00:46:23.050
It's not using RAM or memory.

00:46:23.050 --> 00:46:24.950
Anything I do now is saved there.

00:46:24.950 --> 00:46:27.653
So let's go ahead and rerun
SQLite3 with the same file.

00:46:27.653 --> 00:46:28.570
But I'm not going to--

00:46:28.570 --> 00:46:31.130
I don't have to reimport everything,
because the file already exists.

00:46:31.130 --> 00:46:32.800
Let me now do that same thing again.

00:46:32.800 --> 00:46:36.460
Select star from favorites
to get all of the data.

00:46:36.460 --> 00:46:39.990
And what you'll see now is the same
data, but it's a little prettier now.

00:46:39.990 --> 00:46:44.400
Because I reran it, I effectively
disabled CSV mode this time.

00:46:44.400 --> 00:46:49.590
And what I'm now seeing is the
entire contents of this database

00:46:49.590 --> 00:46:51.420
table called favorites.

00:46:51.420 --> 00:46:53.370
Now, there's nothing new
here, but you're just

00:46:53.370 --> 00:46:55.980
seeing now like an ASCII
or Unicode version of all

00:46:55.980 --> 00:46:57.988
of the same data from that database.

00:46:57.988 --> 00:47:00.030
Well, suppose I want to
get a subset of the data.

00:47:00.030 --> 00:47:01.238
Well, let me clear my screen.

00:47:01.238 --> 00:47:04.530
And just like in Linux, I can Control L
just to clean things up aesthetically.

00:47:04.530 --> 00:47:06.580
Suppose I want to get
just the languages.

00:47:06.580 --> 00:47:09.540
So I could do select
language from favorites.

00:47:09.540 --> 00:47:12.930
And this will now select not
all three columns, a.k.a.

00:47:12.930 --> 00:47:17.280
star, this will only select the language
column and all of the data therein.

00:47:17.280 --> 00:47:19.020
If I hit Enter, voila.

00:47:19.020 --> 00:47:21.150
Now I just see those there.

00:47:21.150 --> 00:47:22.990
No timestamps, no problems.

00:47:22.990 --> 00:47:25.470
It's just a slice of
the table, if you will.

00:47:25.470 --> 00:47:30.010
All right, not that interesting still
because it's just a big column of data.

00:47:30.010 --> 00:47:31.900
But now things get more interesting.

00:47:31.900 --> 00:47:35.850
It turns out in SQL that there are
functions that come with this language,

00:47:35.850 --> 00:47:37.920
just like C, just like Python.

00:47:37.920 --> 00:47:41.790
In SQL, some of the more useful
ones, some of the simpler ones,

00:47:41.790 --> 00:47:42.630
are these here.

00:47:42.630 --> 00:47:44.880
Average, count, distinct,
lower, max, min,

00:47:44.880 --> 00:47:48.150
upper, which pretty
much do what they say.

00:47:48.150 --> 00:47:50.533
And count is a particularly useful one.

00:47:50.533 --> 00:47:51.450
Let's start with that.

00:47:51.450 --> 00:47:55.740
It's a reasonable question to be asked
how many people submitted the Google

00:47:55.740 --> 00:47:58.523
Form by the time I actually
downloaded the CSV.

00:47:58.523 --> 00:48:00.190
Well, why don't we go ahead and do this?

00:48:00.190 --> 00:48:02.850
Let me go back to VS Code
here in my terminal window.

00:48:02.850 --> 00:48:06.900
Let me select not star
but the count of star.

00:48:06.900 --> 00:48:10.020
So give me the count of
the rows that are being

00:48:10.020 --> 00:48:12.660
returned from the database called--

00:48:12.660 --> 00:48:14.190
the database table called favorites.

00:48:14.190 --> 00:48:16.648
Now when I hit Enter, I'm not
going to get all of the data.

00:48:16.648 --> 00:48:18.630
I'm just going to get simply a number.

00:48:18.630 --> 00:48:21.480
430 rows came back.

00:48:21.480 --> 00:48:22.740
So that's pretty good.

00:48:22.740 --> 00:48:24.580
I now know how much data is in there.

00:48:24.580 --> 00:48:26.350
Well, what languages were in there?

00:48:26.350 --> 00:48:30.520
Well, I could do select language
from favorites just as before,

00:48:30.520 --> 00:48:33.270
but that's not that useful,
especially if I'm inheriting the data.

00:48:33.270 --> 00:48:35.770
Like I'm the analyst who's been
handed a data set by my boss

00:48:35.770 --> 00:48:37.440
and they want me to crunch some numbers.

00:48:37.440 --> 00:48:39.810
OK, I could load this into Excel.

00:48:39.810 --> 00:48:40.770
I could sort it.

00:48:40.770 --> 00:48:44.400
But you can use SQL now to answer
pretty basic questions too.

00:48:44.400 --> 00:48:48.540
If you want to select the distinct
languages in the data set because you

00:48:48.540 --> 00:48:51.000
weren't privy to the Google
Form, let me go ahead

00:48:51.000 --> 00:48:56.820
and select only the distinct
languages from the favorites table.

00:48:56.820 --> 00:49:00.330
And now I hit Enter and I get
back a much more succinct answer.

00:49:00.330 --> 00:49:02.220
Just the three languages in question.

00:49:02.220 --> 00:49:04.963
Not really that useful since
I created the Google Form,

00:49:04.963 --> 00:49:07.380
but certainly if you're
inheriting data from someone else,

00:49:07.380 --> 00:49:09.390
you've just downloaded
a data set, at least now

00:49:09.390 --> 00:49:12.190
I'm arguably wrapping my
mind around what's going on.

00:49:12.190 --> 00:49:15.178
Now, this is not necessary
for such a small data set,

00:49:15.178 --> 00:49:16.470
but I can combine these things.

00:49:16.470 --> 00:49:21.180
Select the count of the distinct
languages in this data set

00:49:21.180 --> 00:49:22.140
called favorites.

00:49:22.140 --> 00:49:25.220
And now I should get back what answer?

00:49:25.220 --> 00:49:27.378
So hopefully indeed an
answer called three.

00:49:27.378 --> 00:49:29.420
And what you're getting
back notice aesthetically

00:49:29.420 --> 00:49:32.180
too is like a mini temporary table.

00:49:32.180 --> 00:49:36.890
When I asked for just the
distinct languages, what SQL hands

00:49:36.890 --> 00:49:39.410
me back is this
temporary table in memory

00:49:39.410 --> 00:49:43.340
that has one column called
language and then three rows.

00:49:43.340 --> 00:49:45.210
Now, this is not saved anywhere.

00:49:45.210 --> 00:49:47.450
It's just executed
ephemerally like this.

00:49:47.450 --> 00:49:49.580
But that's why it's
depicted in this way.

00:49:49.580 --> 00:49:52.610
What you're getting is
subsets of your data, smaller

00:49:52.610 --> 00:49:54.410
tables containing some of your data.

00:49:54.410 --> 00:49:55.550
And same thing down here.

00:49:55.550 --> 00:49:58.400
This is a crazy long column name.

00:49:58.400 --> 00:50:00.480
You can rename it if you really want.

00:50:00.480 --> 00:50:02.900
But that's all we're seeing there.

00:50:02.900 --> 00:50:07.100
And in fact, if that's a little ugly,
we can actually alias these things.

00:50:07.100 --> 00:50:12.090
N is a common name for a variable, a
number in any programming language.

00:50:12.090 --> 00:50:15.200
So I can actually alias this
to be a column called n.

00:50:15.200 --> 00:50:16.160
Hit Enter.

00:50:16.160 --> 00:50:19.490
And now I'm getting a tiny, tiny
table whose column is called

00:50:19.490 --> 00:50:23.280
n that just has the one value there.

00:50:23.280 --> 00:50:28.500
All right, questions on these
application of these functions here?

00:50:31.950 --> 00:50:33.152
Questions, yeah.

00:50:33.152 --> 00:50:35.000
AUDIENCE: [INAUDIBLE]

00:50:35.000 --> 00:50:36.400
DAVID MALAN: Say a little louder.

00:50:36.400 --> 00:50:37.280
AUDIENCE: A-S.

00:50:37.280 --> 00:50:38.475
DAVID MALAN: Oh, A-S. As.

00:50:38.475 --> 00:50:39.350
Literally in English.

00:50:39.350 --> 00:50:42.650
So rename this column as this.

00:50:42.650 --> 00:50:45.317
Technically it creates
an alias for the column.

00:50:45.317 --> 00:50:45.900
So that's all.

00:50:45.900 --> 00:50:46.818
Yeah?

00:50:46.818 --> 00:50:49.955
AUDIENCE: [INAUDIBLE]

00:50:49.955 --> 00:50:50.830
DAVID MALAN: Exactly.

00:50:50.830 --> 00:50:53.530
Distinct will operate on whatever
you hand it in parentheses

00:50:53.530 --> 00:50:57.460
and get rid of all of the duplicates,
giving you back just the uniques.

00:50:57.460 --> 00:50:58.360
Correct.

00:50:58.360 --> 00:50:59.780
Other questions here?

00:50:59.780 --> 00:51:00.524
Yeah.

00:51:00.524 --> 00:51:02.944
AUDIENCE: [INAUDIBLE]

00:51:06.935 --> 00:51:08.060
DAVID MALAN: Good question.

00:51:08.060 --> 00:51:10.355
When you define an alias
like n, which I just did,

00:51:10.355 --> 00:51:12.230
does it become like a
variable you can reuse?

00:51:12.230 --> 00:51:16.580
Short answer, no in this case, but you
can reuse it within your same query.

00:51:16.580 --> 00:51:20.630
Even though these queries are
getting a little longer, admittedly,

00:51:20.630 --> 00:51:23.270
statements that they
are, you can actually

00:51:23.270 --> 00:51:26.090
reuse n in an even longer query.

00:51:26.090 --> 00:51:27.440
So later in your query.

00:51:27.440 --> 00:51:30.380
And we'll see a few that are
going to start to grow in length.

00:51:30.380 --> 00:51:33.050
So it's a nice way of
nicknaming things just to be

00:51:33.050 --> 00:51:35.190
a little more terse in your query.

00:51:35.190 --> 00:51:38.480
So we can transition to some of
these more sophisticated queries

00:51:38.480 --> 00:51:41.360
because it turns out there
are some other techniques we

00:51:41.360 --> 00:51:42.630
can introduce as well.

00:51:42.630 --> 00:51:44.660
Here are some other keywords in SQL.

00:51:44.660 --> 00:51:47.005
And again, even though this
is another list of things,

00:51:47.005 --> 00:51:49.130
there's only four things
fundamentally we're doing.

00:51:49.130 --> 00:51:51.600
Creating, reading,
updating, and deleting data.

00:51:51.600 --> 00:51:55.580
These are just allowing us to
fine tune how we do it exactly.

00:51:55.580 --> 00:51:59.210
So where is going to allow us to filter
data, as we'll do in just a moment.

00:51:59.210 --> 00:52:03.440
Like select data where
this conditional is true.

00:52:03.440 --> 00:52:06.260
Like is going to be an
alternative to an equal sign.

00:52:06.260 --> 00:52:11.210
So instead of looking for exactly
Scratch or exactly Python or exactly C,

00:52:11.210 --> 00:52:13.790
you can look for
something like dot dot dot

00:52:13.790 --> 00:52:16.770
and it can be a little bit of
a fuzzier match, if you will,

00:52:16.770 --> 00:52:18.020
with other characters as well.

00:52:18.020 --> 00:52:19.790
Order by is going to deal with sorting.

00:52:19.790 --> 00:52:23.360
Limit is going to just let me limit
the total number of rows that come back

00:52:23.360 --> 00:52:26.330
to 1 or 10 or finite
if I don't want to see

00:52:26.330 --> 00:52:28.375
all 400 plus rows all at
once, because I'm just

00:52:28.375 --> 00:52:29.750
trying to wrap my mind around it.

00:52:29.750 --> 00:52:32.207
And group by is best shown by example.

00:52:32.207 --> 00:52:34.290
So let's play with just a
couple of these as well.

00:52:34.290 --> 00:52:36.080
Let me go back to VS Code here.

00:52:36.080 --> 00:52:37.170
I'll clear my screen.

00:52:37.170 --> 00:52:39.530
I'm still in the same SQLite instance.

00:52:39.530 --> 00:52:44.040
And let's count how many of you liked C
without writing Python code as before.

00:52:44.040 --> 00:52:47.960
So let me go ahead and
select the count of the rows

00:52:47.960 --> 00:52:54.530
from favorites where the
language in each row equals C.

00:52:54.530 --> 00:52:58.100
And the convention in SQLite is
to use single quotes any time

00:52:58.100 --> 00:53:02.810
you're surrounding a string that's meant
to represent a literal piece of text

00:53:02.810 --> 00:53:06.720
as opposed to C, which was double
quotes, or Python, which was either.

00:53:06.720 --> 00:53:11.090
So this is selecting the count of
rows from favorites table where

00:53:11.090 --> 00:53:13.730
the language in question is C. Enter.

00:53:13.730 --> 00:53:15.410
And this gives me 98.

00:53:15.410 --> 00:53:19.190
Notice, though, if I omit that
predicate like we did before,

00:53:19.190 --> 00:53:23.850
you'll get back the total number
of rows that were in the table.

00:53:23.850 --> 00:53:26.030
So where is what's called
a predicate that just

00:53:26.030 --> 00:53:30.020
allows me to filter things just like an
if condition or the like in a language

00:53:30.020 --> 00:53:31.130
that we've seen before.

00:53:31.130 --> 00:53:34.670
You can be a little more specific
like how many people really liked

00:53:34.670 --> 00:53:38.870
C and the Mario problem specifically?

00:53:38.870 --> 00:53:40.440
Well, let's do this.

00:53:40.440 --> 00:53:45.830
Let's go ahead and do select the
number of rows from the favorites table

00:53:45.830 --> 00:53:49.640
where the language is C and.

00:53:49.640 --> 00:53:52.880
So it's still literally
the word ands and or, just

00:53:52.880 --> 00:53:57.830
like in Python but not like in
C. And problem equals Mario.

00:53:57.830 --> 00:54:01.520
So let's see if there's any fans
of both C and the Mario problem.

00:54:01.520 --> 00:54:05.638
And three of us really like those
two things together in this case.

00:54:05.638 --> 00:54:06.930
All right, what else can we do?

00:54:06.930 --> 00:54:10.280
Well, more compelling might be
to see, kind of like in Python,

00:54:10.280 --> 00:54:13.580
for each language, what
was the popularity thereof?

00:54:13.580 --> 00:54:15.560
And at the moment, we
don't really have a way

00:54:15.560 --> 00:54:17.960
of doing that except in
Python where we had the loop

00:54:17.960 --> 00:54:20.710
and we had those variables and the
dictionary that did all of that

00:54:20.710 --> 00:54:21.410
counting for us.

00:54:21.410 --> 00:54:24.710
Totally doable but tedious, especially
if your job is to analyze data.

00:54:24.710 --> 00:54:28.190
My God, even writing 15 lines of
code to answer simple questions

00:54:28.190 --> 00:54:29.390
is kind of ridiculous.

00:54:29.390 --> 00:54:31.560
SQL can do better for us.

00:54:31.560 --> 00:54:33.630
So let me go ahead and do this.

00:54:33.630 --> 00:54:38.540
Let me go ahead and select
every language and the count

00:54:38.540 --> 00:54:45.650
thereof from the favorites table
but this time group by language.

00:54:45.650 --> 00:54:48.440
So this was another one
of the keywords that we

00:54:48.440 --> 00:54:52.460
can use in this abbreviated
list of extra features of SQL.

00:54:52.460 --> 00:54:55.760
And this one takes a moment
to wrap your mind around,

00:54:55.760 --> 00:54:59.510
but this is going to give me a
two column temporary table where

00:54:59.510 --> 00:55:02.030
the first column is a
language and the second column

00:55:02.030 --> 00:55:05.330
is the count thereof from this data set.

00:55:05.330 --> 00:55:10.010
And group by language just means
that only show me Scratch once,

00:55:10.010 --> 00:55:12.350
only show me C once,
only show me Python once.

00:55:12.350 --> 00:55:16.130
That is group all of the
identical values together,

00:55:16.130 --> 00:55:19.260
but keep track of how
many of them there are.

00:55:19.260 --> 00:55:27.200
And so now if I go over to SQLite and
I hit Enter, now I have in SQL version

00:55:27.200 --> 00:55:29.720
the exact same output
that I had from Python

00:55:29.720 --> 00:55:32.120
that took me, what,
15 plus lines before.

00:55:32.120 --> 00:55:35.810
Now we're down to just one because
SQL, Structured Query Language,

00:55:35.810 --> 00:55:39.830
is all about constructing queries like
this to answer questions and get back

00:55:39.830 --> 00:55:41.152
answers quickly.

00:55:41.152 --> 00:55:42.860
If we want to clean
this up a little bit,

00:55:42.860 --> 00:55:44.960
you asked earlier about sorting order.

00:55:44.960 --> 00:55:46.200
Well, we can do that too.

00:55:46.200 --> 00:55:48.930
There's another key
phrase we can use here.

00:55:48.930 --> 00:55:55.080
We can order by the count of those
rows and then run that query here.

00:55:55.080 --> 00:55:57.480
So now unfortunately they're
from smallest to biggest,

00:55:57.480 --> 00:55:58.675
but we can reverse that.

00:55:58.675 --> 00:56:00.800
It turns out, and my query's
starting to wrap here.

00:56:00.800 --> 00:56:02.360
I'll zoom out for a moment.

00:56:02.360 --> 00:56:07.050
If you want to order by count,
the default is in ascending order,

00:56:07.050 --> 00:56:10.980
abbreviated A-S-C. If you want
to reverse the sort in SQL,

00:56:10.980 --> 00:56:13.920
instead of using reverse equals
true like we did in Python,

00:56:13.920 --> 00:56:16.380
you say D-E-S-C for descending order.

00:56:16.380 --> 00:56:19.557
And now we get almost the same
output but flipped in reverse.

00:56:19.557 --> 00:56:22.140
So it's just a lot faster to
answer questions once, of course,

00:56:22.140 --> 00:56:24.660
you get some muscle memory
and some comfort with it.

00:56:24.660 --> 00:56:26.280
Well, what else can I do?

00:56:26.280 --> 00:56:28.890
What if I just care about
the most popular language?

00:56:28.890 --> 00:56:31.140
I don't care about the second
place or the third place

00:56:31.140 --> 00:56:32.460
languages or anything else.

00:56:32.460 --> 00:56:34.830
Well, let me add one more clause here.

00:56:34.830 --> 00:56:36.750
Limit the answer to one.

00:56:36.750 --> 00:56:39.330
And no matter how many
rows should come back now,

00:56:39.330 --> 00:56:41.880
I just get the number one
language as of the data

00:56:41.880 --> 00:56:47.870
set we collected with 270 votes for it.

00:56:47.870 --> 00:56:52.060
Questions on this?

00:56:52.060 --> 00:56:55.260
Any questions here?

00:56:55.260 --> 00:56:55.800
No?

00:56:55.800 --> 00:56:58.620
Well, what if we're
starting to introduce SQL

00:56:58.620 --> 00:57:02.250
and it was kind of too late to
make it into the Google Form?

00:57:02.250 --> 00:57:04.260
So it turns out there's
syntax for this too.

00:57:04.260 --> 00:57:05.670
You can create data, of course.

00:57:05.670 --> 00:57:09.240
Not just the tables, but the data
therein and here's the typical syntax

00:57:09.240 --> 00:57:11.640
for inserting data into a SQL database.

00:57:11.640 --> 00:57:14.850
You literally say insert
into the name of the table.

00:57:14.850 --> 00:57:19.020
And then in parentheses, you specify
one or more columns for which you

00:57:19.020 --> 00:57:20.940
have values that you want to insert.

00:57:20.940 --> 00:57:22.950
This is to say you don't
have to give values

00:57:22.950 --> 00:57:25.650
for every column in the given row.

00:57:25.650 --> 00:57:27.990
If you only have answers
to some of those questions,

00:57:27.990 --> 00:57:29.940
you can enumerate them here like this.

00:57:29.940 --> 00:57:32.010
But the values you insert
are going to be these.

00:57:32.010 --> 00:57:35.500
So you literally say after
the close parenthesis values.

00:57:35.500 --> 00:57:38.430
And then in a second set of
parentheses with a same length

00:57:38.430 --> 00:57:42.370
comma separated list, you specify
what values do you want to insert.

00:57:42.370 --> 00:57:43.710
So it's a little verbose.

00:57:43.710 --> 00:57:47.520
And frankly longer term, you're going
to use Python code to automatically do

00:57:47.520 --> 00:57:50.010
these kinds of insertions, but
let's go ahead and try this.

00:57:50.010 --> 00:57:57.990
Right now if I do select distinct
language from favorites, again,

00:57:57.990 --> 00:57:58.833
we see this.

00:57:58.833 --> 00:58:00.000
Just these three candidates.

00:58:00.000 --> 00:58:01.870
But we've now taught you a bit of SQL.

00:58:01.870 --> 00:58:09.120
So let's do insert into favorites
the column called language.

00:58:09.120 --> 00:58:10.620
And you know what?

00:58:10.620 --> 00:58:12.150
I'm going to give a problem here.

00:58:12.150 --> 00:58:14.430
The values for which,
and let me zoom back out,

00:58:14.430 --> 00:58:17.910
are going to be quote unquote "SQL"
and quote, unquote "fiftyville."

00:58:17.910 --> 00:58:20.130
You'll soon see what that's all about.

00:58:20.130 --> 00:58:21.240
Semicolon.

00:58:21.240 --> 00:58:24.580
Nothing seems to happen, but
that's usually a good thing.

00:58:24.580 --> 00:58:26.580
And now if I scroll
back up in my queries,

00:58:26.580 --> 00:58:30.990
in SQLite3 you can scroll back and
forth in time to avoid retyping things,

00:58:30.990 --> 00:58:35.400
now I should see indeed four
candidate languages here.

00:58:35.400 --> 00:58:39.150
Now, suppose that you were never really
a fan of C and maybe you programmed

00:58:39.150 --> 00:58:42.990
a little bit in high school or in
the real world and you liked C++.

00:58:42.990 --> 00:58:50.010
Well, there's a whole lot of answers
for C. So select star from favorites

00:58:50.010 --> 00:58:53.430
where language equals quote unquote "C."

00:58:53.430 --> 00:58:58.020
So here's everyone who submitted the
answer for C. Let's presume that, no,

00:58:58.020 --> 00:58:59.850
they didn't really want
C, they wanted C++,

00:58:59.850 --> 00:59:01.767
which is not a language
we teach in the class.

00:59:01.767 --> 00:59:03.910
But I could also now do this.

00:59:03.910 --> 00:59:08.670
You can use the update command to set
a column or columns to different values

00:59:08.670 --> 00:59:10.210
where some condition is met.

00:59:10.210 --> 00:59:15.360
So if I do update table name set
column name equal to some value

00:59:15.360 --> 00:59:19.330
filtering it perhaps by
where some condition is true.

00:59:19.330 --> 00:59:22.890
So suppose I've changed
my mind, or you know what,

00:59:22.890 --> 00:59:31.320
let's go ahead and do update favorites
set language equal to maybe C++ where

00:59:31.320 --> 00:59:33.870
language equals C.

00:59:33.870 --> 00:59:35.910
Now, this is destructive,
so you generally

00:59:35.910 --> 00:59:38.368
don't want to do this unless
you have a backup of your data

00:59:38.368 --> 00:59:40.170
too, overriding what
people's answers are.

00:59:40.170 --> 00:59:42.753
This seems to have been successful,
because no error messages.

00:59:42.753 --> 00:59:45.270
And if I rerun the previous
select that gives me

00:59:45.270 --> 00:59:49.410
all of the favorites where language
equals C, now indeed I get none.

00:59:49.410 --> 00:59:53.127
But if I search for
C++, now I get a lot.

00:59:53.127 --> 00:59:55.710
And if I get rid of that where
clause altogether and just look

00:59:55.710 --> 01:00:00.330
at the contents of my database, now you
see that indeed C++ is comingled with

01:00:00.330 --> 01:00:01.140
all the other data.

01:00:01.140 --> 01:00:04.270
This is not what you all intended,
of course, so I can undo this.

01:00:04.270 --> 01:00:06.150
Let me go ahead and
undo what I just did.

01:00:06.150 --> 01:00:12.570
Let me set my favorite language
to C where language equals C++.

01:00:12.570 --> 01:00:14.460
But the predicate is important.

01:00:14.460 --> 01:00:16.230
This I'm not going to do.

01:00:16.230 --> 01:00:20.190
What if I accidentally omitted
this predicate, the where clause?

01:00:20.190 --> 01:00:22.545
How would that screw
things up might you think?

01:00:26.160 --> 01:00:26.970
Yeah, in the back.

01:00:26.970 --> 01:00:29.110
AUDIENCE: [INAUDIBLE]

01:00:29.110 --> 01:00:32.590
DAVID MALAN: It would set every
row's language to indeed C.

01:00:32.590 --> 01:00:34.060
And this is dangerous.

01:00:34.060 --> 01:00:37.720
And if you start googling around
for SQL mistakes or the like,

01:00:37.720 --> 01:00:41.590
people in the real world have
accidentally run commands like this.

01:00:41.590 --> 01:00:43.510
And without naming
names, a former member

01:00:43.510 --> 01:00:47.200
of our teaching staff at one point
accidentally ran a command like this

01:00:47.200 --> 01:00:51.610
and changed every student's name in
our database to Bobby I think it was.

01:00:51.610 --> 01:00:55.120
The same name for every row because
they simply forgot a predicate.

01:00:55.120 --> 01:00:58.240
So here too there's
dangers in code, and you

01:00:58.240 --> 01:01:00.730
should adopt the habit
quite quickly of always,

01:01:00.730 --> 01:01:03.730
one, backing up your data
like with CP, for instance,

01:01:03.730 --> 01:01:09.010
in Linux or any other technique or just
making sure before you hit Enter that,

01:01:09.010 --> 01:01:11.322
yes, this is indeed the
query I want to execute.

01:01:11.322 --> 01:01:13.030
And generally speaking
in the real world,

01:01:13.030 --> 01:01:14.890
there should be process
controls in place.

01:01:14.890 --> 01:01:17.770
Like the intern should not
have access to the production

01:01:17.770 --> 01:01:19.570
database, the live
database, and the like.

01:01:19.570 --> 01:01:21.860
But you have a lot of power
now with these queries.

01:01:21.860 --> 01:01:25.520
So just be all the more careful, because
very easily can you do bad things.

01:01:25.520 --> 01:01:26.660
So let me undo this.

01:01:26.660 --> 01:01:30.790
Where language equals
quote unquote "C++."

01:01:30.790 --> 01:01:31.840
And I'll zoom back out.

01:01:31.840 --> 01:01:32.470
Enter.

01:01:32.470 --> 01:01:34.480
And now I think we're back in business.

01:01:34.480 --> 01:01:36.100
C is among the answers.

01:01:36.100 --> 01:01:37.520
Yeah?

01:01:37.520 --> 01:01:40.340
AUDIENCE: [INAUDIBLE]

01:01:42.293 --> 01:01:44.460
DAVID MALAN: It's essentially
doing what at the end?

01:01:44.460 --> 01:01:45.270
AUDIENCE: [INAUDIBLE]

01:01:45.270 --> 01:01:46.680
DAVID MALAN: It's
essentially find and replace.

01:01:46.680 --> 01:01:47.180
Yes.

01:01:47.180 --> 01:01:51.060
In layperson's terms, this is find
and replace implemented with SQL.

01:01:51.060 --> 01:01:55.020
And in fact, the authors of
Microsoft Word or Google Docs

01:01:55.020 --> 01:01:58.290
might very well be using
language like this SQL

01:01:58.290 --> 01:02:01.710
when you go to the nice graphical
user friendly find and replace box.

01:02:01.710 --> 01:02:05.010
This may very well be what they're
doing underneath the hood or, of course,

01:02:05.010 --> 01:02:07.770
they could be using some
other language altogether.

01:02:07.770 --> 01:02:10.530
There's one last syntax
that's worth knowing,

01:02:10.530 --> 01:02:14.280
delete, which for better or for worse
is even more destructive whereby it

01:02:14.280 --> 01:02:16.590
allows you to delete rows from tables.

01:02:16.590 --> 01:02:20.130
It's distinct from drop, which
lets you delete tables themselves.

01:02:20.130 --> 01:02:21.630
This focuses on rows.

01:02:21.630 --> 01:02:26.250
So suppose that you
really, really didn't like,

01:02:26.250 --> 01:02:30.000
let's say, Tideman was a
little challenging if you

01:02:30.000 --> 01:02:31.840
tackled that more comfortable problem.

01:02:31.840 --> 01:02:34.800
So if you really don't want to
even think about Tideman anymore,

01:02:34.800 --> 01:02:40.080
so why don't we do delete from
favorites where problem equals, and I

01:02:40.080 --> 01:02:42.240
won't execute it for real, Tideman.

01:02:42.240 --> 01:02:45.360
This would have the effect of
deleting every row, including

01:02:45.360 --> 01:02:50.010
the language therein and the timestamp,
where the student answered Tideman.

01:02:50.010 --> 01:02:53.910
Worse than this would be this.

01:02:53.910 --> 01:02:55.125
Why might this be bad?

01:02:58.560 --> 01:03:00.930
OK, chuckling because
there's no predicate.

01:03:00.930 --> 01:03:04.960
There's no filter, which means literally
this would delete all of the data.

01:03:04.960 --> 01:03:07.740
So again, with great power here
comes great responsibility.

01:03:07.740 --> 01:03:12.390
Now, this has just been a data set of
430 rows by us dynamically created.

01:03:12.390 --> 01:03:15.240
There's, of course, some really
juicy data sets in the real world.

01:03:15.240 --> 01:03:17.990
And one website you might have
heard or an app you might have used

01:03:17.990 --> 01:03:20.460
is IMDb, the Internet Movie
Database, which wonderfully

01:03:20.460 --> 01:03:24.330
makes some of their data available
for download as CSV files

01:03:24.330 --> 01:03:27.420
or technically TSV files,
Tab Separated Values.

01:03:27.420 --> 01:03:31.830
But what we did in advance of class was
download some of that data for both TV

01:03:31.830 --> 01:03:34.170
shows in the real world and
movies in the real world.

01:03:34.170 --> 01:03:35.962
And what's wonderful
about this data set is

01:03:35.962 --> 01:03:39.630
it's not just dozens or hundreds
or even thousands of lines.

01:03:39.630 --> 01:03:44.190
There are millions of rows of
juicy data, TV shows and movies

01:03:44.190 --> 01:03:47.190
with which most folks are probably
familiar, at least with a subset.

01:03:47.190 --> 01:03:50.580
And we'll see in just a
little bit that this data

01:03:50.580 --> 01:03:54.600
comes in the form of now six
different tables that we've given you.

01:03:54.600 --> 01:03:56.610
And the tables in question
for today are going

01:03:56.610 --> 01:04:00.735
to be the people in the TV business,
the stars therein, the shows that people

01:04:00.735 --> 01:04:01.860
are producing and the like.

01:04:01.860 --> 01:04:04.860
This is a picture
we'll revisit to enable

01:04:04.860 --> 01:04:07.110
you to wrap your minds around
what the actual data is.

01:04:07.110 --> 01:04:09.948
This feels like a good
opportunity though for a snack.

01:04:09.948 --> 01:04:12.990
In fact, in just a moment, we have a
whole lot of Rice Krispie treats out

01:04:12.990 --> 01:04:13.690
in the lobby.

01:04:13.690 --> 01:04:18.630
But if folks could perhaps acknowledge
this mini wedding cake here.

01:04:18.630 --> 01:04:21.640
CS50 zone Carter Zenke is
getting married this week.

01:04:21.640 --> 01:04:28.880
So congratulations to Carter as well.

01:04:28.880 --> 01:04:30.550
Congrats.

01:04:30.550 --> 01:04:31.940
All right.

01:04:31.940 --> 01:04:34.440
[APPLAUSE]

01:04:38.115 --> 01:04:40.952
There's only one piece of cake in
that box but a lot of Rice Krispie

01:04:40.952 --> 01:04:41.910
treats in the transept.

01:04:41.910 --> 01:04:45.420
Let's take 10 minutes and we'll be back
with Internet Movie Database in 10.

01:04:45.420 --> 01:04:46.470
All right.

01:04:46.470 --> 01:04:47.320
We are back.

01:04:47.320 --> 01:04:51.000
So if you've never been, you can
actually go to imdb.com right now

01:04:51.000 --> 01:04:53.020
and play around or
download the mobile app.

01:04:53.020 --> 01:04:56.580
And it's just big database of
a lot of TV shows and movies

01:04:56.580 --> 01:04:58.090
and actors and the like.

01:04:58.090 --> 01:05:00.828
But what indeed is nice is you
can download some of that data.

01:05:00.828 --> 01:05:02.370
And that's what I've done in advance.

01:05:02.370 --> 01:05:05.160
And what we've done is
we wrote some Python code

01:05:05.160 --> 01:05:09.540
to convert some of the flat file
databases that they let you download

01:05:09.540 --> 01:05:13.890
and we converted it into a
SQL database with six tables.

01:05:13.890 --> 01:05:17.940
So not just one but six that
ultimately are these here.

01:05:17.940 --> 01:05:21.090
And let me just help you wrap your
minds around what this picture is, which

01:05:21.090 --> 01:05:25.410
is an entity relationship diagram,
which is just to say each of these boxes

01:05:25.410 --> 01:05:27.090
on the screen represents a table.

01:05:27.090 --> 01:05:31.230
And each of the arrows or edges
represents some kind of relationship

01:05:31.230 --> 01:05:32.100
across the tables.

01:05:32.100 --> 01:05:34.200
Because up until now,
the only data we had

01:05:34.200 --> 01:05:36.270
were those three columns
in the favorites table.

01:05:36.270 --> 01:05:39.130
But what gets really
useful about SQL databases,

01:05:39.130 --> 01:05:41.700
just like a Google
Spreadsheet or an Excel file,

01:05:41.700 --> 01:05:45.300
is you can have multiple sheets
or in a database multiple tables.

01:05:45.300 --> 01:05:50.010
And so what we're about to see is that
in the IMDb database for TV shows,

01:05:50.010 --> 01:05:53.820
there's going to be a dedicated table
for all the people in the TV business.

01:05:53.820 --> 01:05:57.210
There's going to be a dedicated
table for all of the TV shows

01:05:57.210 --> 01:05:59.130
that are in their
database as of right now.

01:05:59.130 --> 01:06:03.180
There's going to be a dedicated
table for writers in that industry,

01:06:03.180 --> 01:06:08.580
for the ratings of shows, for
the genres to which shows belong,

01:06:08.580 --> 01:06:09.700
comedy and the like.

01:06:09.700 --> 01:06:12.310
And then lastly, there's
going to be this table,

01:06:12.310 --> 01:06:17.700
which somehow associates people
with the TV shows that they star in

01:06:17.700 --> 01:06:19.030
and vice versa.

01:06:19.030 --> 01:06:22.290
And so let's consider first
what this looks like in code.

01:06:22.290 --> 01:06:25.110
And we'll see that it's going to
overwhelm intentionally at first,

01:06:25.110 --> 01:06:26.200
but I'm going to do this.

01:06:26.200 --> 01:06:27.690
I'm going to go back
to my terminal window.

01:06:27.690 --> 01:06:30.240
And during the break, I downloaded
from the course's website

01:06:30.240 --> 01:06:33.960
a file called shows.db, which
we made in advance for you.

01:06:33.960 --> 01:06:37.440
And if I type ls, I'll see all of
my favorites files from before.

01:06:37.440 --> 01:06:40.020
The CSV, the DB, and the Python file.

01:06:40.020 --> 01:06:42.390
But now they're shows.db.

01:06:42.390 --> 01:06:45.540
So I'm going to go ahead in my
full screen terminal window here.

01:06:45.540 --> 01:06:47.850
I'm not using actual tabs or code files.

01:06:47.850 --> 01:06:52.380
Now I'm going to run sqlite3
on the file called shows.db.

01:06:52.380 --> 01:06:54.900
And I'm just going to see
this version information here.

01:06:54.900 --> 01:06:57.120
Let me clear my screen
and run the one command

01:06:57.120 --> 01:07:01.170
I ran earlier to show us the
schema of the favorites database.

01:07:01.170 --> 01:07:03.947
Now we'll see the schema
for the shows database.

01:07:03.947 --> 01:07:06.030
And there's a lot going
on here, but let me scroll

01:07:06.030 --> 01:07:08.070
back up to the very top, the beginning.

01:07:08.070 --> 01:07:10.570
And we see this here.

01:07:10.570 --> 01:07:15.690
So when I run .schema,
we see a dump, really,

01:07:15.690 --> 01:07:20.310
of all of the SQL create table commands
that were run in order to create this

01:07:20.310 --> 01:07:21.330
database for you.

01:07:21.330 --> 01:07:24.420
And one of those tables is
called genres and another people,

01:07:24.420 --> 01:07:27.370
ratings, shows, stars, and so forth.

01:07:27.370 --> 01:07:29.340
And the columns therein,
even though it's

01:07:29.340 --> 01:07:32.670
formatted a little more prettily than
the automatically generated create

01:07:32.670 --> 01:07:37.680
table statement for favorites whereby
we have one column per line of output

01:07:37.680 --> 01:07:41.520
here in the, for instance,
people table, there's

01:07:41.520 --> 01:07:45.900
going to be an ID column, like a unique
identifier like a Harvard ID, a Yale

01:07:45.900 --> 01:07:50.730
ID or the like, a name column, a
birth year, and then some other stuff.

01:07:50.730 --> 01:07:53.760
If I scroll down to shows,
every show in the world

01:07:53.760 --> 01:07:58.410
is going to have a unique ID as well,
a title of course, the year in which it

01:07:58.410 --> 01:08:00.390
debuted, and the total
number of episodes

01:08:00.390 --> 01:08:02.370
as of the time we downloaded the data.

01:08:02.370 --> 01:08:04.080
And then what else is there?

01:08:04.080 --> 01:08:08.170
Some of these are a little
less obvious like ratings here.

01:08:08.170 --> 01:08:10.830
So ratings don't have
an ID column, but they

01:08:10.830 --> 01:08:15.038
have a show ID column and a rating like
on a five point scale or a 10 point

01:08:15.038 --> 01:08:16.830
scale or the like and
then the total number

01:08:16.830 --> 01:08:19.319
of votes that were collected
to contribute to that rating.

01:08:19.319 --> 01:08:23.680
IMDb allows people to upvote and
downvote shows and movies and the like.

01:08:23.680 --> 01:08:26.380
And then similarly is genre structured.

01:08:26.380 --> 01:08:29.290
There's a show ID and
then there's a genre,

01:08:29.290 --> 01:08:33.370
which is going to be an English word
like comedy or drama or something else.

01:08:33.370 --> 01:08:34.200
And then what else?

01:08:34.200 --> 01:08:38.609
Let's go a little further at the
bottom here for stars and writers.

01:08:38.609 --> 01:08:41.729
If we go to the very bottom
here, stars and writers

01:08:41.729 --> 01:08:43.470
are similarly structured too.

01:08:43.470 --> 01:08:46.890
They have a show ID and a person ID.

01:08:46.890 --> 01:08:48.540
So show and person.

01:08:48.540 --> 01:08:52.470
And then this writers table
has a show ID and a person ID.

01:08:52.470 --> 01:08:56.340
And there's a whole lot of other words
that we'll come to in just a moment.

01:08:56.340 --> 01:08:58.330
But what is this code hinting at?

01:08:58.330 --> 01:09:01.000
Well, if I go back to the
picture from earlier here,

01:09:01.000 --> 01:09:05.160
you'll see that this picture
captures the relationships

01:09:05.160 --> 01:09:07.020
among these various tables.

01:09:07.020 --> 01:09:11.670
So for instance, if we focus on
shows for just a moment, a show,

01:09:11.670 --> 01:09:15.359
again, has a unique ID, a title,
a year in which it debuted,

01:09:15.359 --> 01:09:16.859
and a total number of episodes.

01:09:16.859 --> 01:09:20.550
If you want to figure
out what genre or genres

01:09:20.550 --> 01:09:24.300
a show belongs to, because
some shows are just comedies,

01:09:24.300 --> 01:09:29.010
some shows are just dramas, but some
shows are arguably comedies and dramas

01:09:29.010 --> 01:09:30.700
depending on the episode or the like.

01:09:30.700 --> 01:09:34.740
So you can imagine wanting to associate
two or three or even more genres

01:09:34.740 --> 01:09:35.640
with a show.

01:09:35.640 --> 01:09:39.830
This line here in this second
table allows us to do that.

01:09:39.830 --> 01:09:46.310
Every row in the genres table
we'll see has two items, a show ID

01:09:46.310 --> 01:09:49.250
which relates to the ID of a show.

01:09:49.250 --> 01:09:54.350
And that's why these lines literally
line up with that specific column name.

01:09:54.350 --> 01:09:57.050
And a genre, which is going to
be like, quote unquote, "comedy,"

01:09:57.050 --> 01:09:59.250
quote unquote "drama,"
or something else.

01:09:59.250 --> 01:10:02.180
Now with that said, design question.

01:10:02.180 --> 01:10:07.250
Why have we deliberately not just
gotten rid of this genres table

01:10:07.250 --> 01:10:12.950
and made our lives simpler by just
adding a genre column to this show's

01:10:12.950 --> 01:10:13.640
table?

01:10:13.640 --> 01:10:17.150
And again, a table is just like
a sheet with rows and columns.

01:10:17.150 --> 01:10:21.140
At the moment, shows only have four
columns, ID, title, year, episodes.

01:10:21.140 --> 01:10:28.980
Why not just add a fifth column called
genre and put the show's genre there?

01:10:28.980 --> 01:10:32.200
Any intuition here?

01:10:32.200 --> 01:10:34.960
Why not just keep things simple?

01:10:34.960 --> 01:10:36.286
Yeah, in back.

01:10:36.286 --> 01:10:42.035
AUDIENCE: [INAUDIBLE]

01:10:42.035 --> 01:10:42.910
DAVID MALAN: Exactly.

01:10:42.910 --> 01:10:46.670
If you add a fifth column
here and call it genre,

01:10:46.670 --> 01:10:49.900
then you have to pick
a genre specifically.

01:10:49.900 --> 01:10:55.840
You have to put in that cell presumably
comedy or drama or musical or something

01:10:55.840 --> 01:10:56.480
else.

01:10:56.480 --> 01:10:58.270
Now, you could write
multiple words in the cell,

01:10:58.270 --> 01:11:00.645
but generally speaking, that
would be sloppy, bad design.

01:11:00.645 --> 01:11:02.492
Like every cell just
like in a spreadsheet

01:11:02.492 --> 01:11:03.700
should really have one value.

01:11:03.700 --> 01:11:07.270
It might have multiple words, but it
shouldn't be a weirdly comma separated

01:11:07.270 --> 01:11:08.530
list of multiple things.

01:11:08.530 --> 01:11:10.655
It should just be in a
different cell in that case.

01:11:10.655 --> 01:11:14.320
So if you instead were to design this
with just a single column called genre,

01:11:14.320 --> 01:11:16.510
you're imposing what
a computer scientist

01:11:16.510 --> 01:11:19.480
would call a one to one relationship.

01:11:19.480 --> 01:11:21.580
Every show has one genre.

01:11:21.580 --> 01:11:23.800
And that's not necessarily a good thing.

01:11:23.800 --> 01:11:26.170
Or strictly speaking, it
would be a many to one,

01:11:26.170 --> 01:11:28.510
because the same genre could
belong to multiple shows,

01:11:28.510 --> 01:11:31.760
but each show could only
have one genre in that case.

01:11:31.760 --> 01:11:33.760
What a relational
database allows you to do,

01:11:33.760 --> 01:11:35.890
and relational is indeed
the operative word,

01:11:35.890 --> 01:11:39.100
it allows you to factor out
some of your information

01:11:39.100 --> 01:11:44.590
and then have maybe one show here in
one row but then in this genres table,

01:11:44.590 --> 01:11:47.320
you could have one row
for that one show genre,

01:11:47.320 --> 01:11:52.450
or you could have two rows in the
genres table for comedy and for drama.

01:11:52.450 --> 01:11:55.490
Or if it has a third genre, you
could just add another row here.

01:11:55.490 --> 01:11:59.380
So you still have one
row for the show itself

01:11:59.380 --> 01:12:02.890
with all the juiciest details
but a variable number of rows

01:12:02.890 --> 01:12:05.950
by having this relationship
with another table.

01:12:05.950 --> 01:12:10.000
Meanwhile, ratings work the
same way, at least in this case.

01:12:10.000 --> 01:12:12.882
A show has ID, title,
year, and episodes.

01:12:12.882 --> 01:12:14.590
But if you want to
figure out its rating,

01:12:14.590 --> 01:12:17.710
you have to follow the arrow
here, so to speak, and look up

01:12:17.710 --> 01:12:20.020
the corresponding show ID in this table.

01:12:20.020 --> 01:12:23.750
Find the rating of that show
and the total number of ratings.

01:12:23.750 --> 01:12:26.470
So that's been factored out
too, for better or for worse.

01:12:26.470 --> 01:12:28.600
Now let's consider people.

01:12:28.600 --> 01:12:32.050
People have just three
columns, ID, name, and birth.

01:12:32.050 --> 01:12:35.350
But there's no mention of the TV
show in which people have starred

01:12:35.350 --> 01:12:38.140
or the TV shows that
a person has written.

01:12:38.140 --> 01:12:39.200
Well, why is that?

01:12:39.200 --> 01:12:42.730
Well, if you just had a fourth
column here called show,

01:12:42.730 --> 01:12:45.560
well, you would have to decide
what show is that person in.

01:12:45.560 --> 01:12:47.560
And no one could ever act
again in another show,

01:12:47.560 --> 01:12:49.310
because there's no
room to store the data.

01:12:49.310 --> 01:12:53.560
But if someone, of course, a popular
actor can star in multiple shows,

01:12:53.560 --> 01:12:57.790
well, we could have one ID for that
person, one name, one birth year,

01:12:57.790 --> 01:12:58.360
obviously.

01:12:58.360 --> 01:13:02.230
Like there's only one Steve Carell
as an actor in the world of people.

01:13:02.230 --> 01:13:06.610
But Steve Carell in this example
could have his person ID,

01:13:06.610 --> 01:13:09.220
whatever his Harvard ID
equivalent, Yale ID equivalent is,

01:13:09.220 --> 01:13:12.130
appear in multiple rows
in this table so that it

01:13:12.130 --> 01:13:14.470
can be associated with multiple shows.

01:13:14.470 --> 01:13:18.830
And this allows you to create what's
called a one to many relationship,

01:13:18.830 --> 01:13:20.500
or technically it's bidirectional.

01:13:20.500 --> 01:13:22.450
It's a many to many relationship.

01:13:22.450 --> 01:13:23.120
Why?

01:13:23.120 --> 01:13:26.470
Well, one show can certainly
have multiple people in it

01:13:26.470 --> 01:13:29.090
and multiple people writing
for it, just in the real world.

01:13:29.090 --> 01:13:33.910
But conversely, one person could
certainly act in multiple shows

01:13:33.910 --> 01:13:35.330
or write multiple shows.

01:13:35.330 --> 01:13:38.380
So this is what you get
with relational databases.

01:13:38.380 --> 01:13:41.440
You put your sort of canonical
data for people in one place,

01:13:41.440 --> 01:13:46.150
for shows in another place, and
then you use these additional tables

01:13:46.150 --> 01:13:48.993
to relate one thing to another.

01:13:48.993 --> 01:13:50.410
So we won't dwell on the pictures.

01:13:50.410 --> 01:13:53.350
That's just if you sort of can
wrap your mind around the data

01:13:53.350 --> 01:13:55.870
set better that way, that's
one way of thinking about it.

01:13:55.870 --> 01:14:00.177
But recall that the code we just saw for
the schema, again, escalated quickly.

01:14:00.177 --> 01:14:02.260
There's a lot of keywords
I haven't mentioned yet.

01:14:02.260 --> 01:14:04.360
But some of these are perhaps familiar.

01:14:04.360 --> 01:14:06.130
They're capitalized differently here.

01:14:06.130 --> 01:14:08.140
But integer is on the list here.

01:14:08.140 --> 01:14:10.790
Null is on the list, albeit
technically not null.

01:14:10.790 --> 01:14:12.760
So let's tease apart
some of these keywords

01:14:12.760 --> 01:14:15.218
and consider what they're
actually doing for your database,

01:14:15.218 --> 01:14:18.640
because now we're exploring
features that do not exist

01:14:18.640 --> 01:14:20.980
in the world of spreadsheets alone.

01:14:20.980 --> 01:14:24.190
So it turns out in a SQL
database, specifically

01:14:24.190 --> 01:14:27.265
SQLite which is the version
of SQL we use in CS50

01:14:27.265 --> 01:14:30.910
and which is commonly used for things
like mobile applications nowadays.

01:14:30.910 --> 01:14:32.560
It's like a lightweight version of SQL.

01:14:32.560 --> 01:14:35.590
It's when you aren't trying to
run Twitter and have billions

01:14:35.590 --> 01:14:37.150
and billions of rows necessarily.

01:14:37.150 --> 01:14:40.730
You've got hundreds, thousands, tens
of thousands, maybe even a few million,

01:14:40.730 --> 01:14:44.590
but not crazy numbers,
crazy amounts of data.

01:14:44.590 --> 01:14:49.250
In the world of SQLite specifically,
there's these five data types.

01:14:49.250 --> 01:14:52.390
So just like in C, we had
int and char and the like.

01:14:52.390 --> 01:14:54.190
In SQL, we have these.

01:14:54.190 --> 01:14:57.447
Blob, which is kind of funny, but
it just means binary large objects.

01:14:57.447 --> 01:14:58.780
So it's like a binary data type.

01:14:58.780 --> 01:15:02.830
Zeros and ones that aren't necessarily
fitting into the other categories.

01:15:02.830 --> 01:15:05.290
Integer, which of course,
is an integer as we know it.

01:15:05.290 --> 01:15:09.940
Numeric, which is kind of a catchall for
numbers that are formatted specially.

01:15:09.940 --> 01:15:14.710
So like a date would be like year,
year, year, year, dash month, month,

01:15:14.710 --> 01:15:16.390
dash day, day.

01:15:16.390 --> 01:15:18.015
And this is actually a wonderful thing.

01:15:18.015 --> 01:15:19.723
Depending on the
country you're from, you

01:15:19.723 --> 01:15:22.720
might think your date system in your
country is great or it's horrible.

01:15:22.720 --> 01:15:27.040
The US system is horrible because
we have month, day, and then

01:15:27.040 --> 01:15:29.350
year, which is impossible to sort.

01:15:29.350 --> 01:15:31.930
It is the wrong way
objectively to store data.

01:15:31.930 --> 01:15:33.803
And yet here we are using this at scale.

01:15:33.803 --> 01:15:35.470
Other countries have gotten this better.

01:15:35.470 --> 01:15:38.383
Numeric in SQL itself
standardizes that stuff.

01:15:38.383 --> 01:15:40.300
So it doesn't matter
what country you're from.

01:15:40.300 --> 01:15:43.200
You're storing your data in this
particular way for instance.

01:15:43.200 --> 01:15:46.560
Times are standardized and other
types of numeric data as well.

01:15:46.560 --> 01:15:48.630
Real is synonymous with flow.

01:15:48.630 --> 01:15:52.410
So something with a decimal point
and some number of digits thereafter.

01:15:52.410 --> 01:15:55.530
And then text is just
for strings and the like.

01:15:55.530 --> 01:16:00.240
With other even fancier databases
like MySQL, PostgreSQL, Oracle,

01:16:00.240 --> 01:16:02.370
and other products you
might have heard of,

01:16:02.370 --> 01:16:04.170
there's even more data
types where you have

01:16:04.170 --> 01:16:05.712
to make even finer grained decisions.

01:16:05.712 --> 01:16:07.950
But for SQLite, it's
indeed pretty lightweight

01:16:07.950 --> 01:16:12.120
and you or we just have to decide the
data types for each column in a table.

01:16:12.120 --> 01:16:15.000
But there's these additional
constraints in the world of SQL.

01:16:15.000 --> 01:16:21.400
You can additionally say that cells
in this column may or may not be null.

01:16:21.400 --> 01:16:24.660
So if you want to protect yourself
from yourself so you don't screw up

01:16:24.660 --> 01:16:26.880
and insert a null,
that is a blank value,

01:16:26.880 --> 01:16:31.090
you can explicitly design a table to
have a column that cannot be null.

01:16:31.090 --> 01:16:33.090
And so in fact, someone
came up during the break

01:16:33.090 --> 01:16:36.450
to ask me about my having
manually inserted SQL,

01:16:36.450 --> 01:16:38.923
quote unquote "SQL," into
our favorites database.

01:16:38.923 --> 01:16:40.590
You might recall that I kind of cheated.

01:16:40.590 --> 01:16:44.340
I just inserted "SQL" quote
unquote and "fiftyville,"

01:16:44.340 --> 01:16:46.050
the name of a new
problem, quote unquote.

01:16:46.050 --> 01:16:49.170
But what did I not
insert into the database?

01:16:49.170 --> 01:16:50.100
A timestamp.

01:16:50.100 --> 01:16:50.880
And I could have.

01:16:50.880 --> 01:16:54.240
I could have put the current day and
time a few minutes ago, but I didn't.

01:16:54.240 --> 01:16:58.590
And that's fine if it's acceptable to
you and the product you're building.

01:16:58.590 --> 01:17:00.300
But I could have prevented that.

01:17:00.300 --> 01:17:05.100
If we had defined the table to have a
timestamp column that isn't just text

01:17:05.100 --> 01:17:08.490
but it's text that's not null,
SQL would have complained

01:17:08.490 --> 01:17:10.650
and would not have let me
complete that insertion.

01:17:10.650 --> 01:17:13.567
So there's these kinds of built in
defenses that you don't necessarily

01:17:13.567 --> 01:17:15.180
get with a spreadsheet alone.

01:17:15.180 --> 01:17:17.170
And unique means exactly that.

01:17:17.170 --> 01:17:20.190
If you want to make sure
that every row in that column

01:17:20.190 --> 01:17:24.480
is unique, maybe for email addresses
or in the US Social Security

01:17:24.480 --> 01:17:26.190
numbers or anything
that you want to make

01:17:26.190 --> 01:17:30.330
sure you don't have two versions of, you
can specify that the column is unique.

01:17:30.330 --> 01:17:32.170
And there's other such
constraints as well.

01:17:32.170 --> 01:17:34.140
But again, this is
just a list of features

01:17:34.140 --> 01:17:37.170
that you get from a proper
relational database.

01:17:37.170 --> 01:17:39.990
But perhaps the most
intellectually interesting one

01:17:39.990 --> 01:17:42.420
and the most powerful
one is what's called

01:17:42.420 --> 01:17:45.000
here a primary key and a foreign key.

01:17:45.000 --> 01:17:47.640
And let me go back now to this output.

01:17:47.640 --> 01:17:51.970
If we look at shows, you'll
see that a show, again,

01:17:51.970 --> 01:17:55.320
has an ID, a title, a year,
and a number of episodes.

01:17:55.320 --> 01:17:57.120
And now the data types might make sense.

01:17:57.120 --> 01:17:59.820
The ID it turns out, just
like a Harvard ID, a Yale ID,

01:17:59.820 --> 01:18:01.150
is going to be an integer.

01:18:01.150 --> 01:18:02.220
So a simple number.

01:18:02.220 --> 01:18:04.680
The title, of course, is
going to be text but not null.

01:18:04.680 --> 01:18:07.470
It would be weird if
a TV show had no name.

01:18:07.470 --> 01:18:08.615
That can't be.

01:18:08.615 --> 01:18:10.990
The whole world would break
or your TV Guide and whatnot.

01:18:10.990 --> 01:18:13.590
So that makes sense
there to say not null.

01:18:13.590 --> 01:18:14.670
Year is numeric.

01:18:14.670 --> 01:18:15.990
So it's a standardized form.

01:18:15.990 --> 01:18:19.140
Episodes is an integer, like how
many episodes have been produced.

01:18:19.140 --> 01:18:20.790
And then lastly, notice this.

01:18:20.790 --> 01:18:24.090
The primary key of the
show's table is apparently

01:18:24.090 --> 01:18:27.880
the column called ID
mentioned a few lines earlier.

01:18:27.880 --> 01:18:32.400
This just means that the
database will use the ID

01:18:32.400 --> 01:18:34.740
column as the unique identifier.

01:18:34.740 --> 01:18:37.770
So it's similar to the unique
keyword, but primary key just

01:18:37.770 --> 01:18:40.260
means the database is going
to treat it as special too

01:18:40.260 --> 01:18:43.380
and make sure that it is
uniquely identifying your data.

01:18:43.380 --> 01:18:45.300
But what's interesting is this.

01:18:45.300 --> 01:18:48.480
Notice if I scroll back
up to people, people

01:18:48.480 --> 01:18:51.480
were sort of similarly structured
but with different attributes.

01:18:51.480 --> 01:18:56.310
Like up here we had a person
has an ID, a name, a birth

01:18:56.310 --> 01:18:58.800
year, and a primary key of ID.

01:18:58.800 --> 01:19:01.080
So a ID is, again, integer.

01:19:01.080 --> 01:19:03.600
Name is text but not
null, because it'd be

01:19:03.600 --> 01:19:06.420
weird to have a human with
absolutely no name textually.

01:19:06.420 --> 01:19:07.860
Birth is going to be numeric.

01:19:07.860 --> 01:19:11.590
But the primary key of
people is ID as well.

01:19:11.590 --> 01:19:15.150
So those are the unique columns that
the database will just treat special.

01:19:15.150 --> 01:19:15.810
Why?

01:19:15.810 --> 01:19:17.310
Well, we just looked at shows.

01:19:17.310 --> 01:19:18.900
We just looked at people.

01:19:18.900 --> 01:19:22.380
Let's focus now on this
one down here, stars.

01:19:22.380 --> 01:19:25.260
How do you determine
who stars in a TV show?

01:19:25.260 --> 01:19:26.370
Well, we had two columns.

01:19:26.370 --> 01:19:28.320
The show ID and the person ID.

01:19:28.320 --> 01:19:31.080
This is the incarnation of
a many to many relationship.

01:19:31.080 --> 01:19:32.550
One person could be in many shows.

01:19:32.550 --> 01:19:35.710
One show could certainly have many
people in it or writing for it.

01:19:35.710 --> 01:19:37.250
But notice this.

01:19:37.250 --> 01:19:41.220
Within this table of two
columns, show ID and person ID,

01:19:41.220 --> 01:19:44.070
there's what's going to be
called a foreign key called

01:19:44.070 --> 01:19:49.350
show ID that references the
show's table's ID column

01:19:49.350 --> 01:19:51.367
and then another foreign
key called person ID,

01:19:51.367 --> 01:19:53.700
though I could call these
things in parentheses anything

01:19:53.700 --> 01:19:57.495
I want, that references the
people table's ID column.

01:19:57.495 --> 01:20:00.120
Now, you're not going to often
have to type commands like this.

01:20:00.120 --> 01:20:02.700
Again, you set the database
up once in the beginning

01:20:02.700 --> 01:20:04.590
typically, maybe with
some help from a TF,

01:20:04.590 --> 01:20:06.210
maybe with help of Google or the like.

01:20:06.210 --> 01:20:09.203
But once your database is
designed, it's back to the CRUD.

01:20:09.203 --> 01:20:12.120
Create, read, update, delete, the
selects, the inserts, the deletions,

01:20:12.120 --> 01:20:12.690
and the like.

01:20:12.690 --> 01:20:14.310
But what's this implying?

01:20:14.310 --> 01:20:17.520
These keywords like
primary key and foreign key

01:20:17.520 --> 01:20:23.100
are what are doing in code what this
picture was painting a moment ago.

01:20:23.100 --> 01:20:27.570
These lines here are drawn literally to
line up with the corresponding things.

01:20:27.570 --> 01:20:30.660
People's ID lines up with person ID.

01:20:30.660 --> 01:20:34.410
Show's ID lines up with show ID.

01:20:34.410 --> 01:20:37.950
And so you're just seeing graphical
version, code version, graphical,

01:20:37.950 --> 01:20:40.860
code that creates these relationships.

01:20:40.860 --> 01:20:45.190
Now, given that, let's actually
see what these things look like.

01:20:45.190 --> 01:20:47.160
So let me go back to VS Code here.

01:20:47.160 --> 01:20:48.400
Let me clear my screen.

01:20:48.400 --> 01:20:51.000
I'm still within SQLite with shows.db.

01:20:51.000 --> 01:20:53.480
Let me go ahead and do what
I do with any new database.

01:20:53.480 --> 01:20:55.230
If I ever download
something or I'm trying

01:20:55.230 --> 01:20:57.600
to wrap my mind around
a problem, usually it

01:20:57.600 --> 01:21:01.230
doesn't come with a pretty
picture or a three hour lecture

01:21:01.230 --> 01:21:02.670
to explain what the data set is.

01:21:02.670 --> 01:21:06.190
Rather you just have the data
set in your own knowledge of SQL.

01:21:06.190 --> 01:21:07.440
So let me play around.

01:21:07.440 --> 01:21:09.608
So .schema shows me all of the tables.

01:21:09.608 --> 01:21:10.900
That might be a starting point.

01:21:10.900 --> 01:21:11.900
OK, this is interesting.

01:21:11.900 --> 01:21:13.560
I know what people are.

01:21:13.560 --> 01:21:16.530
Let's go ahead and
show me all the people.

01:21:16.530 --> 01:21:18.595
So select star from people.

01:21:18.595 --> 01:21:20.970
I'm just trying to wrap my
mind around what this data set

01:21:20.970 --> 01:21:24.330
looks like in a more user friendly way.

01:21:24.330 --> 01:21:26.607
That's already a lot of people.

01:21:26.607 --> 01:21:28.440
As you see the years
flying by, there's been

01:21:28.440 --> 01:21:31.170
a lot of people in the TV business.

01:21:31.170 --> 01:21:33.540
So this was maybe not
the best query to run.

01:21:33.540 --> 01:21:39.300
But this is indicative of just how
large this data set is from IMDb.

01:21:39.300 --> 01:21:42.960
When in doubt and whenever you
lose control over your computer,

01:21:42.960 --> 01:21:44.940
Control C is your friend to interrupt.

01:21:44.940 --> 01:21:46.860
What would have been
better, because I don't

01:21:46.860 --> 01:21:49.980
think I need to know all of the
million people in the world,

01:21:49.980 --> 01:21:52.470
I could do limit me to 10 people.

01:21:52.470 --> 01:21:55.860
And that's enough now to
get a sense of Fred Astaire

01:21:55.860 --> 01:21:58.920
has an ID of one, the first person ever.

01:21:58.920 --> 01:22:00.660
Birth year of 1899.

01:22:00.660 --> 01:22:04.740
Lauren Bacall and all of these
other people from yesteryear.

01:22:04.740 --> 01:22:07.410
You see that they are the first
10 people in the database.

01:22:07.410 --> 01:22:09.160
So there's an example
of some of the data.

01:22:09.160 --> 01:22:12.540
Now if I want to wrap my mind around
what a show is, I know it technically.

01:22:12.540 --> 01:22:13.800
I know it from the picture.

01:22:13.800 --> 01:22:15.400
But let's just look at some raw data.

01:22:15.400 --> 01:22:18.210
So instead of saying select star
from people, let me go ahead

01:22:18.210 --> 01:22:21.480
and select star from shows limit 10.

01:22:21.480 --> 01:22:26.250
And OK, I've only heard of
or seen a couple of these,

01:22:26.250 --> 01:22:27.970
but these are older shows at that.

01:22:27.970 --> 01:22:31.530
But I see that every show has an ID,
a title, a year in which it debuted,

01:22:31.530 --> 01:22:32.910
and a number of episodes.

01:22:32.910 --> 01:22:35.340
But perhaps most opaque
is going to be this.

01:22:35.340 --> 01:22:41.520
Select star from stars where this
is the table that associates people

01:22:41.520 --> 01:22:42.330
with shows.

01:22:42.330 --> 01:22:46.950
Am I going to see any
names or show titles here?

01:22:46.950 --> 01:22:49.127
Not according to the
definition we saw earlier.

01:22:49.127 --> 01:22:50.460
Oh, I should have done my limit.

01:22:50.460 --> 01:22:51.660
Let me interrupt that.

01:22:51.660 --> 01:22:52.620
Let me do that again.

01:22:52.620 --> 01:22:54.180
Limit 10.

01:22:54.180 --> 01:22:55.050
No.

01:22:55.050 --> 01:22:58.380
And this is where now you're
definitely in the programmer world,

01:22:58.380 --> 01:23:01.780
because this would be the most annoying
spreadsheet to use on your Mac or PC

01:23:01.780 --> 01:23:02.280
ever.

01:23:02.280 --> 01:23:04.528
If you just had a sheet
with all of these numbers

01:23:04.528 --> 01:23:06.570
that associates one thing
with the other, my God,

01:23:06.570 --> 01:23:08.910
how do you figure out who
this is or what this is?

01:23:08.910 --> 01:23:12.780
You have to manually Control F or
Command F looking for the data.

01:23:12.780 --> 01:23:14.310
But a database doesn't care.

01:23:14.310 --> 01:23:18.160
Once you know SQL, you can stitch
these things back together.

01:23:18.160 --> 01:23:21.360
So what you're seeing
here are foreign keys.

01:23:21.360 --> 01:23:22.150
Foreign keys.

01:23:22.150 --> 01:23:22.650
Why?

01:23:22.650 --> 01:23:27.600
Because show ID corresponds to the
same numbers from that other table

01:23:27.600 --> 01:23:30.990
called shows that has a
proper primary key called ID.

01:23:30.990 --> 01:23:33.360
Person ID is a foreign
key in this context,

01:23:33.360 --> 01:23:38.620
because it refers to numbers that belong
to really the people table and its ID

01:23:38.620 --> 01:23:39.120
column.

01:23:39.120 --> 01:23:41.037
So this is just a way
of somehow linking them.

01:23:41.037 --> 01:23:44.100
And so if you think of I always think
of this in my mind's eye as this.

01:23:44.100 --> 01:23:47.760
If this is the people table,
this is the shows table,

01:23:47.760 --> 01:23:51.690
and there's this middle table
in between, the stars table.

01:23:51.690 --> 01:23:56.490
There's some way of stitching those two
together by lining up the IDs of one

01:23:56.490 --> 01:23:58.870
with the other and getting
back some more data.

01:23:58.870 --> 01:24:01.690
So let's actually play
with some of this data.

01:24:01.690 --> 01:24:05.148
How about we start where we
emphasized earlier, genres.

01:24:05.148 --> 01:24:08.440
So let me go ahead and take a quick look
at all of the genres in this database.

01:24:08.440 --> 01:24:10.500
So select star from genres.

01:24:10.500 --> 01:24:13.030
Star is usually going to
be a little overwhelming,

01:24:13.030 --> 01:24:15.240
but it just gives me a
sense of what the data is.

01:24:15.240 --> 01:24:17.850
But let's actually look at--

01:24:17.850 --> 01:24:19.440
let's go look at all of them there.

01:24:19.440 --> 01:24:20.850
OK, that's a lot.

01:24:20.850 --> 01:24:24.210
These are all official genres from IMDb.

01:24:24.210 --> 01:24:27.030
OK, it wasn't terribly long.

01:24:27.030 --> 01:24:28.350
Let me filter that down.

01:24:28.350 --> 01:24:34.800
So from genres where genre equals
Comedy, capital C just based

01:24:34.800 --> 01:24:36.000
on the data I'm seeing.

01:24:36.000 --> 01:24:38.220
OK, so what am I seeing now?

01:24:38.220 --> 01:24:40.838
And in fact, let me limit
this arbitrarily to 10,

01:24:40.838 --> 01:24:42.630
though I could limit
it to anything I want.

01:24:42.630 --> 01:24:45.060
Here are 10 comedies.

01:24:45.060 --> 01:24:46.080
What are they?

01:24:46.080 --> 01:24:47.490
Well, who the heck knows?

01:24:47.490 --> 01:24:49.590
All I know are the 10 show IDs.

01:24:49.590 --> 01:24:51.180
Now, I could do something like this.

01:24:51.180 --> 01:24:54.037
As we've seen before with SQL,
I could do, all right, well

01:24:54.037 --> 01:24:55.620
let's figure out what this show ID is.

01:24:55.620 --> 01:25:03.750
Select star from shows where the ID of
the show I'm looking for equals what?

01:25:03.750 --> 01:25:06.910
62614 semicolon.

01:25:06.910 --> 01:25:11.200
So I could manually look it up by
cross referencing the other table.

01:25:11.200 --> 01:25:14.860
So that was the show in question there,
the first comedy in the data set.

01:25:14.860 --> 01:25:16.200
Let me look up the second one.

01:25:16.200 --> 01:25:20.610
So instead of that,
let's do 63881 Enter.

01:25:20.610 --> 01:25:21.150
OK.

01:25:21.150 --> 01:25:22.410
So that's that show.

01:25:22.410 --> 01:25:23.640
And let's do one more.

01:25:23.640 --> 01:25:27.210
And suffice it to say, this is
just getting tedious and vulnerable

01:25:27.210 --> 01:25:29.520
to mistakes quickly.

01:25:29.520 --> 01:25:31.710
This surely can't be the way to do this.

01:25:31.710 --> 01:25:35.280
And indeed, SQL is going to let us do
this a little more powerfully instead.

01:25:35.280 --> 01:25:36.790
Let's do this.

01:25:36.790 --> 01:25:39.210
Instead of getting
this table temporarily

01:25:39.210 --> 01:25:42.640
with all these show IDs and all
these genres, let's refine the query.

01:25:42.640 --> 01:25:49.920
So let's just select the show ID from
the genres table where the genre equals

01:25:49.920 --> 01:25:51.810
quote unquote "comedy."

01:25:51.810 --> 01:25:56.670
Now I have a big list of show
IDs, all of which are comedy.

01:25:56.670 --> 01:25:57.205
How many?

01:25:57.205 --> 01:25:58.830
Well, I can combine ideas from earlier.

01:25:58.830 --> 01:26:03.582
I can just count all of those show IDs
or star if I want to just do that too.

01:26:03.582 --> 01:26:05.040
But I can count all those show IDs.

01:26:05.040 --> 01:26:09.880
48,706 comedies on IMDb's
database for TV shows.

01:26:09.880 --> 01:26:11.140
So feels like a lot.

01:26:11.140 --> 01:26:14.010
But how can I now use that
information and get back

01:26:14.010 --> 01:26:18.100
the titles of comedies in the
database without doing it manually?

01:26:18.100 --> 01:26:19.720
Well, let's do this.

01:26:19.720 --> 01:26:21.270
I have a moment ago this query.

01:26:21.270 --> 01:26:25.710
Select the show ID from
genres where the current genre

01:26:25.710 --> 01:26:29.295
is quote unquote "comedy."

01:26:29.295 --> 01:26:31.920
What if I kind of nest these
queries, kind of like grade school

01:26:31.920 --> 01:26:33.030
math in parentheses?

01:26:33.030 --> 01:26:35.940
What if I combine this
whole thing in parentheses?

01:26:35.940 --> 01:26:38.040
And now let me select
what I really want.

01:26:38.040 --> 01:26:43.950
Let me go ahead and select
how about the title of all

01:26:43.950 --> 01:26:52.420
shows where the idea of the show
is in this list of show IDs.

01:26:52.420 --> 01:26:59.440
So if you agree that the shows table
has an ID column, which is otherwise

01:26:59.440 --> 01:27:02.170
known as its primary key, the
unique ID that identifies it, just

01:27:02.170 --> 01:27:06.490
like our Harvard IDs, our Yale IDs,
and you agree that per a moment ago

01:27:06.490 --> 01:27:09.580
this shorter query will
give me back just the show

01:27:09.580 --> 01:27:13.840
IDs of all of the
comedies in the database,

01:27:13.840 --> 01:27:17.710
you can actually combine or
nest these queries together.

01:27:17.710 --> 01:27:19.843
It's going to respect
SQLite order of operations

01:27:19.843 --> 01:27:21.760
with parentheses, just
like grade school math.

01:27:21.760 --> 01:27:24.070
So the thing in parentheses
will be executed first.

01:27:24.070 --> 01:27:27.340
That gives it back a list
of IDs, like 48,000 IDs.

01:27:27.340 --> 01:27:30.160
And then this query, the
outer query, is going

01:27:30.160 --> 01:27:33.790
to get the title from all of the
shows where the ID of the show

01:27:33.790 --> 01:27:36.280
is in that big list of 48,000.

01:27:36.280 --> 01:27:39.233
So if I now execute
these together, I think

01:27:39.233 --> 01:27:42.400
the list is still going to be a little
long, but let me execute it together.

01:27:42.400 --> 01:27:45.880
Now I see this long list of outputs.

01:27:45.880 --> 01:27:46.900
A little overwhelming.

01:27:46.900 --> 01:27:50.650
Let's go ahead and maybe
limit it to just 10

01:27:50.650 --> 01:27:52.460
as before for discussion's sake.

01:27:52.460 --> 01:27:56.110
And now I see 10 comedies
ordered arbitrarily

01:27:56.110 --> 01:27:58.330
from however they're in
the database that happen

01:27:58.330 --> 01:28:01.420
to indeed have comedy as their genre.

01:28:01.420 --> 01:28:04.310
If I want to do this a little
more cleanly, I could do this.

01:28:04.310 --> 01:28:05.260
Let's see.

01:28:05.260 --> 01:28:09.640
Why don't I order by title ascending
order, which is alphabetically,

01:28:09.640 --> 01:28:12.190
or the default is also an ascending.

01:28:12.190 --> 01:28:13.240
Limit 10.

01:28:13.240 --> 01:28:18.250
Now I see the top 10, I mean, weirdly
named things with hash symbols

01:28:18.250 --> 01:28:22.360
presumably to get their titles up to the
beginning or maybe these are hashtags.

01:28:22.360 --> 01:28:27.550
Here now we have alphabetically the
first 10 shows that are comedies.

01:28:27.550 --> 01:28:31.705
Any questions on these kinds of queries?

01:28:31.705 --> 01:28:33.580
It's kind of a lot, but
at the same time it's

01:28:33.580 --> 01:28:37.540
just like composing the smaller
ideas from before into slightly more

01:28:37.540 --> 01:28:39.820
useful queries.

01:28:39.820 --> 01:28:40.360
Yeah.

01:28:40.360 --> 01:28:43.348
AUDIENCE: [INAUDIBLE]

01:28:46.343 --> 01:28:48.760
DAVID MALAN: Do foreign keys
have to set the relationship?

01:28:48.760 --> 01:28:51.850
When you create the table,
the programmer or the database

01:28:51.850 --> 01:28:54.280
administrator would
create that relationship

01:28:54.280 --> 01:28:57.700
by using those keywords primary
key and foreign key that

01:28:57.700 --> 01:29:01.330
teaches the database what is
related to what per the picture.

01:29:01.330 --> 01:29:02.530
So you do that once.

01:29:02.530 --> 01:29:07.390
And now I being the sort of programmer
who's familiar with the database,

01:29:07.390 --> 01:29:13.750
I am just using these foreign keys in
a manner consistent with their design.

01:29:13.750 --> 01:29:15.680
And this is where it's
useful at some point,

01:29:15.680 --> 01:29:18.940
even if no one hands you a picture, to
make sure you understand the database,

01:29:18.940 --> 01:29:22.180
because that's going to inform
literally what you type in SQL

01:29:22.180 --> 01:29:23.740
to get the data you care about.

01:29:23.740 --> 01:29:26.060
Well, let's do something
a little more precise.

01:29:26.060 --> 01:29:27.860
How about-- very reasonable question.

01:29:27.860 --> 01:29:30.970
And honestly, this is exactly
what imdb.com and the app or for.

01:29:30.970 --> 01:29:34.600
What if you want to find all of
the shows that Steve Carell is in?

01:29:34.600 --> 01:29:36.160
Kind of a reasonable query.

01:29:36.160 --> 01:29:40.150
Literally something someone might type
into Google or more specifically IMDb.

01:29:40.150 --> 01:29:42.130
It's not really obvious
at first glance how

01:29:42.130 --> 01:29:44.795
to do that, though,
because from my database,

01:29:44.795 --> 01:29:47.170
if these are my six tables,
well, I can pretty easily get

01:29:47.170 --> 01:29:48.460
Steve Carell from here.

01:29:48.460 --> 01:29:51.105
But I can really only get his
ID number, whatever that is,

01:29:51.105 --> 01:29:53.230
his name, which I know
already, and his birth year.

01:29:53.230 --> 01:29:56.200
OK, interesting but has nothing
to do with the shows that he's in.

01:29:56.200 --> 01:30:00.850
I can look at shows over here, but
there's no mention of Steve Carell

01:30:00.850 --> 01:30:02.710
because there's no person ID here.

01:30:02.710 --> 01:30:04.660
Where is that relationship implemented?

01:30:04.660 --> 01:30:06.590
Well, it's implemented down here.

01:30:06.590 --> 01:30:07.910
So how do we do this?

01:30:07.910 --> 01:30:10.210
Well, here's the perfect
example of a lesson

01:30:10.210 --> 01:30:13.450
we've been trying to emphasize for
weeks of taking these baby steps.

01:30:13.450 --> 01:30:17.750
Break larger problems down into smaller
ones and let's do something like this.

01:30:17.750 --> 01:30:20.740
Let's just get everything I know
about Steve Carell from the database.

01:30:20.740 --> 01:30:24.190
Let's select star from people
where the name of the person

01:30:24.190 --> 01:30:26.890
is quote unquote "Steve Carell."

01:30:26.890 --> 01:30:28.630
I just want to see what data we've got.

01:30:28.630 --> 01:30:30.640
And here's what we have.

01:30:30.640 --> 01:30:33.130
There's only one Steve
Carell born in 1962

01:30:33.130 --> 01:30:37.150
and his unique ID is
136797 according to IMDb.

01:30:37.150 --> 01:30:40.930
This isn't some global
actor identifier, per se.

01:30:40.930 --> 01:30:45.010
All right, well how do I get now all
of the shows that Steve Carell is in?

01:30:45.010 --> 01:30:46.130
Well, I could do this.

01:30:46.130 --> 01:30:51.190
Select star from stars,
not to confuse the two.

01:30:51.190 --> 01:30:53.500
One's the symbol, one's the table name.

01:30:53.500 --> 01:30:58.690
Where person ID equals 136797.

01:30:58.690 --> 01:31:03.070
So I think this will now give me
everything from the stars table

01:31:03.070 --> 01:31:04.943
that relates to Steve Carell.

01:31:04.943 --> 01:31:07.360
And you'll see person ID is
the same because I'm literally

01:31:07.360 --> 01:31:08.693
searching for just Steve Carell.

01:31:08.693 --> 01:31:12.200
But there are like 20 or
so shows that he's been in.

01:31:12.200 --> 01:31:14.450
All right, well here's where
things would get tedious.

01:31:14.450 --> 01:31:15.340
What are those shows?

01:31:15.340 --> 01:31:21.460
Well, I could do select title from
shows where the ID of the show equals.

01:31:21.460 --> 01:31:25.670
And here's whenever you copy paste,
you're probably doing something wrong.

01:31:25.670 --> 01:31:27.890
OK, he was in The Dana Carvey Show.

01:31:27.890 --> 01:31:28.715
Familiar with that.

01:31:28.715 --> 01:31:29.590
Let's do another one.

01:31:29.590 --> 01:31:31.460
We'll copy paste this.

01:31:31.460 --> 01:31:34.090
Where ID equals this.

01:31:34.090 --> 01:31:34.890
Over The Top.

01:31:34.890 --> 01:31:35.390
Another.

01:31:35.390 --> 01:31:37.000
And if we keep digging, we'll
probably find The Office.

01:31:37.000 --> 01:31:40.240
But my God, that's going to take
forever to do 20 queries manually.

01:31:40.240 --> 01:31:41.252
It's not very dynamic.

01:31:41.252 --> 01:31:43.960
But what if we just nest these
queries a little more dynamically?

01:31:43.960 --> 01:31:45.830
So let me start from
the beginning again.

01:31:45.830 --> 01:31:48.890
What if we go ahead
and select everything

01:31:48.890 --> 01:31:56.410
we know about people whose
name equals Steve Carell.

01:31:56.410 --> 01:31:58.220
That gave us earlier this data.

01:31:58.220 --> 01:31:59.470
I don't need all of that data.

01:31:59.470 --> 01:32:00.100
I know his name.

01:32:00.100 --> 01:32:01.517
I don't care about his birth year.

01:32:01.517 --> 01:32:05.140
So let's change this to just be
give me the ID of Steve Carell.

01:32:05.140 --> 01:32:08.390
And that gives me back now this
smaller temporary data set.

01:32:08.390 --> 01:32:08.890
All right.

01:32:08.890 --> 01:32:12.195
Can I now use this
inside of another query?

01:32:12.195 --> 01:32:14.320
Well, let me wrap the whole
thing with parentheses.

01:32:14.320 --> 01:32:18.850
And now let me say select
star from the stars table

01:32:18.850 --> 01:32:22.477
where the person ID equals this.

01:32:22.477 --> 01:32:24.310
So I'm deliberately not
using in because I'm

01:32:24.310 --> 01:32:26.780
assuming there's indeed only
one Steve Carell in the world.

01:32:26.780 --> 01:32:28.480
So I'm not getting back
a list of Steve Carells.

01:32:28.480 --> 01:32:30.640
I'm getting back the one
and only in this case.

01:32:30.640 --> 01:32:31.810
So equal is fine.

01:32:31.810 --> 01:32:33.310
In is when you have multiple.

01:32:33.310 --> 01:32:34.840
Equal is when you have one.

01:32:34.840 --> 01:32:36.375
Let me go ahead and hit Enter now.

01:32:36.375 --> 01:32:37.750
OK, that's more data than I need.

01:32:37.750 --> 01:32:40.430
I don't need like 20 copies
of Steve Carell's person ID.

01:32:40.430 --> 01:32:41.380
So let me hit up.

01:32:41.380 --> 01:32:46.420
Let me go back and let me just
get show ID from Steve Carell.

01:32:46.420 --> 01:32:52.070
And now I have a list of just the 20
or so show IDs that he has been in.

01:32:52.070 --> 01:32:52.570
All right.

01:32:52.570 --> 01:32:53.597
How can I now use this?

01:32:53.597 --> 01:32:54.430
Well, let me hit up.

01:32:54.430 --> 01:32:57.040
Let me put the whole
thing in parentheses.

01:32:57.040 --> 01:32:58.960
And now let me select
what I really want.

01:32:58.960 --> 01:33:03.340
Select title from shows where.

01:33:03.340 --> 01:33:05.020
And here's the final flourish.

01:33:05.020 --> 01:33:11.350
The shows table has an ID, has a
title, has a year, and has an episode.

01:33:11.350 --> 01:33:21.610
And what I really want, though, is to
check which shows have ID that is what?

01:33:21.610 --> 01:33:24.510
Anyone want to finish the thought?

01:33:24.510 --> 01:33:25.860
I just want to-- yeah.

01:33:25.860 --> 01:33:28.435
AUDIENCE: [INAUDIBLE]

01:33:28.435 --> 01:33:29.310
DAVID MALAN: Exactly.

01:33:29.310 --> 01:33:30.570
ID in this.

01:33:30.570 --> 01:33:31.690
And this is getting ugly.

01:33:31.690 --> 01:33:33.810
And when you actually write
your queries in a text file,

01:33:33.810 --> 01:33:35.730
you can format them
nicely and indent them.

01:33:35.730 --> 01:33:36.772
My font is just getting--

01:33:36.772 --> 01:33:38.938
I don't want to make it too
small to fit everything.

01:33:38.938 --> 01:33:40.230
But now we have three queries.

01:33:40.230 --> 01:33:43.380
One is in doubly nested parentheses,
then there's the middle one,

01:33:43.380 --> 01:33:44.550
then there's the outer one.

01:33:44.550 --> 01:33:47.310
So this last query is going
to get me the title from shows

01:33:47.310 --> 01:33:50.460
where the ID of the show
is in this big list of 20

01:33:50.460 --> 01:33:52.410
or so show IDs that Steve Carell is in.

01:33:52.410 --> 01:33:54.812
And I knew that because I
looked up his name here.

01:33:54.812 --> 01:33:57.270
And notice what I did not do
this time is I didn't manually

01:33:57.270 --> 01:33:58.620
hardcode his ID number.

01:33:58.620 --> 01:33:59.430
There's no need.

01:33:59.430 --> 01:34:02.508
That would be kind of a bad way
to implement a website if you're

01:34:02.508 --> 01:34:04.050
using a database underneath the hood.

01:34:04.050 --> 01:34:07.470
You want the IMDb for real
to search for whatever

01:34:07.470 --> 01:34:11.010
the human typed in and no one's going
to know Steve Carell's person ID

01:34:11.010 --> 01:34:12.070
or anything else.

01:34:12.070 --> 01:34:14.580
So here we've done this all dynamically.

01:34:14.580 --> 01:34:18.690
And now if I hit Enter, I
think I get all of his shows.

01:34:18.690 --> 01:34:21.840
Let's go ahead and order this
by title just to make it tidy.

01:34:21.840 --> 01:34:24.810
And you probably will see at
least one or more shows that.

01:34:24.810 --> 01:34:29.520
And probably the most popular
is, dot dot dot, The Office.

01:34:29.520 --> 01:34:32.280
So this is literally
the kind of query that's

01:34:32.280 --> 01:34:37.200
being executed underneath the hood when
you go to websites or apps like IMDb.

01:34:37.200 --> 01:34:42.030
Your textual query is probably being
plugged into a longer SQL query

01:34:42.030 --> 01:34:44.850
like this where some
programmer at IMDb probably

01:34:44.850 --> 01:34:48.420
wrote this whole query in
advance weeks, months, years ago

01:34:48.420 --> 01:34:52.470
and they're just somehow plugging
in the value that you the human

01:34:52.470 --> 01:34:55.550
typed into the search box or the like.

01:34:55.550 --> 01:35:03.010
Questions now on finding
this data or any other?

01:35:03.010 --> 01:35:03.890
No?

01:35:03.890 --> 01:35:04.390
OK.

01:35:04.390 --> 01:35:06.980
So where else could we go with this?

01:35:06.980 --> 01:35:12.700
Well, let's consider how
else we might combine data.

01:35:12.700 --> 01:35:16.060
Suppose that the next question
actually perhaps appropriately

01:35:16.060 --> 01:35:19.780
would be focusing in on not just
people and shows and these stars,

01:35:19.780 --> 01:35:23.680
but how do we gather more information
about the shows themselves,

01:35:23.680 --> 01:35:25.430
like the genres, the
ratings, or the like.

01:35:25.430 --> 01:35:28.030
So indeed, let's focus on
just these two tables here.

01:35:28.030 --> 01:35:31.640
Recall that every show has an ID,
a title, a year, and episodes.

01:35:31.640 --> 01:35:34.660
But it also might have one or
more relationships with rows

01:35:34.660 --> 01:35:36.280
and this other table called genres.

01:35:36.280 --> 01:35:39.250
And this is so that a show can
be a comedy, can be a drama,

01:35:39.250 --> 01:35:40.960
can be any number of other things.

01:35:40.960 --> 01:35:42.400
One row per.

01:35:42.400 --> 01:35:45.040
So you would see the same
show ID again and again

01:35:45.040 --> 01:35:48.760
and again with a different
genre written in English

01:35:48.760 --> 01:35:51.070
like comedy, drama, or the like.

01:35:51.070 --> 01:35:53.890
Well, how do I kind of
reconstitute that data?

01:35:53.890 --> 01:35:56.710
Well, turns out there's a few
different ways to do this.

01:35:56.710 --> 01:36:00.610
And let me propose that we
introduce this keyword here, join.

01:36:00.610 --> 01:36:05.525
And this is really the most powerful
of the keywords in SQL itself.

01:36:05.525 --> 01:36:06.650
It doesn't have to be used.

01:36:06.650 --> 01:36:09.310
We've seen with nested
queries that you can still

01:36:09.310 --> 01:36:13.490
select data across multiple
tables, but here is another way.

01:36:13.490 --> 01:36:14.660
So let me do this.

01:36:14.660 --> 01:36:17.710
Let me go back to my SQLite database.

01:36:17.710 --> 01:36:23.680
And let me select sort of in one
breath exactly the data I want.

01:36:23.680 --> 01:36:26.980
Select star from shows.

01:36:26.980 --> 01:36:30.500
And let's just limit this initially
to 10 to see what it looks like.

01:36:30.500 --> 01:36:31.000
All right.

01:36:31.000 --> 01:36:32.250
That's, again, the shows data.

01:36:32.250 --> 01:36:34.690
Select star from genres.

01:36:34.690 --> 01:36:37.760
Let's limit that to 10 too, just
to wrap our minds around it.

01:36:37.760 --> 01:36:39.730
And now this is not that useful.

01:36:39.730 --> 01:36:43.510
However, the data in
the leftmost column here

01:36:43.510 --> 01:36:45.490
is the primary key in the shows table.

01:36:45.490 --> 01:36:46.870
These are just unique IDs.

01:36:46.870 --> 01:36:51.530
The data here in the genres table,
recall, show ID is the foreign key.

01:36:51.530 --> 01:36:54.310
So it's the same numbers but
just copied into another table

01:36:54.310 --> 01:36:56.650
so that we can have this
relationship across them.

01:36:56.650 --> 01:37:00.070
How do I kind of line up these
numbers with these numbers

01:37:00.070 --> 01:37:06.430
to get back a wider table that has title
and year and episodes and genre and,

01:37:06.430 --> 01:37:08.950
heck, ratings and all
of that too if we want?

01:37:08.950 --> 01:37:11.230
Well, you can join these
tables by just telling

01:37:11.230 --> 01:37:12.890
the database what to join on what.

01:37:12.890 --> 01:37:13.940
So let me do this.

01:37:13.940 --> 01:37:17.110
Select star from shows.

01:37:17.110 --> 01:37:21.760
Join that table though
on the genres table.

01:37:21.760 --> 01:37:24.160
Well, how do you want to
join those two tables?

01:37:24.160 --> 01:37:26.860
And again, the two tables from
the picture looked like this.

01:37:26.860 --> 01:37:30.400
How do you tell SQL
programmatically to put one of them

01:37:30.400 --> 01:37:32.420
right next to the other,
line up all of the ID

01:37:32.420 --> 01:37:34.820
so that you just get
one larger data set?

01:37:34.820 --> 01:37:39.160
Well, we can use indeed
this syntax called join.

01:37:39.160 --> 01:37:41.140
So back to VS Code here.

01:37:41.140 --> 01:37:44.380
And let me join these two tables.

01:37:44.380 --> 01:37:45.970
Sorry, typo here.

01:37:45.970 --> 01:37:51.940
Join genres on the shows
table's ID column, a.k.a.

01:37:51.940 --> 01:37:58.240
its primary key, equaling the genres
table's show ID column, a.k.a.

01:37:58.240 --> 01:37:59.210
the foreign key.

01:37:59.210 --> 01:38:01.090
So in other words, it
looks a little cryptic,

01:38:01.090 --> 01:38:05.080
but I'm just telling SQL how to line
up these two tables and what column

01:38:05.080 --> 01:38:08.230
to match with the other so
that the numbers line up

01:38:08.230 --> 01:38:10.610
and I get essentially a wider table.

01:38:10.610 --> 01:38:14.930
Let me go ahead and hit
semicolon and Enter.

01:38:14.930 --> 01:38:17.330
And this is now going to
give me a lot of data.

01:38:17.330 --> 01:38:18.600
We might have to interrupt it.

01:38:18.600 --> 01:38:22.650
But notice even at a glance, we're
getting the ID, the title, the year,

01:38:22.650 --> 01:38:25.130
the number of episodes,
the ID again redundantly,

01:38:25.130 --> 01:38:27.140
but that's to be expected
if I'm joining them,

01:38:27.140 --> 01:38:29.360
and the genre all the way on the right.

01:38:29.360 --> 01:38:31.740
Let me hit Control C to interrupt.

01:38:31.740 --> 01:38:33.890
Let me just limit this to The Office.

01:38:33.890 --> 01:38:36.500
So where title equals
quote unquote "The Office"

01:38:36.500 --> 01:38:39.680
so we can focus on just one sample data.

01:38:39.680 --> 01:38:43.490
And here, fun fact, there's
been more than one Office.

01:38:43.490 --> 01:38:46.100
The one that you all
probably like is this one

01:38:46.100 --> 01:38:49.250
that started in 2005 with 188 episodes.

01:38:49.250 --> 01:38:53.060
Its ID in the shows table is 386676.

01:38:53.060 --> 01:38:54.710
That's confirmed over here too.

01:38:54.710 --> 01:38:56.850
So again, we've just
joined the two tables.

01:38:56.850 --> 01:38:57.350
How?

01:38:57.350 --> 01:38:58.730
By lining up those fields.

01:38:58.730 --> 01:39:01.490
But now that we can see that
almost all of The Offices

01:39:01.490 --> 01:39:04.700
produced over the decades are
comedies except for this one.

01:39:04.700 --> 01:39:07.460
There was a version of The
Office produced in 2001

01:39:07.460 --> 01:39:10.160
that was considered more of a drama.

01:39:10.160 --> 01:39:11.940
Unsure if it's related to the other.

01:39:11.940 --> 01:39:13.640
How can we link in other data?

01:39:13.640 --> 01:39:16.920
Well, let's go ahead and link
in ratings too or instead.

01:39:16.920 --> 01:39:21.290
So instead of joining this with
genres, let me go ahead and rewind here

01:39:21.290 --> 01:39:28.550
and join shows on ratings on
shows.id equals ratings.show_ID.

01:39:28.550 --> 01:39:32.630
And let's limit it to The Office too
for discussion's sake where title equals

01:39:32.630 --> 01:39:35.820
quote unquote "The Office" semicolon.

01:39:35.820 --> 01:39:39.650
And now you can see that
among the various Offices,

01:39:39.650 --> 01:39:43.100
it looks like the one that most
of us probably know and love

01:39:43.100 --> 01:39:49.070
is the highest rated also with a
9.0 with like 585,000 people having

01:39:49.070 --> 01:39:52.280
cast votes for whereas this other
shows seem to have been less popular.

01:39:52.280 --> 01:39:56.370
And perhaps that's why indeed you
see fewer episodes for them as well.

01:39:56.370 --> 01:39:58.680
So even though we've put
the data in multiple places,

01:39:58.680 --> 01:40:02.670
you can still kind of reconstitute
it by lining things up in this way

01:40:02.670 --> 01:40:04.970
and rejoining the tables.

01:40:04.970 --> 01:40:09.140
Questions now on this?

01:40:09.140 --> 01:40:16.540
This is the heart of what SQL does and
what relational databases do for you.

01:40:16.540 --> 01:40:18.100
Questions?

01:40:18.100 --> 01:40:18.610
All right.

01:40:18.610 --> 01:40:20.470
A few final features.

01:40:20.470 --> 01:40:22.660
There's not all that much that--

01:40:22.660 --> 01:40:24.320
SQL takes practice like anything else.

01:40:24.320 --> 01:40:26.237
But in terms of syntax
and capabilities, let's

01:40:26.237 --> 01:40:30.610
just introduce you to a couple of final
features here and problems that arise

01:40:30.610 --> 01:40:32.210
and how we might solve them.

01:40:32.210 --> 01:40:34.640
Let's do this as well.

01:40:34.640 --> 01:40:36.980
So let me go back into VS Code here.

01:40:36.980 --> 01:40:40.900
And let's just find out Steve
Carell's information again.

01:40:40.900 --> 01:40:43.180
Last time we did it
with this nested query

01:40:43.180 --> 01:40:47.503
by getting his ID and then the show IDs
and then the titles for those show IDs.

01:40:47.503 --> 01:40:49.420
With join, you can do
it a little differently.

01:40:49.420 --> 01:40:50.950
And any of these ways are fine.

01:40:50.950 --> 01:40:53.740
One might become easier
to mentally than another.

01:40:53.740 --> 01:40:57.040
Let's go ahead and select
the titles from what.

01:40:57.040 --> 01:41:02.095
Let's select the title
from the people table.

01:41:02.095 --> 01:41:03.220
And I'm going to hit Enter.

01:41:03.220 --> 01:41:05.260
And when you're using
SQLite3 interactively,

01:41:05.260 --> 01:41:09.400
if you ever find yourself with a prompt
that says dot dot dot angle bracket,

01:41:09.400 --> 01:41:12.067
it means you're continuing your
thought onto the next line.

01:41:12.067 --> 01:41:13.900
If you didn't intend
that, you can sometimes

01:41:13.900 --> 01:41:16.510
hit semicolon to just
end the thought and hit

01:41:16.510 --> 01:41:18.460
Enter even if it triggers an error.

01:41:18.460 --> 01:41:21.430
But this is one way of formatting
my queries now a little more nicely.

01:41:21.430 --> 01:41:24.597
I'm just going to add some white space
so that it's a little easier to read.

01:41:24.597 --> 01:41:25.720
What do I want to select?

01:41:25.720 --> 01:41:28.000
Well, I want to select
the title of shows

01:41:28.000 --> 01:41:35.320
from the people table joined with the
stars table on the people table's ID

01:41:35.320 --> 01:41:40.520
column equaling the stars
table's person ID column.

01:41:40.520 --> 01:41:44.230
So in other words, if you think back
to what people are and what stars are,

01:41:44.230 --> 01:41:48.250
one has an ID, one has a person ID, I'm
just now connecting those two tables.

01:41:48.250 --> 01:41:49.450
I'm joining those two.

01:41:49.450 --> 01:41:53.350
But I want to do this as
well with another table.

01:41:53.350 --> 01:41:54.938
Let me additionally join in.

01:41:54.938 --> 01:41:56.980
So now I only have two
hands, but now I'm putting

01:41:56.980 --> 01:41:59.320
a third table joined in together here.

01:41:59.320 --> 01:42:05.960
Join shows on stars.show_id
equals shows.id.

01:42:05.960 --> 01:42:08.500
So this is now linking
three tables together.

01:42:08.500 --> 01:42:12.100
But I only care about this for one
person, so where the name of the person

01:42:12.100 --> 01:42:14.650
equals quote unquote "Steve Carell."

01:42:14.650 --> 01:42:18.850
So more cryptic, to be sure, but
what we're doing with this query

01:42:18.850 --> 01:42:21.370
is just taking all three
tables that we care about

01:42:21.370 --> 01:42:23.440
and we're joining them
all together at once

01:42:23.440 --> 01:42:27.250
using this new join syntax
literally telling the database what

01:42:27.250 --> 01:42:29.020
columns to line up with what.

01:42:29.020 --> 01:42:32.350
And then we filter at the very
end just like before to get back,

01:42:32.350 --> 01:42:35.650
if I hit Enter, the answer
we want, which in this case

01:42:35.650 --> 01:42:40.030
is a little slower at the moment,
but that same list of 20 or so

01:42:40.030 --> 01:42:41.350
shows that he's been in.

01:42:41.350 --> 01:42:42.920
There's one other way to do this.

01:42:42.920 --> 01:42:45.250
And again, these are all
in the slides online.

01:42:45.250 --> 01:42:47.890
So you can repeat them without having to
jot down everything and we'll put them

01:42:47.890 --> 01:42:48.820
in the notes too.

01:42:48.820 --> 01:42:50.450
But there's another way to do this.

01:42:50.450 --> 01:42:52.780
I could also use an implicit join.

01:42:52.780 --> 01:42:55.030
So that was an explicit join
because I literally typed

01:42:55.030 --> 01:42:57.220
the word join multiple times at that.

01:42:57.220 --> 01:43:00.880
But let me go ahead and select
the title from these three tables.

01:43:00.880 --> 01:43:02.628
People, stars, and shows.

01:43:02.628 --> 01:43:04.420
And this might just be
nicer because if you

01:43:04.420 --> 01:43:06.337
know what tables you
want to select data from,

01:43:06.337 --> 01:43:08.980
just enumerate them separated by
commas, which you might prefer

01:43:08.980 --> 01:43:18.970
in your mind, where the people ID
equals the stars person ID and the stars

01:43:18.970 --> 01:43:25.450
show ID equals the shows ID and the
name of the person equals Steve Carell.

01:43:25.450 --> 01:43:26.788
So this is an implicit join.

01:43:26.788 --> 01:43:29.830
And honestly, I constantly reference
my notes for some of this stuff too.

01:43:29.830 --> 01:43:31.720
It's not the kind of thing that's
going to come like this to you

01:43:31.720 --> 01:43:32.740
after just one day.

01:43:32.740 --> 01:43:35.560
But it's just a different way
of expressing the same thing.

01:43:35.560 --> 01:43:37.720
I want to select data from
three different tables.

01:43:37.720 --> 01:43:43.120
And hey SQL, here is how I want you
to line those tables up so that I can

01:43:43.120 --> 01:43:45.640
get like related data for Steve Carell.

01:43:45.640 --> 01:43:49.450
And this now will achieve
the same results ultimately.

01:43:49.450 --> 01:43:50.260
Let me hit Enter.

01:43:54.040 --> 01:43:54.920
And there we go.

01:43:54.920 --> 01:43:55.750
So a little slower.

01:43:55.750 --> 01:43:57.850
And performance might
vary based on computer,

01:43:57.850 --> 01:44:01.270
based on implementation of SQL, but I
think I still have the same answers.

01:44:01.270 --> 01:44:04.790
Now suppose, as I often do, and I
had to look it up again last time,

01:44:04.790 --> 01:44:08.350
suppose you forget how to
spell Steve Carell's name.

01:44:08.350 --> 01:44:10.240
Is it two R's, two L's, or the like?

01:44:10.240 --> 01:44:12.903
Well, I could also do
something like this.

01:44:12.903 --> 01:44:14.320
Well, let's just keep this simple.

01:44:14.320 --> 01:44:17.103
Select star from people
where name equals.

01:44:17.103 --> 01:44:20.020
I've been deliberately getting it
right so as to not embarrass myself.

01:44:20.020 --> 01:44:22.000
That's the Steve Carell I keep querying.

01:44:22.000 --> 01:44:25.960
If you forget, well you could
try searching for just Steves,

01:44:25.960 --> 01:44:28.450
but interestingly,
there's a bunch of Steves.

01:44:28.450 --> 01:44:30.070
We don't know when they were born.

01:44:30.070 --> 01:44:35.360
But that's probably not the Steve Carell
we want if we don't have his last name.

01:44:35.360 --> 01:44:38.770
So I could alternatively do, well it's
Steve and then it starts with a C,

01:44:38.770 --> 01:44:39.460
I think.

01:44:39.460 --> 01:44:42.700
Well, it turns out there's another
wild card you can use in SQL.

01:44:42.700 --> 01:44:46.060
We used the asterisk to
select all of the columns.

01:44:46.060 --> 01:44:50.800
You can in quotes use a percent
sign to say C something.

01:44:50.800 --> 01:44:54.370
So there's 0 or more
characters after the letter C.

01:44:54.370 --> 01:44:58.630
And now this doesn't work because
now I would be literally looking

01:44:58.630 --> 01:45:01.690
for Steve space C something.

01:45:01.690 --> 01:45:04.910
But recall earlier I mentioned
that one other keyword,

01:45:04.910 --> 01:45:06.868
which is for fuzzier
matching, so to speak,

01:45:06.868 --> 01:45:09.160
where it's not exactly what
you're looking for but it's

01:45:09.160 --> 01:45:10.630
like what you're looking for.

01:45:10.630 --> 01:45:14.350
If you instead say where his name
is like Steve space C something,

01:45:14.350 --> 01:45:17.060
now we'll get back a
whole bunch of Steves.

01:45:17.060 --> 01:45:19.810
But I think now I could probably
find the one I'm actually looking

01:45:19.810 --> 01:45:20.920
for if I don't remember his name.

01:45:20.920 --> 01:45:22.510
You can use multiple percent signs.

01:45:22.510 --> 01:45:25.260
If you forget what his first name
is, you could reverse the order.

01:45:25.260 --> 01:45:29.940
But that too is a very
powerful SQL feature at that.

01:45:29.940 --> 01:45:32.670
Questions on these queries here?

01:45:32.670 --> 01:45:33.708
Yeah.

01:45:33.708 --> 01:45:35.448
AUDIENCE: [INAUDIBLE]

01:45:35.448 --> 01:45:36.240
DAVID MALAN: Sorry?

01:45:36.240 --> 01:45:38.730
AUDIENCE: [INAUDIBLE]

01:45:38.730 --> 01:45:40.003
DAVID MALAN: What about it?

01:45:40.003 --> 01:45:40.990
AUDIENCE: [INAUDIBLE]

01:45:40.990 --> 01:45:42.400
DAVID MALAN: Oh yeah, sure.

01:45:42.400 --> 01:45:44.500
So the query I used here.

01:45:44.500 --> 01:45:49.810
There's a lot of Steves whose last
name starts with C. Oops, too far.

01:45:49.810 --> 01:45:52.760
The last query I executed
was this one here.

01:45:52.760 --> 01:45:57.110
So where the name is like
quote unquote "Steve C%."

01:45:57.110 --> 01:45:59.950
So that's just another
tool for your toolkit here.

01:45:59.950 --> 01:46:02.830
But you'll perhaps have
notice that those two--

01:46:02.830 --> 01:46:05.380
prior to that query, the
joins I did were sort of slow.

01:46:05.380 --> 01:46:07.480
And honestly, this database
isn't even that big.

01:46:07.480 --> 01:46:10.120
Like yes, it has tens of
thousands of rows in it.

01:46:10.120 --> 01:46:13.630
But in the real world and most of the
apps you and I use a lot every day

01:46:13.630 --> 01:46:17.080
or websites, there's millions,
even billions of rows of data.

01:46:17.080 --> 01:46:22.210
And if I had to wait on my computer
here or my code space a second or two

01:46:22.210 --> 01:46:25.990
to get the data, that's not going to
work for millions of users or customers

01:46:25.990 --> 01:46:26.680
certainly.

01:46:26.680 --> 01:46:29.020
So how can we actually improve things?

01:46:29.020 --> 01:46:34.330
Well, it turns out another upside
of a proper relational database

01:46:34.330 --> 01:46:38.830
is that it's not just a spreadsheet
where the onus is on you to find

01:46:38.830 --> 01:46:40.120
the data you're looking for.

01:46:40.120 --> 01:46:43.780
You can also tell the database
to index the data for you.

01:46:43.780 --> 01:46:48.670
An index is an efficient cheat
sheet for finding data fast.

01:46:48.670 --> 01:46:52.908
Like books in the real world often
have indices at the end of the book

01:46:52.908 --> 01:46:54.700
where you can look
things up alphabetically

01:46:54.700 --> 01:46:57.790
and then you can cross reference it
for the pages that topic appears on.

01:46:57.790 --> 01:46:59.110
Same idea in a database.

01:46:59.110 --> 01:47:03.190
If you tell the database in advance that
you want to search on a certain column

01:47:03.190 --> 01:47:06.700
frequently, you can tell it to
build a fancy index that will just

01:47:06.700 --> 01:47:08.770
allow you to search that column faster.

01:47:08.770 --> 01:47:11.560
By default, these columns
are going to be searched

01:47:11.560 --> 01:47:13.197
most likely by a linear search.

01:47:13.197 --> 01:47:15.280
Not even binary search,
because the data might not

01:47:15.280 --> 01:47:17.390
be sorted because it came in any order.

01:47:17.390 --> 01:47:20.170
But if you create an
index, you're probably

01:47:20.170 --> 01:47:22.780
going to get something closer
to logarithmic than linear,

01:47:22.780 --> 01:47:24.892
and that's going to
be a big plus overall.

01:47:24.892 --> 01:47:26.350
So let me do something simple here.

01:47:26.350 --> 01:47:30.670
First let me turn on a SQLite specific
feature that just is going to time all

01:47:30.670 --> 01:47:33.190
of my queries by writing .timer on.

01:47:33.190 --> 01:47:37.515
I just want to keep track of how
long each of these commands takes.

01:47:37.515 --> 01:47:40.390
This one is not a slow command, so
this is just going to be relative.

01:47:40.390 --> 01:47:43.300
But let's just select
everything from the shows table

01:47:43.300 --> 01:47:45.970
where the title thereof is The Office.

01:47:45.970 --> 01:47:48.940
Let's see how long this
relatively simple query takes.

01:47:48.940 --> 01:47:50.980
All right, not very long at all.

01:47:50.980 --> 01:47:54.850
In real terms less than
a second, 0.035 seconds.

01:47:54.850 --> 01:47:57.070
So not slow by any means.

01:47:57.070 --> 01:48:00.160
But if you've got hundreds,
thousands, millions of users,

01:48:00.160 --> 01:48:03.040
every one of those milliseconds
could very well add up.

01:48:03.040 --> 01:48:04.640
So can we do better?

01:48:04.640 --> 01:48:06.440
Well, we can if I do this.

01:48:06.440 --> 01:48:11.740
If I use syntax like this once in the
beginning of the design of my database,

01:48:11.740 --> 01:48:17.980
I create not a table but an index with
some name on a specific table on one

01:48:17.980 --> 01:48:18.820
or more columns.

01:48:18.820 --> 01:48:21.370
I can give a clue, a hint
to the database in advance

01:48:21.370 --> 01:48:26.260
saying please optimize with some
secret sauce searching or selecting

01:48:26.260 --> 01:48:29.740
on this column in this table
so that my searches are faster.

01:48:29.740 --> 01:48:30.860
So let me do this.

01:48:30.860 --> 01:48:32.590
Let me go back to VS Code here.

01:48:32.590 --> 01:48:36.460
Let me create an index
called how about title index.

01:48:36.460 --> 01:48:40.240
I could call it anything I want, but
I want to search faster on titles.

01:48:40.240 --> 01:48:45.940
So I'm going to call this a title index
where rather title index on the table

01:48:45.940 --> 01:48:46.900
called shows.

01:48:46.900 --> 01:48:49.150
And then in parentheses is the syntax.

01:48:49.150 --> 01:48:50.620
The column called title.

01:48:50.620 --> 01:48:53.350
So again, I've just borrowed
this canonical syntax

01:48:53.350 --> 01:48:57.590
and I've just translated it into
something that's TV show specific.

01:48:57.590 --> 01:48:58.090
All right.

01:48:58.090 --> 01:49:00.040
What is this going to do for me?

01:49:00.040 --> 01:49:03.700
Once I hit Enter, this is going to
create in the computer's memory,

01:49:03.700 --> 01:49:05.860
the database's memory
something called a B-tree.

01:49:05.860 --> 01:49:07.330
It's not a binary tree.

01:49:07.330 --> 01:49:11.627
A B-tree is actually a potentially
more efficient data structure

01:49:11.627 --> 01:49:13.960
that we didn't talk about a
few weeks back in week five,

01:49:13.960 --> 01:49:15.710
but it looks a little
something like this,

01:49:15.710 --> 01:49:17.680
which looks similar to a binary tree.

01:49:17.680 --> 01:49:21.847
But does anyone notice what
makes this not a binary tree?

01:49:21.847 --> 01:49:26.405
AUDIENCE: [INAUDIBLE]

01:49:26.405 --> 01:49:27.280
DAVID MALAN: Exactly.

01:49:27.280 --> 01:49:31.570
Binary tree, bi implying two, has
no more than two children per node,

01:49:31.570 --> 01:49:33.580
but here's a perfect
example, one, two, three.

01:49:33.580 --> 01:49:35.990
And there could be four
children, five children or more.

01:49:35.990 --> 01:49:38.530
But the effect of that, if
you have a very wide tree,

01:49:38.530 --> 01:49:40.780
the upside is that it's very short.

01:49:40.780 --> 01:49:44.560
It pulls the data higher up closer
to the node, to the root node.

01:49:44.560 --> 01:49:48.080
And recall that the root node is where
we began our searches in the past,

01:49:48.080 --> 01:49:51.620
whether it was a BST, a Binary
Search Tree, even a tri or other data

01:49:51.620 --> 01:49:52.120
structures.

01:49:52.120 --> 01:49:53.380
We always began at the top.

01:49:53.380 --> 01:49:55.780
So the higher up you
can pull the data, even

01:49:55.780 --> 01:49:57.617
if it makes the data
structure very wide,

01:49:57.617 --> 01:50:00.700
you're going to be able to do boom,
boom, boom, look up queries or look up

01:50:00.700 --> 01:50:03.340
data probably much faster
certainly than if it's just

01:50:03.340 --> 01:50:06.590
a very long list like
a column by default.

01:50:06.590 --> 01:50:09.970
So with that said, let
me go back to VS Code.

01:50:09.970 --> 01:50:11.380
I didn't create the index yet.

01:50:11.380 --> 01:50:13.277
Let me go ahead and hit
Enter and create it.

01:50:13.277 --> 01:50:14.860
All right, it took a minute, a moment.

01:50:14.860 --> 01:50:17.920
It took like half a second,
which obviously is not that slow.

01:50:17.920 --> 01:50:20.380
But with more data, that
could have been even slower.

01:50:20.380 --> 01:50:22.600
But it's a one time operation as of now.

01:50:22.600 --> 01:50:27.910
And now let me hit up and let me
select the same data from shows

01:50:27.910 --> 01:50:29.350
where title equals The Office.

01:50:29.350 --> 01:50:34.360
Last time just a moment
ago it took 0.035 seconds.

01:50:34.360 --> 01:50:38.920
Not slow but also that's going to add
up if I have lots of users of IMDb.

01:50:38.920 --> 01:50:42.040
Let's go ahead now and
execute the same query again.

01:50:42.040 --> 01:50:44.700
How long did that take?

01:50:44.700 --> 01:50:47.010
0.001 seconds now.

01:50:47.010 --> 01:50:48.720
I mean, practically nothing.

01:50:48.720 --> 01:50:50.970
And so that's the sort
of opportunity now.

01:50:50.970 --> 01:50:53.470
When you've got lots of data
and you want to really speed up

01:50:53.470 --> 01:50:56.130
these searches, these indexes,
these indices that just create

01:50:56.130 --> 01:50:59.130
for you these magical data
structures in the databases memory,

01:50:59.130 --> 01:51:02.430
it allows you to search on columns
that you are pretty sure you

01:51:02.430 --> 01:51:04.150
want to search on more effectively.

01:51:04.150 --> 01:51:07.380
Now, by contrast, if you've ever
used Google or Bing or some search

01:51:07.380 --> 01:51:10.590
engine that has advanced
search, some of those text boxes

01:51:10.590 --> 01:51:13.470
that you can search more precisely
in might very well be slower.

01:51:13.470 --> 01:51:14.190
Why?

01:51:14.190 --> 01:51:16.950
Well, probably you don't want
to go crazy and just index

01:51:16.950 --> 01:51:19.380
every column on every table.

01:51:19.380 --> 01:51:21.020
Why?

01:51:21.020 --> 01:51:25.070
What might be the intuition?

01:51:25.070 --> 01:51:28.490
If logically indexes speed things
up, why not index everything?

01:51:28.490 --> 01:51:30.710
There's always going
to be a trade off here.

01:51:30.710 --> 01:51:33.040
What might that be?

01:51:33.040 --> 01:51:34.000
Yeah.

01:51:34.000 --> 01:51:34.875
AUDIENCE: [INAUDIBLE]

01:51:34.875 --> 01:51:37.167
DAVID MALAN: Yeah, it's going
to take a lot of storage.

01:51:37.167 --> 01:51:38.960
This is just a slide on the screen.

01:51:38.960 --> 01:51:41.020
But this has to go somewhere.

01:51:41.020 --> 01:51:43.360
This needs space in
the computer's memory

01:51:43.360 --> 01:51:44.740
or on the hard drive or the like.

01:51:44.740 --> 01:51:48.777
And that's fine if you have unlimited
space, but odds are you don't.

01:51:48.777 --> 01:51:51.110
And that's going to get
expensive for different reasons.

01:51:51.110 --> 01:51:55.810
So maybe you only want to index
certain columns and certain tables

01:51:55.810 --> 01:51:56.860
and not all of them.

01:51:56.860 --> 01:51:57.777
Because you know what?

01:51:57.777 --> 01:51:59.590
What even if a user
really wants to search

01:51:59.590 --> 01:52:05.020
maybe via advanced search on some
other column or table altogether, fine.

01:52:05.020 --> 01:52:07.240
If once in a while a
query is slow, we're

01:52:07.240 --> 01:52:11.110
probably getting the bigger bang for
our buck by optimizing the common cases,

01:52:11.110 --> 01:52:16.030
the more popular queries that
people actually care about too.

01:52:16.030 --> 01:52:16.640
All right.

01:52:16.640 --> 01:52:22.450
So let's come full circle and bring
this now back to how we actually

01:52:22.450 --> 01:52:25.030
began, which was with some Python code.

01:52:25.030 --> 01:52:27.550
So it turns out these are
not either or decisions.

01:52:27.550 --> 01:52:29.740
It turns out in the
real world, developers

01:52:29.740 --> 01:52:33.320
are constantly using one,
two, three languages at once.

01:52:33.320 --> 01:52:36.940
And in fact, next week I rattled off
HTML, CSS, and JavaScript, one of which

01:52:36.940 --> 01:52:39.310
is a proper programming
language, but those languages

01:52:39.310 --> 01:52:40.510
are often used together.

01:52:40.510 --> 01:52:46.300
Totally normal and common to use
Python and SQL or Java and SQL

01:52:46.300 --> 01:52:49.300
or SWIFT and SQL or any number
of different combinations

01:52:49.300 --> 01:52:50.710
with a database language.

01:52:50.710 --> 01:52:53.920
You might use your preferred
programming language, Java, Python,

01:52:53.920 --> 01:53:00.760
C++ to create the user interface and
the logic that implements the program

01:53:00.760 --> 01:53:01.330
itself.

01:53:01.330 --> 01:53:04.092
But for your data, SQL's
a really good candidate.

01:53:04.092 --> 01:53:07.300
And indeed, we've seen already that SQL
can just speed up certain operations.

01:53:07.300 --> 01:53:08.140
You can change.

01:53:08.140 --> 01:53:11.200
You can collapse 15 lines
of code into just one

01:53:11.200 --> 01:53:13.430
and you can use these things together.

01:53:13.430 --> 01:53:14.740
So let me come back to--

01:53:14.740 --> 01:53:16.720
I'm going to quit out of SQLite.

01:53:16.720 --> 01:53:18.580
I'm going to minimize
my terminal window.

01:53:18.580 --> 01:53:22.030
And here's where we left off
before with favorites.py.

01:53:22.030 --> 01:53:27.430
With favorites.py, everything was
being stored in favorites.csv.

01:53:27.430 --> 01:53:32.260
And recall that we eventually imported
that CSV file into favorites.db

01:53:32.260 --> 01:53:35.920
automatically with .import just so we
could start playing around with SQL.

01:53:35.920 --> 01:53:37.990
But we can now tie these two together.

01:53:37.990 --> 01:53:40.690
And a way to do that is as follows.

01:53:40.690 --> 01:53:42.700
CS50 has a library for Python.

01:53:42.700 --> 01:53:47.350
You might recall having available
get string, get int, get float.

01:53:47.350 --> 01:53:49.360
You don't strictly need
to use them in Python

01:53:49.360 --> 01:53:51.970
because it's much easier to
just use the input function

01:53:51.970 --> 01:53:55.570
and then try, accept, and convert
things to int or float or the like.

01:53:55.570 --> 01:54:00.520
But it's a lot more work to use SQL in
Python without a third party library.

01:54:00.520 --> 01:54:03.370
A lot of the commercial options
or popular open source options

01:54:03.370 --> 01:54:05.200
are actually just complicated to use.

01:54:05.200 --> 01:54:09.790
So CS50 does have a very useful function
inside of its library for Python

01:54:09.790 --> 01:54:12.640
that you should use and must
use for the problem set that

01:54:12.640 --> 01:54:18.100
just makes it easy to execute Python,
execute SQL inside of your Python code.

01:54:18.100 --> 01:54:22.360
But it's built on top of a very
popular open source alternative.

01:54:22.360 --> 01:54:24.350
So you can use that
too in the real world.

01:54:24.350 --> 01:54:26.800
So the documentation for
that is at this URL here,

01:54:26.800 --> 01:54:30.940
but I'll show you what we need to know
here by focusing back on favorites.py.

01:54:30.940 --> 01:54:35.540
So what I'm going to do
here is follows is this.

01:54:35.540 --> 01:54:42.550
Let me delete everything from
favorites.py except for let's say this.

01:54:42.550 --> 01:54:45.970
From CS50 import SQL in all caps.

01:54:45.970 --> 01:54:49.060
So that's importing a SQL
feature from CS50's library

01:54:49.060 --> 01:54:53.200
that's going to allow me
to open a DB file in code.

01:54:53.200 --> 01:54:54.380
How do I do that?

01:54:54.380 --> 01:54:56.650
Well, let me create a variable
called DB for database,

01:54:56.650 --> 01:54:58.275
though I could call it anything I want.

01:54:58.275 --> 01:55:02.470
Let me call this SQL function and
pass in using special syntax that's

01:55:02.470 --> 01:55:03.640
not CS50 specific.

01:55:03.640 --> 01:55:09.640
It's an industry thing. sqlite:///.

01:55:09.640 --> 01:55:14.640
Unlike every other URL you type, this
one literally has three in this context

01:55:14.640 --> 01:55:15.140
here.

01:55:15.140 --> 01:55:18.340
And then the name of the database,
which in this case is favorites.db.

01:55:18.340 --> 01:55:22.450
So this is just a way of telling
this SQL library that we wrote

01:55:22.450 --> 01:55:27.640
but that works exactly like third party
alternatives open favorites.db using

01:55:27.640 --> 01:55:30.380
the SQLite technology, if you will.

01:55:30.380 --> 01:55:30.880
All right.

01:55:30.880 --> 01:55:32.338
Let's just ask the user a question.

01:55:32.338 --> 01:55:36.057
Give me your favorite problem.

01:55:36.057 --> 01:55:38.140
So we're going to use input
instead of get string,

01:55:38.140 --> 01:55:41.450
but we could use get string, but they're
pretty much the same for our purposes.

01:55:41.450 --> 01:55:43.150
Let's ask the user for their favorite.

01:55:43.150 --> 01:55:49.960
And now in Python code, let
us select from favorites.db

01:55:49.960 --> 01:55:55.450
all of the rows where students specify
that problem as their favorite.

01:55:55.450 --> 01:55:57.610
So in SQL alone, it would be this.

01:55:57.610 --> 01:56:04.210
Select star from favorites
where problem equals

01:56:04.210 --> 01:56:08.170
and I'll do, well, whatever
my favorite's going to be.

01:56:08.170 --> 01:56:10.288
Like problem equals Mario, for instance.

01:56:10.288 --> 01:56:13.330
So if I were just using SQL, I would
literally write something like that.

01:56:13.330 --> 01:56:15.130
But I'm in a .py file now.

01:56:15.130 --> 01:56:16.870
I have to use Python syntax.

01:56:16.870 --> 01:56:18.550
But Python supports strings.

01:56:18.550 --> 01:56:20.090
SQL is just text.

01:56:20.090 --> 01:56:20.990
It's just a string.

01:56:20.990 --> 01:56:24.220
So I could certainly just
put my SQL code in a string

01:56:24.220 --> 01:56:26.800
perhaps and then pass
it to a Python function.

01:56:26.800 --> 01:56:28.780
And here's the bridge between the two.

01:56:28.780 --> 01:56:31.960
If you just treat SQL as any old
text, we can put it in a string

01:56:31.960 --> 01:56:32.840
and execute it.

01:56:32.840 --> 01:56:34.837
So let me actually do this.

01:56:34.837 --> 01:56:36.670
Let me go ahead and
create a variable called

01:56:36.670 --> 01:56:40.130
rows, which is eventually going to
contain all the rows from the database.

01:56:40.130 --> 01:56:45.790
Let me go ahead and select db.execute.

01:56:45.790 --> 01:56:49.030
This is the one function you need to
know about inside of CS50's library,

01:56:49.030 --> 01:56:51.310
and it literally
executes a SQL statement.

01:56:51.310 --> 01:56:56.140
And then in quotes, you pass it
literally what you want to execute.

01:56:56.140 --> 01:56:58.990
And let me go ahead and close
the parenthesis at the end there.

01:56:58.990 --> 01:57:00.740
And now let me just try this.

01:57:00.740 --> 01:57:05.260
So for row in rows, let's iterate
over all of the rows, let me go ahead

01:57:05.260 --> 01:57:12.370
and print out how about
row, quote unquote.

01:57:12.370 --> 01:57:15.700
And what do I want here?

01:57:15.700 --> 01:57:20.020
Let's print out the timestamp
of that person for kicks.

01:57:20.020 --> 01:57:22.120
All right, let me open
my terminal window.

01:57:22.120 --> 01:57:23.890
Python of favorites.py.

01:57:23.890 --> 01:57:27.490
Crossing my fingers here for sure.

01:57:27.490 --> 01:57:27.990
Enter.

01:57:31.200 --> 01:57:31.770
There we go.

01:57:31.770 --> 01:57:32.280
Favorites.

01:57:32.280 --> 01:57:34.500
I'll type in Mario.

01:57:34.500 --> 01:57:35.070
OK.

01:57:35.070 --> 01:57:36.240
So I got back--

01:57:36.240 --> 01:57:37.980
it's not very interesting,
but I got back

01:57:37.980 --> 01:57:40.830
all of the timestamps of students
who typed in Mario that we

01:57:40.830 --> 01:57:42.330
imported into this database.

01:57:42.330 --> 01:57:45.360
Well, what I really care
about is how popular Mario is.

01:57:45.360 --> 01:57:46.830
So let me change this a little bit.

01:57:46.830 --> 01:57:50.490
Let me change this to
count the number of rows.

01:57:50.490 --> 01:57:52.080
And let me keep it simple.

01:57:52.080 --> 01:57:56.260
Let me give an alias like I proposed
earlier like as n, where n is a number.

01:57:56.260 --> 01:58:00.090
So that now down here, I
can actually just do this.

01:58:00.090 --> 01:58:02.230
Print out the value of n.

01:58:02.230 --> 01:58:02.730
All right.

01:58:02.730 --> 01:58:04.450
Let me go back to my terminal window.

01:58:04.450 --> 01:58:05.910
Run Python to favorites.py.

01:58:05.910 --> 01:58:07.140
Let me type in Mario.

01:58:07.140 --> 01:58:08.070
Enter.

01:58:08.070 --> 01:58:09.660
OK, 39.

01:58:09.660 --> 01:58:11.790
Now, technically I'm cheating.

01:58:11.790 --> 01:58:14.430
Honestly if I'm executing
select count, we've

01:58:14.430 --> 01:58:17.752
seen before it only ever
returns one row, not multiple.

01:58:17.752 --> 01:58:20.460
So there's really nothing to
iterate over, but it's working fine.

01:58:20.460 --> 01:58:22.560
It's just iterating once,
but I'm getting lucky.

01:58:22.560 --> 01:58:26.100
So technically what I should
probably just do is this.

01:58:26.100 --> 01:58:29.220
I should probably give
myself a variable called row,

01:58:29.220 --> 01:58:34.770
set it equal to the very first
row and only row that came back,

01:58:34.770 --> 01:58:39.510
and now print out that rows and column.

01:58:39.510 --> 01:58:40.890
Let me rerun the program.

01:58:40.890 --> 01:58:42.130
I'll type in Mario again.

01:58:42.130 --> 01:58:42.660
Enter.

01:58:42.660 --> 01:58:44.980
And I still see 39.

01:58:44.980 --> 01:58:48.328
So of course, I don't
strictly need to do this.

01:58:48.328 --> 01:58:49.620
I don't really need a variable.

01:58:49.620 --> 01:58:52.590
I can do rows bracket 0 instead.

01:58:52.590 --> 01:58:54.840
But let me focus on what
this library is now doing.

01:58:54.840 --> 01:59:01.620
So per the documentation, what the CS50
execute function always does for you is

01:59:01.620 --> 01:59:05.620
it returns a list of dictionaries.

01:59:05.620 --> 01:59:10.558
So if your query returns nothing, like
no matches, you get back an empty list.

01:59:10.558 --> 01:59:12.600
Like open bracket, closed
bracket, nothing in it.

01:59:12.600 --> 01:59:14.520
Any loop is not going to
execute anything useful,

01:59:14.520 --> 01:59:15.770
because there's nothing in it.

01:59:15.770 --> 01:59:18.660
If, though, you get back
one row, you're going

01:59:18.660 --> 01:59:22.740
to get back a list of size one inside
of which is a single dictionary.

01:59:22.740 --> 01:59:24.930
That dictionary is
going to have keys that

01:59:24.930 --> 01:59:29.200
correspond to whatever you selected,
be it the columns or the count.

01:59:29.200 --> 01:59:32.880
So when I selected star before, I
would have gotten all of the columns.

01:59:32.880 --> 01:59:34.950
That's how I was able
to access timestamp.

01:59:34.950 --> 01:59:37.230
Here I'm just selecting
count and I don't

01:59:37.230 --> 01:59:38.910
want to have to type this down here.

01:59:38.910 --> 01:59:40.230
That would just look kind of atrocious.

01:59:40.230 --> 01:59:42.355
It would work, but it would
look weird to just keep

01:59:42.355 --> 01:59:44.640
retyping count paren star close paren.

01:59:44.640 --> 01:59:49.530
So I just created an alias called n just
to make my life easier or cleaner down

01:59:49.530 --> 01:59:50.380
here.

01:59:50.380 --> 01:59:55.680
So to be clear, the CS50 execute
function returns a list of dictionaries

01:59:55.680 --> 01:59:57.360
when you're using select.

01:59:57.360 --> 02:00:01.500
And that is how I can now get
back the first and only row

02:00:01.500 --> 02:00:04.860
and then print out that row's end value.

02:00:04.860 --> 02:00:07.380
It is identical to--

02:00:07.380 --> 02:00:08.650
let me do this.

02:00:08.650 --> 02:00:11.610
Let me highlight this
whole line of text.

02:00:11.610 --> 02:00:15.120
Let me in my terminal window
run SQLite3 of favorites.db

02:00:15.120 --> 02:00:16.470
like we did before break.

02:00:16.470 --> 02:00:19.170
Let me just copy paste this query.

02:00:19.170 --> 02:00:20.040
Enter.

02:00:20.040 --> 02:00:24.550
That's the table I got back earlier
when we played with SQL manually.

02:00:24.550 --> 02:00:28.170
And so when I get back this table,
here's the key, here's the value,

02:00:28.170 --> 02:00:31.860
and I only have one row, which
is why I'm just blindly indexing

02:00:31.860 --> 02:00:34.080
into rows bracket 0, because
I know there's always

02:00:34.080 --> 02:00:35.247
going to be an answer there.

02:00:35.247 --> 02:00:36.690
It's going to be 0 or 1 or more.

02:00:36.690 --> 02:00:40.360
But I know now it's going to be
called n because of this here.

02:00:40.360 --> 02:00:41.580
So what have I just done?

02:00:41.580 --> 02:00:43.113
Well, this is SQL down here.

02:00:43.113 --> 02:00:45.780
And this is just me being like a
data scientist asking questions

02:00:45.780 --> 02:00:48.640
about my data just using
black and white SQL queries.

02:00:48.640 --> 02:00:52.410
This is me now being a Python programmer
who wants to talk to a SQL database

02:00:52.410 --> 02:00:53.700
using Python.

02:00:53.700 --> 02:00:56.820
And the bridge we're using
happens to be the CS50 library.

02:00:56.820 --> 02:00:59.940
But again, there's third party free
libraries you can also use as well.

02:00:59.940 --> 02:01:01.710
Ours is just very simple.

02:01:01.710 --> 02:01:03.660
And indeed, the
documentation will explain

02:01:03.660 --> 02:01:07.113
how execute behaves a little differently
for inserts, updates, and deletes.

02:01:07.113 --> 02:01:09.780
You don't get back a list because
you're not selecting anything,

02:01:09.780 --> 02:01:13.050
but you do get back some return values.

02:01:13.050 --> 02:01:15.990
Questions on this?

02:01:15.990 --> 02:01:18.030
That's the last of our Python code.

02:01:18.030 --> 02:01:23.320
That ties everything together in spirit.

02:01:23.320 --> 02:01:24.768
Yeah?

02:01:24.768 --> 02:01:29.460
AUDIENCE: [INAUDIBLE]

02:01:29.460 --> 02:01:31.810
DAVID MALAN: This one here?

02:01:31.810 --> 02:01:32.320
Yes.

02:01:32.320 --> 02:01:37.558
So db.execute by definition
returns a list of rows.

02:01:37.558 --> 02:01:40.600
And each of those rows happens to be
a dictionary because its convenient.

02:01:40.600 --> 02:01:42.100
Key value pairs.

02:01:42.100 --> 02:01:45.040
If I'm selecting the
count of rows, I just

02:01:45.040 --> 02:01:48.250
know from having learned SQL
an hour ago that this is always

02:01:48.250 --> 02:01:52.540
going to give me a single row whose
column in this case is called n.

02:01:52.540 --> 02:01:56.470
So if I know it's a single row, I
can just blindly, just like in C,

02:01:56.470 --> 02:02:00.160
go into that list or an array in
C and go to the first location

02:02:00.160 --> 02:02:02.590
and then treat that as the single row.

02:02:02.590 --> 02:02:04.330
What you don't want to do is this.

02:02:04.330 --> 02:02:08.500
Even if you the human know
the query returns one row,

02:02:08.500 --> 02:02:10.780
you can't just magically
change the variable name

02:02:10.780 --> 02:02:13.270
to be singular and expect
to have only one value.

02:02:13.270 --> 02:02:14.960
You will always have a list.

02:02:14.960 --> 02:02:18.190
So even if there is only one value
in it, it's up to you to do something

02:02:18.190 --> 02:02:19.780
like this to get at it.

02:02:19.780 --> 02:02:24.070
Or if you prefer more succinctness,
you can do rows bracket I bracket n.

02:02:24.070 --> 02:02:27.220
That'll achieve the same
thing without a variable.

02:02:27.220 --> 02:02:29.095
Yeah?

02:02:29.095 --> 02:02:33.970
AUDIENCE: [INAUDIBLE]

02:02:33.970 --> 02:02:34.720
DAVID MALAN: Good.

02:02:34.720 --> 02:02:37.150
So I have been misleading
this whole time

02:02:37.150 --> 02:02:40.690
and cheating because this is
only ever going to return Mario.

02:02:40.690 --> 02:02:44.530
I'm ignoring the favorite that the
human typed in here on line five.

02:02:44.530 --> 02:02:45.770
So let me fix that.

02:02:45.770 --> 02:02:49.000
And that's going to lead us to some
of the problems that arise ultimately

02:02:49.000 --> 02:02:49.780
with SQL.

02:02:49.780 --> 02:02:53.320
The right way to solve that problem--
let me get rid of my terminal window

02:02:53.320 --> 02:02:53.890
here.

02:02:53.890 --> 02:02:56.170
The right way to solve
this problem is not

02:02:56.170 --> 02:02:59.170
to use an fstring like we
did in Python generally,

02:02:59.170 --> 02:03:02.680
because SQL queries, as we'll see
in a moment, can be dangerous.

02:03:02.680 --> 02:03:06.940
When you want to plug in
users' data into a query

02:03:06.940 --> 02:03:10.540
that you've written most of
in advance, you should, you

02:03:10.540 --> 02:03:15.640
must, you had better use a placeholder,
namely a question mark in this case.

02:03:15.640 --> 02:03:18.340
This is somewhat specific
to CS50's library,

02:03:18.340 --> 02:03:21.910
but we just borrowed the convention
that every other library uses too.

02:03:21.910 --> 02:03:25.660
In the world of SQL, single question
marks are used as placeholders.

02:03:25.660 --> 02:03:28.190
And the way you do this is as follows.

02:03:28.190 --> 02:03:32.020
If you want to plug-in a
value for that question mark,

02:03:32.020 --> 02:03:37.180
just like in printf in C, you specify as
a second or a third or fourth argument

02:03:37.180 --> 02:03:39.320
all of the values you
want plugged into this.

02:03:39.320 --> 02:03:42.280
So in C weeks ago, we were using %s.

02:03:42.280 --> 02:03:43.810
Same exact idea.

02:03:43.810 --> 02:03:46.810
In SQL it's a question
mark that you use instead.

02:03:46.810 --> 02:03:49.480
This now, if I open
back my terminal window

02:03:49.480 --> 02:03:55.120
and I run Python of favorites.py,
type in Mario, I should still get 39.

02:03:55.120 --> 02:03:58.750
But now I can also type in
Scratch perhaps and get 44

02:03:58.750 --> 02:04:00.520
for that very first piece at 0.

02:04:00.520 --> 02:04:02.900
And that one is even more popular here.

02:04:02.900 --> 02:04:04.150
So this now is correct.

02:04:04.150 --> 02:04:09.560
It would work to use an fstring here
and then plug in a value like favorite

02:04:09.560 --> 02:04:10.060
here.

02:04:10.060 --> 02:04:11.977
But you'll see in just
a moment don't do that.

02:04:11.977 --> 02:04:15.640
You will expose yourself to
potential hack or attacks

02:04:15.640 --> 02:04:17.910
by trusting the user's input.

02:04:17.910 --> 02:04:19.660
And so in fact, let's
transition from that

02:04:19.660 --> 02:04:21.610
to exactly some of these
kinds of challenges,

02:04:21.610 --> 02:04:24.020
namely two before we wrap up.

02:04:24.020 --> 02:04:27.820
So in the world of SQL, especially when
it's used at scale with the Twitters

02:04:27.820 --> 02:04:30.610
and the Googles of the
world, a lot of data

02:04:30.610 --> 02:04:32.610
is probably coming into
the database all at once

02:04:32.610 --> 02:04:34.610
because multiple people
are opening their phones

02:04:34.610 --> 02:04:36.070
at the same time around the world.

02:04:36.070 --> 02:04:37.690
They're clicking on
the same links roughly

02:04:37.690 --> 02:04:39.107
at the same time around the world.

02:04:39.107 --> 02:04:42.235
When you have thousands of people
all using your site at once,

02:04:42.235 --> 02:04:44.110
order of operations is
going to be important.

02:04:44.110 --> 02:04:47.620
But unfortunately in SQL and
in other contexts of computing,

02:04:47.620 --> 02:04:50.150
there's this risk of what's
known as a race condition.

02:04:50.150 --> 02:04:53.710
So for instance, has anyone
ever seen or liked this?

02:04:53.710 --> 02:04:55.780
This is the world record egg.

02:04:55.780 --> 02:04:58.467
Or it's this thing that was
very popular a while back.

02:04:58.467 --> 02:04:59.800
It's still kind of going strong.

02:04:59.800 --> 02:05:03.340
But if you go to the Instagram
profile for World Record Egg,

02:05:03.340 --> 02:05:06.610
the goal was to make the most
liked Instagram post ever.

02:05:06.610 --> 02:05:08.000
And they did pretty well.

02:05:08.000 --> 02:05:09.170
It's just this.

02:05:09.170 --> 02:05:10.790
It's just a picture of an egg.

02:05:10.790 --> 02:05:12.570
Now, at the height of
the popularity, like

02:05:12.570 --> 02:05:15.070
there might have been hundreds,
thousands, tens of thousands

02:05:15.070 --> 02:05:18.520
of people clicking pretty much
at the same time on this egg.

02:05:18.520 --> 02:05:20.830
So it actually creates
a potential problem

02:05:20.830 --> 02:05:23.470
with the integrity of Instagram's data.

02:05:23.470 --> 02:05:24.010
Why?

02:05:24.010 --> 02:05:26.302
Well, if you have all these
requests coming in at once,

02:05:26.302 --> 02:05:28.250
how do you possibly keep
track of all of them

02:05:28.250 --> 02:05:31.850
and update your counter in a way that
can keep up with all of that traffic?

02:05:31.850 --> 02:05:32.350
Why?

02:05:32.350 --> 02:05:35.680
Well, let's just hypothesize
what Meta, formerly Facebook,

02:05:35.680 --> 02:05:38.860
was doing underneath the hood with
Instagram if this were their code.

02:05:38.860 --> 02:05:41.890
So suppose for the sake of
discussion that Instagram servers

02:05:41.890 --> 02:05:44.617
are using a mix of Python and SQL.

02:05:44.617 --> 02:05:47.200
Probably not using the CS50
library, but they could absolutely

02:05:47.200 --> 02:05:50.680
be using those two languages
or two others together.

02:05:50.680 --> 02:05:55.270
Suppose they do this in order to update
the number of likes for that post.

02:05:55.270 --> 02:06:00.160
They first execute a SQL query like
select the current number of likes

02:06:00.160 --> 02:06:03.580
from a table called posts
where the idea of the post

02:06:03.580 --> 02:06:08.530
equals whatever the unique identifier
is for that specific egg in the table.

02:06:08.530 --> 02:06:12.460
And then they store the result in
this rows variable, just like I did.

02:06:12.460 --> 02:06:14.830
And then they do this.

02:06:14.830 --> 02:06:16.420
They create a variable called likes.

02:06:16.420 --> 02:06:18.700
They set it equal to rows bracket 0.

02:06:18.700 --> 02:06:21.280
So the very first row in the result set.

02:06:21.280 --> 02:06:22.450
And they get the likes key.

02:06:22.450 --> 02:06:25.030
So this is literally what
I just did with the count.

02:06:25.030 --> 02:06:27.880
Let me hypothesize that
Instagram does something similar

02:06:27.880 --> 02:06:29.330
with the total number of likes.

02:06:29.330 --> 02:06:30.370
Why are they doing this?

02:06:30.370 --> 02:06:34.600
Because they then want to execute
a third line of code that executes

02:06:34.600 --> 02:06:36.430
update the posts table.

02:06:36.430 --> 02:06:41.920
Set the new number of likes equal to
something where the idea of the post

02:06:41.920 --> 02:06:43.330
equals this other thing.

02:06:43.330 --> 02:06:46.870
Now, notice just like in printf there's
the comma separated list of values.

02:06:46.870 --> 02:06:50.560
They want to update the current
number of likes from the current value

02:06:50.560 --> 02:06:51.970
to the current value plus 1.

02:06:51.970 --> 02:06:53.350
So it's likes plus 1.

02:06:53.350 --> 02:06:55.640
And then we plug in the ID for this.

02:06:55.640 --> 02:06:57.760
So suppose this is what
Instagram is doing.

02:06:57.760 --> 02:07:02.860
Unfortunately, whenever you execute
multiple lines of code independently

02:07:02.860 --> 02:07:06.260
and you're so popular like
Instagram that you have thousands,

02:07:06.260 --> 02:07:08.530
hundreds of thousands
of servers potentially,

02:07:08.530 --> 02:07:11.470
it is quite possible that if you
and I and everyone else in the room

02:07:11.470 --> 02:07:13.630
clicks that egg at the
same time, it's not

02:07:13.630 --> 02:07:17.093
going to be the case statistically that
three lines of code are executed for me

02:07:17.093 --> 02:07:19.510
and then three lines for you
and then three lines for you.

02:07:19.510 --> 02:07:21.490
They're probably going
to get interspersed.

02:07:21.490 --> 02:07:24.430
This gets executed for me and
then this gets executed for you

02:07:24.430 --> 02:07:27.520
and then they get back to doing work
for me and so forth just to kind

02:07:27.520 --> 02:07:30.820
of multitask, just like a human
might, but at a super speed here.

02:07:30.820 --> 02:07:33.590
The problem, though, is
if these lines of code

02:07:33.590 --> 02:07:35.838
get interrupted, what could go wrong?

02:07:35.838 --> 02:07:38.630
Well, suppose that Carter and I
both click the egg at the same time

02:07:38.630 --> 02:07:41.780
and suppose the current number
of likes back in the day is 100.

02:07:41.780 --> 02:07:45.150
That stores in this
variable the value 100.

02:07:45.150 --> 02:07:49.010
But if we click so close in time,
we might get back the same answer

02:07:49.010 --> 02:07:50.300
to this select query.

02:07:50.300 --> 02:07:54.380
As of that moment in time when David
and Carter clicked, it had 100 likes.

02:07:54.380 --> 02:07:58.220
But then this last line of code is
executed for me and then maybe Carter.

02:07:58.220 --> 02:08:03.380
Because that answer, the state of the
database, was stored in this variable,

02:08:03.380 --> 02:08:08.870
then both Carter and I will
result in this line of code

02:08:08.870 --> 02:08:10.610
being executed with the same value.

02:08:10.610 --> 02:08:17.480
Update the post table setting the
likes equal to 101 for that post's ID.

02:08:17.480 --> 02:08:18.080
Why?

02:08:18.080 --> 02:08:21.950
Because again, if each of these lines
of code running on different servers

02:08:21.950 --> 02:08:25.213
are checking the value of
the current number of likes

02:08:25.213 --> 02:08:28.130
but then getting interrupted because
Carter clicked the darn thing too

02:08:28.130 --> 02:08:30.350
and then resuming their
work on my behalf,

02:08:30.350 --> 02:08:33.890
we might have a race condition where
the code is sort of racing to finish

02:08:33.890 --> 02:08:36.320
but getting interrupted
by other users' clicks.

02:08:36.320 --> 02:08:38.720
And the problem with
that is that if you are

02:08:38.720 --> 02:08:42.710
inspecting the value of some variable,
or in this case a database cell,

02:08:42.710 --> 02:08:45.650
and making a decision based
on it, like how to update it,

02:08:45.650 --> 02:08:46.995
you might now lose data.

02:08:46.995 --> 02:08:50.120
And Instagram is probably not good for
advertising if they're losing likes.

02:08:50.120 --> 02:08:54.620
And so that's probably a problem
not to retain the value 102

02:08:54.620 --> 02:08:57.230
and instead insert the number 101 twice.

02:08:57.230 --> 02:09:00.500
It's actually similar in
spirit to a story that

02:09:00.500 --> 02:09:05.960
was told in a databases course I
took myself years ago whereby--

02:09:05.960 --> 02:09:08.900
it's somewhat analogous to
kind of a contrived scenario

02:09:08.900 --> 02:09:10.070
involving a refrigerator.

02:09:10.070 --> 02:09:12.380
And this is the closest thing to a
refrigerator we could get on stage.

02:09:12.380 --> 02:09:14.420
But imagine you've got one
of these little dorm fridges

02:09:14.420 --> 02:09:16.040
in your dorm too and your roommate.

02:09:16.040 --> 02:09:19.102
And maybe both of you, as the story
was told to me, really like milk.

02:09:19.102 --> 02:09:21.560
And one of you is at class,
but the other of you comes home

02:09:21.560 --> 02:09:25.500
and you open your dorm fridge and you're
like, oh darn it, we're out of milk.

02:09:25.500 --> 02:09:27.020
And so you close the fridge.

02:09:27.020 --> 02:09:29.720
You walk across the street
to CVS or some other store

02:09:29.720 --> 02:09:31.490
and you get in line to buy some milk.

02:09:31.490 --> 02:09:33.020
Meanwhile, your roommate
gets out of class.

02:09:33.020 --> 02:09:34.160
They come back to your dorm room.

02:09:34.160 --> 02:09:35.702
They're really thirsty for some milk.

02:09:35.702 --> 02:09:37.010
They open up the fridge.

02:09:37.010 --> 02:09:38.510
They say, oh, we're out of milk.

02:09:38.510 --> 02:09:41.630
And then they take a
different route perhaps to CVS

02:09:41.630 --> 02:09:44.300
or some other store nearby,
get in line to buy some milk.

02:09:44.300 --> 02:09:48.260
Fast forward some amount of time in this
very contrived story and what happens?

02:09:48.260 --> 02:09:50.570
Oh damn it, we now ended
up with two gallons of milk

02:09:50.570 --> 02:09:54.480
and there's no way we can fit gallons
of milk in there, let alone two of them.

02:09:54.480 --> 02:09:56.000
So that's a problem.

02:09:56.000 --> 02:10:00.000
But what's the
relationship to this here?

02:10:00.000 --> 02:10:02.693
Well both of us, yeah, did what?

02:10:02.693 --> 02:10:07.523
AUDIENCE: [INAUDIBLE]

02:10:16.685 --> 02:10:17.560
DAVID MALAN: Exactly.

02:10:17.560 --> 02:10:19.925
AUDIENCE: [INAUDIBLE]

02:10:22.965 --> 02:10:23.840
DAVID MALAN: Exactly.

02:10:23.840 --> 02:10:26.480
So to summarize, both of us had
a very similar thought process,

02:10:26.480 --> 02:10:28.820
made a similar decision based
on the same information,

02:10:28.820 --> 02:10:31.760
not realizing that the
information, the fridge,

02:10:31.760 --> 02:10:34.600
was in the process of being updated.

02:10:34.600 --> 02:10:37.100
And of course, in the Instagram
world, it happens like this.

02:10:37.100 --> 02:10:39.300
In the fridge world, it
might take a few minutes.

02:10:39.300 --> 02:10:41.840
But the problem is
ultimately the result of our

02:10:41.840 --> 02:10:46.070
having made a decision about the state
of the world and the state of the world

02:10:46.070 --> 02:10:48.600
was in the middle of being updated.

02:10:48.600 --> 02:10:50.333
The queries got comingled with others.

02:10:50.333 --> 02:10:53.000
Or, in this case, someone was
already on their way to the store.

02:10:53.000 --> 02:10:54.960
So what's the solution
in the real world?

02:10:54.960 --> 02:10:59.000
Well, you could very simply take a
post it note and put like gone for milk

02:10:59.000 --> 02:11:02.060
so as to communicate to your
roommate that they should not

02:11:02.060 --> 02:11:04.620
inspect the value of that variable
and make a decision on it.

02:11:04.620 --> 02:11:05.120
Why?

02:11:05.120 --> 02:11:08.958
Because it's not yet consistent with
the outcome that's about to happen.

02:11:08.958 --> 02:11:11.000
You could be more dramatic
and you could actually

02:11:11.000 --> 02:11:13.580
lock the fridge somehow,
put a padlock around it

02:11:13.580 --> 02:11:15.470
or the like so they
can't even get in there.

02:11:15.470 --> 02:11:18.510
And that would achieve
the same effect too.

02:11:18.510 --> 02:11:22.400
And that is actually pretty much the
solution to this problem in code too.

02:11:22.400 --> 02:11:23.370
It's not safe.

02:11:23.370 --> 02:11:27.110
It's not sufficient to only execute
three lines of code like this.

02:11:27.110 --> 02:11:32.000
Rather, what you probably want to
do is use additional SQL keywords

02:11:32.000 --> 02:11:35.270
that we won't spend much time on
in the class itself, but these.

02:11:35.270 --> 02:11:37.160
There are solutions to this problem.

02:11:37.160 --> 02:11:39.320
You can begin what's
called a transaction

02:11:39.320 --> 02:11:43.460
and you can more explicitly
commit to making a decision,

02:11:43.460 --> 02:11:46.100
like updating the
database to 101 or 102.

02:11:46.100 --> 02:11:50.120
Or if you realize, wait a minute,
Carter's query is interrupting mine.

02:11:50.120 --> 02:11:53.540
Let me roll back to the
previous state and just rewind.

02:11:53.540 --> 02:11:54.290
Let me undo.

02:11:54.290 --> 02:11:55.840
Control Z, if you will.

02:11:55.840 --> 02:11:58.340
There's also another keyword
that's not so much used anymore

02:11:58.340 --> 02:11:59.600
in SQL which is locking.

02:11:59.600 --> 02:12:01.170
You could literally back in the day.

02:12:01.170 --> 02:12:05.690
Lock the entire database table,
preventing anyone from updating it

02:12:05.690 --> 02:12:09.320
or making changes or even reading it
while someone else was accessing it.

02:12:09.320 --> 02:12:12.320
That was a very heavy handed solution
because it slowed everything down.

02:12:12.320 --> 02:12:15.710
But in short, transactions
are now a feature of SQL

02:12:15.710 --> 02:12:17.840
that you won't necessarily
need to use yourselves

02:12:17.840 --> 02:12:22.490
that do solve this problem by doing
the equivalent of saying while David's

02:12:22.490 --> 02:12:26.360
like counter is in the process of
being updated, keep Carter at bay,

02:12:26.360 --> 02:12:29.510
ideally briefly, and then
let his data go through too.

02:12:29.510 --> 02:12:32.240
It's equivalent too to putting
a note or a lock on the fridge.

02:12:32.240 --> 02:12:33.920
And indeed, I mean lock literally.

02:12:33.920 --> 02:12:36.380
They were once upon a
time called and still

02:12:36.380 --> 02:12:41.480
are in some contexts called
locks on databases too.

02:12:41.480 --> 02:12:44.730
And the code for which you might
do this is almost the same.

02:12:44.730 --> 02:12:51.860
You simply wrap the three queries with
a transaction statement and a commit.

02:12:51.860 --> 02:12:57.800
And the term of art here is that
this makes your statements atomic.

02:12:57.800 --> 02:13:01.230
So atomic means they're either
all executed or not at all.

02:13:01.230 --> 02:13:04.340
That is they're all very tightly
coupled together without interruption.

02:13:04.340 --> 02:13:08.480
Transactions solves that problem and
avoid having two gallons of milk.

02:13:08.480 --> 02:13:11.360
And the last problem that
arises that is tragically

02:13:11.360 --> 02:13:14.510
so darn common in the
real world today is

02:13:14.510 --> 02:13:16.173
what's called a SQL injection attack.

02:13:16.173 --> 02:13:18.590
And it's what I alluded to
earlier with the question mark.

02:13:18.590 --> 02:13:20.298
So suppose you're in
the habit of logging

02:13:20.298 --> 02:13:23.240
into Yale websites with
your net ID or password

02:13:23.240 --> 02:13:25.950
or at Harvard, your Harvard
key and password as well.

02:13:25.950 --> 02:13:28.370
Suppose for the sake of
discussion that the people that

02:13:28.370 --> 02:13:32.360
implemented Harvard key log in allow
you to type in your email address,

02:13:32.360 --> 02:13:33.680
of course, and your password.

02:13:33.680 --> 02:13:37.160
But suppose that they are
using SQL underneath the hood

02:13:37.160 --> 02:13:39.770
to check your username
and password to make sure

02:13:39.770 --> 02:13:43.400
that you are David Malan or Carter
Zenke or whoever you claim to be.

02:13:43.400 --> 02:13:48.770
I haven't shown you the syntax yet,
but it turns out that in SQL, --

02:13:48.770 --> 02:13:51.410
is a special way of
indicating a comment.

02:13:51.410 --> 02:13:53.100
It means ignore everything to the right.

02:13:53.100 --> 02:13:57.080
So it's just like // in C or
the hash symbol in Python.

02:13:57.080 --> 02:13:59.430
-- just means ignore
everything to the right.

02:13:59.430 --> 02:14:01.440
And we've, of course,
seen single quotes.

02:14:01.440 --> 02:14:04.730
So one way to wage a
SQL injection attack

02:14:04.730 --> 02:14:08.870
is to try to inject malicious SQL
code into someone else's database

02:14:08.870 --> 02:14:10.220
without them realizing it.

02:14:10.220 --> 02:14:11.220
How do you do this?

02:14:11.220 --> 02:14:15.492
Well, suppose I log in as
malan@harvard.edu single quote dash,

02:14:15.492 --> 02:14:16.316
dash.

02:14:16.316 --> 02:14:20.510
I'm not double quoting anything clearly
and there's nothing to the right

02:14:20.510 --> 02:14:21.770
of the --

02:14:21.770 --> 02:14:23.150
anyway.

02:14:23.150 --> 02:14:25.500
But this imbalance is
going to be useful.

02:14:25.500 --> 02:14:26.000
Why?

02:14:26.000 --> 02:14:29.630
Because if I'm a hacker and I'm
presuming someone at Harvard probably

02:14:29.630 --> 02:14:33.680
is using single quotes to
wrap the user's email address

02:14:33.680 --> 02:14:38.360
and wrap the user's password, what if I
try to complete their thought for them

02:14:38.360 --> 02:14:40.370
and close one of those quotes for them?

02:14:40.370 --> 02:14:41.850
What might happen?

02:14:41.850 --> 02:14:43.220
Well, we could do this.

02:14:43.220 --> 02:14:46.250
Here for instance, let me hypothesize
is the code that Harvard wrote,

02:14:46.250 --> 02:14:48.050
hopefully not, underneath the hood.

02:14:48.050 --> 02:14:51.500
So they're using CS50's library in
Python and they're using SQL inside.

02:14:51.500 --> 02:14:54.350
Suppose that they have
a query like this.

02:14:54.350 --> 02:15:00.380
Select star from users where username
equals question mark and password

02:15:00.380 --> 02:15:01.315
equals question mark.

02:15:01.315 --> 02:15:04.190
And then suppose they just plug in
whatever username and password was

02:15:04.190 --> 02:15:05.070
typed in.

02:15:05.070 --> 02:15:08.090
And then if they get back some
number of rows dot dot dot,

02:15:08.090 --> 02:15:09.410
they assume I am David.

02:15:09.410 --> 02:15:12.200
They assume Carter is Carter if
both the username and password are

02:15:12.200 --> 02:15:12.890
in the database.

02:15:12.890 --> 02:15:14.480
Just end of story there.

02:15:14.480 --> 02:15:15.680
This is good.

02:15:15.680 --> 02:15:18.750
This has the question mark
placeholder, as we discussed earlier.

02:15:18.750 --> 02:15:21.140
But what if you don't
quite remember that?

02:15:21.140 --> 02:15:24.670
You don't quite take that to heart and
you use your more familiar last week

02:15:24.670 --> 02:15:28.690
fstrings whereby we use these
curly braces to plug in values.

02:15:28.690 --> 02:15:30.280
What if you do this instead?

02:15:30.280 --> 02:15:31.900
So it's almost the same idea.

02:15:31.900 --> 02:15:34.950
It's still db execute but now
it's select star from users

02:15:34.950 --> 02:15:36.120
where username equals.

02:15:36.120 --> 02:15:39.570
And now notice I'm doing the single
quotes, which is required by SQL,

02:15:39.570 --> 02:15:41.850
but I'm using fstrings
with the curly braces.

02:15:41.850 --> 02:15:46.530
And the password equals single quote
password and then close single quote.

02:15:46.530 --> 02:15:50.910
The problem is if you're just
blindly pasting effectively

02:15:50.910 --> 02:15:55.530
the user's input into that web form
into the username field and the password

02:15:55.530 --> 02:15:59.890
field, there's nothing stopping a
malicious user, student, faculty,

02:15:59.890 --> 02:16:03.090
staff from including a
single quote in their name.

02:16:03.090 --> 02:16:07.470
Or maybe even benevolently
if their name happens

02:16:07.470 --> 02:16:10.440
to have a single quote, as some
last names in particular do.

02:16:10.440 --> 02:16:12.230
So this is very fragile.

02:16:12.230 --> 02:16:12.730
Why?

02:16:12.730 --> 02:16:15.090
Well, suppose that if
we plug in my malicious,

02:16:15.090 --> 02:16:18.030
malan@harvard.edu single quote --

02:16:18.030 --> 02:16:20.430
notice what happens to username here.

02:16:20.430 --> 02:16:23.940
The username variable
inside of the curly quotes

02:16:23.940 --> 02:16:26.340
will get replaced with this.

02:16:26.340 --> 02:16:29.880
And notice single quote, which
the Harvard programmer wrote,

02:16:29.880 --> 02:16:35.969
malan@harvard.edu single
quote which I wrote --

02:16:35.969 --> 02:16:40.170
which I wrote single quote which
Harvard wrote and whatever else

02:16:40.170 --> 02:16:41.160
they want after that.

02:16:41.160 --> 02:16:43.524
What's the implication,
though, of the dash, dash?

02:16:46.722 --> 02:16:48.680
Everything to the right
is going to be ignored.

02:16:48.680 --> 02:16:51.730
So the password is never even
checked in this scenario.

02:16:51.730 --> 02:16:55.855
I'm tricking the server into
ignoring everything after the --

02:16:55.855 --> 02:17:00.650
but I have constructed very cleverly,
very maliciously a syntactically valid

02:17:00.650 --> 02:17:01.150
query.

02:17:01.150 --> 02:17:01.650
Why?

02:17:01.650 --> 02:17:03.580
Because I provided the
single quote that's

02:17:03.580 --> 02:17:06.670
going to finish the thought
of that first single quote.

02:17:06.670 --> 02:17:09.219
And now I would only know how
to do this if I saw the code

02:17:09.219 --> 02:17:12.209
or if I just randomly try putting
apostrophes into web forms

02:17:12.209 --> 02:17:13.209
and see if things break.

02:17:13.209 --> 02:17:15.215
That's often how
adversaries attack systems.

02:17:15.215 --> 02:17:17.590
They type in potentially
dangerous characters, hit Enter.

02:17:17.590 --> 02:17:20.510
If something breaks, they're
not necessarily into the system,

02:17:20.510 --> 02:17:22.757
but they know that there
might be a vulnerability.

02:17:22.757 --> 02:17:25.340
And then they start trying more
methodically things like this.

02:17:25.340 --> 02:17:27.850
So this then is going to be
bad, because it effectively

02:17:27.850 --> 02:17:29.500
grays out the rest of the query.

02:17:29.500 --> 02:17:32.889
And this query is surely
going to return some rows

02:17:32.889 --> 02:17:34.570
without even knowing my password.

02:17:34.570 --> 02:17:37.450
And so this logic here
dot dot dot means, well,

02:17:37.450 --> 02:17:39.850
if a data came back
from this query, Harvard

02:17:39.850 --> 02:17:43.480
is presumably going to
assume that Malan logged in.

02:17:43.480 --> 02:17:47.840
Show him his account or whatever
is being protected here.

02:17:47.840 --> 02:17:50.559
So in short, using fstrings bad.

02:17:50.559 --> 02:17:54.280
Using any equivalent like %s in C, bad.

02:17:54.280 --> 02:17:56.559
When it comes to SQL,
using question marks

02:17:56.559 --> 02:18:00.280
or whatever a third party
library like CS50 prescribes

02:18:00.280 --> 02:18:02.110
is the way to solve this.

02:18:02.110 --> 02:18:02.620
Why?

02:18:02.620 --> 02:18:07.540
Because libraries like ours are designed
to at least be smart and be paranoid.

02:18:07.540 --> 02:18:09.740
And what we will do is this.

02:18:09.740 --> 02:18:12.580
When you use the question marks
and the values are plugged in,

02:18:12.580 --> 02:18:16.660
we will escape any potentially
dangerous characters

02:18:16.660 --> 02:18:18.520
inside of those placeholders.

02:18:18.520 --> 02:18:21.580
And so effectively, the
single quote will no longer

02:18:21.580 --> 02:18:23.469
be considered a
grammatical single quote.

02:18:23.469 --> 02:18:26.870
It will just be literally a character
in the username or password.

02:18:26.870 --> 02:18:30.040
So the library takes
care of this for you

02:18:30.040 --> 02:18:33.500
because you're plugging in the username
and password as separate arguments.

02:18:33.500 --> 02:18:37.360
And then we or the third party
you're using actually sanitize.

02:18:37.360 --> 02:18:41.510
That is clean up the data and
prevent those bad characters.

02:18:41.510 --> 02:18:44.650
Now, this is kind of an internet
meme that went around for a while.

02:18:44.650 --> 02:18:47.320
If you've ever driven
a car or been in a car

02:18:47.320 --> 02:18:49.930
where there's the automatic
readers for tolls.

02:18:49.930 --> 02:18:52.969
This person thought it might be funny
to try doing something like this.

02:18:52.969 --> 02:18:54.309
What are they presumably doing?

02:18:54.309 --> 02:18:57.520
The presumption here is, whether
or not it worked is unclear,

02:18:57.520 --> 02:19:01.570
is that here's the end of
actual license plate number,

02:19:01.570 --> 02:19:03.903
but here's an interesting
single quote and a semicolon.

02:19:03.903 --> 02:19:06.070
That's especially bad because
it means you can maybe

02:19:06.070 --> 02:19:08.059
execute a second query on the database.

02:19:08.059 --> 02:19:11.170
This is someone having fun trying
to drop the entire database

02:19:11.170 --> 02:19:16.090
table for whatever municipality is
scanning through cameras their license

02:19:16.090 --> 02:19:16.750
plate code.

02:19:16.750 --> 02:19:19.150
And I would be remiss if
we didn't end on this note.

02:19:19.150 --> 02:19:22.240
At least in computer
science circles, there

02:19:22.240 --> 02:19:26.920
is someone named, no relation to the
TF name we put in the database earlier,

02:19:26.920 --> 02:19:31.660
little Bobby Tables, which
ends with this XKCD comic.

02:19:31.660 --> 02:19:36.340
And if you chuckle, if you laugh,
you're now legit SQL programmers.

02:19:39.049 --> 02:19:42.030
Nice, nice.

02:19:42.030 --> 02:19:44.740
Every CS student out there
knows about little Bobby Table.

02:19:44.740 --> 02:19:47.657
So if you name drop little
Bobby Tables now, you're in.

02:19:47.657 --> 02:19:49.240
All right, that's it though for today.

02:19:49.240 --> 02:19:51.780
We will see you next time.

02:19:51.780 --> 02:19:55.130
[MUSIC PLAYING]