WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000

00:00:17.600 --> 00:00:19.880
DAVID MALAN: We've seen how
with languages like Python

00:00:19.880 --> 00:00:23.240
can we implement business logic
on a server and, even if we want,

00:00:23.240 --> 00:00:24.680
generate web pages.

00:00:24.680 --> 00:00:27.770
And we've seen, on the browser
side, the so-called client side,

00:00:27.770 --> 00:00:31.340
how you can render
information or data to users.

00:00:31.340 --> 00:00:33.770
And with CSS can we style
it, and with JavaScript can

00:00:33.770 --> 00:00:35.480
we even make it interactive.

00:00:35.480 --> 00:00:38.120
But when building a business
or any application that

00:00:38.120 --> 00:00:42.060
is interacting with users, where
is all that data being stored?

00:00:42.060 --> 00:00:44.060
Well, if you're running
your own small business,

00:00:44.060 --> 00:00:46.040
you might simply be using a spreadsheet.

00:00:46.040 --> 00:00:48.207
Of course, spreadsheets,
whether they're implemented

00:00:48.207 --> 00:00:51.050
as Microsoft Excel, or Google
Spreadsheets, or Apple Numbers,

00:00:51.050 --> 00:00:53.930
generally stores data
in rows and columns.

00:00:53.930 --> 00:00:56.990
And if you're like me, you
probably use that first row

00:00:56.990 --> 00:01:00.230
to represent the names of the
columns and the types of the data

00:01:00.230 --> 00:01:01.220
that you're storing.

00:01:01.220 --> 00:01:04.010
And as you continue to add more
and more data to that spreadsheet,

00:01:04.010 --> 00:01:08.120
you probably, like me, continue
to add row after row after row.

00:01:08.120 --> 00:01:11.720
And meanwhile, if you've got so much
data or so many different types of data

00:01:11.720 --> 00:01:14.510
that it doesn't really
belong in one sheet,

00:01:14.510 --> 00:01:16.833
you might actually create
a second sheet or a tab

00:01:16.833 --> 00:01:19.250
along the bottom of the screen,
for instance, in which you

00:01:19.250 --> 00:01:22.430
can store additional data as well.

00:01:22.430 --> 00:01:27.440
So indeed, Microsoft Excel allows you
to store data relationally, so to speak.

00:01:27.440 --> 00:01:31.550
You might have one sheet with customer
info, another sheet with invoices,

00:01:31.550 --> 00:01:35.480
another sheet with products, or
any number of other types of data.

00:01:35.480 --> 00:01:40.610
And you can relate those to another
by somehow having some commonality

00:01:40.610 --> 00:01:45.360
among them, some identifier, some
name, some customer ID or the like.

00:01:45.360 --> 00:01:48.050
And so data that tends to be
stored in rows and columns

00:01:48.050 --> 00:01:49.502
can have these relationships.

00:01:49.502 --> 00:01:51.710
And you can do this certainly
in the cloud these days

00:01:51.710 --> 00:01:55.580
as well with Google Docs, simply
a web-based version of the same.

00:01:55.580 --> 00:01:58.940
But when designing your
spreadsheets, or more

00:01:58.940 --> 00:02:01.945
generally, your data's
schema, the design thereof,

00:02:01.945 --> 00:02:04.070
there's a number of questions
you need to consider.

00:02:04.070 --> 00:02:05.600
What data do you want to store?

00:02:05.600 --> 00:02:07.100
And where do you want to store it?

00:02:07.100 --> 00:02:09.623
And what type of data is it, in fact?

00:02:09.623 --> 00:02:12.290
Because whether using Excel, or
Numbers, or Google Spreadsheets,

00:02:12.290 --> 00:02:16.970
odds are you sometimes configure the
columns to be displayed or formatted

00:02:16.970 --> 00:02:17.930
in some ways.

00:02:17.930 --> 00:02:20.150
Numbers might be displayed
with or without commas.

00:02:20.150 --> 00:02:23.330
Numbers might be displayed with
dollar signs or some other symbol.

00:02:23.330 --> 00:02:25.700
You might separate
one field from another

00:02:25.700 --> 00:02:27.990
using special symbols or punctuation.

00:02:27.990 --> 00:02:31.580
So these are all configuration
options in a spreadsheet that might

00:02:31.580 --> 00:02:33.680
be among the first decisions you make.

00:02:33.680 --> 00:02:35.930
In fact, let's consider
a sample spreadsheet

00:02:35.930 --> 00:02:39.930
wherein I might want to represent
users in some web application.

00:02:39.930 --> 00:02:42.680
All of us have, of course, registered
for websites or applications

00:02:42.680 --> 00:02:43.710
these days.

00:02:43.710 --> 00:02:46.190
And what are some of the
questions you're asked?

00:02:46.190 --> 00:02:49.850
Well, you might be asked for your
user name by some application.

00:02:49.850 --> 00:02:52.405
You might be asked for your actual name.

00:02:52.405 --> 00:02:53.780
What else might you be asked for?

00:02:53.780 --> 00:02:56.180
Perhaps your email address.

00:02:56.180 --> 00:02:58.970
And perhaps if you're buying
something, or registering

00:02:58.970 --> 00:03:01.940
for something, or someone that
needs to know a bit more about you,

00:03:01.940 --> 00:03:05.450
perhaps they'll even ask you for your
address, maybe your phone number,

00:03:05.450 --> 00:03:07.590
or perhaps even your age.

00:03:07.590 --> 00:03:10.670
So you can certainly imagine there
being even other fields that you need.

00:03:10.670 --> 00:03:15.740
But let's consider now what types
of data each of these fields is.

00:03:15.740 --> 00:03:17.600
Now, what should a user name be?

00:03:17.600 --> 00:03:21.050
Well, by definition, this should be
a unique value that belongs to you

00:03:21.050 --> 00:03:22.910
and only you on a system.

00:03:22.910 --> 00:03:25.010
For instance, if I'm
registering for a website

00:03:25.010 --> 00:03:28.400
and that website is storing some
data in, for instance, a spreadsheet,

00:03:28.400 --> 00:03:31.730
I might ideally try to claim
Malan if it's available.

00:03:31.730 --> 00:03:33.800
My name meanwhile, of
course, will be just

00:03:33.800 --> 00:03:36.710
David Malan, which someone else
can certainly have as well.

00:03:36.710 --> 00:03:39.320
Hopefully my email address is only mine.

00:03:39.320 --> 00:03:42.020
And that, too, will be stored
in a column of its own.

00:03:42.020 --> 00:03:46.860
And then my address, for instance, here
on campus, 33 Oxford Street, Cambridge,

00:03:46.860 --> 00:03:49.940
Massachusetts, 02138.

00:03:49.940 --> 00:03:54.710
A phone number by which I
can be reached, 617-495-5000.

00:03:54.710 --> 00:03:56.450
And then my age shall be--

00:03:56.450 --> 00:03:58.430
well, we'll leave that
blank just for now.

00:03:58.430 --> 00:04:02.390
So notice how each of these pieces of
data is just a little bit different.

00:04:02.390 --> 00:04:04.100
Some of them seem to be numeric.

00:04:04.100 --> 00:04:06.020
Some of them seem to be alphabetic.

00:04:06.020 --> 00:04:08.900
Some of them even have
some punctuation therein.

00:04:08.900 --> 00:04:11.120
Now, you might or might
not in a spreadsheet

00:04:11.120 --> 00:04:13.820
care to display this
data a bit differently.

00:04:13.820 --> 00:04:16.700
Indeed, the only number that
I'm sweeping under the rug

00:04:16.700 --> 00:04:18.019
here is, in fact, my age.

00:04:18.019 --> 00:04:21.390
But hopefully I don't need one
or more commas in that value.

00:04:21.390 --> 00:04:25.190
So there isn't really any need
for special formatting here.

00:04:25.190 --> 00:04:27.350
But it turns out that
when you're actually

00:04:27.350 --> 00:04:30.950
building a website or
software-based application,

00:04:30.950 --> 00:04:34.970
you're probably not storing your
data ultimately in just spreadsheets.

00:04:34.970 --> 00:04:39.470
You might graduate eventually from being
a small shop to needing more than just

00:04:39.470 --> 00:04:41.478
Microsoft Excel, or
Numbers, or even something

00:04:41.478 --> 00:04:43.020
cloud-based like Google Spreadsheets.

00:04:43.020 --> 00:04:43.700
Why?

00:04:43.700 --> 00:04:47.580
Well, you have more rows than those
programs can generally handle.

00:04:47.580 --> 00:04:49.670
In fact, on your own
Mac or PC, odds are,

00:04:49.670 --> 00:04:53.600
when opening up big files, whether
a spreadsheet or any other document,

00:04:53.600 --> 00:04:55.430
sometimes you might actually feel that.

00:04:55.430 --> 00:04:57.380
The computer might
start to slow, and you

00:04:57.380 --> 00:05:00.200
might start to see a spinning
beach ball or hourglass

00:05:00.200 --> 00:05:02.090
because you start to
hit the limits of what

00:05:02.090 --> 00:05:05.240
a local computer can do with
just client-side software,

00:05:05.240 --> 00:05:06.950
like a spreadsheet tool.

00:05:06.950 --> 00:05:10.730
And so eventually you might actually
need to use a proper database.

00:05:10.730 --> 00:05:13.460
And a database is really
just a piece of software

00:05:13.460 --> 00:05:15.470
that can absolutely
run on your Mac or PC.

00:05:15.470 --> 00:05:19.010
But very commonly it runs on
a server or somewhere else

00:05:19.010 --> 00:05:21.670
in the cloud to which your
own software connects.

00:05:21.670 --> 00:05:24.170
For instance, if you're building
a website or an application

00:05:24.170 --> 00:05:27.110
in a language like Python,
you can, in Python,

00:05:27.110 --> 00:05:32.570
write code that talks or communicates
with that database to pull data down

00:05:32.570 --> 00:05:34.730
and to send data back up.

00:05:34.730 --> 00:05:37.170
But spreadsheets are
wonderfully straightforward.

00:05:37.170 --> 00:05:40.310
It's just so intuitive how you
store the data in rows and columns

00:05:40.310 --> 00:05:43.070
and just more and more rows as
you have more and more data.

00:05:43.070 --> 00:05:48.410
And so what's nice about some databases
is that they mimic exactly this design.

00:05:48.410 --> 00:05:51.560
There exists in the world what
are called relational databases.

00:05:51.560 --> 00:05:55.197
And indeed, this is among the most
common ways to store data relationally.

00:05:55.197 --> 00:05:57.530
The jargon is a bit different
in the world of databases.

00:05:57.530 --> 00:05:59.720
You no longer call these
things spreadsheets.

00:05:59.720 --> 00:06:01.160
We call them databases.

00:06:01.160 --> 00:06:03.830
And we don't call the
individual tabs sheets.

00:06:03.830 --> 00:06:05.240
We call them tables.

00:06:05.240 --> 00:06:08.240
But we continue to call the
structure of the data therein

00:06:08.240 --> 00:06:11.090
rows for each additional
piece of data and columns

00:06:11.090 --> 00:06:13.430
for the different types
of data that we have.

00:06:13.430 --> 00:06:18.500
But in a relational database, such as
Oracle, Microsoft Access, SQL Server,

00:06:18.500 --> 00:06:23.030
My SQL, Postgres, or something smaller
and lighter weight called SQLlite,

00:06:23.030 --> 00:06:26.930
the burden is on you, the
designer of the database,

00:06:26.930 --> 00:06:29.000
or the programmer, or
the business person

00:06:29.000 --> 00:06:32.690
to actually decide on what types
of data you are going to store.

00:06:32.690 --> 00:06:35.660
Because the database, in order
to handle more and more data

00:06:35.660 --> 00:06:39.350
than a typical spreadsheet can
support, needs a bit of help from you,

00:06:39.350 --> 00:06:42.680
needs to know what type of data you're
storing so that it can search it

00:06:42.680 --> 00:06:45.170
more efficiently and
sort it more effectively

00:06:45.170 --> 00:06:48.530
and make it easier for you
ultimately to add and remove data

00:06:48.530 --> 00:06:51.075
from that particular database.

00:06:51.075 --> 00:06:53.450
So in other words, you can do
all of the same operations.

00:06:53.450 --> 00:06:56.420
But generally, in a database,
you have so much more data,

00:06:56.420 --> 00:07:00.150
you're going to need to have the
database help you help yourself.

00:07:00.150 --> 00:07:02.232
So what does that actually mean here?

00:07:02.232 --> 00:07:04.190
Well, let's consider some
of these fields here.

00:07:04.190 --> 00:07:08.630
Were we to migrate my spreadsheet
with many, many, many rows of users,

00:07:08.630 --> 00:07:12.570
for instance, to a proper
database relationally,

00:07:12.570 --> 00:07:15.090
I'm going to have to make a
few more decisions as well.

00:07:15.090 --> 00:07:18.530
And it turns out that I need to choose
various types for these columns.

00:07:18.530 --> 00:07:22.740
And generally, in a database, you
only have a limited menu of options.

00:07:22.740 --> 00:07:26.740
So let's take a look
at what those might be.

00:07:26.740 --> 00:07:28.960
Here in many relational
databases are just

00:07:28.960 --> 00:07:31.940
some of the data types
that are available to you;

00:07:31.940 --> 00:07:35.560
an integer if you want to represent
something like 1, 2, 3, or perhaps

00:07:35.560 --> 00:07:39.250
even a negative, a real number,
otherwise known as a floating point

00:07:39.250 --> 00:07:41.200
value, for instance,
in Python that actually

00:07:41.200 --> 00:07:44.530
has a decimal point and perhaps
some numbers thereafter,

00:07:44.530 --> 00:07:47.530
numeric, which is more of a catch-all
and might handle things like dates

00:07:47.530 --> 00:07:50.800
and times that are numbers in
some sense but have some more

00:07:50.800 --> 00:07:53.770
formal structure to them, and then
the more general text, when you just

00:07:53.770 --> 00:07:56.830
have words or phrases,
characters or whole paragraphs

00:07:56.830 --> 00:08:00.070
or more that you might want to
store as well in some column.

00:08:00.070 --> 00:08:03.400
And then lastly, you might
have the cutely named BLOB,

00:08:03.400 --> 00:08:07.030
or binary large object, when you
actually want to store zeros and ones,

00:08:07.030 --> 00:08:10.730
that is to say binary data, like
actual files in your database.

00:08:10.730 --> 00:08:12.670
However, you needn't do this typically.

00:08:12.670 --> 00:08:16.690
You can actually store files, of course,
on a file system on your own hard drive

00:08:16.690 --> 00:08:20.260
somewhere on the server, but
that option exists for you.

00:08:20.260 --> 00:08:23.110
Now, this happens to be the
list of data types supported

00:08:23.110 --> 00:08:27.460
by the simplest of relational
databases, something called SQLite.

00:08:27.460 --> 00:08:31.300
And as its name implies, it actually
supports a very specific language

00:08:31.300 --> 00:08:36.890
via which you or anyone can request
data from the database and store

00:08:36.890 --> 00:08:39.760
data in the database, or
update it, or delete it.

00:08:39.760 --> 00:08:44.440
And that language is called
SQL, structured query language.

00:08:44.440 --> 00:08:47.200
SQL is a language via
which you can do exactly

00:08:47.200 --> 00:08:51.670
that, retrieve data from a database, put
data into a database, and so much more.

00:08:51.670 --> 00:08:54.867
It's an alternative to the pointing
and clicking with which you're probably

00:08:54.867 --> 00:08:57.700
familiar with something like Microsoft
Excel, or Google spreadsheet,

00:08:57.700 --> 00:08:59.830
or Apple Numbers, where
it's really you, the human,

00:08:59.830 --> 00:09:03.250
doing all the work via a
graphical user interface.

00:09:03.250 --> 00:09:05.710
Now, to be fair, in all of
this spreadsheet programs

00:09:05.710 --> 00:09:08.110
are there functions or
macros that you might

00:09:08.110 --> 00:09:11.540
be able to write so that you can
program in those environments

00:09:11.540 --> 00:09:13.690
but with some limitations.

00:09:13.690 --> 00:09:17.680
And indeed with SQL and with
SQL databases, more generally,

00:09:17.680 --> 00:09:20.680
that is databases that
understand this language, can you

00:09:20.680 --> 00:09:24.430
achieve far higher performance,
can you store far more data,

00:09:24.430 --> 00:09:26.380
and nonetheless get at it quickly.

00:09:26.380 --> 00:09:29.860
So it's sort of the next evolution
of what you might otherwise do only

00:09:29.860 --> 00:09:31.810
within your own computer spreadsheet.

00:09:31.810 --> 00:09:36.490
But many other databases, Oracle,
SQL Server, Microsoft Access, MySQL,

00:09:36.490 --> 00:09:39.100
Postgres and more, support
more than these data

00:09:39.100 --> 00:09:42.760
types that allow you to help
the database help you even more.

00:09:42.760 --> 00:09:46.700
Because the more the database knows
about your data, the smarter decisions

00:09:46.700 --> 00:09:49.637
it can make at a lower level
below these abstractions

00:09:49.637 --> 00:09:51.970
and storing that data so that,
when you have a question,

00:09:51.970 --> 00:09:54.260
it can answer it quickly.

00:09:54.260 --> 00:09:57.760
And so how might you refine
what we mean by integer?

00:09:57.760 --> 00:09:59.890
Well, in some databases,
you might want to tell it

00:09:59.890 --> 00:10:03.370
that this is just a small
integer in this column one

00:10:03.370 --> 00:10:04.990
after the other, row by row.

00:10:04.990 --> 00:10:08.020
Or maybe it's just an integer
somewhere a little bigger

00:10:08.020 --> 00:10:12.070
than that, or perhaps a big int,
which means it can be even larger.

00:10:12.070 --> 00:10:14.860
Now, these actually map to
very well-defined values.

00:10:14.860 --> 00:10:19.630
If you think back to how we considered
data itself is stored in a computer,

00:10:19.630 --> 00:10:22.300
it's ultimately with
zeros and ones or bits.

00:10:22.300 --> 00:10:26.110
And indeed, an integer, so
to speak, generally takes up

00:10:26.110 --> 00:10:28.930
just 32 bits or four bytes.

00:10:28.930 --> 00:10:31.930
But, of course, that few
bits actually translates,

00:10:31.930 --> 00:10:35.530
if you do a bit of the math, to
four billion possible values.

00:10:35.530 --> 00:10:39.160
So if you were to specify
that a column in your database

00:10:39.160 --> 00:10:43.300
is of type integer, that means you
could type in the number from zero

00:10:43.300 --> 00:10:46.570
all the way on up to four billion,
or if you want negative numbers,

00:10:46.570 --> 00:10:51.400
from negative two billion roughly
to positive two billion as well.

00:10:51.400 --> 00:10:54.580
But beyond that three billion
or four, you're out of luck

00:10:54.580 --> 00:10:56.320
if you're only using 32 bits.

00:10:56.320 --> 00:10:57.460
Now, that might seem huge.

00:10:57.460 --> 00:11:02.170
And that is a good problem to have if
I have four billion users, or sales,

00:11:02.170 --> 00:11:04.280
or anything in my database.

00:11:04.280 --> 00:11:08.080
But for certain large companies
these days, having four billion

00:11:08.080 --> 00:11:12.040
or more records of some
sort, transactions or logs,

00:11:12.040 --> 00:11:13.420
not at all uncommon.

00:11:13.420 --> 00:11:15.700
And for those purposes
does there exist big int,

00:11:15.700 --> 00:11:20.080
a bigger data type that uses 64
bits or eight bytes via which

00:11:20.080 --> 00:11:22.743
you can count much, much,
much higher, so high

00:11:22.743 --> 00:11:24.910
that I'm not even sure how
to pronounce that number.

00:11:24.910 --> 00:11:29.110
But indeed, it should be large enough
for almost all of our purposes.

00:11:29.110 --> 00:11:32.890
So ultimately, it's not enough to
say that the field in a database

00:11:32.890 --> 00:11:38.440
is an integer but to specify just how
big maximally that integer can be.

00:11:38.440 --> 00:11:42.400
Now, fortunately, I think for my age we
could probably get away with small int.

00:11:42.400 --> 00:11:45.510
But indeed, with a small int,
you typically have 16 bits.

00:11:45.510 --> 00:11:52.270
So even then could your user
be as old as 65,535 years old.

00:11:52.270 --> 00:11:53.750
Now, what about real numbers?

00:11:53.750 --> 00:11:56.260
These are floating point
values with decimal points.

00:11:56.260 --> 00:12:00.100
And here, too, you have fewer choices,
typically, but along the same lines.

00:12:00.100 --> 00:12:03.310
A real number would
typically take 32 bits,

00:12:03.310 --> 00:12:07.150
which allows you only some amount
of precision, some number of digits

00:12:07.150 --> 00:12:08.470
after that decimal point.

00:12:08.470 --> 00:12:11.440
If you want to go even further
and be ever more precise

00:12:11.440 --> 00:12:14.590
can you use twice as much
space, 64 bits or eight

00:12:14.590 --> 00:12:17.740
bytes, and use double precision instead.

00:12:17.740 --> 00:12:20.920
Well, it would seem that
we sort of start with an OK

00:12:20.920 --> 00:12:23.360
value and an even better value here.

00:12:23.360 --> 00:12:25.355
But what's the trade-off,
as there always is?

00:12:25.355 --> 00:12:27.480
Well, it's fine if want to
be able to count higher,

00:12:27.480 --> 00:12:29.820
whether with integers or real numbers.

00:12:29.820 --> 00:12:32.190
But you are literally
going to pay a price.

00:12:32.190 --> 00:12:34.770
Because if you want to count
higher than four billion

00:12:34.770 --> 00:12:36.900
in the context of integers,
you're going to have

00:12:36.900 --> 00:12:40.890
to spend twice as much space
from four bytes to eight

00:12:40.890 --> 00:12:42.390
in order to store those values.

00:12:42.390 --> 00:12:46.560
And if you're never actually going to
use values larger than four billion,

00:12:46.560 --> 00:12:49.420
you're just spending twice
as much space as you need to.

00:12:49.420 --> 00:12:51.810
And if you don't really
need to store floating point

00:12:51.810 --> 00:12:56.220
values to terribly many digits of
precision, you're just wasting space.

00:12:56.220 --> 00:12:59.240
Now, for small databases,
this might be inconsequential.

00:12:59.240 --> 00:13:01.740
But for the Googles, and
Microsofts, and others of the world

00:13:01.740 --> 00:13:07.050
to deal in gigabytes and terabytes
of data, all of these bits and bytes

00:13:07.050 --> 00:13:08.670
will surely add up.

00:13:08.670 --> 00:13:12.010
As for numeric, more of a
catch-all numerically, so to speak,

00:13:12.010 --> 00:13:16.110
whereby you have Booleans, zeros
or ones, or false and true values,

00:13:16.110 --> 00:13:18.990
dates which comes in a
prescribed format, typically

00:13:18.990 --> 00:13:24.270
year, year, year, year, dash,
month, month, dash, day, day.

00:13:24.270 --> 00:13:26.382
Indeed, even though
across the world there

00:13:26.382 --> 00:13:29.340
are different formats and different
communities for representing dates,

00:13:29.340 --> 00:13:33.820
in SQL do you specify them
exactly in that format only.

00:13:33.820 --> 00:13:38.460
Date time adds to that a space followed
by hour, hour, colon, minute, minute,

00:13:38.460 --> 00:13:42.240
colon, second, second, whereby if you
want to store both date and a time,

00:13:42.240 --> 00:13:45.720
you should store it in exactly that
format in your database, even though,

00:13:45.720 --> 00:13:48.930
thanks to code, whether
Python or SQL itself,

00:13:48.930 --> 00:13:52.080
you want to display that
date in a geographically

00:13:52.080 --> 00:13:54.810
localized and conventional
way, you nonetheless

00:13:54.810 --> 00:13:57.580
store it in this way in your database.

00:13:57.580 --> 00:14:01.680
Now, sometimes you need to fight
against the imprecision that's

00:14:01.680 --> 00:14:04.480
inherent as a risk with real numbers.

00:14:04.480 --> 00:14:07.110
And so you can use the
so-called numeric data type

00:14:07.110 --> 00:14:09.570
to specify a so-called
scale and precision.

00:14:09.570 --> 00:14:13.290
That is two numbers that
specify really how many digits

00:14:13.290 --> 00:14:15.490
you ultimately want to support.

00:14:15.490 --> 00:14:17.640
So if you want to store
things like dollar amounts,

00:14:17.640 --> 00:14:20.610
maybe to cents or hundreds
of cents, two or four decimal

00:14:20.610 --> 00:14:23.520
places, can you specify
exactly that, and you

00:14:23.520 --> 00:14:26.460
will get exactly that much precision.

00:14:26.460 --> 00:14:30.090
As for time and timestamp, well, time is
quite simply hour, hour, colon, minute,

00:14:30.090 --> 00:14:32.050
minute, colon, second, second.

00:14:32.050 --> 00:14:34.650
But timestamp's a
little more interesting.

00:14:34.650 --> 00:14:37.200
It, too, can be used to represent time.

00:14:37.200 --> 00:14:40.860
But as with many languages,
Python and JavaScript among them,

00:14:40.860 --> 00:14:43.560
it does so typically by
remembering or counting

00:14:43.560 --> 00:14:48.690
some number of milliseconds or seconds
from a particular point in time.

00:14:48.690 --> 00:14:51.570
In fact, some years ago, it was
conventional to simply start

00:14:51.570 --> 00:14:56.370
counting time from January 1st,
1970, the so-called epic as opposed

00:14:56.370 --> 00:14:58.950
to say 0 BC or AD.

00:14:58.950 --> 00:15:03.690
Unfortunately, it turns out that you
generally represent these timestamps

00:15:03.690 --> 00:15:08.250
with four bytes or 32 bits, which
gives you access to four billion

00:15:08.250 --> 00:15:09.900
or so possible values.

00:15:09.900 --> 00:15:13.170
And it turns out that
in the year 2038, we

00:15:13.170 --> 00:15:16.770
will have run out of bits
with which to represent time.

00:15:16.770 --> 00:15:20.460
So if you recall the so-called Y2K
problem when we ran into this issue

00:15:20.460 --> 00:15:24.180
around the year 2000, so have
we rather painted ourselves

00:15:24.180 --> 00:15:28.290
into a corner in many systems
by not allocating enough space

00:15:28.290 --> 00:15:30.120
to get us past that year.

00:15:30.120 --> 00:15:33.110
Well, what's the solution
invariably going to be?

00:15:33.110 --> 00:15:35.700
Well, use more space and, thus, cost.

00:15:35.700 --> 00:15:38.400
But that is inherent in many systems.

00:15:38.400 --> 00:15:41.880
And it's one of these things that
us humans will have to address.

00:15:41.880 --> 00:15:43.040
How about text?

00:15:43.040 --> 00:15:46.088
Well, text can, too,
come in multiple forms.

00:15:46.088 --> 00:15:47.880
Whether it's English
or any other language,

00:15:47.880 --> 00:15:49.770
you might have different
lengths thereof.

00:15:49.770 --> 00:15:52.890
And so within the world of
text values and databases,

00:15:52.890 --> 00:15:55.800
need you specify typically
how many characters

00:15:55.800 --> 00:15:57.750
you want to store in that column.

00:15:57.750 --> 00:16:00.750
And this is distinct from something
like a spreadsheet program

00:16:00.750 --> 00:16:04.410
where you can probably keep typing and
typing or pasting as many characters

00:16:04.410 --> 00:16:05.490
as you'd like.

00:16:05.490 --> 00:16:09.810
A database, in so far as its purpose in
life is to give you better performance

00:16:09.810 --> 00:16:12.510
and more scalability than
a spreadsheet allows,

00:16:12.510 --> 00:16:14.760
wants you to tell it a bit more.

00:16:14.760 --> 00:16:16.620
It's going to ask you
to decide whether you

00:16:16.620 --> 00:16:19.320
want to store a fixed number of chars.

00:16:19.320 --> 00:16:22.650
So char followed by a number
here represented at its end

00:16:22.650 --> 00:16:29.418
means how many characters exactly shall
you store in every cell in this column.

00:16:29.418 --> 00:16:30.835
Now, why might that be applicable?

00:16:30.835 --> 00:16:33.120
Well, in the US, for
instance, for states,

00:16:33.120 --> 00:16:35.250
we have a standardization
of two characters

00:16:35.250 --> 00:16:36.970
for every state in the country.

00:16:36.970 --> 00:16:39.570
And so you might say char
(2), thereby allowing

00:16:39.570 --> 00:16:43.230
you to store CT for
Connecticut, and FL for Florida,

00:16:43.230 --> 00:16:47.280
and MA for Massachusetts and so on,
because you know that every value is

00:16:47.280 --> 00:16:49.800
going to be that finite length, two.

00:16:49.800 --> 00:16:51.840
But sometimes you don't
know the max length.

00:16:51.840 --> 00:16:56.640
And indeed, in my own table of users,
my own name might have D-A-V-I-D

00:16:56.640 --> 00:17:00.060
and a space and M-A-L-A-N, so 11 for me.

00:17:00.060 --> 00:17:02.437
But your name might
be shorter or longer.

00:17:02.437 --> 00:17:04.770
Or some user that hasn't even
registered for our website

00:17:04.770 --> 00:17:07.980
yet might have an even
longer name than that.

00:17:07.980 --> 00:17:11.609
So varchar, or variable
number of chars, exists, too,

00:17:11.609 --> 00:17:13.859
where you can specify not a
fixed number of characters

00:17:13.859 --> 00:17:19.109
to use but a maximum upper
bound, perhaps something like 30,

00:17:19.109 --> 00:17:22.230
perhaps something like 300,
or some number of characters

00:17:22.230 --> 00:17:26.079
that you decide that's going to be the
upper limit of any human I actually

00:17:26.079 --> 00:17:29.410
see in my sight who might have a
name as many as that characters.

00:17:29.410 --> 00:17:32.350
But varchar is smart and
the databases that use it,

00:17:32.350 --> 00:17:35.110
because it will use that
many characters maximally.

00:17:35.110 --> 00:17:37.220
But it's not going to waste space.

00:17:37.220 --> 00:17:42.010
In fact, if you're using 30 characters
for every name in your table,

00:17:42.010 --> 00:17:45.890
well, it's only going to
store as many characters are

00:17:45.890 --> 00:17:49.870
necessary, perhaps plus a
bit of overhead or accounting

00:17:49.870 --> 00:17:51.340
to keep track of that value.

00:17:51.340 --> 00:17:56.800
But that way you save space but still
can handle large text or large strings.

00:17:56.800 --> 00:18:00.550
Lastly, if you have particularly large
text, whole paragraphs, or essays,

00:18:00.550 --> 00:18:02.440
or documents that
someone might paste in,

00:18:02.440 --> 00:18:05.232
or the contents of an entire web
page that you want to search,

00:18:05.232 --> 00:18:07.190
well, you can have a text
field more generally,

00:18:07.190 --> 00:18:11.290
which tends to support tens of
thousands or more characters in total.

00:18:11.290 --> 00:18:14.260
But it does so in a way
that's not quite as efficient.

00:18:14.260 --> 00:18:17.720
Rather than keep it right in
the column itself, so to speak,

00:18:17.720 --> 00:18:20.500
it puts it somewhere else on
the server, the result of which

00:18:20.500 --> 00:18:23.620
is that you can fit more
room over here, so to speak.

00:18:23.620 --> 00:18:27.520
But it takes more time and effort
to go search or find that data, so,

00:18:27.520 --> 00:18:30.610
again, a trade-off of space and time.

00:18:30.610 --> 00:18:33.170
But here, too, as with
integers and real,

00:18:33.170 --> 00:18:37.000
you seem to have multiple choices, one
of which at first glance is better.

00:18:37.000 --> 00:18:41.543
After all, why use char ever if
you could also just say varchar

00:18:41.543 --> 00:18:42.460
with that same amount?

00:18:42.460 --> 00:18:45.550
For instance, for those two
character codes for states,

00:18:45.550 --> 00:18:50.320
why not just say varchar
and not just char (2)?

00:18:50.320 --> 00:18:53.650
Well, it turns out that if
you promise the database

00:18:53.650 --> 00:18:58.900
that every cell in your column will
use exactly the same number of bytes,

00:18:58.900 --> 00:19:03.220
it turns out that you'll have
very straight or justified edges,

00:19:03.220 --> 00:19:05.890
in some sense, conceptually
for that column.

00:19:05.890 --> 00:19:08.560
For instance, if every
cell in a column takes up

00:19:08.560 --> 00:19:11.740
exactly the same amount of
space, then you might know,

00:19:11.740 --> 00:19:15.250
if you start numbering the bytes
that are represented by those cells,

00:19:15.250 --> 00:19:18.910
this might be byte zero up here
and then two characters away

00:19:18.910 --> 00:19:22.670
will be address two, and then
four, and then six, and then eight.

00:19:22.670 --> 00:19:26.320
And if you think back to how
algorithms like, say, binary search

00:19:26.320 --> 00:19:30.610
are implemented, as soon as you have
the ability to know arithmetically

00:19:30.610 --> 00:19:34.780
where the next value, or the last
value, or, best yet, the middle value

00:19:34.780 --> 00:19:39.880
is, numerically can you jump right
to it in so-called constant time,

00:19:39.880 --> 00:19:44.260
thereby enabling something like binary
search and ultimately logarithmic time.

00:19:44.260 --> 00:19:45.820
So this is a powerful thing.

00:19:45.820 --> 00:19:48.220
If you commit to the database
that all of your values

00:19:48.220 --> 00:19:50.740
will take up the exact
amount of space, the database

00:19:50.740 --> 00:19:54.820
can search that value faster for you.

00:19:54.820 --> 00:20:00.520
Varchar, unfortunately, is rather akin
to having only a left justified column,

00:20:00.520 --> 00:20:03.713
whereby you might have data that
looks very straight on one edge

00:20:03.713 --> 00:20:05.380
with all of the characters left aligned.

00:20:05.380 --> 00:20:07.630
But because the length
of those characters vary,

00:20:07.630 --> 00:20:11.080
this cell might be this wide, and this
one this wide, and this one this wide,

00:20:11.080 --> 00:20:12.430
and this one this wide.

00:20:12.430 --> 00:20:15.790
And as such, because you have a
very ragged edge, so to speak,

00:20:15.790 --> 00:20:18.610
you can't just do simple
math and add two and add two

00:20:18.610 --> 00:20:21.350
to get to the next value in that column.

00:20:21.350 --> 00:20:24.370
So in the worst case, if a
column is only a varchar,

00:20:24.370 --> 00:20:27.520
the best the database can do by
default is just linear search,

00:20:27.520 --> 00:20:31.810
or big O of n, which, recall,
was notably slower than something

00:20:31.810 --> 00:20:34.250
like log N or logarithmic time.

00:20:34.250 --> 00:20:37.250
So the more help you can provide to
the database the better off you are.

00:20:37.250 --> 00:20:39.792
But of course you don't want to
air too far on the other side

00:20:39.792 --> 00:20:45.190
and say, oh, well, I'm going
to have char 300 for every cell

00:20:45.190 --> 00:20:50.200
in my Names column, because then you're
using 300 characters by definition

00:20:50.200 --> 00:20:54.970
for just David Malan
who needs just, say, 11.

00:20:54.970 --> 00:20:58.570
So lastly is there this thing
BLOB, binary large object,

00:20:58.570 --> 00:21:03.160
which can only store binary data,
which can be used for images or files.

00:21:03.160 --> 00:21:08.230
But those, again, tend to be best
stored somewhere else on the system.

00:21:08.230 --> 00:21:10.720
So with these possibilities
in mind, how might we

00:21:10.720 --> 00:21:13.450
go about designing the
schema for this data?

00:21:13.450 --> 00:21:15.790
We already have the names of my columns.

00:21:15.790 --> 00:21:18.040
And I've already gone ahead
and put in one sample row.

00:21:18.040 --> 00:21:21.130
So let's consider in the context
of all of those SQL types

00:21:21.130 --> 00:21:22.960
which ones apply here.

00:21:22.960 --> 00:21:25.180
Well, for user name,
something like Malan,

00:21:25.180 --> 00:21:27.490
it's probably not numeric
because that's not

00:21:27.490 --> 00:21:29.230
something I'd be so inclined to type.

00:21:29.230 --> 00:21:30.700
But it probably is text.

00:21:30.700 --> 00:21:33.460
But within text we have the
choice of char or varchar

00:21:33.460 --> 00:21:36.070
or larger blocks of
text, more generally.

00:21:36.070 --> 00:21:37.760
So what should this be?

00:21:37.760 --> 00:21:39.040
Well, this one depends.

00:21:39.040 --> 00:21:41.170
A user name tends to be fairly short.

00:21:41.170 --> 00:21:44.500
Because after all, we humans probably
don't want to type very long strings

00:21:44.500 --> 00:21:46.090
just to log into some site.

00:21:46.090 --> 00:21:49.540
Historically, maximal values
of eight characters was common.

00:21:49.540 --> 00:21:52.060
But that tends to be fairly constrained.

00:21:52.060 --> 00:21:54.370
And you can no longer express
very much of your name

00:21:54.370 --> 00:21:56.950
if you tend to have
a long name yourself.

00:21:56.950 --> 00:22:01.030
So here we might say something like
char, because we know it will be small.

00:22:01.030 --> 00:22:03.710
And we'd like to be able to
search on this field efficiently,

00:22:03.710 --> 00:22:06.880
especially to log someone in
fast, but probably not quite as

00:22:06.880 --> 00:22:10.300
short as eight, so maybe 16 or 20.

00:22:10.300 --> 00:22:12.940
Or if it's hard to guess there
and you want more dynamism,

00:22:12.940 --> 00:22:17.050
maybe you would say varchar
of something like 255.

00:22:17.050 --> 00:22:18.000
Why that value?

00:22:18.000 --> 00:22:22.220
Well, recall that with eight bits
can you count 256 possible values.

00:22:22.220 --> 00:22:25.740
But if you start counting at zero,
you can only go as high as 255.

00:22:25.740 --> 00:22:29.280
And historically have many
databases had limits of, say,

00:22:29.280 --> 00:22:32.770
255 for a varchar, though
that's no longer the case.

00:22:32.770 --> 00:22:35.040
But you'll still see this very commonly.

00:22:35.040 --> 00:22:36.960
So what's the best answer?

00:22:36.960 --> 00:22:37.750
I don't know.

00:22:37.750 --> 00:22:40.140
It depends on your
data or someone else's.

00:22:40.140 --> 00:22:43.200
For here, I'll go ahead and
leave it as varchar (255)

00:22:43.200 --> 00:22:44.910
just to give myself flexibility.

00:22:44.910 --> 00:22:48.510
But a very good argument could be
made for char of some smaller value

00:22:48.510 --> 00:22:51.780
or even varchar of some
smaller value, too.

00:22:51.780 --> 00:22:54.810
As for name as well, I'm not really
sure what this one should be.

00:22:54.810 --> 00:22:58.170
I don't know how many
characters your own name has.

00:22:58.170 --> 00:23:00.390
Something like 11 is
probably too few, even

00:23:00.390 --> 00:23:02.740
though it satisfies my name just fine.

00:23:02.740 --> 00:23:05.520
30 feels a bit
[? type. ?] And frankly, I

00:23:05.520 --> 00:23:09.300
bet if we Google longest
name in world, something

00:23:09.300 --> 00:23:13.110
tells me there's someone out there
with quite a few characters, hopefully

00:23:13.110 --> 00:23:15.210
no more than, say, 255.

00:23:15.210 --> 00:23:18.240
But there, too, we might want
to do some due diligence.

00:23:18.240 --> 00:23:20.097
With email, too, this seems to be easy.

00:23:20.097 --> 00:23:22.680
This, too, is just characters,
even though you could certainly

00:23:22.680 --> 00:23:23.550
have numbers.

00:23:23.550 --> 00:23:26.010
I don't know how long the
maximum email address will be.

00:23:26.010 --> 00:23:29.500
But frankly, it's probably
going to be variable in length,

00:23:29.500 --> 00:23:31.120
so I'm going to go with a default.

00:23:31.120 --> 00:23:32.160
Why 255?

00:23:32.160 --> 00:23:34.140
Again, it's probably
way more than we need.

00:23:34.140 --> 00:23:35.533
But varchar is smart.

00:23:35.533 --> 00:23:37.200
It's not going to waste that many bytes.

00:23:37.200 --> 00:23:41.490
It's just a reasonable upper bound
in the absence of better intuition.

00:23:41.490 --> 00:23:43.680
At least we're being consistent.

00:23:43.680 --> 00:23:45.580
Now, address, that
might be decently long,

00:23:45.580 --> 00:23:48.370
especially if it's going on
envelope on, say, multiple lines.

00:23:48.370 --> 00:23:51.840
But here, too, this is probably
just text, and so I'll go ahead here

00:23:51.840 --> 00:23:54.358
and say this, too, 255.

00:23:54.358 --> 00:23:54.900
Phone number.

00:23:54.900 --> 00:23:57.540
Finally, a number by name.

00:23:57.540 --> 00:23:59.700
Well, what should a
phone numbers field be?

00:23:59.700 --> 00:24:04.410
Well, we had integer, or real, or, more
specifically, small int or integer,

00:24:04.410 --> 00:24:06.030
or big int.

00:24:06.030 --> 00:24:09.870
But the funny thing here is even
though a phone number is indeed

00:24:09.870 --> 00:24:13.260
called by us humans
a number, it's really

00:24:13.260 --> 00:24:18.300
just a symbolic representation of a
unique identifier for someone's phone.

00:24:18.300 --> 00:24:22.030
And so these hyphens might be
common in one community or culture.

00:24:22.030 --> 00:24:25.890
Maybe you'd have parentheses or
even pluses for country codes.

00:24:25.890 --> 00:24:30.390
So frankly, very quickly does even
a phone number not become so much

00:24:30.390 --> 00:24:33.040
a number but a textual string.

00:24:33.040 --> 00:24:35.070
So here I have some discretion.

00:24:35.070 --> 00:24:37.890
And maybe I could be
presumptuous and assume

00:24:37.890 --> 00:24:40.020
I'm only going to have
customers or users for now,

00:24:40.020 --> 00:24:43.343
say, in the US for whatever
business constraints.

00:24:43.343 --> 00:24:44.760
And so I might say, you know what?

00:24:44.760 --> 00:24:49.740
This is a candidate to actually
do something like char, say, 10,

00:24:49.740 --> 00:24:53.220
three-digit area code, a three-digit
exchange, and then four digits

00:24:53.220 --> 00:24:54.300
thereafter.

00:24:54.300 --> 00:24:58.470
But that doesn't leave room for those
hyphens, so I could make a char 12.

00:24:58.470 --> 00:25:01.170
Or frankly, if they're just
going to be there all the time,

00:25:01.170 --> 00:25:04.170
why don't I leave them
as char 10 and just

00:25:04.170 --> 00:25:07.440
get rid of those values in my data?

00:25:07.440 --> 00:25:11.207
Or alternatively, I could support
parentheses or pluses as well.

00:25:11.207 --> 00:25:13.290
It really depends on how
I want to store the data.

00:25:13.290 --> 00:25:19.500
But I like the idea of a textual type,
something like char or maybe varchar

00:25:19.500 --> 00:25:20.970
as opposed to an integer.

00:25:20.970 --> 00:25:23.220
Because at least if I've
called certain communities

00:25:23.220 --> 00:25:26.660
or out from some businesses, sometimes
you have to type unusual numbers.

00:25:26.660 --> 00:25:28.410
At least in other
countries, for instance,

00:25:28.410 --> 00:25:30.930
if we generalize beyond
this data set here do you

00:25:30.930 --> 00:25:34.320
type zero to actually
connect to someone local.

00:25:34.320 --> 00:25:37.440
And the problem with zero is
that, mathematically, it's

00:25:37.440 --> 00:25:39.690
meaningless to start a value with zero.

00:25:39.690 --> 00:25:41.910
And unfortunately, a
computer takes that to heart.

00:25:41.910 --> 00:25:45.190
And if you were to store
phone as an integer

00:25:45.190 --> 00:25:48.450
but a number were in some
community to start with a zero,

00:25:48.450 --> 00:25:51.090
your database would
probably just get rid of it.

00:25:51.090 --> 00:25:54.152
In fact, try that yourself in
Microsoft Excel, or Apple Numbers,

00:25:54.152 --> 00:25:55.110
or Google Spreadsheets.

00:25:55.110 --> 00:25:58.530
And depending on how you formatted
the column, you can type as many zeros

00:25:58.530 --> 00:26:00.280
as you want followed by other digits.

00:26:00.280 --> 00:26:04.380
And odds are, when you hit
Enter, those zeros are gone.

00:26:04.380 --> 00:26:09.540
As for age, here we perhaps finally have
a compelling candidate for a number.

00:26:09.540 --> 00:26:11.850
Small int should probably
get the job done.

00:26:11.850 --> 00:26:16.890
Integer would work as well, or big
int, but increasingly wasteful.

00:26:16.890 --> 00:26:17.760
But you know what?

00:26:17.760 --> 00:26:19.920
Even here it's not that obvious.

00:26:19.920 --> 00:26:24.930
I probably shouldn't even use
an integer type here at all.

00:26:24.930 --> 00:26:25.830
Why?

00:26:25.830 --> 00:26:30.270
Well, I might indeed ask a human
for his or her age upon registering.

00:26:30.270 --> 00:26:34.620
The catch is that age might
change the next day or the day

00:26:34.620 --> 00:26:38.490
after, or the day after, because,
of course, time is advancing.

00:26:38.490 --> 00:26:41.310
And unless I also stored
the date and ideally

00:26:41.310 --> 00:26:45.600
time at which the user
registered, I can't really

00:26:45.600 --> 00:26:49.500
even do any math to figure out,
oh, you registered a year ago.

00:26:49.500 --> 00:26:52.150
Let me assume your age is that plus one.

00:26:52.150 --> 00:26:55.170
So what would have been better
from the get-go than age?

00:26:55.170 --> 00:26:58.830
Probably just something
like date of birth, DOB.

00:26:58.830 --> 00:27:01.530
And, of course, in SQL
do we have the ability

00:27:01.530 --> 00:27:03.850
to store dates, even date times.

00:27:03.850 --> 00:27:06.120
So here we probably
have a better candidate

00:27:06.120 --> 00:27:09.700
for exactly one of those
numeric types, so to speak.

00:27:09.700 --> 00:27:13.097
Now, we've only just begun to scratch
the surface of available data types

00:27:13.097 --> 00:27:14.430
and data we might want to store.

00:27:14.430 --> 00:27:18.150
In fact, our spreadsheet or database
could contain even more types.

00:27:18.150 --> 00:27:21.540
But now let's just suppose that
we're happy with our types.

00:27:21.540 --> 00:27:24.480
And the goal now at hand is
to start searching our data

00:27:24.480 --> 00:27:27.940
and storing that data
in an actual database.

00:27:27.940 --> 00:27:31.600
I've been using here Google
Spreadsheets just to lay out my schema

00:27:31.600 --> 00:27:34.130
sort of as a worksheet, if you will.

00:27:34.130 --> 00:27:37.000
But now let's graduate
to actual SQL syntax

00:27:37.000 --> 00:27:39.250
and suppose that this
same data is stored

00:27:39.250 --> 00:27:43.780
not in Google Spreadsheets or
any other but in some database

00:27:43.780 --> 00:27:47.080
elsewhere, a piece of software running
somewhere else on my own computer

00:27:47.080 --> 00:27:50.320
or somewhere in the cloud that knows
how to organize this data, still

00:27:50.320 --> 00:27:56.760
in rows and columns but in such a way
that I can use this new language, SQL

00:27:56.760 --> 00:28:01.430
or SQL, in order to access my data.

00:28:01.430 --> 00:28:03.850
So what are the
fundamental operations that

00:28:03.850 --> 00:28:07.510
SQL supports, or a relational
database, more generally?

00:28:07.510 --> 00:28:09.910
It turns out that throughout
computer science is there

00:28:09.910 --> 00:28:13.630
a pattern of operations that
many different systems support.

00:28:13.630 --> 00:28:16.000
In the world of databases,
you have fairly crassly

00:28:16.000 --> 00:28:20.710
what's called CRUD, the ability
to create data, read data, update,

00:28:20.710 --> 00:28:21.490
and delete.

00:28:21.490 --> 00:28:26.680
But more specifically, in the context
of SQL, this new language, our last,

00:28:26.680 --> 00:28:31.540
you have the ability to create
data, select data, a.k.a.

00:28:31.540 --> 00:28:35.050
READ data, update or insert data, a.k.a.

00:28:35.050 --> 00:28:38.560
UPDATE, or delete or drop data, a.k.a.

00:28:38.560 --> 00:28:39.460
DELETE.

00:28:39.460 --> 00:28:42.490
So whereas in SQL you have
these very specific terms,

00:28:42.490 --> 00:28:45.880
they are just representative
of a class of operations

00:28:45.880 --> 00:28:48.520
that you might see
throughout computer science.

00:28:48.520 --> 00:28:50.830
So how do we go about using SQL?

00:28:50.830 --> 00:28:53.740
Well, short of buying and
installing something like Oracle

00:28:53.740 --> 00:28:56.230
or downloading something
free like MySQL,

00:28:56.230 --> 00:28:58.870
we can simply use something
that's indeed, by definition,

00:28:58.870 --> 00:29:01.540
lighter weight called SQLite.

00:29:01.540 --> 00:29:05.200
SQLite is an implementation of
SQL, this database language,

00:29:05.200 --> 00:29:08.430
via which you can install
it on your own Mac or PC,

00:29:08.430 --> 00:29:13.270
and it uses not special software
or servers but rather just a file

00:29:13.270 --> 00:29:16.330
on your own Mac or PC that
stores all of those tables

00:29:16.330 --> 00:29:18.300
and, in turn, rows and columns.

00:29:18.300 --> 00:29:20.440
It thereafter creates
an abstraction of sorts

00:29:20.440 --> 00:29:22.960
as though you have a
full-fledged server running

00:29:22.960 --> 00:29:26.350
on your own machine with which
you can communicate using SQL.

00:29:26.350 --> 00:29:31.360
Now, any number of programs can be used
to actually talk to a SQLite database.

00:29:31.360 --> 00:29:35.200
You might use a purely textual
interface, so-called command line

00:29:35.200 --> 00:29:37.090
interface, using only your keyboard.

00:29:37.090 --> 00:29:41.210
Or you might use something graphical,
a GUI, a graphical user interface.

00:29:41.210 --> 00:29:43.030
In fact, one of the
easiest programs to use

00:29:43.030 --> 00:29:46.600
is this one here, DB Browser,
freely available for Macs,

00:29:46.600 --> 00:29:50.740
for PCS running Windows, or Linux,
or any number of operating systems

00:29:50.740 --> 00:29:51.670
as well.

00:29:51.670 --> 00:29:56.110
In advance, I've gone ahead and opened
up a database in a file whose name ends

00:29:56.110 --> 00:29:59.830
in .sqllite or also commonly .db.

00:29:59.830 --> 00:30:01.900
This then is, again,
just a file, a binary

00:30:01.900 --> 00:30:04.510
file filled with zeros
and ones that collectively

00:30:04.510 --> 00:30:09.100
represent patterns of rows and columns
and the tables that contain them.

00:30:09.100 --> 00:30:13.120
And if I open this program, I see
all of the tables or, if you will,

00:30:13.120 --> 00:30:15.640
sheets from our old spreadsheet world.

00:30:15.640 --> 00:30:18.370
This happens to be a
database all about music.

00:30:18.370 --> 00:30:22.960
And indeed, I have a table about
musical albums and artists, customers

00:30:22.960 --> 00:30:26.560
and employees, and genres, and
invoices, invoice lines that

00:30:26.560 --> 00:30:29.920
represent actual songs bought,
the types of media involved,

00:30:29.920 --> 00:30:33.340
and playlists and playlists tracks
via which users can actually

00:30:33.340 --> 00:30:37.807
customize those songs, and then lastly,
the tracks or the songs themselves.

00:30:37.807 --> 00:30:39.640
In other words, someone
else on the internet

00:30:39.640 --> 00:30:42.250
has gone to the trouble of
aggregating all of this data

00:30:42.250 --> 00:30:47.770
about music and their authors and
organized it into an SQL database.

00:30:47.770 --> 00:30:48.550
Now, how?

00:30:48.550 --> 00:30:50.342
Well, if you look to
the right here, you'll

00:30:50.342 --> 00:30:54.040
see a number of esoteric commands
all involving CREATE TABLE.

00:30:54.040 --> 00:30:57.490
And indeed, CREATE is one of
the four fundamental operations

00:30:57.490 --> 00:30:58.660
that SQL supports.

00:30:58.660 --> 00:31:02.860
And without going into the weeds
of how these tables were created,

00:31:02.860 --> 00:31:07.070
they were created in such a way that the
author had to decide on every table's

00:31:07.070 --> 00:31:09.640
columns' names as well as their types.

00:31:09.640 --> 00:31:11.440
And that part is interesting.

00:31:11.440 --> 00:31:16.090
For instance, let me go into the Albums
table here and expand this one now.

00:31:16.090 --> 00:31:21.370
Here I have Album ID, and Title,
and, curiously, Artist ID.

00:31:21.370 --> 00:31:25.450
But you'll notice in the Album table
we have no mention of artist's name.

00:31:25.450 --> 00:31:27.970
In fact, if we follow
the white rabbit here,

00:31:27.970 --> 00:31:31.450
we see in the Artist table
that, oh, in Artist do

00:31:31.450 --> 00:31:33.700
we have an artist ID and name.

00:31:33.700 --> 00:31:37.150
And indeed, if we keep digging, we'll
see that in each of these tables

00:31:37.150 --> 00:31:41.050
there's not quite as much
information as I might like.

00:31:41.050 --> 00:31:44.830
In fact, IDs, whatever those
are seem to be more prevalent.

00:31:44.830 --> 00:31:46.930
And that is because, in
the world of databases,

00:31:46.930 --> 00:31:49.390
and perhaps even some of
your own spreadsheets,

00:31:49.390 --> 00:31:53.560
it's conventional and daresay best
practice to normalize your data.

00:31:53.560 --> 00:31:58.240
You indeed should try to avoid having
redundancies within tables and even

00:31:58.240 --> 00:31:59.620
across tables.

00:31:59.620 --> 00:32:00.730
Now, what does that mean?

00:32:00.730 --> 00:32:02.830
Well, consider our own
spreadsheet of users

00:32:02.830 --> 00:32:05.410
that we're trying to
get into a database.

00:32:05.410 --> 00:32:07.960
I had a column there called Address.

00:32:07.960 --> 00:32:12.170
And I proposed that my own address
was 33 Oxford Street in Cambridge,

00:32:12.170 --> 00:32:13.930
Massachusetts 02138.

00:32:13.930 --> 00:32:17.680
Now, it turns out that there's quite
a few other people at my workplace,

00:32:17.680 --> 00:32:21.260
and so they might have that
exact same address as well.

00:32:21.260 --> 00:32:23.770
And in fact, that address
doesn't really lend itself

00:32:23.770 --> 00:32:27.340
to very nice printing on
envelopes because all I did

00:32:27.340 --> 00:32:30.280
was separate things with a
comma, not actually hit Enter.

00:32:30.280 --> 00:32:33.190
And in fact, because everything
was all in one column,

00:32:33.190 --> 00:32:36.550
it would seem a little difficult
to search by, say, zip code.

00:32:36.550 --> 00:32:41.470
Find for me all of the users from
Cambridge, Massachusetts in 02138.

00:32:41.470 --> 00:32:44.330
Well, you could do it sort
of searching free form

00:32:44.330 --> 00:32:46.840
all of the values in some column.

00:32:46.840 --> 00:32:49.510
But it wouldn't be particularly
clean or efficient because you'd

00:32:49.510 --> 00:32:53.770
have to look at and then ignore all
of those other words and numbers, 33

00:32:53.770 --> 00:32:57.550
Oxford Street, Cambridge mass,
if you only care about that zip.

00:32:57.550 --> 00:32:59.380
So before we forge
ahead, let's see if we

00:32:59.380 --> 00:33:03.490
can't infer why this database
is seemingly, at first glance,

00:33:03.490 --> 00:33:05.650
more complicated than you might like.

00:33:05.650 --> 00:33:09.040
Well, if I go back into my
spreadsheet here, what really should

00:33:09.040 --> 00:33:11.050
I have done with Address?

00:33:11.050 --> 00:33:13.750
I probably should have
quantized that value

00:33:13.750 --> 00:33:17.830
into streets and then city,
then state and zip code,

00:33:17.830 --> 00:33:20.680
not having just one column but several.

00:33:20.680 --> 00:33:24.220
In fact, here, what I've gone ahead
and done is separate out Address

00:33:24.220 --> 00:33:27.050
into Street and City and State and Zip.

00:33:27.050 --> 00:33:30.040
And for each of those have I
specified a very precise type.

00:33:30.040 --> 00:33:34.000
I've gone ahead and proposed that Street
is still varchar (255), as is city,

00:33:34.000 --> 00:33:37.360
because I don't really know an upper
bound, so we'll at least be consistent.

00:33:37.360 --> 00:33:41.530
For state, I've hopefully been smart in
at least assuming users are in the US.

00:33:41.530 --> 00:33:44.800
I've said char (2) for just
that two-character code.

00:33:44.800 --> 00:33:47.110
And for Zip, too, I'm
preemptively trying

00:33:47.110 --> 00:33:51.670
to avoid a mistake with even Cambridge
whose zip codes start with a zero.

00:33:51.670 --> 00:33:54.610
Where I had to specify again
that that's just an integer,

00:33:54.610 --> 00:33:57.760
I might actually lose
mathematically that first digit.

00:33:57.760 --> 00:34:02.440
But by storing it as a char with five
characters and no hyphen or four others

00:34:02.440 --> 00:34:07.120
can I ensure that that
02138 remains exactly that.

00:34:07.120 --> 00:34:10.060
But here, too, we have
a bit of inefficiency.

00:34:10.060 --> 00:34:12.880
Imagine if more and more
users from my own building

00:34:12.880 --> 00:34:15.340
register for this
particular application.

00:34:15.340 --> 00:34:18.670
You might have again and
again these exact same values

00:34:18.670 --> 00:34:21.219
if all of my colleagues
are in the same building.

00:34:21.219 --> 00:34:23.500
And here, as with
programming, as soon as you

00:34:23.500 --> 00:34:27.639
start to see a lot of redundancy,
the same data again and again,

00:34:27.639 --> 00:34:30.520
there's surely an opportunity
for better design.

00:34:30.520 --> 00:34:31.330
This is correct.

00:34:31.330 --> 00:34:33.880
My colleagues might also
live here if we added

00:34:33.880 --> 00:34:35.679
their names and their distinct emails.

00:34:35.679 --> 00:34:39.940
But they don't necessarily need
to have the same exact text stored

00:34:39.940 --> 00:34:41.830
again and again.

00:34:41.830 --> 00:34:45.580
So what might you do in this case
even in the world of spreadsheets?

00:34:45.580 --> 00:34:49.989
Well, on the sheet, I might just rename
this actually more explicitly to users.

00:34:49.989 --> 00:34:50.739
And you know what?

00:34:50.739 --> 00:34:53.406
Let me go ahead and create another
sheet in my spreadsheet world

00:34:53.406 --> 00:34:55.239
and call this, say, Cities.

00:34:55.239 --> 00:35:00.980
And my cities might actually have a
city name and perhaps a state and a zip.

00:35:00.980 --> 00:35:04.600
But in this leftmost
column, I could be smart

00:35:04.600 --> 00:35:09.730
and start to assign each of these
cities some unique ID or identifier.

00:35:09.730 --> 00:35:13.870
And so here might I have just a unique
identifier, typically an integer.

00:35:13.870 --> 00:35:16.560
City might again be varchar (255).

00:35:16.560 --> 00:35:18.790
And State might, again, be char (2).

00:35:18.790 --> 00:35:21.250
Zip Code, meanwhile,
can stay as char (5).

00:35:21.250 --> 00:35:25.900
But now what I can do is presume this,
that if Cambridge, Massachusetts,

00:35:25.900 --> 00:35:31.720
02138 is in this sheet or, if you will,
table let's arbitrarily but reasonably

00:35:31.720 --> 00:35:33.920
give it a unique identifier as one.

00:35:33.920 --> 00:35:37.480
And if I were to add more cities here,
like Allston, Massachusetts and its zip

00:35:37.480 --> 00:35:40.660
code, I could sign it
a unique ID of two.

00:35:40.660 --> 00:35:43.420
Because now that I
have this data, here's

00:35:43.420 --> 00:35:45.340
where you get that relational aspect.

00:35:45.340 --> 00:35:50.050
I can relate this sheet or
table to my other as follows.

00:35:50.050 --> 00:35:54.040
Back in this table now can I go ahead
and delete much of this redundancy

00:35:54.040 --> 00:35:58.420
here and actually go ahead and say city
is not really a city but it's a city

00:35:58.420 --> 00:35:59.200
ID.

00:35:59.200 --> 00:36:02.410
And here now it can be a proper
integer, because the city

00:36:02.410 --> 00:36:06.460
in which all of my colleagues live
is the exact same one as mine.

00:36:06.460 --> 00:36:09.433
Now, here there's still an opportunity
for improvement, to be fair.

00:36:09.433 --> 00:36:11.350
Because if all of us are
in the same building,

00:36:11.350 --> 00:36:14.000
maybe that should be
factored out as well.

00:36:14.000 --> 00:36:16.150
And if I really wanted to
go down this rabbit hole,

00:36:16.150 --> 00:36:18.460
I could add another
sheet or table called,

00:36:18.460 --> 00:36:24.430
say, Buildings and factor out also that
33 Oxford Street, give it a unique ID,

00:36:24.430 --> 00:36:26.560
and only store numbers.

00:36:26.560 --> 00:36:29.552
So in short, the more
redundancy and the more waste

00:36:29.552 --> 00:36:32.260
that you end up having in your
data, the more opportunities there

00:36:32.260 --> 00:36:34.150
are to, so to speak, normalize it.

00:36:34.150 --> 00:36:36.400
To factor out those
commonalities and create

00:36:36.400 --> 00:36:39.580
relations between some
pieces of data and others

00:36:39.580 --> 00:36:43.180
and the convention in computing
is to do, quite simply, numbers.

00:36:43.180 --> 00:36:43.680
Why?

00:36:43.680 --> 00:36:48.340
Well, it turns out it's a lot more
efficient to just relate some data

00:36:48.340 --> 00:36:51.250
to others by relying on small integers.

00:36:51.250 --> 00:36:53.020
Four bytes is not all that much.

00:36:53.020 --> 00:36:55.390
And in fact, inside
of a computer CPU are

00:36:55.390 --> 00:36:58.390
small registers, tiny
pieces of memory that

00:36:58.390 --> 00:37:00.940
can be used on the most
basic of operations;

00:37:00.940 --> 00:37:04.330
additions, subtractions, and
comparisons for equality.

00:37:04.330 --> 00:37:07.180
And so with small values
like integers can you

00:37:07.180 --> 00:37:12.170
very quickly reassemble or
relate some data to others.

00:37:12.170 --> 00:37:15.430
And so here we have a general
principle of database design

00:37:15.430 --> 00:37:18.690
to normalize it by factoring things out.

00:37:18.690 --> 00:37:21.940
And so if we go back into
our musical database,

00:37:21.940 --> 00:37:26.920
you can perhaps infer why the author
of this data did that preemptively.

00:37:26.920 --> 00:37:31.990
They have their albums having album
IDs, a number like 1, 2, and 3, a title,

00:37:31.990 --> 00:37:34.210
which is the actual title of that album.

00:37:34.210 --> 00:37:38.650
And then to associate
those albums with artists,

00:37:38.650 --> 00:37:41.620
they've used not the
artist's name but an ID.

00:37:41.620 --> 00:37:45.940
In this way can an album have
the same artist as another

00:37:45.940 --> 00:37:48.430
without storing that
artist's name twice.

00:37:48.430 --> 00:37:51.190
Moreover, if the artist happens
to change his or her name,

00:37:51.190 --> 00:37:53.860
as is not uncommon in
the musical world, you

00:37:53.860 --> 00:37:56.740
can change that name in
just one place and not

00:37:56.740 --> 00:38:00.540
have to scour your tables
for multiple copies.

00:38:00.540 --> 00:38:03.630
And so if we continue this logic,
we'll see in more and more tables

00:38:03.630 --> 00:38:07.740
that we have this principle of naming
the data but then assigning it an ID.

00:38:07.740 --> 00:38:10.230
And if you want to relate
some of that data to another,

00:38:10.230 --> 00:38:14.520
you simply store the ID,
not the actual values.

00:38:14.520 --> 00:38:19.140
Of course, this is decreasingly useful
as we go, because now some of my data

00:38:19.140 --> 00:38:21.990
is in this table, and that,
and this other table, and here.

00:38:21.990 --> 00:38:26.310
And so while very academically clean
and refined, it doesn't anymore

00:38:26.310 --> 00:38:29.910
seem useful to the data scientist
in front of his or her computer

00:38:29.910 --> 00:38:32.610
that just wants to answer
questions about data.

00:38:32.610 --> 00:38:35.490
And yet it's all over the place,
whereas before, with Excel,

00:38:35.490 --> 00:38:37.710
you could just pull up a window.

00:38:37.710 --> 00:38:39.510
But that's where SQL itself comes in.

00:38:39.510 --> 00:38:44.010
SQL does not just prescribe
how to type your data

00:38:44.010 --> 00:38:46.530
but rather how to query it as well.

00:38:46.530 --> 00:38:49.010
And in this particular
program here, DB Browser,

00:38:49.010 --> 00:38:52.320
I can actually go over to
this tab here for Execute SQL.

00:38:52.320 --> 00:38:55.830
And I can begin to execute
those actual commands, SELECT,

00:38:55.830 --> 00:39:00.030
and CREATE, and UPDATE, DELETE, and
others and actually see the results.

00:39:00.030 --> 00:39:03.660
What SQL allows you to do is
express yourself programmatically

00:39:03.660 --> 00:39:07.200
using a fairly small
language, albeit new, that

00:39:07.200 --> 00:39:11.160
allows you to create temporary tables,
just the results, just the result

00:39:11.160 --> 00:39:15.130
sets, so to speak, that you want,
only those rows that you care about.

00:39:15.130 --> 00:39:18.240
So for instance, if I want to see
all of the albums in this database,

00:39:18.240 --> 00:39:21.897
well, in the world of spreadsheets, I
would just double click in and peruse.

00:39:21.897 --> 00:39:24.480
But in the world of SQL, I'm
actually going to type a command.

00:39:24.480 --> 00:39:28.740
I'm going to go ahead here and say
SELECT star, for give me everything,

00:39:28.740 --> 00:39:33.417
from the table called Album, and
then semicolon to finish my thought.

00:39:33.417 --> 00:39:36.000
I'm going to go ahead and click
the graphical Play button here

00:39:36.000 --> 00:39:37.330
to execute this command.

00:39:37.330 --> 00:39:40.110
And you'll see suddenly that
here are all of the albums

00:39:40.110 --> 00:39:45.930
apparently in this table, 347 of them
in total in this particular database.

00:39:45.930 --> 00:39:48.330
And notice that all of the
titles are in one column,

00:39:48.330 --> 00:39:53.190
the Album ID is to the left, and the
Artist ID, of course, to the right.

00:39:53.190 --> 00:39:57.510
Well, if you're now curious who is
the artist behind the album called

00:39:57.510 --> 00:40:00.240
For Those About to Rock,
We Salute You, well, I

00:40:00.240 --> 00:40:02.670
can just make a mental note
that the artist ID is one.

00:40:02.670 --> 00:40:03.540
And you know what?

00:40:03.540 --> 00:40:06.600
With SQL, it turns out
you can use predicates.

00:40:06.600 --> 00:40:08.802
You can qualify the data you want.

00:40:08.802 --> 00:40:10.260
You don't have to say, give me all.

00:40:10.260 --> 00:40:11.820
You can say, give me this.

00:40:11.820 --> 00:40:13.210
So how do I do that?

00:40:13.210 --> 00:40:18.690
Well, I can actually say SELECT star
from Artist, the other table, where--

00:40:18.690 --> 00:40:25.450
and here comes my predicate-- artist ID
equals one, a semicolon again to finish

00:40:25.450 --> 00:40:26.950
that thought, hit Play, and voila.

00:40:26.950 --> 00:40:32.410
It turns out it's AC/DC, the artist
behind that particular album.

00:40:32.410 --> 00:40:34.280
Of course, this felt a bit manual.

00:40:34.280 --> 00:40:36.790
And this seems no better
than a spreadsheet wherein

00:40:36.790 --> 00:40:38.920
now more of the work seems to be on me.

00:40:38.920 --> 00:40:41.230
But SQL's more expressive than this.

00:40:41.230 --> 00:40:44.140
Not only can you create, and
select, an update, and delete data,

00:40:44.140 --> 00:40:48.050
you can also join data
from one table and another.

00:40:48.050 --> 00:40:51.760
So in fact, let me go ahead and
do this a little more explicitly.

00:40:51.760 --> 00:40:56.650
I want to go ahead and select everything
from that original album table,

00:40:56.650 --> 00:41:00.100
but I'd like to first join
it with the artist table.

00:41:00.100 --> 00:41:00.880
How?

00:41:00.880 --> 00:41:03.850
Well, I want to join those two
tables, this one and this one,

00:41:03.850 --> 00:41:07.810
kind of like this,
conceptually, so to speak.

00:41:07.810 --> 00:41:09.010
On what fields?

00:41:09.010 --> 00:41:12.640
Well, in Album, I recall
there's an artist ID.

00:41:12.640 --> 00:41:17.290
I want that to be equal
to artist.artistid.

00:41:17.290 --> 00:41:21.730
In other words, if you imagine this hand
to be albums and this hand to be artist

00:41:21.730 --> 00:41:25.510
and the tips of my fingers each
represent those artist IDs,

00:41:25.510 --> 00:41:29.230
we essentially want the
SQL to line up my fingers

00:41:29.230 --> 00:41:32.290
so that I have on the left my albums
and I have on the right the artist.

00:41:32.290 --> 00:41:36.040
And every one of my rows now has both.

00:41:36.040 --> 00:41:39.100
Let me finish my thought with
a semicolon here and hit Play.

00:41:39.100 --> 00:41:43.750
And voila, now we see even more
information but all together.

00:41:43.750 --> 00:41:47.020
We see that album number one, For
Those About to Rock, We Salute You,

00:41:47.020 --> 00:41:51.370
has an artist ID of one and clearly
an artist ID of one but a name.

00:41:51.370 --> 00:41:52.420
Well, what's happened?

00:41:52.420 --> 00:41:56.230
Well, both of these tables have kind of
been concatenated together but joined

00:41:56.230 --> 00:42:00.010
intelligently such that the
artist IDs in both tables

00:42:00.010 --> 00:42:02.463
line up on the left and the right.

00:42:02.463 --> 00:42:05.380
Of course, at this point, I really
don't care about all these numbers.

00:42:05.380 --> 00:42:08.560
And I definitely don't need the
temporary duplication of data,

00:42:08.560 --> 00:42:10.270
so I don't have to
just keep saying star,

00:42:10.270 --> 00:42:13.480
which is the so-called wild card,
which means give me everything.

00:42:13.480 --> 00:42:17.380
I can actually instead just
say, give me the title of albums

00:42:17.380 --> 00:42:20.170
and the names of the artists
by specifying with commas

00:42:20.170 --> 00:42:23.290
the names of the columns
that I actually want.

00:42:23.290 --> 00:42:26.320
And if I now click Play, I
get much simpler results,

00:42:26.320 --> 00:42:32.030
just the titles of albums and
just the names of those artists.

00:42:32.030 --> 00:42:34.460
Now, how else can we
express ourselves with SQL?

00:42:34.460 --> 00:42:37.820
Well, there are other keywords
besides WHERE and besides JOIN.

00:42:37.820 --> 00:42:41.990
You can also group values by something
and specify that they must all

00:42:41.990 --> 00:42:43.460
have something as well.

00:42:43.460 --> 00:42:46.910
For instance, let me go back
to my data here and consider

00:42:46.910 --> 00:42:49.610
which artists have multiple albums.

00:42:49.610 --> 00:42:52.070
Well, if we consider the
results that we had earlier do

00:42:52.070 --> 00:42:56.600
we have AC/DC as the artist behind For
Those About to Rock, We Salute You,

00:42:56.600 --> 00:42:59.180
but also behind Let There Be Rock.

00:42:59.180 --> 00:43:02.400
Moreover, this band Accept
has multiple albums as well.

00:43:02.400 --> 00:43:04.970
And if we scrolled further,
we'd probably see others.

00:43:04.970 --> 00:43:08.600
So if we'd like to see those
bands, those artists that have

00:43:08.600 --> 00:43:10.760
multiple albums, how can we do this?

00:43:10.760 --> 00:43:16.490
Well, what if I somehow collapsed all
mentions of AC/DC into just one row,

00:43:16.490 --> 00:43:19.280
and next to that row
I put an actual count?

00:43:19.280 --> 00:43:21.420
Well, I could refine
this query as follows.

00:43:21.420 --> 00:43:24.260
I can say, yes, join these
tables together, but then

00:43:24.260 --> 00:43:28.710
collapse them, if you will, based
on grouping some common value.

00:43:28.710 --> 00:43:31.430
Let's go ahead now and
group the data by name

00:43:31.430 --> 00:43:33.650
so that any artist that
appears multiple times

00:43:33.650 --> 00:43:36.050
will effectively be collapsed into one.

00:43:36.050 --> 00:43:40.850
But I'd like to remember how many
rows got collapsed into just one.

00:43:40.850 --> 00:43:43.850
And so rather than select the
albums themselves this time,

00:43:43.850 --> 00:43:48.080
I'm going to select the album's
name and then the count there of,

00:43:48.080 --> 00:43:52.190
thereby specifying show me the name
and show me the count of that name

00:43:52.190 --> 00:43:54.350
before we grouped by.

00:43:54.350 --> 00:43:57.980
If I go ahead now and finish
my thought and click Execute,

00:43:57.980 --> 00:44:02.390
I'll see that, indeed, AC/DC had two
names and Aaron Goldberg had one.

00:44:02.390 --> 00:44:05.210
And if we keep scrolling, we'd
see all of the bands' names

00:44:05.210 --> 00:44:09.350
that had one or more albums and
the count for each of those.

00:44:09.350 --> 00:44:11.570
If I want to filter out
now maybe those bands that

00:44:11.570 --> 00:44:14.570
only had one hit album in this
database, I can instead say,

00:44:14.570 --> 00:44:17.840
you know what, go ahead and
group by the group's name,

00:44:17.840 --> 00:44:22.400
but then show me only those
bands having more than one album.

00:44:22.400 --> 00:44:25.790
Well, here, too, can I simply filter
my results saying literally quite

00:44:25.790 --> 00:44:32.840
that, having some number count
name greater than one, semicolon.

00:44:32.840 --> 00:44:36.735
Hitting Play now and you see the
results immediately eliminate all

00:44:36.735 --> 00:44:38.360
of those bands that had just one album.

00:44:38.360 --> 00:44:40.610
And now if I scroll through,
we'll see all those bands

00:44:40.610 --> 00:44:42.650
that had two or really more.

00:44:42.650 --> 00:44:45.740
And so with SQL can you
begin to express yourself

00:44:45.740 --> 00:44:47.870
certainly more arcanely
than you could with just

00:44:47.870 --> 00:44:52.310
a click of the mouse in a spreadsheet
but ever so much more powerfully.

00:44:52.310 --> 00:44:55.340
But the key is to build
these queries up piecemeal.

00:44:55.340 --> 00:44:58.160
Indeed, this now already
looks quite complicated.

00:44:58.160 --> 00:45:02.990
But we began by just selecting all data
and then refining, refining, refining,

00:45:02.990 --> 00:45:06.020
thereby working at this level
and only getting into the weeds

00:45:06.020 --> 00:45:09.110
once we knew what we wanted.

00:45:09.110 --> 00:45:11.960
Notice now just how fast
these operations have been.

00:45:11.960 --> 00:45:15.560
Indeed, at the bottom here do I
see that 56 rows were returned.

00:45:15.560 --> 00:45:16.130
How fast?

00:45:16.130 --> 00:45:18.320
In one millisecond.

00:45:18.320 --> 00:45:20.450
And indeed, even though
it might take me longer

00:45:20.450 --> 00:45:23.450
to describe in English
what it is I want, in fact,

00:45:23.450 --> 00:45:27.290
the computer can find
this data so quickly.

00:45:27.290 --> 00:45:30.080
But that's because we've done
at least one thing already.

00:45:30.080 --> 00:45:33.290
This data is already organized
with these proper types.

00:45:33.290 --> 00:45:36.650
And it also has a few other
key characteristics as well.

00:45:36.650 --> 00:45:38.990
When storing your data
in a SQL database,

00:45:38.990 --> 00:45:41.450
you're also asked to
answer some questions.

00:45:41.450 --> 00:45:44.900
For every table you're asked
to specify effectively what,

00:45:44.900 --> 00:45:47.900
if any, is this table's primary key.

00:45:47.900 --> 00:45:50.810
These are key words in
SQL that you can apply

00:45:50.810 --> 00:45:53.690
to some column that says,
hey, database, this column

00:45:53.690 --> 00:45:59.750
is my primary value that uniquely
identifies every row in this table.

00:45:59.750 --> 00:46:02.420
In the context then of
our user spreadsheet

00:46:02.420 --> 00:46:05.930
with which we began this
discussion, that identifier for City

00:46:05.930 --> 00:46:07.640
was a primary key.

00:46:07.640 --> 00:46:10.370
I might very well have
used a city's name

00:46:10.370 --> 00:46:12.500
as unique or perhaps even the zip code.

00:46:12.500 --> 00:46:15.050
But far more efficient,
especially if you

00:46:15.050 --> 00:46:17.380
want to avoid ambiguities
or duplication,

00:46:17.380 --> 00:46:19.130
is to just use that integer.

00:46:19.130 --> 00:46:23.420
And so here a primary key is almost
always a numeric value, at least

00:46:23.420 --> 00:46:27.140
in the most optimized layouts of data.

00:46:27.140 --> 00:46:29.960
But it guarantees to the
database that there will be

00:46:29.960 --> 00:46:32.480
no duplicates on this particular value.

00:46:32.480 --> 00:46:38.270
But more powerfully, you can define in
one table a column to be a primary key,

00:46:38.270 --> 00:46:43.490
but then in another table that same
value to be a so-called foreign key.

00:46:43.490 --> 00:46:47.420
In other words, throughout this
example in the actual SQL database,

00:46:47.420 --> 00:46:50.360
I had Albums in one table
and Artists in others.

00:46:50.360 --> 00:46:54.050
And that Artist table
had an Artist ID column

00:46:54.050 --> 00:46:57.020
that was within that table
known as a primary key.

00:46:57.020 --> 00:47:00.800
But when you saw that artist
ID in the Albums table,

00:47:00.800 --> 00:47:04.370
it was contextually there a foreign key.

00:47:04.370 --> 00:47:08.000
Now, beyond semantics, this is
an actual and valuable property.

00:47:08.000 --> 00:47:10.490
It ensures that the
database knows how to link

00:47:10.490 --> 00:47:13.430
and how to link those
two columns efficiently.

00:47:13.430 --> 00:47:17.810
Moreover, you have even fancier features
available to you when you declare keys.

00:47:17.810 --> 00:47:19.910
You can also tell the
database, you know what?

00:47:19.910 --> 00:47:24.800
If I ever delete this artist, go ahead
and delete all of that artist's albums

00:47:24.800 --> 00:47:25.310
as well.

00:47:25.310 --> 00:47:27.530
And you can configure a
database automatically

00:47:27.530 --> 00:47:31.520
to have this so-called cascade
effect whereby data is updated

00:47:31.520 --> 00:47:36.320
and your data is consistent at the end
of the day based on those relations,

00:47:36.320 --> 00:47:37.220
if you will.

00:47:37.220 --> 00:47:40.640
Now, in columns of data
can you also specify

00:47:40.640 --> 00:47:42.770
that every value's got to be unique.

00:47:42.770 --> 00:47:45.260
It doesn't necessarily need
to be your primary key,

00:47:45.260 --> 00:47:47.480
but it might still be unique.

00:47:47.480 --> 00:47:48.440
Like what?

00:47:48.440 --> 00:47:51.650
Well, in our Users table that
we were creating on the fly,

00:47:51.650 --> 00:47:54.200
an email address might,
by human convention,

00:47:54.200 --> 00:47:57.200
be unique, assuming I don't
share it with someone else.

00:47:57.200 --> 00:48:01.070
But using an email address, multiple
characters, many possible characters,

00:48:01.070 --> 00:48:04.340
tends not to be the most
efficient way to search on data.

00:48:04.340 --> 00:48:08.540
So even in my Users table might
I have added for best practice

00:48:08.540 --> 00:48:13.280
a numeric column as well, probably
called ID, as my primary key.

00:48:13.280 --> 00:48:17.330
But I might still specify when moving
that data from my Google spreadsheet

00:48:17.330 --> 00:48:20.030
into a SQL database that, you know what?

00:48:20.030 --> 00:48:23.330
Please ensure that this
Email column's unique

00:48:23.330 --> 00:48:25.970
so that I or some other
programmer doesn't accidentally

00:48:25.970 --> 00:48:28.760
insert duplicate data into this table.

00:48:28.760 --> 00:48:31.340
And moreover, the database
then can search it

00:48:31.340 --> 00:48:35.180
more efficiently because it
knows how many, if any, there

00:48:35.180 --> 00:48:36.380
are of any one value.

00:48:36.380 --> 00:48:39.590
There's one and only one maximally.

00:48:39.590 --> 00:48:42.830
Lastly, there's this
keyword here, index.

00:48:42.830 --> 00:48:46.610
Well, the other feature you get so
powerfully from proper databases

00:48:46.610 --> 00:48:50.180
is the ability to search
and sort values efficiently.

00:48:50.180 --> 00:48:54.020
But the database doesn't know
a priori how to do either

00:48:54.020 --> 00:48:55.250
on the data you care about.

00:48:55.250 --> 00:48:58.820
Because only if you tell the database
what data you plan to search on

00:48:58.820 --> 00:49:03.210
and sort frequently can
it help you in advance.

00:49:03.210 --> 00:49:08.000
And so if when creating a database table
you tell the database server, go ahead

00:49:08.000 --> 00:49:12.590
and index this column, what it will
do is use a database structure,

00:49:12.590 --> 00:49:17.210
a tree structure not unlike our
so-called binary search trees,

00:49:17.210 --> 00:49:20.960
that pulls all the data up in
an artist's rendition thereof,

00:49:20.960 --> 00:49:23.720
thereby ensuring that it
doesn't take as many steps

00:49:23.720 --> 00:49:27.560
to find some email address
or something else because you

00:49:27.560 --> 00:49:29.430
have indexed that particular column.

00:49:29.430 --> 00:49:32.310
It won't store it linearly top
to bottom or left to right.

00:49:32.310 --> 00:49:34.060
It will store it in a
two-dimensional tree

00:49:34.060 --> 00:49:38.500
structure of some sort,
often known as a B-tree, that

00:49:38.500 --> 00:49:44.260
allows you to grab the data in hopefully
logarithmic and not linear time.

00:49:44.260 --> 00:49:46.420
Well, turns out there are
even more features you

00:49:46.420 --> 00:49:49.840
get from actual databases like SQLite.

00:49:49.840 --> 00:49:53.530
Well, you have the ability to specify
when creating a table, please go ahead

00:49:53.530 --> 00:49:56.050
and auto increment this column.

00:49:56.050 --> 00:49:57.380
Well, what does that mean?

00:49:57.380 --> 00:49:59.320
Well, I very manually
a moment ago assigned

00:49:59.320 --> 00:50:02.230
Cambridge the unique identifier of one.

00:50:02.230 --> 00:50:04.330
But why should I, the
programmer, even have

00:50:04.330 --> 00:50:09.280
to worry or care about what the
unique values of my inputs are?

00:50:09.280 --> 00:50:11.560
I just need that that key exists.

00:50:11.560 --> 00:50:14.920
I do not need to care about what
that value is, just that it exists

00:50:14.920 --> 00:50:16.030
and it's unique.

00:50:16.030 --> 00:50:18.740
So you can tell the database
on its own, please go ahead

00:50:18.740 --> 00:50:21.490
and, any time I add a
new row to this table,

00:50:21.490 --> 00:50:25.010
increment that value automatically.

00:50:25.010 --> 00:50:26.980
You can also specify
to a database, please

00:50:26.980 --> 00:50:31.480
ensure that no values in my
database are null that is empty,

00:50:31.480 --> 00:50:35.560
thereby ensuring that a bug in your
code or some missing user input

00:50:35.560 --> 00:50:38.110
doesn't accidentally
put into your database

00:50:38.110 --> 00:50:41.860
a row that's only sparsely
filled with real data.

00:50:41.860 --> 00:50:44.980
The database can help you with these
things just as Python can as well,

00:50:44.980 --> 00:50:49.980
but it's a final layer of
a defense before your data.

00:50:49.980 --> 00:50:51.370
And then functions as well.

00:50:51.370 --> 00:50:53.830
SQL itself is a programming language.

00:50:53.830 --> 00:50:56.950
It might not necessarily have as high
ceiling as something like Python,

00:50:56.950 --> 00:50:58.120
as much functionality.

00:50:58.120 --> 00:51:01.480
But built into SQL are
any number of functions.

00:51:01.480 --> 00:51:03.940
If you want to select
the average revenue

00:51:03.940 --> 00:51:08.680
across any number of purchase orders,
you can use the average function.

00:51:08.680 --> 00:51:13.900
And in MySQL Query can I select data
but pass it into one of these functions

00:51:13.900 --> 00:51:16.960
and get back an answer without
having to paste it into, say,

00:51:16.960 --> 00:51:19.120
a spreadsheet, let alone calculator.

00:51:19.120 --> 00:51:20.920
I can count rows just as I did.

00:51:20.920 --> 00:51:24.520
I wanted to count the number of
albums that a given artist had,

00:51:24.520 --> 00:51:27.220
and COUNT was a function
supported by SQL.

00:51:27.220 --> 00:51:28.540
You can get maxes and mins.

00:51:28.540 --> 00:51:31.330
You can get summations
of value and so many more

00:51:31.330 --> 00:51:33.600
features built into this language.

00:51:33.600 --> 00:51:35.350
And while the tool you
might use might not

00:51:35.350 --> 00:51:39.850
be DB Browser, perhaps it's just a
textual interface or even something

00:51:39.850 --> 00:51:42.190
even more graphical,
it ultimately is just

00:51:42.190 --> 00:51:47.110
executing on your behalf the SQL queries
and handing them off to the database

00:51:47.110 --> 00:51:48.415
for execution.

00:51:48.415 --> 00:51:51.040
Now, with all of these features
that you get with the database,

00:51:51.040 --> 00:51:52.850
it all sounds too good to be true.

00:51:52.850 --> 00:51:54.910
You can scale, you can
eliminate redundancy,

00:51:54.910 --> 00:51:57.670
and you can still select
all the data you want.

00:51:57.670 --> 00:52:00.610
But unfortunately, you have
to start to think harder

00:52:00.610 --> 00:52:02.470
about the design of your system.

00:52:02.470 --> 00:52:06.940
Because databases are sometimes
vulnerable to mistakes, if you will.

00:52:06.940 --> 00:52:09.310
Consider, for instance,
something like Twitter

00:52:09.310 --> 00:52:12.790
that tends to keep track of how
many times something's retweeted.

00:52:12.790 --> 00:52:14.890
Or consider an app like
Instagram, which keeps

00:52:14.890 --> 00:52:19.990
track of how many likes or upvotes
or views some post has gotten.

00:52:19.990 --> 00:52:23.890
On the most popular or viral
of media, those counters,

00:52:23.890 --> 00:52:26.980
those integers might be getting
incremented ever so quickly.

00:52:26.980 --> 00:52:31.090
If you and I both happen to view or
like something at nearly the same time,

00:52:31.090 --> 00:52:35.140
well, that interface
from us into the system

00:52:35.140 --> 00:52:38.470
might actually trigger some
instruction on some server

00:52:38.470 --> 00:52:43.210
somewhere to tell Instagram's
database to increment some value.

00:52:43.210 --> 00:52:46.730
But how does a database go
about incrementing a value?

00:52:46.730 --> 00:52:50.650
Well, if the value of views
or the value of counts

00:52:50.650 --> 00:52:54.790
is somehow stored in a database,
a column of type integer,

00:52:54.790 --> 00:52:59.020
and you go ahead and execute a SQL
SELECT in order to get that value,

00:52:59.020 --> 00:53:03.730
and, for instance, 100 people
before me has liked some post,

00:53:03.730 --> 00:53:07.640
well, the value of that
result comes back as 100.

00:53:07.640 --> 00:53:10.180
I then do some math in
my code, perhaps Python.

00:53:10.180 --> 00:53:15.490
I increment the 100 to 101, and
then I use a SQL UPDATE, as you can,

00:53:15.490 --> 00:53:18.520
to push the data back into the database.

00:53:18.520 --> 00:53:21.220
But suppose both you and
I anywhere in the world

00:53:21.220 --> 00:53:25.240
both happen to like a post at
the same or nearly the same time,

00:53:25.240 --> 00:53:29.320
as can certainly happen when
posts are especially popular.

00:53:29.320 --> 00:53:33.040
Unfortunately, a computer can
sometimes do multiple things

00:53:33.040 --> 00:53:35.440
at once or at least in
such rapid succession

00:53:35.440 --> 00:53:39.250
that it appears to be at the
same time, but a race of sorts

00:53:39.250 --> 00:53:42.650
can happen, a race condition,
if you will, as follows.

00:53:42.650 --> 00:53:45.550
If both my button and
your button is pressed

00:53:45.550 --> 00:53:50.830
at nearly the same time and that induces
execution of code on Instagram server

00:53:50.830 --> 00:53:54.130
that selects for both of us
the current count of views,

00:53:54.130 --> 00:53:58.630
suppose that both of the threads, so
to speak, both of the SQL operations

00:53:58.630 --> 00:54:02.620
that select that data both
come back with the value 100,

00:54:02.620 --> 00:54:05.800
each of the blocks of
code serving our requests

00:54:05.800 --> 00:54:10.360
go ahead and increment that value
to 101 and then, via SQL UPDATE,

00:54:10.360 --> 00:54:13.690
pushes that value back to the database.

00:54:13.690 --> 00:54:19.510
Unfortunately, because both you and
I induced an evaluation of that math

00:54:19.510 --> 00:54:22.900
at nearly the same time, what
the database might end up storing

00:54:22.900 --> 00:54:26.470
is not 102 but 101.

00:54:26.470 --> 00:54:32.110
In other words, if two people's input
is triggering a race to update data,

00:54:32.110 --> 00:54:37.600
the database had better keep track of
who and when asked for that update.

00:54:37.600 --> 00:54:39.250
Otherwise, you lose data.

00:54:39.250 --> 00:54:43.840
Now, in the case of tweets or likes,
it's probably not all that problematic.

00:54:43.840 --> 00:54:46.510
Though, frankly, that is their business.

00:54:46.510 --> 00:54:50.380
But you can certainly imagine that with
banks or financial institutions, where

00:54:50.380 --> 00:54:53.140
the numbers matter ever
so more, you certainly

00:54:53.140 --> 00:54:56.800
don't want to accidentally
lose track of some dollars.

00:54:56.800 --> 00:55:02.310
And so how do we go about solving
this in the case of a SQL database?

00:55:02.310 --> 00:55:06.370
Well, it turns out that there is fairly
fundamental primitives or solutions

00:55:06.370 --> 00:55:07.360
you can use.

00:55:07.360 --> 00:55:12.790
Consider a metaphor in the real world,
such as, say, a familiar refrigerator.

00:55:12.790 --> 00:55:15.190
And suppose that you and
your significant other

00:55:15.190 --> 00:55:19.300
happened to both like something to drink
at the end of the day, like some milk.

00:55:19.300 --> 00:55:22.410
And so you go ahead when you get
home, and the other person's not,

00:55:22.410 --> 00:55:25.255
and you open the fridge and you
see, oh, darn it, we're out of milk.

00:55:25.255 --> 00:55:27.700
And so you close the fridge
and you head downstairs

00:55:27.700 --> 00:55:29.530
and you walk to the nearest store.

00:55:29.530 --> 00:55:32.440
Meanwhile, that other person
comes home and, wanting some milk,

00:55:32.440 --> 00:55:36.430
opens the fridge, and darn it if
we aren't out of milk as well.

00:55:36.430 --> 00:55:39.670
And so that person now heads
out, perhaps in a different car,

00:55:39.670 --> 00:55:43.420
in a different route, and heads to the
store, some other store to get milk.

00:55:43.420 --> 00:55:45.490
Fast forward some amount
of time and both of you

00:55:45.490 --> 00:55:49.390
come home, and darn it if you don't now
have twice as much milk as you need,

00:55:49.390 --> 00:55:50.980
and it does go bad.

00:55:50.980 --> 00:55:55.270
And so you've both ended up buying milk
when really only one of you needed to.

00:55:55.270 --> 00:55:57.520
And this is similar in
spirit, but now you've

00:55:57.520 --> 00:56:02.440
got more data than you actually wanted,
but it's not the right amount of data.

00:56:02.440 --> 00:56:03.830
So why did that happen?

00:56:03.830 --> 00:56:08.170
Well, both of you, like Instagram,
inspected the state of some value

00:56:08.170 --> 00:56:10.900
and made a decision on it
before the other person

00:56:10.900 --> 00:56:14.170
was done acting on that information.

00:56:14.170 --> 00:56:17.170
So in our very real world
of milk, how could you

00:56:17.170 --> 00:56:22.540
go about avoiding that conflict,
that race, to restock the fridge?

00:56:22.540 --> 00:56:26.290
Well, you could quite simply grab
a pen and paper and leave a note,

00:56:26.290 --> 00:56:29.260
so to speak, on the fridge telling
someone else, gone for milk,

00:56:29.260 --> 00:56:31.390
and hopefully they then
would not do the same.

00:56:31.390 --> 00:56:35.290
Or perhaps more dramatically you could
lock the refrigerator in some sense

00:56:35.290 --> 00:56:38.740
so that they can't even get
into it and inspect that state.

00:56:38.740 --> 00:56:41.800
But ultimately, you need your
act of checking the fridge

00:56:41.800 --> 00:56:44.950
and restocking it to be
what we'll call atomic.

00:56:44.950 --> 00:56:48.880
And databases can and
hopefully do provide atomicity,

00:56:48.880 --> 00:56:54.670
that property, the ability to do
multiple things together or not at all

00:56:54.670 --> 00:56:58.300
but not be interrupted
by someone else's work.

00:56:58.300 --> 00:57:00.160
And in fact, in the
database world, these

00:57:00.160 --> 00:57:04.270
are generally known as actual locks
whereby you say to the database,

00:57:04.270 --> 00:57:07.630
don't let anyone else
write to this table or row

00:57:07.630 --> 00:57:11.780
until I am ready to
release or unlock it.

00:57:11.780 --> 00:57:14.530
That, of course, though, tends to
be a very heavy-handed solution.

00:57:14.530 --> 00:57:17.140
Say don't let anyone
else touch this data.

00:57:17.140 --> 00:57:19.660
Better to do it on a
more fine-grained control

00:57:19.660 --> 00:57:21.940
so that you don't slow
your whole system down.

00:57:21.940 --> 00:57:25.660
And so SQL databases tend to
support what are more finally known

00:57:25.660 --> 00:57:30.530
as transactions whereby you can execute
one or more commands again and again

00:57:30.530 --> 00:57:33.430
and again back to back but
make sure that all of them

00:57:33.430 --> 00:57:36.220
go through it once
before, say, your commands

00:57:36.220 --> 00:57:41.260
that your user input induced
actually is allowed to get executed.

00:57:41.260 --> 00:57:43.960
Now, honestly, even
in the most high-tech

00:57:43.960 --> 00:57:47.260
of places like Instagram and
Twitter, this is a hard problem.

00:57:47.260 --> 00:57:51.490
Because at some point, even
waiting for my operations

00:57:51.490 --> 00:57:54.290
to complete before yours
can go in can be a challenge

00:57:54.290 --> 00:57:56.210
and a bottleneck for everyone else.

00:57:56.210 --> 00:58:00.360
And so in the case of the most viral of
posts, what can systems these days do?

00:58:00.360 --> 00:58:03.100
Well, you could just
kind of wait and write

00:58:03.100 --> 00:58:06.790
that data back to the database
that is updated eventually.

00:58:06.790 --> 00:58:09.100
And indeed, another
property of databases

00:58:09.100 --> 00:58:13.870
is known as just that, eventual
consistency, a property that says,

00:58:13.870 --> 00:58:19.900
don't lose any data, but only eventually
make sure it's reflected on the server.

00:58:19.900 --> 00:58:23.170
Eventually get the value
right, but do get it right.

00:58:23.170 --> 00:58:25.720
And so what Instagram and
Twitter and others might do

00:58:25.720 --> 00:58:30.690
is just cache or buffer that data,
waiting until things have quieted down

00:58:30.690 --> 00:58:35.140
'til the post is no longer viral
or most users have gone to sleep.

00:58:35.140 --> 00:58:38.050
Now, that alone might not
be to the best of solutions,

00:58:38.050 --> 00:58:43.990
but it avoids having to get the highest
powered and most expensive hardware.

00:58:43.990 --> 00:58:46.720
Of course, in other contexts,
that might be the better solution.

00:58:46.720 --> 00:58:48.760
In the world of finance,
sometimes it comes

00:58:48.760 --> 00:58:53.140
down to the actual length of cables or
distance from some server to another

00:58:53.140 --> 00:58:55.930
to ensure that the
data gets there so fast

00:58:55.930 --> 00:58:59.030
that you don't run into
these sorts of challenges.

00:58:59.030 --> 00:59:01.810
So databases can solve this,
but the developers and designers

00:59:01.810 --> 00:59:07.740
that use those databases need to know
how to do it and that they should.

00:59:07.740 --> 00:59:10.230
Lastly, there's another
challenge as well,

00:59:10.230 --> 00:59:12.780
unfortunately all too
commonly done these days

00:59:12.780 --> 00:59:17.280
because folks just don't defend
against it via common mechanisms.

00:59:17.280 --> 00:59:21.810
It turns out that a bad actor somewhere
on the internet or your own network

00:59:21.810 --> 00:59:24.960
can potentially, if
you're not careful, trick

00:59:24.960 --> 00:59:29.400
a database into executing
commands that you didn't intend.

00:59:29.400 --> 00:59:32.160
For instance, suppose
in the context of Python

00:59:32.160 --> 00:59:34.740
you have some code that
looks a bit like this.

00:59:34.740 --> 00:59:37.280
Here is a program written
in a mix of pseudocode

00:59:37.280 --> 00:59:40.380
and Python that's designed
to allow a user to input

00:59:40.380 --> 00:59:43.560
the title of an album for
which they want to search.

00:59:43.560 --> 00:59:48.720
And so here I use the Python function
INPUT to prompt the user for just that.

00:59:48.720 --> 00:59:51.570
On the left-hand side do I clear
a variable called Title, and then

00:59:51.570 --> 00:59:55.800
assigned from right to left, the
user's response to that variable.

00:59:55.800 --> 00:59:59.610
Then suppose for the sake of discussion
there is some function called EXECUTE

00:59:59.610 --> 01:00:05.020
whose purpose in life is to take input
that itself represents a SQL command.

01:00:05.020 --> 01:00:08.430
That SQL command might be
this, so like star from Artist

01:00:08.430 --> 01:00:11.190
where Title equals something.

01:00:11.190 --> 01:00:12.490
Now, what is that something?

01:00:12.490 --> 01:00:15.270
Well, if I have the user's input
in a variable called Title,

01:00:15.270 --> 01:00:19.680
I can use the plus operator in
Python, not to add but concatenate

01:00:19.680 --> 01:00:24.390
two strings together, coding them
singly and completing that thought.

01:00:24.390 --> 01:00:28.800
The problem, though, with SQL is
that user's not really to be trusted.

01:00:28.800 --> 01:00:32.580
And whether the user's input is
coming from a keyboard on a Mac,

01:00:32.580 --> 01:00:36.990
or PC, or perhaps, more
compellingly, from an app or website,

01:00:36.990 --> 01:00:40.260
you probably should not
trust all your users.

01:00:40.260 --> 01:00:44.880
Because suppose that your user typed
in not the album name for which they

01:00:44.880 --> 01:00:48.090
want to search, Let There Be Rock,
but rather they type something

01:00:48.090 --> 01:00:51.990
like Let There Be Rock,
semicolon, DELETE,

01:00:51.990 --> 01:00:56.160
thereby using SQL's apparent
DELETE command in order

01:00:56.160 --> 01:01:00.360
to trick your database into
executing not one but two

01:01:00.360 --> 01:01:02.430
commands, a SELECT and DELETE.

01:01:02.430 --> 01:01:05.130
And indeed, this is what's
known as a SQL injection

01:01:05.130 --> 01:01:08.100
attack, the ability for
an adversary, a bad actor

01:01:08.100 --> 01:01:12.060
out there, to somehow trick
your database and your code

01:01:12.060 --> 01:01:16.500
into executing some command
that you didn't intend.

01:01:16.500 --> 01:01:17.690
How is that possible?

01:01:17.690 --> 01:01:21.330
Well, some of these characters
are dangerous, so to speak.

01:01:21.330 --> 01:01:25.680
A semicolon in SQL tends to
separate one command from another.

01:01:25.680 --> 01:01:27.180
It finishes your thought.

01:01:27.180 --> 01:01:30.930
And if you yourself don't anticipate
that some human, this bad actor,

01:01:30.930 --> 01:01:33.960
might type in themselves a
semicolon when they really

01:01:33.960 --> 01:01:38.040
shouldn't be typing SQL at all,
you might mistake that semicolon

01:01:38.040 --> 01:01:42.330
for the actual terminus between
one command and another.

01:01:42.330 --> 01:01:45.240
And if you just blindly
pass it into your server

01:01:45.240 --> 01:01:49.080
and let it execute as usual, you
might execute not just that SELECT

01:01:49.080 --> 01:01:51.750
but that DELETE or
anything else as well.

01:01:51.750 --> 01:01:55.260
And in this way can an adversary not
only delete data from your database

01:01:55.260 --> 01:01:59.430
but maybe select more than you
intended, or update or insert.

01:01:59.430 --> 01:02:03.030
It's ultimately up to you to
defend against these threats.

01:02:03.030 --> 01:02:04.680
So, how?

01:02:04.680 --> 01:02:08.610
Well, it turns out that there are
libraries, code written by other people

01:02:08.610 --> 01:02:11.460
that, frankly, via very
easy-to-use functions,

01:02:11.460 --> 01:02:17.760
just make it easy to sanitize or
scrub, so to speak, user's input.

01:02:17.760 --> 01:02:20.700
What do these libraries or
these functions typically do?

01:02:20.700 --> 01:02:24.280
Honestly, they just escape, so to
speak, these dangerous characters.

01:02:24.280 --> 01:02:27.750
Something like a semicolon or
perhaps a single apostrophe that

01:02:27.750 --> 01:02:32.230
might, in SQL, have some special
and dangerous potential meaning,

01:02:32.230 --> 01:02:36.330
they just escape them as by putting a
backslash, a special character in front

01:02:36.330 --> 01:02:40.080
of them so that if the human were
to type in Let There Be Rock,

01:02:40.080 --> 01:02:46.020
semicolon, DELETE, that would actually
be interpreted safely by your database

01:02:46.020 --> 01:02:51.840
as a search for an album called Let
There Be Rock, semicolon, DELETE,

01:02:51.840 --> 01:02:55.380
which of course most likely
is not the name of an album.

01:02:55.380 --> 01:03:00.060
So that query would probably not
return or select any results.

01:03:00.060 --> 01:03:02.280
But more importantly,
it would not be tricked

01:03:02.280 --> 01:03:05.280
into executing two SQL commands.

01:03:05.280 --> 01:03:10.200
Rather, it would execute only the
SELECT but with a nonsensical value.

01:03:10.200 --> 01:03:13.260
Lastly, consider what a database is.

01:03:13.260 --> 01:03:16.170
It's really a piece of software
running on some computer

01:03:16.170 --> 01:03:20.820
somewhere, be it on my own Mac, or
PC, or some server in the cloud.

01:03:20.820 --> 01:03:24.780
But if you have just one database,
as I seem to keep presuming,

01:03:24.780 --> 01:03:27.180
you have this so-called
single point of failure,

01:03:27.180 --> 01:03:31.290
again, just as we had in the world
of cloud computing more generally.

01:03:31.290 --> 01:03:35.330
And so with something like data
where you don't want to lose it

01:03:35.330 --> 01:03:37.830
and you certainly don't want
all of your web servers or apps

01:03:37.830 --> 01:03:43.170
to go offline just because one server,
your database server, has gone out,

01:03:43.170 --> 01:03:48.327
it's time to revisit the architecture
or topology of systems more generally.

01:03:48.327 --> 01:03:50.160
Something tells me that
it's not going to be

01:03:50.160 --> 01:03:52.560
sufficient to have just one database.

01:03:52.560 --> 01:03:54.570
You probably want two.

01:03:54.570 --> 01:03:59.010
But if you have two databases, now how
do you decide where to put the data?

01:03:59.010 --> 01:04:02.580
Do you put it here, or over
here, or maybe in both places?

01:04:02.580 --> 01:04:06.050
If you put it in both places, though,
you're then using twice as much space,

01:04:06.050 --> 01:04:08.510
so already we've opened a can of worms.

01:04:08.510 --> 01:04:11.730
To solve one simple problem, don't
be a single point of failure.

01:04:11.730 --> 01:04:15.980
But that's going to cost you some time,
or some money, or certainly space.

01:04:15.980 --> 01:04:18.020
So what can you do if
you're architecting

01:04:18.020 --> 01:04:22.670
a system that has now not just
web servers but, say, databases?

01:04:22.670 --> 01:04:26.510
Well, odds are you're going to want to
have not just the one, pictured here

01:04:26.510 --> 01:04:29.690
as a cylinder, as this
convention, but you're probably

01:04:29.690 --> 01:04:32.600
going to want to have a second as well.

01:04:32.600 --> 01:04:35.120
But of course, if you
have two of them, odds

01:04:35.120 --> 01:04:38.200
are it's not sufficient just to
store half of your data on one

01:04:38.200 --> 01:04:40.670
and half of your data on the
other, because, of course,

01:04:40.670 --> 01:04:42.890
you've not solved the
single point of failure.

01:04:42.890 --> 01:04:45.890
You now just have two single points
of failure because half of your data

01:04:45.890 --> 01:04:48.590
could be lost here or half of it here.

01:04:48.590 --> 01:04:52.400
So odds are you're going to want
to start having backups of data.

01:04:52.400 --> 01:04:55.130
But you don't want to necessarily
have to waste time restoring

01:04:55.130 --> 01:04:57.650
from backup, especially
if you want to maintain

01:04:57.650 --> 01:05:00.740
as many as five nines of uptime.

01:05:00.740 --> 01:05:04.160
So odds are you're going to want
to have these databases really

01:05:04.160 --> 01:05:06.830
be duplicates of one another.

01:05:06.830 --> 01:05:09.140
And whenever you write
data to one database,

01:05:09.140 --> 01:05:12.170
you should probably write
it to the other in parallel.

01:05:12.170 --> 01:05:15.140
So, yes, admittedly, you have
just spent twice as much space

01:05:15.140 --> 01:05:17.190
and, frankly, twice as much money.

01:05:17.190 --> 01:05:20.390
But at some point those sorts
of costs are inevitable.

01:05:20.390 --> 01:05:23.000
But there's other ways
to scale here, too.

01:05:23.000 --> 01:05:26.720
You can, of course, hit a
ceiling on vertical scaling

01:05:26.720 --> 01:05:28.520
even when it comes to databases.

01:05:28.520 --> 01:05:32.570
After all, if a database is just
a program running on some server

01:05:32.570 --> 01:05:36.620
and there is only so much RAM or
disk space or CPU in that server,

01:05:36.620 --> 01:05:39.770
eventually you're not going to
be able to store as much data

01:05:39.770 --> 01:05:42.000
or as quickly as you want.

01:05:42.000 --> 01:05:43.340
So what can you do?

01:05:43.340 --> 01:05:45.620
Well, you could, for
instance, shard your data

01:05:45.620 --> 01:05:50.060
and have not just two but maybe four or
more servers and put all of the users

01:05:50.060 --> 01:05:55.040
whose names start from A to M on
one half of your cluster of servers,

01:05:55.040 --> 01:05:57.170
so to speak, but maybe
everyone else from M

01:05:57.170 --> 01:06:00.500
to Z based on, say, their last
name can go on the others.

01:06:00.500 --> 01:06:04.220
To shard a database means to split
the data in some predictable way

01:06:04.220 --> 01:06:05.990
that you can repeat again and again.

01:06:05.990 --> 01:06:10.070
But even there, too, even if only the
As through Ms are going to the left,

01:06:10.070 --> 01:06:14.060
you want to make sure that you
still have that backup or replica.

01:06:14.060 --> 01:06:17.600
And this arrow suggests that they
really should be intercommunicating,

01:06:17.600 --> 01:06:20.450
not unlike load balancers we've seen.

01:06:20.450 --> 01:06:23.120
But there's another way you can
scale your databases as well.

01:06:23.120 --> 01:06:27.440
You don't have to have databases
doing both reading and writing.

01:06:27.440 --> 01:06:31.340
To read data from a database or any
server means to take it from its memory

01:06:31.340 --> 01:06:32.720
and read it into yours.

01:06:32.720 --> 01:06:36.140
And to write means to do
the opposite, to save it.

01:06:36.140 --> 01:06:39.140
Well, what you can do actually
in the world of databases

01:06:39.140 --> 01:06:42.710
is also replicate your
databases multiple times.

01:06:42.710 --> 01:06:46.800
And you might have connected
to these two primary databases

01:06:46.800 --> 01:06:52.610
multiple other databases that are just
copies in one direction and not both.

01:06:52.610 --> 01:06:57.030
And what you might then do is use these
two primary databases not only to read

01:06:57.030 --> 01:06:59.720
but to write, abbreviated here RW.

01:06:59.720 --> 01:07:02.000
But these other databases
down here, which

01:07:02.000 --> 01:07:04.730
are just copies of the ones
to which they're connected,

01:07:04.730 --> 01:07:06.440
are just called read replicas.

01:07:06.440 --> 01:07:11.450
They exist solely for the purpose
to read from them again and again.

01:07:11.450 --> 01:07:13.200
When might this make sense?

01:07:13.200 --> 01:07:15.320
Well, in some contexts,
like social media,

01:07:15.320 --> 01:07:17.930
like Facebook, it's
probably the case that there

01:07:17.930 --> 01:07:20.690
are more reads than there are writes.

01:07:20.690 --> 01:07:24.800
That is to say you probably know more
people who post more content than you

01:07:24.800 --> 01:07:29.310
but you probably still
read or look at theirs.

01:07:29.310 --> 01:07:32.180
And so if the data for
your business follows

01:07:32.180 --> 01:07:36.560
that pattern whereby writes are maybe
common but reads are way more common,

01:07:36.560 --> 01:07:41.030
you can do exactly this model and
replicate again and again, honestly,

01:07:41.030 --> 01:07:44.420
as a tree structure for
efficiency so that it doesn't all

01:07:44.420 --> 01:07:46.280
have to replicate one at a time.

01:07:46.280 --> 01:07:50.000
But then you can write software,
be it in Python or something else,

01:07:50.000 --> 01:07:54.440
that writes data only to these
one or two servers but reads

01:07:54.440 --> 01:07:56.720
from any number of them as well.

01:07:56.720 --> 01:07:59.450
But this, too, is a bit of a
rabbit hole, because at some point

01:07:59.450 --> 01:08:01.580
you want to have this
redundancy not in one

01:08:01.580 --> 01:08:05.990
location but others, east coast and
west coast, one country and another.

01:08:05.990 --> 01:08:10.400
And at that point, you might actually
run into the limitations of time.

01:08:10.400 --> 01:08:13.910
Because after all, it takes a non-zero
number of milliseconds or seconds

01:08:13.910 --> 01:08:16.310
for data to travel long distance.

01:08:16.310 --> 01:08:18.050
Consider after all
how long it might take

01:08:18.050 --> 01:08:21.590
data to go from Cambridge,
Massachusetts to somewhere in Japan.

01:08:21.590 --> 01:08:26.210
That's far longer than it might take
to just go down the road to MIT.

01:08:26.210 --> 01:08:30.979
So here, too, we can revisit all of the
problems we saw in the world of cloud

01:08:30.979 --> 01:08:32.540
computing and servers more generally.

01:08:32.540 --> 01:08:35.990
They're back to revisit in
the context of databases.

01:08:35.990 --> 01:08:40.399
But with databases, you care ever
more that these things not go down,

01:08:40.399 --> 01:08:44.630
or if they do, that you have
spares, backups, or replicas.

01:08:44.630 --> 01:08:48.680
Because now that we're storing our
data in this centralized place,

01:08:48.680 --> 01:08:52.069
we have to think hard not only about
how we're scaling computationally

01:08:52.069 --> 01:08:56.359
but how we're scaling in
terms of our data as well.

01:08:56.359 --> 01:08:59.090
So consider where then we began.

01:08:59.090 --> 01:09:02.770
We started by laying out data in a
spreadsheet, be it Microsoft Excel,

01:09:02.770 --> 01:09:04.939
or Apple Numbers, or
Google Spreadsheets.

01:09:04.939 --> 01:09:07.130
From there we considered
what types of data

01:09:07.130 --> 01:09:09.950
we might store there so
that if we want to upgrade,

01:09:09.950 --> 01:09:12.410
so to speak, from a
spreadsheet to database,

01:09:12.410 --> 01:09:15.260
we know what types we can specify.

01:09:15.260 --> 01:09:19.189
And in SQL, whether implemented
in SQLite, Oracle, or MySQL,

01:09:19.189 --> 01:09:21.560
or something else, they tend
to be standard types that

01:09:21.560 --> 01:09:25.279
tend to be common across platforms,
albeit with some variations,

01:09:25.279 --> 01:09:27.260
so that we can think
hard about these types

01:09:27.260 --> 01:09:31.277
and then ultimately help the
database help us be performant.

01:09:31.277 --> 01:09:34.069
Because if I know that I'm going
to be searching or selecting based

01:09:34.069 --> 01:09:36.470
on certain data, I can tell
the database, for instance,

01:09:36.470 --> 01:09:39.680
to make it unique or at least index it.

01:09:39.680 --> 01:09:43.520
And then using SQL constructions
like SELECT, and INSERT, and UPDATE,

01:09:43.520 --> 01:09:46.670
and DELETE, and yet others can
I manipulate that data and get

01:09:46.670 --> 01:09:52.069
at it even faster, frankly, than the
human me could with a mere spreadsheet.

01:09:52.069 --> 01:09:54.860
But with the design of any
system, as with databases,

01:09:54.860 --> 01:09:58.430
we start to open new cans
of worms and new problems

01:09:58.430 --> 01:10:01.400
as we start to explore more
sophisticated challenges.

01:10:01.400 --> 01:10:04.880
But here, too, many, if
not all, of these problems

01:10:04.880 --> 01:10:09.290
can be solved by simply reducing
the problems to first principles

01:10:09.290 --> 01:10:11.450
and consider, what is
the problem to be solved?

01:10:11.450 --> 01:10:12.770
How is our data represented?

01:10:12.770 --> 01:10:14.150
Where is it stored?

01:10:14.150 --> 01:10:18.290
And consider ultimately what business
constraints or legal constraints

01:10:18.290 --> 01:10:22.700
we have when manipulating that data
and consider what tools of the trade

01:10:22.700 --> 01:10:24.050
are available to us.

01:10:24.050 --> 01:10:27.740
This then is database design and,
more generally, design unto itself,

01:10:27.740 --> 01:10:32.450
not just focusing on the correctness
of implementing solutions to problems

01:10:32.450 --> 01:10:37.670
but the efficiency and the
scalability as well thereof.