WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000

00:00:00.000 --> 00:00:02.470
[MUSIC PLAYING]

00:00:17.784 --> 00:00:19.620
BRIAN YU: All right,
welcome back, everyone,

00:00:19.620 --> 00:00:22.500
to an introduction to Artificial
Intelligence with Python.

00:00:22.500 --> 00:00:24.990
Now so far in this class,
we've used AI to solve

00:00:24.990 --> 00:00:28.140
a number of different problems--
giving the AI instructions for how

00:00:28.140 --> 00:00:31.590
to search for a solution or how to
satisfy certain constraints in order

00:00:31.590 --> 00:00:34.560
to find its way from some input
point to some output point

00:00:34.560 --> 00:00:36.725
in order to solve some sort of problem.

00:00:36.725 --> 00:00:38.850
Today, we're going to turn
to the world of learning

00:00:38.850 --> 00:00:40.890
in particular the idea
of machine learning

00:00:40.890 --> 00:00:43.080
which generally refers to
the idea where we are not

00:00:43.080 --> 00:00:47.160
going to give the computer explicit
instructions for how to perform a task,

00:00:47.160 --> 00:00:50.040
but rather, we are going to give
the computer access to information

00:00:50.040 --> 00:00:52.980
in the form of data or patterns
that it can learn from, and let

00:00:52.980 --> 00:00:56.730
the computer try and figure out what
those patterns are-- try and understand

00:00:56.730 --> 00:00:59.455
that data to be able to
perform a task on its own.

00:00:59.455 --> 00:01:01.830
Now machine learning comes in
a number of different forms

00:01:01.830 --> 00:01:03.180
and it's a very wide field.

00:01:03.180 --> 00:01:06.240
So today, we'll explore some
of the foundational algorithms

00:01:06.240 --> 00:01:09.810
and ideas that are behind a lot of
the different areas within machine

00:01:09.810 --> 00:01:10.480
learning.

00:01:10.480 --> 00:01:13.980
And one of the most popular is
the idea of supervised machine

00:01:13.980 --> 00:01:15.840
learning or just supervised learning.

00:01:15.840 --> 00:01:18.540
And supervised learning is
a particular type of task.

00:01:18.540 --> 00:01:22.620
It refers to the task where we give
the computer access to a data set,

00:01:22.620 --> 00:01:26.320
where that data set consists
of input-output pairs.

00:01:26.320 --> 00:01:28.080
And what we would like
the computer to do

00:01:28.080 --> 00:01:31.620
is we would like our AI to be
able to figure out some function

00:01:31.620 --> 00:01:34.327
that maps inputs to outputs.

00:01:34.327 --> 00:01:36.660
So we have a whole bunch of
data that generally consists

00:01:36.660 --> 00:01:39.030
of some kind of input-- some
evidence, some information

00:01:39.030 --> 00:01:40.890
that the computer will have access to.

00:01:40.890 --> 00:01:43.770
And we would like the computer,
based on that input information,

00:01:43.770 --> 00:01:47.100
to predict what some
output is going to be.

00:01:47.100 --> 00:01:50.295
And we'll give it some data so that
the computer can train its model on

00:01:50.295 --> 00:01:53.310
to begin to understand how it
is that this information works,

00:01:53.310 --> 00:01:56.640
and how it is that the inputs
and outputs relate to each other.

00:01:56.640 --> 00:01:58.380
But ultimately, we
hope that our computer

00:01:58.380 --> 00:02:01.470
will be able to figure out some
function that given those inputs,

00:02:01.470 --> 00:02:03.720
is able to get those outputs.

00:02:03.720 --> 00:02:06.670
There are a couple of different
tasks within supervised learning,

00:02:06.670 --> 00:02:09.900
the one we'll focus on and start
with is known as classification.

00:02:09.900 --> 00:02:14.310
And classification is the problem where
if I give you a whole bunch of inputs,

00:02:14.310 --> 00:02:16.740
you need to figure out some
way to map those inputs

00:02:16.740 --> 00:02:20.700
into discrete categories, where you
can decide what those categories are.

00:02:20.700 --> 00:02:22.867
And it's the job of
the computer to predict

00:02:22.867 --> 00:02:24.450
what those categories are going to be.

00:02:24.450 --> 00:02:26.370
So that might be, for
example, I give you

00:02:26.370 --> 00:02:28.980
information about a
banknote like a US dollar

00:02:28.980 --> 00:02:31.290
and I'm asking you to
predict for me doesn't belong

00:02:31.290 --> 00:02:33.930
to the category of
authentic bank notes or does

00:02:33.930 --> 00:02:36.450
it belong to the category
of counterfeit banknotes.

00:02:36.450 --> 00:02:38.160
You need to categorize the input.

00:02:38.160 --> 00:02:40.920
And we want to train the computer
to figure out some function

00:02:40.920 --> 00:02:43.230
to be able to do that calculation.

00:02:43.230 --> 00:02:45.780
Another example might be the
case of weather, something

00:02:45.780 --> 00:02:48.822
we've talked about a little bit so
far in this class, where we would like

00:02:48.822 --> 00:02:52.020
to predict on a given day is
it going to rain on that day,

00:02:52.020 --> 00:02:53.700
is it going to be cloudy on that day.

00:02:53.700 --> 00:02:55.533
And before, we've seen
how we could do this,

00:02:55.533 --> 00:02:59.190
if we really give the computer all
the exact probabilities for, you know,

00:02:59.190 --> 00:03:01.860
if these are the conditions,
what's the probability of rain,

00:03:01.860 --> 00:03:04.590
oftentimes, we don't have access
to that information, though.

00:03:04.590 --> 00:03:07.410
But what we do have access
to is a whole bunch of data.

00:03:07.410 --> 00:03:10.410
So if we wanted to be able to predict
something like is it going to rain

00:03:10.410 --> 00:03:12.810
or is it not going to rain,
we would give the computer

00:03:12.810 --> 00:03:15.840
historical information about
days when it was raining

00:03:15.840 --> 00:03:18.660
and days when it was not
raining, and ask the computer

00:03:18.660 --> 00:03:21.240
to look for patterns in that data.

00:03:21.240 --> 00:03:22.960
So what might that data look like?

00:03:22.960 --> 00:03:25.380
Well, we could structure that
data in a table like this.

00:03:25.380 --> 00:03:28.800
This might be what our table looks like,
where are for any particular day going

00:03:28.800 --> 00:03:32.400
back, we have information about like
that day's humidity, that day's air

00:03:32.400 --> 00:03:33.180
pressure.

00:03:33.180 --> 00:03:35.160
And then importantly, we have a label--

00:03:35.160 --> 00:03:39.060
something where the human has said that
on this particular day, it was raining

00:03:39.060 --> 00:03:40.070
or it was not raining.

00:03:40.070 --> 00:03:42.720
So you could fill in this table
with a whole bunch of data.

00:03:42.720 --> 00:03:46.350
And what makes this what we would
call a supervised learning exercise

00:03:46.350 --> 00:03:49.440
is that a human has gone in and
labeled each of these data points.

00:03:49.440 --> 00:03:52.260
Said that on this day, when these
were the values for the humidity

00:03:52.260 --> 00:03:56.330
and pressure, that day was a rainy
day and this day was a not rainy day.

00:03:56.330 --> 00:03:58.830
And what we would like the
computer to be able to do then

00:03:58.830 --> 00:04:01.560
is to be able to figure out,
given these inputs, given

00:04:01.560 --> 00:04:05.430
the humidity and the pressure, can
the computer predict what label

00:04:05.430 --> 00:04:06.945
should be associated with that day.

00:04:06.945 --> 00:04:08.820
Does that day look more
like it's going to be

00:04:08.820 --> 00:04:13.650
a day that rains or does it look more
like a day when it's not going to rain.

00:04:13.650 --> 00:04:15.630
Put a little bit more
mathematically, you

00:04:15.630 --> 00:04:18.928
can think of this as a function
that takes two inputs--

00:04:18.928 --> 00:04:21.720
the inputs being the data points
that our computer will have access

00:04:21.720 --> 00:04:23.610
to-- things like humidity and pressure.

00:04:23.610 --> 00:04:25.500
So we could write a
function, f, that takes

00:04:25.500 --> 00:04:27.730
as input both humidity and pressure.

00:04:27.730 --> 00:04:31.170
And then the output is
going to be what category

00:04:31.170 --> 00:04:34.590
we would ascribe to these
particular input points-- what label

00:04:34.590 --> 00:04:36.135
we would associate with that input.

00:04:36.135 --> 00:04:38.010
So we've seen a couple
of example data points

00:04:38.010 --> 00:04:41.250
here, where given this value for
humidity and this value for pressure,

00:04:41.250 --> 00:04:44.610
we predict is it going to rain
or is it not going to rain.

00:04:44.610 --> 00:04:47.610
And that's information that we
just gathered from the world.

00:04:47.610 --> 00:04:51.060
We measured on various different days
what the humidity and pressure were.

00:04:51.060 --> 00:04:55.170
We observed whether or not we saw rain
or no rain on that particular day.

00:04:55.170 --> 00:04:58.950
And this function, f, is what
we would like to approximate.

00:04:58.950 --> 00:05:00.900
Now the computer and
we humans don't really

00:05:00.900 --> 00:05:04.960
know exactly how this function f works--
it's probably quite a complex function.

00:05:04.960 --> 00:05:08.040
So what we're going to do instead
is attempt to estimate it.

00:05:08.040 --> 00:05:12.240
We would like to come up with
a hypothesis function, h,

00:05:12.240 --> 00:05:15.430
which is going to try to
approximate what f does.

00:05:15.430 --> 00:05:19.260
We want to come up with some function
h that will also take the same inputs

00:05:19.260 --> 00:05:22.800
and we'll also produce an
output, rain or no rain.

00:05:22.800 --> 00:05:25.950
And ideally, we'd like these two
functions to agree on as much

00:05:25.950 --> 00:05:27.030
as possible.

00:05:27.030 --> 00:05:30.810
So the goal then of these supervised
learning classification tasks

00:05:30.810 --> 00:05:34.060
is going to be to figure out what
does that function h look like.

00:05:34.060 --> 00:05:37.860
How can we begin to estimate, given all
of this information, all of this data,

00:05:37.860 --> 00:05:42.330
what category or what label should be
assigned to a particular data point.

00:05:42.330 --> 00:05:44.457
So where can you begin doing this?

00:05:44.457 --> 00:05:47.040
Well, a reasonable thing to do,
especially in this situation--

00:05:47.040 --> 00:05:48.780
I have two numerical values--

00:05:48.780 --> 00:05:53.460
is I could try to plot this on a
graph that has two axes-- an x-axis

00:05:53.460 --> 00:05:54.332
and the y-axis.

00:05:54.332 --> 00:05:57.540
And in this case, we're just going to
be using two numerical values as input,

00:05:57.540 --> 00:05:59.795
but these same types of
ideas at scale as you

00:05:59.795 --> 00:06:01.170
add more and more inputs as well.

00:06:01.170 --> 00:06:04.260
We'll be plotting things in two
dimensions, but as we'll soon see,

00:06:04.260 --> 00:06:07.930
you could add more inputs and just
imagine things in multiple dimensions.

00:06:07.930 --> 00:06:10.920
And while we humans have
trouble conceptualizing anything

00:06:10.920 --> 00:06:13.410
really beyond three
dimensions, at least visually,

00:06:13.410 --> 00:06:15.600
a computer has no problem
with trying to imagine

00:06:15.600 --> 00:06:17.280
things and many, many more dimensions.

00:06:17.280 --> 00:06:20.343
That for a computer, each dimension
is just some separate number

00:06:20.343 --> 00:06:21.510
that are just keeping track.

00:06:21.510 --> 00:06:23.385
So it wouldn't be
unreasonable for a computer

00:06:23.385 --> 00:06:25.890
to think in 10 dimensions
or 100 dimensions

00:06:25.890 --> 00:06:28.000
to be able to try to solve a problem.

00:06:28.000 --> 00:06:31.200
But for now, we've got two inputs, so
we'll graph things along two axes--

00:06:31.200 --> 00:06:33.840
an x-axis, which will
here represent humidity,

00:06:33.840 --> 00:06:36.580
and a y-axis, which here
represents pressure.

00:06:36.580 --> 00:06:40.530
And what we might do is say, let's
take all of the days that were raining,

00:06:40.530 --> 00:06:44.130
and just try to plop them on this graph,
and see where they fall on this graph.

00:06:44.130 --> 00:06:46.590
And here might be all
of the rainy days, where

00:06:46.590 --> 00:06:49.020
each rainy day is one
of these blue dots here

00:06:49.020 --> 00:06:51.710
that corresponds to a
particular value for humidity

00:06:51.710 --> 00:06:53.422
and a particular value for pressure.

00:06:53.422 --> 00:06:56.380
And then I might do the same thing
with the days that were not raining.

00:06:56.380 --> 00:06:58.380
So I take all the not
rainy days, figure out

00:06:58.380 --> 00:07:01.020
what their values were for
each of these two inputs,

00:07:01.020 --> 00:07:03.743
and go ahead and plot them
on this graph as well.

00:07:03.743 --> 00:07:05.160
And I've here plotted them in red.

00:07:05.160 --> 00:07:09.870
So blue here stands for a rainy day,
red here stands for a not rainy day.

00:07:09.870 --> 00:07:11.340
And this then is the input--

00:07:11.340 --> 00:07:14.160
that my computer has access
to all of this input.

00:07:14.160 --> 00:07:18.270
And what I would like the computer to
be able to do is to train a model such

00:07:18.270 --> 00:07:21.120
that if I'm ever presented with
a new input that doesn't have

00:07:21.120 --> 00:07:25.170
a label associated with it,
something like this white dot here,

00:07:25.170 --> 00:07:28.530
I would like to predict given those
values for each of the two inputs,

00:07:28.530 --> 00:07:31.860
should we classify it as
a blue dot, a rainy day,

00:07:31.860 --> 00:07:34.985
or should we classify it as
a red dot, a not rainy day.

00:07:34.985 --> 00:07:37.860
And if you're just looking at this
picture graphically trying to say,

00:07:37.860 --> 00:07:41.100
all right, this white dot, does it look
like it belongs to the blue category

00:07:41.100 --> 00:07:43.560
or does it look like it
belongs to the red category,

00:07:43.560 --> 00:07:47.400
I think most people would agree that it
probably belongs to the blue category.

00:07:47.400 --> 00:07:48.180
And why is that?

00:07:48.180 --> 00:07:51.967
Well, it looks like it's
close to other blue dots.

00:07:51.967 --> 00:07:54.300
And that's not a very formal
notion, but it's the notion

00:07:54.300 --> 00:07:56.100
that we'll formalize
it in just a moment--

00:07:56.100 --> 00:07:59.190
that because it seems to be close
to, like, this blue dot here, like,

00:07:59.190 --> 00:08:01.560
nothing else it's closer
to it, than we might say

00:08:01.560 --> 00:08:03.720
that it should be categorized as blue.

00:08:03.720 --> 00:08:06.030
It should fall into that
category of, I think

00:08:06.030 --> 00:08:08.760
that day is going to be a
rainy day based on that input.

00:08:08.760 --> 00:08:11.880
It might not be totally accurate,
but it's a pretty good guess.

00:08:11.880 --> 00:08:15.090
And this type of algorithm is actually
a very popular and common machine

00:08:15.090 --> 00:08:18.750
learning algorithm known as
nearest neighbor classification.

00:08:18.750 --> 00:08:21.810
It's an algorithm for solving
these classification type problems.

00:08:21.810 --> 00:08:25.540
And in nearest neighbor classification,
it's going to perform this algorithm.

00:08:25.540 --> 00:08:29.700
What it will do is, given an input, it
will choose the class of the nearest

00:08:29.700 --> 00:08:31.650
data point to that input.

00:08:31.650 --> 00:08:34.890
By class, we just here mean
category, like rain or no rain,

00:08:34.890 --> 00:08:36.840
counterfeit or not counterfeit.

00:08:36.840 --> 00:08:41.799
And we choose the category or the
class based on the nearest data point.

00:08:41.799 --> 00:08:43.559
So given all that data
we just looked at,

00:08:43.559 --> 00:08:47.070
is the nearest data point a blue
point or is that a red point.

00:08:47.070 --> 00:08:49.780
And depending on the
answer to that question,

00:08:49.780 --> 00:08:51.660
we were able to make
some sort of judgment.

00:08:51.660 --> 00:08:54.420
We were able to say something like,
we think it's going to be blue

00:08:54.420 --> 00:08:56.262
or we think it's going to be red.

00:08:56.262 --> 00:08:58.470
So likewise, we could apply
this to other data points

00:08:58.470 --> 00:08:59.850
that we encounter as well.

00:08:59.850 --> 00:09:03.840
If suddenly, this data point comes
about, well, it's nearest data is red,

00:09:03.840 --> 00:09:07.320
so we would go ahead and classify
this as a red point, not raining.

00:09:07.320 --> 00:09:09.540
Things get a little
bit trickier, though,

00:09:09.540 --> 00:09:12.543
when you look at a point like
this white point over here,

00:09:12.543 --> 00:09:14.460
and you ask the same
sort of question-- should

00:09:14.460 --> 00:09:17.700
it belong to the category of
blue points, the rainy days?

00:09:17.700 --> 00:09:21.870
Or should it belong to the category
of red points, the not rainy days?

00:09:21.870 --> 00:09:24.240
Now nearest neighbor
classification would

00:09:24.240 --> 00:09:26.820
say the way you solve this
problem is look at which point

00:09:26.820 --> 00:09:28.190
it is nearest to that point.

00:09:28.190 --> 00:09:31.440
You look at this nearest point and
say it's red-- it's a not rainy day.

00:09:31.440 --> 00:09:34.200
And therefore, according to
nearest neighbor classification,

00:09:34.200 --> 00:09:37.470
I would say that this unlabeled
point, that should also be red.

00:09:37.470 --> 00:09:40.800
It should also be classified
as a not rainy day.

00:09:40.800 --> 00:09:43.800
But your intuition might think
that that's a reasonable judgment

00:09:43.800 --> 00:09:47.190
to make-- that the closest thing is
a not rainy day, so may as well guess

00:09:47.190 --> 00:09:48.540
that it's not rainy day.

00:09:48.540 --> 00:09:51.680
But it's probably also reasonable to
look at the bigger picture of things

00:09:51.680 --> 00:09:56.670
and to say, yes, it is true, that the
nearest point to it was a red point,

00:09:56.670 --> 00:10:00.060
but it's surrounded by a whole
bunch of other blue points.

00:10:00.060 --> 00:10:02.310
So looking at the bigger
picture, there is potentially

00:10:02.310 --> 00:10:06.235
an argument to be made that this
point should actually be blue.

00:10:06.235 --> 00:10:08.610
And with only this data, we
actually don't know for sure.

00:10:08.610 --> 00:10:11.310
We are given some inputs,
something we're trying to predict,

00:10:11.310 --> 00:10:14.410
and we don't necessarily know
what the output is going to be.

00:10:14.410 --> 00:10:17.500
So in this case, which one is
correct is difficult to say.

00:10:17.500 --> 00:10:21.280
But oftentimes, considering more than
just a single neighbor, considering

00:10:21.280 --> 00:10:25.300
multiple neighbors, can sometimes
give us a better result.

00:10:25.300 --> 00:10:27.400
And so there's a variant
on the nearest neighbor

00:10:27.400 --> 00:10:31.330
a classification algorithm that
is known as the k-nearest-neighbor

00:10:31.330 --> 00:10:34.150
classification algorithm,
where k is some parameter,

00:10:34.150 --> 00:10:38.090
some number that we choose for how
many neighbors are we going to look at.

00:10:38.090 --> 00:10:41.440
So one nearest neighbor
classification is what we saw before.

00:10:41.440 --> 00:10:44.770
Just pick the one nearest
neighbor and use that category.

00:10:44.770 --> 00:10:46.780
But with k-nearest-neighbor
classification,

00:10:46.780 --> 00:10:49.150
where k might be three
or five or seven--

00:10:49.150 --> 00:10:52.000
to say look at the three
or five or seven closest

00:10:52.000 --> 00:10:55.930
neighbors, closest data points to that
point, works a little bit differently.

00:10:55.930 --> 00:10:57.760
This algorithm, we're given an input.

00:10:57.760 --> 00:11:02.690
Choose the most common class out of the
k nearest data points to that input.

00:11:02.690 --> 00:11:06.010
So if we look at the five
nearest points, and three of them

00:11:06.010 --> 00:11:08.380
say it's raining and two of
them say it's not raining,

00:11:08.380 --> 00:11:10.810
we'll go with the three
instead of the two,

00:11:10.810 --> 00:11:14.080
because each one effectively
gets one vote towards what

00:11:14.080 --> 00:11:16.450
they believe the category ought to be.

00:11:16.450 --> 00:11:18.850
And ultimately, you
choose the category that

00:11:18.850 --> 00:11:21.540
has the most votes as
a consequence of that.

00:11:21.540 --> 00:11:24.700
So k-nearest-neighbor classification--
fairly straightforward one

00:11:24.700 --> 00:11:25.960
to understand intuitively.

00:11:25.960 --> 00:11:29.030
You just look at the neighbors and
figure out what the answer might be.

00:11:29.030 --> 00:11:30.880
And it turns out this
can work very, very

00:11:30.880 --> 00:11:34.480
well for solving a whole variety of
different types of classification

00:11:34.480 --> 00:11:35.440
problems.

00:11:35.440 --> 00:11:38.240
But not every model is going
to work under every situation.

00:11:38.240 --> 00:11:40.865
And so one of the things we'll
take a look at today, especially

00:11:40.865 --> 00:11:42.740
in the context of
supervised machine learning

00:11:42.740 --> 00:11:45.740
is that there are a number of different
approaches to machine learning--

00:11:45.740 --> 00:11:47.830
a number of different
algorithms that we can apply

00:11:47.830 --> 00:11:49.690
all solving the same type of problem.

00:11:49.690 --> 00:11:52.780
All solving some kind of
classification problem, where

00:11:52.780 --> 00:11:56.170
we want to take inputs and organize
it into different categories

00:11:56.170 --> 00:11:58.570
And no one algorithm isn't
necessarily always going

00:11:58.570 --> 00:12:00.387
to be better than some other algorithm.

00:12:00.387 --> 00:12:01.720
They each have their trade-offs.

00:12:01.720 --> 00:12:04.540
And maybe depending on the
data, one type of algorithm

00:12:04.540 --> 00:12:07.420
is going to be better-suited to
trying to model that information

00:12:07.420 --> 00:12:08.870
than some other algorithm.

00:12:08.870 --> 00:12:10.720
And so this is what a lot
of machine learning research

00:12:10.720 --> 00:12:13.090
ends up being about-- that when you're
trying to apply machine learning

00:12:13.090 --> 00:12:16.480
techniques, you're often looking not
just at one particular algorithm,

00:12:16.480 --> 00:12:18.250
but trying multiple
different algorithms,

00:12:18.250 --> 00:12:20.770
trying to see what is
going to give you the best

00:12:20.770 --> 00:12:25.790
results for trying to predict some
function that maps inputs to outputs.

00:12:25.790 --> 00:12:29.410
So what then are the drawbacks of
k-nearest-neighbor classification?

00:12:29.410 --> 00:12:30.640
Well, there are a couple.

00:12:30.640 --> 00:12:33.070
One might be that in a
naive approach at least,

00:12:33.070 --> 00:12:35.642
it could be fairly slow to
have to go through and measure

00:12:35.642 --> 00:12:38.350
the distance between a point and
every single one of these points

00:12:38.350 --> 00:12:39.310
that exist here.

00:12:39.310 --> 00:12:40.810
Now there are ways of
trying to get around that.

00:12:40.810 --> 00:12:43.435
There are data structures that
can help to make it more quickly

00:12:43.435 --> 00:12:45.260
to be able to find these neighbors.

00:12:45.260 --> 00:12:48.490
There are also techniques you can use
to try and prune some of this data,

00:12:48.490 --> 00:12:50.680
remove some of the data
points so that you're only

00:12:50.680 --> 00:12:54.470
left with the relevant data points
just to make it a little bit easier.

00:12:54.470 --> 00:12:56.890
But ultimately, what we
might like to do is come up

00:12:56.890 --> 00:13:00.370
with another way of trying
to do this classification.

00:13:00.370 --> 00:13:02.560
And one way of trying
to do the classification

00:13:02.560 --> 00:13:04.750
was looking at what are
the neighboring points.

00:13:04.750 --> 00:13:08.200
But another way might be to
try to look at all of the data

00:13:08.200 --> 00:13:11.440
and see if we can come up
with some decision boundary--

00:13:11.440 --> 00:13:15.860
some boundary that will separate the
rainy days from the not rainy days.

00:13:15.860 --> 00:13:19.640
In the case of two dimensions, we can
do that by drawing a line, for example.

00:13:19.640 --> 00:13:22.900
So what we might want to try
to do is just find some line,

00:13:22.900 --> 00:13:27.520
find some separator that divides the
rainy days, the blue points over here,

00:13:27.520 --> 00:13:30.040
from the not rainy days,
the red points over there.

00:13:30.040 --> 00:13:32.680
We're now trying a different
approach in contrast

00:13:32.680 --> 00:13:34.570
with the nearest
neighbor approach, which

00:13:34.570 --> 00:13:38.380
just looked at local data around the
input data point that we cared about.

00:13:38.380 --> 00:13:42.160
Now what we're doing is trying to use
a technique known as linear regression

00:13:42.160 --> 00:13:45.140
to find some sort of line
that will separate the two

00:13:45.140 --> 00:13:46.858
halves from each other.

00:13:46.858 --> 00:13:49.150
Now, sometimes, it will
actually be possible to come up

00:13:49.150 --> 00:13:52.630
with some line that perfectly separates
all the rainy days from the not

00:13:52.630 --> 00:13:53.590
rainy days.

00:13:53.590 --> 00:13:56.140
Realistically, though,
this is probably cleaner

00:13:56.140 --> 00:13:58.050
than many data sets will actually be.

00:13:58.050 --> 00:13:59.600
Oftentimes, data is messy.

00:13:59.600 --> 00:14:00.470
There are outliers.

00:14:00.470 --> 00:14:03.850
There's random noise that happens
inside of a particular system.

00:14:03.850 --> 00:14:06.220
And what we'd like to do is
still be able to figure out

00:14:06.220 --> 00:14:07.640
what a line might look like.

00:14:07.640 --> 00:14:12.000
So in practice, the data will
not always be linearly separable,

00:14:12.000 --> 00:14:14.740
where linearly separable
refers to some data set

00:14:14.740 --> 00:14:18.673
where I can draw a line just to
separate the two halves of it perfectly.

00:14:18.673 --> 00:14:20.590
Instead, you might have
a situation like this,

00:14:20.590 --> 00:14:24.352
where there are some rainy points that
are on this side of the line and some

00:14:24.352 --> 00:14:26.560
not raining points that are
on that side of the line.

00:14:26.560 --> 00:14:30.880
And there may not be a line
that perfectly separates

00:14:30.880 --> 00:14:34.270
what path of the inputs from the other
half-- that perfectly separates all

00:14:34.270 --> 00:14:36.490
the rainy days from the not rainy days.

00:14:36.490 --> 00:14:40.090
But we can still say that this
line does a pretty good job.

00:14:40.090 --> 00:14:42.080
And we'll try to formalize
a little bit later.

00:14:42.080 --> 00:14:44.200
What we mean when we say
something like this line

00:14:44.200 --> 00:14:46.932
does a pretty good job of
trying to make that prediction.

00:14:46.932 --> 00:14:48.640
But for now, let's
just say we're looking

00:14:48.640 --> 00:14:51.690
for a line that does as
good of a job as we can

00:14:51.690 --> 00:14:56.750
at trying to separate one category of
things from another category of things.

00:14:56.750 --> 00:15:00.160
So let's now try to formalize this
a little bit more mathematically.

00:15:00.160 --> 00:15:02.110
We want to come up with
some sort of function,

00:15:02.110 --> 00:15:04.300
some way we can define this line.

00:15:04.300 --> 00:15:08.830
And our inputs are things like
humidity and pressure in this case.

00:15:08.830 --> 00:15:13.840
So our inputs we might call x1 is going
to be our represent humidity and x2

00:15:13.840 --> 00:15:15.500
is going to represent pressure.

00:15:15.500 --> 00:15:18.500
These are inputs that we are going
to provide to our machine learning

00:15:18.500 --> 00:15:19.250
algorithm.

00:15:19.250 --> 00:15:21.860
And given those inputs, we
would like for our model

00:15:21.860 --> 00:15:24.260
to be able to predict
some sort of output.

00:15:24.260 --> 00:15:27.260
And we're going to predict that
using our hypothesis function,

00:15:27.260 --> 00:15:28.610
which we called h.

00:15:28.610 --> 00:15:33.680
Our hypothesis function is going to
take as input, x1 and x2, humidity

00:15:33.680 --> 00:15:34.890
and pressure in this case.

00:15:34.890 --> 00:15:36.740
And you can imagine if we
didn't just have two inputs--

00:15:36.740 --> 00:15:38.870
we had three or four or
five inputs or more--

00:15:38.870 --> 00:15:42.260
we could have this hypothesis
function take all of those as input.

00:15:42.260 --> 00:15:45.550
And we'll see examples of that
a little bit later as well.

00:15:45.550 --> 00:15:49.470
And now the question is, what
does this hypothesis function do?

00:15:49.470 --> 00:15:53.210
Well, it really just needs
to measure is this data

00:15:53.210 --> 00:15:58.640
point on one side of the boundary or is
it on the other side of the boundary?

00:15:58.640 --> 00:16:00.620
And how do we formalize that boundary?

00:16:00.620 --> 00:16:02.990
Well, the boundary is
generally going to be

00:16:02.990 --> 00:16:06.780
a linear combination of
these input variables,

00:16:06.780 --> 00:16:08.250
at least in this particular case.

00:16:08.250 --> 00:16:10.910
So what we're trying to do
when we say linear combination

00:16:10.910 --> 00:16:13.925
is take each of these inputs
and multiply them by some number

00:16:13.925 --> 00:16:15.550
that we're going to have to figure out.

00:16:15.550 --> 00:16:18.650
We'll generally call that number
a weight for how important

00:16:18.650 --> 00:16:21.800
should these variables be in
trying to determine the answer.

00:16:21.800 --> 00:16:24.440
So weight each of these
variables with some weight.

00:16:24.440 --> 00:16:27.440
And we might add like a constant to
it just to try and make the function

00:16:27.440 --> 00:16:28.610
a little bit different.

00:16:28.610 --> 00:16:30.280
And the result we just need to compare--

00:16:30.280 --> 00:16:33.350
is it greater than 0 or
is it less than 0 to say

00:16:33.350 --> 00:16:37.280
doesn't belong on one side of the
line or the other side of the line.

00:16:37.280 --> 00:16:41.060
And so what that mathematical
expression might look like is this.

00:16:41.060 --> 00:16:45.773
We would take each of my variables, x1
and x2, multiply them by some weight.

00:16:45.773 --> 00:16:47.690
I don't yet know what
that weight is, but it's

00:16:47.690 --> 00:16:50.720
going to be some number,
weight 1 and weight 2.

00:16:50.720 --> 00:16:53.570
And maybe we just want to
add some other weight 0 to it

00:16:53.570 --> 00:16:56.870
because the function might require
us to shift the entire value up

00:16:56.870 --> 00:16:58.610
or down by a certain amount.

00:16:58.610 --> 00:16:59.840
And then we just compare.

00:16:59.840 --> 00:17:02.870
If we do all this math, is a
greater than or equal to 0.

00:17:02.870 --> 00:17:05.810
If so, we might categorize
that data point as a rainy day.

00:17:05.810 --> 00:17:09.440
And otherwise, we might say no rain.

00:17:09.440 --> 00:17:12.260
So the key here then
is that this expression

00:17:12.260 --> 00:17:15.619
is how we are going to calculate
whether it's a rainy day or not.

00:17:15.619 --> 00:17:18.690
We're going to do a bunch of math
where we take each of the variables,

00:17:18.690 --> 00:17:21.650
multiply them by a weight,
maybe add an extra weight to it,

00:17:21.650 --> 00:17:24.079
see if the result is
greater than or equal to 0.

00:17:24.079 --> 00:17:26.240
And using that result
of that expression,

00:17:26.240 --> 00:17:29.660
we're able to determine whether
it's raining or not raining.

00:17:29.660 --> 00:17:33.080
This expression here is in this case
going to refer to just some line.

00:17:33.080 --> 00:17:36.170
If you were to plot that graphically,
it would just be some line.

00:17:36.170 --> 00:17:40.320
And what the line actually looks
like depends upon these weights.

00:17:40.320 --> 00:17:42.710
x1 and x2 are the
inputs, but these weights

00:17:42.710 --> 00:17:46.220
are really what determine the shape
of that line, the slope of that line,

00:17:46.220 --> 00:17:49.190
and what that line actually looks like.

00:17:49.190 --> 00:17:52.330
So we then would like to figure
out what these weights should be.

00:17:52.330 --> 00:17:54.520
We can choose whatever
weights we want, but we

00:17:54.520 --> 00:17:58.330
want to choose weights in such a way
that if you pass in a rainy day's

00:17:58.330 --> 00:18:00.850
humidity and pressure, then
you end up with a result that

00:18:00.850 --> 00:18:02.440
is greater than or equal to 0.

00:18:02.440 --> 00:18:05.690
And we would like it such that if we
passed into our hypothesis function,

00:18:05.690 --> 00:18:09.040
a not rainy day's inputs,
then the output that we get

00:18:09.040 --> 00:18:11.043
should be not raining.

00:18:11.043 --> 00:18:13.960
So before we get there, let's try
and formalize this a little bit more

00:18:13.960 --> 00:18:17.350
mathematically just to get a sense for
how it is that you'll often see this

00:18:17.350 --> 00:18:21.560
if you ever go further into a supervised
machine learning and explore this idea.

00:18:21.560 --> 00:18:23.660
One thing is that generally
for these categories,

00:18:23.660 --> 00:18:27.420
we'll sometimes just use the names of
the categories like rain and not rain.

00:18:27.420 --> 00:18:30.670
Often, mathematically, if we're trying
to do comparisons between these things,

00:18:30.670 --> 00:18:33.110
it's easier just to deal
in the world of numbers.

00:18:33.110 --> 00:18:35.530
So we could just say 1 and 0--

00:18:35.530 --> 00:18:37.820
1 for raining, 0 for not raining.

00:18:37.820 --> 00:18:39.040
So we do all this math.

00:18:39.040 --> 00:18:41.500
And if the result is
greater than or equal to 0,

00:18:41.500 --> 00:18:45.160
we'll go ahead and say our hypothesis
function outputs 1, meaning raining.

00:18:45.160 --> 00:18:48.220
And otherwise, it outputs
0, meaning not raining.

00:18:48.220 --> 00:18:52.090
And oftentimes, this type
of expression will instead

00:18:52.090 --> 00:18:54.890
express using vector mathematics.

00:18:54.890 --> 00:18:57.390
And all the vector is, if you're
not familiar with the term,

00:18:57.390 --> 00:19:00.400
is it refers to a sequence
of numerical values.

00:19:00.400 --> 00:19:03.580
You could represent that in Python
using, like, a list of numerical values

00:19:03.580 --> 00:19:06.370
or a couple with numerical values.

00:19:06.370 --> 00:19:10.160
And here, we have a couple of
sequences of numerical values.

00:19:10.160 --> 00:19:13.300
One of our vectors, one of our
sequences of numerical values,

00:19:13.300 --> 00:19:15.130
are all of these individual weights--

00:19:15.130 --> 00:19:18.340
w0, w1 and w2.

00:19:18.340 --> 00:19:21.400
So we could construct what
we'll call a weight vector

00:19:21.400 --> 00:19:23.860
and we'll see why this is
useful in a moment called w,

00:19:23.860 --> 00:19:26.838
generally represented
using a boldface w, that

00:19:26.838 --> 00:19:28.630
is just a sequence of
these three weights--

00:19:28.630 --> 00:19:31.690
weight 0, weight 1, and weight 2.

00:19:31.690 --> 00:19:33.820
And to be able to calculate
based on those weights

00:19:33.820 --> 00:19:37.570
whether we think a day is
raining or not raining,

00:19:37.570 --> 00:19:42.490
we're going to multiply each of those
weights by one of our input variables.

00:19:42.490 --> 00:19:46.630
That w2, this weight, is going to
be multiplied by input variable x2.

00:19:46.630 --> 00:19:49.780
w1 is going to be multiplied
by input variable x1.

00:19:49.780 --> 00:19:53.097
And w0-- well, it's not
being multiplied by anything,

00:19:53.097 --> 00:19:55.180
but to make sure the vectors
are the same length--

00:19:55.180 --> 00:19:57.263
and we'll see why that's
useful in just a second--

00:19:57.263 --> 00:20:01.077
we'll just go ahead and say
w0 is being multiplied by 1.

00:20:01.077 --> 00:20:03.160
Because you can multiply
by something by 1 and you

00:20:03.160 --> 00:20:05.090
end up getting the exact same number.

00:20:05.090 --> 00:20:07.570
So in addition to the
weight vector, w, we'll

00:20:07.570 --> 00:20:11.890
also have an input vector that
we'll call x that has three values--

00:20:11.890 --> 00:20:18.160
1, again, because we're just multiplying
w0 by 1 eventually, and then x1 and x2.

00:20:18.160 --> 00:20:21.850
So here then, we've represented two
distinct vectors-- a vector of weights

00:20:21.850 --> 00:20:23.590
that we need to somehow learn.

00:20:23.590 --> 00:20:25.690
The goal of our machine
learning algorithm

00:20:25.690 --> 00:20:28.310
is to learn what this weight
vector is supposed to be.

00:20:28.310 --> 00:20:30.502
We could choose any
arbitrary set of numbers

00:20:30.502 --> 00:20:33.460
and it would produce a function that
tries to predict rain or not rain,

00:20:33.460 --> 00:20:35.140
but it probably wouldn't be very good.

00:20:35.140 --> 00:20:39.190
What we want to do is come up with
a good choice of these weights

00:20:39.190 --> 00:20:41.980
so that we're able to do
the accurate predictions.

00:20:41.980 --> 00:20:45.790
And then this input vector
represents a particular input

00:20:45.790 --> 00:20:48.970
to the function, a data point for
which we would like to estimate,

00:20:48.970 --> 00:20:52.442
is that day a rainy day or
is that day not rainy day.

00:20:52.442 --> 00:20:54.400
And that's going to vary
just depending on what

00:20:54.400 --> 00:20:58.300
input is provided to our function, what
it is that we are trying to estimate.

00:20:58.300 --> 00:21:02.390
And then to do the calculation, we
want to calculate this expression here.

00:21:02.390 --> 00:21:04.900
And it turns out that
expression is what we would call

00:21:04.900 --> 00:21:07.390
the dot product of these two vectors.

00:21:07.390 --> 00:21:10.480
The dot product of
two vectors just means

00:21:10.480 --> 00:21:13.630
taking each of the terms and the
vectors and multiplying them together,

00:21:13.630 --> 00:21:18.860
w0 multiplied by 1, w1 multiplied
it by x1, w2 multiply it by x2.

00:21:18.860 --> 00:21:21.400
And that's why these vectors
need to be the same length.

00:21:21.400 --> 00:21:24.400
And then we just add all
of the results together.

00:21:24.400 --> 00:21:30.040
So the dot product of w and x, our
weight vector and our input vector,

00:21:30.040 --> 00:21:35.770
that's just going to be w0 times
1, or just w0 plus w1 times x1,

00:21:35.770 --> 00:21:40.570
multiplying these two terms
together, plus w2 times x2,

00:21:40.570 --> 00:21:42.773
multiplying those statements together.

00:21:42.773 --> 00:21:45.190
So we have our weight vector,
which we need to figure out.

00:21:45.190 --> 00:21:47.020
We need our machine learning
algorithm to figure out

00:21:47.020 --> 00:21:48.280
what the weights should be.

00:21:48.280 --> 00:21:51.370
We have the input vector
representing the data point

00:21:51.370 --> 00:21:54.610
that we're trying to predict a
category for, predict a label for.

00:21:54.610 --> 00:21:58.270
And we're able to do that calculation
by taking this dot product, which you'll

00:21:58.270 --> 00:22:00.260
often see represented in vector form--

00:22:00.260 --> 00:22:02.052
but if you haven't seen
vectors before, you

00:22:02.052 --> 00:22:04.810
can think of it as identical to
just this mathematical expression.

00:22:04.810 --> 00:22:08.170
Just doing the multiplication,
adding the results together.

00:22:08.170 --> 00:22:11.440
And then seeing whether the result
is greater than or equal to 0 or not.

00:22:11.440 --> 00:22:14.530
This expression here is
identical to the expression

00:22:14.530 --> 00:22:16.810
that we're calculating
to see whether or not

00:22:16.810 --> 00:22:21.340
that answer is greater than
or equal to 0 in this case.

00:22:21.340 --> 00:22:24.370
And so for that reason, you'll
often see the hypothesis function

00:22:24.370 --> 00:22:25.980
written as something like this--

00:22:25.980 --> 00:22:29.800
a simpler representation where
the hypothesis takes as input

00:22:29.800 --> 00:22:34.060
some input vector x, some humidity
and pressure for some day.

00:22:34.060 --> 00:22:37.810
And we want to predict an output
like rain or no rain or 1 or 0

00:22:37.810 --> 00:22:40.440
if we choose to represent
things numerically.

00:22:40.440 --> 00:22:42.970
And the way we do that
is by taking the dot

00:22:42.970 --> 00:22:45.612
product of the weights and our input.

00:22:45.612 --> 00:22:47.320
If it's greater than
or equal to 0, we'll

00:22:47.320 --> 00:22:49.150
go ahead and save the output is 1.

00:22:49.150 --> 00:22:52.030
Otherwise, the output is going to be 0.

00:22:52.030 --> 00:22:56.170
And this hypothesis we say is
parameterized by the weights.

00:22:56.170 --> 00:22:58.330
Depending on what
weights we choose, we'll

00:22:58.330 --> 00:23:00.372
end up getting a different hypothesis.

00:23:00.372 --> 00:23:02.830
If we choose the weights
randomly, we're probably not going

00:23:02.830 --> 00:23:04.455
to get a very good hypothesis function.

00:23:04.455 --> 00:23:06.730
We'll get a 1 or a 0,
but it's probably not

00:23:06.730 --> 00:23:09.160
accurately going to reflect
whether we think a day is

00:23:09.160 --> 00:23:11.350
going to be rainy or not rainy.

00:23:11.350 --> 00:23:13.930
But if we choose the
weights right, we can often

00:23:13.930 --> 00:23:16.390
do a pretty good job
of trying to estimate

00:23:16.390 --> 00:23:20.880
whether we think the output of
the function should be a 1 or a 0.

00:23:20.880 --> 00:23:23.170
And so the question then,
is how to figure out

00:23:23.170 --> 00:23:27.042
what these weights should be-- how
to be able to tune those parameters.

00:23:27.042 --> 00:23:29.000
And there are a number
of ways you can do that.

00:23:29.000 --> 00:23:32.800
One of the most common is known
as the perceptron learning rule.

00:23:32.800 --> 00:23:34.303
And we'll see more of this later.

00:23:34.303 --> 00:23:36.220
But the idea of the
perceptron learning rule--

00:23:36.220 --> 00:23:37.960
and we're not going to get
too deep into the mathematics,

00:23:37.960 --> 00:23:40.570
we'll mostly just introduce
it more conceptually-- is

00:23:40.570 --> 00:23:44.770
to say that given some data point
that we would like to learn from,

00:23:44.770 --> 00:23:50.140
some data point that has an input x and
an output y, where y is like 1 for rain

00:23:50.140 --> 00:23:53.052
or 0 for not rain, then we're
going to update the weights.

00:23:53.052 --> 00:23:55.010
And we'll look at the
formula in just a moment.

00:23:55.010 --> 00:23:59.020
But the big picture idea is that
we can start with random weights

00:23:59.020 --> 00:24:00.610
but then learn from the data.

00:24:00.610 --> 00:24:02.900
Like, take the data
points one at a time.

00:24:02.900 --> 00:24:05.370
And for each one of the
data points figure out,

00:24:05.370 --> 00:24:09.370
all right, what parameters do we
need to change inside of the weights

00:24:09.370 --> 00:24:12.050
in order to better
match that input point.

00:24:12.050 --> 00:24:13.930
And so that is the
value of having access

00:24:13.930 --> 00:24:15.940
to a lot of data in
the supervised machine

00:24:15.940 --> 00:24:18.520
learning algorithm-- is that
you take each of the data points

00:24:18.520 --> 00:24:22.310
and maybe look at the multiple times and
constantly try and figure out what you

00:24:22.310 --> 00:24:24.490
whether you need to shift
your weight in order

00:24:24.490 --> 00:24:29.260
to better create some weight vector that
is able to correctly or more accurately

00:24:29.260 --> 00:24:31.300
try to estimate what
the output should be.

00:24:31.300 --> 00:24:33.970
Whether we think it's going to
be raining or whether we think

00:24:33.970 --> 00:24:35.760
it's not going to be raining.

00:24:35.760 --> 00:24:37.510
So what does that
weight update look like?

00:24:37.510 --> 00:24:39.468
Without going into too
much of the mathematics,

00:24:39.468 --> 00:24:41.410
we're going to update
each of the weights

00:24:41.410 --> 00:24:46.480
to be the result of the original
weight plus some additional expression.

00:24:46.480 --> 00:24:48.820
And to understand this expression y--

00:24:48.820 --> 00:24:51.820
well, y is what the actual output is.

00:24:51.820 --> 00:24:57.280
And hypothesis of x, the input, that's
going to be what we thought the input

00:24:57.280 --> 00:24:58.100
was.

00:24:58.100 --> 00:25:01.510
And so I can replace this by
saying what the actual value was

00:25:01.510 --> 00:25:03.820
minus what our estimate was.

00:25:03.820 --> 00:25:08.440
And based on the difference between the
actual value and what our estimate was,

00:25:08.440 --> 00:25:11.170
we might want to change our
hypothesis, change the way

00:25:11.170 --> 00:25:13.300
that we do that estimation.

00:25:13.300 --> 00:25:15.850
If the actual value and the
estimate were the same thing,

00:25:15.850 --> 00:25:18.910
meaning we were correctly able to
predict what category this data

00:25:18.910 --> 00:25:21.280
point belonged to,
well, then actual value

00:25:21.280 --> 00:25:23.560
minus estimate, that's
just going to be 0,

00:25:23.560 --> 00:25:26.440
which means this whole term on
the right hand side goes to be 0.

00:25:26.440 --> 00:25:27.820
And the weight doesn't change.

00:25:27.820 --> 00:25:31.180
Weight i, where i is weight
1 or weight 2 or weight 0,

00:25:31.180 --> 00:25:33.520
weight i just stays at weight i.

00:25:33.520 --> 00:25:36.880
And none of the weights change if
we were able to correctly predict

00:25:36.880 --> 00:25:39.250
what category the input belonged to.

00:25:39.250 --> 00:25:42.010
But if our hypothesis
didn't correctly predict

00:25:42.010 --> 00:25:44.980
what category the input
belonged to, then maybe

00:25:44.980 --> 00:25:46.900
then we need to make some changes--

00:25:46.900 --> 00:25:50.950
adjust the weights so that we're better
able to predict this kind of data point

00:25:50.950 --> 00:25:51.850
in the future.

00:25:51.850 --> 00:25:54.140
And what is the way we might do that?

00:25:54.140 --> 00:25:58.870
Well, if the actual value was bigger
than the estimate, then-- and for now,

00:25:58.870 --> 00:26:02.170
we'll go ahead and assume that
these is are positive values--

00:26:02.170 --> 00:26:04.720
if the actual value is bigger
than the estimate, that

00:26:04.720 --> 00:26:08.050
means we need to increase the
weight in order to make it such

00:26:08.050 --> 00:26:10.090
that the output is bigger
and therefore, we're

00:26:10.090 --> 00:26:13.168
more likely to get to
the right actual value.

00:26:13.168 --> 00:26:15.460
And so if the actual value
is bigger than the estimate,

00:26:15.460 --> 00:26:18.292
then actual value minus estimate,
that'll be a positive number.

00:26:18.292 --> 00:26:21.250
And so you imagine we're just adding
some positive number to the weight

00:26:21.250 --> 00:26:23.740
just to increase it ever so slightly.

00:26:23.740 --> 00:26:25.600
And likewise, the inverse case is true--

00:26:25.600 --> 00:26:30.460
that if the actual value was less than
the estimate, the actual value was 0,

00:26:30.460 --> 00:26:33.460
but we estimated 1, meaning
it actually was not raining,

00:26:33.460 --> 00:26:35.740
but we predicted it was
going to be raining,

00:26:35.740 --> 00:26:39.250
then we want to decrease the value of
the weight, because then in that case,

00:26:39.250 --> 00:26:43.600
we want to try and lower the total value
of computing that dot product in order

00:26:43.600 --> 00:26:46.690
to make it less likely that we
would predict that it would actually

00:26:46.690 --> 00:26:48.010
be raining.

00:26:48.010 --> 00:26:50.840
So no need to get too deep
into the mathematics of that.

00:26:50.840 --> 00:26:53.920
But the general idea is that every
time we encounter some data point,

00:26:53.920 --> 00:26:57.340
we can adjust these weights accordingly
to try and make the weights better

00:26:57.340 --> 00:27:00.850
line up with the actual
data that we have access to.

00:27:00.850 --> 00:27:03.610
And you can repeat this process
with data point after data point

00:27:03.610 --> 00:27:05.680
until eventually,
hopefully, your algorithm

00:27:05.680 --> 00:27:09.430
converges to some set of weights
that do a pretty good job of trying

00:27:09.430 --> 00:27:13.040
to figure out whether a day is
going to be rainy or not rainy.

00:27:13.040 --> 00:27:15.700
And just as a final point
about this particular equation,

00:27:15.700 --> 00:27:19.470
this value alpha here is generally
what we'll call the learning rate.

00:27:19.470 --> 00:27:22.120
It's just some parameter,
some number we choose it

00:27:22.120 --> 00:27:25.690
for how quickly we're actually going
to be updating these weight values.

00:27:25.690 --> 00:27:27.460
That if alpha is
bigger, than we're going

00:27:27.460 --> 00:27:29.380
to update these weight values by a lot.

00:27:29.380 --> 00:27:32.480
And if alpha is smaller, then we'll
update the weight values by less.

00:27:32.480 --> 00:27:35.200
And you can choose the value of
alpha depending on the problem,

00:27:35.200 --> 00:27:39.960
different values might suit the
situation better or worse than others.

00:27:39.960 --> 00:27:43.150
So after all of that, after
we've done this training process,

00:27:43.150 --> 00:27:45.880
take all this data, and
using this learning rule,

00:27:45.880 --> 00:27:50.240
look at all the pieces of data, and
use each piece of data as an indication

00:27:50.240 --> 00:27:53.030
to us of do the weights stay the
same, do we increase the weights,

00:27:53.030 --> 00:27:55.940
do we decrease the weights,
and if so, by how much,

00:27:55.940 --> 00:28:00.110
what you end up with is
effectively a threshold function.

00:28:00.110 --> 00:28:03.230
And we can look at what the threshold
function looks like like this.

00:28:03.230 --> 00:28:05.870
On the x-axis here, we have
the output of that function.

00:28:05.870 --> 00:28:10.150
Taking the weights taking, the
dot product of it with the input.

00:28:10.150 --> 00:28:12.980
And on the y-axis, we have
what the output is going to be.

00:28:12.980 --> 00:28:17.420
0, which in this case represented like
not raining, and 1, which in this case,

00:28:17.420 --> 00:28:18.950
represented raining.

00:28:18.950 --> 00:28:23.570
And the way that our hypothesis function
works is it calculates this value.

00:28:23.570 --> 00:28:27.380
And if it's greater than 0 or
greater than some threshold value,

00:28:27.380 --> 00:28:29.480
then we declare that it's a rainy day.

00:28:29.480 --> 00:28:32.330
And otherwise, we declare
that it's not rainy day.

00:28:32.330 --> 00:28:35.660
And this then graphically is
what that function looks like.

00:28:35.660 --> 00:28:38.600
That Initially, when the value
of this dot product is small--

00:28:38.600 --> 00:28:40.850
it's not raining, it's not
raining, it's not raining--

00:28:40.850 --> 00:28:43.910
but as soon as it crosses that
threshold, we suddenly say,

00:28:43.910 --> 00:28:46.700
OK, now it's raining, now it's
raining, now it's raining.

00:28:46.700 --> 00:28:49.250
And the way to interpret
this kind of representation

00:28:49.250 --> 00:28:51.680
is that anything on this
side of the line, that

00:28:51.680 --> 00:28:55.010
would be the category of data points
where we say yes, it's raining.

00:28:55.010 --> 00:28:56.960
Anything that falls on
this side of the line

00:28:56.960 --> 00:28:59.510
are the data points where we
would say it's not raining.

00:28:59.510 --> 00:29:02.120
And again, we want to choose
some value for the weights that

00:29:02.120 --> 00:29:04.910
results in a function that does
a pretty good job of trying

00:29:04.910 --> 00:29:07.280
to do this estimation.

00:29:07.280 --> 00:29:11.090
But one tricky thing with
this type of hard threshold

00:29:11.090 --> 00:29:14.300
is that it only leaves
two possible outcomes.

00:29:14.300 --> 00:29:16.850
We plug-in some data as input.

00:29:16.850 --> 00:29:20.100
And the output we get is
raining or not raining.

00:29:20.100 --> 00:29:22.978
And there is no room for
it anywhere in between.

00:29:22.978 --> 00:29:24.270
And maybe that's what you want.

00:29:24.270 --> 00:29:26.510
Maybe all you want is
given some data point,

00:29:26.510 --> 00:29:29.600
you would like to be able to
classify it into one or two or more

00:29:29.600 --> 00:29:31.990
of these various different categories.

00:29:31.990 --> 00:29:34.270
But it might also be
the case that you care

00:29:34.270 --> 00:29:37.855
about knowing how strong that
prediction is, for example.

00:29:37.855 --> 00:29:39.730
So if we go back to this
instance here, where

00:29:39.730 --> 00:29:42.550
we have rainy days on
this side of the line, not

00:29:42.550 --> 00:29:45.370
rainy days on that side
of the line, you might

00:29:45.370 --> 00:29:48.970
imagine that let's look now at
these two white data points.

00:29:48.970 --> 00:29:53.110
This data point here that we would like
to predict a label or a category for.

00:29:53.110 --> 00:29:55.660
And this data point over
here that we would also like

00:29:55.660 --> 00:29:58.510
to predict a label or a category for.

00:29:58.510 --> 00:30:01.885
It seems likely that you could pretty
confidently say that this data point,

00:30:01.885 --> 00:30:03.010
that should be a rainy day.

00:30:03.010 --> 00:30:05.500
It seems close to the
other rainy days if we're

00:30:05.500 --> 00:30:07.330
going by the nearest neighbor strategy.

00:30:07.330 --> 00:30:11.830
It's on this side of the line if we're
going by the strategy of just saying

00:30:11.830 --> 00:30:14.140
which side of the line does
it fall on by figuring out

00:30:14.140 --> 00:30:15.670
what those weights should be.

00:30:15.670 --> 00:30:18.610
And if we're using the line strategy
of just which side of the line

00:30:18.610 --> 00:30:21.640
does it fall on, which side
of this decision boundary,

00:30:21.640 --> 00:30:24.700
we'd also say that
this point here is also

00:30:24.700 --> 00:30:27.580
a rainy day, because it falls
on the side of the line that

00:30:27.580 --> 00:30:30.620
corresponds to rainy days.

00:30:30.620 --> 00:30:33.020
But it's likely that
even in this case, we

00:30:33.020 --> 00:30:37.040
would know that we don't feel nearly
as confident about this data point

00:30:37.040 --> 00:30:40.360
on the left as compared to
this data point on the right.

00:30:40.360 --> 00:30:42.620
For this one on the right,
we can feel very confident

00:30:42.620 --> 00:30:44.060
that, yes, it's a rainy day.

00:30:44.060 --> 00:30:48.460
This one, it's pretty close to the
line if we're judging just by distance.

00:30:48.460 --> 00:30:51.260
And so you might be less sure.

00:30:51.260 --> 00:30:56.150
But our threshold function doesn't allow
for a notion of less sure or more sure

00:30:56.150 --> 00:30:57.080
about something.

00:30:57.080 --> 00:30:59.000
It's what we would
call a hard threshold.

00:30:59.000 --> 00:31:02.810
It's once you've crossed this line,
then immediately, we say, yes,

00:31:02.810 --> 00:31:04.520
this is going to be a rainy day.

00:31:04.520 --> 00:31:07.580
Anywhere before it, we're going
to say it's not a rainy day.

00:31:07.580 --> 00:31:10.240
And that may not be helpful
in a number of cases.

00:31:10.240 --> 00:31:13.070
One, this is not a particularly
easy function to deal with.

00:31:13.070 --> 00:31:15.710
If you get you get deeper into
the world of machine learning

00:31:15.710 --> 00:31:18.830
and are trying to do things like
taking derivatives of these curves,

00:31:18.830 --> 00:31:21.222
this type of function
makes things challenging.

00:31:21.222 --> 00:31:23.180
But the other challenge
is that we don't really

00:31:23.180 --> 00:31:25.013
have any notion of
gradation between things.

00:31:25.013 --> 00:31:28.430
We don't have a notion of, yes,
this is a very strong belief

00:31:28.430 --> 00:31:32.600
that it's going to be raining as opposed
to it's probably more likely than not

00:31:32.600 --> 00:31:37.100
that it's going to be raining, but maybe
not totally sure about that, either.

00:31:37.100 --> 00:31:39.590
So what we can do by taking
advantage of a technique known

00:31:39.590 --> 00:31:43.190
as logistic regression is instead
of using this hard threshold

00:31:43.190 --> 00:31:46.970
type of function, we can use instead
a logistic function, something

00:31:46.970 --> 00:31:48.890
we might call a soft threshold.

00:31:48.890 --> 00:31:52.070
And that's going to transform
this into looking something

00:31:52.070 --> 00:31:55.290
a little more like this--
something that more nicely curves.

00:31:55.290 --> 00:31:57.800
And as a result, the
possible output values

00:31:57.800 --> 00:32:02.030
are no longer just 0 and 1, 0
for not raining, 1 for raining.

00:32:02.030 --> 00:32:06.360
But you can actually get any real
number of value between 0 and 1.

00:32:06.360 --> 00:32:10.292
That if you're way over on this
side, then you get a value of 0--

00:32:10.292 --> 00:32:12.750
it's not going to be raining,
we're pretty sure about that.

00:32:12.750 --> 00:32:15.042
And if you're over on this
side, you get a value of 1--

00:32:15.042 --> 00:32:17.330
yes, we're very sure that
it's going to be raining.

00:32:17.330 --> 00:32:22.460
But in between, you could get some real
numbered value where a value like 0.7

00:32:22.460 --> 00:32:24.260
might mean we think it's going to rain.

00:32:24.260 --> 00:32:27.810
It's more probable that it's going
to rain than not based on the data,

00:32:27.810 --> 00:32:32.220
but we're not as confident as some
of the other data points might be.

00:32:32.220 --> 00:32:34.580
So one of the advantages
of the soft threshold

00:32:34.580 --> 00:32:37.940
is that it allows us to have an output
that could be some real number that

00:32:37.940 --> 00:32:41.450
potentially reflects some sort of
probability, the likelihood that we

00:32:41.450 --> 00:32:46.550
think that this particular data point
belongs to that particular category.

00:32:46.550 --> 00:32:49.250
And there are some other nice
mathematical properties of that

00:32:49.250 --> 00:32:50.950
as well.

00:32:50.950 --> 00:32:53.880
So that then is two different
approaches to trying to solve

00:32:53.880 --> 00:32:55.710
this type of classification problem.

00:32:55.710 --> 00:32:58.860
One is this nearest neighbor
type of approach, where you just

00:32:58.860 --> 00:33:01.230
take a data point and look
at the data points that

00:33:01.230 --> 00:33:05.490
are nearby to try and estimate what
category we think it belongs to.

00:33:05.490 --> 00:33:08.310
And the other approach is the
approach of saying, all right,

00:33:08.310 --> 00:33:10.607
let's just try and
use linear regression,

00:33:10.607 --> 00:33:13.440
figure out what these weights should
be, adjust the weights in order

00:33:13.440 --> 00:33:16.950
to figure out what line or
what decision boundary is going

00:33:16.950 --> 00:33:19.708
to best separate these two categories.

00:33:19.708 --> 00:33:22.500
It turns out that another popular
approach, a very popular approach

00:33:22.500 --> 00:33:24.480
if you just have a data
set and you want to start

00:33:24.480 --> 00:33:26.190
trying to do some
learning on it, is what

00:33:26.190 --> 00:33:27.648
we call the support vector machine.

00:33:27.648 --> 00:33:30.690
We're not going to go too much into
the mathematics of the support vector

00:33:30.690 --> 00:33:32.850
machine, but we'll at least
explore it graphically

00:33:32.850 --> 00:33:34.680
to see what it is that it looks like.

00:33:34.680 --> 00:33:38.220
And the idea or the motivation
behind the support vector machine

00:33:38.220 --> 00:33:41.400
is the idea that there are
actually a lot of different lines

00:33:41.400 --> 00:33:44.070
that we could draw, a lot of
different decision boundaries

00:33:44.070 --> 00:33:46.430
that we could draw to
separate two groups.

00:33:46.430 --> 00:33:49.050
So for example, I had the
red data points over here

00:33:49.050 --> 00:33:50.700
and the blue data points over here.

00:33:50.700 --> 00:33:54.620
One possible line I could
draw is a line like this,

00:33:54.620 --> 00:33:57.600
that this line here would separate
the red points from the blue points.

00:33:57.600 --> 00:33:58.680
And it does so perfectly.

00:33:58.680 --> 00:34:01.080
All the red points are
on one side of the line.

00:34:01.080 --> 00:34:03.840
All the blue points around
the other side of the line.

00:34:03.840 --> 00:34:07.000
But this should probably make
you a little bit nervous.

00:34:07.000 --> 00:34:08.909
If you come up with
a model and the model

00:34:08.909 --> 00:34:10.960
comes up with a line
that looks like this.

00:34:10.960 --> 00:34:13.650
And the reason why is that
you worry about how well it's

00:34:13.650 --> 00:34:18.190
going to generalize to other data points
that are not necessarily in the data

00:34:18.190 --> 00:34:19.900
set that we have access to.

00:34:19.900 --> 00:34:23.650
For example, if there was a point
that fell right here for example,

00:34:23.650 --> 00:34:26.370
on the right side of the
line, then based on that,

00:34:26.370 --> 00:34:30.210
we might want to guess that
it is in fact, a red point,

00:34:30.210 --> 00:34:33.780
but it falls on the side of the line
where instead, we would estimate

00:34:33.780 --> 00:34:36.380
that it's a blue point instead.

00:34:36.380 --> 00:34:39.750
And so based on that, this line
is probably not a great choice

00:34:39.750 --> 00:34:43.679
just because it is so close
to these various data points.

00:34:43.679 --> 00:34:45.810
We might instead prefer
a diagonal line that

00:34:45.810 --> 00:34:48.719
just goes diagonally through the
data set like we've seen before.

00:34:48.719 --> 00:34:52.000
But there too, there's a lot of diagonal
lines that we could draw as well.

00:34:52.000 --> 00:34:54.760
For example, I could draw
this diagonal line here,

00:34:54.760 --> 00:34:57.663
which also successfully
separates all the red points

00:34:57.663 --> 00:34:58.830
from all of the blue points.

00:34:58.830 --> 00:35:01.400
From the perspective of
something like a just trying

00:35:01.400 --> 00:35:03.150
to figure out some
setting of weights that

00:35:03.150 --> 00:35:05.760
allows us to predict the
correct output, this line

00:35:05.760 --> 00:35:09.155
will predict the correct output
for this particular set of data

00:35:09.155 --> 00:35:11.530
every single time, because
the red points are on one side

00:35:11.530 --> 00:35:13.518
and the blue points are on the other.

00:35:13.518 --> 00:35:15.810
But yet again, you should
probably be a little nervous.

00:35:15.810 --> 00:35:18.690
Because this line is so
close to these red points,

00:35:18.690 --> 00:35:22.470
even though we're able to
correctly predict on the input data

00:35:22.470 --> 00:35:26.070
if there was a point that fell
somewhere in this general area,

00:35:26.070 --> 00:35:28.530
our algorithm, this model,
would say that yeah,

00:35:28.530 --> 00:35:30.930
we think it's a blue point,
when in actuality, it

00:35:30.930 --> 00:35:34.000
might belong to the
red category instead,

00:35:34.000 --> 00:35:36.870
just because it looks like it's
close to the other red points.

00:35:36.870 --> 00:35:39.700
What we really want to be
able to say, given this data,

00:35:39.700 --> 00:35:42.330
how can you generalize this
out as best as possible, is

00:35:42.330 --> 00:35:46.480
to come up with a line like this that
seems like the intuitive line to draw.

00:35:46.480 --> 00:35:48.810
And the reason why it's
intuitive is because it

00:35:48.810 --> 00:35:54.267
seems to be as far apart as possible
from the red data and the blue data

00:35:54.267 --> 00:35:56.850
so that if we generalize a little
bit and assume that maybe we

00:35:56.850 --> 00:35:58.933
have some points that are
different from the input

00:35:58.933 --> 00:36:02.610
but still slightly further away, we can
still say that something on this side,

00:36:02.610 --> 00:36:06.300
probably red, something on
that side, probably blue.

00:36:06.300 --> 00:36:08.607
And we can make those
judgments that way.

00:36:08.607 --> 00:36:10.440
And that is what support
vector machines are

00:36:10.440 --> 00:36:12.660
designed to do-- they're
designed to try and find

00:36:12.660 --> 00:36:15.720
what we call the maximum
margin separator,

00:36:15.720 --> 00:36:17.850
where the maximum
margin separator is just

00:36:17.850 --> 00:36:21.780
some boundary that maximizes the
distance between the groups of points.

00:36:21.780 --> 00:36:24.990
Rather than come up with some
boundary that's very close to one side

00:36:24.990 --> 00:36:27.840
or the other, where in the case
before, we wouldn't have cared--

00:36:27.840 --> 00:36:31.680
as long as we're categorizing the input
well, that seems all we need to do--

00:36:31.680 --> 00:36:35.790
the support vector machine will try
and find this maximum margin separator,

00:36:35.790 --> 00:36:39.090
some way of trying to maximize
that particular distance.

00:36:39.090 --> 00:36:42.600
And it does so by finding what we
call the support vectors, which

00:36:42.600 --> 00:36:44.790
are the vectors that
are closest to the line

00:36:44.790 --> 00:36:48.060
and trying to maximize the
distance between the line

00:36:48.060 --> 00:36:49.950
and those particular points.

00:36:49.950 --> 00:36:51.790
And it works that way in two dimensions.

00:36:51.790 --> 00:36:53.498
It also works in higher
dimensions, where

00:36:53.498 --> 00:36:56.670
we're not looking for some line
that separates the two data points,

00:36:56.670 --> 00:36:58.440
but instead, looking
for what we generally

00:36:58.440 --> 00:37:02.610
call a hyperplanel, some decision
boundary, effectively, that

00:37:02.610 --> 00:37:06.150
separates one set of data
from the other set of data.

00:37:06.150 --> 00:37:09.270
And this ability of support vector
machines to work in higher dimensions

00:37:09.270 --> 00:37:11.670
actually has a number of
other applications as well.

00:37:11.670 --> 00:37:14.790
But one is that it helpfully
deals with cases where

00:37:14.790 --> 00:37:17.740
data may not be linearly separable.

00:37:17.740 --> 00:37:19.800
So we talked about linear
separability before,

00:37:19.800 --> 00:37:22.890
this idea that you can
take data and just draw

00:37:22.890 --> 00:37:25.180
a line or some linear
combination of the inputs

00:37:25.180 --> 00:37:28.750
that allows us to perfectly separate
the two sets from each other.

00:37:28.750 --> 00:37:32.050
There are some data sets that
are not linearly separable.

00:37:32.050 --> 00:37:35.280
And some were even too, you
would not be able to find

00:37:35.280 --> 00:37:39.270
a good line at all that would try
to do that kind of separation.

00:37:39.270 --> 00:37:42.930
Something like this, for example,
where if you imagine here

00:37:42.930 --> 00:37:45.780
are the red points and the
blue points surround it.

00:37:45.780 --> 00:37:50.550
If you try to find a line that divides
the red points from the blue points,

00:37:50.550 --> 00:37:53.310
it's actually going to be difficult,
if not impossible, to do--

00:37:53.310 --> 00:37:55.080
that any line you choose--

00:37:55.080 --> 00:37:57.510
if you draw a line
here, then you ignored

00:37:57.510 --> 00:38:00.450
all of these blue points that should
actually be blue and not red--

00:38:00.450 --> 00:38:03.240
anywhere else you draw a line,
there's going to be a lot of error,

00:38:03.240 --> 00:38:07.560
a lot of mistakes, a lot of what
will soon call loss to that line

00:38:07.560 --> 00:38:08.280
that you draw--

00:38:08.280 --> 00:38:12.050
a lot of points that you're
going to categorize incorrectly.

00:38:12.050 --> 00:38:14.610
What we really want is to be
able to find a better decision

00:38:14.610 --> 00:38:18.570
boundary that may not be just a
straight line through this two

00:38:18.570 --> 00:38:19.770
dimensional space.

00:38:19.770 --> 00:38:21.810
And what support vector
machines can do is

00:38:21.810 --> 00:38:23.940
they can begin to operate
in higher dimensions

00:38:23.940 --> 00:38:26.880
and be able to find some
other decision boundary,

00:38:26.880 --> 00:38:28.860
like the circle in this
case, that actually

00:38:28.860 --> 00:38:33.130
is able to separate one of these sets
of data from the other set of data,

00:38:33.130 --> 00:38:33.750
a lot better.

00:38:33.750 --> 00:38:37.470
So oftentimes, in data sets where
the data is not linearly separable,

00:38:37.470 --> 00:38:40.140
support vector machines, by
working in higher dimensions,

00:38:40.140 --> 00:38:44.310
can actually figure out a way to solve
that kind of problem effectively.

00:38:44.310 --> 00:38:46.290
So that then-- three
different approaches

00:38:46.290 --> 00:38:48.330
to trying to solve
these sorts of problems.

00:38:48.330 --> 00:38:50.040
We're seeing support vector machines.

00:38:50.040 --> 00:38:53.400
We've seen trying to use linear
regression and the perceptron

00:38:53.400 --> 00:38:56.912
learning rule to be able to figure out
how to categorize inputs and outputs.

00:38:56.912 --> 00:38:58.620
We've seen the nearest
neighbor approach.

00:38:58.620 --> 00:39:00.630
No one necessarily
better than any other.

00:39:00.630 --> 00:39:03.990
Again, it's going to depend on the data
set, the information you have access

00:39:03.990 --> 00:39:04.490
to.

00:39:04.490 --> 00:39:07.157
It's going to depend on what the
function looks like that you're

00:39:07.157 --> 00:39:08.370
ultimately trying to predict.

00:39:08.370 --> 00:39:11.160
And this is where a lot of
research and experimentation

00:39:11.160 --> 00:39:14.880
can be involved in trying to figure
out how it is to best perform

00:39:14.880 --> 00:39:16.710
that kind of estimation.

00:39:16.710 --> 00:39:20.100
But classification is only one of
the tasks that you might encounter

00:39:20.100 --> 00:39:23.340
and supervised machine learning,
because in classification,

00:39:23.340 --> 00:39:26.580
what we're trying to predict
is some discrete category.

00:39:26.580 --> 00:39:29.040
We're trying to predict
red or blue, rain

00:39:29.040 --> 00:39:31.950
or not rain, authentic or counterfeit.

00:39:31.950 --> 00:39:35.500
But sometimes, what we want to
predict is a real number value.

00:39:35.500 --> 00:39:38.880
And for that, we have a related
problem, not classification, but instead

00:39:38.880 --> 00:39:40.500
known as regression.

00:39:40.500 --> 00:39:42.930
And regression is the
supervised learning problem

00:39:42.930 --> 00:39:45.570
where we try and learn a function
mapping inputs to outputs,

00:39:45.570 --> 00:39:46.810
same as before.

00:39:46.810 --> 00:39:49.800
but instead of the outputs
being discrete categories--

00:39:49.800 --> 00:39:51.960
things like rain or not rain--

00:39:51.960 --> 00:39:54.260
in a regression problem,
the output values

00:39:54.260 --> 00:39:57.400
are generally continuous
value-- some real number

00:39:57.400 --> 00:39:58.650
that we would like to predict.

00:39:58.650 --> 00:40:00.385
This happens all the time, as well.

00:40:00.385 --> 00:40:02.760
You might imagine that a
company might take this approach

00:40:02.760 --> 00:40:04.950
if it's trying to figure
out, for instance,

00:40:04.950 --> 00:40:06.720
what the effect of its advertising is.

00:40:06.720 --> 00:40:10.830
Like how do advertising dollars
spent translate into sales

00:40:10.830 --> 00:40:13.000
for the company's product, for example.

00:40:13.000 --> 00:40:17.070
And so they might like to try to predict
some function that takes as input,

00:40:17.070 --> 00:40:18.780
the amount of money
spent on advertising.

00:40:18.780 --> 00:40:20.220
And here, we're just
going to use one input,

00:40:20.220 --> 00:40:22.650
but again, you could scale
this up to many more inputs

00:40:22.650 --> 00:40:25.800
as well if you have a lot of different
kinds of data you have access to.

00:40:25.800 --> 00:40:27.540
And the goal is to learn
to function-- that given

00:40:27.540 --> 00:40:29.423
this amount of spending
on advertising, we're

00:40:29.423 --> 00:40:30.840
going to get this amount in sales.

00:40:30.840 --> 00:40:34.110
And you might judge it based on having
access to a whole bunch of data--

00:40:34.110 --> 00:40:37.830
like for every past month, here's
how much we spent on advertising

00:40:37.830 --> 00:40:39.390
and here is what sales were.

00:40:39.390 --> 00:40:43.300
And we would like to predict
some sort of hypothesis function

00:40:43.300 --> 00:40:46.260
that, again, given the
amount spent on advertising,

00:40:46.260 --> 00:40:48.750
can predict in this
case, some real number,

00:40:48.750 --> 00:40:54.840
some no estimate of how much sales we
expect that company to do in this month

00:40:54.840 --> 00:40:56.940
or in this quarter or
whatever unit of time

00:40:56.940 --> 00:40:58.980
we're choosing to measure things in.

00:40:58.980 --> 00:41:01.830
And so again, the approach to
solving this type of problem,

00:41:01.830 --> 00:41:05.920
we could try using a linear regression
type approach, where we take this data,

00:41:05.920 --> 00:41:07.050
and we just plot it.

00:41:07.050 --> 00:41:09.750
On the x-axis, we have
advertising dollars spent.

00:41:09.750 --> 00:41:11.490
On the y-axis, we have sales.

00:41:11.490 --> 00:41:14.160
And we might just want to
try and draw a line that

00:41:14.160 --> 00:41:16.680
does a pretty good job
of trying to estimate

00:41:16.680 --> 00:41:19.980
this relationship between
advertising and sales.

00:41:19.980 --> 00:41:21.840
And in this case,
unlike before, we're not

00:41:21.840 --> 00:41:24.848
trying to separate the data
points into discrete categories.

00:41:24.848 --> 00:41:26.640
But instead in this
case, we're just trying

00:41:26.640 --> 00:41:31.410
to find a line that approximates
this relationship between advertising

00:41:31.410 --> 00:41:34.860
and sales so that if we want to figure
out what the estimated sales are

00:41:34.860 --> 00:41:38.520
for a particular advertising budget,
you just look it up in this line,

00:41:38.520 --> 00:41:40.560
figure out for this
amount of advertising, we

00:41:40.560 --> 00:41:42.540
would have this amount
of sales, and just

00:41:42.540 --> 00:41:44.367
try and make the estimate that way.

00:41:44.367 --> 00:41:46.200
And so you can try and
come up with a line--

00:41:46.200 --> 00:41:48.720
again, figuring out how to
modify the weights using

00:41:48.720 --> 00:41:51.990
various different techniques to try
and make it so that this line fits

00:41:51.990 --> 00:41:54.860
as well as possible.

00:41:54.860 --> 00:41:57.880
So with all of these approaches
to trying to solve machine

00:41:57.880 --> 00:41:59.270
learning style problems.

00:41:59.270 --> 00:42:02.000
The question becomes, how do
we evaluate these approaches?

00:42:02.000 --> 00:42:06.340
How do we evaluate the various different
hypotheses that we could come up with?

00:42:06.340 --> 00:42:09.880
Because each of these algorithms will
give us some sort of hypothesis--

00:42:09.880 --> 00:42:12.710
some function that
maps inputs to outputs.

00:42:12.710 --> 00:42:16.720
And we want to know how well
does that function work.

00:42:16.720 --> 00:42:19.000
And you can think of
evaluating these hypotheses

00:42:19.000 --> 00:42:23.500
and trying to get a better hypothesis
as kind of like an optimization problem.

00:42:23.500 --> 00:42:26.580
In an optimization problem,
as you recall from before,

00:42:26.580 --> 00:42:30.100
you are either trying to
maximize some objective function

00:42:30.100 --> 00:42:33.160
by trying to find a global maximum.

00:42:33.160 --> 00:42:36.160
Or we were trying to
minimize some cost function

00:42:36.160 --> 00:42:38.210
by trying to find some global minimum.

00:42:38.210 --> 00:42:41.890
And in the case of evaluating these
hypotheses, one thing we might say

00:42:41.890 --> 00:42:45.280
is that this cost function, the
thing we're trying to minimize,

00:42:45.280 --> 00:42:49.180
we might be trying to minimize
what we would call a loss function.

00:42:49.180 --> 00:42:50.510
And what a loss function is--

00:42:50.510 --> 00:42:53.980
it is a function that is
going to estimate for us how

00:42:53.980 --> 00:42:56.200
poorly our function performs.

00:42:56.200 --> 00:42:58.240
More formally, it's
like a loss of utility,

00:42:58.240 --> 00:43:02.740
by whenever we predict something that
is wrong, that is a loss of utility.

00:43:02.740 --> 00:43:06.450
That's going to add to the
output of our loss function.

00:43:06.450 --> 00:43:08.200
And you can come up
with any loss function

00:43:08.200 --> 00:43:11.320
that you want-- just some
mathematical way of estimating given

00:43:11.320 --> 00:43:14.020
each of these data points,
given what the actual output is,

00:43:14.020 --> 00:43:17.110
and given what our projected
output is, our estimate,

00:43:17.110 --> 00:43:19.970
you could calculate some sort
of numerical loss for it.

00:43:19.970 --> 00:43:23.140
But there are a couple of popular loss
functions that are worth discussing--

00:43:23.140 --> 00:43:25.210
just that you've seen them before--

00:43:25.210 --> 00:43:27.040
when it comes to discrete categories.

00:43:27.040 --> 00:43:30.570
Things like rain or not rain,
counterfeit or not counterfeit.

00:43:30.570 --> 00:43:33.463
One approach is the 0-1 loss function.

00:43:33.463 --> 00:43:35.380
And the way that works
is for each of the data

00:43:35.380 --> 00:43:40.180
points our loss function takes as input,
what the actual output is, whether it

00:43:40.180 --> 00:43:44.650
was actually raining we're not rainy,
and takes our prediction into account--

00:43:44.650 --> 00:43:49.030
did we predict given this data point
that it was raining or not raining.

00:43:49.030 --> 00:43:51.730
And if the actual value
equals the prediction,

00:43:51.730 --> 00:43:54.580
well, then the 0-1 loss function
will just say the loss of 0.

00:43:54.580 --> 00:43:58.900
There was no loss of utility because
we were able to predict correctly.

00:43:58.900 --> 00:44:01.300
And otherwise, if the
actual value was not

00:44:01.300 --> 00:44:05.260
the same thing as what we predicted,
well, then in that case, our loss is 1.

00:44:05.260 --> 00:44:08.260
We lost something, lost
some utility, because what

00:44:08.260 --> 00:44:12.750
we predicted was the output of the
function was not what it actually was.

00:44:12.750 --> 00:44:14.500
And the goal then in
a situation like this

00:44:14.500 --> 00:44:17.080
would be to come up with
some hypothesis that

00:44:17.080 --> 00:44:21.040
minimizes the total empirical
loss, the total amount that we've

00:44:21.040 --> 00:44:25.090
lost if you add up for all these
data points what the actual output is

00:44:25.090 --> 00:44:28.250
and what your hypothesis
would have predicted.

00:44:28.250 --> 00:44:30.100
So in this case, for
example, if we go back

00:44:30.100 --> 00:44:32.530
to classifying days as
raining or not raining,

00:44:32.530 --> 00:44:34.720
and we came up with
this decision boundary,

00:44:34.720 --> 00:44:36.680
how would we evaluate
this decision boundary--

00:44:36.680 --> 00:44:40.510
how much better is it than drawing the
line here or drawing the line there.

00:44:40.510 --> 00:44:42.790
Well, we could take each
of the input data points

00:44:42.790 --> 00:44:45.790
and each input data point has a
label-- whether it was raining

00:44:45.790 --> 00:44:47.230
or whether it was not raining.

00:44:47.230 --> 00:44:49.090
And we could compare
it to the prediction--

00:44:49.090 --> 00:44:51.550
whether we predicted it would
be raining or not raining--

00:44:51.550 --> 00:44:55.040
and assign it a numerical
value as a result.

00:44:55.040 --> 00:44:58.323
So for example, these points over
here they were all rainy days.

00:44:58.323 --> 00:45:00.490
And we predicted they would
be raining, because they

00:45:00.490 --> 00:45:02.300
fall in the bottom side of the line.

00:45:02.300 --> 00:45:03.520
So they had a loss of 0--

00:45:03.520 --> 00:45:05.500
nothing lost from those situations.

00:45:05.500 --> 00:45:08.140
And likewise, same is true for
some of these points over here,

00:45:08.140 --> 00:45:10.420
where it was not
raining and we predicted

00:45:10.420 --> 00:45:12.220
it would not be raining, either.

00:45:12.220 --> 00:45:16.810
Where we do have loss are points like
this point here and that point there,

00:45:16.810 --> 00:45:20.980
where we predicted that it would
not be raining, but in actuality,

00:45:20.980 --> 00:45:21.730
it's a blue point.

00:45:21.730 --> 00:45:22.810
It was raining.

00:45:22.810 --> 00:45:25.870
Or likewise here, we predicted
that it would be raining,

00:45:25.870 --> 00:45:27.790
but in actuality, it's a red point--

00:45:27.790 --> 00:45:29.020
it was not raining.

00:45:29.020 --> 00:45:31.900
And so as a result, we
miscategorized these data

00:45:31.900 --> 00:45:34.150
points that we were trying to train on.

00:45:34.150 --> 00:45:36.310
And as a result, there
is some loss here.

00:45:36.310 --> 00:45:40.930
One loss here, there, here and there,
for a total loss of four, for example,

00:45:40.930 --> 00:45:41.893
in this case.

00:45:41.893 --> 00:45:43.810
And that might be how
we would estimate or how

00:45:43.810 --> 00:45:47.860
we would say that this line
is better than a line that

00:45:47.860 --> 00:45:51.370
goes somewhere else or a line that's
further down, because this line might

00:45:51.370 --> 00:45:52.670
minimize the loss.

00:45:52.670 --> 00:45:57.100
So there is no way to do better
than just these four points of loss

00:45:57.100 --> 00:46:01.080
if you're just drawing a
straight line through our space.

00:46:01.080 --> 00:46:05.020
So the 0-1 loss function checks did
we get it right, did we get it wrong.

00:46:05.020 --> 00:46:07.630
If we got it right, the
loss is 0-- nothing lost.

00:46:07.630 --> 00:46:11.470
If we got it wrong, then our loss
function for that data point says 1,

00:46:11.470 --> 00:46:14.740
and we add up all of those losses
across all of our data points

00:46:14.740 --> 00:46:17.290
to get some sort of
empirical loss-- how much we

00:46:17.290 --> 00:46:20.420
have lost across all of
these original data points

00:46:20.420 --> 00:46:23.500
that our algorithm had access to.

00:46:23.500 --> 00:46:26.550
There are other forms of loss as
well that work especially well when

00:46:26.550 --> 00:46:28.980
we deal with more real
value cases-- cases

00:46:28.980 --> 00:46:32.760
like the mapping between advertising
budget and amount that we do in sales,

00:46:32.760 --> 00:46:33.720
for example.

00:46:33.720 --> 00:46:37.770
Because in that case, you care not just
that you get the number exactly right,

00:46:37.770 --> 00:46:40.650
but you care how close you
were to the actual value.

00:46:40.650 --> 00:46:44.700
If the actual value is
you did $2,800 in sales

00:46:44.700 --> 00:46:47.910
and you predicted that you
would do $2,900 in sales,

00:46:47.910 --> 00:46:49.330
maybe that's pretty good.

00:46:49.330 --> 00:46:52.380
That's much better than if you had
predicted you do $1,000 in sales,

00:46:52.380 --> 00:46:53.620
for example.

00:46:53.620 --> 00:46:55.830
And so we would like
our loss function to be

00:46:55.830 --> 00:46:58.530
able to take that into account as well.

00:46:58.530 --> 00:47:02.700
Take into account not just whether
the actual value in the expected value

00:47:02.700 --> 00:47:08.490
are exactly the same, but also, take
into account how far apart they were.

00:47:08.490 --> 00:47:12.450
And so for that one approach
is what we call L1 loss.

00:47:12.450 --> 00:47:15.120
L1 Loss doesn't just look at
whether actual and predicted

00:47:15.120 --> 00:47:20.400
are equal to each other, but we take
the absolute value of the actual value

00:47:20.400 --> 00:47:22.180
minus the predicted value.

00:47:22.180 --> 00:47:25.860
In other words, we just ask,
how far apart were the actual

00:47:25.860 --> 00:47:27.120
and predicted values?

00:47:27.120 --> 00:47:29.280
And we sum that up
across all of the data

00:47:29.280 --> 00:47:33.870
points to be able to get what
our answer ultimately is.

00:47:33.870 --> 00:47:36.690
So what might this actually
look like for our data set?

00:47:36.690 --> 00:47:38.610
Well, if we go back
to this representation

00:47:38.610 --> 00:47:42.720
where we had advertising along the
x-axis, sales along the y-axis,

00:47:42.720 --> 00:47:45.210
our line was our
prediction, our estimate

00:47:45.210 --> 00:47:47.520
for any given amount
of advertising-- what

00:47:47.520 --> 00:47:50.080
we predicted sales was going to be.

00:47:50.080 --> 00:47:54.420
And our L1 loss is just how far
apart vertically along the sales

00:47:54.420 --> 00:47:58.050
axis our prediction was from
each of the data points.

00:47:58.050 --> 00:48:00.900
So we could figure out exactly
how far apart our prediction

00:48:00.900 --> 00:48:02.910
was from each of the data
points and figure out

00:48:02.910 --> 00:48:07.920
as a result of that what our loss is
overall for this particular hypothesis

00:48:07.920 --> 00:48:12.090
just by adding up all of these various
different individual losses for each

00:48:12.090 --> 00:48:13.110
of these data points.

00:48:13.110 --> 00:48:15.790
And our goal then is to try
and minimize that loss--

00:48:15.790 --> 00:48:20.550
to try and come up with some line that
minimizes what the utility loss is

00:48:20.550 --> 00:48:23.280
by judging how far away our
estimate amount of sales

00:48:23.280 --> 00:48:25.800
is from the actual amount of sales.

00:48:25.800 --> 00:48:28.050
And turns out there are other
loss functions, as well.

00:48:28.050 --> 00:48:30.780
One that's quite popular is the L2 loss.

00:48:30.780 --> 00:48:33.840
The L2 loss, instead of just
using the absolute value,

00:48:33.840 --> 00:48:37.350
like how far away the actual
value is from the predicted value,

00:48:37.350 --> 00:48:40.350
it uses the square of
actual minus predicted.

00:48:40.350 --> 00:48:43.470
So how far apart are the actual
and predicted value, and it

00:48:43.470 --> 00:48:48.150
squares that value, effectively
penalizing much more harshly

00:48:48.150 --> 00:48:50.290
anything that is a worse prediction.

00:48:50.290 --> 00:48:52.620
So you imagine if you
have two data points

00:48:52.620 --> 00:48:57.210
that you predict as being one
value away from their actual value

00:48:57.210 --> 00:48:59.490
as opposed to one data
point that you predict

00:48:59.490 --> 00:49:04.290
as being two away from its actual
value, the L2 loss function will more

00:49:04.290 --> 00:49:07.560
harshly penalize that one
that is two away because it's

00:49:07.560 --> 00:49:11.430
going to square however much the
differences between the actual value

00:49:11.430 --> 00:49:12.520
and the predicted value.

00:49:12.520 --> 00:49:14.310
And depending on the
situation, you might

00:49:14.310 --> 00:49:17.520
want to choose a loss function depending
on what you care about minimizing.

00:49:17.520 --> 00:49:20.883
If you really care about minimizing
the error on more outlier cases,

00:49:20.883 --> 00:49:23.050
then you might want to
consider something like this.

00:49:23.050 --> 00:49:25.592
But if you've got a lot of
outliers and you don't necessarily

00:49:25.592 --> 00:49:28.632
care about modeling them, then maybe
an L1 loss function is preferable,

00:49:28.632 --> 00:49:30.840
but there are trade-offs
here that you need to decide

00:49:30.840 --> 00:49:33.630
based on a particular set of data.

00:49:33.630 --> 00:49:37.020
But what you do run the risk of with any
of these lost functions with anything

00:49:37.020 --> 00:49:40.232
that we're trying to do is a
problem known as overfitting.

00:49:40.232 --> 00:49:41.940
And overfitting is a
big problem that you

00:49:41.940 --> 00:49:44.910
can encounter in machine
learning, which happens anytime

00:49:44.910 --> 00:49:49.380
a model fits too closely with
a data set, and as a result,

00:49:49.380 --> 00:49:51.510
fails to generalize.

00:49:51.510 --> 00:49:55.200
We would like our model to
be able to accurately predict

00:49:55.200 --> 00:49:59.430
data and inputs and output pairs
for the data that we have access to.

00:49:59.430 --> 00:50:02.220
But the reason we wanted
to do so is because we

00:50:02.220 --> 00:50:06.690
want our model to generalize well
to data that we haven't seen before.

00:50:06.690 --> 00:50:10.350
I would like to take data from the past
year of whether it was raining and not

00:50:10.350 --> 00:50:12.707
raining and use that
data to generalize it

00:50:12.707 --> 00:50:15.540
towards the future-- to say in the
future, is it going to be raining

00:50:15.540 --> 00:50:16.170
or not raining.

00:50:16.170 --> 00:50:19.890
Or if I have a whole bunch of data on
what counterfeit and not counterfeit US

00:50:19.890 --> 00:50:23.640
dollar bills looked liked in the past
when people have encountered them,

00:50:23.640 --> 00:50:26.550
I'd like to train a computer
to be able to in the future,

00:50:26.550 --> 00:50:32.020
generalize to other dollar
bills that I might see as well.

00:50:32.020 --> 00:50:36.540
And the problem with overfitting is that
if you try and tie yourself too closely

00:50:36.540 --> 00:50:39.450
to the data set that you're
training your model on you

00:50:39.450 --> 00:50:42.043
can end up not generalizing very well.

00:50:42.043 --> 00:50:43.210
So what does this look like.

00:50:43.210 --> 00:50:46.750
Well, we might imagine the rainy day and
not rainy day example again from here,

00:50:46.750 --> 00:50:49.440
where the blue points indicate
rainy days and the red points

00:50:49.440 --> 00:50:50.970
indicate not rainy days.

00:50:50.970 --> 00:50:53.400
And we decided that we
felt pretty comfortable

00:50:53.400 --> 00:50:58.230
with drawing a line like this as the
decision boundary between rainy days

00:50:58.230 --> 00:50:59.090
and not rainy days.

00:50:59.090 --> 00:51:01.860
That we can pretty comfortably
say that points on this side,

00:51:01.860 --> 00:51:04.860
are more likely to be rainy
days, points on that side,

00:51:04.860 --> 00:51:06.870
more likely to be not rainy days.

00:51:06.870 --> 00:51:11.430
But the empirical loss isn't
0 in this particular case,

00:51:11.430 --> 00:51:14.200
because we didn't categorize
everything perfectly.

00:51:14.200 --> 00:51:17.640
There was this one outlier this
one day that it wasn't raining,

00:51:17.640 --> 00:51:20.580
but yet our model steel still
predicts that it is raining.

00:51:20.580 --> 00:51:22.710
But that doesn't necessarily
mean our model is bad.

00:51:22.710 --> 00:51:25.820
It just means the model
isn't 100% accurate.

00:51:25.820 --> 00:51:28.700
If you really wanted to try
and find a hypothesis that

00:51:28.700 --> 00:51:32.142
resulted in minimizing the
loss, you could come up

00:51:32.142 --> 00:51:33.600
with a different decision boundary.

00:51:33.600 --> 00:51:37.100
It wouldn't be a line, but it
would look something like this.

00:51:37.100 --> 00:51:41.090
This decision boundary does
separate all of the red points

00:51:41.090 --> 00:51:44.780
from all of the blue points
because the red points fall

00:51:44.780 --> 00:51:47.390
on this side of this decision
boundary, the blue points

00:51:47.390 --> 00:51:49.700
fall on the other side
of the decision boundary.

00:51:49.700 --> 00:51:54.530
But this, we would probably argue,
is not as good of a prediction.

00:51:54.530 --> 00:51:59.600
Even though it seems to be more accurate
based on all of the available training

00:51:59.600 --> 00:52:02.700
data that we have for training
this machine learning model,

00:52:02.700 --> 00:52:05.330
we might say that it's probably
not going to generalize well.

00:52:05.330 --> 00:52:07.730
That if there were other data
points like here and there,

00:52:07.730 --> 00:52:11.270
we might still want to consider those
to be rainy days, because we think

00:52:11.270 --> 00:52:13.770
this was probably just an outlier.

00:52:13.770 --> 00:52:17.480
So if the only thing you care about
is minimizing the loss on the data

00:52:17.480 --> 00:52:20.340
you have available to you, you
run the risk of overfitting.

00:52:20.340 --> 00:52:22.520
And this can happen in the
misclassification case.

00:52:22.520 --> 00:52:25.460
It can also happen in
the regression case,

00:52:25.460 --> 00:52:28.790
that here, we predicted what we
thought was a pretty good line relating

00:52:28.790 --> 00:52:32.720
advertising to sales, trying to predict
what sales were going to be for a given

00:52:32.720 --> 00:52:33.757
amount of advertising.

00:52:33.757 --> 00:52:36.590
But I could come up with a line
that does a better job of predicting

00:52:36.590 --> 00:52:39.680
the training data, and it would
be something that looks like this,

00:52:39.680 --> 00:52:42.620
just connecting all of the
various different data points.

00:52:42.620 --> 00:52:44.750
And now, there is no loss at all.

00:52:44.750 --> 00:52:48.620
Now I've perfectly predicted given
any advertising what sales are,

00:52:48.620 --> 00:52:52.370
and for all the data available
to me, it's going to be accurate.

00:52:52.370 --> 00:52:55.010
But it's probably not going
to generalize very well.

00:52:55.010 --> 00:53:00.060
I have overfit my model on the
training data that is available to me.

00:53:00.060 --> 00:53:02.060
And so in general, we
want to avoid overfitting.

00:53:02.060 --> 00:53:04.880
We'd like strategies to
make sure that we have over

00:53:04.880 --> 00:53:07.190
fit our model to a particular data set.

00:53:07.190 --> 00:53:09.810
And there are a number of ways
that you could try to do this.

00:53:09.810 --> 00:53:12.830
One way is by examining what it
is that we're optimizing for.

00:53:12.830 --> 00:53:17.060
In an optimization problem, all we
do is we say there is some cost,

00:53:17.060 --> 00:53:19.610
and I want to minimize that cost.

00:53:19.610 --> 00:53:24.410
And so far, we've defined that cost
function-- the cost of a hypothesis

00:53:24.410 --> 00:53:28.400
just as being equal to the
empirical loss of that hypothesis.

00:53:28.400 --> 00:53:32.750
How far away are the actual data
points the outputs away from

00:53:32.750 --> 00:53:36.400
what I predicted them to be based
on that particular hypothesis.

00:53:36.400 --> 00:53:38.150
And if all you're
trying to do is minimize

00:53:38.150 --> 00:53:40.910
cost, meaning minimizing
the loss in this case,

00:53:40.910 --> 00:53:44.030
then the result is going to
be that you might overfit.

00:53:44.030 --> 00:53:48.440
That to minimize cost, you're going to
try and find a way to perfectly match

00:53:48.440 --> 00:53:49.820
all of the input data.

00:53:49.820 --> 00:53:54.020
And that might happen as a result of
overfitting on that particular input

00:53:54.020 --> 00:53:55.650
data.

00:53:55.650 --> 00:53:59.660
So in order to address this, you could
add something to the cost function.

00:53:59.660 --> 00:54:00.880
What counts as cost?

00:54:00.880 --> 00:54:04.190
Well, not just loss,
but also, some measure

00:54:04.190 --> 00:54:08.680
of the complexity of the hypothesis,
where the complexity of the hypothesis

00:54:08.680 --> 00:54:12.440
is something that you would need
to define for how complicated

00:54:12.440 --> 00:54:13.250
does our line look.

00:54:13.250 --> 00:54:15.830
This is sort of an Occam's
razor style approach, where

00:54:15.830 --> 00:54:19.160
we want to give preference to
a simpler decision boundary--

00:54:19.160 --> 00:54:20.990
like a straight line for example.

00:54:20.990 --> 00:54:24.710
Some simpler curve as opposed
to something far more complex

00:54:24.710 --> 00:54:26.840
that might represent the
training data better,

00:54:26.840 --> 00:54:29.150
but might not generalize
as well-- will generally

00:54:29.150 --> 00:54:33.860
say that a simpler solution is probably
the better solution and probably

00:54:33.860 --> 00:54:38.460
the one that is more likely to
generalize well to other inputs.

00:54:38.460 --> 00:54:42.020
So we measure what the losses but
we also measure the complexity.

00:54:42.020 --> 00:54:45.770
And now that all gets taken into account
when we consider the overall cost.

00:54:45.770 --> 00:54:50.040
That yes, something might have less loss
if a better predicts the training data,

00:54:50.040 --> 00:54:52.940
but if it's much more
complex, it still might not

00:54:52.940 --> 00:54:55.230
be the best option that we have.

00:54:55.230 --> 00:54:58.930
And we need to come up with some
balance between loss and complexity.

00:54:58.930 --> 00:55:01.180
And for that reason, you'll
often see this represented

00:55:01.180 --> 00:55:04.730
as multiplying the
complexity by some parameter

00:55:04.730 --> 00:55:08.480
that we have to choose-- parameter
lambda in this case, where we're saying

00:55:08.480 --> 00:55:11.300
if lambda has a greater
value, then we really want

00:55:11.300 --> 00:55:13.852
to penalize more complex hypotheses.

00:55:13.852 --> 00:55:15.560
Whereas if lambda is
smaller, we're going

00:55:15.560 --> 00:55:18.470
to penalize more complex
hypotheses a little bit.

00:55:18.470 --> 00:55:21.620
And it's up to the machine
learning programmer

00:55:21.620 --> 00:55:24.470
to decide where they want
to set that value of lambda

00:55:24.470 --> 00:55:28.430
for how much do I want to penalize
a more complex hypothesis that

00:55:28.430 --> 00:55:30.293
might fit the data little better.

00:55:30.293 --> 00:55:32.960
And again, there is no one right
answer to a lot of these things

00:55:32.960 --> 00:55:36.362
depending on the data set, depending
on the data you have available to you,

00:55:36.362 --> 00:55:39.320
and the problem you're trying to
solve, your choice of these parameters

00:55:39.320 --> 00:55:39.868
may vary.

00:55:39.868 --> 00:55:41.660
And you may need to
experiment a little bit

00:55:41.660 --> 00:55:45.800
to figure out what the right choice
of that is ultimately going to be.

00:55:45.800 --> 00:55:49.100
This process then of considering
not only a loss, but also

00:55:49.100 --> 00:55:53.000
some measure of the complexity
is known as regularization.

00:55:53.000 --> 00:55:55.610
Regularization is the
process of penalizing

00:55:55.610 --> 00:55:58.260
a hypothesis that is more complex.

00:55:58.260 --> 00:56:00.950
In order to favor a
simple or hypothesis that

00:56:00.950 --> 00:56:03.680
is more likely to generalize
well-- more likely to be

00:56:03.680 --> 00:56:08.210
able to apply to other situations that
are dealing with other input points

00:56:08.210 --> 00:56:11.540
unlike the ones that we've
necessarily seen before.

00:56:11.540 --> 00:56:15.500
So oftentimes, you'll see us
add some regularizing term

00:56:15.500 --> 00:56:17.570
to what we're trying
to minimize it in order

00:56:17.570 --> 00:56:21.220
to avoid this problem of overfitting.

00:56:21.220 --> 00:56:25.780
Now another way of making sure we don't
overfit is to run some experiments

00:56:25.780 --> 00:56:29.950
and to see whether or not we are able
to generalize our model that we've

00:56:29.950 --> 00:56:33.250
created to other data sets as well.

00:56:33.250 --> 00:56:35.320
And it's for that
reason that oftentimes,

00:56:35.320 --> 00:56:37.800
when you're doing a machine learning
experiment, when you've got some data

00:56:37.800 --> 00:56:40.717
and you want to try and come up with
some function that predicts given

00:56:40.717 --> 00:56:44.230
some input, what the output is
going to be, you don't necessarily

00:56:44.230 --> 00:56:48.340
want to do your training on all of
the data you have available to you.

00:56:48.340 --> 00:56:52.410
That you could employ a method
known as holdout cross-validation.

00:56:52.410 --> 00:56:55.480
Where in holdout cross-validation,
we split up our data.

00:56:55.480 --> 00:57:00.373
We split up our data into a
training set and a testing set.

00:57:00.373 --> 00:57:02.290
The training set is the
set of data that we're

00:57:02.290 --> 00:57:04.870
going to use to train our
machine learning model.

00:57:04.870 --> 00:57:07.440
And the testing set is
the set of data that we

00:57:07.440 --> 00:57:11.260
are going to use in order to test
to see how well our machine learning

00:57:11.260 --> 00:57:13.660
model actually performed.

00:57:13.660 --> 00:57:15.757
So the learning happens
on the training set.

00:57:15.757 --> 00:57:17.590
We figure out what the
parameters should be,

00:57:17.590 --> 00:57:19.660
we figure out what the right model is.

00:57:19.660 --> 00:57:22.360
And that we see, all right, now
that we've trained the model,

00:57:22.360 --> 00:57:25.960
see how well it does at predicting
things and inside of the testing

00:57:25.960 --> 00:57:29.342
set, some set of data that
we haven't seen before.

00:57:29.342 --> 00:57:32.300
And the hope then is that we're going
to be able to predict the testing

00:57:32.300 --> 00:57:36.460
set pretty well if we're able to
generalize based on the training

00:57:36.460 --> 00:57:38.028
data that's available to us.

00:57:38.028 --> 00:57:39.820
If we've overfit the
training data, though,

00:57:39.820 --> 00:57:43.112
and we're not able to generalize,
then when we look at the testing set,

00:57:43.112 --> 00:57:45.070
it's likely going to be
the case that we're not

00:57:45.070 --> 00:57:49.100
going to predict things from the
testing set nearly as effectively.

00:57:49.100 --> 00:57:51.850
So this is one method of
cross-validation-- validating

00:57:51.850 --> 00:57:55.270
to make sure that the work we have
done is actually going to generalize

00:57:55.270 --> 00:57:56.770
to other data sets as well.

00:57:56.770 --> 00:57:59.620
And there are other statistical
techniques we can use, as well.

00:57:59.620 --> 00:58:03.520
One of the downsides of this just
holdout cross-validation is if you say,

00:58:03.520 --> 00:58:07.960
I just split it 50/50, I train
using 50% of the data and test using

00:58:07.960 --> 00:58:11.110
the other 50%, or you could
choose other percentages as well,

00:58:11.110 --> 00:58:15.190
is that there is a fair amount
of data that I am now not using

00:58:15.190 --> 00:58:19.670
to train that I might be able to get a
better model as a result, for example.

00:58:19.670 --> 00:58:23.500
So one approach is known
as k-fold cross-validation.

00:58:23.500 --> 00:58:26.830
In k-fold cross-validation,
rather than just divide things

00:58:26.830 --> 00:58:31.980
into two sets and run one experiment,
we divide things into k different sets

00:58:31.980 --> 00:58:34.810
and maybe I divide things
up into 10 different sets,

00:58:34.810 --> 00:58:37.270
and then run 10 different experiments.

00:58:37.270 --> 00:58:40.750
So if I split up my data into
10 different sets of data,

00:58:40.750 --> 00:58:44.410
then what I'll do is each time
for each of my 10 experiments,

00:58:44.410 --> 00:58:47.020
I will hold out one
of those sets of data,

00:58:47.020 --> 00:58:50.320
where I'll say, let me train
my model on these nine sets,

00:58:50.320 --> 00:58:54.070
and then test to see how well
it predicts on set number 10.

00:58:54.070 --> 00:58:57.490
And then pick another set of
nine sets to train on, and then

00:58:57.490 --> 00:58:59.320
test it on the other
one that I held out,

00:58:59.320 --> 00:59:03.550
where each time, I train the model
on everything minus the one set

00:59:03.550 --> 00:59:07.300
that I'm holding out, and then test
to see how well our model performs

00:59:07.300 --> 00:59:09.100
on the test that I did hold out.

00:59:09.100 --> 00:59:12.250
And what you end up getting is 10
different results, 10 different answers

00:59:12.250 --> 00:59:14.470
for how accurately our model worked.

00:59:14.470 --> 00:59:16.870
And oftentimes, you can just
take the average of those 10

00:59:16.870 --> 00:59:21.160
to get an approximation for how well
we think our model performs overall.

00:59:21.160 --> 00:59:25.270
But the key idea is separating the
training data from the testing data,

00:59:25.270 --> 00:59:29.170
because you want to test your model
on data that is different from what

00:59:29.170 --> 00:59:30.273
you trained the model on.

00:59:30.273 --> 00:59:32.440
Because the training, you
want to avoid overfitting,

00:59:32.440 --> 00:59:33.940
you want to be able to generalize.

00:59:33.940 --> 00:59:36.190
And the way you test whether
you're able to generalize

00:59:36.190 --> 00:59:39.580
and is by looking at some data
that you haven't seen before

00:59:39.580 --> 00:59:43.380
and seeing how well we're
actually able to perform.

00:59:43.380 --> 00:59:46.010
And so if we want to actually
implement any of these techniques

00:59:46.010 --> 00:59:48.980
inside of a programming language
like Python, a number of ways

00:59:48.980 --> 00:59:49.760
we could do that.

00:59:49.760 --> 00:59:52.007
We could write this
from scratch on our own,

00:59:52.007 --> 00:59:53.840
but there are libraries
out there that allow

00:59:53.840 --> 00:59:57.290
us to take advantage of existing
implementations of these algorithms--

00:59:57.290 --> 01:00:00.080
that we can use the
same types of algorithms

01:00:00.080 --> 01:00:01.950
in a lot of different situations.

01:00:01.950 --> 01:00:03.770
And so there is a
library, very popular one

01:00:03.770 --> 01:00:07.240
known as scikit-learn, which
allows us in Python to be

01:00:07.240 --> 01:00:10.490
able to very quickly get set up with a
lot of these different machine learning

01:00:10.490 --> 01:00:10.990
models.

01:00:10.990 --> 01:00:14.210
So this library has already written
an algorithm for nearest neighbor

01:00:14.210 --> 01:00:16.430
classification, for doing
perceptron learning,

01:00:16.430 --> 01:00:19.035
for doing a bunch of
other types of inference

01:00:19.035 --> 01:00:21.410
and supervised learning that
we haven't yet talked about.

01:00:21.410 --> 01:00:26.840
But using it, we can begin to try
actually testing how these methods work

01:00:26.840 --> 01:00:29.330
and how accurately they perform.

01:00:29.330 --> 01:00:31.900
So let's go ahead and take a
look at one approach to trying

01:00:31.900 --> 01:00:33.920
to solve this type of problem.

01:00:33.920 --> 01:00:37.280
All right, so I'm first going
to pull up banknotes.csv,

01:00:37.280 --> 01:00:40.970
which is a whole bunch of data provided
by UC Irvine, which has information

01:00:40.970 --> 01:00:43.160
about various different banknotes.

01:00:43.160 --> 01:00:45.440
So people took pictures of
various different banknotes

01:00:45.440 --> 01:00:48.530
and measured various different
properties of those banknotes.

01:00:48.530 --> 01:00:51.350
And in particular, some human
categorized each of those

01:00:51.350 --> 01:00:55.810
banknotes as either a counterfeit
bank note or as not counterfeit.

01:00:55.810 --> 01:00:59.540
And so what you're looking at here
is each row represents one banknote.

01:00:59.540 --> 01:01:01.910
This is formatted as a CSV
spreadsheet, where just

01:01:01.910 --> 01:01:05.780
comma-separated value separating each
of these various different fields.

01:01:05.780 --> 01:01:08.990
We have four different
input values for each

01:01:08.990 --> 01:01:12.410
of these data points, just
information, some measurement that

01:01:12.410 --> 01:01:13.460
was made on the banknote.

01:01:13.460 --> 01:01:16.340
And what those measurements exactly
aren't as important as the fact

01:01:16.340 --> 01:01:18.320
that we do have access to this data.

01:01:18.320 --> 01:01:21.950
But more importantly, we have
access for each of these data points

01:01:21.950 --> 01:01:26.210
to a label, where 0 indicates something
like this was not a counterfeit bill,

01:01:26.210 --> 01:01:27.920
meaning it was an authentic bill.

01:01:27.920 --> 01:01:32.840
And a data point labeled 1 means that
it is a counterfeit bill at least,

01:01:32.840 --> 01:01:36.240
according to the human researcher
who labeled this particular data.

01:01:36.240 --> 01:01:38.360
So we have a whole bunch
of data representing

01:01:38.360 --> 01:01:40.940
a whole bunch of different
data points, each of which

01:01:40.940 --> 01:01:44.360
has these various different measurements
that were made on that particular bill.

01:01:44.360 --> 01:01:47.300
And each of which has
an output value 0 or 1--

01:01:47.300 --> 01:01:53.033
0 meaning it was a genuine bill, 1
meaning it was a counterfeit bill.

01:01:53.033 --> 01:01:54.950
And what we would like
to do is use supervised

01:01:54.950 --> 01:01:58.550
learning to begin to predict
or model some sort of function

01:01:58.550 --> 01:02:02.570
that can take these four values as input
and predict what the output would be.

01:02:02.570 --> 01:02:05.777
We want our learning algorithm
to find some sort of pattern that

01:02:05.777 --> 01:02:08.110
is able to predict based on
these measurements something

01:02:08.110 --> 01:02:10.690
that you could measure just
by taking a photo of a bill--

01:02:10.690 --> 01:02:16.090
predict whether that bill is authentic
or whether that bill is counterfeit.

01:02:16.090 --> 01:02:17.730
And so how can we do that?

01:02:17.730 --> 01:02:21.020
Well, I'm first going to
open up banknotes0.py and see

01:02:21.020 --> 01:02:22.410
how it is that we do this.

01:02:22.410 --> 01:02:26.000
I'm first importing a lot
of things from scikit-learn,

01:02:26.000 --> 01:02:29.753
but importantly, I'm going to set
my model equal to the perceptron

01:02:29.753 --> 01:02:32.420
model, which is one of those
models that we talked about before.

01:02:32.420 --> 01:02:35.120
We're just going to try and
figure out some setting of weights

01:02:35.120 --> 01:02:38.950
that is able to divide our
data into two different groups.

01:02:38.950 --> 01:02:43.270
Then I'm going to go ahead and read
data in from my file from banknotes.csv.

01:02:43.270 --> 01:02:46.660
And basically, for every row,
I'm going to separate that row

01:02:46.660 --> 01:02:51.460
into the first four values of that row,
which is the evidence for that row.

01:02:51.460 --> 01:02:56.920
And then the label where if the
final column in that row is 0,

01:02:56.920 --> 01:03:00.770
the label is authentic, and otherwise,
it's going to be counterfeit.

01:03:00.770 --> 01:03:03.880
So I'm effectively reading
data in from the CSV file,

01:03:03.880 --> 01:03:07.330
dividing it into a whole bunch of rows,
where each row has some evidence--

01:03:07.330 --> 01:03:11.380
those four input values that are going
to be inputs to my hypothesis function.

01:03:11.380 --> 01:03:14.710
And then the label, the output,
whether it is authentic or counterfeit.

01:03:14.710 --> 01:03:17.270
That is the thing that I
am then trying to predict.

01:03:17.270 --> 01:03:20.650
So the next step is that I would like
to split up my data set into a training

01:03:20.650 --> 01:03:22.990
set and the testing
set-- some set of data

01:03:22.990 --> 01:03:26.260
that I would like to train my machine
learning model on and some set of data

01:03:26.260 --> 01:03:29.500
that I would like to use to test that
model, see how well it performed.

01:03:29.500 --> 01:03:32.480
So what I'll do is I'll go ahead
and figure out length of the data,

01:03:32.480 --> 01:03:34.150
how many data points do I have.

01:03:34.150 --> 01:03:35.925
I'll go ahead and take
half of them, save

01:03:35.925 --> 01:03:37.550
that number is a number called holdout.

01:03:37.550 --> 01:03:39.930
That is how many items I'm
going to hold out for my data

01:03:39.930 --> 01:03:42.400
set to save for the testing phase.

01:03:42.400 --> 01:03:45.250
I'll randomly shuffle the data
so it's in some random order.

01:03:45.250 --> 01:03:50.440
And then I'll say my testing set will
be all of the data up to the holdout.

01:03:50.440 --> 01:03:54.790
So I'll hold up many data items,
and that will be my testing that.

01:03:54.790 --> 01:03:58.060
My training data will be
everything else-- the information

01:03:58.060 --> 01:04:00.850
that I'm going to train my model on.

01:04:00.850 --> 01:04:04.870
And then I'll say, I need to
divide up my training data

01:04:04.870 --> 01:04:06.040
into two different sets.

01:04:06.040 --> 01:04:10.750
I need to divide it into my x values,
where x here represents the inputs.

01:04:10.750 --> 01:04:14.260
So the x values then I'm going
to train on our basically

01:04:14.260 --> 01:04:16.330
for every row in my
training set, I'm going

01:04:16.330 --> 01:04:19.150
to get the evidence for
that row, those four values,

01:04:19.150 --> 01:04:21.940
where it's basically a
vector of four numbers, where

01:04:21.940 --> 01:04:24.010
that is going to be all of the input.

01:04:24.010 --> 01:04:27.460
And then I need the y values-- what are
the outputs that I want to learn from,

01:04:27.460 --> 01:04:30.567
the labels that belong to each of
these various different input points.

01:04:30.567 --> 01:04:33.650
Well, that's going to be the same thing
for each row in the training data.

01:04:33.650 --> 01:04:36.430
But this time, I take that row
and get what it's labeled as,

01:04:36.430 --> 01:04:38.680
whether it is authentic or counterfeit.

01:04:38.680 --> 01:04:43.270
So I end up with one list of all
of these vectors of my input data

01:04:43.270 --> 01:04:45.790
and one list which
follows the same order,

01:04:45.790 --> 01:04:49.690
but has all of the labels that
correspond with each of those vectors.

01:04:49.690 --> 01:04:51.610
And then to train my
model, which in this case

01:04:51.610 --> 01:04:54.190
is just this perceptron
model, I just called

01:04:54.190 --> 01:04:57.370
model.fit, pass in the
training data, and what

01:04:57.370 --> 01:04:59.710
the labels for those training data are.

01:04:59.710 --> 01:05:02.120
And scikit-learn will take
care of fitting the model--

01:05:02.120 --> 01:05:04.250
will do the entire algorithm for me.

01:05:04.250 --> 01:05:08.320
And then when it's done, I can then test
to see how well that model performed.

01:05:08.320 --> 01:05:10.780
So I can say, let me
get all of these input

01:05:10.780 --> 01:05:13.040
vectors for what I want to test on.

01:05:13.040 --> 01:05:16.870
So for each row in my testing data
set, go ahead and get the evidence.

01:05:16.870 --> 01:05:20.470
And the y values, those are
what the actual values were--

01:05:20.470 --> 01:05:24.610
for each of the rows in the testing
data set, what the actual label is.

01:05:24.610 --> 01:05:26.860
But then I'm going to
generate some predictions.

01:05:26.860 --> 01:05:29.350
I'm going to use this
model and try and predict--

01:05:29.350 --> 01:05:31.480
based on the testing vectors--

01:05:31.480 --> 01:05:33.910
I want to predict what the output is.

01:05:33.910 --> 01:05:38.230
And my goal then is to now compare
y testing with predictions.

01:05:38.230 --> 01:05:41.440
I want to see how well my
predictions based on the model

01:05:41.440 --> 01:05:45.310
actually reflect what the y
values were, what the output is

01:05:45.310 --> 01:05:46.540
that were actually labeled.

01:05:46.540 --> 01:05:48.580
Because I now have
this label data, I can

01:05:48.580 --> 01:05:51.540
assess how well the algorithm worked.

01:05:51.540 --> 01:05:55.330
And so now I can just
compute how well we did.

01:05:55.330 --> 01:05:58.840
This zip function basically just lets
me look through two different lists, one

01:05:58.840 --> 01:06:00.500
by one at the same time.

01:06:00.500 --> 01:06:04.113
So for each actual value and
for each predicted value,

01:06:04.113 --> 01:06:06.280
if the actual is the same
thing as what I predicted,

01:06:06.280 --> 01:06:08.570
I'll go ahead and
increment the counter by 1.

01:06:08.570 --> 01:06:11.727
Otherwise, I'll increment
my incorrect counter by 1.

01:06:11.727 --> 01:06:14.060
And so at the end, I can print
out here are the results,

01:06:14.060 --> 01:06:16.435
here's how many I got right,
here's how many I got wrong.

01:06:16.435 --> 01:06:19.660
And here was my overall
accuracy, for example.

01:06:19.660 --> 01:06:21.100
So I can go ahead and run this.

01:06:21.100 --> 01:06:24.790
I can run Python banknotes0.py.

01:06:24.790 --> 01:06:28.810
And it's going to train on half the data
set and then test on half the data set.

01:06:28.810 --> 01:06:31.090
And here the results
from my perceptron model.

01:06:31.090 --> 01:06:35.540
In this case, it correctly
was able to classify 679 bills

01:06:35.540 --> 01:06:38.170
as correctly either
authentic or counterfeit,

01:06:38.170 --> 01:06:43.580
and incorrectly classified seven of them
for an overall accuracy of close to 99%

01:06:43.580 --> 01:06:44.080
accurate.

01:06:44.080 --> 01:06:47.230
So on this particular data set,
using this perceptron model,

01:06:47.230 --> 01:06:51.310
we were able to predict very well
what the output was going to be.

01:06:51.310 --> 01:06:53.010
And we can try different models, too.

01:06:53.010 --> 01:06:55.390
That scikit-learn makes it
very easy just to swap out

01:06:55.390 --> 01:06:57.970
one model for another model.

01:06:57.970 --> 01:07:02.710
So instead of the perceptron model,
I can use the support vector machine

01:07:02.710 --> 01:07:06.490
using the SVC, otherwise known
as a support vector classifier,

01:07:06.490 --> 01:07:08.950
using a support vector
machine to classify things

01:07:08.950 --> 01:07:10.820
into two different groups.

01:07:10.820 --> 01:07:14.270
And now see, all right,
how well does this perform.

01:07:14.270 --> 01:07:17.630
And this time, we were able
to correctly predict 682

01:07:17.630 --> 01:07:22.280
and incorrectly predicted
four for accuracy of 99.4%.

01:07:22.280 --> 01:07:26.990
And we could even try the
kNeighborsClassifier as the model

01:07:26.990 --> 01:07:27.740
instead.

01:07:27.740 --> 01:07:31.270
And this takes a parameter
n_neighbors for how many neighbors

01:07:31.270 --> 01:07:32.133
you want to look at.

01:07:32.133 --> 01:07:34.550
Let's just look at one neighbor,
the one nearest neighbor,

01:07:34.550 --> 01:07:36.080
and use that to predict.

01:07:36.080 --> 01:07:38.250
Go ahead and run this as well.

01:07:38.250 --> 01:07:40.880
And it looks like, based on the
kNeighborsClassifier looking

01:07:40.880 --> 01:07:45.350
at just one neighbor, we were able
to correctly classify 685 data point,

01:07:45.350 --> 01:07:47.420
incorrectly classified one.

01:07:47.420 --> 01:07:50.560
Maybe let's try three neighbors
instead of just using one neighbor,

01:07:50.560 --> 01:07:52.560
do more of a k-nearest-neighbors
approach, where

01:07:52.560 --> 01:07:55.693
I look at the three near the
neighbors and see how that performs.

01:07:55.693 --> 01:07:57.610
And that one in this
case seems to have gotten

01:07:57.610 --> 01:08:05.330
100% of all of the predictions correctly
described as either authentic banknotes

01:08:05.330 --> 01:08:07.370
or as counterfeit banknotes.

01:08:07.370 --> 01:08:09.470
And we could run these
experiments multiple times.

01:08:09.470 --> 01:08:12.082
Because I'm randomly
reorganizing the data every time,

01:08:12.082 --> 01:08:14.790
we're technically training these
on slightly different data sets.

01:08:14.790 --> 01:08:16.832
And so you might want to
run multiple experiments

01:08:16.832 --> 01:08:19.250
to really see how well they're
actually going to perform.

01:08:19.250 --> 01:08:20.963
But in short, they
all perform very well.

01:08:20.963 --> 01:08:23.630
And while some of them perform
slightly better than others here,

01:08:23.630 --> 01:08:26.240
that might not always be
the case for every data set,

01:08:26.240 --> 01:08:29.240
but you can begin to test now
by very quickly putting together

01:08:29.240 --> 01:08:31.790
these machine learning
models using Scikit learn

01:08:31.790 --> 01:08:33.920
to be able to train
on some training set,

01:08:33.920 --> 01:08:36.814
and then test on some
testing set as well.

01:08:36.814 --> 01:08:39.439
And this splitting up into training
groups, and testing groups,

01:08:39.439 --> 01:08:43.279
and testing happens so often that
scikit-learn has functions built in

01:08:43.279 --> 01:08:44.160
for trying to do it.

01:08:44.160 --> 01:08:46.100
I did it all by hand just now.

01:08:46.100 --> 01:08:48.770
But if we take a look
at banknotes1, we take

01:08:48.770 --> 01:08:52.100
advantage of some other features
that exist in scikit-learn,

01:08:52.100 --> 01:08:55.399
learn where we can really
simplify a lot of our logic.

01:08:55.399 --> 01:08:59.609
That there is a function built into
scikit-learn called train_test_split,

01:08:59.609 --> 01:09:03.140
which will automatically split data into
a training group and a testing group.

01:09:03.140 --> 01:09:05.536
I just have to say
what proportion should

01:09:05.536 --> 01:09:07.369
be in the testing group,
something like 0.5,

01:09:07.369 --> 01:09:10.080
half the data, inside the testing group.

01:09:10.080 --> 01:09:12.380
Then I can fit the model
and the training data,

01:09:12.380 --> 01:09:15.859
make the predictions on the testing
data, and then just count up.

01:09:15.859 --> 01:09:18.830
And scikit-learn has some nice
methods for just counting up

01:09:18.830 --> 01:09:22.250
how many times our testing data
matched the predictions, how

01:09:22.250 --> 01:09:25.439
many times our testing data
didn't match the predictions.

01:09:25.439 --> 01:09:27.680
So very quickly, you can
write programs with not all

01:09:27.680 --> 01:09:30.500
that many lines of code-- it's
maybe, like, 40 lines of code

01:09:30.500 --> 01:09:32.540
to get through all of these predictions.

01:09:32.540 --> 01:09:35.420
And then as a result, see
how well we're able to do.

01:09:35.420 --> 01:09:38.330
So these types of libraries
can allow us without really

01:09:38.330 --> 01:09:40.970
knowing the implementation
details of these algorithms

01:09:40.970 --> 01:09:43.939
to be able to use the algorithms
in a very practical way

01:09:43.939 --> 01:09:47.279
to be able to solve
these types of problems.

01:09:47.279 --> 01:09:49.939
So that then with supervised
learning-- this task

01:09:49.939 --> 01:09:52.972
of given the whole set of
data some, input-output pairs,

01:09:52.972 --> 01:09:54.680
we would like to learn
some function that

01:09:54.680 --> 01:09:57.110
maps those inputs to those outputs.

01:09:57.110 --> 01:09:59.630
But turns out there are other
forms of learning, as well.

01:09:59.630 --> 01:10:02.870
And another popular type of machine
learning, especially nowadays,

01:10:02.870 --> 01:10:05.090
is known as reinforcement learning.

01:10:05.090 --> 01:10:07.250
And the idea of
reinforcement learning is

01:10:07.250 --> 01:10:09.140
rather than just being
given a whole data set

01:10:09.140 --> 01:10:12.470
at the beginning of input-output
pairs, reinforcement learning

01:10:12.470 --> 01:10:14.680
is all about learning from experience.

01:10:14.680 --> 01:10:17.390
And reinforcement learning
are agent, whether it's

01:10:17.390 --> 01:10:19.550
like a physical robot that's
trying to make actions

01:10:19.550 --> 01:10:23.750
in the world or just some virtual agent
that has a program running somewhere.

01:10:23.750 --> 01:10:27.560
Our agent is going to be given
a set of rewards or punishments

01:10:27.560 --> 01:10:29.510
in the form of numerical
values, but you can

01:10:29.510 --> 01:10:31.430
think of them as reward or punishment.

01:10:31.430 --> 01:10:35.480
And based on that, it learns what
actions to take in the future

01:10:35.480 --> 01:10:39.470
that our agent, our AI will be
put in some sort of environment.

01:10:39.470 --> 01:10:42.537
It will make some actions and
based on the actions that it makes,

01:10:42.537 --> 01:10:43.370
it learns something.

01:10:43.370 --> 01:10:45.537
It either gets a reward
when it does something well,

01:10:45.537 --> 01:10:47.720
it gets a punishment when
it does something poorly.

01:10:47.720 --> 01:10:52.130
And it learns what to do or what
not to do in the future based

01:10:52.130 --> 01:10:54.980
on those individual experiences.

01:10:54.980 --> 01:10:58.700
And so what this will often look like
is it will often start with some agent,

01:10:58.700 --> 01:11:01.070
some AI, which might again,
be a physical robot--

01:11:01.070 --> 01:11:03.360
if you're imagining a
physical robot moving around--

01:11:03.360 --> 01:11:05.180
but it can also just be a program.

01:11:05.180 --> 01:11:08.210
And our agent is situated
in their environment,

01:11:08.210 --> 01:11:11.120
where the environment is where
they're going to make their actions.

01:11:11.120 --> 01:11:13.820
And it's what's going to give
them rewards or punishments

01:11:13.820 --> 01:11:16.070
for various actions that they're in.

01:11:16.070 --> 01:11:19.220
So for example, the environment
is going to start off

01:11:19.220 --> 01:11:21.980
by putting our agent inside of a state.

01:11:21.980 --> 01:11:25.640
Our agent has some state that in a
game might be the state of the game

01:11:25.640 --> 01:11:28.730
that the agent is playing, in a
world that the agent is exploring.

01:11:28.730 --> 01:11:31.375
Might be some position
inside of a grid representing

01:11:31.375 --> 01:11:32.750
the world that they're exploring.

01:11:32.750 --> 01:11:35.090
But the agent is in some sort of state.

01:11:35.090 --> 01:11:39.140
And in that state, the agent
needs to choose to take an action.

01:11:39.140 --> 01:11:41.760
The agent likely has multiple
actions they can choose from,

01:11:41.760 --> 01:11:43.310
but they pick an action.

01:11:43.310 --> 01:11:46.310
So they take an action
in a particular state.

01:11:46.310 --> 01:11:51.170
And as a result of that, the agent will
generally get two things in response

01:11:51.170 --> 01:11:52.040
as we model them.

01:11:52.040 --> 01:11:54.770
The agent gets a new state
that they find themselves in.

01:11:54.770 --> 01:11:57.110
After being in this
state taking one action,

01:11:57.110 --> 01:11:59.210
they end up in some other state.

01:11:59.210 --> 01:12:02.360
And they're also given some
sort of numerical reward--

01:12:02.360 --> 01:12:05.660
positive meaning reward,
meaning it was a good thing.

01:12:05.660 --> 01:12:08.000
Negative generally meaning
they did something bad,

01:12:08.000 --> 01:12:10.310
they received some sort of punishment.

01:12:10.310 --> 01:12:13.190
And that is all the
information the agent has.

01:12:13.190 --> 01:12:15.260
It's told what state it's in.

01:12:15.260 --> 01:12:17.150
It makes some sort of action.

01:12:17.150 --> 01:12:19.160
And based on that, it
ends up in another state,

01:12:19.160 --> 01:12:21.660
and it ends up getting
some particular reward.

01:12:21.660 --> 01:12:24.110
And it needs to learn based
on that information what

01:12:24.110 --> 01:12:26.770
actions to begin to take in the future.

01:12:26.770 --> 01:12:30.090
As you can imagine generalizing this
to a lot of different situations,

01:12:30.090 --> 01:12:31.685
this is oftentimes how you train.

01:12:31.685 --> 01:12:33.560
If you've ever seen
those robots that are now

01:12:33.560 --> 01:12:36.140
able to walk around sort
of the way humans do,

01:12:36.140 --> 01:12:39.500
it would be quite difficult to program
the robot in exactly the right way

01:12:39.500 --> 01:12:41.330
to get it to walk the way humans do.

01:12:41.330 --> 01:12:44.330
You could instead of train it through
reinforcement learning-- give it

01:12:44.330 --> 01:12:47.060
some sort of numerical reward
every time it does something

01:12:47.060 --> 01:12:50.690
good like take steps forward, and
punish it every time it does something

01:12:50.690 --> 01:12:52.250
bad like fall over.

01:12:52.250 --> 01:12:54.080
And then let the AI just learn.

01:12:54.080 --> 01:12:56.450
Based on that sequence of
rewards, based on trying

01:12:56.450 --> 01:12:58.610
to take various different
actions, you can

01:12:58.610 --> 01:13:03.200
begin to have the agent learn what to
do in the future and what not to do.

01:13:03.200 --> 01:13:06.530
So in order to begin to formalize
this, the first thing we need to do

01:13:06.530 --> 01:13:10.700
is formalize this notion of what we mean
about states and actions and rewards--

01:13:10.700 --> 01:13:12.800
like what does this world look like.

01:13:12.800 --> 01:13:14.990
And oftentimes, we'll
formulate this world

01:13:14.990 --> 01:13:17.990
as what's known as a
Markov decision process.

01:13:17.990 --> 01:13:21.410
Similar in spirit to Markov chains,
which you might recall from before,

01:13:21.410 --> 01:13:24.020
but a Markov decision
process is a model that we

01:13:24.020 --> 01:13:26.780
can use for decision
making for an agent trying

01:13:26.780 --> 01:13:28.580
to make decisions in this environment.

01:13:28.580 --> 01:13:32.270
And it's a model that allows us to
represent the various different states

01:13:32.270 --> 01:13:35.930
that an agent can be in, the various
different actions that they can take,

01:13:35.930 --> 01:13:40.400
and also, what the reward is
for taking one action as opposed

01:13:40.400 --> 01:13:42.180
to another action.

01:13:42.180 --> 01:13:44.610
So what then does that
actually look like?

01:13:44.610 --> 01:13:47.590
Well, if you recall, a
Markov chain from before,

01:13:47.590 --> 01:13:50.270
a Markov chain looked a
little something like this.

01:13:50.270 --> 01:13:53.600
Where we had a whole bunch of these
individual states, and each state

01:13:53.600 --> 01:13:55.820
immediately transitioned
to another state

01:13:55.820 --> 01:13:57.890
based on some probability distribution.

01:13:57.890 --> 01:14:01.070
We saw this in the context of the
weather before, where if it was sunny,

01:14:01.070 --> 01:14:03.770
we said with some probability,
it will be sunny the next day.

01:14:03.770 --> 01:14:06.920
With some other probability,
it'll be rainy, for example.

01:14:06.920 --> 01:14:09.340
But we could also imagine
generalizing this.

01:14:09.340 --> 01:14:11.060
It's not just sun and rain anymore.

01:14:11.060 --> 01:14:14.210
We just have these states, where
one state leads to another state

01:14:14.210 --> 01:14:16.860
according to some
probability distribution.

01:14:16.860 --> 01:14:19.340
But in this original
model, there was no agent

01:14:19.340 --> 01:14:21.470
that had any control over this process.

01:14:21.470 --> 01:14:24.740
It was just entirely probability-based,
where with some probability,

01:14:24.740 --> 01:14:26.540
we moved to this next
state, but maybe it's

01:14:26.540 --> 01:14:29.230
going to be some other state
with some other probability.

01:14:29.230 --> 01:14:33.320
What we'll now have is the ability
for the agent in this state

01:14:33.320 --> 01:14:37.190
to choose from a set of actions, where
maybe instead of just one path forward,

01:14:37.190 --> 01:14:39.200
they have three different
choices of actions

01:14:39.200 --> 01:14:41.180
that each lead them
down different paths.

01:14:41.180 --> 01:14:43.340
And even this is a bit
of an oversimplification,

01:14:43.340 --> 01:14:46.340
because in each of these states, you
might imagine more branching points

01:14:46.340 --> 01:14:49.200
were there more decisions
that can be taken as well.

01:14:49.200 --> 01:14:53.420
So we've extended the Markov
chain to say that from a state,

01:14:53.420 --> 01:14:55.400
you now have available action choices.

01:14:55.400 --> 01:14:57.830
And each of those actions
might be associated

01:14:57.830 --> 01:15:02.960
with its own probability distribution
of going to various different states.

01:15:02.960 --> 01:15:05.870
Then in addition, we'll
add another extension,

01:15:05.870 --> 01:15:08.180
where any time you move
from a state taking

01:15:08.180 --> 01:15:10.640
an action going into
this other state, we

01:15:10.640 --> 01:15:14.060
can associate a reward
with that outcome,

01:15:14.060 --> 01:15:17.150
saying either r is positive,
meaning some positive reward,

01:15:17.150 --> 01:15:20.460
or r is negative, meaning there
were some sort of punishment.

01:15:20.460 --> 01:15:23.510
And this then is what we'll consider
to be a Markov decision process.

01:15:23.510 --> 01:15:27.500
That a Markov decision process has
some initial set of states in the world

01:15:27.500 --> 01:15:28.670
that we can be in.

01:15:28.670 --> 01:15:31.640
We have some set of
actions that given a state,

01:15:31.640 --> 01:15:34.430
I can say what are the actions
that are available to me

01:15:34.430 --> 01:15:37.730
in that state, an action
that I can choose from.

01:15:37.730 --> 01:15:39.550
Then we have some transition model.

01:15:39.550 --> 01:15:41.710
The transition model
before just said that

01:15:41.710 --> 01:15:44.740
given my current state, what is
the probability that I end up

01:15:44.740 --> 01:15:46.930
in that next state or this other state.

01:15:46.930 --> 01:15:51.130
The transition model now has effectively
two things we're conditioning on.

01:15:51.130 --> 01:15:53.260
We're saying, given
that I'm in this state

01:15:53.260 --> 01:15:56.500
and that I take this action,
what's the probability

01:15:56.500 --> 01:15:59.320
that I end up in this next state?

01:15:59.320 --> 01:16:02.050
Now maybe we live in a very
deterministic world in this Markov

01:16:02.050 --> 01:16:05.170
decision process, where given
a state and given an action,

01:16:05.170 --> 01:16:07.383
we know for sure what next
state we'll end up in.

01:16:07.383 --> 01:16:10.550
But maybe there's some randomness in
the world that when you take in a state

01:16:10.550 --> 01:16:14.260
and you take an action, you might not
always end up in the exact same state.

01:16:14.260 --> 01:16:16.840
There might be some probabilities
involved there as well.

01:16:16.840 --> 01:16:21.260
The Markov decision process can
handle both of those possible cases.

01:16:21.260 --> 01:16:23.830
And then finally, we
have a reward function,

01:16:23.830 --> 01:16:26.230
generally called r,
that in this case says,

01:16:26.230 --> 01:16:30.340
what is the reward for being in
this state, taking this action,

01:16:30.340 --> 01:16:33.687
and then getting to s
prime, this next state.

01:16:33.687 --> 01:16:35.770
So I'm in this original
state, I take this action,

01:16:35.770 --> 01:16:39.768
I get to this next state, what is
the reward for doing that process?

01:16:39.768 --> 01:16:41.560
You can add up these
rewards every time you

01:16:41.560 --> 01:16:44.440
take an action to get the
total amount of rewards

01:16:44.440 --> 01:16:48.940
that an agent might get from interacting
in a particular environment modeled

01:16:48.940 --> 01:16:51.170
using this Markov decision process.

01:16:51.170 --> 01:16:53.740
So what might this actually
look like in practice?

01:16:53.740 --> 01:16:56.290
Well, let's just create a
little simulated world here

01:16:56.290 --> 01:16:59.260
where I have this agent that is
just trying to navigate its way--

01:16:59.260 --> 01:17:02.350
this agent is this yellow dot here
like a robot in the world trying

01:17:02.350 --> 01:17:04.240
to navigate its way through this grid.

01:17:04.240 --> 01:17:07.140
And ultimately, it's trying
to find its way to the goal.

01:17:07.140 --> 01:17:11.230
And if it gets to the green goal, then
it's going to get some sort of reward.

01:17:11.230 --> 01:17:14.795
But then we might also
have some red squares

01:17:14.795 --> 01:17:17.920
that are places where you get some sort
of punishment, some bad place where

01:17:17.920 --> 01:17:19.480
we don't want the agent to go.

01:17:19.480 --> 01:17:22.030
And if it ends up in the
red square, then our agent

01:17:22.030 --> 01:17:25.400
is going to get some sort of
punishment as a result of that.

01:17:25.400 --> 01:17:28.610
But the agent that originally
doesn't know all of these details.

01:17:28.610 --> 01:17:31.360
It doesn't know that these states
are associated with punishments,

01:17:31.360 --> 01:17:34.210
but maybe it does know that the
state is associated with a reward--

01:17:34.210 --> 01:17:35.200
maybe it doesn't.

01:17:35.200 --> 01:17:37.750
But it just needs to sort of
interact with the environment

01:17:37.750 --> 01:17:41.150
to try and figure out what
to do and what not to do.

01:17:41.150 --> 01:17:44.377
So the first thing the agent might do
is given no additional information,

01:17:44.377 --> 01:17:46.210
if it doesn't know what
the punishments are,

01:17:46.210 --> 01:17:50.030
it doesn't know where the rewards are,
it just might try and take an action.

01:17:50.030 --> 01:17:52.690
And it takes an action
and ends up realizing

01:17:52.690 --> 01:17:54.710
that he's got some sort of punishment.

01:17:54.710 --> 01:17:56.930
And so what does it learn
from that experience?

01:17:56.930 --> 01:18:00.550
Well, it might learn that when
you're in this state in the future

01:18:00.550 --> 01:18:02.790
don't take the action,
move to the right--

01:18:02.790 --> 01:18:04.213
that that is a bad action to take.

01:18:04.213 --> 01:18:06.880
That in the future, if you ever
find yourself back in the state,

01:18:06.880 --> 01:18:09.430
don't take this action of
going to the right when

01:18:09.430 --> 01:18:12.180
you're in this particular state,
because that leads to punishment.

01:18:12.180 --> 01:18:13.763
That might be the intuition, at least.

01:18:13.763 --> 01:18:15.610
And so you could try
doing other actions.

01:18:15.610 --> 01:18:16.300
You move up.

01:18:16.300 --> 01:18:18.508
All right, that didn't lead
to any immediate rewards,

01:18:18.508 --> 01:18:21.830
maybe try something else,
then maybe try something else.

01:18:21.830 --> 01:18:23.913
And now you found that you
got another punishment.

01:18:23.913 --> 01:18:25.913
And so you learn something
from that experience.

01:18:25.913 --> 01:18:27.850
So the next time you
do this whole process,

01:18:27.850 --> 01:18:30.010
you know that if you ever
end up in this, square

01:18:30.010 --> 01:18:33.070
you shouldn't take the down
action, because being in this state

01:18:33.070 --> 01:18:37.840
and taking that action ultimately
leads to some sort of punishment,

01:18:37.840 --> 01:18:40.070
a negative reward, in other words.

01:18:40.070 --> 01:18:41.170
And this process repeats.

01:18:41.170 --> 01:18:44.230
You might imagine just letting
our agent explore the world,

01:18:44.230 --> 01:18:48.220
learning over time what states tend
to correspond with poor actions,

01:18:48.220 --> 01:18:50.980
learning over time what states
correspond with poor actions

01:18:50.980 --> 01:18:54.260
until eventually, if it
tries enough things randomly,

01:18:54.260 --> 01:18:57.640
it might find that when
you get to this state,

01:18:57.640 --> 01:19:01.240
if you take the up action in
this state, it might find you

01:19:01.240 --> 01:19:03.160
actually get a reward from that.

01:19:03.160 --> 01:19:06.820
And what it can learn from that
is that if you're in this state,

01:19:06.820 --> 01:19:09.595
you should take the up action,
because that leads to a reward.

01:19:09.595 --> 01:19:12.220
And over time, you can also learn
that if you're in this state,

01:19:12.220 --> 01:19:15.580
you should take the left action because
that leads to this state that also

01:19:15.580 --> 01:19:17.320
lets you eventually get to the reward.

01:19:17.320 --> 01:19:21.160
So you begin to learn over
time, not only which actions

01:19:21.160 --> 01:19:25.390
are good in particular states,
but also, which actions are bad,

01:19:25.390 --> 01:19:27.640
such that once you know some
sequence of good actions

01:19:27.640 --> 01:19:30.910
that leads you to some sort
of reward, our agent can just

01:19:30.910 --> 01:19:34.750
follow those instructions, follow
the experience that it has learned.

01:19:34.750 --> 01:19:37.330
We didn't tell the
agent what the goal was.

01:19:37.330 --> 01:19:39.910
We didn't tell the agent
where the punishments were.

01:19:39.910 --> 01:19:42.760
But the agent can begin to
learn from this experience

01:19:42.760 --> 01:19:47.850
and learn to begin to perform these
sorts of tasks better in the future.

01:19:47.850 --> 01:19:51.370
And so let's now try to formalize this
idea-- formalize the idea that we would

01:19:51.370 --> 01:19:54.520
like to be able to learn in
this state, taking this action,

01:19:54.520 --> 01:19:56.257
is that a good thing or a bad thing.

01:19:56.257 --> 01:19:58.840
There are lots of different
models for reinforcement learning.

01:19:58.840 --> 01:20:00.757
We're just going to look
at one of them today.

01:20:00.757 --> 01:20:04.330
And the one that we're going to look
at is a method known as Q learning.

01:20:04.330 --> 01:20:08.010
And what Q learning is all about
is about learning a function,

01:20:08.010 --> 01:20:12.850
a function Q, that takes inputs
s and a, where s is a state and a

01:20:12.850 --> 01:20:15.130
is an action that you
take in that state.

01:20:15.130 --> 01:20:17.620
And what this Q function
is going to do is

01:20:17.620 --> 01:20:21.520
it is going to estimate the
value-- how much reward will I get

01:20:21.520 --> 01:20:25.960
from taking this action in this state.

01:20:25.960 --> 01:20:28.870
Originally, we don't know what
this Q function should be,

01:20:28.870 --> 01:20:30.940
but over time, based
on experience, based

01:20:30.940 --> 01:20:33.590
on trying things out and
seeing what the result is,

01:20:33.590 --> 01:20:36.160
I would like to try
and learn what q of s,

01:20:36.160 --> 01:20:39.730
a is for any particular state
and any particular action

01:20:39.730 --> 01:20:41.840
that I might take in that state.

01:20:41.840 --> 01:20:42.880
So what is the approach?

01:20:42.880 --> 01:20:44.890
Well, the approach
originally is we'll start

01:20:44.890 --> 01:20:49.960
with Q s, a equal to 0 for all
states s and for all actions a.

01:20:49.960 --> 01:20:52.270
That initially, before
I've ever started anything,

01:20:52.270 --> 01:20:54.880
before I've had any
experiences, I don't know

01:20:54.880 --> 01:20:57.820
the value of taking any
action in any given state,

01:20:57.820 --> 01:21:02.390
so I'm going to assume that the
value is 0 all across the board.

01:21:02.390 --> 01:21:06.790
But then as I interact with the world,
as I experience rewards or punishments,

01:21:06.790 --> 01:21:10.450
or maybe I go to a cell where I don't
get either a reward or a punishment,

01:21:10.450 --> 01:21:14.290
I want to somehow update
my estimate of Q s, a.

01:21:14.290 --> 01:21:18.670
I want to continually update my estimate
of Q s, a based on the experiences,

01:21:18.670 --> 01:21:20.730
and rewards, and punishments
that I've received

01:21:20.730 --> 01:21:24.240
such that in the future, my
knowledge of what actions are good

01:21:24.240 --> 01:21:26.290
and what states will be better.

01:21:26.290 --> 01:21:29.280
So when we take an action and
receive some sort of reward,

01:21:29.280 --> 01:21:32.750
I want to estimate the
new value of Q s, a.

01:21:32.750 --> 01:21:35.390
And I estimate that based on
a couple of different things.

01:21:35.390 --> 01:21:39.090
I estimate it based on the reward that
I'm getting from taking this action

01:21:39.090 --> 01:21:40.830
and getting into the next state.

01:21:40.830 --> 01:21:44.130
But assuming the situation
isn't over, assuming

01:21:44.130 --> 01:21:47.170
there are still future actions
that I might take as well,

01:21:47.170 --> 01:21:51.670
I also need to take into account
the expected future rewards.

01:21:51.670 --> 01:21:54.570
That if you imagine an agent
interacting with the environment,

01:21:54.570 --> 01:21:57.000
and sometimes, you'll take
an action and get a reward,

01:21:57.000 --> 01:21:59.970
but then you can keep taking more
actions and get more rewards.

01:21:59.970 --> 01:22:02.310
That these both are relevant--
both the current reward

01:22:02.310 --> 01:22:05.680
I'm getting from this current
step, and also, my future reward.

01:22:05.680 --> 01:22:09.240
And it might be the case that I want
to take a step that doesn't immediately

01:22:09.240 --> 01:22:12.180
lead to a reward, because
later on down the line,

01:22:12.180 --> 01:22:14.650
I know it will lead to
more rewards as well.

01:22:14.650 --> 01:22:17.550
So there's a balancing act
between current rewards

01:22:17.550 --> 01:22:20.460
that the agent experiences
and future rewards

01:22:20.460 --> 01:22:23.900
that the agent experiences as well.

01:22:23.900 --> 01:22:26.590
And then we need to update Q s, a.

01:22:26.590 --> 01:22:29.620
So we estimate the value of Q
s, a based on the current record

01:22:29.620 --> 01:22:31.390
and the expected future awards.

01:22:31.390 --> 01:22:33.970
And then we need to
update this Q function

01:22:33.970 --> 01:22:36.550
to take into account this new estimate.

01:22:36.550 --> 01:22:39.340
Now as we go through this
process, we'll already

01:22:39.340 --> 01:22:42.160
have an estimate for what
we think the value is.

01:22:42.160 --> 01:22:44.140
Now we have a new estimate
and then somehow we

01:22:44.140 --> 01:22:46.570
need to combine these
two estimates together.

01:22:46.570 --> 01:22:50.110
And we'll look at more formal ways
that we can actually begin to do that.

01:22:50.110 --> 01:22:52.725
So to actually show you what
this formula looks like,

01:22:52.725 --> 01:22:54.850
here's the approach we'll
take with you Q-learning.

01:22:54.850 --> 01:22:59.500
We're going to again start with Q of
s, a being equal to 0 for all states.

01:22:59.500 --> 01:23:06.850
And then every time we take an action
a in state s and observe a reward r,

01:23:06.850 --> 01:23:11.230
we're going to update our value,
our estimate for Q of s, a.

01:23:11.230 --> 01:23:13.780
And the idea is that
we're going to figure out

01:23:13.780 --> 01:23:19.180
what the new value estimate is minus
what our existing value estimate is.

01:23:19.180 --> 01:23:22.780
So we have some preconceived
notion for what the value is

01:23:22.780 --> 01:23:24.500
for taking this action in this state.

01:23:24.500 --> 01:23:28.450
Maybe our expectation is we
currently think the value is 10.

01:23:28.450 --> 01:23:31.540
But then we're going to estimate
what we now think it's going to be.

01:23:31.540 --> 01:23:34.370
Maybe the new value estimate
is something like 20.

01:23:34.370 --> 01:23:36.940
So there's a delta of,
like, 10 that our new value

01:23:36.940 --> 01:23:40.390
estimate is 10 points higher
than what our current value

01:23:40.390 --> 01:23:42.450
estimate happens to be.

01:23:42.450 --> 01:23:44.200
And so we have a couple of options here.

01:23:44.200 --> 01:23:49.360
We need to decide how much we want to
adjust our current expectation of what

01:23:49.360 --> 01:23:52.720
the value is of taking this
action in this particular state.

01:23:52.720 --> 01:23:56.620
And what that difference is--
how much we add or subtract

01:23:56.620 --> 01:23:59.800
from our existing notion of how
much that we expect the value to be

01:23:59.800 --> 01:24:03.750
is dependent on this parameter
alpha, also called the learning rate.

01:24:03.750 --> 01:24:08.710
And alpha represents in effect, how
much we value new information compared

01:24:08.710 --> 01:24:11.770
to how much we value old information.

01:24:11.770 --> 01:24:15.500
And alpha value of 1 means we
really value new information.

01:24:15.500 --> 01:24:17.590
That if we have a new
estimate, then it doesn't

01:24:17.590 --> 01:24:19.240
matter what our old estimate is.

01:24:19.240 --> 01:24:21.910
We're only going to consider our
new estimate, because we always

01:24:21.910 --> 01:24:25.430
just want to take into
consideration our new information.

01:24:25.430 --> 01:24:29.230
So the way that works is that
if you imagine alpha being 1,

01:24:29.230 --> 01:24:32.860
then we're taking the
old value of Q s, a

01:24:32.860 --> 01:24:36.910
and then adding 1 times the
new value minus the old value.

01:24:36.910 --> 01:24:38.990
And that just leaves
us with the new value.

01:24:38.990 --> 01:24:42.010
So when alpha is 1, all
we take into consideration

01:24:42.010 --> 01:24:44.650
is what our new estimate happens to be.

01:24:44.650 --> 01:24:47.470
But over time, as we go
through a lot of experiences,

01:24:47.470 --> 01:24:49.570
we already have some
existing information.

01:24:49.570 --> 01:24:53.080
We might have tried taking
this action nine times already,

01:24:53.080 --> 01:24:55.090
and now we just try to do a tenth time.

01:24:55.090 --> 01:24:58.370
And we don't only want to
consider this 10th experience.

01:24:58.370 --> 01:25:02.020
I also want to consider the fact that
my prior 9 experiences, those were

01:25:02.020 --> 01:25:02.710
meaningful, too.

01:25:02.710 --> 01:25:05.420
And that's data I don't
necessarily want to lose them.

01:25:05.420 --> 01:25:08.620
And so this alpha controls
that decision-- controls

01:25:08.620 --> 01:25:10.480
how important is the new information.

01:25:10.480 --> 01:25:13.570
0 would mean ignore all
the new information,

01:25:13.570 --> 01:25:16.390
just keep this Q value the same.

01:25:16.390 --> 01:25:20.200
1 that means replace the old information
entirely with the new information.

01:25:20.200 --> 01:25:24.772
And somewhere in between, keep some sort
of balance between these two values.

01:25:24.772 --> 01:25:27.970
And we can put this equation a
little bit more formally, as well.

01:25:27.970 --> 01:25:30.940
The old value estimate
is our old estimate

01:25:30.940 --> 01:25:34.690
for what the value is of taking
this action in a particular state.

01:25:34.690 --> 01:25:37.270
That's just Q of s, a.

01:25:37.270 --> 01:25:38.685
We have it once here.

01:25:38.685 --> 01:25:40.310
And we're going to add something to it.

01:25:40.310 --> 01:25:44.020
We're going to add alpha times the
new value estimate minus the old value

01:25:44.020 --> 01:25:44.740
estimate.

01:25:44.740 --> 01:25:49.330
But the old value estimate, we just
look up by calling this Q function.

01:25:49.330 --> 01:25:51.310
And what then is the new value estimate?

01:25:51.310 --> 01:25:53.530
Based on this experience
we have just taken,

01:25:53.530 --> 01:25:55.870
what is our new estimate
for the value of taking

01:25:55.870 --> 01:25:58.570
this action in this particular state?

01:25:58.570 --> 01:26:01.090
Well, it's going to be
composed of two parts.

01:26:01.090 --> 01:26:04.540
It's going to be composed of
what reward did I just get

01:26:04.540 --> 01:26:07.060
from taking this action in this state.

01:26:07.060 --> 01:26:10.390
And then it's going to be what
can I expect my future rewards

01:26:10.390 --> 01:26:12.670
to be from this point forward.

01:26:12.670 --> 01:26:17.230
So it's going to be r, some
reward I'm getting right now,

01:26:17.230 --> 01:26:21.443
plus whatever I estimate I'm
going to get in the future.

01:26:21.443 --> 01:26:23.860
And how do I estimate what I'm
going to get in the future?

01:26:23.860 --> 01:26:27.030
Well, it's a bit of another
call to this Q function.

01:26:27.030 --> 01:26:31.360
It's going to be take the maximum
across all possible actions I could

01:26:31.360 --> 01:26:35.800
take next and say, all right, of all
of these possible actions I could take,

01:26:35.800 --> 01:26:38.560
which one is going to
have the highest reward?

01:26:38.560 --> 01:26:40.840
So this then-- looks a
little bit complicated--

01:26:40.840 --> 01:26:44.710
is going to be our notion for how we're
going to perform this kind of update.

01:26:44.710 --> 01:26:48.070
I have some estimate, some
old estimate, for what

01:26:48.070 --> 01:26:51.040
the value is of taking
this action in the state,

01:26:51.040 --> 01:26:53.920
and I'm going to update it
based on new information.

01:26:53.920 --> 01:26:56.200
That I experienced
some reward, I predict

01:26:56.200 --> 01:26:58.240
what my future reward is going to be.

01:26:58.240 --> 01:27:01.510
And using that, I update
what I estimate the reward

01:27:01.510 --> 01:27:04.720
will be for taking this action
in this particular state.

01:27:04.720 --> 01:27:07.720
And there are other additions you
might make to this algorithm, as well.

01:27:07.720 --> 01:27:10.270
Sometimes, it might not be
the case that future rewards,

01:27:10.270 --> 01:27:12.820
you want to weight equally
to current rewards.

01:27:12.820 --> 01:27:17.402
Maybe you want an agent that values
like reward now over reward later.

01:27:17.402 --> 01:27:19.360
And so sometimes, you
can even add another term

01:27:19.360 --> 01:27:23.050
in here or some other parameter,
where you discount future rewards

01:27:23.050 --> 01:27:26.270
and say future rewards are
not as valuable as rewards

01:27:26.270 --> 01:27:28.720
immediately-- that getting
reward in the current time step

01:27:28.720 --> 01:27:31.542
is better than waiting a year
and getting rewards later.

01:27:31.542 --> 01:27:33.250
But that's something
up to the programmer

01:27:33.250 --> 01:27:36.290
to decide what that
parameter ought to be.

01:27:36.290 --> 01:27:39.100
But the big picture idea
of this entire formula

01:27:39.100 --> 01:27:42.670
is to say that every time we
experience some new reward,

01:27:42.670 --> 01:27:43.900
we take that into account.

01:27:43.900 --> 01:27:47.830
We update our estimate of
how good is this action.

01:27:47.830 --> 01:27:51.070
And then in the future, we can make
decisions based on that algorithm.

01:27:51.070 --> 01:27:54.010
Once we have some good
estimate for every state

01:27:54.010 --> 01:27:57.670
and for every action, what the
value is of taking that action,

01:27:57.670 --> 01:28:01.960
then we can do something like implement
a greedy decision making policy.

01:28:01.960 --> 01:28:05.200
That if I am in a state and I
want to know what actions should

01:28:05.200 --> 01:28:09.880
I take in that state, then I consider
for all of my possible actions,

01:28:09.880 --> 01:28:12.670
what is the value of Q s, a.

01:28:12.670 --> 01:28:16.150
What is my estimated value of
taking that action in that state.

01:28:16.150 --> 01:28:18.850
And I will just pick the
action that has the highest

01:28:18.850 --> 01:28:22.520
value after I evaluate that expression.

01:28:22.520 --> 01:28:24.730
So I pick the action that
has the highest value.

01:28:24.730 --> 01:28:27.188
And based on that, that tells
me what action I should take.

01:28:27.188 --> 01:28:30.880
At any given state that I'm in,
I can just greedily say across

01:28:30.880 --> 01:28:35.050
all my actions, this action gives
me the highest expected value,

01:28:35.050 --> 01:28:38.200
and so I'll go ahead and choose
that action as the action

01:28:38.200 --> 01:28:40.430
that I take as well.

01:28:40.430 --> 01:28:43.270
But there is a downside
to this kind of approach.

01:28:43.270 --> 01:28:45.880
And the downside comes up
in a situation like this,

01:28:45.880 --> 01:28:51.170
where we know that there is some
solution that gets me to the reward

01:28:51.170 --> 01:28:53.610
and our agent has been
able to figure that out.

01:28:53.610 --> 01:28:57.050
But it might not necessarily be
the best way or the fastest way.

01:28:57.050 --> 01:28:59.850
If the agent is allowed to
explore a little bit more,

01:28:59.850 --> 01:29:02.810
you might find that it can get
the reward faster by taking

01:29:02.810 --> 01:29:06.830
some other route instead by going
through this particular path that

01:29:06.830 --> 01:29:11.360
is a faster way to get
to that ultimate goal.

01:29:11.360 --> 01:29:14.730
And maybe we would like for the agent
to be able to figure that out as well.

01:29:14.730 --> 01:29:16.940
But if the agent always
takes the actions

01:29:16.940 --> 01:29:20.960
that it knows to be best, when it
gets to this particular square,

01:29:20.960 --> 01:29:23.180
it doesn't know that
this is a good action,

01:29:23.180 --> 01:29:24.770
because it's never really tried it.

01:29:24.770 --> 01:29:29.040
But it knows that going down eventually
leads its way to this reward.

01:29:29.040 --> 01:29:32.270
So what might learn in the future that
it should just always take this route,

01:29:32.270 --> 01:29:36.770
and it's never going to explore
and go along that route instead.

01:29:36.770 --> 01:29:39.170
So in reinforcement learning,
there's this tension

01:29:39.170 --> 01:29:42.570
between exploration and exploitation.

01:29:42.570 --> 01:29:45.710
And exploitation generally
reverts to using knowledge

01:29:45.710 --> 01:29:47.360
that the AI already has.

01:29:47.360 --> 01:29:50.630
The AI already knows that this
is a move that leads to reward,

01:29:50.630 --> 01:29:52.520
so it'll go ahead and use that move.

01:29:52.520 --> 01:29:56.390
And exploration is all about
exploring other actions

01:29:56.390 --> 01:29:58.820
that we may not have explored
as thoroughly before,

01:29:58.820 --> 01:30:01.970
because maybe one of these actions,
even if I don't know anything about it,

01:30:01.970 --> 01:30:07.300
might lead to better rewards faster
or more rewards in the future.

01:30:07.300 --> 01:30:10.160
And so an agent that only
ever exploits information

01:30:10.160 --> 01:30:12.840
and never explorers might
be able to get reward,

01:30:12.840 --> 01:30:16.250
but it might not maximize its
rewards, because it doesn't know what

01:30:16.250 --> 01:30:17.840
other possibilities are out there--

01:30:17.840 --> 01:30:19.850
possibilities that it
would only know about

01:30:19.850 --> 01:30:23.010
by taking advantage of exploration.

01:30:23.010 --> 01:30:24.710
And so how can we try and address this?

01:30:24.710 --> 01:30:28.550
Well, one possible solution is known
as the epsilon-greedy algorithm,

01:30:28.550 --> 01:30:33.050
where we set epsilon equal to how often
we want to just make a random move.

01:30:33.050 --> 01:30:36.750
Where occasionally, we will just
make a random move in order to say,

01:30:36.750 --> 01:30:40.350
let's try to explore
and see what happens.

01:30:40.350 --> 01:30:45.250
And then the logic of the algorithm will
be with probability 1 minus epsilon,

01:30:45.250 --> 01:30:47.830
choose the estimated best move.

01:30:47.830 --> 01:30:50.620
In a greedy case, we'd
always choose the best move.

01:30:50.620 --> 01:30:55.360
But in epsilon-greedy, we're most of
the time going to choose the best move

01:30:55.360 --> 01:30:57.520
or sometimes going to
choose the best move,

01:30:57.520 --> 01:31:00.100
but sometimes with
probability epsilon, we're

01:31:00.100 --> 01:31:03.160
going to choose a random move instead.

01:31:03.160 --> 01:31:06.400
So every time we're faced with the
ability to take an action, sometimes,

01:31:06.400 --> 01:31:07.930
we're going to choose the best move.

01:31:07.930 --> 01:31:10.900
Sometimes, we're just going
to choose a random move.

01:31:10.900 --> 01:31:12.940
So this type of algorithm
then can be quite

01:31:12.940 --> 01:31:16.720
powerful in a reinforcement learning
context by not always just choosing

01:31:16.720 --> 01:31:20.620
the best possible move right now,
but sometimes, especially early on,

01:31:20.620 --> 01:31:22.600
allowing yourself to
make random moves that

01:31:22.600 --> 01:31:26.560
allow you to explore various different
possible states and actions more.

01:31:26.560 --> 01:31:30.850
And maybe over time, you might decrease
your value of epsilon, more and more

01:31:30.850 --> 01:31:32.720
often choosing the best
mover after you are

01:31:32.720 --> 01:31:37.820
more confident that you've explored what
all of the possibilities actually are.

01:31:37.820 --> 01:31:39.320
So we can put this into practice.

01:31:39.320 --> 01:31:41.830
And one very common application
of reinforcement learning

01:31:41.830 --> 01:31:43.000
is in game playing.

01:31:43.000 --> 01:31:45.400
That if you want to teach
an agent how to play a game,

01:31:45.400 --> 01:31:48.350
you just let the agent play
the game a whole bunch.

01:31:48.350 --> 01:31:51.220
And then the reward signal
happens at the end of the game.

01:31:51.220 --> 01:31:55.780
When the game is over, if our AI won
the game, it gets a reward of like, 1,

01:31:55.780 --> 01:31:56.730
for example.

01:31:56.730 --> 01:32:00.100
And if it lost the game, it
gets a reward of negative 1.

01:32:00.100 --> 01:32:03.108
And from that, it begins to
learn what actions are good

01:32:03.108 --> 01:32:04.150
and what actions are bad.

01:32:04.150 --> 01:32:06.610
You don't have to tell the AI
what's good and what's bad,

01:32:06.610 --> 01:32:08.920
but the AI figures it
out based on that reward.

01:32:08.920 --> 01:32:10.450
Winning the game is some signal.

01:32:10.450 --> 01:32:12.040
Losing the game is some signal.

01:32:12.040 --> 01:32:14.320
And based on all of that,
it begins to figure out

01:32:14.320 --> 01:32:16.985
what decisions it should actually make.

01:32:16.985 --> 01:32:19.360
So one very simple game, which
you may have played before

01:32:19.360 --> 01:32:20.870
is a game called Nim.

01:32:20.870 --> 01:32:23.433
And in the game of Nim, you've
got a whole bunch of objects

01:32:23.433 --> 01:32:25.600
in a whole bunch of different
piles, where here I've

01:32:25.600 --> 01:32:27.940
represented each pile
as an individual row.

01:32:27.940 --> 01:32:31.040
So you've got one object in the first
pile, three in the second pile, five

01:32:31.040 --> 01:32:33.370
and the third pile,
seven in the fourth pile.

01:32:33.370 --> 01:32:36.010
And the game of Nim is a two
player game where players

01:32:36.010 --> 01:32:38.980
take turns removing objects from piles.

01:32:38.980 --> 01:32:41.230
And the rule is that
on any given turn, you

01:32:41.230 --> 01:32:43.690
are allowed to remove
as many objects as you

01:32:43.690 --> 01:32:47.162
want from any one of these
piles, any one of these rows.

01:32:47.162 --> 01:32:49.120
You have to remove at
least one object, but you

01:32:49.120 --> 01:32:53.890
can remove as many as you want
from exactly one of the piles.

01:32:53.890 --> 01:32:57.720
And whoever takes the last object loses.

01:32:57.720 --> 01:33:01.630
So player 1 might like remove
four from this pile here.

01:33:01.630 --> 01:33:04.790
Player 2 might remove
four from this pile here.

01:33:04.790 --> 01:33:08.140
So now we've got four piles left,
one, three, one, and three Player 1

01:33:08.140 --> 01:33:11.020
might remove you know the
entirety of the second pile.

01:33:11.020 --> 01:33:16.930
Player 2, if they're being strategic,
might remove two from the third pile.

01:33:16.930 --> 01:33:20.140
Now we've got three piles left
each with one object left.

01:33:20.140 --> 01:33:22.420
Player 1 might remove one from one pile.

01:33:22.420 --> 01:33:24.750
Player 2 removes one
from the other pile.

01:33:24.750 --> 01:33:27.610
And now player 1 is left
with choosing this one

01:33:27.610 --> 01:33:31.740
object from the last pile, at which
point, player 1 loses the game.

01:33:31.740 --> 01:33:32.980
So fairly simple game.

01:33:32.980 --> 01:33:34.240
Piles of objects.

01:33:34.240 --> 01:33:37.300
Any turn, you choose how many
objects to remove from the pile.

01:33:37.300 --> 01:33:40.330
Whoever removes the last object loses.

01:33:40.330 --> 01:33:42.730
And this is the type of game
you can encode into an AI

01:33:42.730 --> 01:33:46.540
fairly easily, because the states
are really just four numbers.

01:33:46.540 --> 01:33:50.120
Every state is just how many
objects in each of the four piles.

01:33:50.120 --> 01:33:52.690
And the actions are
things like how many am I

01:33:52.690 --> 01:33:56.110
going to remove from each one
of these individual piles.

01:33:56.110 --> 01:33:58.090
And the reward happens at the end.

01:33:58.090 --> 01:34:01.000
That if you were the player that
had to remove the last object,

01:34:01.000 --> 01:34:02.890
then you get some sort of punishment.

01:34:02.890 --> 01:34:06.430
But if you were not, and the other
player had to remove the last object,

01:34:06.430 --> 01:34:08.830
well then you get some sort of reward.

01:34:08.830 --> 01:34:11.680
So we could actually try and
show a demonstration of this--

01:34:11.680 --> 01:34:15.570
that I have implemented an
AI to play the game of Nim.

01:34:15.570 --> 01:34:17.320
All right, so here,
what we're going to do

01:34:17.320 --> 01:34:22.210
is create an AI as a result of
training the AI on some number of games

01:34:22.210 --> 01:34:24.160
that the AI is going
to play against itself.

01:34:24.160 --> 01:34:27.060
Where the idea is the AI will
play games against itself,

01:34:27.060 --> 01:34:30.800
learn from each of those experiences,
and learn what to do in the future.

01:34:30.800 --> 01:34:33.340
And then I, the human,
will play against the AI.

01:34:33.340 --> 01:34:35.380
So initially, we'll
say train zero times,

01:34:35.380 --> 01:34:39.190
meaning we're not going to let the AI
play any practice games against itself

01:34:39.190 --> 01:34:41.110
in order to learn from its experiences.

01:34:41.110 --> 01:34:43.750
We're just going to
see how well it plays.

01:34:43.750 --> 01:34:45.460
And it looks like there are four piles.

01:34:45.460 --> 01:34:48.550
I can choose how many I remove
from any one of the piles.

01:34:48.550 --> 01:34:53.960
So maybe from pile three, I will
remove five objects, for example.

01:34:53.960 --> 01:34:57.340
So now AI chose to take
one item from pile zero.

01:34:57.340 --> 01:35:00.170
So I'm left with these
piles now, for example.

01:35:00.170 --> 01:35:02.500
And so here, I could
choose maybe to say I

01:35:02.500 --> 01:35:08.900
would like to remove them from pile
two, all five of them, for example.

01:35:08.900 --> 01:35:11.290
And so AI chose to take
two away from pile one.

01:35:11.290 --> 01:35:15.560
Now I'm left with one pile that has one
object, one pile that has two objects.

01:35:15.560 --> 01:35:19.070
So from pile three, I
will remove two objects.

01:35:19.070 --> 01:35:22.400
And now I've left the AI with no
choice but to take that last one.

01:35:22.400 --> 01:35:24.648
And so the game is over
and I was able to win.

01:35:24.648 --> 01:35:27.190
But I did so because the AI was
really just playing randomly.

01:35:27.190 --> 01:35:30.130
It didn't have any prior experience
that it was using in order

01:35:30.130 --> 01:35:32.410
to make these sorts of judgments.

01:35:32.410 --> 01:35:36.040
Now let the AI train itself
on, like, 10,000 games.

01:35:36.040 --> 01:35:40.000
I'm going to let the AI play
10,000 games of Nim against itself.

01:35:40.000 --> 01:35:43.180
Every time it wins or loses, it's
going to learn from that experience

01:35:43.180 --> 01:35:46.840
and learn in the future what
to do and what not to do.

01:35:46.840 --> 01:35:49.300
So here then, I'll go
ahead and run this again.

01:35:49.300 --> 01:35:52.300
And now you see the AI running through
a whole bunch of training games--

01:35:52.300 --> 01:35:54.730
10,000 training games against itself.

01:35:54.730 --> 01:35:57.530
And now it's going to let me
make these sorts of decisions.

01:35:57.530 --> 01:35:59.620
So now I'm going to play against the AI.

01:35:59.620 --> 01:36:02.788
Maybe I'll remove one from pile 3.

01:36:02.788 --> 01:36:05.830
And the AI took everything from pile
three, so I'm left with three piles.

01:36:05.830 --> 01:36:11.680
And I'll go ahead and from pile
two, maybe remove three items.

01:36:11.680 --> 01:36:14.340
And the AI removes one
item from pile zero.

01:36:14.340 --> 01:36:17.560
I'm left with two piles, each
of which has two items in it.

01:36:17.560 --> 01:36:21.160
I'll remove one from pile one, I guess.

01:36:21.160 --> 01:36:24.610
And the AI took two from pile
two, leaving me with no choice

01:36:24.610 --> 01:36:27.490
but to take one away from pile one.

01:36:27.490 --> 01:36:31.690
So it seems like after playing
10,000 games of Nim against itself,

01:36:31.690 --> 01:36:34.270
the AI has learned
something about what states

01:36:34.270 --> 01:36:38.020
and what actions tend to be good, and
has begun to learn some sort of pattern

01:36:38.020 --> 01:36:40.780
for how to predict what
actions are going to be good

01:36:40.780 --> 01:36:44.290
and what actions are going
to be bad in any given state.

01:36:44.290 --> 01:36:47.590
So reinforcement learning can be a
very powerful technique for achieving

01:36:47.590 --> 01:36:49.150
these sorts of game playing Agents--

01:36:49.150 --> 01:36:52.300
Agents that are able to play
a game well just by learning

01:36:52.300 --> 01:36:54.970
from experience, whether that's
playing against other people

01:36:54.970 --> 01:36:57.310
or by playing against itself
and learning from those

01:36:57.310 --> 01:36:59.030
experiences, as well.

01:36:59.030 --> 01:37:01.900
Now Nim is a bit of an easy game
to use reinforcement learning

01:37:01.900 --> 01:37:04.120
for because there are so few states.

01:37:04.120 --> 01:37:07.120
There are only states that are as
many as how many different objects are

01:37:07.120 --> 01:37:09.190
in each of these
various different piles.

01:37:09.190 --> 01:37:11.200
You might imagine that
it's going to be harder

01:37:11.200 --> 01:37:14.990
if you think of a game like chess
or games where there are many,

01:37:14.990 --> 01:37:18.430
many more states and many, many more
actions that you can imagine taking,

01:37:18.430 --> 01:37:21.490
where it's not going to be as
easy to learn for every state

01:37:21.490 --> 01:37:24.720
and for every action, what
the value is going to be.

01:37:24.720 --> 01:37:27.130
So oftentimes in that
case, we can't necessarily

01:37:27.130 --> 01:37:31.060
learn exactly what the value is for
every state and for every action,

01:37:31.060 --> 01:37:32.257
but we can approximate it.

01:37:32.257 --> 01:37:34.090
So much as we saw with
[? min and max, ?] we

01:37:34.090 --> 01:37:37.240
could use a death limiting
approach to stop calculating

01:37:37.240 --> 01:37:39.910
at a certain point in time,
we can do a similar type

01:37:39.910 --> 01:37:42.730
of approximation known
as function approximation

01:37:42.730 --> 01:37:47.530
in a reinforcement learning context,
where instead of learning a value of Q

01:37:47.530 --> 01:37:50.260
for every state and
every action, we just

01:37:50.260 --> 01:37:53.260
have some function that
estimates what the value is

01:37:53.260 --> 01:37:56.770
for taking this action in this
particular state that might be based

01:37:56.770 --> 01:38:02.390
on various different features of the
state that the agent happens to be in.

01:38:02.390 --> 01:38:04.480
Where you might have to
choose what those features

01:38:04.480 --> 01:38:07.990
actually are, but you can begin
to learn some patterns that

01:38:07.990 --> 01:38:11.763
generalize beyond one specific
state and one specific action

01:38:11.763 --> 01:38:13.930
that you can begin to learn
if certain features tend

01:38:13.930 --> 01:38:15.580
to be good things or bad things.

01:38:15.580 --> 01:38:18.850
Reinforcement learning can allow
you using a very similar mechanism

01:38:18.850 --> 01:38:20.800
to generalize beyond
one particular state

01:38:20.800 --> 01:38:24.220
and say if this other state
looks kind of like this state,

01:38:24.220 --> 01:38:27.130
then maybe the similar types of
actions that worked in one state

01:38:27.130 --> 01:38:30.150
will also work in another state as well.

01:38:30.150 --> 01:38:32.410
And so this type of approach
can be quite helpful

01:38:32.410 --> 01:38:34.750
as you begin to deal with
reinforcement learning that

01:38:34.750 --> 01:38:37.870
exists in larger and larger
state spaces, where it's just not

01:38:37.870 --> 01:38:43.240
feasible to explore all of the possible
states that could actually exist.

01:38:43.240 --> 01:38:46.540
So there then are two of the main
categories of reinforcement learning.

01:38:46.540 --> 01:38:49.110
Supervised learning, where you
have labeled input and output

01:38:49.110 --> 01:38:52.920
pairs, and reinforcement learning,
where an agent learns from rewards

01:38:52.920 --> 01:38:54.540
or punishments that it receives.

01:38:54.540 --> 01:38:56.700
The third major category
of machine learning

01:38:56.700 --> 01:39:00.450
that we'll just touch on briefly
is known as unsupervised learning.

01:39:00.450 --> 01:39:03.540
And unsupervised learning
happens when we have data

01:39:03.540 --> 01:39:06.330
without any additional
feedback, without labels.

01:39:06.330 --> 01:39:09.810
That in the supervised learning
case, all of our data had labels.

01:39:09.810 --> 01:39:13.560
We labeled a data point with whether
that was a rainy day or not rainy day.

01:39:13.560 --> 01:39:16.590
And using those labels, we were
able to infer what the pattern was.

01:39:16.590 --> 01:39:20.220
Where we labeled data as a counterfeit
banknote or not a counterfeit,

01:39:20.220 --> 01:39:23.760
and using those labels, we were
able to draw inferences and patterns

01:39:23.760 --> 01:39:27.870
to figure out what does a
banknote look like versus not.

01:39:27.870 --> 01:39:32.410
In unsupervised learning, we don't
have any access to any of those labels,

01:39:32.410 --> 01:39:35.500
but we still would like to
learn some of those patterns.

01:39:35.500 --> 01:39:39.000
And one of the tasks that you might
want to perform in unsupervised learning

01:39:39.000 --> 01:39:42.600
is something like clustering, where
clustering is just the task of given

01:39:42.600 --> 01:39:47.310
some set of objects organized into
distinct clusters, groups of objects

01:39:47.310 --> 01:39:49.330
that are similar to one another.

01:39:49.330 --> 01:39:51.510
And there's lots of
applications for clustering.

01:39:51.510 --> 01:39:53.340
It comes up in genetic
research, where you

01:39:53.340 --> 01:39:55.740
might have a whole bunch
of different genes,

01:39:55.740 --> 01:39:57.900
and you want to cluster
them into similar genes

01:39:57.900 --> 01:40:01.530
if you're trying to analyze it across
a population or across species.

01:40:01.530 --> 01:40:04.530
It comes up in an image, if you want
to take all the pixels of an image,

01:40:04.530 --> 01:40:06.570
cluster them into different
parts of the image.

01:40:06.570 --> 01:40:09.060
Comes up a lot up in
market research if you

01:40:09.060 --> 01:40:11.340
want to divide your consumers
into different groups

01:40:11.340 --> 01:40:14.160
so you know which groups to target
with certain types of product

01:40:14.160 --> 01:40:15.690
advertisements, for example.

01:40:15.690 --> 01:40:18.840
And a number of other contexts
as well in which clustering

01:40:18.840 --> 01:40:20.310
can be very applicable.

01:40:20.310 --> 01:40:24.900
One technique for clustering is an
algorithm known as k-means clustering.

01:40:24.900 --> 01:40:27.270
And what k-means
clustering is going to do

01:40:27.270 --> 01:40:31.830
it is going to divide all of our data
points into k different clusters,

01:40:31.830 --> 01:40:35.700
and it's going to do so by repeating
this process of assigning points

01:40:35.700 --> 01:40:39.720
to clusters, and then moving
around those clusters centers.

01:40:39.720 --> 01:40:43.680
We're going to define a cluster by
its center, the middle of the cluster,

01:40:43.680 --> 01:40:46.830
and then assign points to
that cluster based on which

01:40:46.830 --> 01:40:49.430
center is closest to that point.

01:40:49.430 --> 01:40:51.660
And I'll show you an
example of that now.

01:40:51.660 --> 01:40:55.020
Here, for example, I have a
whole bunch of unlabeled data--

01:40:55.020 --> 01:40:58.920
just various data points that are
in some sort of graphical space.

01:40:58.920 --> 01:41:02.610
And I would like to group them
into various different clusters.

01:41:02.610 --> 01:41:04.500
But I don't know how
to do that originally.

01:41:04.500 --> 01:41:07.500
And let's say I want to assign
three clusters to this group.

01:41:07.500 --> 01:41:10.500
And you have to choose how many
clusters you want in k-means clustering,

01:41:10.500 --> 01:41:13.860
but you could try multiple and
see how well those values perform.

01:41:13.860 --> 01:41:17.070
But I'll start just by
randomly picking some places

01:41:17.070 --> 01:41:19.060
to put the centers of those clusters.

01:41:19.060 --> 01:41:22.735
That maybe I have a blue cluster,
a red cluster, and a green cluster.

01:41:22.735 --> 01:41:25.110
And I'm going to start with
the centers of those clusters

01:41:25.110 --> 01:41:27.660
just being in these
three locations here.

01:41:27.660 --> 01:41:30.090
And what k-means
clustering tells us to do

01:41:30.090 --> 01:41:32.790
is once I have the
centers of the clusters,

01:41:32.790 --> 01:41:39.510
assign every point to a cluster based on
which cluster center it is closest to.

01:41:39.510 --> 01:41:42.990
So we end up with something like
this, where all of these points

01:41:42.990 --> 01:41:47.310
are closer to the blue cluster
center than any other cluster center.

01:41:47.310 --> 01:41:50.970
All of these points here are
closer to the green cluster

01:41:50.970 --> 01:41:52.860
center than any other cluster center.

01:41:52.860 --> 01:41:55.410
And then these two points
plus these points over here,

01:41:55.410 --> 01:42:00.370
those are all closest to the
red cluster center instead.

01:42:00.370 --> 01:42:04.050
So here then is one possible
assignment, all these points,

01:42:04.050 --> 01:42:05.910
to three different clusters.

01:42:05.910 --> 01:42:06.895
But it's not great.

01:42:06.895 --> 01:42:10.020
That it seems like in this red cluster,
these points are kind of far apart,

01:42:10.020 --> 01:42:12.870
in this green cluster, these
points are kind of far apart.

01:42:12.870 --> 01:42:15.810
It might not be my ideal
choice of how I would cluster

01:42:15.810 --> 01:42:17.670
these various different data points.

01:42:17.670 --> 01:42:21.960
But k-means clustering is an iterative
process, that after I do this,

01:42:21.960 --> 01:42:25.890
there is a next step, which is that
after I've assigned all of the points

01:42:25.890 --> 01:42:28.380
to the cluster center
that it is nearest to,

01:42:28.380 --> 01:42:31.350
we are going to recenter the clusters.

01:42:31.350 --> 01:42:33.420
Meaning take the cluster
centers, these diamond

01:42:33.420 --> 01:42:37.500
shapes here, and move them
to the middle or the average,

01:42:37.500 --> 01:42:41.040
effectively, of all of the
points that are in that cluster.

01:42:41.040 --> 01:42:43.170
So we'll take this blue
point, this blue center,

01:42:43.170 --> 01:42:46.812
and go ahead and move it to the
middle or to the center of all

01:42:46.812 --> 01:42:49.020
of the points that were
assigned to the blue cluster,

01:42:49.020 --> 01:42:51.070
moving it slightly to
the right in this case.

01:42:51.070 --> 01:42:52.570
And we'll do the same thing for red.

01:42:52.570 --> 01:42:56.910
We'll move the cluster center to
the middle of all of these points,

01:42:56.910 --> 01:42:58.560
weighted by how many points there are.

01:42:58.560 --> 01:43:01.800
There are more points over
here, so the red center

01:43:01.800 --> 01:43:03.810
and moving a little
bit further that way.

01:43:03.810 --> 01:43:06.352
And likewise for the green
center, there are many more points

01:43:06.352 --> 01:43:09.000
on this side of the green
center, so the green center

01:43:09.000 --> 01:43:13.240
ends up being pulled a little
bit further in this direction.

01:43:13.240 --> 01:43:15.808
So we recenter all of the clusters.

01:43:15.808 --> 01:43:17.100
And then we repeat the process.

01:43:17.100 --> 01:43:21.540
We go ahead and now reassign all
of the points to the cluster center

01:43:21.540 --> 01:43:23.230
that they are now closest to.

01:43:23.230 --> 01:43:25.650
And now that we've moved
around the cluster centers,

01:43:25.650 --> 01:43:27.690
these cluster assignments might change.

01:43:27.690 --> 01:43:30.900
That this point originally was
closer to the red cluster center,

01:43:30.900 --> 01:43:34.030
but now it's actually closer
to the blue cluster center.

01:43:34.030 --> 01:43:35.580
Same goes for this point as well.

01:43:35.580 --> 01:43:38.700
And these three points that were
originally closer to the green cluster

01:43:38.700 --> 01:43:43.670
center are now closer to the
red cluster center instead.

01:43:43.670 --> 01:43:48.380
So we can reassign what colors or which
clusters each of these data points

01:43:48.380 --> 01:43:49.620
belongs to.

01:43:49.620 --> 01:43:51.560
And then repeat the
process again, moving

01:43:51.560 --> 01:43:54.590
each of these cluster means,
the middle of the clusters,

01:43:54.590 --> 01:43:59.810
to the mean, the average, of all of the
other points that happen to be there.

01:43:59.810 --> 01:44:01.400
And repeat the process again.

01:44:01.400 --> 01:44:04.130
Go ahead and assign each of
the points to the cluster

01:44:04.130 --> 01:44:05.510
that they are closest to.

01:44:05.510 --> 01:44:08.690
So once we reach a point where we've
assigned all the points to clusters,

01:44:08.690 --> 01:44:12.190
to the cluster that they are
nearest to, and nothing changed,

01:44:12.190 --> 01:44:15.650
we've reached a sort of equilibrium
in this situation, where no points are

01:44:15.650 --> 01:44:17.030
changing their allegiance.

01:44:17.030 --> 01:44:19.850
And as a result, we can declare
this algorithm is now over.

01:44:19.850 --> 01:44:22.910
And we now have some assignment
of each of these points

01:44:22.910 --> 01:44:24.230
into three different clusters.

01:44:24.230 --> 01:44:26.090
And it looks like we
did a pretty good job

01:44:26.090 --> 01:44:29.960
of trying to identify which points
are more similar to one another

01:44:29.960 --> 01:44:31.700
than they are two
points in other groups.

01:44:31.700 --> 01:44:34.970
So we have the green cluster down
here, this blue cluster here,

01:44:34.970 --> 01:44:37.650
and then this red cluster
over there as well.

01:44:37.650 --> 01:44:40.490
And we did so without
any access to some labels

01:44:40.490 --> 01:44:43.040
to tell us what these various
different clusters were.

01:44:43.040 --> 01:44:45.890
We just used an algorithm
in an unsupervised sentence

01:44:45.890 --> 01:44:48.470
without any of those
labels to figure out which

01:44:48.470 --> 01:44:50.820
points belonged to which categories.

01:44:50.820 --> 01:44:54.663
And again, lots of applications for
this type of clustering technique.

01:44:54.663 --> 01:44:57.830
And there are many more algorithms in
each of these various different fields

01:44:57.830 --> 01:45:01.310
within machine learning-- supervised,
and reinforcement, and unsupervised.

01:45:01.310 --> 01:45:04.490
But those are many of the big
picture foundational ideas that

01:45:04.490 --> 01:45:06.408
underlie a lot of these techniques--

01:45:06.408 --> 01:45:08.700
that these are the problems
that we're trying to solve.

01:45:08.700 --> 01:45:11.810
And we try and solve those problems
using a number of different methods.

01:45:11.810 --> 01:45:14.733
Of trying to take data and
learn patterns in that data

01:45:14.733 --> 01:45:17.150
whether that's trying to find
neighboring data points that

01:45:17.150 --> 01:45:19.790
are similar or trying to
minimize some sort of loss

01:45:19.790 --> 01:45:22.460
function, where any
number of other techniques

01:45:22.460 --> 01:45:26.420
that allow us to begin to try to
solve these sorts of problems.

01:45:26.420 --> 01:45:28.430
That then was a look at
some of the principles

01:45:28.430 --> 01:45:30.410
that are at the foundation
of modern machine

01:45:30.410 --> 01:45:33.930
learning-- this ability to take
data and learn from that data

01:45:33.930 --> 01:45:37.160
so that the computer can perform a task,
even if they haven't explicitly been

01:45:37.160 --> 01:45:39.508
given instructions in order to do so.

01:45:39.508 --> 01:45:41.300
Next time, we'll continue
this conversation

01:45:41.300 --> 01:45:43.842
about machine learning looking
at other techniques we can use

01:45:43.842 --> 01:45:45.780
for solving these sorts of problems.

01:45:45.780 --> 01:45:47.500
We'll see you then.