1
00:00:00,000 --> 00:00:02,470
[MUSIC PLAYING]

2
00:00:02,470 --> 00:00:17,784


3
00:00:17,784 --> 00:00:19,620
BRIAN YU: All right,
welcome back, everyone,

4
00:00:19,620 --> 00:00:22,500
to an introduction to Artificial
Intelligence with Python.

5
00:00:22,500 --> 00:00:24,990
Now so far in this class,
we've used AI to solve

6
00:00:24,990 --> 00:00:28,140
a number of different problems--
giving the AI instructions for how

7
00:00:28,140 --> 00:00:31,590
to search for a solution or how to
satisfy certain constraints in order

8
00:00:31,590 --> 00:00:34,560
to find its way from some input
point to some output point

9
00:00:34,560 --> 00:00:36,725
in order to solve some sort of problem.

10
00:00:36,725 --> 00:00:38,850
Today, we're going to turn
to the world of learning

11
00:00:38,850 --> 00:00:40,890
in particular the idea
of machine learning

12
00:00:40,890 --> 00:00:43,080
which generally refers to
the idea where we are not

13
00:00:43,080 --> 00:00:47,160
going to give the computer explicit
instructions for how to perform a task,

14
00:00:47,160 --> 00:00:50,040
but rather, we are going to give
the computer access to information

15
00:00:50,040 --> 00:00:52,980
in the form of data or patterns
that it can learn from, and let

16
00:00:52,980 --> 00:00:56,730
the computer try and figure out what
those patterns are-- try and understand

17
00:00:56,730 --> 00:00:59,455
that data to be able to
perform a task on its own.

18
00:00:59,455 --> 00:01:01,830
Now machine learning comes in
a number of different forms

19
00:01:01,830 --> 00:01:03,180
and it's a very wide field.

20
00:01:03,180 --> 00:01:06,240
So today, we'll explore some
of the foundational algorithms

21
00:01:06,240 --> 00:01:09,810
and ideas that are behind a lot of
the different areas within machine

22
00:01:09,810 --> 00:01:10,480
learning.

23
00:01:10,480 --> 00:01:13,980
And one of the most popular is
the idea of supervised machine

24
00:01:13,980 --> 00:01:15,840
learning or just supervised learning.

25
00:01:15,840 --> 00:01:18,540
And supervised learning is
a particular type of task.

26
00:01:18,540 --> 00:01:22,620
It refers to the task where we give
the computer access to a data set,

27
00:01:22,620 --> 00:01:26,320
where that data set consists
of input-output pairs.

28
00:01:26,320 --> 00:01:28,080
And what we would like
the computer to do

29
00:01:28,080 --> 00:01:31,620
is we would like our AI to be
able to figure out some function

30
00:01:31,620 --> 00:01:34,327
that maps inputs to outputs.

31
00:01:34,327 --> 00:01:36,660
So we have a whole bunch of
data that generally consists

32
00:01:36,660 --> 00:01:39,030
of some kind of input-- some
evidence, some information

33
00:01:39,030 --> 00:01:40,890
that the computer will have access to.

34
00:01:40,890 --> 00:01:43,770
And we would like the computer,
based on that input information,

35
00:01:43,770 --> 00:01:47,100
to predict what some
output is going to be.

36
00:01:47,100 --> 00:01:50,295
And we'll give it some data so that
the computer can train its model on

37
00:01:50,295 --> 00:01:53,310
to begin to understand how it
is that this information works,

38
00:01:53,310 --> 00:01:56,640
and how it is that the inputs
and outputs relate to each other.

39
00:01:56,640 --> 00:01:58,380
But ultimately, we
hope that our computer

40
00:01:58,380 --> 00:02:01,470
will be able to figure out some
function that given those inputs,

41
00:02:01,470 --> 00:02:03,720
is able to get those outputs.

42
00:02:03,720 --> 00:02:06,670
There are a couple of different
tasks within supervised learning,

43
00:02:06,670 --> 00:02:09,900
the one we'll focus on and start
with is known as classification.

44
00:02:09,900 --> 00:02:14,310
And classification is the problem where
if I give you a whole bunch of inputs,

45
00:02:14,310 --> 00:02:16,740
you need to figure out some
way to map those inputs

46
00:02:16,740 --> 00:02:20,700
into discrete categories, where you
can decide what those categories are.

47
00:02:20,700 --> 00:02:22,867
And it's the job of
the computer to predict

48
00:02:22,867 --> 00:02:24,450
what those categories are going to be.

49
00:02:24,450 --> 00:02:26,370
So that might be, for
example, I give you

50
00:02:26,370 --> 00:02:28,980
information about a
banknote like a US dollar

51
00:02:28,980 --> 00:02:31,290
and I'm asking you to
predict for me doesn't belong

52
00:02:31,290 --> 00:02:33,930
to the category of
authentic bank notes or does

53
00:02:33,930 --> 00:02:36,450
it belong to the category
of counterfeit banknotes.

54
00:02:36,450 --> 00:02:38,160
You need to categorize the input.

55
00:02:38,160 --> 00:02:40,920
And we want to train the computer
to figure out some function

56
00:02:40,920 --> 00:02:43,230
to be able to do that calculation.

57
00:02:43,230 --> 00:02:45,780
Another example might be the
case of weather, something

58
00:02:45,780 --> 00:02:48,822
we've talked about a little bit so
far in this class, where we would like

59
00:02:48,822 --> 00:02:52,020
to predict on a given day is
it going to rain on that day,

60
00:02:52,020 --> 00:02:53,700
is it going to be cloudy on that day.

61
00:02:53,700 --> 00:02:55,533
And before, we've seen
how we could do this,

62
00:02:55,533 --> 00:02:59,190
if we really give the computer all
the exact probabilities for, you know,

63
00:02:59,190 --> 00:03:01,860
if these are the conditions,
what's the probability of rain,

64
00:03:01,860 --> 00:03:04,590
oftentimes, we don't have access
to that information, though.

65
00:03:04,590 --> 00:03:07,410
But what we do have access
to is a whole bunch of data.

66
00:03:07,410 --> 00:03:10,410
So if we wanted to be able to predict
something like is it going to rain

67
00:03:10,410 --> 00:03:12,810
or is it not going to rain,
we would give the computer

68
00:03:12,810 --> 00:03:15,840
historical information about
days when it was raining

69
00:03:15,840 --> 00:03:18,660
and days when it was not
raining, and ask the computer

70
00:03:18,660 --> 00:03:21,240
to look for patterns in that data.

71
00:03:21,240 --> 00:03:22,960
So what might that data look like?

72
00:03:22,960 --> 00:03:25,380
Well, we could structure that
data in a table like this.

73
00:03:25,380 --> 00:03:28,800
This might be what our table looks like,
where are for any particular day going

74
00:03:28,800 --> 00:03:32,400
back, we have information about like
that day's humidity, that day's air

75
00:03:32,400 --> 00:03:33,180
pressure.

76
00:03:33,180 --> 00:03:35,160
And then importantly, we have a label--

77
00:03:35,160 --> 00:03:39,060
something where the human has said that
on this particular day, it was raining

78
00:03:39,060 --> 00:03:40,070
or it was not raining.

79
00:03:40,070 --> 00:03:42,720
So you could fill in this table
with a whole bunch of data.

80
00:03:42,720 --> 00:03:46,350
And what makes this what we would
call a supervised learning exercise

81
00:03:46,350 --> 00:03:49,440
is that a human has gone in and
labeled each of these data points.

82
00:03:49,440 --> 00:03:52,260
Said that on this day, when these
were the values for the humidity

83
00:03:52,260 --> 00:03:56,330
and pressure, that day was a rainy
day and this day was a not rainy day.

84
00:03:56,330 --> 00:03:58,830
And what we would like the
computer to be able to do then

85
00:03:58,830 --> 00:04:01,560
is to be able to figure out,
given these inputs, given

86
00:04:01,560 --> 00:04:05,430
the humidity and the pressure, can
the computer predict what label

87
00:04:05,430 --> 00:04:06,945
should be associated with that day.

88
00:04:06,945 --> 00:04:08,820
Does that day look more
like it's going to be

89
00:04:08,820 --> 00:04:13,650
a day that rains or does it look more
like a day when it's not going to rain.

90
00:04:13,650 --> 00:04:15,630
Put a little bit more
mathematically, you

91
00:04:15,630 --> 00:04:18,928
can think of this as a function
that takes two inputs--

92
00:04:18,928 --> 00:04:21,720
the inputs being the data points
that our computer will have access

93
00:04:21,720 --> 00:04:23,610
to-- things like humidity and pressure.

94
00:04:23,610 --> 00:04:25,500
So we could write a
function, f, that takes

95
00:04:25,500 --> 00:04:27,730
as input both humidity and pressure.

96
00:04:27,730 --> 00:04:31,170
And then the output is
going to be what category

97
00:04:31,170 --> 00:04:34,590
we would ascribe to these
particular input points-- what label

98
00:04:34,590 --> 00:04:36,135
we would associate with that input.

99
00:04:36,135 --> 00:04:38,010
So we've seen a couple
of example data points

100
00:04:38,010 --> 00:04:41,250
here, where given this value for
humidity and this value for pressure,

101
00:04:41,250 --> 00:04:44,610
we predict is it going to rain
or is it not going to rain.

102
00:04:44,610 --> 00:04:47,610
And that's information that we
just gathered from the world.

103
00:04:47,610 --> 00:04:51,060
We measured on various different days
what the humidity and pressure were.

104
00:04:51,060 --> 00:04:55,170
We observed whether or not we saw rain
or no rain on that particular day.

105
00:04:55,170 --> 00:04:58,950
And this function, f, is what
we would like to approximate.

106
00:04:58,950 --> 00:05:00,900
Now the computer and
we humans don't really

107
00:05:00,900 --> 00:05:04,960
know exactly how this function f works--
it's probably quite a complex function.

108
00:05:04,960 --> 00:05:08,040
So what we're going to do instead
is attempt to estimate it.

109
00:05:08,040 --> 00:05:12,240
We would like to come up with
a hypothesis function, h,

110
00:05:12,240 --> 00:05:15,430
which is going to try to
approximate what f does.

111
00:05:15,430 --> 00:05:19,260
We want to come up with some function
h that will also take the same inputs

112
00:05:19,260 --> 00:05:22,800
and we'll also produce an
output, rain or no rain.

113
00:05:22,800 --> 00:05:25,950
And ideally, we'd like these two
functions to agree on as much

114
00:05:25,950 --> 00:05:27,030
as possible.

115
00:05:27,030 --> 00:05:30,810
So the goal then of these supervised
learning classification tasks

116
00:05:30,810 --> 00:05:34,060
is going to be to figure out what
does that function h look like.

117
00:05:34,060 --> 00:05:37,860
How can we begin to estimate, given all
of this information, all of this data,

118
00:05:37,860 --> 00:05:42,330
what category or what label should be
assigned to a particular data point.

119
00:05:42,330 --> 00:05:44,457
So where can you begin doing this?

120
00:05:44,457 --> 00:05:47,040
Well, a reasonable thing to do,
especially in this situation--

121
00:05:47,040 --> 00:05:48,780
I have two numerical values--

122
00:05:48,780 --> 00:05:53,460
is I could try to plot this on a
graph that has two axes-- an x-axis

123
00:05:53,460 --> 00:05:54,332
and the y-axis.

124
00:05:54,332 --> 00:05:57,540
And in this case, we're just going to
be using two numerical values as input,

125
00:05:57,540 --> 00:05:59,795
but these same types of
ideas at scale as you

126
00:05:59,795 --> 00:06:01,170
add more and more inputs as well.

127
00:06:01,170 --> 00:06:04,260
We'll be plotting things in two
dimensions, but as we'll soon see,

128
00:06:04,260 --> 00:06:07,930
you could add more inputs and just
imagine things in multiple dimensions.

129
00:06:07,930 --> 00:06:10,920
And while we humans have
trouble conceptualizing anything

130
00:06:10,920 --> 00:06:13,410
really beyond three
dimensions, at least visually,

131
00:06:13,410 --> 00:06:15,600
a computer has no problem
with trying to imagine

132
00:06:15,600 --> 00:06:17,280
things and many, many more dimensions.

133
00:06:17,280 --> 00:06:20,343
That for a computer, each dimension
is just some separate number

134
00:06:20,343 --> 00:06:21,510
that are just keeping track.

135
00:06:21,510 --> 00:06:23,385
So it wouldn't be
unreasonable for a computer

136
00:06:23,385 --> 00:06:25,890
to think in 10 dimensions
or 100 dimensions

137
00:06:25,890 --> 00:06:28,000
to be able to try to solve a problem.

138
00:06:28,000 --> 00:06:31,200
But for now, we've got two inputs, so
we'll graph things along two axes--

139
00:06:31,200 --> 00:06:33,840
an x-axis, which will
here represent humidity,

140
00:06:33,840 --> 00:06:36,580
and a y-axis, which here
represents pressure.

141
00:06:36,580 --> 00:06:40,530
And what we might do is say, let's
take all of the days that were raining,

142
00:06:40,530 --> 00:06:44,130
and just try to plop them on this graph,
and see where they fall on this graph.

143
00:06:44,130 --> 00:06:46,590
And here might be all
of the rainy days, where

144
00:06:46,590 --> 00:06:49,020
each rainy day is one
of these blue dots here

145
00:06:49,020 --> 00:06:51,710
that corresponds to a
particular value for humidity

146
00:06:51,710 --> 00:06:53,422
and a particular value for pressure.

147
00:06:53,422 --> 00:06:56,380
And then I might do the same thing
with the days that were not raining.

148
00:06:56,380 --> 00:06:58,380
So I take all the not
rainy days, figure out

149
00:06:58,380 --> 00:07:01,020
what their values were for
each of these two inputs,

150
00:07:01,020 --> 00:07:03,743
and go ahead and plot them
on this graph as well.

151
00:07:03,743 --> 00:07:05,160
And I've here plotted them in red.

152
00:07:05,160 --> 00:07:09,870
So blue here stands for a rainy day,
red here stands for a not rainy day.

153
00:07:09,870 --> 00:07:11,340
And this then is the input--

154
00:07:11,340 --> 00:07:14,160
that my computer has access
to all of this input.

155
00:07:14,160 --> 00:07:18,270
And what I would like the computer to
be able to do is to train a model such

156
00:07:18,270 --> 00:07:21,120
that if I'm ever presented with
a new input that doesn't have

157
00:07:21,120 --> 00:07:25,170
a label associated with it,
something like this white dot here,

158
00:07:25,170 --> 00:07:28,530
I would like to predict given those
values for each of the two inputs,

159
00:07:28,530 --> 00:07:31,860
should we classify it as
a blue dot, a rainy day,

160
00:07:31,860 --> 00:07:34,985
or should we classify it as
a red dot, a not rainy day.

161
00:07:34,985 --> 00:07:37,860
And if you're just looking at this
picture graphically trying to say,

162
00:07:37,860 --> 00:07:41,100
all right, this white dot, does it look
like it belongs to the blue category

163
00:07:41,100 --> 00:07:43,560
or does it look like it
belongs to the red category,

164
00:07:43,560 --> 00:07:47,400
I think most people would agree that it
probably belongs to the blue category.

165
00:07:47,400 --> 00:07:48,180
And why is that?

166
00:07:48,180 --> 00:07:51,967
Well, it looks like it's
close to other blue dots.

167
00:07:51,967 --> 00:07:54,300
And that's not a very formal
notion, but it's the notion

168
00:07:54,300 --> 00:07:56,100
that we'll formalize
it in just a moment--

169
00:07:56,100 --> 00:07:59,190
that because it seems to be close
to, like, this blue dot here, like,

170
00:07:59,190 --> 00:08:01,560
nothing else it's closer
to it, than we might say

171
00:08:01,560 --> 00:08:03,720
that it should be categorized as blue.

172
00:08:03,720 --> 00:08:06,030
It should fall into that
category of, I think

173
00:08:06,030 --> 00:08:08,760
that day is going to be a
rainy day based on that input.

174
00:08:08,760 --> 00:08:11,880
It might not be totally accurate,
but it's a pretty good guess.

175
00:08:11,880 --> 00:08:15,090
And this type of algorithm is actually
a very popular and common machine

176
00:08:15,090 --> 00:08:18,750
learning algorithm known as
nearest neighbor classification.

177
00:08:18,750 --> 00:08:21,810
It's an algorithm for solving
these classification type problems.

178
00:08:21,810 --> 00:08:25,540
And in nearest neighbor classification,
it's going to perform this algorithm.

179
00:08:25,540 --> 00:08:29,700
What it will do is, given an input, it
will choose the class of the nearest

180
00:08:29,700 --> 00:08:31,650
data point to that input.

181
00:08:31,650 --> 00:08:34,890
By class, we just here mean
category, like rain or no rain,

182
00:08:34,890 --> 00:08:36,840
counterfeit or not counterfeit.

183
00:08:36,840 --> 00:08:41,799
And we choose the category or the
class based on the nearest data point.

184
00:08:41,799 --> 00:08:43,559
So given all that data
we just looked at,

185
00:08:43,559 --> 00:08:47,070
is the nearest data point a blue
point or is that a red point.

186
00:08:47,070 --> 00:08:49,780
And depending on the
answer to that question,

187
00:08:49,780 --> 00:08:51,660
we were able to make
some sort of judgment.

188
00:08:51,660 --> 00:08:54,420
We were able to say something like,
we think it's going to be blue

189
00:08:54,420 --> 00:08:56,262
or we think it's going to be red.

190
00:08:56,262 --> 00:08:58,470
So likewise, we could apply
this to other data points

191
00:08:58,470 --> 00:08:59,850
that we encounter as well.

192
00:08:59,850 --> 00:09:03,840
If suddenly, this data point comes
about, well, it's nearest data is red,

193
00:09:03,840 --> 00:09:07,320
so we would go ahead and classify
this as a red point, not raining.

194
00:09:07,320 --> 00:09:09,540
Things get a little
bit trickier, though,

195
00:09:09,540 --> 00:09:12,543
when you look at a point like
this white point over here,

196
00:09:12,543 --> 00:09:14,460
and you ask the same
sort of question-- should

197
00:09:14,460 --> 00:09:17,700
it belong to the category of
blue points, the rainy days?

198
00:09:17,700 --> 00:09:21,870
Or should it belong to the category
of red points, the not rainy days?

199
00:09:21,870 --> 00:09:24,240
Now nearest neighbor
classification would

200
00:09:24,240 --> 00:09:26,820
say the way you solve this
problem is look at which point

201
00:09:26,820 --> 00:09:28,190
it is nearest to that point.

202
00:09:28,190 --> 00:09:31,440
You look at this nearest point and
say it's red-- it's a not rainy day.

203
00:09:31,440 --> 00:09:34,200
And therefore, according to
nearest neighbor classification,

204
00:09:34,200 --> 00:09:37,470
I would say that this unlabeled
point, that should also be red.

205
00:09:37,470 --> 00:09:40,800
It should also be classified
as a not rainy day.

206
00:09:40,800 --> 00:09:43,800
But your intuition might think
that that's a reasonable judgment

207
00:09:43,800 --> 00:09:47,190
to make-- that the closest thing is
a not rainy day, so may as well guess

208
00:09:47,190 --> 00:09:48,540
that it's not rainy day.

209
00:09:48,540 --> 00:09:51,680
But it's probably also reasonable to
look at the bigger picture of things

210
00:09:51,680 --> 00:09:56,670
and to say, yes, it is true, that the
nearest point to it was a red point,

211
00:09:56,670 --> 00:10:00,060
but it's surrounded by a whole
bunch of other blue points.

212
00:10:00,060 --> 00:10:02,310
So looking at the bigger
picture, there is potentially

213
00:10:02,310 --> 00:10:06,235
an argument to be made that this
point should actually be blue.

214
00:10:06,235 --> 00:10:08,610
And with only this data, we
actually don't know for sure.

215
00:10:08,610 --> 00:10:11,310
We are given some inputs,
something we're trying to predict,

216
00:10:11,310 --> 00:10:14,410
and we don't necessarily know
what the output is going to be.

217
00:10:14,410 --> 00:10:17,500
So in this case, which one is
correct is difficult to say.

218
00:10:17,500 --> 00:10:21,280
But oftentimes, considering more than
just a single neighbor, considering

219
00:10:21,280 --> 00:10:25,300
multiple neighbors, can sometimes
give us a better result.

220
00:10:25,300 --> 00:10:27,400
And so there's a variant
on the nearest neighbor

221
00:10:27,400 --> 00:10:31,330
a classification algorithm that
is known as the k-nearest-neighbor

222
00:10:31,330 --> 00:10:34,150
classification algorithm,
where k is some parameter,

223
00:10:34,150 --> 00:10:38,090
some number that we choose for how
many neighbors are we going to look at.

224
00:10:38,090 --> 00:10:41,440
So one nearest neighbor
classification is what we saw before.

225
00:10:41,440 --> 00:10:44,770
Just pick the one nearest
neighbor and use that category.

226
00:10:44,770 --> 00:10:46,780
But with k-nearest-neighbor
classification,

227
00:10:46,780 --> 00:10:49,150
where k might be three
or five or seven--

228
00:10:49,150 --> 00:10:52,000
to say look at the three
or five or seven closest

229
00:10:52,000 --> 00:10:55,930
neighbors, closest data points to that
point, works a little bit differently.

230
00:10:55,930 --> 00:10:57,760
This algorithm, we're given an input.

231
00:10:57,760 --> 00:11:02,690
Choose the most common class out of the
k nearest data points to that input.

232
00:11:02,690 --> 00:11:06,010
So if we look at the five
nearest points, and three of them

233
00:11:06,010 --> 00:11:08,380
say it's raining and two of
them say it's not raining,

234
00:11:08,380 --> 00:11:10,810
we'll go with the three
instead of the two,

235
00:11:10,810 --> 00:11:14,080
because each one effectively
gets one vote towards what

236
00:11:14,080 --> 00:11:16,450
they believe the category ought to be.

237
00:11:16,450 --> 00:11:18,850
And ultimately, you
choose the category that

238
00:11:18,850 --> 00:11:21,540
has the most votes as
a consequence of that.

239
00:11:21,540 --> 00:11:24,700
So k-nearest-neighbor classification--
fairly straightforward one

240
00:11:24,700 --> 00:11:25,960
to understand intuitively.

241
00:11:25,960 --> 00:11:29,030
You just look at the neighbors and
figure out what the answer might be.

242
00:11:29,030 --> 00:11:30,880
And it turns out this
can work very, very

243
00:11:30,880 --> 00:11:34,480
well for solving a whole variety of
different types of classification

244
00:11:34,480 --> 00:11:35,440
problems.

245
00:11:35,440 --> 00:11:38,240
But not every model is going
to work under every situation.

246
00:11:38,240 --> 00:11:40,865
And so one of the things we'll
take a look at today, especially

247
00:11:40,865 --> 00:11:42,740
in the context of
supervised machine learning

248
00:11:42,740 --> 00:11:45,740
is that there are a number of different
approaches to machine learning--

249
00:11:45,740 --> 00:11:47,830
a number of different
algorithms that we can apply

250
00:11:47,830 --> 00:11:49,690
all solving the same type of problem.

251
00:11:49,690 --> 00:11:52,780
All solving some kind of
classification problem, where

252
00:11:52,780 --> 00:11:56,170
we want to take inputs and organize
it into different categories

253
00:11:56,170 --> 00:11:58,570
And no one algorithm isn't
necessarily always going

254
00:11:58,570 --> 00:12:00,387
to be better than some other algorithm.

255
00:12:00,387 --> 00:12:01,720
They each have their trade-offs.

256
00:12:01,720 --> 00:12:04,540
And maybe depending on the
data, one type of algorithm

257
00:12:04,540 --> 00:12:07,420
is going to be better-suited to
trying to model that information

258
00:12:07,420 --> 00:12:08,870
than some other algorithm.

259
00:12:08,870 --> 00:12:10,720
And so this is what a lot
of machine learning research

260
00:12:10,720 --> 00:12:13,090
ends up being about-- that when you're
trying to apply machine learning

261
00:12:13,090 --> 00:12:16,480
techniques, you're often looking not
just at one particular algorithm,

262
00:12:16,480 --> 00:12:18,250
but trying multiple
different algorithms,

263
00:12:18,250 --> 00:12:20,770
trying to see what is
going to give you the best

264
00:12:20,770 --> 00:12:25,790
results for trying to predict some
function that maps inputs to outputs.

265
00:12:25,790 --> 00:12:29,410
So what then are the drawbacks of
k-nearest-neighbor classification?

266
00:12:29,410 --> 00:12:30,640
Well, there are a couple.

267
00:12:30,640 --> 00:12:33,070
One might be that in a
naive approach at least,

268
00:12:33,070 --> 00:12:35,642
it could be fairly slow to
have to go through and measure

269
00:12:35,642 --> 00:12:38,350
the distance between a point and
every single one of these points

270
00:12:38,350 --> 00:12:39,310
that exist here.

271
00:12:39,310 --> 00:12:40,810
Now there are ways of
trying to get around that.

272
00:12:40,810 --> 00:12:43,435
There are data structures that
can help to make it more quickly

273
00:12:43,435 --> 00:12:45,260
to be able to find these neighbors.

274
00:12:45,260 --> 00:12:48,490
There are also techniques you can use
to try and prune some of this data,

275
00:12:48,490 --> 00:12:50,680
remove some of the data
points so that you're only

276
00:12:50,680 --> 00:12:54,470
left with the relevant data points
just to make it a little bit easier.

277
00:12:54,470 --> 00:12:56,890
But ultimately, what we
might like to do is come up

278
00:12:56,890 --> 00:13:00,370
with another way of trying
to do this classification.

279
00:13:00,370 --> 00:13:02,560
And one way of trying
to do the classification

280
00:13:02,560 --> 00:13:04,750
was looking at what are
the neighboring points.

281
00:13:04,750 --> 00:13:08,200
But another way might be to
try to look at all of the data

282
00:13:08,200 --> 00:13:11,440
and see if we can come up
with some decision boundary--

283
00:13:11,440 --> 00:13:15,860
some boundary that will separate the
rainy days from the not rainy days.

284
00:13:15,860 --> 00:13:19,640
In the case of two dimensions, we can
do that by drawing a line, for example.

285
00:13:19,640 --> 00:13:22,900
So what we might want to try
to do is just find some line,

286
00:13:22,900 --> 00:13:27,520
find some separator that divides the
rainy days, the blue points over here,

287
00:13:27,520 --> 00:13:30,040
from the not rainy days,
the red points over there.

288
00:13:30,040 --> 00:13:32,680
We're now trying a different
approach in contrast

289
00:13:32,680 --> 00:13:34,570
with the nearest
neighbor approach, which

290
00:13:34,570 --> 00:13:38,380
just looked at local data around the
input data point that we cared about.

291
00:13:38,380 --> 00:13:42,160
Now what we're doing is trying to use
a technique known as linear regression

292
00:13:42,160 --> 00:13:45,140
to find some sort of line
that will separate the two

293
00:13:45,140 --> 00:13:46,858
halves from each other.

294
00:13:46,858 --> 00:13:49,150
Now, sometimes, it will
actually be possible to come up

295
00:13:49,150 --> 00:13:52,630
with some line that perfectly separates
all the rainy days from the not

296
00:13:52,630 --> 00:13:53,590
rainy days.

297
00:13:53,590 --> 00:13:56,140
Realistically, though,
this is probably cleaner

298
00:13:56,140 --> 00:13:58,050
than many data sets will actually be.

299
00:13:58,050 --> 00:13:59,600
Oftentimes, data is messy.

300
00:13:59,600 --> 00:14:00,470
There are outliers.

301
00:14:00,470 --> 00:14:03,850
There's random noise that happens
inside of a particular system.

302
00:14:03,850 --> 00:14:06,220
And what we'd like to do is
still be able to figure out

303
00:14:06,220 --> 00:14:07,640
what a line might look like.

304
00:14:07,640 --> 00:14:12,000
So in practice, the data will
not always be linearly separable,

305
00:14:12,000 --> 00:14:14,740
where linearly separable
refers to some data set

306
00:14:14,740 --> 00:14:18,673
where I can draw a line just to
separate the two halves of it perfectly.

307
00:14:18,673 --> 00:14:20,590
Instead, you might have
a situation like this,

308
00:14:20,590 --> 00:14:24,352
where there are some rainy points that
are on this side of the line and some

309
00:14:24,352 --> 00:14:26,560
not raining points that are
on that side of the line.

310
00:14:26,560 --> 00:14:30,880
And there may not be a line
that perfectly separates

311
00:14:30,880 --> 00:14:34,270
what path of the inputs from the other
half-- that perfectly separates all

312
00:14:34,270 --> 00:14:36,490
the rainy days from the not rainy days.

313
00:14:36,490 --> 00:14:40,090
But we can still say that this
line does a pretty good job.

314
00:14:40,090 --> 00:14:42,080
And we'll try to formalize
a little bit later.

315
00:14:42,080 --> 00:14:44,200
What we mean when we say
something like this line

316
00:14:44,200 --> 00:14:46,932
does a pretty good job of
trying to make that prediction.

317
00:14:46,932 --> 00:14:48,640
But for now, let's
just say we're looking

318
00:14:48,640 --> 00:14:51,690
for a line that does as
good of a job as we can

319
00:14:51,690 --> 00:14:56,750
at trying to separate one category of
things from another category of things.

320
00:14:56,750 --> 00:15:00,160
So let's now try to formalize this
a little bit more mathematically.

321
00:15:00,160 --> 00:15:02,110
We want to come up with
some sort of function,

322
00:15:02,110 --> 00:15:04,300
some way we can define this line.

323
00:15:04,300 --> 00:15:08,830
And our inputs are things like
humidity and pressure in this case.

324
00:15:08,830 --> 00:15:13,840
So our inputs we might call x1 is going
to be our represent humidity and x2

325
00:15:13,840 --> 00:15:15,500
is going to represent pressure.

326
00:15:15,500 --> 00:15:18,500
These are inputs that we are going
to provide to our machine learning

327
00:15:18,500 --> 00:15:19,250
algorithm.

328
00:15:19,250 --> 00:15:21,860
And given those inputs, we
would like for our model

329
00:15:21,860 --> 00:15:24,260
to be able to predict
some sort of output.

330
00:15:24,260 --> 00:15:27,260
And we're going to predict that
using our hypothesis function,

331
00:15:27,260 --> 00:15:28,610
which we called h.

332
00:15:28,610 --> 00:15:33,680
Our hypothesis function is going to
take as input, x1 and x2, humidity

333
00:15:33,680 --> 00:15:34,890
and pressure in this case.

334
00:15:34,890 --> 00:15:36,740
And you can imagine if we
didn't just have two inputs--

335
00:15:36,740 --> 00:15:38,870
we had three or four or
five inputs or more--

336
00:15:38,870 --> 00:15:42,260
we could have this hypothesis
function take all of those as input.

337
00:15:42,260 --> 00:15:45,550
And we'll see examples of that
a little bit later as well.

338
00:15:45,550 --> 00:15:49,470
And now the question is, what
does this hypothesis function do?

339
00:15:49,470 --> 00:15:53,210
Well, it really just needs
to measure is this data

340
00:15:53,210 --> 00:15:58,640
point on one side of the boundary or is
it on the other side of the boundary?

341
00:15:58,640 --> 00:16:00,620
And how do we formalize that boundary?

342
00:16:00,620 --> 00:16:02,990
Well, the boundary is
generally going to be

343
00:16:02,990 --> 00:16:06,780
a linear combination of
these input variables,

344
00:16:06,780 --> 00:16:08,250
at least in this particular case.

345
00:16:08,250 --> 00:16:10,910
So what we're trying to do
when we say linear combination

346
00:16:10,910 --> 00:16:13,925
is take each of these inputs
and multiply them by some number

347
00:16:13,925 --> 00:16:15,550
that we're going to have to figure out.

348
00:16:15,550 --> 00:16:18,650
We'll generally call that number
a weight for how important

349
00:16:18,650 --> 00:16:21,800
should these variables be in
trying to determine the answer.

350
00:16:21,800 --> 00:16:24,440
So weight each of these
variables with some weight.

351
00:16:24,440 --> 00:16:27,440
And we might add like a constant to
it just to try and make the function

352
00:16:27,440 --> 00:16:28,610
a little bit different.

353
00:16:28,610 --> 00:16:30,280
And the result we just need to compare--

354
00:16:30,280 --> 00:16:33,350
is it greater than 0 or
is it less than 0 to say

355
00:16:33,350 --> 00:16:37,280
doesn't belong on one side of the
line or the other side of the line.

356
00:16:37,280 --> 00:16:41,060
And so what that mathematical
expression might look like is this.

357
00:16:41,060 --> 00:16:45,773
We would take each of my variables, x1
and x2, multiply them by some weight.

358
00:16:45,773 --> 00:16:47,690
I don't yet know what
that weight is, but it's

359
00:16:47,690 --> 00:16:50,720
going to be some number,
weight 1 and weight 2.

360
00:16:50,720 --> 00:16:53,570
And maybe we just want to
add some other weight 0 to it

361
00:16:53,570 --> 00:16:56,870
because the function might require
us to shift the entire value up

362
00:16:56,870 --> 00:16:58,610
or down by a certain amount.

363
00:16:58,610 --> 00:16:59,840
And then we just compare.

364
00:16:59,840 --> 00:17:02,870
If we do all this math, is a
greater than or equal to 0.

365
00:17:02,870 --> 00:17:05,810
If so, we might categorize
that data point as a rainy day.

366
00:17:05,810 --> 00:17:09,440
And otherwise, we might say no rain.

367
00:17:09,440 --> 00:17:12,260
So the key here then
is that this expression

368
00:17:12,260 --> 00:17:15,619
is how we are going to calculate
whether it's a rainy day or not.

369
00:17:15,619 --> 00:17:18,690
We're going to do a bunch of math
where we take each of the variables,

370
00:17:18,690 --> 00:17:21,650
multiply them by a weight,
maybe add an extra weight to it,

371
00:17:21,650 --> 00:17:24,079
see if the result is
greater than or equal to 0.

372
00:17:24,079 --> 00:17:26,240
And using that result
of that expression,

373
00:17:26,240 --> 00:17:29,660
we're able to determine whether
it's raining or not raining.

374
00:17:29,660 --> 00:17:33,080
This expression here is in this case
going to refer to just some line.

375
00:17:33,080 --> 00:17:36,170
If you were to plot that graphically,
it would just be some line.

376
00:17:36,170 --> 00:17:40,320
And what the line actually looks
like depends upon these weights.

377
00:17:40,320 --> 00:17:42,710
x1 and x2 are the
inputs, but these weights

378
00:17:42,710 --> 00:17:46,220
are really what determine the shape
of that line, the slope of that line,

379
00:17:46,220 --> 00:17:49,190
and what that line actually looks like.

380
00:17:49,190 --> 00:17:52,330
So we then would like to figure
out what these weights should be.

381
00:17:52,330 --> 00:17:54,520
We can choose whatever
weights we want, but we

382
00:17:54,520 --> 00:17:58,330
want to choose weights in such a way
that if you pass in a rainy day's

383
00:17:58,330 --> 00:18:00,850
humidity and pressure, then
you end up with a result that

384
00:18:00,850 --> 00:18:02,440
is greater than or equal to 0.

385
00:18:02,440 --> 00:18:05,690
And we would like it such that if we
passed into our hypothesis function,

386
00:18:05,690 --> 00:18:09,040
a not rainy day's inputs,
then the output that we get

387
00:18:09,040 --> 00:18:11,043
should be not raining.

388
00:18:11,043 --> 00:18:13,960
So before we get there, let's try
and formalize this a little bit more

389
00:18:13,960 --> 00:18:17,350
mathematically just to get a sense for
how it is that you'll often see this

390
00:18:17,350 --> 00:18:21,560
if you ever go further into a supervised
machine learning and explore this idea.

391
00:18:21,560 --> 00:18:23,660
One thing is that generally
for these categories,

392
00:18:23,660 --> 00:18:27,420
we'll sometimes just use the names of
the categories like rain and not rain.

393
00:18:27,420 --> 00:18:30,670
Often, mathematically, if we're trying
to do comparisons between these things,

394
00:18:30,670 --> 00:18:33,110
it's easier just to deal
in the world of numbers.

395
00:18:33,110 --> 00:18:35,530
So we could just say 1 and 0--

396
00:18:35,530 --> 00:18:37,820
1 for raining, 0 for not raining.

397
00:18:37,820 --> 00:18:39,040
So we do all this math.

398
00:18:39,040 --> 00:18:41,500
And if the result is
greater than or equal to 0,

399
00:18:41,500 --> 00:18:45,160
we'll go ahead and say our hypothesis
function outputs 1, meaning raining.

400
00:18:45,160 --> 00:18:48,220
And otherwise, it outputs
0, meaning not raining.

401
00:18:48,220 --> 00:18:52,090
And oftentimes, this type
of expression will instead

402
00:18:52,090 --> 00:18:54,890
express using vector mathematics.

403
00:18:54,890 --> 00:18:57,390
And all the vector is, if you're
not familiar with the term,

404
00:18:57,390 --> 00:19:00,400
is it refers to a sequence
of numerical values.

405
00:19:00,400 --> 00:19:03,580
You could represent that in Python
using, like, a list of numerical values

406
00:19:03,580 --> 00:19:06,370
or a couple with numerical values.

407
00:19:06,370 --> 00:19:10,160
And here, we have a couple of
sequences of numerical values.

408
00:19:10,160 --> 00:19:13,300
One of our vectors, one of our
sequences of numerical values,

409
00:19:13,300 --> 00:19:15,130
are all of these individual weights--

410
00:19:15,130 --> 00:19:18,340
w0, w1 and w2.

411
00:19:18,340 --> 00:19:21,400
So we could construct what
we'll call a weight vector

412
00:19:21,400 --> 00:19:23,860
and we'll see why this is
useful in a moment called w,

413
00:19:23,860 --> 00:19:26,838
generally represented
using a boldface w, that

414
00:19:26,838 --> 00:19:28,630
is just a sequence of
these three weights--

415
00:19:28,630 --> 00:19:31,690
weight 0, weight 1, and weight 2.

416
00:19:31,690 --> 00:19:33,820
And to be able to calculate
based on those weights

417
00:19:33,820 --> 00:19:37,570
whether we think a day is
raining or not raining,

418
00:19:37,570 --> 00:19:42,490
we're going to multiply each of those
weights by one of our input variables.

419
00:19:42,490 --> 00:19:46,630
That w2, this weight, is going to
be multiplied by input variable x2.

420
00:19:46,630 --> 00:19:49,780
w1 is going to be multiplied
by input variable x1.

421
00:19:49,780 --> 00:19:53,097
And w0-- well, it's not
being multiplied by anything,

422
00:19:53,097 --> 00:19:55,180
but to make sure the vectors
are the same length--

423
00:19:55,180 --> 00:19:57,263
and we'll see why that's
useful in just a second--

424
00:19:57,263 --> 00:20:01,077
we'll just go ahead and say
w0 is being multiplied by 1.

425
00:20:01,077 --> 00:20:03,160
Because you can multiply
by something by 1 and you

426
00:20:03,160 --> 00:20:05,090
end up getting the exact same number.

427
00:20:05,090 --> 00:20:07,570
So in addition to the
weight vector, w, we'll

428
00:20:07,570 --> 00:20:11,890
also have an input vector that
we'll call x that has three values--

429
00:20:11,890 --> 00:20:18,160
1, again, because we're just multiplying
w0 by 1 eventually, and then x1 and x2.

430
00:20:18,160 --> 00:20:21,850
So here then, we've represented two
distinct vectors-- a vector of weights

431
00:20:21,850 --> 00:20:23,590
that we need to somehow learn.

432
00:20:23,590 --> 00:20:25,690
The goal of our machine
learning algorithm

433
00:20:25,690 --> 00:20:28,310
is to learn what this weight
vector is supposed to be.

434
00:20:28,310 --> 00:20:30,502
We could choose any
arbitrary set of numbers

435
00:20:30,502 --> 00:20:33,460
and it would produce a function that
tries to predict rain or not rain,

436
00:20:33,460 --> 00:20:35,140
but it probably wouldn't be very good.

437
00:20:35,140 --> 00:20:39,190
What we want to do is come up with
a good choice of these weights

438
00:20:39,190 --> 00:20:41,980
so that we're able to do
the accurate predictions.

439
00:20:41,980 --> 00:20:45,790
And then this input vector
represents a particular input

440
00:20:45,790 --> 00:20:48,970
to the function, a data point for
which we would like to estimate,

441
00:20:48,970 --> 00:20:52,442
is that day a rainy day or
is that day not rainy day.

442
00:20:52,442 --> 00:20:54,400
And that's going to vary
just depending on what

443
00:20:54,400 --> 00:20:58,300
input is provided to our function, what
it is that we are trying to estimate.

444
00:20:58,300 --> 00:21:02,390
And then to do the calculation, we
want to calculate this expression here.

445
00:21:02,390 --> 00:21:04,900
And it turns out that
expression is what we would call

446
00:21:04,900 --> 00:21:07,390
the dot product of these two vectors.

447
00:21:07,390 --> 00:21:10,480
The dot product of
two vectors just means

448
00:21:10,480 --> 00:21:13,630
taking each of the terms and the
vectors and multiplying them together,

449
00:21:13,630 --> 00:21:18,860
w0 multiplied by 1, w1 multiplied
it by x1, w2 multiply it by x2.

450
00:21:18,860 --> 00:21:21,400
And that's why these vectors
need to be the same length.

451
00:21:21,400 --> 00:21:24,400
And then we just add all
of the results together.

452
00:21:24,400 --> 00:21:30,040
So the dot product of w and x, our
weight vector and our input vector,

453
00:21:30,040 --> 00:21:35,770
that's just going to be w0 times
1, or just w0 plus w1 times x1,

454
00:21:35,770 --> 00:21:40,570
multiplying these two terms
together, plus w2 times x2,

455
00:21:40,570 --> 00:21:42,773
multiplying those statements together.

456
00:21:42,773 --> 00:21:45,190
So we have our weight vector,
which we need to figure out.

457
00:21:45,190 --> 00:21:47,020
We need our machine learning
algorithm to figure out

458
00:21:47,020 --> 00:21:48,280
what the weights should be.

459
00:21:48,280 --> 00:21:51,370
We have the input vector
representing the data point

460
00:21:51,370 --> 00:21:54,610
that we're trying to predict a
category for, predict a label for.

461
00:21:54,610 --> 00:21:58,270
And we're able to do that calculation
by taking this dot product, which you'll

462
00:21:58,270 --> 00:22:00,260
often see represented in vector form--

463
00:22:00,260 --> 00:22:02,052
but if you haven't seen
vectors before, you

464
00:22:02,052 --> 00:22:04,810
can think of it as identical to
just this mathematical expression.

465
00:22:04,810 --> 00:22:08,170
Just doing the multiplication,
adding the results together.

466
00:22:08,170 --> 00:22:11,440
And then seeing whether the result
is greater than or equal to 0 or not.

467
00:22:11,440 --> 00:22:14,530
This expression here is
identical to the expression

468
00:22:14,530 --> 00:22:16,810
that we're calculating
to see whether or not

469
00:22:16,810 --> 00:22:21,340
that answer is greater than
or equal to 0 in this case.

470
00:22:21,340 --> 00:22:24,370
And so for that reason, you'll
often see the hypothesis function

471
00:22:24,370 --> 00:22:25,980
written as something like this--

472
00:22:25,980 --> 00:22:29,800
a simpler representation where
the hypothesis takes as input

473
00:22:29,800 --> 00:22:34,060
some input vector x, some humidity
and pressure for some day.

474
00:22:34,060 --> 00:22:37,810
And we want to predict an output
like rain or no rain or 1 or 0

475
00:22:37,810 --> 00:22:40,440
if we choose to represent
things numerically.

476
00:22:40,440 --> 00:22:42,970
And the way we do that
is by taking the dot

477
00:22:42,970 --> 00:22:45,612
product of the weights and our input.

478
00:22:45,612 --> 00:22:47,320
If it's greater than
or equal to 0, we'll

479
00:22:47,320 --> 00:22:49,150
go ahead and save the output is 1.

480
00:22:49,150 --> 00:22:52,030
Otherwise, the output is going to be 0.

481
00:22:52,030 --> 00:22:56,170
And this hypothesis we say is
parameterized by the weights.

482
00:22:56,170 --> 00:22:58,330
Depending on what
weights we choose, we'll

483
00:22:58,330 --> 00:23:00,372
end up getting a different hypothesis.

484
00:23:00,372 --> 00:23:02,830
If we choose the weights
randomly, we're probably not going

485
00:23:02,830 --> 00:23:04,455
to get a very good hypothesis function.

486
00:23:04,455 --> 00:23:06,730
We'll get a 1 or a 0,
but it's probably not

487
00:23:06,730 --> 00:23:09,160
accurately going to reflect
whether we think a day is

488
00:23:09,160 --> 00:23:11,350
going to be rainy or not rainy.

489
00:23:11,350 --> 00:23:13,930
But if we choose the
weights right, we can often

490
00:23:13,930 --> 00:23:16,390
do a pretty good job
of trying to estimate

491
00:23:16,390 --> 00:23:20,880
whether we think the output of
the function should be a 1 or a 0.

492
00:23:20,880 --> 00:23:23,170
And so the question then,
is how to figure out

493
00:23:23,170 --> 00:23:27,042
what these weights should be-- how
to be able to tune those parameters.

494
00:23:27,042 --> 00:23:29,000
And there are a number
of ways you can do that.

495
00:23:29,000 --> 00:23:32,800
One of the most common is known
as the perceptron learning rule.

496
00:23:32,800 --> 00:23:34,303
And we'll see more of this later.

497
00:23:34,303 --> 00:23:36,220
But the idea of the
perceptron learning rule--

498
00:23:36,220 --> 00:23:37,960
and we're not going to get
too deep into the mathematics,

499
00:23:37,960 --> 00:23:40,570
we'll mostly just introduce
it more conceptually-- is

500
00:23:40,570 --> 00:23:44,770
to say that given some data point
that we would like to learn from,

501
00:23:44,770 --> 00:23:50,140
some data point that has an input x and
an output y, where y is like 1 for rain

502
00:23:50,140 --> 00:23:53,052
or 0 for not rain, then we're
going to update the weights.

503
00:23:53,052 --> 00:23:55,010
And we'll look at the
formula in just a moment.

504
00:23:55,010 --> 00:23:59,020
But the big picture idea is that
we can start with random weights

505
00:23:59,020 --> 00:24:00,610
but then learn from the data.

506
00:24:00,610 --> 00:24:02,900
Like, take the data
points one at a time.

507
00:24:02,900 --> 00:24:05,370
And for each one of the
data points figure out,

508
00:24:05,370 --> 00:24:09,370
all right, what parameters do we
need to change inside of the weights

509
00:24:09,370 --> 00:24:12,050
in order to better
match that input point.

510
00:24:12,050 --> 00:24:13,930
And so that is the
value of having access

511
00:24:13,930 --> 00:24:15,940
to a lot of data in
the supervised machine

512
00:24:15,940 --> 00:24:18,520
learning algorithm-- is that
you take each of the data points

513
00:24:18,520 --> 00:24:22,310
and maybe look at the multiple times and
constantly try and figure out what you

514
00:24:22,310 --> 00:24:24,490
whether you need to shift
your weight in order

515
00:24:24,490 --> 00:24:29,260
to better create some weight vector that
is able to correctly or more accurately

516
00:24:29,260 --> 00:24:31,300
try to estimate what
the output should be.

517
00:24:31,300 --> 00:24:33,970
Whether we think it's going to
be raining or whether we think

518
00:24:33,970 --> 00:24:35,760
it's not going to be raining.

519
00:24:35,760 --> 00:24:37,510
So what does that
weight update look like?

520
00:24:37,510 --> 00:24:39,468
Without going into too
much of the mathematics,

521
00:24:39,468 --> 00:24:41,410
we're going to update
each of the weights

522
00:24:41,410 --> 00:24:46,480
to be the result of the original
weight plus some additional expression.

523
00:24:46,480 --> 00:24:48,820
And to understand this expression y--

524
00:24:48,820 --> 00:24:51,820
well, y is what the actual output is.

525
00:24:51,820 --> 00:24:57,280
And hypothesis of x, the input, that's
going to be what we thought the input

526
00:24:57,280 --> 00:24:58,100
was.

527
00:24:58,100 --> 00:25:01,510
And so I can replace this by
saying what the actual value was

528
00:25:01,510 --> 00:25:03,820
minus what our estimate was.

529
00:25:03,820 --> 00:25:08,440
And based on the difference between the
actual value and what our estimate was,

530
00:25:08,440 --> 00:25:11,170
we might want to change our
hypothesis, change the way

531
00:25:11,170 --> 00:25:13,300
that we do that estimation.

532
00:25:13,300 --> 00:25:15,850
If the actual value and the
estimate were the same thing,

533
00:25:15,850 --> 00:25:18,910
meaning we were correctly able to
predict what category this data

534
00:25:18,910 --> 00:25:21,280
point belonged to,
well, then actual value

535
00:25:21,280 --> 00:25:23,560
minus estimate, that's
just going to be 0,

536
00:25:23,560 --> 00:25:26,440
which means this whole term on
the right hand side goes to be 0.

537
00:25:26,440 --> 00:25:27,820
And the weight doesn't change.

538
00:25:27,820 --> 00:25:31,180
Weight i, where i is weight
1 or weight 2 or weight 0,

539
00:25:31,180 --> 00:25:33,520
weight i just stays at weight i.

540
00:25:33,520 --> 00:25:36,880
And none of the weights change if
we were able to correctly predict

541
00:25:36,880 --> 00:25:39,250
what category the input belonged to.

542
00:25:39,250 --> 00:25:42,010
But if our hypothesis
didn't correctly predict

543
00:25:42,010 --> 00:25:44,980
what category the input
belonged to, then maybe

544
00:25:44,980 --> 00:25:46,900
then we need to make some changes--

545
00:25:46,900 --> 00:25:50,950
adjust the weights so that we're better
able to predict this kind of data point

546
00:25:50,950 --> 00:25:51,850
in the future.

547
00:25:51,850 --> 00:25:54,140
And what is the way we might do that?

548
00:25:54,140 --> 00:25:58,870
Well, if the actual value was bigger
than the estimate, then-- and for now,

549
00:25:58,870 --> 00:26:02,170
we'll go ahead and assume that
these is are positive values--

550
00:26:02,170 --> 00:26:04,720
if the actual value is bigger
than the estimate, that

551
00:26:04,720 --> 00:26:08,050
means we need to increase the
weight in order to make it such

552
00:26:08,050 --> 00:26:10,090
that the output is bigger
and therefore, we're

553
00:26:10,090 --> 00:26:13,168
more likely to get to
the right actual value.

554
00:26:13,168 --> 00:26:15,460
And so if the actual value
is bigger than the estimate,

555
00:26:15,460 --> 00:26:18,292
then actual value minus estimate,
that'll be a positive number.

556
00:26:18,292 --> 00:26:21,250
And so you imagine we're just adding
some positive number to the weight

557
00:26:21,250 --> 00:26:23,740
just to increase it ever so slightly.

558
00:26:23,740 --> 00:26:25,600
And likewise, the inverse case is true--

559
00:26:25,600 --> 00:26:30,460
that if the actual value was less than
the estimate, the actual value was 0,

560
00:26:30,460 --> 00:26:33,460
but we estimated 1, meaning
it actually was not raining,

561
00:26:33,460 --> 00:26:35,740
but we predicted it was
going to be raining,

562
00:26:35,740 --> 00:26:39,250
then we want to decrease the value of
the weight, because then in that case,

563
00:26:39,250 --> 00:26:43,600
we want to try and lower the total value
of computing that dot product in order

564
00:26:43,600 --> 00:26:46,690
to make it less likely that we
would predict that it would actually

565
00:26:46,690 --> 00:26:48,010
be raining.

566
00:26:48,010 --> 00:26:50,840
So no need to get too deep
into the mathematics of that.

567
00:26:50,840 --> 00:26:53,920
But the general idea is that every
time we encounter some data point,

568
00:26:53,920 --> 00:26:57,340
we can adjust these weights accordingly
to try and make the weights better

569
00:26:57,340 --> 00:27:00,850
line up with the actual
data that we have access to.

570
00:27:00,850 --> 00:27:03,610
And you can repeat this process
with data point after data point

571
00:27:03,610 --> 00:27:05,680
until eventually,
hopefully, your algorithm

572
00:27:05,680 --> 00:27:09,430
converges to some set of weights
that do a pretty good job of trying

573
00:27:09,430 --> 00:27:13,040
to figure out whether a day is
going to be rainy or not rainy.

574
00:27:13,040 --> 00:27:15,700
And just as a final point
about this particular equation,

575
00:27:15,700 --> 00:27:19,470
this value alpha here is generally
what we'll call the learning rate.

576
00:27:19,470 --> 00:27:22,120
It's just some parameter,
some number we choose it

577
00:27:22,120 --> 00:27:25,690
for how quickly we're actually going
to be updating these weight values.

578
00:27:25,690 --> 00:27:27,460
That if alpha is
bigger, than we're going

579
00:27:27,460 --> 00:27:29,380
to update these weight values by a lot.

580
00:27:29,380 --> 00:27:32,480
And if alpha is smaller, then we'll
update the weight values by less.

581
00:27:32,480 --> 00:27:35,200
And you can choose the value of
alpha depending on the problem,

582
00:27:35,200 --> 00:27:39,960
different values might suit the
situation better or worse than others.

583
00:27:39,960 --> 00:27:43,150
So after all of that, after
we've done this training process,

584
00:27:43,150 --> 00:27:45,880
take all this data, and
using this learning rule,

585
00:27:45,880 --> 00:27:50,240
look at all the pieces of data, and
use each piece of data as an indication

586
00:27:50,240 --> 00:27:53,030
to us of do the weights stay the
same, do we increase the weights,

587
00:27:53,030 --> 00:27:55,940
do we decrease the weights,
and if so, by how much,

588
00:27:55,940 --> 00:28:00,110
what you end up with is
effectively a threshold function.

589
00:28:00,110 --> 00:28:03,230
And we can look at what the threshold
function looks like like this.

590
00:28:03,230 --> 00:28:05,870
On the x-axis here, we have
the output of that function.

591
00:28:05,870 --> 00:28:10,150
Taking the weights taking, the
dot product of it with the input.

592
00:28:10,150 --> 00:28:12,980
And on the y-axis, we have
what the output is going to be.

593
00:28:12,980 --> 00:28:17,420
0, which in this case represented like
not raining, and 1, which in this case,

594
00:28:17,420 --> 00:28:18,950
represented raining.

595
00:28:18,950 --> 00:28:23,570
And the way that our hypothesis function
works is it calculates this value.

596
00:28:23,570 --> 00:28:27,380
And if it's greater than 0 or
greater than some threshold value,

597
00:28:27,380 --> 00:28:29,480
then we declare that it's a rainy day.

598
00:28:29,480 --> 00:28:32,330
And otherwise, we declare
that it's not rainy day.

599
00:28:32,330 --> 00:28:35,660
And this then graphically is
what that function looks like.

600
00:28:35,660 --> 00:28:38,600
That Initially, when the value
of this dot product is small--

601
00:28:38,600 --> 00:28:40,850
it's not raining, it's not
raining, it's not raining--

602
00:28:40,850 --> 00:28:43,910
but as soon as it crosses that
threshold, we suddenly say,

603
00:28:43,910 --> 00:28:46,700
OK, now it's raining, now it's
raining, now it's raining.

604
00:28:46,700 --> 00:28:49,250
And the way to interpret
this kind of representation

605
00:28:49,250 --> 00:28:51,680
is that anything on this
side of the line, that

606
00:28:51,680 --> 00:28:55,010
would be the category of data points
where we say yes, it's raining.

607
00:28:55,010 --> 00:28:56,960
Anything that falls on
this side of the line

608
00:28:56,960 --> 00:28:59,510
are the data points where we
would say it's not raining.

609
00:28:59,510 --> 00:29:02,120
And again, we want to choose
some value for the weights that

610
00:29:02,120 --> 00:29:04,910
results in a function that does
a pretty good job of trying

611
00:29:04,910 --> 00:29:07,280
to do this estimation.

612
00:29:07,280 --> 00:29:11,090
But one tricky thing with
this type of hard threshold

613
00:29:11,090 --> 00:29:14,300
is that it only leaves
two possible outcomes.

614
00:29:14,300 --> 00:29:16,850
We plug-in some data as input.

615
00:29:16,850 --> 00:29:20,100
And the output we get is
raining or not raining.

616
00:29:20,100 --> 00:29:22,978
And there is no room for
it anywhere in between.

617
00:29:22,978 --> 00:29:24,270
And maybe that's what you want.

618
00:29:24,270 --> 00:29:26,510
Maybe all you want is
given some data point,

619
00:29:26,510 --> 00:29:29,600
you would like to be able to
classify it into one or two or more

620
00:29:29,600 --> 00:29:31,990
of these various different categories.

621
00:29:31,990 --> 00:29:34,270
But it might also be
the case that you care

622
00:29:34,270 --> 00:29:37,855
about knowing how strong that
prediction is, for example.

623
00:29:37,855 --> 00:29:39,730
So if we go back to this
instance here, where

624
00:29:39,730 --> 00:29:42,550
we have rainy days on
this side of the line, not

625
00:29:42,550 --> 00:29:45,370
rainy days on that side
of the line, you might

626
00:29:45,370 --> 00:29:48,970
imagine that let's look now at
these two white data points.

627
00:29:48,970 --> 00:29:53,110
This data point here that we would like
to predict a label or a category for.

628
00:29:53,110 --> 00:29:55,660
And this data point over
here that we would also like

629
00:29:55,660 --> 00:29:58,510
to predict a label or a category for.

630
00:29:58,510 --> 00:30:01,885
It seems likely that you could pretty
confidently say that this data point,

631
00:30:01,885 --> 00:30:03,010
that should be a rainy day.

632
00:30:03,010 --> 00:30:05,500
It seems close to the
other rainy days if we're

633
00:30:05,500 --> 00:30:07,330
going by the nearest neighbor strategy.

634
00:30:07,330 --> 00:30:11,830
It's on this side of the line if we're
going by the strategy of just saying

635
00:30:11,830 --> 00:30:14,140
which side of the line does
it fall on by figuring out

636
00:30:14,140 --> 00:30:15,670
what those weights should be.

637
00:30:15,670 --> 00:30:18,610
And if we're using the line strategy
of just which side of the line

638
00:30:18,610 --> 00:30:21,640
does it fall on, which side
of this decision boundary,

639
00:30:21,640 --> 00:30:24,700
we'd also say that
this point here is also

640
00:30:24,700 --> 00:30:27,580
a rainy day, because it falls
on the side of the line that

641
00:30:27,580 --> 00:30:30,620
corresponds to rainy days.

642
00:30:30,620 --> 00:30:33,020
But it's likely that
even in this case, we

643
00:30:33,020 --> 00:30:37,040
would know that we don't feel nearly
as confident about this data point

644
00:30:37,040 --> 00:30:40,360
on the left as compared to
this data point on the right.

645
00:30:40,360 --> 00:30:42,620
For this one on the right,
we can feel very confident

646
00:30:42,620 --> 00:30:44,060
that, yes, it's a rainy day.

647
00:30:44,060 --> 00:30:48,460
This one, it's pretty close to the
line if we're judging just by distance.

648
00:30:48,460 --> 00:30:51,260
And so you might be less sure.

649
00:30:51,260 --> 00:30:56,150
But our threshold function doesn't allow
for a notion of less sure or more sure

650
00:30:56,150 --> 00:30:57,080
about something.

651
00:30:57,080 --> 00:30:59,000
It's what we would
call a hard threshold.

652
00:30:59,000 --> 00:31:02,810
It's once you've crossed this line,
then immediately, we say, yes,

653
00:31:02,810 --> 00:31:04,520
this is going to be a rainy day.

654
00:31:04,520 --> 00:31:07,580
Anywhere before it, we're going
to say it's not a rainy day.

655
00:31:07,580 --> 00:31:10,240
And that may not be helpful
in a number of cases.

656
00:31:10,240 --> 00:31:13,070
One, this is not a particularly
easy function to deal with.

657
00:31:13,070 --> 00:31:15,710
If you get you get deeper into
the world of machine learning

658
00:31:15,710 --> 00:31:18,830
and are trying to do things like
taking derivatives of these curves,

659
00:31:18,830 --> 00:31:21,222
this type of function
makes things challenging.

660
00:31:21,222 --> 00:31:23,180
But the other challenge
is that we don't really

661
00:31:23,180 --> 00:31:25,013
have any notion of
gradation between things.

662
00:31:25,013 --> 00:31:28,430
We don't have a notion of, yes,
this is a very strong belief

663
00:31:28,430 --> 00:31:32,600
that it's going to be raining as opposed
to it's probably more likely than not

664
00:31:32,600 --> 00:31:37,100
that it's going to be raining, but maybe
not totally sure about that, either.

665
00:31:37,100 --> 00:31:39,590
So what we can do by taking
advantage of a technique known

666
00:31:39,590 --> 00:31:43,190
as logistic regression is instead
of using this hard threshold

667
00:31:43,190 --> 00:31:46,970
type of function, we can use instead
a logistic function, something

668
00:31:46,970 --> 00:31:48,890
we might call a soft threshold.

669
00:31:48,890 --> 00:31:52,070
And that's going to transform
this into looking something

670
00:31:52,070 --> 00:31:55,290
a little more like this--
something that more nicely curves.

671
00:31:55,290 --> 00:31:57,800
And as a result, the
possible output values

672
00:31:57,800 --> 00:32:02,030
are no longer just 0 and 1, 0
for not raining, 1 for raining.

673
00:32:02,030 --> 00:32:06,360
But you can actually get any real
number of value between 0 and 1.

674
00:32:06,360 --> 00:32:10,292
That if you're way over on this
side, then you get a value of 0--

675
00:32:10,292 --> 00:32:12,750
it's not going to be raining,
we're pretty sure about that.

676
00:32:12,750 --> 00:32:15,042
And if you're over on this
side, you get a value of 1--

677
00:32:15,042 --> 00:32:17,330
yes, we're very sure that
it's going to be raining.

678
00:32:17,330 --> 00:32:22,460
But in between, you could get some real
numbered value where a value like 0.7

679
00:32:22,460 --> 00:32:24,260
might mean we think it's going to rain.

680
00:32:24,260 --> 00:32:27,810
It's more probable that it's going
to rain than not based on the data,

681
00:32:27,810 --> 00:32:32,220
but we're not as confident as some
of the other data points might be.

682
00:32:32,220 --> 00:32:34,580
So one of the advantages
of the soft threshold

683
00:32:34,580 --> 00:32:37,940
is that it allows us to have an output
that could be some real number that

684
00:32:37,940 --> 00:32:41,450
potentially reflects some sort of
probability, the likelihood that we

685
00:32:41,450 --> 00:32:46,550
think that this particular data point
belongs to that particular category.

686
00:32:46,550 --> 00:32:49,250
And there are some other nice
mathematical properties of that

687
00:32:49,250 --> 00:32:50,950
as well.

688
00:32:50,950 --> 00:32:53,880
So that then is two different
approaches to trying to solve

689
00:32:53,880 --> 00:32:55,710
this type of classification problem.

690
00:32:55,710 --> 00:32:58,860
One is this nearest neighbor
type of approach, where you just

691
00:32:58,860 --> 00:33:01,230
take a data point and look
at the data points that

692
00:33:01,230 --> 00:33:05,490
are nearby to try and estimate what
category we think it belongs to.

693
00:33:05,490 --> 00:33:08,310
And the other approach is the
approach of saying, all right,

694
00:33:08,310 --> 00:33:10,607
let's just try and
use linear regression,

695
00:33:10,607 --> 00:33:13,440
figure out what these weights should
be, adjust the weights in order

696
00:33:13,440 --> 00:33:16,950
to figure out what line or
what decision boundary is going

697
00:33:16,950 --> 00:33:19,708
to best separate these two categories.

698
00:33:19,708 --> 00:33:22,500
It turns out that another popular
approach, a very popular approach

699
00:33:22,500 --> 00:33:24,480
if you just have a data
set and you want to start

700
00:33:24,480 --> 00:33:26,190
trying to do some
learning on it, is what

701
00:33:26,190 --> 00:33:27,648
we call the support vector machine.

702
00:33:27,648 --> 00:33:30,690
We're not going to go too much into
the mathematics of the support vector

703
00:33:30,690 --> 00:33:32,850
machine, but we'll at least
explore it graphically

704
00:33:32,850 --> 00:33:34,680
to see what it is that it looks like.

705
00:33:34,680 --> 00:33:38,220
And the idea or the motivation
behind the support vector machine

706
00:33:38,220 --> 00:33:41,400
is the idea that there are
actually a lot of different lines

707
00:33:41,400 --> 00:33:44,070
that we could draw, a lot of
different decision boundaries

708
00:33:44,070 --> 00:33:46,430
that we could draw to
separate two groups.

709
00:33:46,430 --> 00:33:49,050
So for example, I had the
red data points over here

710
00:33:49,050 --> 00:33:50,700
and the blue data points over here.

711
00:33:50,700 --> 00:33:54,620
One possible line I could
draw is a line like this,

712
00:33:54,620 --> 00:33:57,600
that this line here would separate
the red points from the blue points.

713
00:33:57,600 --> 00:33:58,680
And it does so perfectly.

714
00:33:58,680 --> 00:34:01,080
All the red points are
on one side of the line.

715
00:34:01,080 --> 00:34:03,840
All the blue points around
the other side of the line.

716
00:34:03,840 --> 00:34:07,000
But this should probably make
you a little bit nervous.

717
00:34:07,000 --> 00:34:08,909
If you come up with
a model and the model

718
00:34:08,909 --> 00:34:10,960
comes up with a line
that looks like this.

719
00:34:10,960 --> 00:34:13,650
And the reason why is that
you worry about how well it's

720
00:34:13,650 --> 00:34:18,190
going to generalize to other data points
that are not necessarily in the data

721
00:34:18,190 --> 00:34:19,900
set that we have access to.

722
00:34:19,900 --> 00:34:23,650
For example, if there was a point
that fell right here for example,

723
00:34:23,650 --> 00:34:26,370
on the right side of the
line, then based on that,

724
00:34:26,370 --> 00:34:30,210
we might want to guess that
it is in fact, a red point,

725
00:34:30,210 --> 00:34:33,780
but it falls on the side of the line
where instead, we would estimate

726
00:34:33,780 --> 00:34:36,380
that it's a blue point instead.

727
00:34:36,380 --> 00:34:39,750
And so based on that, this line
is probably not a great choice

728
00:34:39,750 --> 00:34:43,679
just because it is so close
to these various data points.

729
00:34:43,679 --> 00:34:45,810
We might instead prefer
a diagonal line that

730
00:34:45,810 --> 00:34:48,719
just goes diagonally through the
data set like we've seen before.

731
00:34:48,719 --> 00:34:52,000
But there too, there's a lot of diagonal
lines that we could draw as well.

732
00:34:52,000 --> 00:34:54,760
For example, I could draw
this diagonal line here,

733
00:34:54,760 --> 00:34:57,663
which also successfully
separates all the red points

734
00:34:57,663 --> 00:34:58,830
from all of the blue points.

735
00:34:58,830 --> 00:35:01,400
From the perspective of
something like a just trying

736
00:35:01,400 --> 00:35:03,150
to figure out some
setting of weights that

737
00:35:03,150 --> 00:35:05,760
allows us to predict the
correct output, this line

738
00:35:05,760 --> 00:35:09,155
will predict the correct output
for this particular set of data

739
00:35:09,155 --> 00:35:11,530
every single time, because
the red points are on one side

740
00:35:11,530 --> 00:35:13,518
and the blue points are on the other.

741
00:35:13,518 --> 00:35:15,810
But yet again, you should
probably be a little nervous.

742
00:35:15,810 --> 00:35:18,690
Because this line is so
close to these red points,

743
00:35:18,690 --> 00:35:22,470
even though we're able to
correctly predict on the input data

744
00:35:22,470 --> 00:35:26,070
if there was a point that fell
somewhere in this general area,

745
00:35:26,070 --> 00:35:28,530
our algorithm, this model,
would say that yeah,

746
00:35:28,530 --> 00:35:30,930
we think it's a blue point,
when in actuality, it

747
00:35:30,930 --> 00:35:34,000
might belong to the
red category instead,

748
00:35:34,000 --> 00:35:36,870
just because it looks like it's
close to the other red points.

749
00:35:36,870 --> 00:35:39,700
What we really want to be
able to say, given this data,

750
00:35:39,700 --> 00:35:42,330
how can you generalize this
out as best as possible, is

751
00:35:42,330 --> 00:35:46,480
to come up with a line like this that
seems like the intuitive line to draw.

752
00:35:46,480 --> 00:35:48,810
And the reason why it's
intuitive is because it

753
00:35:48,810 --> 00:35:54,267
seems to be as far apart as possible
from the red data and the blue data

754
00:35:54,267 --> 00:35:56,850
so that if we generalize a little
bit and assume that maybe we

755
00:35:56,850 --> 00:35:58,933
have some points that are
different from the input

756
00:35:58,933 --> 00:36:02,610
but still slightly further away, we can
still say that something on this side,

757
00:36:02,610 --> 00:36:06,300
probably red, something on
that side, probably blue.

758
00:36:06,300 --> 00:36:08,607
And we can make those
judgments that way.

759
00:36:08,607 --> 00:36:10,440
And that is what support
vector machines are

760
00:36:10,440 --> 00:36:12,660
designed to do-- they're
designed to try and find

761
00:36:12,660 --> 00:36:15,720
what we call the maximum
margin separator,

762
00:36:15,720 --> 00:36:17,850
where the maximum
margin separator is just

763
00:36:17,850 --> 00:36:21,780
some boundary that maximizes the
distance between the groups of points.

764
00:36:21,780 --> 00:36:24,990
Rather than come up with some
boundary that's very close to one side

765
00:36:24,990 --> 00:36:27,840
or the other, where in the case
before, we wouldn't have cared--

766
00:36:27,840 --> 00:36:31,680
as long as we're categorizing the input
well, that seems all we need to do--

767
00:36:31,680 --> 00:36:35,790
the support vector machine will try
and find this maximum margin separator,

768
00:36:35,790 --> 00:36:39,090
some way of trying to maximize
that particular distance.

769
00:36:39,090 --> 00:36:42,600
And it does so by finding what we
call the support vectors, which

770
00:36:42,600 --> 00:36:44,790
are the vectors that
are closest to the line

771
00:36:44,790 --> 00:36:48,060
and trying to maximize the
distance between the line

772
00:36:48,060 --> 00:36:49,950
and those particular points.

773
00:36:49,950 --> 00:36:51,790
And it works that way in two dimensions.

774
00:36:51,790 --> 00:36:53,498
It also works in higher
dimensions, where

775
00:36:53,498 --> 00:36:56,670
we're not looking for some line
that separates the two data points,

776
00:36:56,670 --> 00:36:58,440
but instead, looking
for what we generally

777
00:36:58,440 --> 00:37:02,610
call a hyperplanel, some decision
boundary, effectively, that

778
00:37:02,610 --> 00:37:06,150
separates one set of data
from the other set of data.

779
00:37:06,150 --> 00:37:09,270
And this ability of support vector
machines to work in higher dimensions

780
00:37:09,270 --> 00:37:11,670
actually has a number of
other applications as well.

781
00:37:11,670 --> 00:37:14,790
But one is that it helpfully
deals with cases where

782
00:37:14,790 --> 00:37:17,740
data may not be linearly separable.

783
00:37:17,740 --> 00:37:19,800
So we talked about linear
separability before,

784
00:37:19,800 --> 00:37:22,890
this idea that you can
take data and just draw

785
00:37:22,890 --> 00:37:25,180
a line or some linear
combination of the inputs

786
00:37:25,180 --> 00:37:28,750
that allows us to perfectly separate
the two sets from each other.

787
00:37:28,750 --> 00:37:32,050
There are some data sets that
are not linearly separable.

788
00:37:32,050 --> 00:37:35,280
And some were even too, you
would not be able to find

789
00:37:35,280 --> 00:37:39,270
a good line at all that would try
to do that kind of separation.

790
00:37:39,270 --> 00:37:42,930
Something like this, for example,
where if you imagine here

791
00:37:42,930 --> 00:37:45,780
are the red points and the
blue points surround it.

792
00:37:45,780 --> 00:37:50,550
If you try to find a line that divides
the red points from the blue points,

793
00:37:50,550 --> 00:37:53,310
it's actually going to be difficult,
if not impossible, to do--

794
00:37:53,310 --> 00:37:55,080
that any line you choose--

795
00:37:55,080 --> 00:37:57,510
if you draw a line
here, then you ignored

796
00:37:57,510 --> 00:38:00,450
all of these blue points that should
actually be blue and not red--

797
00:38:00,450 --> 00:38:03,240
anywhere else you draw a line,
there's going to be a lot of error,

798
00:38:03,240 --> 00:38:07,560
a lot of mistakes, a lot of what
will soon call loss to that line

799
00:38:07,560 --> 00:38:08,280
that you draw--

800
00:38:08,280 --> 00:38:12,050
a lot of points that you're
going to categorize incorrectly.

801
00:38:12,050 --> 00:38:14,610
What we really want is to be
able to find a better decision

802
00:38:14,610 --> 00:38:18,570
boundary that may not be just a
straight line through this two

803
00:38:18,570 --> 00:38:19,770
dimensional space.

804
00:38:19,770 --> 00:38:21,810
And what support vector
machines can do is

805
00:38:21,810 --> 00:38:23,940
they can begin to operate
in higher dimensions

806
00:38:23,940 --> 00:38:26,880
and be able to find some
other decision boundary,

807
00:38:26,880 --> 00:38:28,860
like the circle in this
case, that actually

808
00:38:28,860 --> 00:38:33,130
is able to separate one of these sets
of data from the other set of data,

809
00:38:33,130 --> 00:38:33,750
a lot better.

810
00:38:33,750 --> 00:38:37,470
So oftentimes, in data sets where
the data is not linearly separable,

811
00:38:37,470 --> 00:38:40,140
support vector machines, by
working in higher dimensions,

812
00:38:40,140 --> 00:38:44,310
can actually figure out a way to solve
that kind of problem effectively.

813
00:38:44,310 --> 00:38:46,290
So that then-- three
different approaches

814
00:38:46,290 --> 00:38:48,330
to trying to solve
these sorts of problems.

815
00:38:48,330 --> 00:38:50,040
We're seeing support vector machines.

816
00:38:50,040 --> 00:38:53,400
We've seen trying to use linear
regression and the perceptron

817
00:38:53,400 --> 00:38:56,912
learning rule to be able to figure out
how to categorize inputs and outputs.

818
00:38:56,912 --> 00:38:58,620
We've seen the nearest
neighbor approach.

819
00:38:58,620 --> 00:39:00,630
No one necessarily
better than any other.

820
00:39:00,630 --> 00:39:03,990
Again, it's going to depend on the data
set, the information you have access

821
00:39:03,990 --> 00:39:04,490
to.

822
00:39:04,490 --> 00:39:07,157
It's going to depend on what the
function looks like that you're

823
00:39:07,157 --> 00:39:08,370
ultimately trying to predict.

824
00:39:08,370 --> 00:39:11,160
And this is where a lot of
research and experimentation

825
00:39:11,160 --> 00:39:14,880
can be involved in trying to figure
out how it is to best perform

826
00:39:14,880 --> 00:39:16,710
that kind of estimation.

827
00:39:16,710 --> 00:39:20,100
But classification is only one of
the tasks that you might encounter

828
00:39:20,100 --> 00:39:23,340
and supervised machine learning,
because in classification,

829
00:39:23,340 --> 00:39:26,580
what we're trying to predict
is some discrete category.

830
00:39:26,580 --> 00:39:29,040
We're trying to predict
red or blue, rain

831
00:39:29,040 --> 00:39:31,950
or not rain, authentic or counterfeit.

832
00:39:31,950 --> 00:39:35,500
But sometimes, what we want to
predict is a real number value.

833
00:39:35,500 --> 00:39:38,880
And for that, we have a related
problem, not classification, but instead

834
00:39:38,880 --> 00:39:40,500
known as regression.

835
00:39:40,500 --> 00:39:42,930
And regression is the
supervised learning problem

836
00:39:42,930 --> 00:39:45,570
where we try and learn a function
mapping inputs to outputs,

837
00:39:45,570 --> 00:39:46,810
same as before.

838
00:39:46,810 --> 00:39:49,800
but instead of the outputs
being discrete categories--

839
00:39:49,800 --> 00:39:51,960
things like rain or not rain--

840
00:39:51,960 --> 00:39:54,260
in a regression problem,
the output values

841
00:39:54,260 --> 00:39:57,400
are generally continuous
value-- some real number

842
00:39:57,400 --> 00:39:58,650
that we would like to predict.

843
00:39:58,650 --> 00:40:00,385
This happens all the time, as well.

844
00:40:00,385 --> 00:40:02,760
You might imagine that a
company might take this approach

845
00:40:02,760 --> 00:40:04,950
if it's trying to figure
out, for instance,

846
00:40:04,950 --> 00:40:06,720
what the effect of its advertising is.

847
00:40:06,720 --> 00:40:10,830
Like how do advertising dollars
spent translate into sales

848
00:40:10,830 --> 00:40:13,000
for the company's product, for example.

849
00:40:13,000 --> 00:40:17,070
And so they might like to try to predict
some function that takes as input,

850
00:40:17,070 --> 00:40:18,780
the amount of money
spent on advertising.

851
00:40:18,780 --> 00:40:20,220
And here, we're just
going to use one input,

852
00:40:20,220 --> 00:40:22,650
but again, you could scale
this up to many more inputs

853
00:40:22,650 --> 00:40:25,800
as well if you have a lot of different
kinds of data you have access to.

854
00:40:25,800 --> 00:40:27,540
And the goal is to learn
to function-- that given

855
00:40:27,540 --> 00:40:29,423
this amount of spending
on advertising, we're

856
00:40:29,423 --> 00:40:30,840
going to get this amount in sales.

857
00:40:30,840 --> 00:40:34,110
And you might judge it based on having
access to a whole bunch of data--

858
00:40:34,110 --> 00:40:37,830
like for every past month, here's
how much we spent on advertising

859
00:40:37,830 --> 00:40:39,390
and here is what sales were.

860
00:40:39,390 --> 00:40:43,300
And we would like to predict
some sort of hypothesis function

861
00:40:43,300 --> 00:40:46,260
that, again, given the
amount spent on advertising,

862
00:40:46,260 --> 00:40:48,750
can predict in this
case, some real number,

863
00:40:48,750 --> 00:40:54,840
some no estimate of how much sales we
expect that company to do in this month

864
00:40:54,840 --> 00:40:56,940
or in this quarter or
whatever unit of time

865
00:40:56,940 --> 00:40:58,980
we're choosing to measure things in.

866
00:40:58,980 --> 00:41:01,830
And so again, the approach to
solving this type of problem,

867
00:41:01,830 --> 00:41:05,920
we could try using a linear regression
type approach, where we take this data,

868
00:41:05,920 --> 00:41:07,050
and we just plot it.

869
00:41:07,050 --> 00:41:09,750
On the x-axis, we have
advertising dollars spent.

870
00:41:09,750 --> 00:41:11,490
On the y-axis, we have sales.

871
00:41:11,490 --> 00:41:14,160
And we might just want to
try and draw a line that

872
00:41:14,160 --> 00:41:16,680
does a pretty good job
of trying to estimate

873
00:41:16,680 --> 00:41:19,980
this relationship between
advertising and sales.

874
00:41:19,980 --> 00:41:21,840
And in this case,
unlike before, we're not

875
00:41:21,840 --> 00:41:24,848
trying to separate the data
points into discrete categories.

876
00:41:24,848 --> 00:41:26,640
But instead in this
case, we're just trying

877
00:41:26,640 --> 00:41:31,410
to find a line that approximates
this relationship between advertising

878
00:41:31,410 --> 00:41:34,860
and sales so that if we want to figure
out what the estimated sales are

879
00:41:34,860 --> 00:41:38,520
for a particular advertising budget,
you just look it up in this line,

880
00:41:38,520 --> 00:41:40,560
figure out for this
amount of advertising, we

881
00:41:40,560 --> 00:41:42,540
would have this amount
of sales, and just

882
00:41:42,540 --> 00:41:44,367
try and make the estimate that way.

883
00:41:44,367 --> 00:41:46,200
And so you can try and
come up with a line--

884
00:41:46,200 --> 00:41:48,720
again, figuring out how to
modify the weights using

885
00:41:48,720 --> 00:41:51,990
various different techniques to try
and make it so that this line fits

886
00:41:51,990 --> 00:41:54,860
as well as possible.

887
00:41:54,860 --> 00:41:57,880
So with all of these approaches
to trying to solve machine

888
00:41:57,880 --> 00:41:59,270
learning style problems.

889
00:41:59,270 --> 00:42:02,000
The question becomes, how do
we evaluate these approaches?

890
00:42:02,000 --> 00:42:06,340
How do we evaluate the various different
hypotheses that we could come up with?

891
00:42:06,340 --> 00:42:09,880
Because each of these algorithms will
give us some sort of hypothesis--

892
00:42:09,880 --> 00:42:12,710
some function that
maps inputs to outputs.

893
00:42:12,710 --> 00:42:16,720
And we want to know how well
does that function work.

894
00:42:16,720 --> 00:42:19,000
And you can think of
evaluating these hypotheses

895
00:42:19,000 --> 00:42:23,500
and trying to get a better hypothesis
as kind of like an optimization problem.

896
00:42:23,500 --> 00:42:26,580
In an optimization problem,
as you recall from before,

897
00:42:26,580 --> 00:42:30,100
you are either trying to
maximize some objective function

898
00:42:30,100 --> 00:42:33,160
by trying to find a global maximum.

899
00:42:33,160 --> 00:42:36,160
Or we were trying to
minimize some cost function

900
00:42:36,160 --> 00:42:38,210
by trying to find some global minimum.

901
00:42:38,210 --> 00:42:41,890
And in the case of evaluating these
hypotheses, one thing we might say

902
00:42:41,890 --> 00:42:45,280
is that this cost function, the
thing we're trying to minimize,

903
00:42:45,280 --> 00:42:49,180
we might be trying to minimize
what we would call a loss function.

904
00:42:49,180 --> 00:42:50,510
And what a loss function is--

905
00:42:50,510 --> 00:42:53,980
it is a function that is
going to estimate for us how

906
00:42:53,980 --> 00:42:56,200
poorly our function performs.

907
00:42:56,200 --> 00:42:58,240
More formally, it's
like a loss of utility,

908
00:42:58,240 --> 00:43:02,740
by whenever we predict something that
is wrong, that is a loss of utility.

909
00:43:02,740 --> 00:43:06,450
That's going to add to the
output of our loss function.

910
00:43:06,450 --> 00:43:08,200
And you can come up
with any loss function

911
00:43:08,200 --> 00:43:11,320
that you want-- just some
mathematical way of estimating given

912
00:43:11,320 --> 00:43:14,020
each of these data points,
given what the actual output is,

913
00:43:14,020 --> 00:43:17,110
and given what our projected
output is, our estimate,

914
00:43:17,110 --> 00:43:19,970
you could calculate some sort
of numerical loss for it.

915
00:43:19,970 --> 00:43:23,140
But there are a couple of popular loss
functions that are worth discussing--

916
00:43:23,140 --> 00:43:25,210
just that you've seen them before--

917
00:43:25,210 --> 00:43:27,040
when it comes to discrete categories.

918
00:43:27,040 --> 00:43:30,570
Things like rain or not rain,
counterfeit or not counterfeit.

919
00:43:30,570 --> 00:43:33,463
One approach is the 0-1 loss function.

920
00:43:33,463 --> 00:43:35,380
And the way that works
is for each of the data

921
00:43:35,380 --> 00:43:40,180
points our loss function takes as input,
what the actual output is, whether it

922
00:43:40,180 --> 00:43:44,650
was actually raining we're not rainy,
and takes our prediction into account--

923
00:43:44,650 --> 00:43:49,030
did we predict given this data point
that it was raining or not raining.

924
00:43:49,030 --> 00:43:51,730
And if the actual value
equals the prediction,

925
00:43:51,730 --> 00:43:54,580
well, then the 0-1 loss function
will just say the loss of 0.

926
00:43:54,580 --> 00:43:58,900
There was no loss of utility because
we were able to predict correctly.

927
00:43:58,900 --> 00:44:01,300
And otherwise, if the
actual value was not

928
00:44:01,300 --> 00:44:05,260
the same thing as what we predicted,
well, then in that case, our loss is 1.

929
00:44:05,260 --> 00:44:08,260
We lost something, lost
some utility, because what

930
00:44:08,260 --> 00:44:12,750
we predicted was the output of the
function was not what it actually was.

931
00:44:12,750 --> 00:44:14,500
And the goal then in
a situation like this

932
00:44:14,500 --> 00:44:17,080
would be to come up with
some hypothesis that

933
00:44:17,080 --> 00:44:21,040
minimizes the total empirical
loss, the total amount that we've

934
00:44:21,040 --> 00:44:25,090
lost if you add up for all these
data points what the actual output is

935
00:44:25,090 --> 00:44:28,250
and what your hypothesis
would have predicted.

936
00:44:28,250 --> 00:44:30,100
So in this case, for
example, if we go back

937
00:44:30,100 --> 00:44:32,530
to classifying days as
raining or not raining,

938
00:44:32,530 --> 00:44:34,720
and we came up with
this decision boundary,

939
00:44:34,720 --> 00:44:36,680
how would we evaluate
this decision boundary--

940
00:44:36,680 --> 00:44:40,510
how much better is it than drawing the
line here or drawing the line there.

941
00:44:40,510 --> 00:44:42,790
Well, we could take each
of the input data points

942
00:44:42,790 --> 00:44:45,790
and each input data point has a
label-- whether it was raining

943
00:44:45,790 --> 00:44:47,230
or whether it was not raining.

944
00:44:47,230 --> 00:44:49,090
And we could compare
it to the prediction--

945
00:44:49,090 --> 00:44:51,550
whether we predicted it would
be raining or not raining--

946
00:44:51,550 --> 00:44:55,040
and assign it a numerical
value as a result.

947
00:44:55,040 --> 00:44:58,323
So for example, these points over
here they were all rainy days.

948
00:44:58,323 --> 00:45:00,490
And we predicted they would
be raining, because they

949
00:45:00,490 --> 00:45:02,300
fall in the bottom side of the line.

950
00:45:02,300 --> 00:45:03,520
So they had a loss of 0--

951
00:45:03,520 --> 00:45:05,500
nothing lost from those situations.

952
00:45:05,500 --> 00:45:08,140
And likewise, same is true for
some of these points over here,

953
00:45:08,140 --> 00:45:10,420
where it was not
raining and we predicted

954
00:45:10,420 --> 00:45:12,220
it would not be raining, either.

955
00:45:12,220 --> 00:45:16,810
Where we do have loss are points like
this point here and that point there,

956
00:45:16,810 --> 00:45:20,980
where we predicted that it would
not be raining, but in actuality,

957
00:45:20,980 --> 00:45:21,730
it's a blue point.

958
00:45:21,730 --> 00:45:22,810
It was raining.

959
00:45:22,810 --> 00:45:25,870
Or likewise here, we predicted
that it would be raining,

960
00:45:25,870 --> 00:45:27,790
but in actuality, it's a red point--

961
00:45:27,790 --> 00:45:29,020
it was not raining.

962
00:45:29,020 --> 00:45:31,900
And so as a result, we
miscategorized these data

963
00:45:31,900 --> 00:45:34,150
points that we were trying to train on.

964
00:45:34,150 --> 00:45:36,310
And as a result, there
is some loss here.

965
00:45:36,310 --> 00:45:40,930
One loss here, there, here and there,
for a total loss of four, for example,

966
00:45:40,930 --> 00:45:41,893
in this case.

967
00:45:41,893 --> 00:45:43,810
And that might be how
we would estimate or how

968
00:45:43,810 --> 00:45:47,860
we would say that this line
is better than a line that

969
00:45:47,860 --> 00:45:51,370
goes somewhere else or a line that's
further down, because this line might

970
00:45:51,370 --> 00:45:52,670
minimize the loss.

971
00:45:52,670 --> 00:45:57,100
So there is no way to do better
than just these four points of loss

972
00:45:57,100 --> 00:46:01,080
if you're just drawing a
straight line through our space.

973
00:46:01,080 --> 00:46:05,020
So the 0-1 loss function checks did
we get it right, did we get it wrong.

974
00:46:05,020 --> 00:46:07,630
If we got it right, the
loss is 0-- nothing lost.

975
00:46:07,630 --> 00:46:11,470
If we got it wrong, then our loss
function for that data point says 1,

976
00:46:11,470 --> 00:46:14,740
and we add up all of those losses
across all of our data points

977
00:46:14,740 --> 00:46:17,290
to get some sort of
empirical loss-- how much we

978
00:46:17,290 --> 00:46:20,420
have lost across all of
these original data points

979
00:46:20,420 --> 00:46:23,500
that our algorithm had access to.

980
00:46:23,500 --> 00:46:26,550
There are other forms of loss as
well that work especially well when

981
00:46:26,550 --> 00:46:28,980
we deal with more real
value cases-- cases

982
00:46:28,980 --> 00:46:32,760
like the mapping between advertising
budget and amount that we do in sales,

983
00:46:32,760 --> 00:46:33,720
for example.

984
00:46:33,720 --> 00:46:37,770
Because in that case, you care not just
that you get the number exactly right,

985
00:46:37,770 --> 00:46:40,650
but you care how close you
were to the actual value.

986
00:46:40,650 --> 00:46:44,700
If the actual value is
you did $2,800 in sales

987
00:46:44,700 --> 00:46:47,910
and you predicted that you
would do $2,900 in sales,

988
00:46:47,910 --> 00:46:49,330
maybe that's pretty good.

989
00:46:49,330 --> 00:46:52,380
That's much better than if you had
predicted you do $1,000 in sales,

990
00:46:52,380 --> 00:46:53,620
for example.

991
00:46:53,620 --> 00:46:55,830
And so we would like
our loss function to be

992
00:46:55,830 --> 00:46:58,530
able to take that into account as well.

993
00:46:58,530 --> 00:47:02,700
Take into account not just whether
the actual value in the expected value

994
00:47:02,700 --> 00:47:08,490
are exactly the same, but also, take
into account how far apart they were.

995
00:47:08,490 --> 00:47:12,450
And so for that one approach
is what we call L1 loss.

996
00:47:12,450 --> 00:47:15,120
L1 Loss doesn't just look at
whether actual and predicted

997
00:47:15,120 --> 00:47:20,400
are equal to each other, but we take
the absolute value of the actual value

998
00:47:20,400 --> 00:47:22,180
minus the predicted value.

999
00:47:22,180 --> 00:47:25,860
In other words, we just ask,
how far apart were the actual

1000
00:47:25,860 --> 00:47:27,120
and predicted values?

1001
00:47:27,120 --> 00:47:29,280
And we sum that up
across all of the data

1002
00:47:29,280 --> 00:47:33,870
points to be able to get what
our answer ultimately is.

1003
00:47:33,870 --> 00:47:36,690
So what might this actually
look like for our data set?

1004
00:47:36,690 --> 00:47:38,610
Well, if we go back
to this representation

1005
00:47:38,610 --> 00:47:42,720
where we had advertising along the
x-axis, sales along the y-axis,

1006
00:47:42,720 --> 00:47:45,210
our line was our
prediction, our estimate

1007
00:47:45,210 --> 00:47:47,520
for any given amount
of advertising-- what

1008
00:47:47,520 --> 00:47:50,080
we predicted sales was going to be.

1009
00:47:50,080 --> 00:47:54,420
And our L1 loss is just how far
apart vertically along the sales

1010
00:47:54,420 --> 00:47:58,050
axis our prediction was from
each of the data points.

1011
00:47:58,050 --> 00:48:00,900
So we could figure out exactly
how far apart our prediction

1012
00:48:00,900 --> 00:48:02,910
was from each of the data
points and figure out

1013
00:48:02,910 --> 00:48:07,920
as a result of that what our loss is
overall for this particular hypothesis

1014
00:48:07,920 --> 00:48:12,090
just by adding up all of these various
different individual losses for each

1015
00:48:12,090 --> 00:48:13,110
of these data points.

1016
00:48:13,110 --> 00:48:15,790
And our goal then is to try
and minimize that loss--

1017
00:48:15,790 --> 00:48:20,550
to try and come up with some line that
minimizes what the utility loss is

1018
00:48:20,550 --> 00:48:23,280
by judging how far away our
estimate amount of sales

1019
00:48:23,280 --> 00:48:25,800
is from the actual amount of sales.

1020
00:48:25,800 --> 00:48:28,050
And turns out there are other
loss functions, as well.

1021
00:48:28,050 --> 00:48:30,780
One that's quite popular is the L2 loss.

1022
00:48:30,780 --> 00:48:33,840
The L2 loss, instead of just
using the absolute value,

1023
00:48:33,840 --> 00:48:37,350
like how far away the actual
value is from the predicted value,

1024
00:48:37,350 --> 00:48:40,350
it uses the square of
actual minus predicted.

1025
00:48:40,350 --> 00:48:43,470
So how far apart are the actual
and predicted value, and it

1026
00:48:43,470 --> 00:48:48,150
squares that value, effectively
penalizing much more harshly

1027
00:48:48,150 --> 00:48:50,290
anything that is a worse prediction.

1028
00:48:50,290 --> 00:48:52,620
So you imagine if you
have two data points

1029
00:48:52,620 --> 00:48:57,210
that you predict as being one
value away from their actual value

1030
00:48:57,210 --> 00:48:59,490
as opposed to one data
point that you predict

1031
00:48:59,490 --> 00:49:04,290
as being two away from its actual
value, the L2 loss function will more

1032
00:49:04,290 --> 00:49:07,560
harshly penalize that one
that is two away because it's

1033
00:49:07,560 --> 00:49:11,430
going to square however much the
differences between the actual value

1034
00:49:11,430 --> 00:49:12,520
and the predicted value.

1035
00:49:12,520 --> 00:49:14,310
And depending on the
situation, you might

1036
00:49:14,310 --> 00:49:17,520
want to choose a loss function depending
on what you care about minimizing.

1037
00:49:17,520 --> 00:49:20,883
If you really care about minimizing
the error on more outlier cases,

1038
00:49:20,883 --> 00:49:23,050
then you might want to
consider something like this.

1039
00:49:23,050 --> 00:49:25,592
But if you've got a lot of
outliers and you don't necessarily

1040
00:49:25,592 --> 00:49:28,632
care about modeling them, then maybe
an L1 loss function is preferable,

1041
00:49:28,632 --> 00:49:30,840
but there are trade-offs
here that you need to decide

1042
00:49:30,840 --> 00:49:33,630
based on a particular set of data.

1043
00:49:33,630 --> 00:49:37,020
But what you do run the risk of with any
of these lost functions with anything

1044
00:49:37,020 --> 00:49:40,232
that we're trying to do is a
problem known as overfitting.

1045
00:49:40,232 --> 00:49:41,940
And overfitting is a
big problem that you

1046
00:49:41,940 --> 00:49:44,910
can encounter in machine
learning, which happens anytime

1047
00:49:44,910 --> 00:49:49,380
a model fits too closely with
a data set, and as a result,

1048
00:49:49,380 --> 00:49:51,510
fails to generalize.

1049
00:49:51,510 --> 00:49:55,200
We would like our model to
be able to accurately predict

1050
00:49:55,200 --> 00:49:59,430
data and inputs and output pairs
for the data that we have access to.

1051
00:49:59,430 --> 00:50:02,220
But the reason we wanted
to do so is because we

1052
00:50:02,220 --> 00:50:06,690
want our model to generalize well
to data that we haven't seen before.

1053
00:50:06,690 --> 00:50:10,350
I would like to take data from the past
year of whether it was raining and not

1054
00:50:10,350 --> 00:50:12,707
raining and use that
data to generalize it

1055
00:50:12,707 --> 00:50:15,540
towards the future-- to say in the
future, is it going to be raining

1056
00:50:15,540 --> 00:50:16,170
or not raining.

1057
00:50:16,170 --> 00:50:19,890
Or if I have a whole bunch of data on
what counterfeit and not counterfeit US

1058
00:50:19,890 --> 00:50:23,640
dollar bills looked liked in the past
when people have encountered them,

1059
00:50:23,640 --> 00:50:26,550
I'd like to train a computer
to be able to in the future,

1060
00:50:26,550 --> 00:50:32,020
generalize to other dollar
bills that I might see as well.

1061
00:50:32,020 --> 00:50:36,540
And the problem with overfitting is that
if you try and tie yourself too closely

1062
00:50:36,540 --> 00:50:39,450
to the data set that you're
training your model on you

1063
00:50:39,450 --> 00:50:42,043
can end up not generalizing very well.

1064
00:50:42,043 --> 00:50:43,210
So what does this look like.

1065
00:50:43,210 --> 00:50:46,750
Well, we might imagine the rainy day and
not rainy day example again from here,

1066
00:50:46,750 --> 00:50:49,440
where the blue points indicate
rainy days and the red points

1067
00:50:49,440 --> 00:50:50,970
indicate not rainy days.

1068
00:50:50,970 --> 00:50:53,400
And we decided that we
felt pretty comfortable

1069
00:50:53,400 --> 00:50:58,230
with drawing a line like this as the
decision boundary between rainy days

1070
00:50:58,230 --> 00:50:59,090
and not rainy days.

1071
00:50:59,090 --> 00:51:01,860
That we can pretty comfortably
say that points on this side,

1072
00:51:01,860 --> 00:51:04,860
are more likely to be rainy
days, points on that side,

1073
00:51:04,860 --> 00:51:06,870
more likely to be not rainy days.

1074
00:51:06,870 --> 00:51:11,430
But the empirical loss isn't
0 in this particular case,

1075
00:51:11,430 --> 00:51:14,200
because we didn't categorize
everything perfectly.

1076
00:51:14,200 --> 00:51:17,640
There was this one outlier this
one day that it wasn't raining,

1077
00:51:17,640 --> 00:51:20,580
but yet our model steel still
predicts that it is raining.

1078
00:51:20,580 --> 00:51:22,710
But that doesn't necessarily
mean our model is bad.

1079
00:51:22,710 --> 00:51:25,820
It just means the model
isn't 100% accurate.

1080
00:51:25,820 --> 00:51:28,700
If you really wanted to try
and find a hypothesis that

1081
00:51:28,700 --> 00:51:32,142
resulted in minimizing the
loss, you could come up

1082
00:51:32,142 --> 00:51:33,600
with a different decision boundary.

1083
00:51:33,600 --> 00:51:37,100
It wouldn't be a line, but it
would look something like this.

1084
00:51:37,100 --> 00:51:41,090
This decision boundary does
separate all of the red points

1085
00:51:41,090 --> 00:51:44,780
from all of the blue points
because the red points fall

1086
00:51:44,780 --> 00:51:47,390
on this side of this decision
boundary, the blue points

1087
00:51:47,390 --> 00:51:49,700
fall on the other side
of the decision boundary.

1088
00:51:49,700 --> 00:51:54,530
But this, we would probably argue,
is not as good of a prediction.

1089
00:51:54,530 --> 00:51:59,600
Even though it seems to be more accurate
based on all of the available training

1090
00:51:59,600 --> 00:52:02,700
data that we have for training
this machine learning model,

1091
00:52:02,700 --> 00:52:05,330
we might say that it's probably
not going to generalize well.

1092
00:52:05,330 --> 00:52:07,730
That if there were other data
points like here and there,

1093
00:52:07,730 --> 00:52:11,270
we might still want to consider those
to be rainy days, because we think

1094
00:52:11,270 --> 00:52:13,770
this was probably just an outlier.

1095
00:52:13,770 --> 00:52:17,480
So if the only thing you care about
is minimizing the loss on the data

1096
00:52:17,480 --> 00:52:20,340
you have available to you, you
run the risk of overfitting.

1097
00:52:20,340 --> 00:52:22,520
And this can happen in the
misclassification case.

1098
00:52:22,520 --> 00:52:25,460
It can also happen in
the regression case,

1099
00:52:25,460 --> 00:52:28,790
that here, we predicted what we
thought was a pretty good line relating

1100
00:52:28,790 --> 00:52:32,720
advertising to sales, trying to predict
what sales were going to be for a given

1101
00:52:32,720 --> 00:52:33,757
amount of advertising.

1102
00:52:33,757 --> 00:52:36,590
But I could come up with a line
that does a better job of predicting

1103
00:52:36,590 --> 00:52:39,680
the training data, and it would
be something that looks like this,

1104
00:52:39,680 --> 00:52:42,620
just connecting all of the
various different data points.

1105
00:52:42,620 --> 00:52:44,750
And now, there is no loss at all.

1106
00:52:44,750 --> 00:52:48,620
Now I've perfectly predicted given
any advertising what sales are,

1107
00:52:48,620 --> 00:52:52,370
and for all the data available
to me, it's going to be accurate.

1108
00:52:52,370 --> 00:52:55,010
But it's probably not going
to generalize very well.

1109
00:52:55,010 --> 00:53:00,060
I have overfit my model on the
training data that is available to me.

1110
00:53:00,060 --> 00:53:02,060
And so in general, we
want to avoid overfitting.

1111
00:53:02,060 --> 00:53:04,880
We'd like strategies to
make sure that we have over

1112
00:53:04,880 --> 00:53:07,190
fit our model to a particular data set.

1113
00:53:07,190 --> 00:53:09,810
And there are a number of ways
that you could try to do this.

1114
00:53:09,810 --> 00:53:12,830
One way is by examining what it
is that we're optimizing for.

1115
00:53:12,830 --> 00:53:17,060
In an optimization problem, all we
do is we say there is some cost,

1116
00:53:17,060 --> 00:53:19,610
and I want to minimize that cost.

1117
00:53:19,610 --> 00:53:24,410
And so far, we've defined that cost
function-- the cost of a hypothesis

1118
00:53:24,410 --> 00:53:28,400
just as being equal to the
empirical loss of that hypothesis.

1119
00:53:28,400 --> 00:53:32,750
How far away are the actual data
points the outputs away from

1120
00:53:32,750 --> 00:53:36,400
what I predicted them to be based
on that particular hypothesis.

1121
00:53:36,400 --> 00:53:38,150
And if all you're
trying to do is minimize

1122
00:53:38,150 --> 00:53:40,910
cost, meaning minimizing
the loss in this case,

1123
00:53:40,910 --> 00:53:44,030
then the result is going to
be that you might overfit.

1124
00:53:44,030 --> 00:53:48,440
That to minimize cost, you're going to
try and find a way to perfectly match

1125
00:53:48,440 --> 00:53:49,820
all of the input data.

1126
00:53:49,820 --> 00:53:54,020
And that might happen as a result of
overfitting on that particular input

1127
00:53:54,020 --> 00:53:55,650
data.

1128
00:53:55,650 --> 00:53:59,660
So in order to address this, you could
add something to the cost function.

1129
00:53:59,660 --> 00:54:00,880
What counts as cost?

1130
00:54:00,880 --> 00:54:04,190
Well, not just loss,
but also, some measure

1131
00:54:04,190 --> 00:54:08,680
of the complexity of the hypothesis,
where the complexity of the hypothesis

1132
00:54:08,680 --> 00:54:12,440
is something that you would need
to define for how complicated

1133
00:54:12,440 --> 00:54:13,250
does our line look.

1134
00:54:13,250 --> 00:54:15,830
This is sort of an Occam's
razor style approach, where

1135
00:54:15,830 --> 00:54:19,160
we want to give preference to
a simpler decision boundary--

1136
00:54:19,160 --> 00:54:20,990
like a straight line for example.

1137
00:54:20,990 --> 00:54:24,710
Some simpler curve as opposed
to something far more complex

1138
00:54:24,710 --> 00:54:26,840
that might represent the
training data better,

1139
00:54:26,840 --> 00:54:29,150
but might not generalize
as well-- will generally

1140
00:54:29,150 --> 00:54:33,860
say that a simpler solution is probably
the better solution and probably

1141
00:54:33,860 --> 00:54:38,460
the one that is more likely to
generalize well to other inputs.

1142
00:54:38,460 --> 00:54:42,020
So we measure what the losses but
we also measure the complexity.

1143
00:54:42,020 --> 00:54:45,770
And now that all gets taken into account
when we consider the overall cost.

1144
00:54:45,770 --> 00:54:50,040
That yes, something might have less loss
if a better predicts the training data,

1145
00:54:50,040 --> 00:54:52,940
but if it's much more
complex, it still might not

1146
00:54:52,940 --> 00:54:55,230
be the best option that we have.

1147
00:54:55,230 --> 00:54:58,930
And we need to come up with some
balance between loss and complexity.

1148
00:54:58,930 --> 00:55:01,180
And for that reason, you'll
often see this represented

1149
00:55:01,180 --> 00:55:04,730
as multiplying the
complexity by some parameter

1150
00:55:04,730 --> 00:55:08,480
that we have to choose-- parameter
lambda in this case, where we're saying

1151
00:55:08,480 --> 00:55:11,300
if lambda has a greater
value, then we really want

1152
00:55:11,300 --> 00:55:13,852
to penalize more complex hypotheses.

1153
00:55:13,852 --> 00:55:15,560
Whereas if lambda is
smaller, we're going

1154
00:55:15,560 --> 00:55:18,470
to penalize more complex
hypotheses a little bit.

1155
00:55:18,470 --> 00:55:21,620
And it's up to the machine
learning programmer

1156
00:55:21,620 --> 00:55:24,470
to decide where they want
to set that value of lambda

1157
00:55:24,470 --> 00:55:28,430
for how much do I want to penalize
a more complex hypothesis that

1158
00:55:28,430 --> 00:55:30,293
might fit the data little better.

1159
00:55:30,293 --> 00:55:32,960
And again, there is no one right
answer to a lot of these things

1160
00:55:32,960 --> 00:55:36,362
depending on the data set, depending
on the data you have available to you,

1161
00:55:36,362 --> 00:55:39,320
and the problem you're trying to
solve, your choice of these parameters

1162
00:55:39,320 --> 00:55:39,868
may vary.

1163
00:55:39,868 --> 00:55:41,660
And you may need to
experiment a little bit

1164
00:55:41,660 --> 00:55:45,800
to figure out what the right choice
of that is ultimately going to be.

1165
00:55:45,800 --> 00:55:49,100
This process then of considering
not only a loss, but also

1166
00:55:49,100 --> 00:55:53,000
some measure of the complexity
is known as regularization.

1167
00:55:53,000 --> 00:55:55,610
Regularization is the
process of penalizing

1168
00:55:55,610 --> 00:55:58,260
a hypothesis that is more complex.

1169
00:55:58,260 --> 00:56:00,950
In order to favor a
simple or hypothesis that

1170
00:56:00,950 --> 00:56:03,680
is more likely to generalize
well-- more likely to be

1171
00:56:03,680 --> 00:56:08,210
able to apply to other situations that
are dealing with other input points

1172
00:56:08,210 --> 00:56:11,540
unlike the ones that we've
necessarily seen before.

1173
00:56:11,540 --> 00:56:15,500
So oftentimes, you'll see us
add some regularizing term

1174
00:56:15,500 --> 00:56:17,570
to what we're trying
to minimize it in order

1175
00:56:17,570 --> 00:56:21,220
to avoid this problem of overfitting.

1176
00:56:21,220 --> 00:56:25,780
Now another way of making sure we don't
overfit is to run some experiments

1177
00:56:25,780 --> 00:56:29,950
and to see whether or not we are able
to generalize our model that we've

1178
00:56:29,950 --> 00:56:33,250
created to other data sets as well.

1179
00:56:33,250 --> 00:56:35,320
And it's for that
reason that oftentimes,

1180
00:56:35,320 --> 00:56:37,800
when you're doing a machine learning
experiment, when you've got some data

1181
00:56:37,800 --> 00:56:40,717
and you want to try and come up with
some function that predicts given

1182
00:56:40,717 --> 00:56:44,230
some input, what the output is
going to be, you don't necessarily

1183
00:56:44,230 --> 00:56:48,340
want to do your training on all of
the data you have available to you.

1184
00:56:48,340 --> 00:56:52,410
That you could employ a method
known as holdout cross-validation.

1185
00:56:52,410 --> 00:56:55,480
Where in holdout cross-validation,
we split up our data.

1186
00:56:55,480 --> 00:57:00,373
We split up our data into a
training set and a testing set.

1187
00:57:00,373 --> 00:57:02,290
The training set is the
set of data that we're

1188
00:57:02,290 --> 00:57:04,870
going to use to train our
machine learning model.

1189
00:57:04,870 --> 00:57:07,440
And the testing set is
the set of data that we

1190
00:57:07,440 --> 00:57:11,260
are going to use in order to test
to see how well our machine learning

1191
00:57:11,260 --> 00:57:13,660
model actually performed.

1192
00:57:13,660 --> 00:57:15,757
So the learning happens
on the training set.

1193
00:57:15,757 --> 00:57:17,590
We figure out what the
parameters should be,

1194
00:57:17,590 --> 00:57:19,660
we figure out what the right model is.

1195
00:57:19,660 --> 00:57:22,360
And that we see, all right, now
that we've trained the model,

1196
00:57:22,360 --> 00:57:25,960
see how well it does at predicting
things and inside of the testing

1197
00:57:25,960 --> 00:57:29,342
set, some set of data that
we haven't seen before.

1198
00:57:29,342 --> 00:57:32,300
And the hope then is that we're going
to be able to predict the testing

1199
00:57:32,300 --> 00:57:36,460
set pretty well if we're able to
generalize based on the training

1200
00:57:36,460 --> 00:57:38,028
data that's available to us.

1201
00:57:38,028 --> 00:57:39,820
If we've overfit the
training data, though,

1202
00:57:39,820 --> 00:57:43,112
and we're not able to generalize,
then when we look at the testing set,

1203
00:57:43,112 --> 00:57:45,070
it's likely going to be
the case that we're not

1204
00:57:45,070 --> 00:57:49,100
going to predict things from the
testing set nearly as effectively.

1205
00:57:49,100 --> 00:57:51,850
So this is one method of
cross-validation-- validating

1206
00:57:51,850 --> 00:57:55,270
to make sure that the work we have
done is actually going to generalize

1207
00:57:55,270 --> 00:57:56,770
to other data sets as well.

1208
00:57:56,770 --> 00:57:59,620
And there are other statistical
techniques we can use, as well.

1209
00:57:59,620 --> 00:58:03,520
One of the downsides of this just
holdout cross-validation is if you say,

1210
00:58:03,520 --> 00:58:07,960
I just split it 50/50, I train
using 50% of the data and test using

1211
00:58:07,960 --> 00:58:11,110
the other 50%, or you could
choose other percentages as well,

1212
00:58:11,110 --> 00:58:15,190
is that there is a fair amount
of data that I am now not using

1213
00:58:15,190 --> 00:58:19,670
to train that I might be able to get a
better model as a result, for example.

1214
00:58:19,670 --> 00:58:23,500
So one approach is known
as k-fold cross-validation.

1215
00:58:23,500 --> 00:58:26,830
In k-fold cross-validation,
rather than just divide things

1216
00:58:26,830 --> 00:58:31,980
into two sets and run one experiment,
we divide things into k different sets

1217
00:58:31,980 --> 00:58:34,810
and maybe I divide things
up into 10 different sets,

1218
00:58:34,810 --> 00:58:37,270
and then run 10 different experiments.

1219
00:58:37,270 --> 00:58:40,750
So if I split up my data into
10 different sets of data,

1220
00:58:40,750 --> 00:58:44,410
then what I'll do is each time
for each of my 10 experiments,

1221
00:58:44,410 --> 00:58:47,020
I will hold out one
of those sets of data,

1222
00:58:47,020 --> 00:58:50,320
where I'll say, let me train
my model on these nine sets,

1223
00:58:50,320 --> 00:58:54,070
and then test to see how well
it predicts on set number 10.

1224
00:58:54,070 --> 00:58:57,490
And then pick another set of
nine sets to train on, and then

1225
00:58:57,490 --> 00:58:59,320
test it on the other
one that I held out,

1226
00:58:59,320 --> 00:59:03,550
where each time, I train the model
on everything minus the one set

1227
00:59:03,550 --> 00:59:07,300
that I'm holding out, and then test
to see how well our model performs

1228
00:59:07,300 --> 00:59:09,100
on the test that I did hold out.

1229
00:59:09,100 --> 00:59:12,250
And what you end up getting is 10
different results, 10 different answers

1230
00:59:12,250 --> 00:59:14,470
for how accurately our model worked.

1231
00:59:14,470 --> 00:59:16,870
And oftentimes, you can just
take the average of those 10

1232
00:59:16,870 --> 00:59:21,160
to get an approximation for how well
we think our model performs overall.

1233
00:59:21,160 --> 00:59:25,270
But the key idea is separating the
training data from the testing data,

1234
00:59:25,270 --> 00:59:29,170
because you want to test your model
on data that is different from what

1235
00:59:29,170 --> 00:59:30,273
you trained the model on.

1236
00:59:30,273 --> 00:59:32,440
Because the training, you
want to avoid overfitting,

1237
00:59:32,440 --> 00:59:33,940
you want to be able to generalize.

1238
00:59:33,940 --> 00:59:36,190
And the way you test whether
you're able to generalize

1239
00:59:36,190 --> 00:59:39,580
and is by looking at some data
that you haven't seen before

1240
00:59:39,580 --> 00:59:43,380
and seeing how well we're
actually able to perform.

1241
00:59:43,380 --> 00:59:46,010
And so if we want to actually
implement any of these techniques

1242
00:59:46,010 --> 00:59:48,980
inside of a programming language
like Python, a number of ways

1243
00:59:48,980 --> 00:59:49,760
we could do that.

1244
00:59:49,760 --> 00:59:52,007
We could write this
from scratch on our own,

1245
00:59:52,007 --> 00:59:53,840
but there are libraries
out there that allow

1246
00:59:53,840 --> 00:59:57,290
us to take advantage of existing
implementations of these algorithms--

1247
00:59:57,290 --> 01:00:00,080
that we can use the
same types of algorithms

1248
01:00:00,080 --> 01:00:01,950
in a lot of different situations.

1249
01:00:01,950 --> 01:00:03,770
And so there is a
library, very popular one

1250
01:00:03,770 --> 01:00:07,240
known as scikit-learn, which
allows us in Python to be

1251
01:00:07,240 --> 01:00:10,490
able to very quickly get set up with a
lot of these different machine learning

1252
01:00:10,490 --> 01:00:10,990
models.

1253
01:00:10,990 --> 01:00:14,210
So this library has already written
an algorithm for nearest neighbor

1254
01:00:14,210 --> 01:00:16,430
classification, for doing
perceptron learning,

1255
01:00:16,430 --> 01:00:19,035
for doing a bunch of
other types of inference

1256
01:00:19,035 --> 01:00:21,410
and supervised learning that
we haven't yet talked about.

1257
01:00:21,410 --> 01:00:26,840
But using it, we can begin to try
actually testing how these methods work

1258
01:00:26,840 --> 01:00:29,330
and how accurately they perform.

1259
01:00:29,330 --> 01:00:31,900
So let's go ahead and take a
look at one approach to trying

1260
01:00:31,900 --> 01:00:33,920
to solve this type of problem.

1261
01:00:33,920 --> 01:00:37,280
All right, so I'm first going
to pull up banknotes.csv,

1262
01:00:37,280 --> 01:00:40,970
which is a whole bunch of data provided
by UC Irvine, which has information

1263
01:00:40,970 --> 01:00:43,160
about various different banknotes.

1264
01:00:43,160 --> 01:00:45,440
So people took pictures of
various different banknotes

1265
01:00:45,440 --> 01:00:48,530
and measured various different
properties of those banknotes.

1266
01:00:48,530 --> 01:00:51,350
And in particular, some human
categorized each of those

1267
01:00:51,350 --> 01:00:55,810
banknotes as either a counterfeit
bank note or as not counterfeit.

1268
01:00:55,810 --> 01:00:59,540
And so what you're looking at here
is each row represents one banknote.

1269
01:00:59,540 --> 01:01:01,910
This is formatted as a CSV
spreadsheet, where just

1270
01:01:01,910 --> 01:01:05,780
comma-separated value separating each
of these various different fields.

1271
01:01:05,780 --> 01:01:08,990
We have four different
input values for each

1272
01:01:08,990 --> 01:01:12,410
of these data points, just
information, some measurement that

1273
01:01:12,410 --> 01:01:13,460
was made on the banknote.

1274
01:01:13,460 --> 01:01:16,340
And what those measurements exactly
aren't as important as the fact

1275
01:01:16,340 --> 01:01:18,320
that we do have access to this data.

1276
01:01:18,320 --> 01:01:21,950
But more importantly, we have
access for each of these data points

1277
01:01:21,950 --> 01:01:26,210
to a label, where 0 indicates something
like this was not a counterfeit bill,

1278
01:01:26,210 --> 01:01:27,920
meaning it was an authentic bill.

1279
01:01:27,920 --> 01:01:32,840
And a data point labeled 1 means that
it is a counterfeit bill at least,

1280
01:01:32,840 --> 01:01:36,240
according to the human researcher
who labeled this particular data.

1281
01:01:36,240 --> 01:01:38,360
So we have a whole bunch
of data representing

1282
01:01:38,360 --> 01:01:40,940
a whole bunch of different
data points, each of which

1283
01:01:40,940 --> 01:01:44,360
has these various different measurements
that were made on that particular bill.

1284
01:01:44,360 --> 01:01:47,300
And each of which has
an output value 0 or 1--

1285
01:01:47,300 --> 01:01:53,033
0 meaning it was a genuine bill, 1
meaning it was a counterfeit bill.

1286
01:01:53,033 --> 01:01:54,950
And what we would like
to do is use supervised

1287
01:01:54,950 --> 01:01:58,550
learning to begin to predict
or model some sort of function

1288
01:01:58,550 --> 01:02:02,570
that can take these four values as input
and predict what the output would be.

1289
01:02:02,570 --> 01:02:05,777
We want our learning algorithm
to find some sort of pattern that

1290
01:02:05,777 --> 01:02:08,110
is able to predict based on
these measurements something

1291
01:02:08,110 --> 01:02:10,690
that you could measure just
by taking a photo of a bill--

1292
01:02:10,690 --> 01:02:16,090
predict whether that bill is authentic
or whether that bill is counterfeit.

1293
01:02:16,090 --> 01:02:17,730
And so how can we do that?

1294
01:02:17,730 --> 01:02:21,020
Well, I'm first going to
open up banknotes0.py and see

1295
01:02:21,020 --> 01:02:22,410
how it is that we do this.

1296
01:02:22,410 --> 01:02:26,000
I'm first importing a lot
of things from scikit-learn,

1297
01:02:26,000 --> 01:02:29,753
but importantly, I'm going to set
my model equal to the perceptron

1298
01:02:29,753 --> 01:02:32,420
model, which is one of those
models that we talked about before.

1299
01:02:32,420 --> 01:02:35,120
We're just going to try and
figure out some setting of weights

1300
01:02:35,120 --> 01:02:38,950
that is able to divide our
data into two different groups.

1301
01:02:38,950 --> 01:02:43,270
Then I'm going to go ahead and read
data in from my file from banknotes.csv.

1302
01:02:43,270 --> 01:02:46,660
And basically, for every row,
I'm going to separate that row

1303
01:02:46,660 --> 01:02:51,460
into the first four values of that row,
which is the evidence for that row.

1304
01:02:51,460 --> 01:02:56,920
And then the label where if the
final column in that row is 0,

1305
01:02:56,920 --> 01:03:00,770
the label is authentic, and otherwise,
it's going to be counterfeit.

1306
01:03:00,770 --> 01:03:03,880
So I'm effectively reading
data in from the CSV file,

1307
01:03:03,880 --> 01:03:07,330
dividing it into a whole bunch of rows,
where each row has some evidence--

1308
01:03:07,330 --> 01:03:11,380
those four input values that are going
to be inputs to my hypothesis function.

1309
01:03:11,380 --> 01:03:14,710
And then the label, the output,
whether it is authentic or counterfeit.

1310
01:03:14,710 --> 01:03:17,270
That is the thing that I
am then trying to predict.

1311
01:03:17,270 --> 01:03:20,650
So the next step is that I would like
to split up my data set into a training

1312
01:03:20,650 --> 01:03:22,990
set and the testing
set-- some set of data

1313
01:03:22,990 --> 01:03:26,260
that I would like to train my machine
learning model on and some set of data

1314
01:03:26,260 --> 01:03:29,500
that I would like to use to test that
model, see how well it performed.

1315
01:03:29,500 --> 01:03:32,480
So what I'll do is I'll go ahead
and figure out length of the data,

1316
01:03:32,480 --> 01:03:34,150
how many data points do I have.

1317
01:03:34,150 --> 01:03:35,925
I'll go ahead and take
half of them, save

1318
01:03:35,925 --> 01:03:37,550
that number is a number called holdout.

1319
01:03:37,550 --> 01:03:39,930
That is how many items I'm
going to hold out for my data

1320
01:03:39,930 --> 01:03:42,400
set to save for the testing phase.

1321
01:03:42,400 --> 01:03:45,250
I'll randomly shuffle the data
so it's in some random order.

1322
01:03:45,250 --> 01:03:50,440
And then I'll say my testing set will
be all of the data up to the holdout.

1323
01:03:50,440 --> 01:03:54,790
So I'll hold up many data items,
and that will be my testing that.

1324
01:03:54,790 --> 01:03:58,060
My training data will be
everything else-- the information

1325
01:03:58,060 --> 01:04:00,850
that I'm going to train my model on.

1326
01:04:00,850 --> 01:04:04,870
And then I'll say, I need to
divide up my training data

1327
01:04:04,870 --> 01:04:06,040
into two different sets.

1328
01:04:06,040 --> 01:04:10,750
I need to divide it into my x values,
where x here represents the inputs.

1329
01:04:10,750 --> 01:04:14,260
So the x values then I'm going
to train on our basically

1330
01:04:14,260 --> 01:04:16,330
for every row in my
training set, I'm going

1331
01:04:16,330 --> 01:04:19,150
to get the evidence for
that row, those four values,

1332
01:04:19,150 --> 01:04:21,940
where it's basically a
vector of four numbers, where

1333
01:04:21,940 --> 01:04:24,010
that is going to be all of the input.

1334
01:04:24,010 --> 01:04:27,460
And then I need the y values-- what are
the outputs that I want to learn from,

1335
01:04:27,460 --> 01:04:30,567
the labels that belong to each of
these various different input points.

1336
01:04:30,567 --> 01:04:33,650
Well, that's going to be the same thing
for each row in the training data.

1337
01:04:33,650 --> 01:04:36,430
But this time, I take that row
and get what it's labeled as,

1338
01:04:36,430 --> 01:04:38,680
whether it is authentic or counterfeit.

1339
01:04:38,680 --> 01:04:43,270
So I end up with one list of all
of these vectors of my input data

1340
01:04:43,270 --> 01:04:45,790
and one list which
follows the same order,

1341
01:04:45,790 --> 01:04:49,690
but has all of the labels that
correspond with each of those vectors.

1342
01:04:49,690 --> 01:04:51,610
And then to train my
model, which in this case

1343
01:04:51,610 --> 01:04:54,190
is just this perceptron
model, I just called

1344
01:04:54,190 --> 01:04:57,370
model.fit, pass in the
training data, and what

1345
01:04:57,370 --> 01:04:59,710
the labels for those training data are.

1346
01:04:59,710 --> 01:05:02,120
And scikit-learn will take
care of fitting the model--

1347
01:05:02,120 --> 01:05:04,250
will do the entire algorithm for me.

1348
01:05:04,250 --> 01:05:08,320
And then when it's done, I can then test
to see how well that model performed.

1349
01:05:08,320 --> 01:05:10,780
So I can say, let me
get all of these input

1350
01:05:10,780 --> 01:05:13,040
vectors for what I want to test on.

1351
01:05:13,040 --> 01:05:16,870
So for each row in my testing data
set, go ahead and get the evidence.

1352
01:05:16,870 --> 01:05:20,470
And the y values, those are
what the actual values were--

1353
01:05:20,470 --> 01:05:24,610
for each of the rows in the testing
data set, what the actual label is.

1354
01:05:24,610 --> 01:05:26,860
But then I'm going to
generate some predictions.

1355
01:05:26,860 --> 01:05:29,350
I'm going to use this
model and try and predict--

1356
01:05:29,350 --> 01:05:31,480
based on the testing vectors--

1357
01:05:31,480 --> 01:05:33,910
I want to predict what the output is.

1358
01:05:33,910 --> 01:05:38,230
And my goal then is to now compare
y testing with predictions.

1359
01:05:38,230 --> 01:05:41,440
I want to see how well my
predictions based on the model

1360
01:05:41,440 --> 01:05:45,310
actually reflect what the y
values were, what the output is

1361
01:05:45,310 --> 01:05:46,540
that were actually labeled.

1362
01:05:46,540 --> 01:05:48,580
Because I now have
this label data, I can

1363
01:05:48,580 --> 01:05:51,540
assess how well the algorithm worked.

1364
01:05:51,540 --> 01:05:55,330
And so now I can just
compute how well we did.

1365
01:05:55,330 --> 01:05:58,840
This zip function basically just lets
me look through two different lists, one

1366
01:05:58,840 --> 01:06:00,500
by one at the same time.

1367
01:06:00,500 --> 01:06:04,113
So for each actual value and
for each predicted value,

1368
01:06:04,113 --> 01:06:06,280
if the actual is the same
thing as what I predicted,

1369
01:06:06,280 --> 01:06:08,570
I'll go ahead and
increment the counter by 1.

1370
01:06:08,570 --> 01:06:11,727
Otherwise, I'll increment
my incorrect counter by 1.

1371
01:06:11,727 --> 01:06:14,060
And so at the end, I can print
out here are the results,

1372
01:06:14,060 --> 01:06:16,435
here's how many I got right,
here's how many I got wrong.

1373
01:06:16,435 --> 01:06:19,660
And here was my overall
accuracy, for example.

1374
01:06:19,660 --> 01:06:21,100
So I can go ahead and run this.

1375
01:06:21,100 --> 01:06:24,790
I can run Python banknotes0.py.

1376
01:06:24,790 --> 01:06:28,810
And it's going to train on half the data
set and then test on half the data set.

1377
01:06:28,810 --> 01:06:31,090
And here the results
from my perceptron model.

1378
01:06:31,090 --> 01:06:35,540
In this case, it correctly
was able to classify 679 bills

1379
01:06:35,540 --> 01:06:38,170
as correctly either
authentic or counterfeit,

1380
01:06:38,170 --> 01:06:43,580
and incorrectly classified seven of them
for an overall accuracy of close to 99%

1381
01:06:43,580 --> 01:06:44,080
accurate.

1382
01:06:44,080 --> 01:06:47,230
So on this particular data set,
using this perceptron model,

1383
01:06:47,230 --> 01:06:51,310
we were able to predict very well
what the output was going to be.

1384
01:06:51,310 --> 01:06:53,010
And we can try different models, too.

1385
01:06:53,010 --> 01:06:55,390
That scikit-learn makes it
very easy just to swap out

1386
01:06:55,390 --> 01:06:57,970
one model for another model.

1387
01:06:57,970 --> 01:07:02,710
So instead of the perceptron model,
I can use the support vector machine

1388
01:07:02,710 --> 01:07:06,490
using the SVC, otherwise known
as a support vector classifier,

1389
01:07:06,490 --> 01:07:08,950
using a support vector
machine to classify things

1390
01:07:08,950 --> 01:07:10,820
into two different groups.

1391
01:07:10,820 --> 01:07:14,270
And now see, all right,
how well does this perform.

1392
01:07:14,270 --> 01:07:17,630
And this time, we were able
to correctly predict 682

1393
01:07:17,630 --> 01:07:22,280
and incorrectly predicted
four for accuracy of 99.4%.

1394
01:07:22,280 --> 01:07:26,990
And we could even try the
kNeighborsClassifier as the model

1395
01:07:26,990 --> 01:07:27,740
instead.

1396
01:07:27,740 --> 01:07:31,270
And this takes a parameter
n_neighbors for how many neighbors

1397
01:07:31,270 --> 01:07:32,133
you want to look at.

1398
01:07:32,133 --> 01:07:34,550
Let's just look at one neighbor,
the one nearest neighbor,

1399
01:07:34,550 --> 01:07:36,080
and use that to predict.

1400
01:07:36,080 --> 01:07:38,250
Go ahead and run this as well.

1401
01:07:38,250 --> 01:07:40,880
And it looks like, based on the
kNeighborsClassifier looking

1402
01:07:40,880 --> 01:07:45,350
at just one neighbor, we were able
to correctly classify 685 data point,

1403
01:07:45,350 --> 01:07:47,420
incorrectly classified one.

1404
01:07:47,420 --> 01:07:50,560
Maybe let's try three neighbors
instead of just using one neighbor,

1405
01:07:50,560 --> 01:07:52,560
do more of a k-nearest-neighbors
approach, where

1406
01:07:52,560 --> 01:07:55,693
I look at the three near the
neighbors and see how that performs.

1407
01:07:55,693 --> 01:07:57,610
And that one in this
case seems to have gotten

1408
01:07:57,610 --> 01:08:05,330
100% of all of the predictions correctly
described as either authentic banknotes

1409
01:08:05,330 --> 01:08:07,370
or as counterfeit banknotes.

1410
01:08:07,370 --> 01:08:09,470
And we could run these
experiments multiple times.

1411
01:08:09,470 --> 01:08:12,082
Because I'm randomly
reorganizing the data every time,

1412
01:08:12,082 --> 01:08:14,790
we're technically training these
on slightly different data sets.

1413
01:08:14,790 --> 01:08:16,832
And so you might want to
run multiple experiments

1414
01:08:16,832 --> 01:08:19,250
to really see how well they're
actually going to perform.

1415
01:08:19,250 --> 01:08:20,963
But in short, they
all perform very well.

1416
01:08:20,963 --> 01:08:23,630
And while some of them perform
slightly better than others here,

1417
01:08:23,630 --> 01:08:26,240
that might not always be
the case for every data set,

1418
01:08:26,240 --> 01:08:29,240
but you can begin to test now
by very quickly putting together

1419
01:08:29,240 --> 01:08:31,790
these machine learning
models using Scikit learn

1420
01:08:31,790 --> 01:08:33,920
to be able to train
on some training set,

1421
01:08:33,920 --> 01:08:36,814
and then test on some
testing set as well.

1422
01:08:36,814 --> 01:08:39,439
And this splitting up into training
groups, and testing groups,

1423
01:08:39,439 --> 01:08:43,279
and testing happens so often that
scikit-learn has functions built in

1424
01:08:43,279 --> 01:08:44,160
for trying to do it.

1425
01:08:44,160 --> 01:08:46,100
I did it all by hand just now.

1426
01:08:46,100 --> 01:08:48,770
But if we take a look
at banknotes1, we take

1427
01:08:48,770 --> 01:08:52,100
advantage of some other features
that exist in scikit-learn,

1428
01:08:52,100 --> 01:08:55,399
learn where we can really
simplify a lot of our logic.

1429
01:08:55,399 --> 01:08:59,609
That there is a function built into
scikit-learn called train_test_split,

1430
01:08:59,609 --> 01:09:03,140
which will automatically split data into
a training group and a testing group.

1431
01:09:03,140 --> 01:09:05,536
I just have to say
what proportion should

1432
01:09:05,536 --> 01:09:07,369
be in the testing group,
something like 0.5,

1433
01:09:07,369 --> 01:09:10,080
half the data, inside the testing group.

1434
01:09:10,080 --> 01:09:12,380
Then I can fit the model
and the training data,

1435
01:09:12,380 --> 01:09:15,859
make the predictions on the testing
data, and then just count up.

1436
01:09:15,859 --> 01:09:18,830
And scikit-learn has some nice
methods for just counting up

1437
01:09:18,830 --> 01:09:22,250
how many times our testing data
matched the predictions, how

1438
01:09:22,250 --> 01:09:25,439
many times our testing data
didn't match the predictions.

1439
01:09:25,439 --> 01:09:27,680
So very quickly, you can
write programs with not all

1440
01:09:27,680 --> 01:09:30,500
that many lines of code-- it's
maybe, like, 40 lines of code

1441
01:09:30,500 --> 01:09:32,540
to get through all of these predictions.

1442
01:09:32,540 --> 01:09:35,420
And then as a result, see
how well we're able to do.

1443
01:09:35,420 --> 01:09:38,330
So these types of libraries
can allow us without really

1444
01:09:38,330 --> 01:09:40,970
knowing the implementation
details of these algorithms

1445
01:09:40,970 --> 01:09:43,939
to be able to use the algorithms
in a very practical way

1446
01:09:43,939 --> 01:09:47,279
to be able to solve
these types of problems.

1447
01:09:47,279 --> 01:09:49,939
So that then with supervised
learning-- this task

1448
01:09:49,939 --> 01:09:52,972
of given the whole set of
data some, input-output pairs,

1449
01:09:52,972 --> 01:09:54,680
we would like to learn
some function that

1450
01:09:54,680 --> 01:09:57,110
maps those inputs to those outputs.

1451
01:09:57,110 --> 01:09:59,630
But turns out there are other
forms of learning, as well.

1452
01:09:59,630 --> 01:10:02,870
And another popular type of machine
learning, especially nowadays,

1453
01:10:02,870 --> 01:10:05,090
is known as reinforcement learning.

1454
01:10:05,090 --> 01:10:07,250
And the idea of
reinforcement learning is

1455
01:10:07,250 --> 01:10:09,140
rather than just being
given a whole data set

1456
01:10:09,140 --> 01:10:12,470
at the beginning of input-output
pairs, reinforcement learning

1457
01:10:12,470 --> 01:10:14,680
is all about learning from experience.

1458
01:10:14,680 --> 01:10:17,390
And reinforcement learning
are agent, whether it's

1459
01:10:17,390 --> 01:10:19,550
like a physical robot that's
trying to make actions

1460
01:10:19,550 --> 01:10:23,750
in the world or just some virtual agent
that has a program running somewhere.

1461
01:10:23,750 --> 01:10:27,560
Our agent is going to be given
a set of rewards or punishments

1462
01:10:27,560 --> 01:10:29,510
in the form of numerical
values, but you can

1463
01:10:29,510 --> 01:10:31,430
think of them as reward or punishment.

1464
01:10:31,430 --> 01:10:35,480
And based on that, it learns what
actions to take in the future

1465
01:10:35,480 --> 01:10:39,470
that our agent, our AI will be
put in some sort of environment.

1466
01:10:39,470 --> 01:10:42,537
It will make some actions and
based on the actions that it makes,

1467
01:10:42,537 --> 01:10:43,370
it learns something.

1468
01:10:43,370 --> 01:10:45,537
It either gets a reward
when it does something well,

1469
01:10:45,537 --> 01:10:47,720
it gets a punishment when
it does something poorly.

1470
01:10:47,720 --> 01:10:52,130
And it learns what to do or what
not to do in the future based

1471
01:10:52,130 --> 01:10:54,980
on those individual experiences.

1472
01:10:54,980 --> 01:10:58,700
And so what this will often look like
is it will often start with some agent,

1473
01:10:58,700 --> 01:11:01,070
some AI, which might again,
be a physical robot--

1474
01:11:01,070 --> 01:11:03,360
if you're imagining a
physical robot moving around--

1475
01:11:03,360 --> 01:11:05,180
but it can also just be a program.

1476
01:11:05,180 --> 01:11:08,210
And our agent is situated
in their environment,

1477
01:11:08,210 --> 01:11:11,120
where the environment is where
they're going to make their actions.

1478
01:11:11,120 --> 01:11:13,820
And it's what's going to give
them rewards or punishments

1479
01:11:13,820 --> 01:11:16,070
for various actions that they're in.

1480
01:11:16,070 --> 01:11:19,220
So for example, the environment
is going to start off

1481
01:11:19,220 --> 01:11:21,980
by putting our agent inside of a state.

1482
01:11:21,980 --> 01:11:25,640
Our agent has some state that in a
game might be the state of the game

1483
01:11:25,640 --> 01:11:28,730
that the agent is playing, in a
world that the agent is exploring.

1484
01:11:28,730 --> 01:11:31,375
Might be some position
inside of a grid representing

1485
01:11:31,375 --> 01:11:32,750
the world that they're exploring.

1486
01:11:32,750 --> 01:11:35,090
But the agent is in some sort of state.

1487
01:11:35,090 --> 01:11:39,140
And in that state, the agent
needs to choose to take an action.

1488
01:11:39,140 --> 01:11:41,760
The agent likely has multiple
actions they can choose from,

1489
01:11:41,760 --> 01:11:43,310
but they pick an action.

1490
01:11:43,310 --> 01:11:46,310
So they take an action
in a particular state.

1491
01:11:46,310 --> 01:11:51,170
And as a result of that, the agent will
generally get two things in response

1492
01:11:51,170 --> 01:11:52,040
as we model them.

1493
01:11:52,040 --> 01:11:54,770
The agent gets a new state
that they find themselves in.

1494
01:11:54,770 --> 01:11:57,110
After being in this
state taking one action,

1495
01:11:57,110 --> 01:11:59,210
they end up in some other state.

1496
01:11:59,210 --> 01:12:02,360
And they're also given some
sort of numerical reward--

1497
01:12:02,360 --> 01:12:05,660
positive meaning reward,
meaning it was a good thing.

1498
01:12:05,660 --> 01:12:08,000
Negative generally meaning
they did something bad,

1499
01:12:08,000 --> 01:12:10,310
they received some sort of punishment.

1500
01:12:10,310 --> 01:12:13,190
And that is all the
information the agent has.

1501
01:12:13,190 --> 01:12:15,260
It's told what state it's in.

1502
01:12:15,260 --> 01:12:17,150
It makes some sort of action.

1503
01:12:17,150 --> 01:12:19,160
And based on that, it
ends up in another state,

1504
01:12:19,160 --> 01:12:21,660
and it ends up getting
some particular reward.

1505
01:12:21,660 --> 01:12:24,110
And it needs to learn based
on that information what

1506
01:12:24,110 --> 01:12:26,770
actions to begin to take in the future.

1507
01:12:26,770 --> 01:12:30,090
As you can imagine generalizing this
to a lot of different situations,

1508
01:12:30,090 --> 01:12:31,685
this is oftentimes how you train.

1509
01:12:31,685 --> 01:12:33,560
If you've ever seen
those robots that are now

1510
01:12:33,560 --> 01:12:36,140
able to walk around sort
of the way humans do,

1511
01:12:36,140 --> 01:12:39,500
it would be quite difficult to program
the robot in exactly the right way

1512
01:12:39,500 --> 01:12:41,330
to get it to walk the way humans do.

1513
01:12:41,330 --> 01:12:44,330
You could instead of train it through
reinforcement learning-- give it

1514
01:12:44,330 --> 01:12:47,060
some sort of numerical reward
every time it does something

1515
01:12:47,060 --> 01:12:50,690
good like take steps forward, and
punish it every time it does something

1516
01:12:50,690 --> 01:12:52,250
bad like fall over.

1517
01:12:52,250 --> 01:12:54,080
And then let the AI just learn.

1518
01:12:54,080 --> 01:12:56,450
Based on that sequence of
rewards, based on trying

1519
01:12:56,450 --> 01:12:58,610
to take various different
actions, you can

1520
01:12:58,610 --> 01:13:03,200
begin to have the agent learn what to
do in the future and what not to do.

1521
01:13:03,200 --> 01:13:06,530
So in order to begin to formalize
this, the first thing we need to do

1522
01:13:06,530 --> 01:13:10,700
is formalize this notion of what we mean
about states and actions and rewards--

1523
01:13:10,700 --> 01:13:12,800
like what does this world look like.

1524
01:13:12,800 --> 01:13:14,990
And oftentimes, we'll
formulate this world

1525
01:13:14,990 --> 01:13:17,990
as what's known as a
Markov decision process.

1526
01:13:17,990 --> 01:13:21,410
Similar in spirit to Markov chains,
which you might recall from before,

1527
01:13:21,410 --> 01:13:24,020
but a Markov decision
process is a model that we

1528
01:13:24,020 --> 01:13:26,780
can use for decision
making for an agent trying

1529
01:13:26,780 --> 01:13:28,580
to make decisions in this environment.

1530
01:13:28,580 --> 01:13:32,270
And it's a model that allows us to
represent the various different states

1531
01:13:32,270 --> 01:13:35,930
that an agent can be in, the various
different actions that they can take,

1532
01:13:35,930 --> 01:13:40,400
and also, what the reward is
for taking one action as opposed

1533
01:13:40,400 --> 01:13:42,180
to another action.

1534
01:13:42,180 --> 01:13:44,610
So what then does that
actually look like?

1535
01:13:44,610 --> 01:13:47,590
Well, if you recall, a
Markov chain from before,

1536
01:13:47,590 --> 01:13:50,270
a Markov chain looked a
little something like this.

1537
01:13:50,270 --> 01:13:53,600
Where we had a whole bunch of these
individual states, and each state

1538
01:13:53,600 --> 01:13:55,820
immediately transitioned
to another state

1539
01:13:55,820 --> 01:13:57,890
based on some probability distribution.

1540
01:13:57,890 --> 01:14:01,070
We saw this in the context of the
weather before, where if it was sunny,

1541
01:14:01,070 --> 01:14:03,770
we said with some probability,
it will be sunny the next day.

1542
01:14:03,770 --> 01:14:06,920
With some other probability,
it'll be rainy, for example.

1543
01:14:06,920 --> 01:14:09,340
But we could also imagine
generalizing this.

1544
01:14:09,340 --> 01:14:11,060
It's not just sun and rain anymore.

1545
01:14:11,060 --> 01:14:14,210
We just have these states, where
one state leads to another state

1546
01:14:14,210 --> 01:14:16,860
according to some
probability distribution.

1547
01:14:16,860 --> 01:14:19,340
But in this original
model, there was no agent

1548
01:14:19,340 --> 01:14:21,470
that had any control over this process.

1549
01:14:21,470 --> 01:14:24,740
It was just entirely probability-based,
where with some probability,

1550
01:14:24,740 --> 01:14:26,540
we moved to this next
state, but maybe it's

1551
01:14:26,540 --> 01:14:29,230
going to be some other state
with some other probability.

1552
01:14:29,230 --> 01:14:33,320
What we'll now have is the ability
for the agent in this state

1553
01:14:33,320 --> 01:14:37,190
to choose from a set of actions, where
maybe instead of just one path forward,

1554
01:14:37,190 --> 01:14:39,200
they have three different
choices of actions

1555
01:14:39,200 --> 01:14:41,180
that each lead them
down different paths.

1556
01:14:41,180 --> 01:14:43,340
And even this is a bit
of an oversimplification,

1557
01:14:43,340 --> 01:14:46,340
because in each of these states, you
might imagine more branching points

1558
01:14:46,340 --> 01:14:49,200
were there more decisions
that can be taken as well.

1559
01:14:49,200 --> 01:14:53,420
So we've extended the Markov
chain to say that from a state,

1560
01:14:53,420 --> 01:14:55,400
you now have available action choices.

1561
01:14:55,400 --> 01:14:57,830
And each of those actions
might be associated

1562
01:14:57,830 --> 01:15:02,960
with its own probability distribution
of going to various different states.

1563
01:15:02,960 --> 01:15:05,870
Then in addition, we'll
add another extension,

1564
01:15:05,870 --> 01:15:08,180
where any time you move
from a state taking

1565
01:15:08,180 --> 01:15:10,640
an action going into
this other state, we

1566
01:15:10,640 --> 01:15:14,060
can associate a reward
with that outcome,

1567
01:15:14,060 --> 01:15:17,150
saying either r is positive,
meaning some positive reward,

1568
01:15:17,150 --> 01:15:20,460
or r is negative, meaning there
were some sort of punishment.

1569
01:15:20,460 --> 01:15:23,510
And this then is what we'll consider
to be a Markov decision process.

1570
01:15:23,510 --> 01:15:27,500
That a Markov decision process has
some initial set of states in the world

1571
01:15:27,500 --> 01:15:28,670
that we can be in.

1572
01:15:28,670 --> 01:15:31,640
We have some set of
actions that given a state,

1573
01:15:31,640 --> 01:15:34,430
I can say what are the actions
that are available to me

1574
01:15:34,430 --> 01:15:37,730
in that state, an action
that I can choose from.

1575
01:15:37,730 --> 01:15:39,550
Then we have some transition model.

1576
01:15:39,550 --> 01:15:41,710
The transition model
before just said that

1577
01:15:41,710 --> 01:15:44,740
given my current state, what is
the probability that I end up

1578
01:15:44,740 --> 01:15:46,930
in that next state or this other state.

1579
01:15:46,930 --> 01:15:51,130
The transition model now has effectively
two things we're conditioning on.

1580
01:15:51,130 --> 01:15:53,260
We're saying, given
that I'm in this state

1581
01:15:53,260 --> 01:15:56,500
and that I take this action,
what's the probability

1582
01:15:56,500 --> 01:15:59,320
that I end up in this next state?

1583
01:15:59,320 --> 01:16:02,050
Now maybe we live in a very
deterministic world in this Markov

1584
01:16:02,050 --> 01:16:05,170
decision process, where given
a state and given an action,

1585
01:16:05,170 --> 01:16:07,383
we know for sure what next
state we'll end up in.

1586
01:16:07,383 --> 01:16:10,550
But maybe there's some randomness in
the world that when you take in a state

1587
01:16:10,550 --> 01:16:14,260
and you take an action, you might not
always end up in the exact same state.

1588
01:16:14,260 --> 01:16:16,840
There might be some probabilities
involved there as well.

1589
01:16:16,840 --> 01:16:21,260
The Markov decision process can
handle both of those possible cases.

1590
01:16:21,260 --> 01:16:23,830
And then finally, we
have a reward function,

1591
01:16:23,830 --> 01:16:26,230
generally called r,
that in this case says,

1592
01:16:26,230 --> 01:16:30,340
what is the reward for being in
this state, taking this action,

1593
01:16:30,340 --> 01:16:33,687
and then getting to s
prime, this next state.

1594
01:16:33,687 --> 01:16:35,770
So I'm in this original
state, I take this action,

1595
01:16:35,770 --> 01:16:39,768
I get to this next state, what is
the reward for doing that process?

1596
01:16:39,768 --> 01:16:41,560
You can add up these
rewards every time you

1597
01:16:41,560 --> 01:16:44,440
take an action to get the
total amount of rewards

1598
01:16:44,440 --> 01:16:48,940
that an agent might get from interacting
in a particular environment modeled

1599
01:16:48,940 --> 01:16:51,170
using this Markov decision process.

1600
01:16:51,170 --> 01:16:53,740
So what might this actually
look like in practice?

1601
01:16:53,740 --> 01:16:56,290
Well, let's just create a
little simulated world here

1602
01:16:56,290 --> 01:16:59,260
where I have this agent that is
just trying to navigate its way--

1603
01:16:59,260 --> 01:17:02,350
this agent is this yellow dot here
like a robot in the world trying

1604
01:17:02,350 --> 01:17:04,240
to navigate its way through this grid.

1605
01:17:04,240 --> 01:17:07,140
And ultimately, it's trying
to find its way to the goal.

1606
01:17:07,140 --> 01:17:11,230
And if it gets to the green goal, then
it's going to get some sort of reward.

1607
01:17:11,230 --> 01:17:14,795
But then we might also
have some red squares

1608
01:17:14,795 --> 01:17:17,920
that are places where you get some sort
of punishment, some bad place where

1609
01:17:17,920 --> 01:17:19,480
we don't want the agent to go.

1610
01:17:19,480 --> 01:17:22,030
And if it ends up in the
red square, then our agent

1611
01:17:22,030 --> 01:17:25,400
is going to get some sort of
punishment as a result of that.

1612
01:17:25,400 --> 01:17:28,610
But the agent that originally
doesn't know all of these details.

1613
01:17:28,610 --> 01:17:31,360
It doesn't know that these states
are associated with punishments,

1614
01:17:31,360 --> 01:17:34,210
but maybe it does know that the
state is associated with a reward--

1615
01:17:34,210 --> 01:17:35,200
maybe it doesn't.

1616
01:17:35,200 --> 01:17:37,750
But it just needs to sort of
interact with the environment

1617
01:17:37,750 --> 01:17:41,150
to try and figure out what
to do and what not to do.

1618
01:17:41,150 --> 01:17:44,377
So the first thing the agent might do
is given no additional information,

1619
01:17:44,377 --> 01:17:46,210
if it doesn't know what
the punishments are,

1620
01:17:46,210 --> 01:17:50,030
it doesn't know where the rewards are,
it just might try and take an action.

1621
01:17:50,030 --> 01:17:52,690
And it takes an action
and ends up realizing

1622
01:17:52,690 --> 01:17:54,710
that he's got some sort of punishment.

1623
01:17:54,710 --> 01:17:56,930
And so what does it learn
from that experience?

1624
01:17:56,930 --> 01:18:00,550
Well, it might learn that when
you're in this state in the future

1625
01:18:00,550 --> 01:18:02,790
don't take the action,
move to the right--

1626
01:18:02,790 --> 01:18:04,213
that that is a bad action to take.

1627
01:18:04,213 --> 01:18:06,880
That in the future, if you ever
find yourself back in the state,

1628
01:18:06,880 --> 01:18:09,430
don't take this action of
going to the right when

1629
01:18:09,430 --> 01:18:12,180
you're in this particular state,
because that leads to punishment.

1630
01:18:12,180 --> 01:18:13,763
That might be the intuition, at least.

1631
01:18:13,763 --> 01:18:15,610
And so you could try
doing other actions.

1632
01:18:15,610 --> 01:18:16,300
You move up.

1633
01:18:16,300 --> 01:18:18,508
All right, that didn't lead
to any immediate rewards,

1634
01:18:18,508 --> 01:18:21,830
maybe try something else,
then maybe try something else.

1635
01:18:21,830 --> 01:18:23,913
And now you found that you
got another punishment.

1636
01:18:23,913 --> 01:18:25,913
And so you learn something
from that experience.

1637
01:18:25,913 --> 01:18:27,850
So the next time you
do this whole process,

1638
01:18:27,850 --> 01:18:30,010
you know that if you ever
end up in this, square

1639
01:18:30,010 --> 01:18:33,070
you shouldn't take the down
action, because being in this state

1640
01:18:33,070 --> 01:18:37,840
and taking that action ultimately
leads to some sort of punishment,

1641
01:18:37,840 --> 01:18:40,070
a negative reward, in other words.

1642
01:18:40,070 --> 01:18:41,170
And this process repeats.

1643
01:18:41,170 --> 01:18:44,230
You might imagine just letting
our agent explore the world,

1644
01:18:44,230 --> 01:18:48,220
learning over time what states tend
to correspond with poor actions,

1645
01:18:48,220 --> 01:18:50,980
learning over time what states
correspond with poor actions

1646
01:18:50,980 --> 01:18:54,260
until eventually, if it
tries enough things randomly,

1647
01:18:54,260 --> 01:18:57,640
it might find that when
you get to this state,

1648
01:18:57,640 --> 01:19:01,240
if you take the up action in
this state, it might find you

1649
01:19:01,240 --> 01:19:03,160
actually get a reward from that.

1650
01:19:03,160 --> 01:19:06,820
And what it can learn from that
is that if you're in this state,

1651
01:19:06,820 --> 01:19:09,595
you should take the up action,
because that leads to a reward.

1652
01:19:09,595 --> 01:19:12,220
And over time, you can also learn
that if you're in this state,

1653
01:19:12,220 --> 01:19:15,580
you should take the left action because
that leads to this state that also

1654
01:19:15,580 --> 01:19:17,320
lets you eventually get to the reward.

1655
01:19:17,320 --> 01:19:21,160
So you begin to learn over
time, not only which actions

1656
01:19:21,160 --> 01:19:25,390
are good in particular states,
but also, which actions are bad,

1657
01:19:25,390 --> 01:19:27,640
such that once you know some
sequence of good actions

1658
01:19:27,640 --> 01:19:30,910
that leads you to some sort
of reward, our agent can just

1659
01:19:30,910 --> 01:19:34,750
follow those instructions, follow
the experience that it has learned.

1660
01:19:34,750 --> 01:19:37,330
We didn't tell the
agent what the goal was.

1661
01:19:37,330 --> 01:19:39,910
We didn't tell the agent
where the punishments were.

1662
01:19:39,910 --> 01:19:42,760
But the agent can begin to
learn from this experience

1663
01:19:42,760 --> 01:19:47,850
and learn to begin to perform these
sorts of tasks better in the future.

1664
01:19:47,850 --> 01:19:51,370
And so let's now try to formalize this
idea-- formalize the idea that we would

1665
01:19:51,370 --> 01:19:54,520
like to be able to learn in
this state, taking this action,

1666
01:19:54,520 --> 01:19:56,257
is that a good thing or a bad thing.

1667
01:19:56,257 --> 01:19:58,840
There are lots of different
models for reinforcement learning.

1668
01:19:58,840 --> 01:20:00,757
We're just going to look
at one of them today.

1669
01:20:00,757 --> 01:20:04,330
And the one that we're going to look
at is a method known as Q learning.

1670
01:20:04,330 --> 01:20:08,010
And what Q learning is all about
is about learning a function,

1671
01:20:08,010 --> 01:20:12,850
a function Q, that takes inputs
s and a, where s is a state and a

1672
01:20:12,850 --> 01:20:15,130
is an action that you
take in that state.

1673
01:20:15,130 --> 01:20:17,620
And what this Q function
is going to do is

1674
01:20:17,620 --> 01:20:21,520
it is going to estimate the
value-- how much reward will I get

1675
01:20:21,520 --> 01:20:25,960
from taking this action in this state.

1676
01:20:25,960 --> 01:20:28,870
Originally, we don't know what
this Q function should be,

1677
01:20:28,870 --> 01:20:30,940
but over time, based
on experience, based

1678
01:20:30,940 --> 01:20:33,590
on trying things out and
seeing what the result is,

1679
01:20:33,590 --> 01:20:36,160
I would like to try
and learn what q of s,

1680
01:20:36,160 --> 01:20:39,730
a is for any particular state
and any particular action

1681
01:20:39,730 --> 01:20:41,840
that I might take in that state.

1682
01:20:41,840 --> 01:20:42,880
So what is the approach?

1683
01:20:42,880 --> 01:20:44,890
Well, the approach
originally is we'll start

1684
01:20:44,890 --> 01:20:49,960
with Q s, a equal to 0 for all
states s and for all actions a.

1685
01:20:49,960 --> 01:20:52,270
That initially, before
I've ever started anything,

1686
01:20:52,270 --> 01:20:54,880
before I've had any
experiences, I don't know

1687
01:20:54,880 --> 01:20:57,820
the value of taking any
action in any given state,

1688
01:20:57,820 --> 01:21:02,390
so I'm going to assume that the
value is 0 all across the board.

1689
01:21:02,390 --> 01:21:06,790
But then as I interact with the world,
as I experience rewards or punishments,

1690
01:21:06,790 --> 01:21:10,450
or maybe I go to a cell where I don't
get either a reward or a punishment,

1691
01:21:10,450 --> 01:21:14,290
I want to somehow update
my estimate of Q s, a.

1692
01:21:14,290 --> 01:21:18,670
I want to continually update my estimate
of Q s, a based on the experiences,

1693
01:21:18,670 --> 01:21:20,730
and rewards, and punishments
that I've received

1694
01:21:20,730 --> 01:21:24,240
such that in the future, my
knowledge of what actions are good

1695
01:21:24,240 --> 01:21:26,290
and what states will be better.

1696
01:21:26,290 --> 01:21:29,280
So when we take an action and
receive some sort of reward,

1697
01:21:29,280 --> 01:21:32,750
I want to estimate the
new value of Q s, a.

1698
01:21:32,750 --> 01:21:35,390
And I estimate that based on
a couple of different things.

1699
01:21:35,390 --> 01:21:39,090
I estimate it based on the reward that
I'm getting from taking this action

1700
01:21:39,090 --> 01:21:40,830
and getting into the next state.

1701
01:21:40,830 --> 01:21:44,130
But assuming the situation
isn't over, assuming

1702
01:21:44,130 --> 01:21:47,170
there are still future actions
that I might take as well,

1703
01:21:47,170 --> 01:21:51,670
I also need to take into account
the expected future rewards.

1704
01:21:51,670 --> 01:21:54,570
That if you imagine an agent
interacting with the environment,

1705
01:21:54,570 --> 01:21:57,000
and sometimes, you'll take
an action and get a reward,

1706
01:21:57,000 --> 01:21:59,970
but then you can keep taking more
actions and get more rewards.

1707
01:21:59,970 --> 01:22:02,310
That these both are relevant--
both the current reward

1708
01:22:02,310 --> 01:22:05,680
I'm getting from this current
step, and also, my future reward.

1709
01:22:05,680 --> 01:22:09,240
And it might be the case that I want
to take a step that doesn't immediately

1710
01:22:09,240 --> 01:22:12,180
lead to a reward, because
later on down the line,

1711
01:22:12,180 --> 01:22:14,650
I know it will lead to
more rewards as well.

1712
01:22:14,650 --> 01:22:17,550
So there's a balancing act
between current rewards

1713
01:22:17,550 --> 01:22:20,460
that the agent experiences
and future rewards

1714
01:22:20,460 --> 01:22:23,900
that the agent experiences as well.

1715
01:22:23,900 --> 01:22:26,590
And then we need to update Q s, a.

1716
01:22:26,590 --> 01:22:29,620
So we estimate the value of Q
s, a based on the current record

1717
01:22:29,620 --> 01:22:31,390
and the expected future awards.

1718
01:22:31,390 --> 01:22:33,970
And then we need to
update this Q function

1719
01:22:33,970 --> 01:22:36,550
to take into account this new estimate.

1720
01:22:36,550 --> 01:22:39,340
Now as we go through this
process, we'll already

1721
01:22:39,340 --> 01:22:42,160
have an estimate for what
we think the value is.

1722
01:22:42,160 --> 01:22:44,140
Now we have a new estimate
and then somehow we

1723
01:22:44,140 --> 01:22:46,570
need to combine these
two estimates together.

1724
01:22:46,570 --> 01:22:50,110
And we'll look at more formal ways
that we can actually begin to do that.

1725
01:22:50,110 --> 01:22:52,725
So to actually show you what
this formula looks like,

1726
01:22:52,725 --> 01:22:54,850
here's the approach we'll
take with you Q-learning.

1727
01:22:54,850 --> 01:22:59,500
We're going to again start with Q of
s, a being equal to 0 for all states.

1728
01:22:59,500 --> 01:23:06,850
And then every time we take an action
a in state s and observe a reward r,

1729
01:23:06,850 --> 01:23:11,230
we're going to update our value,
our estimate for Q of s, a.

1730
01:23:11,230 --> 01:23:13,780
And the idea is that
we're going to figure out

1731
01:23:13,780 --> 01:23:19,180
what the new value estimate is minus
what our existing value estimate is.

1732
01:23:19,180 --> 01:23:22,780
So we have some preconceived
notion for what the value is

1733
01:23:22,780 --> 01:23:24,500
for taking this action in this state.

1734
01:23:24,500 --> 01:23:28,450
Maybe our expectation is we
currently think the value is 10.

1735
01:23:28,450 --> 01:23:31,540
But then we're going to estimate
what we now think it's going to be.

1736
01:23:31,540 --> 01:23:34,370
Maybe the new value estimate
is something like 20.

1737
01:23:34,370 --> 01:23:36,940
So there's a delta of,
like, 10 that our new value

1738
01:23:36,940 --> 01:23:40,390
estimate is 10 points higher
than what our current value

1739
01:23:40,390 --> 01:23:42,450
estimate happens to be.

1740
01:23:42,450 --> 01:23:44,200
And so we have a couple of options here.

1741
01:23:44,200 --> 01:23:49,360
We need to decide how much we want to
adjust our current expectation of what

1742
01:23:49,360 --> 01:23:52,720
the value is of taking this
action in this particular state.

1743
01:23:52,720 --> 01:23:56,620
And what that difference is--
how much we add or subtract

1744
01:23:56,620 --> 01:23:59,800
from our existing notion of how
much that we expect the value to be

1745
01:23:59,800 --> 01:24:03,750
is dependent on this parameter
alpha, also called the learning rate.

1746
01:24:03,750 --> 01:24:08,710
And alpha represents in effect, how
much we value new information compared

1747
01:24:08,710 --> 01:24:11,770
to how much we value old information.

1748
01:24:11,770 --> 01:24:15,500
And alpha value of 1 means we
really value new information.

1749
01:24:15,500 --> 01:24:17,590
That if we have a new
estimate, then it doesn't

1750
01:24:17,590 --> 01:24:19,240
matter what our old estimate is.

1751
01:24:19,240 --> 01:24:21,910
We're only going to consider our
new estimate, because we always

1752
01:24:21,910 --> 01:24:25,430
just want to take into
consideration our new information.

1753
01:24:25,430 --> 01:24:29,230
So the way that works is that
if you imagine alpha being 1,

1754
01:24:29,230 --> 01:24:32,860
then we're taking the
old value of Q s, a

1755
01:24:32,860 --> 01:24:36,910
and then adding 1 times the
new value minus the old value.

1756
01:24:36,910 --> 01:24:38,990
And that just leaves
us with the new value.

1757
01:24:38,990 --> 01:24:42,010
So when alpha is 1, all
we take into consideration

1758
01:24:42,010 --> 01:24:44,650
is what our new estimate happens to be.

1759
01:24:44,650 --> 01:24:47,470
But over time, as we go
through a lot of experiences,

1760
01:24:47,470 --> 01:24:49,570
we already have some
existing information.

1761
01:24:49,570 --> 01:24:53,080
We might have tried taking
this action nine times already,

1762
01:24:53,080 --> 01:24:55,090
and now we just try to do a tenth time.

1763
01:24:55,090 --> 01:24:58,370
And we don't only want to
consider this 10th experience.

1764
01:24:58,370 --> 01:25:02,020
I also want to consider the fact that
my prior 9 experiences, those were

1765
01:25:02,020 --> 01:25:02,710
meaningful, too.

1766
01:25:02,710 --> 01:25:05,420
And that's data I don't
necessarily want to lose them.

1767
01:25:05,420 --> 01:25:08,620
And so this alpha controls
that decision-- controls

1768
01:25:08,620 --> 01:25:10,480
how important is the new information.

1769
01:25:10,480 --> 01:25:13,570
0 would mean ignore all
the new information,

1770
01:25:13,570 --> 01:25:16,390
just keep this Q value the same.

1771
01:25:16,390 --> 01:25:20,200
1 that means replace the old information
entirely with the new information.

1772
01:25:20,200 --> 01:25:24,772
And somewhere in between, keep some sort
of balance between these two values.

1773
01:25:24,772 --> 01:25:27,970
And we can put this equation a
little bit more formally, as well.

1774
01:25:27,970 --> 01:25:30,940
The old value estimate
is our old estimate

1775
01:25:30,940 --> 01:25:34,690
for what the value is of taking
this action in a particular state.

1776
01:25:34,690 --> 01:25:37,270
That's just Q of s, a.

1777
01:25:37,270 --> 01:25:38,685
We have it once here.

1778
01:25:38,685 --> 01:25:40,310
And we're going to add something to it.

1779
01:25:40,310 --> 01:25:44,020
We're going to add alpha times the
new value estimate minus the old value

1780
01:25:44,020 --> 01:25:44,740
estimate.

1781
01:25:44,740 --> 01:25:49,330
But the old value estimate, we just
look up by calling this Q function.

1782
01:25:49,330 --> 01:25:51,310
And what then is the new value estimate?

1783
01:25:51,310 --> 01:25:53,530
Based on this experience
we have just taken,

1784
01:25:53,530 --> 01:25:55,870
what is our new estimate
for the value of taking

1785
01:25:55,870 --> 01:25:58,570
this action in this particular state?

1786
01:25:58,570 --> 01:26:01,090
Well, it's going to be
composed of two parts.

1787
01:26:01,090 --> 01:26:04,540
It's going to be composed of
what reward did I just get

1788
01:26:04,540 --> 01:26:07,060
from taking this action in this state.

1789
01:26:07,060 --> 01:26:10,390
And then it's going to be what
can I expect my future rewards

1790
01:26:10,390 --> 01:26:12,670
to be from this point forward.

1791
01:26:12,670 --> 01:26:17,230
So it's going to be r, some
reward I'm getting right now,

1792
01:26:17,230 --> 01:26:21,443
plus whatever I estimate I'm
going to get in the future.

1793
01:26:21,443 --> 01:26:23,860
And how do I estimate what I'm
going to get in the future?

1794
01:26:23,860 --> 01:26:27,030
Well, it's a bit of another
call to this Q function.

1795
01:26:27,030 --> 01:26:31,360
It's going to be take the maximum
across all possible actions I could

1796
01:26:31,360 --> 01:26:35,800
take next and say, all right, of all
of these possible actions I could take,

1797
01:26:35,800 --> 01:26:38,560
which one is going to
have the highest reward?

1798
01:26:38,560 --> 01:26:40,840
So this then-- looks a
little bit complicated--

1799
01:26:40,840 --> 01:26:44,710
is going to be our notion for how we're
going to perform this kind of update.

1800
01:26:44,710 --> 01:26:48,070
I have some estimate, some
old estimate, for what

1801
01:26:48,070 --> 01:26:51,040
the value is of taking
this action in the state,

1802
01:26:51,040 --> 01:26:53,920
and I'm going to update it
based on new information.

1803
01:26:53,920 --> 01:26:56,200
That I experienced
some reward, I predict

1804
01:26:56,200 --> 01:26:58,240
what my future reward is going to be.

1805
01:26:58,240 --> 01:27:01,510
And using that, I update
what I estimate the reward

1806
01:27:01,510 --> 01:27:04,720
will be for taking this action
in this particular state.

1807
01:27:04,720 --> 01:27:07,720
And there are other additions you
might make to this algorithm, as well.

1808
01:27:07,720 --> 01:27:10,270
Sometimes, it might not be
the case that future rewards,

1809
01:27:10,270 --> 01:27:12,820
you want to weight equally
to current rewards.

1810
01:27:12,820 --> 01:27:17,402
Maybe you want an agent that values
like reward now over reward later.

1811
01:27:17,402 --> 01:27:19,360
And so sometimes, you
can even add another term

1812
01:27:19,360 --> 01:27:23,050
in here or some other parameter,
where you discount future rewards

1813
01:27:23,050 --> 01:27:26,270
and say future rewards are
not as valuable as rewards

1814
01:27:26,270 --> 01:27:28,720
immediately-- that getting
reward in the current time step

1815
01:27:28,720 --> 01:27:31,542
is better than waiting a year
and getting rewards later.

1816
01:27:31,542 --> 01:27:33,250
But that's something
up to the programmer

1817
01:27:33,250 --> 01:27:36,290
to decide what that
parameter ought to be.

1818
01:27:36,290 --> 01:27:39,100
But the big picture idea
of this entire formula

1819
01:27:39,100 --> 01:27:42,670
is to say that every time we
experience some new reward,

1820
01:27:42,670 --> 01:27:43,900
we take that into account.

1821
01:27:43,900 --> 01:27:47,830
We update our estimate of
how good is this action.

1822
01:27:47,830 --> 01:27:51,070
And then in the future, we can make
decisions based on that algorithm.

1823
01:27:51,070 --> 01:27:54,010
Once we have some good
estimate for every state

1824
01:27:54,010 --> 01:27:57,670
and for every action, what the
value is of taking that action,

1825
01:27:57,670 --> 01:28:01,960
then we can do something like implement
a greedy decision making policy.

1826
01:28:01,960 --> 01:28:05,200
That if I am in a state and I
want to know what actions should

1827
01:28:05,200 --> 01:28:09,880
I take in that state, then I consider
for all of my possible actions,

1828
01:28:09,880 --> 01:28:12,670
what is the value of Q s, a.

1829
01:28:12,670 --> 01:28:16,150
What is my estimated value of
taking that action in that state.

1830
01:28:16,150 --> 01:28:18,850
And I will just pick the
action that has the highest

1831
01:28:18,850 --> 01:28:22,520
value after I evaluate that expression.

1832
01:28:22,520 --> 01:28:24,730
So I pick the action that
has the highest value.

1833
01:28:24,730 --> 01:28:27,188
And based on that, that tells
me what action I should take.

1834
01:28:27,188 --> 01:28:30,880
At any given state that I'm in,
I can just greedily say across

1835
01:28:30,880 --> 01:28:35,050
all my actions, this action gives
me the highest expected value,

1836
01:28:35,050 --> 01:28:38,200
and so I'll go ahead and choose
that action as the action

1837
01:28:38,200 --> 01:28:40,430
that I take as well.

1838
01:28:40,430 --> 01:28:43,270
But there is a downside
to this kind of approach.

1839
01:28:43,270 --> 01:28:45,880
And the downside comes up
in a situation like this,

1840
01:28:45,880 --> 01:28:51,170
where we know that there is some
solution that gets me to the reward

1841
01:28:51,170 --> 01:28:53,610
and our agent has been
able to figure that out.

1842
01:28:53,610 --> 01:28:57,050
But it might not necessarily be
the best way or the fastest way.

1843
01:28:57,050 --> 01:28:59,850
If the agent is allowed to
explore a little bit more,

1844
01:28:59,850 --> 01:29:02,810
you might find that it can get
the reward faster by taking

1845
01:29:02,810 --> 01:29:06,830
some other route instead by going
through this particular path that

1846
01:29:06,830 --> 01:29:11,360
is a faster way to get
to that ultimate goal.

1847
01:29:11,360 --> 01:29:14,730
And maybe we would like for the agent
to be able to figure that out as well.

1848
01:29:14,730 --> 01:29:16,940
But if the agent always
takes the actions

1849
01:29:16,940 --> 01:29:20,960
that it knows to be best, when it
gets to this particular square,

1850
01:29:20,960 --> 01:29:23,180
it doesn't know that
this is a good action,

1851
01:29:23,180 --> 01:29:24,770
because it's never really tried it.

1852
01:29:24,770 --> 01:29:29,040
But it knows that going down eventually
leads its way to this reward.

1853
01:29:29,040 --> 01:29:32,270
So what might learn in the future that
it should just always take this route,

1854
01:29:32,270 --> 01:29:36,770
and it's never going to explore
and go along that route instead.

1855
01:29:36,770 --> 01:29:39,170
So in reinforcement learning,
there's this tension

1856
01:29:39,170 --> 01:29:42,570
between exploration and exploitation.

1857
01:29:42,570 --> 01:29:45,710
And exploitation generally
reverts to using knowledge

1858
01:29:45,710 --> 01:29:47,360
that the AI already has.

1859
01:29:47,360 --> 01:29:50,630
The AI already knows that this
is a move that leads to reward,

1860
01:29:50,630 --> 01:29:52,520
so it'll go ahead and use that move.

1861
01:29:52,520 --> 01:29:56,390
And exploration is all about
exploring other actions

1862
01:29:56,390 --> 01:29:58,820
that we may not have explored
as thoroughly before,

1863
01:29:58,820 --> 01:30:01,970
because maybe one of these actions,
even if I don't know anything about it,

1864
01:30:01,970 --> 01:30:07,300
might lead to better rewards faster
or more rewards in the future.

1865
01:30:07,300 --> 01:30:10,160
And so an agent that only
ever exploits information

1866
01:30:10,160 --> 01:30:12,840
and never explorers might
be able to get reward,

1867
01:30:12,840 --> 01:30:16,250
but it might not maximize its
rewards, because it doesn't know what

1868
01:30:16,250 --> 01:30:17,840
other possibilities are out there--

1869
01:30:17,840 --> 01:30:19,850
possibilities that it
would only know about

1870
01:30:19,850 --> 01:30:23,010
by taking advantage of exploration.

1871
01:30:23,010 --> 01:30:24,710
And so how can we try and address this?

1872
01:30:24,710 --> 01:30:28,550
Well, one possible solution is known
as the epsilon-greedy algorithm,

1873
01:30:28,550 --> 01:30:33,050
where we set epsilon equal to how often
we want to just make a random move.

1874
01:30:33,050 --> 01:30:36,750
Where occasionally, we will just
make a random move in order to say,

1875
01:30:36,750 --> 01:30:40,350
let's try to explore
and see what happens.

1876
01:30:40,350 --> 01:30:45,250
And then the logic of the algorithm will
be with probability 1 minus epsilon,

1877
01:30:45,250 --> 01:30:47,830
choose the estimated best move.

1878
01:30:47,830 --> 01:30:50,620
In a greedy case, we'd
always choose the best move.

1879
01:30:50,620 --> 01:30:55,360
But in epsilon-greedy, we're most of
the time going to choose the best move

1880
01:30:55,360 --> 01:30:57,520
or sometimes going to
choose the best move,

1881
01:30:57,520 --> 01:31:00,100
but sometimes with
probability epsilon, we're

1882
01:31:00,100 --> 01:31:03,160
going to choose a random move instead.

1883
01:31:03,160 --> 01:31:06,400
So every time we're faced with the
ability to take an action, sometimes,

1884
01:31:06,400 --> 01:31:07,930
we're going to choose the best move.

1885
01:31:07,930 --> 01:31:10,900
Sometimes, we're just going
to choose a random move.

1886
01:31:10,900 --> 01:31:12,940
So this type of algorithm
then can be quite

1887
01:31:12,940 --> 01:31:16,720
powerful in a reinforcement learning
context by not always just choosing

1888
01:31:16,720 --> 01:31:20,620
the best possible move right now,
but sometimes, especially early on,

1889
01:31:20,620 --> 01:31:22,600
allowing yourself to
make random moves that

1890
01:31:22,600 --> 01:31:26,560
allow you to explore various different
possible states and actions more.

1891
01:31:26,560 --> 01:31:30,850
And maybe over time, you might decrease
your value of epsilon, more and more

1892
01:31:30,850 --> 01:31:32,720
often choosing the best
mover after you are

1893
01:31:32,720 --> 01:31:37,820
more confident that you've explored what
all of the possibilities actually are.

1894
01:31:37,820 --> 01:31:39,320
So we can put this into practice.

1895
01:31:39,320 --> 01:31:41,830
And one very common application
of reinforcement learning

1896
01:31:41,830 --> 01:31:43,000
is in game playing.

1897
01:31:43,000 --> 01:31:45,400
That if you want to teach
an agent how to play a game,

1898
01:31:45,400 --> 01:31:48,350
you just let the agent play
the game a whole bunch.

1899
01:31:48,350 --> 01:31:51,220
And then the reward signal
happens at the end of the game.

1900
01:31:51,220 --> 01:31:55,780
When the game is over, if our AI won
the game, it gets a reward of like, 1,

1901
01:31:55,780 --> 01:31:56,730
for example.

1902
01:31:56,730 --> 01:32:00,100
And if it lost the game, it
gets a reward of negative 1.

1903
01:32:00,100 --> 01:32:03,108
And from that, it begins to
learn what actions are good

1904
01:32:03,108 --> 01:32:04,150
and what actions are bad.

1905
01:32:04,150 --> 01:32:06,610
You don't have to tell the AI
what's good and what's bad,

1906
01:32:06,610 --> 01:32:08,920
but the AI figures it
out based on that reward.

1907
01:32:08,920 --> 01:32:10,450
Winning the game is some signal.

1908
01:32:10,450 --> 01:32:12,040
Losing the game is some signal.

1909
01:32:12,040 --> 01:32:14,320
And based on all of that,
it begins to figure out

1910
01:32:14,320 --> 01:32:16,985
what decisions it should actually make.

1911
01:32:16,985 --> 01:32:19,360
So one very simple game, which
you may have played before

1912
01:32:19,360 --> 01:32:20,870
is a game called Nim.

1913
01:32:20,870 --> 01:32:23,433
And in the game of Nim, you've
got a whole bunch of objects

1914
01:32:23,433 --> 01:32:25,600
in a whole bunch of different
piles, where here I've

1915
01:32:25,600 --> 01:32:27,940
represented each pile
as an individual row.

1916
01:32:27,940 --> 01:32:31,040
So you've got one object in the first
pile, three in the second pile, five

1917
01:32:31,040 --> 01:32:33,370
and the third pile,
seven in the fourth pile.

1918
01:32:33,370 --> 01:32:36,010
And the game of Nim is a two
player game where players

1919
01:32:36,010 --> 01:32:38,980
take turns removing objects from piles.

1920
01:32:38,980 --> 01:32:41,230
And the rule is that
on any given turn, you

1921
01:32:41,230 --> 01:32:43,690
are allowed to remove
as many objects as you

1922
01:32:43,690 --> 01:32:47,162
want from any one of these
piles, any one of these rows.

1923
01:32:47,162 --> 01:32:49,120
You have to remove at
least one object, but you

1924
01:32:49,120 --> 01:32:53,890
can remove as many as you want
from exactly one of the piles.

1925
01:32:53,890 --> 01:32:57,720
And whoever takes the last object loses.

1926
01:32:57,720 --> 01:33:01,630
So player 1 might like remove
four from this pile here.

1927
01:33:01,630 --> 01:33:04,790
Player 2 might remove
four from this pile here.

1928
01:33:04,790 --> 01:33:08,140
So now we've got four piles left,
one, three, one, and three Player 1

1929
01:33:08,140 --> 01:33:11,020
might remove you know the
entirety of the second pile.

1930
01:33:11,020 --> 01:33:16,930
Player 2, if they're being strategic,
might remove two from the third pile.

1931
01:33:16,930 --> 01:33:20,140
Now we've got three piles left
each with one object left.

1932
01:33:20,140 --> 01:33:22,420
Player 1 might remove one from one pile.

1933
01:33:22,420 --> 01:33:24,750
Player 2 removes one
from the other pile.

1934
01:33:24,750 --> 01:33:27,610
And now player 1 is left
with choosing this one

1935
01:33:27,610 --> 01:33:31,740
object from the last pile, at which
point, player 1 loses the game.

1936
01:33:31,740 --> 01:33:32,980
So fairly simple game.

1937
01:33:32,980 --> 01:33:34,240
Piles of objects.

1938
01:33:34,240 --> 01:33:37,300
Any turn, you choose how many
objects to remove from the pile.

1939
01:33:37,300 --> 01:33:40,330
Whoever removes the last object loses.

1940
01:33:40,330 --> 01:33:42,730
And this is the type of game
you can encode into an AI

1941
01:33:42,730 --> 01:33:46,540
fairly easily, because the states
are really just four numbers.

1942
01:33:46,540 --> 01:33:50,120
Every state is just how many
objects in each of the four piles.

1943
01:33:50,120 --> 01:33:52,690
And the actions are
things like how many am I

1944
01:33:52,690 --> 01:33:56,110
going to remove from each one
of these individual piles.

1945
01:33:56,110 --> 01:33:58,090
And the reward happens at the end.

1946
01:33:58,090 --> 01:34:01,000
That if you were the player that
had to remove the last object,

1947
01:34:01,000 --> 01:34:02,890
then you get some sort of punishment.

1948
01:34:02,890 --> 01:34:06,430
But if you were not, and the other
player had to remove the last object,

1949
01:34:06,430 --> 01:34:08,830
well then you get some sort of reward.

1950
01:34:08,830 --> 01:34:11,680
So we could actually try and
show a demonstration of this--

1951
01:34:11,680 --> 01:34:15,570
that I have implemented an
AI to play the game of Nim.

1952
01:34:15,570 --> 01:34:17,320
All right, so here,
what we're going to do

1953
01:34:17,320 --> 01:34:22,210
is create an AI as a result of
training the AI on some number of games

1954
01:34:22,210 --> 01:34:24,160
that the AI is going
to play against itself.

1955
01:34:24,160 --> 01:34:27,060
Where the idea is the AI will
play games against itself,

1956
01:34:27,060 --> 01:34:30,800
learn from each of those experiences,
and learn what to do in the future.

1957
01:34:30,800 --> 01:34:33,340
And then I, the human,
will play against the AI.

1958
01:34:33,340 --> 01:34:35,380
So initially, we'll
say train zero times,

1959
01:34:35,380 --> 01:34:39,190
meaning we're not going to let the AI
play any practice games against itself

1960
01:34:39,190 --> 01:34:41,110
in order to learn from its experiences.

1961
01:34:41,110 --> 01:34:43,750
We're just going to
see how well it plays.

1962
01:34:43,750 --> 01:34:45,460
And it looks like there are four piles.

1963
01:34:45,460 --> 01:34:48,550
I can choose how many I remove
from any one of the piles.

1964
01:34:48,550 --> 01:34:53,960
So maybe from pile three, I will
remove five objects, for example.

1965
01:34:53,960 --> 01:34:57,340
So now AI chose to take
one item from pile zero.

1966
01:34:57,340 --> 01:35:00,170
So I'm left with these
piles now, for example.

1967
01:35:00,170 --> 01:35:02,500
And so here, I could
choose maybe to say I

1968
01:35:02,500 --> 01:35:08,900
would like to remove them from pile
two, all five of them, for example.

1969
01:35:08,900 --> 01:35:11,290
And so AI chose to take
two away from pile one.

1970
01:35:11,290 --> 01:35:15,560
Now I'm left with one pile that has one
object, one pile that has two objects.

1971
01:35:15,560 --> 01:35:19,070
So from pile three, I
will remove two objects.

1972
01:35:19,070 --> 01:35:22,400
And now I've left the AI with no
choice but to take that last one.

1973
01:35:22,400 --> 01:35:24,648
And so the game is over
and I was able to win.

1974
01:35:24,648 --> 01:35:27,190
But I did so because the AI was
really just playing randomly.

1975
01:35:27,190 --> 01:35:30,130
It didn't have any prior experience
that it was using in order

1976
01:35:30,130 --> 01:35:32,410
to make these sorts of judgments.

1977
01:35:32,410 --> 01:35:36,040
Now let the AI train itself
on, like, 10,000 games.

1978
01:35:36,040 --> 01:35:40,000
I'm going to let the AI play
10,000 games of Nim against itself.

1979
01:35:40,000 --> 01:35:43,180
Every time it wins or loses, it's
going to learn from that experience

1980
01:35:43,180 --> 01:35:46,840
and learn in the future what
to do and what not to do.

1981
01:35:46,840 --> 01:35:49,300
So here then, I'll go
ahead and run this again.

1982
01:35:49,300 --> 01:35:52,300
And now you see the AI running through
a whole bunch of training games--

1983
01:35:52,300 --> 01:35:54,730
10,000 training games against itself.

1984
01:35:54,730 --> 01:35:57,530
And now it's going to let me
make these sorts of decisions.

1985
01:35:57,530 --> 01:35:59,620
So now I'm going to play against the AI.

1986
01:35:59,620 --> 01:36:02,788
Maybe I'll remove one from pile 3.

1987
01:36:02,788 --> 01:36:05,830
And the AI took everything from pile
three, so I'm left with three piles.

1988
01:36:05,830 --> 01:36:11,680
And I'll go ahead and from pile
two, maybe remove three items.

1989
01:36:11,680 --> 01:36:14,340
And the AI removes one
item from pile zero.

1990
01:36:14,340 --> 01:36:17,560
I'm left with two piles, each
of which has two items in it.

1991
01:36:17,560 --> 01:36:21,160
I'll remove one from pile one, I guess.

1992
01:36:21,160 --> 01:36:24,610
And the AI took two from pile
two, leaving me with no choice

1993
01:36:24,610 --> 01:36:27,490
but to take one away from pile one.

1994
01:36:27,490 --> 01:36:31,690
So it seems like after playing
10,000 games of Nim against itself,

1995
01:36:31,690 --> 01:36:34,270
the AI has learned
something about what states

1996
01:36:34,270 --> 01:36:38,020
and what actions tend to be good, and
has begun to learn some sort of pattern

1997
01:36:38,020 --> 01:36:40,780
for how to predict what
actions are going to be good

1998
01:36:40,780 --> 01:36:44,290
and what actions are going
to be bad in any given state.

1999
01:36:44,290 --> 01:36:47,590
So reinforcement learning can be a
very powerful technique for achieving

2000
01:36:47,590 --> 01:36:49,150
these sorts of game playing Agents--

2001
01:36:49,150 --> 01:36:52,300
Agents that are able to play
a game well just by learning

2002
01:36:52,300 --> 01:36:54,970
from experience, whether that's
playing against other people

2003
01:36:54,970 --> 01:36:57,310
or by playing against itself
and learning from those

2004
01:36:57,310 --> 01:36:59,030
experiences, as well.

2005
01:36:59,030 --> 01:37:01,900
Now Nim is a bit of an easy game
to use reinforcement learning

2006
01:37:01,900 --> 01:37:04,120
for because there are so few states.

2007
01:37:04,120 --> 01:37:07,120
There are only states that are as
many as how many different objects are

2008
01:37:07,120 --> 01:37:09,190
in each of these
various different piles.

2009
01:37:09,190 --> 01:37:11,200
You might imagine that
it's going to be harder

2010
01:37:11,200 --> 01:37:14,990
if you think of a game like chess
or games where there are many,

2011
01:37:14,990 --> 01:37:18,430
many more states and many, many more
actions that you can imagine taking,

2012
01:37:18,430 --> 01:37:21,490
where it's not going to be as
easy to learn for every state

2013
01:37:21,490 --> 01:37:24,720
and for every action, what
the value is going to be.

2014
01:37:24,720 --> 01:37:27,130
So oftentimes in that
case, we can't necessarily

2015
01:37:27,130 --> 01:37:31,060
learn exactly what the value is for
every state and for every action,

2016
01:37:31,060 --> 01:37:32,257
but we can approximate it.

2017
01:37:32,257 --> 01:37:34,090
So much as we saw with
[? min and max, ?] we

2018
01:37:34,090 --> 01:37:37,240
could use a death limiting
approach to stop calculating

2019
01:37:37,240 --> 01:37:39,910
at a certain point in time,
we can do a similar type

2020
01:37:39,910 --> 01:37:42,730
of approximation known
as function approximation

2021
01:37:42,730 --> 01:37:47,530
in a reinforcement learning context,
where instead of learning a value of Q

2022
01:37:47,530 --> 01:37:50,260
for every state and
every action, we just

2023
01:37:50,260 --> 01:37:53,260
have some function that
estimates what the value is

2024
01:37:53,260 --> 01:37:56,770
for taking this action in this
particular state that might be based

2025
01:37:56,770 --> 01:38:02,390
on various different features of the
state that the agent happens to be in.

2026
01:38:02,390 --> 01:38:04,480
Where you might have to
choose what those features

2027
01:38:04,480 --> 01:38:07,990
actually are, but you can begin
to learn some patterns that

2028
01:38:07,990 --> 01:38:11,763
generalize beyond one specific
state and one specific action

2029
01:38:11,763 --> 01:38:13,930
that you can begin to learn
if certain features tend

2030
01:38:13,930 --> 01:38:15,580
to be good things or bad things.

2031
01:38:15,580 --> 01:38:18,850
Reinforcement learning can allow
you using a very similar mechanism

2032
01:38:18,850 --> 01:38:20,800
to generalize beyond
one particular state

2033
01:38:20,800 --> 01:38:24,220
and say if this other state
looks kind of like this state,

2034
01:38:24,220 --> 01:38:27,130
then maybe the similar types of
actions that worked in one state

2035
01:38:27,130 --> 01:38:30,150
will also work in another state as well.

2036
01:38:30,150 --> 01:38:32,410
And so this type of approach
can be quite helpful

2037
01:38:32,410 --> 01:38:34,750
as you begin to deal with
reinforcement learning that

2038
01:38:34,750 --> 01:38:37,870
exists in larger and larger
state spaces, where it's just not

2039
01:38:37,870 --> 01:38:43,240
feasible to explore all of the possible
states that could actually exist.

2040
01:38:43,240 --> 01:38:46,540
So there then are two of the main
categories of reinforcement learning.

2041
01:38:46,540 --> 01:38:49,110
Supervised learning, where you
have labeled input and output

2042
01:38:49,110 --> 01:38:52,920
pairs, and reinforcement learning,
where an agent learns from rewards

2043
01:38:52,920 --> 01:38:54,540
or punishments that it receives.

2044
01:38:54,540 --> 01:38:56,700
The third major category
of machine learning

2045
01:38:56,700 --> 01:39:00,450
that we'll just touch on briefly
is known as unsupervised learning.

2046
01:39:00,450 --> 01:39:03,540
And unsupervised learning
happens when we have data

2047
01:39:03,540 --> 01:39:06,330
without any additional
feedback, without labels.

2048
01:39:06,330 --> 01:39:09,810
That in the supervised learning
case, all of our data had labels.

2049
01:39:09,810 --> 01:39:13,560
We labeled a data point with whether
that was a rainy day or not rainy day.

2050
01:39:13,560 --> 01:39:16,590
And using those labels, we were
able to infer what the pattern was.

2051
01:39:16,590 --> 01:39:20,220
Where we labeled data as a counterfeit
banknote or not a counterfeit,

2052
01:39:20,220 --> 01:39:23,760
and using those labels, we were
able to draw inferences and patterns

2053
01:39:23,760 --> 01:39:27,870
to figure out what does a
banknote look like versus not.

2054
01:39:27,870 --> 01:39:32,410
In unsupervised learning, we don't
have any access to any of those labels,

2055
01:39:32,410 --> 01:39:35,500
but we still would like to
learn some of those patterns.

2056
01:39:35,500 --> 01:39:39,000
And one of the tasks that you might
want to perform in unsupervised learning

2057
01:39:39,000 --> 01:39:42,600
is something like clustering, where
clustering is just the task of given

2058
01:39:42,600 --> 01:39:47,310
some set of objects organized into
distinct clusters, groups of objects

2059
01:39:47,310 --> 01:39:49,330
that are similar to one another.

2060
01:39:49,330 --> 01:39:51,510
And there's lots of
applications for clustering.

2061
01:39:51,510 --> 01:39:53,340
It comes up in genetic
research, where you

2062
01:39:53,340 --> 01:39:55,740
might have a whole bunch
of different genes,

2063
01:39:55,740 --> 01:39:57,900
and you want to cluster
them into similar genes

2064
01:39:57,900 --> 01:40:01,530
if you're trying to analyze it across
a population or across species.

2065
01:40:01,530 --> 01:40:04,530
It comes up in an image, if you want
to take all the pixels of an image,

2066
01:40:04,530 --> 01:40:06,570
cluster them into different
parts of the image.

2067
01:40:06,570 --> 01:40:09,060
Comes up a lot up in
market research if you

2068
01:40:09,060 --> 01:40:11,340
want to divide your consumers
into different groups

2069
01:40:11,340 --> 01:40:14,160
so you know which groups to target
with certain types of product

2070
01:40:14,160 --> 01:40:15,690
advertisements, for example.

2071
01:40:15,690 --> 01:40:18,840
And a number of other contexts
as well in which clustering

2072
01:40:18,840 --> 01:40:20,310
can be very applicable.

2073
01:40:20,310 --> 01:40:24,900
One technique for clustering is an
algorithm known as k-means clustering.

2074
01:40:24,900 --> 01:40:27,270
And what k-means
clustering is going to do

2075
01:40:27,270 --> 01:40:31,830
it is going to divide all of our data
points into k different clusters,

2076
01:40:31,830 --> 01:40:35,700
and it's going to do so by repeating
this process of assigning points

2077
01:40:35,700 --> 01:40:39,720
to clusters, and then moving
around those clusters centers.

2078
01:40:39,720 --> 01:40:43,680
We're going to define a cluster by
its center, the middle of the cluster,

2079
01:40:43,680 --> 01:40:46,830
and then assign points to
that cluster based on which

2080
01:40:46,830 --> 01:40:49,430
center is closest to that point.

2081
01:40:49,430 --> 01:40:51,660
And I'll show you an
example of that now.

2082
01:40:51,660 --> 01:40:55,020
Here, for example, I have a
whole bunch of unlabeled data--

2083
01:40:55,020 --> 01:40:58,920
just various data points that are
in some sort of graphical space.

2084
01:40:58,920 --> 01:41:02,610
And I would like to group them
into various different clusters.

2085
01:41:02,610 --> 01:41:04,500
But I don't know how
to do that originally.

2086
01:41:04,500 --> 01:41:07,500
And let's say I want to assign
three clusters to this group.

2087
01:41:07,500 --> 01:41:10,500
And you have to choose how many
clusters you want in k-means clustering,

2088
01:41:10,500 --> 01:41:13,860
but you could try multiple and
see how well those values perform.

2089
01:41:13,860 --> 01:41:17,070
But I'll start just by
randomly picking some places

2090
01:41:17,070 --> 01:41:19,060
to put the centers of those clusters.

2091
01:41:19,060 --> 01:41:22,735
That maybe I have a blue cluster,
a red cluster, and a green cluster.

2092
01:41:22,735 --> 01:41:25,110
And I'm going to start with
the centers of those clusters

2093
01:41:25,110 --> 01:41:27,660
just being in these
three locations here.

2094
01:41:27,660 --> 01:41:30,090
And what k-means
clustering tells us to do

2095
01:41:30,090 --> 01:41:32,790
is once I have the
centers of the clusters,

2096
01:41:32,790 --> 01:41:39,510
assign every point to a cluster based on
which cluster center it is closest to.

2097
01:41:39,510 --> 01:41:42,990
So we end up with something like
this, where all of these points

2098
01:41:42,990 --> 01:41:47,310
are closer to the blue cluster
center than any other cluster center.

2099
01:41:47,310 --> 01:41:50,970
All of these points here are
closer to the green cluster

2100
01:41:50,970 --> 01:41:52,860
center than any other cluster center.

2101
01:41:52,860 --> 01:41:55,410
And then these two points
plus these points over here,

2102
01:41:55,410 --> 01:42:00,370
those are all closest to the
red cluster center instead.

2103
01:42:00,370 --> 01:42:04,050
So here then is one possible
assignment, all these points,

2104
01:42:04,050 --> 01:42:05,910
to three different clusters.

2105
01:42:05,910 --> 01:42:06,895
But it's not great.

2106
01:42:06,895 --> 01:42:10,020
That it seems like in this red cluster,
these points are kind of far apart,

2107
01:42:10,020 --> 01:42:12,870
in this green cluster, these
points are kind of far apart.

2108
01:42:12,870 --> 01:42:15,810
It might not be my ideal
choice of how I would cluster

2109
01:42:15,810 --> 01:42:17,670
these various different data points.

2110
01:42:17,670 --> 01:42:21,960
But k-means clustering is an iterative
process, that after I do this,

2111
01:42:21,960 --> 01:42:25,890
there is a next step, which is that
after I've assigned all of the points

2112
01:42:25,890 --> 01:42:28,380
to the cluster center
that it is nearest to,

2113
01:42:28,380 --> 01:42:31,350
we are going to recenter the clusters.

2114
01:42:31,350 --> 01:42:33,420
Meaning take the cluster
centers, these diamond

2115
01:42:33,420 --> 01:42:37,500
shapes here, and move them
to the middle or the average,

2116
01:42:37,500 --> 01:42:41,040
effectively, of all of the
points that are in that cluster.

2117
01:42:41,040 --> 01:42:43,170
So we'll take this blue
point, this blue center,

2118
01:42:43,170 --> 01:42:46,812
and go ahead and move it to the
middle or to the center of all

2119
01:42:46,812 --> 01:42:49,020
of the points that were
assigned to the blue cluster,

2120
01:42:49,020 --> 01:42:51,070
moving it slightly to
the right in this case.

2121
01:42:51,070 --> 01:42:52,570
And we'll do the same thing for red.

2122
01:42:52,570 --> 01:42:56,910
We'll move the cluster center to
the middle of all of these points,

2123
01:42:56,910 --> 01:42:58,560
weighted by how many points there are.

2124
01:42:58,560 --> 01:43:01,800
There are more points over
here, so the red center

2125
01:43:01,800 --> 01:43:03,810
and moving a little
bit further that way.

2126
01:43:03,810 --> 01:43:06,352
And likewise for the green
center, there are many more points

2127
01:43:06,352 --> 01:43:09,000
on this side of the green
center, so the green center

2128
01:43:09,000 --> 01:43:13,240
ends up being pulled a little
bit further in this direction.

2129
01:43:13,240 --> 01:43:15,808
So we recenter all of the clusters.

2130
01:43:15,808 --> 01:43:17,100
And then we repeat the process.

2131
01:43:17,100 --> 01:43:21,540
We go ahead and now reassign all
of the points to the cluster center

2132
01:43:21,540 --> 01:43:23,230
that they are now closest to.

2133
01:43:23,230 --> 01:43:25,650
And now that we've moved
around the cluster centers,

2134
01:43:25,650 --> 01:43:27,690
these cluster assignments might change.

2135
01:43:27,690 --> 01:43:30,900
That this point originally was
closer to the red cluster center,

2136
01:43:30,900 --> 01:43:34,030
but now it's actually closer
to the blue cluster center.

2137
01:43:34,030 --> 01:43:35,580
Same goes for this point as well.

2138
01:43:35,580 --> 01:43:38,700
And these three points that were
originally closer to the green cluster

2139
01:43:38,700 --> 01:43:43,670
center are now closer to the
red cluster center instead.

2140
01:43:43,670 --> 01:43:48,380
So we can reassign what colors or which
clusters each of these data points

2141
01:43:48,380 --> 01:43:49,620
belongs to.

2142
01:43:49,620 --> 01:43:51,560
And then repeat the
process again, moving

2143
01:43:51,560 --> 01:43:54,590
each of these cluster means,
the middle of the clusters,

2144
01:43:54,590 --> 01:43:59,810
to the mean, the average, of all of the
other points that happen to be there.

2145
01:43:59,810 --> 01:44:01,400
And repeat the process again.

2146
01:44:01,400 --> 01:44:04,130
Go ahead and assign each of
the points to the cluster

2147
01:44:04,130 --> 01:44:05,510
that they are closest to.

2148
01:44:05,510 --> 01:44:08,690
So once we reach a point where we've
assigned all the points to clusters,

2149
01:44:08,690 --> 01:44:12,190
to the cluster that they are
nearest to, and nothing changed,

2150
01:44:12,190 --> 01:44:15,650
we've reached a sort of equilibrium
in this situation, where no points are

2151
01:44:15,650 --> 01:44:17,030
changing their allegiance.

2152
01:44:17,030 --> 01:44:19,850
And as a result, we can declare
this algorithm is now over.

2153
01:44:19,850 --> 01:44:22,910
And we now have some assignment
of each of these points

2154
01:44:22,910 --> 01:44:24,230
into three different clusters.

2155
01:44:24,230 --> 01:44:26,090
And it looks like we
did a pretty good job

2156
01:44:26,090 --> 01:44:29,960
of trying to identify which points
are more similar to one another

2157
01:44:29,960 --> 01:44:31,700
than they are two
points in other groups.

2158
01:44:31,700 --> 01:44:34,970
So we have the green cluster down
here, this blue cluster here,

2159
01:44:34,970 --> 01:44:37,650
and then this red cluster
over there as well.

2160
01:44:37,650 --> 01:44:40,490
And we did so without
any access to some labels

2161
01:44:40,490 --> 01:44:43,040
to tell us what these various
different clusters were.

2162
01:44:43,040 --> 01:44:45,890
We just used an algorithm
in an unsupervised sentence

2163
01:44:45,890 --> 01:44:48,470
without any of those
labels to figure out which

2164
01:44:48,470 --> 01:44:50,820
points belonged to which categories.

2165
01:44:50,820 --> 01:44:54,663
And again, lots of applications for
this type of clustering technique.

2166
01:44:54,663 --> 01:44:57,830
And there are many more algorithms in
each of these various different fields

2167
01:44:57,830 --> 01:45:01,310
within machine learning-- supervised,
and reinforcement, and unsupervised.

2168
01:45:01,310 --> 01:45:04,490
But those are many of the big
picture foundational ideas that

2169
01:45:04,490 --> 01:45:06,408
underlie a lot of these techniques--

2170
01:45:06,408 --> 01:45:08,700
that these are the problems
that we're trying to solve.

2171
01:45:08,700 --> 01:45:11,810
And we try and solve those problems
using a number of different methods.

2172
01:45:11,810 --> 01:45:14,733
Of trying to take data and
learn patterns in that data

2173
01:45:14,733 --> 01:45:17,150
whether that's trying to find
neighboring data points that

2174
01:45:17,150 --> 01:45:19,790
are similar or trying to
minimize some sort of loss

2175
01:45:19,790 --> 01:45:22,460
function, where any
number of other techniques

2176
01:45:22,460 --> 01:45:26,420
that allow us to begin to try to
solve these sorts of problems.

2177
01:45:26,420 --> 01:45:28,430
That then was a look at
some of the principles

2178
01:45:28,430 --> 01:45:30,410
that are at the foundation
of modern machine

2179
01:45:30,410 --> 01:45:33,930
learning-- this ability to take
data and learn from that data

2180
01:45:33,930 --> 01:45:37,160
so that the computer can perform a task,
even if they haven't explicitly been

2181
01:45:37,160 --> 01:45:39,508
given instructions in order to do so.

2182
01:45:39,508 --> 01:45:41,300
Next time, we'll continue
this conversation

2183
01:45:41,300 --> 01:45:43,842
about machine learning looking
at other techniques we can use

2184
01:45:43,842 --> 01:45:45,780
for solving these sorts of problems.

2185
01:45:45,780 --> 01:45:47,500
We'll see you then.

2186
01:45:47,500 --> 01:45:48,000