1
00:00:00,000 --> 00:00:03,416
[MUSIC PLAYING]

2
00:00:03,416 --> 00:00:17,595


3
00:00:17,595 --> 00:00:18,470
SPEAKER 1: All right.

4
00:00:18,470 --> 00:00:20,220
Welcome back, everyone,
to an introduction

5
00:00:20,220 --> 00:00:22,070
to Artificial Intelligence with Python.

6
00:00:22,070 --> 00:00:25,160
Now last time, we took a look at
machine learning-- a set of techniques

7
00:00:25,160 --> 00:00:28,010
that computers can use in
order to take a set of data

8
00:00:28,010 --> 00:00:31,860
and learn some patterns inside of that
data, learn how to perform a task,

9
00:00:31,860 --> 00:00:35,540
even if we, the programmers, didn't
give the computer explicit instructions

10
00:00:35,540 --> 00:00:37,520
for how to perform that task.

11
00:00:37,520 --> 00:00:40,520
Today, we transition to one of the
most popular techniques and tools

12
00:00:40,520 --> 00:00:43,320
within machine learning
that have neural networks.

13
00:00:43,320 --> 00:00:46,370
And neural networks were
inspired as early as the 1940s

14
00:00:46,370 --> 00:00:49,340
by researchers who were thinking
about how it is that humans learn,

15
00:00:49,340 --> 00:00:51,410
studying neuroscience
and the human brain,

16
00:00:51,410 --> 00:00:55,010
and trying to see whether or not we can
apply those same ideas to computers as

17
00:00:55,010 --> 00:00:58,290
well, and model computer
learning off of human learning.

18
00:00:58,290 --> 00:01:00,230
So how is the brain structured?

19
00:01:00,230 --> 00:01:03,800
Well, very simply put, the brain
consists of a whole bunch of neurons,

20
00:01:03,800 --> 00:01:06,230
and those neurons are
connected to one another

21
00:01:06,230 --> 00:01:08,540
and communicate with
one another in some way.

22
00:01:08,540 --> 00:01:11,540
In particular, if you think about
the structure of a biological neural

23
00:01:11,540 --> 00:01:13,170
network-- something like this--

24
00:01:13,170 --> 00:01:16,070
there are a couple of key
properties that scientists observed.

25
00:01:16,070 --> 00:01:18,440
One was that these neurons
are connected to each other

26
00:01:18,440 --> 00:01:20,640
and receive electrical
signals from one another,

27
00:01:20,640 --> 00:01:24,860
that one neuron can propagate
electrical signals to another neuron.

28
00:01:24,860 --> 00:01:26,630
And another point is
that neurons process

29
00:01:26,630 --> 00:01:29,960
those input signals, and then can
be activated, that a neuron becomes

30
00:01:29,960 --> 00:01:33,500
activated at a certain point, and
then can propagate further signals

31
00:01:33,500 --> 00:01:35,610
onto neurons in the future.

32
00:01:35,610 --> 00:01:39,380
And so the question then became, could
we take this biological idea of how it

33
00:01:39,380 --> 00:01:41,760
is that humans learn-- with
brains and with neurons--

34
00:01:41,760 --> 00:01:44,540
and apply that to a
machine as well, in effect,

35
00:01:44,540 --> 00:01:48,440
designing an artificial neural
network, or an ANN, which

36
00:01:48,440 --> 00:01:51,740
will be a mathematical model
for learning that is inspired

37
00:01:51,740 --> 00:01:53,600
by these biological neural networks?

38
00:01:53,600 --> 00:01:56,090
And what artificial neural
networks will allow us to do

39
00:01:56,090 --> 00:01:59,492
is they will first be able to model
some sort of mathematical function.

40
00:01:59,492 --> 00:02:02,700
Every time you look at a neural network,
which we'll see more of later today,

41
00:02:02,700 --> 00:02:05,330
each one of them is really
just some mathematical function

42
00:02:05,330 --> 00:02:08,600
that is mapping certain
inputs to particular outputs,

43
00:02:08,600 --> 00:02:10,820
based on the structure of
the network, that depending

44
00:02:10,820 --> 00:02:14,540
on where we place particular units
inside of this neural network,

45
00:02:14,540 --> 00:02:18,340
that's going to determine how it is
that the network is going to function.

46
00:02:18,340 --> 00:02:20,540
And in particular,
artificial neural networks

47
00:02:20,540 --> 00:02:23,990
are going to lend themselves
to a way that we can learn what

48
00:02:23,990 --> 00:02:25,993
the network's parameters should be.

49
00:02:25,993 --> 00:02:27,660
We'll see more on that in just a moment.

50
00:02:27,660 --> 00:02:30,560
But in effect we want to model,
such that it is easy for us

51
00:02:30,560 --> 00:02:33,290
to be able to write some code
that allows for the network

52
00:02:33,290 --> 00:02:36,950
to be able to figure out how to model
the right mathematical function,

53
00:02:36,950 --> 00:02:39,570
given a particular set of input data.

54
00:02:39,570 --> 00:02:41,840
So in order to create our
artificial neural network,

55
00:02:41,840 --> 00:02:43,837
instead of using
biological neurons, we're

56
00:02:43,837 --> 00:02:45,920
just going to use what
we're going to call units--

57
00:02:45,920 --> 00:02:47,760
units inside of a neural network--

58
00:02:47,760 --> 00:02:50,160
which we can represent kind
of like a node in a graph,

59
00:02:50,160 --> 00:02:53,340
which will here be represented
just by a blue circle like this.

60
00:02:53,340 --> 00:02:56,270
And these artificial units--
these artificial neurons--

61
00:02:56,270 --> 00:02:58,080
can be connected to one another.

62
00:02:58,080 --> 00:03:00,320
So here, for instance,
we have two units that

63
00:03:00,320 --> 00:03:05,020
are connected by this edge inside
of this graph, effectively.

64
00:03:05,020 --> 00:03:06,770
And so what we're going
to do now is think

65
00:03:06,770 --> 00:03:10,450
of this idea as some sort of
mapping from inputs to outputs,

66
00:03:10,450 --> 00:03:13,550
that we have one unit that
is connected to another unit,

67
00:03:13,550 --> 00:03:17,210
that we might think of this side as
the input and that side of the output.

68
00:03:17,210 --> 00:03:20,390
And what we're trying to do then is
to figure out how to solve a problem,

69
00:03:20,390 --> 00:03:22,702
how to model some sort
of mathematical function.

70
00:03:22,702 --> 00:03:24,410
And this might take
the form of something

71
00:03:24,410 --> 00:03:26,420
we saw last time, which
was something like, we

72
00:03:26,420 --> 00:03:30,500
have certain inputs like variables
x1 and x2, and given those inputs,

73
00:03:30,500 --> 00:03:32,570
we want to perform some sort of task--

74
00:03:32,570 --> 00:03:35,570
a task like predicting whether
or not it's going to rain.

75
00:03:35,570 --> 00:03:38,870
And ideally, we'd like some way,
given these inputs x1 and x2,

76
00:03:38,870 --> 00:03:41,870
which stand for some sort of
variables to do with the weather,

77
00:03:41,870 --> 00:03:44,150
we would like to be able
to predict, in this case,

78
00:03:44,150 --> 00:03:48,890
a Boolean classification-- is it going
to rain, or is it not going to rain?

79
00:03:48,890 --> 00:03:52,100
And we did this last time by
way of a mathematical function.

80
00:03:52,100 --> 00:03:55,640
We defined some function h
for our hypothesis function

81
00:03:55,640 --> 00:03:57,650
that took as input x1 and x2--

82
00:03:57,650 --> 00:04:00,250
the two inputs that we cared
about processing-- in order

83
00:04:00,250 --> 00:04:03,500
to determine whether we thought it was
going to rain, or whether we thought it

84
00:04:03,500 --> 00:04:04,910
was not going to rain.

85
00:04:04,910 --> 00:04:08,570
The question then becomes, what does
this hypothesis function do in order

86
00:04:08,570 --> 00:04:10,260
to make that determination?

87
00:04:10,260 --> 00:04:15,260
And we decided last time to use a linear
combination of these input variables

88
00:04:15,260 --> 00:04:16,980
to determine what the output should be.

89
00:04:16,980 --> 00:04:20,510
So our hypothesis function
was equal to something

90
00:04:20,510 --> 00:04:26,300
like this: weight 0 plus weight 1
times x1 plus weight 2 times x2.

91
00:04:26,300 --> 00:04:28,880
So what's going on here
is that x1 and x2--

92
00:04:28,880 --> 00:04:33,770
those are input variables-- the
inputs to this hypothesis function--

93
00:04:33,770 --> 00:04:35,720
and each of those input
variables is being

94
00:04:35,720 --> 00:04:39,140
multiplied by some weight,
which is just some number.

95
00:04:39,140 --> 00:04:43,970
So x1 is being multiplied by weight
1, x2 is being multiplied by weight 2,

96
00:04:43,970 --> 00:04:46,290
and we have this additional
weight-- weight 0--

97
00:04:46,290 --> 00:04:48,290
that doesn't get multiplied
by an input variable

98
00:04:48,290 --> 00:04:51,540
at all, that just serves to either move
the function up or move the function's

99
00:04:51,540 --> 00:04:52,650
value down.

100
00:04:52,650 --> 00:04:54,680
You can think of this as
either a weight that's

101
00:04:54,680 --> 00:04:56,900
just multiplied by some
dummy value, like the number

102
00:04:56,900 --> 00:05:00,560
1 when it's multiplied by 1, and
so it's not multiplied by anything.

103
00:05:00,560 --> 00:05:02,670
Or sometimes you'll
see in the literature,

104
00:05:02,670 --> 00:05:04,775
people call this variable
weight 0 a "bias,"

105
00:05:04,775 --> 00:05:07,400
so that you can think of these
variables as slightly different.

106
00:05:07,400 --> 00:05:09,620
We have weights that are
multiplied by the input

107
00:05:09,620 --> 00:05:13,127
and we separately add some
bias to the result as well.

108
00:05:13,127 --> 00:05:14,960
You'll hear both of
those terminologies used

109
00:05:14,960 --> 00:05:18,745
when people talk about neural
networks and machine learning.

110
00:05:18,745 --> 00:05:20,870
So in effect, what we've
done here is that in order

111
00:05:20,870 --> 00:05:23,360
to define a hypothesis
function, we just need

112
00:05:23,360 --> 00:05:26,810
to decide and figure out
what these weights should be,

113
00:05:26,810 --> 00:05:30,778
to determine what values to multiply by
our inputs to get some sort of result.

114
00:05:30,778 --> 00:05:32,570
Of course, at the end
of this, what we need

115
00:05:32,570 --> 00:05:34,880
to do is make some
sort of classification

116
00:05:34,880 --> 00:05:39,120
like raining or not raining, and to
do that, we use some sort of function

117
00:05:39,120 --> 00:05:41,220
to define some sort of threshold.

118
00:05:41,220 --> 00:05:46,820
And so we saw, for instance, the
step function, which is defined as 1

119
00:05:46,820 --> 00:05:50,090
if the result of multiplying the
weights by the inputs is at least 0;

120
00:05:50,090 --> 00:05:50,960
otherwise as 0.

121
00:05:50,960 --> 00:05:53,210
You can think of this line
down the middle-- it's kind

122
00:05:53,210 --> 00:05:54,290
of like a dotted line.

123
00:05:54,290 --> 00:05:57,222
Effectively, it stays at 0
all the way up to one point,

124
00:05:57,222 --> 00:05:58,430
and then the function steps--

125
00:05:58,430 --> 00:06:00,120
or jumps up-- to 1.

126
00:06:00,120 --> 00:06:02,550
So it's zero before it
reaches some threshold,

127
00:06:02,550 --> 00:06:05,790
and then it's 1 after it
reaches a particular threshold.

128
00:06:05,790 --> 00:06:07,760
And so this was one way
we could define what

129
00:06:07,760 --> 00:06:10,400
we'll come to call an "activation
function," a function that

130
00:06:10,400 --> 00:06:13,550
determines when it is that
this output becomes active--

131
00:06:13,550 --> 00:06:17,030
changes to a 1 instead of being a 0.

132
00:06:17,030 --> 00:06:20,495
But we also saw that if we didn't just
want a purely binary classification,

133
00:06:20,495 --> 00:06:23,540
if we didn't want purely
1 or 0, but we wanted

134
00:06:23,540 --> 00:06:26,750
to allow for some in-between
real number values,

135
00:06:26,750 --> 00:06:28,170
we could use a different function.

136
00:06:28,170 --> 00:06:31,003
And there are a number of choices,
but the one that we looked at was

137
00:06:31,003 --> 00:06:34,520
the logistic sigmoid function that
has sort of an S-shaped curve,

138
00:06:34,520 --> 00:06:36,740
where we could represent
this as a probability--

139
00:06:36,740 --> 00:06:40,130
that may be somewhere in between the
probability of rain of something like

140
00:06:40,130 --> 00:06:44,490
0.5, and maybe a little bit later
the probability of rain is 0.8--

141
00:06:44,490 --> 00:06:48,320
and so rather than just have a
binary classification of 0 or 1,

142
00:06:48,320 --> 00:06:50,702
we can allow for numbers
that are in between as well.

143
00:06:50,702 --> 00:06:52,910
And it turns out there are
many other different types

144
00:06:52,910 --> 00:06:56,240
of activation functions, where
an activation function just

145
00:06:56,240 --> 00:06:59,720
takes the output of multiplying the
weights together and adding that bias,

146
00:06:59,720 --> 00:07:02,510
and then figuring out what
the actual output should be.

147
00:07:02,510 --> 00:07:06,480
Another popular one is the rectified
linear unit, otherwise known ReLU,

148
00:07:06,480 --> 00:07:09,170
and the way that works is
that it just takes as input

149
00:07:09,170 --> 00:07:11,660
and takes the maximum
of that input and 0.

150
00:07:11,660 --> 00:07:15,950
So if it's positive, it remains
unchanged, but i if it's negative,

151
00:07:15,950 --> 00:07:17,720
it goes ahead and levels out at 0.

152
00:07:17,720 --> 00:07:21,140
And there are other activation
functions that we can choose as well.

153
00:07:21,140 --> 00:07:23,480
But in short, each of
these activation functions,

154
00:07:23,480 --> 00:07:28,220
you can just think of as a function
that gets applied to the result of all

155
00:07:28,220 --> 00:07:29,120
of this computation.

156
00:07:29,120 --> 00:07:34,160
We take some function g and apply it to
the result of all of that calculation.

157
00:07:34,160 --> 00:07:36,680
And this then is what we saw
last time-- the way of defining

158
00:07:36,680 --> 00:07:39,650
some hypothesis function
that takes on inputs,

159
00:07:39,650 --> 00:07:42,710
calculates some linear
combination of those inputs,

160
00:07:42,710 --> 00:07:47,510
and then passes it through some sort of
activation function to get our output.

161
00:07:47,510 --> 00:07:49,880
And this actually turns
out to be the model

162
00:07:49,880 --> 00:07:52,280
for the simplest of neural
networks, that we're

163
00:07:52,280 --> 00:07:56,720
going to instead represent this
mathematical idea graphically, by using

164
00:07:56,720 --> 00:07:58,040
a structure like this.

165
00:07:58,040 --> 00:08:00,770
Here then is a neural
network that has two inputs.

166
00:08:00,770 --> 00:08:03,140
We can think of this
as x1 and this as x2.

167
00:08:03,140 --> 00:08:06,860
And then one output, which you can
think of classifying whether or not

168
00:08:06,860 --> 00:08:09,810
we think it's going to rain
or not rain, for example,

169
00:08:09,810 --> 00:08:11,450
in this particular instance.

170
00:08:11,450 --> 00:08:13,340
And so how exactly does this model work?

171
00:08:13,340 --> 00:08:16,370
Well, each of these two inputs
represents one of our input variables--

172
00:08:16,370 --> 00:08:18,410
x1 and x2.

173
00:08:18,410 --> 00:08:21,080
And notice that these
inputs are connected

174
00:08:21,080 --> 00:08:23,990
to this output via
these edges, which are

175
00:08:23,990 --> 00:08:25,700
going to be defined by their weights.

176
00:08:25,700 --> 00:08:28,190
So these edges each have a
weight associated with them--

177
00:08:28,190 --> 00:08:30,740
weight 1 and weight 2--

178
00:08:30,740 --> 00:08:33,049
and then this output unit,
what it's going to do

179
00:08:33,049 --> 00:08:36,440
is it is going to calculate an
output based on those inputs

180
00:08:36,440 --> 00:08:37,970
and based on those weights.

181
00:08:37,970 --> 00:08:42,049
This output unit is going to multiply
all the inputs by their weights,

182
00:08:42,049 --> 00:08:45,590
add in this bias term, which you can
think of as an extra w0 term that

183
00:08:45,590 --> 00:08:49,860
gets added into it, and then we pass
it through an activation function.

184
00:08:49,860 --> 00:08:53,390
So this then is just a graphical
way of representing the same idea

185
00:08:53,390 --> 00:08:55,520
we saw last time, just mathematically.

186
00:08:55,520 --> 00:08:58,880
And we're going to call this
a very simple neural network.

187
00:08:58,880 --> 00:09:00,710
And we'd like for this
neural network to be

188
00:09:00,710 --> 00:09:03,222
able to learn how to
calculate some function,

189
00:09:03,222 --> 00:09:05,680
that we want some function for
the neural network to learn,

190
00:09:05,680 --> 00:09:07,610
and the neural network
is going to learn what

191
00:09:07,610 --> 00:09:11,070
should the values of w0, w1, and w2 be.

192
00:09:11,070 --> 00:09:13,280
What should the activation
function be in order

193
00:09:13,280 --> 00:09:15,962
to get the result that we would expect?

194
00:09:15,962 --> 00:09:18,170
So we can actually take a
look at an example of this.

195
00:09:18,170 --> 00:09:21,170
What then is a very simple
function that we might calculate?

196
00:09:21,170 --> 00:09:24,770
Well, if we recall back from when we
were looking at propositional logic,

197
00:09:24,770 --> 00:09:26,660
one of the simplest
functions we looked at

198
00:09:26,660 --> 00:09:29,760
was something like the or
function, that takes two inputs--

199
00:09:29,760 --> 00:09:35,360
x and y-- and outputs 1, otherwise known
as true, if either one of the inputs,

200
00:09:35,360 --> 00:09:40,930
or both of them, are 1, and outputs a 0
if both of the inputs are 0, or false.

201
00:09:40,930 --> 00:09:42,485
So this then is the or function.

202
00:09:42,485 --> 00:09:45,110
And this was the truth table for
the or function-- that as long

203
00:09:45,110 --> 00:09:48,560
as either of the inputs are 1,
the output of the function is 1,

204
00:09:48,560 --> 00:09:53,210
and the only case where the output of
0 is where both of the inputs are 0.

205
00:09:53,210 --> 00:09:57,140
So the question is, how could we take
this and train a neural network to be

206
00:09:57,140 --> 00:09:59,360
able to learn this particular function?

207
00:09:59,360 --> 00:10:01,290
What would those weights look like?

208
00:10:01,290 --> 00:10:03,130
Well, we could do something like this.

209
00:10:03,130 --> 00:10:05,450
Here's our neural
network, and I'll propose

210
00:10:05,450 --> 00:10:07,670
that in order to
calculate the or function,

211
00:10:07,670 --> 00:10:11,660
we're going to use a value
of 1 for each of the weights,

212
00:10:11,660 --> 00:10:14,810
and we'll use a bias
of negative 1, and then

213
00:10:14,810 --> 00:10:18,270
we'll just use this step function
as our activation function.

214
00:10:18,270 --> 00:10:19,570
How then does this work?

215
00:10:19,570 --> 00:10:23,010
Well, if I wanted to calculate
something like 0 or 0,

216
00:10:23,010 --> 00:10:26,340
which we know to be 0, because
false or false is false, then

217
00:10:26,340 --> 00:10:27,580
what are we going to do?

218
00:10:27,580 --> 00:10:29,730
Well, our output unit
is going to calculate

219
00:10:29,730 --> 00:10:31,650
this input multiplied by the weight.

220
00:10:31,650 --> 00:10:33,543
0 times 1, that's 0.

221
00:10:33,543 --> 00:10:34,210
Same thing here.

222
00:10:34,210 --> 00:10:36,210
0 times 1, that's 0.

223
00:10:36,210 --> 00:10:40,240
And we'll add to that the bias, minus 1.

224
00:10:40,240 --> 00:10:42,640
So that'll give us some
result of negative 1.

225
00:10:42,640 --> 00:10:45,690
If we plot that on our activation
function-- negative 1 is here--

226
00:10:45,690 --> 00:10:49,290
it's before the threshold,
which means either 0 or 1.

227
00:10:49,290 --> 00:10:51,150
It's only 1 after the threshold.

228
00:10:51,150 --> 00:10:53,590
Since negative 1 is
before the threshold,

229
00:10:53,590 --> 00:10:57,210
the output that this unit
provides it is going to be 0.

230
00:10:57,210 --> 00:11:02,380
And that's what we would expect
it to be, that 0 or 0 should be 0.

231
00:11:02,380 --> 00:11:06,150
What if instead we had had 1 or
0, where this is the number 1?

232
00:11:06,150 --> 00:11:07,950
Well, in this case,
in order to calculate

233
00:11:07,950 --> 00:11:11,850
what the output is going to be, we
again have to do this weighted sum.

234
00:11:11,850 --> 00:11:14,520
1 times 1, that's 1.

235
00:11:14,520 --> 00:11:16,090
0 times 1, that's 0.

236
00:11:16,090 --> 00:11:18,240
Sum of that so far is 1.

237
00:11:18,240 --> 00:11:19,650
Add negative 1 to that.

238
00:11:19,650 --> 00:11:21,310
Well, then the output of 0.

239
00:11:21,310 --> 00:11:24,360
And if we plot 0 on the step
function, 0 ends up being here--

240
00:11:24,360 --> 00:11:26,910
it's just at the threshold--
and so the output here

241
00:11:26,910 --> 00:11:30,990
is going to be 1, because the
output of 1 or 0, that's 1.

242
00:11:30,990 --> 00:11:32,730
So that's what we would expect as well.

243
00:11:32,730 --> 00:11:36,570
And just for one more example, if I
had 1 or 1, what would the result be?

244
00:11:36,570 --> 00:11:38,310
Well 1 times 1 is 1.

245
00:11:38,310 --> 00:11:39,330
1 times 1 is 1.

246
00:11:39,330 --> 00:11:40,970
The sum of those is 2.

247
00:11:40,970 --> 00:11:42,240
I add the bias term to that.

248
00:11:42,240 --> 00:11:43,480
I get the number 1.

249
00:11:43,480 --> 00:11:45,750
1 plotted on this graph
is way over there.

250
00:11:45,750 --> 00:11:47,650
That's well beyond the threshold.

251
00:11:47,650 --> 00:11:49,800
And so this output is
going to be 1 as well.

252
00:11:49,800 --> 00:11:52,920
The output is always 0 or 1,
depending on whether or not

253
00:11:52,920 --> 00:11:54,330
we're past the threshold.

254
00:11:54,330 --> 00:11:58,560
And this neural network then models the
or function-- a very simple function,

255
00:11:58,560 --> 00:12:01,270
definitely-- but it still is
able to model it correctly.

256
00:12:01,270 --> 00:12:06,662
If I give it the inputs, it will
tell me what x1 or x2 happens to be.

257
00:12:06,662 --> 00:12:09,120
And you could imagine trying
to do this for other functions

258
00:12:09,120 --> 00:12:12,760
as well-- a function like the
and function, for instance,

259
00:12:12,760 --> 00:12:18,220
that takes two inputs and calculates
whether both x and y are true.

260
00:12:18,220 --> 00:12:22,830
So if x is 1 and y is 1, then
the output of x and y is 1,

261
00:12:22,830 --> 00:12:25,920
but in all of the other
cases, the output is 0.

262
00:12:25,920 --> 00:12:29,290
How could we model that inside
of a neural network as well?

263
00:12:29,290 --> 00:12:34,170
Well, it turns out we could do it in the
same way, except instead of negative 1

264
00:12:34,170 --> 00:12:38,712
as the bias, we can use
negative 2 as the bias instead.

265
00:12:38,712 --> 00:12:40,170
What does that end up looking like?

266
00:12:40,170 --> 00:12:44,700
Well, if I had 1 and 1, that should
be 1, because 1, true and true,

267
00:12:44,700 --> 00:12:45,870
is equal to true.

268
00:12:45,870 --> 00:12:47,040
Well, I take 1 times 1.

269
00:12:47,040 --> 00:12:47,810
That's 1.

270
00:12:47,810 --> 00:12:49,020
1 times 1 is 1.

271
00:12:49,020 --> 00:12:51,060
I got a total sum of 2 so far.

272
00:12:51,060 --> 00:12:54,750
Now I add the bias of negative
2, and I get the value 0.

273
00:12:54,750 --> 00:12:59,290
And 0 when I plotted on the activation
function is just past that threshold.

274
00:12:59,290 --> 00:13:01,320
And so the output is going to be 1.

275
00:13:01,320 --> 00:13:05,760
But if I had any other input,
for example, like 1 and 0, well,

276
00:13:05,760 --> 00:13:08,430
the weighted sum of these is 1 plus 0.

277
00:13:08,430 --> 00:13:09,810
It's going to be 1.

278
00:13:09,810 --> 00:13:12,750
Minus 2 is going to give us
negative 1, and negative 1

279
00:13:12,750 --> 00:13:17,550
is not past that threshold, and
so the output is going to be zero.

280
00:13:17,550 --> 00:13:20,190
So those then are some
very simple functions

281
00:13:20,190 --> 00:13:23,850
that we can model using a neural
network, that has two inputs and one

282
00:13:23,850 --> 00:13:26,070
output, where our goal is
to be able to figure out

283
00:13:26,070 --> 00:13:29,880
what those weights should be in order
to determine what the output should be.

284
00:13:29,880 --> 00:13:33,360
And you could imagine generalizing this
to calculate more complex functions as

285
00:13:33,360 --> 00:13:35,940
well, that maybe given the
humidity and the pressure,

286
00:13:35,940 --> 00:13:38,790
we want to calculate what's the
probability that it's going to rain,

287
00:13:38,790 --> 00:13:39,385
for example.

288
00:13:39,385 --> 00:13:41,760
Or you might want to do a
regression-style problem, where

289
00:13:41,760 --> 00:13:45,210
given some amount of advertising
and given what month it is maybe,

290
00:13:45,210 --> 00:13:47,220
we want to predict what
our expected sales are

291
00:13:47,220 --> 00:13:49,270
going to be for that particular month.

292
00:13:49,270 --> 00:13:52,900
So you could imagine these inputs
and outputs being different as well.

293
00:13:52,900 --> 00:13:55,920
And it turns out that in some problems,
we're not just going to have two

294
00:13:55,920 --> 00:14:00,000
inputs, and the nice thing about these
neural networks is that we can compose

295
00:14:00,000 --> 00:14:03,510
multiple units together-- make
our networks more complex--

296
00:14:03,510 --> 00:14:07,170
just by adding more units into
this particular neural network.

297
00:14:07,170 --> 00:14:11,692
So the network we've been looking
at has two inputs and one output.

298
00:14:11,692 --> 00:14:13,650
But we could just as
easily say, let's go ahead

299
00:14:13,650 --> 00:14:16,260
and have three inputs in there,
or have even more inputs,

300
00:14:16,260 --> 00:14:19,380
where we could arbitrarily
decide, however many inputs there

301
00:14:19,380 --> 00:14:23,540
are to our problem, all going to
be calculating some sort of output

302
00:14:23,540 --> 00:14:26,520
that we care about
figuring out the value of.

303
00:14:26,520 --> 00:14:29,280
How then does the math work
for figuring out that output?

304
00:14:29,280 --> 00:14:31,290
Well, it's going to work
in a very similar way.

305
00:14:31,290 --> 00:14:35,580
In the case of two inputs, we had
two weights indicated by these edges,

306
00:14:35,580 --> 00:14:39,100
and we multiplied the weights by
the numbers, adding this bias term,

307
00:14:39,100 --> 00:14:41,550
and we'll do the same thing
in the other cases as well.

308
00:14:41,550 --> 00:14:45,120
If I have three inputs, you'll imagine
multiplying each of these three inputs

309
00:14:45,120 --> 00:14:46,680
by each of these weights.

310
00:14:46,680 --> 00:14:49,860
If I had five inputs instead,
we're going to do the same thing.

311
00:14:49,860 --> 00:14:52,795
Here, I'm saying sum up from 1 to 5.

312
00:14:52,795 --> 00:14:54,680
xi multiplied by weight i.

313
00:14:54,680 --> 00:14:57,010
So take each of the
five input variables,

314
00:14:57,010 --> 00:15:00,660
multiply them by their corresponding
weight, and then add the bias to that.

315
00:15:00,660 --> 00:15:03,900
So this would be a case where there are
five inputs into this neural network,

316
00:15:03,900 --> 00:15:04,840
for example.

317
00:15:04,840 --> 00:15:06,930
But there could be more
arbitrarily many nodes

318
00:15:06,930 --> 00:15:08,910
that we want inside of
this neural network,

319
00:15:08,910 --> 00:15:10,950
where each time we're
just going to sum up

320
00:15:10,950 --> 00:15:13,680
all of those input variables
multiplied by the weight,

321
00:15:13,680 --> 00:15:16,385
and then add the bias
term at the very end.

322
00:15:16,385 --> 00:15:18,260
And so this allows us
to be able to represent

323
00:15:18,260 --> 00:15:21,290
problems that have even
more inputs, just by growing

324
00:15:21,290 --> 00:15:24,140
the size of our neural network.

325
00:15:24,140 --> 00:15:26,460
Now, the next question we
might ask is a question

326
00:15:26,460 --> 00:15:29,580
about how it is that we train
these internal networks?

327
00:15:29,580 --> 00:15:31,920
In the case of the or
function and the and function,

328
00:15:31,920 --> 00:15:34,293
they were simple enough
functions that I could just

329
00:15:34,293 --> 00:15:36,210
tell you like here what
the weights should be,

330
00:15:36,210 --> 00:15:38,252
and you could probably
reason through it yourself

331
00:15:38,252 --> 00:15:42,000
what the weights should be in order
to calculate the output that you want.

332
00:15:42,000 --> 00:15:45,240
But in general, with functions
like predicting sales or predicting

333
00:15:45,240 --> 00:15:47,730
whether or not it's going to
rain, these are much trickier

334
00:15:47,730 --> 00:15:49,380
functions to be able to figure out.

335
00:15:49,380 --> 00:15:53,912
We would like the computer to have some
mechanism of calculating what it is

336
00:15:53,912 --> 00:15:56,370
that the weights should be--
how it is to set the weights--

337
00:15:56,370 --> 00:16:00,330
so that our neural network is able
to accurately model the function

338
00:16:00,330 --> 00:16:02,057
that we care about trying to estimate.

339
00:16:02,057 --> 00:16:04,140
And it turns out that the
strategy for doing this,

340
00:16:04,140 --> 00:16:08,340
inspired by the domain of calculus, is
a technique called gradient descent.

341
00:16:08,340 --> 00:16:13,020
And what gradient descent is, it
is an algorithm for minimizing loss

342
00:16:13,020 --> 00:16:14,670
when you're training a neural network.

343
00:16:14,670 --> 00:16:19,970
And recall that loss refers to how bad
our hypothesis function happens to be,

344
00:16:19,970 --> 00:16:22,220
that we can define
certain loss functions,

345
00:16:22,220 --> 00:16:23,970
and we saw some examples
of loss functions

346
00:16:23,970 --> 00:16:27,720
last time that just give us a number
for any particular hypothesis,

347
00:16:27,720 --> 00:16:30,190
saying how poorly does
it model the data?

348
00:16:30,190 --> 00:16:32,430
How many examples does it get wrong?

349
00:16:32,430 --> 00:16:36,390
How are they worse or less bad as
compared to other hypothesis functions

350
00:16:36,390 --> 00:16:37,860
that we might define?

351
00:16:37,860 --> 00:16:41,360
And this loss function is
just a mathematical function,

352
00:16:41,360 --> 00:16:43,110
and when you have a
mathematical function,

353
00:16:43,110 --> 00:16:44,910
in calculus, what you
could do is calculate

354
00:16:44,910 --> 00:16:48,030
something known as the gradient, which
you can think of is like a slope.

355
00:16:48,030 --> 00:16:51,720
It's the direction the loss function
is moving at any particular point.

356
00:16:51,720 --> 00:16:54,930
And what it's going to tell
us is in which direction

357
00:16:54,930 --> 00:16:59,880
should we be moving these weights in
order to minimize the amount of loss?

358
00:16:59,880 --> 00:17:02,640
And so generally speaking-- we
won't get into the calculus of it--

359
00:17:02,640 --> 00:17:04,980
but the high-level idea
for gradient descent

360
00:17:04,980 --> 00:17:06,599
is going to look something like this.

361
00:17:06,599 --> 00:17:08,760
If we want to train a
neural network, we'll

362
00:17:08,760 --> 00:17:11,579
go ahead and start just by
choosing the weights randomly.

363
00:17:11,579 --> 00:17:14,940
Just pick random weights for all of
the weights in the neural network.

364
00:17:14,940 --> 00:17:18,089
And then we'll use the input data
that we have access to in order

365
00:17:18,089 --> 00:17:20,010
to train the network
in order to figure out

366
00:17:20,010 --> 00:17:21,599
what the weights should actually be.

367
00:17:21,599 --> 00:17:24,220
So we'll repeat this
process again and again.

368
00:17:24,220 --> 00:17:26,940
The first step is we're going
to calculate the gradient based

369
00:17:26,940 --> 00:17:28,130
on all of the data points.

370
00:17:28,130 --> 00:17:31,612
So we'll look at all the data and figure
out what the gradient is at the place

371
00:17:31,612 --> 00:17:34,320
where we currently are-- for the
current setting of the weights--

372
00:17:34,320 --> 00:17:38,190
which means that in which direction
should we move the weights in order

373
00:17:38,190 --> 00:17:43,172
to minimize the total amount of loss
in order to make our solution better?

374
00:17:43,172 --> 00:17:44,880
And once we've calculated
that gradient--

375
00:17:44,880 --> 00:17:47,730
which direction we should
move in the loss function--

376
00:17:47,730 --> 00:17:51,070
well, then we can just update those
weights according to the gradient,

377
00:17:51,070 --> 00:17:53,970
take a small step in the
direction of those weights

378
00:17:53,970 --> 00:17:56,530
in order to try to make our
solution a little bit better.

379
00:17:56,530 --> 00:17:59,050
And the size of the step that
we take, that's going to vary,

380
00:17:59,050 --> 00:18:02,092
and you can choose that when you're
training a particular neural network.

381
00:18:02,092 --> 00:18:04,980
But in short, the idea is going
to be take all of the data points,

382
00:18:04,980 --> 00:18:08,730
figure out based on those data points in
what direction the weights should move,

383
00:18:08,730 --> 00:18:12,010
and then move the weights one
small step in that direction.

384
00:18:12,010 --> 00:18:14,407
And if you repeat that
process over and over again,

385
00:18:14,407 --> 00:18:17,490
adjusting the weights a little bit at
a time based on all the data points,

386
00:18:17,490 --> 00:18:21,480
eventually, you should end up with
a pretty good solution to trying

387
00:18:21,480 --> 00:18:23,040
to solve this sort of problem.

388
00:18:23,040 --> 00:18:25,247
At least that's what we
would hope to happen.

389
00:18:25,247 --> 00:18:27,330
Now as you look at this
algorithm, a good question

390
00:18:27,330 --> 00:18:29,640
to ask anytime you're
analyzing an algorithm

391
00:18:29,640 --> 00:18:33,390
is, what is going to be the expensive
part of doing the calculation?

392
00:18:33,390 --> 00:18:36,090
What's going to take a lot of
work to try to figure out what

393
00:18:36,090 --> 00:18:38,430
is going to be expensive to calculate?

394
00:18:38,430 --> 00:18:40,800
And in particular, in the
case of gradient descent,

395
00:18:40,800 --> 00:18:44,970
the really expensive part is this
all data points part right here,

396
00:18:44,970 --> 00:18:48,390
having to take all of the data
points and using all of those data

397
00:18:48,390 --> 00:18:52,740
points to figure out what the gradient
is at this particular setting of all

398
00:18:52,740 --> 00:18:55,737
of the weights, because odds are,
in a big machine learning problem

399
00:18:55,737 --> 00:18:58,320
where you're trying to solve a
big problem with a lot of data,

400
00:18:58,320 --> 00:19:00,720
you have a lot of data
points in order to calculate,

401
00:19:00,720 --> 00:19:03,570
and figuring out the gradient
based on all of those data points

402
00:19:03,570 --> 00:19:04,920
is going to be expensive.

403
00:19:04,920 --> 00:19:08,040
And you'll have to do it many times,
but you'll likely repeat this process

404
00:19:08,040 --> 00:19:10,620
again and again and again, going
through all the data points,

405
00:19:10,620 --> 00:19:13,950
taking one small step over and
over, as you try and figure

406
00:19:13,950 --> 00:19:18,060
out what the optimal setting
of those weights happens to be.

407
00:19:18,060 --> 00:19:20,880
It turns out that we
would ideally like to be

408
00:19:20,880 --> 00:19:24,900
able to train our neural networks faster
to be able to more quickly converge

409
00:19:24,900 --> 00:19:28,757
to some sort of solution that is going
to be a good solution to the problem.

410
00:19:28,757 --> 00:19:31,840
So in that case, there are alternatives
to just standard gradient descent,

411
00:19:31,840 --> 00:19:33,990
which looks at all of
the data points at once.

412
00:19:33,990 --> 00:19:38,130
We can employ a method like stochastic
gradient descent, which will randomly

413
00:19:38,130 --> 00:19:42,870
just choose one data point at a time
to calculate the gradient based on,

414
00:19:42,870 --> 00:19:45,940
instead of calculating it based
on all of the data points.

415
00:19:45,940 --> 00:19:48,900
So the idea there is that we
have some setting of the weights,

416
00:19:48,900 --> 00:19:51,750
we pick a data point, and
based on that one data point,

417
00:19:51,750 --> 00:19:54,630
we figure out in which direction
should we move all of the weights,

418
00:19:54,630 --> 00:19:57,902
and move the weights in that small
direction, then take another data point

419
00:19:57,902 --> 00:20:00,360
and do that again, and repeat
this process again and again,

420
00:20:00,360 --> 00:20:03,000
maybe looking at each of the
data points multiple times,

421
00:20:03,000 --> 00:20:07,380
but each time, only using one data
point to calculate the gradient

422
00:20:07,380 --> 00:20:10,440
to calculate which
direction we should move in.

423
00:20:10,440 --> 00:20:13,800
Now just using one data point
instead of all of the data points

424
00:20:13,800 --> 00:20:16,350
probably gives us a
less accurate estimate

425
00:20:16,350 --> 00:20:18,565
of what the gradient actually is.

426
00:20:18,565 --> 00:20:21,690
But on the plus side, it's going to be
much faster to be able to calculate,

427
00:20:21,690 --> 00:20:25,370
that we can much more quickly calculate
what the gradient is, based on one data

428
00:20:25,370 --> 00:20:28,610
point, instead of calculating
based on all of the data points

429
00:20:28,610 --> 00:20:31,933
and having to do all of that
computational work again and again.

430
00:20:31,933 --> 00:20:34,850
So there are trade-offs here between
looking at all of the data points

431
00:20:34,850 --> 00:20:36,740
and just looking at one data point.

432
00:20:36,740 --> 00:20:39,740
And it turns out that a middle ground--
and this is also quite popular--

433
00:20:39,740 --> 00:20:42,560
is a technique called
mini-batch gradient descent,

434
00:20:42,560 --> 00:20:45,800
where the idea there is instead at
looking at all of the data versus just

435
00:20:45,800 --> 00:20:49,760
a single point, we instead divide
our dataset up into small batches--

436
00:20:49,760 --> 00:20:53,628
groups of data points-- where you can
decide how big a particular batch is,

437
00:20:53,628 --> 00:20:56,420
but in short, you're just going to
look at a small number of points

438
00:20:56,420 --> 00:21:00,020
at any given time, hopefully getting a
more accurate estimate of the gradient,

439
00:21:00,020 --> 00:21:03,680
but also not requiring all of
the computational effort needed

440
00:21:03,680 --> 00:21:07,620
to look at every single
one of these data points.

441
00:21:07,620 --> 00:21:09,710
So gradient descent
then is this technique

442
00:21:09,710 --> 00:21:12,800
that we can use in order to train
these neural networks in order

443
00:21:12,800 --> 00:21:15,410
to figure out what the setting
of all of these weights

444
00:21:15,410 --> 00:21:20,570
should be, if we want some way to try
and get an accurate notion of how it is

445
00:21:20,570 --> 00:21:23,480
that this function should work, some
way of modeling how to transform

446
00:21:23,480 --> 00:21:27,320
the inputs into particular outputs.

447
00:21:27,320 --> 00:21:30,080
So far, the networks that
we've taken a look at

448
00:21:30,080 --> 00:21:32,330
have all been structured
similar to this.

449
00:21:32,330 --> 00:21:35,720
We have some number of inputs--
maybe two or three or five or more--

450
00:21:35,720 --> 00:21:39,980
and then we have one output that is
just predicting like rain or no rain,

451
00:21:39,980 --> 00:21:42,510
or just predicting one particular value.

452
00:21:42,510 --> 00:21:46,580
But often in machine learning problems,
we don't just care about one output.

453
00:21:46,580 --> 00:21:50,330
We might care about an output that has
multiple different values associated

454
00:21:50,330 --> 00:21:51,180
with it.

455
00:21:51,180 --> 00:21:53,780
So in the same way that we
could take a neural network

456
00:21:53,780 --> 00:21:58,910
and add units to the input layer,
we can likewise add outputs

457
00:21:58,910 --> 00:22:00,500
to the output layer as well.

458
00:22:00,500 --> 00:22:03,490
Instead of just one output, you
could imagine we have two outputs,

459
00:22:03,490 --> 00:22:06,650
or we could have like four outputs,
for example, where in each case,

460
00:22:06,650 --> 00:22:09,610
as we add more inputs
or add more outputs,

461
00:22:09,610 --> 00:22:13,100
if we want to keep this network fully
connected between these two layers,

462
00:22:13,100 --> 00:22:17,570
we just need to add more weights,
that now each of these input nodes

463
00:22:17,570 --> 00:22:21,560
have four weights associated
with each of the four outputs,

464
00:22:21,560 --> 00:22:25,070
and that's true for each of these
various different input nodes.

465
00:22:25,070 --> 00:22:27,860
So as we add nodes, we
add more weights in order

466
00:22:27,860 --> 00:22:30,230
to make sure that each
of the inputs can somehow

467
00:22:30,230 --> 00:22:33,560
be connected to each of the
outputs, so that each output

468
00:22:33,560 --> 00:22:38,420
value can be calculated based on what
the value of the input happens to be.

469
00:22:38,420 --> 00:22:42,600
So what might a case be where we want
multiple different output values?

470
00:22:42,600 --> 00:22:44,900
Well, you might consider
that in the case of weather

471
00:22:44,900 --> 00:22:47,570
predicting, for example,
we might not just care

472
00:22:47,570 --> 00:22:49,490
whether it's raining or not raining.

473
00:22:49,490 --> 00:22:52,250
There might be multiple
different categories of weather

474
00:22:52,250 --> 00:22:54,380
that we would like to
categorize the weather into.

475
00:22:54,380 --> 00:22:58,100
With just a single output variable,
we can do a binary classification,

476
00:22:58,100 --> 00:23:00,330
like rain or no rain, for instance--

477
00:23:00,330 --> 00:23:04,340
1 or 0-- but it doesn't allow
us to do much more than that.

478
00:23:04,340 --> 00:23:06,320
With multiple output
variables, I might be

479
00:23:06,320 --> 00:23:09,330
able to use each one to predict
something a little different.

480
00:23:09,330 --> 00:23:11,375
Maybe I want to categorize
the weather into one

481
00:23:11,375 --> 00:23:13,250
of four different
categories, something like,

482
00:23:13,250 --> 00:23:16,740
is it going to be raining
or sunny or cloudy or snowy,

483
00:23:16,740 --> 00:23:18,710
and I now have four
output variables that

484
00:23:18,710 --> 00:23:23,090
can be used to represent maybe the
probability that it is raining,

485
00:23:23,090 --> 00:23:27,260
as opposed to sunny, as opposed
to cloudy, or as opposed to snowy.

486
00:23:27,260 --> 00:23:29,300
How then would this neural network work?

487
00:23:29,300 --> 00:23:32,060
Well, we have some input
variables that represent some data

488
00:23:32,060 --> 00:23:34,010
that we have collected
about the weather.

489
00:23:34,010 --> 00:23:36,020
Each of those inputs
gets multiplied by each

490
00:23:36,020 --> 00:23:37,490
of these various different weights.

491
00:23:37,490 --> 00:23:39,710
We have more multiplications
to do, but these

492
00:23:39,710 --> 00:23:42,790
are fairly quick mathematical
operations to perform.

493
00:23:42,790 --> 00:23:44,540
And then what we get
is after passing them

494
00:23:44,540 --> 00:23:47,180
through some sort of activation
function in the outputs,

495
00:23:47,180 --> 00:23:50,930
we end up getting some sort of number,
where that number, you might imagine,

496
00:23:50,930 --> 00:23:54,020
you can interpret as like a
probability, like a probability

497
00:23:54,020 --> 00:23:57,120
that it is one category, as
opposed to another category.

498
00:23:57,120 --> 00:23:59,390
So here we're saying
that based on the inputs,

499
00:23:59,390 --> 00:24:03,740
we think there is a 10% chance that it's
raining, a 60% chance that it's sunny,

500
00:24:03,740 --> 00:24:07,460
a 20% chance of cloudy, a
10% chance of it's snowy.

501
00:24:07,460 --> 00:24:11,640
And given that output, if these
represent a probability distribution,

502
00:24:11,640 --> 00:24:14,660
well, then you could just pick
whichever one has the highest value--

503
00:24:14,660 --> 00:24:15,710
in this case, sunny--

504
00:24:15,710 --> 00:24:17,690
and say that, well,
most likely, we think

505
00:24:17,690 --> 00:24:23,777
that this categorization of inputs
means that the output should be sunny,

506
00:24:23,777 --> 00:24:25,610
and that is what we
would expect the weather

507
00:24:25,610 --> 00:24:28,710
to be in this particular instance.

508
00:24:28,710 --> 00:24:32,510
So this allows us to do these sort
of multi-class classifications,

509
00:24:32,510 --> 00:24:35,030
where instead of just having
a binary classification--

510
00:24:35,030 --> 00:24:38,630
1 or 0-- we can have as many
different categories as we

511
00:24:38,630 --> 00:24:42,380
want, and we can have our neural
network output these probabilities

512
00:24:42,380 --> 00:24:46,430
over which categories are most
more likely than other categories,

513
00:24:46,430 --> 00:24:49,550
and using that data, we're able
to draw some sort of inference

514
00:24:49,550 --> 00:24:51,860
on what it is that we should do.

515
00:24:51,860 --> 00:24:54,560
So this was sort of the idea
of supervised machine learning.

516
00:24:54,560 --> 00:24:57,650
I can give this neural network
a whole bunch of data--

517
00:24:57,650 --> 00:24:59,450
whole bunch of input data--

518
00:24:59,450 --> 00:25:01,670
corresponding to some
label, some output data--

519
00:25:01,670 --> 00:25:03,740
like we know that it
was raining on this day,

520
00:25:03,740 --> 00:25:05,720
we know that it was sunny on that day--

521
00:25:05,720 --> 00:25:08,150
and using all of that
data, the algorithm

522
00:25:08,150 --> 00:25:11,150
can use gradient descent to
figure out what all of the weights

523
00:25:11,150 --> 00:25:13,670
should be in order to create
some sort of model that

524
00:25:13,670 --> 00:25:16,010
hopefully allows us
a way to predict what

525
00:25:16,010 --> 00:25:18,020
we think the weather is going to be.

526
00:25:18,020 --> 00:25:20,810
But neural networks have a lot
of other applications as well.

527
00:25:20,810 --> 00:25:23,570
You can imagine applying
the same sort of idea

528
00:25:23,570 --> 00:25:26,630
to a reinforcement learning
sort of example as well.

529
00:25:26,630 --> 00:25:29,930
Well, you remember that in
reinforcement learning, we wanted to do

530
00:25:29,930 --> 00:25:34,520
is train some sort of agent to learn
what action to take depending on what

531
00:25:34,520 --> 00:25:36,120
state they currently happen to be in.

532
00:25:36,120 --> 00:25:38,390
So depending on the
current state of the world,

533
00:25:38,390 --> 00:25:41,900
we wanted the agent to pick from
one of the available actions that

534
00:25:41,900 --> 00:25:43,550
is available to them.

535
00:25:43,550 --> 00:25:47,030
And you might model that by having
each of these input variables

536
00:25:47,030 --> 00:25:50,150
represent some information
about the state--

537
00:25:50,150 --> 00:25:53,660
some data about what state
our agent is currently in--

538
00:25:53,660 --> 00:25:55,820
and then the output,
for example, could be

539
00:25:55,820 --> 00:25:58,610
each of the various different
actions that our agent could

540
00:25:58,610 --> 00:26:01,640
take-- action 1, 2, 3,
and 4, and you might

541
00:26:01,640 --> 00:26:04,240
imagine that this network
would work in the same way,

542
00:26:04,240 --> 00:26:06,530
that based on these
particular inputs we go ahead

543
00:26:06,530 --> 00:26:08,840
and calculate values for
each of these outputs,

544
00:26:08,840 --> 00:26:12,690
and those outputs could model which
action is better than other actions,

545
00:26:12,690 --> 00:26:15,440
and we could just choose, based
on looking at those outputs, which

546
00:26:15,440 --> 00:26:17,890
actions we should take.

547
00:26:17,890 --> 00:26:20,600
And so these neural networks
are very broadly applicable,

548
00:26:20,600 --> 00:26:23,870
that all they're really doing is
modeling some mathematical function.

549
00:26:23,870 --> 00:26:26,690
So anything that we can frame as
a mathematical function, something

550
00:26:26,690 --> 00:26:30,050
like classifying inputs into
various different categories,

551
00:26:30,050 --> 00:26:32,810
or figuring out based
on some input state what

552
00:26:32,810 --> 00:26:36,140
action we should take-- these are all
mathematical functions that we could

553
00:26:36,140 --> 00:26:40,100
attempt to model by taking advantage
of this neural network structure,

554
00:26:40,100 --> 00:26:43,760
and in particular, taking advantage
of this technique, gradient descent,

555
00:26:43,760 --> 00:26:47,240
that we can use in order to figure out
what the weights should be in order

556
00:26:47,240 --> 00:26:49,890
to do this sort of calculation.

557
00:26:49,890 --> 00:26:52,890
Now how is it that you would go about
training a neural network that has

558
00:26:52,890 --> 00:26:55,550
multiple outputs instead of just one?

559
00:26:55,550 --> 00:26:57,330
Well, with just a
single output, we could

560
00:26:57,330 --> 00:26:59,920
see what the output for
that value should be,

561
00:26:59,920 --> 00:27:03,190
and then you update all of the
weights that corresponded to it.

562
00:27:03,190 --> 00:27:06,730
And when we have multiple outputs,
at least in this particular case,

563
00:27:06,730 --> 00:27:10,260
we can really think of this as
four separate neural networks,

564
00:27:10,260 --> 00:27:12,780
that really we just
have one network here

565
00:27:12,780 --> 00:27:16,170
that has these three inputs,
corresponding with these three weights,

566
00:27:16,170 --> 00:27:18,750
corresponding to this one output value.

567
00:27:18,750 --> 00:27:21,150
And the same thing is true
for this output value.

568
00:27:21,150 --> 00:27:24,750
This output value effectively
defines yet another neural network

569
00:27:24,750 --> 00:27:28,320
that has these same three inputs,
but a different set of weights

570
00:27:28,320 --> 00:27:29,880
that correspond to this output.

571
00:27:29,880 --> 00:27:32,910
And likewise, this output has
its own set of weights as well,

572
00:27:32,910 --> 00:27:35,790
and the same thing for
the fourth output too.

573
00:27:35,790 --> 00:27:39,480
And so if you wanted to train a neural
network that had four outputs instead

574
00:27:39,480 --> 00:27:42,840
of just one, in this case where
the inputs are directly connected

575
00:27:42,840 --> 00:27:44,760
to the outputs, you could
really think of this

576
00:27:44,760 --> 00:27:47,550
as just training four
independent neural networks.

577
00:27:47,550 --> 00:27:49,720
We know what the outputs
for each of these four

578
00:27:49,720 --> 00:27:52,980
should be based on our input
data, and using that data,

579
00:27:52,980 --> 00:27:56,210
we can begin to figure out what all
of these individual weights should be,

580
00:27:56,210 --> 00:27:58,710
and maybe there's an additional
step at the end to make sure

581
00:27:58,710 --> 00:28:02,130
that turn these values into
a probability distribution,

582
00:28:02,130 --> 00:28:04,860
such that we can interpret
which one is better than another

583
00:28:04,860 --> 00:28:09,150
or more likely than another as a
category or something like that.

584
00:28:09,150 --> 00:28:12,557
So this then seems like it does a pretty
good job of taking inputs and trying

585
00:28:12,557 --> 00:28:14,390
to predict what outputs
should be, and we'll

586
00:28:14,390 --> 00:28:17,158
see some real examples of
this in just a moment as well.

587
00:28:17,158 --> 00:28:18,950
But it's important then
to think about what

588
00:28:18,950 --> 00:28:21,670
the limitations of this
sort of approach is,

589
00:28:21,670 --> 00:28:25,130
of just taking some linear
combination of inputs

590
00:28:25,130 --> 00:28:27,993
and passing it into some
sort of activation function.

591
00:28:27,993 --> 00:28:31,160
And it turns out that when we do this
in the case of binary classification--

592
00:28:31,160 --> 00:28:35,480
I'm trying to predict like does it
belong to one category or another--

593
00:28:35,480 --> 00:28:39,470
we can only predict things that are
linearly separable, because we're

594
00:28:39,470 --> 00:28:43,670
taking a linear combination of inputs
and using that to define some decision

595
00:28:43,670 --> 00:28:45,320
boundary or threshold.

596
00:28:45,320 --> 00:28:48,740
Then what we get is a situation
where if we have this set of data,

597
00:28:48,740 --> 00:28:52,340
we can predict a line
that separates linearly

598
00:28:52,340 --> 00:28:54,950
the red points from the blue points.

599
00:28:54,950 --> 00:28:58,250
But a single unit that is
making a binary classification,

600
00:28:58,250 --> 00:29:03,260
otherwise known as a perceptron,
can't deal with a situation like this,

601
00:29:03,260 --> 00:29:05,390
where-- we've seen this
type of situation before--

602
00:29:05,390 --> 00:29:07,340
where there is no
straight line that just

603
00:29:07,340 --> 00:29:10,310
goes straight through the data that
will divide the red points away

604
00:29:10,310 --> 00:29:11,450
from the blue points.

605
00:29:11,450 --> 00:29:13,890
It's a more complex decision boundary.

606
00:29:13,890 --> 00:29:16,430
The decision boundary somehow
needs to capture the things

607
00:29:16,430 --> 00:29:19,700
inside of the circle, and
there isn't really a line

608
00:29:19,700 --> 00:29:21,860
that will allow us to deal with that.

609
00:29:21,860 --> 00:29:24,410
So this is the limitation
of the perceptron--

610
00:29:24,410 --> 00:29:27,560
these units that just make these binary
decisions based on their inputs--

611
00:29:27,560 --> 00:29:31,240
that a single perceptron
is only capable of learning

612
00:29:31,240 --> 00:29:34,010
a linearly separable decision boundary.

613
00:29:34,010 --> 00:29:36,230
It can do is define a line.

614
00:29:36,230 --> 00:29:38,180
And sure, it can give
us probabilities based

615
00:29:38,180 --> 00:29:40,640
on how close to that
decision boundary we are,

616
00:29:40,640 --> 00:29:45,570
but it can only really decide based
on a linear decision boundary.

617
00:29:45,570 --> 00:29:49,100
And so this doesn't seem like it's
going to generalize well to situations

618
00:29:49,100 --> 00:29:52,310
where real-world data is involved,
because real-world data often

619
00:29:52,310 --> 00:29:53,630
isn't linearly separable.

620
00:29:53,630 --> 00:29:56,990
It often isn't the case that we can
just draw a line through the data

621
00:29:56,990 --> 00:30:00,060
and be able to divide it
up into multiple groups.

622
00:30:00,060 --> 00:30:02,090
So what then is the solution to this?

623
00:30:02,090 --> 00:30:06,380
Well, what was proposed was the
idea of a multilayer neural network,

624
00:30:06,380 --> 00:30:09,950
that so far, all of the neural networks
we've seen have had a set of inputs

625
00:30:09,950 --> 00:30:14,050
and a set of outputs, and the inputs
are connected to those outputs.

626
00:30:14,050 --> 00:30:17,420
But in a multi-layer neural network,
this is going to be an artificial

627
00:30:17,420 --> 00:30:20,870
neural network that has an input
layer still, it has an output layer,

628
00:30:20,870 --> 00:30:24,950
but also has one or more
hidden layers in between--

629
00:30:24,950 --> 00:30:28,160
other layers of artificial
neurons, or units, that

630
00:30:28,160 --> 00:30:30,793
are going to calculate
their own values as well.

631
00:30:30,793 --> 00:30:32,960
So instead of a neural
network that looks like this,

632
00:30:32,960 --> 00:30:37,370
with three inputs and one output, you
might imagine, in the middle here,

633
00:30:37,370 --> 00:30:39,417
injecting a hidden layer--

634
00:30:39,417 --> 00:30:40,250
something like this.

635
00:30:40,250 --> 00:30:42,230
This is a hidden layer
that has four nodes.

636
00:30:42,230 --> 00:30:45,590
You could choose how many nodes or units
end up going into the hidden layer,

637
00:30:45,590 --> 00:30:48,180
and you have multiple
hidden layers as well.

638
00:30:48,180 --> 00:30:52,430
And so now each of these inputs isn't
directly connected to the output.

639
00:30:52,430 --> 00:30:55,460
Each of the inputs is connected
to this hidden layer, and then

640
00:30:55,460 --> 00:30:59,840
all of the nodes in the hidden layer,
those are connected to the one output.

641
00:30:59,840 --> 00:31:02,690
And so this is just
another step that we can

642
00:31:02,690 --> 00:31:05,310
take towards calculating
more complex functions.

643
00:31:05,310 --> 00:31:08,660
Each of these hidden units will
calculate its output value,

644
00:31:08,660 --> 00:31:12,680
otherwise known as its activation,
based on a linear combination

645
00:31:12,680 --> 00:31:14,060
of all the inputs.

646
00:31:14,060 --> 00:31:16,340
And once we have values
for all of these nodes,

647
00:31:16,340 --> 00:31:19,490
as opposed to this just being the
output, we do the same thing again--

648
00:31:19,490 --> 00:31:21,890
calculate the output
for this node, based

649
00:31:21,890 --> 00:31:26,687
on multiplying each of the values for
these units by their weights as well.

650
00:31:26,687 --> 00:31:29,270
So in effect, the way this works
is that we start with inputs.

651
00:31:29,270 --> 00:31:31,437
They get multiplied by
weights in order to calculate

652
00:31:31,437 --> 00:31:32,840
values for the hidden nodes.

653
00:31:32,840 --> 00:31:35,810
Those get multiplied by weights
in order to figure out what

654
00:31:35,810 --> 00:31:38,550
the ultimate output is going to be.

655
00:31:38,550 --> 00:31:42,260
And the advantage of layering things
like this is it gives us an ability

656
00:31:42,260 --> 00:31:46,400
to model more complex functions,
that instead of just having a single

657
00:31:46,400 --> 00:31:49,730
decision boundary-- a single line
dividing the red points from the blue

658
00:31:49,730 --> 00:31:50,600
points--

659
00:31:50,600 --> 00:31:54,680
each of these hidden nodes can
learn a different decision boundary,

660
00:31:54,680 --> 00:31:57,710
and we can combine those decision
boundaries to figure out what

661
00:31:57,710 --> 00:31:59,750
the ultimate output is going to be.

662
00:31:59,750 --> 00:32:02,210
And as we begin to imagine
more complex situations,

663
00:32:02,210 --> 00:32:05,930
you could imagine each of these
nodes learning some useful property

664
00:32:05,930 --> 00:32:09,290
or learning some useful
feature of all of the inputs

665
00:32:09,290 --> 00:32:12,800
and somehow learning how to combine
those features together in order to get

666
00:32:12,800 --> 00:32:15,370
the output that we actually want.

667
00:32:15,370 --> 00:32:17,870
Now the natural question, when
we begin to look at this now,

668
00:32:17,870 --> 00:32:20,780
is to ask the question of, how
do we train a neural network

669
00:32:20,780 --> 00:32:23,180
that has hidden layers inside of it?

670
00:32:23,180 --> 00:32:25,950
And this turns out to initially
be a bit of a tricky question,

671
00:32:25,950 --> 00:32:30,740
because the input data we are given
is we are given values for all

672
00:32:30,740 --> 00:32:34,670
of the inputs, and we're given what
the value of the output should be--

673
00:32:34,670 --> 00:32:36,830
what the category is, for example--

674
00:32:36,830 --> 00:32:40,880
but the input data doesn't tell us
what the values for all of these nodes

675
00:32:40,880 --> 00:32:41,630
should be.

676
00:32:41,630 --> 00:32:44,810
So we don't know how far
off each of these nodes

677
00:32:44,810 --> 00:32:48,570
actually is, because we're only given
data for the inputs and the outputs.

678
00:32:48,570 --> 00:32:50,390
The reason this is
called the hidden layer

679
00:32:50,390 --> 00:32:52,760
is because the data that
is made available to us

680
00:32:52,760 --> 00:32:56,930
doesn't tell us what the values
for all of these intermediate nodes

681
00:32:56,930 --> 00:32:58,530
should actually be.

682
00:32:58,530 --> 00:33:03,020
And so the strategy people came up
with was to say that if you know what

683
00:33:03,020 --> 00:33:07,010
the error or the losses
on the output node, well,

684
00:33:07,010 --> 00:33:10,280
then based on what these weights are--
if one of these weights is higher than

685
00:33:10,280 --> 00:33:11,000
another--

686
00:33:11,000 --> 00:33:16,670
you can calculate an estimate for
how much the error from this node

687
00:33:16,670 --> 00:33:20,492
was due to this part of the hidden
node, or this part of the hidden layer,

688
00:33:20,492 --> 00:33:23,450
or this part of the hidden layer,
based on the values of these weights,

689
00:33:23,450 --> 00:33:26,480
in effect saying, that based
on the error from the output,

690
00:33:26,480 --> 00:33:29,690
I can backpropagate the
error and figure out

691
00:33:29,690 --> 00:33:34,207
an estimate for what the error is for
each of these the hidden layer as well.

692
00:33:34,207 --> 00:33:37,290
And there's some more calculus here
that we won't get into the details of,

693
00:33:37,290 --> 00:33:40,550
but the idea of this algorithm
is known as backpropagation.

694
00:33:40,550 --> 00:33:42,770
It's an algorithm for
training a neural network

695
00:33:42,770 --> 00:33:44,930
with multiple different hidden layers.

696
00:33:44,930 --> 00:33:47,000
And the idea for this--
the pseudocode for it--

697
00:33:47,000 --> 00:33:50,690
will again be, if we want to run
gradient descent with backpropagation,

698
00:33:50,690 --> 00:33:54,050
we'll start with a random choice
of weights as we did before,

699
00:33:54,050 --> 00:33:57,540
and now we'll go ahead and repeat
the training process again and again.

700
00:33:57,540 --> 00:33:59,810
But what we're going
to do each time is now

701
00:33:59,810 --> 00:34:02,720
we're going to calculate the
error for the output layer first.

702
00:34:02,720 --> 00:34:05,940
We know the output and what it should
be, and we know what we calculated,

703
00:34:05,940 --> 00:34:08,389
so we figure out what
the error there is.

704
00:34:08,389 --> 00:34:11,060
But then we're going to
repeat, for every layer,

705
00:34:11,060 --> 00:34:13,963
starting with the output layer,
moving back into the hidden layer,

706
00:34:13,963 --> 00:34:16,880
then the hidden layer before that
if there are multiple hidden layers,

707
00:34:16,880 --> 00:34:19,219
going back all the way to
the very first hidden layer,

708
00:34:19,219 --> 00:34:23,750
assuming there are multiple, we're going
to propagate the error back one layer--

709
00:34:23,750 --> 00:34:25,520
whatever the error was from the output--

710
00:34:25,520 --> 00:34:28,550
figure out what the error should be
a layer before that based on what

711
00:34:28,550 --> 00:34:30,630
the values of those weights are.

712
00:34:30,630 --> 00:34:33,697
And then we can update those weights.

713
00:34:33,697 --> 00:34:35,780
So graphically, the way
you might think about this

714
00:34:35,780 --> 00:34:37,460
is that we first start with the output.

715
00:34:37,460 --> 00:34:39,080
We know what the output should be.

716
00:34:39,080 --> 00:34:40,497
We know what output we calculated.

717
00:34:40,497 --> 00:34:42,497
And based on that, we can
figure out, all right,

718
00:34:42,497 --> 00:34:45,020
how do we need to update
those weights, backpropagating

719
00:34:45,020 --> 00:34:47,330
the error to these nodes.

720
00:34:47,330 --> 00:34:50,290
And using that, we can figure out
how we should update these weights.

721
00:34:50,290 --> 00:34:52,415
And you might imagine if
there are multiple layers,

722
00:34:52,415 --> 00:34:54,500
we could repeat this
process again and again

723
00:34:54,500 --> 00:34:58,427
to begin to figure out how all of
these weights should be updated.

724
00:34:58,427 --> 00:35:00,260
And this backpropagation
algorithm is really

725
00:35:00,260 --> 00:35:03,080
the key algorithm that makes
neural networks possible,

726
00:35:03,080 --> 00:35:06,510
and makes it possible to take
these multi-level structures

727
00:35:06,510 --> 00:35:09,020
and be able to train those
structures, depending

728
00:35:09,020 --> 00:35:12,380
on what the values of these
weights are in order to figure out

729
00:35:12,380 --> 00:35:15,290
how it is that we should go about
updating those weights in order

730
00:35:15,290 --> 00:35:19,370
to create some function that is able
to minimize the total amount of loss,

731
00:35:19,370 --> 00:35:22,910
to figure out some good setting of
the weights that will take the inputs

732
00:35:22,910 --> 00:35:26,360
and translate it into the
output that we expect.

733
00:35:26,360 --> 00:35:29,165
And this works, as we said, not
just for a single hidden layer,

734
00:35:29,165 --> 00:35:32,210
but you can imagine multiple hidden
layers, where each hidden layer--

735
00:35:32,210 --> 00:35:34,490
we just defined however
many nodes we want--

736
00:35:34,490 --> 00:35:36,470
where each of the nodes
in one layer, we can

737
00:35:36,470 --> 00:35:40,010
connect to the nodes in the next
layer, defining more and more complex

738
00:35:40,010 --> 00:35:45,190
networks that are able to model more
and more complex types of functions.

739
00:35:45,190 --> 00:35:49,100
And so this type of network is what we
might call a deep neural network, part

740
00:35:49,100 --> 00:35:52,098
of a larger family of
deep learning algorithms,

741
00:35:52,098 --> 00:35:53,390
if you've ever heard that term.

742
00:35:53,390 --> 00:35:57,620
And all deep learning is about is
it's using multiple layers to be

743
00:35:57,620 --> 00:36:01,130
able to predict and be able to
model higher-level features inside

744
00:36:01,130 --> 00:36:03,910
of the input, to be able to figure
out what the output should be.

745
00:36:03,910 --> 00:36:06,410
And so the deep neural network
is just a neural network that

746
00:36:06,410 --> 00:36:09,230
has multiple of these hidden
layers, where we start at the input,

747
00:36:09,230 --> 00:36:12,500
calculate values for this layer,
then this layer, then this layer,

748
00:36:12,500 --> 00:36:14,460
and then ultimately get an output.

749
00:36:14,460 --> 00:36:17,600
And this allows us to be able to
model more and more sophisticated

750
00:36:17,600 --> 00:36:20,030
types of functions, that
each of these layers

751
00:36:20,030 --> 00:36:22,710
can calculate something
a little bit different.

752
00:36:22,710 --> 00:36:27,290
And we can combine that information to
figure out what the output should be.

753
00:36:27,290 --> 00:36:29,840
Of course, as with any
situation of machine learning,

754
00:36:29,840 --> 00:36:32,330
as we begin to make our
models more and more complex,

755
00:36:32,330 --> 00:36:35,920
to model more and more complex
functions, the risk we run

756
00:36:35,920 --> 00:36:37,670
is something like overfitting.

757
00:36:37,670 --> 00:36:39,620
And we talked about
overfitting last time

758
00:36:39,620 --> 00:36:44,210
in the context of overfitting based on
when we were training our models to be

759
00:36:44,210 --> 00:36:47,510
able to learn some sort of decision
boundary, where overfitting happens

760
00:36:47,510 --> 00:36:51,300
when we fit too closely to the
training data, and as a result,

761
00:36:51,300 --> 00:36:54,990
we don't generalize well to
other situations as well.

762
00:36:54,990 --> 00:36:59,000
And one of the risks we run with a
far more complex neural network that

763
00:36:59,000 --> 00:37:01,070
has many, many different
nodes is that we

764
00:37:01,070 --> 00:37:03,200
might overfit based
on the input data; we

765
00:37:03,200 --> 00:37:07,310
might grow over-reliant on certain nodes
to calculate things just purely based

766
00:37:07,310 --> 00:37:12,180
on the input data that doesn't allow us
to generalize very well to the output.

767
00:37:12,180 --> 00:37:15,190
And there are a number of strategies
for dealing with overfitting,

768
00:37:15,190 --> 00:37:18,010
but one of the most popular in
the context of neural networks

769
00:37:18,010 --> 00:37:19,900
is a technique known as dropout.

770
00:37:19,900 --> 00:37:23,410
And what dropout does is it when we're
training the neural network, what we'll

771
00:37:23,410 --> 00:37:26,740
do in dropout, is
temporarily remove units,

772
00:37:26,740 --> 00:37:28,900
temporarily remove
these artificial neurons

773
00:37:28,900 --> 00:37:32,080
from our network, chosen at
random, and the goal here

774
00:37:32,080 --> 00:37:35,120
is to prevent over-reliance
on certain units.

775
00:37:35,120 --> 00:37:37,060
So what generally
happens in overfitting is

776
00:37:37,060 --> 00:37:40,660
that we begin to over-rely on certain
units inside the neural network

777
00:37:40,660 --> 00:37:43,600
to be able to tell us how
to interpret the input data.

778
00:37:43,600 --> 00:37:46,900
What dropout will do is randomly
remove some of these units

779
00:37:46,900 --> 00:37:50,260
in order to reduce the chance that
we over-rely on certain units,

780
00:37:50,260 --> 00:37:52,630
to make our neural
network more robust, to be

781
00:37:52,630 --> 00:37:56,740
able to handle the situations even when
we just drop out particular neurons

782
00:37:56,740 --> 00:37:58,140
entirely.

783
00:37:58,140 --> 00:38:00,850
So the way that might work is
we have a network like this,

784
00:38:00,850 --> 00:38:03,010
and as we're training it,
when we go about trying

785
00:38:03,010 --> 00:38:04,870
to update the weights
the first time, we'll

786
00:38:04,870 --> 00:38:08,350
just randomly pick some percentage of
the nodes to drop out of the network.

787
00:38:08,350 --> 00:38:10,280
It's as if those nodes
aren't there at all.

788
00:38:10,280 --> 00:38:13,490
It's as if the weights associated
with those nodes aren't there at all.

789
00:38:13,490 --> 00:38:14,930
And we'll train in this way.

790
00:38:14,930 --> 00:38:17,200
Then the next time we update the
weights, we'll pick a different set

791
00:38:17,200 --> 00:38:20,050
and just go ahead and train that
way, and then again randomly choose

792
00:38:20,050 --> 00:38:23,360
and train with other nodes that
have been dropped that as well.

793
00:38:23,360 --> 00:38:25,990
And the goal of that is that
after the training process,

794
00:38:25,990 --> 00:38:29,308
if you train by dropping out random
nodes inside of this neural network,

795
00:38:29,308 --> 00:38:32,350
you hopefully end up with a network
that's a little bit more robust, that

796
00:38:32,350 --> 00:38:35,620
doesn't rely too heavily
on any one particular node,

797
00:38:35,620 --> 00:38:40,420
but more generally learns how to
approximate a function in general.

798
00:38:40,420 --> 00:38:42,790
So that then is a look at
some of these techniques

799
00:38:42,790 --> 00:38:46,390
that we can use in order to
implement a neural network, to get

800
00:38:46,390 --> 00:38:49,060
at the idea of taking
this input, passing it

801
00:38:49,060 --> 00:38:51,160
through these various
different layers, in order

802
00:38:51,160 --> 00:38:52,870
to produce some sort of output.

803
00:38:52,870 --> 00:38:55,870
And what we'd like to do now is take
those ideas and put them into code.

804
00:38:55,870 --> 00:38:58,537
And to do that, there are a number
of different machine learning

805
00:38:58,537 --> 00:39:01,120
libraries-- neural network
libraries-- that we can use that

806
00:39:01,120 --> 00:39:05,560
allow us to get access to someone's
implementation of backpropagation

807
00:39:05,560 --> 00:39:07,210
and all of these hidden layers.

808
00:39:07,210 --> 00:39:09,370
And one of the most popular,
developed by Google,

809
00:39:09,370 --> 00:39:11,440
is known as TensorFlow,
a library that we

810
00:39:11,440 --> 00:39:13,930
can use for quickly
creating neural networks

811
00:39:13,930 --> 00:39:16,780
and modeling them and running
them on some sample data

812
00:39:16,780 --> 00:39:18,730
to see what the output is going to be.

813
00:39:18,730 --> 00:39:20,690
And before we actually
start writing code,

814
00:39:20,690 --> 00:39:23,380
we'll go ahead and take a look
at TensorFlow's Playground, which

815
00:39:23,380 --> 00:39:25,422
will be an opportunity
for us just to play around

816
00:39:25,422 --> 00:39:28,180
with this idea of neural
networks in different layers,

817
00:39:28,180 --> 00:39:31,660
just to get a sense for what it is
that we can do by taking advantage

818
00:39:31,660 --> 00:39:33,950
of a neural networks.

819
00:39:33,950 --> 00:39:37,360
So let's go ahead and go into
TensorFlow's Playground, which you can

820
00:39:37,360 --> 00:39:39,670
go to by visiting that URL from before.

821
00:39:39,670 --> 00:39:43,480
And what we're going to do now is we're
going to try and learn the decision

822
00:39:43,480 --> 00:39:46,240
boundary for this particular output.

823
00:39:46,240 --> 00:39:49,710
I want to learn to separate the
orange points from the blue points,

824
00:39:49,710 --> 00:39:52,090
and I'd like to learn some
sort of setting of weights

825
00:39:52,090 --> 00:39:56,690
inside of a neural network that will be
able to separate those from each other.

826
00:39:56,690 --> 00:39:58,960
The features we have
access to, our input data,

827
00:39:58,960 --> 00:40:03,590
are the x value and the y value, so the
two values along each of the two axes.

828
00:40:03,590 --> 00:40:06,340
And what I'll do now is I can set
particular parameters, like what

829
00:40:06,340 --> 00:40:09,490
activation function I would like
to use, and I'll just go ahead

830
00:40:09,490 --> 00:40:12,720
and press Play and see what happens.

831
00:40:12,720 --> 00:40:16,560
And what happens here is that you'll
see that just by using these two input

832
00:40:16,560 --> 00:40:20,590
features-- the x value and the
y value, with no hidden layers--

833
00:40:20,590 --> 00:40:24,450
just take the input, x and y values, and
figure out what the decision boundary

834
00:40:24,450 --> 00:40:24,990
is--

835
00:40:24,990 --> 00:40:27,600
our neural network learns
pretty quickly that in order

836
00:40:27,600 --> 00:40:30,150
to divide these two points,
we should just use this line.

837
00:40:30,150 --> 00:40:34,193
This line acts as the decision boundary
that separates this group of points

838
00:40:34,193 --> 00:40:36,360
from that group of points,
and it does it very well.

839
00:40:36,360 --> 00:40:38,160
You can see up here what the loss is.

840
00:40:38,160 --> 00:40:40,320
The training loss is
zero, meaning we were

841
00:40:40,320 --> 00:40:44,640
able to perfectly model separating
these two points from each other inside

842
00:40:44,640 --> 00:40:46,380
of our training data.

843
00:40:46,380 --> 00:40:50,610
So this was a fairly simple case of
trying to apply a neural network,

844
00:40:50,610 --> 00:40:54,630
because the data is very clean it's
very nicely linearly separable.

845
00:40:54,630 --> 00:40:58,810
We can just draw a line that separates
all of those points from each other.

846
00:40:58,810 --> 00:41:00,900
Let's now consider a more complex case.

847
00:41:00,900 --> 00:41:03,390
So I'll go ahead and
pause the simulation,

848
00:41:03,390 --> 00:41:06,570
and we'll go ahead and
look at this data set here.

849
00:41:06,570 --> 00:41:09,030
This data set is a little
bit more complex now.

850
00:41:09,030 --> 00:41:11,280
In this data set, we still
have blue and orange points

851
00:41:11,280 --> 00:41:13,140
that we'd like to
separate from each other,

852
00:41:13,140 --> 00:41:15,150
but there is no single
line that we can draw

853
00:41:15,150 --> 00:41:17,400
that is going to be able to
figure out how to separate

854
00:41:17,400 --> 00:41:21,480
the blue from the orange, because the
blue is located in these two quadrants

855
00:41:21,480 --> 00:41:23,640
and the orange is located here and here.

856
00:41:23,640 --> 00:41:26,890
It's a more complex function
to be able to learn.

857
00:41:26,890 --> 00:41:30,660
So let's see what happens if we just
try and predict based on those inputs--

858
00:41:30,660 --> 00:41:34,080
the x- and y-coordinates--
what the output should be.

859
00:41:34,080 --> 00:41:38,220
Press Play, and what you'll notice
is that we're not really able

860
00:41:38,220 --> 00:41:40,530
to draw much of a
conclusion, that we're not

861
00:41:40,530 --> 00:41:42,900
able to very cleanly
see how we should divide

862
00:41:42,900 --> 00:41:46,170
the orange points from the
blue points, and you don't

863
00:41:46,170 --> 00:41:48,760
see a very clean separation there.

864
00:41:48,760 --> 00:41:53,050
So it seems like we don't have enough
sophistication inside of our network

865
00:41:53,050 --> 00:41:55,910
to be able to model something
that is that complex.

866
00:41:55,910 --> 00:41:58,540
We need a better model
for this neural network.

867
00:41:58,540 --> 00:42:01,730
And I'll do that by
adding a hidden layer.

868
00:42:01,730 --> 00:42:04,700
So now I have the hidden layer
that has two neurons inside of it.

869
00:42:04,700 --> 00:42:09,000
So I have two inputs that then go to
two neurons inside of a hidden layer

870
00:42:09,000 --> 00:42:14,260
that then go to our output, and now I'll
press Play, and what you'll notice here

871
00:42:14,260 --> 00:42:16,570
is that we're able to
do slightly better.

872
00:42:16,570 --> 00:42:19,420
We're able to now say, all right,
these points are definitely blue.

873
00:42:19,420 --> 00:42:21,370
These points are definitely orange.

874
00:42:21,370 --> 00:42:24,432
We're still struggling a little bit
with these points up here though,

875
00:42:24,432 --> 00:42:26,140
and what we can do is
we can see for each

876
00:42:26,140 --> 00:42:28,660
of these hidden neurons
what is it exactly

877
00:42:28,660 --> 00:42:30,460
that these hidden neurons are doing.

878
00:42:30,460 --> 00:42:33,850
Each hidden neuron is learning
its own decision boundary,

879
00:42:33,850 --> 00:42:35,590
and we can see what that boundary is.

880
00:42:35,590 --> 00:42:38,350
This first neuron is
learning, all right,

881
00:42:38,350 --> 00:42:41,440
this line that seems to
separate some of the blue points

882
00:42:41,440 --> 00:42:43,510
from the rest of the points.

883
00:42:43,510 --> 00:42:45,983
This other hidden neuron
is learning another line

884
00:42:45,983 --> 00:42:48,400
that seems to be separating
the orange points in the lower

885
00:42:48,400 --> 00:42:50,420
right from the rest of the points.

886
00:42:50,420 --> 00:42:52,720
So that's why we're able
to sort of figure out

887
00:42:52,720 --> 00:42:55,900
these two areas in the bottom
region, but we're still not

888
00:42:55,900 --> 00:42:59,090
able to perfectly classify
all of the points.

889
00:42:59,090 --> 00:43:01,760
So let's go ahead and
add another neuron--

890
00:43:01,760 --> 00:43:04,900
now we've got three neurons
inside of our hidden layer--

891
00:43:04,900 --> 00:43:07,020
and see what we're able to learn now.

892
00:43:07,020 --> 00:43:07,520
All right.

893
00:43:07,520 --> 00:43:09,440
Well, now we seem to
be doing a better job

894
00:43:09,440 --> 00:43:11,990
by learning three different
decision boundaries, which

895
00:43:11,990 --> 00:43:14,540
each of the three neurons
inside of our hidden layer

896
00:43:14,540 --> 00:43:18,352
were able to much better figure out
how to separate these blue points

897
00:43:18,352 --> 00:43:19,310
from the orange points.

898
00:43:19,310 --> 00:43:22,340
And you can see what each of
these hidden neurons is learning.

899
00:43:22,340 --> 00:43:25,220
Each one is learning a slightly
different decision boundary,

900
00:43:25,220 --> 00:43:27,860
and then we're combining those
decision boundaries together

901
00:43:27,860 --> 00:43:30,770
to figure out what the
overall output should be.

902
00:43:30,770 --> 00:43:34,390
And we can try it one more time
by adding a fourth neuron there

903
00:43:34,390 --> 00:43:35,930
and try learning that.

904
00:43:35,930 --> 00:43:37,798
And it seems like now
we can do even better

905
00:43:37,798 --> 00:43:40,340
at trying to separate the blue
points from the orange points,

906
00:43:40,340 --> 00:43:43,280
but we were only able to do
this by adding a hidden layer,

907
00:43:43,280 --> 00:43:46,160
by adding some layer that is
learning some other boundaries,

908
00:43:46,160 --> 00:43:49,070
and combining those boundaries
to determine the output.

909
00:43:49,070 --> 00:43:51,980
And the strength-- the size
and thickness of these lines--

910
00:43:51,980 --> 00:43:55,790
and indicate how high these weights
are, how important each of these inputs

911
00:43:55,790 --> 00:43:59,050
is, for making this sort of calculation.

912
00:43:59,050 --> 00:44:01,730
And we can do maybe one more simulation.

913
00:44:01,730 --> 00:44:04,960
Let's go ahead and try this on
a data set that looks like this.

914
00:44:04,960 --> 00:44:06,668
Go ahead and get rid
of the hidden layer.

915
00:44:06,668 --> 00:44:08,710
Here now we're trying to
separate the blue points

916
00:44:08,710 --> 00:44:11,830
from the orange points, where all
the blue points are located, again,

917
00:44:11,830 --> 00:44:13,700
inside of a circle, effectively.

918
00:44:13,700 --> 00:44:16,130
So we're not going to
be able to learn a line.

919
00:44:16,130 --> 00:44:17,920
Notice I press Play,
and we're really not

920
00:44:17,920 --> 00:44:20,240
able to draw any sort of
classification at all,

921
00:44:20,240 --> 00:44:22,420
because there is no line
that cleanly separates

922
00:44:22,420 --> 00:44:25,570
the blue points from the orange points.

923
00:44:25,570 --> 00:44:29,350
So let's try to solve this by
introducing a hidden layer.

924
00:44:29,350 --> 00:44:31,307
I'll go ahead and press Play.

925
00:44:31,307 --> 00:44:31,890
And all right.

926
00:44:31,890 --> 00:44:33,793
With two neurons and
a hidden layer, we're

927
00:44:33,793 --> 00:44:36,210
able to do a little better,
because we effectively learned

928
00:44:36,210 --> 00:44:37,627
two different decision boundaries.

929
00:44:37,627 --> 00:44:40,380
We learned this line here,
and we learned this line

930
00:44:40,380 --> 00:44:41,760
on the right-hand side.

931
00:44:41,760 --> 00:44:43,890
And right now, we're just saying,
all right, well, if it's in-between,

932
00:44:43,890 --> 00:44:46,473
we'll call it blue, and if it's
outside, we'll call it orange.

933
00:44:46,473 --> 00:44:49,150
So, not great, but certainly
better than before.

934
00:44:49,150 --> 00:44:52,620
We're learning one decision boundary
and another, and based on those,

935
00:44:52,620 --> 00:44:55,690
we can figure out what
the output should be.

936
00:44:55,690 --> 00:45:00,770
But let's now go ahead and add a
third neuron and see what happens now.

937
00:45:00,770 --> 00:45:02,150
I go ahead and train it.

938
00:45:02,150 --> 00:45:04,878
And now, using three
different decision boundaries

939
00:45:04,878 --> 00:45:06,920
that are learned by each
of these hidden neurons,

940
00:45:06,920 --> 00:45:09,800
we're able to much more
accurately model this distinction

941
00:45:09,800 --> 00:45:11,840
between blue points and orange points.

942
00:45:11,840 --> 00:45:14,750
We're able to figure out, maybe with
these three decision boundaries,

943
00:45:14,750 --> 00:45:18,530
combining them together, you can imagine
figuring out what the output should be

944
00:45:18,530 --> 00:45:20,908
and how to make that
sort of classification.

945
00:45:20,908 --> 00:45:22,700
And so the goal here
is just to get a sense

946
00:45:22,700 --> 00:45:25,670
for having more neurons in
these hidden layers that

947
00:45:25,670 --> 00:45:28,490
allows us to learn more
structure in the data,

948
00:45:28,490 --> 00:45:31,400
allows us to figure out what the
relevant and important decision

949
00:45:31,400 --> 00:45:32,360
boundaries are.

950
00:45:32,360 --> 00:45:34,365
And then using this
backpropagation algorithm,

951
00:45:34,365 --> 00:45:36,740
we're able to figure out what
the values of these weights

952
00:45:36,740 --> 00:45:39,290
should be in order to
train this network to be

953
00:45:39,290 --> 00:45:44,240
able to classify one category of points
away from another category of points

954
00:45:44,240 --> 00:45:45,228
instead.

955
00:45:45,228 --> 00:45:48,020
And this is ultimately what we're
going to be trying to do whenever

956
00:45:48,020 --> 00:45:50,970
we're training a neural network.

957
00:45:50,970 --> 00:45:53,300
So let's go ahead and actually
see an example of this.

958
00:45:53,300 --> 00:45:57,020
You'll recall from last time that
we had this banknotes file that

959
00:45:57,020 --> 00:46:00,080
included information about
counterfeit banknotes as opposed

960
00:46:00,080 --> 00:46:04,670
to authentic banknotes, where it had
four different values for each banknote

961
00:46:04,670 --> 00:46:07,640
and then a categorization of
whether that bank note is considered

962
00:46:07,640 --> 00:46:10,280
to be authentic or a counterfeit note.

963
00:46:10,280 --> 00:46:13,880
And what I wanted to do was,
based on that input information,

964
00:46:13,880 --> 00:46:15,830
figure out some function
that could calculate

965
00:46:15,830 --> 00:46:19,250
based on the input information
what category it belonged to.

966
00:46:19,250 --> 00:46:21,590
And what I've written
here in banknotes.py

967
00:46:21,590 --> 00:46:25,340
is a neural network that we'll learn
just that, a network that learns,

968
00:46:25,340 --> 00:46:27,320
based on all of the
input, whether or not

969
00:46:27,320 --> 00:46:31,790
we should categorize a banknote
as authentic or as counterfeit.

970
00:46:31,790 --> 00:46:34,250
The first step is the same as
what we saw from last time.

971
00:46:34,250 --> 00:46:38,130
I'm really just reading the data in and
getting it into an appropriate format.

972
00:46:38,130 --> 00:46:41,690
And so this is where more of the
writing Python code on your own

973
00:46:41,690 --> 00:46:43,820
comes in terms of
manipulating this data,

974
00:46:43,820 --> 00:46:46,010
massaging the data
into a format that will

975
00:46:46,010 --> 00:46:48,290
be understood by a
machine learning library

976
00:46:48,290 --> 00:46:50,890
like scikit-learn or like TensorFlow.

977
00:46:50,890 --> 00:46:54,710
And so here I separate it into
a training and a testing set.

978
00:46:54,710 --> 00:46:59,030
And now what I'm doing down below
is I'm creating a neural network.

979
00:46:59,030 --> 00:47:01,490
Here I'm using tf, which
stands for TensorFlow.

980
00:47:01,490 --> 00:47:04,385
Up above I said, import
TensorFlow as tf.

981
00:47:04,385 --> 00:47:06,720
So you have just an abbreviation
that we'll often use,

982
00:47:06,720 --> 00:47:09,178
so we don't need to write out
TensorFlow every time we want

983
00:47:09,178 --> 00:47:11,570
to use anything inside of the library.

984
00:47:11,570 --> 00:47:13,910
I'm using tf.keras.

985
00:47:13,910 --> 00:47:16,340
Keras is an API, a set
of functions that we

986
00:47:16,340 --> 00:47:20,748
can use in order to manipulate
neural networks inside of TensorFlow,

987
00:47:20,748 --> 00:47:22,790
and it turns out there
are other machine learning

988
00:47:22,790 --> 00:47:25,442
libraries that also use the Kersa API.

989
00:47:25,442 --> 00:47:27,650
But here, I'm saying, all
right, go ahead and give me

990
00:47:27,650 --> 00:47:31,220
a model that is a sequential model--
a sequential neural network--

991
00:47:31,220 --> 00:47:33,750
meaning one layer after another.

992
00:47:33,750 --> 00:47:37,700
And now I'm going to add to that
model what layers I want inside

993
00:47:37,700 --> 00:47:38,910
of my neural network.

994
00:47:38,910 --> 00:47:40,820
So here I'm saying, model.add.

995
00:47:40,820 --> 00:47:43,160
Go ahead and add a dense layer--

996
00:47:43,160 --> 00:47:45,530
and when we say a dense
layer, we mean a layer that

997
00:47:45,530 --> 00:47:48,290
is just each of the
nodes inside of the layer

998
00:47:48,290 --> 00:47:50,970
is going to be connected to
each from the previous layer,

999
00:47:50,970 --> 00:47:54,460
so we have a densely connected layer.

1000
00:47:54,460 --> 00:47:56,910
This layer is going to have
eight units inside of it.

1001
00:47:56,910 --> 00:48:00,090
So it's going to be a hidden layer
inside of a neural network with eight

1002
00:48:00,090 --> 00:48:02,460
different units, eight
artificial neurons, each of which

1003
00:48:02,460 --> 00:48:03,830
might learn something different.

1004
00:48:03,830 --> 00:48:05,760
And I just sort of
chose eight arbitrarily.

1005
00:48:05,760 --> 00:48:09,510
You could choose a different number
of hidden nodes inside of the layer.

1006
00:48:09,510 --> 00:48:12,270
And as we saw before, depending
on the number of units

1007
00:48:12,270 --> 00:48:15,240
there are inside of your
head and layer, more units

1008
00:48:15,240 --> 00:48:17,170
means you can learn
more complex functions,

1009
00:48:17,170 --> 00:48:20,340
so maybe you can more accurately
model the training data,

1010
00:48:20,340 --> 00:48:21,450
but it comes at a cost.

1011
00:48:21,450 --> 00:48:24,480
More units means more weights that
you need to figure out how to update,

1012
00:48:24,480 --> 00:48:27,030
so it might be more expensive
to do that calculation.

1013
00:48:27,030 --> 00:48:30,900
And you also run the risk of overfitting
on the data if you have too many units,

1014
00:48:30,900 --> 00:48:33,420
and you learn to just
overfit on the training data.

1015
00:48:33,420 --> 00:48:34,390
That's not good either.

1016
00:48:34,390 --> 00:48:36,848
So there is a balance, and
there's often a testing process,

1017
00:48:36,848 --> 00:48:40,350
where you'll train on some data
and maybe validate how well you're

1018
00:48:40,350 --> 00:48:41,970
doing on a separate set of data--

1019
00:48:41,970 --> 00:48:45,555
often called a validation set-- to see,
all right, which setting of parameters,

1020
00:48:45,555 --> 00:48:47,430
how many layers should
I have, how many units

1021
00:48:47,430 --> 00:48:49,230
should be in each layer,
which one of those

1022
00:48:49,230 --> 00:48:51,450
performs the best on the validation set?

1023
00:48:51,450 --> 00:48:55,410
So you can do some testing to figure out
what these hyperparameters, so-called,

1024
00:48:55,410 --> 00:48:57,600
should be equal to.

1025
00:48:57,600 --> 00:49:02,010
Next I specify what the input_shape is,
meaning what does my input look like?

1026
00:49:02,010 --> 00:49:04,560
My input has four values,
and so the input shape

1027
00:49:04,560 --> 00:49:07,650
is just 4, because we have four inputs.

1028
00:49:07,650 --> 00:49:09,960
And then I specify what
the activation function is.

1029
00:49:09,960 --> 00:49:12,043
And the activation function,
again, we can choose.

1030
00:49:12,043 --> 00:49:14,160
There a number of different
activation functions.

1031
00:49:14,160 --> 00:49:17,940
Here I'm using relu, which
you might recall from earlier.

1032
00:49:17,940 --> 00:49:20,410
And then I'll add an output layer.

1033
00:49:20,410 --> 00:49:21,660
So I have my hidden layer.

1034
00:49:21,660 --> 00:49:23,820
Now I'm adding one more
layer that will just

1035
00:49:23,820 --> 00:49:26,700
have one unit, because all I
want to do is predict something

1036
00:49:26,700 --> 00:49:29,350
like counterfeit bill or authentic bill.

1037
00:49:29,350 --> 00:49:31,050
So I just need a single unit.

1038
00:49:31,050 --> 00:49:33,240
And the activation function
I'm going to use here

1039
00:49:33,240 --> 00:49:35,370
is that sigmoid
activation function, which

1040
00:49:35,370 --> 00:49:39,300
again was that S-shaped curve that
just gave us like a probability of,

1041
00:49:39,300 --> 00:49:43,380
what is the probability that this
is a counterfeit bill as opposed

1042
00:49:43,380 --> 00:49:45,150
to an authentic bill?

1043
00:49:45,150 --> 00:49:48,750
So that then is the structure of my
neural network-- sequential neural

1044
00:49:48,750 --> 00:49:52,200
network that has one hidden layer
with eight units inside of it,

1045
00:49:52,200 --> 00:49:55,760
and then one output layer that just
has a single unit inside of it.

1046
00:49:55,760 --> 00:49:57,510
And I can choose how
many units there are.

1047
00:49:57,510 --> 00:49:59,670
I can choose the activation function.

1048
00:49:59,670 --> 00:50:02,970
Then I'm going to compile this model.

1049
00:50:02,970 --> 00:50:06,718
TensorFlow gives you a choice of how
you would like to optimize the weights--

1050
00:50:06,718 --> 00:50:09,010
there are various different
algorithms for doing that--

1051
00:50:09,010 --> 00:50:11,135
what type of loss function
you want to use-- again,

1052
00:50:11,135 --> 00:50:12,840
many different options for doing that--

1053
00:50:12,840 --> 00:50:14,880
and then how I want
to evaluate my model.

1054
00:50:14,880 --> 00:50:16,050
Well, I care about accuracy.

1055
00:50:16,050 --> 00:50:20,670
I care about how many of my points
am I able to classify correctly

1056
00:50:20,670 --> 00:50:23,330
versus not correctly of
counterfeit or not counterfeit,

1057
00:50:23,330 --> 00:50:28,650
and I would like it to report to me
how accurate my model is performing.

1058
00:50:28,650 --> 00:50:31,110
Then, now that I've
defined that model, I

1059
00:50:31,110 --> 00:50:34,260
call model.fit to say, go
ahead and train the model.

1060
00:50:34,260 --> 00:50:38,230
Train it on all the training data,
plus all of the training labels--

1061
00:50:38,230 --> 00:50:41,100
so labels for each of those
pieces of training data--

1062
00:50:41,100 --> 00:50:43,860
and I'm saying run it for
20 epochs, meaning go ahead

1063
00:50:43,860 --> 00:50:46,830
and go through each of these
training points 20 times effectively,

1064
00:50:46,830 --> 00:50:50,220
go through the data 20 times and
keep trying to update the weights.

1065
00:50:50,220 --> 00:50:52,440
If I did it for more, I
could train for even longer

1066
00:50:52,440 --> 00:50:55,050
and maybe get a more
accurate result. But then

1067
00:50:55,050 --> 00:50:58,380
after I fit in on all the data,
I'll go ahead and just test it.

1068
00:50:58,380 --> 00:51:01,050
I'll evaluate my model
using model.evaluate,

1069
00:51:01,050 --> 00:51:03,480
built into TensorFlow, that
is just going to tell me,

1070
00:51:03,480 --> 00:51:05,907
how well do I perform
on the testing data?

1071
00:51:05,907 --> 00:51:07,740
So ultimately, this is
just going to give me

1072
00:51:07,740 --> 00:51:13,150
some numbers that tell me how well
we did in this particular case.

1073
00:51:13,150 --> 00:51:15,300
So now what I'm going to
do is go into banknotes

1074
00:51:15,300 --> 00:51:17,697
and go ahead and run banknotes.py.

1075
00:51:17,697 --> 00:51:19,530
And what's going to
happen now is it's going

1076
00:51:19,530 --> 00:51:21,630
to read in all of that trading data.

1077
00:51:21,630 --> 00:51:24,600
It's going to generate a neural
network with all my inputs,

1078
00:51:24,600 --> 00:51:27,750
my eight hidden layers, or eight
hidden units inside my layer,

1079
00:51:27,750 --> 00:51:30,630
and then an output unit, and now
what it's doing is it's training.

1080
00:51:30,630 --> 00:51:32,880
It's training 20 times,
and each time, you

1081
00:51:32,880 --> 00:51:35,940
can see how my accuracy is
increasing on my training data.

1082
00:51:35,940 --> 00:51:38,950
It starts off, the very first
time, not very accurate,

1083
00:51:38,950 --> 00:51:42,660
though better than random,
something like 79% of the time,

1084
00:51:42,660 --> 00:51:45,730
it's able to accurately
classify one bill from another.

1085
00:51:45,730 --> 00:51:49,350
But as I keep training, notice this
accuracy value improves and improves

1086
00:51:49,350 --> 00:51:52,590
and improves, until after I've
trained through all of the data points

1087
00:51:52,590 --> 00:51:59,220
20 times, it looks like my accuracy
is above 99% on the training data.

1088
00:51:59,220 --> 00:52:02,530
And here's where I tested it on
a whole bunch of testing data.

1089
00:52:02,530 --> 00:52:07,170
And it looks like in this case,
I was also like 99.8% accurate.

1090
00:52:07,170 --> 00:52:09,970
So just using that, I was able
to generate a neural network that

1091
00:52:09,970 --> 00:52:12,490
can detect counterfeit
bills from authentic bills

1092
00:52:12,490 --> 00:52:16,030
based on this input data
99.8% of the time, at least

1093
00:52:16,030 --> 00:52:17,700
based on this particular testing data.

1094
00:52:17,700 --> 00:52:19,450
And I might want to
test it with more data

1095
00:52:19,450 --> 00:52:21,890
as well, just to be
confident about that.

1096
00:52:21,890 --> 00:52:24,743
But this is really the value of
using a machine learning library

1097
00:52:24,743 --> 00:52:27,160
like TensorFlow, and there are
others available for Python

1098
00:52:27,160 --> 00:52:30,040
and other languages as
well, but all I have to do

1099
00:52:30,040 --> 00:52:33,400
is define the structure of the
network and define the data

1100
00:52:33,400 --> 00:52:36,120
that I'm going to pass
into the network, and then

1101
00:52:36,120 --> 00:52:38,560
TensorFlow runs the
backpropagation algorithm

1102
00:52:38,560 --> 00:52:40,780
for learning what all of
those weights should be,

1103
00:52:40,780 --> 00:52:44,410
for figuring out how to train
this neural network to be able to,

1104
00:52:44,410 --> 00:52:48,070
as accurately as possible, figure
out what the output values should

1105
00:52:48,070 --> 00:52:50,610
be there as well.

1106
00:52:50,610 --> 00:52:55,130
And so this then was a look at what it
is that neural networks can do, just

1107
00:52:55,130 --> 00:52:58,380
using these sequences of
layer after layer after layer,

1108
00:52:58,380 --> 00:53:01,970
and you can begin to imagine applying
these to much more general problems.

1109
00:53:01,970 --> 00:53:05,690
And one big problem in computing, and
artificial intelligence more generally,

1110
00:53:05,690 --> 00:53:08,000
is the problem of computer vision.

1111
00:53:08,000 --> 00:53:10,580
Computer vision is all
about computational methods

1112
00:53:10,580 --> 00:53:14,313
for analyzing and understanding
images, that you might have pictures

1113
00:53:14,313 --> 00:53:16,730
that you want the computer to
figure out how to deal with,

1114
00:53:16,730 --> 00:53:19,910
how to process those images,
and figure out how to produce

1115
00:53:19,910 --> 00:53:21,710
some sort of useful result out of this.

1116
00:53:21,710 --> 00:53:24,140
You've seen this in the context
of social media websites

1117
00:53:24,140 --> 00:53:27,093
that are able to look at a photo
that contains a whole bunch of faces,

1118
00:53:27,093 --> 00:53:29,260
and it's able to figure out
what's a picture of whom

1119
00:53:29,260 --> 00:53:32,060
and label those and tag them
with appropriate people.

1120
00:53:32,060 --> 00:53:34,130
This is becoming
increasingly relevant as we

1121
00:53:34,130 --> 00:53:36,600
begin to discuss self-driving cars.

1122
00:53:36,600 --> 00:53:38,360
These cars now have
cameras, and we would

1123
00:53:38,360 --> 00:53:40,940
like for the computer to have
some sort of algorithm that

1124
00:53:40,940 --> 00:53:43,490
looks at the images
and figures out, what

1125
00:53:43,490 --> 00:53:47,940
color is the light, what cars are around
us and in what direction, for example.

1126
00:53:47,940 --> 00:53:50,810
And so computer vision is
all about taking an image

1127
00:53:50,810 --> 00:53:53,000
and figuring out what
sort of computation--

1128
00:53:53,000 --> 00:53:55,640
what sort of calculation--
we can do with that image.

1129
00:53:55,640 --> 00:53:59,480
It's also relevant in the context of
something like handwriting recognition.

1130
00:53:59,480 --> 00:54:02,540
This, what you're looking at, is
an example of the MNIST dataset--

1131
00:54:02,540 --> 00:54:04,700
it's a big dataset just
of handwritten digits--

1132
00:54:04,700 --> 00:54:08,840
that we could use to, ideally,
try and figure out how to predict,

1133
00:54:08,840 --> 00:54:12,380
given someone's handwriting, given a
photo of a digit that they have drawn,

1134
00:54:12,380 --> 00:54:17,180
can you predict whether it's a 0, 1,
2, 3, 4, 5, 6, 7, 8, or 9, for example.

1135
00:54:17,180 --> 00:54:19,850
So this sort of handwriting
recognition is yet another task

1136
00:54:19,850 --> 00:54:23,300
that we might want to use computer
vision tasks and tools to be

1137
00:54:23,300 --> 00:54:24,480
able to apply it towards.

1138
00:54:24,480 --> 00:54:27,470
This might be a task
that we might care about.

1139
00:54:27,470 --> 00:54:30,140
So how then can we use
neural networks to be

1140
00:54:30,140 --> 00:54:31,850
able to solve a problem like this?

1141
00:54:31,850 --> 00:54:34,340
Well, neural networks rely
upon some sort of input,

1142
00:54:34,340 --> 00:54:36,350
where that input is just numerical data.

1143
00:54:36,350 --> 00:54:38,630
We have a whole bunch of
units, where each one of them

1144
00:54:38,630 --> 00:54:40,820
just represents some sort of number.

1145
00:54:40,820 --> 00:54:43,670
And so in the context of something
like handwriting recognition,

1146
00:54:43,670 --> 00:54:45,920
or in the context of
just an image, you might

1147
00:54:45,920 --> 00:54:50,240
imagine that an image is really just
a grid of pixels, a grid of dots,

1148
00:54:50,240 --> 00:54:53,660
where each dot has some sort
of color, and in the context

1149
00:54:53,660 --> 00:54:55,520
of something like
handwriting recognition,

1150
00:54:55,520 --> 00:54:57,478
you might imagine that
if you just fill in each

1151
00:54:57,478 --> 00:55:00,740
of these dots in a particular
way, you can generate a 2 or an 8,

1152
00:55:00,740 --> 00:55:05,420
for example, based on which dots happen
to be shaded in and which dots are not.

1153
00:55:05,420 --> 00:55:09,140
And we can represent each of these
pixel values just using numbers.

1154
00:55:09,140 --> 00:55:14,220
So for a particular pixel, for example,
0 might represent entirely black.

1155
00:55:14,220 --> 00:55:16,060
Depending on how you're
representing color,

1156
00:55:16,060 --> 00:55:20,740
it's often common to represent
color values on a 0-to-255 range,

1157
00:55:20,740 --> 00:55:24,890
so that you can represent a color using
eight bits for a particular value,

1158
00:55:24,890 --> 00:55:27,240
like how much white is in the image?

1159
00:55:27,240 --> 00:55:32,180
So 0 might represent all black,
255 might represent entirely white

1160
00:55:32,180 --> 00:55:35,870
as a pixel, and somewhere in between
might represent some shade of gray,

1161
00:55:35,870 --> 00:55:36,890
for example.

1162
00:55:36,890 --> 00:55:40,250
But you might imagine not just having a
single slider that determines how much

1163
00:55:40,250 --> 00:55:42,920
white is in the image, but
if you had a color image,

1164
00:55:42,920 --> 00:55:45,870
you might imagine three different
numerical values-- a red, green,

1165
00:55:45,870 --> 00:55:46,820
and blue value--

1166
00:55:46,820 --> 00:55:49,490
where the red value controls
how much red is in the image,

1167
00:55:49,490 --> 00:55:52,520
we have one value for controlling
how much green is in the pixel,

1168
00:55:52,520 --> 00:55:55,290
and one value for how much
blue is in the pixel as well.

1169
00:55:55,290 --> 00:55:58,970
And depending on how it is that you set
these values of red, green, and blue,

1170
00:55:58,970 --> 00:56:00,840
you can get a different color.

1171
00:56:00,840 --> 00:56:04,460
And so any pixel can really
be represented in this case

1172
00:56:04,460 --> 00:56:06,050
by three numerical values--

1173
00:56:06,050 --> 00:56:09,510
a red value, a green
value, and a blue value.

1174
00:56:09,510 --> 00:56:11,450
And if you take a whole
bunch of these pixels,

1175
00:56:11,450 --> 00:56:15,230
assemble them together inside
of a grid of pixels, then

1176
00:56:15,230 --> 00:56:17,760
you really just have a whole
bunch of numerical values

1177
00:56:17,760 --> 00:56:21,863
that you can use in order to perform
some sort of prediction task.

1178
00:56:21,863 --> 00:56:24,530
And so what you might imagine
doing is using the same techniques

1179
00:56:24,530 --> 00:56:25,790
we talked about before.

1180
00:56:25,790 --> 00:56:30,890
Just design a neural network with a lot
of inputs, that for each of the pixels,

1181
00:56:30,890 --> 00:56:34,070
we might have one or three different
inputs in the case of a color image--

1182
00:56:34,070 --> 00:56:38,240
a different input-- that is just
connected to a deep neural network,

1183
00:56:38,240 --> 00:56:38,830
for example.

1184
00:56:38,830 --> 00:56:40,880
And this deep neural
network might take all

1185
00:56:40,880 --> 00:56:45,700
of the pixels inside of the image
of what digit a person drew,

1186
00:56:45,700 --> 00:56:49,910
and the output might be like 10
neurons that classify it as a 0 or a 1

1187
00:56:49,910 --> 00:56:55,620
or 2 or 3, or just tells us in some
way what that digit happens to be.

1188
00:56:55,620 --> 00:56:57,910
Now there are a couple of
drawbacks to this approach.

1189
00:56:57,910 --> 00:57:01,540
The first drawback to the approach
is just the size of this input array,

1190
00:57:01,540 --> 00:57:03,422
that we have a whole bunch of inputs.

1191
00:57:03,422 --> 00:57:05,880
If we have a big image, that
is a lot of different channels

1192
00:57:05,880 --> 00:57:08,790
we're looking at-- a lot of inputs,
and therefore, a lot of weights

1193
00:57:08,790 --> 00:57:10,690
that we have to calculate.

1194
00:57:10,690 --> 00:57:14,420
And a second problem is the fact
that by flattening everything

1195
00:57:14,420 --> 00:57:16,760
into just the structure
of all the pixels,

1196
00:57:16,760 --> 00:57:20,720
we've lost access to a lot of the
information about the structure

1197
00:57:20,720 --> 00:57:22,670
of the image that's
relevant, that really,

1198
00:57:22,670 --> 00:57:25,040
when a person looks at
an image, they're looking

1199
00:57:25,040 --> 00:57:26,667
at particular features of that image.

1200
00:57:26,667 --> 00:57:27,750
They're looking at curves.

1201
00:57:27,750 --> 00:57:28,610
They're looking at shapes.

1202
00:57:28,610 --> 00:57:30,470
They're looking at what
things can you identify

1203
00:57:30,470 --> 00:57:33,387
in different regions of the image,
and maybe put those things together

1204
00:57:33,387 --> 00:57:36,950
in order to get a better picture of
what the overall image was about.

1205
00:57:36,950 --> 00:57:40,940
And by just turning it into a pixel
values for each of the pixels,

1206
00:57:40,940 --> 00:57:43,230
sure, you might be able
to learn that structure,

1207
00:57:43,230 --> 00:57:45,360
but it might be challenging
in order to do so.

1208
00:57:45,360 --> 00:57:48,890
It might be helpful to take advantage
of the fact that you can use properties

1209
00:57:48,890 --> 00:57:52,190
of the image itself-- the fact that
it's structured in a particular way--

1210
00:57:52,190 --> 00:57:56,150
to be able to improve the way that
we learn based on that image too.

1211
00:57:56,150 --> 00:57:59,210
So in order to figure out how we can
train our neural networks to better

1212
00:57:59,210 --> 00:58:02,510
be able to deal with images, we'll
introduce a couple of ideas--

1213
00:58:02,510 --> 00:58:06,350
a couple of algorithms-- that we can
apply that allow us to take the images

1214
00:58:06,350 --> 00:58:09,630
and extract some useful
information out of that image.

1215
00:58:09,630 --> 00:58:13,430
And the first idea we'll introduce
is the notion of image convolution.

1216
00:58:13,430 --> 00:58:16,940
And what an image convolution is all
about is it's about filtering an image,

1217
00:58:16,940 --> 00:58:20,330
sort of extracting useful or
relevant features out of the image.

1218
00:58:20,330 --> 00:58:25,220
And the way we do that is by applying
a particular filter that basically adds

1219
00:58:25,220 --> 00:58:28,700
the value for every pixel with the
values for all of the neighboring

1220
00:58:28,700 --> 00:58:29,780
pixels to it.

1221
00:58:29,780 --> 00:58:32,750
According to some sort of kernel
matrix, which we'll see in a moment,

1222
00:58:32,750 --> 00:58:36,390
it's going to allow us to weight these
pixels in various different ways.

1223
00:58:36,390 --> 00:58:38,300
And the goal of image
convolution then is

1224
00:58:38,300 --> 00:58:41,720
to extract some sort of interesting
or useful features out of an image,

1225
00:58:41,720 --> 00:58:45,080
to be able to take a pixel, and
based on its neighboring pixels,

1226
00:58:45,080 --> 00:58:48,260
maybe predict some sort of
valuable information, something

1227
00:58:48,260 --> 00:58:50,870
like taking a pixel and looking
at its neighboring pixels,

1228
00:58:50,870 --> 00:58:52,310
you might be able to
predict whether or not

1229
00:58:52,310 --> 00:58:54,143
there's some sort of
curve inside the image,

1230
00:58:54,143 --> 00:58:57,200
or whether it's forming the outline
of a particular line or a shape,

1231
00:58:57,200 --> 00:59:00,050
for example, and that
might be useful if you're

1232
00:59:00,050 --> 00:59:02,600
trying to use all of these
various different features

1233
00:59:02,600 --> 00:59:06,840
to combine them to say something
meaningful about an image as a whole.

1234
00:59:06,840 --> 00:59:08,840
So how then does image convolution work?

1235
00:59:08,840 --> 00:59:11,870
Well, we start with a kernel
matrix, and the kernel matrix

1236
00:59:11,870 --> 00:59:13,160
looks something like this.

1237
00:59:13,160 --> 00:59:15,260
And the idea of this
is that given a pixel--

1238
00:59:15,260 --> 00:59:16,820
that would be the middle pixel--

1239
00:59:16,820 --> 00:59:21,200
we're going to multiply each of the
neighboring pixels by these values

1240
00:59:21,200 --> 00:59:25,362
in order to get some sort of result by
summing up all of the numbers together.

1241
00:59:25,362 --> 00:59:28,070
So if I take this kernel, which
you can think of is like a filter

1242
00:59:28,070 --> 00:59:30,020
that I'm going to apply to the image.

1243
00:59:30,020 --> 00:59:32,090
And let's say that I take this image.

1244
00:59:32,090 --> 00:59:33,800
This is a four-by-four image.

1245
00:59:33,800 --> 00:59:37,250
We'll think of it as just a black and
white image, where each one is just

1246
00:59:37,250 --> 00:59:41,550
a single pixel value, so somewhere
between 0 and 255, for example.

1247
00:59:41,550 --> 00:59:44,450
So we have a whole bunch of
individual pixel values like this,

1248
00:59:44,450 --> 00:59:47,450
and what I'd like to do
is apply this kernel--

1249
00:59:47,450 --> 00:59:49,280
this filter, so to speak--

1250
00:59:49,280 --> 00:59:50,485
to this image.

1251
00:59:50,485 --> 00:59:53,360
And the way I'll do that is, all
right, the kernel is three-by-three.

1252
00:59:53,360 --> 00:59:56,940
So you can imagine a five-by-five
kernel or a larger kernel too.

1253
00:59:56,940 --> 01:00:01,460
And I'll take it and just first apply
it to the first three-by-three section

1254
01:00:01,460 --> 01:00:02,480
of the image.

1255
01:00:02,480 --> 01:00:05,270
And what I'll do is I'll take
each of these pixel values

1256
01:00:05,270 --> 01:00:08,930
and multiply it by its corresponding
value in the filter matrix

1257
01:00:08,930 --> 01:00:11,970
and add all of the results together.

1258
01:00:11,970 --> 01:00:19,040
So here, for example, I'll say 10 times
0, plus 20, times negative 1, plus 30,

1259
01:00:19,040 --> 01:00:22,408
times 0, so on and so forth,
doing all of this calculation.

1260
01:00:22,408 --> 01:00:24,200
And at the end, if I
take all these values,

1261
01:00:24,200 --> 01:00:26,990
multiply them by their
corresponding value in the kernel,

1262
01:00:26,990 --> 01:00:30,410
add the results together, for this
particular set of nine pixels,

1263
01:00:30,410 --> 01:00:33,540
I get the value of 10 for example.

1264
01:00:33,540 --> 01:00:38,600
And then what I'll do is I'll slide this
three-by-three grid effectively over.

1265
01:00:38,600 --> 01:00:43,220
Slide the kernel by one to look at
the next three-by-three section.

1266
01:00:43,220 --> 01:00:45,330
And here I'm just sliding
it over by one pixel,

1267
01:00:45,330 --> 01:00:46,970
but you might imagine a
different slide length,

1268
01:00:46,970 --> 01:00:49,760
or maybe I jump by multiple pixels
at a time if you really wanted to.

1269
01:00:49,760 --> 01:00:51,110
You have different options here.

1270
01:00:51,110 --> 01:00:54,650
But here I'm just sliding over, looking
at the next three-by-three section.

1271
01:00:54,650 --> 01:00:59,450
And I'll do the same math 20 times 0,
plus 30, times a negative 1, plus 40,

1272
01:00:59,450 --> 01:01:03,950
times 0, plus 20 times negative 1,
so on and so forth, plus 30 times 5.

1273
01:01:03,950 --> 01:01:05,990
And what I end up
getting is the number 20.

1274
01:01:05,990 --> 01:01:09,260
Then you can imagine shifting over
to this one, doing the same thing,

1275
01:01:09,260 --> 01:01:11,510
calculating like the
number 40, for example,

1276
01:01:11,510 --> 01:01:15,670
and then doing the same thing here
and calculating a value there as well.

1277
01:01:15,670 --> 01:01:19,350
And so what we have now is
what we'll call a feature map.

1278
01:01:19,350 --> 01:01:22,340
We have taken this
kernel, applied it to each

1279
01:01:22,340 --> 01:01:25,040
of these various different
regions, and what we get

1280
01:01:25,040 --> 01:01:29,505
is some representation of a
filtered version of that image.

1281
01:01:29,505 --> 01:01:32,630
And so to give a more concrete example
of why it is that this kind of thing

1282
01:01:32,630 --> 01:01:35,360
could be useful, let's
take this kernel matrix,

1283
01:01:35,360 --> 01:01:39,080
for example, which is quite a famous
one, that has an 8 in the middle

1284
01:01:39,080 --> 01:01:42,380
and then all of the neighboring
pixels that get a negative 1.

1285
01:01:42,380 --> 01:01:44,420
And let's imagine we
wanted to apply that

1286
01:01:44,420 --> 01:01:48,020
to a three-by-three part of
an image that looks like this,

1287
01:01:48,020 --> 01:01:50,160
where all the values are the same.

1288
01:01:50,160 --> 01:01:52,310
They're all 20, for instance.

1289
01:01:52,310 --> 01:01:56,240
Well, in this case, if you do 20
times 8, and then subtract 20,

1290
01:01:56,240 --> 01:01:58,910
subtract 20, subtract 20, for
each of the eight neighbors,

1291
01:01:58,910 --> 01:02:02,130
well, the result of that is
you just get that expression,

1292
01:02:02,130 --> 01:02:03,440
which comes out to be 0.

1293
01:02:03,440 --> 01:02:07,250
You multiply 20 by 8, but
then you subtracted 28 times

1294
01:02:07,250 --> 01:02:08,960
according to that particular kernel.

1295
01:02:08,960 --> 01:02:11,150
The result of all of that is just 0.

1296
01:02:11,150 --> 01:02:15,170
So the takeaway here is that when a
lot of the pixels are the same value,

1297
01:02:15,170 --> 01:02:18,050
we end up getting a value close to 0.

1298
01:02:18,050 --> 01:02:21,440
If, though, we had something like
this, 20s along this first row,

1299
01:02:21,440 --> 01:02:24,470
then 50s in the second row,
and 50s in the third row, well,

1300
01:02:24,470 --> 01:02:26,530
then when you do this
same kind of math--

1301
01:02:26,530 --> 01:02:29,930
20 times negative 1, 20 times
negative 1, so on and so forth--

1302
01:02:29,930 --> 01:02:34,530
then I get a higher value-- a value
like 90, in this particular case.

1303
01:02:34,530 --> 01:02:37,520
And so the more general
idea here is that

1304
01:02:37,520 --> 01:02:40,520
by applying this kernel,
negative 1s, 8 in the middle,

1305
01:02:40,520 --> 01:02:45,800
and then negative 1s, what I get
is when this middle value is very

1306
01:02:45,800 --> 01:02:47,960
different from the neighboring values--

1307
01:02:47,960 --> 01:02:50,240
like 50 is greater than these 20s--

1308
01:02:50,240 --> 01:02:53,150
then you'll end up with
a value higher than 0.

1309
01:02:53,150 --> 01:02:55,490
Like if this number is
higher than its neighbors,

1310
01:02:55,490 --> 01:02:59,240
you end up getting a bigger output,
but if this value is the same as all

1311
01:02:59,240 --> 01:03:02,660
of its neighbors, then you get a
lower output, something like 0.

1312
01:03:02,660 --> 01:03:04,580
And it turns out that
this sort of filter

1313
01:03:04,580 --> 01:03:08,440
can therefore be used in something
like detecting edges in an image,

1314
01:03:08,440 --> 01:03:11,870
or want to detect like the boundaries
between various different objects

1315
01:03:11,870 --> 01:03:12,890
inside of an image.

1316
01:03:12,890 --> 01:03:15,950
I might use a filter like
this, which is able to tell

1317
01:03:15,950 --> 01:03:19,970
whether the value of this pixel
is different from the values

1318
01:03:19,970 --> 01:03:23,630
of the neighboring pixel-- if it's like
greater than the values of the pixels

1319
01:03:23,630 --> 01:03:25,390
that happened to surround it.

1320
01:03:25,390 --> 01:03:28,250
And so we can use this in
terms of image filtering.

1321
01:03:28,250 --> 01:03:30,290
And so I'll show you an example of that.

1322
01:03:30,290 --> 01:03:38,150
I have here, in filter.py, a file that
uses Python's image library, or PIL,

1323
01:03:38,150 --> 01:03:40,160
to do some image filtering.

1324
01:03:40,160 --> 01:03:41,840
I go ahead and open an image.

1325
01:03:41,840 --> 01:03:45,102
And then all I'm going to do is
apply a kernel to that image.

1326
01:03:45,102 --> 01:03:47,810
It's going to be a three-by-three
kernel, the same kind of kernel

1327
01:03:47,810 --> 01:03:49,390
we saw before.

1328
01:03:49,390 --> 01:03:50,790
And here is the kernel.

1329
01:03:50,790 --> 01:03:53,312
This is just a list
representation of the same matrix

1330
01:03:53,312 --> 01:03:55,020
that I showed you a
moment ago, with it's

1331
01:03:55,020 --> 01:03:56,900
negative 1, negative 1, negative 1.

1332
01:03:56,900 --> 01:03:59,750
The second row is
negative 1, 8, negative 1.

1333
01:03:59,750 --> 01:04:01,880
The third row is all negative 1s.

1334
01:04:01,880 --> 01:04:06,670
And then at the end, I'm going to go
ahead and show the filtered image.

1335
01:04:06,670 --> 01:04:12,340
So if, for example, I go
into convolution directory

1336
01:04:12,340 --> 01:04:15,300
and I open up an image
like bridge.png, this

1337
01:04:15,300 --> 01:04:21,270
is what an input image might look like,
just an image of a bridge over a river.

1338
01:04:21,270 --> 01:04:26,360
Now I'm going to go ahead and run
this filter program on the bridge.

1339
01:04:26,360 --> 01:04:28,820
And what I get is this image here.

1340
01:04:28,820 --> 01:04:32,000
Just by taking the original
image and applying that filter

1341
01:04:32,000 --> 01:04:35,000
to each three-by-three
grid, I've extracted

1342
01:04:35,000 --> 01:04:38,390
all of the boundaries, all of the
edges inside the image that separate

1343
01:04:38,390 --> 01:04:40,110
one part of the image from another.

1344
01:04:40,110 --> 01:04:42,740
So here I've got a
representation of boundaries

1345
01:04:42,740 --> 01:04:45,040
between particular parts of the image.

1346
01:04:45,040 --> 01:04:47,600
And you might imagine that if
a machine learning algorithm is

1347
01:04:47,600 --> 01:04:50,780
trying to learn like what an
image is of, a filter like this

1348
01:04:50,780 --> 01:04:51,860
could be pretty useful.

1349
01:04:51,860 --> 01:04:55,400
Maybe the machine learning
algorithm doesn't care about all

1350
01:04:55,400 --> 01:04:57,200
of the details of the image.

1351
01:04:57,200 --> 01:04:59,210
It just cares about
certain useful features.

1352
01:04:59,210 --> 01:05:01,370
It cares about particular
shapes that are

1353
01:05:01,370 --> 01:05:04,020
able to help it determine
that based on the image,

1354
01:05:04,020 --> 01:05:06,540
this is going to be a
bridge, for example.

1355
01:05:06,540 --> 01:05:08,840
And so this type of idea
of image convolution

1356
01:05:08,840 --> 01:05:11,570
can allow us to apply
filters to images that

1357
01:05:11,570 --> 01:05:15,970
allow us to extract useful results
out of those images-- taking an image

1358
01:05:15,970 --> 01:05:18,640
and extracting its edges, for example.

1359
01:05:18,640 --> 01:05:20,480
You might imagine many
other filters that

1360
01:05:20,480 --> 01:05:23,820
could be applied to an image that are
able to extract particular values as

1361
01:05:23,820 --> 01:05:24,320
well.

1362
01:05:24,320 --> 01:05:27,620
And a filter might have separate kernels
for the red values, the green values,

1363
01:05:27,620 --> 01:05:30,140
and the blue values that are
all summed together at the end,

1364
01:05:30,140 --> 01:05:32,750
such that you could have
particular filters looking for,

1365
01:05:32,750 --> 01:05:34,457
is there red in this part of the image?

1366
01:05:34,457 --> 01:05:36,290
Are there green in other
parts of the image?

1367
01:05:36,290 --> 01:05:39,800
You can begin to assemble these
relevant and useful filters that are

1368
01:05:39,800 --> 01:05:43,050
able to do these calculations as well.

1369
01:05:43,050 --> 01:05:45,990
So that then was the idea of
image convolution-- applying

1370
01:05:45,990 --> 01:05:48,990
some sort of filter to an
image to be able to extract

1371
01:05:48,990 --> 01:05:51,480
some useful features out of that image.

1372
01:05:51,480 --> 01:05:54,600
But all the while, these
images are still pretty big.

1373
01:05:54,600 --> 01:05:56,730
There's a lot of pixels
involved in the image.

1374
01:05:56,730 --> 01:05:59,310
And realistically speaking, if
you've got a really big image,

1375
01:05:59,310 --> 01:06:01,030
that poses a couple of problems.

1376
01:06:01,030 --> 01:06:03,810
One, it means a lot of input
going into the neural network,

1377
01:06:03,810 --> 01:06:07,050
but two, it also means
that we really have

1378
01:06:07,050 --> 01:06:11,715
to care about what's in each particular
pixel, whereas realistically we often,

1379
01:06:11,715 --> 01:06:13,590
if you're looking at an
image, you don't care

1380
01:06:13,590 --> 01:06:16,030
whether it's something is
in one particular pixel

1381
01:06:16,030 --> 01:06:18,030
versus the pixel immediately
to the right of it.

1382
01:06:18,030 --> 01:06:19,598
They're pretty close together.

1383
01:06:19,598 --> 01:06:21,390
You really just care
about whether there is

1384
01:06:21,390 --> 01:06:24,450
a particular feature in
some region of the image,

1385
01:06:24,450 --> 01:06:28,300
and maybe you don't care about
exactly which pixel it happens to be.

1386
01:06:28,300 --> 01:06:30,660
And so there's a technique
we can use known as pooling.

1387
01:06:30,660 --> 01:06:34,650
And what pooling is, is it means
reducing the size of an input

1388
01:06:34,650 --> 01:06:37,340
by sampling from regions
inside of the input.

1389
01:06:37,340 --> 01:06:40,890
So we're going to take a big image
and turn it into a smaller image

1390
01:06:40,890 --> 01:06:41,880
by using pooling.

1391
01:06:41,880 --> 01:06:44,550
And in particular, one of the
most popular types of pooling

1392
01:06:44,550 --> 01:06:45,870
is called max-pooling.

1393
01:06:45,870 --> 01:06:50,550
And what max-pooling does is it pools
just by choosing the maximum value

1394
01:06:50,550 --> 01:06:52,390
in a particular region.

1395
01:06:52,390 --> 01:06:55,470
So, for example, let's imagine
I had this four-by-four image,

1396
01:06:55,470 --> 01:06:57,360
but I wanted to reduce its dimensions.

1397
01:06:57,360 --> 01:07:01,310
I wanted to make an a smaller image, so
that I have fewer inputs to work with.

1398
01:07:01,310 --> 01:07:05,070
Well, what I could do is I
could apply a two-by-two max

1399
01:07:05,070 --> 01:07:07,410
pool, where the idea
would be that I'm going

1400
01:07:07,410 --> 01:07:09,990
to first look at this
two-by-two region and say, what

1401
01:07:09,990 --> 01:07:11,940
is the maximum value in that region?

1402
01:07:11,940 --> 01:07:13,290
Well, it's the number 50.

1403
01:07:13,290 --> 01:07:15,353
So we'll go ahead and
just use the number 50.

1404
01:07:15,353 --> 01:07:17,270
And then we'll look at
this two-by-two region.

1405
01:07:17,270 --> 01:07:18,940
What is the maximum value here?

1406
01:07:18,940 --> 01:07:19,740
110.

1407
01:07:19,740 --> 01:07:21,210
So that's going to be my value.

1408
01:07:21,210 --> 01:07:23,420
Likewise here, the maximum
value looks like 20.

1409
01:07:23,420 --> 01:07:24,710
Go ahead and put that there.

1410
01:07:24,710 --> 01:07:27,030
Then for this last
region, the maximum value

1411
01:07:27,030 --> 01:07:29,510
was 40, so we'll go ahead and use that.

1412
01:07:29,510 --> 01:07:33,290
And what I have now is
a smaller representation

1413
01:07:33,290 --> 01:07:36,260
of this same original
image that I obtained just

1414
01:07:36,260 --> 01:07:40,680
by picking the maximum value
from each of these regions.

1415
01:07:40,680 --> 01:07:43,880
So again, the advantages
here are now I only

1416
01:07:43,880 --> 01:07:46,730
have to deal with a two-by-two
input instead of a four-by-four,

1417
01:07:46,730 --> 01:07:49,910
and you can imagine shrinking
the size of an image even more.

1418
01:07:49,910 --> 01:07:52,880
But in addition to that,
I'm now able to make

1419
01:07:52,880 --> 01:07:57,500
my analysis independent of
whether a particular value was

1420
01:07:57,500 --> 01:07:59,030
in this pixel or this pixel.

1421
01:07:59,030 --> 01:08:01,490
I don't care if the 50 was here or here.

1422
01:08:01,490 --> 01:08:03,980
As long as it was
generally in this region,

1423
01:08:03,980 --> 01:08:06,000
I'll still get access to that value.

1424
01:08:06,000 --> 01:08:10,190
So it makes our algorithms a
little bit more robust as well.

1425
01:08:10,190 --> 01:08:11,750
So that then is pooling--

1426
01:08:11,750 --> 01:08:13,940
taking the size of the
image and reducing it

1427
01:08:13,940 --> 01:08:18,390
a little bit by just sampling from
particular regions inside of the image.

1428
01:08:18,390 --> 01:08:22,310
And now we can put all of these ideas
together-- pooling, image convolution,

1429
01:08:22,310 --> 01:08:26,060
neural networks-- all together into
another type of neural network called

1430
01:08:26,060 --> 01:08:30,500
a convolutional neural network, or a
CNN, which is a neural network that

1431
01:08:30,500 --> 01:08:35,479
uses this convolution step, usually
in the context of analyzing an image,

1432
01:08:35,479 --> 01:08:36,752
for example.

1433
01:08:36,752 --> 01:08:39,710
And so the way that a convolutional
neural own network works is that we

1434
01:08:39,710 --> 01:08:43,189
start with some sort of input
image-- some grid of pixels--

1435
01:08:43,189 --> 01:08:46,580
but rather than immediately put
that into the neural network layers

1436
01:08:46,580 --> 01:08:50,120
that we've seen before, we'll start
by applying a convolution step, where

1437
01:08:50,120 --> 01:08:54,170
the convolution step involves applying
a number of different image filters

1438
01:08:54,170 --> 01:08:56,689
to our original image
in order to get what

1439
01:08:56,689 --> 01:09:00,750
we call a feature map, the result
of applying some filter to an image.

1440
01:09:00,750 --> 01:09:02,750
And we could do this once,
but in general, we'll

1441
01:09:02,750 --> 01:09:06,020
do this multiple times getting a
whole bunch of different feature

1442
01:09:06,020 --> 01:09:09,859
maps, each of which might extract
some different relevant feature out

1443
01:09:09,859 --> 01:09:12,710
of the image, some different
important characteristic of the image

1444
01:09:12,710 --> 01:09:16,760
that we might care about using in order
to calculate what the result should be.

1445
01:09:16,760 --> 01:09:19,790
And in the same way to when
we train neural networks,

1446
01:09:19,790 --> 01:09:23,270
we can train neural networks to learn
the weights between particular units

1447
01:09:23,270 --> 01:09:24,770
inside of the neural networks.

1448
01:09:24,770 --> 01:09:28,160
We can also train neural networks to
learn what those filters should be--

1449
01:09:28,160 --> 01:09:30,170
what the values of the
filters should be--

1450
01:09:30,170 --> 01:09:33,620
in order to get the most useful,
most relevant information out

1451
01:09:33,620 --> 01:09:37,069
of the original image just by figuring
out what setting of those filter

1452
01:09:37,069 --> 01:09:39,380
values-- the values
inside of that kernel--

1453
01:09:39,380 --> 01:09:44,060
results in minimizing the loss
function and minimizing how poorly

1454
01:09:44,060 --> 01:09:48,200
our hypothesis actually performs
in figuring out the classification

1455
01:09:48,200 --> 01:09:50,720
of a particular image, for example.

1456
01:09:50,720 --> 01:09:52,880
So we first apply this convolution step.

1457
01:09:52,880 --> 01:09:55,520
Get a whole bunch of these
various different feature maps.

1458
01:09:55,520 --> 01:09:57,450
But these feature maps are quite large.

1459
01:09:57,450 --> 01:10:00,200
There is a lot of pixel
values that happen to be here.

1460
01:10:00,200 --> 01:10:03,440
And so a logical next step
to take is a pooling step,

1461
01:10:03,440 --> 01:10:06,800
where we reduce the size of these
images by using max-pooling,

1462
01:10:06,800 --> 01:10:10,360
for example, extracting the maximum
value from any particular region.

1463
01:10:10,360 --> 01:10:12,110
There are other pooling
methods that exist

1464
01:10:12,110 --> 01:10:13,610
as well, depending on the situation.

1465
01:10:13,610 --> 01:10:15,800
You could use something
like average-pooling,

1466
01:10:15,800 --> 01:10:18,230
where instead of taking the
maximum value from a region,

1467
01:10:18,230 --> 01:10:22,010
you take the average value from a
region, which has it uses as well.

1468
01:10:22,010 --> 01:10:26,030
But in effect, what pooling will do
is it will take these feature maps

1469
01:10:26,030 --> 01:10:28,190
and reduce their dimensions,
so that we end up

1470
01:10:28,190 --> 01:10:30,677
with smaller grids with fewer pixels.

1471
01:10:30,677 --> 01:10:33,010
And this then is going to be
easier for us to deal with.

1472
01:10:33,010 --> 01:10:35,600
It's going to mean fewer inputs
that we have to worry about,

1473
01:10:35,600 --> 01:10:38,900
and it's also going to mean we're
more resilient, more robust,

1474
01:10:38,900 --> 01:10:42,510
against potential movements of
particular values just by one pixel,

1475
01:10:42,510 --> 01:10:46,280
when ultimately, we really don't care
about those one pixel differences that

1476
01:10:46,280 --> 01:10:49,020
might arise in the original image.

1477
01:10:49,020 --> 01:10:52,700
Now after we've done this pooling step,
now we have a whole bunch of values

1478
01:10:52,700 --> 01:10:55,260
that we can then
flatten out and just put

1479
01:10:55,260 --> 01:10:57,310
into a more traditional neural network.

1480
01:10:57,310 --> 01:10:59,060
So we go ahead and
flatten it, and then we

1481
01:10:59,060 --> 01:11:01,010
end up with a traditional
neural network that

1482
01:11:01,010 --> 01:11:05,210
has one input for each of these values
in each of these resulting feature

1483
01:11:05,210 --> 01:11:10,130
maps after we do the convolution
and after we do the pooling step.

1484
01:11:10,130 --> 01:11:13,460
And so this then is the general
structure of a convolutional network.

1485
01:11:13,460 --> 01:11:15,980
We begin with the image,
apply convolution,

1486
01:11:15,980 --> 01:11:18,800
apply pooling, flatten the
results, and then put that

1487
01:11:18,800 --> 01:11:22,190
into a more traditional neural network
that might itself have hidden layers.

1488
01:11:22,190 --> 01:11:24,290
You can have deep
convolutional networks that

1489
01:11:24,290 --> 01:11:28,490
have hidden layers in between this
flattened layer and the eventual output

1490
01:11:28,490 --> 01:11:32,220
to be able to calculate various
different features of those values.

1491
01:11:32,220 --> 01:11:36,030
But this then can help us to be
able to use convolution and pooling,

1492
01:11:36,030 --> 01:11:38,480
to use our knowledge about
the structure of an image,

1493
01:11:38,480 --> 01:11:42,020
to be able to get better results, to
be able to train our networks faster

1494
01:11:42,020 --> 01:11:46,080
in order to better capture
particular parts of the image.

1495
01:11:46,080 --> 01:11:49,370
And there's no reason necessarily why
you can only use these steps once.

1496
01:11:49,370 --> 01:11:53,570
In fact, in practice, you'll often use
convolution and pooling multiple times

1497
01:11:53,570 --> 01:11:55,170
in multiple different steps.

1498
01:11:55,170 --> 01:11:58,310
So what you might imagine doing
is starting with an image,

1499
01:11:58,310 --> 01:12:00,980
first applying convolution
to get a whole bunch of maps,

1500
01:12:00,980 --> 01:12:04,070
then applying pooling, then
applying convolution again,

1501
01:12:04,070 --> 01:12:06,760
because these maps are still pretty big.

1502
01:12:06,760 --> 01:12:10,330
You can apply convolution to try
and extract relevant features

1503
01:12:10,330 --> 01:12:13,120
out of this result.
Then take those results,

1504
01:12:13,120 --> 01:12:16,570
apply pooling in order to reduce
their dimensions, and then take that

1505
01:12:16,570 --> 01:12:19,900
and feed it into a neural network
that maybe has fewer inputs.

1506
01:12:19,900 --> 01:12:22,810
So here, I have two different
convolution and pooling steps.

1507
01:12:22,810 --> 01:12:25,540
I do convolution and
pooling once, and then I

1508
01:12:25,540 --> 01:12:29,380
do convolution and pooling a
second time, each time extracting

1509
01:12:29,380 --> 01:12:32,200
useful features from the layer
before it, each time using

1510
01:12:32,200 --> 01:12:36,010
pooling to reduce the dimensions of
what you're ultimately looking at.

1511
01:12:36,010 --> 01:12:39,880
And the goal now of this sort of
model is that in each of these steps,

1512
01:12:39,880 --> 01:12:43,090
you can begin to learn
different types of features

1513
01:12:43,090 --> 01:12:45,430
of the original image, that
maybe in the first step

1514
01:12:45,430 --> 01:12:49,180
you learn very low-level features, just
learn and look for features like edges

1515
01:12:49,180 --> 01:12:53,770
and curves and shapes, because based
on pixels in their neighboring values,

1516
01:12:53,770 --> 01:12:55,937
you can figure out, all
right, what are the edges?

1517
01:12:55,937 --> 01:12:56,770
What are the curves?

1518
01:12:56,770 --> 01:12:59,810
What are the various different
shapes that might be present there?

1519
01:12:59,810 --> 01:13:02,470
But then once you have a
mapping that just represents

1520
01:13:02,470 --> 01:13:04,930
where the edges and curves
and shapes happen to be,

1521
01:13:04,930 --> 01:13:07,120
you can imagine applying
the same sort of process

1522
01:13:07,120 --> 01:13:10,480
again to begin to look for higher-level
features-- look for objects,

1523
01:13:10,480 --> 01:13:13,450
maybe look for people's
eyes in facial recognition,

1524
01:13:13,450 --> 01:13:17,020
for example, maybe look at more
complex shapes like the curves

1525
01:13:17,020 --> 01:13:20,470
on a particular number if you're trying
to recognize a digit in a handwriting

1526
01:13:20,470 --> 01:13:22,375
recognition sort of scenario.

1527
01:13:22,375 --> 01:13:24,250
And then after all of
that, now that you have

1528
01:13:24,250 --> 01:13:27,227
these results that represent
these higher-level features,

1529
01:13:27,227 --> 01:13:29,560
you can pass them into a
neural network, which is really

1530
01:13:29,560 --> 01:13:33,430
just a deep neural network that looks
like this, where you might imagine

1531
01:13:33,430 --> 01:13:37,120
making a binary classification, or
classifying into multiple categories,

1532
01:13:37,120 --> 01:13:42,130
or performing various different
tasks on this sort of model.

1533
01:13:42,130 --> 01:13:45,340
So convolutional neural networks can
be quite powerful and quite popular

1534
01:13:45,340 --> 01:13:47,383
when it comes to trying
to analyze images.

1535
01:13:47,383 --> 01:13:48,550
We don't strictly need them.

1536
01:13:48,550 --> 01:13:52,780
We could have just used a vanilla neural
network that just operates with layer

1537
01:13:52,780 --> 01:13:54,318
after layer as we've seen before.

1538
01:13:54,318 --> 01:13:56,110
But these convolutional
neural networks can

1539
01:13:56,110 --> 01:13:58,675
be quite helpful, in particular,
because of the way they

1540
01:13:58,675 --> 01:14:00,550
model the way a human
might look at an image,

1541
01:14:00,550 --> 01:14:03,040
that instead of a human
looking at every single pixel

1542
01:14:03,040 --> 01:14:06,428
simultaneously and trying to involve all
of them by multiplying them together,

1543
01:14:06,428 --> 01:14:08,470
you might imagine that
what convolution is really

1544
01:14:08,470 --> 01:14:11,860
doing is looking at various
different regions of the image

1545
01:14:11,860 --> 01:14:14,770
and extracting relevant
information and features out

1546
01:14:14,770 --> 01:14:17,410
of those parts of the image
the same way that a human might

1547
01:14:17,410 --> 01:14:20,950
have visual receptors that are looking
at particular parts of what they see,

1548
01:14:20,950 --> 01:14:23,440
and using those, combining
them, to figure out

1549
01:14:23,440 --> 01:14:28,140
what meaning they can draw from all
of those various different inputs.

1550
01:14:28,140 --> 01:14:31,480
And so you might imagine applying
this to a situation like handwriting

1551
01:14:31,480 --> 01:14:32,500
recognition.

1552
01:14:32,500 --> 01:14:35,050
So we'll go ahead and see
an example of that now.

1553
01:14:35,050 --> 01:14:37,705
I'll go ahead and open
up handwriting.py.

1554
01:14:37,705 --> 01:14:41,800
Again, what we do here is
we first import TensorFlow.

1555
01:14:41,800 --> 01:14:45,430
And then, TensorFlow, it
turns out, has a few datasets

1556
01:14:45,430 --> 01:14:47,440
that are built in--
built into the library

1557
01:14:47,440 --> 01:14:49,120
that you can just immediately access.

1558
01:14:49,120 --> 01:14:51,910
And one of the most famous
datasets in machine learning

1559
01:14:51,910 --> 01:14:55,720
is the MNIST dataset, which is just
a dataset of a whole bunch of samples

1560
01:14:55,720 --> 01:14:57,310
of people's handwritten digits.

1561
01:14:57,310 --> 01:14:59,980
I showed you a slide of
that a little while ago.

1562
01:14:59,980 --> 01:15:03,010
And what we can do is just
immediately access that dataset,

1563
01:15:03,010 --> 01:15:06,520
which is built into the library, so that
if I want to do something like train

1564
01:15:06,520 --> 01:15:10,810
on a whole bunch of digits, I can just
use the dataset that is provided to me.

1565
01:15:10,810 --> 01:15:14,170
Of course, if I had my own
dataset of handwritten images,

1566
01:15:14,170 --> 01:15:15,640
I can apply the same idea.

1567
01:15:15,640 --> 01:15:19,620
I'd first just need to take those images
and turn them into an array of pixels,

1568
01:15:19,620 --> 01:15:22,120
because that's the way that
these are going to be formatted.

1569
01:15:22,120 --> 01:15:24,037
They're going to be
formatted as, effectively,

1570
01:15:24,037 --> 01:15:26,770
an array of individual pixels.

1571
01:15:26,770 --> 01:15:29,330
And now there's a bit of
reshaping I need to do,

1572
01:15:29,330 --> 01:15:31,640
just turning the data into
a format that I can put

1573
01:15:31,640 --> 01:15:33,360
into my convolutional neural network.

1574
01:15:33,360 --> 01:15:37,970
So this is doing things like taking all
the values and dividing them by 255.

1575
01:15:37,970 --> 01:15:41,700
If you remember, these color
values tend to range from 0 to 255.

1576
01:15:41,700 --> 01:15:45,110
So I can divide them by 255, just
to put them into a 0-to-1 range,

1577
01:15:45,110 --> 01:15:48,320
which might be a little
bit easier to train on .

1578
01:15:48,320 --> 01:15:51,140
And then doing various other
modifications to the data, just

1579
01:15:51,140 --> 01:15:53,270
to get it into a nice usable format.

1580
01:15:53,270 --> 01:15:55,670
But here's the interesting
and important part.

1581
01:15:55,670 --> 01:15:59,920
Here is where I create the
convolutional neural network-- the CNN--

1582
01:15:59,920 --> 01:16:02,970
where here I'm saying, go ahead
and use a sequential model.

1583
01:16:02,970 --> 01:16:06,570
And before I could use model.add to say
add a layer, add a layer, add a layer,

1584
01:16:06,570 --> 01:16:08,570
another way I could define
it is just by passing

1585
01:16:08,570 --> 01:16:12,860
as input to the sequential neural
network a list of all of the layers

1586
01:16:12,860 --> 01:16:14,750
that I want.

1587
01:16:14,750 --> 01:16:17,642
And so here, the very
first layer in my model

1588
01:16:17,642 --> 01:16:19,350
is a convolutional
layer, where I'm first

1589
01:16:19,350 --> 01:16:22,050
going to apply convolution to my image.

1590
01:16:22,050 --> 01:16:26,520
I'm going to use 13 different filters,
so my model is going to learn--

1591
01:16:26,520 --> 01:16:28,680
32, rather-- 32 different
filters that I would

1592
01:16:28,680 --> 01:16:31,920
like to learn on the input
image, where each filter is

1593
01:16:31,920 --> 01:16:33,950
going to be a three-by-three kernel.

1594
01:16:33,950 --> 01:16:36,010
So we saw those
three-by-three kernels before,

1595
01:16:36,010 --> 01:16:39,270
where we could multiply each value
in a three-by-three grid by value,

1596
01:16:39,270 --> 01:16:41,620
multiply it and add all
the results together.

1597
01:16:41,620 --> 01:16:46,300
So here I'm going to learn 32 different
of these three-by-three filters.

1598
01:16:46,300 --> 01:16:48,740
I can again specify my
activation function.

1599
01:16:48,740 --> 01:16:51,320
And I specify what my input shape is.

1600
01:16:51,320 --> 01:16:53,630
My input shape in the
banknotes case was just 4.

1601
01:16:53,630 --> 01:16:55,130
I had four inputs.

1602
01:16:55,130 --> 01:17:00,502
My input shape here is going to be 28,
comma, 28, comma 1, because for each

1603
01:17:00,502 --> 01:17:02,210
of these handwritten
digits, it turns out

1604
01:17:02,210 --> 01:17:05,060
that the MNIST dataset
organizes their data.

1605
01:17:05,060 --> 01:17:07,740
Each image is a 28-by-28 pixel grid.

1606
01:17:07,740 --> 01:17:11,690
They're going to be a 28-by-28 pixel
grid, and each one of those images only

1607
01:17:11,690 --> 01:17:13,387
has one channel value.

1608
01:17:13,387 --> 01:17:15,470
These handwritten digits
are just black and white,

1609
01:17:15,470 --> 01:17:17,960
so it's just a single
color value representing

1610
01:17:17,960 --> 01:17:19,450
how much black or how much white.

1611
01:17:19,450 --> 01:17:22,700
You might imagine that in a color image,
if you were doing this sort of thing,

1612
01:17:22,700 --> 01:17:24,710
you might have three
different channels-- a red,

1613
01:17:24,710 --> 01:17:26,600
a green, and a blue
channel, for example.

1614
01:17:26,600 --> 01:17:30,020
But in the case of just handwriting
recognition and recognizing a digit,

1615
01:17:30,020 --> 01:17:33,640
we're just going to use a single value
for shaded-in in or not shaded-in,

1616
01:17:33,640 --> 01:17:37,270
and it might range, but it's
just a single color value.

1617
01:17:37,270 --> 01:17:40,800
And that then is the very first
layer of our neural network,

1618
01:17:40,800 --> 01:17:43,327
a convolutional layer
that will take the input

1619
01:17:43,327 --> 01:17:45,160
and learn a whole bunch
of different filters

1620
01:17:45,160 --> 01:17:49,356
that we can apply to the input
to extract meaningful features.

1621
01:17:49,356 --> 01:17:52,900
The next step is going to be a
max-pooling layer, also built

1622
01:17:52,900 --> 01:17:55,060
right into TensorFlow,
where this is going

1623
01:17:55,060 --> 01:17:58,840
to be a layer that is going to
use a pool size of two by two,

1624
01:17:58,840 --> 01:18:01,830
meaning we're going to look at
two-by-two regions inside of the image,

1625
01:18:01,830 --> 01:18:03,910
and just extract the maximum value.

1626
01:18:03,910 --> 01:18:06,050
Again, we've seen why
this can be helpful.

1627
01:18:06,050 --> 01:18:09,040
It'll help to reduce
the size of our input.

1628
01:18:09,040 --> 01:18:12,130
Once we've done that, we'll go ahead
and flatten all of the units just

1629
01:18:12,130 --> 01:18:14,500
into a single layer
that we can then pass

1630
01:18:14,500 --> 01:18:16,300
into the rest of the neural network.

1631
01:18:16,300 --> 01:18:18,970
And now, here's the rest
of the whole network.

1632
01:18:18,970 --> 01:18:22,790
Here, I'm saying, let's add a hidden
layer to my neural network with 128

1633
01:18:22,790 --> 01:18:26,560
units-- so a whole bunch of hidden
units inside of the hidden layer--

1634
01:18:26,560 --> 01:18:30,117
and just to prevent overfitting,
I can add a dropout to that-- say,

1635
01:18:30,117 --> 01:18:30,700
you know what?

1636
01:18:30,700 --> 01:18:34,630
When you're training, randomly drop
out half from this hidden layer,

1637
01:18:34,630 --> 01:18:38,200
just to make sure we don't become
over-reliant on any particular node.

1638
01:18:38,200 --> 01:18:41,560
We begin to really generalize and
stop ourselves from overfitting.

1639
01:18:41,560 --> 01:18:44,380
So TensorFlow allows us,
just by adding a single line,

1640
01:18:44,380 --> 01:18:47,650
to add dropout into our model as
well, such that when it's training,

1641
01:18:47,650 --> 01:18:50,080
it will perform this
dropout step in order

1642
01:18:50,080 --> 01:18:54,640
to help make sure that we don't
overfit on this particular data.

1643
01:18:54,640 --> 01:18:57,620
And then finally, I add an output layer.

1644
01:18:57,620 --> 01:18:59,980
The output layer is going
to have 10 units, one

1645
01:18:59,980 --> 01:19:03,310
for each category, that I would
like to classify digits into,

1646
01:19:03,310 --> 01:19:06,230
so 0 through 9, 10 different categories.

1647
01:19:06,230 --> 01:19:08,700
And the activation function
I'm going to use here

1648
01:19:08,700 --> 01:19:11,720
is called the softmax
activation function.

1649
01:19:11,720 --> 01:19:14,450
And in short, what the softmax
activation function is going to do

1650
01:19:14,450 --> 01:19:16,510
is it's going to take
the output and turn it

1651
01:19:16,510 --> 01:19:18,440
into a probability distribution.

1652
01:19:18,440 --> 01:19:20,330
So ultimately, it's
going to tell me, what

1653
01:19:20,330 --> 01:19:24,910
did we estimate the probability is
that this is a 2 versus a 3 versus a 4,

1654
01:19:24,910 --> 01:19:29,180
and so it will turn it into that
probability distribution for me.

1655
01:19:29,180 --> 01:19:31,390
Next up, I'll go ahead
and compile my model

1656
01:19:31,390 --> 01:19:34,420
and fit it on all of my training data.

1657
01:19:34,420 --> 01:19:38,530
And then I can evaluate how well
the neural network performs.

1658
01:19:38,530 --> 01:19:40,540
And then I've added
to my Python program,

1659
01:19:40,540 --> 01:19:43,430
if I've provided a command line
argument, like the name of a file,

1660
01:19:43,430 --> 01:19:46,300
I'm going to go ahead and
save the model to a file.

1661
01:19:46,300 --> 01:19:47,900
And so this can be quite useful too.

1662
01:19:47,900 --> 01:19:49,608
Once you've done the
training step, which

1663
01:19:49,608 --> 01:19:51,970
could take some time, in
terms of taking all the time--

1664
01:19:51,970 --> 01:19:55,510
going through the data; running
backpropagation with gradient descent;

1665
01:19:55,510 --> 01:19:57,790
to be able to say, all
right, how should we adjust

1666
01:19:57,790 --> 01:19:59,540
the weight to this particular model--

1667
01:19:59,540 --> 01:20:01,600
you end up calculating
values for these weights,

1668
01:20:01,600 --> 01:20:03,790
calculating values for
these filters, and you'd

1669
01:20:03,790 --> 01:20:06,560
like to remember that information,
so you can use it later.

1670
01:20:06,560 --> 01:20:10,223
And so TensorFlow allows us to
just save a model to a file,

1671
01:20:10,223 --> 01:20:12,640
such that later if we want to
use the model we've learned,

1672
01:20:12,640 --> 01:20:16,030
use the weights that we've learned,
to make some sort of new prediction

1673
01:20:16,030 --> 01:20:19,550
we can just use the model
that already exists.

1674
01:20:19,550 --> 01:20:22,570
So what we're doing here is after
we've done all the calculation,

1675
01:20:22,570 --> 01:20:26,050
we go ahead and save the
model to a file, such

1676
01:20:26,050 --> 01:20:28,220
that we can use it a little bit later.

1677
01:20:28,220 --> 01:20:35,837
So for example, if I go into digits,
I'm going to run handwriting.py.

1678
01:20:35,837 --> 01:20:36,920
I won't save it this time.

1679
01:20:36,920 --> 01:20:39,135
We'll just run it and go
ahead and see what happens.

1680
01:20:39,135 --> 01:20:41,260
What will happen is we need
to go through the model

1681
01:20:41,260 --> 01:20:44,710
in order to train on all of these
samples of handwritten digits.

1682
01:20:44,710 --> 01:20:47,500
So the MNIST dataset gives
us thousands and thousands

1683
01:20:47,500 --> 01:20:50,050
of sample handwritten
digits in the same format

1684
01:20:50,050 --> 01:20:51,800
that we can use in order to train.

1685
01:20:51,800 --> 01:20:54,363
And so now what you're seeing
is this training process,

1686
01:20:54,363 --> 01:20:56,530
and unlike the banknotes
case, where there was much,

1687
01:20:56,530 --> 01:20:58,160
much fewer data points--

1688
01:20:58,160 --> 01:20:59,680
the data was very, very simple--

1689
01:20:59,680 --> 01:21:03,110
here, the data is more complex, and
this training process takes time.

1690
01:21:03,110 --> 01:21:06,040
And so this is another
one of those cases where

1691
01:21:06,040 --> 01:21:09,472
when training neural networks,
this is why computational power is

1692
01:21:09,472 --> 01:21:11,680
so important, that oftentimes,
you see people wanting

1693
01:21:11,680 --> 01:21:15,070
to use a sophisticated GPUs in
order to more efficiently be

1694
01:21:15,070 --> 01:21:18,040
able to do this sort of
neural network we're training.

1695
01:21:18,040 --> 01:21:20,870
It also speaks to the reason
why more data can be helpful.

1696
01:21:20,870 --> 01:21:23,260
The more sample data
points you have, the better

1697
01:21:23,260 --> 01:21:25,040
you can begin to do this training.

1698
01:21:25,040 --> 01:21:28,060
So here we're going through
60,000 different samples

1699
01:21:28,060 --> 01:21:29,400
of handwritten digits.

1700
01:21:29,400 --> 01:21:31,820
And I said that we're going
to go through them 10 times.

1701
01:21:31,820 --> 01:21:34,780
So we're going to go through the
dataset 10 times, training each time,

1702
01:21:34,780 --> 01:21:37,360
hopefully improving upon
our weights with every time

1703
01:21:37,360 --> 01:21:38,900
we run through this dataset.

1704
01:21:38,900 --> 01:21:41,770
And we can see over here on
the right what the accuracy is

1705
01:21:41,770 --> 01:21:44,860
each time we go ahead and run
this model, that the first time,

1706
01:21:44,860 --> 01:21:48,310
it looks like we got an accuracy
of about 92% of the digits

1707
01:21:48,310 --> 01:21:50,320
correct based on this training set.

1708
01:21:50,320 --> 01:21:53,310
We increased that to 96% or 97%.

1709
01:21:53,310 --> 01:21:56,110
And every time we run
this, we're going to see,

1710
01:21:56,110 --> 01:21:59,290
hopefully, the accuracy improve,
as we continue to try and use

1711
01:21:59,290 --> 01:22:02,440
that gradient descent, that process
of trying to run the algorithm

1712
01:22:02,440 --> 01:22:06,400
to minimize the loss that we get
in order to more accurately predict

1713
01:22:06,400 --> 01:22:07,840
what the output should be.

1714
01:22:07,840 --> 01:22:11,210
And what this process is doing is
it's learning not only the weights,

1715
01:22:11,210 --> 01:22:13,660
but it's learning the
features to use-- the kernel

1716
01:22:13,660 --> 01:22:16,840
matrix to use-- when performing
that convolution step, because this

1717
01:22:16,840 --> 01:22:19,570
is a convolutional neural network,
where I'm first performing

1718
01:22:19,570 --> 01:22:23,380
those convolutions, and then doing
the more traditional neural network

1719
01:22:23,380 --> 01:22:24,260
structure.

1720
01:22:24,260 --> 01:22:28,250
This is going to learn all of
those individual steps as well.

1721
01:22:28,250 --> 01:22:31,770
So here, we see the TensorFlow provides
me with some very nice output, telling

1722
01:22:31,770 --> 01:22:34,960
me about how many seconds are left
with each of these training runs,

1723
01:22:34,960 --> 01:22:37,610
that allows me to see
just how well we're doing.

1724
01:22:37,610 --> 01:22:39,970
So we'll go ahead and see
how this network performs.

1725
01:22:39,970 --> 01:22:42,520
It looks like we've gone
through the dataset seven times.

1726
01:22:42,520 --> 01:22:45,162
We're going through an eighth time now.

1727
01:22:45,162 --> 01:22:47,120
And at this point, the
accuracy is pretty high.

1728
01:22:47,120 --> 01:22:50,950
We saw we went from 92% up to 97%.

1729
01:22:50,950 --> 01:22:52,370
Now it looks like 98%.

1730
01:22:52,370 --> 01:22:55,120
And at this point, it seems like
things are starting to level out.

1731
01:22:55,120 --> 01:22:57,550
There's probably a limit to
how accurate we can ultimately

1732
01:22:57,550 --> 01:22:59,615
be without running the
risk of overfitting.

1733
01:22:59,615 --> 01:23:02,740
Of course, with enough nodes, you could
just memorize the input and overfit

1734
01:23:02,740 --> 01:23:03,600
upon them.

1735
01:23:03,600 --> 01:23:07,400
But we'd like to avoid doing that
and dropout will help us with this.

1736
01:23:07,400 --> 01:23:12,560
But now, we see we're almost
done finishing our training step.

1737
01:23:12,560 --> 01:23:13,950
We're at 55,000.

1738
01:23:13,950 --> 01:23:14,450
All right.

1739
01:23:14,450 --> 01:23:16,280
We've finished training,
and now it's going

1740
01:23:16,280 --> 01:23:18,920
to go ahead and test for
us on 10,000 samples.

1741
01:23:18,920 --> 01:23:23,630
And it looks like on the testing
set, we were 98.8% accurate.

1742
01:23:23,630 --> 01:23:25,640
So we ended up doing
pretty well, it seems,

1743
01:23:25,640 --> 01:23:28,940
on this testing set to
see how accurately can

1744
01:23:28,940 --> 01:23:31,980
we predict these handwritten digits.

1745
01:23:31,980 --> 01:23:34,590
And so what we could do then
is actually test it out.

1746
01:23:34,590 --> 01:23:38,490
I've written a program called
recognition.py using PyGame.

1747
01:23:38,490 --> 01:23:40,350
If you pass it a model
that's been trained,

1748
01:23:40,350 --> 01:23:44,843
and I pre-trained an example model
using this input data, what we can do

1749
01:23:44,843 --> 01:23:46,760
is see whether or not
we've been able to train

1750
01:23:46,760 --> 01:23:50,510
this convolutional neural network
to be able to predict handwriting,

1751
01:23:50,510 --> 01:23:51,050
for example.

1752
01:23:51,050 --> 01:23:54,080
So I can try just like
drawing a handwritten digit.

1753
01:23:54,080 --> 01:23:58,130
I'll go ahead and draw like
the number 2, for example.

1754
01:23:58,130 --> 01:23:59,295
So there's my number 2.

1755
01:23:59,295 --> 01:24:00,170
Again, this is messy.

1756
01:24:00,170 --> 01:24:03,170
If you tried to imagine how would you
write a program with just like ifs

1757
01:24:03,170 --> 01:24:05,390
and thens to be able to do
this sort of calculation,

1758
01:24:05,390 --> 01:24:06,830
it would be tricky to do so.

1759
01:24:06,830 --> 01:24:08,810
But here, I'll press
Classify, and all right.

1760
01:24:08,810 --> 01:24:11,330
It seems it was able to correctly
classify that what I drew

1761
01:24:11,330 --> 01:24:12,383
was the number 2.

1762
01:24:12,383 --> 01:24:13,550
We'll go ahead and reset it.

1763
01:24:13,550 --> 01:24:14,092
Try it again.

1764
01:24:14,092 --> 01:24:16,710
We'll draw like an 8, for example.

1765
01:24:16,710 --> 01:24:19,040
So here is an 8.

1766
01:24:19,040 --> 01:24:20,197
I'll press Classify.

1767
01:24:20,197 --> 01:24:20,780
And all right.

1768
01:24:20,780 --> 01:24:23,693
It predicts that the digit
that I drew was an 8.

1769
01:24:23,693 --> 01:24:25,610
And the key here is this
really begins to show

1770
01:24:25,610 --> 01:24:28,640
the power of what the neural
network is doing, somehow looking

1771
01:24:28,640 --> 01:24:31,190
at various different features
of these different pixels,

1772
01:24:31,190 --> 01:24:33,560
figuring out what the
relevant features are,

1773
01:24:33,560 --> 01:24:36,350
and figuring out how to combine
them to get a classification.

1774
01:24:36,350 --> 01:24:40,340
And this would be a difficult task
to provide explicit instructions

1775
01:24:40,340 --> 01:24:43,580
to the computer on how to do, like
to use a hole punch of if-thens

1776
01:24:43,580 --> 01:24:46,220
to process all of these
pixel values to figure out

1777
01:24:46,220 --> 01:24:48,800
what the handwritten digit is,
like everyone is going to draw

1778
01:24:48,800 --> 01:24:50,180
their 8 a little bit differently.

1779
01:24:50,180 --> 01:24:52,680
If I drew the 8 again, it would
look a little bit different.

1780
01:24:52,680 --> 01:24:55,460
And yet ideally, we want to
train a network to be robust

1781
01:24:55,460 --> 01:24:59,360
enough so that it begins to
learn these patterns on its own.

1782
01:24:59,360 --> 01:25:02,040
All I said was, here is the
structure of the network,

1783
01:25:02,040 --> 01:25:04,610
and here is the data on
which to train the network,

1784
01:25:04,610 --> 01:25:06,620
and the network learning
algorithm just tries

1785
01:25:06,620 --> 01:25:08,960
to figure out what is the
optimal set of weights,

1786
01:25:08,960 --> 01:25:11,210
what is the optimal
set of filters to use,

1787
01:25:11,210 --> 01:25:13,520
in order to be able
to accurately classify

1788
01:25:13,520 --> 01:25:16,030
a digit into one category or another.

1789
01:25:16,030 --> 01:25:20,850
That's going to show the power of
these convolutional neural networks.

1790
01:25:20,850 --> 01:25:25,280
And so that then was a look at how we
can use convolutional neural networks

1791
01:25:25,280 --> 01:25:30,320
to begin to solve problems with regards
to computer vision, the ability to take

1792
01:25:30,320 --> 01:25:32,015
an image and begin to analyze it.

1793
01:25:32,015 --> 01:25:33,890
And so this is the type
of analysis you might

1794
01:25:33,890 --> 01:25:36,710
imagine that's happening
in self-driving cars that

1795
01:25:36,710 --> 01:25:40,910
are able to figure out what filters to
apply to an image to understand what it

1796
01:25:40,910 --> 01:25:44,300
is that the computer is looking
at, or the same type of idea that

1797
01:25:44,300 --> 01:25:46,760
might be applied to facial
recognition and social media

1798
01:25:46,760 --> 01:25:50,600
to be able to determine how to
recognize faces in an image as well.

1799
01:25:50,600 --> 01:25:53,180
You can imagine a neural network
that, instead of classifying

1800
01:25:53,180 --> 01:25:58,310
into one of 10 different digits, could
instead classify like, is this person A

1801
01:25:58,310 --> 01:26:01,730
or is this person B, trying to
tell those people apart just based

1802
01:26:01,730 --> 01:26:03,807
on convolution.

1803
01:26:03,807 --> 01:26:06,890
And so now what we'll take a look at
is yet another type of neural network

1804
01:26:06,890 --> 01:26:09,290
that can be quite popular
for certain types of tasks.

1805
01:26:09,290 --> 01:26:13,160
But to do so, we'll try to generalize
and think about our neural network

1806
01:26:13,160 --> 01:26:16,920
a little bit more abstractly, that here
we have a sample deep neural network,

1807
01:26:16,920 --> 01:26:20,150
where we have this input layer, a
whole bunch of different hidden layers

1808
01:26:20,150 --> 01:26:22,850
that are performing certain
types of calculations,

1809
01:26:22,850 --> 01:26:26,090
and then an output layer here that
just generates some sort of output

1810
01:26:26,090 --> 01:26:28,370
that we care about calculating.

1811
01:26:28,370 --> 01:26:32,780
But we could imagine representing
this a little more simply, like this.

1812
01:26:32,780 --> 01:26:36,110
Here is just a more abstract
representation of our neural network.

1813
01:26:36,110 --> 01:26:37,490
We have some input.

1814
01:26:37,490 --> 01:26:41,090
That might be like a vector of a whole
bunch of different values as our input.

1815
01:26:41,090 --> 01:26:43,390
That gets passed into
a network to perform

1816
01:26:43,390 --> 01:26:46,190
some sort of calculation or
computation, and that network

1817
01:26:46,190 --> 01:26:48,350
produces some sort of output.

1818
01:26:48,350 --> 01:26:50,043
That output might be a single value.

1819
01:26:50,043 --> 01:26:51,960
It might be a whole bunch
of different values.

1820
01:26:51,960 --> 01:26:54,960
But this is the general structure of
the neural network that we've seen.

1821
01:26:54,960 --> 01:26:58,250
There is some sort of input
that gets fed into the network,

1822
01:26:58,250 --> 01:27:02,210
and using that input, the network
calculates what the output should be.

1823
01:27:02,210 --> 01:27:04,730
And this sort of model
for an all network

1824
01:27:04,730 --> 01:27:07,790
is what we might call a
feed-forward neural network.

1825
01:27:07,790 --> 01:27:11,760
Feed-forward neural networks have
connections only in one direction;

1826
01:27:11,760 --> 01:27:14,390
they move from one layer to
the next layer to the layer

1827
01:27:14,390 --> 01:27:18,530
after that, such that the inputs pass
through various different hidden layers

1828
01:27:18,530 --> 01:27:21,560
and then ultimately produce
some sort of output.

1829
01:27:21,560 --> 01:27:24,963
So feed-forward neural networks
are very helpful for solving

1830
01:27:24,963 --> 01:27:27,380
these types of classification
problems that we saw before.

1831
01:27:27,380 --> 01:27:28,760
We have a whole bunch of input.

1832
01:27:28,760 --> 01:27:30,885
We want to learn what
setting of weights will allow

1833
01:27:30,885 --> 01:27:32,717
us to calculate the output effectively.

1834
01:27:32,717 --> 01:27:35,300
But there are some limitations
on feed-forward neural networks

1835
01:27:35,300 --> 01:27:36,425
that we'll see in a moment.

1836
01:27:36,425 --> 01:27:39,350
In particular, the input
needs to be of a fixed shape,

1837
01:27:39,350 --> 01:27:41,932
like a fixed number of neurons
are in the input layer,

1838
01:27:41,932 --> 01:27:43,640
and there's a fixed
shape for the output,

1839
01:27:43,640 --> 01:27:46,670
like a fixed number of
neurons in the output layer,

1840
01:27:46,670 --> 01:27:49,340
and that has some
limitations of its own.

1841
01:27:49,340 --> 01:27:51,457
And a possible solution to this--

1842
01:27:51,457 --> 01:27:53,540
and we'll see examples of
the types of problems we

1843
01:27:53,540 --> 01:27:55,190
can solve for this in just the second--

1844
01:27:55,190 --> 01:27:58,065
is instead of just a feed-forward
neural network where there are only

1845
01:27:58,065 --> 01:28:01,070
connections in one direction,
from left to right effectively,

1846
01:28:01,070 --> 01:28:05,390
across the network, we can also
imagine a recurrent neural network,

1847
01:28:05,390 --> 01:28:07,460
where a recurrent
neural network generates

1848
01:28:07,460 --> 01:28:13,680
output that gets fed back into itself as
input for future runs of that network.

1849
01:28:13,680 --> 01:28:15,800
So whereas in a
traditional neural network,

1850
01:28:15,800 --> 01:28:19,850
we have inputs that get fed into the
network that get fed into the output,

1851
01:28:19,850 --> 01:28:23,150
and the only thing that determines the
output is based on the original input

1852
01:28:23,150 --> 01:28:26,780
and based on the calculation we
do inside of the network itself,

1853
01:28:26,780 --> 01:28:29,780
this goes in contrast with
a recurrent neural network,

1854
01:28:29,780 --> 01:28:32,450
where in a recurrent neural
network, you can imagine output

1855
01:28:32,450 --> 01:28:35,810
from the network feeding back
to itself into the network

1856
01:28:35,810 --> 01:28:39,590
again as input for the next time
that you do the calculations

1857
01:28:39,590 --> 01:28:41,090
inside of the network.

1858
01:28:41,090 --> 01:28:45,890
What this allows is it allows the
network to maintain some sort of state,

1859
01:28:45,890 --> 01:28:48,290
to store some sort of
information that can

1860
01:28:48,290 --> 01:28:51,930
be used on future runs of the network.

1861
01:28:51,930 --> 01:28:54,170
Previously, the network
just defined some weights,

1862
01:28:54,170 --> 01:28:56,990
and we passed inputs through the
network, and it generated outputs,

1863
01:28:56,990 --> 01:29:00,710
but the network wasn't saving any
information based on those inputs

1864
01:29:00,710 --> 01:29:04,103
to be able to remember for future
iterations or for future runs.

1865
01:29:04,103 --> 01:29:06,020
What a recurrent neural
network will let us do

1866
01:29:06,020 --> 01:29:08,270
is let the network
store information that

1867
01:29:08,270 --> 01:29:12,470
gets passed back in as input to the
network again the next time we try

1868
01:29:12,470 --> 01:29:14,370
and perform some sort of action.

1869
01:29:14,370 --> 01:29:18,990
And this is particularly helpful
when dealing with sequences of data.

1870
01:29:18,990 --> 01:29:21,620
So we'll see a real-world example
of this right now actually.

1871
01:29:21,620 --> 01:29:25,880
Microsoft has developed an
AI known as the CaptionBot,

1872
01:29:25,880 --> 01:29:28,370
and what the CaptionBot
does is it says, I

1873
01:29:28,370 --> 01:29:30,500
can understand the
content of any photograph,

1874
01:29:30,500 --> 01:29:32,583
and I'll try to describe
it as well as any human.

1875
01:29:32,583 --> 01:29:35,000
I'll analyze your photo, but
I won't store it or share it.

1876
01:29:35,000 --> 01:29:38,090
And so what Microsoft CaptionBot
seems to be claiming to do

1877
01:29:38,090 --> 01:29:41,630
is it can take an image and
figure out what's in the image

1878
01:29:41,630 --> 01:29:44,460
and just give us a
caption to describe it.

1879
01:29:44,460 --> 01:29:45,470
So let's try it out.

1880
01:29:45,470 --> 01:29:48,255
Here, for example, is an
image of Harvard Square

1881
01:29:48,255 --> 01:29:51,380
and some people walking in front of
one of the buildings at Harvard Square.

1882
01:29:51,380 --> 01:29:53,720
I'll go ahead and take
the URL for that image,

1883
01:29:53,720 --> 01:29:57,520
and I'll paste it into
CaptionBot, then just press Go.

1884
01:29:57,520 --> 01:30:01,460
So CaptionBot is analyzing
the image, and then it says,

1885
01:30:01,460 --> 01:30:03,920
I think it's a group of
people walking in front

1886
01:30:03,920 --> 01:30:05,510
of a building, which seems amazing.

1887
01:30:05,510 --> 01:30:09,590
The eye is able to look at this image
and figure out what's in the image.

1888
01:30:09,590 --> 01:30:11,510
And the important
thing to recognize here

1889
01:30:11,510 --> 01:30:13,910
is that this is no longer
just a classification task.

1890
01:30:13,910 --> 01:30:17,350
We saw being able to classify images
with a convolutional neural network,

1891
01:30:17,350 --> 01:30:21,680
where the job was to take the images
and then figure out, is it a 0, or a 1,

1892
01:30:21,680 --> 01:30:24,740
or a 2; or is that this person's
face or that person's face?

1893
01:30:24,740 --> 01:30:28,160
What seems to be happening
here is the input is an image,

1894
01:30:28,160 --> 01:30:31,190
and we know how to get networks
to take input of images,

1895
01:30:31,190 --> 01:30:33,320
but the output is text.

1896
01:30:33,320 --> 01:30:34,010
It's a sentence.

1897
01:30:34,010 --> 01:30:38,410
It's a phrase, like "a group of people
walking in front of a building."

1898
01:30:38,410 --> 01:30:41,420
And this would seem to pose a
challenge for our more traditional

1899
01:30:41,420 --> 01:30:44,450
feed-forward neural networks,
for the reason being

1900
01:30:44,450 --> 01:30:47,540
that in traditional
neural networks, we just

1901
01:30:47,540 --> 01:30:50,670
have a fixed-size input
and a fixed-size output.

1902
01:30:50,670 --> 01:30:53,930
There are a certain number of neurons
in the input to our neural network

1903
01:30:53,930 --> 01:30:56,580
and a certain number of
outputs for our neural network,

1904
01:30:56,580 --> 01:30:58,763
and then some calculation
that goes on in between.

1905
01:30:58,763 --> 01:30:59,930
But the size of the inputs--

1906
01:30:59,930 --> 01:31:03,030
the number of values in the input and
the number of values in the output--

1907
01:31:03,030 --> 01:31:07,775
those are always going to be fixed based
on the structure of the neural network,

1908
01:31:07,775 --> 01:31:10,400
and that makes it difficult to
imagine how a neural network can

1909
01:31:10,400 --> 01:31:12,440
take an image like
this and say, you know,

1910
01:31:12,440 --> 01:31:14,840
it's a group of people walking
in front of the building,

1911
01:31:14,840 --> 01:31:17,360
because the output is text.

1912
01:31:17,360 --> 01:31:19,580
It's a sequence of words.

1913
01:31:19,580 --> 01:31:23,120
Now it might be possible for a
neural network to output one word.

1914
01:31:23,120 --> 01:31:25,610
One word, you could represent
us like a vector of values,

1915
01:31:25,610 --> 01:31:27,350
and you can imagine ways of doing that.

1916
01:31:27,350 --> 01:31:29,517
And next time, we'll talk
a little bit more about AI

1917
01:31:29,517 --> 01:31:31,950
as it relates to language
and language processing.

1918
01:31:31,950 --> 01:31:34,290
But a sequence of words
is much more challenging,

1919
01:31:34,290 --> 01:31:36,080
because depending on
the image, you might

1920
01:31:36,080 --> 01:31:38,510
imagine the output is a
different number of words.

1921
01:31:38,510 --> 01:31:41,120
We could have sequences
of different lengths,

1922
01:31:41,120 --> 01:31:45,310
and somehow we still want to be able
to generate the appropriate output.

1923
01:31:45,310 --> 01:31:49,250
And so the strategy here is to
use a recurrent neural network,

1924
01:31:49,250 --> 01:31:52,790
a neural network that can feed
its own output back into itself

1925
01:31:52,790 --> 01:31:55,020
as input for the next time.

1926
01:31:55,020 --> 01:31:59,810
And this allows us to do what we call
a one-to-many relationship for inputs

1927
01:31:59,810 --> 01:32:02,720
to outputs, that in vanilla, more
traditional neural networks--

1928
01:32:02,720 --> 01:32:05,840
these are what we consider to
be one-to-one neural networks--

1929
01:32:05,840 --> 01:32:10,370
you pass in one set of values as
input, you get one vector of values

1930
01:32:10,370 --> 01:32:12,080
as the output--

1931
01:32:12,080 --> 01:32:14,750
but in this case, we want to
pass in one value as input--

1932
01:32:14,750 --> 01:32:17,840
the image-- and we want to
get a sequence-- many values--

1933
01:32:17,840 --> 01:32:22,190
as output, where each value is like
one of these words that gets produced

1934
01:32:22,190 --> 01:32:24,460
by this particular algorithm.

1935
01:32:24,460 --> 01:32:26,960
And so the way we might do this
is we might imagine starting

1936
01:32:26,960 --> 01:32:30,175
by providing input the image
into our neural network,

1937
01:32:30,175 --> 01:32:32,300
and the neural network is
going to generate output,

1938
01:32:32,300 --> 01:32:34,730
but the output is not going to
be the whole sequence of words,

1939
01:32:34,730 --> 01:32:37,022
because we can't represent
the whole sequence of words.

1940
01:32:37,022 --> 01:32:39,650
I'm using just a fixed set of neurons.

1941
01:32:39,650 --> 01:32:42,760
Instead, the output is just
going to be the first word.

1942
01:32:42,760 --> 01:32:44,510
We're going to train
the network to output

1943
01:32:44,510 --> 01:32:46,500
what the first word of
the caption should be.

1944
01:32:46,500 --> 01:32:48,500
And you could imagine
that Microsoft has trained

1945
01:32:48,500 --> 01:32:52,250
to this by running a whole bunch
of training samples through the AI,

1946
01:32:52,250 --> 01:32:55,400
giving it a whole bunch of pictures
and what the appropriate caption was,

1947
01:32:55,400 --> 01:32:58,520
and having the AI begin
to learn from that.

1948
01:32:58,520 --> 01:33:00,830
But now, because the
network generates output

1949
01:33:00,830 --> 01:33:03,020
that can be fed back
into itself, you can

1950
01:33:03,020 --> 01:33:06,830
imagine the output of the network
being fed back into the same network--

1951
01:33:06,830 --> 01:33:10,400
this here looks like a separate network,
but it's really the same network that's

1952
01:33:10,400 --> 01:33:12,170
just getting different input--

1953
01:33:12,170 --> 01:33:16,340
that this network's output
gets fed back into itself,

1954
01:33:16,340 --> 01:33:18,440
but it's going to
generate another output,

1955
01:33:18,440 --> 01:33:22,910
and that other output is going to be
like the second word in the caption.

1956
01:33:22,910 --> 01:33:25,220
And this recurrent neural
network then, this network

1957
01:33:25,220 --> 01:33:27,470
is going to generate other
output that can be fed back

1958
01:33:27,470 --> 01:33:30,470
into itself to generate
yet another word, fed back

1959
01:33:30,470 --> 01:33:32,420
into itself to generate another word.

1960
01:33:32,420 --> 01:33:35,150
And so recurrent neural
networks allow us to represent

1961
01:33:35,150 --> 01:33:37,610
this sort of one-to-many structure.

1962
01:33:37,610 --> 01:33:40,370
You provide one image as
input, and the neural network

1963
01:33:40,370 --> 01:33:43,160
can pass data into the
next run of the network,

1964
01:33:43,160 --> 01:33:46,940
and then again and again, such that you
could run the network multiple times,

1965
01:33:46,940 --> 01:33:52,398
each time generating a different output,
still based on that original input.

1966
01:33:52,398 --> 01:33:54,190
And this is where
recurrent neural networks

1967
01:33:54,190 --> 01:33:58,880
become particularly useful when dealing
with sequences of inputs or outputs.

1968
01:33:58,880 --> 01:34:02,110
My output is a sequence of words,
and since I can't very easily

1969
01:34:02,110 --> 01:34:04,690
represent outputting an
entire sequence of words,

1970
01:34:04,690 --> 01:34:07,900
I'll instead output that
sequence one word at a time,

1971
01:34:07,900 --> 01:34:10,240
by allowing my network
to pass information

1972
01:34:10,240 --> 01:34:13,420
about what still needs to
be said about the photo

1973
01:34:13,420 --> 01:34:15,655
into the next stage of
running the networks.

1974
01:34:15,655 --> 01:34:17,530
So you could run the
network multiple times--

1975
01:34:17,530 --> 01:34:19,450
the same network with the same weights--

1976
01:34:19,450 --> 01:34:23,260
just getting different input each time,
first getting input from the image,

1977
01:34:23,260 --> 01:34:25,990
and then getting input
from the network itself,

1978
01:34:25,990 --> 01:34:28,630
as additional information
about what additionally

1979
01:34:28,630 --> 01:34:32,660
needs to be given in a
particular caption, for example.

1980
01:34:32,660 --> 01:34:35,080
So this then is a
one-to-many many relationship

1981
01:34:35,080 --> 01:34:36,760
inside of a recurrent neural network.

1982
01:34:36,760 --> 01:34:38,718
But it turns out there
are other models that we

1983
01:34:38,718 --> 01:34:42,280
can use-- other ways we can try and
use recurrent neural networks-- to be

1984
01:34:42,280 --> 01:34:45,490
able to represent data that might
be stored in other forms as well.

1985
01:34:45,490 --> 01:34:48,640
We saw how we could use neural
networks in order to analyze images,

1986
01:34:48,640 --> 01:34:51,802
in the context of convolutional
neural networks that take an image,

1987
01:34:51,802 --> 01:34:54,010
figure out various different
properties of the image,

1988
01:34:54,010 --> 01:34:57,410
and are able to draw some sort
of conclusion based on that.

1989
01:34:57,410 --> 01:34:59,650
But you might imagine that
something like YouTube,

1990
01:34:59,650 --> 01:35:02,730
they need to be able to do a
lot of learning based on video.

1991
01:35:02,730 --> 01:35:04,480
They need to look
through videos to detect

1992
01:35:04,480 --> 01:35:06,557
if there are copyright
violations, or they

1993
01:35:06,557 --> 01:35:08,890
need to be able to look through
videos to maybe identify

1994
01:35:08,890 --> 01:35:12,400
what particular items are inside
of the video, for example.

1995
01:35:12,400 --> 01:35:14,950
And video, you might imagine,
is much more difficult

1996
01:35:14,950 --> 01:35:18,610
to put it as input to a neural
network, because whereas an image

1997
01:35:18,610 --> 01:35:22,520
you can just treat each pixel is a
different value, videos are sequences.

1998
01:35:22,520 --> 01:35:26,388
They're sequences of images, and each
sequence might be a different length,

1999
01:35:26,388 --> 01:35:28,180
and so it might be
challenging to represent

2000
01:35:28,180 --> 01:35:31,120
that entire video as a
single vector of values

2001
01:35:31,120 --> 01:35:34,070
that you could pass in
to a neural network.

2002
01:35:34,070 --> 01:35:36,340
And so here too,
recurrent neural networks

2003
01:35:36,340 --> 01:35:40,060
can be a valuable solution for
trying to solve this type of problem.

2004
01:35:40,060 --> 01:35:44,150
Then instead of just passing in a
single input into our neural network,

2005
01:35:44,150 --> 01:35:47,170
we could pass in the input one
frame at a time, you might imagine,

2006
01:35:47,170 --> 01:35:51,460
first taking the first frame of the
video, passing it into the network,

2007
01:35:51,460 --> 01:35:54,280
and then maybe not having the
network output anything at all yet.

2008
01:35:54,280 --> 01:35:58,870
Let it take in another input, and
this time, pass it into the network,

2009
01:35:58,870 --> 01:36:01,750
but the network gets
information from the last time

2010
01:36:01,750 --> 01:36:03,760
we provided an input into the network.

2011
01:36:03,760 --> 01:36:06,220
Then we pass in a third input
and then a fourth input,

2012
01:36:06,220 --> 01:36:09,970
where each time, with the network
gets it gets the most recent input,

2013
01:36:09,970 --> 01:36:12,850
like each frame of
the video, but it also

2014
01:36:12,850 --> 01:36:16,940
gets information the network processed
from all of the previous iterations.

2015
01:36:16,940 --> 01:36:19,360
So on frame number
four, you end up getting

2016
01:36:19,360 --> 01:36:22,750
the input for frame number four,
plus information the network is

2017
01:36:22,750 --> 01:36:25,630
calculated from the first three frames.

2018
01:36:25,630 --> 01:36:28,780
And using all of that data combined,
this recurrent neural network

2019
01:36:28,780 --> 01:36:32,920
can begin to learn how to extract
patterns from a sequence of data

2020
01:36:32,920 --> 01:36:33,730
as well.

2021
01:36:33,730 --> 01:36:35,730
And so you might imagine
if you want to classify

2022
01:36:35,730 --> 01:36:37,570
a video into a number
of different genres,

2023
01:36:37,570 --> 01:36:40,990
like an educational video, or a music
video, or different types of videos.

2024
01:36:40,990 --> 01:36:43,180
That's a classification
task, where you want

2025
01:36:43,180 --> 01:36:45,820
to take input each of
the frames of the video,

2026
01:36:45,820 --> 01:36:48,440
and you want to output
something like what it is

2027
01:36:48,440 --> 01:36:51,853
and what category that
it happens to belong to.

2028
01:36:51,853 --> 01:36:53,770
And you can imagine doing
this sort of thing--

2029
01:36:53,770 --> 01:36:56,310
this sort of many-to-one learning--

2030
01:36:56,310 --> 01:36:58,630
anytime your input is a sequence.

2031
01:36:58,630 --> 01:37:01,718
And so input is a sequence
in the context of a video.

2032
01:37:01,718 --> 01:37:04,510
It could be in the context of like,
if someone has typed a message,

2033
01:37:04,510 --> 01:37:06,640
and you want to be able to
categorize that message,

2034
01:37:06,640 --> 01:37:09,220
like if you're trying
to take a movie review

2035
01:37:09,220 --> 01:37:12,850
and trying to classify it as is it a
positive review or a negative review.

2036
01:37:12,850 --> 01:37:15,460
That input is a sequence
of words, and the output

2037
01:37:15,460 --> 01:37:18,060
is a classification--
positive or negative.

2038
01:37:18,060 --> 01:37:20,170
There too, a recurrent
neural network might

2039
01:37:20,170 --> 01:37:22,780
be helpful for analyzing
sequences of words,

2040
01:37:22,780 --> 01:37:25,875
and they're quite popular when it
comes to dealing with language.

2041
01:37:25,875 --> 01:37:27,950
It could even be used
for spoken language

2042
01:37:27,950 --> 01:37:31,250
as well, that spoken language
is an audio waveform that

2043
01:37:31,250 --> 01:37:34,460
can be segmented into distinct
chunks, and each of those

2044
01:37:34,460 --> 01:37:37,760
can be passed in as an input
into a recurrent neural network

2045
01:37:37,760 --> 01:37:40,380
to be able to classify
someone's voice, for instance,

2046
01:37:40,380 --> 01:37:43,160
if you want to do voice recognition,
to say is this one person

2047
01:37:43,160 --> 01:37:44,260
or is this another?

2048
01:37:44,260 --> 01:37:48,310
Here are also cases where you might
want this many-to-one architecture

2049
01:37:48,310 --> 01:37:50,897
for a recurrent neural network.

2050
01:37:50,897 --> 01:37:52,980
And then as one final
problem, just to take a look

2051
01:37:52,980 --> 01:37:55,860
at in terms of what we can do,
with these sorts of networks,

2052
01:37:55,860 --> 01:37:57,870
imagine what Google Translate is doing.

2053
01:37:57,870 --> 01:38:01,620
So what Google Translate is doing is
it's taking some text written in one

2054
01:38:01,620 --> 01:38:05,850
language and converting it into
text written in some other language,

2055
01:38:05,850 --> 01:38:09,090
for example, where now this
input is a sequence of data--

2056
01:38:09,090 --> 01:38:10,770
it's a sequence of words--

2057
01:38:10,770 --> 01:38:13,210
and the output is a
sequence of words as well.

2058
01:38:13,210 --> 01:38:14,440
It's also a sequence.

2059
01:38:14,440 --> 01:38:17,340
So here, we want effectively
like a many-to-many relationship.

2060
01:38:17,340 --> 01:38:21,330
Our input is a sequence, and our
output is a sequence as well.

2061
01:38:21,330 --> 01:38:25,350
And it's not quite going to work to
just say, take each word in the input

2062
01:38:25,350 --> 01:38:28,620
and translate it into
a word in the output,

2063
01:38:28,620 --> 01:38:31,823
because ultimately, different languages
put their words in different orders,

2064
01:38:31,823 --> 01:38:33,990
and maybe one language uses
two words for something,

2065
01:38:33,990 --> 01:38:36,130
whereas another language only uses one.

2066
01:38:36,130 --> 01:38:40,970
So we really want some way to take
this information-- that's input--

2067
01:38:40,970 --> 01:38:45,730
encode it somehow, and use that encoding
to generate what the output ultimately

2068
01:38:45,730 --> 01:38:46,230
should be.

2069
01:38:46,230 --> 01:38:48,105
And this has been one
of the big advancements

2070
01:38:48,105 --> 01:38:50,700
in automated translation
technology is the ability

2071
01:38:50,700 --> 01:38:54,570
to use own networks to do this, instead
of older, more traditional methods,

2072
01:38:54,570 --> 01:38:56,820
and this has improved
accuracy dramatically.

2073
01:38:56,820 --> 01:38:59,070
And the way you might
imagine doing this is, again,

2074
01:38:59,070 --> 01:39:03,030
using a recurrent neural network with
multiple inputs and multiple outputs.

2075
01:39:03,030 --> 01:39:04,590
We start by passing in all the input.

2076
01:39:04,590 --> 01:39:06,143
Input goes into the network.

2077
01:39:06,143 --> 01:39:08,310
Another input, like another
word, goes into network,

2078
01:39:08,310 --> 01:39:12,030
and we do this multiple times, like
once for each word in the input

2079
01:39:12,030 --> 01:39:13,530
that I'm trying to translate.

2080
01:39:13,530 --> 01:39:16,800
And only after all of that
is done, does the network now

2081
01:39:16,800 --> 01:39:19,950
start to generate output, like the
first word of the translated sentence,

2082
01:39:19,950 --> 01:39:23,060
and the next word of the translated
sentence, so on and so forth,

2083
01:39:23,060 --> 01:39:26,100
where each time the
network passes information

2084
01:39:26,100 --> 01:39:31,200
to itself by allowing for this
model of giving some sort of state

2085
01:39:31,200 --> 01:39:33,960
from one run in the
network to the next run,

2086
01:39:33,960 --> 01:39:36,120
assembling information
about all the inputs,

2087
01:39:36,120 --> 01:39:39,780
and then passing in information about
which part of the output in order

2088
01:39:39,780 --> 01:39:40,987
to generate next.

2089
01:39:40,987 --> 01:39:43,320
And there are a number of
different types of these sorts

2090
01:39:43,320 --> 01:39:44,890
of recurrent neural networks.

2091
01:39:44,890 --> 01:39:48,060
One of the most popular is known as
the long short-term memory neural

2092
01:39:48,060 --> 01:39:50,190
network, otherwise known as LSTM.

2093
01:39:50,190 --> 01:39:53,303
But in general, these types of
networks can be very, very powerful

2094
01:39:53,303 --> 01:39:55,470
whenever we're dealing with
sequences, whether those

2095
01:39:55,470 --> 01:39:59,400
are sequences of images or especially
sequences of words when it comes

2096
01:39:59,400 --> 01:40:02,370
towards dealing with natural language.

2097
01:40:02,370 --> 01:40:06,090
So that then were just some of the
different types of neural networks

2098
01:40:06,090 --> 01:40:08,590
that can be used to do all
sorts of different computations,

2099
01:40:08,590 --> 01:40:10,830
and these are incredibly
versatile tools that

2100
01:40:10,830 --> 01:40:12,930
can be applied to a number
of different domains.

2101
01:40:12,930 --> 01:40:16,300
We only looked at a couple of the most
popular types of neural networks--

2102
01:40:16,300 --> 01:40:18,570
the more traditional
feed-forward neural networks,

2103
01:40:18,570 --> 01:40:21,573
convolutional neural networks,
and recurrent neural networks.

2104
01:40:21,573 --> 01:40:22,990
But there are other types as well.

2105
01:40:22,990 --> 01:40:25,907
There are adversarial networks, where
networks compete with each other

2106
01:40:25,907 --> 01:40:28,890
to try and be able to
generate new types of data,

2107
01:40:28,890 --> 01:40:32,370
as well as other networks that can solve
other tasks based on what they happen

2108
01:40:32,370 --> 01:40:34,510
to be structured and adapted for.

2109
01:40:34,510 --> 01:40:36,810
And these are very powerful
tools in machine learning,

2110
01:40:36,810 --> 01:40:40,578
from being able to very easily learn
based on some set of input data

2111
01:40:40,578 --> 01:40:42,870
and to be able to therefore
figure out how to calculate

2112
01:40:42,870 --> 01:40:45,210
some function, from inputs to outputs.

2113
01:40:45,210 --> 01:40:48,600
Whether it's input to some sort of
classification, like analyzing an image

2114
01:40:48,600 --> 01:40:50,910
and getting a digit, or
machine translation where

2115
01:40:50,910 --> 01:40:53,670
the input is in one language
and the output is in another,

2116
01:40:53,670 --> 01:40:58,080
these tools have a lot of applications
for machine learning more generally.

2117
01:40:58,080 --> 01:41:00,360
Next time, we'll look at
machine learning and AI

2118
01:41:00,360 --> 01:41:02,633
in particular in the
context of natural language.

2119
01:41:02,633 --> 01:41:04,800
We talked a little bit about
this today, but looking

2120
01:41:04,800 --> 01:41:08,520
at how it is that our AI can begin
to understand natural language

2121
01:41:08,520 --> 01:41:11,640
and can begin to be able to
analyze and do useful tasks with

2122
01:41:11,640 --> 01:41:13,740
regards to human
language, which turns out

2123
01:41:13,740 --> 01:41:15,880
to be a challenging
and interesting task.

2124
01:41:15,880 --> 01:41:18,110
So we'll see you next time.

2125
01:41:18,110 --> 01:41:19,000