WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:03.416 [MUSIC PLAYING] 00:00:17.595 --> 00:00:18.470 SPEAKER 1: All right. 00:00:18.470 --> 00:00:20.220 Welcome back, everyone, to an introduction 00:00:20.220 --> 00:00:22.070 to Artificial Intelligence with Python. 00:00:22.070 --> 00:00:25.160 Now last time, we took a look at machine learning-- a set of techniques 00:00:25.160 --> 00:00:28.010 that computers can use in order to take a set of data 00:00:28.010 --> 00:00:31.860 and learn some patterns inside of that data, learn how to perform a task, 00:00:31.860 --> 00:00:35.540 even if we, the programmers, didn't give the computer explicit instructions 00:00:35.540 --> 00:00:37.520 for how to perform that task. 00:00:37.520 --> 00:00:40.520 Today, we transition to one of the most popular techniques and tools 00:00:40.520 --> 00:00:43.320 within machine learning that have neural networks. 00:00:43.320 --> 00:00:46.370 And neural networks were inspired as early as the 1940s 00:00:46.370 --> 00:00:49.340 by researchers who were thinking about how it is that humans learn, 00:00:49.340 --> 00:00:51.410 studying neuroscience and the human brain, 00:00:51.410 --> 00:00:55.010 and trying to see whether or not we can apply those same ideas to computers as 00:00:55.010 --> 00:00:58.290 well, and model computer learning off of human learning. 00:00:58.290 --> 00:01:00.230 So how is the brain structured? 00:01:00.230 --> 00:01:03.800 Well, very simply put, the brain consists of a whole bunch of neurons, 00:01:03.800 --> 00:01:06.230 and those neurons are connected to one another 00:01:06.230 --> 00:01:08.540 and communicate with one another in some way. 00:01:08.540 --> 00:01:11.540 In particular, if you think about the structure of a biological neural 00:01:11.540 --> 00:01:13.170 network-- something like this-- 00:01:13.170 --> 00:01:16.070 there are a couple of key properties that scientists observed. 00:01:16.070 --> 00:01:18.440 One was that these neurons are connected to each other 00:01:18.440 --> 00:01:20.640 and receive electrical signals from one another, 00:01:20.640 --> 00:01:24.860 that one neuron can propagate electrical signals to another neuron. 00:01:24.860 --> 00:01:26.630 And another point is that neurons process 00:01:26.630 --> 00:01:29.960 those input signals, and then can be activated, that a neuron becomes 00:01:29.960 --> 00:01:33.500 activated at a certain point, and then can propagate further signals 00:01:33.500 --> 00:01:35.610 onto neurons in the future. 00:01:35.610 --> 00:01:39.380 And so the question then became, could we take this biological idea of how it 00:01:39.380 --> 00:01:41.760 is that humans learn-- with brains and with neurons-- 00:01:41.760 --> 00:01:44.540 and apply that to a machine as well, in effect, 00:01:44.540 --> 00:01:48.440 designing an artificial neural network, or an ANN, which 00:01:48.440 --> 00:01:51.740 will be a mathematical model for learning that is inspired 00:01:51.740 --> 00:01:53.600 by these biological neural networks? 00:01:53.600 --> 00:01:56.090 And what artificial neural networks will allow us to do 00:01:56.090 --> 00:01:59.492 is they will first be able to model some sort of mathematical function. 00:01:59.492 --> 00:02:02.700 Every time you look at a neural network, which we'll see more of later today, 00:02:02.700 --> 00:02:05.330 each one of them is really just some mathematical function 00:02:05.330 --> 00:02:08.600 that is mapping certain inputs to particular outputs, 00:02:08.600 --> 00:02:10.820 based on the structure of the network, that depending 00:02:10.820 --> 00:02:14.540 on where we place particular units inside of this neural network, 00:02:14.540 --> 00:02:18.340 that's going to determine how it is that the network is going to function. 00:02:18.340 --> 00:02:20.540 And in particular, artificial neural networks 00:02:20.540 --> 00:02:23.990 are going to lend themselves to a way that we can learn what 00:02:23.990 --> 00:02:25.993 the network's parameters should be. 00:02:25.993 --> 00:02:27.660 We'll see more on that in just a moment. 00:02:27.660 --> 00:02:30.560 But in effect we want to model, such that it is easy for us 00:02:30.560 --> 00:02:33.290 to be able to write some code that allows for the network 00:02:33.290 --> 00:02:36.950 to be able to figure out how to model the right mathematical function, 00:02:36.950 --> 00:02:39.570 given a particular set of input data. 00:02:39.570 --> 00:02:41.840 So in order to create our artificial neural network, 00:02:41.840 --> 00:02:43.837 instead of using biological neurons, we're 00:02:43.837 --> 00:02:45.920 just going to use what we're going to call units-- 00:02:45.920 --> 00:02:47.760 units inside of a neural network-- 00:02:47.760 --> 00:02:50.160 which we can represent kind of like a node in a graph, 00:02:50.160 --> 00:02:53.340 which will here be represented just by a blue circle like this. 00:02:53.340 --> 00:02:56.270 And these artificial units-- these artificial neurons-- 00:02:56.270 --> 00:02:58.080 can be connected to one another. 00:02:58.080 --> 00:03:00.320 So here, for instance, we have two units that 00:03:00.320 --> 00:03:05.020 are connected by this edge inside of this graph, effectively. 00:03:05.020 --> 00:03:06.770 And so what we're going to do now is think 00:03:06.770 --> 00:03:10.450 of this idea as some sort of mapping from inputs to outputs, 00:03:10.450 --> 00:03:13.550 that we have one unit that is connected to another unit, 00:03:13.550 --> 00:03:17.210 that we might think of this side as the input and that side of the output. 00:03:17.210 --> 00:03:20.390 And what we're trying to do then is to figure out how to solve a problem, 00:03:20.390 --> 00:03:22.702 how to model some sort of mathematical function. 00:03:22.702 --> 00:03:24.410 And this might take the form of something 00:03:24.410 --> 00:03:26.420 we saw last time, which was something like, we 00:03:26.420 --> 00:03:30.500 have certain inputs like variables x1 and x2, and given those inputs, 00:03:30.500 --> 00:03:32.570 we want to perform some sort of task-- 00:03:32.570 --> 00:03:35.570 a task like predicting whether or not it's going to rain. 00:03:35.570 --> 00:03:38.870 And ideally, we'd like some way, given these inputs x1 and x2, 00:03:38.870 --> 00:03:41.870 which stand for some sort of variables to do with the weather, 00:03:41.870 --> 00:03:44.150 we would like to be able to predict, in this case, 00:03:44.150 --> 00:03:48.890 a Boolean classification-- is it going to rain, or is it not going to rain? 00:03:48.890 --> 00:03:52.100 And we did this last time by way of a mathematical function. 00:03:52.100 --> 00:03:55.640 We defined some function h for our hypothesis function 00:03:55.640 --> 00:03:57.650 that took as input x1 and x2-- 00:03:57.650 --> 00:04:00.250 the two inputs that we cared about processing-- in order 00:04:00.250 --> 00:04:03.500 to determine whether we thought it was going to rain, or whether we thought it 00:04:03.500 --> 00:04:04.910 was not going to rain. 00:04:04.910 --> 00:04:08.570 The question then becomes, what does this hypothesis function do in order 00:04:08.570 --> 00:04:10.260 to make that determination? 00:04:10.260 --> 00:04:15.260 And we decided last time to use a linear combination of these input variables 00:04:15.260 --> 00:04:16.980 to determine what the output should be. 00:04:16.980 --> 00:04:20.510 So our hypothesis function was equal to something 00:04:20.510 --> 00:04:26.300 like this: weight 0 plus weight 1 times x1 plus weight 2 times x2. 00:04:26.300 --> 00:04:28.880 So what's going on here is that x1 and x2-- 00:04:28.880 --> 00:04:33.770 those are input variables-- the inputs to this hypothesis function-- 00:04:33.770 --> 00:04:35.720 and each of those input variables is being 00:04:35.720 --> 00:04:39.140 multiplied by some weight, which is just some number. 00:04:39.140 --> 00:04:43.970 So x1 is being multiplied by weight 1, x2 is being multiplied by weight 2, 00:04:43.970 --> 00:04:46.290 and we have this additional weight-- weight 0-- 00:04:46.290 --> 00:04:48.290 that doesn't get multiplied by an input variable 00:04:48.290 --> 00:04:51.540 at all, that just serves to either move the function up or move the function's 00:04:51.540 --> 00:04:52.650 value down. 00:04:52.650 --> 00:04:54.680 You can think of this as either a weight that's 00:04:54.680 --> 00:04:56.900 just multiplied by some dummy value, like the number 00:04:56.900 --> 00:05:00.560 1 when it's multiplied by 1, and so it's not multiplied by anything. 00:05:00.560 --> 00:05:02.670 Or sometimes you'll see in the literature, 00:05:02.670 --> 00:05:04.775 people call this variable weight 0 a "bias," 00:05:04.775 --> 00:05:07.400 so that you can think of these variables as slightly different. 00:05:07.400 --> 00:05:09.620 We have weights that are multiplied by the input 00:05:09.620 --> 00:05:13.127 and we separately add some bias to the result as well. 00:05:13.127 --> 00:05:14.960 You'll hear both of those terminologies used 00:05:14.960 --> 00:05:18.745 when people talk about neural networks and machine learning. 00:05:18.745 --> 00:05:20.870 So in effect, what we've done here is that in order 00:05:20.870 --> 00:05:23.360 to define a hypothesis function, we just need 00:05:23.360 --> 00:05:26.810 to decide and figure out what these weights should be, 00:05:26.810 --> 00:05:30.778 to determine what values to multiply by our inputs to get some sort of result. 00:05:30.778 --> 00:05:32.570 Of course, at the end of this, what we need 00:05:32.570 --> 00:05:34.880 to do is make some sort of classification 00:05:34.880 --> 00:05:39.120 like raining or not raining, and to do that, we use some sort of function 00:05:39.120 --> 00:05:41.220 to define some sort of threshold. 00:05:41.220 --> 00:05:46.820 And so we saw, for instance, the step function, which is defined as 1 00:05:46.820 --> 00:05:50.090 if the result of multiplying the weights by the inputs is at least 0; 00:05:50.090 --> 00:05:50.960 otherwise as 0. 00:05:50.960 --> 00:05:53.210 You can think of this line down the middle-- it's kind 00:05:53.210 --> 00:05:54.290 of like a dotted line. 00:05:54.290 --> 00:05:57.222 Effectively, it stays at 0 all the way up to one point, 00:05:57.222 --> 00:05:58.430 and then the function steps-- 00:05:58.430 --> 00:06:00.120 or jumps up-- to 1. 00:06:00.120 --> 00:06:02.550 So it's zero before it reaches some threshold, 00:06:02.550 --> 00:06:05.790 and then it's 1 after it reaches a particular threshold. 00:06:05.790 --> 00:06:07.760 And so this was one way we could define what 00:06:07.760 --> 00:06:10.400 we'll come to call an "activation function," a function that 00:06:10.400 --> 00:06:13.550 determines when it is that this output becomes active-- 00:06:13.550 --> 00:06:17.030 changes to a 1 instead of being a 0. 00:06:17.030 --> 00:06:20.495 But we also saw that if we didn't just want a purely binary classification, 00:06:20.495 --> 00:06:23.540 if we didn't want purely 1 or 0, but we wanted 00:06:23.540 --> 00:06:26.750 to allow for some in-between real number values, 00:06:26.750 --> 00:06:28.170 we could use a different function. 00:06:28.170 --> 00:06:31.003 And there are a number of choices, but the one that we looked at was 00:06:31.003 --> 00:06:34.520 the logistic sigmoid function that has sort of an S-shaped curve, 00:06:34.520 --> 00:06:36.740 where we could represent this as a probability-- 00:06:36.740 --> 00:06:40.130 that may be somewhere in between the probability of rain of something like 00:06:40.130 --> 00:06:44.490 0.5, and maybe a little bit later the probability of rain is 0.8-- 00:06:44.490 --> 00:06:48.320 and so rather than just have a binary classification of 0 or 1, 00:06:48.320 --> 00:06:50.702 we can allow for numbers that are in between as well. 00:06:50.702 --> 00:06:52.910 And it turns out there are many other different types 00:06:52.910 --> 00:06:56.240 of activation functions, where an activation function just 00:06:56.240 --> 00:06:59.720 takes the output of multiplying the weights together and adding that bias, 00:06:59.720 --> 00:07:02.510 and then figuring out what the actual output should be. 00:07:02.510 --> 00:07:06.480 Another popular one is the rectified linear unit, otherwise known ReLU, 00:07:06.480 --> 00:07:09.170 and the way that works is that it just takes as input 00:07:09.170 --> 00:07:11.660 and takes the maximum of that input and 0. 00:07:11.660 --> 00:07:15.950 So if it's positive, it remains unchanged, but i if it's negative, 00:07:15.950 --> 00:07:17.720 it goes ahead and levels out at 0. 00:07:17.720 --> 00:07:21.140 And there are other activation functions that we can choose as well. 00:07:21.140 --> 00:07:23.480 But in short, each of these activation functions, 00:07:23.480 --> 00:07:28.220 you can just think of as a function that gets applied to the result of all 00:07:28.220 --> 00:07:29.120 of this computation. 00:07:29.120 --> 00:07:34.160 We take some function g and apply it to the result of all of that calculation. 00:07:34.160 --> 00:07:36.680 And this then is what we saw last time-- the way of defining 00:07:36.680 --> 00:07:39.650 some hypothesis function that takes on inputs, 00:07:39.650 --> 00:07:42.710 calculates some linear combination of those inputs, 00:07:42.710 --> 00:07:47.510 and then passes it through some sort of activation function to get our output. 00:07:47.510 --> 00:07:49.880 And this actually turns out to be the model 00:07:49.880 --> 00:07:52.280 for the simplest of neural networks, that we're 00:07:52.280 --> 00:07:56.720 going to instead represent this mathematical idea graphically, by using 00:07:56.720 --> 00:07:58.040 a structure like this. 00:07:58.040 --> 00:08:00.770 Here then is a neural network that has two inputs. 00:08:00.770 --> 00:08:03.140 We can think of this as x1 and this as x2. 00:08:03.140 --> 00:08:06.860 And then one output, which you can think of classifying whether or not 00:08:06.860 --> 00:08:09.810 we think it's going to rain or not rain, for example, 00:08:09.810 --> 00:08:11.450 in this particular instance. 00:08:11.450 --> 00:08:13.340 And so how exactly does this model work? 00:08:13.340 --> 00:08:16.370 Well, each of these two inputs represents one of our input variables-- 00:08:16.370 --> 00:08:18.410 x1 and x2. 00:08:18.410 --> 00:08:21.080 And notice that these inputs are connected 00:08:21.080 --> 00:08:23.990 to this output via these edges, which are 00:08:23.990 --> 00:08:25.700 going to be defined by their weights. 00:08:25.700 --> 00:08:28.190 So these edges each have a weight associated with them-- 00:08:28.190 --> 00:08:30.740 weight 1 and weight 2-- 00:08:30.740 --> 00:08:33.049 and then this output unit, what it's going to do 00:08:33.049 --> 00:08:36.440 is it is going to calculate an output based on those inputs 00:08:36.440 --> 00:08:37.970 and based on those weights. 00:08:37.970 --> 00:08:42.049 This output unit is going to multiply all the inputs by their weights, 00:08:42.049 --> 00:08:45.590 add in this bias term, which you can think of as an extra w0 term that 00:08:45.590 --> 00:08:49.860 gets added into it, and then we pass it through an activation function. 00:08:49.860 --> 00:08:53.390 So this then is just a graphical way of representing the same idea 00:08:53.390 --> 00:08:55.520 we saw last time, just mathematically. 00:08:55.520 --> 00:08:58.880 And we're going to call this a very simple neural network. 00:08:58.880 --> 00:09:00.710 And we'd like for this neural network to be 00:09:00.710 --> 00:09:03.222 able to learn how to calculate some function, 00:09:03.222 --> 00:09:05.680 that we want some function for the neural network to learn, 00:09:05.680 --> 00:09:07.610 and the neural network is going to learn what 00:09:07.610 --> 00:09:11.070 should the values of w0, w1, and w2 be. 00:09:11.070 --> 00:09:13.280 What should the activation function be in order 00:09:13.280 --> 00:09:15.962 to get the result that we would expect? 00:09:15.962 --> 00:09:18.170 So we can actually take a look at an example of this. 00:09:18.170 --> 00:09:21.170 What then is a very simple function that we might calculate? 00:09:21.170 --> 00:09:24.770 Well, if we recall back from when we were looking at propositional logic, 00:09:24.770 --> 00:09:26.660 one of the simplest functions we looked at 00:09:26.660 --> 00:09:29.760 was something like the or function, that takes two inputs-- 00:09:29.760 --> 00:09:35.360 x and y-- and outputs 1, otherwise known as true, if either one of the inputs, 00:09:35.360 --> 00:09:40.930 or both of them, are 1, and outputs a 0 if both of the inputs are 0, or false. 00:09:40.930 --> 00:09:42.485 So this then is the or function. 00:09:42.485 --> 00:09:45.110 And this was the truth table for the or function-- that as long 00:09:45.110 --> 00:09:48.560 as either of the inputs are 1, the output of the function is 1, 00:09:48.560 --> 00:09:53.210 and the only case where the output of 0 is where both of the inputs are 0. 00:09:53.210 --> 00:09:57.140 So the question is, how could we take this and train a neural network to be 00:09:57.140 --> 00:09:59.360 able to learn this particular function? 00:09:59.360 --> 00:10:01.290 What would those weights look like? 00:10:01.290 --> 00:10:03.130 Well, we could do something like this. 00:10:03.130 --> 00:10:05.450 Here's our neural network, and I'll propose 00:10:05.450 --> 00:10:07.670 that in order to calculate the or function, 00:10:07.670 --> 00:10:11.660 we're going to use a value of 1 for each of the weights, 00:10:11.660 --> 00:10:14.810 and we'll use a bias of negative 1, and then 00:10:14.810 --> 00:10:18.270 we'll just use this step function as our activation function. 00:10:18.270 --> 00:10:19.570 How then does this work? 00:10:19.570 --> 00:10:23.010 Well, if I wanted to calculate something like 0 or 0, 00:10:23.010 --> 00:10:26.340 which we know to be 0, because false or false is false, then 00:10:26.340 --> 00:10:27.580 what are we going to do? 00:10:27.580 --> 00:10:29.730 Well, our output unit is going to calculate 00:10:29.730 --> 00:10:31.650 this input multiplied by the weight. 00:10:31.650 --> 00:10:33.543 0 times 1, that's 0. 00:10:33.543 --> 00:10:34.210 Same thing here. 00:10:34.210 --> 00:10:36.210 0 times 1, that's 0. 00:10:36.210 --> 00:10:40.240 And we'll add to that the bias, minus 1. 00:10:40.240 --> 00:10:42.640 So that'll give us some result of negative 1. 00:10:42.640 --> 00:10:45.690 If we plot that on our activation function-- negative 1 is here-- 00:10:45.690 --> 00:10:49.290 it's before the threshold, which means either 0 or 1. 00:10:49.290 --> 00:10:51.150 It's only 1 after the threshold. 00:10:51.150 --> 00:10:53.590 Since negative 1 is before the threshold, 00:10:53.590 --> 00:10:57.210 the output that this unit provides it is going to be 0. 00:10:57.210 --> 00:11:02.380 And that's what we would expect it to be, that 0 or 0 should be 0. 00:11:02.380 --> 00:11:06.150 What if instead we had had 1 or 0, where this is the number 1? 00:11:06.150 --> 00:11:07.950 Well, in this case, in order to calculate 00:11:07.950 --> 00:11:11.850 what the output is going to be, we again have to do this weighted sum. 00:11:11.850 --> 00:11:14.520 1 times 1, that's 1. 00:11:14.520 --> 00:11:16.090 0 times 1, that's 0. 00:11:16.090 --> 00:11:18.240 Sum of that so far is 1. 00:11:18.240 --> 00:11:19.650 Add negative 1 to that. 00:11:19.650 --> 00:11:21.310 Well, then the output of 0. 00:11:21.310 --> 00:11:24.360 And if we plot 0 on the step function, 0 ends up being here-- 00:11:24.360 --> 00:11:26.910 it's just at the threshold-- and so the output here 00:11:26.910 --> 00:11:30.990 is going to be 1, because the output of 1 or 0, that's 1. 00:11:30.990 --> 00:11:32.730 So that's what we would expect as well. 00:11:32.730 --> 00:11:36.570 And just for one more example, if I had 1 or 1, what would the result be? 00:11:36.570 --> 00:11:38.310 Well 1 times 1 is 1. 00:11:38.310 --> 00:11:39.330 1 times 1 is 1. 00:11:39.330 --> 00:11:40.970 The sum of those is 2. 00:11:40.970 --> 00:11:42.240 I add the bias term to that. 00:11:42.240 --> 00:11:43.480 I get the number 1. 00:11:43.480 --> 00:11:45.750 1 plotted on this graph is way over there. 00:11:45.750 --> 00:11:47.650 That's well beyond the threshold. 00:11:47.650 --> 00:11:49.800 And so this output is going to be 1 as well. 00:11:49.800 --> 00:11:52.920 The output is always 0 or 1, depending on whether or not 00:11:52.920 --> 00:11:54.330 we're past the threshold. 00:11:54.330 --> 00:11:58.560 And this neural network then models the or function-- a very simple function, 00:11:58.560 --> 00:12:01.270 definitely-- but it still is able to model it correctly. 00:12:01.270 --> 00:12:06.662 If I give it the inputs, it will tell me what x1 or x2 happens to be. 00:12:06.662 --> 00:12:09.120 And you could imagine trying to do this for other functions 00:12:09.120 --> 00:12:12.760 as well-- a function like the and function, for instance, 00:12:12.760 --> 00:12:18.220 that takes two inputs and calculates whether both x and y are true. 00:12:18.220 --> 00:12:22.830 So if x is 1 and y is 1, then the output of x and y is 1, 00:12:22.830 --> 00:12:25.920 but in all of the other cases, the output is 0. 00:12:25.920 --> 00:12:29.290 How could we model that inside of a neural network as well? 00:12:29.290 --> 00:12:34.170 Well, it turns out we could do it in the same way, except instead of negative 1 00:12:34.170 --> 00:12:38.712 as the bias, we can use negative 2 as the bias instead. 00:12:38.712 --> 00:12:40.170 What does that end up looking like? 00:12:40.170 --> 00:12:44.700 Well, if I had 1 and 1, that should be 1, because 1, true and true, 00:12:44.700 --> 00:12:45.870 is equal to true. 00:12:45.870 --> 00:12:47.040 Well, I take 1 times 1. 00:12:47.040 --> 00:12:47.810 That's 1. 00:12:47.810 --> 00:12:49.020 1 times 1 is 1. 00:12:49.020 --> 00:12:51.060 I got a total sum of 2 so far. 00:12:51.060 --> 00:12:54.750 Now I add the bias of negative 2, and I get the value 0. 00:12:54.750 --> 00:12:59.290 And 0 when I plotted on the activation function is just past that threshold. 00:12:59.290 --> 00:13:01.320 And so the output is going to be 1. 00:13:01.320 --> 00:13:05.760 But if I had any other input, for example, like 1 and 0, well, 00:13:05.760 --> 00:13:08.430 the weighted sum of these is 1 plus 0. 00:13:08.430 --> 00:13:09.810 It's going to be 1. 00:13:09.810 --> 00:13:12.750 Minus 2 is going to give us negative 1, and negative 1 00:13:12.750 --> 00:13:17.550 is not past that threshold, and so the output is going to be zero. 00:13:17.550 --> 00:13:20.190 So those then are some very simple functions 00:13:20.190 --> 00:13:23.850 that we can model using a neural network, that has two inputs and one 00:13:23.850 --> 00:13:26.070 output, where our goal is to be able to figure out 00:13:26.070 --> 00:13:29.880 what those weights should be in order to determine what the output should be. 00:13:29.880 --> 00:13:33.360 And you could imagine generalizing this to calculate more complex functions as 00:13:33.360 --> 00:13:35.940 well, that maybe given the humidity and the pressure, 00:13:35.940 --> 00:13:38.790 we want to calculate what's the probability that it's going to rain, 00:13:38.790 --> 00:13:39.385 for example. 00:13:39.385 --> 00:13:41.760 Or you might want to do a regression-style problem, where 00:13:41.760 --> 00:13:45.210 given some amount of advertising and given what month it is maybe, 00:13:45.210 --> 00:13:47.220 we want to predict what our expected sales are 00:13:47.220 --> 00:13:49.270 going to be for that particular month. 00:13:49.270 --> 00:13:52.900 So you could imagine these inputs and outputs being different as well. 00:13:52.900 --> 00:13:55.920 And it turns out that in some problems, we're not just going to have two 00:13:55.920 --> 00:14:00.000 inputs, and the nice thing about these neural networks is that we can compose 00:14:00.000 --> 00:14:03.510 multiple units together-- make our networks more complex-- 00:14:03.510 --> 00:14:07.170 just by adding more units into this particular neural network. 00:14:07.170 --> 00:14:11.692 So the network we've been looking at has two inputs and one output. 00:14:11.692 --> 00:14:13.650 But we could just as easily say, let's go ahead 00:14:13.650 --> 00:14:16.260 and have three inputs in there, or have even more inputs, 00:14:16.260 --> 00:14:19.380 where we could arbitrarily decide, however many inputs there 00:14:19.380 --> 00:14:23.540 are to our problem, all going to be calculating some sort of output 00:14:23.540 --> 00:14:26.520 that we care about figuring out the value of. 00:14:26.520 --> 00:14:29.280 How then does the math work for figuring out that output? 00:14:29.280 --> 00:14:31.290 Well, it's going to work in a very similar way. 00:14:31.290 --> 00:14:35.580 In the case of two inputs, we had two weights indicated by these edges, 00:14:35.580 --> 00:14:39.100 and we multiplied the weights by the numbers, adding this bias term, 00:14:39.100 --> 00:14:41.550 and we'll do the same thing in the other cases as well. 00:14:41.550 --> 00:14:45.120 If I have three inputs, you'll imagine multiplying each of these three inputs 00:14:45.120 --> 00:14:46.680 by each of these weights. 00:14:46.680 --> 00:14:49.860 If I had five inputs instead, we're going to do the same thing. 00:14:49.860 --> 00:14:52.795 Here, I'm saying sum up from 1 to 5. 00:14:52.795 --> 00:14:54.680 xi multiplied by weight i. 00:14:54.680 --> 00:14:57.010 So take each of the five input variables, 00:14:57.010 --> 00:15:00.660 multiply them by their corresponding weight, and then add the bias to that. 00:15:00.660 --> 00:15:03.900 So this would be a case where there are five inputs into this neural network, 00:15:03.900 --> 00:15:04.840 for example. 00:15:04.840 --> 00:15:06.930 But there could be more arbitrarily many nodes 00:15:06.930 --> 00:15:08.910 that we want inside of this neural network, 00:15:08.910 --> 00:15:10.950 where each time we're just going to sum up 00:15:10.950 --> 00:15:13.680 all of those input variables multiplied by the weight, 00:15:13.680 --> 00:15:16.385 and then add the bias term at the very end. 00:15:16.385 --> 00:15:18.260 And so this allows us to be able to represent 00:15:18.260 --> 00:15:21.290 problems that have even more inputs, just by growing 00:15:21.290 --> 00:15:24.140 the size of our neural network. 00:15:24.140 --> 00:15:26.460 Now, the next question we might ask is a question 00:15:26.460 --> 00:15:29.580 about how it is that we train these internal networks? 00:15:29.580 --> 00:15:31.920 In the case of the or function and the and function, 00:15:31.920 --> 00:15:34.293 they were simple enough functions that I could just 00:15:34.293 --> 00:15:36.210 tell you like here what the weights should be, 00:15:36.210 --> 00:15:38.252 and you could probably reason through it yourself 00:15:38.252 --> 00:15:42.000 what the weights should be in order to calculate the output that you want. 00:15:42.000 --> 00:15:45.240 But in general, with functions like predicting sales or predicting 00:15:45.240 --> 00:15:47.730 whether or not it's going to rain, these are much trickier 00:15:47.730 --> 00:15:49.380 functions to be able to figure out. 00:15:49.380 --> 00:15:53.912 We would like the computer to have some mechanism of calculating what it is 00:15:53.912 --> 00:15:56.370 that the weights should be-- how it is to set the weights-- 00:15:56.370 --> 00:16:00.330 so that our neural network is able to accurately model the function 00:16:00.330 --> 00:16:02.057 that we care about trying to estimate. 00:16:02.057 --> 00:16:04.140 And it turns out that the strategy for doing this, 00:16:04.140 --> 00:16:08.340 inspired by the domain of calculus, is a technique called gradient descent. 00:16:08.340 --> 00:16:13.020 And what gradient descent is, it is an algorithm for minimizing loss 00:16:13.020 --> 00:16:14.670 when you're training a neural network. 00:16:14.670 --> 00:16:19.970 And recall that loss refers to how bad our hypothesis function happens to be, 00:16:19.970 --> 00:16:22.220 that we can define certain loss functions, 00:16:22.220 --> 00:16:23.970 and we saw some examples of loss functions 00:16:23.970 --> 00:16:27.720 last time that just give us a number for any particular hypothesis, 00:16:27.720 --> 00:16:30.190 saying how poorly does it model the data? 00:16:30.190 --> 00:16:32.430 How many examples does it get wrong? 00:16:32.430 --> 00:16:36.390 How are they worse or less bad as compared to other hypothesis functions 00:16:36.390 --> 00:16:37.860 that we might define? 00:16:37.860 --> 00:16:41.360 And this loss function is just a mathematical function, 00:16:41.360 --> 00:16:43.110 and when you have a mathematical function, 00:16:43.110 --> 00:16:44.910 in calculus, what you could do is calculate 00:16:44.910 --> 00:16:48.030 something known as the gradient, which you can think of is like a slope. 00:16:48.030 --> 00:16:51.720 It's the direction the loss function is moving at any particular point. 00:16:51.720 --> 00:16:54.930 And what it's going to tell us is in which direction 00:16:54.930 --> 00:16:59.880 should we be moving these weights in order to minimize the amount of loss? 00:16:59.880 --> 00:17:02.640 And so generally speaking-- we won't get into the calculus of it-- 00:17:02.640 --> 00:17:04.980 but the high-level idea for gradient descent 00:17:04.980 --> 00:17:06.599 is going to look something like this. 00:17:06.599 --> 00:17:08.760 If we want to train a neural network, we'll 00:17:08.760 --> 00:17:11.579 go ahead and start just by choosing the weights randomly. 00:17:11.579 --> 00:17:14.940 Just pick random weights for all of the weights in the neural network. 00:17:14.940 --> 00:17:18.089 And then we'll use the input data that we have access to in order 00:17:18.089 --> 00:17:20.010 to train the network in order to figure out 00:17:20.010 --> 00:17:21.599 what the weights should actually be. 00:17:21.599 --> 00:17:24.220 So we'll repeat this process again and again. 00:17:24.220 --> 00:17:26.940 The first step is we're going to calculate the gradient based 00:17:26.940 --> 00:17:28.130 on all of the data points. 00:17:28.130 --> 00:17:31.612 So we'll look at all the data and figure out what the gradient is at the place 00:17:31.612 --> 00:17:34.320 where we currently are-- for the current setting of the weights-- 00:17:34.320 --> 00:17:38.190 which means that in which direction should we move the weights in order 00:17:38.190 --> 00:17:43.172 to minimize the total amount of loss in order to make our solution better? 00:17:43.172 --> 00:17:44.880 And once we've calculated that gradient-- 00:17:44.880 --> 00:17:47.730 which direction we should move in the loss function-- 00:17:47.730 --> 00:17:51.070 well, then we can just update those weights according to the gradient, 00:17:51.070 --> 00:17:53.970 take a small step in the direction of those weights 00:17:53.970 --> 00:17:56.530 in order to try to make our solution a little bit better. 00:17:56.530 --> 00:17:59.050 And the size of the step that we take, that's going to vary, 00:17:59.050 --> 00:18:02.092 and you can choose that when you're training a particular neural network. 00:18:02.092 --> 00:18:04.980 But in short, the idea is going to be take all of the data points, 00:18:04.980 --> 00:18:08.730 figure out based on those data points in what direction the weights should move, 00:18:08.730 --> 00:18:12.010 and then move the weights one small step in that direction. 00:18:12.010 --> 00:18:14.407 And if you repeat that process over and over again, 00:18:14.407 --> 00:18:17.490 adjusting the weights a little bit at a time based on all the data points, 00:18:17.490 --> 00:18:21.480 eventually, you should end up with a pretty good solution to trying 00:18:21.480 --> 00:18:23.040 to solve this sort of problem. 00:18:23.040 --> 00:18:25.247 At least that's what we would hope to happen. 00:18:25.247 --> 00:18:27.330 Now as you look at this algorithm, a good question 00:18:27.330 --> 00:18:29.640 to ask anytime you're analyzing an algorithm 00:18:29.640 --> 00:18:33.390 is, what is going to be the expensive part of doing the calculation? 00:18:33.390 --> 00:18:36.090 What's going to take a lot of work to try to figure out what 00:18:36.090 --> 00:18:38.430 is going to be expensive to calculate? 00:18:38.430 --> 00:18:40.800 And in particular, in the case of gradient descent, 00:18:40.800 --> 00:18:44.970 the really expensive part is this all data points part right here, 00:18:44.970 --> 00:18:48.390 having to take all of the data points and using all of those data 00:18:48.390 --> 00:18:52.740 points to figure out what the gradient is at this particular setting of all 00:18:52.740 --> 00:18:55.737 of the weights, because odds are, in a big machine learning problem 00:18:55.737 --> 00:18:58.320 where you're trying to solve a big problem with a lot of data, 00:18:58.320 --> 00:19:00.720 you have a lot of data points in order to calculate, 00:19:00.720 --> 00:19:03.570 and figuring out the gradient based on all of those data points 00:19:03.570 --> 00:19:04.920 is going to be expensive. 00:19:04.920 --> 00:19:08.040 And you'll have to do it many times, but you'll likely repeat this process 00:19:08.040 --> 00:19:10.620 again and again and again, going through all the data points, 00:19:10.620 --> 00:19:13.950 taking one small step over and over, as you try and figure 00:19:13.950 --> 00:19:18.060 out what the optimal setting of those weights happens to be. 00:19:18.060 --> 00:19:20.880 It turns out that we would ideally like to be 00:19:20.880 --> 00:19:24.900 able to train our neural networks faster to be able to more quickly converge 00:19:24.900 --> 00:19:28.757 to some sort of solution that is going to be a good solution to the problem. 00:19:28.757 --> 00:19:31.840 So in that case, there are alternatives to just standard gradient descent, 00:19:31.840 --> 00:19:33.990 which looks at all of the data points at once. 00:19:33.990 --> 00:19:38.130 We can employ a method like stochastic gradient descent, which will randomly 00:19:38.130 --> 00:19:42.870 just choose one data point at a time to calculate the gradient based on, 00:19:42.870 --> 00:19:45.940 instead of calculating it based on all of the data points. 00:19:45.940 --> 00:19:48.900 So the idea there is that we have some setting of the weights, 00:19:48.900 --> 00:19:51.750 we pick a data point, and based on that one data point, 00:19:51.750 --> 00:19:54.630 we figure out in which direction should we move all of the weights, 00:19:54.630 --> 00:19:57.902 and move the weights in that small direction, then take another data point 00:19:57.902 --> 00:20:00.360 and do that again, and repeat this process again and again, 00:20:00.360 --> 00:20:03.000 maybe looking at each of the data points multiple times, 00:20:03.000 --> 00:20:07.380 but each time, only using one data point to calculate the gradient 00:20:07.380 --> 00:20:10.440 to calculate which direction we should move in. 00:20:10.440 --> 00:20:13.800 Now just using one data point instead of all of the data points 00:20:13.800 --> 00:20:16.350 probably gives us a less accurate estimate 00:20:16.350 --> 00:20:18.565 of what the gradient actually is. 00:20:18.565 --> 00:20:21.690 But on the plus side, it's going to be much faster to be able to calculate, 00:20:21.690 --> 00:20:25.370 that we can much more quickly calculate what the gradient is, based on one data 00:20:25.370 --> 00:20:28.610 point, instead of calculating based on all of the data points 00:20:28.610 --> 00:20:31.933 and having to do all of that computational work again and again. 00:20:31.933 --> 00:20:34.850 So there are trade-offs here between looking at all of the data points 00:20:34.850 --> 00:20:36.740 and just looking at one data point. 00:20:36.740 --> 00:20:39.740 And it turns out that a middle ground-- and this is also quite popular-- 00:20:39.740 --> 00:20:42.560 is a technique called mini-batch gradient descent, 00:20:42.560 --> 00:20:45.800 where the idea there is instead at looking at all of the data versus just 00:20:45.800 --> 00:20:49.760 a single point, we instead divide our dataset up into small batches-- 00:20:49.760 --> 00:20:53.628 groups of data points-- where you can decide how big a particular batch is, 00:20:53.628 --> 00:20:56.420 but in short, you're just going to look at a small number of points 00:20:56.420 --> 00:21:00.020 at any given time, hopefully getting a more accurate estimate of the gradient, 00:21:00.020 --> 00:21:03.680 but also not requiring all of the computational effort needed 00:21:03.680 --> 00:21:07.620 to look at every single one of these data points. 00:21:07.620 --> 00:21:09.710 So gradient descent then is this technique 00:21:09.710 --> 00:21:12.800 that we can use in order to train these neural networks in order 00:21:12.800 --> 00:21:15.410 to figure out what the setting of all of these weights 00:21:15.410 --> 00:21:20.570 should be, if we want some way to try and get an accurate notion of how it is 00:21:20.570 --> 00:21:23.480 that this function should work, some way of modeling how to transform 00:21:23.480 --> 00:21:27.320 the inputs into particular outputs. 00:21:27.320 --> 00:21:30.080 So far, the networks that we've taken a look at 00:21:30.080 --> 00:21:32.330 have all been structured similar to this. 00:21:32.330 --> 00:21:35.720 We have some number of inputs-- maybe two or three or five or more-- 00:21:35.720 --> 00:21:39.980 and then we have one output that is just predicting like rain or no rain, 00:21:39.980 --> 00:21:42.510 or just predicting one particular value. 00:21:42.510 --> 00:21:46.580 But often in machine learning problems, we don't just care about one output. 00:21:46.580 --> 00:21:50.330 We might care about an output that has multiple different values associated 00:21:50.330 --> 00:21:51.180 with it. 00:21:51.180 --> 00:21:53.780 So in the same way that we could take a neural network 00:21:53.780 --> 00:21:58.910 and add units to the input layer, we can likewise add outputs 00:21:58.910 --> 00:22:00.500 to the output layer as well. 00:22:00.500 --> 00:22:03.490 Instead of just one output, you could imagine we have two outputs, 00:22:03.490 --> 00:22:06.650 or we could have like four outputs, for example, where in each case, 00:22:06.650 --> 00:22:09.610 as we add more inputs or add more outputs, 00:22:09.610 --> 00:22:13.100 if we want to keep this network fully connected between these two layers, 00:22:13.100 --> 00:22:17.570 we just need to add more weights, that now each of these input nodes 00:22:17.570 --> 00:22:21.560 have four weights associated with each of the four outputs, 00:22:21.560 --> 00:22:25.070 and that's true for each of these various different input nodes. 00:22:25.070 --> 00:22:27.860 So as we add nodes, we add more weights in order 00:22:27.860 --> 00:22:30.230 to make sure that each of the inputs can somehow 00:22:30.230 --> 00:22:33.560 be connected to each of the outputs, so that each output 00:22:33.560 --> 00:22:38.420 value can be calculated based on what the value of the input happens to be. 00:22:38.420 --> 00:22:42.600 So what might a case be where we want multiple different output values? 00:22:42.600 --> 00:22:44.900 Well, you might consider that in the case of weather 00:22:44.900 --> 00:22:47.570 predicting, for example, we might not just care 00:22:47.570 --> 00:22:49.490 whether it's raining or not raining. 00:22:49.490 --> 00:22:52.250 There might be multiple different categories of weather 00:22:52.250 --> 00:22:54.380 that we would like to categorize the weather into. 00:22:54.380 --> 00:22:58.100 With just a single output variable, we can do a binary classification, 00:22:58.100 --> 00:23:00.330 like rain or no rain, for instance-- 00:23:00.330 --> 00:23:04.340 1 or 0-- but it doesn't allow us to do much more than that. 00:23:04.340 --> 00:23:06.320 With multiple output variables, I might be 00:23:06.320 --> 00:23:09.330 able to use each one to predict something a little different. 00:23:09.330 --> 00:23:11.375 Maybe I want to categorize the weather into one 00:23:11.375 --> 00:23:13.250 of four different categories, something like, 00:23:13.250 --> 00:23:16.740 is it going to be raining or sunny or cloudy or snowy, 00:23:16.740 --> 00:23:18.710 and I now have four output variables that 00:23:18.710 --> 00:23:23.090 can be used to represent maybe the probability that it is raining, 00:23:23.090 --> 00:23:27.260 as opposed to sunny, as opposed to cloudy, or as opposed to snowy. 00:23:27.260 --> 00:23:29.300 How then would this neural network work? 00:23:29.300 --> 00:23:32.060 Well, we have some input variables that represent some data 00:23:32.060 --> 00:23:34.010 that we have collected about the weather. 00:23:34.010 --> 00:23:36.020 Each of those inputs gets multiplied by each 00:23:36.020 --> 00:23:37.490 of these various different weights. 00:23:37.490 --> 00:23:39.710 We have more multiplications to do, but these 00:23:39.710 --> 00:23:42.790 are fairly quick mathematical operations to perform. 00:23:42.790 --> 00:23:44.540 And then what we get is after passing them 00:23:44.540 --> 00:23:47.180 through some sort of activation function in the outputs, 00:23:47.180 --> 00:23:50.930 we end up getting some sort of number, where that number, you might imagine, 00:23:50.930 --> 00:23:54.020 you can interpret as like a probability, like a probability 00:23:54.020 --> 00:23:57.120 that it is one category, as opposed to another category. 00:23:57.120 --> 00:23:59.390 So here we're saying that based on the inputs, 00:23:59.390 --> 00:24:03.740 we think there is a 10% chance that it's raining, a 60% chance that it's sunny, 00:24:03.740 --> 00:24:07.460 a 20% chance of cloudy, a 10% chance of it's snowy. 00:24:07.460 --> 00:24:11.640 And given that output, if these represent a probability distribution, 00:24:11.640 --> 00:24:14.660 well, then you could just pick whichever one has the highest value-- 00:24:14.660 --> 00:24:15.710 in this case, sunny-- 00:24:15.710 --> 00:24:17.690 and say that, well, most likely, we think 00:24:17.690 --> 00:24:23.777 that this categorization of inputs means that the output should be sunny, 00:24:23.777 --> 00:24:25.610 and that is what we would expect the weather 00:24:25.610 --> 00:24:28.710 to be in this particular instance. 00:24:28.710 --> 00:24:32.510 So this allows us to do these sort of multi-class classifications, 00:24:32.510 --> 00:24:35.030 where instead of just having a binary classification-- 00:24:35.030 --> 00:24:38.630 1 or 0-- we can have as many different categories as we 00:24:38.630 --> 00:24:42.380 want, and we can have our neural network output these probabilities 00:24:42.380 --> 00:24:46.430 over which categories are most more likely than other categories, 00:24:46.430 --> 00:24:49.550 and using that data, we're able to draw some sort of inference 00:24:49.550 --> 00:24:51.860 on what it is that we should do. 00:24:51.860 --> 00:24:54.560 So this was sort of the idea of supervised machine learning. 00:24:54.560 --> 00:24:57.650 I can give this neural network a whole bunch of data-- 00:24:57.650 --> 00:24:59.450 whole bunch of input data-- 00:24:59.450 --> 00:25:01.670 corresponding to some label, some output data-- 00:25:01.670 --> 00:25:03.740 like we know that it was raining on this day, 00:25:03.740 --> 00:25:05.720 we know that it was sunny on that day-- 00:25:05.720 --> 00:25:08.150 and using all of that data, the algorithm 00:25:08.150 --> 00:25:11.150 can use gradient descent to figure out what all of the weights 00:25:11.150 --> 00:25:13.670 should be in order to create some sort of model that 00:25:13.670 --> 00:25:16.010 hopefully allows us a way to predict what 00:25:16.010 --> 00:25:18.020 we think the weather is going to be. 00:25:18.020 --> 00:25:20.810 But neural networks have a lot of other applications as well. 00:25:20.810 --> 00:25:23.570 You can imagine applying the same sort of idea 00:25:23.570 --> 00:25:26.630 to a reinforcement learning sort of example as well. 00:25:26.630 --> 00:25:29.930 Well, you remember that in reinforcement learning, we wanted to do 00:25:29.930 --> 00:25:34.520 is train some sort of agent to learn what action to take depending on what 00:25:34.520 --> 00:25:36.120 state they currently happen to be in. 00:25:36.120 --> 00:25:38.390 So depending on the current state of the world, 00:25:38.390 --> 00:25:41.900 we wanted the agent to pick from one of the available actions that 00:25:41.900 --> 00:25:43.550 is available to them. 00:25:43.550 --> 00:25:47.030 And you might model that by having each of these input variables 00:25:47.030 --> 00:25:50.150 represent some information about the state-- 00:25:50.150 --> 00:25:53.660 some data about what state our agent is currently in-- 00:25:53.660 --> 00:25:55.820 and then the output, for example, could be 00:25:55.820 --> 00:25:58.610 each of the various different actions that our agent could 00:25:58.610 --> 00:26:01.640 take-- action 1, 2, 3, and 4, and you might 00:26:01.640 --> 00:26:04.240 imagine that this network would work in the same way, 00:26:04.240 --> 00:26:06.530 that based on these particular inputs we go ahead 00:26:06.530 --> 00:26:08.840 and calculate values for each of these outputs, 00:26:08.840 --> 00:26:12.690 and those outputs could model which action is better than other actions, 00:26:12.690 --> 00:26:15.440 and we could just choose, based on looking at those outputs, which 00:26:15.440 --> 00:26:17.890 actions we should take. 00:26:17.890 --> 00:26:20.600 And so these neural networks are very broadly applicable, 00:26:20.600 --> 00:26:23.870 that all they're really doing is modeling some mathematical function. 00:26:23.870 --> 00:26:26.690 So anything that we can frame as a mathematical function, something 00:26:26.690 --> 00:26:30.050 like classifying inputs into various different categories, 00:26:30.050 --> 00:26:32.810 or figuring out based on some input state what 00:26:32.810 --> 00:26:36.140 action we should take-- these are all mathematical functions that we could 00:26:36.140 --> 00:26:40.100 attempt to model by taking advantage of this neural network structure, 00:26:40.100 --> 00:26:43.760 and in particular, taking advantage of this technique, gradient descent, 00:26:43.760 --> 00:26:47.240 that we can use in order to figure out what the weights should be in order 00:26:47.240 --> 00:26:49.890 to do this sort of calculation. 00:26:49.890 --> 00:26:52.890 Now how is it that you would go about training a neural network that has 00:26:52.890 --> 00:26:55.550 multiple outputs instead of just one? 00:26:55.550 --> 00:26:57.330 Well, with just a single output, we could 00:26:57.330 --> 00:26:59.920 see what the output for that value should be, 00:26:59.920 --> 00:27:03.190 and then you update all of the weights that corresponded to it. 00:27:03.190 --> 00:27:06.730 And when we have multiple outputs, at least in this particular case, 00:27:06.730 --> 00:27:10.260 we can really think of this as four separate neural networks, 00:27:10.260 --> 00:27:12.780 that really we just have one network here 00:27:12.780 --> 00:27:16.170 that has these three inputs, corresponding with these three weights, 00:27:16.170 --> 00:27:18.750 corresponding to this one output value. 00:27:18.750 --> 00:27:21.150 And the same thing is true for this output value. 00:27:21.150 --> 00:27:24.750 This output value effectively defines yet another neural network 00:27:24.750 --> 00:27:28.320 that has these same three inputs, but a different set of weights 00:27:28.320 --> 00:27:29.880 that correspond to this output. 00:27:29.880 --> 00:27:32.910 And likewise, this output has its own set of weights as well, 00:27:32.910 --> 00:27:35.790 and the same thing for the fourth output too. 00:27:35.790 --> 00:27:39.480 And so if you wanted to train a neural network that had four outputs instead 00:27:39.480 --> 00:27:42.840 of just one, in this case where the inputs are directly connected 00:27:42.840 --> 00:27:44.760 to the outputs, you could really think of this 00:27:44.760 --> 00:27:47.550 as just training four independent neural networks. 00:27:47.550 --> 00:27:49.720 We know what the outputs for each of these four 00:27:49.720 --> 00:27:52.980 should be based on our input data, and using that data, 00:27:52.980 --> 00:27:56.210 we can begin to figure out what all of these individual weights should be, 00:27:56.210 --> 00:27:58.710 and maybe there's an additional step at the end to make sure 00:27:58.710 --> 00:28:02.130 that turn these values into a probability distribution, 00:28:02.130 --> 00:28:04.860 such that we can interpret which one is better than another 00:28:04.860 --> 00:28:09.150 or more likely than another as a category or something like that. 00:28:09.150 --> 00:28:12.557 So this then seems like it does a pretty good job of taking inputs and trying 00:28:12.557 --> 00:28:14.390 to predict what outputs should be, and we'll 00:28:14.390 --> 00:28:17.158 see some real examples of this in just a moment as well. 00:28:17.158 --> 00:28:18.950 But it's important then to think about what 00:28:18.950 --> 00:28:21.670 the limitations of this sort of approach is, 00:28:21.670 --> 00:28:25.130 of just taking some linear combination of inputs 00:28:25.130 --> 00:28:27.993 and passing it into some sort of activation function. 00:28:27.993 --> 00:28:31.160 And it turns out that when we do this in the case of binary classification-- 00:28:31.160 --> 00:28:35.480 I'm trying to predict like does it belong to one category or another-- 00:28:35.480 --> 00:28:39.470 we can only predict things that are linearly separable, because we're 00:28:39.470 --> 00:28:43.670 taking a linear combination of inputs and using that to define some decision 00:28:43.670 --> 00:28:45.320 boundary or threshold. 00:28:45.320 --> 00:28:48.740 Then what we get is a situation where if we have this set of data, 00:28:48.740 --> 00:28:52.340 we can predict a line that separates linearly 00:28:52.340 --> 00:28:54.950 the red points from the blue points. 00:28:54.950 --> 00:28:58.250 But a single unit that is making a binary classification, 00:28:58.250 --> 00:29:03.260 otherwise known as a perceptron, can't deal with a situation like this, 00:29:03.260 --> 00:29:05.390 where-- we've seen this type of situation before-- 00:29:05.390 --> 00:29:07.340 where there is no straight line that just 00:29:07.340 --> 00:29:10.310 goes straight through the data that will divide the red points away 00:29:10.310 --> 00:29:11.450 from the blue points. 00:29:11.450 --> 00:29:13.890 It's a more complex decision boundary. 00:29:13.890 --> 00:29:16.430 The decision boundary somehow needs to capture the things 00:29:16.430 --> 00:29:19.700 inside of the circle, and there isn't really a line 00:29:19.700 --> 00:29:21.860 that will allow us to deal with that. 00:29:21.860 --> 00:29:24.410 So this is the limitation of the perceptron-- 00:29:24.410 --> 00:29:27.560 these units that just make these binary decisions based on their inputs-- 00:29:27.560 --> 00:29:31.240 that a single perceptron is only capable of learning 00:29:31.240 --> 00:29:34.010 a linearly separable decision boundary. 00:29:34.010 --> 00:29:36.230 It can do is define a line. 00:29:36.230 --> 00:29:38.180 And sure, it can give us probabilities based 00:29:38.180 --> 00:29:40.640 on how close to that decision boundary we are, 00:29:40.640 --> 00:29:45.570 but it can only really decide based on a linear decision boundary. 00:29:45.570 --> 00:29:49.100 And so this doesn't seem like it's going to generalize well to situations 00:29:49.100 --> 00:29:52.310 where real-world data is involved, because real-world data often 00:29:52.310 --> 00:29:53.630 isn't linearly separable. 00:29:53.630 --> 00:29:56.990 It often isn't the case that we can just draw a line through the data 00:29:56.990 --> 00:30:00.060 and be able to divide it up into multiple groups. 00:30:00.060 --> 00:30:02.090 So what then is the solution to this? 00:30:02.090 --> 00:30:06.380 Well, what was proposed was the idea of a multilayer neural network, 00:30:06.380 --> 00:30:09.950 that so far, all of the neural networks we've seen have had a set of inputs 00:30:09.950 --> 00:30:14.050 and a set of outputs, and the inputs are connected to those outputs. 00:30:14.050 --> 00:30:17.420 But in a multi-layer neural network, this is going to be an artificial 00:30:17.420 --> 00:30:20.870 neural network that has an input layer still, it has an output layer, 00:30:20.870 --> 00:30:24.950 but also has one or more hidden layers in between-- 00:30:24.950 --> 00:30:28.160 other layers of artificial neurons, or units, that 00:30:28.160 --> 00:30:30.793 are going to calculate their own values as well. 00:30:30.793 --> 00:30:32.960 So instead of a neural network that looks like this, 00:30:32.960 --> 00:30:37.370 with three inputs and one output, you might imagine, in the middle here, 00:30:37.370 --> 00:30:39.417 injecting a hidden layer-- 00:30:39.417 --> 00:30:40.250 something like this. 00:30:40.250 --> 00:30:42.230 This is a hidden layer that has four nodes. 00:30:42.230 --> 00:30:45.590 You could choose how many nodes or units end up going into the hidden layer, 00:30:45.590 --> 00:30:48.180 and you have multiple hidden layers as well. 00:30:48.180 --> 00:30:52.430 And so now each of these inputs isn't directly connected to the output. 00:30:52.430 --> 00:30:55.460 Each of the inputs is connected to this hidden layer, and then 00:30:55.460 --> 00:30:59.840 all of the nodes in the hidden layer, those are connected to the one output. 00:30:59.840 --> 00:31:02.690 And so this is just another step that we can 00:31:02.690 --> 00:31:05.310 take towards calculating more complex functions. 00:31:05.310 --> 00:31:08.660 Each of these hidden units will calculate its output value, 00:31:08.660 --> 00:31:12.680 otherwise known as its activation, based on a linear combination 00:31:12.680 --> 00:31:14.060 of all the inputs. 00:31:14.060 --> 00:31:16.340 And once we have values for all of these nodes, 00:31:16.340 --> 00:31:19.490 as opposed to this just being the output, we do the same thing again-- 00:31:19.490 --> 00:31:21.890 calculate the output for this node, based 00:31:21.890 --> 00:31:26.687 on multiplying each of the values for these units by their weights as well. 00:31:26.687 --> 00:31:29.270 So in effect, the way this works is that we start with inputs. 00:31:29.270 --> 00:31:31.437 They get multiplied by weights in order to calculate 00:31:31.437 --> 00:31:32.840 values for the hidden nodes. 00:31:32.840 --> 00:31:35.810 Those get multiplied by weights in order to figure out what 00:31:35.810 --> 00:31:38.550 the ultimate output is going to be. 00:31:38.550 --> 00:31:42.260 And the advantage of layering things like this is it gives us an ability 00:31:42.260 --> 00:31:46.400 to model more complex functions, that instead of just having a single 00:31:46.400 --> 00:31:49.730 decision boundary-- a single line dividing the red points from the blue 00:31:49.730 --> 00:31:50.600 points-- 00:31:50.600 --> 00:31:54.680 each of these hidden nodes can learn a different decision boundary, 00:31:54.680 --> 00:31:57.710 and we can combine those decision boundaries to figure out what 00:31:57.710 --> 00:31:59.750 the ultimate output is going to be. 00:31:59.750 --> 00:32:02.210 And as we begin to imagine more complex situations, 00:32:02.210 --> 00:32:05.930 you could imagine each of these nodes learning some useful property 00:32:05.930 --> 00:32:09.290 or learning some useful feature of all of the inputs 00:32:09.290 --> 00:32:12.800 and somehow learning how to combine those features together in order to get 00:32:12.800 --> 00:32:15.370 the output that we actually want. 00:32:15.370 --> 00:32:17.870 Now the natural question, when we begin to look at this now, 00:32:17.870 --> 00:32:20.780 is to ask the question of, how do we train a neural network 00:32:20.780 --> 00:32:23.180 that has hidden layers inside of it? 00:32:23.180 --> 00:32:25.950 And this turns out to initially be a bit of a tricky question, 00:32:25.950 --> 00:32:30.740 because the input data we are given is we are given values for all 00:32:30.740 --> 00:32:34.670 of the inputs, and we're given what the value of the output should be-- 00:32:34.670 --> 00:32:36.830 what the category is, for example-- 00:32:36.830 --> 00:32:40.880 but the input data doesn't tell us what the values for all of these nodes 00:32:40.880 --> 00:32:41.630 should be. 00:32:41.630 --> 00:32:44.810 So we don't know how far off each of these nodes 00:32:44.810 --> 00:32:48.570 actually is, because we're only given data for the inputs and the outputs. 00:32:48.570 --> 00:32:50.390 The reason this is called the hidden layer 00:32:50.390 --> 00:32:52.760 is because the data that is made available to us 00:32:52.760 --> 00:32:56.930 doesn't tell us what the values for all of these intermediate nodes 00:32:56.930 --> 00:32:58.530 should actually be. 00:32:58.530 --> 00:33:03.020 And so the strategy people came up with was to say that if you know what 00:33:03.020 --> 00:33:07.010 the error or the losses on the output node, well, 00:33:07.010 --> 00:33:10.280 then based on what these weights are-- if one of these weights is higher than 00:33:10.280 --> 00:33:11.000 another-- 00:33:11.000 --> 00:33:16.670 you can calculate an estimate for how much the error from this node 00:33:16.670 --> 00:33:20.492 was due to this part of the hidden node, or this part of the hidden layer, 00:33:20.492 --> 00:33:23.450 or this part of the hidden layer, based on the values of these weights, 00:33:23.450 --> 00:33:26.480 in effect saying, that based on the error from the output, 00:33:26.480 --> 00:33:29.690 I can backpropagate the error and figure out 00:33:29.690 --> 00:33:34.207 an estimate for what the error is for each of these the hidden layer as well. 00:33:34.207 --> 00:33:37.290 And there's some more calculus here that we won't get into the details of, 00:33:37.290 --> 00:33:40.550 but the idea of this algorithm is known as backpropagation. 00:33:40.550 --> 00:33:42.770 It's an algorithm for training a neural network 00:33:42.770 --> 00:33:44.930 with multiple different hidden layers. 00:33:44.930 --> 00:33:47.000 And the idea for this-- the pseudocode for it-- 00:33:47.000 --> 00:33:50.690 will again be, if we want to run gradient descent with backpropagation, 00:33:50.690 --> 00:33:54.050 we'll start with a random choice of weights as we did before, 00:33:54.050 --> 00:33:57.540 and now we'll go ahead and repeat the training process again and again. 00:33:57.540 --> 00:33:59.810 But what we're going to do each time is now 00:33:59.810 --> 00:34:02.720 we're going to calculate the error for the output layer first. 00:34:02.720 --> 00:34:05.940 We know the output and what it should be, and we know what we calculated, 00:34:05.940 --> 00:34:08.389 so we figure out what the error there is. 00:34:08.389 --> 00:34:11.060 But then we're going to repeat, for every layer, 00:34:11.060 --> 00:34:13.963 starting with the output layer, moving back into the hidden layer, 00:34:13.963 --> 00:34:16.880 then the hidden layer before that if there are multiple hidden layers, 00:34:16.880 --> 00:34:19.219 going back all the way to the very first hidden layer, 00:34:19.219 --> 00:34:23.750 assuming there are multiple, we're going to propagate the error back one layer-- 00:34:23.750 --> 00:34:25.520 whatever the error was from the output-- 00:34:25.520 --> 00:34:28.550 figure out what the error should be a layer before that based on what 00:34:28.550 --> 00:34:30.630 the values of those weights are. 00:34:30.630 --> 00:34:33.697 And then we can update those weights. 00:34:33.697 --> 00:34:35.780 So graphically, the way you might think about this 00:34:35.780 --> 00:34:37.460 is that we first start with the output. 00:34:37.460 --> 00:34:39.080 We know what the output should be. 00:34:39.080 --> 00:34:40.497 We know what output we calculated. 00:34:40.497 --> 00:34:42.497 And based on that, we can figure out, all right, 00:34:42.497 --> 00:34:45.020 how do we need to update those weights, backpropagating 00:34:45.020 --> 00:34:47.330 the error to these nodes. 00:34:47.330 --> 00:34:50.290 And using that, we can figure out how we should update these weights. 00:34:50.290 --> 00:34:52.415 And you might imagine if there are multiple layers, 00:34:52.415 --> 00:34:54.500 we could repeat this process again and again 00:34:54.500 --> 00:34:58.427 to begin to figure out how all of these weights should be updated. 00:34:58.427 --> 00:35:00.260 And this backpropagation algorithm is really 00:35:00.260 --> 00:35:03.080 the key algorithm that makes neural networks possible, 00:35:03.080 --> 00:35:06.510 and makes it possible to take these multi-level structures 00:35:06.510 --> 00:35:09.020 and be able to train those structures, depending 00:35:09.020 --> 00:35:12.380 on what the values of these weights are in order to figure out 00:35:12.380 --> 00:35:15.290 how it is that we should go about updating those weights in order 00:35:15.290 --> 00:35:19.370 to create some function that is able to minimize the total amount of loss, 00:35:19.370 --> 00:35:22.910 to figure out some good setting of the weights that will take the inputs 00:35:22.910 --> 00:35:26.360 and translate it into the output that we expect. 00:35:26.360 --> 00:35:29.165 And this works, as we said, not just for a single hidden layer, 00:35:29.165 --> 00:35:32.210 but you can imagine multiple hidden layers, where each hidden layer-- 00:35:32.210 --> 00:35:34.490 we just defined however many nodes we want-- 00:35:34.490 --> 00:35:36.470 where each of the nodes in one layer, we can 00:35:36.470 --> 00:35:40.010 connect to the nodes in the next layer, defining more and more complex 00:35:40.010 --> 00:35:45.190 networks that are able to model more and more complex types of functions. 00:35:45.190 --> 00:35:49.100 And so this type of network is what we might call a deep neural network, part 00:35:49.100 --> 00:35:52.098 of a larger family of deep learning algorithms, 00:35:52.098 --> 00:35:53.390 if you've ever heard that term. 00:35:53.390 --> 00:35:57.620 And all deep learning is about is it's using multiple layers to be 00:35:57.620 --> 00:36:01.130 able to predict and be able to model higher-level features inside 00:36:01.130 --> 00:36:03.910 of the input, to be able to figure out what the output should be. 00:36:03.910 --> 00:36:06.410 And so the deep neural network is just a neural network that 00:36:06.410 --> 00:36:09.230 has multiple of these hidden layers, where we start at the input, 00:36:09.230 --> 00:36:12.500 calculate values for this layer, then this layer, then this layer, 00:36:12.500 --> 00:36:14.460 and then ultimately get an output. 00:36:14.460 --> 00:36:17.600 And this allows us to be able to model more and more sophisticated 00:36:17.600 --> 00:36:20.030 types of functions, that each of these layers 00:36:20.030 --> 00:36:22.710 can calculate something a little bit different. 00:36:22.710 --> 00:36:27.290 And we can combine that information to figure out what the output should be. 00:36:27.290 --> 00:36:29.840 Of course, as with any situation of machine learning, 00:36:29.840 --> 00:36:32.330 as we begin to make our models more and more complex, 00:36:32.330 --> 00:36:35.920 to model more and more complex functions, the risk we run 00:36:35.920 --> 00:36:37.670 is something like overfitting. 00:36:37.670 --> 00:36:39.620 And we talked about overfitting last time 00:36:39.620 --> 00:36:44.210 in the context of overfitting based on when we were training our models to be 00:36:44.210 --> 00:36:47.510 able to learn some sort of decision boundary, where overfitting happens 00:36:47.510 --> 00:36:51.300 when we fit too closely to the training data, and as a result, 00:36:51.300 --> 00:36:54.990 we don't generalize well to other situations as well. 00:36:54.990 --> 00:36:59.000 And one of the risks we run with a far more complex neural network that 00:36:59.000 --> 00:37:01.070 has many, many different nodes is that we 00:37:01.070 --> 00:37:03.200 might overfit based on the input data; we 00:37:03.200 --> 00:37:07.310 might grow over-reliant on certain nodes to calculate things just purely based 00:37:07.310 --> 00:37:12.180 on the input data that doesn't allow us to generalize very well to the output. 00:37:12.180 --> 00:37:15.190 And there are a number of strategies for dealing with overfitting, 00:37:15.190 --> 00:37:18.010 but one of the most popular in the context of neural networks 00:37:18.010 --> 00:37:19.900 is a technique known as dropout. 00:37:19.900 --> 00:37:23.410 And what dropout does is it when we're training the neural network, what we'll 00:37:23.410 --> 00:37:26.740 do in dropout, is temporarily remove units, 00:37:26.740 --> 00:37:28.900 temporarily remove these artificial neurons 00:37:28.900 --> 00:37:32.080 from our network, chosen at random, and the goal here 00:37:32.080 --> 00:37:35.120 is to prevent over-reliance on certain units. 00:37:35.120 --> 00:37:37.060 So what generally happens in overfitting is 00:37:37.060 --> 00:37:40.660 that we begin to over-rely on certain units inside the neural network 00:37:40.660 --> 00:37:43.600 to be able to tell us how to interpret the input data. 00:37:43.600 --> 00:37:46.900 What dropout will do is randomly remove some of these units 00:37:46.900 --> 00:37:50.260 in order to reduce the chance that we over-rely on certain units, 00:37:50.260 --> 00:37:52.630 to make our neural network more robust, to be 00:37:52.630 --> 00:37:56.740 able to handle the situations even when we just drop out particular neurons 00:37:56.740 --> 00:37:58.140 entirely. 00:37:58.140 --> 00:38:00.850 So the way that might work is we have a network like this, 00:38:00.850 --> 00:38:03.010 and as we're training it, when we go about trying 00:38:03.010 --> 00:38:04.870 to update the weights the first time, we'll 00:38:04.870 --> 00:38:08.350 just randomly pick some percentage of the nodes to drop out of the network. 00:38:08.350 --> 00:38:10.280 It's as if those nodes aren't there at all. 00:38:10.280 --> 00:38:13.490 It's as if the weights associated with those nodes aren't there at all. 00:38:13.490 --> 00:38:14.930 And we'll train in this way. 00:38:14.930 --> 00:38:17.200 Then the next time we update the weights, we'll pick a different set 00:38:17.200 --> 00:38:20.050 and just go ahead and train that way, and then again randomly choose 00:38:20.050 --> 00:38:23.360 and train with other nodes that have been dropped that as well. 00:38:23.360 --> 00:38:25.990 And the goal of that is that after the training process, 00:38:25.990 --> 00:38:29.308 if you train by dropping out random nodes inside of this neural network, 00:38:29.308 --> 00:38:32.350 you hopefully end up with a network that's a little bit more robust, that 00:38:32.350 --> 00:38:35.620 doesn't rely too heavily on any one particular node, 00:38:35.620 --> 00:38:40.420 but more generally learns how to approximate a function in general. 00:38:40.420 --> 00:38:42.790 So that then is a look at some of these techniques 00:38:42.790 --> 00:38:46.390 that we can use in order to implement a neural network, to get 00:38:46.390 --> 00:38:49.060 at the idea of taking this input, passing it 00:38:49.060 --> 00:38:51.160 through these various different layers, in order 00:38:51.160 --> 00:38:52.870 to produce some sort of output. 00:38:52.870 --> 00:38:55.870 And what we'd like to do now is take those ideas and put them into code. 00:38:55.870 --> 00:38:58.537 And to do that, there are a number of different machine learning 00:38:58.537 --> 00:39:01.120 libraries-- neural network libraries-- that we can use that 00:39:01.120 --> 00:39:05.560 allow us to get access to someone's implementation of backpropagation 00:39:05.560 --> 00:39:07.210 and all of these hidden layers. 00:39:07.210 --> 00:39:09.370 And one of the most popular, developed by Google, 00:39:09.370 --> 00:39:11.440 is known as TensorFlow, a library that we 00:39:11.440 --> 00:39:13.930 can use for quickly creating neural networks 00:39:13.930 --> 00:39:16.780 and modeling them and running them on some sample data 00:39:16.780 --> 00:39:18.730 to see what the output is going to be. 00:39:18.730 --> 00:39:20.690 And before we actually start writing code, 00:39:20.690 --> 00:39:23.380 we'll go ahead and take a look at TensorFlow's Playground, which 00:39:23.380 --> 00:39:25.422 will be an opportunity for us just to play around 00:39:25.422 --> 00:39:28.180 with this idea of neural networks in different layers, 00:39:28.180 --> 00:39:31.660 just to get a sense for what it is that we can do by taking advantage 00:39:31.660 --> 00:39:33.950 of a neural networks. 00:39:33.950 --> 00:39:37.360 So let's go ahead and go into TensorFlow's Playground, which you can 00:39:37.360 --> 00:39:39.670 go to by visiting that URL from before. 00:39:39.670 --> 00:39:43.480 And what we're going to do now is we're going to try and learn the decision 00:39:43.480 --> 00:39:46.240 boundary for this particular output. 00:39:46.240 --> 00:39:49.710 I want to learn to separate the orange points from the blue points, 00:39:49.710 --> 00:39:52.090 and I'd like to learn some sort of setting of weights 00:39:52.090 --> 00:39:56.690 inside of a neural network that will be able to separate those from each other. 00:39:56.690 --> 00:39:58.960 The features we have access to, our input data, 00:39:58.960 --> 00:40:03.590 are the x value and the y value, so the two values along each of the two axes. 00:40:03.590 --> 00:40:06.340 And what I'll do now is I can set particular parameters, like what 00:40:06.340 --> 00:40:09.490 activation function I would like to use, and I'll just go ahead 00:40:09.490 --> 00:40:12.720 and press Play and see what happens. 00:40:12.720 --> 00:40:16.560 And what happens here is that you'll see that just by using these two input 00:40:16.560 --> 00:40:20.590 features-- the x value and the y value, with no hidden layers-- 00:40:20.590 --> 00:40:24.450 just take the input, x and y values, and figure out what the decision boundary 00:40:24.450 --> 00:40:24.990 is-- 00:40:24.990 --> 00:40:27.600 our neural network learns pretty quickly that in order 00:40:27.600 --> 00:40:30.150 to divide these two points, we should just use this line. 00:40:30.150 --> 00:40:34.193 This line acts as the decision boundary that separates this group of points 00:40:34.193 --> 00:40:36.360 from that group of points, and it does it very well. 00:40:36.360 --> 00:40:38.160 You can see up here what the loss is. 00:40:38.160 --> 00:40:40.320 The training loss is zero, meaning we were 00:40:40.320 --> 00:40:44.640 able to perfectly model separating these two points from each other inside 00:40:44.640 --> 00:40:46.380 of our training data. 00:40:46.380 --> 00:40:50.610 So this was a fairly simple case of trying to apply a neural network, 00:40:50.610 --> 00:40:54.630 because the data is very clean it's very nicely linearly separable. 00:40:54.630 --> 00:40:58.810 We can just draw a line that separates all of those points from each other. 00:40:58.810 --> 00:41:00.900 Let's now consider a more complex case. 00:41:00.900 --> 00:41:03.390 So I'll go ahead and pause the simulation, 00:41:03.390 --> 00:41:06.570 and we'll go ahead and look at this data set here. 00:41:06.570 --> 00:41:09.030 This data set is a little bit more complex now. 00:41:09.030 --> 00:41:11.280 In this data set, we still have blue and orange points 00:41:11.280 --> 00:41:13.140 that we'd like to separate from each other, 00:41:13.140 --> 00:41:15.150 but there is no single line that we can draw 00:41:15.150 --> 00:41:17.400 that is going to be able to figure out how to separate 00:41:17.400 --> 00:41:21.480 the blue from the orange, because the blue is located in these two quadrants 00:41:21.480 --> 00:41:23.640 and the orange is located here and here. 00:41:23.640 --> 00:41:26.890 It's a more complex function to be able to learn. 00:41:26.890 --> 00:41:30.660 So let's see what happens if we just try and predict based on those inputs-- 00:41:30.660 --> 00:41:34.080 the x- and y-coordinates-- what the output should be. 00:41:34.080 --> 00:41:38.220 Press Play, and what you'll notice is that we're not really able 00:41:38.220 --> 00:41:40.530 to draw much of a conclusion, that we're not 00:41:40.530 --> 00:41:42.900 able to very cleanly see how we should divide 00:41:42.900 --> 00:41:46.170 the orange points from the blue points, and you don't 00:41:46.170 --> 00:41:48.760 see a very clean separation there. 00:41:48.760 --> 00:41:53.050 So it seems like we don't have enough sophistication inside of our network 00:41:53.050 --> 00:41:55.910 to be able to model something that is that complex. 00:41:55.910 --> 00:41:58.540 We need a better model for this neural network. 00:41:58.540 --> 00:42:01.730 And I'll do that by adding a hidden layer. 00:42:01.730 --> 00:42:04.700 So now I have the hidden layer that has two neurons inside of it. 00:42:04.700 --> 00:42:09.000 So I have two inputs that then go to two neurons inside of a hidden layer 00:42:09.000 --> 00:42:14.260 that then go to our output, and now I'll press Play, and what you'll notice here 00:42:14.260 --> 00:42:16.570 is that we're able to do slightly better. 00:42:16.570 --> 00:42:19.420 We're able to now say, all right, these points are definitely blue. 00:42:19.420 --> 00:42:21.370 These points are definitely orange. 00:42:21.370 --> 00:42:24.432 We're still struggling a little bit with these points up here though, 00:42:24.432 --> 00:42:26.140 and what we can do is we can see for each 00:42:26.140 --> 00:42:28.660 of these hidden neurons what is it exactly 00:42:28.660 --> 00:42:30.460 that these hidden neurons are doing. 00:42:30.460 --> 00:42:33.850 Each hidden neuron is learning its own decision boundary, 00:42:33.850 --> 00:42:35.590 and we can see what that boundary is. 00:42:35.590 --> 00:42:38.350 This first neuron is learning, all right, 00:42:38.350 --> 00:42:41.440 this line that seems to separate some of the blue points 00:42:41.440 --> 00:42:43.510 from the rest of the points. 00:42:43.510 --> 00:42:45.983 This other hidden neuron is learning another line 00:42:45.983 --> 00:42:48.400 that seems to be separating the orange points in the lower 00:42:48.400 --> 00:42:50.420 right from the rest of the points. 00:42:50.420 --> 00:42:52.720 So that's why we're able to sort of figure out 00:42:52.720 --> 00:42:55.900 these two areas in the bottom region, but we're still not 00:42:55.900 --> 00:42:59.090 able to perfectly classify all of the points. 00:42:59.090 --> 00:43:01.760 So let's go ahead and add another neuron-- 00:43:01.760 --> 00:43:04.900 now we've got three neurons inside of our hidden layer-- 00:43:04.900 --> 00:43:07.020 and see what we're able to learn now. 00:43:07.020 --> 00:43:07.520 All right. 00:43:07.520 --> 00:43:09.440 Well, now we seem to be doing a better job 00:43:09.440 --> 00:43:11.990 by learning three different decision boundaries, which 00:43:11.990 --> 00:43:14.540 each of the three neurons inside of our hidden layer 00:43:14.540 --> 00:43:18.352 were able to much better figure out how to separate these blue points 00:43:18.352 --> 00:43:19.310 from the orange points. 00:43:19.310 --> 00:43:22.340 And you can see what each of these hidden neurons is learning. 00:43:22.340 --> 00:43:25.220 Each one is learning a slightly different decision boundary, 00:43:25.220 --> 00:43:27.860 and then we're combining those decision boundaries together 00:43:27.860 --> 00:43:30.770 to figure out what the overall output should be. 00:43:30.770 --> 00:43:34.390 And we can try it one more time by adding a fourth neuron there 00:43:34.390 --> 00:43:35.930 and try learning that. 00:43:35.930 --> 00:43:37.798 And it seems like now we can do even better 00:43:37.798 --> 00:43:40.340 at trying to separate the blue points from the orange points, 00:43:40.340 --> 00:43:43.280 but we were only able to do this by adding a hidden layer, 00:43:43.280 --> 00:43:46.160 by adding some layer that is learning some other boundaries, 00:43:46.160 --> 00:43:49.070 and combining those boundaries to determine the output. 00:43:49.070 --> 00:43:51.980 And the strength-- the size and thickness of these lines-- 00:43:51.980 --> 00:43:55.790 and indicate how high these weights are, how important each of these inputs 00:43:55.790 --> 00:43:59.050 is, for making this sort of calculation. 00:43:59.050 --> 00:44:01.730 And we can do maybe one more simulation. 00:44:01.730 --> 00:44:04.960 Let's go ahead and try this on a data set that looks like this. 00:44:04.960 --> 00:44:06.668 Go ahead and get rid of the hidden layer. 00:44:06.668 --> 00:44:08.710 Here now we're trying to separate the blue points 00:44:08.710 --> 00:44:11.830 from the orange points, where all the blue points are located, again, 00:44:11.830 --> 00:44:13.700 inside of a circle, effectively. 00:44:13.700 --> 00:44:16.130 So we're not going to be able to learn a line. 00:44:16.130 --> 00:44:17.920 Notice I press Play, and we're really not 00:44:17.920 --> 00:44:20.240 able to draw any sort of classification at all, 00:44:20.240 --> 00:44:22.420 because there is no line that cleanly separates 00:44:22.420 --> 00:44:25.570 the blue points from the orange points. 00:44:25.570 --> 00:44:29.350 So let's try to solve this by introducing a hidden layer. 00:44:29.350 --> 00:44:31.307 I'll go ahead and press Play. 00:44:31.307 --> 00:44:31.890 And all right. 00:44:31.890 --> 00:44:33.793 With two neurons and a hidden layer, we're 00:44:33.793 --> 00:44:36.210 able to do a little better, because we effectively learned 00:44:36.210 --> 00:44:37.627 two different decision boundaries. 00:44:37.627 --> 00:44:40.380 We learned this line here, and we learned this line 00:44:40.380 --> 00:44:41.760 on the right-hand side. 00:44:41.760 --> 00:44:43.890 And right now, we're just saying, all right, well, if it's in-between, 00:44:43.890 --> 00:44:46.473 we'll call it blue, and if it's outside, we'll call it orange. 00:44:46.473 --> 00:44:49.150 So, not great, but certainly better than before. 00:44:49.150 --> 00:44:52.620 We're learning one decision boundary and another, and based on those, 00:44:52.620 --> 00:44:55.690 we can figure out what the output should be. 00:44:55.690 --> 00:45:00.770 But let's now go ahead and add a third neuron and see what happens now. 00:45:00.770 --> 00:45:02.150 I go ahead and train it. 00:45:02.150 --> 00:45:04.878 And now, using three different decision boundaries 00:45:04.878 --> 00:45:06.920 that are learned by each of these hidden neurons, 00:45:06.920 --> 00:45:09.800 we're able to much more accurately model this distinction 00:45:09.800 --> 00:45:11.840 between blue points and orange points. 00:45:11.840 --> 00:45:14.750 We're able to figure out, maybe with these three decision boundaries, 00:45:14.750 --> 00:45:18.530 combining them together, you can imagine figuring out what the output should be 00:45:18.530 --> 00:45:20.908 and how to make that sort of classification. 00:45:20.908 --> 00:45:22.700 And so the goal here is just to get a sense 00:45:22.700 --> 00:45:25.670 for having more neurons in these hidden layers that 00:45:25.670 --> 00:45:28.490 allows us to learn more structure in the data, 00:45:28.490 --> 00:45:31.400 allows us to figure out what the relevant and important decision 00:45:31.400 --> 00:45:32.360 boundaries are. 00:45:32.360 --> 00:45:34.365 And then using this backpropagation algorithm, 00:45:34.365 --> 00:45:36.740 we're able to figure out what the values of these weights 00:45:36.740 --> 00:45:39.290 should be in order to train this network to be 00:45:39.290 --> 00:45:44.240 able to classify one category of points away from another category of points 00:45:44.240 --> 00:45:45.228 instead. 00:45:45.228 --> 00:45:48.020 And this is ultimately what we're going to be trying to do whenever 00:45:48.020 --> 00:45:50.970 we're training a neural network. 00:45:50.970 --> 00:45:53.300 So let's go ahead and actually see an example of this. 00:45:53.300 --> 00:45:57.020 You'll recall from last time that we had this banknotes file that 00:45:57.020 --> 00:46:00.080 included information about counterfeit banknotes as opposed 00:46:00.080 --> 00:46:04.670 to authentic banknotes, where it had four different values for each banknote 00:46:04.670 --> 00:46:07.640 and then a categorization of whether that bank note is considered 00:46:07.640 --> 00:46:10.280 to be authentic or a counterfeit note. 00:46:10.280 --> 00:46:13.880 And what I wanted to do was, based on that input information, 00:46:13.880 --> 00:46:15.830 figure out some function that could calculate 00:46:15.830 --> 00:46:19.250 based on the input information what category it belonged to. 00:46:19.250 --> 00:46:21.590 And what I've written here in banknotes.py 00:46:21.590 --> 00:46:25.340 is a neural network that we'll learn just that, a network that learns, 00:46:25.340 --> 00:46:27.320 based on all of the input, whether or not 00:46:27.320 --> 00:46:31.790 we should categorize a banknote as authentic or as counterfeit. 00:46:31.790 --> 00:46:34.250 The first step is the same as what we saw from last time. 00:46:34.250 --> 00:46:38.130 I'm really just reading the data in and getting it into an appropriate format. 00:46:38.130 --> 00:46:41.690 And so this is where more of the writing Python code on your own 00:46:41.690 --> 00:46:43.820 comes in terms of manipulating this data, 00:46:43.820 --> 00:46:46.010 massaging the data into a format that will 00:46:46.010 --> 00:46:48.290 be understood by a machine learning library 00:46:48.290 --> 00:46:50.890 like scikit-learn or like TensorFlow. 00:46:50.890 --> 00:46:54.710 And so here I separate it into a training and a testing set. 00:46:54.710 --> 00:46:59.030 And now what I'm doing down below is I'm creating a neural network. 00:46:59.030 --> 00:47:01.490 Here I'm using tf, which stands for TensorFlow. 00:47:01.490 --> 00:47:04.385 Up above I said, import TensorFlow as tf. 00:47:04.385 --> 00:47:06.720 So you have just an abbreviation that we'll often use, 00:47:06.720 --> 00:47:09.178 so we don't need to write out TensorFlow every time we want 00:47:09.178 --> 00:47:11.570 to use anything inside of the library. 00:47:11.570 --> 00:47:13.910 I'm using tf.keras. 00:47:13.910 --> 00:47:16.340 Keras is an API, a set of functions that we 00:47:16.340 --> 00:47:20.748 can use in order to manipulate neural networks inside of TensorFlow, 00:47:20.748 --> 00:47:22.790 and it turns out there are other machine learning 00:47:22.790 --> 00:47:25.442 libraries that also use the Kersa API. 00:47:25.442 --> 00:47:27.650 But here, I'm saying, all right, go ahead and give me 00:47:27.650 --> 00:47:31.220 a model that is a sequential model-- a sequential neural network-- 00:47:31.220 --> 00:47:33.750 meaning one layer after another. 00:47:33.750 --> 00:47:37.700 And now I'm going to add to that model what layers I want inside 00:47:37.700 --> 00:47:38.910 of my neural network. 00:47:38.910 --> 00:47:40.820 So here I'm saying, model.add. 00:47:40.820 --> 00:47:43.160 Go ahead and add a dense layer-- 00:47:43.160 --> 00:47:45.530 and when we say a dense layer, we mean a layer that 00:47:45.530 --> 00:47:48.290 is just each of the nodes inside of the layer 00:47:48.290 --> 00:47:50.970 is going to be connected to each from the previous layer, 00:47:50.970 --> 00:47:54.460 so we have a densely connected layer. 00:47:54.460 --> 00:47:56.910 This layer is going to have eight units inside of it. 00:47:56.910 --> 00:48:00.090 So it's going to be a hidden layer inside of a neural network with eight 00:48:00.090 --> 00:48:02.460 different units, eight artificial neurons, each of which 00:48:02.460 --> 00:48:03.830 might learn something different. 00:48:03.830 --> 00:48:05.760 And I just sort of chose eight arbitrarily. 00:48:05.760 --> 00:48:09.510 You could choose a different number of hidden nodes inside of the layer. 00:48:09.510 --> 00:48:12.270 And as we saw before, depending on the number of units 00:48:12.270 --> 00:48:15.240 there are inside of your head and layer, more units 00:48:15.240 --> 00:48:17.170 means you can learn more complex functions, 00:48:17.170 --> 00:48:20.340 so maybe you can more accurately model the training data, 00:48:20.340 --> 00:48:21.450 but it comes at a cost. 00:48:21.450 --> 00:48:24.480 More units means more weights that you need to figure out how to update, 00:48:24.480 --> 00:48:27.030 so it might be more expensive to do that calculation. 00:48:27.030 --> 00:48:30.900 And you also run the risk of overfitting on the data if you have too many units, 00:48:30.900 --> 00:48:33.420 and you learn to just overfit on the training data. 00:48:33.420 --> 00:48:34.390 That's not good either. 00:48:34.390 --> 00:48:36.848 So there is a balance, and there's often a testing process, 00:48:36.848 --> 00:48:40.350 where you'll train on some data and maybe validate how well you're 00:48:40.350 --> 00:48:41.970 doing on a separate set of data-- 00:48:41.970 --> 00:48:45.555 often called a validation set-- to see, all right, which setting of parameters, 00:48:45.555 --> 00:48:47.430 how many layers should I have, how many units 00:48:47.430 --> 00:48:49.230 should be in each layer, which one of those 00:48:49.230 --> 00:48:51.450 performs the best on the validation set? 00:48:51.450 --> 00:48:55.410 So you can do some testing to figure out what these hyperparameters, so-called, 00:48:55.410 --> 00:48:57.600 should be equal to. 00:48:57.600 --> 00:49:02.010 Next I specify what the input_shape is, meaning what does my input look like? 00:49:02.010 --> 00:49:04.560 My input has four values, and so the input shape 00:49:04.560 --> 00:49:07.650 is just 4, because we have four inputs. 00:49:07.650 --> 00:49:09.960 And then I specify what the activation function is. 00:49:09.960 --> 00:49:12.043 And the activation function, again, we can choose. 00:49:12.043 --> 00:49:14.160 There a number of different activation functions. 00:49:14.160 --> 00:49:17.940 Here I'm using relu, which you might recall from earlier. 00:49:17.940 --> 00:49:20.410 And then I'll add an output layer. 00:49:20.410 --> 00:49:21.660 So I have my hidden layer. 00:49:21.660 --> 00:49:23.820 Now I'm adding one more layer that will just 00:49:23.820 --> 00:49:26.700 have one unit, because all I want to do is predict something 00:49:26.700 --> 00:49:29.350 like counterfeit bill or authentic bill. 00:49:29.350 --> 00:49:31.050 So I just need a single unit. 00:49:31.050 --> 00:49:33.240 And the activation function I'm going to use here 00:49:33.240 --> 00:49:35.370 is that sigmoid activation function, which 00:49:35.370 --> 00:49:39.300 again was that S-shaped curve that just gave us like a probability of, 00:49:39.300 --> 00:49:43.380 what is the probability that this is a counterfeit bill as opposed 00:49:43.380 --> 00:49:45.150 to an authentic bill? 00:49:45.150 --> 00:49:48.750 So that then is the structure of my neural network-- sequential neural 00:49:48.750 --> 00:49:52.200 network that has one hidden layer with eight units inside of it, 00:49:52.200 --> 00:49:55.760 and then one output layer that just has a single unit inside of it. 00:49:55.760 --> 00:49:57.510 And I can choose how many units there are. 00:49:57.510 --> 00:49:59.670 I can choose the activation function. 00:49:59.670 --> 00:50:02.970 Then I'm going to compile this model. 00:50:02.970 --> 00:50:06.718 TensorFlow gives you a choice of how you would like to optimize the weights-- 00:50:06.718 --> 00:50:09.010 there are various different algorithms for doing that-- 00:50:09.010 --> 00:50:11.135 what type of loss function you want to use-- again, 00:50:11.135 --> 00:50:12.840 many different options for doing that-- 00:50:12.840 --> 00:50:14.880 and then how I want to evaluate my model. 00:50:14.880 --> 00:50:16.050 Well, I care about accuracy. 00:50:16.050 --> 00:50:20.670 I care about how many of my points am I able to classify correctly 00:50:20.670 --> 00:50:23.330 versus not correctly of counterfeit or not counterfeit, 00:50:23.330 --> 00:50:28.650 and I would like it to report to me how accurate my model is performing. 00:50:28.650 --> 00:50:31.110 Then, now that I've defined that model, I 00:50:31.110 --> 00:50:34.260 call model.fit to say, go ahead and train the model. 00:50:34.260 --> 00:50:38.230 Train it on all the training data, plus all of the training labels-- 00:50:38.230 --> 00:50:41.100 so labels for each of those pieces of training data-- 00:50:41.100 --> 00:50:43.860 and I'm saying run it for 20 epochs, meaning go ahead 00:50:43.860 --> 00:50:46.830 and go through each of these training points 20 times effectively, 00:50:46.830 --> 00:50:50.220 go through the data 20 times and keep trying to update the weights. 00:50:50.220 --> 00:50:52.440 If I did it for more, I could train for even longer 00:50:52.440 --> 00:50:55.050 and maybe get a more accurate result. But then 00:50:55.050 --> 00:50:58.380 after I fit in on all the data, I'll go ahead and just test it. 00:50:58.380 --> 00:51:01.050 I'll evaluate my model using model.evaluate, 00:51:01.050 --> 00:51:03.480 built into TensorFlow, that is just going to tell me, 00:51:03.480 --> 00:51:05.907 how well do I perform on the testing data? 00:51:05.907 --> 00:51:07.740 So ultimately, this is just going to give me 00:51:07.740 --> 00:51:13.150 some numbers that tell me how well we did in this particular case. 00:51:13.150 --> 00:51:15.300 So now what I'm going to do is go into banknotes 00:51:15.300 --> 00:51:17.697 and go ahead and run banknotes.py. 00:51:17.697 --> 00:51:19.530 And what's going to happen now is it's going 00:51:19.530 --> 00:51:21.630 to read in all of that trading data. 00:51:21.630 --> 00:51:24.600 It's going to generate a neural network with all my inputs, 00:51:24.600 --> 00:51:27.750 my eight hidden layers, or eight hidden units inside my layer, 00:51:27.750 --> 00:51:30.630 and then an output unit, and now what it's doing is it's training. 00:51:30.630 --> 00:51:32.880 It's training 20 times, and each time, you 00:51:32.880 --> 00:51:35.940 can see how my accuracy is increasing on my training data. 00:51:35.940 --> 00:51:38.950 It starts off, the very first time, not very accurate, 00:51:38.950 --> 00:51:42.660 though better than random, something like 79% of the time, 00:51:42.660 --> 00:51:45.730 it's able to accurately classify one bill from another. 00:51:45.730 --> 00:51:49.350 But as I keep training, notice this accuracy value improves and improves 00:51:49.350 --> 00:51:52.590 and improves, until after I've trained through all of the data points 00:51:52.590 --> 00:51:59.220 20 times, it looks like my accuracy is above 99% on the training data. 00:51:59.220 --> 00:52:02.530 And here's where I tested it on a whole bunch of testing data. 00:52:02.530 --> 00:52:07.170 And it looks like in this case, I was also like 99.8% accurate. 00:52:07.170 --> 00:52:09.970 So just using that, I was able to generate a neural network that 00:52:09.970 --> 00:52:12.490 can detect counterfeit bills from authentic bills 00:52:12.490 --> 00:52:16.030 based on this input data 99.8% of the time, at least 00:52:16.030 --> 00:52:17.700 based on this particular testing data. 00:52:17.700 --> 00:52:19.450 And I might want to test it with more data 00:52:19.450 --> 00:52:21.890 as well, just to be confident about that. 00:52:21.890 --> 00:52:24.743 But this is really the value of using a machine learning library 00:52:24.743 --> 00:52:27.160 like TensorFlow, and there are others available for Python 00:52:27.160 --> 00:52:30.040 and other languages as well, but all I have to do 00:52:30.040 --> 00:52:33.400 is define the structure of the network and define the data 00:52:33.400 --> 00:52:36.120 that I'm going to pass into the network, and then 00:52:36.120 --> 00:52:38.560 TensorFlow runs the backpropagation algorithm 00:52:38.560 --> 00:52:40.780 for learning what all of those weights should be, 00:52:40.780 --> 00:52:44.410 for figuring out how to train this neural network to be able to, 00:52:44.410 --> 00:52:48.070 as accurately as possible, figure out what the output values should 00:52:48.070 --> 00:52:50.610 be there as well. 00:52:50.610 --> 00:52:55.130 And so this then was a look at what it is that neural networks can do, just 00:52:55.130 --> 00:52:58.380 using these sequences of layer after layer after layer, 00:52:58.380 --> 00:53:01.970 and you can begin to imagine applying these to much more general problems. 00:53:01.970 --> 00:53:05.690 And one big problem in computing, and artificial intelligence more generally, 00:53:05.690 --> 00:53:08.000 is the problem of computer vision. 00:53:08.000 --> 00:53:10.580 Computer vision is all about computational methods 00:53:10.580 --> 00:53:14.313 for analyzing and understanding images, that you might have pictures 00:53:14.313 --> 00:53:16.730 that you want the computer to figure out how to deal with, 00:53:16.730 --> 00:53:19.910 how to process those images, and figure out how to produce 00:53:19.910 --> 00:53:21.710 some sort of useful result out of this. 00:53:21.710 --> 00:53:24.140 You've seen this in the context of social media websites 00:53:24.140 --> 00:53:27.093 that are able to look at a photo that contains a whole bunch of faces, 00:53:27.093 --> 00:53:29.260 and it's able to figure out what's a picture of whom 00:53:29.260 --> 00:53:32.060 and label those and tag them with appropriate people. 00:53:32.060 --> 00:53:34.130 This is becoming increasingly relevant as we 00:53:34.130 --> 00:53:36.600 begin to discuss self-driving cars. 00:53:36.600 --> 00:53:38.360 These cars now have cameras, and we would 00:53:38.360 --> 00:53:40.940 like for the computer to have some sort of algorithm that 00:53:40.940 --> 00:53:43.490 looks at the images and figures out, what 00:53:43.490 --> 00:53:47.940 color is the light, what cars are around us and in what direction, for example. 00:53:47.940 --> 00:53:50.810 And so computer vision is all about taking an image 00:53:50.810 --> 00:53:53.000 and figuring out what sort of computation-- 00:53:53.000 --> 00:53:55.640 what sort of calculation-- we can do with that image. 00:53:55.640 --> 00:53:59.480 It's also relevant in the context of something like handwriting recognition. 00:53:59.480 --> 00:54:02.540 This, what you're looking at, is an example of the MNIST dataset-- 00:54:02.540 --> 00:54:04.700 it's a big dataset just of handwritten digits-- 00:54:04.700 --> 00:54:08.840 that we could use to, ideally, try and figure out how to predict, 00:54:08.840 --> 00:54:12.380 given someone's handwriting, given a photo of a digit that they have drawn, 00:54:12.380 --> 00:54:17.180 can you predict whether it's a 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9, for example. 00:54:17.180 --> 00:54:19.850 So this sort of handwriting recognition is yet another task 00:54:19.850 --> 00:54:23.300 that we might want to use computer vision tasks and tools to be 00:54:23.300 --> 00:54:24.480 able to apply it towards. 00:54:24.480 --> 00:54:27.470 This might be a task that we might care about. 00:54:27.470 --> 00:54:30.140 So how then can we use neural networks to be 00:54:30.140 --> 00:54:31.850 able to solve a problem like this? 00:54:31.850 --> 00:54:34.340 Well, neural networks rely upon some sort of input, 00:54:34.340 --> 00:54:36.350 where that input is just numerical data. 00:54:36.350 --> 00:54:38.630 We have a whole bunch of units, where each one of them 00:54:38.630 --> 00:54:40.820 just represents some sort of number. 00:54:40.820 --> 00:54:43.670 And so in the context of something like handwriting recognition, 00:54:43.670 --> 00:54:45.920 or in the context of just an image, you might 00:54:45.920 --> 00:54:50.240 imagine that an image is really just a grid of pixels, a grid of dots, 00:54:50.240 --> 00:54:53.660 where each dot has some sort of color, and in the context 00:54:53.660 --> 00:54:55.520 of something like handwriting recognition, 00:54:55.520 --> 00:54:57.478 you might imagine that if you just fill in each 00:54:57.478 --> 00:55:00.740 of these dots in a particular way, you can generate a 2 or an 8, 00:55:00.740 --> 00:55:05.420 for example, based on which dots happen to be shaded in and which dots are not. 00:55:05.420 --> 00:55:09.140 And we can represent each of these pixel values just using numbers. 00:55:09.140 --> 00:55:14.220 So for a particular pixel, for example, 0 might represent entirely black. 00:55:14.220 --> 00:55:16.060 Depending on how you're representing color, 00:55:16.060 --> 00:55:20.740 it's often common to represent color values on a 0-to-255 range, 00:55:20.740 --> 00:55:24.890 so that you can represent a color using eight bits for a particular value, 00:55:24.890 --> 00:55:27.240 like how much white is in the image? 00:55:27.240 --> 00:55:32.180 So 0 might represent all black, 255 might represent entirely white 00:55:32.180 --> 00:55:35.870 as a pixel, and somewhere in between might represent some shade of gray, 00:55:35.870 --> 00:55:36.890 for example. 00:55:36.890 --> 00:55:40.250 But you might imagine not just having a single slider that determines how much 00:55:40.250 --> 00:55:42.920 white is in the image, but if you had a color image, 00:55:42.920 --> 00:55:45.870 you might imagine three different numerical values-- a red, green, 00:55:45.870 --> 00:55:46.820 and blue value-- 00:55:46.820 --> 00:55:49.490 where the red value controls how much red is in the image, 00:55:49.490 --> 00:55:52.520 we have one value for controlling how much green is in the pixel, 00:55:52.520 --> 00:55:55.290 and one value for how much blue is in the pixel as well. 00:55:55.290 --> 00:55:58.970 And depending on how it is that you set these values of red, green, and blue, 00:55:58.970 --> 00:56:00.840 you can get a different color. 00:56:00.840 --> 00:56:04.460 And so any pixel can really be represented in this case 00:56:04.460 --> 00:56:06.050 by three numerical values-- 00:56:06.050 --> 00:56:09.510 a red value, a green value, and a blue value. 00:56:09.510 --> 00:56:11.450 And if you take a whole bunch of these pixels, 00:56:11.450 --> 00:56:15.230 assemble them together inside of a grid of pixels, then 00:56:15.230 --> 00:56:17.760 you really just have a whole bunch of numerical values 00:56:17.760 --> 00:56:21.863 that you can use in order to perform some sort of prediction task. 00:56:21.863 --> 00:56:24.530 And so what you might imagine doing is using the same techniques 00:56:24.530 --> 00:56:25.790 we talked about before. 00:56:25.790 --> 00:56:30.890 Just design a neural network with a lot of inputs, that for each of the pixels, 00:56:30.890 --> 00:56:34.070 we might have one or three different inputs in the case of a color image-- 00:56:34.070 --> 00:56:38.240 a different input-- that is just connected to a deep neural network, 00:56:38.240 --> 00:56:38.830 for example. 00:56:38.830 --> 00:56:40.880 And this deep neural network might take all 00:56:40.880 --> 00:56:45.700 of the pixels inside of the image of what digit a person drew, 00:56:45.700 --> 00:56:49.910 and the output might be like 10 neurons that classify it as a 0 or a 1 00:56:49.910 --> 00:56:55.620 or 2 or 3, or just tells us in some way what that digit happens to be. 00:56:55.620 --> 00:56:57.910 Now there are a couple of drawbacks to this approach. 00:56:57.910 --> 00:57:01.540 The first drawback to the approach is just the size of this input array, 00:57:01.540 --> 00:57:03.422 that we have a whole bunch of inputs. 00:57:03.422 --> 00:57:05.880 If we have a big image, that is a lot of different channels 00:57:05.880 --> 00:57:08.790 we're looking at-- a lot of inputs, and therefore, a lot of weights 00:57:08.790 --> 00:57:10.690 that we have to calculate. 00:57:10.690 --> 00:57:14.420 And a second problem is the fact that by flattening everything 00:57:14.420 --> 00:57:16.760 into just the structure of all the pixels, 00:57:16.760 --> 00:57:20.720 we've lost access to a lot of the information about the structure 00:57:20.720 --> 00:57:22.670 of the image that's relevant, that really, 00:57:22.670 --> 00:57:25.040 when a person looks at an image, they're looking 00:57:25.040 --> 00:57:26.667 at particular features of that image. 00:57:26.667 --> 00:57:27.750 They're looking at curves. 00:57:27.750 --> 00:57:28.610 They're looking at shapes. 00:57:28.610 --> 00:57:30.470 They're looking at what things can you identify 00:57:30.470 --> 00:57:33.387 in different regions of the image, and maybe put those things together 00:57:33.387 --> 00:57:36.950 in order to get a better picture of what the overall image was about. 00:57:36.950 --> 00:57:40.940 And by just turning it into a pixel values for each of the pixels, 00:57:40.940 --> 00:57:43.230 sure, you might be able to learn that structure, 00:57:43.230 --> 00:57:45.360 but it might be challenging in order to do so. 00:57:45.360 --> 00:57:48.890 It might be helpful to take advantage of the fact that you can use properties 00:57:48.890 --> 00:57:52.190 of the image itself-- the fact that it's structured in a particular way-- 00:57:52.190 --> 00:57:56.150 to be able to improve the way that we learn based on that image too. 00:57:56.150 --> 00:57:59.210 So in order to figure out how we can train our neural networks to better 00:57:59.210 --> 00:58:02.510 be able to deal with images, we'll introduce a couple of ideas-- 00:58:02.510 --> 00:58:06.350 a couple of algorithms-- that we can apply that allow us to take the images 00:58:06.350 --> 00:58:09.630 and extract some useful information out of that image. 00:58:09.630 --> 00:58:13.430 And the first idea we'll introduce is the notion of image convolution. 00:58:13.430 --> 00:58:16.940 And what an image convolution is all about is it's about filtering an image, 00:58:16.940 --> 00:58:20.330 sort of extracting useful or relevant features out of the image. 00:58:20.330 --> 00:58:25.220 And the way we do that is by applying a particular filter that basically adds 00:58:25.220 --> 00:58:28.700 the value for every pixel with the values for all of the neighboring 00:58:28.700 --> 00:58:29.780 pixels to it. 00:58:29.780 --> 00:58:32.750 According to some sort of kernel matrix, which we'll see in a moment, 00:58:32.750 --> 00:58:36.390 it's going to allow us to weight these pixels in various different ways. 00:58:36.390 --> 00:58:38.300 And the goal of image convolution then is 00:58:38.300 --> 00:58:41.720 to extract some sort of interesting or useful features out of an image, 00:58:41.720 --> 00:58:45.080 to be able to take a pixel, and based on its neighboring pixels, 00:58:45.080 --> 00:58:48.260 maybe predict some sort of valuable information, something 00:58:48.260 --> 00:58:50.870 like taking a pixel and looking at its neighboring pixels, 00:58:50.870 --> 00:58:52.310 you might be able to predict whether or not 00:58:52.310 --> 00:58:54.143 there's some sort of curve inside the image, 00:58:54.143 --> 00:58:57.200 or whether it's forming the outline of a particular line or a shape, 00:58:57.200 --> 00:59:00.050 for example, and that might be useful if you're 00:59:00.050 --> 00:59:02.600 trying to use all of these various different features 00:59:02.600 --> 00:59:06.840 to combine them to say something meaningful about an image as a whole. 00:59:06.840 --> 00:59:08.840 So how then does image convolution work? 00:59:08.840 --> 00:59:11.870 Well, we start with a kernel matrix, and the kernel matrix 00:59:11.870 --> 00:59:13.160 looks something like this. 00:59:13.160 --> 00:59:15.260 And the idea of this is that given a pixel-- 00:59:15.260 --> 00:59:16.820 that would be the middle pixel-- 00:59:16.820 --> 00:59:21.200 we're going to multiply each of the neighboring pixels by these values 00:59:21.200 --> 00:59:25.362 in order to get some sort of result by summing up all of the numbers together. 00:59:25.362 --> 00:59:28.070 So if I take this kernel, which you can think of is like a filter 00:59:28.070 --> 00:59:30.020 that I'm going to apply to the image. 00:59:30.020 --> 00:59:32.090 And let's say that I take this image. 00:59:32.090 --> 00:59:33.800 This is a four-by-four image. 00:59:33.800 --> 00:59:37.250 We'll think of it as just a black and white image, where each one is just 00:59:37.250 --> 00:59:41.550 a single pixel value, so somewhere between 0 and 255, for example. 00:59:41.550 --> 00:59:44.450 So we have a whole bunch of individual pixel values like this, 00:59:44.450 --> 00:59:47.450 and what I'd like to do is apply this kernel-- 00:59:47.450 --> 00:59:49.280 this filter, so to speak-- 00:59:49.280 --> 00:59:50.485 to this image. 00:59:50.485 --> 00:59:53.360 And the way I'll do that is, all right, the kernel is three-by-three. 00:59:53.360 --> 00:59:56.940 So you can imagine a five-by-five kernel or a larger kernel too. 00:59:56.940 --> 01:00:01.460 And I'll take it and just first apply it to the first three-by-three section 01:00:01.460 --> 01:00:02.480 of the image. 01:00:02.480 --> 01:00:05.270 And what I'll do is I'll take each of these pixel values 01:00:05.270 --> 01:00:08.930 and multiply it by its corresponding value in the filter matrix 01:00:08.930 --> 01:00:11.970 and add all of the results together. 01:00:11.970 --> 01:00:19.040 So here, for example, I'll say 10 times 0, plus 20, times negative 1, plus 30, 01:00:19.040 --> 01:00:22.408 times 0, so on and so forth, doing all of this calculation. 01:00:22.408 --> 01:00:24.200 And at the end, if I take all these values, 01:00:24.200 --> 01:00:26.990 multiply them by their corresponding value in the kernel, 01:00:26.990 --> 01:00:30.410 add the results together, for this particular set of nine pixels, 01:00:30.410 --> 01:00:33.540 I get the value of 10 for example. 01:00:33.540 --> 01:00:38.600 And then what I'll do is I'll slide this three-by-three grid effectively over. 01:00:38.600 --> 01:00:43.220 Slide the kernel by one to look at the next three-by-three section. 01:00:43.220 --> 01:00:45.330 And here I'm just sliding it over by one pixel, 01:00:45.330 --> 01:00:46.970 but you might imagine a different slide length, 01:00:46.970 --> 01:00:49.760 or maybe I jump by multiple pixels at a time if you really wanted to. 01:00:49.760 --> 01:00:51.110 You have different options here. 01:00:51.110 --> 01:00:54.650 But here I'm just sliding over, looking at the next three-by-three section. 01:00:54.650 --> 01:00:59.450 And I'll do the same math 20 times 0, plus 30, times a negative 1, plus 40, 01:00:59.450 --> 01:01:03.950 times 0, plus 20 times negative 1, so on and so forth, plus 30 times 5. 01:01:03.950 --> 01:01:05.990 And what I end up getting is the number 20. 01:01:05.990 --> 01:01:09.260 Then you can imagine shifting over to this one, doing the same thing, 01:01:09.260 --> 01:01:11.510 calculating like the number 40, for example, 01:01:11.510 --> 01:01:15.670 and then doing the same thing here and calculating a value there as well. 01:01:15.670 --> 01:01:19.350 And so what we have now is what we'll call a feature map. 01:01:19.350 --> 01:01:22.340 We have taken this kernel, applied it to each 01:01:22.340 --> 01:01:25.040 of these various different regions, and what we get 01:01:25.040 --> 01:01:29.505 is some representation of a filtered version of that image. 01:01:29.505 --> 01:01:32.630 And so to give a more concrete example of why it is that this kind of thing 01:01:32.630 --> 01:01:35.360 could be useful, let's take this kernel matrix, 01:01:35.360 --> 01:01:39.080 for example, which is quite a famous one, that has an 8 in the middle 01:01:39.080 --> 01:01:42.380 and then all of the neighboring pixels that get a negative 1. 01:01:42.380 --> 01:01:44.420 And let's imagine we wanted to apply that 01:01:44.420 --> 01:01:48.020 to a three-by-three part of an image that looks like this, 01:01:48.020 --> 01:01:50.160 where all the values are the same. 01:01:50.160 --> 01:01:52.310 They're all 20, for instance. 01:01:52.310 --> 01:01:56.240 Well, in this case, if you do 20 times 8, and then subtract 20, 01:01:56.240 --> 01:01:58.910 subtract 20, subtract 20, for each of the eight neighbors, 01:01:58.910 --> 01:02:02.130 well, the result of that is you just get that expression, 01:02:02.130 --> 01:02:03.440 which comes out to be 0. 01:02:03.440 --> 01:02:07.250 You multiply 20 by 8, but then you subtracted 28 times 01:02:07.250 --> 01:02:08.960 according to that particular kernel. 01:02:08.960 --> 01:02:11.150 The result of all of that is just 0. 01:02:11.150 --> 01:02:15.170 So the takeaway here is that when a lot of the pixels are the same value, 01:02:15.170 --> 01:02:18.050 we end up getting a value close to 0. 01:02:18.050 --> 01:02:21.440 If, though, we had something like this, 20s along this first row, 01:02:21.440 --> 01:02:24.470 then 50s in the second row, and 50s in the third row, well, 01:02:24.470 --> 01:02:26.530 then when you do this same kind of math-- 01:02:26.530 --> 01:02:29.930 20 times negative 1, 20 times negative 1, so on and so forth-- 01:02:29.930 --> 01:02:34.530 then I get a higher value-- a value like 90, in this particular case. 01:02:34.530 --> 01:02:37.520 And so the more general idea here is that 01:02:37.520 --> 01:02:40.520 by applying this kernel, negative 1s, 8 in the middle, 01:02:40.520 --> 01:02:45.800 and then negative 1s, what I get is when this middle value is very 01:02:45.800 --> 01:02:47.960 different from the neighboring values-- 01:02:47.960 --> 01:02:50.240 like 50 is greater than these 20s-- 01:02:50.240 --> 01:02:53.150 then you'll end up with a value higher than 0. 01:02:53.150 --> 01:02:55.490 Like if this number is higher than its neighbors, 01:02:55.490 --> 01:02:59.240 you end up getting a bigger output, but if this value is the same as all 01:02:59.240 --> 01:03:02.660 of its neighbors, then you get a lower output, something like 0. 01:03:02.660 --> 01:03:04.580 And it turns out that this sort of filter 01:03:04.580 --> 01:03:08.440 can therefore be used in something like detecting edges in an image, 01:03:08.440 --> 01:03:11.870 or want to detect like the boundaries between various different objects 01:03:11.870 --> 01:03:12.890 inside of an image. 01:03:12.890 --> 01:03:15.950 I might use a filter like this, which is able to tell 01:03:15.950 --> 01:03:19.970 whether the value of this pixel is different from the values 01:03:19.970 --> 01:03:23.630 of the neighboring pixel-- if it's like greater than the values of the pixels 01:03:23.630 --> 01:03:25.390 that happened to surround it. 01:03:25.390 --> 01:03:28.250 And so we can use this in terms of image filtering. 01:03:28.250 --> 01:03:30.290 And so I'll show you an example of that. 01:03:30.290 --> 01:03:38.150 I have here, in filter.py, a file that uses Python's image library, or PIL, 01:03:38.150 --> 01:03:40.160 to do some image filtering. 01:03:40.160 --> 01:03:41.840 I go ahead and open an image. 01:03:41.840 --> 01:03:45.102 And then all I'm going to do is apply a kernel to that image. 01:03:45.102 --> 01:03:47.810 It's going to be a three-by-three kernel, the same kind of kernel 01:03:47.810 --> 01:03:49.390 we saw before. 01:03:49.390 --> 01:03:50.790 And here is the kernel. 01:03:50.790 --> 01:03:53.312 This is just a list representation of the same matrix 01:03:53.312 --> 01:03:55.020 that I showed you a moment ago, with it's 01:03:55.020 --> 01:03:56.900 negative 1, negative 1, negative 1. 01:03:56.900 --> 01:03:59.750 The second row is negative 1, 8, negative 1. 01:03:59.750 --> 01:04:01.880 The third row is all negative 1s. 01:04:01.880 --> 01:04:06.670 And then at the end, I'm going to go ahead and show the filtered image. 01:04:06.670 --> 01:04:12.340 So if, for example, I go into convolution directory 01:04:12.340 --> 01:04:15.300 and I open up an image like bridge.png, this 01:04:15.300 --> 01:04:21.270 is what an input image might look like, just an image of a bridge over a river. 01:04:21.270 --> 01:04:26.360 Now I'm going to go ahead and run this filter program on the bridge. 01:04:26.360 --> 01:04:28.820 And what I get is this image here. 01:04:28.820 --> 01:04:32.000 Just by taking the original image and applying that filter 01:04:32.000 --> 01:04:35.000 to each three-by-three grid, I've extracted 01:04:35.000 --> 01:04:38.390 all of the boundaries, all of the edges inside the image that separate 01:04:38.390 --> 01:04:40.110 one part of the image from another. 01:04:40.110 --> 01:04:42.740 So here I've got a representation of boundaries 01:04:42.740 --> 01:04:45.040 between particular parts of the image. 01:04:45.040 --> 01:04:47.600 And you might imagine that if a machine learning algorithm is 01:04:47.600 --> 01:04:50.780 trying to learn like what an image is of, a filter like this 01:04:50.780 --> 01:04:51.860 could be pretty useful. 01:04:51.860 --> 01:04:55.400 Maybe the machine learning algorithm doesn't care about all 01:04:55.400 --> 01:04:57.200 of the details of the image. 01:04:57.200 --> 01:04:59.210 It just cares about certain useful features. 01:04:59.210 --> 01:05:01.370 It cares about particular shapes that are 01:05:01.370 --> 01:05:04.020 able to help it determine that based on the image, 01:05:04.020 --> 01:05:06.540 this is going to be a bridge, for example. 01:05:06.540 --> 01:05:08.840 And so this type of idea of image convolution 01:05:08.840 --> 01:05:11.570 can allow us to apply filters to images that 01:05:11.570 --> 01:05:15.970 allow us to extract useful results out of those images-- taking an image 01:05:15.970 --> 01:05:18.640 and extracting its edges, for example. 01:05:18.640 --> 01:05:20.480 You might imagine many other filters that 01:05:20.480 --> 01:05:23.820 could be applied to an image that are able to extract particular values as 01:05:23.820 --> 01:05:24.320 well. 01:05:24.320 --> 01:05:27.620 And a filter might have separate kernels for the red values, the green values, 01:05:27.620 --> 01:05:30.140 and the blue values that are all summed together at the end, 01:05:30.140 --> 01:05:32.750 such that you could have particular filters looking for, 01:05:32.750 --> 01:05:34.457 is there red in this part of the image? 01:05:34.457 --> 01:05:36.290 Are there green in other parts of the image? 01:05:36.290 --> 01:05:39.800 You can begin to assemble these relevant and useful filters that are 01:05:39.800 --> 01:05:43.050 able to do these calculations as well. 01:05:43.050 --> 01:05:45.990 So that then was the idea of image convolution-- applying 01:05:45.990 --> 01:05:48.990 some sort of filter to an image to be able to extract 01:05:48.990 --> 01:05:51.480 some useful features out of that image. 01:05:51.480 --> 01:05:54.600 But all the while, these images are still pretty big. 01:05:54.600 --> 01:05:56.730 There's a lot of pixels involved in the image. 01:05:56.730 --> 01:05:59.310 And realistically speaking, if you've got a really big image, 01:05:59.310 --> 01:06:01.030 that poses a couple of problems. 01:06:01.030 --> 01:06:03.810 One, it means a lot of input going into the neural network, 01:06:03.810 --> 01:06:07.050 but two, it also means that we really have 01:06:07.050 --> 01:06:11.715 to care about what's in each particular pixel, whereas realistically we often, 01:06:11.715 --> 01:06:13.590 if you're looking at an image, you don't care 01:06:13.590 --> 01:06:16.030 whether it's something is in one particular pixel 01:06:16.030 --> 01:06:18.030 versus the pixel immediately to the right of it. 01:06:18.030 --> 01:06:19.598 They're pretty close together. 01:06:19.598 --> 01:06:21.390 You really just care about whether there is 01:06:21.390 --> 01:06:24.450 a particular feature in some region of the image, 01:06:24.450 --> 01:06:28.300 and maybe you don't care about exactly which pixel it happens to be. 01:06:28.300 --> 01:06:30.660 And so there's a technique we can use known as pooling. 01:06:30.660 --> 01:06:34.650 And what pooling is, is it means reducing the size of an input 01:06:34.650 --> 01:06:37.340 by sampling from regions inside of the input. 01:06:37.340 --> 01:06:40.890 So we're going to take a big image and turn it into a smaller image 01:06:40.890 --> 01:06:41.880 by using pooling. 01:06:41.880 --> 01:06:44.550 And in particular, one of the most popular types of pooling 01:06:44.550 --> 01:06:45.870 is called max-pooling. 01:06:45.870 --> 01:06:50.550 And what max-pooling does is it pools just by choosing the maximum value 01:06:50.550 --> 01:06:52.390 in a particular region. 01:06:52.390 --> 01:06:55.470 So, for example, let's imagine I had this four-by-four image, 01:06:55.470 --> 01:06:57.360 but I wanted to reduce its dimensions. 01:06:57.360 --> 01:07:01.310 I wanted to make an a smaller image, so that I have fewer inputs to work with. 01:07:01.310 --> 01:07:05.070 Well, what I could do is I could apply a two-by-two max 01:07:05.070 --> 01:07:07.410 pool, where the idea would be that I'm going 01:07:07.410 --> 01:07:09.990 to first look at this two-by-two region and say, what 01:07:09.990 --> 01:07:11.940 is the maximum value in that region? 01:07:11.940 --> 01:07:13.290 Well, it's the number 50. 01:07:13.290 --> 01:07:15.353 So we'll go ahead and just use the number 50. 01:07:15.353 --> 01:07:17.270 And then we'll look at this two-by-two region. 01:07:17.270 --> 01:07:18.940 What is the maximum value here? 01:07:18.940 --> 01:07:19.740 110. 01:07:19.740 --> 01:07:21.210 So that's going to be my value. 01:07:21.210 --> 01:07:23.420 Likewise here, the maximum value looks like 20. 01:07:23.420 --> 01:07:24.710 Go ahead and put that there. 01:07:24.710 --> 01:07:27.030 Then for this last region, the maximum value 01:07:27.030 --> 01:07:29.510 was 40, so we'll go ahead and use that. 01:07:29.510 --> 01:07:33.290 And what I have now is a smaller representation 01:07:33.290 --> 01:07:36.260 of this same original image that I obtained just 01:07:36.260 --> 01:07:40.680 by picking the maximum value from each of these regions. 01:07:40.680 --> 01:07:43.880 So again, the advantages here are now I only 01:07:43.880 --> 01:07:46.730 have to deal with a two-by-two input instead of a four-by-four, 01:07:46.730 --> 01:07:49.910 and you can imagine shrinking the size of an image even more. 01:07:49.910 --> 01:07:52.880 But in addition to that, I'm now able to make 01:07:52.880 --> 01:07:57.500 my analysis independent of whether a particular value was 01:07:57.500 --> 01:07:59.030 in this pixel or this pixel. 01:07:59.030 --> 01:08:01.490 I don't care if the 50 was here or here. 01:08:01.490 --> 01:08:03.980 As long as it was generally in this region, 01:08:03.980 --> 01:08:06.000 I'll still get access to that value. 01:08:06.000 --> 01:08:10.190 So it makes our algorithms a little bit more robust as well. 01:08:10.190 --> 01:08:11.750 So that then is pooling-- 01:08:11.750 --> 01:08:13.940 taking the size of the image and reducing it 01:08:13.940 --> 01:08:18.390 a little bit by just sampling from particular regions inside of the image. 01:08:18.390 --> 01:08:22.310 And now we can put all of these ideas together-- pooling, image convolution, 01:08:22.310 --> 01:08:26.060 neural networks-- all together into another type of neural network called 01:08:26.060 --> 01:08:30.500 a convolutional neural network, or a CNN, which is a neural network that 01:08:30.500 --> 01:08:35.479 uses this convolution step, usually in the context of analyzing an image, 01:08:35.479 --> 01:08:36.752 for example. 01:08:36.752 --> 01:08:39.710 And so the way that a convolutional neural own network works is that we 01:08:39.710 --> 01:08:43.189 start with some sort of input image-- some grid of pixels-- 01:08:43.189 --> 01:08:46.580 but rather than immediately put that into the neural network layers 01:08:46.580 --> 01:08:50.120 that we've seen before, we'll start by applying a convolution step, where 01:08:50.120 --> 01:08:54.170 the convolution step involves applying a number of different image filters 01:08:54.170 --> 01:08:56.689 to our original image in order to get what 01:08:56.689 --> 01:09:00.750 we call a feature map, the result of applying some filter to an image. 01:09:00.750 --> 01:09:02.750 And we could do this once, but in general, we'll 01:09:02.750 --> 01:09:06.020 do this multiple times getting a whole bunch of different feature 01:09:06.020 --> 01:09:09.859 maps, each of which might extract some different relevant feature out 01:09:09.859 --> 01:09:12.710 of the image, some different important characteristic of the image 01:09:12.710 --> 01:09:16.760 that we might care about using in order to calculate what the result should be. 01:09:16.760 --> 01:09:19.790 And in the same way to when we train neural networks, 01:09:19.790 --> 01:09:23.270 we can train neural networks to learn the weights between particular units 01:09:23.270 --> 01:09:24.770 inside of the neural networks. 01:09:24.770 --> 01:09:28.160 We can also train neural networks to learn what those filters should be-- 01:09:28.160 --> 01:09:30.170 what the values of the filters should be-- 01:09:30.170 --> 01:09:33.620 in order to get the most useful, most relevant information out 01:09:33.620 --> 01:09:37.069 of the original image just by figuring out what setting of those filter 01:09:37.069 --> 01:09:39.380 values-- the values inside of that kernel-- 01:09:39.380 --> 01:09:44.060 results in minimizing the loss function and minimizing how poorly 01:09:44.060 --> 01:09:48.200 our hypothesis actually performs in figuring out the classification 01:09:48.200 --> 01:09:50.720 of a particular image, for example. 01:09:50.720 --> 01:09:52.880 So we first apply this convolution step. 01:09:52.880 --> 01:09:55.520 Get a whole bunch of these various different feature maps. 01:09:55.520 --> 01:09:57.450 But these feature maps are quite large. 01:09:57.450 --> 01:10:00.200 There is a lot of pixel values that happen to be here. 01:10:00.200 --> 01:10:03.440 And so a logical next step to take is a pooling step, 01:10:03.440 --> 01:10:06.800 where we reduce the size of these images by using max-pooling, 01:10:06.800 --> 01:10:10.360 for example, extracting the maximum value from any particular region. 01:10:10.360 --> 01:10:12.110 There are other pooling methods that exist 01:10:12.110 --> 01:10:13.610 as well, depending on the situation. 01:10:13.610 --> 01:10:15.800 You could use something like average-pooling, 01:10:15.800 --> 01:10:18.230 where instead of taking the maximum value from a region, 01:10:18.230 --> 01:10:22.010 you take the average value from a region, which has it uses as well. 01:10:22.010 --> 01:10:26.030 But in effect, what pooling will do is it will take these feature maps 01:10:26.030 --> 01:10:28.190 and reduce their dimensions, so that we end up 01:10:28.190 --> 01:10:30.677 with smaller grids with fewer pixels. 01:10:30.677 --> 01:10:33.010 And this then is going to be easier for us to deal with. 01:10:33.010 --> 01:10:35.600 It's going to mean fewer inputs that we have to worry about, 01:10:35.600 --> 01:10:38.900 and it's also going to mean we're more resilient, more robust, 01:10:38.900 --> 01:10:42.510 against potential movements of particular values just by one pixel, 01:10:42.510 --> 01:10:46.280 when ultimately, we really don't care about those one pixel differences that 01:10:46.280 --> 01:10:49.020 might arise in the original image. 01:10:49.020 --> 01:10:52.700 Now after we've done this pooling step, now we have a whole bunch of values 01:10:52.700 --> 01:10:55.260 that we can then flatten out and just put 01:10:55.260 --> 01:10:57.310 into a more traditional neural network. 01:10:57.310 --> 01:10:59.060 So we go ahead and flatten it, and then we 01:10:59.060 --> 01:11:01.010 end up with a traditional neural network that 01:11:01.010 --> 01:11:05.210 has one input for each of these values in each of these resulting feature 01:11:05.210 --> 01:11:10.130 maps after we do the convolution and after we do the pooling step. 01:11:10.130 --> 01:11:13.460 And so this then is the general structure of a convolutional network. 01:11:13.460 --> 01:11:15.980 We begin with the image, apply convolution, 01:11:15.980 --> 01:11:18.800 apply pooling, flatten the results, and then put that 01:11:18.800 --> 01:11:22.190 into a more traditional neural network that might itself have hidden layers. 01:11:22.190 --> 01:11:24.290 You can have deep convolutional networks that 01:11:24.290 --> 01:11:28.490 have hidden layers in between this flattened layer and the eventual output 01:11:28.490 --> 01:11:32.220 to be able to calculate various different features of those values. 01:11:32.220 --> 01:11:36.030 But this then can help us to be able to use convolution and pooling, 01:11:36.030 --> 01:11:38.480 to use our knowledge about the structure of an image, 01:11:38.480 --> 01:11:42.020 to be able to get better results, to be able to train our networks faster 01:11:42.020 --> 01:11:46.080 in order to better capture particular parts of the image. 01:11:46.080 --> 01:11:49.370 And there's no reason necessarily why you can only use these steps once. 01:11:49.370 --> 01:11:53.570 In fact, in practice, you'll often use convolution and pooling multiple times 01:11:53.570 --> 01:11:55.170 in multiple different steps. 01:11:55.170 --> 01:11:58.310 So what you might imagine doing is starting with an image, 01:11:58.310 --> 01:12:00.980 first applying convolution to get a whole bunch of maps, 01:12:00.980 --> 01:12:04.070 then applying pooling, then applying convolution again, 01:12:04.070 --> 01:12:06.760 because these maps are still pretty big. 01:12:06.760 --> 01:12:10.330 You can apply convolution to try and extract relevant features 01:12:10.330 --> 01:12:13.120 out of this result. Then take those results, 01:12:13.120 --> 01:12:16.570 apply pooling in order to reduce their dimensions, and then take that 01:12:16.570 --> 01:12:19.900 and feed it into a neural network that maybe has fewer inputs. 01:12:19.900 --> 01:12:22.810 So here, I have two different convolution and pooling steps. 01:12:22.810 --> 01:12:25.540 I do convolution and pooling once, and then I 01:12:25.540 --> 01:12:29.380 do convolution and pooling a second time, each time extracting 01:12:29.380 --> 01:12:32.200 useful features from the layer before it, each time using 01:12:32.200 --> 01:12:36.010 pooling to reduce the dimensions of what you're ultimately looking at. 01:12:36.010 --> 01:12:39.880 And the goal now of this sort of model is that in each of these steps, 01:12:39.880 --> 01:12:43.090 you can begin to learn different types of features 01:12:43.090 --> 01:12:45.430 of the original image, that maybe in the first step 01:12:45.430 --> 01:12:49.180 you learn very low-level features, just learn and look for features like edges 01:12:49.180 --> 01:12:53.770 and curves and shapes, because based on pixels in their neighboring values, 01:12:53.770 --> 01:12:55.937 you can figure out, all right, what are the edges? 01:12:55.937 --> 01:12:56.770 What are the curves? 01:12:56.770 --> 01:12:59.810 What are the various different shapes that might be present there? 01:12:59.810 --> 01:13:02.470 But then once you have a mapping that just represents 01:13:02.470 --> 01:13:04.930 where the edges and curves and shapes happen to be, 01:13:04.930 --> 01:13:07.120 you can imagine applying the same sort of process 01:13:07.120 --> 01:13:10.480 again to begin to look for higher-level features-- look for objects, 01:13:10.480 --> 01:13:13.450 maybe look for people's eyes in facial recognition, 01:13:13.450 --> 01:13:17.020 for example, maybe look at more complex shapes like the curves 01:13:17.020 --> 01:13:20.470 on a particular number if you're trying to recognize a digit in a handwriting 01:13:20.470 --> 01:13:22.375 recognition sort of scenario. 01:13:22.375 --> 01:13:24.250 And then after all of that, now that you have 01:13:24.250 --> 01:13:27.227 these results that represent these higher-level features, 01:13:27.227 --> 01:13:29.560 you can pass them into a neural network, which is really 01:13:29.560 --> 01:13:33.430 just a deep neural network that looks like this, where you might imagine 01:13:33.430 --> 01:13:37.120 making a binary classification, or classifying into multiple categories, 01:13:37.120 --> 01:13:42.130 or performing various different tasks on this sort of model. 01:13:42.130 --> 01:13:45.340 So convolutional neural networks can be quite powerful and quite popular 01:13:45.340 --> 01:13:47.383 when it comes to trying to analyze images. 01:13:47.383 --> 01:13:48.550 We don't strictly need them. 01:13:48.550 --> 01:13:52.780 We could have just used a vanilla neural network that just operates with layer 01:13:52.780 --> 01:13:54.318 after layer as we've seen before. 01:13:54.318 --> 01:13:56.110 But these convolutional neural networks can 01:13:56.110 --> 01:13:58.675 be quite helpful, in particular, because of the way they 01:13:58.675 --> 01:14:00.550 model the way a human might look at an image, 01:14:00.550 --> 01:14:03.040 that instead of a human looking at every single pixel 01:14:03.040 --> 01:14:06.428 simultaneously and trying to involve all of them by multiplying them together, 01:14:06.428 --> 01:14:08.470 you might imagine that what convolution is really 01:14:08.470 --> 01:14:11.860 doing is looking at various different regions of the image 01:14:11.860 --> 01:14:14.770 and extracting relevant information and features out 01:14:14.770 --> 01:14:17.410 of those parts of the image the same way that a human might 01:14:17.410 --> 01:14:20.950 have visual receptors that are looking at particular parts of what they see, 01:14:20.950 --> 01:14:23.440 and using those, combining them, to figure out 01:14:23.440 --> 01:14:28.140 what meaning they can draw from all of those various different inputs. 01:14:28.140 --> 01:14:31.480 And so you might imagine applying this to a situation like handwriting 01:14:31.480 --> 01:14:32.500 recognition. 01:14:32.500 --> 01:14:35.050 So we'll go ahead and see an example of that now. 01:14:35.050 --> 01:14:37.705 I'll go ahead and open up handwriting.py. 01:14:37.705 --> 01:14:41.800 Again, what we do here is we first import TensorFlow. 01:14:41.800 --> 01:14:45.430 And then, TensorFlow, it turns out, has a few datasets 01:14:45.430 --> 01:14:47.440 that are built in-- built into the library 01:14:47.440 --> 01:14:49.120 that you can just immediately access. 01:14:49.120 --> 01:14:51.910 And one of the most famous datasets in machine learning 01:14:51.910 --> 01:14:55.720 is the MNIST dataset, which is just a dataset of a whole bunch of samples 01:14:55.720 --> 01:14:57.310 of people's handwritten digits. 01:14:57.310 --> 01:14:59.980 I showed you a slide of that a little while ago. 01:14:59.980 --> 01:15:03.010 And what we can do is just immediately access that dataset, 01:15:03.010 --> 01:15:06.520 which is built into the library, so that if I want to do something like train 01:15:06.520 --> 01:15:10.810 on a whole bunch of digits, I can just use the dataset that is provided to me. 01:15:10.810 --> 01:15:14.170 Of course, if I had my own dataset of handwritten images, 01:15:14.170 --> 01:15:15.640 I can apply the same idea. 01:15:15.640 --> 01:15:19.620 I'd first just need to take those images and turn them into an array of pixels, 01:15:19.620 --> 01:15:22.120 because that's the way that these are going to be formatted. 01:15:22.120 --> 01:15:24.037 They're going to be formatted as, effectively, 01:15:24.037 --> 01:15:26.770 an array of individual pixels. 01:15:26.770 --> 01:15:29.330 And now there's a bit of reshaping I need to do, 01:15:29.330 --> 01:15:31.640 just turning the data into a format that I can put 01:15:31.640 --> 01:15:33.360 into my convolutional neural network. 01:15:33.360 --> 01:15:37.970 So this is doing things like taking all the values and dividing them by 255. 01:15:37.970 --> 01:15:41.700 If you remember, these color values tend to range from 0 to 255. 01:15:41.700 --> 01:15:45.110 So I can divide them by 255, just to put them into a 0-to-1 range, 01:15:45.110 --> 01:15:48.320 which might be a little bit easier to train on . 01:15:48.320 --> 01:15:51.140 And then doing various other modifications to the data, just 01:15:51.140 --> 01:15:53.270 to get it into a nice usable format. 01:15:53.270 --> 01:15:55.670 But here's the interesting and important part. 01:15:55.670 --> 01:15:59.920 Here is where I create the convolutional neural network-- the CNN-- 01:15:59.920 --> 01:16:02.970 where here I'm saying, go ahead and use a sequential model. 01:16:02.970 --> 01:16:06.570 And before I could use model.add to say add a layer, add a layer, add a layer, 01:16:06.570 --> 01:16:08.570 another way I could define it is just by passing 01:16:08.570 --> 01:16:12.860 as input to the sequential neural network a list of all of the layers 01:16:12.860 --> 01:16:14.750 that I want. 01:16:14.750 --> 01:16:17.642 And so here, the very first layer in my model 01:16:17.642 --> 01:16:19.350 is a convolutional layer, where I'm first 01:16:19.350 --> 01:16:22.050 going to apply convolution to my image. 01:16:22.050 --> 01:16:26.520 I'm going to use 13 different filters, so my model is going to learn-- 01:16:26.520 --> 01:16:28.680 32, rather-- 32 different filters that I would 01:16:28.680 --> 01:16:31.920 like to learn on the input image, where each filter is 01:16:31.920 --> 01:16:33.950 going to be a three-by-three kernel. 01:16:33.950 --> 01:16:36.010 So we saw those three-by-three kernels before, 01:16:36.010 --> 01:16:39.270 where we could multiply each value in a three-by-three grid by value, 01:16:39.270 --> 01:16:41.620 multiply it and add all the results together. 01:16:41.620 --> 01:16:46.300 So here I'm going to learn 32 different of these three-by-three filters. 01:16:46.300 --> 01:16:48.740 I can again specify my activation function. 01:16:48.740 --> 01:16:51.320 And I specify what my input shape is. 01:16:51.320 --> 01:16:53.630 My input shape in the banknotes case was just 4. 01:16:53.630 --> 01:16:55.130 I had four inputs. 01:16:55.130 --> 01:17:00.502 My input shape here is going to be 28, comma, 28, comma 1, because for each 01:17:00.502 --> 01:17:02.210 of these handwritten digits, it turns out 01:17:02.210 --> 01:17:05.060 that the MNIST dataset organizes their data. 01:17:05.060 --> 01:17:07.740 Each image is a 28-by-28 pixel grid. 01:17:07.740 --> 01:17:11.690 They're going to be a 28-by-28 pixel grid, and each one of those images only 01:17:11.690 --> 01:17:13.387 has one channel value. 01:17:13.387 --> 01:17:15.470 These handwritten digits are just black and white, 01:17:15.470 --> 01:17:17.960 so it's just a single color value representing 01:17:17.960 --> 01:17:19.450 how much black or how much white. 01:17:19.450 --> 01:17:22.700 You might imagine that in a color image, if you were doing this sort of thing, 01:17:22.700 --> 01:17:24.710 you might have three different channels-- a red, 01:17:24.710 --> 01:17:26.600 a green, and a blue channel, for example. 01:17:26.600 --> 01:17:30.020 But in the case of just handwriting recognition and recognizing a digit, 01:17:30.020 --> 01:17:33.640 we're just going to use a single value for shaded-in in or not shaded-in, 01:17:33.640 --> 01:17:37.270 and it might range, but it's just a single color value. 01:17:37.270 --> 01:17:40.800 And that then is the very first layer of our neural network, 01:17:40.800 --> 01:17:43.327 a convolutional layer that will take the input 01:17:43.327 --> 01:17:45.160 and learn a whole bunch of different filters 01:17:45.160 --> 01:17:49.356 that we can apply to the input to extract meaningful features. 01:17:49.356 --> 01:17:52.900 The next step is going to be a max-pooling layer, also built 01:17:52.900 --> 01:17:55.060 right into TensorFlow, where this is going 01:17:55.060 --> 01:17:58.840 to be a layer that is going to use a pool size of two by two, 01:17:58.840 --> 01:18:01.830 meaning we're going to look at two-by-two regions inside of the image, 01:18:01.830 --> 01:18:03.910 and just extract the maximum value. 01:18:03.910 --> 01:18:06.050 Again, we've seen why this can be helpful. 01:18:06.050 --> 01:18:09.040 It'll help to reduce the size of our input. 01:18:09.040 --> 01:18:12.130 Once we've done that, we'll go ahead and flatten all of the units just 01:18:12.130 --> 01:18:14.500 into a single layer that we can then pass 01:18:14.500 --> 01:18:16.300 into the rest of the neural network. 01:18:16.300 --> 01:18:18.970 And now, here's the rest of the whole network. 01:18:18.970 --> 01:18:22.790 Here, I'm saying, let's add a hidden layer to my neural network with 128 01:18:22.790 --> 01:18:26.560 units-- so a whole bunch of hidden units inside of the hidden layer-- 01:18:26.560 --> 01:18:30.117 and just to prevent overfitting, I can add a dropout to that-- say, 01:18:30.117 --> 01:18:30.700 you know what? 01:18:30.700 --> 01:18:34.630 When you're training, randomly drop out half from this hidden layer, 01:18:34.630 --> 01:18:38.200 just to make sure we don't become over-reliant on any particular node. 01:18:38.200 --> 01:18:41.560 We begin to really generalize and stop ourselves from overfitting. 01:18:41.560 --> 01:18:44.380 So TensorFlow allows us, just by adding a single line, 01:18:44.380 --> 01:18:47.650 to add dropout into our model as well, such that when it's training, 01:18:47.650 --> 01:18:50.080 it will perform this dropout step in order 01:18:50.080 --> 01:18:54.640 to help make sure that we don't overfit on this particular data. 01:18:54.640 --> 01:18:57.620 And then finally, I add an output layer. 01:18:57.620 --> 01:18:59.980 The output layer is going to have 10 units, one 01:18:59.980 --> 01:19:03.310 for each category, that I would like to classify digits into, 01:19:03.310 --> 01:19:06.230 so 0 through 9, 10 different categories. 01:19:06.230 --> 01:19:08.700 And the activation function I'm going to use here 01:19:08.700 --> 01:19:11.720 is called the softmax activation function. 01:19:11.720 --> 01:19:14.450 And in short, what the softmax activation function is going to do 01:19:14.450 --> 01:19:16.510 is it's going to take the output and turn it 01:19:16.510 --> 01:19:18.440 into a probability distribution. 01:19:18.440 --> 01:19:20.330 So ultimately, it's going to tell me, what 01:19:20.330 --> 01:19:24.910 did we estimate the probability is that this is a 2 versus a 3 versus a 4, 01:19:24.910 --> 01:19:29.180 and so it will turn it into that probability distribution for me. 01:19:29.180 --> 01:19:31.390 Next up, I'll go ahead and compile my model 01:19:31.390 --> 01:19:34.420 and fit it on all of my training data. 01:19:34.420 --> 01:19:38.530 And then I can evaluate how well the neural network performs. 01:19:38.530 --> 01:19:40.540 And then I've added to my Python program, 01:19:40.540 --> 01:19:43.430 if I've provided a command line argument, like the name of a file, 01:19:43.430 --> 01:19:46.300 I'm going to go ahead and save the model to a file. 01:19:46.300 --> 01:19:47.900 And so this can be quite useful too. 01:19:47.900 --> 01:19:49.608 Once you've done the training step, which 01:19:49.608 --> 01:19:51.970 could take some time, in terms of taking all the time-- 01:19:51.970 --> 01:19:55.510 going through the data; running backpropagation with gradient descent; 01:19:55.510 --> 01:19:57.790 to be able to say, all right, how should we adjust 01:19:57.790 --> 01:19:59.540 the weight to this particular model-- 01:19:59.540 --> 01:20:01.600 you end up calculating values for these weights, 01:20:01.600 --> 01:20:03.790 calculating values for these filters, and you'd 01:20:03.790 --> 01:20:06.560 like to remember that information, so you can use it later. 01:20:06.560 --> 01:20:10.223 And so TensorFlow allows us to just save a model to a file, 01:20:10.223 --> 01:20:12.640 such that later if we want to use the model we've learned, 01:20:12.640 --> 01:20:16.030 use the weights that we've learned, to make some sort of new prediction 01:20:16.030 --> 01:20:19.550 we can just use the model that already exists. 01:20:19.550 --> 01:20:22.570 So what we're doing here is after we've done all the calculation, 01:20:22.570 --> 01:20:26.050 we go ahead and save the model to a file, such 01:20:26.050 --> 01:20:28.220 that we can use it a little bit later. 01:20:28.220 --> 01:20:35.837 So for example, if I go into digits, I'm going to run handwriting.py. 01:20:35.837 --> 01:20:36.920 I won't save it this time. 01:20:36.920 --> 01:20:39.135 We'll just run it and go ahead and see what happens. 01:20:39.135 --> 01:20:41.260 What will happen is we need to go through the model 01:20:41.260 --> 01:20:44.710 in order to train on all of these samples of handwritten digits. 01:20:44.710 --> 01:20:47.500 So the MNIST dataset gives us thousands and thousands 01:20:47.500 --> 01:20:50.050 of sample handwritten digits in the same format 01:20:50.050 --> 01:20:51.800 that we can use in order to train. 01:20:51.800 --> 01:20:54.363 And so now what you're seeing is this training process, 01:20:54.363 --> 01:20:56.530 and unlike the banknotes case, where there was much, 01:20:56.530 --> 01:20:58.160 much fewer data points-- 01:20:58.160 --> 01:20:59.680 the data was very, very simple-- 01:20:59.680 --> 01:21:03.110 here, the data is more complex, and this training process takes time. 01:21:03.110 --> 01:21:06.040 And so this is another one of those cases where 01:21:06.040 --> 01:21:09.472 when training neural networks, this is why computational power is 01:21:09.472 --> 01:21:11.680 so important, that oftentimes, you see people wanting 01:21:11.680 --> 01:21:15.070 to use a sophisticated GPUs in order to more efficiently be 01:21:15.070 --> 01:21:18.040 able to do this sort of neural network we're training. 01:21:18.040 --> 01:21:20.870 It also speaks to the reason why more data can be helpful. 01:21:20.870 --> 01:21:23.260 The more sample data points you have, the better 01:21:23.260 --> 01:21:25.040 you can begin to do this training. 01:21:25.040 --> 01:21:28.060 So here we're going through 60,000 different samples 01:21:28.060 --> 01:21:29.400 of handwritten digits. 01:21:29.400 --> 01:21:31.820 And I said that we're going to go through them 10 times. 01:21:31.820 --> 01:21:34.780 So we're going to go through the dataset 10 times, training each time, 01:21:34.780 --> 01:21:37.360 hopefully improving upon our weights with every time 01:21:37.360 --> 01:21:38.900 we run through this dataset. 01:21:38.900 --> 01:21:41.770 And we can see over here on the right what the accuracy is 01:21:41.770 --> 01:21:44.860 each time we go ahead and run this model, that the first time, 01:21:44.860 --> 01:21:48.310 it looks like we got an accuracy of about 92% of the digits 01:21:48.310 --> 01:21:50.320 correct based on this training set. 01:21:50.320 --> 01:21:53.310 We increased that to 96% or 97%. 01:21:53.310 --> 01:21:56.110 And every time we run this, we're going to see, 01:21:56.110 --> 01:21:59.290 hopefully, the accuracy improve, as we continue to try and use 01:21:59.290 --> 01:22:02.440 that gradient descent, that process of trying to run the algorithm 01:22:02.440 --> 01:22:06.400 to minimize the loss that we get in order to more accurately predict 01:22:06.400 --> 01:22:07.840 what the output should be. 01:22:07.840 --> 01:22:11.210 And what this process is doing is it's learning not only the weights, 01:22:11.210 --> 01:22:13.660 but it's learning the features to use-- the kernel 01:22:13.660 --> 01:22:16.840 matrix to use-- when performing that convolution step, because this 01:22:16.840 --> 01:22:19.570 is a convolutional neural network, where I'm first performing 01:22:19.570 --> 01:22:23.380 those convolutions, and then doing the more traditional neural network 01:22:23.380 --> 01:22:24.260 structure. 01:22:24.260 --> 01:22:28.250 This is going to learn all of those individual steps as well. 01:22:28.250 --> 01:22:31.770 So here, we see the TensorFlow provides me with some very nice output, telling 01:22:31.770 --> 01:22:34.960 me about how many seconds are left with each of these training runs, 01:22:34.960 --> 01:22:37.610 that allows me to see just how well we're doing. 01:22:37.610 --> 01:22:39.970 So we'll go ahead and see how this network performs. 01:22:39.970 --> 01:22:42.520 It looks like we've gone through the dataset seven times. 01:22:42.520 --> 01:22:45.162 We're going through an eighth time now. 01:22:45.162 --> 01:22:47.120 And at this point, the accuracy is pretty high. 01:22:47.120 --> 01:22:50.950 We saw we went from 92% up to 97%. 01:22:50.950 --> 01:22:52.370 Now it looks like 98%. 01:22:52.370 --> 01:22:55.120 And at this point, it seems like things are starting to level out. 01:22:55.120 --> 01:22:57.550 There's probably a limit to how accurate we can ultimately 01:22:57.550 --> 01:22:59.615 be without running the risk of overfitting. 01:22:59.615 --> 01:23:02.740 Of course, with enough nodes, you could just memorize the input and overfit 01:23:02.740 --> 01:23:03.600 upon them. 01:23:03.600 --> 01:23:07.400 But we'd like to avoid doing that and dropout will help us with this. 01:23:07.400 --> 01:23:12.560 But now, we see we're almost done finishing our training step. 01:23:12.560 --> 01:23:13.950 We're at 55,000. 01:23:13.950 --> 01:23:14.450 All right. 01:23:14.450 --> 01:23:16.280 We've finished training, and now it's going 01:23:16.280 --> 01:23:18.920 to go ahead and test for us on 10,000 samples. 01:23:18.920 --> 01:23:23.630 And it looks like on the testing set, we were 98.8% accurate. 01:23:23.630 --> 01:23:25.640 So we ended up doing pretty well, it seems, 01:23:25.640 --> 01:23:28.940 on this testing set to see how accurately can 01:23:28.940 --> 01:23:31.980 we predict these handwritten digits. 01:23:31.980 --> 01:23:34.590 And so what we could do then is actually test it out. 01:23:34.590 --> 01:23:38.490 I've written a program called recognition.py using PyGame. 01:23:38.490 --> 01:23:40.350 If you pass it a model that's been trained, 01:23:40.350 --> 01:23:44.843 and I pre-trained an example model using this input data, what we can do 01:23:44.843 --> 01:23:46.760 is see whether or not we've been able to train 01:23:46.760 --> 01:23:50.510 this convolutional neural network to be able to predict handwriting, 01:23:50.510 --> 01:23:51.050 for example. 01:23:51.050 --> 01:23:54.080 So I can try just like drawing a handwritten digit. 01:23:54.080 --> 01:23:58.130 I'll go ahead and draw like the number 2, for example. 01:23:58.130 --> 01:23:59.295 So there's my number 2. 01:23:59.295 --> 01:24:00.170 Again, this is messy. 01:24:00.170 --> 01:24:03.170 If you tried to imagine how would you write a program with just like ifs 01:24:03.170 --> 01:24:05.390 and thens to be able to do this sort of calculation, 01:24:05.390 --> 01:24:06.830 it would be tricky to do so. 01:24:06.830 --> 01:24:08.810 But here, I'll press Classify, and all right. 01:24:08.810 --> 01:24:11.330 It seems it was able to correctly classify that what I drew 01:24:11.330 --> 01:24:12.383 was the number 2. 01:24:12.383 --> 01:24:13.550 We'll go ahead and reset it. 01:24:13.550 --> 01:24:14.092 Try it again. 01:24:14.092 --> 01:24:16.710 We'll draw like an 8, for example. 01:24:16.710 --> 01:24:19.040 So here is an 8. 01:24:19.040 --> 01:24:20.197 I'll press Classify. 01:24:20.197 --> 01:24:20.780 And all right. 01:24:20.780 --> 01:24:23.693 It predicts that the digit that I drew was an 8. 01:24:23.693 --> 01:24:25.610 And the key here is this really begins to show 01:24:25.610 --> 01:24:28.640 the power of what the neural network is doing, somehow looking 01:24:28.640 --> 01:24:31.190 at various different features of these different pixels, 01:24:31.190 --> 01:24:33.560 figuring out what the relevant features are, 01:24:33.560 --> 01:24:36.350 and figuring out how to combine them to get a classification. 01:24:36.350 --> 01:24:40.340 And this would be a difficult task to provide explicit instructions 01:24:40.340 --> 01:24:43.580 to the computer on how to do, like to use a hole punch of if-thens 01:24:43.580 --> 01:24:46.220 to process all of these pixel values to figure out 01:24:46.220 --> 01:24:48.800 what the handwritten digit is, like everyone is going to draw 01:24:48.800 --> 01:24:50.180 their 8 a little bit differently. 01:24:50.180 --> 01:24:52.680 If I drew the 8 again, it would look a little bit different. 01:24:52.680 --> 01:24:55.460 And yet ideally, we want to train a network to be robust 01:24:55.460 --> 01:24:59.360 enough so that it begins to learn these patterns on its own. 01:24:59.360 --> 01:25:02.040 All I said was, here is the structure of the network, 01:25:02.040 --> 01:25:04.610 and here is the data on which to train the network, 01:25:04.610 --> 01:25:06.620 and the network learning algorithm just tries 01:25:06.620 --> 01:25:08.960 to figure out what is the optimal set of weights, 01:25:08.960 --> 01:25:11.210 what is the optimal set of filters to use, 01:25:11.210 --> 01:25:13.520 in order to be able to accurately classify 01:25:13.520 --> 01:25:16.030 a digit into one category or another. 01:25:16.030 --> 01:25:20.850 That's going to show the power of these convolutional neural networks. 01:25:20.850 --> 01:25:25.280 And so that then was a look at how we can use convolutional neural networks 01:25:25.280 --> 01:25:30.320 to begin to solve problems with regards to computer vision, the ability to take 01:25:30.320 --> 01:25:32.015 an image and begin to analyze it. 01:25:32.015 --> 01:25:33.890 And so this is the type of analysis you might 01:25:33.890 --> 01:25:36.710 imagine that's happening in self-driving cars that 01:25:36.710 --> 01:25:40.910 are able to figure out what filters to apply to an image to understand what it 01:25:40.910 --> 01:25:44.300 is that the computer is looking at, or the same type of idea that 01:25:44.300 --> 01:25:46.760 might be applied to facial recognition and social media 01:25:46.760 --> 01:25:50.600 to be able to determine how to recognize faces in an image as well. 01:25:50.600 --> 01:25:53.180 You can imagine a neural network that, instead of classifying 01:25:53.180 --> 01:25:58.310 into one of 10 different digits, could instead classify like, is this person A 01:25:58.310 --> 01:26:01.730 or is this person B, trying to tell those people apart just based 01:26:01.730 --> 01:26:03.807 on convolution. 01:26:03.807 --> 01:26:06.890 And so now what we'll take a look at is yet another type of neural network 01:26:06.890 --> 01:26:09.290 that can be quite popular for certain types of tasks. 01:26:09.290 --> 01:26:13.160 But to do so, we'll try to generalize and think about our neural network 01:26:13.160 --> 01:26:16.920 a little bit more abstractly, that here we have a sample deep neural network, 01:26:16.920 --> 01:26:20.150 where we have this input layer, a whole bunch of different hidden layers 01:26:20.150 --> 01:26:22.850 that are performing certain types of calculations, 01:26:22.850 --> 01:26:26.090 and then an output layer here that just generates some sort of output 01:26:26.090 --> 01:26:28.370 that we care about calculating. 01:26:28.370 --> 01:26:32.780 But we could imagine representing this a little more simply, like this. 01:26:32.780 --> 01:26:36.110 Here is just a more abstract representation of our neural network. 01:26:36.110 --> 01:26:37.490 We have some input. 01:26:37.490 --> 01:26:41.090 That might be like a vector of a whole bunch of different values as our input. 01:26:41.090 --> 01:26:43.390 That gets passed into a network to perform 01:26:43.390 --> 01:26:46.190 some sort of calculation or computation, and that network 01:26:46.190 --> 01:26:48.350 produces some sort of output. 01:26:48.350 --> 01:26:50.043 That output might be a single value. 01:26:50.043 --> 01:26:51.960 It might be a whole bunch of different values. 01:26:51.960 --> 01:26:54.960 But this is the general structure of the neural network that we've seen. 01:26:54.960 --> 01:26:58.250 There is some sort of input that gets fed into the network, 01:26:58.250 --> 01:27:02.210 and using that input, the network calculates what the output should be. 01:27:02.210 --> 01:27:04.730 And this sort of model for an all network 01:27:04.730 --> 01:27:07.790 is what we might call a feed-forward neural network. 01:27:07.790 --> 01:27:11.760 Feed-forward neural networks have connections only in one direction; 01:27:11.760 --> 01:27:14.390 they move from one layer to the next layer to the layer 01:27:14.390 --> 01:27:18.530 after that, such that the inputs pass through various different hidden layers 01:27:18.530 --> 01:27:21.560 and then ultimately produce some sort of output. 01:27:21.560 --> 01:27:24.963 So feed-forward neural networks are very helpful for solving 01:27:24.963 --> 01:27:27.380 these types of classification problems that we saw before. 01:27:27.380 --> 01:27:28.760 We have a whole bunch of input. 01:27:28.760 --> 01:27:30.885 We want to learn what setting of weights will allow 01:27:30.885 --> 01:27:32.717 us to calculate the output effectively. 01:27:32.717 --> 01:27:35.300 But there are some limitations on feed-forward neural networks 01:27:35.300 --> 01:27:36.425 that we'll see in a moment. 01:27:36.425 --> 01:27:39.350 In particular, the input needs to be of a fixed shape, 01:27:39.350 --> 01:27:41.932 like a fixed number of neurons are in the input layer, 01:27:41.932 --> 01:27:43.640 and there's a fixed shape for the output, 01:27:43.640 --> 01:27:46.670 like a fixed number of neurons in the output layer, 01:27:46.670 --> 01:27:49.340 and that has some limitations of its own. 01:27:49.340 --> 01:27:51.457 And a possible solution to this-- 01:27:51.457 --> 01:27:53.540 and we'll see examples of the types of problems we 01:27:53.540 --> 01:27:55.190 can solve for this in just the second-- 01:27:55.190 --> 01:27:58.065 is instead of just a feed-forward neural network where there are only 01:27:58.065 --> 01:28:01.070 connections in one direction, from left to right effectively, 01:28:01.070 --> 01:28:05.390 across the network, we can also imagine a recurrent neural network, 01:28:05.390 --> 01:28:07.460 where a recurrent neural network generates 01:28:07.460 --> 01:28:13.680 output that gets fed back into itself as input for future runs of that network. 01:28:13.680 --> 01:28:15.800 So whereas in a traditional neural network, 01:28:15.800 --> 01:28:19.850 we have inputs that get fed into the network that get fed into the output, 01:28:19.850 --> 01:28:23.150 and the only thing that determines the output is based on the original input 01:28:23.150 --> 01:28:26.780 and based on the calculation we do inside of the network itself, 01:28:26.780 --> 01:28:29.780 this goes in contrast with a recurrent neural network, 01:28:29.780 --> 01:28:32.450 where in a recurrent neural network, you can imagine output 01:28:32.450 --> 01:28:35.810 from the network feeding back to itself into the network 01:28:35.810 --> 01:28:39.590 again as input for the next time that you do the calculations 01:28:39.590 --> 01:28:41.090 inside of the network. 01:28:41.090 --> 01:28:45.890 What this allows is it allows the network to maintain some sort of state, 01:28:45.890 --> 01:28:48.290 to store some sort of information that can 01:28:48.290 --> 01:28:51.930 be used on future runs of the network. 01:28:51.930 --> 01:28:54.170 Previously, the network just defined some weights, 01:28:54.170 --> 01:28:56.990 and we passed inputs through the network, and it generated outputs, 01:28:56.990 --> 01:29:00.710 but the network wasn't saving any information based on those inputs 01:29:00.710 --> 01:29:04.103 to be able to remember for future iterations or for future runs. 01:29:04.103 --> 01:29:06.020 What a recurrent neural network will let us do 01:29:06.020 --> 01:29:08.270 is let the network store information that 01:29:08.270 --> 01:29:12.470 gets passed back in as input to the network again the next time we try 01:29:12.470 --> 01:29:14.370 and perform some sort of action. 01:29:14.370 --> 01:29:18.990 And this is particularly helpful when dealing with sequences of data. 01:29:18.990 --> 01:29:21.620 So we'll see a real-world example of this right now actually. 01:29:21.620 --> 01:29:25.880 Microsoft has developed an AI known as the CaptionBot, 01:29:25.880 --> 01:29:28.370 and what the CaptionBot does is it says, I 01:29:28.370 --> 01:29:30.500 can understand the content of any photograph, 01:29:30.500 --> 01:29:32.583 and I'll try to describe it as well as any human. 01:29:32.583 --> 01:29:35.000 I'll analyze your photo, but I won't store it or share it. 01:29:35.000 --> 01:29:38.090 And so what Microsoft CaptionBot seems to be claiming to do 01:29:38.090 --> 01:29:41.630 is it can take an image and figure out what's in the image 01:29:41.630 --> 01:29:44.460 and just give us a caption to describe it. 01:29:44.460 --> 01:29:45.470 So let's try it out. 01:29:45.470 --> 01:29:48.255 Here, for example, is an image of Harvard Square 01:29:48.255 --> 01:29:51.380 and some people walking in front of one of the buildings at Harvard Square. 01:29:51.380 --> 01:29:53.720 I'll go ahead and take the URL for that image, 01:29:53.720 --> 01:29:57.520 and I'll paste it into CaptionBot, then just press Go. 01:29:57.520 --> 01:30:01.460 So CaptionBot is analyzing the image, and then it says, 01:30:01.460 --> 01:30:03.920 I think it's a group of people walking in front 01:30:03.920 --> 01:30:05.510 of a building, which seems amazing. 01:30:05.510 --> 01:30:09.590 The eye is able to look at this image and figure out what's in the image. 01:30:09.590 --> 01:30:11.510 And the important thing to recognize here 01:30:11.510 --> 01:30:13.910 is that this is no longer just a classification task. 01:30:13.910 --> 01:30:17.350 We saw being able to classify images with a convolutional neural network, 01:30:17.350 --> 01:30:21.680 where the job was to take the images and then figure out, is it a 0, or a 1, 01:30:21.680 --> 01:30:24.740 or a 2; or is that this person's face or that person's face? 01:30:24.740 --> 01:30:28.160 What seems to be happening here is the input is an image, 01:30:28.160 --> 01:30:31.190 and we know how to get networks to take input of images, 01:30:31.190 --> 01:30:33.320 but the output is text. 01:30:33.320 --> 01:30:34.010 It's a sentence. 01:30:34.010 --> 01:30:38.410 It's a phrase, like "a group of people walking in front of a building." 01:30:38.410 --> 01:30:41.420 And this would seem to pose a challenge for our more traditional 01:30:41.420 --> 01:30:44.450 feed-forward neural networks, for the reason being 01:30:44.450 --> 01:30:47.540 that in traditional neural networks, we just 01:30:47.540 --> 01:30:50.670 have a fixed-size input and a fixed-size output. 01:30:50.670 --> 01:30:53.930 There are a certain number of neurons in the input to our neural network 01:30:53.930 --> 01:30:56.580 and a certain number of outputs for our neural network, 01:30:56.580 --> 01:30:58.763 and then some calculation that goes on in between. 01:30:58.763 --> 01:30:59.930 But the size of the inputs-- 01:30:59.930 --> 01:31:03.030 the number of values in the input and the number of values in the output-- 01:31:03.030 --> 01:31:07.775 those are always going to be fixed based on the structure of the neural network, 01:31:07.775 --> 01:31:10.400 and that makes it difficult to imagine how a neural network can 01:31:10.400 --> 01:31:12.440 take an image like this and say, you know, 01:31:12.440 --> 01:31:14.840 it's a group of people walking in front of the building, 01:31:14.840 --> 01:31:17.360 because the output is text. 01:31:17.360 --> 01:31:19.580 It's a sequence of words. 01:31:19.580 --> 01:31:23.120 Now it might be possible for a neural network to output one word. 01:31:23.120 --> 01:31:25.610 One word, you could represent us like a vector of values, 01:31:25.610 --> 01:31:27.350 and you can imagine ways of doing that. 01:31:27.350 --> 01:31:29.517 And next time, we'll talk a little bit more about AI 01:31:29.517 --> 01:31:31.950 as it relates to language and language processing. 01:31:31.950 --> 01:31:34.290 But a sequence of words is much more challenging, 01:31:34.290 --> 01:31:36.080 because depending on the image, you might 01:31:36.080 --> 01:31:38.510 imagine the output is a different number of words. 01:31:38.510 --> 01:31:41.120 We could have sequences of different lengths, 01:31:41.120 --> 01:31:45.310 and somehow we still want to be able to generate the appropriate output. 01:31:45.310 --> 01:31:49.250 And so the strategy here is to use a recurrent neural network, 01:31:49.250 --> 01:31:52.790 a neural network that can feed its own output back into itself 01:31:52.790 --> 01:31:55.020 as input for the next time. 01:31:55.020 --> 01:31:59.810 And this allows us to do what we call a one-to-many relationship for inputs 01:31:59.810 --> 01:32:02.720 to outputs, that in vanilla, more traditional neural networks-- 01:32:02.720 --> 01:32:05.840 these are what we consider to be one-to-one neural networks-- 01:32:05.840 --> 01:32:10.370 you pass in one set of values as input, you get one vector of values 01:32:10.370 --> 01:32:12.080 as the output-- 01:32:12.080 --> 01:32:14.750 but in this case, we want to pass in one value as input-- 01:32:14.750 --> 01:32:17.840 the image-- and we want to get a sequence-- many values-- 01:32:17.840 --> 01:32:22.190 as output, where each value is like one of these words that gets produced 01:32:22.190 --> 01:32:24.460 by this particular algorithm. 01:32:24.460 --> 01:32:26.960 And so the way we might do this is we might imagine starting 01:32:26.960 --> 01:32:30.175 by providing input the image into our neural network, 01:32:30.175 --> 01:32:32.300 and the neural network is going to generate output, 01:32:32.300 --> 01:32:34.730 but the output is not going to be the whole sequence of words, 01:32:34.730 --> 01:32:37.022 because we can't represent the whole sequence of words. 01:32:37.022 --> 01:32:39.650 I'm using just a fixed set of neurons. 01:32:39.650 --> 01:32:42.760 Instead, the output is just going to be the first word. 01:32:42.760 --> 01:32:44.510 We're going to train the network to output 01:32:44.510 --> 01:32:46.500 what the first word of the caption should be. 01:32:46.500 --> 01:32:48.500 And you could imagine that Microsoft has trained 01:32:48.500 --> 01:32:52.250 to this by running a whole bunch of training samples through the AI, 01:32:52.250 --> 01:32:55.400 giving it a whole bunch of pictures and what the appropriate caption was, 01:32:55.400 --> 01:32:58.520 and having the AI begin to learn from that. 01:32:58.520 --> 01:33:00.830 But now, because the network generates output 01:33:00.830 --> 01:33:03.020 that can be fed back into itself, you can 01:33:03.020 --> 01:33:06.830 imagine the output of the network being fed back into the same network-- 01:33:06.830 --> 01:33:10.400 this here looks like a separate network, but it's really the same network that's 01:33:10.400 --> 01:33:12.170 just getting different input-- 01:33:12.170 --> 01:33:16.340 that this network's output gets fed back into itself, 01:33:16.340 --> 01:33:18.440 but it's going to generate another output, 01:33:18.440 --> 01:33:22.910 and that other output is going to be like the second word in the caption. 01:33:22.910 --> 01:33:25.220 And this recurrent neural network then, this network 01:33:25.220 --> 01:33:27.470 is going to generate other output that can be fed back 01:33:27.470 --> 01:33:30.470 into itself to generate yet another word, fed back 01:33:30.470 --> 01:33:32.420 into itself to generate another word. 01:33:32.420 --> 01:33:35.150 And so recurrent neural networks allow us to represent 01:33:35.150 --> 01:33:37.610 this sort of one-to-many structure. 01:33:37.610 --> 01:33:40.370 You provide one image as input, and the neural network 01:33:40.370 --> 01:33:43.160 can pass data into the next run of the network, 01:33:43.160 --> 01:33:46.940 and then again and again, such that you could run the network multiple times, 01:33:46.940 --> 01:33:52.398 each time generating a different output, still based on that original input. 01:33:52.398 --> 01:33:54.190 And this is where recurrent neural networks 01:33:54.190 --> 01:33:58.880 become particularly useful when dealing with sequences of inputs or outputs. 01:33:58.880 --> 01:34:02.110 My output is a sequence of words, and since I can't very easily 01:34:02.110 --> 01:34:04.690 represent outputting an entire sequence of words, 01:34:04.690 --> 01:34:07.900 I'll instead output that sequence one word at a time, 01:34:07.900 --> 01:34:10.240 by allowing my network to pass information 01:34:10.240 --> 01:34:13.420 about what still needs to be said about the photo 01:34:13.420 --> 01:34:15.655 into the next stage of running the networks. 01:34:15.655 --> 01:34:17.530 So you could run the network multiple times-- 01:34:17.530 --> 01:34:19.450 the same network with the same weights-- 01:34:19.450 --> 01:34:23.260 just getting different input each time, first getting input from the image, 01:34:23.260 --> 01:34:25.990 and then getting input from the network itself, 01:34:25.990 --> 01:34:28.630 as additional information about what additionally 01:34:28.630 --> 01:34:32.660 needs to be given in a particular caption, for example. 01:34:32.660 --> 01:34:35.080 So this then is a one-to-many many relationship 01:34:35.080 --> 01:34:36.760 inside of a recurrent neural network. 01:34:36.760 --> 01:34:38.718 But it turns out there are other models that we 01:34:38.718 --> 01:34:42.280 can use-- other ways we can try and use recurrent neural networks-- to be 01:34:42.280 --> 01:34:45.490 able to represent data that might be stored in other forms as well. 01:34:45.490 --> 01:34:48.640 We saw how we could use neural networks in order to analyze images, 01:34:48.640 --> 01:34:51.802 in the context of convolutional neural networks that take an image, 01:34:51.802 --> 01:34:54.010 figure out various different properties of the image, 01:34:54.010 --> 01:34:57.410 and are able to draw some sort of conclusion based on that. 01:34:57.410 --> 01:34:59.650 But you might imagine that something like YouTube, 01:34:59.650 --> 01:35:02.730 they need to be able to do a lot of learning based on video. 01:35:02.730 --> 01:35:04.480 They need to look through videos to detect 01:35:04.480 --> 01:35:06.557 if there are copyright violations, or they 01:35:06.557 --> 01:35:08.890 need to be able to look through videos to maybe identify 01:35:08.890 --> 01:35:12.400 what particular items are inside of the video, for example. 01:35:12.400 --> 01:35:14.950 And video, you might imagine, is much more difficult 01:35:14.950 --> 01:35:18.610 to put it as input to a neural network, because whereas an image 01:35:18.610 --> 01:35:22.520 you can just treat each pixel is a different value, videos are sequences. 01:35:22.520 --> 01:35:26.388 They're sequences of images, and each sequence might be a different length, 01:35:26.388 --> 01:35:28.180 and so it might be challenging to represent 01:35:28.180 --> 01:35:31.120 that entire video as a single vector of values 01:35:31.120 --> 01:35:34.070 that you could pass in to a neural network. 01:35:34.070 --> 01:35:36.340 And so here too, recurrent neural networks 01:35:36.340 --> 01:35:40.060 can be a valuable solution for trying to solve this type of problem. 01:35:40.060 --> 01:35:44.150 Then instead of just passing in a single input into our neural network, 01:35:44.150 --> 01:35:47.170 we could pass in the input one frame at a time, you might imagine, 01:35:47.170 --> 01:35:51.460 first taking the first frame of the video, passing it into the network, 01:35:51.460 --> 01:35:54.280 and then maybe not having the network output anything at all yet. 01:35:54.280 --> 01:35:58.870 Let it take in another input, and this time, pass it into the network, 01:35:58.870 --> 01:36:01.750 but the network gets information from the last time 01:36:01.750 --> 01:36:03.760 we provided an input into the network. 01:36:03.760 --> 01:36:06.220 Then we pass in a third input and then a fourth input, 01:36:06.220 --> 01:36:09.970 where each time, with the network gets it gets the most recent input, 01:36:09.970 --> 01:36:12.850 like each frame of the video, but it also 01:36:12.850 --> 01:36:16.940 gets information the network processed from all of the previous iterations. 01:36:16.940 --> 01:36:19.360 So on frame number four, you end up getting 01:36:19.360 --> 01:36:22.750 the input for frame number four, plus information the network is 01:36:22.750 --> 01:36:25.630 calculated from the first three frames. 01:36:25.630 --> 01:36:28.780 And using all of that data combined, this recurrent neural network 01:36:28.780 --> 01:36:32.920 can begin to learn how to extract patterns from a sequence of data 01:36:32.920 --> 01:36:33.730 as well. 01:36:33.730 --> 01:36:35.730 And so you might imagine if you want to classify 01:36:35.730 --> 01:36:37.570 a video into a number of different genres, 01:36:37.570 --> 01:36:40.990 like an educational video, or a music video, or different types of videos. 01:36:40.990 --> 01:36:43.180 That's a classification task, where you want 01:36:43.180 --> 01:36:45.820 to take input each of the frames of the video, 01:36:45.820 --> 01:36:48.440 and you want to output something like what it is 01:36:48.440 --> 01:36:51.853 and what category that it happens to belong to. 01:36:51.853 --> 01:36:53.770 And you can imagine doing this sort of thing-- 01:36:53.770 --> 01:36:56.310 this sort of many-to-one learning-- 01:36:56.310 --> 01:36:58.630 anytime your input is a sequence. 01:36:58.630 --> 01:37:01.718 And so input is a sequence in the context of a video. 01:37:01.718 --> 01:37:04.510 It could be in the context of like, if someone has typed a message, 01:37:04.510 --> 01:37:06.640 and you want to be able to categorize that message, 01:37:06.640 --> 01:37:09.220 like if you're trying to take a movie review 01:37:09.220 --> 01:37:12.850 and trying to classify it as is it a positive review or a negative review. 01:37:12.850 --> 01:37:15.460 That input is a sequence of words, and the output 01:37:15.460 --> 01:37:18.060 is a classification-- positive or negative. 01:37:18.060 --> 01:37:20.170 There too, a recurrent neural network might 01:37:20.170 --> 01:37:22.780 be helpful for analyzing sequences of words, 01:37:22.780 --> 01:37:25.875 and they're quite popular when it comes to dealing with language. 01:37:25.875 --> 01:37:27.950 It could even be used for spoken language 01:37:27.950 --> 01:37:31.250 as well, that spoken language is an audio waveform that 01:37:31.250 --> 01:37:34.460 can be segmented into distinct chunks, and each of those 01:37:34.460 --> 01:37:37.760 can be passed in as an input into a recurrent neural network 01:37:37.760 --> 01:37:40.380 to be able to classify someone's voice, for instance, 01:37:40.380 --> 01:37:43.160 if you want to do voice recognition, to say is this one person 01:37:43.160 --> 01:37:44.260 or is this another? 01:37:44.260 --> 01:37:48.310 Here are also cases where you might want this many-to-one architecture 01:37:48.310 --> 01:37:50.897 for a recurrent neural network. 01:37:50.897 --> 01:37:52.980 And then as one final problem, just to take a look 01:37:52.980 --> 01:37:55.860 at in terms of what we can do, with these sorts of networks, 01:37:55.860 --> 01:37:57.870 imagine what Google Translate is doing. 01:37:57.870 --> 01:38:01.620 So what Google Translate is doing is it's taking some text written in one 01:38:01.620 --> 01:38:05.850 language and converting it into text written in some other language, 01:38:05.850 --> 01:38:09.090 for example, where now this input is a sequence of data-- 01:38:09.090 --> 01:38:10.770 it's a sequence of words-- 01:38:10.770 --> 01:38:13.210 and the output is a sequence of words as well. 01:38:13.210 --> 01:38:14.440 It's also a sequence. 01:38:14.440 --> 01:38:17.340 So here, we want effectively like a many-to-many relationship. 01:38:17.340 --> 01:38:21.330 Our input is a sequence, and our output is a sequence as well. 01:38:21.330 --> 01:38:25.350 And it's not quite going to work to just say, take each word in the input 01:38:25.350 --> 01:38:28.620 and translate it into a word in the output, 01:38:28.620 --> 01:38:31.823 because ultimately, different languages put their words in different orders, 01:38:31.823 --> 01:38:33.990 and maybe one language uses two words for something, 01:38:33.990 --> 01:38:36.130 whereas another language only uses one. 01:38:36.130 --> 01:38:40.970 So we really want some way to take this information-- that's input-- 01:38:40.970 --> 01:38:45.730 encode it somehow, and use that encoding to generate what the output ultimately 01:38:45.730 --> 01:38:46.230 should be. 01:38:46.230 --> 01:38:48.105 And this has been one of the big advancements 01:38:48.105 --> 01:38:50.700 in automated translation technology is the ability 01:38:50.700 --> 01:38:54.570 to use own networks to do this, instead of older, more traditional methods, 01:38:54.570 --> 01:38:56.820 and this has improved accuracy dramatically. 01:38:56.820 --> 01:38:59.070 And the way you might imagine doing this is, again, 01:38:59.070 --> 01:39:03.030 using a recurrent neural network with multiple inputs and multiple outputs. 01:39:03.030 --> 01:39:04.590 We start by passing in all the input. 01:39:04.590 --> 01:39:06.143 Input goes into the network. 01:39:06.143 --> 01:39:08.310 Another input, like another word, goes into network, 01:39:08.310 --> 01:39:12.030 and we do this multiple times, like once for each word in the input 01:39:12.030 --> 01:39:13.530 that I'm trying to translate. 01:39:13.530 --> 01:39:16.800 And only after all of that is done, does the network now 01:39:16.800 --> 01:39:19.950 start to generate output, like the first word of the translated sentence, 01:39:19.950 --> 01:39:23.060 and the next word of the translated sentence, so on and so forth, 01:39:23.060 --> 01:39:26.100 where each time the network passes information 01:39:26.100 --> 01:39:31.200 to itself by allowing for this model of giving some sort of state 01:39:31.200 --> 01:39:33.960 from one run in the network to the next run, 01:39:33.960 --> 01:39:36.120 assembling information about all the inputs, 01:39:36.120 --> 01:39:39.780 and then passing in information about which part of the output in order 01:39:39.780 --> 01:39:40.987 to generate next. 01:39:40.987 --> 01:39:43.320 And there are a number of different types of these sorts 01:39:43.320 --> 01:39:44.890 of recurrent neural networks. 01:39:44.890 --> 01:39:48.060 One of the most popular is known as the long short-term memory neural 01:39:48.060 --> 01:39:50.190 network, otherwise known as LSTM. 01:39:50.190 --> 01:39:53.303 But in general, these types of networks can be very, very powerful 01:39:53.303 --> 01:39:55.470 whenever we're dealing with sequences, whether those 01:39:55.470 --> 01:39:59.400 are sequences of images or especially sequences of words when it comes 01:39:59.400 --> 01:40:02.370 towards dealing with natural language. 01:40:02.370 --> 01:40:06.090 So that then were just some of the different types of neural networks 01:40:06.090 --> 01:40:08.590 that can be used to do all sorts of different computations, 01:40:08.590 --> 01:40:10.830 and these are incredibly versatile tools that 01:40:10.830 --> 01:40:12.930 can be applied to a number of different domains. 01:40:12.930 --> 01:40:16.300 We only looked at a couple of the most popular types of neural networks-- 01:40:16.300 --> 01:40:18.570 the more traditional feed-forward neural networks, 01:40:18.570 --> 01:40:21.573 convolutional neural networks, and recurrent neural networks. 01:40:21.573 --> 01:40:22.990 But there are other types as well. 01:40:22.990 --> 01:40:25.907 There are adversarial networks, where networks compete with each other 01:40:25.907 --> 01:40:28.890 to try and be able to generate new types of data, 01:40:28.890 --> 01:40:32.370 as well as other networks that can solve other tasks based on what they happen 01:40:32.370 --> 01:40:34.510 to be structured and adapted for. 01:40:34.510 --> 01:40:36.810 And these are very powerful tools in machine learning, 01:40:36.810 --> 01:40:40.578 from being able to very easily learn based on some set of input data 01:40:40.578 --> 01:40:42.870 and to be able to therefore figure out how to calculate 01:40:42.870 --> 01:40:45.210 some function, from inputs to outputs. 01:40:45.210 --> 01:40:48.600 Whether it's input to some sort of classification, like analyzing an image 01:40:48.600 --> 01:40:50.910 and getting a digit, or machine translation where 01:40:50.910 --> 01:40:53.670 the input is in one language and the output is in another, 01:40:53.670 --> 01:40:58.080 these tools have a lot of applications for machine learning more generally. 01:40:58.080 --> 01:41:00.360 Next time, we'll look at machine learning and AI 01:41:00.360 --> 01:41:02.633 in particular in the context of natural language. 01:41:02.633 --> 01:41:04.800 We talked a little bit about this today, but looking 01:41:04.800 --> 01:41:08.520 at how it is that our AI can begin to understand natural language 01:41:08.520 --> 01:41:11.640 and can begin to be able to analyze and do useful tasks with 01:41:11.640 --> 01:41:13.740 regards to human language, which turns out 01:41:13.740 --> 01:41:15.880 to be a challenging and interesting task. 01:41:15.880 --> 01:41:18.110 So we'll see you next time.