[MUSIC PLAYING] SPEAKER 1: All right. Welcome back, everyone, to an introduction to Artificial Intelligence with Python. Now last time, we took a look at machine learning-- a set of techniques that computers can use in order to take a set of data and learn some patterns inside of that data, learn how to perform a task, even if we, the programmers, didn't give the computer explicit instructions for how to perform that task. 

Today, we transition to one of the most popular techniques and tools within machine learning that have neural networks. And neural networks were inspired as early as the 1940s by researchers who were thinking about how it is that humans learn, studying neuroscience and the human brain, and trying to see whether or not we can apply those same ideas to computers as well, and model computer learning off of human learning. 

So how is the brain structured? Well, very simply put, the brain consists of a whole bunch of neurons, and those neurons are connected to one another and communicate with one another in some way. In particular, if you think about the structure of a biological neural network-- something like this-- there are a couple of key properties that scientists observed. One was that these neurons are connected to each other and receive electrical signals from one another, that one neuron can propagate electrical signals to another neuron. And another point is that neurons process those input signals, and then can be activated, that a neuron becomes activated at a certain point, and then can propagate further signals onto neurons in the future. 

And so the question then became, could we take this biological idea of how it is that humans learn-- with brains and with neurons-- and apply that to a machine as well, in effect, designing an artificial neural network, or an ANN, which will be a mathematical model for learning that is inspired by these biological neural networks? 

And what artificial neural networks will allow us to do is they will first be able to model some sort of mathematical function. Every time you look at a neural network, which we'll see more of later today, each one of them is really just some mathematical function that is mapping certain inputs to particular outputs, based on the structure of the network, that depending on where we place particular units inside of this neural network, that's going to determine how it is that the network is going to function. And in particular, artificial neural networks are going to lend themselves to a way that we can learn what the network's parameters should be. We'll see more on that in just a moment. But in effect we want to model, such that it is easy for us to be able to write some code that allows for the network to be able to figure out how to model the right mathematical function, given a particular set of input data. 

So in order to create our artificial neural network, instead of using biological neurons, we're just going to use what we're going to call units-- units inside of a neural network-- which we can represent kind of like a node in a graph, which will here be represented just by a blue circle like this. And these artificial units-- these artificial neurons-- can be connected to one another. So here, for instance, we have two units that are connected by this edge inside of this graph, effectively. 

And so what we're going to do now is think of this idea as some sort of mapping from inputs to outputs, that we have one unit that is connected to another unit, that we might think of this side as the input and that side of the output. 

And what we're trying to do then is to figure out how to solve a problem, how to model some sort of mathematical function. And this might take the form of something we saw last time, which was something like, we have certain inputs like variables x1 and x2, and given those inputs, we want to perform some sort of task-- a task like predicting whether or not it's going to rain. 

And ideally, we'd like some way, given these inputs x1 and x2, which stand for some sort of variables to do with the weather, we would like to be able to predict, in this case, a Boolean classification-- is it going to rain, or is it not going to rain? And we did this last time by way of a mathematical function. We defined some function h for our hypothesis function that took as input x1 and x2-- the two inputs that we cared about processing-- in order to determine whether we thought it was going to rain, or whether we thought it was not going to rain. 

The question then becomes, what does this hypothesis function do in order to make that determination? And we decided last time to use a linear combination of these input variables to determine what the output should be. So our hypothesis function was equal to something like this: weight 0 plus weight 1 times x1 plus weight 2 times x2. 

So what's going on here is that x1 and x2-- those are input variables-- the inputs to this hypothesis function-- and each of those input variables is being multiplied by some weight, which is just some number. So x1 is being multiplied by weight 1, x2 is being multiplied by weight 2, and we have this additional weight-- weight 0-- that doesn't get multiplied by an input variable at all, that just serves to either move the function up or move the function's value down. You can think of this as either a weight that's just multiplied by some dummy value, like the number 1 when it's multiplied by 1, and so it's not multiplied by anything. Or sometimes you'll see in the literature, people call this variable weight 0 a "bias," so that you can think of these variables as slightly different. We have weights that are multiplied by the input and we separately add some bias to the result as well. You'll hear both of those terminologies used when people talk about neural networks and machine learning. 

So in effect, what we've done here is that in order to define a hypothesis function, we just need to decide and figure out what these weights should be, to determine what values to multiply by our inputs to get some sort of result. Of course, at the end of this, what we need to do is make some sort of classification like raining or not raining, and to do that, we use some sort of function to define some sort of threshold. 

And so we saw, for instance, the step function, which is defined as 1 if the result of multiplying the weights by the inputs is at least 0; otherwise as 0. You can think of this line down the middle-- it's kind of like a dotted line. Effectively, it stays at 0 all the way up to one point, and then the function steps-- or jumps up-- to 1. So it's zero before it reaches some threshold, and then it's 1 after it reaches a particular threshold. And so this was one way we could define what we'll come to call an "activation function," a function that determines when it is that this output becomes active-- changes to a 1 instead of being a 0. 

But we also saw that if we didn't just want a purely binary classification, if we didn't want purely 1 or 0, but we wanted to allow for some in-between real number values, we could use a different function. And there are a number of choices, but the one that we looked at was the logistic sigmoid function that has sort of an S-shaped curve, where we could represent this as a probability-- that may be somewhere in between the probability of rain of something like 0.5, and maybe a little bit later the probability of rain is 0.8-- and so rather than just have a binary classification of 0 or 1, we can allow for numbers that are in between as well. 

And it turns out there are many other different types of activation functions, where an activation function just takes the output of multiplying the weights together and adding that bias, and then figuring out what the actual output should be. Another popular one is the rectified linear unit, otherwise known ReLU, and the way that works is that it just takes as input and takes the maximum of that input and 0. So if it's positive, it remains unchanged, but i if it's negative, it goes ahead and levels out at 0. And there are other activation functions that we can choose as well. 

But in short, each of these activation functions, you can just think of as a function that gets applied to the result of all of this computation. We take some function g and apply it to the result of all of that calculation. And this then is what we saw last time-- the way of defining some hypothesis function that takes on inputs, calculates some linear combination of those inputs, and then passes it through some sort of activation function to get our output. 

And this actually turns out to be the model for the simplest of neural networks, that we're going to instead represent this mathematical idea graphically, by using a structure like this. Here then is a neural network that has two inputs. We can think of this as x1 and this as x2. And then one output, which you can think of classifying whether or not we think it's going to rain or not rain, for example, in this particular instance. 

And so how exactly does this model work? Well, each of these two inputs represents one of our input variables-- x1 and x2. And notice that these inputs are connected to this output via these edges, which are going to be defined by their weights. So these edges each have a weight associated with them-- weight 1 and weight 2-- and then this output unit, what it's going to do is it is going to calculate an output based on those inputs and based on those weights. This output unit is going to multiply all the inputs by their weights, add in this bias term, which you can think of as an extra w0 term that gets added into it, and then we pass it through an activation function. 

So this then is just a graphical way of representing the same idea we saw last time, just mathematically. And we're going to call this a very simple neural network. And we'd like for this neural network to be able to learn how to calculate some function, that we want some function for the neural network to learn, and the neural network is going to learn what should the values of w0, w1, and w2 be. What should the activation function be in order to get the result that we would expect? 

So we can actually take a look at an example of this. What then is a very simple function that we might calculate? Well, if we recall back from when we were looking at propositional logic, one of the simplest functions we looked at was something like the or function, that takes two inputs-- x and y-- and outputs 1, otherwise known as true, if either one of the inputs, or both of them, are 1, and outputs a 0 if both of the inputs are 0, or false. So this then is the or function. And this was the truth table for the or function-- that as long as either of the inputs are 1, the output of the function is 1, and the only case where the output of 0 is where both of the inputs are 0. 

So the question is, how could we take this and train a neural network to be able to learn this particular function? What would those weights look like? Well, we could do something like this. Here's our neural network, and I'll propose that in order to calculate the or function, we're going to use a value of 1 for each of the weights, and we'll use a bias of negative 1, and then we'll just use this step function as our activation function. 

How then does this work? Well, if I wanted to calculate something like 0 or 0, which we know to be 0, because false or false is false, then what are we going to do? Well, our output unit is going to calculate this input multiplied by the weight. 0 times 1, that's 0. Same thing here. 0 times 1, that's 0. And we'll add to that the bias, minus 1. So that'll give us some result of negative 1. If we plot that on our activation function-- negative 1 is here-- it's before the threshold, which means either 0 or 1. It's only 1 after the threshold. Since negative 1 is before the threshold, the output that this unit provides it is going to be 0. And that's what we would expect it to be, that 0 or 0 should be 0. 

What if instead we had had 1 or 0, where this is the number 1? Well, in this case, in order to calculate what the output is going to be, we again have to do this weighted sum. 1 times 1, that's 1. 0 times 1, that's 0. Sum of that so far is 1. Add negative 1 to that. Well, then the output of 0. And if we plot 0 on the step function, 0 ends up being here-- it's just at the threshold-- and so the output here is going to be 1, because the output of 1 or 0, that's 1. So that's what we would expect as well. 

And just for one more example, if I had 1 or 1, what would the result be? Well 1 times 1 is 1. 1 times 1 is 1. The sum of those is 2. I add the bias term to that. I get the number 1. 1 plotted on this graph is way over there. That's well beyond the threshold. And so this output is going to be 1 as well. The output is always 0 or 1, depending on whether or not we're past the threshold. And this neural network then models the or function-- a very simple function, definitely-- but it still is able to model it correctly. If I give it the inputs, it will tell me what x1 or x2 happens to be. 

And you could imagine trying to do this for other functions as well-- a function like the and function, for instance, that takes two inputs and calculates whether both x and y are true. So if x is 1 and y is 1, then the output of x and y is 1, but in all of the other cases, the output is 0. 

How could we model that inside of a neural network as well? Well, it turns out we could do it in the same way, except instead of negative 1 as the bias, we can use negative 2 as the bias instead. What does that end up looking like? Well, if I had 1 and 1, that should be 1, because 1, true and true, is equal to true. Well, I take 1 times 1. That's 1. 1 times 1 is 1. I got a total sum of 2 so far. Now I add the bias of negative 2, and I get the value 0. And 0 when I plotted on the activation function is just past that threshold. And so the output is going to be 1. 

But if I had any other input, for example, like 1 and 0, well, the weighted sum of these is 1 plus 0. It's going to be 1. Minus 2 is going to give us negative 1, and negative 1 is not past that threshold, and so the output is going to be zero. 

So those then are some very simple functions that we can model using a neural network, that has two inputs and one output, where our goal is to be able to figure out what those weights should be in order to determine what the output should be. And you could imagine generalizing this to calculate more complex functions as well, that maybe given the humidity and the pressure, we want to calculate what's the probability that it's going to rain, for example. Or you might want to do a regression-style problem, where given some amount of advertising and given what month it is maybe, we want to predict what our expected sales are going to be for that particular month. So you could imagine these inputs and outputs being different as well. 

And it turns out that in some problems, we're not just going to have two inputs, and the nice thing about these neural networks is that we can compose multiple units together-- make our networks more complex-- just by adding more units into this particular neural network. So the network we've been looking at has two inputs and one output. But we could just as easily say, let's go ahead and have three inputs in there, or have even more inputs, where we could arbitrarily decide, however many inputs there are to our problem, all going to be calculating some sort of output that we care about figuring out the value of. 

How then does the math work for figuring out that output? Well, it's going to work in a very similar way. In the case of two inputs, we had two weights indicated by these edges, and we multiplied the weights by the numbers, adding this bias term, and we'll do the same thing in the other cases as well. If I have three inputs, you'll imagine multiplying each of these three inputs by each of these weights. If I had five inputs instead, we're going to do the same thing. Here, I'm saying sum up from 1 to 5. xi multiplied by weight i. So take each of the five input variables, multiply them by their corresponding weight, and then add the bias to that. So this would be a case where there are five inputs into this neural network, for example. But there could be more arbitrarily many nodes that we want inside of this neural network, where each time we're just going to sum up all of those input variables multiplied by the weight, and then add the bias term at the very end. 

And so this allows us to be able to represent problems that have even more inputs, just by growing the size of our neural network. Now, the next question we might ask is a question about how it is that we train these internal networks? In the case of the or function and the and function, they were simple enough functions that I could just tell you like here what the weights should be, and you could probably reason through it yourself what the weights should be in order to calculate the output that you want. 

But in general, with functions like predicting sales or predicting whether or not it's going to rain, these are much trickier functions to be able to figure out. We would like the computer to have some mechanism of calculating what it is that the weights should be-- how it is to set the weights-- so that our neural network is able to accurately model the function that we care about trying to estimate. 

And it turns out that the strategy for doing this, inspired by the domain of calculus, is a technique called gradient descent. And what gradient descent is, it is an algorithm for minimizing loss when you're training a neural network. And recall that loss refers to how bad our hypothesis function happens to be, that we can define certain loss functions, and we saw some examples of loss functions last time that just give us a number for any particular hypothesis, saying how poorly does it model the data? How many examples does it get wrong? How are they worse or less bad as compared to other hypothesis functions that we might define? And this loss function is just a mathematical function, and when you have a mathematical function, in calculus, what you could do is calculate something known as the gradient, which you can think of is like a slope. It's the direction the loss function is moving at any particular point. And what it's going to tell us is in which direction should we be moving these weights in order to minimize the amount of loss? 

And so generally speaking-- we won't get into the calculus of it-- but the high-level idea for gradient descent is going to look something like this. If we want to train a neural network, we'll go ahead and start just by choosing the weights randomly. Just pick random weights for all of the weights in the neural network. And then we'll use the input data that we have access to in order to train the network in order to figure out what the weights should actually be. So we'll repeat this process again and again. 

The first step is we're going to calculate the gradient based on all of the data points. So we'll look at all the data and figure out what the gradient is at the place where we currently are-- for the current setting of the weights-- which means that in which direction should we move the weights in order to minimize the total amount of loss in order to make our solution better? And once we've calculated that gradient-- which direction we should move in the loss function-- well, then we can just update those weights according to the gradient, take a small step in the direction of those weights in order to try to make our solution a little bit better. And the size of the step that we take, that's going to vary, and you can choose that when you're training a particular neural network. 

But in short, the idea is going to be take all of the data points, figure out based on those data points in what direction the weights should move, and then move the weights one small step in that direction. And if you repeat that process over and over again, adjusting the weights a little bit at a time based on all the data points, eventually, you should end up with a pretty good solution to trying to solve this sort of problem. At least that's what we would hope to happen. 

Now as you look at this algorithm, a good question to ask anytime you're analyzing an algorithm is, what is going to be the expensive part of doing the calculation? What's going to take a lot of work to try to figure out what is going to be expensive to calculate? And in particular, in the case of gradient descent, the really expensive part is this all data points part right here, having to take all of the data points and using all of those data points to figure out what the gradient is at this particular setting of all of the weights, because odds are, in a big machine learning problem where you're trying to solve a big problem with a lot of data, you have a lot of data points in order to calculate, and figuring out the gradient based on all of those data points is going to be expensive. And you'll have to do it many times, but you'll likely repeat this process again and again and again, going through all the data points, taking one small step over and over, as you try and figure out what the optimal setting of those weights happens to be. 

It turns out that we would ideally like to be able to train our neural networks faster to be able to more quickly converge to some sort of solution that is going to be a good solution to the problem. So in that case, there are alternatives to just standard gradient descent, which looks at all of the data points at once. We can employ a method like stochastic gradient descent, which will randomly just choose one data point at a time to calculate the gradient based on, instead of calculating it based on all of the data points. So the idea there is that we have some setting of the weights, we pick a data point, and based on that one data point, we figure out in which direction should we move all of the weights, and move the weights in that small direction, then take another data point and do that again, and repeat this process again and again, maybe looking at each of the data points multiple times, but each time, only using one data point to calculate the gradient to calculate which direction we should move in. 

Now just using one data point instead of all of the data points probably gives us a less accurate estimate of what the gradient actually is. But on the plus side, it's going to be much faster to be able to calculate, that we can much more quickly calculate what the gradient is, based on one data point, instead of calculating based on all of the data points and having to do all of that computational work again and again. So there are trade-offs here between looking at all of the data points and just looking at one data point. 

And it turns out that a middle ground-- and this is also quite popular-- is a technique called mini-batch gradient descent, where the idea there is instead at looking at all of the data versus just a single point, we instead divide our dataset up into small batches-- groups of data points-- where you can decide how big a particular batch is, but in short, you're just going to look at a small number of points at any given time, hopefully getting a more accurate estimate of the gradient, but also not requiring all of the computational effort needed to look at every single one of these data points. 

So gradient descent then is this technique that we can use in order to train these neural networks in order to figure out what the setting of all of these weights should be, if we want some way to try and get an accurate notion of how it is that this function should work, some way of modeling how to transform the inputs into particular outputs. 

So far, the networks that we've taken a look at have all been structured similar to this. We have some number of inputs-- maybe two or three or five or more-- and then we have one output that is just predicting like rain or no rain, or just predicting one particular value. But often in machine learning problems, we don't just care about one output. We might care about an output that has multiple different values associated with it. 

So in the same way that we could take a neural network and add units to the input layer, we can likewise add outputs to the output layer as well. Instead of just one output, you could imagine we have two outputs, or we could have like four outputs, for example, where in each case, as we add more inputs or add more outputs, if we want to keep this network fully connected between these two layers, we just need to add more weights, that now each of these input nodes have four weights associated with each of the four outputs, and that's true for each of these various different input nodes. 

So as we add nodes, we add more weights in order to make sure that each of the inputs can somehow be connected to each of the outputs, so that each output value can be calculated based on what the value of the input happens to be. 

So what might a case be where we want multiple different output values? Well, you might consider that in the case of weather predicting, for example, we might not just care whether it's raining or not raining. There might be multiple different categories of weather that we would like to categorize the weather into. With just a single output variable, we can do a binary classification, like rain or no rain, for instance-- 1 or 0-- but it doesn't allow us to do much more than that. 

With multiple output variables, I might be able to use each one to predict something a little different. Maybe I want to categorize the weather into one of four different categories, something like, is it going to be raining or sunny or cloudy or snowy, and I now have four output variables that can be used to represent maybe the probability that it is raining, as opposed to sunny, as opposed to cloudy, or as opposed to snowy. 

How then would this neural network work? Well, we have some input variables that represent some data that we have collected about the weather. Each of those inputs gets multiplied by each of these various different weights. We have more multiplications to do, but these are fairly quick mathematical operations to perform. And then what we get is after passing them through some sort of activation function in the outputs, we end up getting some sort of number, where that number, you might imagine, you can interpret as like a probability, like a probability that it is one category, as opposed to another category. 

So here we're saying that based on the inputs, we think there is a 10% chance that it's raining, a 60% chance that it's sunny, a 20% chance of cloudy, a 10% chance of it's snowy. And given that output, if these represent a probability distribution, well, then you could just pick whichever one has the highest value-- in this case, sunny-- and say that, well, most likely, we think that this categorization of inputs means that the output should be sunny, and that is what we would expect the weather to be in this particular instance. 

So this allows us to do these sort of multi-class classifications, where instead of just having a binary classification-- 1 or 0-- we can have as many different categories as we want, and we can have our neural network output these probabilities over which categories are most more likely than other categories, and using that data, we're able to draw some sort of inference on what it is that we should do. 

So this was sort of the idea of supervised machine learning. I can give this neural network a whole bunch of data-- whole bunch of input data-- corresponding to some label, some output data-- like we know that it was raining on this day, we know that it was sunny on that day-- and using all of that data, the algorithm can use gradient descent to figure out what all of the weights should be in order to create some sort of model that hopefully allows us a way to predict what we think the weather is going to be. 

But neural networks have a lot of other applications as well. You can imagine applying the same sort of idea to a reinforcement learning sort of example as well. Well, you remember that in reinforcement learning, we wanted to do is train some sort of agent to learn what action to take depending on what state they currently happen to be in. So depending on the current state of the world, we wanted the agent to pick from one of the available actions that is available to them. 

And you might model that by having each of these input variables represent some information about the state-- some data about what state our agent is currently in-- and then the output, for example, could be each of the various different actions that our agent could take-- action 1, 2, 3, and 4, and you might imagine that this network would work in the same way, that based on these particular inputs we go ahead and calculate values for each of these outputs, and those outputs could model which action is better than other actions, and we could just choose, based on looking at those outputs, which actions we should take. 

And so these neural networks are very broadly applicable, that all they're really doing is modeling some mathematical function. So anything that we can frame as a mathematical function, something like classifying inputs into various different categories, or figuring out based on some input state what action we should take-- these are all mathematical functions that we could attempt to model by taking advantage of this neural network structure, and in particular, taking advantage of this technique, gradient descent, that we can use in order to figure out what the weights should be in order to do this sort of calculation. 

Now how is it that you would go about training a neural network that has multiple outputs instead of just one? Well, with just a single output, we could see what the output for that value should be, and then you update all of the weights that corresponded to it. And when we have multiple outputs, at least in this particular case, we can really think of this as four separate neural networks, that really we just have one network here that has these three inputs, corresponding with these three weights, corresponding to this one output value. And the same thing is true for this output value. This output value effectively defines yet another neural network that has these same three inputs, but a different set of weights that correspond to this output. And likewise, this output has its own set of weights as well, and the same thing for the fourth output too. 

And so if you wanted to train a neural network that had four outputs instead of just one, in this case where the inputs are directly connected to the outputs, you could really think of this as just training four independent neural networks. We know what the outputs for each of these four should be based on our input data, and using that data, we can begin to figure out what all of these individual weights should be, and maybe there's an additional step at the end to make sure that turn these values into a probability distribution, such that we can interpret which one is better than another or more likely than another as a category or something like that. 

So this then seems like it does a pretty good job of taking inputs and trying to predict what outputs should be, and we'll see some real examples of this in just a moment as well. But it's important then to think about what the limitations of this sort of approach is, of just taking some linear combination of inputs and passing it into some sort of activation function. And it turns out that when we do this in the case of binary classification-- I'm trying to predict like does it belong to one category or another-- we can only predict things that are linearly separable, because we're taking a linear combination of inputs and using that to define some decision boundary or threshold. Then what we get is a situation where if we have this set of data, we can predict a line that separates linearly the red points from the blue points. 

But a single unit that is making a binary classification, otherwise known as a perceptron, can't deal with a situation like this, where-- we've seen this type of situation before-- where there is no straight line that just goes straight through the data that will divide the red points away from the blue points. It's a more complex decision boundary. The decision boundary somehow needs to capture the things inside of the circle, and there isn't really a line that will allow us to deal with that. So this is the limitation of the perceptron-- these units that just make these binary decisions based on their inputs-- that a single perceptron is only capable of learning a linearly separable decision boundary. It can do is define a line. And sure, it can give us probabilities based on how close to that decision boundary we are, but it can only really decide based on a linear decision boundary. 

And so this doesn't seem like it's going to generalize well to situations where real-world data is involved, because real-world data often isn't linearly separable. It often isn't the case that we can just draw a line through the data and be able to divide it up into multiple groups. 

So what then is the solution to this? Well, what was proposed was the idea of a multilayer neural network, that so far, all of the neural networks we've seen have had a set of inputs and a set of outputs, and the inputs are connected to those outputs. But in a multi-layer neural network, this is going to be an artificial neural network that has an input layer still, it has an output layer, but also has one or more hidden layers in between-- other layers of artificial neurons, or units, that are going to calculate their own values as well. 

So instead of a neural network that looks like this, with three inputs and one output, you might imagine, in the middle here, injecting a hidden layer-- something like this. This is a hidden layer that has four nodes. You could choose how many nodes or units end up going into the hidden layer, and you have multiple hidden layers as well. And so now each of these inputs isn't directly connected to the output. Each of the inputs is connected to this hidden layer, and then all of the nodes in the hidden layer, those are connected to the one output. 

And so this is just another step that we can take towards calculating more complex functions. Each of these hidden units will calculate its output value, otherwise known as its activation, based on a linear combination of all the inputs. And once we have values for all of these nodes, as opposed to this just being the output, we do the same thing again-- calculate the output for this node, based on multiplying each of the values for these units by their weights as well. 

So in effect, the way this works is that we start with inputs. They get multiplied by weights in order to calculate values for the hidden nodes. Those get multiplied by weights in order to figure out what the ultimate output is going to be. And the advantage of layering things like this is it gives us an ability to model more complex functions, that instead of just having a single decision boundary-- a single line dividing the red points from the blue points-- each of these hidden nodes can learn a different decision boundary, and we can combine those decision boundaries to figure out what the ultimate output is going to be. And as we begin to imagine more complex situations, you could imagine each of these nodes learning some useful property or learning some useful feature of all of the inputs and somehow learning how to combine those features together in order to get the output that we actually want. 

Now the natural question, when we begin to look at this now, is to ask the question of, how do we train a neural network that has hidden layers inside of it? And this turns out to initially be a bit of a tricky question, because the input data we are given is we are given values for all of the inputs, and we're given what the value of the output should be-- what the category is, for example-- but the input data doesn't tell us what the values for all of these nodes should be. So we don't know how far off each of these nodes actually is, because we're only given data for the inputs and the outputs. The reason this is called the hidden layer is because the data that is made available to us doesn't tell us what the values for all of these intermediate nodes should actually be. 

And so the strategy people came up with was to say that if you know what the error or the losses on the output node, well, then based on what these weights are-- if one of these weights is higher than another-- you can calculate an estimate for how much the error from this node was due to this part of the hidden node, or this part of the hidden layer, or this part of the hidden layer, based on the values of these weights, in effect saying, that based on the error from the output, I can backpropagate the error and figure out an estimate for what the error is for each of these the hidden layer as well. 

And there's some more calculus here that we won't get into the details of, but the idea of this algorithm is known as backpropagation. It's an algorithm for training a neural network with multiple different hidden layers. And the idea for this-- the pseudocode for it-- will again be, if we want to run gradient descent with backpropagation, we'll start with a random choice of weights as we did before, and now we'll go ahead and repeat the training process again and again. But what we're going to do each time is now we're going to calculate the error for the output layer first. We know the output and what it should be, and we know what we calculated, so we figure out what the error there is. But then we're going to repeat, for every layer, starting with the output layer, moving back into the hidden layer, then the hidden layer before that if there are multiple hidden layers, going back all the way to the very first hidden layer, assuming there are multiple, we're going to propagate the error back one layer-- whatever the error was from the output-- figure out what the error should be a layer before that based on what the values of those weights are. And then we can update those weights. 

So graphically, the way you might think about this is that we first start with the output. We know what the output should be. We know what output we calculated. And based on that, we can figure out, all right, how do we need to update those weights, backpropagating the error to these nodes. And using that, we can figure out how we should update these weights. And you might imagine if there are multiple layers, we could repeat this process again and again to begin to figure out how all of these weights should be updated. 

And this backpropagation algorithm is really the key algorithm that makes neural networks possible, and makes it possible to take these multi-level structures and be able to train those structures, depending on what the values of these weights are in order to figure out how it is that we should go about updating those weights in order to create some function that is able to minimize the total amount of loss, to figure out some good setting of the weights that will take the inputs and translate it into the output that we expect. 

And this works, as we said, not just for a single hidden layer, but you can imagine multiple hidden layers, where each hidden layer-- we just defined however many nodes we want-- where each of the nodes in one layer, we can connect to the nodes in the next layer, defining more and more complex networks that are able to model more and more complex types of functions. 

And so this type of network is what we might call a deep neural network, part of a larger family of deep learning algorithms, if you've ever heard that term. And all deep learning is about is it's using multiple layers to be able to predict and be able to model higher-level features inside of the input, to be able to figure out what the output should be. And so the deep neural network is just a neural network that has multiple of these hidden layers, where we start at the input, calculate values for this layer, then this layer, then this layer, and then ultimately get an output. 

And this allows us to be able to model more and more sophisticated types of functions, that each of these layers can calculate something a little bit different. And we can combine that information to figure out what the output should be. Of course, as with any situation of machine learning, as we begin to make our models more and more complex, to model more and more complex functions, the risk we run is something like overfitting. And we talked about overfitting last time in the context of overfitting based on when we were training our models to be able to learn some sort of decision boundary, where overfitting happens when we fit too closely to the training data, and as a result, we don't generalize well to other situations as well. 

And one of the risks we run with a far more complex neural network that has many, many different nodes is that we might overfit based on the input data; we might grow over-reliant on certain nodes to calculate things just purely based on the input data that doesn't allow us to generalize very well to the output. 

And there are a number of strategies for dealing with overfitting, but one of the most popular in the context of neural networks is a technique known as dropout. And what dropout does is it when we're training the neural network, what we'll do in dropout, is temporarily remove units, temporarily remove these artificial neurons from our network, chosen at random, and the goal here is to prevent over-reliance on certain units. So what generally happens in overfitting is that we begin to over-rely on certain units inside the neural network to be able to tell us how to interpret the input data. What dropout will do is randomly remove some of these units in order to reduce the chance that we over-rely on certain units, to make our neural network more robust, to be able to handle the situations even when we just drop out particular neurons entirely. 

So the way that might work is we have a network like this, and as we're training it, when we go about trying to update the weights the first time, we'll just randomly pick some percentage of the nodes to drop out of the network. It's as if those nodes aren't there at all. It's as if the weights associated with those nodes aren't there at all. And we'll train in this way. Then the next time we update the weights, we'll pick a different set and just go ahead and train that way, and then again randomly choose and train with other nodes that have been dropped that as well. And the goal of that is that after the training process, if you train by dropping out random nodes inside of this neural network, you hopefully end up with a network that's a little bit more robust, that doesn't rely too heavily on any one particular node, but more generally learns how to approximate a function in general. 

So that then is a look at some of these techniques that we can use in order to implement a neural network, to get at the idea of taking this input, passing it through these various different layers, in order to produce some sort of output. And what we'd like to do now is take those ideas and put them into code. And to do that, there are a number of different machine learning libraries-- neural network libraries-- that we can use that allow us to get access to someone's implementation of backpropagation and all of these hidden layers. 

And one of the most popular, developed by Google, is known as TensorFlow, a library that we can use for quickly creating neural networks and modeling them and running them on some sample data to see what the output is going to be. And before we actually start writing code, we'll go ahead and take a look at TensorFlow's Playground, which will be an opportunity for us just to play around with this idea of neural networks in different layers, just to get a sense for what it is that we can do by taking advantage of a neural networks. 

So let's go ahead and go into TensorFlow's Playground, which you can go to by visiting that URL from before. And what we're going to do now is we're going to try and learn the decision boundary for this particular output. I want to learn to separate the orange points from the blue points, and I'd like to learn some sort of setting of weights inside of a neural network that will be able to separate those from each other. The features we have access to, our input data, are the x value and the y value, so the two values along each of the two axes. And what I'll do now is I can set particular parameters, like what activation function I would like to use, and I'll just go ahead and press Play and see what happens. 

And what happens here is that you'll see that just by using these two input features-- the x value and the y value, with no hidden layers-- just take the input, x and y values, and figure out what the decision boundary is-- our neural network learns pretty quickly that in order to divide these two points, we should just use this line. This line acts as the decision boundary that separates this group of points from that group of points, and it does it very well. You can see up here what the loss is. The training loss is zero, meaning we were able to perfectly model separating these two points from each other inside of our training data. 

So this was a fairly simple case of trying to apply a neural network, because the data is very clean it's very nicely linearly separable. We can just draw a line that separates all of those points from each other. 

Let's now consider a more complex case. So I'll go ahead and pause the simulation, and we'll go ahead and look at this data set here. This data set is a little bit more complex now. In this data set, we still have blue and orange points that we'd like to separate from each other, but there is no single line that we can draw that is going to be able to figure out how to separate the blue from the orange, because the blue is located in these two quadrants and the orange is located here and here. It's a more complex function to be able to learn. 

So let's see what happens if we just try and predict based on those inputs-- the x- and y-coordinates-- what the output should be. Press Play, and what you'll notice is that we're not really able to draw much of a conclusion, that we're not able to very cleanly see how we should divide the orange points from the blue points, and you don't see a very clean separation there. So it seems like we don't have enough sophistication inside of our network to be able to model something that is that complex. We need a better model for this neural network. 

And I'll do that by adding a hidden layer. So now I have the hidden layer that has two neurons inside of it. So I have two inputs that then go to two neurons inside of a hidden layer that then go to our output, and now I'll press Play, and what you'll notice here is that we're able to do slightly better. We're able to now say, all right, these points are definitely blue. These points are definitely orange. We're still struggling a little bit with these points up here though, and what we can do is we can see for each of these hidden neurons what is it exactly that these hidden neurons are doing. 

Each hidden neuron is learning its own decision boundary, and we can see what that boundary is. This first neuron is learning, all right, this line that seems to separate some of the blue points from the rest of the points. This other hidden neuron is learning another line that seems to be separating the orange points in the lower right from the rest of the points. So that's why we're able to sort of figure out these two areas in the bottom region, but we're still not able to perfectly classify all of the points. 

So let's go ahead and add another neuron-- now we've got three neurons inside of our hidden layer-- and see what we're able to learn now. All right. Well, now we seem to be doing a better job by learning three different decision boundaries, which each of the three neurons inside of our hidden layer were able to much better figure out how to separate these blue points from the orange points. And you can see what each of these hidden neurons is learning. Each one is learning a slightly different decision boundary, and then we're combining those decision boundaries together to figure out what the overall output should be. 

And we can try it one more time by adding a fourth neuron there and try learning that. And it seems like now we can do even better at trying to separate the blue points from the orange points, but we were only able to do this by adding a hidden layer, by adding some layer that is learning some other boundaries, and combining those boundaries to determine the output. And the strength-- the size and thickness of these lines-- and indicate how high these weights are, how important each of these inputs is, for making this sort of calculation. 

And we can do maybe one more simulation. Let's go ahead and try this on a data set that looks like this. Go ahead and get rid of the hidden layer. Here now we're trying to separate the blue points from the orange points, where all the blue points are located, again, inside of a circle, effectively. So we're not going to be able to learn a line. Notice I press Play, and we're really not able to draw any sort of classification at all, because there is no line that cleanly separates the blue points from the orange points. 

So let's try to solve this by introducing a hidden layer. I'll go ahead and press Play. And all right. With two neurons and a hidden layer, we're able to do a little better, because we effectively learned two different decision boundaries. We learned this line here, and we learned this line on the right-hand side. And right now, we're just saying, all right, well, if it's in-between, we'll call it blue, and if it's outside, we'll call it orange. So, not great, but certainly better than before. We're learning one decision boundary and another, and based on those, we can figure out what the output should be. 

But let's now go ahead and add a third neuron and see what happens now. I go ahead and train it. And now, using three different decision boundaries that are learned by each of these hidden neurons, we're able to much more accurately model this distinction between blue points and orange points. We're able to figure out, maybe with these three decision boundaries, combining them together, you can imagine figuring out what the output should be and how to make that sort of classification. 

And so the goal here is just to get a sense for having more neurons in these hidden layers that allows us to learn more structure in the data, allows us to figure out what the relevant and important decision boundaries are. And then using this backpropagation algorithm, we're able to figure out what the values of these weights should be in order to train this network to be able to classify one category of points away from another category of points instead. And this is ultimately what we're going to be trying to do whenever we're training a neural network. 

So let's go ahead and actually see an example of this. You'll recall from last time that we had this banknotes file that included information about counterfeit banknotes as opposed to authentic banknotes, where it had four different values for each banknote and then a categorization of whether that bank note is considered to be authentic or a counterfeit note. And what I wanted to do was, based on that input information, figure out some function that could calculate based on the input information what category it belonged to. 

And what I've written here in banknotes.py is a neural network that we'll learn just that, a network that learns, based on all of the input, whether or not we should categorize a banknote as authentic or as counterfeit. The first step is the same as what we saw from last time. I'm really just reading the data in and getting it into an appropriate format. And so this is where more of the writing Python code on your own comes in terms of manipulating this data, massaging the data into a format that will be understood by a machine learning library like scikit-learn or like TensorFlow. 

And so here I separate it into a training and a testing set. And now what I'm doing down below is I'm creating a neural network. Here I'm using tf, which stands for TensorFlow. Up above I said, import TensorFlow as tf. So you have just an abbreviation that we'll often use, so we don't need to write out TensorFlow every time we want to use anything inside of the library. I'm using tf.keras. Keras is an API, a set of functions that we can use in order to manipulate neural networks inside of TensorFlow, and it turns out there are other machine learning libraries that also use the Kersa API. 

But here, I'm saying, all right, go ahead and give me a model that is a sequential model-- a sequential neural network-- meaning one layer after another. And now I'm going to add to that model what layers I want inside of my neural network. So here I'm saying, model.add. Go ahead and add a dense layer-- and when we say a dense layer, we mean a layer that is just each of the nodes inside of the layer is going to be connected to each from the previous layer, so we have a densely connected layer. This layer is going to have eight units inside of it. So it's going to be a hidden layer inside of a neural network with eight different units, eight artificial neurons, each of which might learn something different. And I just sort of chose eight arbitrarily. You could choose a different number of hidden nodes inside of the layer. 

And as we saw before, depending on the number of units there are inside of your head and layer, more units means you can learn more complex functions, so maybe you can more accurately model the training data, but it comes at a cost. More units means more weights that you need to figure out how to update, so it might be more expensive to do that calculation. And you also run the risk of overfitting on the data if you have too many units, and you learn to just overfit on the training data. That's not good either. So there is a balance, and there's often a testing process, where you'll train on some data and maybe validate how well you're doing on a separate set of data-- often called a validation set-- to see, all right, which setting of parameters, how many layers should I have, how many units should be in each layer, which one of those performs the best on the validation set? So you can do some testing to figure out what these hyperparameters, so-called, should be equal to. 

Next I specify what the input_shape is, meaning what does my input look like? My input has four values, and so the input shape is just 4, because we have four inputs. And then I specify what the activation function is. And the activation function, again, we can choose. There a number of different activation functions. Here I'm using relu, which you might recall from earlier. And then I'll add an output layer. So I have my hidden layer. 

Now I'm adding one more layer that will just have one unit, because all I want to do is predict something like counterfeit bill or authentic bill. So I just need a single unit. And the activation function I'm going to use here is that sigmoid activation function, which again was that S-shaped curve that just gave us like a probability of, what is the probability that this is a counterfeit bill as opposed to an authentic bill? So that then is the structure of my neural network-- sequential neural network that has one hidden layer with eight units inside of it, and then one output layer that just has a single unit inside of it. And I can choose how many units there are. I can choose the activation function. 

Then I'm going to compile this model. TensorFlow gives you a choice of how you would like to optimize the weights-- there are various different algorithms for doing that-- what type of loss function you want to use-- again, many different options for doing that-- and then how I want to evaluate my model. Well, I care about accuracy. I care about how many of my points am I able to classify correctly versus not correctly of counterfeit or not counterfeit, and I would like it to report to me how accurate my model is performing. 

Then, now that I've defined that model, I call model.fit to say, go ahead and train the model. Train it on all the training data, plus all of the training labels-- so labels for each of those pieces of training data-- and I'm saying run it for 20 epochs, meaning go ahead and go through each of these training points 20 times effectively, go through the data 20 times and keep trying to update the weights. If I did it for more, I could train for even longer and maybe get a more accurate result. But then after I fit in on all the data, I'll go ahead and just test it. I'll evaluate my model using model.evaluate, built into TensorFlow, that is just going to tell me, how well do I perform on the testing data? So ultimately, this is just going to give me some numbers that tell me how well we did in this particular case. 

So now what I'm going to do is go into banknotes and go ahead and run banknotes.py. And what's going to happen now is it's going to read in all of that trading data. It's going to generate a neural network with all my inputs, my eight hidden layers, or eight hidden units inside my layer, and then an output unit, and now what it's doing is it's training. It's training 20 times, and each time, you can see how my accuracy is increasing on my training data. It starts off, the very first time, not very accurate, though better than random, something like 79% of the time, it's able to accurately classify one bill from another. But as I keep training, notice this accuracy value improves and improves and improves, until after I've trained through all of the data points 20 times, it looks like my accuracy is above 99% on the training data. And here's where I tested it on a whole bunch of testing data. And it looks like in this case, I was also like 99.8% accurate. 

So just using that, I was able to generate a neural network that can detect counterfeit bills from authentic bills based on this input data 99.8% of the time, at least based on this particular testing data. And I might want to test it with more data as well, just to be confident about that. But this is really the value of using a machine learning library like TensorFlow, and there are others available for Python and other languages as well, but all I have to do is define the structure of the network and define the data that I'm going to pass into the network, and then TensorFlow runs the backpropagation algorithm for learning what all of those weights should be, for figuring out how to train this neural network to be able to, as accurately as possible, figure out what the output values should be there as well. 

And so this then was a look at what it is that neural networks can do, just using these sequences of layer after layer after layer, and you can begin to imagine applying these to much more general problems. 

And one big problem in computing, and artificial intelligence more generally, is the problem of computer vision. Computer vision is all about computational methods for analyzing and understanding images, that you might have pictures that you want the computer to figure out how to deal with, how to process those images, and figure out how to produce some sort of useful result out of this. You've seen this in the context of social media websites that are able to look at a photo that contains a whole bunch of faces, and it's able to figure out what's a picture of whom and label those and tag them with appropriate people. 

This is becoming increasingly relevant as we begin to discuss self-driving cars. These cars now have cameras, and we would like for the computer to have some sort of algorithm that looks at the images and figures out, what color is the light, what cars are around us and in what direction, for example. And so computer vision is all about taking an image and figuring out what sort of computation-- what sort of calculation-- we can do with that image. 

It's also relevant in the context of something like handwriting recognition. This, what you're looking at, is an example of the MNIST dataset-- it's a big dataset just of handwritten digits-- that we could use to, ideally, try and figure out how to predict, given someone's handwriting, given a photo of a digit that they have drawn, can you predict whether it's a 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9, for example. So this sort of handwriting recognition is yet another task that we might want to use computer vision tasks and tools to be able to apply it towards. This might be a task that we might care about. 

So how then can we use neural networks to be able to solve a problem like this? Well, neural networks rely upon some sort of input, where that input is just numerical data. We have a whole bunch of units, where each one of them just represents some sort of number. And so in the context of something like handwriting recognition, or in the context of just an image, you might imagine that an image is really just a grid of pixels, a grid of dots, where each dot has some sort of color, and in the context of something like handwriting recognition, you might imagine that if you just fill in each of these dots in a particular way, you can generate a 2 or an 8, for example, based on which dots happen to be shaded in and which dots are not. 

And we can represent each of these pixel values just using numbers. So for a particular pixel, for example, 0 might represent entirely black. Depending on how you're representing color, it's often common to represent color values on a 0-to-255 range, so that you can represent a color using eight bits for a particular value, like how much white is in the image? So 0 might represent all black, 255 might represent entirely white as a pixel, and somewhere in between might represent some shade of gray, for example. 

But you might imagine not just having a single slider that determines how much white is in the image, but if you had a color image, you might imagine three different numerical values-- a red, green, and blue value-- where the red value controls how much red is in the image, we have one value for controlling how much green is in the pixel, and one value for how much blue is in the pixel as well. And depending on how it is that you set these values of red, green, and blue, you can get a different color. And so any pixel can really be represented in this case by three numerical values-- a red value, a green value, and a blue value. And if you take a whole bunch of these pixels, assemble them together inside of a grid of pixels, then you really just have a whole bunch of numerical values that you can use in order to perform some sort of prediction task. 

And so what you might imagine doing is using the same techniques we talked about before. Just design a neural network with a lot of inputs, that for each of the pixels, we might have one or three different inputs in the case of a color image-- a different input-- that is just connected to a deep neural network, for example. And this deep neural network might take all of the pixels inside of the image of what digit a person drew, and the output might be like 10 neurons that classify it as a 0 or a 1 or 2 or 3, or just tells us in some way what that digit happens to be. 

Now there are a couple of drawbacks to this approach. The first drawback to the approach is just the size of this input array, that we have a whole bunch of inputs. If we have a big image, that is a lot of different channels we're looking at-- a lot of inputs, and therefore, a lot of weights that we have to calculate. 

And a second problem is the fact that by flattening everything into just the structure of all the pixels, we've lost access to a lot of the information about the structure of the image that's relevant, that really, when a person looks at an image, they're looking at particular features of that image. They're looking at curves. They're looking at shapes. They're looking at what things can you identify in different regions of the image, and maybe put those things together in order to get a better picture of what the overall image was about. And by just turning it into a pixel values for each of the pixels, sure, you might be able to learn that structure, but it might be challenging in order to do so. It might be helpful to take advantage of the fact that you can use properties of the image itself-- the fact that it's structured in a particular way-- to be able to improve the way that we learn based on that image too. 

So in order to figure out how we can train our neural networks to better be able to deal with images, we'll introduce a couple of ideas-- a couple of algorithms-- that we can apply that allow us to take the images and extract some useful information out of that image. And the first idea we'll introduce is the notion of image convolution. And what an image convolution is all about is it's about filtering an image, sort of extracting useful or relevant features out of the image. And the way we do that is by applying a particular filter that basically adds the value for every pixel with the values for all of the neighboring pixels to it. According to some sort of kernel matrix, which we'll see in a moment, it's going to allow us to weight these pixels in various different ways. 

And the goal of image convolution then is to extract some sort of interesting or useful features out of an image, to be able to take a pixel, and based on its neighboring pixels, maybe predict some sort of valuable information, something like taking a pixel and looking at its neighboring pixels, you might be able to predict whether or not there's some sort of curve inside the image, or whether it's forming the outline of a particular line or a shape, for example, and that might be useful if you're trying to use all of these various different features to combine them to say something meaningful about an image as a whole. 

So how then does image convolution work? Well, we start with a kernel matrix, and the kernel matrix looks something like this. And the idea of this is that given a pixel-- that would be the middle pixel-- we're going to multiply each of the neighboring pixels by these values in order to get some sort of result by summing up all of the numbers together. So if I take this kernel, which you can think of is like a filter that I'm going to apply to the image. 

And let's say that I take this image. This is a four-by-four image. We'll think of it as just a black and white image, where each one is just a single pixel value, so somewhere between 0 and 255, for example. So we have a whole bunch of individual pixel values like this, and what I'd like to do is apply this kernel-- this filter, so to speak-- to this image. 

And the way I'll do that is, all right, the kernel is three-by-three. So you can imagine a five-by-five kernel or a larger kernel too. And I'll take it and just first apply it to the first three-by-three section of the image. And what I'll do is I'll take each of these pixel values and multiply it by its corresponding value in the filter matrix and add all of the results together. So here, for example, I'll say 10 times 0, plus 20, times negative 1, plus 30, times 0, so on and so forth, doing all of this calculation. And at the end, if I take all these values, multiply them by their corresponding value in the kernel, add the results together, for this particular set of nine pixels, I get the value of 10 for example. 

And then what I'll do is I'll slide this three-by-three grid effectively over. Slide the kernel by one to look at the next three-by-three section. And here I'm just sliding it over by one pixel, but you might imagine a different slide length, or maybe I jump by multiple pixels at a time if you really wanted to. You have different options here. But here I'm just sliding over, looking at the next three-by-three section. And I'll do the same math 20 times 0, plus 30, times a negative 1, plus 40, times 0, plus 20 times negative 1, so on and so forth, plus 30 times 5. And what I end up getting is the number 20. Then you can imagine shifting over to this one, doing the same thing, calculating like the number 40, for example, and then doing the same thing here and calculating a value there as well. And so what we have now is what we'll call a feature map. We have taken this kernel, applied it to each of these various different regions, and what we get is some representation of a filtered version of that image. 

And so to give a more concrete example of why it is that this kind of thing could be useful, let's take this kernel matrix, for example, which is quite a famous one, that has an 8 in the middle and then all of the neighboring pixels that get a negative 1. And let's imagine we wanted to apply that to a three-by-three part of an image that looks like this, where all the values are the same. They're all 20, for instance. 

Well, in this case, if you do 20 times 8, and then subtract 20, subtract 20, subtract 20, for each of the eight neighbors, well, the result of that is you just get that expression, which comes out to be 0. You multiply 20 by 8, but then you subtracted 28 times according to that particular kernel. The result of all of that is just 0. So the takeaway here is that when a lot of the pixels are the same value, we end up getting a value close to 0. 

If, though, we had something like this, 20s along this first row, then 50s in the second row, and 50s in the third row, well, then when you do this same kind of math-- 20 times negative 1, 20 times negative 1, so on and so forth-- then I get a higher value-- a value like 90, in this particular case. 

And so the more general idea here is that by applying this kernel, negative 1s, 8 in the middle, and then negative 1s, what I get is when this middle value is very different from the neighboring values-- like 50 is greater than these 20s-- then you'll end up with a value higher than 0. Like if this number is higher than its neighbors, you end up getting a bigger output, but if this value is the same as all of its neighbors, then you get a lower output, something like 0. 

And it turns out that this sort of filter can therefore be used in something like detecting edges in an image, or want to detect like the boundaries between various different objects inside of an image. I might use a filter like this, which is able to tell whether the value of this pixel is different from the values of the neighboring pixel-- if it's like greater than the values of the pixels that happened to surround it. 

And so we can use this in terms of image filtering. And so I'll show you an example of that. I have here, in filter.py, a file that uses Python's image library, or PIL, to do some image filtering. I go ahead and open an image. And then all I'm going to do is apply a kernel to that image. It's going to be a three-by-three kernel, the same kind of kernel we saw before. And here is the kernel. This is just a list representation of the same matrix that I showed you a moment ago, with it's negative 1, negative 1, negative 1. The second row is negative 1, 8, negative 1. The third row is all negative 1s. And then at the end, I'm going to go ahead and show the filtered image. 

So if, for example, I go into convolution directory and I open up an image like bridge.png, this is what an input image might look like, just an image of a bridge over a river. Now I'm going to go ahead and run this filter program on the bridge. And what I get is this image here. Just by taking the original image and applying that filter to each three-by-three grid, I've extracted all of the boundaries, all of the edges inside the image that separate one part of the image from another. 

So here I've got a representation of boundaries between particular parts of the image. And you might imagine that if a machine learning algorithm is trying to learn like what an image is of, a filter like this could be pretty useful. Maybe the machine learning algorithm doesn't care about all of the details of the image. It just cares about certain useful features. It cares about particular shapes that are able to help it determine that based on the image, this is going to be a bridge, for example. And so this type of idea of image convolution can allow us to apply filters to images that allow us to extract useful results out of those images-- taking an image and extracting its edges, for example. 

You might imagine many other filters that could be applied to an image that are able to extract particular values as well. And a filter might have separate kernels for the red values, the green values, and the blue values that are all summed together at the end, such that you could have particular filters looking for, is there red in this part of the image? Are there green in other parts of the image? You can begin to assemble these relevant and useful filters that are able to do these calculations as well. So that then was the idea of image convolution-- applying some sort of filter to an image to be able to extract some useful features out of that image. 

But all the while, these images are still pretty big. There's a lot of pixels involved in the image. And realistically speaking, if you've got a really big image, that poses a couple of problems. One, it means a lot of input going into the neural network, but two, it also means that we really have to care about what's in each particular pixel, whereas realistically we often, if you're looking at an image, you don't care whether it's something is in one particular pixel versus the pixel immediately to the right of it. They're pretty close together. You really just care about whether there is a particular feature in some region of the image, and maybe you don't care about exactly which pixel it happens to be. 

And so there's a technique we can use known as pooling. And what pooling is, is it means reducing the size of an input by sampling from regions inside of the input. So we're going to take a big image and turn it into a smaller image by using pooling. And in particular, one of the most popular types of pooling is called max-pooling. And what max-pooling does is it pools just by choosing the maximum value in a particular region. 

So, for example, let's imagine I had this four-by-four image, but I wanted to reduce its dimensions. I wanted to make an a smaller image, so that I have fewer inputs to work with. Well, what I could do is I could apply a two-by-two max pool, where the idea would be that I'm going to first look at this two-by-two region and say, what is the maximum value in that region? Well, it's the number 50. So we'll go ahead and just use the number 50. 

And then we'll look at this two-by-two region. What is the maximum value here? 110. So that's going to be my value. Likewise here, the maximum value looks like 20. Go ahead and put that there. Then for this last region, the maximum value was 40, so we'll go ahead and use that. And what I have now is a smaller representation of this same original image that I obtained just by picking the maximum value from each of these regions. 

So again, the advantages here are now I only have to deal with a two-by-two input instead of a four-by-four, and you can imagine shrinking the size of an image even more. But in addition to that, I'm now able to make my analysis independent of whether a particular value was in this pixel or this pixel. I don't care if the 50 was here or here. As long as it was generally in this region, I'll still get access to that value. So it makes our algorithms a little bit more robust as well. So that then is pooling-- taking the size of the image and reducing it a little bit by just sampling from particular regions inside of the image. 

And now we can put all of these ideas together-- pooling, image convolution, neural networks-- all together into another type of neural network called a convolutional neural network, or a CNN, which is a neural network that uses this convolution step, usually in the context of analyzing an image, for example. 

And so the way that a convolutional neural own network works is that we start with some sort of input image-- some grid of pixels-- but rather than immediately put that into the neural network layers that we've seen before, we'll start by applying a convolution step, where the convolution step involves applying a number of different image filters to our original image in order to get what we call a feature map, the result of applying some filter to an image. And we could do this once, but in general, we'll do this multiple times getting a whole bunch of different feature maps, each of which might extract some different relevant feature out of the image, some different important characteristic of the image that we might care about using in order to calculate what the result should be. 

And in the same way to when we train neural networks, we can train neural networks to learn the weights between particular units inside of the neural networks. We can also train neural networks to learn what those filters should be-- what the values of the filters should be-- in order to get the most useful, most relevant information out of the original image just by figuring out what setting of those filter values-- the values inside of that kernel-- results in minimizing the loss function and minimizing how poorly our hypothesis actually performs in figuring out the classification of a particular image, for example. 

So we first apply this convolution step. Get a whole bunch of these various different feature maps. But these feature maps are quite large. There is a lot of pixel values that happen to be here. And so a logical next step to take is a pooling step, where we reduce the size of these images by using max-pooling, for example, extracting the maximum value from any particular region. There are other pooling methods that exist as well, depending on the situation. You could use something like average-pooling, where instead of taking the maximum value from a region, you take the average value from a region, which has it uses as well. 

But in effect, what pooling will do is it will take these feature maps and reduce their dimensions, so that we end up with smaller grids with fewer pixels. And this then is going to be easier for us to deal with. It's going to mean fewer inputs that we have to worry about, and it's also going to mean we're more resilient, more robust, against potential movements of particular values just by one pixel, when ultimately, we really don't care about those one pixel differences that might arise in the original image. 

Now after we've done this pooling step, now we have a whole bunch of values that we can then flatten out and just put into a more traditional neural network. So we go ahead and flatten it, and then we end up with a traditional neural network that has one input for each of these values in each of these resulting feature maps after we do the convolution and after we do the pooling step. 

And so this then is the general structure of a convolutional network. We begin with the image, apply convolution, apply pooling, flatten the results, and then put that into a more traditional neural network that might itself have hidden layers. You can have deep convolutional networks that have hidden layers in between this flattened layer and the eventual output to be able to calculate various different features of those values. But this then can help us to be able to use convolution and pooling, to use our knowledge about the structure of an image, to be able to get better results, to be able to train our networks faster in order to better capture particular parts of the image. 

And there's no reason necessarily why you can only use these steps once. In fact, in practice, you'll often use convolution and pooling multiple times in multiple different steps. So what you might imagine doing is starting with an image, first applying convolution to get a whole bunch of maps, then applying pooling, then applying convolution again, because these maps are still pretty big. You can apply convolution to try and extract relevant features out of this result. Then take those results, apply pooling in order to reduce their dimensions, and then take that and feed it into a neural network that maybe has fewer inputs. 

So here, I have two different convolution and pooling steps. I do convolution and pooling once, and then I do convolution and pooling a second time, each time extracting useful features from the layer before it, each time using pooling to reduce the dimensions of what you're ultimately looking at. And the goal now of this sort of model is that in each of these steps, you can begin to learn different types of features of the original image, that maybe in the first step you learn very low-level features, just learn and look for features like edges and curves and shapes, because based on pixels in their neighboring values, you can figure out, all right, what are the edges? What are the curves? What are the various different shapes that might be present there? 

But then once you have a mapping that just represents where the edges and curves and shapes happen to be, you can imagine applying the same sort of process again to begin to look for higher-level features-- look for objects, maybe look for people's eyes in facial recognition, for example, maybe look at more complex shapes like the curves on a particular number if you're trying to recognize a digit in a handwriting recognition sort of scenario. 

And then after all of that, now that you have these results that represent these higher-level features, you can pass them into a neural network, which is really just a deep neural network that looks like this, where you might imagine making a binary classification, or classifying into multiple categories, or performing various different tasks on this sort of model. 

So convolutional neural networks can be quite powerful and quite popular when it comes to trying to analyze images. We don't strictly need them. We could have just used a vanilla neural network that just operates with layer after layer as we've seen before. 

But these convolutional neural networks can be quite helpful, in particular, because of the way they model the way a human might look at an image, that instead of a human looking at every single pixel simultaneously and trying to involve all of them by multiplying them together, you might imagine that what convolution is really doing is looking at various different regions of the image and extracting relevant information and features out of those parts of the image the same way that a human might have visual receptors that are looking at particular parts of what they see, and using those, combining them, to figure out what meaning they can draw from all of those various different inputs. 

And so you might imagine applying this to a situation like handwriting recognition. So we'll go ahead and see an example of that now. I'll go ahead and open up handwriting.py. Again, what we do here is we first import TensorFlow. And then, TensorFlow, it turns out, has a few datasets that are built in-- built into the library that you can just immediately access. 

And one of the most famous datasets in machine learning is the MNIST dataset, which is just a dataset of a whole bunch of samples of people's handwritten digits. I showed you a slide of that a little while ago. And what we can do is just immediately access that dataset, which is built into the library, so that if I want to do something like train on a whole bunch of digits, I can just use the dataset that is provided to me. 

Of course, if I had my own dataset of handwritten images, I can apply the same idea. I'd first just need to take those images and turn them into an array of pixels, because that's the way that these are going to be formatted. They're going to be formatted as, effectively, an array of individual pixels. 

And now there's a bit of reshaping I need to do, just turning the data into a format that I can put into my convolutional neural network. So this is doing things like taking all the values and dividing them by 255. If you remember, these color values tend to range from 0 to 255. So I can divide them by 255, just to put them into a 0-to-1 range, which might be a little bit easier to train on . And then doing various other modifications to the data, just to get it into a nice usable format. 

But here's the interesting and important part. Here is where I create the convolutional neural network-- the CNN-- where here I'm saying, go ahead and use a sequential model. And before I could use model.add to say add a layer, add a layer, add a layer, another way I could define it is just by passing as input to the sequential neural network a list of all of the layers that I want. 

And so here, the very first layer in my model is a convolutional layer, where I'm first going to apply convolution to my image. I'm going to use 13 different filters, so my model is going to learn-- 32, rather-- 32 different filters that I would like to learn on the input image, where each filter is going to be a three-by-three kernel. So we saw those three-by-three kernels before, where we could multiply each value in a three-by-three grid by value, multiply it and add all the results together. So here I'm going to learn 32 different of these three-by-three filters. 

I can again specify my activation function. And I specify what my input shape is. My input shape in the banknotes case was just 4. I had four inputs. My input shape here is going to be 28, comma, 28, comma 1, because for each of these handwritten digits, it turns out that the MNIST dataset organizes their data. Each image is a 28-by-28 pixel grid. They're going to be a 28-by-28 pixel grid, and each one of those images only has one channel value. These handwritten digits are just black and white, so it's just a single color value representing how much black or how much white. You might imagine that in a color image, if you were doing this sort of thing, you might have three different channels-- a red, a green, and a blue channel, for example. But in the case of just handwriting recognition and recognizing a digit, we're just going to use a single value for shaded-in in or not shaded-in, and it might range, but it's just a single color value. 

And that then is the very first layer of our neural network, a convolutional layer that will take the input and learn a whole bunch of different filters that we can apply to the input to extract meaningful features. 

The next step is going to be a max-pooling layer, also built right into TensorFlow, where this is going to be a layer that is going to use a pool size of two by two, meaning we're going to look at two-by-two regions inside of the image, and just extract the maximum value. Again, we've seen why this can be helpful. It'll help to reduce the size of our input. Once we've done that, we'll go ahead and flatten all of the units just into a single layer that we can then pass into the rest of the neural network. 

And now, here's the rest of the whole network. Here, I'm saying, let's add a hidden layer to my neural network with 128 units-- so a whole bunch of hidden units inside of the hidden layer-- and just to prevent overfitting, I can add a dropout to that-- say, you know what? When you're training, randomly drop out half from this hidden layer, just to make sure we don't become over-reliant on any particular node. We begin to really generalize and stop ourselves from overfitting. So TensorFlow allows us, just by adding a single line, to add dropout into our model as well, such that when it's training, it will perform this dropout step in order to help make sure that we don't overfit on this particular data. 

And then finally, I add an output layer. The output layer is going to have 10 units, one for each category, that I would like to classify digits into, so 0 through 9, 10 different categories. 

And the activation function I'm going to use here is called the softmax activation function. And in short, what the softmax activation function is going to do is it's going to take the output and turn it into a probability distribution. So ultimately, it's going to tell me, what did we estimate the probability is that this is a 2 versus a 3 versus a 4, and so it will turn it into that probability distribution for me. 

Next up, I'll go ahead and compile my model and fit it on all of my training data. And then I can evaluate how well the neural network performs. And then I've added to my Python program, if I've provided a command line argument, like the name of a file, I'm going to go ahead and save the model to a file. And so this can be quite useful too. Once you've done the training step, which could take some time, in terms of taking all the time-- going through the data; running backpropagation with gradient descent; to be able to say, all right, how should we adjust the weight to this particular model-- you end up calculating values for these weights, calculating values for these filters, and you'd like to remember that information, so you can use it later. And so TensorFlow allows us to just save a model to a file, such that later if we want to use the model we've learned, use the weights that we've learned, to make some sort of new prediction we can just use the model that already exists. 

So what we're doing here is after we've done all the calculation, we go ahead and save the model to a file, such that we can use it a little bit later. So for example, if I go into digits, I'm going to run handwriting.py. I won't save it this time. We'll just run it and go ahead and see what happens. 

What will happen is we need to go through the model in order to train on all of these samples of handwritten digits. So the MNIST dataset gives us thousands and thousands of sample handwritten digits in the same format that we can use in order to train. And so now what you're seeing is this training process, and unlike the banknotes case, where there was much, much fewer data points-- the data was very, very simple-- here, the data is more complex, and this training process takes time. And so this is another one of those cases where when training neural networks, this is why computational power is so important, that oftentimes, you see people wanting to use a sophisticated GPUs in order to more efficiently be able to do this sort of neural network we're training. It also speaks to the reason why more data can be helpful. The more sample data points you have, the better you can begin to do this training. 

So here we're going through 60,000 different samples of handwritten digits. And I said that we're going to go through them 10 times. So we're going to go through the dataset 10 times, training each time, hopefully improving upon our weights with every time we run through this dataset. And we can see over here on the right what the accuracy is each time we go ahead and run this model, that the first time, it looks like we got an accuracy of about 92% of the digits correct based on this training set. We increased that to 96% or 97%. And every time we run this, we're going to see, hopefully, the accuracy improve, as we continue to try and use that gradient descent, that process of trying to run the algorithm to minimize the loss that we get in order to more accurately predict what the output should be. 

And what this process is doing is it's learning not only the weights, but it's learning the features to use-- the kernel matrix to use-- when performing that convolution step, because this is a convolutional neural network, where I'm first performing those convolutions, and then doing the more traditional neural network structure. This is going to learn all of those individual steps as well. 

So here, we see the TensorFlow provides me with some very nice output, telling me about how many seconds are left with each of these training runs, that allows me to see just how well we're doing. So we'll go ahead and see how this network performs. It looks like we've gone through the dataset seven times. We're going through an eighth time now. 

And at this point, the accuracy is pretty high. We saw we went from 92% up to 97%. Now it looks like 98%. And at this point, it seems like things are starting to level out. There's probably a limit to how accurate we can ultimately be without running the risk of overfitting. Of course, with enough nodes, you could just memorize the input and overfit upon them. But we'd like to avoid doing that and dropout will help us with this. 

But now, we see we're almost done finishing our training step. We're at 55,000. All right. We've finished training, and now it's going to go ahead and test for us on 10,000 samples. And it looks like on the testing set, we were 98.8% accurate. So we ended up doing pretty well, it seems, on this testing set to see how accurately can we predict these handwritten digits. 

And so what we could do then is actually test it out. I've written a program called recognition.py using PyGame. If you pass it a model that's been trained, and I pre-trained an example model using this input data, what we can do is see whether or not we've been able to train this convolutional neural network to be able to predict handwriting, for example. 

So I can try just like drawing a handwritten digit. I'll go ahead and draw like the number 2, for example. So there's my number 2. Again, this is messy. If you tried to imagine how would you write a program with just like ifs and thens to be able to do this sort of calculation, it would be tricky to do so. But here, I'll press Classify, and all right. It seems it was able to correctly classify that what I drew was the number 2. 

We'll go ahead and reset it. Try it again. We'll draw like an 8, for example. So here is an 8. I'll press Classify. And all right. It predicts that the digit that I drew was an 8. 

And the key here is this really begins to show the power of what the neural network is doing, somehow looking at various different features of these different pixels, figuring out what the relevant features are, and figuring out how to combine them to get a classification. 

And this would be a difficult task to provide explicit instructions to the computer on how to do, like to use a hole punch of if-thens to process all of these pixel values to figure out what the handwritten digit is, like everyone is going to draw their 8 a little bit differently. If I drew the 8 again, it would look a little bit different. And yet ideally, we want to train a network to be robust enough so that it begins to learn these patterns on its own. All I said was, here is the structure of the network, and here is the data on which to train the network, and the network learning algorithm just tries to figure out what is the optimal set of weights, what is the optimal set of filters to use, in order to be able to accurately classify a digit into one category or another. That's going to show the power of these convolutional neural networks. 

And so that then was a look at how we can use convolutional neural networks to begin to solve problems with regards to computer vision, the ability to take an image and begin to analyze it. And so this is the type of analysis you might imagine that's happening in self-driving cars that are able to figure out what filters to apply to an image to understand what it is that the computer is looking at, or the same type of idea that might be applied to facial recognition and social media to be able to determine how to recognize faces in an image as well. You can imagine a neural network that, instead of classifying into one of 10 different digits, could instead classify like, is this person A or is this person B, trying to tell those people apart just based on convolution. 

And so now what we'll take a look at is yet another type of neural network that can be quite popular for certain types of tasks. But to do so, we'll try to generalize and think about our neural network a little bit more abstractly, that here we have a sample deep neural network, where we have this input layer, a whole bunch of different hidden layers that are performing certain types of calculations, and then an output layer here that just generates some sort of output that we care about calculating. 

But we could imagine representing this a little more simply, like this. Here is just a more abstract representation of our neural network. We have some input. That might be like a vector of a whole bunch of different values as our input. That gets passed into a network to perform some sort of calculation or computation, and that network produces some sort of output. That output might be a single value. It might be a whole bunch of different values. But this is the general structure of the neural network that we've seen. There is some sort of input that gets fed into the network, and using that input, the network calculates what the output should be. 

And this sort of model for an all network is what we might call a feed-forward neural network. Feed-forward neural networks have connections only in one direction; they move from one layer to the next layer to the layer after that, such that the inputs pass through various different hidden layers and then ultimately produce some sort of output. 

So feed-forward neural networks are very helpful for solving these types of classification problems that we saw before. We have a whole bunch of input. We want to learn what setting of weights will allow us to calculate the output effectively. But there are some limitations on feed-forward neural networks that we'll see in a moment. In particular, the input needs to be of a fixed shape, like a fixed number of neurons are in the input layer, and there's a fixed shape for the output, like a fixed number of neurons in the output layer, and that has some limitations of its own. 

And a possible solution to this-- and we'll see examples of the types of problems we can solve for this in just the second-- is instead of just a feed-forward neural network where there are only connections in one direction, from left to right effectively, across the network, we can also imagine a recurrent neural network, where a recurrent neural network generates output that gets fed back into itself as input for future runs of that network. 

So whereas in a traditional neural network, we have inputs that get fed into the network that get fed into the output, and the only thing that determines the output is based on the original input and based on the calculation we do inside of the network itself, this goes in contrast with a recurrent neural network, where in a recurrent neural network, you can imagine output from the network feeding back to itself into the network again as input for the next time that you do the calculations inside of the network. 

What this allows is it allows the network to maintain some sort of state, to store some sort of information that can be used on future runs of the network. Previously, the network just defined some weights, and we passed inputs through the network, and it generated outputs, but the network wasn't saving any information based on those inputs to be able to remember for future iterations or for future runs. What a recurrent neural network will let us do is let the network store information that gets passed back in as input to the network again the next time we try and perform some sort of action. And this is particularly helpful when dealing with sequences of data. 

So we'll see a real-world example of this right now actually. Microsoft has developed an AI known as the CaptionBot, and what the CaptionBot does is it says, I can understand the content of any photograph, and I'll try to describe it as well as any human. I'll analyze your photo, but I won't store it or share it. And so what Microsoft CaptionBot seems to be claiming to do is it can take an image and figure out what's in the image and just give us a caption to describe it. 

So let's try it out. Here, for example, is an image of Harvard Square and some people walking in front of one of the buildings at Harvard Square. I'll go ahead and take the URL for that image, and I'll paste it into CaptionBot, then just press Go. So CaptionBot is analyzing the image, and then it says, I think it's a group of people walking in front of a building, which seems amazing. The eye is able to look at this image and figure out what's in the image. 

And the important thing to recognize here is that this is no longer just a classification task. We saw being able to classify images with a convolutional neural network, where the job was to take the images and then figure out, is it a 0, or a 1, or a 2; or is that this person's face or that person's face? What seems to be happening here is the input is an image, and we know how to get networks to take input of images, but the output is text. It's a sentence. It's a phrase, like "a group of people walking in front of a building." 

And this would seem to pose a challenge for our more traditional feed-forward neural networks, for the reason being that in traditional neural networks, we just have a fixed-size input and a fixed-size output. There are a certain number of neurons in the input to our neural network and a certain number of outputs for our neural network, and then some calculation that goes on in between. But the size of the inputs-- the number of values in the input and the number of values in the output-- those are always going to be fixed based on the structure of the neural network, and that makes it difficult to imagine how a neural network can take an image like this and say, you know, it's a group of people walking in front of the building, because the output is text. It's a sequence of words. 

Now it might be possible for a neural network to output one word. One word, you could represent us like a vector of values, and you can imagine ways of doing that. And next time, we'll talk a little bit more about AI as it relates to language and language processing. But a sequence of words is much more challenging, because depending on the image, you might imagine the output is a different number of words. We could have sequences of different lengths, and somehow we still want to be able to generate the appropriate output. 

And so the strategy here is to use a recurrent neural network, a neural network that can feed its own output back into itself as input for the next time. And this allows us to do what we call a one-to-many relationship for inputs to outputs, that in vanilla, more traditional neural networks-- these are what we consider to be one-to-one neural networks-- you pass in one set of values as input, you get one vector of values as the output-- but in this case, we want to pass in one value as input-- the image-- and we want to get a sequence-- many values-- as output, where each value is like one of these words that gets produced by this particular algorithm. 

And so the way we might do this is we might imagine starting by providing input the image into our neural network, and the neural network is going to generate output, but the output is not going to be the whole sequence of words, because we can't represent the whole sequence of words. I'm using just a fixed set of neurons. Instead, the output is just going to be the first word. We're going to train the network to output what the first word of the caption should be. And you could imagine that Microsoft has trained to this by running a whole bunch of training samples through the AI, giving it a whole bunch of pictures and what the appropriate caption was, and having the AI begin to learn from that. 

But now, because the network generates output that can be fed back into itself, you can imagine the output of the network being fed back into the same network-- this here looks like a separate network, but it's really the same network that's just getting different input-- that this network's output gets fed back into itself, but it's going to generate another output, and that other output is going to be like the second word in the caption. And this recurrent neural network then, this network is going to generate other output that can be fed back into itself to generate yet another word, fed back into itself to generate another word. 

And so recurrent neural networks allow us to represent this sort of one-to-many structure. You provide one image as input, and the neural network can pass data into the next run of the network, and then again and again, such that you could run the network multiple times, each time generating a different output, still based on that original input. And this is where recurrent neural networks become particularly useful when dealing with sequences of inputs or outputs. My output is a sequence of words, and since I can't very easily represent outputting an entire sequence of words, I'll instead output that sequence one word at a time, by allowing my network to pass information about what still needs to be said about the photo into the next stage of running the networks. So you could run the network multiple times-- the same network with the same weights-- just getting different input each time, first getting input from the image, and then getting input from the network itself, as additional information about what additionally needs to be given in a particular caption, for example. 

So this then is a one-to-many many relationship inside of a recurrent neural network. But it turns out there are other models that we can use-- other ways we can try and use recurrent neural networks-- to be able to represent data that might be stored in other forms as well. We saw how we could use neural networks in order to analyze images, in the context of convolutional neural networks that take an image, figure out various different properties of the image, and are able to draw some sort of conclusion based on that. 

But you might imagine that something like YouTube, they need to be able to do a lot of learning based on video. They need to look through videos to detect if there are copyright violations, or they need to be able to look through videos to maybe identify what particular items are inside of the video, for example. And video, you might imagine, is much more difficult to put it as input to a neural network, because whereas an image you can just treat each pixel is a different value, videos are sequences. They're sequences of images, and each sequence might be a different length, and so it might be challenging to represent that entire video as a single vector of values that you could pass in to a neural network. 

And so here too, recurrent neural networks can be a valuable solution for trying to solve this type of problem. Then instead of just passing in a single input into our neural network, we could pass in the input one frame at a time, you might imagine, first taking the first frame of the video, passing it into the network, and then maybe not having the network output anything at all yet. Let it take in another input, and this time, pass it into the network, but the network gets information from the last time we provided an input into the network. Then we pass in a third input and then a fourth input, where each time, with the network gets it gets the most recent input, like each frame of the video, but it also gets information the network processed from all of the previous iterations. 

So on frame number four, you end up getting the input for frame number four, plus information the network is calculated from the first three frames. And using all of that data combined, this recurrent neural network can begin to learn how to extract patterns from a sequence of data as well. 

And so you might imagine if you want to classify a video into a number of different genres, like an educational video, or a music video, or different types of videos. That's a classification task, where you want to take input each of the frames of the video, and you want to output something like what it is and what category that it happens to belong to. And you can imagine doing this sort of thing-- this sort of many-to-one learning-- anytime your input is a sequence. 

And so input is a sequence in the context of a video. It could be in the context of like, if someone has typed a message, and you want to be able to categorize that message, like if you're trying to take a movie review and trying to classify it as is it a positive review or a negative review. That input is a sequence of words, and the output is a classification-- positive or negative. There too, a recurrent neural network might be helpful for analyzing sequences of words, and they're quite popular when it comes to dealing with language. 

It could even be used for spoken language as well, that spoken language is an audio waveform that can be segmented into distinct chunks, and each of those can be passed in as an input into a recurrent neural network to be able to classify someone's voice, for instance, if you want to do voice recognition, to say is this one person or is this another? Here are also cases where you might want this many-to-one architecture for a recurrent neural network. 

And then as one final problem, just to take a look at in terms of what we can do, with these sorts of networks, imagine what Google Translate is doing. So what Google Translate is doing is it's taking some text written in one language and converting it into text written in some other language, for example, where now this input is a sequence of data-- it's a sequence of words-- and the output is a sequence of words as well. It's also a sequence. 

So here, we want effectively like a many-to-many relationship. Our input is a sequence, and our output is a sequence as well. And it's not quite going to work to just say, take each word in the input and translate it into a word in the output, because ultimately, different languages put their words in different orders, and maybe one language uses two words for something, whereas another language only uses one. So we really want some way to take this information-- that's input-- encode it somehow, and use that encoding to generate what the output ultimately should be. And this has been one of the big advancements in automated translation technology is the ability to use own networks to do this, instead of older, more traditional methods, and this has improved accuracy dramatically. 

And the way you might imagine doing this is, again, using a recurrent neural network with multiple inputs and multiple outputs. We start by passing in all the input. Input goes into the network. Another input, like another word, goes into network, and we do this multiple times, like once for each word in the input that I'm trying to translate. And only after all of that is done, does the network now start to generate output, like the first word of the translated sentence, and the next word of the translated sentence, so on and so forth, where each time the network passes information to itself by allowing for this model of giving some sort of state from one run in the network to the next run, assembling information about all the inputs, and then passing in information about which part of the output in order to generate next. 

And there are a number of different types of these sorts of recurrent neural networks. One of the most popular is known as the long short-term memory neural network, otherwise known as LSTM. But in general, these types of networks can be very, very powerful whenever we're dealing with sequences, whether those are sequences of images or especially sequences of words when it comes towards dealing with natural language. 

So that then were just some of the different types of neural networks that can be used to do all sorts of different computations, and these are incredibly versatile tools that can be applied to a number of different domains. We only looked at a couple of the most popular types of neural networks-- the more traditional feed-forward neural networks, convolutional neural networks, and recurrent neural networks. 

But there are other types as well. There are adversarial networks, where networks compete with each other to try and be able to generate new types of data, as well as other networks that can solve other tasks based on what they happen to be structured and adapted for. And these are very powerful tools in machine learning, from being able to very easily learn based on some set of input data and to be able to therefore figure out how to calculate some function, from inputs to outputs. Whether it's input to some sort of classification, like analyzing an image and getting a digit, or machine translation where the input is in one language and the output is in another, these tools have a lot of applications for machine learning more generally. 

Next time, we'll look at machine learning and AI in particular in the context of natural language. We talked a little bit about this today, but looking at how it is that our AI can begin to understand natural language and can begin to be able to analyze and do useful tasks with regards to human language, which turns out to be a challenging and interesting task. So we'll see you next time.