[MUSIC PLAYING] BRIAN YU: All right, welcome back, everyone, to an introduction to Artificial Intelligence with Python. Now so far in this class, we've used AI to solve a number of different problems-- giving the AI instructions for how to search for a solution or how to satisfy certain constraints in order to find its way from some input point to some output point in order to solve some sort of problem. Today, we're going to turn to the world of learning in particular the idea of machine learning which generally refers to the idea where we are not going to give the computer explicit instructions for how to perform a task, but rather, we are going to give the computer access to information in the form of data or patterns that it can learn from, and let the computer try and figure out what those patterns are-- try and understand that data to be able to perform a task on its own. 

Now machine learning comes in a number of different forms and it's a very wide field. So today, we'll explore some of the foundational algorithms and ideas that are behind a lot of the different areas within machine learning. And one of the most popular is the idea of supervised machine learning or just supervised learning. And supervised learning is a particular type of task. It refers to the task where we give the computer access to a data set, where that data set consists of input-output pairs. And what we would like the computer to do is we would like our AI to be able to figure out some function that maps inputs to outputs. 

So we have a whole bunch of data that generally consists of some kind of input-- some evidence, some information that the computer will have access to. And we would like the computer, based on that input information, to predict what some output is going to be. And we'll give it some data so that the computer can train its model on to begin to understand how it is that this information works, and how it is that the inputs and outputs relate to each other. But ultimately, we hope that our computer will be able to figure out some function that given those inputs, is able to get those outputs. 

There are a couple of different tasks within supervised learning, the one we'll focus on and start with is known as classification. And classification is the problem where if I give you a whole bunch of inputs, you need to figure out some way to map those inputs into discrete categories, where you can decide what those categories are. And it's the job of the computer to predict what those categories are going to be. So that might be, for example, I give you information about a banknote like a US dollar and I'm asking you to predict for me doesn't belong to the category of authentic bank notes or does it belong to the category of counterfeit banknotes. You need to categorize the input. And we want to train the computer to figure out some function to be able to do that calculation. 

Another example might be the case of weather, something we've talked about a little bit so far in this class, where we would like to predict on a given day is it going to rain on that day, is it going to be cloudy on that day. And before, we've seen how we could do this, if we really give the computer all the exact probabilities for, you know, if these are the conditions, what's the probability of rain, oftentimes, we don't have access to that information, though. But what we do have access to is a whole bunch of data. So if we wanted to be able to predict something like is it going to rain or is it not going to rain, we would give the computer historical information about days when it was raining and days when it was not raining, and ask the computer to look for patterns in that data. 

So what might that data look like? Well, we could structure that data in a table like this. This might be what our table looks like, where are for any particular day going back, we have information about like that day's humidity, that day's air pressure. And then importantly, we have a label-- something where the human has said that on this particular day, it was raining or it was not raining. So you could fill in this table with a whole bunch of data. And what makes this what we would call a supervised learning exercise is that a human has gone in and labeled each of these data points. Said that on this day, when these were the values for the humidity and pressure, that day was a rainy day and this day was a not rainy day. 

And what we would like the computer to be able to do then is to be able to figure out, given these inputs, given the humidity and the pressure, can the computer predict what label should be associated with that day. Does that day look more like it's going to be a day that rains or does it look more like a day when it's not going to rain. Put a little bit more mathematically, you can think of this as a function that takes two inputs-- the inputs being the data points that our computer will have access to-- things like humidity and pressure. So we could write a function, f, that takes as input both humidity and pressure. And then the output is going to be what category we would ascribe to these particular input points-- what label we would associate with that input. 

So we've seen a couple of example data points here, where given this value for humidity and this value for pressure, we predict is it going to rain or is it not going to rain. And that's information that we just gathered from the world. We measured on various different days what the humidity and pressure were. We observed whether or not we saw rain or no rain on that particular day. And this function, f, is what we would like to approximate. 

Now the computer and we humans don't really know exactly how this function f works-- it's probably quite a complex function. So what we're going to do instead is attempt to estimate it. We would like to come up with a hypothesis function, h, which is going to try to approximate what f does. We want to come up with some function h that will also take the same inputs and we'll also produce an output, rain or no rain. And ideally, we'd like these two functions to agree on as much as possible. 

So the goal then of these supervised learning classification tasks is going to be to figure out what does that function h look like. How can we begin to estimate, given all of this information, all of this data, what category or what label should be assigned to a particular data point. So where can you begin doing this? Well, a reasonable thing to do, especially in this situation-- I have two numerical values-- is I could try to plot this on a graph that has two axes-- an x-axis and the y-axis. And in this case, we're just going to be using two numerical values as input, but these same types of ideas at scale as you add more and more inputs as well. We'll be plotting things in two dimensions, but as we'll soon see, you could add more inputs and just imagine things in multiple dimensions. 

And while we humans have trouble conceptualizing anything really beyond three dimensions, at least visually, a computer has no problem with trying to imagine things and many, many more dimensions. That for a computer, each dimension is just some separate number that are just keeping track. So it wouldn't be unreasonable for a computer to think in 10 dimensions or 100 dimensions to be able to try to solve a problem. But for now, we've got two inputs, so we'll graph things along two axes-- an x-axis, which will here represent humidity, and a y-axis, which here represents pressure. And what we might do is say, let's take all of the days that were raining, and just try to plop them on this graph, and see where they fall on this graph. 

And here might be all of the rainy days, where each rainy day is one of these blue dots here that corresponds to a particular value for humidity and a particular value for pressure. And then I might do the same thing with the days that were not raining. So I take all the not rainy days, figure out what their values were for each of these two inputs, and go ahead and plot them on this graph as well. And I've here plotted them in red. So blue here stands for a rainy day, red here stands for a not rainy day. And this then is the input-- that my computer has access to all of this input. 

And what I would like the computer to be able to do is to train a model such that if I'm ever presented with a new input that doesn't have a label associated with it, something like this white dot here, I would like to predict given those values for each of the two inputs, should we classify it as a blue dot, a rainy day, or should we classify it as a red dot, a not rainy day. And if you're just looking at this picture graphically trying to say, all right, this white dot, does it look like it belongs to the blue category or does it look like it belongs to the red category, I think most people would agree that it probably belongs to the blue category. And why is that? Well, it looks like it's close to other blue dots. 

And that's not a very formal notion, but it's the notion that we'll formalize it in just a moment-- that because it seems to be close to, like, this blue dot here, like, nothing else it's closer to it, than we might say that it should be categorized as blue. It should fall into that category of, I think that day is going to be a rainy day based on that input. It might not be totally accurate, but it's a pretty good guess. 

And this type of algorithm is actually a very popular and common machine learning algorithm known as nearest neighbor classification. It's an algorithm for solving these classification type problems. And in nearest neighbor classification, it's going to perform this algorithm. What it will do is, given an input, it will choose the class of the nearest data point to that input. By class, we just here mean category, like rain or no rain, counterfeit or not counterfeit. And we choose the category or the class based on the nearest data point. So given all that data we just looked at, is the nearest data point a blue point or is that a red point. And depending on the answer to that question, we were able to make some sort of judgment. We were able to say something like, we think it's going to be blue or we think it's going to be red. 

So likewise, we could apply this to other data points that we encounter as well. If suddenly, this data point comes about, well, it's nearest data is red, so we would go ahead and classify this as a red point, not raining. Things get a little bit trickier, though, when you look at a point like this white point over here, and you ask the same sort of question-- should it belong to the category of blue points, the rainy days? Or should it belong to the category of red points, the not rainy days? 

Now nearest neighbor classification would say the way you solve this problem is look at which point it is nearest to that point. You look at this nearest point and say it's red-- it's a not rainy day. And therefore, according to nearest neighbor classification, I would say that this unlabeled point, that should also be red. It should also be classified as a not rainy day. But your intuition might think that that's a reasonable judgment to make-- that the closest thing is a not rainy day, so may as well guess that it's not rainy day. But it's probably also reasonable to look at the bigger picture of things and to say, yes, it is true, that the nearest point to it was a red point, but it's surrounded by a whole bunch of other blue points. So looking at the bigger picture, there is potentially an argument to be made that this point should actually be blue. 

And with only this data, we actually don't know for sure. We are given some inputs, something we're trying to predict, and we don't necessarily know what the output is going to be. So in this case, which one is correct is difficult to say. But oftentimes, considering more than just a single neighbor, considering multiple neighbors, can sometimes give us a better result. And so there's a variant on the nearest neighbor a classification algorithm that is known as the k-nearest-neighbor classification algorithm, where k is some parameter, some number that we choose for how many neighbors are we going to look at. 

So one nearest neighbor classification is what we saw before. Just pick the one nearest neighbor and use that category. But with k-nearest-neighbor classification, where k might be three or five or seven-- to say look at the three or five or seven closest neighbors, closest data points to that point, works a little bit differently. This algorithm, we're given an input. Choose the most common class out of the k nearest data points to that input. So if we look at the five nearest points, and three of them say it's raining and two of them say it's not raining, we'll go with the three instead of the two, because each one effectively gets one vote towards what they believe the category ought to be. And ultimately, you choose the category that has the most votes as a consequence of that. 

So k-nearest-neighbor classification-- fairly straightforward one to understand intuitively. You just look at the neighbors and figure out what the answer might be. And it turns out this can work very, very well for solving a whole variety of different types of classification problems. But not every model is going to work under every situation. And so one of the things we'll take a look at today, especially in the context of supervised machine learning is that there are a number of different approaches to machine learning-- a number of different algorithms that we can apply all solving the same type of problem. All solving some kind of classification problem, where we want to take inputs and organize it into different categories 

And no one algorithm isn't necessarily always going to be better than some other algorithm. They each have their trade-offs. And maybe depending on the data, one type of algorithm is going to be better-suited to trying to model that information than some other algorithm. And so this is what a lot of machine learning research ends up being about-- that when you're trying to apply machine learning techniques, you're often looking not just at one particular algorithm, but trying multiple different algorithms, trying to see what is going to give you the best results for trying to predict some function that maps inputs to outputs. 

So what then are the drawbacks of k-nearest-neighbor classification? Well, there are a couple. One might be that in a naive approach at least, it could be fairly slow to have to go through and measure the distance between a point and every single one of these points that exist here. Now there are ways of trying to get around that. There are data structures that can help to make it more quickly to be able to find these neighbors. There are also techniques you can use to try and prune some of this data, remove some of the data points so that you're only left with the relevant data points just to make it a little bit easier. 

But ultimately, what we might like to do is come up with another way of trying to do this classification. And one way of trying to do the classification was looking at what are the neighboring points. But another way might be to try to look at all of the data and see if we can come up with some decision boundary-- some boundary that will separate the rainy days from the not rainy days. In the case of two dimensions, we can do that by drawing a line, for example. 

So what we might want to try to do is just find some line, find some separator that divides the rainy days, the blue points over here, from the not rainy days, the red points over there. We're now trying a different approach in contrast with the nearest neighbor approach, which just looked at local data around the input data point that we cared about. Now what we're doing is trying to use a technique known as linear regression to find some sort of line that will separate the two halves from each other. 

Now, sometimes, it will actually be possible to come up with some line that perfectly separates all the rainy days from the not rainy days. Realistically, though, this is probably cleaner than many data sets will actually be. Oftentimes, data is messy. There are outliers. There's random noise that happens inside of a particular system. And what we'd like to do is still be able to figure out what a line might look like. So in practice, the data will not always be linearly separable, where linearly separable refers to some data set where I can draw a line just to separate the two halves of it perfectly. Instead, you might have a situation like this, where there are some rainy points that are on this side of the line and some not raining points that are on that side of the line. And there may not be a line that perfectly separates what path of the inputs from the other half-- that perfectly separates all the rainy days from the not rainy days. 

But we can still say that this line does a pretty good job. And we'll try to formalize a little bit later. What we mean when we say something like this line does a pretty good job of trying to make that prediction. But for now, let's just say we're looking for a line that does as good of a job as we can at trying to separate one category of things from another category of things. 

So let's now try to formalize this a little bit more mathematically. We want to come up with some sort of function, some way we can define this line. And our inputs are things like humidity and pressure in this case. So our inputs we might call x1 is going to be our represent humidity and x2 is going to represent pressure. These are inputs that we are going to provide to our machine learning algorithm. And given those inputs, we would like for our model to be able to predict some sort of output. And we're going to predict that using our hypothesis function, which we called h. Our hypothesis function is going to take as input, x1 and x2, humidity and pressure in this case. And you can imagine if we didn't just have two inputs-- we had three or four or five inputs or more-- we could have this hypothesis function take all of those as input. And we'll see examples of that a little bit later as well. 

And now the question is, what does this hypothesis function do? Well, it really just needs to measure is this data point on one side of the boundary or is it on the other side of the boundary? And how do we formalize that boundary? Well, the boundary is generally going to be a linear combination of these input variables, at least in this particular case. So what we're trying to do when we say linear combination is take each of these inputs and multiply them by some number that we're going to have to figure out. We'll generally call that number a weight for how important should these variables be in trying to determine the answer. So weight each of these variables with some weight. And we might add like a constant to it just to try and make the function a little bit different. And the result we just need to compare-- is it greater than 0 or is it less than 0 to say doesn't belong on one side of the line or the other side of the line. 

And so what that mathematical expression might look like is this. We would take each of my variables, x1 and x2, multiply them by some weight. I don't yet know what that weight is, but it's going to be some number, weight 1 and weight 2. And maybe we just want to add some other weight 0 to it because the function might require us to shift the entire value up or down by a certain amount. And then we just compare. If we do all this math, is a greater than or equal to 0. If so, we might categorize that data point as a rainy day. And otherwise, we might say no rain. 

So the key here then is that this expression is how we are going to calculate whether it's a rainy day or not. We're going to do a bunch of math where we take each of the variables, multiply them by a weight, maybe add an extra weight to it, see if the result is greater than or equal to 0. And using that result of that expression, we're able to determine whether it's raining or not raining. This expression here is in this case going to refer to just some line. If you were to plot that graphically, it would just be some line. And what the line actually looks like depends upon these weights. x1 and x2 are the inputs, but these weights are really what determine the shape of that line, the slope of that line, and what that line actually looks like. 

So we then would like to figure out what these weights should be. We can choose whatever weights we want, but we want to choose weights in such a way that if you pass in a rainy day's humidity and pressure, then you end up with a result that is greater than or equal to 0. And we would like it such that if we passed into our hypothesis function, a not rainy day's inputs, then the output that we get should be not raining. 

So before we get there, let's try and formalize this a little bit more mathematically just to get a sense for how it is that you'll often see this if you ever go further into a supervised machine learning and explore this idea. One thing is that generally for these categories, we'll sometimes just use the names of the categories like rain and not rain. Often, mathematically, if we're trying to do comparisons between these things, it's easier just to deal in the world of numbers. So we could just say 1 and 0-- 1 for raining, 0 for not raining. 

So we do all this math. And if the result is greater than or equal to 0, we'll go ahead and say our hypothesis function outputs 1, meaning raining. And otherwise, it outputs 0, meaning not raining. And oftentimes, this type of expression will instead express using vector mathematics. And all the vector is, if you're not familiar with the term, is it refers to a sequence of numerical values. You could represent that in Python using, like, a list of numerical values or a couple with numerical values. 

And here, we have a couple of sequences of numerical values. One of our vectors, one of our sequences of numerical values, are all of these individual weights-- w0, w1 and w2. So we could construct what we'll call a weight vector and we'll see why this is useful in a moment called w, generally represented using a boldface w, that is just a sequence of these three weights-- weight 0, weight 1, and weight 2. 

And to be able to calculate based on those weights whether we think a day is raining or not raining, we're going to multiply each of those weights by one of our input variables. That w2, this weight, is going to be multiplied by input variable x2. w1 is going to be multiplied by input variable x1. And w0-- well, it's not being multiplied by anything, but to make sure the vectors are the same length-- and we'll see why that's useful in just a second-- we'll just go ahead and say w0 is being multiplied by 1. Because you can multiply by something by 1 and you end up getting the exact same number. 

So in addition to the weight vector, w, we'll also have an input vector that we'll call x that has three values-- 1, again, because we're just multiplying w0 by 1 eventually, and then x1 and x2. So here then, we've represented two distinct vectors-- a vector of weights that we need to somehow learn. The goal of our machine learning algorithm is to learn what this weight vector is supposed to be. We could choose any arbitrary set of numbers and it would produce a function that tries to predict rain or not rain, but it probably wouldn't be very good. What we want to do is come up with a good choice of these weights so that we're able to do the accurate predictions. 

And then this input vector represents a particular input to the function, a data point for which we would like to estimate, is that day a rainy day or is that day not rainy day. And that's going to vary just depending on what input is provided to our function, what it is that we are trying to estimate. And then to do the calculation, we want to calculate this expression here. And it turns out that expression is what we would call the dot product of these two vectors. The dot product of two vectors just means taking each of the terms and the vectors and multiplying them together, w0 multiplied by 1, w1 multiplied it by x1, w2 multiply it by x2. And that's why these vectors need to be the same length. 

And then we just add all of the results together. So the dot product of w and x, our weight vector and our input vector, that's just going to be w0 times 1, or just w0 plus w1 times x1, multiplying these two terms together, plus w2 times x2, multiplying those statements together. So we have our weight vector, which we need to figure out. We need our machine learning algorithm to figure out what the weights should be. We have the input vector representing the data point that we're trying to predict a category for, predict a label for. 

And we're able to do that calculation by taking this dot product, which you'll often see represented in vector form-- but if you haven't seen vectors before, you can think of it as identical to just this mathematical expression. Just doing the multiplication, adding the results together. And then seeing whether the result is greater than or equal to 0 or not. This expression here is identical to the expression that we're calculating to see whether or not that answer is greater than or equal to 0 in this case. 

And so for that reason, you'll often see the hypothesis function written as something like this-- a simpler representation where the hypothesis takes as input some input vector x, some humidity and pressure for some day. And we want to predict an output like rain or no rain or 1 or 0 if we choose to represent things numerically. And the way we do that is by taking the dot product of the weights and our input. If it's greater than or equal to 0, we'll go ahead and save the output is 1. Otherwise, the output is going to be 0. 

And this hypothesis we say is parameterized by the weights. Depending on what weights we choose, we'll end up getting a different hypothesis. If we choose the weights randomly, we're probably not going to get a very good hypothesis function. We'll get a 1 or a 0, but it's probably not accurately going to reflect whether we think a day is going to be rainy or not rainy. But if we choose the weights right, we can often do a pretty good job of trying to estimate whether we think the output of the function should be a 1 or a 0. 

And so the question then, is how to figure out what these weights should be-- how to be able to tune those parameters. And there are a number of ways you can do that. One of the most common is known as the perceptron learning rule. And we'll see more of this later. But the idea of the perceptron learning rule-- and we're not going to get too deep into the mathematics, we'll mostly just introduce it more conceptually-- is to say that given some data point that we would like to learn from, some data point that has an input x and an output y, where y is like 1 for rain or 0 for not rain, then we're going to update the weights. And we'll look at the formula in just a moment. 

But the big picture idea is that we can start with random weights but then learn from the data. Like, take the data points one at a time. And for each one of the data points figure out, all right, what parameters do we need to change inside of the weights in order to better match that input point. And so that is the value of having access to a lot of data in the supervised machine learning algorithm-- is that you take each of the data points and maybe look at the multiple times and constantly try and figure out what you whether you need to shift your weight in order to better create some weight vector that is able to correctly or more accurately try to estimate what the output should be. Whether we think it's going to be raining or whether we think it's not going to be raining. 

So what does that weight update look like? Without going into too much of the mathematics, we're going to update each of the weights to be the result of the original weight plus some additional expression. And to understand this expression y-- well, y is what the actual output is. And hypothesis of x, the input, that's going to be what we thought the input was. And so I can replace this by saying what the actual value was minus what our estimate was. And based on the difference between the actual value and what our estimate was, we might want to change our hypothesis, change the way that we do that estimation. 

If the actual value and the estimate were the same thing, meaning we were correctly able to predict what category this data point belonged to, well, then actual value minus estimate, that's just going to be 0, which means this whole term on the right hand side goes to be 0. And the weight doesn't change. Weight i, where i is weight 1 or weight 2 or weight 0, weight i just stays at weight i. And none of the weights change if we were able to correctly predict what category the input belonged to. 

But if our hypothesis didn't correctly predict what category the input belonged to, then maybe then we need to make some changes-- adjust the weights so that we're better able to predict this kind of data point in the future. And what is the way we might do that? Well, if the actual value was bigger than the estimate, then-- and for now, we'll go ahead and assume that these is are positive values-- if the actual value is bigger than the estimate, that means we need to increase the weight in order to make it such that the output is bigger and therefore, we're more likely to get to the right actual value. 

And so if the actual value is bigger than the estimate, then actual value minus estimate, that'll be a positive number. And so you imagine we're just adding some positive number to the weight just to increase it ever so slightly. And likewise, the inverse case is true-- that if the actual value was less than the estimate, the actual value was 0, but we estimated 1, meaning it actually was not raining, but we predicted it was going to be raining, then we want to decrease the value of the weight, because then in that case, we want to try and lower the total value of computing that dot product in order to make it less likely that we would predict that it would actually be raining. 

So no need to get too deep into the mathematics of that. But the general idea is that every time we encounter some data point, we can adjust these weights accordingly to try and make the weights better line up with the actual data that we have access to. And you can repeat this process with data point after data point until eventually, hopefully, your algorithm converges to some set of weights that do a pretty good job of trying to figure out whether a day is going to be rainy or not rainy. 

And just as a final point about this particular equation, this value alpha here is generally what we'll call the learning rate. It's just some parameter, some number we choose it for how quickly we're actually going to be updating these weight values. That if alpha is bigger, than we're going to update these weight values by a lot. And if alpha is smaller, then we'll update the weight values by less. And you can choose the value of alpha depending on the problem, different values might suit the situation better or worse than others. 

So after all of that, after we've done this training process, take all this data, and using this learning rule, look at all the pieces of data, and use each piece of data as an indication to us of do the weights stay the same, do we increase the weights, do we decrease the weights, and if so, by how much, what you end up with is effectively a threshold function. And we can look at what the threshold function looks like like this. On the x-axis here, we have the output of that function. Taking the weights taking, the dot product of it with the input. And on the y-axis, we have what the output is going to be. 0, which in this case represented like not raining, and 1, which in this case, represented raining. 

And the way that our hypothesis function works is it calculates this value. And if it's greater than 0 or greater than some threshold value, then we declare that it's a rainy day. And otherwise, we declare that it's not rainy day. And this then graphically is what that function looks like. That Initially, when the value of this dot product is small-- it's not raining, it's not raining, it's not raining-- but as soon as it crosses that threshold, we suddenly say, OK, now it's raining, now it's raining, now it's raining. And the way to interpret this kind of representation is that anything on this side of the line, that would be the category of data points where we say yes, it's raining. Anything that falls on this side of the line are the data points where we would say it's not raining. And again, we want to choose some value for the weights that results in a function that does a pretty good job of trying to do this estimation. 

But one tricky thing with this type of hard threshold is that it only leaves two possible outcomes. We plug-in some data as input. And the output we get is raining or not raining. And there is no room for it anywhere in between. And maybe that's what you want. Maybe all you want is given some data point, you would like to be able to classify it into one or two or more of these various different categories. But it might also be the case that you care about knowing how strong that prediction is, for example. 

So if we go back to this instance here, where we have rainy days on this side of the line, not rainy days on that side of the line, you might imagine that let's look now at these two white data points. This data point here that we would like to predict a label or a category for. And this data point over here that we would also like to predict a label or a category for. It seems likely that you could pretty confidently say that this data point, that should be a rainy day. It seems close to the other rainy days if we're going by the nearest neighbor strategy. It's on this side of the line if we're going by the strategy of just saying which side of the line does it fall on by figuring out what those weights should be. 

And if we're using the line strategy of just which side of the line does it fall on, which side of this decision boundary, we'd also say that this point here is also a rainy day, because it falls on the side of the line that corresponds to rainy days. But it's likely that even in this case, we would know that we don't feel nearly as confident about this data point on the left as compared to this data point on the right. For this one on the right, we can feel very confident that, yes, it's a rainy day. This one, it's pretty close to the line if we're judging just by distance. And so you might be less sure. 

But our threshold function doesn't allow for a notion of less sure or more sure about something. It's what we would call a hard threshold. It's once you've crossed this line, then immediately, we say, yes, this is going to be a rainy day. Anywhere before it, we're going to say it's not a rainy day. And that may not be helpful in a number of cases. One, this is not a particularly easy function to deal with. If you get you get deeper into the world of machine learning and are trying to do things like taking derivatives of these curves, this type of function makes things challenging. But the other challenge is that we don't really have any notion of gradation between things. We don't have a notion of, yes, this is a very strong belief that it's going to be raining as opposed to it's probably more likely than not that it's going to be raining, but maybe not totally sure about that, either. 

So what we can do by taking advantage of a technique known as logistic regression is instead of using this hard threshold type of function, we can use instead a logistic function, something we might call a soft threshold. And that's going to transform this into looking something a little more like this-- something that more nicely curves. And as a result, the possible output values are no longer just 0 and 1, 0 for not raining, 1 for raining. But you can actually get any real number of value between 0 and 1. That if you're way over on this side, then you get a value of 0-- it's not going to be raining, we're pretty sure about that. And if you're over on this side, you get a value of 1-- yes, we're very sure that it's going to be raining. 

But in between, you could get some real numbered value where a value like 0.7 might mean we think it's going to rain. It's more probable that it's going to rain than not based on the data, but we're not as confident as some of the other data points might be. So one of the advantages of the soft threshold is that it allows us to have an output that could be some real number that potentially reflects some sort of probability, the likelihood that we think that this particular data point belongs to that particular category. And there are some other nice mathematical properties of that as well. 

So that then is two different approaches to trying to solve this type of classification problem. One is this nearest neighbor type of approach, where you just take a data point and look at the data points that are nearby to try and estimate what category we think it belongs to. And the other approach is the approach of saying, all right, let's just try and use linear regression, figure out what these weights should be, adjust the weights in order to figure out what line or what decision boundary is going to best separate these two categories. 

It turns out that another popular approach, a very popular approach if you just have a data set and you want to start trying to do some learning on it, is what we call the support vector machine. We're not going to go too much into the mathematics of the support vector machine, but we'll at least explore it graphically to see what it is that it looks like. And the idea or the motivation behind the support vector machine is the idea that there are actually a lot of different lines that we could draw, a lot of different decision boundaries that we could draw to separate two groups. So for example, I had the red data points over here and the blue data points over here. 

One possible line I could draw is a line like this, that this line here would separate the red points from the blue points. And it does so perfectly. All the red points are on one side of the line. All the blue points around the other side of the line. But this should probably make you a little bit nervous. If you come up with a model and the model comes up with a line that looks like this. And the reason why is that you worry about how well it's going to generalize to other data points that are not necessarily in the data set that we have access to. 

For example, if there was a point that fell right here for example, on the right side of the line, then based on that, we might want to guess that it is in fact, a red point, but it falls on the side of the line where instead, we would estimate that it's a blue point instead. And so based on that, this line is probably not a great choice just because it is so close to these various data points. We might instead prefer a diagonal line that just goes diagonally through the data set like we've seen before. But there too, there's a lot of diagonal lines that we could draw as well. 

For example, I could draw this diagonal line here, which also successfully separates all the red points from all of the blue points. From the perspective of something like a just trying to figure out some setting of weights that allows us to predict the correct output, this line will predict the correct output for this particular set of data every single time, because the red points are on one side and the blue points are on the other. But yet again, you should probably be a little nervous. Because this line is so close to these red points, even though we're able to correctly predict on the input data if there was a point that fell somewhere in this general area, our algorithm, this model, would say that yeah, we think it's a blue point, when in actuality, it might belong to the red category instead, just because it looks like it's close to the other red points. 

What we really want to be able to say, given this data, how can you generalize this out as best as possible, is to come up with a line like this that seems like the intuitive line to draw. And the reason why it's intuitive is because it seems to be as far apart as possible from the red data and the blue data so that if we generalize a little bit and assume that maybe we have some points that are different from the input but still slightly further away, we can still say that something on this side, probably red, something on that side, probably blue. And we can make those judgments that way. 

And that is what support vector machines are designed to do-- they're designed to try and find what we call the maximum margin separator, where the maximum margin separator is just some boundary that maximizes the distance between the groups of points. Rather than come up with some boundary that's very close to one side or the other, where in the case before, we wouldn't have cared-- as long as we're categorizing the input well, that seems all we need to do-- the support vector machine will try and find this maximum margin separator, some way of trying to maximize that particular distance. And it does so by finding what we call the support vectors, which are the vectors that are closest to the line and trying to maximize the distance between the line and those particular points. 

And it works that way in two dimensions. It also works in higher dimensions, where we're not looking for some line that separates the two data points, but instead, looking for what we generally call a hyperplanel, some decision boundary, effectively, that separates one set of data from the other set of data. And this ability of support vector machines to work in higher dimensions actually has a number of other applications as well. But one is that it helpfully deals with cases where data may not be linearly separable. So we talked about linear separability before, this idea that you can take data and just draw a line or some linear combination of the inputs that allows us to perfectly separate the two sets from each other. 

There are some data sets that are not linearly separable. And some were even too, you would not be able to find a good line at all that would try to do that kind of separation. Something like this, for example, where if you imagine here are the red points and the blue points surround it. If you try to find a line that divides the red points from the blue points, it's actually going to be difficult, if not impossible, to do-- that any line you choose-- if you draw a line here, then you ignored all of these blue points that should actually be blue and not red-- anywhere else you draw a line, there's going to be a lot of error, a lot of mistakes, a lot of what will soon call loss to that line that you draw-- a lot of points that you're going to categorize incorrectly. 

What we really want is to be able to find a better decision boundary that may not be just a straight line through this two dimensional space. And what support vector machines can do is they can begin to operate in higher dimensions and be able to find some other decision boundary, like the circle in this case, that actually is able to separate one of these sets of data from the other set of data, a lot better. So oftentimes, in data sets where the data is not linearly separable, support vector machines, by working in higher dimensions, can actually figure out a way to solve that kind of problem effectively. 

So that then-- three different approaches to trying to solve these sorts of problems. We're seeing support vector machines. We've seen trying to use linear regression and the perceptron learning rule to be able to figure out how to categorize inputs and outputs. We've seen the nearest neighbor approach. No one necessarily better than any other. Again, it's going to depend on the data set, the information you have access to. It's going to depend on what the function looks like that you're ultimately trying to predict. And this is where a lot of research and experimentation can be involved in trying to figure out how it is to best perform that kind of estimation. 

But classification is only one of the tasks that you might encounter and supervised machine learning, because in classification, what we're trying to predict is some discrete category. We're trying to predict red or blue, rain or not rain, authentic or counterfeit. But sometimes, what we want to predict is a real number value. And for that, we have a related problem, not classification, but instead known as regression. And regression is the supervised learning problem where we try and learn a function mapping inputs to outputs, same as before. but instead of the outputs being discrete categories-- things like rain or not rain-- in a regression problem, the output values are generally continuous value-- some real number that we would like to predict. 

This happens all the time, as well. You might imagine that a company might take this approach if it's trying to figure out, for instance, what the effect of its advertising is. Like how do advertising dollars spent translate into sales for the company's product, for example. And so they might like to try to predict some function that takes as input, the amount of money spent on advertising. And here, we're just going to use one input, but again, you could scale this up to many more inputs as well if you have a lot of different kinds of data you have access to. And the goal is to learn to function-- that given this amount of spending on advertising, we're going to get this amount in sales. 

And you might judge it based on having access to a whole bunch of data-- like for every past month, here's how much we spent on advertising and here is what sales were. And we would like to predict some sort of hypothesis function that, again, given the amount spent on advertising, can predict in this case, some real number, some no estimate of how much sales we expect that company to do in this month or in this quarter or whatever unit of time we're choosing to measure things in. 

And so again, the approach to solving this type of problem, we could try using a linear regression type approach, where we take this data, and we just plot it. On the x-axis, we have advertising dollars spent. On the y-axis, we have sales. And we might just want to try and draw a line that does a pretty good job of trying to estimate this relationship between advertising and sales. And in this case, unlike before, we're not trying to separate the data points into discrete categories. 

But instead in this case, we're just trying to find a line that approximates this relationship between advertising and sales so that if we want to figure out what the estimated sales are for a particular advertising budget, you just look it up in this line, figure out for this amount of advertising, we would have this amount of sales, and just try and make the estimate that way. And so you can try and come up with a line-- again, figuring out how to modify the weights using various different techniques to try and make it so that this line fits as well as possible. 

So with all of these approaches to trying to solve machine learning style problems. The question becomes, how do we evaluate these approaches? How do we evaluate the various different hypotheses that we could come up with? Because each of these algorithms will give us some sort of hypothesis-- some function that maps inputs to outputs. And we want to know how well does that function work. And you can think of evaluating these hypotheses and trying to get a better hypothesis as kind of like an optimization problem. In an optimization problem, as you recall from before, you are either trying to maximize some objective function by trying to find a global maximum. Or we were trying to minimize some cost function by trying to find some global minimum. 

And in the case of evaluating these hypotheses, one thing we might say is that this cost function, the thing we're trying to minimize, we might be trying to minimize what we would call a loss function. And what a loss function is-- it is a function that is going to estimate for us how poorly our function performs. More formally, it's like a loss of utility, by whenever we predict something that is wrong, that is a loss of utility. That's going to add to the output of our loss function. And you can come up with any loss function that you want-- just some mathematical way of estimating given each of these data points, given what the actual output is, and given what our projected output is, our estimate, you could calculate some sort of numerical loss for it. 

But there are a couple of popular loss functions that are worth discussing-- just that you've seen them before-- when it comes to discrete categories. Things like rain or not rain, counterfeit or not counterfeit. One approach is the 0-1 loss function. And the way that works is for each of the data points our loss function takes as input, what the actual output is, whether it was actually raining we're not rainy, and takes our prediction into account-- did we predict given this data point that it was raining or not raining. And if the actual value equals the prediction, well, then the 0-1 loss function will just say the loss of 0. There was no loss of utility because we were able to predict correctly. 

And otherwise, if the actual value was not the same thing as what we predicted, well, then in that case, our loss is 1. We lost something, lost some utility, because what we predicted was the output of the function was not what it actually was. And the goal then in a situation like this would be to come up with some hypothesis that minimizes the total empirical loss, the total amount that we've lost if you add up for all these data points what the actual output is and what your hypothesis would have predicted. 

So in this case, for example, if we go back to classifying days as raining or not raining, and we came up with this decision boundary, how would we evaluate this decision boundary-- how much better is it than drawing the line here or drawing the line there. Well, we could take each of the input data points and each input data point has a label-- whether it was raining or whether it was not raining. And we could compare it to the prediction-- whether we predicted it would be raining or not raining-- and assign it a numerical value as a result. 

So for example, these points over here they were all rainy days. And we predicted they would be raining, because they fall in the bottom side of the line. So they had a loss of 0-- nothing lost from those situations. And likewise, same is true for some of these points over here, where it was not raining and we predicted it would not be raining, either. Where we do have loss are points like this point here and that point there, where we predicted that it would not be raining, but in actuality, it's a blue point. It was raining. Or likewise here, we predicted that it would be raining, but in actuality, it's a red point-- it was not raining. And so as a result, we miscategorized these data points that we were trying to train on. And as a result, there is some loss here. One loss here, there, here and there, for a total loss of four, for example, in this case. 

And that might be how we would estimate or how we would say that this line is better than a line that goes somewhere else or a line that's further down, because this line might minimize the loss. So there is no way to do better than just these four points of loss if you're just drawing a straight line through our space. 

So the 0-1 loss function checks did we get it right, did we get it wrong. If we got it right, the loss is 0-- nothing lost. If we got it wrong, then our loss function for that data point says 1, and we add up all of those losses across all of our data points to get some sort of empirical loss-- how much we have lost across all of these original data points that our algorithm had access to. 

There are other forms of loss as well that work especially well when we deal with more real value cases-- cases like the mapping between advertising budget and amount that we do in sales, for example. Because in that case, you care not just that you get the number exactly right, but you care how close you were to the actual value. If the actual value is you did $2,800 in sales and you predicted that you would do $2,900 in sales, maybe that's pretty good. That's much better than if you had predicted you do $1,000 in sales, for example. And so we would like our loss function to be able to take that into account as well. Take into account not just whether the actual value in the expected value are exactly the same, but also, take into account how far apart they were. 

And so for that one approach is what we call L1 loss. L1 Loss doesn't just look at whether actual and predicted are equal to each other, but we take the absolute value of the actual value minus the predicted value. In other words, we just ask, how far apart were the actual and predicted values? And we sum that up across all of the data points to be able to get what our answer ultimately is. 

So what might this actually look like for our data set? Well, if we go back to this representation where we had advertising along the x-axis, sales along the y-axis, our line was our prediction, our estimate for any given amount of advertising-- what we predicted sales was going to be. And our L1 loss is just how far apart vertically along the sales axis our prediction was from each of the data points. So we could figure out exactly how far apart our prediction was from each of the data points and figure out as a result of that what our loss is overall for this particular hypothesis just by adding up all of these various different individual losses for each of these data points. And our goal then is to try and minimize that loss-- to try and come up with some line that minimizes what the utility loss is by judging how far away our estimate amount of sales is from the actual amount of sales. 

And turns out there are other loss functions, as well. One that's quite popular is the L2 loss. The L2 loss, instead of just using the absolute value, like how far away the actual value is from the predicted value, it uses the square of actual minus predicted. So how far apart are the actual and predicted value, and it squares that value, effectively penalizing much more harshly anything that is a worse prediction. So you imagine if you have two data points that you predict as being one value away from their actual value as opposed to one data point that you predict as being two away from its actual value, the L2 loss function will more harshly penalize that one that is two away because it's going to square however much the differences between the actual value and the predicted value. 

And depending on the situation, you might want to choose a loss function depending on what you care about minimizing. If you really care about minimizing the error on more outlier cases, then you might want to consider something like this. But if you've got a lot of outliers and you don't necessarily care about modeling them, then maybe an L1 loss function is preferable, but there are trade-offs here that you need to decide based on a particular set of data. 

But what you do run the risk of with any of these lost functions with anything that we're trying to do is a problem known as overfitting. And overfitting is a big problem that you can encounter in machine learning, which happens anytime a model fits too closely with a data set, and as a result, fails to generalize. We would like our model to be able to accurately predict data and inputs and output pairs for the data that we have access to. 

But the reason we wanted to do so is because we want our model to generalize well to data that we haven't seen before. I would like to take data from the past year of whether it was raining and not raining and use that data to generalize it towards the future-- to say in the future, is it going to be raining or not raining. Or if I have a whole bunch of data on what counterfeit and not counterfeit US dollar bills looked liked in the past when people have encountered them, I'd like to train a computer to be able to in the future, generalize to other dollar bills that I might see as well. And the problem with overfitting is that if you try and tie yourself too closely to the data set that you're training your model on you can end up not generalizing very well. 

So what does this look like. Well, we might imagine the rainy day and not rainy day example again from here, where the blue points indicate rainy days and the red points indicate not rainy days. And we decided that we felt pretty comfortable with drawing a line like this as the decision boundary between rainy days and not rainy days. That we can pretty comfortably say that points on this side, are more likely to be rainy days, points on that side, more likely to be not rainy days. But the empirical loss isn't 0 in this particular case, because we didn't categorize everything perfectly. There was this one outlier this one day that it wasn't raining, but yet our model steel still predicts that it is raining. But that doesn't necessarily mean our model is bad. It just means the model isn't 100% accurate. 

If you really wanted to try and find a hypothesis that resulted in minimizing the loss, you could come up with a different decision boundary. It wouldn't be a line, but it would look something like this. This decision boundary does separate all of the red points from all of the blue points because the red points fall on this side of this decision boundary, the blue points fall on the other side of the decision boundary. But this, we would probably argue, is not as good of a prediction. Even though it seems to be more accurate based on all of the available training data that we have for training this machine learning model, we might say that it's probably not going to generalize well. That if there were other data points like here and there, we might still want to consider those to be rainy days, because we think this was probably just an outlier. 

So if the only thing you care about is minimizing the loss on the data you have available to you, you run the risk of overfitting. And this can happen in the misclassification case. It can also happen in the regression case, that here, we predicted what we thought was a pretty good line relating advertising to sales, trying to predict what sales were going to be for a given amount of advertising. But I could come up with a line that does a better job of predicting the training data, and it would be something that looks like this, just connecting all of the various different data points. And now, there is no loss at all. Now I've perfectly predicted given any advertising what sales are, and for all the data available to me, it's going to be accurate. 

But it's probably not going to generalize very well. I have overfit my model on the training data that is available to me. And so in general, we want to avoid overfitting. We'd like strategies to make sure that we have over fit our model to a particular data set. And there are a number of ways that you could try to do this. 

One way is by examining what it is that we're optimizing for. In an optimization problem, all we do is we say there is some cost, and I want to minimize that cost. And so far, we've defined that cost function-- the cost of a hypothesis just as being equal to the empirical loss of that hypothesis. How far away are the actual data points the outputs away from what I predicted them to be based on that particular hypothesis. And if all you're trying to do is minimize cost, meaning minimizing the loss in this case, then the result is going to be that you might overfit. That to minimize cost, you're going to try and find a way to perfectly match all of the input data. And that might happen as a result of overfitting on that particular input data. 

So in order to address this, you could add something to the cost function. What counts as cost? Well, not just loss, but also, some measure of the complexity of the hypothesis, where the complexity of the hypothesis is something that you would need to define for how complicated does our line look. This is sort of an Occam's razor style approach, where we want to give preference to a simpler decision boundary-- like a straight line for example. Some simpler curve as opposed to something far more complex that might represent the training data better, but might not generalize as well-- will generally say that a simpler solution is probably the better solution and probably the one that is more likely to generalize well to other inputs. 

So we measure what the losses but we also measure the complexity. And now that all gets taken into account when we consider the overall cost. That yes, something might have less loss if a better predicts the training data, but if it's much more complex, it still might not be the best option that we have. And we need to come up with some balance between loss and complexity. 

And for that reason, you'll often see this represented as multiplying the complexity by some parameter that we have to choose-- parameter lambda in this case, where we're saying if lambda has a greater value, then we really want to penalize more complex hypotheses. Whereas if lambda is smaller, we're going to penalize more complex hypotheses a little bit. And it's up to the machine learning programmer to decide where they want to set that value of lambda for how much do I want to penalize a more complex hypothesis that might fit the data little better. 

And again, there is no one right answer to a lot of these things depending on the data set, depending on the data you have available to you, and the problem you're trying to solve, your choice of these parameters may vary. And you may need to experiment a little bit to figure out what the right choice of that is ultimately going to be. 

This process then of considering not only a loss, but also some measure of the complexity is known as regularization. Regularization is the process of penalizing a hypothesis that is more complex. In order to favor a simple or hypothesis that is more likely to generalize well-- more likely to be able to apply to other situations that are dealing with other input points unlike the ones that we've necessarily seen before. So oftentimes, you'll see us add some regularizing term to what we're trying to minimize it in order to avoid this problem of overfitting. 

Now another way of making sure we don't overfit is to run some experiments and to see whether or not we are able to generalize our model that we've created to other data sets as well. And it's for that reason that oftentimes, when you're doing a machine learning experiment, when you've got some data and you want to try and come up with some function that predicts given some input, what the output is going to be, you don't necessarily want to do your training on all of the data you have available to you. That you could employ a method known as holdout cross-validation. Where in holdout cross-validation, we split up our data. We split up our data into a training set and a testing set. 

The training set is the set of data that we're going to use to train our machine learning model. And the testing set is the set of data that we are going to use in order to test to see how well our machine learning model actually performed. So the learning happens on the training set. We figure out what the parameters should be, we figure out what the right model is. And that we see, all right, now that we've trained the model, see how well it does at predicting things and inside of the testing set, some set of data that we haven't seen before. And the hope then is that we're going to be able to predict the testing set pretty well if we're able to generalize based on the training data that's available to us. 

If we've overfit the training data, though, and we're not able to generalize, then when we look at the testing set, it's likely going to be the case that we're not going to predict things from the testing set nearly as effectively. So this is one method of cross-validation-- validating to make sure that the work we have done is actually going to generalize to other data sets as well. And there are other statistical techniques we can use, as well. 

One of the downsides of this just holdout cross-validation is if you say, I just split it 50/50, I train using 50% of the data and test using the other 50%, or you could choose other percentages as well, is that there is a fair amount of data that I am now not using to train that I might be able to get a better model as a result, for example. So one approach is known as k-fold cross-validation. In k-fold cross-validation, rather than just divide things into two sets and run one experiment, we divide things into k different sets and maybe I divide things up into 10 different sets, and then run 10 different experiments. 

So if I split up my data into 10 different sets of data, then what I'll do is each time for each of my 10 experiments, I will hold out one of those sets of data, where I'll say, let me train my model on these nine sets, and then test to see how well it predicts on set number 10. And then pick another set of nine sets to train on, and then test it on the other one that I held out, where each time, I train the model on everything minus the one set that I'm holding out, and then test to see how well our model performs on the test that I did hold out. 

And what you end up getting is 10 different results, 10 different answers for how accurately our model worked. And oftentimes, you can just take the average of those 10 to get an approximation for how well we think our model performs overall. But the key idea is separating the training data from the testing data, because you want to test your model on data that is different from what you trained the model on. Because the training, you want to avoid overfitting, you want to be able to generalize. And the way you test whether you're able to generalize and is by looking at some data that you haven't seen before and seeing how well we're actually able to perform. 

And so if we want to actually implement any of these techniques inside of a programming language like Python, a number of ways we could do that. We could write this from scratch on our own, but there are libraries out there that allow us to take advantage of existing implementations of these algorithms-- that we can use the same types of algorithms in a lot of different situations. And so there is a library, very popular one known as scikit-learn, which allows us in Python to be able to very quickly get set up with a lot of these different machine learning models. So this library has already written an algorithm for nearest neighbor classification, for doing perceptron learning, for doing a bunch of other types of inference and supervised learning that we haven't yet talked about. 

But using it, we can begin to try actually testing how these methods work and how accurately they perform. So let's go ahead and take a look at one approach to trying to solve this type of problem. 

All right, so I'm first going to pull up banknotes.csv, which is a whole bunch of data provided by UC Irvine, which has information about various different banknotes. So people took pictures of various different banknotes and measured various different properties of those banknotes. And in particular, some human categorized each of those banknotes as either a counterfeit bank note or as not counterfeit. 

And so what you're looking at here is each row represents one banknote. This is formatted as a CSV spreadsheet, where just comma-separated value separating each of these various different fields. We have four different input values for each of these data points, just information, some measurement that was made on the banknote. And what those measurements exactly aren't as important as the fact that we do have access to this data. But more importantly, we have access for each of these data points to a label, where 0 indicates something like this was not a counterfeit bill, meaning it was an authentic bill. And a data point labeled 1 means that it is a counterfeit bill at least, according to the human researcher who labeled this particular data. 

So we have a whole bunch of data representing a whole bunch of different data points, each of which has these various different measurements that were made on that particular bill. And each of which has an output value 0 or 1-- 0 meaning it was a genuine bill, 1 meaning it was a counterfeit bill. And what we would like to do is use supervised learning to begin to predict or model some sort of function that can take these four values as input and predict what the output would be. We want our learning algorithm to find some sort of pattern that is able to predict based on these measurements something that you could measure just by taking a photo of a bill-- predict whether that bill is authentic or whether that bill is counterfeit. 

And so how can we do that? Well, I'm first going to open up banknotes0.py and see how it is that we do this. I'm first importing a lot of things from scikit-learn, but importantly, I'm going to set my model equal to the perceptron model, which is one of those models that we talked about before. We're just going to try and figure out some setting of weights that is able to divide our data into two different groups. Then I'm going to go ahead and read data in from my file from banknotes.csv. And basically, for every row, I'm going to separate that row into the first four values of that row, which is the evidence for that row. And then the label where if the final column in that row is 0, the label is authentic, and otherwise, it's going to be counterfeit. 

So I'm effectively reading data in from the CSV file, dividing it into a whole bunch of rows, where each row has some evidence-- those four input values that are going to be inputs to my hypothesis function. And then the label, the output, whether it is authentic or counterfeit. That is the thing that I am then trying to predict. 

So the next step is that I would like to split up my data set into a training set and the testing set-- some set of data that I would like to train my machine learning model on and some set of data that I would like to use to test that model, see how well it performed. So what I'll do is I'll go ahead and figure out length of the data, how many data points do I have. I'll go ahead and take half of them, save that number is a number called holdout. That is how many items I'm going to hold out for my data set to save for the testing phase. I'll randomly shuffle the data so it's in some random order. 

And then I'll say my testing set will be all of the data up to the holdout. So I'll hold up many data items, and that will be my testing that. My training data will be everything else-- the information that I'm going to train my model on. And then I'll say, I need to divide up my training data into two different sets. I need to divide it into my x values, where x here represents the inputs. So the x values then I'm going to train on our basically for every row in my training set, I'm going to get the evidence for that row, those four values, where it's basically a vector of four numbers, where that is going to be all of the input. 

And then I need the y values-- what are the outputs that I want to learn from, the labels that belong to each of these various different input points. Well, that's going to be the same thing for each row in the training data. But this time, I take that row and get what it's labeled as, whether it is authentic or counterfeit. So I end up with one list of all of these vectors of my input data and one list which follows the same order, but has all of the labels that correspond with each of those vectors. And then to train my model, which in this case is just this perceptron model, I just called model.fit, pass in the training data, and what the labels for those training data are. And scikit-learn will take care of fitting the model-- will do the entire algorithm for me. 

And then when it's done, I can then test to see how well that model performed. So I can say, let me get all of these input vectors for what I want to test on. So for each row in my testing data set, go ahead and get the evidence. And the y values, those are what the actual values were-- for each of the rows in the testing data set, what the actual label is. But then I'm going to generate some predictions. I'm going to use this model and try and predict-- based on the testing vectors-- I want to predict what the output is. 

And my goal then is to now compare y testing with predictions. I want to see how well my predictions based on the model actually reflect what the y values were, what the output is that were actually labeled. Because I now have this label data, I can assess how well the algorithm worked. And so now I can just compute how well we did. This zip function basically just lets me look through two different lists, one by one at the same time. 

So for each actual value and for each predicted value, if the actual is the same thing as what I predicted, I'll go ahead and increment the counter by 1. Otherwise, I'll increment my incorrect counter by 1. And so at the end, I can print out here are the results, here's how many I got right, here's how many I got wrong. And here was my overall accuracy, for example. 

So I can go ahead and run this. I can run Python banknotes0.py. And it's going to train on half the data set and then test on half the data set. And here the results from my perceptron model. In this case, it correctly was able to classify 679 bills as correctly either authentic or counterfeit, and incorrectly classified seven of them for an overall accuracy of close to 99% accurate. So on this particular data set, using this perceptron model, we were able to predict very well what the output was going to be. 

And we can try different models, too. That scikit-learn makes it very easy just to swap out one model for another model. So instead of the perceptron model, I can use the support vector machine using the SVC, otherwise known as a support vector classifier, using a support vector machine to classify things into two different groups. And now see, all right, how well does this perform. And this time, we were able to correctly predict 682 and incorrectly predicted four for accuracy of 99.4%. And we could even try the kNeighborsClassifier as the model instead. And this takes a parameter n_neighbors for how many neighbors you want to look at. Let's just look at one neighbor, the one nearest neighbor, and use that to predict. Go ahead and run this as well. And it looks like, based on the kNeighborsClassifier looking at just one neighbor, we were able to correctly classify 685 data point, incorrectly classified one. 

Maybe let's try three neighbors instead of just using one neighbor, do more of a k-nearest-neighbors approach, where I look at the three near the neighbors and see how that performs. And that one in this case seems to have gotten 100% of all of the predictions correctly described as either authentic banknotes or as counterfeit banknotes. And we could run these experiments multiple times. Because I'm randomly reorganizing the data every time, we're technically training these on slightly different data sets. And so you might want to run multiple experiments to really see how well they're actually going to perform. 

But in short, they all perform very well. And while some of them perform slightly better than others here, that might not always be the case for every data set, but you can begin to test now by very quickly putting together these machine learning models using Scikit learn to be able to train on some training set, and then test on some testing set as well. And this splitting up into training groups, and testing groups, and testing happens so often that scikit-learn has functions built in for trying to do it. I did it all by hand just now. 

But if we take a look at banknotes1, we take advantage of some other features that exist in scikit-learn, learn where we can really simplify a lot of our logic. That there is a function built into scikit-learn called train_test_split, which will automatically split data into a training group and a testing group. I just have to say what proportion should be in the testing group, something like 0.5, half the data, inside the testing group. Then I can fit the model and the training data, make the predictions on the testing data, and then just count up. 

And scikit-learn has some nice methods for just counting up how many times our testing data matched the predictions, how many times our testing data didn't match the predictions. So very quickly, you can write programs with not all that many lines of code-- it's maybe, like, 40 lines of code to get through all of these predictions. And then as a result, see how well we're able to do. So these types of libraries can allow us without really knowing the implementation details of these algorithms to be able to use the algorithms in a very practical way to be able to solve these types of problems. 

So that then with supervised learning-- this task of given the whole set of data some, input-output pairs, we would like to learn some function that maps those inputs to those outputs. But turns out there are other forms of learning, as well. And another popular type of machine learning, especially nowadays, is known as reinforcement learning. And the idea of reinforcement learning is rather than just being given a whole data set at the beginning of input-output pairs, reinforcement learning is all about learning from experience. 

And reinforcement learning are agent, whether it's like a physical robot that's trying to make actions in the world or just some virtual agent that has a program running somewhere. Our agent is going to be given a set of rewards or punishments in the form of numerical values, but you can think of them as reward or punishment. And based on that, it learns what actions to take in the future that our agent, our AI will be put in some sort of environment. It will make some actions and based on the actions that it makes, it learns something. It either gets a reward when it does something well, it gets a punishment when it does something poorly. And it learns what to do or what not to do in the future based on those individual experiences. 

And so what this will often look like is it will often start with some agent, some AI, which might again, be a physical robot-- if you're imagining a physical robot moving around-- but it can also just be a program. And our agent is situated in their environment, where the environment is where they're going to make their actions. And it's what's going to give them rewards or punishments for various actions that they're in. 

So for example, the environment is going to start off by putting our agent inside of a state. Our agent has some state that in a game might be the state of the game that the agent is playing, in a world that the agent is exploring. Might be some position inside of a grid representing the world that they're exploring. But the agent is in some sort of state. And in that state, the agent needs to choose to take an action. The agent likely has multiple actions they can choose from, but they pick an action. So they take an action in a particular state. And as a result of that, the agent will generally get two things in response as we model them. 

The agent gets a new state that they find themselves in. After being in this state taking one action, they end up in some other state. And they're also given some sort of numerical reward-- positive meaning reward, meaning it was a good thing. Negative generally meaning they did something bad, they received some sort of punishment. And that is all the information the agent has. It's told what state it's in. It makes some sort of action. And based on that, it ends up in another state, and it ends up getting some particular reward. And it needs to learn based on that information what actions to begin to take in the future. 

As you can imagine generalizing this to a lot of different situations, this is oftentimes how you train. If you've ever seen those robots that are now able to walk around sort of the way humans do, it would be quite difficult to program the robot in exactly the right way to get it to walk the way humans do. You could instead of train it through reinforcement learning-- give it some sort of numerical reward every time it does something good like take steps forward, and punish it every time it does something bad like fall over. And then let the AI just learn. Based on that sequence of rewards, based on trying to take various different actions, you can begin to have the agent learn what to do in the future and what not to do. 

So in order to begin to formalize this, the first thing we need to do is formalize this notion of what we mean about states and actions and rewards-- like what does this world look like. And oftentimes, we'll formulate this world as what's known as a Markov decision process. Similar in spirit to Markov chains, which you might recall from before, but a Markov decision process is a model that we can use for decision making for an agent trying to make decisions in this environment. And it's a model that allows us to represent the various different states that an agent can be in, the various different actions that they can take, and also, what the reward is for taking one action as opposed to another action. 

So what then does that actually look like? Well, if you recall, a Markov chain from before, a Markov chain looked a little something like this. Where we had a whole bunch of these individual states, and each state immediately transitioned to another state based on some probability distribution. We saw this in the context of the weather before, where if it was sunny, we said with some probability, it will be sunny the next day. With some other probability, it'll be rainy, for example. But we could also imagine generalizing this. It's not just sun and rain anymore. We just have these states, where one state leads to another state according to some probability distribution. 

But in this original model, there was no agent that had any control over this process. It was just entirely probability-based, where with some probability, we moved to this next state, but maybe it's going to be some other state with some other probability. What we'll now have is the ability for the agent in this state to choose from a set of actions, where maybe instead of just one path forward, they have three different choices of actions that each lead them down different paths. And even this is a bit of an oversimplification, because in each of these states, you might imagine more branching points were there more decisions that can be taken as well. 

So we've extended the Markov chain to say that from a state, you now have available action choices. And each of those actions might be associated with its own probability distribution of going to various different states. Then in addition, we'll add another extension, where any time you move from a state taking an action going into this other state, we can associate a reward with that outcome, saying either r is positive, meaning some positive reward, or r is negative, meaning there were some sort of punishment. 

And this then is what we'll consider to be a Markov decision process. That a Markov decision process has some initial set of states in the world that we can be in. We have some set of actions that given a state, I can say what are the actions that are available to me in that state, an action that I can choose from. Then we have some transition model. The transition model before just said that given my current state, what is the probability that I end up in that next state or this other state. The transition model now has effectively two things we're conditioning on. We're saying, given that I'm in this state and that I take this action, what's the probability that I end up in this next state? 

Now maybe we live in a very deterministic world in this Markov decision process, where given a state and given an action, we know for sure what next state we'll end up in. But maybe there's some randomness in the world that when you take in a state and you take an action, you might not always end up in the exact same state. There might be some probabilities involved there as well. The Markov decision process can handle both of those possible cases. 

And then finally, we have a reward function, generally called r, that in this case says, what is the reward for being in this state, taking this action, and then getting to s prime, this next state. So I'm in this original state, I take this action, I get to this next state, what is the reward for doing that process? You can add up these rewards every time you take an action to get the total amount of rewards that an agent might get from interacting in a particular environment modeled using this Markov decision process. 

So what might this actually look like in practice? Well, let's just create a little simulated world here where I have this agent that is just trying to navigate its way-- this agent is this yellow dot here like a robot in the world trying to navigate its way through this grid. And ultimately, it's trying to find its way to the goal. And if it gets to the green goal, then it's going to get some sort of reward. But then we might also have some red squares that are places where you get some sort of punishment, some bad place where we don't want the agent to go. And if it ends up in the red square, then our agent is going to get some sort of punishment as a result of that. 

But the agent that originally doesn't know all of these details. It doesn't know that these states are associated with punishments, but maybe it does know that the state is associated with a reward-- maybe it doesn't. But it just needs to sort of interact with the environment to try and figure out what to do and what not to do. So the first thing the agent might do is given no additional information, if it doesn't know what the punishments are, it doesn't know where the rewards are, it just might try and take an action. And it takes an action and ends up realizing that he's got some sort of punishment. 

And so what does it learn from that experience? Well, it might learn that when you're in this state in the future don't take the action, move to the right-- that that is a bad action to take. That in the future, if you ever find yourself back in the state, don't take this action of going to the right when you're in this particular state, because that leads to punishment. That might be the intuition, at least. And so you could try doing other actions. You move up. All right, that didn't lead to any immediate rewards, maybe try something else, then maybe try something else. 

And now you found that you got another punishment. And so you learn something from that experience. So the next time you do this whole process, you know that if you ever end up in this, square you shouldn't take the down action, because being in this state and taking that action ultimately leads to some sort of punishment, a negative reward, in other words. And this process repeats. You might imagine just letting our agent explore the world, learning over time what states tend to correspond with poor actions, learning over time what states correspond with poor actions until eventually, if it tries enough things randomly, it might find that when you get to this state, if you take the up action in this state, it might find you actually get a reward from that. 

And what it can learn from that is that if you're in this state, you should take the up action, because that leads to a reward. And over time, you can also learn that if you're in this state, you should take the left action because that leads to this state that also lets you eventually get to the reward. So you begin to learn over time, not only which actions are good in particular states, but also, which actions are bad, such that once you know some sequence of good actions that leads you to some sort of reward, our agent can just follow those instructions, follow the experience that it has learned. 

We didn't tell the agent what the goal was. We didn't tell the agent where the punishments were. But the agent can begin to learn from this experience and learn to begin to perform these sorts of tasks better in the future. 

And so let's now try to formalize this idea-- formalize the idea that we would like to be able to learn in this state, taking this action, is that a good thing or a bad thing. There are lots of different models for reinforcement learning. We're just going to look at one of them today. And the one that we're going to look at is a method known as Q learning. And what Q learning is all about is about learning a function, a function Q, that takes inputs s and a, where s is a state and a is an action that you take in that state. And what this Q function is going to do is it is going to estimate the value-- how much reward will I get from taking this action in this state. 

Originally, we don't know what this Q function should be, but over time, based on experience, based on trying things out and seeing what the result is, I would like to try and learn what q of s, a is for any particular state and any particular action that I might take in that state. So what is the approach? Well, the approach originally is we'll start with Q s, a equal to 0 for all states s and for all actions a. That initially, before I've ever started anything, before I've had any experiences, I don't know the value of taking any action in any given state, so I'm going to assume that the value is 0 all across the board. 

But then as I interact with the world, as I experience rewards or punishments, or maybe I go to a cell where I don't get either a reward or a punishment, I want to somehow update my estimate of Q s, a. I want to continually update my estimate of Q s, a based on the experiences, and rewards, and punishments that I've received such that in the future, my knowledge of what actions are good and what states will be better. So when we take an action and receive some sort of reward, I want to estimate the new value of Q s, a. And I estimate that based on a couple of different things. 

I estimate it based on the reward that I'm getting from taking this action and getting into the next state. But assuming the situation isn't over, assuming there are still future actions that I might take as well, I also need to take into account the expected future rewards. That if you imagine an agent interacting with the environment, and sometimes, you'll take an action and get a reward, but then you can keep taking more actions and get more rewards. That these both are relevant-- both the current reward I'm getting from this current step, and also, my future reward. And it might be the case that I want to take a step that doesn't immediately lead to a reward, because later on down the line, I know it will lead to more rewards as well. So there's a balancing act between current rewards that the agent experiences and future rewards that the agent experiences as well. 

And then we need to update Q s, a. So we estimate the value of Q s, a based on the current record and the expected future awards. And then we need to update this Q function to take into account this new estimate. Now as we go through this process, we'll already have an estimate for what we think the value is. Now we have a new estimate and then somehow we need to combine these two estimates together. And we'll look at more formal ways that we can actually begin to do that. 

So to actually show you what this formula looks like, here's the approach we'll take with you Q-learning. We're going to again start with Q of s, a being equal to 0 for all states. And then every time we take an action a in state s and observe a reward r, we're going to update our value, our estimate for Q of s, a. And the idea is that we're going to figure out what the new value estimate is minus what our existing value estimate is. 

So we have some preconceived notion for what the value is for taking this action in this state. Maybe our expectation is we currently think the value is 10. But then we're going to estimate what we now think it's going to be. Maybe the new value estimate is something like 20. So there's a delta of, like, 10 that our new value estimate is 10 points higher than what our current value estimate happens to be. And so we have a couple of options here. We need to decide how much we want to adjust our current expectation of what the value is of taking this action in this particular state. 

And what that difference is-- how much we add or subtract from our existing notion of how much that we expect the value to be is dependent on this parameter alpha, also called the learning rate. And alpha represents in effect, how much we value new information compared to how much we value old information. And alpha value of 1 means we really value new information. That if we have a new estimate, then it doesn't matter what our old estimate is. We're only going to consider our new estimate, because we always just want to take into consideration our new information. So the way that works is that if you imagine alpha being 1, then we're taking the old value of Q s, a and then adding 1 times the new value minus the old value. And that just leaves us with the new value. So when alpha is 1, all we take into consideration is what our new estimate happens to be. 

But over time, as we go through a lot of experiences, we already have some existing information. We might have tried taking this action nine times already, and now we just try to do a tenth time. And we don't only want to consider this 10th experience. I also want to consider the fact that my prior 9 experiences, those were meaningful, too. And that's data I don't necessarily want to lose them. And so this alpha controls that decision-- controls how important is the new information. 

0 would mean ignore all the new information, just keep this Q value the same. 1 that means replace the old information entirely with the new information. And somewhere in between, keep some sort of balance between these two values. And we can put this equation a little bit more formally, as well. The old value estimate is our old estimate for what the value is of taking this action in a particular state. That's just Q of s, a. We have it once here. And we're going to add something to it. We're going to add alpha times the new value estimate minus the old value estimate. But the old value estimate, we just look up by calling this Q function. 

And what then is the new value estimate? Based on this experience we have just taken, what is our new estimate for the value of taking this action in this particular state? Well, it's going to be composed of two parts. It's going to be composed of what reward did I just get from taking this action in this state. And then it's going to be what can I expect my future rewards to be from this point forward. 

So it's going to be r, some reward I'm getting right now, plus whatever I estimate I'm going to get in the future. And how do I estimate what I'm going to get in the future? Well, it's a bit of another call to this Q function. It's going to be take the maximum across all possible actions I could take next and say, all right, of all of these possible actions I could take, which one is going to have the highest reward? 

So this then-- looks a little bit complicated-- is going to be our notion for how we're going to perform this kind of update. I have some estimate, some old estimate, for what the value is of taking this action in the state, and I'm going to update it based on new information. That I experienced some reward, I predict what my future reward is going to be. And using that, I update what I estimate the reward will be for taking this action in this particular state. 

And there are other additions you might make to this algorithm, as well. Sometimes, it might not be the case that future rewards, you want to weight equally to current rewards. Maybe you want an agent that values like reward now over reward later. And so sometimes, you can even add another term in here or some other parameter, where you discount future rewards and say future rewards are not as valuable as rewards immediately-- that getting reward in the current time step is better than waiting a year and getting rewards later. But that's something up to the programmer to decide what that parameter ought to be. 

But the big picture idea of this entire formula is to say that every time we experience some new reward, we take that into account. We update our estimate of how good is this action. And then in the future, we can make decisions based on that algorithm. 

Once we have some good estimate for every state and for every action, what the value is of taking that action, then we can do something like implement a greedy decision making policy. That if I am in a state and I want to know what actions should I take in that state, then I consider for all of my possible actions, what is the value of Q s, a. What is my estimated value of taking that action in that state. And I will just pick the action that has the highest value after I evaluate that expression. So I pick the action that has the highest value. And based on that, that tells me what action I should take. At any given state that I'm in, I can just greedily say across all my actions, this action gives me the highest expected value, and so I'll go ahead and choose that action as the action that I take as well. 

But there is a downside to this kind of approach. And the downside comes up in a situation like this, where we know that there is some solution that gets me to the reward and our agent has been able to figure that out. But it might not necessarily be the best way or the fastest way. If the agent is allowed to explore a little bit more, you might find that it can get the reward faster by taking some other route instead by going through this particular path that is a faster way to get to that ultimate goal. 

And maybe we would like for the agent to be able to figure that out as well. But if the agent always takes the actions that it knows to be best, when it gets to this particular square, it doesn't know that this is a good action, because it's never really tried it. But it knows that going down eventually leads its way to this reward. So what might learn in the future that it should just always take this route, and it's never going to explore and go along that route instead. 

So in reinforcement learning, there's this tension between exploration and exploitation. And exploitation generally reverts to using knowledge that the AI already has. The AI already knows that this is a move that leads to reward, so it'll go ahead and use that move. And exploration is all about exploring other actions that we may not have explored as thoroughly before, because maybe one of these actions, even if I don't know anything about it, might lead to better rewards faster or more rewards in the future. 

And so an agent that only ever exploits information and never explorers might be able to get reward, but it might not maximize its rewards, because it doesn't know what other possibilities are out there-- possibilities that it would only know about by taking advantage of exploration. And so how can we try and address this? Well, one possible solution is known as the epsilon-greedy algorithm, where we set epsilon equal to how often we want to just make a random move. Where occasionally, we will just make a random move in order to say, let's try to explore and see what happens. 

And then the logic of the algorithm will be with probability 1 minus epsilon, choose the estimated best move. In a greedy case, we'd always choose the best move. But in epsilon-greedy, we're most of the time going to choose the best move or sometimes going to choose the best move, but sometimes with probability epsilon, we're going to choose a random move instead. So every time we're faced with the ability to take an action, sometimes, we're going to choose the best move. Sometimes, we're just going to choose a random move. 

So this type of algorithm then can be quite powerful in a reinforcement learning context by not always just choosing the best possible move right now, but sometimes, especially early on, allowing yourself to make random moves that allow you to explore various different possible states and actions more. And maybe over time, you might decrease your value of epsilon, more and more often choosing the best mover after you are more confident that you've explored what all of the possibilities actually are. 

So we can put this into practice. And one very common application of reinforcement learning is in game playing. That if you want to teach an agent how to play a game, you just let the agent play the game a whole bunch. And then the reward signal happens at the end of the game. When the game is over, if our AI won the game, it gets a reward of like, 1, for example. And if it lost the game, it gets a reward of negative 1. And from that, it begins to learn what actions are good and what actions are bad. You don't have to tell the AI what's good and what's bad, but the AI figures it out based on that reward. Winning the game is some signal. Losing the game is some signal. And based on all of that, it begins to figure out what decisions it should actually make. 

So one very simple game, which you may have played before is a game called Nim. And in the game of Nim, you've got a whole bunch of objects in a whole bunch of different piles, where here I've represented each pile as an individual row. So you've got one object in the first pile, three in the second pile, five and the third pile, seven in the fourth pile. And the game of Nim is a two player game where players take turns removing objects from piles. And the rule is that on any given turn, you are allowed to remove as many objects as you want from any one of these piles, any one of these rows. You have to remove at least one object, but you can remove as many as you want from exactly one of the piles. And whoever takes the last object loses. 

So player 1 might like remove four from this pile here. Player 2 might remove four from this pile here. So now we've got four piles left, one, three, one, and three Player 1 might remove you know the entirety of the second pile. Player 2, if they're being strategic, might remove two from the third pile. Now we've got three piles left each with one object left. Player 1 might remove one from one pile. Player 2 removes one from the other pile. And now player 1 is left with choosing this one object from the last pile, at which point, player 1 loses the game. 

So fairly simple game. Piles of objects. Any turn, you choose how many objects to remove from the pile. Whoever removes the last object loses. And this is the type of game you can encode into an AI fairly easily, because the states are really just four numbers. Every state is just how many objects in each of the four piles. And the actions are things like how many am I going to remove from each one of these individual piles. And the reward happens at the end. That if you were the player that had to remove the last object, then you get some sort of punishment. But if you were not, and the other player had to remove the last object, well then you get some sort of reward. 

So we could actually try and show a demonstration of this-- that I have implemented an AI to play the game of Nim. All right, so here, what we're going to do is create an AI as a result of training the AI on some number of games that the AI is going to play against itself. Where the idea is the AI will play games against itself, learn from each of those experiences, and learn what to do in the future. And then I, the human, will play against the AI. So initially, we'll say train zero times, meaning we're not going to let the AI play any practice games against itself in order to learn from its experiences. We're just going to see how well it plays. 

And it looks like there are four piles. I can choose how many I remove from any one of the piles. So maybe from pile three, I will remove five objects, for example. So now AI chose to take one item from pile zero. So I'm left with these piles now, for example. And so here, I could choose maybe to say I would like to remove them from pile two, all five of them, for example. 

And so AI chose to take two away from pile one. Now I'm left with one pile that has one object, one pile that has two objects. So from pile three, I will remove two objects. And now I've left the AI with no choice but to take that last one. And so the game is over and I was able to win. But I did so because the AI was really just playing randomly. It didn't have any prior experience that it was using in order to make these sorts of judgments. 

Now let the AI train itself on, like, 10,000 games. I'm going to let the AI play 10,000 games of Nim against itself. Every time it wins or loses, it's going to learn from that experience and learn in the future what to do and what not to do. So here then, I'll go ahead and run this again. And now you see the AI running through a whole bunch of training games-- 10,000 training games against itself. 

And now it's going to let me make these sorts of decisions. So now I'm going to play against the AI. Maybe I'll remove one from pile 3. And the AI took everything from pile three, so I'm left with three piles. And I'll go ahead and from pile two, maybe remove three items. And the AI removes one item from pile zero. I'm left with two piles, each of which has two items in it. I'll remove one from pile one, I guess. And the AI took two from pile two, leaving me with no choice but to take one away from pile one. 

So it seems like after playing 10,000 games of Nim against itself, the AI has learned something about what states and what actions tend to be good, and has begun to learn some sort of pattern for how to predict what actions are going to be good and what actions are going to be bad in any given state. So reinforcement learning can be a very powerful technique for achieving these sorts of game playing Agents-- Agents that are able to play a game well just by learning from experience, whether that's playing against other people or by playing against itself and learning from those experiences, as well. 

Now Nim is a bit of an easy game to use reinforcement learning for because there are so few states. There are only states that are as many as how many different objects are in each of these various different piles. You might imagine that it's going to be harder if you think of a game like chess or games where there are many, many more states and many, many more actions that you can imagine taking, where it's not going to be as easy to learn for every state and for every action, what the value is going to be. 

So oftentimes in that case, we can't necessarily learn exactly what the value is for every state and for every action, but we can approximate it. So much as we saw with [? min and max, ?] we could use a death limiting approach to stop calculating at a certain point in time, we can do a similar type of approximation known as function approximation in a reinforcement learning context, where instead of learning a value of Q for every state and every action, we just have some function that estimates what the value is for taking this action in this particular state that might be based on various different features of the state that the agent happens to be in. 

Where you might have to choose what those features actually are, but you can begin to learn some patterns that generalize beyond one specific state and one specific action that you can begin to learn if certain features tend to be good things or bad things. Reinforcement learning can allow you using a very similar mechanism to generalize beyond one particular state and say if this other state looks kind of like this state, then maybe the similar types of actions that worked in one state will also work in another state as well. And so this type of approach can be quite helpful as you begin to deal with reinforcement learning that exists in larger and larger state spaces, where it's just not feasible to explore all of the possible states that could actually exist. 

So there then are two of the main categories of reinforcement learning. Supervised learning, where you have labeled input and output pairs, and reinforcement learning, where an agent learns from rewards or punishments that it receives. The third major category of machine learning that we'll just touch on briefly is known as unsupervised learning. And unsupervised learning happens when we have data without any additional feedback, without labels. That in the supervised learning case, all of our data had labels. We labeled a data point with whether that was a rainy day or not rainy day. And using those labels, we were able to infer what the pattern was. Where we labeled data as a counterfeit banknote or not a counterfeit, and using those labels, we were able to draw inferences and patterns to figure out what does a banknote look like versus not. 

In unsupervised learning, we don't have any access to any of those labels, but we still would like to learn some of those patterns. And one of the tasks that you might want to perform in unsupervised learning is something like clustering, where clustering is just the task of given some set of objects organized into distinct clusters, groups of objects that are similar to one another. And there's lots of applications for clustering. It comes up in genetic research, where you might have a whole bunch of different genes, and you want to cluster them into similar genes if you're trying to analyze it across a population or across species. 

It comes up in an image, if you want to take all the pixels of an image, cluster them into different parts of the image. Comes up a lot up in market research if you want to divide your consumers into different groups so you know which groups to target with certain types of product advertisements, for example. And a number of other contexts as well in which clustering can be very applicable. 

One technique for clustering is an algorithm known as k-means clustering. And what k-means clustering is going to do it is going to divide all of our data points into k different clusters, and it's going to do so by repeating this process of assigning points to clusters, and then moving around those clusters centers. We're going to define a cluster by its center, the middle of the cluster, and then assign points to that cluster based on which center is closest to that point. And I'll show you an example of that now. 

Here, for example, I have a whole bunch of unlabeled data-- just various data points that are in some sort of graphical space. And I would like to group them into various different clusters. But I don't know how to do that originally. And let's say I want to assign three clusters to this group. And you have to choose how many clusters you want in k-means clustering, but you could try multiple and see how well those values perform. But I'll start just by randomly picking some places to put the centers of those clusters. That maybe I have a blue cluster, a red cluster, and a green cluster. And I'm going to start with the centers of those clusters just being in these three locations here. 

And what k-means clustering tells us to do is once I have the centers of the clusters, assign every point to a cluster based on which cluster center it is closest to. So we end up with something like this, where all of these points are closer to the blue cluster center than any other cluster center. All of these points here are closer to the green cluster center than any other cluster center. And then these two points plus these points over here, those are all closest to the red cluster center instead. 

So here then is one possible assignment, all these points, to three different clusters. But it's not great. That it seems like in this red cluster, these points are kind of far apart, in this green cluster, these points are kind of far apart. It might not be my ideal choice of how I would cluster these various different data points. But k-means clustering is an iterative process, that after I do this, there is a next step, which is that after I've assigned all of the points to the cluster center that it is nearest to, we are going to recenter the clusters. Meaning take the cluster centers, these diamond shapes here, and move them to the middle or the average, effectively, of all of the points that are in that cluster. 

So we'll take this blue point, this blue center, and go ahead and move it to the middle or to the center of all of the points that were assigned to the blue cluster, moving it slightly to the right in this case. And we'll do the same thing for red. We'll move the cluster center to the middle of all of these points, weighted by how many points there are. There are more points over here, so the red center and moving a little bit further that way. And likewise for the green center, there are many more points on this side of the green center, so the green center ends up being pulled a little bit further in this direction. 

So we recenter all of the clusters. And then we repeat the process. We go ahead and now reassign all of the points to the cluster center that they are now closest to. And now that we've moved around the cluster centers, these cluster assignments might change. That this point originally was closer to the red cluster center, but now it's actually closer to the blue cluster center. Same goes for this point as well. And these three points that were originally closer to the green cluster center are now closer to the red cluster center instead. 

So we can reassign what colors or which clusters each of these data points belongs to. And then repeat the process again, moving each of these cluster means, the middle of the clusters, to the mean, the average, of all of the other points that happen to be there. And repeat the process again. Go ahead and assign each of the points to the cluster that they are closest to. 

So once we reach a point where we've assigned all the points to clusters, to the cluster that they are nearest to, and nothing changed, we've reached a sort of equilibrium in this situation, where no points are changing their allegiance. And as a result, we can declare this algorithm is now over. And we now have some assignment of each of these points into three different clusters. And it looks like we did a pretty good job of trying to identify which points are more similar to one another than they are two points in other groups. So we have the green cluster down here, this blue cluster here, and then this red cluster over there as well. 

And we did so without any access to some labels to tell us what these various different clusters were. We just used an algorithm in an unsupervised sentence without any of those labels to figure out which points belonged to which categories. And again, lots of applications for this type of clustering technique. And there are many more algorithms in each of these various different fields within machine learning-- supervised, and reinforcement, and unsupervised. But those are many of the big picture foundational ideas that underlie a lot of these techniques-- that these are the problems that we're trying to solve. 

And we try and solve those problems using a number of different methods. Of trying to take data and learn patterns in that data whether that's trying to find neighboring data points that are similar or trying to minimize some sort of loss function, where any number of other techniques that allow us to begin to try to solve these sorts of problems. That then was a look at some of the principles that are at the foundation of modern machine learning-- this ability to take data and learn from that data so that the computer can perform a task, even if they haven't explicitly been given instructions in order to do so. Next time, we'll continue this conversation about machine learning looking at other techniques we can use for solving these sorts of problems. We'll see you then.