WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:02.470 [MUSIC PLAYING] 00:00:17.784 --> 00:00:19.620 BRIAN YU: All right, welcome back, everyone, 00:00:19.620 --> 00:00:22.500 to an introduction to Artificial Intelligence with Python. 00:00:22.500 --> 00:00:24.990 Now so far in this class, we've used AI to solve 00:00:24.990 --> 00:00:28.140 a number of different problems-- giving the AI instructions for how 00:00:28.140 --> 00:00:31.590 to search for a solution or how to satisfy certain constraints in order 00:00:31.590 --> 00:00:34.560 to find its way from some input point to some output point 00:00:34.560 --> 00:00:36.725 in order to solve some sort of problem. 00:00:36.725 --> 00:00:38.850 Today, we're going to turn to the world of learning 00:00:38.850 --> 00:00:40.890 in particular the idea of machine learning 00:00:40.890 --> 00:00:43.080 which generally refers to the idea where we are not 00:00:43.080 --> 00:00:47.160 going to give the computer explicit instructions for how to perform a task, 00:00:47.160 --> 00:00:50.040 but rather, we are going to give the computer access to information 00:00:50.040 --> 00:00:52.980 in the form of data or patterns that it can learn from, and let 00:00:52.980 --> 00:00:56.730 the computer try and figure out what those patterns are-- try and understand 00:00:56.730 --> 00:00:59.455 that data to be able to perform a task on its own. 00:00:59.455 --> 00:01:01.830 Now machine learning comes in a number of different forms 00:01:01.830 --> 00:01:03.180 and it's a very wide field. 00:01:03.180 --> 00:01:06.240 So today, we'll explore some of the foundational algorithms 00:01:06.240 --> 00:01:09.810 and ideas that are behind a lot of the different areas within machine 00:01:09.810 --> 00:01:10.480 learning. 00:01:10.480 --> 00:01:13.980 And one of the most popular is the idea of supervised machine 00:01:13.980 --> 00:01:15.840 learning or just supervised learning. 00:01:15.840 --> 00:01:18.540 And supervised learning is a particular type of task. 00:01:18.540 --> 00:01:22.620 It refers to the task where we give the computer access to a data set, 00:01:22.620 --> 00:01:26.320 where that data set consists of input-output pairs. 00:01:26.320 --> 00:01:28.080 And what we would like the computer to do 00:01:28.080 --> 00:01:31.620 is we would like our AI to be able to figure out some function 00:01:31.620 --> 00:01:34.327 that maps inputs to outputs. 00:01:34.327 --> 00:01:36.660 So we have a whole bunch of data that generally consists 00:01:36.660 --> 00:01:39.030 of some kind of input-- some evidence, some information 00:01:39.030 --> 00:01:40.890 that the computer will have access to. 00:01:40.890 --> 00:01:43.770 And we would like the computer, based on that input information, 00:01:43.770 --> 00:01:47.100 to predict what some output is going to be. 00:01:47.100 --> 00:01:50.295 And we'll give it some data so that the computer can train its model on 00:01:50.295 --> 00:01:53.310 to begin to understand how it is that this information works, 00:01:53.310 --> 00:01:56.640 and how it is that the inputs and outputs relate to each other. 00:01:56.640 --> 00:01:58.380 But ultimately, we hope that our computer 00:01:58.380 --> 00:02:01.470 will be able to figure out some function that given those inputs, 00:02:01.470 --> 00:02:03.720 is able to get those outputs. 00:02:03.720 --> 00:02:06.670 There are a couple of different tasks within supervised learning, 00:02:06.670 --> 00:02:09.900 the one we'll focus on and start with is known as classification. 00:02:09.900 --> 00:02:14.310 And classification is the problem where if I give you a whole bunch of inputs, 00:02:14.310 --> 00:02:16.740 you need to figure out some way to map those inputs 00:02:16.740 --> 00:02:20.700 into discrete categories, where you can decide what those categories are. 00:02:20.700 --> 00:02:22.867 And it's the job of the computer to predict 00:02:22.867 --> 00:02:24.450 what those categories are going to be. 00:02:24.450 --> 00:02:26.370 So that might be, for example, I give you 00:02:26.370 --> 00:02:28.980 information about a banknote like a US dollar 00:02:28.980 --> 00:02:31.290 and I'm asking you to predict for me doesn't belong 00:02:31.290 --> 00:02:33.930 to the category of authentic bank notes or does 00:02:33.930 --> 00:02:36.450 it belong to the category of counterfeit banknotes. 00:02:36.450 --> 00:02:38.160 You need to categorize the input. 00:02:38.160 --> 00:02:40.920 And we want to train the computer to figure out some function 00:02:40.920 --> 00:02:43.230 to be able to do that calculation. 00:02:43.230 --> 00:02:45.780 Another example might be the case of weather, something 00:02:45.780 --> 00:02:48.822 we've talked about a little bit so far in this class, where we would like 00:02:48.822 --> 00:02:52.020 to predict on a given day is it going to rain on that day, 00:02:52.020 --> 00:02:53.700 is it going to be cloudy on that day. 00:02:53.700 --> 00:02:55.533 And before, we've seen how we could do this, 00:02:55.533 --> 00:02:59.190 if we really give the computer all the exact probabilities for, you know, 00:02:59.190 --> 00:03:01.860 if these are the conditions, what's the probability of rain, 00:03:01.860 --> 00:03:04.590 oftentimes, we don't have access to that information, though. 00:03:04.590 --> 00:03:07.410 But what we do have access to is a whole bunch of data. 00:03:07.410 --> 00:03:10.410 So if we wanted to be able to predict something like is it going to rain 00:03:10.410 --> 00:03:12.810 or is it not going to rain, we would give the computer 00:03:12.810 --> 00:03:15.840 historical information about days when it was raining 00:03:15.840 --> 00:03:18.660 and days when it was not raining, and ask the computer 00:03:18.660 --> 00:03:21.240 to look for patterns in that data. 00:03:21.240 --> 00:03:22.960 So what might that data look like? 00:03:22.960 --> 00:03:25.380 Well, we could structure that data in a table like this. 00:03:25.380 --> 00:03:28.800 This might be what our table looks like, where are for any particular day going 00:03:28.800 --> 00:03:32.400 back, we have information about like that day's humidity, that day's air 00:03:32.400 --> 00:03:33.180 pressure. 00:03:33.180 --> 00:03:35.160 And then importantly, we have a label-- 00:03:35.160 --> 00:03:39.060 something where the human has said that on this particular day, it was raining 00:03:39.060 --> 00:03:40.070 or it was not raining. 00:03:40.070 --> 00:03:42.720 So you could fill in this table with a whole bunch of data. 00:03:42.720 --> 00:03:46.350 And what makes this what we would call a supervised learning exercise 00:03:46.350 --> 00:03:49.440 is that a human has gone in and labeled each of these data points. 00:03:49.440 --> 00:03:52.260 Said that on this day, when these were the values for the humidity 00:03:52.260 --> 00:03:56.330 and pressure, that day was a rainy day and this day was a not rainy day. 00:03:56.330 --> 00:03:58.830 And what we would like the computer to be able to do then 00:03:58.830 --> 00:04:01.560 is to be able to figure out, given these inputs, given 00:04:01.560 --> 00:04:05.430 the humidity and the pressure, can the computer predict what label 00:04:05.430 --> 00:04:06.945 should be associated with that day. 00:04:06.945 --> 00:04:08.820 Does that day look more like it's going to be 00:04:08.820 --> 00:04:13.650 a day that rains or does it look more like a day when it's not going to rain. 00:04:13.650 --> 00:04:15.630 Put a little bit more mathematically, you 00:04:15.630 --> 00:04:18.928 can think of this as a function that takes two inputs-- 00:04:18.928 --> 00:04:21.720 the inputs being the data points that our computer will have access 00:04:21.720 --> 00:04:23.610 to-- things like humidity and pressure. 00:04:23.610 --> 00:04:25.500 So we could write a function, f, that takes 00:04:25.500 --> 00:04:27.730 as input both humidity and pressure. 00:04:27.730 --> 00:04:31.170 And then the output is going to be what category 00:04:31.170 --> 00:04:34.590 we would ascribe to these particular input points-- what label 00:04:34.590 --> 00:04:36.135 we would associate with that input. 00:04:36.135 --> 00:04:38.010 So we've seen a couple of example data points 00:04:38.010 --> 00:04:41.250 here, where given this value for humidity and this value for pressure, 00:04:41.250 --> 00:04:44.610 we predict is it going to rain or is it not going to rain. 00:04:44.610 --> 00:04:47.610 And that's information that we just gathered from the world. 00:04:47.610 --> 00:04:51.060 We measured on various different days what the humidity and pressure were. 00:04:51.060 --> 00:04:55.170 We observed whether or not we saw rain or no rain on that particular day. 00:04:55.170 --> 00:04:58.950 And this function, f, is what we would like to approximate. 00:04:58.950 --> 00:05:00.900 Now the computer and we humans don't really 00:05:00.900 --> 00:05:04.960 know exactly how this function f works-- it's probably quite a complex function. 00:05:04.960 --> 00:05:08.040 So what we're going to do instead is attempt to estimate it. 00:05:08.040 --> 00:05:12.240 We would like to come up with a hypothesis function, h, 00:05:12.240 --> 00:05:15.430 which is going to try to approximate what f does. 00:05:15.430 --> 00:05:19.260 We want to come up with some function h that will also take the same inputs 00:05:19.260 --> 00:05:22.800 and we'll also produce an output, rain or no rain. 00:05:22.800 --> 00:05:25.950 And ideally, we'd like these two functions to agree on as much 00:05:25.950 --> 00:05:27.030 as possible. 00:05:27.030 --> 00:05:30.810 So the goal then of these supervised learning classification tasks 00:05:30.810 --> 00:05:34.060 is going to be to figure out what does that function h look like. 00:05:34.060 --> 00:05:37.860 How can we begin to estimate, given all of this information, all of this data, 00:05:37.860 --> 00:05:42.330 what category or what label should be assigned to a particular data point. 00:05:42.330 --> 00:05:44.457 So where can you begin doing this? 00:05:44.457 --> 00:05:47.040 Well, a reasonable thing to do, especially in this situation-- 00:05:47.040 --> 00:05:48.780 I have two numerical values-- 00:05:48.780 --> 00:05:53.460 is I could try to plot this on a graph that has two axes-- an x-axis 00:05:53.460 --> 00:05:54.332 and the y-axis. 00:05:54.332 --> 00:05:57.540 And in this case, we're just going to be using two numerical values as input, 00:05:57.540 --> 00:05:59.795 but these same types of ideas at scale as you 00:05:59.795 --> 00:06:01.170 add more and more inputs as well. 00:06:01.170 --> 00:06:04.260 We'll be plotting things in two dimensions, but as we'll soon see, 00:06:04.260 --> 00:06:07.930 you could add more inputs and just imagine things in multiple dimensions. 00:06:07.930 --> 00:06:10.920 And while we humans have trouble conceptualizing anything 00:06:10.920 --> 00:06:13.410 really beyond three dimensions, at least visually, 00:06:13.410 --> 00:06:15.600 a computer has no problem with trying to imagine 00:06:15.600 --> 00:06:17.280 things and many, many more dimensions. 00:06:17.280 --> 00:06:20.343 That for a computer, each dimension is just some separate number 00:06:20.343 --> 00:06:21.510 that are just keeping track. 00:06:21.510 --> 00:06:23.385 So it wouldn't be unreasonable for a computer 00:06:23.385 --> 00:06:25.890 to think in 10 dimensions or 100 dimensions 00:06:25.890 --> 00:06:28.000 to be able to try to solve a problem. 00:06:28.000 --> 00:06:31.200 But for now, we've got two inputs, so we'll graph things along two axes-- 00:06:31.200 --> 00:06:33.840 an x-axis, which will here represent humidity, 00:06:33.840 --> 00:06:36.580 and a y-axis, which here represents pressure. 00:06:36.580 --> 00:06:40.530 And what we might do is say, let's take all of the days that were raining, 00:06:40.530 --> 00:06:44.130 and just try to plop them on this graph, and see where they fall on this graph. 00:06:44.130 --> 00:06:46.590 And here might be all of the rainy days, where 00:06:46.590 --> 00:06:49.020 each rainy day is one of these blue dots here 00:06:49.020 --> 00:06:51.710 that corresponds to a particular value for humidity 00:06:51.710 --> 00:06:53.422 and a particular value for pressure. 00:06:53.422 --> 00:06:56.380 And then I might do the same thing with the days that were not raining. 00:06:56.380 --> 00:06:58.380 So I take all the not rainy days, figure out 00:06:58.380 --> 00:07:01.020 what their values were for each of these two inputs, 00:07:01.020 --> 00:07:03.743 and go ahead and plot them on this graph as well. 00:07:03.743 --> 00:07:05.160 And I've here plotted them in red. 00:07:05.160 --> 00:07:09.870 So blue here stands for a rainy day, red here stands for a not rainy day. 00:07:09.870 --> 00:07:11.340 And this then is the input-- 00:07:11.340 --> 00:07:14.160 that my computer has access to all of this input. 00:07:14.160 --> 00:07:18.270 And what I would like the computer to be able to do is to train a model such 00:07:18.270 --> 00:07:21.120 that if I'm ever presented with a new input that doesn't have 00:07:21.120 --> 00:07:25.170 a label associated with it, something like this white dot here, 00:07:25.170 --> 00:07:28.530 I would like to predict given those values for each of the two inputs, 00:07:28.530 --> 00:07:31.860 should we classify it as a blue dot, a rainy day, 00:07:31.860 --> 00:07:34.985 or should we classify it as a red dot, a not rainy day. 00:07:34.985 --> 00:07:37.860 And if you're just looking at this picture graphically trying to say, 00:07:37.860 --> 00:07:41.100 all right, this white dot, does it look like it belongs to the blue category 00:07:41.100 --> 00:07:43.560 or does it look like it belongs to the red category, 00:07:43.560 --> 00:07:47.400 I think most people would agree that it probably belongs to the blue category. 00:07:47.400 --> 00:07:48.180 And why is that? 00:07:48.180 --> 00:07:51.967 Well, it looks like it's close to other blue dots. 00:07:51.967 --> 00:07:54.300 And that's not a very formal notion, but it's the notion 00:07:54.300 --> 00:07:56.100 that we'll formalize it in just a moment-- 00:07:56.100 --> 00:07:59.190 that because it seems to be close to, like, this blue dot here, like, 00:07:59.190 --> 00:08:01.560 nothing else it's closer to it, than we might say 00:08:01.560 --> 00:08:03.720 that it should be categorized as blue. 00:08:03.720 --> 00:08:06.030 It should fall into that category of, I think 00:08:06.030 --> 00:08:08.760 that day is going to be a rainy day based on that input. 00:08:08.760 --> 00:08:11.880 It might not be totally accurate, but it's a pretty good guess. 00:08:11.880 --> 00:08:15.090 And this type of algorithm is actually a very popular and common machine 00:08:15.090 --> 00:08:18.750 learning algorithm known as nearest neighbor classification. 00:08:18.750 --> 00:08:21.810 It's an algorithm for solving these classification type problems. 00:08:21.810 --> 00:08:25.540 And in nearest neighbor classification, it's going to perform this algorithm. 00:08:25.540 --> 00:08:29.700 What it will do is, given an input, it will choose the class of the nearest 00:08:29.700 --> 00:08:31.650 data point to that input. 00:08:31.650 --> 00:08:34.890 By class, we just here mean category, like rain or no rain, 00:08:34.890 --> 00:08:36.840 counterfeit or not counterfeit. 00:08:36.840 --> 00:08:41.799 And we choose the category or the class based on the nearest data point. 00:08:41.799 --> 00:08:43.559 So given all that data we just looked at, 00:08:43.559 --> 00:08:47.070 is the nearest data point a blue point or is that a red point. 00:08:47.070 --> 00:08:49.780 And depending on the answer to that question, 00:08:49.780 --> 00:08:51.660 we were able to make some sort of judgment. 00:08:51.660 --> 00:08:54.420 We were able to say something like, we think it's going to be blue 00:08:54.420 --> 00:08:56.262 or we think it's going to be red. 00:08:56.262 --> 00:08:58.470 So likewise, we could apply this to other data points 00:08:58.470 --> 00:08:59.850 that we encounter as well. 00:08:59.850 --> 00:09:03.840 If suddenly, this data point comes about, well, it's nearest data is red, 00:09:03.840 --> 00:09:07.320 so we would go ahead and classify this as a red point, not raining. 00:09:07.320 --> 00:09:09.540 Things get a little bit trickier, though, 00:09:09.540 --> 00:09:12.543 when you look at a point like this white point over here, 00:09:12.543 --> 00:09:14.460 and you ask the same sort of question-- should 00:09:14.460 --> 00:09:17.700 it belong to the category of blue points, the rainy days? 00:09:17.700 --> 00:09:21.870 Or should it belong to the category of red points, the not rainy days? 00:09:21.870 --> 00:09:24.240 Now nearest neighbor classification would 00:09:24.240 --> 00:09:26.820 say the way you solve this problem is look at which point 00:09:26.820 --> 00:09:28.190 it is nearest to that point. 00:09:28.190 --> 00:09:31.440 You look at this nearest point and say it's red-- it's a not rainy day. 00:09:31.440 --> 00:09:34.200 And therefore, according to nearest neighbor classification, 00:09:34.200 --> 00:09:37.470 I would say that this unlabeled point, that should also be red. 00:09:37.470 --> 00:09:40.800 It should also be classified as a not rainy day. 00:09:40.800 --> 00:09:43.800 But your intuition might think that that's a reasonable judgment 00:09:43.800 --> 00:09:47.190 to make-- that the closest thing is a not rainy day, so may as well guess 00:09:47.190 --> 00:09:48.540 that it's not rainy day. 00:09:48.540 --> 00:09:51.680 But it's probably also reasonable to look at the bigger picture of things 00:09:51.680 --> 00:09:56.670 and to say, yes, it is true, that the nearest point to it was a red point, 00:09:56.670 --> 00:10:00.060 but it's surrounded by a whole bunch of other blue points. 00:10:00.060 --> 00:10:02.310 So looking at the bigger picture, there is potentially 00:10:02.310 --> 00:10:06.235 an argument to be made that this point should actually be blue. 00:10:06.235 --> 00:10:08.610 And with only this data, we actually don't know for sure. 00:10:08.610 --> 00:10:11.310 We are given some inputs, something we're trying to predict, 00:10:11.310 --> 00:10:14.410 and we don't necessarily know what the output is going to be. 00:10:14.410 --> 00:10:17.500 So in this case, which one is correct is difficult to say. 00:10:17.500 --> 00:10:21.280 But oftentimes, considering more than just a single neighbor, considering 00:10:21.280 --> 00:10:25.300 multiple neighbors, can sometimes give us a better result. 00:10:25.300 --> 00:10:27.400 And so there's a variant on the nearest neighbor 00:10:27.400 --> 00:10:31.330 a classification algorithm that is known as the k-nearest-neighbor 00:10:31.330 --> 00:10:34.150 classification algorithm, where k is some parameter, 00:10:34.150 --> 00:10:38.090 some number that we choose for how many neighbors are we going to look at. 00:10:38.090 --> 00:10:41.440 So one nearest neighbor classification is what we saw before. 00:10:41.440 --> 00:10:44.770 Just pick the one nearest neighbor and use that category. 00:10:44.770 --> 00:10:46.780 But with k-nearest-neighbor classification, 00:10:46.780 --> 00:10:49.150 where k might be three or five or seven-- 00:10:49.150 --> 00:10:52.000 to say look at the three or five or seven closest 00:10:52.000 --> 00:10:55.930 neighbors, closest data points to that point, works a little bit differently. 00:10:55.930 --> 00:10:57.760 This algorithm, we're given an input. 00:10:57.760 --> 00:11:02.690 Choose the most common class out of the k nearest data points to that input. 00:11:02.690 --> 00:11:06.010 So if we look at the five nearest points, and three of them 00:11:06.010 --> 00:11:08.380 say it's raining and two of them say it's not raining, 00:11:08.380 --> 00:11:10.810 we'll go with the three instead of the two, 00:11:10.810 --> 00:11:14.080 because each one effectively gets one vote towards what 00:11:14.080 --> 00:11:16.450 they believe the category ought to be. 00:11:16.450 --> 00:11:18.850 And ultimately, you choose the category that 00:11:18.850 --> 00:11:21.540 has the most votes as a consequence of that. 00:11:21.540 --> 00:11:24.700 So k-nearest-neighbor classification-- fairly straightforward one 00:11:24.700 --> 00:11:25.960 to understand intuitively. 00:11:25.960 --> 00:11:29.030 You just look at the neighbors and figure out what the answer might be. 00:11:29.030 --> 00:11:30.880 And it turns out this can work very, very 00:11:30.880 --> 00:11:34.480 well for solving a whole variety of different types of classification 00:11:34.480 --> 00:11:35.440 problems. 00:11:35.440 --> 00:11:38.240 But not every model is going to work under every situation. 00:11:38.240 --> 00:11:40.865 And so one of the things we'll take a look at today, especially 00:11:40.865 --> 00:11:42.740 in the context of supervised machine learning 00:11:42.740 --> 00:11:45.740 is that there are a number of different approaches to machine learning-- 00:11:45.740 --> 00:11:47.830 a number of different algorithms that we can apply 00:11:47.830 --> 00:11:49.690 all solving the same type of problem. 00:11:49.690 --> 00:11:52.780 All solving some kind of classification problem, where 00:11:52.780 --> 00:11:56.170 we want to take inputs and organize it into different categories 00:11:56.170 --> 00:11:58.570 And no one algorithm isn't necessarily always going 00:11:58.570 --> 00:12:00.387 to be better than some other algorithm. 00:12:00.387 --> 00:12:01.720 They each have their trade-offs. 00:12:01.720 --> 00:12:04.540 And maybe depending on the data, one type of algorithm 00:12:04.540 --> 00:12:07.420 is going to be better-suited to trying to model that information 00:12:07.420 --> 00:12:08.870 than some other algorithm. 00:12:08.870 --> 00:12:10.720 And so this is what a lot of machine learning research 00:12:10.720 --> 00:12:13.090 ends up being about-- that when you're trying to apply machine learning 00:12:13.090 --> 00:12:16.480 techniques, you're often looking not just at one particular algorithm, 00:12:16.480 --> 00:12:18.250 but trying multiple different algorithms, 00:12:18.250 --> 00:12:20.770 trying to see what is going to give you the best 00:12:20.770 --> 00:12:25.790 results for trying to predict some function that maps inputs to outputs. 00:12:25.790 --> 00:12:29.410 So what then are the drawbacks of k-nearest-neighbor classification? 00:12:29.410 --> 00:12:30.640 Well, there are a couple. 00:12:30.640 --> 00:12:33.070 One might be that in a naive approach at least, 00:12:33.070 --> 00:12:35.642 it could be fairly slow to have to go through and measure 00:12:35.642 --> 00:12:38.350 the distance between a point and every single one of these points 00:12:38.350 --> 00:12:39.310 that exist here. 00:12:39.310 --> 00:12:40.810 Now there are ways of trying to get around that. 00:12:40.810 --> 00:12:43.435 There are data structures that can help to make it more quickly 00:12:43.435 --> 00:12:45.260 to be able to find these neighbors. 00:12:45.260 --> 00:12:48.490 There are also techniques you can use to try and prune some of this data, 00:12:48.490 --> 00:12:50.680 remove some of the data points so that you're only 00:12:50.680 --> 00:12:54.470 left with the relevant data points just to make it a little bit easier. 00:12:54.470 --> 00:12:56.890 But ultimately, what we might like to do is come up 00:12:56.890 --> 00:13:00.370 with another way of trying to do this classification. 00:13:00.370 --> 00:13:02.560 And one way of trying to do the classification 00:13:02.560 --> 00:13:04.750 was looking at what are the neighboring points. 00:13:04.750 --> 00:13:08.200 But another way might be to try to look at all of the data 00:13:08.200 --> 00:13:11.440 and see if we can come up with some decision boundary-- 00:13:11.440 --> 00:13:15.860 some boundary that will separate the rainy days from the not rainy days. 00:13:15.860 --> 00:13:19.640 In the case of two dimensions, we can do that by drawing a line, for example. 00:13:19.640 --> 00:13:22.900 So what we might want to try to do is just find some line, 00:13:22.900 --> 00:13:27.520 find some separator that divides the rainy days, the blue points over here, 00:13:27.520 --> 00:13:30.040 from the not rainy days, the red points over there. 00:13:30.040 --> 00:13:32.680 We're now trying a different approach in contrast 00:13:32.680 --> 00:13:34.570 with the nearest neighbor approach, which 00:13:34.570 --> 00:13:38.380 just looked at local data around the input data point that we cared about. 00:13:38.380 --> 00:13:42.160 Now what we're doing is trying to use a technique known as linear regression 00:13:42.160 --> 00:13:45.140 to find some sort of line that will separate the two 00:13:45.140 --> 00:13:46.858 halves from each other. 00:13:46.858 --> 00:13:49.150 Now, sometimes, it will actually be possible to come up 00:13:49.150 --> 00:13:52.630 with some line that perfectly separates all the rainy days from the not 00:13:52.630 --> 00:13:53.590 rainy days. 00:13:53.590 --> 00:13:56.140 Realistically, though, this is probably cleaner 00:13:56.140 --> 00:13:58.050 than many data sets will actually be. 00:13:58.050 --> 00:13:59.600 Oftentimes, data is messy. 00:13:59.600 --> 00:14:00.470 There are outliers. 00:14:00.470 --> 00:14:03.850 There's random noise that happens inside of a particular system. 00:14:03.850 --> 00:14:06.220 And what we'd like to do is still be able to figure out 00:14:06.220 --> 00:14:07.640 what a line might look like. 00:14:07.640 --> 00:14:12.000 So in practice, the data will not always be linearly separable, 00:14:12.000 --> 00:14:14.740 where linearly separable refers to some data set 00:14:14.740 --> 00:14:18.673 where I can draw a line just to separate the two halves of it perfectly. 00:14:18.673 --> 00:14:20.590 Instead, you might have a situation like this, 00:14:20.590 --> 00:14:24.352 where there are some rainy points that are on this side of the line and some 00:14:24.352 --> 00:14:26.560 not raining points that are on that side of the line. 00:14:26.560 --> 00:14:30.880 And there may not be a line that perfectly separates 00:14:30.880 --> 00:14:34.270 what path of the inputs from the other half-- that perfectly separates all 00:14:34.270 --> 00:14:36.490 the rainy days from the not rainy days. 00:14:36.490 --> 00:14:40.090 But we can still say that this line does a pretty good job. 00:14:40.090 --> 00:14:42.080 And we'll try to formalize a little bit later. 00:14:42.080 --> 00:14:44.200 What we mean when we say something like this line 00:14:44.200 --> 00:14:46.932 does a pretty good job of trying to make that prediction. 00:14:46.932 --> 00:14:48.640 But for now, let's just say we're looking 00:14:48.640 --> 00:14:51.690 for a line that does as good of a job as we can 00:14:51.690 --> 00:14:56.750 at trying to separate one category of things from another category of things. 00:14:56.750 --> 00:15:00.160 So let's now try to formalize this a little bit more mathematically. 00:15:00.160 --> 00:15:02.110 We want to come up with some sort of function, 00:15:02.110 --> 00:15:04.300 some way we can define this line. 00:15:04.300 --> 00:15:08.830 And our inputs are things like humidity and pressure in this case. 00:15:08.830 --> 00:15:13.840 So our inputs we might call x1 is going to be our represent humidity and x2 00:15:13.840 --> 00:15:15.500 is going to represent pressure. 00:15:15.500 --> 00:15:18.500 These are inputs that we are going to provide to our machine learning 00:15:18.500 --> 00:15:19.250 algorithm. 00:15:19.250 --> 00:15:21.860 And given those inputs, we would like for our model 00:15:21.860 --> 00:15:24.260 to be able to predict some sort of output. 00:15:24.260 --> 00:15:27.260 And we're going to predict that using our hypothesis function, 00:15:27.260 --> 00:15:28.610 which we called h. 00:15:28.610 --> 00:15:33.680 Our hypothesis function is going to take as input, x1 and x2, humidity 00:15:33.680 --> 00:15:34.890 and pressure in this case. 00:15:34.890 --> 00:15:36.740 And you can imagine if we didn't just have two inputs-- 00:15:36.740 --> 00:15:38.870 we had three or four or five inputs or more-- 00:15:38.870 --> 00:15:42.260 we could have this hypothesis function take all of those as input. 00:15:42.260 --> 00:15:45.550 And we'll see examples of that a little bit later as well. 00:15:45.550 --> 00:15:49.470 And now the question is, what does this hypothesis function do? 00:15:49.470 --> 00:15:53.210 Well, it really just needs to measure is this data 00:15:53.210 --> 00:15:58.640 point on one side of the boundary or is it on the other side of the boundary? 00:15:58.640 --> 00:16:00.620 And how do we formalize that boundary? 00:16:00.620 --> 00:16:02.990 Well, the boundary is generally going to be 00:16:02.990 --> 00:16:06.780 a linear combination of these input variables, 00:16:06.780 --> 00:16:08.250 at least in this particular case. 00:16:08.250 --> 00:16:10.910 So what we're trying to do when we say linear combination 00:16:10.910 --> 00:16:13.925 is take each of these inputs and multiply them by some number 00:16:13.925 --> 00:16:15.550 that we're going to have to figure out. 00:16:15.550 --> 00:16:18.650 We'll generally call that number a weight for how important 00:16:18.650 --> 00:16:21.800 should these variables be in trying to determine the answer. 00:16:21.800 --> 00:16:24.440 So weight each of these variables with some weight. 00:16:24.440 --> 00:16:27.440 And we might add like a constant to it just to try and make the function 00:16:27.440 --> 00:16:28.610 a little bit different. 00:16:28.610 --> 00:16:30.280 And the result we just need to compare-- 00:16:30.280 --> 00:16:33.350 is it greater than 0 or is it less than 0 to say 00:16:33.350 --> 00:16:37.280 doesn't belong on one side of the line or the other side of the line. 00:16:37.280 --> 00:16:41.060 And so what that mathematical expression might look like is this. 00:16:41.060 --> 00:16:45.773 We would take each of my variables, x1 and x2, multiply them by some weight. 00:16:45.773 --> 00:16:47.690 I don't yet know what that weight is, but it's 00:16:47.690 --> 00:16:50.720 going to be some number, weight 1 and weight 2. 00:16:50.720 --> 00:16:53.570 And maybe we just want to add some other weight 0 to it 00:16:53.570 --> 00:16:56.870 because the function might require us to shift the entire value up 00:16:56.870 --> 00:16:58.610 or down by a certain amount. 00:16:58.610 --> 00:16:59.840 And then we just compare. 00:16:59.840 --> 00:17:02.870 If we do all this math, is a greater than or equal to 0. 00:17:02.870 --> 00:17:05.810 If so, we might categorize that data point as a rainy day. 00:17:05.810 --> 00:17:09.440 And otherwise, we might say no rain. 00:17:09.440 --> 00:17:12.260 So the key here then is that this expression 00:17:12.260 --> 00:17:15.619 is how we are going to calculate whether it's a rainy day or not. 00:17:15.619 --> 00:17:18.690 We're going to do a bunch of math where we take each of the variables, 00:17:18.690 --> 00:17:21.650 multiply them by a weight, maybe add an extra weight to it, 00:17:21.650 --> 00:17:24.079 see if the result is greater than or equal to 0. 00:17:24.079 --> 00:17:26.240 And using that result of that expression, 00:17:26.240 --> 00:17:29.660 we're able to determine whether it's raining or not raining. 00:17:29.660 --> 00:17:33.080 This expression here is in this case going to refer to just some line. 00:17:33.080 --> 00:17:36.170 If you were to plot that graphically, it would just be some line. 00:17:36.170 --> 00:17:40.320 And what the line actually looks like depends upon these weights. 00:17:40.320 --> 00:17:42.710 x1 and x2 are the inputs, but these weights 00:17:42.710 --> 00:17:46.220 are really what determine the shape of that line, the slope of that line, 00:17:46.220 --> 00:17:49.190 and what that line actually looks like. 00:17:49.190 --> 00:17:52.330 So we then would like to figure out what these weights should be. 00:17:52.330 --> 00:17:54.520 We can choose whatever weights we want, but we 00:17:54.520 --> 00:17:58.330 want to choose weights in such a way that if you pass in a rainy day's 00:17:58.330 --> 00:18:00.850 humidity and pressure, then you end up with a result that 00:18:00.850 --> 00:18:02.440 is greater than or equal to 0. 00:18:02.440 --> 00:18:05.690 And we would like it such that if we passed into our hypothesis function, 00:18:05.690 --> 00:18:09.040 a not rainy day's inputs, then the output that we get 00:18:09.040 --> 00:18:11.043 should be not raining. 00:18:11.043 --> 00:18:13.960 So before we get there, let's try and formalize this a little bit more 00:18:13.960 --> 00:18:17.350 mathematically just to get a sense for how it is that you'll often see this 00:18:17.350 --> 00:18:21.560 if you ever go further into a supervised machine learning and explore this idea. 00:18:21.560 --> 00:18:23.660 One thing is that generally for these categories, 00:18:23.660 --> 00:18:27.420 we'll sometimes just use the names of the categories like rain and not rain. 00:18:27.420 --> 00:18:30.670 Often, mathematically, if we're trying to do comparisons between these things, 00:18:30.670 --> 00:18:33.110 it's easier just to deal in the world of numbers. 00:18:33.110 --> 00:18:35.530 So we could just say 1 and 0-- 00:18:35.530 --> 00:18:37.820 1 for raining, 0 for not raining. 00:18:37.820 --> 00:18:39.040 So we do all this math. 00:18:39.040 --> 00:18:41.500 And if the result is greater than or equal to 0, 00:18:41.500 --> 00:18:45.160 we'll go ahead and say our hypothesis function outputs 1, meaning raining. 00:18:45.160 --> 00:18:48.220 And otherwise, it outputs 0, meaning not raining. 00:18:48.220 --> 00:18:52.090 And oftentimes, this type of expression will instead 00:18:52.090 --> 00:18:54.890 express using vector mathematics. 00:18:54.890 --> 00:18:57.390 And all the vector is, if you're not familiar with the term, 00:18:57.390 --> 00:19:00.400 is it refers to a sequence of numerical values. 00:19:00.400 --> 00:19:03.580 You could represent that in Python using, like, a list of numerical values 00:19:03.580 --> 00:19:06.370 or a couple with numerical values. 00:19:06.370 --> 00:19:10.160 And here, we have a couple of sequences of numerical values. 00:19:10.160 --> 00:19:13.300 One of our vectors, one of our sequences of numerical values, 00:19:13.300 --> 00:19:15.130 are all of these individual weights-- 00:19:15.130 --> 00:19:18.340 w0, w1 and w2. 00:19:18.340 --> 00:19:21.400 So we could construct what we'll call a weight vector 00:19:21.400 --> 00:19:23.860 and we'll see why this is useful in a moment called w, 00:19:23.860 --> 00:19:26.838 generally represented using a boldface w, that 00:19:26.838 --> 00:19:28.630 is just a sequence of these three weights-- 00:19:28.630 --> 00:19:31.690 weight 0, weight 1, and weight 2. 00:19:31.690 --> 00:19:33.820 And to be able to calculate based on those weights 00:19:33.820 --> 00:19:37.570 whether we think a day is raining or not raining, 00:19:37.570 --> 00:19:42.490 we're going to multiply each of those weights by one of our input variables. 00:19:42.490 --> 00:19:46.630 That w2, this weight, is going to be multiplied by input variable x2. 00:19:46.630 --> 00:19:49.780 w1 is going to be multiplied by input variable x1. 00:19:49.780 --> 00:19:53.097 And w0-- well, it's not being multiplied by anything, 00:19:53.097 --> 00:19:55.180 but to make sure the vectors are the same length-- 00:19:55.180 --> 00:19:57.263 and we'll see why that's useful in just a second-- 00:19:57.263 --> 00:20:01.077 we'll just go ahead and say w0 is being multiplied by 1. 00:20:01.077 --> 00:20:03.160 Because you can multiply by something by 1 and you 00:20:03.160 --> 00:20:05.090 end up getting the exact same number. 00:20:05.090 --> 00:20:07.570 So in addition to the weight vector, w, we'll 00:20:07.570 --> 00:20:11.890 also have an input vector that we'll call x that has three values-- 00:20:11.890 --> 00:20:18.160 1, again, because we're just multiplying w0 by 1 eventually, and then x1 and x2. 00:20:18.160 --> 00:20:21.850 So here then, we've represented two distinct vectors-- a vector of weights 00:20:21.850 --> 00:20:23.590 that we need to somehow learn. 00:20:23.590 --> 00:20:25.690 The goal of our machine learning algorithm 00:20:25.690 --> 00:20:28.310 is to learn what this weight vector is supposed to be. 00:20:28.310 --> 00:20:30.502 We could choose any arbitrary set of numbers 00:20:30.502 --> 00:20:33.460 and it would produce a function that tries to predict rain or not rain, 00:20:33.460 --> 00:20:35.140 but it probably wouldn't be very good. 00:20:35.140 --> 00:20:39.190 What we want to do is come up with a good choice of these weights 00:20:39.190 --> 00:20:41.980 so that we're able to do the accurate predictions. 00:20:41.980 --> 00:20:45.790 And then this input vector represents a particular input 00:20:45.790 --> 00:20:48.970 to the function, a data point for which we would like to estimate, 00:20:48.970 --> 00:20:52.442 is that day a rainy day or is that day not rainy day. 00:20:52.442 --> 00:20:54.400 And that's going to vary just depending on what 00:20:54.400 --> 00:20:58.300 input is provided to our function, what it is that we are trying to estimate. 00:20:58.300 --> 00:21:02.390 And then to do the calculation, we want to calculate this expression here. 00:21:02.390 --> 00:21:04.900 And it turns out that expression is what we would call 00:21:04.900 --> 00:21:07.390 the dot product of these two vectors. 00:21:07.390 --> 00:21:10.480 The dot product of two vectors just means 00:21:10.480 --> 00:21:13.630 taking each of the terms and the vectors and multiplying them together, 00:21:13.630 --> 00:21:18.860 w0 multiplied by 1, w1 multiplied it by x1, w2 multiply it by x2. 00:21:18.860 --> 00:21:21.400 And that's why these vectors need to be the same length. 00:21:21.400 --> 00:21:24.400 And then we just add all of the results together. 00:21:24.400 --> 00:21:30.040 So the dot product of w and x, our weight vector and our input vector, 00:21:30.040 --> 00:21:35.770 that's just going to be w0 times 1, or just w0 plus w1 times x1, 00:21:35.770 --> 00:21:40.570 multiplying these two terms together, plus w2 times x2, 00:21:40.570 --> 00:21:42.773 multiplying those statements together. 00:21:42.773 --> 00:21:45.190 So we have our weight vector, which we need to figure out. 00:21:45.190 --> 00:21:47.020 We need our machine learning algorithm to figure out 00:21:47.020 --> 00:21:48.280 what the weights should be. 00:21:48.280 --> 00:21:51.370 We have the input vector representing the data point 00:21:51.370 --> 00:21:54.610 that we're trying to predict a category for, predict a label for. 00:21:54.610 --> 00:21:58.270 And we're able to do that calculation by taking this dot product, which you'll 00:21:58.270 --> 00:22:00.260 often see represented in vector form-- 00:22:00.260 --> 00:22:02.052 but if you haven't seen vectors before, you 00:22:02.052 --> 00:22:04.810 can think of it as identical to just this mathematical expression. 00:22:04.810 --> 00:22:08.170 Just doing the multiplication, adding the results together. 00:22:08.170 --> 00:22:11.440 And then seeing whether the result is greater than or equal to 0 or not. 00:22:11.440 --> 00:22:14.530 This expression here is identical to the expression 00:22:14.530 --> 00:22:16.810 that we're calculating to see whether or not 00:22:16.810 --> 00:22:21.340 that answer is greater than or equal to 0 in this case. 00:22:21.340 --> 00:22:24.370 And so for that reason, you'll often see the hypothesis function 00:22:24.370 --> 00:22:25.980 written as something like this-- 00:22:25.980 --> 00:22:29.800 a simpler representation where the hypothesis takes as input 00:22:29.800 --> 00:22:34.060 some input vector x, some humidity and pressure for some day. 00:22:34.060 --> 00:22:37.810 And we want to predict an output like rain or no rain or 1 or 0 00:22:37.810 --> 00:22:40.440 if we choose to represent things numerically. 00:22:40.440 --> 00:22:42.970 And the way we do that is by taking the dot 00:22:42.970 --> 00:22:45.612 product of the weights and our input. 00:22:45.612 --> 00:22:47.320 If it's greater than or equal to 0, we'll 00:22:47.320 --> 00:22:49.150 go ahead and save the output is 1. 00:22:49.150 --> 00:22:52.030 Otherwise, the output is going to be 0. 00:22:52.030 --> 00:22:56.170 And this hypothesis we say is parameterized by the weights. 00:22:56.170 --> 00:22:58.330 Depending on what weights we choose, we'll 00:22:58.330 --> 00:23:00.372 end up getting a different hypothesis. 00:23:00.372 --> 00:23:02.830 If we choose the weights randomly, we're probably not going 00:23:02.830 --> 00:23:04.455 to get a very good hypothesis function. 00:23:04.455 --> 00:23:06.730 We'll get a 1 or a 0, but it's probably not 00:23:06.730 --> 00:23:09.160 accurately going to reflect whether we think a day is 00:23:09.160 --> 00:23:11.350 going to be rainy or not rainy. 00:23:11.350 --> 00:23:13.930 But if we choose the weights right, we can often 00:23:13.930 --> 00:23:16.390 do a pretty good job of trying to estimate 00:23:16.390 --> 00:23:20.880 whether we think the output of the function should be a 1 or a 0. 00:23:20.880 --> 00:23:23.170 And so the question then, is how to figure out 00:23:23.170 --> 00:23:27.042 what these weights should be-- how to be able to tune those parameters. 00:23:27.042 --> 00:23:29.000 And there are a number of ways you can do that. 00:23:29.000 --> 00:23:32.800 One of the most common is known as the perceptron learning rule. 00:23:32.800 --> 00:23:34.303 And we'll see more of this later. 00:23:34.303 --> 00:23:36.220 But the idea of the perceptron learning rule-- 00:23:36.220 --> 00:23:37.960 and we're not going to get too deep into the mathematics, 00:23:37.960 --> 00:23:40.570 we'll mostly just introduce it more conceptually-- is 00:23:40.570 --> 00:23:44.770 to say that given some data point that we would like to learn from, 00:23:44.770 --> 00:23:50.140 some data point that has an input x and an output y, where y is like 1 for rain 00:23:50.140 --> 00:23:53.052 or 0 for not rain, then we're going to update the weights. 00:23:53.052 --> 00:23:55.010 And we'll look at the formula in just a moment. 00:23:55.010 --> 00:23:59.020 But the big picture idea is that we can start with random weights 00:23:59.020 --> 00:24:00.610 but then learn from the data. 00:24:00.610 --> 00:24:02.900 Like, take the data points one at a time. 00:24:02.900 --> 00:24:05.370 And for each one of the data points figure out, 00:24:05.370 --> 00:24:09.370 all right, what parameters do we need to change inside of the weights 00:24:09.370 --> 00:24:12.050 in order to better match that input point. 00:24:12.050 --> 00:24:13.930 And so that is the value of having access 00:24:13.930 --> 00:24:15.940 to a lot of data in the supervised machine 00:24:15.940 --> 00:24:18.520 learning algorithm-- is that you take each of the data points 00:24:18.520 --> 00:24:22.310 and maybe look at the multiple times and constantly try and figure out what you 00:24:22.310 --> 00:24:24.490 whether you need to shift your weight in order 00:24:24.490 --> 00:24:29.260 to better create some weight vector that is able to correctly or more accurately 00:24:29.260 --> 00:24:31.300 try to estimate what the output should be. 00:24:31.300 --> 00:24:33.970 Whether we think it's going to be raining or whether we think 00:24:33.970 --> 00:24:35.760 it's not going to be raining. 00:24:35.760 --> 00:24:37.510 So what does that weight update look like? 00:24:37.510 --> 00:24:39.468 Without going into too much of the mathematics, 00:24:39.468 --> 00:24:41.410 we're going to update each of the weights 00:24:41.410 --> 00:24:46.480 to be the result of the original weight plus some additional expression. 00:24:46.480 --> 00:24:48.820 And to understand this expression y-- 00:24:48.820 --> 00:24:51.820 well, y is what the actual output is. 00:24:51.820 --> 00:24:57.280 And hypothesis of x, the input, that's going to be what we thought the input 00:24:57.280 --> 00:24:58.100 was. 00:24:58.100 --> 00:25:01.510 And so I can replace this by saying what the actual value was 00:25:01.510 --> 00:25:03.820 minus what our estimate was. 00:25:03.820 --> 00:25:08.440 And based on the difference between the actual value and what our estimate was, 00:25:08.440 --> 00:25:11.170 we might want to change our hypothesis, change the way 00:25:11.170 --> 00:25:13.300 that we do that estimation. 00:25:13.300 --> 00:25:15.850 If the actual value and the estimate were the same thing, 00:25:15.850 --> 00:25:18.910 meaning we were correctly able to predict what category this data 00:25:18.910 --> 00:25:21.280 point belonged to, well, then actual value 00:25:21.280 --> 00:25:23.560 minus estimate, that's just going to be 0, 00:25:23.560 --> 00:25:26.440 which means this whole term on the right hand side goes to be 0. 00:25:26.440 --> 00:25:27.820 And the weight doesn't change. 00:25:27.820 --> 00:25:31.180 Weight i, where i is weight 1 or weight 2 or weight 0, 00:25:31.180 --> 00:25:33.520 weight i just stays at weight i. 00:25:33.520 --> 00:25:36.880 And none of the weights change if we were able to correctly predict 00:25:36.880 --> 00:25:39.250 what category the input belonged to. 00:25:39.250 --> 00:25:42.010 But if our hypothesis didn't correctly predict 00:25:42.010 --> 00:25:44.980 what category the input belonged to, then maybe 00:25:44.980 --> 00:25:46.900 then we need to make some changes-- 00:25:46.900 --> 00:25:50.950 adjust the weights so that we're better able to predict this kind of data point 00:25:50.950 --> 00:25:51.850 in the future. 00:25:51.850 --> 00:25:54.140 And what is the way we might do that? 00:25:54.140 --> 00:25:58.870 Well, if the actual value was bigger than the estimate, then-- and for now, 00:25:58.870 --> 00:26:02.170 we'll go ahead and assume that these is are positive values-- 00:26:02.170 --> 00:26:04.720 if the actual value is bigger than the estimate, that 00:26:04.720 --> 00:26:08.050 means we need to increase the weight in order to make it such 00:26:08.050 --> 00:26:10.090 that the output is bigger and therefore, we're 00:26:10.090 --> 00:26:13.168 more likely to get to the right actual value. 00:26:13.168 --> 00:26:15.460 And so if the actual value is bigger than the estimate, 00:26:15.460 --> 00:26:18.292 then actual value minus estimate, that'll be a positive number. 00:26:18.292 --> 00:26:21.250 And so you imagine we're just adding some positive number to the weight 00:26:21.250 --> 00:26:23.740 just to increase it ever so slightly. 00:26:23.740 --> 00:26:25.600 And likewise, the inverse case is true-- 00:26:25.600 --> 00:26:30.460 that if the actual value was less than the estimate, the actual value was 0, 00:26:30.460 --> 00:26:33.460 but we estimated 1, meaning it actually was not raining, 00:26:33.460 --> 00:26:35.740 but we predicted it was going to be raining, 00:26:35.740 --> 00:26:39.250 then we want to decrease the value of the weight, because then in that case, 00:26:39.250 --> 00:26:43.600 we want to try and lower the total value of computing that dot product in order 00:26:43.600 --> 00:26:46.690 to make it less likely that we would predict that it would actually 00:26:46.690 --> 00:26:48.010 be raining. 00:26:48.010 --> 00:26:50.840 So no need to get too deep into the mathematics of that. 00:26:50.840 --> 00:26:53.920 But the general idea is that every time we encounter some data point, 00:26:53.920 --> 00:26:57.340 we can adjust these weights accordingly to try and make the weights better 00:26:57.340 --> 00:27:00.850 line up with the actual data that we have access to. 00:27:00.850 --> 00:27:03.610 And you can repeat this process with data point after data point 00:27:03.610 --> 00:27:05.680 until eventually, hopefully, your algorithm 00:27:05.680 --> 00:27:09.430 converges to some set of weights that do a pretty good job of trying 00:27:09.430 --> 00:27:13.040 to figure out whether a day is going to be rainy or not rainy. 00:27:13.040 --> 00:27:15.700 And just as a final point about this particular equation, 00:27:15.700 --> 00:27:19.470 this value alpha here is generally what we'll call the learning rate. 00:27:19.470 --> 00:27:22.120 It's just some parameter, some number we choose it 00:27:22.120 --> 00:27:25.690 for how quickly we're actually going to be updating these weight values. 00:27:25.690 --> 00:27:27.460 That if alpha is bigger, than we're going 00:27:27.460 --> 00:27:29.380 to update these weight values by a lot. 00:27:29.380 --> 00:27:32.480 And if alpha is smaller, then we'll update the weight values by less. 00:27:32.480 --> 00:27:35.200 And you can choose the value of alpha depending on the problem, 00:27:35.200 --> 00:27:39.960 different values might suit the situation better or worse than others. 00:27:39.960 --> 00:27:43.150 So after all of that, after we've done this training process, 00:27:43.150 --> 00:27:45.880 take all this data, and using this learning rule, 00:27:45.880 --> 00:27:50.240 look at all the pieces of data, and use each piece of data as an indication 00:27:50.240 --> 00:27:53.030 to us of do the weights stay the same, do we increase the weights, 00:27:53.030 --> 00:27:55.940 do we decrease the weights, and if so, by how much, 00:27:55.940 --> 00:28:00.110 what you end up with is effectively a threshold function. 00:28:00.110 --> 00:28:03.230 And we can look at what the threshold function looks like like this. 00:28:03.230 --> 00:28:05.870 On the x-axis here, we have the output of that function. 00:28:05.870 --> 00:28:10.150 Taking the weights taking, the dot product of it with the input. 00:28:10.150 --> 00:28:12.980 And on the y-axis, we have what the output is going to be. 00:28:12.980 --> 00:28:17.420 0, which in this case represented like not raining, and 1, which in this case, 00:28:17.420 --> 00:28:18.950 represented raining. 00:28:18.950 --> 00:28:23.570 And the way that our hypothesis function works is it calculates this value. 00:28:23.570 --> 00:28:27.380 And if it's greater than 0 or greater than some threshold value, 00:28:27.380 --> 00:28:29.480 then we declare that it's a rainy day. 00:28:29.480 --> 00:28:32.330 And otherwise, we declare that it's not rainy day. 00:28:32.330 --> 00:28:35.660 And this then graphically is what that function looks like. 00:28:35.660 --> 00:28:38.600 That Initially, when the value of this dot product is small-- 00:28:38.600 --> 00:28:40.850 it's not raining, it's not raining, it's not raining-- 00:28:40.850 --> 00:28:43.910 but as soon as it crosses that threshold, we suddenly say, 00:28:43.910 --> 00:28:46.700 OK, now it's raining, now it's raining, now it's raining. 00:28:46.700 --> 00:28:49.250 And the way to interpret this kind of representation 00:28:49.250 --> 00:28:51.680 is that anything on this side of the line, that 00:28:51.680 --> 00:28:55.010 would be the category of data points where we say yes, it's raining. 00:28:55.010 --> 00:28:56.960 Anything that falls on this side of the line 00:28:56.960 --> 00:28:59.510 are the data points where we would say it's not raining. 00:28:59.510 --> 00:29:02.120 And again, we want to choose some value for the weights that 00:29:02.120 --> 00:29:04.910 results in a function that does a pretty good job of trying 00:29:04.910 --> 00:29:07.280 to do this estimation. 00:29:07.280 --> 00:29:11.090 But one tricky thing with this type of hard threshold 00:29:11.090 --> 00:29:14.300 is that it only leaves two possible outcomes. 00:29:14.300 --> 00:29:16.850 We plug-in some data as input. 00:29:16.850 --> 00:29:20.100 And the output we get is raining or not raining. 00:29:20.100 --> 00:29:22.978 And there is no room for it anywhere in between. 00:29:22.978 --> 00:29:24.270 And maybe that's what you want. 00:29:24.270 --> 00:29:26.510 Maybe all you want is given some data point, 00:29:26.510 --> 00:29:29.600 you would like to be able to classify it into one or two or more 00:29:29.600 --> 00:29:31.990 of these various different categories. 00:29:31.990 --> 00:29:34.270 But it might also be the case that you care 00:29:34.270 --> 00:29:37.855 about knowing how strong that prediction is, for example. 00:29:37.855 --> 00:29:39.730 So if we go back to this instance here, where 00:29:39.730 --> 00:29:42.550 we have rainy days on this side of the line, not 00:29:42.550 --> 00:29:45.370 rainy days on that side of the line, you might 00:29:45.370 --> 00:29:48.970 imagine that let's look now at these two white data points. 00:29:48.970 --> 00:29:53.110 This data point here that we would like to predict a label or a category for. 00:29:53.110 --> 00:29:55.660 And this data point over here that we would also like 00:29:55.660 --> 00:29:58.510 to predict a label or a category for. 00:29:58.510 --> 00:30:01.885 It seems likely that you could pretty confidently say that this data point, 00:30:01.885 --> 00:30:03.010 that should be a rainy day. 00:30:03.010 --> 00:30:05.500 It seems close to the other rainy days if we're 00:30:05.500 --> 00:30:07.330 going by the nearest neighbor strategy. 00:30:07.330 --> 00:30:11.830 It's on this side of the line if we're going by the strategy of just saying 00:30:11.830 --> 00:30:14.140 which side of the line does it fall on by figuring out 00:30:14.140 --> 00:30:15.670 what those weights should be. 00:30:15.670 --> 00:30:18.610 And if we're using the line strategy of just which side of the line 00:30:18.610 --> 00:30:21.640 does it fall on, which side of this decision boundary, 00:30:21.640 --> 00:30:24.700 we'd also say that this point here is also 00:30:24.700 --> 00:30:27.580 a rainy day, because it falls on the side of the line that 00:30:27.580 --> 00:30:30.620 corresponds to rainy days. 00:30:30.620 --> 00:30:33.020 But it's likely that even in this case, we 00:30:33.020 --> 00:30:37.040 would know that we don't feel nearly as confident about this data point 00:30:37.040 --> 00:30:40.360 on the left as compared to this data point on the right. 00:30:40.360 --> 00:30:42.620 For this one on the right, we can feel very confident 00:30:42.620 --> 00:30:44.060 that, yes, it's a rainy day. 00:30:44.060 --> 00:30:48.460 This one, it's pretty close to the line if we're judging just by distance. 00:30:48.460 --> 00:30:51.260 And so you might be less sure. 00:30:51.260 --> 00:30:56.150 But our threshold function doesn't allow for a notion of less sure or more sure 00:30:56.150 --> 00:30:57.080 about something. 00:30:57.080 --> 00:30:59.000 It's what we would call a hard threshold. 00:30:59.000 --> 00:31:02.810 It's once you've crossed this line, then immediately, we say, yes, 00:31:02.810 --> 00:31:04.520 this is going to be a rainy day. 00:31:04.520 --> 00:31:07.580 Anywhere before it, we're going to say it's not a rainy day. 00:31:07.580 --> 00:31:10.240 And that may not be helpful in a number of cases. 00:31:10.240 --> 00:31:13.070 One, this is not a particularly easy function to deal with. 00:31:13.070 --> 00:31:15.710 If you get you get deeper into the world of machine learning 00:31:15.710 --> 00:31:18.830 and are trying to do things like taking derivatives of these curves, 00:31:18.830 --> 00:31:21.222 this type of function makes things challenging. 00:31:21.222 --> 00:31:23.180 But the other challenge is that we don't really 00:31:23.180 --> 00:31:25.013 have any notion of gradation between things. 00:31:25.013 --> 00:31:28.430 We don't have a notion of, yes, this is a very strong belief 00:31:28.430 --> 00:31:32.600 that it's going to be raining as opposed to it's probably more likely than not 00:31:32.600 --> 00:31:37.100 that it's going to be raining, but maybe not totally sure about that, either. 00:31:37.100 --> 00:31:39.590 So what we can do by taking advantage of a technique known 00:31:39.590 --> 00:31:43.190 as logistic regression is instead of using this hard threshold 00:31:43.190 --> 00:31:46.970 type of function, we can use instead a logistic function, something 00:31:46.970 --> 00:31:48.890 we might call a soft threshold. 00:31:48.890 --> 00:31:52.070 And that's going to transform this into looking something 00:31:52.070 --> 00:31:55.290 a little more like this-- something that more nicely curves. 00:31:55.290 --> 00:31:57.800 And as a result, the possible output values 00:31:57.800 --> 00:32:02.030 are no longer just 0 and 1, 0 for not raining, 1 for raining. 00:32:02.030 --> 00:32:06.360 But you can actually get any real number of value between 0 and 1. 00:32:06.360 --> 00:32:10.292 That if you're way over on this side, then you get a value of 0-- 00:32:10.292 --> 00:32:12.750 it's not going to be raining, we're pretty sure about that. 00:32:12.750 --> 00:32:15.042 And if you're over on this side, you get a value of 1-- 00:32:15.042 --> 00:32:17.330 yes, we're very sure that it's going to be raining. 00:32:17.330 --> 00:32:22.460 But in between, you could get some real numbered value where a value like 0.7 00:32:22.460 --> 00:32:24.260 might mean we think it's going to rain. 00:32:24.260 --> 00:32:27.810 It's more probable that it's going to rain than not based on the data, 00:32:27.810 --> 00:32:32.220 but we're not as confident as some of the other data points might be. 00:32:32.220 --> 00:32:34.580 So one of the advantages of the soft threshold 00:32:34.580 --> 00:32:37.940 is that it allows us to have an output that could be some real number that 00:32:37.940 --> 00:32:41.450 potentially reflects some sort of probability, the likelihood that we 00:32:41.450 --> 00:32:46.550 think that this particular data point belongs to that particular category. 00:32:46.550 --> 00:32:49.250 And there are some other nice mathematical properties of that 00:32:49.250 --> 00:32:50.950 as well. 00:32:50.950 --> 00:32:53.880 So that then is two different approaches to trying to solve 00:32:53.880 --> 00:32:55.710 this type of classification problem. 00:32:55.710 --> 00:32:58.860 One is this nearest neighbor type of approach, where you just 00:32:58.860 --> 00:33:01.230 take a data point and look at the data points that 00:33:01.230 --> 00:33:05.490 are nearby to try and estimate what category we think it belongs to. 00:33:05.490 --> 00:33:08.310 And the other approach is the approach of saying, all right, 00:33:08.310 --> 00:33:10.607 let's just try and use linear regression, 00:33:10.607 --> 00:33:13.440 figure out what these weights should be, adjust the weights in order 00:33:13.440 --> 00:33:16.950 to figure out what line or what decision boundary is going 00:33:16.950 --> 00:33:19.708 to best separate these two categories. 00:33:19.708 --> 00:33:22.500 It turns out that another popular approach, a very popular approach 00:33:22.500 --> 00:33:24.480 if you just have a data set and you want to start 00:33:24.480 --> 00:33:26.190 trying to do some learning on it, is what 00:33:26.190 --> 00:33:27.648 we call the support vector machine. 00:33:27.648 --> 00:33:30.690 We're not going to go too much into the mathematics of the support vector 00:33:30.690 --> 00:33:32.850 machine, but we'll at least explore it graphically 00:33:32.850 --> 00:33:34.680 to see what it is that it looks like. 00:33:34.680 --> 00:33:38.220 And the idea or the motivation behind the support vector machine 00:33:38.220 --> 00:33:41.400 is the idea that there are actually a lot of different lines 00:33:41.400 --> 00:33:44.070 that we could draw, a lot of different decision boundaries 00:33:44.070 --> 00:33:46.430 that we could draw to separate two groups. 00:33:46.430 --> 00:33:49.050 So for example, I had the red data points over here 00:33:49.050 --> 00:33:50.700 and the blue data points over here. 00:33:50.700 --> 00:33:54.620 One possible line I could draw is a line like this, 00:33:54.620 --> 00:33:57.600 that this line here would separate the red points from the blue points. 00:33:57.600 --> 00:33:58.680 And it does so perfectly. 00:33:58.680 --> 00:34:01.080 All the red points are on one side of the line. 00:34:01.080 --> 00:34:03.840 All the blue points around the other side of the line. 00:34:03.840 --> 00:34:07.000 But this should probably make you a little bit nervous. 00:34:07.000 --> 00:34:08.909 If you come up with a model and the model 00:34:08.909 --> 00:34:10.960 comes up with a line that looks like this. 00:34:10.960 --> 00:34:13.650 And the reason why is that you worry about how well it's 00:34:13.650 --> 00:34:18.190 going to generalize to other data points that are not necessarily in the data 00:34:18.190 --> 00:34:19.900 set that we have access to. 00:34:19.900 --> 00:34:23.650 For example, if there was a point that fell right here for example, 00:34:23.650 --> 00:34:26.370 on the right side of the line, then based on that, 00:34:26.370 --> 00:34:30.210 we might want to guess that it is in fact, a red point, 00:34:30.210 --> 00:34:33.780 but it falls on the side of the line where instead, we would estimate 00:34:33.780 --> 00:34:36.380 that it's a blue point instead. 00:34:36.380 --> 00:34:39.750 And so based on that, this line is probably not a great choice 00:34:39.750 --> 00:34:43.679 just because it is so close to these various data points. 00:34:43.679 --> 00:34:45.810 We might instead prefer a diagonal line that 00:34:45.810 --> 00:34:48.719 just goes diagonally through the data set like we've seen before. 00:34:48.719 --> 00:34:52.000 But there too, there's a lot of diagonal lines that we could draw as well. 00:34:52.000 --> 00:34:54.760 For example, I could draw this diagonal line here, 00:34:54.760 --> 00:34:57.663 which also successfully separates all the red points 00:34:57.663 --> 00:34:58.830 from all of the blue points. 00:34:58.830 --> 00:35:01.400 From the perspective of something like a just trying 00:35:01.400 --> 00:35:03.150 to figure out some setting of weights that 00:35:03.150 --> 00:35:05.760 allows us to predict the correct output, this line 00:35:05.760 --> 00:35:09.155 will predict the correct output for this particular set of data 00:35:09.155 --> 00:35:11.530 every single time, because the red points are on one side 00:35:11.530 --> 00:35:13.518 and the blue points are on the other. 00:35:13.518 --> 00:35:15.810 But yet again, you should probably be a little nervous. 00:35:15.810 --> 00:35:18.690 Because this line is so close to these red points, 00:35:18.690 --> 00:35:22.470 even though we're able to correctly predict on the input data 00:35:22.470 --> 00:35:26.070 if there was a point that fell somewhere in this general area, 00:35:26.070 --> 00:35:28.530 our algorithm, this model, would say that yeah, 00:35:28.530 --> 00:35:30.930 we think it's a blue point, when in actuality, it 00:35:30.930 --> 00:35:34.000 might belong to the red category instead, 00:35:34.000 --> 00:35:36.870 just because it looks like it's close to the other red points. 00:35:36.870 --> 00:35:39.700 What we really want to be able to say, given this data, 00:35:39.700 --> 00:35:42.330 how can you generalize this out as best as possible, is 00:35:42.330 --> 00:35:46.480 to come up with a line like this that seems like the intuitive line to draw. 00:35:46.480 --> 00:35:48.810 And the reason why it's intuitive is because it 00:35:48.810 --> 00:35:54.267 seems to be as far apart as possible from the red data and the blue data 00:35:54.267 --> 00:35:56.850 so that if we generalize a little bit and assume that maybe we 00:35:56.850 --> 00:35:58.933 have some points that are different from the input 00:35:58.933 --> 00:36:02.610 but still slightly further away, we can still say that something on this side, 00:36:02.610 --> 00:36:06.300 probably red, something on that side, probably blue. 00:36:06.300 --> 00:36:08.607 And we can make those judgments that way. 00:36:08.607 --> 00:36:10.440 And that is what support vector machines are 00:36:10.440 --> 00:36:12.660 designed to do-- they're designed to try and find 00:36:12.660 --> 00:36:15.720 what we call the maximum margin separator, 00:36:15.720 --> 00:36:17.850 where the maximum margin separator is just 00:36:17.850 --> 00:36:21.780 some boundary that maximizes the distance between the groups of points. 00:36:21.780 --> 00:36:24.990 Rather than come up with some boundary that's very close to one side 00:36:24.990 --> 00:36:27.840 or the other, where in the case before, we wouldn't have cared-- 00:36:27.840 --> 00:36:31.680 as long as we're categorizing the input well, that seems all we need to do-- 00:36:31.680 --> 00:36:35.790 the support vector machine will try and find this maximum margin separator, 00:36:35.790 --> 00:36:39.090 some way of trying to maximize that particular distance. 00:36:39.090 --> 00:36:42.600 And it does so by finding what we call the support vectors, which 00:36:42.600 --> 00:36:44.790 are the vectors that are closest to the line 00:36:44.790 --> 00:36:48.060 and trying to maximize the distance between the line 00:36:48.060 --> 00:36:49.950 and those particular points. 00:36:49.950 --> 00:36:51.790 And it works that way in two dimensions. 00:36:51.790 --> 00:36:53.498 It also works in higher dimensions, where 00:36:53.498 --> 00:36:56.670 we're not looking for some line that separates the two data points, 00:36:56.670 --> 00:36:58.440 but instead, looking for what we generally 00:36:58.440 --> 00:37:02.610 call a hyperplanel, some decision boundary, effectively, that 00:37:02.610 --> 00:37:06.150 separates one set of data from the other set of data. 00:37:06.150 --> 00:37:09.270 And this ability of support vector machines to work in higher dimensions 00:37:09.270 --> 00:37:11.670 actually has a number of other applications as well. 00:37:11.670 --> 00:37:14.790 But one is that it helpfully deals with cases where 00:37:14.790 --> 00:37:17.740 data may not be linearly separable. 00:37:17.740 --> 00:37:19.800 So we talked about linear separability before, 00:37:19.800 --> 00:37:22.890 this idea that you can take data and just draw 00:37:22.890 --> 00:37:25.180 a line or some linear combination of the inputs 00:37:25.180 --> 00:37:28.750 that allows us to perfectly separate the two sets from each other. 00:37:28.750 --> 00:37:32.050 There are some data sets that are not linearly separable. 00:37:32.050 --> 00:37:35.280 And some were even too, you would not be able to find 00:37:35.280 --> 00:37:39.270 a good line at all that would try to do that kind of separation. 00:37:39.270 --> 00:37:42.930 Something like this, for example, where if you imagine here 00:37:42.930 --> 00:37:45.780 are the red points and the blue points surround it. 00:37:45.780 --> 00:37:50.550 If you try to find a line that divides the red points from the blue points, 00:37:50.550 --> 00:37:53.310 it's actually going to be difficult, if not impossible, to do-- 00:37:53.310 --> 00:37:55.080 that any line you choose-- 00:37:55.080 --> 00:37:57.510 if you draw a line here, then you ignored 00:37:57.510 --> 00:38:00.450 all of these blue points that should actually be blue and not red-- 00:38:00.450 --> 00:38:03.240 anywhere else you draw a line, there's going to be a lot of error, 00:38:03.240 --> 00:38:07.560 a lot of mistakes, a lot of what will soon call loss to that line 00:38:07.560 --> 00:38:08.280 that you draw-- 00:38:08.280 --> 00:38:12.050 a lot of points that you're going to categorize incorrectly. 00:38:12.050 --> 00:38:14.610 What we really want is to be able to find a better decision 00:38:14.610 --> 00:38:18.570 boundary that may not be just a straight line through this two 00:38:18.570 --> 00:38:19.770 dimensional space. 00:38:19.770 --> 00:38:21.810 And what support vector machines can do is 00:38:21.810 --> 00:38:23.940 they can begin to operate in higher dimensions 00:38:23.940 --> 00:38:26.880 and be able to find some other decision boundary, 00:38:26.880 --> 00:38:28.860 like the circle in this case, that actually 00:38:28.860 --> 00:38:33.130 is able to separate one of these sets of data from the other set of data, 00:38:33.130 --> 00:38:33.750 a lot better. 00:38:33.750 --> 00:38:37.470 So oftentimes, in data sets where the data is not linearly separable, 00:38:37.470 --> 00:38:40.140 support vector machines, by working in higher dimensions, 00:38:40.140 --> 00:38:44.310 can actually figure out a way to solve that kind of problem effectively. 00:38:44.310 --> 00:38:46.290 So that then-- three different approaches 00:38:46.290 --> 00:38:48.330 to trying to solve these sorts of problems. 00:38:48.330 --> 00:38:50.040 We're seeing support vector machines. 00:38:50.040 --> 00:38:53.400 We've seen trying to use linear regression and the perceptron 00:38:53.400 --> 00:38:56.912 learning rule to be able to figure out how to categorize inputs and outputs. 00:38:56.912 --> 00:38:58.620 We've seen the nearest neighbor approach. 00:38:58.620 --> 00:39:00.630 No one necessarily better than any other. 00:39:00.630 --> 00:39:03.990 Again, it's going to depend on the data set, the information you have access 00:39:03.990 --> 00:39:04.490 to. 00:39:04.490 --> 00:39:07.157 It's going to depend on what the function looks like that you're 00:39:07.157 --> 00:39:08.370 ultimately trying to predict. 00:39:08.370 --> 00:39:11.160 And this is where a lot of research and experimentation 00:39:11.160 --> 00:39:14.880 can be involved in trying to figure out how it is to best perform 00:39:14.880 --> 00:39:16.710 that kind of estimation. 00:39:16.710 --> 00:39:20.100 But classification is only one of the tasks that you might encounter 00:39:20.100 --> 00:39:23.340 and supervised machine learning, because in classification, 00:39:23.340 --> 00:39:26.580 what we're trying to predict is some discrete category. 00:39:26.580 --> 00:39:29.040 We're trying to predict red or blue, rain 00:39:29.040 --> 00:39:31.950 or not rain, authentic or counterfeit. 00:39:31.950 --> 00:39:35.500 But sometimes, what we want to predict is a real number value. 00:39:35.500 --> 00:39:38.880 And for that, we have a related problem, not classification, but instead 00:39:38.880 --> 00:39:40.500 known as regression. 00:39:40.500 --> 00:39:42.930 And regression is the supervised learning problem 00:39:42.930 --> 00:39:45.570 where we try and learn a function mapping inputs to outputs, 00:39:45.570 --> 00:39:46.810 same as before. 00:39:46.810 --> 00:39:49.800 but instead of the outputs being discrete categories-- 00:39:49.800 --> 00:39:51.960 things like rain or not rain-- 00:39:51.960 --> 00:39:54.260 in a regression problem, the output values 00:39:54.260 --> 00:39:57.400 are generally continuous value-- some real number 00:39:57.400 --> 00:39:58.650 that we would like to predict. 00:39:58.650 --> 00:40:00.385 This happens all the time, as well. 00:40:00.385 --> 00:40:02.760 You might imagine that a company might take this approach 00:40:02.760 --> 00:40:04.950 if it's trying to figure out, for instance, 00:40:04.950 --> 00:40:06.720 what the effect of its advertising is. 00:40:06.720 --> 00:40:10.830 Like how do advertising dollars spent translate into sales 00:40:10.830 --> 00:40:13.000 for the company's product, for example. 00:40:13.000 --> 00:40:17.070 And so they might like to try to predict some function that takes as input, 00:40:17.070 --> 00:40:18.780 the amount of money spent on advertising. 00:40:18.780 --> 00:40:20.220 And here, we're just going to use one input, 00:40:20.220 --> 00:40:22.650 but again, you could scale this up to many more inputs 00:40:22.650 --> 00:40:25.800 as well if you have a lot of different kinds of data you have access to. 00:40:25.800 --> 00:40:27.540 And the goal is to learn to function-- that given 00:40:27.540 --> 00:40:29.423 this amount of spending on advertising, we're 00:40:29.423 --> 00:40:30.840 going to get this amount in sales. 00:40:30.840 --> 00:40:34.110 And you might judge it based on having access to a whole bunch of data-- 00:40:34.110 --> 00:40:37.830 like for every past month, here's how much we spent on advertising 00:40:37.830 --> 00:40:39.390 and here is what sales were. 00:40:39.390 --> 00:40:43.300 And we would like to predict some sort of hypothesis function 00:40:43.300 --> 00:40:46.260 that, again, given the amount spent on advertising, 00:40:46.260 --> 00:40:48.750 can predict in this case, some real number, 00:40:48.750 --> 00:40:54.840 some no estimate of how much sales we expect that company to do in this month 00:40:54.840 --> 00:40:56.940 or in this quarter or whatever unit of time 00:40:56.940 --> 00:40:58.980 we're choosing to measure things in. 00:40:58.980 --> 00:41:01.830 And so again, the approach to solving this type of problem, 00:41:01.830 --> 00:41:05.920 we could try using a linear regression type approach, where we take this data, 00:41:05.920 --> 00:41:07.050 and we just plot it. 00:41:07.050 --> 00:41:09.750 On the x-axis, we have advertising dollars spent. 00:41:09.750 --> 00:41:11.490 On the y-axis, we have sales. 00:41:11.490 --> 00:41:14.160 And we might just want to try and draw a line that 00:41:14.160 --> 00:41:16.680 does a pretty good job of trying to estimate 00:41:16.680 --> 00:41:19.980 this relationship between advertising and sales. 00:41:19.980 --> 00:41:21.840 And in this case, unlike before, we're not 00:41:21.840 --> 00:41:24.848 trying to separate the data points into discrete categories. 00:41:24.848 --> 00:41:26.640 But instead in this case, we're just trying 00:41:26.640 --> 00:41:31.410 to find a line that approximates this relationship between advertising 00:41:31.410 --> 00:41:34.860 and sales so that if we want to figure out what the estimated sales are 00:41:34.860 --> 00:41:38.520 for a particular advertising budget, you just look it up in this line, 00:41:38.520 --> 00:41:40.560 figure out for this amount of advertising, we 00:41:40.560 --> 00:41:42.540 would have this amount of sales, and just 00:41:42.540 --> 00:41:44.367 try and make the estimate that way. 00:41:44.367 --> 00:41:46.200 And so you can try and come up with a line-- 00:41:46.200 --> 00:41:48.720 again, figuring out how to modify the weights using 00:41:48.720 --> 00:41:51.990 various different techniques to try and make it so that this line fits 00:41:51.990 --> 00:41:54.860 as well as possible. 00:41:54.860 --> 00:41:57.880 So with all of these approaches to trying to solve machine 00:41:57.880 --> 00:41:59.270 learning style problems. 00:41:59.270 --> 00:42:02.000 The question becomes, how do we evaluate these approaches? 00:42:02.000 --> 00:42:06.340 How do we evaluate the various different hypotheses that we could come up with? 00:42:06.340 --> 00:42:09.880 Because each of these algorithms will give us some sort of hypothesis-- 00:42:09.880 --> 00:42:12.710 some function that maps inputs to outputs. 00:42:12.710 --> 00:42:16.720 And we want to know how well does that function work. 00:42:16.720 --> 00:42:19.000 And you can think of evaluating these hypotheses 00:42:19.000 --> 00:42:23.500 and trying to get a better hypothesis as kind of like an optimization problem. 00:42:23.500 --> 00:42:26.580 In an optimization problem, as you recall from before, 00:42:26.580 --> 00:42:30.100 you are either trying to maximize some objective function 00:42:30.100 --> 00:42:33.160 by trying to find a global maximum. 00:42:33.160 --> 00:42:36.160 Or we were trying to minimize some cost function 00:42:36.160 --> 00:42:38.210 by trying to find some global minimum. 00:42:38.210 --> 00:42:41.890 And in the case of evaluating these hypotheses, one thing we might say 00:42:41.890 --> 00:42:45.280 is that this cost function, the thing we're trying to minimize, 00:42:45.280 --> 00:42:49.180 we might be trying to minimize what we would call a loss function. 00:42:49.180 --> 00:42:50.510 And what a loss function is-- 00:42:50.510 --> 00:42:53.980 it is a function that is going to estimate for us how 00:42:53.980 --> 00:42:56.200 poorly our function performs. 00:42:56.200 --> 00:42:58.240 More formally, it's like a loss of utility, 00:42:58.240 --> 00:43:02.740 by whenever we predict something that is wrong, that is a loss of utility. 00:43:02.740 --> 00:43:06.450 That's going to add to the output of our loss function. 00:43:06.450 --> 00:43:08.200 And you can come up with any loss function 00:43:08.200 --> 00:43:11.320 that you want-- just some mathematical way of estimating given 00:43:11.320 --> 00:43:14.020 each of these data points, given what the actual output is, 00:43:14.020 --> 00:43:17.110 and given what our projected output is, our estimate, 00:43:17.110 --> 00:43:19.970 you could calculate some sort of numerical loss for it. 00:43:19.970 --> 00:43:23.140 But there are a couple of popular loss functions that are worth discussing-- 00:43:23.140 --> 00:43:25.210 just that you've seen them before-- 00:43:25.210 --> 00:43:27.040 when it comes to discrete categories. 00:43:27.040 --> 00:43:30.570 Things like rain or not rain, counterfeit or not counterfeit. 00:43:30.570 --> 00:43:33.463 One approach is the 0-1 loss function. 00:43:33.463 --> 00:43:35.380 And the way that works is for each of the data 00:43:35.380 --> 00:43:40.180 points our loss function takes as input, what the actual output is, whether it 00:43:40.180 --> 00:43:44.650 was actually raining we're not rainy, and takes our prediction into account-- 00:43:44.650 --> 00:43:49.030 did we predict given this data point that it was raining or not raining. 00:43:49.030 --> 00:43:51.730 And if the actual value equals the prediction, 00:43:51.730 --> 00:43:54.580 well, then the 0-1 loss function will just say the loss of 0. 00:43:54.580 --> 00:43:58.900 There was no loss of utility because we were able to predict correctly. 00:43:58.900 --> 00:44:01.300 And otherwise, if the actual value was not 00:44:01.300 --> 00:44:05.260 the same thing as what we predicted, well, then in that case, our loss is 1. 00:44:05.260 --> 00:44:08.260 We lost something, lost some utility, because what 00:44:08.260 --> 00:44:12.750 we predicted was the output of the function was not what it actually was. 00:44:12.750 --> 00:44:14.500 And the goal then in a situation like this 00:44:14.500 --> 00:44:17.080 would be to come up with some hypothesis that 00:44:17.080 --> 00:44:21.040 minimizes the total empirical loss, the total amount that we've 00:44:21.040 --> 00:44:25.090 lost if you add up for all these data points what the actual output is 00:44:25.090 --> 00:44:28.250 and what your hypothesis would have predicted. 00:44:28.250 --> 00:44:30.100 So in this case, for example, if we go back 00:44:30.100 --> 00:44:32.530 to classifying days as raining or not raining, 00:44:32.530 --> 00:44:34.720 and we came up with this decision boundary, 00:44:34.720 --> 00:44:36.680 how would we evaluate this decision boundary-- 00:44:36.680 --> 00:44:40.510 how much better is it than drawing the line here or drawing the line there. 00:44:40.510 --> 00:44:42.790 Well, we could take each of the input data points 00:44:42.790 --> 00:44:45.790 and each input data point has a label-- whether it was raining 00:44:45.790 --> 00:44:47.230 or whether it was not raining. 00:44:47.230 --> 00:44:49.090 And we could compare it to the prediction-- 00:44:49.090 --> 00:44:51.550 whether we predicted it would be raining or not raining-- 00:44:51.550 --> 00:44:55.040 and assign it a numerical value as a result. 00:44:55.040 --> 00:44:58.323 So for example, these points over here they were all rainy days. 00:44:58.323 --> 00:45:00.490 And we predicted they would be raining, because they 00:45:00.490 --> 00:45:02.300 fall in the bottom side of the line. 00:45:02.300 --> 00:45:03.520 So they had a loss of 0-- 00:45:03.520 --> 00:45:05.500 nothing lost from those situations. 00:45:05.500 --> 00:45:08.140 And likewise, same is true for some of these points over here, 00:45:08.140 --> 00:45:10.420 where it was not raining and we predicted 00:45:10.420 --> 00:45:12.220 it would not be raining, either. 00:45:12.220 --> 00:45:16.810 Where we do have loss are points like this point here and that point there, 00:45:16.810 --> 00:45:20.980 where we predicted that it would not be raining, but in actuality, 00:45:20.980 --> 00:45:21.730 it's a blue point. 00:45:21.730 --> 00:45:22.810 It was raining. 00:45:22.810 --> 00:45:25.870 Or likewise here, we predicted that it would be raining, 00:45:25.870 --> 00:45:27.790 but in actuality, it's a red point-- 00:45:27.790 --> 00:45:29.020 it was not raining. 00:45:29.020 --> 00:45:31.900 And so as a result, we miscategorized these data 00:45:31.900 --> 00:45:34.150 points that we were trying to train on. 00:45:34.150 --> 00:45:36.310 And as a result, there is some loss here. 00:45:36.310 --> 00:45:40.930 One loss here, there, here and there, for a total loss of four, for example, 00:45:40.930 --> 00:45:41.893 in this case. 00:45:41.893 --> 00:45:43.810 And that might be how we would estimate or how 00:45:43.810 --> 00:45:47.860 we would say that this line is better than a line that 00:45:47.860 --> 00:45:51.370 goes somewhere else or a line that's further down, because this line might 00:45:51.370 --> 00:45:52.670 minimize the loss. 00:45:52.670 --> 00:45:57.100 So there is no way to do better than just these four points of loss 00:45:57.100 --> 00:46:01.080 if you're just drawing a straight line through our space. 00:46:01.080 --> 00:46:05.020 So the 0-1 loss function checks did we get it right, did we get it wrong. 00:46:05.020 --> 00:46:07.630 If we got it right, the loss is 0-- nothing lost. 00:46:07.630 --> 00:46:11.470 If we got it wrong, then our loss function for that data point says 1, 00:46:11.470 --> 00:46:14.740 and we add up all of those losses across all of our data points 00:46:14.740 --> 00:46:17.290 to get some sort of empirical loss-- how much we 00:46:17.290 --> 00:46:20.420 have lost across all of these original data points 00:46:20.420 --> 00:46:23.500 that our algorithm had access to. 00:46:23.500 --> 00:46:26.550 There are other forms of loss as well that work especially well when 00:46:26.550 --> 00:46:28.980 we deal with more real value cases-- cases 00:46:28.980 --> 00:46:32.760 like the mapping between advertising budget and amount that we do in sales, 00:46:32.760 --> 00:46:33.720 for example. 00:46:33.720 --> 00:46:37.770 Because in that case, you care not just that you get the number exactly right, 00:46:37.770 --> 00:46:40.650 but you care how close you were to the actual value. 00:46:40.650 --> 00:46:44.700 If the actual value is you did $2,800 in sales 00:46:44.700 --> 00:46:47.910 and you predicted that you would do $2,900 in sales, 00:46:47.910 --> 00:46:49.330 maybe that's pretty good. 00:46:49.330 --> 00:46:52.380 That's much better than if you had predicted you do $1,000 in sales, 00:46:52.380 --> 00:46:53.620 for example. 00:46:53.620 --> 00:46:55.830 And so we would like our loss function to be 00:46:55.830 --> 00:46:58.530 able to take that into account as well. 00:46:58.530 --> 00:47:02.700 Take into account not just whether the actual value in the expected value 00:47:02.700 --> 00:47:08.490 are exactly the same, but also, take into account how far apart they were. 00:47:08.490 --> 00:47:12.450 And so for that one approach is what we call L1 loss. 00:47:12.450 --> 00:47:15.120 L1 Loss doesn't just look at whether actual and predicted 00:47:15.120 --> 00:47:20.400 are equal to each other, but we take the absolute value of the actual value 00:47:20.400 --> 00:47:22.180 minus the predicted value. 00:47:22.180 --> 00:47:25.860 In other words, we just ask, how far apart were the actual 00:47:25.860 --> 00:47:27.120 and predicted values? 00:47:27.120 --> 00:47:29.280 And we sum that up across all of the data 00:47:29.280 --> 00:47:33.870 points to be able to get what our answer ultimately is. 00:47:33.870 --> 00:47:36.690 So what might this actually look like for our data set? 00:47:36.690 --> 00:47:38.610 Well, if we go back to this representation 00:47:38.610 --> 00:47:42.720 where we had advertising along the x-axis, sales along the y-axis, 00:47:42.720 --> 00:47:45.210 our line was our prediction, our estimate 00:47:45.210 --> 00:47:47.520 for any given amount of advertising-- what 00:47:47.520 --> 00:47:50.080 we predicted sales was going to be. 00:47:50.080 --> 00:47:54.420 And our L1 loss is just how far apart vertically along the sales 00:47:54.420 --> 00:47:58.050 axis our prediction was from each of the data points. 00:47:58.050 --> 00:48:00.900 So we could figure out exactly how far apart our prediction 00:48:00.900 --> 00:48:02.910 was from each of the data points and figure out 00:48:02.910 --> 00:48:07.920 as a result of that what our loss is overall for this particular hypothesis 00:48:07.920 --> 00:48:12.090 just by adding up all of these various different individual losses for each 00:48:12.090 --> 00:48:13.110 of these data points. 00:48:13.110 --> 00:48:15.790 And our goal then is to try and minimize that loss-- 00:48:15.790 --> 00:48:20.550 to try and come up with some line that minimizes what the utility loss is 00:48:20.550 --> 00:48:23.280 by judging how far away our estimate amount of sales 00:48:23.280 --> 00:48:25.800 is from the actual amount of sales. 00:48:25.800 --> 00:48:28.050 And turns out there are other loss functions, as well. 00:48:28.050 --> 00:48:30.780 One that's quite popular is the L2 loss. 00:48:30.780 --> 00:48:33.840 The L2 loss, instead of just using the absolute value, 00:48:33.840 --> 00:48:37.350 like how far away the actual value is from the predicted value, 00:48:37.350 --> 00:48:40.350 it uses the square of actual minus predicted. 00:48:40.350 --> 00:48:43.470 So how far apart are the actual and predicted value, and it 00:48:43.470 --> 00:48:48.150 squares that value, effectively penalizing much more harshly 00:48:48.150 --> 00:48:50.290 anything that is a worse prediction. 00:48:50.290 --> 00:48:52.620 So you imagine if you have two data points 00:48:52.620 --> 00:48:57.210 that you predict as being one value away from their actual value 00:48:57.210 --> 00:48:59.490 as opposed to one data point that you predict 00:48:59.490 --> 00:49:04.290 as being two away from its actual value, the L2 loss function will more 00:49:04.290 --> 00:49:07.560 harshly penalize that one that is two away because it's 00:49:07.560 --> 00:49:11.430 going to square however much the differences between the actual value 00:49:11.430 --> 00:49:12.520 and the predicted value. 00:49:12.520 --> 00:49:14.310 And depending on the situation, you might 00:49:14.310 --> 00:49:17.520 want to choose a loss function depending on what you care about minimizing. 00:49:17.520 --> 00:49:20.883 If you really care about minimizing the error on more outlier cases, 00:49:20.883 --> 00:49:23.050 then you might want to consider something like this. 00:49:23.050 --> 00:49:25.592 But if you've got a lot of outliers and you don't necessarily 00:49:25.592 --> 00:49:28.632 care about modeling them, then maybe an L1 loss function is preferable, 00:49:28.632 --> 00:49:30.840 but there are trade-offs here that you need to decide 00:49:30.840 --> 00:49:33.630 based on a particular set of data. 00:49:33.630 --> 00:49:37.020 But what you do run the risk of with any of these lost functions with anything 00:49:37.020 --> 00:49:40.232 that we're trying to do is a problem known as overfitting. 00:49:40.232 --> 00:49:41.940 And overfitting is a big problem that you 00:49:41.940 --> 00:49:44.910 can encounter in machine learning, which happens anytime 00:49:44.910 --> 00:49:49.380 a model fits too closely with a data set, and as a result, 00:49:49.380 --> 00:49:51.510 fails to generalize. 00:49:51.510 --> 00:49:55.200 We would like our model to be able to accurately predict 00:49:55.200 --> 00:49:59.430 data and inputs and output pairs for the data that we have access to. 00:49:59.430 --> 00:50:02.220 But the reason we wanted to do so is because we 00:50:02.220 --> 00:50:06.690 want our model to generalize well to data that we haven't seen before. 00:50:06.690 --> 00:50:10.350 I would like to take data from the past year of whether it was raining and not 00:50:10.350 --> 00:50:12.707 raining and use that data to generalize it 00:50:12.707 --> 00:50:15.540 towards the future-- to say in the future, is it going to be raining 00:50:15.540 --> 00:50:16.170 or not raining. 00:50:16.170 --> 00:50:19.890 Or if I have a whole bunch of data on what counterfeit and not counterfeit US 00:50:19.890 --> 00:50:23.640 dollar bills looked liked in the past when people have encountered them, 00:50:23.640 --> 00:50:26.550 I'd like to train a computer to be able to in the future, 00:50:26.550 --> 00:50:32.020 generalize to other dollar bills that I might see as well. 00:50:32.020 --> 00:50:36.540 And the problem with overfitting is that if you try and tie yourself too closely 00:50:36.540 --> 00:50:39.450 to the data set that you're training your model on you 00:50:39.450 --> 00:50:42.043 can end up not generalizing very well. 00:50:42.043 --> 00:50:43.210 So what does this look like. 00:50:43.210 --> 00:50:46.750 Well, we might imagine the rainy day and not rainy day example again from here, 00:50:46.750 --> 00:50:49.440 where the blue points indicate rainy days and the red points 00:50:49.440 --> 00:50:50.970 indicate not rainy days. 00:50:50.970 --> 00:50:53.400 And we decided that we felt pretty comfortable 00:50:53.400 --> 00:50:58.230 with drawing a line like this as the decision boundary between rainy days 00:50:58.230 --> 00:50:59.090 and not rainy days. 00:50:59.090 --> 00:51:01.860 That we can pretty comfortably say that points on this side, 00:51:01.860 --> 00:51:04.860 are more likely to be rainy days, points on that side, 00:51:04.860 --> 00:51:06.870 more likely to be not rainy days. 00:51:06.870 --> 00:51:11.430 But the empirical loss isn't 0 in this particular case, 00:51:11.430 --> 00:51:14.200 because we didn't categorize everything perfectly. 00:51:14.200 --> 00:51:17.640 There was this one outlier this one day that it wasn't raining, 00:51:17.640 --> 00:51:20.580 but yet our model steel still predicts that it is raining. 00:51:20.580 --> 00:51:22.710 But that doesn't necessarily mean our model is bad. 00:51:22.710 --> 00:51:25.820 It just means the model isn't 100% accurate. 00:51:25.820 --> 00:51:28.700 If you really wanted to try and find a hypothesis that 00:51:28.700 --> 00:51:32.142 resulted in minimizing the loss, you could come up 00:51:32.142 --> 00:51:33.600 with a different decision boundary. 00:51:33.600 --> 00:51:37.100 It wouldn't be a line, but it would look something like this. 00:51:37.100 --> 00:51:41.090 This decision boundary does separate all of the red points 00:51:41.090 --> 00:51:44.780 from all of the blue points because the red points fall 00:51:44.780 --> 00:51:47.390 on this side of this decision boundary, the blue points 00:51:47.390 --> 00:51:49.700 fall on the other side of the decision boundary. 00:51:49.700 --> 00:51:54.530 But this, we would probably argue, is not as good of a prediction. 00:51:54.530 --> 00:51:59.600 Even though it seems to be more accurate based on all of the available training 00:51:59.600 --> 00:52:02.700 data that we have for training this machine learning model, 00:52:02.700 --> 00:52:05.330 we might say that it's probably not going to generalize well. 00:52:05.330 --> 00:52:07.730 That if there were other data points like here and there, 00:52:07.730 --> 00:52:11.270 we might still want to consider those to be rainy days, because we think 00:52:11.270 --> 00:52:13.770 this was probably just an outlier. 00:52:13.770 --> 00:52:17.480 So if the only thing you care about is minimizing the loss on the data 00:52:17.480 --> 00:52:20.340 you have available to you, you run the risk of overfitting. 00:52:20.340 --> 00:52:22.520 And this can happen in the misclassification case. 00:52:22.520 --> 00:52:25.460 It can also happen in the regression case, 00:52:25.460 --> 00:52:28.790 that here, we predicted what we thought was a pretty good line relating 00:52:28.790 --> 00:52:32.720 advertising to sales, trying to predict what sales were going to be for a given 00:52:32.720 --> 00:52:33.757 amount of advertising. 00:52:33.757 --> 00:52:36.590 But I could come up with a line that does a better job of predicting 00:52:36.590 --> 00:52:39.680 the training data, and it would be something that looks like this, 00:52:39.680 --> 00:52:42.620 just connecting all of the various different data points. 00:52:42.620 --> 00:52:44.750 And now, there is no loss at all. 00:52:44.750 --> 00:52:48.620 Now I've perfectly predicted given any advertising what sales are, 00:52:48.620 --> 00:52:52.370 and for all the data available to me, it's going to be accurate. 00:52:52.370 --> 00:52:55.010 But it's probably not going to generalize very well. 00:52:55.010 --> 00:53:00.060 I have overfit my model on the training data that is available to me. 00:53:00.060 --> 00:53:02.060 And so in general, we want to avoid overfitting. 00:53:02.060 --> 00:53:04.880 We'd like strategies to make sure that we have over 00:53:04.880 --> 00:53:07.190 fit our model to a particular data set. 00:53:07.190 --> 00:53:09.810 And there are a number of ways that you could try to do this. 00:53:09.810 --> 00:53:12.830 One way is by examining what it is that we're optimizing for. 00:53:12.830 --> 00:53:17.060 In an optimization problem, all we do is we say there is some cost, 00:53:17.060 --> 00:53:19.610 and I want to minimize that cost. 00:53:19.610 --> 00:53:24.410 And so far, we've defined that cost function-- the cost of a hypothesis 00:53:24.410 --> 00:53:28.400 just as being equal to the empirical loss of that hypothesis. 00:53:28.400 --> 00:53:32.750 How far away are the actual data points the outputs away from 00:53:32.750 --> 00:53:36.400 what I predicted them to be based on that particular hypothesis. 00:53:36.400 --> 00:53:38.150 And if all you're trying to do is minimize 00:53:38.150 --> 00:53:40.910 cost, meaning minimizing the loss in this case, 00:53:40.910 --> 00:53:44.030 then the result is going to be that you might overfit. 00:53:44.030 --> 00:53:48.440 That to minimize cost, you're going to try and find a way to perfectly match 00:53:48.440 --> 00:53:49.820 all of the input data. 00:53:49.820 --> 00:53:54.020 And that might happen as a result of overfitting on that particular input 00:53:54.020 --> 00:53:55.650 data. 00:53:55.650 --> 00:53:59.660 So in order to address this, you could add something to the cost function. 00:53:59.660 --> 00:54:00.880 What counts as cost? 00:54:00.880 --> 00:54:04.190 Well, not just loss, but also, some measure 00:54:04.190 --> 00:54:08.680 of the complexity of the hypothesis, where the complexity of the hypothesis 00:54:08.680 --> 00:54:12.440 is something that you would need to define for how complicated 00:54:12.440 --> 00:54:13.250 does our line look. 00:54:13.250 --> 00:54:15.830 This is sort of an Occam's razor style approach, where 00:54:15.830 --> 00:54:19.160 we want to give preference to a simpler decision boundary-- 00:54:19.160 --> 00:54:20.990 like a straight line for example. 00:54:20.990 --> 00:54:24.710 Some simpler curve as opposed to something far more complex 00:54:24.710 --> 00:54:26.840 that might represent the training data better, 00:54:26.840 --> 00:54:29.150 but might not generalize as well-- will generally 00:54:29.150 --> 00:54:33.860 say that a simpler solution is probably the better solution and probably 00:54:33.860 --> 00:54:38.460 the one that is more likely to generalize well to other inputs. 00:54:38.460 --> 00:54:42.020 So we measure what the losses but we also measure the complexity. 00:54:42.020 --> 00:54:45.770 And now that all gets taken into account when we consider the overall cost. 00:54:45.770 --> 00:54:50.040 That yes, something might have less loss if a better predicts the training data, 00:54:50.040 --> 00:54:52.940 but if it's much more complex, it still might not 00:54:52.940 --> 00:54:55.230 be the best option that we have. 00:54:55.230 --> 00:54:58.930 And we need to come up with some balance between loss and complexity. 00:54:58.930 --> 00:55:01.180 And for that reason, you'll often see this represented 00:55:01.180 --> 00:55:04.730 as multiplying the complexity by some parameter 00:55:04.730 --> 00:55:08.480 that we have to choose-- parameter lambda in this case, where we're saying 00:55:08.480 --> 00:55:11.300 if lambda has a greater value, then we really want 00:55:11.300 --> 00:55:13.852 to penalize more complex hypotheses. 00:55:13.852 --> 00:55:15.560 Whereas if lambda is smaller, we're going 00:55:15.560 --> 00:55:18.470 to penalize more complex hypotheses a little bit. 00:55:18.470 --> 00:55:21.620 And it's up to the machine learning programmer 00:55:21.620 --> 00:55:24.470 to decide where they want to set that value of lambda 00:55:24.470 --> 00:55:28.430 for how much do I want to penalize a more complex hypothesis that 00:55:28.430 --> 00:55:30.293 might fit the data little better. 00:55:30.293 --> 00:55:32.960 And again, there is no one right answer to a lot of these things 00:55:32.960 --> 00:55:36.362 depending on the data set, depending on the data you have available to you, 00:55:36.362 --> 00:55:39.320 and the problem you're trying to solve, your choice of these parameters 00:55:39.320 --> 00:55:39.868 may vary. 00:55:39.868 --> 00:55:41.660 And you may need to experiment a little bit 00:55:41.660 --> 00:55:45.800 to figure out what the right choice of that is ultimately going to be. 00:55:45.800 --> 00:55:49.100 This process then of considering not only a loss, but also 00:55:49.100 --> 00:55:53.000 some measure of the complexity is known as regularization. 00:55:53.000 --> 00:55:55.610 Regularization is the process of penalizing 00:55:55.610 --> 00:55:58.260 a hypothesis that is more complex. 00:55:58.260 --> 00:56:00.950 In order to favor a simple or hypothesis that 00:56:00.950 --> 00:56:03.680 is more likely to generalize well-- more likely to be 00:56:03.680 --> 00:56:08.210 able to apply to other situations that are dealing with other input points 00:56:08.210 --> 00:56:11.540 unlike the ones that we've necessarily seen before. 00:56:11.540 --> 00:56:15.500 So oftentimes, you'll see us add some regularizing term 00:56:15.500 --> 00:56:17.570 to what we're trying to minimize it in order 00:56:17.570 --> 00:56:21.220 to avoid this problem of overfitting. 00:56:21.220 --> 00:56:25.780 Now another way of making sure we don't overfit is to run some experiments 00:56:25.780 --> 00:56:29.950 and to see whether or not we are able to generalize our model that we've 00:56:29.950 --> 00:56:33.250 created to other data sets as well. 00:56:33.250 --> 00:56:35.320 And it's for that reason that oftentimes, 00:56:35.320 --> 00:56:37.800 when you're doing a machine learning experiment, when you've got some data 00:56:37.800 --> 00:56:40.717 and you want to try and come up with some function that predicts given 00:56:40.717 --> 00:56:44.230 some input, what the output is going to be, you don't necessarily 00:56:44.230 --> 00:56:48.340 want to do your training on all of the data you have available to you. 00:56:48.340 --> 00:56:52.410 That you could employ a method known as holdout cross-validation. 00:56:52.410 --> 00:56:55.480 Where in holdout cross-validation, we split up our data. 00:56:55.480 --> 00:57:00.373 We split up our data into a training set and a testing set. 00:57:00.373 --> 00:57:02.290 The training set is the set of data that we're 00:57:02.290 --> 00:57:04.870 going to use to train our machine learning model. 00:57:04.870 --> 00:57:07.440 And the testing set is the set of data that we 00:57:07.440 --> 00:57:11.260 are going to use in order to test to see how well our machine learning 00:57:11.260 --> 00:57:13.660 model actually performed. 00:57:13.660 --> 00:57:15.757 So the learning happens on the training set. 00:57:15.757 --> 00:57:17.590 We figure out what the parameters should be, 00:57:17.590 --> 00:57:19.660 we figure out what the right model is. 00:57:19.660 --> 00:57:22.360 And that we see, all right, now that we've trained the model, 00:57:22.360 --> 00:57:25.960 see how well it does at predicting things and inside of the testing 00:57:25.960 --> 00:57:29.342 set, some set of data that we haven't seen before. 00:57:29.342 --> 00:57:32.300 And the hope then is that we're going to be able to predict the testing 00:57:32.300 --> 00:57:36.460 set pretty well if we're able to generalize based on the training 00:57:36.460 --> 00:57:38.028 data that's available to us. 00:57:38.028 --> 00:57:39.820 If we've overfit the training data, though, 00:57:39.820 --> 00:57:43.112 and we're not able to generalize, then when we look at the testing set, 00:57:43.112 --> 00:57:45.070 it's likely going to be the case that we're not 00:57:45.070 --> 00:57:49.100 going to predict things from the testing set nearly as effectively. 00:57:49.100 --> 00:57:51.850 So this is one method of cross-validation-- validating 00:57:51.850 --> 00:57:55.270 to make sure that the work we have done is actually going to generalize 00:57:55.270 --> 00:57:56.770 to other data sets as well. 00:57:56.770 --> 00:57:59.620 And there are other statistical techniques we can use, as well. 00:57:59.620 --> 00:58:03.520 One of the downsides of this just holdout cross-validation is if you say, 00:58:03.520 --> 00:58:07.960 I just split it 50/50, I train using 50% of the data and test using 00:58:07.960 --> 00:58:11.110 the other 50%, or you could choose other percentages as well, 00:58:11.110 --> 00:58:15.190 is that there is a fair amount of data that I am now not using 00:58:15.190 --> 00:58:19.670 to train that I might be able to get a better model as a result, for example. 00:58:19.670 --> 00:58:23.500 So one approach is known as k-fold cross-validation. 00:58:23.500 --> 00:58:26.830 In k-fold cross-validation, rather than just divide things 00:58:26.830 --> 00:58:31.980 into two sets and run one experiment, we divide things into k different sets 00:58:31.980 --> 00:58:34.810 and maybe I divide things up into 10 different sets, 00:58:34.810 --> 00:58:37.270 and then run 10 different experiments. 00:58:37.270 --> 00:58:40.750 So if I split up my data into 10 different sets of data, 00:58:40.750 --> 00:58:44.410 then what I'll do is each time for each of my 10 experiments, 00:58:44.410 --> 00:58:47.020 I will hold out one of those sets of data, 00:58:47.020 --> 00:58:50.320 where I'll say, let me train my model on these nine sets, 00:58:50.320 --> 00:58:54.070 and then test to see how well it predicts on set number 10. 00:58:54.070 --> 00:58:57.490 And then pick another set of nine sets to train on, and then 00:58:57.490 --> 00:58:59.320 test it on the other one that I held out, 00:58:59.320 --> 00:59:03.550 where each time, I train the model on everything minus the one set 00:59:03.550 --> 00:59:07.300 that I'm holding out, and then test to see how well our model performs 00:59:07.300 --> 00:59:09.100 on the test that I did hold out. 00:59:09.100 --> 00:59:12.250 And what you end up getting is 10 different results, 10 different answers 00:59:12.250 --> 00:59:14.470 for how accurately our model worked. 00:59:14.470 --> 00:59:16.870 And oftentimes, you can just take the average of those 10 00:59:16.870 --> 00:59:21.160 to get an approximation for how well we think our model performs overall. 00:59:21.160 --> 00:59:25.270 But the key idea is separating the training data from the testing data, 00:59:25.270 --> 00:59:29.170 because you want to test your model on data that is different from what 00:59:29.170 --> 00:59:30.273 you trained the model on. 00:59:30.273 --> 00:59:32.440 Because the training, you want to avoid overfitting, 00:59:32.440 --> 00:59:33.940 you want to be able to generalize. 00:59:33.940 --> 00:59:36.190 And the way you test whether you're able to generalize 00:59:36.190 --> 00:59:39.580 and is by looking at some data that you haven't seen before 00:59:39.580 --> 00:59:43.380 and seeing how well we're actually able to perform. 00:59:43.380 --> 00:59:46.010 And so if we want to actually implement any of these techniques 00:59:46.010 --> 00:59:48.980 inside of a programming language like Python, a number of ways 00:59:48.980 --> 00:59:49.760 we could do that. 00:59:49.760 --> 00:59:52.007 We could write this from scratch on our own, 00:59:52.007 --> 00:59:53.840 but there are libraries out there that allow 00:59:53.840 --> 00:59:57.290 us to take advantage of existing implementations of these algorithms-- 00:59:57.290 --> 01:00:00.080 that we can use the same types of algorithms 01:00:00.080 --> 01:00:01.950 in a lot of different situations. 01:00:01.950 --> 01:00:03.770 And so there is a library, very popular one 01:00:03.770 --> 01:00:07.240 known as scikit-learn, which allows us in Python to be 01:00:07.240 --> 01:00:10.490 able to very quickly get set up with a lot of these different machine learning 01:00:10.490 --> 01:00:10.990 models. 01:00:10.990 --> 01:00:14.210 So this library has already written an algorithm for nearest neighbor 01:00:14.210 --> 01:00:16.430 classification, for doing perceptron learning, 01:00:16.430 --> 01:00:19.035 for doing a bunch of other types of inference 01:00:19.035 --> 01:00:21.410 and supervised learning that we haven't yet talked about. 01:00:21.410 --> 01:00:26.840 But using it, we can begin to try actually testing how these methods work 01:00:26.840 --> 01:00:29.330 and how accurately they perform. 01:00:29.330 --> 01:00:31.900 So let's go ahead and take a look at one approach to trying 01:00:31.900 --> 01:00:33.920 to solve this type of problem. 01:00:33.920 --> 01:00:37.280 All right, so I'm first going to pull up banknotes.csv, 01:00:37.280 --> 01:00:40.970 which is a whole bunch of data provided by UC Irvine, which has information 01:00:40.970 --> 01:00:43.160 about various different banknotes. 01:00:43.160 --> 01:00:45.440 So people took pictures of various different banknotes 01:00:45.440 --> 01:00:48.530 and measured various different properties of those banknotes. 01:00:48.530 --> 01:00:51.350 And in particular, some human categorized each of those 01:00:51.350 --> 01:00:55.810 banknotes as either a counterfeit bank note or as not counterfeit. 01:00:55.810 --> 01:00:59.540 And so what you're looking at here is each row represents one banknote. 01:00:59.540 --> 01:01:01.910 This is formatted as a CSV spreadsheet, where just 01:01:01.910 --> 01:01:05.780 comma-separated value separating each of these various different fields. 01:01:05.780 --> 01:01:08.990 We have four different input values for each 01:01:08.990 --> 01:01:12.410 of these data points, just information, some measurement that 01:01:12.410 --> 01:01:13.460 was made on the banknote. 01:01:13.460 --> 01:01:16.340 And what those measurements exactly aren't as important as the fact 01:01:16.340 --> 01:01:18.320 that we do have access to this data. 01:01:18.320 --> 01:01:21.950 But more importantly, we have access for each of these data points 01:01:21.950 --> 01:01:26.210 to a label, where 0 indicates something like this was not a counterfeit bill, 01:01:26.210 --> 01:01:27.920 meaning it was an authentic bill. 01:01:27.920 --> 01:01:32.840 And a data point labeled 1 means that it is a counterfeit bill at least, 01:01:32.840 --> 01:01:36.240 according to the human researcher who labeled this particular data. 01:01:36.240 --> 01:01:38.360 So we have a whole bunch of data representing 01:01:38.360 --> 01:01:40.940 a whole bunch of different data points, each of which 01:01:40.940 --> 01:01:44.360 has these various different measurements that were made on that particular bill. 01:01:44.360 --> 01:01:47.300 And each of which has an output value 0 or 1-- 01:01:47.300 --> 01:01:53.033 0 meaning it was a genuine bill, 1 meaning it was a counterfeit bill. 01:01:53.033 --> 01:01:54.950 And what we would like to do is use supervised 01:01:54.950 --> 01:01:58.550 learning to begin to predict or model some sort of function 01:01:58.550 --> 01:02:02.570 that can take these four values as input and predict what the output would be. 01:02:02.570 --> 01:02:05.777 We want our learning algorithm to find some sort of pattern that 01:02:05.777 --> 01:02:08.110 is able to predict based on these measurements something 01:02:08.110 --> 01:02:10.690 that you could measure just by taking a photo of a bill-- 01:02:10.690 --> 01:02:16.090 predict whether that bill is authentic or whether that bill is counterfeit. 01:02:16.090 --> 01:02:17.730 And so how can we do that? 01:02:17.730 --> 01:02:21.020 Well, I'm first going to open up banknotes0.py and see 01:02:21.020 --> 01:02:22.410 how it is that we do this. 01:02:22.410 --> 01:02:26.000 I'm first importing a lot of things from scikit-learn, 01:02:26.000 --> 01:02:29.753 but importantly, I'm going to set my model equal to the perceptron 01:02:29.753 --> 01:02:32.420 model, which is one of those models that we talked about before. 01:02:32.420 --> 01:02:35.120 We're just going to try and figure out some setting of weights 01:02:35.120 --> 01:02:38.950 that is able to divide our data into two different groups. 01:02:38.950 --> 01:02:43.270 Then I'm going to go ahead and read data in from my file from banknotes.csv. 01:02:43.270 --> 01:02:46.660 And basically, for every row, I'm going to separate that row 01:02:46.660 --> 01:02:51.460 into the first four values of that row, which is the evidence for that row. 01:02:51.460 --> 01:02:56.920 And then the label where if the final column in that row is 0, 01:02:56.920 --> 01:03:00.770 the label is authentic, and otherwise, it's going to be counterfeit. 01:03:00.770 --> 01:03:03.880 So I'm effectively reading data in from the CSV file, 01:03:03.880 --> 01:03:07.330 dividing it into a whole bunch of rows, where each row has some evidence-- 01:03:07.330 --> 01:03:11.380 those four input values that are going to be inputs to my hypothesis function. 01:03:11.380 --> 01:03:14.710 And then the label, the output, whether it is authentic or counterfeit. 01:03:14.710 --> 01:03:17.270 That is the thing that I am then trying to predict. 01:03:17.270 --> 01:03:20.650 So the next step is that I would like to split up my data set into a training 01:03:20.650 --> 01:03:22.990 set and the testing set-- some set of data 01:03:22.990 --> 01:03:26.260 that I would like to train my machine learning model on and some set of data 01:03:26.260 --> 01:03:29.500 that I would like to use to test that model, see how well it performed. 01:03:29.500 --> 01:03:32.480 So what I'll do is I'll go ahead and figure out length of the data, 01:03:32.480 --> 01:03:34.150 how many data points do I have. 01:03:34.150 --> 01:03:35.925 I'll go ahead and take half of them, save 01:03:35.925 --> 01:03:37.550 that number is a number called holdout. 01:03:37.550 --> 01:03:39.930 That is how many items I'm going to hold out for my data 01:03:39.930 --> 01:03:42.400 set to save for the testing phase. 01:03:42.400 --> 01:03:45.250 I'll randomly shuffle the data so it's in some random order. 01:03:45.250 --> 01:03:50.440 And then I'll say my testing set will be all of the data up to the holdout. 01:03:50.440 --> 01:03:54.790 So I'll hold up many data items, and that will be my testing that. 01:03:54.790 --> 01:03:58.060 My training data will be everything else-- the information 01:03:58.060 --> 01:04:00.850 that I'm going to train my model on. 01:04:00.850 --> 01:04:04.870 And then I'll say, I need to divide up my training data 01:04:04.870 --> 01:04:06.040 into two different sets. 01:04:06.040 --> 01:04:10.750 I need to divide it into my x values, where x here represents the inputs. 01:04:10.750 --> 01:04:14.260 So the x values then I'm going to train on our basically 01:04:14.260 --> 01:04:16.330 for every row in my training set, I'm going 01:04:16.330 --> 01:04:19.150 to get the evidence for that row, those four values, 01:04:19.150 --> 01:04:21.940 where it's basically a vector of four numbers, where 01:04:21.940 --> 01:04:24.010 that is going to be all of the input. 01:04:24.010 --> 01:04:27.460 And then I need the y values-- what are the outputs that I want to learn from, 01:04:27.460 --> 01:04:30.567 the labels that belong to each of these various different input points. 01:04:30.567 --> 01:04:33.650 Well, that's going to be the same thing for each row in the training data. 01:04:33.650 --> 01:04:36.430 But this time, I take that row and get what it's labeled as, 01:04:36.430 --> 01:04:38.680 whether it is authentic or counterfeit. 01:04:38.680 --> 01:04:43.270 So I end up with one list of all of these vectors of my input data 01:04:43.270 --> 01:04:45.790 and one list which follows the same order, 01:04:45.790 --> 01:04:49.690 but has all of the labels that correspond with each of those vectors. 01:04:49.690 --> 01:04:51.610 And then to train my model, which in this case 01:04:51.610 --> 01:04:54.190 is just this perceptron model, I just called 01:04:54.190 --> 01:04:57.370 model.fit, pass in the training data, and what 01:04:57.370 --> 01:04:59.710 the labels for those training data are. 01:04:59.710 --> 01:05:02.120 And scikit-learn will take care of fitting the model-- 01:05:02.120 --> 01:05:04.250 will do the entire algorithm for me. 01:05:04.250 --> 01:05:08.320 And then when it's done, I can then test to see how well that model performed. 01:05:08.320 --> 01:05:10.780 So I can say, let me get all of these input 01:05:10.780 --> 01:05:13.040 vectors for what I want to test on. 01:05:13.040 --> 01:05:16.870 So for each row in my testing data set, go ahead and get the evidence. 01:05:16.870 --> 01:05:20.470 And the y values, those are what the actual values were-- 01:05:20.470 --> 01:05:24.610 for each of the rows in the testing data set, what the actual label is. 01:05:24.610 --> 01:05:26.860 But then I'm going to generate some predictions. 01:05:26.860 --> 01:05:29.350 I'm going to use this model and try and predict-- 01:05:29.350 --> 01:05:31.480 based on the testing vectors-- 01:05:31.480 --> 01:05:33.910 I want to predict what the output is. 01:05:33.910 --> 01:05:38.230 And my goal then is to now compare y testing with predictions. 01:05:38.230 --> 01:05:41.440 I want to see how well my predictions based on the model 01:05:41.440 --> 01:05:45.310 actually reflect what the y values were, what the output is 01:05:45.310 --> 01:05:46.540 that were actually labeled. 01:05:46.540 --> 01:05:48.580 Because I now have this label data, I can 01:05:48.580 --> 01:05:51.540 assess how well the algorithm worked. 01:05:51.540 --> 01:05:55.330 And so now I can just compute how well we did. 01:05:55.330 --> 01:05:58.840 This zip function basically just lets me look through two different lists, one 01:05:58.840 --> 01:06:00.500 by one at the same time. 01:06:00.500 --> 01:06:04.113 So for each actual value and for each predicted value, 01:06:04.113 --> 01:06:06.280 if the actual is the same thing as what I predicted, 01:06:06.280 --> 01:06:08.570 I'll go ahead and increment the counter by 1. 01:06:08.570 --> 01:06:11.727 Otherwise, I'll increment my incorrect counter by 1. 01:06:11.727 --> 01:06:14.060 And so at the end, I can print out here are the results, 01:06:14.060 --> 01:06:16.435 here's how many I got right, here's how many I got wrong. 01:06:16.435 --> 01:06:19.660 And here was my overall accuracy, for example. 01:06:19.660 --> 01:06:21.100 So I can go ahead and run this. 01:06:21.100 --> 01:06:24.790 I can run Python banknotes0.py. 01:06:24.790 --> 01:06:28.810 And it's going to train on half the data set and then test on half the data set. 01:06:28.810 --> 01:06:31.090 And here the results from my perceptron model. 01:06:31.090 --> 01:06:35.540 In this case, it correctly was able to classify 679 bills 01:06:35.540 --> 01:06:38.170 as correctly either authentic or counterfeit, 01:06:38.170 --> 01:06:43.580 and incorrectly classified seven of them for an overall accuracy of close to 99% 01:06:43.580 --> 01:06:44.080 accurate. 01:06:44.080 --> 01:06:47.230 So on this particular data set, using this perceptron model, 01:06:47.230 --> 01:06:51.310 we were able to predict very well what the output was going to be. 01:06:51.310 --> 01:06:53.010 And we can try different models, too. 01:06:53.010 --> 01:06:55.390 That scikit-learn makes it very easy just to swap out 01:06:55.390 --> 01:06:57.970 one model for another model. 01:06:57.970 --> 01:07:02.710 So instead of the perceptron model, I can use the support vector machine 01:07:02.710 --> 01:07:06.490 using the SVC, otherwise known as a support vector classifier, 01:07:06.490 --> 01:07:08.950 using a support vector machine to classify things 01:07:08.950 --> 01:07:10.820 into two different groups. 01:07:10.820 --> 01:07:14.270 And now see, all right, how well does this perform. 01:07:14.270 --> 01:07:17.630 And this time, we were able to correctly predict 682 01:07:17.630 --> 01:07:22.280 and incorrectly predicted four for accuracy of 99.4%. 01:07:22.280 --> 01:07:26.990 And we could even try the kNeighborsClassifier as the model 01:07:26.990 --> 01:07:27.740 instead. 01:07:27.740 --> 01:07:31.270 And this takes a parameter n_neighbors for how many neighbors 01:07:31.270 --> 01:07:32.133 you want to look at. 01:07:32.133 --> 01:07:34.550 Let's just look at one neighbor, the one nearest neighbor, 01:07:34.550 --> 01:07:36.080 and use that to predict. 01:07:36.080 --> 01:07:38.250 Go ahead and run this as well. 01:07:38.250 --> 01:07:40.880 And it looks like, based on the kNeighborsClassifier looking 01:07:40.880 --> 01:07:45.350 at just one neighbor, we were able to correctly classify 685 data point, 01:07:45.350 --> 01:07:47.420 incorrectly classified one. 01:07:47.420 --> 01:07:50.560 Maybe let's try three neighbors instead of just using one neighbor, 01:07:50.560 --> 01:07:52.560 do more of a k-nearest-neighbors approach, where 01:07:52.560 --> 01:07:55.693 I look at the three near the neighbors and see how that performs. 01:07:55.693 --> 01:07:57.610 And that one in this case seems to have gotten 01:07:57.610 --> 01:08:05.330 100% of all of the predictions correctly described as either authentic banknotes 01:08:05.330 --> 01:08:07.370 or as counterfeit banknotes. 01:08:07.370 --> 01:08:09.470 And we could run these experiments multiple times. 01:08:09.470 --> 01:08:12.082 Because I'm randomly reorganizing the data every time, 01:08:12.082 --> 01:08:14.790 we're technically training these on slightly different data sets. 01:08:14.790 --> 01:08:16.832 And so you might want to run multiple experiments 01:08:16.832 --> 01:08:19.250 to really see how well they're actually going to perform. 01:08:19.250 --> 01:08:20.963 But in short, they all perform very well. 01:08:20.963 --> 01:08:23.630 And while some of them perform slightly better than others here, 01:08:23.630 --> 01:08:26.240 that might not always be the case for every data set, 01:08:26.240 --> 01:08:29.240 but you can begin to test now by very quickly putting together 01:08:29.240 --> 01:08:31.790 these machine learning models using Scikit learn 01:08:31.790 --> 01:08:33.920 to be able to train on some training set, 01:08:33.920 --> 01:08:36.814 and then test on some testing set as well. 01:08:36.814 --> 01:08:39.439 And this splitting up into training groups, and testing groups, 01:08:39.439 --> 01:08:43.279 and testing happens so often that scikit-learn has functions built in 01:08:43.279 --> 01:08:44.160 for trying to do it. 01:08:44.160 --> 01:08:46.100 I did it all by hand just now. 01:08:46.100 --> 01:08:48.770 But if we take a look at banknotes1, we take 01:08:48.770 --> 01:08:52.100 advantage of some other features that exist in scikit-learn, 01:08:52.100 --> 01:08:55.399 learn where we can really simplify a lot of our logic. 01:08:55.399 --> 01:08:59.609 That there is a function built into scikit-learn called train_test_split, 01:08:59.609 --> 01:09:03.140 which will automatically split data into a training group and a testing group. 01:09:03.140 --> 01:09:05.536 I just have to say what proportion should 01:09:05.536 --> 01:09:07.369 be in the testing group, something like 0.5, 01:09:07.369 --> 01:09:10.080 half the data, inside the testing group. 01:09:10.080 --> 01:09:12.380 Then I can fit the model and the training data, 01:09:12.380 --> 01:09:15.859 make the predictions on the testing data, and then just count up. 01:09:15.859 --> 01:09:18.830 And scikit-learn has some nice methods for just counting up 01:09:18.830 --> 01:09:22.250 how many times our testing data matched the predictions, how 01:09:22.250 --> 01:09:25.439 many times our testing data didn't match the predictions. 01:09:25.439 --> 01:09:27.680 So very quickly, you can write programs with not all 01:09:27.680 --> 01:09:30.500 that many lines of code-- it's maybe, like, 40 lines of code 01:09:30.500 --> 01:09:32.540 to get through all of these predictions. 01:09:32.540 --> 01:09:35.420 And then as a result, see how well we're able to do. 01:09:35.420 --> 01:09:38.330 So these types of libraries can allow us without really 01:09:38.330 --> 01:09:40.970 knowing the implementation details of these algorithms 01:09:40.970 --> 01:09:43.939 to be able to use the algorithms in a very practical way 01:09:43.939 --> 01:09:47.279 to be able to solve these types of problems. 01:09:47.279 --> 01:09:49.939 So that then with supervised learning-- this task 01:09:49.939 --> 01:09:52.972 of given the whole set of data some, input-output pairs, 01:09:52.972 --> 01:09:54.680 we would like to learn some function that 01:09:54.680 --> 01:09:57.110 maps those inputs to those outputs. 01:09:57.110 --> 01:09:59.630 But turns out there are other forms of learning, as well. 01:09:59.630 --> 01:10:02.870 And another popular type of machine learning, especially nowadays, 01:10:02.870 --> 01:10:05.090 is known as reinforcement learning. 01:10:05.090 --> 01:10:07.250 And the idea of reinforcement learning is 01:10:07.250 --> 01:10:09.140 rather than just being given a whole data set 01:10:09.140 --> 01:10:12.470 at the beginning of input-output pairs, reinforcement learning 01:10:12.470 --> 01:10:14.680 is all about learning from experience. 01:10:14.680 --> 01:10:17.390 And reinforcement learning are agent, whether it's 01:10:17.390 --> 01:10:19.550 like a physical robot that's trying to make actions 01:10:19.550 --> 01:10:23.750 in the world or just some virtual agent that has a program running somewhere. 01:10:23.750 --> 01:10:27.560 Our agent is going to be given a set of rewards or punishments 01:10:27.560 --> 01:10:29.510 in the form of numerical values, but you can 01:10:29.510 --> 01:10:31.430 think of them as reward or punishment. 01:10:31.430 --> 01:10:35.480 And based on that, it learns what actions to take in the future 01:10:35.480 --> 01:10:39.470 that our agent, our AI will be put in some sort of environment. 01:10:39.470 --> 01:10:42.537 It will make some actions and based on the actions that it makes, 01:10:42.537 --> 01:10:43.370 it learns something. 01:10:43.370 --> 01:10:45.537 It either gets a reward when it does something well, 01:10:45.537 --> 01:10:47.720 it gets a punishment when it does something poorly. 01:10:47.720 --> 01:10:52.130 And it learns what to do or what not to do in the future based 01:10:52.130 --> 01:10:54.980 on those individual experiences. 01:10:54.980 --> 01:10:58.700 And so what this will often look like is it will often start with some agent, 01:10:58.700 --> 01:11:01.070 some AI, which might again, be a physical robot-- 01:11:01.070 --> 01:11:03.360 if you're imagining a physical robot moving around-- 01:11:03.360 --> 01:11:05.180 but it can also just be a program. 01:11:05.180 --> 01:11:08.210 And our agent is situated in their environment, 01:11:08.210 --> 01:11:11.120 where the environment is where they're going to make their actions. 01:11:11.120 --> 01:11:13.820 And it's what's going to give them rewards or punishments 01:11:13.820 --> 01:11:16.070 for various actions that they're in. 01:11:16.070 --> 01:11:19.220 So for example, the environment is going to start off 01:11:19.220 --> 01:11:21.980 by putting our agent inside of a state. 01:11:21.980 --> 01:11:25.640 Our agent has some state that in a game might be the state of the game 01:11:25.640 --> 01:11:28.730 that the agent is playing, in a world that the agent is exploring. 01:11:28.730 --> 01:11:31.375 Might be some position inside of a grid representing 01:11:31.375 --> 01:11:32.750 the world that they're exploring. 01:11:32.750 --> 01:11:35.090 But the agent is in some sort of state. 01:11:35.090 --> 01:11:39.140 And in that state, the agent needs to choose to take an action. 01:11:39.140 --> 01:11:41.760 The agent likely has multiple actions they can choose from, 01:11:41.760 --> 01:11:43.310 but they pick an action. 01:11:43.310 --> 01:11:46.310 So they take an action in a particular state. 01:11:46.310 --> 01:11:51.170 And as a result of that, the agent will generally get two things in response 01:11:51.170 --> 01:11:52.040 as we model them. 01:11:52.040 --> 01:11:54.770 The agent gets a new state that they find themselves in. 01:11:54.770 --> 01:11:57.110 After being in this state taking one action, 01:11:57.110 --> 01:11:59.210 they end up in some other state. 01:11:59.210 --> 01:12:02.360 And they're also given some sort of numerical reward-- 01:12:02.360 --> 01:12:05.660 positive meaning reward, meaning it was a good thing. 01:12:05.660 --> 01:12:08.000 Negative generally meaning they did something bad, 01:12:08.000 --> 01:12:10.310 they received some sort of punishment. 01:12:10.310 --> 01:12:13.190 And that is all the information the agent has. 01:12:13.190 --> 01:12:15.260 It's told what state it's in. 01:12:15.260 --> 01:12:17.150 It makes some sort of action. 01:12:17.150 --> 01:12:19.160 And based on that, it ends up in another state, 01:12:19.160 --> 01:12:21.660 and it ends up getting some particular reward. 01:12:21.660 --> 01:12:24.110 And it needs to learn based on that information what 01:12:24.110 --> 01:12:26.770 actions to begin to take in the future. 01:12:26.770 --> 01:12:30.090 As you can imagine generalizing this to a lot of different situations, 01:12:30.090 --> 01:12:31.685 this is oftentimes how you train. 01:12:31.685 --> 01:12:33.560 If you've ever seen those robots that are now 01:12:33.560 --> 01:12:36.140 able to walk around sort of the way humans do, 01:12:36.140 --> 01:12:39.500 it would be quite difficult to program the robot in exactly the right way 01:12:39.500 --> 01:12:41.330 to get it to walk the way humans do. 01:12:41.330 --> 01:12:44.330 You could instead of train it through reinforcement learning-- give it 01:12:44.330 --> 01:12:47.060 some sort of numerical reward every time it does something 01:12:47.060 --> 01:12:50.690 good like take steps forward, and punish it every time it does something 01:12:50.690 --> 01:12:52.250 bad like fall over. 01:12:52.250 --> 01:12:54.080 And then let the AI just learn. 01:12:54.080 --> 01:12:56.450 Based on that sequence of rewards, based on trying 01:12:56.450 --> 01:12:58.610 to take various different actions, you can 01:12:58.610 --> 01:13:03.200 begin to have the agent learn what to do in the future and what not to do. 01:13:03.200 --> 01:13:06.530 So in order to begin to formalize this, the first thing we need to do 01:13:06.530 --> 01:13:10.700 is formalize this notion of what we mean about states and actions and rewards-- 01:13:10.700 --> 01:13:12.800 like what does this world look like. 01:13:12.800 --> 01:13:14.990 And oftentimes, we'll formulate this world 01:13:14.990 --> 01:13:17.990 as what's known as a Markov decision process. 01:13:17.990 --> 01:13:21.410 Similar in spirit to Markov chains, which you might recall from before, 01:13:21.410 --> 01:13:24.020 but a Markov decision process is a model that we 01:13:24.020 --> 01:13:26.780 can use for decision making for an agent trying 01:13:26.780 --> 01:13:28.580 to make decisions in this environment. 01:13:28.580 --> 01:13:32.270 And it's a model that allows us to represent the various different states 01:13:32.270 --> 01:13:35.930 that an agent can be in, the various different actions that they can take, 01:13:35.930 --> 01:13:40.400 and also, what the reward is for taking one action as opposed 01:13:40.400 --> 01:13:42.180 to another action. 01:13:42.180 --> 01:13:44.610 So what then does that actually look like? 01:13:44.610 --> 01:13:47.590 Well, if you recall, a Markov chain from before, 01:13:47.590 --> 01:13:50.270 a Markov chain looked a little something like this. 01:13:50.270 --> 01:13:53.600 Where we had a whole bunch of these individual states, and each state 01:13:53.600 --> 01:13:55.820 immediately transitioned to another state 01:13:55.820 --> 01:13:57.890 based on some probability distribution. 01:13:57.890 --> 01:14:01.070 We saw this in the context of the weather before, where if it was sunny, 01:14:01.070 --> 01:14:03.770 we said with some probability, it will be sunny the next day. 01:14:03.770 --> 01:14:06.920 With some other probability, it'll be rainy, for example. 01:14:06.920 --> 01:14:09.340 But we could also imagine generalizing this. 01:14:09.340 --> 01:14:11.060 It's not just sun and rain anymore. 01:14:11.060 --> 01:14:14.210 We just have these states, where one state leads to another state 01:14:14.210 --> 01:14:16.860 according to some probability distribution. 01:14:16.860 --> 01:14:19.340 But in this original model, there was no agent 01:14:19.340 --> 01:14:21.470 that had any control over this process. 01:14:21.470 --> 01:14:24.740 It was just entirely probability-based, where with some probability, 01:14:24.740 --> 01:14:26.540 we moved to this next state, but maybe it's 01:14:26.540 --> 01:14:29.230 going to be some other state with some other probability. 01:14:29.230 --> 01:14:33.320 What we'll now have is the ability for the agent in this state 01:14:33.320 --> 01:14:37.190 to choose from a set of actions, where maybe instead of just one path forward, 01:14:37.190 --> 01:14:39.200 they have three different choices of actions 01:14:39.200 --> 01:14:41.180 that each lead them down different paths. 01:14:41.180 --> 01:14:43.340 And even this is a bit of an oversimplification, 01:14:43.340 --> 01:14:46.340 because in each of these states, you might imagine more branching points 01:14:46.340 --> 01:14:49.200 were there more decisions that can be taken as well. 01:14:49.200 --> 01:14:53.420 So we've extended the Markov chain to say that from a state, 01:14:53.420 --> 01:14:55.400 you now have available action choices. 01:14:55.400 --> 01:14:57.830 And each of those actions might be associated 01:14:57.830 --> 01:15:02.960 with its own probability distribution of going to various different states. 01:15:02.960 --> 01:15:05.870 Then in addition, we'll add another extension, 01:15:05.870 --> 01:15:08.180 where any time you move from a state taking 01:15:08.180 --> 01:15:10.640 an action going into this other state, we 01:15:10.640 --> 01:15:14.060 can associate a reward with that outcome, 01:15:14.060 --> 01:15:17.150 saying either r is positive, meaning some positive reward, 01:15:17.150 --> 01:15:20.460 or r is negative, meaning there were some sort of punishment. 01:15:20.460 --> 01:15:23.510 And this then is what we'll consider to be a Markov decision process. 01:15:23.510 --> 01:15:27.500 That a Markov decision process has some initial set of states in the world 01:15:27.500 --> 01:15:28.670 that we can be in. 01:15:28.670 --> 01:15:31.640 We have some set of actions that given a state, 01:15:31.640 --> 01:15:34.430 I can say what are the actions that are available to me 01:15:34.430 --> 01:15:37.730 in that state, an action that I can choose from. 01:15:37.730 --> 01:15:39.550 Then we have some transition model. 01:15:39.550 --> 01:15:41.710 The transition model before just said that 01:15:41.710 --> 01:15:44.740 given my current state, what is the probability that I end up 01:15:44.740 --> 01:15:46.930 in that next state or this other state. 01:15:46.930 --> 01:15:51.130 The transition model now has effectively two things we're conditioning on. 01:15:51.130 --> 01:15:53.260 We're saying, given that I'm in this state 01:15:53.260 --> 01:15:56.500 and that I take this action, what's the probability 01:15:56.500 --> 01:15:59.320 that I end up in this next state? 01:15:59.320 --> 01:16:02.050 Now maybe we live in a very deterministic world in this Markov 01:16:02.050 --> 01:16:05.170 decision process, where given a state and given an action, 01:16:05.170 --> 01:16:07.383 we know for sure what next state we'll end up in. 01:16:07.383 --> 01:16:10.550 But maybe there's some randomness in the world that when you take in a state 01:16:10.550 --> 01:16:14.260 and you take an action, you might not always end up in the exact same state. 01:16:14.260 --> 01:16:16.840 There might be some probabilities involved there as well. 01:16:16.840 --> 01:16:21.260 The Markov decision process can handle both of those possible cases. 01:16:21.260 --> 01:16:23.830 And then finally, we have a reward function, 01:16:23.830 --> 01:16:26.230 generally called r, that in this case says, 01:16:26.230 --> 01:16:30.340 what is the reward for being in this state, taking this action, 01:16:30.340 --> 01:16:33.687 and then getting to s prime, this next state. 01:16:33.687 --> 01:16:35.770 So I'm in this original state, I take this action, 01:16:35.770 --> 01:16:39.768 I get to this next state, what is the reward for doing that process? 01:16:39.768 --> 01:16:41.560 You can add up these rewards every time you 01:16:41.560 --> 01:16:44.440 take an action to get the total amount of rewards 01:16:44.440 --> 01:16:48.940 that an agent might get from interacting in a particular environment modeled 01:16:48.940 --> 01:16:51.170 using this Markov decision process. 01:16:51.170 --> 01:16:53.740 So what might this actually look like in practice? 01:16:53.740 --> 01:16:56.290 Well, let's just create a little simulated world here 01:16:56.290 --> 01:16:59.260 where I have this agent that is just trying to navigate its way-- 01:16:59.260 --> 01:17:02.350 this agent is this yellow dot here like a robot in the world trying 01:17:02.350 --> 01:17:04.240 to navigate its way through this grid. 01:17:04.240 --> 01:17:07.140 And ultimately, it's trying to find its way to the goal. 01:17:07.140 --> 01:17:11.230 And if it gets to the green goal, then it's going to get some sort of reward. 01:17:11.230 --> 01:17:14.795 But then we might also have some red squares 01:17:14.795 --> 01:17:17.920 that are places where you get some sort of punishment, some bad place where 01:17:17.920 --> 01:17:19.480 we don't want the agent to go. 01:17:19.480 --> 01:17:22.030 And if it ends up in the red square, then our agent 01:17:22.030 --> 01:17:25.400 is going to get some sort of punishment as a result of that. 01:17:25.400 --> 01:17:28.610 But the agent that originally doesn't know all of these details. 01:17:28.610 --> 01:17:31.360 It doesn't know that these states are associated with punishments, 01:17:31.360 --> 01:17:34.210 but maybe it does know that the state is associated with a reward-- 01:17:34.210 --> 01:17:35.200 maybe it doesn't. 01:17:35.200 --> 01:17:37.750 But it just needs to sort of interact with the environment 01:17:37.750 --> 01:17:41.150 to try and figure out what to do and what not to do. 01:17:41.150 --> 01:17:44.377 So the first thing the agent might do is given no additional information, 01:17:44.377 --> 01:17:46.210 if it doesn't know what the punishments are, 01:17:46.210 --> 01:17:50.030 it doesn't know where the rewards are, it just might try and take an action. 01:17:50.030 --> 01:17:52.690 And it takes an action and ends up realizing 01:17:52.690 --> 01:17:54.710 that he's got some sort of punishment. 01:17:54.710 --> 01:17:56.930 And so what does it learn from that experience? 01:17:56.930 --> 01:18:00.550 Well, it might learn that when you're in this state in the future 01:18:00.550 --> 01:18:02.790 don't take the action, move to the right-- 01:18:02.790 --> 01:18:04.213 that that is a bad action to take. 01:18:04.213 --> 01:18:06.880 That in the future, if you ever find yourself back in the state, 01:18:06.880 --> 01:18:09.430 don't take this action of going to the right when 01:18:09.430 --> 01:18:12.180 you're in this particular state, because that leads to punishment. 01:18:12.180 --> 01:18:13.763 That might be the intuition, at least. 01:18:13.763 --> 01:18:15.610 And so you could try doing other actions. 01:18:15.610 --> 01:18:16.300 You move up. 01:18:16.300 --> 01:18:18.508 All right, that didn't lead to any immediate rewards, 01:18:18.508 --> 01:18:21.830 maybe try something else, then maybe try something else. 01:18:21.830 --> 01:18:23.913 And now you found that you got another punishment. 01:18:23.913 --> 01:18:25.913 And so you learn something from that experience. 01:18:25.913 --> 01:18:27.850 So the next time you do this whole process, 01:18:27.850 --> 01:18:30.010 you know that if you ever end up in this, square 01:18:30.010 --> 01:18:33.070 you shouldn't take the down action, because being in this state 01:18:33.070 --> 01:18:37.840 and taking that action ultimately leads to some sort of punishment, 01:18:37.840 --> 01:18:40.070 a negative reward, in other words. 01:18:40.070 --> 01:18:41.170 And this process repeats. 01:18:41.170 --> 01:18:44.230 You might imagine just letting our agent explore the world, 01:18:44.230 --> 01:18:48.220 learning over time what states tend to correspond with poor actions, 01:18:48.220 --> 01:18:50.980 learning over time what states correspond with poor actions 01:18:50.980 --> 01:18:54.260 until eventually, if it tries enough things randomly, 01:18:54.260 --> 01:18:57.640 it might find that when you get to this state, 01:18:57.640 --> 01:19:01.240 if you take the up action in this state, it might find you 01:19:01.240 --> 01:19:03.160 actually get a reward from that. 01:19:03.160 --> 01:19:06.820 And what it can learn from that is that if you're in this state, 01:19:06.820 --> 01:19:09.595 you should take the up action, because that leads to a reward. 01:19:09.595 --> 01:19:12.220 And over time, you can also learn that if you're in this state, 01:19:12.220 --> 01:19:15.580 you should take the left action because that leads to this state that also 01:19:15.580 --> 01:19:17.320 lets you eventually get to the reward. 01:19:17.320 --> 01:19:21.160 So you begin to learn over time, not only which actions 01:19:21.160 --> 01:19:25.390 are good in particular states, but also, which actions are bad, 01:19:25.390 --> 01:19:27.640 such that once you know some sequence of good actions 01:19:27.640 --> 01:19:30.910 that leads you to some sort of reward, our agent can just 01:19:30.910 --> 01:19:34.750 follow those instructions, follow the experience that it has learned. 01:19:34.750 --> 01:19:37.330 We didn't tell the agent what the goal was. 01:19:37.330 --> 01:19:39.910 We didn't tell the agent where the punishments were. 01:19:39.910 --> 01:19:42.760 But the agent can begin to learn from this experience 01:19:42.760 --> 01:19:47.850 and learn to begin to perform these sorts of tasks better in the future. 01:19:47.850 --> 01:19:51.370 And so let's now try to formalize this idea-- formalize the idea that we would 01:19:51.370 --> 01:19:54.520 like to be able to learn in this state, taking this action, 01:19:54.520 --> 01:19:56.257 is that a good thing or a bad thing. 01:19:56.257 --> 01:19:58.840 There are lots of different models for reinforcement learning. 01:19:58.840 --> 01:20:00.757 We're just going to look at one of them today. 01:20:00.757 --> 01:20:04.330 And the one that we're going to look at is a method known as Q learning. 01:20:04.330 --> 01:20:08.010 And what Q learning is all about is about learning a function, 01:20:08.010 --> 01:20:12.850 a function Q, that takes inputs s and a, where s is a state and a 01:20:12.850 --> 01:20:15.130 is an action that you take in that state. 01:20:15.130 --> 01:20:17.620 And what this Q function is going to do is 01:20:17.620 --> 01:20:21.520 it is going to estimate the value-- how much reward will I get 01:20:21.520 --> 01:20:25.960 from taking this action in this state. 01:20:25.960 --> 01:20:28.870 Originally, we don't know what this Q function should be, 01:20:28.870 --> 01:20:30.940 but over time, based on experience, based 01:20:30.940 --> 01:20:33.590 on trying things out and seeing what the result is, 01:20:33.590 --> 01:20:36.160 I would like to try and learn what q of s, 01:20:36.160 --> 01:20:39.730 a is for any particular state and any particular action 01:20:39.730 --> 01:20:41.840 that I might take in that state. 01:20:41.840 --> 01:20:42.880 So what is the approach? 01:20:42.880 --> 01:20:44.890 Well, the approach originally is we'll start 01:20:44.890 --> 01:20:49.960 with Q s, a equal to 0 for all states s and for all actions a. 01:20:49.960 --> 01:20:52.270 That initially, before I've ever started anything, 01:20:52.270 --> 01:20:54.880 before I've had any experiences, I don't know 01:20:54.880 --> 01:20:57.820 the value of taking any action in any given state, 01:20:57.820 --> 01:21:02.390 so I'm going to assume that the value is 0 all across the board. 01:21:02.390 --> 01:21:06.790 But then as I interact with the world, as I experience rewards or punishments, 01:21:06.790 --> 01:21:10.450 or maybe I go to a cell where I don't get either a reward or a punishment, 01:21:10.450 --> 01:21:14.290 I want to somehow update my estimate of Q s, a. 01:21:14.290 --> 01:21:18.670 I want to continually update my estimate of Q s, a based on the experiences, 01:21:18.670 --> 01:21:20.730 and rewards, and punishments that I've received 01:21:20.730 --> 01:21:24.240 such that in the future, my knowledge of what actions are good 01:21:24.240 --> 01:21:26.290 and what states will be better. 01:21:26.290 --> 01:21:29.280 So when we take an action and receive some sort of reward, 01:21:29.280 --> 01:21:32.750 I want to estimate the new value of Q s, a. 01:21:32.750 --> 01:21:35.390 And I estimate that based on a couple of different things. 01:21:35.390 --> 01:21:39.090 I estimate it based on the reward that I'm getting from taking this action 01:21:39.090 --> 01:21:40.830 and getting into the next state. 01:21:40.830 --> 01:21:44.130 But assuming the situation isn't over, assuming 01:21:44.130 --> 01:21:47.170 there are still future actions that I might take as well, 01:21:47.170 --> 01:21:51.670 I also need to take into account the expected future rewards. 01:21:51.670 --> 01:21:54.570 That if you imagine an agent interacting with the environment, 01:21:54.570 --> 01:21:57.000 and sometimes, you'll take an action and get a reward, 01:21:57.000 --> 01:21:59.970 but then you can keep taking more actions and get more rewards. 01:21:59.970 --> 01:22:02.310 That these both are relevant-- both the current reward 01:22:02.310 --> 01:22:05.680 I'm getting from this current step, and also, my future reward. 01:22:05.680 --> 01:22:09.240 And it might be the case that I want to take a step that doesn't immediately 01:22:09.240 --> 01:22:12.180 lead to a reward, because later on down the line, 01:22:12.180 --> 01:22:14.650 I know it will lead to more rewards as well. 01:22:14.650 --> 01:22:17.550 So there's a balancing act between current rewards 01:22:17.550 --> 01:22:20.460 that the agent experiences and future rewards 01:22:20.460 --> 01:22:23.900 that the agent experiences as well. 01:22:23.900 --> 01:22:26.590 And then we need to update Q s, a. 01:22:26.590 --> 01:22:29.620 So we estimate the value of Q s, a based on the current record 01:22:29.620 --> 01:22:31.390 and the expected future awards. 01:22:31.390 --> 01:22:33.970 And then we need to update this Q function 01:22:33.970 --> 01:22:36.550 to take into account this new estimate. 01:22:36.550 --> 01:22:39.340 Now as we go through this process, we'll already 01:22:39.340 --> 01:22:42.160 have an estimate for what we think the value is. 01:22:42.160 --> 01:22:44.140 Now we have a new estimate and then somehow we 01:22:44.140 --> 01:22:46.570 need to combine these two estimates together. 01:22:46.570 --> 01:22:50.110 And we'll look at more formal ways that we can actually begin to do that. 01:22:50.110 --> 01:22:52.725 So to actually show you what this formula looks like, 01:22:52.725 --> 01:22:54.850 here's the approach we'll take with you Q-learning. 01:22:54.850 --> 01:22:59.500 We're going to again start with Q of s, a being equal to 0 for all states. 01:22:59.500 --> 01:23:06.850 And then every time we take an action a in state s and observe a reward r, 01:23:06.850 --> 01:23:11.230 we're going to update our value, our estimate for Q of s, a. 01:23:11.230 --> 01:23:13.780 And the idea is that we're going to figure out 01:23:13.780 --> 01:23:19.180 what the new value estimate is minus what our existing value estimate is. 01:23:19.180 --> 01:23:22.780 So we have some preconceived notion for what the value is 01:23:22.780 --> 01:23:24.500 for taking this action in this state. 01:23:24.500 --> 01:23:28.450 Maybe our expectation is we currently think the value is 10. 01:23:28.450 --> 01:23:31.540 But then we're going to estimate what we now think it's going to be. 01:23:31.540 --> 01:23:34.370 Maybe the new value estimate is something like 20. 01:23:34.370 --> 01:23:36.940 So there's a delta of, like, 10 that our new value 01:23:36.940 --> 01:23:40.390 estimate is 10 points higher than what our current value 01:23:40.390 --> 01:23:42.450 estimate happens to be. 01:23:42.450 --> 01:23:44.200 And so we have a couple of options here. 01:23:44.200 --> 01:23:49.360 We need to decide how much we want to adjust our current expectation of what 01:23:49.360 --> 01:23:52.720 the value is of taking this action in this particular state. 01:23:52.720 --> 01:23:56.620 And what that difference is-- how much we add or subtract 01:23:56.620 --> 01:23:59.800 from our existing notion of how much that we expect the value to be 01:23:59.800 --> 01:24:03.750 is dependent on this parameter alpha, also called the learning rate. 01:24:03.750 --> 01:24:08.710 And alpha represents in effect, how much we value new information compared 01:24:08.710 --> 01:24:11.770 to how much we value old information. 01:24:11.770 --> 01:24:15.500 And alpha value of 1 means we really value new information. 01:24:15.500 --> 01:24:17.590 That if we have a new estimate, then it doesn't 01:24:17.590 --> 01:24:19.240 matter what our old estimate is. 01:24:19.240 --> 01:24:21.910 We're only going to consider our new estimate, because we always 01:24:21.910 --> 01:24:25.430 just want to take into consideration our new information. 01:24:25.430 --> 01:24:29.230 So the way that works is that if you imagine alpha being 1, 01:24:29.230 --> 01:24:32.860 then we're taking the old value of Q s, a 01:24:32.860 --> 01:24:36.910 and then adding 1 times the new value minus the old value. 01:24:36.910 --> 01:24:38.990 And that just leaves us with the new value. 01:24:38.990 --> 01:24:42.010 So when alpha is 1, all we take into consideration 01:24:42.010 --> 01:24:44.650 is what our new estimate happens to be. 01:24:44.650 --> 01:24:47.470 But over time, as we go through a lot of experiences, 01:24:47.470 --> 01:24:49.570 we already have some existing information. 01:24:49.570 --> 01:24:53.080 We might have tried taking this action nine times already, 01:24:53.080 --> 01:24:55.090 and now we just try to do a tenth time. 01:24:55.090 --> 01:24:58.370 And we don't only want to consider this 10th experience. 01:24:58.370 --> 01:25:02.020 I also want to consider the fact that my prior 9 experiences, those were 01:25:02.020 --> 01:25:02.710 meaningful, too. 01:25:02.710 --> 01:25:05.420 And that's data I don't necessarily want to lose them. 01:25:05.420 --> 01:25:08.620 And so this alpha controls that decision-- controls 01:25:08.620 --> 01:25:10.480 how important is the new information. 01:25:10.480 --> 01:25:13.570 0 would mean ignore all the new information, 01:25:13.570 --> 01:25:16.390 just keep this Q value the same. 01:25:16.390 --> 01:25:20.200 1 that means replace the old information entirely with the new information. 01:25:20.200 --> 01:25:24.772 And somewhere in between, keep some sort of balance between these two values. 01:25:24.772 --> 01:25:27.970 And we can put this equation a little bit more formally, as well. 01:25:27.970 --> 01:25:30.940 The old value estimate is our old estimate 01:25:30.940 --> 01:25:34.690 for what the value is of taking this action in a particular state. 01:25:34.690 --> 01:25:37.270 That's just Q of s, a. 01:25:37.270 --> 01:25:38.685 We have it once here. 01:25:38.685 --> 01:25:40.310 And we're going to add something to it. 01:25:40.310 --> 01:25:44.020 We're going to add alpha times the new value estimate minus the old value 01:25:44.020 --> 01:25:44.740 estimate. 01:25:44.740 --> 01:25:49.330 But the old value estimate, we just look up by calling this Q function. 01:25:49.330 --> 01:25:51.310 And what then is the new value estimate? 01:25:51.310 --> 01:25:53.530 Based on this experience we have just taken, 01:25:53.530 --> 01:25:55.870 what is our new estimate for the value of taking 01:25:55.870 --> 01:25:58.570 this action in this particular state? 01:25:58.570 --> 01:26:01.090 Well, it's going to be composed of two parts. 01:26:01.090 --> 01:26:04.540 It's going to be composed of what reward did I just get 01:26:04.540 --> 01:26:07.060 from taking this action in this state. 01:26:07.060 --> 01:26:10.390 And then it's going to be what can I expect my future rewards 01:26:10.390 --> 01:26:12.670 to be from this point forward. 01:26:12.670 --> 01:26:17.230 So it's going to be r, some reward I'm getting right now, 01:26:17.230 --> 01:26:21.443 plus whatever I estimate I'm going to get in the future. 01:26:21.443 --> 01:26:23.860 And how do I estimate what I'm going to get in the future? 01:26:23.860 --> 01:26:27.030 Well, it's a bit of another call to this Q function. 01:26:27.030 --> 01:26:31.360 It's going to be take the maximum across all possible actions I could 01:26:31.360 --> 01:26:35.800 take next and say, all right, of all of these possible actions I could take, 01:26:35.800 --> 01:26:38.560 which one is going to have the highest reward? 01:26:38.560 --> 01:26:40.840 So this then-- looks a little bit complicated-- 01:26:40.840 --> 01:26:44.710 is going to be our notion for how we're going to perform this kind of update. 01:26:44.710 --> 01:26:48.070 I have some estimate, some old estimate, for what 01:26:48.070 --> 01:26:51.040 the value is of taking this action in the state, 01:26:51.040 --> 01:26:53.920 and I'm going to update it based on new information. 01:26:53.920 --> 01:26:56.200 That I experienced some reward, I predict 01:26:56.200 --> 01:26:58.240 what my future reward is going to be. 01:26:58.240 --> 01:27:01.510 And using that, I update what I estimate the reward 01:27:01.510 --> 01:27:04.720 will be for taking this action in this particular state. 01:27:04.720 --> 01:27:07.720 And there are other additions you might make to this algorithm, as well. 01:27:07.720 --> 01:27:10.270 Sometimes, it might not be the case that future rewards, 01:27:10.270 --> 01:27:12.820 you want to weight equally to current rewards. 01:27:12.820 --> 01:27:17.402 Maybe you want an agent that values like reward now over reward later. 01:27:17.402 --> 01:27:19.360 And so sometimes, you can even add another term 01:27:19.360 --> 01:27:23.050 in here or some other parameter, where you discount future rewards 01:27:23.050 --> 01:27:26.270 and say future rewards are not as valuable as rewards 01:27:26.270 --> 01:27:28.720 immediately-- that getting reward in the current time step 01:27:28.720 --> 01:27:31.542 is better than waiting a year and getting rewards later. 01:27:31.542 --> 01:27:33.250 But that's something up to the programmer 01:27:33.250 --> 01:27:36.290 to decide what that parameter ought to be. 01:27:36.290 --> 01:27:39.100 But the big picture idea of this entire formula 01:27:39.100 --> 01:27:42.670 is to say that every time we experience some new reward, 01:27:42.670 --> 01:27:43.900 we take that into account. 01:27:43.900 --> 01:27:47.830 We update our estimate of how good is this action. 01:27:47.830 --> 01:27:51.070 And then in the future, we can make decisions based on that algorithm. 01:27:51.070 --> 01:27:54.010 Once we have some good estimate for every state 01:27:54.010 --> 01:27:57.670 and for every action, what the value is of taking that action, 01:27:57.670 --> 01:28:01.960 then we can do something like implement a greedy decision making policy. 01:28:01.960 --> 01:28:05.200 That if I am in a state and I want to know what actions should 01:28:05.200 --> 01:28:09.880 I take in that state, then I consider for all of my possible actions, 01:28:09.880 --> 01:28:12.670 what is the value of Q s, a. 01:28:12.670 --> 01:28:16.150 What is my estimated value of taking that action in that state. 01:28:16.150 --> 01:28:18.850 And I will just pick the action that has the highest 01:28:18.850 --> 01:28:22.520 value after I evaluate that expression. 01:28:22.520 --> 01:28:24.730 So I pick the action that has the highest value. 01:28:24.730 --> 01:28:27.188 And based on that, that tells me what action I should take. 01:28:27.188 --> 01:28:30.880 At any given state that I'm in, I can just greedily say across 01:28:30.880 --> 01:28:35.050 all my actions, this action gives me the highest expected value, 01:28:35.050 --> 01:28:38.200 and so I'll go ahead and choose that action as the action 01:28:38.200 --> 01:28:40.430 that I take as well. 01:28:40.430 --> 01:28:43.270 But there is a downside to this kind of approach. 01:28:43.270 --> 01:28:45.880 And the downside comes up in a situation like this, 01:28:45.880 --> 01:28:51.170 where we know that there is some solution that gets me to the reward 01:28:51.170 --> 01:28:53.610 and our agent has been able to figure that out. 01:28:53.610 --> 01:28:57.050 But it might not necessarily be the best way or the fastest way. 01:28:57.050 --> 01:28:59.850 If the agent is allowed to explore a little bit more, 01:28:59.850 --> 01:29:02.810 you might find that it can get the reward faster by taking 01:29:02.810 --> 01:29:06.830 some other route instead by going through this particular path that 01:29:06.830 --> 01:29:11.360 is a faster way to get to that ultimate goal. 01:29:11.360 --> 01:29:14.730 And maybe we would like for the agent to be able to figure that out as well. 01:29:14.730 --> 01:29:16.940 But if the agent always takes the actions 01:29:16.940 --> 01:29:20.960 that it knows to be best, when it gets to this particular square, 01:29:20.960 --> 01:29:23.180 it doesn't know that this is a good action, 01:29:23.180 --> 01:29:24.770 because it's never really tried it. 01:29:24.770 --> 01:29:29.040 But it knows that going down eventually leads its way to this reward. 01:29:29.040 --> 01:29:32.270 So what might learn in the future that it should just always take this route, 01:29:32.270 --> 01:29:36.770 and it's never going to explore and go along that route instead. 01:29:36.770 --> 01:29:39.170 So in reinforcement learning, there's this tension 01:29:39.170 --> 01:29:42.570 between exploration and exploitation. 01:29:42.570 --> 01:29:45.710 And exploitation generally reverts to using knowledge 01:29:45.710 --> 01:29:47.360 that the AI already has. 01:29:47.360 --> 01:29:50.630 The AI already knows that this is a move that leads to reward, 01:29:50.630 --> 01:29:52.520 so it'll go ahead and use that move. 01:29:52.520 --> 01:29:56.390 And exploration is all about exploring other actions 01:29:56.390 --> 01:29:58.820 that we may not have explored as thoroughly before, 01:29:58.820 --> 01:30:01.970 because maybe one of these actions, even if I don't know anything about it, 01:30:01.970 --> 01:30:07.300 might lead to better rewards faster or more rewards in the future. 01:30:07.300 --> 01:30:10.160 And so an agent that only ever exploits information 01:30:10.160 --> 01:30:12.840 and never explorers might be able to get reward, 01:30:12.840 --> 01:30:16.250 but it might not maximize its rewards, because it doesn't know what 01:30:16.250 --> 01:30:17.840 other possibilities are out there-- 01:30:17.840 --> 01:30:19.850 possibilities that it would only know about 01:30:19.850 --> 01:30:23.010 by taking advantage of exploration. 01:30:23.010 --> 01:30:24.710 And so how can we try and address this? 01:30:24.710 --> 01:30:28.550 Well, one possible solution is known as the epsilon-greedy algorithm, 01:30:28.550 --> 01:30:33.050 where we set epsilon equal to how often we want to just make a random move. 01:30:33.050 --> 01:30:36.750 Where occasionally, we will just make a random move in order to say, 01:30:36.750 --> 01:30:40.350 let's try to explore and see what happens. 01:30:40.350 --> 01:30:45.250 And then the logic of the algorithm will be with probability 1 minus epsilon, 01:30:45.250 --> 01:30:47.830 choose the estimated best move. 01:30:47.830 --> 01:30:50.620 In a greedy case, we'd always choose the best move. 01:30:50.620 --> 01:30:55.360 But in epsilon-greedy, we're most of the time going to choose the best move 01:30:55.360 --> 01:30:57.520 or sometimes going to choose the best move, 01:30:57.520 --> 01:31:00.100 but sometimes with probability epsilon, we're 01:31:00.100 --> 01:31:03.160 going to choose a random move instead. 01:31:03.160 --> 01:31:06.400 So every time we're faced with the ability to take an action, sometimes, 01:31:06.400 --> 01:31:07.930 we're going to choose the best move. 01:31:07.930 --> 01:31:10.900 Sometimes, we're just going to choose a random move. 01:31:10.900 --> 01:31:12.940 So this type of algorithm then can be quite 01:31:12.940 --> 01:31:16.720 powerful in a reinforcement learning context by not always just choosing 01:31:16.720 --> 01:31:20.620 the best possible move right now, but sometimes, especially early on, 01:31:20.620 --> 01:31:22.600 allowing yourself to make random moves that 01:31:22.600 --> 01:31:26.560 allow you to explore various different possible states and actions more. 01:31:26.560 --> 01:31:30.850 And maybe over time, you might decrease your value of epsilon, more and more 01:31:30.850 --> 01:31:32.720 often choosing the best mover after you are 01:31:32.720 --> 01:31:37.820 more confident that you've explored what all of the possibilities actually are. 01:31:37.820 --> 01:31:39.320 So we can put this into practice. 01:31:39.320 --> 01:31:41.830 And one very common application of reinforcement learning 01:31:41.830 --> 01:31:43.000 is in game playing. 01:31:43.000 --> 01:31:45.400 That if you want to teach an agent how to play a game, 01:31:45.400 --> 01:31:48.350 you just let the agent play the game a whole bunch. 01:31:48.350 --> 01:31:51.220 And then the reward signal happens at the end of the game. 01:31:51.220 --> 01:31:55.780 When the game is over, if our AI won the game, it gets a reward of like, 1, 01:31:55.780 --> 01:31:56.730 for example. 01:31:56.730 --> 01:32:00.100 And if it lost the game, it gets a reward of negative 1. 01:32:00.100 --> 01:32:03.108 And from that, it begins to learn what actions are good 01:32:03.108 --> 01:32:04.150 and what actions are bad. 01:32:04.150 --> 01:32:06.610 You don't have to tell the AI what's good and what's bad, 01:32:06.610 --> 01:32:08.920 but the AI figures it out based on that reward. 01:32:08.920 --> 01:32:10.450 Winning the game is some signal. 01:32:10.450 --> 01:32:12.040 Losing the game is some signal. 01:32:12.040 --> 01:32:14.320 And based on all of that, it begins to figure out 01:32:14.320 --> 01:32:16.985 what decisions it should actually make. 01:32:16.985 --> 01:32:19.360 So one very simple game, which you may have played before 01:32:19.360 --> 01:32:20.870 is a game called Nim. 01:32:20.870 --> 01:32:23.433 And in the game of Nim, you've got a whole bunch of objects 01:32:23.433 --> 01:32:25.600 in a whole bunch of different piles, where here I've 01:32:25.600 --> 01:32:27.940 represented each pile as an individual row. 01:32:27.940 --> 01:32:31.040 So you've got one object in the first pile, three in the second pile, five 01:32:31.040 --> 01:32:33.370 and the third pile, seven in the fourth pile. 01:32:33.370 --> 01:32:36.010 And the game of Nim is a two player game where players 01:32:36.010 --> 01:32:38.980 take turns removing objects from piles. 01:32:38.980 --> 01:32:41.230 And the rule is that on any given turn, you 01:32:41.230 --> 01:32:43.690 are allowed to remove as many objects as you 01:32:43.690 --> 01:32:47.162 want from any one of these piles, any one of these rows. 01:32:47.162 --> 01:32:49.120 You have to remove at least one object, but you 01:32:49.120 --> 01:32:53.890 can remove as many as you want from exactly one of the piles. 01:32:53.890 --> 01:32:57.720 And whoever takes the last object loses. 01:32:57.720 --> 01:33:01.630 So player 1 might like remove four from this pile here. 01:33:01.630 --> 01:33:04.790 Player 2 might remove four from this pile here. 01:33:04.790 --> 01:33:08.140 So now we've got four piles left, one, three, one, and three Player 1 01:33:08.140 --> 01:33:11.020 might remove you know the entirety of the second pile. 01:33:11.020 --> 01:33:16.930 Player 2, if they're being strategic, might remove two from the third pile. 01:33:16.930 --> 01:33:20.140 Now we've got three piles left each with one object left. 01:33:20.140 --> 01:33:22.420 Player 1 might remove one from one pile. 01:33:22.420 --> 01:33:24.750 Player 2 removes one from the other pile. 01:33:24.750 --> 01:33:27.610 And now player 1 is left with choosing this one 01:33:27.610 --> 01:33:31.740 object from the last pile, at which point, player 1 loses the game. 01:33:31.740 --> 01:33:32.980 So fairly simple game. 01:33:32.980 --> 01:33:34.240 Piles of objects. 01:33:34.240 --> 01:33:37.300 Any turn, you choose how many objects to remove from the pile. 01:33:37.300 --> 01:33:40.330 Whoever removes the last object loses. 01:33:40.330 --> 01:33:42.730 And this is the type of game you can encode into an AI 01:33:42.730 --> 01:33:46.540 fairly easily, because the states are really just four numbers. 01:33:46.540 --> 01:33:50.120 Every state is just how many objects in each of the four piles. 01:33:50.120 --> 01:33:52.690 And the actions are things like how many am I 01:33:52.690 --> 01:33:56.110 going to remove from each one of these individual piles. 01:33:56.110 --> 01:33:58.090 And the reward happens at the end. 01:33:58.090 --> 01:34:01.000 That if you were the player that had to remove the last object, 01:34:01.000 --> 01:34:02.890 then you get some sort of punishment. 01:34:02.890 --> 01:34:06.430 But if you were not, and the other player had to remove the last object, 01:34:06.430 --> 01:34:08.830 well then you get some sort of reward. 01:34:08.830 --> 01:34:11.680 So we could actually try and show a demonstration of this-- 01:34:11.680 --> 01:34:15.570 that I have implemented an AI to play the game of Nim. 01:34:15.570 --> 01:34:17.320 All right, so here, what we're going to do 01:34:17.320 --> 01:34:22.210 is create an AI as a result of training the AI on some number of games 01:34:22.210 --> 01:34:24.160 that the AI is going to play against itself. 01:34:24.160 --> 01:34:27.060 Where the idea is the AI will play games against itself, 01:34:27.060 --> 01:34:30.800 learn from each of those experiences, and learn what to do in the future. 01:34:30.800 --> 01:34:33.340 And then I, the human, will play against the AI. 01:34:33.340 --> 01:34:35.380 So initially, we'll say train zero times, 01:34:35.380 --> 01:34:39.190 meaning we're not going to let the AI play any practice games against itself 01:34:39.190 --> 01:34:41.110 in order to learn from its experiences. 01:34:41.110 --> 01:34:43.750 We're just going to see how well it plays. 01:34:43.750 --> 01:34:45.460 And it looks like there are four piles. 01:34:45.460 --> 01:34:48.550 I can choose how many I remove from any one of the piles. 01:34:48.550 --> 01:34:53.960 So maybe from pile three, I will remove five objects, for example. 01:34:53.960 --> 01:34:57.340 So now AI chose to take one item from pile zero. 01:34:57.340 --> 01:35:00.170 So I'm left with these piles now, for example. 01:35:00.170 --> 01:35:02.500 And so here, I could choose maybe to say I 01:35:02.500 --> 01:35:08.900 would like to remove them from pile two, all five of them, for example. 01:35:08.900 --> 01:35:11.290 And so AI chose to take two away from pile one. 01:35:11.290 --> 01:35:15.560 Now I'm left with one pile that has one object, one pile that has two objects. 01:35:15.560 --> 01:35:19.070 So from pile three, I will remove two objects. 01:35:19.070 --> 01:35:22.400 And now I've left the AI with no choice but to take that last one. 01:35:22.400 --> 01:35:24.648 And so the game is over and I was able to win. 01:35:24.648 --> 01:35:27.190 But I did so because the AI was really just playing randomly. 01:35:27.190 --> 01:35:30.130 It didn't have any prior experience that it was using in order 01:35:30.130 --> 01:35:32.410 to make these sorts of judgments. 01:35:32.410 --> 01:35:36.040 Now let the AI train itself on, like, 10,000 games. 01:35:36.040 --> 01:35:40.000 I'm going to let the AI play 10,000 games of Nim against itself. 01:35:40.000 --> 01:35:43.180 Every time it wins or loses, it's going to learn from that experience 01:35:43.180 --> 01:35:46.840 and learn in the future what to do and what not to do. 01:35:46.840 --> 01:35:49.300 So here then, I'll go ahead and run this again. 01:35:49.300 --> 01:35:52.300 And now you see the AI running through a whole bunch of training games-- 01:35:52.300 --> 01:35:54.730 10,000 training games against itself. 01:35:54.730 --> 01:35:57.530 And now it's going to let me make these sorts of decisions. 01:35:57.530 --> 01:35:59.620 So now I'm going to play against the AI. 01:35:59.620 --> 01:36:02.788 Maybe I'll remove one from pile 3. 01:36:02.788 --> 01:36:05.830 And the AI took everything from pile three, so I'm left with three piles. 01:36:05.830 --> 01:36:11.680 And I'll go ahead and from pile two, maybe remove three items. 01:36:11.680 --> 01:36:14.340 And the AI removes one item from pile zero. 01:36:14.340 --> 01:36:17.560 I'm left with two piles, each of which has two items in it. 01:36:17.560 --> 01:36:21.160 I'll remove one from pile one, I guess. 01:36:21.160 --> 01:36:24.610 And the AI took two from pile two, leaving me with no choice 01:36:24.610 --> 01:36:27.490 but to take one away from pile one. 01:36:27.490 --> 01:36:31.690 So it seems like after playing 10,000 games of Nim against itself, 01:36:31.690 --> 01:36:34.270 the AI has learned something about what states 01:36:34.270 --> 01:36:38.020 and what actions tend to be good, and has begun to learn some sort of pattern 01:36:38.020 --> 01:36:40.780 for how to predict what actions are going to be good 01:36:40.780 --> 01:36:44.290 and what actions are going to be bad in any given state. 01:36:44.290 --> 01:36:47.590 So reinforcement learning can be a very powerful technique for achieving 01:36:47.590 --> 01:36:49.150 these sorts of game playing Agents-- 01:36:49.150 --> 01:36:52.300 Agents that are able to play a game well just by learning 01:36:52.300 --> 01:36:54.970 from experience, whether that's playing against other people 01:36:54.970 --> 01:36:57.310 or by playing against itself and learning from those 01:36:57.310 --> 01:36:59.030 experiences, as well. 01:36:59.030 --> 01:37:01.900 Now Nim is a bit of an easy game to use reinforcement learning 01:37:01.900 --> 01:37:04.120 for because there are so few states. 01:37:04.120 --> 01:37:07.120 There are only states that are as many as how many different objects are 01:37:07.120 --> 01:37:09.190 in each of these various different piles. 01:37:09.190 --> 01:37:11.200 You might imagine that it's going to be harder 01:37:11.200 --> 01:37:14.990 if you think of a game like chess or games where there are many, 01:37:14.990 --> 01:37:18.430 many more states and many, many more actions that you can imagine taking, 01:37:18.430 --> 01:37:21.490 where it's not going to be as easy to learn for every state 01:37:21.490 --> 01:37:24.720 and for every action, what the value is going to be. 01:37:24.720 --> 01:37:27.130 So oftentimes in that case, we can't necessarily 01:37:27.130 --> 01:37:31.060 learn exactly what the value is for every state and for every action, 01:37:31.060 --> 01:37:32.257 but we can approximate it. 01:37:32.257 --> 01:37:34.090 So much as we saw with [? min and max, ?] we 01:37:34.090 --> 01:37:37.240 could use a death limiting approach to stop calculating 01:37:37.240 --> 01:37:39.910 at a certain point in time, we can do a similar type 01:37:39.910 --> 01:37:42.730 of approximation known as function approximation 01:37:42.730 --> 01:37:47.530 in a reinforcement learning context, where instead of learning a value of Q 01:37:47.530 --> 01:37:50.260 for every state and every action, we just 01:37:50.260 --> 01:37:53.260 have some function that estimates what the value is 01:37:53.260 --> 01:37:56.770 for taking this action in this particular state that might be based 01:37:56.770 --> 01:38:02.390 on various different features of the state that the agent happens to be in. 01:38:02.390 --> 01:38:04.480 Where you might have to choose what those features 01:38:04.480 --> 01:38:07.990 actually are, but you can begin to learn some patterns that 01:38:07.990 --> 01:38:11.763 generalize beyond one specific state and one specific action 01:38:11.763 --> 01:38:13.930 that you can begin to learn if certain features tend 01:38:13.930 --> 01:38:15.580 to be good things or bad things. 01:38:15.580 --> 01:38:18.850 Reinforcement learning can allow you using a very similar mechanism 01:38:18.850 --> 01:38:20.800 to generalize beyond one particular state 01:38:20.800 --> 01:38:24.220 and say if this other state looks kind of like this state, 01:38:24.220 --> 01:38:27.130 then maybe the similar types of actions that worked in one state 01:38:27.130 --> 01:38:30.150 will also work in another state as well. 01:38:30.150 --> 01:38:32.410 And so this type of approach can be quite helpful 01:38:32.410 --> 01:38:34.750 as you begin to deal with reinforcement learning that 01:38:34.750 --> 01:38:37.870 exists in larger and larger state spaces, where it's just not 01:38:37.870 --> 01:38:43.240 feasible to explore all of the possible states that could actually exist. 01:38:43.240 --> 01:38:46.540 So there then are two of the main categories of reinforcement learning. 01:38:46.540 --> 01:38:49.110 Supervised learning, where you have labeled input and output 01:38:49.110 --> 01:38:52.920 pairs, and reinforcement learning, where an agent learns from rewards 01:38:52.920 --> 01:38:54.540 or punishments that it receives. 01:38:54.540 --> 01:38:56.700 The third major category of machine learning 01:38:56.700 --> 01:39:00.450 that we'll just touch on briefly is known as unsupervised learning. 01:39:00.450 --> 01:39:03.540 And unsupervised learning happens when we have data 01:39:03.540 --> 01:39:06.330 without any additional feedback, without labels. 01:39:06.330 --> 01:39:09.810 That in the supervised learning case, all of our data had labels. 01:39:09.810 --> 01:39:13.560 We labeled a data point with whether that was a rainy day or not rainy day. 01:39:13.560 --> 01:39:16.590 And using those labels, we were able to infer what the pattern was. 01:39:16.590 --> 01:39:20.220 Where we labeled data as a counterfeit banknote or not a counterfeit, 01:39:20.220 --> 01:39:23.760 and using those labels, we were able to draw inferences and patterns 01:39:23.760 --> 01:39:27.870 to figure out what does a banknote look like versus not. 01:39:27.870 --> 01:39:32.410 In unsupervised learning, we don't have any access to any of those labels, 01:39:32.410 --> 01:39:35.500 but we still would like to learn some of those patterns. 01:39:35.500 --> 01:39:39.000 And one of the tasks that you might want to perform in unsupervised learning 01:39:39.000 --> 01:39:42.600 is something like clustering, where clustering is just the task of given 01:39:42.600 --> 01:39:47.310 some set of objects organized into distinct clusters, groups of objects 01:39:47.310 --> 01:39:49.330 that are similar to one another. 01:39:49.330 --> 01:39:51.510 And there's lots of applications for clustering. 01:39:51.510 --> 01:39:53.340 It comes up in genetic research, where you 01:39:53.340 --> 01:39:55.740 might have a whole bunch of different genes, 01:39:55.740 --> 01:39:57.900 and you want to cluster them into similar genes 01:39:57.900 --> 01:40:01.530 if you're trying to analyze it across a population or across species. 01:40:01.530 --> 01:40:04.530 It comes up in an image, if you want to take all the pixels of an image, 01:40:04.530 --> 01:40:06.570 cluster them into different parts of the image. 01:40:06.570 --> 01:40:09.060 Comes up a lot up in market research if you 01:40:09.060 --> 01:40:11.340 want to divide your consumers into different groups 01:40:11.340 --> 01:40:14.160 so you know which groups to target with certain types of product 01:40:14.160 --> 01:40:15.690 advertisements, for example. 01:40:15.690 --> 01:40:18.840 And a number of other contexts as well in which clustering 01:40:18.840 --> 01:40:20.310 can be very applicable. 01:40:20.310 --> 01:40:24.900 One technique for clustering is an algorithm known as k-means clustering. 01:40:24.900 --> 01:40:27.270 And what k-means clustering is going to do 01:40:27.270 --> 01:40:31.830 it is going to divide all of our data points into k different clusters, 01:40:31.830 --> 01:40:35.700 and it's going to do so by repeating this process of assigning points 01:40:35.700 --> 01:40:39.720 to clusters, and then moving around those clusters centers. 01:40:39.720 --> 01:40:43.680 We're going to define a cluster by its center, the middle of the cluster, 01:40:43.680 --> 01:40:46.830 and then assign points to that cluster based on which 01:40:46.830 --> 01:40:49.430 center is closest to that point. 01:40:49.430 --> 01:40:51.660 And I'll show you an example of that now. 01:40:51.660 --> 01:40:55.020 Here, for example, I have a whole bunch of unlabeled data-- 01:40:55.020 --> 01:40:58.920 just various data points that are in some sort of graphical space. 01:40:58.920 --> 01:41:02.610 And I would like to group them into various different clusters. 01:41:02.610 --> 01:41:04.500 But I don't know how to do that originally. 01:41:04.500 --> 01:41:07.500 And let's say I want to assign three clusters to this group. 01:41:07.500 --> 01:41:10.500 And you have to choose how many clusters you want in k-means clustering, 01:41:10.500 --> 01:41:13.860 but you could try multiple and see how well those values perform. 01:41:13.860 --> 01:41:17.070 But I'll start just by randomly picking some places 01:41:17.070 --> 01:41:19.060 to put the centers of those clusters. 01:41:19.060 --> 01:41:22.735 That maybe I have a blue cluster, a red cluster, and a green cluster. 01:41:22.735 --> 01:41:25.110 And I'm going to start with the centers of those clusters 01:41:25.110 --> 01:41:27.660 just being in these three locations here. 01:41:27.660 --> 01:41:30.090 And what k-means clustering tells us to do 01:41:30.090 --> 01:41:32.790 is once I have the centers of the clusters, 01:41:32.790 --> 01:41:39.510 assign every point to a cluster based on which cluster center it is closest to. 01:41:39.510 --> 01:41:42.990 So we end up with something like this, where all of these points 01:41:42.990 --> 01:41:47.310 are closer to the blue cluster center than any other cluster center. 01:41:47.310 --> 01:41:50.970 All of these points here are closer to the green cluster 01:41:50.970 --> 01:41:52.860 center than any other cluster center. 01:41:52.860 --> 01:41:55.410 And then these two points plus these points over here, 01:41:55.410 --> 01:42:00.370 those are all closest to the red cluster center instead. 01:42:00.370 --> 01:42:04.050 So here then is one possible assignment, all these points, 01:42:04.050 --> 01:42:05.910 to three different clusters. 01:42:05.910 --> 01:42:06.895 But it's not great. 01:42:06.895 --> 01:42:10.020 That it seems like in this red cluster, these points are kind of far apart, 01:42:10.020 --> 01:42:12.870 in this green cluster, these points are kind of far apart. 01:42:12.870 --> 01:42:15.810 It might not be my ideal choice of how I would cluster 01:42:15.810 --> 01:42:17.670 these various different data points. 01:42:17.670 --> 01:42:21.960 But k-means clustering is an iterative process, that after I do this, 01:42:21.960 --> 01:42:25.890 there is a next step, which is that after I've assigned all of the points 01:42:25.890 --> 01:42:28.380 to the cluster center that it is nearest to, 01:42:28.380 --> 01:42:31.350 we are going to recenter the clusters. 01:42:31.350 --> 01:42:33.420 Meaning take the cluster centers, these diamond 01:42:33.420 --> 01:42:37.500 shapes here, and move them to the middle or the average, 01:42:37.500 --> 01:42:41.040 effectively, of all of the points that are in that cluster. 01:42:41.040 --> 01:42:43.170 So we'll take this blue point, this blue center, 01:42:43.170 --> 01:42:46.812 and go ahead and move it to the middle or to the center of all 01:42:46.812 --> 01:42:49.020 of the points that were assigned to the blue cluster, 01:42:49.020 --> 01:42:51.070 moving it slightly to the right in this case. 01:42:51.070 --> 01:42:52.570 And we'll do the same thing for red. 01:42:52.570 --> 01:42:56.910 We'll move the cluster center to the middle of all of these points, 01:42:56.910 --> 01:42:58.560 weighted by how many points there are. 01:42:58.560 --> 01:43:01.800 There are more points over here, so the red center 01:43:01.800 --> 01:43:03.810 and moving a little bit further that way. 01:43:03.810 --> 01:43:06.352 And likewise for the green center, there are many more points 01:43:06.352 --> 01:43:09.000 on this side of the green center, so the green center 01:43:09.000 --> 01:43:13.240 ends up being pulled a little bit further in this direction. 01:43:13.240 --> 01:43:15.808 So we recenter all of the clusters. 01:43:15.808 --> 01:43:17.100 And then we repeat the process. 01:43:17.100 --> 01:43:21.540 We go ahead and now reassign all of the points to the cluster center 01:43:21.540 --> 01:43:23.230 that they are now closest to. 01:43:23.230 --> 01:43:25.650 And now that we've moved around the cluster centers, 01:43:25.650 --> 01:43:27.690 these cluster assignments might change. 01:43:27.690 --> 01:43:30.900 That this point originally was closer to the red cluster center, 01:43:30.900 --> 01:43:34.030 but now it's actually closer to the blue cluster center. 01:43:34.030 --> 01:43:35.580 Same goes for this point as well. 01:43:35.580 --> 01:43:38.700 And these three points that were originally closer to the green cluster 01:43:38.700 --> 01:43:43.670 center are now closer to the red cluster center instead. 01:43:43.670 --> 01:43:48.380 So we can reassign what colors or which clusters each of these data points 01:43:48.380 --> 01:43:49.620 belongs to. 01:43:49.620 --> 01:43:51.560 And then repeat the process again, moving 01:43:51.560 --> 01:43:54.590 each of these cluster means, the middle of the clusters, 01:43:54.590 --> 01:43:59.810 to the mean, the average, of all of the other points that happen to be there. 01:43:59.810 --> 01:44:01.400 And repeat the process again. 01:44:01.400 --> 01:44:04.130 Go ahead and assign each of the points to the cluster 01:44:04.130 --> 01:44:05.510 that they are closest to. 01:44:05.510 --> 01:44:08.690 So once we reach a point where we've assigned all the points to clusters, 01:44:08.690 --> 01:44:12.190 to the cluster that they are nearest to, and nothing changed, 01:44:12.190 --> 01:44:15.650 we've reached a sort of equilibrium in this situation, where no points are 01:44:15.650 --> 01:44:17.030 changing their allegiance. 01:44:17.030 --> 01:44:19.850 And as a result, we can declare this algorithm is now over. 01:44:19.850 --> 01:44:22.910 And we now have some assignment of each of these points 01:44:22.910 --> 01:44:24.230 into three different clusters. 01:44:24.230 --> 01:44:26.090 And it looks like we did a pretty good job 01:44:26.090 --> 01:44:29.960 of trying to identify which points are more similar to one another 01:44:29.960 --> 01:44:31.700 than they are two points in other groups. 01:44:31.700 --> 01:44:34.970 So we have the green cluster down here, this blue cluster here, 01:44:34.970 --> 01:44:37.650 and then this red cluster over there as well. 01:44:37.650 --> 01:44:40.490 And we did so without any access to some labels 01:44:40.490 --> 01:44:43.040 to tell us what these various different clusters were. 01:44:43.040 --> 01:44:45.890 We just used an algorithm in an unsupervised sentence 01:44:45.890 --> 01:44:48.470 without any of those labels to figure out which 01:44:48.470 --> 01:44:50.820 points belonged to which categories. 01:44:50.820 --> 01:44:54.663 And again, lots of applications for this type of clustering technique. 01:44:54.663 --> 01:44:57.830 And there are many more algorithms in each of these various different fields 01:44:57.830 --> 01:45:01.310 within machine learning-- supervised, and reinforcement, and unsupervised. 01:45:01.310 --> 01:45:04.490 But those are many of the big picture foundational ideas that 01:45:04.490 --> 01:45:06.408 underlie a lot of these techniques-- 01:45:06.408 --> 01:45:08.700 that these are the problems that we're trying to solve. 01:45:08.700 --> 01:45:11.810 And we try and solve those problems using a number of different methods. 01:45:11.810 --> 01:45:14.733 Of trying to take data and learn patterns in that data 01:45:14.733 --> 01:45:17.150 whether that's trying to find neighboring data points that 01:45:17.150 --> 01:45:19.790 are similar or trying to minimize some sort of loss 01:45:19.790 --> 01:45:22.460 function, where any number of other techniques 01:45:22.460 --> 01:45:26.420 that allow us to begin to try to solve these sorts of problems. 01:45:26.420 --> 01:45:28.430 That then was a look at some of the principles 01:45:28.430 --> 01:45:30.410 that are at the foundation of modern machine 01:45:30.410 --> 01:45:33.930 learning-- this ability to take data and learn from that data 01:45:33.930 --> 01:45:37.160 so that the computer can perform a task, even if they haven't explicitly been 01:45:37.160 --> 01:45:39.508 given instructions in order to do so. 01:45:39.508 --> 01:45:41.300 Next time, we'll continue this conversation 01:45:41.300 --> 01:45:43.842 about machine learning looking at other techniques we can use 01:45:43.842 --> 01:45:45.780 for solving these sorts of problems. 01:45:45.780 --> 01:45:47.500 We'll see you then.