1 00:00:00,000 --> 00:00:02,470 [MUSIC PLAYING] 2 00:00:02,470 --> 00:00:17,784 3 00:00:17,784 --> 00:00:19,620 BRIAN YU: All right, welcome back, everyone, 4 00:00:19,620 --> 00:00:22,500 to an introduction to Artificial Intelligence with Python. 5 00:00:22,500 --> 00:00:24,990 Now so far in this class, we've used AI to solve 6 00:00:24,990 --> 00:00:28,140 a number of different problems-- giving the AI instructions for how 7 00:00:28,140 --> 00:00:31,590 to search for a solution or how to satisfy certain constraints in order 8 00:00:31,590 --> 00:00:34,560 to find its way from some input point to some output point 9 00:00:34,560 --> 00:00:36,725 in order to solve some sort of problem. 10 00:00:36,725 --> 00:00:38,850 Today, we're going to turn to the world of learning 11 00:00:38,850 --> 00:00:40,890 in particular the idea of machine learning 12 00:00:40,890 --> 00:00:43,080 which generally refers to the idea where we are not 13 00:00:43,080 --> 00:00:47,160 going to give the computer explicit instructions for how to perform a task, 14 00:00:47,160 --> 00:00:50,040 but rather, we are going to give the computer access to information 15 00:00:50,040 --> 00:00:52,980 in the form of data or patterns that it can learn from, and let 16 00:00:52,980 --> 00:00:56,730 the computer try and figure out what those patterns are-- try and understand 17 00:00:56,730 --> 00:00:59,455 that data to be able to perform a task on its own. 18 00:00:59,455 --> 00:01:01,830 Now machine learning comes in a number of different forms 19 00:01:01,830 --> 00:01:03,180 and it's a very wide field. 20 00:01:03,180 --> 00:01:06,240 So today, we'll explore some of the foundational algorithms 21 00:01:06,240 --> 00:01:09,810 and ideas that are behind a lot of the different areas within machine 22 00:01:09,810 --> 00:01:10,480 learning. 23 00:01:10,480 --> 00:01:13,980 And one of the most popular is the idea of supervised machine 24 00:01:13,980 --> 00:01:15,840 learning or just supervised learning. 25 00:01:15,840 --> 00:01:18,540 And supervised learning is a particular type of task. 26 00:01:18,540 --> 00:01:22,620 It refers to the task where we give the computer access to a data set, 27 00:01:22,620 --> 00:01:26,320 where that data set consists of input-output pairs. 28 00:01:26,320 --> 00:01:28,080 And what we would like the computer to do 29 00:01:28,080 --> 00:01:31,620 is we would like our AI to be able to figure out some function 30 00:01:31,620 --> 00:01:34,327 that maps inputs to outputs. 31 00:01:34,327 --> 00:01:36,660 So we have a whole bunch of data that generally consists 32 00:01:36,660 --> 00:01:39,030 of some kind of input-- some evidence, some information 33 00:01:39,030 --> 00:01:40,890 that the computer will have access to. 34 00:01:40,890 --> 00:01:43,770 And we would like the computer, based on that input information, 35 00:01:43,770 --> 00:01:47,100 to predict what some output is going to be. 36 00:01:47,100 --> 00:01:50,295 And we'll give it some data so that the computer can train its model on 37 00:01:50,295 --> 00:01:53,310 to begin to understand how it is that this information works, 38 00:01:53,310 --> 00:01:56,640 and how it is that the inputs and outputs relate to each other. 39 00:01:56,640 --> 00:01:58,380 But ultimately, we hope that our computer 40 00:01:58,380 --> 00:02:01,470 will be able to figure out some function that given those inputs, 41 00:02:01,470 --> 00:02:03,720 is able to get those outputs. 42 00:02:03,720 --> 00:02:06,670 There are a couple of different tasks within supervised learning, 43 00:02:06,670 --> 00:02:09,900 the one we'll focus on and start with is known as classification. 44 00:02:09,900 --> 00:02:14,310 And classification is the problem where if I give you a whole bunch of inputs, 45 00:02:14,310 --> 00:02:16,740 you need to figure out some way to map those inputs 46 00:02:16,740 --> 00:02:20,700 into discrete categories, where you can decide what those categories are. 47 00:02:20,700 --> 00:02:22,867 And it's the job of the computer to predict 48 00:02:22,867 --> 00:02:24,450 what those categories are going to be. 49 00:02:24,450 --> 00:02:26,370 So that might be, for example, I give you 50 00:02:26,370 --> 00:02:28,980 information about a banknote like a US dollar 51 00:02:28,980 --> 00:02:31,290 and I'm asking you to predict for me doesn't belong 52 00:02:31,290 --> 00:02:33,930 to the category of authentic bank notes or does 53 00:02:33,930 --> 00:02:36,450 it belong to the category of counterfeit banknotes. 54 00:02:36,450 --> 00:02:38,160 You need to categorize the input. 55 00:02:38,160 --> 00:02:40,920 And we want to train the computer to figure out some function 56 00:02:40,920 --> 00:02:43,230 to be able to do that calculation. 57 00:02:43,230 --> 00:02:45,780 Another example might be the case of weather, something 58 00:02:45,780 --> 00:02:48,822 we've talked about a little bit so far in this class, where we would like 59 00:02:48,822 --> 00:02:52,020 to predict on a given day is it going to rain on that day, 60 00:02:52,020 --> 00:02:53,700 is it going to be cloudy on that day. 61 00:02:53,700 --> 00:02:55,533 And before, we've seen how we could do this, 62 00:02:55,533 --> 00:02:59,190 if we really give the computer all the exact probabilities for, you know, 63 00:02:59,190 --> 00:03:01,860 if these are the conditions, what's the probability of rain, 64 00:03:01,860 --> 00:03:04,590 oftentimes, we don't have access to that information, though. 65 00:03:04,590 --> 00:03:07,410 But what we do have access to is a whole bunch of data. 66 00:03:07,410 --> 00:03:10,410 So if we wanted to be able to predict something like is it going to rain 67 00:03:10,410 --> 00:03:12,810 or is it not going to rain, we would give the computer 68 00:03:12,810 --> 00:03:15,840 historical information about days when it was raining 69 00:03:15,840 --> 00:03:18,660 and days when it was not raining, and ask the computer 70 00:03:18,660 --> 00:03:21,240 to look for patterns in that data. 71 00:03:21,240 --> 00:03:22,960 So what might that data look like? 72 00:03:22,960 --> 00:03:25,380 Well, we could structure that data in a table like this. 73 00:03:25,380 --> 00:03:28,800 This might be what our table looks like, where are for any particular day going 74 00:03:28,800 --> 00:03:32,400 back, we have information about like that day's humidity, that day's air 75 00:03:32,400 --> 00:03:33,180 pressure. 76 00:03:33,180 --> 00:03:35,160 And then importantly, we have a label-- 77 00:03:35,160 --> 00:03:39,060 something where the human has said that on this particular day, it was raining 78 00:03:39,060 --> 00:03:40,070 or it was not raining. 79 00:03:40,070 --> 00:03:42,720 So you could fill in this table with a whole bunch of data. 80 00:03:42,720 --> 00:03:46,350 And what makes this what we would call a supervised learning exercise 81 00:03:46,350 --> 00:03:49,440 is that a human has gone in and labeled each of these data points. 82 00:03:49,440 --> 00:03:52,260 Said that on this day, when these were the values for the humidity 83 00:03:52,260 --> 00:03:56,330 and pressure, that day was a rainy day and this day was a not rainy day. 84 00:03:56,330 --> 00:03:58,830 And what we would like the computer to be able to do then 85 00:03:58,830 --> 00:04:01,560 is to be able to figure out, given these inputs, given 86 00:04:01,560 --> 00:04:05,430 the humidity and the pressure, can the computer predict what label 87 00:04:05,430 --> 00:04:06,945 should be associated with that day. 88 00:04:06,945 --> 00:04:08,820 Does that day look more like it's going to be 89 00:04:08,820 --> 00:04:13,650 a day that rains or does it look more like a day when it's not going to rain. 90 00:04:13,650 --> 00:04:15,630 Put a little bit more mathematically, you 91 00:04:15,630 --> 00:04:18,928 can think of this as a function that takes two inputs-- 92 00:04:18,928 --> 00:04:21,720 the inputs being the data points that our computer will have access 93 00:04:21,720 --> 00:04:23,610 to-- things like humidity and pressure. 94 00:04:23,610 --> 00:04:25,500 So we could write a function, f, that takes 95 00:04:25,500 --> 00:04:27,730 as input both humidity and pressure. 96 00:04:27,730 --> 00:04:31,170 And then the output is going to be what category 97 00:04:31,170 --> 00:04:34,590 we would ascribe to these particular input points-- what label 98 00:04:34,590 --> 00:04:36,135 we would associate with that input. 99 00:04:36,135 --> 00:04:38,010 So we've seen a couple of example data points 100 00:04:38,010 --> 00:04:41,250 here, where given this value for humidity and this value for pressure, 101 00:04:41,250 --> 00:04:44,610 we predict is it going to rain or is it not going to rain. 102 00:04:44,610 --> 00:04:47,610 And that's information that we just gathered from the world. 103 00:04:47,610 --> 00:04:51,060 We measured on various different days what the humidity and pressure were. 104 00:04:51,060 --> 00:04:55,170 We observed whether or not we saw rain or no rain on that particular day. 105 00:04:55,170 --> 00:04:58,950 And this function, f, is what we would like to approximate. 106 00:04:58,950 --> 00:05:00,900 Now the computer and we humans don't really 107 00:05:00,900 --> 00:05:04,960 know exactly how this function f works-- it's probably quite a complex function. 108 00:05:04,960 --> 00:05:08,040 So what we're going to do instead is attempt to estimate it. 109 00:05:08,040 --> 00:05:12,240 We would like to come up with a hypothesis function, h, 110 00:05:12,240 --> 00:05:15,430 which is going to try to approximate what f does. 111 00:05:15,430 --> 00:05:19,260 We want to come up with some function h that will also take the same inputs 112 00:05:19,260 --> 00:05:22,800 and we'll also produce an output, rain or no rain. 113 00:05:22,800 --> 00:05:25,950 And ideally, we'd like these two functions to agree on as much 114 00:05:25,950 --> 00:05:27,030 as possible. 115 00:05:27,030 --> 00:05:30,810 So the goal then of these supervised learning classification tasks 116 00:05:30,810 --> 00:05:34,060 is going to be to figure out what does that function h look like. 117 00:05:34,060 --> 00:05:37,860 How can we begin to estimate, given all of this information, all of this data, 118 00:05:37,860 --> 00:05:42,330 what category or what label should be assigned to a particular data point. 119 00:05:42,330 --> 00:05:44,457 So where can you begin doing this? 120 00:05:44,457 --> 00:05:47,040 Well, a reasonable thing to do, especially in this situation-- 121 00:05:47,040 --> 00:05:48,780 I have two numerical values-- 122 00:05:48,780 --> 00:05:53,460 is I could try to plot this on a graph that has two axes-- an x-axis 123 00:05:53,460 --> 00:05:54,332 and the y-axis. 124 00:05:54,332 --> 00:05:57,540 And in this case, we're just going to be using two numerical values as input, 125 00:05:57,540 --> 00:05:59,795 but these same types of ideas at scale as you 126 00:05:59,795 --> 00:06:01,170 add more and more inputs as well. 127 00:06:01,170 --> 00:06:04,260 We'll be plotting things in two dimensions, but as we'll soon see, 128 00:06:04,260 --> 00:06:07,930 you could add more inputs and just imagine things in multiple dimensions. 129 00:06:07,930 --> 00:06:10,920 And while we humans have trouble conceptualizing anything 130 00:06:10,920 --> 00:06:13,410 really beyond three dimensions, at least visually, 131 00:06:13,410 --> 00:06:15,600 a computer has no problem with trying to imagine 132 00:06:15,600 --> 00:06:17,280 things and many, many more dimensions. 133 00:06:17,280 --> 00:06:20,343 That for a computer, each dimension is just some separate number 134 00:06:20,343 --> 00:06:21,510 that are just keeping track. 135 00:06:21,510 --> 00:06:23,385 So it wouldn't be unreasonable for a computer 136 00:06:23,385 --> 00:06:25,890 to think in 10 dimensions or 100 dimensions 137 00:06:25,890 --> 00:06:28,000 to be able to try to solve a problem. 138 00:06:28,000 --> 00:06:31,200 But for now, we've got two inputs, so we'll graph things along two axes-- 139 00:06:31,200 --> 00:06:33,840 an x-axis, which will here represent humidity, 140 00:06:33,840 --> 00:06:36,580 and a y-axis, which here represents pressure. 141 00:06:36,580 --> 00:06:40,530 And what we might do is say, let's take all of the days that were raining, 142 00:06:40,530 --> 00:06:44,130 and just try to plop them on this graph, and see where they fall on this graph. 143 00:06:44,130 --> 00:06:46,590 And here might be all of the rainy days, where 144 00:06:46,590 --> 00:06:49,020 each rainy day is one of these blue dots here 145 00:06:49,020 --> 00:06:51,710 that corresponds to a particular value for humidity 146 00:06:51,710 --> 00:06:53,422 and a particular value for pressure. 147 00:06:53,422 --> 00:06:56,380 And then I might do the same thing with the days that were not raining. 148 00:06:56,380 --> 00:06:58,380 So I take all the not rainy days, figure out 149 00:06:58,380 --> 00:07:01,020 what their values were for each of these two inputs, 150 00:07:01,020 --> 00:07:03,743 and go ahead and plot them on this graph as well. 151 00:07:03,743 --> 00:07:05,160 And I've here plotted them in red. 152 00:07:05,160 --> 00:07:09,870 So blue here stands for a rainy day, red here stands for a not rainy day. 153 00:07:09,870 --> 00:07:11,340 And this then is the input-- 154 00:07:11,340 --> 00:07:14,160 that my computer has access to all of this input. 155 00:07:14,160 --> 00:07:18,270 And what I would like the computer to be able to do is to train a model such 156 00:07:18,270 --> 00:07:21,120 that if I'm ever presented with a new input that doesn't have 157 00:07:21,120 --> 00:07:25,170 a label associated with it, something like this white dot here, 158 00:07:25,170 --> 00:07:28,530 I would like to predict given those values for each of the two inputs, 159 00:07:28,530 --> 00:07:31,860 should we classify it as a blue dot, a rainy day, 160 00:07:31,860 --> 00:07:34,985 or should we classify it as a red dot, a not rainy day. 161 00:07:34,985 --> 00:07:37,860 And if you're just looking at this picture graphically trying to say, 162 00:07:37,860 --> 00:07:41,100 all right, this white dot, does it look like it belongs to the blue category 163 00:07:41,100 --> 00:07:43,560 or does it look like it belongs to the red category, 164 00:07:43,560 --> 00:07:47,400 I think most people would agree that it probably belongs to the blue category. 165 00:07:47,400 --> 00:07:48,180 And why is that? 166 00:07:48,180 --> 00:07:51,967 Well, it looks like it's close to other blue dots. 167 00:07:51,967 --> 00:07:54,300 And that's not a very formal notion, but it's the notion 168 00:07:54,300 --> 00:07:56,100 that we'll formalize it in just a moment-- 169 00:07:56,100 --> 00:07:59,190 that because it seems to be close to, like, this blue dot here, like, 170 00:07:59,190 --> 00:08:01,560 nothing else it's closer to it, than we might say 171 00:08:01,560 --> 00:08:03,720 that it should be categorized as blue. 172 00:08:03,720 --> 00:08:06,030 It should fall into that category of, I think 173 00:08:06,030 --> 00:08:08,760 that day is going to be a rainy day based on that input. 174 00:08:08,760 --> 00:08:11,880 It might not be totally accurate, but it's a pretty good guess. 175 00:08:11,880 --> 00:08:15,090 And this type of algorithm is actually a very popular and common machine 176 00:08:15,090 --> 00:08:18,750 learning algorithm known as nearest neighbor classification. 177 00:08:18,750 --> 00:08:21,810 It's an algorithm for solving these classification type problems. 178 00:08:21,810 --> 00:08:25,540 And in nearest neighbor classification, it's going to perform this algorithm. 179 00:08:25,540 --> 00:08:29,700 What it will do is, given an input, it will choose the class of the nearest 180 00:08:29,700 --> 00:08:31,650 data point to that input. 181 00:08:31,650 --> 00:08:34,890 By class, we just here mean category, like rain or no rain, 182 00:08:34,890 --> 00:08:36,840 counterfeit or not counterfeit. 183 00:08:36,840 --> 00:08:41,799 And we choose the category or the class based on the nearest data point. 184 00:08:41,799 --> 00:08:43,559 So given all that data we just looked at, 185 00:08:43,559 --> 00:08:47,070 is the nearest data point a blue point or is that a red point. 186 00:08:47,070 --> 00:08:49,780 And depending on the answer to that question, 187 00:08:49,780 --> 00:08:51,660 we were able to make some sort of judgment. 188 00:08:51,660 --> 00:08:54,420 We were able to say something like, we think it's going to be blue 189 00:08:54,420 --> 00:08:56,262 or we think it's going to be red. 190 00:08:56,262 --> 00:08:58,470 So likewise, we could apply this to other data points 191 00:08:58,470 --> 00:08:59,850 that we encounter as well. 192 00:08:59,850 --> 00:09:03,840 If suddenly, this data point comes about, well, it's nearest data is red, 193 00:09:03,840 --> 00:09:07,320 so we would go ahead and classify this as a red point, not raining. 194 00:09:07,320 --> 00:09:09,540 Things get a little bit trickier, though, 195 00:09:09,540 --> 00:09:12,543 when you look at a point like this white point over here, 196 00:09:12,543 --> 00:09:14,460 and you ask the same sort of question-- should 197 00:09:14,460 --> 00:09:17,700 it belong to the category of blue points, the rainy days? 198 00:09:17,700 --> 00:09:21,870 Or should it belong to the category of red points, the not rainy days? 199 00:09:21,870 --> 00:09:24,240 Now nearest neighbor classification would 200 00:09:24,240 --> 00:09:26,820 say the way you solve this problem is look at which point 201 00:09:26,820 --> 00:09:28,190 it is nearest to that point. 202 00:09:28,190 --> 00:09:31,440 You look at this nearest point and say it's red-- it's a not rainy day. 203 00:09:31,440 --> 00:09:34,200 And therefore, according to nearest neighbor classification, 204 00:09:34,200 --> 00:09:37,470 I would say that this unlabeled point, that should also be red. 205 00:09:37,470 --> 00:09:40,800 It should also be classified as a not rainy day. 206 00:09:40,800 --> 00:09:43,800 But your intuition might think that that's a reasonable judgment 207 00:09:43,800 --> 00:09:47,190 to make-- that the closest thing is a not rainy day, so may as well guess 208 00:09:47,190 --> 00:09:48,540 that it's not rainy day. 209 00:09:48,540 --> 00:09:51,680 But it's probably also reasonable to look at the bigger picture of things 210 00:09:51,680 --> 00:09:56,670 and to say, yes, it is true, that the nearest point to it was a red point, 211 00:09:56,670 --> 00:10:00,060 but it's surrounded by a whole bunch of other blue points. 212 00:10:00,060 --> 00:10:02,310 So looking at the bigger picture, there is potentially 213 00:10:02,310 --> 00:10:06,235 an argument to be made that this point should actually be blue. 214 00:10:06,235 --> 00:10:08,610 And with only this data, we actually don't know for sure. 215 00:10:08,610 --> 00:10:11,310 We are given some inputs, something we're trying to predict, 216 00:10:11,310 --> 00:10:14,410 and we don't necessarily know what the output is going to be. 217 00:10:14,410 --> 00:10:17,500 So in this case, which one is correct is difficult to say. 218 00:10:17,500 --> 00:10:21,280 But oftentimes, considering more than just a single neighbor, considering 219 00:10:21,280 --> 00:10:25,300 multiple neighbors, can sometimes give us a better result. 220 00:10:25,300 --> 00:10:27,400 And so there's a variant on the nearest neighbor 221 00:10:27,400 --> 00:10:31,330 a classification algorithm that is known as the k-nearest-neighbor 222 00:10:31,330 --> 00:10:34,150 classification algorithm, where k is some parameter, 223 00:10:34,150 --> 00:10:38,090 some number that we choose for how many neighbors are we going to look at. 224 00:10:38,090 --> 00:10:41,440 So one nearest neighbor classification is what we saw before. 225 00:10:41,440 --> 00:10:44,770 Just pick the one nearest neighbor and use that category. 226 00:10:44,770 --> 00:10:46,780 But with k-nearest-neighbor classification, 227 00:10:46,780 --> 00:10:49,150 where k might be three or five or seven-- 228 00:10:49,150 --> 00:10:52,000 to say look at the three or five or seven closest 229 00:10:52,000 --> 00:10:55,930 neighbors, closest data points to that point, works a little bit differently. 230 00:10:55,930 --> 00:10:57,760 This algorithm, we're given an input. 231 00:10:57,760 --> 00:11:02,690 Choose the most common class out of the k nearest data points to that input. 232 00:11:02,690 --> 00:11:06,010 So if we look at the five nearest points, and three of them 233 00:11:06,010 --> 00:11:08,380 say it's raining and two of them say it's not raining, 234 00:11:08,380 --> 00:11:10,810 we'll go with the three instead of the two, 235 00:11:10,810 --> 00:11:14,080 because each one effectively gets one vote towards what 236 00:11:14,080 --> 00:11:16,450 they believe the category ought to be. 237 00:11:16,450 --> 00:11:18,850 And ultimately, you choose the category that 238 00:11:18,850 --> 00:11:21,540 has the most votes as a consequence of that. 239 00:11:21,540 --> 00:11:24,700 So k-nearest-neighbor classification-- fairly straightforward one 240 00:11:24,700 --> 00:11:25,960 to understand intuitively. 241 00:11:25,960 --> 00:11:29,030 You just look at the neighbors and figure out what the answer might be. 242 00:11:29,030 --> 00:11:30,880 And it turns out this can work very, very 243 00:11:30,880 --> 00:11:34,480 well for solving a whole variety of different types of classification 244 00:11:34,480 --> 00:11:35,440 problems. 245 00:11:35,440 --> 00:11:38,240 But not every model is going to work under every situation. 246 00:11:38,240 --> 00:11:40,865 And so one of the things we'll take a look at today, especially 247 00:11:40,865 --> 00:11:42,740 in the context of supervised machine learning 248 00:11:42,740 --> 00:11:45,740 is that there are a number of different approaches to machine learning-- 249 00:11:45,740 --> 00:11:47,830 a number of different algorithms that we can apply 250 00:11:47,830 --> 00:11:49,690 all solving the same type of problem. 251 00:11:49,690 --> 00:11:52,780 All solving some kind of classification problem, where 252 00:11:52,780 --> 00:11:56,170 we want to take inputs and organize it into different categories 253 00:11:56,170 --> 00:11:58,570 And no one algorithm isn't necessarily always going 254 00:11:58,570 --> 00:12:00,387 to be better than some other algorithm. 255 00:12:00,387 --> 00:12:01,720 They each have their trade-offs. 256 00:12:01,720 --> 00:12:04,540 And maybe depending on the data, one type of algorithm 257 00:12:04,540 --> 00:12:07,420 is going to be better-suited to trying to model that information 258 00:12:07,420 --> 00:12:08,870 than some other algorithm. 259 00:12:08,870 --> 00:12:10,720 And so this is what a lot of machine learning research 260 00:12:10,720 --> 00:12:13,090 ends up being about-- that when you're trying to apply machine learning 261 00:12:13,090 --> 00:12:16,480 techniques, you're often looking not just at one particular algorithm, 262 00:12:16,480 --> 00:12:18,250 but trying multiple different algorithms, 263 00:12:18,250 --> 00:12:20,770 trying to see what is going to give you the best 264 00:12:20,770 --> 00:12:25,790 results for trying to predict some function that maps inputs to outputs. 265 00:12:25,790 --> 00:12:29,410 So what then are the drawbacks of k-nearest-neighbor classification? 266 00:12:29,410 --> 00:12:30,640 Well, there are a couple. 267 00:12:30,640 --> 00:12:33,070 One might be that in a naive approach at least, 268 00:12:33,070 --> 00:12:35,642 it could be fairly slow to have to go through and measure 269 00:12:35,642 --> 00:12:38,350 the distance between a point and every single one of these points 270 00:12:38,350 --> 00:12:39,310 that exist here. 271 00:12:39,310 --> 00:12:40,810 Now there are ways of trying to get around that. 272 00:12:40,810 --> 00:12:43,435 There are data structures that can help to make it more quickly 273 00:12:43,435 --> 00:12:45,260 to be able to find these neighbors. 274 00:12:45,260 --> 00:12:48,490 There are also techniques you can use to try and prune some of this data, 275 00:12:48,490 --> 00:12:50,680 remove some of the data points so that you're only 276 00:12:50,680 --> 00:12:54,470 left with the relevant data points just to make it a little bit easier. 277 00:12:54,470 --> 00:12:56,890 But ultimately, what we might like to do is come up 278 00:12:56,890 --> 00:13:00,370 with another way of trying to do this classification. 279 00:13:00,370 --> 00:13:02,560 And one way of trying to do the classification 280 00:13:02,560 --> 00:13:04,750 was looking at what are the neighboring points. 281 00:13:04,750 --> 00:13:08,200 But another way might be to try to look at all of the data 282 00:13:08,200 --> 00:13:11,440 and see if we can come up with some decision boundary-- 283 00:13:11,440 --> 00:13:15,860 some boundary that will separate the rainy days from the not rainy days. 284 00:13:15,860 --> 00:13:19,640 In the case of two dimensions, we can do that by drawing a line, for example. 285 00:13:19,640 --> 00:13:22,900 So what we might want to try to do is just find some line, 286 00:13:22,900 --> 00:13:27,520 find some separator that divides the rainy days, the blue points over here, 287 00:13:27,520 --> 00:13:30,040 from the not rainy days, the red points over there. 288 00:13:30,040 --> 00:13:32,680 We're now trying a different approach in contrast 289 00:13:32,680 --> 00:13:34,570 with the nearest neighbor approach, which 290 00:13:34,570 --> 00:13:38,380 just looked at local data around the input data point that we cared about. 291 00:13:38,380 --> 00:13:42,160 Now what we're doing is trying to use a technique known as linear regression 292 00:13:42,160 --> 00:13:45,140 to find some sort of line that will separate the two 293 00:13:45,140 --> 00:13:46,858 halves from each other. 294 00:13:46,858 --> 00:13:49,150 Now, sometimes, it will actually be possible to come up 295 00:13:49,150 --> 00:13:52,630 with some line that perfectly separates all the rainy days from the not 296 00:13:52,630 --> 00:13:53,590 rainy days. 297 00:13:53,590 --> 00:13:56,140 Realistically, though, this is probably cleaner 298 00:13:56,140 --> 00:13:58,050 than many data sets will actually be. 299 00:13:58,050 --> 00:13:59,600 Oftentimes, data is messy. 300 00:13:59,600 --> 00:14:00,470 There are outliers. 301 00:14:00,470 --> 00:14:03,850 There's random noise that happens inside of a particular system. 302 00:14:03,850 --> 00:14:06,220 And what we'd like to do is still be able to figure out 303 00:14:06,220 --> 00:14:07,640 what a line might look like. 304 00:14:07,640 --> 00:14:12,000 So in practice, the data will not always be linearly separable, 305 00:14:12,000 --> 00:14:14,740 where linearly separable refers to some data set 306 00:14:14,740 --> 00:14:18,673 where I can draw a line just to separate the two halves of it perfectly. 307 00:14:18,673 --> 00:14:20,590 Instead, you might have a situation like this, 308 00:14:20,590 --> 00:14:24,352 where there are some rainy points that are on this side of the line and some 309 00:14:24,352 --> 00:14:26,560 not raining points that are on that side of the line. 310 00:14:26,560 --> 00:14:30,880 And there may not be a line that perfectly separates 311 00:14:30,880 --> 00:14:34,270 what path of the inputs from the other half-- that perfectly separates all 312 00:14:34,270 --> 00:14:36,490 the rainy days from the not rainy days. 313 00:14:36,490 --> 00:14:40,090 But we can still say that this line does a pretty good job. 314 00:14:40,090 --> 00:14:42,080 And we'll try to formalize a little bit later. 315 00:14:42,080 --> 00:14:44,200 What we mean when we say something like this line 316 00:14:44,200 --> 00:14:46,932 does a pretty good job of trying to make that prediction. 317 00:14:46,932 --> 00:14:48,640 But for now, let's just say we're looking 318 00:14:48,640 --> 00:14:51,690 for a line that does as good of a job as we can 319 00:14:51,690 --> 00:14:56,750 at trying to separate one category of things from another category of things. 320 00:14:56,750 --> 00:15:00,160 So let's now try to formalize this a little bit more mathematically. 321 00:15:00,160 --> 00:15:02,110 We want to come up with some sort of function, 322 00:15:02,110 --> 00:15:04,300 some way we can define this line. 323 00:15:04,300 --> 00:15:08,830 And our inputs are things like humidity and pressure in this case. 324 00:15:08,830 --> 00:15:13,840 So our inputs we might call x1 is going to be our represent humidity and x2 325 00:15:13,840 --> 00:15:15,500 is going to represent pressure. 326 00:15:15,500 --> 00:15:18,500 These are inputs that we are going to provide to our machine learning 327 00:15:18,500 --> 00:15:19,250 algorithm. 328 00:15:19,250 --> 00:15:21,860 And given those inputs, we would like for our model 329 00:15:21,860 --> 00:15:24,260 to be able to predict some sort of output. 330 00:15:24,260 --> 00:15:27,260 And we're going to predict that using our hypothesis function, 331 00:15:27,260 --> 00:15:28,610 which we called h. 332 00:15:28,610 --> 00:15:33,680 Our hypothesis function is going to take as input, x1 and x2, humidity 333 00:15:33,680 --> 00:15:34,890 and pressure in this case. 334 00:15:34,890 --> 00:15:36,740 And you can imagine if we didn't just have two inputs-- 335 00:15:36,740 --> 00:15:38,870 we had three or four or five inputs or more-- 336 00:15:38,870 --> 00:15:42,260 we could have this hypothesis function take all of those as input. 337 00:15:42,260 --> 00:15:45,550 And we'll see examples of that a little bit later as well. 338 00:15:45,550 --> 00:15:49,470 And now the question is, what does this hypothesis function do? 339 00:15:49,470 --> 00:15:53,210 Well, it really just needs to measure is this data 340 00:15:53,210 --> 00:15:58,640 point on one side of the boundary or is it on the other side of the boundary? 341 00:15:58,640 --> 00:16:00,620 And how do we formalize that boundary? 342 00:16:00,620 --> 00:16:02,990 Well, the boundary is generally going to be 343 00:16:02,990 --> 00:16:06,780 a linear combination of these input variables, 344 00:16:06,780 --> 00:16:08,250 at least in this particular case. 345 00:16:08,250 --> 00:16:10,910 So what we're trying to do when we say linear combination 346 00:16:10,910 --> 00:16:13,925 is take each of these inputs and multiply them by some number 347 00:16:13,925 --> 00:16:15,550 that we're going to have to figure out. 348 00:16:15,550 --> 00:16:18,650 We'll generally call that number a weight for how important 349 00:16:18,650 --> 00:16:21,800 should these variables be in trying to determine the answer. 350 00:16:21,800 --> 00:16:24,440 So weight each of these variables with some weight. 351 00:16:24,440 --> 00:16:27,440 And we might add like a constant to it just to try and make the function 352 00:16:27,440 --> 00:16:28,610 a little bit different. 353 00:16:28,610 --> 00:16:30,280 And the result we just need to compare-- 354 00:16:30,280 --> 00:16:33,350 is it greater than 0 or is it less than 0 to say 355 00:16:33,350 --> 00:16:37,280 doesn't belong on one side of the line or the other side of the line. 356 00:16:37,280 --> 00:16:41,060 And so what that mathematical expression might look like is this. 357 00:16:41,060 --> 00:16:45,773 We would take each of my variables, x1 and x2, multiply them by some weight. 358 00:16:45,773 --> 00:16:47,690 I don't yet know what that weight is, but it's 359 00:16:47,690 --> 00:16:50,720 going to be some number, weight 1 and weight 2. 360 00:16:50,720 --> 00:16:53,570 And maybe we just want to add some other weight 0 to it 361 00:16:53,570 --> 00:16:56,870 because the function might require us to shift the entire value up 362 00:16:56,870 --> 00:16:58,610 or down by a certain amount. 363 00:16:58,610 --> 00:16:59,840 And then we just compare. 364 00:16:59,840 --> 00:17:02,870 If we do all this math, is a greater than or equal to 0. 365 00:17:02,870 --> 00:17:05,810 If so, we might categorize that data point as a rainy day. 366 00:17:05,810 --> 00:17:09,440 And otherwise, we might say no rain. 367 00:17:09,440 --> 00:17:12,260 So the key here then is that this expression 368 00:17:12,260 --> 00:17:15,619 is how we are going to calculate whether it's a rainy day or not. 369 00:17:15,619 --> 00:17:18,690 We're going to do a bunch of math where we take each of the variables, 370 00:17:18,690 --> 00:17:21,650 multiply them by a weight, maybe add an extra weight to it, 371 00:17:21,650 --> 00:17:24,079 see if the result is greater than or equal to 0. 372 00:17:24,079 --> 00:17:26,240 And using that result of that expression, 373 00:17:26,240 --> 00:17:29,660 we're able to determine whether it's raining or not raining. 374 00:17:29,660 --> 00:17:33,080 This expression here is in this case going to refer to just some line. 375 00:17:33,080 --> 00:17:36,170 If you were to plot that graphically, it would just be some line. 376 00:17:36,170 --> 00:17:40,320 And what the line actually looks like depends upon these weights. 377 00:17:40,320 --> 00:17:42,710 x1 and x2 are the inputs, but these weights 378 00:17:42,710 --> 00:17:46,220 are really what determine the shape of that line, the slope of that line, 379 00:17:46,220 --> 00:17:49,190 and what that line actually looks like. 380 00:17:49,190 --> 00:17:52,330 So we then would like to figure out what these weights should be. 381 00:17:52,330 --> 00:17:54,520 We can choose whatever weights we want, but we 382 00:17:54,520 --> 00:17:58,330 want to choose weights in such a way that if you pass in a rainy day's 383 00:17:58,330 --> 00:18:00,850 humidity and pressure, then you end up with a result that 384 00:18:00,850 --> 00:18:02,440 is greater than or equal to 0. 385 00:18:02,440 --> 00:18:05,690 And we would like it such that if we passed into our hypothesis function, 386 00:18:05,690 --> 00:18:09,040 a not rainy day's inputs, then the output that we get 387 00:18:09,040 --> 00:18:11,043 should be not raining. 388 00:18:11,043 --> 00:18:13,960 So before we get there, let's try and formalize this a little bit more 389 00:18:13,960 --> 00:18:17,350 mathematically just to get a sense for how it is that you'll often see this 390 00:18:17,350 --> 00:18:21,560 if you ever go further into a supervised machine learning and explore this idea. 391 00:18:21,560 --> 00:18:23,660 One thing is that generally for these categories, 392 00:18:23,660 --> 00:18:27,420 we'll sometimes just use the names of the categories like rain and not rain. 393 00:18:27,420 --> 00:18:30,670 Often, mathematically, if we're trying to do comparisons between these things, 394 00:18:30,670 --> 00:18:33,110 it's easier just to deal in the world of numbers. 395 00:18:33,110 --> 00:18:35,530 So we could just say 1 and 0-- 396 00:18:35,530 --> 00:18:37,820 1 for raining, 0 for not raining. 397 00:18:37,820 --> 00:18:39,040 So we do all this math. 398 00:18:39,040 --> 00:18:41,500 And if the result is greater than or equal to 0, 399 00:18:41,500 --> 00:18:45,160 we'll go ahead and say our hypothesis function outputs 1, meaning raining. 400 00:18:45,160 --> 00:18:48,220 And otherwise, it outputs 0, meaning not raining. 401 00:18:48,220 --> 00:18:52,090 And oftentimes, this type of expression will instead 402 00:18:52,090 --> 00:18:54,890 express using vector mathematics. 403 00:18:54,890 --> 00:18:57,390 And all the vector is, if you're not familiar with the term, 404 00:18:57,390 --> 00:19:00,400 is it refers to a sequence of numerical values. 405 00:19:00,400 --> 00:19:03,580 You could represent that in Python using, like, a list of numerical values 406 00:19:03,580 --> 00:19:06,370 or a couple with numerical values. 407 00:19:06,370 --> 00:19:10,160 And here, we have a couple of sequences of numerical values. 408 00:19:10,160 --> 00:19:13,300 One of our vectors, one of our sequences of numerical values, 409 00:19:13,300 --> 00:19:15,130 are all of these individual weights-- 410 00:19:15,130 --> 00:19:18,340 w0, w1 and w2. 411 00:19:18,340 --> 00:19:21,400 So we could construct what we'll call a weight vector 412 00:19:21,400 --> 00:19:23,860 and we'll see why this is useful in a moment called w, 413 00:19:23,860 --> 00:19:26,838 generally represented using a boldface w, that 414 00:19:26,838 --> 00:19:28,630 is just a sequence of these three weights-- 415 00:19:28,630 --> 00:19:31,690 weight 0, weight 1, and weight 2. 416 00:19:31,690 --> 00:19:33,820 And to be able to calculate based on those weights 417 00:19:33,820 --> 00:19:37,570 whether we think a day is raining or not raining, 418 00:19:37,570 --> 00:19:42,490 we're going to multiply each of those weights by one of our input variables. 419 00:19:42,490 --> 00:19:46,630 That w2, this weight, is going to be multiplied by input variable x2. 420 00:19:46,630 --> 00:19:49,780 w1 is going to be multiplied by input variable x1. 421 00:19:49,780 --> 00:19:53,097 And w0-- well, it's not being multiplied by anything, 422 00:19:53,097 --> 00:19:55,180 but to make sure the vectors are the same length-- 423 00:19:55,180 --> 00:19:57,263 and we'll see why that's useful in just a second-- 424 00:19:57,263 --> 00:20:01,077 we'll just go ahead and say w0 is being multiplied by 1. 425 00:20:01,077 --> 00:20:03,160 Because you can multiply by something by 1 and you 426 00:20:03,160 --> 00:20:05,090 end up getting the exact same number. 427 00:20:05,090 --> 00:20:07,570 So in addition to the weight vector, w, we'll 428 00:20:07,570 --> 00:20:11,890 also have an input vector that we'll call x that has three values-- 429 00:20:11,890 --> 00:20:18,160 1, again, because we're just multiplying w0 by 1 eventually, and then x1 and x2. 430 00:20:18,160 --> 00:20:21,850 So here then, we've represented two distinct vectors-- a vector of weights 431 00:20:21,850 --> 00:20:23,590 that we need to somehow learn. 432 00:20:23,590 --> 00:20:25,690 The goal of our machine learning algorithm 433 00:20:25,690 --> 00:20:28,310 is to learn what this weight vector is supposed to be. 434 00:20:28,310 --> 00:20:30,502 We could choose any arbitrary set of numbers 435 00:20:30,502 --> 00:20:33,460 and it would produce a function that tries to predict rain or not rain, 436 00:20:33,460 --> 00:20:35,140 but it probably wouldn't be very good. 437 00:20:35,140 --> 00:20:39,190 What we want to do is come up with a good choice of these weights 438 00:20:39,190 --> 00:20:41,980 so that we're able to do the accurate predictions. 439 00:20:41,980 --> 00:20:45,790 And then this input vector represents a particular input 440 00:20:45,790 --> 00:20:48,970 to the function, a data point for which we would like to estimate, 441 00:20:48,970 --> 00:20:52,442 is that day a rainy day or is that day not rainy day. 442 00:20:52,442 --> 00:20:54,400 And that's going to vary just depending on what 443 00:20:54,400 --> 00:20:58,300 input is provided to our function, what it is that we are trying to estimate. 444 00:20:58,300 --> 00:21:02,390 And then to do the calculation, we want to calculate this expression here. 445 00:21:02,390 --> 00:21:04,900 And it turns out that expression is what we would call 446 00:21:04,900 --> 00:21:07,390 the dot product of these two vectors. 447 00:21:07,390 --> 00:21:10,480 The dot product of two vectors just means 448 00:21:10,480 --> 00:21:13,630 taking each of the terms and the vectors and multiplying them together, 449 00:21:13,630 --> 00:21:18,860 w0 multiplied by 1, w1 multiplied it by x1, w2 multiply it by x2. 450 00:21:18,860 --> 00:21:21,400 And that's why these vectors need to be the same length. 451 00:21:21,400 --> 00:21:24,400 And then we just add all of the results together. 452 00:21:24,400 --> 00:21:30,040 So the dot product of w and x, our weight vector and our input vector, 453 00:21:30,040 --> 00:21:35,770 that's just going to be w0 times 1, or just w0 plus w1 times x1, 454 00:21:35,770 --> 00:21:40,570 multiplying these two terms together, plus w2 times x2, 455 00:21:40,570 --> 00:21:42,773 multiplying those statements together. 456 00:21:42,773 --> 00:21:45,190 So we have our weight vector, which we need to figure out. 457 00:21:45,190 --> 00:21:47,020 We need our machine learning algorithm to figure out 458 00:21:47,020 --> 00:21:48,280 what the weights should be. 459 00:21:48,280 --> 00:21:51,370 We have the input vector representing the data point 460 00:21:51,370 --> 00:21:54,610 that we're trying to predict a category for, predict a label for. 461 00:21:54,610 --> 00:21:58,270 And we're able to do that calculation by taking this dot product, which you'll 462 00:21:58,270 --> 00:22:00,260 often see represented in vector form-- 463 00:22:00,260 --> 00:22:02,052 but if you haven't seen vectors before, you 464 00:22:02,052 --> 00:22:04,810 can think of it as identical to just this mathematical expression. 465 00:22:04,810 --> 00:22:08,170 Just doing the multiplication, adding the results together. 466 00:22:08,170 --> 00:22:11,440 And then seeing whether the result is greater than or equal to 0 or not. 467 00:22:11,440 --> 00:22:14,530 This expression here is identical to the expression 468 00:22:14,530 --> 00:22:16,810 that we're calculating to see whether or not 469 00:22:16,810 --> 00:22:21,340 that answer is greater than or equal to 0 in this case. 470 00:22:21,340 --> 00:22:24,370 And so for that reason, you'll often see the hypothesis function 471 00:22:24,370 --> 00:22:25,980 written as something like this-- 472 00:22:25,980 --> 00:22:29,800 a simpler representation where the hypothesis takes as input 473 00:22:29,800 --> 00:22:34,060 some input vector x, some humidity and pressure for some day. 474 00:22:34,060 --> 00:22:37,810 And we want to predict an output like rain or no rain or 1 or 0 475 00:22:37,810 --> 00:22:40,440 if we choose to represent things numerically. 476 00:22:40,440 --> 00:22:42,970 And the way we do that is by taking the dot 477 00:22:42,970 --> 00:22:45,612 product of the weights and our input. 478 00:22:45,612 --> 00:22:47,320 If it's greater than or equal to 0, we'll 479 00:22:47,320 --> 00:22:49,150 go ahead and save the output is 1. 480 00:22:49,150 --> 00:22:52,030 Otherwise, the output is going to be 0. 481 00:22:52,030 --> 00:22:56,170 And this hypothesis we say is parameterized by the weights. 482 00:22:56,170 --> 00:22:58,330 Depending on what weights we choose, we'll 483 00:22:58,330 --> 00:23:00,372 end up getting a different hypothesis. 484 00:23:00,372 --> 00:23:02,830 If we choose the weights randomly, we're probably not going 485 00:23:02,830 --> 00:23:04,455 to get a very good hypothesis function. 486 00:23:04,455 --> 00:23:06,730 We'll get a 1 or a 0, but it's probably not 487 00:23:06,730 --> 00:23:09,160 accurately going to reflect whether we think a day is 488 00:23:09,160 --> 00:23:11,350 going to be rainy or not rainy. 489 00:23:11,350 --> 00:23:13,930 But if we choose the weights right, we can often 490 00:23:13,930 --> 00:23:16,390 do a pretty good job of trying to estimate 491 00:23:16,390 --> 00:23:20,880 whether we think the output of the function should be a 1 or a 0. 492 00:23:20,880 --> 00:23:23,170 And so the question then, is how to figure out 493 00:23:23,170 --> 00:23:27,042 what these weights should be-- how to be able to tune those parameters. 494 00:23:27,042 --> 00:23:29,000 And there are a number of ways you can do that. 495 00:23:29,000 --> 00:23:32,800 One of the most common is known as the perceptron learning rule. 496 00:23:32,800 --> 00:23:34,303 And we'll see more of this later. 497 00:23:34,303 --> 00:23:36,220 But the idea of the perceptron learning rule-- 498 00:23:36,220 --> 00:23:37,960 and we're not going to get too deep into the mathematics, 499 00:23:37,960 --> 00:23:40,570 we'll mostly just introduce it more conceptually-- is 500 00:23:40,570 --> 00:23:44,770 to say that given some data point that we would like to learn from, 501 00:23:44,770 --> 00:23:50,140 some data point that has an input x and an output y, where y is like 1 for rain 502 00:23:50,140 --> 00:23:53,052 or 0 for not rain, then we're going to update the weights. 503 00:23:53,052 --> 00:23:55,010 And we'll look at the formula in just a moment. 504 00:23:55,010 --> 00:23:59,020 But the big picture idea is that we can start with random weights 505 00:23:59,020 --> 00:24:00,610 but then learn from the data. 506 00:24:00,610 --> 00:24:02,900 Like, take the data points one at a time. 507 00:24:02,900 --> 00:24:05,370 And for each one of the data points figure out, 508 00:24:05,370 --> 00:24:09,370 all right, what parameters do we need to change inside of the weights 509 00:24:09,370 --> 00:24:12,050 in order to better match that input point. 510 00:24:12,050 --> 00:24:13,930 And so that is the value of having access 511 00:24:13,930 --> 00:24:15,940 to a lot of data in the supervised machine 512 00:24:15,940 --> 00:24:18,520 learning algorithm-- is that you take each of the data points 513 00:24:18,520 --> 00:24:22,310 and maybe look at the multiple times and constantly try and figure out what you 514 00:24:22,310 --> 00:24:24,490 whether you need to shift your weight in order 515 00:24:24,490 --> 00:24:29,260 to better create some weight vector that is able to correctly or more accurately 516 00:24:29,260 --> 00:24:31,300 try to estimate what the output should be. 517 00:24:31,300 --> 00:24:33,970 Whether we think it's going to be raining or whether we think 518 00:24:33,970 --> 00:24:35,760 it's not going to be raining. 519 00:24:35,760 --> 00:24:37,510 So what does that weight update look like? 520 00:24:37,510 --> 00:24:39,468 Without going into too much of the mathematics, 521 00:24:39,468 --> 00:24:41,410 we're going to update each of the weights 522 00:24:41,410 --> 00:24:46,480 to be the result of the original weight plus some additional expression. 523 00:24:46,480 --> 00:24:48,820 And to understand this expression y-- 524 00:24:48,820 --> 00:24:51,820 well, y is what the actual output is. 525 00:24:51,820 --> 00:24:57,280 And hypothesis of x, the input, that's going to be what we thought the input 526 00:24:57,280 --> 00:24:58,100 was. 527 00:24:58,100 --> 00:25:01,510 And so I can replace this by saying what the actual value was 528 00:25:01,510 --> 00:25:03,820 minus what our estimate was. 529 00:25:03,820 --> 00:25:08,440 And based on the difference between the actual value and what our estimate was, 530 00:25:08,440 --> 00:25:11,170 we might want to change our hypothesis, change the way 531 00:25:11,170 --> 00:25:13,300 that we do that estimation. 532 00:25:13,300 --> 00:25:15,850 If the actual value and the estimate were the same thing, 533 00:25:15,850 --> 00:25:18,910 meaning we were correctly able to predict what category this data 534 00:25:18,910 --> 00:25:21,280 point belonged to, well, then actual value 535 00:25:21,280 --> 00:25:23,560 minus estimate, that's just going to be 0, 536 00:25:23,560 --> 00:25:26,440 which means this whole term on the right hand side goes to be 0. 537 00:25:26,440 --> 00:25:27,820 And the weight doesn't change. 538 00:25:27,820 --> 00:25:31,180 Weight i, where i is weight 1 or weight 2 or weight 0, 539 00:25:31,180 --> 00:25:33,520 weight i just stays at weight i. 540 00:25:33,520 --> 00:25:36,880 And none of the weights change if we were able to correctly predict 541 00:25:36,880 --> 00:25:39,250 what category the input belonged to. 542 00:25:39,250 --> 00:25:42,010 But if our hypothesis didn't correctly predict 543 00:25:42,010 --> 00:25:44,980 what category the input belonged to, then maybe 544 00:25:44,980 --> 00:25:46,900 then we need to make some changes-- 545 00:25:46,900 --> 00:25:50,950 adjust the weights so that we're better able to predict this kind of data point 546 00:25:50,950 --> 00:25:51,850 in the future. 547 00:25:51,850 --> 00:25:54,140 And what is the way we might do that? 548 00:25:54,140 --> 00:25:58,870 Well, if the actual value was bigger than the estimate, then-- and for now, 549 00:25:58,870 --> 00:26:02,170 we'll go ahead and assume that these is are positive values-- 550 00:26:02,170 --> 00:26:04,720 if the actual value is bigger than the estimate, that 551 00:26:04,720 --> 00:26:08,050 means we need to increase the weight in order to make it such 552 00:26:08,050 --> 00:26:10,090 that the output is bigger and therefore, we're 553 00:26:10,090 --> 00:26:13,168 more likely to get to the right actual value. 554 00:26:13,168 --> 00:26:15,460 And so if the actual value is bigger than the estimate, 555 00:26:15,460 --> 00:26:18,292 then actual value minus estimate, that'll be a positive number. 556 00:26:18,292 --> 00:26:21,250 And so you imagine we're just adding some positive number to the weight 557 00:26:21,250 --> 00:26:23,740 just to increase it ever so slightly. 558 00:26:23,740 --> 00:26:25,600 And likewise, the inverse case is true-- 559 00:26:25,600 --> 00:26:30,460 that if the actual value was less than the estimate, the actual value was 0, 560 00:26:30,460 --> 00:26:33,460 but we estimated 1, meaning it actually was not raining, 561 00:26:33,460 --> 00:26:35,740 but we predicted it was going to be raining, 562 00:26:35,740 --> 00:26:39,250 then we want to decrease the value of the weight, because then in that case, 563 00:26:39,250 --> 00:26:43,600 we want to try and lower the total value of computing that dot product in order 564 00:26:43,600 --> 00:26:46,690 to make it less likely that we would predict that it would actually 565 00:26:46,690 --> 00:26:48,010 be raining. 566 00:26:48,010 --> 00:26:50,840 So no need to get too deep into the mathematics of that. 567 00:26:50,840 --> 00:26:53,920 But the general idea is that every time we encounter some data point, 568 00:26:53,920 --> 00:26:57,340 we can adjust these weights accordingly to try and make the weights better 569 00:26:57,340 --> 00:27:00,850 line up with the actual data that we have access to. 570 00:27:00,850 --> 00:27:03,610 And you can repeat this process with data point after data point 571 00:27:03,610 --> 00:27:05,680 until eventually, hopefully, your algorithm 572 00:27:05,680 --> 00:27:09,430 converges to some set of weights that do a pretty good job of trying 573 00:27:09,430 --> 00:27:13,040 to figure out whether a day is going to be rainy or not rainy. 574 00:27:13,040 --> 00:27:15,700 And just as a final point about this particular equation, 575 00:27:15,700 --> 00:27:19,470 this value alpha here is generally what we'll call the learning rate. 576 00:27:19,470 --> 00:27:22,120 It's just some parameter, some number we choose it 577 00:27:22,120 --> 00:27:25,690 for how quickly we're actually going to be updating these weight values. 578 00:27:25,690 --> 00:27:27,460 That if alpha is bigger, than we're going 579 00:27:27,460 --> 00:27:29,380 to update these weight values by a lot. 580 00:27:29,380 --> 00:27:32,480 And if alpha is smaller, then we'll update the weight values by less. 581 00:27:32,480 --> 00:27:35,200 And you can choose the value of alpha depending on the problem, 582 00:27:35,200 --> 00:27:39,960 different values might suit the situation better or worse than others. 583 00:27:39,960 --> 00:27:43,150 So after all of that, after we've done this training process, 584 00:27:43,150 --> 00:27:45,880 take all this data, and using this learning rule, 585 00:27:45,880 --> 00:27:50,240 look at all the pieces of data, and use each piece of data as an indication 586 00:27:50,240 --> 00:27:53,030 to us of do the weights stay the same, do we increase the weights, 587 00:27:53,030 --> 00:27:55,940 do we decrease the weights, and if so, by how much, 588 00:27:55,940 --> 00:28:00,110 what you end up with is effectively a threshold function. 589 00:28:00,110 --> 00:28:03,230 And we can look at what the threshold function looks like like this. 590 00:28:03,230 --> 00:28:05,870 On the x-axis here, we have the output of that function. 591 00:28:05,870 --> 00:28:10,150 Taking the weights taking, the dot product of it with the input. 592 00:28:10,150 --> 00:28:12,980 And on the y-axis, we have what the output is going to be. 593 00:28:12,980 --> 00:28:17,420 0, which in this case represented like not raining, and 1, which in this case, 594 00:28:17,420 --> 00:28:18,950 represented raining. 595 00:28:18,950 --> 00:28:23,570 And the way that our hypothesis function works is it calculates this value. 596 00:28:23,570 --> 00:28:27,380 And if it's greater than 0 or greater than some threshold value, 597 00:28:27,380 --> 00:28:29,480 then we declare that it's a rainy day. 598 00:28:29,480 --> 00:28:32,330 And otherwise, we declare that it's not rainy day. 599 00:28:32,330 --> 00:28:35,660 And this then graphically is what that function looks like. 600 00:28:35,660 --> 00:28:38,600 That Initially, when the value of this dot product is small-- 601 00:28:38,600 --> 00:28:40,850 it's not raining, it's not raining, it's not raining-- 602 00:28:40,850 --> 00:28:43,910 but as soon as it crosses that threshold, we suddenly say, 603 00:28:43,910 --> 00:28:46,700 OK, now it's raining, now it's raining, now it's raining. 604 00:28:46,700 --> 00:28:49,250 And the way to interpret this kind of representation 605 00:28:49,250 --> 00:28:51,680 is that anything on this side of the line, that 606 00:28:51,680 --> 00:28:55,010 would be the category of data points where we say yes, it's raining. 607 00:28:55,010 --> 00:28:56,960 Anything that falls on this side of the line 608 00:28:56,960 --> 00:28:59,510 are the data points where we would say it's not raining. 609 00:28:59,510 --> 00:29:02,120 And again, we want to choose some value for the weights that 610 00:29:02,120 --> 00:29:04,910 results in a function that does a pretty good job of trying 611 00:29:04,910 --> 00:29:07,280 to do this estimation. 612 00:29:07,280 --> 00:29:11,090 But one tricky thing with this type of hard threshold 613 00:29:11,090 --> 00:29:14,300 is that it only leaves two possible outcomes. 614 00:29:14,300 --> 00:29:16,850 We plug-in some data as input. 615 00:29:16,850 --> 00:29:20,100 And the output we get is raining or not raining. 616 00:29:20,100 --> 00:29:22,978 And there is no room for it anywhere in between. 617 00:29:22,978 --> 00:29:24,270 And maybe that's what you want. 618 00:29:24,270 --> 00:29:26,510 Maybe all you want is given some data point, 619 00:29:26,510 --> 00:29:29,600 you would like to be able to classify it into one or two or more 620 00:29:29,600 --> 00:29:31,990 of these various different categories. 621 00:29:31,990 --> 00:29:34,270 But it might also be the case that you care 622 00:29:34,270 --> 00:29:37,855 about knowing how strong that prediction is, for example. 623 00:29:37,855 --> 00:29:39,730 So if we go back to this instance here, where 624 00:29:39,730 --> 00:29:42,550 we have rainy days on this side of the line, not 625 00:29:42,550 --> 00:29:45,370 rainy days on that side of the line, you might 626 00:29:45,370 --> 00:29:48,970 imagine that let's look now at these two white data points. 627 00:29:48,970 --> 00:29:53,110 This data point here that we would like to predict a label or a category for. 628 00:29:53,110 --> 00:29:55,660 And this data point over here that we would also like 629 00:29:55,660 --> 00:29:58,510 to predict a label or a category for. 630 00:29:58,510 --> 00:30:01,885 It seems likely that you could pretty confidently say that this data point, 631 00:30:01,885 --> 00:30:03,010 that should be a rainy day. 632 00:30:03,010 --> 00:30:05,500 It seems close to the other rainy days if we're 633 00:30:05,500 --> 00:30:07,330 going by the nearest neighbor strategy. 634 00:30:07,330 --> 00:30:11,830 It's on this side of the line if we're going by the strategy of just saying 635 00:30:11,830 --> 00:30:14,140 which side of the line does it fall on by figuring out 636 00:30:14,140 --> 00:30:15,670 what those weights should be. 637 00:30:15,670 --> 00:30:18,610 And if we're using the line strategy of just which side of the line 638 00:30:18,610 --> 00:30:21,640 does it fall on, which side of this decision boundary, 639 00:30:21,640 --> 00:30:24,700 we'd also say that this point here is also 640 00:30:24,700 --> 00:30:27,580 a rainy day, because it falls on the side of the line that 641 00:30:27,580 --> 00:30:30,620 corresponds to rainy days. 642 00:30:30,620 --> 00:30:33,020 But it's likely that even in this case, we 643 00:30:33,020 --> 00:30:37,040 would know that we don't feel nearly as confident about this data point 644 00:30:37,040 --> 00:30:40,360 on the left as compared to this data point on the right. 645 00:30:40,360 --> 00:30:42,620 For this one on the right, we can feel very confident 646 00:30:42,620 --> 00:30:44,060 that, yes, it's a rainy day. 647 00:30:44,060 --> 00:30:48,460 This one, it's pretty close to the line if we're judging just by distance. 648 00:30:48,460 --> 00:30:51,260 And so you might be less sure. 649 00:30:51,260 --> 00:30:56,150 But our threshold function doesn't allow for a notion of less sure or more sure 650 00:30:56,150 --> 00:30:57,080 about something. 651 00:30:57,080 --> 00:30:59,000 It's what we would call a hard threshold. 652 00:30:59,000 --> 00:31:02,810 It's once you've crossed this line, then immediately, we say, yes, 653 00:31:02,810 --> 00:31:04,520 this is going to be a rainy day. 654 00:31:04,520 --> 00:31:07,580 Anywhere before it, we're going to say it's not a rainy day. 655 00:31:07,580 --> 00:31:10,240 And that may not be helpful in a number of cases. 656 00:31:10,240 --> 00:31:13,070 One, this is not a particularly easy function to deal with. 657 00:31:13,070 --> 00:31:15,710 If you get you get deeper into the world of machine learning 658 00:31:15,710 --> 00:31:18,830 and are trying to do things like taking derivatives of these curves, 659 00:31:18,830 --> 00:31:21,222 this type of function makes things challenging. 660 00:31:21,222 --> 00:31:23,180 But the other challenge is that we don't really 661 00:31:23,180 --> 00:31:25,013 have any notion of gradation between things. 662 00:31:25,013 --> 00:31:28,430 We don't have a notion of, yes, this is a very strong belief 663 00:31:28,430 --> 00:31:32,600 that it's going to be raining as opposed to it's probably more likely than not 664 00:31:32,600 --> 00:31:37,100 that it's going to be raining, but maybe not totally sure about that, either. 665 00:31:37,100 --> 00:31:39,590 So what we can do by taking advantage of a technique known 666 00:31:39,590 --> 00:31:43,190 as logistic regression is instead of using this hard threshold 667 00:31:43,190 --> 00:31:46,970 type of function, we can use instead a logistic function, something 668 00:31:46,970 --> 00:31:48,890 we might call a soft threshold. 669 00:31:48,890 --> 00:31:52,070 And that's going to transform this into looking something 670 00:31:52,070 --> 00:31:55,290 a little more like this-- something that more nicely curves. 671 00:31:55,290 --> 00:31:57,800 And as a result, the possible output values 672 00:31:57,800 --> 00:32:02,030 are no longer just 0 and 1, 0 for not raining, 1 for raining. 673 00:32:02,030 --> 00:32:06,360 But you can actually get any real number of value between 0 and 1. 674 00:32:06,360 --> 00:32:10,292 That if you're way over on this side, then you get a value of 0-- 675 00:32:10,292 --> 00:32:12,750 it's not going to be raining, we're pretty sure about that. 676 00:32:12,750 --> 00:32:15,042 And if you're over on this side, you get a value of 1-- 677 00:32:15,042 --> 00:32:17,330 yes, we're very sure that it's going to be raining. 678 00:32:17,330 --> 00:32:22,460 But in between, you could get some real numbered value where a value like 0.7 679 00:32:22,460 --> 00:32:24,260 might mean we think it's going to rain. 680 00:32:24,260 --> 00:32:27,810 It's more probable that it's going to rain than not based on the data, 681 00:32:27,810 --> 00:32:32,220 but we're not as confident as some of the other data points might be. 682 00:32:32,220 --> 00:32:34,580 So one of the advantages of the soft threshold 683 00:32:34,580 --> 00:32:37,940 is that it allows us to have an output that could be some real number that 684 00:32:37,940 --> 00:32:41,450 potentially reflects some sort of probability, the likelihood that we 685 00:32:41,450 --> 00:32:46,550 think that this particular data point belongs to that particular category. 686 00:32:46,550 --> 00:32:49,250 And there are some other nice mathematical properties of that 687 00:32:49,250 --> 00:32:50,950 as well. 688 00:32:50,950 --> 00:32:53,880 So that then is two different approaches to trying to solve 689 00:32:53,880 --> 00:32:55,710 this type of classification problem. 690 00:32:55,710 --> 00:32:58,860 One is this nearest neighbor type of approach, where you just 691 00:32:58,860 --> 00:33:01,230 take a data point and look at the data points that 692 00:33:01,230 --> 00:33:05,490 are nearby to try and estimate what category we think it belongs to. 693 00:33:05,490 --> 00:33:08,310 And the other approach is the approach of saying, all right, 694 00:33:08,310 --> 00:33:10,607 let's just try and use linear regression, 695 00:33:10,607 --> 00:33:13,440 figure out what these weights should be, adjust the weights in order 696 00:33:13,440 --> 00:33:16,950 to figure out what line or what decision boundary is going 697 00:33:16,950 --> 00:33:19,708 to best separate these two categories. 698 00:33:19,708 --> 00:33:22,500 It turns out that another popular approach, a very popular approach 699 00:33:22,500 --> 00:33:24,480 if you just have a data set and you want to start 700 00:33:24,480 --> 00:33:26,190 trying to do some learning on it, is what 701 00:33:26,190 --> 00:33:27,648 we call the support vector machine. 702 00:33:27,648 --> 00:33:30,690 We're not going to go too much into the mathematics of the support vector 703 00:33:30,690 --> 00:33:32,850 machine, but we'll at least explore it graphically 704 00:33:32,850 --> 00:33:34,680 to see what it is that it looks like. 705 00:33:34,680 --> 00:33:38,220 And the idea or the motivation behind the support vector machine 706 00:33:38,220 --> 00:33:41,400 is the idea that there are actually a lot of different lines 707 00:33:41,400 --> 00:33:44,070 that we could draw, a lot of different decision boundaries 708 00:33:44,070 --> 00:33:46,430 that we could draw to separate two groups. 709 00:33:46,430 --> 00:33:49,050 So for example, I had the red data points over here 710 00:33:49,050 --> 00:33:50,700 and the blue data points over here. 711 00:33:50,700 --> 00:33:54,620 One possible line I could draw is a line like this, 712 00:33:54,620 --> 00:33:57,600 that this line here would separate the red points from the blue points. 713 00:33:57,600 --> 00:33:58,680 And it does so perfectly. 714 00:33:58,680 --> 00:34:01,080 All the red points are on one side of the line. 715 00:34:01,080 --> 00:34:03,840 All the blue points around the other side of the line. 716 00:34:03,840 --> 00:34:07,000 But this should probably make you a little bit nervous. 717 00:34:07,000 --> 00:34:08,909 If you come up with a model and the model 718 00:34:08,909 --> 00:34:10,960 comes up with a line that looks like this. 719 00:34:10,960 --> 00:34:13,650 And the reason why is that you worry about how well it's 720 00:34:13,650 --> 00:34:18,190 going to generalize to other data points that are not necessarily in the data 721 00:34:18,190 --> 00:34:19,900 set that we have access to. 722 00:34:19,900 --> 00:34:23,650 For example, if there was a point that fell right here for example, 723 00:34:23,650 --> 00:34:26,370 on the right side of the line, then based on that, 724 00:34:26,370 --> 00:34:30,210 we might want to guess that it is in fact, a red point, 725 00:34:30,210 --> 00:34:33,780 but it falls on the side of the line where instead, we would estimate 726 00:34:33,780 --> 00:34:36,380 that it's a blue point instead. 727 00:34:36,380 --> 00:34:39,750 And so based on that, this line is probably not a great choice 728 00:34:39,750 --> 00:34:43,679 just because it is so close to these various data points. 729 00:34:43,679 --> 00:34:45,810 We might instead prefer a diagonal line that 730 00:34:45,810 --> 00:34:48,719 just goes diagonally through the data set like we've seen before. 731 00:34:48,719 --> 00:34:52,000 But there too, there's a lot of diagonal lines that we could draw as well. 732 00:34:52,000 --> 00:34:54,760 For example, I could draw this diagonal line here, 733 00:34:54,760 --> 00:34:57,663 which also successfully separates all the red points 734 00:34:57,663 --> 00:34:58,830 from all of the blue points. 735 00:34:58,830 --> 00:35:01,400 From the perspective of something like a just trying 736 00:35:01,400 --> 00:35:03,150 to figure out some setting of weights that 737 00:35:03,150 --> 00:35:05,760 allows us to predict the correct output, this line 738 00:35:05,760 --> 00:35:09,155 will predict the correct output for this particular set of data 739 00:35:09,155 --> 00:35:11,530 every single time, because the red points are on one side 740 00:35:11,530 --> 00:35:13,518 and the blue points are on the other. 741 00:35:13,518 --> 00:35:15,810 But yet again, you should probably be a little nervous. 742 00:35:15,810 --> 00:35:18,690 Because this line is so close to these red points, 743 00:35:18,690 --> 00:35:22,470 even though we're able to correctly predict on the input data 744 00:35:22,470 --> 00:35:26,070 if there was a point that fell somewhere in this general area, 745 00:35:26,070 --> 00:35:28,530 our algorithm, this model, would say that yeah, 746 00:35:28,530 --> 00:35:30,930 we think it's a blue point, when in actuality, it 747 00:35:30,930 --> 00:35:34,000 might belong to the red category instead, 748 00:35:34,000 --> 00:35:36,870 just because it looks like it's close to the other red points. 749 00:35:36,870 --> 00:35:39,700 What we really want to be able to say, given this data, 750 00:35:39,700 --> 00:35:42,330 how can you generalize this out as best as possible, is 751 00:35:42,330 --> 00:35:46,480 to come up with a line like this that seems like the intuitive line to draw. 752 00:35:46,480 --> 00:35:48,810 And the reason why it's intuitive is because it 753 00:35:48,810 --> 00:35:54,267 seems to be as far apart as possible from the red data and the blue data 754 00:35:54,267 --> 00:35:56,850 so that if we generalize a little bit and assume that maybe we 755 00:35:56,850 --> 00:35:58,933 have some points that are different from the input 756 00:35:58,933 --> 00:36:02,610 but still slightly further away, we can still say that something on this side, 757 00:36:02,610 --> 00:36:06,300 probably red, something on that side, probably blue. 758 00:36:06,300 --> 00:36:08,607 And we can make those judgments that way. 759 00:36:08,607 --> 00:36:10,440 And that is what support vector machines are 760 00:36:10,440 --> 00:36:12,660 designed to do-- they're designed to try and find 761 00:36:12,660 --> 00:36:15,720 what we call the maximum margin separator, 762 00:36:15,720 --> 00:36:17,850 where the maximum margin separator is just 763 00:36:17,850 --> 00:36:21,780 some boundary that maximizes the distance between the groups of points. 764 00:36:21,780 --> 00:36:24,990 Rather than come up with some boundary that's very close to one side 765 00:36:24,990 --> 00:36:27,840 or the other, where in the case before, we wouldn't have cared-- 766 00:36:27,840 --> 00:36:31,680 as long as we're categorizing the input well, that seems all we need to do-- 767 00:36:31,680 --> 00:36:35,790 the support vector machine will try and find this maximum margin separator, 768 00:36:35,790 --> 00:36:39,090 some way of trying to maximize that particular distance. 769 00:36:39,090 --> 00:36:42,600 And it does so by finding what we call the support vectors, which 770 00:36:42,600 --> 00:36:44,790 are the vectors that are closest to the line 771 00:36:44,790 --> 00:36:48,060 and trying to maximize the distance between the line 772 00:36:48,060 --> 00:36:49,950 and those particular points. 773 00:36:49,950 --> 00:36:51,790 And it works that way in two dimensions. 774 00:36:51,790 --> 00:36:53,498 It also works in higher dimensions, where 775 00:36:53,498 --> 00:36:56,670 we're not looking for some line that separates the two data points, 776 00:36:56,670 --> 00:36:58,440 but instead, looking for what we generally 777 00:36:58,440 --> 00:37:02,610 call a hyperplanel, some decision boundary, effectively, that 778 00:37:02,610 --> 00:37:06,150 separates one set of data from the other set of data. 779 00:37:06,150 --> 00:37:09,270 And this ability of support vector machines to work in higher dimensions 780 00:37:09,270 --> 00:37:11,670 actually has a number of other applications as well. 781 00:37:11,670 --> 00:37:14,790 But one is that it helpfully deals with cases where 782 00:37:14,790 --> 00:37:17,740 data may not be linearly separable. 783 00:37:17,740 --> 00:37:19,800 So we talked about linear separability before, 784 00:37:19,800 --> 00:37:22,890 this idea that you can take data and just draw 785 00:37:22,890 --> 00:37:25,180 a line or some linear combination of the inputs 786 00:37:25,180 --> 00:37:28,750 that allows us to perfectly separate the two sets from each other. 787 00:37:28,750 --> 00:37:32,050 There are some data sets that are not linearly separable. 788 00:37:32,050 --> 00:37:35,280 And some were even too, you would not be able to find 789 00:37:35,280 --> 00:37:39,270 a good line at all that would try to do that kind of separation. 790 00:37:39,270 --> 00:37:42,930 Something like this, for example, where if you imagine here 791 00:37:42,930 --> 00:37:45,780 are the red points and the blue points surround it. 792 00:37:45,780 --> 00:37:50,550 If you try to find a line that divides the red points from the blue points, 793 00:37:50,550 --> 00:37:53,310 it's actually going to be difficult, if not impossible, to do-- 794 00:37:53,310 --> 00:37:55,080 that any line you choose-- 795 00:37:55,080 --> 00:37:57,510 if you draw a line here, then you ignored 796 00:37:57,510 --> 00:38:00,450 all of these blue points that should actually be blue and not red-- 797 00:38:00,450 --> 00:38:03,240 anywhere else you draw a line, there's going to be a lot of error, 798 00:38:03,240 --> 00:38:07,560 a lot of mistakes, a lot of what will soon call loss to that line 799 00:38:07,560 --> 00:38:08,280 that you draw-- 800 00:38:08,280 --> 00:38:12,050 a lot of points that you're going to categorize incorrectly. 801 00:38:12,050 --> 00:38:14,610 What we really want is to be able to find a better decision 802 00:38:14,610 --> 00:38:18,570 boundary that may not be just a straight line through this two 803 00:38:18,570 --> 00:38:19,770 dimensional space. 804 00:38:19,770 --> 00:38:21,810 And what support vector machines can do is 805 00:38:21,810 --> 00:38:23,940 they can begin to operate in higher dimensions 806 00:38:23,940 --> 00:38:26,880 and be able to find some other decision boundary, 807 00:38:26,880 --> 00:38:28,860 like the circle in this case, that actually 808 00:38:28,860 --> 00:38:33,130 is able to separate one of these sets of data from the other set of data, 809 00:38:33,130 --> 00:38:33,750 a lot better. 810 00:38:33,750 --> 00:38:37,470 So oftentimes, in data sets where the data is not linearly separable, 811 00:38:37,470 --> 00:38:40,140 support vector machines, by working in higher dimensions, 812 00:38:40,140 --> 00:38:44,310 can actually figure out a way to solve that kind of problem effectively. 813 00:38:44,310 --> 00:38:46,290 So that then-- three different approaches 814 00:38:46,290 --> 00:38:48,330 to trying to solve these sorts of problems. 815 00:38:48,330 --> 00:38:50,040 We're seeing support vector machines. 816 00:38:50,040 --> 00:38:53,400 We've seen trying to use linear regression and the perceptron 817 00:38:53,400 --> 00:38:56,912 learning rule to be able to figure out how to categorize inputs and outputs. 818 00:38:56,912 --> 00:38:58,620 We've seen the nearest neighbor approach. 819 00:38:58,620 --> 00:39:00,630 No one necessarily better than any other. 820 00:39:00,630 --> 00:39:03,990 Again, it's going to depend on the data set, the information you have access 821 00:39:03,990 --> 00:39:04,490 to. 822 00:39:04,490 --> 00:39:07,157 It's going to depend on what the function looks like that you're 823 00:39:07,157 --> 00:39:08,370 ultimately trying to predict. 824 00:39:08,370 --> 00:39:11,160 And this is where a lot of research and experimentation 825 00:39:11,160 --> 00:39:14,880 can be involved in trying to figure out how it is to best perform 826 00:39:14,880 --> 00:39:16,710 that kind of estimation. 827 00:39:16,710 --> 00:39:20,100 But classification is only one of the tasks that you might encounter 828 00:39:20,100 --> 00:39:23,340 and supervised machine learning, because in classification, 829 00:39:23,340 --> 00:39:26,580 what we're trying to predict is some discrete category. 830 00:39:26,580 --> 00:39:29,040 We're trying to predict red or blue, rain 831 00:39:29,040 --> 00:39:31,950 or not rain, authentic or counterfeit. 832 00:39:31,950 --> 00:39:35,500 But sometimes, what we want to predict is a real number value. 833 00:39:35,500 --> 00:39:38,880 And for that, we have a related problem, not classification, but instead 834 00:39:38,880 --> 00:39:40,500 known as regression. 835 00:39:40,500 --> 00:39:42,930 And regression is the supervised learning problem 836 00:39:42,930 --> 00:39:45,570 where we try and learn a function mapping inputs to outputs, 837 00:39:45,570 --> 00:39:46,810 same as before. 838 00:39:46,810 --> 00:39:49,800 but instead of the outputs being discrete categories-- 839 00:39:49,800 --> 00:39:51,960 things like rain or not rain-- 840 00:39:51,960 --> 00:39:54,260 in a regression problem, the output values 841 00:39:54,260 --> 00:39:57,400 are generally continuous value-- some real number 842 00:39:57,400 --> 00:39:58,650 that we would like to predict. 843 00:39:58,650 --> 00:40:00,385 This happens all the time, as well. 844 00:40:00,385 --> 00:40:02,760 You might imagine that a company might take this approach 845 00:40:02,760 --> 00:40:04,950 if it's trying to figure out, for instance, 846 00:40:04,950 --> 00:40:06,720 what the effect of its advertising is. 847 00:40:06,720 --> 00:40:10,830 Like how do advertising dollars spent translate into sales 848 00:40:10,830 --> 00:40:13,000 for the company's product, for example. 849 00:40:13,000 --> 00:40:17,070 And so they might like to try to predict some function that takes as input, 850 00:40:17,070 --> 00:40:18,780 the amount of money spent on advertising. 851 00:40:18,780 --> 00:40:20,220 And here, we're just going to use one input, 852 00:40:20,220 --> 00:40:22,650 but again, you could scale this up to many more inputs 853 00:40:22,650 --> 00:40:25,800 as well if you have a lot of different kinds of data you have access to. 854 00:40:25,800 --> 00:40:27,540 And the goal is to learn to function-- that given 855 00:40:27,540 --> 00:40:29,423 this amount of spending on advertising, we're 856 00:40:29,423 --> 00:40:30,840 going to get this amount in sales. 857 00:40:30,840 --> 00:40:34,110 And you might judge it based on having access to a whole bunch of data-- 858 00:40:34,110 --> 00:40:37,830 like for every past month, here's how much we spent on advertising 859 00:40:37,830 --> 00:40:39,390 and here is what sales were. 860 00:40:39,390 --> 00:40:43,300 And we would like to predict some sort of hypothesis function 861 00:40:43,300 --> 00:40:46,260 that, again, given the amount spent on advertising, 862 00:40:46,260 --> 00:40:48,750 can predict in this case, some real number, 863 00:40:48,750 --> 00:40:54,840 some no estimate of how much sales we expect that company to do in this month 864 00:40:54,840 --> 00:40:56,940 or in this quarter or whatever unit of time 865 00:40:56,940 --> 00:40:58,980 we're choosing to measure things in. 866 00:40:58,980 --> 00:41:01,830 And so again, the approach to solving this type of problem, 867 00:41:01,830 --> 00:41:05,920 we could try using a linear regression type approach, where we take this data, 868 00:41:05,920 --> 00:41:07,050 and we just plot it. 869 00:41:07,050 --> 00:41:09,750 On the x-axis, we have advertising dollars spent. 870 00:41:09,750 --> 00:41:11,490 On the y-axis, we have sales. 871 00:41:11,490 --> 00:41:14,160 And we might just want to try and draw a line that 872 00:41:14,160 --> 00:41:16,680 does a pretty good job of trying to estimate 873 00:41:16,680 --> 00:41:19,980 this relationship between advertising and sales. 874 00:41:19,980 --> 00:41:21,840 And in this case, unlike before, we're not 875 00:41:21,840 --> 00:41:24,848 trying to separate the data points into discrete categories. 876 00:41:24,848 --> 00:41:26,640 But instead in this case, we're just trying 877 00:41:26,640 --> 00:41:31,410 to find a line that approximates this relationship between advertising 878 00:41:31,410 --> 00:41:34,860 and sales so that if we want to figure out what the estimated sales are 879 00:41:34,860 --> 00:41:38,520 for a particular advertising budget, you just look it up in this line, 880 00:41:38,520 --> 00:41:40,560 figure out for this amount of advertising, we 881 00:41:40,560 --> 00:41:42,540 would have this amount of sales, and just 882 00:41:42,540 --> 00:41:44,367 try and make the estimate that way. 883 00:41:44,367 --> 00:41:46,200 And so you can try and come up with a line-- 884 00:41:46,200 --> 00:41:48,720 again, figuring out how to modify the weights using 885 00:41:48,720 --> 00:41:51,990 various different techniques to try and make it so that this line fits 886 00:41:51,990 --> 00:41:54,860 as well as possible. 887 00:41:54,860 --> 00:41:57,880 So with all of these approaches to trying to solve machine 888 00:41:57,880 --> 00:41:59,270 learning style problems. 889 00:41:59,270 --> 00:42:02,000 The question becomes, how do we evaluate these approaches? 890 00:42:02,000 --> 00:42:06,340 How do we evaluate the various different hypotheses that we could come up with? 891 00:42:06,340 --> 00:42:09,880 Because each of these algorithms will give us some sort of hypothesis-- 892 00:42:09,880 --> 00:42:12,710 some function that maps inputs to outputs. 893 00:42:12,710 --> 00:42:16,720 And we want to know how well does that function work. 894 00:42:16,720 --> 00:42:19,000 And you can think of evaluating these hypotheses 895 00:42:19,000 --> 00:42:23,500 and trying to get a better hypothesis as kind of like an optimization problem. 896 00:42:23,500 --> 00:42:26,580 In an optimization problem, as you recall from before, 897 00:42:26,580 --> 00:42:30,100 you are either trying to maximize some objective function 898 00:42:30,100 --> 00:42:33,160 by trying to find a global maximum. 899 00:42:33,160 --> 00:42:36,160 Or we were trying to minimize some cost function 900 00:42:36,160 --> 00:42:38,210 by trying to find some global minimum. 901 00:42:38,210 --> 00:42:41,890 And in the case of evaluating these hypotheses, one thing we might say 902 00:42:41,890 --> 00:42:45,280 is that this cost function, the thing we're trying to minimize, 903 00:42:45,280 --> 00:42:49,180 we might be trying to minimize what we would call a loss function. 904 00:42:49,180 --> 00:42:50,510 And what a loss function is-- 905 00:42:50,510 --> 00:42:53,980 it is a function that is going to estimate for us how 906 00:42:53,980 --> 00:42:56,200 poorly our function performs. 907 00:42:56,200 --> 00:42:58,240 More formally, it's like a loss of utility, 908 00:42:58,240 --> 00:43:02,740 by whenever we predict something that is wrong, that is a loss of utility. 909 00:43:02,740 --> 00:43:06,450 That's going to add to the output of our loss function. 910 00:43:06,450 --> 00:43:08,200 And you can come up with any loss function 911 00:43:08,200 --> 00:43:11,320 that you want-- just some mathematical way of estimating given 912 00:43:11,320 --> 00:43:14,020 each of these data points, given what the actual output is, 913 00:43:14,020 --> 00:43:17,110 and given what our projected output is, our estimate, 914 00:43:17,110 --> 00:43:19,970 you could calculate some sort of numerical loss for it. 915 00:43:19,970 --> 00:43:23,140 But there are a couple of popular loss functions that are worth discussing-- 916 00:43:23,140 --> 00:43:25,210 just that you've seen them before-- 917 00:43:25,210 --> 00:43:27,040 when it comes to discrete categories. 918 00:43:27,040 --> 00:43:30,570 Things like rain or not rain, counterfeit or not counterfeit. 919 00:43:30,570 --> 00:43:33,463 One approach is the 0-1 loss function. 920 00:43:33,463 --> 00:43:35,380 And the way that works is for each of the data 921 00:43:35,380 --> 00:43:40,180 points our loss function takes as input, what the actual output is, whether it 922 00:43:40,180 --> 00:43:44,650 was actually raining we're not rainy, and takes our prediction into account-- 923 00:43:44,650 --> 00:43:49,030 did we predict given this data point that it was raining or not raining. 924 00:43:49,030 --> 00:43:51,730 And if the actual value equals the prediction, 925 00:43:51,730 --> 00:43:54,580 well, then the 0-1 loss function will just say the loss of 0. 926 00:43:54,580 --> 00:43:58,900 There was no loss of utility because we were able to predict correctly. 927 00:43:58,900 --> 00:44:01,300 And otherwise, if the actual value was not 928 00:44:01,300 --> 00:44:05,260 the same thing as what we predicted, well, then in that case, our loss is 1. 929 00:44:05,260 --> 00:44:08,260 We lost something, lost some utility, because what 930 00:44:08,260 --> 00:44:12,750 we predicted was the output of the function was not what it actually was. 931 00:44:12,750 --> 00:44:14,500 And the goal then in a situation like this 932 00:44:14,500 --> 00:44:17,080 would be to come up with some hypothesis that 933 00:44:17,080 --> 00:44:21,040 minimizes the total empirical loss, the total amount that we've 934 00:44:21,040 --> 00:44:25,090 lost if you add up for all these data points what the actual output is 935 00:44:25,090 --> 00:44:28,250 and what your hypothesis would have predicted. 936 00:44:28,250 --> 00:44:30,100 So in this case, for example, if we go back 937 00:44:30,100 --> 00:44:32,530 to classifying days as raining or not raining, 938 00:44:32,530 --> 00:44:34,720 and we came up with this decision boundary, 939 00:44:34,720 --> 00:44:36,680 how would we evaluate this decision boundary-- 940 00:44:36,680 --> 00:44:40,510 how much better is it than drawing the line here or drawing the line there. 941 00:44:40,510 --> 00:44:42,790 Well, we could take each of the input data points 942 00:44:42,790 --> 00:44:45,790 and each input data point has a label-- whether it was raining 943 00:44:45,790 --> 00:44:47,230 or whether it was not raining. 944 00:44:47,230 --> 00:44:49,090 And we could compare it to the prediction-- 945 00:44:49,090 --> 00:44:51,550 whether we predicted it would be raining or not raining-- 946 00:44:51,550 --> 00:44:55,040 and assign it a numerical value as a result. 947 00:44:55,040 --> 00:44:58,323 So for example, these points over here they were all rainy days. 948 00:44:58,323 --> 00:45:00,490 And we predicted they would be raining, because they 949 00:45:00,490 --> 00:45:02,300 fall in the bottom side of the line. 950 00:45:02,300 --> 00:45:03,520 So they had a loss of 0-- 951 00:45:03,520 --> 00:45:05,500 nothing lost from those situations. 952 00:45:05,500 --> 00:45:08,140 And likewise, same is true for some of these points over here, 953 00:45:08,140 --> 00:45:10,420 where it was not raining and we predicted 954 00:45:10,420 --> 00:45:12,220 it would not be raining, either. 955 00:45:12,220 --> 00:45:16,810 Where we do have loss are points like this point here and that point there, 956 00:45:16,810 --> 00:45:20,980 where we predicted that it would not be raining, but in actuality, 957 00:45:20,980 --> 00:45:21,730 it's a blue point. 958 00:45:21,730 --> 00:45:22,810 It was raining. 959 00:45:22,810 --> 00:45:25,870 Or likewise here, we predicted that it would be raining, 960 00:45:25,870 --> 00:45:27,790 but in actuality, it's a red point-- 961 00:45:27,790 --> 00:45:29,020 it was not raining. 962 00:45:29,020 --> 00:45:31,900 And so as a result, we miscategorized these data 963 00:45:31,900 --> 00:45:34,150 points that we were trying to train on. 964 00:45:34,150 --> 00:45:36,310 And as a result, there is some loss here. 965 00:45:36,310 --> 00:45:40,930 One loss here, there, here and there, for a total loss of four, for example, 966 00:45:40,930 --> 00:45:41,893 in this case. 967 00:45:41,893 --> 00:45:43,810 And that might be how we would estimate or how 968 00:45:43,810 --> 00:45:47,860 we would say that this line is better than a line that 969 00:45:47,860 --> 00:45:51,370 goes somewhere else or a line that's further down, because this line might 970 00:45:51,370 --> 00:45:52,670 minimize the loss. 971 00:45:52,670 --> 00:45:57,100 So there is no way to do better than just these four points of loss 972 00:45:57,100 --> 00:46:01,080 if you're just drawing a straight line through our space. 973 00:46:01,080 --> 00:46:05,020 So the 0-1 loss function checks did we get it right, did we get it wrong. 974 00:46:05,020 --> 00:46:07,630 If we got it right, the loss is 0-- nothing lost. 975 00:46:07,630 --> 00:46:11,470 If we got it wrong, then our loss function for that data point says 1, 976 00:46:11,470 --> 00:46:14,740 and we add up all of those losses across all of our data points 977 00:46:14,740 --> 00:46:17,290 to get some sort of empirical loss-- how much we 978 00:46:17,290 --> 00:46:20,420 have lost across all of these original data points 979 00:46:20,420 --> 00:46:23,500 that our algorithm had access to. 980 00:46:23,500 --> 00:46:26,550 There are other forms of loss as well that work especially well when 981 00:46:26,550 --> 00:46:28,980 we deal with more real value cases-- cases 982 00:46:28,980 --> 00:46:32,760 like the mapping between advertising budget and amount that we do in sales, 983 00:46:32,760 --> 00:46:33,720 for example. 984 00:46:33,720 --> 00:46:37,770 Because in that case, you care not just that you get the number exactly right, 985 00:46:37,770 --> 00:46:40,650 but you care how close you were to the actual value. 986 00:46:40,650 --> 00:46:44,700 If the actual value is you did $2,800 in sales 987 00:46:44,700 --> 00:46:47,910 and you predicted that you would do $2,900 in sales, 988 00:46:47,910 --> 00:46:49,330 maybe that's pretty good. 989 00:46:49,330 --> 00:46:52,380 That's much better than if you had predicted you do $1,000 in sales, 990 00:46:52,380 --> 00:46:53,620 for example. 991 00:46:53,620 --> 00:46:55,830 And so we would like our loss function to be 992 00:46:55,830 --> 00:46:58,530 able to take that into account as well. 993 00:46:58,530 --> 00:47:02,700 Take into account not just whether the actual value in the expected value 994 00:47:02,700 --> 00:47:08,490 are exactly the same, but also, take into account how far apart they were. 995 00:47:08,490 --> 00:47:12,450 And so for that one approach is what we call L1 loss. 996 00:47:12,450 --> 00:47:15,120 L1 Loss doesn't just look at whether actual and predicted 997 00:47:15,120 --> 00:47:20,400 are equal to each other, but we take the absolute value of the actual value 998 00:47:20,400 --> 00:47:22,180 minus the predicted value. 999 00:47:22,180 --> 00:47:25,860 In other words, we just ask, how far apart were the actual 1000 00:47:25,860 --> 00:47:27,120 and predicted values? 1001 00:47:27,120 --> 00:47:29,280 And we sum that up across all of the data 1002 00:47:29,280 --> 00:47:33,870 points to be able to get what our answer ultimately is. 1003 00:47:33,870 --> 00:47:36,690 So what might this actually look like for our data set? 1004 00:47:36,690 --> 00:47:38,610 Well, if we go back to this representation 1005 00:47:38,610 --> 00:47:42,720 where we had advertising along the x-axis, sales along the y-axis, 1006 00:47:42,720 --> 00:47:45,210 our line was our prediction, our estimate 1007 00:47:45,210 --> 00:47:47,520 for any given amount of advertising-- what 1008 00:47:47,520 --> 00:47:50,080 we predicted sales was going to be. 1009 00:47:50,080 --> 00:47:54,420 And our L1 loss is just how far apart vertically along the sales 1010 00:47:54,420 --> 00:47:58,050 axis our prediction was from each of the data points. 1011 00:47:58,050 --> 00:48:00,900 So we could figure out exactly how far apart our prediction 1012 00:48:00,900 --> 00:48:02,910 was from each of the data points and figure out 1013 00:48:02,910 --> 00:48:07,920 as a result of that what our loss is overall for this particular hypothesis 1014 00:48:07,920 --> 00:48:12,090 just by adding up all of these various different individual losses for each 1015 00:48:12,090 --> 00:48:13,110 of these data points. 1016 00:48:13,110 --> 00:48:15,790 And our goal then is to try and minimize that loss-- 1017 00:48:15,790 --> 00:48:20,550 to try and come up with some line that minimizes what the utility loss is 1018 00:48:20,550 --> 00:48:23,280 by judging how far away our estimate amount of sales 1019 00:48:23,280 --> 00:48:25,800 is from the actual amount of sales. 1020 00:48:25,800 --> 00:48:28,050 And turns out there are other loss functions, as well. 1021 00:48:28,050 --> 00:48:30,780 One that's quite popular is the L2 loss. 1022 00:48:30,780 --> 00:48:33,840 The L2 loss, instead of just using the absolute value, 1023 00:48:33,840 --> 00:48:37,350 like how far away the actual value is from the predicted value, 1024 00:48:37,350 --> 00:48:40,350 it uses the square of actual minus predicted. 1025 00:48:40,350 --> 00:48:43,470 So how far apart are the actual and predicted value, and it 1026 00:48:43,470 --> 00:48:48,150 squares that value, effectively penalizing much more harshly 1027 00:48:48,150 --> 00:48:50,290 anything that is a worse prediction. 1028 00:48:50,290 --> 00:48:52,620 So you imagine if you have two data points 1029 00:48:52,620 --> 00:48:57,210 that you predict as being one value away from their actual value 1030 00:48:57,210 --> 00:48:59,490 as opposed to one data point that you predict 1031 00:48:59,490 --> 00:49:04,290 as being two away from its actual value, the L2 loss function will more 1032 00:49:04,290 --> 00:49:07,560 harshly penalize that one that is two away because it's 1033 00:49:07,560 --> 00:49:11,430 going to square however much the differences between the actual value 1034 00:49:11,430 --> 00:49:12,520 and the predicted value. 1035 00:49:12,520 --> 00:49:14,310 And depending on the situation, you might 1036 00:49:14,310 --> 00:49:17,520 want to choose a loss function depending on what you care about minimizing. 1037 00:49:17,520 --> 00:49:20,883 If you really care about minimizing the error on more outlier cases, 1038 00:49:20,883 --> 00:49:23,050 then you might want to consider something like this. 1039 00:49:23,050 --> 00:49:25,592 But if you've got a lot of outliers and you don't necessarily 1040 00:49:25,592 --> 00:49:28,632 care about modeling them, then maybe an L1 loss function is preferable, 1041 00:49:28,632 --> 00:49:30,840 but there are trade-offs here that you need to decide 1042 00:49:30,840 --> 00:49:33,630 based on a particular set of data. 1043 00:49:33,630 --> 00:49:37,020 But what you do run the risk of with any of these lost functions with anything 1044 00:49:37,020 --> 00:49:40,232 that we're trying to do is a problem known as overfitting. 1045 00:49:40,232 --> 00:49:41,940 And overfitting is a big problem that you 1046 00:49:41,940 --> 00:49:44,910 can encounter in machine learning, which happens anytime 1047 00:49:44,910 --> 00:49:49,380 a model fits too closely with a data set, and as a result, 1048 00:49:49,380 --> 00:49:51,510 fails to generalize. 1049 00:49:51,510 --> 00:49:55,200 We would like our model to be able to accurately predict 1050 00:49:55,200 --> 00:49:59,430 data and inputs and output pairs for the data that we have access to. 1051 00:49:59,430 --> 00:50:02,220 But the reason we wanted to do so is because we 1052 00:50:02,220 --> 00:50:06,690 want our model to generalize well to data that we haven't seen before. 1053 00:50:06,690 --> 00:50:10,350 I would like to take data from the past year of whether it was raining and not 1054 00:50:10,350 --> 00:50:12,707 raining and use that data to generalize it 1055 00:50:12,707 --> 00:50:15,540 towards the future-- to say in the future, is it going to be raining 1056 00:50:15,540 --> 00:50:16,170 or not raining. 1057 00:50:16,170 --> 00:50:19,890 Or if I have a whole bunch of data on what counterfeit and not counterfeit US 1058 00:50:19,890 --> 00:50:23,640 dollar bills looked liked in the past when people have encountered them, 1059 00:50:23,640 --> 00:50:26,550 I'd like to train a computer to be able to in the future, 1060 00:50:26,550 --> 00:50:32,020 generalize to other dollar bills that I might see as well. 1061 00:50:32,020 --> 00:50:36,540 And the problem with overfitting is that if you try and tie yourself too closely 1062 00:50:36,540 --> 00:50:39,450 to the data set that you're training your model on you 1063 00:50:39,450 --> 00:50:42,043 can end up not generalizing very well. 1064 00:50:42,043 --> 00:50:43,210 So what does this look like. 1065 00:50:43,210 --> 00:50:46,750 Well, we might imagine the rainy day and not rainy day example again from here, 1066 00:50:46,750 --> 00:50:49,440 where the blue points indicate rainy days and the red points 1067 00:50:49,440 --> 00:50:50,970 indicate not rainy days. 1068 00:50:50,970 --> 00:50:53,400 And we decided that we felt pretty comfortable 1069 00:50:53,400 --> 00:50:58,230 with drawing a line like this as the decision boundary between rainy days 1070 00:50:58,230 --> 00:50:59,090 and not rainy days. 1071 00:50:59,090 --> 00:51:01,860 That we can pretty comfortably say that points on this side, 1072 00:51:01,860 --> 00:51:04,860 are more likely to be rainy days, points on that side, 1073 00:51:04,860 --> 00:51:06,870 more likely to be not rainy days. 1074 00:51:06,870 --> 00:51:11,430 But the empirical loss isn't 0 in this particular case, 1075 00:51:11,430 --> 00:51:14,200 because we didn't categorize everything perfectly. 1076 00:51:14,200 --> 00:51:17,640 There was this one outlier this one day that it wasn't raining, 1077 00:51:17,640 --> 00:51:20,580 but yet our model steel still predicts that it is raining. 1078 00:51:20,580 --> 00:51:22,710 But that doesn't necessarily mean our model is bad. 1079 00:51:22,710 --> 00:51:25,820 It just means the model isn't 100% accurate. 1080 00:51:25,820 --> 00:51:28,700 If you really wanted to try and find a hypothesis that 1081 00:51:28,700 --> 00:51:32,142 resulted in minimizing the loss, you could come up 1082 00:51:32,142 --> 00:51:33,600 with a different decision boundary. 1083 00:51:33,600 --> 00:51:37,100 It wouldn't be a line, but it would look something like this. 1084 00:51:37,100 --> 00:51:41,090 This decision boundary does separate all of the red points 1085 00:51:41,090 --> 00:51:44,780 from all of the blue points because the red points fall 1086 00:51:44,780 --> 00:51:47,390 on this side of this decision boundary, the blue points 1087 00:51:47,390 --> 00:51:49,700 fall on the other side of the decision boundary. 1088 00:51:49,700 --> 00:51:54,530 But this, we would probably argue, is not as good of a prediction. 1089 00:51:54,530 --> 00:51:59,600 Even though it seems to be more accurate based on all of the available training 1090 00:51:59,600 --> 00:52:02,700 data that we have for training this machine learning model, 1091 00:52:02,700 --> 00:52:05,330 we might say that it's probably not going to generalize well. 1092 00:52:05,330 --> 00:52:07,730 That if there were other data points like here and there, 1093 00:52:07,730 --> 00:52:11,270 we might still want to consider those to be rainy days, because we think 1094 00:52:11,270 --> 00:52:13,770 this was probably just an outlier. 1095 00:52:13,770 --> 00:52:17,480 So if the only thing you care about is minimizing the loss on the data 1096 00:52:17,480 --> 00:52:20,340 you have available to you, you run the risk of overfitting. 1097 00:52:20,340 --> 00:52:22,520 And this can happen in the misclassification case. 1098 00:52:22,520 --> 00:52:25,460 It can also happen in the regression case, 1099 00:52:25,460 --> 00:52:28,790 that here, we predicted what we thought was a pretty good line relating 1100 00:52:28,790 --> 00:52:32,720 advertising to sales, trying to predict what sales were going to be for a given 1101 00:52:32,720 --> 00:52:33,757 amount of advertising. 1102 00:52:33,757 --> 00:52:36,590 But I could come up with a line that does a better job of predicting 1103 00:52:36,590 --> 00:52:39,680 the training data, and it would be something that looks like this, 1104 00:52:39,680 --> 00:52:42,620 just connecting all of the various different data points. 1105 00:52:42,620 --> 00:52:44,750 And now, there is no loss at all. 1106 00:52:44,750 --> 00:52:48,620 Now I've perfectly predicted given any advertising what sales are, 1107 00:52:48,620 --> 00:52:52,370 and for all the data available to me, it's going to be accurate. 1108 00:52:52,370 --> 00:52:55,010 But it's probably not going to generalize very well. 1109 00:52:55,010 --> 00:53:00,060 I have overfit my model on the training data that is available to me. 1110 00:53:00,060 --> 00:53:02,060 And so in general, we want to avoid overfitting. 1111 00:53:02,060 --> 00:53:04,880 We'd like strategies to make sure that we have over 1112 00:53:04,880 --> 00:53:07,190 fit our model to a particular data set. 1113 00:53:07,190 --> 00:53:09,810 And there are a number of ways that you could try to do this. 1114 00:53:09,810 --> 00:53:12,830 One way is by examining what it is that we're optimizing for. 1115 00:53:12,830 --> 00:53:17,060 In an optimization problem, all we do is we say there is some cost, 1116 00:53:17,060 --> 00:53:19,610 and I want to minimize that cost. 1117 00:53:19,610 --> 00:53:24,410 And so far, we've defined that cost function-- the cost of a hypothesis 1118 00:53:24,410 --> 00:53:28,400 just as being equal to the empirical loss of that hypothesis. 1119 00:53:28,400 --> 00:53:32,750 How far away are the actual data points the outputs away from 1120 00:53:32,750 --> 00:53:36,400 what I predicted them to be based on that particular hypothesis. 1121 00:53:36,400 --> 00:53:38,150 And if all you're trying to do is minimize 1122 00:53:38,150 --> 00:53:40,910 cost, meaning minimizing the loss in this case, 1123 00:53:40,910 --> 00:53:44,030 then the result is going to be that you might overfit. 1124 00:53:44,030 --> 00:53:48,440 That to minimize cost, you're going to try and find a way to perfectly match 1125 00:53:48,440 --> 00:53:49,820 all of the input data. 1126 00:53:49,820 --> 00:53:54,020 And that might happen as a result of overfitting on that particular input 1127 00:53:54,020 --> 00:53:55,650 data. 1128 00:53:55,650 --> 00:53:59,660 So in order to address this, you could add something to the cost function. 1129 00:53:59,660 --> 00:54:00,880 What counts as cost? 1130 00:54:00,880 --> 00:54:04,190 Well, not just loss, but also, some measure 1131 00:54:04,190 --> 00:54:08,680 of the complexity of the hypothesis, where the complexity of the hypothesis 1132 00:54:08,680 --> 00:54:12,440 is something that you would need to define for how complicated 1133 00:54:12,440 --> 00:54:13,250 does our line look. 1134 00:54:13,250 --> 00:54:15,830 This is sort of an Occam's razor style approach, where 1135 00:54:15,830 --> 00:54:19,160 we want to give preference to a simpler decision boundary-- 1136 00:54:19,160 --> 00:54:20,990 like a straight line for example. 1137 00:54:20,990 --> 00:54:24,710 Some simpler curve as opposed to something far more complex 1138 00:54:24,710 --> 00:54:26,840 that might represent the training data better, 1139 00:54:26,840 --> 00:54:29,150 but might not generalize as well-- will generally 1140 00:54:29,150 --> 00:54:33,860 say that a simpler solution is probably the better solution and probably 1141 00:54:33,860 --> 00:54:38,460 the one that is more likely to generalize well to other inputs. 1142 00:54:38,460 --> 00:54:42,020 So we measure what the losses but we also measure the complexity. 1143 00:54:42,020 --> 00:54:45,770 And now that all gets taken into account when we consider the overall cost. 1144 00:54:45,770 --> 00:54:50,040 That yes, something might have less loss if a better predicts the training data, 1145 00:54:50,040 --> 00:54:52,940 but if it's much more complex, it still might not 1146 00:54:52,940 --> 00:54:55,230 be the best option that we have. 1147 00:54:55,230 --> 00:54:58,930 And we need to come up with some balance between loss and complexity. 1148 00:54:58,930 --> 00:55:01,180 And for that reason, you'll often see this represented 1149 00:55:01,180 --> 00:55:04,730 as multiplying the complexity by some parameter 1150 00:55:04,730 --> 00:55:08,480 that we have to choose-- parameter lambda in this case, where we're saying 1151 00:55:08,480 --> 00:55:11,300 if lambda has a greater value, then we really want 1152 00:55:11,300 --> 00:55:13,852 to penalize more complex hypotheses. 1153 00:55:13,852 --> 00:55:15,560 Whereas if lambda is smaller, we're going 1154 00:55:15,560 --> 00:55:18,470 to penalize more complex hypotheses a little bit. 1155 00:55:18,470 --> 00:55:21,620 And it's up to the machine learning programmer 1156 00:55:21,620 --> 00:55:24,470 to decide where they want to set that value of lambda 1157 00:55:24,470 --> 00:55:28,430 for how much do I want to penalize a more complex hypothesis that 1158 00:55:28,430 --> 00:55:30,293 might fit the data little better. 1159 00:55:30,293 --> 00:55:32,960 And again, there is no one right answer to a lot of these things 1160 00:55:32,960 --> 00:55:36,362 depending on the data set, depending on the data you have available to you, 1161 00:55:36,362 --> 00:55:39,320 and the problem you're trying to solve, your choice of these parameters 1162 00:55:39,320 --> 00:55:39,868 may vary. 1163 00:55:39,868 --> 00:55:41,660 And you may need to experiment a little bit 1164 00:55:41,660 --> 00:55:45,800 to figure out what the right choice of that is ultimately going to be. 1165 00:55:45,800 --> 00:55:49,100 This process then of considering not only a loss, but also 1166 00:55:49,100 --> 00:55:53,000 some measure of the complexity is known as regularization. 1167 00:55:53,000 --> 00:55:55,610 Regularization is the process of penalizing 1168 00:55:55,610 --> 00:55:58,260 a hypothesis that is more complex. 1169 00:55:58,260 --> 00:56:00,950 In order to favor a simple or hypothesis that 1170 00:56:00,950 --> 00:56:03,680 is more likely to generalize well-- more likely to be 1171 00:56:03,680 --> 00:56:08,210 able to apply to other situations that are dealing with other input points 1172 00:56:08,210 --> 00:56:11,540 unlike the ones that we've necessarily seen before. 1173 00:56:11,540 --> 00:56:15,500 So oftentimes, you'll see us add some regularizing term 1174 00:56:15,500 --> 00:56:17,570 to what we're trying to minimize it in order 1175 00:56:17,570 --> 00:56:21,220 to avoid this problem of overfitting. 1176 00:56:21,220 --> 00:56:25,780 Now another way of making sure we don't overfit is to run some experiments 1177 00:56:25,780 --> 00:56:29,950 and to see whether or not we are able to generalize our model that we've 1178 00:56:29,950 --> 00:56:33,250 created to other data sets as well. 1179 00:56:33,250 --> 00:56:35,320 And it's for that reason that oftentimes, 1180 00:56:35,320 --> 00:56:37,800 when you're doing a machine learning experiment, when you've got some data 1181 00:56:37,800 --> 00:56:40,717 and you want to try and come up with some function that predicts given 1182 00:56:40,717 --> 00:56:44,230 some input, what the output is going to be, you don't necessarily 1183 00:56:44,230 --> 00:56:48,340 want to do your training on all of the data you have available to you. 1184 00:56:48,340 --> 00:56:52,410 That you could employ a method known as holdout cross-validation. 1185 00:56:52,410 --> 00:56:55,480 Where in holdout cross-validation, we split up our data. 1186 00:56:55,480 --> 00:57:00,373 We split up our data into a training set and a testing set. 1187 00:57:00,373 --> 00:57:02,290 The training set is the set of data that we're 1188 00:57:02,290 --> 00:57:04,870 going to use to train our machine learning model. 1189 00:57:04,870 --> 00:57:07,440 And the testing set is the set of data that we 1190 00:57:07,440 --> 00:57:11,260 are going to use in order to test to see how well our machine learning 1191 00:57:11,260 --> 00:57:13,660 model actually performed. 1192 00:57:13,660 --> 00:57:15,757 So the learning happens on the training set. 1193 00:57:15,757 --> 00:57:17,590 We figure out what the parameters should be, 1194 00:57:17,590 --> 00:57:19,660 we figure out what the right model is. 1195 00:57:19,660 --> 00:57:22,360 And that we see, all right, now that we've trained the model, 1196 00:57:22,360 --> 00:57:25,960 see how well it does at predicting things and inside of the testing 1197 00:57:25,960 --> 00:57:29,342 set, some set of data that we haven't seen before. 1198 00:57:29,342 --> 00:57:32,300 And the hope then is that we're going to be able to predict the testing 1199 00:57:32,300 --> 00:57:36,460 set pretty well if we're able to generalize based on the training 1200 00:57:36,460 --> 00:57:38,028 data that's available to us. 1201 00:57:38,028 --> 00:57:39,820 If we've overfit the training data, though, 1202 00:57:39,820 --> 00:57:43,112 and we're not able to generalize, then when we look at the testing set, 1203 00:57:43,112 --> 00:57:45,070 it's likely going to be the case that we're not 1204 00:57:45,070 --> 00:57:49,100 going to predict things from the testing set nearly as effectively. 1205 00:57:49,100 --> 00:57:51,850 So this is one method of cross-validation-- validating 1206 00:57:51,850 --> 00:57:55,270 to make sure that the work we have done is actually going to generalize 1207 00:57:55,270 --> 00:57:56,770 to other data sets as well. 1208 00:57:56,770 --> 00:57:59,620 And there are other statistical techniques we can use, as well. 1209 00:57:59,620 --> 00:58:03,520 One of the downsides of this just holdout cross-validation is if you say, 1210 00:58:03,520 --> 00:58:07,960 I just split it 50/50, I train using 50% of the data and test using 1211 00:58:07,960 --> 00:58:11,110 the other 50%, or you could choose other percentages as well, 1212 00:58:11,110 --> 00:58:15,190 is that there is a fair amount of data that I am now not using 1213 00:58:15,190 --> 00:58:19,670 to train that I might be able to get a better model as a result, for example. 1214 00:58:19,670 --> 00:58:23,500 So one approach is known as k-fold cross-validation. 1215 00:58:23,500 --> 00:58:26,830 In k-fold cross-validation, rather than just divide things 1216 00:58:26,830 --> 00:58:31,980 into two sets and run one experiment, we divide things into k different sets 1217 00:58:31,980 --> 00:58:34,810 and maybe I divide things up into 10 different sets, 1218 00:58:34,810 --> 00:58:37,270 and then run 10 different experiments. 1219 00:58:37,270 --> 00:58:40,750 So if I split up my data into 10 different sets of data, 1220 00:58:40,750 --> 00:58:44,410 then what I'll do is each time for each of my 10 experiments, 1221 00:58:44,410 --> 00:58:47,020 I will hold out one of those sets of data, 1222 00:58:47,020 --> 00:58:50,320 where I'll say, let me train my model on these nine sets, 1223 00:58:50,320 --> 00:58:54,070 and then test to see how well it predicts on set number 10. 1224 00:58:54,070 --> 00:58:57,490 And then pick another set of nine sets to train on, and then 1225 00:58:57,490 --> 00:58:59,320 test it on the other one that I held out, 1226 00:58:59,320 --> 00:59:03,550 where each time, I train the model on everything minus the one set 1227 00:59:03,550 --> 00:59:07,300 that I'm holding out, and then test to see how well our model performs 1228 00:59:07,300 --> 00:59:09,100 on the test that I did hold out. 1229 00:59:09,100 --> 00:59:12,250 And what you end up getting is 10 different results, 10 different answers 1230 00:59:12,250 --> 00:59:14,470 for how accurately our model worked. 1231 00:59:14,470 --> 00:59:16,870 And oftentimes, you can just take the average of those 10 1232 00:59:16,870 --> 00:59:21,160 to get an approximation for how well we think our model performs overall. 1233 00:59:21,160 --> 00:59:25,270 But the key idea is separating the training data from the testing data, 1234 00:59:25,270 --> 00:59:29,170 because you want to test your model on data that is different from what 1235 00:59:29,170 --> 00:59:30,273 you trained the model on. 1236 00:59:30,273 --> 00:59:32,440 Because the training, you want to avoid overfitting, 1237 00:59:32,440 --> 00:59:33,940 you want to be able to generalize. 1238 00:59:33,940 --> 00:59:36,190 And the way you test whether you're able to generalize 1239 00:59:36,190 --> 00:59:39,580 and is by looking at some data that you haven't seen before 1240 00:59:39,580 --> 00:59:43,380 and seeing how well we're actually able to perform. 1241 00:59:43,380 --> 00:59:46,010 And so if we want to actually implement any of these techniques 1242 00:59:46,010 --> 00:59:48,980 inside of a programming language like Python, a number of ways 1243 00:59:48,980 --> 00:59:49,760 we could do that. 1244 00:59:49,760 --> 00:59:52,007 We could write this from scratch on our own, 1245 00:59:52,007 --> 00:59:53,840 but there are libraries out there that allow 1246 00:59:53,840 --> 00:59:57,290 us to take advantage of existing implementations of these algorithms-- 1247 00:59:57,290 --> 01:00:00,080 that we can use the same types of algorithms 1248 01:00:00,080 --> 01:00:01,950 in a lot of different situations. 1249 01:00:01,950 --> 01:00:03,770 And so there is a library, very popular one 1250 01:00:03,770 --> 01:00:07,240 known as scikit-learn, which allows us in Python to be 1251 01:00:07,240 --> 01:00:10,490 able to very quickly get set up with a lot of these different machine learning 1252 01:00:10,490 --> 01:00:10,990 models. 1253 01:00:10,990 --> 01:00:14,210 So this library has already written an algorithm for nearest neighbor 1254 01:00:14,210 --> 01:00:16,430 classification, for doing perceptron learning, 1255 01:00:16,430 --> 01:00:19,035 for doing a bunch of other types of inference 1256 01:00:19,035 --> 01:00:21,410 and supervised learning that we haven't yet talked about. 1257 01:00:21,410 --> 01:00:26,840 But using it, we can begin to try actually testing how these methods work 1258 01:00:26,840 --> 01:00:29,330 and how accurately they perform. 1259 01:00:29,330 --> 01:00:31,900 So let's go ahead and take a look at one approach to trying 1260 01:00:31,900 --> 01:00:33,920 to solve this type of problem. 1261 01:00:33,920 --> 01:00:37,280 All right, so I'm first going to pull up banknotes.csv, 1262 01:00:37,280 --> 01:00:40,970 which is a whole bunch of data provided by UC Irvine, which has information 1263 01:00:40,970 --> 01:00:43,160 about various different banknotes. 1264 01:00:43,160 --> 01:00:45,440 So people took pictures of various different banknotes 1265 01:00:45,440 --> 01:00:48,530 and measured various different properties of those banknotes. 1266 01:00:48,530 --> 01:00:51,350 And in particular, some human categorized each of those 1267 01:00:51,350 --> 01:00:55,810 banknotes as either a counterfeit bank note or as not counterfeit. 1268 01:00:55,810 --> 01:00:59,540 And so what you're looking at here is each row represents one banknote. 1269 01:00:59,540 --> 01:01:01,910 This is formatted as a CSV spreadsheet, where just 1270 01:01:01,910 --> 01:01:05,780 comma-separated value separating each of these various different fields. 1271 01:01:05,780 --> 01:01:08,990 We have four different input values for each 1272 01:01:08,990 --> 01:01:12,410 of these data points, just information, some measurement that 1273 01:01:12,410 --> 01:01:13,460 was made on the banknote. 1274 01:01:13,460 --> 01:01:16,340 And what those measurements exactly aren't as important as the fact 1275 01:01:16,340 --> 01:01:18,320 that we do have access to this data. 1276 01:01:18,320 --> 01:01:21,950 But more importantly, we have access for each of these data points 1277 01:01:21,950 --> 01:01:26,210 to a label, where 0 indicates something like this was not a counterfeit bill, 1278 01:01:26,210 --> 01:01:27,920 meaning it was an authentic bill. 1279 01:01:27,920 --> 01:01:32,840 And a data point labeled 1 means that it is a counterfeit bill at least, 1280 01:01:32,840 --> 01:01:36,240 according to the human researcher who labeled this particular data. 1281 01:01:36,240 --> 01:01:38,360 So we have a whole bunch of data representing 1282 01:01:38,360 --> 01:01:40,940 a whole bunch of different data points, each of which 1283 01:01:40,940 --> 01:01:44,360 has these various different measurements that were made on that particular bill. 1284 01:01:44,360 --> 01:01:47,300 And each of which has an output value 0 or 1-- 1285 01:01:47,300 --> 01:01:53,033 0 meaning it was a genuine bill, 1 meaning it was a counterfeit bill. 1286 01:01:53,033 --> 01:01:54,950 And what we would like to do is use supervised 1287 01:01:54,950 --> 01:01:58,550 learning to begin to predict or model some sort of function 1288 01:01:58,550 --> 01:02:02,570 that can take these four values as input and predict what the output would be. 1289 01:02:02,570 --> 01:02:05,777 We want our learning algorithm to find some sort of pattern that 1290 01:02:05,777 --> 01:02:08,110 is able to predict based on these measurements something 1291 01:02:08,110 --> 01:02:10,690 that you could measure just by taking a photo of a bill-- 1292 01:02:10,690 --> 01:02:16,090 predict whether that bill is authentic or whether that bill is counterfeit. 1293 01:02:16,090 --> 01:02:17,730 And so how can we do that? 1294 01:02:17,730 --> 01:02:21,020 Well, I'm first going to open up banknotes0.py and see 1295 01:02:21,020 --> 01:02:22,410 how it is that we do this. 1296 01:02:22,410 --> 01:02:26,000 I'm first importing a lot of things from scikit-learn, 1297 01:02:26,000 --> 01:02:29,753 but importantly, I'm going to set my model equal to the perceptron 1298 01:02:29,753 --> 01:02:32,420 model, which is one of those models that we talked about before. 1299 01:02:32,420 --> 01:02:35,120 We're just going to try and figure out some setting of weights 1300 01:02:35,120 --> 01:02:38,950 that is able to divide our data into two different groups. 1301 01:02:38,950 --> 01:02:43,270 Then I'm going to go ahead and read data in from my file from banknotes.csv. 1302 01:02:43,270 --> 01:02:46,660 And basically, for every row, I'm going to separate that row 1303 01:02:46,660 --> 01:02:51,460 into the first four values of that row, which is the evidence for that row. 1304 01:02:51,460 --> 01:02:56,920 And then the label where if the final column in that row is 0, 1305 01:02:56,920 --> 01:03:00,770 the label is authentic, and otherwise, it's going to be counterfeit. 1306 01:03:00,770 --> 01:03:03,880 So I'm effectively reading data in from the CSV file, 1307 01:03:03,880 --> 01:03:07,330 dividing it into a whole bunch of rows, where each row has some evidence-- 1308 01:03:07,330 --> 01:03:11,380 those four input values that are going to be inputs to my hypothesis function. 1309 01:03:11,380 --> 01:03:14,710 And then the label, the output, whether it is authentic or counterfeit. 1310 01:03:14,710 --> 01:03:17,270 That is the thing that I am then trying to predict. 1311 01:03:17,270 --> 01:03:20,650 So the next step is that I would like to split up my data set into a training 1312 01:03:20,650 --> 01:03:22,990 set and the testing set-- some set of data 1313 01:03:22,990 --> 01:03:26,260 that I would like to train my machine learning model on and some set of data 1314 01:03:26,260 --> 01:03:29,500 that I would like to use to test that model, see how well it performed. 1315 01:03:29,500 --> 01:03:32,480 So what I'll do is I'll go ahead and figure out length of the data, 1316 01:03:32,480 --> 01:03:34,150 how many data points do I have. 1317 01:03:34,150 --> 01:03:35,925 I'll go ahead and take half of them, save 1318 01:03:35,925 --> 01:03:37,550 that number is a number called holdout. 1319 01:03:37,550 --> 01:03:39,930 That is how many items I'm going to hold out for my data 1320 01:03:39,930 --> 01:03:42,400 set to save for the testing phase. 1321 01:03:42,400 --> 01:03:45,250 I'll randomly shuffle the data so it's in some random order. 1322 01:03:45,250 --> 01:03:50,440 And then I'll say my testing set will be all of the data up to the holdout. 1323 01:03:50,440 --> 01:03:54,790 So I'll hold up many data items, and that will be my testing that. 1324 01:03:54,790 --> 01:03:58,060 My training data will be everything else-- the information 1325 01:03:58,060 --> 01:04:00,850 that I'm going to train my model on. 1326 01:04:00,850 --> 01:04:04,870 And then I'll say, I need to divide up my training data 1327 01:04:04,870 --> 01:04:06,040 into two different sets. 1328 01:04:06,040 --> 01:04:10,750 I need to divide it into my x values, where x here represents the inputs. 1329 01:04:10,750 --> 01:04:14,260 So the x values then I'm going to train on our basically 1330 01:04:14,260 --> 01:04:16,330 for every row in my training set, I'm going 1331 01:04:16,330 --> 01:04:19,150 to get the evidence for that row, those four values, 1332 01:04:19,150 --> 01:04:21,940 where it's basically a vector of four numbers, where 1333 01:04:21,940 --> 01:04:24,010 that is going to be all of the input. 1334 01:04:24,010 --> 01:04:27,460 And then I need the y values-- what are the outputs that I want to learn from, 1335 01:04:27,460 --> 01:04:30,567 the labels that belong to each of these various different input points. 1336 01:04:30,567 --> 01:04:33,650 Well, that's going to be the same thing for each row in the training data. 1337 01:04:33,650 --> 01:04:36,430 But this time, I take that row and get what it's labeled as, 1338 01:04:36,430 --> 01:04:38,680 whether it is authentic or counterfeit. 1339 01:04:38,680 --> 01:04:43,270 So I end up with one list of all of these vectors of my input data 1340 01:04:43,270 --> 01:04:45,790 and one list which follows the same order, 1341 01:04:45,790 --> 01:04:49,690 but has all of the labels that correspond with each of those vectors. 1342 01:04:49,690 --> 01:04:51,610 And then to train my model, which in this case 1343 01:04:51,610 --> 01:04:54,190 is just this perceptron model, I just called 1344 01:04:54,190 --> 01:04:57,370 model.fit, pass in the training data, and what 1345 01:04:57,370 --> 01:04:59,710 the labels for those training data are. 1346 01:04:59,710 --> 01:05:02,120 And scikit-learn will take care of fitting the model-- 1347 01:05:02,120 --> 01:05:04,250 will do the entire algorithm for me. 1348 01:05:04,250 --> 01:05:08,320 And then when it's done, I can then test to see how well that model performed. 1349 01:05:08,320 --> 01:05:10,780 So I can say, let me get all of these input 1350 01:05:10,780 --> 01:05:13,040 vectors for what I want to test on. 1351 01:05:13,040 --> 01:05:16,870 So for each row in my testing data set, go ahead and get the evidence. 1352 01:05:16,870 --> 01:05:20,470 And the y values, those are what the actual values were-- 1353 01:05:20,470 --> 01:05:24,610 for each of the rows in the testing data set, what the actual label is. 1354 01:05:24,610 --> 01:05:26,860 But then I'm going to generate some predictions. 1355 01:05:26,860 --> 01:05:29,350 I'm going to use this model and try and predict-- 1356 01:05:29,350 --> 01:05:31,480 based on the testing vectors-- 1357 01:05:31,480 --> 01:05:33,910 I want to predict what the output is. 1358 01:05:33,910 --> 01:05:38,230 And my goal then is to now compare y testing with predictions. 1359 01:05:38,230 --> 01:05:41,440 I want to see how well my predictions based on the model 1360 01:05:41,440 --> 01:05:45,310 actually reflect what the y values were, what the output is 1361 01:05:45,310 --> 01:05:46,540 that were actually labeled. 1362 01:05:46,540 --> 01:05:48,580 Because I now have this label data, I can 1363 01:05:48,580 --> 01:05:51,540 assess how well the algorithm worked. 1364 01:05:51,540 --> 01:05:55,330 And so now I can just compute how well we did. 1365 01:05:55,330 --> 01:05:58,840 This zip function basically just lets me look through two different lists, one 1366 01:05:58,840 --> 01:06:00,500 by one at the same time. 1367 01:06:00,500 --> 01:06:04,113 So for each actual value and for each predicted value, 1368 01:06:04,113 --> 01:06:06,280 if the actual is the same thing as what I predicted, 1369 01:06:06,280 --> 01:06:08,570 I'll go ahead and increment the counter by 1. 1370 01:06:08,570 --> 01:06:11,727 Otherwise, I'll increment my incorrect counter by 1. 1371 01:06:11,727 --> 01:06:14,060 And so at the end, I can print out here are the results, 1372 01:06:14,060 --> 01:06:16,435 here's how many I got right, here's how many I got wrong. 1373 01:06:16,435 --> 01:06:19,660 And here was my overall accuracy, for example. 1374 01:06:19,660 --> 01:06:21,100 So I can go ahead and run this. 1375 01:06:21,100 --> 01:06:24,790 I can run Python banknotes0.py. 1376 01:06:24,790 --> 01:06:28,810 And it's going to train on half the data set and then test on half the data set. 1377 01:06:28,810 --> 01:06:31,090 And here the results from my perceptron model. 1378 01:06:31,090 --> 01:06:35,540 In this case, it correctly was able to classify 679 bills 1379 01:06:35,540 --> 01:06:38,170 as correctly either authentic or counterfeit, 1380 01:06:38,170 --> 01:06:43,580 and incorrectly classified seven of them for an overall accuracy of close to 99% 1381 01:06:43,580 --> 01:06:44,080 accurate. 1382 01:06:44,080 --> 01:06:47,230 So on this particular data set, using this perceptron model, 1383 01:06:47,230 --> 01:06:51,310 we were able to predict very well what the output was going to be. 1384 01:06:51,310 --> 01:06:53,010 And we can try different models, too. 1385 01:06:53,010 --> 01:06:55,390 That scikit-learn makes it very easy just to swap out 1386 01:06:55,390 --> 01:06:57,970 one model for another model. 1387 01:06:57,970 --> 01:07:02,710 So instead of the perceptron model, I can use the support vector machine 1388 01:07:02,710 --> 01:07:06,490 using the SVC, otherwise known as a support vector classifier, 1389 01:07:06,490 --> 01:07:08,950 using a support vector machine to classify things 1390 01:07:08,950 --> 01:07:10,820 into two different groups. 1391 01:07:10,820 --> 01:07:14,270 And now see, all right, how well does this perform. 1392 01:07:14,270 --> 01:07:17,630 And this time, we were able to correctly predict 682 1393 01:07:17,630 --> 01:07:22,280 and incorrectly predicted four for accuracy of 99.4%. 1394 01:07:22,280 --> 01:07:26,990 And we could even try the kNeighborsClassifier as the model 1395 01:07:26,990 --> 01:07:27,740 instead. 1396 01:07:27,740 --> 01:07:31,270 And this takes a parameter n_neighbors for how many neighbors 1397 01:07:31,270 --> 01:07:32,133 you want to look at. 1398 01:07:32,133 --> 01:07:34,550 Let's just look at one neighbor, the one nearest neighbor, 1399 01:07:34,550 --> 01:07:36,080 and use that to predict. 1400 01:07:36,080 --> 01:07:38,250 Go ahead and run this as well. 1401 01:07:38,250 --> 01:07:40,880 And it looks like, based on the kNeighborsClassifier looking 1402 01:07:40,880 --> 01:07:45,350 at just one neighbor, we were able to correctly classify 685 data point, 1403 01:07:45,350 --> 01:07:47,420 incorrectly classified one. 1404 01:07:47,420 --> 01:07:50,560 Maybe let's try three neighbors instead of just using one neighbor, 1405 01:07:50,560 --> 01:07:52,560 do more of a k-nearest-neighbors approach, where 1406 01:07:52,560 --> 01:07:55,693 I look at the three near the neighbors and see how that performs. 1407 01:07:55,693 --> 01:07:57,610 And that one in this case seems to have gotten 1408 01:07:57,610 --> 01:08:05,330 100% of all of the predictions correctly described as either authentic banknotes 1409 01:08:05,330 --> 01:08:07,370 or as counterfeit banknotes. 1410 01:08:07,370 --> 01:08:09,470 And we could run these experiments multiple times. 1411 01:08:09,470 --> 01:08:12,082 Because I'm randomly reorganizing the data every time, 1412 01:08:12,082 --> 01:08:14,790 we're technically training these on slightly different data sets. 1413 01:08:14,790 --> 01:08:16,832 And so you might want to run multiple experiments 1414 01:08:16,832 --> 01:08:19,250 to really see how well they're actually going to perform. 1415 01:08:19,250 --> 01:08:20,963 But in short, they all perform very well. 1416 01:08:20,963 --> 01:08:23,630 And while some of them perform slightly better than others here, 1417 01:08:23,630 --> 01:08:26,240 that might not always be the case for every data set, 1418 01:08:26,240 --> 01:08:29,240 but you can begin to test now by very quickly putting together 1419 01:08:29,240 --> 01:08:31,790 these machine learning models using Scikit learn 1420 01:08:31,790 --> 01:08:33,920 to be able to train on some training set, 1421 01:08:33,920 --> 01:08:36,814 and then test on some testing set as well. 1422 01:08:36,814 --> 01:08:39,439 And this splitting up into training groups, and testing groups, 1423 01:08:39,439 --> 01:08:43,279 and testing happens so often that scikit-learn has functions built in 1424 01:08:43,279 --> 01:08:44,160 for trying to do it. 1425 01:08:44,160 --> 01:08:46,100 I did it all by hand just now. 1426 01:08:46,100 --> 01:08:48,770 But if we take a look at banknotes1, we take 1427 01:08:48,770 --> 01:08:52,100 advantage of some other features that exist in scikit-learn, 1428 01:08:52,100 --> 01:08:55,399 learn where we can really simplify a lot of our logic. 1429 01:08:55,399 --> 01:08:59,609 That there is a function built into scikit-learn called train_test_split, 1430 01:08:59,609 --> 01:09:03,140 which will automatically split data into a training group and a testing group. 1431 01:09:03,140 --> 01:09:05,536 I just have to say what proportion should 1432 01:09:05,536 --> 01:09:07,369 be in the testing group, something like 0.5, 1433 01:09:07,369 --> 01:09:10,080 half the data, inside the testing group. 1434 01:09:10,080 --> 01:09:12,380 Then I can fit the model and the training data, 1435 01:09:12,380 --> 01:09:15,859 make the predictions on the testing data, and then just count up. 1436 01:09:15,859 --> 01:09:18,830 And scikit-learn has some nice methods for just counting up 1437 01:09:18,830 --> 01:09:22,250 how many times our testing data matched the predictions, how 1438 01:09:22,250 --> 01:09:25,439 many times our testing data didn't match the predictions. 1439 01:09:25,439 --> 01:09:27,680 So very quickly, you can write programs with not all 1440 01:09:27,680 --> 01:09:30,500 that many lines of code-- it's maybe, like, 40 lines of code 1441 01:09:30,500 --> 01:09:32,540 to get through all of these predictions. 1442 01:09:32,540 --> 01:09:35,420 And then as a result, see how well we're able to do. 1443 01:09:35,420 --> 01:09:38,330 So these types of libraries can allow us without really 1444 01:09:38,330 --> 01:09:40,970 knowing the implementation details of these algorithms 1445 01:09:40,970 --> 01:09:43,939 to be able to use the algorithms in a very practical way 1446 01:09:43,939 --> 01:09:47,279 to be able to solve these types of problems. 1447 01:09:47,279 --> 01:09:49,939 So that then with supervised learning-- this task 1448 01:09:49,939 --> 01:09:52,972 of given the whole set of data some, input-output pairs, 1449 01:09:52,972 --> 01:09:54,680 we would like to learn some function that 1450 01:09:54,680 --> 01:09:57,110 maps those inputs to those outputs. 1451 01:09:57,110 --> 01:09:59,630 But turns out there are other forms of learning, as well. 1452 01:09:59,630 --> 01:10:02,870 And another popular type of machine learning, especially nowadays, 1453 01:10:02,870 --> 01:10:05,090 is known as reinforcement learning. 1454 01:10:05,090 --> 01:10:07,250 And the idea of reinforcement learning is 1455 01:10:07,250 --> 01:10:09,140 rather than just being given a whole data set 1456 01:10:09,140 --> 01:10:12,470 at the beginning of input-output pairs, reinforcement learning 1457 01:10:12,470 --> 01:10:14,680 is all about learning from experience. 1458 01:10:14,680 --> 01:10:17,390 And reinforcement learning are agent, whether it's 1459 01:10:17,390 --> 01:10:19,550 like a physical robot that's trying to make actions 1460 01:10:19,550 --> 01:10:23,750 in the world or just some virtual agent that has a program running somewhere. 1461 01:10:23,750 --> 01:10:27,560 Our agent is going to be given a set of rewards or punishments 1462 01:10:27,560 --> 01:10:29,510 in the form of numerical values, but you can 1463 01:10:29,510 --> 01:10:31,430 think of them as reward or punishment. 1464 01:10:31,430 --> 01:10:35,480 And based on that, it learns what actions to take in the future 1465 01:10:35,480 --> 01:10:39,470 that our agent, our AI will be put in some sort of environment. 1466 01:10:39,470 --> 01:10:42,537 It will make some actions and based on the actions that it makes, 1467 01:10:42,537 --> 01:10:43,370 it learns something. 1468 01:10:43,370 --> 01:10:45,537 It either gets a reward when it does something well, 1469 01:10:45,537 --> 01:10:47,720 it gets a punishment when it does something poorly. 1470 01:10:47,720 --> 01:10:52,130 And it learns what to do or what not to do in the future based 1471 01:10:52,130 --> 01:10:54,980 on those individual experiences. 1472 01:10:54,980 --> 01:10:58,700 And so what this will often look like is it will often start with some agent, 1473 01:10:58,700 --> 01:11:01,070 some AI, which might again, be a physical robot-- 1474 01:11:01,070 --> 01:11:03,360 if you're imagining a physical robot moving around-- 1475 01:11:03,360 --> 01:11:05,180 but it can also just be a program. 1476 01:11:05,180 --> 01:11:08,210 And our agent is situated in their environment, 1477 01:11:08,210 --> 01:11:11,120 where the environment is where they're going to make their actions. 1478 01:11:11,120 --> 01:11:13,820 And it's what's going to give them rewards or punishments 1479 01:11:13,820 --> 01:11:16,070 for various actions that they're in. 1480 01:11:16,070 --> 01:11:19,220 So for example, the environment is going to start off 1481 01:11:19,220 --> 01:11:21,980 by putting our agent inside of a state. 1482 01:11:21,980 --> 01:11:25,640 Our agent has some state that in a game might be the state of the game 1483 01:11:25,640 --> 01:11:28,730 that the agent is playing, in a world that the agent is exploring. 1484 01:11:28,730 --> 01:11:31,375 Might be some position inside of a grid representing 1485 01:11:31,375 --> 01:11:32,750 the world that they're exploring. 1486 01:11:32,750 --> 01:11:35,090 But the agent is in some sort of state. 1487 01:11:35,090 --> 01:11:39,140 And in that state, the agent needs to choose to take an action. 1488 01:11:39,140 --> 01:11:41,760 The agent likely has multiple actions they can choose from, 1489 01:11:41,760 --> 01:11:43,310 but they pick an action. 1490 01:11:43,310 --> 01:11:46,310 So they take an action in a particular state. 1491 01:11:46,310 --> 01:11:51,170 And as a result of that, the agent will generally get two things in response 1492 01:11:51,170 --> 01:11:52,040 as we model them. 1493 01:11:52,040 --> 01:11:54,770 The agent gets a new state that they find themselves in. 1494 01:11:54,770 --> 01:11:57,110 After being in this state taking one action, 1495 01:11:57,110 --> 01:11:59,210 they end up in some other state. 1496 01:11:59,210 --> 01:12:02,360 And they're also given some sort of numerical reward-- 1497 01:12:02,360 --> 01:12:05,660 positive meaning reward, meaning it was a good thing. 1498 01:12:05,660 --> 01:12:08,000 Negative generally meaning they did something bad, 1499 01:12:08,000 --> 01:12:10,310 they received some sort of punishment. 1500 01:12:10,310 --> 01:12:13,190 And that is all the information the agent has. 1501 01:12:13,190 --> 01:12:15,260 It's told what state it's in. 1502 01:12:15,260 --> 01:12:17,150 It makes some sort of action. 1503 01:12:17,150 --> 01:12:19,160 And based on that, it ends up in another state, 1504 01:12:19,160 --> 01:12:21,660 and it ends up getting some particular reward. 1505 01:12:21,660 --> 01:12:24,110 And it needs to learn based on that information what 1506 01:12:24,110 --> 01:12:26,770 actions to begin to take in the future. 1507 01:12:26,770 --> 01:12:30,090 As you can imagine generalizing this to a lot of different situations, 1508 01:12:30,090 --> 01:12:31,685 this is oftentimes how you train. 1509 01:12:31,685 --> 01:12:33,560 If you've ever seen those robots that are now 1510 01:12:33,560 --> 01:12:36,140 able to walk around sort of the way humans do, 1511 01:12:36,140 --> 01:12:39,500 it would be quite difficult to program the robot in exactly the right way 1512 01:12:39,500 --> 01:12:41,330 to get it to walk the way humans do. 1513 01:12:41,330 --> 01:12:44,330 You could instead of train it through reinforcement learning-- give it 1514 01:12:44,330 --> 01:12:47,060 some sort of numerical reward every time it does something 1515 01:12:47,060 --> 01:12:50,690 good like take steps forward, and punish it every time it does something 1516 01:12:50,690 --> 01:12:52,250 bad like fall over. 1517 01:12:52,250 --> 01:12:54,080 And then let the AI just learn. 1518 01:12:54,080 --> 01:12:56,450 Based on that sequence of rewards, based on trying 1519 01:12:56,450 --> 01:12:58,610 to take various different actions, you can 1520 01:12:58,610 --> 01:13:03,200 begin to have the agent learn what to do in the future and what not to do. 1521 01:13:03,200 --> 01:13:06,530 So in order to begin to formalize this, the first thing we need to do 1522 01:13:06,530 --> 01:13:10,700 is formalize this notion of what we mean about states and actions and rewards-- 1523 01:13:10,700 --> 01:13:12,800 like what does this world look like. 1524 01:13:12,800 --> 01:13:14,990 And oftentimes, we'll formulate this world 1525 01:13:14,990 --> 01:13:17,990 as what's known as a Markov decision process. 1526 01:13:17,990 --> 01:13:21,410 Similar in spirit to Markov chains, which you might recall from before, 1527 01:13:21,410 --> 01:13:24,020 but a Markov decision process is a model that we 1528 01:13:24,020 --> 01:13:26,780 can use for decision making for an agent trying 1529 01:13:26,780 --> 01:13:28,580 to make decisions in this environment. 1530 01:13:28,580 --> 01:13:32,270 And it's a model that allows us to represent the various different states 1531 01:13:32,270 --> 01:13:35,930 that an agent can be in, the various different actions that they can take, 1532 01:13:35,930 --> 01:13:40,400 and also, what the reward is for taking one action as opposed 1533 01:13:40,400 --> 01:13:42,180 to another action. 1534 01:13:42,180 --> 01:13:44,610 So what then does that actually look like? 1535 01:13:44,610 --> 01:13:47,590 Well, if you recall, a Markov chain from before, 1536 01:13:47,590 --> 01:13:50,270 a Markov chain looked a little something like this. 1537 01:13:50,270 --> 01:13:53,600 Where we had a whole bunch of these individual states, and each state 1538 01:13:53,600 --> 01:13:55,820 immediately transitioned to another state 1539 01:13:55,820 --> 01:13:57,890 based on some probability distribution. 1540 01:13:57,890 --> 01:14:01,070 We saw this in the context of the weather before, where if it was sunny, 1541 01:14:01,070 --> 01:14:03,770 we said with some probability, it will be sunny the next day. 1542 01:14:03,770 --> 01:14:06,920 With some other probability, it'll be rainy, for example. 1543 01:14:06,920 --> 01:14:09,340 But we could also imagine generalizing this. 1544 01:14:09,340 --> 01:14:11,060 It's not just sun and rain anymore. 1545 01:14:11,060 --> 01:14:14,210 We just have these states, where one state leads to another state 1546 01:14:14,210 --> 01:14:16,860 according to some probability distribution. 1547 01:14:16,860 --> 01:14:19,340 But in this original model, there was no agent 1548 01:14:19,340 --> 01:14:21,470 that had any control over this process. 1549 01:14:21,470 --> 01:14:24,740 It was just entirely probability-based, where with some probability, 1550 01:14:24,740 --> 01:14:26,540 we moved to this next state, but maybe it's 1551 01:14:26,540 --> 01:14:29,230 going to be some other state with some other probability. 1552 01:14:29,230 --> 01:14:33,320 What we'll now have is the ability for the agent in this state 1553 01:14:33,320 --> 01:14:37,190 to choose from a set of actions, where maybe instead of just one path forward, 1554 01:14:37,190 --> 01:14:39,200 they have three different choices of actions 1555 01:14:39,200 --> 01:14:41,180 that each lead them down different paths. 1556 01:14:41,180 --> 01:14:43,340 And even this is a bit of an oversimplification, 1557 01:14:43,340 --> 01:14:46,340 because in each of these states, you might imagine more branching points 1558 01:14:46,340 --> 01:14:49,200 were there more decisions that can be taken as well. 1559 01:14:49,200 --> 01:14:53,420 So we've extended the Markov chain to say that from a state, 1560 01:14:53,420 --> 01:14:55,400 you now have available action choices. 1561 01:14:55,400 --> 01:14:57,830 And each of those actions might be associated 1562 01:14:57,830 --> 01:15:02,960 with its own probability distribution of going to various different states. 1563 01:15:02,960 --> 01:15:05,870 Then in addition, we'll add another extension, 1564 01:15:05,870 --> 01:15:08,180 where any time you move from a state taking 1565 01:15:08,180 --> 01:15:10,640 an action going into this other state, we 1566 01:15:10,640 --> 01:15:14,060 can associate a reward with that outcome, 1567 01:15:14,060 --> 01:15:17,150 saying either r is positive, meaning some positive reward, 1568 01:15:17,150 --> 01:15:20,460 or r is negative, meaning there were some sort of punishment. 1569 01:15:20,460 --> 01:15:23,510 And this then is what we'll consider to be a Markov decision process. 1570 01:15:23,510 --> 01:15:27,500 That a Markov decision process has some initial set of states in the world 1571 01:15:27,500 --> 01:15:28,670 that we can be in. 1572 01:15:28,670 --> 01:15:31,640 We have some set of actions that given a state, 1573 01:15:31,640 --> 01:15:34,430 I can say what are the actions that are available to me 1574 01:15:34,430 --> 01:15:37,730 in that state, an action that I can choose from. 1575 01:15:37,730 --> 01:15:39,550 Then we have some transition model. 1576 01:15:39,550 --> 01:15:41,710 The transition model before just said that 1577 01:15:41,710 --> 01:15:44,740 given my current state, what is the probability that I end up 1578 01:15:44,740 --> 01:15:46,930 in that next state or this other state. 1579 01:15:46,930 --> 01:15:51,130 The transition model now has effectively two things we're conditioning on. 1580 01:15:51,130 --> 01:15:53,260 We're saying, given that I'm in this state 1581 01:15:53,260 --> 01:15:56,500 and that I take this action, what's the probability 1582 01:15:56,500 --> 01:15:59,320 that I end up in this next state? 1583 01:15:59,320 --> 01:16:02,050 Now maybe we live in a very deterministic world in this Markov 1584 01:16:02,050 --> 01:16:05,170 decision process, where given a state and given an action, 1585 01:16:05,170 --> 01:16:07,383 we know for sure what next state we'll end up in. 1586 01:16:07,383 --> 01:16:10,550 But maybe there's some randomness in the world that when you take in a state 1587 01:16:10,550 --> 01:16:14,260 and you take an action, you might not always end up in the exact same state. 1588 01:16:14,260 --> 01:16:16,840 There might be some probabilities involved there as well. 1589 01:16:16,840 --> 01:16:21,260 The Markov decision process can handle both of those possible cases. 1590 01:16:21,260 --> 01:16:23,830 And then finally, we have a reward function, 1591 01:16:23,830 --> 01:16:26,230 generally called r, that in this case says, 1592 01:16:26,230 --> 01:16:30,340 what is the reward for being in this state, taking this action, 1593 01:16:30,340 --> 01:16:33,687 and then getting to s prime, this next state. 1594 01:16:33,687 --> 01:16:35,770 So I'm in this original state, I take this action, 1595 01:16:35,770 --> 01:16:39,768 I get to this next state, what is the reward for doing that process? 1596 01:16:39,768 --> 01:16:41,560 You can add up these rewards every time you 1597 01:16:41,560 --> 01:16:44,440 take an action to get the total amount of rewards 1598 01:16:44,440 --> 01:16:48,940 that an agent might get from interacting in a particular environment modeled 1599 01:16:48,940 --> 01:16:51,170 using this Markov decision process. 1600 01:16:51,170 --> 01:16:53,740 So what might this actually look like in practice? 1601 01:16:53,740 --> 01:16:56,290 Well, let's just create a little simulated world here 1602 01:16:56,290 --> 01:16:59,260 where I have this agent that is just trying to navigate its way-- 1603 01:16:59,260 --> 01:17:02,350 this agent is this yellow dot here like a robot in the world trying 1604 01:17:02,350 --> 01:17:04,240 to navigate its way through this grid. 1605 01:17:04,240 --> 01:17:07,140 And ultimately, it's trying to find its way to the goal. 1606 01:17:07,140 --> 01:17:11,230 And if it gets to the green goal, then it's going to get some sort of reward. 1607 01:17:11,230 --> 01:17:14,795 But then we might also have some red squares 1608 01:17:14,795 --> 01:17:17,920 that are places where you get some sort of punishment, some bad place where 1609 01:17:17,920 --> 01:17:19,480 we don't want the agent to go. 1610 01:17:19,480 --> 01:17:22,030 And if it ends up in the red square, then our agent 1611 01:17:22,030 --> 01:17:25,400 is going to get some sort of punishment as a result of that. 1612 01:17:25,400 --> 01:17:28,610 But the agent that originally doesn't know all of these details. 1613 01:17:28,610 --> 01:17:31,360 It doesn't know that these states are associated with punishments, 1614 01:17:31,360 --> 01:17:34,210 but maybe it does know that the state is associated with a reward-- 1615 01:17:34,210 --> 01:17:35,200 maybe it doesn't. 1616 01:17:35,200 --> 01:17:37,750 But it just needs to sort of interact with the environment 1617 01:17:37,750 --> 01:17:41,150 to try and figure out what to do and what not to do. 1618 01:17:41,150 --> 01:17:44,377 So the first thing the agent might do is given no additional information, 1619 01:17:44,377 --> 01:17:46,210 if it doesn't know what the punishments are, 1620 01:17:46,210 --> 01:17:50,030 it doesn't know where the rewards are, it just might try and take an action. 1621 01:17:50,030 --> 01:17:52,690 And it takes an action and ends up realizing 1622 01:17:52,690 --> 01:17:54,710 that he's got some sort of punishment. 1623 01:17:54,710 --> 01:17:56,930 And so what does it learn from that experience? 1624 01:17:56,930 --> 01:18:00,550 Well, it might learn that when you're in this state in the future 1625 01:18:00,550 --> 01:18:02,790 don't take the action, move to the right-- 1626 01:18:02,790 --> 01:18:04,213 that that is a bad action to take. 1627 01:18:04,213 --> 01:18:06,880 That in the future, if you ever find yourself back in the state, 1628 01:18:06,880 --> 01:18:09,430 don't take this action of going to the right when 1629 01:18:09,430 --> 01:18:12,180 you're in this particular state, because that leads to punishment. 1630 01:18:12,180 --> 01:18:13,763 That might be the intuition, at least. 1631 01:18:13,763 --> 01:18:15,610 And so you could try doing other actions. 1632 01:18:15,610 --> 01:18:16,300 You move up. 1633 01:18:16,300 --> 01:18:18,508 All right, that didn't lead to any immediate rewards, 1634 01:18:18,508 --> 01:18:21,830 maybe try something else, then maybe try something else. 1635 01:18:21,830 --> 01:18:23,913 And now you found that you got another punishment. 1636 01:18:23,913 --> 01:18:25,913 And so you learn something from that experience. 1637 01:18:25,913 --> 01:18:27,850 So the next time you do this whole process, 1638 01:18:27,850 --> 01:18:30,010 you know that if you ever end up in this, square 1639 01:18:30,010 --> 01:18:33,070 you shouldn't take the down action, because being in this state 1640 01:18:33,070 --> 01:18:37,840 and taking that action ultimately leads to some sort of punishment, 1641 01:18:37,840 --> 01:18:40,070 a negative reward, in other words. 1642 01:18:40,070 --> 01:18:41,170 And this process repeats. 1643 01:18:41,170 --> 01:18:44,230 You might imagine just letting our agent explore the world, 1644 01:18:44,230 --> 01:18:48,220 learning over time what states tend to correspond with poor actions, 1645 01:18:48,220 --> 01:18:50,980 learning over time what states correspond with poor actions 1646 01:18:50,980 --> 01:18:54,260 until eventually, if it tries enough things randomly, 1647 01:18:54,260 --> 01:18:57,640 it might find that when you get to this state, 1648 01:18:57,640 --> 01:19:01,240 if you take the up action in this state, it might find you 1649 01:19:01,240 --> 01:19:03,160 actually get a reward from that. 1650 01:19:03,160 --> 01:19:06,820 And what it can learn from that is that if you're in this state, 1651 01:19:06,820 --> 01:19:09,595 you should take the up action, because that leads to a reward. 1652 01:19:09,595 --> 01:19:12,220 And over time, you can also learn that if you're in this state, 1653 01:19:12,220 --> 01:19:15,580 you should take the left action because that leads to this state that also 1654 01:19:15,580 --> 01:19:17,320 lets you eventually get to the reward. 1655 01:19:17,320 --> 01:19:21,160 So you begin to learn over time, not only which actions 1656 01:19:21,160 --> 01:19:25,390 are good in particular states, but also, which actions are bad, 1657 01:19:25,390 --> 01:19:27,640 such that once you know some sequence of good actions 1658 01:19:27,640 --> 01:19:30,910 that leads you to some sort of reward, our agent can just 1659 01:19:30,910 --> 01:19:34,750 follow those instructions, follow the experience that it has learned. 1660 01:19:34,750 --> 01:19:37,330 We didn't tell the agent what the goal was. 1661 01:19:37,330 --> 01:19:39,910 We didn't tell the agent where the punishments were. 1662 01:19:39,910 --> 01:19:42,760 But the agent can begin to learn from this experience 1663 01:19:42,760 --> 01:19:47,850 and learn to begin to perform these sorts of tasks better in the future. 1664 01:19:47,850 --> 01:19:51,370 And so let's now try to formalize this idea-- formalize the idea that we would 1665 01:19:51,370 --> 01:19:54,520 like to be able to learn in this state, taking this action, 1666 01:19:54,520 --> 01:19:56,257 is that a good thing or a bad thing. 1667 01:19:56,257 --> 01:19:58,840 There are lots of different models for reinforcement learning. 1668 01:19:58,840 --> 01:20:00,757 We're just going to look at one of them today. 1669 01:20:00,757 --> 01:20:04,330 And the one that we're going to look at is a method known as Q learning. 1670 01:20:04,330 --> 01:20:08,010 And what Q learning is all about is about learning a function, 1671 01:20:08,010 --> 01:20:12,850 a function Q, that takes inputs s and a, where s is a state and a 1672 01:20:12,850 --> 01:20:15,130 is an action that you take in that state. 1673 01:20:15,130 --> 01:20:17,620 And what this Q function is going to do is 1674 01:20:17,620 --> 01:20:21,520 it is going to estimate the value-- how much reward will I get 1675 01:20:21,520 --> 01:20:25,960 from taking this action in this state. 1676 01:20:25,960 --> 01:20:28,870 Originally, we don't know what this Q function should be, 1677 01:20:28,870 --> 01:20:30,940 but over time, based on experience, based 1678 01:20:30,940 --> 01:20:33,590 on trying things out and seeing what the result is, 1679 01:20:33,590 --> 01:20:36,160 I would like to try and learn what q of s, 1680 01:20:36,160 --> 01:20:39,730 a is for any particular state and any particular action 1681 01:20:39,730 --> 01:20:41,840 that I might take in that state. 1682 01:20:41,840 --> 01:20:42,880 So what is the approach? 1683 01:20:42,880 --> 01:20:44,890 Well, the approach originally is we'll start 1684 01:20:44,890 --> 01:20:49,960 with Q s, a equal to 0 for all states s and for all actions a. 1685 01:20:49,960 --> 01:20:52,270 That initially, before I've ever started anything, 1686 01:20:52,270 --> 01:20:54,880 before I've had any experiences, I don't know 1687 01:20:54,880 --> 01:20:57,820 the value of taking any action in any given state, 1688 01:20:57,820 --> 01:21:02,390 so I'm going to assume that the value is 0 all across the board. 1689 01:21:02,390 --> 01:21:06,790 But then as I interact with the world, as I experience rewards or punishments, 1690 01:21:06,790 --> 01:21:10,450 or maybe I go to a cell where I don't get either a reward or a punishment, 1691 01:21:10,450 --> 01:21:14,290 I want to somehow update my estimate of Q s, a. 1692 01:21:14,290 --> 01:21:18,670 I want to continually update my estimate of Q s, a based on the experiences, 1693 01:21:18,670 --> 01:21:20,730 and rewards, and punishments that I've received 1694 01:21:20,730 --> 01:21:24,240 such that in the future, my knowledge of what actions are good 1695 01:21:24,240 --> 01:21:26,290 and what states will be better. 1696 01:21:26,290 --> 01:21:29,280 So when we take an action and receive some sort of reward, 1697 01:21:29,280 --> 01:21:32,750 I want to estimate the new value of Q s, a. 1698 01:21:32,750 --> 01:21:35,390 And I estimate that based on a couple of different things. 1699 01:21:35,390 --> 01:21:39,090 I estimate it based on the reward that I'm getting from taking this action 1700 01:21:39,090 --> 01:21:40,830 and getting into the next state. 1701 01:21:40,830 --> 01:21:44,130 But assuming the situation isn't over, assuming 1702 01:21:44,130 --> 01:21:47,170 there are still future actions that I might take as well, 1703 01:21:47,170 --> 01:21:51,670 I also need to take into account the expected future rewards. 1704 01:21:51,670 --> 01:21:54,570 That if you imagine an agent interacting with the environment, 1705 01:21:54,570 --> 01:21:57,000 and sometimes, you'll take an action and get a reward, 1706 01:21:57,000 --> 01:21:59,970 but then you can keep taking more actions and get more rewards. 1707 01:21:59,970 --> 01:22:02,310 That these both are relevant-- both the current reward 1708 01:22:02,310 --> 01:22:05,680 I'm getting from this current step, and also, my future reward. 1709 01:22:05,680 --> 01:22:09,240 And it might be the case that I want to take a step that doesn't immediately 1710 01:22:09,240 --> 01:22:12,180 lead to a reward, because later on down the line, 1711 01:22:12,180 --> 01:22:14,650 I know it will lead to more rewards as well. 1712 01:22:14,650 --> 01:22:17,550 So there's a balancing act between current rewards 1713 01:22:17,550 --> 01:22:20,460 that the agent experiences and future rewards 1714 01:22:20,460 --> 01:22:23,900 that the agent experiences as well. 1715 01:22:23,900 --> 01:22:26,590 And then we need to update Q s, a. 1716 01:22:26,590 --> 01:22:29,620 So we estimate the value of Q s, a based on the current record 1717 01:22:29,620 --> 01:22:31,390 and the expected future awards. 1718 01:22:31,390 --> 01:22:33,970 And then we need to update this Q function 1719 01:22:33,970 --> 01:22:36,550 to take into account this new estimate. 1720 01:22:36,550 --> 01:22:39,340 Now as we go through this process, we'll already 1721 01:22:39,340 --> 01:22:42,160 have an estimate for what we think the value is. 1722 01:22:42,160 --> 01:22:44,140 Now we have a new estimate and then somehow we 1723 01:22:44,140 --> 01:22:46,570 need to combine these two estimates together. 1724 01:22:46,570 --> 01:22:50,110 And we'll look at more formal ways that we can actually begin to do that. 1725 01:22:50,110 --> 01:22:52,725 So to actually show you what this formula looks like, 1726 01:22:52,725 --> 01:22:54,850 here's the approach we'll take with you Q-learning. 1727 01:22:54,850 --> 01:22:59,500 We're going to again start with Q of s, a being equal to 0 for all states. 1728 01:22:59,500 --> 01:23:06,850 And then every time we take an action a in state s and observe a reward r, 1729 01:23:06,850 --> 01:23:11,230 we're going to update our value, our estimate for Q of s, a. 1730 01:23:11,230 --> 01:23:13,780 And the idea is that we're going to figure out 1731 01:23:13,780 --> 01:23:19,180 what the new value estimate is minus what our existing value estimate is. 1732 01:23:19,180 --> 01:23:22,780 So we have some preconceived notion for what the value is 1733 01:23:22,780 --> 01:23:24,500 for taking this action in this state. 1734 01:23:24,500 --> 01:23:28,450 Maybe our expectation is we currently think the value is 10. 1735 01:23:28,450 --> 01:23:31,540 But then we're going to estimate what we now think it's going to be. 1736 01:23:31,540 --> 01:23:34,370 Maybe the new value estimate is something like 20. 1737 01:23:34,370 --> 01:23:36,940 So there's a delta of, like, 10 that our new value 1738 01:23:36,940 --> 01:23:40,390 estimate is 10 points higher than what our current value 1739 01:23:40,390 --> 01:23:42,450 estimate happens to be. 1740 01:23:42,450 --> 01:23:44,200 And so we have a couple of options here. 1741 01:23:44,200 --> 01:23:49,360 We need to decide how much we want to adjust our current expectation of what 1742 01:23:49,360 --> 01:23:52,720 the value is of taking this action in this particular state. 1743 01:23:52,720 --> 01:23:56,620 And what that difference is-- how much we add or subtract 1744 01:23:56,620 --> 01:23:59,800 from our existing notion of how much that we expect the value to be 1745 01:23:59,800 --> 01:24:03,750 is dependent on this parameter alpha, also called the learning rate. 1746 01:24:03,750 --> 01:24:08,710 And alpha represents in effect, how much we value new information compared 1747 01:24:08,710 --> 01:24:11,770 to how much we value old information. 1748 01:24:11,770 --> 01:24:15,500 And alpha value of 1 means we really value new information. 1749 01:24:15,500 --> 01:24:17,590 That if we have a new estimate, then it doesn't 1750 01:24:17,590 --> 01:24:19,240 matter what our old estimate is. 1751 01:24:19,240 --> 01:24:21,910 We're only going to consider our new estimate, because we always 1752 01:24:21,910 --> 01:24:25,430 just want to take into consideration our new information. 1753 01:24:25,430 --> 01:24:29,230 So the way that works is that if you imagine alpha being 1, 1754 01:24:29,230 --> 01:24:32,860 then we're taking the old value of Q s, a 1755 01:24:32,860 --> 01:24:36,910 and then adding 1 times the new value minus the old value. 1756 01:24:36,910 --> 01:24:38,990 And that just leaves us with the new value. 1757 01:24:38,990 --> 01:24:42,010 So when alpha is 1, all we take into consideration 1758 01:24:42,010 --> 01:24:44,650 is what our new estimate happens to be. 1759 01:24:44,650 --> 01:24:47,470 But over time, as we go through a lot of experiences, 1760 01:24:47,470 --> 01:24:49,570 we already have some existing information. 1761 01:24:49,570 --> 01:24:53,080 We might have tried taking this action nine times already, 1762 01:24:53,080 --> 01:24:55,090 and now we just try to do a tenth time. 1763 01:24:55,090 --> 01:24:58,370 And we don't only want to consider this 10th experience. 1764 01:24:58,370 --> 01:25:02,020 I also want to consider the fact that my prior 9 experiences, those were 1765 01:25:02,020 --> 01:25:02,710 meaningful, too. 1766 01:25:02,710 --> 01:25:05,420 And that's data I don't necessarily want to lose them. 1767 01:25:05,420 --> 01:25:08,620 And so this alpha controls that decision-- controls 1768 01:25:08,620 --> 01:25:10,480 how important is the new information. 1769 01:25:10,480 --> 01:25:13,570 0 would mean ignore all the new information, 1770 01:25:13,570 --> 01:25:16,390 just keep this Q value the same. 1771 01:25:16,390 --> 01:25:20,200 1 that means replace the old information entirely with the new information. 1772 01:25:20,200 --> 01:25:24,772 And somewhere in between, keep some sort of balance between these two values. 1773 01:25:24,772 --> 01:25:27,970 And we can put this equation a little bit more formally, as well. 1774 01:25:27,970 --> 01:25:30,940 The old value estimate is our old estimate 1775 01:25:30,940 --> 01:25:34,690 for what the value is of taking this action in a particular state. 1776 01:25:34,690 --> 01:25:37,270 That's just Q of s, a. 1777 01:25:37,270 --> 01:25:38,685 We have it once here. 1778 01:25:38,685 --> 01:25:40,310 And we're going to add something to it. 1779 01:25:40,310 --> 01:25:44,020 We're going to add alpha times the new value estimate minus the old value 1780 01:25:44,020 --> 01:25:44,740 estimate. 1781 01:25:44,740 --> 01:25:49,330 But the old value estimate, we just look up by calling this Q function. 1782 01:25:49,330 --> 01:25:51,310 And what then is the new value estimate? 1783 01:25:51,310 --> 01:25:53,530 Based on this experience we have just taken, 1784 01:25:53,530 --> 01:25:55,870 what is our new estimate for the value of taking 1785 01:25:55,870 --> 01:25:58,570 this action in this particular state? 1786 01:25:58,570 --> 01:26:01,090 Well, it's going to be composed of two parts. 1787 01:26:01,090 --> 01:26:04,540 It's going to be composed of what reward did I just get 1788 01:26:04,540 --> 01:26:07,060 from taking this action in this state. 1789 01:26:07,060 --> 01:26:10,390 And then it's going to be what can I expect my future rewards 1790 01:26:10,390 --> 01:26:12,670 to be from this point forward. 1791 01:26:12,670 --> 01:26:17,230 So it's going to be r, some reward I'm getting right now, 1792 01:26:17,230 --> 01:26:21,443 plus whatever I estimate I'm going to get in the future. 1793 01:26:21,443 --> 01:26:23,860 And how do I estimate what I'm going to get in the future? 1794 01:26:23,860 --> 01:26:27,030 Well, it's a bit of another call to this Q function. 1795 01:26:27,030 --> 01:26:31,360 It's going to be take the maximum across all possible actions I could 1796 01:26:31,360 --> 01:26:35,800 take next and say, all right, of all of these possible actions I could take, 1797 01:26:35,800 --> 01:26:38,560 which one is going to have the highest reward? 1798 01:26:38,560 --> 01:26:40,840 So this then-- looks a little bit complicated-- 1799 01:26:40,840 --> 01:26:44,710 is going to be our notion for how we're going to perform this kind of update. 1800 01:26:44,710 --> 01:26:48,070 I have some estimate, some old estimate, for what 1801 01:26:48,070 --> 01:26:51,040 the value is of taking this action in the state, 1802 01:26:51,040 --> 01:26:53,920 and I'm going to update it based on new information. 1803 01:26:53,920 --> 01:26:56,200 That I experienced some reward, I predict 1804 01:26:56,200 --> 01:26:58,240 what my future reward is going to be. 1805 01:26:58,240 --> 01:27:01,510 And using that, I update what I estimate the reward 1806 01:27:01,510 --> 01:27:04,720 will be for taking this action in this particular state. 1807 01:27:04,720 --> 01:27:07,720 And there are other additions you might make to this algorithm, as well. 1808 01:27:07,720 --> 01:27:10,270 Sometimes, it might not be the case that future rewards, 1809 01:27:10,270 --> 01:27:12,820 you want to weight equally to current rewards. 1810 01:27:12,820 --> 01:27:17,402 Maybe you want an agent that values like reward now over reward later. 1811 01:27:17,402 --> 01:27:19,360 And so sometimes, you can even add another term 1812 01:27:19,360 --> 01:27:23,050 in here or some other parameter, where you discount future rewards 1813 01:27:23,050 --> 01:27:26,270 and say future rewards are not as valuable as rewards 1814 01:27:26,270 --> 01:27:28,720 immediately-- that getting reward in the current time step 1815 01:27:28,720 --> 01:27:31,542 is better than waiting a year and getting rewards later. 1816 01:27:31,542 --> 01:27:33,250 But that's something up to the programmer 1817 01:27:33,250 --> 01:27:36,290 to decide what that parameter ought to be. 1818 01:27:36,290 --> 01:27:39,100 But the big picture idea of this entire formula 1819 01:27:39,100 --> 01:27:42,670 is to say that every time we experience some new reward, 1820 01:27:42,670 --> 01:27:43,900 we take that into account. 1821 01:27:43,900 --> 01:27:47,830 We update our estimate of how good is this action. 1822 01:27:47,830 --> 01:27:51,070 And then in the future, we can make decisions based on that algorithm. 1823 01:27:51,070 --> 01:27:54,010 Once we have some good estimate for every state 1824 01:27:54,010 --> 01:27:57,670 and for every action, what the value is of taking that action, 1825 01:27:57,670 --> 01:28:01,960 then we can do something like implement a greedy decision making policy. 1826 01:28:01,960 --> 01:28:05,200 That if I am in a state and I want to know what actions should 1827 01:28:05,200 --> 01:28:09,880 I take in that state, then I consider for all of my possible actions, 1828 01:28:09,880 --> 01:28:12,670 what is the value of Q s, a. 1829 01:28:12,670 --> 01:28:16,150 What is my estimated value of taking that action in that state. 1830 01:28:16,150 --> 01:28:18,850 And I will just pick the action that has the highest 1831 01:28:18,850 --> 01:28:22,520 value after I evaluate that expression. 1832 01:28:22,520 --> 01:28:24,730 So I pick the action that has the highest value. 1833 01:28:24,730 --> 01:28:27,188 And based on that, that tells me what action I should take. 1834 01:28:27,188 --> 01:28:30,880 At any given state that I'm in, I can just greedily say across 1835 01:28:30,880 --> 01:28:35,050 all my actions, this action gives me the highest expected value, 1836 01:28:35,050 --> 01:28:38,200 and so I'll go ahead and choose that action as the action 1837 01:28:38,200 --> 01:28:40,430 that I take as well. 1838 01:28:40,430 --> 01:28:43,270 But there is a downside to this kind of approach. 1839 01:28:43,270 --> 01:28:45,880 And the downside comes up in a situation like this, 1840 01:28:45,880 --> 01:28:51,170 where we know that there is some solution that gets me to the reward 1841 01:28:51,170 --> 01:28:53,610 and our agent has been able to figure that out. 1842 01:28:53,610 --> 01:28:57,050 But it might not necessarily be the best way or the fastest way. 1843 01:28:57,050 --> 01:28:59,850 If the agent is allowed to explore a little bit more, 1844 01:28:59,850 --> 01:29:02,810 you might find that it can get the reward faster by taking 1845 01:29:02,810 --> 01:29:06,830 some other route instead by going through this particular path that 1846 01:29:06,830 --> 01:29:11,360 is a faster way to get to that ultimate goal. 1847 01:29:11,360 --> 01:29:14,730 And maybe we would like for the agent to be able to figure that out as well. 1848 01:29:14,730 --> 01:29:16,940 But if the agent always takes the actions 1849 01:29:16,940 --> 01:29:20,960 that it knows to be best, when it gets to this particular square, 1850 01:29:20,960 --> 01:29:23,180 it doesn't know that this is a good action, 1851 01:29:23,180 --> 01:29:24,770 because it's never really tried it. 1852 01:29:24,770 --> 01:29:29,040 But it knows that going down eventually leads its way to this reward. 1853 01:29:29,040 --> 01:29:32,270 So what might learn in the future that it should just always take this route, 1854 01:29:32,270 --> 01:29:36,770 and it's never going to explore and go along that route instead. 1855 01:29:36,770 --> 01:29:39,170 So in reinforcement learning, there's this tension 1856 01:29:39,170 --> 01:29:42,570 between exploration and exploitation. 1857 01:29:42,570 --> 01:29:45,710 And exploitation generally reverts to using knowledge 1858 01:29:45,710 --> 01:29:47,360 that the AI already has. 1859 01:29:47,360 --> 01:29:50,630 The AI already knows that this is a move that leads to reward, 1860 01:29:50,630 --> 01:29:52,520 so it'll go ahead and use that move. 1861 01:29:52,520 --> 01:29:56,390 And exploration is all about exploring other actions 1862 01:29:56,390 --> 01:29:58,820 that we may not have explored as thoroughly before, 1863 01:29:58,820 --> 01:30:01,970 because maybe one of these actions, even if I don't know anything about it, 1864 01:30:01,970 --> 01:30:07,300 might lead to better rewards faster or more rewards in the future. 1865 01:30:07,300 --> 01:30:10,160 And so an agent that only ever exploits information 1866 01:30:10,160 --> 01:30:12,840 and never explorers might be able to get reward, 1867 01:30:12,840 --> 01:30:16,250 but it might not maximize its rewards, because it doesn't know what 1868 01:30:16,250 --> 01:30:17,840 other possibilities are out there-- 1869 01:30:17,840 --> 01:30:19,850 possibilities that it would only know about 1870 01:30:19,850 --> 01:30:23,010 by taking advantage of exploration. 1871 01:30:23,010 --> 01:30:24,710 And so how can we try and address this? 1872 01:30:24,710 --> 01:30:28,550 Well, one possible solution is known as the epsilon-greedy algorithm, 1873 01:30:28,550 --> 01:30:33,050 where we set epsilon equal to how often we want to just make a random move. 1874 01:30:33,050 --> 01:30:36,750 Where occasionally, we will just make a random move in order to say, 1875 01:30:36,750 --> 01:30:40,350 let's try to explore and see what happens. 1876 01:30:40,350 --> 01:30:45,250 And then the logic of the algorithm will be with probability 1 minus epsilon, 1877 01:30:45,250 --> 01:30:47,830 choose the estimated best move. 1878 01:30:47,830 --> 01:30:50,620 In a greedy case, we'd always choose the best move. 1879 01:30:50,620 --> 01:30:55,360 But in epsilon-greedy, we're most of the time going to choose the best move 1880 01:30:55,360 --> 01:30:57,520 or sometimes going to choose the best move, 1881 01:30:57,520 --> 01:31:00,100 but sometimes with probability epsilon, we're 1882 01:31:00,100 --> 01:31:03,160 going to choose a random move instead. 1883 01:31:03,160 --> 01:31:06,400 So every time we're faced with the ability to take an action, sometimes, 1884 01:31:06,400 --> 01:31:07,930 we're going to choose the best move. 1885 01:31:07,930 --> 01:31:10,900 Sometimes, we're just going to choose a random move. 1886 01:31:10,900 --> 01:31:12,940 So this type of algorithm then can be quite 1887 01:31:12,940 --> 01:31:16,720 powerful in a reinforcement learning context by not always just choosing 1888 01:31:16,720 --> 01:31:20,620 the best possible move right now, but sometimes, especially early on, 1889 01:31:20,620 --> 01:31:22,600 allowing yourself to make random moves that 1890 01:31:22,600 --> 01:31:26,560 allow you to explore various different possible states and actions more. 1891 01:31:26,560 --> 01:31:30,850 And maybe over time, you might decrease your value of epsilon, more and more 1892 01:31:30,850 --> 01:31:32,720 often choosing the best mover after you are 1893 01:31:32,720 --> 01:31:37,820 more confident that you've explored what all of the possibilities actually are. 1894 01:31:37,820 --> 01:31:39,320 So we can put this into practice. 1895 01:31:39,320 --> 01:31:41,830 And one very common application of reinforcement learning 1896 01:31:41,830 --> 01:31:43,000 is in game playing. 1897 01:31:43,000 --> 01:31:45,400 That if you want to teach an agent how to play a game, 1898 01:31:45,400 --> 01:31:48,350 you just let the agent play the game a whole bunch. 1899 01:31:48,350 --> 01:31:51,220 And then the reward signal happens at the end of the game. 1900 01:31:51,220 --> 01:31:55,780 When the game is over, if our AI won the game, it gets a reward of like, 1, 1901 01:31:55,780 --> 01:31:56,730 for example. 1902 01:31:56,730 --> 01:32:00,100 And if it lost the game, it gets a reward of negative 1. 1903 01:32:00,100 --> 01:32:03,108 And from that, it begins to learn what actions are good 1904 01:32:03,108 --> 01:32:04,150 and what actions are bad. 1905 01:32:04,150 --> 01:32:06,610 You don't have to tell the AI what's good and what's bad, 1906 01:32:06,610 --> 01:32:08,920 but the AI figures it out based on that reward. 1907 01:32:08,920 --> 01:32:10,450 Winning the game is some signal. 1908 01:32:10,450 --> 01:32:12,040 Losing the game is some signal. 1909 01:32:12,040 --> 01:32:14,320 And based on all of that, it begins to figure out 1910 01:32:14,320 --> 01:32:16,985 what decisions it should actually make. 1911 01:32:16,985 --> 01:32:19,360 So one very simple game, which you may have played before 1912 01:32:19,360 --> 01:32:20,870 is a game called Nim. 1913 01:32:20,870 --> 01:32:23,433 And in the game of Nim, you've got a whole bunch of objects 1914 01:32:23,433 --> 01:32:25,600 in a whole bunch of different piles, where here I've 1915 01:32:25,600 --> 01:32:27,940 represented each pile as an individual row. 1916 01:32:27,940 --> 01:32:31,040 So you've got one object in the first pile, three in the second pile, five 1917 01:32:31,040 --> 01:32:33,370 and the third pile, seven in the fourth pile. 1918 01:32:33,370 --> 01:32:36,010 And the game of Nim is a two player game where players 1919 01:32:36,010 --> 01:32:38,980 take turns removing objects from piles. 1920 01:32:38,980 --> 01:32:41,230 And the rule is that on any given turn, you 1921 01:32:41,230 --> 01:32:43,690 are allowed to remove as many objects as you 1922 01:32:43,690 --> 01:32:47,162 want from any one of these piles, any one of these rows. 1923 01:32:47,162 --> 01:32:49,120 You have to remove at least one object, but you 1924 01:32:49,120 --> 01:32:53,890 can remove as many as you want from exactly one of the piles. 1925 01:32:53,890 --> 01:32:57,720 And whoever takes the last object loses. 1926 01:32:57,720 --> 01:33:01,630 So player 1 might like remove four from this pile here. 1927 01:33:01,630 --> 01:33:04,790 Player 2 might remove four from this pile here. 1928 01:33:04,790 --> 01:33:08,140 So now we've got four piles left, one, three, one, and three Player 1 1929 01:33:08,140 --> 01:33:11,020 might remove you know the entirety of the second pile. 1930 01:33:11,020 --> 01:33:16,930 Player 2, if they're being strategic, might remove two from the third pile. 1931 01:33:16,930 --> 01:33:20,140 Now we've got three piles left each with one object left. 1932 01:33:20,140 --> 01:33:22,420 Player 1 might remove one from one pile. 1933 01:33:22,420 --> 01:33:24,750 Player 2 removes one from the other pile. 1934 01:33:24,750 --> 01:33:27,610 And now player 1 is left with choosing this one 1935 01:33:27,610 --> 01:33:31,740 object from the last pile, at which point, player 1 loses the game. 1936 01:33:31,740 --> 01:33:32,980 So fairly simple game. 1937 01:33:32,980 --> 01:33:34,240 Piles of objects. 1938 01:33:34,240 --> 01:33:37,300 Any turn, you choose how many objects to remove from the pile. 1939 01:33:37,300 --> 01:33:40,330 Whoever removes the last object loses. 1940 01:33:40,330 --> 01:33:42,730 And this is the type of game you can encode into an AI 1941 01:33:42,730 --> 01:33:46,540 fairly easily, because the states are really just four numbers. 1942 01:33:46,540 --> 01:33:50,120 Every state is just how many objects in each of the four piles. 1943 01:33:50,120 --> 01:33:52,690 And the actions are things like how many am I 1944 01:33:52,690 --> 01:33:56,110 going to remove from each one of these individual piles. 1945 01:33:56,110 --> 01:33:58,090 And the reward happens at the end. 1946 01:33:58,090 --> 01:34:01,000 That if you were the player that had to remove the last object, 1947 01:34:01,000 --> 01:34:02,890 then you get some sort of punishment. 1948 01:34:02,890 --> 01:34:06,430 But if you were not, and the other player had to remove the last object, 1949 01:34:06,430 --> 01:34:08,830 well then you get some sort of reward. 1950 01:34:08,830 --> 01:34:11,680 So we could actually try and show a demonstration of this-- 1951 01:34:11,680 --> 01:34:15,570 that I have implemented an AI to play the game of Nim. 1952 01:34:15,570 --> 01:34:17,320 All right, so here, what we're going to do 1953 01:34:17,320 --> 01:34:22,210 is create an AI as a result of training the AI on some number of games 1954 01:34:22,210 --> 01:34:24,160 that the AI is going to play against itself. 1955 01:34:24,160 --> 01:34:27,060 Where the idea is the AI will play games against itself, 1956 01:34:27,060 --> 01:34:30,800 learn from each of those experiences, and learn what to do in the future. 1957 01:34:30,800 --> 01:34:33,340 And then I, the human, will play against the AI. 1958 01:34:33,340 --> 01:34:35,380 So initially, we'll say train zero times, 1959 01:34:35,380 --> 01:34:39,190 meaning we're not going to let the AI play any practice games against itself 1960 01:34:39,190 --> 01:34:41,110 in order to learn from its experiences. 1961 01:34:41,110 --> 01:34:43,750 We're just going to see how well it plays. 1962 01:34:43,750 --> 01:34:45,460 And it looks like there are four piles. 1963 01:34:45,460 --> 01:34:48,550 I can choose how many I remove from any one of the piles. 1964 01:34:48,550 --> 01:34:53,960 So maybe from pile three, I will remove five objects, for example. 1965 01:34:53,960 --> 01:34:57,340 So now AI chose to take one item from pile zero. 1966 01:34:57,340 --> 01:35:00,170 So I'm left with these piles now, for example. 1967 01:35:00,170 --> 01:35:02,500 And so here, I could choose maybe to say I 1968 01:35:02,500 --> 01:35:08,900 would like to remove them from pile two, all five of them, for example. 1969 01:35:08,900 --> 01:35:11,290 And so AI chose to take two away from pile one. 1970 01:35:11,290 --> 01:35:15,560 Now I'm left with one pile that has one object, one pile that has two objects. 1971 01:35:15,560 --> 01:35:19,070 So from pile three, I will remove two objects. 1972 01:35:19,070 --> 01:35:22,400 And now I've left the AI with no choice but to take that last one. 1973 01:35:22,400 --> 01:35:24,648 And so the game is over and I was able to win. 1974 01:35:24,648 --> 01:35:27,190 But I did so because the AI was really just playing randomly. 1975 01:35:27,190 --> 01:35:30,130 It didn't have any prior experience that it was using in order 1976 01:35:30,130 --> 01:35:32,410 to make these sorts of judgments. 1977 01:35:32,410 --> 01:35:36,040 Now let the AI train itself on, like, 10,000 games. 1978 01:35:36,040 --> 01:35:40,000 I'm going to let the AI play 10,000 games of Nim against itself. 1979 01:35:40,000 --> 01:35:43,180 Every time it wins or loses, it's going to learn from that experience 1980 01:35:43,180 --> 01:35:46,840 and learn in the future what to do and what not to do. 1981 01:35:46,840 --> 01:35:49,300 So here then, I'll go ahead and run this again. 1982 01:35:49,300 --> 01:35:52,300 And now you see the AI running through a whole bunch of training games-- 1983 01:35:52,300 --> 01:35:54,730 10,000 training games against itself. 1984 01:35:54,730 --> 01:35:57,530 And now it's going to let me make these sorts of decisions. 1985 01:35:57,530 --> 01:35:59,620 So now I'm going to play against the AI. 1986 01:35:59,620 --> 01:36:02,788 Maybe I'll remove one from pile 3. 1987 01:36:02,788 --> 01:36:05,830 And the AI took everything from pile three, so I'm left with three piles. 1988 01:36:05,830 --> 01:36:11,680 And I'll go ahead and from pile two, maybe remove three items. 1989 01:36:11,680 --> 01:36:14,340 And the AI removes one item from pile zero. 1990 01:36:14,340 --> 01:36:17,560 I'm left with two piles, each of which has two items in it. 1991 01:36:17,560 --> 01:36:21,160 I'll remove one from pile one, I guess. 1992 01:36:21,160 --> 01:36:24,610 And the AI took two from pile two, leaving me with no choice 1993 01:36:24,610 --> 01:36:27,490 but to take one away from pile one. 1994 01:36:27,490 --> 01:36:31,690 So it seems like after playing 10,000 games of Nim against itself, 1995 01:36:31,690 --> 01:36:34,270 the AI has learned something about what states 1996 01:36:34,270 --> 01:36:38,020 and what actions tend to be good, and has begun to learn some sort of pattern 1997 01:36:38,020 --> 01:36:40,780 for how to predict what actions are going to be good 1998 01:36:40,780 --> 01:36:44,290 and what actions are going to be bad in any given state. 1999 01:36:44,290 --> 01:36:47,590 So reinforcement learning can be a very powerful technique for achieving 2000 01:36:47,590 --> 01:36:49,150 these sorts of game playing Agents-- 2001 01:36:49,150 --> 01:36:52,300 Agents that are able to play a game well just by learning 2002 01:36:52,300 --> 01:36:54,970 from experience, whether that's playing against other people 2003 01:36:54,970 --> 01:36:57,310 or by playing against itself and learning from those 2004 01:36:57,310 --> 01:36:59,030 experiences, as well. 2005 01:36:59,030 --> 01:37:01,900 Now Nim is a bit of an easy game to use reinforcement learning 2006 01:37:01,900 --> 01:37:04,120 for because there are so few states. 2007 01:37:04,120 --> 01:37:07,120 There are only states that are as many as how many different objects are 2008 01:37:07,120 --> 01:37:09,190 in each of these various different piles. 2009 01:37:09,190 --> 01:37:11,200 You might imagine that it's going to be harder 2010 01:37:11,200 --> 01:37:14,990 if you think of a game like chess or games where there are many, 2011 01:37:14,990 --> 01:37:18,430 many more states and many, many more actions that you can imagine taking, 2012 01:37:18,430 --> 01:37:21,490 where it's not going to be as easy to learn for every state 2013 01:37:21,490 --> 01:37:24,720 and for every action, what the value is going to be. 2014 01:37:24,720 --> 01:37:27,130 So oftentimes in that case, we can't necessarily 2015 01:37:27,130 --> 01:37:31,060 learn exactly what the value is for every state and for every action, 2016 01:37:31,060 --> 01:37:32,257 but we can approximate it. 2017 01:37:32,257 --> 01:37:34,090 So much as we saw with [? min and max, ?] we 2018 01:37:34,090 --> 01:37:37,240 could use a death limiting approach to stop calculating 2019 01:37:37,240 --> 01:37:39,910 at a certain point in time, we can do a similar type 2020 01:37:39,910 --> 01:37:42,730 of approximation known as function approximation 2021 01:37:42,730 --> 01:37:47,530 in a reinforcement learning context, where instead of learning a value of Q 2022 01:37:47,530 --> 01:37:50,260 for every state and every action, we just 2023 01:37:50,260 --> 01:37:53,260 have some function that estimates what the value is 2024 01:37:53,260 --> 01:37:56,770 for taking this action in this particular state that might be based 2025 01:37:56,770 --> 01:38:02,390 on various different features of the state that the agent happens to be in. 2026 01:38:02,390 --> 01:38:04,480 Where you might have to choose what those features 2027 01:38:04,480 --> 01:38:07,990 actually are, but you can begin to learn some patterns that 2028 01:38:07,990 --> 01:38:11,763 generalize beyond one specific state and one specific action 2029 01:38:11,763 --> 01:38:13,930 that you can begin to learn if certain features tend 2030 01:38:13,930 --> 01:38:15,580 to be good things or bad things. 2031 01:38:15,580 --> 01:38:18,850 Reinforcement learning can allow you using a very similar mechanism 2032 01:38:18,850 --> 01:38:20,800 to generalize beyond one particular state 2033 01:38:20,800 --> 01:38:24,220 and say if this other state looks kind of like this state, 2034 01:38:24,220 --> 01:38:27,130 then maybe the similar types of actions that worked in one state 2035 01:38:27,130 --> 01:38:30,150 will also work in another state as well. 2036 01:38:30,150 --> 01:38:32,410 And so this type of approach can be quite helpful 2037 01:38:32,410 --> 01:38:34,750 as you begin to deal with reinforcement learning that 2038 01:38:34,750 --> 01:38:37,870 exists in larger and larger state spaces, where it's just not 2039 01:38:37,870 --> 01:38:43,240 feasible to explore all of the possible states that could actually exist. 2040 01:38:43,240 --> 01:38:46,540 So there then are two of the main categories of reinforcement learning. 2041 01:38:46,540 --> 01:38:49,110 Supervised learning, where you have labeled input and output 2042 01:38:49,110 --> 01:38:52,920 pairs, and reinforcement learning, where an agent learns from rewards 2043 01:38:52,920 --> 01:38:54,540 or punishments that it receives. 2044 01:38:54,540 --> 01:38:56,700 The third major category of machine learning 2045 01:38:56,700 --> 01:39:00,450 that we'll just touch on briefly is known as unsupervised learning. 2046 01:39:00,450 --> 01:39:03,540 And unsupervised learning happens when we have data 2047 01:39:03,540 --> 01:39:06,330 without any additional feedback, without labels. 2048 01:39:06,330 --> 01:39:09,810 That in the supervised learning case, all of our data had labels. 2049 01:39:09,810 --> 01:39:13,560 We labeled a data point with whether that was a rainy day or not rainy day. 2050 01:39:13,560 --> 01:39:16,590 And using those labels, we were able to infer what the pattern was. 2051 01:39:16,590 --> 01:39:20,220 Where we labeled data as a counterfeit banknote or not a counterfeit, 2052 01:39:20,220 --> 01:39:23,760 and using those labels, we were able to draw inferences and patterns 2053 01:39:23,760 --> 01:39:27,870 to figure out what does a banknote look like versus not. 2054 01:39:27,870 --> 01:39:32,410 In unsupervised learning, we don't have any access to any of those labels, 2055 01:39:32,410 --> 01:39:35,500 but we still would like to learn some of those patterns. 2056 01:39:35,500 --> 01:39:39,000 And one of the tasks that you might want to perform in unsupervised learning 2057 01:39:39,000 --> 01:39:42,600 is something like clustering, where clustering is just the task of given 2058 01:39:42,600 --> 01:39:47,310 some set of objects organized into distinct clusters, groups of objects 2059 01:39:47,310 --> 01:39:49,330 that are similar to one another. 2060 01:39:49,330 --> 01:39:51,510 And there's lots of applications for clustering. 2061 01:39:51,510 --> 01:39:53,340 It comes up in genetic research, where you 2062 01:39:53,340 --> 01:39:55,740 might have a whole bunch of different genes, 2063 01:39:55,740 --> 01:39:57,900 and you want to cluster them into similar genes 2064 01:39:57,900 --> 01:40:01,530 if you're trying to analyze it across a population or across species. 2065 01:40:01,530 --> 01:40:04,530 It comes up in an image, if you want to take all the pixels of an image, 2066 01:40:04,530 --> 01:40:06,570 cluster them into different parts of the image. 2067 01:40:06,570 --> 01:40:09,060 Comes up a lot up in market research if you 2068 01:40:09,060 --> 01:40:11,340 want to divide your consumers into different groups 2069 01:40:11,340 --> 01:40:14,160 so you know which groups to target with certain types of product 2070 01:40:14,160 --> 01:40:15,690 advertisements, for example. 2071 01:40:15,690 --> 01:40:18,840 And a number of other contexts as well in which clustering 2072 01:40:18,840 --> 01:40:20,310 can be very applicable. 2073 01:40:20,310 --> 01:40:24,900 One technique for clustering is an algorithm known as k-means clustering. 2074 01:40:24,900 --> 01:40:27,270 And what k-means clustering is going to do 2075 01:40:27,270 --> 01:40:31,830 it is going to divide all of our data points into k different clusters, 2076 01:40:31,830 --> 01:40:35,700 and it's going to do so by repeating this process of assigning points 2077 01:40:35,700 --> 01:40:39,720 to clusters, and then moving around those clusters centers. 2078 01:40:39,720 --> 01:40:43,680 We're going to define a cluster by its center, the middle of the cluster, 2079 01:40:43,680 --> 01:40:46,830 and then assign points to that cluster based on which 2080 01:40:46,830 --> 01:40:49,430 center is closest to that point. 2081 01:40:49,430 --> 01:40:51,660 And I'll show you an example of that now. 2082 01:40:51,660 --> 01:40:55,020 Here, for example, I have a whole bunch of unlabeled data-- 2083 01:40:55,020 --> 01:40:58,920 just various data points that are in some sort of graphical space. 2084 01:40:58,920 --> 01:41:02,610 And I would like to group them into various different clusters. 2085 01:41:02,610 --> 01:41:04,500 But I don't know how to do that originally. 2086 01:41:04,500 --> 01:41:07,500 And let's say I want to assign three clusters to this group. 2087 01:41:07,500 --> 01:41:10,500 And you have to choose how many clusters you want in k-means clustering, 2088 01:41:10,500 --> 01:41:13,860 but you could try multiple and see how well those values perform. 2089 01:41:13,860 --> 01:41:17,070 But I'll start just by randomly picking some places 2090 01:41:17,070 --> 01:41:19,060 to put the centers of those clusters. 2091 01:41:19,060 --> 01:41:22,735 That maybe I have a blue cluster, a red cluster, and a green cluster. 2092 01:41:22,735 --> 01:41:25,110 And I'm going to start with the centers of those clusters 2093 01:41:25,110 --> 01:41:27,660 just being in these three locations here. 2094 01:41:27,660 --> 01:41:30,090 And what k-means clustering tells us to do 2095 01:41:30,090 --> 01:41:32,790 is once I have the centers of the clusters, 2096 01:41:32,790 --> 01:41:39,510 assign every point to a cluster based on which cluster center it is closest to. 2097 01:41:39,510 --> 01:41:42,990 So we end up with something like this, where all of these points 2098 01:41:42,990 --> 01:41:47,310 are closer to the blue cluster center than any other cluster center. 2099 01:41:47,310 --> 01:41:50,970 All of these points here are closer to the green cluster 2100 01:41:50,970 --> 01:41:52,860 center than any other cluster center. 2101 01:41:52,860 --> 01:41:55,410 And then these two points plus these points over here, 2102 01:41:55,410 --> 01:42:00,370 those are all closest to the red cluster center instead. 2103 01:42:00,370 --> 01:42:04,050 So here then is one possible assignment, all these points, 2104 01:42:04,050 --> 01:42:05,910 to three different clusters. 2105 01:42:05,910 --> 01:42:06,895 But it's not great. 2106 01:42:06,895 --> 01:42:10,020 That it seems like in this red cluster, these points are kind of far apart, 2107 01:42:10,020 --> 01:42:12,870 in this green cluster, these points are kind of far apart. 2108 01:42:12,870 --> 01:42:15,810 It might not be my ideal choice of how I would cluster 2109 01:42:15,810 --> 01:42:17,670 these various different data points. 2110 01:42:17,670 --> 01:42:21,960 But k-means clustering is an iterative process, that after I do this, 2111 01:42:21,960 --> 01:42:25,890 there is a next step, which is that after I've assigned all of the points 2112 01:42:25,890 --> 01:42:28,380 to the cluster center that it is nearest to, 2113 01:42:28,380 --> 01:42:31,350 we are going to recenter the clusters. 2114 01:42:31,350 --> 01:42:33,420 Meaning take the cluster centers, these diamond 2115 01:42:33,420 --> 01:42:37,500 shapes here, and move them to the middle or the average, 2116 01:42:37,500 --> 01:42:41,040 effectively, of all of the points that are in that cluster. 2117 01:42:41,040 --> 01:42:43,170 So we'll take this blue point, this blue center, 2118 01:42:43,170 --> 01:42:46,812 and go ahead and move it to the middle or to the center of all 2119 01:42:46,812 --> 01:42:49,020 of the points that were assigned to the blue cluster, 2120 01:42:49,020 --> 01:42:51,070 moving it slightly to the right in this case. 2121 01:42:51,070 --> 01:42:52,570 And we'll do the same thing for red. 2122 01:42:52,570 --> 01:42:56,910 We'll move the cluster center to the middle of all of these points, 2123 01:42:56,910 --> 01:42:58,560 weighted by how many points there are. 2124 01:42:58,560 --> 01:43:01,800 There are more points over here, so the red center 2125 01:43:01,800 --> 01:43:03,810 and moving a little bit further that way. 2126 01:43:03,810 --> 01:43:06,352 And likewise for the green center, there are many more points 2127 01:43:06,352 --> 01:43:09,000 on this side of the green center, so the green center 2128 01:43:09,000 --> 01:43:13,240 ends up being pulled a little bit further in this direction. 2129 01:43:13,240 --> 01:43:15,808 So we recenter all of the clusters. 2130 01:43:15,808 --> 01:43:17,100 And then we repeat the process. 2131 01:43:17,100 --> 01:43:21,540 We go ahead and now reassign all of the points to the cluster center 2132 01:43:21,540 --> 01:43:23,230 that they are now closest to. 2133 01:43:23,230 --> 01:43:25,650 And now that we've moved around the cluster centers, 2134 01:43:25,650 --> 01:43:27,690 these cluster assignments might change. 2135 01:43:27,690 --> 01:43:30,900 That this point originally was closer to the red cluster center, 2136 01:43:30,900 --> 01:43:34,030 but now it's actually closer to the blue cluster center. 2137 01:43:34,030 --> 01:43:35,580 Same goes for this point as well. 2138 01:43:35,580 --> 01:43:38,700 And these three points that were originally closer to the green cluster 2139 01:43:38,700 --> 01:43:43,670 center are now closer to the red cluster center instead. 2140 01:43:43,670 --> 01:43:48,380 So we can reassign what colors or which clusters each of these data points 2141 01:43:48,380 --> 01:43:49,620 belongs to. 2142 01:43:49,620 --> 01:43:51,560 And then repeat the process again, moving 2143 01:43:51,560 --> 01:43:54,590 each of these cluster means, the middle of the clusters, 2144 01:43:54,590 --> 01:43:59,810 to the mean, the average, of all of the other points that happen to be there. 2145 01:43:59,810 --> 01:44:01,400 And repeat the process again. 2146 01:44:01,400 --> 01:44:04,130 Go ahead and assign each of the points to the cluster 2147 01:44:04,130 --> 01:44:05,510 that they are closest to. 2148 01:44:05,510 --> 01:44:08,690 So once we reach a point where we've assigned all the points to clusters, 2149 01:44:08,690 --> 01:44:12,190 to the cluster that they are nearest to, and nothing changed, 2150 01:44:12,190 --> 01:44:15,650 we've reached a sort of equilibrium in this situation, where no points are 2151 01:44:15,650 --> 01:44:17,030 changing their allegiance. 2152 01:44:17,030 --> 01:44:19,850 And as a result, we can declare this algorithm is now over. 2153 01:44:19,850 --> 01:44:22,910 And we now have some assignment of each of these points 2154 01:44:22,910 --> 01:44:24,230 into three different clusters. 2155 01:44:24,230 --> 01:44:26,090 And it looks like we did a pretty good job 2156 01:44:26,090 --> 01:44:29,960 of trying to identify which points are more similar to one another 2157 01:44:29,960 --> 01:44:31,700 than they are two points in other groups. 2158 01:44:31,700 --> 01:44:34,970 So we have the green cluster down here, this blue cluster here, 2159 01:44:34,970 --> 01:44:37,650 and then this red cluster over there as well. 2160 01:44:37,650 --> 01:44:40,490 And we did so without any access to some labels 2161 01:44:40,490 --> 01:44:43,040 to tell us what these various different clusters were. 2162 01:44:43,040 --> 01:44:45,890 We just used an algorithm in an unsupervised sentence 2163 01:44:45,890 --> 01:44:48,470 without any of those labels to figure out which 2164 01:44:48,470 --> 01:44:50,820 points belonged to which categories. 2165 01:44:50,820 --> 01:44:54,663 And again, lots of applications for this type of clustering technique. 2166 01:44:54,663 --> 01:44:57,830 And there are many more algorithms in each of these various different fields 2167 01:44:57,830 --> 01:45:01,310 within machine learning-- supervised, and reinforcement, and unsupervised. 2168 01:45:01,310 --> 01:45:04,490 But those are many of the big picture foundational ideas that 2169 01:45:04,490 --> 01:45:06,408 underlie a lot of these techniques-- 2170 01:45:06,408 --> 01:45:08,700 that these are the problems that we're trying to solve. 2171 01:45:08,700 --> 01:45:11,810 And we try and solve those problems using a number of different methods. 2172 01:45:11,810 --> 01:45:14,733 Of trying to take data and learn patterns in that data 2173 01:45:14,733 --> 01:45:17,150 whether that's trying to find neighboring data points that 2174 01:45:17,150 --> 01:45:19,790 are similar or trying to minimize some sort of loss 2175 01:45:19,790 --> 01:45:22,460 function, where any number of other techniques 2176 01:45:22,460 --> 01:45:26,420 that allow us to begin to try to solve these sorts of problems. 2177 01:45:26,420 --> 01:45:28,430 That then was a look at some of the principles 2178 01:45:28,430 --> 01:45:30,410 that are at the foundation of modern machine 2179 01:45:30,410 --> 01:45:33,930 learning-- this ability to take data and learn from that data 2180 01:45:33,930 --> 01:45:37,160 so that the computer can perform a task, even if they haven't explicitly been 2181 01:45:37,160 --> 01:45:39,508 given instructions in order to do so. 2182 01:45:39,508 --> 01:45:41,300 Next time, we'll continue this conversation 2183 01:45:41,300 --> 01:45:43,842 about machine learning looking at other techniques we can use 2184 01:45:43,842 --> 01:45:45,780 for solving these sorts of problems. 2185 01:45:45,780 --> 01:45:47,500 We'll see you then. 2186 01:45:47,500 --> 01:45:48,000