1 00:00:00,000 --> 00:00:00,994 [VIDEO PLAYBACK] 2 00:00:00,994 --> 00:00:03,479 [MUSIC PLAYING] 3 00:00:03,479 --> 00:00:14,910 4 00:00:14,910 --> 00:00:15,497 [END PLAYBACK] 5 00:00:15,497 --> 00:00:16,580 DAVID J. MALAN: All right. 6 00:00:16,580 --> 00:00:19,420 This is CS50, and this is Yale University. 7 00:00:19,420 --> 00:00:21,580 Welcome to week seven. 8 00:00:21,580 --> 00:00:25,180 So this class marks a transition between the first part of the course, where 9 00:00:25,180 --> 00:00:29,370 we have been learning about the magic wonder world of C-- 10 00:00:29,370 --> 00:00:32,850 an extremely powerful low-level programming language that 11 00:00:32,850 --> 00:00:35,480 has allowed us to solve many problems, indeed-- 12 00:00:35,480 --> 00:00:37,740 to a second part of the course, where we move 13 00:00:37,740 --> 00:00:41,390 towards more high-level programming languages, such as Python. 14 00:00:41,390 --> 00:00:44,190 In fact, this transition had already begun last week 15 00:00:44,190 --> 00:00:48,130 when we learned about HTML, CSS, and the like. 16 00:00:48,130 --> 00:00:50,450 And we left off, really, by commenting on the fact 17 00:00:50,450 --> 00:00:55,250 that this new programming language, Python, can be conveniently used 18 00:00:55,250 --> 00:01:00,250 to write something like the back end of google.com or facebook.com 19 00:01:00,250 --> 00:01:04,310 or any web service, really, that accepts some parameters, 20 00:01:04,310 --> 00:01:08,370 parse them, possibly look up for some code in a database, 21 00:01:08,370 --> 00:01:13,380 possibly store some code in a database, and it gets back 22 00:01:13,380 --> 00:01:16,200 to the user with dynamic output. 23 00:01:16,200 --> 00:01:20,200 So before we get to that end and to see how Python can used to indeed write 24 00:01:20,200 --> 00:01:23,520 the back end of a web server, it is instructive to see 25 00:01:23,520 --> 00:01:31,570 how Python can be used as a tool, really, to do data analysis, 26 00:01:31,570 --> 00:01:33,680 much like a data scientist will do. 27 00:01:33,680 --> 00:01:36,460 And this is what we are going to see today, 28 00:01:36,460 --> 00:01:39,850 diving into the magic world of machine learning. 29 00:01:39,850 --> 00:01:42,570 But first of all, what is machine learning, really? 30 00:01:42,570 --> 00:01:45,310 So these are a funny sequence of vignettes that I 31 00:01:45,310 --> 00:01:49,220 found online on pythonprogramming.net. 32 00:01:49,220 --> 00:01:55,510 And they represent the stereotypical life of programmers/researchers 33 00:01:55,510 --> 00:01:56,900 in machine learning. 34 00:01:56,900 --> 00:01:58,990 So let's see on the top left here. 35 00:01:58,990 --> 00:02:04,510 Well, society thinks that they are creating hordes of robots, 36 00:02:04,510 --> 00:02:08,521 possibly with the idea to conquer the world. 37 00:02:08,521 --> 00:02:09,020 Right? 38 00:02:09,020 --> 00:02:14,360 Friends think that they're hanging out with robots, really. 39 00:02:14,360 --> 00:02:18,050 Now, parents-- their parents think that programmers in machine learning 40 00:02:18,050 --> 00:02:24,480 spend most of their time in data centers with no windows, apparently. 41 00:02:24,480 --> 00:02:26,250 What about other programmers? 42 00:02:26,250 --> 00:02:30,330 Well, they might think that programmers do fancy mathematics. 43 00:02:30,330 --> 00:02:32,360 And what about themself? 44 00:02:32,360 --> 00:02:37,780 Well, they typically think that they're involved with some fancy visualization, 45 00:02:37,780 --> 00:02:39,540 data analysis. 46 00:02:39,540 --> 00:02:43,330 But at the end of the day, what they really do-- and here we are-- 47 00:02:43,330 --> 00:02:48,840 is using Python, and not just using Python to implement some algorithm, 48 00:02:48,840 --> 00:02:57,820 but really using Python to import some algorithm as we are seeing here. 49 00:02:57,820 --> 00:03:00,310 So we do not know Python yet. 50 00:03:00,310 --> 00:03:05,390 But this line of code looks extremely readable, isn't it? 51 00:03:05,390 --> 00:03:06,480 There is English there. 52 00:03:06,480 --> 00:03:12,220 It says "from sklearn"-- we don't know what this is yet-- "import"-- 53 00:03:12,220 --> 00:03:16,000 I already mentioned svm, support vector machine. 54 00:03:16,000 --> 00:03:19,844 It's a function to run a machine-learning algorithm. 55 00:03:19,844 --> 00:03:21,760 So we don't know Python yet, but we're already 56 00:03:21,760 --> 00:03:25,660 able to decipher, more or less, what is going on. 57 00:03:25,660 --> 00:03:29,080 And indeed, sklearn, as we will see, is a so-called module 58 00:03:29,080 --> 00:03:32,430 in Python-- a library, if you wish, in C-- from which we 59 00:03:32,430 --> 00:03:37,510 are importing an algorithm, a function. 60 00:03:37,510 --> 00:03:41,550 Now, this line exemplifies a characteristic feature of Python-- 61 00:03:41,550 --> 00:03:44,660 namely, it readability. 62 00:03:44,660 --> 00:03:48,120 And it is often the case that Python code is referred to 63 00:03:48,120 --> 00:03:51,780 as being similar to pseudocode, precisely for the fact 64 00:03:51,780 --> 00:03:56,670 that it allows us to express very powerful ideas with a couple of lines 65 00:03:56,670 --> 00:03:58,830 that are extremely readable. 66 00:03:58,830 --> 00:04:02,240 Now, this very characteristic of Python, together 67 00:04:02,240 --> 00:04:06,290 with our familiarity, at this point in the course, with C, 68 00:04:06,290 --> 00:04:09,370 is what will allow us today to quickly see 69 00:04:09,370 --> 00:04:12,550 how Python can be used as a data processing tool 70 00:04:12,550 --> 00:04:15,910 to implement and run machine learning algorithms. 71 00:04:15,910 --> 00:04:19,040 Well, so back to the questionn-- what is machine learning, really? 72 00:04:19,040 --> 00:04:21,790 So as the previous sequence of vignettes were suggesting, 73 00:04:21,790 --> 00:04:24,100 in the popular culture, at least, machine learning 74 00:04:24,100 --> 00:04:28,970 is often associated with the world of AI, artificial intelligence. 75 00:04:28,970 --> 00:04:32,670 Typically, we think of machine in terms of robots. 76 00:04:32,670 --> 00:04:36,950 And literally, there are countless science fiction movies 77 00:04:36,950 --> 00:04:41,960 about this theme, typically representing some robots turning evil 78 00:04:41,960 --> 00:04:44,850 against humanity or in this line. 79 00:04:44,850 --> 00:04:48,800 Indeed, this is a modern interpretation of an old fear 80 00:04:48,800 --> 00:04:54,130 dating back, possibly, to Frankenstein in the 1800s and beyond. 81 00:04:54,130 --> 00:04:56,860 But really, if we think of the way machine 82 00:04:56,860 --> 00:05:02,530 learning is having an impact on our lives on a day-to-day basis, 83 00:05:02,530 --> 00:05:06,820 it's not necessarily related to robots, per se. 84 00:05:06,820 --> 00:05:12,770 It has more something to do with the following set of applications 85 00:05:12,770 --> 00:05:15,390 that we indeed use on a daily basis. 86 00:05:15,390 --> 00:05:20,340 So lets just think of search engines, for instance, 87 00:05:20,340 --> 00:05:24,340 that allow us to look for whatever we might feel like in the world wide web, 88 00:05:24,340 --> 00:05:28,210 and in a matter of literally milliseconds to get back 89 00:05:28,210 --> 00:05:33,820 an order, at least, of results based on the query that we enter. 90 00:05:33,820 --> 00:05:37,030 Think about image recognition, the possibility 91 00:05:37,030 --> 00:05:43,030 to catalog, to search for an image based on its subject. 92 00:05:43,030 --> 00:05:45,610 Think about speech recognition software. 93 00:05:45,610 --> 00:05:49,800 These days, we can all talk to our phone and ask, what's the weather like 94 00:05:49,800 --> 00:05:52,530 or show me the movies around me. 95 00:05:52,530 --> 00:05:56,740 Or finally, just for these four sets of application 96 00:05:56,740 --> 00:05:59,940 I mentioned, the world of natural language processing, 97 00:05:59,940 --> 00:06:04,380 the amazing ability to translate a document of text from one language 98 00:06:04,380 --> 00:06:09,760 to another in real time or the ability to, say, infer 99 00:06:09,760 --> 00:06:14,380 the meaning, the semantics of a document with no human intervention. 100 00:06:14,380 --> 00:06:18,660 So these are just a few of the applications that are indeed backed up 101 00:06:18,660 --> 00:06:20,790 by machine learning algorithms. 102 00:06:20,790 --> 00:06:27,700 And here by the word machine, we really mean an algorithm running, most likely, 103 00:06:27,700 --> 00:06:30,370 on the cloud in a data center. 104 00:06:30,370 --> 00:06:34,140 And today, we use these applications on a daily basis. 105 00:06:34,140 --> 00:06:39,060 And so have you ever wondered what they're all about? 106 00:06:39,060 --> 00:06:44,140 How can we get started to thinking to design one of these applications? 107 00:06:44,140 --> 00:06:46,890 And this is what we're going to learn today. 108 00:06:46,890 --> 00:06:51,910 So in particular, we will be focusing on two applications-- image recognition 109 00:06:51,910 --> 00:06:53,792 and natural language processing. 110 00:06:53,792 --> 00:06:56,500 111 00:06:56,500 --> 00:06:59,790 But before that, let's go back to week zero. 112 00:06:59,790 --> 00:07:02,400 So we have seen this diagram early on in the course. 113 00:07:02,400 --> 00:07:07,160 It represents a general-purpose algorithm. 114 00:07:07,160 --> 00:07:11,970 The black box is an algorithm that takes some input from the left-hand side. 115 00:07:11,970 --> 00:07:16,130 It processes them, and it delivers the user with an output. 116 00:07:16,130 --> 00:07:19,400 Now the class of application that we're going to see today 117 00:07:19,400 --> 00:07:22,440 very much fits into this framework. 118 00:07:22,440 --> 00:07:25,870 Just think about image recognition, for instance. 119 00:07:25,870 --> 00:07:28,150 Well, we might want to have an algorithm that 120 00:07:28,150 --> 00:07:33,640 takes as inputs images-- say here an image of a horse, image of a car-- 121 00:07:33,640 --> 00:07:37,070 and is capable or realizing or recognizing that there is, indeed, 122 00:07:37,070 --> 00:07:40,600 a horse or a car in that image and get back to us 123 00:07:40,600 --> 00:07:45,830 with strings, such as "this is a horse." 124 00:07:45,830 --> 00:07:49,630 Or, well, in the world of natural language processing, 125 00:07:49,630 --> 00:07:53,560 think about an algorithm where we could do something of the following-- 126 00:07:53,560 --> 00:07:59,490 say I want to pass as an input to this algorithm one of my favorite novels, 127 00:07:59,490 --> 00:08:02,490 1984 by George Orwell. 128 00:08:02,490 --> 00:08:05,890 So this is just a part of it-- "BIG BROTHER is watching you." 129 00:08:05,890 --> 00:08:12,191 But say that we want really the entire book to be fed up as an input 130 00:08:12,191 --> 00:08:12,940 to this algorithm. 131 00:08:12,940 --> 00:08:17,270 We want the algorithm to be able to recognize the semantics of the book. 132 00:08:17,270 --> 00:08:19,980 So we want the algorithm, with no human intervention, 133 00:08:19,980 --> 00:08:23,380 to be able to get back to us and tell us, look, Patrick, 134 00:08:23,380 --> 00:08:24,850 this book is about politics. 135 00:08:24,850 --> 00:08:26,370 It's about propaganda. 136 00:08:26,370 --> 00:08:28,800 It's about privacy. 137 00:08:28,800 --> 00:08:36,549 So in both of these two applications-- image recognition and natural language 138 00:08:36,549 --> 00:08:39,520 processing-- it is already clear that what the machine learning 139 00:08:39,520 --> 00:08:47,060 algorithm is doing is trying to infer some hidden structure in the input, 140 00:08:47,060 --> 00:08:48,430 right? 141 00:08:48,430 --> 00:08:52,680 And if we phrase the problem like that, well, we have seen something like that 142 00:08:52,680 --> 00:08:58,140 earlier on in the course-- in fact, many examples of a scenario 143 00:08:58,140 --> 00:09:00,980 where we are given an input with some hidden structure, 144 00:09:00,980 --> 00:09:05,750 and we want to design-- you have done that in problem set four-- 145 00:09:05,750 --> 00:09:10,080 an algorithm that can decipher, that can crack the input and gives us an output. 146 00:09:10,080 --> 00:09:11,840 So this is Whodunit. 147 00:09:11,840 --> 00:09:14,903 And per the specific, we were given an image with some red noise. 148 00:09:14,903 --> 00:09:23,490 And you were told to design an algorithm to get back with the hidden message. 149 00:09:23,490 --> 00:09:30,530 So in this respect, you might think applications such as image recognition 150 00:09:30,530 --> 00:09:34,540 seem to share the similar behavior, isn't it? 151 00:09:34,540 --> 00:09:37,490 But there is a key difference. 152 00:09:37,490 --> 00:09:40,924 Anyone can spot the difference? 153 00:09:40,924 --> 00:09:41,840 We're getting started. 154 00:09:41,840 --> 00:09:44,570 155 00:09:44,570 --> 00:09:48,400 So one difference in the picture is that instead of just one image, 156 00:09:48,400 --> 00:09:50,890 there are two images. 157 00:09:50,890 --> 00:09:54,480 And indeed, that is somehow to the point in the sense 158 00:09:54,480 --> 00:09:59,750 that with problem set four, we were given an image with a hidden message, 159 00:09:59,750 --> 00:10:00,490 indeed. 160 00:10:00,490 --> 00:10:04,600 But we were told the structure of the hidden message. 161 00:10:04,600 --> 00:10:08,150 In other words, we were told that the image had a lot of red noise, 162 00:10:08,150 --> 00:10:12,900 and we were told that in order to decipher, to really find 163 00:10:12,900 --> 00:10:16,730 the hidden message, we should lower the intensity of the red pixel 164 00:10:16,730 --> 00:10:19,750 and possibly rise the intensity of the other pixels. 165 00:10:19,750 --> 00:10:21,970 But in machine learning applications, this 166 00:10:21,970 --> 00:10:24,750 is not typically what we have in mind. 167 00:10:24,750 --> 00:10:28,160 So we want an algorithm that can work with any sort of image 168 00:10:28,160 --> 00:10:30,490 we may want to feed it with. 169 00:10:30,490 --> 00:10:34,490 170 00:10:34,490 --> 00:10:38,730 And this is what is bringing us to one of the key differences 171 00:10:38,730 --> 00:10:41,250 in the class of application that we are going to see, 172 00:10:41,250 --> 00:10:44,420 namely the fact that it is the algorithm itself 173 00:10:44,420 --> 00:10:49,250 that is trying to find out the hidden structure in the input. 174 00:10:49,250 --> 00:10:55,050 And the way you can do this is by having access to what is called training data. 175 00:10:55,050 --> 00:11:02,460 In other words, even if we want to design an algorithm that 176 00:11:02,460 --> 00:11:07,170 is capable to tell us, Patrick, look, this is a horse, in order to work, 177 00:11:07,170 --> 00:11:12,480 this algorithm needs to have access to a lot of images of horses. 178 00:11:12,480 --> 00:11:16,990 And based on these images, it is able to figure out the hidden structure 179 00:11:16,990 --> 00:11:21,090 so that once you feed, again, an image of a horse, it can say, 180 00:11:21,090 --> 00:11:22,510 this is a horse. 181 00:11:22,510 --> 00:11:25,200 And this is what we are going to see today, 182 00:11:25,200 --> 00:11:29,760 starting precisely from the example of image classification. 183 00:11:29,760 --> 00:11:33,480 So what is the example we're going to play with? 184 00:11:33,480 --> 00:11:38,230 It's the following-- suppose that we are given a data set of handwritten digits. 185 00:11:38,230 --> 00:11:44,990 So there are 10 digits from zero to nine. 186 00:11:44,990 --> 00:11:47,040 And for each of these digits, we are given 187 00:11:47,040 --> 00:11:51,940 a collection of handwritten images representing that digit. 188 00:11:51,940 --> 00:11:57,430 So for instance, for digit zero, we are given many of these images. 189 00:11:57,430 --> 00:12:02,630 And each of is having a different way of writing a digit zero. 190 00:12:02,630 --> 00:12:09,310 So what we set as our goal for today is to actually design, in this case, 191 00:12:09,310 --> 00:12:13,660 an algorithm that can take an image of digit zero 192 00:12:13,660 --> 00:12:17,370 or an image of digit six or any other digit 193 00:12:17,370 --> 00:12:26,210 and, indeed, get us back with the string "0" or "6" or so on. 194 00:12:26,210 --> 00:12:28,890 And the way you can do that, as we will see, 195 00:12:28,890 --> 00:12:31,100 is by having access to the so-called training 196 00:12:31,100 --> 00:12:34,960 data that is inferring the [INAUDIBLE] structure behind the images 197 00:12:34,960 --> 00:12:38,200 we want to use. 198 00:12:38,200 --> 00:12:44,770 But before we get to talk about images, let's us just abstract a little bit. 199 00:12:44,770 --> 00:12:48,470 And let us think that we live in a one-dimensional world, a line land, 200 00:12:48,470 --> 00:12:54,320 if you wish, where the only thing that we can see is a straight line. 201 00:12:54,320 --> 00:13:01,020 Say we go to school, and the first thing that we are told in day one of school 202 00:13:01,020 --> 00:13:03,310 is that, OK, I see a point here. 203 00:13:03,310 --> 00:13:05,780 The teacher is telling us, well, look, Patrick. 204 00:13:05,780 --> 00:13:07,980 This is number zero. 205 00:13:07,980 --> 00:13:11,330 Then we go again, day two of school. 206 00:13:11,330 --> 00:13:12,720 We see another point. 207 00:13:12,720 --> 00:13:15,940 The teacher is telling us, this represents a zero. 208 00:13:15,940 --> 00:13:18,010 Day three, we see another point. 209 00:13:18,010 --> 00:13:21,220 It's the only thing that we can see in a one-dimensional world. 210 00:13:21,220 --> 00:13:24,330 And the teacher is telling us, well, this is a six. 211 00:13:24,330 --> 00:13:24,880 And so on. 212 00:13:24,880 --> 00:13:26,830 So day four, we see another point. 213 00:13:26,830 --> 00:13:29,730 This represents number zero. 214 00:13:29,730 --> 00:13:30,520 And so on. 215 00:13:30,520 --> 00:13:32,680 So this is representing number six. 216 00:13:32,680 --> 00:13:34,970 This will represent the number six. 217 00:13:34,970 --> 00:13:39,640 So in other words, we have been exposed to a labeled training 218 00:13:39,640 --> 00:13:45,470 set of data-- points and the associated label corresponding 219 00:13:45,470 --> 00:13:47,570 to the digits they represent. 220 00:13:47,570 --> 00:13:51,180 So say now that we are presented, the next day of school, 221 00:13:51,180 --> 00:13:53,980 with a so-called test point over here. 222 00:13:53,980 --> 00:13:57,840 And we are asked, what should this number be? 223 00:13:57,840 --> 00:13:59,540 What should this point represent? 224 00:13:59,540 --> 00:14:00,690 Which digit? 225 00:14:00,690 --> 00:14:01,190 Anyone? 226 00:14:01,190 --> 00:14:04,340 227 00:14:04,340 --> 00:14:06,010 Number six, indeed. 228 00:14:06,010 --> 00:14:09,317 And congratulations because this is the first example. 229 00:14:09,317 --> 00:14:11,150 This is the first machine learning algorithm 230 00:14:11,150 --> 00:14:15,260 that we are going to discuss today, easy as it sounds. 231 00:14:15,260 --> 00:14:17,040 So there is a lot going on. 232 00:14:17,040 --> 00:14:19,950 First of all, just appreciate the fact that I haven't 233 00:14:19,950 --> 00:14:22,000 told you anything about the structure. 234 00:14:22,000 --> 00:14:25,860 I just presented to you a set of points. 235 00:14:25,860 --> 00:14:27,870 We don't tell you anything about the game. 236 00:14:27,870 --> 00:14:29,830 It's not like in problem set four, where we 237 00:14:29,830 --> 00:14:33,280 were told about the specifics of the image to crack. 238 00:14:33,280 --> 00:14:37,380 In this case, I was just presenting you with points. 239 00:14:37,380 --> 00:14:41,830 And then at a certain moment, I was asking you, OK, what is this point? 240 00:14:41,830 --> 00:14:46,145 And as you have guessed, well, this should represent the number six. 241 00:14:46,145 --> 00:14:46,975 Why? 242 00:14:46,975 --> 00:14:49,150 Well, because somehow, we have figured out 243 00:14:49,150 --> 00:14:56,760 that the points representing number zero are grouping in one side of the line. 244 00:14:56,760 --> 00:15:00,060 Points representing number six are grouping in another side of the line. 245 00:15:00,060 --> 00:15:02,440 And so in our head, what we are doing, we're 246 00:15:02,440 --> 00:15:05,440 looking at the closest point among the ones 247 00:15:05,440 --> 00:15:10,410 that we have been exposed to previously, so the point with the minimal distance, 248 00:15:10,410 --> 00:15:13,770 so the so-called nearest neighbor, and just 249 00:15:13,770 --> 00:15:15,890 label the new point that we're given-- the test 250 00:15:15,890 --> 00:15:20,440 point-- with the label of the closest point we've been exposed to. 251 00:15:20,440 --> 00:15:23,440 So this algorithm is called nearest neighbor classifier. 252 00:15:23,440 --> 00:15:28,140 And indeed, this is what we're going to use to classify handwritten digits. 253 00:15:28,140 --> 00:15:32,400 But before we get to that, let's assume that we get promoted. 254 00:15:32,400 --> 00:15:36,140 And so on the second year of school, we now see the world in two dimensions, 255 00:15:36,140 --> 00:15:38,480 much like a flat land. 256 00:15:38,480 --> 00:15:42,840 And we are exposed to a set of points-- again, a labeled set of points. 257 00:15:42,840 --> 00:15:44,680 So this point represents number zero. 258 00:15:44,680 --> 00:15:48,480 This is the number six, number six, number zero, number six, 259 00:15:48,480 --> 00:15:51,080 number zero, six again. 260 00:15:51,080 --> 00:15:54,000 And finally, we're presented with a test point. 261 00:15:54,000 --> 00:15:59,760 So any guess on what this test point should represent? 262 00:15:59,760 --> 00:16:01,330 It's the number zero. 263 00:16:01,330 --> 00:16:07,060 And indeed, same reasoning, just done in two dimensions-- nearest neighbor 264 00:16:07,060 --> 00:16:08,420 classified. 265 00:16:08,420 --> 00:16:12,030 So in a way, this is very much intuitive. 266 00:16:12,030 --> 00:16:19,430 And what we are left off with is the question of, OK, 267 00:16:19,430 --> 00:16:25,090 can we map an image into a point in a space? 268 00:16:25,090 --> 00:16:29,620 Because if we were able to do that, if we can take an image of a digit 269 00:16:29,620 --> 00:16:33,430 and just interpret that image as a point in a space, then 270 00:16:33,430 --> 00:16:36,390 we could repeat exactly the same procedure I told you. 271 00:16:36,390 --> 00:16:37,810 In this case, there are no points. 272 00:16:37,810 --> 00:16:40,310 There are images in this abstract space. 273 00:16:40,310 --> 00:16:44,090 But we have access to a labeled training set. 274 00:16:44,090 --> 00:16:46,780 So we know what these digits represent. 275 00:16:46,780 --> 00:16:50,560 And then when we are asked, OK, what is this new digit? 276 00:16:50,560 --> 00:16:52,760 Well, we apply the same logic, right? 277 00:16:52,760 --> 00:16:56,159 278 00:16:56,159 --> 00:16:57,700 And this is what we are going to see. 279 00:16:57,700 --> 00:17:02,010 Indeed, the world can be seen as having more than one 280 00:17:02,010 --> 00:17:04,329 or two or three dimensions. 281 00:17:04,329 --> 00:17:07,950 And much of machine learning is really about interpreting 282 00:17:07,950 --> 00:17:13,319 the data we are given as points in some sort of high-dimensional world. 283 00:17:13,319 --> 00:17:19,780 And in making this jump, we might feel a little bit like the character 284 00:17:19,780 --> 00:17:27,040 in one of my favorite novels, namely, Flatland by Edwin Abbott Abbott. 285 00:17:27,040 --> 00:17:31,200 This is the [INAUDIBLE] of the first edition in 1884. 286 00:17:31,200 --> 00:17:34,310 The plot of Flatland is the following-- the story 287 00:17:34,310 --> 00:17:39,010 describes a two-dimensional world occupied by geometric figures. 288 00:17:39,010 --> 00:17:41,750 The narrator is a square named A Square who 289 00:17:41,750 --> 00:17:45,422 guides the reader through some of the implications of life in two dimensions. 290 00:17:45,422 --> 00:17:49,700 On New Year's Eve, A Square dreams about a visit to a one-dimensional world, 291 00:17:49,700 --> 00:17:53,040 Lineland, inhabited by lustrous points, in which 292 00:17:53,040 --> 00:17:56,550 he attempts to convince the realm's monarch of a second dimension, 293 00:17:56,550 --> 00:17:58,750 but he's unable to do so. 294 00:17:58,750 --> 00:18:01,610 Following his vision, A Square is himself is visited 295 00:18:01,610 --> 00:18:04,250 by a three-dimensional sphere named A Sphere, 296 00:18:04,250 --> 00:18:09,080 which he cannot comprehend until he sees Spaceland, a three-dimensional world. 297 00:18:09,080 --> 00:18:16,090 So many movies have been produced on this story line. 298 00:18:16,090 --> 00:18:18,376 Let's see a trailer of one of those. 299 00:18:18,376 --> 00:18:20,150 [VIDEO PLAYBACK] 300 00:18:20,150 --> 00:18:25,670 -Imagine a vast plane, a world of only two dimensions 301 00:18:25,670 --> 00:18:30,980 on which triangles, squares, pentagons, and other figures live 302 00:18:30,980 --> 00:18:33,530 and move freely about. 303 00:18:33,530 --> 00:18:36,868 -Configuration makes the man. 304 00:18:36,868 --> 00:18:38,390 -Get to your squarical! 305 00:18:38,390 --> 00:18:41,342 Now! 306 00:18:41,342 --> 00:18:43,720 -You're only a square. 307 00:18:43,720 --> 00:18:45,180 -Thanks, brother. 308 00:18:45,180 --> 00:18:48,240 -They know nothing of our three-dimensional world. 309 00:18:48,240 --> 00:18:55,395 -Such a notion is, of course, absurd and, furthermore, illegal! 310 00:18:55,395 --> 00:18:58,810 -But that's all about to change. 311 00:18:58,810 --> 00:19:00,550 -Where did you come from? 312 00:19:00,550 --> 00:19:05,402 -I come from space, the third dimension. 313 00:19:05,402 --> 00:19:07,480 -No, no, no, no, no, no, no, no! 314 00:19:07,480 --> 00:19:08,080 No! 315 00:19:08,080 --> 00:19:09,002 You're not serious! 316 00:19:09,002 --> 00:19:11,600 317 00:19:11,600 --> 00:19:16,200 -Based on the beloved novel by Edwin A. Abbott. 318 00:19:16,200 --> 00:19:19,397 Tonight, our world faces a grave threat. 319 00:19:19,397 --> 00:19:19,980 [END PLAYBACK] 320 00:19:19,980 --> 00:19:20,580 DAVID J. MALAN: Right. 321 00:19:20,580 --> 00:19:21,440 And then it goes on. 322 00:19:21,440 --> 00:19:21,980 I love it. 323 00:19:21,980 --> 00:19:22,830 It's one of my favorite. 324 00:19:22,830 --> 00:19:23,960 You should watch the movie. 325 00:19:23,960 --> 00:19:27,380 326 00:19:27,380 --> 00:19:33,440 And so indeed, are we ready to go beyond the Lineland, Flatland, Spaceland. 327 00:19:33,440 --> 00:19:34,600 So let's do it. 328 00:19:34,600 --> 00:19:38,590 So here I just represented what Linland looks like. 329 00:19:38,590 --> 00:19:40,470 Just one line, right? 330 00:19:40,470 --> 00:19:43,360 The only thing that we can see in one dimension are really points. 331 00:19:43,360 --> 00:19:46,980 And here I wrote two points for us. 332 00:19:46,980 --> 00:19:49,020 What we have here is the coordinate system, 333 00:19:49,020 --> 00:19:53,520 just to be able to measure distances, so to speak. 334 00:19:53,520 --> 00:19:56,360 So in this case, this point is at location one. 335 00:19:56,360 --> 00:19:58,540 This is what this one represents. 336 00:19:58,540 --> 00:20:02,170 And this point is at location four. 337 00:20:02,170 --> 00:20:04,820 So this is the picture in one dimension, Lineland. 338 00:20:04,820 --> 00:20:07,010 Flatland we have seen-- two-dimensional world. 339 00:20:07,010 --> 00:20:09,940 Indeed, we can visualize points here. 340 00:20:09,940 --> 00:20:14,010 In this case, each point is represented by two coordinates. 341 00:20:14,010 --> 00:20:16,790 We have the horizontal coordinate and the vertical coordinate. 342 00:20:16,790 --> 00:20:21,810 So coordinate 1, horizontal, and 2, vertical. 343 00:20:21,810 --> 00:20:24,390 And so on for the 4, 4. 344 00:20:24,390 --> 00:20:27,490 Now we go to Spaceland-- indeed, a three-dimensional world. 345 00:20:27,490 --> 00:20:31,390 And even here, it's pretty easy to visualize what points are, really. 346 00:20:31,390 --> 00:20:32,600 There are three coordinates. 347 00:20:32,600 --> 00:20:41,660 And so we can simply refer to a point with the associated coordinates. 348 00:20:41,660 --> 00:20:44,810 So can we go beyond? 349 00:20:44,810 --> 00:20:46,170 Indeed. 350 00:20:46,170 --> 00:20:46,890 Indeed we can. 351 00:20:46,890 --> 00:20:52,500 It's not that easy to draw points or the like in higher dimensions. 352 00:20:52,500 --> 00:20:57,420 But we can indeed think of a point, say, in four dimensions 353 00:20:57,420 --> 00:21:06,640 by being referred to by a set a coordinates-- say, 0, 1, 4, 5. 354 00:21:06,640 --> 00:21:08,250 Isn't it? 355 00:21:08,250 --> 00:21:12,030 So OK, this reference, the axes there don't mean much. 356 00:21:12,030 --> 00:21:14,280 We're no longer in three dimensions. 357 00:21:14,280 --> 00:21:18,220 Still, they just represent sort of an abstract space. 358 00:21:18,220 --> 00:21:26,460 But indeed, I cannot four axes here, but can definitely think of a point indexed 359 00:21:26,460 --> 00:21:31,200 by four coordinates as a point living in a four-dimensional world. 360 00:21:31,200 --> 00:21:32,450 And so on. 361 00:21:32,450 --> 00:21:38,880 If we want a point in a five-dimensional world, well, here we are. 362 00:21:38,880 --> 00:21:41,310 And so we can go back to the idea, can we 363 00:21:41,310 --> 00:21:44,780 map an image from this data set we have access 364 00:21:44,780 --> 00:21:48,850 to to a point in a higher-dimensional space? 365 00:21:48,850 --> 00:21:51,240 And as we have seen in problem set four, images 366 00:21:51,240 --> 00:21:53,960 are just a collection of pixels-- in this case, 367 00:21:53,960 --> 00:21:56,450 for the smile, just a collection of the zeroes and ones, 368 00:21:56,450 --> 00:22:01,110 zeroes being associated to the color white, ones to the color black. 369 00:22:01,110 --> 00:22:05,240 And even in the data set that we are playing with, 370 00:22:05,240 --> 00:22:09,130 each image in this data set is simply an eight 371 00:22:09,130 --> 00:22:14,080 by eight array of pixels, whereby each pixel in this case 372 00:22:14,080 --> 00:22:18,090 is a number between 0 and 16. 373 00:22:18,090 --> 00:22:24,430 So in some sense, we have eight by eight-- so it's 64. 374 00:22:24,430 --> 00:22:27,970 So it's really a point in a 64-dimensional space, isn't it? 375 00:22:27,970 --> 00:22:30,690 376 00:22:30,690 --> 00:22:34,770 So we can really use this idea and interpret images 377 00:22:34,770 --> 00:22:38,360 as points in this 64-dimensional space. 378 00:22:38,360 --> 00:22:40,920 And now we can indeed run the nearest neighbor 379 00:22:40,920 --> 00:22:43,070 classifier we have seen previously. 380 00:22:43,070 --> 00:22:46,300 Namely, we are in this 64-dimensional space. 381 00:22:46,300 --> 00:22:50,480 We see labeled images that are presented to us. 382 00:22:50,480 --> 00:22:53,050 We are exposed to it, if you wish. 383 00:22:53,050 --> 00:22:55,310 This represents a six, a six. 384 00:22:55,310 --> 00:22:56,770 This represents a zero. 385 00:22:56,770 --> 00:23:02,620 And so on, until we are presented with a test point again. 386 00:23:02,620 --> 00:23:03,120 Fine. 387 00:23:03,120 --> 00:23:04,290 We know how to deal with it. 388 00:23:04,290 --> 00:23:12,650 We just simply assign to this first point the label of the training point 389 00:23:12,650 --> 00:23:14,950 that is closest to it. 390 00:23:14,950 --> 00:23:18,810 So the only thing that is missing here is a notion of distance, isn't it? 391 00:23:18,810 --> 00:23:21,880 So let's see how we can go about thinking about this. 392 00:23:21,880 --> 00:23:27,920 So in a one-dimensional world, it's pretty easy to compute distances. 393 00:23:27,920 --> 00:23:32,050 Indeed, the distance between these two points is simply what? 394 00:23:32,050 --> 00:23:33,740 This line, right? 395 00:23:33,740 --> 00:23:37,100 So it's simply 4 minus 1, 3. 396 00:23:37,100 --> 00:23:39,620 In a two-dimensional world, a little bit more 397 00:23:39,620 --> 00:23:42,580 complicated, but we are still able to do so. 398 00:23:42,580 --> 00:23:44,760 If this is the distance we want to have access to, 399 00:23:44,760 --> 00:23:48,640 well, we have the old Pythagoras's theorem. 400 00:23:48,640 --> 00:23:52,260 So we first compute this distance-- so the difference 401 00:23:52,260 --> 00:23:57,890 between the horizontal coordinates of the two points, which is 4 minus 1, 402 00:23:57,890 --> 00:24:00,420 is a 3. 403 00:24:00,420 --> 00:24:02,650 Then we square it. 404 00:24:02,650 --> 00:24:10,930 And we add the vertical distance between the two points, which is 4 minus 2-- 2. 405 00:24:10,930 --> 00:24:12,800 And then we square it. 406 00:24:12,800 --> 00:24:15,020 And then we take the square root. 407 00:24:15,020 --> 00:24:16,450 Isn't it? 408 00:24:16,450 --> 00:24:21,800 So in this case, well, it will be the square root of 13. 409 00:24:21,800 --> 00:24:24,890 So even in a three-dimensional world, it's the same idea. 410 00:24:24,890 --> 00:24:28,450 In order to get this distance between the two points, 411 00:24:28,450 --> 00:24:30,950 we can simply work coordinate-wise. 412 00:24:30,950 --> 00:24:34,790 So we can take the distance between the first coordinate, which is a 3, 413 00:24:34,790 --> 00:24:40,220 square it, plus the distance between the second coordinate, which is a 2, 414 00:24:40,220 --> 00:24:41,870 square it. 415 00:24:41,870 --> 00:24:45,250 Plus the distance between the first coordinate, which is a 3, 416 00:24:45,250 --> 00:24:47,710 and square it again. 417 00:24:47,710 --> 00:24:50,660 And we can take the square root of this. 418 00:24:50,660 --> 00:24:54,020 So we indeed have a formula to compute distances. 419 00:24:54,020 --> 00:24:58,720 And this formula doesn't simply hold in one-, two-, or three-dimensional space. 420 00:24:58,720 --> 00:25:02,620 It holds in many dimensions we may want to work. 421 00:25:02,620 --> 00:25:06,340 And so we have all the ingredients to run the nearest neighbor 422 00:25:06,340 --> 00:25:08,200 classifier at this point. 423 00:25:08,200 --> 00:25:12,140 So just to give you an idea what these distances look 424 00:25:12,140 --> 00:25:15,600 like, say we want the distance between these two images, 425 00:25:15,600 --> 00:25:19,850 thought of as points in this 64-dimensional space. 426 00:25:19,850 --> 00:25:23,300 Well, in this case, if we apply this formula coordinate-wise, 427 00:25:23,300 --> 00:25:27,290 we get a distance of 31.98. 428 00:25:27,290 --> 00:25:31,080 Say now we want to consider the distance between an image representing 429 00:25:31,080 --> 00:25:34,110 digit zero and an image representing digit six. 430 00:25:34,110 --> 00:25:41,100 Well, we do the math, and we find out 45.97, which is, indeed, bigger 431 00:25:41,100 --> 00:25:42,490 than what we had previously. 432 00:25:42,490 --> 00:25:46,390 Previously, we had 31 as the distance between two images representing 433 00:25:46,390 --> 00:25:47,230 the same digit. 434 00:25:47,230 --> 00:25:52,220 So it should be smaller than a distance between two images representing 435 00:25:52,220 --> 00:25:55,420 different digits. 436 00:25:55,420 --> 00:25:59,030 So at this point, we are ready to see some Python code. 437 00:25:59,030 --> 00:26:01,770 So we are going to actually implement the nearest neighbor 438 00:26:01,770 --> 00:26:06,430 classifier just described to you in abstract terms 439 00:26:06,430 --> 00:26:10,430 by using these data sets. 440 00:26:10,430 --> 00:26:15,550 Again, for each digit, we have access to a collection of different images 441 00:26:15,550 --> 00:26:18,390 representing the digits. 442 00:26:18,390 --> 00:26:21,090 So let's get to see some Python. 443 00:26:21,090 --> 00:26:26,110 First of all, let me show you what by Python is. 444 00:26:26,110 --> 00:26:32,100 We can go to the CS50 IDE and simply fire up the Python interpreter 445 00:26:32,100 --> 00:26:34,194 by writing, python. 446 00:26:34,194 --> 00:26:34,860 And here we are. 447 00:26:34,860 --> 00:26:38,230 We're inside the interpreter, so to speak. 448 00:26:38,230 --> 00:26:40,075 At this point, we can run Python code. 449 00:26:40,075 --> 00:26:41,310 So let's see. 450 00:26:41,310 --> 00:26:45,540 We can write, x = 3. 451 00:26:45,540 --> 00:26:48,970 y = 5. 452 00:26:48,970 --> 00:26:52,840 x + y equals-- guess what? 453 00:26:52,840 --> 00:26:54,280 8. 454 00:26:54,280 --> 00:26:56,680 Amazing, isn't it? 455 00:26:56,680 --> 00:27:00,120 Coming from the world of C, this is totally amazing. 456 00:27:00,120 --> 00:27:01,320 Many things happening here. 457 00:27:01,320 --> 00:27:06,420 First, there is no need of declaring variables whatsoever. 458 00:27:06,420 --> 00:27:10,140 I didn't write int x = 3. 459 00:27:10,140 --> 00:27:12,400 I simply wrote x = 3. 460 00:27:12,400 --> 00:27:18,390 And in fact, we might write something like x = '8'. 461 00:27:18,390 --> 00:27:22,370 y = 'b'. 462 00:27:22,370 --> 00:27:28,540 And guess what-- x + y is equal to the string 'ab'. 463 00:27:28,540 --> 00:27:31,080 So in Python, there is no difference between single quotes 464 00:27:31,080 --> 00:27:35,790 and double quotes, as in C. And indeed, we do not 465 00:27:35,790 --> 00:27:38,970 need to define what a variable is. 466 00:27:38,970 --> 00:27:43,770 Another key difference with respect to C is the fact that it's immediate. 467 00:27:43,770 --> 00:27:45,970 There is no clang or make or the like. 468 00:27:45,970 --> 00:27:48,690 There is no need for us to call the compiler. 469 00:27:48,690 --> 00:27:51,750 I was simply running the code, and the interpreter 470 00:27:51,750 --> 00:27:55,751 was in fact interpreting or running the code line by line. 471 00:27:55,751 --> 00:27:57,750 So indeed, there is a compiler behind the scene. 472 00:27:57,750 --> 00:28:00,435 But we do not need to get involved with this. 473 00:28:00,435 --> 00:28:03,880 This is one of the beauty of Python. 474 00:28:03,880 --> 00:28:10,220 And in fact, coming from the world of C, we can read off Python code fairly easy 475 00:28:10,220 --> 00:28:10,840 at this point. 476 00:28:10,840 --> 00:28:12,710 Now, the syntax is a little bit different. 477 00:28:12,710 --> 00:28:16,030 So let's see what a for loop will be. 478 00:28:16,030 --> 00:28:28,090 for i in 3, 5, 7, print i. 479 00:28:28,090 --> 00:28:31,320 And ta-da, this is a for loop. 480 00:28:31,320 --> 00:28:33,520 So a few syntax differences, right? 481 00:28:33,520 --> 00:28:37,830 First of all, it is more like a for each loop, 482 00:28:37,830 --> 00:28:42,840 where we loop through each element in this array, if you wish. 483 00:28:42,840 --> 00:28:47,890 And then there is a column that we don't even see. 484 00:28:47,890 --> 00:28:51,380 More interestingly, there are no brackets, no curly brackets. 485 00:28:51,380 --> 00:28:58,000 So how is Python knowing that the block of code 486 00:28:58,000 --> 00:29:02,310 we want to execute inside the for loop is the line "print i"? 487 00:29:02,310 --> 00:29:04,740 Well, it goes by indentation. 488 00:29:04,740 --> 00:29:07,260 So if I were to repeat the line of code that I just 489 00:29:07,260 --> 00:29:13,040 presented to you but without them indentation here, 490 00:29:13,040 --> 00:29:15,880 you will see an error. 491 00:29:15,880 --> 00:29:18,350 So in Python, there is no need of curly brackets. 492 00:29:18,350 --> 00:29:21,670 It's simply a matter of the indentation you use. 493 00:29:21,670 --> 00:29:24,460 And it is good style to use indentation even in C code, 494 00:29:24,460 --> 00:29:27,980 so why the curly brackets after all? 495 00:29:27,980 --> 00:29:31,330 So OK, I could carry on like this and show you some more Python 496 00:29:31,330 --> 00:29:35,650 code within the CS50 IDE. 497 00:29:35,650 --> 00:29:38,070 But for the sake of exposition, let me actually 498 00:29:38,070 --> 00:29:43,850 close this and go to the following way of presenting code to you. 499 00:29:43,850 --> 00:29:48,130 So these are called markdown type notebooks. 500 00:29:48,130 --> 00:29:52,060 So the idea is that while the code is indeed run line 501 00:29:52,060 --> 00:29:56,370 by line as Python Is doing, this way of presenting the material 502 00:29:56,370 --> 00:29:59,850 is allowing me to group lines of code together. 503 00:29:59,850 --> 00:30:01,990 So here it is, what I was writing earlier 504 00:30:01,990 --> 00:30:07,130 in the CS50 IDE, the same line of code, same line of code again 505 00:30:07,130 --> 00:30:11,580 when it comes to manipulating strings, the for loop 506 00:30:11,580 --> 00:30:13,720 that I presented to you, and so on. 507 00:30:13,720 --> 00:30:18,330 So as we see again, we don't know the syntax yet, 508 00:30:18,330 --> 00:30:21,770 but we can read off what is happening coming from C. 509 00:30:21,770 --> 00:30:24,740 This is an if statement, for instance. 510 00:30:24,740 --> 00:30:26,660 Again, there are syntax differences. 511 00:30:26,660 --> 00:30:28,850 Indeed, there are no brackets here. 512 00:30:28,850 --> 00:30:33,140 There is the semicolon and there is the indentation. 513 00:30:33,140 --> 00:30:35,870 But we can really decipher what is going on. 514 00:30:35,870 --> 00:30:39,980 So this one the beauty of Python, and we are going to rely on this beauty now 515 00:30:39,980 --> 00:30:46,200 and today to actually parse almost line by line some blocks of code that 516 00:30:46,200 --> 00:30:50,230 will allow us to quickly jump into the action 517 00:30:50,230 --> 00:30:55,830 and see some meaningful, cool output from machine learning algorithms. 518 00:30:55,830 --> 00:30:57,400 So let's do that. 519 00:30:57,400 --> 00:31:02,560 We go to the world of supervised learning, namely image recognition. 520 00:31:02,560 --> 00:31:05,650 But before we get to that, what is supervised learning? 521 00:31:05,650 --> 00:31:09,130 So the class of application in machine learning typically 522 00:31:09,130 --> 00:31:13,470 fit into either supervised learning or unsupervised learning. 523 00:31:13,470 --> 00:31:15,550 Today we are going to see both. 524 00:31:15,550 --> 00:31:20,510 So the example of image recognition fits into the category 525 00:31:20,510 --> 00:31:25,480 of supervised learning, as we have access to a labeled, 526 00:31:25,480 --> 00:31:28,600 as I mentioned and stressed earlier, data set. 527 00:31:28,600 --> 00:31:33,640 So we are presenting a set of code or a set of points or images 528 00:31:33,640 --> 00:31:35,790 with a label associated to it. 529 00:31:35,790 --> 00:31:38,600 As we had this label, we're doing supervised learning. 530 00:31:38,600 --> 00:31:42,050 So let's start with points, and let's see how Python 531 00:31:42,050 --> 00:31:46,870 can be used in Flatland, if you wish. 532 00:31:46,870 --> 00:31:51,880 So Python has many built data types and functions. 533 00:31:51,880 --> 00:31:55,990 Indeed, we are going to see much more of this next week. 534 00:31:55,990 --> 00:32:01,340 But when it comes to data processing, data science type of application, 535 00:32:01,340 --> 00:32:07,890 really typically, people rely on external data structures and functions. 536 00:32:07,890 --> 00:32:10,540 And so this is what we are going to do today. 537 00:32:10,540 --> 00:32:12,700 And this is the line of code that I'm writing here. 538 00:32:12,700 --> 00:32:17,030 I'm importing two modules, they're called-- library, 539 00:32:17,030 --> 00:32:20,720 if you wish, in the world of C. The first module is numpy. 540 00:32:20,720 --> 00:32:25,150 It's one of the main modules for scientific computing. 541 00:32:25,150 --> 00:32:27,150 The second module is matplot. 542 00:32:27,150 --> 00:32:31,310 It will allow us to easily plot graphs. 543 00:32:31,310 --> 00:32:34,310 And the third line of code, don't pay attention to it. 544 00:32:34,310 --> 00:32:37,490 It is simply required there in this notebook style 545 00:32:37,490 --> 00:32:40,860 to tell the notebook just print whatever graph we 546 00:32:40,860 --> 00:32:44,390 are producing in line with the output. 547 00:32:44,390 --> 00:32:47,142 So one of the cool things about Python is 548 00:32:47,142 --> 00:32:49,940 that it is an extremely popular language. 549 00:32:49,940 --> 00:32:55,790 And being popular in computer science, it is really important, 550 00:32:55,790 --> 00:32:59,440 one reason being that there is a lot of code 551 00:32:59,440 --> 00:33:02,130 out there that we can simply go and import. 552 00:33:02,130 --> 00:33:05,430 If you wish, these numpy and matplot are on these lines. 553 00:33:05,430 --> 00:33:08,840 Someone else has brought libraries, modules of code 554 00:33:08,840 --> 00:33:10,690 that we can easily import. 555 00:33:10,690 --> 00:33:16,210 So indeed, we need to install this module before we can import them. 556 00:33:16,210 --> 00:33:20,260 But for the sake of today, just ignore the fact that we need to install them. 557 00:33:20,260 --> 00:33:24,470 So let's just assume that we can easily import this module with a line of code 558 00:33:24,470 --> 00:33:28,100 that I am presenting to you. 559 00:33:28,100 --> 00:33:31,610 OK, let us create a training data set in Flatland. 560 00:33:31,610 --> 00:33:34,470 So here is the code. 561 00:33:34,470 --> 00:33:41,980 The first line creates a numpy array, meaning an array 562 00:33:41,980 --> 00:33:46,280 within this numpy scientific module. 563 00:33:46,280 --> 00:33:48,240 So it's very simple. 564 00:33:48,240 --> 00:33:53,250 We're creating an array of six points. 565 00:33:53,250 --> 00:33:57,670 Each point is a two-dimensional point, so it is indexed by two coordinates. 566 00:33:57,670 --> 00:34:00,510 567 00:34:00,510 --> 00:34:02,640 And so in the second line, instead, we are 568 00:34:02,640 --> 00:34:07,330 creating, in this case, a built-in Python list, as we will see. 569 00:34:07,330 --> 00:34:10,350 But it's simply an array of strings, if you wish. 570 00:34:10,350 --> 00:34:15,520 And each element in the array Y_train is simply a color, name of a color. 571 00:34:15,520 --> 00:34:21,900 We have three red strings and three blue ones. 572 00:34:21,900 --> 00:34:27,960 So shortly, we will plot this collection of points. 573 00:34:27,960 --> 00:34:29,850 But before we get to do that, let me just 574 00:34:29,850 --> 00:34:34,260 show you some cool features of the Python syntax. 575 00:34:34,260 --> 00:34:39,900 So X_train is an array of six points, each of them being 576 00:34:39,900 --> 00:34:42,210 a two-dimensional vector, if you wish. 577 00:34:42,210 --> 00:34:47,330 But you can also interpret these array as being a two-dimensional array. 578 00:34:47,330 --> 00:34:50,750 And so we can use this syntax that here I present to you. 579 00:34:50,750 --> 00:34:55,900 So the print function is simply printing whatever is inside it. 580 00:34:55,900 --> 00:35:00,990 And so with this line of code, X_train 5, 0, what we are doing, 581 00:35:00,990 --> 00:35:05,590 we are taking the fifth element in the array of points. 582 00:35:05,590 --> 00:35:11,070 So Python is a zero-index language, much like C. So we start counting from zero. 583 00:35:11,070 --> 00:35:14,480 And so in this case, the fifth point is the last one in this array-- 584 00:35:14,480 --> 00:35:17,660 so, namely, 7, 6. 585 00:35:17,660 --> 00:35:21,980 And so we take the fifth point, the last one in that array, 586 00:35:21,980 --> 00:35:25,450 and we can print the zeroth index coordinate. 587 00:35:25,450 --> 00:35:29,490 So if we you that, the interpreter is outputting 588 00:35:29,490 --> 00:35:36,290 7, which is indeed the horizontal coordinate, if you wish, 589 00:35:36,290 --> 00:35:38,070 of the last point in the collection. 590 00:35:38,070 --> 00:35:42,590 Now, we can do the same while changing from the horizontal 591 00:35:42,590 --> 00:35:44,340 coordinate to the vertical one. 592 00:35:44,340 --> 00:35:48,220 And so we get 6, which is here. 593 00:35:48,220 --> 00:35:51,400 So one other key feature of Python-- we are 594 00:35:51,400 --> 00:35:54,470 going to get to see a little bit of it today-- 595 00:35:54,470 --> 00:35:56,960 is the so-called slicing syntax. 596 00:35:56,960 --> 00:36:03,950 So slicing syntax is a convenient way to extract a collection of elements 597 00:36:03,950 --> 00:36:06,297 from an array, as in this case. 598 00:36:06,297 --> 00:36:09,380 So in this case, what we are doing, we are taking-- the first line of code 599 00:36:09,380 --> 00:36:11,710 is saying, OK, Python. 600 00:36:11,710 --> 00:36:14,430 Just look at all the points in the array. 601 00:36:14,430 --> 00:36:18,730 Take the horizontal coordinate, the zeroth coordinate, and just print 602 00:36:18,730 --> 00:36:22,140 all the elements in the horizontal coordinate. 603 00:36:22,140 --> 00:36:27,130 Indeed, if you were to compare, we have a 1, a 2, 3, 5.5. 604 00:36:27,130 --> 00:36:32,610 These are all the first, the horizontal, components of each point. 605 00:36:32,610 --> 00:36:35,410 And so with the second component. 606 00:36:35,410 --> 00:36:38,990 So this is one of the uses of this slicing syntax. 607 00:36:38,990 --> 00:36:46,510 And what we can use this syntax for is to conveniently plot this point. 608 00:36:46,510 --> 00:36:52,410 So now we are using-- OK, plt.figure is simply saying, 609 00:36:52,410 --> 00:36:56,600 Python, just expect that I'm going to print a figure. 610 00:36:56,600 --> 00:37:03,280 And the last line, as well, the .show Is simply saying, just output the plot. 611 00:37:03,280 --> 00:37:07,400 So the magic, really, everything is happening in the line in between, 612 00:37:07,400 --> 00:37:11,310 the scatter function that is contained, again, 613 00:37:11,310 --> 00:37:14,030 in the matplot module we are importing. 614 00:37:14,030 --> 00:37:20,350 So this scatter function takes two arrays, an array of horizontal 615 00:37:20,350 --> 00:37:24,850 coordinates, if you wish, that we have access to through the slicing syntax, 616 00:37:24,850 --> 00:37:29,240 and an array of vertical coordinates. 617 00:37:29,240 --> 00:37:34,730 And then we have s equals 170-- simply the size, if you wish, of this point. 618 00:37:34,730 --> 00:37:36,070 And the color yellow. 619 00:37:36,070 --> 00:37:38,730 This is one of the beauties of Python, again. 620 00:37:38,730 --> 00:37:43,350 We are using the slicing syntax again from the label arrays. 621 00:37:43,350 --> 00:37:47,840 Recall that Y_train was an array of colors. 622 00:37:47,840 --> 00:37:50,340 So every time we are plotting a point, we 623 00:37:50,340 --> 00:37:56,440 are also taking the corresponding color from the array Y_train. 624 00:37:56,440 --> 00:37:59,330 So if we do that, just appreciate that. 625 00:37:59,330 --> 00:38:04,090 With essentially one line of code, we have a plot here. 626 00:38:04,090 --> 00:38:05,426 And there are indeed six points. 627 00:38:05,426 --> 00:38:06,860 Three of them are red. 628 00:38:06,860 --> 00:38:08,790 Three of them are blue. 629 00:38:08,790 --> 00:38:13,750 So this represents our so-called labeled training set. 630 00:38:13,750 --> 00:38:17,630 Instead of having digits zero and six, now we have color red and blue. 631 00:38:17,630 --> 00:38:19,000 The idea is the same. 632 00:38:19,000 --> 00:38:23,850 So what we can do, we can now add a so-called test point, 633 00:38:23,850 --> 00:38:26,880 say at location 3 and 4. 634 00:38:26,880 --> 00:38:28,420 So why don't we print it? 635 00:38:28,420 --> 00:38:31,950 We use the same line of code as before to print. 636 00:38:31,950 --> 00:38:36,140 We add this new line with the new test point. 637 00:38:36,140 --> 00:38:39,470 And we have it plot with the color green. 638 00:38:39,470 --> 00:38:41,900 And this is the output. 639 00:38:41,900 --> 00:38:43,420 Flatland. 640 00:38:43,420 --> 00:38:48,530 So now what we want to do, we want to run this nearest neighbor classifier. 641 00:38:48,530 --> 00:38:49,660 And we know why, right? 642 00:38:49,660 --> 00:38:54,930 We simply look at the point that is closer to the green point here. 643 00:38:54,930 --> 00:38:59,640 And we associate to the green point either a color green or blue, 644 00:38:59,640 --> 00:39:05,110 depending on whatever color of the nearest neighbor has-- in this case, 645 00:39:05,110 --> 00:39:06,030 green. 646 00:39:06,030 --> 00:39:10,970 So in order to do that, we need to define a notion of distance in that. 647 00:39:10,970 --> 00:39:13,365 Well, we know what the distance should look like. 648 00:39:13,365 --> 00:39:14,740 We have the mathematical formula. 649 00:39:14,740 --> 00:39:17,110 Let's write it down in Python. 650 00:39:17,110 --> 00:39:21,240 And so you get to see how we can indeed define functions in Python. 651 00:39:21,240 --> 00:39:26,170 So here, define-- and again, this is resembling pseudocode code, 652 00:39:26,170 --> 00:39:27,930 right? def stands for defining. 653 00:39:27,930 --> 00:39:29,630 Just appreciate this. 654 00:39:29,630 --> 00:39:34,330 So define a function that we call dist that takes two points 655 00:39:34,330 --> 00:39:37,590 and returns the following. 656 00:39:37,590 --> 00:39:41,890 So let me just pass for you precisely what we are doing here. 657 00:39:41,890 --> 00:39:45,160 So the line of code that we want to understand 658 00:39:45,160 --> 00:39:49,230 is the following, where we take two points. 659 00:39:49,230 --> 00:39:52,510 These could be, in this case, two-dimensional points, each of those. 660 00:39:52,510 --> 00:39:55,240 But later on, we are going to 64-dimensional points. 661 00:39:55,240 --> 00:39:58,060 662 00:39:58,060 --> 00:40:00,820 And return with the Euclidean of [INAUDIBLE] distance. 663 00:40:00,820 --> 00:40:02,050 So let's parse this. 664 00:40:02,050 --> 00:40:03,970 Let's assume that we have two points. 665 00:40:03,970 --> 00:40:10,040 y is the test point 3, 4, and x is one of the training points. 666 00:40:10,040 --> 00:40:16,110 And we want to see what this line above is doing with respect to these inputs. 667 00:40:16,110 --> 00:40:18,250 So first of all, we can take the difference. 668 00:40:18,250 --> 00:40:20,180 And the way Python is thinking about this 669 00:40:20,180 --> 00:40:23,030 is taking the difference coordinate-wise. 670 00:40:23,030 --> 00:40:28,410 This is, again, one of the magic properties of the Python language-- 671 00:40:28,410 --> 00:40:33,330 namely, we can act, we can apply a function to the entire vector. 672 00:40:33,330 --> 00:40:34,830 It's called vectorization. 673 00:40:34,830 --> 00:40:38,850 So whenever there are vectors-- in this case, two-dimensional points 674 00:40:38,850 --> 00:40:44,270 or the like or arrays-- we can apply a function to each element of the array. 675 00:40:44,270 --> 00:40:47,050 So this is one case, if we take the difference, 676 00:40:47,050 --> 00:40:50,940 Python will automatically take the difference of each coordinate at once. 677 00:40:50,940 --> 00:40:54,580 So the difference between 1 and 3 is indeed minus 2. 678 00:40:54,580 --> 00:40:59,240 And the difference between 1 and 4 is indeed minus 3. 679 00:40:59,240 --> 00:41:04,720 So now we can take the power, the second-- we can square it. 680 00:41:04,720 --> 00:41:07,520 And again, what Python is doing with this code 681 00:41:07,520 --> 00:41:10,030 is simply thinking coordinate-wise. 682 00:41:10,030 --> 00:41:12,810 So it's taking minus 2, and taking the square of it. 683 00:41:12,810 --> 00:41:13,720 That's 4. 684 00:41:13,720 --> 00:41:18,090 Taking 3, and taking the square of it as 9. 685 00:41:18,090 --> 00:41:18,810 And so on. 686 00:41:18,810 --> 00:41:24,770 Now we can use a function that comes with the numpy module simply summing 687 00:41:24,770 --> 00:41:25,790 the two coordinates. 688 00:41:25,790 --> 00:41:28,380 So 4 plus 9 is indeed 13. 689 00:41:28,380 --> 00:41:31,420 And then we can take the square root of it. 690 00:41:31,420 --> 00:41:35,020 So this is what this single line of code is doing. 691 00:41:35,020 --> 00:41:38,290 Just appreciate the beauty behind that. 692 00:41:38,290 --> 00:41:41,390 And indeed, we can define a function that simply returns 693 00:41:41,390 --> 00:41:45,400 the distance between any two points. 694 00:41:45,400 --> 00:41:53,650 And let us just compute for each point in the training set, 695 00:41:53,650 --> 00:41:59,390 we compute the distance with respect to the single point, the test point. 696 00:41:59,390 --> 00:42:01,270 So what we are doing here, we are computing. 697 00:42:01,270 --> 00:42:05,440 There are six points in the train set, three red and three blue. 698 00:42:05,440 --> 00:42:08,370 So we are taking the distance from the green point 699 00:42:08,370 --> 00:42:12,840 to any other of these six points in the training set. 700 00:42:12,840 --> 00:42:16,500 And so this is what this block of code is doing. 701 00:42:16,500 --> 00:42:18,630 The length function is simply returning the amount 702 00:42:18,630 --> 00:42:22,880 of points in the X_train array. 703 00:42:22,880 --> 00:42:28,960 Then we initialize an array of zeros with six zeros. 704 00:42:28,960 --> 00:42:32,300 And then this is simply a for loop that is computing the distance 705 00:42:32,300 --> 00:42:38,200 using the dist function we defined between all these six pairs of points. 706 00:42:38,200 --> 00:42:42,500 And this is the output that we print with the function print. 707 00:42:42,500 --> 00:42:49,360 So this is an array where each element represent the distance with respect 708 00:42:49,360 --> 00:42:51,120 to the test point. 709 00:42:51,120 --> 00:42:54,550 Then what we can do to just complete our classifier 710 00:42:54,550 --> 00:42:58,000 is just choose the point that has the closest distance. 711 00:42:58,000 --> 00:43:01,340 In this case, it will be the second one, as we can see. 712 00:43:01,340 --> 00:43:03,490 The distance is 1.8 here. 713 00:43:03,490 --> 00:43:08,020 And we can simply print the color associated with that point. 714 00:43:08,020 --> 00:43:10,380 And indeed, if we go back to the picture here, 715 00:43:10,380 --> 00:43:15,400 we see that point number two here is the closest to the green dot. 716 00:43:15,400 --> 00:43:16,750 And so here it is. 717 00:43:16,750 --> 00:43:18,400 So just appreciate the beauty. 718 00:43:18,400 --> 00:43:22,050 With literally three, four lines of code, 719 00:43:22,050 --> 00:43:25,510 we can run a machine learning algorithm. 720 00:43:25,510 --> 00:43:30,060 In fact, with using some more Pythonic, as it is called, syntax, 721 00:43:30,060 --> 00:43:34,970 we can even get down to much less lines of code. 722 00:43:34,970 --> 00:43:37,380 So OK, this is the case for points. 723 00:43:37,380 --> 00:43:41,160 But let's go back to the image classification example. 724 00:43:41,160 --> 00:43:45,300 Let us see how we can exactly take the precise lines of code 725 00:43:45,300 --> 00:43:51,130 that I showed to you and apply it to images in the digit data set. 726 00:43:51,130 --> 00:43:53,300 So this is what we are doing. 727 00:43:53,300 --> 00:43:57,080 We are importing from the module sklearn-- 728 00:43:57,080 --> 00:44:01,990 it's a common module in machine learning for Python. 729 00:44:01,990 --> 00:44:06,110 We are importing a data set that is the digit data set. 730 00:44:06,110 --> 00:44:07,440 We call it digits. 731 00:44:07,440 --> 00:44:13,420 So the digit data set contains 1,797 images in that. 732 00:44:13,420 --> 00:44:16,990 And there are multiple structures within this database. 733 00:44:16,990 --> 00:44:21,610 In particular, there are two arrays-- an array of images, which is called 734 00:44:21,610 --> 00:44:29,610 digits.images, an array of labels, which is called digits.target. 735 00:44:29,610 --> 00:44:30,250 So let us see. 736 00:44:30,250 --> 00:44:37,700 Let us print the first element in the array digits.images. 737 00:44:37,700 --> 00:44:40,150 So this is what it contains. 738 00:44:40,150 --> 00:44:45,490 It is an eight by eight collection of pixels that indeed represents 739 00:44:45,490 --> 00:44:50,360 the number zero, as we can see by actually 740 00:44:50,360 --> 00:44:53,420 plotting with that line of code that I showed to you, 741 00:44:53,420 --> 00:45:03,440 quickly mapping each pixel from 0 to 16 included to an intensity of black. 742 00:45:03,440 --> 00:45:06,160 So indeed, we can realize that, well, this is a zero. 743 00:45:06,160 --> 00:45:10,810 There are a few zeroes in the middle. 744 00:45:10,810 --> 00:45:16,110 And indeed, if we plot it with this line of code, we indeed get number zero. 745 00:45:16,110 --> 00:45:23,830 So this is just the first element indexed by zero in the data set. 746 00:45:23,830 --> 00:45:26,660 And we can indeed also plot the true label 747 00:45:26,660 --> 00:45:29,315 that counts with the digits.target. 748 00:45:29,315 --> 00:45:31,460 And it is a zero. 749 00:45:31,460 --> 00:45:33,080 So this is the data set. 750 00:45:33,080 --> 00:45:38,550 And what we want to do is to run the nearest neighbor classifier 751 00:45:38,550 --> 00:45:40,680 to this data. 752 00:45:40,680 --> 00:45:44,180 So in particular, what we are doing-- so let's see. 753 00:45:44,180 --> 00:45:48,420 This data set, again, something like this. 754 00:45:48,420 --> 00:45:50,870 What we are doing, we are saying, OK. 755 00:45:50,870 --> 00:45:53,910 Let us consider just a subset of the database. 756 00:45:53,910 --> 00:45:56,850 So let us take 10 images. 757 00:45:56,850 --> 00:46:01,940 And let us consider these 10 images as being the training data 758 00:46:01,940 --> 00:46:03,570 that we have access to. 759 00:46:03,570 --> 00:46:06,110 Potentially, we could have access to the entire database. 760 00:46:06,110 --> 00:46:07,380 But somehow, we want to split. 761 00:46:07,380 --> 00:46:09,850 And this is a common feature in machine learning, 762 00:46:09,850 --> 00:46:13,410 to split the original data set into multiple subsets 763 00:46:13,410 --> 00:46:18,050 in order to test the performance of the algorithm you are implementing. 764 00:46:18,050 --> 00:46:21,250 So this is what we are doing. 765 00:46:21,250 --> 00:46:28,890 By selecting 10 images, we essentially have this picture, 766 00:46:28,890 --> 00:46:36,270 where each point in this 64-dimensional space represents an image of a digit. 767 00:46:36,270 --> 00:46:39,780 And the yellow label here represents the true label 768 00:46:39,780 --> 00:46:43,040 we have access to in this training set. 769 00:46:43,040 --> 00:46:45,360 So now what we can do, we can say, OK. 770 00:46:45,360 --> 00:46:47,700 Let us take a test point. 771 00:46:47,700 --> 00:46:51,477 So say that-- take a test point here. 772 00:46:51,477 --> 00:46:55,133 773 00:46:55,133 --> 00:46:59,840 And Let us assume that we only have access to the image of this test 774 00:46:59,840 --> 00:47:01,920 point, which is indeed a three. 775 00:47:01,920 --> 00:47:06,380 And we want to test the performance of our machine learning algorithm 776 00:47:06,380 --> 00:47:08,040 to classify this point. 777 00:47:08,040 --> 00:47:12,330 So indeed, this is the line of code that allows us to only take 778 00:47:12,330 --> 00:47:15,380 10 images out of the original data set. 779 00:47:15,380 --> 00:47:18,500 I'm using, again, the slicing syntax, but in this case, 780 00:47:18,500 --> 00:47:23,480 a little bit differently in the sense that we are selecting elements 781 00:47:23,480 --> 00:47:27,730 from zero included to 10 excluded. 782 00:47:27,730 --> 00:47:30,680 This is per the specification of the slicing syntax. 783 00:47:30,680 --> 00:47:35,260 The right, outmost element is excluded. 784 00:47:35,260 --> 00:47:39,860 So indeed, if we use this syntax, we can extract 10 images from it, 785 00:47:39,860 --> 00:47:42,540 precisely like in the picture there. 786 00:47:42,540 --> 00:47:44,810 And then we can create a test image. 787 00:47:44,810 --> 00:47:48,600 We can choose a random number here in the remaining part of the dataset, 788 00:47:48,600 --> 00:47:54,460 say the image corresponding to 345. 789 00:47:54,460 --> 00:47:58,390 Indeed, we can plot it with the same line of code I presented to you. 790 00:47:58,390 --> 00:48:00,930 It is indeed a three. 791 00:48:00,930 --> 00:48:04,310 But this is easy to the human eyes. 792 00:48:04,310 --> 00:48:08,080 So we want to see how good of a performance 793 00:48:08,080 --> 00:48:11,810 we can get by applying the nearest neighbor classifier. 794 00:48:11,810 --> 00:48:15,420 And now the lines of code that I'm going to present to you now 795 00:48:15,420 --> 00:48:18,200 are precisely the same lines of code that I 796 00:48:18,200 --> 00:48:21,780 presented to you earlier in Flatland. 797 00:48:21,780 --> 00:48:23,010 So let's see. 798 00:48:23,010 --> 00:48:25,260 This is all together. 799 00:48:25,260 --> 00:48:29,650 And indeed, we get that the classifier is returning number three. 800 00:48:29,650 --> 00:48:31,920 So the classifier, what it's doing, again, 801 00:48:31,920 --> 00:48:37,710 is computing all the distances between these test points 802 00:48:37,710 --> 00:48:40,690 and all the other points in the training sets. 803 00:48:40,690 --> 00:48:45,630 And it's choosing the point that has the closest distance, the nearest neighbor. 804 00:48:45,630 --> 00:48:51,550 And it is assigned the same label of this point-- and in this case, 805 00:48:51,550 --> 00:48:53,580 indeed, correct. 806 00:48:53,580 --> 00:48:57,260 So it should come as a surprise that, indeed, such a simple algorithm-- 807 00:48:57,260 --> 00:48:59,390 two, three lines of code in Python to implement 808 00:48:59,390 --> 00:49:04,980 it-- allows us to get such good a result. But how good of a result 809 00:49:04,980 --> 00:49:06,250 is this, really? 810 00:49:06,250 --> 00:49:08,400 Well, we can test it. 811 00:49:08,400 --> 00:49:10,810 And indeed, we can plot the true solution. 812 00:49:10,810 --> 00:49:14,170 We do have access to the true label, which is indeed a three. 813 00:49:14,170 --> 00:49:19,520 So let us test how well we are doing with 100 test images. 814 00:49:19,520 --> 00:49:22,110 So what we are doing, instead of just testing 815 00:49:22,110 --> 00:49:25,040 the performance of our algorithm with a single image, 816 00:49:25,040 --> 00:49:29,530 let us consider a set of 100 images. 817 00:49:29,530 --> 00:49:36,750 And let us count how many mistakes the algorithm we just implemented gets. 818 00:49:36,750 --> 00:49:39,220 So if we run this code-- I won't it parse it for you. 819 00:49:39,220 --> 00:49:44,530 It's simply, again, taking into account that starting from a number of errors 820 00:49:44,530 --> 00:49:45,180 equals 0. 821 00:49:45,180 --> 00:49:47,940 And then there is a count here. 822 00:49:47,940 --> 00:49:53,910 It's adding plus 1 every time that the algorithm is outputting something 823 00:49:53,910 --> 00:49:55,460 that is different from the truth. 824 00:49:55,460 --> 00:50:02,580 So if we run this algorithm over a set of 100 test images, 825 00:50:02,580 --> 00:50:06,840 we get that we commit 37 errors. 826 00:50:06,840 --> 00:50:14,180 So we get 63 correct answers out of 100, which is pretty good, really, 827 00:50:14,180 --> 00:50:16,628 for such a simple algorithm, isn't it? 828 00:50:16,628 --> 00:50:19,440 829 00:50:19,440 --> 00:50:23,670 But indeed, much like the way humans learn, 830 00:50:23,670 --> 00:50:26,360 human learning, also machine learning algorithm 831 00:50:26,360 --> 00:50:31,490 gets better with the amount of training sets they have access to. 832 00:50:31,490 --> 00:50:35,050 So in this case, we have just chosen a subset 833 00:50:35,050 --> 00:50:40,680 of the original data base of 10 images. 834 00:50:40,680 --> 00:50:47,760 So what we might try to do is to take a training set which is much bigger 835 00:50:47,760 --> 00:50:51,790 and see how well the algorithm is doing with that. 836 00:50:51,790 --> 00:50:54,460 So we can do that. 837 00:50:54,460 --> 00:50:58,210 We indeed, enlarge the training set. 838 00:50:58,210 --> 00:51:01,060 Before, it was from 0 to 10 excluded. 839 00:51:01,060 --> 00:51:04,005 Now it is from 0 to 1,000 excluded. 840 00:51:04,005 --> 00:51:07,500 So it has 1,000 images. 841 00:51:07,500 --> 00:51:12,430 We can run exactly the same code as before over 100 test images. 842 00:51:12,430 --> 00:51:16,190 And this time, look-- only three mistakes. 843 00:51:16,190 --> 00:51:17,840 It's rather surprising, isn't it? 844 00:51:17,840 --> 00:51:19,560 I mean, such a simple algorithm. 845 00:51:19,560 --> 00:51:22,460 I described it to you starting from Lineland. 846 00:51:22,460 --> 00:51:25,430 And basically, the idea was there. 847 00:51:25,430 --> 00:51:26,950 Now it was a matter of coding it up. 848 00:51:26,950 --> 00:51:29,860 There was a notion of a point in a higher-dimensional space. 849 00:51:29,860 --> 00:51:31,240 There was a notion of a distance. 850 00:51:31,240 --> 00:51:37,420 But once we figured that out and we code it up in Python, 851 00:51:37,420 --> 00:51:39,950 Python doesn't care about the dimension, as we saw. 852 00:51:39,950 --> 00:51:43,480 The same distance function that works with two-dimensional points 853 00:51:43,480 --> 00:51:46,960 equally works with 64-dimensional points and higher. 854 00:51:46,960 --> 00:51:57,240 And so this is what we achieve-- 97% of correctness 855 00:51:57,240 --> 00:52:01,800 with, really, five lines of code. 856 00:52:01,800 --> 00:52:08,170 So the question is, what if we try the very same algorithm 857 00:52:08,170 --> 00:52:12,870 I just presented to you in a data base that looks like more 858 00:52:12,870 --> 00:52:17,360 of what we will like to try it with? 859 00:52:17,360 --> 00:52:18,855 So this is a popular database. 860 00:52:18,855 --> 00:52:21,840 It's called CEFAR-10. 861 00:52:21,840 --> 00:52:25,590 It is, again, same idea-- it is a labeled data base 862 00:52:25,590 --> 00:52:30,200 that contains, in this case, really, tens of thousands or more 863 00:52:30,200 --> 00:52:32,060 of images with a label. 864 00:52:32,060 --> 00:52:34,750 So we indeed have 10 labels, as before. 865 00:52:34,750 --> 00:52:38,990 But now, instead of the labels being numbers from 0 to 9, 866 00:52:38,990 --> 00:52:46,720 the labels are something like airplanes, automobiles, birds, dogs, and so on. 867 00:52:46,720 --> 00:52:49,140 So this is just one of the data sets that you can find. 868 00:52:49,140 --> 00:52:51,904 And indeed, there are websites such as kaggle.com 869 00:52:51,904 --> 00:52:57,410 that host sort of competitions where machine learning researchers 870 00:52:57,410 --> 00:53:02,660 and programmers try out their algorithm, and there are challenges going on. 871 00:53:02,660 --> 00:53:07,210 So this data set was popular a couple of years ago for one of these challenges. 872 00:53:07,210 --> 00:53:11,770 A typical challenge could last anything in between two, three months 873 00:53:11,770 --> 00:53:14,570 to even longer-- a year. 874 00:53:14,570 --> 00:53:17,520 So it turns out that if we run the nearest neighbor 875 00:53:17,520 --> 00:53:23,680 classifier to this new set of images, the performance is 30%. 876 00:53:23,680 --> 00:53:27,620 So it is still much better than random guessing. 877 00:53:27,620 --> 00:53:29,550 After all, there are 10 categories. 878 00:53:29,550 --> 00:53:34,340 So you might suspect that just by random guessing, you get 10% correct. 879 00:53:34,340 --> 00:53:35,510 Indeed, you get 30%. 880 00:53:35,510 --> 00:53:38,480 But it's not what we would like, is it? 881 00:53:38,480 --> 00:53:42,250 And in fact, there are more advanced algorithms that 882 00:53:42,250 --> 00:53:44,660 do something a little bit different. 883 00:53:44,660 --> 00:53:49,750 But let us see first what could be an issue with the algorithm 884 00:53:49,750 --> 00:53:51,390 that we just ran? 885 00:53:51,390 --> 00:53:56,320 So this is a training cert for the category zero in the previous data set. 886 00:53:56,320 --> 00:53:59,420 And this is a few elements, a few images, 887 00:53:59,420 --> 00:54:03,930 from the category horse in the new data set. 888 00:54:03,930 --> 00:54:07,830 So one difference that pops to the eye immediately 889 00:54:07,830 --> 00:54:10,930 is that these are color pictures. 890 00:54:10,930 --> 00:54:11,940 And indeed, they are. 891 00:54:11,940 --> 00:54:17,700 So, in fact, instead of being eight by eight pixels, they are 32 by 32. 892 00:54:17,700 --> 00:54:21,550 So still rather small pictures, but now each pixel 893 00:54:21,550 --> 00:54:28,040 is indeed a triple-- RGB, as we saw-- where each coordinate contains 894 00:54:28,040 --> 00:54:30,952 a number from 0 to 255. 895 00:54:30,952 --> 00:54:34,320 So in some sense, we are in a higher dimensional space. 896 00:54:34,320 --> 00:54:35,520 It's not just 64. 897 00:54:35,520 --> 00:54:40,340 But there is another key difference, isn't it? 898 00:54:40,340 --> 00:54:42,690 Anyone? 899 00:54:42,690 --> 00:54:43,190 Please. 900 00:54:43,190 --> 00:54:44,958 AUDIENCE: The image can be rotated. 901 00:54:44,958 --> 00:54:49,104 So you can have a horse that's facing one way or [INAUDIBLE]. 902 00:54:49,104 --> 00:54:50,020 DAVID J. MALAN: Great. 903 00:54:50,020 --> 00:54:54,490 This is definitely one of the issues here, is viewpoint variation. 904 00:54:54,490 --> 00:54:59,750 So we indeed have a sequence of pictures representing horses. 905 00:54:59,750 --> 00:55:04,670 But the horses are taken from different angles, poses, right? 906 00:55:04,670 --> 00:55:06,680 And in fact, there are all sort of issues here. 907 00:55:06,680 --> 00:55:10,370 There are viewpoint variations, illumination conditions, 908 00:55:10,370 --> 00:55:15,080 scale variation, deformation, occlusions, and the like. 909 00:55:15,080 --> 00:55:19,530 And this is what is making image recognition a really tough challenge 910 00:55:19,530 --> 00:55:21,070 for machine learning algorithms. 911 00:55:21,070 --> 00:55:25,410 Now, what more sophisticated algorithms do 912 00:55:25,410 --> 00:55:30,740 is not just interpreting images as collections of pixels, per se. 913 00:55:30,740 --> 00:55:35,280 914 00:55:35,280 --> 00:55:37,120 But they work on a higher level somehow. 915 00:55:37,120 --> 00:55:40,760 So they group pixels together. 916 00:55:40,760 --> 00:55:44,140 Instead of looking at one pixel at the time, what is happening, 917 00:55:44,140 --> 00:55:47,360 they sort of try to extrapolate, to abstract 918 00:55:47,360 --> 00:55:49,802 some higher-level feature of the code. 919 00:55:49,802 --> 00:55:50,760 And this is an example. 920 00:55:50,760 --> 00:55:53,556 So the digit zero can indeed be represented 921 00:55:53,556 --> 00:55:56,700 as having the following four features, which 922 00:55:56,700 --> 00:56:00,870 is four arches and the possible angles. 923 00:56:00,870 --> 00:56:03,870 And indeed, we can go higher in the hierarchy. 924 00:56:03,870 --> 00:56:09,830 So the top-of-the-art algorithm for this class of application, and not only, 925 00:56:09,830 --> 00:56:12,650 is called deep learning. 926 00:56:12,650 --> 00:56:15,220 They do work besides like I just described. 927 00:56:15,220 --> 00:56:18,160 So instead of working with the pixels themself, 928 00:56:18,160 --> 00:56:22,270 as at the bottom of this image, they try to group pixels together. 929 00:56:22,270 --> 00:56:26,570 And they try to find out patterns if you group a few pixels together. 930 00:56:26,570 --> 00:56:32,930 And the first layer of patterns they can extrapolate are edges, for instance. 931 00:56:32,930 --> 00:56:36,540 And then from there, you can go another step up. 932 00:56:36,540 --> 00:56:41,030 So you by grouping edges together, you can come up 933 00:56:41,030 --> 00:56:48,260 with objects such as eyes or noses or mouth and the like. 934 00:56:48,260 --> 00:56:54,000 And then grouping this set of objects again, 935 00:56:54,000 --> 00:56:57,390 you can get something like a face. 936 00:56:57,390 --> 00:56:58,890 So this is indeed the logic. 937 00:56:58,890 --> 00:57:03,260 Deep learning is really a game changer. 938 00:57:03,260 --> 00:57:06,980 The idea behind this type of technology is not new, in fact. 939 00:57:06,980 --> 00:57:11,370 It relies on so-called neural networks that were invented 940 00:57:11,370 --> 00:57:14,010 in the '80s or probably even earlier. 941 00:57:14,010 --> 00:57:17,910 But it's only until recently-- literally four years ago, 942 00:57:17,910 --> 00:57:21,500 five-- that researchers have been able to use this technology 943 00:57:21,500 --> 00:57:23,900 to really achieve amazing results. 944 00:57:23,900 --> 00:57:28,310 And now this technology is everywhere, not just for image recognition. 945 00:57:28,310 --> 00:57:31,770 Google's search engine uses deep learning. 946 00:57:31,770 --> 00:57:33,790 Just name one-- Facebook. 947 00:57:33,790 --> 00:57:39,840 The tagging feature for pictures in Facebook is based on deep learning. 948 00:57:39,840 --> 00:57:44,340 Speech recognition software-- Siri, Cortana, OK Google-- 949 00:57:44,340 --> 00:57:48,330 that one doesn't have a fancy name yet-- they are all based on deep learning. 950 00:57:48,330 --> 00:57:51,090 And this is really a game changer. 951 00:57:51,090 --> 00:57:54,630 It has brought a 10% or more increase in the performance 952 00:57:54,630 --> 00:58:00,170 of [INAUDIBLE] algorithm in a matter of literally one jump. 953 00:58:00,170 --> 00:58:03,480 Something like that was unseen in the field. 954 00:58:03,480 --> 00:58:08,710 And indeed, deep learning is beyond the scope of this class. 955 00:58:08,710 --> 00:58:11,860 In fact, you really need a lot of computational power 956 00:58:11,860 --> 00:58:14,830 in order to run deep learning algorithms. 957 00:58:14,830 --> 00:58:17,320 Not just that, you need tons of data out there. 958 00:58:17,320 --> 00:58:19,400 And this is why in the '80s, they couldn't 959 00:58:19,400 --> 00:58:22,030 run-- the theory was there, but first of all, 960 00:58:22,030 --> 00:58:24,400 we didn't have access to the amount of data 961 00:58:24,400 --> 00:58:27,600 we do have access to today, thanks to the world wide web. 962 00:58:27,600 --> 00:58:30,330 And back then, we didn't have the processing capabilities 963 00:58:30,330 --> 00:58:31,690 that we have now. 964 00:58:31,690 --> 00:58:35,880 So while deep learning algorithms are beyond what we can actually 965 00:58:35,880 --> 00:58:41,420 run in the CS50 IDE, we can indeed use some tool sets 966 00:58:41,420 --> 00:58:44,150 that are out there that are based on deep learning. 967 00:58:44,150 --> 00:58:47,130 So this is one example-- TensorFlow by Google. 968 00:58:47,130 --> 00:58:52,460 So what you can do, you can actually download within Python, 969 00:58:52,460 --> 00:58:57,240 for instance, a trained algorithm for you. 970 00:58:57,240 --> 00:59:01,050 So without us trying to train an algorithm ourselves, 971 00:59:01,050 --> 00:59:03,900 as we have been doing with the nearest neighbor classified, 972 00:59:03,900 --> 00:59:09,690 we can download a file of 100 megabytes or more for doing image recognition. 973 00:59:09,690 --> 00:59:14,790 Another example is the DeepDream generator. 974 00:59:14,790 --> 00:59:17,250 So it turns out, as I mentioned, this type of algorithm, 975 00:59:17,250 --> 00:59:22,990 they are able to figure out patterns in the input that we feed them with. 976 00:59:22,990 --> 00:59:27,620 So what we can do with the DeepDream generator, 977 00:59:27,620 --> 00:59:30,100 we can go online, upload whatever picture 978 00:59:30,100 --> 00:59:34,550 we might like-- a picture of ourself-- then upload a picture of one 979 00:59:34,550 --> 00:59:37,210 of our favorite paints. 980 00:59:37,210 --> 00:59:39,850 And the algorithm is capable of recognizing the painting 981 00:59:39,850 --> 00:59:46,730 style behind that painting and apply it to the original image we uploaded. 982 00:59:46,730 --> 00:59:50,290 And this is why it's called DeepDream, because apparently dreams work 983 00:59:50,290 --> 00:59:53,430 by mixing together images. 984 00:59:53,430 --> 00:59:57,580 So if you apply this type of technology to that database 985 00:59:57,580 --> 01:00:00,960 I showed to you earlier, we get an amazing performance 986 01:00:00,960 --> 01:00:05,390 of 95%, which is indeed close to what the human eye can 987 01:00:05,390 --> 01:00:07,820 achieve, in this case. 988 01:00:07,820 --> 01:00:12,000 So the question is, is 95% enough? 989 01:00:12,000 --> 01:00:14,970 Well, it really depends on the application. 990 01:00:14,970 --> 01:00:18,920 And just imagine an example-- self-driving car. 991 01:00:18,920 --> 01:00:20,310 As you know, it's a hot area. 992 01:00:20,310 --> 01:00:21,960 There are a lot of players out there. 993 01:00:21,960 --> 01:00:25,690 Tesla is one of the first players who was brought to the market, 994 01:00:25,690 --> 01:00:29,690 really, autopilot-like features. 995 01:00:29,690 --> 01:00:33,250 So the autopilot feature by Tesla is allowing the car 996 01:00:33,250 --> 01:00:36,500 to shift lanes on the highway automatically, 997 01:00:36,500 --> 01:00:41,920 to speed up or lower down based on the traffic, and so much more. 998 01:00:41,920 --> 01:00:46,400 Now, this is just an assistant, and the company is making it clear, 999 01:00:46,400 --> 01:00:49,100 so someone should always be in control of the car. 1000 01:00:49,100 --> 01:00:52,320 But it is indeed providing a lot of help. 1001 01:00:52,320 --> 01:00:58,930 And it turns out that the autopilot feature in cars like Tesla 1002 01:00:58,930 --> 01:01:03,580 do rely on a variety of technologies, such as GPS, 1003 01:01:03,580 --> 01:01:06,380 ultrasonic sensors, radars, and so on. 1004 01:01:06,380 --> 01:01:09,770 But they also do rely on forward-facing cameras 1005 01:01:09,770 --> 01:01:12,230 with image recognition software. 1006 01:01:12,230 --> 01:01:17,600 And indeed, they use these so-called deep learning technologies these days. 1007 01:01:17,600 --> 01:01:18,480 What is the problem? 1008 01:01:18,480 --> 01:01:21,300 Well, let us see a video. 1009 01:01:21,300 --> 01:01:25,029 1010 01:01:25,029 --> 01:01:28,508 [VIDEO PLAYBACK] 1011 01:01:28,508 --> 01:01:33,975 [MUSIC PLAYING] 1012 01:01:33,975 --> 01:02:45,207 1013 01:02:45,207 --> 01:02:46,572 [END PLAYBACK] 1014 01:02:46,572 --> 01:02:49,280 DAVID J. MALAN: So indeed, the reason investigation is going on-- 1015 01:02:49,280 --> 01:02:52,060 and let me actually close this. 1016 01:02:52,060 --> 01:02:56,240 And as we saw, a driver was on a highway in Florida. 1017 01:02:56,240 --> 01:02:58,980 He was using the autopilot feature, apparently 1018 01:02:58,980 --> 01:03:03,870 relying almost exclusively on it, when a tractor trailer drove perpendicular 1019 01:03:03,870 --> 01:03:04,800 to it. 1020 01:03:04,800 --> 01:03:08,820 And if you read, from tesla.com, a statement that the company has released 1021 01:03:08,820 --> 01:03:12,700 after the accident, which happened a few months ago, 1022 01:03:12,700 --> 01:03:15,080 I read-- "Neither Autopilot nor the driver 1023 01:03:15,080 --> 01:03:20,220 noticed the white side of the tractor trailer against a brightly lit sky, 1024 01:03:20,220 --> 01:03:22,810 so the brake was not applied." 1025 01:03:22,810 --> 01:03:26,390 So it is indeed an issue with image recognition. 1026 01:03:26,390 --> 01:03:31,030 Apparently, the color of the trailer was whitish, white. 1027 01:03:31,030 --> 01:03:34,730 And so against a brightly lit sky, the algorithm, 1028 01:03:34,730 --> 01:03:41,290 although it performs something like 95% of the time correctly, 1029 01:03:41,290 --> 01:03:43,880 had some challenges. 1030 01:03:43,880 --> 01:03:45,600 So these are a few of the challenges. 1031 01:03:45,600 --> 01:03:51,130 And, in fact, the [INAUDIBLE] will be much interested in this respect. 1032 01:03:51,130 --> 01:03:54,440 Applying this type of technology for self-driving cars 1033 01:03:54,440 --> 01:03:59,910 will bring a lot of interesting questions in all fields-- 1034 01:03:59,910 --> 01:04:04,550 not just computer science, of course, but politics with policies, ethics, 1035 01:04:04,550 --> 01:04:06,880 philosophy, and the like. 1036 01:04:06,880 --> 01:04:07,450 All right. 1037 01:04:07,450 --> 01:04:09,710 That was it for image recognition. 1038 01:04:09,710 --> 01:04:16,570 So let's now just move to the next application-- text clustering. 1039 01:04:16,570 --> 01:04:22,449 So we are going to go a little bit faster about that. 1040 01:04:22,449 --> 01:04:24,490 The application we have in mind is the following. 1041 01:04:24,490 --> 01:04:28,460 Say that I want to design an algorithm that takes as an input 1042 01:04:28,460 --> 01:04:33,050 the following list of movies-- in fact, not just the movie title, 1043 01:04:33,050 --> 01:04:36,690 but the IMDB synopsis for the movies. 1044 01:04:36,690 --> 01:04:40,950 So the movies are Robin Hood, The Matrix, The King's Speech, Aladdin, 1045 01:04:40,950 --> 01:04:41,770 and so on. 1046 01:04:41,770 --> 01:04:45,900 And I want to design an algorithm that, just solely based on these inputs, 1047 01:04:45,900 --> 01:04:49,120 is capable to cluster, as it is called, to group 1048 01:04:49,120 --> 01:04:52,000 these movies into two categories. 1049 01:04:52,000 --> 01:04:58,140 So if we stare at the list of movies, it might be evident, 1050 01:04:58,140 --> 01:05:02,350 if you like, that the clustering that we expect is something like that, 1051 01:05:02,350 --> 01:05:04,600 where we have a Beautiful Mind, The Matrix, The King's 1052 01:05:04,600 --> 01:05:11,590 Speech in one group and Robin Hood, Aladdin, Finding Nemo in another group. 1053 01:05:11,590 --> 01:05:14,690 And indeed, if I were to ask you, just group this list of movies 1054 01:05:14,690 --> 01:05:18,300 into two groups, most likely, this would have been the answer. 1055 01:05:18,300 --> 01:05:20,960 But your answer would have been based on something 1056 01:05:20,960 --> 01:05:28,160 different than what the machine will do, as we see, most likely. 1057 01:05:28,160 --> 01:05:31,450 In fact, you might say, OK, these are the two categories. 1058 01:05:31,450 --> 01:05:36,600 Because we know from before that, in a way, Robin Hood, Aladdin, 1059 01:05:36,600 --> 01:05:40,260 Finding Nemo are really more Disney-like movies, right? 1060 01:05:40,260 --> 01:05:46,580 They're for kids, if you wish, whereas the other are more action-type movies. 1061 01:05:46,580 --> 01:05:49,810 But again, this way of categorizing, clustering movies 1062 01:05:49,810 --> 01:05:55,690 is based on some sort of human learning, whereas the machine has only access 1063 01:05:55,690 --> 01:05:58,900 to the synopsis of the movies. 1064 01:05:58,900 --> 01:06:04,740 So let's see what will happen if we indeed try to run an algorithm 1065 01:06:04,740 --> 01:06:06,670 to do this clustering. 1066 01:06:06,670 --> 01:06:09,040 So as before, we try to abstract. 1067 01:06:09,040 --> 01:06:12,440 In this case, the set of applications we are discussing now 1068 01:06:12,440 --> 01:06:16,370 is called unsupervised learning because contrary 1069 01:06:16,370 --> 01:06:19,300 to what happened before, where we were presented a list of data, 1070 01:06:19,300 --> 01:06:23,990 a list of points, a list of images with a label associated to it, this time, 1071 01:06:23,990 --> 01:06:27,410 we are simply presented with a set of points. 1072 01:06:27,410 --> 01:06:29,540 And here it is in Lineland. 1073 01:06:29,540 --> 01:06:33,910 This is what we will see-- just a set of seven points. 1074 01:06:33,910 --> 01:06:39,220 And so if you are asked, OK, just split this set of points into two groups, 1075 01:06:39,220 --> 01:06:43,660 well, we have an easy way of doing that. 1076 01:06:43,660 --> 01:06:47,890 And once again, I haven't told you anything about the structure 1077 01:06:47,890 --> 01:06:48,880 to be inferred. 1078 01:06:48,880 --> 01:06:53,400 I was just presenting to you with a set of data points and asking to you, 1079 01:06:53,400 --> 01:07:00,150 just split this group of points into k equals 2 categories, groups. 1080 01:07:00,150 --> 01:07:05,990 So as before, this is indeed a well-known machine learning algorithm. 1081 01:07:05,990 --> 01:07:08,100 It's called K-means. 1082 01:07:08,100 --> 01:07:10,950 K there represents the number of clusters 1083 01:07:10,950 --> 01:07:15,860 we want the algorithm to divide the original data into. 1084 01:07:15,860 --> 01:07:20,000 And as before, from one dimension, we can go to two dimensions. 1085 01:07:20,000 --> 01:07:24,960 And if I ask you, OK, split this group of points into two groups, easy. 1086 01:07:24,960 --> 01:07:27,540 That's what K-mean is doing. 1087 01:07:27,540 --> 01:07:29,730 What about text now? 1088 01:07:29,730 --> 01:07:34,370 So before, we somehow had an easy way of mapping 1089 01:07:34,370 --> 01:07:43,070 an image, a collection of pixels, into a point in a higher-dimensional space. 1090 01:07:43,070 --> 01:07:47,410 How can we go about thinking-- doing something like that with text? 1091 01:07:47,410 --> 01:07:48,780 It's not that easy, isn't it? 1092 01:07:48,780 --> 01:07:52,100 1093 01:07:52,100 --> 01:07:54,130 Because if we were able to do, if we were 1094 01:07:54,130 --> 01:07:59,680 able to interpret each of these synopses as a point in a space, then 1095 01:07:59,680 --> 01:08:02,210 most likely, the picture will look like that. 1096 01:08:02,210 --> 01:08:06,490 And if I ask you to divide this group of movies into two categories, 1097 01:08:06,490 --> 01:08:08,640 that will be your answer. 1098 01:08:08,640 --> 01:08:16,090 So indeed, let's see how we can map some text input into a collection, 1099 01:08:16,090 --> 01:08:18,899 into a point in a higher-dimensional space. 1100 01:08:18,899 --> 01:08:22,500 So in order to do so, let me just step back for a moment 1101 01:08:22,500 --> 01:08:27,910 and describe the following, easier type of example, where as an input, 1102 01:08:27,910 --> 01:08:29,560 we have four strings. 1103 01:08:29,560 --> 01:08:30,380 So let's read. 1104 01:08:30,380 --> 01:08:32,760 First string is, "I love CS50. 1105 01:08:32,760 --> 01:08:36,000 Staff is awesome, awesome, awesome." 1106 01:08:36,000 --> 01:08:39,760 String A. String B-- "I have a dog and a cat." 1107 01:08:39,760 --> 01:08:42,210 String C-- "Best of CS50? 1108 01:08:42,210 --> 01:08:43,450 Staff. 1109 01:08:43,450 --> 01:08:44,600 And cakes. 1110 01:08:44,600 --> 01:08:45,100 OK. 1111 01:08:45,100 --> 01:08:46,772 CS50 staff." 1112 01:08:46,772 --> 01:08:50,390 String D-- "My dog keeps chasing my cat. 1113 01:08:50,390 --> 01:08:51,870 Dogs." 1114 01:08:51,870 --> 01:08:52,370 OK. 1115 01:08:52,370 --> 01:08:54,810 Say these are the four strings, and we ask, OK, 1116 01:08:54,810 --> 01:08:57,859 let's split it into two groups. 1117 01:08:57,859 --> 01:09:01,370 Most likely, what you will guess is that one cluster, one group 1118 01:09:01,370 --> 01:09:05,210 should include string A and string C. And the other cluster 1119 01:09:05,210 --> 01:09:10,899 should include string B and string D, based on the semantics, on the meaning. 1120 01:09:10,899 --> 01:09:13,350 So how can we do this? 1121 01:09:13,350 --> 01:09:17,029 And indeed, if this a representation in high-dimensional space, 1122 01:09:17,029 --> 01:09:20,010 this is what we will get. 1123 01:09:20,010 --> 01:09:22,240 So the missing step is really this mapping, right? 1124 01:09:22,240 --> 01:09:27,189 Mapping each of these four strings into a point in a higher-dimensional space. 1125 01:09:27,189 --> 01:09:32,540 And this is what we can do with the following interpretation. 1126 01:09:32,540 --> 01:09:36,585 So here, we have the four strings. 1127 01:09:36,585 --> 01:09:39,710 The first thing that we can do is look at the vocabulary used in these four 1128 01:09:39,710 --> 01:09:45,470 strings-- namely, extract the words that is used in each of the strings. 1129 01:09:45,470 --> 01:09:50,950 So if we do not consider so-called stop words-- namely, 1130 01:09:50,950 --> 01:09:55,310 words such as "I," "is," "a," "and," and the like, 1131 01:09:55,310 --> 01:10:01,330 which do not provide much meaning, this is the dictionary that we will extract. 1132 01:10:01,330 --> 01:10:05,540 There is the word "awesome," "best," "cakes," "cats," and so on. 1133 01:10:05,540 --> 01:10:07,700 And now if we look at each of the strings, 1134 01:10:07,700 --> 01:10:15,040 we can indeed map each string into a numerical point 1135 01:10:15,040 --> 01:10:19,320 by using the so-called bags of words interpretation, by which each string is 1136 01:10:19,320 --> 01:10:22,110 simply represented by the word count. 1137 01:10:22,110 --> 01:10:25,620 So let's see, for instance, the first string. 1138 01:10:25,620 --> 01:10:29,360 The word "awesome" is used three times. 1139 01:10:29,360 --> 01:10:31,560 And that's why we have a three there. 1140 01:10:31,560 --> 01:10:35,610 The word "CS50" in lowercase is used once. 1141 01:10:35,610 --> 01:10:38,420 The word "love" is used once, again. 1142 01:10:38,420 --> 01:10:40,890 And "staff" is used once, again. 1143 01:10:40,890 --> 01:10:44,830 Again, we do not consider the stop words such as "I" and "is." 1144 01:10:44,830 --> 01:10:48,330 So on-- so the second string can also be represented 1145 01:10:48,330 --> 01:10:52,310 by a numerical vector as this. 1146 01:10:52,310 --> 01:10:58,232 And indeed, there are 12 words in this dictionary. 1147 01:10:58,232 --> 01:11:00,190 So we can indeed think of each of these strings 1148 01:11:00,190 --> 01:11:05,680 as being a point in a 12-dimensional space, isn't it? 1149 01:11:05,680 --> 01:11:06,860 So not so simple. 1150 01:11:06,860 --> 01:11:08,880 This is a first great step. 1151 01:11:08,880 --> 01:11:13,370 But we should also normalize by the length of the string, if you wish, 1152 01:11:13,370 --> 01:11:15,880 just because if a string is very long, then 1153 01:11:15,880 --> 01:11:20,230 it's more likely that simply the rough counts will be higher. 1154 01:11:20,230 --> 01:11:26,740 What we can do easily is just divide each numerical vector 1155 01:11:26,740 --> 01:11:30,060 by the total amount of words in that string. 1156 01:11:30,060 --> 01:11:32,780 So we get a so-called frequency matrix. 1157 01:11:32,780 --> 01:11:33,380 So here it is. 1158 01:11:33,380 --> 01:11:36,980 We have a way to map a string into a point in a high-dimensional space, 1159 01:11:36,980 --> 01:11:39,150 a 12-dimensional space. 1160 01:11:39,150 --> 01:11:42,900 And what we can do, we can apply this algorithm K-means. 1161 01:11:42,900 --> 01:11:47,550 So let me just show you quickly how this can be done with Python 1162 01:11:47,550 --> 01:11:50,350 in the realm of unsupervised learning. 1163 01:11:50,350 --> 01:11:53,600 So now we will move much faster than before. 1164 01:11:53,600 --> 01:11:57,090 In this case, we we're importing the same modules as before, 1165 01:11:57,090 --> 01:12:03,700 numpy and matplot, and creating-- in this case, in the world of Flatland-- 1166 01:12:03,700 --> 01:12:08,560 we're creating an array of seven points, which I here 1167 01:12:08,560 --> 01:12:13,320 plot with the same exact lines of code as before. 1168 01:12:13,320 --> 01:12:16,620 But in this case, instead of us implementing the machine 1169 01:12:16,620 --> 01:12:21,660 learning algorithm from scratch, we can do what stereotypically, at least, 1170 01:12:21,660 --> 01:12:25,100 most machine learning programmer or researcher will do, 1171 01:12:25,100 --> 01:12:30,360 which is importing the K-means algorithms from an external module. 1172 01:12:30,360 --> 01:12:34,670 So if we imported this algorithm, the details of how the algorithm works 1173 01:12:34,670 --> 01:12:37,010 is beyond the scope of this class. 1174 01:12:37,010 --> 01:12:39,210 But it wouldn't be that difficult, in fact. 1175 01:12:39,210 --> 01:12:41,670 But we can reasonably run this algorithm and say, OK, 1176 01:12:41,670 --> 01:12:47,130 algorithm, cluster this group of points into two groups. k equals 2. 1177 01:12:47,130 --> 01:12:52,500 If we do that-- I won't pass line-by-line what is happening-- simply 1178 01:12:52,500 --> 01:12:57,990 running the algorithm with k equals 2, these would be the output. 1179 01:12:57,990 --> 01:13:02,600 So indeed, the algorithm is capable of figuring out the two groups of points 1180 01:13:02,600 --> 01:13:05,050 based on their distance. 1181 01:13:05,050 --> 01:13:07,390 So we can also run the algorithm with k equals 3. 1182 01:13:07,390 --> 01:13:11,330 After all, we are deciding the number of groups to be created. 1183 01:13:11,330 --> 01:13:14,290 So if we run with k equals 3, the same lines of code 1184 01:13:14,290 --> 01:13:18,640 as before-- I'm changing k from 2 to 3-- we get three groups. 1185 01:13:18,640 --> 01:13:19,480 And so on. 1186 01:13:19,480 --> 01:13:22,620 With k equals 7, there are seven points. 1187 01:13:22,620 --> 01:13:23,660 And here it is. 1188 01:13:23,660 --> 01:13:28,320 So the crosses that are there present in the plot 1189 01:13:28,320 --> 01:13:30,520 are simply, if you wish, the center of mass, 1190 01:13:30,520 --> 01:13:34,710 center of gravity of the group-- simply the middle of the group. 1191 01:13:34,710 --> 01:13:38,460 So this is what we can do easily with points in the world of Flatland. 1192 01:13:38,460 --> 01:13:44,370 And we can, in fact, move, very much like we have done now, 1193 01:13:44,370 --> 01:13:46,150 to the world of documents. 1194 01:13:46,150 --> 01:13:52,940 And so this is the collection of strings I presented to you. 1195 01:13:52,940 --> 01:13:57,500 This is the bags of words matrix that we can easily 1196 01:13:57,500 --> 01:14:04,440 construct using some function from the external module sklearn. 1197 01:14:04,440 --> 01:14:07,870 Again, I won't spend much detail on parsing this code. 1198 01:14:07,870 --> 01:14:11,810 I want you to appreciate the fact that really, in a few lines of code, 1199 01:14:11,810 --> 01:14:16,260 we can get something like this running in Python. 1200 01:14:16,260 --> 01:14:20,390 So we can have a look at the dictionary, is what I presented to you 1201 01:14:20,390 --> 01:14:25,390 earlier-- "awesome," "best," "cakes," and the like. 1202 01:14:25,390 --> 01:14:29,210 We can get to the frequency matrix, as before. 1203 01:14:29,210 --> 01:14:32,380 And we can indeed run K-means with k equals 2. 1204 01:14:32,380 --> 01:14:37,280 And if we do that, we indeed have the output that we expect, 1205 01:14:37,280 --> 01:14:39,550 meaning the algorithm is capable of figuring out 1206 01:14:39,550 --> 01:14:43,450 that the two clusters should be divided by the words "dog," 1207 01:14:43,450 --> 01:14:49,210 "cat," "keeps," and the words "awesome," "staff", and "CS50." 1208 01:14:49,210 --> 01:14:52,310 So this is per the simple example of strings. 1209 01:14:52,310 --> 01:14:55,010 We can go to the more interesting example 1210 01:14:55,010 --> 01:14:58,620 of movies with the IMDB synopsis. 1211 01:14:58,620 --> 01:15:00,910 Run precisely the same line of code. 1212 01:15:00,910 --> 01:15:04,850 Now, the inputs for the algorithm is the following list 1213 01:15:04,850 --> 01:15:12,606 of movies with their title and the synopsis from IMDB. 1214 01:15:12,606 --> 01:15:17,310 And we can easily import this Google spreadsheet 1215 01:15:17,310 --> 01:15:19,960 into Python using the pandas modules. 1216 01:15:19,960 --> 01:15:22,190 Again, I won't spend much time to it. 1217 01:15:22,190 --> 01:15:25,940 I want to get to the punch line. 1218 01:15:25,940 --> 01:15:30,170 This is the line of code to import these data sets. 1219 01:15:30,170 --> 01:15:33,300 Here is, indeed, if we printed these frames [INAUDIBLE] Python, 1220 01:15:33,300 --> 01:15:37,060 it's the same table as in the Google spreadsheet. 1221 01:15:37,060 --> 01:15:41,250 And from this time on, we can precisely take the same code 1222 01:15:41,250 --> 01:15:45,620 that we applied earlier in the easier example in this case. 1223 01:15:45,620 --> 01:15:53,760 And if we do so, just by creating the frequency matrix, 1224 01:15:53,760 --> 01:15:58,600 running K-means with k equals 2, we get the following output. 1225 01:15:58,600 --> 01:16:04,210 So indeed, the algorithm is capable of figuring out that one class of movies 1226 01:16:04,210 --> 01:16:09,620 should be including movies such as The King's Speech, Frozen Aladdin, 1227 01:16:09,620 --> 01:16:11,770 Cinderella, Robin Hood, and the like. 1228 01:16:11,770 --> 01:16:17,350 So The King's Speech, we wouldn't really expect that, right? 1229 01:16:17,350 --> 01:16:21,080 Frozen, Aladdin, Cinderella, Robin Hood-- OK, kids' movies. 1230 01:16:21,080 --> 01:16:22,870 But The King's Speech? 1231 01:16:22,870 --> 01:16:24,960 Well, let's hear the other cluster. 1232 01:16:24,960 --> 01:16:28,000 The other cluster would be Mad Max, The Matrix, No Country For Old 1233 01:16:28,000 --> 01:16:30,130 Men, and the like. 1234 01:16:30,130 --> 01:16:33,530 So the way the algorithm is thinking about it when we map it 1235 01:16:33,530 --> 01:16:38,270 to this higher-dimensional space is by grouping movies together 1236 01:16:38,270 --> 01:16:40,910 based on the count words, as we see. 1237 01:16:40,910 --> 01:16:44,920 And so the reason why it's taking The King's Speech 1238 01:16:44,920 --> 01:16:48,640 in the same group as Frozen, Aladdin, Robin Hood and so on 1239 01:16:48,640 --> 01:16:52,400 is because of words such as "king," "prince," "duke." 1240 01:16:52,400 --> 01:16:56,420 So this is the machine learning behind-- this what the machine is doing. 1241 01:16:56,420 --> 01:16:59,020 Again, it might sound a little bit counterintuitive, 1242 01:16:59,020 --> 01:17:01,230 coming from a human learning point of view. 1243 01:17:01,230 --> 01:17:07,520 But the machine has only access to those inputs that I showed to you earlier. 1244 01:17:07,520 --> 01:17:11,510 So let's just wrap up. 1245 01:17:11,510 --> 01:17:12,820 This was Python. 1246 01:17:12,820 --> 01:17:17,240 We have seen, indeed, two important applications in the real world-- 1247 01:17:17,240 --> 01:17:23,470 so the image recognition application and the text clustering application. 1248 01:17:23,470 --> 01:17:29,570 These are just two applications out of countlessly many applications that 1249 01:17:29,570 --> 01:17:32,050 are changing our life on a daily basis. 1250 01:17:32,050 --> 01:17:38,116 And in fact-- well, at this point, just to mention something more in this. 1251 01:17:38,116 --> 01:17:40,300 This should ring a bell at this point in the class. 1252 01:17:40,300 --> 01:17:44,210 It is, indeed, the pyramid in Mario. 1253 01:17:44,210 --> 01:17:48,160 And indeed, we can run a machine learning algorithm 1254 01:17:48,160 --> 01:17:52,100 to play video games such as Mario. 1255 01:17:52,100 --> 01:17:53,120 The way we do that? 1256 01:17:53,120 --> 01:17:58,640 Well, we can have as a training set, we can have human players play Mario. 1257 01:17:58,640 --> 01:18:01,585 And we can have an algorithm watching them and learning from them. 1258 01:18:01,585 --> 01:18:07,430 Or we can have an algorithm watching another algorithm watching playing. 1259 01:18:07,430 --> 01:18:13,050 And this is what indeed does happen in the case of Go. 1260 01:18:13,050 --> 01:18:17,110 So Go is a popular chess-board-like game. 1261 01:18:17,110 --> 01:18:18,090 It's much like chess. 1262 01:18:18,090 --> 01:18:23,790 But now the number of combinations are astonishingly large. 1263 01:18:23,790 --> 01:18:26,760 It's more than the number of atoms in the universe. 1264 01:18:26,760 --> 01:18:30,690 So playing this game has always been considered sort of hard. 1265 01:18:30,690 --> 01:18:34,000 And it has been believed for a long time to be 1266 01:18:34,000 --> 01:18:38,840 outside of the reach of modern applications, modern machine 1267 01:18:38,840 --> 01:18:40,160 learning algorithms. 1268 01:18:40,160 --> 01:18:43,070 This was the picture up to a few months ago, 1269 01:18:43,070 --> 01:18:48,980 in March 2016, when an algorithm made by a researcher at Google 1270 01:18:48,980 --> 01:18:52,280 played against one of the world champions at this game. 1271 01:18:52,280 --> 01:18:58,720 And here it is, the world champion, world master of Go, Lee Sedol. 1272 01:18:58,720 --> 01:19:03,230 So before a series of five games against the machine, 1273 01:19:03,230 --> 01:19:06,860 Lee released a statement claiming that he would have expected 1274 01:19:06,860 --> 01:19:09,730 to win something like four to one. 1275 01:19:09,730 --> 01:19:12,470 Indeed, four to one was the final outcome, 1276 01:19:12,470 --> 01:19:14,160 but it was the other way around. 1277 01:19:14,160 --> 01:19:19,700 So the machine won four times out of one. 1278 01:19:19,700 --> 01:19:23,120 And what is really amazing-- this is perceived, again, as a game-changer. 1279 01:19:23,120 --> 01:19:29,950 Deep learning algorithms are behind this technology. 1280 01:19:29,950 --> 01:19:32,550 So much attention has been drawn to the game 1281 01:19:32,550 --> 01:19:39,700 not just because the machine won, in fact, four out of five, but because 1282 01:19:39,700 --> 01:19:44,690 during the game, really, new plays were made by the algorithm. 1283 01:19:44,690 --> 01:19:49,070 And the algorithm was trained not just by observing human masters 1284 01:19:49,070 --> 01:19:51,130 playing the game Go. 1285 01:19:51,130 --> 01:19:55,830 But it was also trained by looking at itself-- at the algorithm itself-- 1286 01:19:55,830 --> 01:19:59,550 playing against another algorithm at Go. 1287 01:19:59,550 --> 01:20:04,330 So in having access to this set of training data, 1288 01:20:04,330 --> 01:20:07,790 the algorithm simply came up with new, astonishing moves 1289 01:20:07,790 --> 01:20:14,290 that not even commentators of the game were able to command. 1290 01:20:14,290 --> 01:20:17,350 And so this is where we left off by watching 1291 01:20:17,350 --> 01:20:20,080 some of the reaction of these commentators 1292 01:20:20,080 --> 01:20:23,800 while trying to interpret the moves made by the machine. 1293 01:20:23,800 --> 01:20:30,540 CS50 will be back next week with more on Python and on web servers. 1294 01:20:30,540 --> 01:20:33,620 1295 01:20:33,620 --> 01:20:34,762 [APPLAUSE] 1296 01:20:34,762 --> 01:20:35,428 [VIDEO PLAYBACK] 1297 01:20:35,428 --> 01:20:40,830 -And this is what [INAUDIBLE] from the Google team was talking about, 1298 01:20:40,830 --> 01:20:45,975 is this kind of evaluation, value-- 1299 01:20:45,975 --> 01:20:48,570 1300 01:20:48,570 --> 01:20:51,250 -That's a very surprising move. 1301 01:20:51,250 --> 01:20:56,562 -I thought it was a mistake. 1302 01:20:56,562 --> 01:20:58,660 -Well, I thought it was a quick miss. 1303 01:20:58,660 --> 01:21:00,220 But-- 1304 01:21:00,220 --> 01:21:02,780 -If it were online Go, we'd call it a clicko. 1305 01:21:02,780 --> 01:21:06,252 -Yeah, it's a very strange-- something like this would be a more normal move. 1306 01:21:06,252 --> 01:21:09,350 1307 01:21:09,350 --> 01:21:12,309 -OK, you're going to have to-- so do you have to think about this? 1308 01:21:12,309 --> 01:21:13,850 -This would be kind of a normal move. 1309 01:21:13,850 --> 01:21:16,410 And locally, white would answer here. 1310 01:21:16,410 --> 01:21:16,910 -Sure. 1311 01:21:16,910 --> 01:21:17,390 -But-- 1312 01:21:17,390 --> 01:21:17,710 [END PLAYBACK] 1313 01:21:17,710 --> 01:21:18,918 DAVID J. MALAN: Thanks again. 1314 01:21:18,918 --> 01:21:22,960 1315 01:21:22,960 --> 01:21:23,900 [VIDEO PLAYBACK] 1316 01:21:23,900 --> 01:21:28,120 -All those freshman were, like, stoked for the first lecture, you know? 1317 01:21:28,120 --> 01:21:32,760 Like cake, candy, swag, candy. 1318 01:21:32,760 --> 01:21:35,447 They, like, got their own DJ. 1319 01:21:35,447 --> 01:21:37,280 -But the first lecture didn't go as planned. 1320 01:21:37,280 --> 01:21:41,200 1321 01:21:41,200 --> 01:21:46,124 -Malan, he had that phone book ready to, like, tear its heart out. 1322 01:21:46,124 --> 01:21:49,290 -Well, you might first open the phone book roughly to the middle, look down, 1323 01:21:49,290 --> 01:21:49,790 and-- 1324 01:21:49,790 --> 01:21:59,060 1325 01:21:59,060 --> 01:22:02,172 -And then the dude just let it drop. 1326 01:22:02,172 --> 01:22:05,364 [MUSIC PLAYING] 1327 01:22:05,364 --> 01:22:20,070 1328 01:22:20,070 --> 01:22:20,570 -Hmm. 1329 01:22:20,570 --> 01:22:23,930 1330 01:22:23,930 --> 01:22:25,860 -Mind if I bum one of those? 1331 01:22:25,860 --> 01:22:26,820 -Oh, sure. 1332 01:22:26,820 --> 01:22:32,110 1333 01:22:32,110 --> 01:22:35,625 -Dude, there was something in those pages. 1334 01:22:35,625 --> 01:22:39,130 This was like nothing I'd ever seen before on the YouTubes. 1335 01:22:39,130 --> 01:22:40,870 -Was there any mention of Rosebud? 1336 01:22:40,870 --> 01:22:42,115 -Rosebud? 1337 01:22:42,115 --> 01:22:44,110 Is that, like, a programming language? 1338 01:22:44,110 --> 01:22:47,410 1339 01:22:47,410 --> 01:22:48,310 -[EXHALES] 1340 01:22:48,310 --> 01:22:55,210 1341 01:22:55,210 --> 01:22:56,460 [END PLAYBACK]