1 00:00:00,000 --> 00:00:02,202 2 00:00:02,202 --> 00:00:03,160 KEVIN XU: Hey everyone! 3 00:00:03,160 --> 00:00:06,850 Welcome to our introduction to ML seminar. 4 00:00:06,850 --> 00:00:09,340 I'm Kevin Xu, a sophomore at the college, 5 00:00:09,340 --> 00:00:11,740 studying computer science and physics. 6 00:00:11,740 --> 00:00:13,990 ZAD CHIN: I'm Zad, I'm a sophomore at Harvard College, 7 00:00:13,990 --> 00:00:15,640 studying computer science and maths. 8 00:00:15,640 --> 00:00:19,130 KEVIN XU: And we're so excited that you could join us here today. 9 00:00:19,130 --> 00:00:21,760 We know that we're in the middle of final test week, 10 00:00:21,760 --> 00:00:23,650 and so everybody's just a little stressed. 11 00:00:23,650 --> 00:00:27,640 So, hopefully, we can give a very interesting, and hopefully 12 00:00:27,640 --> 00:00:30,070 fun presentation about what to look forward 13 00:00:30,070 --> 00:00:32,710 to as you start implementing your final projects. 14 00:00:32,710 --> 00:00:34,720 So just some logistics first. 15 00:00:34,720 --> 00:00:39,830 We will have some set points where we'll take a few questions from the audience. 16 00:00:39,830 --> 00:00:43,820 So feel free to type them into the chat as you think of them, 17 00:00:43,820 --> 00:00:49,330 and somebody will probably read them to us when we pause for questions. 18 00:00:49,330 --> 00:00:52,450 Yeah, other than that, let's just get straight into it. 19 00:00:52,450 --> 00:00:55,490 20 00:00:55,490 --> 00:00:59,180 Right so, of course, our seminar is about machine learning, or ML. 21 00:00:59,180 --> 00:01:02,150 And, so, the first question, which is not so obvious, 22 00:01:02,150 --> 00:01:03,870 is what is machine learning? 23 00:01:03,870 --> 00:01:08,360 And, so, I'm sure a lot of you have heard about the developments that 24 00:01:08,360 --> 00:01:12,530 have been in this field, a lot about neural networking, and perhaps 25 00:01:12,530 --> 00:01:13,580 reinforcement learning. 26 00:01:13,580 --> 00:01:17,780 And a popular topic, of course, is the game development theory, 27 00:01:17,780 --> 00:01:22,310 where some computers have solved games through these complicated machine 28 00:01:22,310 --> 00:01:24,350 learning algorithms and neural networks. 29 00:01:24,350 --> 00:01:29,000 And, so, we just want to give a quick overview of what 30 00:01:29,000 --> 00:01:31,040 exactly falls under machine learning. 31 00:01:31,040 --> 00:01:33,650 And this is actually a very broad category 32 00:01:33,650 --> 00:01:36,890 that includes a lot of different-- 33 00:01:36,890 --> 00:01:38,660 some ideas, and some fields. 34 00:01:38,660 --> 00:01:40,645 And, so, there are three main ones. 35 00:01:40,645 --> 00:01:42,770 There's unsupervised learning, supervised learning, 36 00:01:42,770 --> 00:01:43,910 and reinforcement learning. 37 00:01:43,910 --> 00:01:47,030 And a lot of the neural networks that you think about, 38 00:01:47,030 --> 00:01:50,480 when you think about machine learning, and reinforcement learning, 39 00:01:50,480 --> 00:01:53,450 fall under one of these categories. 40 00:01:53,450 --> 00:01:57,860 But, at the end of the day, ML is about taking a lot of data 41 00:01:57,860 --> 00:02:01,670 and having the computer, or an algorithm, or a program, process 42 00:02:01,670 --> 00:02:04,760 the important parts of that data. 43 00:02:04,760 --> 00:02:07,220 Recognize the patterns in the data, and attempt 44 00:02:07,220 --> 00:02:09,979 to use that data to generalize to a bigger data 45 00:02:09,979 --> 00:02:11,780 set that you might be given. 46 00:02:11,780 --> 00:02:15,140 So Zad now is going to present a little bit more about what exactly 47 00:02:15,140 --> 00:02:17,617 these subfields in ML are. 48 00:02:17,617 --> 00:02:20,700 ZAD CHIN: So the first thing we want to know is about supervised learning, 49 00:02:20,700 --> 00:02:22,283 so what is supervised learning, right? 50 00:02:22,283 --> 00:02:24,560 So, given some labeled data, say for example, 51 00:02:24,560 --> 00:02:27,930 we have a data set of cat and dogs pictures. 52 00:02:27,930 --> 00:02:30,380 So, how can a machine learn to predict the labels that 53 00:02:30,380 --> 00:02:32,130 generalize to unseen data? 54 00:02:32,130 --> 00:02:35,337 So what we do is we put a data set of cats and dogs 55 00:02:35,337 --> 00:02:37,670 pictures, and then put it into a machine learning model. 56 00:02:37,670 --> 00:02:40,670 And let the machine learning model kind of learn itself on how 57 00:02:40,670 --> 00:02:42,742 to differentiate between cats and dogs. 58 00:02:42,742 --> 00:02:45,950 And then we have some testing data set, we maybe have cats and dogs, or maybe 59 00:02:45,950 --> 00:02:49,220 other images, and the machine would kind of like label, oh, this is a cat, 60 00:02:49,220 --> 00:02:52,190 or this is a dog, or this is not a cat or a dog. 61 00:02:52,190 --> 00:02:54,800 So this is kind of like a supervised machine 62 00:02:54,800 --> 00:02:59,150 learning, whereby we actually label the data so the machine can learn from it. 63 00:02:59,150 --> 00:03:01,330 So, the next big idea in machine learning 64 00:03:01,330 --> 00:03:04,600 is also unsupervised learning, where the machine actually recognizes 65 00:03:04,600 --> 00:03:06,320 the pattern itself from the data. 66 00:03:06,320 --> 00:03:10,150 So like, we don't label anything, we don't label it's a dog, or a cat, 67 00:03:10,150 --> 00:03:14,110 we just show all the images to the machine learning, 68 00:03:14,110 --> 00:03:17,520 and the machine will be like, oh this feature kind of like resemble a cat, 69 00:03:17,520 --> 00:03:18,340 so this is a cat. 70 00:03:18,340 --> 00:03:21,020 And this feature kind of resembles a dog, so it's a dog. 71 00:03:21,020 --> 00:03:24,150 So that's kind of like a difference between unsupervised learning 72 00:03:24,150 --> 00:03:25,150 and supervised learning. 73 00:03:25,150 --> 00:03:26,942 And most of the time, unsupervised learning 74 00:03:26,942 --> 00:03:31,300 includes like, clustering for example, in a very high dimensional ICU data. 75 00:03:31,300 --> 00:03:33,453 Like who is most likely to get readmitted, 76 00:03:33,453 --> 00:03:36,370 and what are the features that those people are going to be readmitted 77 00:03:36,370 --> 00:03:36,870 or not? 78 00:03:36,870 --> 00:03:39,828 So that's kind of like a difference between supervised and unsupervised 79 00:03:39,828 --> 00:03:40,390 learning. 80 00:03:40,390 --> 00:03:43,420 And next, Kevin is going to talk more about what is neural network, 81 00:03:43,420 --> 00:03:46,870 and why it important to unsupervised learning. 82 00:03:46,870 --> 00:03:50,890 KEVIN XU: So, as Zad mentioned, a lot of the time you're dealing with data 83 00:03:50,890 --> 00:03:54,125 sets that are hugely multidimensional. 84 00:03:54,125 --> 00:03:57,250 And all that means is, there's a lot of different categories that you could 85 00:03:57,250 --> 00:04:02,260 consider, such as an election, like the population, the demographics, how 86 00:04:02,260 --> 00:04:04,510 likely the other candidates are, et cetera, et cetera. 87 00:04:04,510 --> 00:04:06,880 And you just have all of these categories of data 88 00:04:06,880 --> 00:04:10,780 that you need to somehow compile together and get some reasonable kind 89 00:04:10,780 --> 00:04:12,970 of prediction as a result. 90 00:04:12,970 --> 00:04:15,550 And this is where neural networking shines. 91 00:04:15,550 --> 00:04:19,660 And if you continue researching this field, 92 00:04:19,660 --> 00:04:22,707 and perhaps continue with this kind of topic, 93 00:04:22,707 --> 00:04:25,040 you're probably going to see this kind of picture a lot. 94 00:04:25,040 --> 00:04:28,430 And at first this looks very incomprehensible. 95 00:04:28,430 --> 00:04:31,420 It's just a giant graph with a bunch of nodes and lines. 96 00:04:31,420 --> 00:04:37,450 But this is actually just a very brief picture of what neural networking is. 97 00:04:37,450 --> 00:04:40,285 On the left, you can consider these as input nodes, and on the right 98 00:04:40,285 --> 00:04:42,910 you have output nodes, and in the middle you have hidden nodes. 99 00:04:42,910 --> 00:04:46,130 But this doesn't really tell you anything about what this does. 100 00:04:46,130 --> 00:04:49,360 And, so, you can think about this giant network of graphs, 101 00:04:49,360 --> 00:04:53,620 or however you might imagine it, as just a black box function. 102 00:04:53,620 --> 00:04:56,317 And this black box function takes in your input data, 103 00:04:56,317 --> 00:04:58,900 so the different categories, your multidimensional input data, 104 00:04:58,900 --> 00:05:01,810 and then it spits out the output data that 105 00:05:01,810 --> 00:05:03,680 should be good enough for a prediction. 106 00:05:03,680 --> 00:05:06,520 So, in the most simplified case, let's just 107 00:05:06,520 --> 00:05:09,760 talk about a game, such as tic-tac-toe. 108 00:05:09,760 --> 00:05:13,100 And, so, your input data just might be the state that you're currently in. 109 00:05:13,100 --> 00:05:16,570 So, what squares are filled, whose turn is it, et cetera, et cetera. 110 00:05:16,570 --> 00:05:19,580 And your output data might just be one single number, 111 00:05:19,580 --> 00:05:21,700 the heuristic between zero and one. 112 00:05:21,700 --> 00:05:26,200 And all this number tells you is how good your current board position is. 113 00:05:26,200 --> 00:05:29,870 So obviously, if you can make three in a row in your next turn, 114 00:05:29,870 --> 00:05:32,215 your heuristic should be very good, it should be one. 115 00:05:32,215 --> 00:05:34,840 And if you're going to lose in the next turn it should be zero. 116 00:05:34,840 --> 00:05:39,730 So, obviously, not every state falls under this case. 117 00:05:39,730 --> 00:05:42,600 So, there's a continuous range of numbers between zero and one 118 00:05:42,600 --> 00:05:48,360 that your black box function should be able to retrieve, or accurately 119 00:05:48,360 --> 00:05:50,340 predict from this input data. 120 00:05:50,340 --> 00:05:53,070 And this is a job of everything in the middle, right? 121 00:05:53,070 --> 00:05:55,410 How do these variables relate, what function 122 00:05:55,410 --> 00:05:59,870 or what functions you should apply to these variables in the middle 123 00:05:59,870 --> 00:06:03,460 of processing, et cetera, et cetera. 124 00:06:03,460 --> 00:06:06,670 And, so, the overall goal of neural networking 125 00:06:06,670 --> 00:06:09,640 is to design this middle function, right? 126 00:06:09,640 --> 00:06:15,160 How do you get the computer to find these patterns, and to make-- 127 00:06:15,160 --> 00:06:18,200 create this black box function in a reasonable manner. 128 00:06:18,200 --> 00:06:20,770 And this is something that's really hard for humans to do, 129 00:06:20,770 --> 00:06:22,240 because there's so much data. 130 00:06:22,240 --> 00:06:25,592 And, so, we give it to the machine instead. 131 00:06:25,592 --> 00:06:27,800 ZAD CHIN: So, a general outline in a machine learning 132 00:06:27,800 --> 00:06:30,060 project, or machine learning research, for example, 133 00:06:30,060 --> 00:06:31,650 we start with a data collection. 134 00:06:31,650 --> 00:06:34,970 So there are a lot of different ways where we can collect data. 135 00:06:34,970 --> 00:06:37,220 Some significant ways like finding sources of data 136 00:06:37,220 --> 00:06:41,790 is like, scrapping it from websites, API call, or readily available data 137 00:06:41,790 --> 00:06:47,000 sets such as Kaggle, or like in UCI ML repository. 138 00:06:47,000 --> 00:06:50,720 But do take a note that scrapping raw data, for example, from API call, 139 00:06:50,720 --> 00:06:53,030 or even websites, may take a long time to clean up 140 00:06:53,030 --> 00:06:55,860 the data, especially when there is a lot of empty rows and stuff. 141 00:06:55,860 --> 00:06:58,220 So the next step we do is normally data exploration. 142 00:06:58,220 --> 00:07:01,340 And initial data exploration is normally very powerful, 143 00:07:01,340 --> 00:07:04,640 and it gives you great insight on what features to take in, 144 00:07:04,640 --> 00:07:07,490 what features to drop, whether your data is biased. 145 00:07:07,490 --> 00:07:11,900 And understanding the dimension of the data is actually very important. 146 00:07:11,900 --> 00:07:14,570 So the third one is, of course, choosing an ML model. 147 00:07:14,570 --> 00:07:17,030 Choose a simple model to start with, say for example, 148 00:07:17,030 --> 00:07:19,670 what is the type of problem you want to solve, right? 149 00:07:19,670 --> 00:07:22,400 We'll go too deep in that-- in an example 150 00:07:22,400 --> 00:07:25,010 later, but like, some problems I consider, 151 00:07:25,010 --> 00:07:27,612 is it a regression or classification problem? 152 00:07:27,612 --> 00:07:30,320 Or whether to actually use the supervised or unsupervised machine 153 00:07:30,320 --> 00:07:31,190 learning model. 154 00:07:31,190 --> 00:07:33,240 There are some models that you can choose from. 155 00:07:33,240 --> 00:07:35,115 And the fourth one, after you choose a model, 156 00:07:35,115 --> 00:07:38,620 and code it out with all the libraries, you get the result. 157 00:07:38,620 --> 00:07:39,620 You want to test it out. 158 00:07:39,620 --> 00:07:42,110 What is the accuracy score of your model, 159 00:07:42,110 --> 00:07:45,350 what is the precision, what is the record score of your model. 160 00:07:45,350 --> 00:07:49,160 And how can we improve the model by fine tuning the parameters 161 00:07:49,160 --> 00:07:51,338 or even using the more sophisticated models? 162 00:07:51,338 --> 00:07:53,630 Say, for example, you started with logistic regression, 163 00:07:53,630 --> 00:07:55,770 maybe you can move on to neural networks. 164 00:07:55,770 --> 00:08:01,730 So, I think we can start with a bit of questions if you guys have any. 165 00:08:01,730 --> 00:08:05,780 Or not, we can move on to examples that we have. 166 00:08:05,780 --> 00:08:09,290 SPEAKER: OK, there's one question, from Prateek. 167 00:08:09,290 --> 00:08:13,580 "Is there any difference between data science and machine learning?" 168 00:08:13,580 --> 00:08:17,040 169 00:08:17,040 --> 00:08:18,152 ZAD CHIN: So-- 170 00:08:18,152 --> 00:08:19,520 KEVIN XU: Yeah, do you want to take this, Zad? 171 00:08:19,520 --> 00:08:20,395 ZAD CHIN: Yeah, sure. 172 00:08:20,395 --> 00:08:22,160 I think data science is more-- 173 00:08:22,160 --> 00:08:24,553 I think machine learning is under data science. 174 00:08:24,553 --> 00:08:25,970 You can correct me when I'm wrong. 175 00:08:25,970 --> 00:08:30,620 I think, like, data scientists somehow also use machine learning model 176 00:08:30,620 --> 00:08:32,549 to help them to analyze the data. 177 00:08:32,549 --> 00:08:35,450 So, basically, what it means for data science is that like, 178 00:08:35,450 --> 00:08:41,510 we try to understand the pattern, or what useful information 179 00:08:41,510 --> 00:08:43,789 can a huge amount of data tell us, right? 180 00:08:43,789 --> 00:08:47,190 So, I feel like, machine learning model can be a very good stepping stone 181 00:08:47,190 --> 00:08:48,380 into data science. 182 00:08:48,380 --> 00:08:51,380 And also, there are statistics as well, like pure statistics, 183 00:08:51,380 --> 00:08:56,225 such as chi-square test, which is not really machine learning but also 184 00:08:56,225 --> 00:08:58,085 a part of data science. 185 00:08:58,085 --> 00:09:03,320 KEVIN XU: Right, just echoing what Zad said, ML is a technique. 186 00:09:03,320 --> 00:09:08,090 Or a-- you can always think of it as a methodology 187 00:09:08,090 --> 00:09:12,960 to approach the study of how you can extrapolate data. 188 00:09:12,960 --> 00:09:17,827 And, so, there's a lot that falls under this, that falls on machine learning. 189 00:09:17,827 --> 00:09:19,910 But there are also a lot of things that you can do 190 00:09:19,910 --> 00:09:22,800 that are not machine learning, right? 191 00:09:22,800 --> 00:09:25,700 And, so, I hope that answers your question. 192 00:09:25,700 --> 00:09:28,025 SPEAKER: OK, no other questions, you can continue. 193 00:09:28,025 --> 00:09:30,890 194 00:09:30,890 --> 00:09:34,047 KEVIN XU: Zad will have to [? click. ?] 195 00:09:34,047 --> 00:09:36,500 ZAD CHIN: So you guys can go to this link. 196 00:09:36,500 --> 00:09:40,130 If you want to make a copy of the notebook yourself and follow up when we 197 00:09:40,130 --> 00:09:46,290 actually present it, you can go to this tinyurl.com/ML-notebook-1. 198 00:09:46,290 --> 00:09:49,380 We will jump straight to the notebook that we can show you. 199 00:09:49,380 --> 00:09:53,240 So there are a few contents, if you want this table of contents. 200 00:09:53,240 --> 00:09:57,090 And I'll close it, so we can have a better view of [? it. ?] 201 00:09:57,090 --> 00:09:58,470 So let's get started. 202 00:09:58,470 --> 00:10:02,940 We talked a lot about how we can start with machine learning research. 203 00:10:02,940 --> 00:10:05,780 So, in this particular notebook, we will try 204 00:10:05,780 --> 00:10:08,660 to use the Iris data set, one of the most famous data sets 205 00:10:08,660 --> 00:10:10,200 about machine learning. 206 00:10:10,200 --> 00:10:12,710 So, in a supervised learning, we will try 207 00:10:12,710 --> 00:10:16,940 to use two different models, which is k-nearest neighbor 208 00:10:16,940 --> 00:10:18,140 and logistic regression. 209 00:10:18,140 --> 00:10:20,420 And, the second part, we will jump into reinforcement 210 00:10:20,420 --> 00:10:22,680 learning with a different example. 211 00:10:22,680 --> 00:10:26,660 So, as we talked about last time, the ML development process 212 00:10:26,660 --> 00:10:31,160 includes data collection, train-test split, identify correct ML problems 213 00:10:31,160 --> 00:10:34,880 and ML model to use, evaluate the performance of the algorithm, 214 00:10:34,880 --> 00:10:38,270 and try different models and retrain to retest. 215 00:10:38,270 --> 00:10:39,682 So let's get started. 216 00:10:39,682 --> 00:10:42,890 The first thing you want to do is, of course, import the necessary libraries, 217 00:10:42,890 --> 00:10:46,920 like Pandas, NumPy, plotting library, train-test library. 218 00:10:46,920 --> 00:10:49,400 So you get straight and run this. 219 00:10:49,400 --> 00:10:53,840 So to run a cell in Jupyter Notebook or like, Google Colab, 220 00:10:53,840 --> 00:10:57,257 you just press Shift-Enter. 221 00:10:57,257 --> 00:10:59,090 And, so, the next thing I'll do is to import 222 00:10:59,090 --> 00:11:01,700 the data set that we talked about, which is the Iris data set. 223 00:11:01,700 --> 00:11:05,540 So this DataFrame.head is to print out like-- or how the-- 224 00:11:05,540 --> 00:11:08,640 it's to give you an idea on how the DataFrame actually looks like. 225 00:11:08,640 --> 00:11:14,435 So we can see that if we run this, there is sepal length, and then petal length 226 00:11:14,435 --> 00:11:15,935 and petal width, and there's target. 227 00:11:15,935 --> 00:11:19,640 Target is the three different things that we have, which is, I think, 228 00:11:19,640 --> 00:11:22,140 different kinds of iris. 229 00:11:22,140 --> 00:11:24,660 And the next thing we want to do is try to explore 230 00:11:24,660 --> 00:11:27,510 what is the shape of the DataFrame, for example, right? 231 00:11:27,510 --> 00:11:31,380 Now, we know that there are 150 rows and five different columns, 232 00:11:31,380 --> 00:11:35,252 here we can see there's five different columns, and there's 150 rows. 233 00:11:35,252 --> 00:11:37,210 So, first of all, the first thing we want to do 234 00:11:37,210 --> 00:11:40,430 is to actually explore and analyze the data. 235 00:11:40,430 --> 00:11:44,320 That's because it's important to do initial data exploration because it 236 00:11:44,320 --> 00:11:46,570 might be biased, it might be noisy, and it 237 00:11:46,570 --> 00:11:49,210 shows the relationship between different features and target 238 00:11:49,210 --> 00:11:50,830 and better train our ML. 239 00:11:50,830 --> 00:11:54,160 So I put some links here that you can use next time. 240 00:11:54,160 --> 00:11:56,860 But we will start by using-- putting a pie 241 00:11:56,860 --> 00:11:58,690 chart between the different targets. 242 00:11:58,690 --> 00:12:03,580 Like, there's setosa, virginica, and versicolor. 243 00:12:03,580 --> 00:12:05,620 That's a different type of iris. 244 00:12:05,620 --> 00:12:07,480 I can see that each of them, kind of like, 245 00:12:07,480 --> 00:12:12,130 33%, represented very equally and balanced in the labeled data set. 246 00:12:12,130 --> 00:12:18,160 And this is to plot the number of different labels versus-- 247 00:12:18,160 --> 00:12:22,120 this is the number of like-- count, and then this is the width. 248 00:12:22,120 --> 00:12:24,130 So you can see that most of the-- 249 00:12:24,130 --> 00:12:29,350 say for example, sepal, we can see fall in the 3.0 to 3.5, 250 00:12:29,350 --> 00:12:33,860 and it's kind of evenly, or normal distributed. 251 00:12:33,860 --> 00:12:38,590 And this is more like a graph to see how different kinds of length and width 252 00:12:38,590 --> 00:12:41,590 vary according to species. 253 00:12:41,590 --> 00:12:44,380 And I take some time, and I think this is very useful. 254 00:12:44,380 --> 00:12:46,750 And this is basically a pairwise plotting 255 00:12:46,750 --> 00:12:50,710 between how the features and the target are influenced. 256 00:12:50,710 --> 00:12:54,370 It takes some time to plot, but it will be cool. 257 00:12:54,370 --> 00:12:57,580 So say, for example, look at this. 258 00:12:57,580 --> 00:13:01,220 We can see that like, petal length has a very huge influence on the target. 259 00:13:01,220 --> 00:13:04,220 You can see that for target zero, they have really, 260 00:13:04,220 --> 00:13:07,060 really, significant smaller petal length, as opposed 261 00:13:07,060 --> 00:13:10,630 to target two which actually has a very clear distinct feature. 262 00:13:10,630 --> 00:13:13,990 And moving forward, to actually-- you know like sometimes, 263 00:13:13,990 --> 00:13:15,830 when you look at graph, it's not intuitive? 264 00:13:15,830 --> 00:13:20,620 So a very, very good way to plot the intuition-- 265 00:13:20,620 --> 00:13:23,980 to calculate, to see the relationship between different features, 266 00:13:23,980 --> 00:13:26,692 is to plot a heat map. 267 00:13:26,692 --> 00:13:28,400 You can see that when we plot a heat map, 268 00:13:28,400 --> 00:13:31,760 you can see the higher correlation between petal length and target. 269 00:13:31,760 --> 00:13:34,480 This is where we want the label, right, petal length and target, 270 00:13:34,480 --> 00:13:36,850 and petal width and target. 271 00:13:36,850 --> 00:13:41,680 So we pass it into the machine learning model where you want to-- we can also 272 00:13:41,680 --> 00:13:43,750 drop-- we can actually drop, which means delete, 273 00:13:43,750 --> 00:13:47,380 the column of sepal width and length, because it doesn't really affect-- 274 00:13:47,380 --> 00:13:51,208 it doesn't really correlate to the target that we actually care about. 275 00:13:51,208 --> 00:13:53,500 So the next thing we want to do is to train-test split. 276 00:13:53,500 --> 00:13:55,960 The importance of train-test split can be learned more about here, 277 00:13:55,960 --> 00:13:57,110 but we will just move on. 278 00:13:57,110 --> 00:13:59,860 So, basically, when we have a data set, we 279 00:13:59,860 --> 00:14:02,290 can't just put everything in your training sample. 280 00:14:02,290 --> 00:14:07,870 We need to split it into maybe 70/30 split, which is 70% training, then 281 00:14:07,870 --> 00:14:12,380 30% testing, or 80/20, 80% testing and-- 282 00:14:12,380 --> 00:14:15,405 I'm sorry, 80% training and 20% testing. 283 00:14:15,405 --> 00:14:17,530 Depending on the size of your data set, some people 284 00:14:17,530 --> 00:14:22,090 go with 70/30, if you have a huge data set you can actually do 80/20 split. 285 00:14:22,090 --> 00:14:25,090 So, in here, because we don't have a huge data set, 286 00:14:25,090 --> 00:14:28,270 you only have like 1,500 rows, so I decided 287 00:14:28,270 --> 00:14:35,260 to go with 70/30 split, which we can see that the test size is 0.3. 288 00:14:35,260 --> 00:14:40,900 And we have, after splitting, we get 105 rows in the training data set, 289 00:14:40,900 --> 00:14:44,830 and 45 rows in the testing data set. 290 00:14:44,830 --> 00:14:48,700 And we can also check whether we actually trained it correctly, 291 00:14:48,700 --> 00:14:52,150 and this is a very important step whereby we actually split. 292 00:14:52,150 --> 00:14:54,820 So, now, we have five different columns, right? 293 00:14:54,820 --> 00:14:57,850 We have sepal length, sepal width, petal length, and petal width. 294 00:14:57,850 --> 00:15:01,040 And we have target, and this is what we want to test it out. 295 00:15:01,040 --> 00:15:05,500 So we need to split the training data set into both x and y, where we have-- 296 00:15:05,500 --> 00:15:09,312 these are call features, and then this is the target that we want. 297 00:15:09,312 --> 00:15:11,770 And these are all features that are in the testing data set 298 00:15:11,770 --> 00:15:13,760 and this is the target. 299 00:15:13,760 --> 00:15:19,150 So we split that, all of this can be fine-- 300 00:15:19,150 --> 00:15:22,900 this is actually under Pandas, if this notation seems really, really weird 301 00:15:22,900 --> 00:15:24,280 to you. 302 00:15:24,280 --> 00:15:27,650 It is very well documented in the Panda DataFrame. 303 00:15:27,650 --> 00:15:31,750 And this is still head, and we see how it looks like in this 304 00:15:31,750 --> 00:15:36,610 or-- after I split it, you can see that train x now has four things. 305 00:15:36,610 --> 00:15:40,720 Four different features that we want to train it on the head, which is this 306 00:15:40,720 --> 00:15:45,040 is the attribute of it, which correspond to the ID features. 307 00:15:45,040 --> 00:15:49,730 So, next, what you want to do is identify correct problem and ML model. 308 00:15:49,730 --> 00:15:52,490 So in this problem it's more like a classification problem. 309 00:15:52,490 --> 00:15:55,630 So, for example, if you are provided with several inputs and variables, 310 00:15:55,630 --> 00:15:58,330 a classification model will actually try to predict 311 00:15:58,330 --> 00:16:02,595 the class of the kind of data. 312 00:16:02,595 --> 00:16:04,720 Say, for example, here is a classification problem, 313 00:16:04,720 --> 00:16:08,870 because we are given the features sepal length, sepal width, petal length 314 00:16:08,870 --> 00:16:11,410 and petal width, and we want to find what target, which 315 00:16:11,410 --> 00:16:13,900 is what iris species it is. 316 00:16:13,900 --> 00:16:16,420 So that's a classification problem. 317 00:16:16,420 --> 00:16:19,420 As opposed to a regression problem, where we are given certain features. 318 00:16:19,420 --> 00:16:22,990 You want to find a petal length or petal width, that's a regression model. 319 00:16:22,990 --> 00:16:27,470 You can read more about it here, I'm not going through it for time's sake. 320 00:16:27,470 --> 00:16:30,220 And before we move on to test out that model, 321 00:16:30,220 --> 00:16:34,850 I want to talk about evaluation metrics so we can keep running and see it. 322 00:16:34,850 --> 00:16:39,190 So there are a few eval-- before we dive into evaluation metrics, 323 00:16:39,190 --> 00:16:41,950 we need to understand what is a true false-- 324 00:16:41,950 --> 00:16:45,890 true positive, true negative, and false positive, and false negative. 325 00:16:45,890 --> 00:16:50,110 So a true positive is when outcome correctly predicts the positive class. 326 00:16:50,110 --> 00:16:54,280 Say, for example, you have a one, and then the model actually correctly 327 00:16:54,280 --> 00:16:56,320 predicts the one class. 328 00:16:56,320 --> 00:17:00,250 And, similarly, a true negative is an outcome where the model actually 329 00:17:00,250 --> 00:17:02,830 correctly predicts the negative class. 330 00:17:02,830 --> 00:17:03,980 Let me give you an example. 331 00:17:03,980 --> 00:17:08,890 So, basically, say we have a cat data set and a not cat data set. 332 00:17:08,890 --> 00:17:14,829 So a true positive is whereby a model correctly predicts that it is a cat. 333 00:17:14,829 --> 00:17:17,950 A true negative is when the model actually correctly predicts 334 00:17:17,950 --> 00:17:19,609 that it is not a cat. 335 00:17:19,609 --> 00:17:22,329 So that's the difference between true positive and true negative. 336 00:17:22,329 --> 00:17:25,510 A false positive is an outcome where the model incorrectly 337 00:17:25,510 --> 00:17:26,930 predicts the positive class. 338 00:17:26,930 --> 00:17:29,620 Say, for example, I have picture of a cat, 339 00:17:29,620 --> 00:17:32,680 but then my model predicts it as not a cat [? view. ?] 340 00:17:32,680 --> 00:17:34,190 That's a false positive. 341 00:17:34,190 --> 00:17:36,940 And a false negative is an outcome where the model incorrectly 342 00:17:36,940 --> 00:17:38,750 predicts a negative class. 343 00:17:38,750 --> 00:17:39,820 That's when I give-- 344 00:17:39,820 --> 00:17:44,270 I feed in a non-cat picture, but the model says, this is a cat. 345 00:17:44,270 --> 00:17:47,660 So that's a false negative. 346 00:17:47,660 --> 00:17:49,420 So in terms of a different kind of metrics 347 00:17:49,420 --> 00:17:53,170 that we use in judging the performance of an ML metrics, 348 00:17:53,170 --> 00:17:55,940 we have accuracy, recall, precision. 349 00:17:55,940 --> 00:17:58,990 And we also have other called the ROC which is receiver 350 00:17:58,990 --> 00:18:00,790 operating characteristic curve. 351 00:18:00,790 --> 00:18:04,300 And AUC, which most often used as opposed 352 00:18:04,300 --> 00:18:07,310 to ROC, which is area under the curve. 353 00:18:07,310 --> 00:18:11,200 So accuracy bias' meaning is the fraction of predictions that got right. 354 00:18:11,200 --> 00:18:17,870 Say, let's take the example of the cat and non-cat data set, right? 355 00:18:17,870 --> 00:18:19,750 So the fraction of-- 356 00:18:19,750 --> 00:18:24,070 out of the 150 prediction, for example, if my model 357 00:18:24,070 --> 00:18:28,260 predicts 96% of the cat pictures correctly, that's an accuracy. 358 00:18:28,260 --> 00:18:32,400 And, also, you have recall, which is what proportion of actual positive 359 00:18:32,400 --> 00:18:33,360 are defined correctly. 360 00:18:33,360 --> 00:18:37,950 So mathematically, it's defined as true positive or true positive plus 361 00:18:37,950 --> 00:18:38,850 false negative. 362 00:18:38,850 --> 00:18:43,890 And precision is a true positive or true positive plus false positive, sorry. 363 00:18:43,890 --> 00:18:46,620 So, it's like, what proportion of positive identification 364 00:18:46,620 --> 00:18:47,820 were actually correct. 365 00:18:47,820 --> 00:18:50,940 And I thought it might be a bit overwhelming 366 00:18:50,940 --> 00:18:53,940 to hear a lot like, true positive, true negative, it's like, 367 00:18:53,940 --> 00:18:56,670 oh my god, this is so much. 368 00:18:56,670 --> 00:19:00,960 I included a lot of different resources that you can actually go and read it. 369 00:19:00,960 --> 00:19:04,290 I think they are pretty good resources whereby you can learn more, and get 370 00:19:04,290 --> 00:19:06,730 accustomed to all of these terms. 371 00:19:06,730 --> 00:19:10,713 And, so, let's first dive in to the first machine learning-- 372 00:19:10,713 --> 00:19:12,880 the supervised model that we're going to talk about, 373 00:19:12,880 --> 00:19:14,800 which is k-nearest neighbor. 374 00:19:14,800 --> 00:19:18,220 So the k, in k-nearest neighbor, is the nearest neighbor 375 00:19:18,220 --> 00:19:19,830 we wish to take the vote from. 376 00:19:19,830 --> 00:19:21,060 Let's do an example, right? 377 00:19:21,060 --> 00:19:25,995 So this is a dot I want to label, whether it's a blue or a red dot. 378 00:19:25,995 --> 00:19:30,840 In the radius of three, which is this-- what's it called-- 379 00:19:30,840 --> 00:19:33,300 solid line circle, you can see that we have 380 00:19:33,300 --> 00:19:38,350 two red triangles but one blue square. 381 00:19:38,350 --> 00:19:43,050 So, in this case, this dot will be classified by the KNN label 382 00:19:43,050 --> 00:19:45,600 as the red triangle. 383 00:19:45,600 --> 00:19:47,680 And what about if it's larger, right? 384 00:19:47,680 --> 00:19:50,060 So what about k value of five? 385 00:19:50,060 --> 00:19:52,270 So it would be the dotted line circle. 386 00:19:52,270 --> 00:19:54,300 So you can see that in this dotted line circle, 387 00:19:54,300 --> 00:20:02,700 we can see that if we classify this green dot, we have three blue squares, 388 00:20:02,700 --> 00:20:04,360 but we have two red triangles. 389 00:20:04,360 --> 00:20:07,980 So this blue dot, based on this dotted line circle, 390 00:20:07,980 --> 00:20:12,450 will be classified as a blue square. 391 00:20:12,450 --> 00:20:18,110 So we can see that actually choosing a k value in the KNN algorithm 392 00:20:18,110 --> 00:20:18,860 is very important. 393 00:20:18,860 --> 00:20:21,110 And that's where all the time consuming parts come in, 394 00:20:21,110 --> 00:20:24,560 because we want to choose the correct model that doesn't overfit or underfit 395 00:20:24,560 --> 00:20:26,090 the algorithm. 396 00:20:26,090 --> 00:20:30,650 So some of the ways I can do it is plot a graph of accuracy versus k value, 397 00:20:30,650 --> 00:20:33,020 or graph of error rate versus k value. 398 00:20:33,020 --> 00:20:37,430 But, most of the time, we just test with a random one and we move on from there. 399 00:20:37,430 --> 00:20:40,708 And there's also a different metric system that we can use. 400 00:20:40,708 --> 00:20:43,250 If we learn maths we also learn about the Euclidean distance, 401 00:20:43,250 --> 00:20:46,760 whereby it's like, draw a triangle, and we have the Euclidean distance. 402 00:20:46,760 --> 00:20:48,530 Or we have the Manhattan distance, whereby 403 00:20:48,530 --> 00:20:52,430 it's like the longest distance from a point to a point. 404 00:20:52,430 --> 00:20:57,530 So the KNN resources that I found really, really helpful in explaining 405 00:20:57,530 --> 00:20:58,980 will be here as well. 406 00:20:58,980 --> 00:21:01,790 So, without further ado, let's get started with model testing. 407 00:21:01,790 --> 00:21:05,210 So I use a SKlearn model to-- 408 00:21:05,210 --> 00:21:09,830 I paste in my k-neighbor, which is three, and I paste a KNN classifier, 409 00:21:09,830 --> 00:21:11,670 and I train it. 410 00:21:11,670 --> 00:21:14,420 And after that I used-- this is the prediction-- after I train it, 411 00:21:14,420 --> 00:21:17,930 I just used the model to predict my testing data set, 412 00:21:17,930 --> 00:21:21,750 to see how accurate it is on the data set itself. 413 00:21:21,750 --> 00:21:27,230 So, after I run it, you can see that the accuracy score was around 93%. 414 00:21:27,230 --> 00:21:30,980 That means that of all the positive-- 415 00:21:30,980 --> 00:21:36,480 of all of the predictions that all the ML model made, 93% of them are correct. 416 00:21:36,480 --> 00:21:39,110 And what about the precision score? 417 00:21:39,110 --> 00:21:43,340 We got 0.9476, around 95%. 418 00:21:43,340 --> 00:21:46,040 That means that out of the total positive observation, 419 00:21:46,040 --> 00:21:48,380 95% of the prediction is actually correct. 420 00:21:48,380 --> 00:21:49,400 That's pretty high. 421 00:21:49,400 --> 00:21:52,190 And the recall of the KNN is around 93%. 422 00:21:52,190 --> 00:21:54,290 I'm so sorry, I will change this later. 423 00:21:54,290 --> 00:21:58,610 But, maybe, let's test around with different kind of neighbor scores. 424 00:21:58,610 --> 00:22:03,600 [? Let's do ?] about 10, which is basically overfeeding. 425 00:22:03,600 --> 00:22:06,600 We can see that it actually affects the-- 426 00:22:06,600 --> 00:22:09,270 I think, just now, we have 93% accuracy. 427 00:22:09,270 --> 00:22:13,500 But now, we can see that if we feed it with a amount of 10 neighbors, which 428 00:22:13,500 --> 00:22:18,344 is increasing the radius of the circle, you can see that the-- 429 00:22:18,344 --> 00:22:20,680 both the precision, and accuracy, and recall score 430 00:22:20,680 --> 00:22:24,190 actually increased from around 93% to 95%. 431 00:22:24,190 --> 00:22:28,000 So that's why constantly testing and evaluating the model 432 00:22:28,000 --> 00:22:29,540 is actually very important. 433 00:22:29,540 --> 00:22:34,670 So, let's move on to another one, which is logistic regression. 434 00:22:34,670 --> 00:22:38,200 Logistic regression is mainly based on this graph, 435 00:22:38,200 --> 00:22:44,060 called the MLE, maximum likelihood estimation. 436 00:22:44,060 --> 00:22:48,340 Giving an example would be, say, for example, I was given a bunch of data 437 00:22:48,340 --> 00:22:52,120 and I was trying to predict whether someone is COVID positive 438 00:22:52,120 --> 00:22:53,630 or COVID negative. 439 00:22:53,630 --> 00:22:57,640 So, what we will do is, we will plot a graph like this, a sigmoid curve. 440 00:22:57,640 --> 00:23:00,820 And then, say the data was here, right, and I'll be like, 441 00:23:00,820 --> 00:23:04,900 oh, because this data was around more than 90, more than half of this curve, 442 00:23:04,900 --> 00:23:06,880 I would classify it as one. 443 00:23:06,880 --> 00:23:10,310 But if it's here, we'll classify it as COVID negative. 444 00:23:10,310 --> 00:23:13,030 So that's how this model generally works, 445 00:23:13,030 --> 00:23:15,850 is that depending on where the data points are, 446 00:23:15,850 --> 00:23:20,970 we classify whether it's positive or negative based on where it is. 447 00:23:20,970 --> 00:23:24,050 So, in this case, you also try to run a logistic regression 448 00:23:24,050 --> 00:23:26,760 model on the Iris data set itself. 449 00:23:26,760 --> 00:23:30,440 And I also include a few resources where you can 450 00:23:30,440 --> 00:23:32,630 understand logistic regression more. 451 00:23:32,630 --> 00:23:41,962 So if we run the logistic regressions I think it's actually-- 452 00:23:41,962 --> 00:23:43,730 [INAUDIBLE] 453 00:23:43,730 --> 00:23:45,860 So, if you run the logistic regression, you 454 00:23:45,860 --> 00:23:50,690 can see that the accuracy score is actually less than the one 455 00:23:50,690 --> 00:23:51,710 that we had before. 456 00:23:51,710 --> 00:24:00,680 We have 95% of accuracy with KNN model, but we only have around 93% 457 00:24:00,680 --> 00:24:02,870 with logistic regression. 458 00:24:02,870 --> 00:24:05,870 And that's how we actually, so-- 459 00:24:05,870 --> 00:24:08,630 alternatively, you can use other approach or ML 460 00:24:08,630 --> 00:24:10,400 model that we suggested here-- 461 00:24:10,400 --> 00:24:11,930 sorry to keep jumping-- 462 00:24:11,930 --> 00:24:15,290 which is decision tree, support vector machine, or neural network. 463 00:24:15,290 --> 00:24:18,000 I remember when you run it across support vector machine, 464 00:24:18,000 --> 00:24:20,840 it has really good accuracy and prediction, 465 00:24:20,840 --> 00:24:24,750 while neural network is also a pretty good way to try to classify it. 466 00:24:24,750 --> 00:24:27,740 So that's all for me you can run on to-- 467 00:24:27,740 --> 00:24:33,560 KEVIN XU: Yeah, so, great, Zad, a wonderful job 468 00:24:33,560 --> 00:24:38,550 of highlighting how you can use a regression, 469 00:24:38,550 --> 00:24:46,062 or you can use ML to take a data set and attempt to predict future elements 470 00:24:46,062 --> 00:24:47,770 that may be part of this data set, right? 471 00:24:47,770 --> 00:24:50,580 And, so, this is one of the core pieces of ML. 472 00:24:50,580 --> 00:24:54,220 And, don't be worried if you couldn't follow all of that. 473 00:24:54,220 --> 00:24:58,950 Or, you don't really understand how the syntax for the code works. 474 00:24:58,950 --> 00:25:01,740 It's learning new libraries, and machine learning 475 00:25:01,740 --> 00:25:05,730 is very heavily dependent on previously written libraries. 476 00:25:05,730 --> 00:25:11,500 It's a lot of work to develop your own algorithm to-- for the machine learning 477 00:25:11,500 --> 00:25:12,390 take place. 478 00:25:12,390 --> 00:25:14,970 And, so, a lot of the case it's the-- 479 00:25:14,970 --> 00:25:18,720 a lot of the time it's the case of seeing how your data looks, 480 00:25:18,720 --> 00:25:21,150 and then trying to choose the best model. 481 00:25:21,150 --> 00:25:24,510 Which is a sequence of algorithms that make this black box function, 482 00:25:24,510 --> 00:25:29,050 as we talked about earlier, to actually get accurate predictions. 483 00:25:29,050 --> 00:25:34,140 So, yeah, feel free to ask questions-- or pose questions about this. 484 00:25:34,140 --> 00:25:38,700 But I will be going on to kind of the flip side of machine learning. 485 00:25:38,700 --> 00:25:44,640 Instead of using previous data to predict elements of the same data set, 486 00:25:44,640 --> 00:25:50,670 we're going to start talking about how can we get the machine to calculate 487 00:25:50,670 --> 00:25:53,820 or like-- we're going to start talking about games, and reinforcement 488 00:25:53,820 --> 00:25:54,340 learning. 489 00:25:54,340 --> 00:25:57,870 And, so, one of the most popular uses for reinforcement learning 490 00:25:57,870 --> 00:26:00,660 is, of course, trying to solve games, right? 491 00:26:00,660 --> 00:26:06,780 So you have a game, and you want to build an artificial intelligence, 492 00:26:06,780 --> 00:26:10,620 or an AI, that best wins your game, or that 493 00:26:10,620 --> 00:26:14,470 always plays the best move possible. 494 00:26:14,470 --> 00:26:21,300 And this is actually loaded with theory about how data sets work, how the game 495 00:26:21,300 --> 00:26:25,140 itself works, and there's a lot of math and logic behind it. 496 00:26:25,140 --> 00:26:29,940 But, at the end of the day, the idea is, given what you know about the game, 497 00:26:29,940 --> 00:26:33,300 can you get your computer to train on the game such 498 00:26:33,300 --> 00:26:39,850 that it always has a pretty good idea of what the next best move is? 499 00:26:39,850 --> 00:26:43,680 And, so, I've built just a really silly game here. 500 00:26:43,680 --> 00:26:48,690 We don't have to worry too much about the structure of the game itself, 501 00:26:48,690 --> 00:26:52,290 but in general, essentially, when you start the game 502 00:26:52,290 --> 00:26:53,997 you're presented with two doors. 503 00:26:53,997 --> 00:26:55,830 You choose one of the doors, and then you're 504 00:26:55,830 --> 00:26:57,330 presented with another two doors. 505 00:26:57,330 --> 00:27:01,020 And you choose one of the doors again, and behind that door, 506 00:27:01,020 --> 00:27:05,970 there's a randomly generated value between zero 507 00:27:05,970 --> 00:27:08,290 and a certain number that corresponds with the door. 508 00:27:08,290 --> 00:27:11,730 So if you think about this kind of tree-like structure, 509 00:27:11,730 --> 00:27:14,730 you have two, and then two more, so there's four total doors at the end. 510 00:27:14,730 --> 00:27:16,855 And each one of them has a number assigned to them. 511 00:27:16,855 --> 00:27:21,970 Perhaps like nine-- or in this case, I think, three, nine, one, and 20. 512 00:27:21,970 --> 00:27:26,470 All right, so obviously, the door that is associated with 20 513 00:27:26,470 --> 00:27:29,770 is going to, on average, give you better points, or reward, 514 00:27:29,770 --> 00:27:33,880 or whatever this point system works out as, right? 515 00:27:33,880 --> 00:27:36,440 Than the door that has a one associated with it. 516 00:27:36,440 --> 00:27:39,310 And, so, the question is, can you get the computer 517 00:27:39,310 --> 00:27:43,210 to realize which door is the best, simply 518 00:27:43,210 --> 00:27:45,760 by playing the game a bunch of times? 519 00:27:45,760 --> 00:27:48,550 And this is the concept of reinforcement learning. 520 00:27:48,550 --> 00:27:50,330 You just play it a bunch of times. 521 00:27:50,330 --> 00:27:54,220 And for things that turn out well, as in you've got a lot of points, 522 00:27:54,220 --> 00:27:56,890 and well, yeah, you've got a lot of points, 523 00:27:56,890 --> 00:28:00,790 then the computer should be more likely to choose that option in the future. 524 00:28:00,790 --> 00:28:03,430 And for doors that you didn't get a lot of points, 525 00:28:03,430 --> 00:28:06,890 you could be less likely to choose that door in the future. 526 00:28:06,890 --> 00:28:11,890 So we don't have to worry too much about the overall structure of this code, 527 00:28:11,890 --> 00:28:16,120 but right now, I have it set up such that the computer just simply 528 00:28:16,120 --> 00:28:17,950 plays random moves every time. 529 00:28:17,950 --> 00:28:22,540 So it goes through one of the two initial doors, with equal probability, 530 00:28:22,540 --> 00:28:25,730 and then it chooses one of the next two doors with equal probability. 531 00:28:25,730 --> 00:28:29,080 So you expect that the expected value is going 532 00:28:29,080 --> 00:28:32,950 to be pretty low, because these aren't in general pretty-- 533 00:28:32,950 --> 00:28:33,790 or, yeah. 534 00:28:33,790 --> 00:28:37,180 So it won't be the highest that it could possibly be, right? 535 00:28:37,180 --> 00:28:45,880 So if we run this code, we see we got an expected value about 4.125. 536 00:28:45,880 --> 00:28:50,843 So, on average, the computer is scoring four points, right? 537 00:28:50,843 --> 00:28:53,260 And, so, this is definitely not the best we can do, right? 538 00:28:53,260 --> 00:28:54,880 This is a totally random-- 539 00:28:54,880 --> 00:28:56,620 the computer is playing totally randomly, 540 00:28:56,620 --> 00:28:59,590 like this is completely stupid for the computer to do. 541 00:28:59,590 --> 00:29:04,600 So we want to attempt to get it to figure out which doors are better. 542 00:29:04,600 --> 00:29:06,790 And this actually doesn't have to-- 543 00:29:06,790 --> 00:29:10,400 this-- to implement reinforcement learning, a lot of the time neural 544 00:29:10,400 --> 00:29:12,400 networking and other aspects of machine learning 545 00:29:12,400 --> 00:29:15,320 are incorporated with reinforcement learning. 546 00:29:15,320 --> 00:29:21,280 So you take in the data, and you pass it through another regression, 547 00:29:21,280 --> 00:29:25,790 or prediction, and that also helps you find out which move is best. 548 00:29:25,790 --> 00:29:28,180 But, in this case, because the game is so simple, 549 00:29:28,180 --> 00:29:31,740 you can simply just hard code the training in. 550 00:29:31,740 --> 00:29:36,640 And, so, if we take a look here, and this will be a little brief, 551 00:29:36,640 --> 00:29:39,140 so if you want to ask about the logic, go ahead in the chat, 552 00:29:39,140 --> 00:29:41,600 but I'll try not to waste too much time trying to explain 553 00:29:41,600 --> 00:29:43,950 what every line of code does. 554 00:29:43,950 --> 00:29:50,122 So for every move you take, you consider the state that the move resides in. 555 00:29:50,122 --> 00:29:53,330 So, the current state of the board, so, which doors you have in front of you, 556 00:29:53,330 --> 00:29:55,350 and which possible doors you can go through. 557 00:29:55,350 --> 00:29:58,308 So, if you start at the beginning, you have the first two doors, right? 558 00:29:58,308 --> 00:30:02,360 So, your state is just at the beginning, and you have two possible choices. 559 00:30:02,360 --> 00:30:08,540 And, so, if you label every state in the system with a heuristic value that is-- 560 00:30:08,540 --> 00:30:11,690 that basically tells you the goodness of that state, 561 00:30:11,690 --> 00:30:15,920 like how desirable do you are-- how desirable is it to be in that state 562 00:30:15,920 --> 00:30:18,350 if the goal is to accumulate points. 563 00:30:18,350 --> 00:30:21,260 Then, what you can do is, have the computer simply 564 00:30:21,260 --> 00:30:25,580 go through this, the states, that have the highest goodness value, right? 565 00:30:25,580 --> 00:30:28,310 So how do we actually calculate this goodness value? 566 00:30:28,310 --> 00:30:30,690 Well, we just play the game a bunch of times. 567 00:30:30,690 --> 00:30:36,140 So for every time you play the game, you're keeping an internal track, 568 00:30:36,140 --> 00:30:40,850 or the computer is keeping an internal track, of what the current state 569 00:30:40,850 --> 00:30:42,350 heuristic value is. 570 00:30:42,350 --> 00:30:46,370 And then, it makes a move based on what it thinks 571 00:30:46,370 --> 00:30:48,870 is the best move in this case. 572 00:30:48,870 --> 00:30:57,310 So, here, once we make the move, we then find out what the next state is, 573 00:30:57,310 --> 00:31:00,580 and how good the next state is as a result of the move. 574 00:31:00,580 --> 00:31:05,380 So using this accumulation, we can kind of backtrack and see 575 00:31:05,380 --> 00:31:08,470 how good was the current state that we were in before we made the move, 576 00:31:08,470 --> 00:31:10,610 and how good of a move did we make? 577 00:31:10,610 --> 00:31:13,330 And, so, of course, by incrementing these, right, you 578 00:31:13,330 --> 00:31:17,740 increase the probability to make good moves and you increase-- you decrease, 579 00:31:17,740 --> 00:31:20,270 relatively, the probability to make bad moves. 580 00:31:20,270 --> 00:31:23,810 581 00:31:23,810 --> 00:31:33,790 And, so, if we throw this training in that calculates the heuristic model, 582 00:31:33,790 --> 00:31:37,430 and we recompile, we see we got an expected value of about 10, 583 00:31:37,430 --> 00:31:40,340 which is about twice as big as we had previously. 584 00:31:40,340 --> 00:31:42,760 So the computer has gotten way better at this game. 585 00:31:42,760 --> 00:31:45,462 And if you actually take a look into the data structure, which 586 00:31:45,462 --> 00:31:47,920 I won't at the moment, because it looks really complicated, 587 00:31:47,920 --> 00:31:50,620 and well, it's just a big dictionary, and I don't think 588 00:31:50,620 --> 00:31:53,380 it will help you understand what is going on here. 589 00:31:53,380 --> 00:31:56,020 If you actually take a look, you'll see that the end result 590 00:31:56,020 --> 00:32:00,700 of the training, so the training, we played to the game 10,000 times. 591 00:32:00,700 --> 00:32:07,810 And we evaluated the goodness of every move for those 10,000 games. 592 00:32:07,810 --> 00:32:10,060 And you'll find that after this training, 593 00:32:10,060 --> 00:32:16,150 the probability of choosing the door that leads to this 20 door, 594 00:32:16,150 --> 00:32:20,350 sorry I went a little far, so that leads to this 20 door at the end 595 00:32:20,350 --> 00:32:21,670 is almost one. 596 00:32:21,670 --> 00:32:24,170 It's like, 0.9997 or something like that. 597 00:32:24,170 --> 00:32:27,460 And, so, the computer has basically figured out, 598 00:32:27,460 --> 00:32:33,745 without us telling the computer anything except for the game, right, the game 599 00:32:33,745 --> 00:32:40,070 state, and what the results are, how to beat this game. 600 00:32:40,070 --> 00:32:43,550 And, so, this is the goal of reinforcement learning. 601 00:32:43,550 --> 00:32:49,060 And I think this is a very interesting thing, that has-- 602 00:32:49,060 --> 00:32:52,300 there's a lot of application in chess, and go, 603 00:32:52,300 --> 00:32:57,700 and this is the basic core idea of how people have solved these games. 604 00:32:57,700 --> 00:33:02,830 You want to make good moves, is what it boils down to. 605 00:33:02,830 --> 00:33:05,110 Which sounds very simple, but in execution can 606 00:33:05,110 --> 00:33:09,190 be more complicated than it seems. 607 00:33:09,190 --> 00:33:09,690 Let's go-- 608 00:33:09,690 --> 00:33:12,782 SPEAKER: Kevin, and Zad, can we take a couple of questions now? 609 00:33:12,782 --> 00:33:13,740 Oh perfect timing, yay. 610 00:33:13,740 --> 00:33:16,140 KEVIN XU: --that's what we're planning on doing right now. 611 00:33:16,140 --> 00:33:17,473 SPEAKER: There we go, all right. 612 00:33:17,473 --> 00:33:22,140 There have been several since the last little break there. 613 00:33:22,140 --> 00:33:26,910 OK, from Angela, "Was Person of Interest a realistic example 614 00:33:26,910 --> 00:33:29,940 of machine learning?" 615 00:33:29,940 --> 00:33:32,020 KEVIN XU: Person of Interest? 616 00:33:32,020 --> 00:33:34,660 ZAD CHIN: I'm sorry, can you repeat the question? 617 00:33:34,660 --> 00:33:37,600 SPEAKER: That's the question, it was probably about the earlier-- 618 00:33:37,600 --> 00:33:40,090 maybe Angela, you can write back in the chat? 619 00:33:40,090 --> 00:33:43,690 I'm going to just keep going on and continue answering, 620 00:33:43,690 --> 00:33:46,720 but if you want to clarify that in the chat. 621 00:33:46,720 --> 00:33:50,680 From Maria, "Do you ever feel like machine learning sometimes 622 00:33:50,680 --> 00:33:56,400 have very serious sequences, like in elections for example?" 623 00:33:56,400 --> 00:34:00,270 ZAD CHIN: Yes, we do feel like, in terms of machine learning, 624 00:34:00,270 --> 00:34:02,760 the impact of machine learning model, especially, 625 00:34:02,760 --> 00:34:04,470 I mean it can be both bad and good. 626 00:34:04,470 --> 00:34:06,730 That's why, at the last slide of our slides, 627 00:34:06,730 --> 00:34:09,210 we also talk about how machine learning actually 628 00:34:09,210 --> 00:34:15,635 generates deepfake, or like, privacy intrusion, the alignment problem. 629 00:34:15,635 --> 00:34:17,760 So we actually include some machine learning ethics 630 00:34:17,760 --> 00:34:20,080 that we hope to actually share with you as well. 631 00:34:20,080 --> 00:34:25,179 But generally, in terms of machine learning in an election, 632 00:34:25,179 --> 00:34:26,520 I do think that that's true. 633 00:34:26,520 --> 00:34:30,840 Because a lot of the time, advertisement model in like Facebook, or Google, 634 00:34:30,840 --> 00:34:34,050 they use a lot of machine learning model to predict the kind of person 635 00:34:34,050 --> 00:34:36,403 that you are, and recommend-- 636 00:34:36,403 --> 00:34:38,820 recommender system is actually a part of machine learning. 637 00:34:38,820 --> 00:34:41,440 It's a very huge research topic in machine learning. 638 00:34:41,440 --> 00:34:46,199 And, so, if you want to know more it will be recommender system at Google 639 00:34:46,199 --> 00:34:47,909 or Facebook, I would-- 640 00:34:47,909 --> 00:34:52,440 SPEAKER: OK, and then we have another question about cybersecurity. 641 00:34:52,440 --> 00:34:53,730 So is that-- 642 00:34:53,730 --> 00:34:56,820 I think we can leave that for later, I think you have some slides on that. 643 00:34:56,820 --> 00:34:58,530 Is that correct, Zad? 644 00:34:58,530 --> 00:34:59,390 Yeah, OK. 645 00:34:59,390 --> 00:35:00,528 ZAD CHIN: Yeah, we do. 646 00:35:00,528 --> 00:35:02,070 SPEAKER: All right, so, let me just-- 647 00:35:02,070 --> 00:35:06,750 this is from James, "How does a computer or machine recognize objects 648 00:35:06,750 --> 00:35:09,330 by itself in unsupervised learning? 649 00:35:09,330 --> 00:35:11,825 650 00:35:11,825 --> 00:35:12,700 ZAD CHIN: It depends. 651 00:35:12,700 --> 00:35:17,240 So for example, if you say like, a data point, for example, currently 652 00:35:17,240 --> 00:35:19,610 I'm doing research on the ICU data set. 653 00:35:19,610 --> 00:35:24,650 We have a lot of features, and then, so what we do is we paste in the features, 654 00:35:24,650 --> 00:35:29,840 and then it will be represented in like, let me give you a simple example. 655 00:35:29,840 --> 00:35:32,630 Say, for example, we recommend-- we have x and y feature. 656 00:35:32,630 --> 00:35:34,190 And we recommend it on-- 657 00:35:34,190 --> 00:35:37,238 we just plot, like we just put the points in. 658 00:35:37,238 --> 00:35:40,530 And then how the machine learning knows is that they try to cluster the points. 659 00:35:40,530 --> 00:35:45,110 One of the very good ways to actually know, like in unsupervised learning, 660 00:35:45,110 --> 00:35:46,340 is clustering. 661 00:35:46,340 --> 00:35:48,620 So basically, say, for example, I took a point, right? 662 00:35:48,620 --> 00:35:51,830 Like what is the neighbor of the point, and how it should relate, 663 00:35:51,830 --> 00:35:53,260 how strong it actually relates. 664 00:35:53,260 --> 00:35:57,170 So, maybe like, for example, the points are very separated, 665 00:35:57,170 --> 00:36:01,980 or the points are like a block, or the points are not related at all. 666 00:36:01,980 --> 00:36:06,620 So, basically, one of the very good ways of unsupervised learning 667 00:36:06,620 --> 00:36:10,310 is also clustering, that's like for discrete data set. 668 00:36:10,310 --> 00:36:13,220 If you say for image data set, for example, most 669 00:36:13,220 --> 00:36:16,880 of the time we use something called a convolutional neural network, which 670 00:36:16,880 --> 00:36:19,840 is CNN, which is basically passed through a lot of [? features. ?] 671 00:36:19,840 --> 00:36:23,540 And I think in CS50, we also have this pset 672 00:36:23,540 --> 00:36:25,190 where we talk about edges detection. 673 00:36:25,190 --> 00:36:27,290 I think it's under the-- 674 00:36:27,290 --> 00:36:28,610 it's one of the psets. 675 00:36:28,610 --> 00:36:33,500 And that's an edge detection, whereby we draw out the edge itself. 676 00:36:33,500 --> 00:36:35,480 And that's also a part of the CNN, whereby 677 00:36:35,480 --> 00:36:38,540 we try to pass through different filters and networks. 678 00:36:38,540 --> 00:36:42,060 And then, we draw the edge to recognize what is the image itself. 679 00:36:42,060 --> 00:36:46,880 So if you are interested in knowing about how machine learning actually 680 00:36:46,880 --> 00:36:49,430 understands images, I would recommend CNN. 681 00:36:49,430 --> 00:36:52,370 If you about-- if you like normal, unsupervised learning, 682 00:36:52,370 --> 00:36:55,970 there are a few where we can say here's like, clustering autoencoder, 683 00:36:55,970 --> 00:36:58,500 which is a part of neural network. 684 00:36:58,500 --> 00:37:01,710 SPEAKER: OK, why don't we take two more, and then we'll continue again. 685 00:37:01,710 --> 00:37:04,800 So, Amir asks, "Which step is preferred first? 686 00:37:04,800 --> 00:37:09,232 Data analysis or machine learning?" 687 00:37:09,232 --> 00:37:10,690 ZAD CHIN: I would-- yeah, go ahead. 688 00:37:10,690 --> 00:37:14,980 KEVIN XU: So, this is very context-dependent at times. 689 00:37:14,980 --> 00:37:20,080 But in general, you don't want to take some data set that you just gathered 690 00:37:20,080 --> 00:37:23,500 and immediately throw it into machine learning. 691 00:37:23,500 --> 00:37:29,770 Because, when you take data, in the real world, as opposed to generate data, 692 00:37:29,770 --> 00:37:31,250 there's a lot of noise. 693 00:37:31,250 --> 00:37:33,970 There's a lot of inconsistencies, there is 694 00:37:33,970 --> 00:37:37,100 a lot of things that can go wrong, just when you take normal data. 695 00:37:37,100 --> 00:37:42,040 And, so, if you just throw something random into an ML program, 696 00:37:42,040 --> 00:37:45,140 it will not necessarily get you the results you want. 697 00:37:45,140 --> 00:37:48,910 And most of the time it won't, because there's so much noise in the data 698 00:37:48,910 --> 00:37:52,960 that it's really difficult to identify the patterns. 699 00:37:52,960 --> 00:37:57,820 And, so, generally, when you're designing this ML-- 700 00:37:57,820 --> 00:38:04,060 type of sequence of steps, you want to at least screen your data before you 701 00:38:04,060 --> 00:38:06,010 throw it into any type of program. 702 00:38:06,010 --> 00:38:10,300 That way, you can catch early on if things are going to go wrong, right? 703 00:38:10,300 --> 00:38:12,550 And like, in the worst case, you just might 704 00:38:12,550 --> 00:38:16,410 have to retake the entire data set because it's not valuable, right? 705 00:38:16,410 --> 00:38:20,230 And, so, it's kind of like a screening process, at least the initial data 706 00:38:20,230 --> 00:38:24,358 analysis, before you can actually try and be productive with that data set. 707 00:38:24,358 --> 00:38:26,650 ZAD CHIN: Adding it on, I feel like it's very important 708 00:38:26,650 --> 00:38:27,832 to do data analysis first. 709 00:38:27,832 --> 00:38:30,790 This is because say, for example, you get a data of like COVID positive 710 00:38:30,790 --> 00:38:32,320 and COVID negative patients. 711 00:38:32,320 --> 00:38:34,512 And actually it is very dangerous. 712 00:38:34,512 --> 00:38:36,220 So, for example, if your data set is very 713 00:38:36,220 --> 00:38:38,950 biased toward COVID negative person, and you just 714 00:38:38,950 --> 00:38:41,800 pass it to a logistic regression model, and the model would 715 00:38:41,800 --> 00:38:45,670 be like, oh, since 98% of the people are actually COVID negative, 716 00:38:45,670 --> 00:38:48,790 right, then I can just predict, oh nine-- 717 00:38:48,790 --> 00:38:53,050 out of the test example I gave, I will just predict everybody as negative. 718 00:38:53,050 --> 00:38:56,920 So I still get a 98% accuracy, which is actually very, very dangerous. 719 00:38:56,920 --> 00:38:57,940 And it's all the way-- 720 00:38:57,940 --> 00:39:00,995 why you want to know how many labels that we have, 721 00:39:00,995 --> 00:39:03,370 before we actually pass it to the machine learning model. 722 00:39:03,370 --> 00:39:05,350 Because these kind of biases can happen. 723 00:39:05,350 --> 00:39:10,578 Machines can be just like, oh, because 98% of people are actually negative, 724 00:39:10,578 --> 00:39:12,370 so I can just predict everyone is negative. 725 00:39:12,370 --> 00:39:14,980 And I get a 98% accuracy, right? 726 00:39:14,980 --> 00:39:18,040 So it's very, very important to do the data analysis 727 00:39:18,040 --> 00:39:20,320 before you actually train the data, especially 728 00:39:20,320 --> 00:39:24,040 on this kind of highly biased data itself. 729 00:39:24,040 --> 00:39:26,830 SPEAKER: OK, and the last question, from Victor, "Do all machine 730 00:39:26,830 --> 00:39:28,870 learning models use neutral-- 731 00:39:28,870 --> 00:39:32,438 neural networks?" 732 00:39:32,438 --> 00:39:32,980 ZAD CHIN: No. 733 00:39:32,980 --> 00:39:33,772 KEVIN XU: Yeah, no. 734 00:39:33,772 --> 00:39:35,770 So neural networking are-- 735 00:39:35,770 --> 00:39:38,440 convolutional neural networking, these two things, 736 00:39:38,440 --> 00:39:43,540 or same, thing but specialized, are a specific subset 737 00:39:43,540 --> 00:39:45,680 of the machine learning that you can do. 738 00:39:45,680 --> 00:39:48,430 And, so, I think you probably asked this before I showed 739 00:39:48,430 --> 00:39:50,200 the reinforcement learning example. 740 00:39:50,200 --> 00:39:53,440 But you don't have to use neural networks at all when you're 741 00:39:53,440 --> 00:39:57,210 trying to get the machine to learn. 742 00:39:57,210 --> 00:40:02,400 It is very-- neural networks are very powerful, because you're 743 00:40:02,400 --> 00:40:07,680 able to take a lot of data and have the computer generate 744 00:40:07,680 --> 00:40:09,720 the relationships between the data. 745 00:40:09,720 --> 00:40:14,610 But it's not always necessary to use a neural network when 746 00:40:14,610 --> 00:40:18,390 you're trying to learn from a data set, or have the computer learn from a data 747 00:40:18,390 --> 00:40:20,940 set, as you saw in reinforcement learning, right? 748 00:40:20,940 --> 00:40:26,430 You can simply have the computer attempt to make its own conclusions based 749 00:40:26,430 --> 00:40:32,640 on what state, or what type of data you give it, and what it wants to achieve. 750 00:40:32,640 --> 00:40:36,320 751 00:40:36,320 --> 00:40:37,280 ZAD CHIN: Right. 752 00:40:37,280 --> 00:40:41,520 KEVIN XU: So, yeah, we'll take some more questions at the end. 753 00:40:41,520 --> 00:40:44,720 So yeah, feel free to continue posting them in the chat. 754 00:40:44,720 --> 00:40:50,150 But, for now, since we've probably hit you with a lot, and things we can do, 755 00:40:50,150 --> 00:40:52,550 and things we can't do, and possible things, 756 00:40:52,550 --> 00:40:56,120 we want to give some tips on what is reasonable to consider 757 00:40:56,120 --> 00:41:00,300 if you're actually attempting this-- to implement this in a CS50 final project. 758 00:41:00,300 --> 00:41:03,980 And, so, Zad has made this great graph here 759 00:41:03,980 --> 00:41:06,710 that goes kind of in difficulty level from the left to right. 760 00:41:06,710 --> 00:41:09,260 And, so, at the very left you have supervised learning 761 00:41:09,260 --> 00:41:12,260 and unsupervised learning, which is-- 762 00:41:12,260 --> 00:41:16,880 requires some effort, but not huge amounts of dedication. 763 00:41:16,880 --> 00:41:19,170 Although, this is always context-dependent as well. 764 00:41:19,170 --> 00:41:23,480 And then, reinforcement learning will probably take more time, simply 765 00:41:23,480 --> 00:41:26,600 because you not only have to provide the data, 766 00:41:26,600 --> 00:41:29,180 but you often have to build an infrastructure that 767 00:41:29,180 --> 00:41:30,420 can interpret the data. 768 00:41:30,420 --> 00:41:35,540 So, in the case of the game, right, you have to build something that actually-- 769 00:41:35,540 --> 00:41:38,090 you have to actually build the game into Python. 770 00:41:38,090 --> 00:41:42,170 Where the game has to take in an input state somehow, 771 00:41:42,170 --> 00:41:45,770 and it has to return to you the new state and the results. 772 00:41:45,770 --> 00:41:49,010 And, so, there's all this extra infrastructure 773 00:41:49,010 --> 00:41:52,040 that you need before you can actually run any ML, 774 00:41:52,040 --> 00:41:55,760 and sometimes this takes longer than running the actual ML. 775 00:41:55,760 --> 00:41:58,190 I actually spent longer trying to get this infrastructure 776 00:41:58,190 --> 00:42:01,950 to work out than actually implementing the reinforcement learning. 777 00:42:01,950 --> 00:42:04,160 So this is like very-- 778 00:42:04,160 --> 00:42:07,160 something to be cautious of, if you're interested in doing something 779 00:42:07,160 --> 00:42:09,440 like solving a game. 780 00:42:09,440 --> 00:42:13,550 But, of course, it is certainly doable if you 781 00:42:13,550 --> 00:42:16,800 are willing to put in the extra time to do it. 782 00:42:16,800 --> 00:42:20,090 And then, with convolutional neural networks, and deep learning, 783 00:42:20,090 --> 00:42:24,500 and some of the higher stuff, we caution against it 784 00:42:24,500 --> 00:42:27,530 unless you are very familiar with this kind of construct. 785 00:42:27,530 --> 00:42:32,000 And it requires some-- quite a bit of in depth knowledge. 786 00:42:32,000 --> 00:42:35,700 So you don't really have to worry about that. 787 00:42:35,700 --> 00:42:39,290 And, so, quickly, I just want to mention how you can actually 788 00:42:39,290 --> 00:42:40,430 implement these things. 789 00:42:40,430 --> 00:42:45,710 So we used Google Colab which runs Jupyter Notebook which 790 00:42:45,710 --> 00:42:48,630 is a Python interpreter that goes line by line. 791 00:42:48,630 --> 00:42:52,500 So the nice thing about Google Colab is that, well, there's two nice things. 792 00:42:52,500 --> 00:42:54,570 One, is that it interprets the line by line. 793 00:42:54,570 --> 00:42:57,320 So you can change a line without having to rerun 794 00:42:57,320 --> 00:42:59,660 the script from the top down, which is great 795 00:42:59,660 --> 00:43:04,220 if your things take forever to run, as is often the case in machine learning. 796 00:43:04,220 --> 00:43:08,490 And the other nice thing about Google Colab is that it's cloud computed. 797 00:43:08,490 --> 00:43:11,540 So there's a GPU on the server end that does all this, 798 00:43:11,540 --> 00:43:14,670 and then returns it to you over the web. 799 00:43:14,670 --> 00:43:17,120 And, so you won't try and break your machine 800 00:43:17,120 --> 00:43:21,380 trying to process like 10,000 images. 801 00:43:21,380 --> 00:43:25,010 But, on the other hand, if you feel comfortable running it on your device, 802 00:43:25,010 --> 00:43:26,900 it's definitely certainly doable. 803 00:43:26,900 --> 00:43:29,720 I ran the reinforcement learning code just fine 804 00:43:29,720 --> 00:43:33,120 on just a terminal on my device. 805 00:43:33,120 --> 00:43:35,570 And that-- those are the kind of things that don't really 806 00:43:35,570 --> 00:43:37,220 take too much processing. 807 00:43:37,220 --> 00:43:42,230 So, yeah, just keep that in mind when you are thinking 808 00:43:42,230 --> 00:43:46,840 about how to actually implement this. 809 00:43:46,840 --> 00:43:49,530 ZAD CHIN: So, next, I will talk about the useful Python library 810 00:43:49,530 --> 00:43:50,790 that we almost-- 811 00:43:50,790 --> 00:43:53,380 not-- normally use for machine learning. 812 00:43:53,380 --> 00:43:55,360 The first one we will go into is data. 813 00:43:55,360 --> 00:43:56,760 How do we get data, right? 814 00:43:56,760 --> 00:43:59,610 So, in terms of mining data from online, we 815 00:43:59,610 --> 00:44:03,458 can go for BeautifulSoup, which is kind of like a scrap library 816 00:44:03,458 --> 00:44:04,500 that someone wrote about. 817 00:44:04,500 --> 00:44:06,390 It's all linked, so you guys can press that, 818 00:44:06,390 --> 00:44:09,450 and the slides are available on the CS50 website. 819 00:44:09,450 --> 00:44:13,860 And, also, Scrapy, which is a very good scraping library. 820 00:44:13,860 --> 00:44:18,550 And in some of like, built in data sources that are very well documented, 821 00:44:18,550 --> 00:44:21,360 I would recommend Kaggle, it's a Google platform 822 00:44:21,360 --> 00:44:23,100 with a lot of machine learning data. 823 00:44:23,100 --> 00:44:27,480 And UCI, University of California Irvine machine learning repo-- 824 00:44:27,480 --> 00:44:29,920 also a lot of data sets available. 825 00:44:29,920 --> 00:44:32,490 And you can also scrap from websites or API calls, 826 00:44:32,490 --> 00:44:35,910 with BeautifulSoup or Scrapy, or even your own API call. 827 00:44:35,910 --> 00:44:39,390 And, for data pre-processing, I'm a very huge fan of Pandas. 828 00:44:39,390 --> 00:44:42,090 So I highly, highly recommend using Pandas, 829 00:44:42,090 --> 00:44:45,390 like changing your CSV to Pandas, and just work from there. 830 00:44:45,390 --> 00:44:47,560 It's actually much more efficient and useful. 831 00:44:47,560 --> 00:44:51,540 And there's also NumPy and SciPy, which is also like things that you normally 832 00:44:51,540 --> 00:44:54,040 use in terms of calculating mean, median, 833 00:44:54,040 --> 00:44:56,580 and a lot of very useful functions to analyze 834 00:44:56,580 --> 00:44:59,850 the data or pre-process the data. 835 00:44:59,850 --> 00:45:03,920 So the next thing is Python library, which is like visualization. 836 00:45:03,920 --> 00:45:08,150 The most simple basic visualization library that is available online is 837 00:45:08,150 --> 00:45:11,030 Matplotlib, very useful, super well documented, 838 00:45:11,030 --> 00:45:13,370 a lot of examples online that you can use. 839 00:45:13,370 --> 00:45:16,100 Seaborn is a better kind of visualization, 840 00:45:16,100 --> 00:45:19,760 where you can choose your own color map, or better kind of visualization. 841 00:45:19,760 --> 00:45:23,000 The best visualization that I can actually think of, maybe not the best, 842 00:45:23,000 --> 00:45:23,930 but like-- 843 00:45:23,930 --> 00:45:26,030 Plotly is a very highly visual-- 844 00:45:26,030 --> 00:45:31,460 very engaging, highly visual, very nice visualization libraries 845 00:45:31,460 --> 00:45:33,510 that Python have. 846 00:45:33,510 --> 00:45:37,220 So, in terms of ML models, there are a lot of built in libraries. 847 00:45:37,220 --> 00:45:39,590 You can also do your own ML model from scratch, 848 00:45:39,590 --> 00:45:44,190 from Python, which actually increases your understanding on the model itself. 849 00:45:44,190 --> 00:45:47,840 But if you want to save time, if you just want to use the model itself, 850 00:45:47,840 --> 00:45:49,490 there are a lot of built in libraries. 851 00:45:49,490 --> 00:45:52,820 Say, for example, SKlearn is a very good library for beginners, 852 00:45:52,820 --> 00:45:56,240 they are various classification, regression, and clustering algorithm 853 00:45:56,240 --> 00:45:57,620 that we can just use. 854 00:45:57,620 --> 00:46:02,110 And there is also TensorFlow, I think a Google kind of like-- 855 00:46:02,110 --> 00:46:05,650 a Google deep neural network, Python library. 856 00:46:05,650 --> 00:46:08,350 It goes for various tasks, it's on training and inference, 857 00:46:08,350 --> 00:46:09,700 mostly on deep neural network. 858 00:46:09,700 --> 00:46:13,690 And they have pretty good documentation online and on YouTube and really 859 00:46:13,690 --> 00:46:15,190 good tutorials. 860 00:46:15,190 --> 00:46:18,640 We also have PyTorch, where it also runs on neural network computer 861 00:46:18,640 --> 00:46:20,410 vision and NLP. 862 00:46:20,410 --> 00:46:23,140 And we have Keras, I think Keras is by Facebook. 863 00:46:23,140 --> 00:46:26,140 And it's primarily for developing and evaluating deep learning 864 00:46:26,140 --> 00:46:26,765 models as well. 865 00:46:26,765 --> 00:46:28,932 Those are really, really useful libraries that you-- 866 00:46:28,932 --> 00:46:31,480 and are really well documented, there are a lot of tutorials 867 00:46:31,480 --> 00:46:33,970 online that we actually recommend, and it also 868 00:46:33,970 --> 00:46:36,700 linked so you guys can actually have a look too. 869 00:46:36,700 --> 00:46:40,270 In terms of ML resources and support from CS50 and beyond, 870 00:46:40,270 --> 00:46:43,330 we actually recommend you to go on Ed if you have any bugs 871 00:46:43,330 --> 00:46:44,920 that you don't know how to fix. 872 00:46:44,920 --> 00:46:48,410 The team-- the CS50 team is really, really happy to help you on Ed. 873 00:46:48,410 --> 00:46:51,460 And you also have CS50 Intro to AI classes 874 00:46:51,460 --> 00:46:53,830 online too, so you guys can have a look at it. 875 00:46:53,830 --> 00:46:58,040 They also a lot of Python libraries documentation online, where I highly, 876 00:46:58,040 --> 00:47:00,400 highly recommend you to look over the examples 877 00:47:00,400 --> 00:47:02,270 before you actually start coding. 878 00:47:02,270 --> 00:47:05,470 Kaggle is a place where all the data scientists go. 879 00:47:05,470 --> 00:47:10,810 They are data sets, they are also example codes that you can try out, 880 00:47:10,810 --> 00:47:12,820 and also competitions that you can join. 881 00:47:12,820 --> 00:47:15,160 Very good place to learn about data science, and more 882 00:47:15,160 --> 00:47:16,580 about machine learning. 883 00:47:16,580 --> 00:47:19,570 And there are a few blogs that we found very useful as tools. 884 00:47:19,570 --> 00:47:22,675 Data science blog is by Medium, and Machine Learning Mastery blog 885 00:47:22,675 --> 00:47:24,550 is a free blog that's really, really helpful. 886 00:47:24,550 --> 00:47:27,920 Of course, we have our favorite, Stack Overflow and GitHub, really, 887 00:47:27,920 --> 00:47:31,400 really good ML resources and support there as well. 888 00:47:31,400 --> 00:47:34,390 So let's talk about the ML ethics that we actually 889 00:47:34,390 --> 00:47:37,390 see, like a lot of questions about what is the best. 890 00:47:37,390 --> 00:47:39,700 Machine learning can do so much good stuff, right? 891 00:47:39,700 --> 00:47:42,263 Like it is used in health care, education, anywhere 892 00:47:42,263 --> 00:47:43,180 that you can think of. 893 00:47:43,180 --> 00:47:47,450 But, at the same time, it's had its own like, danger itself. 894 00:47:47,450 --> 00:47:50,650 So, for example, one of the things that we see the most is deepfake. 895 00:47:50,650 --> 00:47:54,670 And I think Malan actually did a deepfake example video in last year's 896 00:47:54,670 --> 00:47:56,290 CS50, which is actually very exciting. 897 00:47:56,290 --> 00:47:58,480 I highly, highly recommend you guys to watch, 898 00:47:58,480 --> 00:48:01,330 I actually linked it here so you guys can watch it later. 899 00:48:01,330 --> 00:48:05,480 And another thing about machine learning is actually it's a black box. 900 00:48:05,480 --> 00:48:08,080 So it's like, interpretability in machine learning 901 00:48:08,080 --> 00:48:11,530 is a huge topic that a lot of machine learning practitioners 902 00:48:11,530 --> 00:48:12,650 actually talking about. 903 00:48:12,650 --> 00:48:16,480 So one of them is Professor Finale, also a professor at Harvard. 904 00:48:16,480 --> 00:48:20,500 He-- she actually does a lot of stuff about interpretability in AI 905 00:48:20,500 --> 00:48:22,010 and in healthcare as well. 906 00:48:22,010 --> 00:48:25,350 So I highly recommend you to watch the TED Talk with her, if you want to. 907 00:48:25,350 --> 00:48:27,100 And the other thing about machine learning 908 00:48:27,100 --> 00:48:29,660 is AI biases, fairness, we heard a lot about it. 909 00:48:29,660 --> 00:48:32,560 So there is this specific course by MIT that talks 910 00:48:32,560 --> 00:48:35,590 about AI biases and fairness, really well documented video, 911 00:48:35,590 --> 00:48:37,030 highly recommend. 912 00:48:37,030 --> 00:48:41,832 And like, how you guys say we have a lot of fairness, transparency, 913 00:48:41,832 --> 00:48:44,040 privacy issues, that are related to machine learning. 914 00:48:44,040 --> 00:48:46,990 And there's this book that is really good that I wrote here as well. 915 00:48:46,990 --> 00:48:49,590 It's called The Alignment Problem, it's by a professor in UCB. 916 00:48:49,590 --> 00:48:52,620 It's about how we can align machine learning with our human values. 917 00:48:52,620 --> 00:48:55,770 All of this stuff is linked, so you have more-- 918 00:48:55,770 --> 00:48:59,070 if you like to know more about machine learning ethics 919 00:48:59,070 --> 00:49:01,440 and how it actually can be dangerous. 920 00:49:01,440 --> 00:49:03,750 But I don't say I don't support it, I was like, 921 00:49:03,750 --> 00:49:06,210 machine learning is very helpful, but we need to be mindful 922 00:49:06,210 --> 00:49:09,580 that it can be perilous as well. 923 00:49:09,580 --> 00:49:13,770 So, that's all from us, and we will take some questions. 924 00:49:13,770 --> 00:49:15,330 And, yeah. 925 00:49:15,330 --> 00:49:16,830 SPEAKER: OK, wonderful. 926 00:49:16,830 --> 00:49:20,970 So, let's go to Doris, "Is there a need for bigger capacity of laptop 927 00:49:20,970 --> 00:49:21,960 for machine learning? 928 00:49:21,960 --> 00:49:24,750 I'm using an old Mac." 929 00:49:24,750 --> 00:49:27,400 KEVIN XU: This is actually a great question. 930 00:49:27,400 --> 00:49:32,280 This is truly dependent on the data sets that you're working with. 931 00:49:32,280 --> 00:49:34,860 A lot of the times, at least in the universities, 932 00:49:34,860 --> 00:49:38,160 right, when they're dealing with huge amounts of data, 933 00:49:38,160 --> 00:49:40,150 they have to process this on the cluster. 934 00:49:40,150 --> 00:49:43,500 Which is a cluster of server computers that will-- 935 00:49:43,500 --> 00:49:46,680 that are like way higher tech than anything 936 00:49:46,680 --> 00:49:48,300 you could purchase individually. 937 00:49:48,300 --> 00:49:51,210 But, for the sake of implementing a small ML 938 00:49:51,210 --> 00:49:53,550 project, for the sake of learning about ML, 939 00:49:53,550 --> 00:49:58,560 or for the sake of doing a fun small project, such as like attempting 940 00:49:58,560 --> 00:50:01,750 to write an image pro-- like, recognitions 941 00:50:01,750 --> 00:50:04,200 set, with just a bunch of images, this is something 942 00:50:04,200 --> 00:50:08,730 that is very doable on most machines. 943 00:50:08,730 --> 00:50:12,456 Of course, it really depends on like-- 944 00:50:12,456 --> 00:50:15,960 yeah, it will be software-- or hardware dependent here, 945 00:50:15,960 --> 00:50:23,360 but I would say that short of being very, very old hardware, 946 00:50:23,360 --> 00:50:26,510 then you should be able to at least get the script working. 947 00:50:26,510 --> 00:50:30,950 But, of course, if it doesn't, cloud computing is always an option. 948 00:50:30,950 --> 00:50:34,370 Google Colab is free, which is great, so it's-- 949 00:50:34,370 --> 00:50:38,880 and it's honestly not any different from just accessing the [? IDE. ?] 950 00:50:38,880 --> 00:50:42,200 And, so, we really recommend that if you are 951 00:50:42,200 --> 00:50:45,740 interested in things that will run really slowly, that you 952 00:50:45,740 --> 00:50:48,600 look into cloud on Google Colab. 953 00:50:48,600 --> 00:50:52,410 SPEAKER: OK, [? Aviral, ?] "Is it possible 954 00:50:52,410 --> 00:50:56,700 to use two types of different algorithms at the same time 955 00:50:56,700 --> 00:51:01,330 to increase the accuracy of the model?" 956 00:51:01,330 --> 00:51:04,320 ZAD CHIN: It's not two possible models at the same time. 957 00:51:04,320 --> 00:51:05,775 I think it is-- 958 00:51:05,775 --> 00:51:08,310 normally like we actually use different model, 959 00:51:08,310 --> 00:51:11,250 and different model has its own strengths and weaknesses, I would say. 960 00:51:11,250 --> 00:51:14,980 So, for example, like KNN, it might overfit very well. 961 00:51:14,980 --> 00:51:18,570 But, then, I think, for example, like mo-- the example that you 962 00:51:18,570 --> 00:51:21,120 have, KNN versus logistic regression, it's 963 00:51:21,120 --> 00:51:23,290 not like we can integrate both two together. 964 00:51:23,290 --> 00:51:27,450 I think both KNN and logistic regression has its own advantages 965 00:51:27,450 --> 00:51:28,600 and disadvantages. 966 00:51:28,600 --> 00:51:33,980 The idea is to test it against different kinds of ML models 967 00:51:33,980 --> 00:51:35,620 that's available out there. 968 00:51:35,620 --> 00:51:40,510 And to increase-- to see which in ML performs better on the current data 969 00:51:40,510 --> 00:51:41,010 set. 970 00:51:41,010 --> 00:51:43,590 So for example you can see that KNN performed really 971 00:51:43,590 --> 00:51:45,450 well on the data set that you have just now, 972 00:51:45,450 --> 00:51:48,927 but then it doesn't mean that KNN will always perform better on other data 973 00:51:48,927 --> 00:51:50,010 sets that you are testing. 974 00:51:50,010 --> 00:51:52,552 So it's actually very recommended to use different data sets. 975 00:51:52,552 --> 00:51:56,190 But on the idea of integrating two machine learning models together 976 00:51:56,190 --> 00:51:59,400 to get higher accuracy, it might be possible, 977 00:51:59,400 --> 00:52:01,750 but I'm not really sure about it. 978 00:52:01,750 --> 00:52:05,850 KEVIN XU: I would say the close analogy is that if you can recall-- well 979 00:52:05,850 --> 00:52:07,470 actually, let's just go back to it. 980 00:52:07,470 --> 00:52:11,160 If you can recall the nodes that we have, right? 981 00:52:11,160 --> 00:52:14,790 You can think of each set of nodes as like, a different transformation 982 00:52:14,790 --> 00:52:15,970 that you do to the data. 983 00:52:15,970 --> 00:52:20,980 So what is often the case is that your data will go through many, many layers, 984 00:52:20,980 --> 00:52:23,370 especially for things that are very complicated. 985 00:52:23,370 --> 00:52:26,370 And, so, this is not necessarily applying two different machine learning 986 00:52:26,370 --> 00:52:28,740 models, but you are-- 987 00:52:28,740 --> 00:52:31,210 this kind of transformation is very iterative. 988 00:52:31,210 --> 00:52:34,590 And, so, this is like-- 989 00:52:34,590 --> 00:52:37,950 we call them layers, so like you pass the data through one layer, 990 00:52:37,950 --> 00:52:40,980 and then you pass it through another layer, iteratively, until you get 991 00:52:40,980 --> 00:52:42,970 to this end output layer. 992 00:52:42,970 --> 00:52:46,440 So this actually goes into the specific of how the function works, 993 00:52:46,440 --> 00:52:50,550 but you will often have to make multiple layers 994 00:52:50,550 --> 00:52:56,250 to get good results for your data set, because the data is always complicated. 995 00:52:56,250 --> 00:52:57,300 SPEAKER: This is, an I-- 996 00:52:57,300 --> 00:53:00,450 please excuse my pronunciation here, [? Effie ?] 997 00:53:00,450 --> 00:53:03,330 writes, "Does ML help in cybersecurity?" 998 00:53:03,330 --> 00:53:07,330 999 00:53:07,330 --> 00:53:09,997 ZAD CHIN: Wait, the question is, "Does ML affect cyber security? 1000 00:53:09,997 --> 00:53:10,913 KEVIN XU: Or does it-- 1001 00:53:10,913 --> 00:53:12,810 SPEAKER: "Does ML help in cybersecurity?" 1002 00:53:12,810 --> 00:53:14,780 ZAD CHIN: Oh, I think it does. 1003 00:53:14,780 --> 00:53:17,560 So, for example, I think I heard a lot of information 1004 00:53:17,560 --> 00:53:22,070 about how banks actually use ML to detect fraud transactions. 1005 00:53:22,070 --> 00:53:24,250 So in terms of that kind of way, there's a lot 1006 00:53:24,250 --> 00:53:29,350 of ways whereby ML actually helps in trying to prevent cybersecurity attacks 1007 00:53:29,350 --> 00:53:33,040 or like detect cybersecurity attacks, especially in a large company. 1008 00:53:33,040 --> 00:53:35,290 Because it's actually really well in trying 1009 00:53:35,290 --> 00:53:39,370 to find patterns, or unusual patterns in a huge amount of data set. 1010 00:53:39,370 --> 00:53:43,690 So I definitely think that it has a very huge application in cybersecurity 1011 00:53:43,690 --> 00:53:44,688 itself. 1012 00:53:44,688 --> 00:53:46,480 KEVIN XU: Yeah, and if there's one takeaway 1013 00:53:46,480 --> 00:53:51,010 that we want you to have from this, it's that the goal of-- what 1014 00:53:51,010 --> 00:53:53,650 ML is really good at doing is taking a lot of data 1015 00:53:53,650 --> 00:53:56,360 and finding connections between those data points. 1016 00:53:56,360 --> 00:54:00,820 So anything you can think of that needs to process a huge amount of data, 1017 00:54:00,820 --> 00:54:04,120 you can almost always apply some form of ML. 1018 00:54:04,120 --> 00:54:08,210 If this data is meaningful, of course. 1019 00:54:08,210 --> 00:54:11,710 So in terms of cybersecurity, like actually 1020 00:54:11,710 --> 00:54:14,870 building firewalls and those kinds of things, 1021 00:54:14,870 --> 00:54:16,960 it doesn't have to necessarily do as much 1022 00:54:16,960 --> 00:54:19,120 with processing huge amounts of data. 1023 00:54:19,120 --> 00:54:23,108 But, what, as Zad said, right, with like, fraudulent account 1024 00:54:23,108 --> 00:54:25,150 accesses, and things like that, that is something 1025 00:54:25,150 --> 00:54:28,500 that is totally very applicable to ML. 1026 00:54:28,500 --> 00:54:29,610 SPEAKER: OK. 1027 00:54:29,610 --> 00:54:34,200 [? Madhi ?] asks, "What is the AI application in video games nowadays? 1028 00:54:34,200 --> 00:54:40,040 And as a machine learning developer, can I work in game field? 1029 00:54:40,040 --> 00:54:42,400 KEVIN XU: Oh yeah, so, I can take this question. 1030 00:54:42,400 --> 00:54:47,880 Yeah, so, I don't know if you've seen, but the I think the team, OpenAI, 1031 00:54:47,880 --> 00:54:52,710 I think is run by Tesla, or is like a subsidiary or maybe-- 1032 00:54:52,710 --> 00:54:53,960 ZAD CHIN: I think it's Google. 1033 00:54:53,960 --> 00:54:55,418 KEVIN XU: It might be Google, yeah. 1034 00:54:55,418 --> 00:54:59,040 So OpenAI is a team that has been developing machine learning 1035 00:54:59,040 --> 00:55:01,290 game software for quite a while. 1036 00:55:01,290 --> 00:55:06,420 And a couple of years ago, they released, like, big news, Dota 2, which 1037 00:55:06,420 --> 00:55:12,120 is one of the popular MOBA games, they managed 1038 00:55:12,120 --> 00:55:14,720 to write an AI that beat a team of professionals. 1039 00:55:14,720 --> 00:55:18,300 So this is totally something that is doable, 1040 00:55:18,300 --> 00:55:25,853 and currently, and I recommend you check out the OpenAI-- 1041 00:55:25,853 --> 00:55:27,270 the stuff that they've been doing. 1042 00:55:27,270 --> 00:55:32,258 It's like, really cool and it's definitely very marketable 1043 00:55:32,258 --> 00:55:34,050 if you're interested in that kind of stuff. 1044 00:55:34,050 --> 00:55:37,183 ZAD CHIN: I think in terms of games, like DeepMind also created AlphaGo, 1045 00:55:37,183 --> 00:55:38,100 which is like one of-- 1046 00:55:38,100 --> 00:55:41,310 I don't know whether that's considered a video game, but it's a game. 1047 00:55:41,310 --> 00:55:45,150 It's just super huge at that time, because it's the world's best Go 1048 00:55:45,150 --> 00:55:46,500 player on planet Earth. 1049 00:55:46,500 --> 00:55:48,180 It was like AlphaGo versus human. 1050 00:55:48,180 --> 00:55:51,960 There's also a movie about it, called Go, or AlphaGo, something like that. 1051 00:55:51,960 --> 00:55:53,440 Yeah, highly recommended. 1052 00:55:53,440 --> 00:55:53,970 Super-- 1053 00:55:53,970 --> 00:55:59,750 KEVIN XU: Yes, so, furthermore, like in terms of not just video games, 1054 00:55:59,750 --> 00:56:03,140 people are still developing things to-- 1055 00:56:03,140 --> 00:56:09,240 just test what we can use this for, in terms of not just these-- 1056 00:56:09,240 --> 00:56:11,750 so, games have different classifications, 1057 00:56:11,750 --> 00:56:15,750 and Carnegie Mellon actually solved poker a few years ago. 1058 00:56:15,750 --> 00:56:20,370 Which is crazy, because poker, you don't always have all the information. 1059 00:56:20,370 --> 00:56:22,460 So there's so much more extrapolation that you 1060 00:56:22,460 --> 00:56:25,020 need to do from a given subset of information. 1061 00:56:25,020 --> 00:56:30,050 And, so, it's actually very interesting how far that machine learning 1062 00:56:30,050 --> 00:56:34,640 has come, from being able to take such a small, comparatively small set of data, 1063 00:56:34,640 --> 00:56:40,540 and be able to generalize it to big things. 1064 00:56:40,540 --> 00:56:43,270 SPEAKER: OK, and [? Yashvi ?] asks, "Just as you mentioned, 1065 00:56:43,270 --> 00:56:46,030 we trained the computer for game 1,000 times. 1066 00:56:46,030 --> 00:56:50,027 Was that the training part or the testing part from train-test?" 1067 00:56:50,027 --> 00:56:51,110 KEVIN XU: Ah, yeah, great. 1068 00:56:51,110 --> 00:56:56,740 So, when we did reinforcement learning, the training over 1,000 times 1069 00:56:56,740 --> 00:57:00,520 was simply to train the computer. 1070 00:57:00,520 --> 00:57:04,090 When you're reinforcement learning, your testing part, 1071 00:57:04,090 --> 00:57:07,090 you don't really have a test set, as opposed 1072 00:57:07,090 --> 00:57:11,020 to when you have a big data set of images where you can divide 80% of them 1073 00:57:11,020 --> 00:57:15,070 to train it, and then verify that your algorithm, or your regression 1074 00:57:15,070 --> 00:57:16,990 is correct with the remainder 20%. 1075 00:57:16,990 --> 00:57:21,400 In the case of the game, you just have-- like, once you've trained your AI, 1076 00:57:21,400 --> 00:57:22,960 you just have your AI play the game. 1077 00:57:22,960 --> 00:57:25,660 In which case, this is your testing phase. 1078 00:57:25,660 --> 00:57:27,070 It's like, did the AI win? 1079 00:57:27,070 --> 00:57:31,280 Like in our case, it was like, how many points did the AI get on average? 1080 00:57:31,280 --> 00:57:35,080 And, so, that is the testing phase, which doesn't actually 1081 00:57:35,080 --> 00:57:38,750 use a data set but rather just uses the game itself. 1082 00:57:38,750 --> 00:57:39,950 SPEAKER: OK. 1083 00:57:39,950 --> 00:57:43,940 "What is the--" This is from [? Madhi ?] again, "What is the application--" 1084 00:57:43,940 --> 00:57:47,250 Oh, no I already asked that one, sorry about that. 1085 00:57:47,250 --> 00:57:53,210 OK from HW, "Besides Google TensorFlow, what other ML platforms 1086 00:57:53,210 --> 00:57:54,080 should we look into? 1087 00:57:54,080 --> 00:57:57,320 Anything from IBM or Microsoft?" 1088 00:57:57,320 --> 00:58:00,080 I think you answered that one, right? 1089 00:58:00,080 --> 00:58:00,880 KEVIN XU: It's-- 1090 00:58:00,880 --> 00:58:02,105 OK, so with the-- 1091 00:58:02,105 --> 00:58:07,100 the thing about libraries, like, choosing the best library 1092 00:58:07,100 --> 00:58:08,550 can be very hard. 1093 00:58:08,550 --> 00:58:11,810 And this is like something that requires knowledge, and like, 1094 00:58:11,810 --> 00:58:14,970 you to just jump into ML and see what works and see what doesn't. 1095 00:58:14,970 --> 00:58:20,360 So, what we recommend, is that you don't concentrate so much 1096 00:58:20,360 --> 00:58:23,930 on optimizing, right now, everything. 1097 00:58:23,930 --> 00:58:27,600 You want to get something that works with the knowledge and the skills 1098 00:58:27,600 --> 00:58:31,400 that you have, first, before you can improve and try and optimize 1099 00:58:31,400 --> 00:58:34,890 to get algorithms that are very good for your data set and things like that. 1100 00:58:34,890 --> 00:58:36,443 So-- 1101 00:58:36,443 --> 00:58:38,360 ZAD CHIN: I think the other thing about trying 1102 00:58:38,360 --> 00:58:42,020 to find an ML library is to find whether it is well documented 1103 00:58:42,020 --> 00:58:45,080 or whether actually a lot of people have tried it as well. 1104 00:58:45,080 --> 00:58:47,780 I personally don't know any library. 1105 00:58:47,780 --> 00:58:51,410 Maybe I'm just shallow minded, but I don't really know any library 1106 00:58:51,410 --> 00:58:52,970 from like IBM and Microsoft. 1107 00:58:52,970 --> 00:58:56,960 But I think they have a really strong IBM and Microsoft Research team. 1108 00:58:56,960 --> 00:58:58,697 Which I really admire as well. 1109 00:58:58,697 --> 00:59:01,530 But in terms of TensorFlow, they are really, really well documented. 1110 00:59:01,530 --> 00:59:05,005 There's a lot of examples out there, a lot of Stack Overflow posts about that 1111 00:59:05,005 --> 00:59:06,380 which is actually very important. 1112 00:59:06,380 --> 00:59:08,517 But you actually face a bottleneck, right? 1113 00:59:08,517 --> 00:59:11,600 Because you need someone to help you, someone to understand your data set. 1114 00:59:11,600 --> 00:59:14,540 And everybody is using it, it was really well documented. 1115 00:59:14,540 --> 00:59:17,870 So I would recommend TensorFlow if you are beginning, 1116 00:59:17,870 --> 00:59:19,970 but like, if you want to be more sophisticated, 1117 00:59:19,970 --> 00:59:22,380 you can start to build your own neural network. 1118 00:59:22,380 --> 00:59:25,760 And I think that will be the most sophisticated thing that you can do. 1119 00:59:25,760 --> 00:59:28,490 KEVIN XU: And of course, we don't expect you guys, in CS50, 1120 00:59:28,490 --> 00:59:29,910 to build your own right now. 1121 00:59:29,910 --> 00:59:33,410 But, in the future, just as consideration, 1122 00:59:33,410 --> 00:59:37,280 there's a lot of statistical theory that goes into what machine learning is. 1123 00:59:37,280 --> 00:59:40,730 And, so, there are Harvard classes, or MIT classes, 1124 00:59:40,730 --> 00:59:45,900 that you can take that are just totally on ML and applying it. 1125 00:59:45,900 --> 00:59:48,480 And if you are interested in talking about that, 1126 00:59:48,480 --> 00:59:52,135 feel free to reach out to one of us, and we'd be happy to chat with you. 1127 00:59:52,135 --> 00:59:52,760 ZAD CHIN: Yeah. 1128 00:59:52,760 --> 00:59:56,960 And, so, before we actually end, please take one minute to actually fill out 1129 00:59:56,960 --> 01:00:00,890 this feedback form, which is tinyurl.com/ML50-feedback. 1130 01:00:00,890 --> 01:00:03,865 And we actually really thank you so much for actually coming, 1131 01:00:03,865 --> 01:00:06,740 it's really late, or really early, I don't know what's the time zone. 1132 01:00:06,740 --> 01:00:08,490 But thank you so much for actually coming, 1133 01:00:08,490 --> 01:00:12,448 we actually enjoyed this great seminar, and we are very excited to share this. 1134 01:00:12,448 --> 01:00:14,240 And we hope that you learned something too. 1135 01:00:14,240 --> 01:00:14,740 So-- 1136 01:00:14,740 --> 01:00:17,150 1137 01:00:17,150 --> 01:00:20,540 KEVIN XU: So, yeah, I think we'll be ending the recording now, 1138 01:00:20,540 --> 01:00:26,200 but might stay a little bit after to answer any further questions. 1139 01:00:26,200 --> 01:00:27,468