WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:02.940 [MUSIC PLAYING] 00:00:17.625 --> 00:00:20.250 BRIAN YU: All right, welcome back, everyone, to an Introduction 00:00:20.250 --> 00:00:22.410 to Artificial Intelligence with Python. 00:00:22.410 --> 00:00:26.100 And last time we took a look at how it is that AI inside of our computers 00:00:26.100 --> 00:00:27.420 can represent knowledge. 00:00:27.420 --> 00:00:30.510 We represented that knowledge in the form of logical sentences 00:00:30.510 --> 00:00:32.460 in a variety of different logical languages, 00:00:32.460 --> 00:00:36.000 and the idea was we wanted our AI to be able to represent knowledge 00:00:36.000 --> 00:00:39.420 or information and somehow use those pieces of information 00:00:39.420 --> 00:00:42.570 to be able to derive new pieces of information via inference, 00:00:42.570 --> 00:00:45.060 to be able to take some information and deduce 00:00:45.060 --> 00:00:47.580 some additional conclusions based on the information 00:00:47.580 --> 00:00:49.650 that it already knew for sure. 00:00:49.650 --> 00:00:52.680 But in reality, when we think about computers and we think about AI, 00:00:52.680 --> 00:00:56.280 very rarely are our machines going to be able to know things for sure. 00:00:56.280 --> 00:00:58.800 Oftentimes there's going to be some amount of uncertainty 00:00:58.800 --> 00:01:01.140 in the information that our AIs or our computers 00:01:01.140 --> 00:01:04.503 are dealing with where it might believe something with some probability, 00:01:04.503 --> 00:01:07.420 as we'll soon discuss what probability is all about and what it means, 00:01:07.420 --> 00:01:09.210 but not entirely for certain. 00:01:09.210 --> 00:01:12.660 And we want to use the information that it has some knowledge about, even if it 00:01:12.660 --> 00:01:15.960 doesn't have perfect knowledge, to still be able to make inferences, still 00:01:15.960 --> 00:01:17.650 be able to draw conclusions. 00:01:17.650 --> 00:01:20.820 So you might imagine, for example, in the context of a robot that 00:01:20.820 --> 00:01:23.280 has some sensors and is exploring some environment, 00:01:23.280 --> 00:01:26.580 it might not know exactly where it is or exactly what's around it, 00:01:26.580 --> 00:01:30.150 but it does have access to some data that can allow it to draw inferences 00:01:30.150 --> 00:01:31.200 with some probability. 00:01:31.200 --> 00:01:33.747 There's some likelihood that one thing is true or another, 00:01:33.747 --> 00:01:36.330 or you can imagine in context where there is a little bit more 00:01:36.330 --> 00:01:39.390 randomness and uncertainty, something like predicting the weather, where 00:01:39.390 --> 00:01:42.098 you might not be able to know for sure what tomorrow's weather is 00:01:42.098 --> 00:01:44.610 with 100% certainty, but you can probably 00:01:44.610 --> 00:01:47.550 infer with some probability what tomorrow's weather is 00:01:47.550 --> 00:01:50.940 going to be based on maybe today's webinar and yesterday's weather 00:01:50.940 --> 00:01:54.070 and other data that you might have access to as well. 00:01:54.070 --> 00:01:57.270 And so oftentimes we can distill this in terms of just possible events 00:01:57.270 --> 00:02:00.120 that might happen and what the likelihood of those events are. 00:02:00.120 --> 00:02:02.190 This comes a lot in games, for example, where 00:02:02.190 --> 00:02:04.620 there's an element of chance inside of those games. 00:02:04.620 --> 00:02:06.120 So you imagine rolling the dice. 00:02:06.120 --> 00:02:08.580 You're not sure exactly what the die roll is going to be, 00:02:08.580 --> 00:02:12.510 but you know it's going to be one of these possibilities from one to six, 00:02:12.510 --> 00:02:13.970 for example. 00:02:13.970 --> 00:02:17.050 And so here, now, we introduce the idea of probability theory. 00:02:17.050 --> 00:02:19.050 And what we'll take a look at today is beginning 00:02:19.050 --> 00:02:22.170 by looking at the mathematical foundations of probability theory, 00:02:22.170 --> 00:02:25.740 getting an understanding for some of the key concepts within probability, 00:02:25.740 --> 00:02:29.040 and then diving into how we can use probability and the ideas 00:02:29.040 --> 00:02:32.730 that we look at mathematically to represent some ideas in terms of models 00:02:32.730 --> 00:02:36.300 that we can put into our computers in order to program an AI that 00:02:36.300 --> 00:02:39.600 is able to use information about probability to draw inferences, 00:02:39.600 --> 00:02:42.630 to make some judgments about the world with some probability 00:02:42.630 --> 00:02:45.380 or likelihood of being true. 00:02:45.380 --> 00:02:48.270 So probability ultimately boils down to this idea 00:02:48.270 --> 00:02:50.340 that there are possible worlds that we're here 00:02:50.340 --> 00:02:53.250 representing using this little Greek letter omega, 00:02:53.250 --> 00:02:56.760 and the idea of a possible world is that, when I roll a die, 00:02:56.760 --> 00:02:59.380 there are six possible worlds that could result from it. 00:02:59.380 --> 00:03:03.180 I can roll a 1 or 2 or 3 or a 4 or a 5 or a 6, 00:03:03.180 --> 00:03:06.840 and each of those or a possible world, and each of those possible worlds 00:03:06.840 --> 00:03:11.760 has some probability of being true, the probability that I do roll a 1 or a 2 00:03:11.760 --> 00:03:13.840 or a 3 or something else. 00:03:13.840 --> 00:03:15.690 And we represent that probability like this, 00:03:15.690 --> 00:03:18.870 using the capital letter P and then, in parentheses, what 00:03:18.870 --> 00:03:20.890 it is that we want the probability of. 00:03:20.890 --> 00:03:24.600 So this right here would be the probability of some possible world 00:03:24.600 --> 00:03:27.390 as represented by the little letter omega. 00:03:27.390 --> 00:03:30.120 Now, there are a couple of basic axioms of probability 00:03:30.120 --> 00:03:33.360 that become relevant as we consider how we deal with probability 00:03:33.360 --> 00:03:34.590 and how we think about it. 00:03:34.590 --> 00:03:37.320 First and foremost, every probability value 00:03:37.320 --> 00:03:40.500 must range between zero and one inclusive. 00:03:40.500 --> 00:03:42.420 So the smallest value any probability can 00:03:42.420 --> 00:03:46.500 have is the number zero, which is an impossible event, something 00:03:46.500 --> 00:03:49.350 like I roll a die and the die is a seven is the roll that I get. 00:03:49.350 --> 00:03:51.570 If the die only has numbers one through six, 00:03:51.570 --> 00:03:54.420 the event that I roll a seven is impossible, 00:03:54.420 --> 00:03:56.610 so it would have probability zero. 00:03:56.610 --> 00:03:58.710 And on the other end of the spectrum, probability 00:03:58.710 --> 00:04:01.260 can range all the way up to the positive number one, 00:04:01.260 --> 00:04:04.650 meaning an event is certain to happen, that I roll a die and the number 00:04:04.650 --> 00:04:06.540 is less than 10, for example. 00:04:06.540 --> 00:04:09.930 That is an event that is guaranteed to happen if the only sides on my die 00:04:09.930 --> 00:04:12.150 are one through six, for instance. 00:04:12.150 --> 00:04:15.600 And then there can range through any real number in between these two values 00:04:15.600 --> 00:04:18.600 where, generally speaking, a higher value for the probability 00:04:18.600 --> 00:04:20.910 means an event is more likely to take place 00:04:20.910 --> 00:04:22.980 and a lower value for the probability means 00:04:22.980 --> 00:04:26.040 the event is less likely to take place. 00:04:26.040 --> 00:04:29.280 And the other key rule for probability looks a little bit like this. 00:04:29.280 --> 00:04:32.190 This sigma notation, if you haven't seen it before, 00:04:32.190 --> 00:04:35.100 refers to summation, the idea that we're going to be adding up 00:04:35.100 --> 00:04:36.500 a whole sequence of values. 00:04:36.500 --> 00:04:39.000 And this sigma notation's going to come up a couple of times 00:04:39.000 --> 00:04:41.220 today, because as we deal with probability, 00:04:41.220 --> 00:04:43.950 oftentimes we're adding up a whole bunch of individual values 00:04:43.950 --> 00:04:46.660 or individual probabilities to get some other value. 00:04:46.660 --> 00:04:48.570 So we'll see this come up a couple of times. 00:04:48.570 --> 00:04:52.710 But what this notation means is that if I sum up all of the possible world's 00:04:52.710 --> 00:04:57.150 omega that are in big Omega, which represents the set of all 00:04:57.150 --> 00:05:00.780 the possible worlds, meaning I take for all of the worlds 00:05:00.780 --> 00:05:05.250 in the set of possible worlds and add up all of their probabilities, what 00:05:05.250 --> 00:05:07.312 I ultimately get is the number one. 00:05:07.312 --> 00:05:10.520 So if I take all the possible worlds, add up what each of their probabilities 00:05:10.520 --> 00:05:12.630 is, I should get the number one at the end, 00:05:12.630 --> 00:05:15.600 meaning all probabilities just need to sum to one. 00:05:15.600 --> 00:05:18.120 So for example, if I take dice, for example, 00:05:18.120 --> 00:05:20.750 if you imagine I have a fair die with numbers one through six 00:05:20.750 --> 00:05:22.820 and I roll the die, each one of these rolls 00:05:22.820 --> 00:05:25.160 has an equal probability of taking place, 00:05:25.160 --> 00:05:28.290 and the probability is one over six, for example. 00:05:28.290 --> 00:05:31.890 So each of these probabilities is between zero and one, zero meaning 00:05:31.890 --> 00:05:33.950 and possible and one meaning for certain. 00:05:33.950 --> 00:05:35.990 And if you add up all of these probabilities 00:05:35.990 --> 00:05:39.230 for all of the possible worlds, you get the number one. 00:05:39.230 --> 00:05:42.560 And we can represent any one of those probabilities like this. 00:05:42.560 --> 00:05:47.750 The probability that we roll the number two, for example, is just one over six. 00:05:47.750 --> 00:05:52.040 Every six times we roll the die, we'd expect that one time, for instance, 00:05:52.040 --> 00:05:53.480 the die might come up as a two. 00:05:53.480 --> 00:05:56.870 Its probability is not certain, but it's a little more than nothing, 00:05:56.870 --> 00:05:58.430 for instance. 00:05:58.430 --> 00:06:01.173 And so this is all fairly straightforward for just a single die. 00:06:01.173 --> 00:06:03.590 But things get more interesting as our models of the world 00:06:03.590 --> 00:06:05.183 get a little bit more complex. 00:06:05.183 --> 00:06:07.850 Let's imagine now that we're not just dealing with a single die, 00:06:07.850 --> 00:06:10.040 but we have two dice, for example. 00:06:10.040 --> 00:06:12.230 I have a red die here and a blue die there, 00:06:12.230 --> 00:06:15.230 and I care not just about what the individual roll is, 00:06:15.230 --> 00:06:17.270 but I care about the sum of the two rolls. 00:06:17.270 --> 00:06:20.660 In this case, the sum of the two rolls is the number three. 00:06:20.660 --> 00:06:23.000 How do I begin to now reason about, what does 00:06:23.000 --> 00:06:27.760 the probability look like if, instead of having one die, I now have two dice? 00:06:27.760 --> 00:06:30.260 Well, what we might imagine is that we could first consider, 00:06:30.260 --> 00:06:32.860 what are all of the possible worlds? 00:06:32.860 --> 00:06:34.840 And in this case, all of the possible worlds 00:06:34.840 --> 00:06:38.480 are just every combination of the red and blue die that I could come up with. 00:06:38.480 --> 00:06:43.000 For the red die, it could be a 1 or a 2 or a 3 or a 4 or a 5 or a 6, 00:06:43.000 --> 00:06:45.260 and for each of those possibilities, the blue die, 00:06:45.260 --> 00:06:50.700 likewise, could also be either 1 or 2 or 3 or 4 or 5 or 6. 00:06:50.700 --> 00:06:53.490 And it just so happens that, in this particular case, 00:06:53.490 --> 00:06:56.570 each of these possible combinations is equally likely. 00:06:56.570 --> 00:06:59.715 Equally likely are all of these various different possible worlds. 00:06:59.715 --> 00:07:01.340 That's not always going to be the case. 00:07:01.340 --> 00:07:04.160 As you imagine more complex models that we could try to build 00:07:04.160 --> 00:07:06.770 and things that we could try to represent in the real world, 00:07:06.770 --> 00:07:09.950 it's probably not going to be the case that every single possible world is 00:07:09.950 --> 00:07:11.270 always equally likely. 00:07:11.270 --> 00:07:14.030 But in the case of fair dice where, in any given die roll, 00:07:14.030 --> 00:07:17.450 any one number has just as good a chance of coming up as any other number, 00:07:17.450 --> 00:07:21.740 we can consider all of these possible worlds to be equally likely. 00:07:21.740 --> 00:07:24.500 But even though all of the possible worlds are equally likely, 00:07:24.500 --> 00:07:27.690 that doesn't necessarily mean that their sums are equally likely. 00:07:27.690 --> 00:07:31.530 So if we consider what the sum is of all of these two-- so 1 plus 1, that's a 2. 00:07:31.530 --> 00:07:32.990 2 plus 1 is a 3-- 00:07:32.990 --> 00:07:35.790 and consider for each of these possible pairs of numbers 00:07:35.790 --> 00:07:37.970 what their sum ultimately is, we can notice 00:07:37.970 --> 00:07:41.030 that there are some patterns here where it's not entirely the case 00:07:41.030 --> 00:07:43.710 that every number comes up equally likely. 00:07:43.710 --> 00:07:45.800 If you consider seven, for example, what's 00:07:45.800 --> 00:07:49.070 the probability that when I roll two dice their sum is seven, 00:07:49.070 --> 00:07:50.770 there are several ways this can happen. 00:07:50.770 --> 00:07:53.450 There are six possible worlds where the sum is seven. 00:07:53.450 --> 00:07:56.270 It could be a one and a six or a two and a five or a three 00:07:56.270 --> 00:07:59.400 and a four, a four and a three, and so forth. 00:07:59.400 --> 00:08:02.030 But if you instead consider, what's the probability that I 00:08:02.030 --> 00:08:06.380 roll two dice and the sum of those two die rolls is 12, for example, well, 00:08:06.380 --> 00:08:09.770 looking at this diagram, there's only one possible world 00:08:09.770 --> 00:08:12.080 in which that can happen, and that's the possible world 00:08:12.080 --> 00:08:16.720 where both the red die and the blue die both come up at sixes 00:08:16.720 --> 00:08:18.870 to give us the sum total of 12. 00:08:18.870 --> 00:08:21.000 So based on just taking a look at this diagram, 00:08:21.000 --> 00:08:23.542 we see that some of these probabilities are likely different. 00:08:23.542 --> 00:08:27.530 The probability that the sum is a seven must be greater than the probability 00:08:27.530 --> 00:08:28.732 that the sum is a 12. 00:08:28.732 --> 00:08:31.190 And we can represent that even more formally by saying, OK, 00:08:31.190 --> 00:08:35.690 the probability that we sum to 12 is one out of 36. 00:08:35.690 --> 00:08:39.010 Out of the 36 equally likely possible worlds, 00:08:39.010 --> 00:08:42.049 six squared because we have six options for the red die 00:08:42.049 --> 00:08:46.730 and six options for the blue die, out of those 36 options, only one of them 00:08:46.730 --> 00:08:49.970 sums to 12, whereas, on the other hand, the probability 00:08:49.970 --> 00:08:53.580 that if we take two dice rolls and they sum up to the number seven, 00:08:53.580 --> 00:08:55.910 well, out of those 36 possible worlds, there 00:08:55.910 --> 00:08:59.900 were six worlds where the sum was seven, and so we get six over 36, 00:08:59.900 --> 00:09:04.070 which we can simplify as a fraction to just one over six. 00:09:04.070 --> 00:09:07.400 So here, now, we're able to represent these different ideas of probability, 00:09:07.400 --> 00:09:09.690 representing some events that might be more likely 00:09:09.690 --> 00:09:12.980 and then other events that are less likely, as well. 00:09:12.980 --> 00:09:15.230 And these sorts of judgments where we're figuring out, 00:09:15.230 --> 00:09:18.410 just in the abstract, what is the probability that this thing takes 00:09:18.410 --> 00:09:22.040 place, are generally known as unconditional probabilities, 00:09:22.040 --> 00:09:25.970 some degree of belief we have in some proposition, some fact about the world 00:09:25.970 --> 00:09:28.760 in the absence of any other evidence without knowing 00:09:28.760 --> 00:09:29.900 any additional information. 00:09:29.900 --> 00:09:32.570 If I roll a die, what's the chance it comes up as a two, 00:09:32.570 --> 00:09:35.420 or if I roll two dice, what's the chance that the sum of those two 00:09:35.420 --> 00:09:37.430 die rolls is a seven? 00:09:37.430 --> 00:09:40.550 But usually when we're thinking about probability, especially when we're 00:09:40.550 --> 00:09:43.790 thinking about training in AI to intelligently be able to know something 00:09:43.790 --> 00:09:46.970 about the world and make predictions based on that information, 00:09:46.970 --> 00:09:50.610 it's not unconditional probability that our AI is dealing with, 00:09:50.610 --> 00:09:53.060 but, rather, conditional probability, probability 00:09:53.060 --> 00:09:55.730 where rather than having no original knowledge, 00:09:55.730 --> 00:09:59.790 we have some initial knowledge about the world and how the world actually works. 00:09:59.790 --> 00:10:02.420 So conditional probability is the degree of belief 00:10:02.420 --> 00:10:08.210 in a proposition given some evidence that has already been revealed to us. 00:10:08.210 --> 00:10:09.480 So what does this look like? 00:10:09.480 --> 00:10:12.080 Well, it looks like this in terms of notation. 00:10:12.080 --> 00:10:16.595 We're going to represent conditional probability as probability of a 00:10:16.595 --> 00:10:19.260 and then this vertical bar and then b. 00:10:19.260 --> 00:10:23.090 And the way to read this is the thing on the left-hand side of the vertical bar 00:10:23.090 --> 00:10:25.340 is what we want the probability of. 00:10:25.340 --> 00:10:29.690 Here, now, I want the probability that a is true, that it is the real world, 00:10:29.690 --> 00:10:32.460 that it is the event that actually does take place. 00:10:32.460 --> 00:10:34.520 And then on the right side of the vertical bar 00:10:34.520 --> 00:10:36.620 is our evidence, the information that we already 00:10:36.620 --> 00:10:38.780 know for certain about the world-- 00:10:38.780 --> 00:10:41.570 for example, that b is true. 00:10:41.570 --> 00:10:43.430 So the way to read this entire expression 00:10:43.430 --> 00:10:46.820 is, what is the probability of a given b, 00:10:46.820 --> 00:10:51.860 the probability that a is true given that we already know that b is true? 00:10:51.860 --> 00:10:54.500 And this type of judgment, conditional probability, 00:10:54.500 --> 00:10:58.430 the probability of one thing given some other fact, comes up quite a lot 00:10:58.430 --> 00:11:00.590 when we think about the types of calculations 00:11:00.590 --> 00:11:02.630 we might want our AI to be able to do. 00:11:02.630 --> 00:11:05.120 For example, we might care about the probability of rain 00:11:05.120 --> 00:11:08.090 today given that we know that it rained yesterday. 00:11:08.090 --> 00:11:11.360 We could think about the probability of rain today just in the abstract. 00:11:11.360 --> 00:11:13.440 What is the chance that today it rains? 00:11:13.440 --> 00:11:15.350 But usually we have some additional evidence. 00:11:15.350 --> 00:11:17.900 I know for certain that it rained yesterday, 00:11:17.900 --> 00:11:21.290 and so I would like to calculate the probability that it rains today 00:11:21.290 --> 00:11:23.457 given that I know that it rained yesterday, 00:11:23.457 --> 00:11:25.790 or you might imagine that I want to know the probability 00:11:25.790 --> 00:11:28.550 that my optimal route to my destination changes 00:11:28.550 --> 00:11:30.290 given the current traffic conditions. 00:11:30.290 --> 00:11:32.480 So whether or not traffic conditions change, 00:11:32.480 --> 00:11:35.690 that might change the probability that this route is actually 00:11:35.690 --> 00:11:38.510 the optimal route, or you might imagine in a medical context 00:11:38.510 --> 00:11:42.830 I want to know the probability that a patient has a particular disease given 00:11:42.830 --> 00:11:45.950 some results of some tests that have been performed on that patient, 00:11:45.950 --> 00:11:48.770 and I have some evidence, the results of that test, 00:11:48.770 --> 00:11:52.100 and I would like to know the probability that a patient has 00:11:52.100 --> 00:11:53.520 a particular disease. 00:11:53.520 --> 00:11:55.228 So this notion of conditional probability 00:11:55.228 --> 00:11:57.353 comes up everywhere as we begin to think about what 00:11:57.353 --> 00:11:59.660 we would like to reason about, but being able to reason 00:11:59.660 --> 00:12:03.225 a little more intelligently by taking into account evidence 00:12:03.225 --> 00:12:04.100 that we already have. 00:12:04.100 --> 00:12:06.650 We're more able to get an accurate result for what 00:12:06.650 --> 00:12:08.720 is the likelihood that someone has this disease 00:12:08.720 --> 00:12:11.330 if we know this evidence, the results of the test, 00:12:11.330 --> 00:12:13.250 as opposed to if we were just calculating 00:12:13.250 --> 00:12:16.910 the unconditional probability of saying, what is the probability they have 00:12:16.910 --> 00:12:21.290 the disease without any evidence to try and back up our result one 00:12:21.290 --> 00:12:23.790 way or the other? 00:12:23.790 --> 00:12:26.652 So now that we've got this idea of what conditional probability is, 00:12:26.652 --> 00:12:28.610 the next question we have to ask is, all right, 00:12:28.610 --> 00:12:30.690 how do we calculate conditional probability? 00:12:30.690 --> 00:12:34.297 How do we figure out, mathematically, if I have an expression like this, 00:12:34.297 --> 00:12:35.630 how do I get a number from that? 00:12:35.630 --> 00:12:38.070 What does conditional probability actually mean? 00:12:38.070 --> 00:12:39.950 Well, the formula for conditional probability 00:12:39.950 --> 00:12:41.540 looks a little something like this-- 00:12:41.540 --> 00:12:45.020 the probability of a given b, the probability 00:12:45.020 --> 00:12:47.960 that a is true given that we know that b is true, 00:12:47.960 --> 00:12:50.510 is equal to this fraction-- the probability 00:12:50.510 --> 00:12:56.070 that a and b are true divided by just the probability that b is true. 00:12:56.070 --> 00:12:58.250 And the way to intuitively try to think about this 00:12:58.250 --> 00:13:01.610 is that if I want to know the probability that a is true given that b 00:13:01.610 --> 00:13:05.120 is true, well, I want to consider all the ways they could both be 00:13:05.120 --> 00:13:08.330 true out of the only worlds that I care about 00:13:08.330 --> 00:13:10.430 are the worlds where b is already true. 00:13:10.430 --> 00:13:13.250 I can sort of ignore all the cases where b isn't true 00:13:13.250 --> 00:13:15.980 because those aren't relevant to my ultimate computation. 00:13:15.980 --> 00:13:20.232 They're not relevant to what it is that I want to get information about. 00:13:20.232 --> 00:13:21.690 So let's take a look at an example. 00:13:21.690 --> 00:13:24.530 Let's go back to that example of rolling two dice and the idea 00:13:24.530 --> 00:13:27.260 that those two dice might sum up to the number 12. 00:13:27.260 --> 00:13:30.020 We discussed earlier that the unconditional probability 00:13:30.020 --> 00:13:33.500 that if I roll two dice and they sum to 12 is one out of 36, 00:13:33.500 --> 00:13:36.620 because out of the 36 possible worlds that I might care about, 00:13:36.620 --> 00:13:39.650 in only one of them is the sum of those two dice 12. 00:13:39.650 --> 00:13:43.330 It's only when red is six and blue is also six. 00:13:43.330 --> 00:13:45.770 But let's say now that I have some additional information. 00:13:45.770 --> 00:13:47.930 I now want to know, what is the probability 00:13:47.930 --> 00:13:54.080 that the two dice sum to 12 given that I know that the red die was a six? 00:13:54.080 --> 00:13:55.700 So I already have some evidence. 00:13:55.700 --> 00:13:57.320 I already know the red die is a six. 00:13:57.320 --> 00:13:58.737 I don't know what the blue die is. 00:13:58.737 --> 00:14:01.482 That information isn't given to me in this expression. 00:14:01.482 --> 00:14:03.440 But given the fact that I know that the red die 00:14:03.440 --> 00:14:07.525 rolled a six, what is the probability that we sum to 12? 00:14:07.525 --> 00:14:10.400 And so we can begin to do the math using that expression from before. 00:14:10.400 --> 00:14:12.800 Here, again, are all of the possibilities, 00:14:12.800 --> 00:14:16.160 all of the possible combinations of red die being one through six 00:14:16.160 --> 00:14:18.857 and blue die being one through six. 00:14:18.857 --> 00:14:20.690 And I might consider, first, all right, what 00:14:20.690 --> 00:14:24.290 is the probability of my evidence, my b variable where 00:14:24.290 --> 00:14:27.770 I want to know what is the probability that the red die is a six? 00:14:27.770 --> 00:14:31.580 Well, the probability that the red die is a six is just one out of six. 00:14:31.580 --> 00:14:35.180 So these one out of six options are really the only worlds 00:14:35.180 --> 00:14:36.560 that I care about here now. 00:14:36.560 --> 00:14:39.680 All the rest of them are irrelevant to my calculation 00:14:39.680 --> 00:14:42.560 because I already have this evidence that the red die was a six, 00:14:42.560 --> 00:14:46.770 so I don't need to care about all of the other possibilities that could result. 00:14:46.770 --> 00:14:50.120 So now, in addition to the fact that the red die rolled as a six 00:14:50.120 --> 00:14:52.040 and the probability of that, the other piece 00:14:52.040 --> 00:14:54.320 of information I need to know in order to calculate 00:14:54.320 --> 00:14:58.940 this conditional probability is the probability that both of my variables, 00:14:58.940 --> 00:15:02.720 a and b, are true, the probability that both the red die is a six 00:15:02.720 --> 00:15:04.860 and they all sum to 12. 00:15:04.860 --> 00:15:07.400 So what is the probability that both of these things happen? 00:15:07.400 --> 00:15:11.990 Well, it only happens in one possible case, in one out of these 36 cases, 00:15:11.990 --> 00:15:15.910 and it's the case where both the red and the blue die are equal to six. 00:15:15.910 --> 00:15:18.160 This is a piece of information that we already knew. 00:15:18.160 --> 00:15:22.240 And so this probability is equal to one over 36. 00:15:22.240 --> 00:15:24.580 And so to get the conditional probability 00:15:24.580 --> 00:15:28.660 that the sum is 12 given that I know that the red dice is equal to six, 00:15:28.660 --> 00:15:33.700 well, I just divide these two values together, and 1/36 divided by 1/6 00:15:33.700 --> 00:15:36.940 gives us this probability of 1/6. 00:15:36.940 --> 00:15:40.300 Given that I know that the red die rolled a value of six, 00:15:40.300 --> 00:15:45.350 the probability that the sum of the two dice is 12 is also one over six. 00:15:45.350 --> 00:15:47.350 And that probably makes intuitive sense for you, 00:15:47.350 --> 00:15:51.250 too, because if the red die is a six, the only way for me to get to a 12 00:15:51.250 --> 00:15:53.350 is if the blue die also rolls a six. 00:15:53.350 --> 00:15:57.430 And we know that the probability of the blue die rolling a six is one over six. 00:15:57.430 --> 00:15:59.380 So in this case, the conditional probability 00:15:59.380 --> 00:16:00.940 seems fairly straightforward. 00:16:00.940 --> 00:16:04.390 But this idea of calculating a conditional probability 00:16:04.390 --> 00:16:08.175 by looking at the probability that both of these events take place 00:16:08.175 --> 00:16:10.300 is an idea that's going to come up again and again. 00:16:10.300 --> 00:16:13.270 This is the definition, now, of conditional probability, 00:16:13.270 --> 00:16:15.190 and we're going to use that definition as we 00:16:15.190 --> 00:16:18.760 think about probability more generally to be able to draw conclusions 00:16:18.760 --> 00:16:19.480 about the world. 00:16:19.480 --> 00:16:21.130 This, again, is that formula. 00:16:21.130 --> 00:16:24.790 The probability of a given b is equal to the probability 00:16:24.790 --> 00:16:28.973 that a and b take place divided by the probability of b. 00:16:28.973 --> 00:16:32.140 And you'll see this formula sometimes written in a couple of different ways. 00:16:32.140 --> 00:16:35.890 You could imagine, algebraically, multiplying both sides of this equation 00:16:35.890 --> 00:16:39.065 by probability of b to get rid of the fraction, 00:16:39.065 --> 00:16:40.690 and you'll get an expression like this. 00:16:40.690 --> 00:16:44.890 The probability of a and b, which is this expression over here, 00:16:44.890 --> 00:16:48.910 is just the probability of b times the probability of a given b, 00:16:48.910 --> 00:16:52.210 or you could represent this equivalently since a and b, in this expression, 00:16:52.210 --> 00:16:55.870 are interchangeable. a and b is the same thing as b and a. 00:16:55.870 --> 00:16:59.740 You could imagine also representing the probability of a and b 00:16:59.740 --> 00:17:03.430 as the probability of a times the probability of b given a, just 00:17:03.430 --> 00:17:05.319 switching all of the a's and b's. 00:17:05.319 --> 00:17:08.589 These three are all equivalent ways of trying to represent 00:17:08.589 --> 00:17:10.150 what joint probability means. 00:17:10.150 --> 00:17:12.480 And so you'll sometimes see all of these equations, 00:17:12.480 --> 00:17:16.030 and they might be useful to you as you begin to reason about probability 00:17:16.030 --> 00:17:20.540 and to think about what values might be taking place in the real world. 00:17:20.540 --> 00:17:22.510 Now, sometimes when we deal with probability, 00:17:22.510 --> 00:17:24.520 we don't just care about a Boolean event. 00:17:24.520 --> 00:17:27.099 Like, did this happen or did this not happen? 00:17:27.099 --> 00:17:30.550 Sometimes we might want the ability to represent variable values 00:17:30.550 --> 00:17:33.760 in a probability space where some variable might take 00:17:33.760 --> 00:17:36.430 on multiple different possible values. 00:17:36.430 --> 00:17:39.820 And in probability, we call a variable in probability theory 00:17:39.820 --> 00:17:41.380 a random variable. 00:17:41.380 --> 00:17:45.790 A random variable in probability is just some variable in probability theory 00:17:45.790 --> 00:17:49.150 that has some domain of values that it can take on. 00:17:49.150 --> 00:17:50.290 So what do I mean by this? 00:17:50.290 --> 00:17:52.720 Well, what I mean is I might have a random variable that 00:17:52.720 --> 00:17:56.470 is just called Roll, for example, that has six possible values. 00:17:56.470 --> 00:17:59.120 Roll is my variable, and the possible values, 00:17:59.120 --> 00:18:03.520 the domain of values that it can take on, are 1, 2, 3, 4, 5, and 6. 00:18:03.520 --> 00:18:05.845 And I might like to know the probability of each. 00:18:05.845 --> 00:18:07.720 In this case, they happen to all be the same. 00:18:07.720 --> 00:18:10.728 But in other random variables, that might not be the case. 00:18:10.728 --> 00:18:12.520 For example, I might have a random variable 00:18:12.520 --> 00:18:14.560 to represent the weather, for example, where 00:18:14.560 --> 00:18:18.290 the domain of values it could take on are things like sun or cloudy 00:18:18.290 --> 00:18:21.070 or rainy or windy or snowy, and each of those 00:18:21.070 --> 00:18:23.650 might have a different probability, and I care about knowing, 00:18:23.650 --> 00:18:26.290 what is the probability that the weather equals sun 00:18:26.290 --> 00:18:28.725 or that the weather equals clouds, for instance, 00:18:28.725 --> 00:18:33.100 and I might like to do some mathematical calculations based on that information. 00:18:33.100 --> 00:18:35.622 Other random variables might be something like traffic. 00:18:35.622 --> 00:18:38.080 What are the odds that there is no traffic or light traffic 00:18:38.080 --> 00:18:39.190 or heavy traffic? 00:18:39.190 --> 00:18:41.530 Traffic, in this case, is my random variable, 00:18:41.530 --> 00:18:44.920 and the values that that random variable can take on are here. 00:18:44.920 --> 00:18:47.110 It's either none or light or heavy. 00:18:47.110 --> 00:18:50.200 And I, the person doing these calculations, I, the person encoding 00:18:50.200 --> 00:18:52.810 these random variables into my computer, need 00:18:52.810 --> 00:18:56.950 to make the decision as to what these possible values actually are. 00:18:56.950 --> 00:18:59.118 You might imagine, for example, for a flight, 00:18:59.118 --> 00:19:01.660 if I care about whether or not I make it to a flight on time, 00:19:01.660 --> 00:19:04.327 my flight has a couple of possible values that it could take on. 00:19:04.327 --> 00:19:05.620 My flight could be on time. 00:19:05.620 --> 00:19:06.860 My flight could be delayed. 00:19:06.860 --> 00:19:08.170 My flight could be canceled. 00:19:08.170 --> 00:19:11.830 So flight, in this case, is my random variable, 00:19:11.830 --> 00:19:14.500 and these are the values that it can take on. 00:19:14.500 --> 00:19:17.710 And often I'll want to know something about the probability 00:19:17.710 --> 00:19:21.380 that my random variable takes on each of those possible values. 00:19:21.380 --> 00:19:24.660 And this is what we then call a probability distribution. 00:19:24.660 --> 00:19:27.700 A probability distribution takes a random variable 00:19:27.700 --> 00:19:32.420 and gives me the probability for each of the possible values in its domain. 00:19:32.420 --> 00:19:35.950 So in the case of this flight, for example, my probability distribution 00:19:35.950 --> 00:19:37.300 might look something like this. 00:19:37.300 --> 00:19:40.240 My probability distribution says, the probability 00:19:40.240 --> 00:19:46.210 that the random variable Flight is equal to the value on time is 0.6, 00:19:46.210 --> 00:19:49.390 or, otherwise, put into more English, human-friendly terms, the likelihood 00:19:49.390 --> 00:19:52.510 that my flight is on time is 60%, for example. 00:19:52.510 --> 00:19:56.180 And in this case, the probability that my flight is delayed is 30%. 00:19:56.180 --> 00:20:00.170 The probability that my flight is canceled is 10%, or 0.1. 00:20:00.170 --> 00:20:04.180 And if you sum up all of these possible values, the sum is going to be 1. 00:20:04.180 --> 00:20:06.640 If you take all of the possible worlds, here 00:20:06.640 --> 00:20:10.120 are my three possible worlds for the value of the random variable Flight. 00:20:10.120 --> 00:20:11.470 Add them all up together. 00:20:11.470 --> 00:20:15.610 The result needs to be the number one per that axiom of probability theory 00:20:15.610 --> 00:20:17.500 that we've discussed before. 00:20:17.500 --> 00:20:21.810 So this now is one way of representing this probability distribution 00:20:21.810 --> 00:20:23.622 for the random variable Flight. 00:20:23.622 --> 00:20:25.830 Sometimes you'll see it represented a little bit more 00:20:25.830 --> 00:20:28.770 concisely, that this is pretty verbose for really just trying 00:20:28.770 --> 00:20:31.080 to express three possible values. 00:20:31.080 --> 00:20:33.630 And so often you'll instead see this same notation 00:20:33.630 --> 00:20:35.460 representing using a vector. 00:20:35.460 --> 00:20:38.250 And all a vector is is a sequence of values. 00:20:38.250 --> 00:20:41.520 As opposed to just a single value, I might have multiple values. 00:20:41.520 --> 00:20:45.570 And so I could extend, instead, represent this idea this way-- 00:20:45.570 --> 00:20:47.610 bold P-- so a larger P-- 00:20:47.610 --> 00:20:52.110 generally meaning the probability distribution of this variable flight 00:20:52.110 --> 00:20:55.890 is equal to this vector represented in angle brackets. 00:20:55.890 --> 00:21:00.240 The probability distribution is 0.6, 0.3, and 0.1, 00:21:00.240 --> 00:21:03.180 and I would just have to know that this probability distribution is 00:21:03.180 --> 00:21:06.930 an order of on time or delayed and canceled 00:21:06.930 --> 00:21:10.470 to know how to interpret this vector to mean the first value in the vector 00:21:10.470 --> 00:21:14.430 is the probability that my flight is on time, the second value in the vector 00:21:14.430 --> 00:21:16.380 is the probability that my flight is delayed, 00:21:16.380 --> 00:21:20.910 and the third value in the vector is the probability that my flight is canceled. 00:21:20.910 --> 00:21:23.430 And so this is just an alternate way of representing 00:21:23.430 --> 00:21:25.380 this idea a little more verbosely. 00:21:25.380 --> 00:21:28.230 But oftentimes you'll see us just talk about a probability 00:21:28.230 --> 00:21:30.637 distribution over a random variable. 00:21:30.637 --> 00:21:32.970 And whenever we talk about that, what we're really doing 00:21:32.970 --> 00:21:35.012 is trying to figure out the probabilities of each 00:21:35.012 --> 00:21:38.190 of the possible values that that random variable can take on, 00:21:38.190 --> 00:21:40.970 but this notation is just a little bit more succinct, 00:21:40.970 --> 00:21:43.470 even though it can sometimes be a little confusing depending 00:21:43.470 --> 00:21:44.928 on the context in which you see it. 00:21:44.928 --> 00:21:48.060 So we'll start to look at examples where we use this sort of notation 00:21:48.060 --> 00:21:53.850 to describe probability and to describe events that might take place. 00:21:53.850 --> 00:21:55.890 A couple of other important ideas to know with 00:21:55.890 --> 00:21:57.450 regards to probability theory-- 00:21:57.450 --> 00:22:01.770 one is this idea of independence, and independence refers to the idea 00:22:01.770 --> 00:22:04.620 that the knowledge of one event doesn't influence 00:22:04.620 --> 00:22:06.850 the probability of another event. 00:22:06.850 --> 00:22:08.910 So for example, in the context of my two dice 00:22:08.910 --> 00:22:11.910 rolls where I had the red die and the blue die, the probability 00:22:11.910 --> 00:22:14.400 that I roll the red die and the blue die, 00:22:14.400 --> 00:22:17.490 those two events, red die and blue die, are independent. 00:22:17.490 --> 00:22:21.162 Knowing the result of the red die doesn't change the probabilities 00:22:21.162 --> 00:22:21.870 for the blue die. 00:22:21.870 --> 00:22:24.330 It doesn't give me any additional information 00:22:24.330 --> 00:22:27.408 about what the value of the blue die is ultimately going to be. 00:22:27.408 --> 00:22:29.200 But that's not always going to be the case. 00:22:29.200 --> 00:22:32.670 You might imagine that in the case of weather, something like clouds 00:22:32.670 --> 00:22:37.170 and rain, those are probably not independent, that if it is cloudy, 00:22:37.170 --> 00:22:40.620 that might increase the probability that later in the day it's going to rain. 00:22:40.620 --> 00:22:45.030 So some information informs some other event or some other random variable. 00:22:45.030 --> 00:22:49.540 So independence refers to the idea that one event doesn't influence the other. 00:22:49.540 --> 00:22:54.600 And if they're not independent, then there might be some relationship. 00:22:54.600 --> 00:22:57.880 So mathematically, formally, what does independence actually mean? 00:22:57.880 --> 00:23:02.550 Well, recall this formula from before, that the probability of a and b 00:23:02.550 --> 00:23:06.390 is the probability of a times the probability of b given a. 00:23:06.390 --> 00:23:08.490 And the more intuitive way to think about this 00:23:08.490 --> 00:23:12.030 is that to know how likely it is that a and b happen, 00:23:12.030 --> 00:23:14.850 well, let's first figure out the likelihood that a happens, 00:23:14.850 --> 00:23:17.163 and then given that we know that a happens, 00:23:17.163 --> 00:23:19.080 let's figure out the likelihood that b happens 00:23:19.080 --> 00:23:22.000 and multiply those two things together. 00:23:22.000 --> 00:23:27.750 But if a and b were independent, meaning knowing a doesn't change anything 00:23:27.750 --> 00:23:30.000 about the likelihood that b is true, well, 00:23:30.000 --> 00:23:35.040 then the probability of b given a, meaning the probability that b is true 00:23:35.040 --> 00:23:37.812 given that I know a is true, well, that I know a is true 00:23:37.812 --> 00:23:40.770 shouldn't really make a difference if these two things are independent, 00:23:40.770 --> 00:23:43.230 that a shouldn't influence b at all. 00:23:43.230 --> 00:23:48.120 So the probability of b given a is really just the probability of b, 00:23:48.120 --> 00:23:51.150 if it is true that a and b are independent. 00:23:51.150 --> 00:23:54.480 And so this right here is one example of a definition for what 00:23:54.480 --> 00:23:56.850 it means for a and b to be independent. 00:23:56.850 --> 00:24:01.050 The probability of a and b is just the probability of a times 00:24:01.050 --> 00:24:02.490 the probability of b. 00:24:02.490 --> 00:24:06.300 Any time you find two events a and b where this relationship holds, 00:24:06.300 --> 00:24:10.000 then you can say that a and b are independent. 00:24:10.000 --> 00:24:13.980 So an example of that might be the dice that we were taking a look at before. 00:24:13.980 --> 00:24:18.690 Here, if I wanted the probability of red being a six and blue being a six, 00:24:18.690 --> 00:24:22.050 well, that's just the probability that red is a six multiplied 00:24:22.050 --> 00:24:24.090 by the probability that blue is a six. 00:24:24.090 --> 00:24:26.240 Both equal to one over 36. 00:24:26.240 --> 00:24:30.740 So I can say that these two events are independent. 00:24:30.740 --> 00:24:34.123 What wouldn't be independent, for example, would be an example-- 00:24:34.123 --> 00:24:36.040 so this, for example, has a probability of one 00:24:36.040 --> 00:24:37.980 over 36, as we talked about before. 00:24:37.980 --> 00:24:40.950 But what wouldn't be independent would be a case like this-- 00:24:40.950 --> 00:24:46.740 the probability that the red die rolls a six and the red die rolls a four. 00:24:46.740 --> 00:24:49.868 If you just naively took, OK, red die six, red die four, 00:24:49.868 --> 00:24:51.660 well, if I'm only rolling the die once, you 00:24:51.660 --> 00:24:54.510 might imagine the naive approach is to say, well, each of these 00:24:54.510 --> 00:24:56.260 has a probability of one over six. 00:24:56.260 --> 00:24:59.657 So multiply them together, and the probability is one over 36. 00:24:59.657 --> 00:25:01.990 But, of course, if you're only rolling the red die once, 00:25:01.990 --> 00:25:05.730 there's no way you could get two different values for the red die. 00:25:05.730 --> 00:25:08.370 It couldn't both be a six and a four. 00:25:08.370 --> 00:25:10.560 So the probability should be zero. 00:25:10.560 --> 00:25:14.610 But if you were to multiply probability of red six times probability 00:25:14.610 --> 00:25:17.690 of red four, well, that would equal one over 36. 00:25:17.690 --> 00:25:19.440 But, of course, that's not true because we 00:25:19.440 --> 00:25:23.460 know that there is no way, probability zero, that when we roll the red die 00:25:23.460 --> 00:25:28.590 once we get both a six and a four because only one of those possibilities 00:25:28.590 --> 00:25:31.120 can actually be the result. 00:25:31.120 --> 00:25:35.190 And so we can say that the event that red roll is six and the event 00:25:35.190 --> 00:25:38.800 that red roll is four, those two events are not independent. 00:25:38.800 --> 00:25:43.560 If I know that the red roll is a six, I know that the red roll cannot possibly 00:25:43.560 --> 00:25:44.310 be a four. 00:25:44.310 --> 00:25:46.280 So these things are not independent. 00:25:46.280 --> 00:25:48.630 And instead, if I wanted to calculate the probability, 00:25:48.630 --> 00:25:51.870 I would need to use this conditional probability, 00:25:51.870 --> 00:25:56.530 as is the regular definition of the probability of two events taking place. 00:25:56.530 --> 00:25:59.280 And the probability of this, now, well, the probability of the red 00:25:59.280 --> 00:26:01.710 roll being a six, that's one of six. 00:26:01.710 --> 00:26:06.330 But what's the probability that the roll is a four given that the roll is a six? 00:26:06.330 --> 00:26:09.900 Well, this is just zero, because there's no way for the red roll 00:26:09.900 --> 00:26:13.920 to be a four given that we already know the red roll is a six. 00:26:13.920 --> 00:26:16.410 And so the value, if we do all that multiplication, 00:26:16.410 --> 00:26:19.680 is we get the number zero. 00:26:19.680 --> 00:26:21.477 So this idea of conditional probability is 00:26:21.477 --> 00:26:23.310 going to come up again and again, especially 00:26:23.310 --> 00:26:26.850 as we begin to reason about multiple different random variables that 00:26:26.850 --> 00:26:29.130 might be interacting with each other in some way. 00:26:29.130 --> 00:26:32.580 And this gets us to one of the most important rules in probability theory, 00:26:32.580 --> 00:26:34.767 which is known as Bayes' rule. 00:26:34.767 --> 00:26:37.350 And it turns out that just using the information we've already 00:26:37.350 --> 00:26:40.900 learned about probability and just applying a little bit of algebra, 00:26:40.900 --> 00:26:43.860 we can actually derive Bayes' rule for ourselves. 00:26:43.860 --> 00:26:46.530 But it's a very important rule when it comes to inference 00:26:46.530 --> 00:26:49.020 and thinking about probability in the context of what 00:26:49.020 --> 00:26:52.110 it is that a computer can do, or what a mathematician could do, 00:26:52.110 --> 00:26:55.390 by having access to information about probability. 00:26:55.390 --> 00:26:57.300 So let's go back to these equations to be 00:26:57.300 --> 00:26:59.860 able to derive Bayes' rule ourselves. 00:26:59.860 --> 00:27:04.140 We know the probability of a and b, the likelihood that a and b take place, 00:27:04.140 --> 00:27:07.890 is the likelihood of b and then the likelihood of a given 00:27:07.890 --> 00:27:10.050 that we know that b is already true. 00:27:10.050 --> 00:27:13.170 And likewise, the probability of a given a and b 00:27:13.170 --> 00:27:16.920 is the probability of a times the probability of b given 00:27:16.920 --> 00:27:18.630 that we know that a is already true. 00:27:18.630 --> 00:27:20.640 This is sort of a symmetric relationship where 00:27:20.640 --> 00:27:24.340 it doesn't matter the order of a and b and b and a mean the same thing. 00:27:24.340 --> 00:27:27.870 And so in these equations, we can just swap out a and b 00:27:27.870 --> 00:27:30.160 to be able to represent the exact same idea. 00:27:30.160 --> 00:27:32.650 So we know that these two equations are already true. 00:27:32.650 --> 00:27:33.910 We've seen that already. 00:27:33.910 --> 00:27:37.380 And now let's just do a little bit of algebraic manipulation of this stuff. 00:27:37.380 --> 00:27:40.200 Both of these expressions on the right-hand side 00:27:40.200 --> 00:27:43.380 are equal to the probability of a and b. 00:27:43.380 --> 00:27:46.950 So what I can do is take these two expressions on the right-hand side 00:27:46.950 --> 00:27:49.140 and just set them equal to each other. 00:27:49.140 --> 00:27:52.860 If they're both equal to the probability of a and b, 00:27:52.860 --> 00:27:55.090 then they both must be equal to each other. 00:27:55.090 --> 00:27:57.750 So probability of a times probability of b 00:27:57.750 --> 00:28:04.740 given a is equal to the probability of b times the probability of a given b. 00:28:04.740 --> 00:28:07.790 And now all we're going to do is do a little bit of division. 00:28:07.790 --> 00:28:13.830 I'm going to divide both sides by P of a, and now I get what is Bayes' rule. 00:28:13.830 --> 00:28:19.100 The probability of b given a is equal to the probability of b 00:28:19.100 --> 00:28:23.338 times the probability of a given b divided by the probability of a. 00:28:23.338 --> 00:28:25.380 And sometimes in Bayes' rule you'll see the order 00:28:25.380 --> 00:28:26.713 of these two arguments switched. 00:28:26.713 --> 00:28:30.920 So instead of b times a given b, it'll be a given b times b. 00:28:30.920 --> 00:28:33.420 That ultimately doesn't matter because in multiplication you 00:28:33.420 --> 00:28:35.970 can switch the order of the two things you're multiplying 00:28:35.970 --> 00:28:37.510 and it doesn't change the result. 00:28:37.510 --> 00:28:41.520 But this here right now is the most common formulation of Bayes' rule. 00:28:41.520 --> 00:28:46.620 The probability of b given a is equal to the probability of a given 00:28:46.620 --> 00:28:51.300 b times the probability of b divided by the probability of a. 00:28:51.300 --> 00:28:54.030 And this rule, it turns out, is really important 00:28:54.030 --> 00:28:56.670 when it comes to trying to infer things about the world 00:28:56.670 --> 00:29:00.200 because it means you can express one conditional probability, 00:29:00.200 --> 00:29:04.410 the conditional probability of b given a, using knowledge 00:29:04.410 --> 00:29:08.370 about the probability of a given b, using the reverse 00:29:08.370 --> 00:29:10.068 of that conditional probability. 00:29:10.068 --> 00:29:12.360 So let's first do a little bit of an example with this, 00:29:12.360 --> 00:29:14.820 just to see how we might use it, and then explore what 00:29:14.820 --> 00:29:17.200 this means a little bit more generally. 00:29:17.200 --> 00:29:20.320 So we're going to construct a situation where I have some information. 00:29:20.320 --> 00:29:22.260 There are two events that I care about-- 00:29:22.260 --> 00:29:25.650 the idea that it's cloudy in the morning and the idea 00:29:25.650 --> 00:29:28.120 that it is rainy in the afternoon. 00:29:28.120 --> 00:29:30.000 Those are two different possible events that 00:29:30.000 --> 00:29:34.080 could take place-- cloudy in the morning, or the AM, rainy in the PM. 00:29:34.080 --> 00:29:37.800 And what I care about is, given clouds in the morning, what 00:29:37.800 --> 00:29:41.110 is the probability of rain in the afternoon, a reasonable question 00:29:41.110 --> 00:29:41.610 I might ask. 00:29:41.610 --> 00:29:44.250 In the morning, I look outside, or an AI's camera 00:29:44.250 --> 00:29:47.782 looks outside, and sees that there are clouds in the morning, 00:29:47.782 --> 00:29:49.740 and we want to conclude, we want to figure out, 00:29:49.740 --> 00:29:54.430 what is the probability that in the afternoon there is going to be rain? 00:29:54.430 --> 00:29:56.470 Of course, in the abstract, we don't have access 00:29:56.470 --> 00:29:58.990 to this kind of information, but we can use data 00:29:58.990 --> 00:30:00.830 to begin to try and figure this out. 00:30:00.830 --> 00:30:05.080 So let's imagine, now, that I have access to some pieces of information. 00:30:05.080 --> 00:30:08.860 I have access to the idea that 80% of rainy afternoons 00:30:08.860 --> 00:30:10.705 start out with a cloudy morning. 00:30:10.705 --> 00:30:13.330 And you might imagine that I could have gathered this data just 00:30:13.330 --> 00:30:15.122 by looking at data over a sequence of time, 00:30:15.122 --> 00:30:18.850 that I know that 80% of the time when it's raining in the afternoon it 00:30:18.850 --> 00:30:21.780 was cloudy that morning. 00:30:21.780 --> 00:30:25.170 I also know that 40% of days have cloudy mornings, 00:30:25.170 --> 00:30:29.010 and I also know that 10% of days have rainy afternoons. 00:30:29.010 --> 00:30:31.170 And now, using this information, I would like 00:30:31.170 --> 00:30:34.350 to figure out, given clouds in the morning, what 00:30:34.350 --> 00:30:37.110 is the probability that it rains in the afternoon? 00:30:37.110 --> 00:30:41.570 I want to know the probability of afternoon rain given morning clouds, 00:30:41.570 --> 00:30:46.630 and I can do that, in particular, using this fact, the probability of-- 00:30:46.630 --> 00:30:50.250 so if I know that 80% of rainy afternoon start with cloudy mornings, 00:30:50.250 --> 00:30:54.390 then I know the probability of cloudy mornings given rainy afternoon. 00:30:54.390 --> 00:30:58.440 So using sort of the reverse conditional probability, I can figure that out. 00:30:58.440 --> 00:31:01.530 Expressed in terms of Bayes' rule, this is what that would look like-- 00:31:01.530 --> 00:31:05.430 probability of rain given clouds is the probability 00:31:05.430 --> 00:31:08.550 of clouds given rain times the probability of rain 00:31:08.550 --> 00:31:10.380 divided by the probability of clouds. 00:31:10.380 --> 00:31:13.560 Here I'm just substituting in for the values of a and b 00:31:13.560 --> 00:31:15.630 from that equation and Bayes' rule from before. 00:31:15.630 --> 00:31:16.650 And then I can just do the math. 00:31:16.650 --> 00:31:17.670 I have this information. 00:31:17.670 --> 00:31:21.360 I know that 80% of the time, if it was raining, then 00:31:21.360 --> 00:31:23.610 there were clouds in the morning-- so 0.8 here. 00:31:23.610 --> 00:31:28.110 Probability of rain is 0.1 because 10% of days were raining and 40% of days 00:31:28.110 --> 00:31:28.860 were cloudy. 00:31:28.860 --> 00:31:31.980 I do the math and I can figure out the answer is 0.2. 00:31:31.980 --> 00:31:35.730 So the probability that it rains in the afternoon given that it was cloudy 00:31:35.730 --> 00:31:40.130 in the morning is 0.2 in this case. 00:31:40.130 --> 00:31:42.480 And this, now, is an application of Bayes' rule, 00:31:42.480 --> 00:31:45.220 the idea that using one conditional probability, 00:31:45.220 --> 00:31:48.060 we can get the reverse conditional probability. 00:31:48.060 --> 00:31:51.420 And this is often useful when one of the conditional probabilities 00:31:51.420 --> 00:31:55.300 might be easier for us to know about or easier for us to have data about, 00:31:55.300 --> 00:31:57.870 and using that information, we can calculate 00:31:57.870 --> 00:31:59.730 the other conditional probability. 00:31:59.730 --> 00:32:01.030 So what does this look like? 00:32:01.030 --> 00:32:04.410 Well, it means that knowing the probability of cloudy mornings given 00:32:04.410 --> 00:32:09.420 rainy afternoons, we can calculate the probability of rainy afternoons given 00:32:09.420 --> 00:32:12.600 cloudy mornings, or, for example, more generally, 00:32:12.600 --> 00:32:16.860 if we know the probability of some visible effect, some effect 00:32:16.860 --> 00:32:21.150 that we can see and observe given some unknown cause that we're not 00:32:21.150 --> 00:32:26.100 sure about, well, then we can calculate the probability of that unknown cause 00:32:26.100 --> 00:32:28.770 given the visible effect. 00:32:28.770 --> 00:32:30.520 So what might that look like? 00:32:30.520 --> 00:32:32.520 Well, in the context of medicine, for example, 00:32:32.520 --> 00:32:37.440 I might know the probability of some medical test result given a disease. 00:32:37.440 --> 00:32:41.520 Like, I know that if someone has a disease, then x percent of the time 00:32:41.520 --> 00:32:44.340 the medical test result will show up as this, for instance. 00:32:44.340 --> 00:32:47.100 And using that information, then I can calculate, 00:32:47.100 --> 00:32:50.430 what is the probability that, given I know the medical test 00:32:50.430 --> 00:32:53.590 result, what is the likelihood that someone has the disease? 00:32:53.590 --> 00:32:56.970 This is the piece of information that is usually easier to know, easier 00:32:56.970 --> 00:32:59.130 to immediately have access to data for. 00:32:59.130 --> 00:33:02.687 And this is the information that I actually want to calculate. 00:33:02.687 --> 00:33:04.270 Or I might want to know, for example-- 00:33:04.270 --> 00:33:08.400 if I know that some probability of counterfeit bills 00:33:08.400 --> 00:33:11.670 have blurry text around the edges, because counterfeit printers 00:33:11.670 --> 00:33:13.950 aren't nearly as good at printing text precisely. 00:33:13.950 --> 00:33:16.380 So I have some information about given that something 00:33:16.380 --> 00:33:20.550 is a counterfeit bill, x percent of counterfeit bills have blurry text, 00:33:20.550 --> 00:33:21.510 for example. 00:33:21.510 --> 00:33:24.840 And using that information, then I can calculate some piece of information 00:33:24.840 --> 00:33:27.360 that I might want to know, like, given that I 00:33:27.360 --> 00:33:31.980 know there's blurry text on a bill, what is the probability that that bill is 00:33:31.980 --> 00:33:32.580 counterfeit? 00:33:32.580 --> 00:33:34.980 So given one conditional probability, I can 00:33:34.980 --> 00:33:39.363 calculate the other conditional probability as well. 00:33:39.363 --> 00:33:41.280 And so now that we've taken a look at a couple 00:33:41.280 --> 00:33:42.990 of different types of probability. 00:33:42.990 --> 00:33:45.210 We've looked at unconditional probability 00:33:45.210 --> 00:33:48.300 where I just look at what is the probability of this event occurring 00:33:48.300 --> 00:33:51.390 given no additional evidence that I might have access to, 00:33:51.390 --> 00:33:53.940 and we've also looked at conditional probability 00:33:53.940 --> 00:33:57.570 where I have some sort of evidence, and I would like to, using that evidence, 00:33:57.570 --> 00:34:00.847 be able to calculate some other probability as well. 00:34:00.847 --> 00:34:03.930 The other kind of probability that will be important for us to think about 00:34:03.930 --> 00:34:06.360 is joint probability, and this is when we're 00:34:06.360 --> 00:34:11.250 considering the likelihood of multiple different events simultaneously. 00:34:11.250 --> 00:34:12.580 And so what do we mean by this? 00:34:12.580 --> 00:34:15.534 Well, for example, I might have probability distributions 00:34:15.534 --> 00:34:18.659 that look a little something like this, like I want to know the probability 00:34:18.659 --> 00:34:22.800 distribution of clouds in the morning, and that distribution looks like this. 00:34:22.800 --> 00:34:26.460 40% of the times, C, which is my random variable here, 00:34:26.460 --> 00:34:31.060 is equal to it's cloudy, and 60% of the time it's not cloudy. 00:34:31.060 --> 00:34:33.420 So here is just a simple probability distribution 00:34:33.420 --> 00:34:37.710 that is effectively telling me that 40% of the time it's cloudy. 00:34:37.710 --> 00:34:41.219 I might also have a probability distribution for rain in the afternoon 00:34:41.219 --> 00:34:44.670 where 10% of the time, or with probability 0.1, 00:34:44.670 --> 00:34:48.600 it is raining in the afternoon and with probability 0.9 00:34:48.600 --> 00:34:51.090 it is not raining in the afternoon. 00:34:51.090 --> 00:34:54.580 And using just these two pieces of information, 00:34:54.580 --> 00:34:57.540 I don't actually have a whole lot of information about how these two 00:34:57.540 --> 00:34:59.980 variables relate to each other. 00:34:59.980 --> 00:35:02.940 But I could if I had access to their joint probability, 00:35:02.940 --> 00:35:05.550 meaning for every combination of these two things-- 00:35:05.550 --> 00:35:09.330 meaning morning cloudy and afternoon rain, morning cloudy and afternoon 00:35:09.330 --> 00:35:12.960 not rain, morning not cloudy and afternoon rain, and morning 00:35:12.960 --> 00:35:15.150 not cloudy and afternoon not raining-- 00:35:15.150 --> 00:35:17.700 if I had access to values for each of those four, 00:35:17.700 --> 00:35:20.340 I'd have more information-- so information that'd 00:35:20.340 --> 00:35:22.390 be organized in a table like this. 00:35:22.390 --> 00:35:25.690 And this, rather than just a probability distribution, 00:35:25.690 --> 00:35:27.970 is a joint probability distribution. 00:35:27.970 --> 00:35:31.090 It tells me the probability distribution of each 00:35:31.090 --> 00:35:34.930 of the possible combinations of values that these random variables 00:35:34.930 --> 00:35:36.160 can take on. 00:35:36.160 --> 00:35:39.640 So if I want to know, what is the probability that on any given day 00:35:39.640 --> 00:35:42.400 it is both cloudy and rainy, well, I would say, 00:35:42.400 --> 00:35:45.100 all right, we're looking at cases where it is cloudy 00:35:45.100 --> 00:35:48.460 and cases where it is raining and the intersection of those two, 00:35:48.460 --> 00:35:51.310 that row and that column, is 0.08. 00:35:51.310 --> 00:35:55.210 So that is the probability that it is both cloudy and rainy 00:35:55.210 --> 00:35:57.070 using that information. 00:35:57.070 --> 00:36:00.010 And using this conditional probability table, 00:36:00.010 --> 00:36:02.260 using this joint probability table, I can 00:36:02.260 --> 00:36:04.930 begin to draw other pieces of information 00:36:04.930 --> 00:36:07.420 about things like conditional probability. 00:36:07.420 --> 00:36:11.890 So I might ask a question like, what is the probability distribution of clouds 00:36:11.890 --> 00:36:14.470 given that I know that it is raining, meaning 00:36:14.470 --> 00:36:16.660 I know for sure that it's raining. 00:36:16.660 --> 00:36:19.780 Tell me the probability distribution over whether it's cloudy 00:36:19.780 --> 00:36:22.720 or not given that I know already that it is, in fact, raining. 00:36:22.720 --> 00:36:25.480 And here I'm using C to stand for that random variable. 00:36:25.480 --> 00:36:28.030 I'm looking for a distribution, meaning the answer to this 00:36:28.030 --> 00:36:29.860 is not going to be a single value. 00:36:29.860 --> 00:36:33.760 It's going to be two values, a vector of two values where the first value is 00:36:33.760 --> 00:36:37.960 probability of clouds, the second value is probability that it is not cloudy, 00:36:37.960 --> 00:36:40.240 but the sum of those two values is going to be one, 00:36:40.240 --> 00:36:42.470 because when you add up the probabilities of all 00:36:42.470 --> 00:36:47.190 of the possible worlds, the result that you get must be the number one. 00:36:47.190 --> 00:36:50.740 And, well, what do we know about how to calculate a conditional probability? 00:36:50.740 --> 00:36:56.590 Well, we know that the probability of a given b is the probability of a and b 00:36:56.590 --> 00:36:59.320 divided by the probability of b. 00:36:59.320 --> 00:37:00.740 So what does this mean? 00:37:00.740 --> 00:37:03.610 Well, it means that I can calculate the probability of clouds 00:37:03.610 --> 00:37:08.260 given that it's raining as the probability of clouds 00:37:08.260 --> 00:37:11.230 and raining divided by the probability of rain. 00:37:11.230 --> 00:37:15.220 And this comma here for the probability distribution of clouds and rain, 00:37:15.220 --> 00:37:17.710 this comma sort of stands in for the word "and." 00:37:17.710 --> 00:37:21.460 You'll sort of see the logical operator AND and the comma used interchangeably. 00:37:21.460 --> 00:37:24.550 This means the probability distribution over the clouds 00:37:24.550 --> 00:37:29.382 and knowing the fact that it is raining divided by the probability of rain. 00:37:29.382 --> 00:37:31.840 And the interesting thing to note here and what we'll often 00:37:31.840 --> 00:37:34.210 do in order to simplify our mathematics is 00:37:34.210 --> 00:37:38.260 that dividing by the probability of rain, the probability of rain 00:37:38.260 --> 00:37:40.150 here is just some numerical constant. 00:37:40.150 --> 00:37:40.900 It is some number. 00:37:40.900 --> 00:37:43.780 Dividing by probability of rain is just dividing 00:37:43.780 --> 00:37:46.090 by some constant or, in other words, multiplying 00:37:46.090 --> 00:37:48.100 by the inverse of that constant. 00:37:48.100 --> 00:37:50.620 And it turns out that oftentimes we can just 00:37:50.620 --> 00:37:53.230 not worry about what the exact value of this is 00:37:53.230 --> 00:37:56.370 and just know that it is, in fact, a constant value, 00:37:56.370 --> 00:37:57.620 and we'll see why in a moment. 00:37:57.620 --> 00:38:01.390 So instead of expressing this as this joint probability divided 00:38:01.390 --> 00:38:06.790 by the probability of rain, sometimes we'll just represent it as alpha times 00:38:06.790 --> 00:38:10.830 the numerator here, the probability distribution of C, this variable, 00:38:10.830 --> 00:38:13.370 and that we know that it is raining, for instance. 00:38:13.370 --> 00:38:16.600 So all we've done here is said this value of one 00:38:16.600 --> 00:38:19.840 over the probability of rain, that's really just a constant that we're 00:38:19.840 --> 00:38:23.140 going to divide by or equivalently multiply by the inverse of at the end. 00:38:23.140 --> 00:38:26.770 We'll just call it alpha for now and deal with it a little bit later. 00:38:26.770 --> 00:38:30.130 But the key idea here now-- and this is an idea that's going to come up again-- 00:38:30.130 --> 00:38:34.390 is that the conditional distribution of C given rain 00:38:34.390 --> 00:38:38.200 is proportional to, meaning just some factor multiplied by, 00:38:38.200 --> 00:38:42.580 the joint probability of C and rain being true. 00:38:42.580 --> 00:38:44.030 And so how do we figure this out? 00:38:44.030 --> 00:38:46.720 Well, this is going to be the probability that it is cloudy 00:38:46.720 --> 00:38:50.200 given that it's raining, which is 0.08, and the probability that it's not 00:38:50.200 --> 00:38:53.350 cloudy given that it's raining, which is 0.02. 00:38:53.350 --> 00:38:55.180 And so we get alpha times-- 00:38:55.180 --> 00:38:58.060 here now is that probability distribution. 00:38:58.060 --> 00:39:00.370 0.08 is clouds and rain. 00:39:00.370 --> 00:39:04.210 0.02 is not cloudy and rain. 00:39:04.210 --> 00:39:08.260 But, of course, 0.08 and 0.02 don't sum up to the number one. 00:39:08.260 --> 00:39:10.780 And we know that in a probability distribution, 00:39:10.780 --> 00:39:13.030 if you consider all of the possible values, 00:39:13.030 --> 00:39:15.730 they must sum up to a probability of one. 00:39:15.730 --> 00:39:20.350 And so we know that we just need to figure out some constant to normalize, 00:39:20.350 --> 00:39:23.830 so to speak, these values, something we can multiply or divide by 00:39:23.830 --> 00:39:26.600 to get it so that all of these probabilities sum up to one. 00:39:26.600 --> 00:39:29.390 And it turns out that if we multiply both numbers by 10, 00:39:29.390 --> 00:39:32.290 then we can get that result of 0.8 and 0.2. 00:39:32.290 --> 00:39:34.990 The proportions are still equivalent, but now 0.8 00:39:34.990 --> 00:39:38.750 plus 0.2, those sum up to the number 1. 00:39:38.750 --> 00:39:41.080 So take a look at this and see if you can understand, 00:39:41.080 --> 00:39:43.870 step by step, how it is we're getting from one point to another. 00:39:43.870 --> 00:39:48.190 But the key idea here is that by using the joint probabilities, 00:39:48.190 --> 00:39:52.480 these probabilities that it is both cloudy and rainy and that it is not 00:39:52.480 --> 00:39:56.740 cloudy and rainy, I can take that information and figure out 00:39:56.740 --> 00:39:59.800 the conditional probability-- given that it's raining, 00:39:59.800 --> 00:40:02.320 what is the chance that it's cloudy versus not cloudy-- 00:40:02.320 --> 00:40:06.740 just by multiplying by some normalization constant, so to speak. 00:40:06.740 --> 00:40:08.860 And this is what a computer can begin to use 00:40:08.860 --> 00:40:12.130 to be able to interact with these various different types 00:40:12.130 --> 00:40:13.207 of probabilities. 00:40:13.207 --> 00:40:15.790 And it turns out there are a number of other probability rules 00:40:15.790 --> 00:40:19.570 that are going to be useful to us as we begin to explore how we can actually 00:40:19.570 --> 00:40:22.860 use this information to encode into our computers 00:40:22.860 --> 00:40:27.030 some more complex analysis that we might want to do about probability 00:40:27.030 --> 00:40:30.793 and distributions and random variables that we might be interacting with. 00:40:30.793 --> 00:40:33.210 So here are a couple of those important probability rules. 00:40:33.210 --> 00:40:35.850 One of the simplest rules is just this negation rule. 00:40:35.850 --> 00:40:39.420 What is the probability of not event a? 00:40:39.420 --> 00:40:41.970 So a is an event that has some probability, 00:40:41.970 --> 00:40:45.840 and I would like to know, what is the probability that a does not occur? 00:40:45.840 --> 00:40:50.340 And it turns out it's just one minus P of a, which makes sense 00:40:50.340 --> 00:40:52.470 because if those are the two possible cases, 00:40:52.470 --> 00:40:56.770 either a happens or a doesn't happen, then when you add up those two cases, 00:40:56.770 --> 00:41:02.970 you must get one, which means P of not a must just be one minus P of a 00:41:02.970 --> 00:41:06.930 because P of a and P of not a must sum up to the number one. 00:41:06.930 --> 00:41:10.050 They must include all of the possible cases. 00:41:10.050 --> 00:41:14.010 We've seen an expression for calculating the probability of a and b. 00:41:14.010 --> 00:41:18.180 We might also reasonably want to calculate the probability of a or b. 00:41:18.180 --> 00:41:21.480 What is the probability that one thing happens or another thing happens? 00:41:21.480 --> 00:41:23.550 So for example, I might want to calculate, 00:41:23.550 --> 00:41:26.010 what is the probability that if I roll two dice, 00:41:26.010 --> 00:41:29.970 a red die and a blue die, what is the likelihood that a is a six or b 00:41:29.970 --> 00:41:31.860 is a six, one or the other? 00:41:31.860 --> 00:41:34.860 And what you might imagine you could do and the wrong way to approach it 00:41:34.860 --> 00:41:38.810 would be just to say, all right, well, a comes up as a six, 00:41:38.810 --> 00:41:41.727 the red die comes up as a six with probability one over six. 00:41:41.727 --> 00:41:42.810 The same for the blue die. 00:41:42.810 --> 00:41:44.070 It's also one over six. 00:41:44.070 --> 00:41:47.520 Add them together and you get 2/6, otherwise known as 1/3. 00:41:47.520 --> 00:41:50.820 But this suffers from the problem of over counting, 00:41:50.820 --> 00:41:54.330 that we've double counted the case where both a and b, both 00:41:54.330 --> 00:41:57.690 the red die and the blue die, both come up as a six roll, 00:41:57.690 --> 00:41:59.780 and I've counted that instance twice. 00:41:59.780 --> 00:42:02.070 So to resolve this, the actual expression 00:42:02.070 --> 00:42:05.100 for calculating the probability of a or b 00:42:05.100 --> 00:42:08.070 uses what we call the inclusion-exclusion formula. 00:42:08.070 --> 00:42:11.510 So I take the probability of a, add it to the probability of b. 00:42:11.510 --> 00:42:12.900 That's all same as before. 00:42:12.900 --> 00:42:16.440 But then I need to exclude the cases that I've double counted. 00:42:16.440 --> 00:42:21.930 So I subtract from that the probability of a and b, and that 00:42:21.930 --> 00:42:23.520 gets me the result for a or b. 00:42:23.520 --> 00:42:27.348 I consider all the cases where a is true and all the cases where b is true. 00:42:27.348 --> 00:42:29.640 And if you imagine this is like a Venn diagram of cases 00:42:29.640 --> 00:42:31.830 where a is true, cases where b is true, I just 00:42:31.830 --> 00:42:34.500 need to subtract out the middle to get rid of the cases 00:42:34.500 --> 00:42:37.860 that I have over counted by double counting them inside of both 00:42:37.860 --> 00:42:41.520 of these individual expressions. 00:42:41.520 --> 00:42:43.530 One other rule that's going to be quite helpful 00:42:43.530 --> 00:42:45.770 is a rule called marginalization. 00:42:45.770 --> 00:42:47.880 Some marginalization is answering the question 00:42:47.880 --> 00:42:52.350 of how do I figure out the probability of a using some other variable that I 00:42:52.350 --> 00:42:53.970 might have access to, like b? 00:42:53.970 --> 00:42:56.190 Even if I don't know additional information about it, 00:42:56.190 --> 00:43:00.270 I know that b, some event, can have two possible states. 00:43:00.270 --> 00:43:05.080 Either b happens or b doesn't happen, assuming it's a Boolean, true or false. 00:43:05.080 --> 00:43:07.500 And well, what that means is that for me to be 00:43:07.500 --> 00:43:11.130 able to calculate the probability of a, there are only two cases. 00:43:11.130 --> 00:43:15.930 Either a happens and b happens or a happens and b doesn't happen. 00:43:15.930 --> 00:43:19.200 And those are two disjoint, meaning they can't both happen together-- 00:43:19.200 --> 00:43:21.480 either b happens or b doesn't happen. 00:43:21.480 --> 00:43:23.640 They're disjoint or separate cases. 00:43:23.640 --> 00:43:28.140 And so I can figure out the probability of a just by adding up those two cases. 00:43:28.140 --> 00:43:31.770 The probability that a is true is the probability 00:43:31.770 --> 00:43:35.640 that a and b is true plus the probability that a is true 00:43:35.640 --> 00:43:36.810 and b isn't true. 00:43:36.810 --> 00:43:40.123 So by marginalizing, I've looked at the two possible cases 00:43:40.123 --> 00:43:41.040 that might take place. 00:43:41.040 --> 00:43:44.120 Either b happens or b doesn't happen. 00:43:44.120 --> 00:43:47.610 And in either of those cases, I look at, what's the probability that a happens, 00:43:47.610 --> 00:43:50.430 and if I add those together, well, then I get the probability 00:43:50.430 --> 00:43:52.710 that a happens as a whole. 00:43:52.710 --> 00:43:54.030 So take a look at that rule. 00:43:54.030 --> 00:43:57.120 It doesn't matter what b is or how it's related to a. 00:43:57.120 --> 00:43:59.580 So long as I know these joint distributions, 00:43:59.580 --> 00:44:02.280 I can figure out the overall probability of a. 00:44:02.280 --> 00:44:05.130 And this can be a useful way, if I have a joint distribution, 00:44:05.130 --> 00:44:08.550 like the joint distribution of a and b, to just figure out 00:44:08.550 --> 00:44:11.320 some unconditional probability, like the probability of a, 00:44:11.320 --> 00:44:14.520 and we'll see examples of this soon, as well. 00:44:14.520 --> 00:44:17.460 Now, sometimes these might not just be variables 00:44:17.460 --> 00:44:21.160 that are events that are they happened or they didn't happen, like b is here. 00:44:21.160 --> 00:44:23.850 They might be some broader probability distribution where 00:44:23.850 --> 00:44:25.800 there are multiple possible values. 00:44:25.800 --> 00:44:28.710 And so here, in order to use this marginalization rule, 00:44:28.710 --> 00:44:34.290 I need to sum up not just over b and not b, but for all of the possible values 00:44:34.290 --> 00:44:36.610 that the other random variable could take on. 00:44:36.610 --> 00:44:39.360 And so here we'll see a version of this rule for random variables, 00:44:39.360 --> 00:44:41.610 and it's going to include that summation notation 00:44:41.610 --> 00:44:46.270 to indicate that I'm summing up, adding up, a whole bunch of individual values. 00:44:46.270 --> 00:44:47.092 So here's the rule. 00:44:47.092 --> 00:44:49.050 Looks a lot more complicated, but it's actually 00:44:49.050 --> 00:44:51.330 the equivalent, exactly the same rule. 00:44:51.330 --> 00:44:55.500 What I'm saying here is that if I have two random variables one called x 00:44:55.500 --> 00:45:01.380 and one called y, well, the probability that x is equal to some value x sub i-- 00:45:01.380 --> 00:45:04.170 this is just some value that this variable takes on-- 00:45:04.170 --> 00:45:05.520 how do I figure it out? 00:45:05.520 --> 00:45:08.760 Well, I'm going to sum up over j, where j 00:45:08.760 --> 00:45:13.380 is going to range over all of the possible values that y can take on. 00:45:13.380 --> 00:45:18.558 Well, let's look at the probability that x equals xi and y equals yj. 00:45:18.558 --> 00:45:20.600 So the exact same rule-- the only difference here 00:45:20.600 --> 00:45:23.360 is now I'm summing up over all of the possible values 00:45:23.360 --> 00:45:27.420 that y can take on, saying let's add up all of those possible cases 00:45:27.420 --> 00:45:31.100 and look at this joint distribution, this joint probability 00:45:31.100 --> 00:45:35.990 that x takes on the value I care about given all of the possible values for y. 00:45:35.990 --> 00:45:40.910 And if I add all those up, then I can get this unconditional probability 00:45:40.910 --> 00:45:46.397 of what x is equal to, whether or not x is equal to some value x sub i. 00:45:46.397 --> 00:45:48.230 So let's take a look at this rule because it 00:45:48.230 --> 00:45:49.688 does look a little bit complicated. 00:45:49.688 --> 00:45:51.650 Let's try and put a concrete example to it. 00:45:51.650 --> 00:45:54.470 Here, again, is that same joint distribution from before. 00:45:54.470 --> 00:45:58.460 I have cloud, not cloudy, rainy, not rainy. 00:45:58.460 --> 00:46:00.830 And maybe I want to access some variable. 00:46:00.830 --> 00:46:04.790 I want to know, what is the probability that it is cloudy? 00:46:04.790 --> 00:46:08.550 Well, marginalization says that if I have this joint distribution 00:46:08.550 --> 00:46:12.140 and I want to know, what is the probability that it is cloudy, well, 00:46:12.140 --> 00:46:15.650 I need to consider the other variable, the variable that's not here, 00:46:15.650 --> 00:46:17.060 the idea that it's rainy. 00:46:17.060 --> 00:46:20.780 And I consider the two cases, either it's raining or it's not raining, 00:46:20.780 --> 00:46:24.410 and I just sum up the values for each of those possibilities. 00:46:24.410 --> 00:46:27.380 In other words, the probability that it is cloudy 00:46:27.380 --> 00:46:31.110 is equal to the sum of the probability that it's cloudy 00:46:31.110 --> 00:46:38.090 and it's raining and the probability that it's cloudy and it is not raining. 00:46:38.090 --> 00:46:40.460 And so these, now, are values that I have access to. 00:46:40.460 --> 00:46:44.840 These are values that are just inside of this joint probability table. 00:46:44.840 --> 00:46:47.990 What is the probability that it is both cloudy and rainy? 00:46:47.990 --> 00:46:51.350 Well, it's just the intersection of these two here, which is 0.08, 00:46:51.350 --> 00:46:54.590 and the probability that it's cloudy and not raining is-- all right, 00:46:54.590 --> 00:46:56.480 here's cloudy, here's not raining-- 00:46:56.480 --> 00:46:58.000 it's 0.32. 00:46:58.000 --> 00:47:02.630 So it's 0.08 plus 0.32, which just gives us equal to 0.4. 00:47:02.630 --> 00:47:06.840 That is the unconditional probability that it is, in fact, cloudy. 00:47:06.840 --> 00:47:09.530 And so marginalization gives us a way to go 00:47:09.530 --> 00:47:13.360 from these joint distributions to just some individual probability 00:47:13.360 --> 00:47:14.430 that I might care about. 00:47:14.430 --> 00:47:17.222 And you'll see a little bit later why it is that we care about that 00:47:17.222 --> 00:47:19.370 and why that's actually useful to us as we 00:47:19.370 --> 00:47:21.885 begin doing some of these calculations. 00:47:21.885 --> 00:47:25.010 Last rule we'll take a look up before transitioning into something a little 00:47:25.010 --> 00:47:27.200 bit different is this rule of conditioning-- 00:47:27.200 --> 00:47:31.070 very similar to the marginalization rule, but it says that, again, 00:47:31.070 --> 00:47:32.600 if I have two events a and b-- 00:47:32.600 --> 00:47:35.810 but instead of having access to their joint probabilities, 00:47:35.810 --> 00:47:38.180 I have access to their conditional probabilities, 00:47:38.180 --> 00:47:39.920 how they relate to each other. 00:47:39.920 --> 00:47:43.700 Well, again, if I want to know the probability that a happens and I know 00:47:43.700 --> 00:47:47.960 that there's some other variable b, either b happens or b doesn't happen, 00:47:47.960 --> 00:47:50.660 and so I can say that the probability of a 00:47:50.660 --> 00:47:54.920 is the probability of a given b times the probability of b, 00:47:54.920 --> 00:47:57.470 meaning b happened, and given that I know b happened, 00:47:57.470 --> 00:47:59.480 what's the likelihood that a happened? 00:47:59.480 --> 00:48:02.480 And then I consider the other case, that b didn't happen. 00:48:02.480 --> 00:48:05.360 So here is the probability that b didn't happen, 00:48:05.360 --> 00:48:07.880 and here's the probability that a happens given 00:48:07.880 --> 00:48:09.890 that I know that b didn't happen. 00:48:09.890 --> 00:48:13.820 And this is really the equivalent rule, just using conditional probability 00:48:13.820 --> 00:48:16.190 instead of joint probability where I'm saying, 00:48:16.190 --> 00:48:19.790 let's look at both of these two cases and condition on b. 00:48:19.790 --> 00:48:23.480 Look at the case where b happens and look at the case where b doesn't happen 00:48:23.480 --> 00:48:26.560 and look at what probabilities I get as a result. 00:48:26.560 --> 00:48:28.598 And just as in the case of marginalization 00:48:28.598 --> 00:48:30.890 where there was an equivalent rule for random variables 00:48:30.890 --> 00:48:34.850 that could take on multiple possible values in a domain of possible values, 00:48:34.850 --> 00:48:37.530 here, too, conditioning has the same equivalent rule. 00:48:37.530 --> 00:48:41.590 Again, there's a summation to mean I'm summing over all of the possible values 00:48:41.590 --> 00:48:44.070 that some random variable y could take on. 00:48:44.070 --> 00:48:48.140 But if I want to know, what is the probability that x takes on this value, 00:48:48.140 --> 00:48:50.870 then I'm going to sum up over all the values j 00:48:50.870 --> 00:48:53.420 that y could take on and say, all right, what's 00:48:53.420 --> 00:48:56.870 the chance that y takes on that value, yj, and multiply it 00:48:56.870 --> 00:49:00.830 by the conditional probability that x takes on this value given 00:49:00.830 --> 00:49:03.180 that y took on that value yj-- 00:49:03.180 --> 00:49:06.470 so equivalent rule just using conditional probabilities 00:49:06.470 --> 00:49:08.120 instead of joint probabilities. 00:49:08.120 --> 00:49:10.790 And using the equation we know about joint probabilities, 00:49:10.790 --> 00:49:13.748 we can translate between these two. 00:49:13.748 --> 00:49:15.790 All right, we've seen a whole lot of mathematics, 00:49:15.790 --> 00:49:18.110 and we've just sort of laid the foundation for mathematics. 00:49:18.110 --> 00:49:20.777 And no need to worry if you haven't seen probability in too much 00:49:20.777 --> 00:49:22.370 detail up until this point. 00:49:22.370 --> 00:49:24.500 These are sort of the foundations of the ideas 00:49:24.500 --> 00:49:27.560 that are going to come up as we begin to explore how we can now 00:49:27.560 --> 00:49:31.820 take these ideas from probability and begin to apply them to represent 00:49:31.820 --> 00:49:35.120 something inside of our computer, something inside of the AI agent 00:49:35.120 --> 00:49:39.280 we're trying to design that is able to represent information and probabilities 00:49:39.280 --> 00:49:42.600 and the likelihoods between various different events. 00:49:42.600 --> 00:49:45.020 So there are a number of different probabilistic models 00:49:45.020 --> 00:49:48.290 that we can generate, but the first of the models we're going to talk about 00:49:48.290 --> 00:49:50.600 are what are known as Bayesian networks. 00:49:50.600 --> 00:49:52.670 And a Bayesian network is just going to be 00:49:52.670 --> 00:49:56.090 some network of random variables, connected random variables, 00:49:56.090 --> 00:49:58.850 that are going to represent the dependence 00:49:58.850 --> 00:50:00.260 between these random variables. 00:50:00.260 --> 00:50:03.498 And odds are most random variables in this world 00:50:03.498 --> 00:50:05.540 are not independent from each other, that there's 00:50:05.540 --> 00:50:08.840 some relationship between things that are happening that we care about. 00:50:08.840 --> 00:50:12.200 If it is raining today, that might increase the likelihood 00:50:12.200 --> 00:50:14.750 that my flight or my train gets delayed, for example. 00:50:14.750 --> 00:50:17.610 There is some dependence between these random variables, 00:50:17.610 --> 00:50:22.420 and a Bayesian network is going to be able to capture those dependencies. 00:50:22.420 --> 00:50:23.770 So what is a Bayesian network? 00:50:23.770 --> 00:50:26.430 What is its actual structure, and how does it work? 00:50:26.430 --> 00:50:29.230 Well, a Bayesian network is going to be a directed graph. 00:50:29.230 --> 00:50:31.170 And again, we've seen directed graphs before. 00:50:31.170 --> 00:50:34.170 They are individual nodes with arrows or edges 00:50:34.170 --> 00:50:38.897 that connect one node to another node, pointing in a particular direction. 00:50:38.897 --> 00:50:40.980 And so this directed graph is going to have nodes, 00:50:40.980 --> 00:50:43.860 as well, where each node in this directed graph 00:50:43.860 --> 00:50:47.850 is going to represent a random variable, something like the weather or something 00:50:47.850 --> 00:50:51.340 like whether my train was on time or delayed. 00:50:51.340 --> 00:50:54.780 And we're going to have an arrow from a node x to a node y 00:50:54.780 --> 00:50:57.435 to mean that x is a parent of y. 00:50:57.435 --> 00:50:58.560 So that'll be our notation. 00:50:58.560 --> 00:51:02.940 If there's an arrow from x to y, x is going to be considered a parent of y. 00:51:02.940 --> 00:51:06.360 And the reason that's important is because each of these nodes 00:51:06.360 --> 00:51:09.180 is going to have a probability distribution that we're 00:51:09.180 --> 00:51:13.140 going to store along with it, which is the distribution of x given 00:51:13.140 --> 00:51:16.520 some evidence, given the parents of x. 00:51:16.520 --> 00:51:18.480 So the way to more intuitively think about this 00:51:18.480 --> 00:51:22.260 is the parents are going to be thought of as sort of causes for some effect 00:51:22.260 --> 00:51:24.720 that we're going to observe. 00:51:24.720 --> 00:51:27.780 And so let's take a look at an actual example of a Bayesian network 00:51:27.780 --> 00:51:30.270 and think about the types of logic that might be involved 00:51:30.270 --> 00:51:32.070 in reasoning about that network. 00:51:32.070 --> 00:51:35.580 Let's imagine, for a moment, that I have an appointment out of town 00:51:35.580 --> 00:51:38.510 and I need to take a train in order to get to that appointment. 00:51:38.510 --> 00:51:40.260 So what are the things I might care about? 00:51:40.260 --> 00:51:42.620 Well, I care about getting to my appointment on time. 00:51:42.620 --> 00:51:44.370 Either I make it to my appointment and I'm 00:51:44.370 --> 00:51:46.710 able to attend it or I miss the appointment. 00:51:46.710 --> 00:51:49.440 And you might imagine that that's influenced by the train, 00:51:49.440 --> 00:51:54.000 that the train is either on time or it's delayed, for example. 00:51:54.000 --> 00:51:56.370 But that train itself is also influenced. 00:51:56.370 --> 00:52:00.030 Whether the train is on time or not depends maybe on the rain. 00:52:00.030 --> 00:52:00.822 Is there no rain? 00:52:00.822 --> 00:52:01.530 Is it light rain? 00:52:01.530 --> 00:52:02.737 Is there heavy rain? 00:52:02.737 --> 00:52:05.070 And it might also be influenced by other variables, too. 00:52:05.070 --> 00:52:07.050 It might be influenced, as well, by whether 00:52:07.050 --> 00:52:09.608 or not there's maintenance on the train track, for example. 00:52:09.608 --> 00:52:11.400 If there is maintenance on the train track, 00:52:11.400 --> 00:52:15.660 that probably increases the likelihood that my train is delayed. 00:52:15.660 --> 00:52:19.680 And so we can represent all of these ideas using a Bayesian network that 00:52:19.680 --> 00:52:21.360 looks a little something like this. 00:52:21.360 --> 00:52:25.440 Here I have four nodes representing four random variables 00:52:25.440 --> 00:52:26.970 that I would like to keep track of. 00:52:26.970 --> 00:52:29.190 I have one random variable called Rain that 00:52:29.190 --> 00:52:34.080 can take on three possible values in its domain, either none or light or heavy 00:52:34.080 --> 00:52:36.348 for no rain, light rain, or heavy rain. 00:52:36.348 --> 00:52:38.640 I have a variable called Maintenance for whether or not 00:52:38.640 --> 00:52:42.030 there is maintenance on the train track, which it has two possible values, just 00:52:42.030 --> 00:52:42.960 either yes or no. 00:52:42.960 --> 00:52:46.355 Either there is maintenance or there is no maintenance happening on the track. 00:52:46.355 --> 00:52:49.230 Then I have a random variable for the train indicating whether or not 00:52:49.230 --> 00:52:50.490 the train was on time or not. 00:52:50.490 --> 00:52:53.850 That random variable has two possible values in its domain. 00:52:53.850 --> 00:52:57.730 The train is either on time or the train is delayed. 00:52:57.730 --> 00:52:59.803 And then, finally, I have a random variable 00:52:59.803 --> 00:53:01.470 for whether I make it to my appointment. 00:53:01.470 --> 00:53:04.950 For my appointment down here, I have a random variable called Appointment 00:53:04.950 --> 00:53:09.420 that itself has two possible values, attend and miss. 00:53:09.420 --> 00:53:10.920 And so here are the possible values. 00:53:10.920 --> 00:53:12.960 Here are my four nodes, each of which represents 00:53:12.960 --> 00:53:17.160 a random variable, each of which has a domain of possible values 00:53:17.160 --> 00:53:18.500 that it can take on. 00:53:18.500 --> 00:53:21.980 And the arrows, the edges pointing from one node to another, 00:53:21.980 --> 00:53:26.250 encode some notion of dependence inside of this graph, 00:53:26.250 --> 00:53:28.830 that whether I make it to my appointment or not 00:53:28.830 --> 00:53:32.650 is dependent upon whether the train is on time or delayed. 00:53:32.650 --> 00:53:36.390 And whether the train is on time or delayed is dependent on two things, 00:53:36.390 --> 00:53:38.910 given by the two arrows pointing at this node. 00:53:38.910 --> 00:53:42.350 It is dependent on whether or not there was maintenance on the train track, 00:53:42.350 --> 00:53:45.240 and it is also dependent upon whether or not 00:53:45.240 --> 00:53:47.675 it was raining, or whether it is raining. 00:53:47.675 --> 00:53:49.800 And just to make things a little complicated, let's 00:53:49.800 --> 00:53:53.280 say, as well, that whether or not there's maintenance on the track, 00:53:53.280 --> 00:53:55.260 this too might be influenced by the rain. 00:53:55.260 --> 00:53:57.178 Then if there's heavier rain, well, maybe it's 00:53:57.178 --> 00:53:59.970 less likely that there's going to be maintenance on the train track 00:53:59.970 --> 00:54:02.010 that day because they're more likely to want 00:54:02.010 --> 00:54:05.500 to do maintenance on the track on days when it's not raining, for example. 00:54:05.500 --> 00:54:08.350 And so these nodes might have different relationships between them. 00:54:08.350 --> 00:54:10.770 But the idea is that we can come up with a probability 00:54:10.770 --> 00:54:16.370 distribution for any of these nodes based only upon its parents. 00:54:16.370 --> 00:54:20.158 And so let's look node by node at what this probability distribution might 00:54:20.158 --> 00:54:20.950 actually look like. 00:54:20.950 --> 00:54:24.150 And we'll go ahead and begin with this root node, this Rain node here, which 00:54:24.150 --> 00:54:27.630 is at the top and has no arrows pointing into it, 00:54:27.630 --> 00:54:30.510 which means its probability distribution is not 00:54:30.510 --> 00:54:32.410 going to be a conditional distribution. 00:54:32.410 --> 00:54:33.870 It's not based on anything. 00:54:33.870 --> 00:54:38.250 I just have some probability distribution over the possible values 00:54:38.250 --> 00:54:40.520 for the Rain random variable. 00:54:40.520 --> 00:54:43.590 And that distribution might look a little something like this. 00:54:43.590 --> 00:54:46.170 None, light, and heavy-- each have a possible value. 00:54:46.170 --> 00:54:48.300 Here I'm saying the likelihood of no rain 00:54:48.300 --> 00:54:53.790 is 0.7, of light rain is 0.2, of heavy rain is 0.1, for example. 00:54:53.790 --> 00:54:58.440 So here is a probability distribution for this root node in this Bayesian 00:54:58.440 --> 00:54:59.770 network. 00:54:59.770 --> 00:55:03.000 And let's now consider the next node in the network, Maintenance. 00:55:03.000 --> 00:55:05.140 Track maintenance is yes or no. 00:55:05.140 --> 00:55:07.530 And the general idea of what this distribution 00:55:07.530 --> 00:55:09.660 is going to encode, at least in this story, 00:55:09.660 --> 00:55:13.308 is the idea that the heavier the rain is, the less likely 00:55:13.308 --> 00:55:15.600 it is that there's going to be maintenance on the track 00:55:15.600 --> 00:55:18.017 because the people that are doing maintenance on the track 00:55:18.017 --> 00:55:21.190 probably want to wait until a day when it's not as rainy in order to do 00:55:21.190 --> 00:55:23.000 the track maintenance, for example. 00:55:23.000 --> 00:55:25.480 And so what might that probability distribution look like? 00:55:25.480 --> 00:55:28.180 Well, this now is going to be a conditional probability 00:55:28.180 --> 00:55:31.600 distribution, that here are the three possible values for the Rain 00:55:31.600 --> 00:55:34.840 random variable, which I'm here just going to abbreviate to R, either 00:55:34.840 --> 00:55:37.490 no rain, light rain, or heavy rain. 00:55:37.490 --> 00:55:41.590 And for each of those possible values, either there is yes track maintenance 00:55:41.590 --> 00:55:46.120 or no track maintenance, and those have probabilities associated with them, 00:55:46.120 --> 00:55:50.650 that I see here that if it is not raining, 00:55:50.650 --> 00:55:53.620 then there is a probability 0.4 that there's track maintenance 00:55:53.620 --> 00:55:56.350 and a probability of 0.6 that there isn't. 00:55:56.350 --> 00:55:59.200 But if there's heavy rain, then here the chance 00:55:59.200 --> 00:56:02.020 that there is track maintenance is 0.1 and the chance 00:56:02.020 --> 00:56:04.430 that there is not track maintenance is 0.9. 00:56:04.430 --> 00:56:08.230 Each of these rows is going to sum up to one because each of these 00:56:08.230 --> 00:56:10.930 represent different values of whether or not 00:56:10.930 --> 00:56:14.710 it's raining, the three possible values that that random variable can take on, 00:56:14.710 --> 00:56:18.160 and each is associated with its own probability distribution. 00:56:18.160 --> 00:56:22.450 That is ultimately all going to add up to the number one. 00:56:22.450 --> 00:56:26.290 So that there is our distribution for this random variable called Maintenance 00:56:26.290 --> 00:56:30.110 about whether or not there is maintenance on the train track. 00:56:30.110 --> 00:56:32.050 And now let's consider the next variable. 00:56:32.050 --> 00:56:34.210 Here we have a node inside of our Bayesian network 00:56:34.210 --> 00:56:38.570 called Train that has two possible values, on time and delayed. 00:56:38.570 --> 00:56:42.160 And this node is going to be dependent upon the two nodes that 00:56:42.160 --> 00:56:45.040 are pointing towards it, that whether or not the train is on time 00:56:45.040 --> 00:56:48.872 or delayed it depends on whether or not there is track maintenance, 00:56:48.872 --> 00:56:50.830 and it depends on whether or not there is rain, 00:56:50.830 --> 00:56:55.610 that heavier rain probably means more likely that my train is delayed. 00:56:55.610 --> 00:56:58.270 And if there is track maintenance, that also 00:56:58.270 --> 00:57:02.360 probably means it's more likely that my train is delayed as well. 00:57:02.360 --> 00:57:05.350 And so you could construct a larger probability distribution, 00:57:05.350 --> 00:57:07.720 a conditional probability distribution, that 00:57:07.720 --> 00:57:11.530 instead of conditioning on just one variable, as was the case here, 00:57:11.530 --> 00:57:14.380 is now conditioning on two variables, conditioning 00:57:14.380 --> 00:57:19.270 both on rain, represented by R, and on maintenance, represented by yes. 00:57:19.270 --> 00:57:23.040 Again, each of these rows has two values that sum up to the number one, 00:57:23.040 --> 00:57:27.310 one for whether the train is on time, one for whether the train is delayed. 00:57:27.310 --> 00:57:29.260 And here I can say something like, all right, 00:57:29.260 --> 00:57:32.950 if I know there was light rain and track maintenance-- well, OK, 00:57:32.950 --> 00:57:36.490 that would be R is light and M is yes-- 00:57:36.490 --> 00:57:40.210 well, then there is a probability of 0.6 that my train is on time 00:57:40.210 --> 00:57:43.540 and a probability of 0.4 the train is delayed. 00:57:43.540 --> 00:57:47.770 And you can imagine gathering this data just by looking at real-world data, 00:57:47.770 --> 00:57:50.970 looking at data about, all right, if I knew that it was light rain 00:57:50.970 --> 00:57:52.720 and there was track maintenance, how often 00:57:52.720 --> 00:57:54.400 was a train delayed or not delayed, and you 00:57:54.400 --> 00:57:55.930 could begin to construct this thing. 00:57:55.930 --> 00:57:58.060 But the interesting thing is, intelligently, 00:57:58.060 --> 00:57:59.812 being able to try to figure out, how might 00:57:59.812 --> 00:58:01.270 you go about ordering these things? 00:58:01.270 --> 00:58:06.730 What things might influence other nodes inside of this Bayesian network? 00:58:06.730 --> 00:58:08.860 And the last thing I care about is whether or not 00:58:08.860 --> 00:58:10.870 I make it to my appointment. 00:58:10.870 --> 00:58:13.210 So did I attend or miss the appointment? 00:58:13.210 --> 00:58:16.180 And ultimately, whether I attend or miss the appointment, 00:58:16.180 --> 00:58:19.552 it is influenced by track maintenance because it's indirectly this idea 00:58:19.552 --> 00:58:21.760 that, all right, if there is track maintenance, well, 00:58:21.760 --> 00:58:23.450 then my train might more likely be delayed, 00:58:23.450 --> 00:58:25.325 and if my train is more likely to be delayed, 00:58:25.325 --> 00:58:27.280 then I'm more likely to miss my appointment. 00:58:27.280 --> 00:58:29.650 But what we encode in this Bayesian network 00:58:29.650 --> 00:58:32.820 are just what we might consider to be more direct relationships. 00:58:32.820 --> 00:58:35.710 So the train has a direct influence on the appointment. 00:58:35.710 --> 00:58:38.710 And given that I know whether the train is on time or delayed, 00:58:38.710 --> 00:58:40.540 knowing whether there's track maintenance 00:58:40.540 --> 00:58:44.550 isn't going to give me any additional information that I didn't already have, 00:58:44.550 --> 00:58:48.070 that if I know train, these other nodes that are up above 00:58:48.070 --> 00:58:51.150 isn't really going to influence the result. 00:58:51.150 --> 00:58:54.910 And so here we might represent it using another conditional probability 00:58:54.910 --> 00:58:57.430 distribution that looks a little something like this, that 00:58:57.430 --> 00:59:00.160 train can take on two possible values. 00:59:00.160 --> 00:59:02.740 Either my train is on time or my train is delayed. 00:59:02.740 --> 00:59:04.510 And for each of those two possible values, 00:59:04.510 --> 00:59:06.803 I have a distribution for what are the odds 00:59:06.803 --> 00:59:09.220 that I'm able to attend the meeting, and what are the odds 00:59:09.220 --> 00:59:10.090 that I missed the meeting? 00:59:10.090 --> 00:59:12.010 And obviously, if my train is on time, I'm 00:59:12.010 --> 00:59:14.130 much more likely to be able to attend the meeting 00:59:14.130 --> 00:59:16.600 than if my train is delayed, in which case 00:59:16.600 --> 00:59:19.500 I'm more likely to miss that meeting. 00:59:19.500 --> 00:59:21.790 So all of these nodes put altogether here 00:59:21.790 --> 00:59:25.330 represent this Bayesian network, this network of random variables 00:59:25.330 --> 00:59:27.730 whose values I ultimately care about and that 00:59:27.730 --> 00:59:30.380 have some sort of relationship between them, 00:59:30.380 --> 00:59:33.670 some sort of dependence where these arrows from one node to another 00:59:33.670 --> 00:59:37.960 indicate some dependence, that I can calculate the probability of some node 00:59:37.960 --> 00:59:41.870 given the parents that happen to exist there. 00:59:41.870 --> 00:59:45.340 So now that we've been able to describe the structure of this Bayesian network 00:59:45.340 --> 00:59:47.680 and the relationships between each of these nodes, 00:59:47.680 --> 00:59:51.070 by associating each of the node in the network with a probability 00:59:51.070 --> 00:59:53.980 distribution, whether that's an unconditional probability 00:59:53.980 --> 00:59:56.200 distribution in the case of this root node here, 00:59:56.200 --> 00:59:59.630 like Rain, and a conditional probability distribution, 00:59:59.630 --> 01:00:02.380 in the case of all of the other nodes whose probabilities are 01:00:02.380 --> 01:00:05.000 dependent upon the values of their parents, 01:00:05.000 --> 01:00:09.160 we can begin to do some computation and calculation using the information 01:00:09.160 --> 01:00:10.490 inside of that table. 01:00:10.490 --> 01:00:12.310 So let's imagine, for example, that I just 01:00:12.310 --> 01:00:15.910 wanted to compute something simple, like the probability of light rain. 01:00:15.910 --> 01:00:18.130 How would I get the probability of light rain? 01:00:18.130 --> 01:00:21.370 Well, light rain-- rain here is a root node. 01:00:21.370 --> 01:00:23.770 And so if I wanted to calculate that probability, 01:00:23.770 --> 01:00:26.740 I could just look at the probability distribution for rain 01:00:26.740 --> 01:00:29.800 and extract from it the probability of light rain. 01:00:29.800 --> 01:00:33.220 It's just a single value that I already have access to. 01:00:33.220 --> 01:00:35.410 But we could also imagine wanting to compute 01:00:35.410 --> 01:00:39.100 more complex joint probabilities, like the probability 01:00:39.100 --> 01:00:42.710 that there is light rain and also no track maintenance. 01:00:42.710 --> 01:00:47.440 This is a joint probability of two values, light rain and no track 01:00:47.440 --> 01:00:48.293 maintenance. 01:00:48.293 --> 01:00:51.460 And the way I might do that is first by starting by saying, all right, well, 01:00:51.460 --> 01:00:54.100 let me get the probability of light rain, but now 01:00:54.100 --> 01:00:57.160 I also want the probability of no track maintenance. 01:00:57.160 --> 01:01:01.630 But, of course, this node is dependent upon the value of rain. 01:01:01.630 --> 01:01:05.350 So what I really want is the probability of no track maintenance given 01:01:05.350 --> 01:01:07.540 that I know that there was light rain. 01:01:07.540 --> 01:01:10.450 And so the expression for calculating this idea 01:01:10.450 --> 01:01:13.870 that the probability of light rain and no track maintenance 01:01:13.870 --> 01:01:17.680 is really just the probability of light rain and the probability 01:01:17.680 --> 01:01:21.250 that there is no track maintenance given that I know that there already 01:01:21.250 --> 01:01:22.210 is light rain. 01:01:22.210 --> 01:01:25.540 So I take the unconditional probability of light rain, 01:01:25.540 --> 01:01:30.160 multiply it by the conditional probability of no track maintenance 01:01:30.160 --> 01:01:32.550 given that I know there is light rain. 01:01:32.550 --> 01:01:35.770 And you can continue to do this again and again for every variable 01:01:35.770 --> 01:01:38.378 that you want to add into this joint probability 01:01:38.378 --> 01:01:39.670 that I might want to calculate. 01:01:39.670 --> 01:01:42.400 If I wanted to know the probability of light rain 01:01:42.400 --> 01:01:45.100 and no track maintenance and a delayed train, 01:01:45.100 --> 01:01:48.850 well, that's going to be the probability of light rain multiplied 01:01:48.850 --> 01:01:50.950 by the probability of no track maintenance 01:01:50.950 --> 01:01:56.218 given light rain multiplied by the probability of a delayed train given 01:01:56.218 --> 01:01:59.260 light rain and no track maintenance, because whether the train is on time 01:01:59.260 --> 01:02:03.190 or delayed is dependent upon both of these other two variables, 01:02:03.190 --> 01:02:05.290 and so I have two pieces of evidence that 01:02:05.290 --> 01:02:08.860 go into the calculation of that conditional probability. 01:02:08.860 --> 01:02:11.470 And each of these three values is just a value 01:02:11.470 --> 01:02:15.640 that I can look up by looking at one of these individual probability 01:02:15.640 --> 01:02:20.140 distributions that is encoded into my Bayesian network. 01:02:20.140 --> 01:02:23.410 And if I wanted a joint probability over all four of the variables, 01:02:23.410 --> 01:02:25.900 something like the probability of light rain 01:02:25.900 --> 01:02:30.130 and no track maintenance and a delayed train and I missed my appointment, 01:02:30.130 --> 01:02:32.890 well, that's going to be multiplying four different values, one 01:02:32.890 --> 01:02:34.870 from each of these individual nodes. 01:02:34.870 --> 01:02:36.970 It's going to be the probability of light rain, 01:02:36.970 --> 01:02:39.370 then of no track maintenance given light rain, 01:02:39.370 --> 01:02:42.882 then of a delayed train given light rain and no track maintenance. 01:02:42.882 --> 01:02:46.090 And then, finally, for this node here for whether I make it to my appointment 01:02:46.090 --> 01:02:50.770 or not, it's not dependent upon these two variables given that I know 01:02:50.770 --> 01:02:52.270 whether or not the train is on time. 01:02:52.270 --> 01:02:55.030 I only need to care about the conditional probability 01:02:55.030 --> 01:03:00.160 that I miss my appointment given that the train happens to be delayed. 01:03:00.160 --> 01:03:04.120 And so that's represented here by four probabilities, each of which 01:03:04.120 --> 01:03:07.420 is located inside of one of these probability distributions 01:03:07.420 --> 01:03:11.092 for each of the nodes, all multiplied together. 01:03:11.092 --> 01:03:13.300 And so I can take a variable like that and figure out 01:03:13.300 --> 01:03:15.910 what the joint probability is by multiplying 01:03:15.910 --> 01:03:18.280 a whole bunch of these individual probabilities 01:03:18.280 --> 01:03:19.990 from the Bayesian network. 01:03:19.990 --> 01:03:23.110 But, of course, just as with last time where what I really wanted to do 01:03:23.110 --> 01:03:25.463 was to be able to get new pieces of information, 01:03:25.463 --> 01:03:28.630 here, too, this is what we're going to want to do with our Bayesian network. 01:03:28.630 --> 01:03:31.720 In the context of knowledge, we talked about the problem of inference. 01:03:31.720 --> 01:03:34.210 Given things that I know to be true, can I 01:03:34.210 --> 01:03:38.020 draw conclusions, make deductions about other facts about the world 01:03:38.020 --> 01:03:40.270 that I also know to be true? 01:03:40.270 --> 01:03:44.170 And what we're going to do now is apply the same sort of idea to probability. 01:03:44.170 --> 01:03:46.960 Using information about which I have some knowledge, 01:03:46.960 --> 01:03:49.510 whether some evidence or some probabilities, can 01:03:49.510 --> 01:03:52.360 I figure out not other variables for certain, 01:03:52.360 --> 01:03:55.750 but can I figure out the probabilities of other variables taking 01:03:55.750 --> 01:03:57.160 on particular values? 01:03:57.160 --> 01:04:00.160 And so here we introduce the problem of inference 01:04:00.160 --> 01:04:03.970 in a probabilistic setting in a case where variables might not necessarily 01:04:03.970 --> 01:04:06.760 be true for sure, but they might be random variables 01:04:06.760 --> 01:04:10.640 that take on different values with some probability. 01:04:10.640 --> 01:04:13.780 So how do we formally define what exactly this inference problem actually 01:04:13.780 --> 01:04:14.500 is? 01:04:14.500 --> 01:04:17.350 Well, the inference problem has a couple of parts to it. 01:04:17.350 --> 01:04:20.140 We have some query, some variable x that we 01:04:20.140 --> 01:04:21.730 want to compute the distribution for. 01:04:21.730 --> 01:04:24.880 Maybe I want the probability that I missed my train 01:04:24.880 --> 01:04:29.500 or I want the probability that there is track maintenance, something 01:04:29.500 --> 01:04:31.570 that I want information about. 01:04:31.570 --> 01:04:33.437 And then I have some evidence variables. 01:04:33.437 --> 01:04:35.020 Maybe it's just one piece of evidence. 01:04:35.020 --> 01:04:36.760 Maybe it's multiple pieces of evidence. 01:04:36.760 --> 01:04:40.600 But I've observed certain variables for some sort of event. 01:04:40.600 --> 01:04:43.772 So for example, I might have observed that it is raining. 01:04:43.772 --> 01:04:44.980 This is evidence that I have. 01:04:44.980 --> 01:04:47.933 I know that there is light rain or I know that there is heavy rain, 01:04:47.933 --> 01:04:49.100 and that is evidence I have. 01:04:49.100 --> 01:04:52.750 And using that evidence, I want to know, what is the probability 01:04:52.750 --> 01:04:55.430 that my train is delayed, for example? 01:04:55.430 --> 01:04:58.480 And that is a query that I might want to ask based on this evidence. 01:04:58.480 --> 01:05:00.700 So I have a query, some variable, evidence, 01:05:00.700 --> 01:05:03.280 which are some other variables that I have observed inside 01:05:03.280 --> 01:05:05.260 of my Bayesian network, and of course that 01:05:05.260 --> 01:05:08.110 does leave some hidden variables, y. 01:05:08.110 --> 01:05:11.380 These are variables that are not evidence variables and not 01:05:11.380 --> 01:05:12.550 query variables. 01:05:12.550 --> 01:05:16.090 So you might imagine in the case where I know whether or not it's raining 01:05:16.090 --> 01:05:19.930 and I want to know whether my train is going to be delayed or not, 01:05:19.930 --> 01:05:23.380 the hidden variable, the thing I don't have access to, is something like, 01:05:23.380 --> 01:05:25.130 is there maintenance on the track, or am I 01:05:25.130 --> 01:05:27.380 going to make or not make my appointment, for example? 01:05:27.380 --> 01:05:29.410 These are variables that I don't have access to. 01:05:29.410 --> 01:05:32.680 They're hidden because they're not things I observed, 01:05:32.680 --> 01:05:35.100 and they're also not the query, the thing that I'm asking. 01:05:35.100 --> 01:05:37.480 And so ultimately what we want to calculate 01:05:37.480 --> 01:05:41.650 is I want to know the probability distribution of x given 01:05:41.650 --> 01:05:42.970 e, the event that I observed. 01:05:42.970 --> 01:05:46.150 So given that I observed some event, I observed that it is raining, 01:05:46.150 --> 01:05:49.960 I would like to know, what is the distribution over the possible values 01:05:49.960 --> 01:05:51.640 of the Train random variable? 01:05:51.640 --> 01:05:52.630 Is it on time? 01:05:52.630 --> 01:05:53.440 Is it delayed? 01:05:53.440 --> 01:05:55.750 What is the likelihood it's going to be there? 01:05:55.750 --> 01:05:58.720 And it turns out we can do this calculation just using 01:05:58.720 --> 01:06:02.410 a lot of the probability rules that we've already seen in action. 01:06:02.410 --> 01:06:04.870 And ultimately, we're going to take a look at the math 01:06:04.870 --> 01:06:07.150 at a little bit of a high level, at an abstract level, 01:06:07.150 --> 01:06:09.370 but ultimately we can allow computers and programming 01:06:09.370 --> 01:06:12.610 libraries that already exist to begin to do some of this math for us. 01:06:12.610 --> 01:06:15.810 But it's good to get a general sense for what's actually happening when 01:06:15.810 --> 01:06:18.010 this inference process takes place. 01:06:18.010 --> 01:06:21.190 Let's imagine, for example, that I want to compute the probability 01:06:21.190 --> 01:06:24.430 distribution of the Appointment random variable 01:06:24.430 --> 01:06:28.510 given some evidence, given that I know that there was light rain and no track 01:06:28.510 --> 01:06:29.260 maintenance. 01:06:29.260 --> 01:06:32.830 So there's my evidence, these two variables that I observed the value of. 01:06:32.830 --> 01:06:34.630 I observe the value of rain. 01:06:34.630 --> 01:06:35.920 I know there's light rain. 01:06:35.920 --> 01:06:38.830 And I know that there is no track maintenance going on today. 01:06:38.830 --> 01:06:42.820 And what I care about knowing, my query, is this random variable Appointment. 01:06:42.820 --> 01:06:46.008 I want to know the distribution of this random variable Appointment. 01:06:46.008 --> 01:06:47.800 What is the chance that I am able to attend 01:06:47.800 --> 01:06:50.560 my appointment, what is the chance that I miss my appointment 01:06:50.560 --> 01:06:52.360 given this evidence? 01:06:52.360 --> 01:06:55.870 And the hidden variable, the information that I don't have access to, 01:06:55.870 --> 01:06:57.190 is this variable Train. 01:06:57.190 --> 01:07:00.040 This is information that is not part of the evidence that I see, 01:07:00.040 --> 01:07:01.660 not something that I observe. 01:07:01.660 --> 01:07:05.050 But it is also not the query that I am asking for. 01:07:05.050 --> 01:07:07.460 And so what might this inference procedure look like? 01:07:07.460 --> 01:07:10.810 Well, if you recall back from a when we were defining conditional probability 01:07:10.810 --> 01:07:13.270 and doing math with conditional probabilities, 01:07:13.270 --> 01:07:15.940 we know that a conditional probability is 01:07:15.940 --> 01:07:19.030 proportional to the joint probability. 01:07:19.030 --> 01:07:23.050 And we remember this by recalling that the probability of a given b 01:07:23.050 --> 01:07:25.930 is just some constant factor alpha multiplied 01:07:25.930 --> 01:07:27.583 by the probability of a and b. 01:07:27.583 --> 01:07:29.500 That constant factor alpha turns up and you're 01:07:29.500 --> 01:07:32.620 dividing over the probability of b, but the important thing 01:07:32.620 --> 01:07:34.930 is that it's just some constant multiplied 01:07:34.930 --> 01:07:37.450 by the joint distribution, the probability 01:07:37.450 --> 01:07:40.070 that all of these individual things happen. 01:07:40.070 --> 01:07:42.610 So in this case, I can take the probability 01:07:42.610 --> 01:07:47.380 of the Appointment random variable given light rain and no track maintenance 01:07:47.380 --> 01:07:51.070 and say that is just going to be proportional, some constant alpha, 01:07:51.070 --> 01:07:54.700 multiplied by the joint probability, the probability of a particular value 01:07:54.700 --> 01:08:00.410 for the appointment random variable, and light rain and no track maintenance. 01:08:00.410 --> 01:08:02.980 Well, all right, how do I calculate this, probability 01:08:02.980 --> 01:08:05.350 of appointment and light rain and no track maintenance, 01:08:05.350 --> 01:08:07.480 when what I really care about is knowing-- 01:08:07.480 --> 01:08:11.260 I need all four of these values to be able to calculate a joint distribution 01:08:11.260 --> 01:08:13.990 across everything, because, then, a particular appointment 01:08:13.990 --> 01:08:16.420 depends upon the value of train. 01:08:16.420 --> 01:08:18.399 Well, in order to do that, here I can begin 01:08:18.399 --> 01:08:21.430 to use that marginalization trick, that there are only 01:08:21.430 --> 01:08:24.640 two ways I can get any configuration of an appointment, light rain, 01:08:24.640 --> 01:08:25.859 and no track maintenance. 01:08:25.859 --> 01:08:28.120 Either this particular setting of variables 01:08:28.120 --> 01:08:33.130 happens and the train is on time or this particular setting of variables happens 01:08:33.130 --> 01:08:34.180 and the train is delayed. 01:08:34.180 --> 01:08:37.520 Those are two possible cases that I would want to consider. 01:08:37.520 --> 01:08:40.149 And if I add those two cases up, well, then I 01:08:40.149 --> 01:08:44.859 get the result just by adding up all of the possibilities for the hidden 01:08:44.859 --> 01:08:46.990 variable, or variables if there are multiple. 01:08:46.990 --> 01:08:49.090 But since there's only one hidden variable here, 01:08:49.090 --> 01:08:53.229 Train, all I need to do is iterate over all the possible values for that hidden 01:08:53.229 --> 01:08:56.600 variable Train and add up their probabilities. 01:08:56.600 --> 01:08:59.529 So this probability expression here becomes 01:08:59.529 --> 01:09:02.890 probability distribution over Appointment, light, no rain, and train 01:09:02.890 --> 01:09:06.010 is on time, and the probability distribution 01:09:06.010 --> 01:09:10.120 over the Appointment, light rain, no track maintenance, and the train 01:09:10.120 --> 01:09:11.660 is delayed, for example. 01:09:11.660 --> 01:09:15.597 So I take both of the possible values for train, go ahead and add them up. 01:09:15.597 --> 01:09:16.180 These are just 01:09:16.180 --> 01:09:18.722 Joint probabilities that we saw earlier how to calculate just 01:09:18.722 --> 01:09:22.120 by going parent, parent, parent, parent and calculating those probabilities 01:09:22.120 --> 01:09:23.615 and multiplying them together. 01:09:23.615 --> 01:09:26.740 And then you'll need to normalize them at the end, speaking at a high level 01:09:26.740 --> 01:09:29.920 to make sure that everything adds up to the number one. 01:09:29.920 --> 01:09:32.229 So the formula for how you do this and a process known 01:09:32.229 --> 01:09:35.223 as inference by enumeration looks a little bit complicated, 01:09:35.223 --> 01:09:36.640 but ultimately it looks like this. 01:09:36.640 --> 01:09:39.550 And let's now try to distill what it is that all of these symbols 01:09:39.550 --> 01:09:40.420 actually mean. 01:09:40.420 --> 01:09:41.410 Let's start here. 01:09:41.410 --> 01:09:46.029 What I care about knowing is the probability of x, my query variable, 01:09:46.029 --> 01:09:48.370 given some sort of evidence. 01:09:48.370 --> 01:09:50.410 What do I know about conditional probabilities? 01:09:50.410 --> 01:09:55.030 Well, a conditional probability is proportional to the joint probability. 01:09:55.030 --> 01:09:57.850 So we had some alpha, some normalizing constant, 01:09:57.850 --> 01:10:01.840 multiplied by this joint probability of x and evidence. 01:10:01.840 --> 01:10:03.410 And how do I calculate that? 01:10:03.410 --> 01:10:05.980 Well, to do that, I'm going to marginalize over 01:10:05.980 --> 01:10:07.420 all of the hidden variables. 01:10:07.420 --> 01:10:10.450 All the variables that I don't directly observe the values for, 01:10:10.450 --> 01:10:13.390 I'm basically going to iterate over all of the possibilities 01:10:13.390 --> 01:10:16.040 that it could happen and just sum them all up. 01:10:16.040 --> 01:10:19.270 And so I can translate this into a sum over all y, which 01:10:19.270 --> 01:10:22.450 ranges over all the possible hidden variables and the values 01:10:22.450 --> 01:10:27.250 that they could take on, and adds up all of those possible individual 01:10:27.250 --> 01:10:28.300 probabilities. 01:10:28.300 --> 01:10:32.195 And that is going to allow me to do this process of inference by enumeration. 01:10:32.195 --> 01:10:34.570 And ultimately, it's pretty annoying if we as humans have 01:10:34.570 --> 01:10:36.713 to do all of this math for ourselves. 01:10:36.713 --> 01:10:39.880 But it turns out this is where computers and AI can be particularly helpful, 01:10:39.880 --> 01:10:43.360 that we can program a computer to understand a Bayesian network to be 01:10:43.360 --> 01:10:45.610 able to understand these inference procedures 01:10:45.610 --> 01:10:47.560 and to be able to do these calculations. 01:10:47.560 --> 01:10:49.390 And using the information you've seen here, 01:10:49.390 --> 01:10:52.150 you could implement a Bayesian network from scratch yourself. 01:10:52.150 --> 01:10:54.733 But turns out there are a lot of libraries, especially written 01:10:54.733 --> 01:10:56.650 in Python, that allow us to make it easier 01:10:56.650 --> 01:10:58.780 to do this sort of probabilistic inference 01:10:58.780 --> 01:11:01.788 to be able to take a Bayesian network and do these sorts of calculations 01:11:01.788 --> 01:11:04.830 so that you don't need to know and understand all of the underlying math, 01:11:04.830 --> 01:11:07.372 though it's helpful to have a general sense for how it works. 01:11:07.372 --> 01:11:10.330 But you just need to be able to describe the structure of the network 01:11:10.330 --> 01:11:14.350 and make queries in order to be able to produce the result. 01:11:14.350 --> 01:11:17.050 And so let's take a look at an example of that right now. 01:11:17.050 --> 01:11:19.420 It turns out that there are a lot of possible libraries 01:11:19.420 --> 01:11:21.803 that exist in Python for doing this sort of inference. 01:11:21.803 --> 01:11:24.220 It doesn't matter too much which specific library you use. 01:11:24.220 --> 01:11:26.330 They all behave in fairly similar ways. 01:11:26.330 --> 01:11:29.170 But the library I'm going to use here is one known as pomegranate. 01:11:29.170 --> 01:11:33.820 And here inside of model.py, I have defined a Bayesian network 01:11:33.820 --> 01:11:38.070 just using the structure and the syntax that the pomegranate library expects. 01:11:38.070 --> 01:11:40.930 And what I'm effectively doing is just, in Python, 01:11:40.930 --> 01:11:44.740 creating nodes to represent each the nodes of the Bayesian network 01:11:44.740 --> 01:11:47.060 that you saw me describe a moment ago. 01:11:47.060 --> 01:11:49.750 So here on line four, after I've imported pomegranate, 01:11:49.750 --> 01:11:52.540 I'm defining a variable called rain that is going to represent 01:11:52.540 --> 01:11:55.990 a node inside of my Bayesian network. 01:11:55.990 --> 01:11:59.530 It's going to be a node that follows this distribution where 01:11:59.530 --> 01:12:01.030 there are three possible values-- 01:12:01.030 --> 01:12:03.970 none for no rain, light for light rain, heavy for heavy rain. 01:12:03.970 --> 01:12:07.180 And these are the probabilities of each of those taking place. 01:12:07.180 --> 01:12:13.630 0.7 is the likelihood of no rain, 0.2 for light rain, 0.1 for heavy rain. 01:12:13.630 --> 01:12:15.760 Then, after that, we go to the next variable, 01:12:15.760 --> 01:12:18.400 the variable for track maintenance, for example, which 01:12:18.400 --> 01:12:20.990 is dependent upon that rain variable. 01:12:20.990 --> 01:12:23.890 And this, instead of being an unconditional distribution, 01:12:23.890 --> 01:12:27.370 is a conditional distribution, as indicated by a conditional probability 01:12:27.370 --> 01:12:28.430 table here. 01:12:28.430 --> 01:12:33.790 And the idea is that this is conditional on the distribution of rain. 01:12:33.790 --> 01:12:36.700 So if there is no rain, then the chance that there is yes 01:12:36.700 --> 01:12:38.370 track maintenance is 0.4. 01:12:38.370 --> 01:12:41.720 If there's no rain, the chance that there is no track maintenance is 0.6. 01:12:41.720 --> 01:12:43.720 Likewise, for light rain, I have a distribution. 01:12:43.720 --> 01:12:45.760 For heavy rain, I have a distribution, as well. 01:12:45.760 --> 01:12:48.130 But I'm effectively encoding the same information 01:12:48.130 --> 01:12:50.110 you saw represented graphically a moment ago, 01:12:50.110 --> 01:12:53.110 but I'm telling this Python program that the maintenance 01:12:53.110 --> 01:12:57.640 node obeys this particular conditional probability distribution. 01:12:57.640 --> 01:13:01.090 And we do the same thing for the other random variables, as well. 01:13:01.090 --> 01:13:06.310 Train was a node inside my distribution that was a conditional probability 01:13:06.310 --> 01:13:08.050 table with two parents. 01:13:08.050 --> 01:13:11.380 It was dependent not only on rain, but also on track maintenance. 01:13:11.380 --> 01:13:15.310 And so here I'm saying something like, given that there is no rain and yes 01:13:15.310 --> 01:13:19.630 track maintenance, the probability that my train is on time is 0.8, 01:13:19.630 --> 01:13:22.240 and the probability that it's delayed is 0.2. 01:13:22.240 --> 01:13:24.220 And likewise, I can do the same thing for all 01:13:24.220 --> 01:13:28.330 of the other possible values of the parents of the train node 01:13:28.330 --> 01:13:32.800 inside of my Bayesian network by saying, for all of those possible values, 01:13:32.800 --> 01:13:36.350 here is the distribution that the train node should follow. 01:13:36.350 --> 01:13:38.710 And I do the same thing for an appointment 01:13:38.710 --> 01:13:41.830 based on the distribution of the variable Train. 01:13:41.830 --> 01:13:45.340 Then, at the end, what I do is actually construct this network 01:13:45.340 --> 01:13:47.860 by describing what the states of the network are 01:13:47.860 --> 01:13:50.660 and by adding edges between the dependent nodes. 01:13:50.660 --> 01:13:53.110 So I create a new Bayesian network, add states to it-- 01:13:53.110 --> 01:13:56.650 one for rain, one for maintenance, one for train, one for the appointment-- 01:13:56.650 --> 01:14:00.460 and then I add edges connecting the related pieces. 01:14:00.460 --> 01:14:04.570 Rain has an arrow to maintenance because rain influences track maintenance, 01:14:04.570 --> 01:14:08.530 rain also influences the train, maintenance also influences the train, 01:14:08.530 --> 01:14:11.140 and train influences whether I make it to my appointment, 01:14:11.140 --> 01:14:14.800 and bake just finalizes the model and does some additional computation. 01:14:14.800 --> 01:14:18.250 So the specific syntax of this is not really the important part. 01:14:18.250 --> 01:14:20.980 Pomegranate just happens to be one of several different libraries 01:14:20.980 --> 01:14:22.990 that can all be used for similar purposes, 01:14:22.990 --> 01:14:26.170 and you could describe and define a library for yourself 01:14:26.170 --> 01:14:28.010 that implemented similar things. 01:14:28.010 --> 01:14:30.430 But the key idea here is that someone can 01:14:30.430 --> 01:14:33.220 design a library for a general Bayesian network that 01:14:33.220 --> 01:14:35.680 has nodes that are based upon its parents, 01:14:35.680 --> 01:14:39.190 and then all a programmer needs to do, using one of those libraries, 01:14:39.190 --> 01:14:43.420 is to define what those nodes and what those probability distributions are, 01:14:43.420 --> 01:14:47.000 and we can begin to do some interesting logic based on it. 01:14:47.000 --> 01:14:50.200 So let's try doing that conditional or joint probability 01:14:50.200 --> 01:14:56.800 calculation that we saw us do by hand before by going into likelihood.py 01:14:56.800 --> 01:15:00.340 where here I'm importing the model that I justified a moment ago. 01:15:00.340 --> 01:15:03.100 And here I'd just like to calculate model.probability, 01:15:03.100 --> 01:15:06.320 which calculates the probability for a given observation, 01:15:06.320 --> 01:15:10.270 and I'd like to calculate the probability of no rain, 01:15:10.270 --> 01:15:13.330 no track maintenance, my train is on time, 01:15:13.330 --> 01:15:14.950 and I'm able to attend the meeting-- 01:15:14.950 --> 01:15:16.870 so sort of the optimal scenario, that there's 01:15:16.870 --> 01:15:20.162 no rain and no maintenance on the track, my train is on time, 01:15:20.162 --> 01:15:21.620 and I'm able to attend the meeting. 01:15:21.620 --> 01:15:25.020 What is the probability that all of that actually happens? 01:15:25.020 --> 01:15:26.900 And I can calculate that using the library 01:15:26.900 --> 01:15:28.700 and just print out its probability. 01:15:28.700 --> 01:15:32.780 And so I'll go ahead and run Python of likelihood.py, 01:15:32.780 --> 01:15:37.190 and I see that, OK, the probability is about 0.34. 01:15:37.190 --> 01:15:40.850 So about a third of the time, everything goes right for me, in this case-- 01:15:40.850 --> 01:15:43.190 no rain, no track maintenance, train is on time, 01:15:43.190 --> 01:15:45.032 and I'm able to attend the meeting. 01:15:45.032 --> 01:15:47.990 But I could experiment with this, try and calculate other probabilities 01:15:47.990 --> 01:15:48.650 as well. 01:15:48.650 --> 01:15:51.860 What's the probability that everything goes right up until the train 01:15:51.860 --> 01:15:57.020 but I still miss my meeting-- so no rain, no track maintenance, train 01:15:57.020 --> 01:15:59.690 is on time, but I miss the appointment. 01:15:59.690 --> 01:16:04.680 Let's calculate that probability, and that has a probability of about 0.04. 01:16:04.680 --> 01:16:07.643 So about 4% of the time the train will be on time, 01:16:07.643 --> 01:16:09.560 there won't be any rain, no track maintenance, 01:16:09.560 --> 01:16:12.420 and yet I'll still miss the meeting. 01:16:12.420 --> 01:16:14.780 And so this is really just an implementation 01:16:14.780 --> 01:16:17.900 of the calculation of the joint probabilities that we did before. 01:16:17.900 --> 01:16:20.150 What this library is likely doing is first 01:16:20.150 --> 01:16:23.600 figuring out the probability of no rain, then figuring 01:16:23.600 --> 01:16:26.030 that the probability of no track maintenance 01:16:26.030 --> 01:16:28.580 given no rain, then the probability that my train is 01:16:28.580 --> 01:16:31.760 on time given both of these values, and then the probability 01:16:31.760 --> 01:16:35.930 that I miss my appointment given that I know that the train was on time. 01:16:35.930 --> 01:16:39.070 So this, again, is the calculation of that joint probability. 01:16:39.070 --> 01:16:42.320 And turns out we can also begin to have our computer solve inference problems, 01:16:42.320 --> 01:16:45.980 as well, to begin to infer, based on information, evidence 01:16:45.980 --> 01:16:51.000 that we see, what is the likelihood of other variables also being true? 01:16:51.000 --> 01:16:54.740 So let's go into inference.py, for example, where here I'm, 01:16:54.740 --> 01:16:57.110 again, importing that exact same model from before, 01:16:57.110 --> 01:16:59.300 importing all the nodes and all the edges 01:16:59.300 --> 01:17:03.300 and the probability distribution that is encoded there, as well. 01:17:03.300 --> 01:17:06.320 And now there's a function for doing some sort of prediction. 01:17:06.320 --> 01:17:10.760 And here, into this model, I pass in the evidence that I observe. 01:17:10.760 --> 01:17:14.750 So here I've encoded into this Python program the evidence 01:17:14.750 --> 01:17:15.770 that I have observed. 01:17:15.770 --> 01:17:18.950 I have observed the fact that the train is delayed, 01:17:18.950 --> 01:17:22.190 and that is the value for one of the four random variables 01:17:22.190 --> 01:17:24.140 inside of this Bayesian network. 01:17:24.140 --> 01:17:26.210 And using that information, I would like to be 01:17:26.210 --> 01:17:29.270 able to draw inspiration and figure out inferences 01:17:29.270 --> 01:17:31.875 about the values of the other random variables 01:17:31.875 --> 01:17:33.500 that are inside of my Bayesian network. 01:17:33.500 --> 01:17:36.240 I would like to make predictions about everything else. 01:17:36.240 --> 01:17:40.340 So all of the actual computational logic is happening in just these three lines 01:17:40.340 --> 01:17:42.260 where I'm making this call to this prediction. 01:17:42.260 --> 01:17:45.830 Down below, I'm just iterating over all of the states and all the predictions 01:17:45.830 --> 01:17:49.860 and just printing them out so that we can visually see what the results are. 01:17:49.860 --> 01:17:51.980 But let's find out, given the train is delayed, 01:17:51.980 --> 01:17:56.210 what can I predict about the values of the other random variables? 01:17:56.210 --> 01:17:59.021 Let's go ahead and run Python inference.py. 01:17:59.021 --> 01:18:00.005 I run that. 01:18:00.005 --> 01:18:01.880 And all right, here is the result that I get. 01:18:01.880 --> 01:18:04.640 Given the fact that I know that the train is delayed-- 01:18:04.640 --> 01:18:06.770 this is evidence that I have observed-- 01:18:06.770 --> 01:18:10.490 well, given that there is a 45% chance or a 46% chance 01:18:10.490 --> 01:18:12.520 that there was no rain, a 31% chance there 01:18:12.520 --> 01:18:15.230 was light rain, a 23% chance there was heavy rain, 01:18:15.230 --> 01:18:17.712 I can see a probability distribution over track maintenance 01:18:17.712 --> 01:18:19.670 and a probability distribution over whether I'm 01:18:19.670 --> 01:18:22.130 able to attend or miss my appointment. 01:18:22.130 --> 01:18:23.990 Now, we know that whether I attend or miss 01:18:23.990 --> 01:18:27.715 the appointment, that is only dependent upon the train being delayed 01:18:27.715 --> 01:18:28.340 or not delayed. 01:18:28.340 --> 01:18:30.540 It shouldn't depend on anything else. 01:18:30.540 --> 01:18:34.610 So let's imagine, for example, that I knew that there was heavy rain. 01:18:34.610 --> 01:18:38.620 That shouldn't affect the distribution for making the appointment. 01:18:38.620 --> 01:18:41.360 And indeed, if I go up here and add some evidence, 01:18:41.360 --> 01:18:44.128 say that I know that the value of rain is heavy-- 01:18:44.128 --> 01:18:45.920 that is evidence that I now have access to. 01:18:45.920 --> 01:18:47.420 I now have two pieces of evidence. 01:18:47.420 --> 01:18:51.950 I know that the rain is heavy, and I know that my train is delayed. 01:18:51.950 --> 01:18:55.550 I can calculate the probability by running this inference procedure again 01:18:55.550 --> 01:18:57.090 and seeing the result. 01:18:57.090 --> 01:18:58.340 I know that the rain is heavy. 01:18:58.340 --> 01:18:59.840 I know my train is delayed. 01:18:59.840 --> 01:19:02.990 The probability distribution for track maintenance changed. 01:19:02.990 --> 01:19:05.130 Given that I know that there is heavy rain, 01:19:05.130 --> 01:19:08.750 now it's more likely that there is no track maintenance, 88% as 01:19:08.750 --> 01:19:12.250 opposed to 64% from here before. 01:19:12.250 --> 01:19:16.040 And now what is the probability that I make the appointment? 01:19:16.040 --> 01:19:17.480 Well, that's the same as before. 01:19:17.480 --> 01:19:21.100 It's still going to be attend the appointment with probability 0.6, 01:19:21.100 --> 01:19:23.450 miss the appointment with probability 0.4, 01:19:23.450 --> 01:19:27.290 because it was only dependent upon whether or not my train was on time 01:19:27.290 --> 01:19:28.260 or delayed. 01:19:28.260 --> 01:19:31.610 And so this here is implementing that idea of that inference algorithm 01:19:31.610 --> 01:19:34.130 to be able to figure out, based on the evidence 01:19:34.130 --> 01:19:37.970 that I have, what can we infer about the values of the other variables that 01:19:37.970 --> 01:19:39.050 exist as well? 01:19:39.050 --> 01:19:42.890 So inference by enumeration is one way of doing this inference procedure, 01:19:42.890 --> 01:19:46.730 just looping over all of the values the hidden variables could take on 01:19:46.730 --> 01:19:49.460 and figuring out what the probability is. 01:19:49.460 --> 01:19:52.010 Now, it turns out this is not particularly efficient, 01:19:52.010 --> 01:19:56.180 and there are definitely optimizations you can make by avoiding repeated work 01:19:56.180 --> 01:19:59.030 if you're calculating the same sort of probability multiple times. 01:19:59.030 --> 01:20:02.570 There are ways of optimizing the program to avoid having to recalculate 01:20:02.570 --> 01:20:04.640 the same probabilities again and again. 01:20:04.640 --> 01:20:06.980 But even then, as the number of variables 01:20:06.980 --> 01:20:10.220 get large, as the number of possible values those variables could take on 01:20:10.220 --> 01:20:12.110 get large, we're going to start to have to do 01:20:12.110 --> 01:20:14.600 a lot of computation, a lot of calculation, 01:20:14.600 --> 01:20:16.190 to be able to do this inference. 01:20:16.190 --> 01:20:18.150 And at that point, you might start to get 01:20:18.150 --> 01:20:20.250 unreasonable in terms of the amount of time 01:20:20.250 --> 01:20:24.615 that it would take to be able to do this sort exact inference. 01:20:24.615 --> 01:20:26.490 And it's for that reason that oftentimes when 01:20:26.490 --> 01:20:29.970 it comes towards probability and things we're not entirely sure about, 01:20:29.970 --> 01:20:32.280 we don't always care about doing exact inference 01:20:32.280 --> 01:20:35.040 and knowing exactly what the probability is. 01:20:35.040 --> 01:20:37.560 But if we can approximate the inference procedure, 01:20:37.560 --> 01:20:41.570 do some sort of approximate inference, that that can be pretty good as well, 01:20:41.570 --> 01:20:43.550 that if I don't know the exact probability 01:20:43.550 --> 01:20:45.510 but I have a general sense for the probability, 01:20:45.510 --> 01:20:49.200 that I can get increasingly accurate with more time, that that's probably 01:20:49.200 --> 01:20:53.620 pretty good, especially if I can get that to happen even faster. 01:20:53.620 --> 01:20:57.930 So how could I do approximate inference inside of a Bayesian network? 01:20:57.930 --> 01:21:00.480 Well, one method is through a procedure known as sampling. 01:21:00.480 --> 01:21:04.980 In the process of sampling, I'm going to take a sample of all of the variables 01:21:04.980 --> 01:21:06.840 inside of this Bayesian network here. 01:21:06.840 --> 01:21:08.280 And how am I going to sample? 01:21:08.280 --> 01:21:12.240 Well, I'm going to sample one of the values from each of these nodes 01:21:12.240 --> 01:21:14.560 according to their probability distribution. 01:21:14.560 --> 01:21:16.560 So how might I take a sample of all these nodes? 01:21:16.560 --> 01:21:17.430 Well, I'll start at the root. 01:21:17.430 --> 01:21:18.450 I'll start with rain. 01:21:18.450 --> 01:21:21.060 Here's the distribution for rain, and I'll go ahead 01:21:21.060 --> 01:21:23.880 and, using a random number generator or something like it, 01:21:23.880 --> 01:21:25.770 randomly pick one of these three values. 01:21:25.770 --> 01:21:29.730 I'll pick none with probability 0.7, light with probability 0.2, 01:21:29.730 --> 01:21:31.440 and heavy with probability 0.1. 01:21:31.440 --> 01:21:34.770 So I'll randomly just pick one of them according to that distribution, 01:21:34.770 --> 01:21:37.780 and maybe, in this case, I pick none, for example. 01:21:37.780 --> 01:21:39.780 Then I do the same thing for the other variable. 01:21:39.780 --> 01:21:42.410 Maintenance also as a probability distribution. 01:21:42.410 --> 01:21:44.070 And I am going to sample-- 01:21:44.070 --> 01:21:46.470 now, there are three probability distributions here, 01:21:46.470 --> 01:21:49.050 but I'm only going to sample from this first row 01:21:49.050 --> 01:21:53.950 here because I've observed already in my sample that the value of rain is none. 01:21:53.950 --> 01:21:54.450 So 01:21:54.450 --> 01:21:58.295 Given that rain is none, I'm going to sample from this distribution to say, 01:21:58.295 --> 01:22:00.420 all right, what should the value of maintenance be? 01:22:00.420 --> 01:22:02.753 And in this case, maintenance is going to be, let's just 01:22:02.753 --> 01:22:06.570 say, yes, which happens 40% of the time in the event that there is no rain, 01:22:06.570 --> 01:22:07.603 for example. 01:22:07.603 --> 01:22:10.020 And we'll sample all of the rest of the nodes in this way, 01:22:10.020 --> 01:22:12.840 as well, that I want to sample from the train distribution, 01:22:12.840 --> 01:22:17.040 and I'll sample from this first row here where there is no rain, 01:22:17.040 --> 01:22:18.570 but there is track maintenance. 01:22:18.570 --> 01:22:21.980 And I'll sample 80% of the time, I'll say the train is on time. 01:22:21.980 --> 01:22:24.463 20% of the time, I'll say the train is delayed. 01:22:24.463 --> 01:22:27.630 And finally, we'll do the same thing for whether I make it to my appointment 01:22:27.630 --> 01:22:27.890 or not. 01:22:27.890 --> 01:22:29.490 Did I attend or miss the appointment? 01:22:29.490 --> 01:22:32.700 We'll sample based on this distribution and maybe say that in this case 01:22:32.700 --> 01:22:36.150 I attend the appointment, which happens 90% of the time when 01:22:36.150 --> 01:22:38.730 the train is actually on time. 01:22:38.730 --> 01:22:42.900 So by going through these nodes, I can very quickly just do some sampling 01:22:42.900 --> 01:22:45.720 and get a sample of the possible values that 01:22:45.720 --> 01:22:48.990 could come up from going through this entire Bayesian network 01:22:48.990 --> 01:22:51.540 according to those probability distributions. 01:22:51.540 --> 01:22:54.360 And where this becomes powerful is if I do this not once, 01:22:54.360 --> 01:22:57.100 but I do this thousands or tens of thousands of times 01:22:57.100 --> 01:23:00.400 and generate a whole bunch of samples, all using this distribution. 01:23:00.400 --> 01:23:01.410 I get different samples. 01:23:01.410 --> 01:23:02.820 Maybe some of them are the same. 01:23:02.820 --> 01:23:07.800 But I get a value for each of the possible variables that could come up. 01:23:07.800 --> 01:23:10.620 And so then, if I'm ever faced with a question, a question like, 01:23:10.620 --> 01:23:13.860 what is the probability that the train is on time, 01:23:13.860 --> 01:23:15.900 you could do an exact inference procedure. 01:23:15.900 --> 01:23:18.630 This is no different than the inference problem we had before 01:23:18.630 --> 01:23:21.780 where I could just marginalize, look at all the possible other values 01:23:21.780 --> 01:23:24.390 of the variables and do the computation of inference 01:23:24.390 --> 01:23:28.200 by enumeration to find out this probability exactly. 01:23:28.200 --> 01:23:31.710 But I could also, if I don't care about the exact probability, just sample it. 01:23:31.710 --> 01:23:33.150 Approximate it to get close. 01:23:33.150 --> 01:23:35.040 And this is a powerful tool in AI where we 01:23:35.040 --> 01:23:38.790 don't need to be right 100% of the time or we don't need to be exactly right. 01:23:38.790 --> 01:23:41.130 If we just need to be right with some probability, 01:23:41.130 --> 01:23:44.290 we can often do some more effectively, more efficiently. 01:23:44.290 --> 01:23:46.920 And so here, now, are all of those possible samples. 01:23:46.920 --> 01:23:50.390 I'll sort of highlight the ones where the train is on time. 01:23:50.390 --> 01:23:52.620 I'm ignoring the ones where the train is delayed. 01:23:52.620 --> 01:23:55.350 And in this case, there's six out of eight 01:23:55.350 --> 01:23:57.690 of the samples have the train is arriving on time. 01:23:57.690 --> 01:24:01.320 And so maybe, in this case, I can say that, in six out of eight cases, 01:24:01.320 --> 01:24:03.458 that's the likelihood that the train is on time. 01:24:03.458 --> 01:24:06.000 And with eight samples, that might not be a great prediction. 01:24:06.000 --> 01:24:08.520 But if I had thousands upon thousands of samples, 01:24:08.520 --> 01:24:11.580 then this could be a much better inference procedure 01:24:11.580 --> 01:24:13.680 to be able to do these sorts of calculations. 01:24:13.680 --> 01:24:17.310 So this is a direct sampling method to just do a bunch of samples 01:24:17.310 --> 01:24:21.210 and then figure out what the probability of some event is. 01:24:21.210 --> 01:24:24.400 Now, this from before was an unconditional probability. 01:24:24.400 --> 01:24:27.447 What is the probability that the train is on time? 01:24:27.447 --> 01:24:30.030 And I did that by looking at all the samples and figuring out, 01:24:30.030 --> 01:24:32.372 right here, the ones where the train is on time. 01:24:32.372 --> 01:24:34.080 But sometimes what I'll want to calculate 01:24:34.080 --> 01:24:38.387 is not an unconditional probability, but rather a conditional probability, 01:24:38.387 --> 01:24:40.470 something like, what is the probability that there 01:24:40.470 --> 01:24:45.010 is light rain given that the train is on time, something to that effect. 01:24:45.010 --> 01:24:50.060 And to do that kind of calculation, well, what I might do is here 01:24:50.060 --> 01:24:52.140 are all the samples that I have, and I want 01:24:52.140 --> 01:24:54.720 to calculate a probability distribution given 01:24:54.720 --> 01:24:57.368 that I know that the train is on time. 01:24:57.368 --> 01:24:59.910 So to be able to do that, I can kind of look at the two cases 01:24:59.910 --> 01:25:03.630 where the train was delayed and ignore or reject them, 01:25:03.630 --> 01:25:07.762 sort of exclude them from the possible samples that I'm considering. 01:25:07.762 --> 01:25:09.720 And now I want to look at these remaining cases 01:25:09.720 --> 01:25:11.130 where the train is on time. 01:25:11.130 --> 01:25:13.860 Here are the cases where there is light rain. 01:25:13.860 --> 01:25:16.850 And now I say, OK, these are two out of the six possible cases. 01:25:16.850 --> 01:25:19.580 That can give me an approximation for the probability 01:25:19.580 --> 01:25:23.440 of light rain given the fact that I know the train was on time. 01:25:23.440 --> 01:25:25.700 And I did that in almost exactly the same way 01:25:25.700 --> 01:25:28.660 just by adding an additional step, by saying that, 01:25:28.660 --> 01:25:30.470 all right, when I take each sample, let me 01:25:30.470 --> 01:25:34.460 reject all of the samples that don't match my evidence 01:25:34.460 --> 01:25:37.250 and only consider the samples that do match 01:25:37.250 --> 01:25:39.920 what it is that I have in my evidence that I want 01:25:39.920 --> 01:25:42.020 to make some sort of calculation about. 01:25:42.020 --> 01:25:45.920 And it turns out, using the libraries that we've had for Bayesian networks, 01:25:45.920 --> 01:25:48.740 we can begin to implement this same sort of idea, 01:25:48.740 --> 01:25:51.890 implement rejection sampling, which is what this method is called, 01:25:51.890 --> 01:25:55.850 to be able to figure out some probability, not via direct inference, 01:25:55.850 --> 01:25:57.980 but instead by sampling. 01:25:57.980 --> 01:26:00.290 So what I have here is a program called sample.py-- 01:26:00.290 --> 01:26:02.180 imports the exact same model. 01:26:02.180 --> 01:26:05.490 And what I define first is a program to generate a sample. 01:26:05.490 --> 01:26:09.088 And the way I generate a sample is just by looping over all of the states. 01:26:09.088 --> 01:26:10.880 The states need to be in some sort of order 01:26:10.880 --> 01:26:12.797 to make sure I'm looping in the correct order. 01:26:12.797 --> 01:26:16.010 But effectively, if it is a conditional distribution, 01:26:16.010 --> 01:26:18.410 I'm going to sample based on the parents. 01:26:18.410 --> 01:26:21.240 And otherwise, I'm just going to directly sample the variable, 01:26:21.240 --> 01:26:25.040 like rain, which has no parents-- it's just an unconditional distribution-- 01:26:25.040 --> 01:26:28.640 and keep track of all those parent samples and return the final sample. 01:26:28.640 --> 01:26:31.290 The exact syntax of this, again, not particularly important. 01:26:31.290 --> 01:26:33.290 It just happens to be part of the implementation 01:26:33.290 --> 01:26:35.820 details of this particular library. 01:26:35.820 --> 01:26:38.270 The interesting logic is done below. 01:26:38.270 --> 01:26:40.820 Now that I have the ability to generate a sample, 01:26:40.820 --> 01:26:45.020 if I want to know the distribution of the appointment random variable given 01:26:45.020 --> 01:26:48.680 that the train is delayed, well, then I can begin to do calculations like this. 01:26:48.680 --> 01:26:52.430 Let me take 10,000 samples and assemble all my results 01:26:52.430 --> 01:26:53.810 in this list called data. 01:26:53.810 --> 01:26:57.140 I'll go ahead and loop n times-- in this case, 10,000 times. 01:26:57.140 --> 01:27:01.670 I'll generate a sample, and I want to know the distribution of appointment 01:27:01.670 --> 01:27:03.410 given that the train is delayed. 01:27:03.410 --> 01:27:05.900 So according to rejection sampling, I'm only 01:27:05.900 --> 01:27:08.210 going to consider samples where the train is delayed. 01:27:08.210 --> 01:27:11.552 If the train's not delayed, I'm not going to consider those values at all. 01:27:11.552 --> 01:27:13.760 So I'm going to say, all right, if I take the sample, 01:27:13.760 --> 01:27:16.290 look at the value of the train random variable, 01:27:16.290 --> 01:27:19.670 if the train is delayed, well, let me go ahead and add to my data 01:27:19.670 --> 01:27:23.000 that I'm collecting the value of the appointment random variable 01:27:23.000 --> 01:27:25.520 that it took on in this particular sample. 01:27:25.520 --> 01:27:28.610 So I'm only considering the samples where the train is delayed 01:27:28.610 --> 01:27:31.010 and, for each of those samples, considering 01:27:31.010 --> 01:27:32.870 what the value of appointment is. 01:27:32.870 --> 01:27:35.570 And then at the end, I'm using a Python class called counter, 01:27:35.570 --> 01:27:37.580 which quickly counts up all the values inside 01:27:37.580 --> 01:27:40.100 of a data set so I can take this list of data 01:27:40.100 --> 01:27:44.000 and figure out how many times was my appointment made, 01:27:44.000 --> 01:27:47.360 and how many times was my appointment missed? 01:27:47.360 --> 01:27:49.610 And so this here, with just a couple of lines of code, 01:27:49.610 --> 01:27:53.080 is an implementation of rejection sampling. 01:27:53.080 --> 01:27:58.170 And I can run it by going ahead and running Python sample.py. 01:27:58.170 --> 01:28:00.230 And when I do that, here is the result I get. 01:28:00.230 --> 01:28:02.150 This is the result of the counter. 01:28:02.150 --> 01:28:05.750 1,251 times I was able to attend the meeting, 01:28:05.750 --> 01:28:08.900 and 856 times I was able to miss the meeting. 01:28:08.900 --> 01:28:11.550 And you can imagine, by doing more and more samples, 01:28:11.550 --> 01:28:14.480 I'll be able to get a better and better, more accurate result. 01:28:14.480 --> 01:28:16.070 And this is a randomized process. 01:28:16.070 --> 01:28:18.895 It's going to be an approximation of the probability. 01:28:18.895 --> 01:28:21.770 If I run it a different time, you'll notice the numbers are similar-- 01:28:21.770 --> 01:28:25.460 1,272 and 905-- but they're not identical 01:28:25.460 --> 01:28:28.250 because there's some randomization, some likelihood that things 01:28:28.250 --> 01:28:31.730 might be higher or lower, and so this is why we generally want to try and use 01:28:31.730 --> 01:28:35.360 more samples so that we can have a greater amount of confidence 01:28:35.360 --> 01:28:37.760 in our result, be more sure about the result 01:28:37.760 --> 01:28:41.240 that we're getting of whether or not it accurately reflects or represents 01:28:41.240 --> 01:28:43.940 the actual underlying probabilities that are 01:28:43.940 --> 01:28:47.130 inherent inside of this distribution. 01:28:47.130 --> 01:28:50.057 And so this, then, was an instance of rejection sampling. 01:28:50.057 --> 01:28:52.640 And it turns out, there are a number of other sampling methods 01:28:52.640 --> 01:28:55.070 that you could use to begin to try to sample. 01:28:55.070 --> 01:28:57.530 One problem that rejection sampling has is 01:28:57.530 --> 01:29:02.480 that if the evidence you're looking for is a fairly unlikely event, well, 01:29:02.480 --> 01:29:04.610 you're going to be rejecting a lot of samples. 01:29:04.610 --> 01:29:08.490 Like, if I'm looking for the probability of x given some evidence e, 01:29:08.490 --> 01:29:12.680 if e is very unlikely to occur-- like, occurs maybe one every 1,000 times-- 01:29:12.680 --> 01:29:16.040 then I'm only going to be considering one out of every 1,000 samples 01:29:16.040 --> 01:29:18.798 that I do, which is a pretty inefficient method for trying 01:29:18.798 --> 01:29:20.090 to do this sort of calculation. 01:29:20.090 --> 01:29:23.600 I'm throwing away a lot of samples, and it takes computational effort 01:29:23.600 --> 01:29:25.640 to be able to generate those samples, so I'd 01:29:25.640 --> 01:29:27.480 like to not have to do something like that. 01:29:27.480 --> 01:29:30.230 So there are other sampling methods that can try and address this. 01:29:30.230 --> 01:29:33.680 One such sampling method is called likelihood weighting. 01:29:33.680 --> 01:29:36.920 In likelihood weighting, we follow a slightly different procedure, 01:29:36.920 --> 01:29:39.740 and the goal is to avoid needing to throw out 01:29:39.740 --> 01:29:42.590 samples that didn't match the evidence. 01:29:42.590 --> 01:29:46.760 And so what we'll do is we'll start by fixing the values for the evidence 01:29:46.760 --> 01:29:47.300 variables. 01:29:47.300 --> 01:29:49.430 Rather than sample everything, we're going 01:29:49.430 --> 01:29:53.970 to fix the values of the evidence variables and not sample those. 01:29:53.970 --> 01:29:57.650 Then we're going to sample all the other non-evidence variables in the same way, 01:29:57.650 --> 01:30:01.010 just using the Bayesian network, looking at the probability distributions, 01:30:01.010 --> 01:30:04.040 sampling all the non-evidence variables. 01:30:04.040 --> 01:30:08.450 But then what we need to do is weight each sample by its likelihood. 01:30:08.450 --> 01:30:10.520 If our evidence is really unlikely, we want 01:30:10.520 --> 01:30:14.210 to make sure that we've taken into account, how likely was the evidence 01:30:14.210 --> 01:30:16.410 to actually show up in the sample? 01:30:16.410 --> 01:30:18.590 If I have a sample where the evidence was much more 01:30:18.590 --> 01:30:20.720 likely to show up than another sample, then I 01:30:20.720 --> 01:30:23.060 want to weight the more likely one higher. 01:30:23.060 --> 01:30:25.490 So we're going to weight each sample by its likelihood 01:30:25.490 --> 01:30:29.480 where likelihood is just defined as the probability of all of the evidence. 01:30:29.480 --> 01:30:32.090 Given all the evidence we have, what is the probability 01:30:32.090 --> 01:30:34.640 that it would happen in that particular sample? 01:30:34.640 --> 01:30:37.250 So before, all of our samples were weighted equally. 01:30:37.250 --> 01:30:40.970 They all had a weight of one when we were calculating the overall average. 01:30:40.970 --> 01:30:42.980 In this case, we're going to weight each sample, 01:30:42.980 --> 01:30:46.220 multiply each sample by its likelihood in order 01:30:46.220 --> 01:30:49.252 to get the more accurate distribution. 01:30:49.252 --> 01:30:50.460 So what would this look like? 01:30:50.460 --> 01:30:54.170 Well, if I asked the same question, what is the probability of light rain given 01:30:54.170 --> 01:30:57.050 that the train is on time, when I do the sampling procedure 01:30:57.050 --> 01:30:59.780 and start by trying to sample, I'm going to start 01:30:59.780 --> 01:31:01.520 by fixing the evidence variable. 01:31:01.520 --> 01:31:04.640 I'm already going to have in my sample the train is on time. 01:31:04.640 --> 01:31:06.860 That way, I don't have to throw out anything. 01:31:06.860 --> 01:31:10.610 I'm only sampling things where I know the value of the variables that 01:31:10.610 --> 01:31:13.790 are my evidence are what I expect them to be. 01:31:13.790 --> 01:31:16.310 So I'll go ahead and sample from rain, and maybe this time I 01:31:16.310 --> 01:31:18.318 sample light rain instead of no rain. 01:31:18.318 --> 01:31:21.110 Then I'll sample from track maintenance and say maybe, yes, there's 01:31:21.110 --> 01:31:22.100 track maintenance. 01:31:22.100 --> 01:31:25.190 Then for train, well, I've already fixed it in place. 01:31:25.190 --> 01:31:29.360 Train was an evidence variable, so I'm not going to bother sampling again. 01:31:29.360 --> 01:31:30.820 I'll just go ahead and move on. 01:31:30.820 --> 01:31:35.280 I'll move on to appointment and go ahead and sample from appointment as well. 01:31:35.280 --> 01:31:37.040 So now I've generated a sample. 01:31:37.040 --> 01:31:40.190 I've generated a sample by fixing this evidence variable 01:31:40.190 --> 01:31:42.310 and sampling the other three. 01:31:42.310 --> 01:31:44.390 And the last step is now weighting the sample. 01:31:44.390 --> 01:31:45.920 How much weight should it have? 01:31:45.920 --> 01:31:50.090 And the weight is based on how probable is it that the train was actually 01:31:50.090 --> 01:31:52.560 on time, this evidence actually happened, 01:31:52.560 --> 01:31:55.460 given the values of these other variables, light rain and the fact 01:31:55.460 --> 01:31:57.620 that, yes, there was track maintenance? 01:31:57.620 --> 01:32:00.260 Well, to do that, I can just go back to the train variable 01:32:00.260 --> 01:32:02.900 and say, all right, if there was light rain and track 01:32:02.900 --> 01:32:05.060 maintenance, the likelihood of my evidence, 01:32:05.060 --> 01:32:08.570 the likelihood that my train was on time, is 0.6. 01:32:08.570 --> 01:32:13.250 And so this particular sample would have a weight of 0.6. 01:32:13.250 --> 01:32:15.740 And I could repeat the sampling procedure again and again. 01:32:15.740 --> 01:32:18.140 Each time, every sample would be given a weight 01:32:18.140 --> 01:32:22.928 according to the probability of the evidence that I see associated with it. 01:32:22.928 --> 01:32:25.970 And there are other sampling methods that exist, as well, but all of them 01:32:25.970 --> 01:32:27.845 are designed to try and get at the same idea, 01:32:27.845 --> 01:32:30.950 to approximate the inference procedure of figuring out 01:32:30.950 --> 01:32:33.540 the value of a variable. 01:32:33.540 --> 01:32:35.570 So we've now dealt with probability as it 01:32:35.570 --> 01:32:38.840 pertains to particular variables that have these discrete values. 01:32:38.840 --> 01:32:40.910 But what we haven't really considered is how 01:32:40.910 --> 01:32:44.300 values might change over time, that we've considered something 01:32:44.300 --> 01:32:47.870 like a variable for rain where rain can take on values of none or light 01:32:47.870 --> 01:32:50.600 rain or heavy rain, but, in practice, usually when 01:32:50.600 --> 01:32:54.950 we consider values for variables like rain, we like to consider it for, 01:32:54.950 --> 01:32:58.020 over time, how do the values of these variables change? 01:32:58.020 --> 01:33:02.040 What do we deal with when we're dealing with uncertainty over a period of time? 01:33:02.040 --> 01:33:04.590 Which can come up in the context of weather, for example-- 01:33:04.590 --> 01:33:06.830 if I have sunny days and I have rainy days. 01:33:06.830 --> 01:33:11.450 And I'd like to know not just what is the probability that it's raining now, 01:33:11.450 --> 01:33:14.210 but what is the probability that it rains tomorrow or the day 01:33:14.210 --> 01:33:15.838 after that or the day after that? 01:33:15.838 --> 01:33:17.630 And so to do this, we're going to introduce 01:33:17.630 --> 01:33:19.440 a slightly different kind of model. 01:33:19.440 --> 01:33:23.300 But here we're going to have a random variable, not just one for the weather, 01:33:23.300 --> 01:33:25.643 but for every possible time step. 01:33:25.643 --> 01:33:27.560 And you can define time step however you like. 01:33:27.560 --> 01:33:30.680 A simple way is just to use days as your time step. 01:33:30.680 --> 01:33:34.220 And so we can define a variable called x sub t, which 01:33:34.220 --> 01:33:36.620 is going to be the weather at time t. 01:33:36.620 --> 01:33:39.350 So x sub zero might be the weather on day zero, 01:33:39.350 --> 01:33:42.400 x sub one might be the weather on day one, so on and so forth, 01:33:42.400 --> 01:33:45.022 x sub two is the weather on day two. 01:33:45.022 --> 01:33:46.730 But as you can imagine, if we start to do 01:33:46.730 --> 01:33:48.740 this over longer and longer periods of time, 01:33:48.740 --> 01:33:51.282 there's an incredible amount of data that might go into this. 01:33:51.282 --> 01:33:53.960 If you're keeping track of data about the weather for a year, 01:33:53.960 --> 01:33:57.240 now suddenly you might be trying to predict the weather tomorrow given 01:33:57.240 --> 01:34:00.620 365 days of previous pieces of evidence, and that's 01:34:00.620 --> 01:34:03.530 a lot of evidence to have to deal with and manipulate and calculate. 01:34:03.530 --> 01:34:06.410 Probably nobody knows what the exact conditional probability 01:34:06.410 --> 01:34:10.070 distribution is for all of those combinations of variables. 01:34:10.070 --> 01:34:13.070 And so when we're trying to do this inference inside of a computer, when 01:34:13.070 --> 01:34:16.640 we're trying to reasonably do this sort of analysis, 01:34:16.640 --> 01:34:19.053 it's helpful to make some simplifying assumptions, 01:34:19.053 --> 01:34:21.470 some assumptions about the problem that we can just assume 01:34:21.470 --> 01:34:23.930 are true to make our lives a little bit easier. 01:34:23.930 --> 01:34:26.270 Even if they're not totally accurate assumptions, 01:34:26.270 --> 01:34:28.703 if they're close to accurate or approximate, 01:34:28.703 --> 01:34:29.870 they're usually pretty good. 01:34:29.870 --> 01:34:33.350 And the assumption we're going to make is called the Markov assumption, 01:34:33.350 --> 01:34:38.210 which is the assumption that the current state depends only on a finite fixed 01:34:38.210 --> 01:34:40.220 number of previous states. 01:34:40.220 --> 01:34:44.210 So the current day's weather depends not on all of the previous day's weather 01:34:44.210 --> 01:34:47.150 for all of history, but the current day's weather I 01:34:47.150 --> 01:34:49.758 can predict just based on yesterday's weather 01:34:49.758 --> 01:34:52.550 or just based on the last two days' weather or the last three days' 01:34:52.550 --> 01:34:53.050 weather. 01:34:53.050 --> 01:34:57.620 But oftentimes, we're going to deal with just the one previous state helps 01:34:57.620 --> 01:34:59.720 to predict this current state. 01:34:59.720 --> 01:35:01.970 And by putting a whole bunch of these random variables 01:35:01.970 --> 01:35:04.400 together, using this Markov assumption, we 01:35:04.400 --> 01:35:08.090 can create what's called a Markov chain where a Markov chain is just 01:35:08.090 --> 01:35:11.960 some sequence of random variables where each of the variable's distribution 01:35:11.960 --> 01:35:13.772 follows that Markov assumption. 01:35:13.772 --> 01:35:16.480 And so we'll do an example of this where the Markov assumption is 01:35:16.480 --> 01:35:17.590 I can predict the weather. 01:35:17.590 --> 01:35:19.050 Is it sunny or rainy? 01:35:19.050 --> 01:35:21.520 And we'll just consider those two possibilities for now, 01:35:21.520 --> 01:35:23.395 even though there are other types of weather. 01:35:23.395 --> 01:35:26.650 But I can predict each day's weather just on the prior day's weather. 01:35:26.650 --> 01:35:30.430 Using today's weather, I can come up with a probability distribution 01:35:30.430 --> 01:35:31.825 for tomorrow's weather. 01:35:31.825 --> 01:35:33.700 And here's what this weather might look like. 01:35:33.700 --> 01:35:37.030 It's formatted in terms of a matrix, as you might describe it, 01:35:37.030 --> 01:35:41.410 as sort of rows and columns of values where on the left-hand side 01:35:41.410 --> 01:35:45.850 I have today's webinar, represented by the variable x sub t. 01:35:45.850 --> 01:35:48.730 And then over here in the columns, I have tomorrow's weather, 01:35:48.730 --> 01:35:54.790 represented by the variable x sub t plus one, t plus one day's weather instead. 01:35:54.790 --> 01:35:58.990 And what this matrix is saying is if today is sunny, 01:35:58.990 --> 01:36:02.440 well, then, it's more likely than not that tomorrow is also sunny. 01:36:02.440 --> 01:36:05.990 Oftentimes the weather stays consistent for multiple days in a row. 01:36:05.990 --> 01:36:08.200 And for example, let's say that if today is sunny, 01:36:08.200 --> 01:36:12.820 our model says that tomorrow, with probability 0.8, it will also be sunny, 01:36:12.820 --> 01:36:15.610 and with probability 0.2 it will be raining. 01:36:15.610 --> 01:36:19.245 And likewise, if today is raining, then it's 01:36:19.245 --> 01:36:21.370 more likely than not that tomorrow is also raining. 01:36:21.370 --> 01:36:23.620 With probability 0.7, it'll be raining. 01:36:23.620 --> 01:36:26.710 With probability 0.3, it will be sunny. 01:36:26.710 --> 01:36:28.840 So this matrix, this description of how it 01:36:28.840 --> 01:36:32.290 is we transition from one state to the next state, 01:36:32.290 --> 01:36:34.540 is what we're going to call the transition model. 01:36:34.540 --> 01:36:37.030 And using the transition model, you can begin 01:36:37.030 --> 01:36:41.770 to construct this Markov chain by just predicting, given today's weather, 01:36:41.770 --> 01:36:44.020 what's the likelihood of tomorrow's weather happening? 01:36:44.020 --> 01:36:46.930 And you can imagine doing a similar sampling 01:36:46.930 --> 01:36:49.660 procedure where you take this information, 01:36:49.660 --> 01:36:51.940 you sample what tomorrow's weather is going to be, 01:36:51.940 --> 01:36:53.980 using that you sample the next day's weather, 01:36:53.980 --> 01:36:58.390 and the result of that is you can form this Markov chain of x zero, 01:36:58.390 --> 01:37:01.120 time day zero is sunny, the next day is sunny, 01:37:01.120 --> 01:37:04.240 maybe the next day it changes to raining, then raining, then raining. 01:37:04.240 --> 01:37:06.910 And the pattern that this Markov chain follows, 01:37:06.910 --> 01:37:08.890 given the distribution that we had access to, 01:37:08.890 --> 01:37:11.850 this transition model here, is that when it's sunny, 01:37:11.850 --> 01:37:13.600 it tends to stay sunny for a little while. 01:37:13.600 --> 01:37:16.100 The next couple days tend to be sunny too. 01:37:16.100 --> 01:37:19.735 And when it's raining, it tends to be raining as well. 01:37:19.735 --> 01:37:21.860 And so you get a Markov chain that looks like this. 01:37:21.860 --> 01:37:23.193 And you can do analysis on this. 01:37:23.193 --> 01:37:25.630 You can say, given that today is raining, 01:37:25.630 --> 01:37:27.790 what is the probability that tomorrow it's raining, 01:37:27.790 --> 01:37:29.770 or you can begin to ask probability questions, 01:37:29.770 --> 01:37:33.970 like what is the probability of this sequence of five values-- sun, sun, 01:37:33.970 --> 01:37:35.200 rain, rain, rain-- 01:37:35.200 --> 01:37:37.610 and answer those sorts of questions too. 01:37:37.610 --> 01:37:40.780 And it turns out there are, again, many Python libraries for interacting 01:37:40.780 --> 01:37:44.620 with models like this of probabilities that have distributions 01:37:44.620 --> 01:37:47.440 and random variables that are based on previous variables 01:37:47.440 --> 01:37:49.720 according to this Markov assumption. 01:37:49.720 --> 01:37:53.090 And pomegranate 2 has ways of dealing with these sorts of variables. 01:37:53.090 --> 01:37:59.800 So I'll go ahead and go into the chain directory 01:37:59.800 --> 01:38:02.590 where I have some information about Markov chains. 01:38:02.590 --> 01:38:05.770 And here I've defined a file called model.py where 01:38:05.770 --> 01:38:08.320 I've defined in a very similar syntax. 01:38:08.320 --> 01:38:11.080 And again, the exact syntax doesn't matter so much as the idea 01:38:11.080 --> 01:38:14.410 that I'm encoding this information into a Python program 01:38:14.410 --> 01:38:17.290 so that the program access to these distributions. 01:38:17.290 --> 01:38:19.930 I've here defined some starting distributions. 01:38:19.930 --> 01:38:23.020 So every Markov model begins at some point in time, 01:38:23.020 --> 01:38:25.120 and I need to give it some starting distribution. 01:38:25.120 --> 01:38:27.078 And so we'll just say, you know what, to start, 01:38:27.078 --> 01:38:29.380 you can pick 50/50 between sunny and rainy. 01:38:29.380 --> 01:38:33.370 We'll say it's sunny 50% the time, rainy 50% of the time. 01:38:33.370 --> 01:38:36.430 And then down below, I've here defined the transition model, 01:38:36.430 --> 01:38:39.770 how it is that I transition from one day to the next. 01:38:39.770 --> 01:38:42.520 And here I've encoded that exact same matrix from before, 01:38:42.520 --> 01:38:45.210 that if it was sunny today, then with probability 0.8 01:38:45.210 --> 01:38:47.650 it will be sunny tomorrow, and it will be raining tomorrow 01:38:47.650 --> 01:38:49.540 with probability 0.2. 01:38:49.540 --> 01:38:54.540 And I likewise have another distribution for if it was raining today instead. 01:38:54.540 --> 01:38:56.980 And so that alone defines the Markov model. 01:38:56.980 --> 01:38:59.410 You can begin to answer questions using that model. 01:38:59.410 --> 01:39:02.680 But one thing I'll just do is sample from the Markov chain. 01:39:02.680 --> 01:39:06.130 And it turns out there is a method built into this Markov chain library that 01:39:06.130 --> 01:39:08.440 allows me to sample 50 states from the chain, 01:39:08.440 --> 01:39:13.000 basically just simulating 50 instances of weather. 01:39:13.000 --> 01:39:18.290 And so let me go ahead and run this, Python model.py. 01:39:18.290 --> 01:39:22.570 And when I run it, what I get is it is going to sample from this Markov chain 01:39:22.570 --> 01:39:26.498 50 states, 50 days worth of weather that it's just going to randomly sample. 01:39:26.498 --> 01:39:29.290 And you can imagine sampling many times to be able to get more data 01:39:29.290 --> 01:39:30.820 to be able to do more analysis. 01:39:30.820 --> 01:39:33.580 But here, for example, it's sunny two days 01:39:33.580 --> 01:39:37.360 a row, rainy a whole bunch of days in a row before it changes back to sun. 01:39:37.360 --> 01:39:41.170 And so you get this model that follows the distribution that we originally 01:39:41.170 --> 01:39:43.960 described, that follows the distribution of sunny days 01:39:43.960 --> 01:39:49.780 tend to lead to more sunny days, rainy days tend to lead to more rainy days. 01:39:49.780 --> 01:39:52.060 And that, then, is the Markov model. 01:39:52.060 --> 01:39:56.260 And Markov models rely on us knowing the values of these individual states. 01:39:56.260 --> 01:40:00.490 I know that today is sunny or that today is rainy, and using that information, 01:40:00.490 --> 01:40:04.660 I can draw some sort of inference about what tomorrow is going to be like. 01:40:04.660 --> 01:40:07.130 But in practice, this often isn't the case. 01:40:07.130 --> 01:40:09.310 It often isn't the case that I know for certain 01:40:09.310 --> 01:40:11.620 what the exact state of the world is. 01:40:11.620 --> 01:40:14.710 Oftentimes the state of the world is exactly unknown, 01:40:14.710 --> 01:40:18.480 but I'm able to somehow sense some information about that state 01:40:18.480 --> 01:40:22.385 that a robot or an AI doesn't have exact knowledge about the world around it, 01:40:22.385 --> 01:40:24.510 but it has some sort of sensor, whether that sensor 01:40:24.510 --> 01:40:27.240 is a camera or sensors that detect distance 01:40:27.240 --> 01:40:30.300 or just a microphone that is sensing audio, for example. 01:40:30.300 --> 01:40:33.990 It is sensing data, and using that data, that data is somehow 01:40:33.990 --> 01:40:36.930 related to the state of the world even if it doesn't actually 01:40:36.930 --> 01:40:41.100 know, our AI doesn't know, what the underlying true state of the world 01:40:41.100 --> 01:40:42.730 actually is. 01:40:42.730 --> 01:40:45.480 And for that, we need to get into the world of sensor models, 01:40:45.480 --> 01:40:48.420 the way of describing how it is that we translate 01:40:48.420 --> 01:40:51.600 what the hidden state, the underlying true state of the world 01:40:51.600 --> 01:40:56.880 is with what the observation, what it is that the AI knows or the AI has access 01:40:56.880 --> 01:40:58.810 to, actually is. 01:40:58.810 --> 01:41:02.880 And so for example, a hidden state might be a robot's position. 01:41:02.880 --> 01:41:05.650 If a robot is exploring new, uncharted territory, 01:41:05.650 --> 01:41:08.580 the robot likely doesn't know exactly where it is. 01:41:08.580 --> 01:41:10.000 But it does have an observation. 01:41:10.000 --> 01:41:12.510 It has robot sensor data where it can sense 01:41:12.510 --> 01:41:16.560 how far away are possible obstacles around it, and using that information, 01:41:16.560 --> 01:41:19.230 using the observed information that it has, 01:41:19.230 --> 01:41:22.290 it can infer something about the hidden state, 01:41:22.290 --> 01:41:26.220 because what the true hidden state is influences those observations. 01:41:26.220 --> 01:41:29.370 Whatever the robot's true position is affects 01:41:29.370 --> 01:41:33.420 or has some effect upon what the sensor data the robot is able to collect 01:41:33.420 --> 01:41:36.330 is, even if the robot doesn't actually know for certain 01:41:36.330 --> 01:41:39.090 what its true position is. 01:41:39.090 --> 01:41:42.300 Likewise, if you think about a voice recognition or a speech recognition 01:41:42.300 --> 01:41:47.600 program that listens to you and is able to respond to you, something like Alexa 01:41:47.600 --> 01:41:50.830 or what Apple and Google are doing with their voice recognition as well, 01:41:50.830 --> 01:41:54.090 that you might imagine that the hidden state, the underlying state, 01:41:54.090 --> 01:41:55.740 is what words are actually spoken. 01:41:55.740 --> 01:41:58.290 The true nature of the world contains you 01:41:58.290 --> 01:42:00.270 saying a particular sequence of words. 01:42:00.270 --> 01:42:04.380 But your phone or your smart home device doesn't know for sure 01:42:04.380 --> 01:42:05.940 exactly what words you said. 01:42:05.940 --> 01:42:11.100 The only observation that the AI has access to is some audio wave forms. 01:42:11.100 --> 01:42:13.710 And those audio wave forms are, of course, dependent 01:42:13.710 --> 01:42:16.110 upon this hidden state, and you can infer, 01:42:16.110 --> 01:42:20.520 based on those audio wave forms, what the words spoken likely were, 01:42:20.520 --> 01:42:23.490 but you might not know with 100% certainty what 01:42:23.490 --> 01:42:25.330 that hidden state actually is. 01:42:25.330 --> 01:42:27.630 And it might be a task to try and predict. 01:42:27.630 --> 01:42:30.300 Given this observation, given these audio away forms, 01:42:30.300 --> 01:42:34.142 can you figure out what the actual words spoken are? 01:42:34.142 --> 01:42:35.850 Likewise, you might imagine on a website. 01:42:35.850 --> 01:42:38.490 True user engagement might be information you don't directly 01:42:38.490 --> 01:42:41.880 have access to, but you can observe data, like website or app 01:42:41.880 --> 01:42:44.220 analytics about how often was this button clicked 01:42:44.220 --> 01:42:47.220 or how often are people interacting with a page in a particular way. 01:42:47.220 --> 01:42:51.190 And you can use that to infer things about your users as well. 01:42:51.190 --> 01:42:54.968 So this type of problem comes up all the time when we're dealing with AI 01:42:54.968 --> 01:42:56.760 and trying to infer things about the world, 01:42:56.760 --> 01:43:00.750 that often AI doesn't really know the hidden true state of the world. 01:43:00.750 --> 01:43:03.930 All that AI has access to is some observation 01:43:03.930 --> 01:43:07.440 that is related to the hidden true state, but it's not direct. 01:43:07.440 --> 01:43:08.790 There might be some noise there. 01:43:08.790 --> 01:43:10.985 The audio wave form might have some additional noise 01:43:10.985 --> 01:43:12.360 that might be difficult to parse. 01:43:12.360 --> 01:43:14.910 The sensor data might not be exactly correct. 01:43:14.910 --> 01:43:16.860 There's some noise that might not allow you 01:43:16.860 --> 01:43:19.560 to conclude with certainty what the hidden state is, but can 01:43:19.560 --> 01:43:22.100 allow you to infer what it might be. 01:43:22.100 --> 01:43:24.348 And so the simple example we'll take a look at here 01:43:24.348 --> 01:43:27.390 is imagining the hidden state as the weather, whether it's sunny or rainy 01:43:27.390 --> 01:43:31.530 or not, and imagine you are programming an AI inside of a building that maybe 01:43:31.530 --> 01:43:34.710 has access to just a camera to inside the building, 01:43:34.710 --> 01:43:37.890 and all you have access to is an observation as to 01:43:37.890 --> 01:43:41.790 whether or not employees are bringing an umbrella into the building or not. 01:43:41.790 --> 01:43:44.290 You can detect whether it's an umbrella or not, 01:43:44.290 --> 01:43:47.700 and so you might have an observation as to whether or not an umbrella is 01:43:47.700 --> 01:43:49.320 brought into the building or not. 01:43:49.320 --> 01:43:51.690 And using that information, you want to predict 01:43:51.690 --> 01:43:53.790 whether it's sunny or rainy, even if you don't 01:43:53.790 --> 01:43:55.877 know what the underlying weather is. 01:43:55.877 --> 01:43:57.960 So the underlying weather might be sunny or rainy. 01:43:57.960 --> 01:44:01.462 And if it's raining, obviously people are more likely to bring an umbrella. 01:44:01.462 --> 01:44:03.420 And so whether or not people bring an umbrella, 01:44:03.420 --> 01:44:06.773 your observation tells you something about the hidden state. 01:44:06.773 --> 01:44:08.940 And of course, this is a bit of a contrived example, 01:44:08.940 --> 01:44:11.370 but the idea here is to think about this more broadly 01:44:11.370 --> 01:44:14.370 in terms of more generally, any time you observe something, 01:44:14.370 --> 01:44:18.025 it having to do with some underlying hidden state. 01:44:18.025 --> 01:44:21.150 And so to try and model this type of idea where we have these hidden states 01:44:21.150 --> 01:44:24.180 and observations, rather than just use a Markov model, which 01:44:24.180 --> 01:44:26.160 has state, state, state, state, each of which 01:44:26.160 --> 01:44:29.700 is connected by that transition matrix that we described before, 01:44:29.700 --> 01:44:32.640 we're going to use what we call a hidden Markov model-- 01:44:32.640 --> 01:44:34.740 very similar to a Markov model, but this is 01:44:34.740 --> 01:44:37.920 going to allow us to model a system that has hidden states 01:44:37.920 --> 01:44:41.520 that we don't directly observe along with some observed event 01:44:41.520 --> 01:44:43.740 that we do actually see. 01:44:43.740 --> 01:44:45.720 And so in addition to that transition model 01:44:45.720 --> 01:44:48.780 that we still need of saying, given the underlying state of the world, 01:44:48.780 --> 01:44:52.440 if it's sunny or rainy, what's the probability of tomorrow's weather, 01:44:52.440 --> 01:44:56.310 we also need another model, that given some state is 01:44:56.310 --> 01:44:58.500 going to give us an observation of green, 01:44:58.500 --> 01:45:01.440 yes, someone brings an umbrella into the office, or red, 01:45:01.440 --> 01:45:03.930 no, nobody brings umbrellas into the office. 01:45:03.930 --> 01:45:06.772 And so the observation might be that if it's sunny, 01:45:06.772 --> 01:45:09.480 then odds are nobody is going to bring an umbrella to the office. 01:45:09.480 --> 01:45:11.760 But maybe some people are just being cautious 01:45:11.760 --> 01:45:14.490 and they do bring an umbrella to the office anyways. 01:45:14.490 --> 01:45:17.725 And if it's raining, with much higher probability, 01:45:17.725 --> 01:45:20.100 then people are going to bring umbrellas into the office. 01:45:20.100 --> 01:45:23.280 But maybe, if the rain was unexpected, people didn't bring an umbrella, 01:45:23.280 --> 01:45:25.990 and so they might have some other probability as well. 01:45:25.990 --> 01:45:28.860 So using the observations, you can begin to predict, 01:45:28.860 --> 01:45:32.070 with reasonable likelihood, what the underlying state is 01:45:32.070 --> 01:45:35.440 even if you don't actually get to observe the underlying state, 01:45:35.440 --> 01:45:39.030 if you don't get to see what the hidden state is actually equal to. 01:45:39.030 --> 01:45:41.540 This here we'll often call the sensor model. 01:45:41.540 --> 01:45:44.280 It's also often called the emission probabilities 01:45:44.280 --> 01:45:48.120 because the state, the underlying state, emits some sort of emission 01:45:48.120 --> 01:45:49.660 that you then observe. 01:45:49.660 --> 01:45:53.220 And so that can be another way of describing that same idea. 01:45:53.220 --> 01:45:55.860 And the sensor Markov assumption that we're going to use 01:45:55.860 --> 01:45:59.340 is this assumption that the evidence variable, the thing we observe, 01:45:59.340 --> 01:46:03.480 the emission that gets produced, depends only on the corresponding state, 01:46:03.480 --> 01:46:06.620 meaning I can predict whether or not people will bring umbrellas 01:46:06.620 --> 01:46:11.310 or not entirely dependent just on whether it is sunny or rainy today. 01:46:11.310 --> 01:46:13.950 Of course, again, this assumption might not hold in practice, 01:46:13.950 --> 01:46:15.458 that in practice it might depend-- 01:46:15.458 --> 01:46:17.250 whether or not people bring umbrellas might 01:46:17.250 --> 01:46:20.042 depend not just on today's weather, but also on yesterday's weather 01:46:20.042 --> 01:46:20.910 and the day before. 01:46:20.910 --> 01:46:23.100 But for simplification purposes, it can be 01:46:23.100 --> 01:46:25.920 helpful to apply the sort of assumption just 01:46:25.920 --> 01:46:29.130 to allow us to be able to reason about these probabilities a little more 01:46:29.130 --> 01:46:30.130 easily. 01:46:30.130 --> 01:46:34.770 And if we're able to approximate it, we can still often get a very good answer. 01:46:34.770 --> 01:46:37.710 And so what these hidden Markov models end up looking like is a little 01:46:37.710 --> 01:46:41.730 something like this, where now, rather than just have one chain of states-- 01:46:41.730 --> 01:46:43.860 like, sun, sun, rain, rain, rain-- 01:46:43.860 --> 01:46:49.650 we instead have this upper level, which is the underlying state of the world, 01:46:49.650 --> 01:46:53.070 is it sunny or is it rainy, and those are connected by that transition 01:46:53.070 --> 01:46:54.690 matrix we described before. 01:46:54.690 --> 01:46:57.510 But each of these states produces an emission, 01:46:57.510 --> 01:47:01.590 produces an observation that I see, that on this day it was sunny, 01:47:01.590 --> 01:47:04.917 and people didn't bring umbrellas, and on this day it was sunny, 01:47:04.917 --> 01:47:07.500 but people did bring umbrellas, and on this day it was raining 01:47:07.500 --> 01:47:09.960 and people did bring umbrellas, and so on and so forth. 01:47:09.960 --> 01:47:12.930 And so each of these underlying states, represented 01:47:12.930 --> 01:47:16.740 by x sub t for x sub 1, 0, 1, 2, so on and so forth, 01:47:16.740 --> 01:47:19.450 produces some sort of observation or emission, 01:47:19.450 --> 01:47:20.950 which is what the E stands for-- 01:47:20.950 --> 01:47:25.700 E sub 0, E sub 1, E sub 2, so on and so forth. 01:47:25.700 --> 01:47:28.893 And so this, too, is a way of trying to represent this idea. 01:47:28.893 --> 01:47:31.560 And what you want to think about is that these underlying states 01:47:31.560 --> 01:47:35.790 are the true nature of the world, the robot's position as it moves over time, 01:47:35.790 --> 01:47:39.030 and that produces some sort of sensor data that might be observed, 01:47:39.030 --> 01:47:41.490 or what people are actually saying and using 01:47:41.490 --> 01:47:45.390 the emission data of what audio wave forms do you detect in order to process 01:47:45.390 --> 01:47:47.330 that data and try and figure it out. 01:47:47.330 --> 01:47:49.830 And there are a number of possible tasks that you might want 01:47:49.830 --> 01:47:52.150 to do given this kind of information. 01:47:52.150 --> 01:47:54.750 And one of the simplest is trying to infer something 01:47:54.750 --> 01:47:58.560 about the future or the past or about these sort of hidden states 01:47:58.560 --> 01:47:59.580 that might exist. 01:47:59.580 --> 01:48:01.310 And so the tasks that you'll often see-- 01:48:01.310 --> 01:48:03.893 and we're not going to go into the mathematics of these tasks, 01:48:03.893 --> 01:48:07.020 but they're all based on this same idea of conditional probabilities 01:48:07.020 --> 01:48:09.990 and using the probability distributions we have 01:48:09.990 --> 01:48:12.180 to draw these sorts of conclusions. 01:48:12.180 --> 01:48:16.410 One task is called filtering, which is, given observations from the start 01:48:16.410 --> 01:48:20.310 until now, calculate the distribution for the current state, 01:48:20.310 --> 01:48:23.520 meaning given information about from the beginning of time 01:48:23.520 --> 01:48:26.610 until now, on which days did people bring an umbrella 01:48:26.610 --> 01:48:28.770 or not bring an umbrella, can I calculate 01:48:28.770 --> 01:48:32.280 the probability of the current state, that today is it sunny 01:48:32.280 --> 01:48:33.570 or is it raining? 01:48:33.570 --> 01:48:35.670 Another task that might be possible is prediction, 01:48:35.670 --> 01:48:37.320 which is looking towards the future. 01:48:37.320 --> 01:48:39.690 Given observations about people bringing umbrellas 01:48:39.690 --> 01:48:43.350 from the beginning of when we started counting time until now, 01:48:43.350 --> 01:48:47.710 can I figure out the distribution that tomorrow is it sunny or is it raining? 01:48:47.710 --> 01:48:51.240 And you can also go backwards, as well, by a smoothing where I can say, 01:48:51.240 --> 01:48:54.810 given observations from start until now, calculate the distributions 01:48:54.810 --> 01:48:56.460 for some past state. 01:48:56.460 --> 01:49:00.090 I know that today people brought umbrellas and tomorrow people brought 01:49:00.090 --> 01:49:03.780 umbrellas, and so given two days' worth of data of people bringing umbrellas, 01:49:03.780 --> 01:49:06.713 what's the probability that yesterday it was raining? 01:49:06.713 --> 01:49:08.880 And that I know that people brought umbrellas today, 01:49:08.880 --> 01:49:11.160 that might inform that decision, as well. 01:49:11.160 --> 01:49:13.740 It might influence those probabilities. 01:49:13.740 --> 01:49:17.340 And there's also a most likely explanation task, 01:49:17.340 --> 01:49:19.510 in addition to other tasks that might exist as well, 01:49:19.510 --> 01:49:21.750 which is combining some of these given observations 01:49:21.750 --> 01:49:25.920 from the start up until now, figuring out the most likely sequence of states, 01:49:25.920 --> 01:49:28.528 and this is what we're going to take a look at now, this idea 01:49:28.528 --> 01:49:30.570 that if I have all these observations-- umbrella, 01:49:30.570 --> 01:49:32.790 no umbrella, umbrella, no umbrella-- can I 01:49:32.790 --> 01:49:36.990 calculate the most likely states of sun, rain, sun, rain, and whatnot that 01:49:36.990 --> 01:49:41.610 actually represented the true weather that would produce these observations? 01:49:41.610 --> 01:49:43.590 And this is quite common when you're trying 01:49:43.590 --> 01:49:46.530 to do something like voice recognition, for example, that you have 01:49:46.530 --> 01:49:49.830 these emissions of audio wave forms and you would like to calculate, 01:49:49.830 --> 01:49:52.260 based on all of the observations that you have, 01:49:52.260 --> 01:49:54.750 what is the most likely sequence of actual words 01:49:54.750 --> 01:49:59.100 or syllables or sounds that the user actually made when they were speaking 01:49:59.100 --> 01:50:01.230 to this particular device, or other tasks that 01:50:01.230 --> 01:50:03.740 might come up in that context as well. 01:50:03.740 --> 01:50:07.800 And so we can try this out by going ahead and going into the HMM 01:50:07.800 --> 01:50:11.790 directory, HMM for Hidden Markov Model. 01:50:11.790 --> 01:50:17.350 And here what I've done is I've defined a model where this model first 01:50:17.350 --> 01:50:22.410 defines my possible state, sun and rain, along with their emission 01:50:22.410 --> 01:50:25.690 probabilities, the observation model or the emission model, 01:50:25.690 --> 01:50:30.310 where here, given that I know that it's sunny, the probability that I 01:50:30.310 --> 01:50:32.590 see people bring an umbrella is 0.2. 01:50:32.590 --> 01:50:35.470 The probability of no umbrella is 0.8. 01:50:35.470 --> 01:50:37.288 And likewise, if it's raining, then people 01:50:37.288 --> 01:50:38.830 are more likely to bring an umbrella. 01:50:38.830 --> 01:50:40.630 Umbrella has a probability of 0.9. 01:50:40.630 --> 01:50:42.580 No umbrella has probably of 0.1. 01:50:42.580 --> 01:50:47.350 So the actual underlying hidden states, those states are sun and rain. 01:50:47.350 --> 01:50:50.500 But the things that I observe, the observations that I can see, 01:50:50.500 --> 01:50:56.270 are either umbrella or no umbrella as the things that I observe as a result. 01:50:56.270 --> 01:51:00.730 So this, then, I also need to add to it a transition matrix, same as before, 01:51:00.730 --> 01:51:04.540 saying that if today is sunny, then tomorrow is more likely to be sunny, 01:51:04.540 --> 01:51:07.770 and if today is rainy, then tomorrow is more likely to be raining. 01:51:07.770 --> 01:51:10.130 As with before, I give it some starting probabilities, 01:51:10.130 --> 01:51:14.050 saying, at first, 50/50 chance for whether it's sunny or rainy, 01:51:14.050 --> 01:51:17.570 and then I can create the model based on that information. 01:51:17.570 --> 01:51:19.990 Again, the exact syntax of this is not so important 01:51:19.990 --> 01:51:23.770 so much as it is the data that I am now encoding into a program, such 01:51:23.770 --> 01:51:27.350 that now I can begin to do some inference. 01:51:27.350 --> 01:51:31.270 So I can give my program, for example, a list of observations-- 01:51:31.270 --> 01:51:34.420 umbrella, umbrella, no umbrella, umbrella, umbrella, so on and so forth, 01:51:34.420 --> 01:51:35.478 no umbrella, no umbrella. 01:51:35.478 --> 01:51:37.270 And I would like to calculate, I would like 01:51:37.270 --> 01:51:41.110 to figure out, the most likely explanation for these observations. 01:51:41.110 --> 01:51:42.640 What is likely? 01:51:42.640 --> 01:51:43.660 Was it rain, rain? 01:51:43.660 --> 01:51:46.720 Is this rain or is it more likely that this was actually sunny 01:51:46.720 --> 01:51:48.742 and then it switched back to it being rainy? 01:51:48.742 --> 01:51:50.200 And that's an interesting question. 01:51:50.200 --> 01:51:52.360 We might not be sure because it might just 01:51:52.360 --> 01:51:56.410 be that it just so happened on this rainy day people decided not to bring 01:51:56.410 --> 01:52:00.580 an umbrella or it could be that it switched from rainy to sunny back 01:52:00.580 --> 01:52:04.450 to rainy, which doesn't seem too likely, but it certainly could happen. 01:52:04.450 --> 01:52:07.060 And using the data we give to the Hidden Markov Model, 01:52:07.060 --> 01:52:10.620 our model can begin to predict these answers, can begin to figure it out. 01:52:10.620 --> 01:52:13.750 So we're going to go ahead and just predict these observations. 01:52:13.750 --> 01:52:15.750 And then for each of those predictions, go ahead 01:52:15.750 --> 01:52:17.292 and print out what the prediction is. 01:52:17.292 --> 01:52:19.780 And this library just so happens to have a function called 01:52:19.780 --> 01:52:23.142 predict that does this prediction process for me. 01:52:23.142 --> 01:52:28.270 So I run Python sequence.py, and the result I get is this. 01:52:28.270 --> 01:52:31.450 This is the prediction based on the observations of what 01:52:31.450 --> 01:52:34.750 all of those states are likely to be, and it's likely to be rain, then rain. 01:52:34.750 --> 01:52:36.625 In this case, it thinks that what most likely 01:52:36.625 --> 01:52:39.940 happened is that it was sunny for a day and then went back to being rainy. 01:52:39.940 --> 01:52:42.700 But in different situations, if it was rainy for longer, maybe, 01:52:42.700 --> 01:52:44.750 or if the probabilities were slightly different, 01:52:44.750 --> 01:52:48.190 you might imagine that it's more likely that it was rainy all the way through, 01:52:48.190 --> 01:52:53.250 and it just so happened on one rainy day people decided not to bring umbrellas. 01:52:53.250 --> 01:52:55.750 And so here, too, Python libraries can begin 01:52:55.750 --> 01:52:58.730 to allow for the sort of inference procedure. 01:52:58.730 --> 01:53:02.410 And by taking what we know and by putting it in terms of these tasks 01:53:02.410 --> 01:53:06.310 that already exist, these general tasks that work with Hidden Markov Models, 01:53:06.310 --> 01:53:10.540 then any time we can take an idea and formulate it as a Hidden Markov Model, 01:53:10.540 --> 01:53:12.550 formulate it as something that has hidden 01:53:12.550 --> 01:53:15.700 states and observed emissions that result from the states. 01:53:15.700 --> 01:53:17.830 Then we can take advantage of these algorithms that 01:53:17.830 --> 01:53:21.740 are known to exist for trying to do this sort of inference. 01:53:21.740 --> 01:53:25.720 So now we've seen a couple of ways that AI can begin to deal with uncertainty. 01:53:25.720 --> 01:53:28.840 We've taken a look at probability and how we can use probability 01:53:28.840 --> 01:53:32.200 to describe numerically things that are likely or more likely or less 01:53:32.200 --> 01:53:34.990 likely to happen than other events or other variables. 01:53:34.990 --> 01:53:37.750 And using that information, we can begin to construct 01:53:37.750 --> 01:53:40.810 these standard types of models, things like Bayesian networks 01:53:40.810 --> 01:53:43.180 and Markov chains and Hidden Markov Models, 01:53:43.180 --> 01:53:47.110 that all allow us to be able to describe how particular events relate 01:53:47.110 --> 01:53:49.900 to other events or how the values of particular variables 01:53:49.900 --> 01:53:53.050 relate to other variables, not for certain, but with some sort 01:53:53.050 --> 01:53:54.550 of probability distribution. 01:53:54.550 --> 01:53:57.970 And by formulating things in terms of these models that already exist, 01:53:57.970 --> 01:54:00.160 we can take advantage of Python libraries 01:54:00.160 --> 01:54:02.950 that implement these sort of models already and allow us just 01:54:02.950 --> 01:54:06.880 to be able to use them to produce some sort of resulting effect. 01:54:06.880 --> 01:54:08.890 So all of this then allows our AI to begin 01:54:08.890 --> 01:54:11.290 to deal with these sort of uncertain problems 01:54:11.290 --> 01:54:13.720 so that our AI doesn't need to know things for certain 01:54:13.720 --> 01:54:17.080 but can infer based on information it doesn't know. 01:54:17.080 --> 01:54:19.930 Next time, we'll take a look at additional types of problems 01:54:19.930 --> 01:54:22.870 that we can solve by taking advantage of AI-related algorithms 01:54:22.870 --> 01:54:26.140 even beyond the world of the types of problems we've already explored. 01:54:26.140 --> 01:54:28.230 We'll see you next time.