1 00:00:00,000 --> 00:00:02,940 [MUSIC PLAYING] 2 00:00:02,940 --> 00:00:17,625 3 00:00:17,625 --> 00:00:20,250 BRIAN YU: All right, welcome back, everyone, to an Introduction 4 00:00:20,250 --> 00:00:22,410 to Artificial Intelligence with Python. 5 00:00:22,410 --> 00:00:26,100 And last time we took a look at how it is that AI inside of our computers 6 00:00:26,100 --> 00:00:27,420 can represent knowledge. 7 00:00:27,420 --> 00:00:30,510 We represented that knowledge in the form of logical sentences 8 00:00:30,510 --> 00:00:32,460 in a variety of different logical languages, 9 00:00:32,460 --> 00:00:36,000 and the idea was we wanted our AI to be able to represent knowledge 10 00:00:36,000 --> 00:00:39,420 or information and somehow use those pieces of information 11 00:00:39,420 --> 00:00:42,570 to be able to derive new pieces of information via inference, 12 00:00:42,570 --> 00:00:45,060 to be able to take some information and deduce 13 00:00:45,060 --> 00:00:47,580 some additional conclusions based on the information 14 00:00:47,580 --> 00:00:49,650 that it already knew for sure. 15 00:00:49,650 --> 00:00:52,680 But in reality, when we think about computers and we think about AI, 16 00:00:52,680 --> 00:00:56,280 very rarely are our machines going to be able to know things for sure. 17 00:00:56,280 --> 00:00:58,800 Oftentimes there's going to be some amount of uncertainty 18 00:00:58,800 --> 00:01:01,140 in the information that our AIs or our computers 19 00:01:01,140 --> 00:01:04,503 are dealing with where it might believe something with some probability, 20 00:01:04,503 --> 00:01:07,420 as we'll soon discuss what probability is all about and what it means, 21 00:01:07,420 --> 00:01:09,210 but not entirely for certain. 22 00:01:09,210 --> 00:01:12,660 And we want to use the information that it has some knowledge about, even if it 23 00:01:12,660 --> 00:01:15,960 doesn't have perfect knowledge, to still be able to make inferences, still 24 00:01:15,960 --> 00:01:17,650 be able to draw conclusions. 25 00:01:17,650 --> 00:01:20,820 So you might imagine, for example, in the context of a robot that 26 00:01:20,820 --> 00:01:23,280 has some sensors and is exploring some environment, 27 00:01:23,280 --> 00:01:26,580 it might not know exactly where it is or exactly what's around it, 28 00:01:26,580 --> 00:01:30,150 but it does have access to some data that can allow it to draw inferences 29 00:01:30,150 --> 00:01:31,200 with some probability. 30 00:01:31,200 --> 00:01:33,747 There's some likelihood that one thing is true or another, 31 00:01:33,747 --> 00:01:36,330 or you can imagine in context where there is a little bit more 32 00:01:36,330 --> 00:01:39,390 randomness and uncertainty, something like predicting the weather, where 33 00:01:39,390 --> 00:01:42,098 you might not be able to know for sure what tomorrow's weather is 34 00:01:42,098 --> 00:01:44,610 with 100% certainty, but you can probably 35 00:01:44,610 --> 00:01:47,550 infer with some probability what tomorrow's weather is 36 00:01:47,550 --> 00:01:50,940 going to be based on maybe today's webinar and yesterday's weather 37 00:01:50,940 --> 00:01:54,070 and other data that you might have access to as well. 38 00:01:54,070 --> 00:01:57,270 And so oftentimes we can distill this in terms of just possible events 39 00:01:57,270 --> 00:02:00,120 that might happen and what the likelihood of those events are. 40 00:02:00,120 --> 00:02:02,190 This comes a lot in games, for example, where 41 00:02:02,190 --> 00:02:04,620 there's an element of chance inside of those games. 42 00:02:04,620 --> 00:02:06,120 So you imagine rolling the dice. 43 00:02:06,120 --> 00:02:08,580 You're not sure exactly what the die roll is going to be, 44 00:02:08,580 --> 00:02:12,510 but you know it's going to be one of these possibilities from one to six, 45 00:02:12,510 --> 00:02:13,970 for example. 46 00:02:13,970 --> 00:02:17,050 And so here, now, we introduce the idea of probability theory. 47 00:02:17,050 --> 00:02:19,050 And what we'll take a look at today is beginning 48 00:02:19,050 --> 00:02:22,170 by looking at the mathematical foundations of probability theory, 49 00:02:22,170 --> 00:02:25,740 getting an understanding for some of the key concepts within probability, 50 00:02:25,740 --> 00:02:29,040 and then diving into how we can use probability and the ideas 51 00:02:29,040 --> 00:02:32,730 that we look at mathematically to represent some ideas in terms of models 52 00:02:32,730 --> 00:02:36,300 that we can put into our computers in order to program an AI that 53 00:02:36,300 --> 00:02:39,600 is able to use information about probability to draw inferences, 54 00:02:39,600 --> 00:02:42,630 to make some judgments about the world with some probability 55 00:02:42,630 --> 00:02:45,380 or likelihood of being true. 56 00:02:45,380 --> 00:02:48,270 So probability ultimately boils down to this idea 57 00:02:48,270 --> 00:02:50,340 that there are possible worlds that we're here 58 00:02:50,340 --> 00:02:53,250 representing using this little Greek letter omega, 59 00:02:53,250 --> 00:02:56,760 and the idea of a possible world is that, when I roll a die, 60 00:02:56,760 --> 00:02:59,380 there are six possible worlds that could result from it. 61 00:02:59,380 --> 00:03:03,180 I can roll a 1 or 2 or 3 or a 4 or a 5 or a 6, 62 00:03:03,180 --> 00:03:06,840 and each of those or a possible world, and each of those possible worlds 63 00:03:06,840 --> 00:03:11,760 has some probability of being true, the probability that I do roll a 1 or a 2 64 00:03:11,760 --> 00:03:13,840 or a 3 or something else. 65 00:03:13,840 --> 00:03:15,690 And we represent that probability like this, 66 00:03:15,690 --> 00:03:18,870 using the capital letter P and then, in parentheses, what 67 00:03:18,870 --> 00:03:20,890 it is that we want the probability of. 68 00:03:20,890 --> 00:03:24,600 So this right here would be the probability of some possible world 69 00:03:24,600 --> 00:03:27,390 as represented by the little letter omega. 70 00:03:27,390 --> 00:03:30,120 Now, there are a couple of basic axioms of probability 71 00:03:30,120 --> 00:03:33,360 that become relevant as we consider how we deal with probability 72 00:03:33,360 --> 00:03:34,590 and how we think about it. 73 00:03:34,590 --> 00:03:37,320 First and foremost, every probability value 74 00:03:37,320 --> 00:03:40,500 must range between zero and one inclusive. 75 00:03:40,500 --> 00:03:42,420 So the smallest value any probability can 76 00:03:42,420 --> 00:03:46,500 have is the number zero, which is an impossible event, something 77 00:03:46,500 --> 00:03:49,350 like I roll a die and the die is a seven is the roll that I get. 78 00:03:49,350 --> 00:03:51,570 If the die only has numbers one through six, 79 00:03:51,570 --> 00:03:54,420 the event that I roll a seven is impossible, 80 00:03:54,420 --> 00:03:56,610 so it would have probability zero. 81 00:03:56,610 --> 00:03:58,710 And on the other end of the spectrum, probability 82 00:03:58,710 --> 00:04:01,260 can range all the way up to the positive number one, 83 00:04:01,260 --> 00:04:04,650 meaning an event is certain to happen, that I roll a die and the number 84 00:04:04,650 --> 00:04:06,540 is less than 10, for example. 85 00:04:06,540 --> 00:04:09,930 That is an event that is guaranteed to happen if the only sides on my die 86 00:04:09,930 --> 00:04:12,150 are one through six, for instance. 87 00:04:12,150 --> 00:04:15,600 And then there can range through any real number in between these two values 88 00:04:15,600 --> 00:04:18,600 where, generally speaking, a higher value for the probability 89 00:04:18,600 --> 00:04:20,910 means an event is more likely to take place 90 00:04:20,910 --> 00:04:22,980 and a lower value for the probability means 91 00:04:22,980 --> 00:04:26,040 the event is less likely to take place. 92 00:04:26,040 --> 00:04:29,280 And the other key rule for probability looks a little bit like this. 93 00:04:29,280 --> 00:04:32,190 This sigma notation, if you haven't seen it before, 94 00:04:32,190 --> 00:04:35,100 refers to summation, the idea that we're going to be adding up 95 00:04:35,100 --> 00:04:36,500 a whole sequence of values. 96 00:04:36,500 --> 00:04:39,000 And this sigma notation's going to come up a couple of times 97 00:04:39,000 --> 00:04:41,220 today, because as we deal with probability, 98 00:04:41,220 --> 00:04:43,950 oftentimes we're adding up a whole bunch of individual values 99 00:04:43,950 --> 00:04:46,660 or individual probabilities to get some other value. 100 00:04:46,660 --> 00:04:48,570 So we'll see this come up a couple of times. 101 00:04:48,570 --> 00:04:52,710 But what this notation means is that if I sum up all of the possible world's 102 00:04:52,710 --> 00:04:57,150 omega that are in big Omega, which represents the set of all 103 00:04:57,150 --> 00:05:00,780 the possible worlds, meaning I take for all of the worlds 104 00:05:00,780 --> 00:05:05,250 in the set of possible worlds and add up all of their probabilities, what 105 00:05:05,250 --> 00:05:07,312 I ultimately get is the number one. 106 00:05:07,312 --> 00:05:10,520 So if I take all the possible worlds, add up what each of their probabilities 107 00:05:10,520 --> 00:05:12,630 is, I should get the number one at the end, 108 00:05:12,630 --> 00:05:15,600 meaning all probabilities just need to sum to one. 109 00:05:15,600 --> 00:05:18,120 So for example, if I take dice, for example, 110 00:05:18,120 --> 00:05:20,750 if you imagine I have a fair die with numbers one through six 111 00:05:20,750 --> 00:05:22,820 and I roll the die, each one of these rolls 112 00:05:22,820 --> 00:05:25,160 has an equal probability of taking place, 113 00:05:25,160 --> 00:05:28,290 and the probability is one over six, for example. 114 00:05:28,290 --> 00:05:31,890 So each of these probabilities is between zero and one, zero meaning 115 00:05:31,890 --> 00:05:33,950 and possible and one meaning for certain. 116 00:05:33,950 --> 00:05:35,990 And if you add up all of these probabilities 117 00:05:35,990 --> 00:05:39,230 for all of the possible worlds, you get the number one. 118 00:05:39,230 --> 00:05:42,560 And we can represent any one of those probabilities like this. 119 00:05:42,560 --> 00:05:47,750 The probability that we roll the number two, for example, is just one over six. 120 00:05:47,750 --> 00:05:52,040 Every six times we roll the die, we'd expect that one time, for instance, 121 00:05:52,040 --> 00:05:53,480 the die might come up as a two. 122 00:05:53,480 --> 00:05:56,870 Its probability is not certain, but it's a little more than nothing, 123 00:05:56,870 --> 00:05:58,430 for instance. 124 00:05:58,430 --> 00:06:01,173 And so this is all fairly straightforward for just a single die. 125 00:06:01,173 --> 00:06:03,590 But things get more interesting as our models of the world 126 00:06:03,590 --> 00:06:05,183 get a little bit more complex. 127 00:06:05,183 --> 00:06:07,850 Let's imagine now that we're not just dealing with a single die, 128 00:06:07,850 --> 00:06:10,040 but we have two dice, for example. 129 00:06:10,040 --> 00:06:12,230 I have a red die here and a blue die there, 130 00:06:12,230 --> 00:06:15,230 and I care not just about what the individual roll is, 131 00:06:15,230 --> 00:06:17,270 but I care about the sum of the two rolls. 132 00:06:17,270 --> 00:06:20,660 In this case, the sum of the two rolls is the number three. 133 00:06:20,660 --> 00:06:23,000 How do I begin to now reason about, what does 134 00:06:23,000 --> 00:06:27,760 the probability look like if, instead of having one die, I now have two dice? 135 00:06:27,760 --> 00:06:30,260 Well, what we might imagine is that we could first consider, 136 00:06:30,260 --> 00:06:32,860 what are all of the possible worlds? 137 00:06:32,860 --> 00:06:34,840 And in this case, all of the possible worlds 138 00:06:34,840 --> 00:06:38,480 are just every combination of the red and blue die that I could come up with. 139 00:06:38,480 --> 00:06:43,000 For the red die, it could be a 1 or a 2 or a 3 or a 4 or a 5 or a 6, 140 00:06:43,000 --> 00:06:45,260 and for each of those possibilities, the blue die, 141 00:06:45,260 --> 00:06:50,700 likewise, could also be either 1 or 2 or 3 or 4 or 5 or 6. 142 00:06:50,700 --> 00:06:53,490 And it just so happens that, in this particular case, 143 00:06:53,490 --> 00:06:56,570 each of these possible combinations is equally likely. 144 00:06:56,570 --> 00:06:59,715 Equally likely are all of these various different possible worlds. 145 00:06:59,715 --> 00:07:01,340 That's not always going to be the case. 146 00:07:01,340 --> 00:07:04,160 As you imagine more complex models that we could try to build 147 00:07:04,160 --> 00:07:06,770 and things that we could try to represent in the real world, 148 00:07:06,770 --> 00:07:09,950 it's probably not going to be the case that every single possible world is 149 00:07:09,950 --> 00:07:11,270 always equally likely. 150 00:07:11,270 --> 00:07:14,030 But in the case of fair dice where, in any given die roll, 151 00:07:14,030 --> 00:07:17,450 any one number has just as good a chance of coming up as any other number, 152 00:07:17,450 --> 00:07:21,740 we can consider all of these possible worlds to be equally likely. 153 00:07:21,740 --> 00:07:24,500 But even though all of the possible worlds are equally likely, 154 00:07:24,500 --> 00:07:27,690 that doesn't necessarily mean that their sums are equally likely. 155 00:07:27,690 --> 00:07:31,530 So if we consider what the sum is of all of these two-- so 1 plus 1, that's a 2. 156 00:07:31,530 --> 00:07:32,990 2 plus 1 is a 3-- 157 00:07:32,990 --> 00:07:35,790 and consider for each of these possible pairs of numbers 158 00:07:35,790 --> 00:07:37,970 what their sum ultimately is, we can notice 159 00:07:37,970 --> 00:07:41,030 that there are some patterns here where it's not entirely the case 160 00:07:41,030 --> 00:07:43,710 that every number comes up equally likely. 161 00:07:43,710 --> 00:07:45,800 If you consider seven, for example, what's 162 00:07:45,800 --> 00:07:49,070 the probability that when I roll two dice their sum is seven, 163 00:07:49,070 --> 00:07:50,770 there are several ways this can happen. 164 00:07:50,770 --> 00:07:53,450 There are six possible worlds where the sum is seven. 165 00:07:53,450 --> 00:07:56,270 It could be a one and a six or a two and a five or a three 166 00:07:56,270 --> 00:07:59,400 and a four, a four and a three, and so forth. 167 00:07:59,400 --> 00:08:02,030 But if you instead consider, what's the probability that I 168 00:08:02,030 --> 00:08:06,380 roll two dice and the sum of those two die rolls is 12, for example, well, 169 00:08:06,380 --> 00:08:09,770 looking at this diagram, there's only one possible world 170 00:08:09,770 --> 00:08:12,080 in which that can happen, and that's the possible world 171 00:08:12,080 --> 00:08:16,720 where both the red die and the blue die both come up at sixes 172 00:08:16,720 --> 00:08:18,870 to give us the sum total of 12. 173 00:08:18,870 --> 00:08:21,000 So based on just taking a look at this diagram, 174 00:08:21,000 --> 00:08:23,542 we see that some of these probabilities are likely different. 175 00:08:23,542 --> 00:08:27,530 The probability that the sum is a seven must be greater than the probability 176 00:08:27,530 --> 00:08:28,732 that the sum is a 12. 177 00:08:28,732 --> 00:08:31,190 And we can represent that even more formally by saying, OK, 178 00:08:31,190 --> 00:08:35,690 the probability that we sum to 12 is one out of 36. 179 00:08:35,690 --> 00:08:39,010 Out of the 36 equally likely possible worlds, 180 00:08:39,010 --> 00:08:42,049 six squared because we have six options for the red die 181 00:08:42,049 --> 00:08:46,730 and six options for the blue die, out of those 36 options, only one of them 182 00:08:46,730 --> 00:08:49,970 sums to 12, whereas, on the other hand, the probability 183 00:08:49,970 --> 00:08:53,580 that if we take two dice rolls and they sum up to the number seven, 184 00:08:53,580 --> 00:08:55,910 well, out of those 36 possible worlds, there 185 00:08:55,910 --> 00:08:59,900 were six worlds where the sum was seven, and so we get six over 36, 186 00:08:59,900 --> 00:09:04,070 which we can simplify as a fraction to just one over six. 187 00:09:04,070 --> 00:09:07,400 So here, now, we're able to represent these different ideas of probability, 188 00:09:07,400 --> 00:09:09,690 representing some events that might be more likely 189 00:09:09,690 --> 00:09:12,980 and then other events that are less likely, as well. 190 00:09:12,980 --> 00:09:15,230 And these sorts of judgments where we're figuring out, 191 00:09:15,230 --> 00:09:18,410 just in the abstract, what is the probability that this thing takes 192 00:09:18,410 --> 00:09:22,040 place, are generally known as unconditional probabilities, 193 00:09:22,040 --> 00:09:25,970 some degree of belief we have in some proposition, some fact about the world 194 00:09:25,970 --> 00:09:28,760 in the absence of any other evidence without knowing 195 00:09:28,760 --> 00:09:29,900 any additional information. 196 00:09:29,900 --> 00:09:32,570 If I roll a die, what's the chance it comes up as a two, 197 00:09:32,570 --> 00:09:35,420 or if I roll two dice, what's the chance that the sum of those two 198 00:09:35,420 --> 00:09:37,430 die rolls is a seven? 199 00:09:37,430 --> 00:09:40,550 But usually when we're thinking about probability, especially when we're 200 00:09:40,550 --> 00:09:43,790 thinking about training in AI to intelligently be able to know something 201 00:09:43,790 --> 00:09:46,970 about the world and make predictions based on that information, 202 00:09:46,970 --> 00:09:50,610 it's not unconditional probability that our AI is dealing with, 203 00:09:50,610 --> 00:09:53,060 but, rather, conditional probability, probability 204 00:09:53,060 --> 00:09:55,730 where rather than having no original knowledge, 205 00:09:55,730 --> 00:09:59,790 we have some initial knowledge about the world and how the world actually works. 206 00:09:59,790 --> 00:10:02,420 So conditional probability is the degree of belief 207 00:10:02,420 --> 00:10:08,210 in a proposition given some evidence that has already been revealed to us. 208 00:10:08,210 --> 00:10:09,480 So what does this look like? 209 00:10:09,480 --> 00:10:12,080 Well, it looks like this in terms of notation. 210 00:10:12,080 --> 00:10:16,595 We're going to represent conditional probability as probability of a 211 00:10:16,595 --> 00:10:19,260 and then this vertical bar and then b. 212 00:10:19,260 --> 00:10:23,090 And the way to read this is the thing on the left-hand side of the vertical bar 213 00:10:23,090 --> 00:10:25,340 is what we want the probability of. 214 00:10:25,340 --> 00:10:29,690 Here, now, I want the probability that a is true, that it is the real world, 215 00:10:29,690 --> 00:10:32,460 that it is the event that actually does take place. 216 00:10:32,460 --> 00:10:34,520 And then on the right side of the vertical bar 217 00:10:34,520 --> 00:10:36,620 is our evidence, the information that we already 218 00:10:36,620 --> 00:10:38,780 know for certain about the world-- 219 00:10:38,780 --> 00:10:41,570 for example, that b is true. 220 00:10:41,570 --> 00:10:43,430 So the way to read this entire expression 221 00:10:43,430 --> 00:10:46,820 is, what is the probability of a given b, 222 00:10:46,820 --> 00:10:51,860 the probability that a is true given that we already know that b is true? 223 00:10:51,860 --> 00:10:54,500 And this type of judgment, conditional probability, 224 00:10:54,500 --> 00:10:58,430 the probability of one thing given some other fact, comes up quite a lot 225 00:10:58,430 --> 00:11:00,590 when we think about the types of calculations 226 00:11:00,590 --> 00:11:02,630 we might want our AI to be able to do. 227 00:11:02,630 --> 00:11:05,120 For example, we might care about the probability of rain 228 00:11:05,120 --> 00:11:08,090 today given that we know that it rained yesterday. 229 00:11:08,090 --> 00:11:11,360 We could think about the probability of rain today just in the abstract. 230 00:11:11,360 --> 00:11:13,440 What is the chance that today it rains? 231 00:11:13,440 --> 00:11:15,350 But usually we have some additional evidence. 232 00:11:15,350 --> 00:11:17,900 I know for certain that it rained yesterday, 233 00:11:17,900 --> 00:11:21,290 and so I would like to calculate the probability that it rains today 234 00:11:21,290 --> 00:11:23,457 given that I know that it rained yesterday, 235 00:11:23,457 --> 00:11:25,790 or you might imagine that I want to know the probability 236 00:11:25,790 --> 00:11:28,550 that my optimal route to my destination changes 237 00:11:28,550 --> 00:11:30,290 given the current traffic conditions. 238 00:11:30,290 --> 00:11:32,480 So whether or not traffic conditions change, 239 00:11:32,480 --> 00:11:35,690 that might change the probability that this route is actually 240 00:11:35,690 --> 00:11:38,510 the optimal route, or you might imagine in a medical context 241 00:11:38,510 --> 00:11:42,830 I want to know the probability that a patient has a particular disease given 242 00:11:42,830 --> 00:11:45,950 some results of some tests that have been performed on that patient, 243 00:11:45,950 --> 00:11:48,770 and I have some evidence, the results of that test, 244 00:11:48,770 --> 00:11:52,100 and I would like to know the probability that a patient has 245 00:11:52,100 --> 00:11:53,520 a particular disease. 246 00:11:53,520 --> 00:11:55,228 So this notion of conditional probability 247 00:11:55,228 --> 00:11:57,353 comes up everywhere as we begin to think about what 248 00:11:57,353 --> 00:11:59,660 we would like to reason about, but being able to reason 249 00:11:59,660 --> 00:12:03,225 a little more intelligently by taking into account evidence 250 00:12:03,225 --> 00:12:04,100 that we already have. 251 00:12:04,100 --> 00:12:06,650 We're more able to get an accurate result for what 252 00:12:06,650 --> 00:12:08,720 is the likelihood that someone has this disease 253 00:12:08,720 --> 00:12:11,330 if we know this evidence, the results of the test, 254 00:12:11,330 --> 00:12:13,250 as opposed to if we were just calculating 255 00:12:13,250 --> 00:12:16,910 the unconditional probability of saying, what is the probability they have 256 00:12:16,910 --> 00:12:21,290 the disease without any evidence to try and back up our result one 257 00:12:21,290 --> 00:12:23,790 way or the other? 258 00:12:23,790 --> 00:12:26,652 So now that we've got this idea of what conditional probability is, 259 00:12:26,652 --> 00:12:28,610 the next question we have to ask is, all right, 260 00:12:28,610 --> 00:12:30,690 how do we calculate conditional probability? 261 00:12:30,690 --> 00:12:34,297 How do we figure out, mathematically, if I have an expression like this, 262 00:12:34,297 --> 00:12:35,630 how do I get a number from that? 263 00:12:35,630 --> 00:12:38,070 What does conditional probability actually mean? 264 00:12:38,070 --> 00:12:39,950 Well, the formula for conditional probability 265 00:12:39,950 --> 00:12:41,540 looks a little something like this-- 266 00:12:41,540 --> 00:12:45,020 the probability of a given b, the probability 267 00:12:45,020 --> 00:12:47,960 that a is true given that we know that b is true, 268 00:12:47,960 --> 00:12:50,510 is equal to this fraction-- the probability 269 00:12:50,510 --> 00:12:56,070 that a and b are true divided by just the probability that b is true. 270 00:12:56,070 --> 00:12:58,250 And the way to intuitively try to think about this 271 00:12:58,250 --> 00:13:01,610 is that if I want to know the probability that a is true given that b 272 00:13:01,610 --> 00:13:05,120 is true, well, I want to consider all the ways they could both be 273 00:13:05,120 --> 00:13:08,330 true out of the only worlds that I care about 274 00:13:08,330 --> 00:13:10,430 are the worlds where b is already true. 275 00:13:10,430 --> 00:13:13,250 I can sort of ignore all the cases where b isn't true 276 00:13:13,250 --> 00:13:15,980 because those aren't relevant to my ultimate computation. 277 00:13:15,980 --> 00:13:20,232 They're not relevant to what it is that I want to get information about. 278 00:13:20,232 --> 00:13:21,690 So let's take a look at an example. 279 00:13:21,690 --> 00:13:24,530 Let's go back to that example of rolling two dice and the idea 280 00:13:24,530 --> 00:13:27,260 that those two dice might sum up to the number 12. 281 00:13:27,260 --> 00:13:30,020 We discussed earlier that the unconditional probability 282 00:13:30,020 --> 00:13:33,500 that if I roll two dice and they sum to 12 is one out of 36, 283 00:13:33,500 --> 00:13:36,620 because out of the 36 possible worlds that I might care about, 284 00:13:36,620 --> 00:13:39,650 in only one of them is the sum of those two dice 12. 285 00:13:39,650 --> 00:13:43,330 It's only when red is six and blue is also six. 286 00:13:43,330 --> 00:13:45,770 But let's say now that I have some additional information. 287 00:13:45,770 --> 00:13:47,930 I now want to know, what is the probability 288 00:13:47,930 --> 00:13:54,080 that the two dice sum to 12 given that I know that the red die was a six? 289 00:13:54,080 --> 00:13:55,700 So I already have some evidence. 290 00:13:55,700 --> 00:13:57,320 I already know the red die is a six. 291 00:13:57,320 --> 00:13:58,737 I don't know what the blue die is. 292 00:13:58,737 --> 00:14:01,482 That information isn't given to me in this expression. 293 00:14:01,482 --> 00:14:03,440 But given the fact that I know that the red die 294 00:14:03,440 --> 00:14:07,525 rolled a six, what is the probability that we sum to 12? 295 00:14:07,525 --> 00:14:10,400 And so we can begin to do the math using that expression from before. 296 00:14:10,400 --> 00:14:12,800 Here, again, are all of the possibilities, 297 00:14:12,800 --> 00:14:16,160 all of the possible combinations of red die being one through six 298 00:14:16,160 --> 00:14:18,857 and blue die being one through six. 299 00:14:18,857 --> 00:14:20,690 And I might consider, first, all right, what 300 00:14:20,690 --> 00:14:24,290 is the probability of my evidence, my b variable where 301 00:14:24,290 --> 00:14:27,770 I want to know what is the probability that the red die is a six? 302 00:14:27,770 --> 00:14:31,580 Well, the probability that the red die is a six is just one out of six. 303 00:14:31,580 --> 00:14:35,180 So these one out of six options are really the only worlds 304 00:14:35,180 --> 00:14:36,560 that I care about here now. 305 00:14:36,560 --> 00:14:39,680 All the rest of them are irrelevant to my calculation 306 00:14:39,680 --> 00:14:42,560 because I already have this evidence that the red die was a six, 307 00:14:42,560 --> 00:14:46,770 so I don't need to care about all of the other possibilities that could result. 308 00:14:46,770 --> 00:14:50,120 So now, in addition to the fact that the red die rolled as a six 309 00:14:50,120 --> 00:14:52,040 and the probability of that, the other piece 310 00:14:52,040 --> 00:14:54,320 of information I need to know in order to calculate 311 00:14:54,320 --> 00:14:58,940 this conditional probability is the probability that both of my variables, 312 00:14:58,940 --> 00:15:02,720 a and b, are true, the probability that both the red die is a six 313 00:15:02,720 --> 00:15:04,860 and they all sum to 12. 314 00:15:04,860 --> 00:15:07,400 So what is the probability that both of these things happen? 315 00:15:07,400 --> 00:15:11,990 Well, it only happens in one possible case, in one out of these 36 cases, 316 00:15:11,990 --> 00:15:15,910 and it's the case where both the red and the blue die are equal to six. 317 00:15:15,910 --> 00:15:18,160 This is a piece of information that we already knew. 318 00:15:18,160 --> 00:15:22,240 And so this probability is equal to one over 36. 319 00:15:22,240 --> 00:15:24,580 And so to get the conditional probability 320 00:15:24,580 --> 00:15:28,660 that the sum is 12 given that I know that the red dice is equal to six, 321 00:15:28,660 --> 00:15:33,700 well, I just divide these two values together, and 1/36 divided by 1/6 322 00:15:33,700 --> 00:15:36,940 gives us this probability of 1/6. 323 00:15:36,940 --> 00:15:40,300 Given that I know that the red die rolled a value of six, 324 00:15:40,300 --> 00:15:45,350 the probability that the sum of the two dice is 12 is also one over six. 325 00:15:45,350 --> 00:15:47,350 And that probably makes intuitive sense for you, 326 00:15:47,350 --> 00:15:51,250 too, because if the red die is a six, the only way for me to get to a 12 327 00:15:51,250 --> 00:15:53,350 is if the blue die also rolls a six. 328 00:15:53,350 --> 00:15:57,430 And we know that the probability of the blue die rolling a six is one over six. 329 00:15:57,430 --> 00:15:59,380 So in this case, the conditional probability 330 00:15:59,380 --> 00:16:00,940 seems fairly straightforward. 331 00:16:00,940 --> 00:16:04,390 But this idea of calculating a conditional probability 332 00:16:04,390 --> 00:16:08,175 by looking at the probability that both of these events take place 333 00:16:08,175 --> 00:16:10,300 is an idea that's going to come up again and again. 334 00:16:10,300 --> 00:16:13,270 This is the definition, now, of conditional probability, 335 00:16:13,270 --> 00:16:15,190 and we're going to use that definition as we 336 00:16:15,190 --> 00:16:18,760 think about probability more generally to be able to draw conclusions 337 00:16:18,760 --> 00:16:19,480 about the world. 338 00:16:19,480 --> 00:16:21,130 This, again, is that formula. 339 00:16:21,130 --> 00:16:24,790 The probability of a given b is equal to the probability 340 00:16:24,790 --> 00:16:28,973 that a and b take place divided by the probability of b. 341 00:16:28,973 --> 00:16:32,140 And you'll see this formula sometimes written in a couple of different ways. 342 00:16:32,140 --> 00:16:35,890 You could imagine, algebraically, multiplying both sides of this equation 343 00:16:35,890 --> 00:16:39,065 by probability of b to get rid of the fraction, 344 00:16:39,065 --> 00:16:40,690 and you'll get an expression like this. 345 00:16:40,690 --> 00:16:44,890 The probability of a and b, which is this expression over here, 346 00:16:44,890 --> 00:16:48,910 is just the probability of b times the probability of a given b, 347 00:16:48,910 --> 00:16:52,210 or you could represent this equivalently since a and b, in this expression, 348 00:16:52,210 --> 00:16:55,870 are interchangeable. a and b is the same thing as b and a. 349 00:16:55,870 --> 00:16:59,740 You could imagine also representing the probability of a and b 350 00:16:59,740 --> 00:17:03,430 as the probability of a times the probability of b given a, just 351 00:17:03,430 --> 00:17:05,319 switching all of the a's and b's. 352 00:17:05,319 --> 00:17:08,589 These three are all equivalent ways of trying to represent 353 00:17:08,589 --> 00:17:10,150 what joint probability means. 354 00:17:10,150 --> 00:17:12,480 And so you'll sometimes see all of these equations, 355 00:17:12,480 --> 00:17:16,030 and they might be useful to you as you begin to reason about probability 356 00:17:16,030 --> 00:17:20,540 and to think about what values might be taking place in the real world. 357 00:17:20,540 --> 00:17:22,510 Now, sometimes when we deal with probability, 358 00:17:22,510 --> 00:17:24,520 we don't just care about a Boolean event. 359 00:17:24,520 --> 00:17:27,099 Like, did this happen or did this not happen? 360 00:17:27,099 --> 00:17:30,550 Sometimes we might want the ability to represent variable values 361 00:17:30,550 --> 00:17:33,760 in a probability space where some variable might take 362 00:17:33,760 --> 00:17:36,430 on multiple different possible values. 363 00:17:36,430 --> 00:17:39,820 And in probability, we call a variable in probability theory 364 00:17:39,820 --> 00:17:41,380 a random variable. 365 00:17:41,380 --> 00:17:45,790 A random variable in probability is just some variable in probability theory 366 00:17:45,790 --> 00:17:49,150 that has some domain of values that it can take on. 367 00:17:49,150 --> 00:17:50,290 So what do I mean by this? 368 00:17:50,290 --> 00:17:52,720 Well, what I mean is I might have a random variable that 369 00:17:52,720 --> 00:17:56,470 is just called Roll, for example, that has six possible values. 370 00:17:56,470 --> 00:17:59,120 Roll is my variable, and the possible values, 371 00:17:59,120 --> 00:18:03,520 the domain of values that it can take on, are 1, 2, 3, 4, 5, and 6. 372 00:18:03,520 --> 00:18:05,845 And I might like to know the probability of each. 373 00:18:05,845 --> 00:18:07,720 In this case, they happen to all be the same. 374 00:18:07,720 --> 00:18:10,728 But in other random variables, that might not be the case. 375 00:18:10,728 --> 00:18:12,520 For example, I might have a random variable 376 00:18:12,520 --> 00:18:14,560 to represent the weather, for example, where 377 00:18:14,560 --> 00:18:18,290 the domain of values it could take on are things like sun or cloudy 378 00:18:18,290 --> 00:18:21,070 or rainy or windy or snowy, and each of those 379 00:18:21,070 --> 00:18:23,650 might have a different probability, and I care about knowing, 380 00:18:23,650 --> 00:18:26,290 what is the probability that the weather equals sun 381 00:18:26,290 --> 00:18:28,725 or that the weather equals clouds, for instance, 382 00:18:28,725 --> 00:18:33,100 and I might like to do some mathematical calculations based on that information. 383 00:18:33,100 --> 00:18:35,622 Other random variables might be something like traffic. 384 00:18:35,622 --> 00:18:38,080 What are the odds that there is no traffic or light traffic 385 00:18:38,080 --> 00:18:39,190 or heavy traffic? 386 00:18:39,190 --> 00:18:41,530 Traffic, in this case, is my random variable, 387 00:18:41,530 --> 00:18:44,920 and the values that that random variable can take on are here. 388 00:18:44,920 --> 00:18:47,110 It's either none or light or heavy. 389 00:18:47,110 --> 00:18:50,200 And I, the person doing these calculations, I, the person encoding 390 00:18:50,200 --> 00:18:52,810 these random variables into my computer, need 391 00:18:52,810 --> 00:18:56,950 to make the decision as to what these possible values actually are. 392 00:18:56,950 --> 00:18:59,118 You might imagine, for example, for a flight, 393 00:18:59,118 --> 00:19:01,660 if I care about whether or not I make it to a flight on time, 394 00:19:01,660 --> 00:19:04,327 my flight has a couple of possible values that it could take on. 395 00:19:04,327 --> 00:19:05,620 My flight could be on time. 396 00:19:05,620 --> 00:19:06,860 My flight could be delayed. 397 00:19:06,860 --> 00:19:08,170 My flight could be canceled. 398 00:19:08,170 --> 00:19:11,830 So flight, in this case, is my random variable, 399 00:19:11,830 --> 00:19:14,500 and these are the values that it can take on. 400 00:19:14,500 --> 00:19:17,710 And often I'll want to know something about the probability 401 00:19:17,710 --> 00:19:21,380 that my random variable takes on each of those possible values. 402 00:19:21,380 --> 00:19:24,660 And this is what we then call a probability distribution. 403 00:19:24,660 --> 00:19:27,700 A probability distribution takes a random variable 404 00:19:27,700 --> 00:19:32,420 and gives me the probability for each of the possible values in its domain. 405 00:19:32,420 --> 00:19:35,950 So in the case of this flight, for example, my probability distribution 406 00:19:35,950 --> 00:19:37,300 might look something like this. 407 00:19:37,300 --> 00:19:40,240 My probability distribution says, the probability 408 00:19:40,240 --> 00:19:46,210 that the random variable Flight is equal to the value on time is 0.6, 409 00:19:46,210 --> 00:19:49,390 or, otherwise, put into more English, human-friendly terms, the likelihood 410 00:19:49,390 --> 00:19:52,510 that my flight is on time is 60%, for example. 411 00:19:52,510 --> 00:19:56,180 And in this case, the probability that my flight is delayed is 30%. 412 00:19:56,180 --> 00:20:00,170 The probability that my flight is canceled is 10%, or 0.1. 413 00:20:00,170 --> 00:20:04,180 And if you sum up all of these possible values, the sum is going to be 1. 414 00:20:04,180 --> 00:20:06,640 If you take all of the possible worlds, here 415 00:20:06,640 --> 00:20:10,120 are my three possible worlds for the value of the random variable Flight. 416 00:20:10,120 --> 00:20:11,470 Add them all up together. 417 00:20:11,470 --> 00:20:15,610 The result needs to be the number one per that axiom of probability theory 418 00:20:15,610 --> 00:20:17,500 that we've discussed before. 419 00:20:17,500 --> 00:20:21,810 So this now is one way of representing this probability distribution 420 00:20:21,810 --> 00:20:23,622 for the random variable Flight. 421 00:20:23,622 --> 00:20:25,830 Sometimes you'll see it represented a little bit more 422 00:20:25,830 --> 00:20:28,770 concisely, that this is pretty verbose for really just trying 423 00:20:28,770 --> 00:20:31,080 to express three possible values. 424 00:20:31,080 --> 00:20:33,630 And so often you'll instead see this same notation 425 00:20:33,630 --> 00:20:35,460 representing using a vector. 426 00:20:35,460 --> 00:20:38,250 And all a vector is is a sequence of values. 427 00:20:38,250 --> 00:20:41,520 As opposed to just a single value, I might have multiple values. 428 00:20:41,520 --> 00:20:45,570 And so I could extend, instead, represent this idea this way-- 429 00:20:45,570 --> 00:20:47,610 bold P-- so a larger P-- 430 00:20:47,610 --> 00:20:52,110 generally meaning the probability distribution of this variable flight 431 00:20:52,110 --> 00:20:55,890 is equal to this vector represented in angle brackets. 432 00:20:55,890 --> 00:21:00,240 The probability distribution is 0.6, 0.3, and 0.1, 433 00:21:00,240 --> 00:21:03,180 and I would just have to know that this probability distribution is 434 00:21:03,180 --> 00:21:06,930 an order of on time or delayed and canceled 435 00:21:06,930 --> 00:21:10,470 to know how to interpret this vector to mean the first value in the vector 436 00:21:10,470 --> 00:21:14,430 is the probability that my flight is on time, the second value in the vector 437 00:21:14,430 --> 00:21:16,380 is the probability that my flight is delayed, 438 00:21:16,380 --> 00:21:20,910 and the third value in the vector is the probability that my flight is canceled. 439 00:21:20,910 --> 00:21:23,430 And so this is just an alternate way of representing 440 00:21:23,430 --> 00:21:25,380 this idea a little more verbosely. 441 00:21:25,380 --> 00:21:28,230 But oftentimes you'll see us just talk about a probability 442 00:21:28,230 --> 00:21:30,637 distribution over a random variable. 443 00:21:30,637 --> 00:21:32,970 And whenever we talk about that, what we're really doing 444 00:21:32,970 --> 00:21:35,012 is trying to figure out the probabilities of each 445 00:21:35,012 --> 00:21:38,190 of the possible values that that random variable can take on, 446 00:21:38,190 --> 00:21:40,970 but this notation is just a little bit more succinct, 447 00:21:40,970 --> 00:21:43,470 even though it can sometimes be a little confusing depending 448 00:21:43,470 --> 00:21:44,928 on the context in which you see it. 449 00:21:44,928 --> 00:21:48,060 So we'll start to look at examples where we use this sort of notation 450 00:21:48,060 --> 00:21:53,850 to describe probability and to describe events that might take place. 451 00:21:53,850 --> 00:21:55,890 A couple of other important ideas to know with 452 00:21:55,890 --> 00:21:57,450 regards to probability theory-- 453 00:21:57,450 --> 00:22:01,770 one is this idea of independence, and independence refers to the idea 454 00:22:01,770 --> 00:22:04,620 that the knowledge of one event doesn't influence 455 00:22:04,620 --> 00:22:06,850 the probability of another event. 456 00:22:06,850 --> 00:22:08,910 So for example, in the context of my two dice 457 00:22:08,910 --> 00:22:11,910 rolls where I had the red die and the blue die, the probability 458 00:22:11,910 --> 00:22:14,400 that I roll the red die and the blue die, 459 00:22:14,400 --> 00:22:17,490 those two events, red die and blue die, are independent. 460 00:22:17,490 --> 00:22:21,162 Knowing the result of the red die doesn't change the probabilities 461 00:22:21,162 --> 00:22:21,870 for the blue die. 462 00:22:21,870 --> 00:22:24,330 It doesn't give me any additional information 463 00:22:24,330 --> 00:22:27,408 about what the value of the blue die is ultimately going to be. 464 00:22:27,408 --> 00:22:29,200 But that's not always going to be the case. 465 00:22:29,200 --> 00:22:32,670 You might imagine that in the case of weather, something like clouds 466 00:22:32,670 --> 00:22:37,170 and rain, those are probably not independent, that if it is cloudy, 467 00:22:37,170 --> 00:22:40,620 that might increase the probability that later in the day it's going to rain. 468 00:22:40,620 --> 00:22:45,030 So some information informs some other event or some other random variable. 469 00:22:45,030 --> 00:22:49,540 So independence refers to the idea that one event doesn't influence the other. 470 00:22:49,540 --> 00:22:54,600 And if they're not independent, then there might be some relationship. 471 00:22:54,600 --> 00:22:57,880 So mathematically, formally, what does independence actually mean? 472 00:22:57,880 --> 00:23:02,550 Well, recall this formula from before, that the probability of a and b 473 00:23:02,550 --> 00:23:06,390 is the probability of a times the probability of b given a. 474 00:23:06,390 --> 00:23:08,490 And the more intuitive way to think about this 475 00:23:08,490 --> 00:23:12,030 is that to know how likely it is that a and b happen, 476 00:23:12,030 --> 00:23:14,850 well, let's first figure out the likelihood that a happens, 477 00:23:14,850 --> 00:23:17,163 and then given that we know that a happens, 478 00:23:17,163 --> 00:23:19,080 let's figure out the likelihood that b happens 479 00:23:19,080 --> 00:23:22,000 and multiply those two things together. 480 00:23:22,000 --> 00:23:27,750 But if a and b were independent, meaning knowing a doesn't change anything 481 00:23:27,750 --> 00:23:30,000 about the likelihood that b is true, well, 482 00:23:30,000 --> 00:23:35,040 then the probability of b given a, meaning the probability that b is true 483 00:23:35,040 --> 00:23:37,812 given that I know a is true, well, that I know a is true 484 00:23:37,812 --> 00:23:40,770 shouldn't really make a difference if these two things are independent, 485 00:23:40,770 --> 00:23:43,230 that a shouldn't influence b at all. 486 00:23:43,230 --> 00:23:48,120 So the probability of b given a is really just the probability of b, 487 00:23:48,120 --> 00:23:51,150 if it is true that a and b are independent. 488 00:23:51,150 --> 00:23:54,480 And so this right here is one example of a definition for what 489 00:23:54,480 --> 00:23:56,850 it means for a and b to be independent. 490 00:23:56,850 --> 00:24:01,050 The probability of a and b is just the probability of a times 491 00:24:01,050 --> 00:24:02,490 the probability of b. 492 00:24:02,490 --> 00:24:06,300 Any time you find two events a and b where this relationship holds, 493 00:24:06,300 --> 00:24:10,000 then you can say that a and b are independent. 494 00:24:10,000 --> 00:24:13,980 So an example of that might be the dice that we were taking a look at before. 495 00:24:13,980 --> 00:24:18,690 Here, if I wanted the probability of red being a six and blue being a six, 496 00:24:18,690 --> 00:24:22,050 well, that's just the probability that red is a six multiplied 497 00:24:22,050 --> 00:24:24,090 by the probability that blue is a six. 498 00:24:24,090 --> 00:24:26,240 Both equal to one over 36. 499 00:24:26,240 --> 00:24:30,740 So I can say that these two events are independent. 500 00:24:30,740 --> 00:24:34,123 What wouldn't be independent, for example, would be an example-- 501 00:24:34,123 --> 00:24:36,040 so this, for example, has a probability of one 502 00:24:36,040 --> 00:24:37,980 over 36, as we talked about before. 503 00:24:37,980 --> 00:24:40,950 But what wouldn't be independent would be a case like this-- 504 00:24:40,950 --> 00:24:46,740 the probability that the red die rolls a six and the red die rolls a four. 505 00:24:46,740 --> 00:24:49,868 If you just naively took, OK, red die six, red die four, 506 00:24:49,868 --> 00:24:51,660 well, if I'm only rolling the die once, you 507 00:24:51,660 --> 00:24:54,510 might imagine the naive approach is to say, well, each of these 508 00:24:54,510 --> 00:24:56,260 has a probability of one over six. 509 00:24:56,260 --> 00:24:59,657 So multiply them together, and the probability is one over 36. 510 00:24:59,657 --> 00:25:01,990 But, of course, if you're only rolling the red die once, 511 00:25:01,990 --> 00:25:05,730 there's no way you could get two different values for the red die. 512 00:25:05,730 --> 00:25:08,370 It couldn't both be a six and a four. 513 00:25:08,370 --> 00:25:10,560 So the probability should be zero. 514 00:25:10,560 --> 00:25:14,610 But if you were to multiply probability of red six times probability 515 00:25:14,610 --> 00:25:17,690 of red four, well, that would equal one over 36. 516 00:25:17,690 --> 00:25:19,440 But, of course, that's not true because we 517 00:25:19,440 --> 00:25:23,460 know that there is no way, probability zero, that when we roll the red die 518 00:25:23,460 --> 00:25:28,590 once we get both a six and a four because only one of those possibilities 519 00:25:28,590 --> 00:25:31,120 can actually be the result. 520 00:25:31,120 --> 00:25:35,190 And so we can say that the event that red roll is six and the event 521 00:25:35,190 --> 00:25:38,800 that red roll is four, those two events are not independent. 522 00:25:38,800 --> 00:25:43,560 If I know that the red roll is a six, I know that the red roll cannot possibly 523 00:25:43,560 --> 00:25:44,310 be a four. 524 00:25:44,310 --> 00:25:46,280 So these things are not independent. 525 00:25:46,280 --> 00:25:48,630 And instead, if I wanted to calculate the probability, 526 00:25:48,630 --> 00:25:51,870 I would need to use this conditional probability, 527 00:25:51,870 --> 00:25:56,530 as is the regular definition of the probability of two events taking place. 528 00:25:56,530 --> 00:25:59,280 And the probability of this, now, well, the probability of the red 529 00:25:59,280 --> 00:26:01,710 roll being a six, that's one of six. 530 00:26:01,710 --> 00:26:06,330 But what's the probability that the roll is a four given that the roll is a six? 531 00:26:06,330 --> 00:26:09,900 Well, this is just zero, because there's no way for the red roll 532 00:26:09,900 --> 00:26:13,920 to be a four given that we already know the red roll is a six. 533 00:26:13,920 --> 00:26:16,410 And so the value, if we do all that multiplication, 534 00:26:16,410 --> 00:26:19,680 is we get the number zero. 535 00:26:19,680 --> 00:26:21,477 So this idea of conditional probability is 536 00:26:21,477 --> 00:26:23,310 going to come up again and again, especially 537 00:26:23,310 --> 00:26:26,850 as we begin to reason about multiple different random variables that 538 00:26:26,850 --> 00:26:29,130 might be interacting with each other in some way. 539 00:26:29,130 --> 00:26:32,580 And this gets us to one of the most important rules in probability theory, 540 00:26:32,580 --> 00:26:34,767 which is known as Bayes' rule. 541 00:26:34,767 --> 00:26:37,350 And it turns out that just using the information we've already 542 00:26:37,350 --> 00:26:40,900 learned about probability and just applying a little bit of algebra, 543 00:26:40,900 --> 00:26:43,860 we can actually derive Bayes' rule for ourselves. 544 00:26:43,860 --> 00:26:46,530 But it's a very important rule when it comes to inference 545 00:26:46,530 --> 00:26:49,020 and thinking about probability in the context of what 546 00:26:49,020 --> 00:26:52,110 it is that a computer can do, or what a mathematician could do, 547 00:26:52,110 --> 00:26:55,390 by having access to information about probability. 548 00:26:55,390 --> 00:26:57,300 So let's go back to these equations to be 549 00:26:57,300 --> 00:26:59,860 able to derive Bayes' rule ourselves. 550 00:26:59,860 --> 00:27:04,140 We know the probability of a and b, the likelihood that a and b take place, 551 00:27:04,140 --> 00:27:07,890 is the likelihood of b and then the likelihood of a given 552 00:27:07,890 --> 00:27:10,050 that we know that b is already true. 553 00:27:10,050 --> 00:27:13,170 And likewise, the probability of a given a and b 554 00:27:13,170 --> 00:27:16,920 is the probability of a times the probability of b given 555 00:27:16,920 --> 00:27:18,630 that we know that a is already true. 556 00:27:18,630 --> 00:27:20,640 This is sort of a symmetric relationship where 557 00:27:20,640 --> 00:27:24,340 it doesn't matter the order of a and b and b and a mean the same thing. 558 00:27:24,340 --> 00:27:27,870 And so in these equations, we can just swap out a and b 559 00:27:27,870 --> 00:27:30,160 to be able to represent the exact same idea. 560 00:27:30,160 --> 00:27:32,650 So we know that these two equations are already true. 561 00:27:32,650 --> 00:27:33,910 We've seen that already. 562 00:27:33,910 --> 00:27:37,380 And now let's just do a little bit of algebraic manipulation of this stuff. 563 00:27:37,380 --> 00:27:40,200 Both of these expressions on the right-hand side 564 00:27:40,200 --> 00:27:43,380 are equal to the probability of a and b. 565 00:27:43,380 --> 00:27:46,950 So what I can do is take these two expressions on the right-hand side 566 00:27:46,950 --> 00:27:49,140 and just set them equal to each other. 567 00:27:49,140 --> 00:27:52,860 If they're both equal to the probability of a and b, 568 00:27:52,860 --> 00:27:55,090 then they both must be equal to each other. 569 00:27:55,090 --> 00:27:57,750 So probability of a times probability of b 570 00:27:57,750 --> 00:28:04,740 given a is equal to the probability of b times the probability of a given b. 571 00:28:04,740 --> 00:28:07,790 And now all we're going to do is do a little bit of division. 572 00:28:07,790 --> 00:28:13,830 I'm going to divide both sides by P of a, and now I get what is Bayes' rule. 573 00:28:13,830 --> 00:28:19,100 The probability of b given a is equal to the probability of b 574 00:28:19,100 --> 00:28:23,338 times the probability of a given b divided by the probability of a. 575 00:28:23,338 --> 00:28:25,380 And sometimes in Bayes' rule you'll see the order 576 00:28:25,380 --> 00:28:26,713 of these two arguments switched. 577 00:28:26,713 --> 00:28:30,920 So instead of b times a given b, it'll be a given b times b. 578 00:28:30,920 --> 00:28:33,420 That ultimately doesn't matter because in multiplication you 579 00:28:33,420 --> 00:28:35,970 can switch the order of the two things you're multiplying 580 00:28:35,970 --> 00:28:37,510 and it doesn't change the result. 581 00:28:37,510 --> 00:28:41,520 But this here right now is the most common formulation of Bayes' rule. 582 00:28:41,520 --> 00:28:46,620 The probability of b given a is equal to the probability of a given 583 00:28:46,620 --> 00:28:51,300 b times the probability of b divided by the probability of a. 584 00:28:51,300 --> 00:28:54,030 And this rule, it turns out, is really important 585 00:28:54,030 --> 00:28:56,670 when it comes to trying to infer things about the world 586 00:28:56,670 --> 00:29:00,200 because it means you can express one conditional probability, 587 00:29:00,200 --> 00:29:04,410 the conditional probability of b given a, using knowledge 588 00:29:04,410 --> 00:29:08,370 about the probability of a given b, using the reverse 589 00:29:08,370 --> 00:29:10,068 of that conditional probability. 590 00:29:10,068 --> 00:29:12,360 So let's first do a little bit of an example with this, 591 00:29:12,360 --> 00:29:14,820 just to see how we might use it, and then explore what 592 00:29:14,820 --> 00:29:17,200 this means a little bit more generally. 593 00:29:17,200 --> 00:29:20,320 So we're going to construct a situation where I have some information. 594 00:29:20,320 --> 00:29:22,260 There are two events that I care about-- 595 00:29:22,260 --> 00:29:25,650 the idea that it's cloudy in the morning and the idea 596 00:29:25,650 --> 00:29:28,120 that it is rainy in the afternoon. 597 00:29:28,120 --> 00:29:30,000 Those are two different possible events that 598 00:29:30,000 --> 00:29:34,080 could take place-- cloudy in the morning, or the AM, rainy in the PM. 599 00:29:34,080 --> 00:29:37,800 And what I care about is, given clouds in the morning, what 600 00:29:37,800 --> 00:29:41,110 is the probability of rain in the afternoon, a reasonable question 601 00:29:41,110 --> 00:29:41,610 I might ask. 602 00:29:41,610 --> 00:29:44,250 In the morning, I look outside, or an AI's camera 603 00:29:44,250 --> 00:29:47,782 looks outside, and sees that there are clouds in the morning, 604 00:29:47,782 --> 00:29:49,740 and we want to conclude, we want to figure out, 605 00:29:49,740 --> 00:29:54,430 what is the probability that in the afternoon there is going to be rain? 606 00:29:54,430 --> 00:29:56,470 Of course, in the abstract, we don't have access 607 00:29:56,470 --> 00:29:58,990 to this kind of information, but we can use data 608 00:29:58,990 --> 00:30:00,830 to begin to try and figure this out. 609 00:30:00,830 --> 00:30:05,080 So let's imagine, now, that I have access to some pieces of information. 610 00:30:05,080 --> 00:30:08,860 I have access to the idea that 80% of rainy afternoons 611 00:30:08,860 --> 00:30:10,705 start out with a cloudy morning. 612 00:30:10,705 --> 00:30:13,330 And you might imagine that I could have gathered this data just 613 00:30:13,330 --> 00:30:15,122 by looking at data over a sequence of time, 614 00:30:15,122 --> 00:30:18,850 that I know that 80% of the time when it's raining in the afternoon it 615 00:30:18,850 --> 00:30:21,780 was cloudy that morning. 616 00:30:21,780 --> 00:30:25,170 I also know that 40% of days have cloudy mornings, 617 00:30:25,170 --> 00:30:29,010 and I also know that 10% of days have rainy afternoons. 618 00:30:29,010 --> 00:30:31,170 And now, using this information, I would like 619 00:30:31,170 --> 00:30:34,350 to figure out, given clouds in the morning, what 620 00:30:34,350 --> 00:30:37,110 is the probability that it rains in the afternoon? 621 00:30:37,110 --> 00:30:41,570 I want to know the probability of afternoon rain given morning clouds, 622 00:30:41,570 --> 00:30:46,630 and I can do that, in particular, using this fact, the probability of-- 623 00:30:46,630 --> 00:30:50,250 so if I know that 80% of rainy afternoon start with cloudy mornings, 624 00:30:50,250 --> 00:30:54,390 then I know the probability of cloudy mornings given rainy afternoon. 625 00:30:54,390 --> 00:30:58,440 So using sort of the reverse conditional probability, I can figure that out. 626 00:30:58,440 --> 00:31:01,530 Expressed in terms of Bayes' rule, this is what that would look like-- 627 00:31:01,530 --> 00:31:05,430 probability of rain given clouds is the probability 628 00:31:05,430 --> 00:31:08,550 of clouds given rain times the probability of rain 629 00:31:08,550 --> 00:31:10,380 divided by the probability of clouds. 630 00:31:10,380 --> 00:31:13,560 Here I'm just substituting in for the values of a and b 631 00:31:13,560 --> 00:31:15,630 from that equation and Bayes' rule from before. 632 00:31:15,630 --> 00:31:16,650 And then I can just do the math. 633 00:31:16,650 --> 00:31:17,670 I have this information. 634 00:31:17,670 --> 00:31:21,360 I know that 80% of the time, if it was raining, then 635 00:31:21,360 --> 00:31:23,610 there were clouds in the morning-- so 0.8 here. 636 00:31:23,610 --> 00:31:28,110 Probability of rain is 0.1 because 10% of days were raining and 40% of days 637 00:31:28,110 --> 00:31:28,860 were cloudy. 638 00:31:28,860 --> 00:31:31,980 I do the math and I can figure out the answer is 0.2. 639 00:31:31,980 --> 00:31:35,730 So the probability that it rains in the afternoon given that it was cloudy 640 00:31:35,730 --> 00:31:40,130 in the morning is 0.2 in this case. 641 00:31:40,130 --> 00:31:42,480 And this, now, is an application of Bayes' rule, 642 00:31:42,480 --> 00:31:45,220 the idea that using one conditional probability, 643 00:31:45,220 --> 00:31:48,060 we can get the reverse conditional probability. 644 00:31:48,060 --> 00:31:51,420 And this is often useful when one of the conditional probabilities 645 00:31:51,420 --> 00:31:55,300 might be easier for us to know about or easier for us to have data about, 646 00:31:55,300 --> 00:31:57,870 and using that information, we can calculate 647 00:31:57,870 --> 00:31:59,730 the other conditional probability. 648 00:31:59,730 --> 00:32:01,030 So what does this look like? 649 00:32:01,030 --> 00:32:04,410 Well, it means that knowing the probability of cloudy mornings given 650 00:32:04,410 --> 00:32:09,420 rainy afternoons, we can calculate the probability of rainy afternoons given 651 00:32:09,420 --> 00:32:12,600 cloudy mornings, or, for example, more generally, 652 00:32:12,600 --> 00:32:16,860 if we know the probability of some visible effect, some effect 653 00:32:16,860 --> 00:32:21,150 that we can see and observe given some unknown cause that we're not 654 00:32:21,150 --> 00:32:26,100 sure about, well, then we can calculate the probability of that unknown cause 655 00:32:26,100 --> 00:32:28,770 given the visible effect. 656 00:32:28,770 --> 00:32:30,520 So what might that look like? 657 00:32:30,520 --> 00:32:32,520 Well, in the context of medicine, for example, 658 00:32:32,520 --> 00:32:37,440 I might know the probability of some medical test result given a disease. 659 00:32:37,440 --> 00:32:41,520 Like, I know that if someone has a disease, then x percent of the time 660 00:32:41,520 --> 00:32:44,340 the medical test result will show up as this, for instance. 661 00:32:44,340 --> 00:32:47,100 And using that information, then I can calculate, 662 00:32:47,100 --> 00:32:50,430 what is the probability that, given I know the medical test 663 00:32:50,430 --> 00:32:53,590 result, what is the likelihood that someone has the disease? 664 00:32:53,590 --> 00:32:56,970 This is the piece of information that is usually easier to know, easier 665 00:32:56,970 --> 00:32:59,130 to immediately have access to data for. 666 00:32:59,130 --> 00:33:02,687 And this is the information that I actually want to calculate. 667 00:33:02,687 --> 00:33:04,270 Or I might want to know, for example-- 668 00:33:04,270 --> 00:33:08,400 if I know that some probability of counterfeit bills 669 00:33:08,400 --> 00:33:11,670 have blurry text around the edges, because counterfeit printers 670 00:33:11,670 --> 00:33:13,950 aren't nearly as good at printing text precisely. 671 00:33:13,950 --> 00:33:16,380 So I have some information about given that something 672 00:33:16,380 --> 00:33:20,550 is a counterfeit bill, x percent of counterfeit bills have blurry text, 673 00:33:20,550 --> 00:33:21,510 for example. 674 00:33:21,510 --> 00:33:24,840 And using that information, then I can calculate some piece of information 675 00:33:24,840 --> 00:33:27,360 that I might want to know, like, given that I 676 00:33:27,360 --> 00:33:31,980 know there's blurry text on a bill, what is the probability that that bill is 677 00:33:31,980 --> 00:33:32,580 counterfeit? 678 00:33:32,580 --> 00:33:34,980 So given one conditional probability, I can 679 00:33:34,980 --> 00:33:39,363 calculate the other conditional probability as well. 680 00:33:39,363 --> 00:33:41,280 And so now that we've taken a look at a couple 681 00:33:41,280 --> 00:33:42,990 of different types of probability. 682 00:33:42,990 --> 00:33:45,210 We've looked at unconditional probability 683 00:33:45,210 --> 00:33:48,300 where I just look at what is the probability of this event occurring 684 00:33:48,300 --> 00:33:51,390 given no additional evidence that I might have access to, 685 00:33:51,390 --> 00:33:53,940 and we've also looked at conditional probability 686 00:33:53,940 --> 00:33:57,570 where I have some sort of evidence, and I would like to, using that evidence, 687 00:33:57,570 --> 00:34:00,847 be able to calculate some other probability as well. 688 00:34:00,847 --> 00:34:03,930 The other kind of probability that will be important for us to think about 689 00:34:03,930 --> 00:34:06,360 is joint probability, and this is when we're 690 00:34:06,360 --> 00:34:11,250 considering the likelihood of multiple different events simultaneously. 691 00:34:11,250 --> 00:34:12,580 And so what do we mean by this? 692 00:34:12,580 --> 00:34:15,534 Well, for example, I might have probability distributions 693 00:34:15,534 --> 00:34:18,659 that look a little something like this, like I want to know the probability 694 00:34:18,659 --> 00:34:22,800 distribution of clouds in the morning, and that distribution looks like this. 695 00:34:22,800 --> 00:34:26,460 40% of the times, C, which is my random variable here, 696 00:34:26,460 --> 00:34:31,060 is equal to it's cloudy, and 60% of the time it's not cloudy. 697 00:34:31,060 --> 00:34:33,420 So here is just a simple probability distribution 698 00:34:33,420 --> 00:34:37,710 that is effectively telling me that 40% of the time it's cloudy. 699 00:34:37,710 --> 00:34:41,219 I might also have a probability distribution for rain in the afternoon 700 00:34:41,219 --> 00:34:44,670 where 10% of the time, or with probability 0.1, 701 00:34:44,670 --> 00:34:48,600 it is raining in the afternoon and with probability 0.9 702 00:34:48,600 --> 00:34:51,090 it is not raining in the afternoon. 703 00:34:51,090 --> 00:34:54,580 And using just these two pieces of information, 704 00:34:54,580 --> 00:34:57,540 I don't actually have a whole lot of information about how these two 705 00:34:57,540 --> 00:34:59,980 variables relate to each other. 706 00:34:59,980 --> 00:35:02,940 But I could if I had access to their joint probability, 707 00:35:02,940 --> 00:35:05,550 meaning for every combination of these two things-- 708 00:35:05,550 --> 00:35:09,330 meaning morning cloudy and afternoon rain, morning cloudy and afternoon 709 00:35:09,330 --> 00:35:12,960 not rain, morning not cloudy and afternoon rain, and morning 710 00:35:12,960 --> 00:35:15,150 not cloudy and afternoon not raining-- 711 00:35:15,150 --> 00:35:17,700 if I had access to values for each of those four, 712 00:35:17,700 --> 00:35:20,340 I'd have more information-- so information that'd 713 00:35:20,340 --> 00:35:22,390 be organized in a table like this. 714 00:35:22,390 --> 00:35:25,690 And this, rather than just a probability distribution, 715 00:35:25,690 --> 00:35:27,970 is a joint probability distribution. 716 00:35:27,970 --> 00:35:31,090 It tells me the probability distribution of each 717 00:35:31,090 --> 00:35:34,930 of the possible combinations of values that these random variables 718 00:35:34,930 --> 00:35:36,160 can take on. 719 00:35:36,160 --> 00:35:39,640 So if I want to know, what is the probability that on any given day 720 00:35:39,640 --> 00:35:42,400 it is both cloudy and rainy, well, I would say, 721 00:35:42,400 --> 00:35:45,100 all right, we're looking at cases where it is cloudy 722 00:35:45,100 --> 00:35:48,460 and cases where it is raining and the intersection of those two, 723 00:35:48,460 --> 00:35:51,310 that row and that column, is 0.08. 724 00:35:51,310 --> 00:35:55,210 So that is the probability that it is both cloudy and rainy 725 00:35:55,210 --> 00:35:57,070 using that information. 726 00:35:57,070 --> 00:36:00,010 And using this conditional probability table, 727 00:36:00,010 --> 00:36:02,260 using this joint probability table, I can 728 00:36:02,260 --> 00:36:04,930 begin to draw other pieces of information 729 00:36:04,930 --> 00:36:07,420 about things like conditional probability. 730 00:36:07,420 --> 00:36:11,890 So I might ask a question like, what is the probability distribution of clouds 731 00:36:11,890 --> 00:36:14,470 given that I know that it is raining, meaning 732 00:36:14,470 --> 00:36:16,660 I know for sure that it's raining. 733 00:36:16,660 --> 00:36:19,780 Tell me the probability distribution over whether it's cloudy 734 00:36:19,780 --> 00:36:22,720 or not given that I know already that it is, in fact, raining. 735 00:36:22,720 --> 00:36:25,480 And here I'm using C to stand for that random variable. 736 00:36:25,480 --> 00:36:28,030 I'm looking for a distribution, meaning the answer to this 737 00:36:28,030 --> 00:36:29,860 is not going to be a single value. 738 00:36:29,860 --> 00:36:33,760 It's going to be two values, a vector of two values where the first value is 739 00:36:33,760 --> 00:36:37,960 probability of clouds, the second value is probability that it is not cloudy, 740 00:36:37,960 --> 00:36:40,240 but the sum of those two values is going to be one, 741 00:36:40,240 --> 00:36:42,470 because when you add up the probabilities of all 742 00:36:42,470 --> 00:36:47,190 of the possible worlds, the result that you get must be the number one. 743 00:36:47,190 --> 00:36:50,740 And, well, what do we know about how to calculate a conditional probability? 744 00:36:50,740 --> 00:36:56,590 Well, we know that the probability of a given b is the probability of a and b 745 00:36:56,590 --> 00:36:59,320 divided by the probability of b. 746 00:36:59,320 --> 00:37:00,740 So what does this mean? 747 00:37:00,740 --> 00:37:03,610 Well, it means that I can calculate the probability of clouds 748 00:37:03,610 --> 00:37:08,260 given that it's raining as the probability of clouds 749 00:37:08,260 --> 00:37:11,230 and raining divided by the probability of rain. 750 00:37:11,230 --> 00:37:15,220 And this comma here for the probability distribution of clouds and rain, 751 00:37:15,220 --> 00:37:17,710 this comma sort of stands in for the word "and." 752 00:37:17,710 --> 00:37:21,460 You'll sort of see the logical operator AND and the comma used interchangeably. 753 00:37:21,460 --> 00:37:24,550 This means the probability distribution over the clouds 754 00:37:24,550 --> 00:37:29,382 and knowing the fact that it is raining divided by the probability of rain. 755 00:37:29,382 --> 00:37:31,840 And the interesting thing to note here and what we'll often 756 00:37:31,840 --> 00:37:34,210 do in order to simplify our mathematics is 757 00:37:34,210 --> 00:37:38,260 that dividing by the probability of rain, the probability of rain 758 00:37:38,260 --> 00:37:40,150 here is just some numerical constant. 759 00:37:40,150 --> 00:37:40,900 It is some number. 760 00:37:40,900 --> 00:37:43,780 Dividing by probability of rain is just dividing 761 00:37:43,780 --> 00:37:46,090 by some constant or, in other words, multiplying 762 00:37:46,090 --> 00:37:48,100 by the inverse of that constant. 763 00:37:48,100 --> 00:37:50,620 And it turns out that oftentimes we can just 764 00:37:50,620 --> 00:37:53,230 not worry about what the exact value of this is 765 00:37:53,230 --> 00:37:56,370 and just know that it is, in fact, a constant value, 766 00:37:56,370 --> 00:37:57,620 and we'll see why in a moment. 767 00:37:57,620 --> 00:38:01,390 So instead of expressing this as this joint probability divided 768 00:38:01,390 --> 00:38:06,790 by the probability of rain, sometimes we'll just represent it as alpha times 769 00:38:06,790 --> 00:38:10,830 the numerator here, the probability distribution of C, this variable, 770 00:38:10,830 --> 00:38:13,370 and that we know that it is raining, for instance. 771 00:38:13,370 --> 00:38:16,600 So all we've done here is said this value of one 772 00:38:16,600 --> 00:38:19,840 over the probability of rain, that's really just a constant that we're 773 00:38:19,840 --> 00:38:23,140 going to divide by or equivalently multiply by the inverse of at the end. 774 00:38:23,140 --> 00:38:26,770 We'll just call it alpha for now and deal with it a little bit later. 775 00:38:26,770 --> 00:38:30,130 But the key idea here now-- and this is an idea that's going to come up again-- 776 00:38:30,130 --> 00:38:34,390 is that the conditional distribution of C given rain 777 00:38:34,390 --> 00:38:38,200 is proportional to, meaning just some factor multiplied by, 778 00:38:38,200 --> 00:38:42,580 the joint probability of C and rain being true. 779 00:38:42,580 --> 00:38:44,030 And so how do we figure this out? 780 00:38:44,030 --> 00:38:46,720 Well, this is going to be the probability that it is cloudy 781 00:38:46,720 --> 00:38:50,200 given that it's raining, which is 0.08, and the probability that it's not 782 00:38:50,200 --> 00:38:53,350 cloudy given that it's raining, which is 0.02. 783 00:38:53,350 --> 00:38:55,180 And so we get alpha times-- 784 00:38:55,180 --> 00:38:58,060 here now is that probability distribution. 785 00:38:58,060 --> 00:39:00,370 0.08 is clouds and rain. 786 00:39:00,370 --> 00:39:04,210 0.02 is not cloudy and rain. 787 00:39:04,210 --> 00:39:08,260 But, of course, 0.08 and 0.02 don't sum up to the number one. 788 00:39:08,260 --> 00:39:10,780 And we know that in a probability distribution, 789 00:39:10,780 --> 00:39:13,030 if you consider all of the possible values, 790 00:39:13,030 --> 00:39:15,730 they must sum up to a probability of one. 791 00:39:15,730 --> 00:39:20,350 And so we know that we just need to figure out some constant to normalize, 792 00:39:20,350 --> 00:39:23,830 so to speak, these values, something we can multiply or divide by 793 00:39:23,830 --> 00:39:26,600 to get it so that all of these probabilities sum up to one. 794 00:39:26,600 --> 00:39:29,390 And it turns out that if we multiply both numbers by 10, 795 00:39:29,390 --> 00:39:32,290 then we can get that result of 0.8 and 0.2. 796 00:39:32,290 --> 00:39:34,990 The proportions are still equivalent, but now 0.8 797 00:39:34,990 --> 00:39:38,750 plus 0.2, those sum up to the number 1. 798 00:39:38,750 --> 00:39:41,080 So take a look at this and see if you can understand, 799 00:39:41,080 --> 00:39:43,870 step by step, how it is we're getting from one point to another. 800 00:39:43,870 --> 00:39:48,190 But the key idea here is that by using the joint probabilities, 801 00:39:48,190 --> 00:39:52,480 these probabilities that it is both cloudy and rainy and that it is not 802 00:39:52,480 --> 00:39:56,740 cloudy and rainy, I can take that information and figure out 803 00:39:56,740 --> 00:39:59,800 the conditional probability-- given that it's raining, 804 00:39:59,800 --> 00:40:02,320 what is the chance that it's cloudy versus not cloudy-- 805 00:40:02,320 --> 00:40:06,740 just by multiplying by some normalization constant, so to speak. 806 00:40:06,740 --> 00:40:08,860 And this is what a computer can begin to use 807 00:40:08,860 --> 00:40:12,130 to be able to interact with these various different types 808 00:40:12,130 --> 00:40:13,207 of probabilities. 809 00:40:13,207 --> 00:40:15,790 And it turns out there are a number of other probability rules 810 00:40:15,790 --> 00:40:19,570 that are going to be useful to us as we begin to explore how we can actually 811 00:40:19,570 --> 00:40:22,860 use this information to encode into our computers 812 00:40:22,860 --> 00:40:27,030 some more complex analysis that we might want to do about probability 813 00:40:27,030 --> 00:40:30,793 and distributions and random variables that we might be interacting with. 814 00:40:30,793 --> 00:40:33,210 So here are a couple of those important probability rules. 815 00:40:33,210 --> 00:40:35,850 One of the simplest rules is just this negation rule. 816 00:40:35,850 --> 00:40:39,420 What is the probability of not event a? 817 00:40:39,420 --> 00:40:41,970 So a is an event that has some probability, 818 00:40:41,970 --> 00:40:45,840 and I would like to know, what is the probability that a does not occur? 819 00:40:45,840 --> 00:40:50,340 And it turns out it's just one minus P of a, which makes sense 820 00:40:50,340 --> 00:40:52,470 because if those are the two possible cases, 821 00:40:52,470 --> 00:40:56,770 either a happens or a doesn't happen, then when you add up those two cases, 822 00:40:56,770 --> 00:41:02,970 you must get one, which means P of not a must just be one minus P of a 823 00:41:02,970 --> 00:41:06,930 because P of a and P of not a must sum up to the number one. 824 00:41:06,930 --> 00:41:10,050 They must include all of the possible cases. 825 00:41:10,050 --> 00:41:14,010 We've seen an expression for calculating the probability of a and b. 826 00:41:14,010 --> 00:41:18,180 We might also reasonably want to calculate the probability of a or b. 827 00:41:18,180 --> 00:41:21,480 What is the probability that one thing happens or another thing happens? 828 00:41:21,480 --> 00:41:23,550 So for example, I might want to calculate, 829 00:41:23,550 --> 00:41:26,010 what is the probability that if I roll two dice, 830 00:41:26,010 --> 00:41:29,970 a red die and a blue die, what is the likelihood that a is a six or b 831 00:41:29,970 --> 00:41:31,860 is a six, one or the other? 832 00:41:31,860 --> 00:41:34,860 And what you might imagine you could do and the wrong way to approach it 833 00:41:34,860 --> 00:41:38,810 would be just to say, all right, well, a comes up as a six, 834 00:41:38,810 --> 00:41:41,727 the red die comes up as a six with probability one over six. 835 00:41:41,727 --> 00:41:42,810 The same for the blue die. 836 00:41:42,810 --> 00:41:44,070 It's also one over six. 837 00:41:44,070 --> 00:41:47,520 Add them together and you get 2/6, otherwise known as 1/3. 838 00:41:47,520 --> 00:41:50,820 But this suffers from the problem of over counting, 839 00:41:50,820 --> 00:41:54,330 that we've double counted the case where both a and b, both 840 00:41:54,330 --> 00:41:57,690 the red die and the blue die, both come up as a six roll, 841 00:41:57,690 --> 00:41:59,780 and I've counted that instance twice. 842 00:41:59,780 --> 00:42:02,070 So to resolve this, the actual expression 843 00:42:02,070 --> 00:42:05,100 for calculating the probability of a or b 844 00:42:05,100 --> 00:42:08,070 uses what we call the inclusion-exclusion formula. 845 00:42:08,070 --> 00:42:11,510 So I take the probability of a, add it to the probability of b. 846 00:42:11,510 --> 00:42:12,900 That's all same as before. 847 00:42:12,900 --> 00:42:16,440 But then I need to exclude the cases that I've double counted. 848 00:42:16,440 --> 00:42:21,930 So I subtract from that the probability of a and b, and that 849 00:42:21,930 --> 00:42:23,520 gets me the result for a or b. 850 00:42:23,520 --> 00:42:27,348 I consider all the cases where a is true and all the cases where b is true. 851 00:42:27,348 --> 00:42:29,640 And if you imagine this is like a Venn diagram of cases 852 00:42:29,640 --> 00:42:31,830 where a is true, cases where b is true, I just 853 00:42:31,830 --> 00:42:34,500 need to subtract out the middle to get rid of the cases 854 00:42:34,500 --> 00:42:37,860 that I have over counted by double counting them inside of both 855 00:42:37,860 --> 00:42:41,520 of these individual expressions. 856 00:42:41,520 --> 00:42:43,530 One other rule that's going to be quite helpful 857 00:42:43,530 --> 00:42:45,770 is a rule called marginalization. 858 00:42:45,770 --> 00:42:47,880 Some marginalization is answering the question 859 00:42:47,880 --> 00:42:52,350 of how do I figure out the probability of a using some other variable that I 860 00:42:52,350 --> 00:42:53,970 might have access to, like b? 861 00:42:53,970 --> 00:42:56,190 Even if I don't know additional information about it, 862 00:42:56,190 --> 00:43:00,270 I know that b, some event, can have two possible states. 863 00:43:00,270 --> 00:43:05,080 Either b happens or b doesn't happen, assuming it's a Boolean, true or false. 864 00:43:05,080 --> 00:43:07,500 And well, what that means is that for me to be 865 00:43:07,500 --> 00:43:11,130 able to calculate the probability of a, there are only two cases. 866 00:43:11,130 --> 00:43:15,930 Either a happens and b happens or a happens and b doesn't happen. 867 00:43:15,930 --> 00:43:19,200 And those are two disjoint, meaning they can't both happen together-- 868 00:43:19,200 --> 00:43:21,480 either b happens or b doesn't happen. 869 00:43:21,480 --> 00:43:23,640 They're disjoint or separate cases. 870 00:43:23,640 --> 00:43:28,140 And so I can figure out the probability of a just by adding up those two cases. 871 00:43:28,140 --> 00:43:31,770 The probability that a is true is the probability 872 00:43:31,770 --> 00:43:35,640 that a and b is true plus the probability that a is true 873 00:43:35,640 --> 00:43:36,810 and b isn't true. 874 00:43:36,810 --> 00:43:40,123 So by marginalizing, I've looked at the two possible cases 875 00:43:40,123 --> 00:43:41,040 that might take place. 876 00:43:41,040 --> 00:43:44,120 Either b happens or b doesn't happen. 877 00:43:44,120 --> 00:43:47,610 And in either of those cases, I look at, what's the probability that a happens, 878 00:43:47,610 --> 00:43:50,430 and if I add those together, well, then I get the probability 879 00:43:50,430 --> 00:43:52,710 that a happens as a whole. 880 00:43:52,710 --> 00:43:54,030 So take a look at that rule. 881 00:43:54,030 --> 00:43:57,120 It doesn't matter what b is or how it's related to a. 882 00:43:57,120 --> 00:43:59,580 So long as I know these joint distributions, 883 00:43:59,580 --> 00:44:02,280 I can figure out the overall probability of a. 884 00:44:02,280 --> 00:44:05,130 And this can be a useful way, if I have a joint distribution, 885 00:44:05,130 --> 00:44:08,550 like the joint distribution of a and b, to just figure out 886 00:44:08,550 --> 00:44:11,320 some unconditional probability, like the probability of a, 887 00:44:11,320 --> 00:44:14,520 and we'll see examples of this soon, as well. 888 00:44:14,520 --> 00:44:17,460 Now, sometimes these might not just be variables 889 00:44:17,460 --> 00:44:21,160 that are events that are they happened or they didn't happen, like b is here. 890 00:44:21,160 --> 00:44:23,850 They might be some broader probability distribution where 891 00:44:23,850 --> 00:44:25,800 there are multiple possible values. 892 00:44:25,800 --> 00:44:28,710 And so here, in order to use this marginalization rule, 893 00:44:28,710 --> 00:44:34,290 I need to sum up not just over b and not b, but for all of the possible values 894 00:44:34,290 --> 00:44:36,610 that the other random variable could take on. 895 00:44:36,610 --> 00:44:39,360 And so here we'll see a version of this rule for random variables, 896 00:44:39,360 --> 00:44:41,610 and it's going to include that summation notation 897 00:44:41,610 --> 00:44:46,270 to indicate that I'm summing up, adding up, a whole bunch of individual values. 898 00:44:46,270 --> 00:44:47,092 So here's the rule. 899 00:44:47,092 --> 00:44:49,050 Looks a lot more complicated, but it's actually 900 00:44:49,050 --> 00:44:51,330 the equivalent, exactly the same rule. 901 00:44:51,330 --> 00:44:55,500 What I'm saying here is that if I have two random variables one called x 902 00:44:55,500 --> 00:45:01,380 and one called y, well, the probability that x is equal to some value x sub i-- 903 00:45:01,380 --> 00:45:04,170 this is just some value that this variable takes on-- 904 00:45:04,170 --> 00:45:05,520 how do I figure it out? 905 00:45:05,520 --> 00:45:08,760 Well, I'm going to sum up over j, where j 906 00:45:08,760 --> 00:45:13,380 is going to range over all of the possible values that y can take on. 907 00:45:13,380 --> 00:45:18,558 Well, let's look at the probability that x equals xi and y equals yj. 908 00:45:18,558 --> 00:45:20,600 So the exact same rule-- the only difference here 909 00:45:20,600 --> 00:45:23,360 is now I'm summing up over all of the possible values 910 00:45:23,360 --> 00:45:27,420 that y can take on, saying let's add up all of those possible cases 911 00:45:27,420 --> 00:45:31,100 and look at this joint distribution, this joint probability 912 00:45:31,100 --> 00:45:35,990 that x takes on the value I care about given all of the possible values for y. 913 00:45:35,990 --> 00:45:40,910 And if I add all those up, then I can get this unconditional probability 914 00:45:40,910 --> 00:45:46,397 of what x is equal to, whether or not x is equal to some value x sub i. 915 00:45:46,397 --> 00:45:48,230 So let's take a look at this rule because it 916 00:45:48,230 --> 00:45:49,688 does look a little bit complicated. 917 00:45:49,688 --> 00:45:51,650 Let's try and put a concrete example to it. 918 00:45:51,650 --> 00:45:54,470 Here, again, is that same joint distribution from before. 919 00:45:54,470 --> 00:45:58,460 I have cloud, not cloudy, rainy, not rainy. 920 00:45:58,460 --> 00:46:00,830 And maybe I want to access some variable. 921 00:46:00,830 --> 00:46:04,790 I want to know, what is the probability that it is cloudy? 922 00:46:04,790 --> 00:46:08,550 Well, marginalization says that if I have this joint distribution 923 00:46:08,550 --> 00:46:12,140 and I want to know, what is the probability that it is cloudy, well, 924 00:46:12,140 --> 00:46:15,650 I need to consider the other variable, the variable that's not here, 925 00:46:15,650 --> 00:46:17,060 the idea that it's rainy. 926 00:46:17,060 --> 00:46:20,780 And I consider the two cases, either it's raining or it's not raining, 927 00:46:20,780 --> 00:46:24,410 and I just sum up the values for each of those possibilities. 928 00:46:24,410 --> 00:46:27,380 In other words, the probability that it is cloudy 929 00:46:27,380 --> 00:46:31,110 is equal to the sum of the probability that it's cloudy 930 00:46:31,110 --> 00:46:38,090 and it's raining and the probability that it's cloudy and it is not raining. 931 00:46:38,090 --> 00:46:40,460 And so these, now, are values that I have access to. 932 00:46:40,460 --> 00:46:44,840 These are values that are just inside of this joint probability table. 933 00:46:44,840 --> 00:46:47,990 What is the probability that it is both cloudy and rainy? 934 00:46:47,990 --> 00:46:51,350 Well, it's just the intersection of these two here, which is 0.08, 935 00:46:51,350 --> 00:46:54,590 and the probability that it's cloudy and not raining is-- all right, 936 00:46:54,590 --> 00:46:56,480 here's cloudy, here's not raining-- 937 00:46:56,480 --> 00:46:58,000 it's 0.32. 938 00:46:58,000 --> 00:47:02,630 So it's 0.08 plus 0.32, which just gives us equal to 0.4. 939 00:47:02,630 --> 00:47:06,840 That is the unconditional probability that it is, in fact, cloudy. 940 00:47:06,840 --> 00:47:09,530 And so marginalization gives us a way to go 941 00:47:09,530 --> 00:47:13,360 from these joint distributions to just some individual probability 942 00:47:13,360 --> 00:47:14,430 that I might care about. 943 00:47:14,430 --> 00:47:17,222 And you'll see a little bit later why it is that we care about that 944 00:47:17,222 --> 00:47:19,370 and why that's actually useful to us as we 945 00:47:19,370 --> 00:47:21,885 begin doing some of these calculations. 946 00:47:21,885 --> 00:47:25,010 Last rule we'll take a look up before transitioning into something a little 947 00:47:25,010 --> 00:47:27,200 bit different is this rule of conditioning-- 948 00:47:27,200 --> 00:47:31,070 very similar to the marginalization rule, but it says that, again, 949 00:47:31,070 --> 00:47:32,600 if I have two events a and b-- 950 00:47:32,600 --> 00:47:35,810 but instead of having access to their joint probabilities, 951 00:47:35,810 --> 00:47:38,180 I have access to their conditional probabilities, 952 00:47:38,180 --> 00:47:39,920 how they relate to each other. 953 00:47:39,920 --> 00:47:43,700 Well, again, if I want to know the probability that a happens and I know 954 00:47:43,700 --> 00:47:47,960 that there's some other variable b, either b happens or b doesn't happen, 955 00:47:47,960 --> 00:47:50,660 and so I can say that the probability of a 956 00:47:50,660 --> 00:47:54,920 is the probability of a given b times the probability of b, 957 00:47:54,920 --> 00:47:57,470 meaning b happened, and given that I know b happened, 958 00:47:57,470 --> 00:47:59,480 what's the likelihood that a happened? 959 00:47:59,480 --> 00:48:02,480 And then I consider the other case, that b didn't happen. 960 00:48:02,480 --> 00:48:05,360 So here is the probability that b didn't happen, 961 00:48:05,360 --> 00:48:07,880 and here's the probability that a happens given 962 00:48:07,880 --> 00:48:09,890 that I know that b didn't happen. 963 00:48:09,890 --> 00:48:13,820 And this is really the equivalent rule, just using conditional probability 964 00:48:13,820 --> 00:48:16,190 instead of joint probability where I'm saying, 965 00:48:16,190 --> 00:48:19,790 let's look at both of these two cases and condition on b. 966 00:48:19,790 --> 00:48:23,480 Look at the case where b happens and look at the case where b doesn't happen 967 00:48:23,480 --> 00:48:26,560 and look at what probabilities I get as a result. 968 00:48:26,560 --> 00:48:28,598 And just as in the case of marginalization 969 00:48:28,598 --> 00:48:30,890 where there was an equivalent rule for random variables 970 00:48:30,890 --> 00:48:34,850 that could take on multiple possible values in a domain of possible values, 971 00:48:34,850 --> 00:48:37,530 here, too, conditioning has the same equivalent rule. 972 00:48:37,530 --> 00:48:41,590 Again, there's a summation to mean I'm summing over all of the possible values 973 00:48:41,590 --> 00:48:44,070 that some random variable y could take on. 974 00:48:44,070 --> 00:48:48,140 But if I want to know, what is the probability that x takes on this value, 975 00:48:48,140 --> 00:48:50,870 then I'm going to sum up over all the values j 976 00:48:50,870 --> 00:48:53,420 that y could take on and say, all right, what's 977 00:48:53,420 --> 00:48:56,870 the chance that y takes on that value, yj, and multiply it 978 00:48:56,870 --> 00:49:00,830 by the conditional probability that x takes on this value given 979 00:49:00,830 --> 00:49:03,180 that y took on that value yj-- 980 00:49:03,180 --> 00:49:06,470 so equivalent rule just using conditional probabilities 981 00:49:06,470 --> 00:49:08,120 instead of joint probabilities. 982 00:49:08,120 --> 00:49:10,790 And using the equation we know about joint probabilities, 983 00:49:10,790 --> 00:49:13,748 we can translate between these two. 984 00:49:13,748 --> 00:49:15,790 All right, we've seen a whole lot of mathematics, 985 00:49:15,790 --> 00:49:18,110 and we've just sort of laid the foundation for mathematics. 986 00:49:18,110 --> 00:49:20,777 And no need to worry if you haven't seen probability in too much 987 00:49:20,777 --> 00:49:22,370 detail up until this point. 988 00:49:22,370 --> 00:49:24,500 These are sort of the foundations of the ideas 989 00:49:24,500 --> 00:49:27,560 that are going to come up as we begin to explore how we can now 990 00:49:27,560 --> 00:49:31,820 take these ideas from probability and begin to apply them to represent 991 00:49:31,820 --> 00:49:35,120 something inside of our computer, something inside of the AI agent 992 00:49:35,120 --> 00:49:39,280 we're trying to design that is able to represent information and probabilities 993 00:49:39,280 --> 00:49:42,600 and the likelihoods between various different events. 994 00:49:42,600 --> 00:49:45,020 So there are a number of different probabilistic models 995 00:49:45,020 --> 00:49:48,290 that we can generate, but the first of the models we're going to talk about 996 00:49:48,290 --> 00:49:50,600 are what are known as Bayesian networks. 997 00:49:50,600 --> 00:49:52,670 And a Bayesian network is just going to be 998 00:49:52,670 --> 00:49:56,090 some network of random variables, connected random variables, 999 00:49:56,090 --> 00:49:58,850 that are going to represent the dependence 1000 00:49:58,850 --> 00:50:00,260 between these random variables. 1001 00:50:00,260 --> 00:50:03,498 And odds are most random variables in this world 1002 00:50:03,498 --> 00:50:05,540 are not independent from each other, that there's 1003 00:50:05,540 --> 00:50:08,840 some relationship between things that are happening that we care about. 1004 00:50:08,840 --> 00:50:12,200 If it is raining today, that might increase the likelihood 1005 00:50:12,200 --> 00:50:14,750 that my flight or my train gets delayed, for example. 1006 00:50:14,750 --> 00:50:17,610 There is some dependence between these random variables, 1007 00:50:17,610 --> 00:50:22,420 and a Bayesian network is going to be able to capture those dependencies. 1008 00:50:22,420 --> 00:50:23,770 So what is a Bayesian network? 1009 00:50:23,770 --> 00:50:26,430 What is its actual structure, and how does it work? 1010 00:50:26,430 --> 00:50:29,230 Well, a Bayesian network is going to be a directed graph. 1011 00:50:29,230 --> 00:50:31,170 And again, we've seen directed graphs before. 1012 00:50:31,170 --> 00:50:34,170 They are individual nodes with arrows or edges 1013 00:50:34,170 --> 00:50:38,897 that connect one node to another node, pointing in a particular direction. 1014 00:50:38,897 --> 00:50:40,980 And so this directed graph is going to have nodes, 1015 00:50:40,980 --> 00:50:43,860 as well, where each node in this directed graph 1016 00:50:43,860 --> 00:50:47,850 is going to represent a random variable, something like the weather or something 1017 00:50:47,850 --> 00:50:51,340 like whether my train was on time or delayed. 1018 00:50:51,340 --> 00:50:54,780 And we're going to have an arrow from a node x to a node y 1019 00:50:54,780 --> 00:50:57,435 to mean that x is a parent of y. 1020 00:50:57,435 --> 00:50:58,560 So that'll be our notation. 1021 00:50:58,560 --> 00:51:02,940 If there's an arrow from x to y, x is going to be considered a parent of y. 1022 00:51:02,940 --> 00:51:06,360 And the reason that's important is because each of these nodes 1023 00:51:06,360 --> 00:51:09,180 is going to have a probability distribution that we're 1024 00:51:09,180 --> 00:51:13,140 going to store along with it, which is the distribution of x given 1025 00:51:13,140 --> 00:51:16,520 some evidence, given the parents of x. 1026 00:51:16,520 --> 00:51:18,480 So the way to more intuitively think about this 1027 00:51:18,480 --> 00:51:22,260 is the parents are going to be thought of as sort of causes for some effect 1028 00:51:22,260 --> 00:51:24,720 that we're going to observe. 1029 00:51:24,720 --> 00:51:27,780 And so let's take a look at an actual example of a Bayesian network 1030 00:51:27,780 --> 00:51:30,270 and think about the types of logic that might be involved 1031 00:51:30,270 --> 00:51:32,070 in reasoning about that network. 1032 00:51:32,070 --> 00:51:35,580 Let's imagine, for a moment, that I have an appointment out of town 1033 00:51:35,580 --> 00:51:38,510 and I need to take a train in order to get to that appointment. 1034 00:51:38,510 --> 00:51:40,260 So what are the things I might care about? 1035 00:51:40,260 --> 00:51:42,620 Well, I care about getting to my appointment on time. 1036 00:51:42,620 --> 00:51:44,370 Either I make it to my appointment and I'm 1037 00:51:44,370 --> 00:51:46,710 able to attend it or I miss the appointment. 1038 00:51:46,710 --> 00:51:49,440 And you might imagine that that's influenced by the train, 1039 00:51:49,440 --> 00:51:54,000 that the train is either on time or it's delayed, for example. 1040 00:51:54,000 --> 00:51:56,370 But that train itself is also influenced. 1041 00:51:56,370 --> 00:52:00,030 Whether the train is on time or not depends maybe on the rain. 1042 00:52:00,030 --> 00:52:00,822 Is there no rain? 1043 00:52:00,822 --> 00:52:01,530 Is it light rain? 1044 00:52:01,530 --> 00:52:02,737 Is there heavy rain? 1045 00:52:02,737 --> 00:52:05,070 And it might also be influenced by other variables, too. 1046 00:52:05,070 --> 00:52:07,050 It might be influenced, as well, by whether 1047 00:52:07,050 --> 00:52:09,608 or not there's maintenance on the train track, for example. 1048 00:52:09,608 --> 00:52:11,400 If there is maintenance on the train track, 1049 00:52:11,400 --> 00:52:15,660 that probably increases the likelihood that my train is delayed. 1050 00:52:15,660 --> 00:52:19,680 And so we can represent all of these ideas using a Bayesian network that 1051 00:52:19,680 --> 00:52:21,360 looks a little something like this. 1052 00:52:21,360 --> 00:52:25,440 Here I have four nodes representing four random variables 1053 00:52:25,440 --> 00:52:26,970 that I would like to keep track of. 1054 00:52:26,970 --> 00:52:29,190 I have one random variable called Rain that 1055 00:52:29,190 --> 00:52:34,080 can take on three possible values in its domain, either none or light or heavy 1056 00:52:34,080 --> 00:52:36,348 for no rain, light rain, or heavy rain. 1057 00:52:36,348 --> 00:52:38,640 I have a variable called Maintenance for whether or not 1058 00:52:38,640 --> 00:52:42,030 there is maintenance on the train track, which it has two possible values, just 1059 00:52:42,030 --> 00:52:42,960 either yes or no. 1060 00:52:42,960 --> 00:52:46,355 Either there is maintenance or there is no maintenance happening on the track. 1061 00:52:46,355 --> 00:52:49,230 Then I have a random variable for the train indicating whether or not 1062 00:52:49,230 --> 00:52:50,490 the train was on time or not. 1063 00:52:50,490 --> 00:52:53,850 That random variable has two possible values in its domain. 1064 00:52:53,850 --> 00:52:57,730 The train is either on time or the train is delayed. 1065 00:52:57,730 --> 00:52:59,803 And then, finally, I have a random variable 1066 00:52:59,803 --> 00:53:01,470 for whether I make it to my appointment. 1067 00:53:01,470 --> 00:53:04,950 For my appointment down here, I have a random variable called Appointment 1068 00:53:04,950 --> 00:53:09,420 that itself has two possible values, attend and miss. 1069 00:53:09,420 --> 00:53:10,920 And so here are the possible values. 1070 00:53:10,920 --> 00:53:12,960 Here are my four nodes, each of which represents 1071 00:53:12,960 --> 00:53:17,160 a random variable, each of which has a domain of possible values 1072 00:53:17,160 --> 00:53:18,500 that it can take on. 1073 00:53:18,500 --> 00:53:21,980 And the arrows, the edges pointing from one node to another, 1074 00:53:21,980 --> 00:53:26,250 encode some notion of dependence inside of this graph, 1075 00:53:26,250 --> 00:53:28,830 that whether I make it to my appointment or not 1076 00:53:28,830 --> 00:53:32,650 is dependent upon whether the train is on time or delayed. 1077 00:53:32,650 --> 00:53:36,390 And whether the train is on time or delayed is dependent on two things, 1078 00:53:36,390 --> 00:53:38,910 given by the two arrows pointing at this node. 1079 00:53:38,910 --> 00:53:42,350 It is dependent on whether or not there was maintenance on the train track, 1080 00:53:42,350 --> 00:53:45,240 and it is also dependent upon whether or not 1081 00:53:45,240 --> 00:53:47,675 it was raining, or whether it is raining. 1082 00:53:47,675 --> 00:53:49,800 And just to make things a little complicated, let's 1083 00:53:49,800 --> 00:53:53,280 say, as well, that whether or not there's maintenance on the track, 1084 00:53:53,280 --> 00:53:55,260 this too might be influenced by the rain. 1085 00:53:55,260 --> 00:53:57,178 Then if there's heavier rain, well, maybe it's 1086 00:53:57,178 --> 00:53:59,970 less likely that there's going to be maintenance on the train track 1087 00:53:59,970 --> 00:54:02,010 that day because they're more likely to want 1088 00:54:02,010 --> 00:54:05,500 to do maintenance on the track on days when it's not raining, for example. 1089 00:54:05,500 --> 00:54:08,350 And so these nodes might have different relationships between them. 1090 00:54:08,350 --> 00:54:10,770 But the idea is that we can come up with a probability 1091 00:54:10,770 --> 00:54:16,370 distribution for any of these nodes based only upon its parents. 1092 00:54:16,370 --> 00:54:20,158 And so let's look node by node at what this probability distribution might 1093 00:54:20,158 --> 00:54:20,950 actually look like. 1094 00:54:20,950 --> 00:54:24,150 And we'll go ahead and begin with this root node, this Rain node here, which 1095 00:54:24,150 --> 00:54:27,630 is at the top and has no arrows pointing into it, 1096 00:54:27,630 --> 00:54:30,510 which means its probability distribution is not 1097 00:54:30,510 --> 00:54:32,410 going to be a conditional distribution. 1098 00:54:32,410 --> 00:54:33,870 It's not based on anything. 1099 00:54:33,870 --> 00:54:38,250 I just have some probability distribution over the possible values 1100 00:54:38,250 --> 00:54:40,520 for the Rain random variable. 1101 00:54:40,520 --> 00:54:43,590 And that distribution might look a little something like this. 1102 00:54:43,590 --> 00:54:46,170 None, light, and heavy-- each have a possible value. 1103 00:54:46,170 --> 00:54:48,300 Here I'm saying the likelihood of no rain 1104 00:54:48,300 --> 00:54:53,790 is 0.7, of light rain is 0.2, of heavy rain is 0.1, for example. 1105 00:54:53,790 --> 00:54:58,440 So here is a probability distribution for this root node in this Bayesian 1106 00:54:58,440 --> 00:54:59,770 network. 1107 00:54:59,770 --> 00:55:03,000 And let's now consider the next node in the network, Maintenance. 1108 00:55:03,000 --> 00:55:05,140 Track maintenance is yes or no. 1109 00:55:05,140 --> 00:55:07,530 And the general idea of what this distribution 1110 00:55:07,530 --> 00:55:09,660 is going to encode, at least in this story, 1111 00:55:09,660 --> 00:55:13,308 is the idea that the heavier the rain is, the less likely 1112 00:55:13,308 --> 00:55:15,600 it is that there's going to be maintenance on the track 1113 00:55:15,600 --> 00:55:18,017 because the people that are doing maintenance on the track 1114 00:55:18,017 --> 00:55:21,190 probably want to wait until a day when it's not as rainy in order to do 1115 00:55:21,190 --> 00:55:23,000 the track maintenance, for example. 1116 00:55:23,000 --> 00:55:25,480 And so what might that probability distribution look like? 1117 00:55:25,480 --> 00:55:28,180 Well, this now is going to be a conditional probability 1118 00:55:28,180 --> 00:55:31,600 distribution, that here are the three possible values for the Rain 1119 00:55:31,600 --> 00:55:34,840 random variable, which I'm here just going to abbreviate to R, either 1120 00:55:34,840 --> 00:55:37,490 no rain, light rain, or heavy rain. 1121 00:55:37,490 --> 00:55:41,590 And for each of those possible values, either there is yes track maintenance 1122 00:55:41,590 --> 00:55:46,120 or no track maintenance, and those have probabilities associated with them, 1123 00:55:46,120 --> 00:55:50,650 that I see here that if it is not raining, 1124 00:55:50,650 --> 00:55:53,620 then there is a probability 0.4 that there's track maintenance 1125 00:55:53,620 --> 00:55:56,350 and a probability of 0.6 that there isn't. 1126 00:55:56,350 --> 00:55:59,200 But if there's heavy rain, then here the chance 1127 00:55:59,200 --> 00:56:02,020 that there is track maintenance is 0.1 and the chance 1128 00:56:02,020 --> 00:56:04,430 that there is not track maintenance is 0.9. 1129 00:56:04,430 --> 00:56:08,230 Each of these rows is going to sum up to one because each of these 1130 00:56:08,230 --> 00:56:10,930 represent different values of whether or not 1131 00:56:10,930 --> 00:56:14,710 it's raining, the three possible values that that random variable can take on, 1132 00:56:14,710 --> 00:56:18,160 and each is associated with its own probability distribution. 1133 00:56:18,160 --> 00:56:22,450 That is ultimately all going to add up to the number one. 1134 00:56:22,450 --> 00:56:26,290 So that there is our distribution for this random variable called Maintenance 1135 00:56:26,290 --> 00:56:30,110 about whether or not there is maintenance on the train track. 1136 00:56:30,110 --> 00:56:32,050 And now let's consider the next variable. 1137 00:56:32,050 --> 00:56:34,210 Here we have a node inside of our Bayesian network 1138 00:56:34,210 --> 00:56:38,570 called Train that has two possible values, on time and delayed. 1139 00:56:38,570 --> 00:56:42,160 And this node is going to be dependent upon the two nodes that 1140 00:56:42,160 --> 00:56:45,040 are pointing towards it, that whether or not the train is on time 1141 00:56:45,040 --> 00:56:48,872 or delayed it depends on whether or not there is track maintenance, 1142 00:56:48,872 --> 00:56:50,830 and it depends on whether or not there is rain, 1143 00:56:50,830 --> 00:56:55,610 that heavier rain probably means more likely that my train is delayed. 1144 00:56:55,610 --> 00:56:58,270 And if there is track maintenance, that also 1145 00:56:58,270 --> 00:57:02,360 probably means it's more likely that my train is delayed as well. 1146 00:57:02,360 --> 00:57:05,350 And so you could construct a larger probability distribution, 1147 00:57:05,350 --> 00:57:07,720 a conditional probability distribution, that 1148 00:57:07,720 --> 00:57:11,530 instead of conditioning on just one variable, as was the case here, 1149 00:57:11,530 --> 00:57:14,380 is now conditioning on two variables, conditioning 1150 00:57:14,380 --> 00:57:19,270 both on rain, represented by R, and on maintenance, represented by yes. 1151 00:57:19,270 --> 00:57:23,040 Again, each of these rows has two values that sum up to the number one, 1152 00:57:23,040 --> 00:57:27,310 one for whether the train is on time, one for whether the train is delayed. 1153 00:57:27,310 --> 00:57:29,260 And here I can say something like, all right, 1154 00:57:29,260 --> 00:57:32,950 if I know there was light rain and track maintenance-- well, OK, 1155 00:57:32,950 --> 00:57:36,490 that would be R is light and M is yes-- 1156 00:57:36,490 --> 00:57:40,210 well, then there is a probability of 0.6 that my train is on time 1157 00:57:40,210 --> 00:57:43,540 and a probability of 0.4 the train is delayed. 1158 00:57:43,540 --> 00:57:47,770 And you can imagine gathering this data just by looking at real-world data, 1159 00:57:47,770 --> 00:57:50,970 looking at data about, all right, if I knew that it was light rain 1160 00:57:50,970 --> 00:57:52,720 and there was track maintenance, how often 1161 00:57:52,720 --> 00:57:54,400 was a train delayed or not delayed, and you 1162 00:57:54,400 --> 00:57:55,930 could begin to construct this thing. 1163 00:57:55,930 --> 00:57:58,060 But the interesting thing is, intelligently, 1164 00:57:58,060 --> 00:57:59,812 being able to try to figure out, how might 1165 00:57:59,812 --> 00:58:01,270 you go about ordering these things? 1166 00:58:01,270 --> 00:58:06,730 What things might influence other nodes inside of this Bayesian network? 1167 00:58:06,730 --> 00:58:08,860 And the last thing I care about is whether or not 1168 00:58:08,860 --> 00:58:10,870 I make it to my appointment. 1169 00:58:10,870 --> 00:58:13,210 So did I attend or miss the appointment? 1170 00:58:13,210 --> 00:58:16,180 And ultimately, whether I attend or miss the appointment, 1171 00:58:16,180 --> 00:58:19,552 it is influenced by track maintenance because it's indirectly this idea 1172 00:58:19,552 --> 00:58:21,760 that, all right, if there is track maintenance, well, 1173 00:58:21,760 --> 00:58:23,450 then my train might more likely be delayed, 1174 00:58:23,450 --> 00:58:25,325 and if my train is more likely to be delayed, 1175 00:58:25,325 --> 00:58:27,280 then I'm more likely to miss my appointment. 1176 00:58:27,280 --> 00:58:29,650 But what we encode in this Bayesian network 1177 00:58:29,650 --> 00:58:32,820 are just what we might consider to be more direct relationships. 1178 00:58:32,820 --> 00:58:35,710 So the train has a direct influence on the appointment. 1179 00:58:35,710 --> 00:58:38,710 And given that I know whether the train is on time or delayed, 1180 00:58:38,710 --> 00:58:40,540 knowing whether there's track maintenance 1181 00:58:40,540 --> 00:58:44,550 isn't going to give me any additional information that I didn't already have, 1182 00:58:44,550 --> 00:58:48,070 that if I know train, these other nodes that are up above 1183 00:58:48,070 --> 00:58:51,150 isn't really going to influence the result. 1184 00:58:51,150 --> 00:58:54,910 And so here we might represent it using another conditional probability 1185 00:58:54,910 --> 00:58:57,430 distribution that looks a little something like this, that 1186 00:58:57,430 --> 00:59:00,160 train can take on two possible values. 1187 00:59:00,160 --> 00:59:02,740 Either my train is on time or my train is delayed. 1188 00:59:02,740 --> 00:59:04,510 And for each of those two possible values, 1189 00:59:04,510 --> 00:59:06,803 I have a distribution for what are the odds 1190 00:59:06,803 --> 00:59:09,220 that I'm able to attend the meeting, and what are the odds 1191 00:59:09,220 --> 00:59:10,090 that I missed the meeting? 1192 00:59:10,090 --> 00:59:12,010 And obviously, if my train is on time, I'm 1193 00:59:12,010 --> 00:59:14,130 much more likely to be able to attend the meeting 1194 00:59:14,130 --> 00:59:16,600 than if my train is delayed, in which case 1195 00:59:16,600 --> 00:59:19,500 I'm more likely to miss that meeting. 1196 00:59:19,500 --> 00:59:21,790 So all of these nodes put altogether here 1197 00:59:21,790 --> 00:59:25,330 represent this Bayesian network, this network of random variables 1198 00:59:25,330 --> 00:59:27,730 whose values I ultimately care about and that 1199 00:59:27,730 --> 00:59:30,380 have some sort of relationship between them, 1200 00:59:30,380 --> 00:59:33,670 some sort of dependence where these arrows from one node to another 1201 00:59:33,670 --> 00:59:37,960 indicate some dependence, that I can calculate the probability of some node 1202 00:59:37,960 --> 00:59:41,870 given the parents that happen to exist there. 1203 00:59:41,870 --> 00:59:45,340 So now that we've been able to describe the structure of this Bayesian network 1204 00:59:45,340 --> 00:59:47,680 and the relationships between each of these nodes, 1205 00:59:47,680 --> 00:59:51,070 by associating each of the node in the network with a probability 1206 00:59:51,070 --> 00:59:53,980 distribution, whether that's an unconditional probability 1207 00:59:53,980 --> 00:59:56,200 distribution in the case of this root node here, 1208 00:59:56,200 --> 00:59:59,630 like Rain, and a conditional probability distribution, 1209 00:59:59,630 --> 01:00:02,380 in the case of all of the other nodes whose probabilities are 1210 01:00:02,380 --> 01:00:05,000 dependent upon the values of their parents, 1211 01:00:05,000 --> 01:00:09,160 we can begin to do some computation and calculation using the information 1212 01:00:09,160 --> 01:00:10,490 inside of that table. 1213 01:00:10,490 --> 01:00:12,310 So let's imagine, for example, that I just 1214 01:00:12,310 --> 01:00:15,910 wanted to compute something simple, like the probability of light rain. 1215 01:00:15,910 --> 01:00:18,130 How would I get the probability of light rain? 1216 01:00:18,130 --> 01:00:21,370 Well, light rain-- rain here is a root node. 1217 01:00:21,370 --> 01:00:23,770 And so if I wanted to calculate that probability, 1218 01:00:23,770 --> 01:00:26,740 I could just look at the probability distribution for rain 1219 01:00:26,740 --> 01:00:29,800 and extract from it the probability of light rain. 1220 01:00:29,800 --> 01:00:33,220 It's just a single value that I already have access to. 1221 01:00:33,220 --> 01:00:35,410 But we could also imagine wanting to compute 1222 01:00:35,410 --> 01:00:39,100 more complex joint probabilities, like the probability 1223 01:00:39,100 --> 01:00:42,710 that there is light rain and also no track maintenance. 1224 01:00:42,710 --> 01:00:47,440 This is a joint probability of two values, light rain and no track 1225 01:00:47,440 --> 01:00:48,293 maintenance. 1226 01:00:48,293 --> 01:00:51,460 And the way I might do that is first by starting by saying, all right, well, 1227 01:00:51,460 --> 01:00:54,100 let me get the probability of light rain, but now 1228 01:00:54,100 --> 01:00:57,160 I also want the probability of no track maintenance. 1229 01:00:57,160 --> 01:01:01,630 But, of course, this node is dependent upon the value of rain. 1230 01:01:01,630 --> 01:01:05,350 So what I really want is the probability of no track maintenance given 1231 01:01:05,350 --> 01:01:07,540 that I know that there was light rain. 1232 01:01:07,540 --> 01:01:10,450 And so the expression for calculating this idea 1233 01:01:10,450 --> 01:01:13,870 that the probability of light rain and no track maintenance 1234 01:01:13,870 --> 01:01:17,680 is really just the probability of light rain and the probability 1235 01:01:17,680 --> 01:01:21,250 that there is no track maintenance given that I know that there already 1236 01:01:21,250 --> 01:01:22,210 is light rain. 1237 01:01:22,210 --> 01:01:25,540 So I take the unconditional probability of light rain, 1238 01:01:25,540 --> 01:01:30,160 multiply it by the conditional probability of no track maintenance 1239 01:01:30,160 --> 01:01:32,550 given that I know there is light rain. 1240 01:01:32,550 --> 01:01:35,770 And you can continue to do this again and again for every variable 1241 01:01:35,770 --> 01:01:38,378 that you want to add into this joint probability 1242 01:01:38,378 --> 01:01:39,670 that I might want to calculate. 1243 01:01:39,670 --> 01:01:42,400 If I wanted to know the probability of light rain 1244 01:01:42,400 --> 01:01:45,100 and no track maintenance and a delayed train, 1245 01:01:45,100 --> 01:01:48,850 well, that's going to be the probability of light rain multiplied 1246 01:01:48,850 --> 01:01:50,950 by the probability of no track maintenance 1247 01:01:50,950 --> 01:01:56,218 given light rain multiplied by the probability of a delayed train given 1248 01:01:56,218 --> 01:01:59,260 light rain and no track maintenance, because whether the train is on time 1249 01:01:59,260 --> 01:02:03,190 or delayed is dependent upon both of these other two variables, 1250 01:02:03,190 --> 01:02:05,290 and so I have two pieces of evidence that 1251 01:02:05,290 --> 01:02:08,860 go into the calculation of that conditional probability. 1252 01:02:08,860 --> 01:02:11,470 And each of these three values is just a value 1253 01:02:11,470 --> 01:02:15,640 that I can look up by looking at one of these individual probability 1254 01:02:15,640 --> 01:02:20,140 distributions that is encoded into my Bayesian network. 1255 01:02:20,140 --> 01:02:23,410 And if I wanted a joint probability over all four of the variables, 1256 01:02:23,410 --> 01:02:25,900 something like the probability of light rain 1257 01:02:25,900 --> 01:02:30,130 and no track maintenance and a delayed train and I missed my appointment, 1258 01:02:30,130 --> 01:02:32,890 well, that's going to be multiplying four different values, one 1259 01:02:32,890 --> 01:02:34,870 from each of these individual nodes. 1260 01:02:34,870 --> 01:02:36,970 It's going to be the probability of light rain, 1261 01:02:36,970 --> 01:02:39,370 then of no track maintenance given light rain, 1262 01:02:39,370 --> 01:02:42,882 then of a delayed train given light rain and no track maintenance. 1263 01:02:42,882 --> 01:02:46,090 And then, finally, for this node here for whether I make it to my appointment 1264 01:02:46,090 --> 01:02:50,770 or not, it's not dependent upon these two variables given that I know 1265 01:02:50,770 --> 01:02:52,270 whether or not the train is on time. 1266 01:02:52,270 --> 01:02:55,030 I only need to care about the conditional probability 1267 01:02:55,030 --> 01:03:00,160 that I miss my appointment given that the train happens to be delayed. 1268 01:03:00,160 --> 01:03:04,120 And so that's represented here by four probabilities, each of which 1269 01:03:04,120 --> 01:03:07,420 is located inside of one of these probability distributions 1270 01:03:07,420 --> 01:03:11,092 for each of the nodes, all multiplied together. 1271 01:03:11,092 --> 01:03:13,300 And so I can take a variable like that and figure out 1272 01:03:13,300 --> 01:03:15,910 what the joint probability is by multiplying 1273 01:03:15,910 --> 01:03:18,280 a whole bunch of these individual probabilities 1274 01:03:18,280 --> 01:03:19,990 from the Bayesian network. 1275 01:03:19,990 --> 01:03:23,110 But, of course, just as with last time where what I really wanted to do 1276 01:03:23,110 --> 01:03:25,463 was to be able to get new pieces of information, 1277 01:03:25,463 --> 01:03:28,630 here, too, this is what we're going to want to do with our Bayesian network. 1278 01:03:28,630 --> 01:03:31,720 In the context of knowledge, we talked about the problem of inference. 1279 01:03:31,720 --> 01:03:34,210 Given things that I know to be true, can I 1280 01:03:34,210 --> 01:03:38,020 draw conclusions, make deductions about other facts about the world 1281 01:03:38,020 --> 01:03:40,270 that I also know to be true? 1282 01:03:40,270 --> 01:03:44,170 And what we're going to do now is apply the same sort of idea to probability. 1283 01:03:44,170 --> 01:03:46,960 Using information about which I have some knowledge, 1284 01:03:46,960 --> 01:03:49,510 whether some evidence or some probabilities, can 1285 01:03:49,510 --> 01:03:52,360 I figure out not other variables for certain, 1286 01:03:52,360 --> 01:03:55,750 but can I figure out the probabilities of other variables taking 1287 01:03:55,750 --> 01:03:57,160 on particular values? 1288 01:03:57,160 --> 01:04:00,160 And so here we introduce the problem of inference 1289 01:04:00,160 --> 01:04:03,970 in a probabilistic setting in a case where variables might not necessarily 1290 01:04:03,970 --> 01:04:06,760 be true for sure, but they might be random variables 1291 01:04:06,760 --> 01:04:10,640 that take on different values with some probability. 1292 01:04:10,640 --> 01:04:13,780 So how do we formally define what exactly this inference problem actually 1293 01:04:13,780 --> 01:04:14,500 is? 1294 01:04:14,500 --> 01:04:17,350 Well, the inference problem has a couple of parts to it. 1295 01:04:17,350 --> 01:04:20,140 We have some query, some variable x that we 1296 01:04:20,140 --> 01:04:21,730 want to compute the distribution for. 1297 01:04:21,730 --> 01:04:24,880 Maybe I want the probability that I missed my train 1298 01:04:24,880 --> 01:04:29,500 or I want the probability that there is track maintenance, something 1299 01:04:29,500 --> 01:04:31,570 that I want information about. 1300 01:04:31,570 --> 01:04:33,437 And then I have some evidence variables. 1301 01:04:33,437 --> 01:04:35,020 Maybe it's just one piece of evidence. 1302 01:04:35,020 --> 01:04:36,760 Maybe it's multiple pieces of evidence. 1303 01:04:36,760 --> 01:04:40,600 But I've observed certain variables for some sort of event. 1304 01:04:40,600 --> 01:04:43,772 So for example, I might have observed that it is raining. 1305 01:04:43,772 --> 01:04:44,980 This is evidence that I have. 1306 01:04:44,980 --> 01:04:47,933 I know that there is light rain or I know that there is heavy rain, 1307 01:04:47,933 --> 01:04:49,100 and that is evidence I have. 1308 01:04:49,100 --> 01:04:52,750 And using that evidence, I want to know, what is the probability 1309 01:04:52,750 --> 01:04:55,430 that my train is delayed, for example? 1310 01:04:55,430 --> 01:04:58,480 And that is a query that I might want to ask based on this evidence. 1311 01:04:58,480 --> 01:05:00,700 So I have a query, some variable, evidence, 1312 01:05:00,700 --> 01:05:03,280 which are some other variables that I have observed inside 1313 01:05:03,280 --> 01:05:05,260 of my Bayesian network, and of course that 1314 01:05:05,260 --> 01:05:08,110 does leave some hidden variables, y. 1315 01:05:08,110 --> 01:05:11,380 These are variables that are not evidence variables and not 1316 01:05:11,380 --> 01:05:12,550 query variables. 1317 01:05:12,550 --> 01:05:16,090 So you might imagine in the case where I know whether or not it's raining 1318 01:05:16,090 --> 01:05:19,930 and I want to know whether my train is going to be delayed or not, 1319 01:05:19,930 --> 01:05:23,380 the hidden variable, the thing I don't have access to, is something like, 1320 01:05:23,380 --> 01:05:25,130 is there maintenance on the track, or am I 1321 01:05:25,130 --> 01:05:27,380 going to make or not make my appointment, for example? 1322 01:05:27,380 --> 01:05:29,410 These are variables that I don't have access to. 1323 01:05:29,410 --> 01:05:32,680 They're hidden because they're not things I observed, 1324 01:05:32,680 --> 01:05:35,100 and they're also not the query, the thing that I'm asking. 1325 01:05:35,100 --> 01:05:37,480 And so ultimately what we want to calculate 1326 01:05:37,480 --> 01:05:41,650 is I want to know the probability distribution of x given 1327 01:05:41,650 --> 01:05:42,970 e, the event that I observed. 1328 01:05:42,970 --> 01:05:46,150 So given that I observed some event, I observed that it is raining, 1329 01:05:46,150 --> 01:05:49,960 I would like to know, what is the distribution over the possible values 1330 01:05:49,960 --> 01:05:51,640 of the Train random variable? 1331 01:05:51,640 --> 01:05:52,630 Is it on time? 1332 01:05:52,630 --> 01:05:53,440 Is it delayed? 1333 01:05:53,440 --> 01:05:55,750 What is the likelihood it's going to be there? 1334 01:05:55,750 --> 01:05:58,720 And it turns out we can do this calculation just using 1335 01:05:58,720 --> 01:06:02,410 a lot of the probability rules that we've already seen in action. 1336 01:06:02,410 --> 01:06:04,870 And ultimately, we're going to take a look at the math 1337 01:06:04,870 --> 01:06:07,150 at a little bit of a high level, at an abstract level, 1338 01:06:07,150 --> 01:06:09,370 but ultimately we can allow computers and programming 1339 01:06:09,370 --> 01:06:12,610 libraries that already exist to begin to do some of this math for us. 1340 01:06:12,610 --> 01:06:15,810 But it's good to get a general sense for what's actually happening when 1341 01:06:15,810 --> 01:06:18,010 this inference process takes place. 1342 01:06:18,010 --> 01:06:21,190 Let's imagine, for example, that I want to compute the probability 1343 01:06:21,190 --> 01:06:24,430 distribution of the Appointment random variable 1344 01:06:24,430 --> 01:06:28,510 given some evidence, given that I know that there was light rain and no track 1345 01:06:28,510 --> 01:06:29,260 maintenance. 1346 01:06:29,260 --> 01:06:32,830 So there's my evidence, these two variables that I observed the value of. 1347 01:06:32,830 --> 01:06:34,630 I observe the value of rain. 1348 01:06:34,630 --> 01:06:35,920 I know there's light rain. 1349 01:06:35,920 --> 01:06:38,830 And I know that there is no track maintenance going on today. 1350 01:06:38,830 --> 01:06:42,820 And what I care about knowing, my query, is this random variable Appointment. 1351 01:06:42,820 --> 01:06:46,008 I want to know the distribution of this random variable Appointment. 1352 01:06:46,008 --> 01:06:47,800 What is the chance that I am able to attend 1353 01:06:47,800 --> 01:06:50,560 my appointment, what is the chance that I miss my appointment 1354 01:06:50,560 --> 01:06:52,360 given this evidence? 1355 01:06:52,360 --> 01:06:55,870 And the hidden variable, the information that I don't have access to, 1356 01:06:55,870 --> 01:06:57,190 is this variable Train. 1357 01:06:57,190 --> 01:07:00,040 This is information that is not part of the evidence that I see, 1358 01:07:00,040 --> 01:07:01,660 not something that I observe. 1359 01:07:01,660 --> 01:07:05,050 But it is also not the query that I am asking for. 1360 01:07:05,050 --> 01:07:07,460 And so what might this inference procedure look like? 1361 01:07:07,460 --> 01:07:10,810 Well, if you recall back from a when we were defining conditional probability 1362 01:07:10,810 --> 01:07:13,270 and doing math with conditional probabilities, 1363 01:07:13,270 --> 01:07:15,940 we know that a conditional probability is 1364 01:07:15,940 --> 01:07:19,030 proportional to the joint probability. 1365 01:07:19,030 --> 01:07:23,050 And we remember this by recalling that the probability of a given b 1366 01:07:23,050 --> 01:07:25,930 is just some constant factor alpha multiplied 1367 01:07:25,930 --> 01:07:27,583 by the probability of a and b. 1368 01:07:27,583 --> 01:07:29,500 That constant factor alpha turns up and you're 1369 01:07:29,500 --> 01:07:32,620 dividing over the probability of b, but the important thing 1370 01:07:32,620 --> 01:07:34,930 is that it's just some constant multiplied 1371 01:07:34,930 --> 01:07:37,450 by the joint distribution, the probability 1372 01:07:37,450 --> 01:07:40,070 that all of these individual things happen. 1373 01:07:40,070 --> 01:07:42,610 So in this case, I can take the probability 1374 01:07:42,610 --> 01:07:47,380 of the Appointment random variable given light rain and no track maintenance 1375 01:07:47,380 --> 01:07:51,070 and say that is just going to be proportional, some constant alpha, 1376 01:07:51,070 --> 01:07:54,700 multiplied by the joint probability, the probability of a particular value 1377 01:07:54,700 --> 01:08:00,410 for the appointment random variable, and light rain and no track maintenance. 1378 01:08:00,410 --> 01:08:02,980 Well, all right, how do I calculate this, probability 1379 01:08:02,980 --> 01:08:05,350 of appointment and light rain and no track maintenance, 1380 01:08:05,350 --> 01:08:07,480 when what I really care about is knowing-- 1381 01:08:07,480 --> 01:08:11,260 I need all four of these values to be able to calculate a joint distribution 1382 01:08:11,260 --> 01:08:13,990 across everything, because, then, a particular appointment 1383 01:08:13,990 --> 01:08:16,420 depends upon the value of train. 1384 01:08:16,420 --> 01:08:18,399 Well, in order to do that, here I can begin 1385 01:08:18,399 --> 01:08:21,430 to use that marginalization trick, that there are only 1386 01:08:21,430 --> 01:08:24,640 two ways I can get any configuration of an appointment, light rain, 1387 01:08:24,640 --> 01:08:25,859 and no track maintenance. 1388 01:08:25,859 --> 01:08:28,120 Either this particular setting of variables 1389 01:08:28,120 --> 01:08:33,130 happens and the train is on time or this particular setting of variables happens 1390 01:08:33,130 --> 01:08:34,180 and the train is delayed. 1391 01:08:34,180 --> 01:08:37,520 Those are two possible cases that I would want to consider. 1392 01:08:37,520 --> 01:08:40,149 And if I add those two cases up, well, then I 1393 01:08:40,149 --> 01:08:44,859 get the result just by adding up all of the possibilities for the hidden 1394 01:08:44,859 --> 01:08:46,990 variable, or variables if there are multiple. 1395 01:08:46,990 --> 01:08:49,090 But since there's only one hidden variable here, 1396 01:08:49,090 --> 01:08:53,229 Train, all I need to do is iterate over all the possible values for that hidden 1397 01:08:53,229 --> 01:08:56,600 variable Train and add up their probabilities. 1398 01:08:56,600 --> 01:08:59,529 So this probability expression here becomes 1399 01:08:59,529 --> 01:09:02,890 probability distribution over Appointment, light, no rain, and train 1400 01:09:02,890 --> 01:09:06,010 is on time, and the probability distribution 1401 01:09:06,010 --> 01:09:10,120 over the Appointment, light rain, no track maintenance, and the train 1402 01:09:10,120 --> 01:09:11,660 is delayed, for example. 1403 01:09:11,660 --> 01:09:15,597 So I take both of the possible values for train, go ahead and add them up. 1404 01:09:15,597 --> 01:09:16,180 These are just 1405 01:09:16,180 --> 01:09:18,722 Joint probabilities that we saw earlier how to calculate just 1406 01:09:18,722 --> 01:09:22,120 by going parent, parent, parent, parent and calculating those probabilities 1407 01:09:22,120 --> 01:09:23,615 and multiplying them together. 1408 01:09:23,615 --> 01:09:26,740 And then you'll need to normalize them at the end, speaking at a high level 1409 01:09:26,740 --> 01:09:29,920 to make sure that everything adds up to the number one. 1410 01:09:29,920 --> 01:09:32,229 So the formula for how you do this and a process known 1411 01:09:32,229 --> 01:09:35,223 as inference by enumeration looks a little bit complicated, 1412 01:09:35,223 --> 01:09:36,640 but ultimately it looks like this. 1413 01:09:36,640 --> 01:09:39,550 And let's now try to distill what it is that all of these symbols 1414 01:09:39,550 --> 01:09:40,420 actually mean. 1415 01:09:40,420 --> 01:09:41,410 Let's start here. 1416 01:09:41,410 --> 01:09:46,029 What I care about knowing is the probability of x, my query variable, 1417 01:09:46,029 --> 01:09:48,370 given some sort of evidence. 1418 01:09:48,370 --> 01:09:50,410 What do I know about conditional probabilities? 1419 01:09:50,410 --> 01:09:55,030 Well, a conditional probability is proportional to the joint probability. 1420 01:09:55,030 --> 01:09:57,850 So we had some alpha, some normalizing constant, 1421 01:09:57,850 --> 01:10:01,840 multiplied by this joint probability of x and evidence. 1422 01:10:01,840 --> 01:10:03,410 And how do I calculate that? 1423 01:10:03,410 --> 01:10:05,980 Well, to do that, I'm going to marginalize over 1424 01:10:05,980 --> 01:10:07,420 all of the hidden variables. 1425 01:10:07,420 --> 01:10:10,450 All the variables that I don't directly observe the values for, 1426 01:10:10,450 --> 01:10:13,390 I'm basically going to iterate over all of the possibilities 1427 01:10:13,390 --> 01:10:16,040 that it could happen and just sum them all up. 1428 01:10:16,040 --> 01:10:19,270 And so I can translate this into a sum over all y, which 1429 01:10:19,270 --> 01:10:22,450 ranges over all the possible hidden variables and the values 1430 01:10:22,450 --> 01:10:27,250 that they could take on, and adds up all of those possible individual 1431 01:10:27,250 --> 01:10:28,300 probabilities. 1432 01:10:28,300 --> 01:10:32,195 And that is going to allow me to do this process of inference by enumeration. 1433 01:10:32,195 --> 01:10:34,570 And ultimately, it's pretty annoying if we as humans have 1434 01:10:34,570 --> 01:10:36,713 to do all of this math for ourselves. 1435 01:10:36,713 --> 01:10:39,880 But it turns out this is where computers and AI can be particularly helpful, 1436 01:10:39,880 --> 01:10:43,360 that we can program a computer to understand a Bayesian network to be 1437 01:10:43,360 --> 01:10:45,610 able to understand these inference procedures 1438 01:10:45,610 --> 01:10:47,560 and to be able to do these calculations. 1439 01:10:47,560 --> 01:10:49,390 And using the information you've seen here, 1440 01:10:49,390 --> 01:10:52,150 you could implement a Bayesian network from scratch yourself. 1441 01:10:52,150 --> 01:10:54,733 But turns out there are a lot of libraries, especially written 1442 01:10:54,733 --> 01:10:56,650 in Python, that allow us to make it easier 1443 01:10:56,650 --> 01:10:58,780 to do this sort of probabilistic inference 1444 01:10:58,780 --> 01:11:01,788 to be able to take a Bayesian network and do these sorts of calculations 1445 01:11:01,788 --> 01:11:04,830 so that you don't need to know and understand all of the underlying math, 1446 01:11:04,830 --> 01:11:07,372 though it's helpful to have a general sense for how it works. 1447 01:11:07,372 --> 01:11:10,330 But you just need to be able to describe the structure of the network 1448 01:11:10,330 --> 01:11:14,350 and make queries in order to be able to produce the result. 1449 01:11:14,350 --> 01:11:17,050 And so let's take a look at an example of that right now. 1450 01:11:17,050 --> 01:11:19,420 It turns out that there are a lot of possible libraries 1451 01:11:19,420 --> 01:11:21,803 that exist in Python for doing this sort of inference. 1452 01:11:21,803 --> 01:11:24,220 It doesn't matter too much which specific library you use. 1453 01:11:24,220 --> 01:11:26,330 They all behave in fairly similar ways. 1454 01:11:26,330 --> 01:11:29,170 But the library I'm going to use here is one known as pomegranate. 1455 01:11:29,170 --> 01:11:33,820 And here inside of model.py, I have defined a Bayesian network 1456 01:11:33,820 --> 01:11:38,070 just using the structure and the syntax that the pomegranate library expects. 1457 01:11:38,070 --> 01:11:40,930 And what I'm effectively doing is just, in Python, 1458 01:11:40,930 --> 01:11:44,740 creating nodes to represent each the nodes of the Bayesian network 1459 01:11:44,740 --> 01:11:47,060 that you saw me describe a moment ago. 1460 01:11:47,060 --> 01:11:49,750 So here on line four, after I've imported pomegranate, 1461 01:11:49,750 --> 01:11:52,540 I'm defining a variable called rain that is going to represent 1462 01:11:52,540 --> 01:11:55,990 a node inside of my Bayesian network. 1463 01:11:55,990 --> 01:11:59,530 It's going to be a node that follows this distribution where 1464 01:11:59,530 --> 01:12:01,030 there are three possible values-- 1465 01:12:01,030 --> 01:12:03,970 none for no rain, light for light rain, heavy for heavy rain. 1466 01:12:03,970 --> 01:12:07,180 And these are the probabilities of each of those taking place. 1467 01:12:07,180 --> 01:12:13,630 0.7 is the likelihood of no rain, 0.2 for light rain, 0.1 for heavy rain. 1468 01:12:13,630 --> 01:12:15,760 Then, after that, we go to the next variable, 1469 01:12:15,760 --> 01:12:18,400 the variable for track maintenance, for example, which 1470 01:12:18,400 --> 01:12:20,990 is dependent upon that rain variable. 1471 01:12:20,990 --> 01:12:23,890 And this, instead of being an unconditional distribution, 1472 01:12:23,890 --> 01:12:27,370 is a conditional distribution, as indicated by a conditional probability 1473 01:12:27,370 --> 01:12:28,430 table here. 1474 01:12:28,430 --> 01:12:33,790 And the idea is that this is conditional on the distribution of rain. 1475 01:12:33,790 --> 01:12:36,700 So if there is no rain, then the chance that there is yes 1476 01:12:36,700 --> 01:12:38,370 track maintenance is 0.4. 1477 01:12:38,370 --> 01:12:41,720 If there's no rain, the chance that there is no track maintenance is 0.6. 1478 01:12:41,720 --> 01:12:43,720 Likewise, for light rain, I have a distribution. 1479 01:12:43,720 --> 01:12:45,760 For heavy rain, I have a distribution, as well. 1480 01:12:45,760 --> 01:12:48,130 But I'm effectively encoding the same information 1481 01:12:48,130 --> 01:12:50,110 you saw represented graphically a moment ago, 1482 01:12:50,110 --> 01:12:53,110 but I'm telling this Python program that the maintenance 1483 01:12:53,110 --> 01:12:57,640 node obeys this particular conditional probability distribution. 1484 01:12:57,640 --> 01:13:01,090 And we do the same thing for the other random variables, as well. 1485 01:13:01,090 --> 01:13:06,310 Train was a node inside my distribution that was a conditional probability 1486 01:13:06,310 --> 01:13:08,050 table with two parents. 1487 01:13:08,050 --> 01:13:11,380 It was dependent not only on rain, but also on track maintenance. 1488 01:13:11,380 --> 01:13:15,310 And so here I'm saying something like, given that there is no rain and yes 1489 01:13:15,310 --> 01:13:19,630 track maintenance, the probability that my train is on time is 0.8, 1490 01:13:19,630 --> 01:13:22,240 and the probability that it's delayed is 0.2. 1491 01:13:22,240 --> 01:13:24,220 And likewise, I can do the same thing for all 1492 01:13:24,220 --> 01:13:28,330 of the other possible values of the parents of the train node 1493 01:13:28,330 --> 01:13:32,800 inside of my Bayesian network by saying, for all of those possible values, 1494 01:13:32,800 --> 01:13:36,350 here is the distribution that the train node should follow. 1495 01:13:36,350 --> 01:13:38,710 And I do the same thing for an appointment 1496 01:13:38,710 --> 01:13:41,830 based on the distribution of the variable Train. 1497 01:13:41,830 --> 01:13:45,340 Then, at the end, what I do is actually construct this network 1498 01:13:45,340 --> 01:13:47,860 by describing what the states of the network are 1499 01:13:47,860 --> 01:13:50,660 and by adding edges between the dependent nodes. 1500 01:13:50,660 --> 01:13:53,110 So I create a new Bayesian network, add states to it-- 1501 01:13:53,110 --> 01:13:56,650 one for rain, one for maintenance, one for train, one for the appointment-- 1502 01:13:56,650 --> 01:14:00,460 and then I add edges connecting the related pieces. 1503 01:14:00,460 --> 01:14:04,570 Rain has an arrow to maintenance because rain influences track maintenance, 1504 01:14:04,570 --> 01:14:08,530 rain also influences the train, maintenance also influences the train, 1505 01:14:08,530 --> 01:14:11,140 and train influences whether I make it to my appointment, 1506 01:14:11,140 --> 01:14:14,800 and bake just finalizes the model and does some additional computation. 1507 01:14:14,800 --> 01:14:18,250 So the specific syntax of this is not really the important part. 1508 01:14:18,250 --> 01:14:20,980 Pomegranate just happens to be one of several different libraries 1509 01:14:20,980 --> 01:14:22,990 that can all be used for similar purposes, 1510 01:14:22,990 --> 01:14:26,170 and you could describe and define a library for yourself 1511 01:14:26,170 --> 01:14:28,010 that implemented similar things. 1512 01:14:28,010 --> 01:14:30,430 But the key idea here is that someone can 1513 01:14:30,430 --> 01:14:33,220 design a library for a general Bayesian network that 1514 01:14:33,220 --> 01:14:35,680 has nodes that are based upon its parents, 1515 01:14:35,680 --> 01:14:39,190 and then all a programmer needs to do, using one of those libraries, 1516 01:14:39,190 --> 01:14:43,420 is to define what those nodes and what those probability distributions are, 1517 01:14:43,420 --> 01:14:47,000 and we can begin to do some interesting logic based on it. 1518 01:14:47,000 --> 01:14:50,200 So let's try doing that conditional or joint probability 1519 01:14:50,200 --> 01:14:56,800 calculation that we saw us do by hand before by going into likelihood.py 1520 01:14:56,800 --> 01:15:00,340 where here I'm importing the model that I justified a moment ago. 1521 01:15:00,340 --> 01:15:03,100 And here I'd just like to calculate model.probability, 1522 01:15:03,100 --> 01:15:06,320 which calculates the probability for a given observation, 1523 01:15:06,320 --> 01:15:10,270 and I'd like to calculate the probability of no rain, 1524 01:15:10,270 --> 01:15:13,330 no track maintenance, my train is on time, 1525 01:15:13,330 --> 01:15:14,950 and I'm able to attend the meeting-- 1526 01:15:14,950 --> 01:15:16,870 so sort of the optimal scenario, that there's 1527 01:15:16,870 --> 01:15:20,162 no rain and no maintenance on the track, my train is on time, 1528 01:15:20,162 --> 01:15:21,620 and I'm able to attend the meeting. 1529 01:15:21,620 --> 01:15:25,020 What is the probability that all of that actually happens? 1530 01:15:25,020 --> 01:15:26,900 And I can calculate that using the library 1531 01:15:26,900 --> 01:15:28,700 and just print out its probability. 1532 01:15:28,700 --> 01:15:32,780 And so I'll go ahead and run Python of likelihood.py, 1533 01:15:32,780 --> 01:15:37,190 and I see that, OK, the probability is about 0.34. 1534 01:15:37,190 --> 01:15:40,850 So about a third of the time, everything goes right for me, in this case-- 1535 01:15:40,850 --> 01:15:43,190 no rain, no track maintenance, train is on time, 1536 01:15:43,190 --> 01:15:45,032 and I'm able to attend the meeting. 1537 01:15:45,032 --> 01:15:47,990 But I could experiment with this, try and calculate other probabilities 1538 01:15:47,990 --> 01:15:48,650 as well. 1539 01:15:48,650 --> 01:15:51,860 What's the probability that everything goes right up until the train 1540 01:15:51,860 --> 01:15:57,020 but I still miss my meeting-- so no rain, no track maintenance, train 1541 01:15:57,020 --> 01:15:59,690 is on time, but I miss the appointment. 1542 01:15:59,690 --> 01:16:04,680 Let's calculate that probability, and that has a probability of about 0.04. 1543 01:16:04,680 --> 01:16:07,643 So about 4% of the time the train will be on time, 1544 01:16:07,643 --> 01:16:09,560 there won't be any rain, no track maintenance, 1545 01:16:09,560 --> 01:16:12,420 and yet I'll still miss the meeting. 1546 01:16:12,420 --> 01:16:14,780 And so this is really just an implementation 1547 01:16:14,780 --> 01:16:17,900 of the calculation of the joint probabilities that we did before. 1548 01:16:17,900 --> 01:16:20,150 What this library is likely doing is first 1549 01:16:20,150 --> 01:16:23,600 figuring out the probability of no rain, then figuring 1550 01:16:23,600 --> 01:16:26,030 that the probability of no track maintenance 1551 01:16:26,030 --> 01:16:28,580 given no rain, then the probability that my train is 1552 01:16:28,580 --> 01:16:31,760 on time given both of these values, and then the probability 1553 01:16:31,760 --> 01:16:35,930 that I miss my appointment given that I know that the train was on time. 1554 01:16:35,930 --> 01:16:39,070 So this, again, is the calculation of that joint probability. 1555 01:16:39,070 --> 01:16:42,320 And turns out we can also begin to have our computer solve inference problems, 1556 01:16:42,320 --> 01:16:45,980 as well, to begin to infer, based on information, evidence 1557 01:16:45,980 --> 01:16:51,000 that we see, what is the likelihood of other variables also being true? 1558 01:16:51,000 --> 01:16:54,740 So let's go into inference.py, for example, where here I'm, 1559 01:16:54,740 --> 01:16:57,110 again, importing that exact same model from before, 1560 01:16:57,110 --> 01:16:59,300 importing all the nodes and all the edges 1561 01:16:59,300 --> 01:17:03,300 and the probability distribution that is encoded there, as well. 1562 01:17:03,300 --> 01:17:06,320 And now there's a function for doing some sort of prediction. 1563 01:17:06,320 --> 01:17:10,760 And here, into this model, I pass in the evidence that I observe. 1564 01:17:10,760 --> 01:17:14,750 So here I've encoded into this Python program the evidence 1565 01:17:14,750 --> 01:17:15,770 that I have observed. 1566 01:17:15,770 --> 01:17:18,950 I have observed the fact that the train is delayed, 1567 01:17:18,950 --> 01:17:22,190 and that is the value for one of the four random variables 1568 01:17:22,190 --> 01:17:24,140 inside of this Bayesian network. 1569 01:17:24,140 --> 01:17:26,210 And using that information, I would like to be 1570 01:17:26,210 --> 01:17:29,270 able to draw inspiration and figure out inferences 1571 01:17:29,270 --> 01:17:31,875 about the values of the other random variables 1572 01:17:31,875 --> 01:17:33,500 that are inside of my Bayesian network. 1573 01:17:33,500 --> 01:17:36,240 I would like to make predictions about everything else. 1574 01:17:36,240 --> 01:17:40,340 So all of the actual computational logic is happening in just these three lines 1575 01:17:40,340 --> 01:17:42,260 where I'm making this call to this prediction. 1576 01:17:42,260 --> 01:17:45,830 Down below, I'm just iterating over all of the states and all the predictions 1577 01:17:45,830 --> 01:17:49,860 and just printing them out so that we can visually see what the results are. 1578 01:17:49,860 --> 01:17:51,980 But let's find out, given the train is delayed, 1579 01:17:51,980 --> 01:17:56,210 what can I predict about the values of the other random variables? 1580 01:17:56,210 --> 01:17:59,021 Let's go ahead and run Python inference.py. 1581 01:17:59,021 --> 01:18:00,005 I run that. 1582 01:18:00,005 --> 01:18:01,880 And all right, here is the result that I get. 1583 01:18:01,880 --> 01:18:04,640 Given the fact that I know that the train is delayed-- 1584 01:18:04,640 --> 01:18:06,770 this is evidence that I have observed-- 1585 01:18:06,770 --> 01:18:10,490 well, given that there is a 45% chance or a 46% chance 1586 01:18:10,490 --> 01:18:12,520 that there was no rain, a 31% chance there 1587 01:18:12,520 --> 01:18:15,230 was light rain, a 23% chance there was heavy rain, 1588 01:18:15,230 --> 01:18:17,712 I can see a probability distribution over track maintenance 1589 01:18:17,712 --> 01:18:19,670 and a probability distribution over whether I'm 1590 01:18:19,670 --> 01:18:22,130 able to attend or miss my appointment. 1591 01:18:22,130 --> 01:18:23,990 Now, we know that whether I attend or miss 1592 01:18:23,990 --> 01:18:27,715 the appointment, that is only dependent upon the train being delayed 1593 01:18:27,715 --> 01:18:28,340 or not delayed. 1594 01:18:28,340 --> 01:18:30,540 It shouldn't depend on anything else. 1595 01:18:30,540 --> 01:18:34,610 So let's imagine, for example, that I knew that there was heavy rain. 1596 01:18:34,610 --> 01:18:38,620 That shouldn't affect the distribution for making the appointment. 1597 01:18:38,620 --> 01:18:41,360 And indeed, if I go up here and add some evidence, 1598 01:18:41,360 --> 01:18:44,128 say that I know that the value of rain is heavy-- 1599 01:18:44,128 --> 01:18:45,920 that is evidence that I now have access to. 1600 01:18:45,920 --> 01:18:47,420 I now have two pieces of evidence. 1601 01:18:47,420 --> 01:18:51,950 I know that the rain is heavy, and I know that my train is delayed. 1602 01:18:51,950 --> 01:18:55,550 I can calculate the probability by running this inference procedure again 1603 01:18:55,550 --> 01:18:57,090 and seeing the result. 1604 01:18:57,090 --> 01:18:58,340 I know that the rain is heavy. 1605 01:18:58,340 --> 01:18:59,840 I know my train is delayed. 1606 01:18:59,840 --> 01:19:02,990 The probability distribution for track maintenance changed. 1607 01:19:02,990 --> 01:19:05,130 Given that I know that there is heavy rain, 1608 01:19:05,130 --> 01:19:08,750 now it's more likely that there is no track maintenance, 88% as 1609 01:19:08,750 --> 01:19:12,250 opposed to 64% from here before. 1610 01:19:12,250 --> 01:19:16,040 And now what is the probability that I make the appointment? 1611 01:19:16,040 --> 01:19:17,480 Well, that's the same as before. 1612 01:19:17,480 --> 01:19:21,100 It's still going to be attend the appointment with probability 0.6, 1613 01:19:21,100 --> 01:19:23,450 miss the appointment with probability 0.4, 1614 01:19:23,450 --> 01:19:27,290 because it was only dependent upon whether or not my train was on time 1615 01:19:27,290 --> 01:19:28,260 or delayed. 1616 01:19:28,260 --> 01:19:31,610 And so this here is implementing that idea of that inference algorithm 1617 01:19:31,610 --> 01:19:34,130 to be able to figure out, based on the evidence 1618 01:19:34,130 --> 01:19:37,970 that I have, what can we infer about the values of the other variables that 1619 01:19:37,970 --> 01:19:39,050 exist as well? 1620 01:19:39,050 --> 01:19:42,890 So inference by enumeration is one way of doing this inference procedure, 1621 01:19:42,890 --> 01:19:46,730 just looping over all of the values the hidden variables could take on 1622 01:19:46,730 --> 01:19:49,460 and figuring out what the probability is. 1623 01:19:49,460 --> 01:19:52,010 Now, it turns out this is not particularly efficient, 1624 01:19:52,010 --> 01:19:56,180 and there are definitely optimizations you can make by avoiding repeated work 1625 01:19:56,180 --> 01:19:59,030 if you're calculating the same sort of probability multiple times. 1626 01:19:59,030 --> 01:20:02,570 There are ways of optimizing the program to avoid having to recalculate 1627 01:20:02,570 --> 01:20:04,640 the same probabilities again and again. 1628 01:20:04,640 --> 01:20:06,980 But even then, as the number of variables 1629 01:20:06,980 --> 01:20:10,220 get large, as the number of possible values those variables could take on 1630 01:20:10,220 --> 01:20:12,110 get large, we're going to start to have to do 1631 01:20:12,110 --> 01:20:14,600 a lot of computation, a lot of calculation, 1632 01:20:14,600 --> 01:20:16,190 to be able to do this inference. 1633 01:20:16,190 --> 01:20:18,150 And at that point, you might start to get 1634 01:20:18,150 --> 01:20:20,250 unreasonable in terms of the amount of time 1635 01:20:20,250 --> 01:20:24,615 that it would take to be able to do this sort exact inference. 1636 01:20:24,615 --> 01:20:26,490 And it's for that reason that oftentimes when 1637 01:20:26,490 --> 01:20:29,970 it comes towards probability and things we're not entirely sure about, 1638 01:20:29,970 --> 01:20:32,280 we don't always care about doing exact inference 1639 01:20:32,280 --> 01:20:35,040 and knowing exactly what the probability is. 1640 01:20:35,040 --> 01:20:37,560 But if we can approximate the inference procedure, 1641 01:20:37,560 --> 01:20:41,570 do some sort of approximate inference, that that can be pretty good as well, 1642 01:20:41,570 --> 01:20:43,550 that if I don't know the exact probability 1643 01:20:43,550 --> 01:20:45,510 but I have a general sense for the probability, 1644 01:20:45,510 --> 01:20:49,200 that I can get increasingly accurate with more time, that that's probably 1645 01:20:49,200 --> 01:20:53,620 pretty good, especially if I can get that to happen even faster. 1646 01:20:53,620 --> 01:20:57,930 So how could I do approximate inference inside of a Bayesian network? 1647 01:20:57,930 --> 01:21:00,480 Well, one method is through a procedure known as sampling. 1648 01:21:00,480 --> 01:21:04,980 In the process of sampling, I'm going to take a sample of all of the variables 1649 01:21:04,980 --> 01:21:06,840 inside of this Bayesian network here. 1650 01:21:06,840 --> 01:21:08,280 And how am I going to sample? 1651 01:21:08,280 --> 01:21:12,240 Well, I'm going to sample one of the values from each of these nodes 1652 01:21:12,240 --> 01:21:14,560 according to their probability distribution. 1653 01:21:14,560 --> 01:21:16,560 So how might I take a sample of all these nodes? 1654 01:21:16,560 --> 01:21:17,430 Well, I'll start at the root. 1655 01:21:17,430 --> 01:21:18,450 I'll start with rain. 1656 01:21:18,450 --> 01:21:21,060 Here's the distribution for rain, and I'll go ahead 1657 01:21:21,060 --> 01:21:23,880 and, using a random number generator or something like it, 1658 01:21:23,880 --> 01:21:25,770 randomly pick one of these three values. 1659 01:21:25,770 --> 01:21:29,730 I'll pick none with probability 0.7, light with probability 0.2, 1660 01:21:29,730 --> 01:21:31,440 and heavy with probability 0.1. 1661 01:21:31,440 --> 01:21:34,770 So I'll randomly just pick one of them according to that distribution, 1662 01:21:34,770 --> 01:21:37,780 and maybe, in this case, I pick none, for example. 1663 01:21:37,780 --> 01:21:39,780 Then I do the same thing for the other variable. 1664 01:21:39,780 --> 01:21:42,410 Maintenance also as a probability distribution. 1665 01:21:42,410 --> 01:21:44,070 And I am going to sample-- 1666 01:21:44,070 --> 01:21:46,470 now, there are three probability distributions here, 1667 01:21:46,470 --> 01:21:49,050 but I'm only going to sample from this first row 1668 01:21:49,050 --> 01:21:53,950 here because I've observed already in my sample that the value of rain is none. 1669 01:21:53,950 --> 01:21:54,450 So 1670 01:21:54,450 --> 01:21:58,295 Given that rain is none, I'm going to sample from this distribution to say, 1671 01:21:58,295 --> 01:22:00,420 all right, what should the value of maintenance be? 1672 01:22:00,420 --> 01:22:02,753 And in this case, maintenance is going to be, let's just 1673 01:22:02,753 --> 01:22:06,570 say, yes, which happens 40% of the time in the event that there is no rain, 1674 01:22:06,570 --> 01:22:07,603 for example. 1675 01:22:07,603 --> 01:22:10,020 And we'll sample all of the rest of the nodes in this way, 1676 01:22:10,020 --> 01:22:12,840 as well, that I want to sample from the train distribution, 1677 01:22:12,840 --> 01:22:17,040 and I'll sample from this first row here where there is no rain, 1678 01:22:17,040 --> 01:22:18,570 but there is track maintenance. 1679 01:22:18,570 --> 01:22:21,980 And I'll sample 80% of the time, I'll say the train is on time. 1680 01:22:21,980 --> 01:22:24,463 20% of the time, I'll say the train is delayed. 1681 01:22:24,463 --> 01:22:27,630 And finally, we'll do the same thing for whether I make it to my appointment 1682 01:22:27,630 --> 01:22:27,890 or not. 1683 01:22:27,890 --> 01:22:29,490 Did I attend or miss the appointment? 1684 01:22:29,490 --> 01:22:32,700 We'll sample based on this distribution and maybe say that in this case 1685 01:22:32,700 --> 01:22:36,150 I attend the appointment, which happens 90% of the time when 1686 01:22:36,150 --> 01:22:38,730 the train is actually on time. 1687 01:22:38,730 --> 01:22:42,900 So by going through these nodes, I can very quickly just do some sampling 1688 01:22:42,900 --> 01:22:45,720 and get a sample of the possible values that 1689 01:22:45,720 --> 01:22:48,990 could come up from going through this entire Bayesian network 1690 01:22:48,990 --> 01:22:51,540 according to those probability distributions. 1691 01:22:51,540 --> 01:22:54,360 And where this becomes powerful is if I do this not once, 1692 01:22:54,360 --> 01:22:57,100 but I do this thousands or tens of thousands of times 1693 01:22:57,100 --> 01:23:00,400 and generate a whole bunch of samples, all using this distribution. 1694 01:23:00,400 --> 01:23:01,410 I get different samples. 1695 01:23:01,410 --> 01:23:02,820 Maybe some of them are the same. 1696 01:23:02,820 --> 01:23:07,800 But I get a value for each of the possible variables that could come up. 1697 01:23:07,800 --> 01:23:10,620 And so then, if I'm ever faced with a question, a question like, 1698 01:23:10,620 --> 01:23:13,860 what is the probability that the train is on time, 1699 01:23:13,860 --> 01:23:15,900 you could do an exact inference procedure. 1700 01:23:15,900 --> 01:23:18,630 This is no different than the inference problem we had before 1701 01:23:18,630 --> 01:23:21,780 where I could just marginalize, look at all the possible other values 1702 01:23:21,780 --> 01:23:24,390 of the variables and do the computation of inference 1703 01:23:24,390 --> 01:23:28,200 by enumeration to find out this probability exactly. 1704 01:23:28,200 --> 01:23:31,710 But I could also, if I don't care about the exact probability, just sample it. 1705 01:23:31,710 --> 01:23:33,150 Approximate it to get close. 1706 01:23:33,150 --> 01:23:35,040 And this is a powerful tool in AI where we 1707 01:23:35,040 --> 01:23:38,790 don't need to be right 100% of the time or we don't need to be exactly right. 1708 01:23:38,790 --> 01:23:41,130 If we just need to be right with some probability, 1709 01:23:41,130 --> 01:23:44,290 we can often do some more effectively, more efficiently. 1710 01:23:44,290 --> 01:23:46,920 And so here, now, are all of those possible samples. 1711 01:23:46,920 --> 01:23:50,390 I'll sort of highlight the ones where the train is on time. 1712 01:23:50,390 --> 01:23:52,620 I'm ignoring the ones where the train is delayed. 1713 01:23:52,620 --> 01:23:55,350 And in this case, there's six out of eight 1714 01:23:55,350 --> 01:23:57,690 of the samples have the train is arriving on time. 1715 01:23:57,690 --> 01:24:01,320 And so maybe, in this case, I can say that, in six out of eight cases, 1716 01:24:01,320 --> 01:24:03,458 that's the likelihood that the train is on time. 1717 01:24:03,458 --> 01:24:06,000 And with eight samples, that might not be a great prediction. 1718 01:24:06,000 --> 01:24:08,520 But if I had thousands upon thousands of samples, 1719 01:24:08,520 --> 01:24:11,580 then this could be a much better inference procedure 1720 01:24:11,580 --> 01:24:13,680 to be able to do these sorts of calculations. 1721 01:24:13,680 --> 01:24:17,310 So this is a direct sampling method to just do a bunch of samples 1722 01:24:17,310 --> 01:24:21,210 and then figure out what the probability of some event is. 1723 01:24:21,210 --> 01:24:24,400 Now, this from before was an unconditional probability. 1724 01:24:24,400 --> 01:24:27,447 What is the probability that the train is on time? 1725 01:24:27,447 --> 01:24:30,030 And I did that by looking at all the samples and figuring out, 1726 01:24:30,030 --> 01:24:32,372 right here, the ones where the train is on time. 1727 01:24:32,372 --> 01:24:34,080 But sometimes what I'll want to calculate 1728 01:24:34,080 --> 01:24:38,387 is not an unconditional probability, but rather a conditional probability, 1729 01:24:38,387 --> 01:24:40,470 something like, what is the probability that there 1730 01:24:40,470 --> 01:24:45,010 is light rain given that the train is on time, something to that effect. 1731 01:24:45,010 --> 01:24:50,060 And to do that kind of calculation, well, what I might do is here 1732 01:24:50,060 --> 01:24:52,140 are all the samples that I have, and I want 1733 01:24:52,140 --> 01:24:54,720 to calculate a probability distribution given 1734 01:24:54,720 --> 01:24:57,368 that I know that the train is on time. 1735 01:24:57,368 --> 01:24:59,910 So to be able to do that, I can kind of look at the two cases 1736 01:24:59,910 --> 01:25:03,630 where the train was delayed and ignore or reject them, 1737 01:25:03,630 --> 01:25:07,762 sort of exclude them from the possible samples that I'm considering. 1738 01:25:07,762 --> 01:25:09,720 And now I want to look at these remaining cases 1739 01:25:09,720 --> 01:25:11,130 where the train is on time. 1740 01:25:11,130 --> 01:25:13,860 Here are the cases where there is light rain. 1741 01:25:13,860 --> 01:25:16,850 And now I say, OK, these are two out of the six possible cases. 1742 01:25:16,850 --> 01:25:19,580 That can give me an approximation for the probability 1743 01:25:19,580 --> 01:25:23,440 of light rain given the fact that I know the train was on time. 1744 01:25:23,440 --> 01:25:25,700 And I did that in almost exactly the same way 1745 01:25:25,700 --> 01:25:28,660 just by adding an additional step, by saying that, 1746 01:25:28,660 --> 01:25:30,470 all right, when I take each sample, let me 1747 01:25:30,470 --> 01:25:34,460 reject all of the samples that don't match my evidence 1748 01:25:34,460 --> 01:25:37,250 and only consider the samples that do match 1749 01:25:37,250 --> 01:25:39,920 what it is that I have in my evidence that I want 1750 01:25:39,920 --> 01:25:42,020 to make some sort of calculation about. 1751 01:25:42,020 --> 01:25:45,920 And it turns out, using the libraries that we've had for Bayesian networks, 1752 01:25:45,920 --> 01:25:48,740 we can begin to implement this same sort of idea, 1753 01:25:48,740 --> 01:25:51,890 implement rejection sampling, which is what this method is called, 1754 01:25:51,890 --> 01:25:55,850 to be able to figure out some probability, not via direct inference, 1755 01:25:55,850 --> 01:25:57,980 but instead by sampling. 1756 01:25:57,980 --> 01:26:00,290 So what I have here is a program called sample.py-- 1757 01:26:00,290 --> 01:26:02,180 imports the exact same model. 1758 01:26:02,180 --> 01:26:05,490 And what I define first is a program to generate a sample. 1759 01:26:05,490 --> 01:26:09,088 And the way I generate a sample is just by looping over all of the states. 1760 01:26:09,088 --> 01:26:10,880 The states need to be in some sort of order 1761 01:26:10,880 --> 01:26:12,797 to make sure I'm looping in the correct order. 1762 01:26:12,797 --> 01:26:16,010 But effectively, if it is a conditional distribution, 1763 01:26:16,010 --> 01:26:18,410 I'm going to sample based on the parents. 1764 01:26:18,410 --> 01:26:21,240 And otherwise, I'm just going to directly sample the variable, 1765 01:26:21,240 --> 01:26:25,040 like rain, which has no parents-- it's just an unconditional distribution-- 1766 01:26:25,040 --> 01:26:28,640 and keep track of all those parent samples and return the final sample. 1767 01:26:28,640 --> 01:26:31,290 The exact syntax of this, again, not particularly important. 1768 01:26:31,290 --> 01:26:33,290 It just happens to be part of the implementation 1769 01:26:33,290 --> 01:26:35,820 details of this particular library. 1770 01:26:35,820 --> 01:26:38,270 The interesting logic is done below. 1771 01:26:38,270 --> 01:26:40,820 Now that I have the ability to generate a sample, 1772 01:26:40,820 --> 01:26:45,020 if I want to know the distribution of the appointment random variable given 1773 01:26:45,020 --> 01:26:48,680 that the train is delayed, well, then I can begin to do calculations like this. 1774 01:26:48,680 --> 01:26:52,430 Let me take 10,000 samples and assemble all my results 1775 01:26:52,430 --> 01:26:53,810 in this list called data. 1776 01:26:53,810 --> 01:26:57,140 I'll go ahead and loop n times-- in this case, 10,000 times. 1777 01:26:57,140 --> 01:27:01,670 I'll generate a sample, and I want to know the distribution of appointment 1778 01:27:01,670 --> 01:27:03,410 given that the train is delayed. 1779 01:27:03,410 --> 01:27:05,900 So according to rejection sampling, I'm only 1780 01:27:05,900 --> 01:27:08,210 going to consider samples where the train is delayed. 1781 01:27:08,210 --> 01:27:11,552 If the train's not delayed, I'm not going to consider those values at all. 1782 01:27:11,552 --> 01:27:13,760 So I'm going to say, all right, if I take the sample, 1783 01:27:13,760 --> 01:27:16,290 look at the value of the train random variable, 1784 01:27:16,290 --> 01:27:19,670 if the train is delayed, well, let me go ahead and add to my data 1785 01:27:19,670 --> 01:27:23,000 that I'm collecting the value of the appointment random variable 1786 01:27:23,000 --> 01:27:25,520 that it took on in this particular sample. 1787 01:27:25,520 --> 01:27:28,610 So I'm only considering the samples where the train is delayed 1788 01:27:28,610 --> 01:27:31,010 and, for each of those samples, considering 1789 01:27:31,010 --> 01:27:32,870 what the value of appointment is. 1790 01:27:32,870 --> 01:27:35,570 And then at the end, I'm using a Python class called counter, 1791 01:27:35,570 --> 01:27:37,580 which quickly counts up all the values inside 1792 01:27:37,580 --> 01:27:40,100 of a data set so I can take this list of data 1793 01:27:40,100 --> 01:27:44,000 and figure out how many times was my appointment made, 1794 01:27:44,000 --> 01:27:47,360 and how many times was my appointment missed? 1795 01:27:47,360 --> 01:27:49,610 And so this here, with just a couple of lines of code, 1796 01:27:49,610 --> 01:27:53,080 is an implementation of rejection sampling. 1797 01:27:53,080 --> 01:27:58,170 And I can run it by going ahead and running Python sample.py. 1798 01:27:58,170 --> 01:28:00,230 And when I do that, here is the result I get. 1799 01:28:00,230 --> 01:28:02,150 This is the result of the counter. 1800 01:28:02,150 --> 01:28:05,750 1,251 times I was able to attend the meeting, 1801 01:28:05,750 --> 01:28:08,900 and 856 times I was able to miss the meeting. 1802 01:28:08,900 --> 01:28:11,550 And you can imagine, by doing more and more samples, 1803 01:28:11,550 --> 01:28:14,480 I'll be able to get a better and better, more accurate result. 1804 01:28:14,480 --> 01:28:16,070 And this is a randomized process. 1805 01:28:16,070 --> 01:28:18,895 It's going to be an approximation of the probability. 1806 01:28:18,895 --> 01:28:21,770 If I run it a different time, you'll notice the numbers are similar-- 1807 01:28:21,770 --> 01:28:25,460 1,272 and 905-- but they're not identical 1808 01:28:25,460 --> 01:28:28,250 because there's some randomization, some likelihood that things 1809 01:28:28,250 --> 01:28:31,730 might be higher or lower, and so this is why we generally want to try and use 1810 01:28:31,730 --> 01:28:35,360 more samples so that we can have a greater amount of confidence 1811 01:28:35,360 --> 01:28:37,760 in our result, be more sure about the result 1812 01:28:37,760 --> 01:28:41,240 that we're getting of whether or not it accurately reflects or represents 1813 01:28:41,240 --> 01:28:43,940 the actual underlying probabilities that are 1814 01:28:43,940 --> 01:28:47,130 inherent inside of this distribution. 1815 01:28:47,130 --> 01:28:50,057 And so this, then, was an instance of rejection sampling. 1816 01:28:50,057 --> 01:28:52,640 And it turns out, there are a number of other sampling methods 1817 01:28:52,640 --> 01:28:55,070 that you could use to begin to try to sample. 1818 01:28:55,070 --> 01:28:57,530 One problem that rejection sampling has is 1819 01:28:57,530 --> 01:29:02,480 that if the evidence you're looking for is a fairly unlikely event, well, 1820 01:29:02,480 --> 01:29:04,610 you're going to be rejecting a lot of samples. 1821 01:29:04,610 --> 01:29:08,490 Like, if I'm looking for the probability of x given some evidence e, 1822 01:29:08,490 --> 01:29:12,680 if e is very unlikely to occur-- like, occurs maybe one every 1,000 times-- 1823 01:29:12,680 --> 01:29:16,040 then I'm only going to be considering one out of every 1,000 samples 1824 01:29:16,040 --> 01:29:18,798 that I do, which is a pretty inefficient method for trying 1825 01:29:18,798 --> 01:29:20,090 to do this sort of calculation. 1826 01:29:20,090 --> 01:29:23,600 I'm throwing away a lot of samples, and it takes computational effort 1827 01:29:23,600 --> 01:29:25,640 to be able to generate those samples, so I'd 1828 01:29:25,640 --> 01:29:27,480 like to not have to do something like that. 1829 01:29:27,480 --> 01:29:30,230 So there are other sampling methods that can try and address this. 1830 01:29:30,230 --> 01:29:33,680 One such sampling method is called likelihood weighting. 1831 01:29:33,680 --> 01:29:36,920 In likelihood weighting, we follow a slightly different procedure, 1832 01:29:36,920 --> 01:29:39,740 and the goal is to avoid needing to throw out 1833 01:29:39,740 --> 01:29:42,590 samples that didn't match the evidence. 1834 01:29:42,590 --> 01:29:46,760 And so what we'll do is we'll start by fixing the values for the evidence 1835 01:29:46,760 --> 01:29:47,300 variables. 1836 01:29:47,300 --> 01:29:49,430 Rather than sample everything, we're going 1837 01:29:49,430 --> 01:29:53,970 to fix the values of the evidence variables and not sample those. 1838 01:29:53,970 --> 01:29:57,650 Then we're going to sample all the other non-evidence variables in the same way, 1839 01:29:57,650 --> 01:30:01,010 just using the Bayesian network, looking at the probability distributions, 1840 01:30:01,010 --> 01:30:04,040 sampling all the non-evidence variables. 1841 01:30:04,040 --> 01:30:08,450 But then what we need to do is weight each sample by its likelihood. 1842 01:30:08,450 --> 01:30:10,520 If our evidence is really unlikely, we want 1843 01:30:10,520 --> 01:30:14,210 to make sure that we've taken into account, how likely was the evidence 1844 01:30:14,210 --> 01:30:16,410 to actually show up in the sample? 1845 01:30:16,410 --> 01:30:18,590 If I have a sample where the evidence was much more 1846 01:30:18,590 --> 01:30:20,720 likely to show up than another sample, then I 1847 01:30:20,720 --> 01:30:23,060 want to weight the more likely one higher. 1848 01:30:23,060 --> 01:30:25,490 So we're going to weight each sample by its likelihood 1849 01:30:25,490 --> 01:30:29,480 where likelihood is just defined as the probability of all of the evidence. 1850 01:30:29,480 --> 01:30:32,090 Given all the evidence we have, what is the probability 1851 01:30:32,090 --> 01:30:34,640 that it would happen in that particular sample? 1852 01:30:34,640 --> 01:30:37,250 So before, all of our samples were weighted equally. 1853 01:30:37,250 --> 01:30:40,970 They all had a weight of one when we were calculating the overall average. 1854 01:30:40,970 --> 01:30:42,980 In this case, we're going to weight each sample, 1855 01:30:42,980 --> 01:30:46,220 multiply each sample by its likelihood in order 1856 01:30:46,220 --> 01:30:49,252 to get the more accurate distribution. 1857 01:30:49,252 --> 01:30:50,460 So what would this look like? 1858 01:30:50,460 --> 01:30:54,170 Well, if I asked the same question, what is the probability of light rain given 1859 01:30:54,170 --> 01:30:57,050 that the train is on time, when I do the sampling procedure 1860 01:30:57,050 --> 01:30:59,780 and start by trying to sample, I'm going to start 1861 01:30:59,780 --> 01:31:01,520 by fixing the evidence variable. 1862 01:31:01,520 --> 01:31:04,640 I'm already going to have in my sample the train is on time. 1863 01:31:04,640 --> 01:31:06,860 That way, I don't have to throw out anything. 1864 01:31:06,860 --> 01:31:10,610 I'm only sampling things where I know the value of the variables that 1865 01:31:10,610 --> 01:31:13,790 are my evidence are what I expect them to be. 1866 01:31:13,790 --> 01:31:16,310 So I'll go ahead and sample from rain, and maybe this time I 1867 01:31:16,310 --> 01:31:18,318 sample light rain instead of no rain. 1868 01:31:18,318 --> 01:31:21,110 Then I'll sample from track maintenance and say maybe, yes, there's 1869 01:31:21,110 --> 01:31:22,100 track maintenance. 1870 01:31:22,100 --> 01:31:25,190 Then for train, well, I've already fixed it in place. 1871 01:31:25,190 --> 01:31:29,360 Train was an evidence variable, so I'm not going to bother sampling again. 1872 01:31:29,360 --> 01:31:30,820 I'll just go ahead and move on. 1873 01:31:30,820 --> 01:31:35,280 I'll move on to appointment and go ahead and sample from appointment as well. 1874 01:31:35,280 --> 01:31:37,040 So now I've generated a sample. 1875 01:31:37,040 --> 01:31:40,190 I've generated a sample by fixing this evidence variable 1876 01:31:40,190 --> 01:31:42,310 and sampling the other three. 1877 01:31:42,310 --> 01:31:44,390 And the last step is now weighting the sample. 1878 01:31:44,390 --> 01:31:45,920 How much weight should it have? 1879 01:31:45,920 --> 01:31:50,090 And the weight is based on how probable is it that the train was actually 1880 01:31:50,090 --> 01:31:52,560 on time, this evidence actually happened, 1881 01:31:52,560 --> 01:31:55,460 given the values of these other variables, light rain and the fact 1882 01:31:55,460 --> 01:31:57,620 that, yes, there was track maintenance? 1883 01:31:57,620 --> 01:32:00,260 Well, to do that, I can just go back to the train variable 1884 01:32:00,260 --> 01:32:02,900 and say, all right, if there was light rain and track 1885 01:32:02,900 --> 01:32:05,060 maintenance, the likelihood of my evidence, 1886 01:32:05,060 --> 01:32:08,570 the likelihood that my train was on time, is 0.6. 1887 01:32:08,570 --> 01:32:13,250 And so this particular sample would have a weight of 0.6. 1888 01:32:13,250 --> 01:32:15,740 And I could repeat the sampling procedure again and again. 1889 01:32:15,740 --> 01:32:18,140 Each time, every sample would be given a weight 1890 01:32:18,140 --> 01:32:22,928 according to the probability of the evidence that I see associated with it. 1891 01:32:22,928 --> 01:32:25,970 And there are other sampling methods that exist, as well, but all of them 1892 01:32:25,970 --> 01:32:27,845 are designed to try and get at the same idea, 1893 01:32:27,845 --> 01:32:30,950 to approximate the inference procedure of figuring out 1894 01:32:30,950 --> 01:32:33,540 the value of a variable. 1895 01:32:33,540 --> 01:32:35,570 So we've now dealt with probability as it 1896 01:32:35,570 --> 01:32:38,840 pertains to particular variables that have these discrete values. 1897 01:32:38,840 --> 01:32:40,910 But what we haven't really considered is how 1898 01:32:40,910 --> 01:32:44,300 values might change over time, that we've considered something 1899 01:32:44,300 --> 01:32:47,870 like a variable for rain where rain can take on values of none or light 1900 01:32:47,870 --> 01:32:50,600 rain or heavy rain, but, in practice, usually when 1901 01:32:50,600 --> 01:32:54,950 we consider values for variables like rain, we like to consider it for, 1902 01:32:54,950 --> 01:32:58,020 over time, how do the values of these variables change? 1903 01:32:58,020 --> 01:33:02,040 What do we deal with when we're dealing with uncertainty over a period of time? 1904 01:33:02,040 --> 01:33:04,590 Which can come up in the context of weather, for example-- 1905 01:33:04,590 --> 01:33:06,830 if I have sunny days and I have rainy days. 1906 01:33:06,830 --> 01:33:11,450 And I'd like to know not just what is the probability that it's raining now, 1907 01:33:11,450 --> 01:33:14,210 but what is the probability that it rains tomorrow or the day 1908 01:33:14,210 --> 01:33:15,838 after that or the day after that? 1909 01:33:15,838 --> 01:33:17,630 And so to do this, we're going to introduce 1910 01:33:17,630 --> 01:33:19,440 a slightly different kind of model. 1911 01:33:19,440 --> 01:33:23,300 But here we're going to have a random variable, not just one for the weather, 1912 01:33:23,300 --> 01:33:25,643 but for every possible time step. 1913 01:33:25,643 --> 01:33:27,560 And you can define time step however you like. 1914 01:33:27,560 --> 01:33:30,680 A simple way is just to use days as your time step. 1915 01:33:30,680 --> 01:33:34,220 And so we can define a variable called x sub t, which 1916 01:33:34,220 --> 01:33:36,620 is going to be the weather at time t. 1917 01:33:36,620 --> 01:33:39,350 So x sub zero might be the weather on day zero, 1918 01:33:39,350 --> 01:33:42,400 x sub one might be the weather on day one, so on and so forth, 1919 01:33:42,400 --> 01:33:45,022 x sub two is the weather on day two. 1920 01:33:45,022 --> 01:33:46,730 But as you can imagine, if we start to do 1921 01:33:46,730 --> 01:33:48,740 this over longer and longer periods of time, 1922 01:33:48,740 --> 01:33:51,282 there's an incredible amount of data that might go into this. 1923 01:33:51,282 --> 01:33:53,960 If you're keeping track of data about the weather for a year, 1924 01:33:53,960 --> 01:33:57,240 now suddenly you might be trying to predict the weather tomorrow given 1925 01:33:57,240 --> 01:34:00,620 365 days of previous pieces of evidence, and that's 1926 01:34:00,620 --> 01:34:03,530 a lot of evidence to have to deal with and manipulate and calculate. 1927 01:34:03,530 --> 01:34:06,410 Probably nobody knows what the exact conditional probability 1928 01:34:06,410 --> 01:34:10,070 distribution is for all of those combinations of variables. 1929 01:34:10,070 --> 01:34:13,070 And so when we're trying to do this inference inside of a computer, when 1930 01:34:13,070 --> 01:34:16,640 we're trying to reasonably do this sort of analysis, 1931 01:34:16,640 --> 01:34:19,053 it's helpful to make some simplifying assumptions, 1932 01:34:19,053 --> 01:34:21,470 some assumptions about the problem that we can just assume 1933 01:34:21,470 --> 01:34:23,930 are true to make our lives a little bit easier. 1934 01:34:23,930 --> 01:34:26,270 Even if they're not totally accurate assumptions, 1935 01:34:26,270 --> 01:34:28,703 if they're close to accurate or approximate, 1936 01:34:28,703 --> 01:34:29,870 they're usually pretty good. 1937 01:34:29,870 --> 01:34:33,350 And the assumption we're going to make is called the Markov assumption, 1938 01:34:33,350 --> 01:34:38,210 which is the assumption that the current state depends only on a finite fixed 1939 01:34:38,210 --> 01:34:40,220 number of previous states. 1940 01:34:40,220 --> 01:34:44,210 So the current day's weather depends not on all of the previous day's weather 1941 01:34:44,210 --> 01:34:47,150 for all of history, but the current day's weather I 1942 01:34:47,150 --> 01:34:49,758 can predict just based on yesterday's weather 1943 01:34:49,758 --> 01:34:52,550 or just based on the last two days' weather or the last three days' 1944 01:34:52,550 --> 01:34:53,050 weather. 1945 01:34:53,050 --> 01:34:57,620 But oftentimes, we're going to deal with just the one previous state helps 1946 01:34:57,620 --> 01:34:59,720 to predict this current state. 1947 01:34:59,720 --> 01:35:01,970 And by putting a whole bunch of these random variables 1948 01:35:01,970 --> 01:35:04,400 together, using this Markov assumption, we 1949 01:35:04,400 --> 01:35:08,090 can create what's called a Markov chain where a Markov chain is just 1950 01:35:08,090 --> 01:35:11,960 some sequence of random variables where each of the variable's distribution 1951 01:35:11,960 --> 01:35:13,772 follows that Markov assumption. 1952 01:35:13,772 --> 01:35:16,480 And so we'll do an example of this where the Markov assumption is 1953 01:35:16,480 --> 01:35:17,590 I can predict the weather. 1954 01:35:17,590 --> 01:35:19,050 Is it sunny or rainy? 1955 01:35:19,050 --> 01:35:21,520 And we'll just consider those two possibilities for now, 1956 01:35:21,520 --> 01:35:23,395 even though there are other types of weather. 1957 01:35:23,395 --> 01:35:26,650 But I can predict each day's weather just on the prior day's weather. 1958 01:35:26,650 --> 01:35:30,430 Using today's weather, I can come up with a probability distribution 1959 01:35:30,430 --> 01:35:31,825 for tomorrow's weather. 1960 01:35:31,825 --> 01:35:33,700 And here's what this weather might look like. 1961 01:35:33,700 --> 01:35:37,030 It's formatted in terms of a matrix, as you might describe it, 1962 01:35:37,030 --> 01:35:41,410 as sort of rows and columns of values where on the left-hand side 1963 01:35:41,410 --> 01:35:45,850 I have today's webinar, represented by the variable x sub t. 1964 01:35:45,850 --> 01:35:48,730 And then over here in the columns, I have tomorrow's weather, 1965 01:35:48,730 --> 01:35:54,790 represented by the variable x sub t plus one, t plus one day's weather instead. 1966 01:35:54,790 --> 01:35:58,990 And what this matrix is saying is if today is sunny, 1967 01:35:58,990 --> 01:36:02,440 well, then, it's more likely than not that tomorrow is also sunny. 1968 01:36:02,440 --> 01:36:05,990 Oftentimes the weather stays consistent for multiple days in a row. 1969 01:36:05,990 --> 01:36:08,200 And for example, let's say that if today is sunny, 1970 01:36:08,200 --> 01:36:12,820 our model says that tomorrow, with probability 0.8, it will also be sunny, 1971 01:36:12,820 --> 01:36:15,610 and with probability 0.2 it will be raining. 1972 01:36:15,610 --> 01:36:19,245 And likewise, if today is raining, then it's 1973 01:36:19,245 --> 01:36:21,370 more likely than not that tomorrow is also raining. 1974 01:36:21,370 --> 01:36:23,620 With probability 0.7, it'll be raining. 1975 01:36:23,620 --> 01:36:26,710 With probability 0.3, it will be sunny. 1976 01:36:26,710 --> 01:36:28,840 So this matrix, this description of how it 1977 01:36:28,840 --> 01:36:32,290 is we transition from one state to the next state, 1978 01:36:32,290 --> 01:36:34,540 is what we're going to call the transition model. 1979 01:36:34,540 --> 01:36:37,030 And using the transition model, you can begin 1980 01:36:37,030 --> 01:36:41,770 to construct this Markov chain by just predicting, given today's weather, 1981 01:36:41,770 --> 01:36:44,020 what's the likelihood of tomorrow's weather happening? 1982 01:36:44,020 --> 01:36:46,930 And you can imagine doing a similar sampling 1983 01:36:46,930 --> 01:36:49,660 procedure where you take this information, 1984 01:36:49,660 --> 01:36:51,940 you sample what tomorrow's weather is going to be, 1985 01:36:51,940 --> 01:36:53,980 using that you sample the next day's weather, 1986 01:36:53,980 --> 01:36:58,390 and the result of that is you can form this Markov chain of x zero, 1987 01:36:58,390 --> 01:37:01,120 time day zero is sunny, the next day is sunny, 1988 01:37:01,120 --> 01:37:04,240 maybe the next day it changes to raining, then raining, then raining. 1989 01:37:04,240 --> 01:37:06,910 And the pattern that this Markov chain follows, 1990 01:37:06,910 --> 01:37:08,890 given the distribution that we had access to, 1991 01:37:08,890 --> 01:37:11,850 this transition model here, is that when it's sunny, 1992 01:37:11,850 --> 01:37:13,600 it tends to stay sunny for a little while. 1993 01:37:13,600 --> 01:37:16,100 The next couple days tend to be sunny too. 1994 01:37:16,100 --> 01:37:19,735 And when it's raining, it tends to be raining as well. 1995 01:37:19,735 --> 01:37:21,860 And so you get a Markov chain that looks like this. 1996 01:37:21,860 --> 01:37:23,193 And you can do analysis on this. 1997 01:37:23,193 --> 01:37:25,630 You can say, given that today is raining, 1998 01:37:25,630 --> 01:37:27,790 what is the probability that tomorrow it's raining, 1999 01:37:27,790 --> 01:37:29,770 or you can begin to ask probability questions, 2000 01:37:29,770 --> 01:37:33,970 like what is the probability of this sequence of five values-- sun, sun, 2001 01:37:33,970 --> 01:37:35,200 rain, rain, rain-- 2002 01:37:35,200 --> 01:37:37,610 and answer those sorts of questions too. 2003 01:37:37,610 --> 01:37:40,780 And it turns out there are, again, many Python libraries for interacting 2004 01:37:40,780 --> 01:37:44,620 with models like this of probabilities that have distributions 2005 01:37:44,620 --> 01:37:47,440 and random variables that are based on previous variables 2006 01:37:47,440 --> 01:37:49,720 according to this Markov assumption. 2007 01:37:49,720 --> 01:37:53,090 And pomegranate 2 has ways of dealing with these sorts of variables. 2008 01:37:53,090 --> 01:37:59,800 So I'll go ahead and go into the chain directory 2009 01:37:59,800 --> 01:38:02,590 where I have some information about Markov chains. 2010 01:38:02,590 --> 01:38:05,770 And here I've defined a file called model.py where 2011 01:38:05,770 --> 01:38:08,320 I've defined in a very similar syntax. 2012 01:38:08,320 --> 01:38:11,080 And again, the exact syntax doesn't matter so much as the idea 2013 01:38:11,080 --> 01:38:14,410 that I'm encoding this information into a Python program 2014 01:38:14,410 --> 01:38:17,290 so that the program access to these distributions. 2015 01:38:17,290 --> 01:38:19,930 I've here defined some starting distributions. 2016 01:38:19,930 --> 01:38:23,020 So every Markov model begins at some point in time, 2017 01:38:23,020 --> 01:38:25,120 and I need to give it some starting distribution. 2018 01:38:25,120 --> 01:38:27,078 And so we'll just say, you know what, to start, 2019 01:38:27,078 --> 01:38:29,380 you can pick 50/50 between sunny and rainy. 2020 01:38:29,380 --> 01:38:33,370 We'll say it's sunny 50% the time, rainy 50% of the time. 2021 01:38:33,370 --> 01:38:36,430 And then down below, I've here defined the transition model, 2022 01:38:36,430 --> 01:38:39,770 how it is that I transition from one day to the next. 2023 01:38:39,770 --> 01:38:42,520 And here I've encoded that exact same matrix from before, 2024 01:38:42,520 --> 01:38:45,210 that if it was sunny today, then with probability 0.8 2025 01:38:45,210 --> 01:38:47,650 it will be sunny tomorrow, and it will be raining tomorrow 2026 01:38:47,650 --> 01:38:49,540 with probability 0.2. 2027 01:38:49,540 --> 01:38:54,540 And I likewise have another distribution for if it was raining today instead. 2028 01:38:54,540 --> 01:38:56,980 And so that alone defines the Markov model. 2029 01:38:56,980 --> 01:38:59,410 You can begin to answer questions using that model. 2030 01:38:59,410 --> 01:39:02,680 But one thing I'll just do is sample from the Markov chain. 2031 01:39:02,680 --> 01:39:06,130 And it turns out there is a method built into this Markov chain library that 2032 01:39:06,130 --> 01:39:08,440 allows me to sample 50 states from the chain, 2033 01:39:08,440 --> 01:39:13,000 basically just simulating 50 instances of weather. 2034 01:39:13,000 --> 01:39:18,290 And so let me go ahead and run this, Python model.py. 2035 01:39:18,290 --> 01:39:22,570 And when I run it, what I get is it is going to sample from this Markov chain 2036 01:39:22,570 --> 01:39:26,498 50 states, 50 days worth of weather that it's just going to randomly sample. 2037 01:39:26,498 --> 01:39:29,290 And you can imagine sampling many times to be able to get more data 2038 01:39:29,290 --> 01:39:30,820 to be able to do more analysis. 2039 01:39:30,820 --> 01:39:33,580 But here, for example, it's sunny two days 2040 01:39:33,580 --> 01:39:37,360 a row, rainy a whole bunch of days in a row before it changes back to sun. 2041 01:39:37,360 --> 01:39:41,170 And so you get this model that follows the distribution that we originally 2042 01:39:41,170 --> 01:39:43,960 described, that follows the distribution of sunny days 2043 01:39:43,960 --> 01:39:49,780 tend to lead to more sunny days, rainy days tend to lead to more rainy days. 2044 01:39:49,780 --> 01:39:52,060 And that, then, is the Markov model. 2045 01:39:52,060 --> 01:39:56,260 And Markov models rely on us knowing the values of these individual states. 2046 01:39:56,260 --> 01:40:00,490 I know that today is sunny or that today is rainy, and using that information, 2047 01:40:00,490 --> 01:40:04,660 I can draw some sort of inference about what tomorrow is going to be like. 2048 01:40:04,660 --> 01:40:07,130 But in practice, this often isn't the case. 2049 01:40:07,130 --> 01:40:09,310 It often isn't the case that I know for certain 2050 01:40:09,310 --> 01:40:11,620 what the exact state of the world is. 2051 01:40:11,620 --> 01:40:14,710 Oftentimes the state of the world is exactly unknown, 2052 01:40:14,710 --> 01:40:18,480 but I'm able to somehow sense some information about that state 2053 01:40:18,480 --> 01:40:22,385 that a robot or an AI doesn't have exact knowledge about the world around it, 2054 01:40:22,385 --> 01:40:24,510 but it has some sort of sensor, whether that sensor 2055 01:40:24,510 --> 01:40:27,240 is a camera or sensors that detect distance 2056 01:40:27,240 --> 01:40:30,300 or just a microphone that is sensing audio, for example. 2057 01:40:30,300 --> 01:40:33,990 It is sensing data, and using that data, that data is somehow 2058 01:40:33,990 --> 01:40:36,930 related to the state of the world even if it doesn't actually 2059 01:40:36,930 --> 01:40:41,100 know, our AI doesn't know, what the underlying true state of the world 2060 01:40:41,100 --> 01:40:42,730 actually is. 2061 01:40:42,730 --> 01:40:45,480 And for that, we need to get into the world of sensor models, 2062 01:40:45,480 --> 01:40:48,420 the way of describing how it is that we translate 2063 01:40:48,420 --> 01:40:51,600 what the hidden state, the underlying true state of the world 2064 01:40:51,600 --> 01:40:56,880 is with what the observation, what it is that the AI knows or the AI has access 2065 01:40:56,880 --> 01:40:58,810 to, actually is. 2066 01:40:58,810 --> 01:41:02,880 And so for example, a hidden state might be a robot's position. 2067 01:41:02,880 --> 01:41:05,650 If a robot is exploring new, uncharted territory, 2068 01:41:05,650 --> 01:41:08,580 the robot likely doesn't know exactly where it is. 2069 01:41:08,580 --> 01:41:10,000 But it does have an observation. 2070 01:41:10,000 --> 01:41:12,510 It has robot sensor data where it can sense 2071 01:41:12,510 --> 01:41:16,560 how far away are possible obstacles around it, and using that information, 2072 01:41:16,560 --> 01:41:19,230 using the observed information that it has, 2073 01:41:19,230 --> 01:41:22,290 it can infer something about the hidden state, 2074 01:41:22,290 --> 01:41:26,220 because what the true hidden state is influences those observations. 2075 01:41:26,220 --> 01:41:29,370 Whatever the robot's true position is affects 2076 01:41:29,370 --> 01:41:33,420 or has some effect upon what the sensor data the robot is able to collect 2077 01:41:33,420 --> 01:41:36,330 is, even if the robot doesn't actually know for certain 2078 01:41:36,330 --> 01:41:39,090 what its true position is. 2079 01:41:39,090 --> 01:41:42,300 Likewise, if you think about a voice recognition or a speech recognition 2080 01:41:42,300 --> 01:41:47,600 program that listens to you and is able to respond to you, something like Alexa 2081 01:41:47,600 --> 01:41:50,830 or what Apple and Google are doing with their voice recognition as well, 2082 01:41:50,830 --> 01:41:54,090 that you might imagine that the hidden state, the underlying state, 2083 01:41:54,090 --> 01:41:55,740 is what words are actually spoken. 2084 01:41:55,740 --> 01:41:58,290 The true nature of the world contains you 2085 01:41:58,290 --> 01:42:00,270 saying a particular sequence of words. 2086 01:42:00,270 --> 01:42:04,380 But your phone or your smart home device doesn't know for sure 2087 01:42:04,380 --> 01:42:05,940 exactly what words you said. 2088 01:42:05,940 --> 01:42:11,100 The only observation that the AI has access to is some audio wave forms. 2089 01:42:11,100 --> 01:42:13,710 And those audio wave forms are, of course, dependent 2090 01:42:13,710 --> 01:42:16,110 upon this hidden state, and you can infer, 2091 01:42:16,110 --> 01:42:20,520 based on those audio wave forms, what the words spoken likely were, 2092 01:42:20,520 --> 01:42:23,490 but you might not know with 100% certainty what 2093 01:42:23,490 --> 01:42:25,330 that hidden state actually is. 2094 01:42:25,330 --> 01:42:27,630 And it might be a task to try and predict. 2095 01:42:27,630 --> 01:42:30,300 Given this observation, given these audio away forms, 2096 01:42:30,300 --> 01:42:34,142 can you figure out what the actual words spoken are? 2097 01:42:34,142 --> 01:42:35,850 Likewise, you might imagine on a website. 2098 01:42:35,850 --> 01:42:38,490 True user engagement might be information you don't directly 2099 01:42:38,490 --> 01:42:41,880 have access to, but you can observe data, like website or app 2100 01:42:41,880 --> 01:42:44,220 analytics about how often was this button clicked 2101 01:42:44,220 --> 01:42:47,220 or how often are people interacting with a page in a particular way. 2102 01:42:47,220 --> 01:42:51,190 And you can use that to infer things about your users as well. 2103 01:42:51,190 --> 01:42:54,968 So this type of problem comes up all the time when we're dealing with AI 2104 01:42:54,968 --> 01:42:56,760 and trying to infer things about the world, 2105 01:42:56,760 --> 01:43:00,750 that often AI doesn't really know the hidden true state of the world. 2106 01:43:00,750 --> 01:43:03,930 All that AI has access to is some observation 2107 01:43:03,930 --> 01:43:07,440 that is related to the hidden true state, but it's not direct. 2108 01:43:07,440 --> 01:43:08,790 There might be some noise there. 2109 01:43:08,790 --> 01:43:10,985 The audio wave form might have some additional noise 2110 01:43:10,985 --> 01:43:12,360 that might be difficult to parse. 2111 01:43:12,360 --> 01:43:14,910 The sensor data might not be exactly correct. 2112 01:43:14,910 --> 01:43:16,860 There's some noise that might not allow you 2113 01:43:16,860 --> 01:43:19,560 to conclude with certainty what the hidden state is, but can 2114 01:43:19,560 --> 01:43:22,100 allow you to infer what it might be. 2115 01:43:22,100 --> 01:43:24,348 And so the simple example we'll take a look at here 2116 01:43:24,348 --> 01:43:27,390 is imagining the hidden state as the weather, whether it's sunny or rainy 2117 01:43:27,390 --> 01:43:31,530 or not, and imagine you are programming an AI inside of a building that maybe 2118 01:43:31,530 --> 01:43:34,710 has access to just a camera to inside the building, 2119 01:43:34,710 --> 01:43:37,890 and all you have access to is an observation as to 2120 01:43:37,890 --> 01:43:41,790 whether or not employees are bringing an umbrella into the building or not. 2121 01:43:41,790 --> 01:43:44,290 You can detect whether it's an umbrella or not, 2122 01:43:44,290 --> 01:43:47,700 and so you might have an observation as to whether or not an umbrella is 2123 01:43:47,700 --> 01:43:49,320 brought into the building or not. 2124 01:43:49,320 --> 01:43:51,690 And using that information, you want to predict 2125 01:43:51,690 --> 01:43:53,790 whether it's sunny or rainy, even if you don't 2126 01:43:53,790 --> 01:43:55,877 know what the underlying weather is. 2127 01:43:55,877 --> 01:43:57,960 So the underlying weather might be sunny or rainy. 2128 01:43:57,960 --> 01:44:01,462 And if it's raining, obviously people are more likely to bring an umbrella. 2129 01:44:01,462 --> 01:44:03,420 And so whether or not people bring an umbrella, 2130 01:44:03,420 --> 01:44:06,773 your observation tells you something about the hidden state. 2131 01:44:06,773 --> 01:44:08,940 And of course, this is a bit of a contrived example, 2132 01:44:08,940 --> 01:44:11,370 but the idea here is to think about this more broadly 2133 01:44:11,370 --> 01:44:14,370 in terms of more generally, any time you observe something, 2134 01:44:14,370 --> 01:44:18,025 it having to do with some underlying hidden state. 2135 01:44:18,025 --> 01:44:21,150 And so to try and model this type of idea where we have these hidden states 2136 01:44:21,150 --> 01:44:24,180 and observations, rather than just use a Markov model, which 2137 01:44:24,180 --> 01:44:26,160 has state, state, state, state, each of which 2138 01:44:26,160 --> 01:44:29,700 is connected by that transition matrix that we described before, 2139 01:44:29,700 --> 01:44:32,640 we're going to use what we call a hidden Markov model-- 2140 01:44:32,640 --> 01:44:34,740 very similar to a Markov model, but this is 2141 01:44:34,740 --> 01:44:37,920 going to allow us to model a system that has hidden states 2142 01:44:37,920 --> 01:44:41,520 that we don't directly observe along with some observed event 2143 01:44:41,520 --> 01:44:43,740 that we do actually see. 2144 01:44:43,740 --> 01:44:45,720 And so in addition to that transition model 2145 01:44:45,720 --> 01:44:48,780 that we still need of saying, given the underlying state of the world, 2146 01:44:48,780 --> 01:44:52,440 if it's sunny or rainy, what's the probability of tomorrow's weather, 2147 01:44:52,440 --> 01:44:56,310 we also need another model, that given some state is 2148 01:44:56,310 --> 01:44:58,500 going to give us an observation of green, 2149 01:44:58,500 --> 01:45:01,440 yes, someone brings an umbrella into the office, or red, 2150 01:45:01,440 --> 01:45:03,930 no, nobody brings umbrellas into the office. 2151 01:45:03,930 --> 01:45:06,772 And so the observation might be that if it's sunny, 2152 01:45:06,772 --> 01:45:09,480 then odds are nobody is going to bring an umbrella to the office. 2153 01:45:09,480 --> 01:45:11,760 But maybe some people are just being cautious 2154 01:45:11,760 --> 01:45:14,490 and they do bring an umbrella to the office anyways. 2155 01:45:14,490 --> 01:45:17,725 And if it's raining, with much higher probability, 2156 01:45:17,725 --> 01:45:20,100 then people are going to bring umbrellas into the office. 2157 01:45:20,100 --> 01:45:23,280 But maybe, if the rain was unexpected, people didn't bring an umbrella, 2158 01:45:23,280 --> 01:45:25,990 and so they might have some other probability as well. 2159 01:45:25,990 --> 01:45:28,860 So using the observations, you can begin to predict, 2160 01:45:28,860 --> 01:45:32,070 with reasonable likelihood, what the underlying state is 2161 01:45:32,070 --> 01:45:35,440 even if you don't actually get to observe the underlying state, 2162 01:45:35,440 --> 01:45:39,030 if you don't get to see what the hidden state is actually equal to. 2163 01:45:39,030 --> 01:45:41,540 This here we'll often call the sensor model. 2164 01:45:41,540 --> 01:45:44,280 It's also often called the emission probabilities 2165 01:45:44,280 --> 01:45:48,120 because the state, the underlying state, emits some sort of emission 2166 01:45:48,120 --> 01:45:49,660 that you then observe. 2167 01:45:49,660 --> 01:45:53,220 And so that can be another way of describing that same idea. 2168 01:45:53,220 --> 01:45:55,860 And the sensor Markov assumption that we're going to use 2169 01:45:55,860 --> 01:45:59,340 is this assumption that the evidence variable, the thing we observe, 2170 01:45:59,340 --> 01:46:03,480 the emission that gets produced, depends only on the corresponding state, 2171 01:46:03,480 --> 01:46:06,620 meaning I can predict whether or not people will bring umbrellas 2172 01:46:06,620 --> 01:46:11,310 or not entirely dependent just on whether it is sunny or rainy today. 2173 01:46:11,310 --> 01:46:13,950 Of course, again, this assumption might not hold in practice, 2174 01:46:13,950 --> 01:46:15,458 that in practice it might depend-- 2175 01:46:15,458 --> 01:46:17,250 whether or not people bring umbrellas might 2176 01:46:17,250 --> 01:46:20,042 depend not just on today's weather, but also on yesterday's weather 2177 01:46:20,042 --> 01:46:20,910 and the day before. 2178 01:46:20,910 --> 01:46:23,100 But for simplification purposes, it can be 2179 01:46:23,100 --> 01:46:25,920 helpful to apply the sort of assumption just 2180 01:46:25,920 --> 01:46:29,130 to allow us to be able to reason about these probabilities a little more 2181 01:46:29,130 --> 01:46:30,130 easily. 2182 01:46:30,130 --> 01:46:34,770 And if we're able to approximate it, we can still often get a very good answer. 2183 01:46:34,770 --> 01:46:37,710 And so what these hidden Markov models end up looking like is a little 2184 01:46:37,710 --> 01:46:41,730 something like this, where now, rather than just have one chain of states-- 2185 01:46:41,730 --> 01:46:43,860 like, sun, sun, rain, rain, rain-- 2186 01:46:43,860 --> 01:46:49,650 we instead have this upper level, which is the underlying state of the world, 2187 01:46:49,650 --> 01:46:53,070 is it sunny or is it rainy, and those are connected by that transition 2188 01:46:53,070 --> 01:46:54,690 matrix we described before. 2189 01:46:54,690 --> 01:46:57,510 But each of these states produces an emission, 2190 01:46:57,510 --> 01:47:01,590 produces an observation that I see, that on this day it was sunny, 2191 01:47:01,590 --> 01:47:04,917 and people didn't bring umbrellas, and on this day it was sunny, 2192 01:47:04,917 --> 01:47:07,500 but people did bring umbrellas, and on this day it was raining 2193 01:47:07,500 --> 01:47:09,960 and people did bring umbrellas, and so on and so forth. 2194 01:47:09,960 --> 01:47:12,930 And so each of these underlying states, represented 2195 01:47:12,930 --> 01:47:16,740 by x sub t for x sub 1, 0, 1, 2, so on and so forth, 2196 01:47:16,740 --> 01:47:19,450 produces some sort of observation or emission, 2197 01:47:19,450 --> 01:47:20,950 which is what the E stands for-- 2198 01:47:20,950 --> 01:47:25,700 E sub 0, E sub 1, E sub 2, so on and so forth. 2199 01:47:25,700 --> 01:47:28,893 And so this, too, is a way of trying to represent this idea. 2200 01:47:28,893 --> 01:47:31,560 And what you want to think about is that these underlying states 2201 01:47:31,560 --> 01:47:35,790 are the true nature of the world, the robot's position as it moves over time, 2202 01:47:35,790 --> 01:47:39,030 and that produces some sort of sensor data that might be observed, 2203 01:47:39,030 --> 01:47:41,490 or what people are actually saying and using 2204 01:47:41,490 --> 01:47:45,390 the emission data of what audio wave forms do you detect in order to process 2205 01:47:45,390 --> 01:47:47,330 that data and try and figure it out. 2206 01:47:47,330 --> 01:47:49,830 And there are a number of possible tasks that you might want 2207 01:47:49,830 --> 01:47:52,150 to do given this kind of information. 2208 01:47:52,150 --> 01:47:54,750 And one of the simplest is trying to infer something 2209 01:47:54,750 --> 01:47:58,560 about the future or the past or about these sort of hidden states 2210 01:47:58,560 --> 01:47:59,580 that might exist. 2211 01:47:59,580 --> 01:48:01,310 And so the tasks that you'll often see-- 2212 01:48:01,310 --> 01:48:03,893 and we're not going to go into the mathematics of these tasks, 2213 01:48:03,893 --> 01:48:07,020 but they're all based on this same idea of conditional probabilities 2214 01:48:07,020 --> 01:48:09,990 and using the probability distributions we have 2215 01:48:09,990 --> 01:48:12,180 to draw these sorts of conclusions. 2216 01:48:12,180 --> 01:48:16,410 One task is called filtering, which is, given observations from the start 2217 01:48:16,410 --> 01:48:20,310 until now, calculate the distribution for the current state, 2218 01:48:20,310 --> 01:48:23,520 meaning given information about from the beginning of time 2219 01:48:23,520 --> 01:48:26,610 until now, on which days did people bring an umbrella 2220 01:48:26,610 --> 01:48:28,770 or not bring an umbrella, can I calculate 2221 01:48:28,770 --> 01:48:32,280 the probability of the current state, that today is it sunny 2222 01:48:32,280 --> 01:48:33,570 or is it raining? 2223 01:48:33,570 --> 01:48:35,670 Another task that might be possible is prediction, 2224 01:48:35,670 --> 01:48:37,320 which is looking towards the future. 2225 01:48:37,320 --> 01:48:39,690 Given observations about people bringing umbrellas 2226 01:48:39,690 --> 01:48:43,350 from the beginning of when we started counting time until now, 2227 01:48:43,350 --> 01:48:47,710 can I figure out the distribution that tomorrow is it sunny or is it raining? 2228 01:48:47,710 --> 01:48:51,240 And you can also go backwards, as well, by a smoothing where I can say, 2229 01:48:51,240 --> 01:48:54,810 given observations from start until now, calculate the distributions 2230 01:48:54,810 --> 01:48:56,460 for some past state. 2231 01:48:56,460 --> 01:49:00,090 I know that today people brought umbrellas and tomorrow people brought 2232 01:49:00,090 --> 01:49:03,780 umbrellas, and so given two days' worth of data of people bringing umbrellas, 2233 01:49:03,780 --> 01:49:06,713 what's the probability that yesterday it was raining? 2234 01:49:06,713 --> 01:49:08,880 And that I know that people brought umbrellas today, 2235 01:49:08,880 --> 01:49:11,160 that might inform that decision, as well. 2236 01:49:11,160 --> 01:49:13,740 It might influence those probabilities. 2237 01:49:13,740 --> 01:49:17,340 And there's also a most likely explanation task, 2238 01:49:17,340 --> 01:49:19,510 in addition to other tasks that might exist as well, 2239 01:49:19,510 --> 01:49:21,750 which is combining some of these given observations 2240 01:49:21,750 --> 01:49:25,920 from the start up until now, figuring out the most likely sequence of states, 2241 01:49:25,920 --> 01:49:28,528 and this is what we're going to take a look at now, this idea 2242 01:49:28,528 --> 01:49:30,570 that if I have all these observations-- umbrella, 2243 01:49:30,570 --> 01:49:32,790 no umbrella, umbrella, no umbrella-- can I 2244 01:49:32,790 --> 01:49:36,990 calculate the most likely states of sun, rain, sun, rain, and whatnot that 2245 01:49:36,990 --> 01:49:41,610 actually represented the true weather that would produce these observations? 2246 01:49:41,610 --> 01:49:43,590 And this is quite common when you're trying 2247 01:49:43,590 --> 01:49:46,530 to do something like voice recognition, for example, that you have 2248 01:49:46,530 --> 01:49:49,830 these emissions of audio wave forms and you would like to calculate, 2249 01:49:49,830 --> 01:49:52,260 based on all of the observations that you have, 2250 01:49:52,260 --> 01:49:54,750 what is the most likely sequence of actual words 2251 01:49:54,750 --> 01:49:59,100 or syllables or sounds that the user actually made when they were speaking 2252 01:49:59,100 --> 01:50:01,230 to this particular device, or other tasks that 2253 01:50:01,230 --> 01:50:03,740 might come up in that context as well. 2254 01:50:03,740 --> 01:50:07,800 And so we can try this out by going ahead and going into the HMM 2255 01:50:07,800 --> 01:50:11,790 directory, HMM for Hidden Markov Model. 2256 01:50:11,790 --> 01:50:17,350 And here what I've done is I've defined a model where this model first 2257 01:50:17,350 --> 01:50:22,410 defines my possible state, sun and rain, along with their emission 2258 01:50:22,410 --> 01:50:25,690 probabilities, the observation model or the emission model, 2259 01:50:25,690 --> 01:50:30,310 where here, given that I know that it's sunny, the probability that I 2260 01:50:30,310 --> 01:50:32,590 see people bring an umbrella is 0.2. 2261 01:50:32,590 --> 01:50:35,470 The probability of no umbrella is 0.8. 2262 01:50:35,470 --> 01:50:37,288 And likewise, if it's raining, then people 2263 01:50:37,288 --> 01:50:38,830 are more likely to bring an umbrella. 2264 01:50:38,830 --> 01:50:40,630 Umbrella has a probability of 0.9. 2265 01:50:40,630 --> 01:50:42,580 No umbrella has probably of 0.1. 2266 01:50:42,580 --> 01:50:47,350 So the actual underlying hidden states, those states are sun and rain. 2267 01:50:47,350 --> 01:50:50,500 But the things that I observe, the observations that I can see, 2268 01:50:50,500 --> 01:50:56,270 are either umbrella or no umbrella as the things that I observe as a result. 2269 01:50:56,270 --> 01:51:00,730 So this, then, I also need to add to it a transition matrix, same as before, 2270 01:51:00,730 --> 01:51:04,540 saying that if today is sunny, then tomorrow is more likely to be sunny, 2271 01:51:04,540 --> 01:51:07,770 and if today is rainy, then tomorrow is more likely to be raining. 2272 01:51:07,770 --> 01:51:10,130 As with before, I give it some starting probabilities, 2273 01:51:10,130 --> 01:51:14,050 saying, at first, 50/50 chance for whether it's sunny or rainy, 2274 01:51:14,050 --> 01:51:17,570 and then I can create the model based on that information. 2275 01:51:17,570 --> 01:51:19,990 Again, the exact syntax of this is not so important 2276 01:51:19,990 --> 01:51:23,770 so much as it is the data that I am now encoding into a program, such 2277 01:51:23,770 --> 01:51:27,350 that now I can begin to do some inference. 2278 01:51:27,350 --> 01:51:31,270 So I can give my program, for example, a list of observations-- 2279 01:51:31,270 --> 01:51:34,420 umbrella, umbrella, no umbrella, umbrella, umbrella, so on and so forth, 2280 01:51:34,420 --> 01:51:35,478 no umbrella, no umbrella. 2281 01:51:35,478 --> 01:51:37,270 And I would like to calculate, I would like 2282 01:51:37,270 --> 01:51:41,110 to figure out, the most likely explanation for these observations. 2283 01:51:41,110 --> 01:51:42,640 What is likely? 2284 01:51:42,640 --> 01:51:43,660 Was it rain, rain? 2285 01:51:43,660 --> 01:51:46,720 Is this rain or is it more likely that this was actually sunny 2286 01:51:46,720 --> 01:51:48,742 and then it switched back to it being rainy? 2287 01:51:48,742 --> 01:51:50,200 And that's an interesting question. 2288 01:51:50,200 --> 01:51:52,360 We might not be sure because it might just 2289 01:51:52,360 --> 01:51:56,410 be that it just so happened on this rainy day people decided not to bring 2290 01:51:56,410 --> 01:52:00,580 an umbrella or it could be that it switched from rainy to sunny back 2291 01:52:00,580 --> 01:52:04,450 to rainy, which doesn't seem too likely, but it certainly could happen. 2292 01:52:04,450 --> 01:52:07,060 And using the data we give to the Hidden Markov Model, 2293 01:52:07,060 --> 01:52:10,620 our model can begin to predict these answers, can begin to figure it out. 2294 01:52:10,620 --> 01:52:13,750 So we're going to go ahead and just predict these observations. 2295 01:52:13,750 --> 01:52:15,750 And then for each of those predictions, go ahead 2296 01:52:15,750 --> 01:52:17,292 and print out what the prediction is. 2297 01:52:17,292 --> 01:52:19,780 And this library just so happens to have a function called 2298 01:52:19,780 --> 01:52:23,142 predict that does this prediction process for me. 2299 01:52:23,142 --> 01:52:28,270 So I run Python sequence.py, and the result I get is this. 2300 01:52:28,270 --> 01:52:31,450 This is the prediction based on the observations of what 2301 01:52:31,450 --> 01:52:34,750 all of those states are likely to be, and it's likely to be rain, then rain. 2302 01:52:34,750 --> 01:52:36,625 In this case, it thinks that what most likely 2303 01:52:36,625 --> 01:52:39,940 happened is that it was sunny for a day and then went back to being rainy. 2304 01:52:39,940 --> 01:52:42,700 But in different situations, if it was rainy for longer, maybe, 2305 01:52:42,700 --> 01:52:44,750 or if the probabilities were slightly different, 2306 01:52:44,750 --> 01:52:48,190 you might imagine that it's more likely that it was rainy all the way through, 2307 01:52:48,190 --> 01:52:53,250 and it just so happened on one rainy day people decided not to bring umbrellas. 2308 01:52:53,250 --> 01:52:55,750 And so here, too, Python libraries can begin 2309 01:52:55,750 --> 01:52:58,730 to allow for the sort of inference procedure. 2310 01:52:58,730 --> 01:53:02,410 And by taking what we know and by putting it in terms of these tasks 2311 01:53:02,410 --> 01:53:06,310 that already exist, these general tasks that work with Hidden Markov Models, 2312 01:53:06,310 --> 01:53:10,540 then any time we can take an idea and formulate it as a Hidden Markov Model, 2313 01:53:10,540 --> 01:53:12,550 formulate it as something that has hidden 2314 01:53:12,550 --> 01:53:15,700 states and observed emissions that result from the states. 2315 01:53:15,700 --> 01:53:17,830 Then we can take advantage of these algorithms that 2316 01:53:17,830 --> 01:53:21,740 are known to exist for trying to do this sort of inference. 2317 01:53:21,740 --> 01:53:25,720 So now we've seen a couple of ways that AI can begin to deal with uncertainty. 2318 01:53:25,720 --> 01:53:28,840 We've taken a look at probability and how we can use probability 2319 01:53:28,840 --> 01:53:32,200 to describe numerically things that are likely or more likely or less 2320 01:53:32,200 --> 01:53:34,990 likely to happen than other events or other variables. 2321 01:53:34,990 --> 01:53:37,750 And using that information, we can begin to construct 2322 01:53:37,750 --> 01:53:40,810 these standard types of models, things like Bayesian networks 2323 01:53:40,810 --> 01:53:43,180 and Markov chains and Hidden Markov Models, 2324 01:53:43,180 --> 01:53:47,110 that all allow us to be able to describe how particular events relate 2325 01:53:47,110 --> 01:53:49,900 to other events or how the values of particular variables 2326 01:53:49,900 --> 01:53:53,050 relate to other variables, not for certain, but with some sort 2327 01:53:53,050 --> 01:53:54,550 of probability distribution. 2328 01:53:54,550 --> 01:53:57,970 And by formulating things in terms of these models that already exist, 2329 01:53:57,970 --> 01:54:00,160 we can take advantage of Python libraries 2330 01:54:00,160 --> 01:54:02,950 that implement these sort of models already and allow us just 2331 01:54:02,950 --> 01:54:06,880 to be able to use them to produce some sort of resulting effect. 2332 01:54:06,880 --> 01:54:08,890 So all of this then allows our AI to begin 2333 01:54:08,890 --> 01:54:11,290 to deal with these sort of uncertain problems 2334 01:54:11,290 --> 01:54:13,720 so that our AI doesn't need to know things for certain 2335 01:54:13,720 --> 01:54:17,080 but can infer based on information it doesn't know. 2336 01:54:17,080 --> 01:54:19,930 Next time, we'll take a look at additional types of problems 2337 01:54:19,930 --> 01:54:22,870 that we can solve by taking advantage of AI-related algorithms 2338 01:54:22,870 --> 01:54:26,140 even beyond the world of the types of problems we've already explored. 2339 01:54:26,140 --> 01:54:28,230 We'll see you next time. 2340 01:54:28,230 --> 01:54:29,000