1
00:00:00,000 --> 00:00:02,940
[MUSIC PLAYING]

2
00:00:02,940 --> 00:00:17,625


3
00:00:17,625 --> 00:00:20,250
BRIAN YU: All right, welcome
back, everyone, to an Introduction

4
00:00:20,250 --> 00:00:22,410
to Artificial Intelligence with Python.

5
00:00:22,410 --> 00:00:26,100
And last time we took a look at how
it is that AI inside of our computers

6
00:00:26,100 --> 00:00:27,420
can represent knowledge.

7
00:00:27,420 --> 00:00:30,510
We represented that knowledge
in the form of logical sentences

8
00:00:30,510 --> 00:00:32,460
in a variety of different
logical languages,

9
00:00:32,460 --> 00:00:36,000
and the idea was we wanted our AI
to be able to represent knowledge

10
00:00:36,000 --> 00:00:39,420
or information and somehow use
those pieces of information

11
00:00:39,420 --> 00:00:42,570
to be able to derive new pieces
of information via inference,

12
00:00:42,570 --> 00:00:45,060
to be able to take some
information and deduce

13
00:00:45,060 --> 00:00:47,580
some additional conclusions
based on the information

14
00:00:47,580 --> 00:00:49,650
that it already knew for sure.

15
00:00:49,650 --> 00:00:52,680
But in reality, when we think about
computers and we think about AI,

16
00:00:52,680 --> 00:00:56,280
very rarely are our machines going
to be able to know things for sure.

17
00:00:56,280 --> 00:00:58,800
Oftentimes there's going to
be some amount of uncertainty

18
00:00:58,800 --> 00:01:01,140
in the information that
our AIs or our computers

19
00:01:01,140 --> 00:01:04,503
are dealing with where it might believe
something with some probability,

20
00:01:04,503 --> 00:01:07,420
as we'll soon discuss what probability
is all about and what it means,

21
00:01:07,420 --> 00:01:09,210
but not entirely for certain.

22
00:01:09,210 --> 00:01:12,660
And we want to use the information that
it has some knowledge about, even if it

23
00:01:12,660 --> 00:01:15,960
doesn't have perfect knowledge, to
still be able to make inferences, still

24
00:01:15,960 --> 00:01:17,650
be able to draw conclusions.

25
00:01:17,650 --> 00:01:20,820
So you might imagine, for example,
in the context of a robot that

26
00:01:20,820 --> 00:01:23,280
has some sensors and is
exploring some environment,

27
00:01:23,280 --> 00:01:26,580
it might not know exactly where
it is or exactly what's around it,

28
00:01:26,580 --> 00:01:30,150
but it does have access to some data
that can allow it to draw inferences

29
00:01:30,150 --> 00:01:31,200
with some probability.

30
00:01:31,200 --> 00:01:33,747
There's some likelihood that
one thing is true or another,

31
00:01:33,747 --> 00:01:36,330
or you can imagine in context
where there is a little bit more

32
00:01:36,330 --> 00:01:39,390
randomness and uncertainty, something
like predicting the weather, where

33
00:01:39,390 --> 00:01:42,098
you might not be able to know for
sure what tomorrow's weather is

34
00:01:42,098 --> 00:01:44,610
with 100% certainty,
but you can probably

35
00:01:44,610 --> 00:01:47,550
infer with some probability
what tomorrow's weather is

36
00:01:47,550 --> 00:01:50,940
going to be based on maybe today's
webinar and yesterday's weather

37
00:01:50,940 --> 00:01:54,070
and other data that you
might have access to as well.

38
00:01:54,070 --> 00:01:57,270
And so oftentimes we can distill
this in terms of just possible events

39
00:01:57,270 --> 00:02:00,120
that might happen and what the
likelihood of those events are.

40
00:02:00,120 --> 00:02:02,190
This comes a lot in
games, for example, where

41
00:02:02,190 --> 00:02:04,620
there's an element of chance
inside of those games.

42
00:02:04,620 --> 00:02:06,120
So you imagine rolling the dice.

43
00:02:06,120 --> 00:02:08,580
You're not sure exactly what
the die roll is going to be,

44
00:02:08,580 --> 00:02:12,510
but you know it's going to be one of
these possibilities from one to six,

45
00:02:12,510 --> 00:02:13,970
for example.

46
00:02:13,970 --> 00:02:17,050
And so here, now, we introduce
the idea of probability theory.

47
00:02:17,050 --> 00:02:19,050
And what we'll take a
look at today is beginning

48
00:02:19,050 --> 00:02:22,170
by looking at the mathematical
foundations of probability theory,

49
00:02:22,170 --> 00:02:25,740
getting an understanding for some of
the key concepts within probability,

50
00:02:25,740 --> 00:02:29,040
and then diving into how we can
use probability and the ideas

51
00:02:29,040 --> 00:02:32,730
that we look at mathematically to
represent some ideas in terms of models

52
00:02:32,730 --> 00:02:36,300
that we can put into our computers
in order to program an AI that

53
00:02:36,300 --> 00:02:39,600
is able to use information about
probability to draw inferences,

54
00:02:39,600 --> 00:02:42,630
to make some judgments about
the world with some probability

55
00:02:42,630 --> 00:02:45,380
or likelihood of being true.

56
00:02:45,380 --> 00:02:48,270
So probability ultimately
boils down to this idea

57
00:02:48,270 --> 00:02:50,340
that there are possible
worlds that we're here

58
00:02:50,340 --> 00:02:53,250
representing using this
little Greek letter omega,

59
00:02:53,250 --> 00:02:56,760
and the idea of a possible world
is that, when I roll a die,

60
00:02:56,760 --> 00:02:59,380
there are six possible worlds
that could result from it.

61
00:02:59,380 --> 00:03:03,180
I can roll a 1 or 2 or
3 or a 4 or a 5 or a 6,

62
00:03:03,180 --> 00:03:06,840
and each of those or a possible world,
and each of those possible worlds

63
00:03:06,840 --> 00:03:11,760
has some probability of being true, the
probability that I do roll a 1 or a 2

64
00:03:11,760 --> 00:03:13,840
or a 3 or something else.

65
00:03:13,840 --> 00:03:15,690
And we represent that
probability like this,

66
00:03:15,690 --> 00:03:18,870
using the capital letter P
and then, in parentheses, what

67
00:03:18,870 --> 00:03:20,890
it is that we want the probability of.

68
00:03:20,890 --> 00:03:24,600
So this right here would be the
probability of some possible world

69
00:03:24,600 --> 00:03:27,390
as represented by the
little letter omega.

70
00:03:27,390 --> 00:03:30,120
Now, there are a couple of
basic axioms of probability

71
00:03:30,120 --> 00:03:33,360
that become relevant as we consider
how we deal with probability

72
00:03:33,360 --> 00:03:34,590
and how we think about it.

73
00:03:34,590 --> 00:03:37,320
First and foremost,
every probability value

74
00:03:37,320 --> 00:03:40,500
must range between
zero and one inclusive.

75
00:03:40,500 --> 00:03:42,420
So the smallest value
any probability can

76
00:03:42,420 --> 00:03:46,500
have is the number zero, which
is an impossible event, something

77
00:03:46,500 --> 00:03:49,350
like I roll a die and the die is
a seven is the roll that I get.

78
00:03:49,350 --> 00:03:51,570
If the die only has
numbers one through six,

79
00:03:51,570 --> 00:03:54,420
the event that I roll
a seven is impossible,

80
00:03:54,420 --> 00:03:56,610
so it would have probability zero.

81
00:03:56,610 --> 00:03:58,710
And on the other end of
the spectrum, probability

82
00:03:58,710 --> 00:04:01,260
can range all the way up
to the positive number one,

83
00:04:01,260 --> 00:04:04,650
meaning an event is certain to happen,
that I roll a die and the number

84
00:04:04,650 --> 00:04:06,540
is less than 10, for example.

85
00:04:06,540 --> 00:04:09,930
That is an event that is guaranteed
to happen if the only sides on my die

86
00:04:09,930 --> 00:04:12,150
are one through six, for instance.

87
00:04:12,150 --> 00:04:15,600
And then there can range through any
real number in between these two values

88
00:04:15,600 --> 00:04:18,600
where, generally speaking, a
higher value for the probability

89
00:04:18,600 --> 00:04:20,910
means an event is more
likely to take place

90
00:04:20,910 --> 00:04:22,980
and a lower value for
the probability means

91
00:04:22,980 --> 00:04:26,040
the event is less likely to take place.

92
00:04:26,040 --> 00:04:29,280
And the other key rule for probability
looks a little bit like this.

93
00:04:29,280 --> 00:04:32,190
This sigma notation, if
you haven't seen it before,

94
00:04:32,190 --> 00:04:35,100
refers to summation, the idea
that we're going to be adding up

95
00:04:35,100 --> 00:04:36,500
a whole sequence of values.

96
00:04:36,500 --> 00:04:39,000
And this sigma notation's going
to come up a couple of times

97
00:04:39,000 --> 00:04:41,220
today, because as we
deal with probability,

98
00:04:41,220 --> 00:04:43,950
oftentimes we're adding up a
whole bunch of individual values

99
00:04:43,950 --> 00:04:46,660
or individual probabilities
to get some other value.

100
00:04:46,660 --> 00:04:48,570
So we'll see this come
up a couple of times.

101
00:04:48,570 --> 00:04:52,710
But what this notation means is that
if I sum up all of the possible world's

102
00:04:52,710 --> 00:04:57,150
omega that are in big Omega,
which represents the set of all

103
00:04:57,150 --> 00:05:00,780
the possible worlds, meaning
I take for all of the worlds

104
00:05:00,780 --> 00:05:05,250
in the set of possible worlds and add
up all of their probabilities, what

105
00:05:05,250 --> 00:05:07,312
I ultimately get is the number one.

106
00:05:07,312 --> 00:05:10,520
So if I take all the possible worlds,
add up what each of their probabilities

107
00:05:10,520 --> 00:05:12,630
is, I should get the
number one at the end,

108
00:05:12,630 --> 00:05:15,600
meaning all probabilities
just need to sum to one.

109
00:05:15,600 --> 00:05:18,120
So for example, if I
take dice, for example,

110
00:05:18,120 --> 00:05:20,750
if you imagine I have a fair
die with numbers one through six

111
00:05:20,750 --> 00:05:22,820
and I roll the die,
each one of these rolls

112
00:05:22,820 --> 00:05:25,160
has an equal probability
of taking place,

113
00:05:25,160 --> 00:05:28,290
and the probability is
one over six, for example.

114
00:05:28,290 --> 00:05:31,890
So each of these probabilities is
between zero and one, zero meaning

115
00:05:31,890 --> 00:05:33,950
and possible and one
meaning for certain.

116
00:05:33,950 --> 00:05:35,990
And if you add up all
of these probabilities

117
00:05:35,990 --> 00:05:39,230
for all of the possible
worlds, you get the number one.

118
00:05:39,230 --> 00:05:42,560
And we can represent any one of
those probabilities like this.

119
00:05:42,560 --> 00:05:47,750
The probability that we roll the number
two, for example, is just one over six.

120
00:05:47,750 --> 00:05:52,040
Every six times we roll the die, we'd
expect that one time, for instance,

121
00:05:52,040 --> 00:05:53,480
the die might come up as a two.

122
00:05:53,480 --> 00:05:56,870
Its probability is not certain, but
it's a little more than nothing,

123
00:05:56,870 --> 00:05:58,430
for instance.

124
00:05:58,430 --> 00:06:01,173
And so this is all fairly
straightforward for just a single die.

125
00:06:01,173 --> 00:06:03,590
But things get more interesting
as our models of the world

126
00:06:03,590 --> 00:06:05,183
get a little bit more complex.

127
00:06:05,183 --> 00:06:07,850
Let's imagine now that we're not
just dealing with a single die,

128
00:06:07,850 --> 00:06:10,040
but we have two dice, for example.

129
00:06:10,040 --> 00:06:12,230
I have a red die here
and a blue die there,

130
00:06:12,230 --> 00:06:15,230
and I care not just about
what the individual roll is,

131
00:06:15,230 --> 00:06:17,270
but I care about the
sum of the two rolls.

132
00:06:17,270 --> 00:06:20,660
In this case, the sum of the
two rolls is the number three.

133
00:06:20,660 --> 00:06:23,000
How do I begin to now
reason about, what does

134
00:06:23,000 --> 00:06:27,760
the probability look like if, instead
of having one die, I now have two dice?

135
00:06:27,760 --> 00:06:30,260
Well, what we might imagine is
that we could first consider,

136
00:06:30,260 --> 00:06:32,860
what are all of the possible worlds?

137
00:06:32,860 --> 00:06:34,840
And in this case, all
of the possible worlds

138
00:06:34,840 --> 00:06:38,480
are just every combination of the red
and blue die that I could come up with.

139
00:06:38,480 --> 00:06:43,000
For the red die, it could be a 1
or a 2 or a 3 or a 4 or a 5 or a 6,

140
00:06:43,000 --> 00:06:45,260
and for each of those
possibilities, the blue die,

141
00:06:45,260 --> 00:06:50,700
likewise, could also be either
1 or 2 or 3 or 4 or 5 or 6.

142
00:06:50,700 --> 00:06:53,490
And it just so happens that,
in this particular case,

143
00:06:53,490 --> 00:06:56,570
each of these possible
combinations is equally likely.

144
00:06:56,570 --> 00:06:59,715
Equally likely are all of these
various different possible worlds.

145
00:06:59,715 --> 00:07:01,340
That's not always going to be the case.

146
00:07:01,340 --> 00:07:04,160
As you imagine more complex
models that we could try to build

147
00:07:04,160 --> 00:07:06,770
and things that we could try
to represent in the real world,

148
00:07:06,770 --> 00:07:09,950
it's probably not going to be the case
that every single possible world is

149
00:07:09,950 --> 00:07:11,270
always equally likely.

150
00:07:11,270 --> 00:07:14,030
But in the case of fair dice
where, in any given die roll,

151
00:07:14,030 --> 00:07:17,450
any one number has just as good a
chance of coming up as any other number,

152
00:07:17,450 --> 00:07:21,740
we can consider all of these
possible worlds to be equally likely.

153
00:07:21,740 --> 00:07:24,500
But even though all of the
possible worlds are equally likely,

154
00:07:24,500 --> 00:07:27,690
that doesn't necessarily mean that
their sums are equally likely.

155
00:07:27,690 --> 00:07:31,530
So if we consider what the sum is of all
of these two-- so 1 plus 1, that's a 2.

156
00:07:31,530 --> 00:07:32,990
2 plus 1 is a 3--

157
00:07:32,990 --> 00:07:35,790
and consider for each of these
possible pairs of numbers

158
00:07:35,790 --> 00:07:37,970
what their sum ultimately
is, we can notice

159
00:07:37,970 --> 00:07:41,030
that there are some patterns here
where it's not entirely the case

160
00:07:41,030 --> 00:07:43,710
that every number comes
up equally likely.

161
00:07:43,710 --> 00:07:45,800
If you consider seven,
for example, what's

162
00:07:45,800 --> 00:07:49,070
the probability that when I roll
two dice their sum is seven,

163
00:07:49,070 --> 00:07:50,770
there are several ways this can happen.

164
00:07:50,770 --> 00:07:53,450
There are six possible worlds
where the sum is seven.

165
00:07:53,450 --> 00:07:56,270
It could be a one and a six
or a two and a five or a three

166
00:07:56,270 --> 00:07:59,400
and a four, a four and
a three, and so forth.

167
00:07:59,400 --> 00:08:02,030
But if you instead consider,
what's the probability that I

168
00:08:02,030 --> 00:08:06,380
roll two dice and the sum of those two
die rolls is 12, for example, well,

169
00:08:06,380 --> 00:08:09,770
looking at this diagram,
there's only one possible world

170
00:08:09,770 --> 00:08:12,080
in which that can happen,
and that's the possible world

171
00:08:12,080 --> 00:08:16,720
where both the red die and the
blue die both come up at sixes

172
00:08:16,720 --> 00:08:18,870
to give us the sum total of 12.

173
00:08:18,870 --> 00:08:21,000
So based on just taking
a look at this diagram,

174
00:08:21,000 --> 00:08:23,542
we see that some of these
probabilities are likely different.

175
00:08:23,542 --> 00:08:27,530
The probability that the sum is a seven
must be greater than the probability

176
00:08:27,530 --> 00:08:28,732
that the sum is a 12.

177
00:08:28,732 --> 00:08:31,190
And we can represent that even
more formally by saying, OK,

178
00:08:31,190 --> 00:08:35,690
the probability that we
sum to 12 is one out of 36.

179
00:08:35,690 --> 00:08:39,010
Out of the 36 equally
likely possible worlds,

180
00:08:39,010 --> 00:08:42,049
six squared because we have
six options for the red die

181
00:08:42,049 --> 00:08:46,730
and six options for the blue die, out
of those 36 options, only one of them

182
00:08:46,730 --> 00:08:49,970
sums to 12, whereas, on the
other hand, the probability

183
00:08:49,970 --> 00:08:53,580
that if we take two dice rolls and
they sum up to the number seven,

184
00:08:53,580 --> 00:08:55,910
well, out of those 36
possible worlds, there

185
00:08:55,910 --> 00:08:59,900
were six worlds where the sum was
seven, and so we get six over 36,

186
00:08:59,900 --> 00:09:04,070
which we can simplify as a
fraction to just one over six.

187
00:09:04,070 --> 00:09:07,400
So here, now, we're able to represent
these different ideas of probability,

188
00:09:07,400 --> 00:09:09,690
representing some events
that might be more likely

189
00:09:09,690 --> 00:09:12,980
and then other events that
are less likely, as well.

190
00:09:12,980 --> 00:09:15,230
And these sorts of judgments
where we're figuring out,

191
00:09:15,230 --> 00:09:18,410
just in the abstract, what is the
probability that this thing takes

192
00:09:18,410 --> 00:09:22,040
place, are generally known as
unconditional probabilities,

193
00:09:22,040 --> 00:09:25,970
some degree of belief we have in some
proposition, some fact about the world

194
00:09:25,970 --> 00:09:28,760
in the absence of any other
evidence without knowing

195
00:09:28,760 --> 00:09:29,900
any additional information.

196
00:09:29,900 --> 00:09:32,570
If I roll a die, what's the
chance it comes up as a two,

197
00:09:32,570 --> 00:09:35,420
or if I roll two dice, what's the
chance that the sum of those two

198
00:09:35,420 --> 00:09:37,430
die rolls is a seven?

199
00:09:37,430 --> 00:09:40,550
But usually when we're thinking about
probability, especially when we're

200
00:09:40,550 --> 00:09:43,790
thinking about training in AI to
intelligently be able to know something

201
00:09:43,790 --> 00:09:46,970
about the world and make predictions
based on that information,

202
00:09:46,970 --> 00:09:50,610
it's not unconditional probability
that our AI is dealing with,

203
00:09:50,610 --> 00:09:53,060
but, rather, conditional
probability, probability

204
00:09:53,060 --> 00:09:55,730
where rather than having
no original knowledge,

205
00:09:55,730 --> 00:09:59,790
we have some initial knowledge about the
world and how the world actually works.

206
00:09:59,790 --> 00:10:02,420
So conditional probability
is the degree of belief

207
00:10:02,420 --> 00:10:08,210
in a proposition given some evidence
that has already been revealed to us.

208
00:10:08,210 --> 00:10:09,480
So what does this look like?

209
00:10:09,480 --> 00:10:12,080
Well, it looks like this
in terms of notation.

210
00:10:12,080 --> 00:10:16,595
We're going to represent conditional
probability as probability of a

211
00:10:16,595 --> 00:10:19,260
and then this vertical bar and then b.

212
00:10:19,260 --> 00:10:23,090
And the way to read this is the thing on
the left-hand side of the vertical bar

213
00:10:23,090 --> 00:10:25,340
is what we want the probability of.

214
00:10:25,340 --> 00:10:29,690
Here, now, I want the probability that
a is true, that it is the real world,

215
00:10:29,690 --> 00:10:32,460
that it is the event that
actually does take place.

216
00:10:32,460 --> 00:10:34,520
And then on the right
side of the vertical bar

217
00:10:34,520 --> 00:10:36,620
is our evidence, the
information that we already

218
00:10:36,620 --> 00:10:38,780
know for certain about the world--

219
00:10:38,780 --> 00:10:41,570
for example, that b is true.

220
00:10:41,570 --> 00:10:43,430
So the way to read
this entire expression

221
00:10:43,430 --> 00:10:46,820
is, what is the
probability of a given b,

222
00:10:46,820 --> 00:10:51,860
the probability that a is true given
that we already know that b is true?

223
00:10:51,860 --> 00:10:54,500
And this type of judgment,
conditional probability,

224
00:10:54,500 --> 00:10:58,430
the probability of one thing given
some other fact, comes up quite a lot

225
00:10:58,430 --> 00:11:00,590
when we think about the
types of calculations

226
00:11:00,590 --> 00:11:02,630
we might want our AI to be able to do.

227
00:11:02,630 --> 00:11:05,120
For example, we might care
about the probability of rain

228
00:11:05,120 --> 00:11:08,090
today given that we know
that it rained yesterday.

229
00:11:08,090 --> 00:11:11,360
We could think about the probability
of rain today just in the abstract.

230
00:11:11,360 --> 00:11:13,440
What is the chance that today it rains?

231
00:11:13,440 --> 00:11:15,350
But usually we have some
additional evidence.

232
00:11:15,350 --> 00:11:17,900
I know for certain that
it rained yesterday,

233
00:11:17,900 --> 00:11:21,290
and so I would like to calculate
the probability that it rains today

234
00:11:21,290 --> 00:11:23,457
given that I know that
it rained yesterday,

235
00:11:23,457 --> 00:11:25,790
or you might imagine that I
want to know the probability

236
00:11:25,790 --> 00:11:28,550
that my optimal route to
my destination changes

237
00:11:28,550 --> 00:11:30,290
given the current traffic conditions.

238
00:11:30,290 --> 00:11:32,480
So whether or not traffic
conditions change,

239
00:11:32,480 --> 00:11:35,690
that might change the probability
that this route is actually

240
00:11:35,690 --> 00:11:38,510
the optimal route, or you might
imagine in a medical context

241
00:11:38,510 --> 00:11:42,830
I want to know the probability that a
patient has a particular disease given

242
00:11:42,830 --> 00:11:45,950
some results of some tests that
have been performed on that patient,

243
00:11:45,950 --> 00:11:48,770
and I have some evidence,
the results of that test,

244
00:11:48,770 --> 00:11:52,100
and I would like to know the
probability that a patient has

245
00:11:52,100 --> 00:11:53,520
a particular disease.

246
00:11:53,520 --> 00:11:55,228
So this notion of
conditional probability

247
00:11:55,228 --> 00:11:57,353
comes up everywhere as we
begin to think about what

248
00:11:57,353 --> 00:11:59,660
we would like to reason about,
but being able to reason

249
00:11:59,660 --> 00:12:03,225
a little more intelligently by
taking into account evidence

250
00:12:03,225 --> 00:12:04,100
that we already have.

251
00:12:04,100 --> 00:12:06,650
We're more able to get an
accurate result for what

252
00:12:06,650 --> 00:12:08,720
is the likelihood that
someone has this disease

253
00:12:08,720 --> 00:12:11,330
if we know this evidence,
the results of the test,

254
00:12:11,330 --> 00:12:13,250
as opposed to if we
were just calculating

255
00:12:13,250 --> 00:12:16,910
the unconditional probability of saying,
what is the probability they have

256
00:12:16,910 --> 00:12:21,290
the disease without any evidence
to try and back up our result one

257
00:12:21,290 --> 00:12:23,790
way or the other?

258
00:12:23,790 --> 00:12:26,652
So now that we've got this idea of
what conditional probability is,

259
00:12:26,652 --> 00:12:28,610
the next question we have
to ask is, all right,

260
00:12:28,610 --> 00:12:30,690
how do we calculate
conditional probability?

261
00:12:30,690 --> 00:12:34,297
How do we figure out, mathematically,
if I have an expression like this,

262
00:12:34,297 --> 00:12:35,630
how do I get a number from that?

263
00:12:35,630 --> 00:12:38,070
What does conditional
probability actually mean?

264
00:12:38,070 --> 00:12:39,950
Well, the formula for
conditional probability

265
00:12:39,950 --> 00:12:41,540
looks a little something like this--

266
00:12:41,540 --> 00:12:45,020
the probability of a
given b, the probability

267
00:12:45,020 --> 00:12:47,960
that a is true given that
we know that b is true,

268
00:12:47,960 --> 00:12:50,510
is equal to this
fraction-- the probability

269
00:12:50,510 --> 00:12:56,070
that a and b are true divided by
just the probability that b is true.

270
00:12:56,070 --> 00:12:58,250
And the way to intuitively
try to think about this

271
00:12:58,250 --> 00:13:01,610
is that if I want to know the
probability that a is true given that b

272
00:13:01,610 --> 00:13:05,120
is true, well, I want to consider
all the ways they could both be

273
00:13:05,120 --> 00:13:08,330
true out of the only
worlds that I care about

274
00:13:08,330 --> 00:13:10,430
are the worlds where b is already true.

275
00:13:10,430 --> 00:13:13,250
I can sort of ignore all
the cases where b isn't true

276
00:13:13,250 --> 00:13:15,980
because those aren't relevant
to my ultimate computation.

277
00:13:15,980 --> 00:13:20,232
They're not relevant to what it is
that I want to get information about.

278
00:13:20,232 --> 00:13:21,690
So let's take a look at an example.

279
00:13:21,690 --> 00:13:24,530
Let's go back to that example
of rolling two dice and the idea

280
00:13:24,530 --> 00:13:27,260
that those two dice might
sum up to the number 12.

281
00:13:27,260 --> 00:13:30,020
We discussed earlier that
the unconditional probability

282
00:13:30,020 --> 00:13:33,500
that if I roll two dice and
they sum to 12 is one out of 36,

283
00:13:33,500 --> 00:13:36,620
because out of the 36 possible
worlds that I might care about,

284
00:13:36,620 --> 00:13:39,650
in only one of them is the
sum of those two dice 12.

285
00:13:39,650 --> 00:13:43,330
It's only when red is
six and blue is also six.

286
00:13:43,330 --> 00:13:45,770
But let's say now that I have
some additional information.

287
00:13:45,770 --> 00:13:47,930
I now want to know,
what is the probability

288
00:13:47,930 --> 00:13:54,080
that the two dice sum to 12 given that
I know that the red die was a six?

289
00:13:54,080 --> 00:13:55,700
So I already have some evidence.

290
00:13:55,700 --> 00:13:57,320
I already know the red die is a six.

291
00:13:57,320 --> 00:13:58,737
I don't know what the blue die is.

292
00:13:58,737 --> 00:14:01,482
That information isn't given
to me in this expression.

293
00:14:01,482 --> 00:14:03,440
But given the fact that
I know that the red die

294
00:14:03,440 --> 00:14:07,525
rolled a six, what is the
probability that we sum to 12?

295
00:14:07,525 --> 00:14:10,400
And so we can begin to do the math
using that expression from before.

296
00:14:10,400 --> 00:14:12,800
Here, again, are all
of the possibilities,

297
00:14:12,800 --> 00:14:16,160
all of the possible combinations
of red die being one through six

298
00:14:16,160 --> 00:14:18,857
and blue die being one through six.

299
00:14:18,857 --> 00:14:20,690
And I might consider,
first, all right, what

300
00:14:20,690 --> 00:14:24,290
is the probability of my
evidence, my b variable where

301
00:14:24,290 --> 00:14:27,770
I want to know what is the
probability that the red die is a six?

302
00:14:27,770 --> 00:14:31,580
Well, the probability that the red
die is a six is just one out of six.

303
00:14:31,580 --> 00:14:35,180
So these one out of six options
are really the only worlds

304
00:14:35,180 --> 00:14:36,560
that I care about here now.

305
00:14:36,560 --> 00:14:39,680
All the rest of them are
irrelevant to my calculation

306
00:14:39,680 --> 00:14:42,560
because I already have this
evidence that the red die was a six,

307
00:14:42,560 --> 00:14:46,770
so I don't need to care about all of the
other possibilities that could result.

308
00:14:46,770 --> 00:14:50,120
So now, in addition to the fact
that the red die rolled as a six

309
00:14:50,120 --> 00:14:52,040
and the probability of
that, the other piece

310
00:14:52,040 --> 00:14:54,320
of information I need to
know in order to calculate

311
00:14:54,320 --> 00:14:58,940
this conditional probability is the
probability that both of my variables,

312
00:14:58,940 --> 00:15:02,720
a and b, are true, the probability
that both the red die is a six

313
00:15:02,720 --> 00:15:04,860
and they all sum to 12.

314
00:15:04,860 --> 00:15:07,400
So what is the probability that
both of these things happen?

315
00:15:07,400 --> 00:15:11,990
Well, it only happens in one possible
case, in one out of these 36 cases,

316
00:15:11,990 --> 00:15:15,910
and it's the case where both the red
and the blue die are equal to six.

317
00:15:15,910 --> 00:15:18,160
This is a piece of information
that we already knew.

318
00:15:18,160 --> 00:15:22,240
And so this probability
is equal to one over 36.

319
00:15:22,240 --> 00:15:24,580
And so to get the
conditional probability

320
00:15:24,580 --> 00:15:28,660
that the sum is 12 given that I know
that the red dice is equal to six,

321
00:15:28,660 --> 00:15:33,700
well, I just divide these two values
together, and 1/36 divided by 1/6

322
00:15:33,700 --> 00:15:36,940
gives us this probability of 1/6.

323
00:15:36,940 --> 00:15:40,300
Given that I know that the
red die rolled a value of six,

324
00:15:40,300 --> 00:15:45,350
the probability that the sum of the
two dice is 12 is also one over six.

325
00:15:45,350 --> 00:15:47,350
And that probably makes
intuitive sense for you,

326
00:15:47,350 --> 00:15:51,250
too, because if the red die is a six,
the only way for me to get to a 12

327
00:15:51,250 --> 00:15:53,350
is if the blue die also rolls a six.

328
00:15:53,350 --> 00:15:57,430
And we know that the probability of the
blue die rolling a six is one over six.

329
00:15:57,430 --> 00:15:59,380
So in this case, the
conditional probability

330
00:15:59,380 --> 00:16:00,940
seems fairly straightforward.

331
00:16:00,940 --> 00:16:04,390
But this idea of calculating
a conditional probability

332
00:16:04,390 --> 00:16:08,175
by looking at the probability that
both of these events take place

333
00:16:08,175 --> 00:16:10,300
is an idea that's going to
come up again and again.

334
00:16:10,300 --> 00:16:13,270
This is the definition, now,
of conditional probability,

335
00:16:13,270 --> 00:16:15,190
and we're going to use
that definition as we

336
00:16:15,190 --> 00:16:18,760
think about probability more generally
to be able to draw conclusions

337
00:16:18,760 --> 00:16:19,480
about the world.

338
00:16:19,480 --> 00:16:21,130
This, again, is that formula.

339
00:16:21,130 --> 00:16:24,790
The probability of a given b
is equal to the probability

340
00:16:24,790 --> 00:16:28,973
that a and b take place divided
by the probability of b.

341
00:16:28,973 --> 00:16:32,140
And you'll see this formula sometimes
written in a couple of different ways.

342
00:16:32,140 --> 00:16:35,890
You could imagine, algebraically,
multiplying both sides of this equation

343
00:16:35,890 --> 00:16:39,065
by probability of b to
get rid of the fraction,

344
00:16:39,065 --> 00:16:40,690
and you'll get an expression like this.

345
00:16:40,690 --> 00:16:44,890
The probability of a and b, which
is this expression over here,

346
00:16:44,890 --> 00:16:48,910
is just the probability of b times
the probability of a given b,

347
00:16:48,910 --> 00:16:52,210
or you could represent this equivalently
since a and b, in this expression,

348
00:16:52,210 --> 00:16:55,870
are interchangeable. a and b
is the same thing as b and a.

349
00:16:55,870 --> 00:16:59,740
You could imagine also representing
the probability of a and b

350
00:16:59,740 --> 00:17:03,430
as the probability of a times the
probability of b given a, just

351
00:17:03,430 --> 00:17:05,319
switching all of the a's and b's.

352
00:17:05,319 --> 00:17:08,589
These three are all equivalent
ways of trying to represent

353
00:17:08,589 --> 00:17:10,150
what joint probability means.

354
00:17:10,150 --> 00:17:12,480
And so you'll sometimes
see all of these equations,

355
00:17:12,480 --> 00:17:16,030
and they might be useful to you as
you begin to reason about probability

356
00:17:16,030 --> 00:17:20,540
and to think about what values might
be taking place in the real world.

357
00:17:20,540 --> 00:17:22,510
Now, sometimes when we
deal with probability,

358
00:17:22,510 --> 00:17:24,520
we don't just care
about a Boolean event.

359
00:17:24,520 --> 00:17:27,099
Like, did this happen
or did this not happen?

360
00:17:27,099 --> 00:17:30,550
Sometimes we might want the ability
to represent variable values

361
00:17:30,550 --> 00:17:33,760
in a probability space where
some variable might take

362
00:17:33,760 --> 00:17:36,430
on multiple different possible values.

363
00:17:36,430 --> 00:17:39,820
And in probability, we call a
variable in probability theory

364
00:17:39,820 --> 00:17:41,380
a random variable.

365
00:17:41,380 --> 00:17:45,790
A random variable in probability is
just some variable in probability theory

366
00:17:45,790 --> 00:17:49,150
that has some domain of
values that it can take on.

367
00:17:49,150 --> 00:17:50,290
So what do I mean by this?

368
00:17:50,290 --> 00:17:52,720
Well, what I mean is I might
have a random variable that

369
00:17:52,720 --> 00:17:56,470
is just called Roll, for example,
that has six possible values.

370
00:17:56,470 --> 00:17:59,120
Roll is my variable,
and the possible values,

371
00:17:59,120 --> 00:18:03,520
the domain of values that it can
take on, are 1, 2, 3, 4, 5, and 6.

372
00:18:03,520 --> 00:18:05,845
And I might like to know
the probability of each.

373
00:18:05,845 --> 00:18:07,720
In this case, they happen
to all be the same.

374
00:18:07,720 --> 00:18:10,728
But in other random variables,
that might not be the case.

375
00:18:10,728 --> 00:18:12,520
For example, I might
have a random variable

376
00:18:12,520 --> 00:18:14,560
to represent the weather,
for example, where

377
00:18:14,560 --> 00:18:18,290
the domain of values it could take
on are things like sun or cloudy

378
00:18:18,290 --> 00:18:21,070
or rainy or windy or
snowy, and each of those

379
00:18:21,070 --> 00:18:23,650
might have a different probability,
and I care about knowing,

380
00:18:23,650 --> 00:18:26,290
what is the probability
that the weather equals sun

381
00:18:26,290 --> 00:18:28,725
or that the weather equals
clouds, for instance,

382
00:18:28,725 --> 00:18:33,100
and I might like to do some mathematical
calculations based on that information.

383
00:18:33,100 --> 00:18:35,622
Other random variables might
be something like traffic.

384
00:18:35,622 --> 00:18:38,080
What are the odds that there
is no traffic or light traffic

385
00:18:38,080 --> 00:18:39,190
or heavy traffic?

386
00:18:39,190 --> 00:18:41,530
Traffic, in this case,
is my random variable,

387
00:18:41,530 --> 00:18:44,920
and the values that that random
variable can take on are here.

388
00:18:44,920 --> 00:18:47,110
It's either none or light or heavy.

389
00:18:47,110 --> 00:18:50,200
And I, the person doing these
calculations, I, the person encoding

390
00:18:50,200 --> 00:18:52,810
these random variables
into my computer, need

391
00:18:52,810 --> 00:18:56,950
to make the decision as to what
these possible values actually are.

392
00:18:56,950 --> 00:18:59,118
You might imagine, for
example, for a flight,

393
00:18:59,118 --> 00:19:01,660
if I care about whether or not
I make it to a flight on time,

394
00:19:01,660 --> 00:19:04,327
my flight has a couple of possible
values that it could take on.

395
00:19:04,327 --> 00:19:05,620
My flight could be on time.

396
00:19:05,620 --> 00:19:06,860
My flight could be delayed.

397
00:19:06,860 --> 00:19:08,170
My flight could be canceled.

398
00:19:08,170 --> 00:19:11,830
So flight, in this case,
is my random variable,

399
00:19:11,830 --> 00:19:14,500
and these are the values
that it can take on.

400
00:19:14,500 --> 00:19:17,710
And often I'll want to know
something about the probability

401
00:19:17,710 --> 00:19:21,380
that my random variable takes on
each of those possible values.

402
00:19:21,380 --> 00:19:24,660
And this is what we then call
a probability distribution.

403
00:19:24,660 --> 00:19:27,700
A probability distribution
takes a random variable

404
00:19:27,700 --> 00:19:32,420
and gives me the probability for each
of the possible values in its domain.

405
00:19:32,420 --> 00:19:35,950
So in the case of this flight, for
example, my probability distribution

406
00:19:35,950 --> 00:19:37,300
might look something like this.

407
00:19:37,300 --> 00:19:40,240
My probability distribution
says, the probability

408
00:19:40,240 --> 00:19:46,210
that the random variable Flight is
equal to the value on time is 0.6,

409
00:19:46,210 --> 00:19:49,390
or, otherwise, put into more English,
human-friendly terms, the likelihood

410
00:19:49,390 --> 00:19:52,510
that my flight is on
time is 60%, for example.

411
00:19:52,510 --> 00:19:56,180
And in this case, the probability
that my flight is delayed is 30%.

412
00:19:56,180 --> 00:20:00,170
The probability that my flight
is canceled is 10%, or 0.1.

413
00:20:00,170 --> 00:20:04,180
And if you sum up all of these possible
values, the sum is going to be 1.

414
00:20:04,180 --> 00:20:06,640
If you take all of the
possible worlds, here

415
00:20:06,640 --> 00:20:10,120
are my three possible worlds for the
value of the random variable Flight.

416
00:20:10,120 --> 00:20:11,470
Add them all up together.

417
00:20:11,470 --> 00:20:15,610
The result needs to be the number one
per that axiom of probability theory

418
00:20:15,610 --> 00:20:17,500
that we've discussed before.

419
00:20:17,500 --> 00:20:21,810
So this now is one way of representing
this probability distribution

420
00:20:21,810 --> 00:20:23,622
for the random variable Flight.

421
00:20:23,622 --> 00:20:25,830
Sometimes you'll see it
represented a little bit more

422
00:20:25,830 --> 00:20:28,770
concisely, that this is pretty
verbose for really just trying

423
00:20:28,770 --> 00:20:31,080
to express three possible values.

424
00:20:31,080 --> 00:20:33,630
And so often you'll instead
see this same notation

425
00:20:33,630 --> 00:20:35,460
representing using a vector.

426
00:20:35,460 --> 00:20:38,250
And all a vector is is
a sequence of values.

427
00:20:38,250 --> 00:20:41,520
As opposed to just a single value,
I might have multiple values.

428
00:20:41,520 --> 00:20:45,570
And so I could extend, instead,
represent this idea this way--

429
00:20:45,570 --> 00:20:47,610
bold P-- so a larger P--

430
00:20:47,610 --> 00:20:52,110
generally meaning the probability
distribution of this variable flight

431
00:20:52,110 --> 00:20:55,890
is equal to this vector
represented in angle brackets.

432
00:20:55,890 --> 00:21:00,240
The probability distribution
is 0.6, 0.3, and 0.1,

433
00:21:00,240 --> 00:21:03,180
and I would just have to know that
this probability distribution is

434
00:21:03,180 --> 00:21:06,930
an order of on time or
delayed and canceled

435
00:21:06,930 --> 00:21:10,470
to know how to interpret this vector
to mean the first value in the vector

436
00:21:10,470 --> 00:21:14,430
is the probability that my flight is
on time, the second value in the vector

437
00:21:14,430 --> 00:21:16,380
is the probability that
my flight is delayed,

438
00:21:16,380 --> 00:21:20,910
and the third value in the vector is the
probability that my flight is canceled.

439
00:21:20,910 --> 00:21:23,430
And so this is just an
alternate way of representing

440
00:21:23,430 --> 00:21:25,380
this idea a little more verbosely.

441
00:21:25,380 --> 00:21:28,230
But oftentimes you'll see us
just talk about a probability

442
00:21:28,230 --> 00:21:30,637
distribution over a random variable.

443
00:21:30,637 --> 00:21:32,970
And whenever we talk about
that, what we're really doing

444
00:21:32,970 --> 00:21:35,012
is trying to figure out
the probabilities of each

445
00:21:35,012 --> 00:21:38,190
of the possible values that that
random variable can take on,

446
00:21:38,190 --> 00:21:40,970
but this notation is just
a little bit more succinct,

447
00:21:40,970 --> 00:21:43,470
even though it can sometimes
be a little confusing depending

448
00:21:43,470 --> 00:21:44,928
on the context in which you see it.

449
00:21:44,928 --> 00:21:48,060
So we'll start to look at examples
where we use this sort of notation

450
00:21:48,060 --> 00:21:53,850
to describe probability and to
describe events that might take place.

451
00:21:53,850 --> 00:21:55,890
A couple of other important
ideas to know with

452
00:21:55,890 --> 00:21:57,450
regards to probability theory--

453
00:21:57,450 --> 00:22:01,770
one is this idea of independence,
and independence refers to the idea

454
00:22:01,770 --> 00:22:04,620
that the knowledge of one
event doesn't influence

455
00:22:04,620 --> 00:22:06,850
the probability of another event.

456
00:22:06,850 --> 00:22:08,910
So for example, in the
context of my two dice

457
00:22:08,910 --> 00:22:11,910
rolls where I had the red die
and the blue die, the probability

458
00:22:11,910 --> 00:22:14,400
that I roll the red
die and the blue die,

459
00:22:14,400 --> 00:22:17,490
those two events, red die and
blue die, are independent.

460
00:22:17,490 --> 00:22:21,162
Knowing the result of the red die
doesn't change the probabilities

461
00:22:21,162 --> 00:22:21,870
for the blue die.

462
00:22:21,870 --> 00:22:24,330
It doesn't give me any
additional information

463
00:22:24,330 --> 00:22:27,408
about what the value of the blue
die is ultimately going to be.

464
00:22:27,408 --> 00:22:29,200
But that's not always
going to be the case.

465
00:22:29,200 --> 00:22:32,670
You might imagine that in the case
of weather, something like clouds

466
00:22:32,670 --> 00:22:37,170
and rain, those are probably not
independent, that if it is cloudy,

467
00:22:37,170 --> 00:22:40,620
that might increase the probability that
later in the day it's going to rain.

468
00:22:40,620 --> 00:22:45,030
So some information informs some other
event or some other random variable.

469
00:22:45,030 --> 00:22:49,540
So independence refers to the idea that
one event doesn't influence the other.

470
00:22:49,540 --> 00:22:54,600
And if they're not independent, then
there might be some relationship.

471
00:22:54,600 --> 00:22:57,880
So mathematically, formally, what
does independence actually mean?

472
00:22:57,880 --> 00:23:02,550
Well, recall this formula from before,
that the probability of a and b

473
00:23:02,550 --> 00:23:06,390
is the probability of a times
the probability of b given a.

474
00:23:06,390 --> 00:23:08,490
And the more intuitive
way to think about this

475
00:23:08,490 --> 00:23:12,030
is that to know how likely
it is that a and b happen,

476
00:23:12,030 --> 00:23:14,850
well, let's first figure out
the likelihood that a happens,

477
00:23:14,850 --> 00:23:17,163
and then given that we
know that a happens,

478
00:23:17,163 --> 00:23:19,080
let's figure out the
likelihood that b happens

479
00:23:19,080 --> 00:23:22,000
and multiply those two things together.

480
00:23:22,000 --> 00:23:27,750
But if a and b were independent, meaning
knowing a doesn't change anything

481
00:23:27,750 --> 00:23:30,000
about the likelihood
that b is true, well,

482
00:23:30,000 --> 00:23:35,040
then the probability of b given a,
meaning the probability that b is true

483
00:23:35,040 --> 00:23:37,812
given that I know a is true,
well, that I know a is true

484
00:23:37,812 --> 00:23:40,770
shouldn't really make a difference
if these two things are independent,

485
00:23:40,770 --> 00:23:43,230
that a shouldn't influence b at all.

486
00:23:43,230 --> 00:23:48,120
So the probability of b given a is
really just the probability of b,

487
00:23:48,120 --> 00:23:51,150
if it is true that a
and b are independent.

488
00:23:51,150 --> 00:23:54,480
And so this right here is one
example of a definition for what

489
00:23:54,480 --> 00:23:56,850
it means for a and b to be independent.

490
00:23:56,850 --> 00:24:01,050
The probability of a and b is
just the probability of a times

491
00:24:01,050 --> 00:24:02,490
the probability of b.

492
00:24:02,490 --> 00:24:06,300
Any time you find two events a and
b where this relationship holds,

493
00:24:06,300 --> 00:24:10,000
then you can say that a
and b are independent.

494
00:24:10,000 --> 00:24:13,980
So an example of that might be the dice
that we were taking a look at before.

495
00:24:13,980 --> 00:24:18,690
Here, if I wanted the probability of
red being a six and blue being a six,

496
00:24:18,690 --> 00:24:22,050
well, that's just the probability
that red is a six multiplied

497
00:24:22,050 --> 00:24:24,090
by the probability that blue is a six.

498
00:24:24,090 --> 00:24:26,240
Both equal to one over 36.

499
00:24:26,240 --> 00:24:30,740
So I can say that these
two events are independent.

500
00:24:30,740 --> 00:24:34,123
What wouldn't be independent, for
example, would be an example--

501
00:24:34,123 --> 00:24:36,040
so this, for example,
has a probability of one

502
00:24:36,040 --> 00:24:37,980
over 36, as we talked about before.

503
00:24:37,980 --> 00:24:40,950
But what wouldn't be independent
would be a case like this--

504
00:24:40,950 --> 00:24:46,740
the probability that the red die rolls
a six and the red die rolls a four.

505
00:24:46,740 --> 00:24:49,868
If you just naively took, OK,
red die six, red die four,

506
00:24:49,868 --> 00:24:51,660
well, if I'm only
rolling the die once, you

507
00:24:51,660 --> 00:24:54,510
might imagine the naive approach
is to say, well, each of these

508
00:24:54,510 --> 00:24:56,260
has a probability of one over six.

509
00:24:56,260 --> 00:24:59,657
So multiply them together, and
the probability is one over 36.

510
00:24:59,657 --> 00:25:01,990
But, of course, if you're
only rolling the red die once,

511
00:25:01,990 --> 00:25:05,730
there's no way you could get two
different values for the red die.

512
00:25:05,730 --> 00:25:08,370
It couldn't both be a six and a four.

513
00:25:08,370 --> 00:25:10,560
So the probability should be zero.

514
00:25:10,560 --> 00:25:14,610
But if you were to multiply probability
of red six times probability

515
00:25:14,610 --> 00:25:17,690
of red four, well, that
would equal one over 36.

516
00:25:17,690 --> 00:25:19,440
But, of course, that's
not true because we

517
00:25:19,440 --> 00:25:23,460
know that there is no way, probability
zero, that when we roll the red die

518
00:25:23,460 --> 00:25:28,590
once we get both a six and a four
because only one of those possibilities

519
00:25:28,590 --> 00:25:31,120
can actually be the result.

520
00:25:31,120 --> 00:25:35,190
And so we can say that the event
that red roll is six and the event

521
00:25:35,190 --> 00:25:38,800
that red roll is four, those
two events are not independent.

522
00:25:38,800 --> 00:25:43,560
If I know that the red roll is a six, I
know that the red roll cannot possibly

523
00:25:43,560 --> 00:25:44,310
be a four.

524
00:25:44,310 --> 00:25:46,280
So these things are not independent.

525
00:25:46,280 --> 00:25:48,630
And instead, if I wanted to
calculate the probability,

526
00:25:48,630 --> 00:25:51,870
I would need to use this
conditional probability,

527
00:25:51,870 --> 00:25:56,530
as is the regular definition of the
probability of two events taking place.

528
00:25:56,530 --> 00:25:59,280
And the probability of this, now,
well, the probability of the red

529
00:25:59,280 --> 00:26:01,710
roll being a six, that's one of six.

530
00:26:01,710 --> 00:26:06,330
But what's the probability that the roll
is a four given that the roll is a six?

531
00:26:06,330 --> 00:26:09,900
Well, this is just zero, because
there's no way for the red roll

532
00:26:09,900 --> 00:26:13,920
to be a four given that we already
know the red roll is a six.

533
00:26:13,920 --> 00:26:16,410
And so the value, if we do
all that multiplication,

534
00:26:16,410 --> 00:26:19,680
is we get the number zero.

535
00:26:19,680 --> 00:26:21,477
So this idea of
conditional probability is

536
00:26:21,477 --> 00:26:23,310
going to come up again
and again, especially

537
00:26:23,310 --> 00:26:26,850
as we begin to reason about multiple
different random variables that

538
00:26:26,850 --> 00:26:29,130
might be interacting with
each other in some way.

539
00:26:29,130 --> 00:26:32,580
And this gets us to one of the most
important rules in probability theory,

540
00:26:32,580 --> 00:26:34,767
which is known as Bayes' rule.

541
00:26:34,767 --> 00:26:37,350
And it turns out that just using
the information we've already

542
00:26:37,350 --> 00:26:40,900
learned about probability and just
applying a little bit of algebra,

543
00:26:40,900 --> 00:26:43,860
we can actually derive
Bayes' rule for ourselves.

544
00:26:43,860 --> 00:26:46,530
But it's a very important rule
when it comes to inference

545
00:26:46,530 --> 00:26:49,020
and thinking about probability
in the context of what

546
00:26:49,020 --> 00:26:52,110
it is that a computer can do, or
what a mathematician could do,

547
00:26:52,110 --> 00:26:55,390
by having access to
information about probability.

548
00:26:55,390 --> 00:26:57,300
So let's go back to
these equations to be

549
00:26:57,300 --> 00:26:59,860
able to derive Bayes' rule ourselves.

550
00:26:59,860 --> 00:27:04,140
We know the probability of a and b,
the likelihood that a and b take place,

551
00:27:04,140 --> 00:27:07,890
is the likelihood of b and
then the likelihood of a given

552
00:27:07,890 --> 00:27:10,050
that we know that b is already true.

553
00:27:10,050 --> 00:27:13,170
And likewise, the probability
of a given a and b

554
00:27:13,170 --> 00:27:16,920
is the probability of a times
the probability of b given

555
00:27:16,920 --> 00:27:18,630
that we know that a is already true.

556
00:27:18,630 --> 00:27:20,640
This is sort of a symmetric
relationship where

557
00:27:20,640 --> 00:27:24,340
it doesn't matter the order of a and
b and b and a mean the same thing.

558
00:27:24,340 --> 00:27:27,870
And so in these equations,
we can just swap out a and b

559
00:27:27,870 --> 00:27:30,160
to be able to represent
the exact same idea.

560
00:27:30,160 --> 00:27:32,650
So we know that these two
equations are already true.

561
00:27:32,650 --> 00:27:33,910
We've seen that already.

562
00:27:33,910 --> 00:27:37,380
And now let's just do a little bit of
algebraic manipulation of this stuff.

563
00:27:37,380 --> 00:27:40,200
Both of these expressions
on the right-hand side

564
00:27:40,200 --> 00:27:43,380
are equal to the probability of a and b.

565
00:27:43,380 --> 00:27:46,950
So what I can do is take these two
expressions on the right-hand side

566
00:27:46,950 --> 00:27:49,140
and just set them equal to each other.

567
00:27:49,140 --> 00:27:52,860
If they're both equal to
the probability of a and b,

568
00:27:52,860 --> 00:27:55,090
then they both must be
equal to each other.

569
00:27:55,090 --> 00:27:57,750
So probability of a
times probability of b

570
00:27:57,750 --> 00:28:04,740
given a is equal to the probability of
b times the probability of a given b.

571
00:28:04,740 --> 00:28:07,790
And now all we're going to do
is do a little bit of division.

572
00:28:07,790 --> 00:28:13,830
I'm going to divide both sides by P of
a, and now I get what is Bayes' rule.

573
00:28:13,830 --> 00:28:19,100
The probability of b given a is
equal to the probability of b

574
00:28:19,100 --> 00:28:23,338
times the probability of a given
b divided by the probability of a.

575
00:28:23,338 --> 00:28:25,380
And sometimes in Bayes'
rule you'll see the order

576
00:28:25,380 --> 00:28:26,713
of these two arguments switched.

577
00:28:26,713 --> 00:28:30,920
So instead of b times a given
b, it'll be a given b times b.

578
00:28:30,920 --> 00:28:33,420
That ultimately doesn't matter
because in multiplication you

579
00:28:33,420 --> 00:28:35,970
can switch the order of the
two things you're multiplying

580
00:28:35,970 --> 00:28:37,510
and it doesn't change the result.

581
00:28:37,510 --> 00:28:41,520
But this here right now is the most
common formulation of Bayes' rule.

582
00:28:41,520 --> 00:28:46,620
The probability of b given a is
equal to the probability of a given

583
00:28:46,620 --> 00:28:51,300
b times the probability of b
divided by the probability of a.

584
00:28:51,300 --> 00:28:54,030
And this rule, it turns
out, is really important

585
00:28:54,030 --> 00:28:56,670
when it comes to trying to
infer things about the world

586
00:28:56,670 --> 00:29:00,200
because it means you can express
one conditional probability,

587
00:29:00,200 --> 00:29:04,410
the conditional probability
of b given a, using knowledge

588
00:29:04,410 --> 00:29:08,370
about the probability of a
given b, using the reverse

589
00:29:08,370 --> 00:29:10,068
of that conditional probability.

590
00:29:10,068 --> 00:29:12,360
So let's first do a little
bit of an example with this,

591
00:29:12,360 --> 00:29:14,820
just to see how we might use
it, and then explore what

592
00:29:14,820 --> 00:29:17,200
this means a little bit more generally.

593
00:29:17,200 --> 00:29:20,320
So we're going to construct a situation
where I have some information.

594
00:29:20,320 --> 00:29:22,260
There are two events that I care about--

595
00:29:22,260 --> 00:29:25,650
the idea that it's cloudy
in the morning and the idea

596
00:29:25,650 --> 00:29:28,120
that it is rainy in the afternoon.

597
00:29:28,120 --> 00:29:30,000
Those are two different
possible events that

598
00:29:30,000 --> 00:29:34,080
could take place-- cloudy in the
morning, or the AM, rainy in the PM.

599
00:29:34,080 --> 00:29:37,800
And what I care about is, given
clouds in the morning, what

600
00:29:37,800 --> 00:29:41,110
is the probability of rain in the
afternoon, a reasonable question

601
00:29:41,110 --> 00:29:41,610
I might ask.

602
00:29:41,610 --> 00:29:44,250
In the morning, I look
outside, or an AI's camera

603
00:29:44,250 --> 00:29:47,782
looks outside, and sees that
there are clouds in the morning,

604
00:29:47,782 --> 00:29:49,740
and we want to conclude,
we want to figure out,

605
00:29:49,740 --> 00:29:54,430
what is the probability that in the
afternoon there is going to be rain?

606
00:29:54,430 --> 00:29:56,470
Of course, in the abstract,
we don't have access

607
00:29:56,470 --> 00:29:58,990
to this kind of information,
but we can use data

608
00:29:58,990 --> 00:30:00,830
to begin to try and figure this out.

609
00:30:00,830 --> 00:30:05,080
So let's imagine, now, that I have
access to some pieces of information.

610
00:30:05,080 --> 00:30:08,860
I have access to the idea
that 80% of rainy afternoons

611
00:30:08,860 --> 00:30:10,705
start out with a cloudy morning.

612
00:30:10,705 --> 00:30:13,330
And you might imagine that I
could have gathered this data just

613
00:30:13,330 --> 00:30:15,122
by looking at data over
a sequence of time,

614
00:30:15,122 --> 00:30:18,850
that I know that 80% of the time
when it's raining in the afternoon it

615
00:30:18,850 --> 00:30:21,780
was cloudy that morning.

616
00:30:21,780 --> 00:30:25,170
I also know that 40% of
days have cloudy mornings,

617
00:30:25,170 --> 00:30:29,010
and I also know that 10% of
days have rainy afternoons.

618
00:30:29,010 --> 00:30:31,170
And now, using this
information, I would like

619
00:30:31,170 --> 00:30:34,350
to figure out, given
clouds in the morning, what

620
00:30:34,350 --> 00:30:37,110
is the probability that
it rains in the afternoon?

621
00:30:37,110 --> 00:30:41,570
I want to know the probability of
afternoon rain given morning clouds,

622
00:30:41,570 --> 00:30:46,630
and I can do that, in particular,
using this fact, the probability of--

623
00:30:46,630 --> 00:30:50,250
so if I know that 80% of rainy
afternoon start with cloudy mornings,

624
00:30:50,250 --> 00:30:54,390
then I know the probability of cloudy
mornings given rainy afternoon.

625
00:30:54,390 --> 00:30:58,440
So using sort of the reverse conditional
probability, I can figure that out.

626
00:30:58,440 --> 00:31:01,530
Expressed in terms of Bayes' rule,
this is what that would look like--

627
00:31:01,530 --> 00:31:05,430
probability of rain given
clouds is the probability

628
00:31:05,430 --> 00:31:08,550
of clouds given rain times
the probability of rain

629
00:31:08,550 --> 00:31:10,380
divided by the probability of clouds.

630
00:31:10,380 --> 00:31:13,560
Here I'm just substituting
in for the values of a and b

631
00:31:13,560 --> 00:31:15,630
from that equation and
Bayes' rule from before.

632
00:31:15,630 --> 00:31:16,650
And then I can just do the math.

633
00:31:16,650 --> 00:31:17,670
I have this information.

634
00:31:17,670 --> 00:31:21,360
I know that 80% of the time,
if it was raining, then

635
00:31:21,360 --> 00:31:23,610
there were clouds in the
morning-- so 0.8 here.

636
00:31:23,610 --> 00:31:28,110
Probability of rain is 0.1 because 10%
of days were raining and 40% of days

637
00:31:28,110 --> 00:31:28,860
were cloudy.

638
00:31:28,860 --> 00:31:31,980
I do the math and I can
figure out the answer is 0.2.

639
00:31:31,980 --> 00:31:35,730
So the probability that it rains in
the afternoon given that it was cloudy

640
00:31:35,730 --> 00:31:40,130
in the morning is 0.2 in this case.

641
00:31:40,130 --> 00:31:42,480
And this, now, is an
application of Bayes' rule,

642
00:31:42,480 --> 00:31:45,220
the idea that using one
conditional probability,

643
00:31:45,220 --> 00:31:48,060
we can get the reverse
conditional probability.

644
00:31:48,060 --> 00:31:51,420
And this is often useful when one
of the conditional probabilities

645
00:31:51,420 --> 00:31:55,300
might be easier for us to know about
or easier for us to have data about,

646
00:31:55,300 --> 00:31:57,870
and using that information,
we can calculate

647
00:31:57,870 --> 00:31:59,730
the other conditional probability.

648
00:31:59,730 --> 00:32:01,030
So what does this look like?

649
00:32:01,030 --> 00:32:04,410
Well, it means that knowing the
probability of cloudy mornings given

650
00:32:04,410 --> 00:32:09,420
rainy afternoons, we can calculate the
probability of rainy afternoons given

651
00:32:09,420 --> 00:32:12,600
cloudy mornings, or, for
example, more generally,

652
00:32:12,600 --> 00:32:16,860
if we know the probability of
some visible effect, some effect

653
00:32:16,860 --> 00:32:21,150
that we can see and observe given
some unknown cause that we're not

654
00:32:21,150 --> 00:32:26,100
sure about, well, then we can calculate
the probability of that unknown cause

655
00:32:26,100 --> 00:32:28,770
given the visible effect.

656
00:32:28,770 --> 00:32:30,520
So what might that look like?

657
00:32:30,520 --> 00:32:32,520
Well, in the context of
medicine, for example,

658
00:32:32,520 --> 00:32:37,440
I might know the probability of some
medical test result given a disease.

659
00:32:37,440 --> 00:32:41,520
Like, I know that if someone has a
disease, then x percent of the time

660
00:32:41,520 --> 00:32:44,340
the medical test result will
show up as this, for instance.

661
00:32:44,340 --> 00:32:47,100
And using that information,
then I can calculate,

662
00:32:47,100 --> 00:32:50,430
what is the probability that,
given I know the medical test

663
00:32:50,430 --> 00:32:53,590
result, what is the likelihood
that someone has the disease?

664
00:32:53,590 --> 00:32:56,970
This is the piece of information that
is usually easier to know, easier

665
00:32:56,970 --> 00:32:59,130
to immediately have access to data for.

666
00:32:59,130 --> 00:33:02,687
And this is the information that
I actually want to calculate.

667
00:33:02,687 --> 00:33:04,270
Or I might want to know, for example--

668
00:33:04,270 --> 00:33:08,400
if I know that some probability
of counterfeit bills

669
00:33:08,400 --> 00:33:11,670
have blurry text around the edges,
because counterfeit printers

670
00:33:11,670 --> 00:33:13,950
aren't nearly as good at
printing text precisely.

671
00:33:13,950 --> 00:33:16,380
So I have some information
about given that something

672
00:33:16,380 --> 00:33:20,550
is a counterfeit bill, x percent of
counterfeit bills have blurry text,

673
00:33:20,550 --> 00:33:21,510
for example.

674
00:33:21,510 --> 00:33:24,840
And using that information, then I can
calculate some piece of information

675
00:33:24,840 --> 00:33:27,360
that I might want to
know, like, given that I

676
00:33:27,360 --> 00:33:31,980
know there's blurry text on a bill, what
is the probability that that bill is

677
00:33:31,980 --> 00:33:32,580
counterfeit?

678
00:33:32,580 --> 00:33:34,980
So given one conditional
probability, I can

679
00:33:34,980 --> 00:33:39,363
calculate the other conditional
probability as well.

680
00:33:39,363 --> 00:33:41,280
And so now that we've
taken a look at a couple

681
00:33:41,280 --> 00:33:42,990
of different types of probability.

682
00:33:42,990 --> 00:33:45,210
We've looked at
unconditional probability

683
00:33:45,210 --> 00:33:48,300
where I just look at what is the
probability of this event occurring

684
00:33:48,300 --> 00:33:51,390
given no additional evidence
that I might have access to,

685
00:33:51,390 --> 00:33:53,940
and we've also looked at
conditional probability

686
00:33:53,940 --> 00:33:57,570
where I have some sort of evidence, and
I would like to, using that evidence,

687
00:33:57,570 --> 00:34:00,847
be able to calculate some
other probability as well.

688
00:34:00,847 --> 00:34:03,930
The other kind of probability that
will be important for us to think about

689
00:34:03,930 --> 00:34:06,360
is joint probability,
and this is when we're

690
00:34:06,360 --> 00:34:11,250
considering the likelihood of multiple
different events simultaneously.

691
00:34:11,250 --> 00:34:12,580
And so what do we mean by this?

692
00:34:12,580 --> 00:34:15,534
Well, for example, I might
have probability distributions

693
00:34:15,534 --> 00:34:18,659
that look a little something like this,
like I want to know the probability

694
00:34:18,659 --> 00:34:22,800
distribution of clouds in the morning,
and that distribution looks like this.

695
00:34:22,800 --> 00:34:26,460
40% of the times, C, which
is my random variable here,

696
00:34:26,460 --> 00:34:31,060
is equal to it's cloudy, and
60% of the time it's not cloudy.

697
00:34:31,060 --> 00:34:33,420
So here is just a simple
probability distribution

698
00:34:33,420 --> 00:34:37,710
that is effectively telling me
that 40% of the time it's cloudy.

699
00:34:37,710 --> 00:34:41,219
I might also have a probability
distribution for rain in the afternoon

700
00:34:41,219 --> 00:34:44,670
where 10% of the time,
or with probability 0.1,

701
00:34:44,670 --> 00:34:48,600
it is raining in the afternoon
and with probability 0.9

702
00:34:48,600 --> 00:34:51,090
it is not raining in the afternoon.

703
00:34:51,090 --> 00:34:54,580
And using just these two
pieces of information,

704
00:34:54,580 --> 00:34:57,540
I don't actually have a whole lot
of information about how these two

705
00:34:57,540 --> 00:34:59,980
variables relate to each other.

706
00:34:59,980 --> 00:35:02,940
But I could if I had access
to their joint probability,

707
00:35:02,940 --> 00:35:05,550
meaning for every combination
of these two things--

708
00:35:05,550 --> 00:35:09,330
meaning morning cloudy and afternoon
rain, morning cloudy and afternoon

709
00:35:09,330 --> 00:35:12,960
not rain, morning not cloudy
and afternoon rain, and morning

710
00:35:12,960 --> 00:35:15,150
not cloudy and afternoon not raining--

711
00:35:15,150 --> 00:35:17,700
if I had access to values
for each of those four,

712
00:35:17,700 --> 00:35:20,340
I'd have more information--
so information that'd

713
00:35:20,340 --> 00:35:22,390
be organized in a table like this.

714
00:35:22,390 --> 00:35:25,690
And this, rather than just
a probability distribution,

715
00:35:25,690 --> 00:35:27,970
is a joint probability distribution.

716
00:35:27,970 --> 00:35:31,090
It tells me the probability
distribution of each

717
00:35:31,090 --> 00:35:34,930
of the possible combinations of
values that these random variables

718
00:35:34,930 --> 00:35:36,160
can take on.

719
00:35:36,160 --> 00:35:39,640
So if I want to know, what is the
probability that on any given day

720
00:35:39,640 --> 00:35:42,400
it is both cloudy and
rainy, well, I would say,

721
00:35:42,400 --> 00:35:45,100
all right, we're looking
at cases where it is cloudy

722
00:35:45,100 --> 00:35:48,460
and cases where it is raining and
the intersection of those two,

723
00:35:48,460 --> 00:35:51,310
that row and that column, is 0.08.

724
00:35:51,310 --> 00:35:55,210
So that is the probability that
it is both cloudy and rainy

725
00:35:55,210 --> 00:35:57,070
using that information.

726
00:35:57,070 --> 00:36:00,010
And using this conditional
probability table,

727
00:36:00,010 --> 00:36:02,260
using this joint
probability table, I can

728
00:36:02,260 --> 00:36:04,930
begin to draw other
pieces of information

729
00:36:04,930 --> 00:36:07,420
about things like
conditional probability.

730
00:36:07,420 --> 00:36:11,890
So I might ask a question like, what is
the probability distribution of clouds

731
00:36:11,890 --> 00:36:14,470
given that I know that
it is raining, meaning

732
00:36:14,470 --> 00:36:16,660
I know for sure that it's raining.

733
00:36:16,660 --> 00:36:19,780
Tell me the probability distribution
over whether it's cloudy

734
00:36:19,780 --> 00:36:22,720
or not given that I know already
that it is, in fact, raining.

735
00:36:22,720 --> 00:36:25,480
And here I'm using C to stand
for that random variable.

736
00:36:25,480 --> 00:36:28,030
I'm looking for a distribution,
meaning the answer to this

737
00:36:28,030 --> 00:36:29,860
is not going to be a single value.

738
00:36:29,860 --> 00:36:33,760
It's going to be two values, a vector
of two values where the first value is

739
00:36:33,760 --> 00:36:37,960
probability of clouds, the second value
is probability that it is not cloudy,

740
00:36:37,960 --> 00:36:40,240
but the sum of those two
values is going to be one,

741
00:36:40,240 --> 00:36:42,470
because when you add up
the probabilities of all

742
00:36:42,470 --> 00:36:47,190
of the possible worlds, the result
that you get must be the number one.

743
00:36:47,190 --> 00:36:50,740
And, well, what do we know about how
to calculate a conditional probability?

744
00:36:50,740 --> 00:36:56,590
Well, we know that the probability of
a given b is the probability of a and b

745
00:36:56,590 --> 00:36:59,320
divided by the probability of b.

746
00:36:59,320 --> 00:37:00,740
So what does this mean?

747
00:37:00,740 --> 00:37:03,610
Well, it means that I can
calculate the probability of clouds

748
00:37:03,610 --> 00:37:08,260
given that it's raining as
the probability of clouds

749
00:37:08,260 --> 00:37:11,230
and raining divided by
the probability of rain.

750
00:37:11,230 --> 00:37:15,220
And this comma here for the probability
distribution of clouds and rain,

751
00:37:15,220 --> 00:37:17,710
this comma sort of stands
in for the word "and."

752
00:37:17,710 --> 00:37:21,460
You'll sort of see the logical operator
AND and the comma used interchangeably.

753
00:37:21,460 --> 00:37:24,550
This means the probability
distribution over the clouds

754
00:37:24,550 --> 00:37:29,382
and knowing the fact that it is raining
divided by the probability of rain.

755
00:37:29,382 --> 00:37:31,840
And the interesting thing to
note here and what we'll often

756
00:37:31,840 --> 00:37:34,210
do in order to simplify
our mathematics is

757
00:37:34,210 --> 00:37:38,260
that dividing by the probability
of rain, the probability of rain

758
00:37:38,260 --> 00:37:40,150
here is just some numerical constant.

759
00:37:40,150 --> 00:37:40,900
It is some number.

760
00:37:40,900 --> 00:37:43,780
Dividing by probability
of rain is just dividing

761
00:37:43,780 --> 00:37:46,090
by some constant or, in
other words, multiplying

762
00:37:46,090 --> 00:37:48,100
by the inverse of that constant.

763
00:37:48,100 --> 00:37:50,620
And it turns out that
oftentimes we can just

764
00:37:50,620 --> 00:37:53,230
not worry about what the
exact value of this is

765
00:37:53,230 --> 00:37:56,370
and just know that it is,
in fact, a constant value,

766
00:37:56,370 --> 00:37:57,620
and we'll see why in a moment.

767
00:37:57,620 --> 00:38:01,390
So instead of expressing this as
this joint probability divided

768
00:38:01,390 --> 00:38:06,790
by the probability of rain, sometimes
we'll just represent it as alpha times

769
00:38:06,790 --> 00:38:10,830
the numerator here, the probability
distribution of C, this variable,

770
00:38:10,830 --> 00:38:13,370
and that we know that it
is raining, for instance.

771
00:38:13,370 --> 00:38:16,600
So all we've done here
is said this value of one

772
00:38:16,600 --> 00:38:19,840
over the probability of rain, that's
really just a constant that we're

773
00:38:19,840 --> 00:38:23,140
going to divide by or equivalently
multiply by the inverse of at the end.

774
00:38:23,140 --> 00:38:26,770
We'll just call it alpha for now
and deal with it a little bit later.

775
00:38:26,770 --> 00:38:30,130
But the key idea here now-- and this is
an idea that's going to come up again--

776
00:38:30,130 --> 00:38:34,390
is that the conditional
distribution of C given rain

777
00:38:34,390 --> 00:38:38,200
is proportional to, meaning
just some factor multiplied by,

778
00:38:38,200 --> 00:38:42,580
the joint probability of
C and rain being true.

779
00:38:42,580 --> 00:38:44,030
And so how do we figure this out?

780
00:38:44,030 --> 00:38:46,720
Well, this is going to be the
probability that it is cloudy

781
00:38:46,720 --> 00:38:50,200
given that it's raining, which is
0.08, and the probability that it's not

782
00:38:50,200 --> 00:38:53,350
cloudy given that it's
raining, which is 0.02.

783
00:38:53,350 --> 00:38:55,180
And so we get alpha times--

784
00:38:55,180 --> 00:38:58,060
here now is that
probability distribution.

785
00:38:58,060 --> 00:39:00,370
0.08 is clouds and rain.

786
00:39:00,370 --> 00:39:04,210
0.02 is not cloudy and rain.

787
00:39:04,210 --> 00:39:08,260
But, of course, 0.08 and 0.02
don't sum up to the number one.

788
00:39:08,260 --> 00:39:10,780
And we know that in a
probability distribution,

789
00:39:10,780 --> 00:39:13,030
if you consider all of
the possible values,

790
00:39:13,030 --> 00:39:15,730
they must sum up to
a probability of one.

791
00:39:15,730 --> 00:39:20,350
And so we know that we just need to
figure out some constant to normalize,

792
00:39:20,350 --> 00:39:23,830
so to speak, these values, something
we can multiply or divide by

793
00:39:23,830 --> 00:39:26,600
to get it so that all of these
probabilities sum up to one.

794
00:39:26,600 --> 00:39:29,390
And it turns out that if we
multiply both numbers by 10,

795
00:39:29,390 --> 00:39:32,290
then we can get that
result of 0.8 and 0.2.

796
00:39:32,290 --> 00:39:34,990
The proportions are still
equivalent, but now 0.8

797
00:39:34,990 --> 00:39:38,750
plus 0.2, those sum up to the number 1.

798
00:39:38,750 --> 00:39:41,080
So take a look at this and
see if you can understand,

799
00:39:41,080 --> 00:39:43,870
step by step, how it is we're
getting from one point to another.

800
00:39:43,870 --> 00:39:48,190
But the key idea here is that by
using the joint probabilities,

801
00:39:48,190 --> 00:39:52,480
these probabilities that it is both
cloudy and rainy and that it is not

802
00:39:52,480 --> 00:39:56,740
cloudy and rainy, I can take
that information and figure out

803
00:39:56,740 --> 00:39:59,800
the conditional probability--
given that it's raining,

804
00:39:59,800 --> 00:40:02,320
what is the chance that it's
cloudy versus not cloudy--

805
00:40:02,320 --> 00:40:06,740
just by multiplying by some
normalization constant, so to speak.

806
00:40:06,740 --> 00:40:08,860
And this is what a
computer can begin to use

807
00:40:08,860 --> 00:40:12,130
to be able to interact with
these various different types

808
00:40:12,130 --> 00:40:13,207
of probabilities.

809
00:40:13,207 --> 00:40:15,790
And it turns out there are a
number of other probability rules

810
00:40:15,790 --> 00:40:19,570
that are going to be useful to us as
we begin to explore how we can actually

811
00:40:19,570 --> 00:40:22,860
use this information to
encode into our computers

812
00:40:22,860 --> 00:40:27,030
some more complex analysis that we
might want to do about probability

813
00:40:27,030 --> 00:40:30,793
and distributions and random variables
that we might be interacting with.

814
00:40:30,793 --> 00:40:33,210
So here are a couple of those
important probability rules.

815
00:40:33,210 --> 00:40:35,850
One of the simplest rules
is just this negation rule.

816
00:40:35,850 --> 00:40:39,420
What is the probability of not event a?

817
00:40:39,420 --> 00:40:41,970
So a is an event that
has some probability,

818
00:40:41,970 --> 00:40:45,840
and I would like to know, what is the
probability that a does not occur?

819
00:40:45,840 --> 00:40:50,340
And it turns out it's just one
minus P of a, which makes sense

820
00:40:50,340 --> 00:40:52,470
because if those are
the two possible cases,

821
00:40:52,470 --> 00:40:56,770
either a happens or a doesn't happen,
then when you add up those two cases,

822
00:40:56,770 --> 00:41:02,970
you must get one, which means P of
not a must just be one minus P of a

823
00:41:02,970 --> 00:41:06,930
because P of a and P of not a
must sum up to the number one.

824
00:41:06,930 --> 00:41:10,050
They must include all
of the possible cases.

825
00:41:10,050 --> 00:41:14,010
We've seen an expression for
calculating the probability of a and b.

826
00:41:14,010 --> 00:41:18,180
We might also reasonably want to
calculate the probability of a or b.

827
00:41:18,180 --> 00:41:21,480
What is the probability that one thing
happens or another thing happens?

828
00:41:21,480 --> 00:41:23,550
So for example, I might
want to calculate,

829
00:41:23,550 --> 00:41:26,010
what is the probability
that if I roll two dice,

830
00:41:26,010 --> 00:41:29,970
a red die and a blue die, what is
the likelihood that a is a six or b

831
00:41:29,970 --> 00:41:31,860
is a six, one or the other?

832
00:41:31,860 --> 00:41:34,860
And what you might imagine you could
do and the wrong way to approach it

833
00:41:34,860 --> 00:41:38,810
would be just to say, all right,
well, a comes up as a six,

834
00:41:38,810 --> 00:41:41,727
the red die comes up as a six
with probability one over six.

835
00:41:41,727 --> 00:41:42,810
The same for the blue die.

836
00:41:42,810 --> 00:41:44,070
It's also one over six.

837
00:41:44,070 --> 00:41:47,520
Add them together and you get
2/6, otherwise known as 1/3.

838
00:41:47,520 --> 00:41:50,820
But this suffers from the
problem of over counting,

839
00:41:50,820 --> 00:41:54,330
that we've double counted the
case where both a and b, both

840
00:41:54,330 --> 00:41:57,690
the red die and the blue die,
both come up as a six roll,

841
00:41:57,690 --> 00:41:59,780
and I've counted that instance twice.

842
00:41:59,780 --> 00:42:02,070
So to resolve this,
the actual expression

843
00:42:02,070 --> 00:42:05,100
for calculating the
probability of a or b

844
00:42:05,100 --> 00:42:08,070
uses what we call the
inclusion-exclusion formula.

845
00:42:08,070 --> 00:42:11,510
So I take the probability of a,
add it to the probability of b.

846
00:42:11,510 --> 00:42:12,900
That's all same as before.

847
00:42:12,900 --> 00:42:16,440
But then I need to exclude the
cases that I've double counted.

848
00:42:16,440 --> 00:42:21,930
So I subtract from that the
probability of a and b, and that

849
00:42:21,930 --> 00:42:23,520
gets me the result for a or b.

850
00:42:23,520 --> 00:42:27,348
I consider all the cases where a is
true and all the cases where b is true.

851
00:42:27,348 --> 00:42:29,640
And if you imagine this is
like a Venn diagram of cases

852
00:42:29,640 --> 00:42:31,830
where a is true, cases
where b is true, I just

853
00:42:31,830 --> 00:42:34,500
need to subtract out the
middle to get rid of the cases

854
00:42:34,500 --> 00:42:37,860
that I have over counted by double
counting them inside of both

855
00:42:37,860 --> 00:42:41,520
of these individual expressions.

856
00:42:41,520 --> 00:42:43,530
One other rule that's
going to be quite helpful

857
00:42:43,530 --> 00:42:45,770
is a rule called marginalization.

858
00:42:45,770 --> 00:42:47,880
Some marginalization is
answering the question

859
00:42:47,880 --> 00:42:52,350
of how do I figure out the probability
of a using some other variable that I

860
00:42:52,350 --> 00:42:53,970
might have access to, like b?

861
00:42:53,970 --> 00:42:56,190
Even if I don't know additional
information about it,

862
00:42:56,190 --> 00:43:00,270
I know that b, some event,
can have two possible states.

863
00:43:00,270 --> 00:43:05,080
Either b happens or b doesn't happen,
assuming it's a Boolean, true or false.

864
00:43:05,080 --> 00:43:07,500
And well, what that means
is that for me to be

865
00:43:07,500 --> 00:43:11,130
able to calculate the probability
of a, there are only two cases.

866
00:43:11,130 --> 00:43:15,930
Either a happens and b happens or
a happens and b doesn't happen.

867
00:43:15,930 --> 00:43:19,200
And those are two disjoint, meaning
they can't both happen together--

868
00:43:19,200 --> 00:43:21,480
either b happens or b doesn't happen.

869
00:43:21,480 --> 00:43:23,640
They're disjoint or separate cases.

870
00:43:23,640 --> 00:43:28,140
And so I can figure out the probability
of a just by adding up those two cases.

871
00:43:28,140 --> 00:43:31,770
The probability that a is
true is the probability

872
00:43:31,770 --> 00:43:35,640
that a and b is true plus the
probability that a is true

873
00:43:35,640 --> 00:43:36,810
and b isn't true.

874
00:43:36,810 --> 00:43:40,123
So by marginalizing, I've
looked at the two possible cases

875
00:43:40,123 --> 00:43:41,040
that might take place.

876
00:43:41,040 --> 00:43:44,120
Either b happens or b doesn't happen.

877
00:43:44,120 --> 00:43:47,610
And in either of those cases, I look at,
what's the probability that a happens,

878
00:43:47,610 --> 00:43:50,430
and if I add those together,
well, then I get the probability

879
00:43:50,430 --> 00:43:52,710
that a happens as a whole.

880
00:43:52,710 --> 00:43:54,030
So take a look at that rule.

881
00:43:54,030 --> 00:43:57,120
It doesn't matter what b is
or how it's related to a.

882
00:43:57,120 --> 00:43:59,580
So long as I know these
joint distributions,

883
00:43:59,580 --> 00:44:02,280
I can figure out the
overall probability of a.

884
00:44:02,280 --> 00:44:05,130
And this can be a useful way,
if I have a joint distribution,

885
00:44:05,130 --> 00:44:08,550
like the joint distribution
of a and b, to just figure out

886
00:44:08,550 --> 00:44:11,320
some unconditional probability,
like the probability of a,

887
00:44:11,320 --> 00:44:14,520
and we'll see examples
of this soon, as well.

888
00:44:14,520 --> 00:44:17,460
Now, sometimes these might
not just be variables

889
00:44:17,460 --> 00:44:21,160
that are events that are they happened
or they didn't happen, like b is here.

890
00:44:21,160 --> 00:44:23,850
They might be some broader
probability distribution where

891
00:44:23,850 --> 00:44:25,800
there are multiple possible values.

892
00:44:25,800 --> 00:44:28,710
And so here, in order to use
this marginalization rule,

893
00:44:28,710 --> 00:44:34,290
I need to sum up not just over b and not
b, but for all of the possible values

894
00:44:34,290 --> 00:44:36,610
that the other random
variable could take on.

895
00:44:36,610 --> 00:44:39,360
And so here we'll see a version
of this rule for random variables,

896
00:44:39,360 --> 00:44:41,610
and it's going to include
that summation notation

897
00:44:41,610 --> 00:44:46,270
to indicate that I'm summing up, adding
up, a whole bunch of individual values.

898
00:44:46,270 --> 00:44:47,092
So here's the rule.

899
00:44:47,092 --> 00:44:49,050
Looks a lot more complicated,
but it's actually

900
00:44:49,050 --> 00:44:51,330
the equivalent, exactly the same rule.

901
00:44:51,330 --> 00:44:55,500
What I'm saying here is that if I
have two random variables one called x

902
00:44:55,500 --> 00:45:01,380
and one called y, well, the probability
that x is equal to some value x sub i--

903
00:45:01,380 --> 00:45:04,170
this is just some value that
this variable takes on--

904
00:45:04,170 --> 00:45:05,520
how do I figure it out?

905
00:45:05,520 --> 00:45:08,760
Well, I'm going to
sum up over j, where j

906
00:45:08,760 --> 00:45:13,380
is going to range over all of the
possible values that y can take on.

907
00:45:13,380 --> 00:45:18,558
Well, let's look at the probability
that x equals xi and y equals yj.

908
00:45:18,558 --> 00:45:20,600
So the exact same rule--
the only difference here

909
00:45:20,600 --> 00:45:23,360
is now I'm summing up over
all of the possible values

910
00:45:23,360 --> 00:45:27,420
that y can take on, saying let's
add up all of those possible cases

911
00:45:27,420 --> 00:45:31,100
and look at this joint
distribution, this joint probability

912
00:45:31,100 --> 00:45:35,990
that x takes on the value I care about
given all of the possible values for y.

913
00:45:35,990 --> 00:45:40,910
And if I add all those up, then I can
get this unconditional probability

914
00:45:40,910 --> 00:45:46,397
of what x is equal to, whether or
not x is equal to some value x sub i.

915
00:45:46,397 --> 00:45:48,230
So let's take a look
at this rule because it

916
00:45:48,230 --> 00:45:49,688
does look a little bit complicated.

917
00:45:49,688 --> 00:45:51,650
Let's try and put a
concrete example to it.

918
00:45:51,650 --> 00:45:54,470
Here, again, is that same
joint distribution from before.

919
00:45:54,470 --> 00:45:58,460
I have cloud, not
cloudy, rainy, not rainy.

920
00:45:58,460 --> 00:46:00,830
And maybe I want to
access some variable.

921
00:46:00,830 --> 00:46:04,790
I want to know, what is the
probability that it is cloudy?

922
00:46:04,790 --> 00:46:08,550
Well, marginalization says that
if I have this joint distribution

923
00:46:08,550 --> 00:46:12,140
and I want to know, what is the
probability that it is cloudy, well,

924
00:46:12,140 --> 00:46:15,650
I need to consider the other variable,
the variable that's not here,

925
00:46:15,650 --> 00:46:17,060
the idea that it's rainy.

926
00:46:17,060 --> 00:46:20,780
And I consider the two cases, either
it's raining or it's not raining,

927
00:46:20,780 --> 00:46:24,410
and I just sum up the values
for each of those possibilities.

928
00:46:24,410 --> 00:46:27,380
In other words, the
probability that it is cloudy

929
00:46:27,380 --> 00:46:31,110
is equal to the sum of the
probability that it's cloudy

930
00:46:31,110 --> 00:46:38,090
and it's raining and the probability
that it's cloudy and it is not raining.

931
00:46:38,090 --> 00:46:40,460
And so these, now, are
values that I have access to.

932
00:46:40,460 --> 00:46:44,840
These are values that are just inside
of this joint probability table.

933
00:46:44,840 --> 00:46:47,990
What is the probability that
it is both cloudy and rainy?

934
00:46:47,990 --> 00:46:51,350
Well, it's just the intersection
of these two here, which is 0.08,

935
00:46:51,350 --> 00:46:54,590
and the probability that it's cloudy
and not raining is-- all right,

936
00:46:54,590 --> 00:46:56,480
here's cloudy, here's not raining--

937
00:46:56,480 --> 00:46:58,000
it's 0.32.

938
00:46:58,000 --> 00:47:02,630
So it's 0.08 plus 0.32, which
just gives us equal to 0.4.

939
00:47:02,630 --> 00:47:06,840
That is the unconditional probability
that it is, in fact, cloudy.

940
00:47:06,840 --> 00:47:09,530
And so marginalization
gives us a way to go

941
00:47:09,530 --> 00:47:13,360
from these joint distributions to
just some individual probability

942
00:47:13,360 --> 00:47:14,430
that I might care about.

943
00:47:14,430 --> 00:47:17,222
And you'll see a little bit later
why it is that we care about that

944
00:47:17,222 --> 00:47:19,370
and why that's actually
useful to us as we

945
00:47:19,370 --> 00:47:21,885
begin doing some of these calculations.

946
00:47:21,885 --> 00:47:25,010
Last rule we'll take a look up before
transitioning into something a little

947
00:47:25,010 --> 00:47:27,200
bit different is this
rule of conditioning--

948
00:47:27,200 --> 00:47:31,070
very similar to the marginalization
rule, but it says that, again,

949
00:47:31,070 --> 00:47:32,600
if I have two events a and b--

950
00:47:32,600 --> 00:47:35,810
but instead of having access
to their joint probabilities,

951
00:47:35,810 --> 00:47:38,180
I have access to their
conditional probabilities,

952
00:47:38,180 --> 00:47:39,920
how they relate to each other.

953
00:47:39,920 --> 00:47:43,700
Well, again, if I want to know the
probability that a happens and I know

954
00:47:43,700 --> 00:47:47,960
that there's some other variable b,
either b happens or b doesn't happen,

955
00:47:47,960 --> 00:47:50,660
and so I can say that
the probability of a

956
00:47:50,660 --> 00:47:54,920
is the probability of a given
b times the probability of b,

957
00:47:54,920 --> 00:47:57,470
meaning b happened, and
given that I know b happened,

958
00:47:57,470 --> 00:47:59,480
what's the likelihood that a happened?

959
00:47:59,480 --> 00:48:02,480
And then I consider the other
case, that b didn't happen.

960
00:48:02,480 --> 00:48:05,360
So here is the probability
that b didn't happen,

961
00:48:05,360 --> 00:48:07,880
and here's the probability
that a happens given

962
00:48:07,880 --> 00:48:09,890
that I know that b didn't happen.

963
00:48:09,890 --> 00:48:13,820
And this is really the equivalent rule,
just using conditional probability

964
00:48:13,820 --> 00:48:16,190
instead of joint probability
where I'm saying,

965
00:48:16,190 --> 00:48:19,790
let's look at both of these
two cases and condition on b.

966
00:48:19,790 --> 00:48:23,480
Look at the case where b happens and
look at the case where b doesn't happen

967
00:48:23,480 --> 00:48:26,560
and look at what probabilities
I get as a result.

968
00:48:26,560 --> 00:48:28,598
And just as in the
case of marginalization

969
00:48:28,598 --> 00:48:30,890
where there was an equivalent
rule for random variables

970
00:48:30,890 --> 00:48:34,850
that could take on multiple possible
values in a domain of possible values,

971
00:48:34,850 --> 00:48:37,530
here, too, conditioning has
the same equivalent rule.

972
00:48:37,530 --> 00:48:41,590
Again, there's a summation to mean I'm
summing over all of the possible values

973
00:48:41,590 --> 00:48:44,070
that some random
variable y could take on.

974
00:48:44,070 --> 00:48:48,140
But if I want to know, what is the
probability that x takes on this value,

975
00:48:48,140 --> 00:48:50,870
then I'm going to sum
up over all the values j

976
00:48:50,870 --> 00:48:53,420
that y could take on and
say, all right, what's

977
00:48:53,420 --> 00:48:56,870
the chance that y takes on
that value, yj, and multiply it

978
00:48:56,870 --> 00:49:00,830
by the conditional probability
that x takes on this value given

979
00:49:00,830 --> 00:49:03,180
that y took on that value yj--

980
00:49:03,180 --> 00:49:06,470
so equivalent rule just using
conditional probabilities

981
00:49:06,470 --> 00:49:08,120
instead of joint probabilities.

982
00:49:08,120 --> 00:49:10,790
And using the equation we know
about joint probabilities,

983
00:49:10,790 --> 00:49:13,748
we can translate between these two.

984
00:49:13,748 --> 00:49:15,790
All right, we've seen a
whole lot of mathematics,

985
00:49:15,790 --> 00:49:18,110
and we've just sort of laid
the foundation for mathematics.

986
00:49:18,110 --> 00:49:20,777
And no need to worry if you haven't
seen probability in too much

987
00:49:20,777 --> 00:49:22,370
detail up until this point.

988
00:49:22,370 --> 00:49:24,500
These are sort of the
foundations of the ideas

989
00:49:24,500 --> 00:49:27,560
that are going to come up as we
begin to explore how we can now

990
00:49:27,560 --> 00:49:31,820
take these ideas from probability
and begin to apply them to represent

991
00:49:31,820 --> 00:49:35,120
something inside of our computer,
something inside of the AI agent

992
00:49:35,120 --> 00:49:39,280
we're trying to design that is able to
represent information and probabilities

993
00:49:39,280 --> 00:49:42,600
and the likelihoods between
various different events.

994
00:49:42,600 --> 00:49:45,020
So there are a number of
different probabilistic models

995
00:49:45,020 --> 00:49:48,290
that we can generate, but the first of
the models we're going to talk about

996
00:49:48,290 --> 00:49:50,600
are what are known as Bayesian networks.

997
00:49:50,600 --> 00:49:52,670
And a Bayesian network
is just going to be

998
00:49:52,670 --> 00:49:56,090
some network of random variables,
connected random variables,

999
00:49:56,090 --> 00:49:58,850
that are going to
represent the dependence

1000
00:49:58,850 --> 00:50:00,260
between these random variables.

1001
00:50:00,260 --> 00:50:03,498
And odds are most random
variables in this world

1002
00:50:03,498 --> 00:50:05,540
are not independent from
each other, that there's

1003
00:50:05,540 --> 00:50:08,840
some relationship between things that
are happening that we care about.

1004
00:50:08,840 --> 00:50:12,200
If it is raining today, that
might increase the likelihood

1005
00:50:12,200 --> 00:50:14,750
that my flight or my train
gets delayed, for example.

1006
00:50:14,750 --> 00:50:17,610
There is some dependence
between these random variables,

1007
00:50:17,610 --> 00:50:22,420
and a Bayesian network is going to be
able to capture those dependencies.

1008
00:50:22,420 --> 00:50:23,770
So what is a Bayesian network?

1009
00:50:23,770 --> 00:50:26,430
What is its actual structure,
and how does it work?

1010
00:50:26,430 --> 00:50:29,230
Well, a Bayesian network is
going to be a directed graph.

1011
00:50:29,230 --> 00:50:31,170
And again, we've seen
directed graphs before.

1012
00:50:31,170 --> 00:50:34,170
They are individual nodes
with arrows or edges

1013
00:50:34,170 --> 00:50:38,897
that connect one node to another node,
pointing in a particular direction.

1014
00:50:38,897 --> 00:50:40,980
And so this directed graph
is going to have nodes,

1015
00:50:40,980 --> 00:50:43,860
as well, where each node
in this directed graph

1016
00:50:43,860 --> 00:50:47,850
is going to represent a random variable,
something like the weather or something

1017
00:50:47,850 --> 00:50:51,340
like whether my train
was on time or delayed.

1018
00:50:51,340 --> 00:50:54,780
And we're going to have an
arrow from a node x to a node y

1019
00:50:54,780 --> 00:50:57,435
to mean that x is a parent of y.

1020
00:50:57,435 --> 00:50:58,560
So that'll be our notation.

1021
00:50:58,560 --> 00:51:02,940
If there's an arrow from x to y, x is
going to be considered a parent of y.

1022
00:51:02,940 --> 00:51:06,360
And the reason that's important
is because each of these nodes

1023
00:51:06,360 --> 00:51:09,180
is going to have a probability
distribution that we're

1024
00:51:09,180 --> 00:51:13,140
going to store along with it, which
is the distribution of x given

1025
00:51:13,140 --> 00:51:16,520
some evidence, given the parents of x.

1026
00:51:16,520 --> 00:51:18,480
So the way to more
intuitively think about this

1027
00:51:18,480 --> 00:51:22,260
is the parents are going to be thought
of as sort of causes for some effect

1028
00:51:22,260 --> 00:51:24,720
that we're going to observe.

1029
00:51:24,720 --> 00:51:27,780
And so let's take a look at an
actual example of a Bayesian network

1030
00:51:27,780 --> 00:51:30,270
and think about the types of
logic that might be involved

1031
00:51:30,270 --> 00:51:32,070
in reasoning about that network.

1032
00:51:32,070 --> 00:51:35,580
Let's imagine, for a moment, that
I have an appointment out of town

1033
00:51:35,580 --> 00:51:38,510
and I need to take a train in
order to get to that appointment.

1034
00:51:38,510 --> 00:51:40,260
So what are the things
I might care about?

1035
00:51:40,260 --> 00:51:42,620
Well, I care about getting
to my appointment on time.

1036
00:51:42,620 --> 00:51:44,370
Either I make it to
my appointment and I'm

1037
00:51:44,370 --> 00:51:46,710
able to attend it or I
miss the appointment.

1038
00:51:46,710 --> 00:51:49,440
And you might imagine that
that's influenced by the train,

1039
00:51:49,440 --> 00:51:54,000
that the train is either on time
or it's delayed, for example.

1040
00:51:54,000 --> 00:51:56,370
But that train itself
is also influenced.

1041
00:51:56,370 --> 00:52:00,030
Whether the train is on time or
not depends maybe on the rain.

1042
00:52:00,030 --> 00:52:00,822
Is there no rain?

1043
00:52:00,822 --> 00:52:01,530
Is it light rain?

1044
00:52:01,530 --> 00:52:02,737
Is there heavy rain?

1045
00:52:02,737 --> 00:52:05,070
And it might also be influenced
by other variables, too.

1046
00:52:05,070 --> 00:52:07,050
It might be influenced,
as well, by whether

1047
00:52:07,050 --> 00:52:09,608
or not there's maintenance on
the train track, for example.

1048
00:52:09,608 --> 00:52:11,400
If there is maintenance
on the train track,

1049
00:52:11,400 --> 00:52:15,660
that probably increases the
likelihood that my train is delayed.

1050
00:52:15,660 --> 00:52:19,680
And so we can represent all of these
ideas using a Bayesian network that

1051
00:52:19,680 --> 00:52:21,360
looks a little something like this.

1052
00:52:21,360 --> 00:52:25,440
Here I have four nodes
representing four random variables

1053
00:52:25,440 --> 00:52:26,970
that I would like to keep track of.

1054
00:52:26,970 --> 00:52:29,190
I have one random
variable called Rain that

1055
00:52:29,190 --> 00:52:34,080
can take on three possible values in its
domain, either none or light or heavy

1056
00:52:34,080 --> 00:52:36,348
for no rain, light rain, or heavy rain.

1057
00:52:36,348 --> 00:52:38,640
I have a variable called
Maintenance for whether or not

1058
00:52:38,640 --> 00:52:42,030
there is maintenance on the train track,
which it has two possible values, just

1059
00:52:42,030 --> 00:52:42,960
either yes or no.

1060
00:52:42,960 --> 00:52:46,355
Either there is maintenance or there is
no maintenance happening on the track.

1061
00:52:46,355 --> 00:52:49,230
Then I have a random variable for
the train indicating whether or not

1062
00:52:49,230 --> 00:52:50,490
the train was on time or not.

1063
00:52:50,490 --> 00:52:53,850
That random variable has two
possible values in its domain.

1064
00:52:53,850 --> 00:52:57,730
The train is either on time
or the train is delayed.

1065
00:52:57,730 --> 00:52:59,803
And then, finally, I
have a random variable

1066
00:52:59,803 --> 00:53:01,470
for whether I make it to my appointment.

1067
00:53:01,470 --> 00:53:04,950
For my appointment down here, I have
a random variable called Appointment

1068
00:53:04,950 --> 00:53:09,420
that itself has two possible
values, attend and miss.

1069
00:53:09,420 --> 00:53:10,920
And so here are the possible values.

1070
00:53:10,920 --> 00:53:12,960
Here are my four nodes,
each of which represents

1071
00:53:12,960 --> 00:53:17,160
a random variable, each of which
has a domain of possible values

1072
00:53:17,160 --> 00:53:18,500
that it can take on.

1073
00:53:18,500 --> 00:53:21,980
And the arrows, the edges
pointing from one node to another,

1074
00:53:21,980 --> 00:53:26,250
encode some notion of
dependence inside of this graph,

1075
00:53:26,250 --> 00:53:28,830
that whether I make it
to my appointment or not

1076
00:53:28,830 --> 00:53:32,650
is dependent upon whether the
train is on time or delayed.

1077
00:53:32,650 --> 00:53:36,390
And whether the train is on time or
delayed is dependent on two things,

1078
00:53:36,390 --> 00:53:38,910
given by the two arrows
pointing at this node.

1079
00:53:38,910 --> 00:53:42,350
It is dependent on whether or not there
was maintenance on the train track,

1080
00:53:42,350 --> 00:53:45,240
and it is also dependent
upon whether or not

1081
00:53:45,240 --> 00:53:47,675
it was raining, or
whether it is raining.

1082
00:53:47,675 --> 00:53:49,800
And just to make things a
little complicated, let's

1083
00:53:49,800 --> 00:53:53,280
say, as well, that whether or not
there's maintenance on the track,

1084
00:53:53,280 --> 00:53:55,260
this too might be
influenced by the rain.

1085
00:53:55,260 --> 00:53:57,178
Then if there's heavier
rain, well, maybe it's

1086
00:53:57,178 --> 00:53:59,970
less likely that there's going to
be maintenance on the train track

1087
00:53:59,970 --> 00:54:02,010
that day because they're
more likely to want

1088
00:54:02,010 --> 00:54:05,500
to do maintenance on the track on days
when it's not raining, for example.

1089
00:54:05,500 --> 00:54:08,350
And so these nodes might have
different relationships between them.

1090
00:54:08,350 --> 00:54:10,770
But the idea is that we can
come up with a probability

1091
00:54:10,770 --> 00:54:16,370
distribution for any of these
nodes based only upon its parents.

1092
00:54:16,370 --> 00:54:20,158
And so let's look node by node at what
this probability distribution might

1093
00:54:20,158 --> 00:54:20,950
actually look like.

1094
00:54:20,950 --> 00:54:24,150
And we'll go ahead and begin with this
root node, this Rain node here, which

1095
00:54:24,150 --> 00:54:27,630
is at the top and has no
arrows pointing into it,

1096
00:54:27,630 --> 00:54:30,510
which means its probability
distribution is not

1097
00:54:30,510 --> 00:54:32,410
going to be a conditional distribution.

1098
00:54:32,410 --> 00:54:33,870
It's not based on anything.

1099
00:54:33,870 --> 00:54:38,250
I just have some probability
distribution over the possible values

1100
00:54:38,250 --> 00:54:40,520
for the Rain random variable.

1101
00:54:40,520 --> 00:54:43,590
And that distribution might look
a little something like this.

1102
00:54:43,590 --> 00:54:46,170
None, light, and heavy--
each have a possible value.

1103
00:54:46,170 --> 00:54:48,300
Here I'm saying the
likelihood of no rain

1104
00:54:48,300 --> 00:54:53,790
is 0.7, of light rain is 0.2, of
heavy rain is 0.1, for example.

1105
00:54:53,790 --> 00:54:58,440
So here is a probability distribution
for this root node in this Bayesian

1106
00:54:58,440 --> 00:54:59,770
network.

1107
00:54:59,770 --> 00:55:03,000
And let's now consider the next
node in the network, Maintenance.

1108
00:55:03,000 --> 00:55:05,140
Track maintenance is yes or no.

1109
00:55:05,140 --> 00:55:07,530
And the general idea of
what this distribution

1110
00:55:07,530 --> 00:55:09,660
is going to encode, at
least in this story,

1111
00:55:09,660 --> 00:55:13,308
is the idea that the heavier
the rain is, the less likely

1112
00:55:13,308 --> 00:55:15,600
it is that there's going to
be maintenance on the track

1113
00:55:15,600 --> 00:55:18,017
because the people that are
doing maintenance on the track

1114
00:55:18,017 --> 00:55:21,190
probably want to wait until a day
when it's not as rainy in order to do

1115
00:55:21,190 --> 00:55:23,000
the track maintenance, for example.

1116
00:55:23,000 --> 00:55:25,480
And so what might that probability
distribution look like?

1117
00:55:25,480 --> 00:55:28,180
Well, this now is going to
be a conditional probability

1118
00:55:28,180 --> 00:55:31,600
distribution, that here are the
three possible values for the Rain

1119
00:55:31,600 --> 00:55:34,840
random variable, which I'm here just
going to abbreviate to R, either

1120
00:55:34,840 --> 00:55:37,490
no rain, light rain, or heavy rain.

1121
00:55:37,490 --> 00:55:41,590
And for each of those possible values,
either there is yes track maintenance

1122
00:55:41,590 --> 00:55:46,120
or no track maintenance, and those have
probabilities associated with them,

1123
00:55:46,120 --> 00:55:50,650
that I see here that
if it is not raining,

1124
00:55:50,650 --> 00:55:53,620
then there is a probability 0.4
that there's track maintenance

1125
00:55:53,620 --> 00:55:56,350
and a probability of
0.6 that there isn't.

1126
00:55:56,350 --> 00:55:59,200
But if there's heavy
rain, then here the chance

1127
00:55:59,200 --> 00:56:02,020
that there is track maintenance
is 0.1 and the chance

1128
00:56:02,020 --> 00:56:04,430
that there is not track
maintenance is 0.9.

1129
00:56:04,430 --> 00:56:08,230
Each of these rows is going to sum
up to one because each of these

1130
00:56:08,230 --> 00:56:10,930
represent different
values of whether or not

1131
00:56:10,930 --> 00:56:14,710
it's raining, the three possible values
that that random variable can take on,

1132
00:56:14,710 --> 00:56:18,160
and each is associated with its
own probability distribution.

1133
00:56:18,160 --> 00:56:22,450
That is ultimately all going
to add up to the number one.

1134
00:56:22,450 --> 00:56:26,290
So that there is our distribution for
this random variable called Maintenance

1135
00:56:26,290 --> 00:56:30,110
about whether or not there is
maintenance on the train track.

1136
00:56:30,110 --> 00:56:32,050
And now let's consider
the next variable.

1137
00:56:32,050 --> 00:56:34,210
Here we have a node inside
of our Bayesian network

1138
00:56:34,210 --> 00:56:38,570
called Train that has two possible
values, on time and delayed.

1139
00:56:38,570 --> 00:56:42,160
And this node is going to be
dependent upon the two nodes that

1140
00:56:42,160 --> 00:56:45,040
are pointing towards it, that
whether or not the train is on time

1141
00:56:45,040 --> 00:56:48,872
or delayed it depends on whether
or not there is track maintenance,

1142
00:56:48,872 --> 00:56:50,830
and it depends on whether
or not there is rain,

1143
00:56:50,830 --> 00:56:55,610
that heavier rain probably means
more likely that my train is delayed.

1144
00:56:55,610 --> 00:56:58,270
And if there is track
maintenance, that also

1145
00:56:58,270 --> 00:57:02,360
probably means it's more likely
that my train is delayed as well.

1146
00:57:02,360 --> 00:57:05,350
And so you could construct a
larger probability distribution,

1147
00:57:05,350 --> 00:57:07,720
a conditional probability
distribution, that

1148
00:57:07,720 --> 00:57:11,530
instead of conditioning on just
one variable, as was the case here,

1149
00:57:11,530 --> 00:57:14,380
is now conditioning on two
variables, conditioning

1150
00:57:14,380 --> 00:57:19,270
both on rain, represented by R, and
on maintenance, represented by yes.

1151
00:57:19,270 --> 00:57:23,040
Again, each of these rows has two
values that sum up to the number one,

1152
00:57:23,040 --> 00:57:27,310
one for whether the train is on time,
one for whether the train is delayed.

1153
00:57:27,310 --> 00:57:29,260
And here I can say
something like, all right,

1154
00:57:29,260 --> 00:57:32,950
if I know there was light rain
and track maintenance-- well, OK,

1155
00:57:32,950 --> 00:57:36,490
that would be R is light and M is yes--

1156
00:57:36,490 --> 00:57:40,210
well, then there is a probability
of 0.6 that my train is on time

1157
00:57:40,210 --> 00:57:43,540
and a probability of 0.4
the train is delayed.

1158
00:57:43,540 --> 00:57:47,770
And you can imagine gathering this data
just by looking at real-world data,

1159
00:57:47,770 --> 00:57:50,970
looking at data about, all right,
if I knew that it was light rain

1160
00:57:50,970 --> 00:57:52,720
and there was track
maintenance, how often

1161
00:57:52,720 --> 00:57:54,400
was a train delayed or
not delayed, and you

1162
00:57:54,400 --> 00:57:55,930
could begin to construct this thing.

1163
00:57:55,930 --> 00:57:58,060
But the interesting
thing is, intelligently,

1164
00:57:58,060 --> 00:57:59,812
being able to try to
figure out, how might

1165
00:57:59,812 --> 00:58:01,270
you go about ordering these things?

1166
00:58:01,270 --> 00:58:06,730
What things might influence other
nodes inside of this Bayesian network?

1167
00:58:06,730 --> 00:58:08,860
And the last thing I care
about is whether or not

1168
00:58:08,860 --> 00:58:10,870
I make it to my appointment.

1169
00:58:10,870 --> 00:58:13,210
So did I attend or miss the appointment?

1170
00:58:13,210 --> 00:58:16,180
And ultimately, whether I
attend or miss the appointment,

1171
00:58:16,180 --> 00:58:19,552
it is influenced by track maintenance
because it's indirectly this idea

1172
00:58:19,552 --> 00:58:21,760
that, all right, if there
is track maintenance, well,

1173
00:58:21,760 --> 00:58:23,450
then my train might
more likely be delayed,

1174
00:58:23,450 --> 00:58:25,325
and if my train is more
likely to be delayed,

1175
00:58:25,325 --> 00:58:27,280
then I'm more likely
to miss my appointment.

1176
00:58:27,280 --> 00:58:29,650
But what we encode in
this Bayesian network

1177
00:58:29,650 --> 00:58:32,820
are just what we might consider
to be more direct relationships.

1178
00:58:32,820 --> 00:58:35,710
So the train has a direct
influence on the appointment.

1179
00:58:35,710 --> 00:58:38,710
And given that I know whether
the train is on time or delayed,

1180
00:58:38,710 --> 00:58:40,540
knowing whether there's
track maintenance

1181
00:58:40,540 --> 00:58:44,550
isn't going to give me any additional
information that I didn't already have,

1182
00:58:44,550 --> 00:58:48,070
that if I know train, these
other nodes that are up above

1183
00:58:48,070 --> 00:58:51,150
isn't really going to
influence the result.

1184
00:58:51,150 --> 00:58:54,910
And so here we might represent it
using another conditional probability

1185
00:58:54,910 --> 00:58:57,430
distribution that looks a
little something like this, that

1186
00:58:57,430 --> 00:59:00,160
train can take on two possible values.

1187
00:59:00,160 --> 00:59:02,740
Either my train is on time
or my train is delayed.

1188
00:59:02,740 --> 00:59:04,510
And for each of those
two possible values,

1189
00:59:04,510 --> 00:59:06,803
I have a distribution
for what are the odds

1190
00:59:06,803 --> 00:59:09,220
that I'm able to attend the
meeting, and what are the odds

1191
00:59:09,220 --> 00:59:10,090
that I missed the meeting?

1192
00:59:10,090 --> 00:59:12,010
And obviously, if my
train is on time, I'm

1193
00:59:12,010 --> 00:59:14,130
much more likely to be
able to attend the meeting

1194
00:59:14,130 --> 00:59:16,600
than if my train is
delayed, in which case

1195
00:59:16,600 --> 00:59:19,500
I'm more likely to miss that meeting.

1196
00:59:19,500 --> 00:59:21,790
So all of these nodes
put altogether here

1197
00:59:21,790 --> 00:59:25,330
represent this Bayesian network,
this network of random variables

1198
00:59:25,330 --> 00:59:27,730
whose values I ultimately
care about and that

1199
00:59:27,730 --> 00:59:30,380
have some sort of
relationship between them,

1200
00:59:30,380 --> 00:59:33,670
some sort of dependence where these
arrows from one node to another

1201
00:59:33,670 --> 00:59:37,960
indicate some dependence, that I can
calculate the probability of some node

1202
00:59:37,960 --> 00:59:41,870
given the parents that
happen to exist there.

1203
00:59:41,870 --> 00:59:45,340
So now that we've been able to describe
the structure of this Bayesian network

1204
00:59:45,340 --> 00:59:47,680
and the relationships
between each of these nodes,

1205
00:59:47,680 --> 00:59:51,070
by associating each of the node
in the network with a probability

1206
00:59:51,070 --> 00:59:53,980
distribution, whether that's
an unconditional probability

1207
00:59:53,980 --> 00:59:56,200
distribution in the case
of this root node here,

1208
00:59:56,200 --> 00:59:59,630
like Rain, and a conditional
probability distribution,

1209
00:59:59,630 --> 01:00:02,380
in the case of all of the other
nodes whose probabilities are

1210
01:00:02,380 --> 01:00:05,000
dependent upon the
values of their parents,

1211
01:00:05,000 --> 01:00:09,160
we can begin to do some computation
and calculation using the information

1212
01:00:09,160 --> 01:00:10,490
inside of that table.

1213
01:00:10,490 --> 01:00:12,310
So let's imagine, for
example, that I just

1214
01:00:12,310 --> 01:00:15,910
wanted to compute something simple,
like the probability of light rain.

1215
01:00:15,910 --> 01:00:18,130
How would I get the
probability of light rain?

1216
01:00:18,130 --> 01:00:21,370
Well, light rain-- rain
here is a root node.

1217
01:00:21,370 --> 01:00:23,770
And so if I wanted to
calculate that probability,

1218
01:00:23,770 --> 01:00:26,740
I could just look at the
probability distribution for rain

1219
01:00:26,740 --> 01:00:29,800
and extract from it the
probability of light rain.

1220
01:00:29,800 --> 01:00:33,220
It's just a single value that
I already have access to.

1221
01:00:33,220 --> 01:00:35,410
But we could also imagine
wanting to compute

1222
01:00:35,410 --> 01:00:39,100
more complex joint probabilities,
like the probability

1223
01:00:39,100 --> 01:00:42,710
that there is light rain and
also no track maintenance.

1224
01:00:42,710 --> 01:00:47,440
This is a joint probability of two
values, light rain and no track

1225
01:00:47,440 --> 01:00:48,293
maintenance.

1226
01:00:48,293 --> 01:00:51,460
And the way I might do that is first
by starting by saying, all right, well,

1227
01:00:51,460 --> 01:00:54,100
let me get the probability
of light rain, but now

1228
01:00:54,100 --> 01:00:57,160
I also want the probability
of no track maintenance.

1229
01:00:57,160 --> 01:01:01,630
But, of course, this node is
dependent upon the value of rain.

1230
01:01:01,630 --> 01:01:05,350
So what I really want is the probability
of no track maintenance given

1231
01:01:05,350 --> 01:01:07,540
that I know that there was light rain.

1232
01:01:07,540 --> 01:01:10,450
And so the expression
for calculating this idea

1233
01:01:10,450 --> 01:01:13,870
that the probability of light
rain and no track maintenance

1234
01:01:13,870 --> 01:01:17,680
is really just the probability
of light rain and the probability

1235
01:01:17,680 --> 01:01:21,250
that there is no track maintenance
given that I know that there already

1236
01:01:21,250 --> 01:01:22,210
is light rain.

1237
01:01:22,210 --> 01:01:25,540
So I take the unconditional
probability of light rain,

1238
01:01:25,540 --> 01:01:30,160
multiply it by the conditional
probability of no track maintenance

1239
01:01:30,160 --> 01:01:32,550
given that I know there is light rain.

1240
01:01:32,550 --> 01:01:35,770
And you can continue to do this
again and again for every variable

1241
01:01:35,770 --> 01:01:38,378
that you want to add into
this joint probability

1242
01:01:38,378 --> 01:01:39,670
that I might want to calculate.

1243
01:01:39,670 --> 01:01:42,400
If I wanted to know the
probability of light rain

1244
01:01:42,400 --> 01:01:45,100
and no track maintenance
and a delayed train,

1245
01:01:45,100 --> 01:01:48,850
well, that's going to be the
probability of light rain multiplied

1246
01:01:48,850 --> 01:01:50,950
by the probability of
no track maintenance

1247
01:01:50,950 --> 01:01:56,218
given light rain multiplied by the
probability of a delayed train given

1248
01:01:56,218 --> 01:01:59,260
light rain and no track maintenance,
because whether the train is on time

1249
01:01:59,260 --> 01:02:03,190
or delayed is dependent upon both
of these other two variables,

1250
01:02:03,190 --> 01:02:05,290
and so I have two
pieces of evidence that

1251
01:02:05,290 --> 01:02:08,860
go into the calculation of
that conditional probability.

1252
01:02:08,860 --> 01:02:11,470
And each of these three
values is just a value

1253
01:02:11,470 --> 01:02:15,640
that I can look up by looking at
one of these individual probability

1254
01:02:15,640 --> 01:02:20,140
distributions that is encoded
into my Bayesian network.

1255
01:02:20,140 --> 01:02:23,410
And if I wanted a joint probability
over all four of the variables,

1256
01:02:23,410 --> 01:02:25,900
something like the
probability of light rain

1257
01:02:25,900 --> 01:02:30,130
and no track maintenance and a delayed
train and I missed my appointment,

1258
01:02:30,130 --> 01:02:32,890
well, that's going to be multiplying
four different values, one

1259
01:02:32,890 --> 01:02:34,870
from each of these individual nodes.

1260
01:02:34,870 --> 01:02:36,970
It's going to be the
probability of light rain,

1261
01:02:36,970 --> 01:02:39,370
then of no track maintenance
given light rain,

1262
01:02:39,370 --> 01:02:42,882
then of a delayed train given light
rain and no track maintenance.

1263
01:02:42,882 --> 01:02:46,090
And then, finally, for this node here
for whether I make it to my appointment

1264
01:02:46,090 --> 01:02:50,770
or not, it's not dependent upon
these two variables given that I know

1265
01:02:50,770 --> 01:02:52,270
whether or not the train is on time.

1266
01:02:52,270 --> 01:02:55,030
I only need to care about
the conditional probability

1267
01:02:55,030 --> 01:03:00,160
that I miss my appointment given
that the train happens to be delayed.

1268
01:03:00,160 --> 01:03:04,120
And so that's represented here by
four probabilities, each of which

1269
01:03:04,120 --> 01:03:07,420
is located inside of one of
these probability distributions

1270
01:03:07,420 --> 01:03:11,092
for each of the nodes,
all multiplied together.

1271
01:03:11,092 --> 01:03:13,300
And so I can take a variable
like that and figure out

1272
01:03:13,300 --> 01:03:15,910
what the joint probability
is by multiplying

1273
01:03:15,910 --> 01:03:18,280
a whole bunch of these
individual probabilities

1274
01:03:18,280 --> 01:03:19,990
from the Bayesian network.

1275
01:03:19,990 --> 01:03:23,110
But, of course, just as with last
time where what I really wanted to do

1276
01:03:23,110 --> 01:03:25,463
was to be able to get new
pieces of information,

1277
01:03:25,463 --> 01:03:28,630
here, too, this is what we're going to
want to do with our Bayesian network.

1278
01:03:28,630 --> 01:03:31,720
In the context of knowledge, we
talked about the problem of inference.

1279
01:03:31,720 --> 01:03:34,210
Given things that I
know to be true, can I

1280
01:03:34,210 --> 01:03:38,020
draw conclusions, make deductions
about other facts about the world

1281
01:03:38,020 --> 01:03:40,270
that I also know to be true?

1282
01:03:40,270 --> 01:03:44,170
And what we're going to do now is apply
the same sort of idea to probability.

1283
01:03:44,170 --> 01:03:46,960
Using information about
which I have some knowledge,

1284
01:03:46,960 --> 01:03:49,510
whether some evidence or
some probabilities, can

1285
01:03:49,510 --> 01:03:52,360
I figure out not other
variables for certain,

1286
01:03:52,360 --> 01:03:55,750
but can I figure out the probabilities
of other variables taking

1287
01:03:55,750 --> 01:03:57,160
on particular values?

1288
01:03:57,160 --> 01:04:00,160
And so here we introduce
the problem of inference

1289
01:04:00,160 --> 01:04:03,970
in a probabilistic setting in a case
where variables might not necessarily

1290
01:04:03,970 --> 01:04:06,760
be true for sure, but they
might be random variables

1291
01:04:06,760 --> 01:04:10,640
that take on different
values with some probability.

1292
01:04:10,640 --> 01:04:13,780
So how do we formally define what
exactly this inference problem actually

1293
01:04:13,780 --> 01:04:14,500
is?

1294
01:04:14,500 --> 01:04:17,350
Well, the inference problem
has a couple of parts to it.

1295
01:04:17,350 --> 01:04:20,140
We have some query,
some variable x that we

1296
01:04:20,140 --> 01:04:21,730
want to compute the distribution for.

1297
01:04:21,730 --> 01:04:24,880
Maybe I want the probability
that I missed my train

1298
01:04:24,880 --> 01:04:29,500
or I want the probability that there
is track maintenance, something

1299
01:04:29,500 --> 01:04:31,570
that I want information about.

1300
01:04:31,570 --> 01:04:33,437
And then I have some evidence variables.

1301
01:04:33,437 --> 01:04:35,020
Maybe it's just one piece of evidence.

1302
01:04:35,020 --> 01:04:36,760
Maybe it's multiple pieces of evidence.

1303
01:04:36,760 --> 01:04:40,600
But I've observed certain
variables for some sort of event.

1304
01:04:40,600 --> 01:04:43,772
So for example, I might have
observed that it is raining.

1305
01:04:43,772 --> 01:04:44,980
This is evidence that I have.

1306
01:04:44,980 --> 01:04:47,933
I know that there is light rain or
I know that there is heavy rain,

1307
01:04:47,933 --> 01:04:49,100
and that is evidence I have.

1308
01:04:49,100 --> 01:04:52,750
And using that evidence, I want
to know, what is the probability

1309
01:04:52,750 --> 01:04:55,430
that my train is delayed, for example?

1310
01:04:55,430 --> 01:04:58,480
And that is a query that I might
want to ask based on this evidence.

1311
01:04:58,480 --> 01:05:00,700
So I have a query, some
variable, evidence,

1312
01:05:00,700 --> 01:05:03,280
which are some other variables
that I have observed inside

1313
01:05:03,280 --> 01:05:05,260
of my Bayesian network,
and of course that

1314
01:05:05,260 --> 01:05:08,110
does leave some hidden variables, y.

1315
01:05:08,110 --> 01:05:11,380
These are variables that are
not evidence variables and not

1316
01:05:11,380 --> 01:05:12,550
query variables.

1317
01:05:12,550 --> 01:05:16,090
So you might imagine in the case where
I know whether or not it's raining

1318
01:05:16,090 --> 01:05:19,930
and I want to know whether my train
is going to be delayed or not,

1319
01:05:19,930 --> 01:05:23,380
the hidden variable, the thing I don't
have access to, is something like,

1320
01:05:23,380 --> 01:05:25,130
is there maintenance
on the track, or am I

1321
01:05:25,130 --> 01:05:27,380
going to make or not make
my appointment, for example?

1322
01:05:27,380 --> 01:05:29,410
These are variables that
I don't have access to.

1323
01:05:29,410 --> 01:05:32,680
They're hidden because
they're not things I observed,

1324
01:05:32,680 --> 01:05:35,100
and they're also not the query,
the thing that I'm asking.

1325
01:05:35,100 --> 01:05:37,480
And so ultimately what
we want to calculate

1326
01:05:37,480 --> 01:05:41,650
is I want to know the probability
distribution of x given

1327
01:05:41,650 --> 01:05:42,970
e, the event that I observed.

1328
01:05:42,970 --> 01:05:46,150
So given that I observed some event,
I observed that it is raining,

1329
01:05:46,150 --> 01:05:49,960
I would like to know, what is the
distribution over the possible values

1330
01:05:49,960 --> 01:05:51,640
of the Train random variable?

1331
01:05:51,640 --> 01:05:52,630
Is it on time?

1332
01:05:52,630 --> 01:05:53,440
Is it delayed?

1333
01:05:53,440 --> 01:05:55,750
What is the likelihood
it's going to be there?

1334
01:05:55,750 --> 01:05:58,720
And it turns out we can do
this calculation just using

1335
01:05:58,720 --> 01:06:02,410
a lot of the probability rules
that we've already seen in action.

1336
01:06:02,410 --> 01:06:04,870
And ultimately, we're going
to take a look at the math

1337
01:06:04,870 --> 01:06:07,150
at a little bit of a high
level, at an abstract level,

1338
01:06:07,150 --> 01:06:09,370
but ultimately we can allow
computers and programming

1339
01:06:09,370 --> 01:06:12,610
libraries that already exist to
begin to do some of this math for us.

1340
01:06:12,610 --> 01:06:15,810
But it's good to get a general sense
for what's actually happening when

1341
01:06:15,810 --> 01:06:18,010
this inference process takes place.

1342
01:06:18,010 --> 01:06:21,190
Let's imagine, for example, that
I want to compute the probability

1343
01:06:21,190 --> 01:06:24,430
distribution of the
Appointment random variable

1344
01:06:24,430 --> 01:06:28,510
given some evidence, given that I know
that there was light rain and no track

1345
01:06:28,510 --> 01:06:29,260
maintenance.

1346
01:06:29,260 --> 01:06:32,830
So there's my evidence, these two
variables that I observed the value of.

1347
01:06:32,830 --> 01:06:34,630
I observe the value of rain.

1348
01:06:34,630 --> 01:06:35,920
I know there's light rain.

1349
01:06:35,920 --> 01:06:38,830
And I know that there is no
track maintenance going on today.

1350
01:06:38,830 --> 01:06:42,820
And what I care about knowing, my query,
is this random variable Appointment.

1351
01:06:42,820 --> 01:06:46,008
I want to know the distribution of
this random variable Appointment.

1352
01:06:46,008 --> 01:06:47,800
What is the chance that
I am able to attend

1353
01:06:47,800 --> 01:06:50,560
my appointment, what is the
chance that I miss my appointment

1354
01:06:50,560 --> 01:06:52,360
given this evidence?

1355
01:06:52,360 --> 01:06:55,870
And the hidden variable, the
information that I don't have access to,

1356
01:06:55,870 --> 01:06:57,190
is this variable Train.

1357
01:06:57,190 --> 01:07:00,040
This is information that is not
part of the evidence that I see,

1358
01:07:00,040 --> 01:07:01,660
not something that I observe.

1359
01:07:01,660 --> 01:07:05,050
But it is also not the
query that I am asking for.

1360
01:07:05,050 --> 01:07:07,460
And so what might this
inference procedure look like?

1361
01:07:07,460 --> 01:07:10,810
Well, if you recall back from a when we
were defining conditional probability

1362
01:07:10,810 --> 01:07:13,270
and doing math with
conditional probabilities,

1363
01:07:13,270 --> 01:07:15,940
we know that a
conditional probability is

1364
01:07:15,940 --> 01:07:19,030
proportional to the joint probability.

1365
01:07:19,030 --> 01:07:23,050
And we remember this by recalling
that the probability of a given b

1366
01:07:23,050 --> 01:07:25,930
is just some constant
factor alpha multiplied

1367
01:07:25,930 --> 01:07:27,583
by the probability of a and b.

1368
01:07:27,583 --> 01:07:29,500
That constant factor
alpha turns up and you're

1369
01:07:29,500 --> 01:07:32,620
dividing over the probability
of b, but the important thing

1370
01:07:32,620 --> 01:07:34,930
is that it's just some
constant multiplied

1371
01:07:34,930 --> 01:07:37,450
by the joint distribution,
the probability

1372
01:07:37,450 --> 01:07:40,070
that all of these
individual things happen.

1373
01:07:40,070 --> 01:07:42,610
So in this case, I can
take the probability

1374
01:07:42,610 --> 01:07:47,380
of the Appointment random variable given
light rain and no track maintenance

1375
01:07:47,380 --> 01:07:51,070
and say that is just going to be
proportional, some constant alpha,

1376
01:07:51,070 --> 01:07:54,700
multiplied by the joint probability,
the probability of a particular value

1377
01:07:54,700 --> 01:08:00,410
for the appointment random variable,
and light rain and no track maintenance.

1378
01:08:00,410 --> 01:08:02,980
Well, all right, how do I
calculate this, probability

1379
01:08:02,980 --> 01:08:05,350
of appointment and light rain
and no track maintenance,

1380
01:08:05,350 --> 01:08:07,480
when what I really
care about is knowing--

1381
01:08:07,480 --> 01:08:11,260
I need all four of these values to be
able to calculate a joint distribution

1382
01:08:11,260 --> 01:08:13,990
across everything, because,
then, a particular appointment

1383
01:08:13,990 --> 01:08:16,420
depends upon the value of train.

1384
01:08:16,420 --> 01:08:18,399
Well, in order to do
that, here I can begin

1385
01:08:18,399 --> 01:08:21,430
to use that marginalization
trick, that there are only

1386
01:08:21,430 --> 01:08:24,640
two ways I can get any configuration
of an appointment, light rain,

1387
01:08:24,640 --> 01:08:25,859
and no track maintenance.

1388
01:08:25,859 --> 01:08:28,120
Either this particular
setting of variables

1389
01:08:28,120 --> 01:08:33,130
happens and the train is on time or this
particular setting of variables happens

1390
01:08:33,130 --> 01:08:34,180
and the train is delayed.

1391
01:08:34,180 --> 01:08:37,520
Those are two possible cases
that I would want to consider.

1392
01:08:37,520 --> 01:08:40,149
And if I add those two
cases up, well, then I

1393
01:08:40,149 --> 01:08:44,859
get the result just by adding up all
of the possibilities for the hidden

1394
01:08:44,859 --> 01:08:46,990
variable, or variables
if there are multiple.

1395
01:08:46,990 --> 01:08:49,090
But since there's only
one hidden variable here,

1396
01:08:49,090 --> 01:08:53,229
Train, all I need to do is iterate over
all the possible values for that hidden

1397
01:08:53,229 --> 01:08:56,600
variable Train and add
up their probabilities.

1398
01:08:56,600 --> 01:08:59,529
So this probability
expression here becomes

1399
01:08:59,529 --> 01:09:02,890
probability distribution over
Appointment, light, no rain, and train

1400
01:09:02,890 --> 01:09:06,010
is on time, and the
probability distribution

1401
01:09:06,010 --> 01:09:10,120
over the Appointment, light rain,
no track maintenance, and the train

1402
01:09:10,120 --> 01:09:11,660
is delayed, for example.

1403
01:09:11,660 --> 01:09:15,597
So I take both of the possible values
for train, go ahead and add them up.

1404
01:09:15,597 --> 01:09:16,180
These are just

1405
01:09:16,180 --> 01:09:18,722
Joint probabilities that we saw
earlier how to calculate just

1406
01:09:18,722 --> 01:09:22,120
by going parent, parent, parent, parent
and calculating those probabilities

1407
01:09:22,120 --> 01:09:23,615
and multiplying them together.

1408
01:09:23,615 --> 01:09:26,740
And then you'll need to normalize them
at the end, speaking at a high level

1409
01:09:26,740 --> 01:09:29,920
to make sure that everything
adds up to the number one.

1410
01:09:29,920 --> 01:09:32,229
So the formula for how you
do this and a process known

1411
01:09:32,229 --> 01:09:35,223
as inference by enumeration
looks a little bit complicated,

1412
01:09:35,223 --> 01:09:36,640
but ultimately it looks like this.

1413
01:09:36,640 --> 01:09:39,550
And let's now try to distill what
it is that all of these symbols

1414
01:09:39,550 --> 01:09:40,420
actually mean.

1415
01:09:40,420 --> 01:09:41,410
Let's start here.

1416
01:09:41,410 --> 01:09:46,029
What I care about knowing is the
probability of x, my query variable,

1417
01:09:46,029 --> 01:09:48,370
given some sort of evidence.

1418
01:09:48,370 --> 01:09:50,410
What do I know about
conditional probabilities?

1419
01:09:50,410 --> 01:09:55,030
Well, a conditional probability is
proportional to the joint probability.

1420
01:09:55,030 --> 01:09:57,850
So we had some alpha,
some normalizing constant,

1421
01:09:57,850 --> 01:10:01,840
multiplied by this joint
probability of x and evidence.

1422
01:10:01,840 --> 01:10:03,410
And how do I calculate that?

1423
01:10:03,410 --> 01:10:05,980
Well, to do that, I'm
going to marginalize over

1424
01:10:05,980 --> 01:10:07,420
all of the hidden variables.

1425
01:10:07,420 --> 01:10:10,450
All the variables that I don't
directly observe the values for,

1426
01:10:10,450 --> 01:10:13,390
I'm basically going to iterate
over all of the possibilities

1427
01:10:13,390 --> 01:10:16,040
that it could happen and
just sum them all up.

1428
01:10:16,040 --> 01:10:19,270
And so I can translate this
into a sum over all y, which

1429
01:10:19,270 --> 01:10:22,450
ranges over all the possible
hidden variables and the values

1430
01:10:22,450 --> 01:10:27,250
that they could take on, and adds
up all of those possible individual

1431
01:10:27,250 --> 01:10:28,300
probabilities.

1432
01:10:28,300 --> 01:10:32,195
And that is going to allow me to do this
process of inference by enumeration.

1433
01:10:32,195 --> 01:10:34,570
And ultimately, it's pretty
annoying if we as humans have

1434
01:10:34,570 --> 01:10:36,713
to do all of this math for ourselves.

1435
01:10:36,713 --> 01:10:39,880
But it turns out this is where computers
and AI can be particularly helpful,

1436
01:10:39,880 --> 01:10:43,360
that we can program a computer to
understand a Bayesian network to be

1437
01:10:43,360 --> 01:10:45,610
able to understand these
inference procedures

1438
01:10:45,610 --> 01:10:47,560
and to be able to do these calculations.

1439
01:10:47,560 --> 01:10:49,390
And using the information
you've seen here,

1440
01:10:49,390 --> 01:10:52,150
you could implement a Bayesian
network from scratch yourself.

1441
01:10:52,150 --> 01:10:54,733
But turns out there are a lot
of libraries, especially written

1442
01:10:54,733 --> 01:10:56,650
in Python, that allow
us to make it easier

1443
01:10:56,650 --> 01:10:58,780
to do this sort of
probabilistic inference

1444
01:10:58,780 --> 01:11:01,788
to be able to take a Bayesian network
and do these sorts of calculations

1445
01:11:01,788 --> 01:11:04,830
so that you don't need to know and
understand all of the underlying math,

1446
01:11:04,830 --> 01:11:07,372
though it's helpful to have a
general sense for how it works.

1447
01:11:07,372 --> 01:11:10,330
But you just need to be able to
describe the structure of the network

1448
01:11:10,330 --> 01:11:14,350
and make queries in order to
be able to produce the result.

1449
01:11:14,350 --> 01:11:17,050
And so let's take a look at
an example of that right now.

1450
01:11:17,050 --> 01:11:19,420
It turns out that there are
a lot of possible libraries

1451
01:11:19,420 --> 01:11:21,803
that exist in Python for
doing this sort of inference.

1452
01:11:21,803 --> 01:11:24,220
It doesn't matter too much
which specific library you use.

1453
01:11:24,220 --> 01:11:26,330
They all behave in fairly similar ways.

1454
01:11:26,330 --> 01:11:29,170
But the library I'm going to use
here is one known as pomegranate.

1455
01:11:29,170 --> 01:11:33,820
And here inside of model.py, I
have defined a Bayesian network

1456
01:11:33,820 --> 01:11:38,070
just using the structure and the syntax
that the pomegranate library expects.

1457
01:11:38,070 --> 01:11:40,930
And what I'm effectively
doing is just, in Python,

1458
01:11:40,930 --> 01:11:44,740
creating nodes to represent each
the nodes of the Bayesian network

1459
01:11:44,740 --> 01:11:47,060
that you saw me describe a moment ago.

1460
01:11:47,060 --> 01:11:49,750
So here on line four, after
I've imported pomegranate,

1461
01:11:49,750 --> 01:11:52,540
I'm defining a variable called
rain that is going to represent

1462
01:11:52,540 --> 01:11:55,990
a node inside of my Bayesian network.

1463
01:11:55,990 --> 01:11:59,530
It's going to be a node that
follows this distribution where

1464
01:11:59,530 --> 01:12:01,030
there are three possible values--

1465
01:12:01,030 --> 01:12:03,970
none for no rain, light for
light rain, heavy for heavy rain.

1466
01:12:03,970 --> 01:12:07,180
And these are the probabilities
of each of those taking place.

1467
01:12:07,180 --> 01:12:13,630
0.7 is the likelihood of no rain, 0.2
for light rain, 0.1 for heavy rain.

1468
01:12:13,630 --> 01:12:15,760
Then, after that, we go
to the next variable,

1469
01:12:15,760 --> 01:12:18,400
the variable for track
maintenance, for example, which

1470
01:12:18,400 --> 01:12:20,990
is dependent upon that rain variable.

1471
01:12:20,990 --> 01:12:23,890
And this, instead of being an
unconditional distribution,

1472
01:12:23,890 --> 01:12:27,370
is a conditional distribution, as
indicated by a conditional probability

1473
01:12:27,370 --> 01:12:28,430
table here.

1474
01:12:28,430 --> 01:12:33,790
And the idea is that this is
conditional on the distribution of rain.

1475
01:12:33,790 --> 01:12:36,700
So if there is no rain, then
the chance that there is yes

1476
01:12:36,700 --> 01:12:38,370
track maintenance is 0.4.

1477
01:12:38,370 --> 01:12:41,720
If there's no rain, the chance that
there is no track maintenance is 0.6.

1478
01:12:41,720 --> 01:12:43,720
Likewise, for light rain,
I have a distribution.

1479
01:12:43,720 --> 01:12:45,760
For heavy rain, I have
a distribution, as well.

1480
01:12:45,760 --> 01:12:48,130
But I'm effectively encoding
the same information

1481
01:12:48,130 --> 01:12:50,110
you saw represented
graphically a moment ago,

1482
01:12:50,110 --> 01:12:53,110
but I'm telling this Python
program that the maintenance

1483
01:12:53,110 --> 01:12:57,640
node obeys this particular
conditional probability distribution.

1484
01:12:57,640 --> 01:13:01,090
And we do the same thing for the
other random variables, as well.

1485
01:13:01,090 --> 01:13:06,310
Train was a node inside my distribution
that was a conditional probability

1486
01:13:06,310 --> 01:13:08,050
table with two parents.

1487
01:13:08,050 --> 01:13:11,380
It was dependent not only on rain,
but also on track maintenance.

1488
01:13:11,380 --> 01:13:15,310
And so here I'm saying something like,
given that there is no rain and yes

1489
01:13:15,310 --> 01:13:19,630
track maintenance, the probability
that my train is on time is 0.8,

1490
01:13:19,630 --> 01:13:22,240
and the probability that
it's delayed is 0.2.

1491
01:13:22,240 --> 01:13:24,220
And likewise, I can do
the same thing for all

1492
01:13:24,220 --> 01:13:28,330
of the other possible values of
the parents of the train node

1493
01:13:28,330 --> 01:13:32,800
inside of my Bayesian network by saying,
for all of those possible values,

1494
01:13:32,800 --> 01:13:36,350
here is the distribution that
the train node should follow.

1495
01:13:36,350 --> 01:13:38,710
And I do the same thing
for an appointment

1496
01:13:38,710 --> 01:13:41,830
based on the distribution
of the variable Train.

1497
01:13:41,830 --> 01:13:45,340
Then, at the end, what I do is
actually construct this network

1498
01:13:45,340 --> 01:13:47,860
by describing what the
states of the network are

1499
01:13:47,860 --> 01:13:50,660
and by adding edges between
the dependent nodes.

1500
01:13:50,660 --> 01:13:53,110
So I create a new Bayesian
network, add states to it--

1501
01:13:53,110 --> 01:13:56,650
one for rain, one for maintenance, one
for train, one for the appointment--

1502
01:13:56,650 --> 01:14:00,460
and then I add edges
connecting the related pieces.

1503
01:14:00,460 --> 01:14:04,570
Rain has an arrow to maintenance because
rain influences track maintenance,

1504
01:14:04,570 --> 01:14:08,530
rain also influences the train,
maintenance also influences the train,

1505
01:14:08,530 --> 01:14:11,140
and train influences whether
I make it to my appointment,

1506
01:14:11,140 --> 01:14:14,800
and bake just finalizes the model
and does some additional computation.

1507
01:14:14,800 --> 01:14:18,250
So the specific syntax of this
is not really the important part.

1508
01:14:18,250 --> 01:14:20,980
Pomegranate just happens to be
one of several different libraries

1509
01:14:20,980 --> 01:14:22,990
that can all be used
for similar purposes,

1510
01:14:22,990 --> 01:14:26,170
and you could describe and
define a library for yourself

1511
01:14:26,170 --> 01:14:28,010
that implemented similar things.

1512
01:14:28,010 --> 01:14:30,430
But the key idea here
is that someone can

1513
01:14:30,430 --> 01:14:33,220
design a library for a
general Bayesian network that

1514
01:14:33,220 --> 01:14:35,680
has nodes that are
based upon its parents,

1515
01:14:35,680 --> 01:14:39,190
and then all a programmer needs to
do, using one of those libraries,

1516
01:14:39,190 --> 01:14:43,420
is to define what those nodes and what
those probability distributions are,

1517
01:14:43,420 --> 01:14:47,000
and we can begin to do some
interesting logic based on it.

1518
01:14:47,000 --> 01:14:50,200
So let's try doing that
conditional or joint probability

1519
01:14:50,200 --> 01:14:56,800
calculation that we saw us do by hand
before by going into likelihood.py

1520
01:14:56,800 --> 01:15:00,340
where here I'm importing the model
that I justified a moment ago.

1521
01:15:00,340 --> 01:15:03,100
And here I'd just like to
calculate model.probability,

1522
01:15:03,100 --> 01:15:06,320
which calculates the probability
for a given observation,

1523
01:15:06,320 --> 01:15:10,270
and I'd like to calculate
the probability of no rain,

1524
01:15:10,270 --> 01:15:13,330
no track maintenance,
my train is on time,

1525
01:15:13,330 --> 01:15:14,950
and I'm able to attend the meeting--

1526
01:15:14,950 --> 01:15:16,870
so sort of the optimal
scenario, that there's

1527
01:15:16,870 --> 01:15:20,162
no rain and no maintenance on
the track, my train is on time,

1528
01:15:20,162 --> 01:15:21,620
and I'm able to attend the meeting.

1529
01:15:21,620 --> 01:15:25,020
What is the probability that
all of that actually happens?

1530
01:15:25,020 --> 01:15:26,900
And I can calculate
that using the library

1531
01:15:26,900 --> 01:15:28,700
and just print out its probability.

1532
01:15:28,700 --> 01:15:32,780
And so I'll go ahead and
run Python of likelihood.py,

1533
01:15:32,780 --> 01:15:37,190
and I see that, OK, the
probability is about 0.34.

1534
01:15:37,190 --> 01:15:40,850
So about a third of the time, everything
goes right for me, in this case--

1535
01:15:40,850 --> 01:15:43,190
no rain, no track
maintenance, train is on time,

1536
01:15:43,190 --> 01:15:45,032
and I'm able to attend the meeting.

1537
01:15:45,032 --> 01:15:47,990
But I could experiment with this,
try and calculate other probabilities

1538
01:15:47,990 --> 01:15:48,650
as well.

1539
01:15:48,650 --> 01:15:51,860
What's the probability that everything
goes right up until the train

1540
01:15:51,860 --> 01:15:57,020
but I still miss my meeting-- so no
rain, no track maintenance, train

1541
01:15:57,020 --> 01:15:59,690
is on time, but I miss the appointment.

1542
01:15:59,690 --> 01:16:04,680
Let's calculate that probability, and
that has a probability of about 0.04.

1543
01:16:04,680 --> 01:16:07,643
So about 4% of the time
the train will be on time,

1544
01:16:07,643 --> 01:16:09,560
there won't be any rain,
no track maintenance,

1545
01:16:09,560 --> 01:16:12,420
and yet I'll still miss the meeting.

1546
01:16:12,420 --> 01:16:14,780
And so this is really
just an implementation

1547
01:16:14,780 --> 01:16:17,900
of the calculation of the joint
probabilities that we did before.

1548
01:16:17,900 --> 01:16:20,150
What this library is
likely doing is first

1549
01:16:20,150 --> 01:16:23,600
figuring out the probability
of no rain, then figuring

1550
01:16:23,600 --> 01:16:26,030
that the probability
of no track maintenance

1551
01:16:26,030 --> 01:16:28,580
given no rain, then the
probability that my train is

1552
01:16:28,580 --> 01:16:31,760
on time given both of these
values, and then the probability

1553
01:16:31,760 --> 01:16:35,930
that I miss my appointment given that
I know that the train was on time.

1554
01:16:35,930 --> 01:16:39,070
So this, again, is the calculation
of that joint probability.

1555
01:16:39,070 --> 01:16:42,320
And turns out we can also begin to have
our computer solve inference problems,

1556
01:16:42,320 --> 01:16:45,980
as well, to begin to infer,
based on information, evidence

1557
01:16:45,980 --> 01:16:51,000
that we see, what is the likelihood
of other variables also being true?

1558
01:16:51,000 --> 01:16:54,740
So let's go into inference.py,
for example, where here I'm,

1559
01:16:54,740 --> 01:16:57,110
again, importing that exact
same model from before,

1560
01:16:57,110 --> 01:16:59,300
importing all the
nodes and all the edges

1561
01:16:59,300 --> 01:17:03,300
and the probability distribution
that is encoded there, as well.

1562
01:17:03,300 --> 01:17:06,320
And now there's a function for
doing some sort of prediction.

1563
01:17:06,320 --> 01:17:10,760
And here, into this model, I pass
in the evidence that I observe.

1564
01:17:10,760 --> 01:17:14,750
So here I've encoded into this
Python program the evidence

1565
01:17:14,750 --> 01:17:15,770
that I have observed.

1566
01:17:15,770 --> 01:17:18,950
I have observed the fact
that the train is delayed,

1567
01:17:18,950 --> 01:17:22,190
and that is the value for one
of the four random variables

1568
01:17:22,190 --> 01:17:24,140
inside of this Bayesian network.

1569
01:17:24,140 --> 01:17:26,210
And using that information,
I would like to be

1570
01:17:26,210 --> 01:17:29,270
able to draw inspiration
and figure out inferences

1571
01:17:29,270 --> 01:17:31,875
about the values of the
other random variables

1572
01:17:31,875 --> 01:17:33,500
that are inside of my Bayesian network.

1573
01:17:33,500 --> 01:17:36,240
I would like to make predictions
about everything else.

1574
01:17:36,240 --> 01:17:40,340
So all of the actual computational logic
is happening in just these three lines

1575
01:17:40,340 --> 01:17:42,260
where I'm making this
call to this prediction.

1576
01:17:42,260 --> 01:17:45,830
Down below, I'm just iterating over all
of the states and all the predictions

1577
01:17:45,830 --> 01:17:49,860
and just printing them out so that we
can visually see what the results are.

1578
01:17:49,860 --> 01:17:51,980
But let's find out, given
the train is delayed,

1579
01:17:51,980 --> 01:17:56,210
what can I predict about the values
of the other random variables?

1580
01:17:56,210 --> 01:17:59,021
Let's go ahead and run
Python inference.py.

1581
01:17:59,021 --> 01:18:00,005
I run that.

1582
01:18:00,005 --> 01:18:01,880
And all right, here is
the result that I get.

1583
01:18:01,880 --> 01:18:04,640
Given the fact that I know
that the train is delayed--

1584
01:18:04,640 --> 01:18:06,770
this is evidence that I have observed--

1585
01:18:06,770 --> 01:18:10,490
well, given that there is a
45% chance or a 46% chance

1586
01:18:10,490 --> 01:18:12,520
that there was no rain,
a 31% chance there

1587
01:18:12,520 --> 01:18:15,230
was light rain, a 23%
chance there was heavy rain,

1588
01:18:15,230 --> 01:18:17,712
I can see a probability
distribution over track maintenance

1589
01:18:17,712 --> 01:18:19,670
and a probability
distribution over whether I'm

1590
01:18:19,670 --> 01:18:22,130
able to attend or miss my appointment.

1591
01:18:22,130 --> 01:18:23,990
Now, we know that
whether I attend or miss

1592
01:18:23,990 --> 01:18:27,715
the appointment, that is only
dependent upon the train being delayed

1593
01:18:27,715 --> 01:18:28,340
or not delayed.

1594
01:18:28,340 --> 01:18:30,540
It shouldn't depend on anything else.

1595
01:18:30,540 --> 01:18:34,610
So let's imagine, for example, that
I knew that there was heavy rain.

1596
01:18:34,610 --> 01:18:38,620
That shouldn't affect the distribution
for making the appointment.

1597
01:18:38,620 --> 01:18:41,360
And indeed, if I go up
here and add some evidence,

1598
01:18:41,360 --> 01:18:44,128
say that I know that the
value of rain is heavy--

1599
01:18:44,128 --> 01:18:45,920
that is evidence that
I now have access to.

1600
01:18:45,920 --> 01:18:47,420
I now have two pieces of evidence.

1601
01:18:47,420 --> 01:18:51,950
I know that the rain is heavy, and
I know that my train is delayed.

1602
01:18:51,950 --> 01:18:55,550
I can calculate the probability by
running this inference procedure again

1603
01:18:55,550 --> 01:18:57,090
and seeing the result.

1604
01:18:57,090 --> 01:18:58,340
I know that the rain is heavy.

1605
01:18:58,340 --> 01:18:59,840
I know my train is delayed.

1606
01:18:59,840 --> 01:19:02,990
The probability distribution
for track maintenance changed.

1607
01:19:02,990 --> 01:19:05,130
Given that I know that
there is heavy rain,

1608
01:19:05,130 --> 01:19:08,750
now it's more likely that there
is no track maintenance, 88% as

1609
01:19:08,750 --> 01:19:12,250
opposed to 64% from here before.

1610
01:19:12,250 --> 01:19:16,040
And now what is the probability
that I make the appointment?

1611
01:19:16,040 --> 01:19:17,480
Well, that's the same as before.

1612
01:19:17,480 --> 01:19:21,100
It's still going to be attend the
appointment with probability 0.6,

1613
01:19:21,100 --> 01:19:23,450
miss the appointment
with probability 0.4,

1614
01:19:23,450 --> 01:19:27,290
because it was only dependent upon
whether or not my train was on time

1615
01:19:27,290 --> 01:19:28,260
or delayed.

1616
01:19:28,260 --> 01:19:31,610
And so this here is implementing
that idea of that inference algorithm

1617
01:19:31,610 --> 01:19:34,130
to be able to figure out,
based on the evidence

1618
01:19:34,130 --> 01:19:37,970
that I have, what can we infer about
the values of the other variables that

1619
01:19:37,970 --> 01:19:39,050
exist as well?

1620
01:19:39,050 --> 01:19:42,890
So inference by enumeration is one
way of doing this inference procedure,

1621
01:19:42,890 --> 01:19:46,730
just looping over all of the values
the hidden variables could take on

1622
01:19:46,730 --> 01:19:49,460
and figuring out what
the probability is.

1623
01:19:49,460 --> 01:19:52,010
Now, it turns out this is
not particularly efficient,

1624
01:19:52,010 --> 01:19:56,180
and there are definitely optimizations
you can make by avoiding repeated work

1625
01:19:56,180 --> 01:19:59,030
if you're calculating the same
sort of probability multiple times.

1626
01:19:59,030 --> 01:20:02,570
There are ways of optimizing the
program to avoid having to recalculate

1627
01:20:02,570 --> 01:20:04,640
the same probabilities again and again.

1628
01:20:04,640 --> 01:20:06,980
But even then, as the
number of variables

1629
01:20:06,980 --> 01:20:10,220
get large, as the number of possible
values those variables could take on

1630
01:20:10,220 --> 01:20:12,110
get large, we're going
to start to have to do

1631
01:20:12,110 --> 01:20:14,600
a lot of computation,
a lot of calculation,

1632
01:20:14,600 --> 01:20:16,190
to be able to do this inference.

1633
01:20:16,190 --> 01:20:18,150
And at that point,
you might start to get

1634
01:20:18,150 --> 01:20:20,250
unreasonable in terms
of the amount of time

1635
01:20:20,250 --> 01:20:24,615
that it would take to be able
to do this sort exact inference.

1636
01:20:24,615 --> 01:20:26,490
And it's for that reason
that oftentimes when

1637
01:20:26,490 --> 01:20:29,970
it comes towards probability and
things we're not entirely sure about,

1638
01:20:29,970 --> 01:20:32,280
we don't always care about
doing exact inference

1639
01:20:32,280 --> 01:20:35,040
and knowing exactly
what the probability is.

1640
01:20:35,040 --> 01:20:37,560
But if we can approximate
the inference procedure,

1641
01:20:37,560 --> 01:20:41,570
do some sort of approximate inference,
that that can be pretty good as well,

1642
01:20:41,570 --> 01:20:43,550
that if I don't know
the exact probability

1643
01:20:43,550 --> 01:20:45,510
but I have a general
sense for the probability,

1644
01:20:45,510 --> 01:20:49,200
that I can get increasingly accurate
with more time, that that's probably

1645
01:20:49,200 --> 01:20:53,620
pretty good, especially if I can
get that to happen even faster.

1646
01:20:53,620 --> 01:20:57,930
So how could I do approximate
inference inside of a Bayesian network?

1647
01:20:57,930 --> 01:21:00,480
Well, one method is through a
procedure known as sampling.

1648
01:21:00,480 --> 01:21:04,980
In the process of sampling, I'm going
to take a sample of all of the variables

1649
01:21:04,980 --> 01:21:06,840
inside of this Bayesian network here.

1650
01:21:06,840 --> 01:21:08,280
And how am I going to sample?

1651
01:21:08,280 --> 01:21:12,240
Well, I'm going to sample one of
the values from each of these nodes

1652
01:21:12,240 --> 01:21:14,560
according to their
probability distribution.

1653
01:21:14,560 --> 01:21:16,560
So how might I take a
sample of all these nodes?

1654
01:21:16,560 --> 01:21:17,430
Well, I'll start at the root.

1655
01:21:17,430 --> 01:21:18,450
I'll start with rain.

1656
01:21:18,450 --> 01:21:21,060
Here's the distribution
for rain, and I'll go ahead

1657
01:21:21,060 --> 01:21:23,880
and, using a random number
generator or something like it,

1658
01:21:23,880 --> 01:21:25,770
randomly pick one of these three values.

1659
01:21:25,770 --> 01:21:29,730
I'll pick none with probability
0.7, light with probability 0.2,

1660
01:21:29,730 --> 01:21:31,440
and heavy with probability 0.1.

1661
01:21:31,440 --> 01:21:34,770
So I'll randomly just pick one of
them according to that distribution,

1662
01:21:34,770 --> 01:21:37,780
and maybe, in this case,
I pick none, for example.

1663
01:21:37,780 --> 01:21:39,780
Then I do the same thing
for the other variable.

1664
01:21:39,780 --> 01:21:42,410
Maintenance also as a
probability distribution.

1665
01:21:42,410 --> 01:21:44,070
And I am going to sample--

1666
01:21:44,070 --> 01:21:46,470
now, there are three
probability distributions here,

1667
01:21:46,470 --> 01:21:49,050
but I'm only going to
sample from this first row

1668
01:21:49,050 --> 01:21:53,950
here because I've observed already in my
sample that the value of rain is none.

1669
01:21:53,950 --> 01:21:54,450
So

1670
01:21:54,450 --> 01:21:58,295
Given that rain is none, I'm going to
sample from this distribution to say,

1671
01:21:58,295 --> 01:22:00,420
all right, what should the
value of maintenance be?

1672
01:22:00,420 --> 01:22:02,753
And in this case, maintenance
is going to be, let's just

1673
01:22:02,753 --> 01:22:06,570
say, yes, which happens 40% of the time
in the event that there is no rain,

1674
01:22:06,570 --> 01:22:07,603
for example.

1675
01:22:07,603 --> 01:22:10,020
And we'll sample all of the
rest of the nodes in this way,

1676
01:22:10,020 --> 01:22:12,840
as well, that I want to sample
from the train distribution,

1677
01:22:12,840 --> 01:22:17,040
and I'll sample from this first
row here where there is no rain,

1678
01:22:17,040 --> 01:22:18,570
but there is track maintenance.

1679
01:22:18,570 --> 01:22:21,980
And I'll sample 80% of the time,
I'll say the train is on time.

1680
01:22:21,980 --> 01:22:24,463
20% of the time, I'll
say the train is delayed.

1681
01:22:24,463 --> 01:22:27,630
And finally, we'll do the same thing
for whether I make it to my appointment

1682
01:22:27,630 --> 01:22:27,890
or not.

1683
01:22:27,890 --> 01:22:29,490
Did I attend or miss the appointment?

1684
01:22:29,490 --> 01:22:32,700
We'll sample based on this distribution
and maybe say that in this case

1685
01:22:32,700 --> 01:22:36,150
I attend the appointment, which
happens 90% of the time when

1686
01:22:36,150 --> 01:22:38,730
the train is actually on time.

1687
01:22:38,730 --> 01:22:42,900
So by going through these nodes, I
can very quickly just do some sampling

1688
01:22:42,900 --> 01:22:45,720
and get a sample of the
possible values that

1689
01:22:45,720 --> 01:22:48,990
could come up from going through
this entire Bayesian network

1690
01:22:48,990 --> 01:22:51,540
according to those
probability distributions.

1691
01:22:51,540 --> 01:22:54,360
And where this becomes powerful
is if I do this not once,

1692
01:22:54,360 --> 01:22:57,100
but I do this thousands or
tens of thousands of times

1693
01:22:57,100 --> 01:23:00,400
and generate a whole bunch of
samples, all using this distribution.

1694
01:23:00,400 --> 01:23:01,410
I get different samples.

1695
01:23:01,410 --> 01:23:02,820
Maybe some of them are the same.

1696
01:23:02,820 --> 01:23:07,800
But I get a value for each of the
possible variables that could come up.

1697
01:23:07,800 --> 01:23:10,620
And so then, if I'm ever faced
with a question, a question like,

1698
01:23:10,620 --> 01:23:13,860
what is the probability
that the train is on time,

1699
01:23:13,860 --> 01:23:15,900
you could do an exact
inference procedure.

1700
01:23:15,900 --> 01:23:18,630
This is no different than the
inference problem we had before

1701
01:23:18,630 --> 01:23:21,780
where I could just marginalize, look
at all the possible other values

1702
01:23:21,780 --> 01:23:24,390
of the variables and do the
computation of inference

1703
01:23:24,390 --> 01:23:28,200
by enumeration to find out
this probability exactly.

1704
01:23:28,200 --> 01:23:31,710
But I could also, if I don't care about
the exact probability, just sample it.

1705
01:23:31,710 --> 01:23:33,150
Approximate it to get close.

1706
01:23:33,150 --> 01:23:35,040
And this is a powerful
tool in AI where we

1707
01:23:35,040 --> 01:23:38,790
don't need to be right 100% of the time
or we don't need to be exactly right.

1708
01:23:38,790 --> 01:23:41,130
If we just need to be right
with some probability,

1709
01:23:41,130 --> 01:23:44,290
we can often do some more
effectively, more efficiently.

1710
01:23:44,290 --> 01:23:46,920
And so here, now, are all
of those possible samples.

1711
01:23:46,920 --> 01:23:50,390
I'll sort of highlight the ones
where the train is on time.

1712
01:23:50,390 --> 01:23:52,620
I'm ignoring the ones
where the train is delayed.

1713
01:23:52,620 --> 01:23:55,350
And in this case,
there's six out of eight

1714
01:23:55,350 --> 01:23:57,690
of the samples have the
train is arriving on time.

1715
01:23:57,690 --> 01:24:01,320
And so maybe, in this case, I can
say that, in six out of eight cases,

1716
01:24:01,320 --> 01:24:03,458
that's the likelihood
that the train is on time.

1717
01:24:03,458 --> 01:24:06,000
And with eight samples, that
might not be a great prediction.

1718
01:24:06,000 --> 01:24:08,520
But if I had thousands
upon thousands of samples,

1719
01:24:08,520 --> 01:24:11,580
then this could be a much
better inference procedure

1720
01:24:11,580 --> 01:24:13,680
to be able to do these
sorts of calculations.

1721
01:24:13,680 --> 01:24:17,310
So this is a direct sampling method
to just do a bunch of samples

1722
01:24:17,310 --> 01:24:21,210
and then figure out what the
probability of some event is.

1723
01:24:21,210 --> 01:24:24,400
Now, this from before was an
unconditional probability.

1724
01:24:24,400 --> 01:24:27,447
What is the probability
that the train is on time?

1725
01:24:27,447 --> 01:24:30,030
And I did that by looking at all
the samples and figuring out,

1726
01:24:30,030 --> 01:24:32,372
right here, the ones where
the train is on time.

1727
01:24:32,372 --> 01:24:34,080
But sometimes what
I'll want to calculate

1728
01:24:34,080 --> 01:24:38,387
is not an unconditional probability,
but rather a conditional probability,

1729
01:24:38,387 --> 01:24:40,470
something like, what is
the probability that there

1730
01:24:40,470 --> 01:24:45,010
is light rain given that the train
is on time, something to that effect.

1731
01:24:45,010 --> 01:24:50,060
And to do that kind of calculation,
well, what I might do is here

1732
01:24:50,060 --> 01:24:52,140
are all the samples
that I have, and I want

1733
01:24:52,140 --> 01:24:54,720
to calculate a probability
distribution given

1734
01:24:54,720 --> 01:24:57,368
that I know that the train is on time.

1735
01:24:57,368 --> 01:24:59,910
So to be able to do that, I can
kind of look at the two cases

1736
01:24:59,910 --> 01:25:03,630
where the train was delayed
and ignore or reject them,

1737
01:25:03,630 --> 01:25:07,762
sort of exclude them from the
possible samples that I'm considering.

1738
01:25:07,762 --> 01:25:09,720
And now I want to look
at these remaining cases

1739
01:25:09,720 --> 01:25:11,130
where the train is on time.

1740
01:25:11,130 --> 01:25:13,860
Here are the cases where
there is light rain.

1741
01:25:13,860 --> 01:25:16,850
And now I say, OK, these are two
out of the six possible cases.

1742
01:25:16,850 --> 01:25:19,580
That can give me an
approximation for the probability

1743
01:25:19,580 --> 01:25:23,440
of light rain given the fact that
I know the train was on time.

1744
01:25:23,440 --> 01:25:25,700
And I did that in almost
exactly the same way

1745
01:25:25,700 --> 01:25:28,660
just by adding an additional
step, by saying that,

1746
01:25:28,660 --> 01:25:30,470
all right, when I take
each sample, let me

1747
01:25:30,470 --> 01:25:34,460
reject all of the samples
that don't match my evidence

1748
01:25:34,460 --> 01:25:37,250
and only consider the
samples that do match

1749
01:25:37,250 --> 01:25:39,920
what it is that I have in
my evidence that I want

1750
01:25:39,920 --> 01:25:42,020
to make some sort of calculation about.

1751
01:25:42,020 --> 01:25:45,920
And it turns out, using the libraries
that we've had for Bayesian networks,

1752
01:25:45,920 --> 01:25:48,740
we can begin to implement
this same sort of idea,

1753
01:25:48,740 --> 01:25:51,890
implement rejection sampling, which
is what this method is called,

1754
01:25:51,890 --> 01:25:55,850
to be able to figure out some
probability, not via direct inference,

1755
01:25:55,850 --> 01:25:57,980
but instead by sampling.

1756
01:25:57,980 --> 01:26:00,290
So what I have here is a
program called sample.py--

1757
01:26:00,290 --> 01:26:02,180
imports the exact same model.

1758
01:26:02,180 --> 01:26:05,490
And what I define first is a
program to generate a sample.

1759
01:26:05,490 --> 01:26:09,088
And the way I generate a sample is
just by looping over all of the states.

1760
01:26:09,088 --> 01:26:10,880
The states need to be
in some sort of order

1761
01:26:10,880 --> 01:26:12,797
to make sure I'm looping
in the correct order.

1762
01:26:12,797 --> 01:26:16,010
But effectively, if it is
a conditional distribution,

1763
01:26:16,010 --> 01:26:18,410
I'm going to sample
based on the parents.

1764
01:26:18,410 --> 01:26:21,240
And otherwise, I'm just going
to directly sample the variable,

1765
01:26:21,240 --> 01:26:25,040
like rain, which has no parents-- it's
just an unconditional distribution--

1766
01:26:25,040 --> 01:26:28,640
and keep track of all those parent
samples and return the final sample.

1767
01:26:28,640 --> 01:26:31,290
The exact syntax of this, again,
not particularly important.

1768
01:26:31,290 --> 01:26:33,290
It just happens to be
part of the implementation

1769
01:26:33,290 --> 01:26:35,820
details of this particular library.

1770
01:26:35,820 --> 01:26:38,270
The interesting logic is done below.

1771
01:26:38,270 --> 01:26:40,820
Now that I have the ability
to generate a sample,

1772
01:26:40,820 --> 01:26:45,020
if I want to know the distribution of
the appointment random variable given

1773
01:26:45,020 --> 01:26:48,680
that the train is delayed, well, then I
can begin to do calculations like this.

1774
01:26:48,680 --> 01:26:52,430
Let me take 10,000 samples
and assemble all my results

1775
01:26:52,430 --> 01:26:53,810
in this list called data.

1776
01:26:53,810 --> 01:26:57,140
I'll go ahead and loop n times--
in this case, 10,000 times.

1777
01:26:57,140 --> 01:27:01,670
I'll generate a sample, and I want to
know the distribution of appointment

1778
01:27:01,670 --> 01:27:03,410
given that the train is delayed.

1779
01:27:03,410 --> 01:27:05,900
So according to rejection
sampling, I'm only

1780
01:27:05,900 --> 01:27:08,210
going to consider samples
where the train is delayed.

1781
01:27:08,210 --> 01:27:11,552
If the train's not delayed, I'm not
going to consider those values at all.

1782
01:27:11,552 --> 01:27:13,760
So I'm going to say, all
right, if I take the sample,

1783
01:27:13,760 --> 01:27:16,290
look at the value of the
train random variable,

1784
01:27:16,290 --> 01:27:19,670
if the train is delayed, well,
let me go ahead and add to my data

1785
01:27:19,670 --> 01:27:23,000
that I'm collecting the value of
the appointment random variable

1786
01:27:23,000 --> 01:27:25,520
that it took on in
this particular sample.

1787
01:27:25,520 --> 01:27:28,610
So I'm only considering the
samples where the train is delayed

1788
01:27:28,610 --> 01:27:31,010
and, for each of those
samples, considering

1789
01:27:31,010 --> 01:27:32,870
what the value of appointment is.

1790
01:27:32,870 --> 01:27:35,570
And then at the end, I'm using
a Python class called counter,

1791
01:27:35,570 --> 01:27:37,580
which quickly counts up
all the values inside

1792
01:27:37,580 --> 01:27:40,100
of a data set so I can
take this list of data

1793
01:27:40,100 --> 01:27:44,000
and figure out how many times
was my appointment made,

1794
01:27:44,000 --> 01:27:47,360
and how many times was
my appointment missed?

1795
01:27:47,360 --> 01:27:49,610
And so this here, with just
a couple of lines of code,

1796
01:27:49,610 --> 01:27:53,080
is an implementation
of rejection sampling.

1797
01:27:53,080 --> 01:27:58,170
And I can run it by going ahead
and running Python sample.py.

1798
01:27:58,170 --> 01:28:00,230
And when I do that, here
is the result I get.

1799
01:28:00,230 --> 01:28:02,150
This is the result of the counter.

1800
01:28:02,150 --> 01:28:05,750
1,251 times I was able
to attend the meeting,

1801
01:28:05,750 --> 01:28:08,900
and 856 times I was able
to miss the meeting.

1802
01:28:08,900 --> 01:28:11,550
And you can imagine, by
doing more and more samples,

1803
01:28:11,550 --> 01:28:14,480
I'll be able to get a better and
better, more accurate result.

1804
01:28:14,480 --> 01:28:16,070
And this is a randomized process.

1805
01:28:16,070 --> 01:28:18,895
It's going to be an
approximation of the probability.

1806
01:28:18,895 --> 01:28:21,770
If I run it a different time, you'll
notice the numbers are similar--

1807
01:28:21,770 --> 01:28:25,460
1,272 and 905-- but
they're not identical

1808
01:28:25,460 --> 01:28:28,250
because there's some randomization,
some likelihood that things

1809
01:28:28,250 --> 01:28:31,730
might be higher or lower, and so this
is why we generally want to try and use

1810
01:28:31,730 --> 01:28:35,360
more samples so that we can have
a greater amount of confidence

1811
01:28:35,360 --> 01:28:37,760
in our result, be more
sure about the result

1812
01:28:37,760 --> 01:28:41,240
that we're getting of whether or not
it accurately reflects or represents

1813
01:28:41,240 --> 01:28:43,940
the actual underlying
probabilities that are

1814
01:28:43,940 --> 01:28:47,130
inherent inside of this distribution.

1815
01:28:47,130 --> 01:28:50,057
And so this, then, was an
instance of rejection sampling.

1816
01:28:50,057 --> 01:28:52,640
And it turns out, there are a
number of other sampling methods

1817
01:28:52,640 --> 01:28:55,070
that you could use to
begin to try to sample.

1818
01:28:55,070 --> 01:28:57,530
One problem that
rejection sampling has is

1819
01:28:57,530 --> 01:29:02,480
that if the evidence you're looking
for is a fairly unlikely event, well,

1820
01:29:02,480 --> 01:29:04,610
you're going to be
rejecting a lot of samples.

1821
01:29:04,610 --> 01:29:08,490
Like, if I'm looking for the
probability of x given some evidence e,

1822
01:29:08,490 --> 01:29:12,680
if e is very unlikely to occur-- like,
occurs maybe one every 1,000 times--

1823
01:29:12,680 --> 01:29:16,040
then I'm only going to be considering
one out of every 1,000 samples

1824
01:29:16,040 --> 01:29:18,798
that I do, which is a pretty
inefficient method for trying

1825
01:29:18,798 --> 01:29:20,090
to do this sort of calculation.

1826
01:29:20,090 --> 01:29:23,600
I'm throwing away a lot of samples,
and it takes computational effort

1827
01:29:23,600 --> 01:29:25,640
to be able to generate
those samples, so I'd

1828
01:29:25,640 --> 01:29:27,480
like to not have to do
something like that.

1829
01:29:27,480 --> 01:29:30,230
So there are other sampling methods
that can try and address this.

1830
01:29:30,230 --> 01:29:33,680
One such sampling method is
called likelihood weighting.

1831
01:29:33,680 --> 01:29:36,920
In likelihood weighting, we follow
a slightly different procedure,

1832
01:29:36,920 --> 01:29:39,740
and the goal is to avoid
needing to throw out

1833
01:29:39,740 --> 01:29:42,590
samples that didn't match the evidence.

1834
01:29:42,590 --> 01:29:46,760
And so what we'll do is we'll start
by fixing the values for the evidence

1835
01:29:46,760 --> 01:29:47,300
variables.

1836
01:29:47,300 --> 01:29:49,430
Rather than sample
everything, we're going

1837
01:29:49,430 --> 01:29:53,970
to fix the values of the evidence
variables and not sample those.

1838
01:29:53,970 --> 01:29:57,650
Then we're going to sample all the other
non-evidence variables in the same way,

1839
01:29:57,650 --> 01:30:01,010
just using the Bayesian network, looking
at the probability distributions,

1840
01:30:01,010 --> 01:30:04,040
sampling all the non-evidence variables.

1841
01:30:04,040 --> 01:30:08,450
But then what we need to do is
weight each sample by its likelihood.

1842
01:30:08,450 --> 01:30:10,520
If our evidence is
really unlikely, we want

1843
01:30:10,520 --> 01:30:14,210
to make sure that we've taken into
account, how likely was the evidence

1844
01:30:14,210 --> 01:30:16,410
to actually show up in the sample?

1845
01:30:16,410 --> 01:30:18,590
If I have a sample where
the evidence was much more

1846
01:30:18,590 --> 01:30:20,720
likely to show up than
another sample, then I

1847
01:30:20,720 --> 01:30:23,060
want to weight the
more likely one higher.

1848
01:30:23,060 --> 01:30:25,490
So we're going to weight
each sample by its likelihood

1849
01:30:25,490 --> 01:30:29,480
where likelihood is just defined as
the probability of all of the evidence.

1850
01:30:29,480 --> 01:30:32,090
Given all the evidence we
have, what is the probability

1851
01:30:32,090 --> 01:30:34,640
that it would happen in
that particular sample?

1852
01:30:34,640 --> 01:30:37,250
So before, all of our samples
were weighted equally.

1853
01:30:37,250 --> 01:30:40,970
They all had a weight of one when we
were calculating the overall average.

1854
01:30:40,970 --> 01:30:42,980
In this case, we're going
to weight each sample,

1855
01:30:42,980 --> 01:30:46,220
multiply each sample by
its likelihood in order

1856
01:30:46,220 --> 01:30:49,252
to get the more accurate distribution.

1857
01:30:49,252 --> 01:30:50,460
So what would this look like?

1858
01:30:50,460 --> 01:30:54,170
Well, if I asked the same question, what
is the probability of light rain given

1859
01:30:54,170 --> 01:30:57,050
that the train is on time, when
I do the sampling procedure

1860
01:30:57,050 --> 01:30:59,780
and start by trying to
sample, I'm going to start

1861
01:30:59,780 --> 01:31:01,520
by fixing the evidence variable.

1862
01:31:01,520 --> 01:31:04,640
I'm already going to have in
my sample the train is on time.

1863
01:31:04,640 --> 01:31:06,860
That way, I don't have
to throw out anything.

1864
01:31:06,860 --> 01:31:10,610
I'm only sampling things where I
know the value of the variables that

1865
01:31:10,610 --> 01:31:13,790
are my evidence are what
I expect them to be.

1866
01:31:13,790 --> 01:31:16,310
So I'll go ahead and sample
from rain, and maybe this time I

1867
01:31:16,310 --> 01:31:18,318
sample light rain instead of no rain.

1868
01:31:18,318 --> 01:31:21,110
Then I'll sample from track
maintenance and say maybe, yes, there's

1869
01:31:21,110 --> 01:31:22,100
track maintenance.

1870
01:31:22,100 --> 01:31:25,190
Then for train, well, I've
already fixed it in place.

1871
01:31:25,190 --> 01:31:29,360
Train was an evidence variable, so I'm
not going to bother sampling again.

1872
01:31:29,360 --> 01:31:30,820
I'll just go ahead and move on.

1873
01:31:30,820 --> 01:31:35,280
I'll move on to appointment and go ahead
and sample from appointment as well.

1874
01:31:35,280 --> 01:31:37,040
So now I've generated a sample.

1875
01:31:37,040 --> 01:31:40,190
I've generated a sample by
fixing this evidence variable

1876
01:31:40,190 --> 01:31:42,310
and sampling the other three.

1877
01:31:42,310 --> 01:31:44,390
And the last step is now
weighting the sample.

1878
01:31:44,390 --> 01:31:45,920
How much weight should it have?

1879
01:31:45,920 --> 01:31:50,090
And the weight is based on how probable
is it that the train was actually

1880
01:31:50,090 --> 01:31:52,560
on time, this evidence
actually happened,

1881
01:31:52,560 --> 01:31:55,460
given the values of these other
variables, light rain and the fact

1882
01:31:55,460 --> 01:31:57,620
that, yes, there was track maintenance?

1883
01:31:57,620 --> 01:32:00,260
Well, to do that, I can just
go back to the train variable

1884
01:32:00,260 --> 01:32:02,900
and say, all right, if there
was light rain and track

1885
01:32:02,900 --> 01:32:05,060
maintenance, the
likelihood of my evidence,

1886
01:32:05,060 --> 01:32:08,570
the likelihood that my
train was on time, is 0.6.

1887
01:32:08,570 --> 01:32:13,250
And so this particular sample
would have a weight of 0.6.

1888
01:32:13,250 --> 01:32:15,740
And I could repeat the sampling
procedure again and again.

1889
01:32:15,740 --> 01:32:18,140
Each time, every sample
would be given a weight

1890
01:32:18,140 --> 01:32:22,928
according to the probability of the
evidence that I see associated with it.

1891
01:32:22,928 --> 01:32:25,970
And there are other sampling methods
that exist, as well, but all of them

1892
01:32:25,970 --> 01:32:27,845
are designed to try and
get at the same idea,

1893
01:32:27,845 --> 01:32:30,950
to approximate the inference
procedure of figuring out

1894
01:32:30,950 --> 01:32:33,540
the value of a variable.

1895
01:32:33,540 --> 01:32:35,570
So we've now dealt
with probability as it

1896
01:32:35,570 --> 01:32:38,840
pertains to particular variables
that have these discrete values.

1897
01:32:38,840 --> 01:32:40,910
But what we haven't
really considered is how

1898
01:32:40,910 --> 01:32:44,300
values might change over time,
that we've considered something

1899
01:32:44,300 --> 01:32:47,870
like a variable for rain where rain
can take on values of none or light

1900
01:32:47,870 --> 01:32:50,600
rain or heavy rain, but,
in practice, usually when

1901
01:32:50,600 --> 01:32:54,950
we consider values for variables like
rain, we like to consider it for,

1902
01:32:54,950 --> 01:32:58,020
over time, how do the values
of these variables change?

1903
01:32:58,020 --> 01:33:02,040
What do we deal with when we're dealing
with uncertainty over a period of time?

1904
01:33:02,040 --> 01:33:04,590
Which can come up in the context
of weather, for example--

1905
01:33:04,590 --> 01:33:06,830
if I have sunny days
and I have rainy days.

1906
01:33:06,830 --> 01:33:11,450
And I'd like to know not just what is
the probability that it's raining now,

1907
01:33:11,450 --> 01:33:14,210
but what is the probability that
it rains tomorrow or the day

1908
01:33:14,210 --> 01:33:15,838
after that or the day after that?

1909
01:33:15,838 --> 01:33:17,630
And so to do this,
we're going to introduce

1910
01:33:17,630 --> 01:33:19,440
a slightly different kind of model.

1911
01:33:19,440 --> 01:33:23,300
But here we're going to have a random
variable, not just one for the weather,

1912
01:33:23,300 --> 01:33:25,643
but for every possible time step.

1913
01:33:25,643 --> 01:33:27,560
And you can define time
step however you like.

1914
01:33:27,560 --> 01:33:30,680
A simple way is just to
use days as your time step.

1915
01:33:30,680 --> 01:33:34,220
And so we can define a
variable called x sub t, which

1916
01:33:34,220 --> 01:33:36,620
is going to be the weather at time t.

1917
01:33:36,620 --> 01:33:39,350
So x sub zero might be
the weather on day zero,

1918
01:33:39,350 --> 01:33:42,400
x sub one might be the weather
on day one, so on and so forth,

1919
01:33:42,400 --> 01:33:45,022
x sub two is the weather on day two.

1920
01:33:45,022 --> 01:33:46,730
But as you can imagine,
if we start to do

1921
01:33:46,730 --> 01:33:48,740
this over longer and
longer periods of time,

1922
01:33:48,740 --> 01:33:51,282
there's an incredible amount of
data that might go into this.

1923
01:33:51,282 --> 01:33:53,960
If you're keeping track of data
about the weather for a year,

1924
01:33:53,960 --> 01:33:57,240
now suddenly you might be trying to
predict the weather tomorrow given

1925
01:33:57,240 --> 01:34:00,620
365 days of previous pieces
of evidence, and that's

1926
01:34:00,620 --> 01:34:03,530
a lot of evidence to have to deal
with and manipulate and calculate.

1927
01:34:03,530 --> 01:34:06,410
Probably nobody knows what the
exact conditional probability

1928
01:34:06,410 --> 01:34:10,070
distribution is for all of
those combinations of variables.

1929
01:34:10,070 --> 01:34:13,070
And so when we're trying to do this
inference inside of a computer, when

1930
01:34:13,070 --> 01:34:16,640
we're trying to reasonably
do this sort of analysis,

1931
01:34:16,640 --> 01:34:19,053
it's helpful to make some
simplifying assumptions,

1932
01:34:19,053 --> 01:34:21,470
some assumptions about the
problem that we can just assume

1933
01:34:21,470 --> 01:34:23,930
are true to make our
lives a little bit easier.

1934
01:34:23,930 --> 01:34:26,270
Even if they're not totally
accurate assumptions,

1935
01:34:26,270 --> 01:34:28,703
if they're close to
accurate or approximate,

1936
01:34:28,703 --> 01:34:29,870
they're usually pretty good.

1937
01:34:29,870 --> 01:34:33,350
And the assumption we're going to
make is called the Markov assumption,

1938
01:34:33,350 --> 01:34:38,210
which is the assumption that the current
state depends only on a finite fixed

1939
01:34:38,210 --> 01:34:40,220
number of previous states.

1940
01:34:40,220 --> 01:34:44,210
So the current day's weather depends
not on all of the previous day's weather

1941
01:34:44,210 --> 01:34:47,150
for all of history, but
the current day's weather I

1942
01:34:47,150 --> 01:34:49,758
can predict just based
on yesterday's weather

1943
01:34:49,758 --> 01:34:52,550
or just based on the last two days'
weather or the last three days'

1944
01:34:52,550 --> 01:34:53,050
weather.

1945
01:34:53,050 --> 01:34:57,620
But oftentimes, we're going to deal
with just the one previous state helps

1946
01:34:57,620 --> 01:34:59,720
to predict this current state.

1947
01:34:59,720 --> 01:35:01,970
And by putting a whole bunch
of these random variables

1948
01:35:01,970 --> 01:35:04,400
together, using this
Markov assumption, we

1949
01:35:04,400 --> 01:35:08,090
can create what's called a Markov
chain where a Markov chain is just

1950
01:35:08,090 --> 01:35:11,960
some sequence of random variables where
each of the variable's distribution

1951
01:35:11,960 --> 01:35:13,772
follows that Markov assumption.

1952
01:35:13,772 --> 01:35:16,480
And so we'll do an example of this
where the Markov assumption is

1953
01:35:16,480 --> 01:35:17,590
I can predict the weather.

1954
01:35:17,590 --> 01:35:19,050
Is it sunny or rainy?

1955
01:35:19,050 --> 01:35:21,520
And we'll just consider those
two possibilities for now,

1956
01:35:21,520 --> 01:35:23,395
even though there are
other types of weather.

1957
01:35:23,395 --> 01:35:26,650
But I can predict each day's weather
just on the prior day's weather.

1958
01:35:26,650 --> 01:35:30,430
Using today's weather, I can come
up with a probability distribution

1959
01:35:30,430 --> 01:35:31,825
for tomorrow's weather.

1960
01:35:31,825 --> 01:35:33,700
And here's what this
weather might look like.

1961
01:35:33,700 --> 01:35:37,030
It's formatted in terms of a
matrix, as you might describe it,

1962
01:35:37,030 --> 01:35:41,410
as sort of rows and columns of
values where on the left-hand side

1963
01:35:41,410 --> 01:35:45,850
I have today's webinar, represented
by the variable x sub t.

1964
01:35:45,850 --> 01:35:48,730
And then over here in the columns,
I have tomorrow's weather,

1965
01:35:48,730 --> 01:35:54,790
represented by the variable x sub t plus
one, t plus one day's weather instead.

1966
01:35:54,790 --> 01:35:58,990
And what this matrix is
saying is if today is sunny,

1967
01:35:58,990 --> 01:36:02,440
well, then, it's more likely than
not that tomorrow is also sunny.

1968
01:36:02,440 --> 01:36:05,990
Oftentimes the weather stays
consistent for multiple days in a row.

1969
01:36:05,990 --> 01:36:08,200
And for example, let's say
that if today is sunny,

1970
01:36:08,200 --> 01:36:12,820
our model says that tomorrow, with
probability 0.8, it will also be sunny,

1971
01:36:12,820 --> 01:36:15,610
and with probability
0.2 it will be raining.

1972
01:36:15,610 --> 01:36:19,245
And likewise, if today
is raining, then it's

1973
01:36:19,245 --> 01:36:21,370
more likely than not that
tomorrow is also raining.

1974
01:36:21,370 --> 01:36:23,620
With probability 0.7, it'll be raining.

1975
01:36:23,620 --> 01:36:26,710
With probability 0.3, it will be sunny.

1976
01:36:26,710 --> 01:36:28,840
So this matrix, this
description of how it

1977
01:36:28,840 --> 01:36:32,290
is we transition from one
state to the next state,

1978
01:36:32,290 --> 01:36:34,540
is what we're going to
call the transition model.

1979
01:36:34,540 --> 01:36:37,030
And using the transition
model, you can begin

1980
01:36:37,030 --> 01:36:41,770
to construct this Markov chain by just
predicting, given today's weather,

1981
01:36:41,770 --> 01:36:44,020
what's the likelihood of
tomorrow's weather happening?

1982
01:36:44,020 --> 01:36:46,930
And you can imagine
doing a similar sampling

1983
01:36:46,930 --> 01:36:49,660
procedure where you
take this information,

1984
01:36:49,660 --> 01:36:51,940
you sample what tomorrow's
weather is going to be,

1985
01:36:51,940 --> 01:36:53,980
using that you sample
the next day's weather,

1986
01:36:53,980 --> 01:36:58,390
and the result of that is you can
form this Markov chain of x zero,

1987
01:36:58,390 --> 01:37:01,120
time day zero is sunny,
the next day is sunny,

1988
01:37:01,120 --> 01:37:04,240
maybe the next day it changes to
raining, then raining, then raining.

1989
01:37:04,240 --> 01:37:06,910
And the pattern that this
Markov chain follows,

1990
01:37:06,910 --> 01:37:08,890
given the distribution
that we had access to,

1991
01:37:08,890 --> 01:37:11,850
this transition model here,
is that when it's sunny,

1992
01:37:11,850 --> 01:37:13,600
it tends to stay sunny
for a little while.

1993
01:37:13,600 --> 01:37:16,100
The next couple days
tend to be sunny too.

1994
01:37:16,100 --> 01:37:19,735
And when it's raining, it
tends to be raining as well.

1995
01:37:19,735 --> 01:37:21,860
And so you get a Markov
chain that looks like this.

1996
01:37:21,860 --> 01:37:23,193
And you can do analysis on this.

1997
01:37:23,193 --> 01:37:25,630
You can say, given
that today is raining,

1998
01:37:25,630 --> 01:37:27,790
what is the probability
that tomorrow it's raining,

1999
01:37:27,790 --> 01:37:29,770
or you can begin to ask
probability questions,

2000
01:37:29,770 --> 01:37:33,970
like what is the probability of this
sequence of five values-- sun, sun,

2001
01:37:33,970 --> 01:37:35,200
rain, rain, rain--

2002
01:37:35,200 --> 01:37:37,610
and answer those sorts of questions too.

2003
01:37:37,610 --> 01:37:40,780
And it turns out there are, again,
many Python libraries for interacting

2004
01:37:40,780 --> 01:37:44,620
with models like this of
probabilities that have distributions

2005
01:37:44,620 --> 01:37:47,440
and random variables that are
based on previous variables

2006
01:37:47,440 --> 01:37:49,720
according to this Markov assumption.

2007
01:37:49,720 --> 01:37:53,090
And pomegranate 2 has ways of dealing
with these sorts of variables.

2008
01:37:53,090 --> 01:37:59,800
So I'll go ahead and go
into the chain directory

2009
01:37:59,800 --> 01:38:02,590
where I have some information
about Markov chains.

2010
01:38:02,590 --> 01:38:05,770
And here I've defined a
file called model.py where

2011
01:38:05,770 --> 01:38:08,320
I've defined in a very similar syntax.

2012
01:38:08,320 --> 01:38:11,080
And again, the exact syntax
doesn't matter so much as the idea

2013
01:38:11,080 --> 01:38:14,410
that I'm encoding this
information into a Python program

2014
01:38:14,410 --> 01:38:17,290
so that the program access
to these distributions.

2015
01:38:17,290 --> 01:38:19,930
I've here defined some
starting distributions.

2016
01:38:19,930 --> 01:38:23,020
So every Markov model begins
at some point in time,

2017
01:38:23,020 --> 01:38:25,120
and I need to give it some
starting distribution.

2018
01:38:25,120 --> 01:38:27,078
And so we'll just say,
you know what, to start,

2019
01:38:27,078 --> 01:38:29,380
you can pick 50/50
between sunny and rainy.

2020
01:38:29,380 --> 01:38:33,370
We'll say it's sunny 50% the
time, rainy 50% of the time.

2021
01:38:33,370 --> 01:38:36,430
And then down below, I've here
defined the transition model,

2022
01:38:36,430 --> 01:38:39,770
how it is that I transition
from one day to the next.

2023
01:38:39,770 --> 01:38:42,520
And here I've encoded that
exact same matrix from before,

2024
01:38:42,520 --> 01:38:45,210
that if it was sunny today,
then with probability 0.8

2025
01:38:45,210 --> 01:38:47,650
it will be sunny tomorrow, and
it will be raining tomorrow

2026
01:38:47,650 --> 01:38:49,540
with probability 0.2.

2027
01:38:49,540 --> 01:38:54,540
And I likewise have another distribution
for if it was raining today instead.

2028
01:38:54,540 --> 01:38:56,980
And so that alone
defines the Markov model.

2029
01:38:56,980 --> 01:38:59,410
You can begin to answer
questions using that model.

2030
01:38:59,410 --> 01:39:02,680
But one thing I'll just do is
sample from the Markov chain.

2031
01:39:02,680 --> 01:39:06,130
And it turns out there is a method built
into this Markov chain library that

2032
01:39:06,130 --> 01:39:08,440
allows me to sample 50
states from the chain,

2033
01:39:08,440 --> 01:39:13,000
basically just simulating
50 instances of weather.

2034
01:39:13,000 --> 01:39:18,290
And so let me go ahead and
run this, Python model.py.

2035
01:39:18,290 --> 01:39:22,570
And when I run it, what I get is it is
going to sample from this Markov chain

2036
01:39:22,570 --> 01:39:26,498
50 states, 50 days worth of weather
that it's just going to randomly sample.

2037
01:39:26,498 --> 01:39:29,290
And you can imagine sampling many
times to be able to get more data

2038
01:39:29,290 --> 01:39:30,820
to be able to do more analysis.

2039
01:39:30,820 --> 01:39:33,580
But here, for example,
it's sunny two days

2040
01:39:33,580 --> 01:39:37,360
a row, rainy a whole bunch of days in
a row before it changes back to sun.

2041
01:39:37,360 --> 01:39:41,170
And so you get this model that follows
the distribution that we originally

2042
01:39:41,170 --> 01:39:43,960
described, that follows the
distribution of sunny days

2043
01:39:43,960 --> 01:39:49,780
tend to lead to more sunny days, rainy
days tend to lead to more rainy days.

2044
01:39:49,780 --> 01:39:52,060
And that, then, is the Markov model.

2045
01:39:52,060 --> 01:39:56,260
And Markov models rely on us knowing
the values of these individual states.

2046
01:39:56,260 --> 01:40:00,490
I know that today is sunny or that today
is rainy, and using that information,

2047
01:40:00,490 --> 01:40:04,660
I can draw some sort of inference about
what tomorrow is going to be like.

2048
01:40:04,660 --> 01:40:07,130
But in practice, this
often isn't the case.

2049
01:40:07,130 --> 01:40:09,310
It often isn't the case
that I know for certain

2050
01:40:09,310 --> 01:40:11,620
what the exact state of the world is.

2051
01:40:11,620 --> 01:40:14,710
Oftentimes the state of the
world is exactly unknown,

2052
01:40:14,710 --> 01:40:18,480
but I'm able to somehow sense
some information about that state

2053
01:40:18,480 --> 01:40:22,385
that a robot or an AI doesn't have exact
knowledge about the world around it,

2054
01:40:22,385 --> 01:40:24,510
but it has some sort of
sensor, whether that sensor

2055
01:40:24,510 --> 01:40:27,240
is a camera or sensors
that detect distance

2056
01:40:27,240 --> 01:40:30,300
or just a microphone that is
sensing audio, for example.

2057
01:40:30,300 --> 01:40:33,990
It is sensing data, and using
that data, that data is somehow

2058
01:40:33,990 --> 01:40:36,930
related to the state of the
world even if it doesn't actually

2059
01:40:36,930 --> 01:40:41,100
know, our AI doesn't know, what the
underlying true state of the world

2060
01:40:41,100 --> 01:40:42,730
actually is.

2061
01:40:42,730 --> 01:40:45,480
And for that, we need to get
into the world of sensor models,

2062
01:40:45,480 --> 01:40:48,420
the way of describing how
it is that we translate

2063
01:40:48,420 --> 01:40:51,600
what the hidden state, the
underlying true state of the world

2064
01:40:51,600 --> 01:40:56,880
is with what the observation, what it is
that the AI knows or the AI has access

2065
01:40:56,880 --> 01:40:58,810
to, actually is.

2066
01:40:58,810 --> 01:41:02,880
And so for example, a hidden
state might be a robot's position.

2067
01:41:02,880 --> 01:41:05,650
If a robot is exploring
new, uncharted territory,

2068
01:41:05,650 --> 01:41:08,580
the robot likely doesn't
know exactly where it is.

2069
01:41:08,580 --> 01:41:10,000
But it does have an observation.

2070
01:41:10,000 --> 01:41:12,510
It has robot sensor
data where it can sense

2071
01:41:12,510 --> 01:41:16,560
how far away are possible obstacles
around it, and using that information,

2072
01:41:16,560 --> 01:41:19,230
using the observed
information that it has,

2073
01:41:19,230 --> 01:41:22,290
it can infer something
about the hidden state,

2074
01:41:22,290 --> 01:41:26,220
because what the true hidden state
is influences those observations.

2075
01:41:26,220 --> 01:41:29,370
Whatever the robot's
true position is affects

2076
01:41:29,370 --> 01:41:33,420
or has some effect upon what the sensor
data the robot is able to collect

2077
01:41:33,420 --> 01:41:36,330
is, even if the robot doesn't
actually know for certain

2078
01:41:36,330 --> 01:41:39,090
what its true position is.

2079
01:41:39,090 --> 01:41:42,300
Likewise, if you think about a voice
recognition or a speech recognition

2080
01:41:42,300 --> 01:41:47,600
program that listens to you and is able
to respond to you, something like Alexa

2081
01:41:47,600 --> 01:41:50,830
or what Apple and Google are doing
with their voice recognition as well,

2082
01:41:50,830 --> 01:41:54,090
that you might imagine that the
hidden state, the underlying state,

2083
01:41:54,090 --> 01:41:55,740
is what words are actually spoken.

2084
01:41:55,740 --> 01:41:58,290
The true nature of
the world contains you

2085
01:41:58,290 --> 01:42:00,270
saying a particular sequence of words.

2086
01:42:00,270 --> 01:42:04,380
But your phone or your smart
home device doesn't know for sure

2087
01:42:04,380 --> 01:42:05,940
exactly what words you said.

2088
01:42:05,940 --> 01:42:11,100
The only observation that the AI has
access to is some audio wave forms.

2089
01:42:11,100 --> 01:42:13,710
And those audio wave forms
are, of course, dependent

2090
01:42:13,710 --> 01:42:16,110
upon this hidden state,
and you can infer,

2091
01:42:16,110 --> 01:42:20,520
based on those audio wave forms,
what the words spoken likely were,

2092
01:42:20,520 --> 01:42:23,490
but you might not know
with 100% certainty what

2093
01:42:23,490 --> 01:42:25,330
that hidden state actually is.

2094
01:42:25,330 --> 01:42:27,630
And it might be a task
to try and predict.

2095
01:42:27,630 --> 01:42:30,300
Given this observation,
given these audio away forms,

2096
01:42:30,300 --> 01:42:34,142
can you figure out what the
actual words spoken are?

2097
01:42:34,142 --> 01:42:35,850
Likewise, you might
imagine on a website.

2098
01:42:35,850 --> 01:42:38,490
True user engagement might be
information you don't directly

2099
01:42:38,490 --> 01:42:41,880
have access to, but you can
observe data, like website or app

2100
01:42:41,880 --> 01:42:44,220
analytics about how often
was this button clicked

2101
01:42:44,220 --> 01:42:47,220
or how often are people interacting
with a page in a particular way.

2102
01:42:47,220 --> 01:42:51,190
And you can use that to infer
things about your users as well.

2103
01:42:51,190 --> 01:42:54,968
So this type of problem comes up all
the time when we're dealing with AI

2104
01:42:54,968 --> 01:42:56,760
and trying to infer
things about the world,

2105
01:42:56,760 --> 01:43:00,750
that often AI doesn't really know
the hidden true state of the world.

2106
01:43:00,750 --> 01:43:03,930
All that AI has access
to is some observation

2107
01:43:03,930 --> 01:43:07,440
that is related to the hidden
true state, but it's not direct.

2108
01:43:07,440 --> 01:43:08,790
There might be some noise there.

2109
01:43:08,790 --> 01:43:10,985
The audio wave form might
have some additional noise

2110
01:43:10,985 --> 01:43:12,360
that might be difficult to parse.

2111
01:43:12,360 --> 01:43:14,910
The sensor data might
not be exactly correct.

2112
01:43:14,910 --> 01:43:16,860
There's some noise that
might not allow you

2113
01:43:16,860 --> 01:43:19,560
to conclude with certainty what
the hidden state is, but can

2114
01:43:19,560 --> 01:43:22,100
allow you to infer what it might be.

2115
01:43:22,100 --> 01:43:24,348
And so the simple example
we'll take a look at here

2116
01:43:24,348 --> 01:43:27,390
is imagining the hidden state as the
weather, whether it's sunny or rainy

2117
01:43:27,390 --> 01:43:31,530
or not, and imagine you are programming
an AI inside of a building that maybe

2118
01:43:31,530 --> 01:43:34,710
has access to just a camera
to inside the building,

2119
01:43:34,710 --> 01:43:37,890
and all you have access
to is an observation as to

2120
01:43:37,890 --> 01:43:41,790
whether or not employees are bringing
an umbrella into the building or not.

2121
01:43:41,790 --> 01:43:44,290
You can detect whether
it's an umbrella or not,

2122
01:43:44,290 --> 01:43:47,700
and so you might have an observation
as to whether or not an umbrella is

2123
01:43:47,700 --> 01:43:49,320
brought into the building or not.

2124
01:43:49,320 --> 01:43:51,690
And using that information,
you want to predict

2125
01:43:51,690 --> 01:43:53,790
whether it's sunny or
rainy, even if you don't

2126
01:43:53,790 --> 01:43:55,877
know what the underlying weather is.

2127
01:43:55,877 --> 01:43:57,960
So the underlying weather
might be sunny or rainy.

2128
01:43:57,960 --> 01:44:01,462
And if it's raining, obviously people
are more likely to bring an umbrella.

2129
01:44:01,462 --> 01:44:03,420
And so whether or not
people bring an umbrella,

2130
01:44:03,420 --> 01:44:06,773
your observation tells you
something about the hidden state.

2131
01:44:06,773 --> 01:44:08,940
And of course, this is a
bit of a contrived example,

2132
01:44:08,940 --> 01:44:11,370
but the idea here is to
think about this more broadly

2133
01:44:11,370 --> 01:44:14,370
in terms of more generally,
any time you observe something,

2134
01:44:14,370 --> 01:44:18,025
it having to do with some
underlying hidden state.

2135
01:44:18,025 --> 01:44:21,150
And so to try and model this type of
idea where we have these hidden states

2136
01:44:21,150 --> 01:44:24,180
and observations, rather than
just use a Markov model, which

2137
01:44:24,180 --> 01:44:26,160
has state, state, state,
state, each of which

2138
01:44:26,160 --> 01:44:29,700
is connected by that transition
matrix that we described before,

2139
01:44:29,700 --> 01:44:32,640
we're going to use what we
call a hidden Markov model--

2140
01:44:32,640 --> 01:44:34,740
very similar to a Markov
model, but this is

2141
01:44:34,740 --> 01:44:37,920
going to allow us to model a
system that has hidden states

2142
01:44:37,920 --> 01:44:41,520
that we don't directly observe
along with some observed event

2143
01:44:41,520 --> 01:44:43,740
that we do actually see.

2144
01:44:43,740 --> 01:44:45,720
And so in addition to
that transition model

2145
01:44:45,720 --> 01:44:48,780
that we still need of saying, given
the underlying state of the world,

2146
01:44:48,780 --> 01:44:52,440
if it's sunny or rainy, what's the
probability of tomorrow's weather,

2147
01:44:52,440 --> 01:44:56,310
we also need another model,
that given some state is

2148
01:44:56,310 --> 01:44:58,500
going to give us an
observation of green,

2149
01:44:58,500 --> 01:45:01,440
yes, someone brings an umbrella
into the office, or red,

2150
01:45:01,440 --> 01:45:03,930
no, nobody brings
umbrellas into the office.

2151
01:45:03,930 --> 01:45:06,772
And so the observation
might be that if it's sunny,

2152
01:45:06,772 --> 01:45:09,480
then odds are nobody is going to
bring an umbrella to the office.

2153
01:45:09,480 --> 01:45:11,760
But maybe some people
are just being cautious

2154
01:45:11,760 --> 01:45:14,490
and they do bring an umbrella
to the office anyways.

2155
01:45:14,490 --> 01:45:17,725
And if it's raining, with
much higher probability,

2156
01:45:17,725 --> 01:45:20,100
then people are going to bring
umbrellas into the office.

2157
01:45:20,100 --> 01:45:23,280
But maybe, if the rain was unexpected,
people didn't bring an umbrella,

2158
01:45:23,280 --> 01:45:25,990
and so they might have some
other probability as well.

2159
01:45:25,990 --> 01:45:28,860
So using the observations,
you can begin to predict,

2160
01:45:28,860 --> 01:45:32,070
with reasonable likelihood,
what the underlying state is

2161
01:45:32,070 --> 01:45:35,440
even if you don't actually get
to observe the underlying state,

2162
01:45:35,440 --> 01:45:39,030
if you don't get to see what the
hidden state is actually equal to.

2163
01:45:39,030 --> 01:45:41,540
This here we'll often
call the sensor model.

2164
01:45:41,540 --> 01:45:44,280
It's also often called
the emission probabilities

2165
01:45:44,280 --> 01:45:48,120
because the state, the underlying
state, emits some sort of emission

2166
01:45:48,120 --> 01:45:49,660
that you then observe.

2167
01:45:49,660 --> 01:45:53,220
And so that can be another way
of describing that same idea.

2168
01:45:53,220 --> 01:45:55,860
And the sensor Markov assumption
that we're going to use

2169
01:45:55,860 --> 01:45:59,340
is this assumption that the evidence
variable, the thing we observe,

2170
01:45:59,340 --> 01:46:03,480
the emission that gets produced,
depends only on the corresponding state,

2171
01:46:03,480 --> 01:46:06,620
meaning I can predict whether or
not people will bring umbrellas

2172
01:46:06,620 --> 01:46:11,310
or not entirely dependent just on
whether it is sunny or rainy today.

2173
01:46:11,310 --> 01:46:13,950
Of course, again, this assumption
might not hold in practice,

2174
01:46:13,950 --> 01:46:15,458
that in practice it might depend--

2175
01:46:15,458 --> 01:46:17,250
whether or not people
bring umbrellas might

2176
01:46:17,250 --> 01:46:20,042
depend not just on today's weather,
but also on yesterday's weather

2177
01:46:20,042 --> 01:46:20,910
and the day before.

2178
01:46:20,910 --> 01:46:23,100
But for simplification
purposes, it can be

2179
01:46:23,100 --> 01:46:25,920
helpful to apply the
sort of assumption just

2180
01:46:25,920 --> 01:46:29,130
to allow us to be able to reason about
these probabilities a little more

2181
01:46:29,130 --> 01:46:30,130
easily.

2182
01:46:30,130 --> 01:46:34,770
And if we're able to approximate it, we
can still often get a very good answer.

2183
01:46:34,770 --> 01:46:37,710
And so what these hidden Markov
models end up looking like is a little

2184
01:46:37,710 --> 01:46:41,730
something like this, where now, rather
than just have one chain of states--

2185
01:46:41,730 --> 01:46:43,860
like, sun, sun, rain, rain, rain--

2186
01:46:43,860 --> 01:46:49,650
we instead have this upper level, which
is the underlying state of the world,

2187
01:46:49,650 --> 01:46:53,070
is it sunny or is it rainy, and those
are connected by that transition

2188
01:46:53,070 --> 01:46:54,690
matrix we described before.

2189
01:46:54,690 --> 01:46:57,510
But each of these states
produces an emission,

2190
01:46:57,510 --> 01:47:01,590
produces an observation that I
see, that on this day it was sunny,

2191
01:47:01,590 --> 01:47:04,917
and people didn't bring umbrellas,
and on this day it was sunny,

2192
01:47:04,917 --> 01:47:07,500
but people did bring umbrellas,
and on this day it was raining

2193
01:47:07,500 --> 01:47:09,960
and people did bring umbrellas,
and so on and so forth.

2194
01:47:09,960 --> 01:47:12,930
And so each of these
underlying states, represented

2195
01:47:12,930 --> 01:47:16,740
by x sub t for x sub 1, 0,
1, 2, so on and so forth,

2196
01:47:16,740 --> 01:47:19,450
produces some sort of
observation or emission,

2197
01:47:19,450 --> 01:47:20,950
which is what the E stands for--

2198
01:47:20,950 --> 01:47:25,700
E sub 0, E sub 1, E sub
2, so on and so forth.

2199
01:47:25,700 --> 01:47:28,893
And so this, too, is a way of
trying to represent this idea.

2200
01:47:28,893 --> 01:47:31,560
And what you want to think about
is that these underlying states

2201
01:47:31,560 --> 01:47:35,790
are the true nature of the world, the
robot's position as it moves over time,

2202
01:47:35,790 --> 01:47:39,030
and that produces some sort of
sensor data that might be observed,

2203
01:47:39,030 --> 01:47:41,490
or what people are
actually saying and using

2204
01:47:41,490 --> 01:47:45,390
the emission data of what audio wave
forms do you detect in order to process

2205
01:47:45,390 --> 01:47:47,330
that data and try and figure it out.

2206
01:47:47,330 --> 01:47:49,830
And there are a number of
possible tasks that you might want

2207
01:47:49,830 --> 01:47:52,150
to do given this kind of information.

2208
01:47:52,150 --> 01:47:54,750
And one of the simplest is
trying to infer something

2209
01:47:54,750 --> 01:47:58,560
about the future or the past or
about these sort of hidden states

2210
01:47:58,560 --> 01:47:59,580
that might exist.

2211
01:47:59,580 --> 01:48:01,310
And so the tasks that you'll often see--

2212
01:48:01,310 --> 01:48:03,893
and we're not going to go into
the mathematics of these tasks,

2213
01:48:03,893 --> 01:48:07,020
but they're all based on this same
idea of conditional probabilities

2214
01:48:07,020 --> 01:48:09,990
and using the probability
distributions we have

2215
01:48:09,990 --> 01:48:12,180
to draw these sorts of conclusions.

2216
01:48:12,180 --> 01:48:16,410
One task is called filtering, which
is, given observations from the start

2217
01:48:16,410 --> 01:48:20,310
until now, calculate the
distribution for the current state,

2218
01:48:20,310 --> 01:48:23,520
meaning given information about
from the beginning of time

2219
01:48:23,520 --> 01:48:26,610
until now, on which days
did people bring an umbrella

2220
01:48:26,610 --> 01:48:28,770
or not bring an
umbrella, can I calculate

2221
01:48:28,770 --> 01:48:32,280
the probability of the current
state, that today is it sunny

2222
01:48:32,280 --> 01:48:33,570
or is it raining?

2223
01:48:33,570 --> 01:48:35,670
Another task that might
be possible is prediction,

2224
01:48:35,670 --> 01:48:37,320
which is looking towards the future.

2225
01:48:37,320 --> 01:48:39,690
Given observations about
people bringing umbrellas

2226
01:48:39,690 --> 01:48:43,350
from the beginning of when we
started counting time until now,

2227
01:48:43,350 --> 01:48:47,710
can I figure out the distribution that
tomorrow is it sunny or is it raining?

2228
01:48:47,710 --> 01:48:51,240
And you can also go backwards, as
well, by a smoothing where I can say,

2229
01:48:51,240 --> 01:48:54,810
given observations from start until
now, calculate the distributions

2230
01:48:54,810 --> 01:48:56,460
for some past state.

2231
01:48:56,460 --> 01:49:00,090
I know that today people brought
umbrellas and tomorrow people brought

2232
01:49:00,090 --> 01:49:03,780
umbrellas, and so given two days' worth
of data of people bringing umbrellas,

2233
01:49:03,780 --> 01:49:06,713
what's the probability that
yesterday it was raining?

2234
01:49:06,713 --> 01:49:08,880
And that I know that people
brought umbrellas today,

2235
01:49:08,880 --> 01:49:11,160
that might inform that
decision, as well.

2236
01:49:11,160 --> 01:49:13,740
It might influence those probabilities.

2237
01:49:13,740 --> 01:49:17,340
And there's also a most
likely explanation task,

2238
01:49:17,340 --> 01:49:19,510
in addition to other tasks
that might exist as well,

2239
01:49:19,510 --> 01:49:21,750
which is combining some of
these given observations

2240
01:49:21,750 --> 01:49:25,920
from the start up until now, figuring
out the most likely sequence of states,

2241
01:49:25,920 --> 01:49:28,528
and this is what we're going to
take a look at now, this idea

2242
01:49:28,528 --> 01:49:30,570
that if I have all these
observations-- umbrella,

2243
01:49:30,570 --> 01:49:32,790
no umbrella, umbrella,
no umbrella-- can I

2244
01:49:32,790 --> 01:49:36,990
calculate the most likely states of
sun, rain, sun, rain, and whatnot that

2245
01:49:36,990 --> 01:49:41,610
actually represented the true weather
that would produce these observations?

2246
01:49:41,610 --> 01:49:43,590
And this is quite common
when you're trying

2247
01:49:43,590 --> 01:49:46,530
to do something like voice
recognition, for example, that you have

2248
01:49:46,530 --> 01:49:49,830
these emissions of audio wave forms
and you would like to calculate,

2249
01:49:49,830 --> 01:49:52,260
based on all of the
observations that you have,

2250
01:49:52,260 --> 01:49:54,750
what is the most likely
sequence of actual words

2251
01:49:54,750 --> 01:49:59,100
or syllables or sounds that the user
actually made when they were speaking

2252
01:49:59,100 --> 01:50:01,230
to this particular device,
or other tasks that

2253
01:50:01,230 --> 01:50:03,740
might come up in that context as well.

2254
01:50:03,740 --> 01:50:07,800
And so we can try this out by
going ahead and going into the HMM

2255
01:50:07,800 --> 01:50:11,790
directory, HMM for Hidden Markov Model.

2256
01:50:11,790 --> 01:50:17,350
And here what I've done is I've
defined a model where this model first

2257
01:50:17,350 --> 01:50:22,410
defines my possible state, sun and
rain, along with their emission

2258
01:50:22,410 --> 01:50:25,690
probabilities, the observation
model or the emission model,

2259
01:50:25,690 --> 01:50:30,310
where here, given that I know that
it's sunny, the probability that I

2260
01:50:30,310 --> 01:50:32,590
see people bring an umbrella is 0.2.

2261
01:50:32,590 --> 01:50:35,470
The probability of no umbrella is 0.8.

2262
01:50:35,470 --> 01:50:37,288
And likewise, if it's
raining, then people

2263
01:50:37,288 --> 01:50:38,830
are more likely to bring an umbrella.

2264
01:50:38,830 --> 01:50:40,630
Umbrella has a probability of 0.9.

2265
01:50:40,630 --> 01:50:42,580
No umbrella has probably of 0.1.

2266
01:50:42,580 --> 01:50:47,350
So the actual underlying hidden
states, those states are sun and rain.

2267
01:50:47,350 --> 01:50:50,500
But the things that I observe,
the observations that I can see,

2268
01:50:50,500 --> 01:50:56,270
are either umbrella or no umbrella as
the things that I observe as a result.

2269
01:50:56,270 --> 01:51:00,730
So this, then, I also need to add to
it a transition matrix, same as before,

2270
01:51:00,730 --> 01:51:04,540
saying that if today is sunny, then
tomorrow is more likely to be sunny,

2271
01:51:04,540 --> 01:51:07,770
and if today is rainy, then tomorrow
is more likely to be raining.

2272
01:51:07,770 --> 01:51:10,130
As with before, I give it
some starting probabilities,

2273
01:51:10,130 --> 01:51:14,050
saying, at first, 50/50 chance
for whether it's sunny or rainy,

2274
01:51:14,050 --> 01:51:17,570
and then I can create the model
based on that information.

2275
01:51:17,570 --> 01:51:19,990
Again, the exact syntax of
this is not so important

2276
01:51:19,990 --> 01:51:23,770
so much as it is the data that I am
now encoding into a program, such

2277
01:51:23,770 --> 01:51:27,350
that now I can begin
to do some inference.

2278
01:51:27,350 --> 01:51:31,270
So I can give my program, for
example, a list of observations--

2279
01:51:31,270 --> 01:51:34,420
umbrella, umbrella, no umbrella,
umbrella, umbrella, so on and so forth,

2280
01:51:34,420 --> 01:51:35,478
no umbrella, no umbrella.

2281
01:51:35,478 --> 01:51:37,270
And I would like to
calculate, I would like

2282
01:51:37,270 --> 01:51:41,110
to figure out, the most likely
explanation for these observations.

2283
01:51:41,110 --> 01:51:42,640
What is likely?

2284
01:51:42,640 --> 01:51:43,660
Was it rain, rain?

2285
01:51:43,660 --> 01:51:46,720
Is this rain or is it more likely
that this was actually sunny

2286
01:51:46,720 --> 01:51:48,742
and then it switched
back to it being rainy?

2287
01:51:48,742 --> 01:51:50,200
And that's an interesting question.

2288
01:51:50,200 --> 01:51:52,360
We might not be sure
because it might just

2289
01:51:52,360 --> 01:51:56,410
be that it just so happened on this
rainy day people decided not to bring

2290
01:51:56,410 --> 01:52:00,580
an umbrella or it could be that it
switched from rainy to sunny back

2291
01:52:00,580 --> 01:52:04,450
to rainy, which doesn't seem too
likely, but it certainly could happen.

2292
01:52:04,450 --> 01:52:07,060
And using the data we give
to the Hidden Markov Model,

2293
01:52:07,060 --> 01:52:10,620
our model can begin to predict these
answers, can begin to figure it out.

2294
01:52:10,620 --> 01:52:13,750
So we're going to go ahead and
just predict these observations.

2295
01:52:13,750 --> 01:52:15,750
And then for each of those
predictions, go ahead

2296
01:52:15,750 --> 01:52:17,292
and print out what the prediction is.

2297
01:52:17,292 --> 01:52:19,780
And this library just so happens
to have a function called

2298
01:52:19,780 --> 01:52:23,142
predict that does this
prediction process for me.

2299
01:52:23,142 --> 01:52:28,270
So I run Python sequence.py,
and the result I get is this.

2300
01:52:28,270 --> 01:52:31,450
This is the prediction based
on the observations of what

2301
01:52:31,450 --> 01:52:34,750
all of those states are likely to be,
and it's likely to be rain, then rain.

2302
01:52:34,750 --> 01:52:36,625
In this case, it thinks
that what most likely

2303
01:52:36,625 --> 01:52:39,940
happened is that it was sunny for a
day and then went back to being rainy.

2304
01:52:39,940 --> 01:52:42,700
But in different situations, if
it was rainy for longer, maybe,

2305
01:52:42,700 --> 01:52:44,750
or if the probabilities
were slightly different,

2306
01:52:44,750 --> 01:52:48,190
you might imagine that it's more likely
that it was rainy all the way through,

2307
01:52:48,190 --> 01:52:53,250
and it just so happened on one rainy day
people decided not to bring umbrellas.

2308
01:52:53,250 --> 01:52:55,750
And so here, too, Python
libraries can begin

2309
01:52:55,750 --> 01:52:58,730
to allow for the sort
of inference procedure.

2310
01:52:58,730 --> 01:53:02,410
And by taking what we know and by
putting it in terms of these tasks

2311
01:53:02,410 --> 01:53:06,310
that already exist, these general tasks
that work with Hidden Markov Models,

2312
01:53:06,310 --> 01:53:10,540
then any time we can take an idea and
formulate it as a Hidden Markov Model,

2313
01:53:10,540 --> 01:53:12,550
formulate it as
something that has hidden

2314
01:53:12,550 --> 01:53:15,700
states and observed emissions
that result from the states.

2315
01:53:15,700 --> 01:53:17,830
Then we can take advantage
of these algorithms that

2316
01:53:17,830 --> 01:53:21,740
are known to exist for trying
to do this sort of inference.

2317
01:53:21,740 --> 01:53:25,720
So now we've seen a couple of ways that
AI can begin to deal with uncertainty.

2318
01:53:25,720 --> 01:53:28,840
We've taken a look at probability
and how we can use probability

2319
01:53:28,840 --> 01:53:32,200
to describe numerically things that
are likely or more likely or less

2320
01:53:32,200 --> 01:53:34,990
likely to happen than other
events or other variables.

2321
01:53:34,990 --> 01:53:37,750
And using that information,
we can begin to construct

2322
01:53:37,750 --> 01:53:40,810
these standard types of models,
things like Bayesian networks

2323
01:53:40,810 --> 01:53:43,180
and Markov chains and
Hidden Markov Models,

2324
01:53:43,180 --> 01:53:47,110
that all allow us to be able to
describe how particular events relate

2325
01:53:47,110 --> 01:53:49,900
to other events or how the
values of particular variables

2326
01:53:49,900 --> 01:53:53,050
relate to other variables, not
for certain, but with some sort

2327
01:53:53,050 --> 01:53:54,550
of probability distribution.

2328
01:53:54,550 --> 01:53:57,970
And by formulating things in terms
of these models that already exist,

2329
01:53:57,970 --> 01:54:00,160
we can take advantage
of Python libraries

2330
01:54:00,160 --> 01:54:02,950
that implement these sort of
models already and allow us just

2331
01:54:02,950 --> 01:54:06,880
to be able to use them to produce
some sort of resulting effect.

2332
01:54:06,880 --> 01:54:08,890
So all of this then
allows our AI to begin

2333
01:54:08,890 --> 01:54:11,290
to deal with these sort
of uncertain problems

2334
01:54:11,290 --> 01:54:13,720
so that our AI doesn't need
to know things for certain

2335
01:54:13,720 --> 01:54:17,080
but can infer based on
information it doesn't know.

2336
01:54:17,080 --> 01:54:19,930
Next time, we'll take a look
at additional types of problems

2337
01:54:19,930 --> 01:54:22,870
that we can solve by taking
advantage of AI-related algorithms

2338
01:54:22,870 --> 01:54:26,140
even beyond the world of the types
of problems we've already explored.

2339
01:54:26,140 --> 01:54:28,230
We'll see you next time.

2340
01:54:28,230 --> 01:54:29,000