WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000

00:00:00.000 --> 00:00:02.940
[MUSIC PLAYING]

00:00:17.625 --> 00:00:20.250
BRIAN YU: All right, welcome
back, everyone, to an Introduction

00:00:20.250 --> 00:00:22.410
to Artificial Intelligence with Python.

00:00:22.410 --> 00:00:26.100
And last time we took a look at how
it is that AI inside of our computers

00:00:26.100 --> 00:00:27.420
can represent knowledge.

00:00:27.420 --> 00:00:30.510
We represented that knowledge
in the form of logical sentences

00:00:30.510 --> 00:00:32.460
in a variety of different
logical languages,

00:00:32.460 --> 00:00:36.000
and the idea was we wanted our AI
to be able to represent knowledge

00:00:36.000 --> 00:00:39.420
or information and somehow use
those pieces of information

00:00:39.420 --> 00:00:42.570
to be able to derive new pieces
of information via inference,

00:00:42.570 --> 00:00:45.060
to be able to take some
information and deduce

00:00:45.060 --> 00:00:47.580
some additional conclusions
based on the information

00:00:47.580 --> 00:00:49.650
that it already knew for sure.

00:00:49.650 --> 00:00:52.680
But in reality, when we think about
computers and we think about AI,

00:00:52.680 --> 00:00:56.280
very rarely are our machines going
to be able to know things for sure.

00:00:56.280 --> 00:00:58.800
Oftentimes there's going to
be some amount of uncertainty

00:00:58.800 --> 00:01:01.140
in the information that
our AIs or our computers

00:01:01.140 --> 00:01:04.503
are dealing with where it might believe
something with some probability,

00:01:04.503 --> 00:01:07.420
as we'll soon discuss what probability
is all about and what it means,

00:01:07.420 --> 00:01:09.210
but not entirely for certain.

00:01:09.210 --> 00:01:12.660
And we want to use the information that
it has some knowledge about, even if it

00:01:12.660 --> 00:01:15.960
doesn't have perfect knowledge, to
still be able to make inferences, still

00:01:15.960 --> 00:01:17.650
be able to draw conclusions.

00:01:17.650 --> 00:01:20.820
So you might imagine, for example,
in the context of a robot that

00:01:20.820 --> 00:01:23.280
has some sensors and is
exploring some environment,

00:01:23.280 --> 00:01:26.580
it might not know exactly where
it is or exactly what's around it,

00:01:26.580 --> 00:01:30.150
but it does have access to some data
that can allow it to draw inferences

00:01:30.150 --> 00:01:31.200
with some probability.

00:01:31.200 --> 00:01:33.747
There's some likelihood that
one thing is true or another,

00:01:33.747 --> 00:01:36.330
or you can imagine in context
where there is a little bit more

00:01:36.330 --> 00:01:39.390
randomness and uncertainty, something
like predicting the weather, where

00:01:39.390 --> 00:01:42.098
you might not be able to know for
sure what tomorrow's weather is

00:01:42.098 --> 00:01:44.610
with 100% certainty,
but you can probably

00:01:44.610 --> 00:01:47.550
infer with some probability
what tomorrow's weather is

00:01:47.550 --> 00:01:50.940
going to be based on maybe today's
webinar and yesterday's weather

00:01:50.940 --> 00:01:54.070
and other data that you
might have access to as well.

00:01:54.070 --> 00:01:57.270
And so oftentimes we can distill
this in terms of just possible events

00:01:57.270 --> 00:02:00.120
that might happen and what the
likelihood of those events are.

00:02:00.120 --> 00:02:02.190
This comes a lot in
games, for example, where

00:02:02.190 --> 00:02:04.620
there's an element of chance
inside of those games.

00:02:04.620 --> 00:02:06.120
So you imagine rolling the dice.

00:02:06.120 --> 00:02:08.580
You're not sure exactly what
the die roll is going to be,

00:02:08.580 --> 00:02:12.510
but you know it's going to be one of
these possibilities from one to six,

00:02:12.510 --> 00:02:13.970
for example.

00:02:13.970 --> 00:02:17.050
And so here, now, we introduce
the idea of probability theory.

00:02:17.050 --> 00:02:19.050
And what we'll take a
look at today is beginning

00:02:19.050 --> 00:02:22.170
by looking at the mathematical
foundations of probability theory,

00:02:22.170 --> 00:02:25.740
getting an understanding for some of
the key concepts within probability,

00:02:25.740 --> 00:02:29.040
and then diving into how we can
use probability and the ideas

00:02:29.040 --> 00:02:32.730
that we look at mathematically to
represent some ideas in terms of models

00:02:32.730 --> 00:02:36.300
that we can put into our computers
in order to program an AI that

00:02:36.300 --> 00:02:39.600
is able to use information about
probability to draw inferences,

00:02:39.600 --> 00:02:42.630
to make some judgments about
the world with some probability

00:02:42.630 --> 00:02:45.380
or likelihood of being true.

00:02:45.380 --> 00:02:48.270
So probability ultimately
boils down to this idea

00:02:48.270 --> 00:02:50.340
that there are possible
worlds that we're here

00:02:50.340 --> 00:02:53.250
representing using this
little Greek letter omega,

00:02:53.250 --> 00:02:56.760
and the idea of a possible world
is that, when I roll a die,

00:02:56.760 --> 00:02:59.380
there are six possible worlds
that could result from it.

00:02:59.380 --> 00:03:03.180
I can roll a 1 or 2 or
3 or a 4 or a 5 or a 6,

00:03:03.180 --> 00:03:06.840
and each of those or a possible world,
and each of those possible worlds

00:03:06.840 --> 00:03:11.760
has some probability of being true, the
probability that I do roll a 1 or a 2

00:03:11.760 --> 00:03:13.840
or a 3 or something else.

00:03:13.840 --> 00:03:15.690
And we represent that
probability like this,

00:03:15.690 --> 00:03:18.870
using the capital letter P
and then, in parentheses, what

00:03:18.870 --> 00:03:20.890
it is that we want the probability of.

00:03:20.890 --> 00:03:24.600
So this right here would be the
probability of some possible world

00:03:24.600 --> 00:03:27.390
as represented by the
little letter omega.

00:03:27.390 --> 00:03:30.120
Now, there are a couple of
basic axioms of probability

00:03:30.120 --> 00:03:33.360
that become relevant as we consider
how we deal with probability

00:03:33.360 --> 00:03:34.590
and how we think about it.

00:03:34.590 --> 00:03:37.320
First and foremost,
every probability value

00:03:37.320 --> 00:03:40.500
must range between
zero and one inclusive.

00:03:40.500 --> 00:03:42.420
So the smallest value
any probability can

00:03:42.420 --> 00:03:46.500
have is the number zero, which
is an impossible event, something

00:03:46.500 --> 00:03:49.350
like I roll a die and the die is
a seven is the roll that I get.

00:03:49.350 --> 00:03:51.570
If the die only has
numbers one through six,

00:03:51.570 --> 00:03:54.420
the event that I roll
a seven is impossible,

00:03:54.420 --> 00:03:56.610
so it would have probability zero.

00:03:56.610 --> 00:03:58.710
And on the other end of
the spectrum, probability

00:03:58.710 --> 00:04:01.260
can range all the way up
to the positive number one,

00:04:01.260 --> 00:04:04.650
meaning an event is certain to happen,
that I roll a die and the number

00:04:04.650 --> 00:04:06.540
is less than 10, for example.

00:04:06.540 --> 00:04:09.930
That is an event that is guaranteed
to happen if the only sides on my die

00:04:09.930 --> 00:04:12.150
are one through six, for instance.

00:04:12.150 --> 00:04:15.600
And then there can range through any
real number in between these two values

00:04:15.600 --> 00:04:18.600
where, generally speaking, a
higher value for the probability

00:04:18.600 --> 00:04:20.910
means an event is more
likely to take place

00:04:20.910 --> 00:04:22.980
and a lower value for
the probability means

00:04:22.980 --> 00:04:26.040
the event is less likely to take place.

00:04:26.040 --> 00:04:29.280
And the other key rule for probability
looks a little bit like this.

00:04:29.280 --> 00:04:32.190
This sigma notation, if
you haven't seen it before,

00:04:32.190 --> 00:04:35.100
refers to summation, the idea
that we're going to be adding up

00:04:35.100 --> 00:04:36.500
a whole sequence of values.

00:04:36.500 --> 00:04:39.000
And this sigma notation's going
to come up a couple of times

00:04:39.000 --> 00:04:41.220
today, because as we
deal with probability,

00:04:41.220 --> 00:04:43.950
oftentimes we're adding up a
whole bunch of individual values

00:04:43.950 --> 00:04:46.660
or individual probabilities
to get some other value.

00:04:46.660 --> 00:04:48.570
So we'll see this come
up a couple of times.

00:04:48.570 --> 00:04:52.710
But what this notation means is that
if I sum up all of the possible world's

00:04:52.710 --> 00:04:57.150
omega that are in big Omega,
which represents the set of all

00:04:57.150 --> 00:05:00.780
the possible worlds, meaning
I take for all of the worlds

00:05:00.780 --> 00:05:05.250
in the set of possible worlds and add
up all of their probabilities, what

00:05:05.250 --> 00:05:07.312
I ultimately get is the number one.

00:05:07.312 --> 00:05:10.520
So if I take all the possible worlds,
add up what each of their probabilities

00:05:10.520 --> 00:05:12.630
is, I should get the
number one at the end,

00:05:12.630 --> 00:05:15.600
meaning all probabilities
just need to sum to one.

00:05:15.600 --> 00:05:18.120
So for example, if I
take dice, for example,

00:05:18.120 --> 00:05:20.750
if you imagine I have a fair
die with numbers one through six

00:05:20.750 --> 00:05:22.820
and I roll the die,
each one of these rolls

00:05:22.820 --> 00:05:25.160
has an equal probability
of taking place,

00:05:25.160 --> 00:05:28.290
and the probability is
one over six, for example.

00:05:28.290 --> 00:05:31.890
So each of these probabilities is
between zero and one, zero meaning

00:05:31.890 --> 00:05:33.950
and possible and one
meaning for certain.

00:05:33.950 --> 00:05:35.990
And if you add up all
of these probabilities

00:05:35.990 --> 00:05:39.230
for all of the possible
worlds, you get the number one.

00:05:39.230 --> 00:05:42.560
And we can represent any one of
those probabilities like this.

00:05:42.560 --> 00:05:47.750
The probability that we roll the number
two, for example, is just one over six.

00:05:47.750 --> 00:05:52.040
Every six times we roll the die, we'd
expect that one time, for instance,

00:05:52.040 --> 00:05:53.480
the die might come up as a two.

00:05:53.480 --> 00:05:56.870
Its probability is not certain, but
it's a little more than nothing,

00:05:56.870 --> 00:05:58.430
for instance.

00:05:58.430 --> 00:06:01.173
And so this is all fairly
straightforward for just a single die.

00:06:01.173 --> 00:06:03.590
But things get more interesting
as our models of the world

00:06:03.590 --> 00:06:05.183
get a little bit more complex.

00:06:05.183 --> 00:06:07.850
Let's imagine now that we're not
just dealing with a single die,

00:06:07.850 --> 00:06:10.040
but we have two dice, for example.

00:06:10.040 --> 00:06:12.230
I have a red die here
and a blue die there,

00:06:12.230 --> 00:06:15.230
and I care not just about
what the individual roll is,

00:06:15.230 --> 00:06:17.270
but I care about the
sum of the two rolls.

00:06:17.270 --> 00:06:20.660
In this case, the sum of the
two rolls is the number three.

00:06:20.660 --> 00:06:23.000
How do I begin to now
reason about, what does

00:06:23.000 --> 00:06:27.760
the probability look like if, instead
of having one die, I now have two dice?

00:06:27.760 --> 00:06:30.260
Well, what we might imagine is
that we could first consider,

00:06:30.260 --> 00:06:32.860
what are all of the possible worlds?

00:06:32.860 --> 00:06:34.840
And in this case, all
of the possible worlds

00:06:34.840 --> 00:06:38.480
are just every combination of the red
and blue die that I could come up with.

00:06:38.480 --> 00:06:43.000
For the red die, it could be a 1
or a 2 or a 3 or a 4 or a 5 or a 6,

00:06:43.000 --> 00:06:45.260
and for each of those
possibilities, the blue die,

00:06:45.260 --> 00:06:50.700
likewise, could also be either
1 or 2 or 3 or 4 or 5 or 6.

00:06:50.700 --> 00:06:53.490
And it just so happens that,
in this particular case,

00:06:53.490 --> 00:06:56.570
each of these possible
combinations is equally likely.

00:06:56.570 --> 00:06:59.715
Equally likely are all of these
various different possible worlds.

00:06:59.715 --> 00:07:01.340
That's not always going to be the case.

00:07:01.340 --> 00:07:04.160
As you imagine more complex
models that we could try to build

00:07:04.160 --> 00:07:06.770
and things that we could try
to represent in the real world,

00:07:06.770 --> 00:07:09.950
it's probably not going to be the case
that every single possible world is

00:07:09.950 --> 00:07:11.270
always equally likely.

00:07:11.270 --> 00:07:14.030
But in the case of fair dice
where, in any given die roll,

00:07:14.030 --> 00:07:17.450
any one number has just as good a
chance of coming up as any other number,

00:07:17.450 --> 00:07:21.740
we can consider all of these
possible worlds to be equally likely.

00:07:21.740 --> 00:07:24.500
But even though all of the
possible worlds are equally likely,

00:07:24.500 --> 00:07:27.690
that doesn't necessarily mean that
their sums are equally likely.

00:07:27.690 --> 00:07:31.530
So if we consider what the sum is of all
of these two-- so 1 plus 1, that's a 2.

00:07:31.530 --> 00:07:32.990
2 plus 1 is a 3--

00:07:32.990 --> 00:07:35.790
and consider for each of these
possible pairs of numbers

00:07:35.790 --> 00:07:37.970
what their sum ultimately
is, we can notice

00:07:37.970 --> 00:07:41.030
that there are some patterns here
where it's not entirely the case

00:07:41.030 --> 00:07:43.710
that every number comes
up equally likely.

00:07:43.710 --> 00:07:45.800
If you consider seven,
for example, what's

00:07:45.800 --> 00:07:49.070
the probability that when I roll
two dice their sum is seven,

00:07:49.070 --> 00:07:50.770
there are several ways this can happen.

00:07:50.770 --> 00:07:53.450
There are six possible worlds
where the sum is seven.

00:07:53.450 --> 00:07:56.270
It could be a one and a six
or a two and a five or a three

00:07:56.270 --> 00:07:59.400
and a four, a four and
a three, and so forth.

00:07:59.400 --> 00:08:02.030
But if you instead consider,
what's the probability that I

00:08:02.030 --> 00:08:06.380
roll two dice and the sum of those two
die rolls is 12, for example, well,

00:08:06.380 --> 00:08:09.770
looking at this diagram,
there's only one possible world

00:08:09.770 --> 00:08:12.080
in which that can happen,
and that's the possible world

00:08:12.080 --> 00:08:16.720
where both the red die and the
blue die both come up at sixes

00:08:16.720 --> 00:08:18.870
to give us the sum total of 12.

00:08:18.870 --> 00:08:21.000
So based on just taking
a look at this diagram,

00:08:21.000 --> 00:08:23.542
we see that some of these
probabilities are likely different.

00:08:23.542 --> 00:08:27.530
The probability that the sum is a seven
must be greater than the probability

00:08:27.530 --> 00:08:28.732
that the sum is a 12.

00:08:28.732 --> 00:08:31.190
And we can represent that even
more formally by saying, OK,

00:08:31.190 --> 00:08:35.690
the probability that we
sum to 12 is one out of 36.

00:08:35.690 --> 00:08:39.010
Out of the 36 equally
likely possible worlds,

00:08:39.010 --> 00:08:42.049
six squared because we have
six options for the red die

00:08:42.049 --> 00:08:46.730
and six options for the blue die, out
of those 36 options, only one of them

00:08:46.730 --> 00:08:49.970
sums to 12, whereas, on the
other hand, the probability

00:08:49.970 --> 00:08:53.580
that if we take two dice rolls and
they sum up to the number seven,

00:08:53.580 --> 00:08:55.910
well, out of those 36
possible worlds, there

00:08:55.910 --> 00:08:59.900
were six worlds where the sum was
seven, and so we get six over 36,

00:08:59.900 --> 00:09:04.070
which we can simplify as a
fraction to just one over six.

00:09:04.070 --> 00:09:07.400
So here, now, we're able to represent
these different ideas of probability,

00:09:07.400 --> 00:09:09.690
representing some events
that might be more likely

00:09:09.690 --> 00:09:12.980
and then other events that
are less likely, as well.

00:09:12.980 --> 00:09:15.230
And these sorts of judgments
where we're figuring out,

00:09:15.230 --> 00:09:18.410
just in the abstract, what is the
probability that this thing takes

00:09:18.410 --> 00:09:22.040
place, are generally known as
unconditional probabilities,

00:09:22.040 --> 00:09:25.970
some degree of belief we have in some
proposition, some fact about the world

00:09:25.970 --> 00:09:28.760
in the absence of any other
evidence without knowing

00:09:28.760 --> 00:09:29.900
any additional information.

00:09:29.900 --> 00:09:32.570
If I roll a die, what's the
chance it comes up as a two,

00:09:32.570 --> 00:09:35.420
or if I roll two dice, what's the
chance that the sum of those two

00:09:35.420 --> 00:09:37.430
die rolls is a seven?

00:09:37.430 --> 00:09:40.550
But usually when we're thinking about
probability, especially when we're

00:09:40.550 --> 00:09:43.790
thinking about training in AI to
intelligently be able to know something

00:09:43.790 --> 00:09:46.970
about the world and make predictions
based on that information,

00:09:46.970 --> 00:09:50.610
it's not unconditional probability
that our AI is dealing with,

00:09:50.610 --> 00:09:53.060
but, rather, conditional
probability, probability

00:09:53.060 --> 00:09:55.730
where rather than having
no original knowledge,

00:09:55.730 --> 00:09:59.790
we have some initial knowledge about the
world and how the world actually works.

00:09:59.790 --> 00:10:02.420
So conditional probability
is the degree of belief

00:10:02.420 --> 00:10:08.210
in a proposition given some evidence
that has already been revealed to us.

00:10:08.210 --> 00:10:09.480
So what does this look like?

00:10:09.480 --> 00:10:12.080
Well, it looks like this
in terms of notation.

00:10:12.080 --> 00:10:16.595
We're going to represent conditional
probability as probability of a

00:10:16.595 --> 00:10:19.260
and then this vertical bar and then b.

00:10:19.260 --> 00:10:23.090
And the way to read this is the thing on
the left-hand side of the vertical bar

00:10:23.090 --> 00:10:25.340
is what we want the probability of.

00:10:25.340 --> 00:10:29.690
Here, now, I want the probability that
a is true, that it is the real world,

00:10:29.690 --> 00:10:32.460
that it is the event that
actually does take place.

00:10:32.460 --> 00:10:34.520
And then on the right
side of the vertical bar

00:10:34.520 --> 00:10:36.620
is our evidence, the
information that we already

00:10:36.620 --> 00:10:38.780
know for certain about the world--

00:10:38.780 --> 00:10:41.570
for example, that b is true.

00:10:41.570 --> 00:10:43.430
So the way to read
this entire expression

00:10:43.430 --> 00:10:46.820
is, what is the
probability of a given b,

00:10:46.820 --> 00:10:51.860
the probability that a is true given
that we already know that b is true?

00:10:51.860 --> 00:10:54.500
And this type of judgment,
conditional probability,

00:10:54.500 --> 00:10:58.430
the probability of one thing given
some other fact, comes up quite a lot

00:10:58.430 --> 00:11:00.590
when we think about the
types of calculations

00:11:00.590 --> 00:11:02.630
we might want our AI to be able to do.

00:11:02.630 --> 00:11:05.120
For example, we might care
about the probability of rain

00:11:05.120 --> 00:11:08.090
today given that we know
that it rained yesterday.

00:11:08.090 --> 00:11:11.360
We could think about the probability
of rain today just in the abstract.

00:11:11.360 --> 00:11:13.440
What is the chance that today it rains?

00:11:13.440 --> 00:11:15.350
But usually we have some
additional evidence.

00:11:15.350 --> 00:11:17.900
I know for certain that
it rained yesterday,

00:11:17.900 --> 00:11:21.290
and so I would like to calculate
the probability that it rains today

00:11:21.290 --> 00:11:23.457
given that I know that
it rained yesterday,

00:11:23.457 --> 00:11:25.790
or you might imagine that I
want to know the probability

00:11:25.790 --> 00:11:28.550
that my optimal route to
my destination changes

00:11:28.550 --> 00:11:30.290
given the current traffic conditions.

00:11:30.290 --> 00:11:32.480
So whether or not traffic
conditions change,

00:11:32.480 --> 00:11:35.690
that might change the probability
that this route is actually

00:11:35.690 --> 00:11:38.510
the optimal route, or you might
imagine in a medical context

00:11:38.510 --> 00:11:42.830
I want to know the probability that a
patient has a particular disease given

00:11:42.830 --> 00:11:45.950
some results of some tests that
have been performed on that patient,

00:11:45.950 --> 00:11:48.770
and I have some evidence,
the results of that test,

00:11:48.770 --> 00:11:52.100
and I would like to know the
probability that a patient has

00:11:52.100 --> 00:11:53.520
a particular disease.

00:11:53.520 --> 00:11:55.228
So this notion of
conditional probability

00:11:55.228 --> 00:11:57.353
comes up everywhere as we
begin to think about what

00:11:57.353 --> 00:11:59.660
we would like to reason about,
but being able to reason

00:11:59.660 --> 00:12:03.225
a little more intelligently by
taking into account evidence

00:12:03.225 --> 00:12:04.100
that we already have.

00:12:04.100 --> 00:12:06.650
We're more able to get an
accurate result for what

00:12:06.650 --> 00:12:08.720
is the likelihood that
someone has this disease

00:12:08.720 --> 00:12:11.330
if we know this evidence,
the results of the test,

00:12:11.330 --> 00:12:13.250
as opposed to if we
were just calculating

00:12:13.250 --> 00:12:16.910
the unconditional probability of saying,
what is the probability they have

00:12:16.910 --> 00:12:21.290
the disease without any evidence
to try and back up our result one

00:12:21.290 --> 00:12:23.790
way or the other?

00:12:23.790 --> 00:12:26.652
So now that we've got this idea of
what conditional probability is,

00:12:26.652 --> 00:12:28.610
the next question we have
to ask is, all right,

00:12:28.610 --> 00:12:30.690
how do we calculate
conditional probability?

00:12:30.690 --> 00:12:34.297
How do we figure out, mathematically,
if I have an expression like this,

00:12:34.297 --> 00:12:35.630
how do I get a number from that?

00:12:35.630 --> 00:12:38.070
What does conditional
probability actually mean?

00:12:38.070 --> 00:12:39.950
Well, the formula for
conditional probability

00:12:39.950 --> 00:12:41.540
looks a little something like this--

00:12:41.540 --> 00:12:45.020
the probability of a
given b, the probability

00:12:45.020 --> 00:12:47.960
that a is true given that
we know that b is true,

00:12:47.960 --> 00:12:50.510
is equal to this
fraction-- the probability

00:12:50.510 --> 00:12:56.070
that a and b are true divided by
just the probability that b is true.

00:12:56.070 --> 00:12:58.250
And the way to intuitively
try to think about this

00:12:58.250 --> 00:13:01.610
is that if I want to know the
probability that a is true given that b

00:13:01.610 --> 00:13:05.120
is true, well, I want to consider
all the ways they could both be

00:13:05.120 --> 00:13:08.330
true out of the only
worlds that I care about

00:13:08.330 --> 00:13:10.430
are the worlds where b is already true.

00:13:10.430 --> 00:13:13.250
I can sort of ignore all
the cases where b isn't true

00:13:13.250 --> 00:13:15.980
because those aren't relevant
to my ultimate computation.

00:13:15.980 --> 00:13:20.232
They're not relevant to what it is
that I want to get information about.

00:13:20.232 --> 00:13:21.690
So let's take a look at an example.

00:13:21.690 --> 00:13:24.530
Let's go back to that example
of rolling two dice and the idea

00:13:24.530 --> 00:13:27.260
that those two dice might
sum up to the number 12.

00:13:27.260 --> 00:13:30.020
We discussed earlier that
the unconditional probability

00:13:30.020 --> 00:13:33.500
that if I roll two dice and
they sum to 12 is one out of 36,

00:13:33.500 --> 00:13:36.620
because out of the 36 possible
worlds that I might care about,

00:13:36.620 --> 00:13:39.650
in only one of them is the
sum of those two dice 12.

00:13:39.650 --> 00:13:43.330
It's only when red is
six and blue is also six.

00:13:43.330 --> 00:13:45.770
But let's say now that I have
some additional information.

00:13:45.770 --> 00:13:47.930
I now want to know,
what is the probability

00:13:47.930 --> 00:13:54.080
that the two dice sum to 12 given that
I know that the red die was a six?

00:13:54.080 --> 00:13:55.700
So I already have some evidence.

00:13:55.700 --> 00:13:57.320
I already know the red die is a six.

00:13:57.320 --> 00:13:58.737
I don't know what the blue die is.

00:13:58.737 --> 00:14:01.482
That information isn't given
to me in this expression.

00:14:01.482 --> 00:14:03.440
But given the fact that
I know that the red die

00:14:03.440 --> 00:14:07.525
rolled a six, what is the
probability that we sum to 12?

00:14:07.525 --> 00:14:10.400
And so we can begin to do the math
using that expression from before.

00:14:10.400 --> 00:14:12.800
Here, again, are all
of the possibilities,

00:14:12.800 --> 00:14:16.160
all of the possible combinations
of red die being one through six

00:14:16.160 --> 00:14:18.857
and blue die being one through six.

00:14:18.857 --> 00:14:20.690
And I might consider,
first, all right, what

00:14:20.690 --> 00:14:24.290
is the probability of my
evidence, my b variable where

00:14:24.290 --> 00:14:27.770
I want to know what is the
probability that the red die is a six?

00:14:27.770 --> 00:14:31.580
Well, the probability that the red
die is a six is just one out of six.

00:14:31.580 --> 00:14:35.180
So these one out of six options
are really the only worlds

00:14:35.180 --> 00:14:36.560
that I care about here now.

00:14:36.560 --> 00:14:39.680
All the rest of them are
irrelevant to my calculation

00:14:39.680 --> 00:14:42.560
because I already have this
evidence that the red die was a six,

00:14:42.560 --> 00:14:46.770
so I don't need to care about all of the
other possibilities that could result.

00:14:46.770 --> 00:14:50.120
So now, in addition to the fact
that the red die rolled as a six

00:14:50.120 --> 00:14:52.040
and the probability of
that, the other piece

00:14:52.040 --> 00:14:54.320
of information I need to
know in order to calculate

00:14:54.320 --> 00:14:58.940
this conditional probability is the
probability that both of my variables,

00:14:58.940 --> 00:15:02.720
a and b, are true, the probability
that both the red die is a six

00:15:02.720 --> 00:15:04.860
and they all sum to 12.

00:15:04.860 --> 00:15:07.400
So what is the probability that
both of these things happen?

00:15:07.400 --> 00:15:11.990
Well, it only happens in one possible
case, in one out of these 36 cases,

00:15:11.990 --> 00:15:15.910
and it's the case where both the red
and the blue die are equal to six.

00:15:15.910 --> 00:15:18.160
This is a piece of information
that we already knew.

00:15:18.160 --> 00:15:22.240
And so this probability
is equal to one over 36.

00:15:22.240 --> 00:15:24.580
And so to get the
conditional probability

00:15:24.580 --> 00:15:28.660
that the sum is 12 given that I know
that the red dice is equal to six,

00:15:28.660 --> 00:15:33.700
well, I just divide these two values
together, and 1/36 divided by 1/6

00:15:33.700 --> 00:15:36.940
gives us this probability of 1/6.

00:15:36.940 --> 00:15:40.300
Given that I know that the
red die rolled a value of six,

00:15:40.300 --> 00:15:45.350
the probability that the sum of the
two dice is 12 is also one over six.

00:15:45.350 --> 00:15:47.350
And that probably makes
intuitive sense for you,

00:15:47.350 --> 00:15:51.250
too, because if the red die is a six,
the only way for me to get to a 12

00:15:51.250 --> 00:15:53.350
is if the blue die also rolls a six.

00:15:53.350 --> 00:15:57.430
And we know that the probability of the
blue die rolling a six is one over six.

00:15:57.430 --> 00:15:59.380
So in this case, the
conditional probability

00:15:59.380 --> 00:16:00.940
seems fairly straightforward.

00:16:00.940 --> 00:16:04.390
But this idea of calculating
a conditional probability

00:16:04.390 --> 00:16:08.175
by looking at the probability that
both of these events take place

00:16:08.175 --> 00:16:10.300
is an idea that's going to
come up again and again.

00:16:10.300 --> 00:16:13.270
This is the definition, now,
of conditional probability,

00:16:13.270 --> 00:16:15.190
and we're going to use
that definition as we

00:16:15.190 --> 00:16:18.760
think about probability more generally
to be able to draw conclusions

00:16:18.760 --> 00:16:19.480
about the world.

00:16:19.480 --> 00:16:21.130
This, again, is that formula.

00:16:21.130 --> 00:16:24.790
The probability of a given b
is equal to the probability

00:16:24.790 --> 00:16:28.973
that a and b take place divided
by the probability of b.

00:16:28.973 --> 00:16:32.140
And you'll see this formula sometimes
written in a couple of different ways.

00:16:32.140 --> 00:16:35.890
You could imagine, algebraically,
multiplying both sides of this equation

00:16:35.890 --> 00:16:39.065
by probability of b to
get rid of the fraction,

00:16:39.065 --> 00:16:40.690
and you'll get an expression like this.

00:16:40.690 --> 00:16:44.890
The probability of a and b, which
is this expression over here,

00:16:44.890 --> 00:16:48.910
is just the probability of b times
the probability of a given b,

00:16:48.910 --> 00:16:52.210
or you could represent this equivalently
since a and b, in this expression,

00:16:52.210 --> 00:16:55.870
are interchangeable. a and b
is the same thing as b and a.

00:16:55.870 --> 00:16:59.740
You could imagine also representing
the probability of a and b

00:16:59.740 --> 00:17:03.430
as the probability of a times the
probability of b given a, just

00:17:03.430 --> 00:17:05.319
switching all of the a's and b's.

00:17:05.319 --> 00:17:08.589
These three are all equivalent
ways of trying to represent

00:17:08.589 --> 00:17:10.150
what joint probability means.

00:17:10.150 --> 00:17:12.480
And so you'll sometimes
see all of these equations,

00:17:12.480 --> 00:17:16.030
and they might be useful to you as
you begin to reason about probability

00:17:16.030 --> 00:17:20.540
and to think about what values might
be taking place in the real world.

00:17:20.540 --> 00:17:22.510
Now, sometimes when we
deal with probability,

00:17:22.510 --> 00:17:24.520
we don't just care
about a Boolean event.

00:17:24.520 --> 00:17:27.099
Like, did this happen
or did this not happen?

00:17:27.099 --> 00:17:30.550
Sometimes we might want the ability
to represent variable values

00:17:30.550 --> 00:17:33.760
in a probability space where
some variable might take

00:17:33.760 --> 00:17:36.430
on multiple different possible values.

00:17:36.430 --> 00:17:39.820
And in probability, we call a
variable in probability theory

00:17:39.820 --> 00:17:41.380
a random variable.

00:17:41.380 --> 00:17:45.790
A random variable in probability is
just some variable in probability theory

00:17:45.790 --> 00:17:49.150
that has some domain of
values that it can take on.

00:17:49.150 --> 00:17:50.290
So what do I mean by this?

00:17:50.290 --> 00:17:52.720
Well, what I mean is I might
have a random variable that

00:17:52.720 --> 00:17:56.470
is just called Roll, for example,
that has six possible values.

00:17:56.470 --> 00:17:59.120
Roll is my variable,
and the possible values,

00:17:59.120 --> 00:18:03.520
the domain of values that it can
take on, are 1, 2, 3, 4, 5, and 6.

00:18:03.520 --> 00:18:05.845
And I might like to know
the probability of each.

00:18:05.845 --> 00:18:07.720
In this case, they happen
to all be the same.

00:18:07.720 --> 00:18:10.728
But in other random variables,
that might not be the case.

00:18:10.728 --> 00:18:12.520
For example, I might
have a random variable

00:18:12.520 --> 00:18:14.560
to represent the weather,
for example, where

00:18:14.560 --> 00:18:18.290
the domain of values it could take
on are things like sun or cloudy

00:18:18.290 --> 00:18:21.070
or rainy or windy or
snowy, and each of those

00:18:21.070 --> 00:18:23.650
might have a different probability,
and I care about knowing,

00:18:23.650 --> 00:18:26.290
what is the probability
that the weather equals sun

00:18:26.290 --> 00:18:28.725
or that the weather equals
clouds, for instance,

00:18:28.725 --> 00:18:33.100
and I might like to do some mathematical
calculations based on that information.

00:18:33.100 --> 00:18:35.622
Other random variables might
be something like traffic.

00:18:35.622 --> 00:18:38.080
What are the odds that there
is no traffic or light traffic

00:18:38.080 --> 00:18:39.190
or heavy traffic?

00:18:39.190 --> 00:18:41.530
Traffic, in this case,
is my random variable,

00:18:41.530 --> 00:18:44.920
and the values that that random
variable can take on are here.

00:18:44.920 --> 00:18:47.110
It's either none or light or heavy.

00:18:47.110 --> 00:18:50.200
And I, the person doing these
calculations, I, the person encoding

00:18:50.200 --> 00:18:52.810
these random variables
into my computer, need

00:18:52.810 --> 00:18:56.950
to make the decision as to what
these possible values actually are.

00:18:56.950 --> 00:18:59.118
You might imagine, for
example, for a flight,

00:18:59.118 --> 00:19:01.660
if I care about whether or not
I make it to a flight on time,

00:19:01.660 --> 00:19:04.327
my flight has a couple of possible
values that it could take on.

00:19:04.327 --> 00:19:05.620
My flight could be on time.

00:19:05.620 --> 00:19:06.860
My flight could be delayed.

00:19:06.860 --> 00:19:08.170
My flight could be canceled.

00:19:08.170 --> 00:19:11.830
So flight, in this case,
is my random variable,

00:19:11.830 --> 00:19:14.500
and these are the values
that it can take on.

00:19:14.500 --> 00:19:17.710
And often I'll want to know
something about the probability

00:19:17.710 --> 00:19:21.380
that my random variable takes on
each of those possible values.

00:19:21.380 --> 00:19:24.660
And this is what we then call
a probability distribution.

00:19:24.660 --> 00:19:27.700
A probability distribution
takes a random variable

00:19:27.700 --> 00:19:32.420
and gives me the probability for each
of the possible values in its domain.

00:19:32.420 --> 00:19:35.950
So in the case of this flight, for
example, my probability distribution

00:19:35.950 --> 00:19:37.300
might look something like this.

00:19:37.300 --> 00:19:40.240
My probability distribution
says, the probability

00:19:40.240 --> 00:19:46.210
that the random variable Flight is
equal to the value on time is 0.6,

00:19:46.210 --> 00:19:49.390
or, otherwise, put into more English,
human-friendly terms, the likelihood

00:19:49.390 --> 00:19:52.510
that my flight is on
time is 60%, for example.

00:19:52.510 --> 00:19:56.180
And in this case, the probability
that my flight is delayed is 30%.

00:19:56.180 --> 00:20:00.170
The probability that my flight
is canceled is 10%, or 0.1.

00:20:00.170 --> 00:20:04.180
And if you sum up all of these possible
values, the sum is going to be 1.

00:20:04.180 --> 00:20:06.640
If you take all of the
possible worlds, here

00:20:06.640 --> 00:20:10.120
are my three possible worlds for the
value of the random variable Flight.

00:20:10.120 --> 00:20:11.470
Add them all up together.

00:20:11.470 --> 00:20:15.610
The result needs to be the number one
per that axiom of probability theory

00:20:15.610 --> 00:20:17.500
that we've discussed before.

00:20:17.500 --> 00:20:21.810
So this now is one way of representing
this probability distribution

00:20:21.810 --> 00:20:23.622
for the random variable Flight.

00:20:23.622 --> 00:20:25.830
Sometimes you'll see it
represented a little bit more

00:20:25.830 --> 00:20:28.770
concisely, that this is pretty
verbose for really just trying

00:20:28.770 --> 00:20:31.080
to express three possible values.

00:20:31.080 --> 00:20:33.630
And so often you'll instead
see this same notation

00:20:33.630 --> 00:20:35.460
representing using a vector.

00:20:35.460 --> 00:20:38.250
And all a vector is is
a sequence of values.

00:20:38.250 --> 00:20:41.520
As opposed to just a single value,
I might have multiple values.

00:20:41.520 --> 00:20:45.570
And so I could extend, instead,
represent this idea this way--

00:20:45.570 --> 00:20:47.610
bold P-- so a larger P--

00:20:47.610 --> 00:20:52.110
generally meaning the probability
distribution of this variable flight

00:20:52.110 --> 00:20:55.890
is equal to this vector
represented in angle brackets.

00:20:55.890 --> 00:21:00.240
The probability distribution
is 0.6, 0.3, and 0.1,

00:21:00.240 --> 00:21:03.180
and I would just have to know that
this probability distribution is

00:21:03.180 --> 00:21:06.930
an order of on time or
delayed and canceled

00:21:06.930 --> 00:21:10.470
to know how to interpret this vector
to mean the first value in the vector

00:21:10.470 --> 00:21:14.430
is the probability that my flight is
on time, the second value in the vector

00:21:14.430 --> 00:21:16.380
is the probability that
my flight is delayed,

00:21:16.380 --> 00:21:20.910
and the third value in the vector is the
probability that my flight is canceled.

00:21:20.910 --> 00:21:23.430
And so this is just an
alternate way of representing

00:21:23.430 --> 00:21:25.380
this idea a little more verbosely.

00:21:25.380 --> 00:21:28.230
But oftentimes you'll see us
just talk about a probability

00:21:28.230 --> 00:21:30.637
distribution over a random variable.

00:21:30.637 --> 00:21:32.970
And whenever we talk about
that, what we're really doing

00:21:32.970 --> 00:21:35.012
is trying to figure out
the probabilities of each

00:21:35.012 --> 00:21:38.190
of the possible values that that
random variable can take on,

00:21:38.190 --> 00:21:40.970
but this notation is just
a little bit more succinct,

00:21:40.970 --> 00:21:43.470
even though it can sometimes
be a little confusing depending

00:21:43.470 --> 00:21:44.928
on the context in which you see it.

00:21:44.928 --> 00:21:48.060
So we'll start to look at examples
where we use this sort of notation

00:21:48.060 --> 00:21:53.850
to describe probability and to
describe events that might take place.

00:21:53.850 --> 00:21:55.890
A couple of other important
ideas to know with

00:21:55.890 --> 00:21:57.450
regards to probability theory--

00:21:57.450 --> 00:22:01.770
one is this idea of independence,
and independence refers to the idea

00:22:01.770 --> 00:22:04.620
that the knowledge of one
event doesn't influence

00:22:04.620 --> 00:22:06.850
the probability of another event.

00:22:06.850 --> 00:22:08.910
So for example, in the
context of my two dice

00:22:08.910 --> 00:22:11.910
rolls where I had the red die
and the blue die, the probability

00:22:11.910 --> 00:22:14.400
that I roll the red
die and the blue die,

00:22:14.400 --> 00:22:17.490
those two events, red die and
blue die, are independent.

00:22:17.490 --> 00:22:21.162
Knowing the result of the red die
doesn't change the probabilities

00:22:21.162 --> 00:22:21.870
for the blue die.

00:22:21.870 --> 00:22:24.330
It doesn't give me any
additional information

00:22:24.330 --> 00:22:27.408
about what the value of the blue
die is ultimately going to be.

00:22:27.408 --> 00:22:29.200
But that's not always
going to be the case.

00:22:29.200 --> 00:22:32.670
You might imagine that in the case
of weather, something like clouds

00:22:32.670 --> 00:22:37.170
and rain, those are probably not
independent, that if it is cloudy,

00:22:37.170 --> 00:22:40.620
that might increase the probability that
later in the day it's going to rain.

00:22:40.620 --> 00:22:45.030
So some information informs some other
event or some other random variable.

00:22:45.030 --> 00:22:49.540
So independence refers to the idea that
one event doesn't influence the other.

00:22:49.540 --> 00:22:54.600
And if they're not independent, then
there might be some relationship.

00:22:54.600 --> 00:22:57.880
So mathematically, formally, what
does independence actually mean?

00:22:57.880 --> 00:23:02.550
Well, recall this formula from before,
that the probability of a and b

00:23:02.550 --> 00:23:06.390
is the probability of a times
the probability of b given a.

00:23:06.390 --> 00:23:08.490
And the more intuitive
way to think about this

00:23:08.490 --> 00:23:12.030
is that to know how likely
it is that a and b happen,

00:23:12.030 --> 00:23:14.850
well, let's first figure out
the likelihood that a happens,

00:23:14.850 --> 00:23:17.163
and then given that we
know that a happens,

00:23:17.163 --> 00:23:19.080
let's figure out the
likelihood that b happens

00:23:19.080 --> 00:23:22.000
and multiply those two things together.

00:23:22.000 --> 00:23:27.750
But if a and b were independent, meaning
knowing a doesn't change anything

00:23:27.750 --> 00:23:30.000
about the likelihood
that b is true, well,

00:23:30.000 --> 00:23:35.040
then the probability of b given a,
meaning the probability that b is true

00:23:35.040 --> 00:23:37.812
given that I know a is true,
well, that I know a is true

00:23:37.812 --> 00:23:40.770
shouldn't really make a difference
if these two things are independent,

00:23:40.770 --> 00:23:43.230
that a shouldn't influence b at all.

00:23:43.230 --> 00:23:48.120
So the probability of b given a is
really just the probability of b,

00:23:48.120 --> 00:23:51.150
if it is true that a
and b are independent.

00:23:51.150 --> 00:23:54.480
And so this right here is one
example of a definition for what

00:23:54.480 --> 00:23:56.850
it means for a and b to be independent.

00:23:56.850 --> 00:24:01.050
The probability of a and b is
just the probability of a times

00:24:01.050 --> 00:24:02.490
the probability of b.

00:24:02.490 --> 00:24:06.300
Any time you find two events a and
b where this relationship holds,

00:24:06.300 --> 00:24:10.000
then you can say that a
and b are independent.

00:24:10.000 --> 00:24:13.980
So an example of that might be the dice
that we were taking a look at before.

00:24:13.980 --> 00:24:18.690
Here, if I wanted the probability of
red being a six and blue being a six,

00:24:18.690 --> 00:24:22.050
well, that's just the probability
that red is a six multiplied

00:24:22.050 --> 00:24:24.090
by the probability that blue is a six.

00:24:24.090 --> 00:24:26.240
Both equal to one over 36.

00:24:26.240 --> 00:24:30.740
So I can say that these
two events are independent.

00:24:30.740 --> 00:24:34.123
What wouldn't be independent, for
example, would be an example--

00:24:34.123 --> 00:24:36.040
so this, for example,
has a probability of one

00:24:36.040 --> 00:24:37.980
over 36, as we talked about before.

00:24:37.980 --> 00:24:40.950
But what wouldn't be independent
would be a case like this--

00:24:40.950 --> 00:24:46.740
the probability that the red die rolls
a six and the red die rolls a four.

00:24:46.740 --> 00:24:49.868
If you just naively took, OK,
red die six, red die four,

00:24:49.868 --> 00:24:51.660
well, if I'm only
rolling the die once, you

00:24:51.660 --> 00:24:54.510
might imagine the naive approach
is to say, well, each of these

00:24:54.510 --> 00:24:56.260
has a probability of one over six.

00:24:56.260 --> 00:24:59.657
So multiply them together, and
the probability is one over 36.

00:24:59.657 --> 00:25:01.990
But, of course, if you're
only rolling the red die once,

00:25:01.990 --> 00:25:05.730
there's no way you could get two
different values for the red die.

00:25:05.730 --> 00:25:08.370
It couldn't both be a six and a four.

00:25:08.370 --> 00:25:10.560
So the probability should be zero.

00:25:10.560 --> 00:25:14.610
But if you were to multiply probability
of red six times probability

00:25:14.610 --> 00:25:17.690
of red four, well, that
would equal one over 36.

00:25:17.690 --> 00:25:19.440
But, of course, that's
not true because we

00:25:19.440 --> 00:25:23.460
know that there is no way, probability
zero, that when we roll the red die

00:25:23.460 --> 00:25:28.590
once we get both a six and a four
because only one of those possibilities

00:25:28.590 --> 00:25:31.120
can actually be the result.

00:25:31.120 --> 00:25:35.190
And so we can say that the event
that red roll is six and the event

00:25:35.190 --> 00:25:38.800
that red roll is four, those
two events are not independent.

00:25:38.800 --> 00:25:43.560
If I know that the red roll is a six, I
know that the red roll cannot possibly

00:25:43.560 --> 00:25:44.310
be a four.

00:25:44.310 --> 00:25:46.280
So these things are not independent.

00:25:46.280 --> 00:25:48.630
And instead, if I wanted to
calculate the probability,

00:25:48.630 --> 00:25:51.870
I would need to use this
conditional probability,

00:25:51.870 --> 00:25:56.530
as is the regular definition of the
probability of two events taking place.

00:25:56.530 --> 00:25:59.280
And the probability of this, now,
well, the probability of the red

00:25:59.280 --> 00:26:01.710
roll being a six, that's one of six.

00:26:01.710 --> 00:26:06.330
But what's the probability that the roll
is a four given that the roll is a six?

00:26:06.330 --> 00:26:09.900
Well, this is just zero, because
there's no way for the red roll

00:26:09.900 --> 00:26:13.920
to be a four given that we already
know the red roll is a six.

00:26:13.920 --> 00:26:16.410
And so the value, if we do
all that multiplication,

00:26:16.410 --> 00:26:19.680
is we get the number zero.

00:26:19.680 --> 00:26:21.477
So this idea of
conditional probability is

00:26:21.477 --> 00:26:23.310
going to come up again
and again, especially

00:26:23.310 --> 00:26:26.850
as we begin to reason about multiple
different random variables that

00:26:26.850 --> 00:26:29.130
might be interacting with
each other in some way.

00:26:29.130 --> 00:26:32.580
And this gets us to one of the most
important rules in probability theory,

00:26:32.580 --> 00:26:34.767
which is known as Bayes' rule.

00:26:34.767 --> 00:26:37.350
And it turns out that just using
the information we've already

00:26:37.350 --> 00:26:40.900
learned about probability and just
applying a little bit of algebra,

00:26:40.900 --> 00:26:43.860
we can actually derive
Bayes' rule for ourselves.

00:26:43.860 --> 00:26:46.530
But it's a very important rule
when it comes to inference

00:26:46.530 --> 00:26:49.020
and thinking about probability
in the context of what

00:26:49.020 --> 00:26:52.110
it is that a computer can do, or
what a mathematician could do,

00:26:52.110 --> 00:26:55.390
by having access to
information about probability.

00:26:55.390 --> 00:26:57.300
So let's go back to
these equations to be

00:26:57.300 --> 00:26:59.860
able to derive Bayes' rule ourselves.

00:26:59.860 --> 00:27:04.140
We know the probability of a and b,
the likelihood that a and b take place,

00:27:04.140 --> 00:27:07.890
is the likelihood of b and
then the likelihood of a given

00:27:07.890 --> 00:27:10.050
that we know that b is already true.

00:27:10.050 --> 00:27:13.170
And likewise, the probability
of a given a and b

00:27:13.170 --> 00:27:16.920
is the probability of a times
the probability of b given

00:27:16.920 --> 00:27:18.630
that we know that a is already true.

00:27:18.630 --> 00:27:20.640
This is sort of a symmetric
relationship where

00:27:20.640 --> 00:27:24.340
it doesn't matter the order of a and
b and b and a mean the same thing.

00:27:24.340 --> 00:27:27.870
And so in these equations,
we can just swap out a and b

00:27:27.870 --> 00:27:30.160
to be able to represent
the exact same idea.

00:27:30.160 --> 00:27:32.650
So we know that these two
equations are already true.

00:27:32.650 --> 00:27:33.910
We've seen that already.

00:27:33.910 --> 00:27:37.380
And now let's just do a little bit of
algebraic manipulation of this stuff.

00:27:37.380 --> 00:27:40.200
Both of these expressions
on the right-hand side

00:27:40.200 --> 00:27:43.380
are equal to the probability of a and b.

00:27:43.380 --> 00:27:46.950
So what I can do is take these two
expressions on the right-hand side

00:27:46.950 --> 00:27:49.140
and just set them equal to each other.

00:27:49.140 --> 00:27:52.860
If they're both equal to
the probability of a and b,

00:27:52.860 --> 00:27:55.090
then they both must be
equal to each other.

00:27:55.090 --> 00:27:57.750
So probability of a
times probability of b

00:27:57.750 --> 00:28:04.740
given a is equal to the probability of
b times the probability of a given b.

00:28:04.740 --> 00:28:07.790
And now all we're going to do
is do a little bit of division.

00:28:07.790 --> 00:28:13.830
I'm going to divide both sides by P of
a, and now I get what is Bayes' rule.

00:28:13.830 --> 00:28:19.100
The probability of b given a is
equal to the probability of b

00:28:19.100 --> 00:28:23.338
times the probability of a given
b divided by the probability of a.

00:28:23.338 --> 00:28:25.380
And sometimes in Bayes'
rule you'll see the order

00:28:25.380 --> 00:28:26.713
of these two arguments switched.

00:28:26.713 --> 00:28:30.920
So instead of b times a given
b, it'll be a given b times b.

00:28:30.920 --> 00:28:33.420
That ultimately doesn't matter
because in multiplication you

00:28:33.420 --> 00:28:35.970
can switch the order of the
two things you're multiplying

00:28:35.970 --> 00:28:37.510
and it doesn't change the result.

00:28:37.510 --> 00:28:41.520
But this here right now is the most
common formulation of Bayes' rule.

00:28:41.520 --> 00:28:46.620
The probability of b given a is
equal to the probability of a given

00:28:46.620 --> 00:28:51.300
b times the probability of b
divided by the probability of a.

00:28:51.300 --> 00:28:54.030
And this rule, it turns
out, is really important

00:28:54.030 --> 00:28:56.670
when it comes to trying to
infer things about the world

00:28:56.670 --> 00:29:00.200
because it means you can express
one conditional probability,

00:29:00.200 --> 00:29:04.410
the conditional probability
of b given a, using knowledge

00:29:04.410 --> 00:29:08.370
about the probability of a
given b, using the reverse

00:29:08.370 --> 00:29:10.068
of that conditional probability.

00:29:10.068 --> 00:29:12.360
So let's first do a little
bit of an example with this,

00:29:12.360 --> 00:29:14.820
just to see how we might use
it, and then explore what

00:29:14.820 --> 00:29:17.200
this means a little bit more generally.

00:29:17.200 --> 00:29:20.320
So we're going to construct a situation
where I have some information.

00:29:20.320 --> 00:29:22.260
There are two events that I care about--

00:29:22.260 --> 00:29:25.650
the idea that it's cloudy
in the morning and the idea

00:29:25.650 --> 00:29:28.120
that it is rainy in the afternoon.

00:29:28.120 --> 00:29:30.000
Those are two different
possible events that

00:29:30.000 --> 00:29:34.080
could take place-- cloudy in the
morning, or the AM, rainy in the PM.

00:29:34.080 --> 00:29:37.800
And what I care about is, given
clouds in the morning, what

00:29:37.800 --> 00:29:41.110
is the probability of rain in the
afternoon, a reasonable question

00:29:41.110 --> 00:29:41.610
I might ask.

00:29:41.610 --> 00:29:44.250
In the morning, I look
outside, or an AI's camera

00:29:44.250 --> 00:29:47.782
looks outside, and sees that
there are clouds in the morning,

00:29:47.782 --> 00:29:49.740
and we want to conclude,
we want to figure out,

00:29:49.740 --> 00:29:54.430
what is the probability that in the
afternoon there is going to be rain?

00:29:54.430 --> 00:29:56.470
Of course, in the abstract,
we don't have access

00:29:56.470 --> 00:29:58.990
to this kind of information,
but we can use data

00:29:58.990 --> 00:30:00.830
to begin to try and figure this out.

00:30:00.830 --> 00:30:05.080
So let's imagine, now, that I have
access to some pieces of information.

00:30:05.080 --> 00:30:08.860
I have access to the idea
that 80% of rainy afternoons

00:30:08.860 --> 00:30:10.705
start out with a cloudy morning.

00:30:10.705 --> 00:30:13.330
And you might imagine that I
could have gathered this data just

00:30:13.330 --> 00:30:15.122
by looking at data over
a sequence of time,

00:30:15.122 --> 00:30:18.850
that I know that 80% of the time
when it's raining in the afternoon it

00:30:18.850 --> 00:30:21.780
was cloudy that morning.

00:30:21.780 --> 00:30:25.170
I also know that 40% of
days have cloudy mornings,

00:30:25.170 --> 00:30:29.010
and I also know that 10% of
days have rainy afternoons.

00:30:29.010 --> 00:30:31.170
And now, using this
information, I would like

00:30:31.170 --> 00:30:34.350
to figure out, given
clouds in the morning, what

00:30:34.350 --> 00:30:37.110
is the probability that
it rains in the afternoon?

00:30:37.110 --> 00:30:41.570
I want to know the probability of
afternoon rain given morning clouds,

00:30:41.570 --> 00:30:46.630
and I can do that, in particular,
using this fact, the probability of--

00:30:46.630 --> 00:30:50.250
so if I know that 80% of rainy
afternoon start with cloudy mornings,

00:30:50.250 --> 00:30:54.390
then I know the probability of cloudy
mornings given rainy afternoon.

00:30:54.390 --> 00:30:58.440
So using sort of the reverse conditional
probability, I can figure that out.

00:30:58.440 --> 00:31:01.530
Expressed in terms of Bayes' rule,
this is what that would look like--

00:31:01.530 --> 00:31:05.430
probability of rain given
clouds is the probability

00:31:05.430 --> 00:31:08.550
of clouds given rain times
the probability of rain

00:31:08.550 --> 00:31:10.380
divided by the probability of clouds.

00:31:10.380 --> 00:31:13.560
Here I'm just substituting
in for the values of a and b

00:31:13.560 --> 00:31:15.630
from that equation and
Bayes' rule from before.

00:31:15.630 --> 00:31:16.650
And then I can just do the math.

00:31:16.650 --> 00:31:17.670
I have this information.

00:31:17.670 --> 00:31:21.360
I know that 80% of the time,
if it was raining, then

00:31:21.360 --> 00:31:23.610
there were clouds in the
morning-- so 0.8 here.

00:31:23.610 --> 00:31:28.110
Probability of rain is 0.1 because 10%
of days were raining and 40% of days

00:31:28.110 --> 00:31:28.860
were cloudy.

00:31:28.860 --> 00:31:31.980
I do the math and I can
figure out the answer is 0.2.

00:31:31.980 --> 00:31:35.730
So the probability that it rains in
the afternoon given that it was cloudy

00:31:35.730 --> 00:31:40.130
in the morning is 0.2 in this case.

00:31:40.130 --> 00:31:42.480
And this, now, is an
application of Bayes' rule,

00:31:42.480 --> 00:31:45.220
the idea that using one
conditional probability,

00:31:45.220 --> 00:31:48.060
we can get the reverse
conditional probability.

00:31:48.060 --> 00:31:51.420
And this is often useful when one
of the conditional probabilities

00:31:51.420 --> 00:31:55.300
might be easier for us to know about
or easier for us to have data about,

00:31:55.300 --> 00:31:57.870
and using that information,
we can calculate

00:31:57.870 --> 00:31:59.730
the other conditional probability.

00:31:59.730 --> 00:32:01.030
So what does this look like?

00:32:01.030 --> 00:32:04.410
Well, it means that knowing the
probability of cloudy mornings given

00:32:04.410 --> 00:32:09.420
rainy afternoons, we can calculate the
probability of rainy afternoons given

00:32:09.420 --> 00:32:12.600
cloudy mornings, or, for
example, more generally,

00:32:12.600 --> 00:32:16.860
if we know the probability of
some visible effect, some effect

00:32:16.860 --> 00:32:21.150
that we can see and observe given
some unknown cause that we're not

00:32:21.150 --> 00:32:26.100
sure about, well, then we can calculate
the probability of that unknown cause

00:32:26.100 --> 00:32:28.770
given the visible effect.

00:32:28.770 --> 00:32:30.520
So what might that look like?

00:32:30.520 --> 00:32:32.520
Well, in the context of
medicine, for example,

00:32:32.520 --> 00:32:37.440
I might know the probability of some
medical test result given a disease.

00:32:37.440 --> 00:32:41.520
Like, I know that if someone has a
disease, then x percent of the time

00:32:41.520 --> 00:32:44.340
the medical test result will
show up as this, for instance.

00:32:44.340 --> 00:32:47.100
And using that information,
then I can calculate,

00:32:47.100 --> 00:32:50.430
what is the probability that,
given I know the medical test

00:32:50.430 --> 00:32:53.590
result, what is the likelihood
that someone has the disease?

00:32:53.590 --> 00:32:56.970
This is the piece of information that
is usually easier to know, easier

00:32:56.970 --> 00:32:59.130
to immediately have access to data for.

00:32:59.130 --> 00:33:02.687
And this is the information that
I actually want to calculate.

00:33:02.687 --> 00:33:04.270
Or I might want to know, for example--

00:33:04.270 --> 00:33:08.400
if I know that some probability
of counterfeit bills

00:33:08.400 --> 00:33:11.670
have blurry text around the edges,
because counterfeit printers

00:33:11.670 --> 00:33:13.950
aren't nearly as good at
printing text precisely.

00:33:13.950 --> 00:33:16.380
So I have some information
about given that something

00:33:16.380 --> 00:33:20.550
is a counterfeit bill, x percent of
counterfeit bills have blurry text,

00:33:20.550 --> 00:33:21.510
for example.

00:33:21.510 --> 00:33:24.840
And using that information, then I can
calculate some piece of information

00:33:24.840 --> 00:33:27.360
that I might want to
know, like, given that I

00:33:27.360 --> 00:33:31.980
know there's blurry text on a bill, what
is the probability that that bill is

00:33:31.980 --> 00:33:32.580
counterfeit?

00:33:32.580 --> 00:33:34.980
So given one conditional
probability, I can

00:33:34.980 --> 00:33:39.363
calculate the other conditional
probability as well.

00:33:39.363 --> 00:33:41.280
And so now that we've
taken a look at a couple

00:33:41.280 --> 00:33:42.990
of different types of probability.

00:33:42.990 --> 00:33:45.210
We've looked at
unconditional probability

00:33:45.210 --> 00:33:48.300
where I just look at what is the
probability of this event occurring

00:33:48.300 --> 00:33:51.390
given no additional evidence
that I might have access to,

00:33:51.390 --> 00:33:53.940
and we've also looked at
conditional probability

00:33:53.940 --> 00:33:57.570
where I have some sort of evidence, and
I would like to, using that evidence,

00:33:57.570 --> 00:34:00.847
be able to calculate some
other probability as well.

00:34:00.847 --> 00:34:03.930
The other kind of probability that
will be important for us to think about

00:34:03.930 --> 00:34:06.360
is joint probability,
and this is when we're

00:34:06.360 --> 00:34:11.250
considering the likelihood of multiple
different events simultaneously.

00:34:11.250 --> 00:34:12.580
And so what do we mean by this?

00:34:12.580 --> 00:34:15.534
Well, for example, I might
have probability distributions

00:34:15.534 --> 00:34:18.659
that look a little something like this,
like I want to know the probability

00:34:18.659 --> 00:34:22.800
distribution of clouds in the morning,
and that distribution looks like this.

00:34:22.800 --> 00:34:26.460
40% of the times, C, which
is my random variable here,

00:34:26.460 --> 00:34:31.060
is equal to it's cloudy, and
60% of the time it's not cloudy.

00:34:31.060 --> 00:34:33.420
So here is just a simple
probability distribution

00:34:33.420 --> 00:34:37.710
that is effectively telling me
that 40% of the time it's cloudy.

00:34:37.710 --> 00:34:41.219
I might also have a probability
distribution for rain in the afternoon

00:34:41.219 --> 00:34:44.670
where 10% of the time,
or with probability 0.1,

00:34:44.670 --> 00:34:48.600
it is raining in the afternoon
and with probability 0.9

00:34:48.600 --> 00:34:51.090
it is not raining in the afternoon.

00:34:51.090 --> 00:34:54.580
And using just these two
pieces of information,

00:34:54.580 --> 00:34:57.540
I don't actually have a whole lot
of information about how these two

00:34:57.540 --> 00:34:59.980
variables relate to each other.

00:34:59.980 --> 00:35:02.940
But I could if I had access
to their joint probability,

00:35:02.940 --> 00:35:05.550
meaning for every combination
of these two things--

00:35:05.550 --> 00:35:09.330
meaning morning cloudy and afternoon
rain, morning cloudy and afternoon

00:35:09.330 --> 00:35:12.960
not rain, morning not cloudy
and afternoon rain, and morning

00:35:12.960 --> 00:35:15.150
not cloudy and afternoon not raining--

00:35:15.150 --> 00:35:17.700
if I had access to values
for each of those four,

00:35:17.700 --> 00:35:20.340
I'd have more information--
so information that'd

00:35:20.340 --> 00:35:22.390
be organized in a table like this.

00:35:22.390 --> 00:35:25.690
And this, rather than just
a probability distribution,

00:35:25.690 --> 00:35:27.970
is a joint probability distribution.

00:35:27.970 --> 00:35:31.090
It tells me the probability
distribution of each

00:35:31.090 --> 00:35:34.930
of the possible combinations of
values that these random variables

00:35:34.930 --> 00:35:36.160
can take on.

00:35:36.160 --> 00:35:39.640
So if I want to know, what is the
probability that on any given day

00:35:39.640 --> 00:35:42.400
it is both cloudy and
rainy, well, I would say,

00:35:42.400 --> 00:35:45.100
all right, we're looking
at cases where it is cloudy

00:35:45.100 --> 00:35:48.460
and cases where it is raining and
the intersection of those two,

00:35:48.460 --> 00:35:51.310
that row and that column, is 0.08.

00:35:51.310 --> 00:35:55.210
So that is the probability that
it is both cloudy and rainy

00:35:55.210 --> 00:35:57.070
using that information.

00:35:57.070 --> 00:36:00.010
And using this conditional
probability table,

00:36:00.010 --> 00:36:02.260
using this joint
probability table, I can

00:36:02.260 --> 00:36:04.930
begin to draw other
pieces of information

00:36:04.930 --> 00:36:07.420
about things like
conditional probability.

00:36:07.420 --> 00:36:11.890
So I might ask a question like, what is
the probability distribution of clouds

00:36:11.890 --> 00:36:14.470
given that I know that
it is raining, meaning

00:36:14.470 --> 00:36:16.660
I know for sure that it's raining.

00:36:16.660 --> 00:36:19.780
Tell me the probability distribution
over whether it's cloudy

00:36:19.780 --> 00:36:22.720
or not given that I know already
that it is, in fact, raining.

00:36:22.720 --> 00:36:25.480
And here I'm using C to stand
for that random variable.

00:36:25.480 --> 00:36:28.030
I'm looking for a distribution,
meaning the answer to this

00:36:28.030 --> 00:36:29.860
is not going to be a single value.

00:36:29.860 --> 00:36:33.760
It's going to be two values, a vector
of two values where the first value is

00:36:33.760 --> 00:36:37.960
probability of clouds, the second value
is probability that it is not cloudy,

00:36:37.960 --> 00:36:40.240
but the sum of those two
values is going to be one,

00:36:40.240 --> 00:36:42.470
because when you add up
the probabilities of all

00:36:42.470 --> 00:36:47.190
of the possible worlds, the result
that you get must be the number one.

00:36:47.190 --> 00:36:50.740
And, well, what do we know about how
to calculate a conditional probability?

00:36:50.740 --> 00:36:56.590
Well, we know that the probability of
a given b is the probability of a and b

00:36:56.590 --> 00:36:59.320
divided by the probability of b.

00:36:59.320 --> 00:37:00.740
So what does this mean?

00:37:00.740 --> 00:37:03.610
Well, it means that I can
calculate the probability of clouds

00:37:03.610 --> 00:37:08.260
given that it's raining as
the probability of clouds

00:37:08.260 --> 00:37:11.230
and raining divided by
the probability of rain.

00:37:11.230 --> 00:37:15.220
And this comma here for the probability
distribution of clouds and rain,

00:37:15.220 --> 00:37:17.710
this comma sort of stands
in for the word "and."

00:37:17.710 --> 00:37:21.460
You'll sort of see the logical operator
AND and the comma used interchangeably.

00:37:21.460 --> 00:37:24.550
This means the probability
distribution over the clouds

00:37:24.550 --> 00:37:29.382
and knowing the fact that it is raining
divided by the probability of rain.

00:37:29.382 --> 00:37:31.840
And the interesting thing to
note here and what we'll often

00:37:31.840 --> 00:37:34.210
do in order to simplify
our mathematics is

00:37:34.210 --> 00:37:38.260
that dividing by the probability
of rain, the probability of rain

00:37:38.260 --> 00:37:40.150
here is just some numerical constant.

00:37:40.150 --> 00:37:40.900
It is some number.

00:37:40.900 --> 00:37:43.780
Dividing by probability
of rain is just dividing

00:37:43.780 --> 00:37:46.090
by some constant or, in
other words, multiplying

00:37:46.090 --> 00:37:48.100
by the inverse of that constant.

00:37:48.100 --> 00:37:50.620
And it turns out that
oftentimes we can just

00:37:50.620 --> 00:37:53.230
not worry about what the
exact value of this is

00:37:53.230 --> 00:37:56.370
and just know that it is,
in fact, a constant value,

00:37:56.370 --> 00:37:57.620
and we'll see why in a moment.

00:37:57.620 --> 00:38:01.390
So instead of expressing this as
this joint probability divided

00:38:01.390 --> 00:38:06.790
by the probability of rain, sometimes
we'll just represent it as alpha times

00:38:06.790 --> 00:38:10.830
the numerator here, the probability
distribution of C, this variable,

00:38:10.830 --> 00:38:13.370
and that we know that it
is raining, for instance.

00:38:13.370 --> 00:38:16.600
So all we've done here
is said this value of one

00:38:16.600 --> 00:38:19.840
over the probability of rain, that's
really just a constant that we're

00:38:19.840 --> 00:38:23.140
going to divide by or equivalently
multiply by the inverse of at the end.

00:38:23.140 --> 00:38:26.770
We'll just call it alpha for now
and deal with it a little bit later.

00:38:26.770 --> 00:38:30.130
But the key idea here now-- and this is
an idea that's going to come up again--

00:38:30.130 --> 00:38:34.390
is that the conditional
distribution of C given rain

00:38:34.390 --> 00:38:38.200
is proportional to, meaning
just some factor multiplied by,

00:38:38.200 --> 00:38:42.580
the joint probability of
C and rain being true.

00:38:42.580 --> 00:38:44.030
And so how do we figure this out?

00:38:44.030 --> 00:38:46.720
Well, this is going to be the
probability that it is cloudy

00:38:46.720 --> 00:38:50.200
given that it's raining, which is
0.08, and the probability that it's not

00:38:50.200 --> 00:38:53.350
cloudy given that it's
raining, which is 0.02.

00:38:53.350 --> 00:38:55.180
And so we get alpha times--

00:38:55.180 --> 00:38:58.060
here now is that
probability distribution.

00:38:58.060 --> 00:39:00.370
0.08 is clouds and rain.

00:39:00.370 --> 00:39:04.210
0.02 is not cloudy and rain.

00:39:04.210 --> 00:39:08.260
But, of course, 0.08 and 0.02
don't sum up to the number one.

00:39:08.260 --> 00:39:10.780
And we know that in a
probability distribution,

00:39:10.780 --> 00:39:13.030
if you consider all of
the possible values,

00:39:13.030 --> 00:39:15.730
they must sum up to
a probability of one.

00:39:15.730 --> 00:39:20.350
And so we know that we just need to
figure out some constant to normalize,

00:39:20.350 --> 00:39:23.830
so to speak, these values, something
we can multiply or divide by

00:39:23.830 --> 00:39:26.600
to get it so that all of these
probabilities sum up to one.

00:39:26.600 --> 00:39:29.390
And it turns out that if we
multiply both numbers by 10,

00:39:29.390 --> 00:39:32.290
then we can get that
result of 0.8 and 0.2.

00:39:32.290 --> 00:39:34.990
The proportions are still
equivalent, but now 0.8

00:39:34.990 --> 00:39:38.750
plus 0.2, those sum up to the number 1.

00:39:38.750 --> 00:39:41.080
So take a look at this and
see if you can understand,

00:39:41.080 --> 00:39:43.870
step by step, how it is we're
getting from one point to another.

00:39:43.870 --> 00:39:48.190
But the key idea here is that by
using the joint probabilities,

00:39:48.190 --> 00:39:52.480
these probabilities that it is both
cloudy and rainy and that it is not

00:39:52.480 --> 00:39:56.740
cloudy and rainy, I can take
that information and figure out

00:39:56.740 --> 00:39:59.800
the conditional probability--
given that it's raining,

00:39:59.800 --> 00:40:02.320
what is the chance that it's
cloudy versus not cloudy--

00:40:02.320 --> 00:40:06.740
just by multiplying by some
normalization constant, so to speak.

00:40:06.740 --> 00:40:08.860
And this is what a
computer can begin to use

00:40:08.860 --> 00:40:12.130
to be able to interact with
these various different types

00:40:12.130 --> 00:40:13.207
of probabilities.

00:40:13.207 --> 00:40:15.790
And it turns out there are a
number of other probability rules

00:40:15.790 --> 00:40:19.570
that are going to be useful to us as
we begin to explore how we can actually

00:40:19.570 --> 00:40:22.860
use this information to
encode into our computers

00:40:22.860 --> 00:40:27.030
some more complex analysis that we
might want to do about probability

00:40:27.030 --> 00:40:30.793
and distributions and random variables
that we might be interacting with.

00:40:30.793 --> 00:40:33.210
So here are a couple of those
important probability rules.

00:40:33.210 --> 00:40:35.850
One of the simplest rules
is just this negation rule.

00:40:35.850 --> 00:40:39.420
What is the probability of not event a?

00:40:39.420 --> 00:40:41.970
So a is an event that
has some probability,

00:40:41.970 --> 00:40:45.840
and I would like to know, what is the
probability that a does not occur?

00:40:45.840 --> 00:40:50.340
And it turns out it's just one
minus P of a, which makes sense

00:40:50.340 --> 00:40:52.470
because if those are
the two possible cases,

00:40:52.470 --> 00:40:56.770
either a happens or a doesn't happen,
then when you add up those two cases,

00:40:56.770 --> 00:41:02.970
you must get one, which means P of
not a must just be one minus P of a

00:41:02.970 --> 00:41:06.930
because P of a and P of not a
must sum up to the number one.

00:41:06.930 --> 00:41:10.050
They must include all
of the possible cases.

00:41:10.050 --> 00:41:14.010
We've seen an expression for
calculating the probability of a and b.

00:41:14.010 --> 00:41:18.180
We might also reasonably want to
calculate the probability of a or b.

00:41:18.180 --> 00:41:21.480
What is the probability that one thing
happens or another thing happens?

00:41:21.480 --> 00:41:23.550
So for example, I might
want to calculate,

00:41:23.550 --> 00:41:26.010
what is the probability
that if I roll two dice,

00:41:26.010 --> 00:41:29.970
a red die and a blue die, what is
the likelihood that a is a six or b

00:41:29.970 --> 00:41:31.860
is a six, one or the other?

00:41:31.860 --> 00:41:34.860
And what you might imagine you could
do and the wrong way to approach it

00:41:34.860 --> 00:41:38.810
would be just to say, all right,
well, a comes up as a six,

00:41:38.810 --> 00:41:41.727
the red die comes up as a six
with probability one over six.

00:41:41.727 --> 00:41:42.810
The same for the blue die.

00:41:42.810 --> 00:41:44.070
It's also one over six.

00:41:44.070 --> 00:41:47.520
Add them together and you get
2/6, otherwise known as 1/3.

00:41:47.520 --> 00:41:50.820
But this suffers from the
problem of over counting,

00:41:50.820 --> 00:41:54.330
that we've double counted the
case where both a and b, both

00:41:54.330 --> 00:41:57.690
the red die and the blue die,
both come up as a six roll,

00:41:57.690 --> 00:41:59.780
and I've counted that instance twice.

00:41:59.780 --> 00:42:02.070
So to resolve this,
the actual expression

00:42:02.070 --> 00:42:05.100
for calculating the
probability of a or b

00:42:05.100 --> 00:42:08.070
uses what we call the
inclusion-exclusion formula.

00:42:08.070 --> 00:42:11.510
So I take the probability of a,
add it to the probability of b.

00:42:11.510 --> 00:42:12.900
That's all same as before.

00:42:12.900 --> 00:42:16.440
But then I need to exclude the
cases that I've double counted.

00:42:16.440 --> 00:42:21.930
So I subtract from that the
probability of a and b, and that

00:42:21.930 --> 00:42:23.520
gets me the result for a or b.

00:42:23.520 --> 00:42:27.348
I consider all the cases where a is
true and all the cases where b is true.

00:42:27.348 --> 00:42:29.640
And if you imagine this is
like a Venn diagram of cases

00:42:29.640 --> 00:42:31.830
where a is true, cases
where b is true, I just

00:42:31.830 --> 00:42:34.500
need to subtract out the
middle to get rid of the cases

00:42:34.500 --> 00:42:37.860
that I have over counted by double
counting them inside of both

00:42:37.860 --> 00:42:41.520
of these individual expressions.

00:42:41.520 --> 00:42:43.530
One other rule that's
going to be quite helpful

00:42:43.530 --> 00:42:45.770
is a rule called marginalization.

00:42:45.770 --> 00:42:47.880
Some marginalization is
answering the question

00:42:47.880 --> 00:42:52.350
of how do I figure out the probability
of a using some other variable that I

00:42:52.350 --> 00:42:53.970
might have access to, like b?

00:42:53.970 --> 00:42:56.190
Even if I don't know additional
information about it,

00:42:56.190 --> 00:43:00.270
I know that b, some event,
can have two possible states.

00:43:00.270 --> 00:43:05.080
Either b happens or b doesn't happen,
assuming it's a Boolean, true or false.

00:43:05.080 --> 00:43:07.500
And well, what that means
is that for me to be

00:43:07.500 --> 00:43:11.130
able to calculate the probability
of a, there are only two cases.

00:43:11.130 --> 00:43:15.930
Either a happens and b happens or
a happens and b doesn't happen.

00:43:15.930 --> 00:43:19.200
And those are two disjoint, meaning
they can't both happen together--

00:43:19.200 --> 00:43:21.480
either b happens or b doesn't happen.

00:43:21.480 --> 00:43:23.640
They're disjoint or separate cases.

00:43:23.640 --> 00:43:28.140
And so I can figure out the probability
of a just by adding up those two cases.

00:43:28.140 --> 00:43:31.770
The probability that a is
true is the probability

00:43:31.770 --> 00:43:35.640
that a and b is true plus the
probability that a is true

00:43:35.640 --> 00:43:36.810
and b isn't true.

00:43:36.810 --> 00:43:40.123
So by marginalizing, I've
looked at the two possible cases

00:43:40.123 --> 00:43:41.040
that might take place.

00:43:41.040 --> 00:43:44.120
Either b happens or b doesn't happen.

00:43:44.120 --> 00:43:47.610
And in either of those cases, I look at,
what's the probability that a happens,

00:43:47.610 --> 00:43:50.430
and if I add those together,
well, then I get the probability

00:43:50.430 --> 00:43:52.710
that a happens as a whole.

00:43:52.710 --> 00:43:54.030
So take a look at that rule.

00:43:54.030 --> 00:43:57.120
It doesn't matter what b is
or how it's related to a.

00:43:57.120 --> 00:43:59.580
So long as I know these
joint distributions,

00:43:59.580 --> 00:44:02.280
I can figure out the
overall probability of a.

00:44:02.280 --> 00:44:05.130
And this can be a useful way,
if I have a joint distribution,

00:44:05.130 --> 00:44:08.550
like the joint distribution
of a and b, to just figure out

00:44:08.550 --> 00:44:11.320
some unconditional probability,
like the probability of a,

00:44:11.320 --> 00:44:14.520
and we'll see examples
of this soon, as well.

00:44:14.520 --> 00:44:17.460
Now, sometimes these might
not just be variables

00:44:17.460 --> 00:44:21.160
that are events that are they happened
or they didn't happen, like b is here.

00:44:21.160 --> 00:44:23.850
They might be some broader
probability distribution where

00:44:23.850 --> 00:44:25.800
there are multiple possible values.

00:44:25.800 --> 00:44:28.710
And so here, in order to use
this marginalization rule,

00:44:28.710 --> 00:44:34.290
I need to sum up not just over b and not
b, but for all of the possible values

00:44:34.290 --> 00:44:36.610
that the other random
variable could take on.

00:44:36.610 --> 00:44:39.360
And so here we'll see a version
of this rule for random variables,

00:44:39.360 --> 00:44:41.610
and it's going to include
that summation notation

00:44:41.610 --> 00:44:46.270
to indicate that I'm summing up, adding
up, a whole bunch of individual values.

00:44:46.270 --> 00:44:47.092
So here's the rule.

00:44:47.092 --> 00:44:49.050
Looks a lot more complicated,
but it's actually

00:44:49.050 --> 00:44:51.330
the equivalent, exactly the same rule.

00:44:51.330 --> 00:44:55.500
What I'm saying here is that if I
have two random variables one called x

00:44:55.500 --> 00:45:01.380
and one called y, well, the probability
that x is equal to some value x sub i--

00:45:01.380 --> 00:45:04.170
this is just some value that
this variable takes on--

00:45:04.170 --> 00:45:05.520
how do I figure it out?

00:45:05.520 --> 00:45:08.760
Well, I'm going to
sum up over j, where j

00:45:08.760 --> 00:45:13.380
is going to range over all of the
possible values that y can take on.

00:45:13.380 --> 00:45:18.558
Well, let's look at the probability
that x equals xi and y equals yj.

00:45:18.558 --> 00:45:20.600
So the exact same rule--
the only difference here

00:45:20.600 --> 00:45:23.360
is now I'm summing up over
all of the possible values

00:45:23.360 --> 00:45:27.420
that y can take on, saying let's
add up all of those possible cases

00:45:27.420 --> 00:45:31.100
and look at this joint
distribution, this joint probability

00:45:31.100 --> 00:45:35.990
that x takes on the value I care about
given all of the possible values for y.

00:45:35.990 --> 00:45:40.910
And if I add all those up, then I can
get this unconditional probability

00:45:40.910 --> 00:45:46.397
of what x is equal to, whether or
not x is equal to some value x sub i.

00:45:46.397 --> 00:45:48.230
So let's take a look
at this rule because it

00:45:48.230 --> 00:45:49.688
does look a little bit complicated.

00:45:49.688 --> 00:45:51.650
Let's try and put a
concrete example to it.

00:45:51.650 --> 00:45:54.470
Here, again, is that same
joint distribution from before.

00:45:54.470 --> 00:45:58.460
I have cloud, not
cloudy, rainy, not rainy.

00:45:58.460 --> 00:46:00.830
And maybe I want to
access some variable.

00:46:00.830 --> 00:46:04.790
I want to know, what is the
probability that it is cloudy?

00:46:04.790 --> 00:46:08.550
Well, marginalization says that
if I have this joint distribution

00:46:08.550 --> 00:46:12.140
and I want to know, what is the
probability that it is cloudy, well,

00:46:12.140 --> 00:46:15.650
I need to consider the other variable,
the variable that's not here,

00:46:15.650 --> 00:46:17.060
the idea that it's rainy.

00:46:17.060 --> 00:46:20.780
And I consider the two cases, either
it's raining or it's not raining,

00:46:20.780 --> 00:46:24.410
and I just sum up the values
for each of those possibilities.

00:46:24.410 --> 00:46:27.380
In other words, the
probability that it is cloudy

00:46:27.380 --> 00:46:31.110
is equal to the sum of the
probability that it's cloudy

00:46:31.110 --> 00:46:38.090
and it's raining and the probability
that it's cloudy and it is not raining.

00:46:38.090 --> 00:46:40.460
And so these, now, are
values that I have access to.

00:46:40.460 --> 00:46:44.840
These are values that are just inside
of this joint probability table.

00:46:44.840 --> 00:46:47.990
What is the probability that
it is both cloudy and rainy?

00:46:47.990 --> 00:46:51.350
Well, it's just the intersection
of these two here, which is 0.08,

00:46:51.350 --> 00:46:54.590
and the probability that it's cloudy
and not raining is-- all right,

00:46:54.590 --> 00:46:56.480
here's cloudy, here's not raining--

00:46:56.480 --> 00:46:58.000
it's 0.32.

00:46:58.000 --> 00:47:02.630
So it's 0.08 plus 0.32, which
just gives us equal to 0.4.

00:47:02.630 --> 00:47:06.840
That is the unconditional probability
that it is, in fact, cloudy.

00:47:06.840 --> 00:47:09.530
And so marginalization
gives us a way to go

00:47:09.530 --> 00:47:13.360
from these joint distributions to
just some individual probability

00:47:13.360 --> 00:47:14.430
that I might care about.

00:47:14.430 --> 00:47:17.222
And you'll see a little bit later
why it is that we care about that

00:47:17.222 --> 00:47:19.370
and why that's actually
useful to us as we

00:47:19.370 --> 00:47:21.885
begin doing some of these calculations.

00:47:21.885 --> 00:47:25.010
Last rule we'll take a look up before
transitioning into something a little

00:47:25.010 --> 00:47:27.200
bit different is this
rule of conditioning--

00:47:27.200 --> 00:47:31.070
very similar to the marginalization
rule, but it says that, again,

00:47:31.070 --> 00:47:32.600
if I have two events a and b--

00:47:32.600 --> 00:47:35.810
but instead of having access
to their joint probabilities,

00:47:35.810 --> 00:47:38.180
I have access to their
conditional probabilities,

00:47:38.180 --> 00:47:39.920
how they relate to each other.

00:47:39.920 --> 00:47:43.700
Well, again, if I want to know the
probability that a happens and I know

00:47:43.700 --> 00:47:47.960
that there's some other variable b,
either b happens or b doesn't happen,

00:47:47.960 --> 00:47:50.660
and so I can say that
the probability of a

00:47:50.660 --> 00:47:54.920
is the probability of a given
b times the probability of b,

00:47:54.920 --> 00:47:57.470
meaning b happened, and
given that I know b happened,

00:47:57.470 --> 00:47:59.480
what's the likelihood that a happened?

00:47:59.480 --> 00:48:02.480
And then I consider the other
case, that b didn't happen.

00:48:02.480 --> 00:48:05.360
So here is the probability
that b didn't happen,

00:48:05.360 --> 00:48:07.880
and here's the probability
that a happens given

00:48:07.880 --> 00:48:09.890
that I know that b didn't happen.

00:48:09.890 --> 00:48:13.820
And this is really the equivalent rule,
just using conditional probability

00:48:13.820 --> 00:48:16.190
instead of joint probability
where I'm saying,

00:48:16.190 --> 00:48:19.790
let's look at both of these
two cases and condition on b.

00:48:19.790 --> 00:48:23.480
Look at the case where b happens and
look at the case where b doesn't happen

00:48:23.480 --> 00:48:26.560
and look at what probabilities
I get as a result.

00:48:26.560 --> 00:48:28.598
And just as in the
case of marginalization

00:48:28.598 --> 00:48:30.890
where there was an equivalent
rule for random variables

00:48:30.890 --> 00:48:34.850
that could take on multiple possible
values in a domain of possible values,

00:48:34.850 --> 00:48:37.530
here, too, conditioning has
the same equivalent rule.

00:48:37.530 --> 00:48:41.590
Again, there's a summation to mean I'm
summing over all of the possible values

00:48:41.590 --> 00:48:44.070
that some random
variable y could take on.

00:48:44.070 --> 00:48:48.140
But if I want to know, what is the
probability that x takes on this value,

00:48:48.140 --> 00:48:50.870
then I'm going to sum
up over all the values j

00:48:50.870 --> 00:48:53.420
that y could take on and
say, all right, what's

00:48:53.420 --> 00:48:56.870
the chance that y takes on
that value, yj, and multiply it

00:48:56.870 --> 00:49:00.830
by the conditional probability
that x takes on this value given

00:49:00.830 --> 00:49:03.180
that y took on that value yj--

00:49:03.180 --> 00:49:06.470
so equivalent rule just using
conditional probabilities

00:49:06.470 --> 00:49:08.120
instead of joint probabilities.

00:49:08.120 --> 00:49:10.790
And using the equation we know
about joint probabilities,

00:49:10.790 --> 00:49:13.748
we can translate between these two.

00:49:13.748 --> 00:49:15.790
All right, we've seen a
whole lot of mathematics,

00:49:15.790 --> 00:49:18.110
and we've just sort of laid
the foundation for mathematics.

00:49:18.110 --> 00:49:20.777
And no need to worry if you haven't
seen probability in too much

00:49:20.777 --> 00:49:22.370
detail up until this point.

00:49:22.370 --> 00:49:24.500
These are sort of the
foundations of the ideas

00:49:24.500 --> 00:49:27.560
that are going to come up as we
begin to explore how we can now

00:49:27.560 --> 00:49:31.820
take these ideas from probability
and begin to apply them to represent

00:49:31.820 --> 00:49:35.120
something inside of our computer,
something inside of the AI agent

00:49:35.120 --> 00:49:39.280
we're trying to design that is able to
represent information and probabilities

00:49:39.280 --> 00:49:42.600
and the likelihoods between
various different events.

00:49:42.600 --> 00:49:45.020
So there are a number of
different probabilistic models

00:49:45.020 --> 00:49:48.290
that we can generate, but the first of
the models we're going to talk about

00:49:48.290 --> 00:49:50.600
are what are known as Bayesian networks.

00:49:50.600 --> 00:49:52.670
And a Bayesian network
is just going to be

00:49:52.670 --> 00:49:56.090
some network of random variables,
connected random variables,

00:49:56.090 --> 00:49:58.850
that are going to
represent the dependence

00:49:58.850 --> 00:50:00.260
between these random variables.

00:50:00.260 --> 00:50:03.498
And odds are most random
variables in this world

00:50:03.498 --> 00:50:05.540
are not independent from
each other, that there's

00:50:05.540 --> 00:50:08.840
some relationship between things that
are happening that we care about.

00:50:08.840 --> 00:50:12.200
If it is raining today, that
might increase the likelihood

00:50:12.200 --> 00:50:14.750
that my flight or my train
gets delayed, for example.

00:50:14.750 --> 00:50:17.610
There is some dependence
between these random variables,

00:50:17.610 --> 00:50:22.420
and a Bayesian network is going to be
able to capture those dependencies.

00:50:22.420 --> 00:50:23.770
So what is a Bayesian network?

00:50:23.770 --> 00:50:26.430
What is its actual structure,
and how does it work?

00:50:26.430 --> 00:50:29.230
Well, a Bayesian network is
going to be a directed graph.

00:50:29.230 --> 00:50:31.170
And again, we've seen
directed graphs before.

00:50:31.170 --> 00:50:34.170
They are individual nodes
with arrows or edges

00:50:34.170 --> 00:50:38.897
that connect one node to another node,
pointing in a particular direction.

00:50:38.897 --> 00:50:40.980
And so this directed graph
is going to have nodes,

00:50:40.980 --> 00:50:43.860
as well, where each node
in this directed graph

00:50:43.860 --> 00:50:47.850
is going to represent a random variable,
something like the weather or something

00:50:47.850 --> 00:50:51.340
like whether my train
was on time or delayed.

00:50:51.340 --> 00:50:54.780
And we're going to have an
arrow from a node x to a node y

00:50:54.780 --> 00:50:57.435
to mean that x is a parent of y.

00:50:57.435 --> 00:50:58.560
So that'll be our notation.

00:50:58.560 --> 00:51:02.940
If there's an arrow from x to y, x is
going to be considered a parent of y.

00:51:02.940 --> 00:51:06.360
And the reason that's important
is because each of these nodes

00:51:06.360 --> 00:51:09.180
is going to have a probability
distribution that we're

00:51:09.180 --> 00:51:13.140
going to store along with it, which
is the distribution of x given

00:51:13.140 --> 00:51:16.520
some evidence, given the parents of x.

00:51:16.520 --> 00:51:18.480
So the way to more
intuitively think about this

00:51:18.480 --> 00:51:22.260
is the parents are going to be thought
of as sort of causes for some effect

00:51:22.260 --> 00:51:24.720
that we're going to observe.

00:51:24.720 --> 00:51:27.780
And so let's take a look at an
actual example of a Bayesian network

00:51:27.780 --> 00:51:30.270
and think about the types of
logic that might be involved

00:51:30.270 --> 00:51:32.070
in reasoning about that network.

00:51:32.070 --> 00:51:35.580
Let's imagine, for a moment, that
I have an appointment out of town

00:51:35.580 --> 00:51:38.510
and I need to take a train in
order to get to that appointment.

00:51:38.510 --> 00:51:40.260
So what are the things
I might care about?

00:51:40.260 --> 00:51:42.620
Well, I care about getting
to my appointment on time.

00:51:42.620 --> 00:51:44.370
Either I make it to
my appointment and I'm

00:51:44.370 --> 00:51:46.710
able to attend it or I
miss the appointment.

00:51:46.710 --> 00:51:49.440
And you might imagine that
that's influenced by the train,

00:51:49.440 --> 00:51:54.000
that the train is either on time
or it's delayed, for example.

00:51:54.000 --> 00:51:56.370
But that train itself
is also influenced.

00:51:56.370 --> 00:52:00.030
Whether the train is on time or
not depends maybe on the rain.

00:52:00.030 --> 00:52:00.822
Is there no rain?

00:52:00.822 --> 00:52:01.530
Is it light rain?

00:52:01.530 --> 00:52:02.737
Is there heavy rain?

00:52:02.737 --> 00:52:05.070
And it might also be influenced
by other variables, too.

00:52:05.070 --> 00:52:07.050
It might be influenced,
as well, by whether

00:52:07.050 --> 00:52:09.608
or not there's maintenance on
the train track, for example.

00:52:09.608 --> 00:52:11.400
If there is maintenance
on the train track,

00:52:11.400 --> 00:52:15.660
that probably increases the
likelihood that my train is delayed.

00:52:15.660 --> 00:52:19.680
And so we can represent all of these
ideas using a Bayesian network that

00:52:19.680 --> 00:52:21.360
looks a little something like this.

00:52:21.360 --> 00:52:25.440
Here I have four nodes
representing four random variables

00:52:25.440 --> 00:52:26.970
that I would like to keep track of.

00:52:26.970 --> 00:52:29.190
I have one random
variable called Rain that

00:52:29.190 --> 00:52:34.080
can take on three possible values in its
domain, either none or light or heavy

00:52:34.080 --> 00:52:36.348
for no rain, light rain, or heavy rain.

00:52:36.348 --> 00:52:38.640
I have a variable called
Maintenance for whether or not

00:52:38.640 --> 00:52:42.030
there is maintenance on the train track,
which it has two possible values, just

00:52:42.030 --> 00:52:42.960
either yes or no.

00:52:42.960 --> 00:52:46.355
Either there is maintenance or there is
no maintenance happening on the track.

00:52:46.355 --> 00:52:49.230
Then I have a random variable for
the train indicating whether or not

00:52:49.230 --> 00:52:50.490
the train was on time or not.

00:52:50.490 --> 00:52:53.850
That random variable has two
possible values in its domain.

00:52:53.850 --> 00:52:57.730
The train is either on time
or the train is delayed.

00:52:57.730 --> 00:52:59.803
And then, finally, I
have a random variable

00:52:59.803 --> 00:53:01.470
for whether I make it to my appointment.

00:53:01.470 --> 00:53:04.950
For my appointment down here, I have
a random variable called Appointment

00:53:04.950 --> 00:53:09.420
that itself has two possible
values, attend and miss.

00:53:09.420 --> 00:53:10.920
And so here are the possible values.

00:53:10.920 --> 00:53:12.960
Here are my four nodes,
each of which represents

00:53:12.960 --> 00:53:17.160
a random variable, each of which
has a domain of possible values

00:53:17.160 --> 00:53:18.500
that it can take on.

00:53:18.500 --> 00:53:21.980
And the arrows, the edges
pointing from one node to another,

00:53:21.980 --> 00:53:26.250
encode some notion of
dependence inside of this graph,

00:53:26.250 --> 00:53:28.830
that whether I make it
to my appointment or not

00:53:28.830 --> 00:53:32.650
is dependent upon whether the
train is on time or delayed.

00:53:32.650 --> 00:53:36.390
And whether the train is on time or
delayed is dependent on two things,

00:53:36.390 --> 00:53:38.910
given by the two arrows
pointing at this node.

00:53:38.910 --> 00:53:42.350
It is dependent on whether or not there
was maintenance on the train track,

00:53:42.350 --> 00:53:45.240
and it is also dependent
upon whether or not

00:53:45.240 --> 00:53:47.675
it was raining, or
whether it is raining.

00:53:47.675 --> 00:53:49.800
And just to make things a
little complicated, let's

00:53:49.800 --> 00:53:53.280
say, as well, that whether or not
there's maintenance on the track,

00:53:53.280 --> 00:53:55.260
this too might be
influenced by the rain.

00:53:55.260 --> 00:53:57.178
Then if there's heavier
rain, well, maybe it's

00:53:57.178 --> 00:53:59.970
less likely that there's going to
be maintenance on the train track

00:53:59.970 --> 00:54:02.010
that day because they're
more likely to want

00:54:02.010 --> 00:54:05.500
to do maintenance on the track on days
when it's not raining, for example.

00:54:05.500 --> 00:54:08.350
And so these nodes might have
different relationships between them.

00:54:08.350 --> 00:54:10.770
But the idea is that we can
come up with a probability

00:54:10.770 --> 00:54:16.370
distribution for any of these
nodes based only upon its parents.

00:54:16.370 --> 00:54:20.158
And so let's look node by node at what
this probability distribution might

00:54:20.158 --> 00:54:20.950
actually look like.

00:54:20.950 --> 00:54:24.150
And we'll go ahead and begin with this
root node, this Rain node here, which

00:54:24.150 --> 00:54:27.630
is at the top and has no
arrows pointing into it,

00:54:27.630 --> 00:54:30.510
which means its probability
distribution is not

00:54:30.510 --> 00:54:32.410
going to be a conditional distribution.

00:54:32.410 --> 00:54:33.870
It's not based on anything.

00:54:33.870 --> 00:54:38.250
I just have some probability
distribution over the possible values

00:54:38.250 --> 00:54:40.520
for the Rain random variable.

00:54:40.520 --> 00:54:43.590
And that distribution might look
a little something like this.

00:54:43.590 --> 00:54:46.170
None, light, and heavy--
each have a possible value.

00:54:46.170 --> 00:54:48.300
Here I'm saying the
likelihood of no rain

00:54:48.300 --> 00:54:53.790
is 0.7, of light rain is 0.2, of
heavy rain is 0.1, for example.

00:54:53.790 --> 00:54:58.440
So here is a probability distribution
for this root node in this Bayesian

00:54:58.440 --> 00:54:59.770
network.

00:54:59.770 --> 00:55:03.000
And let's now consider the next
node in the network, Maintenance.

00:55:03.000 --> 00:55:05.140
Track maintenance is yes or no.

00:55:05.140 --> 00:55:07.530
And the general idea of
what this distribution

00:55:07.530 --> 00:55:09.660
is going to encode, at
least in this story,

00:55:09.660 --> 00:55:13.308
is the idea that the heavier
the rain is, the less likely

00:55:13.308 --> 00:55:15.600
it is that there's going to
be maintenance on the track

00:55:15.600 --> 00:55:18.017
because the people that are
doing maintenance on the track

00:55:18.017 --> 00:55:21.190
probably want to wait until a day
when it's not as rainy in order to do

00:55:21.190 --> 00:55:23.000
the track maintenance, for example.

00:55:23.000 --> 00:55:25.480
And so what might that probability
distribution look like?

00:55:25.480 --> 00:55:28.180
Well, this now is going to
be a conditional probability

00:55:28.180 --> 00:55:31.600
distribution, that here are the
three possible values for the Rain

00:55:31.600 --> 00:55:34.840
random variable, which I'm here just
going to abbreviate to R, either

00:55:34.840 --> 00:55:37.490
no rain, light rain, or heavy rain.

00:55:37.490 --> 00:55:41.590
And for each of those possible values,
either there is yes track maintenance

00:55:41.590 --> 00:55:46.120
or no track maintenance, and those have
probabilities associated with them,

00:55:46.120 --> 00:55:50.650
that I see here that
if it is not raining,

00:55:50.650 --> 00:55:53.620
then there is a probability 0.4
that there's track maintenance

00:55:53.620 --> 00:55:56.350
and a probability of
0.6 that there isn't.

00:55:56.350 --> 00:55:59.200
But if there's heavy
rain, then here the chance

00:55:59.200 --> 00:56:02.020
that there is track maintenance
is 0.1 and the chance

00:56:02.020 --> 00:56:04.430
that there is not track
maintenance is 0.9.

00:56:04.430 --> 00:56:08.230
Each of these rows is going to sum
up to one because each of these

00:56:08.230 --> 00:56:10.930
represent different
values of whether or not

00:56:10.930 --> 00:56:14.710
it's raining, the three possible values
that that random variable can take on,

00:56:14.710 --> 00:56:18.160
and each is associated with its
own probability distribution.

00:56:18.160 --> 00:56:22.450
That is ultimately all going
to add up to the number one.

00:56:22.450 --> 00:56:26.290
So that there is our distribution for
this random variable called Maintenance

00:56:26.290 --> 00:56:30.110
about whether or not there is
maintenance on the train track.

00:56:30.110 --> 00:56:32.050
And now let's consider
the next variable.

00:56:32.050 --> 00:56:34.210
Here we have a node inside
of our Bayesian network

00:56:34.210 --> 00:56:38.570
called Train that has two possible
values, on time and delayed.

00:56:38.570 --> 00:56:42.160
And this node is going to be
dependent upon the two nodes that

00:56:42.160 --> 00:56:45.040
are pointing towards it, that
whether or not the train is on time

00:56:45.040 --> 00:56:48.872
or delayed it depends on whether
or not there is track maintenance,

00:56:48.872 --> 00:56:50.830
and it depends on whether
or not there is rain,

00:56:50.830 --> 00:56:55.610
that heavier rain probably means
more likely that my train is delayed.

00:56:55.610 --> 00:56:58.270
And if there is track
maintenance, that also

00:56:58.270 --> 00:57:02.360
probably means it's more likely
that my train is delayed as well.

00:57:02.360 --> 00:57:05.350
And so you could construct a
larger probability distribution,

00:57:05.350 --> 00:57:07.720
a conditional probability
distribution, that

00:57:07.720 --> 00:57:11.530
instead of conditioning on just
one variable, as was the case here,

00:57:11.530 --> 00:57:14.380
is now conditioning on two
variables, conditioning

00:57:14.380 --> 00:57:19.270
both on rain, represented by R, and
on maintenance, represented by yes.

00:57:19.270 --> 00:57:23.040
Again, each of these rows has two
values that sum up to the number one,

00:57:23.040 --> 00:57:27.310
one for whether the train is on time,
one for whether the train is delayed.

00:57:27.310 --> 00:57:29.260
And here I can say
something like, all right,

00:57:29.260 --> 00:57:32.950
if I know there was light rain
and track maintenance-- well, OK,

00:57:32.950 --> 00:57:36.490
that would be R is light and M is yes--

00:57:36.490 --> 00:57:40.210
well, then there is a probability
of 0.6 that my train is on time

00:57:40.210 --> 00:57:43.540
and a probability of 0.4
the train is delayed.

00:57:43.540 --> 00:57:47.770
And you can imagine gathering this data
just by looking at real-world data,

00:57:47.770 --> 00:57:50.970
looking at data about, all right,
if I knew that it was light rain

00:57:50.970 --> 00:57:52.720
and there was track
maintenance, how often

00:57:52.720 --> 00:57:54.400
was a train delayed or
not delayed, and you

00:57:54.400 --> 00:57:55.930
could begin to construct this thing.

00:57:55.930 --> 00:57:58.060
But the interesting
thing is, intelligently,

00:57:58.060 --> 00:57:59.812
being able to try to
figure out, how might

00:57:59.812 --> 00:58:01.270
you go about ordering these things?

00:58:01.270 --> 00:58:06.730
What things might influence other
nodes inside of this Bayesian network?

00:58:06.730 --> 00:58:08.860
And the last thing I care
about is whether or not

00:58:08.860 --> 00:58:10.870
I make it to my appointment.

00:58:10.870 --> 00:58:13.210
So did I attend or miss the appointment?

00:58:13.210 --> 00:58:16.180
And ultimately, whether I
attend or miss the appointment,

00:58:16.180 --> 00:58:19.552
it is influenced by track maintenance
because it's indirectly this idea

00:58:19.552 --> 00:58:21.760
that, all right, if there
is track maintenance, well,

00:58:21.760 --> 00:58:23.450
then my train might
more likely be delayed,

00:58:23.450 --> 00:58:25.325
and if my train is more
likely to be delayed,

00:58:25.325 --> 00:58:27.280
then I'm more likely
to miss my appointment.

00:58:27.280 --> 00:58:29.650
But what we encode in
this Bayesian network

00:58:29.650 --> 00:58:32.820
are just what we might consider
to be more direct relationships.

00:58:32.820 --> 00:58:35.710
So the train has a direct
influence on the appointment.

00:58:35.710 --> 00:58:38.710
And given that I know whether
the train is on time or delayed,

00:58:38.710 --> 00:58:40.540
knowing whether there's
track maintenance

00:58:40.540 --> 00:58:44.550
isn't going to give me any additional
information that I didn't already have,

00:58:44.550 --> 00:58:48.070
that if I know train, these
other nodes that are up above

00:58:48.070 --> 00:58:51.150
isn't really going to
influence the result.

00:58:51.150 --> 00:58:54.910
And so here we might represent it
using another conditional probability

00:58:54.910 --> 00:58:57.430
distribution that looks a
little something like this, that

00:58:57.430 --> 00:59:00.160
train can take on two possible values.

00:59:00.160 --> 00:59:02.740
Either my train is on time
or my train is delayed.

00:59:02.740 --> 00:59:04.510
And for each of those
two possible values,

00:59:04.510 --> 00:59:06.803
I have a distribution
for what are the odds

00:59:06.803 --> 00:59:09.220
that I'm able to attend the
meeting, and what are the odds

00:59:09.220 --> 00:59:10.090
that I missed the meeting?

00:59:10.090 --> 00:59:12.010
And obviously, if my
train is on time, I'm

00:59:12.010 --> 00:59:14.130
much more likely to be
able to attend the meeting

00:59:14.130 --> 00:59:16.600
than if my train is
delayed, in which case

00:59:16.600 --> 00:59:19.500
I'm more likely to miss that meeting.

00:59:19.500 --> 00:59:21.790
So all of these nodes
put altogether here

00:59:21.790 --> 00:59:25.330
represent this Bayesian network,
this network of random variables

00:59:25.330 --> 00:59:27.730
whose values I ultimately
care about and that

00:59:27.730 --> 00:59:30.380
have some sort of
relationship between them,

00:59:30.380 --> 00:59:33.670
some sort of dependence where these
arrows from one node to another

00:59:33.670 --> 00:59:37.960
indicate some dependence, that I can
calculate the probability of some node

00:59:37.960 --> 00:59:41.870
given the parents that
happen to exist there.

00:59:41.870 --> 00:59:45.340
So now that we've been able to describe
the structure of this Bayesian network

00:59:45.340 --> 00:59:47.680
and the relationships
between each of these nodes,

00:59:47.680 --> 00:59:51.070
by associating each of the node
in the network with a probability

00:59:51.070 --> 00:59:53.980
distribution, whether that's
an unconditional probability

00:59:53.980 --> 00:59:56.200
distribution in the case
of this root node here,

00:59:56.200 --> 00:59:59.630
like Rain, and a conditional
probability distribution,

00:59:59.630 --> 01:00:02.380
in the case of all of the other
nodes whose probabilities are

01:00:02.380 --> 01:00:05.000
dependent upon the
values of their parents,

01:00:05.000 --> 01:00:09.160
we can begin to do some computation
and calculation using the information

01:00:09.160 --> 01:00:10.490
inside of that table.

01:00:10.490 --> 01:00:12.310
So let's imagine, for
example, that I just

01:00:12.310 --> 01:00:15.910
wanted to compute something simple,
like the probability of light rain.

01:00:15.910 --> 01:00:18.130
How would I get the
probability of light rain?

01:00:18.130 --> 01:00:21.370
Well, light rain-- rain
here is a root node.

01:00:21.370 --> 01:00:23.770
And so if I wanted to
calculate that probability,

01:00:23.770 --> 01:00:26.740
I could just look at the
probability distribution for rain

01:00:26.740 --> 01:00:29.800
and extract from it the
probability of light rain.

01:00:29.800 --> 01:00:33.220
It's just a single value that
I already have access to.

01:00:33.220 --> 01:00:35.410
But we could also imagine
wanting to compute

01:00:35.410 --> 01:00:39.100
more complex joint probabilities,
like the probability

01:00:39.100 --> 01:00:42.710
that there is light rain and
also no track maintenance.

01:00:42.710 --> 01:00:47.440
This is a joint probability of two
values, light rain and no track

01:00:47.440 --> 01:00:48.293
maintenance.

01:00:48.293 --> 01:00:51.460
And the way I might do that is first
by starting by saying, all right, well,

01:00:51.460 --> 01:00:54.100
let me get the probability
of light rain, but now

01:00:54.100 --> 01:00:57.160
I also want the probability
of no track maintenance.

01:00:57.160 --> 01:01:01.630
But, of course, this node is
dependent upon the value of rain.

01:01:01.630 --> 01:01:05.350
So what I really want is the probability
of no track maintenance given

01:01:05.350 --> 01:01:07.540
that I know that there was light rain.

01:01:07.540 --> 01:01:10.450
And so the expression
for calculating this idea

01:01:10.450 --> 01:01:13.870
that the probability of light
rain and no track maintenance

01:01:13.870 --> 01:01:17.680
is really just the probability
of light rain and the probability

01:01:17.680 --> 01:01:21.250
that there is no track maintenance
given that I know that there already

01:01:21.250 --> 01:01:22.210
is light rain.

01:01:22.210 --> 01:01:25.540
So I take the unconditional
probability of light rain,

01:01:25.540 --> 01:01:30.160
multiply it by the conditional
probability of no track maintenance

01:01:30.160 --> 01:01:32.550
given that I know there is light rain.

01:01:32.550 --> 01:01:35.770
And you can continue to do this
again and again for every variable

01:01:35.770 --> 01:01:38.378
that you want to add into
this joint probability

01:01:38.378 --> 01:01:39.670
that I might want to calculate.

01:01:39.670 --> 01:01:42.400
If I wanted to know the
probability of light rain

01:01:42.400 --> 01:01:45.100
and no track maintenance
and a delayed train,

01:01:45.100 --> 01:01:48.850
well, that's going to be the
probability of light rain multiplied

01:01:48.850 --> 01:01:50.950
by the probability of
no track maintenance

01:01:50.950 --> 01:01:56.218
given light rain multiplied by the
probability of a delayed train given

01:01:56.218 --> 01:01:59.260
light rain and no track maintenance,
because whether the train is on time

01:01:59.260 --> 01:02:03.190
or delayed is dependent upon both
of these other two variables,

01:02:03.190 --> 01:02:05.290
and so I have two
pieces of evidence that

01:02:05.290 --> 01:02:08.860
go into the calculation of
that conditional probability.

01:02:08.860 --> 01:02:11.470
And each of these three
values is just a value

01:02:11.470 --> 01:02:15.640
that I can look up by looking at
one of these individual probability

01:02:15.640 --> 01:02:20.140
distributions that is encoded
into my Bayesian network.

01:02:20.140 --> 01:02:23.410
And if I wanted a joint probability
over all four of the variables,

01:02:23.410 --> 01:02:25.900
something like the
probability of light rain

01:02:25.900 --> 01:02:30.130
and no track maintenance and a delayed
train and I missed my appointment,

01:02:30.130 --> 01:02:32.890
well, that's going to be multiplying
four different values, one

01:02:32.890 --> 01:02:34.870
from each of these individual nodes.

01:02:34.870 --> 01:02:36.970
It's going to be the
probability of light rain,

01:02:36.970 --> 01:02:39.370
then of no track maintenance
given light rain,

01:02:39.370 --> 01:02:42.882
then of a delayed train given light
rain and no track maintenance.

01:02:42.882 --> 01:02:46.090
And then, finally, for this node here
for whether I make it to my appointment

01:02:46.090 --> 01:02:50.770
or not, it's not dependent upon
these two variables given that I know

01:02:50.770 --> 01:02:52.270
whether or not the train is on time.

01:02:52.270 --> 01:02:55.030
I only need to care about
the conditional probability

01:02:55.030 --> 01:03:00.160
that I miss my appointment given
that the train happens to be delayed.

01:03:00.160 --> 01:03:04.120
And so that's represented here by
four probabilities, each of which

01:03:04.120 --> 01:03:07.420
is located inside of one of
these probability distributions

01:03:07.420 --> 01:03:11.092
for each of the nodes,
all multiplied together.

01:03:11.092 --> 01:03:13.300
And so I can take a variable
like that and figure out

01:03:13.300 --> 01:03:15.910
what the joint probability
is by multiplying

01:03:15.910 --> 01:03:18.280
a whole bunch of these
individual probabilities

01:03:18.280 --> 01:03:19.990
from the Bayesian network.

01:03:19.990 --> 01:03:23.110
But, of course, just as with last
time where what I really wanted to do

01:03:23.110 --> 01:03:25.463
was to be able to get new
pieces of information,

01:03:25.463 --> 01:03:28.630
here, too, this is what we're going to
want to do with our Bayesian network.

01:03:28.630 --> 01:03:31.720
In the context of knowledge, we
talked about the problem of inference.

01:03:31.720 --> 01:03:34.210
Given things that I
know to be true, can I

01:03:34.210 --> 01:03:38.020
draw conclusions, make deductions
about other facts about the world

01:03:38.020 --> 01:03:40.270
that I also know to be true?

01:03:40.270 --> 01:03:44.170
And what we're going to do now is apply
the same sort of idea to probability.

01:03:44.170 --> 01:03:46.960
Using information about
which I have some knowledge,

01:03:46.960 --> 01:03:49.510
whether some evidence or
some probabilities, can

01:03:49.510 --> 01:03:52.360
I figure out not other
variables for certain,

01:03:52.360 --> 01:03:55.750
but can I figure out the probabilities
of other variables taking

01:03:55.750 --> 01:03:57.160
on particular values?

01:03:57.160 --> 01:04:00.160
And so here we introduce
the problem of inference

01:04:00.160 --> 01:04:03.970
in a probabilistic setting in a case
where variables might not necessarily

01:04:03.970 --> 01:04:06.760
be true for sure, but they
might be random variables

01:04:06.760 --> 01:04:10.640
that take on different
values with some probability.

01:04:10.640 --> 01:04:13.780
So how do we formally define what
exactly this inference problem actually

01:04:13.780 --> 01:04:14.500
is?

01:04:14.500 --> 01:04:17.350
Well, the inference problem
has a couple of parts to it.

01:04:17.350 --> 01:04:20.140
We have some query,
some variable x that we

01:04:20.140 --> 01:04:21.730
want to compute the distribution for.

01:04:21.730 --> 01:04:24.880
Maybe I want the probability
that I missed my train

01:04:24.880 --> 01:04:29.500
or I want the probability that there
is track maintenance, something

01:04:29.500 --> 01:04:31.570
that I want information about.

01:04:31.570 --> 01:04:33.437
And then I have some evidence variables.

01:04:33.437 --> 01:04:35.020
Maybe it's just one piece of evidence.

01:04:35.020 --> 01:04:36.760
Maybe it's multiple pieces of evidence.

01:04:36.760 --> 01:04:40.600
But I've observed certain
variables for some sort of event.

01:04:40.600 --> 01:04:43.772
So for example, I might have
observed that it is raining.

01:04:43.772 --> 01:04:44.980
This is evidence that I have.

01:04:44.980 --> 01:04:47.933
I know that there is light rain or
I know that there is heavy rain,

01:04:47.933 --> 01:04:49.100
and that is evidence I have.

01:04:49.100 --> 01:04:52.750
And using that evidence, I want
to know, what is the probability

01:04:52.750 --> 01:04:55.430
that my train is delayed, for example?

01:04:55.430 --> 01:04:58.480
And that is a query that I might
want to ask based on this evidence.

01:04:58.480 --> 01:05:00.700
So I have a query, some
variable, evidence,

01:05:00.700 --> 01:05:03.280
which are some other variables
that I have observed inside

01:05:03.280 --> 01:05:05.260
of my Bayesian network,
and of course that

01:05:05.260 --> 01:05:08.110
does leave some hidden variables, y.

01:05:08.110 --> 01:05:11.380
These are variables that are
not evidence variables and not

01:05:11.380 --> 01:05:12.550
query variables.

01:05:12.550 --> 01:05:16.090
So you might imagine in the case where
I know whether or not it's raining

01:05:16.090 --> 01:05:19.930
and I want to know whether my train
is going to be delayed or not,

01:05:19.930 --> 01:05:23.380
the hidden variable, the thing I don't
have access to, is something like,

01:05:23.380 --> 01:05:25.130
is there maintenance
on the track, or am I

01:05:25.130 --> 01:05:27.380
going to make or not make
my appointment, for example?

01:05:27.380 --> 01:05:29.410
These are variables that
I don't have access to.

01:05:29.410 --> 01:05:32.680
They're hidden because
they're not things I observed,

01:05:32.680 --> 01:05:35.100
and they're also not the query,
the thing that I'm asking.

01:05:35.100 --> 01:05:37.480
And so ultimately what
we want to calculate

01:05:37.480 --> 01:05:41.650
is I want to know the probability
distribution of x given

01:05:41.650 --> 01:05:42.970
e, the event that I observed.

01:05:42.970 --> 01:05:46.150
So given that I observed some event,
I observed that it is raining,

01:05:46.150 --> 01:05:49.960
I would like to know, what is the
distribution over the possible values

01:05:49.960 --> 01:05:51.640
of the Train random variable?

01:05:51.640 --> 01:05:52.630
Is it on time?

01:05:52.630 --> 01:05:53.440
Is it delayed?

01:05:53.440 --> 01:05:55.750
What is the likelihood
it's going to be there?

01:05:55.750 --> 01:05:58.720
And it turns out we can do
this calculation just using

01:05:58.720 --> 01:06:02.410
a lot of the probability rules
that we've already seen in action.

01:06:02.410 --> 01:06:04.870
And ultimately, we're going
to take a look at the math

01:06:04.870 --> 01:06:07.150
at a little bit of a high
level, at an abstract level,

01:06:07.150 --> 01:06:09.370
but ultimately we can allow
computers and programming

01:06:09.370 --> 01:06:12.610
libraries that already exist to
begin to do some of this math for us.

01:06:12.610 --> 01:06:15.810
But it's good to get a general sense
for what's actually happening when

01:06:15.810 --> 01:06:18.010
this inference process takes place.

01:06:18.010 --> 01:06:21.190
Let's imagine, for example, that
I want to compute the probability

01:06:21.190 --> 01:06:24.430
distribution of the
Appointment random variable

01:06:24.430 --> 01:06:28.510
given some evidence, given that I know
that there was light rain and no track

01:06:28.510 --> 01:06:29.260
maintenance.

01:06:29.260 --> 01:06:32.830
So there's my evidence, these two
variables that I observed the value of.

01:06:32.830 --> 01:06:34.630
I observe the value of rain.

01:06:34.630 --> 01:06:35.920
I know there's light rain.

01:06:35.920 --> 01:06:38.830
And I know that there is no
track maintenance going on today.

01:06:38.830 --> 01:06:42.820
And what I care about knowing, my query,
is this random variable Appointment.

01:06:42.820 --> 01:06:46.008
I want to know the distribution of
this random variable Appointment.

01:06:46.008 --> 01:06:47.800
What is the chance that
I am able to attend

01:06:47.800 --> 01:06:50.560
my appointment, what is the
chance that I miss my appointment

01:06:50.560 --> 01:06:52.360
given this evidence?

01:06:52.360 --> 01:06:55.870
And the hidden variable, the
information that I don't have access to,

01:06:55.870 --> 01:06:57.190
is this variable Train.

01:06:57.190 --> 01:07:00.040
This is information that is not
part of the evidence that I see,

01:07:00.040 --> 01:07:01.660
not something that I observe.

01:07:01.660 --> 01:07:05.050
But it is also not the
query that I am asking for.

01:07:05.050 --> 01:07:07.460
And so what might this
inference procedure look like?

01:07:07.460 --> 01:07:10.810
Well, if you recall back from a when we
were defining conditional probability

01:07:10.810 --> 01:07:13.270
and doing math with
conditional probabilities,

01:07:13.270 --> 01:07:15.940
we know that a
conditional probability is

01:07:15.940 --> 01:07:19.030
proportional to the joint probability.

01:07:19.030 --> 01:07:23.050
And we remember this by recalling
that the probability of a given b

01:07:23.050 --> 01:07:25.930
is just some constant
factor alpha multiplied

01:07:25.930 --> 01:07:27.583
by the probability of a and b.

01:07:27.583 --> 01:07:29.500
That constant factor
alpha turns up and you're

01:07:29.500 --> 01:07:32.620
dividing over the probability
of b, but the important thing

01:07:32.620 --> 01:07:34.930
is that it's just some
constant multiplied

01:07:34.930 --> 01:07:37.450
by the joint distribution,
the probability

01:07:37.450 --> 01:07:40.070
that all of these
individual things happen.

01:07:40.070 --> 01:07:42.610
So in this case, I can
take the probability

01:07:42.610 --> 01:07:47.380
of the Appointment random variable given
light rain and no track maintenance

01:07:47.380 --> 01:07:51.070
and say that is just going to be
proportional, some constant alpha,

01:07:51.070 --> 01:07:54.700
multiplied by the joint probability,
the probability of a particular value

01:07:54.700 --> 01:08:00.410
for the appointment random variable,
and light rain and no track maintenance.

01:08:00.410 --> 01:08:02.980
Well, all right, how do I
calculate this, probability

01:08:02.980 --> 01:08:05.350
of appointment and light rain
and no track maintenance,

01:08:05.350 --> 01:08:07.480
when what I really
care about is knowing--

01:08:07.480 --> 01:08:11.260
I need all four of these values to be
able to calculate a joint distribution

01:08:11.260 --> 01:08:13.990
across everything, because,
then, a particular appointment

01:08:13.990 --> 01:08:16.420
depends upon the value of train.

01:08:16.420 --> 01:08:18.399
Well, in order to do
that, here I can begin

01:08:18.399 --> 01:08:21.430
to use that marginalization
trick, that there are only

01:08:21.430 --> 01:08:24.640
two ways I can get any configuration
of an appointment, light rain,

01:08:24.640 --> 01:08:25.859
and no track maintenance.

01:08:25.859 --> 01:08:28.120
Either this particular
setting of variables

01:08:28.120 --> 01:08:33.130
happens and the train is on time or this
particular setting of variables happens

01:08:33.130 --> 01:08:34.180
and the train is delayed.

01:08:34.180 --> 01:08:37.520
Those are two possible cases
that I would want to consider.

01:08:37.520 --> 01:08:40.149
And if I add those two
cases up, well, then I

01:08:40.149 --> 01:08:44.859
get the result just by adding up all
of the possibilities for the hidden

01:08:44.859 --> 01:08:46.990
variable, or variables
if there are multiple.

01:08:46.990 --> 01:08:49.090
But since there's only
one hidden variable here,

01:08:49.090 --> 01:08:53.229
Train, all I need to do is iterate over
all the possible values for that hidden

01:08:53.229 --> 01:08:56.600
variable Train and add
up their probabilities.

01:08:56.600 --> 01:08:59.529
So this probability
expression here becomes

01:08:59.529 --> 01:09:02.890
probability distribution over
Appointment, light, no rain, and train

01:09:02.890 --> 01:09:06.010
is on time, and the
probability distribution

01:09:06.010 --> 01:09:10.120
over the Appointment, light rain,
no track maintenance, and the train

01:09:10.120 --> 01:09:11.660
is delayed, for example.

01:09:11.660 --> 01:09:15.597
So I take both of the possible values
for train, go ahead and add them up.

01:09:15.597 --> 01:09:16.180
These are just

01:09:16.180 --> 01:09:18.722
Joint probabilities that we saw
earlier how to calculate just

01:09:18.722 --> 01:09:22.120
by going parent, parent, parent, parent
and calculating those probabilities

01:09:22.120 --> 01:09:23.615
and multiplying them together.

01:09:23.615 --> 01:09:26.740
And then you'll need to normalize them
at the end, speaking at a high level

01:09:26.740 --> 01:09:29.920
to make sure that everything
adds up to the number one.

01:09:29.920 --> 01:09:32.229
So the formula for how you
do this and a process known

01:09:32.229 --> 01:09:35.223
as inference by enumeration
looks a little bit complicated,

01:09:35.223 --> 01:09:36.640
but ultimately it looks like this.

01:09:36.640 --> 01:09:39.550
And let's now try to distill what
it is that all of these symbols

01:09:39.550 --> 01:09:40.420
actually mean.

01:09:40.420 --> 01:09:41.410
Let's start here.

01:09:41.410 --> 01:09:46.029
What I care about knowing is the
probability of x, my query variable,

01:09:46.029 --> 01:09:48.370
given some sort of evidence.

01:09:48.370 --> 01:09:50.410
What do I know about
conditional probabilities?

01:09:50.410 --> 01:09:55.030
Well, a conditional probability is
proportional to the joint probability.

01:09:55.030 --> 01:09:57.850
So we had some alpha,
some normalizing constant,

01:09:57.850 --> 01:10:01.840
multiplied by this joint
probability of x and evidence.

01:10:01.840 --> 01:10:03.410
And how do I calculate that?

01:10:03.410 --> 01:10:05.980
Well, to do that, I'm
going to marginalize over

01:10:05.980 --> 01:10:07.420
all of the hidden variables.

01:10:07.420 --> 01:10:10.450
All the variables that I don't
directly observe the values for,

01:10:10.450 --> 01:10:13.390
I'm basically going to iterate
over all of the possibilities

01:10:13.390 --> 01:10:16.040
that it could happen and
just sum them all up.

01:10:16.040 --> 01:10:19.270
And so I can translate this
into a sum over all y, which

01:10:19.270 --> 01:10:22.450
ranges over all the possible
hidden variables and the values

01:10:22.450 --> 01:10:27.250
that they could take on, and adds
up all of those possible individual

01:10:27.250 --> 01:10:28.300
probabilities.

01:10:28.300 --> 01:10:32.195
And that is going to allow me to do this
process of inference by enumeration.

01:10:32.195 --> 01:10:34.570
And ultimately, it's pretty
annoying if we as humans have

01:10:34.570 --> 01:10:36.713
to do all of this math for ourselves.

01:10:36.713 --> 01:10:39.880
But it turns out this is where computers
and AI can be particularly helpful,

01:10:39.880 --> 01:10:43.360
that we can program a computer to
understand a Bayesian network to be

01:10:43.360 --> 01:10:45.610
able to understand these
inference procedures

01:10:45.610 --> 01:10:47.560
and to be able to do these calculations.

01:10:47.560 --> 01:10:49.390
And using the information
you've seen here,

01:10:49.390 --> 01:10:52.150
you could implement a Bayesian
network from scratch yourself.

01:10:52.150 --> 01:10:54.733
But turns out there are a lot
of libraries, especially written

01:10:54.733 --> 01:10:56.650
in Python, that allow
us to make it easier

01:10:56.650 --> 01:10:58.780
to do this sort of
probabilistic inference

01:10:58.780 --> 01:11:01.788
to be able to take a Bayesian network
and do these sorts of calculations

01:11:01.788 --> 01:11:04.830
so that you don't need to know and
understand all of the underlying math,

01:11:04.830 --> 01:11:07.372
though it's helpful to have a
general sense for how it works.

01:11:07.372 --> 01:11:10.330
But you just need to be able to
describe the structure of the network

01:11:10.330 --> 01:11:14.350
and make queries in order to
be able to produce the result.

01:11:14.350 --> 01:11:17.050
And so let's take a look at
an example of that right now.

01:11:17.050 --> 01:11:19.420
It turns out that there are
a lot of possible libraries

01:11:19.420 --> 01:11:21.803
that exist in Python for
doing this sort of inference.

01:11:21.803 --> 01:11:24.220
It doesn't matter too much
which specific library you use.

01:11:24.220 --> 01:11:26.330
They all behave in fairly similar ways.

01:11:26.330 --> 01:11:29.170
But the library I'm going to use
here is one known as pomegranate.

01:11:29.170 --> 01:11:33.820
And here inside of model.py, I
have defined a Bayesian network

01:11:33.820 --> 01:11:38.070
just using the structure and the syntax
that the pomegranate library expects.

01:11:38.070 --> 01:11:40.930
And what I'm effectively
doing is just, in Python,

01:11:40.930 --> 01:11:44.740
creating nodes to represent each
the nodes of the Bayesian network

01:11:44.740 --> 01:11:47.060
that you saw me describe a moment ago.

01:11:47.060 --> 01:11:49.750
So here on line four, after
I've imported pomegranate,

01:11:49.750 --> 01:11:52.540
I'm defining a variable called
rain that is going to represent

01:11:52.540 --> 01:11:55.990
a node inside of my Bayesian network.

01:11:55.990 --> 01:11:59.530
It's going to be a node that
follows this distribution where

01:11:59.530 --> 01:12:01.030
there are three possible values--

01:12:01.030 --> 01:12:03.970
none for no rain, light for
light rain, heavy for heavy rain.

01:12:03.970 --> 01:12:07.180
And these are the probabilities
of each of those taking place.

01:12:07.180 --> 01:12:13.630
0.7 is the likelihood of no rain, 0.2
for light rain, 0.1 for heavy rain.

01:12:13.630 --> 01:12:15.760
Then, after that, we go
to the next variable,

01:12:15.760 --> 01:12:18.400
the variable for track
maintenance, for example, which

01:12:18.400 --> 01:12:20.990
is dependent upon that rain variable.

01:12:20.990 --> 01:12:23.890
And this, instead of being an
unconditional distribution,

01:12:23.890 --> 01:12:27.370
is a conditional distribution, as
indicated by a conditional probability

01:12:27.370 --> 01:12:28.430
table here.

01:12:28.430 --> 01:12:33.790
And the idea is that this is
conditional on the distribution of rain.

01:12:33.790 --> 01:12:36.700
So if there is no rain, then
the chance that there is yes

01:12:36.700 --> 01:12:38.370
track maintenance is 0.4.

01:12:38.370 --> 01:12:41.720
If there's no rain, the chance that
there is no track maintenance is 0.6.

01:12:41.720 --> 01:12:43.720
Likewise, for light rain,
I have a distribution.

01:12:43.720 --> 01:12:45.760
For heavy rain, I have
a distribution, as well.

01:12:45.760 --> 01:12:48.130
But I'm effectively encoding
the same information

01:12:48.130 --> 01:12:50.110
you saw represented
graphically a moment ago,

01:12:50.110 --> 01:12:53.110
but I'm telling this Python
program that the maintenance

01:12:53.110 --> 01:12:57.640
node obeys this particular
conditional probability distribution.

01:12:57.640 --> 01:13:01.090
And we do the same thing for the
other random variables, as well.

01:13:01.090 --> 01:13:06.310
Train was a node inside my distribution
that was a conditional probability

01:13:06.310 --> 01:13:08.050
table with two parents.

01:13:08.050 --> 01:13:11.380
It was dependent not only on rain,
but also on track maintenance.

01:13:11.380 --> 01:13:15.310
And so here I'm saying something like,
given that there is no rain and yes

01:13:15.310 --> 01:13:19.630
track maintenance, the probability
that my train is on time is 0.8,

01:13:19.630 --> 01:13:22.240
and the probability that
it's delayed is 0.2.

01:13:22.240 --> 01:13:24.220
And likewise, I can do
the same thing for all

01:13:24.220 --> 01:13:28.330
of the other possible values of
the parents of the train node

01:13:28.330 --> 01:13:32.800
inside of my Bayesian network by saying,
for all of those possible values,

01:13:32.800 --> 01:13:36.350
here is the distribution that
the train node should follow.

01:13:36.350 --> 01:13:38.710
And I do the same thing
for an appointment

01:13:38.710 --> 01:13:41.830
based on the distribution
of the variable Train.

01:13:41.830 --> 01:13:45.340
Then, at the end, what I do is
actually construct this network

01:13:45.340 --> 01:13:47.860
by describing what the
states of the network are

01:13:47.860 --> 01:13:50.660
and by adding edges between
the dependent nodes.

01:13:50.660 --> 01:13:53.110
So I create a new Bayesian
network, add states to it--

01:13:53.110 --> 01:13:56.650
one for rain, one for maintenance, one
for train, one for the appointment--

01:13:56.650 --> 01:14:00.460
and then I add edges
connecting the related pieces.

01:14:00.460 --> 01:14:04.570
Rain has an arrow to maintenance because
rain influences track maintenance,

01:14:04.570 --> 01:14:08.530
rain also influences the train,
maintenance also influences the train,

01:14:08.530 --> 01:14:11.140
and train influences whether
I make it to my appointment,

01:14:11.140 --> 01:14:14.800
and bake just finalizes the model
and does some additional computation.

01:14:14.800 --> 01:14:18.250
So the specific syntax of this
is not really the important part.

01:14:18.250 --> 01:14:20.980
Pomegranate just happens to be
one of several different libraries

01:14:20.980 --> 01:14:22.990
that can all be used
for similar purposes,

01:14:22.990 --> 01:14:26.170
and you could describe and
define a library for yourself

01:14:26.170 --> 01:14:28.010
that implemented similar things.

01:14:28.010 --> 01:14:30.430
But the key idea here
is that someone can

01:14:30.430 --> 01:14:33.220
design a library for a
general Bayesian network that

01:14:33.220 --> 01:14:35.680
has nodes that are
based upon its parents,

01:14:35.680 --> 01:14:39.190
and then all a programmer needs to
do, using one of those libraries,

01:14:39.190 --> 01:14:43.420
is to define what those nodes and what
those probability distributions are,

01:14:43.420 --> 01:14:47.000
and we can begin to do some
interesting logic based on it.

01:14:47.000 --> 01:14:50.200
So let's try doing that
conditional or joint probability

01:14:50.200 --> 01:14:56.800
calculation that we saw us do by hand
before by going into likelihood.py

01:14:56.800 --> 01:15:00.340
where here I'm importing the model
that I justified a moment ago.

01:15:00.340 --> 01:15:03.100
And here I'd just like to
calculate model.probability,

01:15:03.100 --> 01:15:06.320
which calculates the probability
for a given observation,

01:15:06.320 --> 01:15:10.270
and I'd like to calculate
the probability of no rain,

01:15:10.270 --> 01:15:13.330
no track maintenance,
my train is on time,

01:15:13.330 --> 01:15:14.950
and I'm able to attend the meeting--

01:15:14.950 --> 01:15:16.870
so sort of the optimal
scenario, that there's

01:15:16.870 --> 01:15:20.162
no rain and no maintenance on
the track, my train is on time,

01:15:20.162 --> 01:15:21.620
and I'm able to attend the meeting.

01:15:21.620 --> 01:15:25.020
What is the probability that
all of that actually happens?

01:15:25.020 --> 01:15:26.900
And I can calculate
that using the library

01:15:26.900 --> 01:15:28.700
and just print out its probability.

01:15:28.700 --> 01:15:32.780
And so I'll go ahead and
run Python of likelihood.py,

01:15:32.780 --> 01:15:37.190
and I see that, OK, the
probability is about 0.34.

01:15:37.190 --> 01:15:40.850
So about a third of the time, everything
goes right for me, in this case--

01:15:40.850 --> 01:15:43.190
no rain, no track
maintenance, train is on time,

01:15:43.190 --> 01:15:45.032
and I'm able to attend the meeting.

01:15:45.032 --> 01:15:47.990
But I could experiment with this,
try and calculate other probabilities

01:15:47.990 --> 01:15:48.650
as well.

01:15:48.650 --> 01:15:51.860
What's the probability that everything
goes right up until the train

01:15:51.860 --> 01:15:57.020
but I still miss my meeting-- so no
rain, no track maintenance, train

01:15:57.020 --> 01:15:59.690
is on time, but I miss the appointment.

01:15:59.690 --> 01:16:04.680
Let's calculate that probability, and
that has a probability of about 0.04.

01:16:04.680 --> 01:16:07.643
So about 4% of the time
the train will be on time,

01:16:07.643 --> 01:16:09.560
there won't be any rain,
no track maintenance,

01:16:09.560 --> 01:16:12.420
and yet I'll still miss the meeting.

01:16:12.420 --> 01:16:14.780
And so this is really
just an implementation

01:16:14.780 --> 01:16:17.900
of the calculation of the joint
probabilities that we did before.

01:16:17.900 --> 01:16:20.150
What this library is
likely doing is first

01:16:20.150 --> 01:16:23.600
figuring out the probability
of no rain, then figuring

01:16:23.600 --> 01:16:26.030
that the probability
of no track maintenance

01:16:26.030 --> 01:16:28.580
given no rain, then the
probability that my train is

01:16:28.580 --> 01:16:31.760
on time given both of these
values, and then the probability

01:16:31.760 --> 01:16:35.930
that I miss my appointment given that
I know that the train was on time.

01:16:35.930 --> 01:16:39.070
So this, again, is the calculation
of that joint probability.

01:16:39.070 --> 01:16:42.320
And turns out we can also begin to have
our computer solve inference problems,

01:16:42.320 --> 01:16:45.980
as well, to begin to infer,
based on information, evidence

01:16:45.980 --> 01:16:51.000
that we see, what is the likelihood
of other variables also being true?

01:16:51.000 --> 01:16:54.740
So let's go into inference.py,
for example, where here I'm,

01:16:54.740 --> 01:16:57.110
again, importing that exact
same model from before,

01:16:57.110 --> 01:16:59.300
importing all the
nodes and all the edges

01:16:59.300 --> 01:17:03.300
and the probability distribution
that is encoded there, as well.

01:17:03.300 --> 01:17:06.320
And now there's a function for
doing some sort of prediction.

01:17:06.320 --> 01:17:10.760
And here, into this model, I pass
in the evidence that I observe.

01:17:10.760 --> 01:17:14.750
So here I've encoded into this
Python program the evidence

01:17:14.750 --> 01:17:15.770
that I have observed.

01:17:15.770 --> 01:17:18.950
I have observed the fact
that the train is delayed,

01:17:18.950 --> 01:17:22.190
and that is the value for one
of the four random variables

01:17:22.190 --> 01:17:24.140
inside of this Bayesian network.

01:17:24.140 --> 01:17:26.210
And using that information,
I would like to be

01:17:26.210 --> 01:17:29.270
able to draw inspiration
and figure out inferences

01:17:29.270 --> 01:17:31.875
about the values of the
other random variables

01:17:31.875 --> 01:17:33.500
that are inside of my Bayesian network.

01:17:33.500 --> 01:17:36.240
I would like to make predictions
about everything else.

01:17:36.240 --> 01:17:40.340
So all of the actual computational logic
is happening in just these three lines

01:17:40.340 --> 01:17:42.260
where I'm making this
call to this prediction.

01:17:42.260 --> 01:17:45.830
Down below, I'm just iterating over all
of the states and all the predictions

01:17:45.830 --> 01:17:49.860
and just printing them out so that we
can visually see what the results are.

01:17:49.860 --> 01:17:51.980
But let's find out, given
the train is delayed,

01:17:51.980 --> 01:17:56.210
what can I predict about the values
of the other random variables?

01:17:56.210 --> 01:17:59.021
Let's go ahead and run
Python inference.py.

01:17:59.021 --> 01:18:00.005
I run that.

01:18:00.005 --> 01:18:01.880
And all right, here is
the result that I get.

01:18:01.880 --> 01:18:04.640
Given the fact that I know
that the train is delayed--

01:18:04.640 --> 01:18:06.770
this is evidence that I have observed--

01:18:06.770 --> 01:18:10.490
well, given that there is a
45% chance or a 46% chance

01:18:10.490 --> 01:18:12.520
that there was no rain,
a 31% chance there

01:18:12.520 --> 01:18:15.230
was light rain, a 23%
chance there was heavy rain,

01:18:15.230 --> 01:18:17.712
I can see a probability
distribution over track maintenance

01:18:17.712 --> 01:18:19.670
and a probability
distribution over whether I'm

01:18:19.670 --> 01:18:22.130
able to attend or miss my appointment.

01:18:22.130 --> 01:18:23.990
Now, we know that
whether I attend or miss

01:18:23.990 --> 01:18:27.715
the appointment, that is only
dependent upon the train being delayed

01:18:27.715 --> 01:18:28.340
or not delayed.

01:18:28.340 --> 01:18:30.540
It shouldn't depend on anything else.

01:18:30.540 --> 01:18:34.610
So let's imagine, for example, that
I knew that there was heavy rain.

01:18:34.610 --> 01:18:38.620
That shouldn't affect the distribution
for making the appointment.

01:18:38.620 --> 01:18:41.360
And indeed, if I go up
here and add some evidence,

01:18:41.360 --> 01:18:44.128
say that I know that the
value of rain is heavy--

01:18:44.128 --> 01:18:45.920
that is evidence that
I now have access to.

01:18:45.920 --> 01:18:47.420
I now have two pieces of evidence.

01:18:47.420 --> 01:18:51.950
I know that the rain is heavy, and
I know that my train is delayed.

01:18:51.950 --> 01:18:55.550
I can calculate the probability by
running this inference procedure again

01:18:55.550 --> 01:18:57.090
and seeing the result.

01:18:57.090 --> 01:18:58.340
I know that the rain is heavy.

01:18:58.340 --> 01:18:59.840
I know my train is delayed.

01:18:59.840 --> 01:19:02.990
The probability distribution
for track maintenance changed.

01:19:02.990 --> 01:19:05.130
Given that I know that
there is heavy rain,

01:19:05.130 --> 01:19:08.750
now it's more likely that there
is no track maintenance, 88% as

01:19:08.750 --> 01:19:12.250
opposed to 64% from here before.

01:19:12.250 --> 01:19:16.040
And now what is the probability
that I make the appointment?

01:19:16.040 --> 01:19:17.480
Well, that's the same as before.

01:19:17.480 --> 01:19:21.100
It's still going to be attend the
appointment with probability 0.6,

01:19:21.100 --> 01:19:23.450
miss the appointment
with probability 0.4,

01:19:23.450 --> 01:19:27.290
because it was only dependent upon
whether or not my train was on time

01:19:27.290 --> 01:19:28.260
or delayed.

01:19:28.260 --> 01:19:31.610
And so this here is implementing
that idea of that inference algorithm

01:19:31.610 --> 01:19:34.130
to be able to figure out,
based on the evidence

01:19:34.130 --> 01:19:37.970
that I have, what can we infer about
the values of the other variables that

01:19:37.970 --> 01:19:39.050
exist as well?

01:19:39.050 --> 01:19:42.890
So inference by enumeration is one
way of doing this inference procedure,

01:19:42.890 --> 01:19:46.730
just looping over all of the values
the hidden variables could take on

01:19:46.730 --> 01:19:49.460
and figuring out what
the probability is.

01:19:49.460 --> 01:19:52.010
Now, it turns out this is
not particularly efficient,

01:19:52.010 --> 01:19:56.180
and there are definitely optimizations
you can make by avoiding repeated work

01:19:56.180 --> 01:19:59.030
if you're calculating the same
sort of probability multiple times.

01:19:59.030 --> 01:20:02.570
There are ways of optimizing the
program to avoid having to recalculate

01:20:02.570 --> 01:20:04.640
the same probabilities again and again.

01:20:04.640 --> 01:20:06.980
But even then, as the
number of variables

01:20:06.980 --> 01:20:10.220
get large, as the number of possible
values those variables could take on

01:20:10.220 --> 01:20:12.110
get large, we're going
to start to have to do

01:20:12.110 --> 01:20:14.600
a lot of computation,
a lot of calculation,

01:20:14.600 --> 01:20:16.190
to be able to do this inference.

01:20:16.190 --> 01:20:18.150
And at that point,
you might start to get

01:20:18.150 --> 01:20:20.250
unreasonable in terms
of the amount of time

01:20:20.250 --> 01:20:24.615
that it would take to be able
to do this sort exact inference.

01:20:24.615 --> 01:20:26.490
And it's for that reason
that oftentimes when

01:20:26.490 --> 01:20:29.970
it comes towards probability and
things we're not entirely sure about,

01:20:29.970 --> 01:20:32.280
we don't always care about
doing exact inference

01:20:32.280 --> 01:20:35.040
and knowing exactly
what the probability is.

01:20:35.040 --> 01:20:37.560
But if we can approximate
the inference procedure,

01:20:37.560 --> 01:20:41.570
do some sort of approximate inference,
that that can be pretty good as well,

01:20:41.570 --> 01:20:43.550
that if I don't know
the exact probability

01:20:43.550 --> 01:20:45.510
but I have a general
sense for the probability,

01:20:45.510 --> 01:20:49.200
that I can get increasingly accurate
with more time, that that's probably

01:20:49.200 --> 01:20:53.620
pretty good, especially if I can
get that to happen even faster.

01:20:53.620 --> 01:20:57.930
So how could I do approximate
inference inside of a Bayesian network?

01:20:57.930 --> 01:21:00.480
Well, one method is through a
procedure known as sampling.

01:21:00.480 --> 01:21:04.980
In the process of sampling, I'm going
to take a sample of all of the variables

01:21:04.980 --> 01:21:06.840
inside of this Bayesian network here.

01:21:06.840 --> 01:21:08.280
And how am I going to sample?

01:21:08.280 --> 01:21:12.240
Well, I'm going to sample one of
the values from each of these nodes

01:21:12.240 --> 01:21:14.560
according to their
probability distribution.

01:21:14.560 --> 01:21:16.560
So how might I take a
sample of all these nodes?

01:21:16.560 --> 01:21:17.430
Well, I'll start at the root.

01:21:17.430 --> 01:21:18.450
I'll start with rain.

01:21:18.450 --> 01:21:21.060
Here's the distribution
for rain, and I'll go ahead

01:21:21.060 --> 01:21:23.880
and, using a random number
generator or something like it,

01:21:23.880 --> 01:21:25.770
randomly pick one of these three values.

01:21:25.770 --> 01:21:29.730
I'll pick none with probability
0.7, light with probability 0.2,

01:21:29.730 --> 01:21:31.440
and heavy with probability 0.1.

01:21:31.440 --> 01:21:34.770
So I'll randomly just pick one of
them according to that distribution,

01:21:34.770 --> 01:21:37.780
and maybe, in this case,
I pick none, for example.

01:21:37.780 --> 01:21:39.780
Then I do the same thing
for the other variable.

01:21:39.780 --> 01:21:42.410
Maintenance also as a
probability distribution.

01:21:42.410 --> 01:21:44.070
And I am going to sample--

01:21:44.070 --> 01:21:46.470
now, there are three
probability distributions here,

01:21:46.470 --> 01:21:49.050
but I'm only going to
sample from this first row

01:21:49.050 --> 01:21:53.950
here because I've observed already in my
sample that the value of rain is none.

01:21:53.950 --> 01:21:54.450
So

01:21:54.450 --> 01:21:58.295
Given that rain is none, I'm going to
sample from this distribution to say,

01:21:58.295 --> 01:22:00.420
all right, what should the
value of maintenance be?

01:22:00.420 --> 01:22:02.753
And in this case, maintenance
is going to be, let's just

01:22:02.753 --> 01:22:06.570
say, yes, which happens 40% of the time
in the event that there is no rain,

01:22:06.570 --> 01:22:07.603
for example.

01:22:07.603 --> 01:22:10.020
And we'll sample all of the
rest of the nodes in this way,

01:22:10.020 --> 01:22:12.840
as well, that I want to sample
from the train distribution,

01:22:12.840 --> 01:22:17.040
and I'll sample from this first
row here where there is no rain,

01:22:17.040 --> 01:22:18.570
but there is track maintenance.

01:22:18.570 --> 01:22:21.980
And I'll sample 80% of the time,
I'll say the train is on time.

01:22:21.980 --> 01:22:24.463
20% of the time, I'll
say the train is delayed.

01:22:24.463 --> 01:22:27.630
And finally, we'll do the same thing
for whether I make it to my appointment

01:22:27.630 --> 01:22:27.890
or not.

01:22:27.890 --> 01:22:29.490
Did I attend or miss the appointment?

01:22:29.490 --> 01:22:32.700
We'll sample based on this distribution
and maybe say that in this case

01:22:32.700 --> 01:22:36.150
I attend the appointment, which
happens 90% of the time when

01:22:36.150 --> 01:22:38.730
the train is actually on time.

01:22:38.730 --> 01:22:42.900
So by going through these nodes, I
can very quickly just do some sampling

01:22:42.900 --> 01:22:45.720
and get a sample of the
possible values that

01:22:45.720 --> 01:22:48.990
could come up from going through
this entire Bayesian network

01:22:48.990 --> 01:22:51.540
according to those
probability distributions.

01:22:51.540 --> 01:22:54.360
And where this becomes powerful
is if I do this not once,

01:22:54.360 --> 01:22:57.100
but I do this thousands or
tens of thousands of times

01:22:57.100 --> 01:23:00.400
and generate a whole bunch of
samples, all using this distribution.

01:23:00.400 --> 01:23:01.410
I get different samples.

01:23:01.410 --> 01:23:02.820
Maybe some of them are the same.

01:23:02.820 --> 01:23:07.800
But I get a value for each of the
possible variables that could come up.

01:23:07.800 --> 01:23:10.620
And so then, if I'm ever faced
with a question, a question like,

01:23:10.620 --> 01:23:13.860
what is the probability
that the train is on time,

01:23:13.860 --> 01:23:15.900
you could do an exact
inference procedure.

01:23:15.900 --> 01:23:18.630
This is no different than the
inference problem we had before

01:23:18.630 --> 01:23:21.780
where I could just marginalize, look
at all the possible other values

01:23:21.780 --> 01:23:24.390
of the variables and do the
computation of inference

01:23:24.390 --> 01:23:28.200
by enumeration to find out
this probability exactly.

01:23:28.200 --> 01:23:31.710
But I could also, if I don't care about
the exact probability, just sample it.

01:23:31.710 --> 01:23:33.150
Approximate it to get close.

01:23:33.150 --> 01:23:35.040
And this is a powerful
tool in AI where we

01:23:35.040 --> 01:23:38.790
don't need to be right 100% of the time
or we don't need to be exactly right.

01:23:38.790 --> 01:23:41.130
If we just need to be right
with some probability,

01:23:41.130 --> 01:23:44.290
we can often do some more
effectively, more efficiently.

01:23:44.290 --> 01:23:46.920
And so here, now, are all
of those possible samples.

01:23:46.920 --> 01:23:50.390
I'll sort of highlight the ones
where the train is on time.

01:23:50.390 --> 01:23:52.620
I'm ignoring the ones
where the train is delayed.

01:23:52.620 --> 01:23:55.350
And in this case,
there's six out of eight

01:23:55.350 --> 01:23:57.690
of the samples have the
train is arriving on time.

01:23:57.690 --> 01:24:01.320
And so maybe, in this case, I can
say that, in six out of eight cases,

01:24:01.320 --> 01:24:03.458
that's the likelihood
that the train is on time.

01:24:03.458 --> 01:24:06.000
And with eight samples, that
might not be a great prediction.

01:24:06.000 --> 01:24:08.520
But if I had thousands
upon thousands of samples,

01:24:08.520 --> 01:24:11.580
then this could be a much
better inference procedure

01:24:11.580 --> 01:24:13.680
to be able to do these
sorts of calculations.

01:24:13.680 --> 01:24:17.310
So this is a direct sampling method
to just do a bunch of samples

01:24:17.310 --> 01:24:21.210
and then figure out what the
probability of some event is.

01:24:21.210 --> 01:24:24.400
Now, this from before was an
unconditional probability.

01:24:24.400 --> 01:24:27.447
What is the probability
that the train is on time?

01:24:27.447 --> 01:24:30.030
And I did that by looking at all
the samples and figuring out,

01:24:30.030 --> 01:24:32.372
right here, the ones where
the train is on time.

01:24:32.372 --> 01:24:34.080
But sometimes what
I'll want to calculate

01:24:34.080 --> 01:24:38.387
is not an unconditional probability,
but rather a conditional probability,

01:24:38.387 --> 01:24:40.470
something like, what is
the probability that there

01:24:40.470 --> 01:24:45.010
is light rain given that the train
is on time, something to that effect.

01:24:45.010 --> 01:24:50.060
And to do that kind of calculation,
well, what I might do is here

01:24:50.060 --> 01:24:52.140
are all the samples
that I have, and I want

01:24:52.140 --> 01:24:54.720
to calculate a probability
distribution given

01:24:54.720 --> 01:24:57.368
that I know that the train is on time.

01:24:57.368 --> 01:24:59.910
So to be able to do that, I can
kind of look at the two cases

01:24:59.910 --> 01:25:03.630
where the train was delayed
and ignore or reject them,

01:25:03.630 --> 01:25:07.762
sort of exclude them from the
possible samples that I'm considering.

01:25:07.762 --> 01:25:09.720
And now I want to look
at these remaining cases

01:25:09.720 --> 01:25:11.130
where the train is on time.

01:25:11.130 --> 01:25:13.860
Here are the cases where
there is light rain.

01:25:13.860 --> 01:25:16.850
And now I say, OK, these are two
out of the six possible cases.

01:25:16.850 --> 01:25:19.580
That can give me an
approximation for the probability

01:25:19.580 --> 01:25:23.440
of light rain given the fact that
I know the train was on time.

01:25:23.440 --> 01:25:25.700
And I did that in almost
exactly the same way

01:25:25.700 --> 01:25:28.660
just by adding an additional
step, by saying that,

01:25:28.660 --> 01:25:30.470
all right, when I take
each sample, let me

01:25:30.470 --> 01:25:34.460
reject all of the samples
that don't match my evidence

01:25:34.460 --> 01:25:37.250
and only consider the
samples that do match

01:25:37.250 --> 01:25:39.920
what it is that I have in
my evidence that I want

01:25:39.920 --> 01:25:42.020
to make some sort of calculation about.

01:25:42.020 --> 01:25:45.920
And it turns out, using the libraries
that we've had for Bayesian networks,

01:25:45.920 --> 01:25:48.740
we can begin to implement
this same sort of idea,

01:25:48.740 --> 01:25:51.890
implement rejection sampling, which
is what this method is called,

01:25:51.890 --> 01:25:55.850
to be able to figure out some
probability, not via direct inference,

01:25:55.850 --> 01:25:57.980
but instead by sampling.

01:25:57.980 --> 01:26:00.290
So what I have here is a
program called sample.py--

01:26:00.290 --> 01:26:02.180
imports the exact same model.

01:26:02.180 --> 01:26:05.490
And what I define first is a
program to generate a sample.

01:26:05.490 --> 01:26:09.088
And the way I generate a sample is
just by looping over all of the states.

01:26:09.088 --> 01:26:10.880
The states need to be
in some sort of order

01:26:10.880 --> 01:26:12.797
to make sure I'm looping
in the correct order.

01:26:12.797 --> 01:26:16.010
But effectively, if it is
a conditional distribution,

01:26:16.010 --> 01:26:18.410
I'm going to sample
based on the parents.

01:26:18.410 --> 01:26:21.240
And otherwise, I'm just going
to directly sample the variable,

01:26:21.240 --> 01:26:25.040
like rain, which has no parents-- it's
just an unconditional distribution--

01:26:25.040 --> 01:26:28.640
and keep track of all those parent
samples and return the final sample.

01:26:28.640 --> 01:26:31.290
The exact syntax of this, again,
not particularly important.

01:26:31.290 --> 01:26:33.290
It just happens to be
part of the implementation

01:26:33.290 --> 01:26:35.820
details of this particular library.

01:26:35.820 --> 01:26:38.270
The interesting logic is done below.

01:26:38.270 --> 01:26:40.820
Now that I have the ability
to generate a sample,

01:26:40.820 --> 01:26:45.020
if I want to know the distribution of
the appointment random variable given

01:26:45.020 --> 01:26:48.680
that the train is delayed, well, then I
can begin to do calculations like this.

01:26:48.680 --> 01:26:52.430
Let me take 10,000 samples
and assemble all my results

01:26:52.430 --> 01:26:53.810
in this list called data.

01:26:53.810 --> 01:26:57.140
I'll go ahead and loop n times--
in this case, 10,000 times.

01:26:57.140 --> 01:27:01.670
I'll generate a sample, and I want to
know the distribution of appointment

01:27:01.670 --> 01:27:03.410
given that the train is delayed.

01:27:03.410 --> 01:27:05.900
So according to rejection
sampling, I'm only

01:27:05.900 --> 01:27:08.210
going to consider samples
where the train is delayed.

01:27:08.210 --> 01:27:11.552
If the train's not delayed, I'm not
going to consider those values at all.

01:27:11.552 --> 01:27:13.760
So I'm going to say, all
right, if I take the sample,

01:27:13.760 --> 01:27:16.290
look at the value of the
train random variable,

01:27:16.290 --> 01:27:19.670
if the train is delayed, well,
let me go ahead and add to my data

01:27:19.670 --> 01:27:23.000
that I'm collecting the value of
the appointment random variable

01:27:23.000 --> 01:27:25.520
that it took on in
this particular sample.

01:27:25.520 --> 01:27:28.610
So I'm only considering the
samples where the train is delayed

01:27:28.610 --> 01:27:31.010
and, for each of those
samples, considering

01:27:31.010 --> 01:27:32.870
what the value of appointment is.

01:27:32.870 --> 01:27:35.570
And then at the end, I'm using
a Python class called counter,

01:27:35.570 --> 01:27:37.580
which quickly counts up
all the values inside

01:27:37.580 --> 01:27:40.100
of a data set so I can
take this list of data

01:27:40.100 --> 01:27:44.000
and figure out how many times
was my appointment made,

01:27:44.000 --> 01:27:47.360
and how many times was
my appointment missed?

01:27:47.360 --> 01:27:49.610
And so this here, with just
a couple of lines of code,

01:27:49.610 --> 01:27:53.080
is an implementation
of rejection sampling.

01:27:53.080 --> 01:27:58.170
And I can run it by going ahead
and running Python sample.py.

01:27:58.170 --> 01:28:00.230
And when I do that, here
is the result I get.

01:28:00.230 --> 01:28:02.150
This is the result of the counter.

01:28:02.150 --> 01:28:05.750
1,251 times I was able
to attend the meeting,

01:28:05.750 --> 01:28:08.900
and 856 times I was able
to miss the meeting.

01:28:08.900 --> 01:28:11.550
And you can imagine, by
doing more and more samples,

01:28:11.550 --> 01:28:14.480
I'll be able to get a better and
better, more accurate result.

01:28:14.480 --> 01:28:16.070
And this is a randomized process.

01:28:16.070 --> 01:28:18.895
It's going to be an
approximation of the probability.

01:28:18.895 --> 01:28:21.770
If I run it a different time, you'll
notice the numbers are similar--

01:28:21.770 --> 01:28:25.460
1,272 and 905-- but
they're not identical

01:28:25.460 --> 01:28:28.250
because there's some randomization,
some likelihood that things

01:28:28.250 --> 01:28:31.730
might be higher or lower, and so this
is why we generally want to try and use

01:28:31.730 --> 01:28:35.360
more samples so that we can have
a greater amount of confidence

01:28:35.360 --> 01:28:37.760
in our result, be more
sure about the result

01:28:37.760 --> 01:28:41.240
that we're getting of whether or not
it accurately reflects or represents

01:28:41.240 --> 01:28:43.940
the actual underlying
probabilities that are

01:28:43.940 --> 01:28:47.130
inherent inside of this distribution.

01:28:47.130 --> 01:28:50.057
And so this, then, was an
instance of rejection sampling.

01:28:50.057 --> 01:28:52.640
And it turns out, there are a
number of other sampling methods

01:28:52.640 --> 01:28:55.070
that you could use to
begin to try to sample.

01:28:55.070 --> 01:28:57.530
One problem that
rejection sampling has is

01:28:57.530 --> 01:29:02.480
that if the evidence you're looking
for is a fairly unlikely event, well,

01:29:02.480 --> 01:29:04.610
you're going to be
rejecting a lot of samples.

01:29:04.610 --> 01:29:08.490
Like, if I'm looking for the
probability of x given some evidence e,

01:29:08.490 --> 01:29:12.680
if e is very unlikely to occur-- like,
occurs maybe one every 1,000 times--

01:29:12.680 --> 01:29:16.040
then I'm only going to be considering
one out of every 1,000 samples

01:29:16.040 --> 01:29:18.798
that I do, which is a pretty
inefficient method for trying

01:29:18.798 --> 01:29:20.090
to do this sort of calculation.

01:29:20.090 --> 01:29:23.600
I'm throwing away a lot of samples,
and it takes computational effort

01:29:23.600 --> 01:29:25.640
to be able to generate
those samples, so I'd

01:29:25.640 --> 01:29:27.480
like to not have to do
something like that.

01:29:27.480 --> 01:29:30.230
So there are other sampling methods
that can try and address this.

01:29:30.230 --> 01:29:33.680
One such sampling method is
called likelihood weighting.

01:29:33.680 --> 01:29:36.920
In likelihood weighting, we follow
a slightly different procedure,

01:29:36.920 --> 01:29:39.740
and the goal is to avoid
needing to throw out

01:29:39.740 --> 01:29:42.590
samples that didn't match the evidence.

01:29:42.590 --> 01:29:46.760
And so what we'll do is we'll start
by fixing the values for the evidence

01:29:46.760 --> 01:29:47.300
variables.

01:29:47.300 --> 01:29:49.430
Rather than sample
everything, we're going

01:29:49.430 --> 01:29:53.970
to fix the values of the evidence
variables and not sample those.

01:29:53.970 --> 01:29:57.650
Then we're going to sample all the other
non-evidence variables in the same way,

01:29:57.650 --> 01:30:01.010
just using the Bayesian network, looking
at the probability distributions,

01:30:01.010 --> 01:30:04.040
sampling all the non-evidence variables.

01:30:04.040 --> 01:30:08.450
But then what we need to do is
weight each sample by its likelihood.

01:30:08.450 --> 01:30:10.520
If our evidence is
really unlikely, we want

01:30:10.520 --> 01:30:14.210
to make sure that we've taken into
account, how likely was the evidence

01:30:14.210 --> 01:30:16.410
to actually show up in the sample?

01:30:16.410 --> 01:30:18.590
If I have a sample where
the evidence was much more

01:30:18.590 --> 01:30:20.720
likely to show up than
another sample, then I

01:30:20.720 --> 01:30:23.060
want to weight the
more likely one higher.

01:30:23.060 --> 01:30:25.490
So we're going to weight
each sample by its likelihood

01:30:25.490 --> 01:30:29.480
where likelihood is just defined as
the probability of all of the evidence.

01:30:29.480 --> 01:30:32.090
Given all the evidence we
have, what is the probability

01:30:32.090 --> 01:30:34.640
that it would happen in
that particular sample?

01:30:34.640 --> 01:30:37.250
So before, all of our samples
were weighted equally.

01:30:37.250 --> 01:30:40.970
They all had a weight of one when we
were calculating the overall average.

01:30:40.970 --> 01:30:42.980
In this case, we're going
to weight each sample,

01:30:42.980 --> 01:30:46.220
multiply each sample by
its likelihood in order

01:30:46.220 --> 01:30:49.252
to get the more accurate distribution.

01:30:49.252 --> 01:30:50.460
So what would this look like?

01:30:50.460 --> 01:30:54.170
Well, if I asked the same question, what
is the probability of light rain given

01:30:54.170 --> 01:30:57.050
that the train is on time, when
I do the sampling procedure

01:30:57.050 --> 01:30:59.780
and start by trying to
sample, I'm going to start

01:30:59.780 --> 01:31:01.520
by fixing the evidence variable.

01:31:01.520 --> 01:31:04.640
I'm already going to have in
my sample the train is on time.

01:31:04.640 --> 01:31:06.860
That way, I don't have
to throw out anything.

01:31:06.860 --> 01:31:10.610
I'm only sampling things where I
know the value of the variables that

01:31:10.610 --> 01:31:13.790
are my evidence are what
I expect them to be.

01:31:13.790 --> 01:31:16.310
So I'll go ahead and sample
from rain, and maybe this time I

01:31:16.310 --> 01:31:18.318
sample light rain instead of no rain.

01:31:18.318 --> 01:31:21.110
Then I'll sample from track
maintenance and say maybe, yes, there's

01:31:21.110 --> 01:31:22.100
track maintenance.

01:31:22.100 --> 01:31:25.190
Then for train, well, I've
already fixed it in place.

01:31:25.190 --> 01:31:29.360
Train was an evidence variable, so I'm
not going to bother sampling again.

01:31:29.360 --> 01:31:30.820
I'll just go ahead and move on.

01:31:30.820 --> 01:31:35.280
I'll move on to appointment and go ahead
and sample from appointment as well.

01:31:35.280 --> 01:31:37.040
So now I've generated a sample.

01:31:37.040 --> 01:31:40.190
I've generated a sample by
fixing this evidence variable

01:31:40.190 --> 01:31:42.310
and sampling the other three.

01:31:42.310 --> 01:31:44.390
And the last step is now
weighting the sample.

01:31:44.390 --> 01:31:45.920
How much weight should it have?

01:31:45.920 --> 01:31:50.090
And the weight is based on how probable
is it that the train was actually

01:31:50.090 --> 01:31:52.560
on time, this evidence
actually happened,

01:31:52.560 --> 01:31:55.460
given the values of these other
variables, light rain and the fact

01:31:55.460 --> 01:31:57.620
that, yes, there was track maintenance?

01:31:57.620 --> 01:32:00.260
Well, to do that, I can just
go back to the train variable

01:32:00.260 --> 01:32:02.900
and say, all right, if there
was light rain and track

01:32:02.900 --> 01:32:05.060
maintenance, the
likelihood of my evidence,

01:32:05.060 --> 01:32:08.570
the likelihood that my
train was on time, is 0.6.

01:32:08.570 --> 01:32:13.250
And so this particular sample
would have a weight of 0.6.

01:32:13.250 --> 01:32:15.740
And I could repeat the sampling
procedure again and again.

01:32:15.740 --> 01:32:18.140
Each time, every sample
would be given a weight

01:32:18.140 --> 01:32:22.928
according to the probability of the
evidence that I see associated with it.

01:32:22.928 --> 01:32:25.970
And there are other sampling methods
that exist, as well, but all of them

01:32:25.970 --> 01:32:27.845
are designed to try and
get at the same idea,

01:32:27.845 --> 01:32:30.950
to approximate the inference
procedure of figuring out

01:32:30.950 --> 01:32:33.540
the value of a variable.

01:32:33.540 --> 01:32:35.570
So we've now dealt
with probability as it

01:32:35.570 --> 01:32:38.840
pertains to particular variables
that have these discrete values.

01:32:38.840 --> 01:32:40.910
But what we haven't
really considered is how

01:32:40.910 --> 01:32:44.300
values might change over time,
that we've considered something

01:32:44.300 --> 01:32:47.870
like a variable for rain where rain
can take on values of none or light

01:32:47.870 --> 01:32:50.600
rain or heavy rain, but,
in practice, usually when

01:32:50.600 --> 01:32:54.950
we consider values for variables like
rain, we like to consider it for,

01:32:54.950 --> 01:32:58.020
over time, how do the values
of these variables change?

01:32:58.020 --> 01:33:02.040
What do we deal with when we're dealing
with uncertainty over a period of time?

01:33:02.040 --> 01:33:04.590
Which can come up in the context
of weather, for example--

01:33:04.590 --> 01:33:06.830
if I have sunny days
and I have rainy days.

01:33:06.830 --> 01:33:11.450
And I'd like to know not just what is
the probability that it's raining now,

01:33:11.450 --> 01:33:14.210
but what is the probability that
it rains tomorrow or the day

01:33:14.210 --> 01:33:15.838
after that or the day after that?

01:33:15.838 --> 01:33:17.630
And so to do this,
we're going to introduce

01:33:17.630 --> 01:33:19.440
a slightly different kind of model.

01:33:19.440 --> 01:33:23.300
But here we're going to have a random
variable, not just one for the weather,

01:33:23.300 --> 01:33:25.643
but for every possible time step.

01:33:25.643 --> 01:33:27.560
And you can define time
step however you like.

01:33:27.560 --> 01:33:30.680
A simple way is just to
use days as your time step.

01:33:30.680 --> 01:33:34.220
And so we can define a
variable called x sub t, which

01:33:34.220 --> 01:33:36.620
is going to be the weather at time t.

01:33:36.620 --> 01:33:39.350
So x sub zero might be
the weather on day zero,

01:33:39.350 --> 01:33:42.400
x sub one might be the weather
on day one, so on and so forth,

01:33:42.400 --> 01:33:45.022
x sub two is the weather on day two.

01:33:45.022 --> 01:33:46.730
But as you can imagine,
if we start to do

01:33:46.730 --> 01:33:48.740
this over longer and
longer periods of time,

01:33:48.740 --> 01:33:51.282
there's an incredible amount of
data that might go into this.

01:33:51.282 --> 01:33:53.960
If you're keeping track of data
about the weather for a year,

01:33:53.960 --> 01:33:57.240
now suddenly you might be trying to
predict the weather tomorrow given

01:33:57.240 --> 01:34:00.620
365 days of previous pieces
of evidence, and that's

01:34:00.620 --> 01:34:03.530
a lot of evidence to have to deal
with and manipulate and calculate.

01:34:03.530 --> 01:34:06.410
Probably nobody knows what the
exact conditional probability

01:34:06.410 --> 01:34:10.070
distribution is for all of
those combinations of variables.

01:34:10.070 --> 01:34:13.070
And so when we're trying to do this
inference inside of a computer, when

01:34:13.070 --> 01:34:16.640
we're trying to reasonably
do this sort of analysis,

01:34:16.640 --> 01:34:19.053
it's helpful to make some
simplifying assumptions,

01:34:19.053 --> 01:34:21.470
some assumptions about the
problem that we can just assume

01:34:21.470 --> 01:34:23.930
are true to make our
lives a little bit easier.

01:34:23.930 --> 01:34:26.270
Even if they're not totally
accurate assumptions,

01:34:26.270 --> 01:34:28.703
if they're close to
accurate or approximate,

01:34:28.703 --> 01:34:29.870
they're usually pretty good.

01:34:29.870 --> 01:34:33.350
And the assumption we're going to
make is called the Markov assumption,

01:34:33.350 --> 01:34:38.210
which is the assumption that the current
state depends only on a finite fixed

01:34:38.210 --> 01:34:40.220
number of previous states.

01:34:40.220 --> 01:34:44.210
So the current day's weather depends
not on all of the previous day's weather

01:34:44.210 --> 01:34:47.150
for all of history, but
the current day's weather I

01:34:47.150 --> 01:34:49.758
can predict just based
on yesterday's weather

01:34:49.758 --> 01:34:52.550
or just based on the last two days'
weather or the last three days'

01:34:52.550 --> 01:34:53.050
weather.

01:34:53.050 --> 01:34:57.620
But oftentimes, we're going to deal
with just the one previous state helps

01:34:57.620 --> 01:34:59.720
to predict this current state.

01:34:59.720 --> 01:35:01.970
And by putting a whole bunch
of these random variables

01:35:01.970 --> 01:35:04.400
together, using this
Markov assumption, we

01:35:04.400 --> 01:35:08.090
can create what's called a Markov
chain where a Markov chain is just

01:35:08.090 --> 01:35:11.960
some sequence of random variables where
each of the variable's distribution

01:35:11.960 --> 01:35:13.772
follows that Markov assumption.

01:35:13.772 --> 01:35:16.480
And so we'll do an example of this
where the Markov assumption is

01:35:16.480 --> 01:35:17.590
I can predict the weather.

01:35:17.590 --> 01:35:19.050
Is it sunny or rainy?

01:35:19.050 --> 01:35:21.520
And we'll just consider those
two possibilities for now,

01:35:21.520 --> 01:35:23.395
even though there are
other types of weather.

01:35:23.395 --> 01:35:26.650
But I can predict each day's weather
just on the prior day's weather.

01:35:26.650 --> 01:35:30.430
Using today's weather, I can come
up with a probability distribution

01:35:30.430 --> 01:35:31.825
for tomorrow's weather.

01:35:31.825 --> 01:35:33.700
And here's what this
weather might look like.

01:35:33.700 --> 01:35:37.030
It's formatted in terms of a
matrix, as you might describe it,

01:35:37.030 --> 01:35:41.410
as sort of rows and columns of
values where on the left-hand side

01:35:41.410 --> 01:35:45.850
I have today's webinar, represented
by the variable x sub t.

01:35:45.850 --> 01:35:48.730
And then over here in the columns,
I have tomorrow's weather,

01:35:48.730 --> 01:35:54.790
represented by the variable x sub t plus
one, t plus one day's weather instead.

01:35:54.790 --> 01:35:58.990
And what this matrix is
saying is if today is sunny,

01:35:58.990 --> 01:36:02.440
well, then, it's more likely than
not that tomorrow is also sunny.

01:36:02.440 --> 01:36:05.990
Oftentimes the weather stays
consistent for multiple days in a row.

01:36:05.990 --> 01:36:08.200
And for example, let's say
that if today is sunny,

01:36:08.200 --> 01:36:12.820
our model says that tomorrow, with
probability 0.8, it will also be sunny,

01:36:12.820 --> 01:36:15.610
and with probability
0.2 it will be raining.

01:36:15.610 --> 01:36:19.245
And likewise, if today
is raining, then it's

01:36:19.245 --> 01:36:21.370
more likely than not that
tomorrow is also raining.

01:36:21.370 --> 01:36:23.620
With probability 0.7, it'll be raining.

01:36:23.620 --> 01:36:26.710
With probability 0.3, it will be sunny.

01:36:26.710 --> 01:36:28.840
So this matrix, this
description of how it

01:36:28.840 --> 01:36:32.290
is we transition from one
state to the next state,

01:36:32.290 --> 01:36:34.540
is what we're going to
call the transition model.

01:36:34.540 --> 01:36:37.030
And using the transition
model, you can begin

01:36:37.030 --> 01:36:41.770
to construct this Markov chain by just
predicting, given today's weather,

01:36:41.770 --> 01:36:44.020
what's the likelihood of
tomorrow's weather happening?

01:36:44.020 --> 01:36:46.930
And you can imagine
doing a similar sampling

01:36:46.930 --> 01:36:49.660
procedure where you
take this information,

01:36:49.660 --> 01:36:51.940
you sample what tomorrow's
weather is going to be,

01:36:51.940 --> 01:36:53.980
using that you sample
the next day's weather,

01:36:53.980 --> 01:36:58.390
and the result of that is you can
form this Markov chain of x zero,

01:36:58.390 --> 01:37:01.120
time day zero is sunny,
the next day is sunny,

01:37:01.120 --> 01:37:04.240
maybe the next day it changes to
raining, then raining, then raining.

01:37:04.240 --> 01:37:06.910
And the pattern that this
Markov chain follows,

01:37:06.910 --> 01:37:08.890
given the distribution
that we had access to,

01:37:08.890 --> 01:37:11.850
this transition model here,
is that when it's sunny,

01:37:11.850 --> 01:37:13.600
it tends to stay sunny
for a little while.

01:37:13.600 --> 01:37:16.100
The next couple days
tend to be sunny too.

01:37:16.100 --> 01:37:19.735
And when it's raining, it
tends to be raining as well.

01:37:19.735 --> 01:37:21.860
And so you get a Markov
chain that looks like this.

01:37:21.860 --> 01:37:23.193
And you can do analysis on this.

01:37:23.193 --> 01:37:25.630
You can say, given
that today is raining,

01:37:25.630 --> 01:37:27.790
what is the probability
that tomorrow it's raining,

01:37:27.790 --> 01:37:29.770
or you can begin to ask
probability questions,

01:37:29.770 --> 01:37:33.970
like what is the probability of this
sequence of five values-- sun, sun,

01:37:33.970 --> 01:37:35.200
rain, rain, rain--

01:37:35.200 --> 01:37:37.610
and answer those sorts of questions too.

01:37:37.610 --> 01:37:40.780
And it turns out there are, again,
many Python libraries for interacting

01:37:40.780 --> 01:37:44.620
with models like this of
probabilities that have distributions

01:37:44.620 --> 01:37:47.440
and random variables that are
based on previous variables

01:37:47.440 --> 01:37:49.720
according to this Markov assumption.

01:37:49.720 --> 01:37:53.090
And pomegranate 2 has ways of dealing
with these sorts of variables.

01:37:53.090 --> 01:37:59.800
So I'll go ahead and go
into the chain directory

01:37:59.800 --> 01:38:02.590
where I have some information
about Markov chains.

01:38:02.590 --> 01:38:05.770
And here I've defined a
file called model.py where

01:38:05.770 --> 01:38:08.320
I've defined in a very similar syntax.

01:38:08.320 --> 01:38:11.080
And again, the exact syntax
doesn't matter so much as the idea

01:38:11.080 --> 01:38:14.410
that I'm encoding this
information into a Python program

01:38:14.410 --> 01:38:17.290
so that the program access
to these distributions.

01:38:17.290 --> 01:38:19.930
I've here defined some
starting distributions.

01:38:19.930 --> 01:38:23.020
So every Markov model begins
at some point in time,

01:38:23.020 --> 01:38:25.120
and I need to give it some
starting distribution.

01:38:25.120 --> 01:38:27.078
And so we'll just say,
you know what, to start,

01:38:27.078 --> 01:38:29.380
you can pick 50/50
between sunny and rainy.

01:38:29.380 --> 01:38:33.370
We'll say it's sunny 50% the
time, rainy 50% of the time.

01:38:33.370 --> 01:38:36.430
And then down below, I've here
defined the transition model,

01:38:36.430 --> 01:38:39.770
how it is that I transition
from one day to the next.

01:38:39.770 --> 01:38:42.520
And here I've encoded that
exact same matrix from before,

01:38:42.520 --> 01:38:45.210
that if it was sunny today,
then with probability 0.8

01:38:45.210 --> 01:38:47.650
it will be sunny tomorrow, and
it will be raining tomorrow

01:38:47.650 --> 01:38:49.540
with probability 0.2.

01:38:49.540 --> 01:38:54.540
And I likewise have another distribution
for if it was raining today instead.

01:38:54.540 --> 01:38:56.980
And so that alone
defines the Markov model.

01:38:56.980 --> 01:38:59.410
You can begin to answer
questions using that model.

01:38:59.410 --> 01:39:02.680
But one thing I'll just do is
sample from the Markov chain.

01:39:02.680 --> 01:39:06.130
And it turns out there is a method built
into this Markov chain library that

01:39:06.130 --> 01:39:08.440
allows me to sample 50
states from the chain,

01:39:08.440 --> 01:39:13.000
basically just simulating
50 instances of weather.

01:39:13.000 --> 01:39:18.290
And so let me go ahead and
run this, Python model.py.

01:39:18.290 --> 01:39:22.570
And when I run it, what I get is it is
going to sample from this Markov chain

01:39:22.570 --> 01:39:26.498
50 states, 50 days worth of weather
that it's just going to randomly sample.

01:39:26.498 --> 01:39:29.290
And you can imagine sampling many
times to be able to get more data

01:39:29.290 --> 01:39:30.820
to be able to do more analysis.

01:39:30.820 --> 01:39:33.580
But here, for example,
it's sunny two days

01:39:33.580 --> 01:39:37.360
a row, rainy a whole bunch of days in
a row before it changes back to sun.

01:39:37.360 --> 01:39:41.170
And so you get this model that follows
the distribution that we originally

01:39:41.170 --> 01:39:43.960
described, that follows the
distribution of sunny days

01:39:43.960 --> 01:39:49.780
tend to lead to more sunny days, rainy
days tend to lead to more rainy days.

01:39:49.780 --> 01:39:52.060
And that, then, is the Markov model.

01:39:52.060 --> 01:39:56.260
And Markov models rely on us knowing
the values of these individual states.

01:39:56.260 --> 01:40:00.490
I know that today is sunny or that today
is rainy, and using that information,

01:40:00.490 --> 01:40:04.660
I can draw some sort of inference about
what tomorrow is going to be like.

01:40:04.660 --> 01:40:07.130
But in practice, this
often isn't the case.

01:40:07.130 --> 01:40:09.310
It often isn't the case
that I know for certain

01:40:09.310 --> 01:40:11.620
what the exact state of the world is.

01:40:11.620 --> 01:40:14.710
Oftentimes the state of the
world is exactly unknown,

01:40:14.710 --> 01:40:18.480
but I'm able to somehow sense
some information about that state

01:40:18.480 --> 01:40:22.385
that a robot or an AI doesn't have exact
knowledge about the world around it,

01:40:22.385 --> 01:40:24.510
but it has some sort of
sensor, whether that sensor

01:40:24.510 --> 01:40:27.240
is a camera or sensors
that detect distance

01:40:27.240 --> 01:40:30.300
or just a microphone that is
sensing audio, for example.

01:40:30.300 --> 01:40:33.990
It is sensing data, and using
that data, that data is somehow

01:40:33.990 --> 01:40:36.930
related to the state of the
world even if it doesn't actually

01:40:36.930 --> 01:40:41.100
know, our AI doesn't know, what the
underlying true state of the world

01:40:41.100 --> 01:40:42.730
actually is.

01:40:42.730 --> 01:40:45.480
And for that, we need to get
into the world of sensor models,

01:40:45.480 --> 01:40:48.420
the way of describing how
it is that we translate

01:40:48.420 --> 01:40:51.600
what the hidden state, the
underlying true state of the world

01:40:51.600 --> 01:40:56.880
is with what the observation, what it is
that the AI knows or the AI has access

01:40:56.880 --> 01:40:58.810
to, actually is.

01:40:58.810 --> 01:41:02.880
And so for example, a hidden
state might be a robot's position.

01:41:02.880 --> 01:41:05.650
If a robot is exploring
new, uncharted territory,

01:41:05.650 --> 01:41:08.580
the robot likely doesn't
know exactly where it is.

01:41:08.580 --> 01:41:10.000
But it does have an observation.

01:41:10.000 --> 01:41:12.510
It has robot sensor
data where it can sense

01:41:12.510 --> 01:41:16.560
how far away are possible obstacles
around it, and using that information,

01:41:16.560 --> 01:41:19.230
using the observed
information that it has,

01:41:19.230 --> 01:41:22.290
it can infer something
about the hidden state,

01:41:22.290 --> 01:41:26.220
because what the true hidden state
is influences those observations.

01:41:26.220 --> 01:41:29.370
Whatever the robot's
true position is affects

01:41:29.370 --> 01:41:33.420
or has some effect upon what the sensor
data the robot is able to collect

01:41:33.420 --> 01:41:36.330
is, even if the robot doesn't
actually know for certain

01:41:36.330 --> 01:41:39.090
what its true position is.

01:41:39.090 --> 01:41:42.300
Likewise, if you think about a voice
recognition or a speech recognition

01:41:42.300 --> 01:41:47.600
program that listens to you and is able
to respond to you, something like Alexa

01:41:47.600 --> 01:41:50.830
or what Apple and Google are doing
with their voice recognition as well,

01:41:50.830 --> 01:41:54.090
that you might imagine that the
hidden state, the underlying state,

01:41:54.090 --> 01:41:55.740
is what words are actually spoken.

01:41:55.740 --> 01:41:58.290
The true nature of
the world contains you

01:41:58.290 --> 01:42:00.270
saying a particular sequence of words.

01:42:00.270 --> 01:42:04.380
But your phone or your smart
home device doesn't know for sure

01:42:04.380 --> 01:42:05.940
exactly what words you said.

01:42:05.940 --> 01:42:11.100
The only observation that the AI has
access to is some audio wave forms.

01:42:11.100 --> 01:42:13.710
And those audio wave forms
are, of course, dependent

01:42:13.710 --> 01:42:16.110
upon this hidden state,
and you can infer,

01:42:16.110 --> 01:42:20.520
based on those audio wave forms,
what the words spoken likely were,

01:42:20.520 --> 01:42:23.490
but you might not know
with 100% certainty what

01:42:23.490 --> 01:42:25.330
that hidden state actually is.

01:42:25.330 --> 01:42:27.630
And it might be a task
to try and predict.

01:42:27.630 --> 01:42:30.300
Given this observation,
given these audio away forms,

01:42:30.300 --> 01:42:34.142
can you figure out what the
actual words spoken are?

01:42:34.142 --> 01:42:35.850
Likewise, you might
imagine on a website.

01:42:35.850 --> 01:42:38.490
True user engagement might be
information you don't directly

01:42:38.490 --> 01:42:41.880
have access to, but you can
observe data, like website or app

01:42:41.880 --> 01:42:44.220
analytics about how often
was this button clicked

01:42:44.220 --> 01:42:47.220
or how often are people interacting
with a page in a particular way.

01:42:47.220 --> 01:42:51.190
And you can use that to infer
things about your users as well.

01:42:51.190 --> 01:42:54.968
So this type of problem comes up all
the time when we're dealing with AI

01:42:54.968 --> 01:42:56.760
and trying to infer
things about the world,

01:42:56.760 --> 01:43:00.750
that often AI doesn't really know
the hidden true state of the world.

01:43:00.750 --> 01:43:03.930
All that AI has access
to is some observation

01:43:03.930 --> 01:43:07.440
that is related to the hidden
true state, but it's not direct.

01:43:07.440 --> 01:43:08.790
There might be some noise there.

01:43:08.790 --> 01:43:10.985
The audio wave form might
have some additional noise

01:43:10.985 --> 01:43:12.360
that might be difficult to parse.

01:43:12.360 --> 01:43:14.910
The sensor data might
not be exactly correct.

01:43:14.910 --> 01:43:16.860
There's some noise that
might not allow you

01:43:16.860 --> 01:43:19.560
to conclude with certainty what
the hidden state is, but can

01:43:19.560 --> 01:43:22.100
allow you to infer what it might be.

01:43:22.100 --> 01:43:24.348
And so the simple example
we'll take a look at here

01:43:24.348 --> 01:43:27.390
is imagining the hidden state as the
weather, whether it's sunny or rainy

01:43:27.390 --> 01:43:31.530
or not, and imagine you are programming
an AI inside of a building that maybe

01:43:31.530 --> 01:43:34.710
has access to just a camera
to inside the building,

01:43:34.710 --> 01:43:37.890
and all you have access
to is an observation as to

01:43:37.890 --> 01:43:41.790
whether or not employees are bringing
an umbrella into the building or not.

01:43:41.790 --> 01:43:44.290
You can detect whether
it's an umbrella or not,

01:43:44.290 --> 01:43:47.700
and so you might have an observation
as to whether or not an umbrella is

01:43:47.700 --> 01:43:49.320
brought into the building or not.

01:43:49.320 --> 01:43:51.690
And using that information,
you want to predict

01:43:51.690 --> 01:43:53.790
whether it's sunny or
rainy, even if you don't

01:43:53.790 --> 01:43:55.877
know what the underlying weather is.

01:43:55.877 --> 01:43:57.960
So the underlying weather
might be sunny or rainy.

01:43:57.960 --> 01:44:01.462
And if it's raining, obviously people
are more likely to bring an umbrella.

01:44:01.462 --> 01:44:03.420
And so whether or not
people bring an umbrella,

01:44:03.420 --> 01:44:06.773
your observation tells you
something about the hidden state.

01:44:06.773 --> 01:44:08.940
And of course, this is a
bit of a contrived example,

01:44:08.940 --> 01:44:11.370
but the idea here is to
think about this more broadly

01:44:11.370 --> 01:44:14.370
in terms of more generally,
any time you observe something,

01:44:14.370 --> 01:44:18.025
it having to do with some
underlying hidden state.

01:44:18.025 --> 01:44:21.150
And so to try and model this type of
idea where we have these hidden states

01:44:21.150 --> 01:44:24.180
and observations, rather than
just use a Markov model, which

01:44:24.180 --> 01:44:26.160
has state, state, state,
state, each of which

01:44:26.160 --> 01:44:29.700
is connected by that transition
matrix that we described before,

01:44:29.700 --> 01:44:32.640
we're going to use what we
call a hidden Markov model--

01:44:32.640 --> 01:44:34.740
very similar to a Markov
model, but this is

01:44:34.740 --> 01:44:37.920
going to allow us to model a
system that has hidden states

01:44:37.920 --> 01:44:41.520
that we don't directly observe
along with some observed event

01:44:41.520 --> 01:44:43.740
that we do actually see.

01:44:43.740 --> 01:44:45.720
And so in addition to
that transition model

01:44:45.720 --> 01:44:48.780
that we still need of saying, given
the underlying state of the world,

01:44:48.780 --> 01:44:52.440
if it's sunny or rainy, what's the
probability of tomorrow's weather,

01:44:52.440 --> 01:44:56.310
we also need another model,
that given some state is

01:44:56.310 --> 01:44:58.500
going to give us an
observation of green,

01:44:58.500 --> 01:45:01.440
yes, someone brings an umbrella
into the office, or red,

01:45:01.440 --> 01:45:03.930
no, nobody brings
umbrellas into the office.

01:45:03.930 --> 01:45:06.772
And so the observation
might be that if it's sunny,

01:45:06.772 --> 01:45:09.480
then odds are nobody is going to
bring an umbrella to the office.

01:45:09.480 --> 01:45:11.760
But maybe some people
are just being cautious

01:45:11.760 --> 01:45:14.490
and they do bring an umbrella
to the office anyways.

01:45:14.490 --> 01:45:17.725
And if it's raining, with
much higher probability,

01:45:17.725 --> 01:45:20.100
then people are going to bring
umbrellas into the office.

01:45:20.100 --> 01:45:23.280
But maybe, if the rain was unexpected,
people didn't bring an umbrella,

01:45:23.280 --> 01:45:25.990
and so they might have some
other probability as well.

01:45:25.990 --> 01:45:28.860
So using the observations,
you can begin to predict,

01:45:28.860 --> 01:45:32.070
with reasonable likelihood,
what the underlying state is

01:45:32.070 --> 01:45:35.440
even if you don't actually get
to observe the underlying state,

01:45:35.440 --> 01:45:39.030
if you don't get to see what the
hidden state is actually equal to.

01:45:39.030 --> 01:45:41.540
This here we'll often
call the sensor model.

01:45:41.540 --> 01:45:44.280
It's also often called
the emission probabilities

01:45:44.280 --> 01:45:48.120
because the state, the underlying
state, emits some sort of emission

01:45:48.120 --> 01:45:49.660
that you then observe.

01:45:49.660 --> 01:45:53.220
And so that can be another way
of describing that same idea.

01:45:53.220 --> 01:45:55.860
And the sensor Markov assumption
that we're going to use

01:45:55.860 --> 01:45:59.340
is this assumption that the evidence
variable, the thing we observe,

01:45:59.340 --> 01:46:03.480
the emission that gets produced,
depends only on the corresponding state,

01:46:03.480 --> 01:46:06.620
meaning I can predict whether or
not people will bring umbrellas

01:46:06.620 --> 01:46:11.310
or not entirely dependent just on
whether it is sunny or rainy today.

01:46:11.310 --> 01:46:13.950
Of course, again, this assumption
might not hold in practice,

01:46:13.950 --> 01:46:15.458
that in practice it might depend--

01:46:15.458 --> 01:46:17.250
whether or not people
bring umbrellas might

01:46:17.250 --> 01:46:20.042
depend not just on today's weather,
but also on yesterday's weather

01:46:20.042 --> 01:46:20.910
and the day before.

01:46:20.910 --> 01:46:23.100
But for simplification
purposes, it can be

01:46:23.100 --> 01:46:25.920
helpful to apply the
sort of assumption just

01:46:25.920 --> 01:46:29.130
to allow us to be able to reason about
these probabilities a little more

01:46:29.130 --> 01:46:30.130
easily.

01:46:30.130 --> 01:46:34.770
And if we're able to approximate it, we
can still often get a very good answer.

01:46:34.770 --> 01:46:37.710
And so what these hidden Markov
models end up looking like is a little

01:46:37.710 --> 01:46:41.730
something like this, where now, rather
than just have one chain of states--

01:46:41.730 --> 01:46:43.860
like, sun, sun, rain, rain, rain--

01:46:43.860 --> 01:46:49.650
we instead have this upper level, which
is the underlying state of the world,

01:46:49.650 --> 01:46:53.070
is it sunny or is it rainy, and those
are connected by that transition

01:46:53.070 --> 01:46:54.690
matrix we described before.

01:46:54.690 --> 01:46:57.510
But each of these states
produces an emission,

01:46:57.510 --> 01:47:01.590
produces an observation that I
see, that on this day it was sunny,

01:47:01.590 --> 01:47:04.917
and people didn't bring umbrellas,
and on this day it was sunny,

01:47:04.917 --> 01:47:07.500
but people did bring umbrellas,
and on this day it was raining

01:47:07.500 --> 01:47:09.960
and people did bring umbrellas,
and so on and so forth.

01:47:09.960 --> 01:47:12.930
And so each of these
underlying states, represented

01:47:12.930 --> 01:47:16.740
by x sub t for x sub 1, 0,
1, 2, so on and so forth,

01:47:16.740 --> 01:47:19.450
produces some sort of
observation or emission,

01:47:19.450 --> 01:47:20.950
which is what the E stands for--

01:47:20.950 --> 01:47:25.700
E sub 0, E sub 1, E sub
2, so on and so forth.

01:47:25.700 --> 01:47:28.893
And so this, too, is a way of
trying to represent this idea.

01:47:28.893 --> 01:47:31.560
And what you want to think about
is that these underlying states

01:47:31.560 --> 01:47:35.790
are the true nature of the world, the
robot's position as it moves over time,

01:47:35.790 --> 01:47:39.030
and that produces some sort of
sensor data that might be observed,

01:47:39.030 --> 01:47:41.490
or what people are
actually saying and using

01:47:41.490 --> 01:47:45.390
the emission data of what audio wave
forms do you detect in order to process

01:47:45.390 --> 01:47:47.330
that data and try and figure it out.

01:47:47.330 --> 01:47:49.830
And there are a number of
possible tasks that you might want

01:47:49.830 --> 01:47:52.150
to do given this kind of information.

01:47:52.150 --> 01:47:54.750
And one of the simplest is
trying to infer something

01:47:54.750 --> 01:47:58.560
about the future or the past or
about these sort of hidden states

01:47:58.560 --> 01:47:59.580
that might exist.

01:47:59.580 --> 01:48:01.310
And so the tasks that you'll often see--

01:48:01.310 --> 01:48:03.893
and we're not going to go into
the mathematics of these tasks,

01:48:03.893 --> 01:48:07.020
but they're all based on this same
idea of conditional probabilities

01:48:07.020 --> 01:48:09.990
and using the probability
distributions we have

01:48:09.990 --> 01:48:12.180
to draw these sorts of conclusions.

01:48:12.180 --> 01:48:16.410
One task is called filtering, which
is, given observations from the start

01:48:16.410 --> 01:48:20.310
until now, calculate the
distribution for the current state,

01:48:20.310 --> 01:48:23.520
meaning given information about
from the beginning of time

01:48:23.520 --> 01:48:26.610
until now, on which days
did people bring an umbrella

01:48:26.610 --> 01:48:28.770
or not bring an
umbrella, can I calculate

01:48:28.770 --> 01:48:32.280
the probability of the current
state, that today is it sunny

01:48:32.280 --> 01:48:33.570
or is it raining?

01:48:33.570 --> 01:48:35.670
Another task that might
be possible is prediction,

01:48:35.670 --> 01:48:37.320
which is looking towards the future.

01:48:37.320 --> 01:48:39.690
Given observations about
people bringing umbrellas

01:48:39.690 --> 01:48:43.350
from the beginning of when we
started counting time until now,

01:48:43.350 --> 01:48:47.710
can I figure out the distribution that
tomorrow is it sunny or is it raining?

01:48:47.710 --> 01:48:51.240
And you can also go backwards, as
well, by a smoothing where I can say,

01:48:51.240 --> 01:48:54.810
given observations from start until
now, calculate the distributions

01:48:54.810 --> 01:48:56.460
for some past state.

01:48:56.460 --> 01:49:00.090
I know that today people brought
umbrellas and tomorrow people brought

01:49:00.090 --> 01:49:03.780
umbrellas, and so given two days' worth
of data of people bringing umbrellas,

01:49:03.780 --> 01:49:06.713
what's the probability that
yesterday it was raining?

01:49:06.713 --> 01:49:08.880
And that I know that people
brought umbrellas today,

01:49:08.880 --> 01:49:11.160
that might inform that
decision, as well.

01:49:11.160 --> 01:49:13.740
It might influence those probabilities.

01:49:13.740 --> 01:49:17.340
And there's also a most
likely explanation task,

01:49:17.340 --> 01:49:19.510
in addition to other tasks
that might exist as well,

01:49:19.510 --> 01:49:21.750
which is combining some of
these given observations

01:49:21.750 --> 01:49:25.920
from the start up until now, figuring
out the most likely sequence of states,

01:49:25.920 --> 01:49:28.528
and this is what we're going to
take a look at now, this idea

01:49:28.528 --> 01:49:30.570
that if I have all these
observations-- umbrella,

01:49:30.570 --> 01:49:32.790
no umbrella, umbrella,
no umbrella-- can I

01:49:32.790 --> 01:49:36.990
calculate the most likely states of
sun, rain, sun, rain, and whatnot that

01:49:36.990 --> 01:49:41.610
actually represented the true weather
that would produce these observations?

01:49:41.610 --> 01:49:43.590
And this is quite common
when you're trying

01:49:43.590 --> 01:49:46.530
to do something like voice
recognition, for example, that you have

01:49:46.530 --> 01:49:49.830
these emissions of audio wave forms
and you would like to calculate,

01:49:49.830 --> 01:49:52.260
based on all of the
observations that you have,

01:49:52.260 --> 01:49:54.750
what is the most likely
sequence of actual words

01:49:54.750 --> 01:49:59.100
or syllables or sounds that the user
actually made when they were speaking

01:49:59.100 --> 01:50:01.230
to this particular device,
or other tasks that

01:50:01.230 --> 01:50:03.740
might come up in that context as well.

01:50:03.740 --> 01:50:07.800
And so we can try this out by
going ahead and going into the HMM

01:50:07.800 --> 01:50:11.790
directory, HMM for Hidden Markov Model.

01:50:11.790 --> 01:50:17.350
And here what I've done is I've
defined a model where this model first

01:50:17.350 --> 01:50:22.410
defines my possible state, sun and
rain, along with their emission

01:50:22.410 --> 01:50:25.690
probabilities, the observation
model or the emission model,

01:50:25.690 --> 01:50:30.310
where here, given that I know that
it's sunny, the probability that I

01:50:30.310 --> 01:50:32.590
see people bring an umbrella is 0.2.

01:50:32.590 --> 01:50:35.470
The probability of no umbrella is 0.8.

01:50:35.470 --> 01:50:37.288
And likewise, if it's
raining, then people

01:50:37.288 --> 01:50:38.830
are more likely to bring an umbrella.

01:50:38.830 --> 01:50:40.630
Umbrella has a probability of 0.9.

01:50:40.630 --> 01:50:42.580
No umbrella has probably of 0.1.

01:50:42.580 --> 01:50:47.350
So the actual underlying hidden
states, those states are sun and rain.

01:50:47.350 --> 01:50:50.500
But the things that I observe,
the observations that I can see,

01:50:50.500 --> 01:50:56.270
are either umbrella or no umbrella as
the things that I observe as a result.

01:50:56.270 --> 01:51:00.730
So this, then, I also need to add to
it a transition matrix, same as before,

01:51:00.730 --> 01:51:04.540
saying that if today is sunny, then
tomorrow is more likely to be sunny,

01:51:04.540 --> 01:51:07.770
and if today is rainy, then tomorrow
is more likely to be raining.

01:51:07.770 --> 01:51:10.130
As with before, I give it
some starting probabilities,

01:51:10.130 --> 01:51:14.050
saying, at first, 50/50 chance
for whether it's sunny or rainy,

01:51:14.050 --> 01:51:17.570
and then I can create the model
based on that information.

01:51:17.570 --> 01:51:19.990
Again, the exact syntax of
this is not so important

01:51:19.990 --> 01:51:23.770
so much as it is the data that I am
now encoding into a program, such

01:51:23.770 --> 01:51:27.350
that now I can begin
to do some inference.

01:51:27.350 --> 01:51:31.270
So I can give my program, for
example, a list of observations--

01:51:31.270 --> 01:51:34.420
umbrella, umbrella, no umbrella,
umbrella, umbrella, so on and so forth,

01:51:34.420 --> 01:51:35.478
no umbrella, no umbrella.

01:51:35.478 --> 01:51:37.270
And I would like to
calculate, I would like

01:51:37.270 --> 01:51:41.110
to figure out, the most likely
explanation for these observations.

01:51:41.110 --> 01:51:42.640
What is likely?

01:51:42.640 --> 01:51:43.660
Was it rain, rain?

01:51:43.660 --> 01:51:46.720
Is this rain or is it more likely
that this was actually sunny

01:51:46.720 --> 01:51:48.742
and then it switched
back to it being rainy?

01:51:48.742 --> 01:51:50.200
And that's an interesting question.

01:51:50.200 --> 01:51:52.360
We might not be sure
because it might just

01:51:52.360 --> 01:51:56.410
be that it just so happened on this
rainy day people decided not to bring

01:51:56.410 --> 01:52:00.580
an umbrella or it could be that it
switched from rainy to sunny back

01:52:00.580 --> 01:52:04.450
to rainy, which doesn't seem too
likely, but it certainly could happen.

01:52:04.450 --> 01:52:07.060
And using the data we give
to the Hidden Markov Model,

01:52:07.060 --> 01:52:10.620
our model can begin to predict these
answers, can begin to figure it out.

01:52:10.620 --> 01:52:13.750
So we're going to go ahead and
just predict these observations.

01:52:13.750 --> 01:52:15.750
And then for each of those
predictions, go ahead

01:52:15.750 --> 01:52:17.292
and print out what the prediction is.

01:52:17.292 --> 01:52:19.780
And this library just so happens
to have a function called

01:52:19.780 --> 01:52:23.142
predict that does this
prediction process for me.

01:52:23.142 --> 01:52:28.270
So I run Python sequence.py,
and the result I get is this.

01:52:28.270 --> 01:52:31.450
This is the prediction based
on the observations of what

01:52:31.450 --> 01:52:34.750
all of those states are likely to be,
and it's likely to be rain, then rain.

01:52:34.750 --> 01:52:36.625
In this case, it thinks
that what most likely

01:52:36.625 --> 01:52:39.940
happened is that it was sunny for a
day and then went back to being rainy.

01:52:39.940 --> 01:52:42.700
But in different situations, if
it was rainy for longer, maybe,

01:52:42.700 --> 01:52:44.750
or if the probabilities
were slightly different,

01:52:44.750 --> 01:52:48.190
you might imagine that it's more likely
that it was rainy all the way through,

01:52:48.190 --> 01:52:53.250
and it just so happened on one rainy day
people decided not to bring umbrellas.

01:52:53.250 --> 01:52:55.750
And so here, too, Python
libraries can begin

01:52:55.750 --> 01:52:58.730
to allow for the sort
of inference procedure.

01:52:58.730 --> 01:53:02.410
And by taking what we know and by
putting it in terms of these tasks

01:53:02.410 --> 01:53:06.310
that already exist, these general tasks
that work with Hidden Markov Models,

01:53:06.310 --> 01:53:10.540
then any time we can take an idea and
formulate it as a Hidden Markov Model,

01:53:10.540 --> 01:53:12.550
formulate it as
something that has hidden

01:53:12.550 --> 01:53:15.700
states and observed emissions
that result from the states.

01:53:15.700 --> 01:53:17.830
Then we can take advantage
of these algorithms that

01:53:17.830 --> 01:53:21.740
are known to exist for trying
to do this sort of inference.

01:53:21.740 --> 01:53:25.720
So now we've seen a couple of ways that
AI can begin to deal with uncertainty.

01:53:25.720 --> 01:53:28.840
We've taken a look at probability
and how we can use probability

01:53:28.840 --> 01:53:32.200
to describe numerically things that
are likely or more likely or less

01:53:32.200 --> 01:53:34.990
likely to happen than other
events or other variables.

01:53:34.990 --> 01:53:37.750
And using that information,
we can begin to construct

01:53:37.750 --> 01:53:40.810
these standard types of models,
things like Bayesian networks

01:53:40.810 --> 01:53:43.180
and Markov chains and
Hidden Markov Models,

01:53:43.180 --> 01:53:47.110
that all allow us to be able to
describe how particular events relate

01:53:47.110 --> 01:53:49.900
to other events or how the
values of particular variables

01:53:49.900 --> 01:53:53.050
relate to other variables, not
for certain, but with some sort

01:53:53.050 --> 01:53:54.550
of probability distribution.

01:53:54.550 --> 01:53:57.970
And by formulating things in terms
of these models that already exist,

01:53:57.970 --> 01:54:00.160
we can take advantage
of Python libraries

01:54:00.160 --> 01:54:02.950
that implement these sort of
models already and allow us just

01:54:02.950 --> 01:54:06.880
to be able to use them to produce
some sort of resulting effect.

01:54:06.880 --> 01:54:08.890
So all of this then
allows our AI to begin

01:54:08.890 --> 01:54:11.290
to deal with these sort
of uncertain problems

01:54:11.290 --> 01:54:13.720
so that our AI doesn't need
to know things for certain

01:54:13.720 --> 01:54:17.080
but can infer based on
information it doesn't know.

01:54:17.080 --> 01:54:19.930
Next time, we'll take a look
at additional types of problems

01:54:19.930 --> 01:54:22.870
that we can solve by taking
advantage of AI-related algorithms

01:54:22.870 --> 01:54:26.140
even beyond the world of the types
of problems we've already explored.

01:54:26.140 --> 01:54:28.230
We'll see you next time.